This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring Global Information for Session-based Recommendation

Ziyang Wang, Wei Wei, Gao Cong, Xiao-Li Li, Xian-Ling Mao, Minghui Qiu This paper is an extended version of the SIGIR ’20 conference paper [1]. Wei Wei is the corresponding author. Ziyang Wang and Wei Wei was with Cognitive Computing and Intelligent Information Processing (CCIIP) Laboratory, School of Computer Science and Technology,Huazhong University of Science and Technology, Wuhan, China. E-mail: {ziyang1997,weiw}@hust.edu.cn. Gao Cong was with School of Computer and Engineering, Nanyang Technological University, Singapore. E-mail: [email protected]. Xiao-Li Li was with Institute for Infocomm Research, Singapore. E-mail: [email protected]. Xian-Ling Mao was with School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China. E-mail: [email protected]. Minghui Qiu was with Alibaba Group, Hangzhou, China. E-mail: [email protected].
Abstract

Session-based recommendation (SBR) is a challenging task, which aims at recommending items based on anonymous behavior sequences. Most existing SBR studies model the user preferences based only on the current session while neglecting the item-transition information from the other sessions, which suffer from the inability of modeling the complicated item-transition pattern. To address the limitations, we introduce global item-transition information to strength the modeling of the dynamic item-transition. For fully exploiting the global item-transition information, two ways of exploring global information for SBR are studied in this work. Specifically, we first propose a basic GNN-based framework (BGNN), which solely uses session-level item-transition information on session graph. Based on BGNN, we propose a novel approach, called Session-based Recommendation with Global Information (SRGI), which infers the user preferences via fully exploring global item-transitions over all sessions from two different perspectives: (i) Fusion-based Model (SRGI-FM), which recursively incorporates the neighbor embeddings of each node on global graph into the learning process of session-level item representation; and (ii) Constrained-based Model (SRGI-CM), which treats the global-level item-transition information as a constraint to ensure the learned item embeddings are consistent with the global item-transition. Extensive experiments conducted on three popular benchmark datasets demonstrate that both SRGI-FM and SRGI-CM outperform the state-of-the-art methods consistently.

Index Terms:
Session-based Recommendation, Graph Neural Network, Graph Contrastive Learning

1 Introduction

With the explosion of information on the Internet, recommendation systems play an increasingly important role as they recommend useful content to address the information overload problem. Conventional recommendation approaches (e.g., collaborative filtering[2, 3, 4]) usually make predictions based on the user profiles and long-term interaction history. However, in many recent real-world scenarios (e.g., on-line shopping platform111https://www.amazon.com/ and mobile stream media222https://www.tiktok.com/), such information may not exist. In addition, users’ recent behaviors indicate their current interests, which is neglected by these methods. To bridge the gaps in conventional recommendation approaches, session-based recommendation (SBR) is emerged to predict the next action (e.g., click) of users based on anonymous behavior sequences in chronological order.

Due to its highly practical value, SBR has attracted increasing attention and many approaches have been proposed. Most of early studies on session-based recommendation are based on Markov-chains[5], which predict the users’ next interest solely based on the previous action. However, these methods rely on strong independence assumption, which may suffer from noisy data and intractable computation problem for real-world applications.

Recent years have witnessed the rapid development of deep learning techniques [6, 7, 8] and many neural networks based approaches are proposed for SBR, which model the item-transition within the current session and can be roughly categorized into two types, i.e., RNN-based and GNN-based. The former [9, 10, 11] regard SBR as a sequence modeling task and leverage recurrent neural networks (RNN) to capture the item-transition information in chronological order, which is then extended with attention network [12] and memory network [13]. The latter [14, 15, 16] focus on capturing the dependencies between each item and its context (i.e., adjacent items) in the session, which first convert each session into a subgraph and then employ graph neural network (e.g., GGNN [10], GAT [17]) to extract the item features from the graph.

Refer to caption
Figure 1: Example of item-transition of sessions.

Although encouraging results are achieved, these methods still suffer from the inability of modeling the complicated inherent order of item-transition for each session, due to the noisy data and limited session information. Recently, some efforts [18, 19] are devoted to incorporating collaborative information into SBR to strength the modeling of item-transition. Specifically, CSRM [18] fuses the latest mm sessions to enrich the representation of the current session via memory network. CoSAN [19] injects feature embedding represented by neighborhood sessions to enrich the item representation in the current session. However, they may unfortunately encode both relevant and irrelevant information of the other sessions into the current session representation, which may even deteriorate the performance [11]. To illustrate this, an example is shown in Figure 1. Without loss of generality, suppose the current session is “Session 3”, and the session-based recommendation aims to recommend the next interest of the user based on the past three clicked items. From Figure 1, we observe that: (i) Utilizing the item-transition of the other sessions might help model the user preference of the current session. For example, from “Session 1” and “Session 2” we can find relevant items from item-transition information for “Iphone”, such as “HUAWEI Mate” and “OnePlus”. and (ii) Directly utilizing the item-transition information of the entire other session may introduce noise when part of the item-transition information encoded in such session is not relevant to the current session. For instance, CSRM [18] and CoSAN [19] may also consider to utilize “Session 1” and “Session 2” to help modeling the user preference of “Session 3”, which will introduce the irrelevant items (i.e., “LEGO” and “Electric Toothbrush”) when learning “Session 2”’s embedding as it treats other session as a whole without distinguishing relevant item-transition from irrelevant item-transition, which is challenging.

To this end, we propose a novel approach to exploit the item-transitions over all sessions from two perspectives for better inferring the user preference of the current session for session-based recommendation, which is named Session-based Recommendation with Global Information (SRGI). In SRGI, we propose to learn two levels of item feature from session graph and global graph, respectively: (i) Session graph, which is to learn the session-level item feature by modeling pairwise item-transitions within the current session; and (ii) Global graph, which is to learn the global-level item feature by modeling pairwise item-transitions over all sessions (including the current session). SRGI first employs a basic GNN framework (B-GNN) on the session graph to learn session-level item embedding within the current session from three kinds of relations. Then based on the constructed global graph, we present two versions of SRGI that incorporates the global transition information into SBR: (i) Fusion-based Model (SRGI-FM), which directly incorporates the neighbors’ features of each node on the global graph through a session-aware attention mechanism to enrich the features of current session; and (ii) Constraint-based Model (SRGI-CM), which preserves the global proximity to ensure the learnt item embeddings are consistent with the global item-transition, via graph contrastive learning.

The main contributions of this work are summarized as follows:

  • We propose a basic GNN-based session recommendation framework for SBR, which employs graph neural networks layer to explicitly extracting item features derived from three kinds of relations on session-level graph.

  • We propose a novel unified model (named SRGI) to fully and effectively explore the pairwise item-transition information from two levels of graph models, i.e., session graph and global graph from two different aspects, i.e., Fusion-based Model that circularly incorporates the global neighbor information into the item embedding learning process for enhancing the session-level item representations, and Constraint-based Model treats the global-level item-transition information as a constraint and employ graph contrastive learning to ensure the learnt item embeddings are consistent with the global graph structure;

  • We conduct extensive experiments on three real-world datasets, which demonstrate the efficacy of SRGI-FM and SRGI-CM over state-of-the-art baselines.

2 Related Work

2.1 Session-based Recommendation

Many research efforts have been conducted on session-based recommendation, which will be reviewed in this section.

Markov Chain-based SBR. Several traditional methods can be employed for SBR although they are not originally designed for SBR. For example, Markov Chain-based methods map each session into a Markov chain, where the user’s next action is inferred based on the previous one. Shani et al. [20] utilize Markov Decision Processes (MDP) to model the transitions of items within a session, where simplest MDP boils down to first-order Markov chains [21]. Rendle et al. [5] propose FPMC to capture both sequential patterns and long-term user preference by a hybrid method based on the combination of matrix factorization and first-order Markov chain for recommendation. It can be adapted for SBR by ignoring the user latent representation as it is not available for anonymous SBR. However, MC-based methods usually focus on modeling sequential transition of two adjacent items. In contrast, our proposed model converts the sequentially item-transitions into graph-structure data for capturing the inherent order of item-transition patterns for SBR.

Deep-learning based SBR. In recent years, neural network-based methods that are capable of modeling sequential data have been utilized for SBR. Hidasi et al. [9] propose the first work called GRU4REC to apply the RNN networks for SBR, which adopts a multi-layer Gated Recurrent Unit (GRU) to model item interaction sequences. Then, Tan et al. [22] extend the method [9] by introducing data augmentation. Li et al. [12] propose NARM that incorporates attention mechanism into stack GRU encoder to capture the more representative item-transition information for SBR. Liu et al. [13] propose an attention-based short-term memory networks (named STAMP) to captures the user’s current interest without using RNN. Both NARM and STAMP emphasize the importance of the last click by using attention mechanism. Inspired by TransformerTransformer [23], SASRec [24] stacks multiple layers to capture the relevance between items. ISLF [25] takes into account the user’s interest shift, and employs variational auto-encoder (VAE) [26] and RNN to capture the user’s sequential behavior characteristics for SBR. MCPRN [11] proposes to model the multi-purpose of a given session by using a mixture-channel model for SBR. However, similar to MC-based methods, RNN-based methods focus on modeling the sequential transitions of adjacent items [27] to infer user preference via the chronology of the given sequence, and thus cannot model the complex item-transition patterns (e.g., non-adjacent item transitions).

Recently, several proposals employ GNN-based model on graph built from the current session to learn item embeddings for SBR. Wu et al. [14] convert each session into a graph and leverage gated GNN to the explore the complex transitions among items. Following the success of SR-GNN, some variants are also proposed for SBR, such as GC-SAN [28], which combine GNN layers and self attention layers to learn the dependencies within the session. Qiu et al. [15] propose FGNN to learn each item representation by aggregating its neighbors’ embeddings with multi-head attention. Yu et al. [16] leverage target-aware attention to dynamically obtain the importance of each item within the session. Meng et al. [29] incorporate knowledge graph to capture the dependencies between items. However, all these approaches only model the item-transition information on the current session. In contrast, our proposed model learns the item-transition information over all sessions to enhance the modeling of item-transition and inference of user interests.

Collaborative Filtering-based SBR. Although deep learning based methods have achieved remarkable performance, collaborative filtering (CF) based methods can still provide competitive results. Item-KNN [30] can be extended for SBR by recommending items that are most similar to the last item of the current session. KNN-RNN [31] makes use of GRU4REC [9] and the co-occurrence-based KNN model to extract the sequential patterns for SBR. CSRM [18] first utilizes NARM over item-transitions to encode each session, then enriches the representation of the current session by exploring the latest mm neighborhood sessions. CoSAN [19] incorporates neighborhood sessions embedding into the items of current session and employ multi-head self-attention to capture the dependencies between each item. However, CSRM and CoSAN may suffer from noise when integrating other sessions’ embeddings for the current one. In contrast, our proposed method considers the collaborative information in item-level: we use the item embeddings in other sessions to enrich the item embeddings of the current session, and then integrate them into session representation for SBR.

2.2 Graph Contrastive Learning

Graph contrastive learning aims to learn the discriminative node representations by contrasting positive and negative samples in graph. Early work on graph contrastive learning focus on capturing local structural patterns, which forces neighboring nodes to have similar feature representations. For example, DeepWalk[32] and node2vec[33] obtain walk sequences on the graph and consider nodes within the same window in the sequence as positive samples. However, random-walk based approaches overly emphasize the structural information [34] and their performance are heavily affected by hyperparameter choice.

Recently, with the development of deep learning technique, some studies employ graph neural networks to obtain better item representations for graph constrastive learning. Based on the powerful graph convolutional architectures, DGI [35] maximizes the mutual information between global and local representations. GMI [36] focus on the mutual information between input and representation of both the node and edge in the graph. Via data augmentations, GraphCL [37] maximizes the agreement between two augmented views of the input graph. Considering adaptive graph augmentations, GCA [38] incorporates priors for topological and semantic aspects of the graph.

3 Preliminaries

In this section, we first present the problem statement, and then introduce two types of graph, i.e., session graph and global graph, based on different levels of pair-wise item transitions over sessions for learning item representations, in which we highlight the modeling of global-level item transition information as it is the basis of global graph construction. For clarity, frequently used notations are summarized in Table I.

TABLE I: Notations and Definitions.
Notation Definition
Input
SS Anonymous session, S={v1s,v2s,,vls}S=\{v^{s}_{1},v^{s}_{2},...,v^{s}_{l}\}
Graph
𝒢s,𝒢g\mathcal{G}_{s},\mathcal{G}_{g} Session graph and global graph
𝒱s,𝒱g\mathcal{V}_{s},\mathcal{V}_{g} Nodes in session graph and global graph
s,g\mathcal{E}_{s},\mathcal{E}_{g} Edges in session graph and global graph
𝒩vs\mathcal{N}^{s}_{v} Neighbor set of node vv in session graph
𝒩ε(v)\mathcal{N}_{\varepsilon}({v}) ε\varepsilon-neighbor set of node vv in global graph
Latent Variable
h𝒩vig\textbf{h}_{\mathcal{N}^{g}_{v_{i}}} Features of the neighbors of the item viv_{i} obtained from global graph
hvig\textbf{h}^{g}_{v_{i}} Features of item viv_{i} obtained from global graph
hvis\textbf{h}^{s}_{v_{i}} Features of item viv_{i} obtained from session graph
hvi\textbf{h}^{\prime}_{v_{i}} Final representation of the item visv^{s}_{i}
S Final representation of the current session
o Features of items from graph views after data augmentation
Output
y^\hat{\textbf{{y}}} Output probabilities of each item
Loss Function
LpL_{p} Cross-entropy of the prediction results
LcL_{c} Contrastive loss for the global graph structure

3.1 Problem Statement

Let V={v1,v2,,vm}V=\{v_{1},v_{2},...,v_{m}\} be all of items, and each session S be denoted by S={v1s,v2s,,vls}S=\{v^{s}_{1},v^{s}_{2},...,v^{s}_{l}\}, consisting of a sequence of interactions (i.e., items clicked by a user) in chronological order, where visv^{s}_{i} denotes an item clicked at time-step ii within session SS, and the length of SS is ll. Given a session SS, the problem of session-based recommendation aims to recommend the top-NN items (1N|V|1\leq N\leq|V|) from VV that are most likely to be clicked by the user in the next timestamp (i.e., vl+1sv^{s}_{l+1}).

3.2 Graph Models: Session Graph and Global Graph

In this subsection, we present two different graph models to capture different levels of item transition information over all available sessions for item representation learning.

3.2.1 Session Graph Model

Session-based graph aims to learn the session-level item embedding by modeling sequential patterns over pair-wise adjacent items in the current session. Inspired by [14], each session sequence is converted into a session graph for learning GNN-based item embeddings of the current session. More concretely, for each session S={vis}i=1lS=\{v^{s}_{i}\}^{l}_{i=1}, the corresponding session graph is defined as a 2-tuple 𝒢s=(𝒱s,s),\mathcal{G}_{s}=(\mathcal{V}_{s},\mathcal{E}_{s}), where 𝒱sV\mathcal{V}_{s}\subseteq V and s={eijs}\mathcal{E}_{s}=\{e^{s}_{ij}\} are denoted the clicked item set and the edge set in SS respectively, and eijs=(vis,vjs)e^{s}_{ij}=(v^{s}_{i},v^{s}_{j}) indcates the adjacent edge of node visv^{s}_{i} and vjsv^{s}_{j} in SS, which is called session-level item-transition pattern.

Refer to caption
(a) Session Graph.
Refer to caption
(b) Global Graph.
Figure 2: Illustrations of construction of session graph and global graph.

3.2.2 Global Graph Model

Similar to conventional recurrent neural network (e.g., RNN [12]) based approaches, session graph can efficiently capture sequential intra-relations to learn session-level item embeddings. However, it neglects the complicated inter-relations of items over sessions, e.g., the item-transition information from other sessions, which is called global-level item transition information.

Global-level Item Transition Modeling. Here, we take into account global-level item transitions for global-level item representation learning, via integrating all pairwise item transitions over sessions. As such, we propose a novel global graph model for learning global-level item embeddings, which breaks down sequence independence assumption with linking all pairs of items based on pairwise transitions over all sessions (including the current one). Next, we firstly introduce a new definition (i.e., ε\varepsilon-neighbor set) for modeling global-level item transition, and then give the definition of global graph.

Definition 1

ε\varepsilon-Neighbor Set (𝒩ε(v)\mathcal{N}_{\varepsilon}({v})). For any item vipv^{p}_{i} in session SpS_{p}, the ε\varepsilon-neighbor set of vipv^{p}_{i} indicates a set of items, each element of which is defined as follows,

𝒩ε(vip)={vjq|vip=viqSpSq;vjqSq;j[iε,i+ε];SpSq},\begin{split}\mathcal{N}_{\varepsilon}(v^{p}_{i})=\{v^{q}_{j}|&v^{p}_{i}=v^{q}_{i^{{}^{\prime}}}\in S_{p}\cap S_{q};v^{q}_{j}\in S_{q};\\ &j\in[i^{{}^{\prime}}-\varepsilon,i^{{}^{\prime}}+\varepsilon];S_{p}\neq S_{q}\},\end{split} (1)

where ii^{{}^{\prime}} is the order of item vipv^{p}_{i} in session SqS_{q}, ε\varepsilon is a hyperparameter to control the scope of modeling of item-transition between vipv^{p}_{i} and the items in SqS_{q}. Note that, parameter ε\varepsilon favors the modeling of short-range item transitions over sessions, since it is helpless (even noise, e.g., irrelevant dependence) for capturing the global-level item transition information if beyond the scope (ε\varepsilon).

According to Definition 1, for each item viVv_{i}\in V, global-level item transition is defined as {(vi,vj)|vi,vjV;vj𝒩ε(vi)}.\{(v_{i},v_{j})|v_{i},v_{j}\in V;v_{j}\in\mathcal{N}_{\varepsilon}({v_{i}})\}.

Global Graph. Global graph aims to capture the global-level item transition information, which will be used to learn item embeddings over all sessions. Specifically, the global graph is built based on ε\varepsilon-neighbor sets of items in all sessions. Without loss of generality, global graph is defined as follows, let 𝒢g=(𝒱g,g)\mathcal{G}_{g}=(\mathcal{V}_{g},\mathcal{E}_{g}) be the global graph, where 𝒱g\mathcal{V}_{g} denotes the graph node set that contains all items in VV, and g={eijg|(vi,vj);viV,vj𝒩ε(vi)}\mathcal{E}_{g}=\{e^{g}_{ij}|(v_{i},v_{j});v_{i}\!\in\!V,v_{j}\!\in\!\mathcal{N}_{\varepsilon}({v_{i}})\} indicates the set of edges, each corresponding to two pairwise items from all the sessions. Figure 2b shows an example of building the global graph with ε=2\varepsilon=2. To distinguish the importance of viv_{i}’s neighbors (𝒩ε(vi)\mathcal{N}_{\varepsilon}({v_{i}})), we treat the co-occurrence over sessions of node viv_{i} and its neighbor node vj𝒩ε(vi)v_{j}\in\mathcal{N}_{\varepsilon}({v_{i}}) as the weights of the corresponding edge.

Remark. (i) The definition of the neighbors333We do not distinguish 𝒩ε(v)\mathcal{N}_{\varepsilon}({v}) and 𝒩vg\mathcal{N}^{g}_{v} when the context is clear and discriminative. (i.e., 𝒩vg\mathcal{N}^{g}_{v}) of item vv on graph 𝒢g\mathcal{G}_{g} is same as 𝒩ε(v)\mathcal{N}_{\varepsilon}(v); (ii) 𝒢g\mathcal{G}_{g} is an undirected weighted graph as ε\varepsilon-neighbor set is undirected; and (iii) Each item in VV is encoded into an unified embedding space at time-step tt, i.e., hitd\textbf{h}^{t}_{i}\in\mathbb{R}^{d} (dd indicates the dimension of item embedding), which is feed with an initialization embedding hi0|V|\textbf{h}^{0}_{i}\in\mathbb{R}^{|V|}, here we use one-hot based embedding and it is transformed into dd-dimensional latent vector space by using a trainable matrix W0d×|V|\textbf{W}_{0}\in\mathbb{R}^{d\times|V|}.

4 The Proposed Method

Refer to caption
(a) Basic GNN-based Framework.
Refer to caption
(b) Fusion-based Model.
Refer to caption
(c) Constrained-based Model.
Figure 3: An overview of the proposed framework. (a) Basic GNN-based framework, which utilizes graph neural networks to exploit the item transitions in session graph. (b) Fusion-based model, where the neighbors of item in global graph are incorporated into current session by a session-level attention mechanism. (c) Constrained-based model, which introduces the global transition information by preserving the global proximity in the embedding space.

To leverage the different levels (e.g., local-level and global-level) information for SBR, we first propose a basic GNN-based framework (i.e., B-GNN, as shown in Fig. 3a) in Section 4.1. Based on the B-GNN framework, we also propose a novel approach (named Session-based Recommendation with Global Information, SRGI) with two different variants in Section 4.2 and Section 4.3, which is capable of leveraging the global item-transition information from two different aspects: (a) SRGI-FM (rf. Fig. 3b), it incorporates the global item-transitions information into the learning process of session-level item representation; and (b) SRGI-CM (rf. Fig. 3c), it treats the global-level item-transition information as a constraint to ensure the learnt item embeddings are consistent with the structure of the global graph. Next, we will illustrate each part in detail.

4.1 Basic GNN-based Framework (B-GNN)

In this section, we first propose a basic GNN-based model (called B-GNN, as shown in Fig. 3a), which solely utilizes three different types (i.e., in, out and in-out, as shown in Fig. 2a) of session-level item transition information for session-based recommendation. Specifically, B-GNN consists of three sub-components, i.e., session-level item representation learning layer, session representation learning layer and prediction, and we will be detailed in the following sub-sections, respectively.

4.1.1 Session-level Item Representation Learning layer

To learn the session-level item representation, we focus on how to enrich the representation of each item with the help of its 11-hop neighbors, via exploring the pairwise item-transitions within the current session.

Information Propagation: For each node viv_{i} of session SS, we consider three different types of relations between viv_{i} and its neighbors 𝒩vis\mathcal{N}^{s}_{v_{i}}, i.e., in-coming neighbor, out-coming neighbor and in-out coming neighbor which are denoted by 𝒩vis,in\mathcal{N}^{s,in}_{v_{i}}, 𝒩vis,out\mathcal{N}^{s,out}_{v_{i}} and 𝒩vis,inout\mathcal{N}^{s,in-out}_{v_{i}} respectively. To calculate the importance of different neighbors, we treat such three kinds of neighbors differently during the propagation of information, which are first calculated based on mean pooling, namely,

hNvis,in=MeanPooling(hvj|vj𝒩vis,in)hNvis,out=MeanPooling(hvj|vj𝒩vis,out)hNvis,inout=MeanPooling(hvj|vj𝒩vis,inout)\begin{split}\textbf{h}_{N_{v_{i}}^{s,in}}&=\text{MeanPooling}({\textbf{h}_{v_{j}}|v_{j}\in\mathcal{N}_{v_{i}}^{s,in}})\\ \textbf{h}_{N_{v_{i}}^{s,out}}&=\text{MeanPooling}({\textbf{h}_{v_{j}}|v_{j}\in\mathcal{N}_{v_{i}}^{s,out}})\\ \textbf{h}_{N_{v_{i}}^{s,in-out}}&=\text{MeanPooling}({\textbf{h}_{v_{j}}|v_{j}\in\mathcal{N}_{v_{i}}^{s,in-out}})\end{split} (2)

Then, the neighbor representation (hNvis\textbf{h}_{N_{v_{i}}^{s}}) of node viv_{i} is obtained based on the concatenation of hNvis,in\textbf{h}_{N_{v_{i}}^{s,in}}, hNvis,out\textbf{h}_{N_{v_{i}}^{s,out}} and hNvis,inout\textbf{h}_{N_{v_{i}}^{s,in-out}}, namely,

hNvis=[hNvis,inhNvis,outhNvis,inout],\textbf{h}_{N_{v_{i}}^{s}}=[\textbf{h}_{N_{v_{i}}^{s,in}}||\textbf{h}_{N_{v_{i}}^{s,out}}||\textbf{h}_{N_{v_{i}}^{s,in-out}}], (3)

where |||| denotes concatenation operation. In particular, different from SR-GNN [14] and FGNN [15], here we consider three kinds of relations on session graph and use mean pooling in our propagation layer to reduces the training parameters for avoiding over-fitting.

Information Aggregation: The session-level item representation is generated by aggregating the embeddings of item viv_{i} and its neighbors hNvis\textbf{h}_{N_{v_{i}}^{s}}, which is computed via a fully connected layer,

hvis=tanh(W1hvi+W2hNvis+b1),\textbf{h}^{s}_{v_{i}}=\text{tanh}(\textbf{W}_{1}\textbf{h}_{v_{i}}+\textbf{W}_{2}\textbf{h}_{N_{v_{i}}^{s}}+\textbf{b}_{1}), (4)

where W1d×d,W2d×3d\textbf{W}_{1}\in\mathbb{R}^{d\times d},\textbf{W}_{2}\in\mathbb{R}^{d\times 3d} and b1d\textbf{b}_{1}\in\mathbb{R}^{d} are trainable parameters. The new representations hvis\textbf{h}^{s}_{v_{i}} of each item is aggregated by the features of item itself and its neighbors in the current session.

4.1.2 Session Representation Learning Layer

Based on the learnt item representations, we now present how to obtain the session representations. Note that the contribution of different items to the next prediction is not equal. Intuitively, the items clicked later in the session are more representative of the user’s current preferences. Moreover, it is important to find the main purpose of the user and filter noise in current session [12]. Hence we incorporate the reversed position information and session information to make a better prediction. After feeding a session sequence into graph neural networks, we can obtain the representation of the items involved in the session, i.e., H=[hv1s,hv2s,,hvls]\textbf{H}=\left[\textbf{h}_{v^{s}_{1}}^{\prime},\textbf{h}_{v^{s}_{2}}^{\prime},...,\textbf{h}_{v^{s}_{l}}^{\prime}\right]444Here hv1s=hvis\textbf{h}_{v^{s}_{1}}^{\prime}=\textbf{h}^{s}_{v_{i}} for B-GNN.. We also use a learnable position embedding matrix P=[p1,p2,,pl]\textbf{P}=\left[\textbf{p}_{1},\textbf{p}_{2},...,\textbf{p}_{l}\right], where pid\textbf{p}_{i}\in\mathbb{R}^{d} is a position vector for specific position ii and ll is the length of the current session sequence. The position information is integrated through concatenation and non-linear transformation:

zi=tanh(W3[hvispli+1]+b2),\textbf{z}_{i}=\text{tanh}\left(\textbf{W}_{3}\left[\textbf{h}_{v^{s}_{i}}^{\prime}\parallel\textbf{p}_{l-i+1}\right]+\textbf{b}_{2}\right), (5)

where parameters W3d×2d\textbf{W}_{3}\in\mathbb{R}^{d\times 2d} and b2d\textbf{b}_{2}\in\mathbb{R}^{d} are trainable parameters. Here reversed position li+1l-i+1 can be regard as the distance between the current item and the predicted item, which contains more effective information than forward position ii. Next, the corresponding weight is learned through a soft-attention mechanism:

βi=q2σ(W4zi+W5s+b3),\beta_{i}=\textbf{q}_{2}^{\top}\sigma\left(\textbf{W}_{4}\textbf{z}_{i}+\textbf{W}_{5}\textbf{s}^{\prime}+\textbf{b}_{3}\right), (6)

where W4,W5d×d\textbf{W}_{4},\textbf{W}_{5}\in\mathbb{R}^{d\times d} and q2,b3d\textbf{q}_{2},\textbf{b}_{3}\in\mathbb{R}^{d} are learnable parameters and s\textbf{s}^{\prime} denotes session information, which is obtained by sum pooling (i.e., s=i=1lhvis\textbf{s}^{\prime}=\sum_{i=1}^{l}\textbf{h}_{v^{s}_{i}}^{\prime}). Finally, the session representation can be obtained by linearly combining the item representations:

S=i=1lβihvis.\textbf{S}=\sum_{i=1}^{l}\beta_{i}\textbf{h}_{v^{s}_{i}}^{\prime}. (7)

The session representation S is constructed by all the items involved in the current session, where the contribution of each item is determined not only by the information in the session graph, but also by the chronological order in the sequence.

4.1.3 Objective Function

Based on the obtained session representations S, the final recommendation probability for each candidate item based on their initial embeddings as well as current session representation. Here, 2\ell_{2}-norm is employed to normalize session representation and item representations (i.e., S^=SS2\hat{\textbf{{S}}}=\frac{\textbf{S}}{||\textbf{S}||_{2}} and h^v=hvhv2\hat{\textbf{{h}}}_{v}=\frac{\textbf{h}_{v}}{||\textbf{h}_{v}||_{2}}) for avoiding popular bias [39]. Then, we use inner-product and apply softmax function to obtain the output y^\hat{\textbf{y}}:

y^i=Softmax(αS^h^vi),\hat{\textbf{y}}_{i}=\text{Softmax}\left(\alpha\hat{\textbf{{S}}}^{\top}\hat{\textbf{{h}}}_{v_{i}}\right), (8)

where α\alpha is a scale coefficient for better convergence [39, 40] and y^iy^\hat{\textbf{y}}_{i}\in\hat{\textbf{y}} denotes the probability of item viv_{i} appearing as the next-click in the current session.

The loss function is defined as the cross-entropy of the prediction results y^\hat{\textbf{y}}:

S=i=1myilog(y^i)+(1yi)log(1y^i),\mathcal{L}_{S}=-\sum_{i=1}^{m}\textbf{y}_{i}\log\left(\hat{\textbf{y}}_{i}\right)+\left(1-\textbf{y}_{i}\right)\log\left(1-\hat{\textbf{y}}_{i}\right), (9)

where y denotes the one-hot encoding vector of the ground truth item.

4.2 Fusion-based Model

In this section, we propose a novel fusion-based model under the B-GNN framework, i.e., a variant of our proposed model SRGI (e.g., SRGI-FM, as shown in Fig. 3b). We will detail how to incorporate the global neighbor features into current session, which is built based on the architecture of graph convolution network [41]. Here we first describe a single layer consisting of two components: fusion-based information propagation and information aggregation, and then show how to generalize it to multiple layers.

Fusion-based Information Propagation: To obtain the first-order (i.e., 1-hop neighbors in the global graph) neighbor’s features of item vv, one straightforward solution is to use mean pooling method [42]. However, not all of items in vv’s ε\varepsilon-neighbor set are relevant to the user preference of the current session, and thus we consider to utilize a session-aware attention to distinguish the importance of items in (𝒩ε(v)\mathcal{N}_{\varepsilon}(v)). Therefore, each item in 𝒩ε(v)\mathcal{N}_{\varepsilon}(v) is linearly combined according to the session-aware attention score,

h𝒩vig=vj𝒩vigπ(vi,vj)hvj,\textbf{h}_{\mathcal{N}^{g}_{v_{i}}}=\sum_{v_{j}\in\mathcal{N}^{g}_{v_{i}}}\pi(v_{i},v_{j})\textbf{h}_{v_{j}}, (10)

where π(vi,vj)\pi(v_{i},v_{j}) estimates the importance weight of different neighbors. Intuitively, the more consistent an item is to the preference of the current session, the more important this item is to the recommendation. Therefore, we implement π(vi,vj)\pi(v_{i},v_{j}) based on the principle of attention network [17]:

π(vi,vj)=q1TLeakyRelu(W6[(shvj)wij]).\pi(v_{i},v_{j})=\textbf{q}^{T}_{1}\text{LeakyRelu}\big{(}\textbf{W}_{6}[(\textbf{s}^{\prime}\odot\textbf{h}_{v_{j}})\|w_{ij}]\big{)}. (11)

where \| indicates concatenation operation; wij1w_{ij}\in\mathbb{R}^{1} is the weight of edge (vi,vj)(v_{i},v_{j}) in global graph; W6d+1×d+1\textbf{W}_{6}\in\mathbb{R}^{d+1\times d+1} and q1d+1\textbf{q}_{1}\in\mathbb{R}^{d+1} are trainable parameters; \odot indicates element-wise product.

Different from mean pooling, in our model the propagation of information depends on the affinity between s\textbf{s}^{\prime} and vjv_{j}, which means neighbors that match the preference of current session will be more favourable, and we then normalize the coefficients across all neighbors connected with viv_{i} by adopting the softmax function:

π(vi,vj)=exp(π(vi,vj))vk𝒩vigexp(π(vi,vk)).\pi(v_{i},v_{j})=\frac{\exp\big{(}\pi(v_{i},v_{j})\big{)}}{\sum_{v_{k}\in\mathcal{N}^{g}_{v_{i}}}\exp\big{(}\pi(v_{i},v_{k})\big{)}}. (12)

As such, Eq. (12) is capable of assigning high attention scores to the important global-level 1-hop neighbors of items in the current session.

Fusion-based Information Aggregation: The final step is to aggregate the item representation hv\textbf{h}_{v} and its neighborhood representation h𝒩vgh^{g}_{\mathcal{N}_{v}}, we implement aggregator function agg as follows,

hvg=Relu(W7[hv||h𝒩vg]),\textbf{h}^{g}_{v}=\text{Relu}\big{(}\textbf{W}_{7}[\textbf{h}_{v}||\textbf{h}_{\mathcal{N}^{g}_{v}}]\big{)}, (13)

where W7d×2d\textbf{W}_{7}\in\mathbb{R}^{d\times 2d} is transformation weight.

Through a single aggregator layer, the representation of an item is dependent on itself and its immediate neighbors. We could explore the high-order connectivity information through extending aggregator from one layer to multiple layers, which allows more global information to be incorporated into the current representation. We formulate the representation of an item in the kk-th steps as:

hvg,(k)=agg(hv(k1),h𝒩vg(k1)),\textbf{h}^{g,(k)}_{v}=\text{agg}\big{(}\textbf{h}^{(k-1)}_{v},\textbf{h}^{(k-1)}_{\mathcal{N}^{g}_{v}}\big{)}, (14)

hv(k1)\textbf{h}^{(k-1)}_{v} is representation of item vv which is generated from previous information propagation steps, hv(0)\textbf{h}^{(0)}_{v} is set as hv\textbf{h}_{v} at the initial propagation iteration. In this way, the kk-order representation of an item is a mixture of its initial representations and its neighbors up to kk hops away. This enables more effective messages to be incorporated into the representation of the current session.

After obtaining the representation of the hvg,(k)\textbf{h}^{g,(k)}_{v}, the new representation of each item can be obtained as follow,

hv=hvg,(k)+hvs.\textbf{h}^{\prime}_{v}=\textbf{h}^{g,(k)}_{v}+\textbf{h}^{s}_{v}. (15)

Consequently, we obtain the new representation for each item of the current session, which integrate both session-level and global-level item-transition information and thus the recommendation prediction can be implemented based on Eq. (5)-(9) via using the new item embedding hv\textbf{h}^{\prime}_{v}.

4.3 Constrained-based Model

Recall that the strategy of building the global graph is based on the co-occurrence of pairwise items over sessions, namely, pairwise items with high frequency co-occurrences are remained while filtering the ones with low-frequency co-occurrences. To this end, in this section we propose another variant of our proposed model SRGI (i.e., SRGI-CM, as shown in Fig. 3c), which is different from SRGI-FM that aims to integrate different levels of item-transitions for learning item representation, the principle of SRGI-CM is to ensure the learnt item embeddings are consistent with global-level item-transition information, which is called global proximity that can be modeled by graph contrastive learning from two different aspects: (i) Preserving the local structural patterns. An item and its global neighbors often have high relevancy (e.g., milk and bread) due to high frequency item-transitions over sessions, and thus should be assigned to the proximal vectors in the learnt embedding space; and, ii) Enforcing the discrimination for different items. The representation of different nodes should be discriminating in the learned embedding space. Next, we show the process of performing contrastive learning on the global graph.

Graph Data Augmentation. Given the global graph 𝒢g=(𝒱g,g)\mathcal{G}_{g}=(\mathcal{V}_{g},\mathcal{E}_{g}), we first construct two correlated graph views based on data augmentation. Graph augmentation aims to create novel and realistically rational data via certain transformations on the original graph. Different graph augmentations for graph provide different contexts for each node, which is the key component of graph contrastive learning. Following previous graph contrastive learning approaches [37, 38], three general data augmentations are used for creating novel and realistically global graphs:

  1. 1.

    Node dropping. It will randomly discard a certain portion of nodes in the global graph and also remove their corresponding connections, where the dropping probability for each node follows a uniform distribution.

  2. 2.

    Edge dropping. Similar to node dropping, edge dropping will randomly remove certain portion of edges in the original graph.

  3. 3.

    Attribute masking. It masks a certain fraction of dimensions of node features with zeros.

Contrastive Learning. After employing data augmentations we can obtain two correlated graph views which are both corrupted from the original global graph. We denotes two graph views as 𝒢g1=(𝒱g1,g1)\mathcal{G}^{1}_{g}=(\mathcal{V}^{1}_{g},\mathcal{E}^{1}_{g}) and 𝒢g2=(𝒱g2,g2)\mathcal{G}^{2}_{g}=(\mathcal{V}^{2}_{g},\mathcal{E}^{2}_{g}), where 𝒱\mathcal{V}_{*} and \mathcal{E}_{*} are the node sets and the edge sets for two graph views. Then a GNN-based encoder f(,)f(\cdot,\cdot) is applied here to extract the features for each node in the graph:

o1=f(𝒱g1,g1)o2=f(𝒱g2,g2)\begin{split}\textbf{o}^{1}&=f(\mathcal{V}^{1}_{g},\mathcal{E}^{1}_{g})\\ \textbf{o}^{2}&=f(\mathcal{V}^{2}_{g},\mathcal{E}^{2}_{g})\end{split} (16)

where oio\textbf{o}^{*}_{i}\in\textbf{o}^{*} is the learnt node representation for node ii in graph view 𝒢g\mathcal{G}^{*}_{g}. Contrastive learning does not apply any constraint on GNN architecture [37], and here we choose GCN [41] for its low computational cost and stability. After obtaining the node representations o1\textbf{o}^{1} and o2\textbf{o}^{2}, we use a projection function (i.e., a single fully connected layer) to transform the learnt node representation to another latent space where the contrastive loss is calculated,

z=Relu(W8o+b4).\textbf{z}=\text{Relu}(\textbf{W}_{8}\textbf{o}+\textbf{b}_{4}). (17)

Finally, the contrastive loss is employed to maximize the node-level agreement while enforcing the representations of different nodes discriminate in the latent space. For the node ii in the graph view 𝒢g1\mathcal{G}^{1}_{g}, only the corresponding node in the graph view 𝒢g2\mathcal{G}^{2}_{g} is regarded as positive samples, while all other nodes are treated as negative samples. The contrastive loss for the global graph is shown as follow:

C=i=1Nlogeθ(zi1,zi2)eθ(zi1,zi2)+k=1NI[ki]eθ(zi1,zk2)+k=1NI[ki]eθ(zi1,zk1)\mathcal{L}_{C}=\sum\limits^{N}_{i=1}log\frac{e^{\theta(\textbf{z}^{1}_{i},\textbf{z}^{2}_{i})}}{e^{\theta(\textbf{z}^{1}_{i},\textbf{z}^{2}_{i})}+\sum\limits^{N}_{k=1}\textbf{I}_{[k\neq i]}e^{\theta(\textbf{z}^{1}_{i},\textbf{z}^{2}_{k})}+\sum\limits^{N}_{k=1}\textbf{I}_{[k\neq i]}e^{\theta(\textbf{z}^{1}_{i},\textbf{z}^{1}_{k})}} (18)

where I[]\textbf{I}_{[*]} is an indicator function and θ(a,b)=cos(a,b)\theta(a,b)=cos(a,b) denotes the cosine similarity function. The contrastive loss C\mathcal{L}_{C} is aim to: i) Maximizing the consistency between the corresponding node in different graph views, which will force each node’s features to be proximal with its neigbhors’ features and thus the learned node representations will be more robust; and ii) Minimizing the agreement between different nodes, which lead to the discrimination of different nodes’ representations in the latent embedding space.

To learn the item transitions during the session and preserve the global proximity, the final loss function of SRGI-CM combines prediction loss and contrastive loss,

G=S+λCC,\mathcal{L}_{G}=\mathcal{L}_{S}+\lambda_{C}\mathcal{L}_{C}, (19)

where λC\lambda_{C} is a trade-off parameter to control the impact of global proximity.

4.4 Discussion

As we show the detail of SRGI-FM and SRGI-CM, the common features of these two models is that they are both designed based on the technology of GNN and the proposed global graph. The global item transition is important for the modeling of the item-transition within the current session. By exploring the node features and structural information in the global graph, the global context information is introduced to enrich the representation of each item. And the experiment results demonstrate that both SRGI-FM and SRGI-CM show promising improvements compared with B-GNN.

And the main difference between SRGI-FM and SRGI-CM is reflected in two aspects: (i) The fusion-based model SRGI-FM and constrained-based model SRGI-CM present two different ways to incorporate the effective global information into the current session. SRGI-FM uses attention-based GNN to directly incorporate the neighbor features from the global graph into the current session, while SRGI-CM focuses on the global graph structure and leverages graph contrastive learning to constrain the representation of each item in the latent space. (ii) Comparing with SRGI-FM, SRGI-CM needs to employ data augmentation to create different graph views based on the global graphs when in the training process. However, in the inferring process, SRGI-CM does not need the global graph as input or compute the contrastive loss, while SRGI-FM still needs to incorporate the features from the global graph into the current session. This means SRGI-CM requires more time to process the data during the training process, while in the process of inference, the computation cost of SRGI-CM will be much smaller than SRGI-FM.

5 Experiments

We have conducted extensive experiments to evaluate the accuracy of the proposed method by answering the following four key research questions:

  • RQ1: Does two versions of SRGI outperform state-of-the-art SBR baselines in real world datasets?

  • RQ2: Does global graph information improve the performance of SRGI? What is the impact of different hyper-parameter on learning global item transitions?

  • RQ3: How does the basic GNN framework (i.e., B-GNN) perform compared with other session-level item representation learning methods?

  • RQ4: Does reversed position aware session encoder show effective for session-based recommendation?

5.1 Datesets and Preprocessing

To fully evaluate the effectiveness of our proposed method, we employ three real-world datasets in the experiment.

DigineticaDiginetica555https://competitions.codalab.org/competitions/11161 is from CIKM Cup 2016, which consisting of typical transaction data for five month from e-commerce website. Following [14], we set the sessions of last week (latest data) as the test data, and the remaining historical data for training.

TmallTmall666https://tianchi.aliyun.com/dataset/dataDetail?dataId=42 comes from IJCAI-15 competition, which contains anonymized user’s shopping logs on Tmall online shopping platform. Since the amount of items of Tmall is extremely large, we use the first 120,000 of the sessions for train and test. The sessions of last 100 seconds are set as the test data, and the remaining historical data for training.

NowplayingNowplaying777https://dbis-nowplaying.uibk.ac.at/#nowplaying comes from music-related tweets [43], which describes the music listening behavior of users. We set the sessions of last two months as the test data, and the remaining historical data for training.

Following [14, 28], we conduct preprocessing step over the three datasets. More specifically, sessions of length 1 and items appearing less than 5 times were filtered across all the three datasets. For Tmall, sessions longer than 40 were also filtered [18]. Furthermore, for a session S=[s1,s2,,sn]S=\left[s_{1},s_{2},...,s_{n}\right], we generate sequences and corresponding labels by a sequence splitting preprocessing, i.e., ([s1],s2\left[s_{1}\right],s_{2}), ([s1,s2],s3\left[s_{1},s_{2}\right],s_{3}), …, ([s1,s2,,sn1],sn\left[s_{1},s_{2},...,s_{n-1}\right],s_{n}) for both training and testing across all the three datasets. The statistics of datasets, after preprocessing, are summarized in Table II.

TABLE II: Statistics of the used datasets.
Dataset Diginetica Tmall Nowplaying
# train sessions 719,470 351,268 825,304
# test sessions 60,858 25,898 89,824
# items 43,097 40,728 60,417
avg. len. 5.12 6.69 7.42

5.2 Evaluation Metrics

We adopt two widely used ranking based metrics: P@N and MRR@N by following previous work[13, 14].

P@N (Precision)[14]: The P@N score is typically used as a measure of accuracy. It represents the proportion of correctly recommended items in top NN recommended item list, which is defined as:

P@N=nhitntest,P@N=\frac{n_{hit}}{n_{test}}, (20)

where ntestn_{test} denotes the number of test data and nhitn_{hit} denotes the number of the target items appearing in the of top NN recommended items.

MRR@N (Mean Reciprocal Rank)[13]: The MRR@N score is the average of reciprocal rank of the correctly-recommended items. The reciprocal rank is set to zero if the rank exceeds NN,

MRR@N=1ntest1Rank(vtarget).MRR@N=\frac{1}{n_{test}}\sum\frac{1}{\text{Rank}(v_{target})}. (21)

MRR is a normalized score in the range of [0,1][0,1], and a larger MRR value means that correct recommendations are in the top of the ranking list.

Here, we choose N=20N=20 for both P@N and MRR@N, as recommendation systems should focus on top ranked items.

5.3 Baseline Algorithms

We compare our method with classic methods as well as state-of-the-art models. The following baseline models are evaluated.

POP: It recommends top-NN frequent items of the training set.

Item-KNN[30]: It recommends items based on the similarity between items of the current session and items of other ones.

FPMC[5]: It combines the matrix factorization and the first-order Markov Chain for capturing both sequential effects and user preferences. By following the previous work, we also ignore the user latent representations when computing recommendation scores.

GRU4Rec888https://github.com/hidasib/GRU4Rec [9]: It is a RNN-based model that uses Gated Recurrent Unit (GRU) to model user sequences.

NARM999https://github.com/lijingsdu/sessionRec_NARM [12]: It improves over GRU4Rec [9] by incorporating attention mechanism into hierarchical RNN for SBR.

STAMP101010https://github.com/uestcnlp/STAMP [13]: It employs attention layers to replace all RNN encoders in previous work by fully relying on the self-attention of the last item in the current session to capture the user’s short-term interest.

SR-GNN111111https://github.com/CRIPAC-DIG/SR-GNN [14]: It converts sessions into graphs and leverages gated GNN layer to capture the dependencies between items and its context, followed by a self-attention of the last item as STAMP[13] does to obtain the session level representation.

CSRM121212https://github.com/wmeirui/CSRM_SIGIR2019 [18]: It utilizes the memory networks [44] to investigate the latest mm sessions for better predicting the intent of the current session.

CoSAN [19]: It injects the embedding of neighbor sessions to enrich the item representation and employ multi-head attention to capture the dependencies between items.

GCE-GNN131313https://github.com/CCIIPLab/GCE-GNN [1]: A state-of-the-art GNN-based model which directly aggregates global information into current session and utilizes attention mechanism to obtain the session representation.

5.4 Parameter Setup

Following previous methods [12][13][14], the dimension of the latent vectors is fixed to 100100, and the size for mini-batch is set to 100100 for all models. We keep the hyper-parameters of each model consistent for a fair comparison. For CSRM, we set the memory size to 100 which is consistent with the batch size. For our model, all parameters are initialized using a Gaussian distribution with a mean of 0 and a standard deviation of 0.10.1. We use the Adam optimizer with the initial learning rate 0.0010.001, which will decay by 0.10.1 after every 33 epoch. The L2 penalty is set to 10510^{-5} and the scale coefficient α\alpha is set to 12. Moreover, we set the maximum distance of adjacent items ε\varepsilon and the number of neighbors to 33 and 1212, respectively. In SRGI-FM we use dropout [45] to avoid overfitting, the dropout ratio is set to 0.50.5 and the graph depth is searched in {1,2}\{1,2\}. In SRGI-CM, the graph depth is set to 22 and the trade-off parameter λC\lambda_{C} is searched in {10,50,100,150,100}\{10,50,100,150,100\}. All the parameters are searched on a validation set which is a random 10%10\% subset of the training set.

TABLE III: The performance of evaluated methods on three datasets.
Method Diginetica Tmall Nowpalying
P@20 MRR@20 P@20 MRR@20 P@20 MRR@20
POP 1.18 0.28 2.00 0.90 1.76 0.58
Item-KNN 35.75 11.57 9.15 3.31 14.78 4.67
FPMC 22.14 6.66 16.06 7.32 7.36 2.82
GRU4Rec 30.79 8.22 10.93 5.89 7.92 4.48
NARM 48.32 16.00 23.30 10.70 18.59 6.93
STAMP 46.62 15.13 26.47 13.36 17.66 6.88
CSRM 48.49 17.13 29.46 13.96 18.14 6.42
CoSAN 51.97 17.92 32.68 14.09 21.05 7.53
SR-GNN 50.73 17.59 27.57 13.72 18.87 7.47
GCE-GNN 54.22 19.04 33.42 15.42 22.37 8.40
B-GNN 53.93 19.05 35.77 16.23 22.16 8.78
SRGI-FM 54.35 19.14 36.33 16.42 22.77 8.87
SRGI-CM 54.71 19.43 37.28 17.85 22.73 8.80

5.5 Overall Comparison (RQ1)

Table III reports the experimental results of the state-of-the-art baselines and our proposed model on three real-world datasets, in which the best result of each column is highlighted in boldface. It can be observed that SRGI-FM and SRGI-CM achieve better performance than state-of-the-art baselines across all three datasets in terms of the two metrics, which ascertains the effectiveness of our proposed method.

Among the traditional methods, POP’s performance is the worst, as it only recommends the most popular item without learning the preference of the user. Comparing with POP, FPMC performs better on three datasets, which shows the strength of Markov Chains in modeling sequential data. Moreover, Item-KNN achieves the best results among the traditional methods on the Diginetica and Nowplaying datasets, which demonstrates the importance of collaborative information. However, it cannot capture the complex sequential transitions within the session.

Compared with traditional methods, neural network based methods usually have better performance for session-based recommendation. In sprite of preforming worse than Item-KNN on Diginetica, GRU4Rec, as the first RNN based method for SBR, still demonstrates the capability of RNN in modeling sequences. Nevertheless, it relies entirely on RNN to model the complex transitions within the session, which can only captures the point-wise dependencies and may generate fake dependencies [27].

The subsequent methods, NARM and STAMP outperform GRU4REC significantly. NARM combines RNN and attention mechanism, which uses the last hidden state of RNN as the main preference of user, this result indicates that directly using RNN to encode the session sequence may not be sufficient for SBR as RNN only models one way item-transition between adjacent items in a session. We also observe that STAMP, a complete attention-based method, achieves better performance than NARM on Tmall, which incorporates a self-attention over the last item of a session to model the short-term interest, this result demonstrates the effectiveness of assigning different attention weights on different items for session encoding. Compared with RNN, attention mechanism appears to be a better option, although STAMP neglects the chronological order of items in a session.

CSRM and CoSAN performs better than NARM and STAMP over three datasets, which shows the effectiveness of using item transitions from other sessions. However, the memory networks used by CSRM have limited slots and both of them treat other sessions as a whole one without distinguishing the relevant item-transitions from the irrelevant ones encoded in other sessions.

By modeling every session sequence as a subgraph and applying GNN to encode items, SR-GNN and GCE-GNN demonstrate the effectiveness of applying GNN in session-based recommendation. This indicates that the graph modeling would be more suitable than the sequence modeling, RNN, or a set modeling, the attention modeling for SBR. Our approach SRGI-FM and SRGI-CM outperforms SR-GNN and GCE-GNN on all the three datasets. Specifically, SRGI-CM outperforms the GCE-GNN by 1.5%1.5\% on Diginetica, 13.6%13.6\% on Tmall and 4.4%4.4\% on Nowplaying on average. Different from SR-GNN and CoSAN, our approach integrates item-level transition information from global context, i.e., other session, and local context, i.e., the current session, and also incorporates reversed position information, leading to consistent better performance.

5.6 Impact of Global Feature Encoder (RQ2)

In this section, we aim to study the effect of global transition information and the impact of different graph parameters (i.e., graph depth, the number of neighbors and trade-off parameter λC\lambda_{C}) by conducting experiments over three datasets.

5.6.1 Effect of global transition information.

From Table III, we can observe that both SRGI-FM and SRGI-CM achieve better performance than B-GNN, which verifies that global transition information can provide useful information for learning the preference of the current session. Comparing with two version of SRGI, SRGI-CM performs better than SRGI-FM in most instances. It is because SRGI-FM directly incorporates the neighbors’ features from the global graph, which easily introduces extra noise into the current session. Although SRGI-FM leverages session aware attention mechanism to reduce the influence of noise, the performance of the model is still affected. In comparison, SRGI-CM is less influenced by the noise information as it does not need to fuse global features directly into the current session representation. Specifically, SRGI-CM obtains the representation of each session based on the item features within the session and utilizes contrastive learning to preserve global proximity. The global proximity forces the neighboring items more proximity while different items to be discriminating in the learned embedding space, which benefits the item representation learning and the prediction of the current session. The results in Table III demonstrate the effectiveness of our proposed two versions of SRGI.

Refer to caption
(a) MRR@20 on Diginetica.
Refer to caption
(b) MRR@20 on Tmall.
Refer to caption
(c) MRR@20 on Nowplaying.
Refer to caption
(d) P@20 on Diginetica.
Refer to caption
(e) P@20 on Tmall.
Refer to caption
(f) P@20 on Nowplaying.
Figure 4: The performance of models with different graph depths.

5.6.2 Impact of the graph depth.

We next conduct experiments on three datasets to evaluate the impact of the graph depth on two versions of SRGI. For SRGI-FM, the computational cost is high as it uses attention network in the graph encoder and the number of neighbors of each item increases exponentially as the graph depth increases, therefore we only evaluate the performance of SRGI-FM with 1-hop and 2-hop neighbors due to the limited graphics memory. In contrast, in SRGI-CM we employ GCN architecture, whose computational cost is relatively lower and thus we evaluate the performance of SRGI-CM with the graph depth from 1-hop to 4-hop.

Figure 4 shows the performance of two versions of SRGI with different graph depth. It can be observed that both SRGI-CM and SRGI-FM outperform the B-GNN with graph depth over three datasets. Specifically, SRGI-FM with 22-hop performs better than SRGI-FM with 11-hop in most situation, which indicates that high-level exploring might obtain more effective information from global graph. Besides, the performance of SRGI-CM is better than SRGI-FM on Diginetica and Tmall, which demonstrates the effectiveness of global proximity for session-based recommendation. Further, we observe the performance of SRGI-CM drops when graph depth is set to 4 on three datasets in terms of P@20 and MRR@20, which shows that higher-level exploring might also introduce more noise during the global proximity learning. Overall, the results in Figure 4 demonstrates that the structural information in global graph contain useful global transitions information for current session.

Refer to caption
(a) MRR@20 on Diginetica.
Refer to caption
(b) MRR@20 on Tmall.
Refer to caption
(c) MRR@20 on Nowplaying.
Figure 5: The performance of models with different number of neighbors.

5.6.3 Impact of the number of neighborhoods on global graph.

It can be observed from Figure 1 that there will be irrelevant items in a session when user clicked items. As we collect the ϵ\epsilon-Neighbor set of each item to obtain the global item transitions information, the inclusion of noise was unavoidable. In the proposed method, we only keep top-NN edges with the highest weights (i.e., frequency) to reduce the influence of noise. When the number of neighbors NN increases, more useful global transitions information can be captured, and at the same time more noise will be introduced. Here, we conduct experiment over three datasets to evaluate the impact of NN on the performance of our proposed method.

From Figure 5, we can observe that with the number of neighbors increase from 4 to 12, the performance of SRGI-CM in terms of MRR@20 becomes better over three datasets. It is because SRGI-CM can capture more effective global transitions information from the neighbors of each item, which benefits the prediction of current session. However, the performance of SRGI-CM drops when it has more neighbors in the global graph, which is affected by the increasing noise in the neighbors.

5.6.4 Impact of the trade-off parameter λC\lambda_{C} in SRGI-CM.

Trade-off parameter λC\lambda_{C} is an important scalar, which controls the influence of global proximity. Higher λC\lambda_{C} means that SRGI-CM pays more attention to the global-level item transitions. Therefore, we conduct experiments to study the impact of λC\lambda_{C} on the performance of SRGI-CM over three datasets. From Table IV we observe that with the λC\lambda_{C} increases, the performance of SRGI-CM improves in terms of P@20 on Diginetica and Nowplaying, which shows the effective of global information to the current session. However, the performance of SRGI-CM drops on Nowplaying in terms of MRR@20 when the λC\lambda_{C} increases. This is because too much attention to global information will affect the learning of the current session. We consider that it is appropriate to set λC\lambda_{C} to 5050 or 100100 for most of the scenarios.

TABLE IV: The performance of SRGI-CM with different λC\lambda_{C}.
Dataset Diginetica Tmall Nowplaying
Measures P@20 MRR@20 P@20 MRR@20 P@20 MRR@20
λC=10\lambda_{C}=10 54.26 19.33 36.83 17.02 22.52 8.92
λC=50\lambda_{C}=50 54.61 19.37 37.28 17.85 22.83 8.86
λC=100\lambda_{C}=100 54.71 19.32 36.63 17.77 22.86 8.73
λC=150\lambda_{C}=150 54.71 19.43 35.84 18.02 22.72 8.55
λC=200\lambda_{C}=200 54.57 19.47 35.03 17.98 22.75 8.49
TABLE V: The performance of contrast models.
Dataset Diginetica Tmall Nowplaying
Measures P@20 MRR@20 P@20 MRR@20 P@20 MRR@20
B-GGNN 53.89 18.81 34.88 14.89 22.35 7.80
B-GAT 53.61 18.85 35.16 15.40 21.70 8.20
B-GNN 53.93 19.05 35.77 16.23 22.16 8.78
TABLE VI: The performance of contrast models.
Dataset Diginetica Tmall Nowplaying
Measures P@20 MRR@20 P@20 MRR@20 P@20 MRR@20
B-GNN-WP 53.08 18.54 34.66 16.04 21.43 8.08
B-GNN-FP 53.64 18.76 35.36 16.19 22.07 8.65
B-GNN 53.93 19.05 35.77 16.23 22.16 8.78

5.7 Comparison with Different Session-level item representation learning methods. (RQ3)

In this work, we use mean pooling-based GNNs to convey information from item’s neighbors to item itself in the session graph. Specifically, We consider three different kinds of relations of each item and utilize fully connected layer to obtain the new representation of each item. As the representation learning of items is significant for session based recommendation, we conduct experiments to compare our proposed B-GNN with a series of contrast models:

  • B-GGNN: B-GNN with gated GNNs replacing the mean pooling-based GNNs.

  • B-GAT: B-GNN with GAT replacing the mean pooling-based GNNs.

Table V shows the performance of B-GNN and different contrast models. We can observe that the proposed B-GNN outperforms B-GGNN and B-GNN-GAT on Diginetica and Tmall. Gated GNNs introduce gated recurrent units into GNNs and GAT leverages attention mechanism into graph, which enhance the ability of graph neural network to extract features. However, more parameters and nonlinear layers will make the network more prone to overfitting and these models do not consider three relations in the session graph. Our GNNs layer utilize mean pooling layer to reduce the parameters and capture three kinds of relations in the session graph. The results in Table V demonstrate the effectiveness of our proposed GNNs layer.

5.8 Comparison with Different Session Encoder. (RQ4)

For efficiently learning the importance of each item in the session sequence, we propose reversed-position aware session encoder. It incorporates the reversed position information with session information, which is used to drive our model to learn the contribution of each item in the current session. To verify this and evaluate the effectiveness of using the position vector in a reverse order, which is proposed in our method, we design a series of contrast models:

  • B-GNN-WP: B-GNN without using position information when aggregating the item features.

  • B-GNN-FP: B-GNN with forward position vector replacing the reversed position vector.

Table VI shows the performance of different contrast models. We can observe that B-GNN-WP’s performance is not satisfactory, as the model is unable to capture the chronological information in the sessions without position vector. B-GNN-FP performs better than B-GNN-WP over three datasets, which demonstrates the importance of position information. However, the forward position vector can not learn the distance between each item and the target item, which makes the improvement limited. Our attention network with reversed position vector performs better than the B-GNN-FP over three datasets, it is because reversed position vector contains the relative position information between the current item and the target item. The results demonstrate the effectiveness of reversed position vector and superiority of our session encoder.

6 Conclusion

This paper studies the problem of session-based recommendation, which is a challenging task as the user identities and historical interactions are often unavailable in real-world scenarios. It presents two different ways to leverage the high-order global information for session-based recommendation via GNN: (i) SRGI-FM, which recursively incorporates the neighbors’ feature of each node on global graph via session-aware attention mechanism; and (ii) SRGI-CM, which treats session-based recommendation as a multi-task learning problem and utilizes graph contrastive learning for preserving global proximity to learn item representations, Furthermore, it capture three kinds of relations within the session and incorporates the reversed position embedding to better learn the contribution of each item. Comprehensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines over three benchmark datasets consistently, indicating it can be effectively used to solve real-world session-based recommendation problems.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No.61602197 and Grant No.61772076, and in part by Equipment Pre-Research Fund for The 13th Five-year Plan under Grant No.41412050801.

References

  • [1] Z. Wang, W. Wei, G. Cong, X.-L. Li, X.-L. Mao, and M. Qiu, “Global context enhanced graph neural networks for session-based recommendation,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 169–178.
  • [2] A. Mnih and R. R. Salakhutdinov, “Probabilistic matrix factorization,” in Advances in neural information processing systems, 2008, pp. 1257–1264.
  • [3] S. Kabbur, X. Ning, and G. Karypis, “Fism: factored item similarity models for top-n recommender systems,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, 2013, pp. 659–667.
  • [4] C.-K. Hsieh, L. Yang, Y. Cui, T.-Y. Lin, S. Belongie, and D. Estrin, “Collaborative metric learning,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 193–201.
  • [5] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing personalized markov chains for next-basket recommendation,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 811–820.
  • [6] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182.
  • [7] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara, “Variational autoencoders for collaborative filtering,” in Proceedings of the 2018 world wide web conference, 2018, pp. 689–698.
  • [8] X. Li, M. de Rijke, Y. Liu, J. Mao, W. Ma, M. Zhang, and S. Ma, “Learning better representations for neural information retrieval with graph information,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 795–804.
  • [9] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based recommendations with recurrent neural networks,” in ICLR, 2016.
  • [10] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” in ICLR, 2016.
  • [11] S. Wang, L. Hu, Y. Wang, Q. Z. Sheng, M. Orgun, and L. Cao, “Modeling multi-purpose sessions for next-item recommendations via mixture-channel purpose routing networks,” in IJCAI, 2019, pp. 3771–3777.
  • [12] J. Li, P. Ren, Z. Chen, Z. Ren, T. Lian, and J. Ma, “Neural attentive session-based recommendation,” in CIKM, 2017, pp. 1419–1428.
  • [13] Q. Liu, Y. Zeng, R. Mokhosi, and H. Zhang, “Stamp: short-term attention/memory priority model for session-based recommendation,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1831–1839.
  • [14] S. Wu, Y. Tang, Y. Zhu, L. Wang, X. Xie, and T. Tan, “Session-based recommendation with graph neural networks,” in AAAI, 2019, pp. 346–353.
  • [15] R. Qiu, J. Li, Z. Huang, and H. Yin, “Rethinking the item order in session-based recommendation with graph neural networks,” in CIKM, 2019, pp. 579–588.
  • [16] F. Yu, Y. Zhu, Q. Liu, S. Wu, L. Wang, and T. Tan, “Tagnn: Target attentive graph neural networks for session-based recommendation,” arXiv preprint arXiv:2005.02844, 2020.
  • [17] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
  • [18] M. Wang, P. Ren, L. Mei, Z. Chen, J. Ma, and M. de Rijke, “A collaborative session-based recommendation approach with parallel memory modules,” in SIGIR, 2019.
  • [19] A. Luo, P. Zhao, Y. Liu, F. Zhuang, D. Wang, J. Xu, J. Fang, and V. S. Sheng, “Collaborative self-attention network for session-based recommendation,” in Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20.   International Joint Conferences on Artificial Intelligence Organization, 7 2020, pp. 2591–2597. [Online]. Available: https://doi.org/10.24963/ijcai.2020/359
  • [20] G. Shani, D. Heckerman, and R. I. Brafman, “An mdp-based recommender system,” 2005, pp. 1265–1295.
  • [21] P. Ren, Z. Chen, J. Li, Z. Ren, J. Ma, and M. de Rijke, “Repeatnet: A repeat aware neural recommendation machine for session-based recommendation,” in AAAI, 2019.
  • [22] Y. K. Tan, X. Xu, and Y. Liu, “Improved recurrent neural networks for session-based recommendations,” in Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, 2016, pp. 17–22.
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008.
  • [24] W.-C. Kang and J. McAuley, “Self-attentive sequential recommendation,” in ICDM, 2018, pp. 197–206.
  • [25] J. Song, H. Shen, Z. Ou, J. Zhang, T. Xiao, and S. Liang, “Islf: Interest shift and latent factors combination model for session-based recommendation,” in IJCAI, 2019, pp. 5765–5771.
  • [26] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin, “Variational autoencoder for deep learning of images, labels and captions,” in Advances in neural information processing systems, 2016, pp. 2352–2360.
  • [27] S. Wang, L. Hu, Y. Wang, L. Cao, Q. Z. Sheng, and M. Orgun, “Sequential recommender systems: Challenges, progress and prospects,” in IJCAI, 2019, pp. 6332–6338.
  • [28] C. Xu, P. Zhao, Y. Liu, V. S. Sheng, J. Xu, F. Zhuang, J. Fang, and X. Zhou, “Graph contextualized self-attention network for session-based recommendation,” in IJCAI, 2019, pp. 3940–3946.
  • [29] W. Meng, D. Yang, and Y. Xiao, “Incorporating user micro-behaviors and item knowledge into multi-task learning for session-based recommendation,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 1091–1100.
  • [30] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web, 2001, pp. 285–295.
  • [31] D. Jannach and M. Ludewig, “When recurrent neural networks meet the neighborhood for session-based recommendation,” in Proceedings of the Eleventh ACM Conference on Recommender Systems, 2017, pp. 306–310.
  • [32] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 701–710.
  • [33] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • [34] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec,” in Proceedings of the eleventh ACM international conference on web search and data mining, 2018, pp. 459–467.
  • [35] P. Velickovic, W. Fedus, W. L. Hamilton, P. Liò, Y. Bengio, and R. D. Hjelm, “Deep graph infomax.” in ICLR (Poster), 2019.
  • [36] Z. Peng, W. Huang, M. Luo, Q. Zheng, Y. Rong, T. Xu, and J. Huang, “Graph representation learning via graphical mutual information maximization,” in Proceedings of The Web Conference 2020, 2020, pp. 259–270.
  • [37] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” arXiv preprint arXiv:2010.13902, 2020.
  • [38] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, and L. Wang, “Graph contrastive learning with adaptive augmentation,” arXiv preprint arXiv:2010.14945, 2020.
  • [39] P. Gupta, D. Garg, P. Malhotra, L. Vig, and G. Shroff, “Niser: Normalized item and session representations with graph neural networks,” arXiv preprint arXiv:1909.04276, 2019.
  • [40] F. Wang, X. Xiang, J. Cheng, and A. L. Yuille, “Normface: L2 hypersphere embedding for face verification,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1041–1049.
  • [41] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  • [42] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NIPS, 2017, pp. 1024–1034.
  • [43] E. Zangerle, M. Pichl, W. Gassler, and G. Specht, “#nowplaying music dataset: Extracting listening behavior from twitter,” in ISMM, 2014, pp. 21–26.
  • [44] J. Weston, S. Chopra, and A. Bordes, “Memory networks,” arXiv preprint arXiv:1410.3916, 2014.
  • [45] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” JMLR, pp. 1929–1958, 2014.