GRAINRec: Graph and Attention Integrated Approach for Real-Time Session-Based Item Recommendations

Bhavtosh Rath Data Sciences, Target Corporation
Brooklyn Park, MN, USA
[email protected] Pushkar Chennu Data Sciences, Target Corporation
Brooklyn Park, MN, USA
[email protected] David Relyea Data Sciences, Target Corporation
Brooklyn Park, MN, USA
[email protected] {@IEEEauthorhalign} Prathyusha Kanmanth Reddy Data Sciences, Target Corporation
Brooklyn Park, MN, USA
[email protected] Amit Pande Data Sciences, Target Corporation
Brooklyn Park, MN, USA
[email protected]

Abstract

Recent advancements in session-based recommendation models using deep learning techniques have demonstrated significant performance improvements. While they can enhance model sophistication and improve the relevance of recommendations, they also make it challenging to implement a scalable real-time solution. To addressing this challenge, we propose GRAINRec- a Graph and Attention Integrated session-based recommendation model that generates recommendations in real-time. Our scope of work is item recommendations in online retail where a session is defined as an ordered sequence of digital guest actions, such as page views or adds to cart. The proposed model generates recommendations by considering the importance of all items in the session together, letting us predict relevant recommendations dynamically as the session evolves. We also propose a heuristic approach to implement real-time inferencing that meets Target platform’s service level agreement (SLA). The proposed architecture lets us predict relevant recommendations dynamically as the session evolves, rather than relying on pre-computed recommendations for each item. Evaluation results of the proposed model show an average improvement of 1.5% across all offline evaluation metrics. A/B tests done over a 2 week duration showed an increase of 10% in click through rate and 9% increase in attributable demand. Extensive ablation studies are also done to understand our model performance for different parameters.

Index Terms:

Session based recommendation, Graph Neural Network, Attention, Deep Learning, Real-time inferencing.

I Introduction

Recommender systems play an important role in retail guest experiences by predicting and displaying products that guests would be most interested in purchasing. As per a report by BusinessWire¹¹1https://www.businesswire.com/news/home/20220530005180/en/Artificial-Intelligence-in-Retail-2022-Market-Research-Report—Global-Industry-Analysis-and-Growth-Forecast-to-2030—ResearchAndMarkets.com by 2025, e-commerce sales are expected to reach $7.3 trillion, which will drive the AI in retail market value to $36,462.5 million by 2030 from an estimated $1,714.3 million in 2021. It is thus a critical area of research, not just in academia but more importantly in industry. In this paper we propose and analyze the quality of recommendations for a session-based recommendation model with real time inferencing capability.

To explain what a session-based recommendation model is, we first define a session. A session is a sequence of ordered items in close temporal proximity. In the context of item recommendations in online retail, examples of sessions are items browsed, purchased or added to cart having a temporal proximity of usually a few minutes. Session-based recommendations face two major challenges over traditional single item based recommender systems:
1) While traditional models provide recommendations in the context of the current item being considered by a user, a session based model generates recommendations considering the overall context of all items in the session, which is comparatively more challenging to understand. Suppose a guest intends to buy three items in a session. When milk is added as the first item, traditional recommendations are items related to milk like almond milk, butter, and yogurt. If the user adds eggs next, traditional models tend to recommend items related to eggs (like egg whites, brown eggs, 12-count eggs, 18-count eggs). However, GRAINRec considers the combined context (milk and eggs) to recommend breakfast items like bread, cereal, and coffee. If sugar gets added as the third item, existing models might recommend items related to sugar like stevia sweetener, granulated sugar, and organic sugar. GRAINRec would consider the basket context of all three items and recommend baking items - like cake flour, frosting, and baking powder.
2) A major challenge for session based models is real-time inferencing. For single item based recommender models, collaborative filtering-based or item embedding-based recommendations can be generated offline and used in a lookup in real time. In the previous example, for a retailer with a catalog of say 1 Million items, a session of length just N = 3 would require a $10^{12}$ sized lookup table! These kinds of dynamic recommendations cannot be pre-computed using existing models.

The contributions of the paper can be summarized as follows:

1. We propose GRAINRec which builds upon the foundation of the LESSR model [23] while incorporating enhancements (detailed in Section V) critical to achieving both high performance and scalability in a high rate real-time framework.

2. Retail sessions evolve dynamically so the generation of item recommendations needs to happen in real time. This paper addresses this concern via a nearest neighbor matrix approach.

3. The paper presents comprehensive offline and online evaluation results, along with ablation studies that demonstrate GRAINRec’s effectiveness. Additionally, we provide a detailed explanation of the model parameters and the inference setup used to deploy the training and inference architecture on Target’s platform.

II Related Work

II-A Session-based recommender systems

Session-based item recommendation models have garnered significant attention in recent years due to their ability to provide personalized recommendations based on item relationships. They model dynamic user preferences and provide recommendations that are sensitive to the evolution of session context. There have been many papers suggesting different session-based recommendation models applied to different use cases. Wang et. al. [1] performed a comprehensive survey in this regard. In the domain of news recommendations Song et. al. [2] proposed a session-based model to optimize for recommendations that are more focused on the freshness of the content while also modeling long term preferences with regards to topics. Shen et. al. [3] proposed a recency-regularized neural attentive framework that uses users’ sequential commenting behavior to recommend news. For music recommendation Wang et. al. [4] proposed a context-based model that mines knowledge from public opinion such as comments, music review, or social tags to recommend music. Bao et. al. [5] presented a session-based location recommendation system that leverages sparse geo-social data. It combines user preferences and geographic information to suggest locations. Wang et. al. [6] presented a dynamic time-aware attention model for recommending points of interest (POIs) by analyzing users’ check-in sessions, capturing temporal patterns and user preferences. For video recommendations Beutel et. al. [7] proposed a recurrent neural network-based approach on sequences of watches on YouTube to integrate the session context.

II-B Session-based recommendations for items

This subsection focuses only on papers proposed for item recommendations. Yap et. al. [8] proposed a sequential pattern mining based next-items recommendation algorithm to predict users’ next accesses by identifying frequent patterns. The paper introduces a personalized framework using a competence score measure to improve the accuracy of recommendations for individual users. Hu et. al. [9] proposed a KNN-based model that captures personalized item frequency information. Le et. al. [10] proposed a Markov chain-based model that introduces generative modeling that incorporates dynamic user-biased emission and context-biased transition, improving next-item prediction accuracy. Hidasi et. al.[11] introduced GRU4Rec that used Gated Recurrent Units (GRUs) for session-based recommendations. GRU4Rec’s ability to model complex user behavior sequences laid the foundation for subsequent research in session-based recommendations using deep learning. Li et. al.[12] proposed NARM, a neural attentive model that incorporated an attention mechanism to focus on relevant parts of the session history, enhancing the recommendation accuracy. Kang and McAuley [13] introduced SASRec, a self-attentive sequential recommendation model that employs the transformer architecture. By leveraging self-attention, SASRec captures long-range dependencies within a session more effectively than RNN-based approaches. Wang et. al. [14] proposed a novel approach using hypergraphs for session-based recommendations. The Sequential Hypergraph Attention Network (SHAN) models user sessions as hypergraphs, where hyperedges connect multiple items. This structure allows the model to capture higher-order item co-occurrences and complex dependencies within sessions. SHAN’s attention mechanism further enhances its ability to focus on the most relevant parts of the session, leading to improved recommendation performance. Feng et. al.[15] introduced Caser, a Convolutional Sequence Embedding Recommendation model, which applies convolutional neural networks (CNNs) to model user behavior sequences. Caser captures both point-wise and union-level sequential patterns by treating the embedding matrix as an "image" and applying horizontal and vertical convolutional filters. The model showed significant improvement over traditional sequential recommendation methods. Zhang et. al. [16] proposed HINRec, a session-based recommendation model using Heterogeneous Information Network (HIN) embeddings. This model captures various types of interactions and relationships among items by representing the session data as a heterogeneous graph. The embeddings generated from this network effectively incorporate both structural and semantic information, resulting in improved accuracy. Liu et. al.[17] proposed STAMP, a session-based recommendation model that integrates short-term attention and long-term memory. It utilizes an attention mechanism to capture users’ short-term interests within a session, while a memory network stores long-term preferences. This dual approach allows STAMP to dynamically balance between recent interactions and long-standing preferences, achieving superior performance.

II-C Session-based recommender models on graph data

This subsection highlights session based recommendation research on graph data. Gao et. al.[18] conducted a comprehensive review of the literature on graph neural network-based recommender systems. Wu et. al. [19] proposed SR-GNN which models sessions as graphs where items are nodes and interactions are edges. The model utilizes Graph Neural Networks (GNNs) to capture complex item transitions within sessions, significantly improving recommendation accuracy. Wang et. al.[20] proposed a gated GNN model for session-based recommendations. The gated mechanism helps the model selectively focus on important parts of the session graph, enhancing the overall recommendation performance by capturing more nuanced user behaviors. Wang et. al.[21] introduced a knowledge-aware GNN that incorporates external knowledge graphs into the recommendation process. By integrating knowledge with session data and applying label smoothness regularization, the model achieves improved recommendation accuracy. Zhou et. al.[22] presented Temporal Graph Neural Networks (TGNN) for session-based recommendation. TGNN captures temporal dynamics within sessions by modeling time-aware item transitions. This approach enhances the model’s ability to understand user behavior changes over time, leading to more accurate recommendations. Chen et. al.[23] proposed LESSR, a GNN and attention based approach this work borrows the model from. However their work does not propose a real-time inferencing solution.

III Problem Definition

A session-based item recommendation model will recommend the next best item relevant to the context of item(s) in the session. Let $I=\{i_{1},i_{2},...,i_{n}\}$ denote the set of all $n$ items in the model. A session $S_{i}=[s_{(i,1)},s_{(i,2)},...,s_{(i,t)}]$ is an item sequence of length $t$ with items ordered with respect to time and $s_{(i,)}\in I$ . The objective of the model is to predict the next item $s_{(i,t+1)}$ .

IV Model Architecture

Refer to caption — Figure 1: Architecture of GRAINRec. Data processing generates ordered sequences of items interacted within a session, which are batched together as directed graphs. Model training consists of interchanging graph neural network and attention layers that generate item embeddings, followed by a readout layer that generates session embeddings. A nearest neighbor matrix is used to perform real-time inferencing.

GRAINRec’s architecture can be divided into 3 modules: data processing, model training, and real-time inferencing. The architecture is shown in Figure 1. We explain about each module in detail below.

IV-A Data Processing

In the data processing phase we generate a temporally-ordered sequence of items added to a cart. Each sequence can be represented as an ordered graph. For example, if a guest adds three items $a$ , $b$ and $c$ in that order then it can be represented as a directed graph ( $a\rightarrow b\rightarrow c$ ), where the edge $a\rightarrow b$ means item $b$ was added to cart after item $a$ . The graph does not contain self-loops ( $a\rightarrow a$ ), but the same item $a$ can be added more than once in the sequence. During model training the sequences are batched, which means all sequences in the batch are merged to generate a single graph representation. An illustration is shown in Figure 1, where two ordered sequences ( $a\rightarrow b\rightarrow c$ and $c\rightarrow b\rightarrow d$ ) are merged to generate a single graph representation.

IV-B Model Training

The proposed session-based recommendation architecture is a deep learning model that has two types of neural network layers: A Gated Recurrent Unit (GRU)-based graph neural network that captures short-range relationships in the sequences and an attention-based model that captures long-range relationships in the sequences. For simplicity we show in Figure 1 that item $c$ predominantly learns from item $b$ and items $a,d$ using GRU and attention mechanisms respectively. The model alternates between layer types; layer=1 is GNN, layer=2 is GNN-Attention, layer=3 is GNN-Attention-GNN, and so on. This helps propagate the features captured by each kind of layer more effectively. We also apply a readout layer that calculates a session embedding by concatenating the attention-based aggregate of all ordered items in the session (called the global embedding) and the embedding of the last session item (called the local embedding).

IV-B1 GRU-based graph neural network

We use a GRU unit instead of LSTM because for session based recommendation models GRUs have been found to be more effective than LSTM units[11]. Let $G=(V,E)$ represent the batched item sequence graph, where $V$ represents a set of nodes, and $E$ represents a set of edges. Each node $v\in V$ has a feature vector $h_{v}^{l}$ ( $l$ is the layer) associated with it, which is passed as item embeddings into the GNN layer. The GNN layer employs a GRU unit, which allows for sequential processing of node features while maintaining the order of message passing in the graph. We can divide the GRU unit’s task into 3 parts showing how we update embeddings of $v$ :
1. Message Passing: Each node $v$ updates its feature vector $h_{v}$ from its neighbor $u$ through layer $l$ using a binary message passing function ( $f_{msg}^{l}$ ).

f_{msg}^{l}=m_{u^{l}\rightarrow v^{l}}

(1)

2. Message Aggregation: For each node $v$ , the received messages from neighbors are aggregated using a GRU unit. $\mathcal{N}_{(v)}$ denotes the set of neighbors of node $v$ , and the aggregate message $h_{\mathcal{N}_{(v)}}$ can be represented as:

h_{\mathcal{N}_{(v^{l})}}=GRU(f_{msg}^{l},\forall u\in\mathcal{N}_{(v)})

(2)

3. Node Update: Here we combine the node’s own features with the aggregated features from its neighbors and update embeddings for $v$ . $\mathbf{W}_{v}$ and $\mathbf{W}_{neigh}$ are weight matrices for transforming the node’s own features and the aggregated neighbor features, respectively.

h_{v}^{l+1}=\mathbf{W}_{v^{l}}h_{v^{l}}+\mathbf{W}_{neigh}h_{\mathcal{N}_{(v)}}

(3)

IV-B2 Attention mechanism on graphs

We also leverage an attention mechanism to dynamically weigh node features. Given an input feature matrix $X\in\mathbb{R}^{V\times F}$ for a graph with $V$ nodes and $F$ features per node, we first apply batch normalization and dropout to improve our model generalization. The implementation then applies linear layers for query, key and value transformations to project the input features to a higher-dimensional space.

Q=X\mathbf{W}_{Q}

(4)

K=X\mathbf{W}_{K}

(5)

V=X\mathbf{W}_{V}

(6)

Here, $\mathbf{W}_{Q}\in\mathbb{R}^{F\times H}$ , $\mathbf{W}_{K}\in\mathbb{R}^{F\times H}$ , $\mathbf{W}_{V}\in\mathbb{R}^{F\times O}$ are the learnable weight matrices. $Q$ , $K$ , and $V$ represent the queries, keys, and values, respectively.

Then, edge features $e$ are computed as the element-wise sum of queries and keys. These features are then passed through a sigmoid function and transformed to derive the attention coefficients,

e_{ij}=\sigma(Q_{i}+K_{j})\mathbf{W}_{e},

(7)

where $e_{ij}$ is the attention score between node $i$ and node $j$ , $\mathbf{W}_{e}\in\mathbb{R}^{H}$ is the weight vector for edge features, and $\sigma$ is the sigmoid activation function.

We then obtain the attention weights by applying a softmax function to the edge attention scores,

\alpha_{ij}^{l}=\frac{\exp(e_{ij})}{\sum_{k\in\mathcal{N}(i)}\exp(e_{ik})},

(8)

where $\mathcal{N}(i)$ denotes the set of neighbors of node $i$ in the session graph.

Finally, the output representation for each node $i$ is computed as a weighted sum of the value vectors of its neighbors.

\mathbf{h}_{i}^{l+1}=\sum_{j\in\mathcal{N}(i)}\alpha_{ij}^{l}\mathbf{V}_{j}

(9)

IV-B3 Readout layer

The Readout module aggregates node features in a sequence using the attention mechanism. This module is particularly suited for scenarios where the importance of nodes varies and needs to be dynamically weighted, making it especially effective for tasks that require feature aggregation. In our case after item embeddings are generated, the readout technique is used to generate a session embedding. The session is represented as an embedding vector $s\in\mathbb{R}^{d}$ . We define local session embedding ( $s_{local}$ ) as the item embedding of the most recent item in the session; for session $[v_{1},v_{2},v_{3},...,v_{n}]$ , $s_{local}=v_{n}$ . Then, we consider the notion of a global session embedding ( $s_{global}$ ) which is calculated using the attention mechanism. For the above session example

s_{local}=v_{n}

(10)

\alpha_{i}=q^{T}\sigma(\mathbf{W}_{1}x_{i}^{(L)}+\mathbf{W}_{2}x_{n}^{(L)}+r)

(11)

s_{global}=\sum_{i=1}^{n}\alpha_{i}x_{n}

(12)

where $q\in\mathbb{R}^{d}$ and $\mathbf{W}_{1},\mathbf{W}_{2}\in\mathbb{R}^{dxd}$ are parameters that are learned. $r$ is the bias term. $\alpha_{i}$ denotes the attention score for every item $i$ in the session containing $n$ items. Finally, session embeddings are generated by applying a linear transformation after concatenating the local and global session embedding,

s=\mathbf{W}_{3}[s_{local}||s_{global}],

(13)

where $\mathbf{W}_{3}\in\mathbb{R}^{dx2d}$ is a parameter to be learned.

We obtain a score for the $n^{th}$ item using its item embeddings $h_{n}$ as dependent on the session embeddings using the below equation.

h_{n}=s^{T}V_{n}

(14)

As we have set up our model as a multi-item classification problem, we use softmax function that generates a probability distribution of the next item prediction.

\hat{y}=softmax(h_{n})

(15)

where $\hat{y}\in\mathbb{R}^{m}$ where $m$ is number of items. We then use cross-entropy as the loss function.

\mathcal{L}(y,\hat{y})=-y^{T}\log(\hat{y})

(16)

All parameters, including the item embeddings, are randomly initialized and jointly learned through an end-to-end backpropagation training process.

The primary distinction between our attention module and the Graph Attention Network (GAT) model[29] lies in how connections between items are represented. Our attention module models sequences as a multigraph, meaning an edge exists from item1 to item2 only if item2 appears after item1 anywhere within a sequence. In contrast, GAT uses a standard adjacency matrix (or edge list), where an edge from item1 to item2 exists only if item2 is the immediate successor of item1 in the sequence.

IV-C Real time inferencing

Motivation: Our original latency test was run on a microservice fetching real-time recommendations in the dev environment. This test used a model trained only on guest actions on frequent items (separate from the one we used for offline evaluation in this paper). Inference latency was found to be over 400 ms, which was unacceptable. This served as the motivation for implementing the nearest neighbor matrix solution for real-time inference.

Inference in our context means that as the guest session changes/evolves dynamically, the model calculates a real-time session embedding and uses it to generate a handful of relevant recommendations from a catalogue of millions of items. In order to meet the latency threshold, we used a heuristic approach of generating a nearest neighbor matrix as part of model training, which has a size of $no\textunderscore items$ x $no\textunderscore neighbors$ . We generate this matrix during model training by finding the top-k nearest item embeddings to every item. Suppose we choose $no\textunderscore neighbors=100$ , it means that while inferring recommendations for the session $d\rightarrow e\rightarrow f$ , our lookup space for recommendations would not be the entire item catalog but $\sim$ 300 items (i.e combining 100 pre-calculated neighbors for each of the items $d,e,f$ ). We use this heuristic approach to address the latency and accuracy trade-off. Figure 2 shows how only the the nearest neighbors of session items are considered to generate the final recommendations list.

Approximate Nearest Neighbor (ANN) is a popular method for nearest neighbor generation; however, we opted for a Nearest Neighbor Matrix because ANN-based recommendations across the entire catalog often lacked relevance. This approach allows us to filter the item search space for each item more effectively. Additionally, ANN indexing introduces extra latency, which hindered our microservice from meeting SLA requirements during model inference.

V Comparison of GRAINRec and LESSR

The proposed GRAINRec architecture is an enhanced version of the LESSR model [23], incorporating several key modifications to improve performance (Table III). Unlike the original LESSR model, which included self-loops allowing consecutive item repetitions, GRAINRec was trained without self-loops, which we found improved performance. Cross-category items were excluded from sequences before training and during inference to ensure more coherent session-based recommendations. We also conducted extensive hyperparameter tuning to optimize GRAINRec’s performance, an effort not explored in the LESSR study. Finally, to facilitate deployment in Target’s production environment, we implemented real-time inference using a nearest neighbor matrix.

VI Experiments

VI-A Evaluation against baseline models

We compared our proposed model against the following baseline models:
1. Item-KNN[25]: This model recommends similar items to the most recently added item in the session by calculating their cosine similarity scores. Embeddings for items are generated using word2vec.
2. FPMC[26]: This is a Markov chain-based model for next basket recommendation. For our case we customize it to predict the next item.
3. GRU4Rec[11]: This is a recurrent neural network model for session-based next item recommendations.
4. SR-GNN[19]: SR-GNN is a session-based recommendation model using graph neural networks to model session sequences as directed graph-structured data.
5. SASRec[13]: SASRec is a self-attention based sequential model for item recommendations.
6. LESSR[23]: LESSR model combines graph neural networks with attention.
7. GRAINRec: A graph- and attention-based model described above.

VI-B Dataset statistics and parameter setup

TABLE I: Dataset statistics

	No. of items	No. of sequences
Frequency items	36,669	68,923,237
Discretionary items	257,397	97,140,831

All models were trained using item sequences collected by Target, a large retail company. To mitigate data sparsity, we included only items that appeared in the sequence dataset at least 10 times during this period. We capped each sequence length at 20 items and filtered out consecutive duplicate items. The entire catalog was divided into two types: frequency (items usually purchased multiple times, such as groceries, cleaning products, etc) and discretionary (items typically purchased once, such as electronics and furniture). Table I summarizes item and sequence count statistics for both item types used for offline evaluation. While frequency items make up a little over 10% of the catalog, they account for just over 40% of the guest action sequences. Also for every sequence of length greater than 2 we also include its prefixes²²2prefixes are contiguous subsequences of a given sequence that start at the first element and extend to every other element in the sequence. Ex: For sequence $a\rightarrow b\rightarrow c\rightarrow d$ , prefixes are $a\rightarrow b$ , $a\rightarrow b\rightarrow c$ & $a\rightarrow b\rightarrow c\rightarrow d$ . for model training.
Training parameters: Hyperparameter tuning was performed using the scikit-optimize python library. The model parameters were set as follows: an embedding dimension of 256, two layers (GNN-Attention), dropout rate of 0.146, learning rate of 0.00045, batch size of 1024, and decay rate of 0.0001. Weight matrix dimensions for outputs are as follows: Model Input : $no\_items*emb\_dim$ , GNN layer output: $emb\_dim*emb\_dim$ , Attention layer output: $1*emb\_dim$ , ReadOut layer: $emb\_dim*(3xemb\_dim)$ , Fully connected layer: $emb\_dim*batch\_size$ . We selected a 2-layer model to leverage the strengths of both the GNN and attention mechanisms while maintaining a balance between model complexity and efficiency. Adding more layers would increase the model’s complexity, potentially extending training and inference times, which could hinder the model’s ability to meet the required service-level agreements (SLAs) for real-time inference. By using a 2-layer architecture, we ensure that the workflow runs more frequently. The model was trained using NVIDIA A100 3G 40GB (3 instances, 40 GB of GPU memory) multi-instance GPUs.
Inferencing parameters: For model inference we set an upper limit on the session length to 3.

For both training and inference, a filter was set on the session to ensure all items in the session belong to the same category as the most recently added item in the session. For example, if the session contains the items: $sugar\rightarrow iphone\rightarrow carrots\rightarrow milk$ , the filtered session will be $sugar\rightarrow carrots\rightarrow milk$ . $iphone$ belongs to a different category than milk and is excluded from the session, while $sugar$ and $carrot$ belongs to same category as $milk$ and are retained.

VI-C Inferencing setup

GRAINRec, like other deployed recommendation models in Target, is a guest-facing application built on an application layer called Kubernetes (k8s). When setting up the inference infrastructure, we faced a critical decision: whether to run inference on CPU or GPU. For our specific use case, CPU was chosen due to its simplicity and favorable cost economics. While GPUs are generally faster for model training, the simplicity of using CPUs for real-time request-response operations without the overhead of batching made it the preferred choice. Additionally, after optimizing our systems, we found that the cost-per-inference on CPU was significantly more economical.

It took less than 50ms (95th percentile latency) to generate GRAINRec recommendations for guests. We were able to achieve this low latency by implementing custom load balancing using gRPC as our RPC framework. We further reduced latency by building an in-memory customized local cache. Each instance of the GRAINRec PyTorch model is hosted within a Kubernetes microservice allocated with 2 CPUs and 10 GB of memory. This microservice has direct access to the real-time feature store, which contains comprehensive data on the purchase/browse/add-to-cart history of each guest. The throughput of each k8s pod is around 60 requests per second.

In our k8s environment, we faced challenges scaling GRAINRec PyTorch model inference on CPUs due to resource contention. PyTorch’s default behavior of using multiple CPU cores per inference caused performance issues when multiple pods competed for the same cores. To resolve this, we limited each pod to a single thread for PyTorch inference by setting $torch.set\_num\_threads(1)$ , ensuring efficient concurrent inference across multiple pods.

VI-D A/B test results (Online evaluation)

TABLE II: Effect of hyperparameter tuning on online evaluation (CTR- Click Through Rate, AD- Attributable Demand)

	CTR	AD
Before hyperparameter tuning	6.4 % $\uparrow$	5.8 % $\uparrow$
After hyperparameter tuning	10.1 % $\uparrow$	9.2 % $\uparrow$

The proposed model was A/B tested against a link prediction model already running in production. Increase in metrics is compared against the link prediction model. The evaluation was done against two metrics:
1. Click Through Rate (CTR): This metric quantifies how often guests click on recommended items compared to how often those items are displayed to guests.
2. Attributable Demand (AD): This metric quantifies the $ sales generated directly from the guests purchasing the recommended items.

Table II presents the impact of hyperparameter tuning on key performance metrics for an item recommendation system. Before hyperparameter tuning, the CTR increased by 6.4% and AD increased by 5.8%. When A/B testing was done after tuning, CTR rose to 10%, and AD increased to 9%. These results demonstrate that hyperparameter tuning significantly enhances the effectiveness of the recommendation system, leading to greater user engagement.

VI-E Evaluation on dataset (Offline evaluation)

TABLE III: Evaluation of proposed approach (GRAINRec) against baseline models

Models	Frequency items			Discretionary items			Entire catalog
Models	hit@10	mrr@10	ndcg@10	hit@10	mrr@10	ndcg@10	hit@10	mrr@10	ndcg@10
Item-KNN	0.134	0.127	0.131	0.111	0.105	0.108	0.122	0.118	0.12
FPMC	0.151	0.143	0.149	0.123	0.116	0.119	0.145	0.133	0.139
GRU4Rec	0.219	0.103	0.121	0.207	0.101	0.123	0.221	0.123	0.129
SR-GNN	0.222	0.101	0.123	0.209	0.105	0.122	0.231	0.109	0.133
SASRec	0.248	0.119	0.148	0.228	0.109	0.139	0.235	0.134	0.144
LESSR	0.255	0.122	0.151	0.234	0.112	0.141	0.239	0.136	0.147
GRAINRec	0.258	0.124	0.153	0.238	0.114	0.144	0.245	0.139	0.149
Improvement	1.18 %	1.64%	1.32%	1.71%	1.79%	1.98%	2.51%	2.21%	1.36%

We evaluate our model against baselines listed in Section VI. A. using 3 popular metrics:
1. Hit rate at 10 (hit@10): It indicates the proportion of times a relevant item is found within the top 10 recommendations provided by the system.
2. Mean Reciprocal Rank at 10 (mrr@10): It measures the average reciprocal rank (average of inverse of the ranks) of the first relevant item across multiple test cases.
3. Normalized Discounted Cumulative Gain at 10 (ndcg@10): It combines the concepts of cumulative gain, discounting the gain of relevant items based on their position in the recommendation list, and normalizing the result to ensure comparability across different recommendation lists.

The performance analysis of various recommendation models across different item categories is encapsulated in Table III.

For frequency item categories, Item-KNN is the least performing model with scores of 0.134 (hit@10), 0.127 (mrr@10), and 0.131 (ndcg@10). As a word2vec based approach, its results are low due to its inability to capture any patterns in the item sequences. FPMC showed slight improvement over Item-KNN with scores of 0.151 (hit@10), 0.143 (mrr@10), and 0.149 (ndcg@10) as it captures position information by modelling sequential data through Markov chains. GRU4Rec and SR-GNN significantly outperformed the traditional models as they employ deep learning techniques. GRU4Rec, with 0.219 (hit@10), 0.103 (mrr@10), and 0.121 (ndcg@10) leverages gated recurrent units to capture sequential dependencies. Its improved performance can be attributed to its gated neural network model that stores information from previous item. A small drawback of the model is that information passing happens only in one dimension, which is overcome by graphical representation of sequences. SR-GNN with 0.222 (hit@10), 0.101 (mrr@10), and 0.123 (ndcg@10) further enhances this by incorporating graph neural networks to model complex item relationships better. SASRec exhibited stronger performance with 0.248 (hit@10), 0.119 (mrr@10), and 0.148 (ndcg@10). Its self-attentive mechanism effectively models sequence patterns that captures long range dependencies, providing a significant boost in recommendation accuracy. LESSR model shows performance of 0.255 (hit@10), 0.124 (mrr@10) and 0.151 (ndcg@10) which is better than the previous baselines as it incorporates both GNN and attention. Our proposed GRAINRec emerged as the best performer in this category with 0.258 (hit@10), 0.124 (mrr@10), and 0.153 (ndcg@10). In addition to the multi-layers attention and GNN based architecture, what contributes to the performance improvement is removal of self-loops, cross category dependency and nearest neighbor matrix.

In discretionary item categories, the trends observed is similar as in frequency item categories. Item-KNN and FPMC maintain their lower performance, with FPMC slightly better at 0.123 (hit@10), 0.116 (mrr@10), and 0.119 (ndcg@10) compared to Item-KNN’s 0.111 (hit@10), 0.105 (mrr@10), and 0.108 (ndcg@10). GRU4Rec (0.207 hit@10, 0.101 mrr@10, 0.123 ndcg@10) and SR-GNN (0.209 hit@10, 0.105 mrr@10, 0.122 ndcg@10) continue to show their strength. SASRec again show superior performance, achieving 0.228 (hit@10), 0.109 (mrr@10), and 0.139 (ndcg@10). LESSR shows even better performance with 0.234 (hit@10), 0.112 (mrr@10), and 0.141 (ndcg@10). GRAINRec leads with 0.238 (hit@10), 0.114 (mrr@10), and 0.144 (ndcg@10), indicating its robustness.

When considering the entire catalog, Item-KNN (0.122 hit@10, 0.118 mrr@10, 0.12 ndcg@10) and FPMC (0.145 hit@10, 0.133 mrr@10, 0.139 ndcg@10) again exhibit lower scores compared to more sophisticated models. GRU4Rec and SR-GNN continue their competitive performance, with GRU4Rec at 0.221 (hit@10), 0.123 (mrr@10), and 0.129 (ndcg@10) showing strong consistency across metrics. SR-GNN, however, shows a marked drop in mrr@10 (0.109), possibly indicating some limitations in its architecture for broader item catalogs despite having good hit@10 (0.231) and ndcg@10 (0.133) scores. SASRec remains effective with scores of 0.235 (hit@10), 0.134 (mrr@10), and 0.144 (ndcg@10). LESSR outperforms other baselines with 0.239 (hit@10), 0.136 (mrr@10), and 0.147 (ndcg@10), but GRAINRec maintains its leading position, achieving 0.245 (hit@10), 0.139 (mrr@10), and 0.149 (ndcg@10).

Our proposed model shows an average improvement of over 1.5% over LESSR for all metrics. Comparing the performances across frequency, discretionary, and entire catalog categories we observe that the model trained on frequency categories performs the best. This can be attributed to the lower data sparsity in the frequency dataset, which allows the model to learn item relationships more effectively. On the other hand, items in the discretionary category are more numerous but occur less frequently, leading to poorer learning of their representations.

VI-F Multi-class classification

TABLE IV: Comparison of single- and multi-class classification for GRAINRec

	hit@10	mrr@10	ndcg@10
Single-class classification	0.245	0.139	0.149
Multi-class classification	0.156	0.091	0.101

Table IV compares the performance of GRAINRec under the single-class and multi-class classification settings. Single-class classification significantly outperforms multi-class classification across all evaluation metrics. Specifically, hit@10, mrr@10, and ndcg@10 values for multi-class classification show a drop of 36.33%, 34.53% and 32.21% respectively, compared to single-class classification. This could be attributed to the fact that as the number of classes grows (the catalog has roughly 600k items), the number of decision boundaries that a learning algorithm must address also increases. Experimental evidence[28] suggests that with more decision boundaries, the complexity of the problem rises which significantly reduces model performance.

VII Ablation Studies

In this section, we conduct a series of ablation studies to systematically investigate the contributions of various components of our proposed model to its performance of ndcg@10 metric. By selectively removing/modifying specific parameters of the architecture, we aim to understand the impact of each component on the model’s performance, providing deeper insights into the strengths and potential areas for improvement of our approach.

VII-A Ordering of layers

Figure 3 summarizes the performance of the proposed model across different item categories using various configurations of graph neural network (G) and attention (A) layers. The configurations tested include a 4-layer GNN model (GGGG), a 4-layer attention model (AAAA), a mixed 2-layer GNN followed by 2-layer attention model (GGAA), the reverse order (AAGG), and an alternating layer model (GAGA).

The results show that GAGA, the proposed alternating layer model, outperforms all other configurations, demonstrating the effectiveness of alternating GNN and attention layers. This approach effectively leverages the strengths of both types of layers, with the GNN layers capturing the context of closer items and the attention layers capturing dependencies of more distant items. This layered alternation helps mitigate the over-squashing[27] issue seen in continuous GNN layers, where information propagation across long distances in the graph becomes inefficient. The AGAG configuration showed similar performance to GAGA, reinforcing the advantage of alternating the two types of layers for improving recommendation accuracy across all item categories.

VII-B Size of nearest neighbor matrix

We investigate the relationship between model performance and the size of the nearest neighbor matrix in Figure 4, evaluating the model using ndcg@10 across six different neighborhood sizes: 25, 50, 75, 100, 125, and 150. For the full catalog, the ndcg@10 value increases from 0.1316 to 0.1541. Discretionary items show a similar improvement, rising from 0.1251 to 0.1515, while frequency items exhibit the most significant growth, from 0.1370 to 0.1560. These results suggest that larger neighborhood sizes enhance recommendation accuracy across all categories, with frequency items achieving the highest ndcg@10 values. While the general trend shows increasing ndcg@10 values, with larger neighborhood sizes we observe diminishing marginal returns beyond size of 100. Specifically, the rate of increase in ndcg@10 values slows down past this point. Consequently, we select a neighborhood size of 100 for our model, as larger sizes lead to inferencing times that fail to meet platform’s SLA.

VII-C Embedding dimension and number of layers

The experiment results are presented in Figure 5, which displays the performance of model across varying embedding dimensions: 32, 64, 128, 256, 512, and 1024. The models compared are a 2-layer Graph Neural Network with Attention (GNN-Attention), a 1-layer Attention model, and a 1-layer Graph Neural Network. Each subplot in the figure corresponds to a different item category: frequency items, discretionary items, and full catalog. All three subplots have a consistent y-axis range, making it easier to compare the results across different item categories.

In the first subplot, representing frequency items, the 2-layer GNN-Attention model consistently outperforms the other models across all embedding dimensions. Specifically, the ndcg@10 values for the GNN-Attention model range from 0.1421 to 0.1555. The 1-layer Attention model shows moderate performance, with ndcg@10 values ranging from 0.1357 to 0.1509. The 1-layer GNN model demonstrates the lowest performance, with ndcg@10 values between 0.1305 and 0.1449. This indicates that incorporating both GNN and Attention mechanisms yields the best results for frequency item recommendations.

The second subplot, which depicts discretionary items, shows a similar trend. The 2-layer GNN-Attention model again leads in performance, with ndcg@10 values between 0.1317 and 0.1450. The 1-layer Attention model follows, with values ranging from 0.1254 to 0.1404. The 1-layer GNN model has the lowest ndcg@10 values, ranging from 0.1202 to 0.1346.

In the third subplot, which represents the full catalog, the performance of the models is evaluated on all items. The 2-layer GNN-Attention model maintains its superior performance, with ndcg@10 values ranging from 0.1357 to 0.1510. The 1-layer Attention model achieves ndcg@10 values between 0.13024 and 0.14549. The 1-layer GNN model shows the lowest performance with values ranging from 0.1239 to 0.1392. The consistent outperformance of the 2-layer GNN-Attention model across different item categories and embedding dimensions highlights its robustness and suitability for diverse recommendation scenarios.

Overall, the experiment results clearly demonstrate the effectiveness of the proposed 2-layer GNN-Attention model in achieving higher ndcg@10 values compared to the 1-layer Attention and 1-layer GNN models. The enhanced performance is observed across all embedding dimensions and item categories, indicating the model’s capability to capture complex interactions and improve recommendation accuracy in real-time session-based systems. While GNNs are adept at capturing short-range dependencies, attention mechanisms additionally captures long-range dependencies, thereby providing a more comprehensive representation of the data. Furthermore, we observed that model performance improves with an increase in embedding dimensions, as larger dimensions enable the model to capture a greater amount of information. However, this improvement follows a trend of diminishing returns, where the incremental gains in performance decrease as the embedding dimension continues to increase.

VIII Conclusion and Future Work

We propose a session-based recommendation model that integrates graph neural network and attention mechanisms to generate item recommendations. We also integrated a readout layer that uses the attention mechanism to generate session embeddings capturing the context of all items in the session. We proposed a k-nearest neighbor-based heuristic approach that allows recommendations to be generated in real time. We compared our model against SOTA session-based models and saw a improvement across all metrics. The model also showed an improvement in click-through rate and attributable demand during online evaluation.

As part of future work, we will evaluate our model using a contrastive learning setup with InfoNCE loss function. While the proposed model is implemented for the add to cart placement, in future we intend to test it in placements that consider other guest actions such as page views or item purchases. We are currently in the process of setting up and A/B testing said placements. We also intend to work on improving the model’s ability to handle cross-category item relationships. Currently we are in the process of integrating item facets (product title, brand, color, size) and language models during model training to generate pertained embeddings that would likely improve the quality of recommendations.

Acknowledgment: The authors would like to thank all members of the Item Personalization team at Target for their constructive feedback and support for this work.

IX COMPANY PORTRAIT

Target Corporation is a leading American retailer based in Minneapolis, Minnesota, operating around 2,000 stores nationwide with a workforce of about 400,000 employees. Known for its high-quality products and excellent customer service, Target offers a wide range of products, including food, clothing, home goods, electronics, and health and beauty items, with a mission to help families discover the joy of everyday life.

References

[1] Wang, S., Cao, L., Wang, Y., Sheng, Q., Orgun, M. & Lian, D. A survey on session-based recommender systems. ACM Computing Surveys (CSUR). 54, 1-38 (2021)
[2] Song, Y., Elkahky, A. & He, X. Multi-rate deep learning for temporal recommendation. Proceedings Of The 39th International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 909-912 (2016)
[3] Shen, C., Han, C., He, L., Mukherjee, A., Obradovic, Z. & Dragut, E. Session-based News Recommendation from Temporal User Commenting Dynamics. 2022 IEEE/ACM International Conference On Advances In Social Networks Analysis And Mining (ASONAM). pp. 163-170 (2022)
[4] Wang, D., Deng, S., Zhang, X. & Xu, G. Learning music embedding with metadata for context aware recommendation. Proceedings Of The 2016 ACM On International Conference On Multimedia Retrieval. pp. 249-253 (2016)
[5] Bao, J., Zheng, Y. & Mokbel, M. Location-based and preference-aware recommendation using sparse geo-social networking data. Proceedings Of The 20th International Conference On Advances In Geographic Information Systems. pp. 199-208 (2012)
[6] Wang, X., Liu, X., Li, L., Chen, X., Liu, J. & Wu, H. Time-aware user modeling with check-in time prediction for next POI recommendation. 2021 IEEE International Conference On Web Services (ICWS). pp. 125-134 (2021)
[7] Beutel, A., Covington, P., Jain, S., Xu, C., Li, J., Gatto, V. & Chi, E. Latent cross: Making use of context in recurrent recommender systems. Proceedings Of The Eleventh ACM International Conference On Web Search And Data Mining. pp. 46-54 (2018)
[8] Yap, G., Li, X. & Yu, P. Effective next-items recommendation via personalized sequential pattern mining. International Conference On Database Systems For Advanced Applications. pp. 48-64 (2012)
[9] Hu, H., He, X., Gao, J. & Zhang, Z. Modeling personalized item frequency information for next-basket recommendation. Proceedings Of The 43rd International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 1071-1080 (2020)
[10] Le, D., Fang, Y. & Lauw, H. Modeling sequential preferences with dynamic user and context factors. Joint European Conference On Machine Learning And Knowledge Discovery In Databases. pp. 145-161 (2016)
[11] Hidasi, B., Karatzoglou, A., Baltrunas, L. & Tikk, D. Session-based recommendations with recurrent neural networks. ArXiv Preprint ArXiv:1511.06939. (2015)
[12] Li, J., Ren, P., Chen, Z., Ren, Z., Lian, T. & Ma, J. Neural attentive session-based recommendation. Proceedings Of The 2017 ACM On Conference On Information And Knowledge Management. pp. 1419-1428 (2017)
[13] Kang, W. & McAuley, J. Self-attentive sequential recommendation. 2018 IEEE International Conference On Data Mining (ICDM). pp. 197-206 (2018)
[14] Wang, J., Ding, K., Hong, L., Liu, H. & Caverlee, J. Next-item recommendation with sequential hypergraphs. Proceedings Of The 43rd International ACM SIGIR Conference On Research And Development In Information Retrieval. pp. 1101-1110 (2020)
[15] Tang, J. & Wang, K. Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings Of The Eleventh ACM International Conference On Web Search And Data Mining. pp. 565-573 (2018)
[16] Shi, C., Hu, B., Zhao, W. & Philip, S. Heterogeneous information network embedding for recommendation. IEEE Transactions On Knowledge And Data Engineering. 31, 357-370 (2018)
[17] Liu, Q., Zeng, Y., Mokhosi, R. & Zhang, H. STAMP: short-term attention/memory priority model for session-based recommendation. Proceedings Of The 24th ACM SIGKDD International Conference On Knowledge Discovery & Data Mining. pp. 1831-1839 (2018)
[18] Gao, C., Zheng, Y., Li, N., Li, Y., Qin, Y., Piao, J., Quan, Y., Chang, J., Jin, D., He, X. & Others A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Transactions On Recommender Systems. 1, 1-51 (2023)
[19] Wu, S., Tang, Y., Zhu, Y., Wang, L., Xie, X. & Tan, T. Session-based recommendation with graph neural networks. Proceedings Of The AAAI Conference On Artificial Intelligence. 33, 346-353 (2019)
[20] Pan, Z., Cai, F., Chen, W., Chen, H. & De Rijke, M. Star graph neural networks for session-based recommendation. Proceedings Of The 29th ACM International Conference On Information & Knowledge Management. pp. 1195-1204 (2020)
[21] Wang, H., Zhang, F., Zhang, M., Leskovec, J., Zhao, M., Li, W. & Wang, Z. Knowledge-aware graph neural networks with label smoothness regularization for recommender systems. Proceedings Of The 25th ACM SIGKDD International Conference On Knowledge Discovery & Data Mining. pp. 968-977 (2019)
[22] Li, X., Wang, X., Zhang, H. & Zhang, J. Session-based Recommendation with Temporal Graph Neural Network and Contrastive Learning. 2023 3rd International Conference On Neural Networks, Information And Communication Engineering (NNICE). pp. 10-14 (2023)
[23] Chen, T. & Wong, R. Handling information loss of graph neural networks for session-based recommendation. Proceedings Of The 26th ACM SIGKDD International Conference On Knowledge Discovery & Data Mining. pp. 1172-1180 (2020)
[24] Pei, W., Yang, J., Sun, Z., Zhang, J., Bozzon, A. & Tax, D. Interacting attention-gated recurrent networks for recommendation. Proceedings Of The 2017 ACM On Conference On Information And Knowledge Management. pp. 1459-1468 (2017)
[25] Davidson, J., Liebald, B., Liu, J., Nandy, P., Van Vleet, T., Gargi, U., Gupta, S., He, Y., Lambert, M., Livingston, B. & Others The YouTube video recommendation system. Proceedings Of The Fourth ACM Conference On Recommender Systems. pp. 293-296 (2010)
[26] Rendle, S., Freudenthaler, C. & Schmidt-Thieme, L. Factorizing personalized markov chains for next-basket recommendation. Proceedings Of The 19th International Conference On World Wide Web. pp. 811-820 (2010)
[27] Alon, U. & Yahav, E. On the bottleneck of graph neural networks and its practical implications. ArXiv Preprint ArXiv:2006.05205. (2020)
[28] Del Moral, P., Nowaczyk, S. & Pashami, S. Why is multiclass classification hard?. IEEE Access. 10 pp. 80448-80462 (2022)
[29] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P. & Bengio, Y. Graph attention networks. ArXiv Preprint ArXiv:1710.10903. (2017)