SR-GCL: Session-Based Recommendation with Global Context Enhanced Augmentation in Contrastive Learning

Eunkyu Oh, Taehun Kim, Minsoo Kim, Yunhu Ji, Sushil Khyalia

Abstract

Session-based recommendations aim to predict the next behavior of users based on ongoing sessions. The previous works have been modeling the session as a variable-length of a sequence of items and learning the representation of both individual items and the aggregated session. Recent research has applied graph neural networks with an attention mechanism to capture complicated item transitions and dependencies by modeling the sessions into graph-structured data. However, they still face fundamental challenges in terms of data and learning methodology such as sparse supervision signals and noisy interactions in sessions, leading to sub-optimal performance. In this paper, we propose SR-GCL, a novel contrastive learning framework for a session-based recommendation. As a crucial component of contrastive learning, we propose two global context enhanced data augmentation methods while maintaining the semantics of the original session. The extensive experiment results on two real-world E-commerce datasets demonstrate the superiority of SR-GCL as compared to other state-of-the-art methods.

Refer to caption — Figure 1: Overall flow of SR-GCL. Augmentations selected from a method pool are applied to a local session graph. Then, the GNN-based encoder, Next Item Prediction Layer, and Projection Head are jointly optimized with two different tasks.

Introduction

A Session-Based Recommendation (SBR) has attracted increasing attention as a new paradigm for Recommender Systems (RecSys) in E-commerce and online streaming services, which aims to predict successive items that a user is likely to interact with given a sequence of previous items consumed in the session. By definition, a session is a slice of user-item interactions separated by arbitrary time, so it is naturally expressed as a variable-length time-series event sequence (Wang et al. 2019b). With the advances of deep learning techniques, Graph Neural Networks (GNNs)-based approaches (Wu et al. 2019; Pan et al. 2020a) show the effectiveness to capture sophisticated transition relationships of items within the given session.

Although existing methods achieve promising results, they still suffer from some limitations in terms of data and learning methodology. Firstly, they mostly follow the supervised learning paradigm (Hidasi et al. 2015; Wu et al. 2019) where the supervision signal comes from observed user-item interactions. However, as well known problems in RecSys, the observed data are extremely sparse (Bayer et al. 2017; He and McAuley 2016) compared to the whole interaction space, and usually follow a power-law distribution (Milojević 2010). Due to these data characteristics, high-degree items receive more supervision signals, while it is hard to obtain high-quality representations of low-degree ones. Secondly, they are vulnerable to noisy interactions that deviate from the user’s main intention, especially in the case of a long-range session (Wang et al. 2019b).

To overcome the aforementioned problems, we adopt Contrastive Learning (CL) that has renewed a surge of interest in Computer Vision (CV) and Natural Language Processing (NLP) domains (Hjelm et al. 2018; Wu et al. 2018). This leverages the fact that the underlying data has a richer structure than the information that sparse labels or rewards can provide. Specifically, CL utilizes data- or task-specific augmentations to inject the desired feature invariance, distills additional supervision signals from the unlabelled data itself, and represents semantically similar objects (i.e., positive pairs) closer to each other, while dissimilar ones (i.e., negative pairs) further away. In this process, data augmentation is an essential component and has been researched for images (Perez and Wang 2017) (e.g., color jitter, random flip, etc.) and texts (Wei and Zou 2019) (e.g., synonym replacement, random insertion, etc.). In contrast, an augmentation scheme for an SBR has not been sufficiently studied. As an alternative, applying conventional ones into a CL framework for an SBR is also not straightforward because such perturbation can loose the context of the whole session.

Along these lines, we propose SR-GCL, a novel contrastive learning framework for a session-based recommendation. To be specific, each session is firstly constructed to a directed graph as an anchor, then augmented session graphs are generated as positive pairs by choosing arbitrary augmentation methods. We also obtain negative pairs by augmenting other samples. In this process, we propose two data augmentation methods for maintaining the semantics of the original session. For this, we consider the global context by integrating all pairwise item transitions over whole sessions. In this way, the influence of noisy items and high-degree nodes can be reduced, so that accurate neighbor information of nodes can be extracted. Our model is optimized by a next-item prediction objective and multi-positive contrastive learning simultaneously. Extensive experiments on two public datasets demonstrate the effectiveness of our framework.

Related Work

Session-Based Recommendation

In an SBR, a session is formally represented as a variable-length time-series event sequence. With the outstanding performance of deep learning-based models, RNN-based methods model a series of events in the session as a sequence and take the last hidden state as the context representation (Hidasi et al. 2015). Recently, GNNs have become dominant, which more accurately capture the transition pattern and coherence of items within a session by modeling the session sequences as graph-structured data (Wu et al. 2019; Pan et al. 2020a). Despite the remarkable success, many existing works following a supervised learning paradigm still suffer from data sparsity and a long-tail problem in user-item interactions (Bayer et al. 2017; Clauset, Shalizi, and Newman 2009). To address the weak supervision from insufficient labels and to better learn the optimal representation of items and sessions, we adopt a contrastive learning paradigm for an SBR.

Contrastive Self-Supervised Learning

Recently, Self-Supervised Learning (SSL) has shown impressive performance in various fields while complementing the limitations of supervised learning. Among SSL, CL builds representations by learning to encode what makes two things similar or different via positive or negative samples (He et al. 2020; Chen et al. 2020c, a, b). For sequential recommendations, (Xie et al. 2020) proposed Contrastive Learning for Sequential Recommendation (CL4SRec) extracting more meaningful user patterns within a session and three augmentation approaches such as Crop, Mask, and Reorder. However, the augmentation methods have no consideration of session data characteristics as those are merely borrowed from other domains (e.g., image). Unlike the previous methods, our novel augmentation approaches retain the original intention of a given session by taking into account the global context.

Method

Contrastive Learning Framework

Figure 1 presents the overview of our contrastive learning framework consisting of four major components: Augmentation, GNN Encoder, Next Item Prediction Layer, and Projection Head. The proposed framework builds representations of items and a session (i.e., a sequence of items) while maximizing agreement between different views on the session via a data augmentation pipeline. The trainable modules are jointly optimized with two different tasks: a next item prediction as the main task on which most of the previous models have focused and contrastive learning as an auxiliary task.

For the next item prediction task, each session $s$ is first constructed as a directed graph and fed to an encoder, which computes the representation $r$ of the session. In our case, $f(\cdot):\mathbb{R}^{\left|s\right|\times d}\to\mathbb{R}^{2d}$ denotes the encoder, where $\left|s\right|$ is the length of $s$ and $d$ is the dimension of an item embedding. In practice, the proposed framework is model-agnostic and can adopt any encoder for session-based recommendations. In this paper, we adopt SGNN-HN, a state-of-the-art GNN-based model (Pan et al. 2020a) (See Appendix A for the detail). The representation of a session via the encoder can be expressed as

r=f(s)

(1)

Then, Next Item Prediction Layer uses $r$ to obtain a predicted next click $t$ . $g(\cdot):\mathbb{R}^{2d}\to\mathbb{R}^{d}$ denotes the prediction function, which consists of one linear layer as

t=g(r)=W_{g}r+b_{g}

(2)

where $W_{g}\in\mathbb{R}^{d\times 2d}$ is a trainable matrix and $b_{g}\in\mathbb{R}^{d}$ is a bias. We obtain main loss $\mathcal{L}_{main}$ by measuring the similarities between $t$ and all items $V$ .

On the other hand, the process of contrastive learning starts with augmentations. Our framework has no restriction on how many augmentation methods are used per batch, but we assume two randomly selected augmentations for simplicity (He et al. 2020; Chen et al. 2020c, a, b). Augmentation module generates two augmented sessions, $s^{a_{i}}$ and $s^{a_{j}}$ by each method. Thereafter, we obtain their representations, $r^{a_{i}}$ and $r^{a_{j}}$ by passing them through the same encoder with shared parameters as Equation 1. Projection Head (See Appendix C for the detail) represented by $h(\cdot):\mathbb{R}^{2d}\to\mathbb{R}^{d}$ maps the representations into another embedding space, and outputs each $z$ as

z=h(r)=W_{h2}\sigma(W_{h1}r+b_{h1})+b_{h2}

(3)

where $W_{h1}\in\mathbb{R}^{d\times 2d}$ , $W_{h2}\in\mathbb{R}^{d\times d}$ are learnable matrices, $b_{h1},b_{h2}\in\mathbb{R}^{d}$ are biases, and $\sigma$ is a ReLU. Finally, we calculate contrastive loss $\mathcal{L}_{cl}$ through comparison with all $z$ obtained in the same batch.

Data Augmentations for a Session

A successful session augmentation should be able to obtain different views of the same sequence but still, maintain the main characteristics hidden in historical behaviors. Inspired by the principles of augmentations in NLP (Wei and Zou 2019; Jiao et al. 2019), we devise two global context enhanced augmentation approaches (i.e., Item Change and Injection) by referring to the global graph having pairwise connections over whole sessions as shown in Figure 2. Synonyms of a node are sampled by the degree of connections from the global graph. These methods can reflect the global intra-session context, not just the given session.

Our proposed approaches have two main advantages compared with previous methods (See Appendix B for the details of conventional methods for session data). First, the conventional methods mainly deal with limited information in a current session, whereas our approaches considering the entire session context information can generate more plausible session data. Second, we can obtain many variants of the original sample. If we should augment only within the given session, the maximum number of variants is restricted. These advantages are particularly pronounced for session data with a short length.

For common notations, $\gamma\in\left[0,1\right]$ means an augmented ratio. $\delta(x)$ is the function that picks one node from the neighbors of a given item $x$ . The neighbor nodes are termed synonyms, which have similar interactions with $x$ . $\delta(x)$ samples the synonym with the frequency of connections to the power of $k\in\left[0,1\right]$ to lower the impact of high-degree nodes with respect to a long-tail problem.

Item Change

Item Change is to replace a selected item with its synonym. To apply this augmentation to a session $s$ , we randomly make a set $\mathcal{I}_{ch}=\left\{i_{1},i_{2},...,i_{N_{ch}}\right\}$ indicating changed indices, where $N_{ch}=\lceil\gamma_{ch}\times\left|s\right|\rceil$ . The items pointed by $\mathcal{I}_{ch}$ are changed to their synonyms. This method $a_{ch}$ is formulated as

	$\displaystyle s^{a_{ch}}=a_{ch}(s)=\left[\tilde{x}_{1},\tilde{x}_{2},...,\tilde{x}_{\left\|s\right\|}\right],$		(4)
	$\displaystyle\mbox{where }\tilde{x}_{j}=\begin{cases}x_{j}\qquad\qquad\mbox{if }j\notin\mathcal{I}_{ch}\\ \delta(x_{j})\qquad\;\;\;\mbox{if }j\in\mathcal{I}_{ch}\end{cases}$		(4)

Item Injection

This augmentation is to insert synonyms into random indices of a sequence. Formally, this inserts an item $N_{in}=\lceil\gamma_{in}\times\left|s\right|\rceil$ times to the random positions of a session $s$ . When $i(s)$ denotes the function that one synonym of a randomly selected item is injected before or after the item, this method $a_{in}$ repeats $i(\cdot)$ $N_{in}$ times (i.e., $i^{N_{in}}(\cdot)$ ), which can be formulated as

	$\displaystyle s^{a_{in}}=a_{in}(s)=i^{N_{in}}(s),$		(5)
	$\displaystyle\mbox{where }i(s)=\left[x_{1},...,x_{i},\delta(x_{i}),x_{i+1},...,x_{\left\|s\right\|}\right]$		(5)

Objective Function for Multi-task Learning

We take multi-task learning to enhance the performance of a session-based recommendation. The main task for a next item prediction and an auxiliary task for contrastive learning are jointly optimized. The total loss $\mathcal{L}_{total}$ is the weighted sum of main loss $\mathcal{L}_{main}$ and contrastive loss $\mathcal{L}_{cl}$ as

\mathcal{L}_{total}=\mathcal{L}_{main}+\lambda\mathcal{L}_{cl}

(6)

where $\lambda$ is a weight for contrastive loss.

The process of deriving $\mathcal{L}_{main}$ for the prediction task is as follows. A session $s$ is encoded to the representation via Encoder as Equation 1, and a predicted next click $t$ is obtained via Next Item Prediction Layer as Equation 2. Then, we compare the predicted click $t$ with all items in $V$ by calculating the similarities. To solve a long-tail problem in a recommendation (Abdollahpouri, Burke, and Mobasher 2017), we apply layer normalization before dot-product as

	$\displaystyle sim(v_{1},v_{2})=\tilde{v}_{1}^{\top}\tilde{v}_{2}$		(7)
	$\displaystyle\mbox{where }\tilde{v}=LayerNorm(v)$		(7)

To normalize the similarities, a softmax layer is applied with a temperature parameter $\tau_{m}$ . Finally, we adopt cross-entropy as an optimization objective. These are summarized as

\mathcal{L}_{main}=-\log\frac{\exp(sim(t,y)/\tau_{m})}{\sum_{v_{i}\in V}\exp(sim(t,v_{i})/\tau_{m})}

(8)

where $y\in V$ is the next clicked item of $s$ .

For contrastive loss, we extend a Simple framework for Contrastive Learning of visual Representations (SimCLR) (Chen et al. 2020a) that does not require a separate module to manage negative samples. However, the number of negative samples is determined by the number of mini-batch $N$ since the negative samples are generated from other samples in the same batch. If more negative samples are required for enhancing a model, $N$ should also increase. To overcome this dependency, we propose multi-positive contrastive loss inspired by supervised contrastive loss (Khosla et al. 2020). This allows us to flexibly set the number of negative samples, which can employ multiple augmentation methods per batch (i.e., more than two). That is, we generalize InfoNCE loss (Oord, Li, and Vinyals 2018) based on noise contrastive estimation for the multiple positive pairs.

The procedure to compute multi-positive contrastive loss $\mathcal{L}_{cl}$ is as follows. Augmentation module generates $M\times N$ augmented sessions, where $M$ is the number of selected augmentations. Then, the augmented sessions are encoded to the representations as Equation 1, and those are projected into another space via Projection Head as Equation 3. For example, when a session $s_{n}$ augmented by $a_{m}$ passes Encoder $f(\cdot)$ and Projection Head $h(\cdot)$ , it becomes $z_{n}^{a_{m}}$ . Therefore, we obtain a set $Z^{\prime}=\left\{z_{1}^{a_{1}},...,z_{N}^{a_{1}},...,z_{1}^{a_{M}},...,z_{N}^{a_{M}}\right\}$ . To compute $\mathcal{L}_{cl}$ , we define a query set $Z_{q}^{\prime}=\left\{z_{1}^{a_{1}},z_{2}^{a_{1}},...,z_{N}^{a_{1}}\right\}$ , which contains $N$ representations of the sessions augmented by the first method $a_{1}$ . For each query, the similarities with $M\times N$ keys $Z_{k}^{\prime}=Z^{\prime}$ are calculated as Figure 1. The same distance as Equation 7 is used to measure the similarities of the query and keys. A softmax layer with a temperature parameter $\tau_{c}$ is applied. Cross-entropy is normalized for the multiple positives. That is, multi-positive contrastive loss for the $q$ -th query $z_{q}\in Z_{q}^{\prime}$ is so defined as

\mathcal{L}_{cl}=\frac{-1}{M}\!\!\sum_{z_{pos}\in P(z_{q})}\!\!\!\!\!\log\frac{\exp(sim(z_{q},z_{pos})/\tau_{c})}{\sum_{z_{k}\in Z_{k}^{\prime}}\exp(sim(z_{q},z_{k})/\tau_{c})}

(9)

where $P(z_{q})\subseteq Z_{k}^{\prime}$ is a set of all positive pairs of the $q$ -th query, and $M=\left|P(z_{q})\right|$ is the number of selected methods.

Experiments and Analysis

Experimental Settings

Datasets

•

Yoochoose was released by RecSys Challenge 2015¹¹1https://recsys.acm.org/recsys15/challenge/, which contains click-streams from an E-commerce website within 6 months.
•

Diginetica was used as a challenge dataset for CIKM Cup 2016²²2https://competitions.codalab.org/competitions/11161. We only adopt the transaction data which is suitable for a session-based recommendation.

We follow the preprocessing of previous works (Wu et al. 2019; Pan et al. 2020a) for fairness. We filter out sessions of length 1 and items which occur less than 5 times. Then, we split the sessions for train and test, where the last day of Yoochoose and the last week of Diginetica are used for test. Furthermore, we exclude the items which are not included in the train set. Finally, we split the sessions to several sub-sequences. Specifically, for a session $s=\left[x_{1},x_{2},...,x_{\left|s\right|}\right]$ , we generate sub-sequences and the corresponding labels as $\left(\left[x_{1}\right],x_{2}\right),\left(\left[x_{1},x_{2}\right],x_{3}\right),...,\left(\left[x_{1},...,x_{\left|s\right|-1}\right],x_{\left|s\right|}\right)$ for train and test. As Yoochoose is too large, we only utilize the recent 1/64 and 1/4 fractions of the train set, denoted as Yoochoose1/64 and Yoochoose1/4, respectively. Statistics for the three datasets are summarized in Appendix E.

Baselines

•

GRU4Rec (Hidasi et al. 2015) applies GRUs to model the sequential information in a session-based recommendation.
•

CSRM (Wang et al. 2019a) employs GRUs to model the sequential behavior, adopts an attention mechanism to capture the main purpose, and uses neighbor sessions as auxiliary information.
•

SR-IEM (Pan et al. 2020b) utilizes a modified self-attention mechanism to estimate the item importance and recommends based on the global preference and current interest.
•

SR-GNN (Wu et al. 2019) as the first proposed GNN-based model, adopts gated GNNs to obtain item embeddings and recommends by generating the session representation with an attention mechanism.
•

NISER+ (Gupta et al. 2019) extends SR-GNN by introducing L2 normalization, positional embedding, and dropout.
•

SGNN-HN (Pan et al. 2020a) extends SR-GNN by introducing a highway gate after the GNNs and a star node, which is connected with all items in the given session graph.

Evaluation Metric

Following (Wu et al. 2019; Pan et al. 2020a), we use P@K (Precision) and MRR@K (Mean Reciprocal Rank) to evaluate the recommendation performance where K is 20.

Parameter Setup

We use the recent 10 items of the given session. We adopt an Adam optimizer with an initial learning rate $1e^{-3}$ , $\beta_{1}\!=\!0.9$ , $\beta_{2}\!=\!0.999$ , and a decay factor 0.1 for every 3 epochs. L2 regularization is set to $1e^{-5}$ . The batch size $N$ is set to 100 and the dimension of item embedding $d$ is set to 256. The weight of contrastive learning loss $\lambda$ is set to 0.7. The temperature parameters $\tau$ are set to 0.085 and 0.005 for main loss and contrastive loss. All ratios $\gamma$ for augmentation methods are set to 0.5. The factor for frequency of node connections $k$ is set to 0.75. All parameters are initialized using a uniform distribution with a range of $\left[\frac{-1}{\sqrt{d}},\frac{1}{\sqrt{d}}\right]$ . In addition, the setup of each encoder (e.g., SR-GNN and NISER+) follows that of the corresponding paper.

Method

Yoochoose1/64

Yoochoose1/4

Diginetica

P@20

MRR@20

P@20

MRR@20

P@20

MRR@20

RNN- based

GRU4REC

60.64

22.89

59.53

22.60

29.45

8.33

CSRM

69.85

29.71

70.63

29.48

51.69

16.92

Attention-

based

SR-IEM

71.15

31.71

71.67

31.82

52.35

17.64

GNN- based

SR-GNN

70.57

30.94

71.36

31.89

50.73

17.59

NISER+

71.27

31.61

71.80

31.80

53.39

18.72

SGNN-HN

72.06

32.61

72.85

32.55

55.67

19.45

CL- based

SR-GNN w/ GCL

71.16

31.27

71.79

31.42

52.73

17.90

NISER+ w/ GCL

71.64

32.08

72.30

32.00

54.74

19.26

SR-GCL

72.14

32.33

73.11

32.70

55.93

19.53

Table 1: Overall performance on three datasets. For each column, the bold-faced number is the best score and the second performer is underlined.

Results and Discussions

In this section, we present the evaluation results. Except for the overall comparison result, we present the results only on Yoochoose1/64 and Diginetica.

Overall Performance Comparison

The overall experimental results are summarized in Table 1. Although CSRM shows a great improvement from GRU4Rec in RNN-based models, SR-IEM outperforms those because its attention mechanism helps avoid the bias caused by unrelated items. SR-GNN shows a comparable result with SR-IEM by modeling the transition relationship of items. SGNN-HN, a state-of-the-art, shows significant improvements on all metrics by proposing a virtual node that connects items without direct connections. SR-GCL on top of the SGNN-HN encoder outperforms the state-of-the-art baselines except for MRR@20 on Yoochoose1/64. We also experimented by applying GCL to other encoders, and the results show better performance than the vanilla encoders. The lower result of MRR@20 on Yoochoose1/64 is repeatedly reported in the following experiments, so we will discuss this in detail at the end.

Impact of Considering the Global Context

To analyze the impact of considering the global context in our proposed augmentation approaches, we experimented with two extreme ways to find synonyms. The one is selecting from 1-hop neighbor nodes in the global item graph, which is currently adopted. The other one is selecting from all items regardless of their connections. In Figure 3, the bars are the means of nine performances when varying $\gamma$ is from 0.1 to 0.9. Each cap means the standard deviation of each case. The results have a similar tendency except Figure 3 (b). The methods considering the global context (i.e., blue bars) show better performance than those which randomly select synonyms without the context (i.e., red bars). In addition, our proposed approaches have lower standard deviations as shown in the caps. Note that the way without the context is a similar operation to masking some items.

Impact of Contrastive Loss

We investigate the interaction effect between contrastive loss and prediction loss by varying the weight $\lambda$ . We expect that contrastive learning as an auxiliary task is helpful for the recommendation performance, but it could degrade the performance if the effect is too large. So we experimented to find the optimal $\lambda$ for our task. In addition, we applied this experiment to other GNN-based encoders, SR-GNN and NISER+ to confirm that our framework is model-agnostic used for general session-based recommendations. Figure 4 shows that our proposed framework using SGNN-HN as an encoder achieves the best performance at less than 1. Then, as $\lambda$ increases in more than 1, the performance decreases. The case of using NISER+ has a similar result, but that looks relatively less sensitive to $\lambda$ . When we replace the encoder with SR-GNN, the performance improvement is most prominent. There is no phenomenon that the performance is lowered as $\lambda$ increases.

Discussions

The experiment results show the superiority of SR-GCL in all evaluations except MRR@20 on Yoochoose1/64 consistently. We speculate the reason why SR-GCL is not effective in that case in two aspects. The first factor is the characteristics of the dataset. The ratio of a sequence length of 3 or less in Yoochoose1/64 is 63%. Compared with that in Diginetica (i.e., 47%), session augmentations are more difficult to work on Yoochoose1/64. In addition, the ratio for the number of neighbor items with 2 or less in Yoochoose1/64 is 31%, indicating significantly lower connectivity compared with others. This may cause over-fittings, leading to degrading the performance when considering the global context (See Appendix E for detailed statistics). The second one is the relationship between contrastive learning and metrics. In terms of P@20, contrastive learning introduces to explore a wider range of target items. However, this could conflict with the main task which adopts cross-entropy for one target, therefore, degrade MRR@20.

Conclusion

We present SR-GCL, a contrastive learning framework for an SBR to better capture the item and session representations, and propose two global context enhanced augmentation methods. For future work, we plan to devise better augmentation techniques suitable for shorter session data. Beyond the current contrastive learning framework where positive and negative pairs should be obtained, we plan to develop a non-contrastive learning method for an SBR where negative pairs are not required.

Acknowledgments

We thank Prof. Kyunghyun Cho (New York University) for a fruitful discussion and long-term collaboration.

References

Abdollahpouri, Burke, and Mobasher (2017) Abdollahpouri, H.; Burke, R.; and Mobasher, B. 2017. Controlling popularity bias in learning-to-rank recommendation. In Proceedings of the eleventh ACM conference on recommender systems, 42–46.
Bayer et al. (2017) Bayer, I.; He, X.; Kanagal, B.; and Rendle, S. 2017. A generic coordinate descent framework for learning from implicit feedback. In Proceedings of the 26th International Conference on World Wide Web, 1341–1350.
Bowman et al. (2015) Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A. M.; Jozefowicz, R.; and Bengio, S. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349.
Chen et al. (2020a) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
Chen et al. (2020b) Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; and Hinton, G. 2020b. Big self-supervised models are strong semi-supervised learners. arXiv preprint arXiv:2006.10029.
Chen et al. (2020c) Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020c. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
Clauset, Shalizi, and Newman (2009) Clauset, A.; Shalizi, C. R.; and Newman, M. E. 2009. Power-law distributions in empirical data. SIAM review, 51(4): 661–703.
Covington, Adams, and Sargin (2016) Covington, P.; Adams, J.; and Sargin, E. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, 191–198.
Gupta et al. (2019) Gupta, P.; Garg, D.; Malhotra, P.; Vig, L.; and Shroff, G. M. 2019. NISER: Normalized Item and Session Representations with Graph Neural Networks. arXiv preprint arXiv:1909.04276.
He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9729–9738.
He and McAuley (2016) He, R.; and McAuley, J. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, 507–517.
Hidasi et al. (2015) Hidasi, B.; Karatzoglou, A.; Baltrunas, L.; and Tikk, D. 2015. Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.
Hjelm et al. (2018) Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670.
Jiao et al. (2019) Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; and Liu, Q. 2019. Tinybert: Distilling bert for natural language understanding. arXiv preprint arXiv:1909.10351.
Khosla et al. (2020) Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; and Krishnan, D. 2020. Supervised contrastive learning. arXiv preprint arXiv:2004.11362.
Li et al. (2015) Li, Y.; Tarlow, D.; Brockschmidt, M.; and Zemel, R. 2015. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.
Milojević (2010) Milojević, S. 2010. Power law distributions in information science: Making the case for logarithmic binning. Journal of the American Society for Information Science and Technology, 61(12): 2417–2425.
Oord, Li, and Vinyals (2018) Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Pan et al. (2020a) Pan, Z.; Cai, F.; Chen, W.; Chen, H.; and de Rijke, M. 2020a. Star graph neural networks for session-based recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 1195–1204.
Pan et al. (2020b) Pan, Z.; Cai, F.; Ling, Y.; and de Rijke, M. 2020b. Rethinking item importance in session-based recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 1837–1840.
Perez and Wang (2017) Perez, L.; and Wang, J. 2017. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621.
Qiu et al. (2019) Qiu, R.; Li, J.; Huang, Z.; and Yin, H. 2019. Rethinking the item order in session-based recommendation with graph neural networks. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 579–588.
Takahashi, Matsubara, and Uehara (2019) Takahashi, R.; Matsubara, T.; and Uehara, K. 2019. Data augmentation using random image cropping and patching for deep CNNs. IEEE Transactions on Circuits and Systems for Video Technology, 30(9): 2917–2931.
Tang and Wang (2018) Tang, J.; and Wang, K. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 565–573.
Wang et al. (2019a) Wang, M.; Ren, P.; Mei, L.; Chen, Z.; Ma, J.; and de Rijke, M. 2019a. A collaborative session-based recommendation approach with parallel memory modules. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 345–354.
Wang et al. (2019b) Wang, S.; Cao, L.; Wang, Y.; Sheng, Q. Z.; Orgun, M.; and Lian, D. 2019b. A survey on session-based recommender systems. arXiv preprint arXiv:1902.04864.
Wei and Zou (2019) Wei, J.; and Zou, K. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6382–6388.
Wu et al. (2021) Wu, J.; Wang, X.; Feng, F.; He, X.; Chen, L.; Lian, J.; and Xie, X. 2021. Self-supervised graph learning for recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 726–735.
Wu et al. (2019) Wu, S.; Tang, Y.; Zhu, Y.; Wang, L.; Xie, X.; and Tan, T. 2019. Session-based recommendation with graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 346–353.
Wu et al. (2018) Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742.
Xie et al. (2020) Xie, X.; Sun, F.; Liu, Z.; Wu, S.; Gao, J.; Ding, B.; and Cui, B. 2020. Contrastive Learning for Sequential Recommendation. arXiv preprint arXiv:2010.14395.
You et al. (2020) You, Y.; Chen, T.; Sui, Y.; Chen, T.; Wang, Z.; and Shen, Y. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33: 5812–5823.

Supplementary Materials

Organization

The appendix is organized as follows. In Sections A-C for supplementing Section ’Method’ of the main text, we describe the GNN-based encoder consisting of three procedures in detail, three conventional augmentations for session data, and Projection Head for contrastive learning. Then, for supplementing Section ’Experiments and Analysis’, we further report the three experimental results and analysis in Section D. Finally, in Section E, we present the data statistics in terms of distribution to complement our discussion of the main text.

Appendix A Graph Neural Network Based Encoder

There have been many models to learn a session representation for a recommendation (Hidasi et al. 2015; Wang et al. 2019a; Pan et al. 2020b; Wu et al. 2019; Gupta et al. 2019; Pan et al. 2020a). Among them, we leverage a GNN-based encoder which outperforms RNN-based models and shows decent improvements. In particular, we adopt Star Graph Neural Networks with Highway Networks (SGNN-HN) (Pan et al. 2020a) as a state-of-the-art model. We detail three processes, constructing a session graph, updating the graph, and obtaining the representation of the session as follows.

A.1 Constructing a Session Graph

We construct a local session graph $\mathcal{G}\!=\!\left(\mathcal{V},\mathcal{E}\right)$ from each session $s$ to pass the graph-based model. $\mathcal{V}\!=\!\left\{\left\{v_{1},v_{2},...,v_{\left|\mathcal{V}\right|-1}\right\},v_{s}\right\}$ includes all unique items in the session and a star node $v_{s}$ . The $i$ -th node embedding after passing the $l$ -th layer of GNN is represented by a hidden state $h_{i}^{l}\in\mathbb{R}^{d}$ . We initialize the hidden states as $h_{i}^{0}\!=\!v_{i}$ and initialization for the hidden state of a star node $v_{s}$ is $h_{s}^{0}$ , which is the average value of all $h_{i}^{0}$ as

h_{s}^{0}=\frac{1}{N}\sum_{i=1}^{N}h_{i}^{0}

(10)

where $N\!=\!\left|\mathcal{V}\right|-1$ is the number of all unique items. A star node is connected to all nodes with bidirectional edges. Information from non-adjacent nodes can be propagated by taking the star node as an intermediate node (Pan et al. 2020a). $\mathcal{E}$ is an edge set consisting of two types of edges. The first one is for connections of all nodes, which are converted into two normalized adjacency matrices (i.e., incoming matrix $A^{I}$ and outgoing matrix $A^{O}$ ) to pass the GNN. The second one is for connections of a star node.

A.2 Updating a Graph

First, nodes in a graph $\mathcal{G}$ are updated by propagating information of the neighbor nodes. $m_{i}^{l}$ denotes message (i.e., information from the neighbors) for a hidden state $h_{i}$ at the $l$ -th layer, which can be formulated as

	$\displaystyle m_{i}^{l}=\left[A_{i}^{I}(\left[h_{1}^{l-1},h_{2}^{l-1},...,h_{N}^{l-1}\right]^{\top}W^{I}+b^{I})\;;\quad\right.$		(11)
	$\displaystyle\left.A_{i}^{O}(\left[h_{1}^{l-1},h_{2}^{l-1},...,h_{N}^{l-1}\right]^{\top}W^{O}+b^{O})\right]$		(11)

where N is the number of nodes, and $\left[;\right]$ denotes a concatenation. $A_{i}^{I},A_{i}^{O}\in\mathbb{R}^{1\times N}$ are the corresponding incoming and outgoing weights for the node $h_{i}$ (i.e., the $i$ -th row of $A^{I}$ and $A^{O}$ ). $W^{I},W^{O}\in\mathbb{R}^{d\times d}$ and $b^{I},b^{O}\in\mathbb{R}^{d}$ are learnable parameters for the incoming and outgoing edges, respectively. So we feed the message $m_{i}^{l}$ and the previous state $h_{i}^{l-1}$ into the Gated Graph Neural Networks (GGNNs) (Li et al. 2015) for each node as

		$\displaystyle z_{i}^{l}=\sigma(W_{z}m_{i}^{l}+U_{z}h_{i}^{l-1}),$		(12)
		$\displaystyle r_{i}^{l}=\sigma(W_{r}m_{i}^{l}+U_{r}h_{i}^{l-1}),$
		$\displaystyle\tilde{h}_{i}^{l}=tanh(W_{h}m_{i}^{l}+U_{h}(r_{i}^{l}\odot h_{i}^{l-1})),$
		$\displaystyle\hat{h}_{i}^{l}=(1-z_{i}^{l})\odot h_{i}^{l-1}+z_{i}^{l}\odot\tilde{h}_{i}^{l}$

where $\sigma$ is a sigmoid and $\odot$ is an element-wise multiplication. $z_{i}^{l}$ and $r_{i}^{l}$ are an update gate and reset gate. $W_{z},W_{r},W_{h}\in\mathbb{R}^{d\times 2d}$ and $U_{z},U_{r},U_{h}\in\mathbb{R}^{d\times d}$ are trainable matrices. After the propagation of adjacent nodes, we also consider the overall information from the previous star node $h_{s}^{l-1}$ . For each node, we decide how much information from the star node should be propagated with a gate network and attention mechanism as

		$\displaystyle\alpha_{i}^{l}=\frac{(W_{q1}\hat{h}_{i}^{l})^{\top}W_{k1}h_{s}^{l-1}}{\sqrt{d}},$		(13)
		$\displaystyle h_{i}^{l}=(1-\alpha_{i}^{l})\hat{h}_{i}^{l}+\alpha_{i}^{l}h_{s}^{l-1}$		(13)

where $W_{q1},W_{k1}\in\mathbb{R}^{d\times d}$ are trainable matrices, and $\sqrt{d}$ is a scaling factor. After all hidden states are updated, the state of a star node is also updated. We apply an attention mechanism to compute different degrees of similarity for all node states $h^{l}=\left[h_{1}^{l},h_{2}^{l},...,h_{N}^{l}\right]$ by regarding the star node as a query. So the current state of the star node $h_{s}^{l}$ is computed as

		$\displaystyle\beta^{l}=softmax\left(\frac{(W_{k2}h^{l})^{\top}W_{q2}h_{s}^{l-1}}{\sqrt{d}}\right),$		(14)
		$\displaystyle\qquad\qquad\qquad h_{s}^{l}=\beta^{l}h^{l}$		(14)

where $W_{q2},W_{k2}\in\mathbb{R}^{d\times d}$ are learnable parameters, and $\beta^{l}\in\mathbb{R}^{N}$ are weights for all nodes. Multiple updating can be stacked to accurately represent the transition relationship between items. However, the more propagating information between nodes is repeated, the easier the model could lead to an over-fitting problem with the GNN (Qiu et al. 2019; Pan et al. 2020a). To address this problem, (Pan et al. 2020a) applied a highway gate for all nodes after the multiple updating. The highway gate computes the final hidden states $h^{f}$ , which is the weighted sum of $h^{0}$ and $h^{L}$ . $h^{0}$ and $h^{L}$ denote the hidden states before and after the $L$ -layer GNN. The highway gate can be formulated as

		$\displaystyle h^{f}=g\odot h^{0}+(1-g)\odot h^{L},$		(15)
		$\displaystyle\mbox{where }g=\sigma(W_{g}\left[h^{0};h^{L}\right])$		(15)

where $\sigma$ is a sigmoid function, $\left[;\right]$ denotes a concatenation, and $W_{g}\in\mathbb{R}^{d\times 2d}$ is a trainable matrix.

A.3 Obtaining a Session Representation

We obtain sequential item embeddings $u\in\mathbb{R}^{\left|s\right|\times d}$ from the corresponding nodes $h^{f}\in\mathbb{R}^{N\times d}$ which passed the GNN, where $\left|s\right|$ is the length of a session, and $N$ is the number of all unique items of a session. We add trainable position embeddings $p\in\mathbb{R}^{\left|s\right|\times d}$ to consider sequential information in an attention mechanism as

u^{p}=u+p=\left[u_{1}^{p},u_{2}^{p},...,u_{\left|s\right|}^{p}\right]

(16)

To represent the global preference of a session, a soft attention mechanism is applied with the current interest and a star node. This combines the items according to their degrees of preference as

		$\displaystyle\gamma_{i}=W_{0}^{\top}\sigma(W_{1}u_{i}^{p}+W_{2}h_{s}^{L}+W_{3}u_{\left\|s\right\|}^{p}+b_{0}),$		(17)
		$\displaystyle\qquad\qquad\qquad\tilde{r}=\sum_{i=1}^{\left\|s\right\|}\gamma_{i}u_{i}^{p}$		(17)

where $W_{0}\in\mathbb{R}^{d}$ and $W_{1},W_{2},W_{3}\in\mathbb{R}^{d\times d}$ are learnable parameters, $b_{0}\in\mathbb{R}^{d}$ is a bias, and $\sigma$ is a sigmoid function. That is, the global preference of the session $\tilde{r}$ is determined in consideration of the star node after the $L$ -th layer $h_{s}^{L}$ and the last item $u_{\left|s\right|}^{p}$ . We finally obtain the representation of the session $s$ to consider not only the global preference but also the current preference as

r=\left[\tilde{r};u_{\left|s\right|}^{p}\right]

(18)

where $\left[;\right]$ denotes a concatenation. $r\in\mathbb{R}^{2d}$ is used for a recommendation task and contrastive learning.

Appendix B Conventional Augmentations for a Session

For comparison, we also present three conventional methods including Crop, Mask, and Reorder (Xie et al. 2020). Compared with our proposed methods, they could have little diversity of variation. In addition, the rate of information loss is high when the methods are applied.

B.1 Item Crop

This is a technique commonly used in CV, which is to obtain a subset by randomly cropping a part of an image (Takahashi, Matsubara, and Uehara 2019). Applying this method to sequence data is equivalent to obtaining a continuous sub-sequence of a sequence. This augmentation method $a_{cr}$ for a sequence $s$ is formulated as

s^{a_{cr}}=a_{cr}(s)=\left[x_{i},x_{i+1},...,x_{i+N_{cr}-1}\right]

(19)

where $i$ means a starting index randomly selected, and $N_{cr}=\lfloor\gamma_{cr}\times\left|s\right|\rfloor$ is the length of the sub-sequence. It is effective to obtain a local view of the historical session.

B.2 Item Mask

This technique applies zero-masking to some parts of a sample. It has been widely adopted to avoid over-fittings in other fields, which is called word-dropout in NLP (Bowman et al. 2015) and node-dropout in GNN (Wu et al. 2021; You et al. 2020). To apply this technique for a session $s$ , we randomly make a set $\mathcal{I}_{ma}=\left\{i_{1},i_{2},...,i_{N_{ma}}\right\}$ indicating masked indices, where $N_{ma}=\lceil\gamma_{ma}\times\left|s\right|\rceil$ is the number of masked items. The items pointed by $\mathcal{I}_{ma}$ are replaced with a special item [mask]. This method $a_{ma}$ can be formulated as

	$\displaystyle s^{a_{ma}}=a_{ma}(s)=\left[\tilde{x}_{1},\tilde{x}_{2},...,\tilde{x}_{\left\|s\right\|}\right],$		(20)
	$\displaystyle\mbox{where }\tilde{x}_{j}=\begin{cases}x_{j}\qquad\qquad\mbox{if }j\notin\mathcal{I}_{ma}\\ \mbox{[mask]}\qquad\;\mbox{if }j\in\mathcal{I}_{ma}\end{cases}$		(20)

As items within a session implicitly represent the intention of a user, even if $s^{a_{ma}}$ has only several elements of the session, it could still reserve the main intention.

B.3 Item Reorder

This aims to shuffle a part of a given sequence. For sessions in the real-world, the order of users’ interactions is not strictly enforced, but flexible due to various unobservable external factors (Tang and Wang 2018; Covington, Adams, and Sargin 2016). To apply this method to a session $s$ , we randomly shuffle a continuous sub-sequence $\left[x_{i},x_{i+1},...,x_{i+N_{re}-1}\right]$ , where $i$ is a starting index randomly selected, and $N_{re}=\lceil\gamma_{re}\times\left|s\right|\rceil$ is the length of the sub-sequence. This augmentation method $a_{re}$ is formulated as

s^{a_{re}}=a_{re}(s)=\left[x_{1},...,\tilde{x}_{i},...,\tilde{x}_{i+N_{re}-1},...,x_{\left|s\right|}\right]

(21)

where $\left[\tilde{x}_{i},\tilde{x}_{i+1},...,\tilde{x}_{i+N_{re}-1}\right]$ is a shuffled sub-sequence. With this method, we encourage our model to rely less on the order of a session. This enhances the model to be more robust when it encounters unexpected sessions.

Appendix C Projection Head

Projection Head aims to distinguish contrastive learning from the main task (i.e., next item prediction) by mapping representations to the latent space where contrastive loss is applied. Without this layer, contrastive learning could damage useful information for the main task because it is trained to be invariant to data transformation (Chen et al. 2020a). Therefore, we use another representation vector $z\in\mathbb{R}^{d}$ in the projected space through this module to separate from the main objective.

Appendix D Supplementary Experiments

D.1 Impact of Individual Augmentation

Figure 5 shows the effect of individual augmentation operators by varying the ratio $\gamma$ from 0.1 to 0.9, where $\gamma$ means the degree of transformation of each session. If this value is large, it is difficult to have the same underlying semantics as the original session. Otherwise, if it is too small, the contrastive effect between differently augmented views of the same sample would be reduced. For the experiment, we use two augmented sessions generated by using the same method with a fixed ratio. Note that even if the same augmentation method is used, two generated sessions may be different due to the randomness of applied positions.

Taken overall, we can see that Change and Injection augmentations have better performance than other methods except for the case of Figure 5 (b). Also, our proposed methods are relatively less influenced by $\gamma$ . The reason is that even if the degrees of deformation are severe, they do not generate very different views from the original sample due to the consideration of the global context. In contrast, Crop and Mask are sensitive to $\gamma$ . They also show the opposite tendency as $\gamma$ increases because they play a similar role in hiding information of a session. Meanwhile, Reorder shows a significant difference between P@20 and MRR@20. It maintains relatively low performance on MRR@20. The reason is that Reorder makes the current interest difficult to be known. To be specific, our model considers the last clicked-item as a proxy of current interest, but this method could shuffle the order of the sub-session including the last item, which dilutes the meaning. Of course, other augmentations can also stochastically affect this, but Reorder can perturb these sequential dependencies more. In summary, a high ratio of reordering could cause a mismatch between the global preference and the current interest.

D.2 Combination of Augmentations

Figure 6 shows the performance according to the combinations of two augmentation methods. We use two augmented sessions generated by two fixed methods. The results show that exactly one combination cannot be the best for session data. However, our proposed methods (i.e., Change and Injection) can provide more options for session augmentations, which show the improved performance than the combinations of conventional methods (i.e., Crop, Mask, and Reorder) except for the case of Figure 6 (b). It is notable that the combination of weak augmented methods (i.e., Change and Injection) and a strong method (i.e., Reorder) has better performance as mentioned in the advanced experiment.

D.3 Impact of the Number of Selected Methods

Our framework allows setting the number of selected methods flexibly due to our multi-positive loss described in the main text. This means that the number of negative samples of contrastive loss can be adjusted independently without changing the batch size. In this experiment, we tested how different the number of randomly selected methods impact the performance.

Figure 7 shows the performance varying the number of selected methods $M$ . When the batch size is $N$ , the number of augmented representations used for contrastive learning per batch is $N\times M$ . CMR means that we use an augmentation set including Crop, Mask, and Reorder. CI stands for Change and Injection. The cases except for Figure 7 (b) show two results. First, the performance using the augmentation sets including CI is generally better than only CMR. Second, the number of selected methods does not need to be fixed to 2. Even though there is no optimal number of augmentations to ensure the best performance for all conditions, it can be optimized according to a specific purpose. The results could confirm that our contrastive learning framework is effective when the selected number is two or more, which means that there must be other views of the same session to take advantage of contrastive learning.

	Yoochoose1/64	Yoochoose1/4	Diginetica
# of clicks	557,248	8,326,407	982,961
# of train sessions	369,859	5,917,745	719,470
# of test sessions	55,898	55,898	60,858
# of items	17,745	30,470	43,097
Avg. of session length	6.16	5.71	5.13

Table 2: Statistics of datasets.

Appendix E Inherent Characteristics of Datasets

In this section, we investigate the characteristics of the data itself from two aspects in order to analyze the tendency of the performance results. One is the distribution of session lengths, and the other is the distribution of the number of neighbors for each node.

Figure 8 (a)-(c) show the distributions of session lengths on three datasets. Compared with the distribution on Diginetica, we can see that the ratio of short sessions in Yoochoose for both 1/64 and 1/4 is high. Specifically, the ratio of a session length of 3 or less in Yoochoose1/64 is 63% while that in Diginetica is 47%. By such data characteristics, when the session is short, it is difficult to apply augmentations. This is because no matter how much deformation there is, there is little room to be applied. This means that relatively insignificant contrastive effects could be obtained from the augmentations on Yoochoose1/64.

Meanwhile, the three datasets show different aspects regarding the number of neighbors on each item as shown in Figure 8 (d)-(f). Yoochoose1/64 has significantly lower connectivity compared with others, meaning many isolated nodes. So it is easy for those many isolated nodes to find neighbors, which could be more vulnerable to over-fittings because the global context to be considered is limited. On the other hand, the neighbors of an item on Yoochoose1/4 and Diginetica are significantly distributed with higher values than those on Yoochoose1/64. This diversity of neighbors could help a model to be robust if we utilize the global context.