This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

Modeling Sequences as Distributions with Uncertainty for Sequential Recommendation

Ziwei Fan, Zhiwei Liu Department of Computer Science, University of Illinois at ChicagoUSA zliu213,[email protected] Lei Zheng Pinterest Inc.USA [email protected]  and  Shen Wang, Philip S. Yu Department of Computer Science, University of Illinois at ChicagoUSA swang224,[email protected]
(2018)
Abstract.

The sequential patterns within the user interactions are pivotal for representing the user’s preference and capturing latent relationships among items. The recent advancements of sequence modeling by Transformers advocate the community to devise more effective encoders for the sequential recommendation. Most existing sequential methods assume users are deterministic. However, item-item transitions might fluctuate significantly in several item aspects and exhibit randomness of user interests. This stochastic characteristics brings up a solid demand to include uncertainties in representing sequences and items. Additionally, modeling sequences and items with uncertainties expands users’ and items’ interaction spaces, thus further alleviating cold-start problems.

In this work, we propose a Distribution-based Transformer for Sequentail Recommendation (DT4SR), which injects uncertainties into sequential modeling. We use Elliptical Gaussian distributions to describe items and sequences with uncertainty. We describe the uncertainty in items and sequences as Elliptical Gaussian distribution. And we adopt Wasserstein distance to measure the similarity between distributions. We devise two novel Transformers for modeling mean and covariance, which guarantees the positive-definite property of distributions. The proposed method significantly outperforms the state-of-the-art methods. The experiments on three benchmark datasets also demonstrate its effectiveness in alleviating cold-start issues. The code is available in https://github.com/DyGRec/DT4SR.

Sequential Recommendation, Self-Attention, Uncertainty, Data Sparsity
copyright: acmcopyrightjournalyear: 2018doi: 10.1145/1122445.1122456conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NYbooktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NYprice: 15.00isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Refer to caption
Figure 1. An example user with uncertain interests. The random item transition connects two items with distinct genres and release years, while a correlated item transition connects items with same genres and/or release years.

Recommendation systems achieve great successes in providing personalized services in various domains, including fashion (Kang et al., 2017), music (Van Den Oord et al., 2013), book (Givon and Lavrenko, 2009) and grocery (Faggioli et al., 2020; Liu et al., 2020a, b) recommendation. Among recommendation methods, Sequential Recommendation (SR) methods (Tang and Wang, 2018; Chen et al., 2018; Kang and McAuley, 2018; Sun et al., 2019; Zheng et al., 2019a) show promising improvements in predicting users’ interests. SR methods format each user’s historically interacted items as a sequence and dynamically model the user’s preference to predict the next item in sequences.

The core idea of SR is modeling item-item transition relationships within the sequence. The recent advancements of Transformer (Vaswani et al., 2017) on sequence encoding inspire the recommendation community to adopt it for user sequences modeling (Kang and McAuley, 2018; Sun et al., 2019). SASRec (Kang and McAuley, 2018) is a pioneering work proposing to use Transformer for recommendation , which applies self-attention to measure the correlation among item transitions. Several following works  (Sun et al., 2019; Li et al., 2020a; Ma et al., 2019) enhance the SASRec model with complex components to improve the recommendation performance, which demonstrates the effectiveness of adopting Transformer as a backbone encoder for sequence modeling.

However, the stochastic characteristics of item transitions in sequences spoil the ability of existing models to model sequential correlations, which is demonstrated in Figure 1. In this Figure, we present the movie-watching records of a user and the corresponding genre and release year transitions between items. We can observe that item-item transition relationships fluctuate randomly from both genre and release year perspectives. Most item-item transitions are hard to explain as both items in a transition pair have distinct genres and years, e.g., the transition from “Scent of a Woman” to “The Boss Baby”. This example indicates the importance of modeling each sequence with uncertainty. For each user, we should consider her interactions as a set of stochastic events controlled by distributions of dynamic user interests. Modeling sequence as distributions can not only incorporate such transition uncertainty but also benefit the exploration of users’ interests. The reason is that a distribution covers more data space than an embedding, which thus expands the interaction space of a user. As such, the user cold-start issue, i.e., users with few interactions (Liu et al., 2021), is alleviated.

Previous works (Oh et al., 2018; Zheng et al., 2019b) have demonstrated the superiority of representing a stochastic object as a distribution rather than an embedding. Nevertheless, representing items and their associated transitions as distributions is challenging. Firstly, it is non-trivial to dynamically learn mean and covariance to infer the distribution. Moreover, we need to guarantee the positive definite property of covariance during inference. Furthermore, it is still unclear how to measure the distance between the sequence distributions.

To this end, we propose a distribution-based Transformer (DT4SR) to model the dynamics of evolving distribution representations while still maintaining necessary properties of distribution. To be specific, instead of a fixed vector, we use Elliptical Gaussian distributions (Vilnis and McCallum, 2014; Bojchevski and Günnemann, 2017; Zheng et al., 2019b) to represent items with covariance measuring the uncertainty. We develop mean and covariance Transformers to learn the dynamics of mean and covariance in the user sequence. Instead of using dot-product to measure the affinity between the inferred next-item and item candidates, we propose to use Wasserstein distance under a metric learning framework (Hsieh et al., 2017; Li et al., 2020b; Tay et al., 2018) to generate a top-N recommendation list of items with the smallest distribution distances. The followings are our contributions:

  • To the best of our knowledge, this is the first work proposing to model items and sequences as Elliptical Gaussian Distributions in the sequential recommendation. We demonstrate that distribution representations with uncertainty well characterize fluctuating and evolving interests of users and further alleviate cold-start problem.

  • We develop two novel Transformers for modeling mean and covariance embeddings adaptive to the distribution-based sequential recommendation.

  • The DT4SR achieves significant improvements over state-of-the-art recommendation methods in both overall performance and cold start setting. The experimental results verify the effectiveness of distribution-based representations in sequential modeling.

2. Problem Definition

A recommender system collects feedback between a set of users 𝒰\mathcal{U} and items 𝒱\mathcal{V}, e.g., clicks. Sequential recommendation chronologically models a user uu’s interaction sequence, which is denoted as 𝒮u=[v1u,v2u,,v|𝒮u|u]\mathcal{S}^{u}={[v^{u}_{1},v^{u}_{2},\dots,v^{u}_{|\mathcal{S}^{u}|}]}. The goal of SR is to characterize dynamics in sequences and then predict the next-item. We formulate the objective as follows:

(1) p(v|𝒮u|+1(u)=v|𝒮u),p\left(v_{|\mathcal{S}^{u}|+1}^{(u)}=v\left|\mathcal{S}^{u}\right.\right),

which measures the probability of an item vv being the next item, given user uu’s sequence 𝒮u\mathcal{S}^{u}.

3. Proposed Model

In this section, we present the framework of DT4SR, as shown in Figure 2. The overall framework consists of several components, mean and covariance embeddings, distribution-based self-attention, distribution-based Feed-Forward Network (FFN), and a specific covariance output layer. The mean and covariance embeddings together describe distributions of items. The distribution-based self-attention captures dynamical correlations in mean and covariance aspects. The distribution-based FFN introduces non-linearity specifically for distribution representation. To guarantee the positive definite property of covariance, we propose a covariance output layer. This distribution-based dynamic modeling distinguishes our proposed method from existing sequential methods.

Refer to caption
Figure 2. : Illustration of the proposed DT4SR. Mean and Covariance Transformers take the sequences of mean and covariance embeddings as inputs and output next-step item’s distribution. For example, at step 11, 𝐄i1μ\mathbf{E}^{\mu}_{i_{1}} and 𝐄i1Σ\mathbf{E}^{\Sigma}_{i_{1}} together infer 𝐄i2^μ\mathbf{E}^{\mu}_{\hat{i_{2}}} and 𝐄i2^Σ\mathbf{E}^{\Sigma}_{\hat{i_{2}}}.

3.1. Embedding Layers

We represent each item with an elliptical Gaussian distribution governed by a mean vector and one diagonal covariance matrix. To be specific, each item has two embedding representations, which are for mean and covariance. We denote the mean embedding table as 𝐄μ|𝒱|×d\mathbf{E}^{\mu}\in\mathbb{R}^{|\mathcal{V}|\times d} and covariance embedding table as 𝐄Σ|𝒱|×d\mathbf{E}^{\Sigma}\in\mathbb{R}^{|\mathcal{V}|\times d}, where 𝒱\mathcal{V} denotes the item set and dd is the embedding dimensionality. We also define separate learnable positional embeddings for both Transformers, 𝐏μn×d\mathbf{P}^{\mu}\in\mathbb{R}^{n\times d} for the mean Transformer and 𝐏Σn×d\mathbf{P}^{\Sigma}\in\mathbb{R}^{n\times d} for the covariance Transformer respectively. The nn denotes the maximum length of the sequence. The elements pip_{i} signifies the positional information at position ii in the sequence. As the sequence lengths of users vary a lot, we keep the most recent nn interactions to format the sequence if a user has more than nn interactions. Otherwise, we keep all interactions and apply zero paddings until the sequence has a length of nn. With all embeddings, given a sequence 𝒮=[v1,v2,,vn]\mathcal{S}={[v_{1},v_{2},\dots,v_{n}]}, we have its mean and covariance sequence representations as:

(2) 𝐄𝒮μ\displaystyle\mathbf{E}^{\mu}_{\mathcal{S}} =[𝐞1μ+𝐩1μ,𝐞2μ+𝐩2μ,,𝐞nμ+𝐩nμ],\displaystyle=[\mathbf{e}^{\mu}_{1}+\mathbf{p}^{\mu}_{1},\mathbf{e}^{\mu}_{2}+\mathbf{p}^{\mu}_{2},\dots,\mathbf{e}^{\mu}_{n}+\mathbf{p}^{\mu}_{n}],
𝐄𝒮Σ\displaystyle\mathbf{E}^{\Sigma}_{\mathcal{S}} =[𝐞1Σ+𝐩1Σ,𝐞2Σ+𝐩2Σ,,𝐞nΣ+𝐩nΣ].\displaystyle=[\mathbf{e}^{\Sigma}_{1}+\mathbf{p}^{\Sigma}_{1},\mathbf{e}^{\Sigma}_{2}+\mathbf{p}^{\Sigma}_{2},\dots,\mathbf{e}^{\Sigma}_{n}+\mathbf{p}^{\Sigma}_{n}].

3.2. Mean and Covariance Self-Attentions

We introduce a novel distribution-based self-attention mechanism considering numerical stability and its applicability to distribution embeddings. A typical self-attention block introduces query 𝐐\mathbf{Q}, key 𝐊\mathbf{K}, and value 𝐕\mathbf{V}. In sequential recommendation, the query 𝐐\mathbf{Q}, key 𝐊\mathbf{K}, and value 𝐕\mathbf{V} are linear transformations of the same sequence embedding 𝐄𝒮𝐖Q\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{Q}, 𝐄𝒮𝐖K\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{K}, and 𝐄𝒮𝐖V\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{V}, where 𝐄𝒮\mathbf{E}_{\mathcal{S}}^{*} is any sequence embedding defined in Eq. (2). The self-attention adopts the scaled dot-product attention (Vaswani et al., 2017) on the query and key to measure weights from previous steps on the current step. To maintain numerical stability of query, key, and value, we apply “exponential linear unit” (elu) activation (Clevert et al., 2015) after the linear transformations, which presents the distribution-based self-attention as:

(3) DSA=σ(elu(𝐄𝒮𝐖Q)Telu(𝐄𝒮𝐖K)d)elu(𝐄𝒮𝐖V),\textbf{DSA}=\sigma\left(\frac{\text{elu}(\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{Q})^{T}\text{elu}(\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{K})}{\sqrt{d}}\right)\text{elu}(\mathbf{E}_{\mathcal{S}}^{*}\mathbf{W}^{V}),

where σ()\sigma(\cdot) denotes the softmax function and ‘DSA’ is short for Distribution Self-Attention. We apply the DSA module in the mean Transformer as DSAμ\textbf{DSA}^{\mu} and the covariance Transformer as DSAΣ\textbf{DSA}^{\Sigma}, either taking 𝐄𝒮=𝐄𝒮μ\mathbf{E}_{\mathcal{S}}^{*}=\mathbf{E}_{\mathcal{S}}^{\mu} or 𝐄𝒮=𝐄𝒮Σ\mathbf{E}_{\mathcal{S}}^{*}=\mathbf{E}_{\mathcal{S}}^{\Sigma} in Eq. (2) as inputs, respectively.

3.3. Mean and Covariance Feed-Forward Layers

In additional to uncovering sequential patterns, we introduce two novel and adaptive versions of feed-forward layers to endow extra non-linearity. We adapt the FFN layer to distribution representations by replacing the ReLU activation as ELU. The FFN with respect to both DSAiμ\textbf{DSA}^{\mu}_{i} and DSAiΣ\textbf{DSA}^{\Sigma}_{i} at position ii are defined as:

(4) FFNμ(DSAiμ)\displaystyle\textbf{FFN}^{\mu}(\textbf{DSA}^{\mu}_{i}) =elu(elu(DSAiμW1μ+b1μ)W2μ+b2μ),\displaystyle=\text{elu}\left(\text{elu}\left(\textbf{DSA}^{\mu}_{i}W^{\mu}_{1}+b^{\mu}_{1}\right)W^{\mu}_{2}+b^{\mu}_{2}\right),
FFNΣ(DSAiΣ)\displaystyle\textbf{FFN}^{\Sigma}(\textbf{DSA}^{\Sigma}_{i}) =elu(elu(DSAiΣW1Σ+b1Σ)W2Σ+b2Σ),\displaystyle=\text{elu}\left(\text{elu}\left(\textbf{DSA}^{\Sigma}_{i}W^{\Sigma}_{1}+b^{\Sigma}_{1}\right)W^{\Sigma}_{2}+b^{\Sigma}_{2}\right),

respectively, where all WW are in d×d\mathbb{R}^{d\times d} and all bb are in d\mathbb{R}^{d}.

We retain those techniques used in existing Transformer structures, , including residual connection, dropout operation and layer normalization. Note that we can stack multiple DSA and FFN layers to learn more complex item relationships. In the meantime, the multi-heads mechanism is also applicable in DT4SR framework.

3.4. Layer Outputs

However, a valid distribution requires covariance matrix to be positive definite while the outputs from FFN layers do not guarantee this property. Thus, we add an all ones vector to the output covariance embedding vector after the ELU activation, which is defined as follows:

(5) Σi=diag(elu(𝐎i(L))+𝟏),\Sigma_{i}=\mathrm{diag}\left(\text{elu}\left(\mathbf{O}^{(L)}_{i}\right)+\mathbf{1}\right),

where 𝐎i(L)\mathbf{O}^{(L)}_{i} denotes the output covariance embedding of item ii after LL layers of DSAΣ\textbf{DSA}^{\Sigma} and 𝐅𝐅𝐍Σ\mathbf{FFN}^{\Sigma}, and 𝟏d\mathbf{1}\in\mathbb{R}^{d} is an all ones vector.

3.5. Loss and Optimization

Recall that the mean and covariance Transformers’ outputs infer the mean and covariance embeddings of the next-step item at each position in the sequence, respectively. Following the same setting in (Kang and McAuley, 2018), we use the next-step item as ground truth prediction for each position in the sequence. For example, given a sequence 𝒮u=[v1u,v2u,v3u,,vnu]\mathcal{S}^{u}=[v^{u}_{1},v^{u}_{2},v^{u}_{3},\dots,v^{u}_{n}], the ground truth item for v1uv^{u}_{1} is v2uv^{u}_{2}. Note that if viuv^{u}_{i} is a padding item, its ground truth item is also a padding item.

3.5.1. Wasserstein Distance and Loss

To measure how accurately the model predicts the next item, we need to identify the distance between the distribution of the ground truth item and the inferred distribution. The most popular approaches to calculate the distance between two distribution are Kullback–Leibler (KL) divergence (Kullback and Leibler, 1951) and pthp_{th} Wasserstein distance (Arjovsky et al., 2017). Wasserstein distance can measure distances when two distributions have no overlap while KL divergence cannot. Therefore, we propose to use pthp_{th} Wasserstein distance to measure the distance of Gaussian distributions. To be specific, we use 2-Wasserstein distance. Given two items i1i_{1} and i2i_{2}, as well as their distrbutions 𝒩(μi1,Σi1)\mathcal{N}(\mu_{i_{1}},\Sigma_{i_{1}}) and 𝒩(μi2,Σi2)\mathcal{N}(\mu_{i_{2}},\Sigma_{i_{2}}), the Wasserstein distance is:

(6) dW2(i1,i2)=μi1μi222+trace(Σi1+Σi22(Σi21/2Σi1Σi21/2)1/2).d_{W_{2}}(i_{1},i_{2})=||\mu_{i_{1}}-\mu_{i_{2}}||^{2}_{2}+\text{trace}\left(\Sigma_{i_{1}}+\Sigma_{i_{2}}-2(\Sigma_{i_{2}}^{1/2}\Sigma_{i_{1}}\Sigma_{i_{2}}^{1/2})^{1/2}\right).

3.5.2. Loss

Based on 2-Wasserstein distance W2W_{2}, we propose to use a Wasserstein distance-based BPR loss (Rendle et al., 2012) to measure the correctness of next-item prediction:

(7) 𝒮u𝒮t[1,2,,|𝒮u|]log(σ(dW2(it,t^)dW2(it,t^)))+λΘ22,-\sum_{\mathcal{S}^{u}\in\mathcal{S}}\sum_{t\in[1,2,\dots,|\mathcal{S}^{u}|]}\log(\sigma(d_{W_{2}}(i^{\prime}_{t},\hat{t})-d_{W_{2}}(i_{t},\hat{t})))+\lambda||\Theta||_{2}^{2},

where t^\hat{t} denotes the inferred distribution at position tt, iti_{t} is the ground truth item and iti^{\prime}_{t} denotes the negative sampled item from items that user uu never interacts with. Θ\Theta is the set of all learnable parameters in the both mean and covariance Transformers.

For evaluation, for user uu, we calculate distance scores based on the 2-Wasserstein distance W2(𝒮nu^,i)W_{2}(\hat{\mathcal{S}^{u}_{n}},i) on all candidate items i𝒱{i\in\mathcal{V}}, where 𝒮nu^\hat{\mathcal{S}^{u}_{n}} denotes the inferred next-item distribution at last position nn. Then we rank them in ascending order to generate the recommendation list.

4. Experiments

In this section, we validate the effectiveness of the proposed DT4SR by presenting experimental settings and results. The designed experiments will answer the following research questions:

  • RQ1: Does DT4SR outperform the state-of-the-art recommendation methods?

  • RQ2: Does distribution representation provide better recommendations than single item embedding?

  • RQ3: Are distribution representations in sequential modeling effective for alleviating cold-start user/item problems?

Table 1. Datasets Statistics
Dataset #users #items #actions density
Amazon Toys 57,617 69,147 410,920 0.010%
Amazon Beauty 52,204 57,289 394,908 0.013%
Amazon Games 31,013 23,715 287,107 0.039%

4.1. Datasets

We evaluate the proposed DT4SR on three public benchmark datasets from Amazon review datasets (McAuley et al., 2015). Amazon datasets are known for high sparsity and having several categories of rating reviews. Details of datasets statistics are presented in Table 1. Following (Kang and McAuley, 2018; Sun et al., 2019), we treat the presence of ratings as positive implicit feedbacks. We use the timestamps of each rating to sort the interactions of each user to generate the sequence. The most recent interaction is used for test and the last second one is used for validation.

4.2. Experimental Settings

Evaluation Protocol. We evaluate all models with three standard metrics, Recall@N, NDCG@N, and Mean Reciprocal Rank (MRR). Recall@N measures the accuracy of retrieving relevant items in the top-N recommendation. NDCG@N also considers the ranked positions of retrieved relevant items in the top-N list. MRR measures the ranking performance of the entire recommendation list. We set N to be 1 and 5 for evaluation. For each user, we randomly sample 1,000 items without interaction with the user as negative items considering ranking efficiency.
Baselines. We compare DT4SR with several state-of-the-art recommendation methods in three relevant categories: static methods with point embedding vectors, static metric learning methods, and sequential methods. We use LightGCN (He et al., 2020) as the strong baseline with point embedding vectors. We also compare with BPRMF (Rendle et al., 2012), but we omit it in the performance table because of its low values. For metric learning methods, we compare the most recent work DDN (Zheng et al., 2019b), and SML (Li et al., 2020b). For sequential recommendation methods, we adopt the metric learning-based TransRec (He et al., 2017) and Transformer-based SASRec (Kang and McAuley, 2018) as baselines. Note that TransRec belongs to both the metric learning framework and sequential recommendation.
Hyper-parameter Settings. For all baselines, we search the embedding dimension in {32,64,128}\{32,64,128\}. As the proposed model has both mean and covariance embeddings, we only search for {16,32,64}\{16,32,64\} for DT4SR for a fair comparison. We search the LL-2 regularization weight from {0.0,0.001,0.005,0.01,0.05,0.1}\{0.0,0.001,0.005,0.01,0.05,0.1\}. The number of layers LL is selected from {1,2}\{1,2\}. For all baselines specific hyper-parameters, we do not list all of them because of the space limitations. We grid search all possible combinations for all models and report the test set performance based on the best validation MRR result.

Table 2. Performance Comparison in Recall@1, Recall@5, NDCG@5,and MRR. The best and second-best results are boldfaced and underlined, respectively.
Dataset Metric LGCN TRec DDN SML SASRec DT4SR Imp.
Toys Recall@1 0.0603 0.0622 0.0657 0.0666 \ul0.0669 0.0886 +32.4%
Recall@5 0.1494 0.1566 0.1423 0.1435 \ul0.1661 0.1762 +6.1%
NDCG@5 0.1060 0.1085 0.0951 0.0933 \ul0.1179 0.1341 +14.5%
MRR 0.1084 0.1096 0.0954 0.0938 \ul0.1187 0.1348 +13.5%
Beauty Recall@1 0.0592 0.0404 0.0464 0.0412 \ul0.0667 0.0774 +16.0%
Recall@5 0.1510 0.1144 0.1232 0.1258 \ul0.1541 0.1652 +7.2%
NDCG@5 0.1060 0.0781 0.0840 0.0788 \ul0.1121 0.1230 +9.7%
MRR 0.1081 0.0825 0.0880 0.0797 \ul0.1124 0.1229 +9.3%
Games Recall@1 0.0825 0.0809 0.0722 0.0776 \ul0.1111 0.1238 +11.4%
Recall@5 0.2313 0.2106 0.2218 0.2061 \ul0.2870 0.3064 +6.7%
NDCG@5 0.1584 0.1474 0.1422 0.1437 \ul0.2014 0.2183 +8.3%
MRR 0.1610 0.1513 0.1423 0.1439 \ul0.1984 0.2152 +8.4%

4.3. Overall Comparison (RQ1 and RQ2)

We report the overall performance comparison between the proposed DT4SR and baselines in Table 2. The followings are our observations:

  • RQ1: DT4SR significantly outperforms all baselines in all three datasets across all evaluation metrics. The average relative improvements over the second-best baseline is 19.9% on Recall@1, 6.7% in Recall@5, 10.8% in NDCG@5, and 10.4% in MRR. These improvements demonstrate the effectiveness of distribution-based representations with uncertainty in sequential recommendation.

  • RQ2: Compared with the point embedding representation sequential model SASRec, the proposed DT4SR still achieves a great improvement margin. The reasons for this improvement are twofold. First, the distribution-based representations provide uncertainty information when we model users with various interests. Moreover, the sequential modeling for mean and covariance can dynamically capture the evolving patterns of uncertainty within the sequence.

  • Among all baselines, we can observe that sequential methods achieve outstanding performance, demonstrating the necessity of utilizing sequential information. Moreover, the state-of-the-art CF method LightGCN outperforms other static baselines, owing to its capability of using graph information.

4.4. Performance w.r.t Sequence length (RQ3)

Refer to caption
Refer to caption
Figure 3. The Recall@5 performance on different sequence lengths over Amazon Toys and Amazon Games datasets.

We present the performances of LightGCN, SASRec, and proposed DT4SR with respect to different sequences lengths (i.e., number of interactions of users) in Figure 3. We can observe that DT4SR performs the best on all sequence length intervals of users. Note that for users with only one interaction, the relative performance gain of DT4SR against SASRec is 9% on Recall@5. It demonstrates the effectiveness of DT4SR in cold-start users. We can also observe that the long sequences (i.e., users with more than 20 interactions) achieve considerable improvements, especially on the Toys dataset. Long sequences typically cover diverse interests, and the observation proves the necessity of uncertainty in representing sequences.

4.5. Performance on Cold Start Items  (RQ3)

Refer to caption
Refer to caption
Figure 4. The Recall@5 performance on cold start items over Amazon Toys and Amazon Games datasets.

We plot the performances of LightGCN, SASRec, and proposed DT4SR on cold start items, which only have limited interactions in training data. For Toys dataset, DT4SR achieves 100% relative improvements on extremely cold start items (i.e., items with only one interaction). For the Games dataset, the performance gain is also more than 50% for extremely cold start items. This experiment shows that distribution representations for items and sequences are more expressive and generalized for cold start items. Note that the DT4SR still achieves comparative performance in frequent items.

5. Conclusion

We propose DT4SR under the framework of Transformer with distribution representations in the sequential recommendation. Different from existing sequential methods assuming users are deterministic, we propose introducing uncertainty into representations of sequences and items. We use Elliptical Gaussian Distributions to represent items with uncertainty. Building upon distribution representations, we propose two novel Transformers for learning mean and covariance with the guarantee of positive definite property of covariance. We conducted several experiments to demonstrate that the proposed DT4SR significantly outperforms existing methods in overall performance and cold start user/item recommendation.

References

  • (1)
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Leon Bottou. 2017. Wasserstein generative adversarial networks. In International conference on machine learning. PMLR, 214–223.
  • Bojchevski and Günnemann (2017) Aleksandar Bojchevski and Stephan Günnemann. 2017. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815 (2017).
  • Chen et al. (2018) Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2018. Sequential recommendation with user memory networks. In Proceedings of the eleventh ACM international conference on web search and data mining. 108–116.
  • Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).
  • Faggioli et al. (2020) Guglielmo Faggioli, Mirko Polato, and Fabio Aiolli. 2020. Recency aware collaborative filtering for next basket recommendation. In Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization. 80–87.
  • Givon and Lavrenko (2009) Sharon Givon and Victor Lavrenko. 2009. Predicting social-tags for cold start book recommendations. In Proceedings of the third ACM conference on Recommender systems. 333–336.
  • He et al. (2017) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems. 161–169.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 639–648.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge Belongie, and Deborah Estrin. 2017. Collaborative metric learning. In Proceedings of the 26th international conference on world wide web. 193–201.
  • Kang et al. (2017) Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. 2017. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 207–216.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE, 197–206.
  • Kullback and Leibler (1951) S Kullback and RA Leibler. 1951. On information and sufficiency Annals of Mathematical Statistics, 22 (1): 79–86.
  • Li et al. (2020a) Jiacheng Li, Yujie Wang, and Julian McAuley. 2020a. Time interval aware self-attention for sequential recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining. 322–330.
  • Li et al. (2020b) Mingming Li, Shuai Zhang, Fuqing Zhu, Wanhui Qian, Liangjun Zang, Jizhong Han, and Songlin Hu. 2020b. Symmetric metric learning with adaptive margin for recommendation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4634–4641.
  • Liu et al. (2021) Zhiwei* Liu, Ziwei* Fan, Yu Wang, and Philip S. Yu. 2021. Augmenting Sequential Recommendation with Pseudo-Prior Items via Reversely Pre-training Transformer. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval.
  • Liu et al. (2020a) Zhiwei Liu, Xiaohan Li, Ziwei Fan, Stephen Guo, Kannan Achan, and Philip S Yu. 2020a. Basket Recommendation with Multi-Intent Translation Graph Neural Network. arXiv preprint arXiv:2010.11419 (2020).
  • Liu et al. (2020b) Zhiwei Liu, Mengting Wan, Stephen Guo, Kannan Achan, and Philip S Yu. 2020b. BasConv: Aggregating Heterogeneous Interactions for Basket Recommendation with Graph Convolutional Neural Network. In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 64–72.
  • Ma et al. (2019) Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical gating networks for sequential recommendation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 825–833.
  • McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 43–52.
  • Oh et al. (2018) Seong Joon Oh, Kevin Murphy, Jiyan Pan, Joseph Roth, Florian Schroff, and Andrew Gallagher. 2018. Modeling uncertainty with hedged instance embedding. arXiv preprint arXiv:1810.00319 (2018).
  • Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
  • Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. 565–573.
  • Tay et al. (2018) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent relational metric learning via memory-based attention for collaborative ranking. In Proceedings of the 2018 World Wide Web Conference. 729–739.
  • Van Den Oord et al. (2013) Aäron Van Den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep content-based music recommendation. In Neural Information Processing Systems Conference (NIPS 2013), Vol. 26. Neural Information Processing Systems Foundation (NIPS).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
  • Vilnis and McCallum (2014) Luke Vilnis and Andrew McCallum. 2014. Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623 (2014).
  • Zheng et al. (2019a) Lei Zheng, Ziwei Fan, Chun-Ta Lu, Jiawei Zhang, and Philip S Yu. 2019a. Gated Spectral Units: Modeling Co-evolving Patterns for Sequential Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1077–1080.
  • Zheng et al. (2019b) Lei Zheng, Chaozhuo Li, Chun-Ta Lu, Jiawei Zhang, and Philip S Yu. 2019b. Deep Distribution Network: Addressing the Data Sparsity Issue for Top-N Recommendation. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1081–1084.