This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Attention on Global-Local Representation Spaces in Recommender Systems

Munlika Rattaphun, Wen-Chieh Fang, and Chih-Yi Chiu M. Rattaphun, W. C. Fang, and C. Y. Chiu are with the Department of Computer Science and Information Engineering, National Chiayi University, Taiwan, R.O.C.
Corresponding author: Wen-Chieh Fang.
E-mail: [email protected]; [email protected]; [email protected]
Abstract

In this study, we present a novel clustering-based collaborative filtering (CF) method for recommender systems. Clustering-based CF methods can effectively deal with data sparsity and scalability problems. However, most of them are applied to a single representation space, which might not characterize complex user-item interactions well. We argue that the user-item interactions should be observed from multiple views and characterized in an adaptive way. To address this issue, we leveraged the global and local properties to construct multiple representation spaces by learning various training datasets and loss functions. An attention network was built to generate a blended representation according to the relative importance of the representation spaces for each user-item pair, providing a flexible way to characterize diverse user-item interactions. Substantial experiments were evaluated on four popular benchmark datasets. The results show that the proposed method is superior to several CF methods where only one representation space is considered.

Index Terms:
attention, clustering, collaborative filtering, deep learning, multi-views.

1 Introduction

Recommender systems are getting popular in many web applications. Because customers have various opinions and preferences, satisfying their requirements becomes an important issue for content and product providers. There is a great need for personalized recommender systems that suggest appropriate content and products to a user in a wide variety of online services, such as entertainment [1], travel [2], and social network [3].

The general approaches used in recommender systems include collaborative filtering (CF) and content-based filtering. The CF approach builds a model from a user’s past behavior (items that are purchased, viewed, selected, or rated), as well as similar behaviors by other users. The model is then used to recommend unseen items that will likely interest the user. The content-based filtering approach uses a series of discrete, previously-tagged characteristics of an item to recommend other items with similar properties. A modern recommender system often combines these two approaches into a hybrid system. In this study, we focus on the CF approach.

Matrix factorization is one of the most popular CF techniques [4] [5][6]. The relation between users and items are represented as an interaction matrix, and the matrix is then factorized into a user latent matrix and an item latent matrix. The similarity between the user and item is usually measured by the inner product or Euclidean distance between their latent vectors. Missing values in the interaction matrix can thus be inferred by the predicted user-item similarity and be used to make recommendations to a user for non-interacted items. Another widely used CF technique is the neighborhood model [7] [8]. It selects the nearest neighbors for each user/item based on their similarities. The missing values of the user/item are predicted by combining its neighbors’ weighted similarities.

As the system scales with rapid growth in the numbers of users and items, the interaction proportion among the whole set of users and items becomes very small. Then these CF techniques must deal with the data sparsity and scalability problems. One remedy is clustering-based recommendation [9][10][11][12][13]. This approach groups similar users or items into subsets by means of clustering. From the view of clusters, interactions between the user and item subsets are no more sparse. In addition, the clustering technique can be integrated with the index structure to overcome the scalability issue. In other words, instead of exhaustive search, cluster pruning can be used to keep a few candidates for evaluation with respect to a given query. The complexity is reduced to the sub-linear time of the number of items, making recommendation more efficient. Therefore, the clustering-based approach has been extensively developed recently to address sparsity and scalability problems.

Another key point is that most existing methods are applied to a single representation space. However, user-item interaction behaviors are often diverse and complex; it is beneficial to observe them from multiple views and to characterize them in an adaptive way. For example, in the clustering-based approach, we can extract not only local features from clusters but also global features from the whole dataset. These features may reflect significant properties of various user-item interactions. Moreover, they can be combined to derive a more discriminating representation.

Following the above discussion, we propose a novel method for clustering-based CF recommender systems that involves leveraging global and local views. Multiple representation spaces are learned from the training data of global and local user-item interactions with different loss functions. To fuse the information from different representation spaces, an attention network is adopted to formulate an attentive representation for each user-item pair by blending the relative weights among the representation spaces. Compared with existing attention-based methods that use an attention model at the item level (considering which item is more important), the proposed attention model is used at the representation level (considering which representation is more important).

We highlight the contributions of the proposed method as follows:

  • Multiple representation spaces are constructed based on different global/local properties and loss functions. Rather than using only one representation as in most existing methods, multiple representations can provide an abundance of features to describe the relation between users and items.

  • An attention network is presented to capture the relative importance of multiple representation spaces. It can dynamically fuse multiple representations for each user-item pair, forming a flexible measure to characterize diverse user-item interactions.

  • Substantial experiments compared with various configurations and state-of-the-arts are evaluated under four benchmark datasets. The results can support the effectiveness and superiority of the proposed method.

The remainder of this paper is organized as follows. Section 2 presents related work on recommender systems. Section 3 elaborates the proposed method, including the global/local representation networks and attention network. Section 4 demonstrates and discusses experimental results. Finally, conclusions are summarized in Section 5.

2 Related Work

Among the mass of literature studies in recommender systems, we focus on the up-to-date CF-based methods with brief introduction. We also review the clustering-based approach and attention mechanism highly related to our work.

2.1 CF-based recommender systems

The core idea of the CF approach is to model the preference of users toward items according to their historical interactions. Several techniques are proposed to address this issue. In addition to the above-mentioned matrix factorization and neighborhood model, here we introduce other popular techniques, namely, ranking formulation and deep learning.

Rendle et al. [14] presented a generic optimization criterion by Bayesian personalized ranking, called BPR-Opt, for personalized ranking in the item recommendation task. BPR-Opt is the maximum posterior estimator derived from a Bayesian analysis of the problem. The main idea is to rank observed items higher than unobserved items. They also provided a generic learning algorithm for optimizing models with respect to BPR-Opt. Following the similar ranking formulation, Hsien et al. [6] proposed collaborative metric learning (CML), which encoded user preferences and user-user/item-item similarity in a joint metric space. CML achieves significant speedup for top-N recommendation tasks using off-the-shelf approximate nearest-neighbor search with negligible accuracy reduction.

More recent approaches tend to use the deep learning technique. He et al. [5] presented a general framework called neural network-based collaborative filtering (NeuMF) for recommender systems. They used nonlinear neural networks as the user–item interaction function. NeuMF expresses and generalizes matrix factorization and leverages a multi-layer perceptron to learn the user–item interaction function. The authors transformed the identity of a user/item to a binarized sparse vector with one-hot encoding, which only explored the user/item feature in a limited manner. Due to the complexity of the scoring function, the NeuMF retrieval process is generally hard to accelerate. Rendle et al. [15] revisited the NeuMF experiments, and concluded that a dot product might be a better default choice for combining embeddings than learned similarities using MLP or NeuMF. Xue et al. [16] proposed a neural network framework to model higher-order item relations for item-based collaborative filtering by using multiple nonlinear layers above pairwise interaction modeling. Under this framework, the authors also used an attention model to differentiate the importance of pairwise interactions. Deng et al. [17] proposed a general framework called deep collaborative filtering (DeepCF) to combine the representation learning and matching function learning. They also proposed a model called a collaborative filtering network (CFNet) based on a plain-vanilla MLP model under the DeepCF framework. CFNet has great flexibility in learning complex matching functions, with good efficiency in learning low-rank relations between users and items.

To handle the high-dimensional and sparse matrix efficiently, Luo et al. presented a series studies for latent factor-based CF methods [18] [19] [20] [21]. For example, they investigated the stochastic gradient descent algorithm and develop several extensions to improve the accuracy [18]. In addition, some non-negative factorization algorithms are proposed to overcome the slow convergence problem under non-negativity constraints [19][20]. Wu et al. designed a deep structure to construct the hierarchical latent factor models that can achieve high computation and storage efficiency [21].

2.2 Clustering-based recommender systems

The clustering-based approach is well known for its scalability to large and sparse systems. The basic steps include calculating the similarity based on rating data, and then applying a clustering algorithm to group similar users and items into clusters. Some advanced methods are introduced in the following.

Ji et al. [22] proposed a reconstructive method for improving matrix approximation by introducing clustering and transfer learning techniques. The proposed method compresses low-rank approximation into a cluster-level rating-pattern referred to as a codebook and then constructs an improved approximation by expanding the codebook. Wu et al. [23] proposed a scalable co-clustering method called CCCF. The idea is users may have different preferences over different subsets of items, where these subsets of items may overlap. CCCF can explore subgroups of users who share interests over similar subsets of items through user-item co-clustering. Chen et al. [12] proposed a recommender system called N2VSCDNNR based on node2vec [24] and clustering technology. N2VSCDNNR recommends item clusters with high weights to the target user cluster. The weights are obtained from the interaction frequency between items and users. Then the items in the high-weight clusters are recommended to the target user. Kang and McAuley [13] proposed a clustering-based framework called CIGAR, which learns a preference-preserving binary embedding model and a candidate item re-ranking model. Although using binary codes can significantly reduce the inference cost, due to their constrained capability, the performance is limited compared to models that use real-valued representations.

2.3 Recommender systems with attention mechanisms

The attention mechanisms are components of prediction systems that enable a system to focus sequentially on different subsets of the input [25]. They have been widely adopted in recommender systems recently.

Chen et al. [26] presented an attentive collaborative filtering model to address implicit feedback in multimedia recommendation. They introduced item-level and component-level attention models to assign attentive weights for inferring the underlying user preferences encoded in implicit user feedback. Zhu et al. [27] presented a tree-based deep recommendation model that predicts user preferences using a max-heap-like tree probability formulation with an attention-based deep network. The underlying deep model and tree structure are learned in turn. Tay et al. [28] proposed a metric learning approach called latent relational metric learning for recommendations. They adopt the idea of translating users to items by translation vectors while relying on a memory attention network to learn the relation vectors. He et al. [29] proposed a user-generated list recommendation model that leverages the hierarchical structure of items, lists, and users to capture the containment relationship between lists and items. They applied a self-attention network to refine the item and list representations by considering the consistency of neighboring items and lists. Hu et al. [30] introduced a model named BCFNet, which is an extended version of DeepCF [17]. The BCFNet consists of three sub-models, including representation learning, matching function learning, and balance module. In particular, an attention mechanism is integrated to improve the ability of the representation learning and matching function learning.

For group recommendation, the attention mechanisms can be employed to aggregate the properties of a set of user/item group. Cao et al. [31] proposed using NeuMF with attention models to characterize user and user group interest at the same time. They further integrated the modeling of user-item interactions into their method, making it possible to reinforce the two tasks of recommending items for users and user groups. Chen et al. [1] considered the task of recommending a set of items (bundle) to a user. They designed a factorized attention network to fuse the set of item representations into a bundle representation.

The attention mechanisms are also applied in sequential recommendation. For example, Zheng et al. [2] proposed a memory-augmented hierarchical attention network (MAHAN) for next point-of-interest (POI) recommendation. MAHAN incorporates two networks to tackle short-term check-in sequences and long-term memories for preference modeling. A co-attention network is used to characterize the interaction between these long-term and short-term preferences. Wu et al. [32] presented a similar idea, where the long-term preference is formulated by attention weight evaluation. Hao et al. [33] explored user’s short-term preference based on the local and global features of the annular-graph of the user behavior sequence by the self-attention mechanism. Fan et al. [34] introduced the low-rank decomposed self-attention to generate content-aware representations, where nn items are aggregated into kk latent interests (n>k)(n>k) to perform a lightweight self-attention computation.

2.4 Summary

To sum up, the CF techniques that employ neural networks to model user-item interactions become a popular choice. However, these techniques mainly focus on a single representation space. We argue that it is insufficient to model the complicated user-item interactions. The clustering-based methods are proposed in several studies to address the sparsity and scalability problems. Few of them take advantage of global and local representations from the cluster structure. We point out that multiple representations are beneficial to provide multi-views for observing user-item interactions. The attention mechanisms can provide an effective way to fuse these representations into a blended one. We develop the representation-level attention, which is different from the item-level attention used by existing methods.

3 Proposed Method

In the proposed method, we build multiple representation networks and an attention network for the recommendation purpose. Fig. 1 illustrates an overview of the training process for these networks. Given a set of user-item interactions, we prepare two categories of training data, namely, global and local, according to whether clustering is involved or not. Then the representation networks are trained by using different training data categories and loss functions. Next, multiple global and local representations, which are generated from the representation networks, are employed to train the attention network. The attention network combines these representations to obtain the attentive representations of the users and items, which are used to measure the compatibility between each user-item pair. Details are elaborated in the following subsections.

Refer to caption
Figure 1: Overview of the training process of the proposed method.

3.1 Global representation

The goal of the CF approach with implicit feedback is to explore the unobserved relations between user uu and item ii and to answer the question: will the user interact with the item in the future? A user-item interaction consists of, for example, the user assigning a score to an item or checking information about an item. The interaction history is usually described by an interaction matrix, denoted as A=[aui]A=\left[a_{ui}\right], which contains the user ID uu, the item ID ii, and the interaction auia_{ui} that represents the preference of user uu for item ii. This study focuses on evidence that the user interacted with an item, regardless of the extent of the interaction. Therefore, the matrix element auia_{ui} is defined by:

aui={1,ifuinteracted withi,0,otherwise.a_{ui}=\begin{cases}1,&\text{if}\;u\;\text{interacted with}\;i,\\ 0,&\text{otherwise.}\end{cases} (1)

A triplet training set is sampled from AA and expressed as:

T={(u,i+,i)|aui+=1,aui=0},T=\left\{\left(u,i^{+},i^{-}\right)\big{|}a_{ui^{+}}=1,a_{ui^{-}}=0\right\}, (2)

where i+i^{+} and ii^{-} are the positive and negative items, respectively. In practice, it is impossible to enumerate all triplet combinations and many of them are trivial for training. We adopt a simple strategy in this study [6][35]. Given the row vector of user uu in interaction matrix AA, for each positive item i+i^{+} (where aui+=1a_{ui^{+}}=1), we randomly choose an negative item ii^{-} (where aui=0a_{ui^{-}}=0) to produce a triplet (u,i+,i)(u,i^{+},i^{-}). The iteration process will repeat until the number of triplets is sufficient.

To derive the global representation, local information such as cluster assignment of the training data is ignored during the training phase. Let GdG\in\mathbb{R}^{d} be a global representation space of dd-dimensions. To project user uu and item ii into GG, the representation network is designed based on the Siamese network [36], as shown in Fig. 2. The embedding layer transforms a user or item ID into a dense vector of a fixed size. The embedding vector is then forwarded to hidden layers for nonlinear transformation. Note that we modify the typical Siamese network so that the variant can accept a triplet (u,i+,i)T\left(u,i^{+},i^{-}\right)\in T as the input in the training process. In addition, we separate the user from the item without sharing the hidden layers. In other words, the network parameters can be very different between the user part and the item part. This is because users and items are different in nature, they should be processed through different paths in order to be projected to the common representation space. Through the respective embedding and hidden layers, inputs uu, i+i^{+}, and ii^{-} are transformed with L2 normalization into GG as representation outputs u^\hat{u}, i^+\hat{i}^{+}, and i^\hat{i}^{-}, respectively.

Refer to caption
Figure 2: The representation network.

The training of the representation network is guided by a triplet loss function. Using different loss functions to generate variant representations can provide multiple views to characterize each user and item. We introduce two loss functions for this purpose, the dot product loss and the Euclidean distance loss; more loss functions can be conducted if necessary. The dot product loss originates from BPR-Opt [14]. Its variant adopts the conventional matrix factorization, referred to as BPR-MF, which is generally used as an underlying preference estimator. BPR-MF can be optimized by a contrastive pairwise ranking objective function of a triplet, referred to as product loss:

lprod(u^,i^+,i^)=lnσ(u^,i^+u^,i^),l_{prod}(\hat{u},\hat{i}^{+},\hat{i}^{-})=-\ln\sigma\left(\langle\hat{u},\hat{i}^{+}\rangle-\langle\hat{u},\hat{i}^{-}\rangle\right), (3)

where σ\sigma is the sigmoid function and \left\langle\cdot\right\rangle denotes the dot product computation. Because enumerating all triplets in TT is typically intractable, BPR-MF uses stochastic gradient descent (SGD) to optimize the model. Specifically, in each step of SGD, a batch of triplets is dynamically sampled from TT. In addition, L2 regularization was adopted on the user and item representations, which is crucial to alleviate overfitting.

The second global representation was generated based on the Euclidean distance metric, where the optimization function is distance loss expressed by [37]:

ldist(u^,i^+,i^)=max(0,u^i^+2u^i^2+α),l_{dist}(\hat{u},\hat{i}^{+},\hat{i}^{-})=max\left(0,\|\hat{u}-\hat{i}^{+}\|^{2}-\|\hat{u}-\hat{i}^{-}\|^{2}+\alpha\right), (4)

where α\alpha is the margin.

To make the optimization effective, hard examples of triplets that result in large loss can be selected frequently during training. The network parameters are updated through SGD to minimize the loss function iteratively. In the following, the global representations of user uu and item ii are denoted as u^\hat{u} and i^\hat{i}, respectively.

3.2 Local representation

Roughly speaking, the global representations provide a macro view of users and items. However, they are insufficient to model complex user-item interactions. The local representations, on the other hand, are derived by learning local data, which are extracted from the clusters that are close to the user. They can model the user-item interactions from a micro view.

For simplicity, we assume the global representation space GG is a Euclidean space. K-means clustering is used to construct a cluster set of size MM. Given a query (user) uu and its global representation u^\hat{u}, JJ clusters are chosen from the MM clusters that are the closest to u^\hat{u} and are denoted as candidate clusters {Cj}j=1J\{C_{j}\}_{j=1}^{J}. The corresponding codebook of centroids is {μj|μjG}j=1J\{\mu_{j}|\mu_{j}\in G\}_{j=1}^{J}. To construct the local representation network, we employ the same network architecture defined in Fig. 2 but train with the local properties of the candidate clusters.

Suppose that a positive item i^+\hat{i}^{+} and a negative item i^\hat{i}^{-} belong to two candidate clusters, i.e., i^+Cj\hat{i}^{+}\in C_{j} and i^Ck\hat{i}^{-}\in C_{k}, where j,k{1,,J}j,k\in\{1,\cdots,J\}. There are two considerations for generating the triplet training set. The first case is j=kj=k, i.e., i^+\hat{i}^{+} and i^\hat{i}^{-} are in the same cluster. They can be included in the local intra-triplet set TTT^{\dagger}\in T when satisfying the following criterion:

T={(u,i+,i)|u^i^+>u^i^}.T^{\dagger}=\left\{(u,i^{+},i^{-})\big{|}\|\hat{u}-\hat{i}^{+}\|>\|\hat{u}-\hat{i}^{-}\|\right\}. (5)

The intuition is that, if the distance between uu and i+i^{+} is larger than that between uu and ii^{-}, it will incur positive triplet loss. We can use this hard example to tune the representation network so that the user is close to its positive item and far away from its negative item. The other case is jkj\neq k, i.e., i^+\hat{i}^{+} and i^\hat{i}^{-} are in different clusters. A similar definition for the local inter-triplet set TTT^{\ddagger}\in T is given:

T={(u,i+,i)|u^μ^j>u^μ^k}.T^{\ddagger}=\left\{(u,i^{+},i^{-})\big{|}\|\hat{u}-\hat{\mu}_{j}\|>\|\hat{u}-\hat{\mu}_{k}\|\right\}. (6)

Unlike Eq. (5) that considers the relation between the user and items, Eq. (6) addresses the relation between the user and clusters. That is, if the distance between uu and CjC_{j} (where i+i^{+} belongs to) is larger than that between uu and CkC_{k} (where ii^{-} belongs to), we take this hard example to tune the user representation network so that the user is close to its positive cluster CjC_{j} and far away from its negative cluster CkC_{k}. Consequently, both TT^{\dagger} and TT^{\ddagger} serve as the local training dataset for learning local representations.

The outputs of the local representation network are expressed as the user u~\tilde{u}, the positive item i~+\tilde{i}^{+}, and the negative item i~\tilde{i}^{-}. In a similar way to Eq. 3 and Eq. 4, the loss functions for training local representations are expressed as lprod(u~,i~+,i~)l_{prod}(\tilde{u},\tilde{i}^{+},\tilde{i}^{-}) for product loss and ldist(u~,i~+,i~)l_{dist}(\tilde{u},\tilde{i}^{+},\tilde{i}^{-}) for distance loss.

3.3 Attentive representation

By leveraging the global and local properties with different loss functions, multiple representation networks can be trained to generate the respective representations for each user/item. Moreover, an attention network is proposed to find the best combination based on these representations. That is, the proposed attention network can map a set of user-item representations to weighted sum representations of the user and item. The weight is assigned to each representation space by computing a compatibility function of the user and item. Details are described in the following.

Suppose that we construct RR representation networks to derive corresponding representation spaces; each user/item has RR representations. Fig. 3 gives an overview of the proposed network consisting of RR representation networks followed by an attention network, where four representation spaces (R=4)(R=4) are taken as an example. In the training process, the input layer receives a triplet (u,i+,i)(u,i^{+},i^{-}). Let g(r)g^{(r)} be the rrth representation network, r1,2,,Rr\in{1,2,...,R}, which transforms the user and items to their rrth representations. We express the rrth representations of the triplet as (g(r)(u),g(r)(i+),g(r)(i))\big{(}g^{(r)}(u),g^{(r)}(i^{+}),g^{(r)}(i^{-})\big{)}, and forward them to the rrth attention channel.

Refer to caption
Figure 3: The proposed model consisting of RR representation networks followed by an attention network, where four representation spaces (R=4)(R=4) are taken as an example.

Each attention channel can be divided into three parts. The first part is a transformation function, denoted as f(r)f^{(r)}, which maps the user and positive item to the respective vectors u(r)u^{(r)} and i(r)i^{(r)}:

u(r)=f(r)(g(r)(u)),i(r)=f(r)(g(r)(i+)),\begin{split}&u^{(r)}=f^{(r)}\big{(}g^{(r)}(u)\big{)},\\ &i^{(r)}=f^{(r)}\big{(}g^{(r)}(i^{+})\big{)},\end{split} (7)

Note that the negative item is excluded in the attention network. Consequently, the second part is to calculate the compatibility between the user and positive item. The idea is paying more attention to the representation space where the user and the positive item are more compatible, as measured by the dot product. The compatibility is then convert to a normalized weight, denoted as α(r)\alpha^{(r)}, by applying the softmax function over all representation spaces.

α(r)=exp(u(r)i(r))rexp(u(r)i(r)).\alpha^{(r)}=\frac{\exp(u^{(r)}\cdot i^{(r)})}{\sum_{r}\exp(u^{(r)}\cdot i^{(r)})}. (8)

In the last part, a weighted sum is calculated as the linear combinations of the RR representation spaces for each user/item:

u¯=r=1Rα(r)g(r)(u),i¯+=r=1Rα(r)g(r)(i+),i¯=r=1Rα(r)g(r)(i).\begin{split}&\bar{u}=\sum_{r=1}^{R}\alpha^{(r)}\cdot g^{(r)}(u),\\ &\bar{i}^{+}=\sum_{r=1}^{R}\alpha^{(r)}\cdot g^{(r)}(i^{+}),\\ &\bar{i}^{-}=\sum_{r=1}^{R}\alpha^{(r)}\cdot g^{(r)}(i^{-}).\end{split} (9)

The outputs u¯\bar{u}, i¯+\bar{i}^{+}, and i¯\bar{i}^{-} are denoted as attentive representations. The proposed model is trained by the triplet loss function based on either product loss lprod(u¯,i¯+,i¯)l_{prod}(\bar{u},\bar{i}^{+},\bar{i}^{-}) or distance loss ldist(u¯,i¯+,i¯)l_{dist}(\bar{u},\bar{i}^{+},\bar{i}^{-}), and the training data are randomly sampled from the global and local triplet sets TT, TT^{\dagger}, and TT^{\ddagger}.

3.4 Recommendation as retrieval

We treat the recommendation task as a retrieval process that searches for similar items from multi-view representation spaces for a given user. In clustering-based recommendation, a coarse-to-fine search strategy can be applied without scanning the dataset exhaustively, making recommendation more efficient and scalable. For each item ii in the dataset, RR representations are extracted by the set of representation networks {g(r)(i)|r=1,2,,R}\{{g}^{(r)}(i)|r=1,2,...,R\}. We choose one of the global representations and perform k-means clustering to create MM clusters and to build the index structure with MM clusters, each of which has attached an inverted list of associated items.

When a user query uu is submitted, the user representations {g(r)(u)}\{{g}^{(r)}(u)\} are extracted and used to search for the KK nearest item clusters. We denote the items stored in the nearest clusters as the candidate items. For each pair of user uu and candidate item ii^{\prime}, the attention network is used to obtain the attentive representations u¯\bar{u} and i¯\bar{i}^{\prime} defined in Eq. 9 and to compute their Euclidean distance or dot product to estimate their similarity. The candidates are sorted to output the top-NN items as the recommendation result for user uu. The retrieval process is summarized in Algorithm 1.

By taking advantage of the coarse-to-fine search strategy, only some dataset items are selected as candidates for the attention and re-ranking computation. Although applying the attention network for each user-item pair requires additional computations, these can be accelerated by GPUs, and thus the runtime is usually minor. Therefore, the time complexity of the retrieval process can be considered sublinear with respect to the size of the item dataset.

Input: a set of MM item clusters, Φ\Phi; the set of the corresponding MM inverted lists of item clusters, Θ\Theta ; a user uu;
Output: NN recommended items {i1,i2,,iN}\left\{i_{1},i_{2},...,i_{N}\right\};
1 Extract the user embeddings {g(r)(u)|r=1,2,,R}\{{g}^{(r)}(u)|r=1,2,...,R\};
2 Determine the top-KK item clusters {Ck|CkΦ}k=1K,K<M\{C_{k}|C_{k}\in\Phi\}_{k=1}^{K},K<M for uu by calculating the maximum dot products or minimal Euclidean distances between uu and the item cluster centroid;
3 Let I={i|ik{1,2,,K}Lk,LkΘI=\{i|i\in\cup_{k\in\{1,2,\ldots,K\}}L_{k},L_{k}\in\Theta } be the set of candidate items from the top-KK item clusters;
4 Return the top-N items {ij|ijI}j=1N\{i_{j}|i_{j}\in I\}_{j=1}^{N} with the maximum dot products or minimum Euclidean distances between u¯\bar{u} and ij¯\bar{i_{j}} from Eq. 9;
Algorithm 1 Recommendation

4 Experimental Results

This section describes the evaluation of the proposed method and a comparison with several CF-based recommender systems. The experiments were carried out under Windows 10 with an Intel Core i7-6700 CPU, an NVIDIA GeForce GTX 1080Ti GPU, and 64GB RAM. The programming environment was the Keras package of Tensorflow under Python 3.7. The source code is available at https://github.com/MunlikaRattaphun/Attention-on-Global-Local-Embedding-Spaces-in-Recommender-Systems.

4.1 Benchmark datasets

Four popular benchmark datasets, including MovieLens, Yelp, and Amazon, were used in the experimental evaluation. We briefly introduce these datasets, as well as the corresponding training/test data partitions:

  • MovieLens-100k. This dataset consists of 100,000 records of user-movie ratings from 943 users and 1,682 movies. Each user has rated at least 20 movies. We randomly sampled half of the user-movie views as test data, and used the other half for training.

  • Yelp Madison. The Yelp dataset is a set of user reviews of businesses in local cities. Here, we selected the city of Madison. Users who had more than five business interactions were kept; the others were removed from the dataset. For each of the remaining users, we randomly sampled three user-business interactions as the test data, and used the rest for training.

  • Yelp Pittsburgh. Yelp Pittsburgh is another dataset containing user reviews of businesses in Pittsburgh. The user filtering and training/test data generation were the same as for Yelp Madison.

  • Amazon. This dataset contains product reviews from customers of Amazon.com. For this study, the category Movie and TV was chosen. Users with fewer than six reviews were removed from the dataset. The training/test data were generated in the same way as for Yelp Madison and Yelp Pittsburgh.

Table I summarizes some statistic properties of these datasets. Their scales and densities are very different from each other, providing a diverse way for observation.

TABLE I: Dataset statistics (after preprocessing)
Datasets #Users #Items #Interactions Density
MovieLens-100k 943 1,682 100,000 0.0630
Yelp Madison 2,773 3,186 50,961 0.0058
Yelp Pittsburgh 5,116 6,443 111,246 0.0034
Amazon 32,910 45,585 1,089,539 0.0007

4.2 Evaluation metrics

Several widely used accuracy metrics in top-N recommendation, including precision, recall, hit rate (HR), average-reciprocal hit rank (ARHR), and normalized discounted cumulative gain (NDCG), were used for experiment evaluation. For a user query uu, the top-N recommended items are denoted as Qu={i1,i2,,iN}Q_{u}=\{i_{1},i_{2},...,i_{N}\} and the ground truth (positive) items as Gu={i1,i2,,i|Gu|}G_{u}=\{i^{\prime}_{1},i^{\prime}_{2},...,i^{\prime}_{|G_{u}|}\}, where |||\cdot| is the set cardinality. These metrics are defined as follows:

recall(u)\displaystyle\text{recall}\big{(}u\big{)} =|QuGu||Gu|\displaystyle=\frac{\big{|}Q_{u}\cap G_{u}\big{|}}{\big{|}G_{u}\big{|}} (10)
precision(u)\displaystyle\text{precision}\big{(}u\big{)} =|QuGu||Qu|\displaystyle=\frac{\big{|}Q_{u}\cap G_{u}\big{|}}{\big{|}Q_{u}\big{|}} (11)
HR(u)\displaystyle\text{HR}\big{(}u\big{)} ={1,if|QuGu|10,if|QuGu|=0\displaystyle=\begin{cases}1,&\text{if}\;\big{|}Q_{u}\cap G_{u}\big{|}\geqslant 1\\ 0,&\text{if}\;\big{|}Q_{u}\cap G_{u}\big{|}=0\end{cases} (12)
ARHR(u)\displaystyle\text{ARHR}\big{(}u\big{)} =iQuGu1ranki\displaystyle=\sum_{i\in Q_{u}\cap G_{u}}\frac{1}{rank_{i}} (13)
NDCG(u)\displaystyle\text{NDCG}\big{(}u\big{)} =iQu1log2(ranki+1)\displaystyle=\sum_{i\in Q_{u}}\frac{1}{log_{2}(rank_{i}+1)} (14)

where rankirank_{i} is the ranking position of item ii. Although a higher value reflects a better accuracy in general, it is recommended to assess the model performance from multiple metrics rather than any single metric.

4.3 Hyperparameter and ablation study

To study the impact of model parameters, we take the following four global and local representations, as well as two attentive representations, for consideration:

  • Global representation / distance loss (GD) was generated based on the global training set (Eq. (2)) and the distance loss function (Eq. (4)).

  • Global representation / product loss (GP) was generated based on the global training set and product loss function (Eq. (3)).

  • Local representation / distance loss (LD) was generated based on the local training set (Eqs. (5) and (6)) and the distance loss function.

  • Local representation / product loss (LP) was generated based on the local training set and the product loss function.

  • Attentive representation / distance loss (AD) was blended from the above four representations by the proposed attention network and the distance loss function.

  • Attentive representation / product loss (AP) was blended by the proposed attention network and the product loss function.

Except for using different training datasets and loss functions, these representations adopted the same representation network shown in Fig. 2, which contains one embedding layer and two hidden layers, each of which has 64 neurons. The dimensionality of the output layer was also set to 64. Each model was trained for a maximum of 200 epochs with an early stopping strategy and optimized with the Adam optimizer, where the batch size was fixed at 512 and the learning rate was 0.00017. The margin of the loss function in Eq. (4) was set α=0.5\alpha=0.5.

We first evaluate the impact of the number of clusters MM. We evaluate three different MM for each dataset: M{10,15,20}M\in\{10,15,20\} for MovieLens, M{10,20,30}M\in\{10,20,30\} for Yelp Madison, M{10,20,30}M\in\{10,20,30\} for Pittsburgh, and M{30,100,256}M\in\{30,100,256\} for Amazon. The range of MM has been carefully chosen: too large or small MM will incur a bad clustering result. The performance is plotted in Fig. 4, where each column shows a particular dataset with different MM. Each plot shows six recall curves of the proposed representations under the top-KK candidate clusters, K{1,2,3,4,5}K\in\{1,2,3,4,5\}. We ranked the items in the candidate clusters to return the top-20 items for recommendation.

Refer to caption
Figure 4: Top-NN candidate clusters (X-axis) vs. recall (Y-axis).

As the number of candidate clusters KK is increased, more positive items are found from the candidate clusters, and therefore the recall curves is generally improved. In addition, most of them show a sharp rise at K=2K=2. However, some recall curves have a degraded performance as KK increases. This means that the corresponding model does not generate good representation spaces, so that the relevance between the user and positive item is less than that between the user and negative item. The proposed attentive representations (AD and AP) are relatively stable under these configurations, showing that they became more robust by adaptively fusing multiple representations.

In general, local representations (LD and LP) are better than global representations (GD and GP). However, no single representation always performs better than others because of the variety of user-item interaction behaviors. Again, the proposed attention network combining different views of representation spaces yield the best recall on all the benchmark datasets. Based on the above result, we choose the following MM in subsequent experiments: M=20M=20 for MovieLens, M=20M=20 for Yelp Madison, M=10M=10 for Yelp Pittsburgh, and M=100M=100 for Amazon.

Next, we observe the recommendation result from more evaluation metrics, as shown in Figs. 5, 6, 7 and 8 for MovieLens, Yelp Madison, Yelp Pittsburgh, and Amazon, respectively. Each figure contains four subplots, including recall, precision, HR, and ARHR (from left to right). The X-axis represents the number of top-NN recommendation items, N{5,10,15,20,25,30}N\in\{5,10,15,20,25,30\}, and the Y-axis represents the accuracy. The number of candidate clusters is fixed at K=2K=2. Again we see the proposed attention networks outperform the other representation networks under these accuracy metrics. Particularly, the proposed method AP is superior to AD under the ARHR metric, even they are comparable under the other three metrics. Note that ARHR measures the ranking quality that assigns a high score to hits at top position ranks. That is, AP could provide a better ranking quality. Another interesting phenomenon is, for the HR metric, most of the curves show a sharp rise at N=10N=10. Since HR is an indication whether at least one positive item is in the top-NN set, we suggest N10N\geq 10 would be a good number of recommendation items.

Refer to caption
Figure 5: Top-NN recommended items (X-axis) vs. accuracy metrics (Y-axis) in MovieLens.
Refer to caption
Figure 6: Top-NN recommended items (X-axis) vs. accuracy metrics (Y-axis) in Yelp Madison.
Refer to caption
Figure 7: Top-NN recommended items (X-axis) vs. accuracy metrics (Y-axis) in Yelp Pittsburgh.
Refer to caption
Figure 8: Top-NN recommended items (X-axis) vs. accuracy metrics (Y-axis) in Amazon.

An ablation study is introduced for the attention network of the global and local representations. Three representation types are considered: global, local, and all. The global type consists of GD, GP, GAD (global attentive representations with distance loss), and GAP (global attentive representations with product loss). The local type follows the similar definition. We fixed the number of candidate clusters at K=2K=2 and the number of recommended items at N=15N=15. Tables II and III list the performance in the four benchmark datasets. The number in boldface indicates the best accuracy in each metric column. It is clear to see that with the attention network, the global attentive representations (i.e., GAD and GAP) gain further improvement by combining the two global representations. The local attentive representations (i.e., LAD and LAP) also gain the similar improvement. By combining all global and local representations together, the attention network can yield the best accuracy, as demonstrated by AD and AP.

TABLE II: Ablation study for the attention network of the global and local representations in MovieLens and Yelp Madison
MovieLens Yelp Madison
Recall Precision HR ARHR Recall Precision HR ARHR
Global type GD 0.2230 0.2829 0.9025 1.1060 0.0797 0.0160 0.2126 0.0671
GP 0.2001 0.2294 0.8956 0.8407 0.0790 0.0158 0.2124 0.0591
GAD 0.2500 0.2961 0.9275 1.1736 0.0797 0.0159 0.2124 0.0671
GAP 0.2391 0.2919 0.9166 1.1619 0.0818 0.0164 0.2172 0.0659
Local type LD 0.2965 0.3817 0.9303 1.5752 0.0873 0.0174 0.2322 0.0745
LP 0.2452 0.2896 0.9251 1.1421 0.0985 0.0197 0.2563 0.0799
LAD 0.2958 0.3743 0.9298 1.5077 0.1008 0.0201 0.2605 0.0835
LAP 0.2833 0.3478 0.9406 1.4750 0.0998 0.0200 0.2589 0.0804
All type AD 0.3045 0.3858 0.9475 1.5861 0.1006 0.0201 0.2600 0.0829
AP 0.3002 0.3841 0.9428 1.5759 0.1009 0.0201 0.2609 0.0834
TABLE III: Ablation study for the attention network of the global and local representations in Yelp Pittsburgh and Amazon
Yelp Pittsburgh Amazon
Recall Precision HR ARHR Recall Precision HR ARHR
Global type GD 0.0652 0.0130 0.1786 0.0559 0.0438 0.0088 0.1117 0.0388
GP 0.0580 0.0116 0.1594 0.0464 0.0253 0.0051 0.0683 0.0207
GAD 0.0652 0.0130 0.1786 0.0559 0.0470 0.0094 0.1207 0.0412
GAP 0.0657 0.0131 0.1785 0.0549 0.0463 0.0093 0.1220 0.0412
Local type LD 0.0710 0.0142 0.1903 0.0680 0.0513 0.0103 0.1257 0.0530
LP 0.0632 0.0127 0.1709 0.0491 0.0528 0.0106 0.1337 0.0481
LAD 0.0723 0.0144 0.1931 0.0682 0.0529 0.0106 0.1343 0.0482
LAP 0.0705 0.0141 0.1898 0.0642 0.0593 0.0119 0.1460 0.0551
All type AD 0.0722 0.0144 0.1930 0.0681 0.0569 0.0114 0.1410 0.0551
AP 0.0727 0.0145 0.1963 0.0731 0.0595 0.0119 0.1460 0.0577

4.4 Comparison with other CF methods

We compared several CF recommendation systems, including network-based, factorization-based, and clustering-based approaches:

  • MetaPath2Vec++ [38]: This system uses a network embedding model that characterizes user-item interactions by heterogeneous network representation learning.

  • BiNE [39]: This system is also a network embedding model that characterizes user-item interactions by bipartite representation learning.

  • CoFactor [4]: This factorization model generates the representation by jointly decomposing the user-item interaction matrix and the item-item co-occurrence matrix with shared item latent factors.

  • N2VSCDNNR [12]: This clustering-based method uses the node2vec technique to generate a common representation space and builds user and item clusters for matching.

  • CIGAR [13]: This is also a clustering-based method that uses a hashing technique to retrieve candidate items and learns a ranking model to re-rank the candidate items.

Tables IV and V present their performances in terms of recall, precision, HR, and ARHR. The clustering-based approach (i.e., N2VSCDNNR, CIGAR, AP and AD) yields a better performance than the other approaches in the diverse benchmark datasets, demonstrating its superiority to address the sparsity and scalability problems to a certain extent. Furthermore, CIGAR performs better than N2VSCDNNR in the sparser and larger datasets, i.e., Yelp and Amazon. However, these compared methods employ a single representation, which is insufficient to model complex user-item interactions. The proposed attention model combines multiple representations to regenerate joint representations for each user-item pair dynamically, thus gain the best performance in various benchmark datasets.

TABLE IV: Comparison with CF methods in MovieLens and Yelp Madison
MovieLens Yelp Madison
Recall Precision HR ARHR Recall Precision HR ARHR
Metapath2vec++ 0.2750 0.2810 0.8957 1.1761 0.0341 0.0101 0.1214 0.0381
BiNE 0.1236 0.1721 0.7108 0.7559 0.0542 0.0128 0.1512 0.0481
CoFactor 0.2750 0.2810 0.8957 1.1761 0.0951 0.0160 0.2088 0.0657
N2VSCDNNR 0.2865 0.2933 0.9350 1.2241 0.1042 0.0166 0.2166 0.0686
CIGAR 0.1728 0.2687 0.9300 0.9868 0.1124 0.0225 0.2952 0.0842
AD 0.2983 0.4702 0.9788 1.9169 0.1355 0.0271 0.3524 0.0922
AP 0.2983 0.4702 0.9788 1.9171 0.1355 0.0271 0.3524 0.0921
TABLE V: Comparison with CF methods in Yelp Pittsburgh and Amazon.
Yelp Pittsburgh Amazon
Recall Precision HR ARHR Recall Precision HR ARHR
Metapath2vec++ 0.0326 0.0095 0.1164 0.0426 0.0257 0.0038 0.0643 0.0223
BiNE 0.0399 0.0115 0.1352 0.0526 0.0389 0.0059 0.0875 0.0366
CoFactor 0.0735 0.0144 0.1832 0.0602 0.0575 0.0087 0.1141 0.0494
N2VSCDNNR 0.0818 0.0153 0.2008 0.0644 0.0608 0.0093 0.1237 0.0528
CIGAR 0.1259 0.0252 0.3155 0.0876 0.0701 0.0140 0.1789 0.0665
AD 0.1345 0.0269 0.3455 0.1001 0.0746 0.0149 0.1861 0.0768
AP 0.1345 0.0269 0.3433 0.1138 0.0793 0.0159 0.1980 0.0796

Finally, we performed leave-one-out evaluation to compare our methods with two other CF methods. The compared methods are end-to-end learning-based recommender systems:

  • NeuMF [5]: This system generalizes matrix factorization by nonlinear neural networks to model user-item interactions.

  • BCFNet [30]: This system, which is an extension of DeepCF [17], uses an attention mechanism to improve the representation learning and matching function learning.

In leave-one-out evaluation, we held-out the latest interaction as a test item for each user and utilized the remaining data for training. For testing we followed the same way of the above two methods: we randomly selected 99 unobserved interactions as the negative examples for each user, together with the latest interaction as the positive example. In total 100 test are ranked, and the top-10 items were evaluated by HR and NDCG.

Table VI lists the leave-one-out evaluation for the four benchmark datasets. Note the BCFNet’s result for Amazon is not available because the memory consumption is too large to be executed in our computer. For each dataset, we generated three representation spaces of 32, 64, and 128 dimensions. Our methods yield better HR and NDCG scores in the high-dimensional representation space. In particular, our NDCG scores outperform NeuMF and BCFNet with a large gap, indicating our methods can rank the positive example at a superior position. The proposed method is able to find better representations for users and items through separate representation networks. Separating items into positive and negative ones also has an important impact on the recommendation task. In addition, similar to BCFNet, the attention mechanism can be used to highlight important features. However, because the local representations are derived by learning local data, which are extracted from the clusters close to the user, when the density vector is low-dimensional, it may be easier to mix some dissimilar data points in a cluster than that is high-dimensional. The local representations that come out may be poor, and in turn affect the final recommendation results.

TABLE VI: Comparison with NeuMF and BCFNet by leave-one-out evaluation
MovieLens
32-d 64-d 128-d
HR NDCG HR NDCG HR NDCG
NeuMF 0.6405 0.3679 0.6617 0.3834 0.6490 0.3766
BCFNet 0.6723 0.3896 0.6702 0.3912 0.7010 0.4096
AD 0.6426 0.3787 0.7063 0.4675 0.6903 0.3902
AP 0.6501 0.4011 0.7243 0.4944 0.6999 0.4538
Yelp Madison
32-d 64-d 128-d
HR NDCG HR NDCG HR NDCG
NeuMF 0.6383 0.3917 0.6502 0.3986 0.6556 0.4104
BCFNet 0.6664 0.4199 0.6661 0.4170 0.6682 0.4235
AD 0.6304 0.3518 0.6477 0.4601 0.7010 0.5819
AP 0.6275 0.3357 0.6361 0.4672 0.7079 0.5729
Yelp Pittsburgh
32-d 64-d 128-d
HR NDCG HR NDCG HR NDCG
NeuMF 0.6855 0.4253 0.6949 0.4324 0.6933 0.4394
BCFNet 0.7000 0.4458 0.7048 0.4575 0.7062 0.4546
AD 0.6571 0.3726 0.6841 0.5677 0.7261 0.6249
AP 0.6455 0.3729 0.6841 0.5677 0.6754 0.4398
Amazon
32-d 64-d 128-d
HR NDCG HR NDCG HR NDCG
NeuMF 0.6437 0.3781 0.6422 0.3774 0.6490 0.3846
BCFNet - - - - - -
AD 0.6961 0.4356 0.6988 0.4319 0.6972 0.4302
AP 0.6746 0.3770 0.6763 0.3999 0.6972 0.4257

5 Conclusion

In this study, we proposed a novel clustering-based CF method for recommender systems. The user and item representations are learned from multiple views in terms of global/local representation spaces and dot product/Euclidean distance loss functions. An attention network is designed to generate a joint representation of these views for each user-item pair dynamically. Experimental results show the proposed method is effective and competitive compared to several CF methods where only one representation space is considered.

References

  • [1] L. Chen, Y. Liu, X. He, L. Gao, and Z. Zheng, “Matching user with item set: Collaborative bundle recommendation with deep attention network.” in IJCAI, 2019, pp. 2095–2101.
  • [2] C. Zheng, D. Tao, J. Wang, L. Cui, W. Ruan, and S. Yu, “Memory augmented hierarchical attention network for next point-of-interest recommendation,” IEEE Transactions on Computational Social Systems, vol. 8, no. 2, pp. 489–499, 2020.
  • [3] G. Zhao, Z. Liu, Y. Chao, and X. Qian, “CAPER: Context-aware personalized emoji recommendation,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 9, pp. 3160 – 3172, 2020.
  • [4] D. Liang, J. Altosaar, L. Charlin, and D. M.Blei, “Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2016, pp. 59–66.
  • [5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182.
  • [6] C.-K. Hsieh, L. Yang, Y. Cui, T.-Y. Lin, S. Belongie, and D. Estrin, “Collaborative metric learning,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 193–201.
  • [7] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 426–434.
  • [8] X. Luo, Y. Xia, Q. Zhu, and Y. Li, “Boosting the k-nearest-neighborhood based incremental collaborative filtering,” Knowledge-Based Systems, vol. 53, pp. 90–99, 2013.
  • [9] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering,” in Proceedings of the fifth international conference on computer and information technology, vol. 1, 2002, pp. 291–324.
  • [10] C. Rana and S. K. Jain, “An evolutionary clustering algorithm based on temporal features for dynamic recommender systems,” Swarm and Evolutionary Computation, vol. 14, pp. 21–30, 2014.
  • [11] C.-L. Liao and S.-J. Lee, “A clustering based approach to improving the efficiency of collaborative filtering recommendation,” Electronic Commerce Research and Applications, vol. 18, pp. 1–9, 2016.
  • [12] J. Chen, Y. Wu, L. Fan, X. Lin, H. Zheng, S. Yu, and Q. Xuan, “N2VSCDNNR: A local recommender system based on node2vec and rich information network,” IEEE Transactions on Computational Social Systems, vol. 6, no. 3, pp. 456–466, 2019.
  • [13] W.-C. Kang and J. McAuley, “Candidate generation with binary codes for large-scale top-n recommendation,” in CIKM, 2019, pp. 152–1532.
  • [14] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 452–461.
  • [15] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative filtering vs. matrix factorization revisited,” in Fourteenth ACM Conference on Recommender Systems, 2020, pp. 240–248.
  • [16] F. Xue, X. He, X. Wang, J. Xu, K. Liu, and R. Hong, “Deep item-based collaborative filtering for top-n recommendation,” ACM Transactions on Information Systems (TOIS), vol. 37, no. 3, pp. 1–25, 2019.
  • [17] Z. Deng, L. Huang, C. Wang, J. Lai, and P. S. Yu, “DeepCF: A unified framework of representation learning and matching function learning in recommender system,” in AAAI, vol. 1, no. 33, 2019, pp. 61–68.
  • [18] X. Luo, D. Wang, M. Zhou, and H. Yuan, “Latent factor-based recommenders relying on extended stochastic gradient descent algorithms,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 2, pp. 916–926, 2021.
  • [19] X. Luo, M. Zhou, S. Li, D. Wu, Z. Liu, and M. Shang, “Algorithms of unconstrained non-negative latent factor analysis for recommender systems,” IEEE Transactions on Big Data, vol. 7, no. 1, pp. 227–240, 2021.
  • [20] X. Luo, Z. Liu, S. Li, M. Shang, and Z. Wang, “A fast non-negative latent factor model based on generalized momentum method,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 610–620, 2021.
  • [21] D. Wu, X. Luo, M. Shang, Y. He, G. Wang, and M. Zhou, “A deep latent factor model for high-dimensional and sparse matrices in recommender systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 7, pp. 4285–4296, 2021.
  • [22] K. Ji, R. Sun, X. Li, and W. Shu, “Improving matrix approximation for recommendation via a clustering-based reconstructive method,” Neurocomputing, vol. 173, pp. 912–920, 2016.
  • [23] Y. Wu, X. Liu, M. Xie, M. Ester, and Q. Yang, “CCCF: Improving collaborative filtering via scalable user-item co-clustering,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, pp. 73–82.
  • [24] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • [25] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content using attention-based encoder-decoder networks,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886, 2015.
  • [26] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.-S. Chua, “Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 335–344.
  • [27] H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai, “Learning tree-based deep model for recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1079–1088.
  • [28] Y. Tay, L. Anh Tuan, and S. C. Hui, “Latent relational metric learning via memory-based attention for collaborative ranking,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 729–739.
  • [29] Y. He, J. Wang, W. Niu, and J. Caverlee, “A hierarchical self-attentive model for recommending user-generated item lists,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1481–1490.
  • [30] Z.-Y. Hu, J. Huang, Z.-H. Deng, C.-D. Wang, L. Huang, J.-H. Lai, and P. S. Yu, “BCFNet: A balanced collaborative filtering network with attention mechanism,” arXiv preprint arXiv:2103.06105, 2021.
  • [31] D. Cao, X. He, L. Miao, Y. An, C. Yang, and R. Hong, “Attentive group recommendation,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 645–654.
  • [32] Y. Wu, K. Li, G. Zhao, and X. Qian, “Personalized long-and short-term preference learning for next poi recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2020 (Early Access).
  • [33] J. Hao, Y. Dun, G. Zhao, Y. Wu, and X. Qian, “Annular-graph attention model for personalized sequential recommendation,” IEEE Transactions on Multimedia, 2021 (Early Access).
  • [34] X. Fan, Z. Liu, J. Lian, W. X. Zhao, X. Xie, and J.-R. Wen, “Lighter and better: Low-rank decomposed self-attention networks for next-item recommendation,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1733–1737.
  • [35] L. Prétet, G. Richard, and G. Peeters, “Learning to rank music tracks using triplet loss,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 511–515.
  • [36] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2.   Lille, 2015.
  • [37] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  • [38] Y. Dong, N. V. Chawla, and A. Swami, “Metapath2vec: Scalable representation learning for heterogeneous networks,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 135–144.
  • [39] M. Gao, L. Chen, X. He, and A. Zhou, “BiNE: Bipartite network embedding,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 715–724.