Attention on Global-Local Representation Spaces in Recommender Systems
Abstract
In this study, we present a novel clustering-based collaborative filtering (CF) method for recommender systems. Clustering-based CF methods can effectively deal with data sparsity and scalability problems. However, most of them are applied to a single representation space, which might not characterize complex user-item interactions well. We argue that the user-item interactions should be observed from multiple views and characterized in an adaptive way. To address this issue, we leveraged the global and local properties to construct multiple representation spaces by learning various training datasets and loss functions. An attention network was built to generate a blended representation according to the relative importance of the representation spaces for each user-item pair, providing a flexible way to characterize diverse user-item interactions. Substantial experiments were evaluated on four popular benchmark datasets. The results show that the proposed method is superior to several CF methods where only one representation space is considered.
Index Terms:
attention, clustering, collaborative filtering, deep learning, multi-views.1 Introduction
Recommender systems are getting popular in many web applications. Because customers have various opinions and preferences, satisfying their requirements becomes an important issue for content and product providers. There is a great need for personalized recommender systems that suggest appropriate content and products to a user in a wide variety of online services, such as entertainment [1], travel [2], and social network [3].
The general approaches used in recommender systems include collaborative filtering (CF) and content-based filtering. The CF approach builds a model from a user’s past behavior (items that are purchased, viewed, selected, or rated), as well as similar behaviors by other users. The model is then used to recommend unseen items that will likely interest the user. The content-based filtering approach uses a series of discrete, previously-tagged characteristics of an item to recommend other items with similar properties. A modern recommender system often combines these two approaches into a hybrid system. In this study, we focus on the CF approach.
Matrix factorization is one of the most popular CF techniques [4] [5][6]. The relation between users and items are represented as an interaction matrix, and the matrix is then factorized into a user latent matrix and an item latent matrix. The similarity between the user and item is usually measured by the inner product or Euclidean distance between their latent vectors. Missing values in the interaction matrix can thus be inferred by the predicted user-item similarity and be used to make recommendations to a user for non-interacted items. Another widely used CF technique is the neighborhood model [7] [8]. It selects the nearest neighbors for each user/item based on their similarities. The missing values of the user/item are predicted by combining its neighbors’ weighted similarities.
As the system scales with rapid growth in the numbers of users and items, the interaction proportion among the whole set of users and items becomes very small. Then these CF techniques must deal with the data sparsity and scalability problems. One remedy is clustering-based recommendation [9][10][11][12][13]. This approach groups similar users or items into subsets by means of clustering. From the view of clusters, interactions between the user and item subsets are no more sparse. In addition, the clustering technique can be integrated with the index structure to overcome the scalability issue. In other words, instead of exhaustive search, cluster pruning can be used to keep a few candidates for evaluation with respect to a given query. The complexity is reduced to the sub-linear time of the number of items, making recommendation more efficient. Therefore, the clustering-based approach has been extensively developed recently to address sparsity and scalability problems.
Another key point is that most existing methods are applied to a single representation space. However, user-item interaction behaviors are often diverse and complex; it is beneficial to observe them from multiple views and to characterize them in an adaptive way. For example, in the clustering-based approach, we can extract not only local features from clusters but also global features from the whole dataset. These features may reflect significant properties of various user-item interactions. Moreover, they can be combined to derive a more discriminating representation.
Following the above discussion, we propose a novel method for clustering-based CF recommender systems that involves leveraging global and local views. Multiple representation spaces are learned from the training data of global and local user-item interactions with different loss functions. To fuse the information from different representation spaces, an attention network is adopted to formulate an attentive representation for each user-item pair by blending the relative weights among the representation spaces. Compared with existing attention-based methods that use an attention model at the item level (considering which item is more important), the proposed attention model is used at the representation level (considering which representation is more important).
We highlight the contributions of the proposed method as follows:
-
•
Multiple representation spaces are constructed based on different global/local properties and loss functions. Rather than using only one representation as in most existing methods, multiple representations can provide an abundance of features to describe the relation between users and items.
-
•
An attention network is presented to capture the relative importance of multiple representation spaces. It can dynamically fuse multiple representations for each user-item pair, forming a flexible measure to characterize diverse user-item interactions.
-
•
Substantial experiments compared with various configurations and state-of-the-arts are evaluated under four benchmark datasets. The results can support the effectiveness and superiority of the proposed method.
The remainder of this paper is organized as follows. Section 2 presents related work on recommender systems. Section 3 elaborates the proposed method, including the global/local representation networks and attention network. Section 4 demonstrates and discusses experimental results. Finally, conclusions are summarized in Section 5.
2 Related Work
Among the mass of literature studies in recommender systems, we focus on the up-to-date CF-based methods with brief introduction. We also review the clustering-based approach and attention mechanism highly related to our work.
2.1 CF-based recommender systems
The core idea of the CF approach is to model the preference of users toward items according to their historical interactions. Several techniques are proposed to address this issue. In addition to the above-mentioned matrix factorization and neighborhood model, here we introduce other popular techniques, namely, ranking formulation and deep learning.
Rendle et al. [14] presented a generic optimization criterion by Bayesian personalized ranking, called BPR-Opt, for personalized ranking in the item recommendation task. BPR-Opt is the maximum posterior estimator derived from a Bayesian analysis of the problem. The main idea is to rank observed items higher than unobserved items. They also provided a generic learning algorithm for optimizing models with respect to BPR-Opt. Following the similar ranking formulation, Hsien et al. [6] proposed collaborative metric learning (CML), which encoded user preferences and user-user/item-item similarity in a joint metric space. CML achieves significant speedup for top-N recommendation tasks using off-the-shelf approximate nearest-neighbor search with negligible accuracy reduction.
More recent approaches tend to use the deep learning technique. He et al. [5] presented a general framework called neural network-based collaborative filtering (NeuMF) for recommender systems. They used nonlinear neural networks as the user–item interaction function. NeuMF expresses and generalizes matrix factorization and leverages a multi-layer perceptron to learn the user–item interaction function. The authors transformed the identity of a user/item to a binarized sparse vector with one-hot encoding, which only explored the user/item feature in a limited manner. Due to the complexity of the scoring function, the NeuMF retrieval process is generally hard to accelerate. Rendle et al. [15] revisited the NeuMF experiments, and concluded that a dot product might be a better default choice for combining embeddings than learned similarities using MLP or NeuMF. Xue et al. [16] proposed a neural network framework to model higher-order item relations for item-based collaborative filtering by using multiple nonlinear layers above pairwise interaction modeling. Under this framework, the authors also used an attention model to differentiate the importance of pairwise interactions. Deng et al. [17] proposed a general framework called deep collaborative filtering (DeepCF) to combine the representation learning and matching function learning. They also proposed a model called a collaborative filtering network (CFNet) based on a plain-vanilla MLP model under the DeepCF framework. CFNet has great flexibility in learning complex matching functions, with good efficiency in learning low-rank relations between users and items.
To handle the high-dimensional and sparse matrix efficiently, Luo et al. presented a series studies for latent factor-based CF methods [18] [19] [20] [21]. For example, they investigated the stochastic gradient descent algorithm and develop several extensions to improve the accuracy [18]. In addition, some non-negative factorization algorithms are proposed to overcome the slow convergence problem under non-negativity constraints [19][20]. Wu et al. designed a deep structure to construct the hierarchical latent factor models that can achieve high computation and storage efficiency [21].
2.2 Clustering-based recommender systems
The clustering-based approach is well known for its scalability to large and sparse systems. The basic steps include calculating the similarity based on rating data, and then applying a clustering algorithm to group similar users and items into clusters. Some advanced methods are introduced in the following.
Ji et al. [22] proposed a reconstructive method for improving matrix approximation by introducing clustering and transfer learning techniques. The proposed method compresses low-rank approximation into a cluster-level rating-pattern referred to as a codebook and then constructs an improved approximation by expanding the codebook. Wu et al. [23] proposed a scalable co-clustering method called CCCF. The idea is users may have different preferences over different subsets of items, where these subsets of items may overlap. CCCF can explore subgroups of users who share interests over similar subsets of items through user-item co-clustering. Chen et al. [12] proposed a recommender system called N2VSCDNNR based on node2vec [24] and clustering technology. N2VSCDNNR recommends item clusters with high weights to the target user cluster. The weights are obtained from the interaction frequency between items and users. Then the items in the high-weight clusters are recommended to the target user. Kang and McAuley [13] proposed a clustering-based framework called CIGAR, which learns a preference-preserving binary embedding model and a candidate item re-ranking model. Although using binary codes can significantly reduce the inference cost, due to their constrained capability, the performance is limited compared to models that use real-valued representations.
2.3 Recommender systems with attention mechanisms
The attention mechanisms are components of prediction systems that enable a system to focus sequentially on different subsets of the input [25]. They have been widely adopted in recommender systems recently.
Chen et al. [26] presented an attentive collaborative filtering model to address implicit feedback in multimedia recommendation. They introduced item-level and component-level attention models to assign attentive weights for inferring the underlying user preferences encoded in implicit user feedback. Zhu et al. [27] presented a tree-based deep recommendation model that predicts user preferences using a max-heap-like tree probability formulation with an attention-based deep network. The underlying deep model and tree structure are learned in turn. Tay et al. [28] proposed a metric learning approach called latent relational metric learning for recommendations. They adopt the idea of translating users to items by translation vectors while relying on a memory attention network to learn the relation vectors. He et al. [29] proposed a user-generated list recommendation model that leverages the hierarchical structure of items, lists, and users to capture the containment relationship between lists and items. They applied a self-attention network to refine the item and list representations by considering the consistency of neighboring items and lists. Hu et al. [30] introduced a model named BCFNet, which is an extended version of DeepCF [17]. The BCFNet consists of three sub-models, including representation learning, matching function learning, and balance module. In particular, an attention mechanism is integrated to improve the ability of the representation learning and matching function learning.
For group recommendation, the attention mechanisms can be employed to aggregate the properties of a set of user/item group. Cao et al. [31] proposed using NeuMF with attention models to characterize user and user group interest at the same time. They further integrated the modeling of user-item interactions into their method, making it possible to reinforce the two tasks of recommending items for users and user groups. Chen et al. [1] considered the task of recommending a set of items (bundle) to a user. They designed a factorized attention network to fuse the set of item representations into a bundle representation.
The attention mechanisms are also applied in sequential recommendation. For example, Zheng et al. [2] proposed a memory-augmented hierarchical attention network (MAHAN) for next point-of-interest (POI) recommendation. MAHAN incorporates two networks to tackle short-term check-in sequences and long-term memories for preference modeling. A co-attention network is used to characterize the interaction between these long-term and short-term preferences. Wu et al. [32] presented a similar idea, where the long-term preference is formulated by attention weight evaluation. Hao et al. [33] explored user’s short-term preference based on the local and global features of the annular-graph of the user behavior sequence by the self-attention mechanism. Fan et al. [34] introduced the low-rank decomposed self-attention to generate content-aware representations, where items are aggregated into latent interests to perform a lightweight self-attention computation.
2.4 Summary
To sum up, the CF techniques that employ neural networks to model user-item interactions become a popular choice. However, these techniques mainly focus on a single representation space. We argue that it is insufficient to model the complicated user-item interactions. The clustering-based methods are proposed in several studies to address the sparsity and scalability problems. Few of them take advantage of global and local representations from the cluster structure. We point out that multiple representations are beneficial to provide multi-views for observing user-item interactions. The attention mechanisms can provide an effective way to fuse these representations into a blended one. We develop the representation-level attention, which is different from the item-level attention used by existing methods.
3 Proposed Method
In the proposed method, we build multiple representation networks and an attention network for the recommendation purpose. Fig. 1 illustrates an overview of the training process for these networks. Given a set of user-item interactions, we prepare two categories of training data, namely, global and local, according to whether clustering is involved or not. Then the representation networks are trained by using different training data categories and loss functions. Next, multiple global and local representations, which are generated from the representation networks, are employed to train the attention network. The attention network combines these representations to obtain the attentive representations of the users and items, which are used to measure the compatibility between each user-item pair. Details are elaborated in the following subsections.

3.1 Global representation
The goal of the CF approach with implicit feedback is to explore the unobserved relations between user and item and to answer the question: will the user interact with the item in the future? A user-item interaction consists of, for example, the user assigning a score to an item or checking information about an item. The interaction history is usually described by an interaction matrix, denoted as , which contains the user ID , the item ID , and the interaction that represents the preference of user for item . This study focuses on evidence that the user interacted with an item, regardless of the extent of the interaction. Therefore, the matrix element is defined by:
(1) |
A triplet training set is sampled from and expressed as:
(2) |
where and are the positive and negative items, respectively. In practice, it is impossible to enumerate all triplet combinations and many of them are trivial for training. We adopt a simple strategy in this study [6][35]. Given the row vector of user in interaction matrix , for each positive item (where ), we randomly choose an negative item (where ) to produce a triplet . The iteration process will repeat until the number of triplets is sufficient.
To derive the global representation, local information such as cluster assignment of the training data is ignored during the training phase. Let be a global representation space of -dimensions. To project user and item into , the representation network is designed based on the Siamese network [36], as shown in Fig. 2. The embedding layer transforms a user or item ID into a dense vector of a fixed size. The embedding vector is then forwarded to hidden layers for nonlinear transformation. Note that we modify the typical Siamese network so that the variant can accept a triplet as the input in the training process. In addition, we separate the user from the item without sharing the hidden layers. In other words, the network parameters can be very different between the user part and the item part. This is because users and items are different in nature, they should be processed through different paths in order to be projected to the common representation space. Through the respective embedding and hidden layers, inputs , , and are transformed with L2 normalization into as representation outputs , , and , respectively.

The training of the representation network is guided by a triplet loss function. Using different loss functions to generate variant representations can provide multiple views to characterize each user and item. We introduce two loss functions for this purpose, the dot product loss and the Euclidean distance loss; more loss functions can be conducted if necessary. The dot product loss originates from BPR-Opt [14]. Its variant adopts the conventional matrix factorization, referred to as BPR-MF, which is generally used as an underlying preference estimator. BPR-MF can be optimized by a contrastive pairwise ranking objective function of a triplet, referred to as product loss:
(3) |
where is the sigmoid function and denotes the dot product computation. Because enumerating all triplets in is typically intractable, BPR-MF uses stochastic gradient descent (SGD) to optimize the model. Specifically, in each step of SGD, a batch of triplets is dynamically sampled from . In addition, L2 regularization was adopted on the user and item representations, which is crucial to alleviate overfitting.
The second global representation was generated based on the Euclidean distance metric, where the optimization function is distance loss expressed by [37]:
(4) |
where is the margin.
To make the optimization effective, hard examples of triplets that result in large loss can be selected frequently during training. The network parameters are updated through SGD to minimize the loss function iteratively. In the following, the global representations of user and item are denoted as and , respectively.
3.2 Local representation
Roughly speaking, the global representations provide a macro view of users and items. However, they are insufficient to model complex user-item interactions. The local representations, on the other hand, are derived by learning local data, which are extracted from the clusters that are close to the user. They can model the user-item interactions from a micro view.
For simplicity, we assume the global representation space is a Euclidean space. K-means clustering is used to construct a cluster set of size . Given a query (user) and its global representation , clusters are chosen from the clusters that are the closest to and are denoted as candidate clusters . The corresponding codebook of centroids is . To construct the local representation network, we employ the same network architecture defined in Fig. 2 but train with the local properties of the candidate clusters.
Suppose that a positive item and a negative item belong to two candidate clusters, i.e., and , where . There are two considerations for generating the triplet training set. The first case is , i.e., and are in the same cluster. They can be included in the local intra-triplet set when satisfying the following criterion:
(5) |
The intuition is that, if the distance between and is larger than that between and , it will incur positive triplet loss. We can use this hard example to tune the representation network so that the user is close to its positive item and far away from its negative item. The other case is , i.e., and are in different clusters. A similar definition for the local inter-triplet set is given:
(6) |
Unlike Eq. (5) that considers the relation between the user and items, Eq. (6) addresses the relation between the user and clusters. That is, if the distance between and (where belongs to) is larger than that between and (where belongs to), we take this hard example to tune the user representation network so that the user is close to its positive cluster and far away from its negative cluster . Consequently, both and serve as the local training dataset for learning local representations.
3.3 Attentive representation
By leveraging the global and local properties with different loss functions, multiple representation networks can be trained to generate the respective representations for each user/item. Moreover, an attention network is proposed to find the best combination based on these representations. That is, the proposed attention network can map a set of user-item representations to weighted sum representations of the user and item. The weight is assigned to each representation space by computing a compatibility function of the user and item. Details are described in the following.
Suppose that we construct representation networks to derive corresponding representation spaces; each user/item has representations. Fig. 3 gives an overview of the proposed network consisting of representation networks followed by an attention network, where four representation spaces are taken as an example. In the training process, the input layer receives a triplet . Let be the th representation network, , which transforms the user and items to their th representations. We express the th representations of the triplet as , and forward them to the th attention channel.

Each attention channel can be divided into three parts. The first part is a transformation function, denoted as , which maps the user and positive item to the respective vectors and :
(7) |
Note that the negative item is excluded in the attention network. Consequently, the second part is to calculate the compatibility between the user and positive item. The idea is paying more attention to the representation space where the user and the positive item are more compatible, as measured by the dot product. The compatibility is then convert to a normalized weight, denoted as , by applying the softmax function over all representation spaces.
(8) |
In the last part, a weighted sum is calculated as the linear combinations of the representation spaces for each user/item:
(9) |
The outputs , , and are denoted as attentive representations. The proposed model is trained by the triplet loss function based on either product loss or distance loss , and the training data are randomly sampled from the global and local triplet sets , , and .
3.4 Recommendation as retrieval
We treat the recommendation task as a retrieval process that searches for similar items from multi-view representation spaces for a given user. In clustering-based recommendation, a coarse-to-fine search strategy can be applied without scanning the dataset exhaustively, making recommendation more efficient and scalable. For each item in the dataset, representations are extracted by the set of representation networks . We choose one of the global representations and perform k-means clustering to create clusters and to build the index structure with clusters, each of which has attached an inverted list of associated items.
When a user query is submitted, the user representations are extracted and used to search for the nearest item clusters. We denote the items stored in the nearest clusters as the candidate items. For each pair of user and candidate item , the attention network is used to obtain the attentive representations and defined in Eq. 9 and to compute their Euclidean distance or dot product to estimate their similarity. The candidates are sorted to output the top- items as the recommendation result for user . The retrieval process is summarized in Algorithm 1.
By taking advantage of the coarse-to-fine search strategy, only some dataset items are selected as candidates for the attention and re-ranking computation. Although applying the attention network for each user-item pair requires additional computations, these can be accelerated by GPUs, and thus the runtime is usually minor. Therefore, the time complexity of the retrieval process can be considered sublinear with respect to the size of the item dataset.
4 Experimental Results
This section describes the evaluation of the proposed method and a comparison with several CF-based recommender systems. The experiments were carried out under Windows 10 with an Intel Core i7-6700 CPU, an NVIDIA GeForce GTX 1080Ti GPU, and 64GB RAM. The programming environment was the Keras package of Tensorflow under Python 3.7. The source code is available at https://github.com/MunlikaRattaphun/Attention-on-Global-Local-Embedding-Spaces-in-Recommender-Systems.
4.1 Benchmark datasets
Four popular benchmark datasets, including MovieLens, Yelp, and Amazon, were used in the experimental evaluation. We briefly introduce these datasets, as well as the corresponding training/test data partitions:
-
•
MovieLens-100k. This dataset consists of 100,000 records of user-movie ratings from 943 users and 1,682 movies. Each user has rated at least 20 movies. We randomly sampled half of the user-movie views as test data, and used the other half for training.
-
•
Yelp Madison. The Yelp dataset is a set of user reviews of businesses in local cities. Here, we selected the city of Madison. Users who had more than five business interactions were kept; the others were removed from the dataset. For each of the remaining users, we randomly sampled three user-business interactions as the test data, and used the rest for training.
-
•
Yelp Pittsburgh. Yelp Pittsburgh is another dataset containing user reviews of businesses in Pittsburgh. The user filtering and training/test data generation were the same as for Yelp Madison.
-
•
Amazon. This dataset contains product reviews from customers of Amazon.com. For this study, the category Movie and TV was chosen. Users with fewer than six reviews were removed from the dataset. The training/test data were generated in the same way as for Yelp Madison and Yelp Pittsburgh.
Table I summarizes some statistic properties of these datasets. Their scales and densities are very different from each other, providing a diverse way for observation.
Datasets | #Users | #Items | #Interactions | Density |
---|---|---|---|---|
MovieLens-100k | 943 | 1,682 | 100,000 | 0.0630 |
Yelp Madison | 2,773 | 3,186 | 50,961 | 0.0058 |
Yelp Pittsburgh | 5,116 | 6,443 | 111,246 | 0.0034 |
Amazon | 32,910 | 45,585 | 1,089,539 | 0.0007 |
4.2 Evaluation metrics
Several widely used accuracy metrics in top-N recommendation, including precision, recall, hit rate (HR), average-reciprocal hit rank (ARHR), and normalized discounted cumulative gain (NDCG), were used for experiment evaluation. For a user query , the top-N recommended items are denoted as and the ground truth (positive) items as , where is the set cardinality. These metrics are defined as follows:
(10) | ||||
(11) | ||||
(12) | ||||
(13) | ||||
(14) |
where is the ranking position of item . Although a higher value reflects a better accuracy in general, it is recommended to assess the model performance from multiple metrics rather than any single metric.
4.3 Hyperparameter and ablation study
To study the impact of model parameters, we take the following four global and local representations, as well as two attentive representations, for consideration:
- •
-
•
Global representation / product loss (GP) was generated based on the global training set and product loss function (Eq. (3)).
- •
-
•
Local representation / product loss (LP) was generated based on the local training set and the product loss function.
-
•
Attentive representation / distance loss (AD) was blended from the above four representations by the proposed attention network and the distance loss function.
-
•
Attentive representation / product loss (AP) was blended by the proposed attention network and the product loss function.
Except for using different training datasets and loss functions, these representations adopted the same representation network shown in Fig. 2, which contains one embedding layer and two hidden layers, each of which has 64 neurons. The dimensionality of the output layer was also set to 64. Each model was trained for a maximum of 200 epochs with an early stopping strategy and optimized with the Adam optimizer, where the batch size was fixed at 512 and the learning rate was 0.00017. The margin of the loss function in Eq. (4) was set .
We first evaluate the impact of the number of clusters . We evaluate three different for each dataset: for MovieLens, for Yelp Madison, for Pittsburgh, and for Amazon. The range of has been carefully chosen: too large or small will incur a bad clustering result. The performance is plotted in Fig. 4, where each column shows a particular dataset with different . Each plot shows six recall curves of the proposed representations under the top- candidate clusters, . We ranked the items in the candidate clusters to return the top-20 items for recommendation.

As the number of candidate clusters is increased, more positive items are found from the candidate clusters, and therefore the recall curves is generally improved. In addition, most of them show a sharp rise at . However, some recall curves have a degraded performance as increases. This means that the corresponding model does not generate good representation spaces, so that the relevance between the user and positive item is less than that between the user and negative item. The proposed attentive representations (AD and AP) are relatively stable under these configurations, showing that they became more robust by adaptively fusing multiple representations.
In general, local representations (LD and LP) are better than global representations (GD and GP). However, no single representation always performs better than others because of the variety of user-item interaction behaviors. Again, the proposed attention network combining different views of representation spaces yield the best recall on all the benchmark datasets. Based on the above result, we choose the following in subsequent experiments: for MovieLens, for Yelp Madison, for Yelp Pittsburgh, and for Amazon.
Next, we observe the recommendation result from more evaluation metrics, as shown in Figs. 5, 6, 7 and 8 for MovieLens, Yelp Madison, Yelp Pittsburgh, and Amazon, respectively. Each figure contains four subplots, including recall, precision, HR, and ARHR (from left to right). The X-axis represents the number of top- recommendation items, , and the Y-axis represents the accuracy. The number of candidate clusters is fixed at . Again we see the proposed attention networks outperform the other representation networks under these accuracy metrics. Particularly, the proposed method AP is superior to AD under the ARHR metric, even they are comparable under the other three metrics. Note that ARHR measures the ranking quality that assigns a high score to hits at top position ranks. That is, AP could provide a better ranking quality. Another interesting phenomenon is, for the HR metric, most of the curves show a sharp rise at . Since HR is an indication whether at least one positive item is in the top- set, we suggest would be a good number of recommendation items.




An ablation study is introduced for the attention network of the global and local representations. Three representation types are considered: global, local, and all. The global type consists of GD, GP, GAD (global attentive representations with distance loss), and GAP (global attentive representations with product loss). The local type follows the similar definition. We fixed the number of candidate clusters at and the number of recommended items at . Tables II and III list the performance in the four benchmark datasets. The number in boldface indicates the best accuracy in each metric column. It is clear to see that with the attention network, the global attentive representations (i.e., GAD and GAP) gain further improvement by combining the two global representations. The local attentive representations (i.e., LAD and LAP) also gain the similar improvement. By combining all global and local representations together, the attention network can yield the best accuracy, as demonstrated by AD and AP.
MovieLens | Yelp Madison | ||||||||
Recall | Precision | HR | ARHR | Recall | Precision | HR | ARHR | ||
Global type | GD | 0.2230 | 0.2829 | 0.9025 | 1.1060 | 0.0797 | 0.0160 | 0.2126 | 0.0671 |
GP | 0.2001 | 0.2294 | 0.8956 | 0.8407 | 0.0790 | 0.0158 | 0.2124 | 0.0591 | |
GAD | 0.2500 | 0.2961 | 0.9275 | 1.1736 | 0.0797 | 0.0159 | 0.2124 | 0.0671 | |
GAP | 0.2391 | 0.2919 | 0.9166 | 1.1619 | 0.0818 | 0.0164 | 0.2172 | 0.0659 | |
Local type | LD | 0.2965 | 0.3817 | 0.9303 | 1.5752 | 0.0873 | 0.0174 | 0.2322 | 0.0745 |
LP | 0.2452 | 0.2896 | 0.9251 | 1.1421 | 0.0985 | 0.0197 | 0.2563 | 0.0799 | |
LAD | 0.2958 | 0.3743 | 0.9298 | 1.5077 | 0.1008 | 0.0201 | 0.2605 | 0.0835 | |
LAP | 0.2833 | 0.3478 | 0.9406 | 1.4750 | 0.0998 | 0.0200 | 0.2589 | 0.0804 | |
All type | AD | 0.3045 | 0.3858 | 0.9475 | 1.5861 | 0.1006 | 0.0201 | 0.2600 | 0.0829 |
AP | 0.3002 | 0.3841 | 0.9428 | 1.5759 | 0.1009 | 0.0201 | 0.2609 | 0.0834 |
Yelp Pittsburgh | Amazon | ||||||||
Recall | Precision | HR | ARHR | Recall | Precision | HR | ARHR | ||
Global type | GD | 0.0652 | 0.0130 | 0.1786 | 0.0559 | 0.0438 | 0.0088 | 0.1117 | 0.0388 |
GP | 0.0580 | 0.0116 | 0.1594 | 0.0464 | 0.0253 | 0.0051 | 0.0683 | 0.0207 | |
GAD | 0.0652 | 0.0130 | 0.1786 | 0.0559 | 0.0470 | 0.0094 | 0.1207 | 0.0412 | |
GAP | 0.0657 | 0.0131 | 0.1785 | 0.0549 | 0.0463 | 0.0093 | 0.1220 | 0.0412 | |
Local type | LD | 0.0710 | 0.0142 | 0.1903 | 0.0680 | 0.0513 | 0.0103 | 0.1257 | 0.0530 |
LP | 0.0632 | 0.0127 | 0.1709 | 0.0491 | 0.0528 | 0.0106 | 0.1337 | 0.0481 | |
LAD | 0.0723 | 0.0144 | 0.1931 | 0.0682 | 0.0529 | 0.0106 | 0.1343 | 0.0482 | |
LAP | 0.0705 | 0.0141 | 0.1898 | 0.0642 | 0.0593 | 0.0119 | 0.1460 | 0.0551 | |
All type | AD | 0.0722 | 0.0144 | 0.1930 | 0.0681 | 0.0569 | 0.0114 | 0.1410 | 0.0551 |
AP | 0.0727 | 0.0145 | 0.1963 | 0.0731 | 0.0595 | 0.0119 | 0.1460 | 0.0577 |
4.4 Comparison with other CF methods
We compared several CF recommendation systems, including network-based, factorization-based, and clustering-based approaches:
-
•
MetaPath2Vec++ [38]: This system uses a network embedding model that characterizes user-item interactions by heterogeneous network representation learning.
-
•
BiNE [39]: This system is also a network embedding model that characterizes user-item interactions by bipartite representation learning.
-
•
CoFactor [4]: This factorization model generates the representation by jointly decomposing the user-item interaction matrix and the item-item co-occurrence matrix with shared item latent factors.
-
•
N2VSCDNNR [12]: This clustering-based method uses the node2vec technique to generate a common representation space and builds user and item clusters for matching.
-
•
CIGAR [13]: This is also a clustering-based method that uses a hashing technique to retrieve candidate items and learns a ranking model to re-rank the candidate items.
Tables IV and V present their performances in terms of recall, precision, HR, and ARHR. The clustering-based approach (i.e., N2VSCDNNR, CIGAR, AP and AD) yields a better performance than the other approaches in the diverse benchmark datasets, demonstrating its superiority to address the sparsity and scalability problems to a certain extent. Furthermore, CIGAR performs better than N2VSCDNNR in the sparser and larger datasets, i.e., Yelp and Amazon. However, these compared methods employ a single representation, which is insufficient to model complex user-item interactions. The proposed attention model combines multiple representations to regenerate joint representations for each user-item pair dynamically, thus gain the best performance in various benchmark datasets.
MovieLens | Yelp Madison | |||||||
---|---|---|---|---|---|---|---|---|
Recall | Precision | HR | ARHR | Recall | Precision | HR | ARHR | |
Metapath2vec++ | 0.2750 | 0.2810 | 0.8957 | 1.1761 | 0.0341 | 0.0101 | 0.1214 | 0.0381 |
BiNE | 0.1236 | 0.1721 | 0.7108 | 0.7559 | 0.0542 | 0.0128 | 0.1512 | 0.0481 |
CoFactor | 0.2750 | 0.2810 | 0.8957 | 1.1761 | 0.0951 | 0.0160 | 0.2088 | 0.0657 |
N2VSCDNNR | 0.2865 | 0.2933 | 0.9350 | 1.2241 | 0.1042 | 0.0166 | 0.2166 | 0.0686 |
CIGAR | 0.1728 | 0.2687 | 0.9300 | 0.9868 | 0.1124 | 0.0225 | 0.2952 | 0.0842 |
AD | 0.2983 | 0.4702 | 0.9788 | 1.9169 | 0.1355 | 0.0271 | 0.3524 | 0.0922 |
AP | 0.2983 | 0.4702 | 0.9788 | 1.9171 | 0.1355 | 0.0271 | 0.3524 | 0.0921 |
Yelp Pittsburgh | Amazon | |||||||
---|---|---|---|---|---|---|---|---|
Recall | Precision | HR | ARHR | Recall | Precision | HR | ARHR | |
Metapath2vec++ | 0.0326 | 0.0095 | 0.1164 | 0.0426 | 0.0257 | 0.0038 | 0.0643 | 0.0223 |
BiNE | 0.0399 | 0.0115 | 0.1352 | 0.0526 | 0.0389 | 0.0059 | 0.0875 | 0.0366 |
CoFactor | 0.0735 | 0.0144 | 0.1832 | 0.0602 | 0.0575 | 0.0087 | 0.1141 | 0.0494 |
N2VSCDNNR | 0.0818 | 0.0153 | 0.2008 | 0.0644 | 0.0608 | 0.0093 | 0.1237 | 0.0528 |
CIGAR | 0.1259 | 0.0252 | 0.3155 | 0.0876 | 0.0701 | 0.0140 | 0.1789 | 0.0665 |
AD | 0.1345 | 0.0269 | 0.3455 | 0.1001 | 0.0746 | 0.0149 | 0.1861 | 0.0768 |
AP | 0.1345 | 0.0269 | 0.3433 | 0.1138 | 0.0793 | 0.0159 | 0.1980 | 0.0796 |
Finally, we performed leave-one-out evaluation to compare our methods with two other CF methods. The compared methods are end-to-end learning-based recommender systems:
-
•
NeuMF [5]: This system generalizes matrix factorization by nonlinear neural networks to model user-item interactions.
- •
In leave-one-out evaluation, we held-out the latest interaction as a test item for each user and utilized the remaining data for training. For testing we followed the same way of the above two methods: we randomly selected 99 unobserved interactions as the negative examples for each user, together with the latest interaction as the positive example. In total 100 test are ranked, and the top-10 items were evaluated by HR and NDCG.
Table VI lists the leave-one-out evaluation for the four benchmark datasets. Note the BCFNet’s result for Amazon is not available because the memory consumption is too large to be executed in our computer. For each dataset, we generated three representation spaces of 32, 64, and 128 dimensions. Our methods yield better HR and NDCG scores in the high-dimensional representation space. In particular, our NDCG scores outperform NeuMF and BCFNet with a large gap, indicating our methods can rank the positive example at a superior position. The proposed method is able to find better representations for users and items through separate representation networks. Separating items into positive and negative ones also has an important impact on the recommendation task. In addition, similar to BCFNet, the attention mechanism can be used to highlight important features. However, because the local representations are derived by learning local data, which are extracted from the clusters close to the user, when the density vector is low-dimensional, it may be easier to mix some dissimilar data points in a cluster than that is high-dimensional. The local representations that come out may be poor, and in turn affect the final recommendation results.
MovieLens | ||||||
32-d | 64-d | 128-d | ||||
HR | NDCG | HR | NDCG | HR | NDCG | |
NeuMF | 0.6405 | 0.3679 | 0.6617 | 0.3834 | 0.6490 | 0.3766 |
BCFNet | 0.6723 | 0.3896 | 0.6702 | 0.3912 | 0.7010 | 0.4096 |
AD | 0.6426 | 0.3787 | 0.7063 | 0.4675 | 0.6903 | 0.3902 |
AP | 0.6501 | 0.4011 | 0.7243 | 0.4944 | 0.6999 | 0.4538 |
Yelp Madison | ||||||
32-d | 64-d | 128-d | ||||
HR | NDCG | HR | NDCG | HR | NDCG | |
NeuMF | 0.6383 | 0.3917 | 0.6502 | 0.3986 | 0.6556 | 0.4104 |
BCFNet | 0.6664 | 0.4199 | 0.6661 | 0.4170 | 0.6682 | 0.4235 |
AD | 0.6304 | 0.3518 | 0.6477 | 0.4601 | 0.7010 | 0.5819 |
AP | 0.6275 | 0.3357 | 0.6361 | 0.4672 | 0.7079 | 0.5729 |
Yelp Pittsburgh | ||||||
32-d | 64-d | 128-d | ||||
HR | NDCG | HR | NDCG | HR | NDCG | |
NeuMF | 0.6855 | 0.4253 | 0.6949 | 0.4324 | 0.6933 | 0.4394 |
BCFNet | 0.7000 | 0.4458 | 0.7048 | 0.4575 | 0.7062 | 0.4546 |
AD | 0.6571 | 0.3726 | 0.6841 | 0.5677 | 0.7261 | 0.6249 |
AP | 0.6455 | 0.3729 | 0.6841 | 0.5677 | 0.6754 | 0.4398 |
Amazon | ||||||
32-d | 64-d | 128-d | ||||
HR | NDCG | HR | NDCG | HR | NDCG | |
NeuMF | 0.6437 | 0.3781 | 0.6422 | 0.3774 | 0.6490 | 0.3846 |
BCFNet | - | - | - | - | - | - |
AD | 0.6961 | 0.4356 | 0.6988 | 0.4319 | 0.6972 | 0.4302 |
AP | 0.6746 | 0.3770 | 0.6763 | 0.3999 | 0.6972 | 0.4257 |
5 Conclusion
In this study, we proposed a novel clustering-based CF method for recommender systems. The user and item representations are learned from multiple views in terms of global/local representation spaces and dot product/Euclidean distance loss functions. An attention network is designed to generate a joint representation of these views for each user-item pair dynamically. Experimental results show the proposed method is effective and competitive compared to several CF methods where only one representation space is considered.
References
- [1] L. Chen, Y. Liu, X. He, L. Gao, and Z. Zheng, “Matching user with item set: Collaborative bundle recommendation with deep attention network.” in IJCAI, 2019, pp. 2095–2101.
- [2] C. Zheng, D. Tao, J. Wang, L. Cui, W. Ruan, and S. Yu, “Memory augmented hierarchical attention network for next point-of-interest recommendation,” IEEE Transactions on Computational Social Systems, vol. 8, no. 2, pp. 489–499, 2020.
- [3] G. Zhao, Z. Liu, Y. Chao, and X. Qian, “CAPER: Context-aware personalized emoji recommendation,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 9, pp. 3160 – 3172, 2020.
- [4] D. Liang, J. Altosaar, L. Charlin, and D. M.Blei, “Factorization meets the item embedding: Regularizing matrix factorization with item co-occurrence,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2016, pp. 59–66.
- [5] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural collaborative filtering,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 173–182.
- [6] C.-K. Hsieh, L. Yang, Y. Cui, T.-Y. Lin, S. Belongie, and D. Estrin, “Collaborative metric learning,” in Proceedings of the 26th international conference on world wide web, 2017, pp. 193–201.
- [7] Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, 2008, pp. 426–434.
- [8] X. Luo, Y. Xia, Q. Zhu, and Y. Li, “Boosting the k-nearest-neighborhood based incremental collaborative filtering,” Knowledge-Based Systems, vol. 53, pp. 90–99, 2013.
- [9] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering,” in Proceedings of the fifth international conference on computer and information technology, vol. 1, 2002, pp. 291–324.
- [10] C. Rana and S. K. Jain, “An evolutionary clustering algorithm based on temporal features for dynamic recommender systems,” Swarm and Evolutionary Computation, vol. 14, pp. 21–30, 2014.
- [11] C.-L. Liao and S.-J. Lee, “A clustering based approach to improving the efficiency of collaborative filtering recommendation,” Electronic Commerce Research and Applications, vol. 18, pp. 1–9, 2016.
- [12] J. Chen, Y. Wu, L. Fan, X. Lin, H. Zheng, S. Yu, and Q. Xuan, “N2VSCDNNR: A local recommender system based on node2vec and rich information network,” IEEE Transactions on Computational Social Systems, vol. 6, no. 3, pp. 456–466, 2019.
- [13] W.-C. Kang and J. McAuley, “Candidate generation with binary codes for large-scale top-n recommendation,” in CIKM, 2019, pp. 152–1532.
- [14] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009, pp. 452–461.
- [15] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative filtering vs. matrix factorization revisited,” in Fourteenth ACM Conference on Recommender Systems, 2020, pp. 240–248.
- [16] F. Xue, X. He, X. Wang, J. Xu, K. Liu, and R. Hong, “Deep item-based collaborative filtering for top-n recommendation,” ACM Transactions on Information Systems (TOIS), vol. 37, no. 3, pp. 1–25, 2019.
- [17] Z. Deng, L. Huang, C. Wang, J. Lai, and P. S. Yu, “DeepCF: A unified framework of representation learning and matching function learning in recommender system,” in AAAI, vol. 1, no. 33, 2019, pp. 61–68.
- [18] X. Luo, D. Wang, M. Zhou, and H. Yuan, “Latent factor-based recommenders relying on extended stochastic gradient descent algorithms,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 2, pp. 916–926, 2021.
- [19] X. Luo, M. Zhou, S. Li, D. Wu, Z. Liu, and M. Shang, “Algorithms of unconstrained non-negative latent factor analysis for recommender systems,” IEEE Transactions on Big Data, vol. 7, no. 1, pp. 227–240, 2021.
- [20] X. Luo, Z. Liu, S. Li, M. Shang, and Z. Wang, “A fast non-negative latent factor model based on generalized momentum method,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 1, pp. 610–620, 2021.
- [21] D. Wu, X. Luo, M. Shang, Y. He, G. Wang, and M. Zhou, “A deep latent factor model for high-dimensional and sparse matrices in recommender systems,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 51, no. 7, pp. 4285–4296, 2021.
- [22] K. Ji, R. Sun, X. Li, and W. Shu, “Improving matrix approximation for recommendation via a clustering-based reconstructive method,” Neurocomputing, vol. 173, pp. 912–920, 2016.
- [23] Y. Wu, X. Liu, M. Xie, M. Ester, and Q. Yang, “CCCF: Improving collaborative filtering via scalable user-item co-clustering,” in Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, 2016, pp. 73–82.
- [24] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
- [25] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content using attention-based encoder-decoder networks,” IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875–1886, 2015.
- [26] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.-S. Chua, “Attentive collaborative filtering: Multimedia recommendation with item- and component-level attention,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017, pp. 335–344.
- [27] H. Zhu, X. Li, P. Zhang, G. Li, J. He, H. Li, and K. Gai, “Learning tree-based deep model for recommender systems,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1079–1088.
- [28] Y. Tay, L. Anh Tuan, and S. C. Hui, “Latent relational metric learning via memory-based attention for collaborative ranking,” in Proceedings of the 2018 World Wide Web Conference, 2018, pp. 729–739.
- [29] Y. He, J. Wang, W. Niu, and J. Caverlee, “A hierarchical self-attentive model for recommending user-generated item lists,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, 2019, pp. 1481–1490.
- [30] Z.-Y. Hu, J. Huang, Z.-H. Deng, C.-D. Wang, L. Huang, J.-H. Lai, and P. S. Yu, “BCFNet: A balanced collaborative filtering network with attention mechanism,” arXiv preprint arXiv:2103.06105, 2021.
- [31] D. Cao, X. He, L. Miao, Y. An, C. Yang, and R. Hong, “Attentive group recommendation,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 645–654.
- [32] Y. Wu, K. Li, G. Zhao, and X. Qian, “Personalized long-and short-term preference learning for next poi recommendation,” IEEE Transactions on Knowledge and Data Engineering, 2020 (Early Access).
- [33] J. Hao, Y. Dun, G. Zhao, Y. Wu, and X. Qian, “Annular-graph attention model for personalized sequential recommendation,” IEEE Transactions on Multimedia, 2021 (Early Access).
- [34] X. Fan, Z. Liu, J. Lian, W. X. Zhao, X. Xie, and J.-R. Wen, “Lighter and better: Low-rank decomposed self-attention networks for next-item recommendation,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1733–1737.
- [35] L. Prétet, G. Richard, and G. Peeters, “Learning to rank music tracks using triplet loss,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 511–515.
- [36] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in ICML deep learning workshop, vol. 2. Lille, 2015.
- [37] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
- [38] Y. Dong, N. V. Chawla, and A. Swami, “Metapath2vec: Scalable representation learning for heterogeneous networks,” in Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 2017, pp. 135–144.
- [39] M. Gao, L. Chen, X. He, and A. Zhou, “BiNE: Bipartite network embedding,” in The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, 2018, pp. 715–724.