This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Probabilistic Metric Learning with Adaptive Margin for Top-K Recommendation

Chen Ma McGill University [email protected] Liheng Ma McGill University [email protected] Yingxue Zhang Huawei Noah’s Ark Lab Montreal [email protected] Ruiming Tang Huawei Noah’s Ark Lab [email protected] Xue Liu McGill University [email protected]  and  Mark Coates McGill University [email protected]
Abstract.

Personalized recommender systems are playing an increasingly important role as more content and services become available and users struggle to identify what might interest them. Although matrix factorization and deep learning based methods have proved effective in user preference modeling, they violate the triangle inequality and fail to capture fine-grained preference information. To tackle this, we develop a distance-based recommendation model with several novel aspects: (i) each user and item are parameterized by Gaussian distributions to capture the learning uncertainties; (ii) an adaptive margin generation scheme is proposed to generate the margins regarding different training triplets; (iii) explicit user-user/item-item similarity modeling is incorporated in the objective function. The Wasserstein distance is employed to determine preferences because it obeys the triangle inequality and can measure the distance between probabilistic distributions. Via a comparison using five real-world datasets with state-of-the-art methods, the proposed model outperforms the best existing models by 4-22% in terms of recall@K on Top-K recommendation.

1. Introduction

Internet users can easily access an increasingly vast number of online products and services, and it is becoming very difficult for users to identify the items that will appeal to them out of a plethora of candidates. To reduce information overload and to satisfy the diverse needs of users, personalized recommender systems have emerged and they are beginning to play an important role in modern society. These systems can provide personalized experiences, serve huge service demands, and benefit both the user-side and supply-side. They can: (i) help users easily discover products that are likely to interest them; and (ii) create opportunities for product and service providers to better serve customers and to increase revenue.

In all kinds of recommender systems, modeling the user-item interaction lies at the core. There are two common ways used in recent recommendation models to infer the user preference: matrix factorization (MF) and multi-layer perceptrons (MLPs). MF-based methods (e.g., (Hu et al., 2008; Rendle et al., 2009)) apply the inner product between latent factors of users and items to predict the user preferences for different items. The latent factors strive to depict the user-item relationships in the latent space. In contrast, MLP-based methods (e.g., (He et al., 2017b; Covington et al., 2016)) adopt (deep) neural networks to learn non-linear user-item relationships, which can generate better latent feature combinations between the embeddings of users and items (He et al., 2017b).

However, both MF-based and MLP-based methods violate the triangle inequality (Shrivastava and Li, 2014), and as a result may fail to capture the fine-grained preference information (Hsieh et al., 2017). As a concrete example in (Park et al., 2018), if a user accessed two items, MF or MLP-based methods will put both items close to the user, but will not necessarily put these two items close to each other, even if they share similar properties.

To address the limitations of MF and MLP-based methods, metric (distance) learning approaches have been utilized in the recommendation model (Hsieh et al., 2017; Park et al., 2018; Tay et al., 2018; Li et al., 2020), as the distance naturally satisfies the triangle inequality. These techniques project users and items into a low-dimensional metric space, where the user preference is measured by the distance to items. Specifically, CML (Hsieh et al., 2017) and LRML (Tay et al., 2018) are two representative models. CML minimizes the Euclidean distance between users and their accessed items, which facilitates user-user/item-item similarity learning. LRML incorporates a memory network to introduce additional capacity to learn relations between users and items in the metric space.

Refer to caption
Figure 1. A motivating example of handling uncertainties of learned embeddings.

Although existing distance-based methods have achieved satisfactory results, we argue that there are still several avenues for enhancing performance. First, previous distance-based methods (Hsieh et al., 2017; Park et al., 2018; Tay et al., 2018; Li et al., 2020) learn the user and item embeddings in a deterministic manner without handling the uncertainty. Relying solely on the learned deterministic embeddings may lead to an inaccurate understanding of user preferences. A motivating example is shown in Figure 1. After having accessed two songs s1s_{1} and s2s_{2} with different genres, the user uu may be placed between of s1s_{1} and s2s_{2}. If we only consider deterministic embeddings, s4s_{4} should be a good candidate. But if we consider the embeddings from a probabilistic perspective, s3s_{3} can be a better recommendation and it has the same genre as s1s_{1}. Second, most of the existing methods  (Hsieh et al., 2017; Tay et al., 2018; Park et al., 2018) adopt the margin ranking loss (hinge loss) with a fixed margin as the hyper-parameter. We argue that the margin value should be adaptive and relevant to corresponding training samples. Furthermore, different training phases may need different magnitudes of margin values. Setting a fixed value may not be an optimal solution. Third, previous distance-based methods  (Hsieh et al., 2017; Tay et al., 2018; Park et al., 2018; Li et al., 2020) do not explicitly model user-user and item-item relationships. Closely-related users are very likely to share the same interests, and if two items have similar attributes it is likely that a user will favour both. When inferring a user’s preferences, we should explicitly take into account the user-user and item-item similarities.

To address the shortcomings, we propose a Probabilistic Metric Learning model with an Adaptive Margin (PMLAM) for Top-K recommendation. PMLAM consists of three major components: 1) a user-item interaction module, 2) an adaptive margin generation module, and 3) a user-user/item-item relation modeling module. To capture the uncertainties in the learned user and item embeddings, each user or item is parameterized with one Gaussian distribution, where the distribution related parameters are learned by our model. In the user-item interaction module, we adopt the Wasserstein distance to measure the distances between users and items, thus not only taking into account the means but also the uncertainties. In the adaptive margin generation module, we model the learning of adaptive margins as a bilevel (inner and outer) optimization problem (Sinha et al., 2018), where we build a proxy function to explicitly link the learning of margin related parameters with the outer objective function. In the user-user and item-item relation modeling module, we incorporate two margin ranking losses with adaptive margins for user-pairs and item-pairs, respectively, to explicitly encourage similar users or items to be mapped closer to one another in the latent space. We extensively evaluate our model by comparing with many state-of-the-art methods, using two performance metrics on five real-world datasets. The experimental results not only demonstrate the improvements of our model over other baselines but also show the effectiveness of the proposed modules.

To summarize, the major contributions of this paper are:

  • To capture the uncertainties in the learned user/item embeddings, we represent each user and item as a Gaussian distribution. The Wasserstein distance is leveraged to measure the user preference for items while simultaneously considering the uncertainty.

  • To generate an adaptive margin, we cast margin generation as a bilevel optimization problem, where a proxy function is built to explicitly update the margin generation related parameters.

  • To explicitly model the user-user and item-item relationships, we apply two margin ranking losses with adaptive margins to force similar users and items to map closer to one another in the latent space.

  • Experiments on five real-world datasets show that the proposed PMLAM model significantly outperforms the state-of-the-art methods for the Top-K recommendation task.

2. Related Work

In this section we summarize and discuss work that is related to our proposed top-K recommendation model.

In many real-world recommendation scenarios, user implicit data (Wang et al., 2015; Ma et al., 2019a), e.g., clicking history, is more common than explicit feedback (Salakhutdinov et al., 2007) such as user ratings. The implicit feedback setting, also called one-class collaborative filtering (OCCF) (Pan et al., 2008), arises when only positive samples are available. To tackle this challenging problem, effective methods have been proposed.

Matrix Factorization-based Methods. Popularized by the Netflix prize competition, matrix factorization (MF) based methods have become a prominent solution for personalized recommendation (Koren et al., 2009). In (Hu et al., 2008), Hu et al. propose a weighted regularized matrix factorization (WRMF) model to treat all the missing data as negative samples, while heuristically assigning confidence weights to positive samples. Rendle et al. adopt a different approach in (Rendle et al., 2009), proposing a pair-wise ranking objective (Bayesian personalized ranking) to model the pair-wise relationships between positive items and negative items for each user, where the negative samples are randomly sampled from the unobserved feedback. To allow unobserved items to have varying degrees of importance, He et al. in (He et al., 2016) propose to weight the missing data based on item popularity, demonstrating improved performance compared to WRMF.

Multi-layer Perceptron-based Methods. Due to their ability to learn more complex non-linear relationships between users and items, (deep) neural networks have been a great success in the domain of recommender systems. He et al. in (He et al., 2017b) propose a neural network-based collaborative filtering model, where a multi-layer perceptron is used to learn the non-linear user-item interactions. In (Wu et al., 2016; Ma et al., 2018, 2019b), (denoising) autoencoders are employed to learn the user or item hidden representations from user implicit feedback. Autoencoder approaches can be shown to be generalizations of many of the MF methods (Wu et al., 2016). In (Xue et al., 2017; Guo et al., 2017), conventional matrix factorization and factorization machine methods benefit from the representation ability of deep neural networks for learning either the user-item relationships or the interactions with side information. Graph neural networks (GNNs) have recently been incorporated in recommendation algorithms because they can learn and model relationships between entities (Wang et al., 2019; Sun et al., 2019; Ma et al., 2020).

Distance-based Methods. Due to their capacity to measure the distance between users and items, distance-based methods have been successfully applied in Top-K recommendation. In (Hsieh et al., 2017), Hsieh et al. propose to compute the Euclidean distance between users and items for capturing fine-grained user preference. In (Tay et al., 2018), Tay et al. adopt a memory network (Sukhbaatar et al., 2015) to explicitly store the user preference in external memories. Park et al. in (Park et al., 2018) apply a translation emb edding to capture more complex relations between users and items, where the translation embedding is learned from the neighborhood information of users and items. In (He et al., 2017a), He et al. apply a distance metric to capture how the user interest shifts across sequential user-item interactions. In (Li et al., 2020), Li et al. propose to measure the trilateral relationship from both the user-centric and item-centric perspectives and learn adaptive margins for the central user and positive item.

Our proposed recommendation model is different in key ways from all of the methods identified above. In contrast to the matrix factorization (Hu et al., 2008; Rendle et al., 2009; He et al., 2016) and neural network methods (He et al., 2017b; Wu et al., 2016; Ma et al., 2018, 2019b; Xue et al., 2017; Guo et al., 2017; Wang et al., 2019), we employ the Wasserstein distance that obeys the triangle inequality. This is important for ensuring that users with similar interaction histories are mapped close together in the latent space. In contrast to most of the prior distance-based approaches, (Hsieh et al., 2017; Tay et al., 2018; Park et al., 2018; Li et al., 2020), we employ parameterized Gaussian distributions to represent each user and item in order to capture the uncertainties of learned user preferences and item properties. Moreover, we formulate a bilevel optimization problem and incorporate a neural network to generate adaptive margins for the commonly applied margin ranking loss function.

3. Problem Formulation

The recommendation task considered in this paper takes as input the user implicit feedback. For each user ii, the user preference data is represented by a set that includes the items she preferred, e.g., 𝒟i={I1,,Ij,,I|𝒟i|}\mathcal{D}_{i}=\{I_{1},...,I_{j},...,I_{|\mathcal{D}_{i}|}\}, where IjI_{j} is an item index in the dataset. The top-KK recommendation task in this paper is formulated as: given the training item set 𝒮i\mathcal{S}_{i}, and the non-empty test item set 𝒯i\mathcal{T}_{i} (requiring that 𝒮i𝒯i=𝒟i\mathcal{S}_{i}\cup\mathcal{T}_{i}=\mathcal{D}_{i} and 𝒮i𝒯i=\mathcal{S}_{i}\cap\mathcal{T}_{i}=\emptyset) of user ii, the model must recommend an ordered set of items 𝒳i\mathcal{X}_{i} such that |𝒳i|K|\mathcal{X}_{i}|\leq K and 𝒳i𝒮i=\mathcal{X}_{i}\cap\mathcal{S}_{i}=\emptyset. Then the recommendation quality is evaluated by a matching score between 𝒯i\mathcal{T}_{i} and 𝒳i\mathcal{X}_{i}, such as Recall@KK.

4. Methodology

In this section, we present the proposed model shown in Fig. 2. We first introduce the user-item interaction module, which captures the user-item interactions by calculating the Wasserstein distance between users’ and items’ distributions. Then we describe the adaptive margin generation module, which generates adaptive margins during the training process. Next, we present the user-user and item-item relation modeling module. Lastly, we specify the objective function and explain the training process of the proposed model.

Refer to caption
Figure 2. The demonstration of the proposed model. 𝒥UI\mathcal{J}^{U-I} denotes the combined optimization regarding 𝒥innerUI\mathcal{J}_{inner}^{U-I} and 𝒥outerUI\mathcal{J}_{outer}^{U-I}. 𝒥UU\mathcal{J}^{U-U} and 𝒥II\mathcal{J}^{I-I} follow the same manner with 𝒥UI\mathcal{J}^{U-I}.

4.1. Wasserstein Distance for Interactions

Previous works (Hsieh et al., 2017; Tay et al., 2018) use the user and item embeddings in a deterministic manner and do not measure or learn the uncertainties of user preferences and item properties. Motivated by probabilistic matrix factorization (PMF) (Salakhutdinov and Mnih, 2007), we represent each user or item as a single Gaussian distribution. In contrast to PMF, which applies Gaussian priors on user and item embeddings, users and items in our model are parameterized by Gaussian distributions, where the means μ\mu and covariances Σ\Sigma are directly learned. Specifically, the latent factors of user ii and item jj are represented as:

(1) 𝐮i\displaystyle\mathbf{u}_{i} 𝒩(μi(U),𝚺i(U)),\displaystyle\sim\mathcal{N}(\mathbf{\mu}^{(U)}_{i},\mathbf{\Sigma}^{(U)}_{i})\,,
𝐯j\displaystyle\mathbf{v}_{j} 𝒩(μj(I),𝚺j(I)).\displaystyle\sim\mathcal{N}(\mathbf{\mu}^{(I)}_{j},\mathbf{\Sigma}^{(I)}_{j})\,.

Here μi(U)\mathbf{\mu}^{(U)}_{i} and 𝚺i(U)\mathbf{\Sigma}^{(U)}_{i} are the learned mean vector and covariance matrix of user ii, respectively; μj(I)\mathbf{\mu}^{(I)}_{j} and 𝚺j(I)\mathbf{\Sigma}^{(I)}_{j} are the learned mean vector and covariance matrix of item jj. To limit the complexity of the model and reduce the computational overhead, we assume that the embedding dimensions are uncorrelated. Thus, 𝚺\mathbf{\Sigma} is a diagonal covariance matrix that can be represented as a vector. Specifically, μh\mathbf{\mu}\in\mathbb{R}^{h} and 𝚺h\mathbf{\Sigma}\in\mathbb{R}^{h}, where hh is the dimension of the latent space.

Widely used distance metrics for deterministic embeddings, like the Euclidean distance, do not properly measure the distance between distributions. Since users and items are represented by probabilistic distributions, we need a distance measure between distributions. Among the commonly used distance metric between distributions, we adopt the Wasserstein distance to measure the user preference for an item. The reasons are twofold: i) the Wasserstein distance satisfies all the properties a distance should have; and ii) the Wasserstein distance has a simple form when calculating the distance between Gaussian distributions (Mallasto and Feragen, 2017). Formally, the pp-th Wasserstein distance between two probability measures μ\mu and ν\nu on a Polish metric space (Srivastava, 2008) (𝒳,d)(\mathcal{X},d) is defined (Givens and Shortt, 1984):

𝒲p(μ,ν):=(infJ𝒥(μ,ν)𝒳×𝒳d(x,y)p𝑑J(x,y))1p,\mathcal{W}_{p}(\mu,\nu):=\Big{(}\inf_{J\in\mathcal{J}(\mu,\nu)}\int_{\mathcal{X}\times\mathcal{X}}d(x,y)^{p}dJ(x,y)\Big{)}^{\frac{1}{p}}\,,

where d(,)pd(\cdot,\cdot)^{p} is an arbitrary distance with pthp^{th} moment (Casella and Berger, 2002) for a deterministic variable, p[1,+)p\in[1,+\infty); and 𝒥(μ,ν)\mathcal{J}(\mu,\nu) denotes the set of all measures JJ on 𝒳×𝒳\mathcal{X}\times\mathcal{X} which admit μ\mu and ν\nu as marginals. When p1p\geq 1, the pp-th Wasserstein distance preserves all properties of a metric, including both symmetry and the triangle inequality.

The calculation of the general Wasserstein distance is computation-intensive (Xie et al., 2019). To reduce the computational cost, we use Gaussian distributions for the latent representations of users and items. Then when p=2p=2, the 22-nd Wasserstein distance (abbreviated as 𝒲2\mathcal{W}_{2}) has a closed form solution, thus making the calculation process much faster. Specifically, we have the following formula to calculate the 𝒲2\mathcal{W}_{2} distance between user ii and item jj (Givens and Shortt, 1984):

(2) 𝒲2(𝒩(μi(U),𝚺i(U)),𝒩(μj(I),𝚺j(I)))2\displaystyle\mathcal{W}_{2}\left(\mathcal{N}(\mathbf{\mu}^{(U)}_{i},\mathbf{\Sigma}^{(U)}_{i}),\,\mathcal{N}(\mathbf{\mu}^{(I)}_{j},\mathbf{\Sigma}^{(I)}_{j})\right)^{2}
=μi(U)μj(I)22+trace(𝚺i(U)+𝚺j(I)2((𝚺i(U))12𝚺j(I)(𝚺i(U))12)12).\displaystyle=||\mathbf{\mu}^{(U)}_{i}-\mathbf{\mu}^{(I)}_{j}||_{2}^{2}+\mathrm{trace}\bigg{(}\mathbf{\Sigma}^{(U)}_{i}+\mathbf{\Sigma}^{(I)}_{j}-2\Big{(}(\mathbf{\Sigma}^{(U)}_{i})^{\frac{1}{2}}\mathbf{\Sigma}^{(I)}_{j}(\mathbf{\Sigma}^{(U)}_{i})^{\frac{1}{2}}\Big{)}^{\frac{1}{2}}\bigg{)}\,.

In our setting, we focus on diagonal covariance matrices, thus 𝚺i(U)𝚺j(I)=𝚺j(I)𝚺i(U)\mathbf{\Sigma}^{(U)}_{i}\mathbf{\Sigma}^{(I)}_{j}=\mathbf{\Sigma}^{(I)}_{j}\mathbf{\Sigma}^{(U)}_{i}. For simplicity, we use 𝒲2(i,j)2\mathcal{W}_{2}(i,j)^{2} to denote the left hand side of Eq. 2. Then Eq. 2 can be simplified as:

(3) 𝒲2(i,j)2=μi(U)μj(I)22+(𝚺i(U))12(𝚺j(I))1222.\mathcal{W}_{2}(i,j)^{2}=||\mathbf{\mu}^{(U)}_{i}-\mathbf{\mu}^{(I)}_{j}||_{2}^{2}+||(\mathbf{\Sigma}^{(U)}_{i})^{\frac{1}{2}}-(\mathbf{\Sigma}^{(I)}_{j})^{\frac{1}{2}}||^{2}_{2}\,.

According to the above equation, the time complexity of calculating 𝒲2\mathcal{W}_{2} distance between the latent representations of users and items is linear with the embedding dimension.

4.2. Adaptive Margin in Margin Ranking Loss

To learn the distance-based model, most of the existing works (Hsieh et al., 2017; Tay et al., 2018) apply the margin ranking loss to measure the user preference difference between positive items and negative items. Specifically, the margin ranking loss makes sure the distance between a user and a positive item is less than the distance between the user and a negative item by a fixed margin m>0m>0. The loss function is:

(4) Fix(i,j,k;Θ)=[d(i,j;Θ)2d(i,k;Θ)2+m]+,\mathcal{L}_{Fix}(i,j,k;\Theta)=[d(i,j;\Theta)^{2}-d(i,k;\Theta)^{2}+m]_{+}\,,

where j𝒮ij\in\mathcal{S}_{i} is an item that user ii has accessed, and k𝒮ik\not\in\mathcal{S}_{i} is a randomly sampled item treated as the negative example, and [z]+=max(z,0)[z]_{+}=\max(z,0). Thus, (i,j,k)(i,j,k) represents a training triplet.

The safe margin mm in the margin ranking loss is a crucial hyper-parameter that has a major impact on the model performance. A fixed margin value may not achieve satisfactory performance. First, using a fixed value does not allow for adaptation to distinguish the properties of the training triplets. For example, some users have broad interests, so the margins for these users should not be so large as to make potential preferred items too far from the user. Other users have very focused interests, and it is desirable to have a larger margin to avoid recommending items that are not directly within the focus. Second, in different training phases, the model may need different magnitudes of margins. For instance, in the early stage of training, the model is not reliable enough to make strong predictions on user preferences, and thus imposing a large margin risk pushing potentially positive items too far from a user. Third, to achieve satisfactory performance, the selection of a fixed margin involves tedious hyper-parameter tuning. Based on these considerations, we conclude that setting a fixed margin value for all training triplets may limit the model expressiveness.

To address the problems outlined above, we propose an adaptive margin generation scheme which generates margins according to the training triplets. Formally, we formulate the margin ranking loss with an adaptive margin as:

(5) Ada(i,j,k;Θ,Φ)=[d(i,j;Θ)2d(i,k;Θ)2+f(i,j,k;Φ)]+.\mathcal{L}_{Ada}(i,j,k;\Theta,\Phi)=[d(i,j;\Theta)^{2}-d(i,k;\Theta)^{2}+f(i,j,k;\Phi)]_{+}\,.

Here f(i,j,k;Φ)f(i,j,k;\Phi) is a function that generates the specific margin based on the corresponding user and item embeddings and Φ\Phi is the learnable set of parameters associated with f()f(\cdot). Then we could consider optimizing Θ\Theta and Φ\Phi simultaneously:

(6) Θ=argminΘ,Φij𝒮ik𝒮iAda(i,j,k;Θ,Φ).\Theta^{*}=\underset{\Theta,\Phi}{\mathrm{argmin}}\>\sum_{i}\sum_{j\in\mathcal{S}_{i}}\sum_{k\not\in\mathcal{S}_{i}}\mathcal{L}_{Ada}(i,j,k;\Theta,\Phi)\,.

Unfortunately, directly minimizing the objective function as in Eq. 6 does not achieve the desired purpose of generating suitable adaptive margins. Since the margin-related term explicitly appears in the loss function, constantly decreasing the value of the generated margin is the straightforward way to reduce the loss. As a result all generated margins have very small values or are set to zero, leading to unsatisfactory results. In other words, the direct optimization of Ada\mathcal{L}_{Ada} with respect to Φ\Phi harms the optimization of Θ\Theta.

4.2.1. Bilevel Optimization

We model the learning of recommendation models and the generation of adaptive margins as a bilevel optimization problem (Colson et al., 2007):

(7) minΦ𝒥outer(Θ(Φ)):=ij𝒮ik𝒮iFix(i,j,k;Θ(Φ))\displaystyle\underset{\Phi}{\mathrm{min}}\>\mathcal{J}_{outer}\left(\Theta^{*}(\Phi)\right):=\sum_{i}\sum_{j\in\mathcal{S}_{i}}\sum_{k\not\in\mathcal{S}_{i}}\mathcal{L}_{Fix}\left(i,j,k;\Theta^{*}(\Phi)\right)
s.t.Θ(Φ)=argminΘ𝒥inner(Θ,Φ):=ij𝒮ik𝒮iAda(i,j,k;Θ,Φ).\displaystyle\mathrm{s.t.}\>\Theta^{*}(\Phi)=\underset{\Theta}{\mathrm{argmin}}\>\mathcal{J}_{inner}(\Theta,\Phi):=\sum_{i}\sum_{j\in\mathcal{S}_{i}}\sum_{k\not\in\mathcal{S}_{i}}\mathcal{L}_{Ada}(i,j,k;\Theta,\Phi)\,.

Here Θ\Theta contains the model parameters μ\mathbf{\mu} and 𝚺\mathbf{\Sigma}. The objective function 𝒥inner\mathcal{J}_{inner} attempts to minimize Ada\mathcal{L}_{Ada} with respect to Θ\Theta while the objective function 𝒥outer\mathcal{J}_{outer} optimizes Fix\mathcal{L}_{Fix} with respect to Φ\Phi through Θ(Φ)\Theta^{*}(\Phi). For simplicity, the mm of Fix\mathcal{L}_{Fix} in 𝒥outer\mathcal{J}_{outer} is set to 11 for guiding the learning of f(;Φ)f(\cdot;\Phi). Thus, we can have an alternating optimization to learn Θ\Theta and Φ\Phi:

  • Θ\Theta update phase (Inner Optimization): Fix Φ\Phi and optimize Θ\Theta.

  • Φ\Phi update phase (Outer Optimization): Fix Θ\Theta and optimize Φ\Phi.

4.2.2. Approximate Gradient Optimization

As most existing models utilize gradient-based methods for optimization, a simple approximation strategy with less computation is introduced as follows:

(8) Φ𝒥outer(Θ(Φ))Φ𝒥outer(ΘαΘ𝒥inner(Θ,Φ)).\displaystyle\nabla_{\Phi}\mathcal{J}_{outer}\left(\Theta^{*}(\Phi)\right)\approx\nabla_{\Phi}\mathcal{J}_{outer}\left(\Theta-\alpha\nabla_{\Theta}\mathcal{J}_{inner}(\Theta,\Phi)\right)\,.

In this expression, Θ\Theta denotes the current parameters including μ\mathbf{\mu} and 𝚺\mathbf{\Sigma}, and α\alpha is the learning rate for one step of inner optimization. Related approximations have been validated in (Rendle, 2012; Liu et al., 2019). Thus, we can define a proxy function to link Φ\Phi with the outer optimization:

(9) Θ~(Φ):=ΘαΘ𝒥inner(Θ,Φ).\tilde{\Theta}(\Phi):=\Theta-\alpha\nabla_{\Theta}\mathcal{J}_{inner}(\Theta,\Phi)\,.

For simplicity, we use two optimizers OPTΘ\operatorname{OPT}_{\Theta} and OPTΦ\operatorname{OPT}_{\Phi} to update Θ\Theta and Φ\Phi, respectively. The iterative procedure is shown in Alg. 1.

Initialize optimizers OPTΘ\operatorname{OPT}_{\Theta} and OPTΦ\operatorname{OPT}_{\Phi} ;
while not converged do
       Θ\Theta Update (fix Φt\Phi^{t}):
       Θt+1OPTΘ(Θt,Θt𝒥inner(Θt,Φt))\Theta^{t+1}\longleftarrow\operatorname{OPT}_{\Theta}\left(\Theta^{t},\nabla_{\Theta^{t}}\mathcal{J}_{inner}(\Theta^{t},\Phi^{t})\right) ;
       Proxy:
       Θ~t+1(Φt):=ΘtαΘt𝒥inner(Θt,Φt)\tilde{\Theta}^{t+1}(\Phi^{t}):=\Theta^{t}-\alpha\nabla_{\Theta^{t}}\mathcal{J}_{inner}(\Theta^{t},\Phi^{t}) ;
       Φ\Phi Update (fix Θt\Theta^{t}):
       Φt+1OPTΦ(Φt,Φt𝒥outer(Θ~t+1(Φt)))\Phi^{t+1}\longleftarrow\operatorname{OPT}_{\Phi}\left(\Phi^{t},\nabla_{\Phi^{t}}\mathcal{J}_{outer}\big{(}\tilde{\Theta}^{t+1}(\Phi^{t})\big{)}\right) ;
      
end while
Algorithm 1 Iterative Optimization Procedure

4.2.3. The design of f(;Φ)f(\cdot;\Phi)

We parameterize f(i,j,k;Φ)f(i,j,k;\Phi) with a neural network to generate the margin based on (i,j,k)(i,j,k):

(10) 𝐳ijk\displaystyle\mathbf{z}_{ijk} =tanh(𝐖1𝐬ijk+𝐛1),\displaystyle=\mathrm{tanh}(\mathbf{W}_{1}\cdot\mathbf{s}_{ijk}+\mathbf{b}_{1})\,,
mijk\displaystyle m_{ijk} =softplus(𝐖2𝐳ijk+𝐛2).\displaystyle=\mathrm{softplus}(\mathbf{W}_{2}\cdot\mathbf{z}_{ijk}+\mathbf{b}_{2})\,.

Here 𝐖\mathbf{W}_{*} and 𝐛\mathbf{b}_{*} are learnable parameters in f(;Φ)f(\cdot;\Phi), 𝐬ijk\mathbf{s}_{ijk} is the input to generate the margin, and mijkm_{ijk}\in\mathbb{R} is the generated margin of (i,j,k)(i,j,k). The activation function softplus guarantees mijk>0m_{ijk}>0. To promote a discrimative 𝐬ijk\mathbf{s}_{ijk} that reflects the relation between (i,j,k)(i,j,k) and mijkm_{ijk}, the following form can be a fine-grained indicator:

(11) χ(𝐮i,𝐯j)\displaystyle\chi(\mathbf{u}_{i},\mathbf{v}_{j}) =[(ui,1vj,1)2,(ui,2vj,2)2,,(ui,hvj,h)2]\displaystyle=[(u_{i,1}-v_{j,1})^{2},(u_{i,2}-v_{j,2})^{2},...,(u_{i,h}-v_{j,h})^{2}]^{\top}
𝐬ijk\displaystyle\mathbf{s}_{ijk} =[χ(𝐮i,𝐯j);χ(𝐮i,𝐯k);χ(𝐮i,𝐯k)χ(𝐮i,𝐯j)].\displaystyle=[\chi(\mathbf{u}_{i},\mathbf{v}_{j});\,\chi(\mathbf{u}_{i},\mathbf{v}_{k});\,\chi(\mathbf{u}_{i},\mathbf{v}_{k})\ominus\chi(\mathbf{u}_{i},\mathbf{v}_{j})]\,.

Here χ(𝐮i,𝐯j)h\chi(\mathbf{u}_{i},\mathbf{v}_{j})\in\mathbb{R}^{h} is introduced to mimic the calculation of Euclidean distance without summing over all dimensions. \ominus denotes element-wise subtraction and [;][...;...] denotes the concatenation operation. To improve the robustness of f(;Φ)f(\cdot;\Phi), we take as inputs the sampled embeddings 𝐮i\mathbf{u}_{i} and 𝐯j\mathbf{v}_{j}. To perform backpropagation from 𝐮i\mathbf{u}_{i} and 𝐯j\mathbf{v}_{j}, we adopt the reparameterization trick (Kingma and Welling, 2014) for Eq. 1:

(12) 𝐮i\displaystyle\mathbf{u}_{i} =μi(U)+(𝚺i(U))12ϵ1,\displaystyle=\mathbf{\mu}_{i}^{(U)}+(\mathbf{\Sigma}_{i}^{(U)})^{\frac{1}{2}}\odot\mathbf{\epsilon}_{1}\,,
𝐯j\displaystyle\mathbf{v}_{j} =μj(I)+(𝚺j(I))12ϵ2,\displaystyle=\mathbf{\mu}_{j}^{(I)}+(\mathbf{\Sigma}_{j}^{(I)})^{\frac{1}{2}}\odot\mathbf{\epsilon}_{2}\,,

where ϵ1,ϵ2𝒩(𝟎,𝟏)\mathbf{\epsilon}_{1},\mathbf{\epsilon}_{2}\sim\mathcal{N}(\mathbf{0},\mathbf{1}) and \odot is element-wise muliplication.

4.3. User-User and Item-Item Relations

It is important to model the relationships between pairs of users or pairs of items when developing recommender systems and strategies for doing so effectively have been studied for many years (Sarwar et al., 2001; Kabbur et al., 2013; Ning et al., 2015). For example, item-based collaborative filtering methods use item rating vectors to calculate the similarities between the items. Closely-related users or items may share the same interests or have similar attributes. For a certain user, items similar to the user’s preferred items are potential recommendation candidates.

Despite this intuition, previous distance-based recommendation methods (Hsieh et al., 2017; Tay et al., 2018) do not explicitly take the user-user or item-item relationships into consideration. As a result of relying primarily on user-item information, the systems may fail to generate appropriate user-user or item-item distances. To model the relationships between similar users or items, we employ two ranking margin losses with adaptive margins to encourage similar users or items to be mapped closer together in the latent space. Formally, the similarities between users or items are calculated from the user implicit feedback, which can be represented by a binary user-item interaction matrix. We set a threshold on the calculated similarities to identify the similar users and items for a specific user ii and item jj, respectively, denoted as 𝒩iU\mathcal{N}^{U}_{i} and 𝒩jI\mathcal{N}^{I}_{j}. We adopt the following losses for user pairs and item pairs, respectively:

(15) {𝒥outerUU:=ip𝒩iUq𝒩iUFix(i,p,q;Θ~UUt+1),𝒥innerUU:=ip𝒩iUq𝒩iUAda(i,p,q;Θt,ΦUUt),\displaystyle\left\{\begin{array}[]{rl}\mathcal{J}_{outer}^{U-U}&:=\sum_{i}\sum_{p\in\mathcal{N}_{i}^{U}}\sum_{q\not\in\mathcal{N}_{i}^{U}}\mathcal{L}_{Fix}(i,p,q;\tilde{\Theta}^{t+1}_{U-U})\,,\\ \mathcal{J}_{inner}^{U-U}&:=\sum_{i}\sum_{p\in\mathcal{N}_{i}^{U}}\sum_{q\not\in\mathcal{N}_{i}^{U}}\mathcal{L}_{Ada}(i,p,q;\Theta^{t},\Phi^{t}_{U-U})\,,\end{array}\right.
(18) {𝒥outerII:=jp𝒩jIq𝒩jIFix(j,p,q;Θ~IIt+1),𝒥innerII:=jp𝒩jIq𝒩jIAda(j,p,q;Θt,ΦIIt),\displaystyle\left\{\begin{array}[]{rl}\mathcal{J}_{outer}^{I-I}&:=\sum_{j}\sum_{p\in\mathcal{N}_{j}^{I}}\sum_{q\not\in\mathcal{N}_{j}^{I}}\mathcal{L}_{Fix}(j,p,q;\tilde{\Theta}^{t+1}_{I-I})\,,\\ \mathcal{J}_{inner}^{I-I}&:=\sum_{j}\sum_{p\in\mathcal{N}_{j}^{I}}\sum_{q\not\in\mathcal{N}_{j}^{I}}\mathcal{L}_{Ada}(j,p,q;\Theta^{t},\Phi^{t}_{I-I})\,,\end{array}\right.

where qq is a randomly sampled user in Eq. 15 and a randomly sampled item in Eq. 18. UUU-U denotes the user-user relation and III-I denotes the item-item relation. We use ΦUUt\Phi^{t}_{U-U} and ΦIIt\Phi^{t}_{I-I} to update ΘUUt+1\Theta^{t+1}_{U-U} and ΘIIt+1\Theta^{t+1}_{I-I}, respectively, which are the same as in Alg. 1. We denote the indicator in Eq. 11 as 𝐬ijkUI\mathbf{s}_{ijk}^{U-I}, then we generate 𝐬ijqUU\mathbf{s}_{ijq}^{U-U} and 𝐬ijqII\mathbf{s}_{ijq}^{I-I} following the procedure described by Eq. 11.

4.4. Model Training

Let us denote the losses 𝒥innerUI\mathcal{J}_{inner}^{U-I} and 𝒥outerUI\mathcal{J}_{outer}^{U-I} to capture the interactions between users and items. Then we combine the loss functions presented in Section 4.3 to optimize the proposed model:

(19) 𝒥inner=𝒥innerUI+𝒥innerUU+𝒥innerII,\displaystyle\mathcal{J}_{inner}=\mathcal{J}_{inner}^{U-I}+\mathcal{J}_{inner}^{U-U}+\mathcal{J}_{inner}^{I-I}\,,
𝒥outer=𝒥outerUI+𝒥outerUU+𝒥outerII+λΦF2,\displaystyle\mathcal{J}_{outer}=\mathcal{J}_{outer}^{U-I}+\mathcal{J}_{outer}^{U-U}+\mathcal{J}_{outer}^{I-I}+\lambda||\Phi||_{F}^{2}\,,

where λ\lambda is a regularization parameter. We follow the same training scheme of Section 4.2 to train Eq. 19. To mitigate the curse of dimensionality issue (Bordes et al., 2013) and prevent overfitting, we bound all the user/item embeddings within a unit sphere after each mini-batch training: μ1||\mathbf{\mu}||\leqslant 1 and 𝚺1||\mathbf{\Sigma}||\leqslant 1. When minimizing the objective function, the partial derivatives with respect to all the parameters can be computed by gradient descent with back-propagation.

Recommendation Phase. In the testing phase, for a certain user ii, we compute the distance 𝒲2(i,j)2\mathcal{W}_{2}(i,j)^{2} between user ii and each item jj in the dataset. Then the items that are not in the training set and have the shortest distances are recommended to user ii.

5. Experiments

In this section, we evaluate the proposed model, comparing with the state-of-the-art methods on five real-world datasets.

5.1. Datasets

The proposed model is evaluated on five real-world datasets from various domains with different sparsities: Books, Electronics and CDs (He and McAuley, 2016), Comics (Wan and McAuley, 2018) and Gowalla (Cho et al., 2011). The Books, Electronics and CDs datasets are adopted from the Amazon review dataset with different categories, i.e., books, electronics and CDs. These datasets include a significant amount of user-item interaction data, e.g., user ratings and reviews. The Comics dataset was collected in late 2017 from the GoodReads website with different genres, and we use the genres of comics. The Gowalla dataset was collected worldwide from the Gowalla website (a location-based social networking website) over the period from February 2009 to October 2010. In order to be consistent with the implicit feedback setting, we retain any ratings no less than four (out of five) as positive feedback and treat all other ratings as missing entries for all datasets. To filter noisy data, we only include users with at least ten ratings and items with at least five ratings. Table 1 shows the data statistics.

We employ five-fold cross-validation to evaluate the proposed model. For each user, the items she accessed are randomly split into five folds. We pick one fold each time as the ground truth for testing, and the remaining four folds constitute the training set. The average results over the five folds are reported.

Table 1. The statistics of the datasets.
Dataset #Users #Items #Interactions Density
Books 77,754 66,963 2,517,343 0.048%
Electronics 40,358 28,147 524,906 0.046%
CDs 24,934 24,634 478,048 0.079%
Comics 37,633 39,623 2,504,498 0.168%
Gowalla 64,404 72,871 1,237,869 0.034%

5.2. Evaluation Metrics

We evaluate all models in terms of Recall@k and NDCG@k. For each user, Recall@k (R@k) indicates the percentage of her rated items that appear in the top kk recommended items. NDCG@k (N@k) is the normalized discounted cumulative gain at kk, which takes the position of correctly recommended items into account.

5.3. Methods Studied

To demonstrate the effectiveness of our model, we compare to the following recommendation methods.
Classical methods for implicit feedback:

  • BPRMF, Bayesian Personalized Ranking-based Matrix Factorization (Rendle et al., 2009), which is a classic method for learning pairwise personalized rankings from user implicit feedback.

Classical neural-based recommendation methods:

  • NCF, Neural Collaborative Filtering (He et al., 2017b), which combines the matrix factorization (MF) model with a multi-layer perceptron (MLP) to learn the user-item interaction function.

  • DeepAE, the deep autoencoder (Ma et al., 2018), which utilizes a three-hidden-layer autoencoder with a weighted loss function.

State-of-the-art distance-based recommendation methods:

  • CML, Collaborative Metric Learning (Hsieh et al., 2017), which learns a metric space to encode the user-item interactions and to implicitly capture the user-user and item-item similarities.

  • LRML, Latent Relational Metric Learning (Tay et al., 2018), which exploits an attention-based memory-augmented neural architecture to model the relationships between users and items.

  • TransCF, Collaborative Translational Metric Learning (Park et al., 2018), which employs the neighborhood of users and items to construct translation vectors capturing the intensity of user–item relations.

  • SML, Symmetric Metric Learning with adaptive margin (Li et al., 2020), which measures the trilateral relationship from both the user- and item-centric perspectives and learns adaptive margins.

The proposed method:

  • PMLAM, the proposed model, which represents each user and item as Gaussian distributions to capture the uncertainties in user preferences and item properties, and incorporates an adaptive margin generation mechanism to generate the margins based on the sampled user-item triplets.

5.4. Experiment Settings

In the experiments, the latent dimension of all the models is set to 5050 for a fair comparison. All the models adopt the same negative sampling strategy with the proposed model, unless otherwise specified. For BPRMF, the learning rate is set to 0.0010.001 and the regularization parameter is set to 0.0010.001. With these parameters, the model can achieve good results. For NCF, we follow the same model structure as in the original paper (He et al., 2017b). The learning rate is set to 0.0010.001 and the batch size is set to 10241024. For DeepAE, we adopt the same model structure employed in the author-provided code and set the batch size to 512512. The weight of the positive items is selected from {5,10,15,20}\{5,10,15,20\} by a grid search and the weights of all other items are set to 11 as recommended in (Hu et al., 2008). For CML, we use the authors’ implementation to set the margin to m=1m=1 and the regularization parameter to λc=1\lambda_{c}=1. For LRML, the learning rate is set to 0.0010.001, and the number of memories is selected from {5,10,20,25,50,100}\{5,10,20,25,50,100\} by a grid search. For TransCF, we follow the settings in the original paper to select λ,λnbr,λdist{0, 0.001, 0.01, 0.1}\lambda,\,\lambda_{nbr},\,\lambda_{dist}\in\{0,\,0.001,\,0.01,\,0.1\} and set the margin to 11 and batch size to 10001000, respectively. For SML, we follow the author’s code to set the user and item margin bound ll to 1.01.0, λ\lambda to 0.010.01 and γ\gamma to 1010, respectively.

For our model, both the learning rate and λ\lambda are set to 0.0010.001. For the ElectronicsElectronics and CDsCDs datasets, we randomly sample 1010 unobserved users or items as negative samples for each user and positive item. This number is reduced to 22 for the other datasets to speed up the training process. The batch size is set to 50005000 for all datasets. The dimension hh is set to 5050. The user and item embeddings are initialized by drawing each vector element independently from a zero-mean Gaussian distribution with a standard deviation of 0.010.01. Our experiments are conducted with PyTorch running on GPU machines (Nvidia Tesla P100).

Table 2. The performance comparison of all methods in terms of Recall@10 and NDCG@10. The best performing method is boldfaced. The underlined number is the second best performing method. *, **, *** indicate the statistical significance for p<=0.05p<=0.05, p<=0.01p<=0.01, and p<=0.001p<=0.001, respectively, compared to the best baseline method based on the paired t-test. Improv. denotes the improvement of our model over the best baseline method.
BPRMF NCF DeepAE CML LRML TransCF SML PMLAM Improv.
Recall@10
Books 0.0553 0.0568 0.0817 0.0730 0.0565 0.0754 0.0581 0.0885** 8.32%
Electronics 0.0243 0.0277 0.0253 0.0395 0.0299 0.0353 0.0279 0.0469*** 18.73%
CDs 0.0730 0.0759 0.0736 0.0922 0.0822 0.0851 0.0793 0.1129*** 22.45%
Comics 0.1966 0.2092 0.2324 0.1934 0.1795 0.1967 0.1713 0.2417 4.00%
Gowalla 0.0888 0.0895 0.1113 0.0840 0.0935 0.0824 0.0894 0.1331*** 19.58%
NDCG@10
Books 0.0391 0.0404 0.0590 0.0519 0.0383 0.0542 0.0415 0.0671** 13.72%
Electronics 0.0111 0.0125 0.0134 0.0178 0.0117 0.0148 0.0105 0.0234*** 31.46%
CDs 0.0383 0.0402 0.0411 0.0502 0.0420 0.0461 0.0423 0.0619*** 23.30%
Comics 0.2247 0.2395 0.2595 0.2239 0.1922 0.2341 0.1834 0.2753* 6.08%
Gowalla 0.0806 0.0822 0.0944 0.0611 0.0670 0.0611 0.0823 0.0984* 4.23%

5.5. Implementation Details

To speed up the training process, we implement a two-phase sampling strategy. We sample a number of candidates, e.g., 500, of negative samples for each user every 20 epochs to form a candidate set. During the next 20 epochs, the negative samples of each user are sampled from her candidate set. This strategy can be implemented using multiple processes to further reduce the training time.

Since none of the processed datasets has inherent user-user/item-item information, we treat the user-item interaction as a user-item matrix and compute the cosine similarity for the user and item pairs, respectively (Sarwar et al., 2001). We set a threshold, e.g., 0.20.2 on Amazon and Gowalla datasets and 0.40.4 on the Comics dataset, to select the neighbors. These thresholds are chosen to ensure a reasonable degree of connectivity in the constructed graphs.

Refer to caption
(a) Recall@k on Books
Refer to caption
(b) NDCG@k on Books
Refer to caption
(c) Recall@k on Electronics
Refer to caption
(d) NDCG@k on Electronics
Refer to caption
(e) Recall@k on CDs
Refer to caption
(f) NDCG@k on CDs
Refer to caption
(g) Recall@k on Comics
Refer to caption
(h) NDCG@k on Comics
Refer to caption
(i) Recall@k on Gowalla
Refer to caption
(j) NDCG@k on Gowalla
Refer to caption
Figure 3. The performance comparison on all datasets.

5.6. Performance Comparison

The performance comparison is shown in Figure 3 and Table 2. Based on these results, we have several observations.

Observations about our model. First, the proposed model, PMLAM, achieves the best performance on all five datasets with both evaluation metrics, which illustrates the superiority of our model.
Second, PMLAM outperforms SML. Although SML has an adaptive margin mechanism, it is achieved by having a learnable scalar margin for each user and item and adding a regularization term to prevent the learned margins from being too small. It can be challenging to identify an appropriate regularization weight via hyperparameter tuning. By contrast, PMLAM formulates the adaptive margin generation as a bilevel optimization, avoiding the additional regularization. PMLAM employs a neural network to generate the adaptive margin, so the number of parameters related to margin generation does not increase with the number of users or items.
Third, PMLAM achieves better performance than TransCF. One major reason is that TransCF only considers the items rated by a user and the users who rated an item as the neighbors of the user and item, respectively, which neglects the user-user/item-item relations. PMLAM models the user-user/item-item relations by two margin ranking losses with adaptive margins.
Fourth, PMLAM makes better recommendations than CML and LRML. These methods apply a fixed margin for all user-item triplets and do not measure or model the uncertainty of learned user/item embeddings. PMLAM represents each user and item as a Gaussian distribution, where the uncertainties of learned user preferences and item properties are captured by the covariance matrices.
Fifth, PMLAM outperforms NCF and DeepAE. These are MLP-based recommendation methods with the ability to capture non-linear user-item relationships, but they violate the triangle inequality when modeling user-item interaction. As a result, they can struggle to capture the fine-grained user preference for particular items (Hsieh et al., 2017).

Other observations. First, all of the results reported for the Comics dataset are considerably better than those for the other datasets. The other four datasets are sparser and data sparsity negatively impacts recommendation performance.
Second, CML, LRML and TransCF perform better than SML on most of the datasets. The adaptive margin regularization term in SML struggles to adequately counterbalance SML’s tendency to reduce the loss by imposing small margins. Although it is reported that SML outperforms CML, LRML and TransCF in (Li et al., 2020), the experiments are conducted on three relatively small-scale datasets with only several thousands of users and items. We experiment with much larger datasets; identifying a successful regularization setting appears to be more difficult as the number of users increases.
Third, TransCF outperforms LRML on most of the datasets. One possible reason is that TransCF has a more effective translation embedding learning mechanism, which incorporates the neighborhood information of users and items. TransCF also has a regularization term to further pull positive items closer to the anchor user.
Fourth, CML achieves better performance than LRML on most of the datasets. CML integrates the weighted approximate-rank pairwise (WARP) weighting scheme (Weston et al., 2010) in the loss function to penalize lower-ranked positive items. The comparison between CML and LRML in (Tay et al., 2018) removes this component of CML. The WARP scheme appears to play an important role in improving CML’s performance.
Fifth, DeepAE outperforms NCF. The heuristic weighting function of DeepAE can impose useful penalties to errors that occur during training when positive items are assigned lower prediction scores.

Table 3. The ablation analysis on the CDs and Electronics datasets. cat denotes the concatenation operation and add denotes the addition operation.
Architecture CDs Electronics      \bigstrut
R@10 N@10 R@10 N@10 \bigstrut
(1) FixUI{Fix}^{U-I} + Deter_Emb 0.0721 0.0371 0.0241 0.0090
(2) FixUI{Fix}^{U-I} + Gauss_Emb 0.0815 0.0434 0.0296 0.0110
(3) AdaUI{Ada}^{U-I} + Deter_Emb 0.0777 0.0415 0.0338 0.0125
(4) AdaUI{Ada}^{U-I}-catcat + Deter_Emb 0.0408 0.0204 0.0139 0.0055
(5) AdaUI{Ada}^{U-I}-addadd + Deter_Emb 0.0311 0.0158 0.0050 0.0018
(6) AdaUI{Ada}^{U-I} + Gauss_Emb 0.0856 0.0454 0.0365 0.0155
(7) AdaUI{Ada}^{U-I} + FixUU{Fix}^{U-U} + FixII{Fix}^{I-I} 0.0966 0.0526 0.0429 0.0189
(8) PMLAM 0.1129 0.0619 0.0469 0.0234

5.7. Ablation Analysis

To verify and assess the relative effectiveness of the proposed user-item interaction module, the adaptive margin generation module, and the user-user/item-item relation module, we conduct an ablation study. Table 3 reports the performance improvement achieved by each module of the proposed model. Note that we compute Euclidean distances between deterministic embeddings. In (1), which serves as a baseline, we use the hinge loss with a fixed margin (Eq. 4) on deterministic embeddings of users and items to capture the user-item interaction (mm is set to 11 which is commonly used in (Hsieh et al., 2017; Tay et al., 2018; Park et al., 2018)). In (2), as an alternative baseline, we apply the same hinge loss as in (1), but replace the deterministic embeddings with parameterized Gaussian distributions (Section 4.1). In (3), we use the adaptive margin generation module (Section 4.2) to generate the margins for deterministic embeddings. In (4), we concatenate the deterministic embeddings of (i,j,k)(i,j,k) to generate 𝐬ijk\mathbf{s}_{ijk} instead of using Eq. 11. In (5), we sum the deterministic embeddings of (i,j,k)(i,j,k) to generate 𝐬ijk\mathbf{s}_{ijk} instead of using Eq. 11. In (6), we combine (2) and (3) to generate the adaptive margins for Gaussian embeddings. In (7), we augment (6) with user-user/item-item modeling but with a fixed margin, where the margin is also set to 11. In (8), we add the user-user/item-item modeling with adaptive margins (Section 4.3) to replace the fixed margins in the configuration of (7).

From the results in Table 3, we have several observations. First, from (1) and (2), we observe that by representing the user and item as Gaussian distributions and computing the distance between Gaussian distributions, the performance improves. This suggests that measuring the uncertainties of learned embeddings is significant. Second, from (1) and (3) along with (2) and (6), we observe that incorporating the adaptive margin generation module improves performance, irrespective of whether deterministic or Gaussian embeddings are used. These results demonstrate the effectiveness of the proposed margin generation module. Third, from (3), (4) and (5), we observe that our designed inputs (Eq. 11) for margin generation facilitate the production of appropriate margins compared to commonly used embedding concatenation or summation operations. Fourth, from (2), (3) and (6), we observe that (6) achieves better results than either (2) or (3), demonstrating that Gaussian embedddings and adaptive margin generation are compatible and can be combined to improve the model performance. Fifth, compared to (6), we observe that the inclusion of the user-user and item-item terms in the objective function (7) leads to a large improvement in recommendation performance. This demonstrates that explicit user-user/item-item modeling is essential and can be an effective supplement to infer user preferences. Sixth, from (7) and (8), we observe that adaptive margins also improve the modelling of the user-user/item-item relations.

Table 4. A case study of the generated margin of sampled training triplets. The movie genre label is from the dataset.
User Positive Sampled Movie Margin
405 Scream (Thriller) Four Rooms (Thriller) 1.2752
Toy Story (Animation) 12.8004
French Kiss (Comedy) Addicted to Love (Comedy) 2.6448
Batman (Action) 12.4607
66 Air Force One (Action) GoldenEye (Action) 0.3216
Crumb (Documentary) 5.0010
The Godfather (Crime) The Godfather II (Crime) 0.0067
Terminator (Sci-Fi) 3.6335

5.8. Case Study

In this section, we conduct case studies to confirm whether the adaptive margin generation can produce appropriate margins. To achieve this purpose, we train our model on the MovieLens-100K dataset. This dataset provides richer side information about movies (e.g., movie genres), making it easier for us to illustrate the results. Since we only focus on the adaptive margin generation, we use deterministic embeddings of users and items to avoid the interference of other modules. We randomly sample users from the dataset. For each user, we sample one item that the user has accessed as the positive item and two items the user did not access as negative items, where one item has a similar genre with the positive item and the other does not. The case study results are shown in Table 4.

As shown in Table 4, our adaptive margin generation module tends to generate a smaller margin value when the negative movie has a similar genre with the positive movie, while generating larger margins when they are distinct. The generated margins thus encourage the model to embed items with a higher probability of being preferred closer to the user’s embedding.

6. Conclusion

In this paper, we propose a distance-based recommendation model for top-K recommendation. Each user and item in our model are represented by Gaussian distributions with learnable parameters to handle the uncertainties. By incorporating an adaptive margin scheme, our model can generate fine-grained margins for the training triples during the training procedure. To explicitly capture the user-user/item-item relations, we adopt two margin ranking losses with adaptive margins to force similar user and item pairs to map closer together in the latent space. Experimental results on five real-world datasets validate the performance of our model, demonstrating improved performance compared to many state-of-the-art methods and highlighting the effectiveness of the Gaussian embeddings and the adaptive margin generation scheme. The code is available at https://github.com/huawei-noah/noah-research/tree/master/PMLAM.

References

  • (1)
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. 2013. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems.
  • Casella and Berger (2002) George Casella and Roger L Berger. 2002. Statistical inference. Duxbury Pacific Grove, CA.
  • Cho et al. (2011) Eunjoon Cho, Seth A. Myers, and Jure Leskovec. 2011. Friendship and mobility: user movement in location-based social networks. In Proc. ACM Conf. Knowledge Discovery and Data Mining.
  • Colson et al. (2007) Benoît Colson, Patrice Marcotte, and Gilles Savard. 2007. An overview of bilevel optimization. Annals OR (2007).
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proc. ACM Conf. Recommender Systems.
  • Givens and Shortt (1984) Clark R. Givens and Rae Michael Shortt. 1984. A class of Wasserstein metrics for probability distributions. The Michigan Mathematical Journal (1984).
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. In Proc. Int. Joint Conf. Artificial Intelligence.
  • He et al. (2017a) Ruining He, Wang-Cheng Kang, and Julian McAuley. 2017a. Translation-based Recommendation. In Proc. ACM Conf. Recommender Systems.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proc. Int. Conf. World Wide Web.
  • He et al. (2017b) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017b. Neural Collaborative Filtering. In Proc. Int. Conf. World Wide Web.
  • He et al. (2016) Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback. In Proc. ACM Int. Conf. Research and Development in Information Retrieval.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge J. Belongie, and Deborah Estrin. 2017. Collaborative Metric Learning. In Proc. Int. Conf. World Wide Web.
  • Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative Filtering for Implicit Feedback Datasets. In Proc. IEEE Int. Conf. Data Mining.
  • Kabbur et al. (2013) Santosh Kabbur, Xia Ning, and George Karypis. 2013. FISM: factored item similarity models for top-N recommender systems. In Proc. ACM Conf. Knowledge Discovery and Data Mining.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In Proc. Int. Conf. Learning Representations.
  • Koren et al. (2009) Yehuda Koren, Robert M. Bell, and Chris Volinsky. 2009. Matrix Factorization Techniques for Recommender Systems. IEEE Computer (2009).
  • Li et al. (2020) Mingming Li, Shuai Zhang, Fuqing Zhu, Wanhui Qian, Liangjun Zang, Jizhong Han, and Songlin Hu. 2020. Symmetric Metric Learning with Adaptive Margin for Recommendation. (2020).
  • Liu et al. (2019) Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable Architecture Search. In Proc. Int. Conf. Learning Representations.
  • Ma et al. (2019a) Chen Ma, Peng Kang, and Xue Liu. 2019a. Hierarchical Gating Networks for Sequential Recommendation. In Proc. ACM Conf. Knowledge Discovery and Data Mining.
  • Ma et al. (2019b) Chen Ma, Peng Kang, Bin Wu, Qinglong Wang, and Xue Liu. 2019b. Gated Attentive-Autoencoder for Content-Aware Recommendation. In Proc. ACM Int. Conf. Web Search and Data Mining.
  • Ma et al. (2020) Chen Ma, Liheng Ma, Yingxue Zhang, Jianing Sun, Xue Liu, and Mark Coates. 2020. Memory Augmented Graph Neural Networks for Sequential Recommendation. In AAAI.
  • Ma et al. (2018) Chen Ma, Yingxue Zhang, Qinglong Wang, and Xue Liu. 2018. Point-of-Interest Recommendation: Exploiting Self-Attentive Autoencoders with Neighbor-Aware Influence. In Proc. ACM Int. Conf. Information and Knowledge Management.
  • Mallasto and Feragen (2017) Anton Mallasto and Aasa Feragen. 2017. Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. In Advances in Neural Information Processing Systems.
  • Ning et al. (2015) Xia Ning, Christian Desrosiers, and George Karypis. 2015. A Comprehensive Survey of Neighborhood-Based Recommendation Methods. In Recommender Systems Handbook.
  • Pan et al. (2008) Rong Pan, Yunhong Zhou, Bin Cao, Nathan Nan Liu, Rajan M. Lukose, Martin Scholz, and Qiang Yang. 2008. One-Class Collaborative Filtering. In Proc. IEEE Int. Conf. Data Mining.
  • Park et al. (2018) Chanyoung Park, Donghyun Kim, Xing Xie, and Hwanjo Yu. 2018. Collaborative Translational Metric Learning. In Proc. IEEE Int. Conf. Data Mining.
  • Rendle (2012) Steffen Rendle. 2012. Learning recommender systems with adaptive regularization. In Proc. ACM Int. Conf. Web Search and Data Mining.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proc. Conf. Uncertainty in Artificial Intelligence.
  • Salakhutdinov and Mnih (2007) Ruslan Salakhutdinov and Andriy Mnih. 2007. Probabilistic Matrix Factorization. In Advances in Neural Information Processing Systems.
  • Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey E. Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proc. Int. Conf. Machine Learning.
  • Sarwar et al. (2001) Badrul Munir Sarwar, George Karypis, Joseph A. Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In Proc. Int. Conf. World Wide Web.
  • Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. 2014. Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS). In Advances in Neural Information Processing Systems.
  • Sinha et al. (2018) Ankur Sinha, Pekka Malo, and Kalyanmoy Deb. 2018. A Review on Bilevel Optimization: From Classical to Evolutionary Approaches and Applications. IEEE Trans. Evolutionary Computation (2018).
  • Srivastava (2008) Sashi Mohan Srivastava. 2008. A course on Borel sets. Springer Science & Business Media.
  • Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In Advances in Neural Information Processing Systems.
  • Sun et al. (2019) Jianing Sun, Yingxue Zhang, Chen Ma, Mark Coates, Huifeng Guo, Ruiming Tang, and Xiuqiang He. 2019. Multi-graph Convolution Collaborative Filtering. In Proc. IEEE Int. Conf. Data Mining.
  • Tay et al. (2018) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent Relational Metric Learning via Memory-based Attention for Collaborative Ranking. In Proc. Int. Conf. World Wide Web.
  • Wan and McAuley (2018) Mengting Wan and Julian McAuley. 2018. Item recommendation on monotonic behavior chains. In Proc. ACM Conf. Recommender Systems.
  • Wang et al. (2015) Hao Wang, Naiyan Wang, and Dit-Yan Yeung. 2015. Collaborative Deep Learning for Recommender Systems. In Proc. ACM Conf. Knowledge Discovery and Data Mining.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In Proc. ACM Int. Conf. Research and Development in Information Retrieval.
  • Weston et al. (2010) Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning (2010).
  • Wu et al. (2016) Yao Wu, Christopher DuBois, Alice X. Zheng, and Martin Ester. 2016. Collaborative Denoising Auto-Encoders for Top-N Recommender Systems. In Proc. ACM Int. Conf. Web Search and Data Mining.
  • Xie et al. (2019) Yujia Xie, Xiangfeng Wang, Ruijia Wang, and Hongyuan Zha. 2019. A Fast Proximal Point Method for Computing Exact Wasserstein Distance. In Proc. Conf. Uncertainty in Artificial Intelligence.
  • Xue et al. (2017) Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. 2017. Deep Matrix Factorization Models for Recommender Systems. In Proc. Int. Joint Conf. Artificial Intelligence.