This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CnGAN: Generative Adversarial Networks for Cross-network User Preference Generation for Non-overlapped Users

Dilruk Perera School of ComputingNational University of Singapore [email protected]  and  Roger Zimmermann School of ComputingNational University of Singapore [email protected]
(2019)
Abstract.

A major drawback of cross-network recommender solutions is that they can only be applied to users that are overlapped across networks. Thus, the non-overlapped users, which form the majority of users are ignored. As a solution, we propose CnGAN, a novel multi-task learning based, encoder-GAN-recommender architecture. The proposed model synthetically generates source network user preferences for non-overlapped users by learning the mapping from target to source network preference manifolds. The resultant user preferences are used in a Siamese network based neural recommender architecture. Furthermore, we propose a novel user-based pairwise loss function for recommendations using implicit interactions to better guide the generation process in the multi-task learning environment. We illustrate our solution by generating user preferences on the Twitter source network for recommendations on the YouTube target network. Extensive experiments show that the generated preferences can be used to improve recommendations for non-overlapped users. The resultant recommendations achieve superior performance compared to the state-of-the-art cross-network recommender solutions in terms of accuracy, novelty and diversity.

Cross-network Recommendations; Generative Adversarial Networks; Collaborative Filtering; Deep learning; Implicit Feedback
journalyear: 2019copyright: iw3c2w3conference: Proceedings of the 2019 World Wide Web Conference; May 13–17, 2019; San Francisco, CA, USAbooktitle: Proceedings of the 2019 World Wide Web Conference (WWW ’19), May 13–17, 2019, San Francisco, CA, USAdoi: 10.1145/3308558.3313733isbn: 978-1-4503-6674-8/19/05ccs: Information systems Recommender systemsccs: Information systems Personalizationccs: Information systems Multimedia and multimodal retrieval

1. Introduction

Cross-network recommender systems utilize auxiliary information from multiple source networks to create comprehensive user profiles for recommendations on a target network. Therefore, unlike traditional recommender solutions which are limited to information within a single network, cross-network solutions are more robust against cold-start and data sparsity issues (Mehta et al., 2005). For example, the additional information from Twitter and Flicker, increased recommender precision on Delicious by 10% (Abel et al., 2011). Similarly, the transfer of information from various source networks to target networks such as Google+ to YouTube (Deng et al., 2013), Twitter to YouTube (Roy et al., 2012; Perera and Zimmermann, 2017) and Wikipedia to Twitter (Osborne et al., 2012) have consistently improved recommender accuracy. Furthermore, cross-network solutions allow user preferences to be captured from diverse perspectives, which increases the overall recommender quality in terms of diversity and novelty (Perera and Zimmermann, 2018, 2017). However, despite the growing success of cross-network recommender solutions, the majority of existing solutions can only be applied to users that exist in multiple networks (overlapped users). The remaining non-overlapped users, which form the majority are unable to enjoy the benefits of cross-network solutions.

Deep learning and generative modeling techniques have been successfully used in the recommender systems domain in the past few years. For example, restricted Boltzmann machines (Salakhutdinov et al., 2007), Autoencoders (Ouyang et al., 2014), Hidden Markov Models (Sahoo et al., 2012) and Recurrent Neural Networks (RNN) (Hidasi et al., 2015) are some of the models used for rating prediction tasks. Recent advancements in Generative Adversarial Networks (GANs) have also drawn recommender systems researchers to apply the new minimax game framework to information retrieval. For example, IRGAN (Wang et al., 2017) and RecGAN (Bharadhwaj et al., 2018) were designed to predict relevant documents to users. Existing solutions use GAN to generate simple rating values (Wang et al., 2017; Bharadhwaj et al., 2018; Yoo et al., 2017) or synthetic items (e.g., images (Kang et al., 2017)). In contrast, we use GAN to generate complex user data (i.e., user preferences on the source network).

We propose a novel GAN model (CnGAN) to generate cross-network user preferences for non-overlapped users. Unlike the standard GAN model, we view the learning process as a mapping from target to source network preference manifolds. The proposed model solves two main tasks. First, a generator task learns the corresponding mapping from target to source network user preferences and generates auxiliary preferences on the source network. Second, a recommender task uses the synthetically generated preferences to provide recommendations for users who only have interactions on the target network. Furthermore, we propose two novel loss functions to better guide the generator and recommender tasks. The proposed solution is used to provide recommendations for YouTube users by incorporating synthesized preferences on Twitter. However, the solution is generic and can be extended to incorporate multiple source networks.

In this paper, we first briefly provide preliminaries to the proposed model and present the proposed CnGAN solution detailing its two main components, the generator and recommender tasks. Then, we compare the proposed model against multiple baselines to demonstrate its effectiveness in terms of accuracy, novelty and diversity. We summarize our main contributions as follows:

  • To the best of our knowledge, this is the first attempt to apply a GAN based model to generate missing source network preferences for non-overlapped users.

  • We propose CnGAN, a novel GAN based model which includes a novel content loss function and user-based pairwise loss function for the generator and recommender tasks.

  • We carry out extensive experiments to demonstrate the effectiveness of CnGAN to conduct recommendations for non-overlapped users and improve the overall quality of recommendations compared to state-of-the-art methods.

2. Model

2.1. Model Preliminaries

2.1.1. Bayesian Personalized Ranking (BPR):

Learning user preferences from implicit feedback is challenging since implicit feedback does not indicate preference levels toward items, it only indicates the presence or absence of interactions with items. BPR (Rendle et al., 2009) was proposed to rank items based on implicit feedback using a generic optimization criterion. BPR is trained using pairwise training instances S={(u,i,j)|uUiIu+jIIu+}S=\{(u,i,j)|u\in U\land i\in I_{u+}\land j\in I\setminus I_{u+}\}, where II is the set of all items and Iu+I_{u+} is the set of items user uu has interacted with. BPR uses a pairwise loss function to rank interacted items higher than non-interacted items, and maximizes their margin using a pairwise loss as follows:

(1) LBPR(S)=(u,i,j)Slnσ(r^uij)+λΘΘ2L_{BPR}(S)=\sum_{(u,i,j)\in S}-\ln\sigma(\hat{r}_{uij})+\lambda_{\Theta}{\|\Theta\|}^{2}

where r^uij=r^uir^uj\hat{r}_{uij}=\hat{r}_{ui}-\hat{r}_{uj}, Θ\Theta are model parameters, σ(x)=ex/ex+1\sigma(x)=e^{x}/e^{x+1} is a sigmoid function and λΘ\lambda_{\Theta} are regularization parameters.

2.1.2. Feature extraction on a continuous space:

CnGAN is trained using target network interactions of each user at each time interval and their corresponding source network interactions. We used topic modeling to capture user interactions on a continuous topical space since CnGAN is a GAN based model and requires inputs in a continuous space. Let uoU=[Tnut,Snut]u_{o}\in U=[Tn_{u}^{t},Sn_{u}^{t}] denote an overlapped user at time tt, where TnutTn_{u}^{t} and SnutSn_{u}^{t} are target and source network interactions spanned over T=[1,,t]T=[1,\dots,t] time intervals. A non-overlapped user unoU=[Tnut]u_{no}\in U=[Tn_{u}^{t}] is denoted only using target network interactions. We used YouTube as the target and Twitter as the source network to conduct video recommendations on YouTube. Therefore, TnutTn_{u}^{t} is the set of interacted videos (i.e., liked or added to playlists) and SnutSn_{u}^{t} is the set of tweets. We assumed that each interaction (video or tweet) is associated with multiple topics, and extracted the topics from the textual data associated with each interaction (video titles, descriptions and tweet contents). We used the Twitter-Latent Dirichlet Allocation (Twitter-LDA) (Zhao et al., 2011) for topic modeling since it is most effective against short and noisy contents. Based on topic modeling, each user uu on the target network is represented as a collection of topical distributions tnu={𝒕𝒏𝒖𝟏;;𝒕𝒏𝒖𝒕}T×Kttn_{u}=\{\boldsymbol{tn_{u}^{1}};\dots;\boldsymbol{tn_{u}^{t}}\}\in\mathbb{R}^{T\times K^{t}} over TT time intervals, where KtK^{t} is the number of topics. Each vector 𝒕𝒏𝒖𝒕={tnut,1;;tnut,Kt}Kt\boldsymbol{tn_{u}^{t}}=\{tn_{u}^{t,1};\dots;tn_{u}^{t,K^{t}}\}\in\mathbb{R}^{K^{t}} is the topical distribution at time tt, where tnut,k=tn^ut,k/c=1Kttn^ut,ctn_{u}^{t,k}=\hat{tn}_{u}^{t,k}/\sum_{c=1}^{K^{t}}\hat{tn}_{u}^{t,c} is the relative frequency of topic k, and tn^ut,k\hat{tn}_{u}^{t,k} is the absolute frequency of the corresponding topic. Similarly, on the source network, interactions are represented as snu={𝒔𝒏𝒖𝟏;;𝒔𝒏𝒖𝒕}T×Ktsn_{u}=\{\boldsymbol{sn_{u}^{1}};\dots;\boldsymbol{sn_{u}^{t}}\}\in\mathbb{R}^{T\times K^{t}}. The resultant topical frequencies indicate user preference levels towards corresponding KtK^{t} topics. Therefore, u={tnu,snu}T×2Ktu=\{tn_{u},sn_{u}\}\in\mathbb{R}^{T\times 2K^{t}} represents user preferences over TT time intervals on a continuous topical space, and forms the input to the proposed model.

2.2. Generator Task

The generator task learns the function that maps the target network preferences manifold to the source network preferences manifold using Encoders (EE), Discriminator (DD) and Generator (GG).

2.2.1. Encoders:

User preferences captured as topical distribution vectors (𝒕𝒏𝒖𝒕(\boldsymbol{tn_{u}^{t}} and 𝒔𝒏𝒖𝒕)\boldsymbol{sn_{u}^{t}}) become sparse when 𝑲𝒕\boldsymbol{K^{t}} is set to a sufficiently large value to capture finer level user preferences, and/or the number of interactions at a time interval is low. One of the most effective methods to train machine learning models with highly sparse data is to model interactions between input features (Blondel et al., 2016) (e.g., using neural networks (He et al., 2017)). Therefore, we used two neural encoders EtnE_{tn} and EsnE_{sn} for target and source networks to transform the input topical distributions to dense latent encodings. The resultant encodings are represented as Etn(tnu)E_{tn}(tn_{u}) T×En\in\mathbb{R}^{T\times En} and Esn(snu)T×EnE_{sn}(sn_{u})\in\mathbb{R}^{T\times En}, where EnEn is the dimensionality of the latent space. These encodings form the input to the generator task.

Refer to caption
Figure 1. Adversarial learning process between 𝑬\boldsymbol{E}, 𝑫\boldsymbol{D} and 𝑮\boldsymbol{G}.

2.2.2. Generator task formulation:

For a non-overlapped user uu, let Etn(𝒕𝒏𝒖𝒕)E_{tn}(\boldsymbol{tn_{u}^{t}}) denote the target network encoding at time interval tt. The generator task aims to synthetically generate the mapping source network encoding that closely reflects the missing source network encoding Esn(𝒔𝒏𝒖𝒕)E_{sn}(\boldsymbol{sn_{u}^{t}}). Hence, GG uses target encoding Etn(𝒕𝒏𝒖𝒕)E_{tn}(\boldsymbol{tn_{u}^{t}}) and attempts to generate the mapping source encoding to fool DD. Similarly, DD tries to differentiate between real source encoding Esn(𝒔𝒏𝒖𝒕)E_{sn}(\boldsymbol{sn_{u}^{t}}) and generated encoding G(Etn(𝒕𝒏𝒖𝒕))G(E_{tn}(\boldsymbol{tn_{u}^{t}})). Note that, we refer to the actual and generated target and source network encodings of non-overlapped users (Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕)))\big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),G(E_{tn}(\boldsymbol{tn_{u}^{t}}))\big{)} as synthetically mapped data and the actual target and source network encodings of overlapped users (Etn(𝒕𝒏𝒖𝒕),Esn(𝒔𝒏𝒖𝒕))\big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),E_{sn}(\boldsymbol{sn_{u}^{t}})\big{)} as real mapping data.

2.2.3. Discriminator:

Analogous to DD in standard GANs, the real and synthetically mapped pairs could be used to learn DD. However, DD may only learn to differentiate between actual and generated source network encodings in a given pair of inputs, without learning to check if the given pair is a mapping pair. Therefore, during training, we input mismatching source and target network encodings to ensure that an effective mapping is learned. We modified the loss function for DD to minimize output scores for mismatching pairs and maximize output scores for matching pairs. For a given target network encoding, we drew mismatching source encodings only from real data to avoid potential biases. Mismatching pairs were created by pairing real encodings from different users at the same time interval or by pairing encodings from the same user at different time intervals (see Figure 1). Accordingly, the loss function for DD is formed as follows:

(2) maxEtn,Esn,DV(Etn,Esn,D)=𝔼𝒕𝒏𝒖𝒕,𝒔𝒏𝒖𝒕pdata(tn,sn)Lreal(Etn(𝒕𝒏𝒖𝒕),Esn(𝒔𝒏𝒖𝒕))+𝔼𝒕𝒏𝒖𝒕pdata(tn)Lfake(Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕)))+𝔼𝒕𝒏𝒖𝒕,𝒔𝒏¯𝒖𝒕p¯data(tn,sn¯)Lmismatch(Etn(𝒕𝒏𝒖𝒕),Esn(𝒔𝒏¯𝒖𝒕))\begin{array}[]{l}\max_{E_{tn},E_{sn},D}V(E_{tn},E_{sn},D)\\ =\mathbb{E}_{\boldsymbol{{tn}_{u}^{t}},\boldsymbol{{sn}_{u}^{t}}\sim\ p_{data}(tn,sn)}L_{real}\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),E_{sn}(\boldsymbol{sn_{u}^{t}})\Big{)}\\ +\mathbb{E}_{\boldsymbol{{tn}_{u}^{t}}\sim\ p_{data}(tn)}L_{fake}\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),G\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}})\Big{)}\Big{)}\\ +\mathbb{E}_{\boldsymbol{{tn}_{u}^{t}},\boldsymbol{{\overline{sn}}_{u}^{t}}\sim\ \overline{p}_{data}(tn,\overline{sn})}L_{mismatch}\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),E_{sn}(\boldsymbol{\overline{sn}_{u}^{t}})\Big{)}\end{array}

where pdata(tn,sn)p_{data}(tn,sn) and p¯data(tn,sn¯)\overline{p}_{data}(tn,\overline{sn}) are matching and mismatching target and source network topical distributions, pdata(tn)p_{data}(tn) are the target network topical distributions and G(x)G(x) is the generated matching source network encoding for the given target network encoding xx. Furthermore, Lreal(x,y)=[log(D(x,y))]L_{real}(x,y)=[\log(D(x,y))], Lfake(x,y)=[1log(D(x,y))]L_{fake}(x,y)=[1-\log(D(x,y))] and Lmismatch(x,y)=[1log(D(x,y))]L_{mismatch}(x,y)=[1-\log(D(x,y))], where D(x,y)D(x,y) is the probability that xx and yy are matching target and source network encodings. Therefore, DD and EE work together to maximize the discriminator value function V(Etn,Esn,D)V(E_{tn},E_{sn},D). The encoders (EtnE_{tn} and EsnE_{sn}) assist the process by encoding the topical distributions that are optimized for learning DD (e.g., extract encodings that express the latent mapping between target and source networks).

2.2.4. Generator:

Once DD and EE are trained, they adversarially guide GG by effectively identifying real and synthetically generated data. Specifically, GG takes in a target network encoding Etn(𝒕𝒏𝒖𝒕)E_{tn}(\boldsymbol{tn_{u}^{t}}) and synthetically generates the mapping source network encoding that resemble Esn(𝒔𝒏𝒖𝒕)E_{sn}(\boldsymbol{sn_{u}^{t}}) drawn from real mapping data.

GAN models are typically used to generate real-like images from random noise vectors. For a given noise vector, different GANs can learn multiple successful mappings from the input space to the output space. In contrast, CnGAN aims to learn a specific mapping from target to source network encodings. Hence, for given pairs of real mapping data (𝒕𝒏𝒖𝒕,𝒔𝒏𝒖𝒕)(\boldsymbol{tn_{u}^{t}},\boldsymbol{sn_{u}^{t}}), GG minimizes an additional content loss between the generated G(Etn(𝒕𝒏𝒖𝒕))G(E_{tn}(\boldsymbol{tn_{u}^{t}})) and real Esn(𝒔𝒏𝒖𝒕)E_{sn}(\boldsymbol{sn_{u}^{t}}) source encodings to provide additional guidance for GG. Therefore, we formulated the loss function for GG as follows:

(3) minGV(G)=𝔼𝒕𝒏𝒖𝒕pdata(tn)Lfake(Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕)))+𝔼𝒔𝒏𝒖𝒕,𝒕𝒏𝒖𝒕pdata(tn,sn)Lcontent(Esn(𝒔𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕)))\begin{array}[]{l}\min_{G}V(G)=\mathbb{E}_{\boldsymbol{{tn}_{u}^{t}}\sim\ p_{data}(tn)}L_{fake}\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}}),G\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}})\Big{)}\Big{)}\\ +\mathbb{E}_{\boldsymbol{{sn}_{u}^{t}},\boldsymbol{{tn}_{u}^{t}}\sim\ p_{data}(tn,sn)}L_{content}\Big{(}E_{sn}(\boldsymbol{sn_{u}^{t}}),G\Big{(}E_{tn}(\boldsymbol{tn_{u}^{t}})\Big{)}\Big{)}\end{array}

where Lcontent(x,y)=x,y1L_{content}(x,y)=\|x,y\|_{1} is the content loss computed as the 1\ell 1-norm loss between real and generated encodings.

2.3. Recommender Task

The recommender task uses the generated preferences to conduct recommendations for non-overlapped users.

2.3.1. Recommender task formulation:

Let Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕))E_{tn}(\boldsymbol{tn_{u}^{t}}),G(E_{tn}(\boldsymbol{tn_{u}^{t}})) denote a non-overlapped user uu at time tt based on synthetically mapped data, which reflects current preferences. To capture previous preferences, we used a previous interaction vector RutMR_{u}^{t-}\in\mathbb{R}^{M}, where MM is the number of items. Each element ru,it𝑹𝒖𝒕r_{{u},i}^{t-}\in\boldsymbol{R_{u}^{t-}} is set to 1 if the user had an interaction with item ii before time tt, and 0 if otherwise. Thus, we formulated the recommender task as a time aware Top-N ranking task where, current and previous user preferences are used to predict a set of NN items that the user is most likely interact with, in the next time interval t+1t+1.

2.3.2. Recommender loss:

Typically, BPR optimization is performed using Stochastic Gradient Descent (SGD) where, for each training instance (u,i,j)(u,i,j), parameter (Θ\Theta) updates are performed as follows:

(4) ΘΘαΘLBPR(S)ΘΘ+α(er^uij1+er^uijΘr^uij+λΘΘ)\begin{gathered}\Theta\leftarrow\Theta-\alpha\frac{\partial}{\partial\Theta}L_{BPR}(S)\\ \Theta\leftarrow\Theta+\alpha\bigg{(}\frac{e^{-\hat{r}_{uij}}}{1+e^{-\hat{r}_{uij}}}\cdot\frac{\partial}{\partial\Theta}\hat{r}_{uij}+\lambda_{\Theta}\Theta\bigg{)}\end{gathered}

where α\alpha is the learning rate, r^uij=r^uir^uj\hat{r}_{uij}=\hat{r}_{ui}-\hat{r}_{uj} is the pairwise loss for a given (u,i,j)(u,i,j) triplet and r^ui\hat{r}_{ui} is the predicted rating for user uu and item ii. Since BPR is a generic optimization criterion, r^ui\hat{r}_{ui} can be obtained using any standard collaborative filtering approach (e.g., MF). Considering the rating prediction function in MF (r^ui=f=1K𝒘𝒖𝒇𝒉𝒊𝒇\hat{r}_{ui}=\sum_{f=1}^{K}\boldsymbol{w_{uf}}\cdot\boldsymbol{h_{if}}, where 𝒘𝒖𝒇K\boldsymbol{w_{uf}}\in\mathbb{R}^{K} and 𝒉𝒊𝒇K\boldsymbol{h_{if}}\in\mathbb{R}^{K} are latent representations of user uu and item ii), r^uij\hat{r}_{uij} for MF based BPR optimization can be defined as r^uij=f=1Kwuf(hifhjf)\hat{r}_{uij}=\sum_{f=1}^{K}w_{uf}\cdot(h_{if}-h_{jf}).

For each (u,i,j)(u,i,j) instance, three updates are performed when Θ=wuf\Theta=w_{uf}, Θ=hif\Theta=h_{if} and Θ=hjf\Theta=h_{jf}. For each update on the latent user representation (wufw_{uf}), two updates are performed on the latent item representations (hif(h_{if} and hjf)h_{jf}). Hence, the training process is biased towards item representation learning, and we term the standard BPR optimization function as an item-based pairwise loss function. Since our goal is to learn effective user representations to conduct quality recommendations, we introduce a user-based pairwise loss function (RR loss), which is biased towards user representation learning.

We used a fixed time-length sliding window approach in the training process where, at each time interval, the model is trained using the interactions within the time interval (see Section 3.2.1). Thus, we denote the training instances for the new user-based BPR at each time interval tt as St={(u,v,i)|uUi+tvUUi+tiI}S^{t}=\{(u,v,i)|u\in U_{i+}^{t}\land v\in U\setminus U_{i+}^{t}\land i\in I\} where Ui+tU_{i+}^{t} is the set of users who have interactions with item ii, at time tt. Compared to (u,i,j)(u,i,j) in the standard item-based BPR, user-based BPR contains (u,v,i)(u,v,i) where, for item ii at time tt, user uu has an interaction and user vv does not. Thus, intuitively, the predictor should assign a higher score for (u,i)(u,i) compared to (v,i)(v,i). Accordingly, we defined the user-based loss function to be minimized as follows:

(5) LUBPR(St)=(u,v,i)Stlnσ(r^uvit)+λΘΘ2L_{UBPR}(S^{t})=\sum_{(u,v,i)\in S^{t}}-\ln\sigma(\hat{r}_{uvi}^{t})+\lambda_{\Theta}{\|\Theta\|}^{2}

where r^uvit=r^uitr^vit=f=1K(wuftwvft)hif\hat{r}_{uvi}^{t}=\hat{r}_{ui}^{t}-\hat{r}_{vi}^{t}=\sum_{f=1}^{K}(w_{uf}^{t}-w_{vf}^{t})h_{if}. During optimization, Θ\Theta updates are performed as follows:

(6) ΘΘ+α(er^uvjt1+er^uvjtΘr^uvjt+λΘΘ)\Theta\leftarrow\Theta+\alpha\bigg{(}\frac{e^{-\hat{r}_{uvj}^{t}}}{1+e^{-\hat{r}_{uvj}^{t}}}\cdot\frac{\partial}{\partial\Theta}\hat{r}_{uvj}^{t}+\lambda_{\Theta}\Theta\bigg{)}\\

For different Θ\Theta values, the following partial derivations are obtained:

(7) Θr^uvjt={hif,if Θ=wufthif,if Θ=wvft(wuftwvft),if Θ=hif0,otherwise\frac{\partial}{\partial\Theta}\hat{r}_{uvj}^{t}=\begin{cases}h_{if},\text{if }\Theta=w_{uf}^{t}\\ -h_{if},\text{if }\Theta=w_{vf}^{t}\\ (w_{uf}^{t}-w_{vf}^{t}),\text{if }\Theta=h_{if}\\ 0,\text{otherwise}\end{cases}

Since the user-based BPR performs two user representation updates for each training instance, it is able to efficiently guide GG to learn user representations during training.

2.3.3. Recommender architecture:

We used a Siamese network for the recommender architecture since it naturally supports pairwise learning (see Figure 2). Given a non-overlapped user uu with current preferences Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒖𝒕))E_{tn}(\boldsymbol{tn_{u}^{t}}),G(E_{tn}(\boldsymbol{tn_{u}^{t}})) and previous preferences 𝑹𝒖𝒕\boldsymbol{R_{u}^{t-}} at time tt, we learned a transfer function Φ\Phi, which maps the user to the latent user space wu,ftw_{u,f}^{t} for recommendations as follows:

(8) wu,ft=Φ(Etn(𝒕𝒏𝒖𝒕),G(Etn(𝒕𝒏𝒊𝒕)),𝑹𝒖𝒕)w_{u,f}^{t}=\Phi(E_{tn}(\boldsymbol{tn_{u}^{t}}),G(E_{tn}(\boldsymbol{tn_{i}^{t}})),\boldsymbol{R_{u}^{t-}})

The transfer function Φ\Phi is learned using a neural network since neural networks are more expressive and effective than simple linear models. Accordingly, for a non-overlapped user uu at time tt, the predicted rating r^ui\hat{r}_{ui} for any given item ii is obtained using the inner-product between the latent user and item representations as follows:

(9) r^uit=f=1KΦ(Etn(tnut),G(Etn(tnut)),𝑹𝒖𝒕)𝒉𝒊𝒇\hat{r}_{ui}^{t}=\sum_{f=1}^{K}\Phi(E_{tn}(tn_{u}^{t}),G(E_{tn}(tn_{u}^{t})),\boldsymbol{R_{u}^{t-}})\cdot\boldsymbol{h_{if}}
Refer to caption
Figure 2. Recommender task architecture as a Siamese network.

2.4. Multi-Task Learning (MTL)

Although the generator and recommender tasks are separate, they depend on each other to achieve a higher overall performance. The generator task increases the accuracy of the recommender by synthesizing source network encodings that effectively represent user preferences. The recommender task guides the generator to efficiently learn the target to source mapping by reducing the search space. Therefore, we trained both interrelated tasks in a MTL environment to benefit from the training signals of each other. Hence, we formulated the multi-task training objectives as follows:

(10) minGmaxEtn,Esn,DV(Etn,Esn,G,D)\min_{G}\max_{E_{tn},E_{sn},D}V(E_{tn},E_{sn},G,D)\\
(11) minwuft,wvft,hifLUBPR(St)\min_{w_{uf}^{t},w_{vf}^{t},h_{if}}L_{UBPR}(S^{t})

where V(Etn,Esn,G,D)V(E_{tn},E_{sn},G,D) is the value function for EE, GG and DD, LUBPR(St)L_{UBPR}(S^{t}) is the RR loss function, and wuft,wvftw_{uf}^{t},w_{vf}^{t} and hifh_{if} are model parameters. The RR loss is back-propagated all the way to GG since wuftw_{uf}^{t} and wvftw_{vf}^{t} are compositions of the functions GG and Φ\Phi (see Equation 8) and the generator and recommender tasks are trained as an end-to-end process. Hence, equation 11 can be restated as follows:

(12) minΦ,G,hifLUBPR(St)\min_{\Phi,G,h_{if}}L_{UBPR}(S^{t})

3. Experiments

3.1. Dataset

Due to the lack of publicly available timestamped cross-network datasets, we extracted overlapped users on YouTube (target) and Twitter (source) networks from two publicly available datasets (Yan et al., 2014; Lim et al., 2015). We scraped timestamps of interactions and associated textual contents (video titles, descriptions and tweet contents) over a 2-year period from 1st March 2015 to 29th February 2017. In line with common practices, we filtered out users with less than 10 interactions on both networks for effective evaluations. The final dataset contained 2372 users and 12,782 YouTube videos.

3.2. Experimental Setup

The recommender systems literature often uses training and testing datasets with temporal overlaps (Campos et al., 2012), which provides undue advantages as it does not reflect a realistic recommender environment. To avoid such biases, at each time interval, the proposed model is trained using only the interactions during the current and previous time intervals. We randomly selected 50% of users as non-overlapped users and removed their source network interactions. The model was first trained offline using overlapped users (see Section 3.2.1). Then, testing and online training were conducted using both overlapped and non-overlapped users (see Section 3.2.2).

3.2.1. Offline Training:

We used the sliding window approach and data from the first 16 months (2T/32T/3) of overlapped users to learn the mapping from target to source network preferences for the generator task, the neural transfer function (Φ)(\Phi), and item latent representations (hif)(h_{if}) for the recommender task. At each training epoch, the generator task is learned first. Hence, EE, DD and GG are trained using real mapping data and the trained GG is used to generate synthetically mapped data. The generated preferences are used as inputs to the recommender task to predict interactions at the next time interval. The ground truth interactions at the same time interval are used to propagate the recommender error from the recommender to the generator task and learn parameters of both processes.

3.2.2. Testing and Online Training:

We used data from the last 8 months (2T/32T/3 onward) to test and retrain the model online in a simulated real-world environment. At each testing time interval tt, first, mapping source network preferences are generated for non-overlapped users based on their target network preferences. Second, the synthetically mapped data for non-overlapped users are used as inputs to conduct Top-N recommendations for t+1t+1 (see Equation 9). The entire model is then retrained online before moving to the next time interval, as follows: First, the recommender parameters (Φ(\Phi and 𝒉𝒊𝒇𝒕)\boldsymbol{h_{if}^{t}}) are updated based on the recommendations for both overlapped and non-overlapped users. Second, the components of the generator task (E(E, DD and G)G) are updated based on the real mapping data from overlapped users. Finally, to guide GG from the training signals of the recommender, the model synthesizes mapping data for overlapped users, which are used as inputs for recommendations. The error is back-propagated from the recommender task to GG, and consequently, the parameters GG, Φ\Phi and 𝒉𝒊𝒇𝒕\boldsymbol{h_{if}^{t}} are updated. This process is repeated at subsequent time intervals as the model continues the online retraining process. In line with common practices in sliding window approaches, the model is retrained multiple times before moving to the next time interval.

3.3. Evaluation

We formulated video recommendation as a Top-N recommender task and predicted a ranked set of NN videos that the user is most likely to interact with at each time interval. To evaluate model accuracy, we calculated the Hit Ratio (HR) and Normalized Discounted Cumulative Gain (NDCG) (He et al., 2015). Both metrics were calculated for each participating user at each time interval and the results were averaged across all users and testing time intervals.

3.3.1. Baselines

We evaluated CnGAN against single network, cross-network, linear factorization and GAN based baselines.

  • TimePop: Recommends the most popular NN items in the previous time interval to all users in the current time interval.

  • TBKNN (Campos et al., 2010): Time-Biased KNN computes a set of KK neighbors for each user at each time interval based on complete user interaction histories, and recommends their latest interactions to the target user at the next time interval. Similar to the original work, we used several KK values from 4 to 50, and the results were averaged.

  • TDCN (Perera and Zimmermann, 2017): Time Dependent Cross-Network is an offline, MF based cross-network recommender solution, which provides recommendations only for overlapped users. TDCN learns network level transfer matrices to transfer and integrate user preferences across networks and conduct MF based recommendations.

  • CRGAN: Due to the absence of GAN based cross-network user preference synthesizing solutions, we created Cross-network Recommender GAN, as a variation of our solution. Essentially, CRGAN uses the standard DD and GG loss functions and does not contain the LmismatchL_{mismatch} and LcontentL_{content} loss components.

  • NUBPR-O and NUBPR-NO: Due to the absence of neural network based cross-network recommender solutions, we created two Neural User-based BPR solutions, as variations of our solution. Essentially, both models do not contain the generator task. NUBPR-O considers both source and target network interactions and NUBPR-NO only considers target network interactions to conduct recommendations.

3.4. Model Parameters

We used Adaptive Moment Estimation (Adam) (Kingma and Ba, 2015) for optimizations since Adam adaptively updates the learning rate during training. We set the initial learning rate to 0.1, a fairly large value for faster initial learning before the rate is updated by Adam. We used only one hidden layer for all neural architectures, and given the size of the output layer HLH_{L}, the size of the hidden layer was set to HL×2H_{L}\times 2 to reduce the number of hyper-parameters. We used the dropout regularization technique and the dropout was set to 0.4 during training to prevent neural networks from overfitting.

4. Discussion

4.0.1. Offline training loss:

We compared the offline training loss of DD, GG and RR (see Figure 4). All three processes converge around 50 epochs, and compared to DD and RR, the drop in GG is low. Therefore, we further examined the proposed GG loss against the vanilla GG loss for the same training process (see Figure 4). Inline with standard GANs, the vanilla GG loss component in the proposed GG loss increases while the proposed GG loss slowly decreases. Hence, despite the low decrease in the overall GG loss, LcontentL_{content} loss has notably decreased.

Refer to caption
Figure 3. Offline training loss of 𝑫\boldsymbol{D}, 𝑮\boldsymbol{G} and 𝑹\boldsymbol{R}.
Refer to caption
Figure 4. Offline training loss of 𝑮\boldsymbol{G} and vanilla 𝑮\boldsymbol{G}.

4.0.2. Online training loss:

Refer to caption
Figure 5. Online training loss of the recommender task.

We observed RR loss during the online training process (see Figure 5). Updates are most effective within the first 10-15 iterations and after around 20 iterations, the recommender accuracy reaches its peak. Additional iterations tend to overfit the model and degrade performance. Therefore, the proposed solution is feasible in real-world applications since online training requires only a few iterations and retraining is costly.

4.0.3. Prediction accuracy:

Refer to caption
Refer to caption
Figure 6. Hit Ratio and NDCG for different Top-N values.

We compared recommender accuracy (HR and NDCG) against multiple Top-N values (see Figure 6). The neural network based solutions (Proposed, NUBPR-O, NUBPR-NO and CRGAN) show a higher accuracy since they capture complex relationships in user preferences. Among the non-neural network based solutions, TimePop has the lowest accuracy since it is based on a simple statistic - the popularity of videos. TDCN outperforms TBKNN and TimePop since it effectively utilizes both source and target network interactions to model user preferences.

As expected, NUBPR-O which does not have a data generation process and is only trained and tested with real mapping data from overlapped users has the best accuracy. We used NUBPR-O as the benchmark, since the goal is to synthesize data similar to the data used to train NUBPR-O. The proposed model shows the closest accuracy to NUBPR-O and consistently outperforms NUBPR-NO, which only used target network interactions. Therefore, the proposed model is able to generate user preferences that increases the recommender accuracy, even when the user overlaps are unknown. Furthermore, the improvements gained over CRGAN, which uses vanilla GAN loss functions, show the effectiveness of the proposed DD and GG loss functions.

4.0.4. Prediction accuracy for different number of topics (KtK^{t}):

Refer to caption
Refer to caption
Figure 7. Hit Ratio and NDCG for different number of topics.

We compared recomender accuracy against the dimensionality of the encoded topical space (KtK^{t}) (see Section 2.1.2 and Figure 7). A grid search algorithm was used to select an effective KtK^{t} value, and the highest accuracy is achieved when KtK^{t} is around 64 topics for all Top-N values. Smaller number of topics fail to effectively encode diverse user preferences, while a higher number of topics create highly sparse inputs and reduce recommender performance. Further experiments using the standard perplexity measure (Blei et al., 2003) showed that 64 topics lead to a smaller perplexity with faster convergence.

4.0.5. Diversity and novelty:

A higher recommendation accuracy alone is insufficient to capture overall user satisfaction. For example, continuously recommending similar items (in terms of topics, genre, etc.) could lead to a decline in recommendation accuracy as users lose interest over time. Therefore, we compared the diversity (Avazpour et al., 2014) and novelty (Zhang, 2013) of recommended items. The proposed model was able to recommend videos with better novelty and diversity, and was the closest to NUBPR-O model (only a 2.4% and 3.8% novelty and diversity drop). Compared to the single-network based solutions, cross-network solutions showed better results as they utilize rich user profiles with diverse user preferences. Against the closest CRGAN approach, the proposed model showed considerable improvements in novelty (by 8.9%) and diversity (by 12.3%).

4.0.6. Alternative GAN architectures:

Designing and training GAN models can be challenging due to the unbalance between DD and GG, diminishing gradients and non-convergence (Arjovsky and Bottou, 2017; Salimans et al., 2016). Various GAN architectures and training techniques were recently introduced to handle such issues. Therefore, despite the effectiveness of CnGAN, we designed two alternative solutions based on two widely popular GAN architectures, Deep convolutional GAN (DCGAN) (Radford et al., 2015) and Wasserstein GAN (WGAN) (Arjovsky et al., 2017) to replace the GG and DD learning processes. We found that DCGAN becomes highly unstable in this environment, where D error is constantly increased. This can be because, DCGAN is based on Convolution Neural Networks, known to capture the local features within the input data. However, unlike typical image data, user preferences in the vectors may not provide any interesting local features. Hence, DCGAN was less effective. Further experiments using WGAN also did not improve the performances.

5. Conclusion and Further Work

Typical cross-network recommender solutions are applied to users that are fully overlapped across multiple networks. Thus to the best of our knowledge, we propose the first Cross-network Generative Adversarial Network based model (CnGAN), which generates user preferences for non-overlapped users. The proposed model first uses a generator task to learn the mapping from target to source network user preferences and synthesize source network preferences for non-overlapped users. Second, the model uses a recommender task based on a Siamese network to incorporate synthesized source network preferences and conduct recommendations. The proposed model consistently outperformed multiple baselines in terms of accuracy, diversity and novelty of recommendations. As future work, we plan to investigate the recommender quality improvements when using both generated and real data for overlapped users. Furthermore, the model can be extended to use social information of the users. Overall, CnGAN alleviates a significant limitation in cross-network recommender solutions, and provides a foundation to make quality cross-network recommendations for all users.

Acknowledgments

This research has been supported by Singapore Ministry of Education Academic Research Fund Tier 2 under MOE’s official grant number MOE2018-T2-1-103. We gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research.

References

  • (1)
  • Abel et al. (2011) Fabian Abel, Samur Araújo, Qi Gao, and Geert-Jan Houben. 2011. Analyzing cross-system user modeling on the social web. In International Conference on Web Engineering. Springer, 28–43.
  • Arjovsky and Bottou (2017) Martin Arjovsky and Léon Bottou. 2017. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).
  • Avazpour et al. (2014) Iman Avazpour, Teerat Pitakrat, Lars Grunske, and John Grundy. 2014. Dimensions and metrics for evaluating recommendation systems. In Recommendation systems in software engineering. Springer, 245–273.
  • Bharadhwaj et al. (2018) Homanga Bharadhwaj, Homin Park, and Brian Y Lim. 2018. RecGAN: Recurrent generative adversarial networks for recommendation systems. In Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 372–376.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • Blondel et al. (2016) Mathieu Blondel, Masakazu Ishihata, Akinori Fujino, and Naonori Ueda. 2016. Polynomial networks and factorization machines: New insights and efficient training algorithms. In Proceedings of International Conference on Machine Learning.
  • Campos et al. (2010) Pedro G Campos, Alejandro Bellogín, Fernando Díez, and J Enrique Chavarriaga. 2010. Simple time-biased KNN-based recommendations. In Proceedings of the Workshop on Context-Aware Movie Recommendation. ACM, 20–23.
  • Campos et al. (2012) Pedro G Campos, Fernando Dıez, and Iván Cantador. 2012. A performance comparison of time-aware recommendation models.
  • Deng et al. (2013) Zhengyu Deng, Jitao Sang, Changsheng Xu, et al. 2013. Personalized video recommendation based on cross-platform user modeling. In 2013 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1–6.
  • He et al. (2015) Xiangnan He, Tao Chen, Min-Yen Kan, and Xiao Chen. 2015. Trirank: Review-aware explainable recommendation by modeling aspects. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 1661–1670.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 173–182.
  • Hidasi et al. (2015) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2015. Session-based recommendations with recurrent neural networks. International Conference on Learning Representations (2015).
  • Kang et al. (2017) Wang-Cheng Kang, Chen Fang, Zhaowen Wang, and Julian McAuley. 2017. Visually-aware fashion recommendation and design with generative image models. In 2017 IEEE International Conference on Data Mining (ICDM). IEEE, 207–216.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Lei Ba. 2015. Adam: Amethod for stochastic optimization. In International Conference on Learning Representation.
  • Lim et al. (2015) Bang Hui Lim, Dongyuan Lu, Tao Chen, and Min-Yen Kan. 2015. # mytweet via instagram: Exploring user behaviour across multiple social networks. In Advances in Social Networks Analysis and Mining (ASONAM), 2015 IEEE/ACM International Conference on. IEEE, 113–120.
  • Mehta et al. (2005) Bhaskar Mehta, Claudia Niederee, Avare Stewart, Marco Degemmis, Pasquale Lops, and Giovanni Semeraro. 2005. Ontologically-enriched unified user modeling for cross-system personalization. In International Conference on User Modeling. Springer, 119–123.
  • Osborne et al. (2012) Miles Osborne, Saša Petrovic, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2012. Bieber no more: First story detection using Twitter and Wikipedia. In SIGIR 2012 Workshop on Time-aware Information Access.
  • Ouyang et al. (2014) Yuanxin Ouyang, Wenqi Liu, Wenge Rong, and Zhang Xiong. 2014. Autoencoder-based collaborative filtering. In International Conference on Neural Information Processing. Springer, 284–291.
  • Perera and Zimmermann (2017) Dilruk Perera and Roger Zimmermann. 2017. Exploring the use of time-dependent cross-network information for personalized recommendations. In Proceedings of the 2017 ACM on Multimedia Conference. ACM, 1780–1788.
  • Perera and Zimmermann (2018) Dilruk Perera and Roger Zimmermann. 2018. LSTM Networks for Online Cross-Network Recommendations.. In IJCAI. 3825–3833.
  • Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR).
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, 452–461.
  • Roy et al. (2012) Suman Deb Roy, Tao Mei, Wenjun Zeng, and Shipeng Li. 2012. Socialtransfer: Cross-domain transfer learning from social streams for media applications. In Proceedings of the 20th ACM International Conference on Multimedia. ACM, 649–658.
  • Sahoo et al. (2012) Nachiketa Sahoo, Param Vir Singh, and Tridas Mukhopadhyay. 2012. A hidden Markov model for collaborative filtering. MIS Quarterly (2012), 1329–1356.
  • Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In Proceedings of the 24th International Conference on Machine learning. ACM, 791–798.
  • Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. In Advances in Neural Information Processing Systems. 2234–2242.
  • Wang et al. (2017) Jun Wang, Lantao Yu, Weinan Zhang, Yu Gong, Yinghui Xu, Benyou Wang, Peng Zhang, and Dell Zhang. 2017. Irgan: A minimax game for unifying generative and discriminative information retrieval models. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval. ACM, 515–524.
  • Yan et al. (2014) Ming Yan, Jitao Sang, and Changsheng Xu. 2014. Mining cross-network association for youtube video promotion. In Proceedings of the 22nd ACM International conference on Multimedia. ACM, 557–566.
  • Yoo et al. (2017) Jaeyoon Yoo, Heonseok Ha, Jihun Yi, Jongha Ryu, Chanju Kim, Jung-Woo Ha, Young-Han Kim, and Sungroh Yoon. 2017. Energy-based sequence gans for recommendation and their connection to imitation learning. Proceedings of ACM Conference.
  • Zhang (2013) Liang Zhang. 2013. The definition of novelty in recommendation system. Journal of Engineering Science & Technology Review 6, 3.
  • Zhao et al. (2011) Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan, and Xiaoming Li. 2011. Comparing twitter and traditional media using topic models. In European Conference on Information Retrieval. Springer, 338–349.