This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SocialTrans: A Deep Sequential Model with Social Information for Web-Scale Recommendation Systems

Qiaoan Chen Hao Gu Lingling Yi Weixin Group, Tencent Inc. kazechen, nickgu, [email protected] Yishi Lin Peng He Chuan Chen Weixin Group, Tencent Inc. elsielin, paulhe, [email protected]  and  Yangqiu Song Department of CSE, Hong Kong University of Science and Technology [email protected]
Abstract.

On social network platforms, a user’s behavior is based on his/her personal interests, or influenced by his/her friends. In the literature, it is common to model either users’ personal preference or their socially influenced preference. In this paper, we present a novel deep learning model SocialTrans for social recommendations to integrate these two types of preferences. SocialTrans is composed of three modules. The first module is based on a multi-layer Transformer to model users’ personal preference. The second module is a multi-layer graph attention neural network (GAT), which is used to model the social influence strengths between friends in social networks. The last module merges users’ personal preference and socially influenced preference to produce recommendations. Our model can efficiently fit large-scale data and we deployed SocialTrans to a major article recommendation system in China. Experiments on three data sets verify the effectiveness of our model and show that it outperforms state-of-the-art social recommendation methods.

copyright: noneconference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NYbooktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NYprice: 15.00isbn: 978-1-4503-9999-9/18/06submissionid: 182

1. INTRODUCTION

Social network platforms such as Facebook and Twitter are very popular and they have become an essential part of our daily life. These platforms provide places for people to communicate with each other. On these platforms, users can share information (e.g., articles, videos and games) with their friends. To enrich user experiences, these platforms often build recommendation systems to help their users to explore new things, for example by listing ”things you may be interested in”. Recommendation systems deployed in social network platforms usually use users’ profiles and their history behaviors to make predictions about their interests. In social network platforms, users’ behavior could also be significantly influenced by their friends. Thus, it is crucial to incorporate social influence in the recommendation systems, which motivates this work.

Figure 1 presents how Ada behaves in an online community.111Icons made by Freepik from www.flaticon.com. The left part is her historical behavior, described by a sequence of actions (e.g., item clicks), and the right part is her social network. First, user interests are dynamic by nature. Ada has been interested in pets for a long period, but she may search for yoga books in the future. We should capture Ada’s dynamic interest from her behaviors. Second, Ada trusts her boss who is an expert in data mining when searching for technology news, while she could be influenced by another friend when searching for yoga. This socially influenced preference should be considered in modeling.

Refer to caption
Figure 1. An illustration of Ada’s historical behavior and her social network.

To get deeper insights into this phenomenon, we analyze a real-world social network platform - WeChat, a popular mobile application in China, with more than one billion monthly active users222https://www.wechat.com/en. WeChat users can read and share articles with their friends. In this analysis, if a user shares an article and his friend (or an nn-hop friend) re-shares it, we say that his friend is influenced by him. Let H(n)H(n) be the average influence probability for each user and his nn-hop friend pairs, and H(0)H(0) be the average sharing probability. This analysis answers two questions: (1) how social influence strength changes in different hops; (2) how social influence strength varies in different topics.

Figure 2 shows the analysis result. In the left part, we consider the increased probability of influence strength H(n)H(0)H(n)-H(0), which describes how significantly a user is influenced by his n-hop friends compared to a global probability. It shows that users are significantly influenced by 1-hop friends and the influence strength decreases dramatically when the hop increases. The right part of the Figure 2 shows that direct friends’ influence H(1)H(1) is quite different in various topics. These results motivate us to model context-dependent social influence to improve the recommendation system.

Refer to caption
Figure 2. Analysis of social influence in WeChat. H(n)H(n) represents the average influence probability for each user and his nn-hop friend pairs, while H(0)H(0) represents the average sharing probability. All users are anonymous in this analysis. ”Enter” is the abbreviation for entertainment.

In this paper, we propose an approach to model users’ personal preferences and context-dependent socially influenced preferences. Our recommendation model, named SocialTrans, is based on two recent state-of-the-art models, Transformer (Vaswani et al., 2017) and graph-attention network (GAT) (Veličković et al., 2018). A multi-layer Transformer is used to capture users’ personal preferences. Socially influenced preferences are captured by a multi-layer GAT, extended by considering edge attributes and the multi-head attention mechanism. We conduct offline experiments on two data sets and online A/B testing to verify our model. The results show that SocialTrans achieves the state-of-the-art performance and achieves at least a 9.5%9.5\% relative improvement on offline data sets and a 5.89%5.89\% relative improvement on online A/B testing. Our contributions can be summarized as follows:

  • Novel methodologies. We propose SocialTrans to model both users’ personal preferences and their socially influenced preferences. We combine Transformer and GAT for social recommendation tasks. In particular, the GAT we use is an extended version that considers edge attributes and uses the multi-head attention mechanism to model context-dependent social influence.

  • Multifaceted experiments. We evaluate our approach on one benchmark data set and two commercial data sets. The experimental results verify the superiority of our approach over state-of-the-art techniques.

  • Large-scale implementation. We train and deploy SocialTrans in a real-world recommendation system which can potentially affects over one billion users. This model contains a three-layer Transformer and a two-layer GAT. We provide techniques to speed up both offline training and online service procedures.

  • Economical online evaluation procedure. Model evaluation in a fast-growing recommendation system is computationally expensive. Many items can be added or fading-away every day. We provide an efficient deployment and evaluation procedure to overcome the difficulties.

Organization. We first formulate the problem in Section 2. Then, we introduce our proposed model SocialTrans in Section 3. Large scale implementation details are in Section 4. Section 5 shows our experimental results. Related works are in Section 6. Section 7 concludes the paper.

2. Problem Definition

The goal of sequence-based social recommendation is to predict which item a user will click soon, based on his previous clicked history and the social network. In this setting, let UU denote the set of users and VV be the set of items. We use G=(U,E)G=(U,E) to denote the social network, where EE is the set of friendship links between users. At each timestamp tt, user uu’s previous behavior sequence is represented by an ordered list St1u=(v0u,v1u,v2u,,vt1u)S_{t-1}^{u}=\left(v_{0}^{u},v_{1}^{u},v_{2}^{u},\cdots,v_{t-1}^{u}\right), uU,vjuV,1jt1u\in U,v^{u}_{j}\in V,1\leq j\leq t-1. Sequence-based social recommendation utilizes both information from a user uu and his friends, which can be represented as 𝕊t1u={St1u|u{u}N(u)}\mathbb{S}^{u}_{t-1}=\{S_{t-1}^{u^{\prime}}|u^{\prime}\in\{u\}\cup N(u)\}. Here N(u)N(u) is the set of uu’s friends. Given 𝕊t1u\mathbb{S}^{u}_{t-1}, sequence-based social recommendation aims to predict which item vv is likely to be clicked by user uu.

In a real-world data set, the length of a user’s behavior sequence can be up to several hundreds, which is hard for many models to handle. To simplify our question, we transform a user’s previous behavior sequence St1u=(v0u,v1u,v2u,,vt1u)S_{t-1}^{u}=\left(v_{0}^{u},v_{1}^{u},v_{2}^{u},\cdots,v_{t-1}^{u}\right) into a fixed length sequence S^t1u=(v0,v1,,vm1),vjV,0jm1\hat{S}_{t-1}^{u}=(v_{0},v_{1},...,v_{m-1}),v_{j}\in V,0\leq j\leq m-1. Here mm represents the maximum length that our model can handle and S^t1u\hat{S}_{t-1}^{u} is most recent mm items in St1uS_{t-1}^{u}. If the sequence length is less than mm, we repeatedly add a ‘blank’ item to the left until the length is mm. Similarly, 𝕊^t1u\hat{\mathbb{S}}^{u}_{t-1} represents the fixed-length version of sequences in 𝕊t1u\mathbb{S}^{u}_{t-1}. How to handle longer length of sequence will be left out as future work.

3. Model Framework

Motivated by the observation that a user’s behavior can be determined by personal preference and socially influenced preference, we propose a novel method named SocialTrans for recommendation systems in social-network platforms. Figure 3 provides an overview of SocialTrans. SocialTrans is composed of three modules: personal preference modeling, socially influenced preference modeling, and rating prediction. First, a user’s personal preference is modeled by a multi-layer Transformer (Kang and McAuley, 2018), which can capture his dynamic interest (§3.1). Second, a multi-layer GAT (Veličković et al., 2018) is used to model socially influenced preference from his friends. The GAT we use is an extended version that considers edge attributes and uses the multi-head attention mechanism (§3.2). Finally, the user’s personal preference and socially influenced preference are fused to get the final representation. Rating scores between users and items are computed to produce recommendation results (§3.3).

Refer to caption
Figure 3. SocialTrans model architecture. It contains three modules: a multi-layer Transformer for personal preference modeling, a multi-layer GAT for socially influenced modeling and a rating & prediction module.

3.1. Personal Preference Modeling

The personal preference modeling module tries to capture how users’ dynamic interests evolve over time. To be specific, this module generates a user’s personal preference embedding at the current timestamp given his behavior sequence.

We use a multi-layer Transformer (Kang and McAuley, 2018) to capture users’ personal preferences. Transformer is widely used in sequence modeling. It is able to capture the correlation between any pairs in sequences. As shown in Figure 4, the Transformer layer contains three sub-layers, a Multi-Head Attention sub-layer, a Feed-Forward Network sub-layer, and an Add & Norm sub-layer. We now describe the input, the output, and sub-layers in Transformer in detail.

Input Embedding. The input matrix 𝐇(0)m×d\mathbf{H}^{(0)}\in\mathbb{R}^{m\times d} to the multi-layer Transformer is mainly constructed from items in user’s behavior sequence S^t1u=(v0,v1,,vm1)\hat{S}_{t-1}^{u}=(v_{0},v_{1},...,v_{m-1}). Here dd is the hidden dimension and each item vVv\in V is represented as a row 𝐰v\mathbf{w}_{v} in the item embedding matrix 𝐖|V|×d\mathbf{W}\in\mathbb{R}^{|V|\times d}. Since the Transformer can’t be aware of items’ position, each position τ\tau is associated with a learnable position embedding vector 𝐩τd\mathbf{p}_{\tau}\in\mathbb{R}^{d} to carry location information for the corresponding item. Each row 𝐡τ(0)\mathbf{h}^{(0)}_{\tau} in 𝐇(0)\mathbf{H}^{(0)} is defined as:

(1) 𝐡τ(0)=𝐰vτ+𝐩τ\mathbf{h}^{(0)}_{\tau}=\mathbf{w}_{v_{\tau}}+\mathbf{p}_{\tau}

Multi-head Self-Attention. Attention mechanisms are widely used in sequence modeling tasks. They allow a model to capture the relationship between any pairs in the sequences. Recent work (Li et al., 2018; Devlin et al., 2019) has shown that attention to different representation subspaces simultaneously is beneficial. In this work, we adopt the multi-head self-attention as in work (Vaswani et al., 2017). This allows the model to jointly attend to information from different representation subspaces.

First, the scaled dot-product attention is applied to each head. This attention function can be described as mapping a set of query-key-value tuples to an output. It is defined as:

(2) Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊Tds)𝐕\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d_{s}}})\mathbf{V}

Here 𝐐,𝐊,𝐕m×ds\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{m\times d_{s}} represent the query, key, and value matrices. Moreover, dsd_{s} is the dimensionality for each head, and we have ds=d/rd_{s}=d/r for an rr-head model. The scaled dot-product attention computes a weighted sum of all values, where the weight is calculated by the query matrix 𝐐\mathbf{Q} and the key matrix 𝐊\mathbf{K}. The scaling factor ds\sqrt{d_{s}} is used to avoid large weights, especially when the dimensionality is high.

For an attention head ii in layer ll, all inputs to the scaled dot-product attention come from layer l1l-1’s output. This implies a self-attention mechanism. The query, key, and value matrices are linear projection of 𝐇(l1)\mathbf{H}^{(l-1)}. The head is defined as:

(3) headi(l)\displaystyle\text{head}_{i}^{(l)} =Attention(𝐐(l,i),𝐊(l,i),𝐕(l,i))\displaystyle=\text{Attention}\big{(}\mathbf{Q}^{(l,i)},\mathbf{K}^{(l,i)},\mathbf{V}^{(l,i)}\big{)}
where 𝐐(l,i)=𝐇(l1)𝐖Q(l,i)\displaystyle\mathbf{Q}^{(l,i)}=\mathbf{H}^{(l-1)}\mathbf{W}^{(l,i)}_{Q}
𝐊(l,i)=𝐇(l1)𝐖K(l,i)\displaystyle\mathbf{K}^{(l,i)}=\mathbf{H}^{(l-1)}\mathbf{W}^{(l,i)}_{K}
𝐕(l,i)=𝐇(l1)𝐖V(l,i)\displaystyle\mathbf{V}^{(l,i)}=\mathbf{H}^{(l-1)}\mathbf{W}^{(l,i)}_{V}

In the above equation, 𝐖Q(l,i),𝐖K(l,i),𝐖V(l,i)d×ds\mathbf{W}^{(l,i)}_{Q},\mathbf{W}^{(l,i)}_{K},\mathbf{W}^{(l,i)}_{V}\in\mathbb{R}^{d\times d_{s}} are the corresponding matrices that project input 𝐇(l1)\mathbf{H}^{(l-1)} into the latent space of query, key, and value. Row τ\tau of headi(l)\text{head}_{i}^{(l)} corresponds to an intermediate representation of a user’s behavior sequence at timestamp τ\tau.

Items in a user behavior sequence are produced one by one. The model should only consider previous items when predicting the next item. However, the aggregated value in Equation (3) contains information of subsequent items, which makes the model ill-defined. Therefore we remove all links between row τ\tau in 𝐐\mathbf{Q} and row τ\tau^{\prime} in 𝐕\mathbf{V} if τ>τ\tau>\tau^{\prime}.

After all attention heads are computed in layer ll, their outputs are concatenate and projected by 𝐖O(l)d×d\mathbf{W}^{(l)}_{O}\in\mathbb{R}^{d\times d}, resulting in the final output 𝐀(l)m×d\mathbf{A}^{(l)}\in\mathbb{R}^{m\times d} of a multi-head attention sub-layer:

(4) 𝐀(l)=Concat(head1(l),head2(l),,headr(l))𝐖O(l)\mathbf{A}^{(l)}=\text{Concat}\big{(}\text{head}_{1}^{(l)},\text{head}_{2}^{(l)},\cdots,\text{head}_{r}^{(l)}\big{)}\mathbf{W}^{(l)}_{O}

Feed Forward Network. Although previous items’ information can be aggregated in the multi-head self-attention, it’s still a linear model. To improve the representation power of our model, we apply a two-layer feed forward network to each position:

(5) 𝐅(l)=f(𝐀(l)𝐖FFN(l,1)+𝐛FFN(l,1))𝐖FFN(l,2)+𝐛FFN(l,2)\mathbf{F}^{(l)}=f\big{(}\mathbf{A}^{(l)}\mathbf{W}_{\text{FFN}}^{(l,1)}+\mathbf{b}_{\text{FFN}}^{(l,1)}\big{)}\mathbf{W}_{\text{FFN}}^{(l,2)}+\mathbf{b}_{\text{FFN}}^{(l,2)}

where 𝑾FFN(k,1),𝑾FFN(k,2)\boldsymbol{W}_{\text{FFN}}^{(k,1)},\boldsymbol{W}_{\text{FFN}}^{(k,2)} are both d×dd\times d matrices and 𝒃FFN(k,1),𝒃FFN(k,2)\boldsymbol{b}_{\text{FFN}}^{(k,1)},\boldsymbol{b}_{\text{FFN}}^{(k,2)} are dd dimensional vectors. Moreover ff is an activation function and we choose Gaussian Error Linear Unit as in work (Hendrycks and Gimpel, 2016). Nonlinear transformation is applied to all positions independently, meaning that no information will be exchanged across positions in this sub-layer.

Add & Norm Sub-layer. Training a multi-layers network is difficult because the vanishing gradient problem may occur in back-propagation. Residual neural network (He et al., 2016) has shown its effectiveness in solving this problem. The core idea behind the residual neural network is to propagate outputs of lower layers to higher layers simply by adding them. If lower layer outputs are useful, the model can skip through higher layers to get necessary information. Let 𝐗\mathbf{X} be the output from lower layer and 𝐘\mathbf{Y} be the output from higher layer. The residual layer (or add) layer is defined as:

(6) Add(𝑿,𝒀)=𝑿+𝒀\text{Add}(\boldsymbol{X},\boldsymbol{Y})=\boldsymbol{X}+\boldsymbol{Y}

In addition, to stabilize and accelerate neural network training, we apply Layer Normalization (Ba et al., 2016) to the residual layer output. Assuming the input is a vector 𝒛\boldsymbol{z}, the operation is defined as:

(7) LayerNorm(𝒛)=𝜶𝒛μσ2+ϵ+𝜷\text{LayerNorm}(\boldsymbol{z})=\boldsymbol{\alpha}\odot\frac{\boldsymbol{z}-\mu}{\sqrt{\sigma^{2}+\epsilon}}+\boldsymbol{\beta}

where μ\mu and σ\sigma are the mean and variance of 𝒛\boldsymbol{z}, 𝜶\boldsymbol{\alpha} and 𝜷\boldsymbol{\beta} are learned scaling and bias terms, and \odot represents an element-wise product.

Personal Preference Embedding Output. SocialTrans encapsulates a user’s previous behavior sequence into a dd dimensional embedding. Let lTl_{T} be the number of layers in Transformer. The output of the lTl_{T}-th layer is 𝐇(lT)\mathbf{H}^{(l_{T})}. We take 𝐡m1(lT)\mathbf{h}^{(l_{T})}_{m-1} as the user’s personal preference embedding, where 𝐡m1(lT)\mathbf{h}^{(l_{T})}_{m-1} is the last row of 𝐇(lT)\mathbf{H}^{(l_{T})}. The personal preference embedding is expressive because 𝐡m1(lT)\mathbf{h}^{(l_{T})}_{m-1} aggregated all previous items information in multi-head self-attention layer. Moreover, stacking multiple layers provides personal preference embedding with the highly non-linearly expressive power.

Refer to caption
Figure 4. Illustration of a layer in Transformer.

3.2. Socially Influenced Preference Modeling

Birds of a feather flock together. A user’s behavior is influenced by his friends (Marsden and Friedkin, 1993; McPherson et al., 2001). We should incorporate the social information to further model user latent factors. Meanwhile, different social connections or friends have different influence on a user. In other words, the learning of social-space user latent factors should consider different strengths in social relations. Therefore, we introduce a multi-head graph attention network (GAT) (Veličković et al., 2018) to select friends that are representative in characterizing users’ socially influenced preferences. We also consider edge attributes to learn the context-dependent social influence. Next, we describe the input, output and our modified GAT in detail.

Input Embedding. In §3.1, we described how to obtain a user’s personal preference embedding given his behavior sequence. Suppose we want to generate the socially influenced preference embedding of a user uu. This module’s inputs are personal preference embeddings of uu and his friends. Specifically, for a user uu, the input of this module is {𝐡^u(0)|u{u}N(u)}\{\hat{\mathbf{h}}^{(0)}_{u^{\prime}}|u^{\prime}\in\{u\}\cup N(u)\}, where 𝐡^u(0)=𝐡m1(lT)\hat{\mathbf{h}}^{(0)}_{u}=\mathbf{h}^{(l_{T})}_{m-1} and N(u)N(u) is the set of user uu’s friends.

Graph Attention Network. Veličković et al. (Veličković et al., 2018) introduces graph attention networks (GAT) which specifies different weights to different nodes in the neighborhood. Social influence is often context-dependent. It depends on both friends’ preferences and the degree of closeness with friends. We use the GAT to aggregate contextual information from the user’s friends. In this work, we propose an extended GAT that uses edge attributes and the multi-heads mechanisms. Figure 5 shows our modified version of GAT.

We first calculate the similarity score δu,u(l)\delta_{u,u^{\prime}}^{(l)} between the target user’s embedding 𝐡^u(l1)\hat{\mathbf{h}}^{(l-1)}_{u} and all of his neighbors’ embedding 𝐡^u(l1)\hat{\mathbf{h}}^{(l-1)}_{u^{\prime}}:

(8) δu,u(l)=(𝐖^Q(l)𝐡^u(l1))T(𝐖^K(l)𝐡^u(l1))+(𝐰^E(l))T𝐞u,u\delta_{u,u^{\prime}}^{(l)}=\big{(}\hat{\mathbf{W}}_{Q}^{(l)}\hat{\mathbf{h}}^{(l-1)}_{u}\big{)}^{T}\big{(}\hat{\mathbf{W}}_{K}^{(l)}\hat{\mathbf{h}}^{(l-1)}_{u^{\prime}}\big{)}+\big{(}\hat{\mathbf{w}}_{E}^{(l)}\big{)}^{T}\mathbf{e}_{u,u^{\prime}}

Then we normalized the similarity score to a probability distribution:

(9) κu,u(k)=exp(δu,u(k))iN(u){u}exp(δu,i(k))\kappa_{u,u^{\prime}}^{(k)}=\frac{\exp(\delta_{u,u^{\prime}}^{(k)})}{\sum_{i\in N(u)\cup\{u\}}\exp(\delta_{u,i}^{(k)})}

where 𝐖^Q(l),𝐖^K(l)𝐑d×d\hat{\mathbf{W}}_{Q}^{(l)},\hat{\mathbf{W}}_{K}^{(l)}\in\mathbf{R}^{d\times d} in Equation (8) are the query and key projection matrices similar to Equation (3) in Transformer. 𝐞u,u\mathbf{e}_{u,u^{\prime}} is a vector of attributes corresponding to the edge between uu and uu^{\prime}, and 𝐰^E(l)\hat{\mathbf{w}}_{E}^{(l)} is a weighted vector applied to these attributes. Equation (8) computes similarity score based on user’s and friend’s representation and their corresponding attributes. This enables us to combine both the preferences of friends and the degree of closeness with friends.

Intuitively, κu,u(l)\kappa_{u,u^{\prime}}^{(l)} is the social influence strength of a friend uu^{\prime} on the user uu. We aggregate social influence of all friends as:

(10) 𝐡^u(l)=f(iN(u){u}κu,i(l)𝐖^V(l)𝐡^i(l1))\hat{\mathbf{h}}^{(l)}_{u}=f\big{(}\sum_{i\in N(u)\cup\{u\}}\kappa_{u,i}^{(l)}\hat{\mathbf{W}}_{V}^{(l)}\hat{\mathbf{h}}^{(l-1)}_{i}\big{)}

where 𝐖^V(l)𝐑d×d\hat{\mathbf{W}}_{V}^{(l)}\in\mathbf{R}^{d\times d} is value projection matrix and ff is a Gaussian Error Linear Unit as in Equation (5).

In practice, we find that the multi-head attention mechanism is useful to jointly capture context semantic at different subspaces. We extend our previous Equations (8), (9), and (10) to:

(11) δu,u(l,i)=(𝐖^Q(l,i)𝐡^u(l1))T(𝐖^K(l,i)𝐡^u(l1))+(𝐰^E(l,i))T𝐞u,u\delta_{u,u^{\prime}}^{(l,i)}=\big{(}\hat{\mathbf{W}}_{Q}^{(l,i)}\hat{\mathbf{h}}^{(l-1)}_{u}\big{)}^{T}\big{(}\hat{\mathbf{W}}_{K}^{(l,i)}\hat{\mathbf{h}}^{(l-1)}_{u^{\prime}}\big{)}+\big{(}\hat{\mathbf{w}}_{E}^{(l,i)}\big{)}^{T}\mathbf{e}_{u,u^{\prime}}
(12) κu,u(l,i)=exp(δu,u(l,i))jN(u){u}exp(δu,j(l,i))\kappa_{u,u^{\prime}}^{(l,i)}=\frac{\exp(\delta_{u,u^{\prime}}^{(l,i)})}{\sum_{j\in N(u)\cup\{u\}}\exp(\delta_{u,j}^{(l,i)})}
(13) 𝐡^u(l,i)=f(jN(u){u}κu,j(l,i)𝐖^V(l,i)𝐡^j(l1))\hat{\mathbf{h}}^{(l,i)}_{u}=f\big{(}\sum_{j\in N(u)\cup\{u\}}\kappa_{u,j}^{(l,i)}\hat{\mathbf{W}}_{V}^{(l,i)}\hat{\mathbf{h}}^{(l-1)}_{j}\big{)}

where 𝐖^Q(l,i),𝐖^K(l,i),𝐖^V(l,i)ds×d\hat{\mathbf{W}}_{Q}^{(l,i)},\hat{\mathbf{W}}_{K}^{(l,i)},\hat{\mathbf{W}}_{V}^{(l,i)}\in\mathbb{R}^{d_{s}\times d} and 𝐰^E(l,i)\hat{\mathbf{w}}_{E}^{(l,i)} are corresponding parameters for head ii. Finally, all rr heads are stacked and projected by 𝐖^O(l)𝐑d×d\hat{\mathbf{W}}^{(l)}_{O}\in\mathbf{R}^{d\times d} to get the final output embedding in layer ll:

(14) 𝐡^u(l)=𝐖^O(l)[𝐡^u(l,1);𝐡^u(l,2);;𝐡^u(l,r)]\hat{\mathbf{h}}^{(l)}_{u}=\hat{\mathbf{W}}^{(l)}_{O}\text{[}\hat{\mathbf{h}}^{(l,1)}_{u};\hat{\mathbf{h}}^{(l,2)}_{u};\cdots;\hat{\mathbf{h}}^{(l,r)}_{u}]

Social Embedding Output. We stack lGl_{G} layers of GAT and take the final representation 𝐡^u(lG)\hat{\mathbf{h}}^{(l_{G})}_{u} as the user uu socially influenced preference embedding. It encapsulates social information from user uu and his friends. Note that stacking too many layers of GAT may harm the model performance, because our analysis result shown in Figure 2 suggests that most social information come from one-hop friends. Thus, we use at most two layers of GAT.

Refer to caption
Figure 5. Socially influenced preference is modeled by a multi-head and context dependent GAT .

3.3. Rating & Prediction

A user’s decisions depend on the dynamic personal preference and socially influenced preference from his friends. The final embedding representation is obtained by merging personal preference embedding and socially influenced preference embedding, which is defined as:

(15) 𝐡~u=𝐖F[𝐡m1(lT);𝐡^u(lG)]\tilde{\mathbf{h}}_{u}=\mathbf{W}_{F}[\mathbf{h}^{(l_{T})}_{m-1};\hat{\mathbf{h}}^{(l_{G})}_{u}]

where 𝐖F𝐑d×2d\mathbf{W}_{F}\in\mathbf{R}^{d\times 2d} and 𝐡~u\tilde{\mathbf{h}}_{u} is user uu’s final representation.

The probability that the next item will be vv is computed using a softmax function:

(16) p(vm=v|𝕊^t1u)=exp(𝐡~uT𝐰v)iVexp(𝐡~uT𝐰i)p(v_{m}=v|\hat{\mathbb{S}}_{t-1}^{u})=\frac{\exp(\tilde{\mathbf{h}}_{u}^{T}\mathbf{w}_{v})}{\sum_{i\in V}\exp(\tilde{\mathbf{h}}_{u}^{T}\mathbf{w}_{i})}

We train the model parameters by maximizing the log-probability of all observed sequences:

(17) uUtlogp(vm=v|𝕊^t1u)\sum_{u\in U}\sum_{t}\log p(v_{m}=v|\hat{\mathbb{S}}_{t-1}^{u})

To predict which item user uu will click, we compute all items’ scores according to Equation (16) and return a list of items with top KK scores for the recommendation. To avoid large computational cost in a real-world application, the approximate nearest neighbor search method (Andoni and Indyk, 2006) is used.

4. Large Scale Implementation

A real-world application may contain billions of users and millions of items. The main challenge to deploy the above model into online service is the large number of graph edges. For example, the social network of WeChat contains hundreds of billions of edges. The data is of multi-terabyte size and can not be loaded into the memory of any commercial server. Many graph operations may fail in this scenario. In addition, directly computing matching score function between user uu’s representation vector 𝐡~u\tilde{\mathbf{h}}_{u} and item representation matrix 𝐖\mathbf{W} is computationally costly. In this part, we discuss several implementation details about how we apply SocialTrans to large scale recommendation systems in industries.

Graph sampling. Directly applying graph attention operation over the whole graph is impossible. A node can have thousands of neighbors in our data. We use a graph sampling technique to create a sub-graph containing nodes and their neighbors, which is computationally possible in a minibatch. Each node samples its nn-hop sub-graph independently. Neighbors in each hop can be sampled simply by uniform sampling or can be sampled according to a specific edge attribute. For example, sampling by the number of commonly clicked items. It means a neighbor has a greater chance to be sampled if he clicks a greater number of items that are both clicked by a user and the neighbor. The sampling process is repeated several times for each node and implemented in a MapReduce style data pre-processing program. In this way, we can remove the graph storage component and reduce communication costs during the training stage.

Sampling negative items. There are millions of candidate items in our settings. Many methods require computing matching score function between 𝐡~u\tilde{\mathbf{h}}_{u} and 𝐰v\mathbf{w}_{v}. After that, the softmax function is applied to obtain the predicted probability. We use negative sampling to reduce the computational cost of the softmax function. In each training minibatch, we sample a set of 1000 negative items JJ shared by all users. The probability of next item will be vv in this minibatch is approximated as:

(18) p(vm=v|𝕊^t1u)=exp(𝐡~uT𝐰v)i{v}Jexp(𝐡~uT𝐰i)p(v_{m}=v|\hat{\mathbb{S}}_{t-1}^{u})=\frac{\exp(\tilde{\mathbf{h}}_{u}^{T}\mathbf{w}_{v})}{\sum_{i\in\{v\}\cup J}\exp(\tilde{\mathbf{h}}_{u}^{T}\mathbf{w}_{i})}

The negative item sampling probability is proportional to its appearance count in the training data set. Using negative sampling technique provides an approximation of the original softmax function. Empirically, we do not observe a decline in performance.

We use Adam (Kingma and Ba, 2015) for optimization because of its effectiveness. Directly applying Adam will be computational infeasible because its update is applied to all trainable variables. The items’ representation matrix 𝐖\mathbf{W} will be updated although many items do not appear in that minibatch. Here we adopt a slightly modified version of Adam, which only updates items appeared in the minibatch and other trainable variables. We update each trainable variable θ\theta at the training step kk according to the following equations:

(19) θk={θk1ηmk/(1β1k)nk/(1β2k)+ϵθΘkθk1otherwise\theta_{k}=\begin{cases}\theta_{k-1}-\eta\frac{m_{k}/(1-\beta_{1}^{k})}{\sqrt{n_{k}/(1-\beta_{2}^{k})}+\epsilon}&\theta\in\Theta_{k}\\ \theta_{k-1}&\text{otherwise}\end{cases}
(20) mk={β1mk1+(1β1)θL(θk1)θΘkmk1otherwisem_{k}=\begin{cases}\beta_{1}m_{k-1}+(1-\beta_{1})\nabla_{\theta}L(\theta_{k-1})&\theta\in\Theta_{k}\\ m_{k-1}&\text{otherwise}\end{cases}
(21) nk={β2nk1+(1β2)(θL(θk1))2θΘknk1otherwisen_{k}=\begin{cases}\beta_{2}n_{k-1}+(1-\beta_{2})(\nabla_{\theta}L(\theta_{k-1}))^{2}&\theta\in\Theta_{k}\\ n_{k-1}&\text{otherwise}\end{cases}

In the above equations, β1,β2,ϵ\beta_{1},\beta_{2},\epsilon are Adam’s hyperparameters. We fix them to 0.9, 0.999 and 1e-8 respectively. Θk\Theta_{k} are sets of items representation parameters and other trainable variables appeared in the minibatch kk. For items not appeared, their parameters and statistics in Adam remain unchanged at this step. Empirically, combining negative sampling and sparse Adam update techniques can speed up the training procedure 3-5x times when there are millions of items.

Multi-GPU training. The socially influenced preference module’s inputs are personal preference embeddings of a user and his friends. These personal preference embeddings are computationally expensive. It’s necessary to utilize multiple GPUs on a single machine to speed up the training procedure. We adopt the data parallelism paradigm to utilize multiple GPUs. Each minibatch is split into sub-mini batches with equal sizes. And each GPU runs the forward and backward propagation over one sub-minibatch using the same parameters. After all backward propagation is done, gradients are aggregated and we use the aggregated gradients to perform parameters update. For training efficiency, we train our model with a large minibatch size until the memory can not hold more samples.

Embedding Generation. Since the size of input and output data is large, generating fusion embedding results on a single machine requires a large amount of disk space. Since the generation procedure is less computationally expensive, we implement it on a cluster without GPUs. We split the generation procedure into three stages:

  1. (1)

    The first stage is generating personal preference embedding 𝐡m1(lT)\mathbf{h}^{(l_{T})}_{m-1} for all users and item embedding matrix 𝐖\mathbf{W}. This stage is less computationally expensive than the training stage. And we implement it on a distributed Spark (Zaharia et al., 2010) cluster.

  2. (2)

    The second stage is to retrieve the user’s and their friends’ personal preference embedding. This stage requires lots of disk access and network communication. It is implemented by a SparkSQL query on a distributed data warehouse.

  3. (3)

    The final stage is generating the users’ social influenced preference embedding 𝐡^u(lG)\hat{\mathbf{h}}^{(l_{G})}_{u} and fusing it with users’ personal preference embedding 𝐡m1(lT)\mathbf{h}^{(l_{T})}_{m-1} to get the final embedding 𝐡~u\tilde{\mathbf{h}}_{u}. Another Spark program is implemented to generate the final user embedding.

The intermediate results and all embedding outputs are stored in a distributed data warehouse. Downstream tasks retrieve the results to provide online service.

5. EXPERIMENTS

In this section, we first describe experimental data sets, compared methods, and evaluation metrics. Then, we show results of offline experiments on two data sets and online valuation of our method on an article recommendation system in WeChat. Specifically, we aim to answer the following questions:

Q1: How does SocialTrans outperform the state-of-the-art methods for the recommendation tasks?

Q2: How does the performance of SocialTrans change under different circumstances?

Q3: What is the quality of generated user representation for online services?

5.1. Data Sets

We evaluate our model in three data sets. For offline evaluation, we tested on a benchmark data set Yelp and a data set WeChat Official Accounts. For online evaluation, we conducted experiments on WeChat Top Stories, a major article recommendation application in China333All users in data sets of WeChat are anonymous.. The statistics of those data sets are summarized in Table 1. We describe the detailed information as follows:

Yelp444https://www.yelp.com/dataset. Yelp is a popular review website in the United States. Users can review local businesses including restaurants and shops. We treat each review from 01/01/2012 to 11/14/2018 as an interaction. This data set includes 3.3 million user reviews, more than 170,000 businesses, more than 250,000 users and 4.7 million friendship links. The last 150 days of reviews are used as a test set and the training set is constructed from the remaining days. Items that do not exist in the training set are removed from the test set. For each recommendation, we use the user’s own interaction and friends’ interactions before the recommendation time.

WeChat Official Accounts. WeChat is a Chinese messaging mobile app with more than one billion active users. WeChat users can register an official account, which can push articles to subscribed users. We sample users and their social networks restricted to an anonymous city in China. We treat each official account reading activity as an item click event. The training set is constructed from users’ reading logs in June 2019. The first four days of July 2019 are kept for testing. We remove items appeared less than five times in training data to ensure the quality of recommendation. After processing, this data set contains more than 736,000 users, 48,000 items and each user has an average of 81 friends.

WeChat Top Stories. WeChat users can receive articles recommendation service in Top Stories, whose contents are provided by WeChat official accounts. Here we treat each official account reading activity as an item click event. Training logs are constructed from users’ reading logs in June 2019. Testing is conducted on an online environment in five consecutive days of July 2019. This data set contains billions of users and millions of items. In this data set, we keep the specific values for business secret.

Yelp
WeChat
Official Accounts
WeChat
Top Stories
Users 251,181 736,984 \sim Billions
Items 173,567 48,150 \sim Millions
Events 3,260,808 11,053,791 \sim Tens of Bil.
Relations 4,753,048 59,809,526 -
Avg. friends 18.92 81.15 \sim Hundreds
Avg. events 12.98 14.99 \sim Tens
Start Date 01/01/2012 01/06/2019 01/06/2019
End Date 11/14/2018 04/07/2019 31/06/2019
Evaluation Offline Offline Online
Table 1. The statistics of the experimental data sets.

5.2. Offline Model Comparison - Q1

In this subsection, we evaluate the performance of different models on Yelp and WeChat Official Accounts data sets.

5.2.1. Evaluation Metrics.

In the offline evaluation, we are interested in recommending items directly. Each algorithm recommends items according to their ranking scores. We evaluate two widely used ranking-based metrics: Recall@K and Normalized Discounted Cumulative Gain (NDCG).

Recall@K measures the average proportion of the top-K recommended items that are in the test set.

NDCG measures the rank of each clicked user-item click pair in a model. It’s formulated as NDCG=1log2(1+rank)NDCG=\frac{1}{\log_{2}{(1+\text{rank})}}. We report the average value over all the testing data for model comparison.

5.2.2. Compared Models.

We compare SocialTrans with several state-of-the-art baselines. The details of these models are described as follows:

POP: a rule-based method, which recommends a static list of popular items. The rank of each item is sorted according to the number of appearances in the training data.

GRU4Rec (Hidasi et al., 2016): a sequence-based approach that captures users’ personal preferences by a recurrent neural network. Items are recommended according to these personal preferences.

SASRec (Kang and McAuley, 2018): another sequence-based approach that uses a multi-layer Transformer (Vaswani et al., 2017) to capture the users’ personal preferences evolved over time.

metapath2vec (Dong et al., 2017): it is an unsupervised graph-based approach. It first generates meta-path guided random walk sequences. We utilize the user-item bipartite graph and let the meta-path to be ”user-item-user”. Embedding representation of each user/item is learned by using these sequences to preserve neighboring proximity. For recommendation, the rank score is computed by the dot product of user and item embedding.

DGRec (Song et al., 2019): an approach utilizing both temporal and social factors. In this approach, each user representation is generated by a fusion of personal and socially influenced preference. A recurrent neural network is used to model a user’s short term interest. And a unique user embedding is learned to capture the long term interest. Socially influenced preference is captured by a graph attention neural network without considering edge attributes and the multi-head attention mechanism.

Social-GAT: our modified version of graph attention neural network (GAT) (Veličković et al., 2018). Here we represent a user’s personal preference embedding as an average embedding of his previously clicked items. Then GAT over social graph is applied to capture socially influenced preference. A user’s final representation is generated by fusing personal preference and socially influenced preference.

5.2.3. Evaluation details.

We train all models with a batch size of 128 and a fixed learning rate of 0.001. We use our modified version of Adam (Kingma and Ba, 2015) for optimization (mentioned in §4). For all models, The dimension of each user and item are fixed to 100. We cross-validated other model parameters using 80% training logs and leave out 20% for validation. To avoid overfitting, the dropout technique (Srivastava et al., 2014) with rate 0.1 is used. The neighborhood sampling size is empirically set from 20 to 50 in GAT layers.

5.2.4. Comparative Results

We summarize the experimental results of SocialTrans and baseline models in Table 2. We have the following findings:

  • SocialTrans outperforms other baseline models. It achieves a 9.56% relative recall@20 improvement compare to Social-GAT in the Yelp data set. For Wechat Official Accounts, the relative improvement of recall@10 is 9.62%.

  • Sequence-based methods GRU4Rec and SASRec perform better than POP and metapath2vec. Because the last two methods do not consider the factor that user interests will evolve over time.

  • DGRec does not get a greater performance boost than sequence-based methods on both data set. DGRec uses a unique user embedding to capture the user’s long term interest. This can lead to a large increase in model parameters because the number of users are large in both data sets. Similar results are observed in metapath2vec, which only uses a unique embedding to represent a user. We believe that DGRec and metapath2vec are suffered from the problem of overfitting due to the large number of model parameters in these data sets.

  • SocialTrans consists of two components - user’s personal preference and socially influenced preference. If we only consider the preferences of users themselves, SocialTrans degenerates to SASRec. Without the consideration of users’ preference being dynamic, SocialTrans degrades to Social-GAT. Notice that SocialTrans, SASRec, and Social-GAT provide a strong performance, which shows the effectiveness of our model’s different variations.

  • SocialTrans and Social-GAT achieve a greater performance gain than other methods in these data sets, that means social factor has great potential business value. We believe the performance boost comes from special services provided by applications. In the WeChat Official Accounts scenario, users can share official accounts with their friends. This implies that official accounts that are subscribed by more friends are more likely to be recommended and clicked.

In summary, the comparison results show that (1) both temporal and social factors are helpful for recommendations; (2) restricted model capacity is critical to avoid overfitting in real-world data sets; (3) our model outperforms baseline methods.

Yelp WeChat Official Acct.
Model recall@20 NDCG recall@10 NDCG
POP 1.05% 9.39% 3.87% 12.22%
GRU4Rec (Hidasi et al., 2016) 5.84% 11.68% 6.27% 13.03%
SASRec (Kang and McAuley, 2018) 6.18% 12.83% 7.01% 13.50%
metapath2vec (Dong et al., 2017) 1.06% 9.41% 5.23% 12.42%
DGRec (Song et al., 2019) 5.92% 12.71% 6.59% 13.04%
Social-GAT 6.27% 12.92% 8.52% 14.45%
SocialTrans 6.87% 13.23% 9.75% 15.19%
Table 2. Offline comparison of different models.

5.3. Offline Study - Q2

We are interested in the performance of SocialTrans under different circumstances. First, we analyze the performance of different models for users with different number of friends. Then, we analyze the performance of SocialTrans with different number of layers.

5.3.1. Number of Friends.

In social recommendation systems, different users have a different number of friends. When a user has very few friends, the number of items clicked by his friends is limited and the potential of utilizing social information could be limited. On the other hand, when a user has many friends, there are many items clicked by his friends both in the past and in the future. In this case, SocialTrans could leverage social information, learn more, and make better predictions.

We investigate the recommendation performance of SocialTrans and SASRec (Kang and McAuley, 2018) on users with a different number of friends. Figure 6 shows the result. It can be seen that SocialTrans consistently outperforms SASRec in all groups. In the Yelp data set, the improvement of SocialTrans over SASRec becomes larger when users have more friends, which matches our analysis in the previous paragraph. The relative improvement of SocialTrans is 5.35% for group 0-2 and is 16.15% for group >>27. In the WeChat Official Accounts data set, the best relative improvement is achieved at group 21-50.

5.3.2. Number of Transformer and GAT Layers.

The performance of deep learning methods can be empirically improved when stacking more layers. We are interested in how the performance of SocialTrans change with different layers of Transformer or GAT. Table 3 summarizes the result. Stacking more Transformer layers can largely boost performance. In the WeChat Official Account data set, the recall@10 metric improves relatively by 7.45% when the number of Transformer layers increases from 1 to 3. On the other hand, stacking more GAT layers can improve the result, but the improvement is not as significant as stacking more Transformer layers. The reason behind this is that social influence decays very fast and most of the information is provides by one-hop neighbors, which matches the analysis result in Figure 2.

Refer to caption
Figure 6. User performance with different friends number.
Yelp WeChat Official Acct.
Trans.
layers
GAT
layers
recall@20 NDCG recall@10 NDCG
1 1 6.32% 12.87% 8.99% 14.88%
2 1 6.62% 13.09% 9.37% 14.97%
3 1 6.67% 13.11% 9.66% 15.12%
3 2 6.87% 13.23% 9.75% 15.19%
Table 3. SocialTrans performance with different layers.

5.4. Online Evaluation - Q3

5.4.1. Evaluation Procedure.

To verify the effectiveness of our model, we establish an online A/B testing between SocialTrans and competition models in WeChat Top Stories, a major article recommendation engine in China. We divide online traffics into equal-sized portions and each model influences only one portion. Recommending KaK_{a} items to users directly is difficult since it needs a great change in the online system. Furthermore, this direct approach wastes lots of computational resources because new articles are created very quickly. In this scenario, we use user-based collaborate filtering method shown in Figure 7 as an indirect evaluation approach.

This approach only changes the recall component in the online serving system and is less computationally expensive as compared to the direct approach. Each model first generates a fixed-size representation for each user. For SocialTrans, this representation is user’s fusion embedding vector. After that, the user-pair similarity is computed using the above representation to find top KuK_{u} similar users. Since it’s impossible to compute the similarity of all user-user pairs, we adapt approximate nearest neighbor algorithm SimHash (Andoni and Indyk, 2006). We choose KuK_{u} as 300. After that, top KuK_{u} similar users’ recently read articles are fed into the ranking model whose parameters are fixed in our evaluation. Finally, a list of articles with top KaK_{a} score is presented to the user. For safety reasons, this evaluation only uses a very small portions of the online traffic.

Refer to caption
Figure 7. Online evaluation procedure for one model. Here we highlight model component and its generated user representation. Only these components are changed and others remain unchanged.

5.4.2. Evaluation Metrics & Compared Models

In online services, we are interested in CTR (click-through rate), which is a common evaluation metric in recommendation systems. It is the number of clicks divided by the number of recommended articles. We train models with the logs of June 2019 and evaluate them on the online A/B testing platform in five consecutive days of July 2019. Here we compare our model with the following models:

metapath2vec (Dong et al., 2017): a graph-based approach that takes bipartite user-item graph as input and generates a unique user embedding for each user. We choose ”user-item-user” as the meta-path and take their embedding as the representation of users.

SASRec (Kang and McAuley, 2018): a multi-layer Transformer (Vaswani et al., 2017) model. We take the last layer of Transformer as the representation of users.

SocialTrans: our proposed model with a three-layer Transformer and a two-layer GAT. We take the fusion embedding as the representation of users.

5.4.3. Online Results

Table 4 and Figure 8 show the result of SocialTrans and its competitors. We choose the CTR of metapath2vec on the first day as a base value. And we scale all CTRs by this base value. It can be observed that SocialTrans consistently outperforms metapath2vec and SASRec in five days, with an average of 5.98% relative improvement compared to metapath2vec, which shows that the quality of user embedding in SocialTrans is better than baseline models.

Notice that the improvement in online evaluation is smaller that in the offline evaluation. This is because we chose an indirect evaluation approach - collaborative filtering, which requires fewer computational resources as compared to the direct recommendation approach when items are added or fading very fast. This provides an economical way to verify a new model.

Avgerage
Scaled CTR
Relative
Improvement
metapath2vec (Dong et al., 2017) 105.3% 0%
SASRec (Kang and McAuley, 2018) 109.5% +3.99%
SocialTrans 111.5% +5.89%
Table 4. Online CTRs of different models in five days. We use the first day’s CTR of metapath2vec as a base value to scale all CTRs.
Refer to caption
Figure 8. Online CTRs of different models in five days.

6. RELATED WORK

6.1. Sequential Recommendation

Sequential recommendation algorithms use previous interactions to predict a user’s next interaction in the near future. Different from general recommendation, sequential recommendation always views the interactions as a sequence of items in time order. Previous sequential recommendation methods have been based on Markov chains, these methods construct a transition matrix to solve sequential recommendation. Factorizing Personalized Markov Chains (FPMC) (Rendle et al., 2010) mixes Markov chains (MC) and matrix factorization (MF) together and estimates a transition cube to learn an transition matrix for each user. FPMC cannot catch relationships between items in a sequence, because it assumes each item independently affects the user’s next interaction. HRM (Wang et al., 2015) could well capture both sequential behavior and users’ global interest by including transaction and user representations in prediction. Recently, there have been many methods that apply deep learning to recommendation systems emerged. Restricted Boltzmann Machine (RBM) (Salakhutdinov et al., 2007) and Autoencoders (Sedhain et al., 2015) have been the earliest methods, and RBM has been proved to be one of the best performing Collaborative Filtering models. Caser (Tang and Wang, 2018) converts embedding of items in a sequence into a square matrix and uses CNN to learn user local features as a sequential pattern. However, CNN could hardly capture long-term dependencies between items and users’ global features are always important for the recommendation. On the other hand, RNN has been widely used to model global sequential interactions (Yu et al., 2016). GRU4Rec (Hidasi et al., 2016) has been a representative method based on RNN for the sequential recommendation, which uses the final hidden state of GRU to delineate the user current preference. NARM and EDRec (Li et al., 2017; Loyola et al., 2017) use the attention mechanism to improve the effect of GRU4Rec. SASRec (Kang and McAuley, 2018) use a multi-layer Transformer to capture user’s dynamic interest evolved over time. SHAN (Ying et al., 2018b) proposed a two-layer hierarchical attention network to take both user-item and item-item interactions into account. These models assume that a user’s behavior sequence is dynamically determined by his personal preference. However, they do not utilize social information, which is important in social recommendation tasks.

6.2. Social Recommendation

In online social networks, one common assumption is that a user’s preference is influenced by his friends. So introducing social relationships could improve the effectiveness of the model. Besides, for recommendation systems, using friends’ information of users could effectively alleviate cold start and data sparsity problems. There have been many studies that models the influence of friends on user interests from different aspects. Most proposed models are based on Gaussian or Poisson matrix factorization. Ma et al. (Ma et al., 2011) proposed a matrix factorization framework with social regularization such that the distances of connected users’ embedding vectors are small. TrustMF (Yang et al., 2016) adopts a matrix factorization technique to map users into low-dimensional potential feature spaces according to their trust relationship. SBPR (Zhao et al., 2014) is an approach that utilizes social information for training instance selection. Recently, many studies leveraged deep neural networks and network embedding approaches to solve social recommendation problems because of their powerful performance. The NCF model (He et al., 2017) leverages a multi-layer perceptron to learn the user-item interaction function. NSCR (Wang et al., 2017a) enhances the NCF model by plugging a pairwise pooling operation and extends NCF to cross-domain social recommendations. Previous methods neglected the weight of the relationship edge, and different types of edges should play different roles. mTrust (Tang et al., 2012a) and eTrust (Tang et al., 2012b) could model trust evolution, integrating multi-faceted trust relationships into traditional rating prediction algorithms in order to reliably evaluate their strengths. TBPR (Wang et al., 2016) and PTPMF (Wang et al., 2017b) distinguish between strong and weak ties of users for the recommendation in social networks. These approaches assume that a user’s behavior sequence is determined by his personal preference and his socially influenced preference. However, these methods model users’ personal preferences as static and social influences as context-independent.

6.3. Graph Convolutional Networks

Graph Convolutional Networks (GCN) has been a powerful technique to encode graph nodes into low-dimensional space and has been proven to have the ability to extract features from graph-structured data (Defferrard et al., 2016). Kipf and Welling (Kipf and Welling, 2017) used GCNs for semi-supervised graph classification and achieved the state-the-of-art performance. GCNs combine the features of the current node and its neighbors, and handle edge features easily. However, in the original GCNs, all neighbors’ weights are fixed when using convolution filters to update the node embeddings. GATs (Veličković et al., 2018) use an attention mechanism to assign different weights to different nodes in the neighborhood. In the area of recommendation, PinSage (Ying et al., 2018a) uses efficient random walks to structure the convolutions and is scalable to recommendations on large-scale networks. GCMC (Berg et al., 2017) proposes a graph auto-encoder framework based on differentiable message passing on the bipartite user-item graph and provides users’ and items’ embedding vectors for the recommendation. SocialGCN (Wu et al., 2019) uses the ability of GCNs to catch how users’ interests are influenced by the social diffusion process in social networks. DGRec (Song et al., 2019) proposes a method based on a dynamic-graph-attention neural network. DGRec uses RNN to model user short-term interest and uses GAT to model social influence between users and their friends. SocialTrans uses a multi-layers Transformer to model users’ personal preference. We show by experiments that SocialTrans outperforms DGRec both on the Yelp data set and on the WeChat Official Accounts data set.

7. CONCLUSION

On social networks platforms, a user’s behavior is based on his personal interests and is socially influenced by their friends. It is important to study how to consider both of these two factors for social recommendation tasks. In this paper, we presented SocialTrans, a model that jointly captures users’ personal preference and socially influenced preference. SocialTrans uses Transformer and GAT, two state-of-the-art models, as building blocks. We conducted extensive experiments to demonstrate the superiority of SocialTrans over previously state-of-the-art models. On the offline data set Yelp, SocialTrans achieves 11.16% ~17.63% relative improvement of recall@20 as compared to sequence-based methods. On the WeChat Official Accounts data set, SocialTrans achieves a 39.08% relative improvement over the state-of-the-art sequence-based method. In additional, we deployed and tested our model in WeChat Top Stories, a major article recommendation platform in China. Our online A/B testing show that SocialTrans achieves a 5.89% relative improvement of click-through rate against metapath2vec. In the future, we plan to extend SocialTrans to take advantages of more side information of given recommendation tasks (e.g., other available attributes of users and items). Also, social networks are not static and they evolve over time. It would be interesting to explore how to deal with dynamic social networks.

References

  • (1)
  • Andoni and Indyk (2006) Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In FOCS’06. IEEE, 459–468.
  • Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
  • Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263 (2017).
  • Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional neural networks on graphs with fast localized spectral filtering. In NIPS. 3844–3852.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  • Dong et al. (2017) Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In KDD. ACM, 135–144.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR. 770–778.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. ACM, 173–182.
  • Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415 (2016).
  • Hidasi et al. (2016) Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. 2016. Session-based recommendations with recurrent neural networks. In ICLR.
  • Kang and McAuley (2018) Wang-Cheng Kang and Julian McAuley. 2018. Self-Attentive Sequential Recommendation. In ICDM. IEEE, 197–206.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In ICLR.
  • Li et al. (2017) Jing Li, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Tao Lian, and Jun Ma. 2017. Neural attentive session-based recommendation. In CIKM. ACM, 1419–1428.
  • Li et al. (2018) Jian Li, Zhaopeng Tu, Baosong Yang, Michael R Lyu, and Tong Zhang. 2018. Multi-Head Attention with Disagreement Regularization. In EMNLP. 2897–2903.
  • Loyola et al. (2017) Pablo Loyola, Chen Liu, and Yu Hirate. 2017. Modeling user session and intent with an attention-based encoder-decoder architecture. In RecSys. ACM, 147–151.
  • Ma et al. (2011) Hao Ma, Dengyong Zhou, Chao Liu, Michael R Lyu, and Irwin King. 2011. Recommender systems with social regularization. In WSDM. ACM, 287–296.
  • Marsden and Friedkin (1993) Peter V Marsden and Noah E Friedkin. 1993. Network studies of social influence. Sociological Methods & Research 22, 1 (1993), 127–151.
  • McPherson et al. (2001) Miller McPherson, Lynn Smith-Lovin, and James M Cook. 2001. Birds of a feather: Homophily in social networks. Annual review of sociology 27, 1 (2001), 415–444.
  • Rendle et al. (2010) Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. 2010. Factorizing personalized markov chains for next-basket recommendation. In WWW. ACM, 811–820.
  • Salakhutdinov et al. (2007) Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. 2007. Restricted Boltzmann machines for collaborative filtering. In ICML. ACM, 791–798.
  • Sedhain et al. (2015) Suvash Sedhain, Aditya Krishna Menon, Scott Sanner, and Lexing Xie. 2015. Autorec: Autoencoders meet collaborative filtering. In WWW. ACM, 111–112.
  • Song et al. (2019) Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian Tang. 2019. Session-based social recommendation via dynamic graph attention networks. In WSDM. ACM, 555–563.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. JMLR 15, 1, 1929–1958.
  • Tang et al. (2012a) Jiliang Tang, Huiji Gao, and Huan Liu. 2012a. mTrust: discerning multi-faceted trust in a connected world. In WSDM. ACM, 93–102.
  • Tang et al. (2012b) Jiliang Tang, Huiji Gao, Huan Liu, and Atish Das Sarma. 2012b. eTrust: Understanding trust evolution in an online world. In KDD. ACM, 253–261.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommendation via convolutional sequence embedding. In WSDM. ACM, 565–573.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS. 5998–6008.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
  • Wang et al. (2015) Pengfei Wang, Jiafeng Guo, Yanyan Lan, Jun Xu, Shengxian Wan, and Xueqi Cheng. 2015. Learning hierarchical representation model for nextbasket recommendation. In SIGIR. ACM, 403–412.
  • Wang et al. (2017a) Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017a. Item silk road: Recommending items from information domains to social users. In SIGIR. ACM, 185–194.
  • Wang et al. (2017b) Xin Wang, Steven CH Hoi, Martin Ester, Jiajun Bu, and Chun Chen. 2017b. Learning personalized preference of strong and weak ties for social recommendation. In WWW. ACM, 1601–1610.
  • Wang et al. (2016) Xin Wang, Wei Lu, Martin Ester, Can Wang, and Chun Chen. 2016. Social recommendation with strong and weak ties. In CIKM. ACM, 5–14.
  • Wu et al. (2019) Le Wu, Peijie Sun, Yanjie Fu, Richang Hong, Xiting Wang, and Meng Wang. 2019. A Neural Influence Diffusion Model for Social Recommendation. In SIGIR.
  • Yang et al. (2016) Bo Yang, Yu Lei, Jiming Liu, and Wenjie Li. 2016. Social collaborative filtering by trust. TPAMI 39, 8 (2016), 1633–1647.
  • Ying et al. (2018b) Haochao Ying, Fuzhen Zhuang, Fuzheng Zhang, Yanchi Liu, Guandong Xu, Xing Xie, Hui Xiong, and Jian Wu. 2018b. Sequential recommender system based on hierarchical attention network. In IJCAI. AAAI Press, 3926–3932.
  • Ying et al. (2018a) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018a. Graph convolutional neural networks for web-scale recommender systems. In KDD. ACM, 974–983.
  • Yu et al. (2016) Feng Yu, Qiang Liu, Shu Wu, Liang Wang, and Tieniu Tan. 2016. A dynamic recurrent model for next basket recommendation. In SIGIR. ACM, 729–732.
  • Zaharia et al. (2010) Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10-10, 95.
  • Zhao et al. (2014) Tong Zhao, Julian McAuley, and Irwin King. 2014. Leveraging social connections to improve personalized ranking for collaborative filtering. In CIKM. ACM, 261–270.