This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Multi-Strategy based Pre-Training Method for Cold-Start Recommendation

Bowen Hao [email protected] Renmin University of ChinaBeijingChina Hongzhi Yin [email protected] The University of QueenslandBrisbaneAustralia Jing Zhang [email protected] Renmin University of ChinaBeijingChina Cuiping Li [email protected] Renmin University of ChinaBeijingChina  and  Hong Chen [email protected] Renmin University of ChinaBeijingChina
(2022)
Abstract.

Cold-start issue is a fundamental challenge in Recommender System. The recent self-supervised learning (SSL) on Graph Neural Networks (GNNs) model, PT-GNN, pre-trains the GNN model to reconstruct the cold-start embeddings and has shown great potential for cold-start recommendation. However, due to the over-smoothing problem, PT-GNN can only capture up to 3-order relation, which can not provide much useful auxiliary information to depict the target cold-start user or item. Besides, the embedding reconstruction task only considers the intra-correlations within the subgraph of users and items, while ignoring the inter-correlations across different subgraphs. To solve the above challenges, we propose a multi-strategy based pre-training method for cold-start recommendation (MPT), which extends PT-GNN from the perspective of model architecture and pretext tasks to improve the cold-start recommendation performance111This paper is the extension of our previously published conference version (Hao et al., 2021).. Specifically, in terms of the model architecture, in addition to the short-range dependencies of users and items captured by the GNN encoder, we introduce a Transformer encoder to capture long-range dependencies. In terms of the pretext task, in addition to considering the intra-correlations of users and items by the embedding reconstruction task, we add embedding contrastive learning task to capture inter-correlations of users and items. We train the GNN and Transformer encoders on these pretext tasks under the meta-learning setting to simulate the real cold-start scenario, making the model easily and rapidly being adapted to new cold-start users and items. Experiments on three public recommendation datasets show the superiority of the proposed MPT model against the vanilla GNN models, the pre-training GNN model on user/item embedding inference and the recommendation task.

Recommender system, cold-start, pre-training, self-supervised learning
copyright: acmcopyrightjournalyear: 2022doi: 10.1145/1122445.1122456journal: POMACSjournalvolume: 37journalnumber: 4article: 111publicationmonth: 5ccs: Information systems Social advertising

1. Introduction

As an intelligent system, Recommender System (RS) (He et al., 2017, 2020; Linden et al., 2003) has been built successfully in recent years, and aims to alleviate the information overload problem. The most popular algorithm in RS is collaborative filtering, which uses either matrix factorization (Linden et al., 2003) or neural collaborative filtering (He et al., 2017) to learn the user/item embeddings. However, due to the sparse interactions of the cold-start users/items, it is difficult to learn high-quality embeddings for these cold-start users/items.

To address the cold-start issue, researchers propose to incorporate the side information such as knowledge graphs (KGs) (Wang et al., 2019d, a) or the contents of users/items (Yin et al., 2014, 2017; Chen et al., 2020b) to enhance the representations of users/items. However, the contents are not always available, and it is not easy to link the items to the entities in KGs due to the incompleteness and ambiguities of the entities. Another research line is to adopt the Graph Neural Networks (GNNs) such as GraphSAGE (Hamilton et al., 2017), NGCF (Wang et al., 2019b) and LightGCN (He et al., 2020) to solve the problem. The basic idea is to incorporate the high-order neighbors to enhance the embeddings of the cold-start users/items. However, the GNN models for recommendation can not thoroughly solve the cold-start problem, as the embeddings of the cold-start users/items aren’t explicitly optimized, and the cold-start neighbors have not been considered in these GNNs.

In our previous work, we propose PT{\rm PT}-GNN{\rm GNN} (Hao et al., 2021), which pre-trains the GNN model to reconstruct the user/item embeddings under the meta-learning setting (Vinyals et al., 2016). To further reduce the impact from the cold-start neighbors, PT{\rm PT}-GNN{\rm GNN} incorporates a self-attention based meta aggregator to enhance the aggregation ability of each graph convolution step, and an adaptive neighbor sampler to select proper high-order neighbors according to the feedbacks from the GNN model.

However, PT{\rm PT}-GNN{\rm GNN} still suffers from the following challenges: 1) In terms of the model architecture, PT{\rm PT}-GNN{\rm GNN} suffers from lacking the ability to capture long-range dependencies. Existing researches such as Chen et al. (Chen and Wong, 2020) pointed out that GNN can only capture up to 3-hop neighbors due to the overfitting and over-smoothing problems (He et al., 2020). However, in the cold-start scenario, both low-order neighbors (L2L\leq 2, where LL is the convolution layer) and high-order neighbors (L>2L>2) are important to the target cold-start user/item. On one hand, the low-order neighbors are crucial for representing the target cold-start users/items, as the first-order neighbors directly reflect the user’s preference or the item’s target audience, and the second-order neighbors directly reflect the signals of the collaborative users/items. On the other hand, due to the extremely sparse interactions of the cold-start users/items, the first- and second-order neighbors are insufficient to represent a user or an item. Thus, for the target users/items, it is crucial to capture not only the short-range dependencies from their low-order neighbors, but also the long-range dependencies from their high-order neighbors. Nevertheless, due to the limitation of the network structure, most of the existing GNN models can only capture the short-range dependencies, which inspires the first research question: how to capture both short- and long-range dependencies of users/items? 2) In terms of the pretext task, PT{\rm PT}-GNN{\rm GNN} only considers the intra-correlations within the subgraph of the target user or item, but ignores the inter-correlations between the subgraphs of different users or items. More concretely, given a target cold-start user or item, PT{\rm PT}-GNN{\rm GNN} samples its high-order neighbors to form a subgraph, and leverages the subgraph itself to reconstruct the cold-start embedding. Essentially, the embedding reconstruction task is a generative task, which focuses on exploring intra-correlations between nodes in a subgraph. On the contrary, some recent pre-training GNN models (e.g., GCC (Qiu et al., 2020), GraphCL (You et al., 2020)) depending on the contrastive learning mechanism, can capture the similarities of nodes between different subgraphs, i.e., the inter-correlations. Thus, this inspires the second research question: how to capture the inter-correlations of the target cold-start users/items in addition to the intra-correlations?

Refer to caption
Figure 1. The improvement of MPT{\rm MPT} on the original PT{\rm PT}-GNN{\rm GNN}. \checkmark denotes the capacity of PT{\rm PT}-GNN{\rm GNN} and ✗denotes the missing capacity of PT{\rm PT}-GNN{\rm GNN}.

Present work. We propose a Multi-strategy based Pre-Training method for cold-start recommendation (MPT{\rm MPT}), which extends PT{\rm PT}-GNN{\rm GNN} from the perspective of model architecture and pretext tasks to improve the cold-start recommendation performance. The improvements over PT{\rm PT}-GNN{\rm GNN} are shown in Fig. 1.

First, in terms of the model architecture, in addition to the original GNN encoder which captures the short-range dependencies from the user-item edges in the user-item graph, we introduce a Transformer encoder to capture the long-range dependencies from the user-item paths, which can be extracted by performing random walk (Perozzi et al., 2014) starting from the target user or item in the user-item graph. The multi-head self-attention mechanism in the Transformer attends nodes in different positions in a path (Vaswani et al., 2017), which can explicitly capture the long-range dependencies between users and items. Besides, we change the RL-based neighbor sampling strategy in the original GNN encoder into a simple yet effective dynamic sampling strategy, which can reduce the time complexity of the neighbor sampling process.

Second, in terms of the pretext task, in addition to the original embedding reconstruction task which considers the intra-correlations within the subgraph of the target user or item, we add embedding contrastive learning (Wu et al., 2018) task to capture the inter-correlations across the subgraphs or paths of different users or items. Specifically, we first augment a concerned subgraph or path by deleting or replacing the nodes in it. Then we treat the augmented subgraphs or paths of the concerned user or item as its positive counterparts, and those of other users or items as the negative counterparts. By contrastive learning upon the set of the positive and negative instances, we can pull the similar user or item embeddings together while pulling away dissimilar embeddings.

We train the GNN and Transformer encoders by the reconstruction and contrastive learning pretext tasks under the meta-learning setting (Vinyals et al., 2016) to simulate the cold-start scenario. Specifically, following PT{\rm PT}-GNN{\rm GNN}, we first pick the users/items with sufficient interactions as the target users/items, and learn their ground-truth embeddings on the observed abundant interactions. Then for each target user/item, in order to simulate the cold-start scenario, we mask other neighbors and only maintain KK first-order neighbors, based on which we form the user-item subgraph and use random walk (Perozzi et al., 2014) to generate the paths of the target user or item. Next we perform graph convolution multiple steps upon the user-item subgraph or self-attention mechanism upon the path to obtain the target user/item embeddings. Finally, we optimize the model parameters with the reconstruction and contrastive losses.

We adopt the pre-training & fine-tuning paradigm (Qiu et al., 2020) to train MPT{\rm MPT}. During the pre-training stage, in order to capture the correlations of users and items from different views (i.e., the short- and long-range dependencies, the intra- and inter-correlations), we assign each pretext task an independent set of initialized embeddings, and train these tasks independently. During the fine-tuning state, we fuse the pre-trained embeddings from each pretext task and fine-tune the encoders by the downstream recommendation task. The contributions can be summarized as:

  • We extend PT{\rm PT}-GNN{\rm GNN} from the perspective of model architecture. In addition to the short-range dependencies of users and items captured by the GNN encoder, we add a Transformer encoder to capture long-range dependencies.

  • We extend PT{\rm PT}-GNN{\rm GNN} from the perspective of pretext tasks. In addition to considering the intra-correlations of users and items by the embedding reconstruction task, we add embedding contrastive learning task to capture inter-correlations of users and items.

  • Experiments on both intrinsic embedding evaluation and extrinsic downstream tasks demonstrate the superiority of our proposed MPT{\rm MPT} model against the state-of-the-art GNN model and the original proposed PT{\rm PT}-GNN{\rm GNN}.

2. Preliminaries

In this section, we first define the problem and then introduce the original proposed PT{\rm PT}-GNN{\rm GNN} .

2.1. Notation and Problem Definition

Bipartite Graph. We denote the user-item bipartite graph as 𝒢=(𝒰,,)\mathcal{G}=(\mathcal{U},\mathcal{I},\mathcal{E}), where 𝒰={u1,,u|𝒰|}\mathcal{U}=\{u_{1},\cdots,u_{|\mathcal{U}|}\} is the set of users and ={i1,,i||}\mathcal{I}=\{i_{1},\cdots,i_{|\mathcal{I}|}\} is the set of items. 𝒰×\mathcal{E}\subseteq\mathcal{U}\times\mathcal{I} denotes the set of edges that connect the users and items. Notation 𝒩l(u)\mathcal{N}^{l}(u) denotes the ll-order neighbors of user uu. When ignoring the superscript, 𝒩(u)\mathcal{N}(u) denotes the first-order neighbors of uu. 𝒩l(i)\mathcal{N}^{l}(i) and 𝒩(i)\mathcal{N}(i) are defined similarly for items.

User-Item Path. The user-item path is generated by the random walk strategy from the user-item interaction data, and has the same node distribution with the graph 𝒢\mathcal{G}, as Perozzi et al. (Perozzi et al., 2014) proposed. Formally, there exists two types of paths: 𝒫\mathcal{P} = UIUI{\rm UIUI} or 𝒫\mathcal{P} = IUIU{\rm IUIU}, where U{\rm U} denotes the node type is user and I{\rm I} denotes the node type is item.

Problem Definition. Let f:𝒰𝐑df:\mathcal{U}\cup\mathcal{I}\rightarrow\mathbf{R}^{d} be the encoding function that maps the users or items to dd-dimension real-valued vectors. We denote a user uu and an item ii with initial embeddings hu0\textbf{h}_{u}^{0} and hi0\textbf{h}_{i}^{0}, respectively. Given the bipartite graph 𝒢\mathcal{G} or the path 𝒫\mathcal{P}, we aim to pre-train the encoding function ff that is able to be applied to the downstream recommendation task to improve its performance. Note that for simplicity, in the following sections, we mainly take user embedding as an example to explain the proposed model, as item embedding can be explained in the same way.

2.2. A Brief Review of PT-GNN

The basic idea of PT{\rm PT}-GNN{\rm GNN} is to leverage the vanilla GNN model as the encoder ff to reconstruct the target cold-start embeddings to explicitly improve their embedding quality. To further reduce the impact from the cold-start neighbors, PT{\rm PT}-GNN{\rm GNN} incorporates a self-attention based meta aggregator to enhance the aggregation ability of each graph convolution step, and an adaptive neighbor sampler to select proper high-order neighbors according to the feedbacks from the GNN model.

2.2.1. Basic Pre-training GNN Model

The basic pre-training GNN model reconstructs the cold-start embeddings under the meta-learning setting (Vinyals et al., 2016) to simulate the cold-start scenario. Specifically, for each target user uu, we mask some neighbors and only maintains at most KlK^{l} neighbors (KK is empirically selected as a small number, e.g., KK=3) at the ll-th layer. Based on which we perform graph convolution multiple steps to predict the target user embedding. Take GraphSAGE (Hamilton et al., 2017) as the backbone GNN model as an example, the graph convolution process at the ll-th layer is:

(1) hul\displaystyle\textbf{h}_{u}^{l} =\displaystyle= σ(𝐖l(hul1||h𝒩(u)l)),\displaystyle\sigma(\mathbf{W}^{l}\cdot(\textbf{h}_{u}^{l-1}\ ||\ \textbf{h}^{l}_{\mathcal{N}(u)})),

where hul\textbf{h}_{u}^{l} is the refined user embedding at the ll-th convolution step, hul1\textbf{h}_{u}^{l-1} is the previous user embedding (hu0\textbf{h}_{u}^{0} is the initialized embedding). h𝒩(u)l\textbf{h}^{l}_{\mathcal{N}(u)} is the averaged embedding of the neighbors, in which the neighbors are sampled by the random sampling strategy. σ\sigma is the sigmoid function, 𝐖l\mathbf{W}^{l} is the parameter matrix, and |||| is the concatenate operation.

We perform graph convolution LL-1 steps, aggregate the refined embeddings of the first-order neighbors {h1L1,,hKL1}\{\textbf{h}_{1}^{L-1},\cdots,\textbf{h}_{K}^{L-1}\} to obtain the smoothed embedding h𝒩(u)L\textbf{h}^{L}_{\mathcal{N}(u)}, and transform it into the target embedding huL\textbf{h}_{u}^{L}, i.e., h𝒩(u)L=AGGREGATE({hiL1,i𝒩(u)})\textbf{h}^{L}_{\mathcal{N}(u)}=\text{AGGREGATE}(\{\textbf{h}^{L-1}_{i},\forall i\in\mathcal{N}(u)\}), huL=σ(𝐖Lh𝒩(u)L)\textbf{h}_{u}^{L}=\sigma(\mathbf{W}^{L}\cdot\textbf{h}_{\mathcal{N}(u)}^{L}). Then we calculate the cosine similarity between the predicted embedding huL\textbf{h}_{u}^{L} and the ground-truth embedding hu\textbf{h}_{u} to optimize the parameters of the GNN model:

(2) Θ\displaystyle\Theta^{*} =\displaystyle= argmaxΘucos(huL,hu),\displaystyle\mathop{\arg\max}_{\Theta}\sum_{u}{\rm cos}(\textbf{h}_{u}^{L},\textbf{h}_{u}),

where the ground-truth embedding hu\textbf{h}_{u} is learned by any recommender algorithms (e.g., NCF (He et al., 2017), LightGCN (He et al., 2020)222They are good enough to learn high-quality user/item embeddings from the abundant interactions, and we will discuss it in Section 5.3.1), Θ\Theta is the model parameters. Similarly, we can obtain the item embedding hiL\textbf{h}^{L}_{i} and optimize model parameters by (Eq. (2)).

2.2.2. Enhanced Pre-training Model: PT-GNN

The basic pre-training strategy has two problems. On one hand, it can not explicitly deal with the high-order cold-start neighbors during the graph convolution process. On the other hand, the GNN sampling strategies such as random sampling (Hamilton et al., 2017) or importance sampling (Chen et al., 2018) strategies may fail to sample high-order relevant cold-start neighbors due to their sparse interactions.

To solve the first challenge, we incorporate a meta aggregator to enhance the aggregation ability of each graph convolution step. Specifically, the meta aggregator uses self-attention mechanism (Vaswani et al., 2017) to encode the initial embeddings {h10,,hK0}\{\textbf{h}_{1}^{0},\cdots,\textbf{h}_{K}^{0}\} of the KK first-order neighbors for uu as input, and outputs the meta embedding h~u\tilde{\textbf{h}}_{u} for uu. The process is:

{h1,,hK}\displaystyle\{\textbf{h}_{1},\cdots,\textbf{h}_{K}\} \displaystyle\leftarrow SELF_ATTENTION({h10,,hK0}),\displaystyle{\rm SELF\_ATTENTION}(\{\textbf{h}_{1}^{0},\cdots,\textbf{h}_{K}^{0}\}),
(3) h~u\displaystyle\tilde{\textbf{h}}_{u} =\displaystyle= AVERAGE({h1,,hK}).\displaystyle{\rm AVERAGE}(\{\textbf{h}_{1},\cdots,\textbf{h}_{K}\}).

We use the same cosine similarity described in Eq. (2) to measure the similarity between the predicted meta embedding h~u\tilde{\textbf{h}}_{u} and the ground truth embedding hu\textbf{h}_{u}.

To solve the second challenge, we propose an adaptive neighbor sampler to select proper high-order neighbors according to the feedbacks from the GNN model. The adaptive neighbor sampler is formalized as a hierarchical Markov Sequential Decision Process, which sequentially samples from the low-order neighbors to the high-order neighbors and results in at most KlK^{l} neighbors in the ll-th layer for each target user uu. After sampling the neighbors of the former LL layers, the GNN model accepts the sampled neighbors to produce the reward, which denotes whether the sampled neighbors are reasonable or not. Through maximizing the expected reward from the GNN model, the adaptive neighbor sampler can sample proper high-order neighbors. Once the meta aggregator and the adaptive neighbor sampler are trained, the meta embedding h~u\tilde{\textbf{h}}_{u} and the averaged sampled neighbor embedding h~𝒩(u)l\tilde{\textbf{h}}^{l}_{\mathcal{N}(u)} at the ll-th layer are added into each graph convolution step in Eq. (1):

(4) hul\displaystyle\textbf{h}_{u}^{l} =\displaystyle= σ(𝐖l(h~uhul1h~𝒩(u)l)).\displaystyle\sigma(\mathbf{W}^{l}\cdot(\tilde{\textbf{h}}_{u}\ ||\ \textbf{h}_{u}^{l-1}\ ||\ \tilde{\textbf{h}}^{l}_{\mathcal{N}(u)})).

For the target user uu, Eq. (4) is repeated L1L-1 steps to obtain the embeddings {h1L1,,hKL1}\{\textbf{h}_{1}^{L-1},\cdots,\textbf{h}_{K}^{L-1}\} for its KK first-order neighbors, then the predicted embedding huL\textbf{h}_{u}^{L} is obtained by averaging the first-order embeddings. Finally, Eq. (2) is used to optimize the parameters of the pre-training GNN model. The pre-training parameter set Θ={Θgnn,Θf,Θs}\Theta=\{\Theta_{gnn},\Theta_{f},\Theta_{s}\}, where Θgnn\Theta_{gnn} is the parameters of the vanilla GNN, Θf\Theta_{f} is the parameters of the meta aggregator and Θs\Theta_{s} is the parameters of the neighbor sampler. The item embedding hiL\textbf{h}^{L}_{i} can be obtained in a similar way.

2.2.3. Downstream Recommendation Task

We fine-tune PT{\rm PT}-GNN{\rm GNN} in the downstream recommendation task. Specifically, for each target user uu and his neighbors {𝒩1(u),,𝒩L(u)}\{\mathcal{N}^{1}(u),\cdots,\mathcal{N}^{L}(u)\} of different order, we first use the pre-trained adaptive neighbor sampler to sample proper high-order neighbors {𝒩^1(u),𝒩^2(u),𝒩^L(u)}\{\hat{\mathcal{N}}^{1}(u),\hat{\mathcal{N}}^{2}(u)\cdots,\hat{\mathcal{N}}^{L}(u)\}, and then use the pre-trained meta aggregator to produce the user embedding huL\textbf{h}_{u}^{L}. The item embedding hiL\textbf{h}_{i}^{L} can be obtained in the same way. Then we transform the embeddings to calculate the relevance score between a user and an item, i.e., y(u,i)=σ(𝐖huL)Tσ(𝐖hiL)y(u,i)={\sigma(\mathbf{W}\cdot\textbf{h}_{u}^{L})}^{\mathrm{T}}\sigma(\mathbf{W}\cdot\textbf{h}_{i}^{L}) with parameters Θr={𝐖}\Theta_{r}=\{\mathbf{W}\}. Finally, we adopt the BPR loss (He et al., 2020) to optimize Θr\Theta_{r} and fine-tune Θ\Theta:

(5) BPR=(u,i)E,(u,j)Elnσ(y(u,i)y(u,j)).\displaystyle\mathcal{L}_{BPR}=\sum_{(u,i)\in E,(u,j)\notin E}-\ln\sigma(y(u,i)-y(u,j)).

However, as shown in Fig. 1, due to the limitation of the GNN model, PT{\rm PT}-GNN{\rm GNN} can only capture the short-range dependencies of users and items; besides, the embedding reconstruction task only focuses on exploring the intra-correlations within the subgraph of the target user or item. Therefore, it is necessary to fully capture both short- and long-range dependencies of users and items, and both intra- and inter-correlations of users or items.

        

Refer to caption
(a) Reconstruction with GNN Encoder
Refer to caption
(b) Contrastive Learning with GNN Encoder

  

Refer to caption
(c) Reconstruction with Transformer Encoder
Refer to caption
(d) Contrastive Learning with Transformer Encoder
Figure 2. Overview of the proposed pretext tasks. These pretext tasks are trained independently.

3. The Proposed Model MPT

We propose a novel Multi-strategy based Pre-Training method (MPT{\rm MPT}), which extends PT{\rm PT}-GNN{\rm GNN} from the perspective of model architecture and pretext tasks. Specifically, in terms of the model architecture, in addition to using the GNN encoder to capture the short-range dependencies of users and items, we introduce a Transformer encoder to capture long-range dependencies. In terms of the pretext task, in addition to considering the intra-correlations of users and items by the embedding reconstruction task, we add embedding contrastive learning task to capture inter-correlations of users and items. Hence, by combining each architecture and each pretext task, there are four different implementations of pretext tasks, as shown in Fig. 2. We detail each pretext task implementation in Section 3.1 - Section 3.4, and then present the overall model training process in Section 3.5. Finally, we analyze the time complexity of the pretext task implementation in Section 3.6.

3.1. Reconstruction with GNN Encoder

In this pretext task, we primarily follow the original PT-GNN model to reconstruct the user embedding, as proposed in Section 2.2. We modify PT-GNN in terms of its neighbor sampling strategy to reduce the time complexity.

The adaptive neighbor sampling strategy in PT{\rm PT}-GNN{\rm GNN} suffers from the slow convergence issue, as it is essentially a Monte-Carlo based policy gradient strategy (REINFORCE) (Sutton et al., 1999; Williams, 1992), and has the complex action-state trajectories for the entire neighbor sampling process. To solve this problem, inspired by the dynamic sampling theory that uses the current model itself to sample instances (Burges, 2010; Zhang et al., 2013), we propose the dynamic sampling strategy, which samples from the low-order neighbors to the high-order neighbors according to the enhanced embeddings of the target user and each neighbor. The enhanced embedding for the target user (or each neighbor) is obtained by concatenating the meta embedding produced by the meta aggregator in PT-GNN (Section 2.2.2) and the current trained embedding. The meta embedding enables considering the cold-start characteristics of the neighbors, and the current trained embedding enables dynamically selecting proper neighbors. Formally, at the ll-th layer, the sampling process is:

(6) aju\displaystyle{a}_{j}^{u} =\displaystyle= cos(h~u||hul,h~j||hjl),\displaystyle{\rm cos}\ (\tilde{\textbf{h}}_{u}\ ||\ \textbf{h}_{u}^{l},\tilde{\textbf{h}}_{j}\ ||\ \textbf{h}_{j}^{l}),
{h1l,,hKll}\displaystyle\{\textbf{h}^{l}_{1},\cdots,\textbf{h}^{l}_{{K^{l}}}\} \displaystyle\leftarrow S_TOP(a1u,,aju,,a|𝒩l(u)|u),\displaystyle{\rm S\_TOP}({a}_{1}^{u},\cdots,{a}_{j}^{u},\cdots,{a}_{|\mathcal{N}^{l}(u)|}^{u}),

where |||| is the concatenate operation, aju{a}_{j}^{u} is the cosine similarity between the enhanced target user embedding h~u||hul\tilde{\textbf{h}}_{u}\ ||\ \textbf{h}_{u}^{l} and the enhanced jj-th neighbor embedding h~j||hjl\tilde{\textbf{h}}_{j}\ ||\ \textbf{h}_{j}^{l}, h~u\tilde{\textbf{h}}_{u} and h~j\tilde{\textbf{h}}_{j} are the meta embeddings produced by the meta aggregator, hul\textbf{h}_{u}^{l} and hjl\textbf{h}_{j}^{l} are the current trained user embedding, {h1l,,hKll}\{\textbf{h}^{l}_{1},\cdots,\textbf{h}^{l}_{{K^{l}}}\} is the top KlK^{l} relevant neighbors. S_TOP{\rm S\_TOP} is the operation that selects top neighbor embeddings with top cosine similarity. Compared with the adaptive sampling strategy in PT{\rm PT}-GNN{\rm GNN}, the proposed dynamic sampling strategy not only speeds up the model convergence, but also has competitive performance. As it does not need multiple sampling process, and considers the cold-start characteristics of the neighbors during the sampling process. Notably, the training process of the dynamic sampling strategy is not a fully end-to-end fashion. In each training epoch, we first select KlK^{l} relevant neighbors as the sampled neighbors for each target node according to Eq. (6), and then we can obtain the adjacency matrix of the user-item graph. Next, we perform the graph convolution operation to reconstruct the target node embedding to optimize the model parameters. Finally, the updated current node embedding, together with the fixed meta node embedding produced by the meta aggregator are used in the next training epoch, which enables dynamically sampling proper neighbors.

Once the neighbors are sampled, at the ll-th layer, we use the meta embedding h~u\tilde{\textbf{h}}_{u}, the previous user embedding hul1\textbf{h}_{u}^{l-1} and the averaged dynamically sampled neighbor embedding h~𝒩(u)l\tilde{\textbf{h}}^{l}_{\mathcal{N}(u)} to perform graph convolution, as shown in Eq. (4). Then we perform graph convolution LL steps to obtain the predicted user embedding huRg\textbf{h}_{u}^{R_{g}}, and use Eq. (2) to optimize the model parameters. Fig. 2(a) shows this task, and the objective function is as follows:

(7) 1:\displaystyle\mathcal{L}_{1}: =\displaystyle= argmaxΘ1ucos(huRg,hu).\displaystyle\mathop{\arg\max}_{\Theta_{1}}\sum_{u}{\rm cos}(\textbf{h}_{u}^{R_{g}},\textbf{h}_{u}).

3.2. Contrastive Learning with GNN Encoder

In this pretext task, we propose to contrast the user embedding produced by the GNN encoder across subgraphs, as shown in Fig. 2(b). Similar as PT{\rm PT}-GNN{\rm GNN}, we also train this task under the meta-learning setting to simulate the cold-start scenario. Specifically, we also select users with abundant interactions, randomly select at most KlK^{l}(1lL1\leq l\leq L) neighbors at layer ll to form the subgraph 𝒢u\mathcal{G}_{u}. Then we perform contrastive learning (CL) upon 𝒢u\mathcal{G}_{u} to learn the representations of users. In the following part, we first present the four components in the CL framework and then detail the key component, i.e., the data augmentation operation.

  • An augmentation module AUG(){\rm AUG(\cdot)} augments the user subgraph 𝒢u\mathcal{G}_{u} by deleting or replacing the users or items in it, which results in two random augmentations 𝒢~u1\tilde{\mathcal{G}}_{u_{1}} = AUG(𝒢u,seed1){\rm AUG}(\mathcal{G}_{u},seed_{1}) and 𝒢~u2\tilde{\mathcal{G}}_{u_{2}} = AUG(𝒢u,seed2){\rm AUG}(\mathcal{G}_{u},seed_{2}), where seed1seed_{1} and seed2seed_{2} are two random seeds. In this paper, we evaluate the individual data augmentation operation, i.e., performing either deletion or substitution operation. We leave the compositional data augmentation operation as the future work.

  • A GNN encoder performs the graph convolution LL steps using Eq. (4) to generate the representation of the target user for each augmented subgraph. i.e., hu1Cg=GNN(𝒢~u1)\textbf{h}_{u_{1}}^{C_{g}}={\rm GNN}(\tilde{\mathcal{G}}_{u_{1}}) and hu2Cg=GNN(𝒢~u2)\textbf{h}_{u_{2}}^{C_{g}}={\rm GNN}(\tilde{\mathcal{G}}_{u_{2}}).

  • A neural network encoder gg maps the encoded augmentations hu1Cg\textbf{h}_{u_{1}}^{C_{g}} and hu2Cg\textbf{h}_{u_{2}}^{C_{g}} into two vectors z1=g(hu1Cg)z_{1}=g(\textbf{h}_{u_{1}}^{C_{g}}), z2=g(hu2Cg)z_{2}=g(\textbf{h}_{u_{2}}^{C_{g}}). This operation is the same with SimCLR (Chen et al., 2020a), as its observation that adding a nonlinear projection head can significantly improve the representation quality.

  • A contrastive learning loss module maximizes the agreement between positive augmentation pair (s1~,s2~)(\tilde{s_{1}},\tilde{s_{2}}) in the set {s~}\{\tilde{s}\}. We construct the set {s~}\{\tilde{s}\} by randomly augmenting twice for all users in a mini-batch ss (assuming ss is with size NN), which gets a set s~\tilde{s} with size 2N2N. The two variants from the same original user form the positive pair, while all the other instances from the same mini-batch are regarded as negative samples for them. Then the contrastive loss for a positive pair is defined as:

    (8) lc(m,n)=logexp(cos(zm,zn)/τ)k=12N𝟙kmexp(cos(zm,zn)/τ),\displaystyle l_{c}(m,n)=-\log\frac{\exp({\rm cos}(z_{m},z_{n})/\tau)}{\sum_{k=1}^{2N}\mathbbm{1}_{k\neq m}\exp({\rm cos}(z_{m},z_{n})/\tau)},

    where 𝟙km\mathbbm{1}_{k\neq m} is the indicator function to judge whether kmk\neq m, τ\tau is a temperature parameter, and cos(zm,zn)=zmTzn/(zm2zn2){\rm cos}(z_{m},z_{n})={z_{m}}^{\mathrm{T}}z_{n}/(||z_{m}||_{2}||z_{n}||_{2}) denotes the cosine similarity of two vector zmz_{m} and znz_{n}. The overall contrastive loss 2\mathcal{L}_{2} defined in a mini-batch is:

    (9) 2:=minΘ2m=12Nn=12N𝟙m=nlc(m,n),\displaystyle\mathcal{L}_{2}:=\min_{\Theta_{2}}\sum_{m=1}^{2N}\sum_{n=1}^{2N}\mathbbm{1}_{m=n}\ l_{c}(m,n),

    where 𝟙m=n\mathbbm{1}_{m=n} is a indicator function returns 1 when mm and nn is a positive pair, returns 0 otherwise, Θ2\Theta_{2} is the parameters of the CL framework.

The key method in CL is the data augmentation strategy. In recommender system, since each neighbor may play an essential role in expressing the user profile, it remains unknown whether data augmentation would benefit the representation learning and what kind of data augmentation could be useful. To answer these questions, we explore and test two basic augmentations, deletion and substitution. We believe there exist more potential augmentations, which we will leave for future exploration.

  • Deletion. At each layer ll, we random select aa% users or items and delete them. If a parent user or item is deleted, its child users or items are all deleted.

  • Substitution. At each layer ll, we randomly select bb% users or items. For each user uu or item ii in the selected list, we randomly replace uu or ii with one of its parent’s interacted first-order neighbors. Note that in order to illustrate the two data augmentation strategies, we present the deletion and substitution operations in Fig. 2(b), but in practice we perform individual data augmentation strategy, i.e., performing either deletion or substitution operation, and analyze whether the substitution or deletion ratio used for the target user (or target item) can affect the recommendation performance in Section 5.3.3.

Once the CL framework is trained, we leave out other parts and only maintain the parameters of the trained GNN encoder. When a new cold-start user uu comes in, same as Section 3.1, the GNN encoder performs graph convolution LL steps to obtain the enhanced embedding huCg\textbf{h}_{u}^{C_{g}}.

3.3. Reconstruction with Transformer Encoder

In this pretext task, we propose using Transformer encoder to reconstruct the embeddings of target users in the user-item path, as shown in Fig. 2(c). This pretext task is also trained under the meta-learning setting. Specifically, similar to PT{\rm PT}-GNN{\rm GNN}, we first choose abundant users and use any recommender algorithms to learn their ground-truth embeddings. Then for each user, we sample KlK^{l}(1lL1\leq l\leq L) neighbors, based on which, we use random walk (Perozzi et al., 2014) to obtain the user-item path 𝒫\mathcal{P} = IUIU{\rm IUIU} (or 𝒫\mathcal{P} = UIUI{\rm UIUI}) with path length TT. Finally, we mask the target user in the input path with special tokens “[mask]”, and use Transformer encoder to predict the target user embedding.

Formally, given an original path 𝒫\mathcal{P}=[x1,,xT][x_{1},\cdots,x_{T}], where each node xix_{i} in 𝒫\mathcal{P} represents the initial user embedding hu0\textbf{h}_{u}^{0} or the initial item embedding hi0\textbf{h}_{i}^{0}. We first construct a corrupted path 𝒫^\hat{\mathcal{P}} by replacing the target user token in 𝒫\mathcal{P} to a special symbol ”[mask]” (suppose the target user is at the tt-th position in 𝒫^\hat{\mathcal{P}}). Then we use Transformer encoder to map the input path 𝒫^\hat{\mathcal{P}} into a sequence of hidden vectors Tr(𝒫^)=[Tr(𝒫^)1,,Tr(𝒫^)T]{\rm Tr}(\hat{\mathcal{P}})=[{{\rm Tr}(\hat{\mathcal{P}})}_{1},\cdots,{{\rm Tr}(\hat{\mathcal{P}})}_{T}]. Finally, we fetch the tt-th embedding in Tr(𝒫^){\rm Tr}(\hat{\mathcal{P}}) to predict the ground-truth embedding, and use cosine similarity to optimize the parameters Θ3\Theta_{3} of the Transformer encoder. For simplicity, we use notation huRp\textbf{h}_{u}^{R_{p}} to represent the tt-th predicted embedding, i.e., Tr(𝒫^)t{{\rm Tr}(\hat{\mathcal{P}})}_{t} = huRp\textbf{h}_{u}^{R_{p}}. The objective function is:

(10) 3:\displaystyle\mathcal{L}_{3}: =\displaystyle= argmaxΘ3ucos(huRp,hu).\displaystyle\mathop{\arg\max}_{\Theta_{3}}\sum_{u}{\rm cos}(\textbf{h}_{u}^{R_{p}},\textbf{h}_{u}).

Once the Transformer encoder is trained, we can use it to predict the cold-start embedding. When a new cold-start user uu with his interacted neighbors comes in, we first use random walk to generate the path set {𝒫1,,𝒫t,,𝒫T}\{\mathcal{P}_{1},\cdots,\mathcal{P}_{t},\cdots,\mathcal{P}_{T}\}, where in the tt-th path 𝒫t\mathcal{P}_{t}, uu is in the tt-th position. Then we replace uu with the ”[mask]” signal to generate the corrupted path 𝒫^t\hat{\mathcal{P}}_{t}. Next we feed all the corrupted paths {𝒫1^,,𝒫T^}\{\hat{\mathcal{P}_{1}},\cdots,\hat{\mathcal{P}_{T}}\} into the Transformer encoder, obtain the predicted user embeddings {hu1Rp,,huTRp}\{\textbf{h}_{u_{1}}^{R_{p}},\cdots,\textbf{h}_{u_{T}}^{R_{p}}\}. Finally, we average these predicted embeddings to obtain the final user embedding huRp\textbf{h}_{u}^{R_{p}}.

3.4. Contrastive Learning with Transformer Encoder

In this task, we propose to contrast the user embedding produced by the Transformer encoder across different paths, as shown in Fig. 2(d). We train this task under the meta-learning setting. Same as Section 3.3, we choose abundant users, sample KlK^{l}(1lL1\leq l\leq L) order neighbors, and use random walk to obtain the path 𝒫u\mathcal{P}_{u}. Then we perform the CL framework to learn the cold-start user embedding:

  • An augmentation module AUG(){\rm AUG(\cdot)} augments the path 𝒫u\mathcal{P}_{u}=[x1,,xT][x_{1},\cdots,x_{T}] by randomly deleting or replacing the users or items in it, i.e., 𝒫~u1\tilde{\mathcal{P}}_{u_{1}} = AUG(𝒫u,seed1){\rm AUG}(\mathcal{P}_{u},seed_{1}) and 𝒫~u2\tilde{\mathcal{P}}_{u_{2}} = AUG(Tu,seed2){\rm AUG}(T_{u},seed_{2}). Similar to Section 3.2, we evaluate the individual data augmentation operation.

  • A Transformer encoder accepts 𝒫~u1\tilde{\mathcal{P}}_{u_{1}}, 𝒫~u2\tilde{\mathcal{P}}_{u_{2}} as input, and encodes the target user from two augmented paths into latent vectors, i.e., hu1Cp\textbf{h}_{u_{1}}^{C_{p}} = Tr(𝒫~u1){\rm Tr}(\tilde{\mathcal{P}}_{u_{1}}) and hu2Cp\textbf{h}_{u_{2}}^{C_{p}} = Tr(𝒫~u2){\rm Tr}(\tilde{\mathcal{P}}_{u_{2}}).

  • A neural network encoder gg^{\prime} that maps the encoded augmentations hu1Cp\textbf{h}_{u_{1}}^{C_{p}} and hu2Cp\textbf{h}_{u_{2}}^{C_{p}} into two vectors z1z_{1}^{\prime} = g(hu1Cp)g^{\prime}(\textbf{h}_{u_{1}}^{C_{p}}), z2z_{2}^{\prime} = g(hu2Cp)g^{\prime}(\textbf{h}_{u_{2}}^{C_{p}}).

  • A contrastive learning loss module maximizes the agreement between positive augmentation pair (s1~,s2~)(\tilde{s_{1}},\tilde{s_{2}}) in the set {s~}\{\tilde{s}\}. Same as Section 3.2, we also construct the set {s~}\{\tilde{s}\} by randomly augmenting twice for all users in a mini-batch ss to get a set s~\tilde{s} with size 2N2N, use the two variants from the same original user as positive pair, use all the other instances from the same mini-batch as negative samples, and use Eq. (8) as the contrastive loss lc(m,n)l_{c}(m,n) for a positive pair. Similar to Eq. (9), the overall contrastive loss 4\mathcal{L}_{4} defined in a mini-batch is:

    (11) 4:=minΘ4m=12Nn=12N𝟙m=nlc(m,n),\displaystyle\mathcal{L}_{4}:=\min_{\Theta_{4}}\sum_{m=1}^{2N}\sum_{n=1}^{2N}\mathbbm{1}_{m=n}\ l_{c}(m,n),

    where Θ4\Theta_{4} is the parameters of the CL framework.

The data augmentation strategies are as follows:

  • Deletion. For each path 𝒫u\mathcal{P}_{u}, we random select aa% users and items and delete them.

  • Substitution. For each path 𝒫u\mathcal{P}_{u}, we randomly select bb% users and items. For each user uu or item ii in the selected list, we randomly replace uu or ii with one of its parent’s interacted first-order neighbors. Note that in order to illustrate the two data augmentation strategies, we present the deletion and substitution operations in Fig. 2(d), but in practice we perform individual data augmentation strategy, i.e., performing either deletion or substitution operation, and analyze whether the substitution or deletion ratio used for the target user (or target item) can affect the recommendation performance in Section 5.3.3. Note that we perform individual data augmentation strategy, i.e., performing either deletion or substitution operation, and analyze whether the substitution or deletion ratio used for the target user (or target item) can affect the recommendation performance in Section 5.3.3.

Once the CL framework is trained, we leave out other parts and only maintain the parameters of the Transformer encoder. When a new cold-start user uu with his interacted neighbors comes in, same as Section 3.3, we generate the path set {𝒫1,,𝒫T}\{\mathcal{P}_{1},\cdots,\mathcal{P}_{T}\}, use Transformer encoder to obtain the encoded embeddings {hu1Cp,,huTCp}\{\textbf{h}_{u_{1}}^{C_{p}},\cdots,\textbf{h}_{u_{T}}^{C_{p}}\}, and average these embeddings to obtain the final embedding huCp\textbf{h}_{u}^{C_{p}}.

3.5. Model Pre-training & Fine-tuning Process

We adopt the pre-training and fine-tuning paradigm (Qiu et al., 2020) to train the GNN and Transformer encoders.

During the pre-training stage, we independently train each pretext task using the objective functions Eq. (7), Eq. (9), Eq. (10) and Eq. (11) to optimize the parameters {Θ1,Θ2,Θ3,Θ4}\{\Theta_{1},\Theta_{2},\Theta_{3},\Theta_{4}\}. We assign each pretext task an independent set of initialized user and item embeddings, and do not share embeddings for these pretext tasks. Therefore, we can train these pretext tasks in a fully parallel way.

During the fine-tuning process, we initialize the GNN and Transformer encoders with the trained parameters, and fine-tune them via the downstream recommendation task. Specifically, for each target cold-start user and his interacted neighbors {𝒩1(u),,𝒩L(u)}\{\mathcal{N}^{1}(u),\cdots,\mathcal{N}^{L}(u)\} of each order, we first use the trained GNN and Transformer encoders corresponding to each pretext task to generate the user embedding huRg\textbf{h}_{u}^{R_{g}}, huCg\textbf{h}_{u}^{C_{g}}, huRp\textbf{h}_{u}^{R_{p}} and huCp\textbf{h}_{u}^{C_{p}}. Then we concatenate the generated embeddings and transform them into the final user embedding:

(12) hu=𝐖(huRg||huCg||huRp||huCp),\displaystyle\textbf{h}_{u}^{*}=\mathbf{W}\cdot(\textbf{h}_{u}^{R_{g}}||\textbf{h}_{u}^{C_{g}}||\textbf{h}_{u}^{R_{p}}||\textbf{h}_{u}^{C_{p}}),

where Θr={𝐖}\Theta_{r}=\{\mathbf{W}\} is the parameter matrix. We generate the final item embedding hi\textbf{h}_{i}^{*} in a similar way. Next we calculate the relevance score as the inner product of user and item final embeddings, i.e., y(u,i)=huThiy(u,i)=\textbf{h}_{u}^{*^{\mathrm{T}}}\textbf{h}_{i}^{*}. Finally, we use BPR loss defined in Eq. (5) to optimize Θr\Theta_{r} and fine-tune {Θ1,Θ2,Θ3,Θ4}\{\Theta_{1},\Theta_{2},\Theta_{3},\Theta_{4}\}.

3.6. Discussions

As we can see, the GNN and Transformer encoders are the main components of the pretext tasks. For the GNN encoder, the time complexity is 𝒪(TGNN+TS){\mathcal{O}}(T_{GNN}+T_{S}), where 𝒪(TGNN){\mathcal{O}}(T_{GNN}) represents the time complexity of the layer-wise propagation of the GNN model, and 𝒪(TS){\mathcal{O}}(T_{S}) represents the time complexity of the sampling strategy. Since different GNN model has different time complexity 𝒪(TGNN){\mathcal{O}}(T_{GNN}), we select classic GNN models and show their 𝒪(TGNN){\mathcal{O}}(T_{GNN}). LightGCN (He et al., 2020): 𝒪(|+|+TS){\mathcal{O}}(|\mathcal{R}^{+}|+T_{S}), NGCF (Wang et al., 2019b): 𝒪(|+|d2+TS){\mathcal{O}}(|\mathcal{R}^{+}|\ d^{2}+T_{S}), GCMC (van den Berg et al., 2018):𝒪(|+|d2+TS){\mathcal{O}}(|\mathcal{R}^{+}|\ d^{2}+T_{S}), GraphSAGE (Hamilton et al., 2017): 𝒪(|𝒩+|+TS){\mathcal{O}}(|\mathcal{N}^{+}|+T_{S}), where |+||\mathcal{R}^{+}| denotes the number of nonzero entries in the Laplacian matrix, |𝒩+||\mathcal{N}^{+}| is the number of totally sampled instances and dd is the embedding size. We present the time complexity of dynamic sampling strategy 𝒪(TS){\mathcal{O}}(T_{S}) = 𝒪(Ed|𝒩+|){\mathcal{O}}(E*d*|\mathcal{N}^{+}|) and the adaptive sampling strategy 𝒪(TS){\mathcal{O}}(T_{S}) = 𝒪(EMd|𝒩+|){\mathcal{O}}(E*M*d*|\mathcal{N}^{+}|), where EE is the number of convergent epochs, MM is the sampling times. Compared with adaptive sampling strategy, the dynamic strategy does not need multiple sampling time MM, and has fewer convergent epochs EE, thus has smaller time complexity. For the Transformer encoder, the time complexity is 𝒪(T2d){\mathcal{O}}(T^{2}*d), where TT is the length of the user-item path.

4. Experiments

Following PT-GNN (Hao et al., 2021), we conduct intrinsic evaluation task to evaluate the quality of user/item embeddings, and extrinsic task to evaluate the cold-start recommendation performance. We answer the following research questions:

  • RQ1: How does MPT{\rm MPT} perform embedding inference and cold-start recommendation compared with the state-of-the-art GNN and pre-training GNN models?

  • RQ2: What are the benefits of performing pretext tasks in both intrinsic and extrinsic evaluation?

  • RQ3: How do different settings influence the effectiveness of the proposed MPT{\rm MPT} model?

4.1. Experimental Settings

4.1.1. Datasets

We select on three public datasets MovieLens-1M (Ml-1M)333https://grouplens.org/datasets/movielens/ (Harper and Konstan, 2016), MOOCs444http://moocdata.cn/data/course-recommendation (Zhang et al., 2019) and Gowalla555https://snap.stanford.edu/data/loc-gowalla.html (Liang et al., 2016). Table 1 illustrates the statistics of these datasets.

Table 1. Statistics of the Datasets.
Dataset #Users #Items #Interactions #Sparse Ratio
MovieLens-1M 6,040 3,706 1,000,209 4.47%
MOOCs 82,535 1,302 458,453 0.42%
Gowalla 29,858 40,981 1,027,370 0.08%

4.1.2. Comparison Methods

We select three types of baselines, including the neural collaborative filtering model, the GNN models and the self-supervised graph learning models.

  • NCF (He et al., 2017): is a neural collaborative filtering model which uses multi-layer perceptron and matrix factorization to learn the representations of users/items.

  • PinSage (Ying et al., 2018): employs the GraphSAGE (Hamilton et al., 2017) model, which samples neighbors randomly and aggregates them by the AVERAGE function.

  • GCMC (van den Berg et al., 2018): employs the standard GCN (Kipf and Welling, 2017) model to learn the embeddings of users and items.

  • NGCF (Wang et al., 2019b): primarily follows the neural passing based GNN model (Gilmer et al., 2017), but additionally adds second-order interactions during the message passing process.

  • LightGCN (He et al., 2020): discards the feature transformation and nonlinear activation functions in NGCF.

  • SGL (Wu et al., 2021): contrasts the node representation within the graphs from multiple views, where node dropout, edge dropout and random walk are adopted to generate these views. We find edge dropout has the best performance.

  • PT-GNN (Hao et al., 2021): takes the embedding reconstruction as the pretext task to explicitly improve the embedding quality.

For each vanilla GNN model (e.g., GCMC, NGCF), notation GNN+{\rm GNN}^{+} means we apply GNN in PT{\rm PT}-GNN{\rm GNN}, and notation GNN{\rm GNN}^{*} means we apply GNN in MPT{\rm MPT}. Since SGL adopts multi-task learning paradigm for recommendation, for fair comparison, we compare it in the extrinsic recommendation task and use notation GNN{\rm GNN}-SGL{\rm SGL} to denote it.

4.1.3. Intrinsic and Extrinsic Settings.

Following PT{\rm PT}-GNN{\rm GNN} (Hao et al., 2021), we divide each dataset into the meta-training set DTD_{T} and the meta-test set DND_{N}. We train and evaluate the proposed MPT{\rm MPT} model in the intrinsic user/item embedding inference task on DTD_{T}. Once the the proposed model MPT{\rm MPT} is trained, we fine-tune it in the extrinsic task and evaluate its performance on DND_{N}. We select the users/items from each dataset with sufficient interactions as the target users/items in DTD_{T}, as the intrinsic evaluation needs the ground-truth embeddings of users/items inferred from the sufficient interactions. In the cold-start user scenario, we divide the users with the number of the direct interacted items (first-order neighbors) more than nin_{i} into DTD_{T} and leave the rest users into DND_{N}. We set nin_{i} as 25, 10 and 25 for the dataset Ml-1M, MOOCs and Gowalla, respectively. Similarly, in the cold-start item scenario, we divide the items with the number of the direct interacted users (first-order neighbors) more than nun_{u} into DTD_{T} and leave the rest items into DND_{N}, where we set nun_{u} as 15, 10 and 15 for Ml-1M, MOOCs and Gowalla, respectively. We set KK as 3 and 8 in the intrinsic task, and 8 in the extrinsic task. By default, we set dd as 32, the learning rate as 0.003, LL as 4, TT as 6, aa as 0.2, bb as 0.2 and τ\tau as 0.2.

4.1.4. Intrinsic Evaluations: Embedding Inference

We conduct the intrinsic evaluation task, which aims to infer the embeddings of cold-start users and items by the proposed MPT{\rm MPT} model. Both the evaluations on user embedding inference and item embedding inference are performed.

Training and Test Settings. Following PT{\rm PT}-GNN{\rm GNN} (Hao et al., 2021), we perform intrinsic evaluation on DTD_{T}. Specifically, same as PT{\rm PT}-GNN{\rm GNN}, we train NCF (He et al., 2017) to get the ground-truth embeddings for the target users/items in DTD_{T}. We also explore whether MPT{\rm MPT} is sensitive to the ground-truth embedding generated by other models such as LightGCN (He et al., 2020). We randomly split DTD_{T} into the training set TrainTTrain_{T} and the test set TestTTest_{T} with a ratio of 7:3. To mimic the cold-start users/items on TestTTest_{T}, we randomly keep KK neighbors for each user/item, which results in at most KlK^{l} neighbors (1lL1\leq l\leq L) for each target user/item. Thus TestTTest_{T} is changed into TestTTest_{T}^{\prime}.

We train NCF transductively on the merged dataset TrainTTrain_{T} and TestTTest_{T}^{\prime}. We train the vanilla GNN models by BPR loss (He et al., 2020) on TrainTTrain_{T}. We train PT{\rm PT}-GNN{\rm GNN} on TrainTTrain_{T}, where we first perform Eq. (4) LL-1 steps to obtain the refined first-order neighbors, and then average them to reconstruct the target embedding; finally we use Eq. (2) to measure the quality of the predicted embedding. We train MPT{\rm MPT} on TrainTTrain_{T}, where we first perform four pretext tasks by Eq. (7), Eq. (9), Eq. (10) and Eq. (11), and then use Eq. (12) to fuse the generated embeddings; finally we use Eq. (2) to measure the quality of the fused embedding. Note that the embeddings in all the models are randomly initialized. We use cosine similarity to measure the agreement between the ground-truth and predicted embeddings, due to its popularity as an indicator for the semantic similarity between embeddings.

Table 2. Overall performance of user/item embedding inference with cosine similarity. LL=4.
Methods    Ml-1M (user)     MOOCs (user)     Gowalla(user)     Ml-1M (item)     MOOCs (item)     Gowalla(item)
   3-shot 8-shot     3-shot 8-shot     3-shot 8-shot     3-shot 8-shot     3-shot 8-shot     3-shot 8-shot
NCF{\rm NCF}    0.490 0.533     0.451 0.469     0.521 0.558     0.436 0.496     0.482 0.513     0.482 0.491
PinSage{\rm PinSage}    0.518 0.557     0.542 0.564     0.552 0.567     0.558 0.573     0.558 0.591     0.556 0.599
PinSage+{\rm PinSage}^{+}    0.659 0.689     0.651 0.669     0.663 0.692     0.734 0.745     0.658 0.668     0.668 0.676
PinSage{\rm PinSage}^{*}    0.690 0.698     0.668 0.673     0.687 0.695     0.741 0.748     0.661 0.669     0.673 0.683
GCMC{\rm GCMC}    0.513 0.534     0.546 0.569     0.546 0.562     0.559 0.566     0.554 0.558     0.553 0.557
GCMC+{\rm GCMC}^{+}    0.600 0.625     0.750 0.765     0.716 0.734     0.725 0.733     0.607 0.615     0.650 0.667
GCMC{\rm GCMC}^{*}    0.629 0.632     0.782 0.790     0.741 0.748     0.744 0.764     0.608 0.616     0.675 0.681
NGCF{\rm NGCF}    0.534 0.554     0.531 0.547     0.541 0.557     0.511 0.516     0.503 0.509     0.503 0.506
NGCF+{\rm NGCF}^{+}    0.591 0.618     0.541 0.573     0.552 0.574     0.584 0.596     0.549 0.560     0.576 0.591
NGCF{\rm NGCF}^{*}    0.645 0.650     0.598 0.601     0.582 0.589     0.609 0.613     0.606 0.611     0.590 0.594
LightGCN{\rm LightGCN}    0.549 0.569     0.530 0.534     0.581 0.592     0.600 0.631     0.590 0.616     0.606 0.622
LightGCN+{\rm LightGCN}^{+}    0.636 0.644     0.646 0.654     0.614 0.617     0.691 0.701     0.667 0.676     0.693 0.701
LightGCN{\rm LightGCN}^{*}    0.641 0.647     0.652 0.657     0.695 0.700     0.699 0.704     0.669 0.691     0.768 0.774

4.1.5. Extrinsic Evaluation: Recommendation

We apply the pre-training GNN model into the downstream recommendation task and evaluate its performance.

Training and Testing Settings. We consider the scenario of the cold-start users and use the meta-test set DND_{N} to perform recommendation. For each user in DND_{N}, we select top cc% (cc%=0.2 by default) of his interacted items in chronological order into the training set TrainNTrain_{N}, and leave the rest items into the test set TestNTest_{N}. We pre-train our model on DTD_{T} and fine-tune it on TrainNTrain_{N} according to Section 3.5.

The vanilla GNN and the NCF models are trained by the BPR loss on DTD_{T} and TrainNTrain_{N}. For each user in TestNTest_{N}, we calculate his relevance score to each of the rest 1-cc% items. We adopt Recall@𝒦\mathcal{K} and NDCG@𝒦\mathcal{K} as the metrics to evaluate the items ranked by the relevance scores. By default, we set 𝒦\mathcal{K} as 20 for Ml-1M, MOOCs and Gowalla.

5. Experimental Results

5.1. Performance Comparison (RQ1)

5.1.1. Overall Performance Comparison

We report the overall performance of intrinsic embedding inference and extrinsic recommendation tasks in Table 2 and Table 3. We find that:

  • MPT{\rm MPT} (denoted as GNN{\rm GNN}^{*}) is better than NCF and the vanilla GNN model, which indicates the effectiveness of the pre-training strategy. Compared with the pre-training model PT{\rm PT}-GNN{\rm GNN} (denoted as GNN+{\rm GNN}^{+}) and SGL (denoted as GNN{\rm GNN}-SGL{\rm SGL}), MPT{\rm MPT} also performs better than them. This indicates the superiority of simultaneously considering the intra- and inter- correlations of nodes as well as the short- and long-range dependencies of nodes.

  • In Table 2, when KK decreases from 8 to 3, the performance of all the baseline methods drops by a large scale, while MPT{\rm MPT} drops a little. This indicates MPT{\rm MPT} is able to learn high-quality embeddings for users or items that have extremely sparse interactions.

  • In the intrinsic task (Cf. Table 2), we notice that most models perform generally better on the item embedding inference task than the user embedding inference task. The reason is that the users in DTD_{T} have more interacted neighbors than the items in DTD_{T}, thus the learned ground-truth user embedding shows much more diversity than the learned ground-truth item embedding, and only use KlK^{l} (K is small) neighbors to learn the user embedding is more difficult than learning the item embedding. To validate our point of view, we calculate the average number of interacted neighbors of users and items in DTD_{T} among these three datasets. In ML-1M, MOOCs and Gowalla, when the layer size L=1L=1, the users interact with an average of 178.26, 134.90 and 178.27 items, while the items interact with an average of 32.14, 15.75 and 32.14 users. When the layer size LL gets larger, the interacted neighbors’ number of users and items show the same trend. The results indicate the abundant users have much more interacted neighbors than the abundant items, which verifies the above analysis.

  • Aligning Table 2 and Table 3, we notice that using GCMC, LigtGCN as the backbone GNN model can always obtain better performance than using PinSage and NGCF. The possible reason is that compared with PinSage and NGCF, GCMC and LightGCN have discarded the complex feature transformation operation, which will contribute to learn high-quality embeddings and better recommendation performance. This phenomena is consistent with LightGCN’s (He et al., 2020) findings.

Table 3. Overall recommendation performance with sparse rate cc%=20%, LL=4.
Methods    Ml-1M     MOOCs     Gowalla
   Recall NDCG     Recall NDCG     Recall NDCG
NCF{\rm NCF}    0.004 0.009     0.132 0.087     0.039 0.013
PinSage{\rm PinSage}    0.006 0.012     0.085 0.066     0.003 0.011
PinSage{\rm PinSage}-SGL{\rm SGL}    0.040 0.033     0.172 0.086     0.011 0.021
PinSage+{\rm PinSage}^{+}    0.047 0.038     0.150 0.089     0.008 0.019
PinSage{\rm PinSage}^{*}    0.054 0.036     0.182 0.099     0.045 0.021
GCMC{\rm GCMC}    0.008 0.006     0.232 0.172     0.006 0.033
GCMC{\rm GCMC}-SGL{\rm SGL}    0.038 0.037     0.239 0.177     0.033 0.035
GCMC+{\rm GCMC}^{+}    0.044 0.041     0.248 0.187     0.018 0.032
GCMC{\rm GCMC}^{*}    0.071 0.076     0.272 0.212     0.065 0.043
NGCF{\rm NGCF}    0.052 0.058     0.208 0.171     0.009 0.013
NGCF{\rm NGCF}-SGL{\rm SGL}    0.062 0.079     0.216 0.177     0.028 0.014
NGCF+{\rm NGCF}^{+}    0.064 0.071     0.217 0.181     0.037 0.014
NGCF{\rm NGCF}^{*}    0.083 0.077     0.241 0.208     0.047 0.016
LightGCN{\rm LightGCN}    0.058 0.064     0.217 0.169     0.011 0.023
LightGCN{\rm LightGCN}-SGL{\rm SGL}    0.066 0.077     0.275 0.178     0.034 0.028
LightGCN+{\rm LightGCN}^{+}    0.078 0.071     0.308 0.184     0.049 0.031
LightGCN{\rm LightGCN}^{*}    0.094 0.101     0.346 0.223     0.051 0.036
Refer to caption
Figure 3. Cold-start recommendation under different nin_{i} and cc%.

5.1.2. Interacted Number and Sparse Rate Analysis

It is still unclear how does MPT{\rm MPT} handle the cold-start users with different interacted items nin_{i} and sparse rate cc%. To this end, we change the interaction number nin_{i} in the range of {5,10,15}\{5,10,15\} and the sparse rate cc% in the range of {20%,40%,60%}\{20\%,40\%,60\%\}, select LightGCN as the backbone GNN model and report the cold-start recommendation performance on MOOCs dataset in Fig. 3. The smaller nin_{i} and cc% are, the cold-start users in DND_{N} have fewer interactions. We find that:

  • MPT{\rm MPT} (denoted as LightGCN{\rm LightGCN}^{*}) is consistently superior to all the other baselines, which justifies the superiority of MPT{\rm MPT} in handling cold-start recommendation with different nin_{i} and cc%.

  • When nin_{i} decreases from 15 to 5, MPT{\rm MPT} has a larger improvement compared with other baselines, which verifies its capability to solve the cold-start users with extremely sparse interactions.

  • When cc% decreases from 60% to 20%, MPT{\rm MPT} has a larger improvement compared with other baselines, which again verifies the superiority of MPT{\rm MPT} in handling cold-start users with different sparse rate.

5.2. Ablation Study (RQ2)

Refer to caption
(a) Cosine similarity (user) %
Refer to caption
(b) Cosine similarity (item) %
Refer to caption
(c) Cosine similarity

Refer to caption
(d) Recall@20
Refer to caption
(e) NDCG@20
Refer to caption
(f) Recall@20 & NDCG@20
Figure 4. Intrinsic and extrinsic task evaluation under individual or composition of pretext tasks. KK=3, LL=4, TT=6, cc%=20%. We evaluate the variant models on MOOCs (results on Ml-1M and Gowalla show the same trend which are omitted for space).

5.2.1. Impact of Pretext Tasks

It is still not clear which part of the pretext tasks is responsible for the good performance in MPT{\rm MPT}. To answer this question, we apply LightGCN in MPT{\rm MPT}, perform individual or compositional pretext tasks, report the performance of intrinsic and extrinsic task in Fig. 4, where notation RgR_{g}, CgC_{g}, RpR_{p} and CpC_{p} denote the pretext task of reconstruction with GNN (only use the objective function Eq. (7)), contrastive learning with GNN (only use the objective function Eq. (9)), reconstruction with Transformer (only use the objective function Eq. (10)) and contrastive learning with Transformer (only use the objective function Eq. (11)), respectively; notation LightGCN{\rm LightGCN}^{*}-RgR_{g} means we only discard reconstruction task with GNN encoder and use the other three pretext tasks (use the objective functions Eq. (9), Eq. (10) and Eq. (11)) as the variant model. Other variant models are named in a similar way. Aligning Table 2 with Fig. 4, we find that:

  • Combining all the pretext tasks can benefit both embedding quality and recommendation performance. This indicates simultaneously consider intra- and inter-correlations of users and items can benefit cold-start representation learning and recommendation.

  • Compared with other variant models, performing the contrastive learning task with the GNN or Transformer encoder independently, or performing the contrastive learning task with both the GNN and Transformer encoder leads poor performance in the intrinsic task, but has satisfactory performance in the recommendation task. The reason is that contrastive learning does not focus on predicting the target embedding, but can capture the inter-correlations of users or items, and thus can benefit the recommendation task.

Refer to caption
(a) Recall@20
Refer to caption
(b) NDCG@20

Refer to caption
(c) Recall@20
Refer to caption
(d) NDCG@20
Figure 5. Comparison among different sampling strategies.
Table 4. Analysis of the average sampling/training time in each epoch for different sampling strategies. We run each variant model ten times, and calculate the average sampling and training time per epoch.
Dataset Ml-1M MOOCs
Method Avg. Sampling Time Avg. Training Time Avg. Sampling Time Avg. Training Time
LightGCNr{\rm LightGCN}^{r} 0.56s 201.02s 2.78s 237.63s
LightGCNi{\rm LightGCN}^{i} 6.56s 204.64s 2.86s 240.12s
LightGCN+{\rm LightGCN}^{+} 134.64s 205.13s 160.37s 241.75s
LightGCN{\rm LightGCN}^{*} 8.93s 205.76s 3.20s 240.67s

5.2.2. Impact of the Sampling Strategy

We study the effect of the proposed dynamic sampling strategy. In this paper, we compare four variants of the LightGCN model: LightGCN+{\rm LightGCN}^{+} that adaptively samples neighbors (Hao et al., 2021), LightGCNi{\rm LightGCN}^{i} that samples neighbors according to the importance sampling strategy (Chen et al., 2018), LightGCNr{\rm LightGCN}^{r} that randomly samples neighbors (Hamilton et al., 2017) and LightGCN{\rm LightGCN}^{*} that dynamically samples neighbors. Notably, for fair comparison, for each target node, at the ll-th layer, the number of the sampled neighbors are at most KlK^{l}. We report the average sampling time and the average training time per epoch (the average training time includes the training time of all the four pretext tasks, but does not include the neighbor sampling time) for these sampling strategies on MOOCs and Ml-1M in Table 4, and report the recommendation performance on MOOCs and Ml-1M in Fig. 5. We find that:

  • The sampling time of the proposed dynamic sampling strategy is in the same magnitude with the importance sampling strategy and the random sampling strategy, which is totally acceptable. While the adaptive sampling strategy takes too much sampling time, since it adopts a Monte-Carlo based policy gradient strategy, and has the complex action-state trajectories for the entire neighbor sampling process. Besides, we report the average training time in each epoch and find that the training time of all the variant models is almost the same, since each variant model performs the same pretext tasks., i.e., the graph convolution operation in the GNN encoder and self-attention operation in the Transformer encoder. By aligning Table 4 and Fig. 5, we find that MPT{\rm MPT} has a quick convergence speed.

  • Although the adaptive sampling strategy has competitive performance than the random and importance sampling strategies, it has slow convergence speed due to the complex RL-based sampling process.

  • Compared with the other sampling strategies, the proposed dynamic sampling strategy not only has the best performance, but also has quick convergence speed.

5.3. Study of MPT (RQ3)

5.3.1. Effect of Ground-truth Embedding

As mentioned in Section 2, when performing embedding reconstruction with GNN and Transformer encoders, we choose NCF to learn the ground-truth embeddings. However, one may consider whether MPT{\rm MPT}’s performance is sensitive to the ground-truth embedding. To this end, we use the baseline GNN models to learn the ground-truth embeddings, and only perform embedding reconstruction task with LightGCN or Transformer. For NCF, we concatenate embeddings produced by both the MLP and GMF modules as ground-truth embeddings. While for PinSage, GCMC, NGCF and LightGCN, we combine the embeddings obtained at each layer to form the ground-truth matrix, i.e., E=E(0)\textbf{E}=\textbf{E}^{(0)} + \cdots+ E(L)\textbf{E}^{(L)}, where El(|𝒰|+||)×d\textbf{E}^{l}\in\mathcal{R}^{(|\mathcal{U}|+|\mathcal{I}|)\times d} is the concatenated user-item embedding matrix at ll-th convolution step. Fig. 6(a) and Fig. 6(b) shows the recommendation performance using LightGCN and Transformer encoder, respectively. Suffix +NCF{\rm NCF}, +L{\rm L}, +G{\rm G}, +P{\rm P} and +N{\rm N} denote that the ground-truth embeddings are obtained by NCF, LightGCN, GCMC, PinSage and NGCF, respectively. We find that:

  • All the models that equipped with different ground-truth embeddings achieve almost the same performance. This indicates our model is not sensitive to the ground-truth embeddings, as the NCF and vanilla GNN models are good enough to learn high-quality user or item embeddings from the abundant interactions.

  • Besides, when LightGCN is used to obtain the ground-truth embeddings (denoted as LightGCN{\rm LightGCN}^{*}+L{\rm L} and Trans{\rm Trans}^{*}+L{\rm L}), we can obtain marginal performance gain compared with other variant models.

Refer to caption
(a) LightGCN
Refer to caption
(b) Transformer Encoder
Figure 6. Sensitive analysis of ground-truth embeddings.

Refer to caption
(a) LightGCN
Refer to caption
(b) Transformer Encoder
Figure 7. Recommendation performance under individual or compositional data augmentations. cc%= 20%, LL=4.

5.3.2. Effect of Data Augmentation

To understand the effects of individual or compositional data augmentations in the contrastive learning (CL) task using GNNs or Transformer encoder, we investigate the performance of CL pretext tasks in MPT{\rm MPT} when applying augmentations individually or in pairs. We report the performance in Fig. 7. Notation LightGCND{\rm LightGCN}^{*}{\rm D}, LightGCNS{\rm LightGCN}^{*}{\rm S} and LightGCNSD{\rm LightGCN}^{*}{\rm SD} mean we apply LightGCN into the CL task and use deletion, substitution and their combinations, respectively; Notation TransD{\rm Trans}^{*}{\rm D}, TransS{\rm Trans}^{*}{\rm S} and TransSD{\rm Trans}^{*}{\rm SD} denote we use Transformer encoder into the CL task and use deletion, substitution and their combinations, respectively. We find that:

  • Composing augmentations can lead to better recommendation performance than performing individual data augmentation.

  • Aligning Fig. 7 and Table 3, we find combining substitution and deletion using GNNs has competitive performance than SGL. The reason is that, same as SGL, our proposed contrastive data augmentation can also change the graph structure, and thus inter-correlations of nodes can be captured.

Refer to caption
(a) User embedding inference
Refer to caption
(b) Item embedding inference

Refer to caption
(c) Recommendation Recall@20
Refer to caption
(d) Recommendation NDCG@20
Figure 8. Embedding inference and recommendation performance under different layer LL.

Refer to caption
(a) Embedding inference
Refer to caption
(b) Recommendation
Figure 9. Embedding inference and recommendation performance under different path length TT.

5.3.3. Effect of Hyperparameters

We move on to study different designs of the layer depth LL, the deletion ratio aa%, the substitution ratio bb% and the user-item path length TT in MPT{\rm MPT}. Due to the space limitation, we omit the results on MOOCs and Gowalla which have a similar trend to Ml-1M.

  • We only perform embedding reconstruction with GNNs in MPT{\rm MPT}, and report the results in Fig. 8. We find that the performance first increases and then drops when increasing LL from 1 to 4. The peak point is 3 at most cases. This indicates GNN can only capture short-range dependencies, which is consistent with LightGCN’s (He et al., 2020) finding.

  • We only perform embedding reconstruction with Transformer encoder in MPT{\rm MPT}, and report the results in Fig. 9. We find that the performance first improves when the path length TT increases from 3 to 8, and then drops from 8 to 10. This indicates Transformer encoder is good at capturing long-range rather than short-range dependencies of nodes.

  • We perform contrastive learning with LightGCN and contrastive learning with Transformer encoder under different deletion ratio aa% and substitution bb% independently, and report the recommendation results in Fig 10. Based on the results, we find that the recommendation performance is not sensitive to the substitution ratio bb% but is a little bit sensitive to the deletion ratio aa%. In terms of the deletion ratio aa%, the recommendation performance first increases and then drops. The peak point is around 0.2-0.3 for most cases. The reason is that when we perform the substitution operation, we randomly replace uu or ii with one of its parent’s interacted first-order neighbors, thus the replaced nodes can still reflect its parent’s characteristic to some extent. However, once the deletion ratio gets larger (e.g., 0.9), it will result in a highly skewed graph structure, which can not bring much useful collaborative signals to the target node.

Refer to caption
(a) Contrastive learning with LightGCN
Refer to caption
(b) Contrastive learning with LightGCN

Refer to caption
(c) Contrastive learning with Transformer
Refer to caption
(d) Contrastive learning with Transformer
Figure 10. Sensitive analysis under different deletion and substitution ratios in the contrastive learning pretext tasks.

6. Related Work

Self-supervised Learning with GNNs. The recent advance on self-supervised learning for recommendation has received much attention (Yu et al., 2022a; Qiu et al., 2022; Yu et al., 2021a; Xia et al., 2021b, a), where self-supervised learning with GNNs aim to find the correlation of nodes in the graph itself, so as to alleviate the data sparsity issue in graphs (Hu et al., 2020; Qiu et al., 2020; Velickovic et al., 2019; Chen et al., 2019; Sun et al., 2020; You et al., 2020; Yu et al., 2022b, 2021b). More recently, some researchers explore pre-training GNNs on user-item graphs for recommendation. For example, PT{\rm PT}-GNN{\rm GNN} (Hao et al., 2021) reconstructs embeddings under the meta-learning setting. SGL (Wu et al., 2021) contrasts node representation by node dropout, edge dropout and random walk data augmentation operations. PMGT (Liu et al., 2021) reconstructs graph structure and node feature using side information. GCN-P/COM-P (Meng et al., 2021) learns the representations of entities constructed from the side information. However, these methods suffer from ineffective long-range dependency problem, as the GNN model can only capture low-order correlations of nodes. Besides, the pretext tasks of these methods consider either intra-correlations (Hao et al., 2021; Liu et al., 2021; Meng et al., 2021) or inter-correlations (Wu et al., 2021) of nodes, rather than both. Further, the side information is not always available, making it difficult to construct the pretext tasks. To solve the above problems, we propose MPT{\rm MPT}, which considers both short- and long-range dependencies of nodes, and both intra- and inter-correlations of nodes.

Cold-start Recommendation. Cold-start recommendation is an intractable problem in Recommender System. Some researchers try to incorporate the side information such as knowledge graphs (KGs) (Wang et al., 2019d, a) or the contents of users/items (Yin et al., 2014, 2017; Chen et al., 2020b; Yin et al., 2019, 2020; Wang et al., 2019c; Gharibshah et al., 2020) to enhance the representations of users/items. However, the contents are not always available and it is difficult to align the users/items with the entities in KGs. Another research line is to explicitly improve the quality of cold-start users/items, which uses either meta-learning (Finn et al., 2017; Munkhdalai and Yu, 2017; Vinyals et al., 2016; Snell et al., 2017) or the graph neural networks (GNNs) (Ying et al., 2018; Wang et al., 2019b; He et al., 2020; Yin et al., 2020) for recommendation. However, the meta learning technique does not explicitly address the high-order neighbors of the cold-start users/items. The GNNs do not explicitly address the high-order cold-start users/items, which can not thoroughly improve the embedding quality of the cold-start users/items.

7. Conclusion

We propose a multi-strategy based pre-training method, MPT{\rm MPT}, which extends PT-GNN from the perspective of model architecture and pretext tasks to improve the cold-start recommendation performance. Specifically, in addition to the short-range dependencies of nodes captured by the GNN encoder, we add a transformer encoder to capture long-range dependencies. In addition to considering the intra-correlations of nodes by the embedding reconstruction task, we add embedding contrastive learning task to capture inter-correlations of nodes. Experiments on three datasets demonstrate the effectiveness of our proposed MPT{\rm MPT} model against the vanilla GNN and pre-training GNN models.

ACKNOWLEDGMENTS

This work is supported by National Key Research & Develop Plan (Grant No. 2018YFB1004401), National Natural Science Foundation of China (Grant No. 62072460, 62076245, 62172424), Beijing Natural Science Foundation (Grant No. 4212022), Australian Research Council Future Fellowship (Grant No. FT210100624) and Discovery Project (Grant No. DP190101985).

References

  • (1)
  • Burges (2010) Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23-581 (2010), 81.
  • Chen et al. (2019) Hongxu Chen, Hongzhi Yin, Tong Chen, Quoc Viet Hung Nguyen, Wen-Chih Peng, and Xue Li. 2019. Exploiting Centrality Information with Graph Convolutions for Network Representation Learning. In ICDE’19. IEEE, 590–601.
  • Chen et al. (2020b) Hongxu Chen, Hongzhi Yin, Xiangguo Sun, Tong Chen, Bogdan Gabrys, and Katarzyna Musial. 2020b. Multi-level Graph Convolutional Networks for Cross-platform Anchor Link Prediction. In SIGKDD’20. ACM, 1503–1511.
  • Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. 2018. FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling. In ICLR’ 18.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML’20 (Proceedings of Machine Learning Research, Vol. 119). PMLR, 1597–1607.
  • Chen and Wong (2020) Tianwen Chen and Raymond Chi-Wing Wong. 2020. Handling Information Loss of Graph Neural Networks for Session-based Recommendation. In SIGKDD’20. ACM, 1172–1180.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML’17, Vol. 70. 1126–1135.
  • Gharibshah et al. (2020) Zhabiz Gharibshah, Xingquan Zhu, Arthur Hainline, and Michael Conway. 2020. Deep Learning for User Interest and Response Prediction in Online Display Advertising. Data Sci. Eng. 5, 1 (2020), 12–26.
  • Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural Message Passing for Quantum Chemistry. In ICML’17 (Proceedings of Machine Learning Research, Vol. 70). PMLR, 1263–1272.
  • Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NeurlPS’17. 1024–1034.
  • Hao et al. (2021) Bowen Hao, Jing Zhang, Hongzhi Yin, Cuiping Li, and Hong Chen. 2021. Pre-Training Graph Neural Networks for Cold-Start Users and Items Representation. In WSDM’21. ACM, 265–273.
  • Harper and Konstan (2016) F. Maxwell Harper and Joseph A. Konstan. 2016. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. (2016), 19:1–19:19.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In SIGIR’20.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW’17. 173–182.
  • Hu et al. (2020) Ziniu Hu, Yuxiao Dong, Kuansan Wang, Kai-Wei Chang, and Yizhou Sun. 2020. GPT-GNN: Generative Pre-Training of Graph Neural Networks. In SIGKDD’20.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR’17.
  • Liang et al. (2016) Dawen Liang, Laurent Charlin, James McInerney, and David M. Blei. 2016. Modeling User Exposure in Recommendation. In WWW’16. ACM, 951–961.
  • Linden et al. (2003) Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering. IEEE Internet Comput. (2003), 76–80.
  • Liu et al. (2021) Yong Liu, Susen Yang, Chenyi Lei, Guoxin Wang, Haihong Tang, Juyong Zhang, Aixin Sun, and Chunyan Miao. 2021. Pre-training Graph Transformer with Multimodal Side Information for Recommendation. In MM’21. ACM, 2853–2861.
  • Meng et al. (2021) Zaiqiao Meng, Siwei Liu, Craig Macdonald, and Iadh Ounis. 2021. Graph Neural Pre-training for Enhancing Recommendations using Side Information. CoRR abs/2107.03936 (2021). https://arxiv.org/abs/2107.03936
  • Munkhdalai and Yu (2017) Tsendsuren Munkhdalai and Hong Yu. 2017. Meta Networks. In ICML’17 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). 2554–2563.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: online learning of social representations. In SIGKDD’14. ACM, 701–710.
  • Qiu et al. (2020) Jiezhong Qiu, Qibin Chen, Yuxiao Dong, Jing Zhang, Hongxia Yang, Ming Ding, Kuansan Wang, and Jie Tang. 2020. GCC: Graph Contrastive Coding for Graph Neural Network Pre-Training. In SIGKDD’20.
  • Qiu et al. (2022) Ruihong Qiu, Zi Huang, Hongzhi Yin, and Zijian Wang. 2022. Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation. In WSDM’22. ACM, 813–823.
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. In NeurlPS’17. 4077–4087.
  • Sun et al. (2020) Fan-Yun Sun, Jordan Hoffmann, Vikas Verma, and Jian Tang. 2020. InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization. In ICLR’20.
  • Sutton et al. (1999) Richard S. Sutton, David A. McAllester, Satinder P. Singh, and Yishay Mansour. 1999. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In NeurIPS’99. 1057–1063.
  • van den Berg et al. (2018) Rianne van den Berg, Thomas N. Kipf, and Max Welling. 2018. Graph Convolutional Matrix Completion. In SIGKDD’18. ACM.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NeurlPS’17. 5998–6008.
  • Velickovic et al. (2019) Petar Velickovic, William Fedus, William L Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. 2019. Deep Graph Infomax. In ICLR’19.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In NeurlPS’16. 3630–3638.
  • Wang et al. (2019d) Hongwei Wang, Fuzheng Zhang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. 2019d. Multi-Task Feature Learning for Knowledge Graph Enhanced Recommendation. In WWW’ 19. 2000–2010.
  • Wang et al. (2019c) Qinyong Wang, Hongzhi Yin, Hao Wang, Quoc Viet Hung Nguyen, Zi Huang, and Lizhen Cui. 2019c. Enhancing Collaborative Filtering with Generative Augmentation. In SIGKDD’19. ACM, 548–556.
  • Wang et al. (2019a) Xiang Wang, Xiangnan He, Yixin Cao, Meng Liu, and Tat-Seng Chua. 2019a. KGAT: Knowledge Graph Attention Network for Recommendation. In SIGKDD’19. 950–958.
  • Wang et al. (2019b) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019b. Neural Graph Collaborative Filtering. In SIGIR’19. 165–174.
  • Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8, 229–256.
  • Wu et al. (2021) Jiancan Wu, Xiang Wang, Fuli Feng, Xiangnan He, Liang Chen, Jianxun Lian, and Xing Xie. 2021. Self-supervised Graph Learning for Recommendation. In SIGIR’21. ACM, 726–735.
  • Wu et al. (2018) Zhirong Wu, Yuanjun Xiong, Stella X. Yu, and Dahua Lin. 2018. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In CVPR’18. Computer Vision Foundation / IEEE Computer Society, 3733–3742.
  • Xia et al. (2021a) Xin Xia, Hongzhi Yin, Junliang Yu, Yingxia Shao, and Lizhen Cui. 2021a. Self-Supervised Graph Co-Training for Session-based Recommendation. In CIKM’21. ACM, 2180–2190.
  • Xia et al. (2021b) Xin Xia, Hongzhi Yin, Junliang Yu, Qinyong Wang, Lizhen Cui, and Xiangliang Zhang. 2021b. Self-Supervised Hypergraph Convolutional Networks for Session-based Recommendation. In AAAI’21. AAAI Press, 4503–4511.
  • Yin et al. (2014) Hongzhi Yin, Bin Cui, Yizhou Sun, Zhiting Hu, and Ling Chen. 2014. LCARS: A Spatial Item Recommender System. TOIS’14 32, 3 (2014), 11:1–11:37.
  • Yin et al. (2019) Hongzhi Yin, Qinyong Wang, Kai Zheng, Zhixu Li, Jiali Yang, and Xiaofang Zhou. 2019. Social Influence-Based Group Representation Learning for Group Recommendation. In ICDE’19. IEEE, 566–577.
  • Yin et al. (2020) Hongzhi Yin, Qinyong Wang, Kai Zheng, Zhixu Li, and Xiaofang Zhou. 2020. Overcoming Data Sparsity in Group Recommendation. IEEE Trans. Knowl. Data Eng. (2020).
  • Yin et al. (2017) Hongzhi Yin, Weiqing Wang, Hao Wang, Ling Chen, and Xiaofang Zhou. 2017. Spatial-Aware Hierarchical Collaborative Deep Learning for POI Recommendation. IEEE Trans. Knowl. Data Eng. 19, 11 (2017), 2537–2551.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. 2018. Graph Convolutional Neural Networks for Web-Scale Recommender Systems. In SIGKDD”18. 974–983.
  • You et al. (2020) Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph Contrastive Learning with Augmentations. In NeurIPS’20.
  • Yu et al. (2021a) Junliang Yu, Hongzhi Yin, Min Gao, Xin Xia, Xiangliang Zhang, and Nguyen Quoc Viet Hung. 2021a. Socially-Aware Self-Supervised Tri-Training for Recommendation. In SIGKDD’21. ACM, 2084–2092.
  • Yu et al. (2021b) Junliang Yu, Hongzhi Yin, Jundong Li, Qinyong Wang, Nguyen Quoc Viet Hung, and Xiangliang Zhang. 2021b. Self-Supervised Multi-Channel Hypergraph Convolutional Network for Social Recommendation. In WWW’21. ACM / IW3C2, 413–424.
  • Yu et al. (2022a) Junliang Yu, Hongzhi Yin, Xin Xia, Tong Chen, Jundong Li, and Zi Huang. 2022a. Self-Supervised Learning for Recommender Systems: A Survey. CoRR abs/2203.15876 (2022). https://doi.org/10.48550/arXiv.2203.15876
  • Yu et al. (2022b) Junliang Yu, Hongzhi Yin, Xin Xia, Lizhen Cui Tong Chen, and Quoc Viet Hung Nguyen. 2022b. Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for Recommendation. In SIGIR’22.
  • Zhang et al. (2019) Jing Zhang, Bowen Hao, Bo Chen, Cuiping Li, Hong Chen, and Jimeng Sun. 2019. Hierarchical Reinforcement Learning for Course Recommendation in MOOCs. In AAAI’19. 435–442.
  • Zhang et al. (2013) Weinan Zhang, Tianqi Chen, Jun Wang, and Yong Yu. 2013. Optimizing top-n collaborative filtering via dynamic negative item sampling. In SIGIR’13. ACM, 785–788.