MB-HGCN: A Hierarchical Graph Convolutional Network for Multi-behavior Recommendation

Mingshi Yan, Zhiyong Cheng, Jing Sun, Fuming Sun^†, and Yuxin Peng, ^† Corresponding Author.

Abstract

Collaborative filtering-based recommender systems that rely on a single type of behavior often encounter serious sparsity issues in real-world applications, leading to unsatisfactory performance. Multi-behavior Recommendation (MBR) is a method that seeks to learn user preferences, represented as vector embeddings, from auxiliary information. By leveraging these preferences for target behavior recommendations, MBR addresses the sparsity problem and improves the accuracy of recommendations. In this paper, we propose MB-HGCN, a novel multi-behavior recommendation model that uses a hierarchical graph convolutional network to learn user and item embeddings from coarse-grained on the global level to fine-grained on the behavior-specific level. Our model learns global embeddings from a unified homogeneous graph constructed by the interactions of all behaviors, which are then used as initialized embeddings for behavior-specific embedding learning in each behavior graph. We also emphasize the distinct of the user and item behavior-specific embeddings and design two simple-yet-effective strategies to aggregate the behavior-specific embeddings for users and items, respectively. Finally, we adopt multi-task learning for optimization. Extensive experimental results on three real-world datasets demonstrate that our model significantly outperforms the baselines, achieving a relative improvement of 73.93% and 74.21% for HR@10 and NDCG@10, respectively, on the Tmall datasets.

Index Terms:

Collaborative Filtering, Multi-behavior Recommendation, Graph Convolutional Network, Multi-task Learning.

I Introduction

Personalized recommendation is one of the most effective techniques for addressing the problem of information overload, and it has been widely deployed in various information systems [1]. Due to its simplicity and effectiveness, Collaborative Filtering (CF) [2, 3, 4] has become the mainstream approach in contemporary recommender systems. Over the past few decades, many CF-based models have been developed [5, 6, 7, 8], ranging from the early matrix factorization (MF) based methods [3, 9, 7], to deep neural network (DNN) based methods [10, 11], and more recently, graph neural network (GNN) based methods [12, 13]. The rapid advancement of recommendation techniques has greatly enhance the recommendation performance. Since CF-based models mainly rely on the interactions between users and items to learn user preference and make recommendations, an inherent limitation lies in those models is that their performance will degrade sharply when the available interactions are sparse.

Most existing CF-based methods consider a single behavior, which is usually the target behavior on the platforms (e.g., buy on e-commerce platforms) for modeling. However, such behaviors are often very sparse in real-world systems, leading to a serious sparsity problem in these models. In reality, users engage in various types of behaviors (e.g., view and collect) to interact with items and gather information before making a final decision (i.e., engaging in the target behavior). Those behaviors also contain valuable user preference information, and their interactions are typically richer than those of the target behavior. Therefore, they can be leveraged to learn user preference and alleviate the sparsity problem.

The utilization of other behaviors (also called auxiliary behaviors) to facilitate the recommendation of target behavior is called multi-behavior recommendation (MBR), which has gained increasing attention in recent years [14, 15, 16, 17, 18, 19]. The key to MBR is how to utilize the auxiliary information to assist user and item embedding learning. Earlier approaches are straightforward to extend the traditional matrix factorization techniques operating on single matrix to multiple matrices [20, 21] or enriched the training data with auxiliary behavior data using different sampling strategies [22, 23, 24]. With increasing evidence that MBR is effective, it has attracted more attention and recent advanced techniques have aslo been progressively introduced to this task. For instance, NMTR [14] combines DNNs to model the sequence of behaviors, and MATN [25] adopts the multi-head attention mechanism to model multiple behaviors. Furthermore, GCN-based methods employ various strategies on the unified graph constructed by all behaviors to learn user preferences [15, 18, 26, 27]. MBGCN [15] constructs a heterogeneous graph that distinguishes different types of behaviors with different edges to model each behavior separately, and then aggregates user embeddings by its importance for prediction.

The basic assumption behind MBR models is that the interaction information of different behaviors contains user preferences from different perspectives or to different extents. A common paradigm of existing DNN- or GNN-based MBR models is to first learn user and item embeddings from each behavior via a designed network and then aggregate the learned embeddings with different strategies for target behavior prediction (e.g., with or without attention mechanism) [25, 15, 18, 26, 28, 29]. The difference lies in how to design the network structure to learn better embeddings from different behaviors and how to distill valuable information from each behavior to contribute to the target behavior prediction.

In this work, we propose a novel MBR model with a hierarchical graph convolutional network (MB-HGCN) to utilize auxiliary behaviors for user and item embedding learning. Unlike previous GCN-based methods that directly learn user and item embeddings from the unified heterogeneous graph, our model adopts a different paradigm to learn user and item embeddings via a hierarchical network structure. Specifically, we first learn a global embedding for each user and item in a unified homogeneous graph constructed based on the interactions of all behaviors without differentiating the behavior types. We then take the global embeddings as initialized embeddings to each behavior-specific graph, which is constructed based on the interactions of each behavior, for subsequent behavior-specific embedding learning. By performing graph convolutional operations on the unified homogeneous graph, which contains all the interaction information of different behaviors, we can fully exploit all the interaction information to learn the global embeddings of the users and items. Although the global embeddings might be of coarse-grained as they have not differentiated different behavior types, they could be a good initial embeddings for the following behavior-specific embedding learning on each behavior graph. This can also alleviate the sparsity issue in each behavior, as a good embedding initialization is crucial for the representation learning in deep models. The behavior-specific embedding learning on each individual behavior graph aims to capture behavior-specific features for better preference learning.

After the two-stage of embedding learning, we aggregate the behavior-specific embeddings with two different strategies for user and item embedding aggregations, respectively. For user embedding, to distill effective information from different behaviors for the target behavior prediction, we assign the weight to a behavior-specific embedding based on its similarity to the target behavior-specific embedding, with the intuition that the more similar the two embeddings are, the more they contribute to the target behavior. For item embeddings, we adopt a weighting scheme based on the interaction numbers of different behaviors [15]. The rationale behind this design is that item features should be consistent across different behaviors, and the difference between item embeddings learned from different behaviors is caused by the interactions of different users. Finally, we combine the global embedding and the aggregated behavior-specific embeddings for more comprehensive representations. Multi-task learning is adopted to treat each behavior as an independent task in optimization. To evaluate the effectiveness of our model, we perform extensive empirical studies on three large-scale real-world datasets. The experimental results demonstrate that our model outperforms the state-of-the-art MBR models by a large margin. For example, on the Tmall dataset, our model achieve an impressive improvement of 73.93% and 74.21% over the second-best baseline in terms of HR@10 and NDCG@10, respectively. We also conduct comprehensive ablation studies to carefully examine the utility of different designs in our model.

In summary, the main contributions of this work are as follows:

•

We propose a hierarchical convolutional graph network for multi-behavior recommendation that learns user and item embeddings from the coarse-grained and global level with a unified graph to the fine-grained and behavior-specific level in each behavior graph. We deem this learning paradigm can better utilize the multi-behavior information to learn good user and item embeddings.
•

We emphasize the distinctiveness of the user and item behavior-specific embeddings, for which we design two simple-yet-effective aggregation strategies to aggregate the behavior-specific embeddings for users and items, respectively. This is quite different from mainstream aggregation methods that use the same mechanism for aggregation.
•

We conduct extensive experiments on three real-world datasets to evaluate the effectiveness of our MB-HGCN model and examine the validity of each component in MB-HGCN. Experimental results show that MB-HGCN achieves a remarkable improvement over the state-of-the-art models in terms of recommendation accuracy. Additionally, we release the codes and involve parameters to benefit other researchers ¹¹1https://github.com/MingshiYan/MB-HGCN..

The rest of this paper is structured as follows. Section II reviews the related work, and Section III describes our MB-HGCN model in detail. Next, Section IV introduces the experimental setup and reports the experimental results. Finally, Section V concludes this paper.

II Related Work

Multi-behavior recommendation refers to leveraging user-item interaction data of multi-type behaviors for recommendation [14, 23, 15]. Its advantage is to alleviate the data sparsity problem existing in single-behavior-based recommendation methods. Due to its excellent performance, it has attracted increasing attention in recent years [30, 7, 18].

Early multi-behavior recommendation methods are extended on traditional CF-based methods [20, 21, 22, 23, 24, 16], and the most direct approach is to apply the matrix factorization approach in single-behavior data into multi-behavior data. For example, Ajit et al. [20] proposed a collective matrix factorization model (CMF), in which entity parameters are shared among multiple matrix factorizations. This method is extended by Zhao et al. [21] to perform matrix factorization for different behaviors by sharing item embeddings. In addition, some researchers designed different sampling strategies to utilize the data from multiple behaviors. For example, Loni et al. [22] proposed a negative sampling strategy suitable for multiple behaviors to sample user-item interaction data of different behaviors. Ding et al. [23] put forward an improved negative sampling strategy to achieve better utilization of data, further extending this idea. Guo et al. [24] introduced a strategy of sampling based on similarity, which generates positive and negative samples from multiple auxiliary behaviors to help model training. Qiu et al. [16] proposed an adaptive sampling strategy according to the uncorrelated balance characteristics of samples between different behaviors. These methods supplement the training on the target behavior by exploiting the interaction data from the auxiliary behaviors.

With the development of deep learning, multi-behavior recommendation methods based on deep neural networks (DNN) have been developed [25, 29, 14]. The main idea of such models is to design deep neural networks to learn the embedding of users and items separately from each behavior, and then aggregate them for recommendation. The difference between these methods is mainly reflected in the design of DNN and the aggregation strategies. For example, Xia et al. [25] designed a network consisting of transformers and multi-head attention mechanisms to learn embeddings in each behavior, and then aggregated them by adopting a fully connected network. Guo et al. [29] proposed a hierarchical attention mechanism to aggregate user preferences learned from different behaviors. Unlike other methods that aggregate information learned from different behaviors for prediction, Gao et al. [14] adopted a sequential modeling approach to explore the dependencies of different behaviors by passing the current behavior prediction score forward. The advantage of deep networks in representation learning makes the DNN-based MBR models achieve great progress on recommendation performance.

Recently, with the success of graph convolutional networks in recommendation, many GCN-based MBR methods have also been proposed [26, 15, 28, 31, 30, 18, 32]. Similar to DNN-based models, the general paradigm for such methods is to model each behavior separately with GCN to learn the embeddings of users and items, and then aggregate them with different strategies. For example, Xia et al. [26] proposed a multi-behavior pattern encoding framework and a graph element network to explore complex dependencies between different types of user-item interactions. Jin et al. [15] performed user-item propagation and item-item propagation in different behaviors on a multi-behavior heterogeneous graph to learn the influence strength and semantics of different behaviors. Chang et al. [30] proposed a multi-interest learning framework, containing an interest-extracting module and a behavioral correlation module, to better model the complex dependencies among multiple behaviors. Gu et al. [28] designed different strategies to aggregate the embeddings of multi-behavior users and items separately, and adopted a star-shaped contrastive learning to capture the commonality between target behaviors and auxiliary behaviors. Different from those GCN-based methods, Yan et al. [32] proposed a cascaded residual network to explore the connection between different behaviors from the perspective of embedding propagation.

In this work, we propose a novel hierarchical GCN structure to exploit the multi-behavior data for user and item embedding learning with a different paradigm. Our model first learn the global embeddings from a unified homogeneous graph constructed on all the behavior data, and then take them as initial embeddings for subsequent behavior-specific embedding learning. This learning strategy can well utilize the multi-behavior information and promise a good embedding initialization for the embedding learning in each behavior graph. Besides, we adopt two different strategies to aggregate the behavior-specific embeddings for users and items. The comparisons with existing advanced MBR models in empirical study demonstrates the effectiveness of our model.

III methodology

Refer to caption — Figure 1: Overview of MB-HGCN model. We take an example that K=3, i.e., there are three kinds of behavior, in which the last is target behavior and the other two are auxiliary behaviors.

III-A Preliminaries

Multi-behavior recommendation (MBR) is to utilize auxiliary behaviors (e.g., view and cart) when interacting with the platforms to help learn user preferences. These behaviors also reflect users’ interests in items and thus contains rich user preference information, which can be leveraged to effectively alleviate the data sparsity problem. In this work, we aim to learn better user and item embeddings by exploiting auxiliary behaviors to improve the recommendation performance.

Let $\mathcal{U}$ and $\mathcal{I}$ be the set of users and items, and the total number of users and items are $M$ and $N$ , respectively. $K$ is the number of behavior types. We use $k(1\leq k\leq K)$ to represent the $k$ -th behavior, and the $K$ -th behavior is the target behavior. Let $\mathcal{R}_{k}$ be the interaction matrix for the $k$ -th behavior, which is a binary matrix. For $r_{ui}\in\mathcal{R}_{k}$ , $r_{ui}=1$ if there was an interaction of the $k$ -th behavior happened between user $u$ and item $i$ ; otherwise $r_{ui}=0$ .

The studied problem is formulated as follows:

Input: user set $\mathcal{U}$ , item set $\mathcal{I}$ , and user-item interaction matrices $\{\mathcal{R}_{1},\cdots,\mathcal{R}_{K}\}$ for different types of behaviors.

Output: a similarity score, which indicates the possibility that a user $u$ will interact with an item $i$ in the target behavior.

Before describing our model, we would like to first introduce two types of graphs used in our model:

•

Behavior-specific graph, denoted by $\mathcal{G}_{k}=(V_{k},E_{k})$ , which is a bipartite graph constructed based on the interactions of the $k$ -th behavior type according to the interaction matrix $\mathcal{R}_{k}$ . $V_{k}$ consists of the user node $u\in\mathcal{U}$ and the item node $i\in\mathcal{I}$ , and $E_{k}$ denotes the user-item interaction edges in graph $\mathcal{G}_{k}$ . There is an edge between a user node and an item node if $r_{ui}=1$ for $r_{ui}\in\mathcal{R}_{k}$ .
•

Unified graph, denoted by $\mathcal{G}=(V,E)$ , which is constructed based on the interactions of all types of behaviors. It is a homogeneous graph, which means we do not differentiate the different types of interactions in this graph. For interactions of different types or multiple interactions of different behaviors between a user $u$ and an item $i$ , the edge is the same, namely, $E=E_{1}\cup E_{2}\cup\cdots\cup E_{K}$ .

III-B Model Description

Overview. Users’ interaction behaviors with items reflect their interests. In multi-behavior recommendation, it is well-recognized that different types of behaviors disclose user’s preference from different perspectives or to different extents [30, 33, 14]. Based on this common assumption, many MBR approaches have been proposed to extract valuable information from multiple behaviors to learn user preferences. Most previous MBR models first learn embeddings from different behaviors separately and then aggregate them with different strategies. The ultimate goal is to exploit the auxiliary behaviors to learn better user and item embeddings, thereby enhancing the recommendation accuracy of the target behavior.

In this work, we propose a hierarchical graph convolutional network to exploit the multi-behavior information to learn the user and item embeddings. In particular, we first learn a global embedding by adopting the unified graph constructed based on interaction information of all behaviors. The global embedding is then used as the initialized embedding and fed into the behavior-specific graph to learn behavior-specific embeddings for each type of behavior. The intuition is that there is a general interest of users across different behaviors and each behavior contains some distinct features of user preference. The global embedding learned from the unified graph represents the general interests or coarse-grained preferences, and the behavior-specific embedding learned from each behavior-specific graph represents the refined or fine-grained user preferences for this particular behavior. In the next, two different strategies will be used to respectively obtain the final user and item embeddings for prediction. Multi-task learning is used for optimization.

Fig. 1 shows the overall structure of our MB-HGCN model, which mainly consists of three modules: 1) Embedding learning, which is designed to learn the embeddings of users and items via a hierarchical graph network structure; 2) Embedding aggregation, which adopts two different strategies to aggregate the embeddings of users and items. More specific, a novel weighting scheme is designed to adaptive distill valuable information from different behaviors for user embedding aggregation; and a linear aggregation approach is used for item embedding aggregation; 3) Multi-task learning is adopted to employ the interaction information of each behavior as supervision signals for user and item embeddings. In the following subsections, we will first brief the embedding initialization and then describe the three modules detailedly in sequence.

III-B1 Embedding initialization

Following previous works [14, 12, 15, 34], we initialize the ID of user $u\in\mathcal{U}$ and item $i\in\mathcal{I}$ as $d$ -dimensional embedding vector $\bm{e_{u}}^{0}$ and $\bm{e_{i}}^{0}$ , respectively. Let $\bm{P}\in\mathbb{R}^{M\times d}$ and $\bm{Q}\in\mathbb{R}^{N\times d}$ be the embedding matrices for the user and item embedding initialization, where $M$ and $N$ represent the number of users and items, respectively. Each user and item ID is represented as a unique embedding. Given the one-hot embedding matrix $\bm{ID}^{\mathcal{U}}$ and $\bm{ID}^{\mathcal{I}}$ for all users and items, the embeddings of user $u$ and item $i$ are initialized as:

\displaystyle\bm{e}_{u}^{0}=\bm{P}\cdot\bm{ID}_{u}^{\mathcal{U}},\quad\bm{e}_{i}^{0}=\bm{Q}\cdot\bm{ID}_{i}^{\mathcal{I}},

(1)

where $\bm{ID}_{u}^{\mathcal{U}}$ and $\bm{ID}_{i}^{\mathcal{I}}$ represent the user $u$ ’s and the item $i$ ’s one-hot vector, respectively.

III-B2 Embedding Learning

As mentioned, our model adopts a hierarchical GCN structure to exploit the multi-behavior for embedding learning. The interaction information of all behaviors are integrated into the unified graph $\mathcal{G}$ to learn the general user preferences and item features, denoted as the global embedding $\bm{e}_{u}^{g}$ and $\bm{e}_{i}^{g}$ for user $u$ and item $i$ , respectively. The learned global embedding is then fed into each behavior-specific graph $\mathcal{G}_{k}$ to learn the behavior-specific embedding $\bm{e}_{u}^{k}$ and $\bm{e}_{i}^{k}$ for user $u$ and item $i$ in $\mathcal{G}_{k}$ , respectively. For the embedding learning in each graph, we employ the LightGCN [12] model, which is a lightweight CF-based single-behavior recommendation model. It simplifies the standard GCN and only retains the core neighborhood aggregation component, which has proven to be effective and superior in performance. It is worth mentioning that other GCN models can also be adopted, such as Ultra-GCN [34] and SVD-GCN [35]. The core of LightGCN is to recursively aggregate information from neighboring nodes for embedding update of the target node. The graph convolution operation in LightGCN is:

	$\displaystyle\bm{e}_{u}^{(l+1)}$	$\displaystyle=\sum_{i\in N_{u}}\frac{1}{\sqrt{\left\lvert N_{u}\right\rvert}\sqrt{\left\lvert N_{i}\right\rvert}}\bm{e}_{i}^{(l)},$		(2)
	$\displaystyle\bm{e}_{i}^{(l+1)}$	$\displaystyle=\sum_{u\in N_{i}}\frac{1}{\sqrt{\left\lvert N_{i}\right\rvert}\sqrt{\left\lvert N_{u}\right\rvert}}\bm{e}_{u}^{(l)},$		(2)

where $\frac{1}{\sqrt{\left\lvert N_{u}\right\rvert}\sqrt{\left\lvert N_{i}\right\rvert}}$ denotes the normalization coefficient, $N_{u}$ represents the set of items that are interacted with the user $u$ , and $N_{i}$ is the same. After l-layer propagation, LightGCN combines the embeddings obtained at each layer as the final user and item representation. Given the total number of layers as $L$ , the representation of user $u$ and item $i$ after the LightGCN process are as follows:

\displaystyle\bm{e}^{\prime}_{u}=\sum_{l=0}^{L}{\alpha_{l}\bm{e}_{u}^{(l)}},\quad\bm{e}^{\prime}_{i}=\sum_{l=0}^{L}{\alpha_{l}\bm{e}_{i}^{(l)}},

(3)

where $\alpha_{l}$ is a hyperparameter represents the importance of the $l$ -th layer embedding, and $\bm{e}_{u}^{(0)}(\bm{e}_{i}^{(0)})$ is the initial embedding of user $u$ (item $i$ ).

As shown in Fig. 1, LightGCN with the same settings is used in both the unified graph $\mathcal{G}$ and each behavior-specific graph $\mathcal{G}_{k}$ to learn the embeddings of users and items.

Global embedding. Following the Eq. 2, we respectively obtain L embeddings to describe a user $\{\bm{e}_{u}^{(1)},\bm{e}_{u}^{(2)},\cdots,\bm{e}_{u}^{(L)}\}$ and an item $\{\bm{e}_{i}^{(1)},\bm{e}_{i}^{(2)},\cdots,\bm{e}_{i}^{(L)}\}$ after L-layer propagation. Before combining these embeddings, normalization is adopted to alleviate the impact of the embedding scale [12]. In our model, we apply $L_{2}$ normalization for simplicity:

\displaystyle\bm{e}^{g}_{u}=\bm{e}_{u}^{(0)}+\sum_{l=1}^{L}{\alpha_{l}\frac{\bm{e}_{u}^{(l)}}{\lVert\bm{e}_{u}^{(l)}\rVert_{2}}},\quad\bm{e}^{g}_{i}=\bm{e}_{i}^{(0)}+\sum_{l=1}^{L}{\alpha_{l}\frac{\bm{e}_{i}^{(l)}}{\lVert\bm{e}_{i}^{(l)}\rVert_{2}}},

(4)

where $\bm{e}^{g}_{u}$ and $\bm{e}^{g}_{i}$ are the learned global embedding from $\mathcal{G}$ , they are also the sharing input of the following behavior-specific graph $\mathcal{G}_{k}$ . $\bm{e}_{u}^{(0)}$ and $\bm{e}_{i}^{(0)}$ are the input of graph $\mathcal{G}$ (i.e., the initialized embedding $\bm{e}_{u}^{0}$ and $\bm{e}_{i}^{0}$ ). Intuitively, the farther neighbors have less important, thus, we set the $\alpha_{l}$ as $1/(l+1)$ .

Behavior-specific embedding. Similarly, taking $\bm{e}^{g}_{u}$ and $\bm{e}^{g}_{i}$ as the initial embeddings for the embedding learning in each behavior-specific graph $\mathcal{G}_{k}$ , we can obtain $K$ behavior-specific embeddings for each user $u$ and item $i$ ( i.e., $\{\bm{e}_{u}^{1},\bm{e}_{u}^{2},\cdots,\bm{e}_{u}^{K}\}$ and $\{\bm{e}_{i}^{1},\bm{e}_{i}^{2},\cdots,\bm{e}_{i}^{K}\}$ ).

III-B3 Embedding Aggregation

Through the embedding learning process, for each user $u$ and item $i$ , we obtain their global embeddings $\bm{e}^{g}_{u}$ and $\bm{e}^{g}_{i}$ , as well as behavior-specific embedding set $\{\bm{e}_{u}^{1},\bm{e}_{u}^{2},\cdots,\bm{e}_{u}^{K}\}$ and $\{\bm{e}_{i}^{1},\bm{e}_{i}^{2},\cdots,\bm{e}_{i}^{K}\}$ . In the next, we would like to aggregate the above user and item embeddings respectively by adopting different strategies to obtain the final embedding for recommendation.

User embedding aggregation. Considering different behaviors may convey some distinct information of user preference, to distill valuable information from different behavior-specific embeddings for the target behavior prediction, we design a novel weighting scheme for user embedding aggregation. Taking the user $u$ as example, the aggregation is formulated as:

\displaystyle\bm{U}=\bm{e}_{u}^{1}||\bm{e}_{u}^{2}||\cdots||\bm{e}_{u}^{K},

(5)

\displaystyle\tilde{\bm{e}}_{u}^{k}=\bm{U}\bm{\delta}^{\top},

(6)

where $||$ denotes the operation of stacking vectors to form a matrix. $\bm{U}\in\mathbb{R}^{d\times K}$ is the matrix by stacking user embeddings learned from each behavior, in which $d$ represents the embedding size and $K$ is the number of behaviors. $\bm{\delta}$ is the weight vector calculated based on the similarity between the embedding of each behavior and that of the target behavior. Formally, it is computed as

\displaystyle\bm{\delta}=softmax(\frac{{\bm{e}_{u}^{k}}^{\top}\bm{U}}{\sqrt{d}}).

(7)

In this equation, ${\bm{e}_{u}^{k}}^{\top}\bm{U}$ calculates the embedding similarity between the $k$ -th behavior and other behaviors. The denominator $\sqrt{d}$ is used to prevent the vanishing gradient problem, and $softmax(\cdot)$ is adopted for normalization.

The underlying rationality of the aggregation strategy is that the behavior with more similar embeddings contain more relevant preference information with the target behavior, and thus contribute more to the target behavior in the aggregation. With this aggregation strategy, our model can adaptively extract valuable information from other behaviors for the target behavior prediction. Moreover, together with the multi-task learning, it also avoids the behavior-specific embeddings to be optimized towards the target behavior in the learning process. Through the above operation, we can obtain the embedding set $\{\tilde{\bm{e}}_{u}^{1},\tilde{\bm{e}}_{u}^{2},\cdots,\tilde{\bm{e}}_{u}^{K}\}$ .

Item embedding aggregation. Since item features are consistent across different behaviors, we simply apply a linear combination to aggregate the embeddings learned from different behaviors. In different types of behaviors, the users who interacted with the items are different and the total number of interactions (i.e., the number of users who interacted with the items) are also different. Intuitively, with more interactions, the learned features should be more comprehensive. Accordingly, the weight assigned to the $k$ -th behavior-specific embedding for an item $i$ is defined as:

\displaystyle\gamma_{ik}=\frac{w_{k}\cdot n_{ik}}{\sum_{m=1}^{K}{w_{m}\cdot n_{im}}},

(8)

where $w_{k}$ is a learnable parameter for the $k$ -th behavior; $n_{ik}$ denotes the number of users that interacted with item $i$ in the $k$ -th behavior. The behavior-specific embeddings of item $i$ are aggregated as:

\displaystyle\tilde{\bm{e}}_{i}=\sum_{k=1}^{K}{\gamma_{ik}\cdot\bm{e}_{i}^{k}}.

(9)

Notice that both user embedding $\tilde{\bm{e}}_{u}^{k}$ and item embedding $\tilde{\bm{e}}_{i}$ are obtained from the behavior-specific information. In order to obtain more comprehensive representation, we combine them with the global embeddings to obtain the final embeddings:

\displaystyle\hat{\bm{e}}_{u}^{k}=\tilde{\bm{e}}_{u}^{k}\oplus\bm{e}_{u}^{g},\quad\hat{\bm{e}}_{i}=\tilde{\bm{e}}_{i}\oplus\bm{e}_{i}^{g},

(10)

where $\oplus$ denotes the element-wise sum.

III-B4 Multi-task Learning (MTL)

MTL [36] is a learning strategy for jointly optimizing different-yet-related tasks. To better exploit the multiple-behavior information in user and item embedding learning, we treat each behavior as an independent training task. The inner product of user and item embeddings is adopted to estimate the prediction score. Take the $k$ -th behavior as an example:

y_{ui}^{k}=\hat{\bm{e}}_{u}^{k^{\top}}\hat{\bm{e}}_{i}.

(11)

The pairwise Bayesian Personalized Ranking (BPR) [7] loss is adopted in optimization for each task:

\mathcal{L}_{k}=\sum_{(u,i,j)\in\mathcal{O}}-ln\sigma(y_{ui}^{k}-y_{uj}^{k}),

(12)

where $\mathcal{O}=\{(u,i,j)|(u,i)\in\mathcal{R}^{+},(u,j)\in\mathcal{R}^{-}\}$ is defined as positive and negative sample pairs, $\mathcal{R}^{+}$ ( $\mathcal{R}^{-}$ ) denotes the observed (unobserved) samples in the $k$ -th behavior, and $\sigma(\cdot)$ is the sigmoid function. Following the Eq. 12, we obtain the loss function for all the $K$ tasks, i.e., $\{\mathcal{L}_{1},\mathcal{L}_{2},\cdots,\mathcal{L}_{K}\}$ , then the $K$ loss functions are summed for joint optimization. Intuitively, the contribution of different tasks should be different. Assigning different weights to different losses may enhance the final performance, however, this is not our main focus in this study. Here we simply treat them equally to focus on studying the effectiveness of our embedding learning strategy and leave the study of different weights in the loss function as a future work. The final loss function is formulated as:

\mathcal{L}=\sum_{k=1}^{K}\mathcal{L}_{k}+\beta\cdot\left\lVert\bm{\Theta}\right\rVert_{2},

(13)

where $\bm{\Theta}$ represents all trainable parameters in our model and $\beta$ is the coefficient that controls the strength of the $L_{2}$ normalization to prevent over-fitting. To improve the generalization ability, two widely used dropout strategies [15, 37, 38] are also adopted in training: node dropout and message dropout, which are used to randomly drop out nodes in the graph and information in the embedding, respectively.

IV experiment

In this section, we conduct extensive experiments on three real-word datasets to evaluate the effectiveness of our model. In particular, we aim to answer the following research questions:

•

RQ1: How does our MB-HGCN model perform as compared with the state-of-the-art recommendation models that are learned from single- and multi-behavior data?
•

RQ2: How does the key designs in our MB-HGCN model affect the recommendation performance?
•

RQ3: How does the layer numbers of GCN setting in the LightGCN affect the performance of our model?
•

RQ4: Can MB-HGCN alleviate the cold start problem?
•

RQ5: How does the user embedding learned in the MB-HGCN model?

IV-A Experiment Settings

IV-A1 Dataset

Three real-world datasets are adopted for experiments:

•

Tmall. This dataset is collected from Tmall²²2https://www.tmall.com/, which is one of the largest e-commerce platforms in China. It contains 41,738 users and 11,953 items with 4 types of behaviors, i.e., view, collect, cart, and buy.
•

Beibei. This dataset is collected from Beibei³³3https://www.beibei.com/, which is the largest infant product retail e-commerce platform in China. This dataset contains 21,716 users and 7,977 items with three types of behaviors, i.e., view, cart, and buy.
•

Jdata. This dataset is collected from JD⁴⁴4https://www.jd.com/, which is one of the most popular and influential e-commerce websites in the Chinese e-commerce field. This dataset contains 93,334 users and 24,624 items with 4 types of behaviors, i.e., view, collect, cart, and buy.

For the above datasets, we follow the previous work to remove the duplicated records by keeping the earliest one [14, 15]. The statistical information of the three datasets is summarized in Table I.

TABLE I: Statistics of three real-world benchmark datasets.

Dataset	Users	Items	Buy	Cart	Collect	View
Tmall	41,738	11,953	255,586	1,996	221,514	1,813,498
Beibei	21,716	7,997	304,576	642,622	-	2,412,586
Jdata	93,334	24,624	333,383	49,891	45,613	1,681,430

IV-A2 Evaluation Protocols

We adopt the widely used leave-one-out strategy for model evaluation [39, 15, 14]. In training stage, the last postive item for each user is selected to construct the validation set for hyper-parameter tuning. In the evaluation stage, all the items in the test set are ranked according to the predicted scores by recommendation models. Meanwhile, two representative evaluation metrics in recommendation: Hit Ratio (HR@K) [40] and Normalized Discounted Cumulative Gain (NDCG@K) [41] are adopted to evaluate the performance:

•

HR@K: a performance metric used to evaluate the accuracy of a recommender system by measuring the proportion of test items for which the correct recommendation appears within the top K positions of the ranked list.
•

NDCG@K: a metric that measures the quality of the recommended items by considering both their relevance and their position in the ranked list.

TABLE II: Overall performance comparison (Impr. means the relative improvement over the best baselines, optimal and suboptimal are bolded and underlined, respectively).

Dataset	Metric	Single-behavior			Multi-behavior							Impr.
Dataset	Metric	MF-BPR	NCF	LightGCN	R-GCN	NMTR	MBGCN	GNMR	S-MBRec	CRGCN	MB-HGCN	Impr.
Tmall	HR@10	0.0230	0.0301	0.0393	0.0316	0.0517	0.0549	0.0393	0.0694	0.0840	0.1461	73.93%
	NDCG@10	0.0124	0.0153	0.0209	0.0157	0.0250	0.0285	0.0193	0.0362	0.0442	0.0770	74.21%
	HR@20	0.0316	0.0420	0.0538	0.0489	0.0847	0.0799	0.0619	0.1009	0.1238	0.2072	67.37%
	NDCG@20	0.0144	0.0182	0.0243	0.0198	0.0330	0.0345	0.0247	0.0438	0.0540	0.0920	70.37%
	HR@50	0.0434	0.0678	0.0813	0.0826	0.1498	0.1285	0.1071	0.1553	0.1994	0.3149	57.92%
	NDCG@50	0.0166	0.0231	0.0295	0.0262	0.0456	0.0438	0.0332	0.0544	0.0685	0.1130	64.96%
Beibei	HR@10	0.0268	0.0296	0.0309	0.0327	0.0315	0.0373	0.0396	0.0489	0.0539	0.0619	14.84%
	NDCG@10	0.0139	0.0146	0.0161	0.0161	0.0146	0.0193	0.0219	0.0253	0.0259	0.0297	14.67%
	HR@20	0.0427	0.0453	0.0478	0.0561	0.0587	0.0639	0.0640	0.0770	0.0944	0.1019	7.94%
	NDCG@20	0.0179	0.0185	0.0204	0.0219	0.0214	0.0259	0.0280	0.0324	0.0361	0.0397	9.97%
	HR@50	0.0793	0.0809	0.0880	0.1118	0.1276	0.1287	0.1219	0.1234	0.1817	0.2009	10.57%
	NDCG@50	0.0250	0.0216	0.0282	0.0329	0.0348	0.0386	0.0394	0.0415	0.0532	0.0592	11.28%
Jdata	HR@10	0.1850	0.2090	0.2252	0.2406	0.3142	0.2803	0.3068	0.4125	0.5001	0.5338	6.74%
	NDCG@10	0.1238	0.1410	0.1436	0.1444	0.1717	0.1572	0.1581	0.2779	0.2914	0.3238	11.12%
	HR@20	0.2192	0.2461	0.2825	0.3418	0.4086	0.3603	0.3694	0.4957	0.6190	0.6450	4.20%
	NDCG@20	0.1325	0.1504	0.1582	0.1588	0.1966	0.1790	0.1944	0.2989	0.3225	0.3533	9.55%
	HR@50	0.2652	0.2934	0.3658	0.4873	0.5227	0.5045	0.4607	0.6036	0.7685	0.7749	0.83%
	NDCG@50	0.1417	0.1599	0.1747	0.1891	0.2198	0.1984	0.2029	0.3203	0.3535	0.3804	7.61%

IV-A3 Baselines

To demonstrate the performance of our model, we compare our MB-HGCN with several representative recommendation models, including three single-behavior models and six multi-behavior models.

Single-behavior model:

•

MF-BPR [7]. BPR is a widely used optimization strategy, which assumes that the predicted scores of positive samples are higher than that of negative ones. MF-BPR has been widely used as a baseline to evaluate the performance of newly proposed models.
•

NCF [14]. It is a representative model combining neural network and CF, which combines shallow generalized matrix factorization model and deep multi-layer perceptron model to learn the interaction between users and items.
•

LightGCN [12]. It removes the feature transformation and nonlinear activation components in the standard GCN model, and only keeps the core neighborhood aggregation component, which simplifies the model structure and achieves a significant performance improvement over its counterpart.

Multi-behavior model:

•

R-GCN [17]. R-GCN differentiates the relations between nodes via edge types in the graph and designs different propagation layers for different types of edges to model the relation information. This model can adapt to the multi-behavior recommendation.
•

NMTR [14]. It is a deep learning model for multi-behavior recommendation, which designs a neural network for each behavior. It sequentially passes the interaction score among behaviors and also adopts multi-task learning for joint optimization.
•

MBGCN [15]. This model constructs a heterogeneous graph to learn user preferences through user-item propagation and adopts a linear aggregation for feature fusion. In addition, item-item propagation is exploited to enhance item embedding learning.
•

GNMR [18]. This model designs a relation aggregation network to model interaction heterogeneity and attempts to explore the dependencies among different types of behaviors via recursive embedding propagation over the heterogeneous graph.
•

S-MBRec [28]. This model consists of a supervised and a self-supervised learning task, which separately learns the user and item embeddings from each behavior and adopts a star-style contrastive learning strategy to construct a contrastive view pair for the target and each auxiliary behavior.
•

CRGCN [32]. This model designs a cascaded residual network to explore the connection between different behaviors from the perspective of embedding propagation. The multi-task learning is also adopted for joint optimization.

IV-A4 Parameter Settings

Our model is implemented by Pytorch⁵⁵5https://pytorch.org/. In the implementation of all methods, the mini-batch size and embedding size are set to 1024 and 64, respectively [12]. Adam [42] optimizer is adopted for the optimization. In addition, we employ grid search to tune the learning rate and regularization weights (i.e., $\beta$ ) in the $[1e^{-2},3e^{-3},1e^{-3},1e^{-4}]$ and $[1e^{-2},1e^{-3},3e^{-4},1e^{-4}]$ ranges, respectively. Meanwhile, we carefully tune the hyperparameters in the baselines according to their original papers, and an early stop strategy is adopted in the training stage.

IV-B Overall Performance (RQ1)

In this section, we report the performance comparisons between our MB-HGCN model and all the baselines. The results on the three datasets are shown in Table II. Overall, the performance of multi-behavior methods outperforms that of single-behavior methods, which demonstrates the effectiveness of exploiting multiple behaviors. Among the multi-behavior methods, our MB-HGCN significantly outperforms other multi-behavior methods. Comparing with the best baseline, the average improvement of HR@K and NDCG@K across top-K for $(K={10,20,50})$ are 66.41% and 69.85% on Tmall dataset, 11.12% and 11.97% on Beibei dataset, 3.92% and 9.43% on Jdata dataset, respectively. This is a remarkable improvement in the recommendation accuracy, demonstrating the superiority of our model.

Among the single-behavior methods, NCF generally outperforms MF-BPR due to its ability to model the complex and nonlinear relationships between user-item interactions using a neural network architecture. However, LightGCN exhibits the best performance among single-behavior methods. This result confirms the effectiveness of GCN-based approaches in capturing the user-item interactions information, as LightGCN uses a simplified GCN model that emphasizes the importance of neighborhood aggregation. The superior performance of LightGCN highlights the importance of leveraging graph-based modeling techniques for recommendation tasks. Among the multi-behavior methods, R-GCN, which directly combines embeddings learned separately from each behavior with a simple summation, exhibits poor performance, and in some cases, even performs worse than the single-behavior method LightGCN. This suggests that the straightforward aggregation of auxiliary behavior embeddings may have detrimental effects on recommendation accuracy. In contrast, MBGCN and GNMR adopt alternative strategies for embedding aggregation, and both achieve superior performance compared to R-GCN, which validates that different behaviors contribute differently to the target behavior. Moreover, NMTR and CRGCN consider the relationships among multi-behaviors through cascading modeling and both yield better performance than the aforementioned methods. NMTR models the cascading effects indirectly through interaction scores of different behaviors. In contrast, CRGCN directly incorporates cascading effects into the embedding learning process, leading to superior performance over NMTR. CRGCN is also the best-performing baseline in our experiments, leveraging multi-behavior relationships in embedding learning. However, MB-HGCN can outperform CRGCN by a large margin, mainly due to its hierarchical learning strategy and aggregation strategies. Our ablation studies provide further insights into the effectiveness of different components in MB-HGCN.

It is worth noting that the improvement achieved on Tmall far exceeded that of the other two datasets. The primary reason for this substantial gap can be attributed to the greater variety of behavioral interactions among the different datasets. In comparison to Jdata, Tmall’s collect behavior yielded a comparable amount of data as the buy behavior, which provided rich information. By comparison, the Beibei platform requires users to follow a strict sequence of behavior for making purchases, i.e., view $\to$ cart $\to$ buy. Consequently, the global embedding learned in our model reflected the view behavior, which limits the performance of our model.

IV-C Ablation Study (RQ2)

In this section, we conduct extensive ablation studies to examine the validity of different components in our model.

IV-C1 Effect of the embedding learning in graph $\mathcal{G}$

We design a hierarchical graph convolutional network for embedding learning, where a unified graph $\mathcal{G}$ is utilized to learn a coarse-grained global embedding, which is then used as a shared initialization for refining embeddings in behavior-specific graphs. To validate the effectiveness of learning coarse-grained global embeddings, we conduct an experiment where we remove the unified graph component and compare the results to the original model that retain the unified graph. Specifically, we train the model without the unified graph $\mathcal{G}$ and leverage the initialized embedding (i.e., $\bm{e_{u}}^{0}$ and $\bm{e_{i}}^{0}$ ) as the initialization of the behavior-specific graph $\mathcal{G}_{k}$ ( $k\in[1,K]$ ). Experimental results are reported in Table III.

TABLE III: Effect of the embedding learning in graph

\mathcal{G}

(w.

\mathcal{G}

and w/o.

\mathcal{G}

represent with and without embedding learning in graph

\mathcal{G}

, respectively).

Method	Tmall		Beibei		Jdata
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
w/o. $\mathcal{G}$	0.0394	0.0198	0.0420	0.0213	0.2709	0.1608
w. $\mathcal{G}$	0.1461	0.0770	0.0619	0.0297	0.5338	0.3238

The experimental results suggest that removing the unified graph component leads to a significant decrease in performance. This result is attributed to the coarse-grained global embeddings learned in the unified graph component can provide better initialization for refining embeddings in behavior-specific graphs, which allows for more accurate learning in those graphs. Moreover, it is observed that there is a tremendous performance difference between the two models with and without the unified graph. In fact, without the unified graph, the model degenerates to a variant of R-GCN, where the difference is the embedding aggregation strategy. Compared with the results in Table II, the model w/o. $\mathcal{G}$ significantly outperformed R-GCN. The results provide evidence for the effectiveness of the proposed embedding aggregation strategies for users and items, and further verifies the validity of the unified graph design.

IV-C2 Effect of the user embedding aggregation strategy

Our intuition for designing user embedding aggregation strategies is that user interests vary across behaviors, and thus, not all user preferences contribute to the prediction of the target behavior. Therefore, we design a simple adaptive embedding aggregation strategy for user embeddings. To verify the effectiveness of the design for adaptive user embedding aggregation strategy, we conduct three experiments: 1) sum agg., we remove our adaptive user embedding aggregation module and directly sum different behavior-specific embeddings for information aggregation. 2) linear agg., we replace our adaptive user embedding aggregation module with linear aggregation, which assigns different weights based on the number of interactions for each behavior (the same to the item aggregation strategy). 3) adaptive agg., our proposed adaptive embedding aggregation strategy. Experimental results are reported in Table IV.

TABLE IV: Comparison of three different user embedding aggregation strategies.

Method	Tmall		Beibei		Jdata
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
sum agg.	0.0971	0.0488	0.0516	0.0257	0.4068	0.2356
linear agg.	0.1263	0.0687	0.0575	0.0288	0.4888	0.2935
adaptive agg.	0.1461	0.0770	0.0619	0.0297	0.5338	0.3238

The experimental results demonstrate that adopting the adaptive aggregation strategy achieves the best performance. The sum agg. method yields poor performance due to the varying interests that users exhibit in different behaviors, and the aggregation strategy lacks consideration of the importance of each behavior. Although the linear agg. method considers the importance of different behaviors, behaviors with more interactions may not necessarily reflect more accurate user preferences. In contrast, our adaptive aggregation strategy aggregates relevant information at the feature level based on the similarity between different behaviors, resulting in better aggregation of relevant information. It is worth mentioning that our aggregation scheme does not introduce any additional parameters into the model. This avoids the potential risks of negative impact on the embedding learning process from additional parameters introduced by the aggregation scheme.

To verify this point, we perform an additional experiments. We pre-train our model to keep the optimal embeddings that learned from each behavior and remove the aggregation of global embeddings (i.e., the operation of Eq. 10) to eliminate the effects of global embeddings. The goal is to only retain the training of target behaviors to avoid the effects of multi-task learning. On this basis, we compare the performance of the following three variants: 1) $M_{unfix}$ , which employs linear aggregation strategy for user embedding aggregation. 2) $M_{fix}$ , which fixes the parameters in the embedding learning process based on the first experiment. 3) $M_{adap}$ , which adopts our adaptive aggregation strategy for user embedding aggregation. The experimental results are reported in Fig. 2.

According to Fig. 2, we can observe that for the two linear aggregation methods, the one with fixed parameters significantly outperforms the one with unfixed parameters. This is because the supervision signal cannot be transferred to the embedding learning process in the method of fixed parameters, in which only the parameters of linear aggregation process are optimized. This suggests that optimization of aggregation parameters may lead to locally optimal solutions for embedding learning. Furthermore, the method that adopts the adaptive aggregation strategy is better than the two methods which employ linear aggregation. The reason is that our adaptive aggregation strategy does not introduce any parameters, facilitating the embedded learning to be optimized on the right direction.

IV-C3 Effect of the item embedding aggregation strategy

In this experiment, we evaluate the effectiveness of a linear aggregation strategy for item embeddings aggregation. Considering that behaviors with more interactions may reflect more comprehensive features of items, we assign weights to each behavior based on its number of interactions (as shown in Eq. 8). To verify the effectiveness of this design, we conduct the following experiments: 1) fix $\gamma_{ik}$ , we assign the same weight (i.e., set $\gamma_{ik}=1$ ) to each behavior for embedding aggregation. 2) w/o. $w_{k}$ , we remove the learnable parameter $w_{k}$ and strictly assign weights based on the number of interactions for each behavior. 3) w. $w_{k}$ , which keeps the learnable parameter $w_{k}$ allows for fine-tuning the importance of different behaviors (our approach). The experimental results are reported in Table V.

TABLE V: Effect of the item aggregation strategy (w.

w_{k}

and w/o.

w_{k}

represent with and without

w_{k}

, respectively).

Method	Tmall		Beibei		Jdata
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
fix $\gamma_{ik}$	0.1285	0.0686	0.0587	0.0276	0.4685	0.2795
w/o. $w_{k}$	0.1408	0.0762	0.0594	0.0304	0.4814	0.2906
w. $w_{k}$	0.1461	0.0770	0.0619	0.0312	0.5338	0.3238

The results in Table V indicate a significant improvement for the weight allocation method (w/o. $w_{k}$ ) over the non-weight allocation method (fix $\gamma_{ik}$ ), which supports our viewpoint that behaviors with more interactions reflect more comprehensive item features. In the two weight allocation methods, w/o. $w_{k}$ and w. $w_{k}$ , the method that fine-tuned the weights using the learnable parameter achieved better performance, indicating that the contribution of different behaviors varies for different items. Therefore, fine-tuning the weights via the learnable parameter can better aggregate the representation of items, further validating the effectiveness of our proposed strategy.

IV-C4 Effect of the global embedding aggregation

Aggregate global embeddings into the final embedding, as shown in Eq. 10, is to obtain more comprehensive representation. We conduct an ablation study to verify this point by comparing it with the variant without considering the global embeddings. The experimental results are reported in Table VI.

TABLE VI: Effect of the global embedding aggregation(w. c.g. and w/o. c.g. represent with and without global embedding aggregation, respectively).

Method	Tmall		Beibei		Jdata
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
w/o. c.g.	0.1257	0.0665	0.0579	0.0271	0.4606	0.2739
w. c.g.	0.1461	0.0770	0.0619	0.0297	0.5338	0.3238

It is shown that the method with global embedding aggregation, our model can gain a relative improvement of 16.23% and 15.79% on Tmall, 6.91% and 9.59% on Beibei, 15.89% and 18.22% on Jdata for HR@10 and NDCG@10, respectively. This demonstrates that aggregate global embedding can indeed improve performance. Global embeddings reflect coarse-grained user preferences, while behavior-specific embeddings reflect fine-grained user preferences. Combining these two types of embeddings can provide a comprehensive and hierarchical representation of user preferences, which can further improve recommendation performance. It again justifies the effectiveness of our hierarchical design by learning embedding from both global and behavior-specific levels.

IV-C5 Effect of multi-task learning

We adopt a multi-task learning (MTL) framework for joint optimization. To verify its effectiveness, we compare the method with and without MTL, in which the method without multi-task learning train the target behavior in a single-task. The experimental results are reported in Table VII.

TABLE VII: Effect of the joint optimization(w. MTL and w/o. MTL represent with and without MTL, respectively).

Method	Tmall		Beibei		Jdata
Method	HR@10	NDCG@10	HR@10	NDCG@10	HR@10	NDCG@10
w/o. MTL	0.1393	0.0660	0.0385	0.0182	0.4428	0.2615
w. MTL	0.1461	0.0770	0.0619	0.0297	0.5338	0.3238

The experimental results, reported in Table VII, demonstrate that the MTL method outperforms the single-task method across all three datasets, indicating the effectiveness of MTL. The underlying reason for its effectiveness lies in the fact that our model treats each behavior as an independent task during training, and the better the behavior-specific embedding fits the user preferences exhibited in current behavior during the training process, the more accurately relevant information can be aggregated in the adaptive aggregation stage, leading to more precise predictions. It is worth noting that the scale of interaction data for different behaviors may affect the performance of multi-task learning. Therefore, considering the importance of different behaviors is necessary, but it is not the focus of our current research. We plan to research it in future work.

IV-D GCN layer Study (RQ3)

Our model adopts LightGCN as the backbone to perform convolution operations in each graph. From the perspective of the overall structure, the convolution operations on graph $\mathcal{G}$ and graph $\mathcal{G}_{k}$ successively are similar to simply increasing the number of GCN layers. To this end, we compare the effects of GCN layers with different number settings. The results are reported in Fig. 3

From Fig. 3, it can be seen that with the number of GCN layers increases, the performance will first increase with the increasing number of GCN layers and then drops when stacking more layers, which is consistent with the results observed in single-behavior methods LightGCN [12] and NGCF [37]. The best performance is obtained when the number of GCN layers is 2 in our experiments.

IV-E Cold-start Problem (RQ4)

The cold-start problem in recommender systems refers to the situation where a new user or item is added to the system, and there is insufficient historical data available to provide personalized recommendations. Multi-behavior recommendation is one approach to alleviate the cold-start problem by considering multiple behaviors data. Such behavior data may include rich information that can help better understand user preferences. In this section, we will verify the capabilities of our model to tackle this problem.

We compare our MB-HGCN with two models, CRGCN and MBGCN, where CRGCN is the best baseline, and MBGCN designs a item-based scoring module to alleviate the cold start problem. To perform the study, we follow previous work [32, 15] to randomly select 1,000 users from the test set as cold-start users and remove all of their buy behavior records from the training set. In addition, for other behaviors in the training set, we also remove the user-item pairs involved in buy behavior. These 1,000 users are simulated as the hard cold-start users with no buy behavior records. This process ensures that these 1000 users do not have any prior preference information about items, thus simulating them as hardcore cold-start users with no buy behavior records. We then train the model with the remaining records using the settings described in IV-A. Finally, use the trained model to provide personalized recommendations for these 1,000 cold-start users.

The experimental results are shown in Fig. 4. It can be observed that out MB-HGCN consistently outperforms CRGCN and MBGCN across all three datasets. Compared with CRGCN, the average improvement of our model are 19.94% and 35.30% on Tmall dataset, 18.62% and 20.75% on Beibei dataset and 11.28% and 22.08% on Jdata dataset in terms of HR@K and NDCG@K. This experimental result indicates that our model is able to better utilize multi-behavioral data to learn user preferences for the target behavior recommendation. This should be attributed to the design of hierarchical graph convolutional network, we learn user preferences from a coarse-grained global level to a fine-grained behavior-specific level in this design. Therefore, even if users do not have buy behavior, our model can still learn coarse-grained user preferences for the target behavior recommendation. In contrast, the sequential modeling of CRGCN fails to effectively learn random behaviors such as collect behavior that are uncertain to occur, resulting in suboptimal results. In addition, CRGCN shows a significant improvement compared to MBGCN due to its cascading design, which can effectively utilize the effect of cascading behaviors to refine user preferences, while the weighted aggregation strategy adopted by MBGCN may not be able to capture the complex interrelationships between behaviors.

IV-F Embedding Learning Analysis (RQ5)

In recommender systems, embeddings are commonly used to represent users. Each position in the embedding can be viewed as a potential interest feature for the user [3, 43, 44], and these interest features collectively form the user’s preferences. In our model, we first learn user preferences from a coarse-grained global level to a fine-grained behavior-specific level, and then adaptively aggregate relevant information from auxiliary behaviors based on their similarities. To explore the changes in user interests during this process, we visualize the user embeddings during the process. In this visualization, the darkness of color represents the importance of the feature, with darker colors indicating greater importance. The sum of all feature values in the embedding equals 1. Specifically, we randomly select one user from each of the Tmall, Beibei, and Jdata datasets, and display the first 8 positions of their global embeddings, behavior-specific embeddings and the final embeddings which used for the target behavior recommendation.

The experimental results are shown in Fig. 5. Overall, the interest distribution exhibited by the global embeddings is relatively evenly distributed across all three datasets. Compare with the global embedding, the feature values of behavior-specific embeddings exhibit different changes. Taking the view behavior of Tmall dataset as an example, compared to the global embedding, the 0th, 1st, 3rd and 7th features in the view behavior-specific embedding have a darker color, indicating that users pay more attention to these features during browsing behavior. On the other hand, the 2nd, 4th, 5th and 6th features have a lighter color, suggesting that these features contribute less to user browsing behavior. This demonstrates that behavior-specific embedding refine and enhance the global embedding. Another interesting observation is that behavior-specific embeddings augment the degree of interest/disinterest (with darker colors becoming even darker and lighter colors becoming even lighter), without changing the properties of the feature (interested in becoming disinterested). This indicates that the global embedding can indeed represent users’ coarse-grained preferences and further confirms that the behavior-specific embedding locally refines the global embedding. In addition, we observe that some behavioral features are consistent with global features (cart in Tmall dataset and collect and cart in Jdata dataset), which is due to the lack of user-item interaction records in the corresponding behaviors. It also confirms that MB-HGCN can address the cold-start problem to some extent, i.e., when there is no buy behavior, the model would retain the global embedding for recommendation. Finally, we adopt an adaptive aggregation strategy to obtain the final embedding for the target behavioral recommendation, which is obtained by aggregating features based on buy behavior. Taking the Tmall dataset as an example, the 0th and 3rd features in the final embedding are enhanced, while the 7th feature is slightly weakened, and other features with lower levels of interest (such as the 2nd, 5th, and 6th features) also adjusted to some extent. This result also validates the effectiveness of the adaptive aggregation strategy we proposed.

V conclusion

In this work, we present a novel multi-behavior recommendation model named MB-HGCN, which can effectively exploit the multi-behavior information to learn user and item embeddings. In particular, a hierarchical graph network is designed to learn user preference from global to behavior-specific level. Moreover, two different aggregation strategies are applied to aggregate user and item embeddings learned from different behaviors. Extensive experimental results on three real-world benchmark datasets demonstrate the superiority of our model over the state-of-the-art MBR models. Further ablation studies verify the effectiveness of different components in our model. In the future, we plan to explore the relations among multi-behavior interactions in the embedding learning process, and conduct experiments on online systems with A/B testing to evaluate the performance of our proposed model.

References

[1] S. Zhang, L. Yao, A. Sun, and Y. Tay, “Deep learning based recommender system: A survey and new perspectives,” ACM Comput. Surv., vol. 52, no. 1, pp. 1–38, 2019.
[2] Z. Cheng, X. Chang, L. Zhu, R. C. Kanjirathinkal, and M. S. Kankanhalli, “MMALFM: explainable recommendation by leveraging reviews and images,” ACM Trans. Inf. Syst., vol. 37, no. 2, pp. 16:1–16:28, 2019.
[3] Y. Koren, R. M. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
[4] Y. Xu, L. Zhu, Z. Cheng, J. Li, and J. Sun, “Multi-feature discrete collaborative filtering for fast cold-start recommendation,” in Proceedings of the 34th AAAI Conference on Artificial Intelligence. AAAI Press, 2020, pp. 270–278.
[5] Y. Koren, “Factorization meets the neighborhood: A multifaceted collaborative filtering model,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008, pp. 426–434.
[6] X. Ning and G. Karypis, “SLIM: sparse linear methods for top-n recommender systems,” in Proceedings of the 11th IEEE International Conference on Data Mining. IEEE, 2011, pp. 497–506.
[7] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “BPR: bayesian personalized ranking from implicit feedback,” in Proceedings of the 25th International Conference on Uncertainty in Artificial Intelligence. AUAI, 2009, pp. 452–461.
[8] S. Kabbur, X. Ning, and G. Karypis, “FISM: factored item similarity models for top-n recommender systems,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2013, pp. 659–667.
[9] R. Salakhutdinov and A. Mnih, “Probabilistic matrix factorization,” in Proceedings of the 21st Annual Conference on Neural Information Processing Systems. MIT Press, 2007, pp. 1257–1264.
[10] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, “Neural collaborative filtering,” in Proceedings of the 26th International Conference on World Wide Web. ACM, 2017, pp. 173–182.
[11] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016, pp. 7–10.
[12] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, “Lightgcn: Simplifying and powering graph convolution network for recommendation,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2020, pp. 639–648.
[13] X. Wang, X. He, Y. Cao, M. Liu, and T. Chua, “KGAT: knowledge graph attention network for recommendation,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2019, pp. 950–958.
[14] C. Gao, X. He, D. Gan, X. Chen, F. Feng, Y. Li, T. Chua, L. Yao, Y. Song, and D. Jin, “Learning to recommend with multiple cascading behaviors,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 6, pp. 2588–2601, 2021.
[15] B. Jin, C. Gao, X. He, D. Jin, and Y. Li, “Multi-behavior recommendation with graph convolutional networks,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2020, pp. 659–668.
[16] H. Qiu, Y. Liu, G. Guo, Z. Sun, J. Zhang, and H. T. Nguyen, “BPRH: bayesian personalized ranking for heterogeneous implicit feedback,” Inf. Sci., vol. 453, pp. 80–98, 2018.
[17] M. S. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in Proceedings of the 15th European Semantic Web Conference. Springer, 2018, pp. 593–607.
[18] L. Xia, C. Huang, Y. Xu, P. Dai, M. Lu, and L. Bo, “Multi-behavior enhanced recommendation with cross-interaction collaborative relation modeling,” in Proceedings of the 37th IEEE International Conference on Data Engineering. IEEE, 2021, pp. 1931–1936.
[19] W. Zhang, J. Mao, Y. Cao, and C. Xu, “Multiplex graph neural networks for multi-behavior recommendation,” in Proceedings of the 29th ACM International Conference on Information and Knowledge Management. ACM, 2020, pp. 2313–2316.
[20] A. P. Singh and G. J. Gordon, “Relational learning via collective matrix factorization,” in Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2008, pp. 650–658.
[21] Z. Zhao, Z. Cheng, L. Hong, and E. H. Chi, “Improving user topic interest profiles by behavior factorization,” in Proceedings of the 24th International Conference on World Wide Web. ACM, 2015, pp. 1406–1416.
[22] B. Loni, R. Pagano, M. A. Larson, and A. Hanjalic, “Bayesian personalized ranking with multi-channel user feedback,” in Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 2016, pp. 361–364.
[23] J. Ding, G. Yu, X. He, Y. Quan, Y. Li, T. Chua, D. Jin, and J. Yu, “Improving implicit recommender systems with view data,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence. ijcai.org, 2018, pp. 3343–3349.
[24] G. Guo, H. Qiu, Z. Tan, Y. Liu, J. Ma, and X. Wang, “Resolving data sparsity by multi-type auxiliary implicit feedback for recommender systems,” Knowl. Based Syst., vol. 138, pp. 202–207, 2017.
[25] L. Xia, C. Huang, Y. Xu, P. Dai, B. Zhang, and L. Bo, “Multiplex behavioral relation learning for recommendation via memory augmented transformer network,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2020, pp. 2397–2406.
[26] L. Xia, Y. Xu, C. Huang, P. Dai, and L. Bo, “Graph meta network for multi-behavior recommendation,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2021, pp. 757–766.
[27] Y. Zhu, Q. Lin, H. Lu, K. Shi, D. Liu, J. Chambua, S. Wan, and Z. Niu, “Recommending learning objects through attentive heterogeneous graph convolution and operation-aware neural network,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 4, pp. 4178–4189, 2023.
[28] S. Gu, X. Wang, C. Shi, and D. Xiao, “Self-supervised graph neural networks for multi-behavior recommendation,” in Proceedings of the 31st International Joint Conference on Artificial Intelligence. ijcai.org, 2022, pp. 2052–2058.
[29] L. Guo, L. Hua, R. Jia, B. Zhao, X. Wang, and B. Cui, “Buying or browsing?: Predicting real-time purchasing intent using attention-based deep network with multiple behavior,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2019, pp. 1984–1992.
[30] C. Meng, Z. Zhao, W. Guo, Y. Zhang, H. Wu, C. Gao, D. Li, X. Li, and R. Tang, “Coarse-to-fine knowledge-enhanced multi-interest learning framework for multi-behavior recommendation,” CoRR, vol. abs/2208.01849, pp. 1–11, 2022.
[31] C. Chen, W. Ma, M. Zhang, Z. Wang, X. He, C. Wang, Y. Liu, and S. Ma, “Graph heterogeneous multi-relational recommendation,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence. AAAI Press, 2021, pp. 3958–3966.
[32] M. Yan, Z. Cheng, C. Gao, J. Sun, F. Liu, F. Sun, and H. Li, “Cascading residual graph convolutional network for multi-behavior recommendation,” ACM Trans. Inf. Syst., pp. 1–24, 2023.
[33] M. Wan and J. J. McAuley, “Item recommendation on monotonic behavior chains,” in Proceedings of the 12th ACM Conference on Recommender Systems. ACM, 2018, pp. 86–94.
[34] K. Mao, J. Zhu, X. Xiao, B. Lu, Z. Wang, and X. He, “Ultragcn: Ultra simplification of graph convolutional networks for recommendation,” in Proceedings of the 30th ACM International Conference on Information and Knowledge Management. ACM, 2021, pp. 1253–1262.
[35] S. Peng, K. Sugiyama, and T. Mine, “SVD-GCN: A simplified graph convolution paradigm for recommendation,” in Proceedings of the 31st ACM International Conference on Information and Knowledge Management. ACM, 2022, pp. 1625–1634.
[36] H. Tang, J. Liu, M. Zhao, and X. Gong, “Progressive layered extraction (PLE): A novel multi-task learning (MTL) model for personalized recommendations,” in Proceedings of the 14th ACM Conference on Recommender Systems. ACM, 2020, pp. 269–278.
[37] X. Wang, X. He, M. Wang, F. Feng, and T. Chua, “Neural graph collaborative filtering,” in Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019, pp. 165–174.
[38] R. van den Berg, T. N. Kipf, and M. Welling, “Graph convolutional matrix completion,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2018, pp. 1–7.
[39] C. Gao, X. He, D. Gan, X. Chen, F. Feng, Y. Li, T. Chua, and D. Jin, “Neural multi-task recommendation from multi-behavior data,” in Proceedings of the 35th IEEE International Conference on Data Engineering. IEEE, 2019, pp. 1554–1557.
[40] G. Karypis, “Evaluation of item-based top-n recommendation algorithms,” in Proceedings of the 10th ACM CIKM International Conference on Information and Knowledge Management. ACM, 2001, pp. 247–254.
[41] K. Järvelin and J. Kekäläinen, “IR evaluation methods for retrieving highly relevant documents,” in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, E. J. Yannakoudakis, N. J. Belkin, P. Ingwersen, and M. Leong, Eds. ACM, 2000, pp. 41–48.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proceedings of the 3rd International Conference on Learning Representations. OpenReview.net, 2015, pp. 1–15.
[43] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in Proceedings of the 8th IEEE International Conference on Data Mining. IEEE Computer Society, 2008, pp. 263–272.
[44] H. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Anderson, G. Corrado, W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, X. Liu, and H. Shah, “Wide & deep learning for recommender systems,” in Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. ACM, 2016, pp. 7–10.