This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Attribute-driven Disentangled Representation Learning for Multimodal Recommendation

Zhenyang Li 0000-0002-4694-1231 Shandong University [email protected] Fan Liu 0000-0002-4547-3982 National University of Singapore [email protected] Yinwei Wei 0000-0003-1791-3159 Monash University [email protected] Zhiyong Cheng 0000-0003-1109-5028 Hefei University of Technology [email protected] Liqiang Nie 0000-0003-1476-0273 Harbin Institute of Technology, Shenzhen [email protected]  and  Mohan Kankanhalli 0000-0002-4846-2015 National University of Singapore [email protected]
(2024)
Abstract.

Recommendation algorithms predict user preferences by correlating user and item representations derived from historical interaction patterns. In pursuit of enhanced performance, many methods focus on learning robust and independent representations by disentangling the intricate factors within interaction data across various modalities in an unsupervised manner. However, such an approach obfuscates the discernment of how specific factors (e.g., category or brand) influence the outcomes, making it challenging to regulate their effects. In response to this challenge, we introduce a novel method called Attribute-Driven Disentangled Representation Learning (short for AD-DRL), which explicitly incorporates attributes from different modalities into the disentangled representation learning process. By assigning a specific attribute to each factor in multimodal features, AD-DRL can disentangle the factors at both attribute and attribute-value levels. To obtain robust and independent representations for each factor associated with a specific attribute, we first disentangle the representations of features both within and across different modalities. Moreover, we further enhance the robustness of the representations by fusing the multimodal features of the same factor. Empirical evaluations conducted on three public real-world datasets substantiate the effectiveness of AD-DRL, as well as its interpretability and controllability.

Multimodal, Attribute, Disentangled Representation, Recommendation
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australiabooktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australiadoi: 10.1145/3664647.3681148isbn: 979-8-4007-0686-8/24/10ccs: Information systems Personalizationccs: Information systems Recommender systemsccs: Information systems Collaborative filtering

1. Introduction

Recommender systems (RS) are integral to a myriad of online platforms, spanning E-commerce to advertising, facilitating users in pinpointing items aligned with their preferences. Given their pivotal role, numerous efforts have been dedicated to developing advanced recommendation models to improve their performance. Among them, Collaborative Filtering (CF) based models (Rendle et al., 2009; Rendle, 2010; He et al., 2017; Hsieh et al., 2017; Wang et al., 2019; He et al., 2020; Wang et al., 2020; Liu et al., 2021) have achieved great success by exploiting the user-item interaction data to learn user and item representation. However, these models can easily encounter the sparsity problem in practical scenarios due to their exclusive dependence on interaction data. To alleviate the data sparsity problem in recommendation, side information (e.g., attributes, user reviews and item images), which contains rich information associated with users or items, is often used to enhance the representation learning of users and items (Zhang et al., 2016, 2017a; Liu et al., 2019; Wei et al., 2019, 2020; Liu et al., 2023a, c).

It is well recognized that the entangled representations of users and items are infeasible to directly capture fine-grained user preferences across diverse factors, thereby constraining both the efficacy and interpretability of recommender systems (Ma et al., 2019; Wang et al., 2020). In recent years, the study of disentangled representation learning has garnered significant attention in diverse fields, notably in computer vision (Higgins et al., 2017; Liu et al., 2020; Hsieh et al., 2018), due to its capability to identify and disentangle the underlying factors behind data. Empirical evidence suggests that disentangled representations exhibit enhanced robustness, particularly in complex application contexts. Hence, many recommendation methods adopt disentangled representation learning techniques to learn robust and independent representations, ultimately enhancing recommendation performance (Ma et al., 2019; Mu et al., 2022; Wang et al., 2020; Tran and Lauw, 2022; Liu et al., 2023a). For example, Ma et al. (Ma et al., 2019) employed disentangled representations to capture user preferences regarding different concepts associated with user intentions. Wang et al. (Wang et al., 2020) introduced a GCN-based model that produces disentangled representations by modelling a distribution over intents for each user-item interaction, exploring the diversity of user intentions on adopting items. However, these methods only concentrate on disentangling the user and item representations based on their ID embeddings. To exploit the difference between the factors behind data of various modalities, Liu et al. (Liu et al., 2023a) estimated the users’ attention weight to underlying factors of different modalities, utilizing a sophisticated attention-driven module.

Despite the considerable advancements brought about by disentangling techniques in recommender systems, existing studies often disentangle both users and items into latent factors. This methodology inherently constrains the model’s interpretability and controllability. For example, consider using disentangled representations to elucidate the diverse factors, such as style, brand, popularity, and price, that shape user preferences in dress selection. The inherent abstraction of latent factors from current disentangling methods makes it difficult to pinpoint which factor represents each specific dress attribute. In particular, one factor might be loosely related to a combination of brand and price, while another factor might represent a mixture of style and popularity. This ambiguity in the latent factors limits the interpretability of the recommendation system, making it difficult to understand why a particular item was recommended to a user. Moreover, this lack of clarity also affects the controllability of the recommendation system. Suppose a user wants to receive recommendations specifically for dresses without considering her preferences for price and popularity. It would be challenging to adjust the recommender system to focus on those specific preferences, as the latent factors are not clearly tied to these attributes.

Item attributes manifested in various modalities can enrich the recommendation system by offering diverse and complementary information. For example, the textual data might explicitly mention the brand or price, while visual data can reveal visual attributes, such as the category of items. Additionally, the popularity of items can be derived from the statistical information of interaction data. In fact, attributes represent specific, meaningful properties or characteristics of items. Consequently, using attributes to guide the disentanglement process is a promising way to improve the interpretability and controllability of conventional multimodal recommendation methods. In this paper, we propose an Attribute-Driven Disentangled Representation Learning method (AD-DRL for short), which disentangles factors in user and item representations across various modalities at different levels of attribute granularity. To obtain robust and independent representations for each factor associated with a specific attribute, we first disentangle the representations of features within and across different modalities. This process is guided by high-level attributes (e.g., category and popularity), which help reveal the underlying relationships between factors. Following this, we further enhance the robustness of the representations by fusing the multimodal features of the same factor. This step involves exploiting the relationships among representations of the same factor by leveraging low-level attributes (e.g., category attribute values for a clothing dataset, such as jeans, jackets, and dresses), resulting in finer-grained and more comprehensive disentangled representations. To validate the effectiveness of our method, we conduct extensive experiments and ablation studies on three real-world datasets. Experimental results demonstrate the superiority of our method AD-DRL compared to existing methods and showcase its capability in terms of interpretability and controllability.

In summary, the contributions of this paper are threefold:

  • In this paper, we highlight the limitations of traditional disentangled representation learning in multimodal recommendation systems. To overcome these shortcomings, we introduce AD-DRL, which improves interpretability and controllability by employing attributes to disentangle factors in user and item representations.

  • To achieve robust and independent representations, we assign a specific attribute to each factor in multimodal features and disentangle factors at both the attribute and attribute-value levels.

  • We conduct extensive experiments on three real-world datasets to validate the effectiveness of our method. The experimental analysis demonstrates the interpretability and controllability of our model.

2. Related Work

2.1. Multimodal Collaborative Filtering

Traditional Collaborative Filtering (CF) (Sarwar et al., 2001; Ebesu et al., 2018; Rendle et al., 2009) methods primarily rely on user-item interactions to learn representations of users and items. Consequently, the recommendation performance is negatively impacted when encountering users and items with limited interactions. To mitigate the challenges posed by data sparsity (McAuley and Leskovec, 2013; Kalantidis et al., 2013; Tan et al., 2016; Liu et al., 2023b), recent research has incorporated multimodal information, which has been extensively explored in other fields (Xu et al., 2015; Li et al., 2023b, 2024, a), into recommendation systems. The multimodal features (He and McAuley, 2016; Zhang et al., 2017a; Liu et al., 2019; Wei et al., 2019, 2020; Liu et al., 2023a; Du et al., 2023; Guo et al., 2024), such as reviews and images, provide valuable information for user preference and item characteristics, which can supplement historical user-item interactions and thus enhance recommendation performance.

Previous studies integrate multimodal information into the matrix factorization-based method in a straightforward manner. For instance, VBPR (He and McAuley, 2016) directly concatenates the visual features learned from item images with collaborative features as the joint item representation and feeds it into the MF module. With the success of deep learning techniques in modeling complex interaction behaviors (He et al., 2017) and the relation between various multimodal features (Nie et al., 2016), a lot of deep learning techniques have been introduced into multimodal recommendation (He et al., 2017; Tay et al., 2018). For example, MAML (Liu et al., 2019) first fuses the item’s multimodal features and user features, then feeds it into an attention module to capture users’ diverse preferences. VECF (Chen et al., 2019) constructs a visually explainable collaborative filtering model using a multimodal attention network, facilitating the integrated coupling of diverse feature modalities. More recently, graph convolutional networks (GCNs) (Hamilton et al., 2017) have demonstrated their power capability of representation learning for recommendation (Wang et al., 2019; He et al., 2020; Zhou et al., 2023; Wei et al., 2023). Based on the GCN structure, MMGCN (Wei et al., 2019) constructs a user-item bipartite graph to learn representations of each modality and then fuse the modality representation together as the final representation. GRCN (Wei et al., 2020) utilizes the rich multimodal content of items to refine the structure of the interaction graph in order to mitigate the effect of false-positive edges on recommendation performance.

2.2. Disentangled Representation Learning

Disentangled representation learning, which seeks to identify and separate underlying explanatory factors within data, has garnered significant interest, especially in the field of computer vision (Higgins et al., 2017; Liu et al., 2020; Ge et al., 2021). For instance, the beta-VAE method (Higgins et al., 2017) employs a constrained variational framework to learn disentangled representations of fundamental visual concepts. Meanwhile, IPGDN (Liu et al., 2020) automatically uncovers independent latent factors present in graph data.

Owing to the successful application of disentangled representation learning in various domains, numerous studies have concentrated on learning disentangled representations for users and items in recommendation systems in recent years (Ma et al., 2019; Tran and Lauw, 2022; Wang et al., 2020; Mu et al., 2022; Liu et al., 2023a; Wang et al., 2023). Following the previous work in computer vision, the initial attempts take Variational Auto-encoder (VAE) (Kingma and Welling, 2014) to learn disentangled representations. For example, MacridVAE (Ma et al., 2019) captured user preferences regarding the different concepts associated with user intentions separately. Beyond the disentangled representations derived from user-item interaction modeling, ADDVAE (Tran and Lauw, 2022) acquires an additional set of disentangled user representations from textual content and subsequently aligns both sets of representations. For studying the diversity of user intents on adopting the items, DGCF (Wang et al., 2020) employs a graph disentangled module to iteratively refine the intent-aware interaction graph and factorial representations for recommendations. In order to model the users’ various preferences on different factors of each modality, DRML (Liu et al., 2023a) estimates the user’s attention weight to underlying factors of different modalities with an attention module. As the learned disentangled representations lack clear meanings, KDR (Mu et al., 2022) uses knowledge graphs (KGs) to guide the learning process, ensuring that the resulting disentangled representations are associated with semantically meaningful information extracted from the KGs.

Although the existing models achieve performance improvement by disentangled representation learning, they all face the problem that they all disentangle the user and item representation into latent factors without clarifying the semantic meaning of each factor.

3. Method

Refer to caption
Figure 1. High-level and low-level attribute-driven disentangled representation learning module of our proposed AD-DRL. To disentangle attribute factors in multimodal features, (a) Intra-modality disentanglement module exploits the difference between attribute factors (e.g., price, brand, category and popularity) within the same modality feature, (b) Inter-modality disentanglement module utilizes the consistency of the same attribute factor in different modality features, and (c) low-level disentangled representation learning module leverages the intrinsic relationships between items sharing the same attribute value (e.g., the popularity value of different levels: Super Popular, Popular, Moderate, Emerging, and Unknown).

3.1. Preliminaries

3.1.1. Problem Setting

Given a set of users 𝒰{u}\mathcal{U}\in\{u\} and items {i}\mathcal{I}\in\{i\}, we utilize three types of information to learn user and item representations: (1) A user-item interaction matrix 𝑹\bm{R}, where each entry rui𝑹r_{ui}\in\bm{R} represents the implicit feedback (e.g., clicks, likes or purchases) of user uu to item ii. rui=1r_{ui}=1 indicates that user uu has interacted with item ii and rui=0r_{ui}=0 indicates that there is no interaction between user uu and item ii in the observed data; (2) Multimodal features, which mainly consist of two types of features associated with items, namely reviews and images; and (3) Attribute information, consisting of KK attributes and their associated attribute values of items, used as supervision for disentangled representation learning. We aim to learn robust representations of users and items, thereby predicting a user’s preference for items they have not yet interacted with.

3.1.2. Notations.

Similar to previous work (Wei et al., 2020; Liu et al., 2023a), each user uu and item ii are assigned with a unique ID and respectively represented by a vector 𝒗ud\bm{v}_{u}\in\mathbb{R}^{d} and 𝒗id\bm{v}_{i}\in\mathbb{R}^{d}, which are randomly initialized in our model. For the review and image information, we use the BERT (Devlin et al., 2019) and the ViT (Dosovitskiy et al., 2021) to extract the raw textual features 𝒆td0\bm{e}_{t}\in\mathbb{R}^{d_{0}} and visual features 𝒆vd0\bm{e}_{v}\in\mathbb{R}^{d_{0}}, respectively. To tailor the recommendation-oriented features, we adopt two non-linear transformations to cast 𝒆t\bm{e}_{t} and 𝒆v\bm{e}_{v} into the same feature space:

(1) 𝒗t\displaystyle\bm{v}_{t} =σ(𝑾t𝒆t+𝒃t),\displaystyle=\sigma(\bm{W}_{t}\bm{e}_{t}+\bm{b}_{t}),
𝒗v\displaystyle\bm{v}_{v} =σ(𝑾v𝒆v+𝒃v),\displaystyle=\sigma(\bm{W}_{v}\bm{e}_{v}+\bm{b}_{v}),

where 𝑾t\bm{W}_{t}, 𝑾vd×d0\bm{W}_{v}\in\mathbb{R}^{d\times d_{0}} and 𝒃t\bm{b}_{t}, 𝒃vd\bm{b}_{v}\in\mathbb{R}^{d} denote the weight matrices and bias vectors for textual and visual modality, respectively. σ()\sigma(\cdot) is the activation function.

Following the disentangled representation learning process in previous work (Wang et al., 2020; Liu et al., 2023a), we split the feature vector into KK chunks, each corresponding to a specific item attribute (e.g., price, brand). For simplicity, we equally split each modality’s representation into KK continuous chunks. Take item ID embedding as an example:

(2) 𝒗i=(𝒗i1,𝒗i2,,𝒗iK),\bm{v}_{i}=(\bm{v}_{i}^{1},\bm{v}_{i}^{2},\cdots,\bm{v}_{i}^{K}),

where 𝒗ikdK\bm{v}_{i}^{k}\in\mathbb{R}^{\frac{d}{K}} is the item ID embedding corresponding to the kk-th attribute. Analogously, 𝒗t=(𝒗t1,𝒗t2,,𝒗tK)\bm{v}_{t}=(\bm{v}_{t}^{1},\bm{v}_{t}^{2},\cdots,\bm{v}_{t}^{K}), 𝒗v=(𝒗v1,𝒗v2,,𝒗vK)\bm{v}_{v}=(\bm{v}_{v}^{1},\bm{v}_{v}^{2},\cdots,\bm{v}_{v}^{K}) and 𝒗u=(𝒗u1,𝒗u2,,𝒗uK)\bm{v}_{u}=(\bm{v}_{u}^{1},\bm{v}_{u}^{2},\cdots,\bm{v}_{u}^{K}) are defined as the embeddings of textual feature, visual feature and user ID embedding, respectively.

3.1.3. Intuition of Attribute-driven Disentanglement

Our work not only aims to recommend accurate items to users but also advocates for attribute-driven disentanglement in multimodal recommender systems to enhance the interpretability and controllability of recommendations. Specifically, unlike the existing method that disentangles the latent factors in an unsupervised manner, we leverage the semantic labels of item attributes to learn attribute-specific sub-spaces for each attribute to obtain disentangled representations. In this paper, the vector of each modality is composed of KK chunks, with the assumption that each chunk is associated with an attribute (such as price or brand). The achievement of attribute-driven disentangling requires two necessary conditions. Firstly, each chunk should have a clear attribute reference, and chunks associated with different attributes should be distinguishable. For example, chunk jj is associated with the price, while chunk kk represents the vector of the brand. Secondly, each chunk of a specific attribute should accurately capture the specific attribute value. For example, chunk jj should reflect the price level (i.e., expensive or cheap) of the product.

3.2. Attribute-driven Disentangled Representation Learning

In this section, we elaborate on our proposed model, termed AD-DRL, an acronym for Attribute-Driven Disentangled Representation Learning model. We aim to improve the interpretability and controllability of recommendation models by assigning a specific attribute to each factor in multimodal features. Specifically, AD-DRL comprises two disentangling modules: high-level and low-level attribute-driven disentangled representation learning. In the high-level attribute-driven disentangled representation learning module, we exploit the difference between attribute factors within the same modality feature and the consistency of the same attribute factor across different modalities. In contrast, the low-level attribute-driven disentangled representation learning module leverages the intrinsic relationships between items sharing the same attribute value.

3.2.1. High-Level Attribute-driven Disentangled Representation Learning

To obtain robust and independent representations for each attribute factor in multimodal features, AD-DRL enables disentangling within each modality feature and ensures consistency of representations across modalities. Next, we detail the intra-modality disentanglement and inter-modality disentanglement.

Intra-Modality Disentanglement. After obtaining the feature vectors of each modality, we split these vectors into several chunks. Nevertheless, different attribute factors are still entangled within the chunks. In other words, a chunk may contain both price- and brand-related features. Therefore, in order to disentangle attribute factors in each modality feature, we employ an attribute classifier to encourage each chunk to predict the corresponding attribute, as shown in Figure 1a. Taking the kk-th chunk 𝒗tk\bm{v}_{t}^{k} of item textual embedding as an example,

(3) {𝒛k=𝑾intra,t𝒗tk+𝒃intra,t,lintra,tk=n=1Kz~nklogexpznkmexpzmk\begin{cases}\bm{z}^{k}&=\bm{W}_{intra,t}\bm{v}_{t}^{k}+\bm{b}_{intra,t},\\ l_{intra,t}^{k}&=-\sum_{n=1}^{K}\tilde{z}_{n}^{k}\log\frac{\exp{z_{n}^{k}}}{\sum\nolimits_{m}\exp{z_{m}^{k}}}\end{cases}

where 𝑾intra,tK×dK\bm{W}_{intra,t}\in\mathbb{R}^{K\times\frac{d}{K}} and 𝒃intra,tK\bm{b}_{intra,t}\in\mathbb{R}^{K} are the weight matrix and bias vector of the classifier, respectively. 𝒛k\bm{z}^{k} is the predicted logits and znk𝒛kz_{n}^{k}\in\bm{z}^{k}. z~nk\tilde{z}_{n}^{k} denotes the ground truth (attribute label) of chunk 𝒗tk\bm{v}_{t}^{k}. For all the chunks in textual embedding 𝒗t\bm{v}_{t}, we have the following loss for all attribute factors in textual features:

(4) lintra,t=k=1Klintra,tk.l_{intra,t}=\sum_{k=1}^{K}l_{intra,t}^{k}.

Similarly, we can define the losses lintra,ul_{intra,u}, lintra,il_{intra,i} and lintra,vl_{intra,v} to encourage the features of each attribute factor to be concentrated in the corresponding chunk of user ID, item ID and visual feature, respectively. Finally, the total loss for the intra-modality disentanglement is formulated as:

(5) lintra=lintra,u+lintra,i+lintra,t+lintra,v.l_{intra}=l_{intra,u}+l_{intra,i}+l_{intra,t}+l_{intra,v}.

Inter-Modality Disentanglement. In addition to separately disentangling each modality, a serious challenge in disentangling representation learning for multi-modal features is handling the inter-relationship between the factors disentangled from multiple modalities. Intuitively, the chunks of the same attribute factor from different modality features should be consistent. For example, for the brand attribute, an item should have the same brand information in both the visual and textual features. In other words, chunks that share the same attribute across different modalities should be highly similar, while chunks that represent different attributes across modalities should be dissimilar.

To further achieve robust representation in multimodal recommendations, we disentangle attribute factors in different modality features by ensuring consistency of the same attribute across modalities. Inspired by this, we design a cross-modal contrastive loss as shown in Figure 1b. Specifically, for any two modality representations, such as 𝒗t\bm{v}_{t} and 𝒗v\bm{v}_{v}, we take two chunks of the same attribute factor (i.e., (𝒗tk,𝒗vk)(\bm{v}_{t}^{k},\bm{v}_{v}^{k})) as positive pairs, and two chunks corresponding to different attributes (i.e., (𝒗tk,𝒗vnk)(\bm{v}_{t}^{k},\bm{v}_{v}^{n\neq k}), (𝒗tnk,𝒗vk)(\bm{v}_{t}^{n\neq k},\bm{v}_{v}^{k})) as negative pairs. After that, the cross-modal contrastive loss between textual and visual features is defined as follows:

(6) {ltvk=logexp((𝒗tk𝒗vk)/τ)n=1Kexp((𝒗tk𝒗vn)/τ),lvtk=logexp((𝒗vk𝒗tk)/τ)n=1Kexp((𝒗vk𝒗tn)/τ),ltv=k=1Kltvk+k=1Klvtk,\begin{cases}l^{k}_{t\rightarrow v}&=-\log\frac{\exp((\bm{v}_{t}^{k}\cdot\bm{v}_{v}^{k})/\tau)}{\sum_{n=1}^{K}\exp((\bm{v}_{t}^{k}\cdot\bm{v}_{v}^{n})/\tau)},\\ l^{k}_{v\rightarrow t}&=-\log\frac{\exp((\bm{v}_{v}^{k}\cdot\bm{v}_{t}^{k})/\tau)}{\sum_{n=1}^{K}\exp((\bm{v}_{v}^{k}\cdot\bm{v}_{t}^{n})/\tau)},\\ l_{t\leftrightarrow v}&=\sum_{k=1}^{K}l^{k}_{t\rightarrow v}+\sum_{k=1}^{K}l^{k}_{v\rightarrow t},\end{cases}

where \cdot represents the dot product and τ+\tau\in\mathbb{R}^{+} is a scalar temperature parameter. By applying this contrastive learning constraint to any two out of the three modalities, we can achieve disentanglement and alignment across features of various modalities:

(7) linter=lit+liv+ltv.l_{inter}=l_{i\leftrightarrow t}+l_{i\leftrightarrow v}+l_{t\leftrightarrow v}.

3.2.2. Low-level Attribute-driven Disentangled Representation Learning

Each attribute kk is associated with a list of possible attribute values (y~1k,y~2k,,y~Akk)(\tilde{y}^{k}_{1},\tilde{y}^{k}_{2},\cdots,\tilde{y}^{k}_{A_{k}}), where AKA_{K} is the total number of possible values for that attribute. For example, the attribute value of popularity can be stratified into five distinct levels: Super Popular, Popular, Moderate, Emerging, and Unknown. The representation disentanglement could benefit from the attribute values by exploiting their relationships of the same factor. Therefore, to learn more robust representations, we encourage disentangled representations based on attributes to predict specific attribute values of the item.

It is obvious that any single modality may not contain sufficient information for a particular attribute. For example, images may more intuitively reflect attributes such as color and brand of an item, but may not directly reflect attributes such as price or popularity. Therefore, in order to comprehensively and accurately depict the attribute values of the item, we first integrate the features from all modalities together. To achieve this, we apply a multimodal attention mechanism to measure the emphasis of different modalities on different attributes. Specifically, for the kk-th attribute, the attention weights assigned to different modalities are estimated by a two-layer neural network:

(8) {𝒂^k=𝑾a2tanh(𝑾a1(𝒗ik+𝒗tk+𝒗vk)+𝒃a),𝒂k=Softmax(𝒂^k),\begin{cases}\hat{\bm{a}}^{k}=\bm{W}_{a2}\tanh(\bm{W}_{a1}(\bm{v}_{i}^{k}+\bm{v}_{t}^{k}+\bm{v}_{v}^{k})+\bm{b}_{a}),\\ \bm{a}^{k}=Softmax(\hat{\bm{a}}^{k}),\end{cases}

where 𝑾a13×dK\bm{W}_{a1}\in\mathbb{R}^{3\times\frac{d}{K}} and 𝑾a23×3\bm{W}_{a2}\in\mathbb{R}^{3\times 3} are the weight matrices corresponding to the first and second layers of the neural network, respectively. 𝒃a\bm{b}_{a} denotes the bias vector and tanh\tanh is the activation function. SoftmaxSoftmax is adopted to normalize 𝒂^k\hat{\bm{a}}^{k} to a probability distribution.

Thereafter, we obtain the final representation of the kk-th attribute factor for the item as follows:

(9) 𝒗yk=aik𝒗ik+atk𝒗tk+avk𝒗vk,\bm{v}_{y}^{k}=a_{i}^{k}\cdot\bm{v}_{i}^{k}+a_{t}^{k}\cdot\bm{v}_{t}^{k}+a_{v}^{k}\cdot\bm{v}_{v}^{k},

where aika_{i}^{k}, atka_{t}^{k}, avk𝒂ka_{v}^{k}\in\bm{a}^{k} are the attention weights of item ID embedding, textual feature and visual feature, respectively.

Similar to the disentangling at the intra-modality disentanglement in Section 3.2.1, we aim to achieve disentangled representation learning at the low level through attribute value prediction via a classification layer (as shown in Figure 1c):

(10) {𝒚k=𝑾k𝒗yk+𝒃k,llowk=i=1AKy~iklogexpyikjexpyjk,llow=k=1Kllowk\begin{cases}\bm{y}^{k}&=\bm{W}_{k}\bm{v}_{y}^{k}+\bm{b}_{k},\\ l^{k}_{low}&=-\sum_{i=1}^{A_{K}}\tilde{y}_{i}^{k}\log\frac{\exp{y_{i}^{k}}}{\sum\nolimits_{j}\exp{y_{j}^{k}}},\\ l_{low}&=\sum_{k=1}^{K}l^{k}_{low}\end{cases}

where 𝑾kK×dK\bm{W}_{k}\in\mathbb{R}^{K\times\frac{d}{K}} and 𝒃kK\bm{b}_{k}\in\mathbb{R}^{K} are the weight matrices and bias vector of the classifier, respectively. We supervise the training of each attribute subspace via independent attribute classification tasks defined in the form of a cross-entropy loss.

3.3. Preference Prediction and Model Learning

3.3.1. Preference Prediction

So far we have discussed how to obtain attribute-driven disentangled representations 𝒗u=(𝒗u1,𝒗u2,,𝒗uK)\bm{v}_{u}=(\bm{v}_{u}^{1},\bm{v}_{u}^{2},\cdots,\bm{v}_{u}^{K}) and 𝒗y=(𝒗y1,𝒗y2,,𝒗yK)\bm{v}_{y}=(\bm{v}_{y}^{1},\bm{v}_{y}^{2},\cdots,\bm{v}_{y}^{K}) for the user uu and item ii, respectively. By assigning each disentangled representation with an attribute as described above, we are able to predict users’ preferences at the attribute level. This enhances the interpretability and controllability of our model. Specifically, to estimate a user uu’s preference for an item ii, it is crucial to consider her preference for each attribute of the item. To achieve this, we first compute the user’s preference score for each attribute of the item, and subsequently, integrate the scores of each attribute to estimate the user’s overall preference for the item:

(11) {su,i,k=σ(𝒗uk𝒗yk),su,i=k=1Ksu,i,k,\begin{cases}s_{u,i,k}=\sigma(\bm{v}_{u}^{k}\cdot\bm{v}_{y}^{k}),\\ s_{u,i}=\sum_{k=1}^{K}s_{u,i,k},\end{cases}

where \cdot symbolizes the dot product and σ()\sigma(\cdot) denotes the activation function. In this paper, we use a softplus function to ensure the resultant score su,i,ks_{u,i,k} is positive. su,i,ks_{u,i,k} denotes the user uu’s preference score for the attribute kk of the item ii.

3.3.2. Training Protocol

Based on the predicted preference score above, we recommend a list of top-nn ranked items that match the target user’s preferences. The Bayesian Personalized Ranking (BPR) loss function (Rendle et al., 2009) is employed to optimize the model parameters 𝚯\bm{\Theta},

(12) BPR=(u,i+,i)𝒟logϕ(su,i+su,i)+λ𝚯22,\mathcal{L}_{BPR}=\sum_{(u,i_{+},i_{-})\in\mathcal{D}}-\log\phi(s_{u,i_{+}}-s_{u,i_{-}})+\lambda{\lVert\bm{\Theta}\rVert}_{2}^{2},

where λ\lambda is the coefficient controlling L2L_{2} regularization; 𝒟\mathcal{D} denotes the training set; i+i_{+} and ii_{-} are the observed and unobserved items in the interaction records of user uu, respectively. Overall, the total loss of AD-DRL is formulated as,

(13) =BPR+α(u,i)𝒟lintra+β(u,i)𝒟linter+γ(u,i)𝒟llow,\mathcal{L}=\mathcal{L}_{BPR}+\alpha{\sum_{(u,i)\in\mathcal{D}}l_{intra}}+\beta{\sum_{(u,i)\in\mathcal{D}}l_{inter}}+\gamma{\sum_{(u,i)\in\mathcal{D}}l_{low}},

where α\alpha, β\beta and γ\gamma are the hyperparameters to control the weight of three disentanglement modules.

4. Experiments

4.1. Experimental Setup

Table 1. Basic statistics of the three datasets. ”#” denotes the number of statistical values.
Dataset #user #item #interaction sparsity #price #popularity #brand #category
Baby 12,637 18,646 121,651 99.95% 5 5 663 1
Toys Games 18,748 30,420 161,653 99.97% 5 5 1,288 19
Sports 21,400 36,224 209,944 99.97% 5 5 2,081 18

4.1.1. Datasets

We use the widely adopted real-world recommendation dataset, the Amazon review dataset111http://jmcauley.ucsd.edu/data/amazon. (McAuley and Leskovec, 2013), for evaluation in our experiments. Apart from user-item interaction data, this dataset also includes multimodal information (i.e., reviews and images) and various attributes (i.e., price, brand, etc.) of the items on 24 product categories. Three product categories from this dataset are used in the evaluation: Baby, Toys Games and Sports. Following (Liu et al., 2023a), for all datasets, unpopular items and inactive users are filtered out to ensure that all the items and users have at least 5 interaction records. The basic statistics of the three datasets are shown in Table 1.

In our setting, in addition to the user-item interaction data and multimodal information of items, multiple attributes and their associated attribute values of items are required to guide the disentangled representation learning process. Specifically, we adopt four typical attributes (i.e., price, popularity 222Popularity is calculated by counting the number of times each item is purchased., brand and category), and the attribute values for each attribute are compiled based on the metadata provided by the Amazon dataset: for brand and category values, we directly used the values provided by Amazon; for price and popularity, following the methods used by CoHHN (Zhang et al., 2022), we discretize them into five levels according to their numerical values. Table 1 shows the specific statistical information for each attribute in our experiments. It is worth noting that our method can include a wide range of attributes that objectively reflect the character of an item, beyond the four attributes mentioned.

4.1.2. Baselines

We compared our method with the state-of-the-art methods, including both the CF-based methods (i.e., NeuMF (He et al., 2017), NGCF (Wang et al., 2019) and DGCF (Wang et al., 2020)) and Multimodal CF-based methods (JRL (Zhang et al., 2017b), MMGCN (Wei et al., 2019), MAML (Liu et al., 2019), GRCN (Wei et al., 2020), DMRL (Liu et al., 2023a) and BM3 (Zhou et al., 2023)). Each type of method includes a disentangled representation learning model (i.e., DGCF and DMRL) for comparison. In addition to the methods mentioned above, we also created a variant of our model, denoted as AD-DRLID\text{AD-DRL}_{ID}, which excludes the utilization of multimodal information, thus facilitating a fair comparison with CF-based models.

4.1.3. Evaluation Metrics and Parameter Settings

For each dataset, we randomly split the interactions from each user into training and testing sets with an 8:28:2 ratio. From the training set, 10% of interactions are randomly chosen as a validation set for tuning hyperparameters. We evaluate performance on the top-nn recommendation task, aiming to recommend the top-nn items, using Recall@nn and NDCG@nn as metrics for accuracy, with nn defaulting to 20.

The Pytorch toolkit (Paszke et al., 2019) is utilized to implement our models. To ensure fairness, all methods are optimized using the Adam optimizer (Kingma and Ba, 2015), with a default learning rate of 0.0001 and batch size of 1024. We fixed the embedding size of each factor to 32 for AD-DRL and its variants on all datasets. The Xavier initializer (Glorot and Bengio, 2010) is used to initialize the model parameters. The α\alpha, β\beta, γ\gamma and L2L_{2} regularization coefficient are searched in {1e3,5e3,1e2,,5e+0,1e+1}\{1e^{-3},5e^{-3},1e^{-2},\cdots,5e^{+0},1e^{+1}\}. The number of negative examples is searched in {2,4,8}\{2,4,8\}. Besides, model parameters are saved every five epochs. Early stopping strategy (Wang et al., 2019) is performed, i.e., premature stopping if Recall@20 does not increase for 50 successive epochs. Our codes and datasets are released to facilitate the replication of our experiments 333https://github.com/SDLZY/AD-DRL..

4.2. Performance Comparison

Table 2. Performance comparison of different recommendation methods over three datasets. The best results are highlighted in bold.
Datasets Baby Toys Games Sports
Metrics Recall NDCG Recall NDCG Recall NDCG
NeuMF 0.0502 0.0224 0.0253 0.0128 0.0330 0.0157
NGCF 0.0694 0.0313 0.0970 0.0587 0.0707 0.0337
DGCF 0.0788 0.0465 0.1262 0.1085 0.1026 0.0629
AD-DRLID\text{AD-DRL}_{ID} 0.0852 0.0529 0.1517 0.1429 0.1120 0.0722
JRL 0.0579 0.0266 0.0472 0.0413 0.0368 0.0214
MMGCN 0.0814 0.0496 0.1171 0.1065 0.0913 0.0572
MAML 0.0867 0.0521 0.1183 0.1117 0.1029 0.0676
GRCN 0.0883 0.0541 0.1336 0.1236 0.1065 0.0693
DMRL 0.0906 0.0561 0.1434 0.1331 0.1111 0.0711
BM3 0.0911 0.0424 0.1147 0.0683 0.1121 0.0536
AD-DRL 0.0968* 0.0588* 0.1524* 0.1435* 0.1200* 0.0756*
Improv. 6.84% 4.81% 6.28% 7.81% 8.01% 6.33%
  • The symbol * denotes that the improvement is significant with pvalue<0.05p-value<0.05 based on a two-tailed paired t-test.

We summarize the overall performance comparison results in Table 2. The methods in the first block solely use the user-item interactions, while the methods in the second block also exploit textual and visual information along with the user-item interactions. From this table, we can have the following observations:

  • NeuMF, NGCF, DGCF and AD-DRLID\text{AD-DRL}_{ID} are methods trained exclusively on user-item interactions. Among them, NeuMF outperforms traditional MF-based methods (He et al., 2017), benefiting from deep neural networks’ ability to model the non-linear interactions between users and items. NGCF and DGCF achieve state-of-the-art results by utilizing high-order information, with DGCF excelling over NGCF by leveraging disentangled representation to capture diverse user intents, showcasing its effectiveness in enhancing the robustness of user and item representations. Furthermore, although both DGCF and AD-DRLID\text{AD-DRL}_{ID} employ disentangled representation learning, the performance of AD-DRLID\text{AD-DRL}_{ID} is significantly better than that of DGCF. This highlights the superiority of our attribute-driven disentangled representation learning approach.

  • In general, the multimodal recommendation methods perform better than those only using user-item interactions, demonstrating the effectiveness of leveraging multimodal information on learning user and item representations. Although the simple neural network structure results in lower performance of JRL compared to the graph-based methods (NDCG and DGCF), incorporating rich multi-modal information enables it to outperform NeuMF. By exploiting user-item interactions to guide the representation learning in different modalities, MMGCN outperforms NGCF on all the datasets. MAML outperforms NGCF by modeling users’ diverse preferences by using multimodal features of items. GRCN surpasses MAML by employing modality features to discover and prune potential false-positive edges on the user-item interaction graph. DMRL captures the different contributions of features from diverse modalities for each disentangled factor, achieving superior results.

  • Furthermore, our proposed AD-DRL model consistently outperforms all baselines over all three datasets by a large margin. We credit this to the joint effects of the following two aspects. Firstly, robust user and item representations can be derived using attributes at different levels to guide the disentangled representation learning process. Secondly, incorporating the additional attribute information into representation learning can alleviate the data sparsity problem in the recommendation.

4.3. Visualization of Disentangled Representations

Refer to caption
(a) Disentangled vectors (User ID)
Refer to caption
(b) Disentangled vectors (Item ID)
Refer to caption
(c) Disentangled vectors (Review)
Refer to caption
(d) Disentangled vectors (Image)
Figure 2. Visualization of disentangled vectors from ID embeddings and different modalities, with distinct colors indicating different attributes: yellow for price, orange for popularity, grey for brand, dark blue for category, and light blue for others.

AD-DRL conducts disentangled representation learning for user and item representations at two levels of attribute granularity: high and low. To further confirm the effectiveness of disentangled representation learning by AD-DRL, we use t-SNE (van der Maaten and Hinton, 2008) to cluster and visualize the disentangled vectors of users and items from the Sports dataset at each granularity level.

4.3.1. High-Level Attribute-driven Disentanglement

Figure 2 displays the disentangled vectors processed by the high-level attribute-driven disentanglement module for each modality feature, where dots of the same color denote the vectors corresponding to the same attribute. It can be observed that our model effectively distinguishes representation vectors corresponding to different attributes for both users and items across various modalities, demonstrating the effectiveness of our model in disentangling at the attribute level. Such disentangled representations can better capture user preferences and item characteristics toward different attributes.

Refer to caption
(a) Disentangled vectors (price)
Refer to caption
(b) Disentangled vectors (popularity)
Refer to caption
(c) Disentangled vectors (brand)
Refer to caption
(d) Disentangled vectors (category)
Figure 3. Visualization of the disentangled vectors corresponding to different attributes, where different colors represent different attribute values.

4.3.2. Low-Level Attribute-driven Disentanglement

To verify the effectiveness of low-level (attribute value-level) disentanglement, we visualize the representations of each attribute factor (i.e., 𝒗yk\bm{v}_{y}^{k} in Equation 9) in Figure 3. The dots of different colors represent different attribute values444For brand and category, since there are too many attribute values in the dataset, making it difficult to display all of them in Figure 3. Therefore, for these two attributes, we only selected the top 5 attribute values with the most corresponding items in the dataset for demonstration.. As observed, the representations of different attribute values learned from AD-DRL are well separated and the representations of the same attribute value are concentrated. These results demonstrate that AD-DRL can achieve finer-grained disentanglement at the attribute value level, allowing AD-DRL to capture better user preferences.

4.4. Interpretability Study

Refer to caption
Figure 4. The preference scores of two users (uu18629 and uu6805) for different attributes of two items (ii2099 and ii3460).

To gain a deeper insight into the interpretability of our model, in this section, we provide some qualitative examples in Figure 4. As an illustrative example, we randomly sample two users (u18629 and u6805) from the Sports dataset who purchased the same items (i2099 and i3460). Since we have disentangled the representation of users and items based on the attributes, we can analyze how each attribute contributes to recommendation results based on Equation 11. This greatly enhances the interpretability of our model. For example, price contributes most to the interaction (u10826, i2099). This suggests that u10826 prefer i2099 due to its price. However, u6805 is more likely to purchase the i2099 because of her preference for its brand. Such observation indicates that different users have diverse preferences for the same product. In addition, we can also see that user preferences exhibit consistency. For example, u6905 values the brand of the product more than its popularity or price, which is different from u10826.

4.5. Controllability Study

Refer to caption
(a)
Refer to caption
(a) Users who prefer low-price level items.
Refer to caption
(b) Users who prefer high-price level items.
Refer to caption
(c)
Refer to caption
(c) Users who prefer low-popular level items.
Refer to caption
(d) Users who prefer high-popular level items.
Figure 5. The proportion of items with different attribute values within the AD-DRL’s recommendations when ξ\xi takes different values in Equation 14.

In this section, we evaluate the controllability of AD-DRL by manipulating user preferences for certain attributes. Specifically, we change the user uu’s preference score for a specific attribute aa in Equation 11 to assess if such change yields the anticipated variation in recommendation results. We modify the formula for su,is_{u,i} in Equation 11 as follows:

(14) su,i=ξsu,i,a+kasu,i,k,s_{u,i}=\xi*s_{u,i,a}+\sum_{k\neq a}s_{u,i,k},

where ξ\xi represents the scaling factor for su,i,as_{u,i,a}, used to adjust the impact of attribute aa on su,is_{u,i}. For a specific attribute (in this section, we selected price and popularity for study), we systematically vary ξ\xi through values of 2, 1, 0.5, 0, and -1. This process allows us to meticulously examine the resultant shifts in our method’s recommendation outputs, with findings illustrated in Figure 5.

We conduct a detailed analysis using the price attribute as an example. As shown in Figure 5a, we select 100 users from Baby dataset who always purchase inexpensive items based on their purchase records. Various colors in Figure 5a denote distinct price levels, showcasing the distribution of items recommended by AD-DRL across these levels for a given value of ξ\xi. By comparing the recommendations made by AD-DRL under different values of ξ\xi, we have the following observations:

  • When ξ=1\xi=1, that is, using the original AD-DRL method, it can be seen that AD-DRL is capable of capturing the preference of these users for inexpensive items and recommends more affordable items to them.

  • When ξ=2\xi=2, meaning we have increased the impact of price on the user preference score, it can be observed that AD-DRL tends to recommend more inexpensive items to these users.

  • When ξ=0.5\xi=0.5 or 0, meaning we reduce or even eliminate the impact of price on the user preference score, we find that the recommended items from AD-DRL are more diverse in terms of price. It is worth noting that even when ξ=0\xi=0, the items recommended by AD-DRL still tend to be inexpensive. This could be attributed to the retained chunk associated with the brand, and there are correlations between brand and price. For example, some high-end brands offer products at a premium price, while others provide more affordable options.

  • More interestingly, when ξ=1\xi=-1, meaning we have AD-DRL recommend items that are opposite to the users’ price preferences, it can be observed that AD-DRL indeed recommends some more expensive items.

The above results reveal that adjusting the value of ξ\xi enables us to align AD-DRL’s recommendation results with our expectations, thereby illustrating the controllability of AD-DRL. Similar results can be seen in another group of users who prefer high-price level items.

4.6. Ablation Study

Table 3. Ablation study of our proposed AD-DRL method over three datasets. The best results are highlighted in bold.
Datasets Baby Toys Games Sports
Metrics Recall NDCG Recall NDCG Recall NDCG
AD-DRLw/odisentangling\text{AD-DRL}_{w/o\ \text{disentangling}} 0.0888 0.0550 0.1452 0.1375 0.1152 0.0741
AD-DRLw/ointra\text{AD-DRL}_{w/o\ \text{intra}} 0.0925 0.0567 0.1492 0.1397 0.1164 0.0750
AD-DRLw/ointer\text{AD-DRL}_{w/o\ \text{inter}} 0.0959 0.0585 0.1517 0.1429 0.1196 0.0754
AD-DRLw/ohigh\text{AD-DRL}_{w/o\ \text{high}} 0.0903 0.0564 0.1492 0.1391 0.1160 0.0748
AD-DRLw/olow\text{AD-DRL}_{w/o\ \text{low}} 0.0916 0.0553 0.1492 0.1394 0.1177 0.0755
AD-DRL 0.0968 0.0588 0.1524 0.1435 0.1200 0.0756

To validate the effects of the key components in AD-DRL, we set up the following model variants: 1) AD-DRLw/odisentangling\text{AD-DRL}_{w/o\ \text{disentangling}}: we do not perform disentangled representation learning during the training process; 2) AD-DRLw/ointra\text{AD-DRL}_{w/o\ \text{intra}}: we only remove the intra-modality disentanglement module; 3) AD-DRLw/ointer\text{AD-DRL}_{w/o\ \text{inter}}: we only remove the inter-modality disentanglement module; 4) AD-DRLw/ohigh\text{AD-DRL}_{w/o\ \text{high}}: we remove the high-level attribute-driven disentangled representation learning module; 5) AD-DRLw/olow\text{AD-DRL}_{w/o\ \text{low}}: we remove the low-level attribute-driven disentangled representation learning module.

Table 3 shows the experimental results for all variants. From the results, we have the following observations:

  • Using either high-level or low-level disentangled representation learning independently can significantly improve the performance. This indicates that both disentangled representation learning within modalities and between modalities have a positive effect.

  • The AD-DRL combines high-level and low-level disentangled representation learning modules, resulting in significantly improved performance compared to the devised two variants. This demonstrates the necessity and rationality of combining the attribute and attribute value levels.

  • Delving deeper into the high-level disentangled representation learning module, removing the intra-modality disentanglement module results in a greater decrease in model performance compared to removing the inter-modality disentanglement module. This indicates that intra-modality disentanglement is the fundamental aspect of disentangled representation learning, while inter-modality disentanglement can further enhance model performance.

5. Conclusion

In this paper, we highlight the limitations of existing disentangled representation learning techniques in recommender systems. The current methods disentangle the underlying factors behind user-item interactions in an unsupervised manner, leading to limited interpretability and controllability. To overcome this limitation, we propose an attribute-driven disentangled representation learning method, termed AD-DRL, which disentangles attribute factors in various multimodal features. More specifically, the proposed AD-DRL enables disentangling within each modality feature and ensures consistency of representations across modalities at the attribute level. Moreover, it leverages the intrinsic relationships between items sharing the same attribute value. Experimental results demonstrate the superiority of our proposed AD-DRL and showcase its capability in terms of interpretability and controllability.

References

  • (1)
  • Chen et al. (2019) Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng Qin, and Hongyuan Zha. 2019. Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 765–774.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. ACL, 4171–4186.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR. OpenReview.net.
  • Du et al. (2023) Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. ACM, 1003–1011.
  • Ebesu et al. (2018) Travis Ebesu, Bin Shen, and Yi Fang. 2018. Collaborative Memory Network for Recommendation Systems. In International Conference on Research and Development in Information Retrieval. ACM, 515–524.
  • Ge et al. (2021) Yunhao Ge, Sami Abu-El-Haija, Gan Xin, and Laurent Itti. 2021. Zero-shot Synthesis with Group-Supervised Learning. In International Conference on Learning Representations.
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Vol. 9. JMLR, 249–256.
  • Guo et al. (2024) Dan Guo, Kun Li, Bin Hu, Yan Zhang, and Meng Wang. 2024. Benchmarking micro-action recognition: dataset, methods, and applications. IEEE Transactions on Circuits and Systems for Video Technology 34, 7 (2024), 6238–6252.
  • Hamilton et al. (2017) William L. Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive Representation Learning on Large Graphs. In Advances in Neural Information Processing Systems. 1024–1034.
  • He and McAuley (2016) Ruining He and Julian J. McAuley. 2016. VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback. In AAAI Conference on Artificial Intelligence, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI, 144–150.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yong-Dong Zhang, and Meng Wang. 2020. LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 639–648.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In International Conference on World Wide Web. ACM, 173–182.
  • Higgins et al. (2017) Irina Higgins, Loïc Matthey, Arka Pal, Christopher P. Burgess, Xavier Glorot, Matthew M. Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations.
  • Hsieh et al. (2017) Cheng-Kang Hsieh, Longqi Yang, Yin Cui, Tsung-Yi Lin, Serge J. Belongie, and Deborah Estrin. 2017. Collaborative Metric Learning. In International Conference on World Wide Web. ACM, 193–201.
  • Hsieh et al. (2018) Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. 2018. Learning to Decompose and Disentangle Representations for Video Prediction. In Advances in Neural Information Processing Systems. 515–524.
  • Kalantidis et al. (2013) Yannis Kalantidis, Lyndon Kennedy, and Li-Jia Li. 2013. Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos. In ICMR. ACM, 105–112.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations.
  • Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In International Conference on Learning Representations.
  • Li et al. (2023a) Zhenyang Li, Yangyang Guo, Kejie Wang, Xiaolin Chen, Liqiang Nie, and Mohan Kankanhalli. 2023a. Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR. In International Conference on Multimedia. ACM, 5634–5644.
  • Li et al. (2024) Zhenyang Li, Yangyang Guo, Kejie Wang, Fan Liu, Liqiang Nie, and Mohan S. Kankanhalli. 2024. Learning to Agree on Vision Attention for Visual Commonsense Reasoning. IEEE Transactions on Multimedia 26 (2024), 1065–1075.
  • Li et al. (2023b) Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan S. Kankanhalli. 2023b. Joint Answering and Explanation for Visual Commonsense Reasoning. IEEE Transactions on Image Processing 32 (2023), 3836–3846.
  • Liu et al. (2023a) Fan Liu, Huilin Chen, Zhiyong Cheng, Anan Liu, Liqiang Nie, and Mohan Kankanhalli. 2023a. Disentangled Multimodal Representation Learning for Recommendation. IEEE Transactions on Multimedia 25, 11 (2023), 7149–7159.
  • Liu et al. (2023b) Fan Liu, Huilin Chen, Zhiyong Cheng, Liqiang Nie, and Mohan Kankanhalli. 2023b. Semantic-Guided Feature Distillation for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia. ACM, 6567–6575.
  • Liu et al. (2019) Fan Liu, Zhiyong Cheng, Changchang Sun, Yinglong Wang, Liqiang Nie, and Mohan S. Kankanhalli. 2019. User Diverse Preference Modeling by Multimodal Attentive Metric Learning. In International Conference on Multimedia. ACM, 1526–1534.
  • Liu et al. (2021) Fan Liu, Zhiyong Cheng, Lei Zhu, Zan Gao, and Liqiang Nie. 2021. Interest-Aware Message-Passing GCN for Recommendation. In Proceedings of the Web Conference 2021. ACM, 1296–1305.
  • Liu et al. (2023c) Han Liu, Yinwei Wei, Fan Liu, Wenjie Wang, Liqiang Nie, and Tat-Seng Chua. 2023c. Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation. ACM Transactions on Information Systems 42, 2, Article 47 (2023), 26 pages.
  • Liu et al. (2020) Yanbei Liu, Xiao Wang, Shu Wu, and Zhitao Xiao. 2020. Independence Promoted Graph Disentangled Networks. In AAAI Conference on Artificial Intelligence. AAAI, 4916–4923.
  • Ma et al. (2019) Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learning Disentangled Representations for Recommendation. In Advances in Neural Information Processing Systems. 5712–5723.
  • McAuley and Leskovec (2013) Julian J. McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In ACM Conference on Recommender Systems. ACM, 165–172.
  • Mu et al. (2022) Shanlei Mu, Yaliang Li, Wayne Xin Zhao, Siqing Li, and Ji-Rong Wen. 2022. Knowledge-Guided Disentangled Representation Learning for Recommender Systems. ACM Transactions on Information Systems 40, 1 (2022), 6:1–6:26.
  • Nie et al. (2016) Liqiang Nie, Xuemeng Song, and Tat Seng Chua. 2016. Learning from Multiple Social Networks. Synthesis Lectures on Information Concepts Retrieval & Services (2016), 1–118.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024–8035.
  • Rendle (2010) Steffen Rendle. 2010. Factorization Machines. In International Conference on Data Mining. IEEE, 995–1000.
  • Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Conference on Uncertainty in Artificial Intelligence. ACM, 452–461.
  • Sarwar et al. (2001) Badrul Munir Sarwar, George Karypis, Joseph A. Konstan, and John Riedl. 2001. Item-based collaborative filtering recommendation algorithms. In International World Wide Web Conference. ACM, 285–295.
  • Tan et al. (2016) Yunzhi Tan, Min Zhang, Yiqun Liu, and Shaoping Ma. 2016. Rating-boosted latent topics: Understanding users and items with ratings and reviews. In IJCAI. AAAI Press.
  • Tay et al. (2018) Yi Tay, Luu Anh Tuan, and Siu Cheung Hui. 2018. Latent Relational Metric Learning via Memory-based Attention for Collaborative Ranking. In International Conference on World Wide Web. ACM, 729–739.
  • Tran and Lauw (2022) Nhu-Thuat Tran and Hady W. Lauw. 2022. Aligning Dual Disentangled User Representations from Ratings and Textual Content. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). ACM, 1798–1806.
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9 (2008), 2579–2605.
  • Wang et al. (2023) Xin Wang, Hong Chen, Yuwei Zhou, Jianxin Ma, and Wenwu Zhu. 2023. Disentangled Representation Learning for Recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 408–424.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural Graph Collaborative Filtering. In International Conference on Research and Development in Information Retrieval. ACM, 165–174.
  • Wang et al. (2020) Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled Graph Collaborative Filtering. In International Conference on Research and Development in Information Retrieval. ACM, 1001–1010.
  • Wei et al. (2023) Yinwei Wei, Wenqi Liu, Fan Liu, Xiang Wang, Liqiang Nie, and Tat-Seng Chua. 2023. LightGT: A Light Graph Transformer for Multimedia Recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 1508–1517.
  • Wei et al. (2020) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback. In International Conference on Multimedia. ACM, 3541–3549.
  • Wei et al. (2019) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal Graph Convolution Network for Personalized Recommendation of Micro-video. In International Conference on Multimedia. ACM, 1437–1445.
  • Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In International Conference on Machine Learning, Vol. 37. JMLR, 2048–2057.
  • Zhang et al. (2016) Fuzheng Zhang, Nicholas Jing Yuan, Defu Lian, Xing Xie, and Wei-Ying Ma. 2016. Collaborative Knowledge Base Embedding for Recommender Systems. In International Conference on Knowledge Discovery and Data Mining. ACM, 353–362.
  • Zhang et al. (2022) Xiaokun Zhang, Bo Xu, Liang Yang, Chenliang Li, Fenglong Ma, Haifeng Liu, and Hongfei Lin. 2022. Price DOES Matter! Modeling Price and Interest Preferences in Session-Based Recommendation. In International Conference on Research and Development in Information Retrieval. ACM, 1684–1693.
  • Zhang et al. (2017a) Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Croft. 2017a. Joint representation learning for top-n recommendation with heterogeneous information sources. In CIKM. ACM, 1449–1458.
  • Zhang et al. (2017b) Yongfeng Zhang, Qingyao Ai, Xu Chen, and W. Bruce Croft. 2017b. Joint Representation Learning for Top-N Recommendation with Heterogeneous Information Sources. In International Conference on Information and Knowledge Management. ACM, 1449–1458.
  • Zhou et al. (2023) Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023. Bootstrap Latent Representations for Multi-modal Recommendation. In International World Wide Web Conference. ACM, 845–854.