DualVAE: Dual Disentangled Variational AutoEncoder for Recommendation

Zhiqiang Guo School of Computer Science and Technology, Huazhong University of Science and Technology. {zhiqiangguo, jianjunli}@hust.edu.cn Guohui Li School of Software Engineering, Huazhong University of Science and Technology. [email protected] author. Jianjun Li^∗ Chaoyang Wang Wuhan Digital Engineering Institute. [email protected] Si Shi Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ). [email protected]

Abstract

Learning precise representations of users and items to fit observed interaction data is the fundamental task of collaborative filtering. Existing studies usually infer entangled representations to fit such interaction data, neglecting to model the diverse matching relationships between users and items behind their interactions, leading to limited performance and weak interpretability. To address this problem, we propose a Dual Disentangled Variational AutoEncoder (DualVAE) for collaborative recommendation, which combines disentangled representation learning with variational inference to facilitate the generation of implicit interaction data. Specifically, we first implement the disentangling concept by unifying an attention-aware dual disentanglement and disentangled variational autoencoder to infer the disentangled latent representations of users and items. Further, to encourage the correspondence and independence of disentangled representations of users and items, we design a neighborhood-enhanced representation constraint with a customized contrastive mechanism to improve the representation quality. Extensive experiments on three real-world benchmarks show that our proposed model significantly outperforms several recent state-of-the-art baselines. Further empirical experimental results also illustrate the interpretability of the disentangled representations learned by DualVAE.

1 Introduction

Refer to caption — Figure 1: Illustration of (a) user-side VAE; (b) disentangled user-side VAE; and (c) explicit/implicit matching of multi-aspect features between users and items in a movie recommendation scenario. Words with different colors indicate different aspects.

The powerful ability of variational autoencoders (VAEs) [13] to account for uncertainty in the latent space has sparked a surge of interest in integrating variational inference into recommender systems [27, 1, 19]. Traditional VAE-based CF methods [15, 23, 17, 4] generally infer users’ latent variables to reconstruct their interaction vectors, as shown in Figure 1 (a). These methods do not seek to learn deterministic user representations, but rather learn distributions over these representations, making them more suitable for sparse interaction data. However, due to the neglect of the complex entanglement of multiple potential factors behind user decisions, they are insufficient in learning robust and interpretable models.

Recently, disentangled representation learning has received much attention in recommender systems, aiming at improving the representation quality by disentangling latent factors that govern user behaviors [16, 25, 29, 28]. Most existing disentangled VAE-based CF methods [16, 29, 24] are typically designed based only on user-side interaction vectors to infer diverse user preference representations, while items are still entangled in the user vector space. As illustrated in Figure 1 (b), such a user-specific decoupling scheme usually ignores the diverse matching relationships between user preferences and item attributions, which makes the model performance still not satisfactory.

In realistic recommendation scenarios, items generally have multi-aspect attributions. Correspondingly, users have varying degrees of preference for different attributions of items. The preference of a user for an item can be regarded as the agglomeration of affinities between user preferences and item characteristics under different aspects. Take Figure 1 (c) as an example, the movie Item1 has multi-aspect attribution features, including Sci-Fi Genre, Producer Marvel Comics and Role Iron-Man. User1’s explicit preferences are reflected in multiple aspects of items, including Like Sci-Fi Genre, Prefer Marvel Comics producer, and Like Spider-Man. In view of the high matching relationship between User1’s preferences and Item1’s attributions in multiple aspects, there is a high probability that Item1 is recommended to User1. Predictably, coupled user and item representations cannot match such fine-grained correspondence to boost recommendation. Hence, decoupling and inferring multi-aspect features of items and users is necessary to enhance the modeling capability of VAE-based CF models, thereby improving their performance and interpretability. However, disentangled representation learning in such a scenario is non-trivial as it faces the following two challenges:

•

Firstly, explicit user preference and item attribute information generally cannot be directly obtained in CF. We can infer implicit multi-aspect representations of users and items by only utilizing the available user-item interaction data, illustrated in Figure 1 (c). However, conventional binary interaction data typically reflect coupled interaction relationships between users and items. How to infer implicit decoupled multi-aspect representations of users and items based on observed binary interaction data becomes the main challenge.
•

Secondly, unconstrained inference of decoupled representations of users and items cannot guarantee independence among multi-aspect representations. Meanwhile, the learned aspect-level user preference may deviate from the corresponding aspect-level item representation, resulting in incorrect matching of multi-aspect representations between users and items. Hence, how to maintain independence among representations of different aspects for each user and item so that they do not deviate from the corresponding aspects is another tough challenge.

In this paper, we propose a novel Dual Disentangled Variational AutoEncoder (DualVAE) for collaborative recommendation with implicit feedback, which infers both user-side and item-side disentangled representations. Specifically, to address the first challenge, we transform the traditional VAE paradigm by unifying an attention-aware dual disentanglement module with a disentangled variational inference module and a joint generative module to infer multi-aspect latent representations for users and items. Further, we develop a neighborhood-enhanced representation constraint module to address the second challenge by employing self-supervised contrastive learning with neighborhood-based positive samples and two-level negative samples. Compared to standard VAE, DualVAE can capture multi-aspect uncertainty on both the user-side and the item-side, which helps improve its robustness and performance. Extensive experiments on three real-world benchmarks demonstrate the effectiveness of the proposed DualVAE model. Moreover, we also empirically show that the learned disentangled representations provide good support for explaining user behaviors.

2 Related work

2.1 Latent Representation Learning for CF

To fit dyadic interaction data, much of the literature on collaborative filtering [7] focuses on latent representation learning. Matrix factorization (MF)[14] is a typical CF framework that directly embeds the IDs of users and items into latent feature space. Later, some studies [21, 26, 10, 30] model both users and items by introducing deep neural networks, which can be viewed as a nonlinear extension of MF. For instance, NeuMF [10] treats the recommendation problem with implicit feedback as a binary classification problem and fuses Generalized Matrix Factorization (GMF) with MLP. JCA [30] employs two separate autoencoders to simultaneously learn both user-user and item-item correlations. To further account for uncertainty in latent space, some researchers introduce VAEs [13] into collaborative filtering to improve model robustness [11, 22, 19, 6, 17, 4]. For instance, MultiVAE [15] extends VAEs to capture the latent variables of users with implicit feedback and estimate parameters via Bayesian inference. BiVAE [23] infers the latent representations for users and items through bilateral VAEs. Despite their achievements, the entanglement of latent factors behind user behaviors, is mostly neglected by these methods, leading to weak interpretable results.

2.2 Disentangled Representation Learning for CF

Due to its robust performance and interpretability, disentangled representation learning [2] is gradually being introduced into VAE-based recommender systems. Most of these methods [16, 29, 24] usually decouple users’ diverse preferences based on their collaborative interactions. For example, MacridVAE [16] introduces categorical distributions to disentangle the macro-level and micro-level factors on different items via a variational autoencoder. Zheng et al. [29] propose to structurally disentangle user interest and conformity by training with cause-specific data. Wang et al. [24] utilize structural causal models to generate causal representations that describe the causal relationship between latent factors. Nevertheless, these methods only unilaterally disentangle coarse-grained user preferences, while ignoring the relationship modeling between user preferences and item features in multiple aspects, and hence are not suitable for fitting binary interaction data.

3 Methodology

3.1 Notations and Problem Formulation

We consider an implicit collaborative recommendation that consists of a user set $\mathcal{U}$ with $m$ users and an item set $\mathcal{I}$ with $n$ items. The interaction data is denoted as $\mathbf{R}\in\mathbb{R}^{m\times n}$ , where $r_{u,i}=1$ is the observed interaction of user $u$ on item $i$ . We use $\mathbf{r}_{u*}\in\mathbb{R}^{1\times n}$ to denote $u$ ’s interaction vector corresponding to the $u$ -th row in $\mathbf{R}$ , and $\mathbf{r}_{*i}\in\mathbb{R}^{m\times 1}$ to denote the $i$ -th column in $\mathbf{R}$ towards item $i$ . For traditional VAE-based CF models, the latent variables of per user and item are generally denoted by entangled $\mathbf{z}_{u}$ and $\mathbf{z}_{i}$ , respectively. Considering that an item in general has multi-aspect features, we denote the item latent representation as $\mathbf{z}_{i}=[\mathbf{z}_{i}^{1};\mathbf{z}_{i}^{2};\dots;\mathbf{z}_{i}^{A}]\in\mathbb{R}^{1\times Ad}$ , where $\mathbf{z}_{i}^{a}\in\mathbb{R}^{d}$ corresponds to $i$ ’s representation towards the $a$ -th aspect, $A$ is the number of aspects, and $d$ is the embedding dimension. Similarly, user $u$ ’s latent representation is denoted as $\mathbf{z}_{u}=[\mathbf{z}_{u}^{1};\mathbf{z}_{u}^{2};\dots;\mathbf{z}_{u}^{A}]$ , where $\mathbf{z}_{u}^{a}\in\mathbb{R}^{d}$ reflects $u$ ’s preference under the $a$ -th aspect. Our objective in this work is to learn a robust VAE-based CF model to disentangle and infer multi-aspect latent representations of users and items from user-item interactions. After training, we can exploit the learned multi-aspect latent representations to perform top- $N$ recommendation.

3.2 Overview

Traditional VAE-based CF models generally define a user-side generative model that generates the observed data from the following distribution,

(3.1)

p_{\theta}(\mathbf{r}_{u*})=\int p_{\theta}(\mathbf{r}_{u*}|\mathbf{z}_{u})p(\mathbf{z}_{u})~{}d\mathbf{z}_{u},

where $p_{\theta}(\mathbf{r}_{u*}|\mathbf{z}_{u})=\prod_{r_{u,i}\in\mathbf{r}_{u*}}p_{\theta}(r_{u,i}|\mathbf{z}_{u})$ is a probability distribution over the $n$ items learned by a generative model with parameters $\theta$ . Each observed interaction $r_{u,i}\in\mathbf{r}_{u*}$ is independently generated from the inferred user latent variable $\mathbf{z}_{u}$ . $p(\mathbf{z}_{u})$ is the prior of user variable $\mathbf{z}_{u}$ . Different from the one-way generative paradigm, we propose a dual disentangled generative model to generate the observed interaction that encourages multi-aspect disentanglement of both users and items,

(3.2)

p_{\theta}(r_{u,i})=\int p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{C},\mathbf{P})p(\mathbf{z}_{u})p(\mathbf{z}_{i})~{}d\mathbf{z}_{u}\mathbf{z}_{i},

where the generation of $r_{u,i}$ is related to two latent variables $\mathbf{z}_{u},\mathbf{z}_{i}$ . $\mathbf{C}=\{\mathbf{c}_{i}\}^{n}_{i=1}$ and $\mathbf{P}=\{\mathbf{p}_{m}\}^{m}_{u=1}$ are the item-aspect and user-aspect probability matrices, in which $\mathbf{c}_{i}\in\mathbb{R}^{1\times A}$ and $\mathbf{p}_{u}\in\mathbb{R}^{1\times A}$ are two probability vectors drawn from two aspect distributions $p(\mathbf{c}_{i})$ and $p(\mathbf{p}_{u})$ . $A$ is the number of aspects. We assume that $p(\mathbf{z}_{u})=p(\mathbf{z}_{u}|\mathbf{C})$ and $p(\mathbf{z}_{i})=p(\mathbf{z}_{i}|\mathbf{P})$ , i.e., $\mathbf{z}_{u}$ , $\mathbf{z}_{i}$ , $\mathbf{C}$ and $\mathbf{P}$ can be independently generated.

Our proposed DualVAE implements above distributions ( $p(\mathbf{c}_{i})$ , $p(\mathbf{p}_{u})$ , $p(\mathbf{z}_{u})$ , $p(\mathbf{z}_{i})$ , and $p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{C},\mathbf{P})$ ) by four modules: 1) Attention-aware dual disentanglement (ADD) module that generates the aspect-aware probability distributions $p(\mathbf{p}_{u})$ , $p(\mathbf{c}_{i})$ of users and items. 2) Disentangled variational inference (DVI) module that infers the posterior $p(\mathbf{z}_{1:|\mathcal{U}|},\mathbf{z}_{1:|\mathcal{I}|}|\mathbf{R})$ over disentangled latent variables $\mathbf{z}_{u}$ , $\mathbf{z}_{i}$ of users and items. 3) Joint generation (JG) module that utilizes the inferred user and item latent representations and aspect-aware probability matrices to reconstruct the observed interactions, thereby achieving $p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{C},\mathbf{P})$ . 4) Neighborhood-enhanced representation constraint (NRC) module that maintains the correspondence and independence between latent variables by introducing contrastive learning. Next, we will detail the implementation of each module.

3.3 Attention-aware Dual Disentanglement

The ADD module is designed to generate the item-aspect and user-aspect probability vectors, which achieves the aspect assignment of users and items.

item-side aspect disentanglement.

Considering that items usually have different degrees in different aspects, we assign an item-aspect probability matrix $\mathbf{C}=\{\mathbf{c}_{i}\}^{n}_{i=1}$ over all items in $\mathcal{I}$ , where $\mathbf{c}_{i}$ is the aspect probability vector of item $i$ . To generate the matrix $\mathbf{C}$ , we assume $p(\mathbf{C})=\prod^{n}_{i=1}p(\mathbf{c}_{i})$ and adopt the prototype-based attention mechanism to independently infer each aspect-level probability vector $\mathbf{c}_{i}$ . Specifically, we introduce $A$ aspect prototypes $\{\mathbf{h}_{a}\}^{A}_{a=1}$ ( $\mathbf{h}_{a}\in\mathbb{R}^{d}$ ) shared among items, and obtain the vector $\mathbf{c}_{i}$ from the item-aspect distribution $p(\mathbf{c}_{i})$ ,

(3.3)

\mathbf{c}_{i}=[c^{1}_{i};c^{2}_{i};\dots;c^{A}_{i}];~{}\ c^{a}_{i}=\frac{exp(s(\mathbf{z}^{a}_{i},\mathbf{h}_{a}))}{\sum_{a=1}^{A}exp(s(\mathbf{z}^{a}_{i},\mathbf{h}_{a}))},

where $\mathbf{z}^{a}_{i}$ is the $a$ -th aspect-level latent representation of item $i$ , $c^{a}_{i}\in\mathbb{R}_{+}$ is a probability that reflects the relation between item $i$ and aspect $a$ , and $s(\cdot)$ is an affinity function (such as cosine) used to calculate the score of item $i$ under aspect $a$ .

user-side aspect disentanglement.

Similarly, we also define $A$ preference prototypes $\{\mathbf{m}_{a}\}^{A}_{a=1}$ ( $\mathbf{m}_{a}\in\mathbb{R}^{d}$ ) to calculate user-aspect probability vector $\mathbf{p}_{u}$ from the user-aspect distribution $p(\mathbf{p}_{u})$ ,

(3.4)

\mathbf{p}_{u}=[p^{1}_{u};p^{2}_{u};\dots;p^{A}_{u}];~{}\ p^{a}_{u}=\frac{exp(s(\mathbf{z}^{a}_{u},\mathbf{m}_{a}))}{\sum_{a=1}^{A}exp(s(\mathbf{z}^{a}_{u},\mathbf{m}_{a}))},

where $\mathbf{z}^{a}_{u}$ is the $a$ -th aspect-level latent representation of user $u$ , and $p^{a}_{u}\in\mathbb{R}_{+}$ is a probability of user $u$ on the $a$ -th preference.

3.4 Disentangled Variational Inference

The DVI module is designed to infer the multi-aspect latent variables of users and items. In general, latent variables are drawn from prior distributions. We follow the convention to represent the prior over both user and item latent variables via the standard multivariate isotropic Gaussians, i.e., $p(\mathbf{z}_{u})=\mathcal{N}(\mathbf{0},\mathbf{I}),p(\mathbf{z}_{i})=\mathcal{N}(\mathbf{0},\mathbf{I})$ , where $\mathbf{I}$ is a diagonal matrix. Given the user-item interaction matrix $\mathbf{R}$ , the goal of variational inference model is to infer the posterior over the latent variables $p(\mathbf{z}_{1:|\mathcal{U}|},\mathbf{z}_{1:|\mathcal{I}|}|\mathbf{R})$ . Proverbially, the true posterior is intractable, and thereby precise inference is infeasible. We therefore exploit Variational Bayes [3] to approximate this posterior distribution by a parameterized function $q_{\phi}(\mathbf{z}_{1:|\mathcal{U}|},\mathbf{z}_{1:|\mathcal{I}|}|\mathbf{R})=q_{\phi_{u}}(\mathbf{z}_{1:|\mathcal{U}|}|\mathbf{R})q_{\phi_{i}}(\mathbf{z}_{1:|\mathcal{I}|}|\mathbf{R})$ . Obviously, the variational distribution firstly breaks the coupling between $\mathbf{z}_{u}$ and $\mathbf{z}_{i}$ . Considering the statistical independence among users and items, we can set $q_{\phi_{u}}(\mathbf{z}_{1:|\mathcal{U}|}|\mathbf{R})=\prod_{u}q_{\phi_{u}}(\mathbf{z}_{u}|\mathbf{r}_{u*})$ and $q_{\phi_{i}}(\mathbf{z}_{1:|\mathcal{I}|}|\mathbf{R})=\prod_{i}q_{\phi_{i}}(\mathbf{z}_{i}|\mathbf{r}_{*i})$ to achieve the variational inference.

To further obtain disentangled multi-aspect latent representations of users and items, we assume that $q_{\phi_{u}}(\mathbf{z}_{u}|\mathbf{r}_{u*})=q_{\phi_{u}}(\mathbf{z}_{u}|\mathbf{r}_{u*},\mathbf{C})=\prod^{A}_{a=1}q_{\phi_{u}}(\mathbf{z}^{a}_{u}|\mathbf{r}_{u*},\mathbf{C})$ and $q_{\phi_{i}}(\mathbf{z}_{i}|\mathbf{r}_{*i})=q_{\phi_{i}}(\mathbf{z}_{i}|\mathbf{r}_{*i},\mathbf{P})=\prod^{A}_{a=1}q_{\phi_{i}}(\mathbf{z}^{a}_{i}|\mathbf{r}_{*i},\mathbf{P})$ . Without loss of generality, we express the variational $q_{\phi_{u}}(\mathbf{z}^{a}_{u}|\mathbf{r}_{u*},\mathbf{C})$ and $q_{\phi_{i}}(\mathbf{z}^{a}_{i}|\mathbf{r}_{*i},\mathbf{P})$ as a multivariate normal distribution with a diagonal covariance matrix,

(3.5)		$\displaystyle q_{\phi_{u}}(\mathbf{z}^{a}_{u}\|\mathbf{r}_{u*},\mathbf{C})$	$\displaystyle\sim\mathcal{N}(\boldsymbol{\mu}^{a}_{u},\boldsymbol{\sigma}^{a}_{u}),$
(3.5)		$\displaystyle q_{\phi_{i}}(\mathbf{z}^{a}_{i}\|\mathbf{r}_{*i},\mathbf{P})$	$\displaystyle\sim\mathcal{N}(\boldsymbol{\mu}^{a}_{i},\boldsymbol{\sigma}^{a}_{i}),$

where $\boldsymbol{\mu}^{a}\in\mathbb{R}^{d}$ and $\boldsymbol{\sigma}^{a}\in\mathbb{R}^{d}$ are the mean and covariance of the disentangled variational distributions, parameterized by two shallow networks $f_{\phi_{u}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{2d}$ and $f_{\phi_{i}}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{2d}$ with parameters $\phi_{u}$ and $\phi_{i}$ ,

(3.6)		$\displaystyle\left[\boldsymbol{\mu}^{a}_{u};\boldsymbol{\sigma}^{a}_{u}\right]$	$\displaystyle=f_{\phi_{u}}(\mathbf{r}^{a}_{u}),~{}\mathbf{r}^{a}_{u}=\mathbf{r}_{u*}^{\top}\otimes{\mathbf{c}^{a}},$
(3.6)		$\displaystyle\left[\boldsymbol{\mu}^{a}_{i};\boldsymbol{\sigma}^{a}_{i}\right]$	$\displaystyle=f_{\phi_{i}}(\mathbf{r}^{a}_{i}),~{}\mathbf{r}^{a}_{i}=\mathbf{r}_{*i}\otimes\mathbf{p}^{a},$

where $\mathbf{c}^{a}\in\mathbb{R}^{n\times 1}$ and $\mathbf{p}^{a}\in\mathbb{R}^{m\times 1}$ are the $a$ -th column of $\mathbf{C}$ and $\mathbf{P}$ , respectively, representing the $a$ -th aspect probability vectors over all items and users. $\otimes$ denotes element-wise product. $\mathbf{r}^{a}_{u*}$ and $\mathbf{r}^{a}_{*i}$ are calculated as the aspect-level interaction vectors of user $u$ and item $i$ , respectively. It is clear to see that $\mathbf{r}_{u*}=\sum_{a}\mathbf{r}^{a}_{u*}$ , and $\mathbf{r}_{*i}=\sum_{a}\mathbf{r}^{a}_{*i}$ . Notably, under aspect $a$ , the aspect-level probability vectors of items are employed to learn aspect-level latent representations of users, and vice versa. Such operations ensure that the disentangled representations of users and items are in a one-to-one correspondence at the aspect-level. Further, $\mathbf{z}^{a}_{u}$ and $\mathbf{z}^{a}_{i}$ are sampled by a reparameterization trick [13, 20],

(3.7)

\mathbf{z}^{a}_{u}=\boldsymbol{\mu}^{a}_{u}+\boldsymbol{\sigma}^{a}_{u}\otimes\boldsymbol{\epsilon},~{}~{}~{}\mathbf{z}^{a}_{i}=\boldsymbol{\mu}^{a}_{i}+\boldsymbol{\sigma}^{a}_{i}\otimes\boldsymbol{\epsilon},

where $\boldsymbol{\epsilon}$ is a noise vector with $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ .

3.5 Joint Generation

Given the user and item latent representations $\mathbf{z}_{u}$ and $\mathbf{z}_{i}$ and the aspect probability matrices $\mathbf{C}$ and $\mathbf{P}$ , the JG module is designed to predict the preference of a user towards an item by reconstructing their observed interactions. Specifically, we generate the distribution $p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i})$ of user $u$ ’s preference towards item $i$ by a Poisson distribution [23],

(3.8)

p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{C},\mathbf{P})=exp(r_{u,i}\rm log\it g_{\theta}(\mathbf{z}_{u},\mathbf{z}_{i})-g_{\theta}(\mathbf{z}_{u},\mathbf{z}_{i})),

where $g_{\theta}(\mathbf{z}_{u},\mathbf{z}_{i})$ is a differentiable function that utilizes disentangled latent representations of user $u$ and item $i$ to generate the joint observation $r_{u,i}$ ,

(3.9)			$\displaystyle g_{\theta}(\mathbf{z}_{u},\mathbf{z}_{i})=\sum\nolimits_{a=1}^{A}p^{a}_{u}\cdot c^{a}_{i}\cdot\sigma\left(\textsc{Skip}_{\theta}(\mathbf{z}^{a}_{u},\mathbf{z}^{a}_{i})\right)$
(3.9)			$\displaystyle\textsc{Skip}_{\theta}(\mathbf{z}^{a}_{u},\mathbf{z}^{a}_{i})=\mathbf{z}^{a}_{u}\odot\mathbf{z}^{a}_{i}+f_{\theta}(\mathbf{z}^{a}_{u})\odot f_{\theta}(\mathbf{z}^{a}_{i})$

where $\textsc{Skip}_{\theta}(\cdot)$ is a skip-connection operation to avoid the issue of latent variable collapse [5], $f_{\theta}(\cdot)$ is a non-linear function parameterized by $\theta$ , $\sigma$ is sigmoid function, and $\odot$ denotes inner product. Notice $p_{\theta}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{C},\mathbf{P})\propto g_{\theta}(\mathbf{z}_{u},\mathbf{z}_{i})$ and $c^{a}_{i}$ and $p^{a}_{u}$ are regarded as aspect-level weights for item $i$ and user $u$ .

To optimize our model parameters $\theta$ , $\phi_{u}$ and $\phi_{i}$ , we proceed with approximate inference and learning by maximizing the evidence lower bound (ELBO) $\sum_{u,i}\rm log\it p_{\theta}(r_{u,i})$ . However, due to the sparsity of observed interactions, directly performing unbiased stochastic optimization on the above objective is inconvenient. We hence exploit the two-way nature of our model and perform alternate optimization in a Gauss-Seidel fashion [23]. Specifically, we split the model parameters into user-related and item-related parts, and then alternately optimize them. Firstly, the user-side objective is updated with fixed item-related parameters,

(3.10)		$\displaystyle\mathcal{L}^{u}_{vae}$	$\displaystyle=\mathbb{E}_{p(\mathbf{C})}[\mathbb{E}_{q_{\phi_{u}}(\mathbf{z}_{u}\|\mathbf{r}_{u},\mathbf{C})}[\rm log\it p_{\theta_{u}}(\mathbf{r}_{u}\|\mathbf{z}_{u},\mathbf{z}_{1:\|\mathcal{I}\|},\mathbf{C})]$
(3.10)			$\displaystyle-KL(q_{\phi_{u}}(\mathbf{z}_{u}\|\mathbf{r}_{u*},\mathbf{C})\\|p(\mathbf{z}_{u}))]$

where $p_{\theta_{u}}(\mathbf{r}_{u*}|\mathbf{z}_{u},\mathbf{z}_{1:|\mathcal{I}|},\mathbf{C})=\prod_{i}p_{\theta_{u}}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{c}_{i})$ . The first term is interpreted as reconstruction error, while the second term defines the Kulback-Leibler divergence. Analogously, when keeping user-related parameters fixed, the item-side objective needs to be optimized,

(3.11)		$\displaystyle\mathcal{L}^{i}_{vae}$	$\displaystyle=\mathbb{E}_{p(\mathbf{P})}[\mathbb{E}_{q_{\phi_{i}}(\mathbf{z}_{i}\|\mathbf{r}_{i},\mathbf{P})}[\rm log\it p_{\theta_{i}}(\mathbf{r}_{i}\|\mathbf{z}_{1:\|\mathcal{U}\|},\mathbf{z}_{i},\mathbf{P})]$
(3.11)			$\displaystyle-KL(q_{\phi_{i}}(\mathbf{z}_{i}\|\mathbf{r}_{*i},\mathbf{P})\\|p(\mathbf{z}_{i}))]$

where $p_{\theta_{i}}(\mathbf{r}_{*i}|\mathbf{z}_{1:|\mathcal{U}|},\mathbf{z}_{i},\mathbf{P})=\prod_{u}p_{\theta_{i}}(r_{u,i}|\mathbf{z}_{u},\mathbf{z}_{i},\mathbf{p}_{i})$ . Note the above objectives $\mathcal{L}^{u}_{vae}$ and $\mathcal{L}^{i}_{vae}$ are not conflicting with each other, since maximizing either of them corresponds to the maximization of the overall ELBO.

3.6 Neighborhood-enhanced Representation Constraint

The NRC module is designed to prevent the confusion between latent representations inferred from different aspects and maintain correspondence between aspect-level representations of users and items. Specifically, we first aggregate the interacted neighbors to calculate the aspect-level neighborhood-based representations $\mathbf{o}^{a}_{u}$ and $\mathbf{o}^{a}_{i}$ of user $u$ and item $i$ ,

(3.12)

\mathbf{o}^{a}_{u}=\sum\nolimits_{i\in\mathcal{N}_{u}}c^{a}_{i}\cdot\mathbf{z}^{a}_{i},~{}~{}~{}\mathbf{o}^{a}_{i}=\sum\nolimits_{u\in\mathcal{N}_{i}}p^{a}_{u}\cdot\mathbf{z}^{a}_{u},

where $\mathcal{N}_{u}$ and $\mathcal{N}_{i}$ are the neighbor sets of user $u$ and item $i$ , respectively. Afterwards, we treat the inferred aspect-level latent representations $\mathbf{z}^{a}_{u}$ and neighborhood-based representation $\mathbf{o}^{a}_{u}$ of user $u$ as positive sample pair. Moreover, we set two kinds of negative samples. Firstly, to constrain the independence among different aspect representations, we set aspect-level negative samples by taking representations of different aspects for the same user as negative samples. Secondly, we use representations from different users under the same aspect as user-level negative samples to guarantee the discrepancy in aspect-level representations among different users. The InfoNCE [8] loss is employed to achieve the contrastive constraint,

(3.13)

\mathcal{L}^{a}_{u}=-\log\frac{e^{\frac{s(\mathbf{z}^{a}_{u},\mathbf{o}^{a}_{u})}{\tau}}}{\underbrace{e^{\frac{s(\mathbf{z}^{a}_{u},\mathbf{o}^{a}_{u})}{\tau}}}_{\text{pos-pair}}+\underbrace{\sum\limits_{b\neq a}^{A}e^{\frac{s(\mathbf{z}^{a}_{u},\mathbf{o}^{b}_{u})}{\tau}}}_{\text{aspect-level neg-pairs}}+\underbrace{\sum\limits_{v\neq u}^{B_{u}}e^{\frac{s(\mathbf{z}^{a}_{u},\mathbf{o}^{a}_{v})}{\tau}}}_{\text{user-level neg-pairs}}}

where $\mathcal{L}^{a}_{u}$ is the contrastive loss of each aspect-level representation of user $u$ , $s(\cdot)$ denotes the cosine similarity measure, $\tau$ is a temperature parameter (generally set to $0.2$ ), and $B_{u}$ denotes a set of users within a batch. Similarly, the item disentangled representation can be constrained by,

(3.14)

\mathcal{L}^{a}_{i}=-\log\frac{e^{\frac{s(\mathbf{z}^{a}_{i},\mathbf{o}^{a}_{i})}{\tau}}}{\underbrace{e^{\frac{s(\mathbf{z}^{a}_{i},\mathbf{o}^{a}_{i})}{\tau}}}_{\text{pos-pair}}+\underbrace{\sum\limits_{b\neq a}^{A}e^{\frac{s(\mathbf{z}^{a}_{i},\mathbf{o}^{b}_{i})}{\tau}}}_{\text{aspect-level neg-pairs}}+\underbrace{\sum\limits_{j\neq i}^{B_{i}}e^{\frac{s(\mathbf{z}^{a}_{i},\mathbf{o}^{a}_{j})}{\tau}}}_{\text{item-level neg-pairs}}}

where $B_{i}$ denotes a set of items within a batch. Intuitively, the auxiliary supervision of positive pairs encourages the direction consistency between latent representations of users and items on the same aspect, while the supervision of negative pairs enforces the divergence among different aspects. Note there are also some methods [25] that use distance correlation as a regularizer to achieve representation independence modeling. We do not use such distance correlation, because it can only ensure the independence of the aspects, but cannot guarantee the correspondence between the aspect-level representations of users and items.

Finally, the overall objects for alternately optimization can be set as,

(3.15)

\mathcal{L}_{u}=\mathcal{L}^{u}_{vae}+\gamma\cdot\sum\nolimits_{a=1}^{A}\mathcal{L}^{a}_{u},~{}~{}~{}\mathcal{L}_{i}=\mathcal{L}^{i}_{vae}+\gamma\cdot\sum\nolimits_{a=1}^{A}\mathcal{L}^{a}_{i},

where $\gamma$ is an adjusted factor to balance the VAE loss and the contrastive loss.

4 Experiment

Table 1: Statistics of three evaluation datasets.

Dataset	#Users	#Items	#Feedback	Sparsity
ML1M	6,040	3,679	1,000,180	0.9550
AKindle	14,356	15,885	367,477	0.9984
Yelp	31,668	38,048	1,561,406	0.9987

Table 2: Performance comparisons of DualVAE vs. baselines. Best performance is in boldface. Improvement is obtained between DualVAE and the best result (underlined) in baselines.

*

indicates that the improvement is significant with

p<0.05

Datasets	ML1M				AKindle				Yelp
Methods	R@20	N@20	R@50	N@50	R@20	N@20	R@50	N@50	R@20	N@20	R@50	N@50
MF [14]	0.1846	0.3122	0.3193	0.3217	0.0453	0.0364	0.0707	0.0398	0.0298	0.0256	0.0589	0.0362
NeuMF [10]	0.2158	0.3256	0.3569	0.3472	0.0676	0.0459	0.1073	0.0582	0.0547	0.0456	0.1060	0.0622
JCA [30]	0.2251	0.3303	0.3814	0.3612	0.0745	0.0464	0.1171	0.0626	0.0550	0.0452	0.1121	0.0663
PoissVAE [15]	0.2286	0.3439	0.3790	0.3622	0.0623	0.0392	0.1084	0.0532	0.0556	0.0453	0.1076	0.0648
MultiVAE [15]	0.2301	0.3348	0.3819	0.3575	0.0739	0.0459	0.1226	0.0607	0.0563	0.0455	0.1091	0.0653
MacridVAE [16]	0.2313	0.3409	0.3865	0.3655	0.0779	0.0475	0.1335	0.0654	0.0601	0.0485	0.1165	0.0681
JoVA [1]	0.2305	0.3409	0.3840	0.3641	0.0754	0.0470	0.1270	0.0631	0.0573	0.0459	0.1122	0.0665
BiVAE [23]	0.2305	0.3450	0.3853	0.3658	0.0757	0.0477	0.1295	0.0636	0.0575	0.0460	0.1140	0.0670
DualVAE	0.2365*	0.3643*	0.3944*	0.3816*	0.0812*	0.0511*	0.1378*	0.0683*	0.0633*	0.0514*	0.1227*	0.0735*
Improvement	2.27%	5.59%	2.04%	4.32%	4.24%	7.13%	3.21%	4.37%	5.32%	5.98%	5.32%	7.93%

4.1 Experimental Setup

4.1.1 Datasets and Metrics.

The experimental study of DualVAE is conducted on three publicly available benchmark datasets from different platforms: MovieLens-1M (ML1M for short), Amazon Kindle Store (AKindle for short) and Yelp. The first dataset is a MovieLens¹¹1https://grouplens.org/datasets/movielens dataset, where each user has at least 20 interactions, the second dataset is collected from Amazon²²2http://jmcauley.ucsd.edu/data/amazon [18], and the third dataset is from the 2018 Yelp challenge³³3https://www.yelp.com/dataset.. For the last two datasets, we use a 10-core setting to ensure their data quality. For all datasets, we treat observed user-item interactions as positive feedback. Table 1 summarizes the statistics of the three evaluation datasets. The performance of all models on the testing set is evaluated by two commonly used metrics: Recall (R@N) and Normalized Discounted Cumulative Gain (N@N) [9]. We truncate the ranked list by setting $N$ at $\{20,50\}$ . The learned recommendation model can get a ranked top-N list from all items to evaluate the two metrics.

4.1.2 Baselines.

We compare DualVAE versus the following two groups of competitive methods: 1) Latent factor model-based CF methods, including MF [14], NeuMF [10] and JCA [30]; 2) VAE-based CF methods, including PoissVAE and MultiVAE [15], MacridVAE [16], JoVA [1], and BiVAE [23]. Note for a fair comparison, PoissVAE, BiVAE and DualVAE all use Poisson likelihood by default in their generative models. For all the baselines, we use the implementations and parameter settings reported in their original papers.

4.1.3 Parameter Settings.

Our model is implemented in Pytorch⁴⁴4https://github.com/georgeguo-cn/DualVAE. For all datasets, we randomly select 80% user interactions for training and the remaining for testing. From the testing set, we randomly select $10\%$ interactions as validation set to tune hyperparameters. For a fair comparison, the total embedding size is fixed to $A\times d=100$ and the mini-batch Adam [12] is employed to update model parameters with a fixed batch size of $128$ for all models. The learning rate is searched from $\{1e^{-4},1e^{-3},\dots,1e^{-1}\}$ . We search the coefficient $\gamma$ from $\{1e^{-5},1e^{-4},\dots,1e^{-1}\}$ and search the aspect number $A$ from $\{1,2,4,5,10,20\}$ . We perform the grid search of hyperparameters to obtain the optimal set on different datasets. In addition, all experiments in this paper are performed in the same experimental environment with Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GHz and GeForce RTX 3090.

4.2 Performance Comparison

Table 2 presents the overall performance comparison, from which we have the following key observations:

•

DualVAE consistently outperforms all the other baselines on all datasets and metrics. In particular, DualVAE’s improvement over the best baseline is statistically significant in all cases, demonstrating that disentangling multi-aspect features of users and items indeed can help improve performance. Moreover, as the sparsity of datasets increases, the improvement of DualVAE becomes more significant, indicating that capturing multi-aspect uncertainty of user and item feature spaces can further alleviate the data sparsity problem and improve model robustness.
•

Dual VAE-based methods outperform user-specific VAE-based methods on all datasets. Specifically, the performance of BiVAE is higher than that of PoissVAE and MultiVAE, while DualVAE achieves better performance than MacridVAE. The results indicate that inferring latent representations of both users and items can better adapt to the two-way nature of interaction data, thereby promoting accuracy.
•

Disentanglement-based methods generally achieve better performance than their non-disentanglement counterparts. For instance, MacridVAE outperforms MultiVAE by an average of 3.97% in terms of N@ $20$ across all datasets, indicating that disentangling multi-aspect user preferences is beneficial for improving representation quality and performance.

Table 3: Performance of various variants of DualVAE.

Datasets	ML1M		AKindle		Yelp
Methods	R@20	N@20	R@20	N@20	R@20	N@20
w/o ADD	0.2305	0.3450	0.0757	0.0473	0.0575	0.0460
w/o ID	0.2325	0.3594	0.0801	0.0495	0.0602	0.0489
w/o UD	0.2316	0.3492	0.0772	0.0483	0.0574	0.0463
w/o NRC	0.2321	0.3601	0.0803	0.0501	0.0611	0.0493
w/o UNS	0.2354	0.3632	0.0805	0.0503	0.0619	0.0502
w/o ANS	0.2353	0.3632	0.0810	0.0509	0.0624	0.0510
w/o NPS	0.2345	0.3623	0.0808	0.0508	0.0620	0.0503
DualVAE	0.2365	0.3643	0.0812	0.0511	0.0633	0.0514

4.3 Ablation Studies

Table 3 reports the results of ablation studies for various variants of DualVAE. Specifically, w/o ID (w/o UD) denotes that our model only retains user(item)-side aspect disentanglement; w/o UNS and w/o ANS are two variants that do not use user-level and aspect-level negative samples in NRC module, respectively. w/o NPS is a variant that removes neighborhood-based representations and uses inferred aspect-level representations and themselves to construct positive pairs. From Table 3, we can find:

•

After removing the ADD module, the performance of DualVAE drops (vs. w/o ADD) by an average of 6.65% and 8.46% respectively on R@ $20$ and N@ $20$ , indicating the effectiveness of attention-aware disentanglement. The performance gap between DualVAE and w/o UD (w/o ID) demonstrates the benefit of shaping the aspect distributions of both users and items in improving performance. Moreover, w/o ID performs better than w/o UD on sparser datasets, which suggests that the user-side disentanglement may bring more performance gains than the item-side to alleviate interaction sparsity.
•

The variant w/o NRC removes the NRC module and causes an average performance drop of 2.21% and 2.47% in terms of R@ $20$ and N@ $20$ respectively, which indicates the necessity of ensuring the correspondence and independence of disentangled representations in promoting performance. The comparison between DualVAE and w/o ANS (w/o UNS) reveals that it is beneficial to maintain the independence of disentangled representations by exploiting two-level negative samples. Moreover, w/o ANS performs slightly better than w/o UNS on AKindle and Yelp, showing that aspect-level differences may be more important than user-level differences in keeping the representation independence. w/o NPS performs worse than DualVAE, which demonstrates the significance of maintaining the correspondence between multi-aspect representations of users and items.

4.4 Hyperparameter Analysis

4.4.1 Impact of Aspect Number $A$ .

To investigate the effect of aspect number, we vary $A$ in the range $\{1,2,3,4,5,10,20\}$ and show the performance comparisons in Figure 2 (a). We can observe that the poor performance is achieved when $A=1$ on the datasets Akindle and Yelp, which indicates that one aspect generally is insufficient for fitting interaction data. Overall, the performance first increases with the growth of the number of aspects, and then declines after reaching the optimum ( $A=5$ ). The results prove that disentangled representation learning is beneficial, but too many aspects may be redundant and will negatively affect performance. The results on ML1M show a similar trend, and the optimal performance is obtained at $A=10$ .

4.4.2 Impact of Balance Coefficient $\gamma$ .

The parameter $\gamma$ is an adjusted coefficient of contrastive loss. To evaluate the effect of $\gamma$ , we search for it from $\{1e^{-5},1e^{-4},\dots,1e^{-1}\}$ . Figure 2 (b) presents the results in terms of R@ $20$ and N@ $20$ on the datasets Akindle and Yelp. From the results, we can see that a larger $\gamma$ with a stronger independence constraint is desirable for reaching a better performance. In particular, we can set $\gamma=0.1$ for ML1M and Yelp, and $\gamma=0.001$ for AKindle to learn a suitable model.

4.5 Interpretability Exploration

4.5.1 Visualization of Aspect Probability.

We first visualize the aspect probability maps of two specific users $u_{1259}$ and $u_{5443}$ in ML1M, as well as their random $20$ interactive items, as illustrated in Figure 3. We can see that user $u_{1259}$ pays more attention to the $8$ -th aspect, while $u_{5443}$ focuses more on the $3$ -rd aspect, which demonstrates that our model indeed can learn personalized user preferences. Furthermore, the aspect probability distributions are similar between users and most of the items they interacted with. In particular, most of $u_{1259}$ ’s interactive items have a higher probability (brighter in the map) in the $8$ -th aspect, while $u_{5443}$ and most of its interactive items have a better match in the $3$ -rd aspect. This phenomenon suggests that some specific aspects may dominate the interactions between users and items, which provides a good perspective for explaining user behaviors.

4.5.2 Interpretability of Disentangled Representations.

We further explore the interpretability of disentangled representations by a case study, as shown in Figure 4. Specifically, we take the AKindle dataset as example, and randomly select two users, $u_{5}$ and $u_{4587}$ and their interactive items from it to explain their behaviors by analyzing their reviews. Specifically, we present the corresponding user reviews for the interactive items with the highest aspect-level prediction score under two random aspects. For example, interaction $(u_{5},i_{1389})$ has the highest predicted score (thick solid line with color) under the $1$ -st aspect ( $a=1$ ), which implies that the occurrence of this interaction is more related to the matching of user’s preference and item’s characteristic under the $1$ -st aspect. From the results, we have the following findings:

•

By analyzing different reviews under the same aspect, we can find that the reviews seem to be consistent with some intuitive attribution concepts. Specifically, we can summarize the semantics of the two implicit aspects from the reviews (especially the red words) as: character and author. The results interpret the capability of DualVAE to disentangle multi-aspect preferences from user behaviors.
•

A higher prediction score in general corresponds to a higher degree of aspect-level feature matching between users and items. For example, under the $2$ -nd aspect, both $u_{5}$ and $u_{4587}$ really appreciate the author of item $i_{4311}$ , which is a good post-hoc explanation for their purchasing of item $i_{4311}$ . The results further verify the rationality of multi-aspect feature matching between users and items.

5 Conclusion

In this work, we proposed DualVAE, which combines disentangled representation learning with VAE to fit user-item interactions. Specifically, we first designed an attention-aware dual disentanglement module and unified it with a disentangled variational autoencoder to infer multi-aspect latent representations of both users and items for reconstructing the observed interactions. Moreover, we developed a neighborhood-enhanced representation constraint module to ensure the quality of disentangled representations by a contrastive learning with a neighborhood-based positive sample and two-level negative samples. Extensive experiments on three real-world datasets demonstrate the effectiveness of DualVAE. Further empirical studies also explore the interpretability of disentangled representations. For future work, we plan to introduce more content knowledge, such as user reviews, modality information, to establish ground-truth features of users and items for more fine-grained disentangled representations.

6 Acknowledgements

We would like to thank all anonymous reviewers for their valuable comments. The work was partially supported by the National Natural Science Foundation of China under Grant No. 62272176 and the National Key R&D Program of China under Grant No. 2022YFC3802101 and 2023YFB3308301.

References

[1] B. Askari, J. Szlichta, and A. Salehi-Abari, Variational autoencoders for top-k recommendation with implicit feedback, in Proceedings of SIGIR, 2021, pp. 2061–2065.
[2] Y. Bengio, A. Courville, and P. Vincent, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35 (2013), pp. 1798–1828.
[3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American statistical Association, 112 (2017), pp. 859–877.
[4] Y. Cho and M. Oh, Stochastic-expert variational autoencoder for collaborative filtering, in Proceedings of WWW, 2022, pp. 2482–2490.
[5] A. B. Dieng, Y. Kim, A. M. Rush, and D. M. Blei, Avoiding latent variable collapse with generative skip models, in Proceedings of AISTATS, vol. 89, 2019, pp. 2397–2405.
[6] Z. Gao, T. Shen, Z. Mai, M. R. Bouadjenek, I. Waller, A. Anderson, R. Bodkin, and S. Sanner, Mitigating the filter bubble while maintaining relevance: Targeted diversification with vae-based recommender systems, in Proceedings of SIGIR, 2022, pp. 2524–2531.
[7] D. Goldberg, D. A. Nichols, B. M. Oki, and D. B. Terry, Using collaborative filtering to weave an information tapestry, Commun. ACM, 35 (1992), pp. 61–70.
[8] M. Gutmann and A. Hyvärinen, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models, in Proceedings of AISTATS, vol. 9, 2010, pp. 297–304.
[9] X. He, T. Chen, M. Kan, and X. Chen, Trirank: Review-aware explainable recommendation by modeling aspects, in Proceedings of CIKM, 2015, pp. 1661–1670.
[10] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, Neural collaborative filtering, in Proceedings of WWW, 2017, pp. 173–182.
[11] G. Karamanolakis, K. R. Cherian, A. R. Narayan, J. Yuan, D. Tang, and T. Jebara, Item recommendation with variational autoencoders and heterogeneous priors, in Proceedings of DLRS@RecSys, 2018, pp. 10–14.
[12] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
[13] D. P. Kingma and M. Welling, Auto-encoding variational bayes, in Proceedings of ICLR, 2014.
[14] Y. Koren, R. M. Bell, and C. Volinsky, Matrix factorization techniques for recommender systems, Computer, 42 (2009), pp. 30–37.
[15] D. Liang, R. G. Krishnan, M. D. Hoffman, and T. Jebara, Variational autoencoders for collaborative filtering, in Proceedings of WWW, 2018, pp. 689–698.
[16] J. Ma, C. Zhou, P. Cui, H. Yang, and W. Zhu, Learning disentangled representations for recommendation, in Proceedings of NeurIPS, 2019, pp. 5712–5723.
[17] W. Ma, X. Chen, W. Pan, and Z. Ming, VAE++: variational autoencoder for heterogeneous one-class collaborative filtering, in Proceedings of WSDM, 2022, pp. 666–674.
[18] J. Ni, J. Li, and J. J. McAuley, Justifying recommendations using distantly-labeled reviews and fine-grained aspects, in Proceedings EMNLP-IJCNLP.
[19] Z. Ren, Z. Tian, D. Li, P. Ren, L. Yang, X. Xin, H. Liang, M. de Rijke, and Z. Chen, Variational reasoning about user preferences for conversational recommendation, in Proceedings of SIGIR, 2022, pp. 165–175.
[20] D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in Proceedings of ICML, vol. 32, 2014, pp. 1278–1286.
[21] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, Autorec: Autoencoders meet collaborative filtering, in Proceedings of WWW, 2015, pp. 111–112.
[22] I. Shenbin, A. Alekseev, E. Tutubalina, V. Malykh, and S. I. Nikolenko, Recvae: A new variational autoencoder for top-n recommendations with implicit feedback, in Proceedings of WSDM, 2020, pp. 528–536.
[23] Q. Truong, A. Salah, and H. W. Lauw, Bilateral variational autoencoder for collaborative filtering, in Proceedings of WSDM, 2021, pp. 292–300.
[24] S. Wang, X. Chen, Q. Z. Sheng, Y. Zhang, and L. Yao, Causal disentangled variational auto-encoder for preference understanding in recommendation, in Proceedings of SIGIR, 2023, pp. 1874–1878.
[25] X. Wang, H. Jin, A. Zhang, X. He, T. Xu, and T.-S. Chua, Disentangled graph collaborative filtering, in Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, 2020, pp. 1001–1010.
[26] Y. Wu, C. DuBois, A. X. Zheng, and M. Ester, Collaborative denoising auto-encoders for top-n recommender systems, in Proceedings of WSDM, 2016, pp. 153–162.
[27] X. Yu, X. Zhang, Y. Cao, and M. Xia, VAEGAN: A collaborative filtering framework based on adversarial variational autoencoders, in Proceedings of IJCAI, 2019, pp. 4206–4212.
[28] S. Zhao, W. Wei, D. Zou, and X. Mao, Multi-view intent disentangle graph networks for bundle recommendation, in Proceedings of AAAI, 2022, pp. 4379–4387.
[29] Y. Zheng, C. Gao, X. Li, X. He, Y. Li, and D. Jin, Disentangling user interest and conformity for recommendation with causal embedding, in Proceedings of WWW, 2021, pp. 2980–2991.
[30] Z. Zhu, J. Wang, and J. Caverlee, Improving top-k recommendation via joint collaborative autoencoders, in Proceedings of WWW, 2019, pp. 3483–3482.