Causal Disentangled Variational Auto-Encoder for Preference Understanding in Recommendation
Abstract.
Recommendation models are typically trained on observational user interaction data, but the interactions between latent factors in users’ decision-making processes lead to complex and entangled data. Disentangling these latent factors to uncover their underlying representation can improve the robustness, interpretability, and controllability of recommendation models. This paper introduces the Causal Disentangled Variational Auto-Encoder (CaD-VAE), a novel approach for learning causal disentangled representations from interaction data in recommender systems. The CaD-VAE method considers the causal relationships between semantically related factors in real-world recommendation scenarios, rather than enforcing independence as in existing disentanglement methods. The approach utilizes structural causal models to generate causal representations that describe the causal relationship between latent factors. The results demonstrate that CaD-VAE outperforms existing methods, offering a promising solution for disentangling complex user behavior data in recommendation systems.
1. Introduction
Recommender systems play a crucial role in providing personalized recommendations to users based on their behavior. Learning the representation of user preference from behavior data is a critical task in designing a recommender model. Recently, deep neural networks have demonstrated their effectiveness in representation learning in recommendation models (Zhang et al., 2019). However, the existing methods for learning representation from users’ behavior in recommender systems face several challenges. One of the key issues is the inability to disentangle the latent factors that influence users’ behavior. This often leads to a highly entangled representation, which would disregard the complex interactions between latent factors driving users’ decision-making.
Disentangled Representation Learning (DRL) has gained increasing attention as a proposed solution to tackle existing challenges (Bengio et al., 2013). Most research on DRL has been concentrated on computer vision (Higgins et al., 2017; Kim and Mnih, 2018; Dupont, 2018; Liu et al., 2021; Yang et al., 2021), with few studies examining its applications in recommender systems (Wang et al., 2022). Recent works, such as CasualVAE (Yang et al., 2021) and DEAR (Shen et al., 2022) utilize weak supervision to incorporate causal structure into disentanglement, allowing for the generation of images with causal semantics.
Implementing DRL in recommendation systems can enable a more fine-grained analysis of user behavior, leading to more accurate recommendations. One popular DRL method is Variational Autoencoders (VAE) (Higgins et al., 2017; Kumar et al., 2018; Kim and Mnih, 2018; Quessard et al., 2020), which learns latent representations capturing the data’s underlying structure. Ma et al. (2019) propose MacridVAE to learn the user’s macro and micro preference on items for collaborative filtering. Wang et al. (2023) extend the MacridVAE by employing visual images and textual descriptions to extract user interests. However, these works assume that countable, independent factors generate real-world observations, which may not hold in all cases. We argue that latent factors with the semantics of interest, known as concepts (Ma et al., 2019; Yang et al., 2021), have causal relationships in the recommender system. For example, in the movie domain, different directors specialize in different film genres, and different film genres may have a preference for certain actors. As a result, learning causally disentangled representations reflecting the causal relationships between high-level concepts related to user preference would be a better solution.
In this work, we propose a novel approach for disentanglement representation learning in recommender systems by adopting a structural causal model, named Causal Disentangled Variational Auto-Encoder (CaD-VAE). Our approach integrates the causal structure among high-level concepts that are associated with user preferences (such as film directors and film genres) into DRL. Regularization is applied to ensure each dimension within a high-level concept captures an independent, fine-grained level factor (such as action movies and funny movies within the film genre). Specifically, the input data is first processed by an encoder network, which maps it to a lower-dimensional latent space, resulting in independent exogenous factors. The obtained exogenous factors are then passed through a causal layer to be transformed into causal representations, where additional information is used to recover the causal structure between latent factors. Our main contributions are summarized as follows,
-
•
We introduce the problem of causal disentangled representation learning for sparse relational user behavior data in recommender systems.
-
•
We propose a new framework named CaD-VAE, which is able to describe the SCMs for latent factors in representation learning for user behavior.
-
•
We conduct extensive experiments on various real-world datasets, demonstrating the superiority of CaD-VAE over existing state-of-the-art models.
2. Methodology
In this section, we propose the Causal Disentangled Variational Auto-Encoder (CaD-VAE) method for causal disentanglement learning for the recommender systems. The overview of our proposed CaD-VAE model structure is in Figure 1.

2.1. Problem Formulation
Let and index users and items, respectively. For the recommender system, the dataset of user behavior consists of users and items interactions. For a user , the historical interactions is a multi-hot vector, where represents that there is no recorded interaction between user and item , and represents an interaction between the user and item , such as click. For notation brevity, we use denotes all the interactions of the user , that is . Users may have very diverse interests, and interact with items that belong to many high-level concepts, such as preferred film directors, actors, genres, and year of production. We aim to learn disentangled representations from user behavior that reflect the user preference related to different high-level concepts and reflect the causal relationship between these concepts.
2.2. Construct Causal Structure via SCMs
Consider high-level concepts in observations to formalize causal representation. These concepts are thought to have causal relationships with one another in a manner that can be described by a Directed Acyclic Graph (DAG), which can be represented as an adjacency matrix, denoted by . To construct this causal structure, we introduce a causal layer in our framework. This layer is specifically designed to implement the nonlinear Structural Causal Model (SCM) as proposed by (Yu et al., 2019):
(1) |
where is the weighted adjacency matrix among the k elements of z, is the exogenous variables that , is nonlinear element-wise transformations. The set of parameters of and are denoted by . Additionally, is non-zero if and only if is a parent of , and the corresponding binary adjacency matrix is denoted by , where is an element-wise indicator function. When is invertible, Equation 2 can be rephrased as follows:
(2) |
which implies that after undergoing a nonlinear transformation of , the factors satisfy a linear SCM. To ensure disentanglement, we use the labels of the concepts to be additional information, denoted as . The additional information is used to learn the weights of the non-zero elements in the prior adjacency matrix, which represent the sign and scale of causal effects.
2.3. Causality-guided Generative Modeling
Our model is built upon the generative framework of the Variational Autoencoder (VAE) and introduces a causal layer to describe the SCMs. Let and denote the encoder and decoder, respectively. We use to denote the set that contains all the trainable parameters of our model. For a user , our generative model parameterized by assumes that the observed data are generated from the following distribution:
(3) |
where
(4) |
Assuming the encoding process , where is the vectors of independent noise with probability density . The inference models that take into account the causal structure can be defined as:
(5) |
where is the approximate posterior distribution, parameterized by , that models the distribution of the latent representation given the user’s behavior data and additional information . Since and have a one-to-one correspondence, we can simplify the variational posterior distribution as follows:
(6) | ||||
where is the Dirac delta function. And we define the joint prior distribution for latent variables and as:
(7) |
where and is a factorized Gaussian distribution such that:
(8) |
where . and are arbitrary functions.
Given the representation of a user, denoted by , the decoder’s goal is to predict the item, out of a total of items, that the user is most likely to click. Let denotes the decoder. We assume the decoding processes , where is the vectors of independent noise with probability density . Then we define the generative model parameterized by parameters as follows:
(9) |
2.4. Disentanglement Objective
Our objective is to learn the parameters and that maximize the evidence lower bound (ELBO) of . The ELBO is defined as the expectation of the log-likelihood with respect to the approximate posterior , where and are the latent variables:
(10) | ||||
Based on the definitions of the approximate posterior in Equation 6 and the prior distribution in Equation 7, ELBO defined in Equation 11 can be expressed in a neat form as follows:
(11) | ELBO | |||
Aside from disentangling high-level concepts, we are also interested in capturing the user’s specific preference for fine-grained level factors within different concepts, such as action or funny films in the film genre. Specifically, we aim to enforce statistical independence between the dimensions of the latent representation, so that each dimension describes a single factor, which can be formulated as forcing the following:
(12) |
Follow the idea in (Ma et al., 2019) that -VAE can be used to encourage independence between the dimensions. By varying the value of , the model can be encouraged to learn more disentangled representations, where each latent dimension captures a single, independent underlying factor (Higgins et al., 2017). As a result, we amplify the regularization term by a factor of , resulting in the following ELBO:
(13) | ||||
To ensure the learning of causal structure and causal representations, we include a form of supervision during the training process of the model. The first component of this supervision involves utilizing the extra information to establish a constraint on the weighted adjacency matrix . This constraint ensures that the matrix accurately reflects the causal relationships between the labels:
(14) |
where is the sigmoid function and is a small positive constant value. The second component of supervision constructs a constraint on learning the latent causal representation :
(15) |
where is a small positive constant value. Therefore, we have the following training objective:
(16) |
where and are regularization hyperparemeters.
Dataset | Method | NDCG@50 | NDCG@100 | Recall@20 | Recall@50 |
---|---|---|---|---|---|
ML-100k | MultDAE | 0.13226 (±0.02836) | 0.24487 (±0.02738) | 0.23794 (±0.03605) | 0.32279 (±0.04070) |
-MultVAE | 0.13422 (±0.02341) | 0.27484 (±0.02883) | 0.24838 (±0.03294) | 0.35270 (±0.03927) | |
MacridVAE | 0.14272 (±0.02877) | 0.28895 (±0.02739) | 0.30951 (±0.03808)* | 0.41309 (±0.04503) | |
DGCF | 0.15215 (±0.03612) | 0.28229 (±0.02271) | 0.28912 (±0.03012) | 0.34233 (±0.02937) | |
SEM-MacridVAE | 0.17322 (±0.02812)* | 0.29372 (±0.02371)* | 0.27492 (±0.02152) | 0.37026 (±0.02914) | |
Ours | 0.19272 (±0.02515) | 0.31826 (±0.02018 | 0.31272 (±0.02612) | 0.38162 (±0.03812)* | |
ML-1M | MultDAE | 0.29172 (±0.00729) | 0.40453 (±0.00799) | 0.34382 (±0.00961) | 0.46781 (±0.01032) |
-MultVAE | 0.30128 (±0.00617) | 0.40555 (±0.00809) | 0.33960 (±0.00919) | 0.45825 (±0.01039) | |
MacridVAE | 0.31622 (±0.00499) | 0.42740 (±0.00789) | 0.36046 (±0.00947) | 0.49039 (±0.01029) | |
DGCF | 0.32111 (±0.01028) | 0.43222 (±0.00617) | 0.37152 (±0.00891) | 0.49285 (±0.09918) | |
SEM-MacridVAE | 0.32817 (±0.00916)* | 0.44812 (±0.00689)* | 0.38172 (±0.00798)* | 0.49871 (±0.01029)* | |
Ours | 0.34716 (±0.00718) | 0.45971 (±0.00610) | 0.39182 (±0.00571) | 0.50127 (±0.00917) | |
ML-20M | MultDAE | 0.32822 (±0.00187) | 0.41900 (±0.00209) | 0.39169 (±0.00271) | 0.53054 (±0.00285) |
-MultVAE | 0.33812 (±0.00207) | 0.41113 (±0.00212) | 0.38263 (±0.00273) | 0.51975 (±0.00289) | |
MacridVAE | 0.34918 (±0.00271) | 0.42496 (±0.00212) | 0.39649 (±0.00271) | 0.52901 (±0.00284) | |
DGCF | 0.36152 (±0.00281) | 0.43172 (±0.00199) | 0.40127 (±0.00284) | 0.52127 (±0.00229) | |
SEM-MacridVAE | 0.37172 (±0.00187)* | 0.44312 (±0.00177)* | 0.41272 (±0.00300)* | 0.53212 (±0.00198)* | |
Ours | 0.38991 (±0.00201) | 0.45126 (±0.00241) | 0.42822 (±0.00298) | 0.54316 (±0.00189) | |
Netflix | MultDAE | 0.24272 (±0.00089) | 0.37450 (±0.00095) | 0.33982 (±0.00123) | 0.43247 (±0.00126) |
-MultVAE | 0.24986 (±0.00080) | 0.36291 (±0.00094) | 0.32792 (±0.00122) | 0.41960 (±0.00125) | |
MacridVAE | 0.25717 (±0.00098) | 0.37987 (±0.00096) | 0.34587 (±0.00124) | 0.43478 (±0.00118) | |
DGCF | 0.27128 (±0.00089)* | 0.39122 (±0.00078)* | 0.36271 (±0.00199)* | 0.45019 (±0.00102)* | |
SEM-MacridVAE | 0.26981 (±0.00100) | 0.38012 (±0.00099) | 0.35712 (±0.00162) | 0.44172 (±0.00102) | |
Ours | 0.29172 (±0.00080) | 0.40021 (±0.00088) | 0.38212 (±0.00062) | 0.45918 (±0.00081) |
3. Experiment
3.1. Experiment Setup
Our experiments were conducted on four datasets, which included a combination of real-world datasets. The largescale Netflix Prize dataset and three MovieLens datasets of different scales (i.e., ML-100k, ML-1M, and ML-20M) were used following the same methodology as MacridVAE. To binarize these four datasets, we only kept ratings of four or higher and users who had watched at least five movies. We choose four causally related concepts: (DIRECTOR FILM GENRE), (FILM GENRE ACTOR), (PRODUCTION YEAR). We compare the proposed approach with four existing baselines:
-
•
MacridVAE (Ma et al., 2019) is a disentangled representation learning method for the recommendation.
- •
-
•
DGCF (Wang et al., 2020) is a disentangled graph-based method for collaborative filtering.
-
•
SEM-MacridVAE (Wang et al., 2023) is the extension of MacridVAE by introducing semantic information.
The evaluation metric used is nDCG and recall, which is the same as (Wang et al., 2023).
3.2. Resutls
Overall Comparison. The overall comparison can be found on Table 1. We can see that the proposed method generally outperformed all of the existing works. It demonstrates that the proposed causal disentanglement representation works better than traditional disentanglement representation.
Causal Disentanglement. We also provide a t-SNE (van der Maaten and Hinton, 2008) visualization of the learned causal disentanglement representation for high-level concepts on ML-1M. On the representation visualization Figure 2(a), pink represents the year of the production, green represents the directors, blue represents the actors and yellow represents the genres. We can clearly find that the year of production is disentangled from actors, genres and directors as they are not causally related.
Fine-grained Level Disentanglement. In Figure 2(b), we examine the relationship between the level of independence at the fine-grained level and the performance of recommendation by varying the hyper-parameter . To quantify the level of independence, we use a set of -dimensional representations and calculate the following metric (Ma et al., 2019), where is the correlation between dimension and . We observe a positive correlation between recommendation performance and the level of independence, where higher independence leads to better performance. Our method outperforms existing disentanglement representation learning in the level of independence.


4. Conclusion
This work demonstrates the effectiveness of the CaD-VAE model in learning causal disentangled representations from user behavior. Our approach incorporates a causal layer implementing SCMs, allowing for the successful disentanglement of causally related concepts. Experimental results on four real-world datasets demonstrate that the proposed CaD-VAE model outperforms existing state-of-the-art methods for learning disentangled representations. In terms of future research, there is potential to investigate novel applications that can take advantage of the explainability and controllability offered by disentangled representations.
References
- (1)
- Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
- Dupont (2018) Emilien Dupont. 2018. Learning disentangled joint continuous and discrete representations. Advances in Neural Information Processing Systems 31 (2018).
- Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations. https://openreview.net/forum?id=Sy2fzU9gl
- Kim and Mnih (2018) Hyunjik Kim and Andriy Mnih. 2018. Disentangling by Factorising. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2649–2658. https://proceedings.mlr.press/v80/kim18b.html
- Kumar et al. (2018) Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. 2018. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1kG7GZAW
- Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 689–698. https://doi.org/10.1145/3178876.3186150
- Liu et al. (2021) Y. Liu, E. Sangineto, Y. Chen, L. Bao, H. Zhang, N. Sebe, B. Lepri, W. Wang, and M. Nadai. 2021. Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 10780–10789. https://doi.org/10.1109/CVPR46437.2021.01064
- Ma et al. (2019) Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learning Disentangled Representations for Recommendation. Advances in Neural Information Processing Systems 32 (2019).
- Quessard et al. (2020) Robin Quessard, Thomas Barrett, and William Clements. 2020. Learning disentangled representations and group structure of dynamical environments. Advances in Neural Information Processing Systems 33 (2020), 19727–19737.
- Shen et al. (2022) Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, and Tong Zhang. 2022. Weakly Supervised Disentangled Generative Causal Representation Learning. Journal of Machine Learning Research 23 (2022), 1–55.
- van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html
- Wang et al. (2022) Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. 2022. Disentangled Representation Learning. arXiv preprint arXiv:2211.11695 (2022).
- Wang et al. (2023) Xin Wang, Hong Chen, Yuwei Zhou, Jianxin Ma, and Wenwu Zhu. 2023. Disentangled Representation Learning for Recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 408–424. https://doi.org/10.1109/TPAMI.2022.3153112
- Wang et al. (2020) Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled Graph Collaborative Filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1001–1010. https://doi.org/10.1145/3397271.3401137
- Yang et al. (2021) Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. 2021. CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9588–9597. https://doi.org/10.1109/CVPR46437.2021.00947
- Yu et al. (2019) Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7154–7163. https://proceedings.mlr.press/v97/yu19a.html
- Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 52, 1, Article 5 (feb 2019), 38 pages. https://doi.org/10.1145/3285029