This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Disentangled Variational Auto-Encoder for Preference Understanding in Recommendation

Siyu Wang The University of New South WalesSydneyNSWAustralia2052 0009-0008-8726-5277 [email protected] Xiaocong Chen The University of New South WalesSydneyAustralia [email protected] Quan Z. Sheng Macquarie UniversitySydneyAustralia [email protected] Yihong Zhang Osaka UniversityOsakaJapan [email protected]  and  Lina Yao Data61, CSIROEveleighAustralia The University of New South WalesSydneyAustralia [email protected]
Abstract.

Recommendation models are typically trained on observational user interaction data, but the interactions between latent factors in users’ decision-making processes lead to complex and entangled data. Disentangling these latent factors to uncover their underlying representation can improve the robustness, interpretability, and controllability of recommendation models. This paper introduces the Causal Disentangled Variational Auto-Encoder (CaD-VAE), a novel approach for learning causal disentangled representations from interaction data in recommender systems. The CaD-VAE method considers the causal relationships between semantically related factors in real-world recommendation scenarios, rather than enforcing independence as in existing disentanglement methods. The approach utilizes structural causal models to generate causal representations that describe the causal relationship between latent factors. The results demonstrate that CaD-VAE outperforms existing methods, offering a promising solution for disentangling complex user behavior data in recommendation systems.

Recommender Systems, Causal Disentangled Representation, Variational Autoencoder
ccs: Information systems Recommender systems

1. Introduction

Recommender systems play a crucial role in providing personalized recommendations to users based on their behavior. Learning the representation of user preference from behavior data is a critical task in designing a recommender model. Recently, deep neural networks have demonstrated their effectiveness in representation learning in recommendation models (Zhang et al., 2019). However, the existing methods for learning representation from users’ behavior in recommender systems face several challenges. One of the key issues is the inability to disentangle the latent factors that influence users’ behavior. This often leads to a highly entangled representation, which would disregard the complex interactions between latent factors driving users’ decision-making.

Disentangled Representation Learning (DRL) has gained increasing attention as a proposed solution to tackle existing challenges (Bengio et al., 2013). Most research on DRL has been concentrated on computer vision (Higgins et al., 2017; Kim and Mnih, 2018; Dupont, 2018; Liu et al., 2021; Yang et al., 2021), with few studies examining its applications in recommender systems (Wang et al., 2022). Recent works, such as CasualVAE (Yang et al., 2021) and DEAR (Shen et al., 2022) utilize weak supervision to incorporate causal structure into disentanglement, allowing for the generation of images with causal semantics.

Implementing DRL in recommendation systems can enable a more fine-grained analysis of user behavior, leading to more accurate recommendations. One popular DRL method is Variational Autoencoders (VAE)  (Higgins et al., 2017; Kumar et al., 2018; Kim and Mnih, 2018; Quessard et al., 2020), which learns latent representations capturing the data’s underlying structure.  Ma et al. (2019) propose MacridVAE to learn the user’s macro and micro preference on items for collaborative filtering.  Wang et al. (2023) extend the MacridVAE by employing visual images and textual descriptions to extract user interests. However, these works assume that countable, independent factors generate real-world observations, which may not hold in all cases. We argue that latent factors with the semantics of interest, known as concepts (Ma et al., 2019; Yang et al., 2021), have causal relationships in the recommender system. For example, in the movie domain, different directors specialize in different film genres, and different film genres may have a preference for certain actors. As a result, learning causally disentangled representations reflecting the causal relationships between high-level concepts related to user preference would be a better solution.

In this work, we propose a novel approach for disentanglement representation learning in recommender systems by adopting a structural causal model, named Causal Disentangled Variational Auto-Encoder (CaD-VAE). Our approach integrates the causal structure among high-level concepts that are associated with user preferences (such as film directors and film genres) into DRL. Regularization is applied to ensure each dimension within a high-level concept captures an independent, fine-grained level factor (such as action movies and funny movies within the film genre). Specifically, the input data is first processed by an encoder network, which maps it to a lower-dimensional latent space, resulting in independent exogenous factors. The obtained exogenous factors are then passed through a causal layer to be transformed into causal representations, where additional information is used to recover the causal structure between latent factors. Our main contributions are summarized as follows,

  • We introduce the problem of causal disentangled representation learning for sparse relational user behavior data in recommender systems.

  • We propose a new framework named CaD-VAE, which is able to describe the SCMs for latent factors in representation learning for user behavior.

  • We conduct extensive experiments on various real-world datasets, demonstrating the superiority of CaD-VAE over existing state-of-the-art models.

2. Methodology

In this section, we propose the Causal Disentangled Variational Auto-Encoder (CaD-VAE) method for causal disentanglement learning for the recommender systems. The overview of our proposed CaD-VAE model structure is in Figure 1.

Refer to caption
Figure 1. Model structure of CaD-VAE. The encoder takes the observation xux_{u} as input to generate an independent exogenous variable ϵ\epsilon, which is then transformed into causal representations zz by the Causal Layer. The decoder uses zz as input to reconstruct the original observation xux_{u}.

2.1. Problem Formulation

Let u{1,.,U}u\in\{1,....,U\} and i{1,.,I}i\in\{1,....,I\} index users and items, respectively. For the recommender system, the dataset 𝒟\mathcal{D} of user behavior consists of UU users and II items interactions. For a user uu, the historical interactions Du={xu,i:xu,i{0,1}}D_{u}=\{x_{u,i}:x_{u,i}\in\{0,1\}\} is a multi-hot vector, where xu,i=0x_{u,i}=0 represents that there is no recorded interaction between user uu and item ii, and xu,i=1x_{u,i}=1 represents an interaction between the user uu and item ii, such as click. For notation brevity, we use 𝐱u\mathbf{x}_{u} denotes all the interactions of the user uu, that is 𝐱u={xu,i:xu,i=1}\mathbf{x}_{u}=\{x_{u,i}:x_{u,i}=1\}. Users may have very diverse interests, and interact with items that belong to many high-level concepts, such as preferred film directors, actors, genres, and year of production. We aim to learn disentangled representations from user behavior that reflect the user preference related to different high-level concepts and reflect the causal relationship between these concepts.

2.2. Construct Causal Structure via SCMs

Consider kk high-level concepts in observations to formalize causal representation. These concepts are thought to have causal relationships with one another in a manner that can be described by a Directed Acyclic Graph (DAG), which can be represented as an adjacency matrix, denoted by AA. To construct this causal structure, we introduce a causal layer in our framework. This layer is specifically designed to implement the nonlinear Structural Causal Model (SCM) as proposed by (Yu et al., 2019):

(1) 𝐳=g((𝐈𝐀)1ϵ):=Fα(ϵ),\mathbf{z}=g((\mathbf{I-A^{\top}})^{-1}\mathbf{\epsilon}):=F_{\alpha}(\epsilon),

where AA is the weighted adjacency matrix among the k elements of z, ϵ\epsilon is the exogenous variables that ϵ𝒩(0,I)\epsilon\sim\mathcal{N}(0,I), gg is nonlinear element-wise transformations. The set of parameters of AA and gg are denoted by α=(A,g)\alpha=(A,g). Additionally, AijA_{ij} is non-zero if and only if [z]i[z]_{i} is a parent of [z]j[z]_{j}, and the corresponding binary adjacency matrix is denoted by IA=I(A0)I_{A}=I(A\neq 0), where I()I(\cdot) is an element-wise indicator function. When gg is invertible, Equation 2 can be rephrased as follows:

(2) gi1(zi)=𝐀𝐢gi1(𝐳)+ϵi,\displaystyle g^{-1}_{i}(z_{i})=\mathbf{A_{i}^{\top}}g^{-1}_{i}(\mathbf{z})+\epsilon_{i},

which implies that after undergoing a nonlinear transformation of gg, the factors 𝐳\mathbf{z} satisfy a linear SCM. To ensure disentanglement, we use the labels of the concepts to be additional information, denoted as cc. The additional information cc is used to learn the weights of the non-zero elements in the prior adjacency matrix, which represent the sign and scale of causal effects.

2.3. Causality-guided Generative Modeling

Our model is built upon the generative framework of the Variational Autoencoder (VAE) and introduces a causal layer to describe the SCMs. Let EE and DD denote the encoder and decoder, respectively. We use θ\theta to denote the set (E,D,α,λ)(E,D,\alpha,\lambda) that contains all the trainable parameters of our model. For a user uu, our generative model parameterized by θ\theta assumes that the observed data are generated from the following distribution:

(3) pθ(𝐱u)=𝔼pθ(𝐜)[pθ(𝐱u|ϵ,𝐳u,𝐜)pθ(ϵ,𝐳u|𝐜)𝑑ϵ𝑑𝐳u],p_{\theta}(\mathbf{x}_{u})=\mathbb{E}_{p_{\theta}(\mathbf{c})}\Big{[}\iint p_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon},\mathbf{z}_{u},\mathbf{c})p_{\theta}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{c})d\mathbf{\epsilon}d\mathbf{z}_{u}\Big{]},

where

(4) pθ(𝐱u|ϵ,𝐳,𝐜)=xu,i𝐱upθ(xu,i|ϵ,𝐳,𝐜).p_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon,z,c})=\prod_{{x_{u,i}}\in\mathbf{x}_{u}}p_{\theta}(x_{u,i}|\mathbf{\epsilon,z,c}).

Assuming the encoding process ϵ=E(𝐱u,𝐜)+ζ\mathbf{\epsilon}=E(\mathbf{x}_{u},\mathbf{c})+\zeta, where ζ\zeta is the vectors of independent noise with probability density pζp_{\zeta}. The inference models that take into account the causal structure can be defined as:

(5) qϕ(ϵ,𝐳u|𝐱u,𝐜)q(𝐳u|ϵ)qζ(ϵE(𝐱u,𝐜)),q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})\equiv q(\mathbf{z}_{u}|\mathbf{\epsilon})q_{\zeta}(\mathbf{\epsilon}-E(\mathbf{x}_{u},\mathbf{c})),

where qϕ(ϵ,𝐳u|𝐱u,𝐜)q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c}) is the approximate posterior distribution, parameterized by ϕ\phi, that models the distribution of the latent representation given the user’s behavior data 𝐱u\mathbf{x}_{u} and additional information 𝐜\mathbf{c}. Since ϵ\epsilon and zz have a one-to-one correspondence, we can simplify the variational posterior distribution as follows:

(6) qϕ(ϵ,𝐳u|𝐱u,𝐜)\displaystyle q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c}) =q(ϵ|𝐱u,𝐜)δ(𝐳u=Fα(ϵ)),\displaystyle=q(\mathbf{\epsilon}|\mathbf{x}_{u},\mathbf{c})\delta(\mathbf{z}_{u}=F_{\alpha}(\epsilon)),
=q(𝐳|𝐱u,𝐜)δ(ϵ=Fα1(𝐳u)),\displaystyle=q(\mathbf{z}|\mathbf{x}_{u},\mathbf{c})\delta(\mathbf{\epsilon}=F^{-1}_{\alpha}(\mathbf{z}_{u})),

where δ()\delta(\cdot) is the Dirac delta function. And we define the joint prior distribution for latent variables ϵ\epsilon and zz as:

(7) pθ(ϵ,𝐳u|𝐜)=pϵ(ϵ)pθ(𝐳u|𝐜),p_{\theta}(\epsilon,\mathbf{z}_{u}|\mathbf{c})=p_{\epsilon}(\epsilon)p_{\theta}(\mathbf{z}_{u}|\mathbf{c}),

where pϵ(ϵ)=𝒩(0,I)p_{\epsilon}(\epsilon)=\mathcal{N}(0,I) and pθ(𝐳u|𝐜)p_{\theta}(\mathbf{z}_{u}|\mathbf{c}) is a factorized Gaussian distribution such that:

(8) pθ(𝐳u|𝐜)=i=1npθ(zu(i)|ci),p_{\theta}(\mathbf{z}_{u}|\mathbf{c})=\prod_{i=1}^{n}p_{\theta}(z_{u}^{(i)}|c_{i}),

where pθ(zu(i)|ci)=𝒩(λ1(ci),λ22(ci))p_{\theta}(z_{u}^{(i)}|c_{i})=\mathcal{N}(\lambda_{1}(c_{i}),\lambda_{2}^{2}(c_{i})). λ1\lambda_{1} and λ2\lambda_{2} are arbitrary functions.

Given the representation of a user, denoted by 𝐳𝐮\mathbf{z_{u}}, the decoder’s goal is to predict the item, out of a total of II items, that the user is most likely to click. Let DD denotes the decoder. We assume the decoding processes 𝐱=D(𝐳)+ξ\quad\mathbf{x}=D(\mathbf{z})+\xi, where ξ\xi is the vectors of independent noise with probability density qξq_{\xi}. Then we define the generative model parameterized by parameters θ\theta as follows:

(9) pθ(𝐱u|ϵ,𝐳𝐮,𝐜)=pθ(𝐱u|𝐳u)pξ(𝐱uD(𝐳u)).p_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon,\mathbf{z}_{u},c})=p_{\theta}(\mathbf{x}_{u}|\mathbf{z}_{u})\equiv p_{\xi}(\mathbf{x}_{u}-D(\mathbf{z}_{u})).

2.4. Disentanglement Objective

Our objective is to learn the parameters ϕ\phi and θ\theta that maximize the evidence lower bound (ELBO) of Σulnpθ(𝐱u)\Sigma_{u}\ln{p_{\theta}(\mathbf{x}_{u})}. The ELBO is defined as the expectation of the log-likelihood with respect to the approximate posterior qϕ(ϵ,𝐳u|𝐱u,𝐜)q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c}), where 𝐳u\mathbf{z}_{u} and ϵ\epsilon are the latent variables:

(10) lnpθ(𝐱u)\displaystyle ln{p_{\theta}(\mathbf{x}_{u})} 𝔼pθ(𝐜)[𝔼qϕ(ϵ,𝐳u|𝐱u,𝐜)[lnpθ(𝐱u|ϵ,𝐳u,𝐜)]\displaystyle\geq\mathbb{E}_{p_{\theta}(\mathbf{c})}[\mathbb{E}_{q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})}[lnp_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon},\mathbf{z}_{u},\mathbf{c})]
DKL(qϕ(ϵ,𝐳u|𝐱u,𝐜)pθ(ϵ,𝐳u|𝐜))],\displaystyle-D_{\mathrm{{KL}}}(q_{\phi}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})\|p_{\theta}(\mathbf{\epsilon},\mathbf{z}_{u}|\mathbf{c}))],

Based on the definitions of the approximate posterior in Equation 6 and the prior distribution in Equation 7, ELBO defined in Equation 11 can be expressed in a neat form as follows:

(11) ELBO =𝔼pθ(𝐜)[𝔼qϕ(𝐳𝐮|𝐱𝐮,𝐜)[lnpθ(𝐱u|ϵ,𝐳u,𝐜)]\displaystyle=\mathbb{E}_{p_{\theta}(\mathbf{c})}[\mathbb{E}_{q_{\phi}(\mathbf{\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})}}[lnp_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon},\mathbf{z}_{u},\mathbf{c})]
DKL(qϕ(ϵ,|𝐱u,𝐜)pϵ(ϵ))\displaystyle-D_{\mathrm{{KL}}}(q_{\phi}(\mathbf{\epsilon},|\mathbf{x}_{u},\mathbf{c})\|p_{\epsilon}(\epsilon))
DKL(qϕ(𝐳u|𝐱u,𝐜)pθ(𝐳u|𝐜))]\displaystyle-D_{\mathrm{{KL}}}(q_{\phi}(\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})\|p_{\theta}(\mathbf{z}_{u}|\mathbf{c}))]

Aside from disentangling high-level concepts, we are also interested in capturing the user’s specific preference for fine-grained level factors within different concepts, such as action or funny films in the film genre. Specifically, we aim to enforce statistical independence between the dimensions of the latent representation, so that each dimension describes a single factor, which can be formulated as forcing the following:

(12) qϕ(𝐳u(i)|𝐜)=j=1dqϕ(zu,j(i)|ci).q_{\phi}(\mathbf{z}_{u}^{(i)}|\mathbf{c})=\prod_{j=1}^{d}q_{\phi}(z_{u,j}^{(i)}|c_{i}).

Follow the idea in (Ma et al., 2019) that β\beta-VAE can be used to encourage independence between the dimensions. By varying the value of β\beta, the model can be encouraged to learn more disentangled representations, where each latent dimension captures a single, independent underlying factor (Higgins et al., 2017). As a result, we amplify the regularization term by a factor of β1\beta\gg 1, resulting in the following ELBO:

(13) 𝔼pθ(𝐜)[𝔼qϕ(𝐳𝐮|𝐱𝐮,𝐜)[lnpθ(𝐱u|ϵ,𝐳u,𝐜)]\displaystyle\mathbb{E}_{p_{\theta}(\mathbf{c})}[\mathbb{E}_{q_{\phi}(\mathbf{\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})}}[lnp_{\theta}(\mathbf{x}_{u}|\mathbf{\epsilon},\mathbf{z}_{u},\mathbf{c})]
DKL(qϕ(ϵ,|𝐱u,𝐜)pϵ(ϵ))\displaystyle-D_{\mathrm{{KL}}}(q_{\phi}(\mathbf{\epsilon},|\mathbf{x}_{u},\mathbf{c})\|p_{\epsilon}(\epsilon))
βDKL(qϕ(𝐳u|𝐱u,𝐜)pθ(𝐳u|𝐜))]\displaystyle-\beta D_{\mathrm{{KL}}}(q_{\phi}(\mathbf{z}_{u}|\mathbf{x}_{u},\mathbf{c})\|p_{\theta}(\mathbf{z}_{u}|\mathbf{c}))]

To ensure the learning of causal structure and causal representations, we include a form of supervision during the training process of the model. The first component of this supervision involves utilizing the extra information cc to establish a constraint on the weighted adjacency matrix AA. This constraint ensures that the matrix accurately reflects the causal relationships between the labels:

(14) Lsupa=𝔼q𝒳cσ(Ac)22κ1,L_{sup}^{a}=\mathbb{E}_{q_{\mathcal{X}}}\|c-\sigma(A^{\top}c)\|_{2}^{2}\leq\kappa_{1},

where σ()\sigma(\cdot) is the sigmoid function and κ1\kappa_{1} is a small positive constant value. The second component of supervision constructs a constraint on learning the latent causal representation zz:

(15) Lsupz=𝔼zqϕΣi=1ngi1(zi)𝐀𝐢gi1(𝐳)22κ2,L_{sup}^{z}=\mathbb{E}_{z\sim q_{\phi}}\Sigma_{i=1}^{n}\|g^{-1}_{i}(z_{i})-\mathbf{A_{i}^{\top}}g^{-1}_{i}(\mathbf{z})\|_{2}^{2}\leq\kappa_{2},

where κ2\kappa_{2} is a small positive constant value. Therefore, we have the following training objective:

(16) =ELBO+γ1Lsupa+γ2Lsupz,\mathcal{L}=-ELBO+\gamma_{1}L_{sup}^{a}+\gamma_{2}L_{sup}^{z},

where γ1\gamma_{1} and γ2\gamma_{2} are regularization hyperparemeters.

Table 1. Result comparison of the proposed method with several exciting works. The best results are bold, and the second best are marked as *. All methods are constrained to have around 2Md2Md parameters, where d=100d=100.
Dataset Method NDCG@50 NDCG@100 Recall@20 Recall@50
ML-100k MultDAE 0.13226 (±0.02836) 0.24487 (±0.02738) 0.23794 (±0.03605) 0.32279 (±0.04070)
β\beta-MultVAE 0.13422 (±0.02341) 0.27484 (±0.02883) 0.24838 (±0.03294) 0.35270 (±0.03927)
MacridVAE 0.14272 (±0.02877) 0.28895 (±0.02739) 0.30951 (±0.03808)* 0.41309 (±0.04503)
DGCF 0.15215 (±0.03612) 0.28229 (±0.02271) 0.28912 (±0.03012) 0.34233 (±0.02937)
SEM-MacridVAE 0.17322 (±0.02812)* 0.29372 (±0.02371)* 0.27492 (±0.02152) 0.37026 (±0.02914)
Ours 0.19272 (±0.02515) 0.31826 (±0.02018 0.31272 (±0.02612) 0.38162 (±0.03812)*
ML-1M MultDAE 0.29172 (±0.00729) 0.40453 (±0.00799) 0.34382 (±0.00961) 0.46781 (±0.01032)
β\beta-MultVAE 0.30128 (±0.00617) 0.40555 (±0.00809) 0.33960 (±0.00919) 0.45825 (±0.01039)
MacridVAE 0.31622 (±0.00499) 0.42740 (±0.00789) 0.36046 (±0.00947) 0.49039 (±0.01029)
DGCF 0.32111 (±0.01028) 0.43222 (±0.00617) 0.37152 (±0.00891) 0.49285 (±0.09918)
SEM-MacridVAE 0.32817 (±0.00916)* 0.44812 (±0.00689)* 0.38172 (±0.00798)* 0.49871 (±0.01029)*
Ours 0.34716 (±0.00718) 0.45971 (±0.00610) 0.39182 (±0.00571) 0.50127 (±0.00917)
ML-20M MultDAE 0.32822 (±0.00187) 0.41900 (±0.00209) 0.39169 (±0.00271) 0.53054 (±0.00285)
β\beta-MultVAE 0.33812 (±0.00207) 0.41113 (±0.00212) 0.38263 (±0.00273) 0.51975 (±0.00289)
MacridVAE 0.34918 (±0.00271) 0.42496 (±0.00212) 0.39649 (±0.00271) 0.52901 (±0.00284)
DGCF 0.36152 (±0.00281) 0.43172 (±0.00199) 0.40127 (±0.00284) 0.52127 (±0.00229)
SEM-MacridVAE 0.37172 (±0.00187)* 0.44312 (±0.00177)* 0.41272 (±0.00300)* 0.53212 (±0.00198)*
Ours 0.38991 (±0.00201) 0.45126 (±0.00241) 0.42822 (±0.00298) 0.54316 (±0.00189)
Netflix MultDAE 0.24272 (±0.00089) 0.37450 (±0.00095) 0.33982 (±0.00123) 0.43247 (±0.00126)
β\beta-MultVAE 0.24986 (±0.00080) 0.36291 (±0.00094) 0.32792 (±0.00122) 0.41960 (±0.00125)
MacridVAE 0.25717 (±0.00098) 0.37987 (±0.00096) 0.34587 (±0.00124) 0.43478 (±0.00118)
DGCF 0.27128 (±0.00089)* 0.39122 (±0.00078)* 0.36271 (±0.00199)* 0.45019 (±0.00102)*
SEM-MacridVAE 0.26981 (±0.00100) 0.38012 (±0.00099) 0.35712 (±0.00162) 0.44172 (±0.00102)
Ours 0.29172 (±0.00080) 0.40021 (±0.00088) 0.38212 (±0.00062) 0.45918 (±0.00081)

3. Experiment

3.1. Experiment Setup

Our experiments were conducted on four datasets, which included a combination of real-world datasets. The largescale Netflix Prize dataset and three MovieLens datasets of different scales (i.e., ML-100k, ML-1M, and ML-20M) were used following the same methodology as MacridVAE. To binarize these four datasets, we only kept ratings of four or higher and users who had watched at least five movies. We choose four causally related concepts: (DIRECTOR \rightarrow FILM GENRE), (FILM GENRE \rightarrow ACTOR), (PRODUCTION YEAR). We compare the proposed approach with four existing baselines:

  • MacridVAE (Ma et al., 2019) is a disentangled representation learning method for the recommendation.

  • β\beta-MultVAE (Liang et al., 2018) and MultiDAE (Liang et al., 2018) are VAE based representation learning method for reommendation.

  • DGCF (Wang et al., 2020) is a disentangled graph-based method for collaborative filtering.

  • SEM-MacridVAE (Wang et al., 2023) is the extension of MacridVAE by introducing semantic information.

The evaluation metric used is nDCG and recall, which is the same as (Wang et al., 2023).

3.2. Resutls

Overall Comparison. The overall comparison can be found on Table 1. We can see that the proposed method generally outperformed all of the existing works. It demonstrates that the proposed causal disentanglement representation works better than traditional disentanglement representation.

Causal Disentanglement. We also provide a t-SNE (van der Maaten and Hinton, 2008) visualization of the learned causal disentanglement representation for high-level concepts on ML-1M. On the representation visualization Figure 2(a), pink represents the year of the production, green represents the directors, blue represents the actors and yellow represents the genres. We can clearly find that the year of production is disentangled from actors, genres and directors as they are not causally related.

Fine-grained Level Disentanglement. In Figure 2(b), we examine the relationship between the level of independence at the fine-grained level and the performance of recommendation by varying the hyper-parameter β\beta. To quantify the level of independence, we use a set of dd-dimensional representations and calculate the following metric 12d(d1)1i<jd|corri,j|1-\frac{2}{d(d-1)}\sum_{1\leq i<j\leq d}|\text{corr}_{i,j}| (Ma et al., 2019), where corri,j\text{corr}_{i,j} is the correlation between dimension ii and jj. We observe a positive correlation between recommendation performance and the level of independence, where higher independence leads to better performance. Our method outperforms existing disentanglement representation learning in the level of independence.

Refer to caption
(a)
Refer to caption
(b)
Figure 2. Disentanglement experiments. (a) is the visualization learned causal disentanglement representation; (b) reflects the impact of the fine-grained level disentanglement and the recommendation performance

4. Conclusion

This work demonstrates the effectiveness of the CaD-VAE model in learning causal disentangled representations from user behavior. Our approach incorporates a causal layer implementing SCMs, allowing for the successful disentanglement of causally related concepts. Experimental results on four real-world datasets demonstrate that the proposed CaD-VAE model outperforms existing state-of-the-art methods for learning disentangled representations. In terms of future research, there is potential to investigate novel applications that can take advantage of the explainability and controllability offered by disentangled representations.

References

  • (1)
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 8 (2013), 1798–1828. https://doi.org/10.1109/TPAMI.2013.50
  • Dupont (2018) Emilien Dupont. 2018. Learning disentangled joint continuous and discrete representations. Advances in Neural Information Processing Systems 31 (2018).
  • Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. 2017. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In International Conference on Learning Representations. https://openreview.net/forum?id=Sy2fzU9gl
  • Kim and Mnih (2018) Hyunjik Kim and Andriy Mnih. 2018. Disentangling by Factorising. In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, 2649–2658. https://proceedings.mlr.press/v80/kim18b.html
  • Kumar et al. (2018) Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. 2018. Variational Inference of Disentangled Latent Concepts from Unlabeled Observations. In International Conference on Learning Representations. https://openreview.net/forum?id=H1kG7GZAW
  • Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. 2018. Variational Autoencoders for Collaborative Filtering. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 689–698. https://doi.org/10.1145/3178876.3186150
  • Liu et al. (2021) Y. Liu, E. Sangineto, Y. Chen, L. Bao, H. Zhang, N. Sebe, B. Lepri, W. Wang, and M. Nadai. 2021. Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 10780–10789. https://doi.org/10.1109/CVPR46437.2021.01064
  • Ma et al. (2019) Jianxin Ma, Chang Zhou, Peng Cui, Hongxia Yang, and Wenwu Zhu. 2019. Learning Disentangled Representations for Recommendation. Advances in Neural Information Processing Systems 32 (2019).
  • Quessard et al. (2020) Robin Quessard, Thomas Barrett, and William Clements. 2020. Learning disentangled representations and group structure of dynamical environments. Advances in Neural Information Processing Systems 33 (2020), 19727–19737.
  • Shen et al. (2022) Xinwei Shen, Furui Liu, Hanze Dong, Qing Lian, Zhitang Chen, and Tong Zhang. 2022. Weakly Supervised Disentangled Generative Causal Representation Learning. Journal of Machine Learning Research 23 (2022), 1–55.
  • van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE. Journal of Machine Learning Research 9, 86 (2008), 2579–2605. http://jmlr.org/papers/v9/vandermaaten08a.html
  • Wang et al. (2022) Xin Wang, Hong Chen, Si’ao Tang, Zihao Wu, and Wenwu Zhu. 2022. Disentangled Representation Learning. arXiv preprint arXiv:2211.11695 (2022).
  • Wang et al. (2023) Xin Wang, Hong Chen, Yuwei Zhou, Jianxin Ma, and Wenwu Zhu. 2023. Disentangled Representation Learning for Recommendation. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2023), 408–424. https://doi.org/10.1109/TPAMI.2022.3153112
  • Wang et al. (2020) Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled Graph Collaborative Filtering. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1001–1010. https://doi.org/10.1145/3397271.3401137
  • Yang et al. (2021) Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. 2021. CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9588–9597. https://doi.org/10.1109/CVPR46437.2021.00947
  • Yu et al. (2019) Yue Yu, Jie Chen, Tian Gao, and Mo Yu. 2019. DAG-GNN: DAG Structure Learning with Graph Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 7154–7163. https://proceedings.mlr.press/v97/yu19a.html
  • Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 52, 1, Article 5 (feb 2019), 38 pages. https://doi.org/10.1145/3285029