This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Treatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph Autoencoder

Di Fan [email protected] 0009-0001-6357-7849 School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012 Renlei Jiang [email protected] School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012 Yunhao Wen [email protected] Petrochina Engineering and Planning InstituteChina  and  Chuanhou Gao [email protected] 0000-0001-9030-2042 School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012
(2018; 20 February 2007; 12 March 2009; 5 June 2009)
Abstract.

Estimating individual treatment effect (ITE) from observational data has gained increasing attention across various domains, with a key challenge being the identification of latent confounders affecting both treatment and outcome. Networked observational data offer new opportunities to address this issue by utilizing network information to infer latent confounders. However, most existing approaches assume observed variables and network information serve only as proxy variables for latent confounders, which often fails in practice, as some variables influence treatment but not outcomes, and vice versa. Recent advances in disentangled representation learning, which disentangle latent factors into instrumental, confounding, and adjustment factors, have shown promise for ITE estimation. Building on this, we propose a novel disentangled variational graph autoencoder that learns disentangled factors for treatment effect estimation on networked observational data. Our graph encoder further ensures factor independence using the Hilbert-Schmidt Independence Criterion. Extensive experiments on two semi-synthetic datasets derived from real-world social networks and one synthetic dataset demonstrate that our method achieves state-of-the-art performance.

Causal inference, individual treatment effect, disentangled representations, networked observational data, variational graph autoencoder
copyright: acmlicensedjournalyear: 2018doi: XXXXXXX.XXXXXXXjournal: TKDDjournalvolume: 37journalnumber: 4article: 111publicationmonth: 8

1. Introduction

Currently, research on causal effects between different variables has received increasing attention. Among these, learning individual treatment effects of a treatment on an outcome is a fundamental question encountered by numerous researchers, with applications spanning various domains, including education (Ding and Lehrer, 2010), public policy (Athey and Imbens, 2016), economics (Zhang et al., 2021a; Gu et al., 2021), and healthcare (Shalit et al., 2017). For example, in the medical scenario, physicians seek to determine which treatment (such as which medication) is more beneficial for a patient’s recovery (Wu et al., 2022). This naturally raises a question: how can we accurately infer outcome if an instance were to receive an alternative treatment? This relates to the well-known problem of counterfactual outcome prediction (Pearl, 2009b). By predicting counterfactual outcomes, we can accurately estimate each individual’s treatment effect, known as individual treatment effect (ITE) (Rubin, 2005; Shalit et al., 2017), thereby assisting decision-making.

Randomized controlled trials (RCTs) are the gold standard method for learning causal effects (Pearl, 2009b). In these trials, instances (experimental subjects) are randomly assigned to either the treatment or control group. However, this is often costly, unethical, or even impractical (Guo et al., 2020a; Yao et al., 2021). Fortunately, the rapid increasing expansion of big data in many fields offers significant opportunities for causal inference research (Winship and Morgan, 1999; Yao et al., 2021), as these observational datasets are readily available and usually contain a large number of examples. Thus, we often concentrate on estimating treatment effects from observational data. Additionally, instances in the dataset are intrinsically linked by auxiliary network structures, such as user-linked social networks. This type of data is typically referred to as networked observational data (Guo et al., 2020c; Huang et al., 2023).

In observational studies, treatment often depends on specific attributes of an instance, 𝐱\mathbf{x}, leading to selection bias (Imbens and Rubin, 2015). In the medical scenario above, socioeconomic status influences both medication choices and patient recovery. Higher socioeconomic status may increase access to expensive medications and positively impact health. Identifying and controlling for confounding factors (i.e., those affecting both treatment and outcome, thereby introducing selection bias in ITE estimation) is crucial for accurate predictions and presents the main challenge in learning ITE from observational data (Pearl, 2009a; Guo et al., 2020a). To address confounders, most existing methods assume strong ignorability (Johansson et al., 2016; Shalit et al., 2017; Yao et al., 2018), meaning all confounders are measurable and embedded within observed features. However, this assumption is often unrealistic, as not all confounders can be measured. Relaxing this assumption by using proxy variables for latent confounders was proposed by Bennett and Kallus (2019). For networked observational data, several ITE estimation frameworks have been developed in recent years (Veitch et al., 2019; Guo et al., 2020c, 2021; Chu et al., 2021), which primarily leverage the network structure along with noisy, measurable observed variables as two sets of proxy variables to aid in learning and controlling for latent confounders. For instance, socioeconomic status can be inferred from easier-to-measure variables (e.g., postal codes, annual income) combined with social network patterns (e.g., community affiliation). While these methods have achieved empirical success, they primarily focus on learning representations of latent confounding factors (latent confounders) to control confounding bias but overlook that some factors affect only treatment, others only outcomes, or may even be noise factors. In patient data, for example, age and socioeconomic status influence both treatment and outcome, thus acting as confounding factors; the attending physician affects only treatment, referred to as an instrumental factor; gene and air temperature affect only outcome, referred to as adjustment factors; and information like names and contact details are noise factors. Using all patient features and network information only for learning latent confounding factors introduces new biases (Abadie and Imbens, 2006; Häggström, 2018). Therefore, explicitly learning disentangled representations for these four types of latent factors is essential for accurately estimating ITE on networked observational data.

To address the aforementioned challenges, we present a novel generative framework based on the Variational Graph Autoencoder (VGAE) (Kipf and Welling, 2016b) for estimating individual treatment effects on networked observational data. We name our model Treatment effect estimation on Networked observational data by Disentangled Variational Graph Autoencoder (TNDVGA), which can effectively infer latent factors from proxy variables and auxiliary network information using a graph autoencoder, while employing the Hilbert-Schmidt Independence Criterion (HSIC) independence constraint to disentangle these factors into four mutually exclusive sets, thereby improving individual treatment effect estimation. Our main contributions are:

  • We propose a novel framework for learning individual treatment effect from networked observational data, termed TNDVGA, which can simultaneously learn representations of latent factors from both proxy variables and auxiliary network information while disentangling different latent factors to estimate treatment effect more effectively and accurately.

  • We introduce a kernel-based Hilbert-Schmidt Independence Criterion (HSIC) to assess the dependence between different representations of latent factors. This independence regularization is jointly optimized with other components of the model within a unified framework, enabling better learning of independent disentangled representations.

  • We perform extensive experiments to validate the effectiveness of our proposed framework TNDVGA. Results on multiple datasets indicate that TNDVGA achieves state-of-the-art performance, significantly outperforming baseline methods.

The rest of this article is organized as follows. The related work is reviewed in Section 2. Section 3 introduces the technical preliminaries and problem statement. Section 4 describes the details of our proposed framework. We presents comprehensive experimental results of our model’s performance on different datasets in Section 5. Finally, Section 6 concludes our work and suggests directions for future research.

2. Related work

Two aspects of related work are introduced in this section: (2) learning ITE from networked observational data; and (3) disentangled representations for treatment effect estimation.

Learning ITE from i.i.d observational data

Due to the substantial expenses and sometimes infeasibility of randomized experiments, there has been significant interest in estimating individual-level causal effects from observational data in recent years, especially with the emergence of big data. BART (Chipman et al., 2010) employed dimensionally adaptive random basis functions for causal effect estimation. Causal Forest (Wager and Athey, 2018) is a nonparametric approach that extends Breiman’s random forest algorithm to estimate heterogeneous treatment effects. CFR (Shalit et al., 2017) is a representation learning approach that predicts ITE from observational data by projecting original features into a latent space, capturing confounders through minimization of prediction error in factual outcomes and reducing imbalance between treatment and control groups. However, the existing methods that have been previously mentioned depend on the strong ignorability assumption, which essentially ignores the effects of hidden confounding factors and is usually untenable in real-world observational studies. Various approaches have been suggested to relax this strong ignorability assumption. CEVAE (Louizos et al., 2017) followed the causal structure of inference with proxy variables, capable of simultaneously estimating the unknown latent space summarizing confounders and the causal effect. Deep-Treat (Atan et al., 2018) employed a bias-removing autoencoder, along with a policy optimization feedforward neural network to derive balanced representations and optimal policies from observational data. SITE (Yao et al., 2018) captured hidden confounders for individual treatment effect estimation through a local similarity-preserving method.

Learning ITE from networked observational data

Recently, the emergence of networked observational data in various real-world tasks has prompted several studies to relax strong ignorability assumption by utilizing network information among different instances, where the network also serves as a proxy for unobserved confounders. NetDeconf (Guo et al., 2020c) utilized network information and observed features to identify patterns of hidden confounders, enabling the learning of valid individual causal effects from networked observational data. CONE (Guo et al., 2020b) further employed Graph Attention Networks (GAT) to integrate network information, thereby mitigating hidden confounding effects. IGNITE (Guo et al., 2021) introduced a minimax game framework that simultaneously balances representations and predicts treatments to learn ITE from networked observational data. GIAL (Chu et al., 2021) leveraged network structure to capture additional information by identifying imbalances within the network for estimating treatment effects. Thorat et al. (2023) utilized network information to mitigate hidden confounding bias in the estimation of ITE under networked observational studies with multiple treatments. However, these studies uniformly apply all feature information, including network information, to infer latent confounding factors without assuming disentanglement in treatment effect estimation, which may lead to estimation bias. In a network, the treatment administered to one instance may influence the outcomes of its neighbors. This phenomenon is known as spillover effects or interference (Arbour et al., 2016; Rakesh et al., 2018; Huang et al., 2023). Unlike previous works, we follow the assumption by Guo et al. (2020c) and Veitch et al. (2019) that conditioning on potential confounders separates each individual’s treatment and outcome from those of others. We will further explore the study of spillover effects as future work.

Disentangled representations for treatment effect estimation

From the perspective of causal representation learning, learning disentangled representations is one of the challenges in machine learning (Schölkopf et al., 2021). The disentangled representation of latent factors derived from observational data can reduce the influence of instrumental factors and confounders on outcome prediction, thereby mitigating selection bias and significantly improving the accuracy of treatment effect estimation (Hassanpour and Greiner, 2019; Kuang et al., 2020b). Early methods primarily focused on variable decomposition (Kuang et al., 2017, 2020a), exploring treatment effect estimation by considering only adjustment variables and confounders as latent factors. This restricted approach resulted in imprecise confounder separation and hindered accurate estimation of individual treatment effects. Subsequently, many methods focused on decomposing pre-treatment variables into instrumental variables, confounding variables, and adjustment factors. For example, DRCFR (Hassanpour and Greiner, 2019) and DeR-CFR (Wu et al., 2022) disentangled latent factors into these four categories while balancing confounders and estimating treatment effects through counterfactual inference. Additionally, some methods imposed independent constraints on the model to achieve independent disentangled representations. RSB-Net (Zhang et al., 2019) utilized Pearson Correlation Coefficient (PCC) to promote decorrelation between two sets of random variables. MIM-DRCFR (Cheng et al., 2022) introduced a method for learning disentangled representations by minimizing mutual information, while DeR-CFR employed orthogonal loss to ensure that the representations of different learned latent factors contain uncorrelated information. Recently, an increasing number of methods based on Variational Autoencoder (VAE) (Kingma, 2013) have been proposed to address disentanglement in individualized causal effect estimation. TEDVAE (Zhang et al., 2021b) employed a variational autoencoder to separate latent variables, incorporating a regularization term that included reconstruction losses for both treatment and outcomes. TVAE (Vowels et al., 2021)integrated noise factors and introduced a VAE with target learning regularization to estimate individual treatment effects. EDVAE (Liu et al., 2024) adopted a method of disentangling latent factors from both data and model perspectives for ITE estimation. VGANITE (Bao et al., 2022) combined VAE and Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) to disentangle latent factors into three distinct sets. Intact-VAE (Wu and Fukumizu, 2021) emphasized the successful recovery of confounders through a novel prognostic score.

However, these studies primarily focus on estimating individualized causal effects from independent observational data. Given the importance of disentanglement in ITE estimation, it is crucial to incorporate this approach when estimating ITE from networked observational data. Additionally, independent regularizers need to be added to ensure that the information contained in features and network information is accurately transmitted to the corresponding representation spaces of each latent factor. Furthermore, our model takes into account often-overlooked noise factors, combines reconstruction losses for both treatment and outcome, and balances the distribution between treatment and control groups, all of which contribute to improved model performance.

3. Preliminaries

In this section, we first introduce the notations used in this article. We then outline the problem statement by providing the necessary technical preliminaries.

Notations

Throughout this work, we use unbold lowercase letters (e.g., tt) to denote scalars, bold lowercase letters (e.g., 𝐱\mathbf{x}) to represent vectors, and bold uppercase letters (e.g., 𝐀\mathbf{A}) for matrices. The (i,j)(i,j)-th entry of a matrix 𝐀\mathbf{A} is denoted by 𝐀ij\mathbf{A}_{ij}.

Networked observational data

In the network observational data, we define the features (covariates) of ii-th instance as 𝐱ik\mathbf{x}_{i}\in\mathbb{R}^{k}, the treatment as tit_{i}, and the outcome as yiy_{i}\in\mathbb{R}. We assume that all instances are connected through a network, represented by an adjacency matrix 𝐀\mathbf{A}. We assume that the network is undirected, with all edge weights equal111This work can be extended to weighted undirected networks and is also applicable to directed networks by utilizing specialized graph neural networks.. Let nn denotes the number of instances, thus 𝐀{0,1}n×n\mathbf{A}\in\{0,1\}^{n\times n}. The notation 𝐀ij=𝐀ji=1\mathbf{A}_{ij}=\mathbf{A}_{ji}=1 (or 0) indicates the presence (or absence) of an edge between the ii-th and jj-th instances. Therefore, the tuple ({𝐱i,ti,yi}i=1n,𝐀)(\{\mathbf{x}_{i},t_{i},y_{i}\}_{i=1}^{n},\mathbf{A}) represents a network observational dataset. Following the setup of (Shalit et al., 2017; Yao et al., 2018), we concentrate on cases where the treatment variable is binary, specifically t{0,1}t\in\{0,1\}. We denote ti=1t_{i}=1 and ti=0t_{i}=0 to represent whether the ii-th instance is in the treatment or control group, respectively, without loss of generality.

Next, we present the background knowledge necessary for learning individual individual treatment effects. We make the assumption that for each pair of instance ii and treatment tt, there exists a potential outcome yity_{i}^{t}, representing the value that yy would take if treatment tt were applied to instance ii (Rubin, 1978). Note that only one potential outcome is observable, while the unobserved outcome yi1tiy_{i}^{1-t_{i}} is typically referred to as the counterfactual outcome. As a result, the observed outcome can be expressed as a function of the observed treatment and potential outcomes, given by yi=tiyi1+(1ti)yi0y_{i}=t_{i}y_{i}^{1}+(1-t_{i})y_{i}^{0}. Then the ITE for the instance ii in the context of networked observational data is defined as follows:

(1) τi=τ(𝐱i,𝐀)=𝔼[yi1𝐱i,𝐀]𝔼[yi0𝐱i,𝐀],\tau_{i}=\tau(\mathbf{x}_{i},\mathbf{A})=\mathbb{E}[y_{i}^{1}\mid\mathbf{x}_{i},\mathbf{A}]-\mathbb{E}[y_{i}^{0}\mid\mathbf{x}_{i},\mathbf{A}],

which measures the difference between expected potential outcome under treatment and control for the instance ii. Once ITE has been established, the average treatment effect (ATE) can then be estimated by averaging the ITE across all instances as ATE=1ni=1nτi\text{ATE}=\frac{1}{n}\sum_{i=1}^{n}\tau_{i}. Based on the aforementioned notations and definitions, we formally state the problem.

Definition 3.1 (Learning ITEs from Networked Observational Data).

Given the networked observational data ({𝐱i,ti,yi}i=1n,𝐀)(\{\mathbf{x}_{i},t_{i},y_{i}\}_{i=1}^{n},\mathbf{A}), our goal is to use the information from (𝐱i,ti,yi)(\mathbf{x}_{i},t_{i},y_{i}) and the network adjacency matrix 𝐀\mathbf{A} to learn an estimate of the ITE τi\tau_{i} for each instance ii.

This paper is based on three essential assumptions necessary for estimating the individual treatment effect (Rosenbaum and Rubin, 1983):

Assumption 1 (Stable Unit Treatment Value Assumption (SUTVA)).

The potential outcomes for one unit are not affected by the treatment assigned to other units.

Assumption 2 (Overlap).

Each unit has a nonzero probability of receiving either treatment or control given the observed variables, i.e., 0<P(t=1𝐱)<10<P(t=1\mid\mathbf{x})<1.

Assumption 3 (Unconfoundedness).

Treatment assignment is independent of the potential outcomes when conditioning on the latent confounding factors, i.e., t(y0,y1)𝐳ct\!\perp\!\!\!\perp(y^{0},y^{1})\mid\mathbf{z}_{c}. This assumption is a relaxed version of the unconfoundedness assumption commonly used in causal inference, as it allows for the presence of hidden confounders.

4. Methodology

In this section, we will first present a theorem on the identifiability of the individual treatment effect. Then, we introduce our TNDVGA framework designed to learn from networked observational data.

4.1. Identifiability

We introduce the model TNDVGA for estimating treatment effects, based on the assumption that the observed covariates 𝐱\mathbf{x} and the network patterns 𝐀\mathbf{A} can be regarded as generated from four distinct sets of latent factors 𝐳=(𝐳t,𝐳c,𝐳y,𝐳o)\mathbf{z}=(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}). In this context, 𝐳t\mathbf{z}_{t} represents latent instrumental factors that influence the treatment but not the outcome, 𝐳c\mathbf{z}_{c} includes latent confounding factors (latent confounders) that influence both the treatment and the outcome, 𝐳y\mathbf{z}_{y} consists of latent adjustment factors that impact the outcome without affecting the treatment, and 𝐳o\mathbf{z}_{o} refers to latent noise factors, which are covariates unrelated to either the treatment or the outcome. The proposed causal graph for ITE estimation is shown in Fig. 1. By explicitly modeling these four latent factors, it demonstrates that not all variables in the observed set act as proxy variables for confounding factors, but instead, it effectively facilitates the learning of various types of unobserved factors.

Refer to caption
Figure 1. The causal diagram of the proposed TNDVGA. 𝐱\mathbf{x} represents the observed variables, 𝐀\mathbf{A} denotes the network structure, tt is the treatment, yy is the outcome, 𝐳t\mathbf{z}_{t} is latent instrument factors affecting only the treatment, 𝐳c\mathbf{z}_{c} is latent confounding factors, 𝐳y\mathbf{z}_{y} is latent adjustment factors affecting only the outcome, and 𝐳o\mathbf{z}_{o} is the latent noise factors unrelated to both treatment and outcome.
\Description

causal graph

Utilizing network observational data, we formulate and prove the following theorem about the identifiability of individual treatment effects:

Theorem 4.1 (Identifiability of ITE).

If we recover p(𝐳c,𝐳y𝐱,𝐀)p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A}) and p(yt,𝐳c,𝐳y)p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y}), then the proposed TNDVGA can recover the individual treatment effect (ITE) from networked observational data.

Proof.

According to the aforementioned assumptions and networked observational data, the potential outcome distribution for any instance 𝐱\mathbf{x} can be calculated as follows:

(2) p(yt𝐱,𝐀)=(i){𝐳t,𝐳c,𝐳y,𝐳o}p(yt𝐳t,𝐳c,𝐳y,𝐳o)p(𝐳t,𝐳c,𝐳y,𝐳o𝐱,𝐀)𝑑𝐳t𝑑𝐳c𝑑𝐳y𝑑𝐳o=(ii){𝐳t,𝐳c,𝐳y,𝐳o}p(ytt,𝐳t,𝐳c,𝐳y,𝐳o)p(𝐳t,𝐳c,𝐳y,𝐳o𝐱,𝐀)𝑑𝐳t𝑑𝐳c𝑑𝐳y𝑑𝐳o=(iii){𝐳t,𝐳c,𝐳y,𝐳o}p(yt,𝐳t,𝐳c,𝐳y,𝐳o)p(𝐳t,𝐳c,𝐳y,𝐳o𝐱,𝐀)𝑑𝐳t𝑑𝐳c𝑑𝐳y𝑑𝐳o=(iv){𝐳c,𝐳y}p(yt,𝐳c,𝐳y)p(𝐳c,𝐳y𝐱,𝐀)𝑑𝐳c𝑑𝐳y.\begin{split}&p(y^{t}\mid\mathbf{x},\mathbf{A})\\ &\overset{(i)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y^{t}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(ii)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y^{t}\mid t,\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(iii)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y\mid t,\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(iv)}{=}\int_{\{\mathbf{z}_{c},\mathbf{z}_{y}\}}p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{c}d\mathbf{z}_{y}.\end{split}

The Equation (i) is a straightforward expectation over p(𝐳t,𝐳c,𝐳y,𝐳o𝐱,𝐀)p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A}), Equation (ii) follows from Assumption 3 based on the conditional independence assumption t(y0,y1)𝐳ct\!\perp\!\!\!\perp(y^{0},y^{1})\mid\mathbf{z}_{c}, Equation (iii) is derived from the commonly used consistency assumption (Imbens and Rubin, 2015), and Equation (iv) can be obtained from the Markov property y𝐳t,𝐳ot,𝐳c,𝐳yy\!\perp\!\!\!\perp\mathbf{z}_{t},\mathbf{z}_{o}\mid t,\mathbf{z}_{c},\mathbf{z}_{y}. Thus, if we can model p(𝐳c,𝐳y𝐱,𝐀)p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A}) and p(yt,𝐳c,𝐳y)p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y}) correctly, then the ITE can be identified. ∎

Previous work by Zhang et al. (2021b) has derived the proof for identifiability under the assumption of ignorability, based on the inference of relevant parent factors from proxy variables and/or other observed variables. We took inspiration from their concept. However, in contrast to them, our model includes latent noise factors, which improves the composition of the latent variables by bringing it closer to reality. Furthermore, additional network information can be used with the proxy variable 𝐱\mathbf{x}. With these two modifications, we demonstrate the identifiability of ITE in Theorem 4.1. Theorem 4.1 highlights the importance of distinguishing different latent factors and utilizing only the appropriate ones for treatment effect estimation on networked observational data.

4.2. The proposed framework: TNDVGA

An overview of the proposed framework, TNDVGA, is shown in Fig. 2, which learns individual treatment effects through networked observational data. The proposed framework consists of the following important components: (1) Learning Disentangled Latent factors through Variational Graph Autoencoder (VGAE); (2) Predicting Potential Outcomes and Treatment Assignments; (3) Enforcing Independence of Latent factors. We will provide a detailed explanation of these four components in the following sections.

4.2.1. Learning Disentangled Latent factors through VGAE

From the theoretical analysis in the previous section, we have seen that eliminating unnecessary factors is essential to effectively and accurately estimating the treatment effect. However, in practice, we do not know the mechanism of generating 𝐱\mathbf{x} from 𝐳\mathbf{z} and the mechanism of disentangling 𝐳\mathbf{z} into different disjoint sets. This requires us to propose a method that can learn to disentangle the latent factors 𝐳\mathbf{z} and estimate ITE through what the model has learned.

Refer to caption
Figure 2. The overall architecture of TNDVGA consists of a generative network and an inference network for disentangling latent factors.
\Description

model structure

Therefore, we aim to infer the posterior distribution pθ(𝐳𝐱,𝐀)p_{\theta}(\mathbf{z}\mid\mathbf{x},\mathbf{A}) of latent factors 𝐳\mathbf{z} through the observed proxy covariates 𝐱\mathbf{x} and network information 𝐀\mathbf{A}, while disentangling 𝐳\mathbf{z} into latent instrumental factors 𝐳t\mathbf{z}_{t}, confounding factors 𝐳c\mathbf{z}_{c}, adjustment factors 𝐳y\mathbf{z}_{y}, noise factors 𝐳o\mathbf{z}_{o}. Since exact inference is intractable, we use the variational inference framework to approximate the posterior distribution with the tractable distribution. We adopt Variational Graph Autoencoders (VGAEs) to construct our model. Proposed by Kipf and Welling (2016b), VGAEs extend Variational Autoencoders (VAEs) to take into consideration the graph structure in the data. For every observed variable 𝐱\mathbf{x}, VGAEs define a multi-dimensional latent variable 𝐳\mathbf{z}. Moreover, VGAEs rely on an adjacency matrix 𝐀\mathbf{A}, which is utilized by the Graph Neural Network (GNN) in the encoder to enforce the structure of the posterior approximation qϕ(𝐳𝐱,𝐀)q_{\phi}(\mathbf{z}\mid\mathbf{x},\mathbf{A}). As shown in Fig. 2, we use four separate encoders to approximate the variational posterior qϕt(𝐳t𝐱,𝐀)q_{\phi_{t}}(\mathbf{z}_{t}\mid\mathbf{x},\mathbf{A}), qϕc(𝐳c𝐱,𝐀)q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A}), qϕy(𝐳y𝐱,𝐀)q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A}), qϕo(𝐳o𝐱,𝐀)q_{\phi_{o}}(\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A}), disentangling the latent variable 𝐳\mathbf{z} into 𝐳t\mathbf{z}_{t}, 𝐳c\mathbf{z}_{c}, and 𝐳y\mathbf{z}_{y}, 𝐳o\mathbf{z}_{o}, respectively222Our method does not employ tt and yy as inputs to the encoder as done in (Louizos et al., 2017), because we assume that tt and yy are generated by the latent factors. So, the inference of latent factors relies solely on 𝐱\mathbf{x}. For additional information, see (Zhang et al., 2021b).. Then, these four latent factors are used by the decoder pθ(𝐱𝐳t,𝐳c,𝐳y,𝐳o)p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}) to reconstruct 𝐱\mathbf{x}, tt, and yy333Note that, as shown in Figure 1, we obtain the independence property 𝐱𝐀{𝐳t,𝐳c,𝐳y,𝐳o}\mathbf{x}\!\perp\!\!\!\perp\mathbf{A}\mid\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}. Thus, the original VGAE decoder pθ(𝐱𝐳t,𝐳c,𝐳y,𝐳o,𝐀)=pθ(𝐱𝐳t,𝐳c,𝐳y,𝐳o)p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o},\mathbf{A})=p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}). The derivation for tt and yy is similar.. Following the standard VGAE design, we select the prior distributions p(𝐳t)p(\mathbf{z}_{t}), p(𝐳c)p(\mathbf{z}_{c}), , p(𝐳y)p(\mathbf{z}_{y}) and p(𝐳o)p(\mathbf{z}_{o}) as factorized Gaussian distributions:

(3) p(𝐳t)=j=1d𝐳t𝒩({𝐳t}j0,1);p(𝐳c)=j=1d𝐳c𝒩({𝐳c}j0,1);p(𝐳y)=j=1d𝐳y𝒩({𝐳y}j0,1);p(𝐳o)=j=1d𝐳o𝒩({𝐳o}j0,1),\begin{split}p(\mathbf{z}_{t})=\prod_{j=1}^{d_{\mathbf{z}_{t}}}\mathcal{N}(\{\mathbf{z}_{t}\}_{j}\mid 0,1);\quad p(\mathbf{z}_{c})=\prod_{j=1}^{d_{\mathbf{z}_{c}}}\mathcal{N}(\{\mathbf{z}_{c}\}_{j}\mid 0,1);\\ p(\mathbf{z}_{y})=\prod_{j=1}^{d_{\mathbf{z}_{y}}}\mathcal{N}(\{\mathbf{z}_{y}\}_{j}\mid 0,1);\quad p(\mathbf{z}_{o})=\prod_{j=1}^{d_{\mathbf{z}_{o}}}\mathcal{N}(\{\mathbf{z}_{o}\}_{j}\mid 0,1),\end{split}

where d𝐳td_{\mathbf{z}_{t}}, d𝐳cd_{\mathbf{z}_{c}}, d𝐳yd_{\mathbf{z}_{y}}, and d𝐳od_{\mathbf{z}_{o}} represent the dimensions of latent instrumental, confounding, adjustment, noise factors, respectively. And {𝐳t}j\{\mathbf{z}_{t}\}_{j} denotes the jj-th dimension of 𝐳t\mathbf{z}_{t}, and the same applies to 𝐳c\mathbf{z}_{c}, 𝐳y\mathbf{z}_{y}, and 𝐳o\mathbf{z}_{o}.

The probabilistic representation of the generative model for 𝐱\mathbf{x}, tt, and yy is as follows:

(4) pθ𝐱(𝐱𝐳t,𝐳c,𝐳y,𝐳o)\displaystyle p_{\theta_{\mathbf{x}}}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}) =j=1k𝒩(μj=f1j(𝐳{t,c,y,o}),σj2=f2j(𝐳{t,c,y,o})),\displaystyle=\prod_{j=1}^{k}\mathcal{N}(\mu_{j}=f_{1j}(\mathbf{z}_{\{t,c,y,o\}}),\sigma_{j}^{2}=f_{2j}(\mathbf{z}_{\{t,c,y,o\}})),
(5) pθt(t𝐳t,𝐳c)\displaystyle p_{\theta_{t}}(t\mid\mathbf{z}_{t},\mathbf{z}_{c}) =Bern(σ(f3(𝐳c,𝐳t))),\displaystyle=Bern(\sigma(f_{3}(\mathbf{z}_{c},\mathbf{z}_{t}))),
(6) pθy(yt,𝐳c,𝐳y)=𝒩(μ=μ^,σ2=σ^2),μ^=tf4(𝐳c,𝐳y)+(1t)f5(𝐳c,𝐳y);σ^2=tf6(𝐳c,𝐳y)+(1t)f7(𝐳c,𝐳y),\begin{split}p_{\theta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})&=\mathcal{N}(\mu=\hat{\mu},\sigma^{2}={\hat{\sigma}}^{2}),\\ \hat{\mu}=tf_{4}(\mathbf{z}_{c},\mathbf{z}_{y})+(1&-t)f_{5}(\mathbf{z}_{c},\mathbf{z}_{y});{\hat{\sigma}}^{2}=tf_{6}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)f_{7}(\mathbf{z}_{c},\mathbf{z}_{y}),\end{split}

where f1f_{1} to f7f_{7} are functions parameterized by fully connected neural networks, σ()\sigma(\cdot) represents the logistic function, and BernBern refers to the Bernoulli distribution. The distribution of 𝐱\mathbf{x} should be chosen based on the dataset, and in our case, we approximate it with a Gaussian distribution, as the data we use consists of continuous variables. Similarly, for the continuous outcome variable yy, we also parameterize it as a Gaussian distribution, where the mean and variance are defined by two separate neural networks defining p(yt=1,𝐳c,𝐳y)p(y\mid t=1,\mathbf{z}_{c},\mathbf{z}_{y}) and p(yt=0,𝐳c,𝐳y)p(y\mid t=0,\mathbf{z}_{c},\mathbf{z}_{y}), following the two-headed approach proposed by Shalit et al. (2017).

In the inference model, since we input the network information 𝐀\mathbf{A} into the encoder, we design the encoder based on the idea of Variational Graph Autoencoders (VGAEs). Specifically, we utilize Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016a) as the encoder to obtain latent factor representations. GCN has been shown to effectively handle non-Euclidean data, such as graph-structured data, across diverse settings. To simplify notation, we describe the message propagation rule using a single GCN layer, as shown below:

(7) 𝐡=GCN(𝐱,𝐀)=Relu((𝐀^𝐗)𝐱𝐖)=Relu((𝐃~12𝐀~𝐃~12𝐗)𝐱𝐖).\mathbf{h}=GCN(\mathbf{x},\mathbf{A})={Relu}((\hat{\mathbf{A}}\mathbf{X})_{\mathbf{x}}\mathbf{W})={Relu}((\tilde{\mathbf{D}}^{\frac{1}{2}}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{X})_{\mathbf{x}}\mathbf{W}).

where 𝐡d\mathbf{h}\in\mathbb{R}^{d} is the output vector of the GCN, 𝐗n×k\mathbf{X}\in\mathbb{R}^{n\times k} is the feature matrix of the instances, (𝐀^𝐗)𝐱(\hat{\mathbf{A}}\mathbf{X})_{\mathbf{x}} represents the row of the matrix product corresponding to instance 𝐱\mathbf{x}, 𝐀~=𝐀+𝐈N\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{N}, 𝐈N\mathbf{I}_{N} is the identity matrix, 𝐃~ii=j=1N𝐀~ij\tilde{\mathbf{D}}_{ii}=\sum_{j=1}^{N}\tilde{\mathbf{A}}_{ij}, and 𝐖k×d\mathbf{W}\in\mathbb{R}^{k\times d} denotes the parameters of the weight matrix. Relu(){Relu}(\cdot) denotes the ReLU activation function. This leads to the following definition of the variational approximation of the posterior distribution for the latent factors:

(8) qϕt(𝐳t𝐱,𝐀)=𝒩(𝝁=𝝁^t,diag(𝝈2)=diag(𝝈^t2)),qϕc(𝐳c𝐱,𝐀)=𝒩(𝝁=𝝁^c,diag(𝝈2)=diag(𝝈^c2)),qϕy(𝐳y𝐱,𝐀)=𝒩(𝝁=𝝁^y,diag(𝝈2)=diag(𝝈^y2)),qϕo(𝐳o𝐱,𝐀)=𝒩(𝝁=𝝁^o,diag(𝝈2)=diag(𝝈^o2)),\begin{split}q_{\phi_{t}}(\mathbf{z}_{t}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{t},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{t}^{2})),\\ q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{c},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{c}^{2})),\\ q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{y},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{y}^{2})),\\ q_{\phi_{o}}(\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{o},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{o}^{2})),\\ \end{split}

where 𝝁^t\hat{\boldsymbol{\mu}}_{t}, 𝝁^c\hat{\boldsymbol{\mu}}_{c}, 𝝁^y\hat{\boldsymbol{\mu}}_{y}, 𝝁^o\hat{\boldsymbol{\mu}}_{o} and diag(𝝈^t2){\rm{diag}}(\hat{\boldsymbol{\sigma}}_{t}^{2}), diag(𝝈^c2){\rm{diag}}(\hat{\boldsymbol{\sigma}}_{c}^{2}), diag(𝝈^y2){\rm{diag}}(\hat{\boldsymbol{\sigma}}_{y}^{2}), diag(𝝈^o2){\rm{diag}}(\hat{\boldsymbol{\sigma}}_{o}^{2}) are the means and covariance matrix of the Gaussian distributions, parameterized by the GCN as shown in Equation (7). Additionally, 𝝁^t\hat{\boldsymbol{\mu}}_{t} and log𝝈t2{\rm log}\,{\boldsymbol{\sigma}}_{t}^{2} are learned from two GCNs that share the training parameters of the first layer, and the same applies to the remaining three pairs.

4.2.2. Predicting Potential Outcomes and Treatment Assignments

The latent factors 𝐳t\mathbf{z}_{t} and 𝐳c\mathbf{z}_{c} are associated with the treatment tt, whereas 𝐳c\mathbf{z}_{c} and 𝐳y\mathbf{z}_{y} are associated with the outcomes yy, as illustrated in Fig. 1. To ensure that the treatment information is effectively captured by the union of 𝐳t\mathbf{z}_{t} and 𝐳c\mathbf{z}_{c}, we add an auxiliary classifier to predict tt from the encoder’s output, under the assumption that 𝐳t\mathbf{z}_{t} and 𝐳c\mathbf{z}_{c} can accurately predict tt. Additionally, yy is predicted using two regression networks under different treatments to ensure that the outcome information is captured by the union of 𝐳c\mathbf{z}_{c} and 𝐳y\mathbf{z}_{y}, based on the assumption that 𝐳c\mathbf{z}_{c} and 𝐳y\mathbf{z}_{y} can accurately predict yy. Inspired by related approaches (Zhang et al., 2021b; Liu et al., 2024), the classifier and regression networks are defined as follows:

(9) qηt(t𝐳t,𝐳c)=Bern(σ(h1(𝐳c,𝐳t))),q_{\eta_{t}}(t\mid\mathbf{z}_{t},\mathbf{z}_{c})=Bern(\sigma(h_{1}(\mathbf{z}_{c},\mathbf{z}_{t}))),
(10) qηy(yt,𝐳c,𝐳y)=𝒩(μ=μ^,σ2=σ^2),μ^=th2(𝐳c,𝐳y)+(1t)h3(𝐳c,𝐳y),σ^2=th4(𝐳c,𝐳y)+(1t)h5(𝐳c,𝐳y),\begin{split}q_{\eta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})&=\mathcal{N}(\mu=\hat{\mu},\sigma^{2}={\hat{\sigma}}^{2}),\\ \hat{\mu}=th_{2}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)h_{3}(\mathbf{z}_{c},&\mathbf{z}_{y}),{\hat{\sigma}}^{2}=th_{4}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)h_{5}(\mathbf{z}_{c},\mathbf{z}_{y}),\end{split}

where h1h_{1} to h5h_{5} are functions parameterized by fully connected neural networks, and the distribution settings are similar to those in Equations (5) and (6).

4.2.3. Enforcing Independence of Latent factors

Explicitly enhancing the independence of disentangled latent factors encourages the graph encoder to more effectively capture distinct and mutually independent information associated with each latent factor. In the following, we detail the regularization applied to enforce independence among the latent factors.

The goal of our method is for the encoder to capture disentangled latent factors—namely, 𝐳y\mathbf{z}_{y}, 𝐳c\mathbf{z}_{c}, 𝐳t\mathbf{z}_{t}, and 𝐳o\mathbf{z}_{o}—that each contain exclusive information. This requires increasing the statistical independence between these latent factors to further strengthen disentanglement. Given the high dimensionality of the latent factors, using histogram-based measures is infeasible. Therefore, we use the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) to promote sufficient independence among different latent factors.

Specifically, let 𝐳t,\mathbf{z}_{t,*} represent the d𝐳td_{\mathbf{z}_{t}}-dimensional random variable corresponding to the latent factor 𝐳t\mathbf{z}_{t}. Consider a measurable, positive definite kernel κt\kappa_{t} defined over the domain of 𝐳t,\mathbf{z}_{t,*}, with its associated Reproducing Kernel Hilbert Space (RKHS) denoted by t\mathcal{H}_{t}. The mapping function ψt()\psi_{t}(\cdot) transforms 𝐳t,\mathbf{z}_{t,*} into t\mathcal{H}_{t} according to the kernel κt\kappa_{t}. Similarly, for 𝐳y\mathbf{z}_{y}, 𝐳c\mathbf{z}_{c}, and 𝐳o\mathbf{z}_{o}, the same definitions apply. Given a pair of latent factors 𝐳t\mathbf{z}_{t} and 𝐳c\mathbf{z}_{c}, where 𝐳t,\mathbf{z}_{t,*} and 𝐳c,\mathbf{z}_{c,*} are jointly sampled from the distribution p(𝐳t,,𝐳c,)p(\mathbf{z}_{t,*},\mathbf{z}_{c,*}), the cross-covariance operator 𝒞𝐳t,,𝐳c,\mathcal{C}_{\mathbf{z}_{t,*},\mathbf{z}_{c,*}} in the RKHS of κt\kappa_{t} and κc\kappa_{c} is defined as:

(11) 𝒞𝐳t,,𝐳c,=𝔼p(𝐳t,,𝐳c,)[(ψt(𝐳t,)𝝁𝐳t,)𝖳(ψt(𝐳c,)𝝁𝐳c,)],{\mathcal{C}}_{{\mathbf{z}}_{t,*},\mathbf{z}_{c,*}}=\mathbb{E}_{p(\mathbf{z}_{t,*},\mathbf{z}_{c,*})}\left[(\psi_{t}(\mathbf{z}_{t,*})-\boldsymbol{\mu}_{\mathbf{z}_{t,*}})^{\mathsf{T}}(\psi_{t}(\mathbf{z}_{c,*})-\boldsymbol{\mu}_{\mathbf{z}_{c,*}})\right],

where 𝝁𝐳t,=𝔼(ψt(𝐳t,))\boldsymbol{\mu}_{\mathbf{z}_{t,*}}=\mathbb{E}(\psi_{t}(\mathbf{z}_{t,*})), 𝝁𝐳c,=𝔼(ψc(𝐳c,))\boldsymbol{\mu}_{\mathbf{z}_{c,*}}=\mathbb{E}(\psi_{c}(\mathbf{z}_{c,*})). Then, HSIC is defined as follows:

(12) HSIC(𝐳t,,𝐳c,):=𝒞𝐳t,,𝐳c,HS2,{\rm{HSIC}}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*}):={\|{\mathcal{C}}_{{\mathbf{z}}_{t,*},\mathbf{z}_{c,*}}\|}_{\rm HS}^{2},

where \|\cdot\| is the Hilbert-Schmidt norm, which generalizes the Frobenius norm on matrices. It is known that for two random variables 𝐳t,\mathbf{z}_{t,*} and 𝐳c,\mathbf{z}_{c,*} and characteristic kernels κ𝐳t,\kappa_{\mathbf{z}_{t,*}} and κ𝐳c,\kappa_{\mathbf{z}_{c,*}}, if 𝔼[κ𝐳t,(𝐳t,,𝐳t,)]<,𝔼[κ𝐳c,(𝐳c,,𝐳c,)]<\mathbb{E}[\kappa_{\mathbf{z}_{t,*}}({\mathbf{z}_{t,*}},{\mathbf{z}_{t,*}})]<\infty,\mathbb{E}[\kappa_{\mathbf{z}_{c,*}}({\mathbf{z}_{c,*}},{\mathbf{z}_{c,*}})]<\infty, then HSIC(𝐳t,,𝐳c,)=0{\rm HSIC}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*})=0 if and only if 𝐳t,𝐳c,{\mathbf{z}}_{t,*}\!\perp\!\!\!\perp{\mathbf{z}}_{c,*}. In practice, we employ an unbiased estimator HSIC(𝐳t,,𝐳c,){\rm HSIC}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*}) with nn samples (Song et al., 2012), defined as:

(13) HSIC(𝐳t,,𝐳c,)=1n(n3)[tr(𝐔~𝐕~𝖳)+𝟏𝖳𝐔~𝟏𝟏𝖳𝐕~𝖳𝟏(n1)(n2)2n2𝟏𝖳𝐔~𝐕~𝖳𝟏],{\rm{HSIC}}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*})=\frac{1}{n(n-3)}\left[{\rm tr}(\tilde{\mathbf{U}}\tilde{\mathbf{V}}^{\mathsf{T}})+\frac{\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{U}}\mathbf{1}\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{V}}^{\mathsf{T}}\mathbf{1}}{(n-1)(n-2)}-\frac{2}{n-2}\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{U}}\tilde{\mathbf{V}}^{\mathsf{T}}\mathbf{1}\right],

where 𝐔~\tilde{\mathbf{U}} and 𝐕~\tilde{\mathbf{V}} denote the Grammer matrices with κ𝐳t,\kappa_{\mathbf{z}_{t,*}} and κ𝐳c,\kappa_{\mathbf{z}_{c,*}}, respectively, with the diagonal elements set to zero. In our approach, we employ the radial basis function (RBF) kernel. The analysis for other pairs of latent factors follows similarly.

The advantage of using Equation (13) to measure the dependence between different latent factors lies in its ability to capture more complex, nonlinear dependencies by mapping latent factors into the RKHS. The HSIC estimator we employ is unbiased, which is both effective and computationally efficient.

4.3. Loss Function of TNDVGA

In this section, we design a loss function that combines all the key components of ITE estimation, thereby facilitating the end-to-end training of disentangled latent factor representations.

4.3.1. Loss for VGAE

The encoder and decoder parameters can be learned by minimizing the negative evidence lower bound (ELBO), consistent with the standard VGAE (Kipf and Welling, 2016b), where ii denotes the ii-th instance:

(14) ELBO(𝐱i,ti,yi)=𝔼qϕtiqϕciqϕyiqϕoi[logpθ𝐱i(𝐱i𝐳t,i,𝐳c,i,𝐳y,i,𝐳o,i)+logpθti(ti𝐳t,i,𝐳c,i)+logpθyi(yiti,𝐳c,i,𝐳y,i)]DKL(qϕti(𝐳t,i𝐱i,𝐀)p(𝐳t,i))DKL(qϕci(𝐳c,i𝐱i,𝐀)p(𝐳c,i))DKL(qϕyi(𝐳y,i𝐱i,𝐀)p(𝐳y,i))DKL(qϕoi(𝐳o,i𝐱i,𝐀)p(𝐳o,i)).\begin{split}\mathcal{L}_{\rm ELBO}(\mathbf{x}_{i},t_{i},y_{i})=\,&-\mathbb{E}_{q_{\phi_{t_{i}}}q_{\phi_{c_{i}}}q_{\phi_{y_{i}}}q_{\phi_{o_{i}}}}[{\rm log}\,p_{\theta_{\mathbf{x}_{i}}}({\mathbf{x}_{i}}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i},\mathbf{z}_{y,i},\mathbf{z}_{o,i})+{\rm log}\,p_{\theta_{t_{i}}}(t_{i}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i})\\ &+\,{\rm log}\,p_{\theta_{y_{i}}}(y_{i}\mid t_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})]-D_{KL}(q_{\phi_{t_{i}}}(\mathbf{z}_{t,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{t,i}))\\ &-\,D_{KL}(q_{\phi_{c_{i}}}(\mathbf{z}_{c,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{c,i}))-D_{KL}(q_{\phi_{y_{i}}}(\mathbf{z}_{y,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{y,i}))\\ &-\,D_{KL}(q_{\phi_{o_{i}}}(\mathbf{z}_{o,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{o,i})).\end{split}

4.3.2. Loss for Potential Outcome Prediction and Treatment Assignment Prediction

The factual loss function for predicting potential outcomes, along with the loss function for predicting treatment assignments, is defined as follows:

(15) treat(ti,𝐳t,i,𝐳c,i)\displaystyle\mathcal{L}_{treat}(t_{i},\mathbf{z}_{t,i},\mathbf{z}_{c,i}) =𝔼qϕtiqϕci(qηti(ti𝐳t,i,𝐳c,i)),\displaystyle=-\mathbb{E}_{q_{\phi_{t_{i}}}q_{\phi_{c_{i}}}}(q_{\eta_{t_{i}}}(t_{i}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i})),
(16) pred(ti,yi,𝐳c,i,𝐳y,i)\displaystyle\mathcal{L}_{pred}(t_{i},y_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i}) =𝔼qϕciqϕyi(qηyi(yiti,𝐳c,i,𝐳y,i)).\displaystyle=-\mathbb{E}_{q_{\phi_{c_{i}}}q_{\phi_{y_{i}}}}(q_{\eta_{y_{i}}}(y_{i}\mid t_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})).

4.3.3. Loss for HSIC Independence Regularizer

We apply pairwise independence constraints to the latent factors ztz_{t}, zcz_{c}, zyz_{y}, and zoz_{o} in order to improve the statistical independence between disentangled representations. Thus, the HSIC regularizer reg\mathcal{L}_{reg} is calculated as follows:

(17) indep(𝐳t,,𝐳c,,𝐳y,,𝐳o,)=kmk,m{t,c,y,o}HSIC(𝐳k,,𝐳m,).\mathcal{L}_{indep}(\mathbf{z}_{t,*},\mathbf{z}_{c,*},\mathbf{z}_{y,*},\mathbf{z}_{o,*})=\sum_{\stackrel{{\scriptstyle k,m\in\{t,c,y,o\}}}{{k\neq m}}}{\rm HSIC}({\mathbf{z}}_{k,*},{\mathbf{z}}_{m,*}).

4.3.4. Loss for Balanced Representation

As shown in Fig. 1, we observe that 𝐳yt\mathbf{z}_{y}\!\perp\!\!\!\perp t, implying that p(𝐳yt=0)=p(𝐳yt=1)p(\mathbf{z}_{y}\mid t=0)=p(\mathbf{z}_{y}\mid t=1). Therefore, following the approach in (Hassanpour and Greiner, 2019), we aim for the learned 𝐳y\mathbf{z}_{y} to exclude any confounding information, ensuring that all confounding factors are captured within 𝐳c\mathbf{z}_{c}. This is crucial for ensuring the accuracy of the treatment effect estimation. To quantify the discrepancy between the distributions of 𝐳y\mathbf{z}_{y} for the treatment and control groups, we use the integral probability metric (IPM) (Müller, 1997; Sriperumbudur et al., 2012; Guo et al., 2020c). We define the balanced representation loss as disc\mathcal{L}_{disc},

(18) disc(𝐳y,)=IPM({𝐳y,i}i:ti=0,{𝐳y,i}i:ti=1).\mathcal{L}_{disc}(\mathbf{z}_{y,*})=IPM(\{\mathbf{z}_{y,i}\}_{i:t_{i}=0},\{\mathbf{z}_{y,i}\}_{i:t_{i}=1}).

We utilize the Wasserstein-1 distance, defined as Sriperumbudur et al. (2012), to calculate Equation (18). Additionally, we employ the effective approximation algorithm proposed by (Cuturi and Doucet, 2014) to calculate the Wasserstein-1 distance and associated gradients about the model parameters for training the TNDVGA.

4.3.5. The Overall Objective Function

The following provides a summary of the overall objective function for TNDVGA:

(19) TNDVGA=1ni=1n[ELBO(𝐱i,ti,yi)+αttreat(ti,𝐳t,i,𝐳c,i)+αypred(ti,yi,𝐳c,i,𝐳y,i)]+α1indep(𝐳t,,𝐳c,,𝐳y,,𝐳o,)+α2disc(𝐳y,)+λΘ22,\begin{split}\mathcal{L}_{\rm{TNDVGA}}=\frac{1}{n}&\sum_{i=1}^{n}\left[\mathcal{L}_{\rm ELBO}(\mathbf{x}_{i},t_{i},y_{i})+\alpha_{t}\mathcal{L}_{treat}(t_{i},\mathbf{z}_{t,i},\mathbf{z}_{c,i})+\alpha_{y}\mathcal{L}_{pred}(t_{i},y_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})\right]\\ &+\alpha_{1}\mathcal{L}_{indep}(\mathbf{z}_{t,*},\mathbf{z}_{c,*},\mathbf{z}_{y,*},\mathbf{z}_{o,*})+\alpha_{2}\mathcal{L}_{disc}(\mathbf{z}_{y,*})+\lambda{\|\Theta\|}_{2}^{2},\end{split}

where αt\alpha_{t}, αy\alpha_{y}, α1\alpha_{1}, and α2\alpha_{2} are non-negative hyperparameters that balance the corresponding terms. The final term, λΘ22\lambda{\|\Theta\|}_{2}^{2}, is applied to all model parameters Θ\Theta to avoid overfitting.

After completing the model training, we can predict the ITEs of new instances based on the observed covariates 𝐱\mathbf{x}. We utilize the encoders qϕc(𝐳c𝐱,𝐀)q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A}) and qϕy(𝐳y𝐱,𝐀)q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A}) to sample the posteriors of confounding factors and risk factors ll times, and then use the decoder pθy(yt,𝐳c,𝐳y)p_{\theta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y}) to compute the predicted outcomes yy at different tt, averaging them to obtain the estimated potential outcomes y1y^{1} and y0y^{0}. The calculation of ATE can be done by performing the above steps on all test samples and then taking the average.

5. Experiments

In this section, we perform a series of experiments to illustrate the effectiveness of the proposed TNDVGA framework. We first introduce the datasets, evaluation metrics, baselines, and model parameter configurations utilized in the experiments. Then, we compare the performance of different models in estimating ITE. After that, we conduct an ablation study to evaluate the importance of key components in the TNDVGA and conduct a hyperparameter study.

5.1. Datasets

5.1.1. Semi-synthetic datasets

Table 1. Statistics of the Two Semi-Synthetic Datasets: BlogCatalog and Flickr
Datasets Instances Edges Features κ2\kappa_{2} ATE mean ±\pm STD
BlogCatalog 5196 173,468 8,189 0.5 4.366±\pm0.553
1 7.446±\pm0.759
2 13.534±\pm2.309
Flickr 7,575 239,738 12,047 0.5 6.672±\pm3.068
1 8.487±\pm3.372
2 20.546±\pm5.718
BlogCatalog

In the BlogCatalog dataset (Tang and Liu, 2011), a social blog directory for managing bloggers and their blogs, each individual represents a blogger, and each edge represents a social connection between two bloggers. The features are represented as a bag-of-words representation of the keywords in the bloggers’ descriptions. To generate synthetic outcomes and treatments, we rely on the assumptions outlined in (Guo et al., 2020c; Veitch et al., 2019). The outcome yy refers to the readers’ opinion of each blogger, and the treatment tt represents whether the blogger’s content receives more views on mobile devices or desktops. Bloggers whose content is primarily viewed on mobile devices are placed in the treatment group, while those whose content is mainly viewed on desktops are placed in the control group. Additionally, following the assumptions in (Guo et al., 2020c), we assume that the topics discussed by the blogger and their neighbors causally affect both the blogger’s treatment assignment and outcome. In this task, our goal is to investigate the individual treatment effect (ITE) of receiving more views on mobile devices (instead of desktops) on the readers’ opinion. Specifically, a Latent Dirichlet Allocation (LDA) topic model is trained (Blei et al., 2003). Two centroids in the topic space are then defined: (i) the centroid 𝒓¯1\bar{\boldsymbol{r}}^{1} of the treatment group is set as the topic distribution of a randomly selected blogger, and (ii) the centroid 𝒓¯0\bar{\boldsymbol{r}}^{0} of the control group is set as the average topic distribution across all bloggers. We then model readers’ preference of browsing devices on the ii-th blogger content as:

(20) P(ti=1𝐱i,𝐀)=exp(pi1)exp(pi0)+exp(pi0)withpit=κ1𝒓(𝐱i)𝖳𝒓¯1+κ2j𝒩(i)𝒓(𝐱j)𝖳𝒓¯t,t{0,1},\begin{split}&P(t_{i}=1\mid\mathbf{x}_{i},\mathbf{A})=\frac{{\rm exp}(p_{i}^{1})}{{\rm exp}(p_{i}^{0})+{\rm exp}(p_{i}^{0})}\\ {\rm with}\,\,&p_{i}^{t}=\kappa_{1}{\boldsymbol{r}}(\mathbf{x}_{i})^{\mathsf{T}}\bar{\boldsymbol{r}}^{1}+\kappa_{2}\sum_{j\in\mathcal{N}(i)}{\boldsymbol{r}}(\mathbf{x}_{j})^{\mathsf{T}}\bar{\boldsymbol{r}}^{t},t\in\{0,1\},\end{split}

where κ10\kappa_{1}\geq 0 and κ20\kappa_{2}\geq 0 control the strength of the confounding bias introduced by the blogger’s topics and the topics of their neighbors, respectively. Finally, the factual outcome and the counterfactual outcome of the ii-th instance are given as:

(21) yiF=C(pi0+tipi1)+ϵ,yiCF=C[pi0+(1ti)pi1]+ϵ,\begin{split}y_{i}^{F}&=C(p_{i}^{0}+t_{i}p_{i}^{1})+\epsilon,\\ y_{i}^{CF}&=C[p_{i}^{0}+(1-t_{i})p_{i}^{1}]+\epsilon,\end{split}

where CC serves as a scaling factor, and the noise term ϵ\epsilon follows a normal distribution, i.e., ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1). For this study, we set C=5C=5, κ1=10\kappa_{1}=10, and κ2{0.5,1,2}\kappa_{2}\in\{0.5,1,2\}.

Flickr

Flickr (Tang and Liu, 2011) is an online platform utilized for the purpose of sharing images and videos. In this dataset, each user is represented as an instance, with edges indicating social connections between users. The features of each user are a list of interest tags. The treatment and outcome are synthesized using the same settings and simulation process as in the BlogCatalog.

In Table 1, we provide a detailed statistical summary of the two semi-synthetic datasets. For each parameter setting, the mean and standard deviation of the ATEs are computed across 10 runs.

5.1.2. Synthetic datasets

Inspired by (Hassanpour and Greiner, 2019), we generate synthetic datasets named TNDVGASynth, which follow the structure illustrated in Fig. 1 and the relationships defined in Equations (22)-(25).

(22) 𝐳t𝒩(𝟎,𝟏)mt,𝐳c𝒩(𝟎,𝟏)mc,𝐳y𝒩(𝟎,𝟏)my,𝐳o𝒩(𝟎,𝟏)mo𝐱=Concat(𝐳t,𝐳c,𝐳y,𝐳o),𝚿=Concat(𝐳t,𝐳c),𝚽=Concat(𝐳c,𝐳y),\begin{split}\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{t}},\,\,\mathbf{z}_{c}\sim&\,\,\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{c}},\,\,\mathbf{z}_{y}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{y}},\,\,\mathbf{z}_{o}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{o}}\,\,\\ \mathbf{x}=&\,\,Concat(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}),\\ {\boldsymbol{\Psi}}=&\,\,Concat(\mathbf{z}_{t},\mathbf{z}_{c}),\\ {\boldsymbol{\Phi}}=&\,\,Concat(\mathbf{z}_{c},\mathbf{z}_{y}),\\ \end{split}

(23) aBernoulli(0.011+exp(r))withr=𝐡𝐡+1,𝐡=Concatenate(𝐳t,𝐳c,𝐳y,𝐳o),\begin{split}&a\sim Bernoulli(\frac{0.01}{1+{\rm exp}(-r)})\\ {\rm with}\,\,&r=\mathbf{h}\cdot\mathbf{h}+1,\,\,\mathbf{h}=Concatenate(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}),\end{split}

(24) tBernoulli(11+exp(ζ𝐡))withh=𝚿𝜽+1,𝜽𝒩(𝟎,𝟏)mt+mc,\begin{split}&t\sim Bernoulli(\frac{1}{1+{\rm exp}(-\zeta\mathbf{h})})\\ {\rm with}\,&\,h={\boldsymbol{\Psi}}\cdot\boldsymbol{\theta}+1,\,\,\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{t}+m_{c}},\end{split}

(25) y0=(𝚽𝚽𝚽+0.5)𝝂0mc+my+ϵy1=(𝚽𝚽)𝝂1mc+my+ϵwith𝝂0,𝝂1𝒩(𝟎,𝟏)mc+my,ϵ𝒩(𝟎,𝟏),\begin{split}y^{0}&=\frac{{(\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}}+0.5)\cdot{\boldsymbol{\nu}}^{0}}{m_{c}+m_{y}}+\epsilon\\ &y^{1}=\frac{{(\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}})\cdot{\boldsymbol{\nu}}^{1}}{m_{c}+m_{y}}+\epsilon\\ {\rm with}\,\,&{\boldsymbol{\nu}}^{0},{\boldsymbol{\nu}}^{1}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{c}+m_{y}},\,\,\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{1}),\end{split}

where concat(,)concat(\cdot,\cdot) represents the vector concatenation operation. aa is an element of the matrix 𝐀\mathbf{A}; mt,mc,my,mom_{t},m_{c},m_{y},m_{o} represent the dimensions of the latent factors 𝐳t,𝐳c,𝐳y,𝐳o\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}, respectively; a scalar ζ\zeta determines the slope of the logistic curve; \cdot denotes the dot product, and \circ signifies the element-wise (Hadamard) product. We considered all feasible datasets generated from the grid defined by mt,mc,my,mo{4,8}m_{t},m_{c},m_{y},m_{o}\in\{4,8\}, creating 16 scenarios. For each scenario, we synthesize five datasets using different initial random seeds.

5.2. Evaluation Metrics

We evaluate the performance of the proposed TNDVGA framework in learning ITE using two widely used metrics in causal inference. We report the Root Precision in Estimating Heterogeneous Effects (ϵPEHE\sqrt{\epsilon_{PEHE}}) to measure the accuracy of individual-level treatment effect, and the Mean Absolute Error of ATE (ϵATE\epsilon_{ATE}) to assess the accuracy of population-level treatment effect. They are formally defined as follows:

(26) ϵPEHE=1ni=1n(τ^iτi)2,\displaystyle\sqrt{\epsilon_{PEHE}}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}({\hat{\tau}}_{i}-\tau_{i})^{2}},
(27) ϵATE=1n|i=1nτ^ii=1nτi|,\displaystyle\epsilon_{ATE}=\frac{1}{n}\lvert\sum_{i=1}^{n}{\hat{\tau}}_{i}-\sum_{i=1}^{n}\tau_{i}\rvert,

where τ^i=y^i1y^i0{\hat{\tau}}_{i}={\hat{y}}_{i}^{1}-{\hat{y}}_{i}^{0} and τi=yi1yi0\tau_{i}=y_{i}^{1}-y_{i}^{0} represent the estimated ITE and the ground truth ITE of instance ii, respectively. Lower values of these metrics indicate better estimating performance.

5.3. Baselines

We compare our model against the following state-of-the-art models used for ITE estimation:

  • Bayesian Additive Regression Trees (BART) (Chipman et al., 2010). BART is a widely used nonparametric Bayesian regression model that utilizes dimensionally adaptive random basis functions.

  • Causal Forest (Wager and Athey, 2018). Causal Forest is a non-parametric causal inference method designed to estimate heterogeneous treatment effects, extending Breiman’s well-known random forest algorithm.

  • Counterfactual Regression (CFR) (Shalit et al., 2017). CFR is a representation learning-based approach that predicts individual treatment effects (ITE) from observational data. It reduces the imbalance between the latent representations of the treatment and control groups and minimizes prediction errors for factual outcomes by projecting the original features into a latent space to capture confounders. It implements Integral Probability Metrics to measure the distance between distributions. This study employs two distinct forms of balancing penalties: the Wasserstein-1 distance (CFR-Wass) and the maximum mean discrepancy (CFR-MMD).

  • Treatment-agnostic Representation Networks (TARNet) (Shalit et al., 2017). TARNet is a variant of CFR that excludes the balance regularization term from its model.

  • Causal Effect Variational Autoencoder (CEVAE) (Louizos et al., 2017). CEVAE is built upon Variational Autoencoders (VAE) (Kingma, 2013) and adheres to the causal inference framework with proxy variables. It is capable of jointly estimating the unknown latent space that captures confounders and the causal effect.

  • Treatment Effect by Disentangled Variational AutoEncoder (TEDVAE) (Zhang et al., 2021b). TEDVAE is a variational inference approach that simultaneously infers latent factors from observed variables, while disentangling these factors into three distinct sets: instrumental factors, confounding factors, and risk factors. These disentangled factors are then utilized for estimating treatment effects.

  • Network Deconfounder (NetDeconf) (Guo et al., 2020c). NetDeconf is a novel causal inference framework that leverages network information to identify patterns of hidden confounders, enabling the learning of valid individual causal effects from networked observational data.

  • Graph Infomax Adversarial Learning (GIAL) (Chu et al., 2021). GIAL is a model designed for estimating treatment effects that leverages the network structure to capture additional information by identifying imbalances within the network. In this work, we employ two variants of the proposed GIAL method: one that utilizes the original implementation of graph convolutional networks (GCN) (Kipf and Welling, 2016a) (GIAL-GCN) and another that leverages graph attention networks (GAT) (Veličković et al., 2017) (GIAL-GAT) .

5.4. Parameter Settings

We implement TNDVGA using PyTorch on an NVIDIA RTX 4090D GPU. For BlogCatalog and Flickr, we run 10 experiments and report the average results. The dataset is split into training (60%), validation (20%), and test (20%) sets for each run. Baseline methods, such as BART, Causal Forest, CFR, TARNet, and CEVAE, are originally designed for non-networked observational data and thus cannot leverage network information directly. We concatenate the adjacency matrix rows with the original features to ensure a fair comparison; however, this does not notably enhance baseline performance due to dimensionality limitations. For the baselines, we used the default hyperparameters as in previous works (Guo et al., 2020c; Chu et al., 2021). For TNDVGA, we apply grid search to identify the optimal hyperparameter settings. Specifically, the learning rate is set to 3×1043\times 10^{-4}, αt\alpha_{t} and αy\alpha_{y} set to 100, and λ\lambda to 5×1055\times 10^{-5}. The number of GCN layers is varied between 1, 2, and 3, with the hidden dimensions set to 500, and the dimensions of 𝐳t\mathbf{z}_{t}, 𝐳c\mathbf{z}_{c}, 𝐳y\mathbf{z}_{y}, and 𝐳o\mathbf{z}_{o} vary across {10, 20, 30, 40, 50}. The regularization coefficients α1\alpha_{1} and α2\alpha_{2} are tuned within the range {10210^{-2}, 10110^{-1}, 1, 1010, 100100}. For the BlogCatalog dataset, TNDVGA is trained for 500 epochs, while for Flickr, training lasts for 1000 epochs. And we use Adam optimizer (Kingma, 2014) to train TNDVGA. For the synthetic datasets, we use the same parameter selection approach as for the semi-synthetic datasets. Unless stated otherwise, the latent variable dimensions for the different factors are set to their true values.

Table 2. Performance comparison for different methods on BlogCatalog. We report the average values of ϵPEHE\sqrt{\epsilon_{PEHE}} and ϵATE\epsilon_{ATE} on the test sets. Baselines results from (Chu et al., 2021), except TEDVAE.
BlogCatalog
κ2\kappa_{2} 0.5 1 2
ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE}
BART 4.808 2.680 5.770 2.278 11.608 6.418
Causal Forest 7.456 1.261 7.805 1.763 19.271 4.050
CFR-Wass 10.904 4.257 11.644 5.107 34.848 13.053
CFR-MMD 11.536 4.127 12.332 5.345 34.654 13.785
TARNet 11.570 4.228 13.561 8.170 34.420 13.122
CEVAE 7.481 1.279 10.387 1.998 24.215 5.566
TEDVAE 4.609 0.798 4.354 0.881 6.805 1.190
NetDeconf 4.532 0.979 4.597 0.984 9.532 2.130
GIAL-GCN 4.023 0.841 4.091 0.883 8.927 1.780
GIAL-GAT 4.215 0.912 4.258 0.937 9.119 1.982
TNDVGA (ours) 3.969 0.719 3.846 0.699 6.066 1.057
Table 3. Performance comparison for different methods on Flickr. We report the average values of ϵPEHE\sqrt{\epsilon_{PEHE}} and ϵATE\epsilon_{ATE} on the test sets. Baselines results from (Chu et al., 2021), except TEDVAE.
Flickr
κ2\kappa_{2} 0.5 1 2
ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE}
BART 4.907 2.323 9.517 6.548 13.155 9.643
Causal Forest 8.104 1.359 14.636 3.545 26.702 4.324
CFR-Wass 13.846 3.507 27.514 5.192 53.454 13.269
CFR-MMD 13.539 3.350 27.679 5.416 53.863 12.115
TARNet 14.329 3.389 28.466 5.978 55.066 13.105
CEVAE 12.099 1.732 22.496 4.415 42.985 5.393
TEDVAE 5.072 1.041 7.125 1.328 12.952 2.124
NetDeconf 4.286 0.805 5.789 1.359 9.817 2.700
GIAL-GCN 3.938 0.682 5.317 1.194 9.275 2.245
GIAL-GAT 4.015 0.773 5.432 1.231 9.428 2.586
TNDVGA 3.896 0.633 4.974 1.037 7.302 1.908

5.5. Perfomance Comparision

We compare the proposed framework TNDVGA with the state-of-the-art baselines for ITE estimation on both semi-synthetic datasets and synthetic datasets.

5.5.1. Performance on Semi-Synthetic datasets

Tables 2 and 3 present the experimental results on the BlogCatalog and Flick datasets, respectively. Through a comprehensive analysis of the experimental results, we have the following observations:

  • The proposed variational inference framework for ITE estimation, TNDVGA, consistently outperforms state-of-the-art traditional baseline methods, including BART, Causal Forest, CFR, and CEVAE, across different settings on both datasets, as these methods do not account for disentangled latent factors or leverage network information for ITE learning.

  • TNDVGA and NetDeconf, along with GIAL, outperform other baseline methods in ITE estimation due to their ability to leverage auxiliary network information to capture the impact of latent factors on ITE estimation. This result suggests that network information helps in learning representations of latent factors, leading to more accurate ITE estimation. Furthermore, TNDVGA also outperforms NetDeconf and GIAL in ITE estimation because it learns representations of four different latent factors, whereas NetDeconf and GIAL only learn representations of latent confounding factors.

  • TEDVAE also performs reasonably well in estimating ITE, mainly because its model infers and disentangles three disjoint sets of instrumental, confounding, and risk factors from the observed variables. This also highlights the importance of learning disentangled latent factors for ITE estimation. However, TNDVGA outperforms TEDVAE, as it additionally accounts for latent noise factors and effectively leverages network information, whereas TEDVAE struggles to fully utilize network information to enhance its modeling capabilities.

  • TNDVGA demonstrates strong robustness in selecting the latent dimensionality parameter. Considering that we did not explicitly model the generation process of latent factors in these two semi-synthetic real datasets, and that the dataset generation did not include instrumental, risk, or noise factors, TNDVGA still exhibits optimal performance under these conditions. These results indicate that, even in more realistic datasets , TNDVGA can effectively learn latent factors and estimate ITE.

  • When the influence of hidden confounders increases (i.e., with a growing κ2\kappa_{2} value), TNDVGA suffers the least in ϵPEHE\sqrt{\epsilon_{PEHE}} and ϵATE\epsilon_{ATE}. This is because TNDVGA has the ability to identify patterns of latent confounding factors from the network structure, enabling it to infer ITE more accurately.

5.5.2. Performance on Synthetic datasets

Refer to caption
Figure 3. Experimental results of different methods in ITE estimation under different levels of selection bias. As the selection bias increases, TNDVGA consistently performs the best.
\Description

syn_compare

First, similar to (Bao et al., 2022), we control the magnitude of selection bias in the dataset by setting the size of the scalar ζ\zeta. We compare TNDVGA with TEDVAE and NetConf when the dimensions of the latent factors are (8, 8, 8, 8). As shown in Fig. 3, we observe that as the value of ζ\zeta increases, indicating a rise in selection bias, TNDVGA consistently performs the best. Furthermore, TNDVGA’s performance remains stable and is largely unaffected by variations in selection bias. This demonstrates that TNDVGA exhibits better robustness against selection bias, which is crucial when handling real-world datasets. We have similar results on other synthetic datasets.

Next, we investigate the TNDVGA’s ability to recover the latent components 𝐳t\mathbf{z}_{t}, 𝐳c\mathbf{z}_{c}, 𝐳y\mathbf{z}_{y}, and 𝐳o\mathbf{z}_{o} that are utilized to construct the observed covariates 𝐱\mathbf{x} and examine the contribution of disentangling these latent factors to the estimation of ITE. To this end, similar to the settings in (Hassanpour and Greiner, 2019; Bao et al., 2022), we compare the performance of TNDVGA when the parameters d𝐳td_{\mathbf{z}_{t}}, d𝐳cd_{\mathbf{z}_{c}}, d𝐳yd_{\mathbf{z}_{y}} and d𝐳od_{\mathbf{z}_{o}} are set to correspond with the true number of latent factors against the performance when one of the latent dimensionality parameters is set to zero. For example, setting d𝐳c=0d_{\mathbf{z}_{c}}=0 forces TNDVGA to ignore the disentanglement of confounding factors. If TNDVGA performs better when considering the disentanglement of all latent factors compared to when any one latent factor is ignored, then it can be concluded that TNDVGA can recover latent factors, and that disentangling latent factors is beneficial for ITE estimation. Fig. 4 displays the radar charts corresponding to each factor. We can clearly see that when TNDVGA considers disentangling the latent factors through non-zero dimensionality parameters, its performance outperforms that when any latent dimension is set to zero.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4. In the radar chart, each vertex of the polygon is labeled with a sequence of latent factor dimensions from the synthetic dataset. For example, 8-8-8-88\text{-}8\text{-}8\text{-}8 indicates that the dataset is generated using 8 dimensions each for latent instrumental factors, latent confounding factors, latent adjustment factors, and latent noise factors. Each polygon represents the PEHE metric of the model (smaller polygons indicate better performance).
Table 4. Ablation study of our method’s variants on BlogCatalog.
BlogCatalog
κ2\kappa_{2} 0.5 1 2
ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE}
TNDVGA 3.937 0.656 3.918 0.677 0.651 1.184
TNDVGA(w/o BP) 4.090 0.710 4.060 0.808 6.887 1.798
TNDVGA(w/o HSIC) 4.114 0.765 4.070 0.808 6.982 1.958
Table 5. Ablation study of our method’s variants on Flickr.
Flickr
κ2\kappa_{2} 0.5 1 2
ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE} ϵPEHE\sqrt{\epsilon_{PEHE}} ϵATE\epsilon_{ATE}
TNDVGA 3.897 0.610 5.045 0.956 8.763 1.074
TNDVGA(w/o BP) 4.298 0.637 5.551 1.359 10.8531 1.678
TNDVGA(w/o HSIC) 4.622 0.930 5.908 1.380 11.198 1.948

5.6. Ablation Study

Furthermore, we investigate the effect of key components in the proposed TNDVGA framework on learning ITE from network observational data. In particular, we conduct an ablation study by developing two variants of TNDVGA and comparing their performance on the BlogCatalog and Flickr datasets with the original TNDVGA: (i) TNDVGA w/o Balanced Representations: This variant does not balance the learned representations, meaning it does not include the balanced representation loss disc\mathcal{L}_{disc} during training. As a result, the learning factor 𝐳y\mathbf{z}_{y} may embed information about 𝐳t\mathbf{z}_{t}. We refer to this variant as TNDVGA w/o BP. (ii) TNDVGA w/o HSIC Independence Regularizer: This variant omits the independence constraint mechanism between different factor representations, which may prevent the learned representations from being disentangled. We refer to this variant as TNDVGA w/o HSIC.

Refer to caption
(a) ϵPEHE(κ2=0.5)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=0.5)
Refer to caption
(b) ϵPEHE(κ2=1)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=1)
Refer to caption
(c) ϵPEHE(κ2=2)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=2)
Refer to caption
(d) ϵATE(κ2=0.5){\epsilon_{ATE}}(\kappa_{2}=0.5)
Refer to caption
(e) ϵATE(κ2=1){\epsilon_{ATE}}(\kappa_{2}=1)
Refer to caption
(f) ϵATE(κ2=2){\epsilon_{ATE}}(\kappa_{2}=2)
Figure 5. Hyperparameter analysis on BlogCatalog across different κ2\kappa_{2}.
\Description

Hyperparameter analysis

Refer to caption
(a) ϵPEHE(κ2=0.5)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=0.5)
Refer to caption
(b) ϵPEHE(κ2=1)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=1)
Refer to caption
(c) ϵPEHE(κ2=2)\sqrt{\epsilon_{PEHE}}(\kappa_{2}=2)
Refer to caption
(d) ϵATE(κ2=0.5){\epsilon_{ATE}}(\kappa_{2}=0.5)
Refer to caption
(e) ϵATE(κ2=1){\epsilon_{ATE}}(\kappa_{2}=1)
Refer to caption
(f) ϵATE(κ2=2){\epsilon_{ATE}}(\kappa_{2}=2)
Figure 6. Hyperparameter analysis on Flickr across different κ2\kappa_{2}.
\Description

Hyperparameter analysis Flickr

Tables 4 and 5 display the comparison results of the two variants with TNDVGA on the BlogCatalog and Flickr datasets, respectively. From the analysis, we can draw the following observations:

  • TNDVGA w/o BP cannot provide satisfactory performance because it neglects the balance of adjustment variables, which may lead to instrumental information being embedded in the adjustment variables, affecting the effectiveness of the learned representations. This highlights the necessity of balanced representations for better learning of latent factors in order to estimate ITE.

  • TNDVGA w/o HSIC also fails to provide the expected performance and typically performs the worst, as it does not impose independence constraints on the representations corresponding to different latent factors. This indicates that imposing explicit independence constraints on the representations is important for estimating ITE from network observational data.

5.7. Hyperparameter Study

We conduct an analysis of the effects of the most important hyperparameters, α1\alpha_{1} and α2\alpha_{2}, on the performance of TNDVGA. These parameters influence how independence constraints and representation balance contribute to the estimation of ITE from network observational data. The results of the parameter analysis for the BlogCatalog and Flickr datasets, with κ2\kappa_{2} set to 0.5, 1, and 2, are presented in terms of ϵPEHE\sqrt{\epsilon_{PEHE}} and ϵATE\epsilon_{ATE}. We vary α1\alpha_{1} and α2\alpha_{2} within the range {0.01,0.1,1,10,100}\{0.01,0.1,1,10,100\}. The results of the hyperparameter study are shown in Figs. 5(f) and 6(f). When α1\alpha_{1} and α2\alpha_{2} range in {0.01,0.1,1}\{0.01,0.1,1\}, the variations in ϵPEHE\sqrt{\epsilon_{PEHE}} and ϵATE\epsilon_{ATE} are minimal, suggesting that TNDVGA demonstrates stable and favorable performance across a wide range of parameter values. However, when α110\alpha_{1}\geq 10 or α210\alpha_{2}\geq 10, TNDVGA’s performance in estimating ϵATE\epsilon_{ATE} noticeably declines. This reduction in performance occurs because the objective function places too much emphasis on the regularization term at high parameter settings, thereby affecting the accuracy of ATE estimation.

6. conclusion and future work

This paper aims to improve the accuracy of individual treatment effect estimation from networked observational data by modeling disentangled latent factors. The proposed model, TNDVGA, leverages observed features and auxiliary network information to infer and disentangle four distinct sets of latent factors: instrumental, confounding, adjustment, and noise factors. Empirical results from extensive experiments on several semi-synthetic and one synthetic dataset demonstrate that TNDVGA outperforms existing state-of-the-art methods in estimating ITE from networked observational data.

Two promising directions for future work are worth exploring. First, we would like to extend TNDVGA to estimate treatment effects for multiple or continuous treatments, which would enhance its applicability to a wider range of real-world scenarios. Second, we are interested in further investigating ITE estimation under network interference within a generative model framework that employs variational inference.

References

  • (1)
  • Abadie and Imbens (2006) Alberto Abadie and Guido W Imbens. 2006. Large sample properties of matching estimators for average treatment effects. econometrica 74, 1 (2006), 235–267.
  • Arbour et al. (2016) David Arbour, Dan Garant, and David Jensen. 2016. Inferring network effects from observational data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 715–724.
  • Atan et al. (2018) Onur Atan, James Jordon, and Mihaela Van der Schaar. 2018. Deep-treat: Learning optimal personalized treatments from observational data using neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  • Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
  • Bao et al. (2022) Qingsen Bao, Zeyong Mao, and Lei Chen. 2022. Learning Disentangled Latent Factors for Individual Treatment Effect Estimation Using Variational Generative Adversarial Nets. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 347–352.
  • Bennett and Kallus (2019) Andrew Bennett and Nathan Kallus. 2019. Policy evaluation with latent confounders via optimal balance. Advances in neural information processing systems 32 (2019).
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
  • Cheng et al. (2022) Mingyuan Cheng, Xinru Liao, Quan Liu, Bin Ma, Jian Xu, and Bo Zheng. 2022. Learning disentangled representations for counterfactual regression via mutual information minimization. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1802–1806.
  • Chipman et al. (2010) Hugh A Chipman, Edward I George, and Robert E McCulloch. 2010. BART: Bayesian additive regression trees. (2010).
  • Chu et al. (2021) Zhixuan Chu, Stephen L Rathbun, and Sheng Li. 2021. Graph infomax adversarial learning for treatment effect estimation with networked observational data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 176–184.
  • Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. 2014. Fast computation of Wasserstein barycenters. In International conference on machine learning. PMLR, 685–693.
  • Ding and Lehrer (2010) Weili Ding and Steven F Lehrer. 2010. Estimating treatment effects from contaminated multiperiod education experiments: The dynamic impacts of class size reductions. The Review of Economics and Statistics 92, 1 (2010), 31–42.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
  • Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In International conference on algorithmic learning theory. Springer, 63–77.
  • Gu et al. (2021) Tiankai Gu, Kun Kuang, Hong Zhu, Jingjie Li, Zhenhua Dong, Wenjie Hu, Zhenguo Li, Xiuqiang He, and Yue Liu. 2021. Estimating true post-click conversion via group-stratified counterfactual inference. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  • Guo et al. (2020a) Ruocheng Guo, Lu Cheng, Jundong Li, P Richard Hahn, and Huan Liu. 2020a. A survey of learning causality with data: Problems and methods. ACM Computing Surveys (CSUR) 53, 4 (2020), 1–37.
  • Guo et al. (2021) Ruocheng Guo, Jundong Li, Yichuan Li, K Selçuk Candan, Adrienne Raglin, and Huan Liu. 2021. Ignite: A minimax game toward learning individual treatment effects from networked observational data. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 4534–4540.
  • Guo et al. (2020b) Ruocheng Guo, Jundong Li, and Huan Liu. 2020b. Counterfactual evaluation of treatment assignment functions with networked observational data. In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 271–279.
  • Guo et al. (2020c) Ruocheng Guo, Jundong Li, and Huan Liu. 2020c. Learning individual causal effects from networked observational data. In Proceedings of the 13th international conference on web search and data mining. 232–240.
  • Häggström (2018) Jenny Häggström. 2018. Data-driven confounder selection via Markov and Bayesian networks. Biometrics 74, 2 (2018), 389–398.
  • Hassanpour and Greiner (2019) Negar Hassanpour and Russell Greiner. 2019. Learning disentangled representations for counterfactual regression. In International Conference on Learning Representations.
  • Huang et al. (2023) Qiang Huang, Jing Ma, Jundong Li, Ruocheng Guo, Huiyan Sun, and Yi Chang. 2023. Modeling Interference for Individual Treatment Effect Estimation from Networked Observational Data. ACM Transactions on Knowledge Discovery from Data 18, 3 (2023), 1–21.
  • Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
  • Johansson et al. (2016) Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning. PMLR, 3020–3029.
  • Kingma (2013) Diederik P Kingma. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
  • Kingma (2014) Diederik P Kingma. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
  • Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
  • Kuang et al. (2017) Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang. 2017. Treatment effect estimation with data-driven variable decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
  • Kuang et al. (2020a) Kun Kuang, Peng Cui, Hao Zou, Bo Li, Jianrong Tao, Fei Wu, and Shiqiang Yang. 2020a. Data-driven variable decomposition for treatment effect estimation. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2020), 2120–2134.
  • Kuang et al. (2020b) Kun Kuang, Lian Li, Zhi Geng, Lei Xu, Kun Zhang, Beishui Liao, Huaxin Huang, Peng Ding, Wang Miao, and Zhichao Jiang. 2020b. Causal inference. Engineering 6, 3 (2020), 253–263.
  • Liu et al. (2024) Yu Liu, Jian Wang, and Bing Li. 2024. EDVAE: Disentangled latent factors models in counterfactual reasoning for individual treatment effects estimation. Information Sciences 652 (2024), 119578.
  • Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems 30 (2017).
  • Müller (1997) Alfred Müller. 1997. Integral probability metrics and their generating classes of functions. Advances in applied probability 29, 2 (1997), 429–443.
  • Pearl (2009a) Judea Pearl. 2009a. Causal inference in statistics: An overview. (2009).
  • Pearl (2009b) Judea Pearl. 2009b. Causality. Cambridge university press.
  • Rakesh et al. (2018) Vineeth Rakesh, Ruocheng Guo, Raha Moraffah, Nitin Agarwal, and Huan Liu. 2018. Linked causal variational autoencoder for inferring paired spillover effects. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1679–1682.
  • Rosenbaum and Rubin (1983) Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
  • Rubin (1978) Donald B Rubin. 1978. Bayesian inference for causal effects: The role of randomization. The Annals of statistics (1978), 34–58.
  • Rubin (2005) Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.
  • Schölkopf et al. (2021) Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proc. IEEE 109, 5 (2021), 612–634.
  • Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning. PMLR, 3076–3085.
  • Song et al. (2012) Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature Selection via Dependence Maximization. Journal of Machine Learning Research 13, 5 (2012).
  • Sriperumbudur et al. (2012) Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. 2012. On the empirical estimation of integral probability metrics. (2012).
  • Tang and Liu (2011) Lei Tang and Huan Liu. 2011. Leveraging social media networks for classification. Data mining and knowledge discovery 23 (2011), 447–478.
  • Thorat et al. (2023) Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar, and Naoyuki Onoe. 2023. Estimation of individual causal effects in network setup for multiple treatments. arXiv preprint arXiv:2312.11573 (2023).
  • Veitch et al. (2019) Victor Veitch, Yixin Wang, and David Blei. 2019. Using embeddings to correct for unobserved confounding in networks. Advances in Neural Information Processing Systems 32 (2019).
  • Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
  • Vowels et al. (2021) Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. 2021. Targeted VAE: Variational and targeted learning for causal inference. In 2021 IEEE International Conference on Smart Data Services (SMDS). IEEE, 132–141.
  • Wager and Athey (2018) Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228–1242.
  • Winship and Morgan (1999) Christopher Winship and Stephen L Morgan. 1999. The estimation of causal effects from observational data. Annual review of sociology 25, 1 (1999), 659–706.
  • Wu et al. (2022) Anpeng Wu, Junkun Yuan, Kun Kuang, Bo Li, Runze Wu, Qiang Zhu, Yueting Zhuang, and Fei Wu. 2022. Learning decomposed representations for treatment effect estimation. IEEE Transactions on Knowledge and Data Engineering 35, 5 (2022), 4989–5001.
  • Wu and Fukumizu (2021) Pengzhou Wu and Kenji Fukumizu. 2021. Intact-VAE: Estimating treatment effects under unobserved confounding. arXiv preprint arXiv:2101.06662 (2021).
  • Yao et al. (2021) Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–46.
  • Yao et al. (2018) Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. 2018. Representation learning for treatment effect estimation from observational data. Advances in neural information processing systems 31 (2018).
  • Zhang et al. (2021b) Weijia Zhang, Lin Liu, and Jiuyong Li. 2021b. Treatment effect estimation with disentangled latent factors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10923–10930.
  • Zhang et al. (2021a) Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021a. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 11–20.
  • Zhang et al. (2019) Zichen Zhang, Qingfeng Lan, Lei Ding, Yue Wang, Negar Hassanpour, and Russell Greiner. 2019. Reducing selection bias in counterfactual reasoning for individual treatment effects estimation. arXiv preprint arXiv:1912.09040 (2019).