Treatment Effects Estimation on Networked Observational Data using Disentangled Variational Graph Autoencoder

Di Fan [email protected] 0009-0001-6357-7849 School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012 , Renlei Jiang [email protected] School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012 , Yunhao Wen [email protected] Petrochina Engineering and Planning InstituteChina and Chuanhou Gao [email protected] 0000-0001-9030-2042 School of Mathematical Sciences, Zhejiang University866 Yuhangtang rdXihu QuHangzhouChina310012

(2018; 20 February 2007; 12 March 2009; 5 June 2009)

Abstract.

Estimating individual treatment effect (ITE) from observational data has gained increasing attention across various domains, with a key challenge being the identification of latent confounders affecting both treatment and outcome. Networked observational data offer new opportunities to address this issue by utilizing network information to infer latent confounders. However, most existing approaches assume observed variables and network information serve only as proxy variables for latent confounders, which often fails in practice, as some variables influence treatment but not outcomes, and vice versa. Recent advances in disentangled representation learning, which disentangle latent factors into instrumental, confounding, and adjustment factors, have shown promise for ITE estimation. Building on this, we propose a novel disentangled variational graph autoencoder that learns disentangled factors for treatment effect estimation on networked observational data. Our graph encoder further ensures factor independence using the Hilbert-Schmidt Independence Criterion. Extensive experiments on two semi-synthetic datasets derived from real-world social networks and one synthetic dataset demonstrate that our method achieves state-of-the-art performance.

Causal inference, individual treatment effect, disentangled representations, networked observational data, variational graph autoencoder

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†journal: TKDD^†^†journalvolume: 37^†^†journalnumber: 4^†^†article: 111^†^†publicationmonth: 8

1. Introduction

Currently, research on causal effects between different variables has received increasing attention. Among these, learning individual treatment effects of a treatment on an outcome is a fundamental question encountered by numerous researchers, with applications spanning various domains, including education (Ding and Lehrer, 2010), public policy (Athey and Imbens, 2016), economics (Zhang et al., 2021a; Gu et al., 2021), and healthcare (Shalit et al., 2017). For example, in the medical scenario, physicians seek to determine which treatment (such as which medication) is more beneficial for a patient’s recovery (Wu et al., 2022). This naturally raises a question: how can we accurately infer outcome if an instance were to receive an alternative treatment? This relates to the well-known problem of counterfactual outcome prediction (Pearl, 2009b). By predicting counterfactual outcomes, we can accurately estimate each individual’s treatment effect, known as individual treatment effect (ITE) (Rubin, 2005; Shalit et al., 2017), thereby assisting decision-making.

Randomized controlled trials (RCTs) are the gold standard method for learning causal effects (Pearl, 2009b). In these trials, instances (experimental subjects) are randomly assigned to either the treatment or control group. However, this is often costly, unethical, or even impractical (Guo et al., 2020a; Yao et al., 2021). Fortunately, the rapid increasing expansion of big data in many fields offers significant opportunities for causal inference research (Winship and Morgan, 1999; Yao et al., 2021), as these observational datasets are readily available and usually contain a large number of examples. Thus, we often concentrate on estimating treatment effects from observational data. Additionally, instances in the dataset are intrinsically linked by auxiliary network structures, such as user-linked social networks. This type of data is typically referred to as networked observational data (Guo et al., 2020c; Huang et al., 2023).

In observational studies, treatment often depends on specific attributes of an instance, $\mathbf{x}$ , leading to selection bias (Imbens and Rubin, 2015). In the medical scenario above, socioeconomic status influences both medication choices and patient recovery. Higher socioeconomic status may increase access to expensive medications and positively impact health. Identifying and controlling for confounding factors (i.e., those affecting both treatment and outcome, thereby introducing selection bias in ITE estimation) is crucial for accurate predictions and presents the main challenge in learning ITE from observational data (Pearl, 2009a; Guo et al., 2020a). To address confounders, most existing methods assume strong ignorability (Johansson et al., 2016; Shalit et al., 2017; Yao et al., 2018), meaning all confounders are measurable and embedded within observed features. However, this assumption is often unrealistic, as not all confounders can be measured. Relaxing this assumption by using proxy variables for latent confounders was proposed by Bennett and Kallus (2019). For networked observational data, several ITE estimation frameworks have been developed in recent years (Veitch et al., 2019; Guo et al., 2020c, 2021; Chu et al., 2021), which primarily leverage the network structure along with noisy, measurable observed variables as two sets of proxy variables to aid in learning and controlling for latent confounders. For instance, socioeconomic status can be inferred from easier-to-measure variables (e.g., postal codes, annual income) combined with social network patterns (e.g., community affiliation). While these methods have achieved empirical success, they primarily focus on learning representations of latent confounding factors (latent confounders) to control confounding bias but overlook that some factors affect only treatment, others only outcomes, or may even be noise factors. In patient data, for example, age and socioeconomic status influence both treatment and outcome, thus acting as confounding factors; the attending physician affects only treatment, referred to as an instrumental factor; gene and air temperature affect only outcome, referred to as adjustment factors; and information like names and contact details are noise factors. Using all patient features and network information only for learning latent confounding factors introduces new biases (Abadie and Imbens, 2006; Häggström, 2018). Therefore, explicitly learning disentangled representations for these four types of latent factors is essential for accurately estimating ITE on networked observational data.

To address the aforementioned challenges, we present a novel generative framework based on the Variational Graph Autoencoder (VGAE) (Kipf and Welling, 2016b) for estimating individual treatment effects on networked observational data. We name our model Treatment effect estimation on Networked observational data by Disentangled Variational Graph Autoencoder (TNDVGA), which can effectively infer latent factors from proxy variables and auxiliary network information using a graph autoencoder, while employing the Hilbert-Schmidt Independence Criterion (HSIC) independence constraint to disentangle these factors into four mutually exclusive sets, thereby improving individual treatment effect estimation. Our main contributions are:

•

We propose a novel framework for learning individual treatment effect from networked observational data, termed TNDVGA, which can simultaneously learn representations of latent factors from both proxy variables and auxiliary network information while disentangling different latent factors to estimate treatment effect more effectively and accurately.
•

We introduce a kernel-based Hilbert-Schmidt Independence Criterion (HSIC) to assess the dependence between different representations of latent factors. This independence regularization is jointly optimized with other components of the model within a unified framework, enabling better learning of independent disentangled representations.
•

We perform extensive experiments to validate the effectiveness of our proposed framework TNDVGA. Results on multiple datasets indicate that TNDVGA achieves state-of-the-art performance, significantly outperforming baseline methods.

The rest of this article is organized as follows. The related work is reviewed in Section 2. Section 3 introduces the technical preliminaries and problem statement. Section 4 describes the details of our proposed framework. We presents comprehensive experimental results of our model’s performance on different datasets in Section 5. Finally, Section 6 concludes our work and suggests directions for future research.

2. Related work

Two aspects of related work are introduced in this section: (2) learning ITE from networked observational data; and (3) disentangled representations for treatment effect estimation.

Learning ITE from i.i.d observational data

Due to the substantial expenses and sometimes infeasibility of randomized experiments, there has been significant interest in estimating individual-level causal effects from observational data in recent years, especially with the emergence of big data. BART (Chipman et al., 2010) employed dimensionally adaptive random basis functions for causal effect estimation. Causal Forest (Wager and Athey, 2018) is a nonparametric approach that extends Breiman’s random forest algorithm to estimate heterogeneous treatment effects. CFR (Shalit et al., 2017) is a representation learning approach that predicts ITE from observational data by projecting original features into a latent space, capturing confounders through minimization of prediction error in factual outcomes and reducing imbalance between treatment and control groups. However, the existing methods that have been previously mentioned depend on the strong ignorability assumption, which essentially ignores the effects of hidden confounding factors and is usually untenable in real-world observational studies. Various approaches have been suggested to relax this strong ignorability assumption. CEVAE (Louizos et al., 2017) followed the causal structure of inference with proxy variables, capable of simultaneously estimating the unknown latent space summarizing confounders and the causal effect. Deep-Treat (Atan et al., 2018) employed a bias-removing autoencoder, along with a policy optimization feedforward neural network to derive balanced representations and optimal policies from observational data. SITE (Yao et al., 2018) captured hidden confounders for individual treatment effect estimation through a local similarity-preserving method.

Learning ITE from networked observational data

Recently, the emergence of networked observational data in various real-world tasks has prompted several studies to relax strong ignorability assumption by utilizing network information among different instances, where the network also serves as a proxy for unobserved confounders. NetDeconf (Guo et al., 2020c) utilized network information and observed features to identify patterns of hidden confounders, enabling the learning of valid individual causal effects from networked observational data. CONE (Guo et al., 2020b) further employed Graph Attention Networks (GAT) to integrate network information, thereby mitigating hidden confounding effects. IGNITE (Guo et al., 2021) introduced a minimax game framework that simultaneously balances representations and predicts treatments to learn ITE from networked observational data. GIAL (Chu et al., 2021) leveraged network structure to capture additional information by identifying imbalances within the network for estimating treatment effects. Thorat et al. (2023) utilized network information to mitigate hidden confounding bias in the estimation of ITE under networked observational studies with multiple treatments. However, these studies uniformly apply all feature information, including network information, to infer latent confounding factors without assuming disentanglement in treatment effect estimation, which may lead to estimation bias. In a network, the treatment administered to one instance may influence the outcomes of its neighbors. This phenomenon is known as spillover effects or interference (Arbour et al., 2016; Rakesh et al., 2018; Huang et al., 2023). Unlike previous works, we follow the assumption by Guo et al. (2020c) and Veitch et al. (2019) that conditioning on potential confounders separates each individual’s treatment and outcome from those of others. We will further explore the study of spillover effects as future work.

Disentangled representations for treatment effect estimation

From the perspective of causal representation learning, learning disentangled representations is one of the challenges in machine learning (Schölkopf et al., 2021). The disentangled representation of latent factors derived from observational data can reduce the influence of instrumental factors and confounders on outcome prediction, thereby mitigating selection bias and significantly improving the accuracy of treatment effect estimation (Hassanpour and Greiner, 2019; Kuang et al., 2020b). Early methods primarily focused on variable decomposition (Kuang et al., 2017, 2020a), exploring treatment effect estimation by considering only adjustment variables and confounders as latent factors. This restricted approach resulted in imprecise confounder separation and hindered accurate estimation of individual treatment effects. Subsequently, many methods focused on decomposing pre-treatment variables into instrumental variables, confounding variables, and adjustment factors. For example, DRCFR (Hassanpour and Greiner, 2019) and DeR-CFR (Wu et al., 2022) disentangled latent factors into these four categories while balancing confounders and estimating treatment effects through counterfactual inference. Additionally, some methods imposed independent constraints on the model to achieve independent disentangled representations. RSB-Net (Zhang et al., 2019) utilized Pearson Correlation Coefficient (PCC) to promote decorrelation between two sets of random variables. MIM-DRCFR (Cheng et al., 2022) introduced a method for learning disentangled representations by minimizing mutual information, while DeR-CFR employed orthogonal loss to ensure that the representations of different learned latent factors contain uncorrelated information. Recently, an increasing number of methods based on Variational Autoencoder (VAE) (Kingma, 2013) have been proposed to address disentanglement in individualized causal effect estimation. TEDVAE (Zhang et al., 2021b) employed a variational autoencoder to separate latent variables, incorporating a regularization term that included reconstruction losses for both treatment and outcomes. TVAE (Vowels et al., 2021)integrated noise factors and introduced a VAE with target learning regularization to estimate individual treatment effects. EDVAE (Liu et al., 2024) adopted a method of disentangling latent factors from both data and model perspectives for ITE estimation. VGANITE (Bao et al., 2022) combined VAE and Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) to disentangle latent factors into three distinct sets. Intact-VAE (Wu and Fukumizu, 2021) emphasized the successful recovery of confounders through a novel prognostic score.

However, these studies primarily focus on estimating individualized causal effects from independent observational data. Given the importance of disentanglement in ITE estimation, it is crucial to incorporate this approach when estimating ITE from networked observational data. Additionally, independent regularizers need to be added to ensure that the information contained in features and network information is accurately transmitted to the corresponding representation spaces of each latent factor. Furthermore, our model takes into account often-overlooked noise factors, combines reconstruction losses for both treatment and outcome, and balances the distribution between treatment and control groups, all of which contribute to improved model performance.

3. Preliminaries

In this section, we first introduce the notations used in this article. We then outline the problem statement by providing the necessary technical preliminaries.

Notations

Throughout this work, we use unbold lowercase letters (e.g., $t$ ) to denote scalars, bold lowercase letters (e.g., $\mathbf{x}$ ) to represent vectors, and bold uppercase letters (e.g., $\mathbf{A}$ ) for matrices. The $(i,j)$ -th entry of a matrix $\mathbf{A}$ is denoted by $\mathbf{A}_{ij}$ .

Networked observational data

In the network observational data, we define the features (covariates) of $i$ -th instance as $\mathbf{x}_{i}\in\mathbb{R}^{k}$ , the treatment as $t_{i}$ , and the outcome as $y_{i}\in\mathbb{R}$ . We assume that all instances are connected through a network, represented by an adjacency matrix $\mathbf{A}$ . We assume that the network is undirected, with all edge weights equal¹¹1This work can be extended to weighted undirected networks and is also applicable to directed networks by utilizing specialized graph neural networks.. Let $n$ denotes the number of instances, thus $\mathbf{A}\in\{0,1\}^{n\times n}$ . The notation $\mathbf{A}_{ij}=\mathbf{A}_{ji}=1$ (or 0) indicates the presence (or absence) of an edge between the $i$ -th and $j$ -th instances. Therefore, the tuple $(\{\mathbf{x}_{i},t_{i},y_{i}\}_{i=1}^{n},\mathbf{A})$ represents a network observational dataset. Following the setup of (Shalit et al., 2017; Yao et al., 2018), we concentrate on cases where the treatment variable is binary, specifically $t\in\{0,1\}$ . We denote $t_{i}=1$ and $t_{i}=0$ to represent whether the $i$ -th instance is in the treatment or control group, respectively, without loss of generality.

Next, we present the background knowledge necessary for learning individual individual treatment effects. We make the assumption that for each pair of instance $i$ and treatment $t$ , there exists a potential outcome $y_{i}^{t}$ , representing the value that $y$ would take if treatment $t$ were applied to instance $i$ (Rubin, 1978). Note that only one potential outcome is observable, while the unobserved outcome $y_{i}^{1-t_{i}}$ is typically referred to as the counterfactual outcome. As a result, the observed outcome can be expressed as a function of the observed treatment and potential outcomes, given by $y_{i}=t_{i}y_{i}^{1}+(1-t_{i})y_{i}^{0}$ . Then the ITE for the instance $i$ in the context of networked observational data is defined as follows:

(1)

\tau_{i}=\tau(\mathbf{x}_{i},\mathbf{A})=\mathbb{E}[y_{i}^{1}\mid\mathbf{x}_{i},\mathbf{A}]-\mathbb{E}[y_{i}^{0}\mid\mathbf{x}_{i},\mathbf{A}],

which measures the difference between expected potential outcome under treatment and control for the instance $i$ . Once ITE has been established, the average treatment effect (ATE) can then be estimated by averaging the ITE across all instances as $\text{ATE}=\frac{1}{n}\sum_{i=1}^{n}\tau_{i}$ . Based on the aforementioned notations and definitions, we formally state the problem.

Definition 3.1 (Learning ITEs from Networked Observational Data).

Given the networked observational data $(\{\mathbf{x}_{i},t_{i},y_{i}\}_{i=1}^{n},\mathbf{A})$ , our goal is to use the information from $(\mathbf{x}_{i},t_{i},y_{i})$ and the network adjacency matrix $\mathbf{A}$ to learn an estimate of the ITE $\tau_{i}$ for each instance $i$ .

This paper is based on three essential assumptions necessary for estimating the individual treatment effect (Rosenbaum and Rubin, 1983):

Assumption 1 (Stable Unit Treatment Value Assumption (SUTVA)).

The potential outcomes for one unit are not affected by the treatment assigned to other units.

Assumption 2 (Overlap).

Each unit has a nonzero probability of receiving either treatment or control given the observed variables, i.e., $0<P(t=1\mid\mathbf{x})<1$ .

Assumption 3 (Unconfoundedness).

Treatment assignment is independent of the potential outcomes when conditioning on the latent confounding factors, i.e., $t\!\perp\!\!\!\perp(y^{0},y^{1})\mid\mathbf{z}_{c}$ . This assumption is a relaxed version of the unconfoundedness assumption commonly used in causal inference, as it allows for the presence of hidden confounders.

4. Methodology

In this section, we will first present a theorem on the identifiability of the individual treatment effect. Then, we introduce our TNDVGA framework designed to learn from networked observational data.

4.1. Identifiability

We introduce the model TNDVGA for estimating treatment effects, based on the assumption that the observed covariates $\mathbf{x}$ and the network patterns $\mathbf{A}$ can be regarded as generated from four distinct sets of latent factors $\mathbf{z}=(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})$ . In this context, $\mathbf{z}_{t}$ represents latent instrumental factors that influence the treatment but not the outcome, $\mathbf{z}_{c}$ includes latent confounding factors (latent confounders) that influence both the treatment and the outcome, $\mathbf{z}_{y}$ consists of latent adjustment factors that impact the outcome without affecting the treatment, and $\mathbf{z}_{o}$ refers to latent noise factors, which are covariates unrelated to either the treatment or the outcome. The proposed causal graph for ITE estimation is shown in Fig. 1. By explicitly modeling these four latent factors, it demonstrates that not all variables in the observed set act as proxy variables for confounding factors, but instead, it effectively facilitates the learning of various types of unobserved factors.

Refer to caption — Figure 1. The causal diagram of the proposed TNDVGA. $\mathbf{x}$ represents the observed variables, $\mathbf{A}$ denotes the network structure, $t$ is the treatment, $y$ is the outcome, $\mathbf{z}_{t}$ is latent instrument factors affecting only the treatment, $\mathbf{z}_{c}$ is latent confounding factors, $\mathbf{z}_{y}$ is latent adjustment factors affecting only the outcome, and $\mathbf{z}_{o}$ is the latent noise factors unrelated to both treatment and outcome.

Utilizing network observational data, we formulate and prove the following theorem about the identifiability of individual treatment effects:

Theorem 4.1 (Identifiability of ITE).

If we recover $p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})$ and $p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})$ , then the proposed TNDVGA can recover the individual treatment effect (ITE) from networked observational data.

Proof.

According to the aforementioned assumptions and networked observational data, the potential outcome distribution for any instance $\mathbf{x}$ can be calculated as follows:

(2)

\begin{split}&p(y^{t}\mid\mathbf{x},\mathbf{A})\\ &\overset{(i)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y^{t}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(ii)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y^{t}\mid t,\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(iii)}{=}\int_{\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}}p(y\mid t,\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{t}d\mathbf{z}_{c}d\mathbf{z}_{y}d\mathbf{z}_{o}\\ &\overset{(iv)}{=}\int_{\{\mathbf{z}_{c},\mathbf{z}_{y}\}}p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})d\mathbf{z}_{c}d\mathbf{z}_{y}.\end{split}

The Equation (i) is a straightforward expectation over $p(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})$ , Equation (ii) follows from Assumption 3 based on the conditional independence assumption $t\!\perp\!\!\!\perp(y^{0},y^{1})\mid\mathbf{z}_{c}$ , Equation (iii) is derived from the commonly used consistency assumption (Imbens and Rubin, 2015), and Equation (iv) can be obtained from the Markov property $y\!\perp\!\!\!\perp\mathbf{z}_{t},\mathbf{z}_{o}\mid t,\mathbf{z}_{c},\mathbf{z}_{y}$ . Thus, if we can model $p(\mathbf{z}_{c},\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})$ and $p(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})$ correctly, then the ITE can be identified. ∎

Previous work by Zhang et al. (2021b) has derived the proof for identifiability under the assumption of ignorability, based on the inference of relevant parent factors from proxy variables and/or other observed variables. We took inspiration from their concept. However, in contrast to them, our model includes latent noise factors, which improves the composition of the latent variables by bringing it closer to reality. Furthermore, additional network information can be used with the proxy variable $\mathbf{x}$ . With these two modifications, we demonstrate the identifiability of ITE in Theorem 4.1. Theorem 4.1 highlights the importance of distinguishing different latent factors and utilizing only the appropriate ones for treatment effect estimation on networked observational data.

4.2. The proposed framework: TNDVGA

An overview of the proposed framework, TNDVGA, is shown in Fig. 2, which learns individual treatment effects through networked observational data. The proposed framework consists of the following important components: (1) Learning Disentangled Latent factors through Variational Graph Autoencoder (VGAE); (2) Predicting Potential Outcomes and Treatment Assignments; (3) Enforcing Independence of Latent factors. We will provide a detailed explanation of these four components in the following sections.

4.2.1. Learning Disentangled Latent factors through VGAE

From the theoretical analysis in the previous section, we have seen that eliminating unnecessary factors is essential to effectively and accurately estimating the treatment effect. However, in practice, we do not know the mechanism of generating $\mathbf{x}$ from $\mathbf{z}$ and the mechanism of disentangling $\mathbf{z}$ into different disjoint sets. This requires us to propose a method that can learn to disentangle the latent factors $\mathbf{z}$ and estimate ITE through what the model has learned.

Therefore, we aim to infer the posterior distribution $p_{\theta}(\mathbf{z}\mid\mathbf{x},\mathbf{A})$ of latent factors $\mathbf{z}$ through the observed proxy covariates $\mathbf{x}$ and network information $\mathbf{A}$ , while disentangling $\mathbf{z}$ into latent instrumental factors $\mathbf{z}_{t}$ , confounding factors $\mathbf{z}_{c}$ , adjustment factors $\mathbf{z}_{y}$ , noise factors $\mathbf{z}_{o}$ . Since exact inference is intractable, we use the variational inference framework to approximate the posterior distribution with the tractable distribution. We adopt Variational Graph Autoencoders (VGAEs) to construct our model. Proposed by Kipf and Welling (2016b), VGAEs extend Variational Autoencoders (VAEs) to take into consideration the graph structure in the data. For every observed variable $\mathbf{x}$ , VGAEs define a multi-dimensional latent variable $\mathbf{z}$ . Moreover, VGAEs rely on an adjacency matrix $\mathbf{A}$ , which is utilized by the Graph Neural Network (GNN) in the encoder to enforce the structure of the posterior approximation $q_{\phi}(\mathbf{z}\mid\mathbf{x},\mathbf{A})$ . As shown in Fig. 2, we use four separate encoders to approximate the variational posterior $q_{\phi_{t}}(\mathbf{z}_{t}\mid\mathbf{x},\mathbf{A})$ , $q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A})$ , $q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})$ , $q_{\phi_{o}}(\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})$ , disentangling the latent variable $\mathbf{z}$ into $\mathbf{z}_{t}$ , $\mathbf{z}_{c}$ , and $\mathbf{z}_{y}$ , $\mathbf{z}_{o}$ , respectively²²2Our method does not employ $t$ and $y$ as inputs to the encoder as done in (Louizos et al., 2017), because we assume that $t$ and $y$ are generated by the latent factors. So, the inference of latent factors relies solely on $\mathbf{x}$ . For additional information, see (Zhang et al., 2021b).. Then, these four latent factors are used by the decoder $p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})$ to reconstruct $\mathbf{x}$ , $t$ , and $y$ ³³3Note that, as shown in Figure 1, we obtain the independence property $\mathbf{x}\!\perp\!\!\!\perp\mathbf{A}\mid\{\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}\}$ . Thus, the original VGAE decoder $p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o},\mathbf{A})=p_{\theta}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})$ . The derivation for $t$ and $y$ is similar.. Following the standard VGAE design, we select the prior distributions $p(\mathbf{z}_{t})$ , $p(\mathbf{z}_{c})$ , , $p(\mathbf{z}_{y})$ and $p(\mathbf{z}_{o})$ as factorized Gaussian distributions:

(3)

\begin{split}p(\mathbf{z}_{t})=\prod_{j=1}^{d_{\mathbf{z}_{t}}}\mathcal{N}(\{\mathbf{z}_{t}\}_{j}\mid 0,1);\quad p(\mathbf{z}_{c})=\prod_{j=1}^{d_{\mathbf{z}_{c}}}\mathcal{N}(\{\mathbf{z}_{c}\}_{j}\mid 0,1);\\ p(\mathbf{z}_{y})=\prod_{j=1}^{d_{\mathbf{z}_{y}}}\mathcal{N}(\{\mathbf{z}_{y}\}_{j}\mid 0,1);\quad p(\mathbf{z}_{o})=\prod_{j=1}^{d_{\mathbf{z}_{o}}}\mathcal{N}(\{\mathbf{z}_{o}\}_{j}\mid 0,1),\end{split}

where $d_{\mathbf{z}_{t}}$ , $d_{\mathbf{z}_{c}}$ , $d_{\mathbf{z}_{y}}$ , and $d_{\mathbf{z}_{o}}$ represent the dimensions of latent instrumental, confounding, adjustment, noise factors, respectively. And $\{\mathbf{z}_{t}\}_{j}$ denotes the $j$ -th dimension of $\mathbf{z}_{t}$ , and the same applies to $\mathbf{z}_{c}$ , $\mathbf{z}_{y}$ , and $\mathbf{z}_{o}$ .

The probabilistic representation of the generative model for $\mathbf{x}$ , $t$ , and $y$ is as follows:

(4)		$\displaystyle p_{\theta_{\mathbf{x}}}(\mathbf{x}\mid\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o})$	$\displaystyle=\prod_{j=1}^{k}\mathcal{N}(\mu_{j}=f_{1j}(\mathbf{z}_{\{t,c,y,o\}}),\sigma_{j}^{2}=f_{2j}(\mathbf{z}_{\{t,c,y,o\}})),$
(5)		$\displaystyle p_{\theta_{t}}(t\mid\mathbf{z}_{t},\mathbf{z}_{c})$	$\displaystyle=Bern(\sigma(f_{3}(\mathbf{z}_{c},\mathbf{z}_{t}))),$

(6)

\begin{split}p_{\theta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})&=\mathcal{N}(\mu=\hat{\mu},\sigma^{2}={\hat{\sigma}}^{2}),\\ \hat{\mu}=tf_{4}(\mathbf{z}_{c},\mathbf{z}_{y})+(1&-t)f_{5}(\mathbf{z}_{c},\mathbf{z}_{y});{\hat{\sigma}}^{2}=tf_{6}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)f_{7}(\mathbf{z}_{c},\mathbf{z}_{y}),\end{split}

where $f_{1}$ to $f_{7}$ are functions parameterized by fully connected neural networks, $\sigma(\cdot)$ represents the logistic function, and $Bern$ refers to the Bernoulli distribution. The distribution of $\mathbf{x}$ should be chosen based on the dataset, and in our case, we approximate it with a Gaussian distribution, as the data we use consists of continuous variables. Similarly, for the continuous outcome variable $y$ , we also parameterize it as a Gaussian distribution, where the mean and variance are defined by two separate neural networks defining $p(y\mid t=1,\mathbf{z}_{c},\mathbf{z}_{y})$ and $p(y\mid t=0,\mathbf{z}_{c},\mathbf{z}_{y})$ , following the two-headed approach proposed by Shalit et al. (2017).

In the inference model, since we input the network information $\mathbf{A}$ into the encoder, we design the encoder based on the idea of Variational Graph Autoencoders (VGAEs). Specifically, we utilize Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016a) as the encoder to obtain latent factor representations. GCN has been shown to effectively handle non-Euclidean data, such as graph-structured data, across diverse settings. To simplify notation, we describe the message propagation rule using a single GCN layer, as shown below:

(7)

\mathbf{h}=GCN(\mathbf{x},\mathbf{A})={Relu}((\hat{\mathbf{A}}\mathbf{X})_{\mathbf{x}}\mathbf{W})={Relu}((\tilde{\mathbf{D}}^{\frac{1}{2}}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}}\mathbf{X})_{\mathbf{x}}\mathbf{W}).

where $\mathbf{h}\in\mathbb{R}^{d}$ is the output vector of the GCN, $\mathbf{X}\in\mathbb{R}^{n\times k}$ is the feature matrix of the instances, $(\hat{\mathbf{A}}\mathbf{X})_{\mathbf{x}}$ represents the row of the matrix product corresponding to instance $\mathbf{x}$ , $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}_{N}$ , $\mathbf{I}_{N}$ is the identity matrix, $\tilde{\mathbf{D}}_{ii}=\sum_{j=1}^{N}\tilde{\mathbf{A}}_{ij}$ , and $\mathbf{W}\in\mathbb{R}^{k\times d}$ denotes the parameters of the weight matrix. ${Relu}(\cdot)$ denotes the ReLU activation function. This leads to the following definition of the variational approximation of the posterior distribution for the latent factors:

(8)

\begin{split}q_{\phi_{t}}(\mathbf{z}_{t}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{t},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{t}^{2})),\\ q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{c},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{c}^{2})),\\ q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{y},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{y}^{2})),\\ q_{\phi_{o}}(\mathbf{z}_{o}\mid\mathbf{x},\mathbf{A})=\mathcal{N}({\boldsymbol{\mu}}={\hat{\boldsymbol{\mu}}}_{o},{\rm{diag}}(\boldsymbol{\sigma}^{2})={\rm{diag}}({\hat{\boldsymbol{\sigma}}}_{o}^{2})),\\ \end{split}

where $\hat{\boldsymbol{\mu}}_{t}$ , $\hat{\boldsymbol{\mu}}_{c}$ , $\hat{\boldsymbol{\mu}}_{y}$ , $\hat{\boldsymbol{\mu}}_{o}$ and ${\rm{diag}}(\hat{\boldsymbol{\sigma}}_{t}^{2})$ , ${\rm{diag}}(\hat{\boldsymbol{\sigma}}_{c}^{2})$ , ${\rm{diag}}(\hat{\boldsymbol{\sigma}}_{y}^{2})$ , ${\rm{diag}}(\hat{\boldsymbol{\sigma}}_{o}^{2})$ are the means and covariance matrix of the Gaussian distributions, parameterized by the GCN as shown in Equation (7). Additionally, $\hat{\boldsymbol{\mu}}_{t}$ and ${\rm log}\,{\boldsymbol{\sigma}}_{t}^{2}$ are learned from two GCNs that share the training parameters of the first layer, and the same applies to the remaining three pairs.

4.2.2. Predicting Potential Outcomes and Treatment Assignments

The latent factors $\mathbf{z}_{t}$ and $\mathbf{z}_{c}$ are associated with the treatment $t$ , whereas $\mathbf{z}_{c}$ and $\mathbf{z}_{y}$ are associated with the outcomes $y$ , as illustrated in Fig. 1. To ensure that the treatment information is effectively captured by the union of $\mathbf{z}_{t}$ and $\mathbf{z}_{c}$ , we add an auxiliary classifier to predict $t$ from the encoder’s output, under the assumption that $\mathbf{z}_{t}$ and $\mathbf{z}_{c}$ can accurately predict $t$ . Additionally, $y$ is predicted using two regression networks under different treatments to ensure that the outcome information is captured by the union of $\mathbf{z}_{c}$ and $\mathbf{z}_{y}$ , based on the assumption that $\mathbf{z}_{c}$ and $\mathbf{z}_{y}$ can accurately predict $y$ . Inspired by related approaches (Zhang et al., 2021b; Liu et al., 2024), the classifier and regression networks are defined as follows:

(9)

q_{\eta_{t}}(t\mid\mathbf{z}_{t},\mathbf{z}_{c})=Bern(\sigma(h_{1}(\mathbf{z}_{c},\mathbf{z}_{t}))),

(10)

\begin{split}q_{\eta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})&=\mathcal{N}(\mu=\hat{\mu},\sigma^{2}={\hat{\sigma}}^{2}),\\ \hat{\mu}=th_{2}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)h_{3}(\mathbf{z}_{c},&\mathbf{z}_{y}),{\hat{\sigma}}^{2}=th_{4}(\mathbf{z}_{c},\mathbf{z}_{y})+(1-t)h_{5}(\mathbf{z}_{c},\mathbf{z}_{y}),\end{split}

where $h_{1}$ to $h_{5}$ are functions parameterized by fully connected neural networks, and the distribution settings are similar to those in Equations (5) and (6).

4.2.3. Enforcing Independence of Latent factors

Explicitly enhancing the independence of disentangled latent factors encourages the graph encoder to more effectively capture distinct and mutually independent information associated with each latent factor. In the following, we detail the regularization applied to enforce independence among the latent factors.

The goal of our method is for the encoder to capture disentangled latent factors—namely, $\mathbf{z}_{y}$ , $\mathbf{z}_{c}$ , $\mathbf{z}_{t}$ , and $\mathbf{z}_{o}$ —that each contain exclusive information. This requires increasing the statistical independence between these latent factors to further strengthen disentanglement. Given the high dimensionality of the latent factors, using histogram-based measures is infeasible. Therefore, we use the Hilbert-Schmidt Independence Criterion (HSIC) (Gretton et al., 2005) to promote sufficient independence among different latent factors.

Specifically, let $\mathbf{z}_{t,*}$ represent the $d_{\mathbf{z}_{t}}$ -dimensional random variable corresponding to the latent factor $\mathbf{z}_{t}$ . Consider a measurable, positive definite kernel $\kappa_{t}$ defined over the domain of $\mathbf{z}_{t,*}$ , with its associated Reproducing Kernel Hilbert Space (RKHS) denoted by $\mathcal{H}_{t}$ . The mapping function $\psi_{t}(\cdot)$ transforms $\mathbf{z}_{t,*}$ into $\mathcal{H}_{t}$ according to the kernel $\kappa_{t}$ . Similarly, for $\mathbf{z}_{y}$ , $\mathbf{z}_{c}$ , and $\mathbf{z}_{o}$ , the same definitions apply. Given a pair of latent factors $\mathbf{z}_{t}$ and $\mathbf{z}_{c}$ , where $\mathbf{z}_{t,*}$ and $\mathbf{z}_{c,*}$ are jointly sampled from the distribution $p(\mathbf{z}_{t,*},\mathbf{z}_{c,*})$ , the cross-covariance operator $\mathcal{C}_{\mathbf{z}_{t,*},\mathbf{z}_{c,*}}$ in the RKHS of $\kappa_{t}$ and $\kappa_{c}$ is defined as:

(11)

{\mathcal{C}}_{{\mathbf{z}}_{t,*},\mathbf{z}_{c,*}}=\mathbb{E}_{p(\mathbf{z}_{t,*},\mathbf{z}_{c,*})}\left[(\psi_{t}(\mathbf{z}_{t,*})-\boldsymbol{\mu}_{\mathbf{z}_{t,*}})^{\mathsf{T}}(\psi_{t}(\mathbf{z}_{c,*})-\boldsymbol{\mu}_{\mathbf{z}_{c,*}})\right],

where $\boldsymbol{\mu}_{\mathbf{z}_{t,*}}=\mathbb{E}(\psi_{t}(\mathbf{z}_{t,*}))$ , $\boldsymbol{\mu}_{\mathbf{z}_{c,*}}=\mathbb{E}(\psi_{c}(\mathbf{z}_{c,*}))$ . Then, HSIC is defined as follows:

(12)

{\rm{HSIC}}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*}):={\|{\mathcal{C}}_{{\mathbf{z}}_{t,*},\mathbf{z}_{c,*}}\|}_{\rm HS}^{2},

where $\|\cdot\|$ is the Hilbert-Schmidt norm, which generalizes the Frobenius norm on matrices. It is known that for two random variables $\mathbf{z}_{t,*}$ and $\mathbf{z}_{c,*}$ and characteristic kernels $\kappa_{\mathbf{z}_{t,*}}$ and $\kappa_{\mathbf{z}_{c,*}}$ , if $\mathbb{E}[\kappa_{\mathbf{z}_{t,*}}({\mathbf{z}_{t,*}},{\mathbf{z}_{t,*}})]<\infty,\mathbb{E}[\kappa_{\mathbf{z}_{c,*}}({\mathbf{z}_{c,*}},{\mathbf{z}_{c,*}})]<\infty$ , then ${\rm HSIC}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*})=0$ if and only if ${\mathbf{z}}_{t,*}\!\perp\!\!\!\perp{\mathbf{z}}_{c,*}$ . In practice, we employ an unbiased estimator ${\rm HSIC}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*})$ with $n$ samples (Song et al., 2012), defined as:

(13)

{\rm{HSIC}}({\mathbf{z}}_{t,*},{\mathbf{z}}_{c,*})=\frac{1}{n(n-3)}\left[{\rm tr}(\tilde{\mathbf{U}}\tilde{\mathbf{V}}^{\mathsf{T}})+\frac{\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{U}}\mathbf{1}\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{V}}^{\mathsf{T}}\mathbf{1}}{(n-1)(n-2)}-\frac{2}{n-2}\mathbf{1}^{\mathsf{T}}\tilde{\mathbf{U}}\tilde{\mathbf{V}}^{\mathsf{T}}\mathbf{1}\right],

where $\tilde{\mathbf{U}}$ and $\tilde{\mathbf{V}}$ denote the Grammer matrices with $\kappa_{\mathbf{z}_{t,*}}$ and $\kappa_{\mathbf{z}_{c,*}}$ , respectively, with the diagonal elements set to zero. In our approach, we employ the radial basis function (RBF) kernel. The analysis for other pairs of latent factors follows similarly.

The advantage of using Equation (13) to measure the dependence between different latent factors lies in its ability to capture more complex, nonlinear dependencies by mapping latent factors into the RKHS. The HSIC estimator we employ is unbiased, which is both effective and computationally efficient.

4.3. Loss Function of TNDVGA

In this section, we design a loss function that combines all the key components of ITE estimation, thereby facilitating the end-to-end training of disentangled latent factor representations.

4.3.1. Loss for VGAE

The encoder and decoder parameters can be learned by minimizing the negative evidence lower bound (ELBO), consistent with the standard VGAE (Kipf and Welling, 2016b), where $i$ denotes the $i$ -th instance:

(14)

\begin{split}\mathcal{L}_{\rm ELBO}(\mathbf{x}_{i},t_{i},y_{i})=\,&-\mathbb{E}_{q_{\phi_{t_{i}}}q_{\phi_{c_{i}}}q_{\phi_{y_{i}}}q_{\phi_{o_{i}}}}[{\rm log}\,p_{\theta_{\mathbf{x}_{i}}}({\mathbf{x}_{i}}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i},\mathbf{z}_{y,i},\mathbf{z}_{o,i})+{\rm log}\,p_{\theta_{t_{i}}}(t_{i}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i})\\ &+\,{\rm log}\,p_{\theta_{y_{i}}}(y_{i}\mid t_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})]-D_{KL}(q_{\phi_{t_{i}}}(\mathbf{z}_{t,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{t,i}))\\ &-\,D_{KL}(q_{\phi_{c_{i}}}(\mathbf{z}_{c,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{c,i}))-D_{KL}(q_{\phi_{y_{i}}}(\mathbf{z}_{y,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{y,i}))\\ &-\,D_{KL}(q_{\phi_{o_{i}}}(\mathbf{z}_{o,i}\mid\mathbf{x}_{i},\mathbf{A})\|p(\mathbf{z}_{o,i})).\end{split}

4.3.2. Loss for Potential Outcome Prediction and Treatment Assignment Prediction

The factual loss function for predicting potential outcomes, along with the loss function for predicting treatment assignments, is defined as follows:

(15)		$\displaystyle\mathcal{L}_{treat}(t_{i},\mathbf{z}_{t,i},\mathbf{z}_{c,i})$	$\displaystyle=-\mathbb{E}_{q_{\phi_{t_{i}}}q_{\phi_{c_{i}}}}(q_{\eta_{t_{i}}}(t_{i}\mid\mathbf{z}_{t,i},\mathbf{z}_{c,i})),$
(16)		$\displaystyle\mathcal{L}_{pred}(t_{i},y_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})$	$\displaystyle=-\mathbb{E}_{q_{\phi_{c_{i}}}q_{\phi_{y_{i}}}}(q_{\eta_{y_{i}}}(y_{i}\mid t_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})).$

4.3.3. Loss for HSIC Independence Regularizer

We apply pairwise independence constraints to the latent factors $z_{t}$ , $z_{c}$ , $z_{y}$ , and $z_{o}$ in order to improve the statistical independence between disentangled representations. Thus, the HSIC regularizer $\mathcal{L}_{reg}$ is calculated as follows:

(17)

\mathcal{L}_{indep}(\mathbf{z}_{t,*},\mathbf{z}_{c,*},\mathbf{z}_{y,*},\mathbf{z}_{o,*})=\sum_{\stackrel{{\scriptstyle k,m\in\{t,c,y,o\}}}{{k\neq m}}}{\rm HSIC}({\mathbf{z}}_{k,*},{\mathbf{z}}_{m,*}).

4.3.4. Loss for Balanced Representation

As shown in Fig. 1, we observe that $\mathbf{z}_{y}\!\perp\!\!\!\perp t$ , implying that $p(\mathbf{z}_{y}\mid t=0)=p(\mathbf{z}_{y}\mid t=1)$ . Therefore, following the approach in (Hassanpour and Greiner, 2019), we aim for the learned $\mathbf{z}_{y}$ to exclude any confounding information, ensuring that all confounding factors are captured within $\mathbf{z}_{c}$ . This is crucial for ensuring the accuracy of the treatment effect estimation. To quantify the discrepancy between the distributions of $\mathbf{z}_{y}$ for the treatment and control groups, we use the integral probability metric (IPM) (Müller, 1997; Sriperumbudur et al., 2012; Guo et al., 2020c). We define the balanced representation loss as $\mathcal{L}_{disc}$ ,

(18)

\mathcal{L}_{disc}(\mathbf{z}_{y,*})=IPM(\{\mathbf{z}_{y,i}\}_{i:t_{i}=0},\{\mathbf{z}_{y,i}\}_{i:t_{i}=1}).

We utilize the Wasserstein-1 distance, defined as Sriperumbudur et al. (2012), to calculate Equation (18). Additionally, we employ the effective approximation algorithm proposed by (Cuturi and Doucet, 2014) to calculate the Wasserstein-1 distance and associated gradients about the model parameters for training the TNDVGA.

4.3.5. The Overall Objective Function

The following provides a summary of the overall objective function for TNDVGA:

(19)

\begin{split}\mathcal{L}_{\rm{TNDVGA}}=\frac{1}{n}&\sum_{i=1}^{n}\left[\mathcal{L}_{\rm ELBO}(\mathbf{x}_{i},t_{i},y_{i})+\alpha_{t}\mathcal{L}_{treat}(t_{i},\mathbf{z}_{t,i},\mathbf{z}_{c,i})+\alpha_{y}\mathcal{L}_{pred}(t_{i},y_{i},\mathbf{z}_{c,i},\mathbf{z}_{y,i})\right]\\ &+\alpha_{1}\mathcal{L}_{indep}(\mathbf{z}_{t,*},\mathbf{z}_{c,*},\mathbf{z}_{y,*},\mathbf{z}_{o,*})+\alpha_{2}\mathcal{L}_{disc}(\mathbf{z}_{y,*})+\lambda{\|\Theta\|}_{2}^{2},\end{split}

where $\alpha_{t}$ , $\alpha_{y}$ , $\alpha_{1}$ , and $\alpha_{2}$ are non-negative hyperparameters that balance the corresponding terms. The final term, $\lambda{\|\Theta\|}_{2}^{2}$ , is applied to all model parameters $\Theta$ to avoid overfitting.

After completing the model training, we can predict the ITEs of new instances based on the observed covariates $\mathbf{x}$ . We utilize the encoders $q_{\phi_{c}}(\mathbf{z}_{c}\mid\mathbf{x},\mathbf{A})$ and $q_{\phi_{y}}(\mathbf{z}_{y}\mid\mathbf{x},\mathbf{A})$ to sample the posteriors of confounding factors and risk factors $l$ times, and then use the decoder $p_{\theta_{y}}(y\mid t,\mathbf{z}_{c},\mathbf{z}_{y})$ to compute the predicted outcomes $y$ at different $t$ , averaging them to obtain the estimated potential outcomes $y^{1}$ and $y^{0}$ . The calculation of ATE can be done by performing the above steps on all test samples and then taking the average.

5. Experiments

In this section, we perform a series of experiments to illustrate the effectiveness of the proposed TNDVGA framework. We first introduce the datasets, evaluation metrics, baselines, and model parameter configurations utilized in the experiments. Then, we compare the performance of different models in estimating ITE. After that, we conduct an ablation study to evaluate the importance of key components in the TNDVGA and conduct a hyperparameter study.

5.1. Datasets

5.1.1. Semi-synthetic datasets

Table 1. Statistics of the Two Semi-Synthetic Datasets: BlogCatalog and Flickr

Datasets	Instances	Edges	Features	$\kappa_{2}$	ATE mean $\pm$ STD
BlogCatalog	5196	173,468	8,189	0.5	4.366 $\pm$ 0.553
				1	7.446 $\pm$ 0.759
				2	13.534 $\pm$ 2.309
Flickr	7,575	239,738	12,047	0.5	6.672 $\pm$ 3.068
				1	8.487 $\pm$ 3.372
				2	20.546 $\pm$ 5.718

BlogCatalog

In the BlogCatalog dataset (Tang and Liu, 2011), a social blog directory for managing bloggers and their blogs, each individual represents a blogger, and each edge represents a social connection between two bloggers. The features are represented as a bag-of-words representation of the keywords in the bloggers’ descriptions. To generate synthetic outcomes and treatments, we rely on the assumptions outlined in (Guo et al., 2020c; Veitch et al., 2019). The outcome $y$ refers to the readers’ opinion of each blogger, and the treatment $t$ represents whether the blogger’s content receives more views on mobile devices or desktops. Bloggers whose content is primarily viewed on mobile devices are placed in the treatment group, while those whose content is mainly viewed on desktops are placed in the control group. Additionally, following the assumptions in (Guo et al., 2020c), we assume that the topics discussed by the blogger and their neighbors causally affect both the blogger’s treatment assignment and outcome. In this task, our goal is to investigate the individual treatment effect (ITE) of receiving more views on mobile devices (instead of desktops) on the readers’ opinion. Specifically, a Latent Dirichlet Allocation (LDA) topic model is trained (Blei et al., 2003). Two centroids in the topic space are then defined: (i) the centroid $\bar{\boldsymbol{r}}^{1}$ of the treatment group is set as the topic distribution of a randomly selected blogger, and (ii) the centroid $\bar{\boldsymbol{r}}^{0}$ of the control group is set as the average topic distribution across all bloggers. We then model readers’ preference of browsing devices on the $i$ -th blogger content as:

(20)

\begin{split}&P(t_{i}=1\mid\mathbf{x}_{i},\mathbf{A})=\frac{{\rm exp}(p_{i}^{1})}{{\rm exp}(p_{i}^{0})+{\rm exp}(p_{i}^{0})}\\ {\rm with}\,\,&p_{i}^{t}=\kappa_{1}{\boldsymbol{r}}(\mathbf{x}_{i})^{\mathsf{T}}\bar{\boldsymbol{r}}^{1}+\kappa_{2}\sum_{j\in\mathcal{N}(i)}{\boldsymbol{r}}(\mathbf{x}_{j})^{\mathsf{T}}\bar{\boldsymbol{r}}^{t},t\in\{0,1\},\end{split}

where $\kappa_{1}\geq 0$ and $\kappa_{2}\geq 0$ control the strength of the confounding bias introduced by the blogger’s topics and the topics of their neighbors, respectively. Finally, the factual outcome and the counterfactual outcome of the $i$ -th instance are given as:

(21)

\begin{split}y_{i}^{F}&=C(p_{i}^{0}+t_{i}p_{i}^{1})+\epsilon,\\ y_{i}^{CF}&=C[p_{i}^{0}+(1-t_{i})p_{i}^{1}]+\epsilon,\end{split}

where $C$ serves as a scaling factor, and the noise term $\epsilon$ follows a normal distribution, i.e., $\epsilon\sim\mathcal{N}(0,1)$ . For this study, we set $C=5$ , $\kappa_{1}=10$ , and $\kappa_{2}\in\{0.5,1,2\}$ .

Flickr

Flickr (Tang and Liu, 2011) is an online platform utilized for the purpose of sharing images and videos. In this dataset, each user is represented as an instance, with edges indicating social connections between users. The features of each user are a list of interest tags. The treatment and outcome are synthesized using the same settings and simulation process as in the BlogCatalog.

In Table 1, we provide a detailed statistical summary of the two semi-synthetic datasets. For each parameter setting, the mean and standard deviation of the ATEs are computed across 10 runs.

5.1.2. Synthetic datasets

Inspired by (Hassanpour and Greiner, 2019), we generate synthetic datasets named TNDVGASynth, which follow the structure illustrated in Fig. 1 and the relationships defined in Equations (22)-(25).

(22)

\begin{split}\mathbf{z}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{t}},\,\,\mathbf{z}_{c}\sim&\,\,\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{c}},\,\,\mathbf{z}_{y}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{y}},\,\,\mathbf{z}_{o}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{o}}\,\,\\ \mathbf{x}=&\,\,Concat(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}),\\ {\boldsymbol{\Psi}}=&\,\,Concat(\mathbf{z}_{t},\mathbf{z}_{c}),\\ {\boldsymbol{\Phi}}=&\,\,Concat(\mathbf{z}_{c},\mathbf{z}_{y}),\\ \end{split}

(23)

\begin{split}&a\sim Bernoulli(\frac{0.01}{1+{\rm exp}(-r)})\\ {\rm with}\,\,&r=\mathbf{h}\cdot\mathbf{h}+1,\,\,\mathbf{h}=Concatenate(\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}),\end{split}

(24)

\begin{split}&t\sim Bernoulli(\frac{1}{1+{\rm exp}(-\zeta\mathbf{h})})\\ {\rm with}\,&\,h={\boldsymbol{\Psi}}\cdot\boldsymbol{\theta}+1,\,\,\boldsymbol{\theta}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{t}+m_{c}},\end{split}

(25)

\begin{split}y^{0}&=\frac{{(\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}}+0.5)\cdot{\boldsymbol{\nu}}^{0}}{m_{c}+m_{y}}+\epsilon\\ &y^{1}=\frac{{(\boldsymbol{\Phi}}\circ{\boldsymbol{\Phi}})\cdot{\boldsymbol{\nu}}^{1}}{m_{c}+m_{y}}+\epsilon\\ {\rm with}\,\,&{\boldsymbol{\nu}}^{0},{\boldsymbol{\nu}}^{1}\sim\mathcal{N}(\mathbf{0},\mathbf{1})^{m_{c}+m_{y}},\,\,\epsilon\sim\mathcal{N}(\mathbf{0},\mathbf{1}),\end{split}

where $concat(\cdot,\cdot)$ represents the vector concatenation operation. $a$ is an element of the matrix $\mathbf{A}$ ; $m_{t},m_{c},m_{y},m_{o}$ represent the dimensions of the latent factors $\mathbf{z}_{t},\mathbf{z}_{c},\mathbf{z}_{y},\mathbf{z}_{o}$ , respectively; a scalar $\zeta$ determines the slope of the logistic curve; $\cdot$ denotes the dot product, and $\circ$ signifies the element-wise (Hadamard) product. We considered all feasible datasets generated from the grid defined by $m_{t},m_{c},m_{y},m_{o}\in\{4,8\}$ , creating 16 scenarios. For each scenario, we synthesize five datasets using different initial random seeds.

5.2. Evaluation Metrics

We evaluate the performance of the proposed TNDVGA framework in learning ITE using two widely used metrics in causal inference. We report the Root Precision in Estimating Heterogeneous Effects ( $\sqrt{\epsilon_{PEHE}}$ ) to measure the accuracy of individual-level treatment effect, and the Mean Absolute Error of ATE ( $\epsilon_{ATE}$ ) to assess the accuracy of population-level treatment effect. They are formally defined as follows:

(26)		$\displaystyle\sqrt{\epsilon_{PEHE}}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}({\hat{\tau}}_{i}-\tau_{i})^{2}},$
(27)		$\displaystyle\epsilon_{ATE}=\frac{1}{n}\lvert\sum_{i=1}^{n}{\hat{\tau}}_{i}-\sum_{i=1}^{n}\tau_{i}\rvert,$

where ${\hat{\tau}}_{i}={\hat{y}}_{i}^{1}-{\hat{y}}_{i}^{0}$ and $\tau_{i}=y_{i}^{1}-y_{i}^{0}$ represent the estimated ITE and the ground truth ITE of instance $i$ , respectively. Lower values of these metrics indicate better estimating performance.

5.3. Baselines

We compare our model against the following state-of-the-art models used for ITE estimation:

•

Bayesian Additive Regression Trees (BART) (Chipman et al., 2010). BART is a widely used nonparametric Bayesian regression model that utilizes dimensionally adaptive random basis functions.
•

Causal Forest (Wager and Athey, 2018). Causal Forest is a non-parametric causal inference method designed to estimate heterogeneous treatment effects, extending Breiman’s well-known random forest algorithm.
•

Counterfactual Regression (CFR) (Shalit et al., 2017). CFR is a representation learning-based approach that predicts individual treatment effects (ITE) from observational data. It reduces the imbalance between the latent representations of the treatment and control groups and minimizes prediction errors for factual outcomes by projecting the original features into a latent space to capture confounders. It implements Integral Probability Metrics to measure the distance between distributions. This study employs two distinct forms of balancing penalties: the Wasserstein-1 distance (CFR-Wass) and the maximum mean discrepancy (CFR-MMD).
•

Treatment-agnostic Representation Networks (TARNet) (Shalit et al., 2017). TARNet is a variant of CFR that excludes the balance regularization term from its model.
•

Causal Effect Variational Autoencoder (CEVAE) (Louizos et al., 2017). CEVAE is built upon Variational Autoencoders (VAE) (Kingma, 2013) and adheres to the causal inference framework with proxy variables. It is capable of jointly estimating the unknown latent space that captures confounders and the causal effect.
•

Treatment Effect by Disentangled Variational AutoEncoder (TEDVAE) (Zhang et al., 2021b). TEDVAE is a variational inference approach that simultaneously infers latent factors from observed variables, while disentangling these factors into three distinct sets: instrumental factors, confounding factors, and risk factors. These disentangled factors are then utilized for estimating treatment effects.
•

Network Deconfounder (NetDeconf) (Guo et al., 2020c). NetDeconf is a novel causal inference framework that leverages network information to identify patterns of hidden confounders, enabling the learning of valid individual causal effects from networked observational data.
•

Graph Infomax Adversarial Learning (GIAL) (Chu et al., 2021). GIAL is a model designed for estimating treatment effects that leverages the network structure to capture additional information by identifying imbalances within the network. In this work, we employ two variants of the proposed GIAL method: one that utilizes the original implementation of graph convolutional networks (GCN) (Kipf and Welling, 2016a) (GIAL-GCN) and another that leverages graph attention networks (GAT) (Veličković et al., 2017) (GIAL-GAT) .

5.4. Parameter Settings

We implement TNDVGA using PyTorch on an NVIDIA RTX 4090D GPU. For BlogCatalog and Flickr, we run 10 experiments and report the average results. The dataset is split into training (60%), validation (20%), and test (20%) sets for each run. Baseline methods, such as BART, Causal Forest, CFR, TARNet, and CEVAE, are originally designed for non-networked observational data and thus cannot leverage network information directly. We concatenate the adjacency matrix rows with the original features to ensure a fair comparison; however, this does not notably enhance baseline performance due to dimensionality limitations. For the baselines, we used the default hyperparameters as in previous works (Guo et al., 2020c; Chu et al., 2021). For TNDVGA, we apply grid search to identify the optimal hyperparameter settings. Specifically, the learning rate is set to $3\times 10^{-4}$ , $\alpha_{t}$ and $\alpha_{y}$ set to 100, and $\lambda$ to $5\times 10^{-5}$ . The number of GCN layers is varied between 1, 2, and 3, with the hidden dimensions set to 500, and the dimensions of $\mathbf{z}_{t}$ , $\mathbf{z}_{c}$ , $\mathbf{z}_{y}$ , and $\mathbf{z}_{o}$ vary across {10, 20, 30, 40, 50}. The regularization coefficients $\alpha_{1}$ and $\alpha_{2}$ are tuned within the range { $10^{-2}$ , $10^{-1}$ , 1, $10$ , $100$ }. For the BlogCatalog dataset, TNDVGA is trained for 500 epochs, while for Flickr, training lasts for 1000 epochs. And we use Adam optimizer (Kingma, 2014) to train TNDVGA. For the synthetic datasets, we use the same parameter selection approach as for the semi-synthetic datasets. Unless stated otherwise, the latent variable dimensions for the different factors are set to their true values.

Table 2. Performance comparison for different methods on BlogCatalog. We report the average values of

\sqrt{\epsilon_{PEHE}}

and

\epsilon_{ATE}

on the test sets. Baselines results from (Chu et al., 2021), except TEDVAE.

BlogCatalog
$\kappa_{2}$	0.5		1		2
	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$
BART	4.808	2.680	5.770	2.278	11.608	6.418
Causal Forest	7.456	1.261	7.805	1.763	19.271	4.050
CFR-Wass	10.904	4.257	11.644	5.107	34.848	13.053
CFR-MMD	11.536	4.127	12.332	5.345	34.654	13.785
TARNet	11.570	4.228	13.561	8.170	34.420	13.122
CEVAE	7.481	1.279	10.387	1.998	24.215	5.566
TEDVAE	4.609	0.798	4.354	0.881	6.805	1.190
NetDeconf	4.532	0.979	4.597	0.984	9.532	2.130
GIAL-GCN	4.023	0.841	4.091	0.883	8.927	1.780
GIAL-GAT	4.215	0.912	4.258	0.937	9.119	1.982
TNDVGA (ours)	3.969	0.719	3.846	0.699	6.066	1.057

Table 3. Performance comparison for different methods on Flickr. We report the average values of

\sqrt{\epsilon_{PEHE}}

and

\epsilon_{ATE}

on the test sets. Baselines results from (Chu et al., 2021), except TEDVAE.

Flickr
$\kappa_{2}$	0.5		1		2
	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$
BART	4.907	2.323	9.517	6.548	13.155	9.643
Causal Forest	8.104	1.359	14.636	3.545	26.702	4.324
CFR-Wass	13.846	3.507	27.514	5.192	53.454	13.269
CFR-MMD	13.539	3.350	27.679	5.416	53.863	12.115
TARNet	14.329	3.389	28.466	5.978	55.066	13.105
CEVAE	12.099	1.732	22.496	4.415	42.985	5.393
TEDVAE	5.072	1.041	7.125	1.328	12.952	2.124
NetDeconf	4.286	0.805	5.789	1.359	9.817	2.700
GIAL-GCN	3.938	0.682	5.317	1.194	9.275	2.245
GIAL-GAT	4.015	0.773	5.432	1.231	9.428	2.586
TNDVGA	3.896	0.633	4.974	1.037	7.302	1.908

5.5. Perfomance Comparision

We compare the proposed framework TNDVGA with the state-of-the-art baselines for ITE estimation on both semi-synthetic datasets and synthetic datasets.

5.5.1. Performance on Semi-Synthetic datasets

Tables 2 and 3 present the experimental results on the BlogCatalog and Flick datasets, respectively. Through a comprehensive analysis of the experimental results, we have the following observations:

•

The proposed variational inference framework for ITE estimation, TNDVGA, consistently outperforms state-of-the-art traditional baseline methods, including BART, Causal Forest, CFR, and CEVAE, across different settings on both datasets, as these methods do not account for disentangled latent factors or leverage network information for ITE learning.
•

TNDVGA and NetDeconf, along with GIAL, outperform other baseline methods in ITE estimation due to their ability to leverage auxiliary network information to capture the impact of latent factors on ITE estimation. This result suggests that network information helps in learning representations of latent factors, leading to more accurate ITE estimation. Furthermore, TNDVGA also outperforms NetDeconf and GIAL in ITE estimation because it learns representations of four different latent factors, whereas NetDeconf and GIAL only learn representations of latent confounding factors.
•

TEDVAE also performs reasonably well in estimating ITE, mainly because its model infers and disentangles three disjoint sets of instrumental, confounding, and risk factors from the observed variables. This also highlights the importance of learning disentangled latent factors for ITE estimation. However, TNDVGA outperforms TEDVAE, as it additionally accounts for latent noise factors and effectively leverages network information, whereas TEDVAE struggles to fully utilize network information to enhance its modeling capabilities.
•

TNDVGA demonstrates strong robustness in selecting the latent dimensionality parameter. Considering that we did not explicitly model the generation process of latent factors in these two semi-synthetic real datasets, and that the dataset generation did not include instrumental, risk, or noise factors, TNDVGA still exhibits optimal performance under these conditions. These results indicate that, even in more realistic datasets , TNDVGA can effectively learn latent factors and estimate ITE.
•

When the influence of hidden confounders increases (i.e., with a growing $\kappa_{2}$ value), TNDVGA suffers the least in $\sqrt{\epsilon_{PEHE}}$ and $\epsilon_{ATE}$ . This is because TNDVGA has the ability to identify patterns of latent confounding factors from the network structure, enabling it to infer ITE more accurately.

5.5.2. Performance on Synthetic datasets

First, similar to (Bao et al., 2022), we control the magnitude of selection bias in the dataset by setting the size of the scalar $\zeta$ . We compare TNDVGA with TEDVAE and NetConf when the dimensions of the latent factors are (8, 8, 8, 8). As shown in Fig. 3, we observe that as the value of $\zeta$ increases, indicating a rise in selection bias, TNDVGA consistently performs the best. Furthermore, TNDVGA’s performance remains stable and is largely unaffected by variations in selection bias. This demonstrates that TNDVGA exhibits better robustness against selection bias, which is crucial when handling real-world datasets. We have similar results on other synthetic datasets.

Next, we investigate the TNDVGA’s ability to recover the latent components $\mathbf{z}_{t}$ , $\mathbf{z}_{c}$ , $\mathbf{z}_{y}$ , and $\mathbf{z}_{o}$ that are utilized to construct the observed covariates $\mathbf{x}$ and examine the contribution of disentangling these latent factors to the estimation of ITE. To this end, similar to the settings in (Hassanpour and Greiner, 2019; Bao et al., 2022), we compare the performance of TNDVGA when the parameters $d_{\mathbf{z}_{t}}$ , $d_{\mathbf{z}_{c}}$ , $d_{\mathbf{z}_{y}}$ and $d_{\mathbf{z}_{o}}$ are set to correspond with the true number of latent factors against the performance when one of the latent dimensionality parameters is set to zero. For example, setting $d_{\mathbf{z}_{c}}=0$ forces TNDVGA to ignore the disentanglement of confounding factors. If TNDVGA performs better when considering the disentanglement of all latent factors compared to when any one latent factor is ignored, then it can be concluded that TNDVGA can recover latent factors, and that disentangling latent factors is beneficial for ITE estimation. Fig. 4 displays the radar charts corresponding to each factor. We can clearly see that when TNDVGA considers disentangling the latent factors through non-zero dimensionality parameters, its performance outperforms that when any latent dimension is set to zero.

Table 4. Ablation study of our method’s variants on BlogCatalog.

BlogCatalog
$\kappa_{2}$	0.5		1		2
	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$
TNDVGA	3.937	0.656	3.918	0.677	0.651	1.184
TNDVGA(w/o BP)	4.090	0.710	4.060	0.808	6.887	1.798
TNDVGA(w/o HSIC)	4.114	0.765	4.070	0.808	6.982	1.958

Table 5. Ablation study of our method’s variants on Flickr.

Flickr
$\kappa_{2}$	0.5		1		2
	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$	$\sqrt{\epsilon_{PEHE}}$	$\epsilon_{ATE}$
TNDVGA	3.897	0.610	5.045	0.956	8.763	1.074
TNDVGA(w/o BP)	4.298	0.637	5.551	1.359	10.8531	1.678
TNDVGA(w/o HSIC)	4.622	0.930	5.908	1.380	11.198	1.948

5.6. Ablation Study

Furthermore, we investigate the effect of key components in the proposed TNDVGA framework on learning ITE from network observational data. In particular, we conduct an ablation study by developing two variants of TNDVGA and comparing their performance on the BlogCatalog and Flickr datasets with the original TNDVGA: (i) TNDVGA w/o Balanced Representations: This variant does not balance the learned representations, meaning it does not include the balanced representation loss $\mathcal{L}_{disc}$ during training. As a result, the learning factor $\mathbf{z}_{y}$ may embed information about $\mathbf{z}_{t}$ . We refer to this variant as TNDVGA w/o BP. (ii) TNDVGA w/o HSIC Independence Regularizer: This variant omits the independence constraint mechanism between different factor representations, which may prevent the learned representations from being disentangled. We refer to this variant as TNDVGA w/o HSIC.

Tables 4 and 5 display the comparison results of the two variants with TNDVGA on the BlogCatalog and Flickr datasets, respectively. From the analysis, we can draw the following observations:

•

TNDVGA w/o BP cannot provide satisfactory performance because it neglects the balance of adjustment variables, which may lead to instrumental information being embedded in the adjustment variables, affecting the effectiveness of the learned representations. This highlights the necessity of balanced representations for better learning of latent factors in order to estimate ITE.
•

TNDVGA w/o HSIC also fails to provide the expected performance and typically performs the worst, as it does not impose independence constraints on the representations corresponding to different latent factors. This indicates that imposing explicit independence constraints on the representations is important for estimating ITE from network observational data.

5.7. Hyperparameter Study

We conduct an analysis of the effects of the most important hyperparameters, $\alpha_{1}$ and $\alpha_{2}$ , on the performance of TNDVGA. These parameters influence how independence constraints and representation balance contribute to the estimation of ITE from network observational data. The results of the parameter analysis for the BlogCatalog and Flickr datasets, with $\kappa_{2}$ set to 0.5, 1, and 2, are presented in terms of $\sqrt{\epsilon_{PEHE}}$ and $\epsilon_{ATE}$ . We vary $\alpha_{1}$ and $\alpha_{2}$ within the range $\{0.01,0.1,1,10,100\}$ . The results of the hyperparameter study are shown in Figs. 5(f) and 6(f). When $\alpha_{1}$ and $\alpha_{2}$ range in $\{0.01,0.1,1\}$ , the variations in $\sqrt{\epsilon_{PEHE}}$ and $\epsilon_{ATE}$ are minimal, suggesting that TNDVGA demonstrates stable and favorable performance across a wide range of parameter values. However, when $\alpha_{1}\geq 10$ or $\alpha_{2}\geq 10$ , TNDVGA’s performance in estimating $\epsilon_{ATE}$ noticeably declines. This reduction in performance occurs because the objective function places too much emphasis on the regularization term at high parameter settings, thereby affecting the accuracy of ATE estimation.

6. conclusion and future work

This paper aims to improve the accuracy of individual treatment effect estimation from networked observational data by modeling disentangled latent factors. The proposed model, TNDVGA, leverages observed features and auxiliary network information to infer and disentangle four distinct sets of latent factors: instrumental, confounding, adjustment, and noise factors. Empirical results from extensive experiments on several semi-synthetic and one synthetic dataset demonstrate that TNDVGA outperforms existing state-of-the-art methods in estimating ITE from networked observational data.

Two promising directions for future work are worth exploring. First, we would like to extend TNDVGA to estimate treatment effects for multiple or continuous treatments, which would enhance its applicability to a wider range of real-world scenarios. Second, we are interested in further investigating ITE estimation under network interference within a generative model framework that employs variational inference.

References

(1)
Abadie and Imbens (2006) Alberto Abadie and Guido W Imbens. 2006. Large sample properties of matching estimators for average treatment effects. econometrica 74, 1 (2006), 235–267.
Arbour et al. (2016) David Arbour, Dan Garant, and David Jensen. 2016. Inferring network effects from observational data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 715–724.
Atan et al. (2018) Onur Atan, James Jordon, and Mihaela Van der Schaar. 2018. Deep-treat: Learning optimal personalized treatments from observational data using neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
Bao et al. (2022) Qingsen Bao, Zeyong Mao, and Lei Chen. 2022. Learning Disentangled Latent Factors for Individual Treatment Effect Estimation Using Variational Generative Adversarial Nets. In 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD). IEEE, 347–352.
Bennett and Kallus (2019) Andrew Bennett and Nathan Kallus. 2019. Policy evaluation with latent confounders via optimal balance. Advances in neural information processing systems 32 (2019).
Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993–1022.
Cheng et al. (2022) Mingyuan Cheng, Xinru Liao, Quan Liu, Bin Ma, Jian Xu, and Bo Zheng. 2022. Learning disentangled representations for counterfactual regression via mutual information minimization. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1802–1806.
Chipman et al. (2010) Hugh A Chipman, Edward I George, and Robert E McCulloch. 2010. BART: Bayesian additive regression trees. (2010).
Chu et al. (2021) Zhixuan Chu, Stephen L Rathbun, and Sheng Li. 2021. Graph infomax adversarial learning for treatment effect estimation with networked observational data. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 176–184.
Cuturi and Doucet (2014) Marco Cuturi and Arnaud Doucet. 2014. Fast computation of Wasserstein barycenters. In International conference on machine learning. PMLR, 685–693.
Ding and Lehrer (2010) Weili Ding and Steven F Lehrer. 2010. Estimating treatment effects from contaminated multiperiod education experiments: The dynamic impacts of class size reductions. The Review of Economics and Statistics 92, 1 (2010), 31–42.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
Gretton et al. (2005) Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. 2005. Measuring statistical dependence with Hilbert-Schmidt norms. In International conference on algorithmic learning theory. Springer, 63–77.
Gu et al. (2021) Tiankai Gu, Kun Kuang, Hong Zhu, Jingjie Li, Zhenhua Dong, Wenjie Hu, Zhenguo Li, Xiuqiang He, and Yue Liu. 2021. Estimating true post-click conversion via group-stratified counterfactual inference. In Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Guo et al. (2020a) Ruocheng Guo, Lu Cheng, Jundong Li, P Richard Hahn, and Huan Liu. 2020a. A survey of learning causality with data: Problems and methods. ACM Computing Surveys (CSUR) 53, 4 (2020), 1–37.
Guo et al. (2021) Ruocheng Guo, Jundong Li, Yichuan Li, K Selçuk Candan, Adrienne Raglin, and Huan Liu. 2021. Ignite: A minimax game toward learning individual treatment effects from networked observational data. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence. 4534–4540.
Guo et al. (2020b) Ruocheng Guo, Jundong Li, and Huan Liu. 2020b. Counterfactual evaluation of treatment assignment functions with networked observational data. In Proceedings of the 2020 SIAM International Conference on Data Mining. SIAM, 271–279.
Guo et al. (2020c) Ruocheng Guo, Jundong Li, and Huan Liu. 2020c. Learning individual causal effects from networked observational data. In Proceedings of the 13th international conference on web search and data mining. 232–240.
Häggström (2018) Jenny Häggström. 2018. Data-driven confounder selection via Markov and Bayesian networks. Biometrics 74, 2 (2018), 389–398.
Hassanpour and Greiner (2019) Negar Hassanpour and Russell Greiner. 2019. Learning disentangled representations for counterfactual regression. In International Conference on Learning Representations.
Huang et al. (2023) Qiang Huang, Jing Ma, Jundong Li, Ruocheng Guo, Huiyan Sun, and Yi Chang. 2023. Modeling Interference for Individual Treatment Effect Estimation from Networked Observational Data. ACM Transactions on Knowledge Discovery from Data 18, 3 (2023), 1–21.
Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
Johansson et al. (2016) Fredrik Johansson, Uri Shalit, and David Sontag. 2016. Learning representations for counterfactual inference. In International conference on machine learning. PMLR, 3020–3029.
Kingma (2013) Diederik P Kingma. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).
Kingma (2014) Diederik P Kingma. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kipf and Welling (2016a) Thomas N Kipf and Max Welling. 2016a. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Kipf and Welling (2016b) Thomas N Kipf and Max Welling. 2016b. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 (2016).
Kuang et al. (2017) Kun Kuang, Peng Cui, Bo Li, Meng Jiang, Shiqiang Yang, and Fei Wang. 2017. Treatment effect estimation with data-driven variable decomposition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
Kuang et al. (2020a) Kun Kuang, Peng Cui, Hao Zou, Bo Li, Jianrong Tao, Fei Wu, and Shiqiang Yang. 2020a. Data-driven variable decomposition for treatment effect estimation. IEEE Transactions on Knowledge and Data Engineering 34, 5 (2020), 2120–2134.
Kuang et al. (2020b) Kun Kuang, Lian Li, Zhi Geng, Lei Xu, Kun Zhang, Beishui Liao, Huaxin Huang, Peng Ding, Wang Miao, and Zhichao Jiang. 2020b. Causal inference. Engineering 6, 3 (2020), 253–263.
Liu et al. (2024) Yu Liu, Jian Wang, and Bing Li. 2024. EDVAE: Disentangled latent factors models in counterfactual reasoning for individual treatment effects estimation. Information Sciences 652 (2024), 119578.
Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris M Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems 30 (2017).
Müller (1997) Alfred Müller. 1997. Integral probability metrics and their generating classes of functions. Advances in applied probability 29, 2 (1997), 429–443.
Pearl (2009a) Judea Pearl. 2009a. Causal inference in statistics: An overview. (2009).
Pearl (2009b) Judea Pearl. 2009b. Causality. Cambridge university press.
Rakesh et al. (2018) Vineeth Rakesh, Ruocheng Guo, Raha Moraffah, Nitin Agarwal, and Huan Liu. 2018. Linked causal variational autoencoder for inferring paired spillover effects. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1679–1682.
Rosenbaum and Rubin (1983) Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 70, 1 (1983), 41–55.
Rubin (1978) Donald B Rubin. 1978. Bayesian inference for causal effects: The role of randomization. The Annals of statistics (1978), 34–58.
Rubin (2005) Donald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J. Amer. Statist. Assoc. 100, 469 (2005), 322–331.
Schölkopf et al. (2021) Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. 2021. Toward causal representation learning. Proc. IEEE 109, 5 (2021), 612–634.
Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International conference on machine learning. PMLR, 3076–3085.
Song et al. (2012) Le Song, Alex Smola, Arthur Gretton, Justin Bedo, and Karsten Borgwardt. 2012. Feature Selection via Dependence Maximization. Journal of Machine Learning Research 13, 5 (2012).
Sriperumbudur et al. (2012) Bharath K Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert RG Lanckriet. 2012. On the empirical estimation of integral probability metrics. (2012).
Tang and Liu (2011) Lei Tang and Huan Liu. 2011. Leveraging social media networks for classification. Data mining and knowledge discovery 23 (2011), 447–478.
Thorat et al. (2023) Abhinav Thorat, Ravi Kolla, Niranjan Pedanekar, and Naoyuki Onoe. 2023. Estimation of individual causal effects in network setup for multiple treatments. arXiv preprint arXiv:2312.11573 (2023).
Veitch et al. (2019) Victor Veitch, Yixin Wang, and David Blei. 2019. Using embeddings to correct for unobserved confounding in networks. Advances in Neural Information Processing Systems 32 (2019).
Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 (2017).
Vowels et al. (2021) Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. 2021. Targeted VAE: Variational and targeted learning for causal inference. In 2021 IEEE International Conference on Smart Data Services (SMDS). IEEE, 132–141.
Wager and Athey (2018) Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228–1242.
Winship and Morgan (1999) Christopher Winship and Stephen L Morgan. 1999. The estimation of causal effects from observational data. Annual review of sociology 25, 1 (1999), 659–706.
Wu et al. (2022) Anpeng Wu, Junkun Yuan, Kun Kuang, Bo Li, Runze Wu, Qiang Zhu, Yueting Zhuang, and Fei Wu. 2022. Learning decomposed representations for treatment effect estimation. IEEE Transactions on Knowledge and Data Engineering 35, 5 (2022), 4989–5001.
Wu and Fukumizu (2021) Pengzhou Wu and Kenji Fukumizu. 2021. Intact-VAE: Estimating treatment effects under unobserved confounding. arXiv preprint arXiv:2101.06662 (2021).
Yao et al. (2021) Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–46.
Yao et al. (2018) Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. 2018. Representation learning for treatment effect estimation from observational data. Advances in neural information processing systems 31 (2018).
Zhang et al. (2021b) Weijia Zhang, Lin Liu, and Jiuyong Li. 2021b. Treatment effect estimation with disentangled latent factors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 10923–10930.
Zhang et al. (2021a) Yang Zhang, Fuli Feng, Xiangnan He, Tianxin Wei, Chonggang Song, Guohui Ling, and Yongdong Zhang. 2021a. Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval. 11–20.
Zhang et al. (2019) Zichen Zhang, Qingfeng Lan, Lei Ding, Yue Wang, Negar Hassanpour, and Russell Greiner. 2019. Reducing selection bias in counterfactual reasoning for individual treatment effects estimation. arXiv preprint arXiv:1912.09040 (2019).