This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data-Driven Causal Effect Estimation Based on Graphical Causal Modelling: A Survey

Debo Cheng [email protected] 0000-0002-0383-1462 Jiuyong Li [email protected] 0000-0002-9023-1878 Lin Liu [email protected] 0000-0003-2843-5738 Jixue Liu [email protected] 0000-0002-0794-0404  and  Thuc Duy Le [email protected] 0000-0002-9732-4313 UniSA STEM, University of South AustraliaMawson LakesAdelaide5095Australia
(202*)
Abstract.

In many fields of scientific research and real-world applications, unbiased estimation of causal effects from non-experimental data is crucial for understanding the mechanism underlying the data and for decision-making on effective responses or interventions. A great deal of research has been conducted to address this challenging problem from different angles. For estimating causal effect in observational data, assumptions such as Markov condition, faithfulness and causal sufficiency are always made. Under the assumptions, full knowledge such as, a set of covariates or an underlying causal graph, is typically required. A practical challenge is that in many applications, no such full knowledge or only some partial knowledge is available. In recent years, research has emerged to use search strategies based on graphical causal modelling to discover useful knowledge from data for causal effect estimation, with some mild assumptions, and has shown promise in tackling the practical challenge. In this survey, we review these data-driven methods on causal effect estimation for a single treatment with a single outcome of interest and focus on the challenges faced by data-driven causal effect estimation. We concisely summarise the basic concepts and theories that are essential for data-driven causal effect estimation using graphical causal modelling but are scattered around the literature. We identify and discuss the challenges faced by data-driven causal effect estimation and characterise the existing methods by their assumptions and the approaches to tackling the challenges. We analyse the strengths and limitations of the different types of methods and present an empirical evaluation to support the discussions. We hope this review will motivate more researchers to design better data-driven methods based on graphical causal modelling for the challenging problem of causal effect estimation.

Causal inference, Causality, Graphical causal model, Causal effect estimation, Latent confounders, Instrumental variable
copyright: acmcopyrightjournalyear: 202*doi: XXXXXXX.XXXXXXXjournal: JACMjournalvolume: xxjournalnumber: xarticle: 1publicationmonth: 1ccs: Mathematics of computing Causal networksccs: Computing methodologies Causal reasoning and diagnosticsccs: Artificial intelligence Knowledge representation and reasoning

1. Introduction

Causal inference is a fundamental task in many fields, such as epidemiology (Robins and Greenland, 1992; Greenland et al., 1999; Hernán and Robins, 2020) and economics (Imbens and Rubin, 2015; Abadie and Imbens, 2016) for discovering and understanding the underlying mechanisms of different phenomena (Peters et al., 2017; Schölkopf, 2022; Pearl, 2009a; Pearl and Mackenzie, 2018; Hernán and Robins, 2020). One major task for causal inference is to estimate the causal effect (a.k.a. treatment effect) of a treatment (or intervention) on an outcome of interest (Spirtes et al., 2000; Pearl, 2009a; Perković et al., 2018). For example, medical researchers may want to query the causal effect of a new drug on a disease or the causal effect of temperature on atmospheric pollution (Robins, 1986; Pearl et al., 2009; Imbens, 2020; Glymour et al., 2019).

Randomised control trials (RCTs) are the gold standard for estimating causal effects (Deaton and Cartwright, 2018; Rubin, 1974). An RCT aims at removing the effects of other factors on the outcome by randomly assigning an individual to the treated group or the control group. Under randomised assignment, both groups of individuals have the same characteristics. Hence, the causal effects of the treatment on the outcome can be obtained directly by comparing the outcomes of the treated group and the control group. However, RCTs are often expensive or infeasible to conduct due to time constraints or ethical concerns (Pearl and Mackenzie, 2018; Maathuis et al., 2010).

Estimating causal effect from observational data is an important alternative to RCTs (Spirtes et al., 2000; Pearl, 2009a; Imbens and Rubin, 2015). Generally speaking, four types of causal effects are often estimated, average causal effect (ACE) (Rubin, 1974; Hill, 2011; Henckel et al., 2022; Cheng et al., 2020), average causal effect on the treated group (ACT) (Morgan and Harding, 2006; Abadie and Imbens, 2016), conditional average treatment effect (CATE) (Athey and Imbens, 2016; Athey et al., 2018, 2019) and individual causal effect (ICE) (Shalit et al., 2017; Yao et al., 2018). ACE is to measure the average change of the outcome due to the application of a treatment at the population level. ACT aims to assess the average change of the outcome in the treated group only. CATE is a measurement of the average causal effect within a sub-population. CATE is also known as heterogeneous causal effect (HCE). ICE is the causal effect at the individual level and is defined as the difference of the potential outcomes for an individual (Rubin, 1974; Imbens and Rubin, 2015).

This survey is focused on ACE estimation given its wide real-world applications. For example, doctors wish to know the overall efficacy of a new drug on blood pressure (Rubin, 2007; Deaton and Cartwright, 2018); the government wants to assess the effectiveness of job training on the income in general (LaLonde, 1986; Cheng et al., 2020); an airline is interested in evaluating the effects of ticket prices on customers’ purchase tendency (Hartford et al., 2017, 2021). ACE estimation can involve either a single treatment or multiple treatments on a single outcome or multiple outcomes (Wang and Blei, 2019; Ma et al., 2021; Nabi et al., 2022). This survey focuses on the ATE estimation of a single treatment on a single outcome. The data-driven algorithms for causal effect estimation based on graphical causal modelling for multiple treatments and/or multiple outcomes are in the theoretical research stage (Perković et al., 2018; Jaber et al., 2019b, a), and there are currently no fully data-driven algorithms available for such problems. Hence, we will not discuss them in this survey.

Refer to caption
Figure 1. Exemplary DAGs show (a) confounder, (b) mediator, (c) instrumental variable and (d) conditional instrumental variable. In DAG (a), 𝐙\mathbf{Z} is a set of confounders w.r.t. the pair (W,Y)(W,Y); in DAG (b), 𝐙\mathbf{Z} is a set of mediator w.r.t. the pair (W,Y)(W,Y); in DAG (c), SS is an instrumental variable w.r.t. the pair (W,Y)(W,Y); and in DAG (d), SS is a conditional instrumental variable conditioning on the set 𝐙\mathbf{Z} w.r.t. the pair (W,Y)(W,Y).

The main challenge for ACE estimation from observational data is confounding bias that is caused by confounders that are common causes affecting both the treatment WW and outcome YY (Pearl, 2009a; VanderWeele and Shpitser, 2011). For example, the set of variables 𝐙\mathbf{Z} in Fig. 1 (a) is a set of confounders. To estimate the causal effect unbiasedly of WW on YY from observational data, the confounding bias caused by the set of confounders 𝐙\mathbf{Z} needs to be removed. The main techniques for obtaining unbiased ACE estimation from observational data are confounding adjustment (Imbens and Rubin, 2015; Hernán and Robins, 2020) and the instrumental variable (IV) approach (Hernán and Robins, 2006; Imbens, 2014). Confounding adjustment aims at mitigating the influences of confounders on the outcome when estimating the causal effect of WW on YY, and it requires all confounders are measured (Pearl, 2009a). The IV approach leverages a special variable called the instrumental variable to eliminate confounding bias in the estimation of the causal effect of WW on YY and the approach works even when there are unmeasured confounders (Brito and Pearl, 2002).

One focus of this survey is to review the data-driven confounding adjustment methods which are based on the back-door criterion or variations (Pearl, 2009a; Shpitser et al., 2010; Maathuis et al., 2015; Perković et al., 2018) without requiring a given causal graph. In graphical causal modelling term, a set of variables used for confounding adjustment is a set of variables satisfying the back-door criterion (See Definition 3 in Section 2) when the causal DAG is given and all variables in the set need observed and measured. For example, the set 𝐙\mathbf{Z} in Fig. 1 (a) meets the back-door criterion and hence can be used as an adjustment set. There are variations of the back-door criterion for identifying adjustment sets given a causal graph (Pearl, 2009a; Shpitser et al., 2010; Maathuis et al., 2015; Perković et al., 2018). When a causal graph is given, an adjustment set can be read from the graph. It is challenging but practical to identify an adjustment set using data without a causal graph.

Another focus of this survey is to review data-driven methods that search for potential IVs or CIVs (conditional IVs and their corresponding conditioning sets) for ACE estimation. The IV approach (Angrist and Imbens, 1995; Angrist et al., 1996; Pearl, 1995b; Hernán and Robins, 2006; Martens et al., 2006; Yuan et al., 2022) is a powerful way to estimate causal effect from observational data in the presence of latent confounders. A standard IV is a special cause of WW (See Section Definition 9 in Section 2), and an example of IV is shown in Fig. 1. When a standard IV is known, the Two-Stage Least Squares (TSLS) estimator (Angrist and Imbens, 1995) is commonly used to estimate the ACE. A standard IV can be relaxed to a conditional IV (CIV) (Brito and Pearl, 2002; Imbens, 2014), which needs a companion set, called a conditioning set, for ATE estimation. For example, in Fig. 1(d), SS is a CIV conditioning on the set 𝐙\mathbf{Z}. With a CIV SS and a conditioning set 𝐙\mathbf{Z} for the CIV, the Two-Stage Least Squares for CIV (TSLS.CIV estimator)  (Imbens, 2014) can be used to estimate the average causal effect unbiasedly. An IV or a CIV is generally not identifiable in data and needs to be given by domain knowledge. It is desirable to find an IV or a CIV (and its conditioning set) in data without the full domain knowledge or the causal graph, but with mild assumptions.

The data-driven methods surveyed in this paper do not require the full causal knowledge, i.e. a causal graph, adjustment set, IV, or a CIV and its conditioning set, whereas most existing (non-data-driven) methods require the full knowledge (Pearl et al., 2009; Spirtes, 2010; Guo et al., 2020; Witte and Didelez, 2019). In many real-world applications, the full knowledge is unavailable. Hence, data-driven methods are needed and practically useful.

In recent years, data-driven methods have emerged to use search strategies (such as those used in causal discovery methods (Spirtes et al., 2000; Aliferis et al., 2010)) to discover necessary information for causal effect estimation with mild assumptions  (Maathuis et al., 2009; Le et al., 2013; Maathuis et al., 2015; Perkovic et al., 2017; Perković et al., 2018; Fang and He, 2020; Cheng et al., 2022c, 2023a), and they have shown promise in tackling the practical challenge that users do not have the full knowledge for causal effect estimation. These data-driven methods mostly leverage graphical causal models, which provide a powerful language for discovering necessary knowledge from data for causal effect estimation (Pearl, 2009a; Pearl et al., 2009; Zander and Liśkiewicz, 2016; van der Zander et al., 2019). However, to date, there is not a comprehensive survey that discusses the challenges encountered by the data-driven methods and the strategies to tackle the challenges.

This survey reviews the theories of the graphical causal models that support data-driven causal effect estimation, identifies three challenging problems for designing data-driven methods based on graphical causal modelling and discusses how the existing data-driven causal effect estimation methods handle the challenges. Our goal is to identify the challenges and to guide users to choose appropriate methods and motivate researchers to design smarter data-driven methods for causal effect estimation based on graphical causal modelling.

To summarise, the contributions of this survey are as follows.

  • We provide a concise review of the fundamental assumptions, theories and definitions of graphical causal modelling for the existing data-driven causal effect estimation methods. We aim at providing an introduction to the set of concepts and theorems that are essential for data-driven causal effect estimation and major theoretical results in this area. The theorems and suppositions on which the data-driven methods are based are scattered in many different research papers. We summarise them for the convenience of researchers who are interested in venturing into this changeling and important area.

  • We analyse and identify the major challenges confronting data-driven causal effect estimation, under the framework of graphical causal modelling, and categorise existing data-driven ACE estimation methods according to assumptions and the approaches that they handle the challenges, and review the methods in detail. We provide practical guidance on using the methods and an empirical evaluation of these methods.

  • To the best of our knowledge, this is the first paper that provides a systematic survey on data-driven causal effect estimation methods through the lens of graphical causal modelling. This distinguishes our paper from the existing surveys. Specifically, the surveys in (Stuart, 2010; Guo et al., 2020; Yao et al., 2021; Witte and Didelez, 2019) focus on the causal effect estimation methods based on potential outcome methods and the surveys in (Glymour et al., 2019; Vowels et al., 2022; Nogueira et al., 2022; Yu et al., 2021) are on causal structure learning methods and their tool-kits.

2. Background

In this section, we introduce the fundamental concepts of causal graphs, CPDAGs & PAGs, and causal effect estimation through covariate adjustment and by instrumental variables, respectively. Throughout the whole survey, we use an upper case letter, lower case letter, boldfaced upper case letter, and calligraphy letter, e.g., XX, xx, 𝐗\mathbf{X} and 𝒢\mathcal{G} to denote a variable, a variable value, a set of variables and a graph, respectively. In particular, we use WW and YY to denote the treatment variable and the outcome variable, respectively.

2.1. Causal Graphs

Causal relationships among variables are normally represented by a directed acyclic graph (DAG), denoted as 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}), which contains a set of nodes (variables), directed edges between nodes, i.e. 𝐄\mathbf{E}, and has no directed cycles. In a DAG, when there is a directed edge XiXjX_{i}\rightarrow X_{j}, XiX_{i} is known as the parent of XjX_{j} and XjX_{j} is a child of XiX_{i}. A path between two nodes (Xi,Xj)(X_{i},X_{j}) is a set of consecutive edges linking (Xi,Xj)(X_{i},X_{j}) regardless of the edge directions, and a directed path from XiX_{i} to XjX_{j} contains a set of consecutive edges pointing towards XjX_{j}, i.e. XiXjX_{i}\rightarrow\dots\rightarrow X_{j}. XiX_{i} is an ancestor of XjX_{j} and XjX_{j} is a descendant of XiX_{i} if there is a directed path from XiX_{i} to XjX_{j}. We use Pa(X)Pa(X), Ch(X)Ch(X), An(X)An(X) and De(X)De(X) to denote the sets of all parents, children, ancestors and descendants of XX, respectively.

In a DAG, if an edge XiXjX_{i}\rightarrow X_{j} denotes that XiX_{i} is a direct cause of XjX_{j}, the DAG is termed a causal DAG. In a causal DAG 𝒢\mathcal{G}, a directed path between two nodes is called a causal path and a non-directed path between two nodes is a non-causal path. Let \ast be an arbitrary edge mark. XiX_{i} is a collider on a path π\pi if Xk*Xi*XjX_{k}\put(0.4,-2.5){*}\rightarrow X_{i}\leftarrow\put(-4.0,-2.5){*}X_{j} is a sub-path of π\pi. A collider path is a path with every non-endpoint node being a collider. A path of length one is a trivial collider path.

Some assumptions are needed to use a causal DAG 𝒢\mathcal{G} to represent the data generation mechanism, including the Markov condition, the faithfulness assumption and causal sufficiency assumption.

Definition 0 (Markov condition (Pearl, 2009a; Aliferis et al., 2010)).

Given a joint distribution P(𝐗)P(\mathbf{X}) and a DAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}), 𝒢\mathcal{G} and P(𝐗)P(\mathbf{X}) satisfy the Markov condition if Xi𝐗\forall X_{i}\in\mathbf{X}, XiX_{i} is independent of all of its non-descendants in 𝒢\mathcal{G}, given the set of parents of XiX_{i}, i.e. Pa(Xi)Pa(X_{i}).

The Markov condition indicates that the distribution of XiX_{i} is determined by the distributions of its parents solely. With the Markov condition, P(𝐗)P(\mathbf{X}) can be factorised as P(𝐗)=iP(Xi|Pa(Xi))P(\mathbf{X})=\prod_{i}P(X_{i}|Pa(X_{i})) according to 𝒢\mathcal{G}.

Definition 0 (Faithfulness (Spirtes et al., 2000; Glymour and Cooper, 1999)).

A DAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}) is faithful to a joint distribution P(𝐗)P(\mathbf{X}) if and only if every independence present in P(𝐗)P(\mathbf{X}) is entailed by 𝒢\mathcal{G} and the Markov condition. A joint distribution P(𝐗)P(\mathbf{X}) is faithful to a DAG 𝒢\mathcal{G} if and only if the DAG 𝒢\mathcal{G} is faithful to P(𝐗)P(\mathbf{X}).

Definition 0 (Causal sufficiency (Spirtes et al., 2000)).

A given dataset satisfies causal sufficiency if for every pair of observed variables, all their common causes are observed.

When a distribution P(𝐗)P(\mathbf{X}) and a DAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}) are faithful to each other, if two variables XiX_{i} and XjX_{j} in 𝒢\mathcal{G} are dd-separated by another set of variables 𝐙\mathbf{Z} in 𝒢\mathcal{G} as defined below, XiX_{i} and XjX_{j} are conditional independent given 𝐙\mathbf{Z}. That is, conditional independence and dependence relationships between variables in P(𝐗)P(\mathbf{X}) can be read off from the DAG based on dd-separation or dd-connection.

Definition 0 (dd-separation/dd-connection (Pearl, 2009a)).

A path π\pi in a DAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}) is said to be dd-separated by a set of variables 𝐙\mathbf{Z} if and only if (i) π\pi contains a chain XiXkXjX_{i}\rightarrow X_{k}\rightarrow X_{j} or a fork XiXkXjX_{i}\leftarrow X_{k}\rightarrow X_{j} such that the middle node Xk𝐙X_{k}\in\mathbf{Z}, or (ii) π\pi contains a collider XkX_{k} (i.e. π\pi contains the sub-path XiXkXjX_{i}\rightarrow X_{k}\leftarrow X_{j}) such that Xk𝐙X_{k}\notin\mathbf{Z} and none of the descendants of XkX_{k} is in 𝐙\mathbf{Z}. A set 𝐙\mathbf{Z} is said to d-separate XiX_{i} from XjX_{j} if and only if 𝐙\mathbf{Z} dd-separates every path between XiX_{i} and XjX_{j} (denoted as XidXj𝐙X_{i}\perp\!\!\!\perp_{d}X_{j}\mid\mathbf{Z}), otherwise XiX_{i} and XjX_{j} are d-connected given 𝐙\mathbf{Z}, i.e. XidXj𝐙X_{i}\not\!\perp\!\!\!\perp_{d}X_{j}\mid\mathbf{Z}.

An exemplary DAG representing a causal mechanism (causal relationships between all variables in a problem domain) is shown in Fig. 2 (a). In the DAG, X1WYX_{1}\to W\to Y is a causal path, and this path is dd-separated by WW. When we consider a dd-separate set 𝐙\mathbf{Z} for the pair (X1,Y)(X_{1},Y), all five paths between the two variables need to be dd-separated by 𝐙\mathbf{Z}. WW should be in 𝐙\mathbf{Z} because it is the only variable dd-separating the path X1WYX_{1}\to W\to Y. Now, 𝐙\mathbf{Z} dd-connects (X4,X2)(X_{4},X_{2}), (X4,X3)(X_{4},X_{3}), (X4,X6)(X_{4},X_{6}), (X2,X6)(X_{2},X_{6}), and (X3,X6)(X_{3},X_{6}) on the four other paths respectively since WW is a collider in every path. Let’s look into these four paths separately. Path X1WX6X7X8YX_{1}\to W\leftarrow X_{6}\to X_{7}\leftarrow X_{8}\to Y is dd-separated by \emptyset since X7X_{7} is a collider even when the collider WW is included in 𝐙\mathbf{Z}. Path X1WX4X5YX_{1}\to W\leftarrow X_{4}\to X_{5}\to Y is dd-separated by set {X4}\{X_{4}\} or {X5}\{X_{5}\}. Paths X1WX2X3YX_{1}\to W\leftarrow X_{2}\to X_{3}\to Y and X1WX3YX_{1}\to W\leftarrow X_{3}\to Y are dd-separated by {X3}\{X_{3}\}. Consequently, the pair (X1,Y)(X_{1},Y) is dd-separated by set 𝐙={X3,X4,W}\mathbf{Z}=\{X_{3},X_{4},W\} or 𝐙={X3,X5,W}\mathbf{Z}=\{X_{3},X_{5},W\}.

Refer to caption
Figure 2. An exemplary DAG (a) and MAG (b). DAG (a) consists of {X1,X2,,X8,W,Y}\{X_{1},X_{2},\dots,X_{8},W,Y\} and MAG (b) is obtained from DAG (a) with {X4,X6,X8}\{X_{4},X_{6},X_{8}\} unmeasured.

In numerous real-world applications, latent common causes (a.k.a. latent confounders) are commonplace such that the causal sufficiency assumption is violated (Spirtes et al., 2000; Pearl, 2009a). In this situation, the ancestral graph is designed for representing causal ancestral relationships (Richardson and Spirtes, 2002, 2003). An ancestral graph is a mixed graph without directed cycles or almost directed cycles. An ancestral graph contains both directed edges \rightarrow and bidirected edges \leftrightarrow between variables. An almost directed cycles occurs if XiXjX_{i}\leftrightarrow X_{j} in 𝒢\mathcal{G} and XiAn(Xj)X_{i}\in An(X_{j}). Similar to dd-separation in a DAG, mm-separation serves as a general criterion to read off the independencies and dependencies between measured variables entailed by an ancestral graph (Richardson and Spirtes, 2002).

Definition 0 (mm-separation/mm-connection (Richardson and Spirtes, 2002)).

In an ancestral graph 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}), a path π\pi between XiX_{i} and XjX_{j} is said to be mm-separated by a set of nodes 𝐙𝐗{Xi,Xj}\mathbf{Z}\subseteq\mathbf{X}\setminus\{X_{i},X_{j}\} (possibly \emptyset) if (i) every non-collider on π\pi is a member of 𝐙\mathbf{Z}; (ii) every collider on π\pi is not a member of 𝐙\mathbf{Z} and none of the descendants of the colliders is in 𝐙\mathbf{Z}. Two nodes XiX_{i} and XjX_{j} are said to be mm-separated by 𝐙\mathbf{Z} in 𝒢\mathcal{G}, denoted as XimXj|𝐙X_{i}\perp\!\!\!\perp_{m}X_{j}|\mathbf{Z} if every path between XiX_{i} and XjX_{j} is mm-separated by 𝐙\mathbf{Z}; otherwise they are said to be m-connected by 𝐙\mathbf{Z}, denoted as XimXj|𝐙X_{i}\not\!\perp\!\!\!\perp_{m}X_{j}|\mathbf{Z}.

We take the ancestral graph in Fig. 2 (b) as an example for understanding mm-separation. X1WYX_{1}\to W\to Y is a causal path and the path is mm-separated by WW. When we consider an mm-separated set 𝐙\mathbf{Z} for the pair (X1,Y)(X_{1},Y), all five paths between the two nodes need to be m-separated by 𝐙\mathbf{Z}. WW should be in 𝐙\mathbf{Z} because it is the only variable mm-separating the path X1WYX_{1}\to W\to Y. Path X1WX7YX_{1}\to W\leftrightarrow X_{7}\leftrightarrow Y is mm-separated by \emptyset since X7X_{7} is a collider. Path X1WX2X3YX_{1}\to W\leftarrow X_{2}\to X_{3}\to Y is mm-separated by set {X3}\{X_{3}\}. Paths X1WX3YX_{1}\to W\leftarrow X_{3}\to Y and X1WX5YX_{1}\to W\leftrightarrow X_{5}\to Y are mm-separated by {X3,X5}\{X_{3},X_{5}\}. Hence, the pair (X1,Y)(X_{1},Y) is mm-separated by set 𝐙={W,X3,X5}\mathbf{Z}=\{W,X_{3},X_{5}\}.

When there are latent confounders in a system, a Maximal Ancestral Graph (MAG) is commonly used to represent causal relationships among measured variables (Richardson and Spirtes, 2002, 2003; Zhang, 2008a).

Definition 0 (Maximal Ancestral Graph (MAG)).

A MAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}) contains directed and bidirected edges, and satisfies the conditions that (i) there is not a directed cycle or an almost directed cycle, and (ii) every pair of non-adjacent nodes XiX_{i} and XjX_{j} in 𝐗\mathbf{X} is mm-separated by a set 𝐙𝐗\{Xi,Xj}\mathbf{Z}\subseteq\mathbf{X}\backslash\{X_{i},X_{j}\}.

A bidirected edge XiXjX_{i}\leftrightarrow X_{j} in a MAG indicates the presence of a latent confounder between XiX_{i} and XjX_{j}, and the absence of a direct causal relationship between XiX_{i} and XjX_{j}. A causal path in a MAG is a directed path without a bidirected edge. In a MAG, XiX_{i} and XjX_{j} are spouses if there exists a bidirected edge XiXjX_{i}\leftrightarrow X_{j}. The assumptions of Markov condition and faithfulness are still held in a MAG. Note that the causal relationships between measured variables in a MAG are still the same as in the underlying DAG over the measured and unmeasured variables (Richardson and Spirtes, 2002; Zhang, 2008b, a). Thus, in a MAG, we still use Pa(X)Pa(X), Ch(X)Ch(X), An(X)An(X) and De(X)De(X) to denote the sets of all parents, children, ancestors and descendants of XX, respectively.

Another type of causal graphs to represent causal relationships between variables with latent confounders is the acyclic directed mixed graph (ADMG) (Tian and Pearl, 2002; Peña, 2018; Runge, 2021). ADMGs are generalisations of DAGs (Silva et al., 2011; Evans and Richardson, 2014), meaning that the causal relationships between observed variables in ADMGs can be encoded by marginalising causal DAGs (Richardson and Spirtes, 2002; Richardson, 2003). In an ADMAG, it is possible for two nodes to have more than one edge between them. For example, both a directed edge (\rightarrow) and a bidirected edge (\leftrightarrow) connect two nodes in an ADMG. Some works for using ADMG for causal discovery and inference can be found in (Silva et al., 2011; Evans and Richardson, 2014; Bhattacharya et al., 2020; Runge, 2021). However, it is challenging to learn an ADMG from data since it allows WYW\rightarrow Y and WYW\leftrightarrow Y simultaneously. To the best of our knowledge, there are no established data-driven approaches for learning ADMGs from data, which leads to the limited use of ADMGs in data-driven causal effect estimation. Consequently, we have refrained from reviewing ADMG-based methods in the subsequent sections.

2.2. CPDAG & PAG

Learning a DAG (or a MAG) directly from observational data is challenging. Instead, most existing structured learning algorithms (Neapolitan et al., 2004; Spirtes, 2010; Aliferis et al., 2010) are capable of learning a Markov equivalence class of DAGs (or MAGs) from the data (with latent confounders). A Markov equivalence class refers to a set of DAGs (or MAGs) encode the same set of conditional independence relations between measured variables.

A CPDAG (complete partial DAG) is used to represent a Markov equivalence class of DAGs (Pearl, 2009a; Spirtes, 2010; Aliferis et al., 2010), i.e. all the DAGs that encode the same d-separations as the underlying causal DAG. In a CPDAG, an undirected edge indicates that the orientation of this edge is uncertain. For example, the edge WX1W\;\put(0.9,2.1){\circle{2.5}}-\put(-0.9,2.1){\circle{2.5}}\;X_{1} in the learned CPDAG in Fig. 3 indicates a directed edge WX1W\rightarrow X_{1} or WX1W\leftarrow X_{1}.

Refer to caption
Figure 3. The underlying causal DAG is given in the first panel, the CPDAG learned from the data generated from the causal DAG is presented in the second panel, and three DAGs (DAG1DAG_{1}, DAG2DAG_{2}, DAG3DAG_{3}) are encoded in the learned CPDAG.
Refer to caption
Figure 4. The left panel shows the PAG learned from the data generated from the MAG in Fig. 2 (b). There are around 50 MAGs encoded in the learned PAG and the right panel lists three exemplary MAGs encoded in the PAG.

A PAG (Partial Ancestral Graph) is often used to represent a Markov equivalence class of MAGs (Spirtes et al., 2000; Richardson and Spirtes, 2003, 2002; Ali et al., 2009), i.e. all the MAGs that encode the same mm-separations/mm-connections as the underlying causal MAG. In a PAG, an undirected edge is allowed in addition to the edge types in a MAG. Like an undirected edge in a CPDAG, an undirected edge in a PAG also indicates uncertainty. For example, the edge WX2W\leftarrow\put(-2.0,2.5){\circle{2.5}}X_{2} in the PAG in Fig. 4 indicates a directed edge WX2W\leftarrow X_{2} or WX2W\leftrightarrow X_{2}.

Because a PAG contains some edges that are not oriented, we need to introduce the term “definite status” for nodes and paths to add some certainty for mm-separations and mm-connections among nodes. In a PAG, a node XjX_{j} is a definite non-collider on π\pi if there exists at least an edge out of XjX_{j} on π\pi, or both edges have a circle mark “\circ” at XjX_{j} and the sub-path Xi,Xj,Xk\langle X_{i},X_{j},X_{k}\rangle is an unshielded triple (Zhang, 2008a). A node XiX_{i} is known to have definite status if it is a collider or a definite non-collider on the path. A collider on a path π\pi is referred to as a definite collider since it is always of definite status. A path π\pi is of definite status if every node (excluding non-endpoint node) on π\pi is of definite status.

2.3. Average Causal Effect, Confounding Adjustment and Graphical Criteria

Let WW be a binary variable representing treatment status (WW=1 for treated and WW=0 for untreated) and YY the outcome variable. The average causal effect (ACE) of WW on YY is defined as ACE(W,Y)=𝐄(Ydo(w=1))𝐄(Ydo(w=0))\operatorname{ACE}(W,Y)=\mathbf{E}(Y\mid do(w=1))-\mathbf{E}(Y\mid do(w=0)), where the dodo-operator, do()do(\cdot) (Pearl et al., 2009) represents an intervention by setting a variable to a specific value in an experiment.

Confounding adjustment is the most common way for obtaining unbiased causal effect estimation from observational data. Given an adjustment set 𝐙\mathbf{Z}, ACE(W,Y)\operatorname{ACE}(W,Y) can be estimated unbiasedly as follows:

ACE(W,Y)=Σ𝐳[𝐄(Yw=1,𝐙=𝐳)𝐄(Yw=0,𝐙=𝐳)]P(𝐙=𝐳)\operatorname{ACE}(W,Y)=\Sigma_{\mathbf{z}}[\mathbf{E}(Y\mid w=1,\mathbf{Z}=\mathbf{z})-\\ \mathbf{E}(Y\mid w=0,\mathbf{Z}=\mathbf{z})]P(\mathbf{Z}=\mathbf{z})

Therefore, the task of causal effect estimation is transformed into identifying an appropriate adjustment set. Note that, there may exist multiple adjustment sets, all leading to the unbiased estimation of ACE(W,Y)\operatorname{ACE}(W,Y).

Based on causal graphs, two main criteria have been developed and widely used, the back-door criterion (Pearl et al., 2009) and the generalised back-door criterion (Maathuis et al., 2015) for adjustment set identification in a causal DAG and MAG respectively.

Definition 0 (The back-door criterion (Pearl et al., 2009)).

In a causal DAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}), a set 𝐙\mathbf{Z} satisfies the back-door criterion w.r.t. the pair of variables (W,Y)(W,Y) if (i) 𝐙\mathbf{Z} does not contain a descendant of WW, and (ii) 𝐙\mathbf{Z} dd-separates all the back-door paths between WW and YY (i.e., the paths starting with an edge pointing to WW).

We use the DAG in Fig. 2 (a) to illustrate the back-door criterion. The set {X3,X5}\{X_{3},X_{5}\} satisfies the back-door criterion and is a proper adjustment set w.r.t. (W,Y)(W,Y), as it dd-separates all three back-door paths: WX4X5YW\leftarrow X_{4}\to X_{5}\to Y, WX2X3YW\leftarrow X_{2}\to X_{3}\to Y, and WX3YW\leftarrow X_{3}\to Y. The other sets satisfying the back-door criterion are {X3,X4}\{X_{3},X_{4}\}, {X2,X3,X4}\{X_{2},X_{3},X_{4}\}, {X2,X3,X5}\{X_{2},X_{3},X_{5}\}, {X3,X4,X5}\{X_{3},X_{4},X_{5}\} and {X2,X3,X4,X5}\{X_{2},X_{3},X_{4},X_{5}\}.

It is worth mentioning that in a given causal graph, the three do-calculus rules proposed by Pearl (Pearl, 1995a, 2009a) are sound and complete for determining whether a causal effect is identifiable (Shpitser and Pearl, 2006, 2008).

The generalised back-door criterion in a MAG is introduced as follows.

Definition 0 (The generalised back-door criterion (Maathuis et al., 2015)).

In a MAG 𝒢=(𝐗,𝐄)\mathcal{G}=(\mathbf{X},\mathbf{E}), a set 𝐙\mathbf{Z} satisfies the generalised back-door criterion w.r.t. (W,Y)(W,Y) if (i) 𝐙\mathbf{Z} does not include a descendant of WW, and (ii) 𝐙\mathbf{Z} mm-separates all non-causal paths between WW and YY.

We use the MAG in Fig. 2 (b) to illustrate the generalised back-door criterion. The set {X2,X5}\{X_{2},X_{5}\} satisfies the generalised back-door criterion w.r.t. (W,Y)(W,Y) as it mm-separates all non-causal paths between WW and YY: WX2X3YW\leftarrow X_{2}\rightarrow X_{3}\to Y, WX3YW\leftarrow X_{3}\to Y, WX5YW\leftrightarrow X_{5}\rightarrow Y, and WX7YW\leftrightarrow X_{7}\leftrightarrow Y. The other sets satisfying the generalised back-door criterion are {X3,X5}\{X_{3},X_{5}\} and {X2,X3,X5}\{X_{2},X_{3},X_{5}\}.

Two other graphical criteria, the adjustment criterion (Shpitser et al., 2010) and the generalised adjustment criterion (Perković et al., 2018) extend the back-door criterion and the generalised back-door criterion, respectively, for the identification of adjustment sets. The adjustment criterion can be used to determine an adjustment set in a DAG, while the generalised adjustment criterion can be used to determine an adjustment set in four types of graphs, including DAG, CPDAG, MAG and PAG. The generalised adjustment criterion used for data-driven causal effect estimation will be introduced in Subsection 4.3.4.

Once an adjustment set is obtained, a confounding adjustment method can be used to remove confounding bias. Many confounding adjustment methods have been developed, such as Nearest Neighbor Matching (NNM) (Rosenbaum and Rubin, 1985; Sekhon, 2011; Cheng et al., 2022b), Propensity Score Matching (PSM) (Morgan and Harding, 2006; Rubin, 1973; Sekhon, 2011), Covariate Balance Propensity Score (CBPS) (Imai and Ratkovic, 2014), Inverse Probability of Treatment effect Weighting (IPTW) (Hirano et al., 2003; Robins et al., 2000a; Cole and Hernán, 2008; Hernán and Robins, 2020), Doubly Robust Learning (DRL) (Bang and Robins, 2005; Funk et al., 2011; Benkeser et al., 2017). For more details on confounding adjustment, please refer to the surveys (Stuart, 2010; Sauer et al., 2013; Imai and Ratkovic, 2014; Witte and Didelez, 2019; Yao et al., 2021; Guo et al., 2020), or the books (Pearl, 2009a; Koller and Friedman, 2009; Imbens and Rubin, 2015; Morgan and Winship, 2015; Hernán and Robins, 2020).

2.4. Instrumental Variable Approach & Other Alternative Approaches

A proper adjustment set must include all variables that are sufficient to remove the confounding bias between WW and YY. However, when there is a latent common cause of WW and YY, confounding adjustment does not work. In such cases, the instrumental variable (IV) approach is a powerful way to estimate average causal effect (Hernán and Robins, 2006; Martens et al., 2006; Abadie, 2003; Angrist and Imbens, 1995).

Refer to caption
Figure 5. Comparing a standard IV (a) with an ancestral IV SS (b). In DAG (a), SS satisfies the three conditions of Definition 9 and thus is a standard IV. In DAG (b), SS does not satisfy conditions (ii) and (iii) of a standard IV, but it is an ancestral IV conditioning on {X1,X3}\{X_{1},X_{3}\}.
Definition 0 (Standard IV (Bowden and Turkington, 1990; Hernán and Robins, 2006)).

A variable SS is said to be a standard IV w.r.t. the pair of variables (W,Y)(W,Y), if (i) SS has a causal effect on WW, (ii) SS affects YY only through WW, and (iii) SS does not share common causes with YY.

As an example, in Fig. 5 (a), SS is a standard IV w.r.t. the pair of variables (W,Y)(W,Y) since SS is a direct cause of WW, all of its effect on YY is through WW, and SS and YY do not share common causes.

When a standard IV is given, under the assumption that the data is generated by a linear structural equation (Angrist et al., 1996; Robins, 1997; VanderWeele and Shpitser, 2011), the ACE(W,Y)\operatorname{ACE}(W,Y) can be calculated as σsy/σsw\sigma_{sy}/\sigma_{sw}, where σsy\sigma_{sy} and σsw\sigma_{sw} are the regression coefficients of SS when regressing YY on SS and when regressing YY on WW, respectively. The two stage least square (TSLS) regression is one of the most commonly used IV estimators (Angrist and Imbens, 1995; Angrist et al., 1996; Imbens, 2014). The non-linear or non-parametric implementations of the IV estimators for calculating σsy\sigma_{sy} and σsw\sigma_{sw} can be found in (Hartford et al., 2017; Singh et al., 2019; Sjolander and Martinussen, 2019; Bennett et al., 2019). Data-driven causal effect estimation methods do not assume knowing IV. Thus the survey does not review the standard IV methods requiring a known standard IV. The reviews on the standard IV methods can be found in (Bowden and Turkington, 1990; Hernán and Robins, 2006; Martens et al., 2006; Guo et al., 2020).

A standard IV is difficult to find because of its stringent conditions. Ancestral IV (AIV) (Van der Zander et al., 2015) defined below relaxes some of the conditions of a standard IV. AIV is a practical definition of Conditional IV (CIV) and is used in data-driven causal effect estimation. More explanation is provided late.

Definition 0 (Ancestral IV (AIV) (Van der Zander et al., 2015)).

A variable S𝐗S\in\mathbf{X} is said to be an AIV w.r.t. the ordered pair of variables (W,Y)(W,Y), if there exists a set of measured variables 𝐙𝐗{S}\mathbf{Z}\subseteq\mathbf{X}\setminus\{S\} such that (i) SdW𝐙S\not\!\perp\!\!\!\perp_{d}W\mid\mathbf{Z}, (ii) SdY𝐙S\perp\!\!\!\perp_{d}Y\mid\mathbf{Z} in 𝒢W¯\mathcal{G}_{\underline{W}} where 𝒢W¯\mathcal{G}_{\underline{W}} is the DAG obtained by removing WYW\rightarrow Y from 𝒢\mathcal{G}, and (iii) 𝐙(An(Y)An(S))\mathbf{Z}\subseteq(An(Y)\cup An(S)) and Z𝐙\forall Z\in\mathbf{Z}, ZDe(Y)Z\notin De(Y). In this case 𝐙\mathbf{Z} is said to instrumentalise SS.

An example of AIV is provided in Fig. 5 (b). We see that SX3YS\rightarrow X_{3}\rightarrow Y violates requirement (ii) of a standard IV, and SX1X2YS\leftarrow X_{1}\rightarrow X_{2}\rightarrow Y violates requirement (iii) of a standard IV. The set of variables 𝐙={X1,X3}\mathbf{Z}=\{X_{1},X_{3}\} instrumentalises SS to be an AIV since SdW𝐙S\not\!\perp\!\!\!\perp_{d}W\mid\mathbf{Z} in the DAG, SdY𝐙S\perp\!\!\!\perp_{d}Y\mid\mathbf{Z} when WYW\to Y is removed from the DAG, and all elements in 𝐙\mathbf{Z} are ancestors of YY.

Given an ancestral IV SS instrumentalised by 𝐙\mathbf{Z}, the causal effect of WW on YY, ACE(W,Y)\operatorname{ACE}(W,Y), can be estimated by σsy𝐳/σsw𝐳\sigma_{sy\mathbf{z}}/\sigma_{sw\mathbf{z}}, where σsy𝐳\sigma_{sy\mathbf{z}} and σsw𝐳\sigma_{sw\mathbf{z}} are the regression coefficient of SS when YY is regressed on SS and 𝐙\mathbf{Z}, and the regression coefficient of SS when YY is regressed on WW and 𝐙\mathbf{Z}, respectively.

An AIV is a CIV (Conditional IV) (Brito and Pearl, 2002), except that an AIV has the extra requirement that 𝐙(An(Y)An(S))\mathbf{Z}\subseteq(An(Y)\cup An(S)). When a causal graph is not given and is to be discovered from data, a CIV may produce misleading results since it may be a variable that is uncorrelated with WW (Van der Zander et al., 2015). For example, in the DAG with X1U1X2U2WYX_{1}\leftarrow U_{1}\rightarrow X_{2}\leftarrow U_{2}\rightarrow W\rightarrow Y and WUYW\leftarrow U\rightarrow Y where {U,U1,U2}\{U,U_{1},U_{2}\} are latent variables, X1X_{1} is a CIV by definition, but is actually independent of WW. This could result in a misleading conclusion. An AIV circumvents such problematic cases.

The IV approach has been extensively studied and developed, particularly in the field of economics (Arellano and Bover, 1995; Angrist et al., 1996; Abadie, 2003). There are two types of IV estimators: parametric and nonparametric. For a parametric IV estimator, the linearity and homogeneous causal effects assumptions are required to identify causal effects from data with latent confounders. The TSLS estimator is one of the most well-known parametric estimators (Angrist and Imbens, 1995). Without these assumptions, even if a valid IV exists, causal effects are non-identifiable in real-world scenarios. A nonparametric IV estimator relaxes these assumptions and can identify causal effects without relying on distribution assumptions and functional form restrictions (Hartford et al., 2017; Wu et al., 2022). Recently, some nonparametric IV estimators have been developed based on machine learning models, such as kernel-based IV regression (Singh et al., 2019), deep learning-based IV estimator (Hartford et al., 2017), and moment conditions-based IV estimator (Bennett et al., 2019).

The front-door criterion, introduced on page 83 of (Pearl, 2009a), is an alternative graphical criterion for causal effect estimation when there exists a latent confounder affecting both WW and YY. The front-door criterion can be used to identify a front-door adjustment set, which lies on the causal paths between WW and YY. The front-door criterion is to address the problem where the causal effect of WW on YY is not identifiable by the back-door criterion or its extensions. For example, Fig. 1 (b) shows a case of front-door adjustment, in which 𝐙\mathbf{Z} satisfies the front-door criterion.

The front-door criterion and the IV approach are used to estimate the effects of WW and YY in the presence of latent confounders (Pearl, 2009a). Broadly speaking, the IV approach selects a pretreatment variable as a valid IV, whereas the front-door criterion identifies a post-treatment variable as an adjustment set, their suitability depends on the specific research question and study design. For data-driven causal effect estimation, pretreatment variables are often assumed since otherwise, the discovery of causal structure around WW and YY will be very challenging. So, there is not a data-driven method using the front-door criterion yet. Recently, Bhattacharya and Nabi (Bhattacharya and Nabi, 2022) used an auxiliary anchor variable and Verma constraint to test the satisfaction of the front-door criterion and showed an intersection model in which the IV and front-door conditions are satisfied can be tested in data. The results open a door for designing a data-driven solution using the front-door criterion.

Alternative to IV-based methods for dealing with latent confounders, sensitivity analysis can be employed to obtain bound estimates when latent confounders exist. Sensitivity analysis aims to explore the effect of residual latent confounders in causal effect estimation from observational data (VanderWeele and Ding, 2017; Robins et al., 2000b). An early work on sensitivity analysis by Cornfield et al. (Cornfield et al., 1959) focused on the relative prevalence of a binary latent factor (such as a gene switching on or off) among smokers and non-smokers and its ability to explain the observed association between smoking and lung cancer. Robin et al. (Robins et al., 2000b) provided a set of sensitivity parameters to capture the difference between the conditional distributions of the outcome of treated and control units. Mattei et al. (Mattei et al., 2014) proposed an approach to identifying causal effects of a binary treatment when the outcome is missing on a subset of units and nonresponse dependence on the outcome cannot be ruled out. Duarte et al. (Duarte et al., 2021) presented a novel approach to deriving the bound of causal effects in discrete settings and automating the process of partial identification. Scharfstein et al. (Scharfstein et al., 2021) used semiparametric theory to obtain a nonparametric efficient influence function of the ACE, for fixed sensitivity parameters. The sensitivity analysis methods are not strictly graphical causal modelling based, so in this survey, we do not cover the details of these methods. For those who are interested in sensitivity analysis methods, please refer to the literature in (Christopher Frey and Patil, 2002; Tortorelli and Michaleris, 1994; Ding and VanderWeele, 2016).

3. Challenges for Data-Driven Causal Effect Estimation

In many applications, a DAG or MAG is unknown and we need to estimate causal effects directly from data. An intuitive approach is to recover the causal graph from data and subsequently using it to identify an adjustment set or an AIV along with its conditioning set for causal effect estimation. However, this is not straightforward and data-driven causal effect estimation faces the following major challenges.

3.1. Latent Variables

Latent variables exist in practical scenarios, because in a system some variables may be unmeasurable, or may not measured during data collection. In this survey, we consider that a latent variable is the unmeasured common cause of a pair of measured variables as stated in (Pearl, 2009a; Spirtes et al., 2000; Richardson and Spirtes, 2003, 2002). We also assume that there are no unmeasured selection variables due to selection bias cannot be addressed solely by confounding adjustment (van der Zander et al., 2014; Bareinboim et al., 2014; Pearl, 2009a; Perković et al., 2018). When there are latent variables in a system, the causal sufficiency assumption is violated since the assumption requires the common causes of every pair of measured variables in this system are also measured. When the causal sufficiency does not hold, the causal effect estimation from such system becomes challenging and difficult (Spirtes, 2010; Pearl et al., 2009). Latent variables mainly cause two difficulties in causal effect estimation discussed below.

The first difficulty is that latent variables introduce bias in the estimation of causal effects. For example, assume that the system modelled by Fig. 2 (b) (i.e. the underlying data generation mechanism is a causal MAG) contains three latent variables X4X_{4}, X6X_{6} and X8X_{8}. If we do not consider the latent variables, and still try to learn a DAG from data, we will obtain a DAG as shown in Fig. 6 (b). When using the learned DAG in Fig. 6 (b), the estimated ACE(W,Y)\operatorname{ACE}(W,Y) will be biased since based on the learned DAG X7X_{7} is on the back-door path between WW and YY and thus is included the adjustment set. However, the inclusion of X7X_{7} in an adjustment set introduces a spurious association between WW and YY because X7X_{7} is a collider in Fig. 6 (a). This is the well-known “MM-bias” (Greene, 2003; Pearl, 2009b).

Refer to caption
Figure 6. An exemplary DAG (a) and learned DAG (b) from data. Note that in the data, X4,X6,X8X_{4},X_{6},X_{8} are unmeasured.

The second difficulty is that latent variables make causal effect impossible to estimate using the confounding adjustment approach when they are common (direct or indirect) causes of WW and YY. In this case, the unconfoundedness assumption used by confounding adjustment is violated (Imbens and Rubin, 2015; Rubin, 1974, 2007; Guo et al., 2020). Causal effect estimation in this case can be done using the instrumental variable (IV) approach with a standard IV or an ancestral IV (AIV). However, normally an IV needs to be identified by domain knowledge (Martens et al., 2006; Van der Zander et al., 2015; Silva and Shimizu, 2017; Sjolander and Martinussen, 2019; Cheng et al., 2022a), and it is difficult to identify IVs from data. There are only a few data-driven methods available for finding IVs from data (Chu et al., 2001; Kuroki and Cai, 2005; Cheng et al., 2023a), but they all require strong assumptions, such as, two variables or half of the variables in a system are standard IVs or AIVs.

From the above discussion, we also know that whether there is a latent confounder between WW and YY determines the choice between the two totally different approaches (confounding adjustment and IV approach) for causal effect estimation. In the following, we introduce a graphical criterion for determining the visibility of WYW\rightarrow Y or whether there is not a latent variable between WW and YY in a MAG or a PAG.

Definition 0 (Visibility (Zhang, 2008a)).

A directed edge WYW\rightarrow Y in a MAG (or a PAG) is visible if there is a node XiX_{i} not adjacent to YY, such that either there is an edge between XiX_{i} and WW that is into WW, or there is a collider path between XiX_{i} and WW that is into WW and every node on this path is a parent of YY. Otherwise, WYW\rightarrow Y is said to be invisible.

There are two cases in a MAG (or a PAG) as illustrated in Fig. 7 in which WYW\rightarrow Y is visible. When WYW\rightarrow Y is invisible, there is a latent confounder affecting WW and YY, and the causal effect of WW on YY is non-identifiable (Pearl et al., 2009; Zhang, 2008a), i.e. confounding adjustment is not able to recover an unbiased causal effect. When this is a valid IV in the underlying causal DAG, the IV approach can be used to obtain an unbiased causal effect. The details of data-driven methods are introduced in Section 4.

Refer to caption
Figure 7. Two possible configurations of the visible edge WYW\rightarrow Y. Note that XkX_{k} and YY are non-adjacent.

3.2. Uncertainty in Causal Structure Learning

In real-world applications, the underlying causal graphs are rarely available and have to be recovered from data (Spirtes et al., 2000; Witte and Didelez, 2019). Thus, causal structure learning algorithms play a critical role in causal effect estimation from observational data when the causal DAGs/MAGs are unknown (Maathuis et al., 2009, 2010; Malinsky and Spirtes, 2017). A causal structure learning algorithm, such as PC algorithm (Spirtes et al., 2000) or RFCI algorithm (Colombo et al., 2012), learns from data a CPDAG or PAG, i.e. the output of such an algorithm is a CPDAG or PAG. More details on causal structure learning algorithms can be found in the surveys (Yu et al., 2021; Vowels et al., 2022; Glymour et al., 2019; Nogueira et al., 2022).

The uncertainty of the causal structures learned from data leads to uncertainty in causal effect estimation from observational data. We demonstrate this with an example in Fig. 3, where the Markov equivalence class of the DAGs is represented by the CPDAG in the second panel. The causal effect of WW on YY based on DAG3 is different from the causal effect based on DAG1 since DAG3 has two causal paths from WW to YY, while there is only one causal path WYW\rightarrow Y in DAG1. A real-world problem is more complex than this toy example and has more variables. The number of causal graphs in a Markov equivalence class could increase exponentially with the number of variables (Cheng et al., 2020, 2023b). Thus there may be a large number of estimated values for a causal effect from one dataset, leading to uncertainty in causal effect estimation.

Two approaches have been developed so far to deal with the non-uniqueness of causal structure learning (van der Zander et al., 2014; Perković et al., 2018). The first approach provides a bound estimation of a causal effect, i.e. a multiset of causal effects estimated based on the DAGs or MAGs enumerated from the CPDAG or PAG learned from data. The second approach produces a unique estimation of a causal effect. It has been shown that when a CPDAG or a PAG is adjustment amendable w.r.t. the pair of variables (W,Y)(W,Y) as defined below, the causal effect of WW on YY can be estimated uniquely if proper adjustment sets w.r.t. the pair of variables (W,Y)(W,Y) are identified from the DAGs or MAGs encoded in the CPDAG or PAG.

Definition 0 (Amenability (van der Zander et al., 2014; Perković et al., 2018)).

A CPDAG (or a PAG) 𝒢\mathcal{G} is adjustment amendable w.r.t. the pair variables (W,Y)(W,Y) if each possible directed path from WW to YY in 𝒢\mathcal{G} starts with a visible edge out of WW.

For example, in Fig. 4, the PAG is adjustment amenable w.r.t. the pair of variables (W,Y)(W,Y) and the two sets X2,X5{X_{2},X_{5}} and {X3,X5}\{X_{3},X_{5}\} are the proper adjustment sets for all MAGs in the learned PAG. The causal effects from all MAGs in the Markov equivalence class are consistent, and there is one unique estimate.

Some knowledge is required to determine whether a CPDAG or PAG is adjustment amendable or not. If we know that all the other variables in a system are pretreatment variables (De Luna et al., 2011; Entner et al., 2013; Häggström, 2018) w.r.t. the pair of variables (W,Y)(W,Y) and a variable is correlated with WW but not YY, then the CPDAG or PAG learned from the data generated by the system is adjustment amendable relative to the pair of variables (W,Y)(W,Y) (Cheng et al., 2023b).

In general, the amenability assumption can be satisfied in real-world applications. For example, it is possible that domain experts know there is a causal effect between WW and YY, and the directed edge between WW and YY is the only causal path. In this case, if the edge WYW\rightarrow Y is visible, and the amenability assumption is met.

3.3. Time Complexity

The main computational cost for causal effect estimation is to recover causal graphs from data (or search for an adjustment set globally using properties in the underlying causal graph). An optimal search for a causal structure from data is fundamentally NP-complete (Chickering, 1996), and so is an exhaustive search for an adjustment set directly from data. If using a heuristic strategy for global causal structure learning, the optimal results may not be returned.

Alternatively, local causal structure learning (Aliferis et al., 2010; Yu et al., 2021) is faster and can handle hundreds of variables. Local causal structure learning algorithms aim to recover the local causal structure around a given variable instead of a complete causal DAG. In general, the local causal structure of a variable (referred to as XX) contains the sets of parent nodes of the variable Pa(X)Pa(X) and the child nodes of the variable Ch(X)Ch(X). Some methods have been developed to search for an adjustment set in the local causal structure around WW and YY (Perkovic et al., 2017; Fang and He, 2020; Cheng et al., 2023b, 2020).

4. Data-Driven Causal Effect Estimation Methods

In this section, we will review the details of the existing data-driven causal effect estimation methods based on graphical causal modelling. Especially, we will discuss the assumptions, and approaches taken by the methods for addressing the three challenges presented in Section 3.

4.1. Methods Classification

Table 1 presents our classification of existing data-driven methods. We classify the existing data-driven causal effect estimation methods into three categories based on their approaches to handling uncertainty in structure learning, their time complexities, and whether they consider the presence of a latent variable between WW and YY.

Table 1. A summary of data-driven causal effect estimation methods. Assumptions made by the methods are in the parentheses following the names of the methods. The implementation information (if available) of the methods is provided in the bottom section of the table. The superscript of a method is the index of the implementation details of the method in the bottom section.
Uncertainty in Time Latent confounders
structure learning complexity Not considered Considered
Not between (W,Y)(W,Y) Between (W,Y)(W,Y)
Bound estimation Global IDA1 LV-IDA2 IV.tetrad9 (two valid CIVs)
Semi-local IDA1 (fixed edges) CE-SAT AIV.GT10 (two valid AIVs)
DIDA & DIDN (some fixed edges)
Bound estimation Local DICE (pretreatment)
Unique estimation Global EHS5 (pretreatment) sisVIVE7 (a half of valid IVs)
GAC1 (amenability) modeIV8 (Modal validity)
DAVS6 (amenability) AIViPAIViP (a given AIV)
Unique estimation Local CovSel3 & CovSelHigh4 (pretreatment) CEELS (pretreatment)
Packages Language URL
1. pcalg R https://cran.r-project.org/web/packages/pcalg/index.html
2. LV-IDA R https://github.com/dmalinsk/lv-ida
3. CovSel R https://cran.r-project.org/web/packages/CovSel/index.html
4. CovSelHigh R https://cran.r-project.org/web/packages/CovSelHigh/index.html
5. EHS Matlab https://sites.google.com/site/dorisentner/publications/CovariateSelection
6. DAVS R https://github.com/chengdb2016/DAVS
7. sisVIVE R https://cran.r-project.org/web/packages/sisVIVE/index.html
8. modeIV Python https://github.com/jhartford/ModeIV-public
9. IV.tetrad R http://www.homepages.ucl.ac.uk/ ucgtrbd/code/iv_\_discovery
10. AIV.GT R https://github.com/chengdb2016/AIV.GT

A data-driven causal effect method may produce multiple estimated values for a causal effect due to the uncertainty in structure learning. Although these estimated values may contain the unbiased estimation of the causal effect, it is not clear which of them is the unbiased estimation. In this survey, we classify the existing causal effect estimation methods into two categories based on how they address this uncertainty: Bound estimation methods, which produce multiple estimated values of a causal effect, and Unique estimation methods, which produce a unique (single) estimated value of causal effect.

Since there is no known causal graph, data-driven methods need to recover the underlying causal graph from observational data using a causal discovery method, which can be a global structure learning method or a local structure learning method (Aliferis et al., 2010; Glymour et al., 2019). Therefore, we classify these methods as either Global or Local based on the causal discovery method used. Note that the former uses the global search, resulting in higher time complexity, while the latter uses the local search, which has lower time complexity.

Regarding the challenge of latent variables, we classify existing methods into two categories: those that do not consider latent confounders and those that do. For the methods that consider latent confounders (i.e. causal sufficiency does not hold), we further categorise them as either Not between (W,Y)(W,Y), which consider latent confounders that affect either WW or YY but not both (i.e. WYW\rightarrow Y is visible in the corresponding MAG or PAG), or Between (W,Y)(W,Y), which consider latent confounders that affect both WW and YY, such as WUYW\leftarrow U\rightarrow Y in the causal DAG (i.e. WYW\rightarrow Y is invisible in the corresponding MAG or PAG).

In the bottom section of Table 1, we provide information on accessing the software packages for the methods that have been made publicly available online. Furthermore, Fig. 8 summarises the key milestones of data-driven causal effect estimation based on graphical causal modelling and graphical causal modelling theories supporting the algorithms. The theories were developed in the 1990s and early 2000s, while the methods largely emerged in the past ten years. The number of methods remains small relevant to the importance of causal effect estimation. There is a need for more methods to advance this important field.

Refer to caption
Figure 8. The key milestones of the data-driven algorithms/methods reviewed in this survey for estimating the causal effects from observational data and graphical causal modelling theories or criteria supporting the algorithms. The blue line indicates the timeline of the criteria/methods without considering latent confounders. The black line indicates the timeline of the concepts/criteria/methods considering the latent confounders that do not affect both WW and YY. The red line represents the timeline of concepts/methods considering the latent confounders that affect both WW and YY. Names in boldface font indicate theories or concepts and names in normal font indicate methods.

In the following subsections, we will review the details of the data-driven methods following the divisions in Table 1, while using latent variables as the main storyline to structure the review.

4.2. Methods Considering No Latent Variables

Many data-driven methods based on graphical causal models have been developed to estimate causal effects from data without considering latent variables. These methods can be categorised into two types: IDA and its extensions, and data-driven adjustment set discovery methods. In this section, we will review the details of these two types of methods.

4.2.1. IDA and Its Extensions

IDA (Intervention do-calculus when the DAG is Absent) was proposed by Maathuis et al. (Maathuis et al., 2009). IDA assumes a Gaussian (causal) Bayesian network model over the measured variables 𝐗\mathbf{X}. There are two versions of IDA algorithms, the global version (referred to as the basic IDA method) and the local version (referred to as the IDA algorithm). The basic IDA algorithm learns a CPDAG using the PC algorithm (Spirtes et al., 2000), and enumerates all Markov equivalent DAGs encoded in the learned CPDAG. Finally, for each enumerated DAG, IDA estimates the causal effect of WW on YY by adjusting on Pa(W)Pa(W) if there is a causal path between WW and YY since Pa(W)Pa(W) blocks all back-door paths between WW and YY (Pearl, 2009a). IDA returns a multiset of causal effects that correspond to the enumerated DAGs.

The CPDAG could potentially contain an excessive number of DAGs, making the enumeration process time-consuming. Moreover, a number of DAGs encoded in the CPDAG may have the same adjustment set for estimating the causal effect of WW on YY, so it is not necessary to enumerate all possible DAGs for causal effect estimation. For instance, in Fig. 3, DAG1 and DAG2 have the same adjustment set 𝐙={X1}\mathbf{Z}=\{X_{1}\}. To improve the computational efficiency, the IDA algorithm (Maathuis et al., 2009) has used a local criterion for finding a set of all possible adjustment sets locally, i.e. the set of adjacent nodes of WW (referred to as Adj(W)Adj(W)) that contains all possible parents of WW. For each subset of Adj(W)Adj(W), IDA checks the local validity of the orientation around WW, i.e. no new collider around WW. Hence, the IDA can be used for fast estimation of causal effects without enumerating all DAGs. For example, the learned CPDAG in Fig. 3, Adj(W)=X1Adj(W)={X_{1}} and the IDA only need to check two sets \emptyset and {X1}\{X_{1}\} with respect to the locally valid the orientation. Both sets are locally valid and are considered possible adjustment sets. In this example, the possible causal effects can be obtained by adjusting for \emptyset and {X1}\{X_{1}\}, respectively. The time efficiency of IDA relies largely on the PC algorithm (Spirtes et al., 2000) for discovering a CPDAG from observational data, which fundamentally is not scalable to a large number of variables and only when the underlying DAG is very sparse, IDA can deal with high dimensional data.

Using the domain knowledge about some edge directions can substantially diminish the uncertainty in causal effect estimations from observational data. The work in a biological application using IDA (Zhang et al., 2016) has shown that the uncertainty of the bound estimation of the causal effect can be reduced significantly by using known causal edges as constraints for CPDAG learning.

Perković et al.  (Perkovic et al., 2017) proposed a semi-local IDA algorithm to produce a bound estimation of a causal effect when background knowledge (i.e. some directed edge orientation information) is available. The algorithm uses maximally oriented partially DAGs (maximal PDAGs) to represent CPDAGs augmented with some known edges with or without direction information. Semi-local IDA conducts a local search to find all possible parent sets of a treatment variable in a maximal PDAG. In the process, the Meek orientation rules (Meek, 1995) are used to check the validity of the parents of the treatment variable. The check is global, and hence the algorithm is called semi-local. The multiset of causal effects is estimated by adjusting for each possible parent set. The advantage of semi-IDA over maximal PDAGs is the incorporation of more edge information (i.e. orientation information) than IDA over CPDAGs. Because of the extra check on the validity of possible parents using the Meek orientation rules (Meek, 1995), the time complexity of semi-local IDA is slightly higher than IDA.

Fang and He proposed an improved semi-local IDA algorithm, IDA with some Direct causal information (DIDA) (Fang and He, 2020), to improve the performance and time efficiency of semi-local IDA. DIDA also uses a maximal PDAG to find possible parent sets of a treatment variable, but DIDA employs a newly designed criterion to check the validity of parents using the variables adjacent to the treatment. DIDA improves the performance of semi-local IDA and is faster than IDA. Fang and He have also proposed an IDA algorithm using the knowledge of Non-ancestral information, referred to as NIDA (Fang and He, 2020), for causal effect estimation. NIDA has a similar time efficiency as DIDA. Both provide bound estimates of causal effects.

IDA, semi-local IDA, DIDA and NIDA all conduct a local search for finding a possible parent set of the treatment, but we classify them as global algorithms in Table 1. This is because they learn a complete CPDAG from data. These methods do not consider latent confounders.

4.2.2. Data-Driven Adjustment Set Discovery

In practice, there are two main criteria for finding an adjustment set to remove confounding bias in the estimation of the causal effect of WW on YY: the common causes of WW and YY (Glymour et al., 2008), and either all causes of WW or all causes of YY (VanderWeele and Shpitser, 2011). Both criteria need domain knowledge since the causes of WW or YY are not generally discoverable from data. When there are no latent variables, the set of parents of a variable in a learned DAG are their causes, but recovering a DAG is impossible due to the uncertainty in causal structure learning (Spirtes et al., 2000; Maathuis et al., 2009; Spirtes, 2010) as introduced in Section 3.

With the pretreatment assumption, i.e. all measured variables excluding the pair of variables (W,Y)(W,Y) are not descendant nodes of WW or YY, the parents of WW and YY can be learned from data since the PC (parent and children) set contains parents only. Under the pretreatment assumption, De Luna et al. (De Luna et al., 2011) have proposed a set of criteria for identifying adjustment sets, including the parent set of WW, the parent set of YY, the parent set of WW excluding those having no paths linking YY, the parent set of YY excluding those having no paths linking WW except via WW, the common parents of WW and YY, and the union of WW’s parent set and YY’s parent set. Häggström et al.  (Häggström et al., 2015) implemented these criteria in the RR package CovSel by using marginal coordinate hypothesis testing (for continuous covariates and outcome) and kernel smoothing (for continuous/discrete covariates and outcome). However, the implementation is computationally expensive and has a prohibitive time consumption for high-dimensional datasets. Häggström et al. (Häggström, 2018) thus presented an improved implementation, CovSelHigh, which uses the local structure learning algorithm, Max-Min Parents and Children (MMPC) and Max-Min Hill-Climbing (MMHC) (Tsamardinos et al., 2006) and is more scalable to the high-dimensional datasets than the implementations in CovSel.

Another area of research in causal inference focuses on using asymptotic variance to identify valid or optimal adjustment sets for estimating average causal effects with causal linear models. Several recent works have explored this approach, including Rotnitzky and Smucler (Rotnitzky and Smucler, 2020), Witte et al.  (Witte et al., 2020), and Henckel et al.  (Henckel et al., 2022). Henckel et al.  (Henckel et al., 2022) proposed a graphical criterion to compare the asymptotic variances of different valid adjustment sets for causal linear models. This criterion can be used to develop two graphical criteria: a variance-decreasing pruning procedure for any given valid adjustment set, and a graphical characterisation of a valid adjustment set that provides the optimal asymptotic variance among all valid adjustment sets. These results are applicable to DAGs, CPDAGs and maximal PDAGs. Building on this work, Rotnitzky and Smucler (Rotnitzky and Smucler, 2020) discussed several new graphical criteria for nonparametric causal graphical models when treatment effects were estimated using nonparametrically adjusted estimators of the interventional means. They also proposed a new graphical criterion for discovering the optimal adjustment set among the minimal adjustment sets. Witte et al.  (Witte et al., 2020) concluded that the optimal valid adjustment set (O-set) was characterised graphically as the set of parent nodes of outcome YY that yielded the smallest asymptotic variance. These approaches have the potential to improve the accuracy and efficiency of causal effect estimation in various settings.

4.3. Methods Considering Latent Variables Which Are Not Between WW and YY

When there are latent variables in data, an ancestral graph is usually used to represent the causal relationships among measured variables. From the data, only an equivalence class of MAGs, i.e. a PAG, can be recovered. The generalised back-door criterion (or generalised adjustment criterion) can be used in a MAG (PAG) to determine a proper adjustment set when there exists such a proper adjustment set (Maathuis et al., 2015; Perković et al., 2018). In this section, we will review the data-driven methods considering latent variables and finding proper adjustment sets based on the graphical criteria for the causal effect estimation.

4.3.1. LV-IDA

Similar to IDA, LV-IDA (latent variable IDA) by Malinsky and Spirtes (Malinsky and Spirtes, 2017) considers latent variables and is based on the generalised back-door criterion. There are also two versions of LV-IDA algorithms, the global version (referred to as the global LV-IDA algorithm) and the local version (referred to as the LV-IDA algorithm).

The global LV-IDA first recovers a PAG from data with latent variables, using a structure learning algorithm such as FCI (Spirtes et al., 2000), or RFCI (Colombo et al., 2012). Then it enumerates the MAGs encoded in the PAG. In each of the MAGs, LV-IDA searches for a proper adjustment set for each enumerated MAG according to the generalised back-door criterion and provides an estimate of the causal effect using the adjustment set found in the enumerated MAG. In the end, a multiset of causal effects is returned by LV-IDA.

Enumerating all MAGs encoded by a PAG is computationally expensive, and becomes impossible for moderately sized PAGs (the number of nodes is with up to 2020) even using the ZML algorithm (Zhang, 2008b) to reduce the search space. To address the efficiency problem, a local search algorithm has been proposed by using the possiblepossible-DD-SEP(W,Y,𝒢)SEP(W,Y,\mathcal{G}) set (i.e. the possible dd-separated set relative to (W,Y)(W,Y) in 𝒢\mathcal{G} and abbreviated as pds(W,Y,𝒢)pds(W,Y,\mathcal{G})(Spirtes et al., 2000; Colombo et al., 2012; Malinsky and Spirtes, 2017) defined below.

Definition 0 (pds(W,Y,𝒢)pds(W,Y,\mathcal{G})).

A node XX is in the set pds(W,Y,𝒢)pds(W,Y,\mathcal{G}) if and only if there is a path between XX and WW in 𝒢\mathcal{G} such that for every sub-path Xi,Xk,Xj\left\langle X_{i},X_{k},X_{j}\right\rangle on this path either XkX_{k} is a collider, or Xi,Xk,Xj\left\langle X_{i},X_{k},X_{j}\right\rangle is a triangle in 𝒢\mathcal{G} (i.e. each pair of variables in Xi,Xk,Xj\left\langle X_{i},X_{k},X_{j}\right\rangle are adjacent).

It is worth mentioning that pds(W,Y,𝒢)pds(W,Y,\mathcal{G}) reduces the size of the original PAG (i.e. a sub-graph of the learned PAG) since only using this set would be sufficient to check the conditions of generalised back-door criterion w.r.t. the pair of variables (W,Y)(W,Y).

Instead of enumerating all the MAGs from the complete PAG learned from data, the LV-IDA enumerates the MAGs represented by a sub-graph of the complete PAG constructed based on the set pds(W,Y,𝒢)pds(W,Y,\mathcal{G}). For each of the MAGs enumerated from the subgraph, the generalised back-door criterion is used to determine a proper adjustment set, and similar to LV-IDA, the final output of the local version of LV-IDA is a multiset of estimated causal effects.

LV-IDA has demonstrated its ability to handle high-dimensional data with thousands of variables when the underlying causal MAG is sparse. The performance of both versions LV-IDA relies on the accuracy of the learned PAGs and the locations of latent variables. Learning an accurate PAG from data with latent confounders is still a challenging problem in causal discovery (Zhang, 2008b; Spirtes, 2010; Glymour et al., 2019; Cheng et al., 2020; Vowels et al., 2022).

4.3.2. Causal Effect Estimation Based on SAT Solvers (CE-SAT)

Hyttinen et al.  (Hyttinen et al., 2015) developed a data-driven method, Causal effect Estimation based on SATisfiability (SAT) solver (CE-SAT), for causal effect estimation from data with latent variables. CE-SAT requires learning a PAG from data. Instead of enumerating all MAGs encoded in the learned PAG, CE-SAT works on subsets of a Markov equivalence class (i.e. a sub-equivalence class) with different sets of constraint solvers (Hyttinen et al., 2014). Some MAGs in the equivalent class satisfy the same set of mm-separation/mm-connection constraints111Note that in the original paper of CE-SAT (Hyttinen et al., 2015), the terms dd-separation and dd-connection were used. However, we think it is more appropriate to use the terms mm-separation and mm-connection here since CE-SAT requires a PAG as input, rather than a CPDAG. which guarantee the satisfaction of the same set of do-calculus rules (Pearl, 2009a) (a rule set that is a more general criterion of causal effect identifiability than the (generalised) back-door criterion.). When a set of MAGs satisfies the same set of dodo-calculus rules, they have the same do-free expression (Pearl, 2009a) for obtaining the causal effect and hence the same estimated causal effect from a dataset based on this set of MAGs. CE-SAT translates mm-separation constraints into a logical representation, and uses an SAT solver to find a sub-equivalence class satisfying the same do-calculus rules for preventing repeatedly estimating the same causal effect based on the sub-equivalence class. CE-SAT avoids enumerating all MAGs in the PAG. CE-SAT also returns a multiset of causal effects from a dataset but the size of the multiset can be much smaller than that returned by LV-IDA. The process of finding sub-equivalence classes satisfying the same do-calculus rules does not scale well with the number of variables, and it has been shown that CE-SAT only works with datasets of a dozen of variables.

4.3.3. Efficient Bound Estimation with Local Search

Cheng et al. (Cheng et al., 2020) proposed a Data drIven Causal Effect estimation algorithm (DICE) to achieve efficient causal effect estimation from data with latent variables. Based on the generalised back-door criterion, the authors have proved that there exists at least an adjustment set in the local causal structures around WW and YY (i.e. the adjacent nodes of WW and YY in the PAG) under the assumption of pretreatment variables and the assumption of no latent confounders between WW and YY. With this conclusion, there is no need to learn the complete PAG but to find the adjacent nodes of WW and YY locally, which significantly reduces the time complexity. DICE employs a local causal structure learning method, such as PC-Select (Kalisch et al., 2012), to recover the local structures around WW and YY, and then enumerates all subsets of the adjacent variables of WW and YY that are the possible adjustment sets. DICE obtains the set of possible causal effects by adjusting for the possible adjustment sets. DICE is a bound estimator and is faster than LV-IDA and CE-SAT for bound estimation w.r.t. the causal effect of WW on YY due to its localised search approach.

4.3.4. Causal Effect Estimation by the Generalised Adjustment Criterion (GAC)

Perković et al. (Perković et al., 2018) developed a data-driven method, denoted as GAC, using the generalised adjustment criterion to identify an adjustment set from a given PAG.

Definition 0 (Generalised Adjustment Criterion (Perković et al., 2018)).

In a given MAG or PAG 𝒢\mathcal{G}, the set 𝐙\mathbf{Z} satisfies the generalised adjustment criterion w.r.t. (W,Y)(W,Y) if (i). 𝒢\mathcal{G} is adjustment amenable w.r.t. the pair of variables (W,Y)(W,Y), and (ii). 𝐙Forb(W,Y,𝒢)=\mathbf{Z}\cap Forb(W,Y,\mathcal{G})=\emptyset, and (iii). all definite status non-causal paths between WW and YY are blocked (i.e. mm-separated) by 𝐙\mathbf{Z} in 𝒢\mathcal{G}.

In the above definition, Forb(W,Y,𝒢)Forb(W,Y,\mathcal{G}) is the set of all nodes lying on all causal paths between WW and YY. The first condition, i.e. 𝒢\mathcal{G} is adjustment amenable means that the causal effect of WW on YY is identifiable, and the second condition denotes that 𝐙\mathbf{Z} does not contain a node lying on any causal path between WW and YY, i.e. Z𝐙\forall Z\in\mathbf{Z}, ZForb(W,Y,𝒢)Z\notin Forb(W,Y,\mathcal{G}). The third condition is used to read off a set 𝐙\mathbf{Z} from the given MAG or PAG 𝒢\mathcal{G} that blocks all definite status non-causal paths between WW and YY.

GAC needs to learn a PAG from data using a structure learning algorithm, such as FCI (Spirtes et al., 2000) or RFCI (Colombo et al., 2012). From the input PAG, GAC enumerates mm-separating sets of the pair (W,Y)(W,Y) from each MAG encoded in the PAG. Then based on the mm-separating sets, an adjustment set is identified using the generalised adjustment criterion. The performance of the data-driven GAC algorithm relies on the correctness of the learned PAG. GAC is similar to LV-IDA in time efficiency since both require a PAG learned from data.

To improve the efficiency of finding an adjustment set, a fast version of the GAC uses the set of all possible ancestors of WW and YY, excluding the forbidden set as a valid adjustment set that can be obtained in linear time. However, the set of the possible ancestors of WW and YY may be large and hence the quality of the estimated causal effect is low. Note that the fast version of the GAC is faster than LV-IDA.

4.3.5. The EHS Algorithm

Entner, Hoyer, and Spirtes (Entner et al., 2013) proposed to use two inference rules to solve the problem of inferring whether a treatment variable WW has a causal effect on the outcome YY of interest; and if it has, determining an adjustment set from data by conducting statistical tests. The two inference rules support data-driven covariate selection for estimating the causal effect of WW on YY directly from data with latent variables. We call the two inference rules EHS conditions and the method EHS algorithm, named after the authors’ names. EHS is different from the previous methods since it does not need a complete or local causal structure, but finds an adjustment set directly from data by conducting a set of conditional independence tests. The causal effect is unique if an adjustment set can be found from the data.

EHS conditions need the pretreatment variable assumption. EHS algorithm searches for a variable SS and a set 𝐙\mathbf{Z} simultaneously such that (1) SY𝐙S\not\!\perp\!\!\!\perp Y\mid\mathbf{Z} and SY𝐙{W}S\perp\!\!\!\perp Y\mid\mathbf{Z}\cup\{W\} or (2) SW𝐙S\not\!\perp\!\!\!\perp W\mid\mathbf{Z} and SY𝐙S\perp\!\!\!\perp Y\mid\mathbf{Z}. If (1) is satisfied, 𝐙\mathbf{Z} is an adjustment set and is used for causal effect estimation; if (2) is satisfied, the causal effect of WW on YY is zero; and if neither is satisfied, we do not know whether the causal effect can be estimated or not from the dataset. The EHS conditions are sound and complete for querying the causal effect between WW and YY. However, the EHS algorithm (Entner et al., 2013) is extremely inefficient since it globally searches for both SS and 𝐙\mathbf{Z}, and the time complexity is 𝐎(2|𝐗|)\mathbf{O}(2^{|\mathbf{X}|}), where 𝐗\mathbf{X} includes all measured variables in the system except WW and YY. Hence, the EHS algorithm only works for a small number of variables.

4.3.6. GAC Guided Conditional Independent Tests for Causal Effect Estimation

The reliability of the GAG algorithm depends on the correctness of the PAG learned from data, but the PAG learned may be imprecise because of the high complexity of graph discovery and the uncertainty of edge orientation. On the other hand, EHS makes use of statistical tests that can obtain a more reliable adjustment set than reading an adjustment set from a PAG learned from data. However, EHS requires the pretreatment variable assumption such that it will fail to determine a valid adjustment set when there is a mediation variable between WW and YY. For example, in a MAG, there is a causal path WXYW\rightarrow X\rightarrow Y. In this case, XX is a child node of WW and EHS will erroneously include XX in the adjustment set. The estimated causal effect by EHS will be biased due to that XX is a mediator.

Cheng et al.  (Cheng et al., 2022c) developed the Data-driven Adjustment Search (DAVS) algorithm for unique causal effect estimation by taking the advantages of GAC and EHS. The authors developed a criterion which uses a conditional independence test to discover an adjustment set after finding candidate adjustment sets based on the generalised adjustment criterion. DAVS does not require the pretreatment variable assumption. The proposed criterion uses a COSO (Cause Or a Spouse of the treatment Only) variable which is not adjacent to YY. A COSO variable is often known by domain experts since it is common for domain experts to know a direct cause of the treatment variable and that the direct cause is not a direct cause of YY. Even with no such prior knowledge, a COSO variable can be identified from data by conducting statistic tests for adjacencies. A COSO variable is not an IV (Baiocchi et al., 2014; Martens et al., 2006; Hernán and Robins, 2006) introduced in Definition 9. A COSO variable is proposed to enable a data-driven confounding adjustment method for causal effect estimation, while an IV is employed to address the latent confounders between WW and YY (Introduced in Section 4.4). Moreover, a COSO variable can be found via statistical tests in data whereas an IV should be nominated using domain knowledge.

Following the GAC, an adjustment set is in a possible mm-separating set (which contains all variables possibly mm-separating all the paths between WW and YY), but not in the forbidden set (which contains all possible descendant variables of WW lying on the possible direct paths between WW and YY). Note that “possible” here is caused by the uncertainty of the edge orientation in a learned PAG as introduced in Section 3.2. Both possible mm-separating set and forbidden set can be read from the learned PAG. To determine a unique causal effect of WW on YY, DAVS determines if a set 𝐙\mathbf{Z} is an adjustment set 𝐙\mathbf{Z} by testing if QY|𝐙{W}Q\perp\!\!\!\perp Y|\mathbf{Z}\cup\{W\} where QQ is a COSO variable. DAVS adopts the Apriori pruning strategy (Agrawal and Srikant, 1994) for searching for the minimal adjustment sets satisfying the test, and then estimates causal effects using the identified minimal adjustment sets. Note that, the criterion using a COSO variable finds an adjustment set when there is a causal relationship between WW and YY, but it cannot determine whether there is a causal relationship between WW and YY (i.e. when an adjustment set is not found, it is unsure that there is no causal relationship between WW and YY or the relationship is unidentifiable using data.). DAVS is similar to LV-IDA in time efficiency but much faster than EHS.

4.3.7. Local search for efficient discovery of Adjustment Sets

To improve the efficiency of EHS and DAVS, Cheng et al.  (Cheng et al., 2023b) proposed a fully local search algorithm, Causal Effect Estimation by Local Search (CEELS). CEELS makes use of the result from DICE (see Section 4.3.3) that at least one adjustment set exists in the adjacent set of (W,Y)(W,Y) and this ensures that the search for an adjustment set can be done locally. As DAVS, CEELS also makes use of a COSO variable given by domain experts or learned from data.

Compared to EHS, CEELS searches for a proper adjustment set by using EHS conditions with the following improvements: a COSO variable is used, and the search for an adjustment set 𝐙\mathbf{Z} is restricted to the adjacent variables of WW and YY. The search space of CEELS is within the local structure around WW and YY, and it is significantly smaller than the search space of EHS. CEELS improves DICE by finding a proper adjustment set for obtaining a unique (instead of a bound) estimation of the causal effect of WW on YY. In addition to the local search, as with DAVS, CEELS employs a frequent pattern mining approach (i.e. the Apriori pruning) (Agrawal and Srikant, 1994) to reduce the search space to find the minimal adjustment set. Different from DAVS, CEELS is a pure local search for unique causal effect estimation and has a higher efficiency than DAVS. CEELS is significantly faster than EHS and DAVS, and scales well with high-dimensional datasets. CEELS is also faster than LV-IDA. However, CEELS may miss an adjustment set in local search while EHS can find it using the global search, and the work in (Cheng et al., 2023b) shows that the missing rate is small in simulation studies.

4.4. Methods Considering Latent Variables Affecting Both WW and YY

When there are latent confounders, i.e. latent variables affecting both WW and YY, the causal effect of WW on YY is not identifiable via confounding adjustment. Instrumental variable (IV) approach is a powerful method for inferring the causal effect of WW on YY from data with latent confounders even when there exist latent confounders affecting both WW and YY. If there exists a valid IV, the causal effect of WW on YY can be recovered from data with latent confounders. It is challenging to determine an IV directly from data (Pearl, 1995b; Hernán and Robins, 2006). Traditionally, a valid IV is nominated based on domain knowledge. However, recently, research has emerged to work towards data-driven methods related to the IV approach. In this section, we review the emerging research on the IV-based approach.

4.4.1. sisVIVE and Its Extension

The three conditions of a standard IV introduced in Section 2.4 are untestable from data, therefore it is very difficult to discover a valid IV from data with latent variables. Instead of discovering a valid IV from data, Kang et al.  (Kang et al., 2016) proposed a practical algorithm, called some invalid some Valid IV Estimator (sisVIVE), to estimate the causal effects of WW on YY from data with latent variables as long as more than 50% of candidate instruments are valid. The assumption is referred to as majority validation of IVs, i.e. at least half of the covariates are valid IVs. sisVIVE is able to obtain an unbiased estimation of causal effect without knowing which covariate is a valid IV, and can be applied in Mendelian randomisation (Kang et al., 2016; Hartford et al., 2021). However, the estimation may be biased when the assumption of majority validation of IVs is violated. Hence, it would be better to check the majority validation assumption before using sisVIVE.

An extended version of the sisVIVE algorithm, modeIV, has been developed by Hartford et al.  (Hartford et al., 2021). ModeIV is a robust IV technique when the model relationship between an IV 𝐒\mathbf{S} and YY, i.e. the conditional expectation 𝐄(YW,S)\mathbf{E}(Y\mid W,S) is valid and at least half of IVs are valid (i.e. model validity holds). The modeIV allows the estimation of non-linear causal effects from data by employing the recently developed deep learning IV estimator in (Hartford et al., 2017). The advantage of modeIV is its ability in removing most of the bias introduced by invalid IVs. When the assumption of model validity is violated, modeIV will not work.

4.4.2. IV.tetrad

A CIV has more relaxed conditions than a standard IV (Brito and Pearl, 2002) and allows a variable in the covariate set to be a valid IV conditioning on a set of measured variables. IV.tetrad (Silva and Shimizu, 2017) is the first data-driven CIV-based method for discovering a pair of valid CIVs from data directly. IV.tetrad requires that there exist two valid CIVs, denoted as SiS_{i} and SjS_{j} in the covariate set and the conditioning sets of SiS_{i} and SjS_{j} equal to the set of remaining covariates 𝐙=𝐗{Si,Sj}\mathbf{Z}=\mathbf{X}\setminus\{S_{i},S_{j}\} (i.e. the original covariate set 𝐗\mathbf{X} excluding the pair of CIVs {Si,Sj}\{S_{i},S_{j}\}). Moreover, the assumption of linear non-Gaussian causal model is needed for the validity of IV.tetrad. For the two valid CIVs, we have βwy=σsiy𝐳/σsiw𝐳=σsjy𝐳/σsjw𝐳\beta_{wy}=\sigma_{s_{i}*y*\mathbf{z}}/\sigma_{s_{i}*w*\mathbf{z}}=\sigma_{s_{j}*y*\mathbf{z}}/\sigma_{s_{j}*w*\mathbf{z}}, where σsiy𝐳\sigma_{s_{i}*y*\mathbf{z}} is the causal effect of SiS_{i} on YY conditioning on 𝐙\mathbf{Z} and σsiw𝐳\sigma_{s_{i}*w*\mathbf{z}} is the causal effect of SiS_{i} on WW conditioning on 𝐙\mathbf{Z} (σsjy𝐳\sigma_{s_{j}*y*\mathbf{z}} and σsjw𝐳\sigma_{s_{j}*w*\mathbf{z}} are the same.). Thus, the tetrad constraint, i.e. σsiy𝐳σsjw𝐳σsiw𝐳σsjy𝐳=0\sigma_{s_{i}*y*\mathbf{z}}\sigma_{s_{j}*w*\mathbf{z}}-\sigma_{s_{i}*w*\mathbf{z}}\sigma_{s_{j}*y*\mathbf{z}}=0 can be obtained. IV.tetrad uses the tetrad constraint to check the validity of a pair of covariates {Si,Sj}\{S_{i},S_{j}\} in 𝐗\mathbf{X} conditioning on 𝐙=𝐗{Si,Sj}\mathbf{Z}=\mathbf{X}\setminus\{S_{i},S_{j}\}. Note that the tetrad constraint is a necessary condition for discovering a pair of valid CIVs, because a pair of non-valid CIVs can pass the tetrad constraint. It means that the tetrad constraint cannot exactly identify a pair of valid CIVs from data, i.e. IV.tetrad provides a bound estimation of causal effects.

IV.tetrad aims to discover valid CIVs from data directly, and it is the first data-driven algorithm to discover CIVs without domain knowledge. However, IV.tetrad may fail in obtaining an unbiased estimation when the set of remaining covariates 𝐗{Si,Sj}\mathbf{X}\setminus\{S_{i},S_{j}\} contains a collider between SiS_{i} (or SjS_{j}) and YY, i.e. there exists an MM-bias (Greenland, 2003; Pearl, 2009b).

4.4.3. Data-Driven AIV Estimator

As discussed in Section 2.4, a CIV may produce a misleading conclusion (Van der Zander et al., 2015) and an AIV remedies this limitation. The graphical criterion given in Definition 10 for finding a conditioning set relative to a given AIV needs a complete causal DAG representing the causal relationships of both measured and unmeasured variables (Brito and Pearl, 2002; Van der Zander et al., 2015). The requirement of a complete causal DAG makes it impossible to discover a conditioning set directly from data w.r.t. a given AIV. Cheng et al.  (Cheng et al., 2022a) studied a novel graphical property of an AIV in a MAG to discover from data a proper conditioning set that instrumentalises a given AIV. Using the graphical property, they proposed a data-driven method, Ancestral IV estimator in PAG (AIViP for short) to discover a conditioning set 𝐙\mathbf{Z} in a PAG that instrumentalises a given AIV SS in the underlying causal DAG. AIViP employs a causal structure learning method, i.e. RFCI (Colombo et al., 2012) to discover a PAG from data with latent variables, and then constructs a manipulated PAG to discover the conditioning set 𝐙\mathbf{Z}. Finally, AIViP uses a TSLS method to calculate the causal effect of WW on YY by using the given AIV SS and the conditioning set 𝐙\mathbf{Z}. The quality of AIViP for finding the correct conditioning sets relies on the quality of the PAG discovered from data with latent confounders.

Table 2. A summary of the properties and characteristics of the data-driven causal effect estimation methods. ‘Causal sufficiency’ refers to Definition 3, ‘Amenability’ refers to Definition 2, ‘Pretreatment’ denotes the pretreatment variable assumption, ‘Bound’ means that the estimator produces multiset estimated causal effects (‘×\times’ indicates a unique estimate), and ‘Time-complexity’ is based on the time complexity of the causal structure learning part of a method (‘+’ denotes global causal structure learning (exponential time complexity with the number of variables) and ‘-’ indicates local structure learning (polynomial time complexity with the number of variables)).
Methods Causal sufficiency Amenability Linearity Pretreatment Bound Time-complexity
IDA (Maathuis et al., 2009) \checkmark \checkmark \checkmark ×\times \checkmark ++
Semi-local IDA (Perkovic et al., 2017) \checkmark \checkmark \checkmark ×\times \checkmark ++
DIDA (Fang and He, 2020) \checkmark \checkmark \checkmark ×\times \checkmark ++
DIDN  (Fang and He, 2020) \checkmark \checkmark \checkmark ×\times \checkmark ++
CovSel (Häggström et al., 2015) \checkmark \checkmark \checkmark \checkmark ×\times -
CovSelHigh (Häggström, 2018) \checkmark \checkmark \checkmark \checkmark ×\times -
LV-IDA (Malinsky and Spirtes, 2017) ×\times \checkmark \checkmark ×\times \checkmark ++
CE-SAT (Hyttinen et al., 2015) ×\times \checkmark ×\times ×\times \checkmark ++
DICE (Cheng et al., 2020) ×\times \checkmark ×\times \checkmark \checkmark -
EHS (Entner et al., 2013) ×\times \checkmark ×\times \checkmark \checkmark ++
GAC (Perković et al., 2018) ×\times \checkmark ×\times ×\times ×\times ++
DAVS (Cheng et al., 2022c) ×\times \checkmark ×\times ×\times ×\times ++
CEELS (Cheng et al., 2023b) ×\times \checkmark ×\times \checkmark ×\times -
sisVIVE (Kang et al., 2016) ×\times ×\times \checkmark \checkmark \checkmark -
modeIV (Hartford et al., 2021) ×\times ×\times ×\times \checkmark \checkmark -
AIViPAIViP (Cheng et al., 2022a) ×\times ×\times \checkmark \checkmark ×\times ++
IV.tetrad (Silva and Shimizu, 2017) ×\times ×\times \checkmark \checkmark \checkmark -
AIV.GT (Cheng et al., 2023a) ×\times ×\times \checkmark \checkmark \checkmark -

4.4.4. AIV Generalised Tetrad

Cheng et al.  (Cheng et al., 2023a) extended the tetrad constraint (Silva and Shimizu, 2017) to a generalised tetrad condition: σsiy𝐳iσsjw𝐳jσsiw𝐳iσsjy𝐳j=0\sigma_{s_{i}*y*\mathbf{z}_{i}}\sigma_{s_{j}*w*\mathbf{z}_{j}}-\sigma_{s_{i}*w*\mathbf{z}_{i}}\sigma_{s_{j}*y*\mathbf{z}_{j}}=0 where 𝐙i\mathbf{Z}_{i} and 𝐙j\mathbf{Z}_{j}, the conditioning sets of the pair of AIVs SiS_{i} and SjS_{j}, 𝐙i𝐗{Si}\mathbf{Z}_{i}\subseteq\mathbf{X}\setminus\{S_{i}\} and 𝐙j𝐗{Sj}\mathbf{Z}_{j}\subseteq\mathbf{X}\setminus\{S_{j}\}) do not need to be equal. Under the Oracle test, 𝐙i\mathbf{Z}_{i} and 𝐙j\mathbf{Z}_{j} found in a learned PAG do not contain a collider. Hence, the generalised tetrad condition avoids the MM-bias (Greenland, 2003; Pearl, 2009b) in causal effect estimation using the condition. Furthermore, the authors have proved that the generalised tetrad condition can be used to discover a pair of valid AIVs if there is a pair of AIVs in the data. Based on the graphical property of an AIV in a MAG, Cheng et al. (Cheng et al., 2022a) found that each directed AIV is in the set Adj(Y){W}Adj(Y)\setminus\{W\} (the set of adjacent nodes of YY excluding YY) in a MAG. Using the finding and the generalised tetrad condition, a data-driven algorithm, Ancestral IV based on the Generalised Tetrad condition (AIV.GT) (Cheng et al., 2023a), is developed for estimating the causal effects from data with latent variables. AIV.GT finds a pair of direct AIVs and their corresponding conditioning sets using the generalised tetrad condition and the search space is confined in the set Adj(Y){W}Adj(Y)\setminus\{W\}. Same to IV.tetrad, AIV.GT also provides a bound estimation of causal effects, but AIV.GT is faster since it uses a local search.

4.5. Summary

Table 2 provides a summary of the properties and characteristics of the data-driven causal effect estimation methods listed in Table 1. The methods listed in Table 2 are used to estimate ACEs from observational data, and therefore, they can be widely applied in fields such as economics (Imbens and Rubin, 2015; Heckman, 2008), epidemiology (Sauer et al., 2013; Textor et al., 2016), healthcare (Robins, 1986), and bioinformatics (Maathuis et al., 2010; Zisoulis et al., 2012). Note that some of these methods are parametric and some are nonparametric (Pearl et al., 2009; Correa and Bareinboim, 2017; Jaber et al., 2019b, a). A parametric method assumes a functional form of the underlying data generation process, such as linearity, whereas a nonparametric method does not make such an assumption. As shown in Table 2, most methods are parametric, and only a few methods are nonparametric, including CET-SAT, DICE, EHS, GAC, DAVS, CEELS and modelIV. It is worth mentioning that in some cases, the causal effect may not be nonparametrically identifiable, and the best option is to derive a bound estimation by conducting sensitive analysis (Robins et al., 2000b; Scharfstein et al., 2021; Christopher Frey and Patil, 2002; Tortorelli and Michaleris, 1994). Researchers need to carefully consider these properties/characteristics in light of their specific research question, data, and resources, to select an appropriate method for their application.

5. Performance and Complexity Evaluation

In this section, we design simulation experiments to illustrate the performance of the estimators we have reviewed for estimating ACE(W,Y)ACE(W,Y) in terms of the two challenges posed by latent confounders. The primary motivation behind this simulation experimental design is that in real-world applications, latent confounders are widespread within the data. Hence, we consider two different scenarios: data with latent confounders which are not between (W,Y)(W,Y), and data with latent confounders, one of which is between (W,Y)(W,Y). Furthermore, we conduct simulation studies to evaluate the time complexity of these estimators by varying the numbers of samples and variables.

5.1. Performance Evaluation on Synthetic Datasets

Data Generation

We generate two groups of synthetic datasets to assess these data-driven causal effect estimators reviewed in this survey. Following the data generation procedure described in (Witte and Didelez, 2019), we first utilise the causal DAG shown in Fig. 9 (a) to generate synthetic datasets without latent confounders. In this causal DAG, there are four back-door paths between the treatment WW and outcome YY, indicating sparse structures around WW and YY. To obtain synthetic datasets with latent confounders that are not between (W,Y)(W,Y), following the corresponding MAG in Fig. 9 (b), we remove the variables {X2,X6,X8}\{X_{2},X_{6},X_{8}\} from datasets without latent confounders and treat them as latent confounders. These datasets are named as Group 1 datasets. Additionally, we remove the variable {X3}\{X_{3}\} from Group 1 datasets to obtain the synthetic datasets with latent confounders, i.e. the latent confounder X3X_{3} affects both WW and YY. These synthetic datasets are called Group 2 datasets. Both groups of synthetic datasets are used to evaluate the performance of estimators reviewed in this survey.

Refer to caption
Figure 9. (a) A causal DAG for synthetic datasets generation, and (b) its corresponding MAG, where {X2,X6,X8}\{X_{2},X_{6},X_{8}\} are removed.

To approximate real-world scenarios, we generate 90 additional random variables as noise variables in the synthetic datasets. These variables are correlated with each other but not with the variables in the causal DAG shown in Fig. 9 (a). In our simulation studies, we vary the sample sizes, generating datasets with 2k, 4k, 6k, 8k, 10k, and 100k samples. The ground-truth ACE(W,Y)ACE(W,Y) for all synthetic datasets is 0.5. To mitigate potential biases resulting from the data generation procedure, we repeat the process 20 times, generating 20 datasets for each sample size.

Parameters Setting & methods exclusion

In our simulation studies, the significance level α\alpha is set to 0.05 for all estimators involved. For conditional independence tests, we use gaussCItest in the R package pcalg (Kalisch et al., 2012) to implement the estimator involved. CovSelHigh (Häggström, 2018) has the implementations of five criteria: all causes of WW, all causes of YY, all causes of WW excluding causes of YY, all causes of YY excluding causes of WW, and the union of all causes of WW and all causes of YY, as the adjustment set respectively. Following the notation in (De Luna et al., 2011), the identified adjustment sets are denoted as 𝐗^W\hat{\mathbf{X}}_{\rightarrow W}, 𝐗^Y\hat{\mathbf{X}}_{\rightarrow Y}, 𝐐^W\hat{\mathbf{Q}}_{\rightarrow W}, 𝐙^Y\hat{\mathbf{Z}}_{\rightarrow Y} and 𝐗^W,Y=(𝐗^W𝐗^Y)\hat{\mathbf{X}}_{\rightarrow W,Y}=(\hat{\mathbf{X}}_{\rightarrow W}\cup\hat{\mathbf{X}}_{\rightarrow Y}) respectively. DIDA and DIDN are not included since both estimators require directed causal information (Fang and He, 2020). EHS (Entner et al., 2013), CE-SAT (Hyttinen et al., 2015) and CovlSet (Häggström et al., 2015) are not evaluated since they have a high time-complexity and do not work in the synthetic datasets. For bound estimators, IDA, semi-local IDA, LV-IDA, DICE, IV.tetrad and AIV.GT, we report the mean of the bound estimations as their final results.

Methods classification

Following the classification of the methods in Section 4.1 regarding the challenges of latent confounders, the methods are divided into three types: Type I, requiring causal sufficiency: IDA, semi-local IDA, 𝐗^W\hat{\mathbf{X}}_{\rightarrow W}, 𝐗^Y\hat{\mathbf{X}}_{\rightarrow Y}, 𝐐^W\hat{\mathbf{Q}}_{\rightarrow W}, 𝐙^Y\hat{\mathbf{Z}}_{\rightarrow Y} and 𝐗^W,Y\hat{\mathbf{X}}_{\rightarrow W,Y}; Type II handling data with latent confounders which are not between (W,Y)(W,Y): LV-IDA, DICE, DAVS and CEELS; and Type III, IV-based estimators: sisVIVE, IV.tetrad, AIViPAIViP and AIV.GT.

Evaluation metrics

We use the estimation error ϵACE=ACE^ACE\epsilon_{ACE}=\hat{ACE}-ACE to evaluate the performance of these estimators, where ACE^\hat{ACE} is the estimated causal effect of WW on YY. For the three groups of synthetic datasets, the ground-truth ACE(W,Y)ACE(W,Y) is 0.5. For each sample size, we report the Mean and standard deviation (STD) of ϵACE\epsilon_{ACE} as the final result over 20 synthetic datasets.

Table 3. Estimation errors ϵACE\epsilon_{ACE} (Mean±\pmSTD) over 20 Group 1 synthetic datasets (with latent confounders which are not between WW and YY). The first set of estimators (Type I) requires causal sufficiency assumption, the second set of estimators (Type II) are able to handle data with latent confounders which are not between WW and YY, and the third set of estimators (Type III) are IV-based estimators. Note that semiIDA is short for Semi-local IDA. Methods marked with ‘*’ are bound estimators.
Methods Number of Samples
2k 4k 6k 8k 10k 100k
Type I IDA* 0.273±\pm0.078 0.303±\pm0.077 0.285±\pm0.073 0.290±\pm0.064 0.284±\pm0.058 0.274±\pm0.015
semiIDA* 0.272±\pm0.111 0.252±\pm0.082 0.279±\pm0.039 0.288±\pm0.043 0.292±\pm0.038 0.215±\pm0.887
𝐗^W\hat{\mathbf{X}}_{\rightarrow W} 0.329±\pm0.019 0.242±\pm0.024 0.291±\pm0.027 0.279±\pm0.021 0.287±\pm0.016 0.282±\pm0.013
𝐐^W\hat{\mathbf{Q}}_{\rightarrow W} 0.286±\pm0.024 0.203±\pm0.018 0.258±\pm0.015 0.241±\pm0.019 0.241±\pm0.014 0.247±\pm0.016
𝐗^Y\hat{\mathbf{X}}_{\rightarrow Y} 0.265±\pm0.018 0.183±\pm0.015 0.261±\pm0.018 0.224±\pm0.013 0.226±\pm0.017 0.224±\pm0.012
𝐙^Y\hat{\mathbf{Z}}_{\rightarrow Y} 0.279±\pm0.018 0.187±\pm0.016 0.253±\pm0.019 0.227±\pm0.012 0.224±\pm0.013 0.225±\pm0.011
𝐗^W,Y\hat{\mathbf{X}}_{\rightarrow W,Y} 0.316±\pm0.018 0.241±\pm0.015 0.310±\pm0.021 0.278±\pm0.023 0.277±\pm0.019 0.277±\pm0.013
Type II GAC 0.847±\pm0.550 1.235±\pm0.384 1.250±\pm0.225 1.285±\pm0.272 1.174±\pm0.403 1.018±\pm0.491
LV-IDA* 0.539±\pm0.063 0.442±\pm0.065 0.487±\pm0.119 0.492±\pm0.023 0.437±\pm0.132 0.133±\pm0.101
DICE* 0.481±\pm0.096 0.569±\pm0.059 0.477±\pm0.044 0.511±\pm0.053 0.505±\pm0.045 0.553±\pm0.034
DAVS 0.634±\pm0.158 0.528±\pm0.181 0.236±\pm0.245 0.240±\pm0.209 0.235±\pm0.355 0.189±\pm0.034
CEELS 0.621±\pm0.267 0.615±\pm0.269 0.253±\pm0.204 0.215±\pm0.121 0.236±\pm0.222 0.108±\pm0.024
Type III sisVIVE 0.488±\pm0.174 0.778±\pm0.364 0.619±\pm0.278 0.686±\pm0.553 0.767±\pm0.342 0.642±\pm0.368
IV.tetrad* 0.299±\pm0.197 0.243±\pm0.183 0.324±\pm0.265 0.372±\pm0.244 0.298±\pm0.272 0.311±\pm0.214
AIViPAIViP 0.228±\pm0.222 0.242±\pm0.164 0.135±\pm0.124 0.162±\pm0.099 0.153±\pm0.125 0.109±\pm0.053
AIV.GT* 0.215±\pm0.270 0.342±\pm0.276 0.308±\pm0.250 0.268±\pm0.223 0.240±\pm0.247 0.355±\pm0.234
Results

We report all results on both groups of datasets in Tables 3 and 4, and discuss the performance of the different types of methods in detail below.

With Group 1 datasets (with latent confounders but not between WW and YY), methods in Type II and III are expected to perform well since data sets satisfy their assumptions. However, from Table 3 methods in Type I and III perform well overall, and the performance of methods in Type II varies. For methods in Type I, the data sets do not satisfy their causal sufficiency requirement (no latent variables). However, their performance is relatively well since the latent variables not between (W,Y)(W,Y) may not bias the causal effects very much and their results can be still useful. For methods in Type II, the data sets satisfy their requirement (latent variables not between (W,Y)(W,Y)). However, they perform well only when the data set is large. The lowest biased estimates are achieved by GAC and CEELS in data sets of 100k. There is a clear trend that bias is reducing with the increase in data set sizes. This is because PAGs are difficult to learn correctly from data due and learning correct PAG needs big data sets. In contrast, learning CPDAGs is much easier and needs smaller data sets. For methods in Type III, the data sets satisfy their requirement (except sisVIVE) since they are supposed to handle all data types as long as IVs (AIVs) are identified. sisVIVE requires more than half of the variables to be valid IVs, while there are only two valid IVs, i.e. X1,X4{X_{1},X_{4}} in the data. Both IV.tetrad and AIC.GT produces bound estimations and hence their mean biases are moderate.

Note that identifying an IV is quite challenging in many applications. In this simulation, we aim at many applications where IVs are few. However, when genetic variants are considered as IVs to estimate the causal effect of a modifiable factor on a disease, i.e. Mendelian randomisation (MR) analysis (Kang et al., 2016; Brumpton et al., 2020), many genetic variants can be used as IVs and hence sisVIVE has many applications. See Section 6.4 for details.

With Group 2 datasets (with a latent confounder between WW and YY), methods in Type III are supposed to work well and the requirements of other methods are not satisfied. From Table 4, all methods in Type I have large biases because of the latent confounder between WW and YY. In comparing results in Table 3, a latent confounder between WW and YY biases methods of Type I more. Therefore, we can see that the unconfoundedness (no latent variable between WW and YY) is the essential assumption for most methods in causal effect estimation. All methods in Type II do not work on the data since the edge WYW\to Y is invisible and they terminate themselves. For methods in Type III, AIViPAIViP and AIV.GT performed well as expected. AIV.GT is a bound estimator and hence its biases are moderate. sisVIVE does not perform well because its requirement of more than half of the variables being valid IVs is not satisfied. IV.tetrad shows worse results in Group 2 datasets than those in Group 1 data sets because fewer variables pass the IV.tetrad test and yield a larger bound estimation.

Table 4. Estimation errors ϵACE\epsilon_{ACE} (Mean±\pmSTD) over 20 Group 2 synthetic datasets (with latent confounders, one of which is between WW and YY). The first set of estimators (Type I) requires causal sufficiency assumption, the second set of estimators (Type II) are able to handle data with latent confounders which are not between WW and YY, and the third set of estimators (Type III) are IV-based estimators. Methods marked with ‘*’ are bound estimators. Note that semiIDA is short for Semi-local IDA and ‘-’ denotes N/A returned by the estimator on these datasets since edge WYW\rightarrow Y is invisible and the methods do not work on the datasets.
Methods Number of Samples
2k 4k 6k 8k 10k 100k
Type I IDA* 0.914±\pm0.070 0.874±\pm0.057 0.891±\pm0.081 0.885±\pm0.055 0.898±\pm0.055 0.896±\pm0.014
semiIDA* 0.914±\pm0.127 0.919±\pm0.077 0.892±\pm0.038 0.878±\pm0.042 0.885±\pm0.034 0.887±\pm0.014
𝐗^W\hat{\mathbf{X}}_{\rightarrow W} 0.858±\pm0.026 0.932±\pm0.021 0.870±\pm0.018 0.899±\pm0.016 0.890±\pm0.012 0.893±\pm0.013
𝐐^W\hat{\mathbf{Q}}_{\rightarrow W} 0.837±\pm0.023 0.925±\pm0.028 0.869±\pm0.021 0.899±\pm0.015 0.890±\pm0.014 0.893±\pm0.013
𝐗^Y\hat{\mathbf{X}}_{\rightarrow Y} 0.832±\pm0.018 0.906±\pm0.024 0.845±\pm0.021 0.895±\pm0.018 0.892±\pm0.015 0.896±\pm0.012
𝐙^Y\hat{\mathbf{Z}}_{\rightarrow Y} 0.823±\pm0.019 0.904±\pm0.021 0.855±\pm0.017 0.895±\pm0.015 0.888±\pm0.011 0.896±\pm0.012
𝐗^W,Y\hat{\mathbf{X}}_{\rightarrow W,Y} 0.867±\pm0.023 0.934±\pm0.021 0.857±\pm0.019 0.896±\pm0.015 0.892±\pm0.013 0.896±\pm0.012
Type II GAC - - - - - -
LV-IDA* - - - - - -
DICE* - - - - - -
DAVS - - - - - -
CEELS - - - - - -
Type III sisVIVE 0.609±\pm0.311 0.732±\pm0.335 0.464±\pm0.187 0.640±\pm0.466 0.633±\pm0.618 0.576±\pm0.452
IV.tetrad* 1.002±\pm0.477 0.974±\pm0.383 0.979±\pm0.335 0.891±\pm0.415 0.882±\pm0.434 0.923±\pm0.323
AIViPAIViP 0.228±\pm0.222 0.244±\pm0.163 0.141±\pm0.127 0.162±\pm0.099 0.153±\pm0.125 0.109±\pm0.053
AIV.GT* 0.270±\pm0.247 0.378±\pm0.555 0.346±\pm0.376 0.270±\pm0.408 0.295±\pm0.445 0.347±\pm0.297

5.2. Complexity Evaluation

We conduct a simulation study to examine the time complexity of the reviewed estimators using synthetic datasets with latent confounders which are not between WW and YY. The R package pcalg (Kalisch et al., 2012) is utilised for generating the MAGs (Maximal Ancestral Graph) and synthetic datasets, following the procedure outlined in the literature (Cheng et al., 2023b). Specifically, we employ the functions randomDAG and rmvdag, where the parameter probprob for randomDAG is set to 0.1, resulting in the generation of a sparse DAG. Subsequently, we obtain a sparse MAG by removing 10 nodes from the obtained DAG. Note that we pick up the MAG with WYW\rightarrow Y visible for generating data with latent confounders such that all methods can be evaluated. This sparse DAG is used for generating synthetic data. To introduce latent confounders, the removed nodes from the sparse DAG are removed from the generated synthetic data. The computations of all estimators are conducted on a machine equipped with an Intel (R) Core (TM) i7-9700K CPU and 32 GB of RAM.

To evaluate the scalability in terms of the number of samples, the number of measured variables is fixed at 50 and evaluated with different number of sample sizes: 0.5k, 5k, 10k, 50k, 100k, and 500k. Similarly, to assess the scalability regarding the number of variables, the sample size is fixed at 10k, while the number of measured variables is varied as 10, 30, 50, 70, 100, 150, and 200. In all experiments, we repeat each experiment 10 times, and the mean running time of each estimator is recorded and visualised in Fig. 10.

From the left panel of Fig. 10, we have the following observations on time efficiency and scalability regarding the number of samples: (1) Most methods have similar running times. IDA, semi-local IDA, LV-IDA, DAVS, GAC, AIViPAIViP and AIV.GT exhibit similar time efficiency since they employ a global structural learning algorithm to learn a CPDAG or a PAG from the data first before estimating causal effects. These methods exhibit good scalability with sample sizes. (2) The running time of DICE and IV.tetrad is similar to the previous methods because both methods need to enumerate all the possible adjustment sets and possible IVs from a (possibly large) candidate set although they do not need a global causal structure (they use a local structure). (3) CEELS is the fastest algorithm among all since it employs a purely local search strategy for finding the adjustment set. (4) CovSelHigh and sisVIVE, do not scale well with the number of samples. They are efficient when the number of samples is small (fewer than 5K). Their run-time exceeds 2,000 seconds when the sample size is larger than 50k since both methods involve a large number of matrix manipulations (whose complexity is polynomial to the number of samples) to obtain the corresponding coefficient weights for samples.

From the right panel of Fig. 10 with respect to the number of variables, we have the same observations of time efficiency and scalability of all methods as above except sisVIVE. sisVIVE employs the equivalent Lagrangian form for causal effect estimation so its time complexity does not increase significantly with the number of variables. Note that the time complexity of causal structure learning is ultimately exponential to the number of variables. The algorithms deal with variables in a few hundreds not thousands except the causal structure is extremely sparse.

Refer to caption
Figure 10. Running time of the data-driven estimators. We only show the runtime of CovSelHigh, since the five criteria rely on the same process of CPDAG learning. We use the logarithmic form of the running time for the Y-axis.

6. Discussions

In this section, we present some discussions related to issues with data-driven causal effect estimation methods, and their applications and potential.

6.1. Causal Graph Recovery Approach vs Conditional Independence Test based Approach

All reviewed methods apart from IV (and AIV) methods fall into two broad categories: Causal Graph Recovery Approach vs Conditional Independence Test based Approach. We compare the two types of methods as follows.

Causal Graph Recovery based Approach

This type of data-driven methods learns a causal structure (CPDAG/PAG) in observational data using a causal structure learning algorithm (Scutari, 2010; Kalisch et al., 2012; Vowels et al., 2022), then identifies adjustment sets for confounding adjustment in the learned CPDAG/PAG, and finally, uses the adjustment sets for causal effect estimation. Because of uncertainty in causal structure learning, the causal effect is a bond estimation. Because this type of methods requires a discovered causal graph, the performance of these methods relies on the correctness of the recovered causal graph. The issues with the current causal structure learning methods, such as cumulative statistical errors and incorrect edge orientation affect the performance of causal effect estimation (Cheng et al., 2023b; Gradu et al., 2022). IDA (Maathuis et al., 2009), semi-local IDA (Perkovic et al., 2017), DIDA and DIDN (Fang and He, 2020) can be used for a bound estimation when there are no latent variables. Using the pretreatment variables, the certainty of the estimation can be improved significantly from the bound estimation to unique estimation by CovSel (Häggström et al., 2015) and CovSelHigh (Häggström, 2018). The time efficiency is also improved greatly since CovSel (Häggström et al., 2015) and CovSelHigh (Häggström, 2018) use local structure discovery. When there are latent variables but not in between WW and YY, LVIDA (Malinsky and Spirtes, 2017) and GAC (Perković et al., 2018) can be used.

Conditional Independence Test based Approach

It is possible to find a confounding adjustment set using statistical tests when there are no latent variables between WW and YY and bypass learning a CPDAG or a PAG from data with some knowledge. One way is to utilise a COSO variable (a cause of WW or a spouse of WW only, denoted by QQ). The COSO variable can be given by domain knowledge as a direct cause of WW but not a direct cause of YY, or be discovered from data using statistical tests. Then, the triple (W,Y,Q)(W,Y,Q) is used in statistical tests to find an adjustment set from data directly. A direct search for an adjustment set based on the conditional independence tests is global (all variables apart from WW and YY are candidates for an adjustment set), such as DAVS (Cheng et al., 2022c). When pretreatment variables are known, the search can be local (only variables around WW and YY are candidates for an adjustment set) as implemented in CEELS (Cheng et al., 2020). Alternatively, when pretreatment variables are known, there is no need to use a COSO variable but a direct search for an adjustment set using EHS conditions by a global search as implemented by EHS method (Entner et al., 2013). EHS provide a bound estimation. Also, EHS method employs an exhaustive search and only deals with a small number of variables.

The conditional independence test approach, in general, is more accurate than the causal graph recovery approach since the uncertainty of a causal graph found in data is high due to the high complexity of the problem itself (Chickering, 1996, 2002). In some cases, the edge directions cannot be identified from data (Spirtes, 2010; Pearl et al., 2009; Malinsky and Spirtes, 2017). For example, if there is a causal structure X1X2X3X1X_{1}\to X_{2}\rightarrow X_{3}\leftarrow X_{1}, the edge direction between X1X_{1} and X3X_{3} is not identifiable using data only. Note that one incorrect orientation of edge around WW or YY may lead to an error in adjustment set identification. In contrast, if there are no errors in the statistical tests, the adjustment set determined by the conditional independence test approach is correct when the adjustment set is found.

6.2. Considerations When Using Data-Driven Estimators

One question is whether it is possible to obtain unique estimation without domain knowledge. The answer relies on the partial causal information we have in hand. In the following discussions, we assume that the faithfulness assumption is satisfied, and there are no errors in statistical tests.

If there is a latent confounder between WW and YY, there is no algorithm to identify the causal effect directly from the data. If we know an AIV or two valid (unspecified) AIVs, AIViPAIViP (Cheng et al., 2022a) and AIV.GT (Cheng et al., 2023a) can be used to recover the causal effect of WW on YY from data alone. If there is no information about the number of valid IVs, it is better to use multiple IV based methods, such as AIV.GT (Cheng et al., 2023a), IV.Tetrad (Silva and Shimizu, 2017), sisVIVE (Kang et al., 2016) and ModeIV (Hartford et al., 2021), to obtain a set of causal effects, and then choose a range of consistent estimates to draw some common-sense conclusions. Note that before using IV based estimators, it is better to analyse the assumptions required to obtain reliable conclusions.

If there is not a latent confounder between WW and YY, but there are latent variables elsewhere in a system, EHS (Entner et al., 2013) and CEELS (Cheng et al., 2023b) should be the first choice since they do not rely on learning PAG. Note that EHS is inefficient for a problem with a dozen or more variables since it uses an exhaustive search strategy. DICE (Cheng et al., 2020) provides a bound estimate and does not rely on PAG. Hence, DICE can be used as the second choice. Other methodsGAC (Perković et al., 2018), DAVS (Cheng et al., 2022c), and LV-IDA (Malinsky and Spirtes, 2017) use a learned DAG, and their results need to be taken with caution since learning a PAG needs a large data set and learned PAG may be inaccurate for small or moderate data sets. Any partial knowledge that constrains the edge directions for improving PAG accuracy will be helpful for reducing biases in causal effect estimation.

If there is not a latent variable between WW and YY, and there are no latent variables in a system, CovSel (Häggström et al., 2015) and CovSelHigh (Häggström, 2018) can be efficient and accurate for estimating the causal effects from data directly. When the data set is large, i.e. with a large number of samples and/or a large number of variables, IDA (Maathuis et al., 2009), and Semi-local IDA (Perkovic et al., 2017) can provide a reliable bound estimation.

A main challenge of causal effect estimation is the difficulty in justifying whether there exist latent variables in many real-world applications. If there is not any domain knowledge available, using several purely data-driven estimations to cross-check would be better than a single method in terms of obtaining a consistent conclusion. Furthermore, it is better to combine domain knowledge or the suggestions of a domain expert to obtain a reliable conclusion after using the data-driven causal effect estimation methods.

6.3. Double Dipping

A risk for data-driven causal effect estimation is the bias caused by “double dipping”. To estimate the causal effect of WW on YY from observational data using graphical causal modelling, causal discovery methods are employed to recover the underlying causal structure, followed by a causal inference method. However, this approach faces a statistical challenge known as “double dipping,” which can invalidate the coverage guarantees of classical confidence intervals(Gradu et al., 2022). This problem is caused by both learning the causal graph and estimating the causal effect on the same data and the data reuse may lead to a biased estimation of the causal effect. Gradu et al.(Gradu et al., 2022) have developed a randomising causal discovery method to remedy the “double dipping” bias. This is a new direction for bias mitigation in data-driven causal effect estimation. Readers should be aware of the future development in this space.

6.4. Applications & Potential

Applications

The majority of methods discussed in this paper have been developed in recent years. Some relatively older methods (from around ten years ago) have been applied to biological, health, and medicinal research. We discuss some typical applications in the following.

Maathuis et al. (Maathuis et al., 2010) applied IDA to gene expression data obtained without interventions (i.e. observational data) to estimate the regulatory causal effect of a gene (repeat for 234 different genes) on the other 5360 genes. The top kk genes with the highest causal effects are evaluated against the top 5% and 10% genes with the most changes in their expression levels in the knock-off experiment, where each time one of the 5360 genes was knocked off and the changes (before and after the knock-off) of the expression levels of the other genes were measured. The ROC curves showed that the causal effect rank picks up more top 5% and 10% most changed genes in the experiments than regression coefficient (association) ranks by Lasso and elastic net (Tibshirani, 1996) for all kk genes, respectively.

Le et al. (Le et al., 2013) used IDA on observational gene expression data to find the target genes (i.e. genes regulated by) of the two microRNAs, miR200a and miR200b, and evaluated the results using experiments which knocked off miR200a and miR200b respectively. The top 20, 50 and 100 genes with the highest causal effects contain a significant proportion of genes known to be regulated by miR200a and miR200b in the evaluation experiments. Moreover, using the signs of causal effects to indicate an increase or decrease of gene expression after knocking down a microRNA, the signs of causal effects by IDA are 95% and 94% consistent with the top 100 genes ranked by their absolute causal effects with miR200a and miR200b respectively.

Ren et al. (Ren et al., 2023) meticulously examined and visualized the intricate relationships encompassing image features, patient demographics, and clinicopathological variables. Employing an innovative score-based directed graph termed ”Grouped Greedy Equivalence Search” (GGES), they take into account existing knowledge to map these associations. By meticulously refining and handpicking causal variables, they compute the IDA scores (Maathuis et al., 2009), thereby effectively quantifying the impact of each variable on patients’ postoperative survival. Leveraging these IDA scores, a predictive formula is derived using GGES, facilitating the projection of postoperative survival outcomes for individuals afflicted with esophageal cancer.

Sun et al. (Sun et al., 2021) aim to identify potential prognostic genes within the prostate adenocarcinoma microenvironment and simultaneously estimate their causal effects. To determine the minimal sets of confounding covariates between candidate causative genes and the biochemical recurrence (BCR) status, the R package CovSel (Häggström et al., 2015) was employed for screening all candidate causative genes and clinical covariates. In their experiments, validation is performed on five genes using another prostate adenocarcinoma cohort (GEO: GSE70770). These findings have the potential to contribute to an enhanced prognosis for prostate adenocarcinoma.

Sieswerda et al. (Sieswerda et al., 2023) have developed a (causal) Bayesian Network (BN) through a structure learning method that expands upon the idea of identifying an adjustment set by learning the causal Bayesian Network structure (De Luna et al., 2011; Häggström, 2018). Their emphasis is on elucidating causal relationships while disregarding non-causal associations in causal structure learning. This study showcases how structure learning, guided by clinical insights, can be harnessed to derive a causal model. The predictions of the developed BN indicate a treatment effect of 1 percentage point at 10 years in localized prostate cancer.

Mendelian randomisation (MR) analysis, employing genetic variants as IVs for causal effect estimation in the presence of latent variables, has gained significant attention (Kang et al., 2016; Brumpton et al., 2020). In MR studies, the reliability of conclusions heavily depends on the validity of the genetic variants as IVs (Burgess et al., 2020). sisVIVE does not require that all IVs are valid, and has been used in many MR studies (Allard et al., 2015; Brumpton et al., 2020).

Potential

Causal effect estimation based on graphical causal modelling provides valuable insights into the causal mechanisms underlying complex systems and is rapidly evolving into a mature scientific tool (Pearl and Mackenzie, 2018; van der Zander et al., 2019). In practice, causal effect estimation finds numerous applications in real-world scenarios, especially when dealing with latent variables. For example, Sobel and Lindquist define and estimate both “point” and “cumulated” effects for brain regions in functional magnetic resonance imaging (fMRI) (Sobel and Lindquist, 2020). Runge et al. (Runge et al., 2019) present an overview of causal inference frameworks for Earth system sciences and identify promising generic application cases. Griffith et al. (Griffith et al., 2020) explore the collider bias problem in the context of COVID-19 disease, and they effectively mitigate the bias using appropriate sampling strategies.

6.5. Challenges in Evaluation

Due to the relative newness of the methods, currently there are limited applications of data-driven methods in real-world applications. Some typical examples of applications are discussed in the previous subsection. Most other algorithms have been evaluated with some real-world datasets such as Job training (LaLonde, 1986), Sachs (Sachs et al., 2005), 401 (k) (Verbeek, 2008) and Schoolingreturns (Card, 1993). However, these real-world datasets normally have a small number of variables and/or data samples. The true potential of data-driven methods has not been fully evaluated. Another challenge for real-world evaluation is the lack of ground truth. In real-world datasets, there is not ground truth of causal effects nor the underlying causal DAG (or MAG) such that there is not a baseline to compare the performance of the data-driven causal effect estimation methods.

In most literature, synthetic datasets with ground truth causal effects and causal structures are generated for evaluating the performance of data-driven causal effect estimation methods. The two semi-synthetic datasets, IHDP (Hill, 2011) and Twins (Louizos et al., 2017) are also commonly used to evaluate data-driven causal effect estimation methods. For real-world datasets without ground truths, the empirical estimates in the literature are usually used as substitutes for the ground truths since the empirical estimates are likely to be consistent with domain knowledge.

It should be noted that it is impossible to use cross-validation to validate causal effect estimation and it is not possible to use prediction accuracy or AUC (Area Under the Curve) to justify the validity of the estimated causal effects (Li et al., 2020).

7. Conclusion

The estimation of causal effects from observational data is a fundamental task in causal inference. It is critical to remove confounding bias for obtaining unbiased causal effect estimation even in the presence of latent variables. Graphical causal models have become widely used in many areas to represent the underlying mechanisms that generate the data. By taking advantage of graphical causal modelling, many data-driven methods have been developed for causal effect estimation. In this survey, we have reviewed the theories and assumptions for data-driven causal effect estimation under graphical causal modelling framework. We have identified three challenges, uncertainty in causal structure learning, time complexity and latent variable in causal effect estimation from observational data. We categorise methods by the ways they handle the challenges, and have provided a comprehensive review of the methods based on graphical causal modelling. We have also discussed the practical issues of using the methods, offered guidance, assessed the performance of these methods, and examined their time complexity. We focus on the theories and assumptions for data-driven methods based on graphical causal modelling. We hope that the review will enable researchers and practitioners to understand the strengths and limitations of the methods from the theoretical and assumption viewpoint, and support more research in this challenging area.

8. Acknowledgements

We thank the action editor and the reviewers for their valuable comments. We wish to acknowledge the support from the Australian Research Council (under grant DP200101210).

References

  • (1)
  • Abadie (2003) Alberto Abadie. 2003. Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113, 2 (2003), 231–263.
  • Abadie and Imbens (2016) Alberto Abadie and Guido W Imbens. 2016. Matching on the estimated propensity score. Econometrica 84, 2 (2016), 781–807.
  • Agrawal and Srikant (1994) Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In The 20th International Conference on Very Large Data Bases. 487–499.
  • Ali et al. (2009) R Ayesha Ali, Thomas S Richardson, Peter Spirtes, et al. 2009. Markov equivalence for ancestral graphs. The Annals of Statistics 37, 5B (2009), 2808–2837.
  • Aliferis et al. (2010) Constantin F Aliferis, Alexander Statnikov, Ioannis Tsamardinos, et al. 2010. Local causal and Markov blanket induction for causal discovery and feature selection for classification part i: Algorithms and empirical evaluation. Journal of Machine Learning Research 11, Jan (2010), 171–234.
  • Allard et al. (2015) C Allard, V Desgagné, et al. 2015. Mendelian randomization supports causality between maternal hyperglycemia and epigenetic regulation of leptin gene in newborns. Epigenetics 10, 4 (2015), 342–351.
  • Angrist and Imbens (1995) Joshua D Angrist and Guido W Imbens. 1995. Two-stage least squares estimation of average causal effects in models with variable treatment intensity. J. Amer. Statist. Assoc. 90, 430 (1995), 431–442.
  • Angrist et al. (1996) Joshua D Angrist, Guido W Imbens, and Donald B Rubin. 1996. Identification of causal effects using instrumental variables. J. Amer. Statist. Assoc. 91, 434 (1996), 444–455.
  • Arellano and Bover (1995) Manuel Arellano and Olympia Bover. 1995. Another look at the instrumental variable estimation of error-components models. Journal of Econometrics 68, 1 (1995), 29–51.
  • Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
  • Athey et al. (2018) Susan Athey, Guido W Imbens, and Stefan Wager. 2018. Approximate residual balancing: debiased inference of average treatment effects in high dimensions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 4 (2018), 597–623.
  • Athey et al. (2019) Susan Athey, Julie Tibshirani, and Stefan Wager. 2019. Generalized random forests. The Annals of Statistics 47, 2 (2019), 1148–1178.
  • Baiocchi et al. (2014) Michael Baiocchi, Jing Cheng, and Dylan S Small. 2014. Instrumental variable methods for causal inference. Statistics in Medicine 33, 13 (2014), 2297–2340.
  • Bang and Robins (2005) Heejung Bang and James M Robins. 2005. Doubly robust estimation in missing data and causal inference models. Biometrics 61, 4 (2005), 962–973.
  • Bareinboim et al. (2014) Elias Bareinboim, Jin Tian, and Judea Pearl. 2014. Recovering from selection bias in causal and statistical inference. In The Twenty-Eighth AAAI Conference on Artificial Intelligence. 2410–2416.
  • Benkeser et al. (2017) David Benkeser, Marco Carone, MJ Van Der Laan, and PB Gilbert. 2017. Doubly robust nonparametric inference on the average treatment effect. Biometrika 104, 4 (2017), 863–880.
  • Bennett et al. (2019) Andrew Bennett, Nathan Kallus, and Tobias Schnabel. 2019. Deep generalized method of moments for instrumental variable analysis. In Advances in neural information processing systems. 3564–3574.
  • Bhattacharya and Nabi (2022) Rohit Bhattacharya and Razieh Nabi. 2022. On testability of the front-door model via verma constraints. In Uncertainty in Artificial Intelligence. PMLR, 202–212.
  • Bhattacharya et al. (2020) Rohit Bhattacharya, Razieh Nabi, and Ilya Shpitser. 2020. Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables. Stat 1050 (2020), 27.
  • Bowden and Turkington (1990) Roger J Bowden and Darrell A Turkington. 1990. Instrumental Variables. Vol. 8. Cambridge university press.
  • Brito and Pearl (2002) Carlos Brito and Judea Pearl. 2002. Generalized instrumental variables. In The Conference on Uncertainty in Artificial Intelligence. 85–93.
  • Brumpton et al. (2020) Ben Brumpton, Eleanor Sanderson, et al. 2020. Avoiding dynastic, assortative mating, and population stratification biases in Mendelian randomization through within-family analyses. Nature communications 11, 1 (2020), 3519.
  • Burgess et al. (2020) Stephen Burgess, Christopher N Foley, Elias Allara, James R Staley, and Joanna MM Howson. 2020. A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature communications 11, 1 (2020), 376.
  • Card (1993) David Card. 1993. Using Geographic Variation in College Proximity to Estimate the Return to Schooling. In Econometrica, Vol. 69. Citeseer, 1127–1160.
  • Cheng et al. (2020) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2020. Causal query in observational data with hidden variables. In 24th European Conference on Artificial Intelligence. 2551–2558.
  • Cheng et al. (2022a) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2022a. Ancestral Instrument Method for Causal Inference without Complete Knowledge. In International Joint Conference on Artificial Intelligence. 4843–4849.
  • Cheng et al. (2022b) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2022b. Sufficient dimension reduction for average causal effect estimation. Data Mining and Knowledge Discovery 36, 3 (2022), 1174–1196.
  • Cheng et al. (2022c) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2022c. Toward Unique and Unbiased Causal Effect Estimation From Data With Hidden Variables. IEEE Transactions on Neural Networks and Learning Systems 34, 11 (2022), 1–13.
  • Cheng et al. (2023a) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2023a. Discovering Ancestral Instrumental Variables for Causal Inference from Observational Data. IEEE Transactions on Neural Networks and Learning Systems (2023), 1–11.
  • Cheng et al. (2023b) Debo Cheng, Jiuyong Li, Lin Liu, et al. 2023b. Local Search for Efficient Causal Effect Estimation. IEEE Transactions on Knowledge and Data Engineering 35, 9 (2023), 8823–8837. https://doi.org/10.1109/TKDE.2022.3218131
  • Chickering (1996) David Maxwell Chickering. 1996. Learning Bayesian networks is NP-complete. In Learning from Data. Springer, 121–130.
  • Chickering (2002) David Maxwell Chickering. 2002. Learning equivalence classes of Bayesian-network structures. Journal of Machine Learning Research 2, Feb (2002), 445–498.
  • Christopher Frey and Patil (2002) H Christopher Frey and Sumeet R Patil. 2002. Identification and review of sensitivity analysis methods. Risk analysis 22, 3 (2002), 553–578.
  • Chu et al. (2001) Tianjiao Chu, Richard Scheines, and Peter Spirtes. 2001. Semi-instrumental variables: a test for instrument admissibility. In The Conference on Uncertainty in Artificial Intelligence. 83–90.
  • Cole and Hernán (2008) Stephen R Cole and Miguel A Hernán. 2008. Constructing inverse probability weights for marginal structural models. American journal of epidemiology 168, 6 (2008), 656–664.
  • Colombo et al. (2012) Diego Colombo, Marloes H Maathuis, Markus Kalisch, and Thomas S Richardson. 2012. Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics 40, 1 (2012), 294–321.
  • Cornfield et al. (1959) Jerome Cornfield, William Haenszel, et al. 1959. Smoking and lung cancer: recent evidence and a discussion of some questions. Journal of the National Cancer institute 22, 1 (1959), 173–203.
  • Correa and Bareinboim (2017) Juan D Correa and Elias Bareinboim. 2017. Causal effect identification by adjustment under confounding and selection biases. In The Thirty-First AAAI Conference on Artificial Intelligence. 3740–3746.
  • De Luna et al. (2011) Xavier De Luna, Ingeborg Waernbaum, and Thomas S Richardson. 2011. Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika 98, 4 (2011), 861–875.
  • Deaton and Cartwright (2018) Angus Deaton and Nancy Cartwright. 2018. Understanding and misunderstanding randomized controlled trials. Social Science & Medicine 210 (2018), 2–21.
  • Ding and VanderWeele (2016) Peng Ding and Tyler J VanderWeele. 2016. Sensitivity analysis without assumptions. Epidemiology (Cambridge, Mass.) 27, 3 (2016), 368.
  • Duarte et al. (2021) Guilherme Duarte, Noam Finkelstein, Dean Knox, Jonathan Mummolo, and Ilya Shpitser. 2021. An automated approach to causal inference in discrete settings. arXiv preprint arXiv:2109.13471 (2021).
  • Entner et al. (2013) Doris Entner, Patrik Hoyer, and Peter Spirtes. 2013. Data-driven covariate selection for nonparametric estimation of causal effects. In Artificial Intelligence and Statistics. 256–264.
  • Evans and Richardson (2014) Robin J Evans and Thomas S Richardson. 2014. Markovian acyclic directed mixed graphs for discrete data. The Annals of Statistics 42, 4 (2014), 1452–1482.
  • Fang and He (2020) Zhuangyan Fang and Yangbo He. 2020. IDA with Background Knowledge. In Conference on Uncertainty in Artificial Intelligence. PMLR, 270–279.
  • Funk et al. (2011) Michele Jonsson Funk, Daniel Westreich, Chris Wiesen, et al. 2011. Doubly robust estimation of causal effects. American journal of epidemiology 173, 7 (2011), 761–767.
  • Glymour et al. (2019) C Glymour, K Zhang, and P Spirtes. 2019. Review of Causal Discovery Methods Based on Graphical Models. Frontiers in Genetics 10 (2019), 524–524.
  • Glymour and Cooper (1999) Clark N Glymour and Gregory Floyd Cooper. 1999. Computation, Causation, and Discovery. AAAI Press.
  • Glymour et al. (2008) M Maria Glymour, Jennifer Weuve, and Jarvis T Chen. 2008. Methodological challenges in causal research on racial and ethnic patterns of cognitive trajectories: measurement, selection, and bias. Neuropsychology review 18, 3 (2008), 194–213.
  • Gradu et al. (2022) Paula Gradu, Tijana Zrnic, Yixin Wang, and Michael I Jordan. 2022. Valid Inference after Causal Discovery. arXiv preprint arXiv:2208.05949 (2022).
  • Greene (2003) William H Greene. 2003. Econometric Analysis. Pearson Education India.
  • Greenland (2003) Sander Greenland. 2003. Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 14, 3 (2003), 300–306.
  • Greenland et al. (1999) Sander Greenland, Judea Pearl, and James M Robins. 1999. Causal diagrams for epidemiologic research. Epidemiology 10, 1 (1999), 37–48.
  • Griffith et al. (2020) Gareth J Griffith, Tim T Morris, Matthew J Tudball, Annie Herbert, Giulia Mancano, Lindsey Pike, Gemma C Sharp, Jonathan Sterne, Tom M Palmer, George Davey Smith, et al. 2020. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nature communications 11, 1 (2020), 5749.
  • Guo et al. (2020) Ruocheng Guo, Lu Cheng, Jundong Li, et al. 2020. A survey of learning causality with data: Problems and methods. ACM Computing Surveys (CSUR) 53, 4 (2020), 1–37.
  • Häggström (2018) Jenny Häggström. 2018. Data-driven confounder selection via Markov and Bayesian networks. Biometrics 74, 2 (2018), 389–398.
  • Häggström et al. (2015) Jenny Häggström, Emma Persson, Ingeborg Waernbaum, and Xavier de Luna. 2015. CovSel: An R package for covariate selection when estimating average causal effects. Journal of Statistical Software 68, 1 (2015), 1–20.
  • Hartford et al. (2017) Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. 2017. Deep IV: A flexible approach for counterfactual prediction. In International Conference on Machine Learning. PMLR, 1414–1423.
  • Hartford et al. (2021) Jason S Hartford, Victor Veitch, Dhanya Sridhar, and Kevin Leyton-Brown. 2021. Valid causal inference with (some) invalid instruments. In International Conference on Machine Learning. PMLR, 4096–4106.
  • Heckman (2008) James J Heckman. 2008. Econometric causality. International statistical review 76, 1 (2008), 1–27.
  • Henckel et al. (2022) Leonard Henckel, Emilija Perković, and Marloes H Maathuis. 2022. Graphical criteria for efficient total effect estimation via adjustment in causal linear models. Journal of the Royal Statistical Society Series B 84, 2 (2022), 579–599.
  • Hernán and Robins (2006) Miguel A Hernán and James M Robins. 2006. Instruments for causal inference: an epidemiologist’s dream? Epidemiology 17, 4 (2006), 360–372.
  • Hernán and Robins (2020) Miguel A Hernán and James M Robins. 2020. Causal Inference, What If. CRC Boca Raton, FL;.
  • Hill (2011) Jennifer L Hill. 2011. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20, 1 (2011), 217–240.
  • Hirano et al. (2003) Keisuke Hirano, Guido W Imbens, and Geert Ridder. 2003. Efficient estimation of average treatment effects using the estimated propensity score. Econometrica 71, 4 (2003), 1161–1189.
  • Hyttinen et al. (2014) Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. 2014. Constraint-based Causal Discovery: Conflict Resolution with Answer Set Programming.. In The Conference on Uncertainty in Artificial Intelligence. 340–349.
  • Hyttinen et al. (2015) Antti Hyttinen, Frederick Eberhardt, and Matti Järvisalo. 2015. Do-calculus when the True Graph Is Unknown.. In The Conference on Uncertainty in Artificial Intelligence. Citeseer, 395–404.
  • Imai and Ratkovic (2014) Kosuke Imai and Marc Ratkovic. 2014. Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 76, 1 (2014), 243–263.
  • Imbens (2014) Guido W Imbens. 2014. Instrumental Variables: An Econometrician’s Perspective. Statist. Sci. 29, 3 (2014), 323–358.
  • Imbens (2020) Guido W Imbens. 2020. Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature 58, 4 (2020), 1129–79.
  • Imbens and Rubin (2015) Guido W Imbens and Donald B Rubin. 2015. Causal Inference in Statistics, Social, and Biomedical Sciences. Cambridge University Press.
  • Jaber et al. (2019a) Amin Jaber, Jiji Zhang, and Elias Bareinboim. 2019a. Identification of conditional causal effects under markov equivalence. Advances in Neural Information Processing Systems 32 (2019).
  • Jaber et al. (2019b) Amin Jaber, Jiji Zhang, and Elias Bareinboim. 2019b. On causal identification under Markov equivalence. In 28th International Joint Conference on Artificial Intelligence, IJCAI 2019. International Joint Conferences on Artificial Intelligence, 6181–6185.
  • Kalisch et al. (2012) Markus Kalisch, Martin Mächler, Diego Colombo, et al. 2012. Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47, 11 (2012), 1–26.
  • Kang et al. (2016) Hyunseung Kang, Anru Zhang, T Tony Cai, and Dylan S Small. 2016. Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. J. Amer. Statist. Assoc. 111, 513 (2016), 132–144.
  • Koller and Friedman (2009) Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
  • Kuroki and Cai (2005) Manabu Kuroki and Zhihong Cai. 2005. Instrumental variable tests for Directed Acyclic Graph Models.. In International Conference on Artificial Intelligence and Statistics. 190–197.
  • LaLonde (1986) Robert J LaLonde. 1986. Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review 76, 4 (1986), 604–620.
  • Le et al. (2013) Thuc Duy Le, Lin Liu, Anna Tsykin, et al. 2013. Inferring microRNA–mRNA causal regulatory relationships from expression data. Bioinformatics 29, 6 (2013), 765–771.
  • Li et al. (2020) Jiuyong Li, Lin Liu, et al. 2020. Accurate data-driven prediction does not mean high reproducibility. Nature machine intelligence 2, 1 (2020), 13–15.
  • Louizos et al. (2017) Christos Louizos, Uri Shalit, Joris Mooij, et al. 2017. Causal effect inference with deep latent-variable models. In The 31st International Conference on Neural Information Processing Systems. 6449–6459.
  • Ma et al. (2021) Jing Ma, Ruocheng Guo, Aidong Zhang, and Jundong Li. 2021. Multi-cause effect estimation with disentangled confounder representation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence. 2790–2796.
  • Maathuis et al. (2015) Marloes H Maathuis, Diego Colombo, et al. 2015. A generalized back-door criterion. The Annals of Statistics 43, 3 (2015), 1060–1088.
  • Maathuis et al. (2010) Marloes H Maathuis, Diego Colombo, Markus Kalisch, and Peter Bühlmann. 2010. Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 4 (2010), 247–248.
  • Maathuis et al. (2009) Marloes H Maathuis, Markus Kalisch, Peter Bühlmann, et al. 2009. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics 37, 6A (2009), 3133–3164.
  • Malinsky and Spirtes (2017) Daniel Malinsky and Peter Spirtes. 2017. Estimating bounds on causal effects in high-dimensional and possibly confounded systems. International Journal of Approximate Reasoning 88 (2017), 371–384.
  • Martens et al. (2006) Edwin P Martens, Wiebe R Pestman, Anthonius de Boer, et al. 2006. Instrumental variables: application and limitations. Epidemiology 17, 3 (2006), 260–267.
  • Mattei et al. (2014) Alessandra Mattei, Fabrizia Mealli, and Barbara Pacini. 2014. Identification of causal effects in the presence of nonignorable missing outcome values. Biometrics 70, 2 (2014), 278–288.
  • Meek (1995) Christopher Meek. 1995. Causal inference and causal explanation with background knowledge. In the Eleventh Conference on Uncertainty in Artificial Intelligence. 403–411.
  • Morgan and Harding (2006) S. L. Morgan and D. J. Harding. 2006. Matching Estimators of Causal Effects: Prospects and Pitfalls in Theory and Practice. Sociological Methods & Research 35, 1 (2006), 3–60.
  • Morgan and Winship (2015) Stephen L Morgan and Christopher Winship. 2015. Counterfactuals and Causal Inference. Cambridge University Press.
  • Nabi et al. (2022) Razieh Nabi, Todd McNutt, and Ilya Shpitser. 2022. Semiparametric causal sufficient dimension reduction of multidimensional treatments. In Uncertainty in Artificial Intelligence. PMLR, 1445–1455.
  • Neapolitan et al. (2004) Richard E Neapolitan et al. 2004. Learning Bayesian Networks. Vol. 38. Pearson Prentice Hall Upper Saddle River, NJ.
  • Nogueira et al. (2022) Ana Rita Nogueira, Andrea Pugnana, Salvatore Ruggieri, Dino Pedreschi, and João Gama. 2022. Methods and tools for causal discovery and causal inference. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12, 2 (2022), e1449.
  • Pearl (1995a) Judea Pearl. 1995a. Causal diagrams for empirical research. Biometrika 82, 4 (1995), 669–688.
  • Pearl (1995b) Judea Pearl. 1995b. On the testability of causal models with latent and instrumental variables. In The Conference on Uncertainty in Artificial Intelligence. 435–443.
  • Pearl (2009a) Judea Pearl. 2009a. Causality: Models, Reasoning, and Inference. Cambridge university press.
  • Pearl (2009b) Judea Pearl. 2009b. Myth, confusion, and science in causal analysis. Tech. Rep. R-348 (2009). Los Angeles, CA: University of California.
  • Pearl et al. (2009) Judea Pearl et al. 2009. Causal inference in statistics: An overview. Statistics Surveys 3 (2009), 96–146.
  • Pearl and Mackenzie (2018) Judea Pearl and Dana Mackenzie. 2018. The Book of Why: The New Science of Cause and Effect. Basic Books.
  • Peña (2018) Jose M Peña. 2018. Reasoning with alternative acyclic directed mixed graphs. Behaviormetrika 45, 2 (2018), 389–422.
  • Perkovic et al. (2017) Emilija Perkovic, Markus Kalisch, and Marloes H Maathuis. 2017. Interpreting and using CPDAGs with background knowledge. In The Conference on Uncertainty in Artificial Intelligence. AUAI Press, ID–120.
  • Perković et al. (2018) Emilija Perković, Johannes Textor, Markus Kalisch, and Marloes H Maathuis. 2018. Complete graphical characterization and construction of adjustment sets in Markov equivalence classes of ancestral graphs. The Journal of Machine Learning Research 18, 1 (2018), 8132–8193.
  • Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. 2017. Elements of Causal Inference. The MIT Press.
  • Ren et al. (2023) Shangsi Ren, Cameron A Beeche, et al. 2023. Graphical modeling of causal factors associated with the postoperative survival of esophageal cancer subjects. Medical Physics (2023), 1–10.
  • Richardson (2003) Thomas Richardson. 2003. Markov properties for acyclic directed mixed graphs. Scandinavian Journal of Statistics 30, 1 (2003), 145–157.
  • Richardson and Spirtes (2002) Thomas Richardson and Peter Spirtes. 2002. Ancestral graph Markov models. The Annals of Statistics 30, 4 (2002), 962–1030.
  • Richardson and Spirtes (2003) Thomas S Richardson and Peter Spirtes. 2003. Causal inference via ancestral graph models. Oxford Statistical Science Series 27 (2003), 83–105.
  • Robins (1986) James Robins. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling 7, 9-12 (1986), 1393–1512.
  • Robins (1997) James M Robins. 1997. Causal inference from complex longitudinal data. In Latent variable modeling and applications to causality. Springer, 69–117.
  • Robins and Greenland (1992) James M Robins and Sander Greenland. 1992. Identifiability and exchangeability for direct and indirect effects. Epidemiology 3, 2 (1992), 143–155.
  • Robins et al. (2000a) James M Robins, Miguel Angel Hernán, and Babette Brumback. 2000a. Marginal Structural Models and Causal Inference in Epidemiology. Epidemiology 11, 5 (2000), 551.
  • Robins et al. (2000b) James M Robins, Andrea Rotnitzky, and Daniel O Scharfstein. 2000b. Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models. 116 (2000), 1–94.
  • Rosenbaum and Rubin (1985) Paul R Rosenbaum and Donald B Rubin. 1985. The bias due to incomplete matching. Biometrics 41, 1 (1985), 103–116.
  • Rotnitzky and Smucler (2020) Andrea Rotnitzky and Ezequiel Smucler. 2020. Efficient adjustment sets for population average causal treatment effect estimation in graphical models. The Journal of Machine Learning Research 21, 1 (2020), 7642–7727.
  • Rubin (1973) Donald B Rubin. 1973. Matching to remove bias in observational studies. Biometrics 29 (1973), 159–183.
  • Rubin (1974) Donald B Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66, 5 (1974), 688.
  • Rubin (2007) Donald B Rubin. 2007. The design versus the analysis of observational studies for causal effects: parallels with the design of randomized trials. Statistics in Medicine 26, 1 (2007), 20–36.
  • Runge (2021) Jakob Runge. 2021. Necessary and sufficient graphical conditions for optimal adjustment sets in causal graphical models with hidden variables. Advances in Neural Information Processing Systems 34 (2021), 15762–15773.
  • Runge et al. (2019) Jakob Runge, Sebastian Bathiany, Erik Bollt, Gustau Camps-Valls, Dim Coumou, Ethan Deyle, Clark Glymour, Marlene Kretschmer, Miguel D Mahecha, Jordi Muñoz-Marí, et al. 2019. Inferring causation from time series in Earth system sciences. Nature communications 10, 1 (2019), 2553.
  • Sachs et al. (2005) Karen Sachs, Omar Perez, et al. 2005. Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 5721 (2005), 523–529.
  • Sauer et al. (2013) Brian C Sauer, M Alan Brookhart, Jason Roy, and Tyler VanderWeele. 2013. A review of covariate selection for non-experimental comparative effectiveness research. Pharmacoepidemiology and drug safety 22, 11 (2013), 1139–1145.
  • Scharfstein et al. (2021) Daniel O Scharfstein, Razieh Nabi, Edward H Kennedy, Ming-Yueh Huang, Matteo Bonvini, and Marcela Smid. 2021. Semiparametric sensitivity analysis: Unmeasured confounding in observational studies. arXiv preprint arXiv:2104.08300 (2021).
  • Schölkopf (2022) Bernhard Schölkopf. 2022. Causality for machine learning. In Probabilistic and Causal Inference: The Works of Judea Pearl. 765–804.
  • Scutari (2010) M Scutari. 2010. Learning Bayesian networks with the bnlearn R Package. Journal of Statistical Software 35, 3 (2010), 1–22.
  • Sekhon (2011) Jasjeet S Sekhon. 2011. Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software 42 (2011), 1–52.
  • Shalit et al. (2017) Uri Shalit, Fredrik D Johansson, and David Sontag. 2017. Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning. PMLR, 3076–3085.
  • Shpitser and Pearl (2006) Ilya Shpitser and Judea Pearl. 2006. Identification of joint interventional distributions in recursive semi-Markovian causal models. In Proceedings of the National Conference on Artificial Intelligence, Vol. 21. 1219–1226.
  • Shpitser and Pearl (2008) Ilya Shpitser and Judea Pearl. 2008. Complete identification methods for the causal hierarchy. Journal of Machine Learning Research 9, Sep (2008), 1941–1979.
  • Shpitser et al. (2010) Ilya Shpitser, Tyler VanderWeele, and James M Robins. 2010. On the validity of covariate adjustment for estimating causal effects. In The Twenty-Sixth Conference on Uncertainty in Artificial Intelligence. 527–536.
  • Sieswerda et al. (2023) Melle Sieswerda, Shixuan Xie, et al. 2023. Identifying confounders using bayesian networks and estimating treatment effect in prostate cancer with observational data. JCO Clinical Cancer Informatics 7 (2023), e2200080.
  • Silva et al. (2011) Ricardo Silva, Charles Blundell, and Yee Whye Teh. 2011. Mixed cumulative distribution networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 670–678.
  • Silva and Shimizu (2017) Ricardo Silva and Shohei Shimizu. 2017. Learning instrumental variables with structural and non-gaussianity assumptions. Journal of Machine Learning Research 18, 120 (2017), 1–49.
  • Singh et al. (2019) Rahul Singh, Maneesh Sahani, and Arthur Gretton. 2019. Kernel instrumental variable regression. In International Conference on Neural Information Processing Systems. 4593–4605.
  • Sjolander and Martinussen (2019) Arvid Sjolander and Torben Martinussen. 2019. Instrumental variable estimation with the R package ivtools. Epidemiologic Methods 8, 1 (2019), 1–20.
  • Sobel and Lindquist (2020) Michael E Sobel and Martin A Lindquist. 2020. Estimating causal effects in studies of human brain function: New models, methods and estimands. The annals of applied statistics 14, 1 (2020), 452.
  • Spirtes (2010) Peter Spirtes. 2010. Introduction to causal inference. Journal of Machine Learning Research 11, 5 (2010), 1643–1662.
  • Spirtes et al. (2000) Peter Spirtes, Clark N Glymour, Richard Scheines, et al. 2000. Causation, Prediction, and Search. MIT Press.
  • Stuart (2010) Elizabeth A Stuart. 2010. Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics 25, 1 (2010), 1–21.
  • Sun et al. (2021) Xiaoru Sun, Lu Wang, et al. 2021. Identification of microenvironment related potential biomarkers of biochemical recurrence at 3 years after prostatectomy in prostate adenocarcinoma. Aging (Albany NY) 13, 12 (2021), 16024.
  • Textor et al. (2016) Johannes Textor, Benito van der Zander, Mark S Gilthorpe, et al. 2016. Robust causal inference using directed acyclic graphs: the R package ‘dagitty’. International journal of epidemiology 45, 6 (2016), 1887–1894.
  • Tian and Pearl (2002) Jin Tian and Judea Pearl. 2002. A general identification condition for causal effects. In Aaai/iaai. 567–573.
  • Tibshirani (1996) Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58, 1 (1996), 267–288.
  • Tortorelli and Michaleris (1994) Daniel A Tortorelli and Panagiotis Michaleris. 1994. Design sensitivity analysis: overview and review. Inverse problems in Engineering 1, 1 (1994), 71–105.
  • Tsamardinos et al. (2006) Ioannis Tsamardinos, Laura E Brown, and Constantin F Aliferis. 2006. The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning 65, 1 (2006), 31–78.
  • van der Zander et al. (2014) Benito van der Zander, Maciej Liśkiewicz, and Johannes Textor. 2014. Constructing separators and adjustment sets in ancestral graphs. In The Thirtieth Conference on Uncertainty in Artificial Intelligence. 907–916.
  • Van der Zander et al. (2015) Benito Van der Zander, Maciej Liśkiewicz, and Johannes Textor. 2015. Efficiently finding conditional instruments for causal inference. (2015), 3243–3249.
  • van der Zander et al. (2019) Benito van der Zander, Maciej Liśkiewicz, and Johannes Textor. 2019. Separators and adjustment sets in causal graphs: Complete criteria and an algorithmic framework. Artificial Intelligence 270 (2019), 1–40.
  • VanderWeele and Ding (2017) Tyler J VanderWeele and Peng Ding. 2017. Sensitivity analysis in observational research: introducing the E-value. Annals of internal medicine 167, 4 (2017), 268–274.
  • VanderWeele and Shpitser (2011) Tyler J VanderWeele and Ilya Shpitser. 2011. A new criterion for confounder selection. Biometrics 67, 4 (2011), 1406–1413.
  • Verbeek (2008) Marno Verbeek. 2008. A Guide to Modern Econometrics. John Wiley & Sons.
  • Vowels et al. (2022) Matthew J Vowels, Necati Cihan Camgoz, and Richard Bowden. 2022. D’ya like dags? a survey on structure learning and causal discovery. Comput. Surveys 55, 4 (2022), 1–36.
  • Wang and Blei (2019) Yixin Wang and David M Blei. 2019. The blessings of multiple causes. J. Amer. Statist. Assoc. 114, 528 (2019), 1574–1596.
  • Witte and Didelez (2019) Janine Witte and Vanessa Didelez. 2019. Covariate selection strategies for causal inference: Classification and comparison. Biometrical Journal 61, 5 (2019), 1270–1289.
  • Witte et al. (2020) Janine Witte, Leonard Henckel, Marloes H Maathuis, and Vanessa Didelez. 2020. On efficient adjustment in causal graphs. The Journal of Machine Learning Research 21, 1 (2020), 9956–10000.
  • Wu et al. (2022) Anpeng Wu, Kun Kuang, Ruoxuan Xiong, and Fei Wu. 2022. Instrumental Variables in Causal Inference and Machine Learning: A Survey. arXiv preprint arXiv:2212.05778 (2022).
  • Yao et al. (2021) Liuyi Yao, Zhixuan Chu, Sheng Li, Yaliang Li, Jing Gao, and Aidong Zhang. 2021. A survey on causal inference. ACM Transactions on Knowledge Discovery from Data (TKDD) 15, 5 (2021), 1–46.
  • Yao et al. (2018) Liuyi Yao, Sheng Li, Yaliang Li, et al. 2018. Representation Learning for Treatment Effect Estimation from Observational Data. In Advances in Neural Information Processing Systems. 2638–2648.
  • Yu et al. (2021) Kui Yu, Lin Liu, and Jiuyong Li. 2021. A unified view of causal and non-causal feature selection. ACM Transactions on Knowledge Discovery from Data 15, 4 (2021), 1–46.
  • Yuan et al. (2022) Junkun Yuan, Anpeng Wu, Kun Kuang, et al. 2022. Auto IV: Counterfactual Prediction via Automatic Instrumental Variable Decomposition. ACM Transactions on Knowledge Discovery from Data 16, 4 (2022), 1–20.
  • Zander and Liśkiewicz (2016) Benito van der Zander and Maciej Liśkiewicz. 2016. Separators and adjustment sets in Markov equivalent DAGs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 3315–3321.
  • Zhang (2008a) Jiji Zhang. 2008a. Causal reasoning with ancestral graphs. Journal of Machine Learning Research 9, Jul (2008), 1437–1474.
  • Zhang (2008b) Jiji Zhang. 2008b. On the completeness of orientation rules for causal discovery in the presence of latent confounders and selection bias. Artificial Intelligence 172, 16-17 (2008), 1873–1896.
  • Zhang et al. (2016) Weijia Zhang, Thuc Duy Le, Lin Liu, Zhi-Hua Zhou, and Jiuyong Li. 2016. Predicting miRNA targets by integrating gene regulatory knowledge with expression profiles. PloS one 11, 4 (2016), e0152860.
  • Zisoulis et al. (2012) Dimitrios G Zisoulis, Zoya S Kai, Roger K Chang, and Amy E Pasquinelli. 2012. Autoregulation of microRNA biogenesis by let-7 and Argonaute. Nature 486, 7404 (2012), 541.