Evaluation Methods and Measures for Causal Learning Algorithms

Lu Cheng Ruocheng Guo Raha Moraffah Paras Sheth K. Selçuk Candan and Huan Liu \IEEEmembershipFellow, IEEE Ruocheng Guo is with School of Data Science, City University of Hong Kong, China. All other authors are with School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA. (e-mail: {lcheng35,rmoraffa,psheth5,candan,huanliu}@asu.edu, [email protected])

Abstract

The convenient access to copious multi-faceted data has encouraged machine learning researchers to reconsider correlation-based learning and embrace the opportunity of causality-based learning, i.e., causal machine learning (causal learning). Recent years have therefore witnessed great effort in developing causal learning algorithms aiming to help AI achieve human-level intelligence. Due to the lack-of ground-truth data, one of the biggest challenges in current causal learning research is algorithm evaluations. This largely impedes the cross-pollination of AI and causal inference, and hinders the two fields to benefit from the advances of the other. To bridge from conventional causal inference (i.e., based on statistical methods) to causal learning with big data (i.e., the intersection of causal inference and machine learning), in this survey, we review commonly-used datasets, evaluation methods, and measures for causal learning using an evaluation pipeline similar to conventional machine learning. We focus on the two fundamental causal-inference tasks and causality-aware machine learning tasks. Limitations of current evaluation procedures are also discussed. We then examine popular causal inference tools/packages and conclude with primary challenges and opportunities for benchmarking causal learning algorithms in the era of big data. The survey seeks to bring to the forefront the urgency of developing publicly available benchmarks and consensus-building standards for causal learning evaluation with observational data. In doing so, we hope to broaden the discussions and facilitate collaboration to advance the innovation and application of causal learning.

{IEEEImpStatement}

Causal learning goes beyond machine learning due to its power of uncovering data generating processes. Causality relates to crucial open problems in machine learning. On the opposite, machine learning contributes to addressing fundamental challenges in causal inference. One key challenge of causal learning is that the research domain lacks public benchmark resources to support principled evaluation of research contributions. Our goal is to promote objectivity, reproducibility, fairness, collaboration, and awareness of bias in causal learning research. Arguing that this goal can only be achieved through systematic, objective, and transparent evaluation, in this survey, we provide a comprehensive review of the evaluation of fundamental tasks in causal inference and causality-aware machine learning tasks. Similar to the evaluation in conventional machine learning, the causal evaluation pipeline includes the evaluation protocols, metrics, datasets, and popular causal tools/packages. We also seek to expedite the marriage of causality and machine learning via discussions of prominent open problems and challenges. {IEEEkeywords} Benchmarking, Big Data, Causal Inference, Causal Learning, Evaluation

1 Introduction

\IEEEPARstart

Machine learning (ML) is a key pillar of artificial intelligence (AI). The unmatched availability of big data unleashed the unprecedented power of ML to support situational awareness and decision making. While witnessing the exceptional success of ML technologies in various applications, users started to notice a critical shortcoming of ML. Traditional ML techniques can learn correlation-based patterns and relationships from data. Unfortunately, correlation is a poor substitute for causality as in many cases, data may contain spurious correlations [1, 2]. Causal inference with observational data – data that have been generated by something other than randomized experiments – offers a promising alternative to correlation-based learning. Causal inference excels at uncovering the underlying causal mechanisms, leading to its wide applications in a myriad of high-impact domains, such as education [3, 4, 5, 6], medical science [7, 8], economics [9], epidemiology [10, 11], meteorology [12], and environmental health [13]. Therefore, learning causality is significant for AI achieving human-level intelligence [14].

The conventional way to understand causality is to use interventions and/or randomized controlled trials (RCTs) [15, 16]. In many situations, however, these are time-consuming, impractical, or sometimes even unethical [17, 16, 18]. Attention has been drawn to the recent availability of big observational data in all walks of life as they provide new opportunities for learning causality, without the disadvantages of RCTs. While being relatively recent, causal ML (causal learning, CL) with observational data is emerging as a vibrant field with new opportunities and domain specific challenges. CL “seeks to model the effect of interventions and distribution changes with a combination of data-driven learning and assumptions not already included in the statistical description of system” [19]. Note that causal inference is relevant to methods that perform inference (effect estimation) and structural learning (causal discovery), whereas CL is more frequently mentioned along with methods that leverage ML to improve causal inference tasks or use causality to resolve some limitations of current ML methods. In this work, we focus on the two fundamental problems in causal inference: (1) learning causal effects (i.e., estimating causal effect of a treatment on an important outcome) and (2) learning causal structure (i.e., examining whether a certain set of causal relations exists between variables) [20]. Answering these challenging causal questions is starting to become feasible as large datasets may contain sufficient samples from the joint distribution of the observed variables [15, 21].

A primary obstacle that impedes research developments in CL is the lack-of public benchmark resources to support principled evaluation. Standardized evaluation played a major role in the early days of ML research. Successful early benchmarking efforts, such as UCI ML¹¹1http://archive.ics.uci.edu/ml/ and UCI KDD²²2https://kdd.ics.uci.edu/ repositories, not only helped guide the development of efficient and effective ML algorithms, but also encouraged collaborative research and paved the way for the recent breakthroughs in deep learning. Unfortunately, given the different learning objectives of conventional ML and CL, these existing benchmarks are not applicable to CL.

The overarching goal is, therefore, to enable the advancement of CL research, and promote objectivity, reproducibility, fairness, collaboration, and awareness of bias in CL research. We argue that this goal can only be achieved through systematic, objective, and transparent evaluation of CL models and algorithms. In this survey, we aim to provide a comprehensive review for the evaluations of the two fundamental tasks in causal inference and causality-aware ML tasks, such as causal interpretability. We summarize the results in Table 1. Similar to evaluation in conventional ML, the evaluation pipeline for CL includes the evaluation protocols, metrics, datasets, and popular benchmarking tools/packages. Under each task, we then discuss the limitations of current evaluation procedures. This survey also aims to help set agenda in future research on benchmarking CL algorithms by examining prominent open problems and challenges in current evaluation pipeline for CL.

Difference from Previous Work. Here, we first briefly summarize existing surveys on causal inference and CL and then highlight their differences from this work.

Guo et al. [16] focus on reviewing the methodologies for causal effect estimation and causal structure learning, as well as discussing the connections between causal inference and ML. Yao et al. [22] specifically survey causal effect estimation methods and tools under the potential outcome framework. Another survey by Spirtes and Zhang [23] reviews semi-parametric score-based methods for learning causal structure with i.i.d. (independent and identically distributed) and time-series observational data. To bridge from ML to Artificial General Intelligence, Schölkopf [20] shows that many challenging problems in ML and AI are inherently related to causality. The author especially examines where the links between AI and graphical causal inference have been and should be established. The rapid growth of time-series data and the unique challenges it brings in causal studies have led to surveys such as [24] reviewing problems, methods, and evaluation related to causal time series analysis. In causal ML, Chen et al. [25] summarize different types of bias in recommendation systems and review methods that aim to mitigate such bias by leveraging causal inference theories. Moraffah et al. [26] survey the methods that leverage causality to enable the interpretability of ML models. All previously mentioned surveys center on reviewing the theories and methods in CL without an in-depth discussion about the evaluation.

The most related work is by Shimoni et al. [27] and Mooiji et al. [27]. The former introduces a benchmarking framework for causal effect estimation. It summarizes evaluation metrics and data generation with ground-truth effect. The second work surveys methods and benchmarks for causal structure learning. Compared to prior studies, this survey aims to present a comprehensive review of the evaluation protocols, evaluation metrics, datasets, and causal tools/packages that have been widely used to benchmark causal effect estimation and causal structure learning. In addition, with the growing interest in causality in AI community, we also examine existing framework for evaluating causality-aware ML tasks via several representative examples such as causal interpretability and fairness. Therefore, this survey complements existing surveys 1) focused on reviewing causal theories and methodologies; and 2) benchmarking one fundamental task in causal inference, i.e., either causal structure learning or causal effect estimation.

Intended Audience and Paper Organization. This survey will most benefit researchers and practitioners who have basic knowledge of causal inference and would like to develop or apply CL algorithms, but often find the evaluation rather challenging. It will also be useful whoever is interested in knowing the differences between evaluating standard ML and CL algorithms. The rest of the survey is organized as follows: We discuss evaluation frameworks for causal effect estimation in Section 2, causal structure learning in Section 3, and causal ML tasks in Section 4. We then summarize and compare existing tools and packages for causal inference in Section 5. Section 6 delineates prominent open problems and challenges and the last section concludes the survey.

Causal Effect Estimation

Causal Structure Learning

Causal Interpretability

and Fairness

Unbiased Interactive

Metrics

Standard

Effect

Metrics

MAE, MSE,

RMSE, PEHE,

Policy Risk

SHD, SID, Frobenius Norm, Precision, Recall, F1, TPR, FPR, MSE, AUC, Precision-Recall Curve, FPR-TPR Curve, TVD, KL-Divergence, F-test

Counterfactual

Explanation

Sparsity,

Interpretability,

Speed, Proximity,

Diversity,

Visual Linguistic

NDCG@K, MAP@K,

ARP@K, APLT@K

Heterogenous

Effect

Metrics

Uplift_{Coef}

Qini_{Coef}

Fairness

FACE, FACT, Counterfactual Fairness, PC-Fairness, Ctf-DE, Ctf-IE, Ctf-SE

Time Series

Metrics

Standard and Heterogeneous

Effect Metrics, F-Test, T-Test

Procedures

With

Ground

Truth

Observational data with

known effect; observational

and experimental data pairs;

sampling from observational

data; sampling from synthetic

data; sampling from RCTs

A transductive setting where we have the ground-truth causal graph and estimated graph

Transductive

Training on a regular dataset

and testing on generated

counterfactuals

Training set comes from a biased source whereas test set comes from an unbiased source

Without

Ground

Truth

Evaluation is possible if

subset of the data is from RCTs

Inductive

Generating counterfactual

explanations for an unseen

instance

Dataset

Under Unconfoundedness Assumption,

Natural Experiments, RCTs

Causal Direction,

Causal Graphs,

Time Series Datasets

Image, Text, Tabular

Semi-Synthetic

datasets, RCTs

Table 1: Summary of metrics, procedures, datasets for evaluating CL approaches.

2 Benchmarking Causal Effects Estimation

We specify the pipelines for evaluating the first fundamental task in causal inference – causal effect estimation. We begin by defining the problem in the Neyman/Rubin Potential Outcome Framework [28] and then introduce the benchmarking datasets, evaluation procedures, and metrics.

Many empirical analyses are motivated by the necessity to estimate the causal effects of a binary treatment on the outcome of interest. Let $\bm{x}_{i}$ , $t_{i}\in\{0,1\}$ , $y_{i}$ be the features (i.e., background variables), treatment assignment with $t_{i}=1$ being under treated and $t_{i}=0$ under control, and outcome of unit $i$ . Estimating causal effects is defined as

Definition 1 (Causal Effects Estimation)

Given $n$ units $\{(\bm{x}_{1},t_{1},y_{1}),...,(\bm{x}_{n},t_{n},y_{n})\}$ , estimating causal effects is to quantify the changes of $Y$ as we alter the treatment assignment from $0$ to $1$ .

Under different contexts, effects can be estimated within the entire population, a subpopulation that is defined by background variables, an unknown subpopulation, or an individual. Average Treatment Effect (ATE) $\tau$ is typically used for assessing a population represented by the distribution of $X$ :

\tau=\mathbb{E}_{X}[\tau(\bm{x})]=\mathbb{E}_{X}[Y|do(T=1)]-\mathbb{E}_{X}[Y|do(T=0)],

(1)

where $do(T=1)$ and $do(T=0)$ indicate that the treatment assignment is “treated” and “control”, respectively. When the population is heterogeneous, ATE can be misleading as same treatment may affect individuals differently. Under heterogeneity, a common assumption requires that each subpopulation is defined by a set of features, i.e., Conditional ATE (CATE):

\displaystyle CATE:\tau(\bm{x})=\mathbb{E}[Y|do(t=1),\bm{x}]-\mathbb{E}[Y|do(t=0),\bm{x}].

(2)

An Individual Treatment Effect (ITE) is a contrast between potential outcomes of a unit:

\tau_{i}=Y_{i}(do(t=1))-Y_{i}(do(t=0)).

(3)

Note that $\tau_{i}$ is not necessarily equal to $\tau(\bm{x})$ as the latter is an average over a subpopulation. The goal of the causal effects estimation task is to learn a function $\hat{\tau}$ that estimates ATE or CATE, depending on the degree of homogeneity of the population, for binary treatment options: $T=0$ and $T=1$ .

2.1 Evaluation Metrics

Evaluation metrics for causal effect estimation can be categorized into metrics for standard and heterogeneous effect estimations, as well as for time-series effect estimation. We review popular metrics in each category.

2.1.1 Metrics for Standard Causal Effect Estimation

As effect is typically continuous, most of the metrics are directly adapted from those for regression in ML, e.g., Root Mean Square Error (RMSE), Mean Absolute Error (MAE), precision in estimation of heterogeneous effect (PEHE), and Policy Risk. Given the ground-truth ATE $\tau$ and the predicted ATE $\hat{\tau}$ , MAE of estimating ATE is defined as

\epsilon_{MAE\_ATE}=\frac{1}{M}\sum_{j=1}^{M}|\tau_{j}-\hat{\tau}_{j}|,

(4)

where $M$ stands for the number of experiments and $j$ index of the experiment. There are other similar metrics such as mean squared error (MSE) and RMSE [29, 30]:

	$\displaystyle\epsilon_{MSE\_ATE}$	$\displaystyle=\frac{1}{M}\sum_{j=1}^{M}(\tau_{j}-\hat{\tau}_{j})^{2},$		(5)
	$\displaystyle\epsilon_{RMSE\_ATE}$	$\displaystyle=\sqrt{\frac{1}{M}\sum_{j=1}^{M}(\tau_{j}-\hat{\tau}_{j})^{2}}.$		(5)

PEHE is used for estimating CATE and is defined as

\epsilon_{PEHE}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}^{1}-y_{i}^{0}-\hat{\tau}(\bm{x}_{i}))^{2},

(6)

where $y_{i}^{1}$ and $y_{i}^{0}$ are the two potential outcomes of the $i$ -th unit, $y_{i}^{1}-y_{i}^{0}$ denotes the ground-truth CATE, and $\hat{\tau}(\cdot)$ is the learned function. When the true ITE is unknown but a subset of the dataset comes from an RCT, one can use Policy Risk, defined as the average loss in value when treatment is assigned based on the ITE estimator-guided policy [31, 32]:

	$\displaystyle\text{Policy Risk}=1$	$\displaystyle-\big{(}\mathbb{E}[\tilde{y}^{1}_{i}\|t_{i}=1,\pi_{i}=1]p(\pi_{i}=1)$		(7)
		$\displaystyle+\mathbb{E}[\tilde{y}_{i}^{0}\|t_{i}=0,\pi_{i}=0]p(\pi_{i}=0)\big{)},$		(7)

where $\pi_{i}=1$ denotes the policy to treat, and to not treat $\pi_{i}=0$ otherwise. $\tilde{y}_{i}$ is the factual outcome scaled between $[0,1]$ . The second component in Eq. 7 represents the expected potential outcome, i.e., the weighted sum of expectations of two potential outcomes.

2.1.2 Metrics for Heterogeneous Effect Estimation

Evaluating heterogeneous effects with binary treatment typically follows the uplift modeling literature [33, 34]. Uplift modeling is an ML approach that employs the Potential Outcome framework to estimate ITE in order to customize treatment assignments for different units. For example, companies use uplift models to differ between customers who buy product because of the campaign and those who will buy anyway. The evaluation metrics are defined over a curve measuring the performance of algorithms: The $x$ -axis of the curve represents the number of units or the percentile of the population sorted by each estimated ITE (the larger the estimated ITE, the smaller the percentile); the $y$ -axis represents gain when the treatment is assigned to the top $a$ -percentile. Given population sorted by the inferred ITEs, uplift curve (a.k.a. cumulative gain chart) measures the average cumulative gain of receiving the treatment in the first $b$ units:

Uplift(b)=\left(\frac{Y_{b}^{1}}{N_{b}^{1}}-\frac{Y_{b}^{0}}{N_{b}^{0}}\right)(N_{b}^{1}+N_{b}^{0}),

(8)

where $Y_{b}^{1}$ ( $Y_{b}^{0}$ ) and $N_{b}^{1}$ ( $N_{b}^{0}$ ) denote the sum of the treated (control) outcomes and the number of treated (control) units among the first $b$ units, respectively. A variant of uplift curve is the Qini curve that is defined in terms of percentiles, instead of the absolute number of units [35]:

Qini(a)=Y_{a}^{1}-\frac{Y_{a}^{0}N_{a}^{1}}{N_{a}^{0}},

(9)

where $Y_{a}^{1}$ ( $Y_{a}^{0}$ ) and $N_{a}^{1}$ ( $N_{a}^{0}$ ) denote the sum of the treated (control) outcomes and the number of treated (control) units in the first $a$ percentile of population, respectively. Similar to AUC-ROC curve, we can compute the area under the Uplift/Qini curve, referred to as the uplift/Qini coefficient:

\begin{split}Uplift_{Coef}&=\sum_{b=0}^{N-1}\frac{1}{2}(Uplift(b+1)+Uplift(b));\\ Qini_{Coef}&=\sum_{a=0}^{N-1}\frac{1}{2}(Qini(a+1)+Qini(a)),\end{split}

(10)

where $N$ is the size of the population.

2.1.3 Metrics for Time Series Effect Estimation

Time series data are sequences of real-valued data ordered over time. Causal inference for time series analysis is of particular interest because many scientific questions involve causal effect estimation and causal structure learning using chronological observations [24], such as those related to medicine and social science. For example, social scientists have been interested in studying the effects of minimum wages on employment, where monthly employment rate needs to be recorded before and after a change in the minimum wage is made [36].

Evaluation metrics for standard effect estimation (such as MAE described above) can be used for time series effect estimation at each time step $s$ . To evaluate over the entire time series, we need to further take an average of the results over time, e.g., $MAE=\sum_{s=1}^{S}MAE_{s}$ . In addition, we can use metrics specifically designed for sequential data such as F-Test and T-Test. F-test assesses treatment effect heterogeneity by comparing the marginal variances of two potential outcomes. Let $\hat{e}_{t}$ and $\hat{u}_{t}$ be the time dependent errors for the treated and control groups, respectively. $p$ denotes the time lag, F-Test is defined as

F=\frac{(RSS_{0}-RSS_{1})/p}{RSS_{1}/(S-2p-1)},

(11)

where $RSS_{0}=\sum_{s=1}^{S}\hat{e}_{t}$ and $RSS_{1}=\sum_{s=1}^{S}\hat{u}_{t}$ . To test the significance of a cause, one can also use the Unpaired T-Test (UTT) to compare the sequence of the treated group with that of the control group:

UTT=\frac{\bar{x}_{1}-\bar{x}_{2}}{\sqrt{(\frac{1}{n_{1}}+\frac{1}{n_{2}})\big{(}\frac{(n_{1}-1)s_{1}^{2}+(n_{2}-1)s_{2}^{2}}{(n_{1}-1+n_{2}-1)}\big{)}}},

(12)

where $\bar{x}_{1}$ and $\bar{x}_{2}$ are the mean of two sequences; $s_{1}$ and $s_{2}$ denote the standard deviations; and $n_{1}$ and $n_{2}$ denote the cardinalities of the sequences.

2.2 Evaluation Procedures

Evaluating causal inference methods is significantly more challenging than evaluating purely associational methods. The primary obstacle in the evaluation procedure using observational data is that we typically cannot know the true causal effects. Some prior work uses observational data with known treatment effect. However, this requires that the studied phenomena are so well-understood that the causal effect is obvious [37], limiting data availability. Another strategy uses data from pairs of observational and experimental studies to create a nearly ideal scenario for causal effect estimation with observational data [38]. This still suffers from low data availability. The most straightforward approach is to generate observational data from synthetic causal systems where the treatment effect is either directly known or can be easily derived from the formulation [39, 40]. In particular, consider a data generating process that produces binary treatment $T$ , potential outcome $Y(t)$ , and multiple covariates $X=\{X_{1},X_{2},...,X_{k}\}$ . For each unit $i$ , both potential outcomes $Y_{i}(1)$ and $Y_{i}(0)$ are measured. Given the biasing covariates $X^{b}\subseteq X$ of a unit and the synthetic dataset, one can begin by generating the selected treatment $T^{s}$ using the biasing covariates $P(T^{s}|X^{b}):=f(X^{b})$ . If $T^{s}$ is the same as the unit’s corresponding treatment in the synthetic data, this unit is then added to the simulated observational dataset. This evaluation procedure is sampling from synthetic data. However, synthetic data cannot generalize well to real-world settings.

A more recent class of existing work augments an observational study with synthetic treatment assignment and outcomes generated by a synthetic function [27]. Given the observed covariates, we first randomly select a subset of samples which is then fed into a data generating process to simulate treatment assignment and potential outcomes. Causal models are then trained on the tuples of treatment, covariates, and factual outcome. This approach – referred to as sampling from observational data – can be used to evaluate both individual- and population-level effects. The newest approach creates constructed observational data by sampling from RCTs [41]. It has been shown that, in expectation, this approach can create datasets equivalent to those produced by randomly sampling from empirical datasets where all potential outcomes are known. This approach is only suited for estimating population-level effects. Once the ground truth is available, evaluating causal effects estimation is similar to supervised learning algorithms. Without ground truth, evaluation is still possible if the subset of data is from RCTs, e.g., policy risk for ITE.

The evaluation of causal effect estimation with time-series and networked data typically follow the same protocol and metrics as that with i.i.d. data. The estimated ATE (ITE) of the test set will be compared with the corresponding ground truth using MAE and MSE (PEHE).

2.3 Benchmarking Datasets

In general, there are two types of data used for identifying and estimating causal effects [16]: data used under the unconfoundedness assumption [15] – observed/measured variables are sufficient to capture the causal links between treatment and outcome; and data collected from natural experiments, popular alternatives to RCTs. Below, we briefly introduce exemplar benchmarking datasets in each category. For a comprehensive description, please refer to [18]. We first introduce data used under the unconfoundedness assumption.

2.3.1 Datasets with Binary Treatment Variables

Most methods in the literature of learning causal effects are evaluated on datasets with binary treatment.

•

Jobs. The research studies the effect of a job training program on the real earnings of an individual. This dataset consists of RCT conducted by LaLonde [3] (297 treated and 425 control) and the Panel Study of Income Dynamics comparison group (2,490 control) [42]. Features are multiple demographic variables such as age and race.
•

(Infant Health Development Program) IHDP. This is a dataset with simulated treatment and outcomes, initially complied by [6], seeking to evaluate the efficacy of comprehensive early intervention in reducing the development and health problems of low birth weight, premature infants. A commonly used simulation setting is the setting “A” in the NPCI package³³3https://github.com/vdorie/npci. This dataset comprises 747 instances (139 treated and 608 control) each is associated with 25 features.
•

Atlantic Causal Inference Conference Benchmark (ACICB) [43]. It inherits the same features as those in IHDP data [44]. Various settings have been adopted to synthesize the treatment and outcomes.
•

Twins. The Twins dataset in [45] is used to study ITE of twins’ weights on their mortality in the first year of their births. Each twin-pair is represented by 46 features relating to the parents, the pregnancy, and birth.
•

BlogCatalog is used to study causal effect estimation with networked data. It is collected from an online social network service where users can post blogs. Treatments and outcomes are synthesized based on the observed features, the social network structures, and the homophily phenomenon. This dataset has 5,196 instances, 173,468 edges, and 8,189 observed features.
•

Amazon [32] is an extension of the dataset in [46]. The goal is to evaluate the efficacy of positive (or negative) reviews in promoting product sales. Treatment and potential outcomes are generated based on the ratings and by matching the most similar products with different treatment assignment.

2.3.2 Datasets with Multiple Treatments/Continuous Treatment

There are studies in which multiple treatments are studied, e.g., Twins-Mult [47], TCGA [48], and News-Mult [48]. A dataset with continuous treatment can be found in [49]. It was collected to study the causal effect of the amount of smoking on the medical expenditure.

Datasets collected from natural experiments⁴⁴4In most of the cases, natural experiments only allow us to identify ATEs but not ITEs or CATEs, therefore, we can use the ATE metrics such as $\epsilon_{MAE\_ATE}$ , $\epsilon_{MSE\_ATE}$ , and $\epsilon_{RMSE\_ATE}$ for evaluation [29, 30]. allow us to relax the stringent unconfoundedness assumption [50] as it considers the presence of hidden confounders. Such datasets include variables selected under causal knowledge such as the instrumental variable(s) (IVs) and running variable(s). Below, we review the datasets with IVs [51] and RDD (Regression Discontinuity Design) [52]. We also discuss datasets for causal time series analysis and datasets with network information given the commonness of these two types of data.

2.3.3 Datasets with Instrumental Variables

IV is a powerful tool to identify causal effect when there exists hidden confounders. We first briefly describe the core idea of IV and then describe exemplar datasets.

Definition 2 (Instrumental Variable)

Given an observed variable $Z$ , observed features $\mathbf{X}$ , treatment $T$ and outcome $Y$ , $Z$ is a valid IV for the causal effect of $T\rightarrow Y$ iff $Z$ satisfies: (1) $Z\not\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}T|\mathbf{X}$ , and (2) $Z\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y|\mathbf{X},do(T)$ [51].

The definition indicates that a valid IV causally influences the outcome only through the treatment. As shown in a causal graph of a valid IV ( $Z$ ) in Figure 1, the first condition requires that there is an edge $Z\rightarrow T$ or a non-empty set of collider(s) $\mathbf{X}$ s.t. $Z\rightarrow T\leftarrow\mathbf{X}$ , where $\mathbf{X}$ denotes the observed confounders. The second condition requires that $Z\rightarrow T\rightarrow Y$ is the only path that starts from $Z$ and ends at $Y$ . Therefore, blocking $T$ renders $Z\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}Y$ . This implies the exclusive restriction that there is no direct edge $Z\rightarrow Y$ or path $Z\rightarrow\mathbf{X}^{\prime}\rightarrow Y$ where $\mathbf{X}^{\prime}\subseteq\mathbf{X}$ . Formally, $Y(do(Z=z),T)=Y(do(Z=z^{\prime}),T)\ \forall T,z\not=z^{\prime}$ .

Figure 1: A causal graph of a valid IV (

Z

) when there are hidden confounders (

\mathbf{U}

\mathbf{X}

stands for the observed confounders and

\mathbf{U}

is a set of hidden confounders.

We denote a dataset with IV as $(\mathbf{x}_{i},t_{i},y_{i},z_{i})_{i=1}^{N}$ where $z_{i}$ represents the IV of the $i$ -th instance. We can then evaluate approaches that leverage IV to estimate ATE, e.g., the ratio estimator [53]. Exemplar datasets are introduced below.

•

1980 Census Extract. This dataset contains 329,509 observations on the following variables: log weekly wage, quarter of birth (1-4), year of birth (30-39), place of birth (1980 census state codes) and education (highest grade completed). The goal is to study the causal effect of education on earning. The quarter of birth is considered as a legitimate IV [54] for years of schooling. This can be justified by considering that children born earlier in the year enter school at an older age and are allowed to drop out (on their $16$ -th or $17$ -th birthday) after having completed less schooling than those born later in the year.
•

Current Population Survey Extract. This dataset contains 30,967 instances. Each instance represents a male born in 1944-53 extracted from the 1979 and 1981-85 March Current Population Survey. Each instance is matched with a dummy variable which takes value from 25 lottery numbers. There are in total 72 variables. A potentially valid IV is the lottery number that represents the likelihood of serving in the military during the Vietnam era. This is because serving the military influences both schooling and earnings of a male and is determined by the lottery number. In addition, the lottery number is randomly assigned to each individual.
•

Slave Export and Trust in Community and Society. This study evaluates the long-term causal effect of slave export on trust in the community and society [55]. In [30], the treatment is defined as $t=\ln(1+\frac{\text{slave-export}}{\text{area}})$ and the outcome is the trust in neighbors. The dataset has 6,932 instances and 59 features. The IV used in [55, 30] is each community’s distance to the sea.

2.3.4 Datasets for Regression Discontinuity Design

RDD is a type of natural experiment which has been found useful in a variety of real-world problems [52]. A dataset for RDD is denoted by $(\mathbf{x}_{i},t_{i},y_{i},r_{i})_{i=1}^{N}$ where $r_{i}$ is the running variable of the $i$ -th instance that determines the treatment assignment of $i$ based on a predefined cut-off value $r_{0}$ . Running variable is used to control for confounding bias. For example, in the study of the causal effect of drinking alcohol on youth mortality [56], age is a natural running variable and $r_{0}=21\text{(years old)}$ . Here is a list of datasets for RDD methods.

•

Population Threshold Dataset [57]. Decision-making on many features of municipal governments (e.g., the electoral system, mayors’ salaries, and the number of councillors) typically depends on whether the municipality is above or below arbitrary population thresholds. This dataset enables us to evaluate RDD methods for estimating ATEs of threshold-based policies on population-level political and economic outcomes.
•

Pretest Scores and Posttest Scores [58]. It is used to study the causal effect of students’ pre-test scores on their post-test scores. This semi-synthetic dataset is generated based on actual student test scores and demographic information. It has 2,767 observations and uses pre-test score as the running variable.

2.3.5 Datasets for Time Series Effect Estimation

Here list some commonly used real-world datasets for the time series effect estimation. For a more comprehensive list of datasets, please refer to [24].

•

MIMIC II/III Data [59]. It is a large-scale data containing rich information relating to patients admitted to critical care units at a large tertiary care hospital. Some key covariates included are blood pressure, oxygen saturation, given medicine, and other temporal attributes such as time-stamped nurse-verified physiological measurements.
•

Advertisement Data [60]. It is used to measure the effect of an advertisement campaign on the number of times a user was directed to the advertiser’s website from the Google search results page. Particularly, the product-related ads displayed alongside Google’s search results for specific key words went live for 6 consecutive weeks. The outcome variable includes search-related visits to the advertiser’s website, such as organic clicks.
•

Air Quality Data [61]. It is used to study whether US gasoline content regulations successfully reduced ozone pollution. The data consists of two sources: data on ambient air concentrations of ozone from the EPA (Environmental Protection Agency)’s Air Quality Standards database for 1989-2003; and weather data measurements from the National Climatic Data Center’s Cooperative Station Data (NOAA 2008), such as minimum and maximum temperatures.

2.3.6 Networked Data for Effect Estimation

Here, we present a list of existing semi-synthetic datasets for evaluating causal effect estimation with networked data [21, 62].

•

BlogCatalog [21] is a semi-synthetic observational network dataset. It is based on real-world network structure and node attributes collected from the blog website BlogCatalog. Each node is a blogger and the node attributes are bag-of-words representation of her/his keywords. Each edge represents a friendship relation. The treatments and outcomes are sampled from a set of predefined structural equations, where treatment (control) means the blogger’s content is viewed more by mobile (desktop) devices. Outcome is a real number standing for users’ opinion on the bloggers’ content. Flickr [21] is created in a similar way. Note that these two datasets did not consider interference.
•

Wave 1 [62] is a in-school questionnaire dataset collected by [63]. Authors of [62] creates a KNN graph based on similarity between instances in the original Wave 1 data. The KNN network has 5,578 nodes and 100,158 edges. Each node is a student and node attributes include age, grade, health insurance and so on. Outcome is students’ performance. Treatment is whether a student is assigned to a tutor program. Note that this dataset includes simulated peer effect to consider interference.
•

Pokec [62] is semi-synthetic and is generated from a real-world social network dataset [64]. Each node is a user and the node attributes include age, gender, education and so on. Treatment means under exposure of a certain medicine advertisement and outcome represents whether the user made a purchase of it. The purchase can be caused by either the treatment or the interfernce. Similar to Wave 1, the peer effect is simulated from a predefined structural equation.

2.4 Discussion

Revisiting current evaluation pipelines for causal effect estimation manifests that there still remains much to be done towards an effective benchmarking framework. The major challenge we confront when collecting data for causal effect estimation is how to design ethical, fast, reliable, and easy-to-implement experiments. RCTs are mostly impractical given the financial and ethical considerations. We also have limited access to the background variables useful for controlling for confounding bias. For example, users’ social network information – a common hidden confounder in observational studies – can be used to approximate their socio-economic status [21, 65]. However, collecting personal data from social networking sites may contradict the terms of user privacy [66]. In terms of effect estimation with networked data (e.g., social network information), in addition to the limited data challenge, extra effort is needed to examine the model robustness against the interference-related assumptions. Given that many outcomes of interest in the social and health sciences might be ordinal and do not have a meaningful scale, model evaluation requires careful treatment in theses scenarios [67]. Moreover, collecting data with ground-truth ITEs is almost impossible in reality. This is because counterfactual outcomes cannot be observed at the individual level, otherwise, ITEs would not be needed. In time series causal effect estimation, data collection is even more challenging because outcomes and time-varying covariates may take months or even years to observe [68, 24].

Popular alternatives use synthetic or semi-synthetic datasets where treatments and outcomes are synthesized based on real-world data. Therefore, it is a crucial first step to develop high-quality simulation that mimic the real-world data generating processes for benchmarking causal inference algorithms. From the data perspective, comparatively less effort has been focused on data with auxiliary information under unconfoundedness assumption. Real-world data have rich auxiliary information (e.g., social media data) from multiple modalities and can overcome the limitations in using synthetic and semi-synthetic datasets. Developing and evaluating causal models using such data is important. Another critical missing element in current evaluation procedures is how to validate models with observational data. There have been debates about whether cross-validation – the standard recipe for validating ML models – can be used for causal inference. Recently, it is more commonly agreed that it is generally impossible to extend cross-validation in causal inference tasks because cross-validation is used to improve predictive accuracy and assure models’ repeatability, but says nothing about causality. Consequently, evaluation metrics and procedures for hyperparameter tuning and causal model selection are in great need. The overarching goal is to develop an open-source platform/software that provides objective, transparent, independent, and easy-to-use evaluation of causal algorithms.

3 Benchmarking Causal Structure Learning

Causal structure learning refers to the task of identifying the causal relations for a given set of variables $V=\{X_{1},\ldots,X_{n}\}$ . The goal is to generate a causal graph $G=\{V,E\}$ that represents the causal relations over the set of variables in $V$ . $E$ represents the set of directed edges between the variables in $V$ . Causal relation between two variables $X_{i}$ and $X_{j}$ is defined as:

Definition 3 (Causal Relation)

For any directed edge $e$ in $G$ , $X_{i}\rightarrow X_{j}$ implies that $X_{i}$ is a direct cause of $X_{j}$ relative to variables in $V$ .

Causal structure learning can then be defined as

Definition 4 (Learning Causal Structure)

Given observed samples of $J$ variables, $\{X_{,j}\}_{j=1}^{J}$ , we aim to determine whether the value of the $j$ -th variable $X_{,j}$ would change significantly if we modify the value of the $j^{\prime}$ -th variable $X_{,j^{\prime}}$ for all $j\not=j^{\prime}$ .

For instance, learning the causal structure in previous graduate admission example might help answer the question “does changing the sex of an applicant causally affect her/his application result?”. One can use causal graphs to identify the effect that would occur in the other variables when a variable $X_{i}$ is introduced to an intervention, i.e. when the value of $X_{i}$ is set to a fixed value. Once a variable is intervened on, a subgraph is generated by removing all the directed edges that point towards $X_{i}$ in $G$ . By repeating this process for all the variables in $V$ , one can identify those causal pathways that are active in any experiment. Given observational data, algorithms for causal structure learning aim to discover a set of causal graphs as candidates [69] under the assumption that causality can be identified amongst statistical dependencies [70, 71]. Methods that recover causal dependence from time-series data have the benefit of temporal precedence and are often based on bivariate Granger causality tests [72], which are used to determine whether $X$ causes $Y$ or $Y$ causes $X$ (conditioned on a set of covariates). Graphical Granger methods use a series of Granger causality tests to determine the full structure among a set of variables.

3.1 Evaluation Metrics

Evaluating causal structure learning compares each of the learned causal graph candidates $\hat{G}$ with a ground-truth $G$ to determine if they belong to the same equivalence class.

Definition 5 (Equivalence Class)

Two causal graphs $G$ and $\hat{G}$ belong to the same equivalence class iff each conditional independence of $G$ is also implied by $\hat{G}$ and vise versa.

For example, Figures 2 and 3 show two causal graphs that belong to the same equivalence class as they share the same set of conditional independence $\left\{X_{,2}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{,3}|X_{,1}\right\}$ .

Figure 2: A causal graph with the conditional independence

X_{,2}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{,3}|X_{,1}

Figure 3: A different causal graph with the same conditional independence

X_{,2}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}X_{,3}|X_{,1}

Commonly used evaluation metrics for causal structure learning can be categorized into two major types: (1) graph-distance-based measures and (2) classification-based measures. For the first category,we introduce Structural Hamming Distance (SHD), Structural Intervention Distance (SID), and Frobenius Norm. A comprehensive study of metrics comparing learned causal graphs can be found in [73].

•

Structural Hamming Distance (SHD) [74, 75, 76]: Given two causal graphs, one being the ground-truth partially DAG⁵⁵5 $\mathcal{G}$ is called a partially DAG if there is no directed cycle, i.e. no pair ( $j,k$ ) such that there are directed paths from $j$ to $k$ and from $k$ to $j$ . and the other being a predicted partially DAG, SHD is defined as the number of edits (adding, removing or reversing an edge) that have to be made to the learned graph $\hat{G}$ for it to become the ground-truth graph $G$ . It can be formulated as

$\displaystyle\mathrm{SHD}=A+D+I,$ (13)

where $A$ represents the number of edges that are added, $D$ represents the number of edges that were deleted, and $I$ represents wrongly oriented edges. Researchers generally present normalized SHD by calculating the ratio of different baselines’ SHD over the SHD for their proposed methods.

•

Frobenius Norm measures the difference between two adjacency matrices of two causal graphs [77]. Formally,

\text{Frobenius Norm}:\sqrt{\operatorname{trace}\left\{\left(\mathbf{B}_{\text{true}}-\widehat{\mathbf{B}}\right)^{T}\left(\mathbf{B}_{\text{true}}-\widehat{\mathbf{B}}\right)\right\}},

(14)

where $\mathbf{B}_{\text{true}}$ and $\widehat{\mathbf{B}}$ are the true and predicted adjacency matrices, respectively.

•

Structural Intervention Distance (SID) : For causal structure learning based methods, it is important to understand the causal interpretations of a graph since it helps predict the result of interventions. Given a true DAG $G$ and an estimated DAG $\hat{G}$ , SID aims to infer the number of falsely inferred intervention distributions. SID is formulated as [74]

		$\displaystyle\mathrm{SID}:\mathbb{G}\times\mathbb{G}\rightarrow\mathbb{N}$		(15)
		$\displaystyle(\mathcal{G},\mathcal{\hat{G}})\mapsto\#\left\{(i,j),i\neq j\right.\mid\text{the intervention distribution}$
		$\displaystyle\text{from $i$ to $j$ is falsely estimated by $\hat{G}$ with respect to $G$}\},$

where $\mathbb{G}$ is the space of DAGs defined over variables in $V$ . SID is well-suited to evaluate graphs for interventions.

The second type of metrics are based on the intuition that directional adjacency relations can be treated as a binary classification problem. Therefore, a variety of metrics in classification tasks can be used.

•

Precision is the ratio of true positives (TP) over sum of TP and false positives (FP), i.e., $\text{precision}=\frac{TP}{TP+FP}$ .
•

Recall is defined as the ratio of TP over sum of TP and false negatives (FN), i.e., $\text{ recall }=\frac{TP}{TP+FN}.$
•

F1 score is the harmonic mean of precition and recall of the learned structure as compared to true causal structure.
•

False Positive Rate (FPR) In terms of graph, FPR is defined as the ratio of the edges that are present in the predicted graph $E_{M}$ but not present in the ground-truth graph $E_{GT}$ over the absolute difference between ground-truth edges and all possible edges. It is formulated as

$FPR={\sum}_{i}\frac{e_{i}}{|E-E_{GT}|},\ e_{i}\in E_{M}\backslash E_{GT}.$ (16)
•

True Positive Rate (TPR) In terms of graph, TPR is defined as the ratio of the common edges between the ground-truth and predicted causal graphs over the number of edges in ground-truth graph. Formally,

$TPR={\sum}_{i}\frac{e_{i}}{|E_{GT}|},\ e_{i}\in E_{M}\cap E_{GT}.$ (17)
•

MSE is defined as the sum of square of difference between the predicted and the ground-truth causal graphs divided by the total number of nodes. It is formulated as

$MSE=\frac{1}{|T|}\sum(T-A)^{2},$ (18)

where $T$ is the predicted adjacency matrix and $A$ is the ground-truth adjacency matrix.
•

Area under ROC curve (AUC) is the area under the curve of recall versus FPR at different thresholds.
•

Areas under the Precision-Recall and FPR-TPR curves can also be used [75, 78] given that accuracy is measured in terms of precision and recall.

Finally, recent works including [79, 80] highlight that evaluating against structural measures only, may limit the researchers learning from experimental studies of the model’s performance, which in turn creates a gap between theory and practice. To overcome this limitation, they suggest several interventional measures that compare the learned distribution to the ground-truth obtained through interventions. Now we list common interventional measures.

•

Total Variation Distance (TVD) measures the distance between two probability distributions. Under TVD, the quality of an estimated distribution relative to a known distribution can be computed as

$TVD_{P,\hat{P},T=t}(O)=\frac{1}{2}\sum_{o\in\Omega(O)}|P(O=o\mid do(T=t))-\\ \hat{P}(O=o\mid do(T=t))|,$ (19)

where $P$ denotes the true interventional distribution, $\hat{P}$ denotes the estimated interventional distribution, and $\Omega(O)$ denotes the domain of $O$ .
•

KL-Divergence measures how one probability distribution differs from others. Given the estimated interventional distribution $\hat{P}$ and true interventional distribution $P$ , the KL-Divergence for continuous distributions can then be defined as

$D_{KL}(P(x)\|\hat{P}(x))=\int_{-\infty}^{\infty}P(x)\ln\frac{P(x)}{\hat{P}(x)}dx.$ (20)

As for time-series data, aside from the aforementioned metrics, F-Test is a popular metric used to evaluate causal structure learning approaches. In this approach, with the defined null hypothesis, parameters are estimated for both the restricted and the unrestricted models. An F-statistic is then computed using the Residual Sum of Squares (RSS) of the two series, which is given by:

\frac{\left(RSS_{R}-RSS_{UR}\right)/p}{RSS_{UR}/(T-2p-1)}\sim F_{p,T-2p-1},

(21)

where $T$ is the length of the time series, $p$ is the number of lags, $RSS_{R}$ is the RSS for restricted model and $RSS_{UR}$ is the RSS for the unrestricted model. For a more detailed review readers can refer to [79, 80].

3.2 Evaluation Procedures

The evaluation procedure for algorithms that learn causal relations from observational data often adopts a transductive setting, where the evaluation is based on the structural difference between the learned graph and the ground-truth graph. This is similar to the standard procedure to evaluate a supervised learning algorithm. Aside from evaluating against structural methods only, recent works emphasize the need for evaluating causal structure learning models against interventional measures. Unlike structural measures, interventional measures penalize edge mis-orientation errors proportionally to their effect on the estimation of interventional effect.

3.3 Benchmark Datasets

Datasets in this task can be considered in two categories: (a) The first category includes datasets for learning the causal direction between two variables. (b) The second category includes datasets for learning the underlying causal graph of a set of observed variables. Ideal datasets are collected from real-world scenarios and annotated by the experts in the corresponding fields to evaluate algorithms for learning causal relations. However, since it is extremely challenging to obtain such real-world ground-truth datasets for causal graphs, synthetic datasets are commonly used for benchmarking purposes. Popular datasets used for learning causal directions include Tübingen Cause-Effect Pairs, Alzheimer’s Disease Neuroimaging Initiative (ADNI) [81], AntiCD3/CD28 [82], and Note [83]. Datasets for learning causal graphs are LUng CAncer Simple set (LUCAS), LUng CAncer set with Probes (LUCAP), and Random Chordal Graphs. We briefly introduce these datasets.

3.3.1 Datasets for Learning Causal Directions

We detail exemplar datasets for learning causal relations between two variables as follows:

•

Tübingen Cause-Effect Pairs [37]: It consists of real-world samples from cause-effect pairs and is collected across various subject areas.
•

ADNI [81]: Alzheimer’s Disease Neuroimaging Initiative consists of 819 subjects who were enrolled at baseline and followed for 12 months using standard cognitive and functional measures typical of clinical trials.
•

AntiCD3/CD28 [82]: This is a dataset of causal protein-signal networks with 853 instances of multivariable individual-cell data. Each variable stands for a biomolecule.
•

Note [83]: This is a time series data for inferring the arrow (direction) of the time.
•

Abalone [84]: It contains 4,177 samples and each sample has 4 attributes: Sex, Length, Diameter, and Height.

3.3.2 Datasets for Learning Causal Graphs

•

LUCAS and LUCAP. LUCAS (LUng CAncer Simple set) and LUCAP (LUng CAncer set with Probes) consist of data generated by predefined causal graphs with binary variables.
•

Random Chordal Graphs. This is a synthetic dataset generated using approach discussed in [85, 86].
•

CausalWorld. The CausalWorld [87] dataset is a comprehensive robotic benchmark dataset for transfer learning. It provides a combinatorial family of tasks with common causal structures and underlying factors, allowing the user to intervene on the causal variables to determine the similarity level of different tasks.
•

SynTReN. SynTReN is a network generator that creates synthetic transcriptional regulatory networks and produces simulated gene expression data that approximates experimental data. Topologies are created by selecting subnetworks from previous regulatory networks. Results in [88] show that the simulated topologies are closer to the real biological networks when compared to different random graph models. Several user-definable parameters adjust the complexity of the resulting dataset w.r.t. the structure learning algorithms.
•

Causality 4 Climate. The Causality 4 Climate (C4C) [89] dataset comes from a NeurIPS 2019 competition for causal structure learning on climate time series. The dataset consists of an extensive number of climate model-based time series datasets with known causal ground-truth. These datasets incorporate the main challenges of causal structure learning in climate research.

3.3.3 Causal Structure Learning Datasets for Time Series

•

Temperature Ozone Data [90, 91, 37]: This dataset consists of ozone and radiation levels, across 72 points in time, and 16 different places. The assumed ground truth is that the radiation has a causal effect on ozone.
•

Neural activity Dataset: This dataset consists of real-time whole-brain imaging to record the neural activity of Caenorhabditis elegans. The dataset consists of 302 neurons and is generally used to identify which neurons are responsible for the movement.
•

Traffic Prediction Dataset [92]: This dataset contains four months’ worth of sensor data from Los Angeles, California. 207 sensors are placed for collecting this data. The location of each sensor in the form of GPS coordinates are also included in the dataset.
•

US Manufacturing Growth Data [93]: This dataset consists of microeconomic data of growth rates of US manufacturing firms in terms of employment, sales, research & development (R&D) expenditure, and operating income, for the years 1973–2004. It can be used to identify the causal variables that affect the growth rate of a firm.

3.4 Discussions

The transductive setting has been the norm in the literature of learning causal relations [69, 94] in part because the conventional setting of learning causal relations is similar to that of supervised learning. However, in observational studies, the ground-truth causal relations may not be used to train the algorithms even during the training phase. In addition, Guo et al. pointed out in [16] that most of the existing algorithms can only discover the causal relations among variables that can be observed in the training data. Hence, it remains an open problem to develop causal inference algorithms that can be generalized to unseen variables, i.e., an inductive evaluation setting. Furthermore, the combination of observational and experimental data may provide with unique opportunities to identify a model, that under various assumptions, can extract some true external cause-effect relationships. There is a recent line of research [95, 96] focusing on learning causal relations from the combination of observational data and interventional data with some selected variables. This is promising as it may overcome the limitations of using pure observational or RCT data. To this end, learning causal relations with the combined data can be reduced to an easier problem where the goal is to identify a certain set of interventions.

4 Evaluation of Causality-Aware ML Tasks

As a critical ingredient for AI to achieve human-level intelligence, causal inference has found itself in the spotlight of Trustworthy ML and Socially Responsible AI research [66, 97]. In this section, we discuss the evaluation pipeline of two representative tasks: causal interpretability and fairness and unbiased interactive ML.

4.1 Causal Interpretability and Fairness

The surge of ML algorithms for decision making in critical fields, such as law-making and autonomous cars, has made it necessary for the decisions made by these models to be interpretable by humans [98]. Interpretability in ML is defined as “model’s ability to explain or to present in understandable terms to a human” [99]. To ensure the reliability and transparency of such decisions, causal interpretability aims to explain them by answering the counterfactual questions such as “what decisions would have been made if they had been under alternative situations (e.g., being trained with different inputs [100, 101] or model components [102, 103])?” We refer to this as a model’s causal interpretability. Therefore, causal interpretable models are human-friendly and aim to answer causal questions.

4.1.1 Problem Statement

There is no unified definitions for the causal interpretability of an ML model [26]. In this survey, we use a common definition in which a model’s causal interpretability is judged by the counterfactual explanations it generates for a set of inputs. Note that we skip the model-level counterfactual explanations in this paper due to the lack of proper evaluation metrics and benchmarks for such frameworks in the literature [26].

Definition 6 (Example-Level Counterfactual Explanation)

Given an ML model and a predefined label, a counterfactual explanation of an instance is defined as a new instance that is generated by performing minimal changes to an original instance’s features, which makes the model to predict the predefined label.

For example, when a person’s credit card application fails, she may be interested in explanations that specify what minimal changes can be made to her profile to pass the application. Note that while similar to the definition of adversarial example, the example-level counterfactual is designed to explain the decision of the model rather than attack it. Generally, example-level counterfactual explanations aim to answer questions such as “Why does this model predict a specific label for an instance?” or “Was the $i$ -th feature of the instance which caused the model to predict this label?”. Such questions can be answered using counterfactual inference [14]: Counterfactual distributions are new type of conditional probabilities (e.g., $P(y_{x}|x^{\prime},y^{\prime})$ ) that indicate how likely the outcome of an observed instance, i.e., $y^{\prime}$ , would change given $x^{\prime}$ .

Fairness also relates to making models more transparent and interpretable [98]. ML frameworks used for decision making in critical domains such as law-making are required to make fair decisions and not discriminate against specific groups of people [104]. Several frameworks have tried to combine causal inference with fairness and propose a criterion for making fair decisions [105, 106].

4.1.2 Evaluation Metrics

We here discuss common evaluation metrics for two primary tasks in causal interpretability – example-level counterfactual explanation and fairness.

Example-Level Counterfactual explanations. One of the big challenges of evaluating counterfactual explanations is the lack-of ground truth in traditional ML datasets. A commonly used alternative is to gauge the quality of the counterfactual explanations generated by an ML model in terms of metrics that aim to quantify certain characteristics of these counterfactual explanations. Moraffah et al. [26] have classified these metrics based on the properties of the explanation they are designed to measure. Particularly, six metrics – sparsity, interpretability, proximity, speed, diversity, and visual-linguistic metrics – have been suggested to measure the quality of the generated counterfactual explanations from different perspectives.

Sparsity metrics aim to measure the amount of perturbation added to the original input in order to transform it to its corresponding counterfactual explanation. The lower the amount of perturbation is, the better is the quality of the counterfactual. Interpretable counterfactual explanations are often close to the training data manifold [107]. To gauge the closeness of counterfactuals to the data manifold, Interpretability metrics are utilized and are generally computed based on the reconstruction errors of the counterfactuals [107]. Proximity is another counterfactual evaluation metric which evaluates how close the counterfactual explanations are to the original samples. Unlike Interpretability metrics, Proximity is usually calculated by calculating the $l_{p}$ distance between the original and counterfacutal samples. Since most interpretablity frameworks are designed for real-world applications, counterfactuals generation approaches should be fast. Speed metrics measure the speed of counterfatual generation approaches. Another characteristic of high quality counterfactuals is their diversity. This means that counterfactuals generated for a given input should be different. To measure the quality of this counterfactual explanation frameworks, Diversity metrics can be used. Finally, to measure the quality of multi-modal counterfactual explanations such as visual-linguistic explanations, one needs to report the high positiveness/negativeness of the visual explanations as well as the consistency of the visual and linguistic explanations. Table 2 summarizes these metrics with corresponding descriptions. For more detailed explanations, please refer to [26].

	Counterfactual Characteristic	Description
1	Sparsity	Measures the amount of perturbation used to generate the counterfactual example [107, 108]
2	Interpretability	Measures the closeness of the counterfactual example to data manifold [107]
3	Proximity	Measure the similarity of the counterfactual example to original sample [109]
4	Speed	Measures the pace of counterfactual generation [107]
5	Diversity	Measures the diversity of the generated counterfactuals [109]
6	Visual-Linguistic Counterfactuals	Measures the high positeveness/negativeness of the visual explanation
6	Visual-Linguistic Counterfactuals	Measures the consistency of the linguistic explanation with it’s visual counterpart [110]

Table 2: A summary of evaluation metrics for counterfactual explanations [26].

Fairness. Evaluation of causal fairness is challenging and typically done via assessing models’ performance in reducing biases. Below, we discuss the state-of-the-art causality-based notions for fairness. For more details, please refer to [111].

Khademi et al. [112] propose two notations based on causal effect estiamtion, i.e., FACE (Fair on Average Causal Effect) and FACT (Fair on Average Causal Effect on the Treated), under the potential outcome framework [28]. With FACT notion, a classifier is fair iff:

\mathbb{E}[Y_{i}^{(a_{1})}-Y_{i}^{(a_{0})}]=0,

(22)

which is equivalent to calculating ATE. FACT metric, on the other hand, is based on average causal treatment effect on the treated group, i.e., ATT. Based on this definition, a classifier is fair iff:

\mathbb{E}[Y_{i}^{(a_{1})}-Y_{i}^{(a_{0})}|A^{i}=a_{0}]=0.

(23)

Another mainstream fairness notion proposed by Kusner et al. [105] is coined counterfactual fairness, which states that a classifier is fair if under any background variable $X$ we have:

P(y_{a_{1}}|X=x,A=a_{0})=P(y_{a_{0}}|X=x,A=a_{0}),

(24)

where $A$ is a set of protected attributes, $Y$ is the outcome, and $X=V\backslash\{A,Y\}$ denotes the remaining variables in the system. Due to the challenge of causal identification using observational data, Wu et al. [106] propose path-specific counterfactual fairness (PC-Fairness), a generalized causal fairness notion which covers various causality-based fairness notions by tuning its hyperparameters. A classifier achieves PC-Fairness iff:

P(y_{a_{1}|\pi,a_{0}|\bar{\pi}}|O)-P(y_{a_{0}}|O)=0,

(25)

where $Y$ is the outcome, $\pi$ denotes causal paths, and $O$ is a set of observed variables.

Built on causal graphs, Zhang et. al [113] suggest counterfactual direct (Ctf-DE), indirect (Ctf-IE), and spurious effect (Ctf-SE) fairness measures. These metrics essentially assess the transmission from cause to effect. They are defined as:

\text{Ctf-DE}_{x_{0},x_{1}}(y|x)=P(y_{x_{1},W_{x0}}|x)-P(y_{x_{0}}|x);

(26)

\text{Ctf-IE}_{x_{0},x_{1}}(y|x)=P(y_{x_{0},W_{x_{1}}}|x)-P(y_{x_{0}}|x);

(27)

\text{Ctf-SE}_{x_{0},x_{1}}(y|x)=P(y_{x_{0}}|x_{1})-P(y|x_{0}),

(28)

where $X=x_{1}$ is the intervention on $Y=y$ , $X=x_{0}$ is the baseline value of the sensitive attribute, and $W$ indicates the mediator. Despite recent success in utilizing causality to measure fairness, evaluation of such frameworks is still an open problem mainly due to the “impossibility of fairness” [114]: existing fairness notions appear to be internally consistent but often mutually incompatible with each other.

4.1.3 Evaluation Procedures

Counterfactual explanations belong to post-hoc interpretability and are generated after a black-box ML model is trained. The first step of the evalution procedure is to train an ML model on a regular dataset such as Adult-income [115]. Then, we need to generate counterfactual explanations based on the trained model and the dataset. Finally, we evaluate the generated counterfactual explanations using existing criteria such as those in Table 2. The standard procedure in existing work (e.g., [109]) is transductive. One extension is its inductive counterpart where we check whether a method can generate counterfactual explanations for an unseen instance from a hold-out set [116].

4.1.4 Benchmark Datasets

To evaluate the causal interpretability of black-box ML models, we need datasets with two components: (1) a dataset for training and testing an ML task, and (2) a source of ground truth that allows us to evaluate the causal interpretability of the generated explanations. Ideally, this would be a set of ground-truth counterfactual explanations for a given data instance and an ML model. Such ground truth can be extremely expensive to acquire manually given the large number of data instances and ML models. Hence, the second component of ground truth is commonly replaced by a set of proximal metrics [109]. These metrics can be conveniently computed given the generated counterfactual explanations and the corresponding instances from the original dataset. Below we introduce common datasets used for interpretability. It is worth mentioning that most of these datasets are not specifically designed for evaluating causal interpretability. Therefore, they do not come with the ground truth of counterfactual explanations that capture the causal aspect of the model. However, the evaluation metrics can be leveraged as proximal measures to determine the quality of a generated counterfactual explanation. Counterfactual explanations are model-agnostic, therefore, can work on datasets for most ML tasks. We categorize these datesets as Image, Text, and Tabular data. Below are exemplar datasets in each category.

Image Datasets.

•

ImageNet (ILSVRC) [117] is an image dataset organized according to the WordNet hierarchy. There are more than 100,000 synonym sets or synset in WordNet, majority of them are nouns (80,000+). The ImageNet team aims to provide on average 1, 000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated. The task is to predict which synsets an image belongs to.
•

MNIST [118] consists of images of handwritten digits. It has a training set of 60,000 examples and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.
•

PASCAL VOC 2012 [119] contains 12,031 training and validation images and 1,456 test images. Among the 12,031 training images, 2,913 of them have associated ground-truth object regions split evenly into a segmentation train-2012 and val-2012 sets.

Text Datasets.

•

20 Newsgroup Dataset [120] This dataset is a collection of nearly 20,000 news documents and is partitioned (nearly) evenly across 20 different newsgroups. This dataset has become increasingly popular for evaluating text mining and natural language processing algorithms w.r.t. text classification and clustering tasks.
•

IMDB [121] The IMDB dataset contains 25,000 training documents, 25,000 test documents, and 50,000 unlabeled documents. It is a dataset for binary sentiment classification task. The dataset consists of movie reviews retrieved from the website IMDB⁶⁶6https://www.imdb.com/. For labeled documents, they are sampled in a way such that there is a 1:1 ratio between negative and positive documents.
•

Amazon reviews [122] This dataset contains product reviews and metadata from Amazon, including up to 142.8 million reviews spanning May 1996 - July 2014. Particularly, it includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs). One of the tasks can be evaluated by this dataset is binary sentiment classification.

Tabular Datasets. The UCI repository [123] provides a myriad tabular datasets used by the ML literature. Here, we list those that are widely used for evaluating causal interpretability.

•

Adult-income [115] This dataset contains demographic, educational, and other information based on 1994 Census database and is available on the UCI ML repository [123]. This dataset is preprocessed to include 8 features, namely, hours per week, education level, occupation, work class, race, age, marital status, and sex. The ML model’s task is to classify whether an individual’s income is over $50,000$ .
•

German loan dataset [124]. This dataset contains 1,000 observations of loan applicants including numeric, categorical, and ordinal attributes. The label indicates whether the application is successful.
•

LendingClub⁷⁷7https://www.lendingclub.com/info/download-data.action. This dataset consists of 5 years of loan records (2007-2011) by LendingClub company. After the dataset is preprocessed, it contains 8 features: employment years, annual income, number of open credit accounts, credit history, loan grade as decided by LendingClub, home ownership, purpose, and the state of residence in the U.S.
•

COMPAS. This is a dataset collected by ProPublica [125]. It contains information for analysis on recidivism decisions in the U.S. After preprocessing, this dataset contains the following 5 features: bail applicant’s age, gender, race, prior count of offenses, and the degree of criminal charge.

4.2 Unbiased Interactive ML

Recent years have witnessed increasing importance of interactive ML models and systems such as search ranking and recommendation systems in both academia and industry. Training such interactive models and systems is different from standard prediction tasks because the supervision signals or labels – e.g., ratings, clicks, and purchases – come from users’ online behavior, collected through interactions between users and an ML model. The fundamental challenge in learning interactive ML models arises from the fact that they are trained/optimized with biased log data whilst the goal is to optimize the performance via online experiments where RCTs can be conducted [126, 127]. This mismatch between training and test data makes causality critical in understanding and mitigating various types of bias in the process of training interactive ML models. In this section, we illustrate the uses of CL in one of the extensively studied tasks in interactive ML – unbiased learning to rank. Another related task is debiasing recommendation systems [25].

4.2.1 Problem Statement

Unbiased learning to rank is formally defined as follows.

Definition 7 (Unbiased Learning to Rank)

Given a search result page (SERP) of a query $q$ in the search log, $\bm{X}^{q}\in\mathcal{R}^{m^{q}\times d}$ , $\bm{y}^{q}\in\{1,...,m^{q}\}^{q}$ , and $\bm{c}^{q}\in\{0,1\}^{q}$ denote the feature matrix, the vector of item indices, and the vector of implicit feedback, respectively. $d$ is the number of features and $m^{q}$ is the number of items shown in the SERP of query $q$ . Let $\mathcal{Q}$ and $\mathcal{Q^{\prime}}$ be the set of queries for an offline training set and that of a test set, respectively. $f:\mathcal{R}^{d}\rightarrow\mathcal{R}$ denotes a ranking system that maps the feature vector of an item to its ranking score. Given offline search log data $\{\bm{X}^{q},\bm{y}^{q},\bm{c}^{q}\}_{q\in\mathcal{Q}}$ , the goal is to optimize a specific ranking metric (e.g., NDCG) in a test set $\{\bm{X}^{q},\bm{r}^{q}\}_{q\in\mathcal{Q}^{\prime}}$ where $\bm{r}^{q}$ denotes the vector of unbiased feedback. $\bm{r}^{q}$ can be collected from RCTs or manual labeling (e.g., relevance score).

In an RCT, we rank all the items randomly and collect feedback from users. We call such feedback unbiased as it is only causally influenced by users’ preference. While with offline log data, a user’s feedback on an item is influenced by the user’s preference, the position of the item ranked by the logging policy – the existing ranking system that generated the log data, and other potential factors (e.g., context of the item [128]). In some cases, if we do not have access to unbiased implicit feedback from RCTs for the test set, we can also evaluate a ranking system based on relevance scores as in [129, 130, 131].

4.2.2 Evaluation Procedures

The evaluation procedure of unbiased learning to rank is mostly similar to that of a standard learning to rank task in information retrieval. The major difference is that the training set and the test set are from different sources: the former is a typically biased offline dataset whereas the latter is a dataset with unbiased feedback (often collected through RCTs). Given a test set $\{\bm{X}^{q},\bm{r}^{q}\}_{q\in\mathcal{Q}^{\prime}}$ and the corresponding predicted ranking $\{\hat{\bm{y}}^{q}\}_{q\in\mathcal{Q}^{\prime}}$ of a ranking system, we introduce the following two most popular metrics to evaluate its performance.

Normalized Discounted Cumulative Gain (NDCG@K) [132] is defined as

\text{NDCG@K}=\frac{1}{\text{IDCG@K}}\sum_{i=1}^{K}\left(\frac{2^{r^{q}_{i}}-1}{\log_{2}(i+1)}\right),

(29)

where $K$ is a positive integer (e.g., 10 or 50) and varies by specific applications; $\text{IDCG@K}=\sum_{i=1}^{K}\frac{1}{log_{2}(i+1)}$ is the normalizer to ensure NDCG@K is in the range $[0,1]$ . Note that an item’s rank is determined by the prediction $\hat{\bm{y}}^{q}$ of a ranking system. NDCG@K is a weighted sum of a function of unbiased feedback (e.g., relevance score) for the items ranked in top-K positions. A larger NDCG@K indicates that the top-K ranked items are more relevant.

Mean Average Precision (MAP@K) is defined as

\text{MAP@K}=\frac{1}{K}\sum_{i=1}^{K}{|\{j|r^{q}_{j}=1,j=1,...,i\}|}/i.

(30)

MAP@K describes how accurate a ranking system’s ranked predictions are, on average, over a whole validation/test dataset. The primary difference between MAP@K and NDCG@K is that the former assumes binary relevance, i.e., either relevant or irrelevant, whereas the latter also takes continuous values.

In addition, there are metrics proposed specifically for debiasing popularity bias of ranking algorithms. The first metric is Average Recommended Popularity (ARP) [133], defined as:

\text{ARP@K}=\frac{1}{|\mathcal{Q}^{\prime}|}\sum_{q\in\mathcal{Q}^{\prime}}\frac{\sum_{i=1}^{K}\phi_{i}^{q}}{K},

(31)

where $\phi_{i}^{q}$ denotes the popularity of the document or product ranked at position $i$ in the SERP of query $q$ . We can also focus on the documents or products that are least popular, namely the long-tail documents or products. This leads to the Average Percentage of Long Tail Items (APLT@K) [133], defined as:

\text{APLT@K}=\frac{1}{|\mathcal{Q}^{\prime}|}\sum_{q\in\mathcal{Q}^{\prime}}\frac{\sum_{i=1}^{K}1(i\in\mathcal{LT})}{K},

(32)

where $1(\cdot)$ is an indicator function. $\mathcal{LT}$ denotes a set of long-tail documents or products pre-defined by their popularity.

4.2.3 Benchmark Datasets

Data collection in unbiased learning to rank confronts general challenges in CL tasks. The alternative seeks to create semi-synthetic datasets using a search log dataset with ground-truth relevance scores for each query-item pairs [131, 130]. First, we train a ranking system (e.g., RankSVM [131]) on 1% of the original training dataset. We then generate a list of items ranked in order using the entire training dataset $\{\bm{X}^{q},\bm{r}^{q}\}_{q\in\mathcal{Q}}$ . The next step employs a click model (e.g., the position-based click model [134]) to simulate users’ clicks. Specifically, the position-based model assumes $P(c_{i}^{q}=1)=P(o_{i}^{q}=1)P(z_{i}^{q}=1)$ , indicating that a click on $i$ -th item can only be observed iff the item is examined ( $o_{i}^{q}=1$ ) and relevant ( $z_{i}^{q}=1$ ). The binary relevance label $z_{i}^{q}$ is defined as a function of the relevance score $r_{i}^{q}$ :

P(z_{i}^{q}=1)=\epsilon+(1-\epsilon)\frac{2^{r_{i}^{q}}}{2^{4}-1},

(33)

where $\epsilon$ denotes the click noise and is often set to 0.1 [131, 130]. The maximum value of $r_{i}^{q}$ is $4$ . The probability of examination is defined as a function of the position-based examination probability $\bm{p}$ [131, 130]:

P(o_{i}^{q}=1)=(p_{i})^{\eta},

(34)

where $\eta\geq 1$ is a hyperparameter set to $1$ by default. One popular dataset used for evaluating unbiased learning to rank models is the Yahoo! Learning to Rank Challenge Dataset. It contains 29,921 queries and 710k items [135]. Each query-item pair is represented by a feature vector of 700 dimensions and associated with a ground-truth relevance score in $\{0,1,2,3,4\}$ . Any search log dataset with ground-truth relevance scores or feedback collected through RCTs can be potentially used in this task.

5 Causal Inference Tools

In this section, we examine popular tools/packages for benchmarking causal inference, including tools for causal effect estimation – CausalML, EconML, DoWhy CauseBox, for causal structure learning – CausalNex, CausalDiscovery, pcalg, bnlearn, TETRAD, and for evaluation – Causality-Benchmark, JustCause.

•

CausalML [136] implements an array of uplift modeling and causal inference methods using ML algorithms. It provides a standard interface for users to estimate CATE or ITE based on experimental or observational data, without strong assumptions on the model form. CausalML currently supports Tree-based algorithms (e.g., Uplift random forests on KL divergence), Meta-learner algorithms (e.g., S-learner, T-learner), and IV algorithms (e.g., 2-Stage Lease Squares). Covered evaluation metrics include RMSE and MAE.
•

EconML [137] estimates heterogeneous treatment effects from observational data via ML. The goal of EconML is to combine state-of-the-art ML techniques with econometrics to automatically solve complex causal inference problems. The supported estimation methods are in the intersection of econometrics and ML, including double ML, orthogonal random forests, meta-learners, doubly robust learners, orthogonal IV, and deep IV. It does not include evaluation metrics for effect estimation.
•

Informed by conventional ML libraries for prediction, DoWhy [138] provides a unified interface for causal inference methods under the two fundamental frameworks – graphical models and potential outcomes. First, DoWhy creates a causal graphical model for each problem to describe the causal assumptions. It then uses graph-based criteria and do-calculus to find all potential ways of identifying a desired causal effect. In the third step, it estimates causal effects based on the identified estimand. Finally, DoWhy validates an effect estimate from a causal estimator using multiple refutation methods such as adding unobserved common causes and bootstrap validation. DoWhy can also call external estimation methods such as EconML and CausalML.
•

CausalNex⁸⁸8https://github.com/quantumblacklabs/causalnex uses Bayesian Networks to combine ML and domain expertise for causal structure learning and effects estimation. It deploys state-of-the-art structure learning method DAGs with NO TEARS [139] to understand the causal relationships between variables. CausalNex allows users to learn optimal graph structure through: 1) encoding domain expertise and 2) learning from data via structure learning algorithms. Users can also conduct counterfactual analysis by introducing do-calculus.
•

CausalDiscovery [140] implements an end-to-end, step-by-step pipeline approach for recovering the direct dependencies (the skeleton of the causal graph) and the casual relationships between variables. CausalDiscovery currently includes 17 algorithms for graph skeleton identification and 19 algorithms for causal directed graph prediction. The goal of CausalDiscovery is to learn both the causal graph and the associated causal mechanisms from the joint probability distribution of observational data. It also includes R based algorithms to extend the toolkit with additional R packages. There are two types of supported algorithms in CausalDiscovery – graph and pairwise causal inference models. In addition to the standard evaluation metrics, CausalDiscovery also includes SHD and SID.
•

Causality-Benchmark [141] benchmarks algorithms that estimate ATE and ITE of an intervention on the outcome of interest. It includes unlabeled data for prediction, labeled data for validation, and scoring algorithms for automatic evaluation of prediction algorithms based on different evaluation metrics. As ground-truth data of causal effect cannot be known for real-world treatment, Causality-Benchmark uses a simulation-based approach that leverages existing covariates and creates a causal graph to determine treatment assignment and effect. It supports a variety of metrics for causal effect estimation.
•

JustCause⁹⁹9https://github.com/inovex/justcause evaluates causal inference methods using common datasets including Twins, Infant Health and Development Program (IHDP), and IBM ACIC. It can also generate synthetic datasets with a generic but standardized approach used in [142]. The goal of JustCause is to provide a fair and just way to benchmark new methods against several baselines and the state-of-the-art methods in causal effect estimation. The supported algorithms are doubly robust estimation, inverse propensity weighting, S-learner, and T-learner. Evaluation metrics implemented include PEHE score, MAE, EnoRMSE score, and Bias.
•

CauseBox¹⁰¹⁰10https://github.com/paras2612/CauseBox is a benchmarking suite developed in Python for treatment effect estimation methods that consists of an ensemble of five deep learning based SOTA methods and two tree based methods. All the methods are evaluated on benchmarking datasets including IHDP and News datasets. The goal of this toolbox is to provide a unified platform to compare algorithms for different metrics including PEHE and Policy Risk.
•

bnlearn¹¹¹¹11https://www.bnlearn.com/ [143] is a causal structure learning toolbox developed in the R language. The bnlearn package provides implementations of various structure learning algorithms including, but not limited to, PC; Grow-Shrink (GS); Incremental Association Markov Blanket (IAMB); Hybrid Parents & Children (HPC); and Hill Climbing (HC). Bnlearn also provides the conditional independence tests and network scores used to construct the Bayesian network. Both discrete and continuous data are supported. Furthermore, bnlearn facilitates choosing the learning algorithms based on different statistical criteria, so that the best combination for the data could be utilized.
•

TETRAD¹²¹²12http://www.phil.cmu.edu/tetrad/ [144] is a drag and drop suite to perform causal structure learning. It can take datasets containing both continuous and categorical variables, including time series data. It supports algorithms to search for structural relations. It also supports measuring unmeasured confounders; simulating data from a statistical model; predicting the effects on other variables of interventions or perturbations on one or more variables; and computing the probability distribution of any variable conditional on specified values of any other set of variables. It is developed in Java.
•

pcalg¹³¹³13https://cran.r-project.org/web/packages/pcalg/index.html [145] is a toolbox developed in R for causal structure learning and causal effect estimation using graphical models. For structural learning, pcalg supports multiple algorithms, including PC, FCI, RFCI, and GIES, whereas for causal effect estimation, it includes the IDA algorithm, the Generalized Backdoor Criterion (GBC), and the Generalized Adjustment Criterion (GAC).

Comparisons of these tools w.r.t. data, supported methods, and metrics can be found in Table 3. An important notion is that data types supported by most of these tools are limited to i.i.d. data and data with IV. Further, current evaluation-oriented tools do not support metrics for heterogeneous effect estimation and for causal structure learning tasks. In addition, none of these tools are designed to evaluate causality-aware ML tasks, e.g., causal interpretability.

Task

Causal Effect Estimation

Causal Structure Learning

Evaluation for Effect Estimation

Tool

CausalML

EconML

DoWhy

CauseBox

CausalNex

pcalg

bnlearn

TETRAD

CausalDiscovery

Causality-Benchmark

JustCause

Data

i.i.d

✓

Networked

Time Series

✓

Methods

Propensity Score

✓

Tree-based

✓

Meta-Learner

✓

Doubly ML

✓

Doubly Robust

✓

Mediation

Graph

✓

Pairwise

✓

Metrics

PEHE

User-Input Metrics

✓

RMSE

✓

MAE

✓

Bias

✓

Coverage

✓

Confidence Interval

✓

Aggregating Score

✓

Refutation

SID

✓

SHD

✓

Classification

✓

Table 3: Comparisons of causal inference tools with a focus on the included datasets, methods, and metrics.

6 Open Problems and Challenges

Still an early-stage research in AI, CL presents both great opportunities and multi-faceted challenges. We list a few to bring to the front some major concerns in the field.

6.1 Evaluation of Intermediate Steps

The importance of validating intermediate steps in a CL pipeline is often overlooked. For example, it is critical to precisely infer propensity scores as it is an intermediate step in many popular causal methods, e.g., propensity score matching. Therefore, an evaluation pipeline that effectively evaluates intermediate steps is desired. In text analysis, as text is interpretable and can be understood by humans, interpretable balance metrics such as standardized difference in means and/or human judgements of propensity scores can be used to assess the predicted propensity scores or matching results [146]. How to improve and standardize human judgement experiments, however, remains an open problem.

6.2 Evaluation for Networked Data

Networked data (e.g., social network) have been of great interest to researchers due to its ubiquity in real world. Some representative tasks include community detection and predicting relationships between individuals in the network. Despite the growing interest in identifying and estimating causal effects in networked data, methodologies and their evaluation have not kept apace [147]. Due to the interference and peer effects between units, standard Structural Causal Models are not equipped to deal with dependence across individuals and describe the data generating process of networked data. This further impedes the research in causal structure learning for networked data. On the positive side, many evaluation metrics for causal inference and CL with i.i.d. data can still be used to benchmark methods for the corresponding tasks with networked data. The major challenge lies in the lack-of real-world datasets. Most existing work (e.g., [21, 148]) relies on synthetic data generated from graph models or semi-synthetic data with predefined structural equations.

6.3 Constructed Observational Studies

Constructed data – data gathered from both randomized and non-randomized experiments with similar subjects and settings – have been used in many areas, such as economics [3], marketing [149], and education [150]. AI researchers have also started using constructed data, as evidenced by recent works that leverage the advantages of both RCTs and observational studies to enhance online A/B tests (e.g., [151]). Using constructed data for CL makes up for the limitations in using only observational data or RCTs and helps make better decisions. Further, it reveals possible solutions to evaluate intermediate steps and validate the process of model selection and hyperparameters tuning. It also allows us to evaluate causal effect estimation (e.g., ITE) when the ground truth is not available.

6.4 Evaluation of Long-Term Effects

Short- and long-term effects can differ in a number of ways. For example, user behavior changes as they learn and adapt to new environment. In a social network, user behavior is easily influenced by people in their network though the change may take time to reach its full effect [152]. Different from evaluating short-term effects, evaluating long-term effects is more cumbersome as we need to consider the factor of time. Current evaluation methods directly adopt pipelines in short-term effect estimation, e.g., MAE of the estimated effect at a specific timestep. We need metrics tailored to the unique challenges in the long-term effect estimation. For example, when is the best target inferential point to estimate the effect? The answer reveals how long it takes a user to discover a new feature in a recommender system and if in-product education is needed to expedite the uptake. Many issues may arise given the nature of the data available to investigate long-term effects [153]. For instance, it is possible that a treatment effect depends on the previous levels of covariates, therefore, evaluation measures that can test for the presence and strength of effect modification are preferred. Sensitivity analysis and model validation are especially important in this task due to the issues of mis-specifying the direction of causal pathways and censoring, e.g., people lost to follow-up over time.

6.5 Model Validation

Causal inference is generally an unsupervised learning technique, which makes the model validation challenging. With the required (and often unverifiable) causal assumptions, a valid and unbiased causal estimate can only be assured after the sensitivity and robustness of an estimate against these assumptions are checked. For example, despite the promising research progress on addressing unobserved confounding, current approaches are less readily available (or impossible) to remove all confounding. Even for unbiased estimators in the presence of unobserved confounders, assumptions required for causal identification and estimation can be violated. Therefore, being able to determine the robustness of causal methods is crucial for theory testing and model development. In addition, knowing to what extent the conclusions drawn from those causal studies are sensitive to potential confounding and violations of assumptions can help with policy decision making [154]. Although there exists an extensive literature of sensitivity analysis [155], most results are limited to specific model structures and on a case-by-case basis, whereas a general-purpose algorithmic framework for sensitivity analysis is still lacking to account for the ubiquity of causal questions in the sciences and AI [156].

6.6 CL Evaluation Tools

The number of metrics and datasets for CL is fast growing. A pressing need for an open-source platform is thereby raised to promote objective, transparent, and independent evaluation of algorithms and to support broad collaborations. Such a platform is expected to include the ever-growing evaluation metrics, procedures, and datasets in causal effect estimation, causal structure learning, and causality-aware ML tasks. Model validation tools such as sensemakr¹⁴¹⁴14https://cran.r-project.org/web/packages/sensemakr/index.html are also in need to better understand the robustness of a causal model. By providing convenient access to the benchmarking repository for evaluating CL algorithms, we can enlarge the collaboration in the community for the development of new metrics and datasets.

6.7 Evaluation of Causal Interpretability

Evaluating interpretable models is particularly difficult since intepretability does not have a unified definition [26]. Therefore, no specific measure can be used to fully assess models’ intepretability from various aspects. It is even more challenging to evaluate models for causal interpretability due to the lack of ground-truth causal relations between the components of the model or the causal effect of one component on another. Therefore, crafting new datasets with such ground truth is a vital task for the field to move forward.

7 Conclusion and Future Work

CL is essential for AI to uncover causal mechanisms underlying real-world problems and to achieve human-level intelligence. Still at its infancy stage, research in CL is facing many obstacles. For example, CL has burgeoned, so far, without a proper benchmarking pipeline to support fair and transparent evaluations of emerging research contributions. To bridge this gap, in this survey, we provide a comprehensive review of existing methods for evaluating the fundamental tasks in causal inference and causality-aware ML tasks and discuss potential limitations. We follow the evaluation pipeline in conventional ML to focus on the widely-used datasets, evaluation metrics, protocols, and causal inference tools/packages. We conclude the survey with prominent open problems and challenges that await future research. To this end, we hope to broaden the discussions about open-source platforms/software that aim to promote CL research via fair, transparent, independent, and easy-to-use evaluation procedures.

We plan to develop a Causal Inference Evaluation Toolbox that consists of mainstream CL algorithms such as CounterFactual Regression (CFR) [31], Causal Effect Variational Autoencoder [39], Bayesian Regression Trees (BART) [157], Causal Forest [158], Perfect Match [48], Disentangled Representations for CFR [159], and similarity preserved individual treatment effect [160]. This toolbox will showcase results of previously mentioned evaluation metrics for multiple benchmark datasets such as Jobs, IHDP, and News dataset.

Acknowledgements

This work is supported by National Science Foundation (NSF) grants #1909555, #2029044, #2125246, #1633381, #1610282, ARL W911NF2020124, and AMC W911NF2110030. The views, opinions, and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Army Research Office or the U.S. Government.

References

[1] J. Pearl, Causality. Cambridge university press, 2009.
[2] L. Cheng, R. Guo, and H. Liu, “Robust cyberbullying detection with causal interpretation,” in WWW’ Companion, 2019, pp. 169–175.
[3] R. J. LaLonde, “Evaluating the econometric evaluations of training programs with experimental data,” Am. Econ. Rev., pp. 604–620, 1986.
[4] R. H. Dehejia and S. Wahba, “Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs,” J. Amer. Statist. Assoc., vol. 94, no. 448, pp. 1053–1062, 1999.
[5] D. Heckerman, C. Meek, and G. Cooper, “A bayesian approach to causal discovery,” in Innovations in Machine Learning. Springer, 2006, pp. 1–28.
[6] J. L. Hill, “Bayesian nonparametric modeling for causal inference,” Journal of Computational and Graphical Statistics, vol. 20, no. 1, pp. 217–240, 2011.
[7] S. Mani and G. F. Cooper, “Causal discovery from medical textual data.” in Proceedings of the AMIA Symposium, 2000, p. 542.
[8] C.-D. G. of the Psychiatric Genomics Consortium et al., “Identification of risk loci with shared effects on five major psychiatric disorders: a genome-wide analysis,” The Lancet, vol. 381, no. 9875, pp. 1371–1379, 2013.
[9] G. W. Imbens, “Nonparametric estimation of average treatment effects under exogeneity: A review,” Rev. Econ. Stat., vol. 86, no. 1, pp. 4–29, 2004.
[10] J. M. Robins, M. A. Hernan, and B. Brumback, “Marginal structural models and causal inference in epidemiology,” 2000.
[11] H. MA and R. JM, Causal Inference. CRC Boca Raton, FL:, forthcoming.
[12] I. Ebert-Uphoff and Y. Deng, “Causal discovery for climate research using graphical models,” JCLI, vol. 25, no. 17, pp. 5648–5665, 2012.
[13] J. Li, O. R. Zaïane, and A. Osornio-Vargas, “Discovering statistically significant co-location rules in datasets with extended spatial objects,” in DaWaK, 2014, pp. 124–135.
[14] J. Pearl, “Theoretical impediments to machine learning with seven sparks from the causal revolution,” in WSDM, 2018, pp. 3–3.
[15] D. B. Rubin, “Causal inference using potential outcomes: Design, modeling, decisions,” JASA, vol. 100, no. 469, pp. 322–331, 2005.
[16] R. Guo, L. Cheng, J. Li, P. R. Hahn, and H. Liu, “A survey of learning causality with data: Problems and methods,” CSUR, vol. 53, no. 4, pp. 1–37, 2020.
[17] N. Kallus and A. Zhou, “Confounding-robust policy improvement,” in NeurIPS, 2018, pp. 9289–9299.
[18] L. Cheng, R. Moraffah, R. Guo, K. Candan, A. Raglin, and H. Liu, “A practical data repository for causal learning with big data,” in Bench, 2019.
[19] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio, “Toward causal representation learning,” Proceedings of the IEEE, vol. 109, no. 5, pp. 612–634, 2021.
[20] B. Schölkopf, “Causality for machine learning,” arXiv preprint arXiv:1911.10500, 2019.
[21] R. Guo, J. Li, and H. Liu, “Learning individual causal effects from networked observational data,” in WSDM, 2020, pp. 232–240.
[22] L. Yao, Z. Chu, S. Li, Y. Li, J. Gao, and A. Zhang, “A survey on causal inference,” TKDD, vol. 15, no. 5, pp. 1–46, 2021.
[23] P. Spirtes and K. Zhang, “Causal discovery and inference: concepts and recent methodological advances,” in Applied informatics, vol. 3, no. 1, 2016, p. 3.
[24] R. Moraffah, P. Sheth, M. Karami, A. Bhattacharya, Q. Wang, A. Tahir, A. Raglin, and H. Liu, “Causal inference for time series analysis: Problems, methods and evaluation,” Knowledge and Information Systems, pp. 1–45, 2021.
[25] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, and X. He, “Bias and debias in recommender system: A survey and future directions,” arXiv preprint arXiv:2010.03240, 2020.
[26] R. Moraffah, M. Karami, R. Guo, A. Ragliny, and H. Liu, “Causal interpretability for machine learning–problems, methods and evaluation,” ACM SIGKDD Explorations Newsletter, vol. 22, 2020.
[27] Y. Shimoni, C. Yanover, E. Karavani, and Y. Goldschmnidt, “Benchmarking framework for performance-evaluation of causal inference analysis,” arXiv preprint arXiv:1802.05046, 2018.
[28] G. W. Imbens and D. B. Rubin, Causal inference in statistics, social, and biomedical sciences. Cambridge University Press, 2015.
[29] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy, “Deep iv: A flexible approach for counterfactual prediction,” in ICML. JMLR. org, 2017, pp. 1414–1423.
[30] A. M. Puli and R. Ranganath, “Generalized control functions via variational decoupling,” arXiv preprint arXiv:1907.03451, 2019.
[31] U. Shalit, F. D. Johansson, and D. Sontag, “Estimating individual treatment effect: generalization bounds and algorithms,” in ICML, 2017, pp. 3076–3085.
[32] V. Rakesh, R. Guo, R. Moraffah, N. Agarwal, and H. Liu, “Linked causal variational autoencoder for inferring paired spillover effects,” in CIKM. ACM, 2018, pp. 1679–1682.
[33] Y. Zhao, X. Fang, and D. Simchi-Levi, “Uplift modeling with multiple treatments and general response types,” in SDM, N. V. Chawla and W. Wang, Eds., 2017, pp. 588–596.
[34] J.-Y. Gérardy jean yves, “Causal inference and uplift modeling a review of the literature.”
[35] N. J. Radcliffe, “Using control groups to target on predicted lift: Building and assessing uplift models,” Direct Marketing Analytics Journal, vol. 1, p. 1421, 2007.
[36] D. Card and A. B. Krueger, “Minimum wages and employment: A case study of the fast food industry in new jersey and pennsylvania,” 1993.
[37] J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf, “Distinguishing cause from effect using observational data: methods and benchmarks,” JMLR, vol. 17, no. 1, pp. 1103–1204, 2016.
[38] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychowdhury et al., “Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens,” Cell, vol. 167, no. 7, pp. 1853–1866, 2016.
[39] C. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling, “Causal effect inference with deep latent-variable models,” in NeurIPS, 2017, pp. 6446–6456.
[40] N. Kallus, X. Mao, and M. Udell, “Causal inference with noisy and missing covariates via matrix factorization,” NeurIPS, vol. 31, 2018.
[41] A. M. Gentzel, P. Pruthi, and D. Jensen, “How and why to use experimental data to evaluate methods for observational causal inference,” in ICML. PMLR, 2021, pp. 3660–3671.
[42] J. A. Smith and P. E. Todd, “Does matching overcome lalonde’s critique of nonexperimental estimators?” J. Econ., 2005.
[43] P. R. Hahn, V. Dorie, and J. S. Murray, “Atlantic causal inference conference (acic) data analysis challenge 2017,” Tech. rep, Tech. Rep., 2018.
[44] G. J. Duncan, J. Brooks-Gunn, and P. K. Klebanov, “Economic deprivation and early childhood development,” Child development, vol. 65, no. 2, pp. 296–318, 1994.
[45] D. Almond, K. Y. Chay, and D. S. Lee, “The costs of low birth weight,” The Quarterly Journal of Economics, vol. 120, no. 3, pp. 1031–1083, 2005.
[46] J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substitutable and complementary products,” in KDD. ACM, 2015, pp. 785–794.
[47] J. Yoon, J. Jordon, and M. van der Schaar, “Ganite: Estimation of individualized treatment effects using generative adversarial nets,” 2018.
[48] P. Schwab, L. Linhardt, and W. Karlen, “Perfect match: A simple method for learning representations for counterfactual inference with neural networks,” arXiv preprint arXiv:1810.00656, 2018.
[49] D. Galagate, J. Schafer, and M. D. Galagate, “Package ‘causaldrf’,” 2015.
[50] J. D. Angrist and J.-S. Pischke, Mastering’metrics: The path from cause to effect. Princeton University Press, 2014.
[51] J. D. Angrist, G. W. Imbens, and D. B. Rubin, “Identification of causal effects using instrumental variables,” J. Amer. Statist. Assoc., vol. 91, no. 434, pp. 444–455, 1996.
[52] D. T. Campbell, “Reforms as experiments.” Am. Psychol., vol. 24, no. 4, p. 409, 1969.
[53] J. D. Angrist and A. B. Krueger, “The effect of age at school entry on educational attainment: an application of instrumental variables with moments from two samples,” JASA, vol. 87, no. 418, pp. 328–336, 1992.
[54] J. D. Angrist and G. W. Imbens, “Two-stage least squares estimation of average causal effects in models with variable treatment intensity,” J. Amer. Statist. Assoc., vol. 90, no. 430, pp. 431–442, 1995.
[55] N. Nunn and L. Wantchekon, “The slave trade and the origins of mistrust in africa,” American Economic Review, vol. 101, no. 7, pp. 3221–52, 2011.
[56] C. Carpenter and C. Dobkin, “The effect of alcohol consumption on mortality: regression discontinuity evidence from the minimum drinking age,” AEJ: Applied Economics, vol. 1, no. 1, pp. 164–82, 2009.
[57] A. C. Eggers, R. Freier, V. Grembi, and T. Nannicini, “Regression discontinuity designs based on population thresholds: Pitfalls and solutions,” Am. J. Pol. Sci., vol. 62, no. 1, pp. 210–229, 2018.
[58] R. Jacob, P. Zhu, M.-A. Somers, and H. Bloom, “A practical guide to regression discontinuity.” MDRC, 2012.
[59] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, no. 1, pp. 1–9, 2016.
[60] K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott, “Inferring causal impact using bayesian structural time-series models,” The Annals of Applied Statistics, vol. 9, no. 1, pp. 247–274, 2015.
[61] M. Auffhammer and R. Kellogg, “Clearing the air? the effects of gasoline content regulation on air quality,” American Economic Review, vol. 101, no. 6, pp. 2687–2722, 2011.
[62] Y. Ma and V. Tresp, “Causal inference under networked interference and intervention policy enhancement,” in AISTATS. PMLR, 2021, pp. 3700–3708.
[63] K. Chantala and J. Tabor, “National longitudinal study of adolescent health: Strategies to perform a design-based analysis using the add health data,” 1999.
[64] L. Takac and M. Zabovsky, “Data analysis in public social networks,” in International scientific conference and international workshop present day trends of innovations, vol. 1, no. 6, 2012.
[65] R. Guo, J. Li, and H. Liu, “Counterfactual evaluation of treatment assignment functions with networked observational data,” in SDM. SIAM, 2020, pp. 271–279.
[66] L. Cheng, K. R. Varshney, and H. Liu, “Socially responsible ai algorithms: Issues, purposes, and challenges,” JAIR, 2021.
[67] A. Volfovsky, E. M. Airoldi, and D. B. Rubin, “Causal inference for ordinal outcomes,” arXiv preprint arXiv:1501.01234, 2015.
[68] L. Cheng, R. Guo, and H. Liu, “Long-term effect estimation with surrogate representation,” in WSDM, 2021, pp. 274–282.
[69] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson, Causation, prediction, and search. MIT press, 2000.
[70] B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij, “On causal and anticausal learning,” in ICML, 2012, pp. 459–466.
[71] J. Peters, D. Janzing, and B. Schölkopf, Elements of causal inference: foundations and learning algorithms, 2017.
[72] C. W. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, pp. 424–438, 1969.
[73] M. de Jongh and M. J. Druzdzel, “A comparison of structural distance measures for causal bayesian network models,” Recent Advances in Intelligent Information Systems, Challenging Problems of Science, Computer Science series, pp. 443–456, 2009.
[74] J. Peters and P. Bühlmann, “Structural intervention distance for evaluating causal graphs,” Neural computation, vol. 27, no. 3, pp. 771–799, 2015.
[75] I. Tsamardinos, L. E. Brown, and C. F. Aliferis, “The max-min hill-climbing bayesian network structure learning algorithm,” Machine learning, vol. 65, no. 1, pp. 31–78, 2006.
[76] J. Peters, P. Bühlmann, and N. Meinshausen, “Causal inference by using invariant prediction: identification and confidence intervals,” J. R. Stat. Soc. Series B Stat. Methodol., vol. 78, no. 5, pp. 947–1012, 2016.
[77] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvärinen, Y. Kawahara, T. Washio, P. O. Hoyer, and K. Bollen, “Directlingam: A direct method for learning a linear non-gaussian structural equation model,” JMLR, vol. 12, no. Apr, pp. 1225–1248, 2011.
[78] D. Bacciu, T. A. Etchells, P. J. Lisboa, and J. Whittaker, “Efficient identification of independence networks using mutual information,” Computational Statistics, vol. 28, no. 2, pp. 621–646, 2013.
[79] A. Gentzel, D. Garant, and D. Jensen, “The case for evaluating causal models using interventional measures and empirical data,” in NeurIPS, 2019.
[80] M. Peyrard and R. West, “A ladder of causal distances,” in IJCAI, 2021.
[81] R. C. Petersen, P. Aisen, L. A. Beckett, M. Donohue, A. Gamst, D. J. Harvey, C. Jack, W. Jagust, L. Shaw, A. Toga et al., “Alzheimer’s disease neuroimaging initiative (adni): clinical characterization,” Neurology, vol. 74, no. 3, pp. 201–209, 2010.
[82] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
[83] J. Mitrovic, D. Sejdinovic, and Y. W. Teh, “Causal inference via kernel deviance measures,” NeurIPS, vol. 31, 2018.
[84] K. Bache and M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
[85] M. Kocaoglu, A. Dimakis, and S. Vishwanath, “Cost-optimal learning of causal graphs,” in ICML. JMLR. org, 2017, pp. 1875–1884.
[86] K. Shanmugam, M. Kocaoglu, A. G. Dimakis, and S. Vishwanath, “Learning causal graphs with small interventions,” in NeurIPS, 2015, pp. 3195–3203.
[87] O. Ahmed, F. Träuble, A. Goyal, A. Neitz, Y. Bengio, B. Schölkopf, M. Wüthrich, and S. Bauer, “Causalworld: A robotic manipulation benchmark for causal structure and transfer learning,” in ICLR, 2021.
[88] T. Van den Bulcke, K. Van Leemput, B. Naudts, P. van Remortel, H. Ma, A. Verschoren, B. De Moor, and K. Marchal, “Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms,” BMC bioinformatics, vol. 7, no. 1, pp. 1–12, 2006.
[89] J. Runge, X.-A. Tibau, M. Bruhns, J. Muñoz-Marí, and G. Camps-Valls, “The causality for climate competition,” in NeurIPS 2019 Competition and Demonstration Track. PMLR, 2020, pp. 110–120.
[90] U. Schaechtle, K. Stathis, and S. Bromuri, “Multi-dimensional causal discovery,” in IJCAI, 2013.
[91] M. Gong, K. Zhang, B. Schölkopf, C. Glymour, and D. Tao, “Causal discovery from temporally aggregated time series,” in UAI, vol. 2017, 2017.
[92] Z. Pan, Y. Liang, J. Zhang, X. Yi, Y. Yu, and Y. Zheng, “Hyperst-net: Hypernetworks for spatio-temporal forecasting,” arXiv preprint arXiv:1809.10889, 2018.
[93] D. Entner and P. O. Hoyer, “On causal discovery from time series data using fci,” Probabilistic graphical models, pp. 121–128, 2010.
[94] D. Malinsky and D. Danks, “Causal discovery algorithms: A practical guide,” Philosophy Compass, vol. 13, no. 1, p. e12470, 2018.
[95] F. Lattimore, T. Lattimore, and M. D. Reid, “Causal bandits: Learning good interventions via causal inference,” in NeurIPS, 2016, pp. 1181–1189.
[96] S. Lee and E. Bareinboim, “Structural causal bandits: where to intervene?” in NeurIPS, 2018, pp. 2568–2578.
[97] L. Cheng, A. Mosallanezhad, P. Sheth, and H. Liu, “Causal learning for socially responsible ai,” in IJCAI, 2021.
[98] Z. C. Lipton, “The mythos of model interpretability,” Queue, vol. 16, no. 3, pp. 31–57, 2018.
[99] F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv preprint arXiv:1702.08608, 2017.
[100] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee, “Counterfactual visual explanations,” in ICML. PMLR, 2019, pp. 2376–2384.
[101] Y. Goyal, A. Feder, U. Shalit, and B. Kim, “Explaining classifiers with causal concept effect (cace),” arXiv preprint arXiv:1907.07165, 2019.
[102] T. Narendra, A. Sankaran, D. Vijaykeerthy, and S. Mani, “Explaining deep learning models using causal inference,” arXiv preprint arXiv:1811.04376, 2018.
[103] M. Harradon, J. Druce, and B. Ruttenberg, “Causal learning and explanation of deep neural networks via autoencoded activations,” arXiv preprint arXiv:1802.00541, 2018.
[104] A. Chouldechova, “Fair prediction with disparate impact: A study of bias in recidivism prediction instruments,” Big data, vol. 5, no. 2, pp. 153–163, 2017.
[105] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in NeurIPS, 2017, pp. 4066–4076.
[106] Y. Wu, L. Zhang, X. Wu, and H. Tong, “Pc-fairness: A unified framework for measuring causality-based fairness,” in NeurIPS, 2019, pp. 3399–3409.
[107] A. V. Looveren and J. Klaise, “Interpretable counterfactual explanations guided by prototypes,” in ECML PKDD. Springer, 2021, pp. 650–665.
[108] R. Mc Grath, L. Costabello, C. Le Van, P. Sweeney, F. Kamiab, Z. Shen, and F. Lecue, “Interpretable credit application predictions with counterfactual explanations,” in NIPS FEAP-AI4F, 2018.
[109] R. K. Mothilal, A. Sharma, and C. Tan, “Explaining machine learning classifiers through diverse counterfactual explanations,” in FAccT, 2020, pp. 607–617.
[110] A. Kanehira, K. Takemoto, S. Inayoshi, and T. Harada, “Multimodal explanations by predicting counterfactuality in videos,” CoRR, vol. abs/1812.01263, 2018.
[111] K. Makhlouf, S. Zhioua, and C. Palamidessi, “Survey on causal-based machine learning fairness notions,” CoRR, vol. abs/2010.09553, 2020.
[112] A. Khademi, S. Lee, D. Foley, and V. Honavar, “Fairness in algorithmic decision making: An excursion through the lens of causality,” in WWW, 2019, pp. 2907–2914.
[113] J. Zhang and E. Bareinboim, “Fairness in decision-making—the causal explanation formula,” in AAAI, 2018.
[114] S. A. Friedler, C. Scheidegger, and S. Venkatasubramanian, “On the (im) possibility of fairness,” arXiv preprint arXiv:1609.07236, 2016.
[115] R. Kohavi, “Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.” in KDD, vol. 96, 1996, pp. 202–207.
[116] S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual explanations without opening the black box: Automated decisions and the gdpr,” Harv. JL & Tech., vol. 31, p. 841, 2017.
[117] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
[118] Y. LeCun, C. Cortes, and C. Burges, “The mnist database,” http://yann.lecun.com/exdb/mnist/, Jan 2020.
[119] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
[120] K. Lang, “20 newsgroups,” http://qwone.com/~jason/20Newsgroups/, 2008.
[121] IMDb, “Imdb dataset,” https://www.imdb.com/interfaces/, 2020.
[122] AWS, “Amazon customer reviews dataset,” https://s3.amazonaws.com/amazon-reviews-pds/readme.html, 2020.
[123] UCI, “Uci machine learning repository,” https://archive.ics.uci.edu/ml/index.php, 2020.
[124] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[125] A. Flores, K. Bechtel, and C. Lowenkamp, “False positives, false negatives, and false analyses: A rejoinder to “machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks.”,” Federal probation, vol. 80, 09 2016.
[126] R. Guo, X. Zhao, A. Henderson, L. Hong, and H. Liu, “Debiasing grid-based product search in e-commerce,” in KDD, 2020, pp. 2852–2860.
[127] Z. Wang, X. Yin, T. Li, and L. Hong, “Causal meta-mediation analysis: Inferring dose-response function from summary statistics of many randomized experiments,” in KDD, 2020, pp. 2625–2635.
[128] X. Wu, H. Chen, J. Zhao, L. He, D. Yin, and Y. Chang, “Unbiased learning to rank in feeds recommendation,” in WSDM. ACM, 2021.
[129] T. Joachims, A. Swaminathan, and M. De Rijke, “Deep learning with logged bandit feedback,” in ICLR, 2018.
[130] Z. Hu, Y. Wang, Q. Peng, and H. Li, “Unbiased lambdamart: An unbiased pairwise learning-to-rank algorithm,” in WWW. ACM, 2019, pp. 2830–2836.
[131] Q. Ai, K. Bi, C. Luo, J. Guo, and W. B. Croft, “Unbiased learning to rank with unbiased propensity estimation,” in SIGIR. ACM, 2018, pp. 385–394.
[132] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” TOIS, vol. 20, no. 4, pp. 422–446, 2002.
[133] H. Abdollahpouri, R. Burke, and B. Mobasher, “Managing popularity bias in recommender systems with personalized re-ranking,” in FLAIRS, 2019.
[134] N. Craswell, O. Zoeter, M. Taylor, and B. Ramsey, “An experimental comparison of click position-bias models,” in WSDM, 2008, pp. 87–94.
[135] O. Chapelle and Y. Chang, “Yahoo! learning to rank challenge overview,” in Proceedings of the learning to rank challenge. PMLR, 2011, pp. 1–24.
[136] H. Chen, T. Harinen, J.-Y. Lee, M. Yung, and Z. Zhao, “Causalml: Python package for causal machine learning,” 2020.
[137] M. Research, “EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation,” https://github.com/microsoft/EconML, 2019, version 0.x.
[138] “DoWhy: A Python package for causal inference,” https://github.com/microsoft/dowhy, 2019.
[139] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “Dags with no tears: Continuous optimization for structure learning,” in NeurIPS, 2018, pp. 9472–9483.
[140] D. Kalainathan and O. Goudet, “Causal discovery toolbox: Uncover causal relationships in python,” arXiv preprint arXiv:1903.02278, 2019.
[141] Y. Shimoni, C. Yanover, E. Karavani, and Y. Goldschmnidt, “Benchmarking Framework for Performance-Evaluation of Causal Inference Analysis,” ArXiv preprint arXiv:1802.05046, 2018.
[142] S. Wager and S. Athey, “Estimation and inference of heterogeneous treatment effects using random forests,” J. Amer. Statist. Assoc., no. just-accepted, 2017.
[143] M. Scutari, “Learning bayesian networks with the bnlearn r package,” Journal of Statistical Software, vol. 35, pp. 1–22, 2010.
[144] J. D. Ramsey, K. Zhang, M. Glymour, R. S. Romero, B. Huang, I. Ebert-Uphoff, S. Samarasinghe, E. A. Barnes, and C. Glymour, “Tetrad—a toolbox for causal discovery,” in 8th International Workshop on Climate Informatics, 2018.
[145] M. Kalisch, M. Mächler, D. Colombo, M. H. Maathuis, and P. Bühlmann, “Causal inference using graphical models with the r package pcalg,” Journal of statistical software, vol. 47, no. 11, pp. 1–26, 2012.
[146] K. Keith, D. Jensen, and B. O’Connor, “Text and causal inference: A review of using text to remove confounding from causal estimates,” in ACL, 2020, pp. 5332–5344.
[147] E. L. Ogburn, O. Sofrygin, I. Díaz, and M. J. Van der Laan, “Causal inference for social network data,” ADVANCES IN WATER RESOURCES, SOUTHHAMPTON, UK, COMPUTATIONAL MECHANICS PUBLICATIONS, 2017.
[148] B. Bevilacqua, Y. Zhou, and B. Ribeiro, “Size-invariant graph representations for graph classification extrapolations,” in ICML. PMLR, 2021, pp. 837–851.
[149] B. R. Gordon, F. Zettelmeyer, N. Bhargava, and D. Chapsky, “A comparison of approaches to advertising measurement: Evidence from big field experiments at facebook,” Marketing Science, vol. 38, no. 2, pp. 193–225, 2019.
[150] W. R. Shadish, M. H. Clark, and P. M. Steiner, “Can nonrandomized experiments yield accurate answers? a randomized experiment comparing random and nonrandom assignments,” JASA, vol. 103, no. 484, pp. 1334–1344, 2008.
[151] L. Ye, Y. Lin, H. Xie, and J. Lui, “Combining offline causal inference and online bandit learning for data driven decisions,” arXiv preprint arXiv:2001.05699, 2020.
[152] R. Kohavi and R. Longbotham, “Online controlled experiments and a/b testing.” Encyclopedia of machine learning and data mining, vol. 7, no. 8, pp. 922–929, 2017.
[153] S. J. Newsome, R. H. Keogh, and R. M. Daniel, “Estimating long-term treatment effects in observational data: A comparison of the performance of different methods under real-world uncertainty,” Statistics in medicine, vol. 37, no. 15, pp. 2367–2390, 2018.
[154] W. Liu, S. J. Kuramoto, and E. A. Stuart, “An introduction to sensitivity analysis for unobserved confounding in nonexperimental prevention research,” Prevention science, vol. 14, no. 6, pp. 570–580, 2013.
[155] J. Huang, https://medium.com/data-science-at-microsoft/causal-inference-part-3-of-3-model-validation-and-applications-c84764156a29, November 2020.
[156] C. Cinelli, D. Kumor, B. Chen, J. Pearl, and E. Bareinboim, “Sensitivity analysis of linear structural causal models,” in ICML, vol. 97. PMLR, 09–15 Jun 2019, pp. 1252–1261.
[157] P. R. Hahn, J. S. Murray, and C. M. Carvalho, “Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion),” Bayesian Analysis, vol. 15, no. 3, pp. 965–1056, 2020.
[158] S. Wager and S. Athey, “Estimation and inference of heterogeneous treatment effects using random forests,” JASA, vol. 113, no. 523, pp. 1228–1242, 2018.
[159] N. Hassanpour and R. Greiner, “Learning disentangled representations for counterfactual regression,” in ICLR, 2019.
[160] L. Yao, S. Li, Y. Li, M. Huai, J. Gao, and A. Zhang, “Representation learning for treatment effect estimation from observational data,” NeurIPS, vol. 31, pp. 2633–2643, 2018.

{IEEEbiography}

[ [Uncaptioned image] ]Lu Cheng received her B.Eng. degree in Logistic & Systems Engineering from Huazhong University of Science and Technology and M.Eng. in Industrial Engineering from Rensselaer Polytechnic Institute. She is currently a fifth-year PhD candidate of Computer Science and Engineering at Arizona State University. Her research interests include socially responsible AI, CL, and data mining. She has published research papers in premier conferences of AI, data mining and ML. She is a student member of the ACM. Contact her at [email protected].

{IEEEbiography}

[ [Uncaptioned image] ]Ruocheng Guo received his Ph.D. degree in Computer Engineering from Arizona State University. He is currently an Assistant Professor of Data Science at City University of Hong Kong. His research interests include causal ML towards fair, interpretable and generalizable AI, causal inference and data mining. He is a member of the ACM, SIAM, and AAAI. Contact him at [email protected].

{IEEEbiography}

[ [Uncaptioned image] ]Raha Moraffah received her B.S. degree in Computer Sience and Engineering from Sharif University of Technology. She is currently a fifth-year PhD student of Computer Science and Engineering at Arizona State University. Her research interests include causal inference, causal ML and adversarial learning. She is a student member of ACM. Contact her at [email protected].

{IEEEbiography}

[ [Uncaptioned image] ]Paras Sheth received his B.Tech degree in Computer Sience and Engineering from Institute of Engineering and Management. He is currently a third-year PhD student of Computer Science and Engineering at Arizona State University. His research interests include causal inference, data mining and causal ML. Contact him at [email protected].

{IEEEbiography}

[ [Uncaptioned image] ]K. Seçuk Candan received his Ph.D. degree in Computer Science from the University of Maryland College Park. He is currently a Professor of Computer Science and Engineering and the Director of Center for Assured and Scalable Data engineering (CASCADE) at Arizona State University. His research interests include data management, analysis, and scalable ML. He is a Distinguished Scientist of the ACM. {IEEEbiography}[ [Uncaptioned image] ]Huan Liu (F’12) received the B.Eng. degree in computer science and electrical engineering from Shanghai Jiaotong University and the Ph.D. degree in computer science from the University of Southern California. He is currently a Professor of computer science and engineering at Arizona State University. His research interests include data mining, ML, social computing, and artificial intelligence, investigating problems that arise in many real-world applications with high-dimensional data of disparate forms. His well-cited publications include books, book chapters, and encyclopedia entries and conference, and journal papers. He is a Fellow of IEEE, ACM, AAAI, and AAAS.