Learning causal graphs using variable grouping according to ancestral relationship
Abstract
Causal discovery has drawn increasing attention in recent years. Over the past quarter century, several useful algorithms for learning causal graphs have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. In addition, some methods are not feasible when the sample size is smaller than the number of variables.
To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches (e.g., [1, 2]). For learning the entire causal graph, the divide-and-conquer approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal structure learning algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small.
This paper proposes a new algorithm for grouping variables according to the ancestral relationships among the variables, assuming that the causal model is LiNGAM [3], where the causal relationships are linear, and the exogenous variables are mutually independent and distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG (Causal Ancestral-relationship-based Grouping). The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM [4] without grouping variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse.
Index Terms:
causal discovery, causal DAG, conditional independence test, DirectLiNGAM, divide-and-conquer, linear regression.I Introduction
In recent years, inference using high-dimensional causal models for observational data has played a pivotal role in various fields, such as econometrics [5], biology [6], and psychology [7]. For the high dimensional causal inference, the structural causal model (SCM, [8, 9]) defined by a directed acyclic graph (DAG) has been extensively used. Since the causal relationships among variables are usually unknown, we also need to learn the causal DAG before we use the SCM.
Over the past quarter century, several practical algorithms for learning causal DAGs without latent confounders have been proposed. These algorithms are classified into several types.
The constraint-based methods use the conditional independence (CI) relationships among variables to determine causal directions. The PC algorithm [10] is a typical example. Using hypothesis tests, the PC algorithm first estimates CI relationships between all pairs of variables. Then, it estimates the skeleton of a causal DAG and the directions of edges in the skeleton in turn. However, the time complexity of the PC algorithm, in the worst case, is exponential to the number of variables. Hence, the PC algorithm is not feasible for high-dimensional data.
The greedy equivalent search (GES) [11] is an algorithm that learns causal structures using a model selection criterion such as BIC [12]. This type of method is called the score-based method. As Chickering et al. [13, 14] have proven, however, GES belongs to NP-hard, and its application to high-dimensional causal models is also impractical.
It is important to note that the constraint-based and score-based methods can only identify causal DAGs up to the Markov equivalence class, the set of all DAGs compatible with the inferred CI relationships. To fully identify causal DAGs requires additional constraints on the causal model.
Shimizu et al. [3] considered the linear structural equation model as the causal model, where exogenous variables are mutually independent, and the distributions of the exogenous variables are continuous and non-Gaussian. They called the model linear non-Gaussian acyclic model (LiNGAM) and showed that it is possible to fully identify the causal DAG that defines LiNGAM using independent component analysis (ICA-LiNGAM) [15]. However, ICA-LiNGAM tends to converge to a locally optimal solution when the dimension of the model is high, resulting in lower estimation accuracy (e.g.,[4]). To overcome this difficulty, Shimizu [4] proposed DirectLiNGAM that estimates causal DAGs using linear regression between variables and CI tests.
Hoyer et al. [16] and Zhang and Hyvärinen [17] generalized LiNGAM to nonlinear and showed that when the model is nonlinear, the causal DAG is identifiable even if the exogenous variables are Gaussian distributed.
However, the estimation accuracy of existing methods tends to decrease as the sample size decreases relative to the number of variables. Furthermore, DirectLiNGAM is not feasible when the sample size is smaller than or equal to the number of variables (e.g., [18]).
Cai et al. [1] proposed the scalable causation discovery algorithm (SADA) to address this problem. The main idea of SADA is to group variables into several subsets according to the CI relationships between the variables, then learn the causal DAG of each group, merge each result, and return the entire causal DAG estimate. In this method, the causal DAG learning algorithm is applied to each smaller group of variables, reducing the sample size required for it to work. Through simulation studies, Cai et al. [1] showed that SADA improves LiNGAM concerning estimation accuracy. In SADA, however, the time complexity of grouping variables is of exponential order for the number of variables. In addition, SADA has a severe drawback in that even if the correct CI relationships are known, the marginal model for the variables in each group is not necessarily the causal model defined by the sub-DAG induced by the variables in each group. This means the estimator of a causal DAG estimated by SADA may not be consistent depending on the true causal DAG (e.g., [2]).
Zhang et al. [2] proposed another algorithm for grouping variables called causality partitioning (CAPA). CAPA requires only low-order CI relationships among variables to group variables. Letting be the number of variables and be the maximum order of the conditioning set, the time complexity of CAPA is . CAPA guarantees that if the correct low-order CI relationships are known, the marginal model for each group of variables is the causal model defined by the sub-DAG induced by the variables in each group. However, especially in the case of DAGs with high outdegree and low indegree, CAPA tends not to group variables finely enough to improve estimation accuracy even if the true conditional independence relationships are known.
Maeda and Shimizu [19] proposed the repetitive causal discovery (RCD), which is intended to be applied to the model with latent confounders. RCD can also be applied to the models without latent confounders. RCD first determines the ancestral relationships among variables using linear regression and CI tests and creates a list of ancestor sets for each variable. The parent-child relationships between variables are determined from the CI relationships among variables in each estimated ancestor set. RCD can be interpreted as one of the divide-and-conquer approaches because RCD splits the entire variable into families of ancestor sets for each variable and then learns causal DAGs for each group.
In this paper, we propose another scheme for grouping variables according to the ancestral relationships between the variables. We also assume LiNGAM for the causal model. The proposed method uses the algorithm to find variables’ ancestor sets in RCD. When the true causal DAG is connected, and the ancestor sets are correctly estimated, the variable grouping obtained by the proposed method is the family of maximal elements in the family of the union of each variable and its ancestors. The time complexity of the proposed algorithm for grouping variables is the third order of the number of variables. Suppose the true ancestor sets are correctly estimated. In that case, the marginal model for the variables in each group obtained by the proposed method is always the LiNGAM defined by the sub-DAG induced by the variables in each group, and no edge exists across different groups. Therefore, the entire causal DAG can be consistently estimated by applying the causal structure learning algorithm such as DirectLiNGAM to each group and merging the results. We call the proposed procedure for grouping variables the causal ancestral-relationship-based grouping (CAG).
Extensive computer experiments show that CAG outperforms the original DirectLiNGAM without grouping variables, CAPA, and RCD regarding estimation accuracy and computation time when the sample size is small relative to the number of variables and the true causal DAG is sparse.
The rest of this paper is organized as follows: Section II summarizes some existing causal structure learning algorithms and clarifies the position of the proposed method. Section III describes the details of the proposed method. Section IV confirms the proposed methods’s usefulness through computer experiments. Section V concludes the paper. The pseudo-code for the proposed method is provided in the Appendix.
II Related Works
II-A LiNGAM and Variants
Let be a -dimensional random vector. In the following, we identify with the variable set. Assume that the causal relationship among the variables is acyclic and linear. LiNGAM [3] is expressed by
(1) |
where the error term is assumed to be independently distributed as continuous non-Gaussian distributions. is a coefficient matrix. represents the direct causal effect from to . indicates the absence of the direct causal effect from to . We note that it is possible to transform the matrix into a strictly lower triangular matrix by permuting the rows and columns when a DAG defines the causal relationship between .
The model (1) is rewritten by
(2) |
where denotes the identity matrix. Noting that (2) is equivalent to the independent component analysis (ICA) model, Shimizu et al. [3] showed that is identifiable and proposed an algorithm for estimating . The algorithm is known as ICA-LiNGAM.
However, as the number of variables increases, ICA-LiNGAM is more likely to converge to a locally optimal solution, which reduces the accuracy of causal DAG estimation. To overcome this problem, Shimizu et al. [4] proposed another algorithm for estimating named DirectLiNGAM. When linear regression analyses are conducted in two ways for each pair of variables, with one of them as the dependent variable and the other as the independent variable, if the independent variable and the residuals are mutually independent in only one of the models, the independent variable in that model will precede the dependent variable in causal order. After determining the causal order of all variables, DirectLiNGAM estimates the set of parents of each variable by sparse estimating a linear regression model with each variable as the dependent variable and all variables preceding it in the causal order as independent variables. This approach ensures no edges violate the estimated causal order. DirectLiNGAM is also less accurate when the number of variables is large relative to the sample size. DirectLiNGAM tends to output redundant edges even when the sample size is large.
The computational cost of DirectLiNGAM is higher than that of ICA-LiNGAM. The time complexity of DirectLiNGAM is , where is the sample size, is the number of variables, and is the maximal rank found by the low-rank decomposition used in the kernel-based independence measure [4]. Since DirectLiNGAM requires estimating a linear regression model, DirectLiNGAM is not feasible when is smaller than .
II-B Scalable Causation Discovery Algorithm (SADA)
Cai et al. [1] proposed a divide-and-conquer approach called the scalable causation discovery algorithm (SADA) to enhance the scalability of causal DAG learning algorithms. SADA first splits the variable set into two or more subsets. Then, a causal DAG learning algorithm such as DirectLiNGAM is applied to each group of variables, and finally, all the results are merged to estimate the entire causal DAG. Because SADA applies the causal DAG learning algorithm to each group with a smaller number of variables, if the variables are grouped correctly, SADA is expected to improve the estimation accuracy when the sample size is small relative to the number of the entire variable set. SADA is feasible even if the sample size is smaller than the number of variables as long as it is larger than the number of variables in each group.
The procedure for grouping into two subsets by SADA is as follows. SADA first randomly selects two variables satisfying and finds the smallest with respect to the inclusion relation satisfying . The disjoint subsets , , and are computed according to the following procedure.
-
1.
Initialize with , , .
-
2.
For all
-
(a)
if for ,
then -
(b)
if for ,
then -
(c)
else
-
(a)
-
3.
For all
-
(a)
if for ,
then and -
(b)
if for ,
then and
-
(a)
-
4.
Return , and
In step 2, the intermediate set tends to be large, and step 3 downsizes as much as possible.
The output of SADA is two subsets and . This algorithm can be repeated recursively to group variables into smaller subsets. In the implementation, the lower bound of the number of variables in each group is set to , and SADA is applied to or to create even smaller groups only when or . If the number of variables is less than or equal to , or if we cannot find satisfying , , no further grouping is made.
We illustrate the SADA procedure to split into two subsets where the true causal DAG is the one in Fig 1. Table I shows the process of grouping variables into two subsets. Since both and hold in the model defined by the DAG in Fig. 1, we can initialize with , and . We then use the CI tests to determine one by one to which of , , and the remaining variables belong. This procedure returns , , and hence and . In this case, we note that the marginal models for and are defined by the sub-DAGs induced by and , respectively. If the initial values of , , and or the scanning order of changes, and may also change.
Step | C | |||
Initial | , , , , , | |||
Check | , , , , , | , | ||
Check | , , , , | , , | ||
Check | , , , | , , | ||
Check | , , | , , | ||
Check | , | , , , | ||
Check | , , , | , | ||
Check | , , , | , , |
Suppose that is grouped into by the SADA’s variable grouping procedure. SADA applies a causal structure learning algorithm such as DirectLiNGAM to each of the groups. Let be the estimated DAGs for each group, where , are the set of directed edges in . Then, the entire causal DAG is estimated by .
However, SADA has several practical problems. In SADA, the marginal model , is not necessarily the causal model induced by even if we knew the correct CI relationships between variables. Assume that the true causal DAG is Fig 2 and that . Since , we can initialize with , , and , which does not need further variable checking. As a result, we group the original variable set into and . For , the marginal model is defined by the sub-DAG induced by , and . However, the marginal model with respect to is not the model defined by the sub-DAG induced by because .
SADA does not guarantee that it always returns a DAG. Cycles may appear when DAG estimates for each variable group are merged. Suppose and share and . If , , the cycle appears after merging and . Therefore, SADA needs to remove redundant edges in cycles.
In addition, SADA requires too many CI tests for grouping variables. Consider the simple model where are mutually independent and no edge exists in the true DAG. In this case, any and satisfy . In the worst case, CI tests are required to ensure . Thus, SADA is not feasible for high-dimensional data.
II-C Causality Partitioning (CAPA)
Zhang et al. [2] proposed a refined grouping scheme called the causality partitioning (CAPA). Define the order of the CI test by the number of variables in the conditioning set. CAPA reduces time complexity by setting an upper limit on the order of CI tests and using only CI tests of orders below that limit.
When the maximum order of the CI test is , the time complexity of CAPA is shown to be . In implementation, is set to small numbers, like two or three.
The procedure for grouping into two subsets by CAPA is as follows. Let be the current order of the conditioned set. Let be a matrix such that
Let be an undirected graph with as its adjacency matrix. Let , be the degrees of in .
-
1.
Initialize with
-
2.
Initialize with
-
3.
For all in ascending order by , do at most one of the followings so that , at output
-
(a)
if ,
-
(b)
if ,
-
(a)
-
4.
-
5.
For all
if , -
6.
,
-
7.
Return if
Else if , exit
Else and go to 2
Let . When we know the true CI relationship between , always d-separates and in the true causal DAG. Therefore, the marginal model for and is guaranteed to be defined by the sub-DAG induced by and , respectively. In general, the ascending order of , is not unique, and the output may change depending on the choice of the ascending order. In Step 3, both 3(a) and 3(b) may be possible depending on , in which case one of them is randomly selected, which may also change the output.
We illustrate the CAPA procedure with to create two subsets where the true causal DAG is the one in Fig 1. When , is an undirected graph in Fig. 3. Table II shows an example of the results of Step 3 and Step 4 when is used as the order of by . Since , we have . Therefore and are
respectively.
In this example, , , so the size of each group is not so small. CAPA can also be repeated recursively to split variables into smaller subsets. CAPA tends to fail to group finely in causal DAGs that contain vertices with high outdegree and low indegree.
order | ||||
1 | ||||
1 | ||||
1 | ||||
1 | ||||
2 | ||||
2 | ||||
2 | ||||
2 | ||||
4 |
II-D Repetitive Causal Discovery (RCD)
RCD [19] is a novel causal structure learning algorithm that can be applied even when the model contains latent confounders. RCD also assumes LiNGAM, i.e., the causal relationships are linear, and the exogenous variables are independently distributed as continuous non-Gaussian distributions. The processes in RCD are divided into the following three parts: ancestral relationship finding, parental relationship finding, and confounder determining.
The first step in the RCD is to determine the ancestral relationship between variables by repeatedly conducting simple linear regressions and independence tests on each variable pair, in a similar way as when determining causal order in DirectLiNGAM, and to create a list of ancestor sets for each variable. Let be the estimated ancestor set of .
Once ancestral relationships are estimated, RCD extracts parent-child relationships between each pair of variables using CI tests. Assume . For and , if , then can be determined as a parent of . If a variable pair in the RCD’s output for which the direction of causality cannot be identified exists, we conclude that there are latent confounders between them.
Our study assumes that the causal model does not contain latent confounders. RCD can also be applied to the model without latent confounders. RCD is interpreted as another divide-and-conquer approach to improve scalability because it estimates the causal DAG by estimating the parent-child relationship between variables within each variable’s ancestor set. Our proposed method also uses the algorithm to find ancestral relationships in RCD to group variables.
RCD may return a graph that contains cycles even when the sample size is large. This is because errors in estimating ancestral relationships lead to an incorrect estimation of the parent-child relationships.
II-E Positioning of the Proposed Method
The proposed method also assumes LiNGAM (1). Like SADA and CAPA, the proposed method begins by grouping variables into multiple subsets based on the ancestral relationships between variables. The ancestral relationships are estimated in the same way as in RCD. When the true causal DAG is connected, the proposed method defines the variable grouping by the maximal elements in the family of the union of each variable and its ancestors. Thus, if the ancestral relationships are correctly estimated, variables are partitioned into as many groups as the number of sink nodes in the true causal DAG. The proposed method groups finely for DAGs with high outdegree and low indegree. The proposed method’s time complexity for variable grouping is , which is the same order as CAPA with .
When the ancestral relationships among the variables are correctly estimated, the marginal model for variables in each group obtained by the proposed method is the LiNGAM defined by the sub-DAG induced by the variables in each group.
In RCD, the estimated ancestral relationships are used to estimate parent-child relationships among variables. In contrast, the estimated ancestral relationships are only used to group variables in the proposed method. For each group, DirectLiNGAM is applied to estimate sub-DAGs, which are then merged to estimate the entire causal DAG.
III Causal Ancestral-Relationships-based Groupoing (CAG)
This section will introduce our proposed algorithm in detail. Let be a -dimensional random vector. Let be the true causal DAG, where is the set of edges in . We assume that follows LiNGAM (1) without latent confounders. Let and be the sets of ancestors and parents of in , respectively. is the set of common ancestors of and in .
III-A Finding the Ancestral Relationships Between the Variables
This subsection summarizes the algorithm for finding the ancestral relationships between variables in in RCD [19]. To determine the ancestral relationship between and , , Maeda and Shimizu [19] considered the simple regression models
(3) | ||||
where and are error terms. Maeda and Shimizu [19] focused on the independence relationship between the independent variable and the error term in each model. The following proposition holds for the ancestral relationship between and .
Proposition 1 (Maeda and Shimizu [19]).
One of the following four conditions holds for the ancestral relationship between and .
-
1.
If , then .
-
2.
If , then .
-
3.
If , then .
-
4.
If , then .
This algorithm first checks which conditions 1 through 4 in Proposition 1 holds for each variable pair . For implementation, the error terms and are replaced with the OLS residuals. If satisfies condition 4, the determination of the ancestral relationship between is withheld. Let be the set of variable pairs that satisfies condition 4. Let be the set of common ancestors of found during checking Proposition 1 for all pairs .
Assume that . Then, forms part of the backdoor path for and . To remove the influence of from and , consider the regression models
(4) | ||||
where we identify with the vector of common ancestors of . Furthermore, consider the following regression models for the error terms and ,
(5) | ||||
Then Proposition 1 is generalized as follows.
Proposition 2 (Maeda and Shimizu [19]).
One of the following four conditions holds for the ancestral relationship between .
-
1.
If , then .
-
2.
If , then .
-
3.
If , then .
-
4.
If and , then .
Proposition 1 is the case where . For implementation, the error terms and are replaced with OLS residuals. If satisfies condition 4, the determination of the ancestral relationship between and is withheld. After checking Proposition 2 for all , update and . If , recheck Proposition 2. Theoretically, if the model contains no latent confounders, the procedure in Proposition 2 can be repeated until to completely determine the ancestral relationship of all .
The pseudo-code of this algorithm is described in Algorithm V-A in Appendix A. Since the linearity is assumed between variables, Pearson’s correlation test [20] can be used to test the marginally independence, and . For testing the independence of independent variables and residuals, Shimizu and Maeda [19, 21] used the Hilbert-Schmidt independence criterion (HSIC) [22]. This paper uses a Kernel-based conditional independence (KCI) test [23] because the accuracy of the test remains the same, and the computation time is faster.
III-B Grouping Variables and Merging Results
This subsection introduces a procedure for grouping into several subsets according to the estimated ancestral relationships among variables and merging the estimated causal DAGs for each group.
Let be the estimated ancestor set of . Define by the output of the following procedures.
-
1.
Let be the family of maximal sets in
-
2a)
If for all , then
-
2b)
Otherwise,
For all with -
3)
Return
Definition 1.
Define as the grouping of obtained by the ancestral relationships among .
We call the series of procedures for obtaining the causal ancestral-relationship-based grouping (CAG). If the ancestral relationships among are correctly estimated and , each element of is the union of a sink node in and its ancestors. We also note that . Now, we have the following theorem.
Theorem 1.
Assume that we know the correct ancestral relationships between variables. Define be the sub-DAG of induced by . Then, the marginal model for is the LiNGAM defined by .
Proof.
The probability density function of is written by
We note that when , does not include the variables . By integrating out according to the reverse causal order of , we have
which is the LiNGAM defined by . ∎
Theorem 1 claims that we can consistently estimate the sub-DAGs for by applying a causal structure learning algorithm such as DirectLiNGAM to . Once we obtain the estimates , where , the entire causal graph can be estimated by .
When the sample size is small, many ancestral relationships cannot be detected due to the type II error of the CI test in the CAG, so there may be many variables for which the ancestor set is empty, resulting in many groups consisting of a single variable in . Using such grouping would make the causal DAG estimates too sparse. So here, the group consisting of a single variable in is merged with the other groups according to the above procedure 2b. Note that Theorem 1 holds even if contains a group consisting of one variable.
Table III presents the correct and for of the DAG in Figure 1. In this case, we can easily see that . By applying the causal structure learning algorithm to each element of , we can consistently estimate the sub-DAGs as shown in Fig. 4. By merging the sub-DAGs as , we can obtain the true causal DAG in Figure 1.
Variables | ||

In the same way as SADA, the estimated entire DAG may contain cycles due to errors in learning DAGs for each group, even if the correct ancestral relationships are estimated. The proposed method implements topological sorting on the merged graph to check if a directed cycle exists in the merged graph. In SADA, the significance of each coefficient for all edges in a cycle is assessed by the Wald test, and the least significant edge in the cycle is removed from the cycle. In the proposed method, since the ancestral relationships between variables are estimated, directed cycles can also be eliminated by removing edges in the cycle that contradict the estimated ancestral relationships. Another possible approach would be removing the edge in the cycle with the smallest absolute value of coefficient . The computer experiment in Section IV compares the performance of these methods for removing cycles.
III-C Time Complexity of Variable Grouping and Minimum Required Sample Size
It suffices to focus on the number of CI tests to evaluate the time complexity of variable grouping by CAG. The case that requires the most CI tests is when the true causal DAG is fully connected, i.e., for all . In that case, since the proposed algorithm performs CI tests while removing a source node one by one from a DAG, the number of required CI tests is
As mentioned in Section II-A, the time complexity of DirectLiNGAM is . Even allowing for the time complexity of grouping variables, if is small relative to , CAG is expected to reduce overall computation time compared to the original DirectLiNGAM without variable grouping.
Let be the largest cardinality of . Then, the required sample size is reduced to .
IV Computer experiments
This section details the results of computer experiments to compare the estimation accuracy and computation time of CAG with the original DirectLiNGAM without variable grouping, RCD in the absence of latent confounders, and CAPA with . In these experiments, CAG and CAPA used DirectLiNGAM as the causal structure learning algorithm for each variable group. In the following, we will refer to these procedures as CAG-LiNGAM and CAPA-LiNGAM, respectively. The accuracy of CAG-LiNGAM using true ancestral relationships was also computed for reference. As mentioned in Section II-D, RCD can also be applied to the model in the presence of latent confounders. This paper assumes that no latent confounder exists, so in this experiment for RCD, the procedure for determining latent confounders was omitted to estimate ancestral and parent-child relationships among variables.
IV-A Experimental Settings
The number of variables and the number of edges were set to , , , and . The sample sizes were set to and .
The error terms were generated from the uniform distribution . The true causal DAGs were randomly generated with fixed at each iteration. In these experiments, the true causal DAGs were assumed to be sparse, and the indegrees of all vertices were set to at most one. The nonzero elements of the coefficient matrix were randomly generated from the uniform distribution at each iteration. The number of iterations was set to 100.
Let be the number of correctly detected edges in the 100 estimated DAGs. Denote the number of redundant and missing edges in the 100 estimated DAGs by and , respectively. Note that if an edge in an estimated DAG is in the wrong direction, both and will add one. We evaluated the performance of CAG-LiNGAM and the other methods using the following indices (e.g., Zhu et al. [24]).
-
1.
The precision
is the ratio of correctly estimated edges among the edges in the estimated DAG;
-
2.
The recall
is the ratio of correctly estimated edges among the edges in the true DAG;
-
3.
-measure
is the harmonic mean of the precision and the recall;
-
4.
is the running time in seconds for estimating 100 DAGs.
Let be the significance level of Pearson’s correlation test in CAG and RCD. Let be the significance level of KCI in CAG, RCD, and CAPA. and are set to
respectively.
Experiments for were conducted on a machine with a 3.0 GHz Intel Core i9 processor and 256 GB memory, and experiments for were performed on a machine with a 3.0 GHz Intel Core i9 processor and 128 GB memory. The other experiments were conducted on a machine with a 2.1 GHz Dual Xeon processor and 96 GB memory.
IV-B Experimental Results and Discussion
(, , ) | ||||||||||||||||||
CAG | DLi | RCD | CAPA | CAG | DLi | RCD | CAPA | |||||||||||
W | Anc | abs | Anc | W | 0 | 1 | 2 | W | Anc | abs | Anc | W | 0 | 1 | 2 | |||
(10, 5, 11) | 0.210 | 0.221 | 0.213 | 0.507 | 0.122 | 0.268 | 0.209 | 0.166 | 0.193 | 0.434 | 0.396 | 0.446 | 0.448 | 0.374 | 0.128 | 0.197 | 0.209 | 0.194 |
(10, 5, 25) | 0.550 | 0.543 | 0.546 | 0.720 | 0.311 | 0.602 | 0.493 | 0.597 | 0.612 | 0.576 | 0.578 | 0.578 | 0.736 | 0.576 | 0.396 | 0.483 | 0.420 | 0.408 |
(10, 5, 50) | 0.773 | 0.770 | 0.773 | 0.850 | 0.521 | 0.851 | 0.725 | 0.735 | 0.779 | 0.764 | 0.764 | 0.764 | 0.928 | 0.868 | 0.592 | 0.696 | 0.723 | 0.706 |
(10, 5, 100) | 0.891 | 0.889 | 0.891 | 0.917 | 0.693 | 0.967 | 0.786 | 0.877 | 0.896 | 0.896 | 0.896 | 0.896 | 0.996 | 0.986 | 0.826 | 0.886 | 0.860 | 0.840 |
(10, 5, 500) | 0.971 | 0.971 | 0.971 | 0.963 | 0.813 | 1.000 | 0.861 | 0.901 | 0.906 | 0.998 | 0.998 | 0.998 | 1.000 | 1.000 | 0.994 | 0.945 | 0.956 | 0.960 |
(10, 9, 11) | 0.251 | 0.218 | 0.244 | 0.396 | 0.179 | 0.281 | 0.214 | 0.208 | 0.250 | 0.431 | 0.487 | 0.452 | 0.483 | 0.387 | 0.147 | 0.207 | 0.220 | 0.193 |
(10, 9, 25) | 0.369 | 0.368 | 0.350 | 0.563 | 0.339 | 0.452 | 0.388 | 0.389 | 0.429 | 0.693 | 0.486 | 0.727 | 0.729 | 0.600 | 0.301 | 0.418 | 0.447 | 0.426 |
(10, 9, 50) | 0.621 | 0.599 | 0.609 | 0.686 | 0.482 | 0.644 | 0.526 | 0.523 | 0.548 | 0.643 | 0.650 | 0.646 | 0.899 | 0.818 | 0.521 | 0.634 | 0.651 | 0.643 |
(10, 9, 100) | 0.779 | 0.768 | 0.779 | 0.834 | 0.685 | 0.914 | 0.680 | 0.686 | 0.682 | 0.851 | 0.860 | 0.859 | 0.986 | 0.974 | 0.758 | 0.845 | 0.836 | 0.839 |
(10, 9, 500) | 0.918 | 0.917 | 0.918 | 0.877 | 0.785 | 0.997 | 0.785 | 0.773 | 0.774 | 0.999 | 0.999 | 0.999 | 1.000 | 1.000 | 0.999 | 0.929 | 0.930 | 0.926 |
(20, 19, 25) | 0.313 | 0.319 | 0.267 | 0.435 | 0.172 | 0.327 | 0.235 | 0.410 | 0.468 | 0.601 | 0.423 | 0.642 | 0.665 | 0.496 | 0.284 | 0.287 | 0.220 | 0.238 |
(20, 19, 50) | 0.528 | 0.484 | 0.497 | 0.625 | 0.354 | 0.549 | 0.447 | 0.454 | 0.616 | 0.662 | 0.640 | 0.676 | 0.876 | 0.721 | 0.514 | 0.561 | 0.527 | 0.448 |
(20, 19, 100) | 0.759 | 0.723 | 0.751 | 0.780 | 0.576 | 0.837 | 0.610 | 0.594 | 0.681 | 0.776 | 0.780 | 0.778 | 0.984 | 0.943 | 0.729 | 0.779 | 0.776 | 0.685 |
(20, 19, 500) | 0.897 | 0.894 | 0.897 | 0.844 | 0.725 | 0.994 | 0.725 | 0.718 | 0.712 | 0.997 | 0.997 | 0.997 | 1.000 | 1.000 | 0.996 | 0.899 | 0.895 | 0.892 |
(30, 29, 50) | 0.535 | 0.445 | 0.489 | 0.655 | 0.276 | 0.475 | 0.374 | 0.527 | 0.631 | 0.618 | 0.616 | 0.633 | 0.855 | 0.588 | 0.479 | 0.438 | 0.390 | 0.435 |
(30, 29, 100) | 0.725 | 0.701 | 0.673 | 0.878 | 0.587 | 0.813 | 0.649 | 0.649 | 0.676 | 0.725 | 0.701 | 0.853 | 0.878 | 0.587 | 0.813 | 0.649 | 0.649 | 0.676 |
(30, 29, 500) | 0.969 | 0.967 | 0.970 | 0.976 | 0.919 | 0.998 | 0.919 | 0.916 | 0.914 | 0.998 | 0.998 | 0.998 | 1.000 | 1.000 | 0.996 | 0.970 | 0.968 | 0.969 |
(40, 39, 50) | 0.464 | 0.423 | 0.466 | 0.625 | 0.211 | 0.459 | 0.311 | 0.480 | 0.712 | 0.800 | 0.595 | 0.624 | 0.839 | 0.536 | 0.469 | 0.381 | 0.364 | 0.374 |
(40, 39, 100) | 0.696 | 0.640 | 0.652 | 0.863 | 0.477 | 0.792 | 0.573 | 0.576 | 0.679 | 0.835 | 0.741 | 0.849 | 0.973 | 0.809 | 0.686 | 0.690 | 0.680 | 0.635 |
(40, 39, 500) | 0.952 | 0.944 | 0.951 | 0.970 | 0.917 | 0.991 | 0.917 | 0.914 | 0.915 | 0.998 | 0.998 | 0.998 | 1.000 | 1.000 | 0.995 | 0.966 | 0.966 | 0.966 |
(, , ) | ||||||||||||||||||
CAG | DLi | RCD | CAPA | CAG | DLi | RCD | CAPA | |||||||||||
W | Anc | abs | Anc | W | 0 | 1 | 2 | W | Anc | abs | Anc | W | 0 | 1 | 2 | |||
(10, 5, 11) | 0.283 | 0.284 | 0.288 | 0.476 | 0.184 | 0.173 | 0.203 | 0.185 | 0.194 | 23.0 | 17.8 | 21.3 | 5.0 | 29.0 | 6.7 | 28.2 | 99.5 | 210.7 |
(10, 5, 25) | 0.563 | 0.560 | 0.562 | 0.728 | 0.404 | 0.478 | 0.488 | 0.493 | 0.489 | 13.5 | 13.4 | 13.4 | 5.0 | 27.5 | 8.0 | 32.2 | 55.0 | 57.1 |
(10, 5, 50) | 0.769 | 0.767 | 0.769 | 0.887 | 0.651 | 0.698 | 0.710 | 0.729 | 0.741 | 13.0 | 13.0 | 13.0 | 5.1 | 72.1 | 8.1 | 32.8 | 75.1 | 84.8 |
(10, 5, 100) | 0.893 | 0.892 | 0.893 | 0.955 | 0.814 | 0.891 | 0.833 | 0.868 | 0.867 | 154.6 | 154.4 | 154.5 | 48.9 | 123.6 | 123.4 | 165.2 | 371.7 | 502.2 |
(10, 5, 500) | 0.984 | 0.984 | 0.984 | 0.981 | 0.897 | 0.997 | 0.901 | 0.928 | 0.932 | 330.8 | 330.7 | 330.8 | 65.1 | 151.8 | 480.5 | 459.5 | 1867.9 | 2529.8 |
(10, 9, 11) | 0.317 | 0.301 | 0.317 | 0.435 | 0.244 | 0.193 | 0.210 | 0.214 | 0.218 | 39.4 | 23.4 | 23.5 | 12.3 | 28.4 | 10.2 | 39.0 | 112.3 | 250.8 |
(10, 9, 25) | 0.482 | 0.419 | 0.472 | 0.635 | 0.433 | 0.361 | 0.403 | 0.416 | 0.428 | 34.1 | 36.5 | 25.0 | 11.3 | 28.9 | 15.2 | 44.2 | 110.8 | 291.0 |
(10, 9, 50) | 0.632 | 0.623 | 0.627 | 0.778 | 0.607 | 0.576 | 0.575 | 0.580 | 0.591 | 22.8 | 21.7 | 21.7 | 12.3 | 71.8 | 18.6 | 80.3 | 187.3 | 370.5 |
(10, 9, 100) | 0.814 | 0.811 | 0.817 | 0.904 | 0.805 | 0.829 | 0.753 | 0.753 | 0.752 | 338.9 | 334.6 | 334.6 | 117.1 | 120.7 | 312.3 | 255.9 | 1420.2 | 6078.1 |
(10, 9, 500) | 0.957 | 0.956 | 0.957 | 0.935 | 0.879 | 0.998 | 0.851 | 0.844 | 0.843 | 1062.1 | 1061.6 | 1061.6 | 154.4 | 155.5 | 1332.3 | 425.2 | 6833.6 | 19649.6 |
(20, 19, 25) | 0.411 | 0.364 | 0.378 | 0.526 | 0.255 | 0.304 | 0.259 | 0.286 | 0.316 | 520.0 | 47.2 | 101.7 | 29.2 | 172.3 | 34.4 | 524.3 | 713.6 | 1044.2 |
(20, 19, 50) | 0.588 | 0.551 | 0.573 | 0.729 | 0.474 | 0.531 | 0.498 | 0.488 | 0.518 | 93.2 | 57.0 | 72.2 | 29.3 | 323.9 | 48.6 | 573.4 | 926.4 | 1658.1 |
(20, 19, 100) | 0.768 | 0.750 | 0.764 | 0.870 | 0.715 | 0.779 | 0.684 | 0.673 | 0.683 | 1111.4 | 1015.5 | 1015.7 | 337.2 | 417.3 | 991.4 | 1418.6 | 10457.6 | 39708.9 |
(20, 19, 500) | 0.944 | 0.943 | 0.944 | 0.916 | 0.840 | 0.995 | 0.802 | 0.797 | 0.792 | 3918.8 | 3906.2 | 3906.3 | 421.7 | 542.6 | 4757.2 | 1471.6 | 33655.4 | 166854.1 |
(30, 29, 50) | 0.573 | 0.516 | 0.552 | 0.742 | 0.376 | 0.477 | 0.403 | 0.449 | 0.515 | 196.6 | 92.7 | 129.2 | 40.2 | 554.7 | 65.8 | 1077.6 | 1962.0 | 5290.4 |
(30, 29, 100) | 0.782 | 0.726 | 0.753 | 0.925 | 0.703 | 0.747 | 0.704 | 0.703 | 0.709 | 1381.5 | 1138.7 | 1215.6 | 397.6 | 604.9 | 1287.1 | 1090.2 | 13253.2 | 117849.3 |
(30, 29, 500) | 0.983 | 0.982 | 0.984 | 0.988 | 0.958 | 0.997 | 0.943 | 0.941 | 0.941 | 5732.6 | 5731.6 | 5731.7 | 444.0 | 1449.6 | 9536.9 | 3540.0 | 45572.6 | 341942.2 |
(40, 39, 50) | 0.588 | 0.494 | 0.533 | 0.716 | 0.302 | 0.464 | 0.343 | 0.414 | 0.490 | 4233.4 | 192.9 | 258.3 | 56.9 | 962.4 | 153.3 | 1920.0 | 4387.8 | 6462.0 |
(40, 39, 100) | 0.759 | 0.687 | 0.737 | 0.914 | 0.600 | 0.735 | 0.626 | 0.624 | 0.656 | 2931.3 | 2527.1 | 2620.8 | 604.1 | 1001.2 | 2725.9 | 1740.4 | 35769.3 | 352156.9 |
(40, 39, 500) | 0.975 | 0.970 | 0.974 | 0.985 | 0.956 | 0.993 | 0.941 | 0.939 | 0.940 | 10234.1 | 10225.4 | 10225.7 | 706.8 | 2610.9 | 16099.2 | 6221.7 | 94989.2 | 779946.2 |
Table IV shows , , and for each method for some . The experiments for CAG-LiNGAM and RCD were conducted with combinations of significance levels of Pearson’s correlation test and KCI for each . The experiments for CAPA-LiNGAM were conducted with significance levels of KCI for each . Table IV shows only the results of these methods for the significance level that maximizes -measure.
In Table IV, CAG represents CAG-LiNGAM, while refers to CAG-LiNGAM using true ancestral relationships. DLi stands for the original DirectLiNGAM without grouping variables. W, Anc, and abs under CAG and RCD represent cycle elimination methods based on the Wald test, the estimated ancestral relationship, and the absolute value of the coefficient, respectively. The numbers 0, 1, and 2 under CAPA are the values. In the table, the bold numbers highlight the best performance in each experimental group , excluding .
As seen from Table IV, CAG-LiNGAM outperforms the original DirectLiNGAM without the variable grouping, RCD, and CAPA-LiNGAM when the sample size is small relative to the number of variables in terms of -measure and the recall. Although CAG requires time complexity to group variables, CAG-LiNGAM often shows shorter computation time than the original DirectLiNGAM when the sample size is small. When the sample size is large, CAG-LiNGAM takes longer computation time than the original DirectLiNGAM, but CAG-LiNGAM is superior to the original DirectLiNGAM in terms of precision and -measure.
CAG with Wald tests takes longer computation time but is more accurate in terms of -measure than CAG with Anc and abs. As the sample size increases, the accuracy of estimating ancestral relationships and DirectLiNGAM increases, and hence, the frequency of cycles appearing in the estimated DAG decreases, so the difference in estimation accuracy between the three CAG variants becomes smaller.
When the sample size is small, CAPA-LiNGAM may show higher precision than CAG-LiNGAM. However, the recall and -measure are lower in many cases. This may be because when the sample size is small, it is difficult for CAPA to detect conditional independence relationships due to the type II error of the CI tests. Then, becomes overly sparse, resulting in overly sparse causal DAG estimates.
When is and , the time complexity of the variable grouping of CAPA is less than or equal to that of CAG. However, the computation time for CAPA-LiNGAM is not much different from that for CAG-LiNGAM, or CAG-LiNGAM is faster than CAPA-LiNGAM. Besides, CAG-LiNGAM outperforms CAPA-LiNGAM in terms of -measure.
It is noteworthy that RCD shows higher -measure than CAG-LiNGAM when the sample size is large, although it takes more computation time. Conversely, when the sample size is small, CAG-LiNGAM shows higher -measure than RCD, although it takes more computation time.
The original DirectLiNGAM shows lower precision than divide-and-conquer algorithms. This may be because the original DirectLiNGAM does not group variables, leading to redundant edges across different variable groups. In general, DAGs estimated by DirectLiNGAM tend to have more redundant edges.
Even if the true causal DAG is sparse, the subgraph induced by each variable group becomes relatively denser. In the case of directed trees, the ratio of the number of edges to the number of variable pairs is . This means that the smaller the number of variables in each group is, the denser the sub-DAG induced by each group is. When variables are correctly grouped into smaller subsets, the true sub-DAGs induced by each group become denser, and there are no edges across groups. Therefore, if ancestral relationships can be accurately estimated, applying DirectLiNGAM to each group is expected to avoid more redundant edges than using it for the entire variable.
The CAG uses the estimated ancestral relationships for grouping variables but does not use them to estimate the causal DAG for each group. However, since ancestral relationships partially define the structure of the causal DAG, information on ancestral relationships could be used to estimate causal DAGs. The RCD can be interpreted as using information on ancestral relationships to estimate causal DAGs.

The lines in red, blue, orange, black, pink represent , respectively.
Let and be the number of correctly detected and missing ancestors in the estimated ancestor lists , , respectively. Define by the recall of the list of ancestors,
Figure 5 plots of the ancestor list for some when using the significance levels of the CI tests that maximize the -measure of the CAG-LiNGAM with Wald tests. This figure shows that when the sample size is small, the accuracy of the estimation of ancestor lists is quite low. When , are exactly zeros. However, Table IV shows that the accuracy of CAG-LiNGAM is not that bad. This fact suggests that applying DirectLiNGAM to each group split by the CAG may correct errors in the estimated ancestral relationships within each group.
Table V shows computed from the ancestral relationships estimated by CAG and computed from the estimated DAG in the experiments. We can see that is significantly improved by applying DirectLiNGAM to each group.
Before | After | |||||
W | Anc | abs | W | Anc | abs | |
(10, 5, 11) | 0 | 0.049 | 0 | 0.438 | 0.405 | 0.466 |
(10, 5, 25) | 0.300 | 0.300 | 0.300 | 0.519 | 0.525 | 0.524 |
(10, 5, 50) | 0.513 | 0.513 | 0.513 | 0.677 | 0.679 | 0.679 |
(10, 5, 100) | 0.723 | 0.723 | 0.723 | 0.852 | 0.852 | 0.852 |
(10, 5, 500) | 0.919 | 0.919 | 0.919 | 0.998 | 0.998 | 0.998 |
(10, 9, 11) | 0 | 0 | 0 | 0.404 | 0.568 | 0.447 |
(10, 9, 25) | 0 | 0.323 | 0 | 0.648 | 0.483 | 0.701 |
(10, 9, 50) | 0.346 | 0.346 | 0.346 | 0.540 | 0.557 | 0.547 |
(10, 9, 100) | 0.514 | 0.514 | 0.514 | 0.791 | 0.817 | 0.805 |
(10, 9, 500) | 0.713 | 0.713 | 0.713 | 0.999 | 0.999 | 0.999 |
(20, 19, 25) | 0.001 | 0.177 | 0.001 | 0.514 | 0.366 | 0.577 |
(20, 19, 50) | 0.320 | 0.289 | 0.320 | 0.545 | 0.558 | 0.594 |
(20, 19, 100) | 0.359 | 0.359 | 0.359 | 0.638 | 0.669 | 0.651 |
(20, 19, 500) | 0.596 | 0.596 | 0.596 | 0.993 | 0.998 | 0.998 |
(30, 29, 50) | 0.268 | 0.241 | 0.268 | 0.450 | 0.562 | 0.515 |
(30, 29, 100) | 0.371 | 0.291 | 0.371 | 0.745 | 0.636 | 0.782 |
(30, 29, 500) | 0.515 | 0.515 | 0.515 | 0.993 | 0.997 | 0.993 |
(40, 39, 50) | 0.001 | 0.222 | 0.225 | 0.627 | 0.546 | 0.508 |
(40, 39, 100) | 0.356 | 0.276 | 0.356 | 0.697 | 0.624 | 0.785 |
(40, 39, 500) | 0.470 | 0.470 | 0.470 | 0.997 | 0.994 | 0.997 |
Table VI summarizes the following six indices calculated from CAG’s experimental results when .
-
•
Improvement (): the number of ancestral relationships in the output DAGs that DirectLiNGAM corrects
-
•
Worsening (): the number of ancestral relationships in the output DAGs that DirectLiNGAM worsens
-
•
Errors in estimation (): the number of ancestral relationships that are wrongly estimated in CAG
-
•
Correct estimation (): the number of ancestral relationships that are correctly estimated in CAG
-
•
: the ratio of ancestral relationships incorrectly estimated by CAG that DirectLiNGAM corrected
-
•
: the ratio of ancestral relationships correctly estimated by CAG that DirectLiNGAM worsened
Elimination | |||||||
(10,5,11) | Wald | 265 | 1029 | 3858 | 642 | 0.4128 | 0.2667 |
Anc | 228 | 858 | 3858 | 642 | 0.3551 | 0.2224 | |
abs | 286 | 1042 | 3858 | 642 | 0.4455 | 0.2701 | |
(10,5,25) | Wald | 239 | 700 | 3956 | 544 | 0.4393 | 0.1769 |
Anc | 227 | 636 | 3956 | 544 | 0.4173 | 0.1608 | |
abs | 238 | 720 | 3956 | 544 | 0.4375 | 0.1820 | |
(10,5,50) | Wald | 204 | 268 | 4154 | 346 | 0.5896 | 0.0645 |
Anc | 191 | 253 | 4154 | 346 | 0.5520 | 0.0609 | |
abs | 211 | 280 | 4154 | 346 | 0.6098 | 0.0674 | |
(10,9,11) | Wald | 794 | 1226 | 2627 | 1873 | 0.4239 | 0.4667 |
Anc | 374 | 559 | 2627 | 1873 | 0.1997 | 0.2128 | |
abs | 794 | 1226 | 2627 | 1873 | 0.4239 | 0.4667 | |
(10,9,25) | Wald | 657 | 678 | 2780 | 1720 | 0.3820 | 0.2439 |
Anc | 519 | 565 | 2780 | 1720 | 0.3017 | 0.2032 | |
abs | 747 | 804 | 2780 | 1720 | 0.4343 | 0.2892 | |
(10,9,50) | Wald | 704 | 373 | 3078 | 1422 | 0.4951 | 0.1212 |
Anc | 570 | 273 | 3078 | 1422 | 0.4008 | 0.0887 | |
abs | 728 | 412 | 3078 | 1422 | 0.5120 | 0.1339 | |
(10,9,100) | Wald | 682 | 130 | 3479 | 1021 | 0.6680 | 0.0374 |
Anc | 586 | 122 | 3479 | 1021 | 0.5739 | 0.0351 | |
abs | 703 | 118 | 3479 | 1021 | 0.6885 | 0.0339 | |
(10,9,500) | Wald | 626 | 32 | 3730 | 770 | 0.8130 | 0.0086 |
Anc | 626 | 33 | 3730 | 770 | 0.8078 | 0.0088 | |
abs | 626 | 32 | 3730 | 770 | 0.8104 | 0.0086 |
Table VI shows that when the sample size is small, , but . By applying DirectLiNGAM, the number of corrected ancestral relationships is less than that of worsened ones. However, the proportion of corrected ancestral relationships is relatively larger than that of worsened ancestral relationships, which may have improved the accuracy of causal DAG estimation.
Even with large sample sizes, improves by applying DirectLiNGAM. However, Table IV shows that the accuracy of CAG is inferior to that of RCD. When the sample size is large, DirectLiNGAM tends to include redundant edges in the estimated DAG, which may make CAG less accurate than RCD.
This experiment set 28 different significance levels for the CAG’s CI tests. Table VII presents the significance levels that maximize the -measure of CAG-LiNGAM. A small significance level is often chosen when the sample size is small. In practice, however, the results are almost the same regardless of the significance level chosen. When the sample size is small, ancestral relationships can hardly be detected regardless of the significance level. In this experiment of CAG-LiNGAM with Wald tests, when , no ancestral relationship could be detected within 100 repetitions. From Definition 1 in Section III-A, if no ancestral relationship was detected, the variable group is the family of all variable pairs. Interestingly, even in such cases, CAG-LiNGAM shows higher -measure than the other methods. When the sample size is moderately small, the -measure values are large when the significance level of Pearson’s correlation test is 0.01 and the significance level of KCI is about 0.2 to 0.5. When the sample size is large, the smaller the significance level, the higher -measure.
W | Anc | abs | ||||
(, , ) | ||||||
(10, 5, 11) | 0.01 | 0.001 | 0.01 | 0.3 | 0.01 | 0.001 |
(10, 5, 25) | 0.01 | 0.5 | 0.01 | 0.5 | 0.01 | 0.5 |
(10, 5, 50) | 0.01 | 0.4 | 0.01 | 0.4 | 0.01 | 0.4 |
(10, 5, 100) | 0.01 | 0.3 | 0.01 | 0.3 | 0.01 | 0.3 |
(10, 5, 500) | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
(10, 9, 11) | 0.05 | 0.01 | 0.05 | 0.01 | 0.05 | 0.01 |
(10, 9, 25) | 0.01 | 0.001 | 0.5 | 0.5 | 0.01 | 0.001 |
(10, 9, 50) | 0.01 | 0.3 | 0.01 | 0.3 | 0.01 | 0.3 |
(10, 9, 100) | 0.01 | 0.2 | 0.01 | 0.2 | 0.01 | 0.2 |
(10, 9, 500) | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
(20, 19, 25) | 0.05 | 0.001 | 0.01 | 0.5 | 0.05 | 0.001 |
(20, 19, 50) | 0.01 | 0.4 | 0.01 | 0.3 | 0.01 | 0.4 |
(20, 19, 100) | 0.01 | 0.1 | 0.01 | 0.1 | 0.01 | 0.1 |
(20, 19, 500) | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
(30, 29, 50) | 0.01 | 0.4 | 0.01 | 0.3 | 0.01 | 0.4 |
(30, 29, 100) | 0.01 | 0.2 | 0.01 | 0.1 | 0.01 | 0.2 |
(30, 29, 500) | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
(40, 39, 50) | 0.01 | 0.001 | 0.01 | 0.3 | 0.01 | 0.4 |
(40, 39, 100) | 0.01 | 0.2 | 0.01 | 0.1 | 0.01 | 0.1 |
(40, 39, 500) | 0.01 | 0.001 | 0.01 | 0.001 | 0.01 | 0.001 |
As mentioned, is the CAG-LiNGAM when the true ancestral relationships are known. When the sample size is small, the estimation accuracy of CAG-LiNGAM is inferior to that of because CAG is less accurate. However, when and the sample size is large, CAG-LiNGAM shows higher precision and -measure than . As seen from Figure 5 and Table V, the accuracy of ancestral relationship estimation improves as the sample size increases. However, even when , is 0.713, which is not close enough to 1. From Table VII, when , and are used for the significance levels of the CI tests. Even with a large sample size, if small significance levels are used in the CAG, some ancestral relationships cannot be detected due to type II errors in the CI test. As a result, the number of variables in the estimated groups may be smaller than those in the true CAG grouping, resulting in fewer redundant edges than . Grouping by CAG is not the finest grouping that guarantees the identifiability of true causal DAGs. The results suggest that using finer groupings may improve the accuracy of causal DAG estimation when the sample size is large.
V Conclusion
This paper proposes CAG as a divide-and-conquer algorithm for learning causal DAGs. This algorithm can help improve the performance of the original DirectLiNGAM, especially when the sample size is small relative to the number of variables. CAG is based on the algorithm for finding ancestral relationships among variables in RCD. CAG-LiNGAM guarantees the identifiability of true causal DAGs. Detailed computer experiments confirm that CAG-LiNGAM may outperform the original DirectLiNGAM, RCD, and CAPA-LiNGAM in terms of -measure when the true causal DAG is sparse, and the sample size is small relative to the number of variables.
If the true causal DAG is connected and the correct CI relationships are estimated, CAG creates the same number of variable groups as sink nodes in the causal DAG. Therefore, when the number of sink nodes in the true causal graph is small, the number of variables in each group cannot be sufficiently small, and thus, improvement in accuracy may not be expected. Other computer experiments have shown that even if the true causal DAG is sparse, the estimation accuracy is not necessarily high when the indegrees of some vertices are larger than one. CAPA is the opposite of CAG, where the number of variable groups is small when the maximum indegree of the true causal DAG is small, while it may be possible to group variables into many variable groups even when the maximum indegree of the true causal DAG is large. As noted at the end of the previous section, the divide-and-conquer algorithm is expected to be more accurate in estimating causal DAGs if finer groupings are used. The hybrid algorithm of CAG and CAPA may give fine groupings even when the indegree of the true causal DAG is larger.
We also found that the RCD has higher estimation accuracy than CAG when the sample size is large. RCD estimates causal DAGs using the estimated ancestral relationships among variables. As the sample size increases, the estimation accuracy of ancestral relationships improves. Therefore, using the estimated ancestral relationships to estimate causal DAGs may improve estimation accuracy. However, RCD is computationally expensive when the sample size is large. Using the information on the estimated ancestral relationships in DirectLiNGAM may improve the estimation accuracy of causal DAGs at a relatively low computational cost, even when the sample size is large.
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number JP21K11797.
References
- [1] R. Cai, Z. Zhang, and Z. Hao, “Sada: A general framework to support robust causation discovery,” in International conference on machine learning, pp. 208–216, PMLR, 2013.
- [2] H. Zhang, S. Zhou, C. Yan, J. Guan, X. Wang, J. Zhang, and J. Huan, “Learning causal structures based on divide and conquer,” IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 3232–3243, 2020.
- [3] S. Shimizu, P. O. Hoyer, A. Hyvärinen, A. Kerminen, and M. Jordan, “A linear non-gaussian acyclic model for causal discovery.,” Journal of Machine Learning Research, vol. 7, no. 10, 2006.
- [4] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen, Y. Kawahara, T. Washio, P. O. Hoyer, K. Bollen, and P. Hoyer, “Directlingam: A direct method for learning a linear non-gaussian structural equation model,” Journal of Machine Learning Research-JMLR, vol. 12, no. Apr, pp. 1225–1248, 2011.
- [5] K. D. Hoover, Causality in economics and econometrics. SSRN eLibrary, 2006.
- [6] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
- [7] C. Glymour, “Learning causes: Psychological explanations of causal explanation1,” Minds and machines, vol. 8, no. 1, pp. 39–60, 1998.
- [8] J. Pearl, M. Glymour, and N. P. Jewell, Causal Inference in Statistics: A Primer. Hoboken, New Jersey, U.S.: John Wiley and Sons, 2016.
- [9] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669–688, 1995.
- [10] P. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search. MIT press, 2000.
- [11] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
- [12] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, pp. 461–464, 1978.
- [13] D. M. Chickering, “Learning bayesian networks is np-complete,” Learning from data: Artificial intelligence and statistics V, pp. 121–130, 1996.
- [14] D. M. Chickering, D. Heckerman, and C. Meek, “Large-sample learning of bayesian networks is np-hard,” Journal of Machine Learning Research, vol. 5, pp. 1287–1330, 2004.
- [15] A. Hyvarinen, J. Karhunen, and E. Oja, “Independent component analysis,” Studies in informatics and control, vol. 11, no. 2, pp. 205–207, 2002.
- [16] P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf, “Nonlinear causal discovery with additive noise models,” Advances in neural information processing systems, vol. 21, 2008.
- [17] K. Zhang and A. Hyvarinen, “On the identifiability of the post-nonlinear causal model,” arXiv preprint arXiv:1205.2599, 2012.
- [18] Y. S. Wang and M. Drton, “High-dimensional causal discovery under non-gaussianity,” Biometrika, vol. 107, no. 1, pp. 41–59, 2020.
- [19] T. N. Maeda and S. Shimizu, “Rcd: Repetitive causal discovery of linear non-gaussian acyclic models with latent confounders,” in International Conference on Artificial Intelligence and Statistics, pp. 735–745, PMLR, 2020.
- [20] C. J. Kowalski, “On the effects of non-normality on the distribution of the sample product-moment correlation coefficient,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 21, no. 1, pp. 1–12, 1972.
- [21] T. N. Maeda and S. Shimizu, “Repetitive causal discovery of linear non-gaussian acyclic models in the presence of latent confounders,” International Journal of Data Science and Analytics, pp. 1–13, 2022.
- [22] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola, “A kernel statistical test of independence,” Advances in neural information processing systems, vol. 20, 2007.
- [23] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf, “Kernel-based conditional independence test and application in causal discovery,” arXiv preprint arXiv:1202.3775, 2012.
- [24] S. Zhu, I. Ng, and Z. Chen, “Causal discovery with reinforcement learning,” arXiv preprint arXiv:1906.04477, 2019.
Appendix
V-A Pseudo-code of ancestor finding
This section presents the pseudocode of the ancestor-finding algorithm. Lines 5-6 are the procedure that avoids cyclic ancestral relationships and speeds up the ancestor finding. represents the P-value of the Pearson test for the correlation between and . represents the P-value of KCI for the independence between and .
We can declare that all ancestral relationships have been identified if no new ancestral relationships are discovered in a single loop because of the no confounder existing assumption. This means, unlike the original RCD [19, 21], even if some pairs of variables halt at a point where additional searching for unknown common ancestors is required in Line 35, they will ultimately be treated as having no ancestral relationship.
Algorithm 1 Ancestor Finding
V-B Pseudo-code of Grouping and Merging
This section presents the pseudocode for grouping variables and merging the results. In the implementation, a type II error of CI tests can result in many groups with only one variable. In that case, the output DAG will be too sparse. To avoid this, the procedure in lines 3 through 9 merges groups with one element into the other groups. When the sample size is small, no ancestral relationship may be detected, and the number of elements in all groups is one. In such a case, all variable pairs are defined as the variable grouping.
After grouping, line 13 shows the process of learning causal sub-DAGs and merging them. Finally, lines 15-17 eliminate cycles in the merged DAG.
Algorithm 2 Grouping and Merging