Learning causal graphs using variable grouping according to ancestral relationship

Ming Cai Graduate School of Informatics
Kyoto University
Kyoto, Japan
[email protected] Hisayuki Hara Institute for Liberal Arts and Sciences
Kyoto University
Kyoto, Japan
[email protected]

Abstract

Causal discovery has drawn increasing attention in recent years. Over the past quarter century, several useful algorithms for learning causal graphs have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. In addition, some methods are not feasible when the sample size is smaller than the number of variables.

To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches (e.g., [1, 2]). For learning the entire causal graph, the divide-and-conquer approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal structure learning algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small.

This paper proposes a new algorithm for grouping variables according to the ancestral relationships among the variables, assuming that the causal model is LiNGAM [3], where the causal relationships are linear, and the exogenous variables are mutually independent and distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG (Causal Ancestral-relationship-based Grouping). The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM [4] without grouping variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse.

Index Terms:

causal discovery, causal DAG, conditional independence test, DirectLiNGAM, divide-and-conquer, linear regression.

I Introduction

In recent years, inference using high-dimensional causal models for observational data has played a pivotal role in various fields, such as econometrics [5], biology [6], and psychology [7]. For the high dimensional causal inference, the structural causal model (SCM, [8, 9]) defined by a directed acyclic graph (DAG) has been extensively used. Since the causal relationships among variables are usually unknown, we also need to learn the causal DAG before we use the SCM.

Over the past quarter century, several practical algorithms for learning causal DAGs without latent confounders have been proposed. These algorithms are classified into several types.

The constraint-based methods use the conditional independence (CI) relationships among variables to determine causal directions. The PC algorithm [10] is a typical example. Using hypothesis tests, the PC algorithm first estimates CI relationships between all pairs of variables. Then, it estimates the skeleton of a causal DAG and the directions of edges in the skeleton in turn. However, the time complexity of the PC algorithm, in the worst case, is exponential to the number of variables. Hence, the PC algorithm is not feasible for high-dimensional data.

The greedy equivalent search (GES) [11] is an algorithm that learns causal structures using a model selection criterion such as BIC [12]. This type of method is called the score-based method. As Chickering et al. [13, 14] have proven, however, GES belongs to NP-hard, and its application to high-dimensional causal models is also impractical.

It is important to note that the constraint-based and score-based methods can only identify causal DAGs up to the Markov equivalence class, the set of all DAGs compatible with the inferred CI relationships. To fully identify causal DAGs requires additional constraints on the causal model.

Shimizu et al. [3] considered the linear structural equation model as the causal model, where exogenous variables are mutually independent, and the distributions of the exogenous variables are continuous and non-Gaussian. They called the model linear non-Gaussian acyclic model (LiNGAM) and showed that it is possible to fully identify the causal DAG that defines LiNGAM using independent component analysis (ICA-LiNGAM) [15]. However, ICA-LiNGAM tends to converge to a locally optimal solution when the dimension of the model is high, resulting in lower estimation accuracy (e.g.,[4]). To overcome this difficulty, Shimizu [4] proposed DirectLiNGAM that estimates causal DAGs using linear regression between variables and CI tests.

Hoyer et al. [16] and Zhang and Hyvärinen [17] generalized LiNGAM to nonlinear and showed that when the model is nonlinear, the causal DAG is identifiable even if the exogenous variables are Gaussian distributed.

However, the estimation accuracy of existing methods tends to decrease as the sample size decreases relative to the number of variables. Furthermore, DirectLiNGAM is not feasible when the sample size is smaller than or equal to the number of variables (e.g., [18]).

Cai et al. [1] proposed the scalable causation discovery algorithm (SADA) to address this problem. The main idea of SADA is to group variables into several subsets according to the CI relationships between the variables, then learn the causal DAG of each group, merge each result, and return the entire causal DAG estimate. In this method, the causal DAG learning algorithm is applied to each smaller group of variables, reducing the sample size required for it to work. Through simulation studies, Cai et al. [1] showed that SADA improves LiNGAM concerning estimation accuracy. In SADA, however, the time complexity of grouping variables is of exponential order for the number of variables. In addition, SADA has a severe drawback in that even if the correct CI relationships are known, the marginal model for the variables in each group is not necessarily the causal model defined by the sub-DAG induced by the variables in each group. This means the estimator of a causal DAG estimated by SADA may not be consistent depending on the true causal DAG (e.g., [2]).

Zhang et al. [2] proposed another algorithm for grouping variables called causality partitioning (CAPA). CAPA requires only low-order CI relationships among variables to group variables. Letting $p$ be the number of variables and $\sigma_{\max}$ be the maximum order of the conditioning set, the time complexity of CAPA is $O(p^{\sigma_{\max}+2})$ . CAPA guarantees that if the correct low-order CI relationships are known, the marginal model for each group of variables is the causal model defined by the sub-DAG induced by the variables in each group. However, especially in the case of DAGs with high outdegree and low indegree, CAPA tends not to group variables finely enough to improve estimation accuracy even if the true conditional independence relationships are known.

Maeda and Shimizu [19] proposed the repetitive causal discovery (RCD), which is intended to be applied to the model with latent confounders. RCD can also be applied to the models without latent confounders. RCD first determines the ancestral relationships among variables using linear regression and CI tests and creates a list of ancestor sets for each variable. The parent-child relationships between variables are determined from the CI relationships among variables in each estimated ancestor set. RCD can be interpreted as one of the divide-and-conquer approaches because RCD splits the entire variable into families of ancestor sets for each variable and then learns causal DAGs for each group.

In this paper, we propose another scheme for grouping variables according to the ancestral relationships between the variables. We also assume LiNGAM for the causal model. The proposed method uses the algorithm to find variables’ ancestor sets in RCD. When the true causal DAG is connected, and the ancestor sets are correctly estimated, the variable grouping obtained by the proposed method is the family of maximal elements in the family of the union of each variable and its ancestors. The time complexity of the proposed algorithm for grouping variables is the third order of the number of variables. Suppose the true ancestor sets are correctly estimated. In that case, the marginal model for the variables in each group obtained by the proposed method is always the LiNGAM defined by the sub-DAG induced by the variables in each group, and no edge exists across different groups. Therefore, the entire causal DAG can be consistently estimated by applying the causal structure learning algorithm such as DirectLiNGAM to each group and merging the results. We call the proposed procedure for grouping variables the causal ancestral-relationship-based grouping (CAG).

Extensive computer experiments show that CAG outperforms the original DirectLiNGAM without grouping variables, CAPA, and RCD regarding estimation accuracy and computation time when the sample size is small relative to the number of variables and the true causal DAG is sparse.

The rest of this paper is organized as follows: Section II summarizes some existing causal structure learning algorithms and clarifies the position of the proposed method. Section III describes the details of the proposed method. Section IV confirms the proposed methods’s usefulness through computer experiments. Section V concludes the paper. The pseudo-code for the proposed method is provided in the Appendix.

II Related Works

II-A LiNGAM and Variants

Let $X=(x_{1},\ldots,x_{p})^{\top}$ be a $p$ -dimensional random vector. In the following, we identify $X$ with the variable set. Assume that the causal relationship among the variables is acyclic and linear. LiNGAM [3] is expressed by

\displaystyle X=BX+e,

(1)

where the error term $e=(e_{1},\ldots,e_{p})^{\top}$ is assumed to be independently distributed as continuous non-Gaussian distributions. $B=\{b_{ij}\}$ is a $p\times p$ coefficient matrix. $b_{ij}$ represents the direct causal effect from $x_{j}$ to $x_{i}$ . $b_{ij}=0$ indicates the absence of the direct causal effect from $x_{j}$ to $x_{i}$ . We note that it is possible to transform the matrix $B$ into a strictly lower triangular matrix by permuting the rows and columns when a DAG defines the causal relationship between $X$ .

The model (1) is rewritten by

\displaystyle X=(I-B)^{-1}e,

(2)

where $I$ denotes the $p\times p$ identity matrix. Noting that (2) is equivalent to the independent component analysis (ICA) model, Shimizu et al. [3] showed that $B$ is identifiable and proposed an algorithm for estimating $B$ . The algorithm is known as ICA-LiNGAM.

However, as the number of variables increases, ICA-LiNGAM is more likely to converge to a locally optimal solution, which reduces the accuracy of causal DAG estimation. To overcome this problem, Shimizu et al. [4] proposed another algorithm for estimating $B$ named DirectLiNGAM. When linear regression analyses are conducted in two ways for each pair of variables, with one of them as the dependent variable and the other as the independent variable, if the independent variable and the residuals are mutually independent in only one of the models, the independent variable in that model will precede the dependent variable in causal order. After determining the causal order of all variables, DirectLiNGAM estimates the set of parents of each variable by sparse estimating a linear regression model with each variable as the dependent variable and all variables preceding it in the causal order as independent variables. This approach ensures no edges violate the estimated causal order. DirectLiNGAM is also less accurate when the number of variables is large relative to the sample size. DirectLiNGAM tends to output redundant edges even when the sample size is large.

The computational cost of DirectLiNGAM is higher than that of ICA-LiNGAM. The time complexity of DirectLiNGAM is $O(np^{3}M^{2}+p^{4}M^{3})$ , where $n$ is the sample size, $p$ is the number of variables, and $M$ is the maximal rank found by the low-rank decomposition used in the kernel-based independence measure [4]. Since DirectLiNGAM requires estimating a linear regression model, DirectLiNGAM is not feasible when $n$ is smaller than $p$ .

II-B Scalable Causation Discovery Algorithm (SADA)

Cai et al. [1] proposed a divide-and-conquer approach called the scalable causation discovery algorithm (SADA) to enhance the scalability of causal DAG learning algorithms. SADA first splits the variable set $X$ into two or more subsets. Then, a causal DAG learning algorithm such as DirectLiNGAM is applied to each group of variables, and finally, all the results are merged to estimate the entire causal DAG. Because SADA applies the causal DAG learning algorithm to each group with a smaller number of variables, if the variables are grouped correctly, SADA is expected to improve the estimation accuracy when the sample size is small relative to the number of the entire variable set. SADA is feasible even if the sample size is smaller than the number of variables as long as it is larger than the number of variables in each group.

The procedure for grouping $X$ into two subsets by SADA is as follows. SADA first randomly selects two variables $x_{i},x_{j}\in X$ satisfying $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(X\setminus\{x_{i},x_{j}\})$ and finds the smallest $\hat{X}\subseteq X\backslash\{x_{i},x_{j}\}$ with respect to the inclusion relation satisfying $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid\hat{X}$ . The disjoint subsets $X_{1}$ , $X_{2}$ , and $C$ are computed according to the following procedure.

1.

Initialize with $X_{1}=\{x_{i}\}$ , $X_{2}=\{x_{j}\}$ , $C=\{\hat{X}\}$ .
2.
For all $w\in X\setminus(X_{1}\cup X_{2}\cup C)$
- (a)
  
  if $w\mathop{\perp\!\!\!\!\perp}X_{2}\mid\hat{C}$ for $\exists\hat{C}\subseteq C$ ,
  then $X_{1}\leftarrow X_{1}\cup\{w\}$
- (b)
  
  if $w\mathop{\perp\!\!\!\!\perp}X_{1}\mid\hat{C}$ for $\exists\hat{C}\subseteq C$ ,
  then $X_{2}\leftarrow X_{2}\cup\{w\}$
- (c)
  
  else $C\leftarrow C\cup\{w\}$
3.
For all $s\in C$
- (a)
  
  if $s\mathop{\perp\!\!\!\!\perp}X_{2}\mid\hat{C}$ for $\exists\hat{C}\subseteq C\setminus\{s\}$ ,
  then $X_{1}\leftarrow X_{1}\cup\{s\}$ and $C\leftarrow C\setminus\{s\}$
- (b)
  
  if $s\mathop{\perp\!\!\!\!\perp}X_{1}\mid\hat{C}$ for $\exists\hat{C}\subseteq C\setminus\{s\}$ ,
  then $X_{2}\leftarrow X_{2}\cup\{s\}$ and $C\leftarrow C\setminus\{s\}$
4.

Return $X_{1}$ , $X_{2}$ and $C$

In step 2, the intermediate set $C$ tends to be large, and step 3 downsizes $C$ as much as possible.

The output of SADA is two subsets $V_{1}=X_{1}\cup C$ and $V_{2}=X_{2}\cup C$ . This algorithm can be repeated recursively to group variables into smaller subsets. In the implementation, the lower bound of the number of variables in each group is set to $\theta$ , and SADA is applied to $V_{1}$ or $V_{2}$ to create even smaller groups only when $|V_{1}|>\theta$ or $|V_{2}|>\theta$ . If the number of variables is less than or equal to $\theta$ , or if we cannot find $x_{i},x_{j}\in V_{k}$ satisfying $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(V_{k}\setminus\{x_{i},x_{j}\})$ , $k=1,2$ , no further grouping is made.

We illustrate the SADA procedure to split $X$ into two subsets where the true causal DAG is the one in Fig 1. Table I shows the process of grouping variables $x_{1},\ldots,x_{9}$ into two subsets. Since both $x_{3}\mathop{\perp\!\!\!\!\perp}x_{9}\mid\{x_{1},x_{2},x_{4}\ldots x_{8}\}$ and $x_{3}\mathop{\perp\!\!\!\!\perp}x_{9}$ hold in the model defined by the DAG in Fig. 1, we can initialize with $X_{1}=\{x_{3}\}$ , $X_{2}=\{x_{9}\}$ and $C=\emptyset$ . We then use the CI tests to determine one by one to which of $X_{1}$ , $X_{2}$ , and $C$ the remaining variables belong. This procedure returns $X_{1}=\{x_{1},x_{2},x_{3},x_{5},x_{6}\}$ , $X_{2}=\{x_{7},x_{8},x_{9}\}$ , $C=\{x_{4}\}$ and hence $V_{1}=\{x_{1},x_{2},x_{3},x_{4},x_{5},x_{6}\}$ and $V_{2}=\{x_{4},x_{7},x_{8},x_{9}\}$ . In this case, we note that the marginal models for $V_{1}$ and $V_{2}$ are defined by the sub-DAGs induced by $V_{1}$ and $V_{2}$ , respectively. If the initial values of $X_{1}$ , $X_{2}$ , and $C$ or the scanning order of $w\in X\setminus(X_{1}\cup X_{2}\cup C)$ changes, $V_{1}$ and $V_{2}$ may also change.

Figure 1: A causal DAG with nine variables.

TABLE I: The grouping process of the variables in Fig. 1 in SADA.

Step	$X$	$X_{1}$	C	$X_{2}$
Initial	$x_{1},x_{2}$ , $x_{4}$ , $x_{5}$ , $x_{6}$ , $x_{7}$ , $x_{8}$	$x_{3}$	$\phi$	$x_{9}$
Check $x_{1}$	$x_{2}$ , $x_{4}$ , $x_{5}$ , $x_{6}$ , $x_{7}$ , $x_{8}$	$x_{1}$ , $x_{3}$	$\phi$	$x_{9}$
Check $x_{2}$	$x_{4}$ , $x_{5}$ , $x_{6}$ , $x_{7}$ , $x_{8}$	$x_{1}$ , $x_{2}$ , $x_{3}$	$\phi$	$x_{9}$
Check $x_{4}$	$x_{5}$ , $x_{6}$ , $x_{7}$ , $x_{8}$	$x_{1}$ , $x_{2}$ , $x_{3}$	$x_{4}$	$x_{9}$
Check $x_{5}$	$x_{6}$ , $x_{7}$ , $x_{8}$	$x_{1}$ , $x_{2}$ , $x_{3}$ $x_{5}$	$x_{4}$	$x_{9}$
Check $x_{6}$	$x_{7}$ , $x_{8}$	$x_{1}$ , $x_{2}$ , $x_{3}$ $x_{5}$ , $x_{6}$	$x_{4}$	$x_{9}$
Check $x_{7}$	$x_{8}$	$x_{1}$ , $x_{2}$ , $x_{3}$ $x_{5}$ , $x_{6}$	$x_{4}$	$x_{7}$ , $x_{9}$
Check $x_{8}$	$\phi$	$x_{1}$ , $x_{2}$ , $x_{3}$ $x_{5}$ , $x_{6}$	$x_{4}$	$x_{7}$ , $x_{8}$ , $x_{9}$

Suppose that $X$ is grouped into $V_{1},\ldots,V_{K}$ by the SADA’s variable grouping procedure. SADA applies a causal structure learning algorithm such as DirectLiNGAM to each of the $K$ groups. Let $\hat{G}_{1}=(V_{1},E_{1}),\ldots,\hat{G}_{K}=(V_{K},E_{K})$ be the estimated $K$ DAGs for each group, where $E_{k}\subset V_{k}\times V_{k}$ , $k=1,\ldots,K$ are the set of directed edges in $\hat{G}_{k}$ . Then, the entire causal DAG is estimated by $\hat{G}=(X,E_{1}\cup\cdots\cup E_{K})$ .

However, SADA has several practical problems. In SADA, the marginal model $\hat{G}_{k}$ , $k=1,\ldots,K$ is not necessarily the causal model induced by $V_{k}$ even if we knew the correct CI relationships between variables. Assume that the true causal DAG is Fig 2 and that $\theta\leq 3$ . Since $x_{1}\mathop{\perp\!\!\!\!\perp}x_{4}\mid(x_{2},x_{3})$ , we can initialize with $X_{1}=\{x_{1}\}$ , $X_{2}=\{x_{4}\}$ , and $C=\{x_{2},x_{3}\}$ , which does not need further variable checking. As a result, we group the original variable set into $V_{1}=\{x_{1},x_{2},x_{3}\}$ and $V_{2}=\{x_{2},x_{3},x_{4}\}$ . For $V_{1}$ , the marginal model is defined by the sub-DAG induced by $x_{1}$ , $x_{2}$ and $x_{3}$ . However, the marginal model with respect to $V_{2}$ is not the model defined by the sub-DAG induced by $(x_{2},x_{3},x_{4})$ because $x_{2}\mathop{\not\perp\!\!\!\!\perp}x_{3}$ .

Figure 2: An example of SADA not working.

SADA does not guarantee that it always returns a DAG. Cycles may appear when DAG estimates for each variable group are merged. Suppose $V_{1}$ and $V_{2}$ share $x_{1}$ and $x_{2}$ . If $x_{1}\to x_{2}\in E_{1}$ , $x_{2}\to x_{1}\in E_{2}$ , the cycle $x_{1}\to x_{2}\to x_{1}$ appears after merging $\hat{G}_{1}$ and $\hat{G}_{2}$ . Therefore, SADA needs to remove redundant edges in cycles.

In addition, SADA requires too many CI tests for grouping variables. Consider the simple model where $x_{1},\ldots,x_{p}$ are mutually independent and no edge exists in the true DAG. In this case, any $x_{i}$ and $x_{j}$ satisfy $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(X\setminus\{x_{i},x_{j}\})$ . In the worst case, $2^{p-2}$ CI tests are required to ensure $\hat{C}=\emptyset$ . Thus, SADA is not feasible for high-dimensional data.

II-C Causality Partitioning (CAPA)

Zhang et al. [2] proposed a refined grouping scheme called the causality partitioning (CAPA). Define the order of the CI test by the number of variables in the conditioning set. CAPA reduces time complexity by setting an upper limit on the order of CI tests and using only CI tests of orders below that limit.

When the maximum order of the CI test is $\sigma_{\max}$ , the time complexity of CAPA is shown to be $O(p^{\sigma_{\max}+2})$ . In implementation, $\sigma_{\max}$ is set to small numbers, like two or three.

The procedure for grouping $X$ into two subsets by CAPA is as follows. Let $\sigma$ be the current order of the conditioned set. Let $M^{\sigma}=\{M^{\sigma}_{ij}\}$ be a $p\times p$ matrix such that

M^{\sigma}_{ij}=\left\{\begin{array}[]{ll}0&x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid Z,\exists Z\subset X\setminus\{x_{i},x_{j}\}\text{ s.t. }|Z|=\sigma\\ 1&\text{otherwise}.\end{array}\right.

Let $G^{\sigma}=(X,E^{\sigma})$ be an undirected graph with $M^{\sigma}$ as its adjacency matrix. Let $d_{i}=\sum_{r=1,r\neq i}^{p}M^{\sigma}_{ir}$ , $i=1,\ldots,p$ be the degrees of $x_{i}$ in $G^{\sigma}$ .

1.

Initialize with $\sigma=0$
2.

Initialize with $A=B=C=D=\emptyset$
3.
For all $x_{i}\in X$ in ascending order by $d_{i}$ , $i=1,\ldots,p$ do at most one of the followings so that $A\neq\emptyset$ , $B\neq\emptyset$ at output
- (a)
  
  $A\leftarrow A\cup\{x_{i}\}$ if $\forall x_{j}\in B$ , $(x_{i},x_{j})\notin E_{\sigma}$
- (b)
  
  $B\leftarrow B\cup\{x_{i}\}$ if $\forall x_{j}\in A$ , $(x_{i},x_{j})\notin E_{\sigma}$
4.

$C\leftarrow X\setminus(A\cup B)$
5.

For all $x_{i}\in A\cup B$
$D\leftarrow D\cup\{x_{i}\}$ if $\exists x_{j}\in C$ , $(x_{i},x_{j})\in E_{\sigma}$
6.

$X_{1}=A\cup C\cup D$ , $X_{2}=B\cup C\cup D$
7.

Return $\{X_{1},X_{2}\}$ if $\max(|X_{1}|,|X_{2}|)<|X|$
Else if $\sigma=\sigma_{\max}$ , exit
Else $\sigma\leftarrow\sigma+1$ and go to 2

Let $S=X_{1}\cap X_{2}$ . When we know the true CI relationship between $X$ , $S$ always d-separates $X_{1}\setminus S$ and $X_{2}\setminus S$ in the true causal DAG. Therefore, the marginal model for $X_{1}$ and $X_{2}$ is guaranteed to be defined by the sub-DAG induced by $X_{1}$ and $X_{2}$ , respectively. In general, the ascending order of $d_{i}$ , $i=1,\ldots,n$ is not unique, and the output $\{X_{1},X_{2}\}$ may change depending on the choice of the ascending order. In Step 3, both 3(a) and 3(b) may be possible depending on $x_{i}$ , in which case one of them is randomly selected, which may also change the output.

We illustrate the CAPA procedure with $\sigma=1$ to create two subsets where the true causal DAG is the one in Fig 1. When $\sigma=1$ , $G_{\sigma}$ is an undirected graph in Fig. 3. Table II shows an example of the results of Step 3 and Step 4 when $x_{3},x_{5},x_{6},x_{9},x_{2},x_{8},x_{1},x_{7},x_{4}$ is used as the order of $x_{i}$ by $d_{i}$ . Since $C=\{x_{4}\}$ , we have $D=\{x_{1},x_{5},x_{6},x_{7}\}$ . Therefore $X_{1}$ and $X_{2}$ are

	$\displaystyle X_{1}$	$\displaystyle=\{x_{1},x_{2},x_{3},x_{4},x_{5},x_{6},x_{7}\}$
	$\displaystyle X_{2}$	$\displaystyle=\{x_{1},x_{4},x_{5},x_{6},x_{7},x_{8},x_{9}\},$

respectively.

In this example, $|X_{1}|=7$ , $|X_{2}|=7$ , so the size of each group is not so small. CAPA can also be repeated recursively to split variables into smaller subsets. CAPA tends to fail to group finely in causal DAGs that contain vertices with high outdegree and low indegree.

Figure 3:

G_{\sigma}

with

\sigma=1

for the DAG in Fig.1

TABLE II: The grouping process of the variables in Fig. 1.

order	$d_{i}$	$A$	$B$	$C$
$x_{3}$	1	$x_{3}$	$\emptyset$	$\emptyset$
$x_{5}$	1	$x_{3}$	$x_{5}$	$\emptyset$
$x_{6}$	1	$x_{3}$	$x_{5},x_{6}$	$\emptyset$
$x_{9}$	1	$x_{3}$	$x_{6},x_{6},x_{9}$	$\emptyset$
$x_{2}$	2	$x_{2},x_{3}$	$x_{5},x_{6},x_{9}$	$\emptyset$
$x_{8}$	2	$x_{2},x_{3}$	$x_{5},x_{6},x_{8},x_{9}$	$\emptyset$
$x_{1}$	2	$x_{1},x_{2},x_{3}$	$x_{5},x_{6},x_{8},x_{9}$	$\emptyset$
$x_{7}$	2	$x_{1},x_{2},x_{3}$	$x_{5},x_{6},x_{7},x_{8},x_{9}$	$\emptyset$
$x_{4}$	4	$x_{1},x_{2},x_{3}$	$x_{5},x_{6},x_{7},x_{8},x_{9}$	$x_{4}$

II-D Repetitive Causal Discovery (RCD)

RCD [19] is a novel causal structure learning algorithm that can be applied even when the model contains latent confounders. RCD also assumes LiNGAM, i.e., the causal relationships are linear, and the exogenous variables are independently distributed as continuous non-Gaussian distributions. The processes in RCD are divided into the following three parts: ancestral relationship finding, parental relationship finding, and confounder determining.

The first step in the RCD is to determine the ancestral relationship between variables by repeatedly conducting simple linear regressions and independence tests on each variable pair, in a similar way as when determining causal order in DirectLiNGAM, and to create a list of ancestor sets for each variable. Let $\widehat{Anc}_{i}$ be the estimated ancestor set of $x_{i}$ .

Once ancestral relationships are estimated, RCD extracts parent-child relationships between each pair of variables using CI tests. Assume $x_{j}\in\widehat{Anc}_{i}$ . For $x_{i}$ and $x_{j}$ , if $x_{i}\mathop{\not\perp\!\!\!\!\perp}x_{j}\mid\widehat{Anc}_{i}\setminus\{x_{j}\}$ , then $x_{j}$ can be determined as a parent of $x_{i}$ . If a variable pair in the RCD’s output for which the direction of causality cannot be identified exists, we conclude that there are latent confounders between them.

Our study assumes that the causal model does not contain latent confounders. RCD can also be applied to the model without latent confounders. RCD is interpreted as another divide-and-conquer approach to improve scalability because it estimates the causal DAG by estimating the parent-child relationship between variables within each variable’s ancestor set. Our proposed method also uses the algorithm to find ancestral relationships in RCD to group variables.

RCD may return a graph that contains cycles even when the sample size is large. This is because errors in estimating ancestral relationships lead to an incorrect estimation of the parent-child relationships.

II-E Positioning of the Proposed Method

The proposed method also assumes LiNGAM (1). Like SADA and CAPA, the proposed method begins by grouping variables into multiple subsets based on the ancestral relationships between variables. The ancestral relationships are estimated in the same way as in RCD. When the true causal DAG is connected, the proposed method defines the variable grouping by the maximal elements in the family of the union of each variable and its ancestors. Thus, if the ancestral relationships are correctly estimated, variables are partitioned into as many groups as the number of sink nodes in the true causal DAG. The proposed method groups finely for DAGs with high outdegree and low indegree. The proposed method’s time complexity for variable grouping is $O(p^{3})$ , which is the same order as CAPA with $\sigma_{\max}=1$ .

When the ancestral relationships among the variables are correctly estimated, the marginal model for variables in each group obtained by the proposed method is the LiNGAM defined by the sub-DAG induced by the variables in each group.

In RCD, the estimated ancestral relationships are used to estimate parent-child relationships among variables. In contrast, the estimated ancestral relationships are only used to group variables in the proposed method. For each group, DirectLiNGAM is applied to estimate sub-DAGs, which are then merged to estimate the entire causal DAG.

Section III provides a more comprehensive description of our proposed method. Section IV compares the performance of the proposed method, the original DirectLiNGAM, CAPA, and RCD in the absence of latent confounders by computer experiments.

III Causal Ancestral-Relationships-based Groupoing (CAG)

This section will introduce our proposed algorithm in detail. Let $X=(x_{1},\ldots,x_{p})^{\top}$ be a $p$ -dimensional random vector. Let $G=(X,E)$ be the true causal DAG, where $E\subset X\times X$ is the set of edges in $G$ . We assume that $X$ follows LiNGAM (1) without latent confounders. Let $Anc_{i}$ and $Pa_{i}$ be the sets of ancestors and parents of $x_{i}$ in $G$ , respectively. $CA_{ij}:=Anc_{i}\cap Anc_{j}$ is the set of common ancestors of $x_{i}$ and $x_{j}$ in $G$ .

III-A Finding the Ancestral Relationships Between the Variables

This subsection summarizes the algorithm for finding the ancestral relationships between variables in $X$ in RCD [19]. To determine the ancestral relationship between $x_{i}$ and $x_{j}$ , $i\neq j$ , Maeda and Shimizu [19] considered the simple regression models

	$\displaystyle x_{i}$	$\displaystyle=\beta_{ij}x_{j}+u_{i},$		(3)
	$\displaystyle x_{j}$	$\displaystyle=\beta_{ji}x_{i}+u_{j},$		(3)

where $u_{i}$ and $u_{j}$ are error terms. Maeda and Shimizu [19] focused on the independence relationship between the independent variable and the error term in each model. The following proposition holds for the ancestral relationship between $x_{i}$ and $x_{j}$ .

Proposition 1 (Maeda and Shimizu [19]).

One of the following four conditions holds for the ancestral relationship between $x_{i}$ and $x_{j}$ .

1.

If $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}$ , then $x_{i}\notin Anc_{j}\wedge x_{j}\notin Anc_{i}$ .
2.

If $x_{j}\mathop{\perp\!\!\!\!\perp}u_{i}$ , then $x_{i}\in Anc_{j}$ .
3.

If $x_{i}\mathop{\perp\!\!\!\!\perp}u_{j}$ , then $x_{j}\in Anc_{i}$ .
4.

If $x_{i}\mathop{\not\perp\!\!\!\!\perp}u_{j}\wedge x_{j}\mathop{\not\perp\!\!\!\!\perp}u_{i}$ , then $CA_{ij}\neq\emptyset$ .

This algorithm first checks which conditions 1 through 4 in Proposition 1 holds for each variable pair $(x_{i},x_{j})$ . For implementation, the error terms $u_{i}$ and $u_{j}$ are replaced with the OLS residuals. If $(x_{i},x_{j})$ satisfies condition 4, the determination of the ancestral relationship between $(x_{i},x_{j})$ is withheld. Let $R$ be the set of variable pairs $(x_{i},x_{j})$ that satisfies condition 4. Let $CA^{*}_{ij}$ be the set of common ancestors of $(x_{i},x_{j})$ found during checking Proposition 1 for all pairs $(x_{i},x_{j})$ .

Assume that $(x_{i},x_{j})\in R$ . Then, $CA_{ij}^{*}$ forms part of the backdoor path for $x_{i}$ and $x_{j}$ . To remove the influence of $CA^{*}_{ij}$ from $x_{i}$ and $x_{j}$ , consider the regression models

	$\displaystyle x_{i}$	$\displaystyle=\bm{\alpha}_{ij}^{\top}\cdot CA^{*}_{ij}+v_{i},$		(4)
	$\displaystyle x_{j}$	$\displaystyle=\bm{\alpha}_{ji}^{\top}\cdot CA^{*}_{ij}+v_{j},$		(4)

where we identify $CA^{*}_{ij}$ with the vector of common ancestors of $(x_{i},x_{j})$ . Furthermore, consider the following regression models for the error terms $v_{i}$ and $v_{j}$ ,

	$\displaystyle v_{i}$	$\displaystyle=\beta_{ij}v_{j}+u_{i},$		(5)
	$\displaystyle v_{j}$	$\displaystyle=\beta_{ji}v_{i}+u_{j}.$		(5)

Then Proposition 1 is generalized as follows.

Proposition 2 (Maeda and Shimizu [19]).

One of the following four conditions holds for the ancestral relationship between $(x_{i},x_{j})\in R$ .

1.

If $v_{i}\mathop{\perp\!\!\!\!\perp}v_{j}$ , then $x_{i}\notin Anc_{j}\wedge x_{j}\notin Anc_{i}$ .
2.

If $v_{j}\mathop{\perp\!\!\!\!\perp}u_{i}$ , then $x_{i}\in Anc_{j}$ .
3.

If $v_{i}\mathop{\perp\!\!\!\!\perp}u_{j}$ , then $x_{j}\in Anc_{i}$ .
4.

If $v_{i}\mathop{\not\perp\!\!\!\!\perp}u_{j}$ and $v_{j}\mathop{\not\perp\!\!\!\!\perp}u_{i}$ , then $CA_{ij}\setminus CA^{*}_{ij}\neq\emptyset$ .

Proposition 1 is the case where $CA_{ij}=\emptyset$ . For implementation, the error terms $v$ and $u$ are replaced with OLS residuals. If $(x_{i},x_{j})$ satisfies condition 4, the determination of the ancestral relationship between $x_{i}$ and $x_{j}$ is withheld. After checking Proposition 2 for all $(x_{i},x_{j})\in R$ , update $CA^{*}_{ij}$ and $R$ . If $R\neq\emptyset$ , recheck Proposition 2. Theoretically, if the model contains no latent confounders, the procedure in Proposition 2 can be repeated until $R=\emptyset$ to completely determine the ancestral relationship of all $(x_{i},x_{j})$ .

The pseudo-code of this algorithm is described in Algorithm V-A in Appendix A. Since the linearity is assumed between variables, Pearson’s correlation test [20] can be used to test the marginally independence, $x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}$ and $v_{i}\mathop{\perp\!\!\!\!\perp}v_{j}$ . For testing the independence of independent variables and residuals, Shimizu and Maeda [19, 21] used the Hilbert-Schmidt independence criterion (HSIC) [22]. This paper uses a Kernel-based conditional independence (KCI) test [23] because the accuracy of the test remains the same, and the computation time is faster.

Simply checking the conditions of Proposition 1 or Proposition 2 may yield ancestor relationships such that the estimated graph contains directed cycles due to errors in testing. The proposed method adds a heuristic procedure to avoid directed cycles. See Algorithm V-A in the Appendix for details.

III-B Grouping Variables and Merging Results

This subsection introduces a procedure for grouping $X$ into several subsets according to the estimated ancestral relationships among variables and merging the estimated causal DAGs for each group.

Let $\widehat{Anc}_{i}$ be the estimated ancestor set of $x_{i}$ . Define ${\cal V}$ by the output of the following procedures.

1.

Let ${\cal V}_{0}$ be the family of maximal sets in

$\left\{\{x_{i}\}\cup\widehat{Anc}_{i},\;i=1,\ldots,p\right\}$
2a)

If $|\bm{v}|>1$ for all $\bm{v}\in{\cal V}_{0}$ , then ${\cal V}\leftarrow{\cal V}_{0}$

2b)

Otherwise, ${\cal V}\leftarrow{\cal V}_{0}$
For all $\bm{v}\in{\cal V}_{0}$ with $|\bm{v}|=1$

{\cal V}\leftarrow({\cal V}\setminus\bm{v})\cup\left\{\bigcup_{\bm{v}^{\prime}\in{\cal V}_{0},\bm{v}^{\prime}\neq\bm{v}}\{\bm{v}\cup\bm{v}^{\prime}\}\right\}

3)

Return ${\cal V}$

Definition 1.

Define ${\cal V}$ as the grouping of $X$ obtained by the ancestral relationships among $X$ .

We call the series of procedures for obtaining ${\cal V}$ the causal ancestral-relationship-based grouping (CAG). If the ancestral relationships among $X$ are correctly estimated and ${\cal V}={\cal V}_{0}$ , each element of $\mathcal{V}$ is the union of a sink node in $G$ and its ancestors. We also note that $\bigcup_{\bm{v}\in{\cal V}}\bm{v}=X$ . Now, we have the following theorem.

Theorem 1.

Assume that we know the correct ancestral relationships between variables. Define $G_{\bm{v}}$ be the sub-DAG of $G$ induced by $\bm{v}\in{\cal V}$ . Then, the marginal model for $\bm{v}$ is the LiNGAM defined by $G_{\bm{v}}$ .

Proof.

The probability density function of $X$ is written by

	$\displaystyle p(\bm{x})$	$\displaystyle=\prod_{i=1}^{p}p(x_{i}\mid Pa_{i})$
		$\displaystyle=\prod_{i:x_{i}\in\bm{v}}p(x_{i}\mid Pa_{i})\cdot\prod_{j:x_{j}\notin\bm{v}}p(x_{j}\mid Pa_{j}).$

We note that when $x_{i}\in\bm{v}$ , $Pa_{i}$ does not include the variables $x_{j}\notin\bm{v}$ . By integrating $p(\bm{x})$ out according to the reverse causal order of $X\setminus\bm{v}$ , we have

p(\bm{v})=\prod_{i:x_{i}\in\bm{v}}p(x_{i}\mid Pa_{i}),

which is the LiNGAM defined by $G_{\bm{v}}$ . ∎

Theorem 1 claims that we can consistently estimate the sub-DAGs $G_{\bm{v}}$ for $\bm{v}\in{\cal V}$ by applying a causal structure learning algorithm such as DirectLiNGAM to $\bm{v}$ . Once we obtain the estimates $\hat{G}_{\bm{v}}=(\bm{v},E_{\bm{v}})$ , where $E_{\bm{v}}=(\bm{v}\times\bm{v})\cap E$ , the entire causal graph $G$ can be estimated by $\hat{G}=(X,\bigcup_{\bm{v}\in{\cal V}}E_{\bm{v}})$ .

When the sample size is small, many ancestral relationships cannot be detected due to the type II error of the CI test in the CAG, so there may be many variables for which the ancestor set is empty, resulting in many groups consisting of a single variable in ${\cal V}_{0}$ . Using such grouping would make the causal DAG estimates too sparse. So here, the group consisting of a single variable in ${\cal V}_{0}$ is merged with the other groups according to the above procedure 2b. Note that Theorem 1 holds even if ${\cal V}_{0}$ contains a group consisting of one variable.

Table III presents the correct $Anc_{i}$ and $\bm{v}_{i}$ for $i=1,\ldots,9$ of the DAG in Figure 1. In this case, we can easily see that ${\cal V}_{0}={\cal V}=\{\bm{v}_{3},\bm{v}_{5},\bm{v}_{6},\bm{v}_{9}\}$ . By applying the causal structure learning algorithm to each element of ${\cal V}$ , we can consistently estimate the sub-DAGs as shown in Fig. 4. By merging the sub-DAGs as $(X,E_{3}\cup E_{5}\cup E_{6}\cup E_{9})$ , we can obtain the true causal DAG in Figure 1.

TABLE III:

Anc_{i}

and

\bm{v}_{i}

of the DAG in Fig. 1.

Variables	$Anc_{i}$	$\bm{v}_{i}$
$x_{1}$	$\phi$	$x_{1}$
$x_{2}$	$x_{1}$	$x_{1},x_{2}$
$x_{3}$	$x_{1},x_{2}$	$x_{1},x_{2},x_{3}$
$x_{4}$	$x_{1},x_{7}$	$x_{1},x_{4},x_{7}$
$x_{5}$	$x_{1},x_{4},x_{7}$	$x_{1},x_{4},x_{5},x_{7}$
$x_{6}$	$x_{1},x_{4},x_{7}$	$x_{1},x_{4},x_{5},x_{6}$
$x_{7}$	$\phi$	$x_{7}$
$x_{8}$	$x_{7}$	$x_{7},x_{8}$
$x_{9}$	$x_{7},x_{8}$	$x_{7},x_{8},x_{9}$

Refer to caption — Figure 4: An example of the process of the CAG

In the same way as SADA, the estimated entire DAG may contain cycles due to errors in learning DAGs for each group, even if the correct ancestral relationships are estimated. The proposed method implements topological sorting on the merged graph to check if a directed cycle exists in the merged graph. In SADA, the significance of each coefficient $b_{ij}$ for all edges in a cycle is assessed by the Wald test, and the least significant edge in the cycle is removed from the cycle. In the proposed method, since the ancestral relationships between variables are estimated, directed cycles can also be eliminated by removing edges in the cycle that contradict the estimated ancestral relationships. Another possible approach would be removing the edge in the cycle with the smallest absolute value of coefficient $|b_{ij}|$ . The computer experiment in Section IV compares the performance of these methods for removing cycles.

III-C Time Complexity of Variable Grouping and Minimum Required Sample Size

It suffices to focus on the number of CI tests to evaluate the time complexity of variable grouping by CAG. The case that requires the most CI tests is when the true causal DAG is fully connected, i.e., $Anc_{i}=Pa_{i}$ for all $x_{i}\in X$ . In that case, since the proposed algorithm performs CI tests while removing a source node one by one from a DAG, the number of required CI tests is

\displaystyle\binom{p}{2}+\binom{p-1}{2}+\cdots+\binom{2}{2}=O(p^{3}).

As mentioned in Section II-A, the time complexity of DirectLiNGAM is $O(np^{3}M^{2}+p^{4}M^{3})$ . Even allowing for the time complexity of grouping variables, if $n$ is small relative to $p$ , CAG is expected to reduce overall computation time compared to the original DirectLiNGAM without variable grouping.

Let $n_{\max}$ be the largest cardinality of $\bm{v}\in{\cal V}$ . Then, the required sample size is reduced to $n_{\max}+1$ .

IV Computer experiments

This section details the results of computer experiments to compare the estimation accuracy and computation time of CAG with the original DirectLiNGAM without variable grouping, RCD in the absence of latent confounders, and CAPA with $\sigma_{\max}=0,1,2$ . In these experiments, CAG and CAPA used DirectLiNGAM as the causal structure learning algorithm for each variable group. In the following, we will refer to these procedures as CAG-LiNGAM and CAPA-LiNGAM, respectively. The accuracy of CAG-LiNGAM using true ancestral relationships was also computed for reference. As mentioned in Section II-D, RCD can also be applied to the model in the presence of latent confounders. This paper assumes that no latent confounder exists, so in this experiment for RCD, the procedure for determining latent confounders was omitted to estimate ancestral and parent-child relationships among variables.

IV-A Experimental Settings

The number of variables $p$ and the number of edges $|E|$ were set to $(p,|E|)=(10,5)$ , $(10,9)$ , $(20,19)$ , $(30,29)$ and $(40,39)$ . The sample sizes $n$ were set to $11,25,50,100$ and $500$ .

The error terms $e_{i}$ were generated from the uniform distribution $U(-1,1)$ . The true causal DAGs were randomly generated with $(p,|E|)$ fixed at each iteration. In these experiments, the true causal DAGs were assumed to be sparse, and the indegrees of all vertices were set to at most one. The nonzero elements of the coefficient matrix $B$ were randomly generated from the uniform distribution $U([-1,-0.5]\cup[0.5,1])$ at each iteration. The number of iterations was set to 100.

Let $|E_{c}|$ be the number of correctly detected edges in the 100 estimated DAGs. Denote the number of redundant and missing edges in the 100 estimated DAGs by $|E_{r}|$ and $|E_{m}|$ , respectively. Note that if an edge in an estimated DAG is in the wrong direction, both $|E_{r}|$ and $|E_{m}|$ will add one. We evaluated the performance of CAG-LiNGAM and the other methods using the following indices (e.g., Zhu et al. [24]).

1.

The precision

$Pre:=\frac{|E_{c}|}{|E_{c}|+|E_{r}|}$

is the ratio of correctly estimated edges among the edges in the estimated DAG;
2.

The recall

$Rec:=\frac{|E_{c}|}{|E_{c}|+|E_{m}|}=\frac{|E_{c}|}{|E|}$

is the ratio of correctly estimated edges among the edges in the true DAG;
3.

$F$ -measure

$F:=\frac{2\times Pre\times Rec}{Pre+Rec}$

is the harmonic mean of the precision and the recall;
4.

$Time$ is the running time in seconds for estimating 100 DAGs.

Let $\alpha_{P}$ be the significance level of Pearson’s correlation test in CAG and RCD. Let $\alpha_{CI}$ be the significance level of KCI in CAG, RCD, and CAPA. $\alpha_{P}$ and $\alpha_{CI}$ are set to

	$\displaystyle\alpha_{P}$	$\displaystyle\in\{0.01,0.05,0.1,0.5\},$
	$\displaystyle\alpha_{CI}$	$\displaystyle\in\{0.001,0.01,0.1,0.2,0.3,0.4,0.5\},$

respectively.

Experiments for $p=40$ were conducted on a machine with a 3.0 GHz Intel Core i9 processor and 256 GB memory, and experiments for $p=30$ were performed on a machine with a 3.0 GHz Intel Core i9 processor and 128 GB memory. The other experiments were conducted on a machine with a 2.1 GHz Dual Xeon processor and 96 GB memory.

IV-B Experimental Results and Discussion

TABLE IV: Performances of CAG, DirectLiNGAM, RCD and CAPA

( $p$ , $\|E\|$ , $n$ )	$Pre$									$Rec$
	CAG			$\mathrm{CAG}^{*}$	DLi	RCD	CAPA			CAG			$\mathrm{CAG}^{*}$	DLi	RCD	CAPA
	W	Anc	abs	Anc		W	0	1	2	W	Anc	abs	Anc		W	0	1	2
(10, 5, 11)	0.210	0.221	0.213	0.507	0.122	0.268	0.209	0.166	0.193	0.434	0.396	0.446	0.448	0.374	0.128	0.197	0.209	0.194
(10, 5, 25)	0.550	0.543	0.546	0.720	0.311	0.602	0.493	0.597	0.612	0.576	0.578	0.578	0.736	0.576	0.396	0.483	0.420	0.408
(10, 5, 50)	0.773	0.770	0.773	0.850	0.521	0.851	0.725	0.735	0.779	0.764	0.764	0.764	0.928	0.868	0.592	0.696	0.723	0.706
(10, 5, 100)	0.891	0.889	0.891	0.917	0.693	0.967	0.786	0.877	0.896	0.896	0.896	0.896	0.996	0.986	0.826	0.886	0.860	0.840
(10, 5, 500)	0.971	0.971	0.971	0.963	0.813	1.000	0.861	0.901	0.906	0.998	0.998	0.998	1.000	1.000	0.994	0.945	0.956	0.960
(10, 9, 11)	0.251	0.218	0.244	0.396	0.179	0.281	0.214	0.208	0.250	0.431	0.487	0.452	0.483	0.387	0.147	0.207	0.220	0.193
(10, 9, 25)	0.369	0.368	0.350	0.563	0.339	0.452	0.388	0.389	0.429	0.693	0.486	0.727	0.729	0.600	0.301	0.418	0.447	0.426
(10, 9, 50)	0.621	0.599	0.609	0.686	0.482	0.644	0.526	0.523	0.548	0.643	0.650	0.646	0.899	0.818	0.521	0.634	0.651	0.643
(10, 9, 100)	0.779	0.768	0.779	0.834	0.685	0.914	0.680	0.686	0.682	0.851	0.860	0.859	0.986	0.974	0.758	0.845	0.836	0.839
(10, 9, 500)	0.918	0.917	0.918	0.877	0.785	0.997	0.785	0.773	0.774	0.999	0.999	0.999	1.000	1.000	0.999	0.929	0.930	0.926
(20, 19, 25)	0.313	0.319	0.267	0.435	0.172	0.327	0.235	0.410	0.468	0.601	0.423	0.642	0.665	0.496	0.284	0.287	0.220	0.238
(20, 19, 50)	0.528	0.484	0.497	0.625	0.354	0.549	0.447	0.454	0.616	0.662	0.640	0.676	0.876	0.721	0.514	0.561	0.527	0.448
(20, 19, 100)	0.759	0.723	0.751	0.780	0.576	0.837	0.610	0.594	0.681	0.776	0.780	0.778	0.984	0.943	0.729	0.779	0.776	0.685
(20, 19, 500)	0.897	0.894	0.897	0.844	0.725	0.994	0.725	0.718	0.712	0.997	0.997	0.997	1.000	1.000	0.996	0.899	0.895	0.892
(30, 29, 50)	0.535	0.445	0.489	0.655	0.276	0.475	0.374	0.527	0.631	0.618	0.616	0.633	0.855	0.588	0.479	0.438	0.390	0.435
(30, 29, 100)	0.725	0.701	0.673	0.878	0.587	0.813	0.649	0.649	0.676	0.725	0.701	0.853	0.878	0.587	0.813	0.649	0.649	0.676
(30, 29, 500)	0.969	0.967	0.970	0.976	0.919	0.998	0.919	0.916	0.914	0.998	0.998	0.998	1.000	1.000	0.996	0.970	0.968	0.969
(40, 39, 50)	0.464	0.423	0.466	0.625	0.211	0.459	0.311	0.480	0.712	0.800	0.595	0.624	0.839	0.536	0.469	0.381	0.364	0.374
(40, 39, 100)	0.696	0.640	0.652	0.863	0.477	0.792	0.573	0.576	0.679	0.835	0.741	0.849	0.973	0.809	0.686	0.690	0.680	0.635
(40, 39, 500)	0.952	0.944	0.951	0.970	0.917	0.991	0.917	0.914	0.915	0.998	0.998	0.998	1.000	1.000	0.995	0.966	0.966	0.966
( $p$ , $\|E\|$ , $n$ )	$F$									$Time(s)$
	CAG			$\mathrm{CAG}^{*}$	DLi	RCD	CAPA			CAG			$\mathrm{CAG}^{*}$	DLi	RCD	CAPA
	W	Anc	abs	Anc		W	0	1	2	W	Anc	abs	Anc		W	0	1	2
(10, 5, 11)	0.283	0.284	0.288	0.476	0.184	0.173	0.203	0.185	0.194	23.0	17.8	21.3	5.0	29.0	6.7	28.2	99.5	210.7
(10, 5, 25)	0.563	0.560	0.562	0.728	0.404	0.478	0.488	0.493	0.489	13.5	13.4	13.4	5.0	27.5	8.0	32.2	55.0	57.1
(10, 5, 50)	0.769	0.767	0.769	0.887	0.651	0.698	0.710	0.729	0.741	13.0	13.0	13.0	5.1	72.1	8.1	32.8	75.1	84.8
(10, 5, 100)	0.893	0.892	0.893	0.955	0.814	0.891	0.833	0.868	0.867	154.6	154.4	154.5	48.9	123.6	123.4	165.2	371.7	502.2
(10, 5, 500)	0.984	0.984	0.984	0.981	0.897	0.997	0.901	0.928	0.932	330.8	330.7	330.8	65.1	151.8	480.5	459.5	1867.9	2529.8
(10, 9, 11)	0.317	0.301	0.317	0.435	0.244	0.193	0.210	0.214	0.218	39.4	23.4	23.5	12.3	28.4	10.2	39.0	112.3	250.8
(10, 9, 25)	0.482	0.419	0.472	0.635	0.433	0.361	0.403	0.416	0.428	34.1	36.5	25.0	11.3	28.9	15.2	44.2	110.8	291.0
(10, 9, 50)	0.632	0.623	0.627	0.778	0.607	0.576	0.575	0.580	0.591	22.8	21.7	21.7	12.3	71.8	18.6	80.3	187.3	370.5
(10, 9, 100)	0.814	0.811	0.817	0.904	0.805	0.829	0.753	0.753	0.752	338.9	334.6	334.6	117.1	120.7	312.3	255.9	1420.2	6078.1
(10, 9, 500)	0.957	0.956	0.957	0.935	0.879	0.998	0.851	0.844	0.843	1062.1	1061.6	1061.6	154.4	155.5	1332.3	425.2	6833.6	19649.6
(20, 19, 25)	0.411	0.364	0.378	0.526	0.255	0.304	0.259	0.286	0.316	520.0	47.2	101.7	29.2	172.3	34.4	524.3	713.6	1044.2
(20, 19, 50)	0.588	0.551	0.573	0.729	0.474	0.531	0.498	0.488	0.518	93.2	57.0	72.2	29.3	323.9	48.6	573.4	926.4	1658.1
(20, 19, 100)	0.768	0.750	0.764	0.870	0.715	0.779	0.684	0.673	0.683	1111.4	1015.5	1015.7	337.2	417.3	991.4	1418.6	10457.6	39708.9
(20, 19, 500)	0.944	0.943	0.944	0.916	0.840	0.995	0.802	0.797	0.792	3918.8	3906.2	3906.3	421.7	542.6	4757.2	1471.6	33655.4	166854.1
(30, 29, 50)	0.573	0.516	0.552	0.742	0.376	0.477	0.403	0.449	0.515	196.6	92.7	129.2	40.2	554.7	65.8	1077.6	1962.0	5290.4
(30, 29, 100)	0.782	0.726	0.753	0.925	0.703	0.747	0.704	0.703	0.709	1381.5	1138.7	1215.6	397.6	604.9	1287.1	1090.2	13253.2	117849.3
(30, 29, 500)	0.983	0.982	0.984	0.988	0.958	0.997	0.943	0.941	0.941	5732.6	5731.6	5731.7	444.0	1449.6	9536.9	3540.0	45572.6	341942.2
(40, 39, 50)	0.588	0.494	0.533	0.716	0.302	0.464	0.343	0.414	0.490	4233.4	192.9	258.3	56.9	962.4	153.3	1920.0	4387.8	6462.0
(40, 39, 100)	0.759	0.687	0.737	0.914	0.600	0.735	0.626	0.624	0.656	2931.3	2527.1	2620.8	604.1	1001.2	2725.9	1740.4	35769.3	352156.9
(40, 39, 500)	0.975	0.970	0.974	0.985	0.956	0.993	0.941	0.939	0.940	10234.1	10225.4	10225.7	706.8	2610.9	16099.2	6221.7	94989.2	779946.2

Table IV shows $Pre$ , $Rec$ , $F$ and $Time$ for each method for some $(p,|E|,n)$ . The experiments for CAG-LiNGAM and RCD were conducted with $4\times 7$ combinations of significance levels of Pearson’s correlation test and KCI for each $(p,|E|,n)$ . The experiments for CAPA-LiNGAM were conducted with $7$ significance levels of KCI for each $(p,|E|,n)$ . Table IV shows only the results of these methods for the significance level that maximizes $F$ -measure.

In Table IV, CAG represents CAG-LiNGAM, while $\text{CAG}^{*}$ refers to CAG-LiNGAM using true ancestral relationships. DLi stands for the original DirectLiNGAM without grouping variables. W, Anc, and abs under CAG and RCD represent cycle elimination methods based on the Wald test, the estimated ancestral relationship, and the absolute value of the coefficient, respectively. The numbers 0, 1, and 2 under CAPA are the $\sigma_{\max}$ values. In the table, the bold numbers highlight the best performance in each experimental group $(p,|E|,n)$ , excluding $\text{CAG}^{*}$ .

As seen from Table IV, CAG-LiNGAM outperforms the original DirectLiNGAM without the variable grouping, RCD, and CAPA-LiNGAM when the sample size is small relative to the number of variables in terms of $F$ -measure and the recall. Although CAG requires $O(p^{3})$ time complexity to group variables, CAG-LiNGAM often shows shorter computation time than the original DirectLiNGAM when the sample size is small. When the sample size is large, CAG-LiNGAM takes longer computation time than the original DirectLiNGAM, but CAG-LiNGAM is superior to the original DirectLiNGAM in terms of precision and $F$ -measure.

CAG with Wald tests takes longer computation time but is more accurate in terms of $F$ -measure than CAG with Anc and abs. As the sample size increases, the accuracy of estimating ancestral relationships and DirectLiNGAM increases, and hence, the frequency of cycles appearing in the estimated DAG decreases, so the difference in estimation accuracy between the three CAG variants becomes smaller.

When the sample size is small, CAPA-LiNGAM may show higher precision than CAG-LiNGAM. However, the recall and $F$ -measure are lower in many cases. This may be because when the sample size is small, it is difficult for CAPA to detect conditional independence relationships due to the type II error of the CI tests. Then, $G_{\sigma}$ becomes overly sparse, resulting in overly sparse causal DAG estimates.

When $\sigma_{\max}$ is $0$ and $1$ , the time complexity of the variable grouping of CAPA is less than or equal to that of CAG. However, the computation time for CAPA-LiNGAM is not much different from that for CAG-LiNGAM, or CAG-LiNGAM is faster than CAPA-LiNGAM. Besides, CAG-LiNGAM outperforms CAPA-LiNGAM in terms of $F$ -measure.

It is noteworthy that RCD shows higher $F$ -measure than CAG-LiNGAM when the sample size is large, although it takes more computation time. Conversely, when the sample size is small, CAG-LiNGAM shows higher $F$ -measure than RCD, although it takes more computation time.

The original DirectLiNGAM shows lower precision than divide-and-conquer algorithms. This may be because the original DirectLiNGAM does not group variables, leading to redundant edges across different variable groups. In general, DAGs estimated by DirectLiNGAM tend to have more redundant edges.

Even if the true causal DAG is sparse, the subgraph induced by each variable group becomes relatively denser. In the case of directed trees, the ratio of the number of edges to the number of variable pairs is $(p-1)/{}_{p}C_{2}=2/p$ . This means that the smaller the number of variables in each group is, the denser the sub-DAG induced by each group is. When variables are correctly grouped into smaller subsets, the true sub-DAGs induced by each group become denser, and there are no edges across groups. Therefore, if ancestral relationships can be accurately estimated, applying DirectLiNGAM to each group is expected to avoid more redundant edges than using it for the entire variable.

The CAG uses the estimated ancestral relationships for grouping variables but does not use them to estimate the causal DAG for each group. However, since ancestral relationships partially define the structure of the causal DAG, information on ancestral relationships could be used to estimate causal DAGs. The RCD can be interpreted as using information on ancestral relationships to estimate causal DAGs.

Let $A_{c}$ and $A_{m}$ be the number of correctly detected and missing ancestors in the estimated ancestor lists $\widehat{Anc}_{i}$ , $i=1,\ldots,p$ , respectively. Define $Anc_{rec}$ by the recall of the list of ancestors,

Anc_{rec}=\frac{A_{c}}{A_{c}+A_{m}}.

Figure 5 plots $Anc_{rec}$ of the ancestor list for some $(p,|E|,n)$ when using the significance levels of the CI tests that maximize the $F$ -measure of the CAG-LiNGAM with Wald tests. This figure shows that when the sample size is small, the accuracy of the estimation of ancestor lists is quite low. When $(p,|E|,n)=(10,5,11),(10,9,11),(10,9,25),(20,19,25),(40,39,50)$ , $Anc_{rec}$ are exactly zeros. However, Table IV shows that the accuracy of CAG-LiNGAM is not that bad. This fact suggests that applying DirectLiNGAM to each group split by the CAG may correct errors in the estimated ancestral relationships within each group.

Table V shows $Anc_{rec}$ computed from the ancestral relationships estimated by CAG and $Anc_{rec}$ computed from the estimated DAG in the experiments. We can see that $Anc_{rec}$ is significantly improved by applying DirectLiNGAM to each group.

TABLE V:

Anc_{rec}

before and after applying DirectLiNGAM

	Before			After
$(p,\|E\|,n)$	W	Anc	abs	W	Anc	abs
(10, 5, 11)	0	0.049	0	0.438	0.405	0.466
(10, 5, 25)	0.300	0.300	0.300	0.519	0.525	0.524
(10, 5, 50)	0.513	0.513	0.513	0.677	0.679	0.679
(10, 5, 100)	0.723	0.723	0.723	0.852	0.852	0.852
(10, 5, 500)	0.919	0.919	0.919	0.998	0.998	0.998
(10, 9, 11)	0	0	0	0.404	0.568	0.447
(10, 9, 25)	0	0.323	0	0.648	0.483	0.701
(10, 9, 50)	0.346	0.346	0.346	0.540	0.557	0.547
(10, 9, 100)	0.514	0.514	0.514	0.791	0.817	0.805
(10, 9, 500)	0.713	0.713	0.713	0.999	0.999	0.999
(20, 19, 25)	0.001	0.177	0.001	0.514	0.366	0.577
(20, 19, 50)	0.320	0.289	0.320	0.545	0.558	0.594
(20, 19, 100)	0.359	0.359	0.359	0.638	0.669	0.651
(20, 19, 500)	0.596	0.596	0.596	0.993	0.998	0.998
(30, 29, 50)	0.268	0.241	0.268	0.450	0.562	0.515
(30, 29, 100)	0.371	0.291	0.371	0.745	0.636	0.782
(30, 29, 500)	0.515	0.515	0.515	0.993	0.997	0.993
(40, 39, 50)	0.001	0.222	0.225	0.627	0.546	0.508
(40, 39, 100)	0.356	0.276	0.356	0.697	0.624	0.785
(40, 39, 500)	0.470	0.470	0.470	0.997	0.994	0.997

Table VI summarizes the following six indices calculated from CAG’s experimental results when $p=10$ .

•

Improvement ( $Imp$ ): the number of ancestral relationships in the output DAGs that DirectLiNGAM corrects
•

Worsening ( $Wor$ ): the number of ancestral relationships in the output DAGs that DirectLiNGAM worsens
•

Errors in estimation ( $Err$ ): the number of ancestral relationships that are wrongly estimated in CAG
•

Correct estimation ( $Corr$ ): the number of ancestral relationships that are correctly estimated in CAG
•

$Imp/Err$ : the ratio of ancestral relationships incorrectly estimated by CAG that DirectLiNGAM corrected
•

$Wor/Corr$ : the ratio of ancestral relationships correctly estimated by CAG that DirectLiNGAM worsened

TABLE VI: The Improvement and Worsening of Ancestral Relationships after applying DirectLiNGAM.

$(p,\|E\|,n)$	Elimination	$Imp$	$Wor$	$Corr$	$Err$	$Imp/Err$	$Wor/Corr$
(10,5,11)	Wald	265	1029	3858	642	0.4128	0.2667
	Anc	228	858	3858	642	0.3551	0.2224
	abs	286	1042	3858	642	0.4455	0.2701
(10,5,25)	Wald	239	700	3956	544	0.4393	0.1769
	Anc	227	636	3956	544	0.4173	0.1608
	abs	238	720	3956	544	0.4375	0.1820
(10,5,50)	Wald	204	268	4154	346	0.5896	0.0645
	Anc	191	253	4154	346	0.5520	0.0609
	abs	211	280	4154	346	0.6098	0.0674
(10,9,11)	Wald	794	1226	2627	1873	0.4239	0.4667
	Anc	374	559	2627	1873	0.1997	0.2128
	abs	794	1226	2627	1873	0.4239	0.4667
(10,9,25)	Wald	657	678	2780	1720	0.3820	0.2439
	Anc	519	565	2780	1720	0.3017	0.2032
	abs	747	804	2780	1720	0.4343	0.2892
(10,9,50)	Wald	704	373	3078	1422	0.4951	0.1212
	Anc	570	273	3078	1422	0.4008	0.0887
	abs	728	412	3078	1422	0.5120	0.1339
(10,9,100)	Wald	682	130	3479	1021	0.6680	0.0374
	Anc	586	122	3479	1021	0.5739	0.0351
	abs	703	118	3479	1021	0.6885	0.0339
(10,9,500)	Wald	626	32	3730	770	0.8130	0.0086
	Anc	626	33	3730	770	0.8078	0.0088
	abs	626	32	3730	770	0.8104	0.0086

Table VI shows that when the sample size is small, $Imp<Wor$ , but $Imp/Err>Wor/Corr$ . By applying DirectLiNGAM, the number of corrected ancestral relationships is less than that of worsened ones. However, the proportion of corrected ancestral relationships is relatively larger than that of worsened ancestral relationships, which may have improved the accuracy of causal DAG estimation.

Even with large sample sizes, $Anc_{rec}$ improves by applying DirectLiNGAM. However, Table IV shows that the accuracy of CAG is inferior to that of RCD. When the sample size is large, DirectLiNGAM tends to include redundant edges in the estimated DAG, which may make CAG less accurate than RCD.

This experiment set 28 different significance levels for the CAG’s CI tests. Table VII presents the significance levels that maximize the $F$ -measure of CAG-LiNGAM. A small significance level is often chosen when the sample size is small. In practice, however, the results are almost the same regardless of the significance level chosen. When the sample size is small, ancestral relationships can hardly be detected regardless of the significance level. In this experiment of CAG-LiNGAM with Wald tests, when $(p,|E|,n)=(10,5,11),(10,9,11),(10,9,25)$ , no ancestral relationship could be detected within 100 repetitions. From Definition 1 in Section III-A, if no ancestral relationship was detected, the variable group is the family of all ${}_{p}C_{2}$ variable pairs. Interestingly, even in such cases, CAG-LiNGAM shows higher $F$ -measure than the other methods. When the sample size is moderately small, the $F$ -measure values are large when the significance level of Pearson’s correlation test is 0.01 and the significance level of KCI is about 0.2 to 0.5. When the sample size is large, the smaller the significance level, the higher $F$ -measure.

TABLE VII: The significance levels used in CAG-LiNGAM in Table IV

	W		Anc		abs
( $p$ , $\|E\|$ , $n$ )	$\alpha_{P}$	$\alpha_{CI}$	$\alpha_{P}$	$\alpha_{CI}$	$\alpha_{P}$	$\alpha_{CI}$
(10, 5, 11)	0.01	0.001	0.01	0.3	0.01	0.001
(10, 5, 25)	0.01	0.5	0.01	0.5	0.01	0.5
(10, 5, 50)	0.01	0.4	0.01	0.4	0.01	0.4
(10, 5, 100)	0.01	0.3	0.01	0.3	0.01	0.3
(10, 5, 500)	0.01	0.001	0.01	0.001	0.01	0.001
(10, 9, 11)	0.05	0.01	0.05	0.01	0.05	0.01
(10, 9, 25)	0.01	0.001	0.5	0.5	0.01	0.001
(10, 9, 50)	0.01	0.3	0.01	0.3	0.01	0.3
(10, 9, 100)	0.01	0.2	0.01	0.2	0.01	0.2
(10, 9, 500)	0.01	0.001	0.01	0.001	0.01	0.001
(20, 19, 25)	0.05	0.001	0.01	0.5	0.05	0.001
(20, 19, 50)	0.01	0.4	0.01	0.3	0.01	0.4
(20, 19, 100)	0.01	0.1	0.01	0.1	0.01	0.1
(20, 19, 500)	0.01	0.001	0.01	0.001	0.01	0.001
(30, 29, 50)	0.01	0.4	0.01	0.3	0.01	0.4
(30, 29, 100)	0.01	0.2	0.01	0.1	0.01	0.2
(30, 29, 500)	0.01	0.001	0.01	0.001	0.01	0.001
(40, 39, 50)	0.01	0.001	0.01	0.3	0.01	0.4
(40, 39, 100)	0.01	0.2	0.01	0.1	0.01	0.1
(40, 39, 500)	0.01	0.001	0.01	0.001	0.01	0.001

As mentioned, $\mathrm{CAG}^{*}$ is the CAG-LiNGAM when the true ancestral relationships are known. When the sample size is small, the estimation accuracy of CAG-LiNGAM is inferior to that of $\mathrm{CAG}^{*}$ because CAG is less accurate. However, when $p=10$ and the sample size is large, CAG-LiNGAM shows higher precision and $F$ -measure than $\mathrm{CAG}^{*}$ . As seen from Figure 5 and Table V, the accuracy of ancestral relationship estimation improves as the sample size increases. However, even when $(p,|E|,n)=(10,9,500)$ , $Anc_{rec}$ is 0.713, which is not close enough to 1. From Table VII, when $(p,|E|,n)=(10,9,500)$ , $\alpha_{P}=0.01$ and $\alpha_{CI}=0.001$ are used for the significance levels of the CI tests. Even with a large sample size, if small significance levels are used in the CAG, some ancestral relationships cannot be detected due to type II errors in the CI test. As a result, the number of variables in the estimated groups may be smaller than those in the true CAG grouping, resulting in fewer redundant edges than $\mathrm{CAG}^{*}$ . Grouping by CAG is not the finest grouping that guarantees the identifiability of true causal DAGs. The results suggest that using finer groupings may improve the accuracy of causal DAG estimation when the sample size is large.

V Conclusion

This paper proposes CAG as a divide-and-conquer algorithm for learning causal DAGs. This algorithm can help improve the performance of the original DirectLiNGAM, especially when the sample size is small relative to the number of variables. CAG is based on the algorithm for finding ancestral relationships among variables in RCD. CAG-LiNGAM guarantees the identifiability of true causal DAGs. Detailed computer experiments confirm that CAG-LiNGAM may outperform the original DirectLiNGAM, RCD, and CAPA-LiNGAM in terms of $F$ -measure when the true causal DAG is sparse, and the sample size is small relative to the number of variables.

If the true causal DAG is connected and the correct CI relationships are estimated, CAG creates the same number of variable groups as sink nodes in the causal DAG. Therefore, when the number of sink nodes in the true causal graph is small, the number of variables in each group cannot be sufficiently small, and thus, improvement in accuracy may not be expected. Other computer experiments have shown that even if the true causal DAG is sparse, the estimation accuracy is not necessarily high when the indegrees of some vertices are larger than one. CAPA is the opposite of CAG, where the number of variable groups is small when the maximum indegree of the true causal DAG is small, while it may be possible to group variables into many variable groups even when the maximum indegree of the true causal DAG is large. As noted at the end of the previous section, the divide-and-conquer algorithm is expected to be more accurate in estimating causal DAGs if finer groupings are used. The hybrid algorithm of CAG and CAPA may give fine groupings even when the indegree of the true causal DAG is larger.

We also found that the RCD has higher estimation accuracy than CAG when the sample size is large. RCD estimates causal DAGs using the estimated ancestral relationships among variables. As the sample size increases, the estimation accuracy of ancestral relationships improves. Therefore, using the estimated ancestral relationships to estimate causal DAGs may improve estimation accuracy. However, RCD is computationally expensive when the sample size is large. Using the information on the estimated ancestral relationships in DirectLiNGAM may improve the estimation accuracy of causal DAGs at a relatively low computational cost, even when the sample size is large.

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP21K11797.

References

[1] R. Cai, Z. Zhang, and Z. Hao, “Sada: A general framework to support robust causation discovery,” in International conference on machine learning, pp. 208–216, PMLR, 2013.
[2] H. Zhang, S. Zhou, C. Yan, J. Guan, X. Wang, J. Zhang, and J. Huan, “Learning causal structures based on divide and conquer,” IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 3232–3243, 2020.
[3] S. Shimizu, P. O. Hoyer, A. Hyvärinen, A. Kerminen, and M. Jordan, “A linear non-gaussian acyclic model for causal discovery.,” Journal of Machine Learning Research, vol. 7, no. 10, 2006.
[4] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen, Y. Kawahara, T. Washio, P. O. Hoyer, K. Bollen, and P. Hoyer, “Directlingam: A direct method for learning a linear non-gaussian structural equation model,” Journal of Machine Learning Research-JMLR, vol. 12, no. Apr, pp. 1225–1248, 2011.
[5] K. D. Hoover, Causality in economics and econometrics. SSRN eLibrary, 2006.
[6] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
[7] C. Glymour, “Learning causes: Psychological explanations of causal explanation1,” Minds and machines, vol. 8, no. 1, pp. 39–60, 1998.
[8] J. Pearl, M. Glymour, and N. P. Jewell, Causal Inference in Statistics: A Primer. Hoboken, New Jersey, U.S.: John Wiley and Sons, 2016.
[9] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669–688, 1995.
[10] P. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search. MIT press, 2000.
[11] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
[12] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, pp. 461–464, 1978.
[13] D. M. Chickering, “Learning bayesian networks is np-complete,” Learning from data: Artificial intelligence and statistics V, pp. 121–130, 1996.
[14] D. M. Chickering, D. Heckerman, and C. Meek, “Large-sample learning of bayesian networks is np-hard,” Journal of Machine Learning Research, vol. 5, pp. 1287–1330, 2004.
[15] A. Hyvarinen, J. Karhunen, and E. Oja, “Independent component analysis,” Studies in informatics and control, vol. 11, no. 2, pp. 205–207, 2002.
[16] P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf, “Nonlinear causal discovery with additive noise models,” Advances in neural information processing systems, vol. 21, 2008.
[17] K. Zhang and A. Hyvarinen, “On the identifiability of the post-nonlinear causal model,” arXiv preprint arXiv:1205.2599, 2012.
[18] Y. S. Wang and M. Drton, “High-dimensional causal discovery under non-gaussianity,” Biometrika, vol. 107, no. 1, pp. 41–59, 2020.
[19] T. N. Maeda and S. Shimizu, “Rcd: Repetitive causal discovery of linear non-gaussian acyclic models with latent confounders,” in International Conference on Artificial Intelligence and Statistics, pp. 735–745, PMLR, 2020.
[20] C. J. Kowalski, “On the effects of non-normality on the distribution of the sample product-moment correlation coefficient,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 21, no. 1, pp. 1–12, 1972.
[21] T. N. Maeda and S. Shimizu, “Repetitive causal discovery of linear non-gaussian acyclic models in the presence of latent confounders,” International Journal of Data Science and Analytics, pp. 1–13, 2022.
[22] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola, “A kernel statistical test of independence,” Advances in neural information processing systems, vol. 20, 2007.
[23] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf, “Kernel-based conditional independence test and application in causal discovery,” arXiv preprint arXiv:1202.3775, 2012.
[24] S. Zhu, I. Ng, and Z. Chen, “Causal discovery with reinforcement learning,” arXiv preprint arXiv:1906.04477, 2019.

Appendix

V-A Pseudo-code of ancestor finding

This section presents the pseudocode of the ancestor-finding algorithm. Lines 5-6 are the procedure that avoids cyclic ancestral relationships and speeds up the ancestor finding. $Pearson(x_{i},x_{j})$ represents the P-value of the Pearson test for the correlation between $x_{i}$ and $x_{j}$ . $KCI(u,v)$ represents the P-value of KCI for the independence between $u$ and $v$ .

We can declare that all ancestral relationships have been identified if no new ancestral relationships are discovered in a single loop because of the no confounder existing assumption. This means, unlike the original RCD [19, 21], even if some pairs of variables halt at a point where additional searching for unknown common ancestors is required in Line 35, they will ultimately be treated as having no ancestral relationship.

Algorithm 1 Ancestor Finding

0: The observed data of the variable set

X

0: The list

ANC_{L}

of the set of ancestors of all variables

1: Set the significance level of Pearson tests as

\alpha_{P}

and the significance level of KCI as

\alpha_{CI}

Anc_{i}\leftarrow\phi

and

i=1,\ldots,p

ANC_{L}\leftarrow\{Anc_{1},\ldots,Anc_{p}\}

3: while There is no new change in

ANC_{L}

4: for each pair of variables

x_{i}

and

x_{j}

5: if

\exists x_{k}\in Anc_{i},x_{j}\in Anc_{k}

then

6: Add

x_{j}

into

Anc_{i}

7: else

8: if

Anc_{i}\cap Anc_{j}\neq\phi

then

CA\leftarrow Anc_{i}\cap Anc_{j}

10:

v_{i}\leftarrow x_{i}-\hat{\bm{\alpha}}_{ij}^{\top}CA,\quad v_{j}\leftarrow x_{j}-\hat{\bm{\alpha}}_{ji}^{\top}CA

11:

pvalue_{l}\leftarrow Pearson(v_{i},v_{j})

12: if

pvalue_{l}\leq\alpha_{P}

then

13:

u_{i}\leftarrow v_{i}-\hat{b}_{ij}v_{j},\quad u_{j}\leftarrow v_{j}-\hat{b}_{ji}v_{i}

14:

pvalue^{i}_{j}\leftarrow KCI(u_{j},v_{i})

pvalue^{j}_{i}\leftarrow KCI(u_{i},v_{j})

15: if

pvalue^{j}_{i}>\alpha_{CI}

and

pvalue^{i}_{j}\leq\alpha_{CI}

then

16: Add

x_{i}

into

Anc_{j}

17: else if

pvalue^{j}_{i}\leq\alpha_{CI}

and

pvalue^{i}_{j}>\alpha_{CI}

then

18: Add

x_{j}

into

Anc_{i}

19: else

20: There are unobserved common ancestors between

x_{i}

and

x_{j}

21: end if

22: else

23: No ancestral relationship between

x_{i}

and

x_{j}

24: end if

25: else

26:

pvalue_{l}\leftarrow Pearson(x_{i},x_{j})

27: if

pvalue_{l}\leq\alpha_{P}

then

28:

u_{i}\leftarrow x_{i}-\hat{b}_{ij}x_{j},\quad u_{j}\leftarrow x_{j}-\hat{b}_{ji}x_{i}

29:

pvalue^{i}_{j}\leftarrow KCI(u_{j},x_{i})

pvalue^{j}_{i}\leftarrow KCI(u_{i},x_{j})

30: if

pvalue^{j}_{i}>\alpha_{CI}

and

pvalue^{i}_{j}\leq\alpha_{CI}

then

31: Add

x_{i}

into

Anc_{j}

32: else if

pvalue^{j}_{i}\leq\alpha_{CI}

and

pvalue^{i}_{j}>\alpha_{CI}

then

33: Add

x_{j}

into

Anc_{i}

34: else

35: There are unobserved common ancestors between

x_{i}

and

x_{j}

36: end if

37: else

38: No ancestral relationship between

x_{i}

and

x_{j}

39: end if

40: end if

41: end if

42: end for

43: end while

44: return

ANC_{L}

V-B Pseudo-code of Grouping and Merging

This section presents the pseudocode for grouping variables and merging the results. In the implementation, a type II error of CI tests can result in many groups with only one variable. In that case, the output DAG will be too sparse. To avoid this, the procedure in lines 3 through 9 merges groups with one element into the other groups. When the sample size is small, no ancestral relationship may be detected, and the number of elements in all groups is one. In such a case, all ${}_{p}C_{2}$ variable pairs are defined as the variable grouping.

After grouping, line 13 shows the process of learning causal sub-DAGs and merging them. Finally, lines 15-17 eliminate cycles in the merged DAG.

Algorithm 2 Grouping and Merging

0: The set

ANC_{L}

of ancestor lists

0: The set of edges

E

{\cal V}\leftarrow\emptyset

2: Let

{\cal V}_{0}

be defined as the family of maximal elements in

ANC_{L}

3: if

|\bm{v}|>1

for all

\bm{v}\in{\cal V}_{0}

then

{\cal V}\leftarrow{\cal V}_{0}

5: else

6: for each

\bm{v}\in{\cal V}_{0}

with

|\bm{v}|=1

7: Add

\bigcup_{\bm{v}^{\prime}\in{\cal V}_{0},\bm{v}^{\prime}\neq\bm{v}}\{\bm{v}\cup\bm{v}^{\prime}\}

{\cal V}

8: end for

9: end if

10: for each

\bm{v}

{\cal V}

11: Estimate the causal sub-DAG

G_{\bm{v}}=(\bm{v},E_{\bm{v}})

for

\bm{v}

by DirectLiNGAM

12: end for

13:

E\leftarrow\bigcup_{\bm{v}\in{\cal V}}E_{\bm{v}}

14:

G=(X,E)

15: if

G

contains cycles then

16: Eliminate cycles in

G

17: end if

18: return

G