This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning causal graphs using variable grouping according to ancestral relationship

Ming Cai Graduate School of Informatics
Kyoto University
Kyoto, Japan
[email protected]
   Hisayuki Hara Institute for Liberal Arts and Sciences
Kyoto University
Kyoto, Japan
[email protected]
Abstract

Causal discovery has drawn increasing attention in recent years. Over the past quarter century, several useful algorithms for learning causal graphs have been proposed. However, when the sample size is small relative to the number of variables, the accuracy of estimating causal graphs using existing methods decreases. In addition, some methods are not feasible when the sample size is smaller than the number of variables.

To circumvent these problems, some researchers proposed causal structure learning algorithms using divide-and-conquer approaches (e.g., [1, 2]). For learning the entire causal graph, the divide-and-conquer approaches first split variables into several subsets according to the conditional independence relationships among the variables, then apply a conventional causal structure learning algorithm to each subset and merge the estimated results. Since the divide-and-conquer approach reduces the number of variables to which a causal structure learning algorithm is applied, it is expected to improve the estimation accuracy of causal graphs, especially when the sample size is small relative to the number of variables and the model is sparse. However, existing methods are either computationally expensive or do not provide sufficient accuracy when the sample size is small.

This paper proposes a new algorithm for grouping variables according to the ancestral relationships among the variables, assuming that the causal model is LiNGAM [3], where the causal relationships are linear, and the exogenous variables are mutually independent and distributed as continuous non-Gaussian distributions. We call the proposed algorithm CAG (Causal Ancestral-relationship-based Grouping). The time complexity of the ancestor finding in CAG is shown to be cubic to the number of variables. Extensive computer experiments confirm that the proposed method outperforms the original DirectLiNGAM [4] without grouping variables and other divide-and-conquer approaches not only in estimation accuracy but also in computation time when the sample size is small relative to the number of variables and the model is sparse.

Index Terms:
causal discovery, causal DAG, conditional independence test, DirectLiNGAM, divide-and-conquer, linear regression.

I Introduction

In recent years, inference using high-dimensional causal models for observational data has played a pivotal role in various fields, such as econometrics [5], biology [6], and psychology [7]. For the high dimensional causal inference, the structural causal model (SCM, [8, 9]) defined by a directed acyclic graph (DAG) has been extensively used. Since the causal relationships among variables are usually unknown, we also need to learn the causal DAG before we use the SCM.

Over the past quarter century, several practical algorithms for learning causal DAGs without latent confounders have been proposed. These algorithms are classified into several types.

The constraint-based methods use the conditional independence (CI) relationships among variables to determine causal directions. The PC algorithm [10] is a typical example. Using hypothesis tests, the PC algorithm first estimates CI relationships between all pairs of variables. Then, it estimates the skeleton of a causal DAG and the directions of edges in the skeleton in turn. However, the time complexity of the PC algorithm, in the worst case, is exponential to the number of variables. Hence, the PC algorithm is not feasible for high-dimensional data.

The greedy equivalent search (GES) [11] is an algorithm that learns causal structures using a model selection criterion such as BIC [12]. This type of method is called the score-based method. As Chickering et al. [13, 14] have proven, however, GES belongs to NP-hard, and its application to high-dimensional causal models is also impractical.

It is important to note that the constraint-based and score-based methods can only identify causal DAGs up to the Markov equivalence class, the set of all DAGs compatible with the inferred CI relationships. To fully identify causal DAGs requires additional constraints on the causal model.

Shimizu et al. [3] considered the linear structural equation model as the causal model, where exogenous variables are mutually independent, and the distributions of the exogenous variables are continuous and non-Gaussian. They called the model linear non-Gaussian acyclic model (LiNGAM) and showed that it is possible to fully identify the causal DAG that defines LiNGAM using independent component analysis (ICA-LiNGAM) [15]. However, ICA-LiNGAM tends to converge to a locally optimal solution when the dimension of the model is high, resulting in lower estimation accuracy (e.g.,[4]). To overcome this difficulty, Shimizu [4] proposed DirectLiNGAM that estimates causal DAGs using linear regression between variables and CI tests.

Hoyer et al. [16] and Zhang and Hyvärinen [17] generalized LiNGAM to nonlinear and showed that when the model is nonlinear, the causal DAG is identifiable even if the exogenous variables are Gaussian distributed.

However, the estimation accuracy of existing methods tends to decrease as the sample size decreases relative to the number of variables. Furthermore, DirectLiNGAM is not feasible when the sample size is smaller than or equal to the number of variables (e.g., [18]).

Cai et al. [1] proposed the scalable causation discovery algorithm (SADA) to address this problem. The main idea of SADA is to group variables into several subsets according to the CI relationships between the variables, then learn the causal DAG of each group, merge each result, and return the entire causal DAG estimate. In this method, the causal DAG learning algorithm is applied to each smaller group of variables, reducing the sample size required for it to work. Through simulation studies, Cai et al. [1] showed that SADA improves LiNGAM concerning estimation accuracy. In SADA, however, the time complexity of grouping variables is of exponential order for the number of variables. In addition, SADA has a severe drawback in that even if the correct CI relationships are known, the marginal model for the variables in each group is not necessarily the causal model defined by the sub-DAG induced by the variables in each group. This means the estimator of a causal DAG estimated by SADA may not be consistent depending on the true causal DAG (e.g., [2]).

Zhang et al. [2] proposed another algorithm for grouping variables called causality partitioning (CAPA). CAPA requires only low-order CI relationships among variables to group variables. Letting pp be the number of variables and σmax\sigma_{\max} be the maximum order of the conditioning set, the time complexity of CAPA is O(pσmax+2)O(p^{\sigma_{\max}+2}). CAPA guarantees that if the correct low-order CI relationships are known, the marginal model for each group of variables is the causal model defined by the sub-DAG induced by the variables in each group. However, especially in the case of DAGs with high outdegree and low indegree, CAPA tends not to group variables finely enough to improve estimation accuracy even if the true conditional independence relationships are known.

Maeda and Shimizu [19] proposed the repetitive causal discovery (RCD), which is intended to be applied to the model with latent confounders. RCD can also be applied to the models without latent confounders. RCD first determines the ancestral relationships among variables using linear regression and CI tests and creates a list of ancestor sets for each variable. The parent-child relationships between variables are determined from the CI relationships among variables in each estimated ancestor set. RCD can be interpreted as one of the divide-and-conquer approaches because RCD splits the entire variable into families of ancestor sets for each variable and then learns causal DAGs for each group.

In this paper, we propose another scheme for grouping variables according to the ancestral relationships between the variables. We also assume LiNGAM for the causal model. The proposed method uses the algorithm to find variables’ ancestor sets in RCD. When the true causal DAG is connected, and the ancestor sets are correctly estimated, the variable grouping obtained by the proposed method is the family of maximal elements in the family of the union of each variable and its ancestors. The time complexity of the proposed algorithm for grouping variables is the third order of the number of variables. Suppose the true ancestor sets are correctly estimated. In that case, the marginal model for the variables in each group obtained by the proposed method is always the LiNGAM defined by the sub-DAG induced by the variables in each group, and no edge exists across different groups. Therefore, the entire causal DAG can be consistently estimated by applying the causal structure learning algorithm such as DirectLiNGAM to each group and merging the results. We call the proposed procedure for grouping variables the causal ancestral-relationship-based grouping (CAG).

Extensive computer experiments show that CAG outperforms the original DirectLiNGAM without grouping variables, CAPA, and RCD regarding estimation accuracy and computation time when the sample size is small relative to the number of variables and the true causal DAG is sparse.

The rest of this paper is organized as follows: Section II summarizes some existing causal structure learning algorithms and clarifies the position of the proposed method. Section III describes the details of the proposed method. Section IV confirms the proposed methods’s usefulness through computer experiments. Section V concludes the paper. The pseudo-code for the proposed method is provided in the Appendix.

II Related Works

II-A LiNGAM and Variants

Let X=(x1,,xp)X=(x_{1},\ldots,x_{p})^{\top} be a pp-dimensional random vector. In the following, we identify XX with the variable set. Assume that the causal relationship among the variables is acyclic and linear. LiNGAM [3] is expressed by

X=BX+e,\displaystyle X=BX+e, (1)

where the error term e=(e1,,ep)e=(e_{1},\ldots,e_{p})^{\top} is assumed to be independently distributed as continuous non-Gaussian distributions. B={bij}B=\{b_{ij}\} is a p×pp\times p coefficient matrix. bijb_{ij} represents the direct causal effect from xjx_{j} to xix_{i}. bij=0b_{ij}=0 indicates the absence of the direct causal effect from xjx_{j} to xix_{i}. We note that it is possible to transform the matrix BB into a strictly lower triangular matrix by permuting the rows and columns when a DAG defines the causal relationship between XX.

The model (1) is rewritten by

X=(IB)1e,\displaystyle X=(I-B)^{-1}e, (2)

where II denotes the p×pp\times p identity matrix. Noting that (2) is equivalent to the independent component analysis (ICA) model, Shimizu et al. [3] showed that BB is identifiable and proposed an algorithm for estimating BB. The algorithm is known as ICA-LiNGAM.

However, as the number of variables increases, ICA-LiNGAM is more likely to converge to a locally optimal solution, which reduces the accuracy of causal DAG estimation. To overcome this problem, Shimizu et al. [4] proposed another algorithm for estimating BB named DirectLiNGAM. When linear regression analyses are conducted in two ways for each pair of variables, with one of them as the dependent variable and the other as the independent variable, if the independent variable and the residuals are mutually independent in only one of the models, the independent variable in that model will precede the dependent variable in causal order. After determining the causal order of all variables, DirectLiNGAM estimates the set of parents of each variable by sparse estimating a linear regression model with each variable as the dependent variable and all variables preceding it in the causal order as independent variables. This approach ensures no edges violate the estimated causal order. DirectLiNGAM is also less accurate when the number of variables is large relative to the sample size. DirectLiNGAM tends to output redundant edges even when the sample size is large.

The computational cost of DirectLiNGAM is higher than that of ICA-LiNGAM. The time complexity of DirectLiNGAM is O(np3M2+p4M3)O(np^{3}M^{2}+p^{4}M^{3}), where nn is the sample size, pp is the number of variables, and MM is the maximal rank found by the low-rank decomposition used in the kernel-based independence measure [4]. Since DirectLiNGAM requires estimating a linear regression model, DirectLiNGAM is not feasible when nn is smaller than pp.

II-B Scalable Causation Discovery Algorithm (SADA)

Cai et al. [1] proposed a divide-and-conquer approach called the scalable causation discovery algorithm (SADA) to enhance the scalability of causal DAG learning algorithms. SADA first splits the variable set XX into two or more subsets. Then, a causal DAG learning algorithm such as DirectLiNGAM is applied to each group of variables, and finally, all the results are merged to estimate the entire causal DAG. Because SADA applies the causal DAG learning algorithm to each group with a smaller number of variables, if the variables are grouped correctly, SADA is expected to improve the estimation accuracy when the sample size is small relative to the number of the entire variable set. SADA is feasible even if the sample size is smaller than the number of variables as long as it is larger than the number of variables in each group.

The procedure for grouping XX into two subsets by SADA is as follows. SADA first randomly selects two variables xi,xjXx_{i},x_{j}\in X satisfying xixj(X{xi,xj})x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(X\setminus\{x_{i},x_{j}\}) and finds the smallest X^X\{xi,xj}\hat{X}\subseteq X\backslash\{x_{i},x_{j}\} with respect to the inclusion relation satisfying xixjX^x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid\hat{X}. The disjoint subsets X1X_{1}, X2X_{2}, and CC are computed according to the following procedure.

  1. 1.

    Initialize with X1={xi}X_{1}=\{x_{i}\}, X2={xj}X_{2}=\{x_{j}\}, C={X^}C=\{\hat{X}\}.

  2. 2.

    For all wX(X1X2C)w\in X\setminus(X_{1}\cup X_{2}\cup C)

    • (a)

      if wX2C^w\mathop{\perp\!\!\!\!\perp}X_{2}\mid\hat{C} for C^C\exists\hat{C}\subseteq C,
      then X1X1{w}X_{1}\leftarrow X_{1}\cup\{w\}

    • (b)

      if wX1C^w\mathop{\perp\!\!\!\!\perp}X_{1}\mid\hat{C} for C^C\exists\hat{C}\subseteq C,
      then X2X2{w}X_{2}\leftarrow X_{2}\cup\{w\}

    • (c)

      else CC{w}C\leftarrow C\cup\{w\}

  3. 3.

    For all sCs\in C

    • (a)

      if sX2C^s\mathop{\perp\!\!\!\!\perp}X_{2}\mid\hat{C} for C^C{s}\exists\hat{C}\subseteq C\setminus\{s\},
      then X1X1{s}X_{1}\leftarrow X_{1}\cup\{s\} and CC{s}C\leftarrow C\setminus\{s\}

    • (b)

      if sX1C^s\mathop{\perp\!\!\!\!\perp}X_{1}\mid\hat{C} for C^C{s}\exists\hat{C}\subseteq C\setminus\{s\},
      then X2X2{s}X_{2}\leftarrow X_{2}\cup\{s\} and CC{s}C\leftarrow C\setminus\{s\}

  4. 4.

    Return X1X_{1}, X2X_{2} and CC

In step 2, the intermediate set CC tends to be large, and step 3 downsizes CC as much as possible.

The output of SADA is two subsets V1=X1CV_{1}=X_{1}\cup C and V2=X2CV_{2}=X_{2}\cup C. This algorithm can be repeated recursively to group variables into smaller subsets. In the implementation, the lower bound of the number of variables in each group is set to θ\theta, and SADA is applied to V1V_{1} or V2V_{2} to create even smaller groups only when |V1|>θ|V_{1}|>\theta or |V2|>θ|V_{2}|>\theta. If the number of variables is less than or equal to θ\theta, or if we cannot find xi,xjVkx_{i},x_{j}\in V_{k} satisfying xixj(Vk{xi,xj})x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(V_{k}\setminus\{x_{i},x_{j}\}), k=1,2k=1,2, no further grouping is made.

We illustrate the SADA procedure to split XX into two subsets where the true causal DAG is the one in Fig 1. Table I shows the process of grouping variables x1,,x9x_{1},\ldots,x_{9} into two subsets. Since both x3x9{x1,x2,x4x8}x_{3}\mathop{\perp\!\!\!\!\perp}x_{9}\mid\{x_{1},x_{2},x_{4}\ldots x_{8}\} and x3x9x_{3}\mathop{\perp\!\!\!\!\perp}x_{9} hold in the model defined by the DAG in Fig. 1, we can initialize with X1={x3}X_{1}=\{x_{3}\}, X2={x9}X_{2}=\{x_{9}\} and C=C=\emptyset. We then use the CI tests to determine one by one to which of X1X_{1}, X2X_{2}, and CC the remaining variables belong. This procedure returns X1={x1,x2,x3,x5,x6}X_{1}=\{x_{1},x_{2},x_{3},x_{5},x_{6}\}, X2={x7,x8,x9}X_{2}=\{x_{7},x_{8},x_{9}\}, C={x4}C=\{x_{4}\} and hence V1={x1,x2,x3,x4,x5,x6}V_{1}=\{x_{1},x_{2},x_{3},x_{4},x_{5},x_{6}\} and V2={x4,x7,x8,x9}V_{2}=\{x_{4},x_{7},x_{8},x_{9}\}. In this case, we note that the marginal models for V1V_{1} and V2V_{2} are defined by the sub-DAGs induced by V1V_{1} and V2V_{2}, respectively. If the initial values of X1X_{1}, X2X_{2}, and CC or the scanning order of wX(X1X2C)w\in X\setminus(X_{1}\cup X_{2}\cup C) changes, V1V_{1} and V2V_{2} may also change.

x1x_{1}x2x_{2}x3x_{3}x4x_{4}x5x_{5}x6x_{6}x7x_{7}x8x_{8}x9x_{9}
Figure 1: A causal DAG with nine variables.
TABLE I: The grouping process of the variables in Fig. 1 in SADA.
Step XX X1X_{1} C X2X_{2}
Initial x1,x2x_{1},x_{2}, x4x_{4}, x5x_{5}, x6x_{6}, x7x_{7}, x8x_{8} x3x_{3} ϕ\phi x9x_{9}
Check x1x_{1} x2x_{2}, x4x_{4}, x5x_{5}, x6x_{6}, x7x_{7}, x8x_{8} x1x_{1}, x3x_{3} ϕ\phi x9x_{9}
Check x2x_{2} x4x_{4}, x5x_{5}, x6x_{6}, x7x_{7}, x8x_{8} x1x_{1}, x2x_{2}, x3x_{3} ϕ\phi x9x_{9}
Check x4x_{4} x5x_{5}, x6x_{6}, x7x_{7}, x8x_{8} x1x_{1}, x2x_{2}, x3x_{3} x4x_{4} x9x_{9}
Check x5x_{5} x6x_{6}, x7x_{7}, x8x_{8} x1x_{1}, x2x_{2}, x3x_{3} x5x_{5} x4x_{4} x9x_{9}
Check x6x_{6} x7x_{7}, x8x_{8} x1x_{1}, x2x_{2}, x3x_{3} x5x_{5}, x6x_{6} x4x_{4} x9x_{9}
Check x7x_{7} x8x_{8} x1x_{1}, x2x_{2}, x3x_{3} x5x_{5}, x6x_{6} x4x_{4} x7x_{7}, x9x_{9}
Check x8x_{8} ϕ\phi x1x_{1}, x2x_{2}, x3x_{3} x5x_{5}, x6x_{6} x4x_{4} x7x_{7}, x8x_{8}, x9x_{9}

Suppose that XX is grouped into V1,,VKV_{1},\ldots,V_{K} by the SADA’s variable grouping procedure. SADA applies a causal structure learning algorithm such as DirectLiNGAM to each of the KK groups. Let G^1=(V1,E1),,G^K=(VK,EK)\hat{G}_{1}=(V_{1},E_{1}),\ldots,\hat{G}_{K}=(V_{K},E_{K}) be the estimated KK DAGs for each group, where EkVk×VkE_{k}\subset V_{k}\times V_{k}, k=1,,Kk=1,\ldots,K are the set of directed edges in G^k\hat{G}_{k}. Then, the entire causal DAG is estimated by G^=(X,E1EK)\hat{G}=(X,E_{1}\cup\cdots\cup E_{K}).

However, SADA has several practical problems. In SADA, the marginal model G^k\hat{G}_{k}, k=1,,Kk=1,\ldots,K is not necessarily the causal model induced by VkV_{k} even if we knew the correct CI relationships between variables. Assume that the true causal DAG is Fig 2 and that θ3\theta\leq 3. Since x1x4(x2,x3)x_{1}\mathop{\perp\!\!\!\!\perp}x_{4}\mid(x_{2},x_{3}), we can initialize with X1={x1}X_{1}=\{x_{1}\}, X2={x4}X_{2}=\{x_{4}\}, and C={x2,x3}C=\{x_{2},x_{3}\}, which does not need further variable checking. As a result, we group the original variable set into V1={x1,x2,x3}V_{1}=\{x_{1},x_{2},x_{3}\} and V2={x2,x3,x4}V_{2}=\{x_{2},x_{3},x_{4}\}. For V1V_{1}, the marginal model is defined by the sub-DAG induced by x1x_{1}, x2x_{2} and x3x_{3}. However, the marginal model with respect to V2V_{2} is not the model defined by the sub-DAG induced by (x2,x3,x4)(x_{2},x_{3},x_{4}) because x2⟂̸x3x_{2}\mathop{\not\perp\!\!\!\!\perp}x_{3}.

x1x_{1}x2x_{2}x3x_{3}x4x_{4}
Figure 2: An example of SADA not working.

SADA does not guarantee that it always returns a DAG. Cycles may appear when DAG estimates for each variable group are merged. Suppose V1V_{1} and V2V_{2} share x1x_{1} and x2x_{2}. If x1x2E1x_{1}\to x_{2}\in E_{1}, x2x1E2x_{2}\to x_{1}\in E_{2}, the cycle x1x2x1x_{1}\to x_{2}\to x_{1} appears after merging G^1\hat{G}_{1} and G^2\hat{G}_{2}. Therefore, SADA needs to remove redundant edges in cycles.

In addition, SADA requires too many CI tests for grouping variables. Consider the simple model where x1,,xpx_{1},\ldots,x_{p} are mutually independent and no edge exists in the true DAG. In this case, any xix_{i} and xjx_{j} satisfy xixj(X{xi,xj})x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid(X\setminus\{x_{i},x_{j}\}). In the worst case, 2p22^{p-2} CI tests are required to ensure C^=\hat{C}=\emptyset. Thus, SADA is not feasible for high-dimensional data.

II-C Causality Partitioning (CAPA)

Zhang et al. [2] proposed a refined grouping scheme called the causality partitioning (CAPA). Define the order of the CI test by the number of variables in the conditioning set. CAPA reduces time complexity by setting an upper limit on the order of CI tests and using only CI tests of orders below that limit.

When the maximum order of the CI test is σmax\sigma_{\max}, the time complexity of CAPA is shown to be O(pσmax+2)O(p^{\sigma_{\max}+2}). In implementation, σmax\sigma_{\max} is set to small numbers, like two or three.

The procedure for grouping XX into two subsets by CAPA is as follows. Let σ\sigma be the current order of the conditioned set. Let Mσ={Mijσ}M^{\sigma}=\{M^{\sigma}_{ij}\} be a p×pp\times p matrix such that

Mijσ={0xixjZ,ZX{xi,xj} s.t. |Z|=σ1otherwise.M^{\sigma}_{ij}=\left\{\begin{array}[]{ll}0&x_{i}\mathop{\perp\!\!\!\!\perp}x_{j}\mid Z,\exists Z\subset X\setminus\{x_{i},x_{j}\}\text{ s.t. }|Z|=\sigma\\ 1&\text{otherwise}.\end{array}\right.

Let Gσ=(X,Eσ)G^{\sigma}=(X,E^{\sigma}) be an undirected graph with MσM^{\sigma} as its adjacency matrix. Let di=r=1,ripMirσd_{i}=\sum_{r=1,r\neq i}^{p}M^{\sigma}_{ir}, i=1,,pi=1,\ldots,p be the degrees of xix_{i} in GσG^{\sigma}.

  1. 1.

    Initialize with σ=0\sigma=0

  2. 2.

    Initialize with A=B=C=D=A=B=C=D=\emptyset

  3. 3.

    For all xiXx_{i}\in X in ascending order by did_{i}, i=1,,pi=1,\ldots,p do at most one of the followings so that AA\neq\emptyset, BB\neq\emptyset at output

    • (a)

      AA{xi}A\leftarrow A\cup\{x_{i}\} if xjB\forall x_{j}\in B, (xi,xj)Eσ(x_{i},x_{j})\notin E_{\sigma}

    • (b)

      BB{xi}B\leftarrow B\cup\{x_{i}\} if xjA\forall x_{j}\in A, (xi,xj)Eσ(x_{i},x_{j})\notin E_{\sigma}

  4. 4.

    CX(AB)C\leftarrow X\setminus(A\cup B)

  5. 5.

    For all xiABx_{i}\in A\cup B
    DD{xi}D\leftarrow D\cup\{x_{i}\} if xjC\exists x_{j}\in C, (xi,xj)Eσ(x_{i},x_{j})\in E_{\sigma}

  6. 6.

    X1=ACDX_{1}=A\cup C\cup D, X2=BCDX_{2}=B\cup C\cup D

  7. 7.

    Return {X1,X2}\{X_{1},X_{2}\} if max(|X1|,|X2|)<|X|\max(|X_{1}|,|X_{2}|)<|X|
    Else if σ=σmax\sigma=\sigma_{\max}, exit
    Else σσ+1\sigma\leftarrow\sigma+1 and go to 2

Let S=X1X2S=X_{1}\cap X_{2}. When we know the true CI relationship between XX, SS always d-separates X1SX_{1}\setminus S and X2SX_{2}\setminus S in the true causal DAG. Therefore, the marginal model for X1X_{1} and X2X_{2} is guaranteed to be defined by the sub-DAG induced by X1X_{1} and X2X_{2}, respectively. In general, the ascending order of did_{i}, i=1,,ni=1,\ldots,n is not unique, and the output {X1,X2}\{X_{1},X_{2}\} may change depending on the choice of the ascending order. In Step 3, both 3(a) and 3(b) may be possible depending on xix_{i}, in which case one of them is randomly selected, which may also change the output.

We illustrate the CAPA procedure with σ=1\sigma=1 to create two subsets where the true causal DAG is the one in Fig 1. When σ=1\sigma=1, GσG_{\sigma} is an undirected graph in Fig. 3. Table II shows an example of the results of Step 3 and Step 4 when x3,x5,x6,x9,x2,x8,x1,x7,x4x_{3},x_{5},x_{6},x_{9},x_{2},x_{8},x_{1},x_{7},x_{4} is used as the order of xix_{i} by did_{i}. Since C={x4}C=\{x_{4}\}, we have D={x1,x5,x6,x7}D=\{x_{1},x_{5},x_{6},x_{7}\}. Therefore X1X_{1} and X2X_{2} are

X1\displaystyle X_{1} ={x1,x2,x3,x4,x5,x6,x7}\displaystyle=\{x_{1},x_{2},x_{3},x_{4},x_{5},x_{6},x_{7}\}
X2\displaystyle X_{2} ={x1,x4,x5,x6,x7,x8,x9},\displaystyle=\{x_{1},x_{4},x_{5},x_{6},x_{7},x_{8},x_{9}\},

respectively.

In this example, |X1|=7|X_{1}|=7, |X2|=7|X_{2}|=7, so the size of each group is not so small. CAPA can also be repeated recursively to split variables into smaller subsets. CAPA tends to fail to group finely in causal DAGs that contain vertices with high outdegree and low indegree.

x1x_{1}x2x_{2}x3x_{3}x4x_{4}x5x_{5}x6x_{6}x7x_{7}x8x_{8}x9x_{9}
Figure 3: GσG_{\sigma} with σ=1\sigma=1 for the DAG in Fig.1
TABLE II: The grouping process of the variables in Fig. 1.
order did_{i} AA BB CC
x3x_{3} 1 x3x_{3} \emptyset \emptyset
x5x_{5} 1 x3x_{3} x5x_{5} \emptyset
x6x_{6} 1 x3x_{3} x5,x6x_{5},x_{6} \emptyset
x9x_{9} 1 x3x_{3} x6,x6,x9x_{6},x_{6},x_{9} \emptyset
x2x_{2} 2 x2,x3x_{2},x_{3} x5,x6,x9x_{5},x_{6},x_{9} \emptyset
x8x_{8} 2 x2,x3x_{2},x_{3} x5,x6,x8,x9x_{5},x_{6},x_{8},x_{9} \emptyset
x1x_{1} 2 x1,x2,x3x_{1},x_{2},x_{3} x5,x6,x8,x9x_{5},x_{6},x_{8},x_{9} \emptyset
x7x_{7} 2 x1,x2,x3x_{1},x_{2},x_{3} x5,x6,x7,x8,x9x_{5},x_{6},x_{7},x_{8},x_{9} \emptyset
x4x_{4} 4 x1,x2,x3x_{1},x_{2},x_{3} x5,x6,x7,x8,x9x_{5},x_{6},x_{7},x_{8},x_{9} x4x_{4}

II-D Repetitive Causal Discovery (RCD)

RCD [19] is a novel causal structure learning algorithm that can be applied even when the model contains latent confounders. RCD also assumes LiNGAM, i.e., the causal relationships are linear, and the exogenous variables are independently distributed as continuous non-Gaussian distributions. The processes in RCD are divided into the following three parts: ancestral relationship finding, parental relationship finding, and confounder determining.

The first step in the RCD is to determine the ancestral relationship between variables by repeatedly conducting simple linear regressions and independence tests on each variable pair, in a similar way as when determining causal order in DirectLiNGAM, and to create a list of ancestor sets for each variable. Let Anc^i\widehat{Anc}_{i} be the estimated ancestor set of xix_{i}.

Once ancestral relationships are estimated, RCD extracts parent-child relationships between each pair of variables using CI tests. Assume xjAnc^ix_{j}\in\widehat{Anc}_{i}. For xix_{i} and xjx_{j}, if xi⟂̸xjAnc^i{xj}x_{i}\mathop{\not\perp\!\!\!\!\perp}x_{j}\mid\widehat{Anc}_{i}\setminus\{x_{j}\}, then xjx_{j} can be determined as a parent of xix_{i}. If a variable pair in the RCD’s output for which the direction of causality cannot be identified exists, we conclude that there are latent confounders between them.

Our study assumes that the causal model does not contain latent confounders. RCD can also be applied to the model without latent confounders. RCD is interpreted as another divide-and-conquer approach to improve scalability because it estimates the causal DAG by estimating the parent-child relationship between variables within each variable’s ancestor set. Our proposed method also uses the algorithm to find ancestral relationships in RCD to group variables.

RCD may return a graph that contains cycles even when the sample size is large. This is because errors in estimating ancestral relationships lead to an incorrect estimation of the parent-child relationships.

II-E Positioning of the Proposed Method

The proposed method also assumes LiNGAM (1). Like SADA and CAPA, the proposed method begins by grouping variables into multiple subsets based on the ancestral relationships between variables. The ancestral relationships are estimated in the same way as in RCD. When the true causal DAG is connected, the proposed method defines the variable grouping by the maximal elements in the family of the union of each variable and its ancestors. Thus, if the ancestral relationships are correctly estimated, variables are partitioned into as many groups as the number of sink nodes in the true causal DAG. The proposed method groups finely for DAGs with high outdegree and low indegree. The proposed method’s time complexity for variable grouping is O(p3)O(p^{3}), which is the same order as CAPA with σmax=1\sigma_{\max}=1.

When the ancestral relationships among the variables are correctly estimated, the marginal model for variables in each group obtained by the proposed method is the LiNGAM defined by the sub-DAG induced by the variables in each group.

In RCD, the estimated ancestral relationships are used to estimate parent-child relationships among variables. In contrast, the estimated ancestral relationships are only used to group variables in the proposed method. For each group, DirectLiNGAM is applied to estimate sub-DAGs, which are then merged to estimate the entire causal DAG.

Section III provides a more comprehensive description of our proposed method. Section IV compares the performance of the proposed method, the original DirectLiNGAM, CAPA, and RCD in the absence of latent confounders by computer experiments.

III Causal Ancestral-Relationships-based Groupoing (CAG)

This section will introduce our proposed algorithm in detail. Let X=(x1,,xp)X=(x_{1},\ldots,x_{p})^{\top} be a pp-dimensional random vector. Let G=(X,E)G=(X,E) be the true causal DAG, where EX×XE\subset X\times X is the set of edges in GG. We assume that XX follows LiNGAM (1) without latent confounders. Let AnciAnc_{i} and PaiPa_{i} be the sets of ancestors and parents of xix_{i} in GG, respectively. CAij:=AnciAncjCA_{ij}:=Anc_{i}\cap Anc_{j} is the set of common ancestors of xix_{i} and xjx_{j} in GG.

III-A Finding the Ancestral Relationships Between the Variables

This subsection summarizes the algorithm for finding the ancestral relationships between variables in XX in RCD [19]. To determine the ancestral relationship between xix_{i} and xjx_{j}, iji\neq j, Maeda and Shimizu [19] considered the simple regression models

xi\displaystyle x_{i} =βijxj+ui,\displaystyle=\beta_{ij}x_{j}+u_{i}, (3)
xj\displaystyle x_{j} =βjixi+uj,\displaystyle=\beta_{ji}x_{i}+u_{j},

where uiu_{i} and uju_{j} are error terms. Maeda and Shimizu [19] focused on the independence relationship between the independent variable and the error term in each model. The following proposition holds for the ancestral relationship between xix_{i} and xjx_{j}.

Proposition 1 (Maeda and Shimizu [19]).

One of the following four conditions holds for the ancestral relationship between xix_{i} and xjx_{j}.

  1. 1.

    If xixjx_{i}\mathop{\perp\!\!\!\!\perp}x_{j}, then xiAncjxjAncix_{i}\notin Anc_{j}\wedge x_{j}\notin Anc_{i}.

  2. 2.

    If xjuix_{j}\mathop{\perp\!\!\!\!\perp}u_{i}, then xiAncjx_{i}\in Anc_{j}.

  3. 3.

    If xiujx_{i}\mathop{\perp\!\!\!\!\perp}u_{j}, then xjAncix_{j}\in Anc_{i}.

  4. 4.

    If xi⟂̸ujxj⟂̸uix_{i}\mathop{\not\perp\!\!\!\!\perp}u_{j}\wedge x_{j}\mathop{\not\perp\!\!\!\!\perp}u_{i}, then CAijCA_{ij}\neq\emptyset.

This algorithm first checks which conditions 1 through 4 in Proposition 1 holds for each variable pair (xi,xj)(x_{i},x_{j}). For implementation, the error terms uiu_{i} and uju_{j} are replaced with the OLS residuals. If (xi,xj)(x_{i},x_{j}) satisfies condition 4, the determination of the ancestral relationship between (xi,xj)(x_{i},x_{j}) is withheld. Let RR be the set of variable pairs (xi,xj)(x_{i},x_{j}) that satisfies condition 4. Let CAijCA^{*}_{ij} be the set of common ancestors of (xi,xj)(x_{i},x_{j}) found during checking Proposition 1 for all pairs (xi,xj)(x_{i},x_{j}).

Assume that (xi,xj)R(x_{i},x_{j})\in R. Then, CAijCA_{ij}^{*} forms part of the backdoor path for xix_{i} and xjx_{j}. To remove the influence of CAijCA^{*}_{ij} from xix_{i} and xjx_{j}, consider the regression models

xi\displaystyle x_{i} =𝜶ijCAij+vi,\displaystyle=\bm{\alpha}_{ij}^{\top}\cdot CA^{*}_{ij}+v_{i}, (4)
xj\displaystyle x_{j} =𝜶jiCAij+vj,\displaystyle=\bm{\alpha}_{ji}^{\top}\cdot CA^{*}_{ij}+v_{j},

where we identify CAijCA^{*}_{ij} with the vector of common ancestors of (xi,xj)(x_{i},x_{j}). Furthermore, consider the following regression models for the error terms viv_{i} and vjv_{j},

vi\displaystyle v_{i} =βijvj+ui,\displaystyle=\beta_{ij}v_{j}+u_{i}, (5)
vj\displaystyle v_{j} =βjivi+uj.\displaystyle=\beta_{ji}v_{i}+u_{j}.

Then Proposition 1 is generalized as follows.

Proposition 2 (Maeda and Shimizu [19]).

One of the following four conditions holds for the ancestral relationship between (xi,xj)R(x_{i},x_{j})\in R.

  1. 1.

    If vivjv_{i}\mathop{\perp\!\!\!\!\perp}v_{j}, then xiAncjxjAncix_{i}\notin Anc_{j}\wedge x_{j}\notin Anc_{i}.

  2. 2.

    If vjuiv_{j}\mathop{\perp\!\!\!\!\perp}u_{i}, then xiAncjx_{i}\in Anc_{j}.

  3. 3.

    If viujv_{i}\mathop{\perp\!\!\!\!\perp}u_{j}, then xjAncix_{j}\in Anc_{i}.

  4. 4.

    If vi⟂̸ujv_{i}\mathop{\not\perp\!\!\!\!\perp}u_{j} and vj⟂̸uiv_{j}\mathop{\not\perp\!\!\!\!\perp}u_{i}, then CAijCAijCA_{ij}\setminus CA^{*}_{ij}\neq\emptyset.

Proposition 1 is the case where CAij=CA_{ij}=\emptyset. For implementation, the error terms vv and uu are replaced with OLS residuals. If (xi,xj)(x_{i},x_{j}) satisfies condition 4, the determination of the ancestral relationship between xix_{i} and xjx_{j} is withheld. After checking Proposition 2 for all (xi,xj)R(x_{i},x_{j})\in R, update CAijCA^{*}_{ij} and RR. If RR\neq\emptyset, recheck Proposition 2. Theoretically, if the model contains no latent confounders, the procedure in Proposition 2 can be repeated until R=R=\emptyset to completely determine the ancestral relationship of all (xi,xj)(x_{i},x_{j}).

The pseudo-code of this algorithm is described in Algorithm V-A in Appendix A. Since the linearity is assumed between variables, Pearson’s correlation test [20] can be used to test the marginally independence, xixjx_{i}\mathop{\perp\!\!\!\!\perp}x_{j} and vivjv_{i}\mathop{\perp\!\!\!\!\perp}v_{j}. For testing the independence of independent variables and residuals, Shimizu and Maeda [19, 21] used the Hilbert-Schmidt independence criterion (HSIC) [22]. This paper uses a Kernel-based conditional independence (KCI) test [23] because the accuracy of the test remains the same, and the computation time is faster.

Simply checking the conditions of Proposition 1 or Proposition 2 may yield ancestor relationships such that the estimated graph contains directed cycles due to errors in testing. The proposed method adds a heuristic procedure to avoid directed cycles. See Algorithm V-A in the Appendix for details.

III-B Grouping Variables and Merging Results

This subsection introduces a procedure for grouping XX into several subsets according to the estimated ancestral relationships among variables and merging the estimated causal DAGs for each group.

Let Anc^i\widehat{Anc}_{i} be the estimated ancestor set of xix_{i}. Define 𝒱{\cal V} by the output of the following procedures.

  1. 1.

    Let 𝒱0{\cal V}_{0} be the family of maximal sets in

    {{xi}Anc^i,i=1,,p}\left\{\{x_{i}\}\cup\widehat{Anc}_{i},\;i=1,\ldots,p\right\}
  2. 2a)

    If |𝒗|>1|\bm{v}|>1 for all 𝒗𝒱0\bm{v}\in{\cal V}_{0}, then 𝒱𝒱0{\cal V}\leftarrow{\cal V}_{0}

  3. 2b)

    Otherwise, 𝒱𝒱0{\cal V}\leftarrow{\cal V}_{0}
    For all 𝒗𝒱0\bm{v}\in{\cal V}_{0} with |𝒗|=1|\bm{v}|=1

    𝒱(𝒱𝒗){𝒗𝒱0,𝒗𝒗{𝒗𝒗}}{\cal V}\leftarrow({\cal V}\setminus\bm{v})\cup\left\{\bigcup_{\bm{v}^{\prime}\in{\cal V}_{0},\bm{v}^{\prime}\neq\bm{v}}\{\bm{v}\cup\bm{v}^{\prime}\}\right\}
  4. 3)

    Return 𝒱{\cal V}

Definition 1.

Define 𝒱{\cal V} as the grouping of XX obtained by the ancestral relationships among XX.

We call the series of procedures for obtaining 𝒱{\cal V} the causal ancestral-relationship-based grouping (CAG). If the ancestral relationships among XX are correctly estimated and 𝒱=𝒱0{\cal V}={\cal V}_{0}, each element of 𝒱\mathcal{V} is the union of a sink node in GG and its ancestors. We also note that 𝒗𝒱𝒗=X\bigcup_{\bm{v}\in{\cal V}}\bm{v}=X. Now, we have the following theorem.

Theorem 1.

Assume that we know the correct ancestral relationships between variables. Define G𝐯G_{\bm{v}} be the sub-DAG of GG induced by 𝐯𝒱\bm{v}\in{\cal V}. Then, the marginal model for 𝐯\bm{v} is the LiNGAM defined by G𝐯G_{\bm{v}}.

Proof.

The probability density function of XX is written by

p(𝒙)\displaystyle p(\bm{x}) =i=1pp(xiPai)\displaystyle=\prod_{i=1}^{p}p(x_{i}\mid Pa_{i})
=i:xi𝒗p(xiPai)j:xj𝒗p(xjPaj).\displaystyle=\prod_{i:x_{i}\in\bm{v}}p(x_{i}\mid Pa_{i})\cdot\prod_{j:x_{j}\notin\bm{v}}p(x_{j}\mid Pa_{j}).

We note that when xi𝒗x_{i}\in\bm{v}, PaiPa_{i} does not include the variables xj𝒗x_{j}\notin\bm{v}. By integrating p(𝒙)p(\bm{x}) out according to the reverse causal order of X𝒗X\setminus\bm{v}, we have

p(𝒗)=i:xi𝒗p(xiPai),p(\bm{v})=\prod_{i:x_{i}\in\bm{v}}p(x_{i}\mid Pa_{i}),

which is the LiNGAM defined by G𝒗G_{\bm{v}}. ∎

Theorem 1 claims that we can consistently estimate the sub-DAGs G𝒗G_{\bm{v}} for 𝒗𝒱\bm{v}\in{\cal V} by applying a causal structure learning algorithm such as DirectLiNGAM to 𝒗\bm{v}. Once we obtain the estimates G^𝒗=(𝒗,E𝒗)\hat{G}_{\bm{v}}=(\bm{v},E_{\bm{v}}), where E𝒗=(𝒗×𝒗)EE_{\bm{v}}=(\bm{v}\times\bm{v})\cap E, the entire causal graph GG can be estimated by G^=(X,𝒗𝒱E𝒗)\hat{G}=(X,\bigcup_{\bm{v}\in{\cal V}}E_{\bm{v}}).

When the sample size is small, many ancestral relationships cannot be detected due to the type II error of the CI test in the CAG, so there may be many variables for which the ancestor set is empty, resulting in many groups consisting of a single variable in 𝒱0{\cal V}_{0}. Using such grouping would make the causal DAG estimates too sparse. So here, the group consisting of a single variable in 𝒱0{\cal V}_{0} is merged with the other groups according to the above procedure 2b. Note that Theorem 1 holds even if 𝒱0{\cal V}_{0} contains a group consisting of one variable.

Table III presents the correct AnciAnc_{i} and 𝒗i\bm{v}_{i} for i=1,,9i=1,\ldots,9 of the DAG in Figure 1. In this case, we can easily see that 𝒱0=𝒱={𝒗3,𝒗5,𝒗6,𝒗9}{\cal V}_{0}={\cal V}=\{\bm{v}_{3},\bm{v}_{5},\bm{v}_{6},\bm{v}_{9}\}. By applying the causal structure learning algorithm to each element of 𝒱{\cal V}, we can consistently estimate the sub-DAGs as shown in Fig. 4. By merging the sub-DAGs as (X,E3E5E6E9)(X,E_{3}\cup E_{5}\cup E_{6}\cup E_{9}), we can obtain the true causal DAG in Figure 1.

TABLE III: AnciAnc_{i} and 𝒗i\bm{v}_{i} of the DAG in Fig. 1.
      Variables       AnciAnc_{i}       𝒗i\bm{v}_{i}
      x1x_{1}       ϕ\phi       x1x_{1}
      x2x_{2}       x1x_{1}       x1,x2x_{1},x_{2}
      x3x_{3}       x1,x2x_{1},x_{2}       x1,x2,x3x_{1},x_{2},x_{3}
      x4x_{4}       x1,x7x_{1},x_{7}       x1,x4,x7x_{1},x_{4},x_{7}
      x5x_{5}       x1,x4,x7x_{1},x_{4},x_{7}       x1,x4,x5,x7x_{1},x_{4},x_{5},x_{7}
      x6x_{6}       x1,x4,x7x_{1},x_{4},x_{7}       x1,x4,x5,x6x_{1},x_{4},x_{5},x_{6}
      x7x_{7}       ϕ\phi       x7x_{7}
      x8x_{8}       x7x_{7}       x7,x8x_{7},x_{8}
      x9x_{9}       x7,x8x_{7},x_{8}       x7,x8,x9x_{7},x_{8},x_{9}
Refer to caption
Figure 4: An example of the process of the CAG

In the same way as SADA, the estimated entire DAG may contain cycles due to errors in learning DAGs for each group, even if the correct ancestral relationships are estimated. The proposed method implements topological sorting on the merged graph to check if a directed cycle exists in the merged graph. In SADA, the significance of each coefficient bijb_{ij} for all edges in a cycle is assessed by the Wald test, and the least significant edge in the cycle is removed from the cycle. In the proposed method, since the ancestral relationships between variables are estimated, directed cycles can also be eliminated by removing edges in the cycle that contradict the estimated ancestral relationships. Another possible approach would be removing the edge in the cycle with the smallest absolute value of coefficient |bij||b_{ij}|. The computer experiment in Section IV compares the performance of these methods for removing cycles.

III-C Time Complexity of Variable Grouping and Minimum Required Sample Size

It suffices to focus on the number of CI tests to evaluate the time complexity of variable grouping by CAG. The case that requires the most CI tests is when the true causal DAG is fully connected, i.e., Anci=PaiAnc_{i}=Pa_{i} for all xiXx_{i}\in X. In that case, since the proposed algorithm performs CI tests while removing a source node one by one from a DAG, the number of required CI tests is

(p2)+(p12)++(22)=O(p3).\displaystyle\binom{p}{2}+\binom{p-1}{2}+\cdots+\binom{2}{2}=O(p^{3}).

As mentioned in Section II-A, the time complexity of DirectLiNGAM is O(np3M2+p4M3)O(np^{3}M^{2}+p^{4}M^{3}). Even allowing for the time complexity of grouping variables, if nn is small relative to pp, CAG is expected to reduce overall computation time compared to the original DirectLiNGAM without variable grouping.

Let nmaxn_{\max} be the largest cardinality of 𝒗𝒱\bm{v}\in{\cal V}. Then, the required sample size is reduced to nmax+1n_{\max}+1.

IV Computer experiments

This section details the results of computer experiments to compare the estimation accuracy and computation time of CAG with the original DirectLiNGAM without variable grouping, RCD in the absence of latent confounders, and CAPA with σmax=0,1,2\sigma_{\max}=0,1,2. In these experiments, CAG and CAPA used DirectLiNGAM as the causal structure learning algorithm for each variable group. In the following, we will refer to these procedures as CAG-LiNGAM and CAPA-LiNGAM, respectively. The accuracy of CAG-LiNGAM using true ancestral relationships was also computed for reference. As mentioned in Section II-D, RCD can also be applied to the model in the presence of latent confounders. This paper assumes that no latent confounder exists, so in this experiment for RCD, the procedure for determining latent confounders was omitted to estimate ancestral and parent-child relationships among variables.

IV-A Experimental Settings

The number of variables pp and the number of edges |E||E| were set to (p,|E|)=(10,5)(p,|E|)=(10,5), (10,9)(10,9), (20,19)(20,19), (30,29)(30,29) and (40,39)(40,39). The sample sizes nn were set to 11,25,50,10011,25,50,100 and 500500.

The error terms eie_{i} were generated from the uniform distribution U(1,1)U(-1,1). The true causal DAGs were randomly generated with (p,|E|)(p,|E|) fixed at each iteration. In these experiments, the true causal DAGs were assumed to be sparse, and the indegrees of all vertices were set to at most one. The nonzero elements of the coefficient matrix BB were randomly generated from the uniform distribution U([1,0.5][0.5,1])U([-1,-0.5]\cup[0.5,1]) at each iteration. The number of iterations was set to 100.

Let |Ec||E_{c}| be the number of correctly detected edges in the 100 estimated DAGs. Denote the number of redundant and missing edges in the 100 estimated DAGs by |Er||E_{r}| and |Em||E_{m}|, respectively. Note that if an edge in an estimated DAG is in the wrong direction, both |Er||E_{r}| and |Em||E_{m}| will add one. We evaluated the performance of CAG-LiNGAM and the other methods using the following indices (e.g., Zhu et al. [24]).

  1. 1.

    The precision

    Pre:=|Ec||Ec|+|Er|Pre:=\frac{|E_{c}|}{|E_{c}|+|E_{r}|}

    is the ratio of correctly estimated edges among the edges in the estimated DAG;

  2. 2.

    The recall

    Rec:=|Ec||Ec|+|Em|=|Ec||E|Rec:=\frac{|E_{c}|}{|E_{c}|+|E_{m}|}=\frac{|E_{c}|}{|E|}

    is the ratio of correctly estimated edges among the edges in the true DAG;

  3. 3.

    FF-measure

    F:=2×Pre×RecPre+RecF:=\frac{2\times Pre\times Rec}{Pre+Rec}

    is the harmonic mean of the precision and the recall;

  4. 4.

    TimeTime is the running time in seconds for estimating 100 DAGs.

Let αP\alpha_{P} be the significance level of Pearson’s correlation test in CAG and RCD. Let αCI\alpha_{CI} be the significance level of KCI in CAG, RCD, and CAPA. αP\alpha_{P} and αCI\alpha_{CI} are set to

αP\displaystyle\alpha_{P} {0.01,0.05,0.1,0.5},\displaystyle\in\{0.01,0.05,0.1,0.5\},
αCI\displaystyle\alpha_{CI} {0.001,0.01,0.1,0.2,0.3,0.4,0.5},\displaystyle\in\{0.001,0.01,0.1,0.2,0.3,0.4,0.5\},

respectively.

Experiments for p=40p=40 were conducted on a machine with a 3.0 GHz Intel Core i9 processor and 256 GB memory, and experiments for p=30p=30 were performed on a machine with a 3.0 GHz Intel Core i9 processor and 128 GB memory. The other experiments were conducted on a machine with a 2.1 GHz Dual Xeon processor and 96 GB memory.

IV-B Experimental Results and Discussion

TABLE IV: Performances of CAG, DirectLiNGAM, RCD and CAPA
(pp, |E||E|, nn) PrePre RecRec
CAG CAG\mathrm{CAG}^{*} DLi RCD CAPA CAG CAG\mathrm{CAG}^{*} DLi RCD CAPA
W Anc abs Anc W 0 1 2 W Anc abs Anc W 0 1 2
(10, 5, 11) 0.210 0.221 0.213 0.507 0.122 0.268 0.209 0.166 0.193 0.434 0.396 0.446 0.448 0.374 0.128 0.197 0.209 0.194
(10, 5, 25) 0.550 0.543 0.546 0.720 0.311 0.602 0.493 0.597 0.612 0.576 0.578 0.578 0.736 0.576 0.396 0.483 0.420 0.408
(10, 5, 50) 0.773 0.770 0.773 0.850 0.521 0.851 0.725 0.735 0.779 0.764 0.764 0.764 0.928 0.868 0.592 0.696 0.723 0.706
(10, 5, 100) 0.891 0.889 0.891 0.917 0.693 0.967 0.786 0.877 0.896 0.896 0.896 0.896 0.996 0.986 0.826 0.886 0.860 0.840
(10, 5, 500) 0.971 0.971 0.971 0.963 0.813 1.000 0.861 0.901 0.906 0.998 0.998 0.998 1.000 1.000 0.994 0.945 0.956 0.960
(10, 9, 11) 0.251 0.218 0.244 0.396 0.179 0.281 0.214 0.208 0.250 0.431 0.487 0.452 0.483 0.387 0.147 0.207 0.220 0.193
(10, 9, 25) 0.369 0.368 0.350 0.563 0.339 0.452 0.388 0.389 0.429 0.693 0.486 0.727 0.729 0.600 0.301 0.418 0.447 0.426
(10, 9, 50) 0.621 0.599 0.609 0.686 0.482 0.644 0.526 0.523 0.548 0.643 0.650 0.646 0.899 0.818 0.521 0.634 0.651 0.643
(10, 9, 100) 0.779 0.768 0.779 0.834 0.685 0.914 0.680 0.686 0.682 0.851 0.860 0.859 0.986 0.974 0.758 0.845 0.836 0.839
(10, 9, 500) 0.918 0.917 0.918 0.877 0.785 0.997 0.785 0.773 0.774 0.999 0.999 0.999 1.000 1.000 0.999 0.929 0.930 0.926
(20, 19, 25) 0.313 0.319 0.267 0.435 0.172 0.327 0.235 0.410 0.468 0.601 0.423 0.642 0.665 0.496 0.284 0.287 0.220 0.238
(20, 19, 50) 0.528 0.484 0.497 0.625 0.354 0.549 0.447 0.454 0.616 0.662 0.640 0.676 0.876 0.721 0.514 0.561 0.527 0.448
(20, 19, 100) 0.759 0.723 0.751 0.780 0.576 0.837 0.610 0.594 0.681 0.776 0.780 0.778 0.984 0.943 0.729 0.779 0.776 0.685
(20, 19, 500) 0.897 0.894 0.897 0.844 0.725 0.994 0.725 0.718 0.712 0.997 0.997 0.997 1.000 1.000 0.996 0.899 0.895 0.892
(30, 29, 50) 0.535 0.445 0.489 0.655 0.276 0.475 0.374 0.527 0.631 0.618 0.616 0.633 0.855 0.588 0.479 0.438 0.390 0.435
(30, 29, 100) 0.725 0.701 0.673 0.878 0.587 0.813 0.649 0.649 0.676 0.725 0.701 0.853 0.878 0.587 0.813 0.649 0.649 0.676
(30, 29, 500) 0.969 0.967 0.970 0.976 0.919 0.998 0.919 0.916 0.914 0.998 0.998 0.998 1.000 1.000 0.996 0.970 0.968 0.969
(40, 39, 50) 0.464 0.423 0.466 0.625 0.211 0.459 0.311 0.480 0.712 0.800 0.595 0.624 0.839 0.536 0.469 0.381 0.364 0.374
(40, 39, 100) 0.696 0.640 0.652 0.863 0.477 0.792 0.573 0.576 0.679 0.835 0.741 0.849 0.973 0.809 0.686 0.690 0.680 0.635
(40, 39, 500) 0.952 0.944 0.951 0.970 0.917 0.991 0.917 0.914 0.915 0.998 0.998 0.998 1.000 1.000 0.995 0.966 0.966 0.966
(pp, |E||E|, nn) FF Time(s)Time(s)
CAG CAG\mathrm{CAG}^{*} DLi RCD CAPA CAG CAG\mathrm{CAG}^{*} DLi RCD CAPA
W Anc abs Anc W 0 1 2 W Anc abs Anc W 0 1 2
(10, 5, 11) 0.283 0.284 0.288 0.476 0.184 0.173 0.203 0.185 0.194 23.0 17.8 21.3 5.0 29.0 6.7 28.2 99.5 210.7
(10, 5, 25) 0.563 0.560 0.562 0.728 0.404 0.478 0.488 0.493 0.489 13.5 13.4 13.4 5.0 27.5 8.0 32.2 55.0 57.1
(10, 5, 50) 0.769 0.767 0.769 0.887 0.651 0.698 0.710 0.729 0.741 13.0 13.0 13.0 5.1 72.1 8.1 32.8 75.1 84.8
(10, 5, 100) 0.893 0.892 0.893 0.955 0.814 0.891 0.833 0.868 0.867 154.6 154.4 154.5 48.9 123.6 123.4 165.2 371.7 502.2
(10, 5, 500) 0.984 0.984 0.984 0.981 0.897 0.997 0.901 0.928 0.932 330.8 330.7 330.8 65.1 151.8 480.5 459.5 1867.9 2529.8
(10, 9, 11) 0.317 0.301 0.317 0.435 0.244 0.193 0.210 0.214 0.218 39.4 23.4 23.5 12.3 28.4 10.2 39.0 112.3 250.8
(10, 9, 25) 0.482 0.419 0.472 0.635 0.433 0.361 0.403 0.416 0.428 34.1 36.5 25.0 11.3 28.9 15.2 44.2 110.8 291.0
(10, 9, 50) 0.632 0.623 0.627 0.778 0.607 0.576 0.575 0.580 0.591 22.8 21.7 21.7 12.3 71.8 18.6 80.3 187.3 370.5
(10, 9, 100) 0.814 0.811 0.817 0.904 0.805 0.829 0.753 0.753 0.752 338.9 334.6 334.6 117.1 120.7 312.3 255.9 1420.2 6078.1
(10, 9, 500) 0.957 0.956 0.957 0.935 0.879 0.998 0.851 0.844 0.843 1062.1 1061.6 1061.6 154.4 155.5 1332.3 425.2 6833.6 19649.6
(20, 19, 25) 0.411 0.364 0.378 0.526 0.255 0.304 0.259 0.286 0.316 520.0 47.2 101.7 29.2 172.3 34.4 524.3 713.6 1044.2
(20, 19, 50) 0.588 0.551 0.573 0.729 0.474 0.531 0.498 0.488 0.518 93.2 57.0 72.2 29.3 323.9 48.6 573.4 926.4 1658.1
(20, 19, 100) 0.768 0.750 0.764 0.870 0.715 0.779 0.684 0.673 0.683 1111.4 1015.5 1015.7 337.2 417.3 991.4 1418.6 10457.6 39708.9
(20, 19, 500) 0.944 0.943 0.944 0.916 0.840 0.995 0.802 0.797 0.792 3918.8 3906.2 3906.3 421.7 542.6 4757.2 1471.6 33655.4 166854.1
(30, 29, 50) 0.573 0.516 0.552 0.742 0.376 0.477 0.403 0.449 0.515 196.6 92.7 129.2 40.2 554.7 65.8 1077.6 1962.0 5290.4
(30, 29, 100) 0.782 0.726 0.753 0.925 0.703 0.747 0.704 0.703 0.709 1381.5 1138.7 1215.6 397.6 604.9 1287.1 1090.2 13253.2 117849.3
(30, 29, 500) 0.983 0.982 0.984 0.988 0.958 0.997 0.943 0.941 0.941 5732.6 5731.6 5731.7 444.0 1449.6 9536.9 3540.0 45572.6 341942.2
(40, 39, 50) 0.588 0.494 0.533 0.716 0.302 0.464 0.343 0.414 0.490 4233.4 192.9 258.3 56.9 962.4 153.3 1920.0 4387.8 6462.0
(40, 39, 100) 0.759 0.687 0.737 0.914 0.600 0.735 0.626 0.624 0.656 2931.3 2527.1 2620.8 604.1 1001.2 2725.9 1740.4 35769.3 352156.9
(40, 39, 500) 0.975 0.970 0.974 0.985 0.956 0.993 0.941 0.939 0.940 10234.1 10225.4 10225.7 706.8 2610.9 16099.2 6221.7 94989.2 779946.2

Table IV shows PrePre, RecRec, FF and TimeTime for each method for some (p,|E|,n)(p,|E|,n). The experiments for CAG-LiNGAM and RCD were conducted with 4×74\times 7 combinations of significance levels of Pearson’s correlation test and KCI for each (p,|E|,n)(p,|E|,n). The experiments for CAPA-LiNGAM were conducted with 77 significance levels of KCI for each (p,|E|,n)(p,|E|,n). Table IV shows only the results of these methods for the significance level that maximizes FF-measure.

In Table IV, CAG represents CAG-LiNGAM, while CAG\text{CAG}^{*} refers to CAG-LiNGAM using true ancestral relationships. DLi stands for the original DirectLiNGAM without grouping variables. W, Anc, and abs under CAG and RCD represent cycle elimination methods based on the Wald test, the estimated ancestral relationship, and the absolute value of the coefficient, respectively. The numbers 0, 1, and 2 under CAPA are the σmax\sigma_{\max} values. In the table, the bold numbers highlight the best performance in each experimental group (p,|E|,n)(p,|E|,n), excluding CAG\text{CAG}^{*}.

As seen from Table IV, CAG-LiNGAM outperforms the original DirectLiNGAM without the variable grouping, RCD, and CAPA-LiNGAM when the sample size is small relative to the number of variables in terms of FF-measure and the recall. Although CAG requires O(p3)O(p^{3}) time complexity to group variables, CAG-LiNGAM often shows shorter computation time than the original DirectLiNGAM when the sample size is small. When the sample size is large, CAG-LiNGAM takes longer computation time than the original DirectLiNGAM, but CAG-LiNGAM is superior to the original DirectLiNGAM in terms of precision and FF-measure.

CAG with Wald tests takes longer computation time but is more accurate in terms of FF-measure than CAG with Anc and abs. As the sample size increases, the accuracy of estimating ancestral relationships and DirectLiNGAM increases, and hence, the frequency of cycles appearing in the estimated DAG decreases, so the difference in estimation accuracy between the three CAG variants becomes smaller.

When the sample size is small, CAPA-LiNGAM may show higher precision than CAG-LiNGAM. However, the recall and FF-measure are lower in many cases. This may be because when the sample size is small, it is difficult for CAPA to detect conditional independence relationships due to the type II error of the CI tests. Then, GσG_{\sigma} becomes overly sparse, resulting in overly sparse causal DAG estimates.

When σmax\sigma_{\max} is 0 and 11, the time complexity of the variable grouping of CAPA is less than or equal to that of CAG. However, the computation time for CAPA-LiNGAM is not much different from that for CAG-LiNGAM, or CAG-LiNGAM is faster than CAPA-LiNGAM. Besides, CAG-LiNGAM outperforms CAPA-LiNGAM in terms of FF-measure.

It is noteworthy that RCD shows higher FF-measure than CAG-LiNGAM when the sample size is large, although it takes more computation time. Conversely, when the sample size is small, CAG-LiNGAM shows higher FF-measure than RCD, although it takes more computation time.

The original DirectLiNGAM shows lower precision than divide-and-conquer algorithms. This may be because the original DirectLiNGAM does not group variables, leading to redundant edges across different variable groups. In general, DAGs estimated by DirectLiNGAM tend to have more redundant edges.

Even if the true causal DAG is sparse, the subgraph induced by each variable group becomes relatively denser. In the case of directed trees, the ratio of the number of edges to the number of variable pairs is (p1)/C2p=2/p(p-1)/{}_{p}C_{2}=2/p. This means that the smaller the number of variables in each group is, the denser the sub-DAG induced by each group is. When variables are correctly grouped into smaller subsets, the true sub-DAGs induced by each group become denser, and there are no edges across groups. Therefore, if ancestral relationships can be accurately estimated, applying DirectLiNGAM to each group is expected to avoid more redundant edges than using it for the entire variable.

The CAG uses the estimated ancestral relationships for grouping variables but does not use them to estimate the causal DAG for each group. However, since ancestral relationships partially define the structure of the causal DAG, information on ancestral relationships could be used to estimate causal DAGs. The RCD can be interpreted as using information on ancestral relationships to estimate causal DAGs.

Refer to caption
Figure 5: AncrecAnc_{rec} with Wald tests for some sample sizes

The lines in {\{red, blue, orange, black, pink}\} represent (p,|E|)=(10,5),(10,9),(20,19),(30.29),(40,39)(p,|E|)=(10,5),(10,9),(20,19),(30.29),(40,39), respectively.

Let AcA_{c} and AmA_{m} be the number of correctly detected and missing ancestors in the estimated ancestor lists Anc^i\widehat{Anc}_{i}, i=1,,pi=1,\ldots,p, respectively. Define AncrecAnc_{rec} by the recall of the list of ancestors,

Ancrec=AcAc+Am.Anc_{rec}=\frac{A_{c}}{A_{c}+A_{m}}.

Figure 5 plots AncrecAnc_{rec} of the ancestor list for some (p,|E|,n)(p,|E|,n) when using the significance levels of the CI tests that maximize the FF-measure of the CAG-LiNGAM with Wald tests. This figure shows that when the sample size is small, the accuracy of the estimation of ancestor lists is quite low. When (p,|E|,n)=(10,5,11),(10,9,11),(10,9,25),(20,19,25),(40,39,50)(p,|E|,n)=(10,5,11),(10,9,11),(10,9,25),(20,19,25),(40,39,50), AncrecAnc_{rec} are exactly zeros. However, Table IV shows that the accuracy of CAG-LiNGAM is not that bad. This fact suggests that applying DirectLiNGAM to each group split by the CAG may correct errors in the estimated ancestral relationships within each group.

Table V shows AncrecAnc_{rec} computed from the ancestral relationships estimated by CAG and AncrecAnc_{rec} computed from the estimated DAG in the experiments. We can see that AncrecAnc_{rec} is significantly improved by applying DirectLiNGAM to each group.

TABLE V: AncrecAnc_{rec} before and after applying DirectLiNGAM
Before After
(p,|E|,n)(p,|E|,n) W Anc abs W Anc abs
(10, 5, 11) 0 0.049 0 0.438 0.405 0.466
(10, 5, 25) 0.300 0.300 0.300 0.519 0.525 0.524
(10, 5, 50) 0.513 0.513 0.513 0.677 0.679 0.679
(10, 5, 100) 0.723 0.723 0.723 0.852 0.852 0.852
(10, 5, 500) 0.919 0.919 0.919 0.998 0.998 0.998
(10, 9, 11) 0 0 0 0.404 0.568 0.447
(10, 9, 25) 0 0.323 0 0.648 0.483 0.701
(10, 9, 50) 0.346 0.346 0.346 0.540 0.557 0.547
(10, 9, 100) 0.514 0.514 0.514 0.791 0.817 0.805
(10, 9, 500) 0.713 0.713 0.713 0.999 0.999 0.999
(20, 19, 25) 0.001 0.177 0.001 0.514 0.366 0.577
(20, 19, 50) 0.320 0.289 0.320 0.545 0.558 0.594
(20, 19, 100) 0.359 0.359 0.359 0.638 0.669 0.651
(20, 19, 500) 0.596 0.596 0.596 0.993 0.998 0.998
(30, 29, 50) 0.268 0.241 0.268 0.450 0.562 0.515
(30, 29, 100) 0.371 0.291 0.371 0.745 0.636 0.782
(30, 29, 500) 0.515 0.515 0.515 0.993 0.997 0.993
(40, 39, 50) 0.001 0.222 0.225 0.627 0.546 0.508
(40, 39, 100) 0.356 0.276 0.356 0.697 0.624 0.785
(40, 39, 500) 0.470 0.470 0.470 0.997 0.994 0.997

Table VI summarizes the following six indices calculated from CAG’s experimental results when p=10p=10.

  • Improvement (ImpImp): the number of ancestral relationships in the output DAGs that DirectLiNGAM corrects

  • Worsening (WorWor): the number of ancestral relationships in the output DAGs that DirectLiNGAM worsens

  • Errors in estimation (ErrErr): the number of ancestral relationships that are wrongly estimated in CAG

  • Correct estimation (CorrCorr): the number of ancestral relationships that are correctly estimated in CAG

  • Imp/ErrImp/Err: the ratio of ancestral relationships incorrectly estimated by CAG that DirectLiNGAM corrected

  • Wor/CorrWor/Corr: the ratio of ancestral relationships correctly estimated by CAG that DirectLiNGAM worsened

TABLE VI: The Improvement and Worsening of Ancestral Relationships after applying DirectLiNGAM.
(p,|E|,n)(p,|E|,n) Elimination ImpImp WorWor CorrCorr ErrErr Imp/ErrImp/Err Wor/CorrWor/Corr
(10,5,11) Wald 265 1029 3858 642 0.4128 0.2667
Anc 228 858 3858 642 0.3551 0.2224
abs 286 1042 3858 642 0.4455 0.2701
(10,5,25) Wald 239 700 3956 544 0.4393 0.1769
Anc 227 636 3956 544 0.4173 0.1608
abs 238 720 3956 544 0.4375 0.1820
(10,5,50) Wald 204 268 4154 346 0.5896 0.0645
Anc 191 253 4154 346 0.5520 0.0609
abs 211 280 4154 346 0.6098 0.0674
(10,9,11) Wald 794 1226 2627 1873 0.4239 0.4667
Anc 374 559 2627 1873 0.1997 0.2128
abs 794 1226 2627 1873 0.4239 0.4667
(10,9,25) Wald 657 678 2780 1720 0.3820 0.2439
Anc 519 565 2780 1720 0.3017 0.2032
abs 747 804 2780 1720 0.4343 0.2892
(10,9,50) Wald 704 373 3078 1422 0.4951 0.1212
Anc 570 273 3078 1422 0.4008 0.0887
abs 728 412 3078 1422 0.5120 0.1339
(10,9,100) Wald 682 130 3479 1021 0.6680 0.0374
Anc 586 122 3479 1021 0.5739 0.0351
abs 703 118 3479 1021 0.6885 0.0339
(10,9,500) Wald 626 32 3730 770 0.8130 0.0086
Anc 626 33 3730 770 0.8078 0.0088
abs 626 32 3730 770 0.8104 0.0086

Table VI shows that when the sample size is small, Imp<WorImp<Wor, but Imp/Err>Wor/CorrImp/Err>Wor/Corr. By applying DirectLiNGAM, the number of corrected ancestral relationships is less than that of worsened ones. However, the proportion of corrected ancestral relationships is relatively larger than that of worsened ancestral relationships, which may have improved the accuracy of causal DAG estimation.

Even with large sample sizes, AncrecAnc_{rec} improves by applying DirectLiNGAM. However, Table IV shows that the accuracy of CAG is inferior to that of RCD. When the sample size is large, DirectLiNGAM tends to include redundant edges in the estimated DAG, which may make CAG less accurate than RCD.

This experiment set 28 different significance levels for the CAG’s CI tests. Table VII presents the significance levels that maximize the FF-measure of CAG-LiNGAM. A small significance level is often chosen when the sample size is small. In practice, however, the results are almost the same regardless of the significance level chosen. When the sample size is small, ancestral relationships can hardly be detected regardless of the significance level. In this experiment of CAG-LiNGAM with Wald tests, when (p,|E|,n)=(10,5,11),(10,9,11),(10,9,25)(p,|E|,n)=(10,5,11),(10,9,11),(10,9,25), no ancestral relationship could be detected within 100 repetitions. From Definition 1 in Section III-A, if no ancestral relationship was detected, the variable group is the family of all C2p{}_{p}C_{2} variable pairs. Interestingly, even in such cases, CAG-LiNGAM shows higher FF-measure than the other methods. When the sample size is moderately small, the FF-measure values are large when the significance level of Pearson’s correlation test is 0.01 and the significance level of KCI is about 0.2 to 0.5. When the sample size is large, the smaller the significance level, the higher FF-measure.

TABLE VII: The significance levels used in CAG-LiNGAM in Table IV
W Anc abs
(pp, |E||E|, nn) αP\alpha_{P} αCI\alpha_{CI} αP\alpha_{P} αCI\alpha_{CI} αP\alpha_{P} αCI\alpha_{CI}
(10, 5, 11) 0.01 0.001 0.01 0.3 0.01 0.001
(10, 5, 25) 0.01 0.5 0.01 0.5 0.01 0.5
(10, 5, 50) 0.01 0.4 0.01 0.4 0.01 0.4
(10, 5, 100) 0.01 0.3 0.01 0.3 0.01 0.3
(10, 5, 500) 0.01 0.001 0.01 0.001 0.01 0.001
(10, 9, 11) 0.05 0.01 0.05 0.01 0.05 0.01
(10, 9, 25) 0.01 0.001 0.5 0.5 0.01 0.001
(10, 9, 50) 0.01 0.3 0.01 0.3 0.01 0.3
(10, 9, 100) 0.01 0.2 0.01 0.2 0.01 0.2
(10, 9, 500) 0.01 0.001 0.01 0.001 0.01 0.001
(20, 19, 25) 0.05 0.001 0.01 0.5 0.05 0.001
(20, 19, 50) 0.01 0.4 0.01 0.3 0.01 0.4
(20, 19, 100) 0.01 0.1 0.01 0.1 0.01 0.1
(20, 19, 500) 0.01 0.001 0.01 0.001 0.01 0.001
(30, 29, 50) 0.01 0.4 0.01 0.3 0.01 0.4
(30, 29, 100) 0.01 0.2 0.01 0.1 0.01 0.2
(30, 29, 500) 0.01 0.001 0.01 0.001 0.01 0.001
(40, 39, 50) 0.01 0.001 0.01 0.3 0.01 0.4
(40, 39, 100) 0.01 0.2 0.01 0.1 0.01 0.1
(40, 39, 500) 0.01 0.001 0.01 0.001 0.01 0.001

As mentioned, CAG\mathrm{CAG}^{*} is the CAG-LiNGAM when the true ancestral relationships are known. When the sample size is small, the estimation accuracy of CAG-LiNGAM is inferior to that of CAG\mathrm{CAG}^{*} because CAG is less accurate. However, when p=10p=10 and the sample size is large, CAG-LiNGAM shows higher precision and FF-measure than CAG\mathrm{CAG}^{*}. As seen from Figure 5 and Table V, the accuracy of ancestral relationship estimation improves as the sample size increases. However, even when (p,|E|,n)=(10,9,500)(p,|E|,n)=(10,9,500), AncrecAnc_{rec} is 0.713, which is not close enough to 1. From Table VII, when (p,|E|,n)=(10,9,500)(p,|E|,n)=(10,9,500), αP=0.01\alpha_{P}=0.01 and αCI=0.001\alpha_{CI}=0.001 are used for the significance levels of the CI tests. Even with a large sample size, if small significance levels are used in the CAG, some ancestral relationships cannot be detected due to type II errors in the CI test. As a result, the number of variables in the estimated groups may be smaller than those in the true CAG grouping, resulting in fewer redundant edges than CAG\mathrm{CAG}^{*}. Grouping by CAG is not the finest grouping that guarantees the identifiability of true causal DAGs. The results suggest that using finer groupings may improve the accuracy of causal DAG estimation when the sample size is large.

V Conclusion

This paper proposes CAG as a divide-and-conquer algorithm for learning causal DAGs. This algorithm can help improve the performance of the original DirectLiNGAM, especially when the sample size is small relative to the number of variables. CAG is based on the algorithm for finding ancestral relationships among variables in RCD. CAG-LiNGAM guarantees the identifiability of true causal DAGs. Detailed computer experiments confirm that CAG-LiNGAM may outperform the original DirectLiNGAM, RCD, and CAPA-LiNGAM in terms of FF-measure when the true causal DAG is sparse, and the sample size is small relative to the number of variables.

If the true causal DAG is connected and the correct CI relationships are estimated, CAG creates the same number of variable groups as sink nodes in the causal DAG. Therefore, when the number of sink nodes in the true causal graph is small, the number of variables in each group cannot be sufficiently small, and thus, improvement in accuracy may not be expected. Other computer experiments have shown that even if the true causal DAG is sparse, the estimation accuracy is not necessarily high when the indegrees of some vertices are larger than one. CAPA is the opposite of CAG, where the number of variable groups is small when the maximum indegree of the true causal DAG is small, while it may be possible to group variables into many variable groups even when the maximum indegree of the true causal DAG is large. As noted at the end of the previous section, the divide-and-conquer algorithm is expected to be more accurate in estimating causal DAGs if finer groupings are used. The hybrid algorithm of CAG and CAPA may give fine groupings even when the indegree of the true causal DAG is larger.

We also found that the RCD has higher estimation accuracy than CAG when the sample size is large. RCD estimates causal DAGs using the estimated ancestral relationships among variables. As the sample size increases, the estimation accuracy of ancestral relationships improves. Therefore, using the estimated ancestral relationships to estimate causal DAGs may improve estimation accuracy. However, RCD is computationally expensive when the sample size is large. Using the information on the estimated ancestral relationships in DirectLiNGAM may improve the estimation accuracy of causal DAGs at a relatively low computational cost, even when the sample size is large.

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number JP21K11797.

References

  • [1] R. Cai, Z. Zhang, and Z. Hao, “Sada: A general framework to support robust causation discovery,” in International conference on machine learning, pp. 208–216, PMLR, 2013.
  • [2] H. Zhang, S. Zhou, C. Yan, J. Guan, X. Wang, J. Zhang, and J. Huan, “Learning causal structures based on divide and conquer,” IEEE Transactions on Cybernetics, vol. 52, no. 5, pp. 3232–3243, 2020.
  • [3] S. Shimizu, P. O. Hoyer, A. Hyvärinen, A. Kerminen, and M. Jordan, “A linear non-gaussian acyclic model for causal discovery.,” Journal of Machine Learning Research, vol. 7, no. 10, 2006.
  • [4] S. Shimizu, T. Inazumi, Y. Sogawa, A. Hyvarinen, Y. Kawahara, T. Washio, P. O. Hoyer, K. Bollen, and P. Hoyer, “Directlingam: A direct method for learning a linear non-gaussian structural equation model,” Journal of Machine Learning Research-JMLR, vol. 12, no. Apr, pp. 1225–1248, 2011.
  • [5] K. D. Hoover, Causality in economics and econometrics. SSRN eLibrary, 2006.
  • [6] K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan, “Causal protein-signaling networks derived from multiparameter single-cell data,” Science, vol. 308, no. 5721, pp. 523–529, 2005.
  • [7] C. Glymour, “Learning causes: Psychological explanations of causal explanation1,” Minds and machines, vol. 8, no. 1, pp. 39–60, 1998.
  • [8] J. Pearl, M. Glymour, and N. P. Jewell, Causal Inference in Statistics: A Primer. Hoboken, New Jersey, U.S.: John Wiley and Sons, 2016.
  • [9] J. Pearl, “Causal diagrams for empirical research,” Biometrika, vol. 82, no. 4, pp. 669–688, 1995.
  • [10] P. Spirtes, C. N. Glymour, and R. Scheines, Causation, prediction, and search. MIT press, 2000.
  • [11] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
  • [12] G. Schwarz, “Estimating the dimension of a model,” The Annals of Statistics, pp. 461–464, 1978.
  • [13] D. M. Chickering, “Learning bayesian networks is np-complete,” Learning from data: Artificial intelligence and statistics V, pp. 121–130, 1996.
  • [14] D. M. Chickering, D. Heckerman, and C. Meek, “Large-sample learning of bayesian networks is np-hard,” Journal of Machine Learning Research, vol. 5, pp. 1287–1330, 2004.
  • [15] A. Hyvarinen, J. Karhunen, and E. Oja, “Independent component analysis,” Studies in informatics and control, vol. 11, no. 2, pp. 205–207, 2002.
  • [16] P. Hoyer, D. Janzing, J. M. Mooij, J. Peters, and B. Schölkopf, “Nonlinear causal discovery with additive noise models,” Advances in neural information processing systems, vol. 21, 2008.
  • [17] K. Zhang and A. Hyvarinen, “On the identifiability of the post-nonlinear causal model,” arXiv preprint arXiv:1205.2599, 2012.
  • [18] Y. S. Wang and M. Drton, “High-dimensional causal discovery under non-gaussianity,” Biometrika, vol. 107, no. 1, pp. 41–59, 2020.
  • [19] T. N. Maeda and S. Shimizu, “Rcd: Repetitive causal discovery of linear non-gaussian acyclic models with latent confounders,” in International Conference on Artificial Intelligence and Statistics, pp. 735–745, PMLR, 2020.
  • [20] C. J. Kowalski, “On the effects of non-normality on the distribution of the sample product-moment correlation coefficient,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 21, no. 1, pp. 1–12, 1972.
  • [21] T. N. Maeda and S. Shimizu, “Repetitive causal discovery of linear non-gaussian acyclic models in the presence of latent confounders,” International Journal of Data Science and Analytics, pp. 1–13, 2022.
  • [22] A. Gretton, K. Fukumizu, C. Teo, L. Song, B. Schölkopf, and A. Smola, “A kernel statistical test of independence,” Advances in neural information processing systems, vol. 20, 2007.
  • [23] K. Zhang, J. Peters, D. Janzing, and B. Schölkopf, “Kernel-based conditional independence test and application in causal discovery,” arXiv preprint arXiv:1202.3775, 2012.
  • [24] S. Zhu, I. Ng, and Z. Chen, “Causal discovery with reinforcement learning,” arXiv preprint arXiv:1906.04477, 2019.

Appendix

V-A Pseudo-code of ancestor finding

This section presents the pseudocode of the ancestor-finding algorithm. Lines 5-6 are the procedure that avoids cyclic ancestral relationships and speeds up the ancestor finding. Pearson(xi,xj)Pearson(x_{i},x_{j}) represents the P-value of the Pearson test for the correlation between xix_{i} and xjx_{j}. KCI(u,v)KCI(u,v) represents the P-value of KCI for the independence between uu and vv.

We can declare that all ancestral relationships have been identified if no new ancestral relationships are discovered in a single loop because of the no confounder existing assumption. This means, unlike the original RCD [19, 21], even if some pairs of variables halt at a point where additional searching for unknown common ancestors is required in Line 35, they will ultimately be treated as having no ancestral relationship.

 

Algorithm 1 Ancestor Finding

 

0:  The observed data of the variable set XX
0:  The list ANCLANC_{L} of the set of ancestors of all variables
1:  Set the significance level of Pearson tests as αP\alpha_{P} and the significance level of KCI as αCI\alpha_{CI}
2:  AnciϕAnc_{i}\leftarrow\phi and i=1,,pi=1,\ldots,p ANCL{Anc1,,Ancp}ANC_{L}\leftarrow\{Anc_{1},\ldots,Anc_{p}\}
3:  while There is no new change in ANCLANC_{L} do
4:     for each pair of variables xix_{i} and xjx_{j} do
5:        if xkAnci,xjAnck\exists x_{k}\in Anc_{i},x_{j}\in Anc_{k} then
6:           Add xjx_{j} into AnciAnc_{i}
7:        else
8:           if AnciAncjϕAnc_{i}\cap Anc_{j}\neq\phi then
9:              CAAnciAncjCA\leftarrow Anc_{i}\cap Anc_{j}
10:              vixi𝜶^ijCA,vjxj𝜶^jiCAv_{i}\leftarrow x_{i}-\hat{\bm{\alpha}}_{ij}^{\top}CA,\quad v_{j}\leftarrow x_{j}-\hat{\bm{\alpha}}_{ji}^{\top}CA
11:              pvaluelPearson(vi,vj)pvalue_{l}\leftarrow Pearson(v_{i},v_{j})
12:              if pvaluelαPpvalue_{l}\leq\alpha_{P} then
13:                 uivib^ijvj,ujvjb^jiviu_{i}\leftarrow v_{i}-\hat{b}_{ij}v_{j},\quad u_{j}\leftarrow v_{j}-\hat{b}_{ji}v_{i}
14:                 pvaluejiKCI(uj,vi)pvalue^{i}_{j}\leftarrow KCI(u_{j},v_{i})pvalueijKCI(ui,vj)pvalue^{j}_{i}\leftarrow KCI(u_{i},v_{j})
15:                 if pvalueij>αCIpvalue^{j}_{i}>\alpha_{CI} and pvaluejiαCIpvalue^{i}_{j}\leq\alpha_{CI} then
16:                    Add xix_{i} into AncjAnc_{j}
17:                 else if pvalueijαCIpvalue^{j}_{i}\leq\alpha_{CI} and pvalueji>αCIpvalue^{i}_{j}>\alpha_{CI} then
18:                    Add xjx_{j} into AnciAnc_{i}
19:                 else
20:                    There are unobserved common ancestors between xix_{i} and xjx_{j}
21:                 end if
22:              else
23:                 No ancestral relationship between xix_{i} and xjx_{j}
24:              end if
25:           else
26:              pvaluelPearson(xi,xj)pvalue_{l}\leftarrow Pearson(x_{i},x_{j})
27:              if pvaluelαPpvalue_{l}\leq\alpha_{P} then
28:                 uixib^ijxj,ujxjb^jixiu_{i}\leftarrow x_{i}-\hat{b}_{ij}x_{j},\quad u_{j}\leftarrow x_{j}-\hat{b}_{ji}x_{i}
29:                 pvaluejiKCI(uj,xi)pvalue^{i}_{j}\leftarrow KCI(u_{j},x_{i})pvalueijKCI(ui,xj)pvalue^{j}_{i}\leftarrow KCI(u_{i},x_{j})
30:                 if pvalueij>αCIpvalue^{j}_{i}>\alpha_{CI} and pvaluejiαCIpvalue^{i}_{j}\leq\alpha_{CI} then
31:                    Add xix_{i} into AncjAnc_{j}
32:                 else if pvalueijαCIpvalue^{j}_{i}\leq\alpha_{CI} and pvalueji>αCIpvalue^{i}_{j}>\alpha_{CI} then
33:                    Add xjx_{j} into AnciAnc_{i}
34:                 else
35:                    There are unobserved common ancestors between xix_{i} and xjx_{j}
36:                 end if
37:              else
38:                 No ancestral relationship between xix_{i} and xjx_{j}
39:              end if
40:           end if
41:        end if
42:     end for
43:  end while
44:  return  ANCLANC_{L}

 

V-B Pseudo-code of Grouping and Merging

This section presents the pseudocode for grouping variables and merging the results. In the implementation, a type II error of CI tests can result in many groups with only one variable. In that case, the output DAG will be too sparse. To avoid this, the procedure in lines 3 through 9 merges groups with one element into the other groups. When the sample size is small, no ancestral relationship may be detected, and the number of elements in all groups is one. In such a case, all C2p{}_{p}C_{2} variable pairs are defined as the variable grouping.

After grouping, line 13 shows the process of learning causal sub-DAGs and merging them. Finally, lines 15-17 eliminate cycles in the merged DAG.

 

Algorithm 2 Grouping and Merging

 

0:  The set ANCLANC_{L} of ancestor lists
0:  The set of edges EE
1:  𝒱{\cal V}\leftarrow\emptyset
2:  Let 𝒱0{\cal V}_{0} be defined as the family of maximal elements in ANCLANC_{L}
3:  if |𝒗|>1|\bm{v}|>1 for all 𝒗𝒱0\bm{v}\in{\cal V}_{0} then
4:     𝒱𝒱0{\cal V}\leftarrow{\cal V}_{0}
5:  else
6:     for each 𝒗𝒱0\bm{v}\in{\cal V}_{0} with |𝒗|=1|\bm{v}|=1 do
7:        Add 𝒗𝒱0,𝒗𝒗{𝒗𝒗}\bigcup_{\bm{v}^{\prime}\in{\cal V}_{0},\bm{v}^{\prime}\neq\bm{v}}\{\bm{v}\cup\bm{v}^{\prime}\} to 𝒱{\cal V}
8:     end for
9:  end if
10:  for each 𝒗\bm{v} in 𝒱{\cal V} do
11:     Estimate the causal sub-DAG G𝒗=(𝒗,E𝒗)G_{\bm{v}}=(\bm{v},E_{\bm{v}}) for 𝒗\bm{v} by DirectLiNGAM
12:  end for
13:  E𝒗𝒱E𝒗E\leftarrow\bigcup_{\bm{v}\in{\cal V}}E_{\bm{v}}
14:  G=(X,E)G=(X,E)
15:  if GG contains cycles then
16:     Eliminate cycles in GG
17:  end if
18:  return  GG