This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

appendixReferences of Appendix

Matching a Desired Causal State
via Shift Interventions

Jiaqi Zhang
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]
&Chandler Squires
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]
Caroline Uhler
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]
Abstract

Transforming a causal system from a given initial state to a desired target state is an important task permeating multiple fields including control theory, biology, and materials science. In causal models, such transformations can be achieved by performing a set of interventions. In this paper, we consider the problem of identifying a shift intervention that matches the desired mean of a system through active learning. We define the Markov equivalence class that is identifiable from shift interventions and propose two active learning strategies that are guaranteed to exactly match a desired mean. We then derive a worst-case lower bound for the number of interventions required and show that these strategies are optimal for certain classes of graphs. In particular, we show that our strategies may require exponentially fewer interventions than the previously considered approaches, which optimize for structure learning in the underlying causal graph. In line with our theoretical results, we also demonstrate experimentally that our proposed active learning strategies require fewer interventions compared to several baselines.

1 Introduction

Consider an experimental biologist attempting to turn cells from one type into another, e.g., from fibroblasts to neurons (Vierbuchen et al.,, 2010), by altering gene expression. This is known as cellular reprogramming and has shown great promise in recent years for regenerative medicine (Rackham et al.,, 2016). A common approach is to model gene expression of a cell, which is governed by an underlying gene regulatory network, using a structural causal model (Friedman et al.,, 2000; Badsha et al.,, 2019). Through a set of interventions, such as gene knockouts or over-expression of transcription factors (Dominguez et al.,, 2016), a biologist can infer the structure of the underlying regulatory network. After inferring enough about this structure, a biologist can identify the intervention needed to successfully reprogram a cell. More generally, transforming a causal system from an initial state to a desired state through interventions is an important task pervading multiple applications. Other examples include closed-loop control (Touchette and Lloyd,, 2004) and pathway design of microstructures (Wodo et al.,, 2015).

With little prior knowledge of the underlying causal model, this task is intrinsically difficult. Previous works have addressed the problem of intervention design to achieve full identifiability of the causal model (Hauser and Bühlmann,, 2014; Greenewald et al.,, 2019; Squires et al., 2020a, ). However, since interventional experiments tend to be expensive in practice, one wishes to minimize the number of trials and learn just enough information about the causal model to be able to identify the intervention that will transform it into the desired state. Furthermore, in many realistic cases, the set of interventions which can be performed is constrained. For instance, in CRISPR experiments, only a limited number of genes can be knocked out to keep the cell alive; or in robotics, a robot can only manipulate a certain number of arms at once.

Contributions. We take the first step towards the task of causal matching (formalized in Section 2), where an experimenter can perform a series of interventions in order to identify a matching intervention which transforms the system to a desired state. In particular, we consider the case where the goal is to match the mean of a distribution. We focus on a subclass of interventions called shift interventions, which can for example be used to model gene over-expression experiments (Triantafillou et al.,, 2017). These interventions directly increase or decrease the values of their perturbation targets, with their effect being propagated to variables which are downstream (in the underlying causal graph) of these targets. We show that there always exists a unique shift intervention (which may have multiple perturbation targets) that exactly transforms the mean of the variables into the desired mean (Lemma 1). We call this shift intervention the matching intervention.

To find the matching intervention, in Section 3 we characterize the Markov equivalence class of a causal graph induced by shift interventions, i.e., the edges in the causal graph that are identifiable from shift interventions; in particular, we show that the resulting Markov equivalence classes can be more refined than previous notions of interventional Markov equivalence classes. We then propose two active learning strategies in Section 4 based on this characterization, which are guaranteed to identify the matching intervention. These active strategies proceed in an adaptive manner, where each intervention is chosen based on all the information gathered so far.

In Section 5, we derive a worst-case lower bound on the number of interventions required to identify the matching intervention and show that the proposed strategies are optimal up to a logarithmic factor. Notably, the proposed strategies may use exponentially fewer interventions than previous active strategies for structure learning. Finally, in Section 6, we demonstrate also empirically that our proposed strategies outperform previous methods as well as other baselines in various settings.

1.1 Related Works

Experimental Design. Previous work on experimental design in causality has considered two closely related goals: learning the most structural information about the underlying DAG given a fixed budget of interventions (Ghassami et al.,, 2018), and fully identifying the underlying DAG while minimizing the total number or cost (Shanmugam et al.,, 2015; Kocaoglu et al.,, 2017) of interventions. These works can also be classified according to whether they consider a passive setting, i.e., the interventions are picked at a single point in time (Hyttinen et al.,, 2013; Shanmugam et al.,, 2015; Kocaoglu et al.,, 2017), or an active setting, i.e., interventions are decided based on the results of previous interventions (He and Geng,, 2008; Agrawal et al.,, 2019; Greenewald et al.,, 2019; Squires et al., 2020a, ). The setting addressed in the current work is closest to the active, full-identification setting. The primary difference is that in order to match a desired mean, one does not require full identification; in fact, as we show in this work, we may require significantly less interventions.

Causal Bandits. Another related setting is the bandit problem in sequential decision making, where an agent aims to maximize the cumulative reward by selecting an arm at each time step. Previous works considered the setting where there are causal relations between regrets and arms (Lattimore et al.,, 2016; Lee and Bareinboim,, 2018; Yabe et al.,, 2018). Using a known causal structure, these works were able to improve the dependence on the total number of arms compared to previous regret lower-bounds (Bubeck and Cesa-Bianchi,, 2012; Lattimore et al.,, 2016). These results were further extended to the case when the causal structure is unknown a priori (de Kroon et al.,, 2020). In all these works the variables are discrete, with arms given by do-interventions (i.e., setting variables to a given value), so that there are only a finite number of arms. In our work, we are concerned with the continuous setting and shift interventions, which corresponds to an infinite (continuous) set of arms.

Correlation-based Approaches. There are also various correlation-based approaches for this task that do not make use of any causal information. For example, previous works have proposed score-based (Cahan et al.,, 2014), entropy-based (D’Alessio et al.,, 2015) and distance-based approaches (Rackham et al.,, 2016) for cellular reprogramming. However, as shown in bandit settings (Lattimore et al.,, 2016), when the system follows a causal structure, this structure can be exploited to learn the optimal intervention more efficiently. Therefore, we here focus on developing a causal approach.

2 Problem Setup

We now formally introduce the causal matching problem of identifying an intervention to match the desired state in a causal system under a given metric. Following Koller and Friedman, (2009), a causal structural model is given by a directed acyclic graph (DAG) 𝒢\mathcal{G} with nodes [p]={1,,p}[p]=\{1,\dots,p\}, and a set of random variables X={X1,,Xp}X=\{X_{1},...,X_{p}\} whose joint distribution P{\mathrm{P}} factorizes according to 𝒢\mathcal{G}. Denote by pa𝒢(i)={j[p]ji}\operatorname{pa}_{\mathcal{G}}(i)=\{j\in[p]\mid j\to i\} the parents of node ii in 𝒢\mathcal{G}. An intervention I[p]I\subset[p] with multiple perturbation targets iIi\in I either removes all incoming edges to XiX_{i} (hard intervention) or modifies the conditional probability P(Xi|Xpa𝒢(i)){\mathrm{P}}(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)}) (soft intervention) for all iIi\in I. This results in an interventional distribution PI{\mathrm{P}}^{I}. Given a desired joint distribution Q{\mathrm{Q}} over XX, the goal of causal matching is to find an optimal matching intervention II such that PI{\mathrm{P}}^{I} best matches Q{\mathrm{Q}} under some metric. In this paper, we address a special case of the causal matching problem, which we call causal mean matching, where the distance metric between PI{\mathrm{P}}^{I} and Q{\mathrm{Q}} depends only on their expectations.

We focus on causal mean matching for a class of soft interventions, called shift interventions (Rothenhäusler et al.,, 2015). Formally, a shift intervention with perturbation targets I[p]I\subset[p] and shift values {ai}iI\{a_{i}\}_{i\in I} modifies the conditional distribution as PI(Xi=x+ai|Xpa𝒢(i))=P(Xi=x|Xpa𝒢(i)){\mathrm{P}}^{I}(X_{i}=x+a_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}(X_{i}=x|X_{\operatorname{pa}_{\mathcal{G}}(i)}). Here, the shift values {ai}iI\{a_{i}\}_{i\in I} are assumed to be deterministic. We aim to find I[p]I\subset[p] and {ai}iI|I|\{a_{i}\}_{i\in I}\in\mathbb{R}^{|I|} such that the mean of PI{\mathrm{P}}^{I} is closest to that of Q{\mathrm{Q}}, i.e., minimizes d(𝔼PI(X),𝔼Q(X))d(\mathbb{E}_{{\mathrm{P}}^{I}}(X),\mathbb{E}_{{\mathrm{Q}}}(X)) for some metric dd. In fact, as we show in the following lemma, there always exists a unique shift intervention, which we call the matching intervention, that achieves exact mean matching.111To lighten notation, we use II to denote both the perturbation targets and the shift values of this intervention.

Lemma 1.

For any causal structural model and desired mean 𝔼Q(X)\mathbb{E}_{\mathrm{Q}}(X), there exists a unique shift intervention II^{*} such that 𝔼PI(X)=𝔼Q(X)\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X).

We assume throughout that the underlying causal DAG 𝒢\mathcal{G} is unknown. But we assume causal sufficiency (Spirtes et al.,, 2000), which excludes the existence of latent confounders, as well as access to enough observational data to determine the joint distribution P{\mathrm{P}} and thus the Markov equivalence class of 𝒢\mathcal{G} (Andersson et al.,, 1997). It is well-known that with enough interventions, the causal DAG 𝒢\mathcal{G} becomes fully identifiable (Yang et al.,, 2018). Thus one strategy for causal mean matching is to first use interventions to fully identify the structure of 𝒢\mathcal{G}, and then solve for the matching intervention given full knowledge of the graph. However, in general this strategy requires more interventions than needed. In fact, the number of interventions required by such a strategy can be exponentially larger than the number of interventions required by a strategy that directly attempts to identify the matching intervention, as illustrated in Figure 1 and proven in Theorem 2.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Completely identifying a DAG can require exponentially more interventions than identifying the matching intervention. Consider a graph constructed by joining rr size-44 cliques, where the matching intervention has the source node as the only perturbation target, as pictured in (a) with r=2r=2 and the source node in purple; (b) shows the minimum size set intervention (in purple) that completely identifies the DAG, which grows as O(r)O(r) (Squires et al., 2020a, ). In Theorem 2, we show that the matching intervention can be identified in O(logr)O(\log r) single-node interventions.

In this work, we consider active intervention designs, where a series of interventions are chosen adaptively to learn the matching intervention. This means that the information obtained after performing each intervention is taken into account for future choices of interventions. We here focus on the noiseless setting, where for each intervention enough data is obtained to decide the effect of each intervention. Direct implications for the noisy setting are discussed in Appendix G. To incorporate realistic cases in which the system cannot withstand an intervention with too many target variables, as is the case in CRISPR experiments, where knocking out too many genes at once often kills the cell, we consider the setting where there is a sparsity constraint SS on the maximum number of perturbation targets in each intervention, i.e., we only allow II where |I|S|I|\leq S.

3 Identifiability

In this section, we characterize and provide a graphical representation of the shift interventional Markov equivalence class (shift-\mathcal{I}-MEC), i.e., the equivalence class of DAGs that is identifiable by shift interventions \mathcal{I}. We also introduce mean interventional faithfulness, an assumption that guarantees identifiability of the underlying DAG up to its shift-\mathcal{I}-MEC. Proofs are given in Appendix C.

3.1 Shift-interventional Markov Equivalence Class

For any DAG 𝒢\mathcal{G} with nodes [p][p], a distribution ff is Markov with respect to 𝒢\mathcal{G} if it factorizes according to f(X)=i[p]f(Xi|Xpa𝒢(i))f(X)=\prod_{i\in[p]}f(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)}). Two DAGs are Markov equivalent or in the same Markov equivalence class (MEC) if any positive distribution ff which is Markov with respect to (w.r.t.) one DAG is also Markov w.r.t. the other DAG. With observational data, a DAG is only identifiable up to its MEC (Andersson et al.,, 1997). However, the identifiability improves to a smaller class of DAGs with interventions. For a set of interventions \mathcal{I} (not necessarily shift interventions), the pair (f,{fI}I)(f,\{f^{I}\}_{I\in\mathcal{I}}) is \mathcal{I}-Markov w.r.t. 𝒢\mathcal{G} if ff is Markov w.r.t. 𝒢\mathcal{G} and fIf^{I} factorizes according to

fI(X)=iIf(Xi|Xpa𝒢(i))iIfI(Xi|Xpa𝒢(i)),I.f^{I}(X)=\prod_{i\notin I}f(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})\prod_{i\in I}f^{I}(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)}),\quad\forall I\in\mathcal{I}.

Similarly, the interventional Markov equivalence class (\mathcal{I}-MEC) of a DAG can be defined, and Yang et al., (2018) provided a structural characterization of the \mathcal{I}-MEC for general interventions \mathcal{I} (not necessarily shift interventions).

Following, we show that if \mathcal{I} consists of shift interventions, then the \mathcal{I}-MEC becomes smaller, i.e., identifiability of the causal DAG is improved. The proof utilizes Lemma 2 on the relationship between conditional probabilities. For this, denote by an𝒢(i)\operatorname{an}_{\mathcal{G}}(i) the ancestors of node ii, i.e., all nodes jj for which there is a directed path from jj to ii in 𝒢\mathcal{G}. For a subset of nodes II, we say that iIi\in I is a source w.r.t. II if an𝒢(i)I=\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing. A subset III^{\prime}\subset I is a source w.r.t. II if every node in II^{\prime} is a source w.r.t. II.

Lemma 2.

For any distribution ff that factorizes according to 𝒢\mathcal{G}, the interventional distribution fIf^{I} for a shift intervention I[p]I\subset[p] with shift values {ai}iI\{a_{i}\}_{i\in I} satisfies

𝔼fI(Xi)=𝔼f(Xi)+ai,\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},

for any source iIi\in I. Furthermore, if iIi\in I is not a source w.r.t. II, then there exists a positive distribution ff such that 𝔼fI(Xi)𝔼f(Xi)+ai\mathbb{E}_{f^{I}}(X_{i})\neq\mathbb{E}_{f}(X_{i})+a_{i}.

Hence, we can define the shift-\mathcal{I}-Markov property and shift-interventional Markov equivalence class (shift-\mathcal{I}-MEC) as follows.

Definition 1.

For a set of shift interventions \mathcal{I}, the pair (f,{fI}I)(f,\{f^{I}\}_{I\in\mathcal{I}}) is shift-\mathcal{I}-Markov w.r.t. 𝒢\mathcal{G} if (f,{fI}I)(f,\{f^{I}\}_{I\in\mathcal{I}}) is \mathcal{I}-Markov w.r.t. 𝒢\mathcal{G} and

𝔼fI(Xi)=𝔼f(Xi)+ai,iIs.t.an𝒢(i)I=.\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall~{}i\in I\in\mathcal{I}~{}s.t.~{}\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing.

Two DAGs are in the same shift-\mathcal{I}-MEC if any positive distribution that is shift-\mathcal{I}-Markov w.r.t. one DAG is shift-\mathcal{I}-Markov also w.r.t. the other DAG.

The following graphical characterizations are known: Two DAGs are in the same MEC if and only if they share the same skeleton (adjacencies) and v-structures (induced subgraphs ijki\rightarrow j\leftarrow k), see Verma and Pearl, (1991). For general interventions \mathcal{I}, two DAGs are in the same \mathcal{I}-MEC, if they are in the same MEC and they have the same directed edges {ij|iI,jI,I,ij}\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}, where iji-j means that either iji\to j or jij\to i (Hauser and Bühlmann,, 2012; Yang et al.,, 2018). In the following theorem, we provide a graphical criterion for two DAGs to be in the same shift-\mathcal{I}-MEC.

Theorem 1.

Let \mathcal{I} be a set of shift interventions. Then two DAGs 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} belong to the same shift-\mathcal{I}-MEC if and only if they have the same skeleton, v-structures, directed edges {ij|iI,jI,I,ij}\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}, as well as source nodes of II for every II\in\mathcal{I}.

In other words, two DAGs are in the same shift-\mathcal{I}-MEC if and only if they are in the same \mathcal{I}-MEC and they have the same source perturbation targets. Figure 2 shows an example; in particular, to represent an MEC, we use the essential graph (EG), which has the same skeleton as any DAG in this class and directed edges iji\rightarrow j if iji\rightarrow j for every DAG in this class. The essential graphs corresponding to the MEC, \mathcal{I}-MEC and shift-\mathcal{I}-MEC of a DAG 𝒢\mathcal{G} are referred to as EG, \mathcal{I}-EG and shift-\mathcal{I}-EG of 𝒢\mathcal{G}, respectively. They can be obtained from the aforementioned graphical criteria (along with a set of logical rules known as the Meek rules (Meek,, 1995); see details in Appendix A). Figure 2 shows an example of EG, \mathcal{I}-EG and shift-\mathcal{I}-EG of a four-node DAG.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Three types of essential graphs. (a). DAG 𝒢\mathcal{G}; (b). EG of 𝒢\mathcal{G}; (c). \mathcal{I}-EG of 𝒢\mathcal{G} where \mathcal{I} contains one intervention with perturbation targets X1,X2X_{1},X_{2} (purple); (d). shift-\mathcal{I}-EG of 𝒢\mathcal{G}, which can identify an additional edge compared to \mathcal{I}-EG (red).

3.2 Mean Interventional Faithfulness

For the causal mean matching problem, the underlying 𝒢\mathcal{G} can be identified from shift interventions \mathcal{I} up to its shift-\mathcal{I}-MEC. However, we may not need to identify the entire DAG to find the matching intervention II^{*}. Lemma 1 implies that if ii is neither in nor downstream of II^{*}, then the mean of XiX_{i} already matches the desired state, i.e., 𝔼P(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}); this suggest that these variables may be negligible when learning II^{*}. Unfortunately, the reverse is not true; one may design “degenerate" settings where a variable is in (or downstream of) II^{*}, but its marginal mean is also unchanged:

Example 1.

Let X3=X1+2X2X_{3}=X_{1}+2X_{2}, with 𝔼P(X1)=1\mathbb{E}_{\mathrm{P}}(X_{1})=1 and 𝔼P(X2)=1\mathbb{E}_{\mathrm{P}}(X_{2})=1, so that 𝔼P(X3)=3\mathbb{E}_{\mathrm{P}}(X_{3})=3. Suppose II^{*} is a shift intervention with perturbation targets {X1,X2,X3}\{X_{1},X_{2},X_{3}\}, with a1=1a_{1}=1, a2=1a_{2}=-1, and a3=1a_{3}=1. Then 𝔼PI(X3)=3\mathbb{E}_{{\mathrm{P}}^{I}}(X_{3})=3, i.e., the marginal mean of X3X_{3} is unchanged under the intervention.

Such degenerate cases arise when the shift on a node XjX_{j} (deemed 0 if not shifted) exactly cancels out the contributions of shifts on its ancestors. Formally, the following assumption rules out these cases.

Assumption 1 (Mean Interventional Faithfulness).

If i[p]i\in[p] satisfies 𝔼P(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}), then ii is neither a nor downstream of any perturbation target, i.e., iI,an𝒢(i)I=i\notin I^{*},\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}=\varnothing.

This is a particularly weak form of faithfulness, which is implied by interventional faithfulness assumptions in prior work (Yang et al.,, 2018; Squires et al., 2020b, ; Jaber et al.,, 2020).

Let TT be the collection of nodes i[p]i\in[p] for which 𝔼P(Xi)𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i}). The following lemma shows that under the mean interventional faithfulness assumption we can focus on the subgraph 𝒢T\mathcal{G}_{T} induced by TT, since ITI^{*}\subset T and interventions on XTX_{T} do not affect X[p]TX_{[p]\setminus T}.

Lemma 3.

If Assumption 1 holds, then any edge iji-j with jTj\in T and iTi\notin T has orientation jij\leftarrow i. Conversely, if Assumption 1 does not hold, then there exists some iji-j, jTj\in T, iTi\not\in T such that jij\to i.

4 Algorithms

Having shown that shift interventions allow the identification of source perturbation targets and that the mean interventional faithfulness assumption allows reducing the problem to an induced subgraph, we now propose two algorithms to learn the matching intervention. The algorithms actively pick a shift intervention ItI_{t} at time tt based on the current shift-interventional essential graph (shift-t\mathcal{I}_{t}-EG). Without loss of generality and for ease of discussion, we assume that the mean interventional faithfulness assumption holds and we therefore only need to consider 𝒢T\mathcal{G}_{T}. In Appendix D, we show that the faithfulness violations can be identified and thus Assumption 1 is not necessary for identifying the matching intervention, but additional interventions may be required.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Learning II^{*} when the structure is known. Undimmed parts represent the current subgraph with source nodes (in purple). I={1,2,4,5}I^{*}=\{1,2,4,5\} is solved in three steps. Shift values are omitted. (a). 𝒢T\mathcal{G}_{T} and UTU_{T}; (b). 𝒢T1\mathcal{G}_{T_{1}} and UT1U_{T_{1}}; (c). 𝒢T2\mathcal{G}_{T_{2}} and UT2U_{T_{2}}.

Warm-up: Upstream Search. Consider solving for the matching intervention II^{*} when the structure of 𝒢T\mathcal{G}_{T} is known, i.e., the current shift-t\mathcal{I}_{t}-EG is fully directed. Let UT={i|iT,an𝒢T(i)T=}U_{T}=\{i|i\in T,\operatorname{an}_{\mathcal{G}_{T}}(i)\cap T=\varnothing\} be the non-empty set of source nodes in TT. We make the following key observation.

Observation 1.

UTIU_{T}\subset I^{*}, and the shift values are ai=𝔼Q(Xi)𝔼P(Xi)a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}}(X_{i}) for each iUTi\in U_{T}.

This follows since shifting other variables in TT cannot change the mean of nodes in UTU_{T}. Further, the shifted means of variables in UTU_{T} match the desired mean (Lemma 2). Given the resulting intervention UTU_{T}, we obtain a new distribution PUT{\mathrm{P}}^{U_{T}}. Assuming mean interventional faithfulness on this distribution, we may now remove those variables whose means in PUT{\mathrm{P}}^{U_{T}} already match Q{\mathrm{Q}}. We then repeat this process on the new set of unmatched source nodes, T1T_{1}, to compute the corresponding shift intervention UT1U_{T_{1}}. Repeating until we have matched the desired mean for all variables yields II^{*}. We illustrate this procedure in Figure 3.

The idea of upstream search extends to shift-t\mathcal{I}_{t}-EG with partially directed or undirected 𝒢T\mathcal{G}_{T}. In this case, if a node or nodes of 𝒢T\mathcal{G}_{T} are identified as source, Observation 1 still holds. Hence, we solve a part of II^{*} with these source nodes and then intervene on them to reduce the unsolved graph size.

Decomposition of Shift Interventional Essential Graphs: In order to find the source nodes, we decompose the current shift-t\mathcal{I}_{t}-EG into undirected components. Hauser and Bühlmann, (2014) showed that every interventional essential graph is a chain graph with chordal chain components, where the orientations in one chain component do not affect the orientations in other components.222The chain components of a chain graph are the undirected connected components after removing all its directed edges, and an undirected graph is chordal if all cycles of length greater than 3 contain a chord. By a similar argument, we can obtain an analogous decomposition for shift interventional essential graphs, and show that there is at least one chain component with no incoming edges. Let us separate out all of the chain components of shift-t\mathcal{I}_{t}-EG with no incoming edges. The following lemma proves that all sources are contained within these components.

Lemma 4.

For any shift-\mathcal{I}-EG of 𝒢\mathcal{G}, each chain component has exactly one source node w.r.t. this component. This node is a source w.r.t. 𝒢\mathcal{G} if and only if there are no incoming edges to this component.

These results hold when replacing 𝒢\mathcal{G} with any induced subgraph of it. Thus, we can find the source nodes in TT by finding the source nodes in each of its chain components with no incoming edges.

4.1 Two Approximate Strategies

Following the chain graph decomposition, we now focus on how to find the source node of an undirected connected chordal graph 𝒞\mathcal{C}. If there is no sparsity constraint on the number of perturbation targets in each shift intervention, then directly intervening on all of the variables in 𝒞\mathcal{C} gives the source node, since by Theorem 1, all DAGs in the shift-\mathcal{I}-MEC share the same source node. However, when the maximum number of perturbation targets in an intervention is restricted to S<|𝒞|S<|\mathcal{C}|, multiple interventions may be necessary to find the source node.

After intervening on SS nodes, the remaining unoriented part can be decomposed into connected components. In the worst case, the source node of 𝒞\mathcal{C} is in the largest of these connected components. Therefore we seek the set of nodes, within the sparsity constraint, that minimizes the largest connected component size after being removed. This is known as the MinMaxC problem (Lalou et al.,, 2018), which we show is NP-complete on chordal graphs (Appendix D). We propose two approximate strategies to solve this problem, one based on the clique tree representation of chordal graphs and the other based on robust supermodular optimization. The overall algorithm with these subroutines is summarized in Algorithm 1. We outline the subroutines here, and give further details in Appendix D.

Input: Joint distribution P{\mathrm{P}}, desired joint distribution Q{\mathrm{Q}}, sparsity constraint SS.
1 Initialize I=I^{*}=\varnothing and ={}\mathcal{I}=\{\varnothing\};
2 while 𝔼PI(X)𝔼Q(X)\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)\neq\mathbb{E}_{{\mathrm{Q}}}(X) do
3       let T={i|i[p],𝔼PI(Xi)𝔼Q(Xi)}T=\{i|i\in[p],\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\};
4       let 𝒢\mathcal{G} be the subgraph of shift-\mathcal{I}-EG induced by TT;
5       let UTU_{T} be the identified source nodes in TT;
6       while UT=U_{T}=\varnothing do
7             let 𝒞\mathcal{C} be a chain component of 𝒢\mathcal{G} with no incoming edges;
8             select shift intervention II by running CliqueTree(𝒞,S)\texttt{CliqueTree}(\mathcal{C},S) or Supermodular(𝒞,S)\texttt{Supermodular}(\mathcal{C},S);
9             perform II and append it to \mathcal{I};
10             update 𝒢\mathcal{G} and UTU_{T} as the outer loop;
11            
12       end while
13      set ai=𝔼Q(Xi)𝔼PI(Xi)a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X_{i}) for ii in UTU_{T};
14       include perturbation targets UTU_{T} and shift values {ai}iUT\{a_{i}\}_{i\in U_{T}} in II^{*} and perform II^{*};
15      
16 end while
Output: Matching Intervention II^{*}
Algorithm 1 Active Learning for Causal Mean Matching

Clique Tree Strategy. Let C(𝒞)C(\mathcal{C}) be the set of maximal cliques in the chordal graph 𝒞\mathcal{C}. There exists a clique tree 𝒯(𝒞)\mathcal{T}(\mathcal{C}) with nodes in C(𝒞)C(\mathcal{C}) and edges satisfying that C1,C2C(𝒞)\forall C_{1},C_{2}\in C(\mathcal{C}), their intersection C1C2C_{1}\cap C_{2} is a subset of any clique on the unique path between C1,C2C_{1},C_{2} in 𝒯(𝒞)\mathcal{T}(\mathcal{C}) (Blair and Peyton,, 1993). Thus, deleting a clique which is not a leaf node in the clique tree will break 𝒞\mathcal{C} into at least two connected components, each corresponding to a subtree in the clique tree. Inspired by the central node algorithm (Greenewald et al.,, 2019; Squires et al., 2020a, ), we find the SS-constrained central clique of 𝒯(𝒞)\mathcal{T}(\mathcal{C}) by iterating through C(𝒞)C(\mathcal{C}) and returning the clique with no more than SS nodes that separates the graph most, i.e., solving MinMaxC when interventions are constrained to be maximal cliques. We denote this approach as CliqueTree.

Supermodular Strategy. Our second approach, denoted Supermodular, optimizes a lower bound of the objective of MinMaxC. Consider the following equivalent formulation of MinMaxC

minAV𝒞maxiV𝒞fi(A),|A|S,\min_{A\subset V_{\mathcal{C}}}\max_{i\in V_{\mathcal{C}}}f_{i}(A),\quad|A|\leq S, (1)

where V𝒞V_{\mathcal{C}} represents the nodes of 𝒞\mathcal{C} and iV𝒞\forall~{}i\in V_{\mathcal{C}}, fi(A)=jV𝒞gi,j(A)f_{i}(A)=\sum_{j\in V_{\mathcal{C}}}g_{i,j}(A) with gi,j(A)=1g_{i,j}(A)=1 if ii and jj are the same or connected after removing nodes in AA from 𝒞\mathcal{C} and gi,j(A)=0g_{i,j}(A)=0 otherwise.

MinMaxC (1) resembles the problem of robust supermodular optimization (Krause et al.,, 2008). Unfortunately, fif_{i} is not supermodular for chordal graphs (Appendix D). Therefore, we propose to optimize for a surrogate of fif_{i} defined as f^i(A)=j𝒞g^i,j(A)\hat{f}_{i}(A)=\sum_{j\in\mathcal{C}}\hat{g}_{i,j}(A), where

g^i,j(A)={mi,j(V𝒞A)mi,j(V𝒞),ijin𝒞,0,otherwise.\hat{g}_{i,j}(A)=\begin{cases}\frac{m_{i,j}(V_{\mathcal{C}}-A)}{m_{i,j}(V_{\mathcal{C}})},&\quad{i--j}~{}\mathrm{in}~{}{\mathcal{C}},\\ 0,&\quad\mathrm{otherwise}.\end{cases} (2)

Here mi,j(V𝒞)m_{i,j}(V_{\mathcal{C}^{\prime}}) is the number of paths without cycles between ii and jj in 𝒞\mathcal{C}^{\prime} (deemed 0 if ii or jj does not belong to 𝒞\mathcal{C}^{\prime} and 11 if i=j𝒞i=j\in\mathcal{C}^{\prime}) and iji--j means ii is either connected or equal to jj. Comparing g^i,j\hat{g}_{i,j} with gi,jg_{i,j}, we see that f^i(A)\hat{f}_{i}(A) is a lower bound of fi(A)f_{i}(A) for MinMaxC, which is tight when 𝒞\mathcal{C} is a tree. We show that f^i\hat{f}_{i} is monotonic supermodular for all ii (Appendix D). Therefore, we consider (2) with fif_{i} replaced by f^i\hat{f}_{i}, which can be solved using the SATURATE algorithm (Krause et al.,, 2008). Notably, the results returned by Supermodular can be quite different to those returned by CliqueTree since Supermodular is not constrained to pick a maximal clique; see Figure 4.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Picking 22 nodes in an undirected connected chordal graph 𝒞\mathcal{C}. CliqueTree picks {X4,X5}\{X_{4},X_{5}\}, while Supermodular picks the better {X3,X6}\{X_{3},X_{6}\}. (a). 𝒞\mathcal{C}; (b). Clique tree 𝒯(𝒞)\mathcal{T}(\mathcal{C}).

5 Theoretical Results

In this section we derive a worst-case lower bound on the number of interventions for any algorithm to identify the source node in a chordal graph. Then we use this lower bound to show that our strategies are optimal up to a logarithmic factor. This contrasts with the structure learning strategy, which may require exponentially more interventions than our strategy (Figure 1).

The worst case is with respect to all feasible orientations of an essential graph (Hauser and Bühlmann,, 2014; Shanmugam et al.,, 2015), i.e., orientations corresponding to DAGs in the equivalence class. Given a chordal chain component 𝒞\mathcal{C} of 𝒢\mathcal{G}, let r𝒞r_{\mathcal{C}} be the number of maximal cliques in 𝒞\mathcal{C}, and m𝒞m_{\mathcal{C}} be the size of the largest maximal clique in 𝒞\mathcal{C}. The following lemma provides a lower bound depending only on m𝒞m_{\mathcal{C}}.

Lemma 5.

In the worst case over feasible orientations of 𝒞\mathcal{C}, any algorithm requires at least m𝒞1S\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil shift interventions to identify the source node, under the sparsity constraint SS.

To give some intuition for this result, consider the case where the largest maximal clique is upstream of all other maximal cliques. Given such an ordering, in the worst case, each intervention rules out only SS nodes in this clique (namely, the most downstream ones). Now, we show that our two strategies need at most log2(r𝒞+1)m𝒞1S\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil shift interventions for the same task.

Lemma 6.

In the worst case over feasible orientations of 𝒞\mathcal{C}, both CliqueTree and Supermodular require at most log2(r𝒞+1)m𝒞1S\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil shift interventions to identify the source node, under the sparsity constraint SS.

By combining Lemma 5 and Lemma 6, which consider subproblems of the causal mean matching problem, we obtain a bound on the number of shift interventions needed for solving the full causal mean matching problem. Let rr be the largest r𝒞r_{\mathcal{C}} for all chain components 𝒞\mathcal{C} of 𝒢\mathcal{G}:

Theorem 2.

Algorithm 1 requires at most log2(r+1)\lceil\log_{2}(r+1)\rceil times more shift interventions, compared to that required by the optimal strategy, in the worst case over feasible orientations of 𝒢\mathcal{G}.

A direct application of this theorem is that, in terms of the number of interventions required to solve the causal mean matching problem, our algorithm is optimal in the worst case when r=1r=1, i.e., when every chain component is a clique. All proofs are provided in Appendix E.

6 Experiments

We now evaluate our algorithms in several synthetic settings.333Code is publicly available at: https://github.com/uhlerlab/causal_mean_matching. Each setting considers a particular graph type, number of nodes pp in the graph and number of perturbation targets |I|p|I^{*}|\leq p in the matching intervention. We generate 100100 problem instances in each setting. Every problem instance contains a DAG with pp nodes generated according to the graph type and a randomly sampled subset of |I||I^{*}| nodes denoting the perturbation targets in the matching intervention. We consider both, random graphs including Erdös-Rényi graphs (Erdős and Rényi,, 1960) and Barabási–Albert graphs (Albert and Barabási,, 2002), as well as structured chordal graphs, in particular, rooted tree graphs and moralized Erdös-Rényi graphs (Shanmugam et al.,, 2015). The graph size pp in our simulations ranges from 1010 to 10001000, while the number of perturbation targets ranges from 11 to min{p,100}\min\{p,100\}.

We compare our two subroutines for Algorithm 1, CliqueTree and Supermodular, against three carefully constructed baselines. The UpstreamRand baseline follows Algorithm 1 where line 8 is changed to selecting II randomly from 𝒞\mathcal{C} without exceeding SS, i.e., when there is no identified source node it randomly samples from the chain component with no incoming edge. This strategy highlights how much benefit is obtained from CliqueTree and Supermodular on top of upstream search. The Coloring baseline is modified from the coloring-based policy for structure learning (Shanmugam et al.,, 2015), previously shown to perform competitively on large graphs (Squires et al., 2020a, ). It first performs structure learning with the coloring-based policy, and then uses upstream search with known DAG. We also include an Oracle baseline, which does upstream search with known DAG.

In Figure 5 we present a subset of our results on Barabási–Albert graphs with 100100 nodes; similar behaviors are observed in all other settings and shown in Appendix F. In Figure 5(a), we consider problem instances with varying size of |I||I^{*}|. Each algorithm is run with sparsity constraint S=1S=1. We plot the number of extra interventions compared to Oracle, averaged across the 100100 problem instances. As expected, Coloring requires the largest number of extra interventions. This finding is consistent among different numbers of perturbation targets, since the same amount of interventions are used to learn the structure regardless of II^{*}. As |I||I^{*}| increases, CliqueTree and Supermodular outperform UpstreamRand. To further investigate this trend, we plot the rate of extra interventions444The rate is calculated by (#Strategy-#UpstreamRand)/#UpstreamRand where # denotes the number of extra interventions compared to Oracle and Strategy can be CliqueTree, Supermodular or UpstreamRand. used by CliqueTree and Supermodular relative to UpstreamRand in Figure 5(b). This figure shows that CliqueTree and Supermodular improve upon upstream search by up to 25%25\% as the number of perturbation targets increases. Finally, we consider the effect of the sparsity constraint SS in Figure 5(c) with |I|=50|I^{*}|=50. In line with the discussion in Section 4.1, as SS increases, the task becomes easier for plain upstream search. However, when the number of perturbation targets is restricted, CliqueTree and Supermodular are superior, with Supermodular performing best in most cases.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Barabási–Albert graphs with 100100 nodes. (a). Averaged (100 instances) numbers of extra interventions each algorithm (with sparsity constraint S=1S=1) requires compared to Oracle, plotted against number of perturbation targets in II^{*}; (b). Rates of extra interventions CliqueTree and Supermodular (S=1S=1) required relative to UpstreamRand, plotted against number of perturbation targets in II^{*}; (c). Relative extra rate (|I|=50|I^{*}|=50), plotted against sparsity constraint SS.

7 Discussion

In this work, we introduced the causal mean matching problem, which has important applications in medicine and engineering. We aimed to develop active learning approaches for identifying the matching intervention using shift interventions. Towards this end, we characterized the shift interventional Markov equivalence class and showed that it is in general more refined than previously defined equivalence classes. We proposed two strategies for learning the matching intervention based on this characterization, and showed that they are optimal up to a logarithmic factor. We reported experimental results on a range of settings to support these theoretical findings.

Limitations and Future Work. This work has various limitations that may be interesting to address in future work. First, we focus on the task of matching a desired mean, rather than an entire distribution. This is an inherent limitation of deterministic shift interventions: as noted by Hyttinen et al., (2012), in the linear Gaussian setting, these interventions can only modify the mean of the initial distribution. Thus, matching the entire distribution, or other relevant statistics, will require broader classes of interventions. Assumptions on the desired distribution are also required to rule out possibly non-realizable cases. Second, we have focused on causal DAG models, which assume acyclicity and the absence of latent confounders. In many realistic applications, this could be an overly optimistic assumption, requiring extensions of our results to the cyclic and/or causally insufficient setting. Finally, throughout the main text, we have focused on the noiseless setting; we briefly discuss the noisy setting in Appendix G, but there is much room for more extensive investigations.

Acknowledgements

C. Squires was partially supported by an NSF Graduate Fellowship. All authors were partially supported by NSF (DMS-1651995), ONR (N00014-17- 1-2147 and N00014-18-1-2765), the MIT-IBM Watson AI Lab, and a Simons Investigator Award to C. Uhler.

References

  • Agrawal et al., (2019) Agrawal, R., Squires, C., Yang, K., Shanmugam, K., and Uhler, C. (2019). Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3400–3409. PMLR.
  • Albert and Barabási, (2002) Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47.
  • Andersson et al., (1997) Andersson, S. A., Madigan, D., Perlman, M. D., et al. (1997). A characterization of markov equivalence classes for acyclic digraphs. Annals of statistics, 25(2):505–541.
  • Badsha et al., (2019) Badsha, M., Fu, A. Q., et al. (2019). Learning causal biological networks with the principle of mendelian randomization. Frontiers in genetics, 10:460.
  • Blair and Peyton, (1993) Blair, J. R. and Peyton, B. (1993). An introduction to chordal graphs and clique trees. In Graph theory and sparse matrix computation, pages 1–29. Springer.
  • Bubeck and Cesa-Bianchi, (2012) Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. In Foundations and Trends® in Machine Learning, pages 1–122.
  • Cahan et al., (2014) Cahan, P., Li, H., Morris, S. A., Da Rocha, E. L., Daley, G. Q., and Collins, J. J. (2014). Cellnet: network biology applied to stem cell engineering. Cell, 158(4):903–915.
  • de Kroon et al., (2020) de Kroon, A. A., Belgrave, D., and Mooij, J. M. (2020). Causal discovery for causal bandits utilizing separating sets. arXiv preprint arXiv:2009.07916.
  • Dominguez et al., (2016) Dominguez, A. A., Lim, W. A., and Qi, L. S. (2016). Beyond editing: repurposing crispr–cas9 for precision genome regulation and interrogation. Nature reviews Molecular cell biology, 17(1):5.
  • D’Alessio et al., (2015) D’Alessio, A. C., Fan, Z. P., Wert, K. J., Baranov, P., Cohen, M. A., Saini, J. S., Cohick, E., Charniga, C., Dadon, D., Hannett, N. M., et al. (2015). A systematic approach to identify candidate transcription factors that control cell identity. Stem cell reports, 5(5):763–775.
  • Erdős and Rényi, (1960) Erdős, P. and Rényi, A. (1960). On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60.
  • Friedman et al., (2000) Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000). Using bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601–620.
  • Ghassami et al., (2018) Ghassami, A., Salehkaleybar, S., Kiyavash, N., and Bareinboim, E. (2018). Budgeted experiment design for causal structure learning. In International Conference on Machine Learning, pages 1724–1733. PMLR.
  • Greenewald et al., (2019) Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Boix Adsera, E., and Bresler, G. (2019). Sample efficient active learning of causal trees. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Hauser and Bühlmann, (2012) Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464.
  • Hauser and Bühlmann, (2014) Hauser, A. and Bühlmann, P. (2014). Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939.
  • He and Geng, (2008) He, Y.-B. and Geng, Z. (2008). Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523–2547.
  • Hyttinen et al., (2012) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387–3439.
  • Hyttinen et al., (2013) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2013). Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071.
  • Jaber et al., (2020) Jaber, A., Kocaoglu, M., Shanmugam, K., and Bareinboim, E. (2020). Causal discovery from soft interventions with unknown targets: Characterization and learning. Advances in Neural Information Processing Systems, 33.
  • Kocaoglu et al., (2017) Kocaoglu, M., Dimakis, A., and Vishwanath, S. (2017). Cost-optimal learning of causal graphs. In International Conference on Machine Learning, pages 1875–1884. PMLR.
  • Koller and Friedman, (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
  • Krause et al., (2008) Krause, A., McMahan, H. B., Guestrin, C., and Gupta, A. (2008). Robust submodular observation selection. Journal of Machine Learning Research, 9(12).
  • Lalou et al., (2018) Lalou, M., Tahraoui, M. A., and Kheddouci, H. (2018). The critical node detection problem in networks: A survey. Computer Science Review, 28:92–117.
  • Lattimore et al., (2016) Lattimore, F., Lattimore, T., and Reid, M. D. (2016). Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  • Lee and Bareinboim, (2018) Lee, S. and Bareinboim, E. (2018). Structural causal bandits: where to intervene? Advances in Neural Information Processing Systems 31, 31.
  • Meek, (1995) Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, page 403–410, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Rackham et al., (2016) Rackham, O. J., Firas, J., Fang, H., Oates, M. E., Holmes, M. L., Knaupp, A. S., Suzuki, H., Nefzger, C. M., Daub, C. O., Shin, J. W., et al. (2016). A predictive computational framework for direct reprogramming between human cell types. Nature genetics, 48(3):331.
  • Rothenhäusler et al., (2015) Rothenhäusler, D., Heinze, C., Peters, J., and Meinshausen, N. (2015). Backshift: Learning causal cyclic graphs from unknown shift interventions. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  • Shanmugam et al., (2015) Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. (2015). Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
  • Spirtes et al., (2000) Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search. MIT press.
  • (32) Squires, C., Magliacane, S., Greenewald, K., Katz, D., Kocaoglu, M., and Shanmugam, K. (2020a). Active structure learning of causal dags via directed clique trees. In Advances in Neural Information Processing Systems, volume 33, pages 21500–21511. Curran Associates, Inc.
  • (33) Squires, C., Wang, Y., and Uhler, C. (2020b). Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pages 1039–1048. PMLR.
  • Touchette and Lloyd, (2004) Touchette, H. and Lloyd, S. (2004). Information-theoretic approach to the study of control systems. Physica A: Statistical Mechanics and its Applications, 331(1-2):140–172.
  • Triantafillou et al., (2017) Triantafillou, S., Lagani, V., Heinze-Deml, C., Schmidt, A., Tegner, J., and Tsamardinos, I. (2017). Predicting causal relationships from biological data: Applying automated causal discovery on mass cytometry data of human immune cells. Scientific reports, 7(1):1–11.
  • Verma and Pearl, (1991) Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal models. UCLA, Computer Science Department.
  • Vierbuchen et al., (2010) Vierbuchen, T., Ostermeier, A., Pang, Z. P., Kokubu, Y., Südhof, T. C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature, 463(7284):1035–1041.
  • Wodo et al., (2015) Wodo, O., Zola, J., Pokuri, B. S. S., Du, P., and Ganapathysubramanian, B. (2015). Automated, high throughput exploration of process–structure–property relationships using the mapreduce paradigm. Materials discovery, 1:21–28.
  • Yabe et al., (2018) Yabe, A., Hatano, D., Sumita, H., Ito, S., Kakimura, N., Fukunaga, T., and Kawarabayashi, K.-i. (2018). Causal bandits with propagating inference. In International Conference on Machine Learning, pages 5512–5520. PMLR.
  • Yang et al., (2018) Yang, K., Katcoff, A., and Uhler, C. (2018). Characterizing and learning equivalence classes of causal dags under interventions. In International Conference on Machine Learning, pages 5541–5550. PMLR.

Appendix A Preliminaries

A.1 Meek Rules

Given any Markov equivalence class of DAGs with shared directed and undirected edges, the corresponding essential graph \mathcal{E} can be obtained using a set of logical relations known as Meek rules \citepappendixmeek2013causal+. The Meek rules are stated in the following proposition.

Proposition 1 (Meek Rules \citepappendixmeek2013causal+).

We can infer all directed edges in \mathcal{E} using the following four rules:

  1. 1.

    If ijki\rightarrow j-k and ii is not adjacent to kk, then jkj\rightarrow k.

  2. 2.

    If ijki\rightarrow j\rightarrow k and iki-k, then iki\rightarrow k.

  3. 3.

    If ij,ik,il,jk,lki-j,i-k,i-l,j\rightarrow k,l\rightarrow k and jj is not adjacent to ll, then iki\rightarrow k.

  4. 4.

    If ij,ik,il,jk,lki-j,i-k,i-l,j\leftarrow k,l\rightarrow k and jj is not adjacent to ll, then iji\rightarrow j.

Figure 6 illustrates these four rules.

Refer to caption
(a) R1
Refer to caption
(b) R2
Refer to caption
(c) R3
Refer to caption
(d) R4
Figure 6: Meek Rules.

Appendix B Proof of Exact Matching

Proof of Lemma 1.

Without loss of generality, assume 1,2,,p1,2,...,p is the topological order of the underlying DAG 𝒢\mathcal{G}, i.e., jpa𝒢(i)j\in\operatorname{pa}_{\mathcal{G}}(i) implies j<ij<i. We will first construct II^{*} such that 𝔼PI(X)=𝔼Q(X)\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X), and then show that II^{*} is unique.

Existence: Denote i1i_{1} as the smallest i[p]i\in[p] such that 𝔼P(Xi)𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i}). Witout loss of generality we assume that i1i_{1} exists (if i1i_{1} does not exists, then I=I^{*}=\varnothing suffices since 𝔼P(X)=𝔼Q(X)\mathbb{E}_{{\mathrm{P}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X)).

Let I1I_{1} be the shift intervention with perturbation target i1i_{1} and shift values ai1=𝔼Q(Xi1)𝔼P(Xi1)a_{i_{1}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{1}})-\mathbb{E}_{{\mathrm{P}}}(X_{i_{1}}). Since PI1(Xi1=x+ai1|Xpa𝒢(i1))=P(Xi1=x|Xpa𝒢(i1)){\mathrm{P}}^{I_{1}}(X_{i_{1}}=x+a_{i_{1}}|X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})={\mathrm{P}}(X_{i_{1}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{1})}) and PI1(Xpa𝒢(i1))=P(Xpa𝒢(i1)){\mathrm{P}}^{I_{1}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})={\mathrm{P}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{1})}) by definition, we have

PI1(Xi1=x+ai1)=P(Xi1=x).{\mathrm{P}}^{I_{1}}(X_{i_{1}}=x+a_{i_{1}})={\mathrm{P}}(X_{i_{1}}=x).

Thus 𝔼PI1(Xi1)=𝔼P(Xi1)+ai1=𝔼Q(Xi1)\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{1}})=\mathbb{E}_{{\mathrm{P}}}(X_{i_{1}})+a_{i_{1}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{1}}). Also 𝔼PI1(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}) for i<i1i<i_{1}. Denote i2i_{2} as the smallest i[p]i\in[p] such that 𝔼PI1(Xi)𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i}). If i2i_{2} does not exists, then I=I1I^{*}=I_{1} suffices. Otherwise i2>i1i_{2}>i_{1}.

Let I2I_{2} be the shift intervention with perturbation target i1,i2i_{1},i_{2} and corresponding shift values ai1a_{i_{1}} and ai2=𝔼Q(Xi2)𝔼PI1(Xi2)a_{i_{2}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{2}})-\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{2}}). We have PI2(Xi2=x+ai2|Xpa𝒢(i2))=P(Xi2=x|Xpa𝒢(i2))=PI1(Xi2=x|Xpa𝒢(i2)){\mathrm{P}}^{I_{2}}(X_{i_{2}}=x+a_{i_{2}}|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}(X_{i_{2}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}^{I_{1}}(X_{i_{2}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})}) and PI2(Xpa𝒢(i2))=PI1(Xpa𝒢(i2)){\mathrm{P}}^{I_{2}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}^{I_{1}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{2})}) by definition, the topological order, and i2>i1i_{2}>i_{1}. Then

PI2(Xi2=x+ai2)=PI1(Xi2=x).{\mathrm{P}}^{I_{2}}(X_{i_{2}}=x+a_{i_{2}})={\mathrm{P}}^{I_{1}}(X_{i_{2}}=x).

Thus 𝔼PI2(Xi2)=𝔼PI1(Xi2)+ai2=𝔼Q(Xi2)\mathbb{E}_{{\mathrm{P}}^{I_{2}}}(X_{i_{2}})=\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{2}})+a_{i_{2}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{2}}). Also 𝔼PI2(Xi)=𝔼PI1(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}^{I_{2}}}(X_{i})=\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}) for i<i2i<i_{2}. By iterating this process, we will reach IkI_{k} for some kpk\leq p such that there is no ii with 𝔼PIk(Xi)𝔼Q(Xk)\mathbb{E}_{{\mathrm{P}}^{I_{k}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{k}). Taking I=IkI^{*}=I_{k} suffices.

Uniqueness: If there exists I1I2I_{1}^{*}\neq I_{2}^{*} such that 𝔼PI1(X)=𝔼PI2(X)=𝔼Q(X)\mathbb{E}_{{\mathrm{P}}^{I_{1}^{*}}}(X)=\mathbb{E}_{{\mathrm{P}}^{I_{2}^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X), let i[p]i\in[p] be the smallest index such that either ii has different shift values in I1I_{1}^{*} and I2I_{2}^{*}, or ii is only in one intervention’s perturbation targets. In either case, we have PI1(Xpa𝒢(i))=PI2(Xpa𝒢(i)){\mathrm{P}}^{I_{1}^{*}}(X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}^{I_{2}^{*}}(X_{\operatorname{pa}_{\mathcal{G}}(i)}) by the topological order and PI1(Xi=x|Xpa𝒢(i))=PI2(Xi=x+a|Xpa𝒢(i)){\mathrm{P}}^{I_{1}^{*}}(X_{i}=x|X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}^{I_{2}^{*}}(X_{i}=x+a|X_{\operatorname{pa}_{\mathcal{G}}(i)}) for some a0a\neq 0. Thus PI1(Xi=x)=PI2(Xi=x+a){\mathrm{P}}^{I_{1}^{*}}(X_{i}=x)={\mathrm{P}}^{I_{2}^{*}}(X_{i}=x+a) contradicting 𝔼PI1(Xi)=𝔼PI2(Xi)\mathbb{E}_{{\mathrm{P}}^{I_{1}^{*}}}(X_{i})=\mathbb{E}_{{\mathrm{P}}^{I_{2}^{*}}}(X_{i}). ∎

Appendix C Proof of Identifiability

C.1 Shift Interventional MEC

Proof of Lemma 2.

For any distribution ff that factorizes according to 𝒢\mathcal{G} and shift intervention II, let iIi\in I be any source w.r.t. II. By definition, an𝒢(i)I=\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing. Thus pa𝒢(i)\operatorname{pa}_{\mathcal{G}}(i) contains neither a member nor a descendant of II, i.e., there does not exists jpa𝒢(i)j\in\operatorname{pa}_{\mathcal{G}}(i) and kIk\in I such that there is a direct path from kk to jj or k=jk=j. Hence we have fI(Xpa𝒢(i))=f(Xpa𝒢(i))f^{I}(X_{\operatorname{pa}_{\mathcal{G}}(i)})=f(X_{\operatorname{pa}_{\mathcal{G}}(i)}), which gives

fI(Xi=x+ai)=f(Xi=x).f^{I}(X_{i}=x+a_{i})=f(X_{i}=x).

Therefore 𝔼fI(Xi)=𝔼f(Xi)+ai\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i}.

On the other hand, if iIi\in I is not a source w.r.t. II, consider the following linear Gaussian model,

Xj=kpa𝒢(j)βkjXk+ϵj,j[p],X_{j}=\sum_{k\in\operatorname{pa}_{\mathcal{G}}(j)}\beta_{kj}X_{k}+\epsilon_{j},\quad\forall j\in[p],

where βkj\beta_{kj} are deterministic scalars and ϵj𝒩(0,1)\epsilon_{j}\sim\mathcal{N}(0,1) are i.i.d. random variables.

Since ii is not a source in II, there exists a source ii^{\prime} in II such that there is a directed path i=i0i1ii^{\prime}=i_{0}\to i_{1}\rightarrow\dots\rightarrow i_{\ell}. From above, 𝔼fI(Xi)=𝔼f(Xi)+ai\mathbb{E}_{f^{I}}(X_{i^{\prime}})=\mathbb{E}_{f}(X_{i^{\prime}})+a_{i^{\prime}} for ai0a_{i^{\prime}}\neq 0. Consider setting βi0,i1=2|ai|/ai\beta_{i_{0},i_{1}}=2|a_{i}|/a_{i^{\prime}}, βik,ik+1=1\beta_{i_{k},i_{k+1}}=1 for k=1,,1k=1,\ldots,\ell-1, and the remaining edge weights to ϵ>0\epsilon>0. For ϵ\epsilon sufficiently small, we have that 𝔼fI(Xi)𝔼f(Xi)+1.5|ai|\mathbb{E}_{f^{I}}(X_{i})\geq\mathbb{E}_{f}(X_{i})+1.5|a_{i}|, i.e., we cannot have that 𝔼fI(Xi)=𝔼f(Xi)+ai\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i}. ∎

Proof of Theorem 1.

Denote ={I1,,Im}\mathcal{I}=\{I_{1},...,I_{m}\}. For k[m]={1,,m}k\in[m]=\{1,...,m\}, let I^k\hat{I}_{k} and I^k\hat{I}_{k}^{\prime} be the collection of source nodes in IkI_{k} in 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}, respectively. From Definition 1, we know that 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are in the same shift-\mathcal{I}-MEC if and only if they are in the same \mathcal{I}-MEC and, for any pair (f,{fIk}k[m])(f,\{f^{I_{k}}\}_{k\in[m]}) that is \mathcal{I}-Markov w.r.t. both 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}, it satisfies

𝔼fIk(Xi)=𝔼f(Xi)+ai,iI^k,k[m],\mathbb{E}_{f^{I_{k}}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall i\in\hat{I}_{k},\forall k\in[m], (3)

if and only if it also satisfies

𝔼fIk(Xi)=𝔼f(Xi)+ai,iI^k,k[m].\mathbb{E}_{f^{I_{k}}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall i\in\hat{I}_{k}^{\prime},\forall k\in[m]. (4)

By Lemma 2, we know that I^kI^k\hat{I}_{k}^{\prime}\subset\hat{I}_{k} for all k[m]k\in[m]. Otherwise we can find a pair (f,{fIk}k[m])(f,\{f^{I_{k}}\}_{k\in[m]}) that violates (4) for iI^kI^ki\in\hat{I}_{k}^{\prime}\setminus\hat{I}_{k}. Similarly, we have I^kI^k\hat{I}_{k}\subset\hat{I}_{k}^{\prime}. Therefore I^k=I^k\hat{I}_{k}=\hat{I}_{k}^{\prime}. In this case, (3) is equivalent to (4).

Hence, 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are in the same shift-\mathcal{I}-MEC if and only if they are in the same \mathcal{I}-MEC and they have the same source nodes of II for every II\in\mathcal{I}. From Theorem 3.9 in \citetappendixyang2018characterizing+, we know that 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are in the same \mathcal{I}-MEC if and only if they share the same skeleton, vv-structures and directed edges {ij|iI,jI,I,ij}\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}. Therefore, 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are in the same shift-\mathcal{I}-MEC if and only if they have the same skeleton, vv-structures, directed edges {ij|iI,jI,I,ij}\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}, as well as source nodes of II for every II\in\mathcal{I}. ∎

Let 𝒟\mathcal{D} be any DAG, suppose that ={I1,,Im}\mathcal{I}=\{I_{1},...,I_{m}\} and I^k\hat{I}_{k} is the collection of source nodes in IkI_{k} in 𝒟\mathcal{D} for k[m]k\in[m]. Then as a direct corollary of Theorem 1, we can represent a shift interventional Markov equivalence class with a (general) interventional Markov equivalence class.

Corollary 1.

Let ^={I^k|k[m]}\hat{\mathcal{I}}=\mathcal{I}\cup\{\hat{I}_{k}|k\in[m]\}; a DAG 𝒟\mathcal{D}^{\prime} is shift-\mathcal{I}-Markov equivalent to 𝒟\mathcal{D} if and only if 𝒟\mathcal{D}^{\prime} is ^\hat{\mathcal{I}}-Markov equivalent to 𝒟\mathcal{D}.

Proof.

The proof follws as a direct application of Theorem 1, Theorem 3.9 in \citetappendixyang2018characterizing+, and the fact that there are no edges between nodes in I^k\hat{I}_{k}. ∎

C.2 Mean Interventional Faithfulness

Proof of Lemma 3.

If Assumption 1 holds, then for any iTi\notin T, since 𝔼P(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}), then iIi\notin I^{*} and an𝒢(i)I=\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}=\varnothing. Let jTj\in T such that there is an edge iji-j between ii and jj. Since 𝔼P(Xj)𝔼Q(Xj)\mathbb{E}_{{\mathrm{P}}}(X_{j})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{j}), there is either jIj\in I^{*} or an𝒢(j)I\operatorname{an}_{\mathcal{G}}(j)\cap I^{*}\neq\varnothing. Therefore if jij\rightarrow i, then an𝒢(i)I\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing, a contradiction. Thus jij\leftarrow i.

Conversely, if Assumption 1 does not hold, then there exists iTi\notin T (i.e., 𝔼P(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})) such that either iIi\in I^{*} or an𝒢(i)I\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing. If iIi\in I^{*}, then since 𝔼P(Xi)=𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i}) and Lemma 2, ii must not be a source in II^{*}. Therefore we only need to discuss the case where iTi\notin T and an𝒢(i)I\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing.

Let kk be a source of an𝒢(i)I\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}, then kk must also be a source of II^{*}. Otherwise there is a directed path from kk^{\prime} to kk where kkk^{\prime}\neq k and kIk^{\prime}\in I^{*}. By definition of ancestors, we know from kan𝒢(i)k\in\operatorname{an}_{\mathcal{G}}(i) that there is also kan𝒢(i)k^{\prime}\in\operatorname{an}_{\mathcal{G}}(i). Therefore kan𝒢(i)Ik^{\prime}\in\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}, which violates kk being a source of an𝒢(i)I\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}.

Since kk is a source of II^{*}, by Lemma 1 and 2, we know that 𝔼P(Xk)𝔼Q(Xk)\mathbb{E}_{{\mathrm{P}}}(X_{k})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{k}), i.e., kTk\in T. Notice that kan𝒢(i)k\in\operatorname{an}_{\mathcal{G}}(i), and thus we must have a directed path from kTk\in T to iTi\notin T. Thus, there exists some ij,jT,iTi-j,j\in T,i\notin T such that jij\rightarrow i. ∎

Using Lemma 3, we know that we can check the authenticity of Assumption 1 by looking at the orientation of edges between TT and [p]T[p]\setminus T, which is achievable by any (general) intervention on XTX_{T} (or X[p]TX_{[p]\setminus T}).

Corollary 2.

Assumption 1 holds if and only if the {T}\{T\}-essential graph (or {[p]T}\{[p]\setminus T\}-essential graph) of 𝒢\mathcal{G} has edges jij\leftarrow i for all ij,jT,iTi-j,j\in T,i\notin T.

Proof.

The proof follows as a direct application of the graphical characterization of interventional equivalence class in Section 3.1 and the results in Lemma 3. ∎

Appendix D Details of Algorithms

D.1 Decomposition of Shift Interventional Essential Graphs

Chain Graph Decomposition: \citetappendixhauser2014two+ showed that every interventional essential graph is a chain graph with undirected connected chordal chain components, where the orientations in one component do not affect any other components. This decomposition also holds for shift interventional essential graphs, since every shift interventional essential graph is also an interventional essential graph (Corollary 1). Below, we show an example of this decomposition (Figure 7).

Refer to caption
(a) Essential Graph
Refer to caption
(b) Chain Component 1
Refer to caption
(c) Chain Component 2
Refer to caption
(d) Chain Component 3
Figure 7: Chain graph decomposition of the essential graph in (a).
Proof of Lemma 4.

Suppose an undirected connected chain component 𝒞\mathcal{C} of the essential graph has two source nodes ii and jj w.r.t. 𝒞\mathcal{C}. Since 𝒞\mathcal{C} is connected, there is a path between ii and jj in 𝒞\mathcal{C}; let ik1krji-k_{1}-...-k_{r}-j be the shortest among all these paths. Because ii and jj are sources of 𝒞\mathcal{C}, there must be ik1i\rightarrow k_{1} and krjk_{r}\leftarrow j. Therefore, l{1,,r}\exists l\in\{1,...,r\} such that kl1klkl+1k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1} (let k0=ik_{0}=i and kr+1=jk_{r+1}=j). By the shortest path definition, there is no edge between kl1k_{l-1} and kl+1k_{l+1}. Therefore there is a v-structure in 𝒞\mathcal{C} induced by kl1klkl+1k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1}. Since all DAGs in the same shift interventional equivalence class share the same v-structures, kl1klkl+1k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1} must be oriented in the essential graph. This violates kl1,kl,kl+1k_{l-1},k_{l},k_{l+1} belonging to the same undirected chain component 𝒞\mathcal{C}. Thus, combining this with the fact that 𝒞\mathcal{C} must have one source node, we obtain that 𝒞\mathcal{C} has exactly one source node w.r.t. 𝒞\mathcal{C}.

Next we show that the source node of a chain component is also the source of 𝒢\mathcal{G} if and only if there are no incoming edges to this component. Let ii be the source of the chain component 𝒞\mathcal{C}. On one hand, ii must be the source of 𝒢\mathcal{G} if there is no incoming edges to 𝒞\mathcal{C}. On the other hand, if there is an incoming edge jkj\rightarrow k for some j𝒞j\notin\mathcal{C} and k𝒞k\in\mathcal{C}, then since the essential graph is closed under Meek R1 and R2 (Proposition 1), we know that there must be an edge jlj\rightarrow l for all neighbors ll of kk. Following the same deduction and the fact that 𝒞\mathcal{C} is connected, we obtain that jlj\rightarrow l for all l𝒞l\in\mathcal{C} (Figure 8). This means that jij\rightarrow i as well. Therefore ii cannot be a source of 𝒢\mathcal{G}.

Refer to caption
Figure 8: jlj\rightarrow l for all l𝒞l\in\mathcal{C}.

D.2 NP-completeness of MinMaxC

It was shown separately in \citetappendixshen2012exact and \citetappendixlalou2018critical+ that the MinMaxC problem is NP-complete for general graphs and split graphs. Split graphs are a subclass of chordal graphs, where the vertices can be separated into a clique and an independent set (isolated nodes after removing the clique). Thus, MinMaxC is also NP-complete for chordal graphs.

D.3 Clique Tree Strategy

The clique tree strategy takes inputs of an undirected connected chordal graph 𝒞\mathcal{C} and the sparsity constraint SS, and outputs a shift intervention with no more than SS perturbation targets. If 𝒞\mathcal{C} contains no more than SS nodes, then it returns any shift intervention with perturbation targets in 𝒞\mathcal{C}. If 𝒞\mathcal{C} contains more than SS nodes, it first constructs a clique tree 𝒯(𝒞)\mathcal{T}(\mathcal{C}) of 𝒞\mathcal{C} by the maximum-weight spanning tree algorithm \citepappendixkoller2009probabilistic+. Then it iterates through the nodes in 𝒯(𝒞)\mathcal{T}(\mathcal{C}) (which are maximal cliques in 𝒞\mathcal{C}) to find a maximal clique 𝒦\mathcal{K} that breaks 𝒯(𝒞)\mathcal{T}(\mathcal{C}) into subtrees with sizes no more than half of the size of 𝒯(𝒞)\mathcal{T}(\mathcal{C}). If 𝒦\mathcal{K} has no more than SS nodes, then it returns any shift intervention with perturbation targets in 𝒦\mathcal{K}. Otherwise, it samples SS nodes from 𝒦\mathcal{K} and returns any shift intervention with these SS nodes as perturbation targets. The following subroutine summarizes this procedure.

Input: Chordal chain component 𝒞\mathcal{C}, sparsity constraint SS.
1 if 𝒞\mathcal{C} has no more than SS nodes then
2       set II as any shift intervention on 𝒞\mathcal{C} with non-zero shift values;
3      
4else
5       let C(𝒞)C(\mathcal{C}) be the maximal cliques of the chordal graph 𝒞\mathcal{C};
6       let 𝒯(𝒞)\mathcal{T}(\mathcal{C}) be a maximum-weight spanning tree of 𝒞\mathcal{C} with C(𝒞)C(\mathcal{C}) as nodes;
7       set 𝒦=\mathcal{K}=\varnothing;
8       for KK in C(𝒞)C(\mathcal{C}) do
9             get the subtrees of 𝒯(𝒞)\mathcal{T}(\mathcal{C}) after deleting node CC;
10             if all subtrees has size (|C(𝒞)|1)/2\leq\lceil(|C(\mathcal{C})|-1)/2\rceil  then
11                   set 𝒦=K\mathcal{K}=K;
12                   break;
13                  
14             end if
15            
16       end for
17      if |𝒦|>S|\mathcal{K}|>S then
18            set 𝒦\mathcal{K} as a random SS-subset of 𝒦\mathcal{K};
19       end if
20      set II as any shift intervention on 𝒦\mathcal{K} with non-zero shift values;
21      
22 end if
Output: Shift Intervention II
Algorithm 2 CliqueTree(𝒞,S)\texttt{CliqueTree}(\mathcal{C},S)

Complexity: Let NN represent the number of nodes in 𝒞\mathcal{C}, i.e., N=|𝒞|N=|\mathcal{C}|. All the maximal cliques of the chordal graph 𝒞\mathcal{C} can be found in O(N2)O(N^{2}) time \citepappendixgalinier1995chordal. We use Kruskal’s algorithm for computing the maximum-weight spanning tree, which can be done in O(N2log(N))O(N^{2}\log(N)) \citepappendixkruskal1956shortest. The remaining procedure of iterating through C(𝒞)C(\mathcal{C}) takes no more than O(N2)O(N^{2}) since chordal graphs with NN nodes have no more than NN maximal cliques \citepappendixgalinier1995chordal and all subtree sizes can be obtained in O(N)O(N). Therefore this subroutine can be computed in O(N2log(N))O(N^{2}\log(N)) time.

D.4 Supermodular Strategy

The supermodular procedure takes as input an undirected connected chordal graph 𝒞\mathcal{C} as well as the sparsity constraint SS, and outputs a shift intervention with perturbation targets by solving

minAV𝒞maxiV𝒞f^i(A),|A|S,\min_{A\subset V_{\mathcal{C}}}\max_{i\in V_{\mathcal{C}}}\hat{f}_{i}(A),\quad|A|\leq S, (5)

with the SATURATE algorithm \citepappendixkrause2008robust+. Here V𝒞V_{\mathcal{C}} represents nodes of 𝒞\mathcal{C} and f^i(A)=jV𝒞g^i,j(A)\hat{f}_{i}(A)=\sum_{j\in V_{\mathcal{C}}}\hat{g}_{i,j}(A) with g^i,j\hat{g}_{i,j} defined in (2). Algorithm 3 summarizes this subroutine.

Input: Chordal chain component 𝒞\mathcal{C}, sparsity constraint SS.
1 if 𝒞\mathcal{C} has no more than SS nodes then
2       set II as any shift intervention on 𝒞\mathcal{C} with non-zero shift values;
3      
4else
5       let AA be the solution of (5) returned by SATURATE \citepappendixkrause2008robust+;
6       set II as any shift intervention on AA with non-zero shift values;
7      
8 end if
Output: Shift Intervention II
Algorithm 3 Supermodular(𝒞,S)\texttt{Supermodular}(\mathcal{C},S)

Supermodularity: First we give an example showing that fif_{i} defined in (1) is not supermodular for chordal graphs, although it is clearly monotonic decreasing.

Example 2.

Consider the chordal graph in Figure 9; we have f1({2})f1()=34=1>2=13=f1({2,3})f1({3})f_{1}(\{2\})-f_{1}(\varnothing)=3-4=-1>-2=1-3=f_{1}(\{2,3\})-f_{1}(\{3\}). Therefore f1f_{1} is not supermodular for this graph.

Refer to caption
Figure 9: fif_{i} is not supermodular.

Next we prove that f^i\hat{f}_{i} is supermodular and monotonic decreasing.

Proof.

Since f^i(A)=jV𝒞g^i,j(A)\hat{f}_{i}(A)=\sum_{j\in V_{\mathcal{C}}}\hat{g}_{i,j}(A), we only need to show that every g^i,j\hat{g}_{i,j} is supermodular and monotonic decreasing. In the following, we refer to a path without cycles as a simple path.

For any ABV𝒞A\subset B\subset V_{\mathcal{C}}, since V𝒞BV_{\mathcal{C}}-B is a subgraph of V𝒞AV_{\mathcal{C}}-A, then any simple path between ii and jj in V𝒞BV_{\mathcal{C}}-B must also be in V𝒞AV_{\mathcal{C}}-A. Hence mi,j(V𝒞B)mi,j(V𝒞A)m_{i,j}(V_{\mathcal{C}}-B)\leq m_{i,j}(V_{\mathcal{C}}-A), which means that

g^i,j(A)g^i,j(B),\hat{g}_{i,j}(A)\geq\hat{g}_{i,j}(B),

i.e., g^i,j\hat{g}_{i,j} is monotonic decreasing.

For any xV𝒞Bx\in V_{\mathcal{C}}\setminus B, the difference mi,j(V𝒞B)mi,j(V𝒞B{x})m_{i,j}(V_{\mathcal{C}}-B)-m_{i,j}(V_{\mathcal{C}}-B\cup\{x\}) is the number of simple paths in V𝒞BV_{\mathcal{C}}-B between ii and jj that pass through xx. Each of these paths must also be in V𝒞AV_{\mathcal{C}}-A, since V𝒞BV_{\mathcal{C}}-B is a subgraph of V𝒞AV_{\mathcal{C}}-A. Therefore,

mi,j(V𝒞B)mi,j(V𝒞B{x})mi,j(V𝒞A)mi,j(V𝒞A{x}),m_{i,j}(V_{\mathcal{C}}-B)-m_{i,j}(V_{\mathcal{C}}-B\cup\{x\})\leq m_{i,j}(V_{\mathcal{C}}-A)-m_{i,j}(V_{\mathcal{C}}-A\cup\{x\}),

which means that

g^i,j(A{x})g^i,j(A)g^i,j(B{x})g^i,j(B),\hat{g}_{i,j}(A\cup\{x\})-\hat{g}_{i,j}(A)\leq\hat{g}_{i,j}(B\cup\{x\})-\hat{g}_{i,j}(B),

i.e., g^i,j\hat{g}_{i,j} is supermodular. ∎

SATURATE algorithm \citepappendixkrause2008robust+: Having shown that f^i\hat{f}_{i} is monotonic supermodular, we solve the robust supermodular optimization problem in (5) with the SATURATE algorithm in \citepappendixkrause2008robust+. SATURATE performs a binary search for potential objective values and uses a greedy partial cover algorithm to check the feasibility of these objective values; for a detailed description of the algorithm, see \citetappendixkrause2008robust+.

Complexity: Let NN represent the number of nodes in 𝒞\mathcal{C}, i.e., N=|𝒞|N=|\mathcal{C}|. SATURATE uses at most O(N2Slog(N))O(N^{2}S\log(N)) evaluations of supermodular functions f^i\hat{f}_{i} \citepappendixkrause2008robust+. Each f^i\hat{f}_{i} computes all the simple paths between ii and all other jj in 𝒞\mathcal{C}. A modified depth-first search is used to calculated these paths \citepappendixsedgewick2001algorithms, which results in (N)\mathcal{F}(N) complexity. For general graphs, this problem is #P-complete \citepappendixvaliant1979complexity. However, this might be significantly reduced for chordal graphs. We are unaware of particular complexity results for chordal graphs, which would be an interesting direction for future work. The total runtime of this subroutine is thus bounded by O(N2(N)Slog(N))O(N^{2}\mathcal{F}(N)S\log(N)).555For a more efficient implementation, one could replace the undirected graph with a DAG in its MEC (which can be found in linear time using L-BFS). All statements hold except that f^i\hat{f}_{i} is no longer necessarily tight for tree graphs. This replacement results in a total complexity of O(N4Slog(N))O(N^{4}S\log(N)) for the subroutine, since directed simple paths can be counted in O(N2)O(N^{2}).

D.5 Violation of Faithfulness

From Corollary 2, we know that we can check whether Assumption 1 holds or not by any intervention on XTX_{T} (or X[p]TX_{[p]\setminus T}). However, we can run Algorithm 1 to obtain II^{*} without Assumption 1 because lines 2-14 in Algorithm 1 always return the correct II^{*}.

Let III\subset I^{*} be the resolved part of II^{*} in line 2, i.e., it is a shift intervention constructed by taking a subset of perturbation targets of II^{*} and their corresponding shift values. Let III^{*}-I be the remaining shift intervention constructed by deleting II in II^{*}. Denote TI={i|i[p],𝔼PI(Xi)𝔼Q(Xi)}T_{I}=\{i|i\in[p],\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\}, which is returned by line 3. If TIT_{I}\neq\varnothing, then we have solved II^{*}. Otherwise we have:

Lemma 7.

The source nodes w.r.t. TIT_{I} must be perturbation targets of III^{*}-I and their corresponding shift values are 𝔼Q(Xi)𝔼PI(Xi)\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}) (for source node ii).

Proof.

Let ii be a source node w.r.t. III^{*}-I and aia_{i} be its corresponding shift value. Since intervening on other nodes in III^{*}-I does not change the marginal distribution of ii, we must have that ai=𝔼(PI)II(Xi)𝔼PI(Xi)a_{i}=\mathbb{E}_{({\mathrm{P}}^{I})^{I^{*}-I}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}). And because (PI)II=PI=Q({\mathrm{P}}^{I})^{I^{*}-I}={\mathrm{P}}^{I^{*}}={\mathrm{Q}}, we know that

ai=𝔼Q(Xi)𝔼PI(Xi).a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}).

From this, we also have that 𝔼PI(Xi)𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i}) since ai0a_{i}\neq 0. Therefore, all source nodes ii w.r.t. III^{*}-I are in TIT_{I} and their corresponding shift values are 𝔼Q(Xi)𝔼PI(Xi)\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}).

Let ii be a source w.r.t. TIT_{I}, then ii must also be a source node w.r.t. III^{*}-I. Since 𝔼PI(Xi)𝔼Q(Xi)\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i}), ii must be a source node in III^{*}-I or has a source node in III^{*}-I as its ancestor. If it is the latter case, then since all source nodes in III^{*}-I must be in TIT_{I}, ii cannot be a source node w.r.t. TIT_{I}, a contradiction. Therefore the source w.r.t. TIT_{I} must also be the source w.r.t. III^{*}-I. Combined with the result in the previous paragraph, we have that all source nodes ii w.r.t. TIT_{I} are perturbation targets of III^{*}-I and their corresponding shift values are 𝔼Q(Xi)𝔼PI(Xi)\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}). ∎

This lemma shows that UTU_{T} obtained in lines 5-11 of Algorithm 1 must be the perturbation targets of III^{*}-I and line 12 gives the correct shift values. Therefore Algorithm 1 must return the correct II^{*}.

However, to be able to obtain the shift-\mathcal{I}-EG of 𝒢\mathcal{G}, we need mean interventional faithfulness to be satisfied by II\in\mathcal{I} (replacing II^{*} with II and Q{\mathrm{Q}} with PI{\mathrm{P}}^{I} in Assumption 1) as well as \mathcal{I}-faithfulness \citepappendixsquires2020permutation+ to be satisfied by (P,{PI}I)({\mathrm{P}},\{{\mathrm{P}}^{I}\}_{I\in\mathcal{I}}) with respect to 𝒢\mathcal{G}.

Appendix E Proof of Worst-case Bounds

E.1 Proof of Lemma 5

To show Lemma 5, we need the following proposition, which states that we can orient any maximal clique of a chordal graph to be most-upstream without creating cycles and v-structures, and the orientation in this clique can be made arbitrary. It was pointed out in \citepappendixvandenberghe2015chordal using similar arguments that any clique of a chordal graph can be most-upstream. Here, we provide the complete proof.

Proposition 2.

Let 𝒟\mathcal{D} be any undirected chordal graph and KK be any maximal clique of 𝒟\mathcal{D}, for any permutation πK\pi_{K} of the nodes in KK, there exists a topological order π\pi of the nodes in 𝒟\mathcal{D} such that π\pi starts with πK\pi_{K} and orienting 𝒟\mathcal{D} according to π\pi does not create any v-structures.

Proof.

A topological order π\pi of a chordal graph 𝒟\mathcal{D}, orienting according to which does not create v-structures, corresponds to the reverse of a perfect elimination order \citepappendixhauser2014two+. A perfect elimination order is an order of nodes in 𝒟\mathcal{D}, such that all neighbors of ii in 𝒟\mathcal{D} that appear after ii in this order must constitute a clique in 𝒟\mathcal{D}. Any chordal graph has at least one perfect elimination order \citepappendixandersson1997characterization+. In the following, we will use the reverse of a perfect elimination order to refer to a topological order that does not create v-structures.

To prove Proposition 2, we first prove the following statement: if K𝒟K\neq\mathcal{D}, then there exists a perfect elimination order of nodes in 𝒟\mathcal{D} that starts with a node not in KK. To show this, by Proposition 6 in \citetappendixhauser2014two+, we only need to prove that if K𝒟K\neq\mathcal{D}, then there is a node not in KK, whose neighbors in 𝒟\mathcal{D} constitute a clique.

We use induction on the number of nodes in 𝒟\mathcal{D}: Consider |𝒟|=1|\mathcal{D}|=1. Since KK is a maximal clique, K=𝒟K=\mathcal{D}. This statement holds trivially. Suppose the statement is true for chordal graphs with size n1n-1. Consider |𝒟|=n|\mathcal{D}|=n. Since 𝒟\mathcal{D} is a chordal graph, it must have a perfect elimination order. If this perfect elimination order starts with iKi\in K, then there is no edge between ii and any node jKj\notin K. Otherwise, since it is a perfect elimination order starting with ii and KiK\ni i is a clique, there must be edges jkj-k for all kKk\in K. This is a contradiction to KK being a maximal clique.

[Uncaptioned image]

Consider the chordal graph 𝒟\mathcal{D}^{\prime} by deleting ii from 𝒟\mathcal{D}, |𝒟|=n1|\mathcal{D}^{\prime}|=n-1. Let KK^{\prime} be the maximal clique in 𝒟\mathcal{D}^{\prime} containing K{i}K\setminus\{i\}. If K=𝒟K^{\prime}=\mathcal{D}^{\prime}, let jj be any node in 𝒟K\mathcal{D}\setminus K. Since there is no edge jij-i, and 𝒟j\mathcal{D}^{\prime}\ni j is a clique. jj’s neighbors in 𝒟\mathcal{D} must also constitute a clique. If K𝒟K^{\prime}\neq\mathcal{D}^{\prime}, then by induction, we know that there exists j𝒟Kj\in\mathcal{D}^{\prime}\setminus K^{\prime} such that jj’s neighbors in 𝒟\mathcal{D}^{\prime} constitute a clique. Since there is no edge jij-i, jj’s neighbors in 𝒟\mathcal{D} must also constitute a clique. Thus the statement holds for chordal graphs of size nn. Therefore the statement holds.

[Uncaptioned image]
[Uncaptioned image]

Now, we prove Proposition 2 by induction on the number of nodes in 𝒟\mathcal{D}: Consider |𝒟|=1|\mathcal{D}|=1. Since KK is a maximal clique, K=𝒟K=\mathcal{D}. Thus Proposition 2 holds trivially.

Suppose Proposition 2 holds for chordal graphs of size n1n-1. Consider |𝒟|=n|\mathcal{D}|=n. If K=𝒟K=\mathcal{D}, then Proposition 2 holds. If K𝒟K\neq\mathcal{D}, then by the above statement, there exists j𝒟Kj\in\mathcal{D}\setminus K, such that there exists a perfect elimination order of 𝒟\mathcal{D} starting with jj. Let 𝒟\mathcal{D}^{\prime} be the chordal graph obtained by deleting jj from 𝒟\mathcal{D}. By induction, there exists π\pi^{\prime}, a reverse of perfect elimination order, that starts with πK\pi_{K}. Let π=(π,j)\pi=(\pi^{\prime},j); we must have that the reverse of π\pi is a perfect elimination order, since all neighbors of jj in 𝒟\mathcal{D} constitute a clique. Therefore π\pi gives the wanted topological order and Proposition 2 holds for chordal graphs of size nn. This completes the proof of Proposition 2. ∎

Proof of Lemma 5.

Given any algorithm 𝒜\mathcal{A}, let S1,,SkS_{1},...,S_{k} be the first kk shift interventions given by 𝒜\mathcal{A}. By Proposition 2, we know that there exists a feasible orientation of 𝒞\mathcal{C} such that the largest maximal clique KK of 𝒞\mathcal{C} is most-upstream and that, for k=1,,kk^{\prime}=1,...,k, SkKS_{k^{\prime}}\cap K is most-downstream of Kl<kSlK-\cup_{l<k^{\prime}}S_{l}. For example, in the figure below, suppose algorithm 𝒜\mathcal{A} chooses S1={3}S_{1}=\{3\} based on (a) and S2={2}S_{2}=\{2\} based on (b). There is a feasible orientation in (d) such that the largest clique K={1,2,3}K=\{1,2,3\} is most-upstream and SkKS_{k^{\prime}}\cap K is most-downstream of KK, for k=1,2k^{\prime}=1,2.

Refer to caption
(a) 𝒞\mathcal{C}
Refer to caption
(b) 𝒜\mathcal{A} chooses S1={3}S_{1}=\{3\}
Refer to caption
(c) 𝒜\mathcal{A} chooses S2={2}S_{2}=\{2\}
Refer to caption
(d) Orientation of 𝒞\mathcal{C}

Since |Sk|S|S_{k^{\prime}}|\leq S and |K|=m𝒞|K|=m_{\mathcal{C}}, in this worst case, it needs at least m𝒞1S\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil interventions to identify the source of KK, i.e., the source of 𝒞\mathcal{C} (minus 11 because in this case, if there is only one node left, then it must be the source). ∎

E.2 Proof of Lemma 6

Let KK be the clique obtained by lines 7-13 in Algorithm 2; when 𝒞\mathcal{C} has more than SS nodes, we refer to KK as the central clique. To prove Lemma 6, we need the following proposition. This proposition shows that by looking at the undirected graph 𝒞\mathcal{C}, we can find a node in the central clique KK satisfying certain properties, which will become useful in the proof of Lemma 6.

Proposition 3.

Let {𝒯a}aA\{\mathcal{T}_{a}\}_{a\in A} be the connected subtrees of 𝒯(𝒞)\mathcal{T}(\mathcal{C}) after removing KK. For a node kKk\in K, let AkAA_{k}\subset A be the set of indices aAa\in A such that the tree 𝒯a\mathcal{T}_{a} is connected to KK only through the node kk. Let 𝒯Ak={𝒯a}aAk\mathcal{T}_{A_{k}}=\{\mathcal{T}_{a}\}_{a\in A_{k}} be the collection of all such subtrees. If there exists aAAka\in A\setminus A_{k} such that there is an edge between 𝒯a\mathcal{T}_{a} and kk, let 𝒯k\mathcal{T}^{*}_{k} be the one with the largest number of maximal cliques; otherwise let 𝒯k=\mathcal{T}^{*}_{k}=\varnothing. Then there exists a node kk such that the number of maximal cliques in the subgraph induced by the subtrees 𝒯Ak{𝒯k}\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\} and kk itself does not exceed r12\lceil\frac{r-1}{2}\rceil.

Example 3.

As an example, the following figure shows the subtrees that are connected to KK only through node 11, indexed by A1A_{1} (blue). The largest subtree in AA1A\setminus A_{1} that is adjacent to node 11 is denoted by 𝒯1\mathcal{T}_{1}^{*} (undimmed in green).

Refer to caption
Figure 12: An example of 𝒯Ak\mathcal{T}_{A_{k}} and 𝒯k\mathcal{T}_{k}^{*} for k=1k=1.
Proof of Proposition 3.

Notice the following facts.

Fact 1: Let 𝒯\mathcal{T} be any subtree in {𝒯a}aA\{\mathcal{T}_{a}\}_{a\in A}; then there must exist a node iKi\in K such that there is no edge between ii and 𝒯\mathcal{T}.

Proof of Fact 1: For any two nodes i,iKi,i^{\prime}\in K, because 𝒞\mathcal{C} is chordal and 𝒯\mathcal{T} is connected, either the neighbors of ii in 𝒯\mathcal{T} subset that of ii^{\prime}, or the the neighbors of ii^{\prime} in 𝒯\mathcal{T} subset that of ii. Therefore we can order all nodes KK, where all neighbors of ii in 𝒯\mathcal{T} subset that of ii^{\prime} that appear after ii. Then if the first node in this order has some neighbor t𝒯t\in\mathcal{T}, all nodes in KK have tt as neighbor, contradicting KK being a maximal clique.

Refer to caption
(a) Contradicting chordal 𝒞\mathcal{C}
Refer to caption
(b) Contradicting maximal clique KK

Fact 2: Let 𝒯¯\bar{\mathcal{T}} be the collection of the subtrees where all edges connecting to KK are through a single node kKk\in K. We have that 𝒯¯\bar{\mathcal{T}} is the union of disjoint sets 𝒯Ak,kK\mathcal{T}_{A_{k}},k\in K.

Proof of Fact 2: This follows directly from the definition of AkA_{k}.

Fact 3: Let 𝒯\mathcal{T}^{*} be the collection of non-empty 𝒯k,kK\mathcal{T}_{k}^{*},k\in K. Then 𝒯𝒯¯=\mathcal{T}^{*}\cap\bar{\mathcal{T}}=\varnothing. Furthermore, for any subtree in 𝒯\mathcal{T}^{*}, there is a node iKi\in K such that there is no edge between ii and this subtree.

Proof of Fact 3: This follows directly from the definition of 𝒯k\mathcal{T}_{k}^{*} and Fact 1.

Now we prove Proposition 3. If 𝒯=\mathcal{T}^{*}=\varnothing, then since KK contains at least two nodes (otherwise A=A=\varnothing and the proposition holds trivially) and the number of maximal cliques in 𝒯¯\bar{\mathcal{T}} does not exceed r1r-1, using Fact 2, we have at least one kKk\in K such that the number of maximal cliques in the subgraph induced by 𝒯Ak{𝒯k}=𝒯Ak\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\}=\mathcal{T}_{A_{k}} and kk itself does not exceed r12\lceil\frac{r-1}{2}\rceil.

If 𝒯\mathcal{T}^{*}\neq\varnothing. Let 𝒯k\mathcal{T}^{*}_{k^{\prime}} be the subtree with the largest number of maximal cliques in 𝒯\mathcal{T}^{*}. Let kKk\in K be the node such that there is no edge between kk and the subtree 𝒯k\mathcal{T}^{*}_{k^{\prime}} (kk exists because of Fact 3). Now suppose that the proposition does not hold. Then the number of maximal cliques in the subgraph induced by 𝒯Ak{𝒯k}\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\} and kk itself must exceed r12\lceil\frac{r-1}{2}\rceil. Also, the number of maximal cliques in the subgraph induced by 𝒯Ak{𝒯k}\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\} and kk^{\prime} itself exceeds r12\lceil\frac{r-1}{2}\rceil. Notice that (𝒯Ak{𝒯k})(𝒯Ak{𝒯k})=(\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\})\cap(\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}_{k^{\prime}}^{*}\})=\varnothing. Therefore 𝒯Ak{𝒯k}\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}\} is connected to 𝒯Ak{𝒯k}\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\} only through KK. Hence the sum of numbers of maximal cliques in 𝒯Ak{𝒯k}{{k}}\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\}\cup\{\{k\}\} and 𝒯Ak{𝒯k}{{k}}\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}\cup\{\{k^{\prime}\}\} does not exceed rr. We cannot have both 𝒯Ak{𝒯k}{{k}}\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\}\cup\{\{k\}\} and 𝒯Ak{𝒯k}{{k}}\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}\cup\{\{k^{\prime}\}\} having more than r12\lceil\frac{r-1}{2}\rceil maximal cliques. Therefore the proposition must hold. ∎

Proof of Lemma 6.

For CliqueTree, we prove this lemma for a “less-adaptive” version for the sake of clearer discussions. In this “less-adaptive” version, instead of output 11 intervention with SS perturbation targets sampled from the central clique KK (when it has more than SS nodes) in Algorithm 2, we directly output |K|1S\lceil\frac{|K|-1}{S}\rceil interventions with non-overlapping perturbation targets in KK. Each of these interventions has no more than SS perturbation targets and they contain at least |K|1|K|-1 nodes in KK altogether. Furthermore, we pick these interventions such that if they contain exactly |K|1|K|-1 nodes, then the remaining node satisfies Proposition 3.

After these |K|1S\lceil\frac{|K|-1}{S}\rceil interventions, we obtain a partially directed 𝒞\mathcal{C}, which is a chain graph, with one of its chain components without incoming edges as input to CliqueTree in the next iteration of the inner-loop in Algorithm 1. Denote this chain component as 𝒞\mathcal{C}^{\prime}. We show that 𝒞\mathcal{C}^{\prime} has no more than r12\left\lceil\frac{r-1}{2}\right\rceil maximal cliques each with no more than m𝒞m_{\mathcal{C}} nodes. If r12=0\lceil\frac{r-1}{2}\rceil=0, then r=1r=1 and this trivially holds since the source of 𝒞\mathcal{C} must be identified. In the following, we assume r12>0\lceil\frac{r-1}{2}\rceil>0.

Size of maximal cliques: The maximal clique in 𝒞\mathcal{C}^{\prime} must belong to a maximal clique in 𝒞\mathcal{C}, and thus has no more than m𝒞m_{\mathcal{C}} nodes.

Number of maximal cliques: If the source node is identified, then 𝒞\mathcal{C}^{\prime} only has one node. This trivially holds. Now consider when the source node is not identified. We proceed in two cases.

Case I: if these |K|1S\lceil\frac{|K|-1}{S}\rceil interventions contain all nodes in KK, then they break the clique tree 𝒯(𝒞)\mathcal{T}(\mathcal{C}) into subtrees each with no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques. 𝒞\mathcal{C}^{\prime} must belong to one of these subtrees. Therefore it must have no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques.

Case II: if these |K|1S\lceil\frac{|K|-1}{S}\rceil interventions do not contain all nodes in KK, then there is exactly one node left in KK that is not a perturbation target, which satisfies Proposition 3. Denote this node as kk and the source node w.r.t. the intervened |K|1|K|-1 nodes as ii. From Theorem 1, we have that ii is identified and jK,jk\forall j\in K,j\neq k, the orientation of edge kjk-j is identified.

If iki\rightarrow k, then ii is the source w.r.t. KK: if ii is the source w.r.t. 𝒞\mathcal{C}, then 𝒞={i}\mathcal{C}^{\prime}=\{i\} has no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques; otherwise, there is a unique subtree of 𝒯(𝒞)\mathcal{T}(\mathcal{C}) after removing KK that has an edge pointing to ii in 𝒞\mathcal{C} (it exists because ii is the source of KK but not the source of 𝒞\mathcal{C}; it is unique because there is no edge between subtrees and there is no v-structure at ii), and therefore 𝒞\mathcal{C}^{\prime} must belong to this subtree which has no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques.

Refer to caption
(a) If iki\rightarrow k
Refer to caption
(b) If iki\leftarrow k, Fact 1

If iki\leftarrow k, then kk is the source w.r.t. KK: consider all the subtrees of 𝒯(𝒞)\mathcal{T}(\mathcal{C}) after removing KK. We have the following two facts:

Fact 1: Let 𝒯\mathcal{T}^{\prime} be a subtree such that there is an edge between 𝒯\mathcal{T}^{\prime} and K{k}K-\{k\} and all these edges are pointing towards 𝒯\mathcal{T}^{\prime}. Then all edges between kk and t𝒯t\in\mathcal{T}^{\prime} must be oriented as ktk\rightarrow t. Thus 𝒞𝒯=\mathcal{C}^{\prime}\cap\mathcal{T}^{\prime}=\varnothing.

Proof of Fact 1: Otherwise, suppose t𝒯t\in\mathcal{T}^{\prime} and tkt\rightarrow k. Let jK{k}j\in K-\{k\} such that there is an edge between jj and 𝒯\mathcal{T}^{\prime}. Since 𝒯\mathcal{T}^{\prime} is connected, there must be a path from jj to tt in 𝒯\mathcal{T}^{\prime}. Let j=t0t1tltl+1=tj=t_{0}-t_{1}-...-t_{l}-t_{l+1}=t be the shortest of these path. Since t0t1tltl+1t_{0}-t_{1}-...-t_{l}-t_{l+1} is shortest, there cannot be an edge between tlt_{l^{\prime}} and tl′′t_{l^{\prime\prime}} with l′′l>1l^{\prime\prime}-l^{\prime}>1. And since all edges between 𝒯\mathcal{T}^{\prime} and K{k}K-\{k\} are pointing towards 𝒯\mathcal{T}^{\prime}, there is an edge j=t0t1j=t_{0}\rightarrow t_{1}. Therefore to avoid v-structures, it must be j=t0t1tltl+1=tj=t_{0}\rightarrow t_{1}\rightarrow...\rightarrow t_{l}\rightarrow t_{l+1}=t. This creates a directed cycle kjtkk\rightarrow j\rightarrow...\rightarrow t\rightarrow k, a contradiction.

Fact 2: There can be at most one subtree 𝒯\mathcal{T}^{\prime} such that there is an edge pointing from 𝒯\mathcal{T}^{\prime} to K{k}K-\{k\} and also some t𝒯t\in\mathcal{T}^{\prime} such that tkt\rightarrow k or tkt-k is unidentified. Therefore at most one subtree 𝒯\mathcal{T}^{\prime} of this type can have 𝒞𝒯\mathcal{C}^{\prime}\cap\mathcal{T}^{\prime}\neq\varnothing.

Proof of Fact 2: Otherwise suppose there are two different subtrees 𝒯1,𝒯2\mathcal{T}^{\prime}_{1},\mathcal{T}^{\prime}_{2} such that K{k}j1t1𝒯1,K{k}j2t2𝒯2K-\{k\}\ni j_{1}\leftarrow t_{1}\in\mathcal{T}^{\prime}_{1},K-\{k\}\ni j_{2}\leftarrow t_{2}\in\mathcal{T}^{\prime}_{2}. Since there is no edge t1t2t_{1}-t_{2}, we have j1j2j_{1}\neq j_{2}. Without loss of generality, suppose j1j2j_{1}\rightarrow j_{2}. Let tt be any node in 𝒯2\mathcal{T}_{2}^{\prime} with an edge tkt-k, since 𝒯2\mathcal{T}_{2}^{\prime} is connected, let t=t0t1tltl+1=t2t=t_{0}^{\prime}-t_{1}^{\prime}-...-t_{l}^{\prime}-t_{l+1}^{\prime}=t_{2} be the shortest path between tt and t2t_{2} in 𝒯2\mathcal{T}_{2}^{\prime}. Let ll^{\prime} be the maximum in 0,1,,l0,1,...,l such that tltl+1t^{\prime}_{l^{\prime}}\leftarrow t^{\prime}_{l^{\prime}+1}. If such ll^{\prime} does not exist, then t=t0t1tl+1=t2t=t_{0}^{\prime}\rightarrow t_{1}^{\prime}\rightarrow...\rightarrow t_{l+1}^{\prime}=t_{2}. Since j1j2j_{1}\rightarrow j_{2} and there is no v-structure at j2j_{2}, there must be an identified edge j1tl+1=t2j_{1}-t_{l+1}^{\prime}=t_{2}. Notice that there is no edge between t2t_{2} and t1t_{1} and t1j1t_{1}\rightarrow j_{1}, to avoid v-structure, it must be j1t2j_{1}\rightarrow t_{2}. The same deduction leads to identified edges j1t0=tj_{1}\rightarrow t_{0}^{\prime}=t. Since kj1k\rightarrow j_{1} and there are no cycles, the edge ktk\rightarrow t must be identified. If ll^{\prime} exists, since t=t0t1tltl+1=t2t=t_{0}^{\prime}-t_{1}^{\prime}-...-t_{l}^{\prime}-t_{l+1}^{\prime}=t_{2} is the shortest path and there is no v-structure, we must have t=t0tl+1t=t_{0}^{\prime}\leftarrow...\leftarrow t_{l^{\prime}+1}^{\prime}. Furthermore, since ll^{\prime} is the largest, tl+1tl+1=t2t_{l^{\prime}+1}^{\prime}\rightarrow...\rightarrow t_{l+1}^{\prime}=t_{2}. By a similar deduction as in the case where ll^{\prime} does not exist, we must have an identified edge j1tl+1j_{1}\rightarrow t_{l^{\prime}+1}^{\prime}. Therefore kj1tl+1t0=tk\rightarrow j_{1}\rightarrow t_{l^{\prime}+1}^{\prime}\rightarrow...t_{0}^{\prime}=t. To avoid directed cycles, ktk\rightarrow t must be identified. Therefore all edges between kk and 𝒯2\mathcal{T}_{2}^{\prime} are identified as pointing to 𝒯2\mathcal{T}_{2}^{\prime}.

Refer to caption
(a) ll^{\prime} does not exist
Refer to caption
(b) ll^{\prime} exists

Using the above two facts, let 𝒯\mathcal{T}^{\prime} be the unique subtree in Fact 2 (if it exists); if there is no edge between 𝒯\mathcal{T}^{\prime} and kk, then 𝒞\mathcal{C}^{\prime} must be in the subgraph induced by kk itself and 𝒯Ak\mathcal{T}_{A_{k}} in Proposition 3, which has no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques. If there is an edge between 𝒯\mathcal{T}^{\prime} and kk, we know that 𝒞\mathcal{C}^{\prime} must be in the joint set of kk, 𝒯\mathcal{T}^{\prime} and 𝒯Ak\mathcal{T}_{A_{k}}. Since the number of maximal cliques in 𝒯\mathcal{T}^{\prime} must be no more than that of 𝒯k\mathcal{T}_{k}^{*} in Proposition 3, we know that 𝒞\mathcal{C}^{\prime} has no more than r12\lceil\frac{r-1}{2}\rceil maximal cliques.

Therefore, after |K|1Sm𝒞1S\lceil\frac{|K|-1}{S}\rceil\leq\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil interventions, we reduce the number of maximal cliques to at most r12\lceil\frac{r-1}{2}\rceil while maintaining the size of the largest maximal clique m𝒞\leq m_{\mathcal{C}}. Using this iteratively, we obtain that CliqueTree identifies the source node with at most log2(r𝒞+1)m𝒞1S\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil interventions.

For Supermodular, we do not discuss the gap between g^i,j\hat{g}_{i,j} and gi,jg_{i,j} and how well SATURATE solves (5). In this case, it is always no worse than the CliqueTree in the worst case over the feasible orientations of 𝒞\mathcal{C}, since it solves MinMaxC optimally without constraining to maximal cliques. Therefore, it also takes no more than log2(r𝒞+1)m𝒞1S\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil to identify the source node. ∎

E.3 Proof of Theorem 2

Proof of Theorem 2.

This result follows from Lemma 5 and 6. Divide II^{*} into I1,,IkI_{1},...,I_{k} such that IkI_{k^{\prime}} is the source node of Il<kIlI^{*}-\cup_{l<k^{\prime}}I_{l}. Since shifting IkI_{k^{\prime}} affects the marginal of subsequent Ik′′I_{k^{\prime\prime}} with k′′>kk^{\prime\prime}>k^{\prime}, any algorithm needs to identify I1,,IkI_{1},...,I_{k} sequentially in order to identify the exact shift values.

Suppose I1,,Ik1I_{1},...,I_{k^{\prime}-1} are learned. For IkI_{k^{\prime}}, consider the chain components of the subgraph of the shift-{l<kIl}\{\cup_{l<k^{\prime}}I_{l}\}-EG induced by T={i|i[p],𝔼(Pl<kIl)(Xi)𝔼Q(Xi)}T=\{i|i\in[p],\mathbb{E}_{({\mathrm{P}}^{\cup_{l<k^{\prime}}I_{l}})}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\} with no incoming edge. Applying Lemma 4 for ={l<kIl\mathcal{I}=\{\cup_{l<k^{\prime}}I_{l}} and Observation 1 for this subgraph and IkI_{k^{\prime}}, we deduce that there are exactly |Ik||I_{k^{\prime}}| such chain components and IkI_{k^{\prime}} has exactly one member in each of these chain components. Let mk,1,,mk,|Ik|m_{k^{\prime},1},...,m_{k^{\prime},|I_{k^{\prime}}|} be the sizes of the largest maximal cliques in these |Ik||I_{k^{\prime}}| chain components. By Lemma 5, we know that any algorithm needs at least i=1|Ik|mk,i1S\sum_{i=1}^{|I_{k^{\prime}}|}\lceil\frac{m_{k^{\prime},i}-1}{S}\rceil number of interventions to identify IkI_{k^{\prime}} in the worst case. However, since all these chain components contain no more than rr maximal cliques, by Lemma 6, we know that our strategies need at most log2(r+1)i=1|Ik|mk,i1S\lceil\log_{2}(r+1)\rceil\cdot\sum_{i=1}^{|I_{k^{\prime}}|}\lceil\frac{m_{k^{\prime},i}-1}{S}\rceil to identify IkI_{k^{\prime}}.

Applying this result for k=1,,kk^{\prime}=1,...,k, we obtain that our strategies for solving the causal mean matching problem require at most log2(r+1)\lceil\log_{2}(r+1)\rceil times more interventions, compared to the optimal strategy, in the worse case over all feasible orientations. ∎

Appendix F Numerical Experiments

F.1 Experimental Setup

Graph Generation: We consider two random graph models: Erdös-Rényi graphs (Erdős and Rényi,, 1960) and Barabási–Albert graphs (Albert and Barabási,, 2002). The probability of edge creation in Erdös-Rényi graphs is set to 0.20.2; the number of edges to attach from a new node to existing nodes in Barabási–Albert graphs is set to 22. We then tested on two types of structured chordal graphs: rooted tree with root randomly sampled from all the nodes in this tree, and moralized Erdös-Rényi graphs (Shanmugam et al.,, 2015) with the probability of edge creation set to 0.20.2.

Multiple Runs: For each instance in the settings of Barabási–Albert graphs with 100 nodes and S=1S=1 in Figure 5(a), we ran the three non-deterministic strategies (UpstreamRand,CliqueTree, Supermodular) for five times and observed little differences across all instances. Therefore, we excluded the error bars when plotting the results as they are visually negligible and the strategies are robust in these settings.

Implementation: We implemented our algorithms using the NetworkX package \citepappendixhagberg2008exploring and the CausalDAG package https://github.com/uhlerlab/causaldag. All code is written in Python and run on AMD 2990wx CPU.

F.2 More Empirical Results

In the following, we present additional empirical result. The evaluations are the same as in Section 6. The following figures show that we observe similar behaviors as in Figure 5 across different settings.

Random graphs of size {10,50,100}\{10,50,100\}: Barabási–Albert and Erdös-Rényi graphs with number of nodes in {10,50,100}\{10,50,100\}.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 16: Barabási–Albert graphs with 5050 nodes. (a). and (b). S=1S=1; (c). |I|=25|I^{*}|=25.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 17: Barabási–Albert graphs with 1010 nodes. (a). and (b). S=1S=1; (c). |I|=5|I^{*}|=5.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 18: Erdös-Rényi graphs with 100100 nodes. (a). and (b). S=1S=1; (c). |I|=50|I^{*}|=50.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 19: Erdös-Rényi graphs with 5050 nodes. (a). and (b). S=1S=1; (c). |I|=25|I^{*}|=25.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 20: Erdös-Rényi graphs with 1010 nodes. (a). and (b). S=1S=1; (c). |I|=5|I^{*}|=5.

Larger Barabási–Albert graphs of size 10001000:

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 21: Larger Barabási–Albert graphs with 10001000 nodes (excluding coloring which takes more than 8080 extra interventions). (a). and (b). S=1S=1; (c). |I|=100|I^{*}|=100.

Two types of structured chordal graphs:

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 22: Structured chordal graphs. (a). and (b). rooted tree graphs with 5050 nodes and S=1S=1; (c). and (d). moralized Erdös-Rényi graphs with 1010 nodes and S=1S=1.

Appendix G Discussion of the Noisy Setting

In the noisy setting, an intervention can be repeated many times to obtain an estimated essential graph. Each intervention results in a posterior update of the true DAG 𝒢\mathcal{G} over all DAGs in the observational Markov equivalence class. For a tree graph 𝒢\mathcal{G}, this corresponds to a probability over all possible roots. To be able to learn the edges, \citetappendixgreenewald2019sample+ proposed a bounded edge strength condition on the noise for binary variables. Under this condition, they showed that the root node of a tree graph can be learned in finite steps in expectation with high probability.

In our setting, to ensure that the source node w.r.t. an intervention can be learned, we need to repeat this intervention for enough times such that the expectation of each variable XiX_{i} can be estimated. Furthermore, to ensure that the edges in the (general) interventional essential graph can be learned, we need a similar condition as in \citepappendixgreenewald2019sample+ for general chordal graphs and continuous variables.

\bibliographystyleappendix

apalike \bibliographyappendixmain