\newcites

appendixReferences of Appendix

Matching a Desired Causal State
via Shift Interventions

Jiaqi Zhang
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]
&Chandler Squires
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]
Caroline Uhler
LIDS, EECS, and IDSS, MIT
Cambridge, MA 02139
[email protected]

Abstract

Transforming a causal system from a given initial state to a desired target state is an important task permeating multiple fields including control theory, biology, and materials science. In causal models, such transformations can be achieved by performing a set of interventions. In this paper, we consider the problem of identifying a shift intervention that matches the desired mean of a system through active learning. We define the Markov equivalence class that is identifiable from shift interventions and propose two active learning strategies that are guaranteed to exactly match a desired mean. We then derive a worst-case lower bound for the number of interventions required and show that these strategies are optimal for certain classes of graphs. In particular, we show that our strategies may require exponentially fewer interventions than the previously considered approaches, which optimize for structure learning in the underlying causal graph. In line with our theoretical results, we also demonstrate experimentally that our proposed active learning strategies require fewer interventions compared to several baselines.

1 Introduction

Consider an experimental biologist attempting to turn cells from one type into another, e.g., from fibroblasts to neurons (Vierbuchen et al.,, 2010), by altering gene expression. This is known as cellular reprogramming and has shown great promise in recent years for regenerative medicine (Rackham et al.,, 2016). A common approach is to model gene expression of a cell, which is governed by an underlying gene regulatory network, using a structural causal model (Friedman et al.,, 2000; Badsha et al.,, 2019). Through a set of interventions, such as gene knockouts or over-expression of transcription factors (Dominguez et al.,, 2016), a biologist can infer the structure of the underlying regulatory network. After inferring enough about this structure, a biologist can identify the intervention needed to successfully reprogram a cell. More generally, transforming a causal system from an initial state to a desired state through interventions is an important task pervading multiple applications. Other examples include closed-loop control (Touchette and Lloyd,, 2004) and pathway design of microstructures (Wodo et al.,, 2015).

With little prior knowledge of the underlying causal model, this task is intrinsically difficult. Previous works have addressed the problem of intervention design to achieve full identifiability of the causal model (Hauser and Bühlmann,, 2014; Greenewald et al.,, 2019; Squires et al., 2020a, ). However, since interventional experiments tend to be expensive in practice, one wishes to minimize the number of trials and learn just enough information about the causal model to be able to identify the intervention that will transform it into the desired state. Furthermore, in many realistic cases, the set of interventions which can be performed is constrained. For instance, in CRISPR experiments, only a limited number of genes can be knocked out to keep the cell alive; or in robotics, a robot can only manipulate a certain number of arms at once.

Contributions. We take the first step towards the task of causal matching (formalized in Section 2), where an experimenter can perform a series of interventions in order to identify a matching intervention which transforms the system to a desired state. In particular, we consider the case where the goal is to match the mean of a distribution. We focus on a subclass of interventions called shift interventions, which can for example be used to model gene over-expression experiments (Triantafillou et al.,, 2017). These interventions directly increase or decrease the values of their perturbation targets, with their effect being propagated to variables which are downstream (in the underlying causal graph) of these targets. We show that there always exists a unique shift intervention (which may have multiple perturbation targets) that exactly transforms the mean of the variables into the desired mean (Lemma 1). We call this shift intervention the matching intervention.

To find the matching intervention, in Section 3 we characterize the Markov equivalence class of a causal graph induced by shift interventions, i.e., the edges in the causal graph that are identifiable from shift interventions; in particular, we show that the resulting Markov equivalence classes can be more refined than previous notions of interventional Markov equivalence classes. We then propose two active learning strategies in Section 4 based on this characterization, which are guaranteed to identify the matching intervention. These active strategies proceed in an adaptive manner, where each intervention is chosen based on all the information gathered so far.

In Section 5, we derive a worst-case lower bound on the number of interventions required to identify the matching intervention and show that the proposed strategies are optimal up to a logarithmic factor. Notably, the proposed strategies may use exponentially fewer interventions than previous active strategies for structure learning. Finally, in Section 6, we demonstrate also empirically that our proposed strategies outperform previous methods as well as other baselines in various settings.

1.1 Related Works

Experimental Design. Previous work on experimental design in causality has considered two closely related goals: learning the most structural information about the underlying DAG given a fixed budget of interventions (Ghassami et al.,, 2018), and fully identifying the underlying DAG while minimizing the total number or cost (Shanmugam et al.,, 2015; Kocaoglu et al.,, 2017) of interventions. These works can also be classified according to whether they consider a passive setting, i.e., the interventions are picked at a single point in time (Hyttinen et al.,, 2013; Shanmugam et al.,, 2015; Kocaoglu et al.,, 2017), or an active setting, i.e., interventions are decided based on the results of previous interventions (He and Geng,, 2008; Agrawal et al.,, 2019; Greenewald et al.,, 2019; Squires et al., 2020a, ). The setting addressed in the current work is closest to the active, full-identification setting. The primary difference is that in order to match a desired mean, one does not require full identification; in fact, as we show in this work, we may require significantly less interventions.

Causal Bandits. Another related setting is the bandit problem in sequential decision making, where an agent aims to maximize the cumulative reward by selecting an arm at each time step. Previous works considered the setting where there are causal relations between regrets and arms (Lattimore et al.,, 2016; Lee and Bareinboim,, 2018; Yabe et al.,, 2018). Using a known causal structure, these works were able to improve the dependence on the total number of arms compared to previous regret lower-bounds (Bubeck and Cesa-Bianchi,, 2012; Lattimore et al.,, 2016). These results were further extended to the case when the causal structure is unknown a priori (de Kroon et al.,, 2020). In all these works the variables are discrete, with arms given by do-interventions (i.e., setting variables to a given value), so that there are only a finite number of arms. In our work, we are concerned with the continuous setting and shift interventions, which corresponds to an infinite (continuous) set of arms.

Correlation-based Approaches. There are also various correlation-based approaches for this task that do not make use of any causal information. For example, previous works have proposed score-based (Cahan et al.,, 2014), entropy-based (D’Alessio et al.,, 2015) and distance-based approaches (Rackham et al.,, 2016) for cellular reprogramming. However, as shown in bandit settings (Lattimore et al.,, 2016), when the system follows a causal structure, this structure can be exploited to learn the optimal intervention more efficiently. Therefore, we here focus on developing a causal approach.

2 Problem Setup

We now formally introduce the causal matching problem of identifying an intervention to match the desired state in a causal system under a given metric. Following Koller and Friedman, (2009), a causal structural model is given by a directed acyclic graph (DAG) $\mathcal{G}$ with nodes $[p]=\{1,\dots,p\}$ , and a set of random variables $X=\{X_{1},...,X_{p}\}$ whose joint distribution ${\mathrm{P}}$ factorizes according to $\mathcal{G}$ . Denote by $\operatorname{pa}_{\mathcal{G}}(i)=\{j\in[p]\mid j\to i\}$ the parents of node $i$ in $\mathcal{G}$ . An intervention $I\subset[p]$ with multiple perturbation targets $i\in I$ either removes all incoming edges to $X_{i}$ (hard intervention) or modifies the conditional probability ${\mathrm{P}}(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})$ (soft intervention) for all $i\in I$ . This results in an interventional distribution ${\mathrm{P}}^{I}$ . Given a desired joint distribution ${\mathrm{Q}}$ over $X$ , the goal of causal matching is to find an optimal matching intervention $I$ such that ${\mathrm{P}}^{I}$ best matches ${\mathrm{Q}}$ under some metric. In this paper, we address a special case of the causal matching problem, which we call causal mean matching, where the distance metric between ${\mathrm{P}}^{I}$ and ${\mathrm{Q}}$ depends only on their expectations.

We focus on causal mean matching for a class of soft interventions, called shift interventions (Rothenhäusler et al.,, 2015). Formally, a shift intervention with perturbation targets $I\subset[p]$ and shift values $\{a_{i}\}_{i\in I}$ modifies the conditional distribution as ${\mathrm{P}}^{I}(X_{i}=x+a_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}(X_{i}=x|X_{\operatorname{pa}_{\mathcal{G}}(i)})$ . Here, the shift values $\{a_{i}\}_{i\in I}$ are assumed to be deterministic. We aim to find $I\subset[p]$ and $\{a_{i}\}_{i\in I}\in\mathbb{R}^{|I|}$ such that the mean of ${\mathrm{P}}^{I}$ is closest to that of ${\mathrm{Q}}$ , i.e., minimizes $d(\mathbb{E}_{{\mathrm{P}}^{I}}(X),\mathbb{E}_{{\mathrm{Q}}}(X))$ for some metric $d$ . In fact, as we show in the following lemma, there always exists a unique shift intervention, which we call the matching intervention, that achieves exact mean matching.¹¹1To lighten notation, we use $I$ to denote both the perturbation targets and the shift values of this intervention.

Lemma 1.

For any causal structural model and desired mean $\mathbb{E}_{\mathrm{Q}}(X)$ , there exists a unique shift intervention $I^{*}$ such that $\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X)$ .

We assume throughout that the underlying causal DAG $\mathcal{G}$ is unknown. But we assume causal sufficiency (Spirtes et al.,, 2000), which excludes the existence of latent confounders, as well as access to enough observational data to determine the joint distribution ${\mathrm{P}}$ and thus the Markov equivalence class of $\mathcal{G}$ (Andersson et al.,, 1997). It is well-known that with enough interventions, the causal DAG $\mathcal{G}$ becomes fully identifiable (Yang et al.,, 2018). Thus one strategy for causal mean matching is to first use interventions to fully identify the structure of $\mathcal{G}$ , and then solve for the matching intervention given full knowledge of the graph. However, in general this strategy requires more interventions than needed. In fact, the number of interventions required by such a strategy can be exponentially larger than the number of interventions required by a strategy that directly attempts to identify the matching intervention, as illustrated in Figure 1 and proven in Theorem 2.

In this work, we consider active intervention designs, where a series of interventions are chosen adaptively to learn the matching intervention. This means that the information obtained after performing each intervention is taken into account for future choices of interventions. We here focus on the noiseless setting, where for each intervention enough data is obtained to decide the effect of each intervention. Direct implications for the noisy setting are discussed in Appendix G. To incorporate realistic cases in which the system cannot withstand an intervention with too many target variables, as is the case in CRISPR experiments, where knocking out too many genes at once often kills the cell, we consider the setting where there is a sparsity constraint $S$ on the maximum number of perturbation targets in each intervention, i.e., we only allow $I$ where $|I|\leq S$ .

3 Identifiability

In this section, we characterize and provide a graphical representation of the shift interventional Markov equivalence class (shift- $\mathcal{I}$ -MEC), i.e., the equivalence class of DAGs that is identifiable by shift interventions $\mathcal{I}$ . We also introduce mean interventional faithfulness, an assumption that guarantees identifiability of the underlying DAG up to its shift- $\mathcal{I}$ -MEC. Proofs are given in Appendix C.

3.1 Shift-interventional Markov Equivalence Class

For any DAG $\mathcal{G}$ with nodes $[p]$ , a distribution $f$ is Markov with respect to $\mathcal{G}$ if it factorizes according to $f(X)=\prod_{i\in[p]}f(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})$ . Two DAGs are Markov equivalent or in the same Markov equivalence class (MEC) if any positive distribution $f$ which is Markov with respect to (w.r.t.) one DAG is also Markov w.r.t. the other DAG. With observational data, a DAG is only identifiable up to its MEC (Andersson et al.,, 1997). However, the identifiability improves to a smaller class of DAGs with interventions. For a set of interventions $\mathcal{I}$ (not necessarily shift interventions), the pair $(f,\{f^{I}\}_{I\in\mathcal{I}})$ is $\mathcal{I}$ -Markov w.r.t. $\mathcal{G}$ if $f$ is Markov w.r.t. $\mathcal{G}$ and $f^{I}$ factorizes according to

f^{I}(X)=\prod_{i\notin I}f(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)})\prod_{i\in I}f^{I}(X_{i}|X_{\operatorname{pa}_{\mathcal{G}}(i)}),\quad\forall I\in\mathcal{I}.

Similarly, the interventional Markov equivalence class ( $\mathcal{I}$ -MEC) of a DAG can be defined, and Yang et al., (2018) provided a structural characterization of the $\mathcal{I}$ -MEC for general interventions $\mathcal{I}$ (not necessarily shift interventions).

Following, we show that if $\mathcal{I}$ consists of shift interventions, then the $\mathcal{I}$ -MEC becomes smaller, i.e., identifiability of the causal DAG is improved. The proof utilizes Lemma 2 on the relationship between conditional probabilities. For this, denote by $\operatorname{an}_{\mathcal{G}}(i)$ the ancestors of node $i$ , i.e., all nodes $j$ for which there is a directed path from $j$ to $i$ in $\mathcal{G}$ . For a subset of nodes $I$ , we say that $i\in I$ is a source w.r.t. $I$ if $\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing$ . A subset $I^{\prime}\subset I$ is a source w.r.t. $I$ if every node in $I^{\prime}$ is a source w.r.t. $I$ .

Lemma 2.

For any distribution $f$ that factorizes according to $\mathcal{G}$ , the interventional distribution $f^{I}$ for a shift intervention $I\subset[p]$ with shift values $\{a_{i}\}_{i\in I}$ satisfies

\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},

for any source $i\in I$ . Furthermore, if $i\in I$ is not a source w.r.t. $I$ , then there exists a positive distribution $f$ such that $\mathbb{E}_{f^{I}}(X_{i})\neq\mathbb{E}_{f}(X_{i})+a_{i}$ .

Hence, we can define the shift- $\mathcal{I}$ -Markov property and shift-interventional Markov equivalence class (shift- $\mathcal{I}$ -MEC) as follows.

Definition 1.

For a set of shift interventions $\mathcal{I}$ , the pair $(f,\{f^{I}\}_{I\in\mathcal{I}})$ is shift- $\mathcal{I}$ -Markov w.r.t. $\mathcal{G}$ if $(f,\{f^{I}\}_{I\in\mathcal{I}})$ is $\mathcal{I}$ -Markov w.r.t. $\mathcal{G}$ and

\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall~{}i\in I\in\mathcal{I}~{}s.t.~{}\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing.

Two DAGs are in the same shift- $\mathcal{I}$ -MEC if any positive distribution that is shift- $\mathcal{I}$ -Markov w.r.t. one DAG is shift- $\mathcal{I}$ -Markov also w.r.t. the other DAG.

The following graphical characterizations are known: Two DAGs are in the same MEC if and only if they share the same skeleton (adjacencies) and v-structures (induced subgraphs $i\rightarrow j\leftarrow k$ ), see Verma and Pearl, (1991). For general interventions $\mathcal{I}$ , two DAGs are in the same $\mathcal{I}$ -MEC, if they are in the same MEC and they have the same directed edges $\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}$ , where $i-j$ means that either $i\to j$ or $j\to i$ (Hauser and Bühlmann,, 2012; Yang et al.,, 2018). In the following theorem, we provide a graphical criterion for two DAGs to be in the same shift- $\mathcal{I}$ -MEC.

Theorem 1.

Let $\mathcal{I}$ be a set of shift interventions. Then two DAGs $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ belong to the same shift- $\mathcal{I}$ -MEC if and only if they have the same skeleton, v-structures, directed edges $\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}$ , as well as source nodes of $I$ for every $I\in\mathcal{I}$ .

In other words, two DAGs are in the same shift- $\mathcal{I}$ -MEC if and only if they are in the same $\mathcal{I}$ -MEC and they have the same source perturbation targets. Figure 2 shows an example; in particular, to represent an MEC, we use the essential graph (EG), which has the same skeleton as any DAG in this class and directed edges $i\rightarrow j$ if $i\rightarrow j$ for every DAG in this class. The essential graphs corresponding to the MEC, $\mathcal{I}$ -MEC and shift- $\mathcal{I}$ -MEC of a DAG $\mathcal{G}$ are referred to as EG, $\mathcal{I}$ -EG and shift- $\mathcal{I}$ -EG of $\mathcal{G}$ , respectively. They can be obtained from the aforementioned graphical criteria (along with a set of logical rules known as the Meek rules (Meek,, 1995); see details in Appendix A). Figure 2 shows an example of EG, $\mathcal{I}$ -EG and shift- $\mathcal{I}$ -EG of a four-node DAG.

3.2 Mean Interventional Faithfulness

For the causal mean matching problem, the underlying $\mathcal{G}$ can be identified from shift interventions $\mathcal{I}$ up to its shift- $\mathcal{I}$ -MEC. However, we may not need to identify the entire DAG to find the matching intervention $I^{*}$ . Lemma 1 implies that if $i$ is neither in nor downstream of $I^{*}$ , then the mean of $X_{i}$ already matches the desired state, i.e., $\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ ; this suggest that these variables may be negligible when learning $I^{*}$ . Unfortunately, the reverse is not true; one may design “degenerate" settings where a variable is in (or downstream of) $I^{*}$ , but its marginal mean is also unchanged:

Example 1.

Let $X_{3}=X_{1}+2X_{2}$ , with $\mathbb{E}_{\mathrm{P}}(X_{1})=1$ and $\mathbb{E}_{\mathrm{P}}(X_{2})=1$ , so that $\mathbb{E}_{\mathrm{P}}(X_{3})=3$ . Suppose $I^{*}$ is a shift intervention with perturbation targets $\{X_{1},X_{2},X_{3}\}$ , with $a_{1}=1$ , $a_{2}=-1$ , and $a_{3}=1$ . Then $\mathbb{E}_{{\mathrm{P}}^{I}}(X_{3})=3$ , i.e., the marginal mean of $X_{3}$ is unchanged under the intervention.

Such degenerate cases arise when the shift on a node $X_{j}$ (deemed $0$ if not shifted) exactly cancels out the contributions of shifts on its ancestors. Formally, the following assumption rules out these cases.

Assumption 1 (Mean Interventional Faithfulness).

If $i\in[p]$ satisfies $\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ , then $i$ is neither a nor downstream of any perturbation target, i.e., $i\notin I^{*},\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}=\varnothing$ .

This is a particularly weak form of faithfulness, which is implied by interventional faithfulness assumptions in prior work (Yang et al.,, 2018; Squires et al., 2020b, ; Jaber et al.,, 2020).

Let $T$ be the collection of nodes $i\in[p]$ for which $\mathbb{E}_{{\mathrm{P}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ . The following lemma shows that under the mean interventional faithfulness assumption we can focus on the subgraph $\mathcal{G}_{T}$ induced by $T$ , since $I^{*}\subset T$ and interventions on $X_{T}$ do not affect $X_{[p]\setminus T}$ .

Lemma 3.

If Assumption 1 holds, then any edge $i-j$ with $j\in T$ and $i\notin T$ has orientation $j\leftarrow i$ . Conversely, if Assumption 1 does not hold, then there exists some $i-j$ , $j\in T$ , $i\not\in T$ such that $j\to i$ .

4 Algorithms

Having shown that shift interventions allow the identification of source perturbation targets and that the mean interventional faithfulness assumption allows reducing the problem to an induced subgraph, we now propose two algorithms to learn the matching intervention. The algorithms actively pick a shift intervention $I_{t}$ at time $t$ based on the current shift-interventional essential graph (shift- $\mathcal{I}_{t}$ -EG). Without loss of generality and for ease of discussion, we assume that the mean interventional faithfulness assumption holds and we therefore only need to consider $\mathcal{G}_{T}$ . In Appendix D, we show that the faithfulness violations can be identified and thus Assumption 1 is not necessary for identifying the matching intervention, but additional interventions may be required.

Warm-up: Upstream Search. Consider solving for the matching intervention $I^{*}$ when the structure of $\mathcal{G}_{T}$ is known, i.e., the current shift- $\mathcal{I}_{t}$ -EG is fully directed. Let $U_{T}=\{i|i\in T,\operatorname{an}_{\mathcal{G}_{T}}(i)\cap T=\varnothing\}$ be the non-empty set of source nodes in $T$ . We make the following key observation.

Observation 1.

$U_{T}\subset I^{*}$ , and the shift values are $a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}}(X_{i})$ for each $i\in U_{T}$ .

This follows since shifting other variables in $T$ cannot change the mean of nodes in $U_{T}$ . Further, the shifted means of variables in $U_{T}$ match the desired mean (Lemma 2). Given the resulting intervention $U_{T}$ , we obtain a new distribution ${\mathrm{P}}^{U_{T}}$ . Assuming mean interventional faithfulness on this distribution, we may now remove those variables whose means in ${\mathrm{P}}^{U_{T}}$ already match ${\mathrm{Q}}$ . We then repeat this process on the new set of unmatched source nodes, $T_{1}$ , to compute the corresponding shift intervention $U_{T_{1}}$ . Repeating until we have matched the desired mean for all variables yields $I^{*}$ . We illustrate this procedure in Figure 3.

The idea of upstream search extends to shift- $\mathcal{I}_{t}$ -EG with partially directed or undirected $\mathcal{G}_{T}$ . In this case, if a node or nodes of $\mathcal{G}_{T}$ are identified as source, Observation 1 still holds. Hence, we solve a part of $I^{*}$ with these source nodes and then intervene on them to reduce the unsolved graph size.

Decomposition of Shift Interventional Essential Graphs: In order to find the source nodes, we decompose the current shift- $\mathcal{I}_{t}$ -EG into undirected components. Hauser and Bühlmann, (2014) showed that every interventional essential graph is a chain graph with chordal chain components, where the orientations in one chain component do not affect the orientations in other components.²²2The chain components of a chain graph are the undirected connected components after removing all its directed edges, and an undirected graph is chordal if all cycles of length greater than 3 contain a chord. By a similar argument, we can obtain an analogous decomposition for shift interventional essential graphs, and show that there is at least one chain component with no incoming edges. Let us separate out all of the chain components of shift- $\mathcal{I}_{t}$ -EG with no incoming edges. The following lemma proves that all sources are contained within these components.

Lemma 4.

For any shift- $\mathcal{I}$ -EG of $\mathcal{G}$ , each chain component has exactly one source node w.r.t. this component. This node is a source w.r.t. $\mathcal{G}$ if and only if there are no incoming edges to this component.

These results hold when replacing $\mathcal{G}$ with any induced subgraph of it. Thus, we can find the source nodes in $T$ by finding the source nodes in each of its chain components with no incoming edges.

4.1 Two Approximate Strategies

Following the chain graph decomposition, we now focus on how to find the source node of an undirected connected chordal graph $\mathcal{C}$ . If there is no sparsity constraint on the number of perturbation targets in each shift intervention, then directly intervening on all of the variables in $\mathcal{C}$ gives the source node, since by Theorem 1, all DAGs in the shift- $\mathcal{I}$ -MEC share the same source node. However, when the maximum number of perturbation targets in an intervention is restricted to $S<|\mathcal{C}|$ , multiple interventions may be necessary to find the source node.

After intervening on $S$ nodes, the remaining unoriented part can be decomposed into connected components. In the worst case, the source node of $\mathcal{C}$ is in the largest of these connected components. Therefore we seek the set of nodes, within the sparsity constraint, that minimizes the largest connected component size after being removed. This is known as the MinMaxC problem (Lalou et al.,, 2018), which we show is NP-complete on chordal graphs (Appendix D). We propose two approximate strategies to solve this problem, one based on the clique tree representation of chordal graphs and the other based on robust supermodular optimization. The overall algorithm with these subroutines is summarized in Algorithm 1. We outline the subroutines here, and give further details in Appendix D.

Input: Joint distribution

{\mathrm{P}}

, desired joint distribution

{\mathrm{Q}}

, sparsity constraint

S

1 Initialize

I^{*}=\varnothing

and

\mathcal{I}=\{\varnothing\}

;

2 while $\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)\neq\mathbb{E}_{{\mathrm{Q}}}(X)$ do

3 let

T=\{i|i\in[p],\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\}

;

4 let

\mathcal{G}

be the subgraph of shift-

\mathcal{I}

-EG induced by

T

;

5 let

U_{T}

be the identified source nodes in

T

;

6 while $U_{T}=\varnothing$ do

7 let

\mathcal{C}

be a chain component of

\mathcal{G}

with no incoming edges;

8 select shift intervention

I

by running

\texttt{CliqueTree}(\mathcal{C},S)

\texttt{Supermodular}(\mathcal{C},S)

;

9 perform

I

and append it to

\mathcal{I}

;

10 update

\mathcal{G}

and

U_{T}

as the outer loop;

12 end while

13 set

a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X_{i})

for

i

U_{T}

;

14 include perturbation targets

U_{T}

and shift values

\{a_{i}\}_{i\in U_{T}}

I^{*}

and perform

I^{*}

;

16 end while

Output: Matching Intervention

I^{*}

Algorithm 1 Active Learning for Causal Mean Matching

Clique Tree Strategy. Let $C(\mathcal{C})$ be the set of maximal cliques in the chordal graph $\mathcal{C}$ . There exists a clique tree $\mathcal{T}(\mathcal{C})$ with nodes in $C(\mathcal{C})$ and edges satisfying that $\forall C_{1},C_{2}\in C(\mathcal{C})$ , their intersection $C_{1}\cap C_{2}$ is a subset of any clique on the unique path between $C_{1},C_{2}$ in $\mathcal{T}(\mathcal{C})$ (Blair and Peyton,, 1993). Thus, deleting a clique which is not a leaf node in the clique tree will break $\mathcal{C}$ into at least two connected components, each corresponding to a subtree in the clique tree. Inspired by the central node algorithm (Greenewald et al.,, 2019; Squires et al., 2020a, ), we find the $S$ -constrained central clique of $\mathcal{T}(\mathcal{C})$ by iterating through $C(\mathcal{C})$ and returning the clique with no more than $S$ nodes that separates the graph most, i.e., solving MinMaxC when interventions are constrained to be maximal cliques. We denote this approach as CliqueTree.

Supermodular Strategy. Our second approach, denoted Supermodular, optimizes a lower bound of the objective of MinMaxC. Consider the following equivalent formulation of MinMaxC

\min_{A\subset V_{\mathcal{C}}}\max_{i\in V_{\mathcal{C}}}f_{i}(A),\quad|A|\leq S,

(1)

where $V_{\mathcal{C}}$ represents the nodes of $\mathcal{C}$ and $\forall~{}i\in V_{\mathcal{C}}$ , $f_{i}(A)=\sum_{j\in V_{\mathcal{C}}}g_{i,j}(A)$ with $g_{i,j}(A)=1$ if $i$ and $j$ are the same or connected after removing nodes in $A$ from $\mathcal{C}$ and $g_{i,j}(A)=0$ otherwise.

MinMaxC (1) resembles the problem of robust supermodular optimization (Krause et al.,, 2008). Unfortunately, $f_{i}$ is not supermodular for chordal graphs (Appendix D). Therefore, we propose to optimize for a surrogate of $f_{i}$ defined as $\hat{f}_{i}(A)=\sum_{j\in\mathcal{C}}\hat{g}_{i,j}(A)$ , where

\hat{g}_{i,j}(A)=\begin{cases}\frac{m_{i,j}(V_{\mathcal{C}}-A)}{m_{i,j}(V_{\mathcal{C}})},&\quad{i--j}~{}\mathrm{in}~{}{\mathcal{C}},\\ 0,&\quad\mathrm{otherwise}.\end{cases}

(2)

Here $m_{i,j}(V_{\mathcal{C}^{\prime}})$ is the number of paths without cycles between $i$ and $j$ in $\mathcal{C}^{\prime}$ (deemed $0$ if $i$ or $j$ does not belong to $\mathcal{C}^{\prime}$ and $1$ if $i=j\in\mathcal{C}^{\prime}$ ) and $i--j$ means $i$ is either connected or equal to $j$ . Comparing $\hat{g}_{i,j}$ with $g_{i,j}$ , we see that $\hat{f}_{i}(A)$ is a lower bound of $f_{i}(A)$ for MinMaxC, which is tight when $\mathcal{C}$ is a tree. We show that $\hat{f}_{i}$ is monotonic supermodular for all $i$ (Appendix D). Therefore, we consider (2) with $f_{i}$ replaced by $\hat{f}_{i}$ , which can be solved using the SATURATE algorithm (Krause et al.,, 2008). Notably, the results returned by Supermodular can be quite different to those returned by CliqueTree since Supermodular is not constrained to pick a maximal clique; see Figure 4.

5 Theoretical Results

In this section we derive a worst-case lower bound on the number of interventions for any algorithm to identify the source node in a chordal graph. Then we use this lower bound to show that our strategies are optimal up to a logarithmic factor. This contrasts with the structure learning strategy, which may require exponentially more interventions than our strategy (Figure 1).

The worst case is with respect to all feasible orientations of an essential graph (Hauser and Bühlmann,, 2014; Shanmugam et al.,, 2015), i.e., orientations corresponding to DAGs in the equivalence class. Given a chordal chain component $\mathcal{C}$ of $\mathcal{G}$ , let $r_{\mathcal{C}}$ be the number of maximal cliques in $\mathcal{C}$ , and $m_{\mathcal{C}}$ be the size of the largest maximal clique in $\mathcal{C}$ . The following lemma provides a lower bound depending only on $m_{\mathcal{C}}$ .

Lemma 5.

In the worst case over feasible orientations of $\mathcal{C}$ , any algorithm requires at least $\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ shift interventions to identify the source node, under the sparsity constraint $S$ .

To give some intuition for this result, consider the case where the largest maximal clique is upstream of all other maximal cliques. Given such an ordering, in the worst case, each intervention rules out only $S$ nodes in this clique (namely, the most downstream ones). Now, we show that our two strategies need at most $\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ shift interventions for the same task.

Lemma 6.

In the worst case over feasible orientations of $\mathcal{C}$ , both CliqueTree and Supermodular require at most $\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ shift interventions to identify the source node, under the sparsity constraint $S$ .

By combining Lemma 5 and Lemma 6, which consider subproblems of the causal mean matching problem, we obtain a bound on the number of shift interventions needed for solving the full causal mean matching problem. Let $r$ be the largest $r_{\mathcal{C}}$ for all chain components $\mathcal{C}$ of $\mathcal{G}$ :

Theorem 2.

Algorithm 1 requires at most $\lceil\log_{2}(r+1)\rceil$ times more shift interventions, compared to that required by the optimal strategy, in the worst case over feasible orientations of $\mathcal{G}$ .

A direct application of this theorem is that, in terms of the number of interventions required to solve the causal mean matching problem, our algorithm is optimal in the worst case when $r=1$ , i.e., when every chain component is a clique. All proofs are provided in Appendix E.

6 Experiments

We now evaluate our algorithms in several synthetic settings.³³3Code is publicly available at: https://github.com/uhlerlab/causal_mean_matching. Each setting considers a particular graph type, number of nodes $p$ in the graph and number of perturbation targets $|I^{*}|\leq p$ in the matching intervention. We generate $100$ problem instances in each setting. Every problem instance contains a DAG with $p$ nodes generated according to the graph type and a randomly sampled subset of $|I^{*}|$ nodes denoting the perturbation targets in the matching intervention. We consider both, random graphs including Erdös-Rényi graphs (Erdős and Rényi,, 1960) and Barabási–Albert graphs (Albert and Barabási,, 2002), as well as structured chordal graphs, in particular, rooted tree graphs and moralized Erdös-Rényi graphs (Shanmugam et al.,, 2015). The graph size $p$ in our simulations ranges from $10$ to $1000$ , while the number of perturbation targets ranges from $1$ to $\min\{p,100\}$ .

We compare our two subroutines for Algorithm 1, CliqueTree and Supermodular, against three carefully constructed baselines. The UpstreamRand baseline follows Algorithm 1 where line 8 is changed to selecting $I$ randomly from $\mathcal{C}$ without exceeding $S$ , i.e., when there is no identified source node it randomly samples from the chain component with no incoming edge. This strategy highlights how much benefit is obtained from CliqueTree and Supermodular on top of upstream search. The Coloring baseline is modified from the coloring-based policy for structure learning (Shanmugam et al.,, 2015), previously shown to perform competitively on large graphs (Squires et al., 2020a, ). It first performs structure learning with the coloring-based policy, and then uses upstream search with known DAG. We also include an Oracle baseline, which does upstream search with known DAG.

In Figure 5 we present a subset of our results on Barabási–Albert graphs with $100$ nodes; similar behaviors are observed in all other settings and shown in Appendix F. In Figure 5(a), we consider problem instances with varying size of $|I^{*}|$ . Each algorithm is run with sparsity constraint $S=1$ . We plot the number of extra interventions compared to Oracle, averaged across the $100$ problem instances. As expected, Coloring requires the largest number of extra interventions. This finding is consistent among different numbers of perturbation targets, since the same amount of interventions are used to learn the structure regardless of $I^{*}$ . As $|I^{*}|$ increases, CliqueTree and Supermodular outperform UpstreamRand. To further investigate this trend, we plot the rate of extra interventions⁴⁴4The rate is calculated by (#Strategy-#UpstreamRand)/#UpstreamRand where # denotes the number of extra interventions compared to Oracle and Strategy can be CliqueTree, Supermodular or UpstreamRand. used by CliqueTree and Supermodular relative to UpstreamRand in Figure 5(b). This figure shows that CliqueTree and Supermodular improve upon upstream search by up to $25\%$ as the number of perturbation targets increases. Finally, we consider the effect of the sparsity constraint $S$ in Figure 5(c) with $|I^{*}|=50$ . In line with the discussion in Section 4.1, as $S$ increases, the task becomes easier for plain upstream search. However, when the number of perturbation targets is restricted, CliqueTree and Supermodular are superior, with Supermodular performing best in most cases.

7 Discussion

In this work, we introduced the causal mean matching problem, which has important applications in medicine and engineering. We aimed to develop active learning approaches for identifying the matching intervention using shift interventions. Towards this end, we characterized the shift interventional Markov equivalence class and showed that it is in general more refined than previously defined equivalence classes. We proposed two strategies for learning the matching intervention based on this characterization, and showed that they are optimal up to a logarithmic factor. We reported experimental results on a range of settings to support these theoretical findings.

Limitations and Future Work. This work has various limitations that may be interesting to address in future work. First, we focus on the task of matching a desired mean, rather than an entire distribution. This is an inherent limitation of deterministic shift interventions: as noted by Hyttinen et al., (2012), in the linear Gaussian setting, these interventions can only modify the mean of the initial distribution. Thus, matching the entire distribution, or other relevant statistics, will require broader classes of interventions. Assumptions on the desired distribution are also required to rule out possibly non-realizable cases. Second, we have focused on causal DAG models, which assume acyclicity and the absence of latent confounders. In many realistic applications, this could be an overly optimistic assumption, requiring extensions of our results to the cyclic and/or causally insufficient setting. Finally, throughout the main text, we have focused on the noiseless setting; we briefly discuss the noisy setting in Appendix G, but there is much room for more extensive investigations.

Acknowledgements

C. Squires was partially supported by an NSF Graduate Fellowship. All authors were partially supported by NSF (DMS-1651995), ONR (N00014-17- 1-2147 and N00014-18-1-2765), the MIT-IBM Watson AI Lab, and a Simons Investigator Award to C. Uhler.

References

Agrawal et al., (2019) Agrawal, R., Squires, C., Yang, K., Shanmugam, K., and Uhler, C. (2019). Abcd-strategy: Budgeted experimental design for targeted causal structure discovery. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3400–3409. PMLR.
Albert and Barabási, (2002) Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47.
Andersson et al., (1997) Andersson, S. A., Madigan, D., Perlman, M. D., et al. (1997). A characterization of markov equivalence classes for acyclic digraphs. Annals of statistics, 25(2):505–541.
Badsha et al., (2019) Badsha, M., Fu, A. Q., et al. (2019). Learning causal biological networks with the principle of mendelian randomization. Frontiers in genetics, 10:460.
Blair and Peyton, (1993) Blair, J. R. and Peyton, B. (1993). An introduction to chordal graphs and clique trees. In Graph theory and sparse matrix computation, pages 1–29. Springer.
Bubeck and Cesa-Bianchi, (2012) Bubeck, S. and Cesa-Bianchi, N. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. In Foundations and Trends® in Machine Learning, pages 1–122.
Cahan et al., (2014) Cahan, P., Li, H., Morris, S. A., Da Rocha, E. L., Daley, G. Q., and Collins, J. J. (2014). Cellnet: network biology applied to stem cell engineering. Cell, 158(4):903–915.
de Kroon et al., (2020) de Kroon, A. A., Belgrave, D., and Mooij, J. M. (2020). Causal discovery for causal bandits utilizing separating sets. arXiv preprint arXiv:2009.07916.
Dominguez et al., (2016) Dominguez, A. A., Lim, W. A., and Qi, L. S. (2016). Beyond editing: repurposing crispr–cas9 for precision genome regulation and interrogation. Nature reviews Molecular cell biology, 17(1):5.
D’Alessio et al., (2015) D’Alessio, A. C., Fan, Z. P., Wert, K. J., Baranov, P., Cohen, M. A., Saini, J. S., Cohick, E., Charniga, C., Dadon, D., Hannett, N. M., et al. (2015). A systematic approach to identify candidate transcription factors that control cell identity. Stem cell reports, 5(5):763–775.
Erdős and Rényi, (1960) Erdős, P. and Rényi, A. (1960). On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1):17–60.
Friedman et al., (2000) Friedman, N., Linial, M., Nachman, I., and Pe’er, D. (2000). Using bayesian networks to analyze expression data. Journal of computational biology, 7(3-4):601–620.
Ghassami et al., (2018) Ghassami, A., Salehkaleybar, S., Kiyavash, N., and Bareinboim, E. (2018). Budgeted experiment design for causal structure learning. In International Conference on Machine Learning, pages 1724–1733. PMLR.
Greenewald et al., (2019) Greenewald, K., Katz, D., Shanmugam, K., Magliacane, S., Kocaoglu, M., Boix Adsera, E., and Bresler, G. (2019). Sample efficient active learning of causal trees. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Hauser and Bühlmann, (2012) Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464.
Hauser and Bühlmann, (2014) Hauser, A. and Bühlmann, P. (2014). Two optimal strategies for active learning of causal models from interventional data. International Journal of Approximate Reasoning, 55(4):926–939.
He and Geng, (2008) He, Y.-B. and Geng, Z. (2008). Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov):2523–2547.
Hyttinen et al., (2012) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2012). Learning linear cyclic causal models with latent variables. The Journal of Machine Learning Research, 13(1):3387–3439.
Hyttinen et al., (2013) Hyttinen, A., Eberhardt, F., and Hoyer, P. O. (2013). Experiment selection for causal discovery. Journal of Machine Learning Research, 14:3041–3071.
Jaber et al., (2020) Jaber, A., Kocaoglu, M., Shanmugam, K., and Bareinboim, E. (2020). Causal discovery from soft interventions with unknown targets: Characterization and learning. Advances in Neural Information Processing Systems, 33.
Kocaoglu et al., (2017) Kocaoglu, M., Dimakis, A., and Vishwanath, S. (2017). Cost-optimal learning of causal graphs. In International Conference on Machine Learning, pages 1875–1884. PMLR.
Koller and Friedman, (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
Krause et al., (2008) Krause, A., McMahan, H. B., Guestrin, C., and Gupta, A. (2008). Robust submodular observation selection. Journal of Machine Learning Research, 9(12).
Lalou et al., (2018) Lalou, M., Tahraoui, M. A., and Kheddouci, H. (2018). The critical node detection problem in networks: A survey. Computer Science Review, 28:92–117.
Lattimore et al., (2016) Lattimore, F., Lattimore, T., and Reid, M. D. (2016). Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
Lee and Bareinboim, (2018) Lee, S. and Bareinboim, E. (2018). Structural causal bandits: where to intervene? Advances in Neural Information Processing Systems 31, 31.
Meek, (1995) Meek, C. (1995). Causal inference and causal explanation with background knowledge. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, UAI’95, page 403–410, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Rackham et al., (2016) Rackham, O. J., Firas, J., Fang, H., Oates, M. E., Holmes, M. L., Knaupp, A. S., Suzuki, H., Nefzger, C. M., Daub, C. O., Shin, J. W., et al. (2016). A predictive computational framework for direct reprogramming between human cell types. Nature genetics, 48(3):331.
Rothenhäusler et al., (2015) Rothenhäusler, D., Heinze, C., Peters, J., and Meinshausen, N. (2015). Backshift: Learning causal cyclic graphs from unknown shift interventions. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Shanmugam et al., (2015) Shanmugam, K., Kocaoglu, M., Dimakis, A. G., and Vishwanath, S. (2015). Learning causal graphs with small interventions. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Spirtes et al., (2000) Spirtes, P., Glymour, C. N., Scheines, R., and Heckerman, D. (2000). Causation, prediction, and search. MIT press.
(32) Squires, C., Magliacane, S., Greenewald, K., Katz, D., Kocaoglu, M., and Shanmugam, K. (2020a). Active structure learning of causal dags via directed clique trees. In Advances in Neural Information Processing Systems, volume 33, pages 21500–21511. Curran Associates, Inc.
(33) Squires, C., Wang, Y., and Uhler, C. (2020b). Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pages 1039–1048. PMLR.
Touchette and Lloyd, (2004) Touchette, H. and Lloyd, S. (2004). Information-theoretic approach to the study of control systems. Physica A: Statistical Mechanics and its Applications, 331(1-2):140–172.
Triantafillou et al., (2017) Triantafillou, S., Lagani, V., Heinze-Deml, C., Schmidt, A., Tegner, J., and Tsamardinos, I. (2017). Predicting causal relationships from biological data: Applying automated causal discovery on mass cytometry data of human immune cells. Scientific reports, 7(1):1–11.
Verma and Pearl, (1991) Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal models. UCLA, Computer Science Department.
Vierbuchen et al., (2010) Vierbuchen, T., Ostermeier, A., Pang, Z. P., Kokubu, Y., Südhof, T. C., and Wernig, M. (2010). Direct conversion of fibroblasts to functional neurons by defined factors. Nature, 463(7284):1035–1041.
Wodo et al., (2015) Wodo, O., Zola, J., Pokuri, B. S. S., Du, P., and Ganapathysubramanian, B. (2015). Automated, high throughput exploration of process–structure–property relationships using the mapreduce paradigm. Materials discovery, 1:21–28.
Yabe et al., (2018) Yabe, A., Hatano, D., Sumita, H., Ito, S., Kakimura, N., Fukunaga, T., and Kawarabayashi, K.-i. (2018). Causal bandits with propagating inference. In International Conference on Machine Learning, pages 5512–5520. PMLR.
Yang et al., (2018) Yang, K., Katcoff, A., and Uhler, C. (2018). Characterizing and learning equivalence classes of causal dags under interventions. In International Conference on Machine Learning, pages 5541–5550. PMLR.

Appendix A Preliminaries

A.1 Meek Rules

Given any Markov equivalence class of DAGs with shared directed and undirected edges, the corresponding essential graph $\mathcal{E}$ can be obtained using a set of logical relations known as Meek rules \citepappendixmeek2013causal+. The Meek rules are stated in the following proposition.

Proposition 1 (Meek Rules \citepappendixmeek2013causal+).

We can infer all directed edges in $\mathcal{E}$ using the following four rules:

1.

If $i\rightarrow j-k$ and $i$ is not adjacent to $k$ , then $j\rightarrow k$ .
2.

If $i\rightarrow j\rightarrow k$ and $i-k$ , then $i\rightarrow k$ .
3.

If $i-j,i-k,i-l,j\rightarrow k,l\rightarrow k$ and $j$ is not adjacent to $l$ , then $i\rightarrow k$ .
4.

If $i-j,i-k,i-l,j\leftarrow k,l\rightarrow k$ and $j$ is not adjacent to $l$ , then $i\rightarrow j$ .

Figure 6 illustrates these four rules.

Appendix B Proof of Exact Matching

Proof of Lemma 1.

Without loss of generality, assume $1,2,...,p$ is the topological order of the underlying DAG $\mathcal{G}$ , i.e., $j\in\operatorname{pa}_{\mathcal{G}}(i)$ implies $j<i$ . We will first construct $I^{*}$ such that $\mathbb{E}_{{\mathrm{P}}^{I^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X)$ , and then show that $I^{*}$ is unique.

Existence: Denote $i_{1}$ as the smallest $i\in[p]$ such that $\mathbb{E}_{{\mathrm{P}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ . Witout loss of generality we assume that $i_{1}$ exists (if $i_{1}$ does not exists, then $I^{*}=\varnothing$ suffices since $\mathbb{E}_{{\mathrm{P}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X)$ ).

Let $I_{1}$ be the shift intervention with perturbation target $i_{1}$ and shift values $a_{i_{1}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{1}})-\mathbb{E}_{{\mathrm{P}}}(X_{i_{1}})$ . Since ${\mathrm{P}}^{I_{1}}(X_{i_{1}}=x+a_{i_{1}}|X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})={\mathrm{P}}(X_{i_{1}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})$ and ${\mathrm{P}}^{I_{1}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})={\mathrm{P}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{1})})$ by definition, we have

{\mathrm{P}}^{I_{1}}(X_{i_{1}}=x+a_{i_{1}})={\mathrm{P}}(X_{i_{1}}=x).

Thus $\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{1}})=\mathbb{E}_{{\mathrm{P}}}(X_{i_{1}})+a_{i_{1}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{1}})$ . Also $\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ for $i<i_{1}$ . Denote $i_{2}$ as the smallest $i\in[p]$ such that $\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ . If $i_{2}$ does not exists, then $I^{*}=I_{1}$ suffices. Otherwise $i_{2}>i_{1}$ .

Let $I_{2}$ be the shift intervention with perturbation target $i_{1},i_{2}$ and corresponding shift values $a_{i_{1}}$ and $a_{i_{2}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{2}})-\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{2}})$ . We have ${\mathrm{P}}^{I_{2}}(X_{i_{2}}=x+a_{i_{2}}|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}(X_{i_{2}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}^{I_{1}}(X_{i_{2}}=x|X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})$ and ${\mathrm{P}}^{I_{2}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})={\mathrm{P}}^{I_{1}}(X_{\operatorname{pa}_{\mathcal{G}}(i_{2})})$ by definition, the topological order, and $i_{2}>i_{1}$ . Then

{\mathrm{P}}^{I_{2}}(X_{i_{2}}=x+a_{i_{2}})={\mathrm{P}}^{I_{1}}(X_{i_{2}}=x).

Thus $\mathbb{E}_{{\mathrm{P}}^{I_{2}}}(X_{i_{2}})=\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i_{2}})+a_{i_{2}}=\mathbb{E}_{{\mathrm{Q}}}(X_{i_{2}})$ . Also $\mathbb{E}_{{\mathrm{P}}^{I_{2}}}(X_{i})=\mathbb{E}_{{\mathrm{P}}^{I_{1}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ for $i<i_{2}$ . By iterating this process, we will reach $I_{k}$ for some $k\leq p$ such that there is no $i$ with $\mathbb{E}_{{\mathrm{P}}^{I_{k}}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{k})$ . Taking $I^{*}=I_{k}$ suffices.

Uniqueness: If there exists $I_{1}^{*}\neq I_{2}^{*}$ such that $\mathbb{E}_{{\mathrm{P}}^{I_{1}^{*}}}(X)=\mathbb{E}_{{\mathrm{P}}^{I_{2}^{*}}}(X)=\mathbb{E}_{{\mathrm{Q}}}(X)$ , let $i\in[p]$ be the smallest index such that either $i$ has different shift values in $I_{1}^{*}$ and $I_{2}^{*}$ , or $i$ is only in one intervention’s perturbation targets. In either case, we have ${\mathrm{P}}^{I_{1}^{*}}(X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}^{I_{2}^{*}}(X_{\operatorname{pa}_{\mathcal{G}}(i)})$ by the topological order and ${\mathrm{P}}^{I_{1}^{*}}(X_{i}=x|X_{\operatorname{pa}_{\mathcal{G}}(i)})={\mathrm{P}}^{I_{2}^{*}}(X_{i}=x+a|X_{\operatorname{pa}_{\mathcal{G}}(i)})$ for some $a\neq 0$ . Thus ${\mathrm{P}}^{I_{1}^{*}}(X_{i}=x)={\mathrm{P}}^{I_{2}^{*}}(X_{i}=x+a)$ contradicting $\mathbb{E}_{{\mathrm{P}}^{I_{1}^{*}}}(X_{i})=\mathbb{E}_{{\mathrm{P}}^{I_{2}^{*}}}(X_{i})$ . ∎

Appendix C Proof of Identifiability

C.1 Shift Interventional MEC

Proof of Lemma 2.

For any distribution $f$ that factorizes according to $\mathcal{G}$ and shift intervention $I$ , let $i\in I$ be any source w.r.t. $I$ . By definition, $\operatorname{an}_{\mathcal{G}}(i)\cap I=\varnothing$ . Thus $\operatorname{pa}_{\mathcal{G}}(i)$ contains neither a member nor a descendant of $I$ , i.e., there does not exists $j\in\operatorname{pa}_{\mathcal{G}}(i)$ and $k\in I$ such that there is a direct path from $k$ to $j$ or $k=j$ . Hence we have $f^{I}(X_{\operatorname{pa}_{\mathcal{G}}(i)})=f(X_{\operatorname{pa}_{\mathcal{G}}(i)})$ , which gives

f^{I}(X_{i}=x+a_{i})=f(X_{i}=x).

Therefore $\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i}$ .

On the other hand, if $i\in I$ is not a source w.r.t. $I$ , consider the following linear Gaussian model,

X_{j}=\sum_{k\in\operatorname{pa}_{\mathcal{G}}(j)}\beta_{kj}X_{k}+\epsilon_{j},\quad\forall j\in[p],

where $\beta_{kj}$ are deterministic scalars and $\epsilon_{j}\sim\mathcal{N}(0,1)$ are i.i.d. random variables.

Since $i$ is not a source in $I$ , there exists a source $i^{\prime}$ in $I$ such that there is a directed path $i^{\prime}=i_{0}\to i_{1}\rightarrow\dots\rightarrow i_{\ell}$ . From above, $\mathbb{E}_{f^{I}}(X_{i^{\prime}})=\mathbb{E}_{f}(X_{i^{\prime}})+a_{i^{\prime}}$ for $a_{i^{\prime}}\neq 0$ . Consider setting $\beta_{i_{0},i_{1}}=2|a_{i}|/a_{i^{\prime}}$ , $\beta_{i_{k},i_{k+1}}=1$ for $k=1,\ldots,\ell-1$ , and the remaining edge weights to $\epsilon>0$ . For $\epsilon$ sufficiently small, we have that $\mathbb{E}_{f^{I}}(X_{i})\geq\mathbb{E}_{f}(X_{i})+1.5|a_{i}|$ , i.e., we cannot have that $\mathbb{E}_{f^{I}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i}$ . ∎

Proof of Theorem 1.

Denote $\mathcal{I}=\{I_{1},...,I_{m}\}$ . For $k\in[m]=\{1,...,m\}$ , let $\hat{I}_{k}$ and $\hat{I}_{k}^{\prime}$ be the collection of source nodes in $I_{k}$ in $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ , respectively. From Definition 1, we know that $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are in the same shift- $\mathcal{I}$ -MEC if and only if they are in the same $\mathcal{I}$ -MEC and, for any pair $(f,\{f^{I_{k}}\}_{k\in[m]})$ that is $\mathcal{I}$ -Markov w.r.t. both $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ , it satisfies

\mathbb{E}_{f^{I_{k}}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall i\in\hat{I}_{k},\forall k\in[m],

(3)

if and only if it also satisfies

\mathbb{E}_{f^{I_{k}}}(X_{i})=\mathbb{E}_{f}(X_{i})+a_{i},\quad\forall i\in\hat{I}_{k}^{\prime},\forall k\in[m].

(4)

By Lemma 2, we know that $\hat{I}_{k}^{\prime}\subset\hat{I}_{k}$ for all $k\in[m]$ . Otherwise we can find a pair $(f,\{f^{I_{k}}\}_{k\in[m]})$ that violates (4) for $i\in\hat{I}_{k}^{\prime}\setminus\hat{I}_{k}$ . Similarly, we have $\hat{I}_{k}\subset\hat{I}_{k}^{\prime}$ . Therefore $\hat{I}_{k}=\hat{I}_{k}^{\prime}$ . In this case, (3) is equivalent to (4).

Hence, $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are in the same shift- $\mathcal{I}$ -MEC if and only if they are in the same $\mathcal{I}$ -MEC and they have the same source nodes of $I$ for every $I\in\mathcal{I}$ . From Theorem 3.9 in \citetappendixyang2018characterizing+, we know that $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are in the same $\mathcal{I}$ -MEC if and only if they share the same skeleton, $v$ -structures and directed edges $\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}$ . Therefore, $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ are in the same shift- $\mathcal{I}$ -MEC if and only if they have the same skeleton, $v$ -structures, directed edges $\{i\rightarrow j|i\in I,j\notin I,I\in\mathcal{I},i-j\}$ , as well as source nodes of $I$ for every $I\in\mathcal{I}$ . ∎

Let $\mathcal{D}$ be any DAG, suppose that $\mathcal{I}=\{I_{1},...,I_{m}\}$ and $\hat{I}_{k}$ is the collection of source nodes in $I_{k}$ in $\mathcal{D}$ for $k\in[m]$ . Then as a direct corollary of Theorem 1, we can represent a shift interventional Markov equivalence class with a (general) interventional Markov equivalence class.

Corollary 1.

Let $\hat{\mathcal{I}}=\mathcal{I}\cup\{\hat{I}_{k}|k\in[m]\}$ ; a DAG $\mathcal{D}^{\prime}$ is shift- $\mathcal{I}$ -Markov equivalent to $\mathcal{D}$ if and only if $\mathcal{D}^{\prime}$ is $\hat{\mathcal{I}}$ -Markov equivalent to $\mathcal{D}$ .

Proof.

The proof follws as a direct application of Theorem 1, Theorem 3.9 in \citetappendixyang2018characterizing+, and the fact that there are no edges between nodes in $\hat{I}_{k}$ . ∎

C.2 Mean Interventional Faithfulness

Proof of Lemma 3.

If Assumption 1 holds, then for any $i\notin T$ , since $\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ , then $i\notin I^{*}$ and $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}=\varnothing$ . Let $j\in T$ such that there is an edge $i-j$ between $i$ and $j$ . Since $\mathbb{E}_{{\mathrm{P}}}(X_{j})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{j})$ , there is either $j\in I^{*}$ or $\operatorname{an}_{\mathcal{G}}(j)\cap I^{*}\neq\varnothing$ . Therefore if $j\rightarrow i$ , then $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing$ , a contradiction. Thus $j\leftarrow i$ .

Conversely, if Assumption 1 does not hold, then there exists $i\notin T$ (i.e., $\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ ) such that either $i\in I^{*}$ or $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing$ . If $i\in I^{*}$ , then since $\mathbb{E}_{{\mathrm{P}}}(X_{i})=\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ and Lemma 2, $i$ must not be a source in $I^{*}$ . Therefore we only need to discuss the case where $i\notin T$ and $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}\neq\varnothing$ .

Let $k$ be a source of $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}$ , then $k$ must also be a source of $I^{*}$ . Otherwise there is a directed path from $k^{\prime}$ to $k$ where $k^{\prime}\neq k$ and $k^{\prime}\in I^{*}$ . By definition of ancestors, we know from $k\in\operatorname{an}_{\mathcal{G}}(i)$ that there is also $k^{\prime}\in\operatorname{an}_{\mathcal{G}}(i)$ . Therefore $k^{\prime}\in\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}$ , which violates $k$ being a source of $\operatorname{an}_{\mathcal{G}}(i)\cap I^{*}$ .

Since $k$ is a source of $I^{*}$ , by Lemma 1 and 2, we know that $\mathbb{E}_{{\mathrm{P}}}(X_{k})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{k})$ , i.e., $k\in T$ . Notice that $k\in\operatorname{an}_{\mathcal{G}}(i)$ , and thus we must have a directed path from $k\in T$ to $i\notin T$ . Thus, there exists some $i-j,j\in T,i\notin T$ such that $j\rightarrow i$ . ∎

Using Lemma 3, we know that we can check the authenticity of Assumption 1 by looking at the orientation of edges between $T$ and $[p]\setminus T$ , which is achievable by any (general) intervention on $X_{T}$ (or $X_{[p]\setminus T}$ ).

Corollary 2.

Assumption 1 holds if and only if the $\{T\}$ -essential graph (or $\{[p]\setminus T\}$ -essential graph) of $\mathcal{G}$ has edges $j\leftarrow i$ for all $i-j,j\in T,i\notin T$ .

Proof.

The proof follows as a direct application of the graphical characterization of interventional equivalence class in Section 3.1 and the results in Lemma 3. ∎

Appendix D Details of Algorithms

D.1 Decomposition of Shift Interventional Essential Graphs

Chain Graph Decomposition: \citetappendixhauser2014two+ showed that every interventional essential graph is a chain graph with undirected connected chordal chain components, where the orientations in one component do not affect any other components. This decomposition also holds for shift interventional essential graphs, since every shift interventional essential graph is also an interventional essential graph (Corollary 1). Below, we show an example of this decomposition (Figure 7).

Proof of Lemma 4.

Suppose an undirected connected chain component $\mathcal{C}$ of the essential graph has two source nodes $i$ and $j$ w.r.t. $\mathcal{C}$ . Since $\mathcal{C}$ is connected, there is a path between $i$ and $j$ in $\mathcal{C}$ ; let $i-k_{1}-...-k_{r}-j$ be the shortest among all these paths. Because $i$ and $j$ are sources of $\mathcal{C}$ , there must be $i\rightarrow k_{1}$ and $k_{r}\leftarrow j$ . Therefore, $\exists l\in\{1,...,r\}$ such that $k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1}$ (let $k_{0}=i$ and $k_{r+1}=j$ ). By the shortest path definition, there is no edge between $k_{l-1}$ and $k_{l+1}$ . Therefore there is a v-structure in $\mathcal{C}$ induced by $k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1}$ . Since all DAGs in the same shift interventional equivalence class share the same v-structures, $k_{l-1}\rightarrow k_{l}\leftarrow k_{l+1}$ must be oriented in the essential graph. This violates $k_{l-1},k_{l},k_{l+1}$ belonging to the same undirected chain component $\mathcal{C}$ . Thus, combining this with the fact that $\mathcal{C}$ must have one source node, we obtain that $\mathcal{C}$ has exactly one source node w.r.t. $\mathcal{C}$ .

Next we show that the source node of a chain component is also the source of $\mathcal{G}$ if and only if there are no incoming edges to this component. Let $i$ be the source of the chain component $\mathcal{C}$ . On one hand, $i$ must be the source of $\mathcal{G}$ if there is no incoming edges to $\mathcal{C}$ . On the other hand, if there is an incoming edge $j\rightarrow k$ for some $j\notin\mathcal{C}$ and $k\in\mathcal{C}$ , then since the essential graph is closed under Meek R1 and R2 (Proposition 1), we know that there must be an edge $j\rightarrow l$ for all neighbors $l$ of $k$ . Following the same deduction and the fact that $\mathcal{C}$ is connected, we obtain that $j\rightarrow l$ for all $l\in\mathcal{C}$ (Figure 8). This means that $j\rightarrow i$ as well. Therefore $i$ cannot be a source of $\mathcal{G}$ .

∎

D.2 NP-completeness of MinMaxC

It was shown separately in \citetappendixshen2012exact and \citetappendixlalou2018critical+ that the MinMaxC problem is NP-complete for general graphs and split graphs. Split graphs are a subclass of chordal graphs, where the vertices can be separated into a clique and an independent set (isolated nodes after removing the clique). Thus, MinMaxC is also NP-complete for chordal graphs.

D.3 Clique Tree Strategy

The clique tree strategy takes inputs of an undirected connected chordal graph $\mathcal{C}$ and the sparsity constraint $S$ , and outputs a shift intervention with no more than $S$ perturbation targets. If $\mathcal{C}$ contains no more than $S$ nodes, then it returns any shift intervention with perturbation targets in $\mathcal{C}$ . If $\mathcal{C}$ contains more than $S$ nodes, it first constructs a clique tree $\mathcal{T}(\mathcal{C})$ of $\mathcal{C}$ by the maximum-weight spanning tree algorithm \citepappendixkoller2009probabilistic+. Then it iterates through the nodes in $\mathcal{T}(\mathcal{C})$ (which are maximal cliques in $\mathcal{C}$ ) to find a maximal clique $\mathcal{K}$ that breaks $\mathcal{T}(\mathcal{C})$ into subtrees with sizes no more than half of the size of $\mathcal{T}(\mathcal{C})$ . If $\mathcal{K}$ has no more than $S$ nodes, then it returns any shift intervention with perturbation targets in $\mathcal{K}$ . Otherwise, it samples $S$ nodes from $\mathcal{K}$ and returns any shift intervention with these $S$ nodes as perturbation targets. The following subroutine summarizes this procedure.

Input: Chordal chain component

\mathcal{C}

, sparsity constraint

S

1 if $\mathcal{C}$ has no more than $S$ nodes then

2 set

I

as any shift intervention on

\mathcal{C}

with non-zero shift values;

4else

5 let

C(\mathcal{C})

be the maximal cliques of the chordal graph

\mathcal{C}

;

6 let

\mathcal{T}(\mathcal{C})

be a maximum-weight spanning tree of

\mathcal{C}

with

C(\mathcal{C})

as nodes;

7 set

\mathcal{K}=\varnothing

;

8 for $K$ in $C(\mathcal{C})$ do

9 get the subtrees of

\mathcal{T}(\mathcal{C})

after deleting node

C

;

10 if all subtrees has size $\leq\lceil(|C(\mathcal{C})|-1)/2\rceil$ then

11 set

\mathcal{K}=K

;

12 break;

14 end if

16 end for

17 if $|\mathcal{K}|>S$ then

18 set

\mathcal{K}

as a random

S

-subset of

\mathcal{K}

;

19 end if

20 set

I

as any shift intervention on

\mathcal{K}

with non-zero shift values;

22 end if

Output: Shift Intervention

I

Algorithm 2

\texttt{CliqueTree}(\mathcal{C},S)

Complexity: Let $N$ represent the number of nodes in $\mathcal{C}$ , i.e., $N=|\mathcal{C}|$ . All the maximal cliques of the chordal graph $\mathcal{C}$ can be found in $O(N^{2})$ time \citepappendixgalinier1995chordal. We use Kruskal’s algorithm for computing the maximum-weight spanning tree, which can be done in $O(N^{2}\log(N))$ \citepappendixkruskal1956shortest. The remaining procedure of iterating through $C(\mathcal{C})$ takes no more than $O(N^{2})$ since chordal graphs with $N$ nodes have no more than $N$ maximal cliques \citepappendixgalinier1995chordal and all subtree sizes can be obtained in $O(N)$ . Therefore this subroutine can be computed in $O(N^{2}\log(N))$ time.

D.4 Supermodular Strategy

The supermodular procedure takes as input an undirected connected chordal graph $\mathcal{C}$ as well as the sparsity constraint $S$ , and outputs a shift intervention with perturbation targets by solving

\min_{A\subset V_{\mathcal{C}}}\max_{i\in V_{\mathcal{C}}}\hat{f}_{i}(A),\quad|A|\leq S,

(5)

with the SATURATE algorithm \citepappendixkrause2008robust+. Here $V_{\mathcal{C}}$ represents nodes of $\mathcal{C}$ and $\hat{f}_{i}(A)=\sum_{j\in V_{\mathcal{C}}}\hat{g}_{i,j}(A)$ with $\hat{g}_{i,j}$ defined in (2). Algorithm 3 summarizes this subroutine.

Input: Chordal chain component

\mathcal{C}

, sparsity constraint

S

1 if $\mathcal{C}$ has no more than $S$ nodes then

2 set

I

as any shift intervention on

\mathcal{C}

with non-zero shift values;

4else

5 let

A

be the solution of (5) returned by SATURATE \citepappendixkrause2008robust+;

6 set

I

as any shift intervention on

A

with non-zero shift values;

8 end if

Output: Shift Intervention

I

Algorithm 3

\texttt{Supermodular}(\mathcal{C},S)

Supermodularity: First we give an example showing that $f_{i}$ defined in (1) is not supermodular for chordal graphs, although it is clearly monotonic decreasing.

Example 2.

Consider the chordal graph in Figure 9; we have $f_{1}(\{2\})-f_{1}(\varnothing)=3-4=-1>-2=1-3=f_{1}(\{2,3\})-f_{1}(\{3\})$ . Therefore $f_{1}$ is not supermodular for this graph.

Figure 9:

f_{i}

is not supermodular.

Next we prove that $\hat{f}_{i}$ is supermodular and monotonic decreasing.

Proof.

Since $\hat{f}_{i}(A)=\sum_{j\in V_{\mathcal{C}}}\hat{g}_{i,j}(A)$ , we only need to show that every $\hat{g}_{i,j}$ is supermodular and monotonic decreasing. In the following, we refer to a path without cycles as a simple path.

For any $A\subset B\subset V_{\mathcal{C}}$ , since $V_{\mathcal{C}}-B$ is a subgraph of $V_{\mathcal{C}}-A$ , then any simple path between $i$ and $j$ in $V_{\mathcal{C}}-B$ must also be in $V_{\mathcal{C}}-A$ . Hence $m_{i,j}(V_{\mathcal{C}}-B)\leq m_{i,j}(V_{\mathcal{C}}-A)$ , which means that

\hat{g}_{i,j}(A)\geq\hat{g}_{i,j}(B),

i.e., $\hat{g}_{i,j}$ is monotonic decreasing.

For any $x\in V_{\mathcal{C}}\setminus B$ , the difference $m_{i,j}(V_{\mathcal{C}}-B)-m_{i,j}(V_{\mathcal{C}}-B\cup\{x\})$ is the number of simple paths in $V_{\mathcal{C}}-B$ between $i$ and $j$ that pass through $x$ . Each of these paths must also be in $V_{\mathcal{C}}-A$ , since $V_{\mathcal{C}}-B$ is a subgraph of $V_{\mathcal{C}}-A$ . Therefore,

m_{i,j}(V_{\mathcal{C}}-B)-m_{i,j}(V_{\mathcal{C}}-B\cup\{x\})\leq m_{i,j}(V_{\mathcal{C}}-A)-m_{i,j}(V_{\mathcal{C}}-A\cup\{x\}),

which means that

\hat{g}_{i,j}(A\cup\{x\})-\hat{g}_{i,j}(A)\leq\hat{g}_{i,j}(B\cup\{x\})-\hat{g}_{i,j}(B),

i.e., $\hat{g}_{i,j}$ is supermodular. ∎

SATURATE algorithm \citepappendixkrause2008robust+: Having shown that $\hat{f}_{i}$ is monotonic supermodular, we solve the robust supermodular optimization problem in (5) with the SATURATE algorithm in \citepappendixkrause2008robust+. SATURATE performs a binary search for potential objective values and uses a greedy partial cover algorithm to check the feasibility of these objective values; for a detailed description of the algorithm, see \citetappendixkrause2008robust+.

Complexity: Let $N$ represent the number of nodes in $\mathcal{C}$ , i.e., $N=|\mathcal{C}|$ . SATURATE uses at most $O(N^{2}S\log(N))$ evaluations of supermodular functions $\hat{f}_{i}$ \citepappendixkrause2008robust+. Each $\hat{f}_{i}$ computes all the simple paths between $i$ and all other $j$ in $\mathcal{C}$ . A modified depth-first search is used to calculated these paths \citepappendixsedgewick2001algorithms, which results in $\mathcal{F}(N)$ complexity. For general graphs, this problem is #P-complete \citepappendixvaliant1979complexity. However, this might be significantly reduced for chordal graphs. We are unaware of particular complexity results for chordal graphs, which would be an interesting direction for future work. The total runtime of this subroutine is thus bounded by $O(N^{2}\mathcal{F}(N)S\log(N))$ .⁵⁵5For a more efficient implementation, one could replace the undirected graph with a DAG in its MEC (which can be found in linear time using L-BFS). All statements hold except that $\hat{f}_{i}$ is no longer necessarily tight for tree graphs. This replacement results in a total complexity of $O(N^{4}S\log(N))$ for the subroutine, since directed simple paths can be counted in $O(N^{2})$ .

D.5 Violation of Faithfulness

From Corollary 2, we know that we can check whether Assumption 1 holds or not by any intervention on $X_{T}$ (or $X_{[p]\setminus T}$ ). However, we can run Algorithm 1 to obtain $I^{*}$ without Assumption 1 because lines 2-14 in Algorithm 1 always return the correct $I^{*}$ .

Let $I\subset I^{*}$ be the resolved part of $I^{*}$ in line 2, i.e., it is a shift intervention constructed by taking a subset of perturbation targets of $I^{*}$ and their corresponding shift values. Let $I^{*}-I$ be the remaining shift intervention constructed by deleting $I$ in $I^{*}$ . Denote $T_{I}=\{i|i\in[p],\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\}$ , which is returned by line 3. If $T_{I}\neq\varnothing$ , then we have solved $I^{*}$ . Otherwise we have:

Lemma 7.

The source nodes w.r.t. $T_{I}$ must be perturbation targets of $I^{*}-I$ and their corresponding shift values are $\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})$ (for source node $i$ ).

Proof.

Let $i$ be a source node w.r.t. $I^{*}-I$ and $a_{i}$ be its corresponding shift value. Since intervening on other nodes in $I^{*}-I$ does not change the marginal distribution of $i$ , we must have that $a_{i}=\mathbb{E}_{({\mathrm{P}}^{I})^{I^{*}-I}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})$ . And because $({\mathrm{P}}^{I})^{I^{*}-I}={\mathrm{P}}^{I^{*}}={\mathrm{Q}}$ , we know that

a_{i}=\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i}).

From this, we also have that $\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ since $a_{i}\neq 0$ . Therefore, all source nodes $i$ w.r.t. $I^{*}-I$ are in $T_{I}$ and their corresponding shift values are $\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})$ .

Let $i$ be a source w.r.t. $T_{I}$ , then $i$ must also be a source node w.r.t. $I^{*}-I$ . Since $\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})$ , $i$ must be a source node in $I^{*}-I$ or has a source node in $I^{*}-I$ as its ancestor. If it is the latter case, then since all source nodes in $I^{*}-I$ must be in $T_{I}$ , $i$ cannot be a source node w.r.t. $T_{I}$ , a contradiction. Therefore the source w.r.t. $T_{I}$ must also be the source w.r.t. $I^{*}-I$ . Combined with the result in the previous paragraph, we have that all source nodes $i$ w.r.t. $T_{I}$ are perturbation targets of $I^{*}-I$ and their corresponding shift values are $\mathbb{E}_{{\mathrm{Q}}}(X_{i})-\mathbb{E}_{{\mathrm{P}}^{I}}(X_{i})$ . ∎

This lemma shows that $U_{T}$ obtained in lines 5-11 of Algorithm 1 must be the perturbation targets of $I^{*}-I$ and line 12 gives the correct shift values. Therefore Algorithm 1 must return the correct $I^{*}$ .

However, to be able to obtain the shift- $\mathcal{I}$ -EG of $\mathcal{G}$ , we need mean interventional faithfulness to be satisfied by $I\in\mathcal{I}$ (replacing $I^{*}$ with $I$ and ${\mathrm{Q}}$ with ${\mathrm{P}}^{I}$ in Assumption 1) as well as $\mathcal{I}$ -faithfulness \citepappendixsquires2020permutation+ to be satisfied by $({\mathrm{P}},\{{\mathrm{P}}^{I}\}_{I\in\mathcal{I}})$ with respect to $\mathcal{G}$ .

Appendix E Proof of Worst-case Bounds

E.1 Proof of Lemma 5

To show Lemma 5, we need the following proposition, which states that we can orient any maximal clique of a chordal graph to be most-upstream without creating cycles and v-structures, and the orientation in this clique can be made arbitrary. It was pointed out in \citepappendixvandenberghe2015chordal using similar arguments that any clique of a chordal graph can be most-upstream. Here, we provide the complete proof.

Proposition 2.

Let $\mathcal{D}$ be any undirected chordal graph and $K$ be any maximal clique of $\mathcal{D}$ , for any permutation $\pi_{K}$ of the nodes in $K$ , there exists a topological order $\pi$ of the nodes in $\mathcal{D}$ such that $\pi$ starts with $\pi_{K}$ and orienting $\mathcal{D}$ according to $\pi$ does not create any v-structures.

Proof.

A topological order $\pi$ of a chordal graph $\mathcal{D}$ , orienting according to which does not create v-structures, corresponds to the reverse of a perfect elimination order \citepappendixhauser2014two+. A perfect elimination order is an order of nodes in $\mathcal{D}$ , such that all neighbors of $i$ in $\mathcal{D}$ that appear after $i$ in this order must constitute a clique in $\mathcal{D}$ . Any chordal graph has at least one perfect elimination order \citepappendixandersson1997characterization+. In the following, we will use the reverse of a perfect elimination order to refer to a topological order that does not create v-structures.

To prove Proposition 2, we first prove the following statement: if $K\neq\mathcal{D}$ , then there exists a perfect elimination order of nodes in $\mathcal{D}$ that starts with a node not in $K$ . To show this, by Proposition 6 in \citetappendixhauser2014two+, we only need to prove that if $K\neq\mathcal{D}$ , then there is a node not in $K$ , whose neighbors in $\mathcal{D}$ constitute a clique.

We use induction on the number of nodes in $\mathcal{D}$ : Consider $|\mathcal{D}|=1$ . Since $K$ is a maximal clique, $K=\mathcal{D}$ . This statement holds trivially. Suppose the statement is true for chordal graphs with size $n-1$ . Consider $|\mathcal{D}|=n$ . Since $\mathcal{D}$ is a chordal graph, it must have a perfect elimination order. If this perfect elimination order starts with $i\in K$ , then there is no edge between $i$ and any node $j\notin K$ . Otherwise, since it is a perfect elimination order starting with $i$ and $K\ni i$ is a clique, there must be edges $j-k$ for all $k\in K$ . This is a contradiction to $K$ being a maximal clique.

Consider the chordal graph $\mathcal{D}^{\prime}$ by deleting $i$ from $\mathcal{D}$ , $|\mathcal{D}^{\prime}|=n-1$ . Let $K^{\prime}$ be the maximal clique in $\mathcal{D}^{\prime}$ containing $K\setminus\{i\}$ . If $K^{\prime}=\mathcal{D}^{\prime}$ , let $j$ be any node in $\mathcal{D}\setminus K$ . Since there is no edge $j-i$ , and $\mathcal{D}^{\prime}\ni j$ is a clique. $j$ ’s neighbors in $\mathcal{D}$ must also constitute a clique. If $K^{\prime}\neq\mathcal{D}^{\prime}$ , then by induction, we know that there exists $j\in\mathcal{D}^{\prime}\setminus K^{\prime}$ such that $j$ ’s neighbors in $\mathcal{D}^{\prime}$ constitute a clique. Since there is no edge $j-i$ , $j$ ’s neighbors in $\mathcal{D}$ must also constitute a clique. Thus the statement holds for chordal graphs of size $n$ . Therefore the statement holds.

Now, we prove Proposition 2 by induction on the number of nodes in $\mathcal{D}$ : Consider $|\mathcal{D}|=1$ . Since $K$ is a maximal clique, $K=\mathcal{D}$ . Thus Proposition 2 holds trivially.

Suppose Proposition 2 holds for chordal graphs of size $n-1$ . Consider $|\mathcal{D}|=n$ . If $K=\mathcal{D}$ , then Proposition 2 holds. If $K\neq\mathcal{D}$ , then by the above statement, there exists $j\in\mathcal{D}\setminus K$ , such that there exists a perfect elimination order of $\mathcal{D}$ starting with $j$ . Let $\mathcal{D}^{\prime}$ be the chordal graph obtained by deleting $j$ from $\mathcal{D}$ . By induction, there exists $\pi^{\prime}$ , a reverse of perfect elimination order, that starts with $\pi_{K}$ . Let $\pi=(\pi^{\prime},j)$ ; we must have that the reverse of $\pi$ is a perfect elimination order, since all neighbors of $j$ in $\mathcal{D}$ constitute a clique. Therefore $\pi$ gives the wanted topological order and Proposition 2 holds for chordal graphs of size $n$ . This completes the proof of Proposition 2. ∎

Proof of Lemma 5.

Given any algorithm $\mathcal{A}$ , let $S_{1},...,S_{k}$ be the first $k$ shift interventions given by $\mathcal{A}$ . By Proposition 2, we know that there exists a feasible orientation of $\mathcal{C}$ such that the largest maximal clique $K$ of $\mathcal{C}$ is most-upstream and that, for $k^{\prime}=1,...,k$ , $S_{k^{\prime}}\cap K$ is most-downstream of $K-\cup_{l<k^{\prime}}S_{l}$ . For example, in the figure below, suppose algorithm $\mathcal{A}$ chooses $S_{1}=\{3\}$ based on (a) and $S_{2}=\{2\}$ based on (b). There is a feasible orientation in (d) such that the largest clique $K=\{1,2,3\}$ is most-upstream and $S_{k^{\prime}}\cap K$ is most-downstream of $K$ , for $k^{\prime}=1,2$ .

Since $|S_{k^{\prime}}|\leq S$ and $|K|=m_{\mathcal{C}}$ , in this worst case, it needs at least $\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ interventions to identify the source of $K$ , i.e., the source of $\mathcal{C}$ (minus $1$ because in this case, if there is only one node left, then it must be the source). ∎

E.2 Proof of Lemma 6

Let $K$ be the clique obtained by lines 7-13 in Algorithm 2; when $\mathcal{C}$ has more than $S$ nodes, we refer to $K$ as the central clique. To prove Lemma 6, we need the following proposition. This proposition shows that by looking at the undirected graph $\mathcal{C}$ , we can find a node in the central clique $K$ satisfying certain properties, which will become useful in the proof of Lemma 6.

Proposition 3.

Let $\{\mathcal{T}_{a}\}_{a\in A}$ be the connected subtrees of $\mathcal{T}(\mathcal{C})$ after removing $K$ . For a node $k\in K$ , let $A_{k}\subset A$ be the set of indices $a\in A$ such that the tree $\mathcal{T}_{a}$ is connected to $K$ only through the node $k$ . Let $\mathcal{T}_{A_{k}}=\{\mathcal{T}_{a}\}_{a\in A_{k}}$ be the collection of all such subtrees. If there exists $a\in A\setminus A_{k}$ such that there is an edge between $\mathcal{T}_{a}$ and $k$ , let $\mathcal{T}^{*}_{k}$ be the one with the largest number of maximal cliques; otherwise let $\mathcal{T}^{*}_{k}=\varnothing$ . Then there exists a node $k$ such that the number of maximal cliques in the subgraph induced by the subtrees $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\}$ and $k$ itself does not exceed $\lceil\frac{r-1}{2}\rceil$ .

Example 3.

As an example, the following figure shows the subtrees that are connected to $K$ only through node $1$ , indexed by $A_{1}$ (blue). The largest subtree in $A\setminus A_{1}$ that is adjacent to node $1$ is denoted by $\mathcal{T}_{1}^{*}$ (undimmed in green).

Proof of Proposition 3.

Notice the following facts.

Fact 1: Let $\mathcal{T}$ be any subtree in $\{\mathcal{T}_{a}\}_{a\in A}$ ; then there must exist a node $i\in K$ such that there is no edge between $i$ and $\mathcal{T}$ .

Proof of Fact 1: For any two nodes $i,i^{\prime}\in K$ , because $\mathcal{C}$ is chordal and $\mathcal{T}$ is connected, either the neighbors of $i$ in $\mathcal{T}$ subset that of $i^{\prime}$ , or the the neighbors of $i^{\prime}$ in $\mathcal{T}$ subset that of $i$ . Therefore we can order all nodes $K$ , where all neighbors of $i$ in $\mathcal{T}$ subset that of $i^{\prime}$ that appear after $i$ . Then if the first node in this order has some neighbor $t\in\mathcal{T}$ , all nodes in $K$ have $t$ as neighbor, contradicting $K$ being a maximal clique.

Fact 2: Let $\bar{\mathcal{T}}$ be the collection of the subtrees where all edges connecting to $K$ are through a single node $k\in K$ . We have that $\bar{\mathcal{T}}$ is the union of disjoint sets $\mathcal{T}_{A_{k}},k\in K$ .

Proof of Fact 2: This follows directly from the definition of $A_{k}$ .

Fact 3: Let $\mathcal{T}^{*}$ be the collection of non-empty $\mathcal{T}_{k}^{*},k\in K$ . Then $\mathcal{T}^{*}\cap\bar{\mathcal{T}}=\varnothing$ . Furthermore, for any subtree in $\mathcal{T}^{*}$ , there is a node $i\in K$ such that there is no edge between $i$ and this subtree.

Proof of Fact 3: This follows directly from the definition of $\mathcal{T}_{k}^{*}$ and Fact 1.

Now we prove Proposition 3. If $\mathcal{T}^{*}=\varnothing$ , then since $K$ contains at least two nodes (otherwise $A=\varnothing$ and the proposition holds trivially) and the number of maximal cliques in $\bar{\mathcal{T}}$ does not exceed $r-1$ , using Fact 2, we have at least one $k\in K$ such that the number of maximal cliques in the subgraph induced by $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\}=\mathcal{T}_{A_{k}}$ and $k$ itself does not exceed $\lceil\frac{r-1}{2}\rceil$ .

If $\mathcal{T}^{*}\neq\varnothing$ . Let $\mathcal{T}^{*}_{k^{\prime}}$ be the subtree with the largest number of maximal cliques in $\mathcal{T}^{*}$ . Let $k\in K$ be the node such that there is no edge between $k$ and the subtree $\mathcal{T}^{*}_{k^{\prime}}$ ( $k$ exists because of Fact 3). Now suppose that the proposition does not hold. Then the number of maximal cliques in the subgraph induced by $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\}$ and $k$ itself must exceed $\lceil\frac{r-1}{2}\rceil$ . Also, the number of maximal cliques in the subgraph induced by $\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}$ and $k^{\prime}$ itself exceeds $\lceil\frac{r-1}{2}\rceil$ . Notice that $(\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}^{*}\})\cap(\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}_{k^{\prime}}^{*}\})=\varnothing$ . Therefore $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}_{k}\}$ is connected to $\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}$ only through $K$ . Hence the sum of numbers of maximal cliques in $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\}\cup\{\{k\}\}$ and $\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}\cup\{\{k^{\prime}\}\}$ does not exceed $r$ . We cannot have both $\mathcal{T}_{A_{k}}\cup\{\mathcal{T}^{*}_{k}\}\cup\{\{k\}\}$ and $\mathcal{T}_{A_{k^{\prime}}}\cup\{\mathcal{T}^{*}_{k^{\prime}}\}\cup\{\{k^{\prime}\}\}$ having more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques. Therefore the proposition must hold. ∎

Proof of Lemma 6.

For CliqueTree, we prove this lemma for a “less-adaptive” version for the sake of clearer discussions. In this “less-adaptive” version, instead of output $1$ intervention with $S$ perturbation targets sampled from the central clique $K$ (when it has more than $S$ nodes) in Algorithm 2, we directly output $\lceil\frac{|K|-1}{S}\rceil$ interventions with non-overlapping perturbation targets in $K$ . Each of these interventions has no more than $S$ perturbation targets and they contain at least $|K|-1$ nodes in $K$ altogether. Furthermore, we pick these interventions such that if they contain exactly $|K|-1$ nodes, then the remaining node satisfies Proposition 3.

After these $\lceil\frac{|K|-1}{S}\rceil$ interventions, we obtain a partially directed $\mathcal{C}$ , which is a chain graph, with one of its chain components without incoming edges as input to CliqueTree in the next iteration of the inner-loop in Algorithm 1. Denote this chain component as $\mathcal{C}^{\prime}$ . We show that $\mathcal{C}^{\prime}$ has no more than $\left\lceil\frac{r-1}{2}\right\rceil$ maximal cliques each with no more than $m_{\mathcal{C}}$ nodes. If $\lceil\frac{r-1}{2}\rceil=0$ , then $r=1$ and this trivially holds since the source of $\mathcal{C}$ must be identified. In the following, we assume $\lceil\frac{r-1}{2}\rceil>0$ .

Size of maximal cliques: The maximal clique in $\mathcal{C}^{\prime}$ must belong to a maximal clique in $\mathcal{C}$ , and thus has no more than $m_{\mathcal{C}}$ nodes.

Number of maximal cliques: If the source node is identified, then $\mathcal{C}^{\prime}$ only has one node. This trivially holds. Now consider when the source node is not identified. We proceed in two cases.

Case I: if these $\lceil\frac{|K|-1}{S}\rceil$ interventions contain all nodes in $K$ , then they break the clique tree $\mathcal{T}(\mathcal{C})$ into subtrees each with no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques. $\mathcal{C}^{\prime}$ must belong to one of these subtrees. Therefore it must have no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques.

Case II: if these $\lceil\frac{|K|-1}{S}\rceil$ interventions do not contain all nodes in $K$ , then there is exactly one node left in $K$ that is not a perturbation target, which satisfies Proposition 3. Denote this node as $k$ and the source node w.r.t. the intervened $|K|-1$ nodes as $i$ . From Theorem 1, we have that $i$ is identified and $\forall j\in K,j\neq k$ , the orientation of edge $k-j$ is identified.

If $i\rightarrow k$ , then $i$ is the source w.r.t. $K$ : if $i$ is the source w.r.t. $\mathcal{C}$ , then $\mathcal{C}^{\prime}=\{i\}$ has no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques; otherwise, there is a unique subtree of $\mathcal{T}(\mathcal{C})$ after removing $K$ that has an edge pointing to $i$ in $\mathcal{C}$ (it exists because $i$ is the source of $K$ but not the source of $\mathcal{C}$ ; it is unique because there is no edge between subtrees and there is no v-structure at $i$ ), and therefore $\mathcal{C}^{\prime}$ must belong to this subtree which has no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques.

If $i\leftarrow k$ , then $k$ is the source w.r.t. $K$ : consider all the subtrees of $\mathcal{T}(\mathcal{C})$ after removing $K$ . We have the following two facts:

Fact 1: Let $\mathcal{T}^{\prime}$ be a subtree such that there is an edge between $\mathcal{T}^{\prime}$ and $K-\{k\}$ and all these edges are pointing towards $\mathcal{T}^{\prime}$ . Then all edges between $k$ and $t\in\mathcal{T}^{\prime}$ must be oriented as $k\rightarrow t$ . Thus $\mathcal{C}^{\prime}\cap\mathcal{T}^{\prime}=\varnothing$ .

Proof of Fact 1: Otherwise, suppose $t\in\mathcal{T}^{\prime}$ and $t\rightarrow k$ . Let $j\in K-\{k\}$ such that there is an edge between $j$ and $\mathcal{T}^{\prime}$ . Since $\mathcal{T}^{\prime}$ is connected, there must be a path from $j$ to $t$ in $\mathcal{T}^{\prime}$ . Let $j=t_{0}-t_{1}-...-t_{l}-t_{l+1}=t$ be the shortest of these path. Since $t_{0}-t_{1}-...-t_{l}-t_{l+1}$ is shortest, there cannot be an edge between $t_{l^{\prime}}$ and $t_{l^{\prime\prime}}$ with $l^{\prime\prime}-l^{\prime}>1$ . And since all edges between $\mathcal{T}^{\prime}$ and $K-\{k\}$ are pointing towards $\mathcal{T}^{\prime}$ , there is an edge $j=t_{0}\rightarrow t_{1}$ . Therefore to avoid v-structures, it must be $j=t_{0}\rightarrow t_{1}\rightarrow...\rightarrow t_{l}\rightarrow t_{l+1}=t$ . This creates a directed cycle $k\rightarrow j\rightarrow...\rightarrow t\rightarrow k$ , a contradiction.

Fact 2: There can be at most one subtree $\mathcal{T}^{\prime}$ such that there is an edge pointing from $\mathcal{T}^{\prime}$ to $K-\{k\}$ and also some $t\in\mathcal{T}^{\prime}$ such that $t\rightarrow k$ or $t-k$ is unidentified. Therefore at most one subtree $\mathcal{T}^{\prime}$ of this type can have $\mathcal{C}^{\prime}\cap\mathcal{T}^{\prime}\neq\varnothing$ .

Proof of Fact 2: Otherwise suppose there are two different subtrees $\mathcal{T}^{\prime}_{1},\mathcal{T}^{\prime}_{2}$ such that $K-\{k\}\ni j_{1}\leftarrow t_{1}\in\mathcal{T}^{\prime}_{1},K-\{k\}\ni j_{2}\leftarrow t_{2}\in\mathcal{T}^{\prime}_{2}$ . Since there is no edge $t_{1}-t_{2}$ , we have $j_{1}\neq j_{2}$ . Without loss of generality, suppose $j_{1}\rightarrow j_{2}$ . Let $t$ be any node in $\mathcal{T}_{2}^{\prime}$ with an edge $t-k$ , since $\mathcal{T}_{2}^{\prime}$ is connected, let $t=t_{0}^{\prime}-t_{1}^{\prime}-...-t_{l}^{\prime}-t_{l+1}^{\prime}=t_{2}$ be the shortest path between $t$ and $t_{2}$ in $\mathcal{T}_{2}^{\prime}$ . Let $l^{\prime}$ be the maximum in $0,1,...,l$ such that $t^{\prime}_{l^{\prime}}\leftarrow t^{\prime}_{l^{\prime}+1}$ . If such $l^{\prime}$ does not exist, then $t=t_{0}^{\prime}\rightarrow t_{1}^{\prime}\rightarrow...\rightarrow t_{l+1}^{\prime}=t_{2}$ . Since $j_{1}\rightarrow j_{2}$ and there is no v-structure at $j_{2}$ , there must be an identified edge $j_{1}-t_{l+1}^{\prime}=t_{2}$ . Notice that there is no edge between $t_{2}$ and $t_{1}$ and $t_{1}\rightarrow j_{1}$ , to avoid v-structure, it must be $j_{1}\rightarrow t_{2}$ . The same deduction leads to identified edges $j_{1}\rightarrow t_{0}^{\prime}=t$ . Since $k\rightarrow j_{1}$ and there are no cycles, the edge $k\rightarrow t$ must be identified. If $l^{\prime}$ exists, since $t=t_{0}^{\prime}-t_{1}^{\prime}-...-t_{l}^{\prime}-t_{l+1}^{\prime}=t_{2}$ is the shortest path and there is no v-structure, we must have $t=t_{0}^{\prime}\leftarrow...\leftarrow t_{l^{\prime}+1}^{\prime}$ . Furthermore, since $l^{\prime}$ is the largest, $t_{l^{\prime}+1}^{\prime}\rightarrow...\rightarrow t_{l+1}^{\prime}=t_{2}$ . By a similar deduction as in the case where $l^{\prime}$ does not exist, we must have an identified edge $j_{1}\rightarrow t_{l^{\prime}+1}^{\prime}$ . Therefore $k\rightarrow j_{1}\rightarrow t_{l^{\prime}+1}^{\prime}\rightarrow...t_{0}^{\prime}=t$ . To avoid directed cycles, $k\rightarrow t$ must be identified. Therefore all edges between $k$ and $\mathcal{T}_{2}^{\prime}$ are identified as pointing to $\mathcal{T}_{2}^{\prime}$ .

Using the above two facts, let $\mathcal{T}^{\prime}$ be the unique subtree in Fact 2 (if it exists); if there is no edge between $\mathcal{T}^{\prime}$ and $k$ , then $\mathcal{C}^{\prime}$ must be in the subgraph induced by $k$ itself and $\mathcal{T}_{A_{k}}$ in Proposition 3, which has no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques. If there is an edge between $\mathcal{T}^{\prime}$ and $k$ , we know that $\mathcal{C}^{\prime}$ must be in the joint set of $k$ , $\mathcal{T}^{\prime}$ and $\mathcal{T}_{A_{k}}$ . Since the number of maximal cliques in $\mathcal{T}^{\prime}$ must be no more than that of $\mathcal{T}_{k}^{*}$ in Proposition 3, we know that $\mathcal{C}^{\prime}$ has no more than $\lceil\frac{r-1}{2}\rceil$ maximal cliques.

Therefore, after $\lceil\frac{|K|-1}{S}\rceil\leq\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ interventions, we reduce the number of maximal cliques to at most $\lceil\frac{r-1}{2}\rceil$ while maintaining the size of the largest maximal clique $\leq m_{\mathcal{C}}$ . Using this iteratively, we obtain that CliqueTree identifies the source node with at most $\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ interventions.

For Supermodular, we do not discuss the gap between $\hat{g}_{i,j}$ and $g_{i,j}$ and how well SATURATE solves (5). In this case, it is always no worse than the CliqueTree in the worst case over the feasible orientations of $\mathcal{C}$ , since it solves MinMaxC optimally without constraining to maximal cliques. Therefore, it also takes no more than $\lceil\log_{2}(r_{\mathcal{C}}+1)\rceil\cdot\lceil\frac{m_{\mathcal{C}}-1}{S}\rceil$ to identify the source node. ∎

E.3 Proof of Theorem 2

Proof of Theorem 2.

This result follows from Lemma 5 and 6. Divide $I^{*}$ into $I_{1},...,I_{k}$ such that $I_{k^{\prime}}$ is the source node of $I^{*}-\cup_{l<k^{\prime}}I_{l}$ . Since shifting $I_{k^{\prime}}$ affects the marginal of subsequent $I_{k^{\prime\prime}}$ with $k^{\prime\prime}>k^{\prime}$ , any algorithm needs to identify $I_{1},...,I_{k}$ sequentially in order to identify the exact shift values.

Suppose $I_{1},...,I_{k^{\prime}-1}$ are learned. For $I_{k^{\prime}}$ , consider the chain components of the subgraph of the shift- $\{\cup_{l<k^{\prime}}I_{l}\}$ -EG induced by $T=\{i|i\in[p],\mathbb{E}_{({\mathrm{P}}^{\cup_{l<k^{\prime}}I_{l}})}(X_{i})\neq\mathbb{E}_{{\mathrm{Q}}}(X_{i})\}$ with no incoming edge. Applying Lemma 4 for $\mathcal{I}=\{\cup_{l<k^{\prime}}I_{l}$ } and Observation 1 for this subgraph and $I_{k^{\prime}}$ , we deduce that there are exactly $|I_{k^{\prime}}|$ such chain components and $I_{k^{\prime}}$ has exactly one member in each of these chain components. Let $m_{k^{\prime},1},...,m_{k^{\prime},|I_{k^{\prime}}|}$ be the sizes of the largest maximal cliques in these $|I_{k^{\prime}}|$ chain components. By Lemma 5, we know that any algorithm needs at least $\sum_{i=1}^{|I_{k^{\prime}}|}\lceil\frac{m_{k^{\prime},i}-1}{S}\rceil$ number of interventions to identify $I_{k^{\prime}}$ in the worst case. However, since all these chain components contain no more than $r$ maximal cliques, by Lemma 6, we know that our strategies need at most $\lceil\log_{2}(r+1)\rceil\cdot\sum_{i=1}^{|I_{k^{\prime}}|}\lceil\frac{m_{k^{\prime},i}-1}{S}\rceil$ to identify $I_{k^{\prime}}$ .

Applying this result for $k^{\prime}=1,...,k$ , we obtain that our strategies for solving the causal mean matching problem require at most $\lceil\log_{2}(r+1)\rceil$ times more interventions, compared to the optimal strategy, in the worse case over all feasible orientations. ∎

Appendix F Numerical Experiments

F.1 Experimental Setup

Graph Generation: We consider two random graph models: Erdös-Rényi graphs (Erdős and Rényi,, 1960) and Barabási–Albert graphs (Albert and Barabási,, 2002). The probability of edge creation in Erdös-Rényi graphs is set to $0.2$ ; the number of edges to attach from a new node to existing nodes in Barabási–Albert graphs is set to $2$ . We then tested on two types of structured chordal graphs: rooted tree with root randomly sampled from all the nodes in this tree, and moralized Erdös-Rényi graphs (Shanmugam et al.,, 2015) with the probability of edge creation set to $0.2$ .

Multiple Runs: For each instance in the settings of Barabási–Albert graphs with 100 nodes and $S=1$ in Figure 5(a), we ran the three non-deterministic strategies (UpstreamRand,CliqueTree, Supermodular) for five times and observed little differences across all instances. Therefore, we excluded the error bars when plotting the results as they are visually negligible and the strategies are robust in these settings.

Implementation: We implemented our algorithms using the NetworkX package \citepappendixhagberg2008exploring and the CausalDAG package https://github.com/uhlerlab/causaldag. All code is written in Python and run on AMD 2990wx CPU.

F.2 More Empirical Results

In the following, we present additional empirical result. The evaluations are the same as in Section 6. The following figures show that we observe similar behaviors as in Figure 5 across different settings.

Random graphs of size $\{10,50,100\}$ : Barabási–Albert and Erdös-Rényi graphs with number of nodes in $\{10,50,100\}$ .

Larger Barabási–Albert graphs of size $1000$ :

Two types of structured chordal graphs:

Appendix G Discussion of the Noisy Setting

In the noisy setting, an intervention can be repeated many times to obtain an estimated essential graph. Each intervention results in a posterior update of the true DAG $\mathcal{G}$ over all DAGs in the observational Markov equivalence class. For a tree graph $\mathcal{G}$ , this corresponds to a probability over all possible roots. To be able to learn the edges, \citetappendixgreenewald2019sample+ proposed a bounded edge strength condition on the noise for binary variables. Under this condition, they showed that the root node of a tree graph can be learned in finite steps in expectation with high probability.

In our setting, to ensure that the source node w.r.t. an intervention can be learned, we need to repeat this intervention for enough times such that the expectation of each variable $X_{i}$ can be estimated. Furthermore, to ensure that the edges in the (general) interventional essential graph can be learned, we need a similar condition as in \citepappendixgreenewald2019sample+ for general chordal graphs and continuous variables.

\bibliographystyleappendix

apalike \bibliographyappendixmain

Matching a Desired Causal State via Shift Interventions

Abstract

1 Introduction

1.1 Related Works

2 Problem Setup

Lemma 1.

3 Identifiability

3.1 Shift-interventional Markov Equivalence Class

Lemma 2.

Definition 1.

Theorem 1.

3.2 Mean Interventional Faithfulness

Example 1.

Assumption 1 (Mean Interventional Faithfulness).

Lemma 3.

4 Algorithms

Observation 1.

Lemma 4.

4.1 Two Approximate Strategies

5 Theoretical Results

Lemma 5.

Lemma 6.

Theorem 2.

6 Experiments

7 Discussion

Acknowledgements

References

Appendix A Preliminaries

A.1 Meek Rules

Proposition 1 (Meek Rules \citepappendixmeek2013causal+).

Appendix B Proof of Exact Matching

Proof of Lemma 1.

Appendix C Proof of Identifiability

C.1 Shift Interventional MEC

Proof of Lemma 2.

Proof of Theorem 1.

Corollary 1.

Proof.

C.2 Mean Interventional Faithfulness

Proof of Lemma 3.

Corollary 2.

Proof.

Appendix D Details of Algorithms

D.1 Decomposition of Shift Interventional Essential Graphs

Proof of Lemma 4.

D.2 NP-completeness of MinMaxC

D.3 Clique Tree Strategy

D.4 Supermodular Strategy

Example 2.

Proof.

D.5 Violation of Faithfulness

Lemma 7.

Proof.

Appendix E Proof of Worst-case Bounds

E.1 Proof of Lemma 5

Proposition 2.

Proof.

Proof of Lemma 5.

E.2 Proof of Lemma 6

Proposition 3.

Example 3.

Proof of Proposition 3.

Proof of Lemma 6.

E.3 Proof of Theorem 2

Proof of Theorem 2.

Appendix F Numerical Experiments

F.1 Experimental Setup

F.2 More Empirical Results

Appendix G Discussion of the Noisy Setting

Matching a Desired Causal State
via Shift Interventions