This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Linear Causal Representations from Interventions under General Nonlinear Mixing

Simon Buchholz Max Planck Institute for Intelligent Systems, Tübingen, Germany Goutham Rajendran Equal Contribution Carnegie Mellon University, Pittsburgh, USA Elan Rosenfeld Carnegie Mellon University, Pittsburgh, USA
Bryon Aragam
University of Chicago, Chicago, USA
Bernhard Schölkopf Max Planck Institute for Intelligent Systems, Tübingen, Germany Pradeep Ravikumar Carnegie Mellon University, Pittsburgh, USA
Abstract

We study the problem of learning causal representations from unknown, latent interventions in a general setting, where the latent distribution is Gaussian but the mixing function is completely general. We prove strong identifiability results given unknown single-node interventions, i.e., without having access to the intervention targets. This generalizes prior works which have focused on weaker classes, such as linear maps or paired counterfactual data. This is also the first instance of causal identifiability from non-paired interventions for deep neural network embeddings. Our proof relies on carefully uncovering the high-dimensional geometric structure present in the data distribution after a non-linear density transformation, which we capture by analyzing quadratic forms of precision matrices of the latent distributions. Finally, we propose a contrastive algorithm to identify the latent variables in practice and evaluate its performance on various tasks.

1 Introduction

Modern generative models such as GPT-4 [59] or DDPMs [26] achieve tremendous performance for a wide variety of tasks [7]. They do this by effectively learning high-level representations which map to raw data through complicated non-linear maps, such as transformers [89] or diffusion processes [79]. However, we are unable to reason about the specific representations they learn, in particular they are not necessarily related to the true underlying data generating process. Besides their susceptibility to bias [20], they often fail to generalize to out-of-distribution settings [5], rendering them problematic in safety-critical domains. In order to gain a deeper understanding of what representations deep generative models learn, one line of work has pursued the goal of causal representation learning (CRL) [75]. CRL aims to learn high-level representations of data while simultaneously recovering the rich causal structure embedded in them, allowing us to reason about them and use them for higher-level cognitive tasks such as systematic generalization and planning. This has been used effectively in many application domains including genomics [52] and robotics [43, 94].

A crucial primitive in representation learning is the fundamental notion of identifiability [37, 70], i.e., the question whether a unique (up to tolerable ambiguities) model can explain the observed data. It is well known that because of non-identifiability, CRL is impossible in general settings in the absence of inductive biases or additional information [29, 50]. In this work, we consider additional information in the form of interventional data [75]. It is common to have access to such data in many application domains such as genomics and robotics stated above (e.g. [17, 58, 57]). Moreover, there is a pressing need in such safety-critical domains to build reliable and trustworthy systems, making identifiable CRL particularly important. Therefore, it’s important to study whether we can and also how to perform identifiable CRL from raw observational and interventional data. Here, identifiability opens the possibility to provably recover the true representations with formal guarantees. Meaningfully learning such representations with causal structure enables better interpretability, allows us to reason about fairness, and helps with performing high-level tasks such as planning and reasoning.

In this work, we study precisely this problem of causal representation learning in the presence of interventions. While prior work has studied simpler settings of linear or polynomial mixing [84, 88, 2, 71, 11], we allow for general non-linear mixing, which means our identifiability results apply to complex real-world systems and datasets which are used in practice. With our results, we make progress on fundamental questions on interventional learning raised by [75].

Concretely, we consider a general model with latent variables ZZ and observed data XX generated as X=f(Z)X=f(Z) where f:ddf:\mathbb{R}^{d}\to\mathbb{R}^{d^{\prime}} is an arbitrary non-linear mixing function. We assume ZZ satisfies a Gaussian structural equation model (SEM) which is unknown and unobserved. A Gaussian prior is commonly used in practice (which implies a linear SEM over ZZ) and further, having a simple model for ZZ allows us to learn useful representations, generate data efficiently, and explore the latent causal relationships, while the non-linearity of ff ensures universal approximability of the model. We additionally assume access to interventional data X(i)=f(Z(i))X^{(i)}=f(Z^{(i)}) for iIi\in I where Z(i)Z^{(i)} is the latent under an intervention on a single node ZtiZ_{t_{i}}. Notably, we allow for various kinds of interventions, we don’t require knowledge of the targets tit_{i}, and we don’t require paired data, (i.e., we don’t need counterfactual samples from the joint distribution (X,X(i))(X,X^{(i)}), but only their marginals). Having targets or counterfactual data is unrealistic in many practical settings but many prior identifiability results require them, so eliminating this dependence is a crucial step towards CRL in the real world.

Contributions    Our main contribution is a general solution to the identifiability problem that captures practical settings with non-linear mixing functions (e.g. deep neural networks) and unknown interventions (since the targets are latent). We study both perfect and imperfect interventions and additionally allow for shift interventions, where perfect interventions remove the dependence of the target variable from its parents, while imperfect interventions (also known as soft interventions) modify the dependencies. Below, we summarize our main contributions:

  1. 1.

    We show identifiability of the full latent variable model, given non-paired interventional data with unknown intervention targets. In particular, we learn the mixing, the targets and the latent causal structure.

  2. 2.

    Compared to prior works that have focused on linear/polynomial ff, we allow non-linear ff which encompasses representations learned by e.g. deep neural networks and captures complex real-world datasets. Moreover, we study both imperfect (also called soft) and perfect interventions, and always allow shift interventions.

  3. 3.

    We construct illustrative counterexamples to probe tightness of our assumptions which suggest directions for future work.

  4. 4.

    We propose a novel algorithm based on contrastive learning to learn causal representations from interventional data, and we run experiments to validate our theory. Our experiments suggest that a contrastive approach, which so far has been unexplored in interventional CRL, is a promising technique going forward.

2 Related work

Causal representation learning [75, 74] has seen much recent progress and applications since it generalizes and connects the fields of Independent Component Analysis [13, 28, 30], causal inference [81, 61, 63, 64, 82] and latent variable modeling [40, 3, 78, 96, 32]. Fundamental to this approach is the notion of identifiability [36, 15, 93]. Due to non-identifiability in general settings without inductive biases [29, 50], prior works have approached this problem from various angles — using additional auxiliary labels [36, 27, 6, 76]; by imposing sparsity [75, 42, 54, 103]; or by restricting the function classes [41, 9, 104, 22]. See also the survey [32]. Moreover, a long line of works have proposed practical methods for CRL (which includes causal disentanglement as a special case), [19, 95, 16, 97, 44, 14] to name a few. It’s worth noting that most of these approaches are essentially variants of the Variational Autoencoder framework [39, 68].

Of particular relevance to this work is the setting when interventional data is available. We first remark that the much simpler case of fully observed variables (i.e., no latent variables), has been studied in e.g. [23, 62, 83, 34, 18] (see also the survey [82]). In this work, we consider the more difficult setting of structure learning over latent variables, which have been explored in [46, 42, 6, 104, 1, 84, 88, 2, 71, 11, 72]. Among these, [46] assumes that the intervention targets are known, [42] specifically consider instance-level pre- and post- interventions for time-series data and [6, 104, 1] assumes access to paired counterfactual data. In contrast, we assume unknown targets and work in general settings with non-paired interventional data, which is important in various real-world applications [85, 99]. The work [48] require several graphical restrictions on the causal graph and also require 2d2d interventions, while we make no graphical restrictions and only require dd interventions (which we also show cannot be improved under our assumptions). [71, 11, 84, 88] consider linear mixing functions ff whereas we study general non-linear ff. Finally, [2] consider polynomial mixing in the more restricted class of deterministic do-interventions. Several concurrent works study related settings [100, 35, 65, 45, 92]. For a more detailed comparison to the related works, we refer to Appendix C.

Our proposed algorithm is based on contrastive learning. Contrastive learning has been used in other contexts in this domain [31, 27, 55, 104, 91] (either in the setting of time-series data or paired counterfactual data), however the application to non-paired interventional settings is new to the best of our knowledge.

Notation   In this work, we will almost always work with vectors and matrices in dd dimensions where dd is the latent dimension, and we will disambiguate when necessary. For a vector vv, we denote by viv_{i} its ii-th entry. Let Id\mathrm{Id} denote the d×dd\times d identity matrix with columns as the standard basis vectors e1,,ede_{1},\ldots,e_{d}. We denote by N(μ,Σ)N(\mu,\Sigma) the multivariate normal distribution with mean μ\mu and covariance Σ\Sigma. For a set CC, let 𝒰(C)\mathcal{U}(C) denote the uniform distribution on CC. For two random variables X,XX,X^{\prime}, we write X=𝒟XX\overset{\mathcal{D}}{=}X^{\prime} if their distributions are the same. We denote the set {1,,d}\{1,\ldots,d\} by [d][d]. For permutation matrices we use the convention that (Pω)ij=𝟏j=ω(i)(P_{\omega})_{ij}=\bm{1}_{j=\omega(i)} for ωSd\omega\in S_{d}. We use standard directed graph terminology, e.g. edges, parents. When we use the term “non-linear mixing” in this work, we also include linear mixing as a special case.

3 Setting: Interventional Causal Representation Learning

Refer to caption
(a) No interventions
Refer to caption
(b) An imperfect intervention
Refer to caption
(c) A perfect intervention
Figure 1: Illustration of an example latent variable model under interventions. (a)(a) No interventions. (b)(b) An imperfect intervention on node ti=3t_{i}=3. Dashed edges indicate the weights could have potentially been modified. (c)(c) A perfect intervention on node ti=3t_{i}=3.

We now formally introduce our main settings of interest and the required assumptions. We assume that there is a latent variable distribution ZZ on d\mathbb{R}^{d} and an observational distribution XX on d\mathbb{R}^{d^{\prime}} given by X=f(Z)X=f(Z) where ddd\leq d^{\prime} and ff is a non-linear mixing. This encapsulates the most general definition of a latent variable model. As the terminology suggests, we observe XX in real-life, e.g. images of cats and ZZ encodes high-level latent information we wish to learn and model, e.g. ZZ could indicate orientation and size.

Assumption 1 (Nonlinear ff).

The non-linear mixing ff is injective, differentiable and embeds into d\mathbb{R}^{d^{\prime}}.

Such an assumption is standard in the literature. Injectivity is needed in order to identify ZZ from XX because otherwise we may have learning ambiguity if multiple ZZs map to the same XX. Differentiability is needed to transfer densities. Note that 1 allows for a large class of complicated nonlinearities and we discuss in Appendix E why proving results for such a large class of mixing functions is important.

Next, we assume that the latent variables ZZ encode causal structure which can be expressed through a Structural Causal Model (SCM) on a Directed Acyclic Graph (DAG). We focus on linear SCMs with an underlying causal graph GG on vertex set [d][d], encoded by a matrix AA through its non-zero entries, i.e., there is an edge iji\to j in GG iff Aji0A_{ji}\neq 0.

Assumption 2 (Linear Gaussian SCM on Latent Variables).

The latent variables ZZ follow a linear SCM with Gaussian noise, so

Z=AZ+D1/2ϵ\displaystyle Z=AZ+D^{1/2}\epsilon (1)

where DD is a diagonal matrix with positive entries, AA encodes a DAG GG and ϵN(0,Id)\epsilon\sim N(0,\mathrm{Id}).

For an illustration of the model, see Fig. 1(a). It is convenient to encode the coefficients of linear SCMs through the matrix B=D1/2(IdA)B=D^{-1/2}(\mathrm{Id}-A). Then Z=B1ϵZ=B^{-1}\epsilon and both DD and AA can be recovered from BB, indeed the diagonal entries of BB agree with the entries of D1/2D^{-1/2}. Note that we can and will always assume that the entries of DD are positive since ϵ=𝒟ϵ\epsilon\overset{\mathcal{D}}{=}-\epsilon.

Assuming that the latent variables follow a linear Gaussian SCM is certainly restrictive but nevertheless a reasonable approximation of the underlying data generating process in many settings. The Gaussian prior assumption is also standard in latent variable modeling and enables efficient inference for downstream tasks, among other advantages. Importantly, under the above assumptions, our model has infinite modeling capacity [53, 86, 33], so they’re able to model complex datasets such as images and there’s no loss in representational power.

The goal of representation learning is to learn the non-linear mixing ff and the high-level latent distribution ZZ from observational data XX. In causal representation learning, we wish to go a step further and model ZZ as well by learning the parameters BB which then lets us easily recover A,DA,D and the causal graph GG. For general non-linear mixing, this model is not identifiable and hence we cannot hope to recover ZZ, but in this work we use additional interventional data, similar to the setups in recent works [2, 84, 88].

Assumption 3 (Single node interventions).

We consider interventional distributions Z(i)Z^{(i)} for iIi\in I which are single node interventions of the latent distribution ZZ, i.e., for each intervention ii there is a target node tit_{i} and we change the assigning equation of Zti(i)Z^{(i)}_{t_{i}} to

Zti(i)=(A(i)Z(i))ti+(D(i))ti,ti1/2(ϵti+η(i))\displaystyle Z^{(i)}_{t_{i}}=(A^{(i)}Z^{(i)})_{t_{i}}+(D^{(i)})^{1/2}_{t_{i},t_{i}}(\epsilon_{t_{i}}+\eta^{(i)}) (2)

while leaving all other equations unchanged. We assume that Ati,k(i)0A^{(i)}_{t_{i},k}\neq 0 only if ktik\to t_{i} in GG, i.e., no parents are added and η(i)\eta^{(i)} denotes a shift of the mean.

For an illustration, see Fig. 1(b). An intervention on node tit_{i} has no effect on the non-descendants of tit_{i}, but will have a downstream effect on the descendants of tit_{i}. In particular, AA is modified so that the weight of the edge ktik\to t_{i} could potentially be changed if it exists already but no new incoming edges to tit_{i} may be added. Also, the noise variable ϵti\epsilon_{t_{i}} is also allowed to be modified via a scale intervention as well as a shift intervention. Note that [84, 2] do not allow shift interventions, whereas we do.

Again, this can be written concisely as Z(i)=(B(i))1(ϵ+η(i)eti)Z^{(i)}=(B^{(i)})^{-1}(\epsilon+\eta^{(i)}e_{t_{i}}) where B(i)=(D(i))1/2(IdA(i))B^{(i)}=(D^{(i)})^{-1/2}(\mathrm{Id}-A^{(i)}). Let r(i)r^{({i})} denote the row tit_{i} of B(i)B^{(i)}, then we can write B(i)=Beti(Betir(i))B^{(i)}=B-e_{t_{i}}(B^{\top}e_{t_{i}}-r^{({i})})^{\top}. We assume that interventions are non-trivial, i.e., Z(i)𝒟ZZ^{(i)}\overset{\mathcal{D}}{\neq}Z. In our model, we observe interventional distributions X(i)=f(Z(i))X^{(i)}=f(Z^{(i)}) for various interventions iIi\in I.

Definition 1 (Intervention Types).

We call an intervention perfect (or stochastic hard) if Ati(i)=0A^{(i)}_{t_{i}\cdot}=0, i.e., we remove all connections to former parents and potentially change variance and mean of the noise variable. Note that for nodes without parents, either Dtiti(i)DtitiD^{(i)}_{t_{i}t_{i}}\neq D_{t_{i}t_{i}} or η(i)0\eta^{(i)}\neq 0 so that the intervention is non-trivial. We call an intervention a pure noise intervention if Ati(i)=AtiA^{(i)}_{t_{i}\cdot}=A_{t_{i}\cdot} and Ati0A_{t_{i}\cdot}\neq 0 (i.e., node tit_{i} has at least one parent). In other words, a pure noise intervention targets a node with parents by changing only the noise distribution through a change of variance (encoded in DD) or a mean shift (encoded in η\eta).

We call a single-node intervention imperfect if it is not perfect (some prior works call it a soft intervention). Note that perfect interventions are never of pure noise type because they necessarily change the relation to the parents. For an illustration of perfect and imperfect interventions, see Fig. 1(c), Fig. 1(b) respectively. Finally, we require an exhaustive set of interventions.

Assumption 4 ((Coverage of Interventions)).

All nodes are intervened upon by at least one intervention, i.e., {ti:iI}=[d]\{t_{i}:\,i\in I\}=[d]

This assumption was also made in the prior works [84, 2, 88]. When the interventions don’t cover all the nodes, we have non-identifiability as described in Section 3 and Appendix D. We also extensively discuss that none of our other assumptions can simply be dropped in these sections.

Remark 1.
  • Our theory also readily extends to noisy observations, i.e., X=f(Z)+νX=f(Z)+\nu where ν\nu is independent noise. In this case, we first denoise via a standard deconvolution argument [36, 42] and then apply our theory.

  • Similar to the results in [84] we can assume completely unknown intervention targets, i.e., there might be multiple interventions targeting the same node, and we do not need to know the partition. We only require coverage of all nodes. In contrast to their work we assume that we know which dataset corresponds to the observational distribution, but we expect that this restriction can be removed.

To simplify the notation, it is convenient to use B(0)=BB^{(0)}=B, η(0)=0\eta^{(0)}=0 for the observational distribution. We also define I¯=I{0}\bar{I}=I\cup\{0\}. Then, all information about the latent variable distributions Z(i)Z^{(i)} and observed distributions X(i)X^{(i)} are contained in ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f).

4 Main Results

We can now state our main results for the setting introduced above.

Theorem 1.

Suppose we are given distributions X(i)X^{(i)} generated using a model ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f) such that Assumptions 1-4 hold and such that all interventions ii are perfect. Then the model is identifiable up to permutation and scaling, i.e., for any model ((B~(i),η~(i),t~i)iI¯,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in\bar{I}},\widetilde{f}) that generates the same data, the latent dimension dd agrees and there is a permutation ωSd\omega\in S_{d} (and associated permutation matrix PωP_{\omega}) and an invertible pointwise scaling matrix ΛDiag(d)\Lambda\in\mathrm{Diag}(d) such that

t~i=ω(ti),B~(i)=PωB(i)Λ1Pω,f~=fΛ1Pω,η~(i)=η(i).\displaystyle\widetilde{t}_{i}=\omega(t_{i}),\quad\widetilde{B}^{(i)}=P_{\omega}^{\top}B^{(i)}\Lambda^{-1}P_{\omega},\quad\widetilde{f}=f\circ\Lambda^{-1}P_{\omega},\quad\widetilde{\eta}^{(i)}=\eta^{(i)}. (3)

This in particular implies that

Z~(i)=𝒟PωΛZ(i)\displaystyle\widetilde{Z}^{(i)}\overset{\mathcal{D}}{=}P_{\omega}^{\top}\Lambda Z^{(i)} (4)

and we can identify the causal graph GG up to permutation of the labels.

This result says that for the interventional model as described in the previous section, we can identify the non-linear map ff, the intervention targets tit_{i}, the parameter matrices B,D,AB,D,A up to permutations PωP_{\omega} and diagonal scaling Λ\Lambda. Moreover, we can recover the shifts η(i)\eta^{(i)} exactly and also the underlying causal graph GG up to permutations.

Remark 2.

Identifiability and recovery up to permutation and scaling is the best possible for our setting. This is because the latent variables ZZ are not actually observed, which means we cannot (and in fact don’t need to) resolve permutation and scaling ambiguity without further information about ZZ. See [84, Proposition 1] for the short proof.

When we drop the assumption that the interventions are perfect, we can still obtain a weaker identifiability result. Define G\prec_{G} to be the minimal partial order on [d][d] such that iGji\prec_{G}j if (i,j)(i,j) is an edge in GG, i.e., iGji\prec_{G}j if and only if ii is an ancestor of jj in GG. Note that any topological ordering of GG is compatible with the partial order G\prec_{G}. Then, our next result shows that under imperfect interventions, we can still recover the partial order G\prec_{G}.

Theorem 2.

Suppose we are given the distributions X(i)X^{(i)} generated using a model ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f) with causal graph GG such that the Assumptions 1-4 hold and none of the interventions is a pure noise intervention. Then for any other model ((B~(i),η~(i),t~i)iI¯,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in\bar{I}},\widetilde{f}) with causal graph G~\widetilde{G} generating the same observations the latent dimension dd agrees and there is a permutation ωSd\omega\in S_{d} such that t~i=ω(ti)\widetilde{t}_{i}=\omega(t_{i}) and iG~ji\prec_{\widetilde{G}}j iff ω(i)Gω(j)\omega(i)\prec_{G}\omega(j), i.e., G\prec_{G} can be identified up to a permutation of the labels.

Remark 3.

We emphasize that neither the full graph nor the coefficients B(i)B^{(i)} or the latent variables ZiZ_{i} are identifiable in this setting, even for linear mixing functions.

Theorem 1 and Theorem 2 generalize the main results of [84] which assume linear ff while we allow for non-linear ff. The key new ingredient of our work is the following theorem, which shows identifiability of ff up to linear maps.

Theorem 3.

Assume that X(i)X^{(i)} is generated according to a model ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f) such that the Assumptions 1-4 hold. Then we have identifiability up to linear transformations, i.e., if ((B~(i),η~(i),ti~),f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t_{i}}),\widetilde{f}) generates the same observed distributions f~(Z~(i))=𝒟X(i)\widetilde{f}(\widetilde{Z}^{(i)})\overset{\mathcal{D}}{=}X^{(i)}, then their latent dimensions dd agree and there is an invertible linear map TT such that

f~=fT1,Z~(i)=TZ(i).\displaystyle\widetilde{f}=f\circ T^{-1},\quad\widetilde{Z}^{(i)}=TZ^{(i)}. (5)

The proof of this theorem is deferred to Appendix A. In Appendix B we then review the results of [84] and show how their results can be extended to obtain our main results, Theorem 1 and Theorem 2. Now we provide some intuition for the proofs of the main theorems.

Proof intuition   The recent work [84] studies the special case when ff is linear, and the proof is linear algebraic. In particular, they consider row spans of the precision matrices of X(i)X^{(i)}, project them to certain linear subspaces and use those subspaces to construct a generalized RQ decomposition of the pseudoinverse of the linear mixing matrix ff. However, once we are in the setting of non-linear ff, such an approach is not feasible because we can no longer reason about row spans of the precision matrices of XX, which have been transformed non-linearly thereby losing all linear algebraic structure. Instead, we take a statistical approach and look at the log densities of the X(i)X^{(i)}. By choosing a Gaussian prior, the log-odds lnpX(i)(x)lnpX(0)(x)\ln p^{(i)}_{X}(x)-\ln p^{(0)}_{X}(x) of X(i)X^{(i)} with respect to X(0)X^{(0)} can be written as a quadratic form of difference of precision matrices, evaluated at non-linear functions of xx. For simplicity of this exposition, ignore terms arising from shift interventions and determinants of covariance matrices. Then, the log-odds looks like θ(x)=f1(x)(Θ(i)Θ(0))f1(x)\theta(x)=f^{-1}(x)^{\top}(\Theta^{(i)}-\Theta^{(0)})f^{-1}(x), where Θ(i)\Theta^{(i)} is the precision matrix of Z(i)Z^{(i)}.

At this stage, we again shift our viewpoint to geometric and observe that Θ(i)Θ(0)\Theta^{(i)}-\Theta^{(0)} has a certain structure. In particular, for single-node interventions, it has rank at most 22 and for source node targets, it has rank 11. This implies that the level set manifolds of the quadratic forms θ(x)\theta(x) also have a certain geometric structure in them, i.e., the DAG leaves a geometric signature on the data likelihood. We exploit this carefully and proceed by induction on the topological ordering until we end up showing that ff can be identified up to a linear transformation, which is our main Theorem 3. Here, the presence of shift interventions adds additional complexities, and we have to generalize all of our intermediate lemmas to handle these. Once we identify ff up to a linear transformation, we can apply the results of [84] to conclude Theorems 1 and 2.

On the optimality and limitations of the assumptions    We make a few brief remarks on our assumptions, deferring to Appendix D a full discussion and the technical construction of several illustrative counterexamples.

  1. 1.

    Number of interventions: For our main results, Theorems 1 and 2, we assume that there are at least dd interventions (Assumption 4). This cannot be weakened even for linear mixing functions [84]. In addition, we also show in 1 that for the linear identifiability proved in Theorem 3, d2d-2 interventions are not sufficient. Thus, the required number of interventions is tight in Theorems 1, 2 and is tight up to at most one intervention in Theorem 3.

  2. 2.

    Intervention type: In the setting of imperfect interventions, the weaker identifiability guarantees in Theorem 2 cannot be improved even when the mixing is linear [84, Appendix C]. We also show in 2 that if we drop the condition that interventions are not pure noise interventions, we have non-identifiability. Concretely, we show that when all interventions are of pure noise type, any causal graph is compatible with the observations. Finally, in contrast to the special case of linear mixing, we show in Lemma 8 that we need to exclude non-stochastic hard interventions (i.e., do(Zi=zi)\mathrm{do}(Z_{i}=z_{i})) for identifiability up to linear maps.

  3. 3.

    Distributional assumptions: We assume Gaussian latent variables, while allowing for very flexible mixing ff. This model has universal approximation guarantees and is moreover used ubiquitously in practice. While the result can potentially be extended to more general latent distributions (e.g., exponential families), we additionally show in Lemma 9 that the result is not true when making no assumption on the distribution of ϵ\epsilon.

5 Experimental methodology

In this section, we explain our experimental methodology and the theoretical underpinning of our approach. Our main experiments for interventional causal representation learning focus on a method based on contrastive learning. We train a deep neural network to learn to distinguish observational samples xX(0)x\sim X^{(0)} from interventional samples xX(i)x\sim X^{(i)}. Additionally, we design the last layer of the model to model the log-likelihood of a linear Gaussian SCM. Due to representational flexibility of deep neural networks, we will in principle learn the Bayes optimal classifier after optimal training, which we show is related to the underlying causal model parameters. Accordingly, with careful design of the last layer parametric form, we indirectly learn the parameters of the underlying causal model. Similar methods have been used for time-series data or multimodal data [27, 32] but to the best of our knowledge, the contrastive learning approach to interventional learning is novel.

Denote the probability density of xX(i)x\sim X^{(i)} (resp. zZ(i)z\sim Z^{(i)}) by pX(i)(x)p_{X}^{(i)}(x) (resp. pZ(i)(z)p_{Z}^{(i)}(z)). The next lemma describes the log-odds of a sample xx coming from the interventional distribution X(i)X^{(i)} as opposed to the observational distribution X(0)X^{(0)}. We focus on the identifiable case (see Theorem 1) and therefore only consider perfect interventions. As per the notation in Section 3 and Section A.1, let s(i)s^{(i)} denote the row tit_{i} of B(0)B^{(0)} and let η(i),λi\eta^{(i)},\lambda_{i} denote the magnitude of the shift and scale intervention respectively.

Lemma 1.

When we have perfect interventions, the log-odds of a sample xx coming from X(i)X^{(i)} over coming from X(0)X^{(0)} is given by

lnpX(i)(x)lnpX(0)(x)\displaystyle\ln p_{X}^{(i)}(x)-\ln p_{X}^{(0)}(x) =ci12λi2(((f1(x))ti)2+η(i)λi(f1(x))ti)+12f1(x),s(i)2\displaystyle=c_{i}-\frac{1}{2}\lambda_{i}^{2}(((f^{-1}(x))_{t_{i}})^{2}+\eta^{(i)}\lambda_{i}\cdot(f^{-1}(x))_{t_{i}})+\frac{1}{2}\langle f^{-1}(x),s^{({i})}\rangle^{2} (6)

for a constant cic_{i} independent of xx.

The proof is deferred to Appendix F. The form of the log-odds suggests considering the following functions

gi(x,αi,βi,γi,w(i),θ)=αiβihti2(x,θ)+γihti(x,θ)+h(x,θ),w(i)2\displaystyle g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta)=\alpha_{i}-\beta_{i}h^{2}_{t_{i}}(x,\theta)+\gamma_{i}h_{t_{i}}(x,\theta)+\langle h(x,\theta),w^{(i)}\rangle^{2} (7)

where h(,θ)h(\cdot,\theta) denotes a neural net parametrized by θ\theta, parameters w(i)w^{(i)} are the rows of a matrix WW and αi\alpha_{i}, βi\beta_{i}, γi\gamma_{i} are learnable parameters. Note that the ground truth parameters minimize the following cross entropy loss

CE(i)=𝔼j𝒰({0,i})𝔼xX(j)CE(𝟏j=i,gi(x))=𝔼j𝒰({0,i})𝔼xX(j)ln(e𝟏j=igi(x)egi(x)+1).\displaystyle\mathcal{L}_{\mathrm{CE}}^{(i)}=\mathbb{E}_{j\sim\mathcal{U}(\{0,i\})}\mathbb{E}_{x\sim X^{(j)}}\mathrm{CE}(\bm{1}_{j=i},g_{i}(x))=-\mathbb{E}_{j\sim\mathcal{U}(\{0,i\})}\mathbb{E}_{x\sim X^{(j)}}\ln\left(\frac{e^{\bm{1}_{j=i}g_{i}(x)}}{e^{g_{i}(x)}+1}\right). (8)

Note that (compare (6) and (7)) WW should learn B=D1/2(IdA)B=D^{-1/2}(\mathrm{Id}-A) thus its off-diagonal entries should form a DAG. To enforce this we add the NOTEARS regularizer [102] given by NOTEARS(W)=trexp(W0W0)d\mathcal{R}_{NOTEARS}(W)=\text{tr}\exp(W_{0}\circ W_{0})-d (see Section F.3) where W0W_{0} equals WW with the main diagonal zeroed out. We also promote sparsity by adding the l1l_{1} regularization term REG(W)=W01\mathcal{R}_{REG}(W)=\lVert W_{0}\rVert_{1}. Thus, the total loss is given by

(α,β,γ,W,θ)=iICE(i)+τ1NOTEARS(W)+τ2REG(W)\displaystyle\mathcal{L}(\alpha,\beta,\gamma,W,\theta)=\sum_{i\in I}\mathcal{L}_{\mathrm{CE}}^{(i)}+\tau_{1}\mathcal{R}_{NOTEARS}(W)+\tau_{2}\mathcal{R}_{REG}(W) (9)

for hyperparameters τ1\tau_{1} and τ2\tau_{2}. Our identifiability result Theorem 1 implies that when we assume that the neural network has infinite capacity, the loss in (9) is minimized, τ1\tau_{1} is large and τ2\tau_{2} small, and we learn Gaussian latent variables h(X(i),θ)h(X^{(i)},\theta), then we recover the ground truth latent variables up to the tolerable ambiguities of labeling and scale, i.e., hh recovers f1f^{-1} and WW recovers BB up to permutation and scale. Thus, we estimate the latent variables using Z^=h(X,θ)\hat{Z}=h(X,\theta) and estimate the DAG using W0W_{0}. Full details of the practical implementation of our approach are given in Appendix H.

Our experimental setup is similar to [84, 2], we consider dd interventions with different targets (our theory holds in full generality) and therefore, we can arbitrarily assign ti=it_{i}=i based on the intervention index ii which removes the permutation ambiguity. We focus on non-zero shifts because the cross entropy loss together with the quadratic expression for the log-odds results in a non-convex output layer which makes it hard to find the global loss minimizer, as we will describe in more detail in Section G.1. We emphasize that this is no contradiction to the theoretical results stated above. Even when there is no shift intervention the latent variables are identifiable, but our algorithm often fails to find the global minimizer of the loss (9) due to the non-convex loss landscape. For the sake of exposition and to set the stage for future work, we also briefly describe an approach via Variational Autoencoders (VAE). VAEs have been widely used in causal representation learning and while feasible in interventional settings, they are accompanied by certain difficulties, which we detail and suggest how to overcome in Section F.4.

6 Experiments

In this section, we implement our approach on synthetic data and image data. Complete details (architectures, hyperparameters) are deferred to Appendix H. Additional experiments investigating the effect of varsortability [66] and the noise distribution can be found in Appendix G.

Refer to caption
Figure 2: Dependence of performance metrics for ER(d,2)ER(d,2) graphs with d=100d^{\prime}=100 and nonlinear mixing ff on the dimension dd.

Data generation   For all our experiments we use Erdös-Rényi graphs, i.e., we add each undirected edge with equal probability pp to the graph and then finally orient them according to some random order of the nodes. We write ER(d,k)\mathrm{ER}(d,k) for the Erdös-Rényi graph distribution on dd nodes with kdkd expected edges. For a given graph GG we then sample edge weights from 𝒰(±[0.25,1.0])\mathcal{U}(\pm[0.25,1.0]) and a scale matrix DD. For simplicity we assume that we have nn samples from each environment iI¯i\in\bar{I}. We only consider the setting where each node is intervened upon once and thus the latent dimension is also known. We consider three types of mixing functions. First, we consider linear mixing functions where we sample all matrix entries i.i.d. from a Gaussian distribution. Then, we consider non-linear mixing functions that are parametrized by MLPs with three hidden layers which are randomly initialized, and have Leaky ReLU activations. Finally, we consider image data as described in [2]. Pairs of latent variables (z2i+1,z2i+2)(z_{2i+1},z_{2i+2}) describe the coordinates of balls in an image and the non-linearity ff is the rendering of the image. The image generation is based on pygame [77]. A sample image can be found in Figure 3.

Refer to caption
Figure 3: Sample image with 3 balls.

Evaluation Metrics   We evaluate how well we can recover the ground truth latent variables and the underlying DAG. For the recovery of the ground truth latents we use the Mean Correlation Coefficient (MCC) [37, Appendix A.2]. For the evaluation of the learned graph structure we use the Structural Hamming Distance (SHD) where we follow the convention that we count the number of edge differences of the directed graphs (i.e., an edge with wrong orientation counts as two errors). Since the scale of the variables is not fixed the selection of the edge selection threshold is slightly delicate. Thus, we use the heuristic where we match the number of selected edges to the expected number of edges. As a metric that is independent of this thresholding procedure, we also report the Area Under the Receiver Operating Curve (AUROC) for the edge selection.

Methods    We implement our contrastive algorithm as explained in Section F where we use an MLP (with Leaky ReLU nonlinearities) for the function hh for the linear and nonlinear mixing functions and a very small convolutional network for the image dataset, which are termed “Contrastive”. For linear mixing function we also consider a version of the contrastive algorithm where hh is a linear function, termed “Contrastive Linear”. As baselines, we consider a variational autoencoder with latent dimension dd and also the algorithm for linear disentanglement introduced in [84]. Since the variational autoencoder does not output a causal graph structure we report the result for the empty graph here which serves as a baseline.

Results for linear ff    First, for the sake of comparison, we replicate exactly the setting from [84], i.e., we consider initial noise variances sampled uniformly from [2,4][2,4] and we consider perfect interventions where the new variance is sampled uniformly from [6,8][6,8]. We set d=5d=5, d=10d^{\prime}=10, k=3/2k=3/2, and n=50000n=50000. Results can be found in Table 1. The linear contrastive method identifies the ground truth latent variables up to scaling. The failure of the nonlinear method to recover the ground truth latents will be explained and further analyzed in Appendix G.1. Note that the nonlinear contrastive and the linear contrastive method recover the underlying graph better than the baseline.

Table 1: Results for linear ff with d=5,d=10d=5,d^{\prime}=10, k=3/2k=3/2, n=50000n=50000.
Method SHD \downarrow AUROC \uparrow MCC \uparrow R2R^{2} \uparrow
Contrastive 4.6±0.54.6\pm 0.5 0.84±0.020.84\pm 0.02 0.05±0.020.05\pm 0.02 0.02±0.000.02\pm 0.00
Contrastive Linear 5.4±1.65.4\pm 1.6 0.80±0.070.80\pm 0.07 0.90±0.030.90\pm 0.03 1.00±0.001.00\pm 0.00
Linear baseline 7.0±0.57.0\pm 0.5 0.64±0.050.64\pm 0.05 0.83±0.040.83\pm 0.04 1.00±0.001.00\pm 0.00

Results for nonlinear ff    We sample all variances from 𝒰([1,2])\mathcal{U}([1,2]) (initial variance and resampling for the perfect interventions) and the shift parameters η(i)\eta^{(i)} of the interventions from 𝒰([1,2])\mathcal{U}([1,2]). Results can be found in Table 2. We find that our contrastive method can recover the ground truth latents and the causal structure almost perfectly, while the baseline for linear disentanglement cannot recover the graph or the latent variables (which is not surprising as the mixing is highly nonlinear). Also, training a vanilla VAE does not recover the latent variables up to a linear map as indicated by the R2R^{2} scores. In Figure 2 we illustrate the dependence on the dimension dd of our algorithm in this setting.

Table 2: Results for nonlinear synthetic data with n=10000n=10000.
Setting Method SHD \downarrow AUROC \uparrow MCC \uparrow R2R^{2} \uparrow
ER(5, 2), d=20d^{\prime}=20 Contrastive 1.8±0.51.8\pm 0.5 0.97±0.010.97\pm 0.01 0.97±0.000.97\pm 0.00 0.96±0.000.96\pm 0.00
VAE 10.0±0.010.0\pm 0.0 0.50±0.000.50\pm 0.00 0.48±0.030.48\pm 0.03 0.57±0.070.57\pm 0.07
Linear baseline 10.6±1.910.6\pm 1.9 0.48±0.110.48\pm 0.11 0.32±0.030.32\pm 0.03 0.34±0.060.34\pm 0.06
ER(5, 2), d=100d^{\prime}=100 Contrastive 1.0±0.01.0\pm 0.0 1.00±0.001.00\pm 0.00 0.99±0.000.99\pm 0.00 0.98±0.000.98\pm 0.00
VAE 10.0±0.010.0\pm 0.0 0.50±0.000.50\pm 0.00 0.59±0.020.59\pm 0.02 0.68±0.040.68\pm 0.04
Linear baseline 3.4±1.23.4\pm 1.2 0.85±0.070.85\pm 0.07 0.18±0.040.18\pm 0.04 0.11±0.040.11\pm 0.04
ER(10, 2), d=20d^{\prime}=20 Contrastive 3.6±1.33.6\pm 1.3 0.98±0.010.98\pm 0.01 0.93±0.000.93\pm 0.00 0.87±0.010.87\pm 0.01
VAE 18.6±0.918.6\pm 0.9 0.50±0.000.50\pm 0.00 0.59±0.020.59\pm 0.02 0.72±0.020.72\pm 0.02
Linear baseline 29.6±2.529.6\pm 2.5 0.49±0.020.49\pm 0.02 0.44±0.020.44\pm 0.02 0.51±0.020.51\pm 0.02
ER(10, 2), d=100d^{\prime}=100 Contrastive 1.6±0.51.6\pm 0.5 1.00±0.001.00\pm 0.00 0.98±0.000.98\pm 0.00 0.97±0.000.97\pm 0.00
VAE 18.6±0.918.6\pm 0.9 0.50±0.000.50\pm 0.00 0.62±0.020.62\pm 0.02 0.78±0.010.78\pm 0.01
Linear baseline 28.4±2.128.4\pm 2.1 0.51±0.040.51\pm 0.04 0.17±0.030.17\pm 0.03 0.13±0.030.13\pm 0.03

Results for image data   Finally, we report the results for image data. Here, we generate the graph as before and consider variances sampled from σ2𝒰([0.01,0.02])\sigma^{2}\sim\mathcal{U}([0.01,0.02]) and shifts η(i)\eta^{(i)} from 𝒰([0.1,0.2])\mathcal{U}([0.1,0.2]) (i.e., the shifts are still of order σ\sigma). We exclude samples where one of the balls is not contained in the image which generates a slight model misspecification compared to our theory. Again we find that we recover the latent graph and the latent variables as detailed in Table 3.

Table 3: Results for image data with ER(dd, 2) graphs with d=2#ballsd=2\cdot\#\text{balls} and nint=25000n_{\mathrm{int}}=25000 (per environment), nobs=nintdn_{\mathrm{obs}}=n_{\mathrm{int}}\cdot d.
# Balls Method SHD \downarrow AUROC \uparrow MCC \uparrow R2R^{2} \uparrow
2 Contrastive Learning 1.4±0.41.4\pm 0.4 0.95±0.030.95\pm 0.03 0.87±0.030.87\pm 0.03 0.84±0.030.84\pm 0.03
VAE 6.0±0.06.0\pm 0.0 0.50±0.000.50\pm 0.00 0.19±0.060.19\pm 0.06 0.16±0.080.16\pm 0.08
5 Contrastive Learning 2.0±0.32.0\pm 0.3 1.00±0.001.00\pm 0.00 0.94±0.010.94\pm 0.01 0.91±0.010.91\pm 0.01
VAE 18.6±0.918.6\pm 0.9 0.50±0.000.50\pm 0.00 0.31±0.020.31\pm 0.02 0.36±0.030.36\pm 0.03
10 Contrastive Learning 11.0±3.311.0\pm 3.3 0.98±0.020.98\pm 0.02 0.89±0.010.89\pm 0.01 0.83±0.010.83\pm 0.01
VAE 37.2±3.137.2\pm 3.1 0.50±0.000.50\pm 0.00 0.22±0.010.22\pm 0.01 0.33±0.020.33\pm 0.02

7 Conclusion

In this work, we extend several prior works and show identifiability for a widely used class of linear latent variable models with non-linear mixing, from interventional data with unknown targets. Counterexamples show that our assumptions are tight in this setting and could only potentially be relaxed under other additional assumptions. We leave to future work to extend our results for other classes of priors, such as non-parametric distribution families, and also to study sample complexity and robustness of our results. We also proposed a contrastive approach to learn such models in practice and showed that it can recover the latent structure in various settings. Finally, we highlight that the results of our experiments are very promising and it would be interesting to scale up our algorithms to large-scale datasets such as [49].

Acknowledgments

We thank anonymous reviewers for useful comments and suggestions. We acknowledge the support of AFRL and DARPA via FA8750-23-2-1015, ONR via N00014-23-1-2368, and NSF via IIS-1909816, IIS-1955532. We also acknowledge the support of JPMorgan Chase & Co. AI Research. This work was also supported by the Tübingen AI Center.

References

  • Ahuja et al. [2022] K. Ahuja, J. S. Hartford, and Y. Bengio. Weakly supervised representation learning with sparse perturbations. Advances in Neural Information Processing Systems, 35:15516–15528, 2022.
  • Ahuja et al. [2023] K. Ahuja, D. Mahajan, Y. Wang, and Y. Bengio. Interventional causal representation learning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Arora et al. [2013] S. Arora, R. Ge, Y. Halpern, D. M. Mimno, A. Moitra, D. A. Sontag, Y. Wu, and M. Zhu. A practical algorithm for topic modeling with provable guarantees. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 280–288. JMLR.org, 2013. URL http://proceedings.mlr.press/v28/arora13.html.
  • Bello et al. [2022] K. Bello, B. Aragam, and P. K. Ravikumar. DAGMA: Learning DAGs via m-matrices and a log-determinant acyclicity characterization. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=8rZYMpFUgK.
  • Bommasani et al. [2021] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Brehmer et al. [2022] J. Brehmer, P. de Haan, P. Lippe, and T. S. Cohen. Weakly supervised causal representation learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 38319–38331. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/fa567e2b2c870f8f09a87b6e73370869-Paper-Conference.pdf.
  • Bubeck et al. [2023] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
  • Buchholz [2023] S. Buchholz. Some remarks on identifiability of independent component analysis in restricted function classes. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=REtKapdkyI.
  • Buchholz et al. [2022] S. Buchholz, M. Besserve, and B. Schölkopf. Function classes for identifiable nonlinear independent component analysis. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=DpKaP-PY8bK.
  • Castelletti and Peluso [2022] F. Castelletti and S. Peluso. Network structure learning under uncertain interventions. Journal of the American Statistical Association, pages 1–12, 2022.
  • Chen et al. [2022] Y. Chen, E. Rosenfeld, M. Sellke, T. Ma, and A. Risteski. Iterative feature matching: Toward provable domain generalization with logarithmic environments. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 1725–1736. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/0b5eb45a22ff33956c043dd271f244ea-Paper-Conference.pdf.
  • Chickering [2002] D. M. Chickering. Optimal structure identification with greedy search. J. Mach. Learn. Res., 3:507–554, 2002. URL http://jmlr.org/papers/v3/chickering02b.html.
  • Comon [1994] P. Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • [14] J. Cui, W. Huang, Y. Wang, and Y. Wang. Aggnce: Asymptotically identifiable contrastive learning.
  • D’Amour et al. [2022] A. D’Amour, K. Heller, D. Moldovan, B. Adlam, B. Alipanahi, A. Beutel, C. Chen, J. Deaton, J. Eisenstein, M. D. Hoffman, et al. Underspecification presents challenges for credibility in modern machine learning. The Journal of Machine Learning Research, 23(1):10237–10297, 2022.
  • Dilokthanakul et al. [2016] N. Dilokthanakul, P. A. Mediano, M. Garnelo, M. C. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan. Deep unsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648, 2016.
  • Dixit et al. [2016] A. Dixit, O. Parnas, B. Li, J. Chen, C. P. Fulco, L. Jerby-Arnon, N. D. Marjanovic, D. Dionne, T. Burks, R. Raychowdhury, B. Adamson, T. M. Norman, E. S. Lander, J. S. Weissman, N. Friedman, and A. Regev. Perturb-seq: Dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. Cell, 167(7):1853–1866, Dec 2016. ISSN 1097-4172 (Electronic); 0092-8674 (Print); 0092-8674 (Linking). doi: 10.1016/j.cell.2016.11.038.
  • Eberhardt et al. [2005] F. Eberhardt, C. Glymour, and R. Scheines. On the number of experiments sufficient and in the worst case necessary to identify all causal relations among N variables. In UAI ’05, Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, Edinburgh, Scotland, July 26-29, 2005, pages 178–184. AUAI Press, 2005. URL https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1235&proceeding_id=21.
  • Falck et al. [2021] F. Falck, H. Zhang, M. Willetts, G. Nicholson, C. Yau, and C. C. Holmes. Multi-facet clustering variational autoencoders. Advances in Neural Information Processing Systems, 34, 2021.
  • Ferrara [2023] E. Ferrara. Should ChatGPT be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
  • Gamella et al. [2022] J. L. Gamella, A. Taeb, C. Heinze-Deml, and P. Bühlmann. Characterization and greedy learning of gaussian structural causal models under unknown interventions, 2022.
  • Gresele et al. [2021] L. Gresele, J. Von Kügelgen, V. Stimper, B. Schölkopf, and M. Besserve. Independent mechanism analysis, a new concept? Advances in Neural Information Processing Systems, 34, 2021.
  • Hauser and Bühlmann [2012] A. Hauser and P. Bühlmann. Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs. The Journal of Machine Learning Research, 13(1):2409–2464, 2012.
  • He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Heinze-Deml et al. [2018] C. Heinze-Deml, J. Peters, and N. Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6(2), 2018.
  • Ho et al. [2020] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hyvärinen and Morioka [2016] A. Hyvärinen and H. Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 3765–3773, 2016. URL https://proceedings.neurips.cc/paper/2016/hash/d305281faf947ca7acade9ad5c8c818c-Abstract.html.
  • Hyvärinen and Oja [2000] A. Hyvärinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411–430, 2000.
  • Hyvärinen and Pajunen [1999] A. Hyvärinen and P. Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
  • Hyvarinen et al. [2002] A. Hyvarinen, J. Karhunen, and E. Oja. Independent component analysis. Studies in informatics and control, 11(2):205–207, 2002.
  • Hyvarinen et al. [2019] A. Hyvarinen, H. Sasaki, and R. Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019.
  • Hyvärinen et al. [2023] A. Hyvärinen, I. Khemakhem, and R. Monti. Identifiability of latent-variable and structural-equation models: from linear to nonlinear. arXiv preprint arXiv:2302.02672, 2023.
  • Ishikawa et al. [2022] I. Ishikawa, T. Teshima, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama. Universal approximation property of invertible neural networks. arXiv preprint arXiv:2204.07415, 2022.
  • Jaber et al. [2020] A. Jaber, M. Kocaoglu, K. Shanmugam, and E. Bareinboim. Causal discovery from soft interventions with unknown targets: Characterization and learning. Advances in neural information processing systems, 33:9551–9561, 2020.
  • Jiang and Aragam [2023] Y. Jiang and B. Aragam. Learning latent causal graphs with unknown interventions. In Advances in Neural Information Processing Systems, 2023.
  • Khemakhem et al. [2020a] I. Khemakhem, D. Kingma, R. Monti, and A. Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020a.
  • Khemakhem et al. [2020b] I. Khemakhem, R. P. Monti, D. P. Kingma, and A. Hyvärinen. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ICA. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020b. URL https://proceedings.neurips.cc/paper/2020/hash/962e56a8a0b0420d87272a682bfd1e53-Abstract.html.
  • Kingma and Ba [2015] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In Y. Bengio and Y. LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
  • Kivva et al. [2021] B. Kivva, G. Rajendran, P. Ravikumar, and B. Aragam. Learning latent causal graphs via mixture oracles. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 18087–18101, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/966aad8981dcc75b5b8ab04427a833b2-Abstract.html.
  • Kivva et al. [2022] B. Kivva, G. Rajendran, P. Ravikumar, and B. Aragam. Identifiability of deep generative models without auxiliary information. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/649f080d8891ab4d4b262cb9cd52e69a-Abstract-Conference.html.
  • Lachapelle et al. [2022] S. Lachapelle, P. Rodríguez, Y. Sharma, K. Everett, R. L. Priol, A. Lacoste, and S. Lacoste-Julien. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In B. Schölkopf, C. Uhler, and K. Zhang, editors, 1st Conference on Causal Learning and Reasoning, CLeaR 2022, Sequoia Conference Center, Eureka, CA, USA, 11-13 April, 2022, volume 177 of Proceedings of Machine Learning Research, pages 428–484. PMLR, 2022. URL https://proceedings.mlr.press/v177/lachapelle22a.html.
  • Lee et al. [2021] T. E. Lee, J. A. Zhao, A. S. Sawhney, S. Girdhar, and O. Kroemer. Causal reasoning in simulation for structure and transfer learning of robot manipulation policies. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4776–4782. IEEE, 2021.
  • Li et al. [2020] S. Li, B. Hooi, and G. H. Lee. Identifying through flows for recovering latent representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SklOUpEYvB.
  • Liang et al. [2023] W. Liang, A. Kekić, J. von Kügelgen, S. Buchholz, M. Besserve, L. Gresele, and B. Schölkopf. Causal component analysis. In Advances in Neural Information Processing Systems, 2023.
  • Lippe et al. [2022] P. Lippe, S. Magliacane, S. Löwe, Y. M. Asano, T. Cohen, and S. Gavves. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pages 13557–13603. PMLR, 2022.
  • Liu et al. [2022a] C. Liu, L. Zhu, and M. Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022a. ISSN 1063-5203. doi: https://doi.org/10.1016/j.acha.2021.12.009. URL https://www.sciencedirect.com/science/article/pii/S106352032100110X. Special Issue on Harmonic Analysis and Machine Learning.
  • Liu et al. [2022b] Y. Liu, Z. Zhang, D. Gong, M. Gong, B. Huang, A. v. d. Hengel, K. Zhang, and J. Q. Shi. Identifying weight-variant latent causal models. arXiv preprint arXiv:2208.14153, 2022b.
  • Liu et al. [2023] Y. Liu, A. Alahi, C. Russell, M. Horn, D. Zietlow, B. Schölkopf, and F. Locatello. Causal triplet: An open challenge for intervention-centric causal representation learning. In 2nd Conference on Causal Learning and Reasoning (CLeaR), 2023. arXiv:2301.05169.
  • Locatello et al. [2019] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124. PMLR, 2019.
  • Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Skq89Scxx.
  • Lotfollahi et al. [2021] M. Lotfollahi, A. K. Susmelj, C. De Donno, Y. Ji, I. L. Ibarra, F. A. Wolf, N. Yakubova, F. J. Theis, and D. Lopez-Paz. Compositional perturbation autoencoder for single-cell response modeling. BioRxiv, 2021.
  • Lu and Lu [2020] Y. Lu and J. Lu. A universal approximation theorem of deep neural networks for expressing probability distributions. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/2000f6325dfc4fc3201fc45ed01c7a5d-Abstract.html.
  • Moran et al. [2022] G. E. Moran, D. Sridhar, Y. Wang, and D. Blei. Identifiable deep generative models via sparse decoding. Transactions on Machine Learning Research, 2022.
  • Morioka and Hyvarinen [2023] H. Morioka and A. Hyvarinen. Connectivity-contrastive learning: Combining causal discovery and representation learning for multimodal data. In International Conference on Artificial Intelligence and Statistics, pages 3399–3426. PMLR, 2023.
  • Moser [1965] J. Moser. On the volume elements on a manifold. Transactions of the American Mathematical Society, 120(2):286–294, 1965. ISSN 00029947. URL http://www.jstor.org/stable/1994022.
  • Nejatbakhsh et al. [2020] A. Nejatbakhsh, F. Fumarola, S. Esteki, T. Toyoizumi, R. Kiani, and L. Mazzucato. Predicting perturbation effects from resting activity using functional causal flow. bioRxiv, pages 2020–11, 2020.
  • Norman et al. [2019] T. M. Norman, M. A. Horlbeck, J. M. Replogle, A. Y. Ge, A. Xu, M. Jost, L. A. Gilbert, and J. S. Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455):786–793, 2019.
  • OpenAI [2023] OpenAI. GPT-4 technical report, 2023.
  • Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  • Pearl [2009] J. Pearl. Causality. Cambridge university press, 2009.
  • Peters et al. [2016] J. Peters, P. Bühlmann, and N. Meinshausen. Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 78(5):947–1012, 2016. ISSN 13697412, 14679868. URL http://www.jstor.org/stable/44682904.
  • Peters et al. [2017] J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Rajendran et al. [2021] G. Rajendran, B. Kivva, M. Gao, and B. Aragam. Structure learning in polynomial time: Greedy algorithms, bregman information, and exponential families. Advances in Neural Information Processing Systems, 34:18660–18672, 2021.
  • Rajendran et al. [2023] G. Rajendran, P. Reizinger, W. Brendel, and P. Ravikumar. An interventional perspective on identifiability in gaussian lti systems with independent component analysis. arXiv preprint arXiv:2311.18048, 2023.
  • Reisach et al. [2021] A. G. Reisach, C. Seiler, and S. Weichwald. Beware of the simulated dag! causal discovery benchmarks may be easy to game. In M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 27772–27784, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/e987eff4a7c7b7e580d659feb6f60c1a-Abstract.html.
  • Reisach et al. [2023] A. G. Reisach, M. Tami, C. Seiler, A. Chambaz, and S. Weichwald. Simple sorting criteria help find the causal order in additive noise models. In Advances in Neural Information Processing Systems, 2023.
  • Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR, 2014.
  • Rissanen [1978] J. Rissanen. Modeling by shortest data description. Autom., 14(5):465–471, 1978. doi: 10.1016/0005-1098(78)90005-5. URL https://doi.org/10.1016/0005-1098(78)90005-5.
  • Roeder et al. [2021] G. Roeder, L. Metz, and D. Kingma. On linear identifiability of learned representations. In International Conference on Machine Learning, pages 9030–9039. PMLR, 2021.
  • Rosenfeld et al. [2021] E. Rosenfeld, P. K. Ravikumar, and A. Risteski. The risks of invariant risk minimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=BbNIbVPJ-42.
  • Rosenfeld et al. [2022] E. Rosenfeld, P. Ravikumar, and A. Risteski. Domain-adjusted regression or: Erm may already learn features sufficient for out-of-distribution generalization. arXiv preprint arXiv:2202.06856, 2022.
  • Rothenhäusler et al. [2021] D. Rothenhäusler, N. Meinshausen, P. Bühlmann, and J. Peters. Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2):215–246, 2021.
  • Schölkopf and von Kügelgen [2022] B. Schölkopf and J. von Kügelgen. From statistical to causal learning. arXiv preprint arXiv:2204.00607, 2022.
  • Schölkopf et al. [2021] B. Schölkopf, F. Locatello, S. Bauer, N. R. Ke, N. Kalchbrenner, A. Goyal, and Y. Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021. arXiv:2102.11107.
  • Shen et al. [2022] X. Shen, F. Liu, H. Dong, Q. Lian, Z. Chen, and T. Zhang. Weakly supervised disentangled generative causal representation learning. Journal of Machine Learning Research, 23:1–55, 2022.
  • Shinners [2011] P. Shinners. Pygame. http://pygame.org/, 2011.
  • Silva et al. [2006] R. Silva, R. Scheines, C. Glymour, P. Spirtes, and D. M. Chickering. Learning the structure of linear latent variable models. Journal of Machine Learning Research, 7(2), 2006.
  • Sohl-Dickstein et al. [2015] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • Spirtes and Glymour [1991] P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social science computer review, 9(1):62–72, 1991.
  • Spirtes et al. [2000] P. Spirtes, C. N. Glymour, and R. Scheines. Causation, prediction, and search. MIT press, 2000.
  • Squires and Uhler [2022] C. Squires and C. Uhler. Causal structure learning: a combinatorial perspective. Foundations of Computational Mathematics, pages 1–35, 2022.
  • Squires et al. [2020] C. Squires, Y. Wang, and C. Uhler. Permutation-based causal structure learning with unknown intervention targets. In Conference on Uncertainty in Artificial Intelligence, pages 1039–1048. PMLR, 2020.
  • Squires et al. [2023] C. Squires, A. Seigal, S. S. Bhate, and C. Uhler. Linear causal disentanglement via interventions. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 32540–32560. PMLR, 2023. URL https://proceedings.mlr.press/v202/squires23a.html.
  • Stark et al. [2020] S. G. Stark, J. Ficek, F. Locatello, X. Bonilla, S. Chevrier, F. Singer, G. Rätsch, and K.-V. Lehmann. Scim: universal single-cell matching with unpaired feature sets. Bioinformatics, 36(Supplement_2):i919–i927, 2020.
  • Teshima et al. [2020] T. Teshima, I. Ishikawa, K. Tojo, K. Oono, M. Ikeda, and M. Sugiyama. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362–3373, 2020.
  • Varici et al. [2021] B. Varici, K. Shanmugam, P. Sattigeri, and A. Tajer. Scalable intervention target estimation in linear models. Advances in Neural Information Processing Systems, 34:1494–1505, 2021.
  • Varici et al. [2023] B. Varici, E. Acarturk, K. Shanmugam, A. Kumar, and A. Tajer. Score-based causal representation learning with interventions. arXiv preprint arXiv:2301.08230, 2023.
  • Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Villani [2008] C. Villani. Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 2008. ISBN 9783540710509. URL https://books.google.de/books?id=hV8o5R7_5tkC.
  • Von Kügelgen et al. [2021] J. Von Kügelgen, Y. Sharma, L. Gresele, W. Brendel, B. Schölkopf, M. Besserve, and F. Locatello. Self-supervised learning with data augmentations provably isolates content from style. Advances in Neural Information Processing Systems, 34, 2021.
  • von Kügelgen et al. [2023] J. von Kügelgen, M. Besserve, W. Liang, L. Gresele, A. Kekić, E. Bareinboim, D. M. Blei, and B. Schölkopf. Nonparametric identifiability of causal representations from unknown interventions. In Advances in Neural Information Processing Systems, 2023.
  • Wang et al. [2021] Y. Wang, D. Blei, and J. P. Cunningham. Posterior collapse and latent variable non-identifiability. Advances in Neural Information Processing Systems, 34:5443–5455, 2021.
  • Weichwald et al. [2022] S. Weichwald, S. W. Mogensen, T. E. Lee, D. Baumann, O. Kroemer, I. Guyon, S. Trimpe, J. Peters, and N. Pfister. Learning by doing: Controlling a dynamical system using causality, control, and reinforcement learning. In NeurIPS 2021 Competitions and Demonstrations Track, pages 246–258. PMLR, 2022.
  • Willetts and Paige [2021] M. Willetts and B. Paige. I don’t need 𝐮\mathbf{u}: Identifiable non-linear ica without side information. arXiv preprint arXiv:2106.05238, 2021.
  • Xie et al. [2020] F. Xie, R. Cai, B. Huang, C. Glymour, Z. Hao, and K. Zhang. Generalized independent noise condition for estimating latent variable causal graphs. Advances in neural information processing systems, 33:14891–14902, 2020.
  • Yang et al. [2021] M. Yang, F. Liu, Z. Chen, X. Shen, J. Hao, and J. Wang. Causalvae: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9593–9602, June 2021.
  • Yu et al. [2019] Y. Yu, J. Chen, T. Gao, and M. Yu. Dag-gnn: Dag structure learning with graph neural networks. In International Conference on Machine Learning, pages 7154–7163. PMLR, 2019.
  • Zhang et al. [2021] J. Zhang, C. Squires, and C. Uhler. Matching a desired causal state via shift interventions. Advances in Neural Information Processing Systems, 34:19923–19934, 2021.
  • Zhang et al. [2023] J. Zhang, C. Squires, K. Greenewald, A. Srivastava, K. Shanmugam, and C. Uhler. Identifiability guarantees for causal disentanglement from soft interventions. arXiv preprint arXiv:2307.06250, 2023.
  • Zhao et al. [2019] S. Zhao, J. Song, and S. Ermon. Infovae: Balancing learning and inference in variational autoencoders. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 5885–5892. AAAI Press, 2019. doi: 10.1609/AAAI.V33I01.33015885. URL https://doi.org/10.1609/aaai.v33i01.33015885.
  • Zheng et al. [2018] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 31, 2018.
  • Zheng et al. [2022] Y. Zheng, I. Ng, and K. Zhang. On the identifiability of nonlinear ICA: sparsity and beyond. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/6801fa3fd290229efc490ee0cf1c5687-Abstract-Conference.html.
  • Zimmermann et al. [2021] R. S. Zimmermann, Y. Sharma, S. Schneider, M. Bethge, and W. Brendel. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pages 12979–12990. PMLR, 2021.

Appendix A Proof of Identifiability up to Linear Maps

In this section we give a full proof of Theorem 3. We try to make this essential self-contained. To make this easy to read, we first introduce and recall the required notation in Section A.1, then we prove two important key relations required for the proof in Section A.2. Finally, we prove first a simpler version of Theorem 3 with more assumptions in Section A.3 and then give the full proof of Theorem 3 in Section A.4.

A.1 General Notation

Let us introduce some notation for our identifiability proofs. For a full list of assumptions we refer to the main text, here we just recall the relevant notation concisely. We will assume that there are two representations that generate the same observations, i.e., we assume that there are two sets of Gaussian SCMs

Z(i)\displaystyle Z^{(i)} =A(i)Z(i)+(D(i))12(ϵ+η(i)eti)\displaystyle=A^{(i)}Z^{(i)}+(D^{(i)})^{\frac{1}{2}}(\epsilon+\eta^{(i)}e_{t_{i}}) (10)
Z~(i)\displaystyle\widetilde{Z}^{(i)} =A~(i)Z~(i)+(D~(i))12(ϵ~+η~(i)et~i)\displaystyle=\widetilde{A}^{(i)}\widetilde{Z}^{(i)}+(\widetilde{D}^{(i)})^{\frac{1}{2}}(\widetilde{\epsilon}+\widetilde{\eta}^{(i)}e_{\widetilde{t}_{i}}) (11)

where ϵ,ϵ~N(0,Id)\epsilon,\widetilde{\epsilon}\sim N(0,\mathrm{Id}) and i=0i=0 corresponds to the observational distribution while i>0i>0 correspond to interventional settings where we intervene on a single node tit_{i} (or t~i\widetilde{t}_{i}) by adding a shift and changing the relation to its parents (without adding new parents). Moreover, there are two functions ff, f~\widetilde{f} such that X(i)=𝒟f(Z(i))X^{(i)}\overset{\mathcal{D}}{=}f(Z^{(i)}) and X(i)=𝒟f~(Z~(i))X^{(i)}\overset{\mathcal{D}}{=}\widetilde{f}(\widetilde{Z}^{(i)}). It is convenient to define

B(i)=(D(i))12(IdA(i))\displaystyle B^{(i)}=(D^{(i)})^{-\frac{1}{2}}(\mathrm{Id}-A^{(i)}) (12)
μ(i)=η(i)(B(i))1eti\displaystyle\mu^{(i)}=\eta^{(i)}(B^{(i)})^{-1}e_{t_{i}} (13)

so that Z(i)=(B(i))1ϵ+μ(i)Z^{(i)}=(B^{(i)})^{-1}\epsilon+\mu^{(i)} with ϵN(0,Id)\epsilon\sim N(0,\mathrm{Id}). Note that all model information can be collected in ((B(i),η(i),ti),f)((B^{(i)},\eta^{(i)},t_{i}),f).

As in the main text, it is convenient to use the shorthand B=B(0)B=B^{(0)}, A=A(0)A=A^{(0)}.

We denote the row tit_{i} of B(i)B^{(i)} by r(i)r^{({i})}, i.e.,

r(i)=(B(i))eti.\displaystyle r^{({i})}=(B^{(i)})^{\top}e_{t_{i}}. (14)

Note that for perfect interventions where we cut the relation to all parents of node tit_{i} we have r(i)=λietir^{({i})}=\lambda_{i}e_{t_{i}} for some λi0\lambda_{i}\neq 0. Similarly, for the observational distribution we refer to the row tit_{i} of B(0)B^{(0)} by s(i)s^{({i})}, i.e.,

s(i)=(B(0))eti.\displaystyle s^{({i})}=(B^{(0)})^{\top}e_{t_{i}}. (15)

We thus find

B(0)B(i)=eti(etiB(0)etiB(i))=eti(s(i)r(i)).\displaystyle B^{(0)}-B^{(i)}=e_{t_{i}}(e_{t_{i}}^{\top}B^{(0)}-e_{t_{i}}^{\top}B^{(i)})=e_{t_{i}}(s^{({i})}-r^{({i})})^{\top}. (16)

We also consider the covariance matrix Σ(i)\Sigma^{(i)} and the precision matrix Θ(i)\Theta^{(i)} of Z(i)Z^{(i)} which are given by

Θ(i)=(B(i))B(i),Σ(i)=(Θ(i))1\displaystyle\Theta^{(i)}=(B^{(i)})^{\top}B^{(i)},\quad\Sigma^{(i)}=(\Theta^{(i)})^{-1} (17)

Indeed,

Σ(i)=𝔼[(B(i))1ϵϵ(B(i))]=(B(i))1𝔼[ϵϵ](B(i))=(B(i))1(B(i))\displaystyle\Sigma^{(i)}=\mathbb{E}[(B^{(i)})^{-1}\epsilon\epsilon^{\top}(B^{(i)})^{-\top}]=(B^{(i)})^{-1}\mathbb{E}[\epsilon\epsilon^{\top}](B^{(i)})^{-\top}=(B^{(i)})^{-1}(B^{(i)})^{-\top}

Finally, the mean μ(i)\mu^{(i)} of Z(i)Z^{(i)} given by

μ(i)=𝔼Z(i)=η(i)(B(i))1eti.\displaystyle\mu^{(i)}=\mathbb{E}Z^{(i)}=\eta^{(i)}(B^{(i)})^{-1}e_{t_{i}}. (18)

We also define similar quantities for Z~\widetilde{Z}. To simplify the notation, we will frequently use the shorthand vw=vwd×dv\otimes w=v\cdot w^{\top}\in\mathbb{R}^{d\times d} for v,wdv,w\in\mathbb{R}^{d}.

A.2 Two auxiliary lemmas

The key to our identifiability results is the relation shown in the following lemma.

Lemma 2.

Assume that there are two latent variable models ((B(i),η(i),ti)iI,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in I},f) and ((B~(i),η~(i),t~i)iI,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in I},\widetilde{f}) as defined above such that f(Z(i))=𝒟f~(Z~(i))f(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{f}(\widetilde{Z}^{(i)}). Then their latent dimension dd agree and h=f~1fh=\widetilde{f}^{-1}\circ f exists and is a diffeomorphism, i.e., bijective and differentiable with differentiable inverse. Moreover, it satisfies the relation

12z(Θ(i)Θ(0))zη(i)(r(i))z=12h(z)(Θ~(i)Θ~(0))h(z)η~(i)(r~(i))h(z)+c(i)\displaystyle\begin{split}\frac{1}{2}&z^{\top}(\Theta^{(i)}-\Theta^{(0)})z-\eta^{(i)}(r^{({i})})^{\top}z\\ &=\frac{1}{2}h(z)^{\top}(\widetilde{\Theta}^{(i)}-\widetilde{\Theta}^{(0)})h(z)-\widetilde{\eta}^{(i)}(\widetilde{r}^{({i})})^{\top}h(z)+c^{(i)}\end{split} (19)

for all zz and ii and some constant c(i)c^{(i)}. If η(i)=η~(i)=0\eta^{(i)}=\widetilde{\eta}^{(i)}=0 this simplifies to

12z(Θ(i)Θ(0))z=12h(z)(Θ~(i)Θ~(0))h(z)+c(i).\displaystyle\frac{1}{2}z^{\top}(\Theta^{(i)}-\Theta^{(0)})z=\frac{1}{2}h(z)^{\top}(\widetilde{\Theta}^{(i)}-\widetilde{\Theta}^{(0)})h(z)+c^{(i)}. (20)
Proof.

Note first that X(0)=𝒟f(Z(0)X^{(0)}\overset{\mathcal{D}}{=}f(Z^{(0)} implies since ff is an embedding by assumption that the dimension of the support of X(0)X^{(0)} is the same as the latent dimension and this dimension is identifiable from the observations, e.g., by considering the tangent space. As Z(0)Z^{(0)} and Z~(0)\widetilde{Z}^{(0)} have full support and ff and f~\tilde{f} are by assumption embeddings, the equality f(Z(0))=𝒟f~(Z~(0))f(Z^{(0)})\overset{\mathcal{D}}{=}\widetilde{f}(\widetilde{Z}^{(0)}) implies that the two image manifolds agree and h=f~1fh=\widetilde{f}^{-1}\circ f is a differentiable map. Moreover, we find that hZ(i)=Z~(i)h_{\ast}Z^{(i)}=\widetilde{Z}^{(i)} where hh_{\ast} denotes the pushforward map. The variables Z(i)Z^{(i)} are Gaussian with distribution Z(i)N(μ(i),Σ(i))Z^{(i)}\sim N(\mu^{(i)},\Sigma^{(i)}). This implies that its density p(i)p^{(i)} can be written as

ln(p(i)(z))=n2ln(2π)12ln|Σ(i)|12(zμ(i))Θ(i)(zμ(i)).\displaystyle\ln(p^{(i)}(z))=-\frac{n}{2}\ln(2\pi)-\frac{1}{2}\ln|\Sigma^{(i)}|-\frac{1}{2}(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)}). (21)

A similar relation holds for p~(i)(z)\widetilde{p}^{(i)}(z), the density of Z~(i)\widetilde{Z}^{(i)}. Finally, we have the relations

p(i)(z)=p~(i)(h(z))|detJh(z)|\displaystyle p^{(i)}(z)=\widetilde{p}^{(i)}(h(z))|\det J_{h}(z)| (22)

which is a consequence of hZ(i)=Z~(i)h_{\ast}Z^{(i)}=\widetilde{Z}^{(i)}. Here JhJ_{h} denotes the Jacobian matrix of partial derivatives. Thus we conclude that

n2ln(2π)12ln|Σ(i)|12(zμ(i))Θ(i)(zμ(i))=n2ln(2π)12ln|Σ~(i)|12(h(z)μ~(i))Θ~(i)(h(z)μ~(i))+ln|detJh(z)|.\displaystyle\begin{split}-\frac{n}{2}\ln(2\pi)&-\frac{1}{2}\ln|\Sigma^{(i)}|-\frac{1}{2}(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)})\\ &=-\frac{n}{2}\ln(2\pi)-\frac{1}{2}\ln|\widetilde{\Sigma}^{(i)}|-\frac{1}{2}(h(z)-\widetilde{\mu}^{(i)})^{\top}\widetilde{\Theta}^{(i)}(h(z)-\widetilde{\mu}^{(i)})+\ln|\det J_{h}(z)|.\end{split} (23)

Calling this relation EiE^{i}, we consider the difference E0EiE^{0}-E^{i}. Then the determinant term cancels, and we obtain for some constant c(i)c^{(i)}

12(zμ(i))Θ(i)(zμ(i))12zΘ(0)z=12(h(z)μ~(i))Θ~(i)(h(z)μ~(i))12h(z)Θ~(0)h(z)+c(i).\displaystyle\begin{split}\frac{1}{2}&(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)})-\frac{1}{2}z^{\top}\Theta^{(0)}z\\ &=\frac{1}{2}(h(z)-\widetilde{\mu}^{(i)})^{\top}\widetilde{\Theta}^{(i)}(h(z)-\widetilde{\mu}^{(i)})-\frac{1}{2}h(z)^{\top}\widetilde{\Theta}^{(0)}h(z)+c^{\prime(i)}.\end{split} (24)

Expanding the quadratic expression on the left-hand side we obtain

12(zμ(i))Θ(i)(zμ(i))12zΘ(0)z=12z(Θ(i)Θ)zzΘ(i)μ(i)+12(μ(i))Θ(i)μ(i).\displaystyle\frac{1}{2}(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)})-\frac{1}{2}z^{\top}\Theta^{(0)}z=\frac{1}{2}z^{\top}(\Theta^{(i)}-\Theta)z-z^{\top}\Theta^{(i)}\mu^{(i)}+\frac{1}{2}(\mu^{(i)})^{\top}\Theta^{(i)}\mu^{(i)}. (25)

Finally we simplify the last term using (18)

Θ(i)μ(i)=η(i)(B(i))B(i)(B(i))1eti=η(i)(B(i))eti=η(i)r(i)\displaystyle\Theta^{(i)}\mu^{(i)}=\eta^{(i)}(B^{(i)})^{\top}B^{(i)}(B^{(i)})^{-1}e_{t_{i}}=\eta^{(i)}(B^{(i)})^{\top}e_{t_{i}}=\eta^{(i)}r^{({i})} (26)

Apply the same simplification on the right hand side. Finally, we collect the constant terms independent of zz in c(i)c^{(i)}, to obtain

12z(Θ(i)Θ(0))zη(i)(r(i))z=12h(z)(Θ~(i)Θ~(0))h(z)η~(i)(r~(i))h(z)+c(i)\displaystyle\begin{split}\frac{1}{2}&z^{\top}(\Theta^{(i)}-\Theta^{(0)})z-\eta^{(i)}(r^{({i})})^{\top}z\\ &=\frac{1}{2}h(z)^{\top}(\widetilde{\Theta}^{(i)}-\widetilde{\Theta}^{(0)})h(z)-\widetilde{\eta}^{(i)}(\widetilde{r}^{({i})})^{\top}h(z)+c^{(i)}\end{split} (27)

which is (19) as desired. ∎

A second key ingredient for our identifiability results is that the specific structure of the interventions that each target a single node which implies that the difference of precision matrices Θ(i)Θ(0)\Theta^{(i)}-\Theta^{(0)} has rank at most 2. This is also the key ingredient used (in a very different manner) in the proof of the main result in [84]. Here we highlight this simple result because we make use of it frequently.

Lemma 3.

The precision matrices satisfy the relation

Θ(i)Θ(0)=r(i)r(i)s(i)s(i).\displaystyle\Theta^{(i)}-\Theta^{(0)}=r^{({i})}\otimes r^{({i})}-s^{({i})}\otimes s^{({i})}. (28)
Proof.

Using (17) we find

Θ(i)Θ(0)=(B(i))B(i)BB=(B(i))(k=1dekek)B(i)B(k=1dekek)B=k=1d(B(i))ek((B(i))ek)k=1dBek(Bek)=((B(i))eti)((B(i))eti)(Beti)(Beti)=r(i)r(i)s(i)s(i).\displaystyle\begin{split}\Theta^{(i)}-\Theta^{(0)}&=(B^{(i)})^{\top}B^{(i)}-B^{\top}B\\ &=(B^{(i)})^{\top}\left(\sum_{k=1}^{d}e_{k}e_{k}^{\top}\right)B^{(i)}-B^{\top}\left(\sum_{k=1}^{d}e_{k}e_{k}^{\top}\right)B\\ &=\sum_{k=1}^{d}(B^{(i)})^{\top}e_{k}((B^{(i)})^{\top}e_{k})^{\top}-\sum_{k=1}^{d}B^{\top}e_{k}(B^{\top}e_{k})^{\top}\\ &=((B^{(i)})^{\top}e_{t_{i}})\otimes((B^{(i)})^{\top}e_{t_{i}})-(B^{\top}e_{t_{i}})\otimes(B^{\top}e_{t_{i}})\\ &=r^{({i})}\otimes r^{({i})}-s^{({i})}\otimes s^{({i})}.\end{split} (29)

Here we used the fact that all rows except row tit_{i} of B(i)B^{(i)} and B(0)B^{(0)} agree as the intervention targets a single node, i.e., (B(i))ek=(B(0))ek(B^{(i)})^{\top}e_{k}=(B^{(0)})^{\top}e_{k} for ktik\neq t_{i}. ∎

A.3 Warm-up

To provide some intuition how identifiability arises, we first provide the proof for a simpler result with a much stronger assumption on the set of intervention targets and the type of interventions. But note importantly that we don’t relax the assumption of non-linearity of ff.

Theorem 4.

Let f(Z(i))=𝒟f~(Z~(i))f(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{f}(\widetilde{Z}^{(i)}) for two sets of parameters ((B(i),η(i),ti)iI,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in I},f) and ((B~(i),η~(i),t~i)iI,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in{I}},\widetilde{f}) as in the above model. We assume that there are 2n2n perfect pairwise different interventions (i.e., Z(i)𝒟Z(i)Z^{(i)}\overset{\mathcal{D}}{\neq}Z^{(i^{\prime})} for iii\neq i^{\prime}), without shifts (i.e., η(i)=η~(i)=0\eta^{(i)}=\widetilde{\eta}^{(i)}=0). Finally, we assume that there are 2 interventions for each node, and we know the intervention targets of the interventions, i.e., we assume that ti=t~it_{i}=\widetilde{t}_{i} for all ii. Then, Z(i)Z^{(i)} and Z~(i)\tilde{Z}^{(i)} agree up to scaling, i.e., Z(i)=DZ~(i)Z^{(i)}=D\tilde{Z}^{(i)} for a diagonal matrix DDiag(n)D\in\mathrm{Diag}(n).

Remark 4.

Actually this result can be generalized to a setting where the latent SCM is non-parametric [92].

Proof.

We consider a node kk and two interventions i1i_{1} and i2i_{2} which target node k=ti1=ti2k=t_{i_{1}}=t_{i_{2}} (by assumption those interventions exist and i1i_{1} and i2i_{2} agree for both representations). Since we assumed that the interventions are perfect, we have in our notation that

r(i1)=λi1ek,r(i2)=λi2ek\displaystyle r^{({i_{1}})}=\lambda_{i_{1}}e_{k},\quad r^{({i_{2}})}=\lambda_{i_{2}}e_{k} (30)

for two constants λi1λi2>0\lambda_{i_{1}}\neq\lambda_{i_{2}}>0 and similarly for r~(ij){\widetilde{r}^{({i_{j}})}}. Indeed, assuming positivity is not a restriction since ϵ\epsilon follows a symmetric distribution (this is also contained in the parametrisation (D(i))12(D^{(i)})^{\frac{1}{2}}) and λi1λi2\lambda_{i_{1}}\neq\lambda_{i_{2}} is necessary to ensure that the interventional distributions differ. Combining (20) from Lemma 2 with Lemma 3 we obtain the relation

z(r(ij)r(ij)s(ij)s(ij))z=z(Θ(ij)Θ(0))z=h(z)(Θ~(ij)Θ~(0))h(z)+2c(ij)=h(z)(r~(ij)r~(ij)s~(ij)s~(ij))h(z)+2c(ij).\displaystyle\begin{split}z^{\top}\left(r^{({i_{j}})}\otimes r^{({i_{j}})}-s^{({i_{j}})}\otimes s^{({i_{j}})}\right)z&=z^{\top}(\Theta^{(i_{j})}-\Theta^{(0)})z\\ &=h(z)^{\top}(\widetilde{\Theta}^{(i_{j})}-\widetilde{\Theta}^{(0)})h(z)+2c^{(i_{j})}\\ &=h(z)^{\top}\left({\widetilde{r}^{({i_{j}})}}\otimes{\widetilde{r}^{({i_{j}})}}-{\widetilde{s}^{({i_{j}})}}\otimes{\widetilde{s}^{({i_{j}})}}\right)h(z)+2c^{(i_{j})}.\end{split} (31)

for j=1,2j=1,2.

Note that since k=ti1=ti2k=t_{i_{1}}=t_{i_{2}}, we have s(i1)=s(i2)=Beks^{({i_{1}})}=s^{({i_{2}})}=B^{\top}e_{k} by definition. Thus, we subtract the relation in the last display for i1i_{1} from the relation for i2i_{2} and find using (30),

(λi12λi22)zk2=(zr(i1))2(zr(i2))2=(h(z)r~(i1))2(h(z)r~(i2))2+2(c(i1)c(i2))=(λ~i12λ~i22)hk(z)2+2(c(i1)c(i2)).\displaystyle\begin{split}(\lambda_{i_{1}}^{2}-\lambda_{i_{2}}^{2})z_{k}^{2}&=(z^{\top}r^{({i_{1}})})^{2}-(z^{\top}r^{({i_{2}})})^{2}\\ &=(h(z)^{\top}\widetilde{r}^{({i_{1}})})^{2}-(h(z)^{\top}\widetilde{r}^{({i_{2}})})^{2}+2(c^{(i_{1})}-c^{(i_{2})})\\ &=(\widetilde{\lambda}_{i_{1}}^{2}-\widetilde{\lambda}_{i_{2}}^{2})h_{k}(z)^{2}+2(c^{(i_{1})}-c^{(i_{2})}).\end{split} (32)

for all zz. Since λ~i1λ~i2>0\widetilde{\lambda}_{i_{1}}\neq\widetilde{\lambda}_{i_{2}}>0 we can divide and obtain

λi12λi22λ~i12λ~i22zk2=hk(z)2+2c(i1)c(i2)λ~i12λ~i22.\displaystyle\frac{\lambda_{i_{1}}^{2}-\lambda_{i_{2}}^{2}}{\widetilde{\lambda}_{i_{1}}^{2}-\widetilde{\lambda}_{i_{2}}^{2}}z_{k}^{2}=h_{k}(z)^{2}+2\frac{c^{(i_{1})}-c^{(i_{2})}}{\widetilde{\lambda}_{i_{1}}^{2}-\widetilde{\lambda}_{i_{2}}^{2}}. (33)

Since hh is surjective, we have that hkh_{k} is also surjective which implies that the range of the right hand side is [2(c(i1)c(i2))/(λ~i12λ~i22),)[2(c^{(i_{1})}-c^{(i_{2})})/(\widetilde{\lambda}_{i_{1}}^{2}-\widetilde{\lambda}_{i_{2}}^{2}),\infty). The range of the left hand side is depending on the sign of the prefactor and is either (,0](-\infty,0] or [0,)[0,\infty). Since the ranges have to agree we conclude that c(i1)=c(i2)c^{(i_{1})}=c^{(i_{2})} and

hk(z)=±λi12λi22λ~i12λ~i22zk.\displaystyle h_{k}(z)=\pm\sqrt{\frac{\lambda_{i_{1}}^{2}-\lambda_{i_{2}}^{2}}{\widetilde{\lambda}_{i_{1}}^{2}-\widetilde{\lambda}_{i_{2}}^{2}}}z_{k}. (34)

Since kk was arbitrary, this ends the proof. ∎

A.4 Proof of identifiability up to linear maps

The proof of our main result relies on two essentially geometric lemmas and a slight extension of Lemma 2 which we discuss now. The first lemma alone is sufficient to handle the case where interventions change the scaling matrix DD, while the second is required to handle interventions that change BB and the third allows us to consider pure shift interventions.

Lemma 4.

Let g:dg:\mathbb{R}^{d}\to\mathbb{R} be a continuous and surjective function, QQ a quadratic form on d\mathbb{R}^{d} and α\alpha\in\mathbb{R} a constant such that

g(z)2=zQz+α.\displaystyle g(z)^{2}=z\cdot Qz+\alpha. (35)

Then QQ has rank 1 and gg is linear, i.e., g(z)=vzg(z)=v\cdot z for some vdv\in\mathbb{R}^{d}.

Proof.

Assume that QQ has the eigendecomposition Q=i=1rλiviviQ=\sum_{i=1}^{r}\lambda_{i}v_{i}v_{i}^{\top} where rr is the rank of QQ and λi0\lambda_{i}\neq 0 and viv_{i} has unit norm. Clearly gg surjective implies r>0r>0. Since gg is surjective we find that the range of the left hand side is [0,)[0,\infty) and therefore zQzαz\cdot Qz\geq-\alpha. This implies that λi0\lambda_{i}\geq 0 for all ii and therefore, the range of the right hand side is [α,)[\alpha,\infty) which implies α=0\alpha=0. Now we fix any c>0c>0 and consider Ec={zd|zQz=c}E_{c}=\{z\in\mathbb{R}^{d}|\,z\cdot Qz=c\}. It is easy to see that EcE_{c} is the product of an ellipsoid and dr\mathbb{R}^{d-r} and thus is connected if r2r\geq 2. We find by (35) that g(z)=±cg(z)=\pm\sqrt{c} for zEcz\in E_{c} and if g(z){c,c}g(z)\in\{-\sqrt{c},\sqrt{c}\} then zEcz\in E_{c}. By continuity of gg and since EcE_{c} is connected this implies that g(z)=cg(z)=\sqrt{c} for all zEcz\in E_{c} or g(z)=cg(z)=-\sqrt{c} for all zEcz\in E_{c} contradicting the surjectivity of gg. We conclude that r=1r=1, i.e., Q=λ1v1v1=v~1v~1Q=\lambda_{1}v_{1}v_{1}^{\top}=\tilde{v}_{1}\tilde{v}_{1}^{\top} with v~1=λ1v1\tilde{v}_{1}=\sqrt{\lambda_{1}}v_{1}. Thus we find that

g(z)2=|v~1z|2.\displaystyle g(z)^{2}=|\tilde{v}_{1}\cdot z|^{2}. (36)

Since gg is continuous, this implies that g(z)=±v~1zg(z)=\pm\tilde{v}_{1}\cdot z or g(z)=±|v~1z|g(z)=\pm|\tilde{v}_{1}\cdot z|. As only the first functions are surjective we conclude that g(z)=±v~1zg(z)=\pm\tilde{v}_{1}\cdot z and the claim follows with w=±v~1w=\pm\tilde{v}_{1}. ∎

To handle shift interventions we need a slightly stronger version of the lemma above which contains an additional linear term.

Corollary 1.

Let g:dg:\mathbb{R}^{d}\to\mathbb{R} be a continuous and surjective function, QQ a quadratic form on d\mathbb{R}^{d}, wdw\in\mathbb{R}^{d} a vector, and α\alpha\in\mathbb{R} a constant such that

g(z)2=zQz+wz+α.\displaystyle g(z)^{2}=z\cdot Qz+w\cdot z+\alpha. (37)

Then QQ has rank 1 and gg is affine, i.e., g(z)=vz+cg(z)=v\cdot z+c for some vdv\in\mathbb{R}^{d} and cc\in\mathbb{R}.

Proof.

Assume as before that QQ has the eigendecomposition Q=i=1rλiviviQ=\sum_{i=1}^{r}\lambda_{i}v_{i}v_{i}^{\top} where rr is the rank of QQ and λi0\lambda_{i}\neq 0 and |vi|=1|v_{i}|=1. First, we show that wv1,,vrw\in\langle v_{1},\ldots,v_{r}\rangle where .\langle.\rangle denotes the span. Indeed, suppose that this is not the case. Then we could find z0z_{0} such that z0w<0z_{0}\cdot w<0 and z0vi=0z_{0}v_{i}=0 for 1ir1\leq i\leq r, and therefore z0Qz0=0z_{0}\cdot Qz_{0}=0. Plugging z=λz0z=\lambda z_{0} for λ\lambda\to\infty in (37) the right-hand side tends to -\infty while the left-hand side is non-negative. This is a contradiction and therefore wv1,,vrw\in\langle v_{1},\ldots,v_{r}\rangle holds. This implies that there is ww^{\prime} such that Qw=w/2Qw^{\prime}=w/2 and thus

zQz+wz+α=(z+w)Q(z+w)+αwQw.\displaystyle z\cdot Qz+w\cdot z+\alpha=(z+w^{\prime})\cdot Q(z+w^{\prime})+\alpha-w^{\prime}\cdot Qw^{\prime}. (38)

Then we consider g~(z~)=g(z~w)\widetilde{g}(\widetilde{z})=g(\widetilde{z}-w^{\prime}), α~=αwQw\widetilde{\alpha}=\alpha-w^{\prime}\cdot Qw^{\prime} which satisfy

g~2(z~)\displaystyle\widetilde{g}^{2}(\widetilde{z}) =g2(z~w)\displaystyle=g^{2}(\widetilde{z}-w^{\prime}) (39)
=(z~w+w)Q(z~w+w)+αwQw\displaystyle=(\widetilde{z}-w^{\prime}+w^{\prime})\cdot Q(\widetilde{z}-w^{\prime}+w^{\prime})+\alpha-w^{\prime}\cdot Qw^{\prime} (40)
=z~Qz~+α~.\displaystyle=\widetilde{z}\cdot Q\widetilde{z}+\widetilde{\alpha}. (41)

Using Lemma 4 we conclude that QQ has rank 1 and g~(z~)=vz\widetilde{g}(\widetilde{z})=v\cdot z for some vv. But then g(z)=g~(z+w)=zv+wvg(z)=\widetilde{g}(z+w^{\prime})=z\cdot v+w^{\prime}\cdot v which concludes the proof. ∎

The second lemma is similar, but slightly simpler.

Lemma 5.

Assume that there is a quadratic form QQ, a vector 0wd0\neq w\in\mathbb{R}^{d}, wdw^{\prime}\in\mathbb{R}^{d}, α\alpha\in\mathbb{R} and a continuous function g:dg:\mathbb{R}^{d}\to\mathbb{R} such that

g(z)(wz)=zQz+wz+α.\displaystyle g(z)(w\cdot z)=z\cdot Qz+w^{\prime}\cdot z+\alpha. (42)

Then g(z)=vz+cg(z)=v\cdot z+c for some vdv\in\mathbb{R}^{d}, cc\in\mathbb{R}, i.e., gg is an affine function.

Proof.

By rescaling we can assume that ww has unit norm. Plugging in z=0z=0 we find that α=0\alpha=0. Denote by Πw\Pi_{w} the orthogonal projection on ww, i.e., Πwv=(wv)w\Pi_{w}v=(w\cdot v)w and by Πw\Pi^{\perp}_{w} the projection on the orthogonal complement of ww, i.e., z=Πwz+Πwzz=\Pi_{w}z+\Pi_{w}^{\perp}z for all zdz\in\mathbb{R}^{d}. Then we find for all zz

0=g(Πwz)wΠwz=(Πwz)QΠwz+(Πwz)w.\displaystyle 0=g(\Pi^{\perp}_{w}z)w\cdot\Pi^{\perp}_{w}z=(\Pi^{\perp}_{w}z)\cdot Q\Pi^{\perp}_{w}z+(\Pi^{\perp}_{w}z)\cdot w^{\prime}. (43)

By scaling zz we find that both terms on the right hand side vanish and thus for all zz

(Πwz)QΠwz\displaystyle(\Pi^{\perp}_{w}z)\cdot Q\Pi^{\perp}_{w}z =0,\displaystyle=0, (44)
(Πwz)w\displaystyle(\Pi^{\perp}_{w}z)\cdot w^{\prime} =0.\displaystyle=0. (45)

This implies (using also the symmetry of QQ)

g(z)wΠwz=g(z)wz=zQz+wz=(Πwz+Πwz)Q(Πwz+Πwz)+w(Πwz+Πwz)=(Πwz)QΠwz+2(Πwz)QΠwz+(Πwz)QΠwz+Πwzw=(Πwz)(Q(Πwz+2Πwz)+w).\displaystyle\begin{split}g(z)w\cdot\Pi_{w}z&=g(z)w\cdot z\\ &=z\cdot Qz+w^{\prime}\cdot z\\ &=(\Pi_{w}z+\Pi_{w}^{\perp}z)\cdot Q(\Pi_{w}z+\Pi_{w}^{\perp}z)+w^{\prime}\cdot(\Pi_{w}z+\Pi_{w}^{\perp}z)\\ &=(\Pi_{w}z)\cdot Q\Pi_{w}z+2(\Pi_{w}z)Q\Pi_{w}^{\perp}z+(\Pi^{\perp}_{w}z)\cdot Q\Pi^{\perp}_{w}z+\Pi_{w}z\cdot w^{\prime}\\ &=(\Pi_{w}z)\cdot(Q(\Pi_{w}z+2\Pi^{\perp}_{w}z)+w^{\prime}).\end{split} (46)

Now we use Πwz=(wΠwz)w\Pi_{w}z=(w\cdot\Pi_{w}z)w and find

g(z)wΠwz=(wΠwz)w(Q(Πwz+2Πwz)+w).\displaystyle g(z)w\cdot\Pi_{w}z=(w\cdot\Pi_{w}z)w\cdot(Q(\Pi_{w}z+2\Pi^{\perp}_{w}z)+w^{\prime}). (47)

This implies

g(z)=wQ(Πwz+2Πwz)+ww\displaystyle g(z)=w\cdot Q(\Pi_{w}z+2\Pi^{\perp}_{w}z)+w\cdot w^{\prime} (48)

for all zz such that Πwz0\Pi_{w}z\neq 0. By continuity we conclude that this holds for all zz, i.e., gg is affine. ∎

Again we need a slight generalization of this lemma.

Corollary 2.

Assume that there is a quadratic form QQ, a vector 0wd0\neq w\in\mathbb{R}^{d}, wdw^{\prime}\in\mathbb{R}^{d}, α,β\alpha,\beta\in\mathbb{R} and a continuous function g:dg:\mathbb{R}^{d}\to\mathbb{R} such that

g(z)(wz+β)=zQz+wz+α.\displaystyle g(z)(w\cdot z+\beta)=z\cdot Qz+w^{\prime}\cdot z+\alpha. (49)

Then g(z)=vz+cg(z)=v\cdot z+c for some vdv\in\mathbb{R}^{d}, cc\in\mathbb{R}, i.e., gg is an affine function.

Proof.

Define g~(z~)=g(z~βw|w|2)\widetilde{g}(\widetilde{z})=g(\widetilde{z}-\beta w|w|^{-2}). Then

g~(z~)(wz~)=g(z~βw|w|2)(w(z~βw|w|2)+β)=(z~βw|w|2)Q(z~βw|w|2)+w(z~βw|w|2)+α=z~Qz~+w~z~+α~\displaystyle\begin{split}\widetilde{g}(\widetilde{z})(w\cdot\widetilde{z})&=g(\widetilde{z}-\beta w|w|^{-2})(w\cdot(\widetilde{z}-\beta w|w|^{-2})+\beta)\\ &=(\widetilde{z}-\beta w|w|^{-2})\cdot Q(\widetilde{z}-\beta w|w|^{-2})+w^{\prime}\cdot(\widetilde{z}-\beta w|w|^{-2})+\alpha\\ &=\widetilde{z}\cdot Q\widetilde{z}+\widetilde{w}\cdot\widetilde{z}+\widetilde{\alpha}\end{split} (50)

for some w~\widetilde{w} and α~\widetilde{\alpha}. So, we conclude from Lemma 5 that g~\widetilde{g} is affine and thus the same is true for gg. ∎

We now discuss a small extension of Lemma 2 by showing that we can complete the squares in (19) which then allows us to obtain a relation that is very similar to (20).

Lemma 6.

Assume the general setup as introduced above and also Θ(i)Θ\Theta^{(i)}\neq\Theta. Then there is a vector μ(i)\mu^{\prime(i)} and a constant cc such that

(zμ(i))Θ(i)(zμ(i))zΘ(0)z=(zμ(i))(Θ(i)Θ(0))(zμ(i))+c.\displaystyle(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)})-z^{\top}\Theta^{(0)}z=(z-\mu^{\prime(i)})^{\top}(\Theta^{(i)}-\Theta^{(0)})(z-\mu^{\prime(i)})+c. (51)
Proof.

Using the symmetry of the precision matrices and expanding all the terms, we can express the difference of the left-hand side and the right-hand side of (51) as

(zμ(i))Θ(i)(zμ(i))zΘ(0)z(zμ(i))(Θ(i)Θ(0))(zμ(i))c=2zΘ(i)μ(i)+2z(Θ(i)Θ(0))μ(i)+(μ(i))Θ(i)μ(i)(μ(i))(Θ(i)Θ(0))μ(i)c.\displaystyle\begin{split}&(z-\mu^{(i)})^{\top}\Theta^{(i)}(z-\mu^{(i)})-z^{\top}\Theta^{(0)}z-(z-\mu^{\prime(i)})^{\top}(\Theta^{(i)}-\Theta^{(0)})(z-\mu^{\prime(i)})-c\\ &=-2z^{\top}\Theta^{(i)}\mu^{(i)}+2z^{\top}(\Theta^{(i)}-\Theta^{(0)})\mu^{\prime(i)}+(\mu^{(i)})^{\top}\Theta^{(i)}\mu^{(i)}-(\mu^{\prime(i)})^{\top}(\Theta^{(i)}-\Theta^{(0)})\mu^{\prime(i)}-c.\end{split} (52)

Hence, we can set cc at the end such the constant terms cancel and all we have to show is that there exists μ(i)\mu^{\prime(i)} such that

(Θ(i)Θ)μ(i)=Θ(i)μ(i).\displaystyle(\Theta^{(i)}-\Theta)\mu^{\prime(i)}=\Theta^{(i)}\mu^{(i)}. (53)

This is clearly not true for arbitrary μ(i)\mu^{(i)} because Θ(i)Θ\Theta^{(i)}-\Theta has only rank 2. However, such a μ(i)\mu^{\prime(i)} does exist for μ(i)=η(i)(B(i))1eti\mu^{(i)}=\eta^{(i)}(B^{(i)})^{-1}e_{t_{i}} (see equation (18)). Indeed, we can simplify using (17) and Lemma 3

(Θ(i)Θ)μ(i)\displaystyle(\Theta^{(i)}-\Theta)\mu^{\prime(i)} =r(i)((r(i))μ(i))s(i)((s(i))μ(i))\displaystyle=r^{({i})}((r^{({i})})^{\top}\mu^{\prime(i)})-s^{({i})}((s^{({i})})^{\top}\mu^{\prime(i)}) (54)
Θ(i)μ(i)\displaystyle\Theta^{(i)}\mu^{(i)} =η(i)(B(i))B(i)(B(i))1eti=η(i)(B(i))eti=η(i)r(i).\displaystyle=\eta^{(i)}(B^{(i)})^{\top}B^{(i)}(B^{(i)})^{-1}e_{t_{i}}=\eta^{(i)}(B^{(i)})^{\top}e_{t_{i}}=\eta^{(i)}r^{({i})}. (55)

Now there are two cases. When r(i)r^{({i})} and s(i)s^{({i})} are collinear (but not identical, by assumption) we can choose μ(i))=λr(i)\mu^{\prime(i)})=\lambda r^{({i})} for a suitable λ\lambda. Otherwise, we can pick any vector μ(i)\mu^{\prime(i)} such that μ(i)s(i)=0\mu^{\prime(i)}s^{({i})}=0, μ(i)r(i)=η(i)\mu^{\prime(i)}r^{({i})}=\eta^{(i)}. This is always possible for non-collinear vectors (project r(i)r^{({i})} on the orthogonal complement of s(i)s^{({i})}). ∎

Let us now first discuss a rough sketch of the proof of our main result. This allows us to present the main ideas without discussing all the technical details required for the full proof of the result. We restate the theorem for convenience.

See 3

Proof Sketch.

For simplicity of the sketch we ignore shift interventions here. We assume that we have two sets of parameters ((B(i),ti)iI¯,f)((B^{(i)},t_{i})_{i\in\bar{I}},f), (B~(i),t~i)iI¯),f~)(\widetilde{B}^{(i)},\widetilde{t}_{i})_{i\in\bar{I}}),\widetilde{f}) generating the same observations. We can relabel the interventions ii and the variables Z~k\widetilde{Z}_{k} such that t~i=i\widetilde{t}_{i}=i and such that 1,2,1,2,\ldots is a valid causal order of the graph. The general idea is to show by induction that (h1(z),,hk(z))(h_{1}(z),\ldots,h_{k}(z)) is a linear function of zz. Let us sketch how this is achieved. The key ingredients here are Lemma 2 and Lemma 3. Applying (20) from Lemma 2 in combination with Lemma 3 we obtain

12z(Θ(k+1)Θ(0))z=12h(z)(r~(k+1)r~(k+1)s~(k+1)s~(k+1))h(z)+c(k+1).\displaystyle\frac{1}{2}z^{\top}(\Theta^{(k+1)}-\Theta^{(0)})z=\frac{1}{2}h(z)^{\top}(\widetilde{r}^{({k+1})}\otimes\widetilde{r}^{({k+1})}-\widetilde{s}^{({k+1})}\otimes\widetilde{s}^{({k+1})})h(z)+c^{(k+1)}. (56)

By our assumption that the variables Z~(i)\widetilde{Z}^{(i)} follow the causal order of the underlying graph, only the entries r~r(k+1)\widetilde{r}^{({k+1})}_{r} for rk+1r\leq k+1 are non-zero and similarly for s~(k+1)\widetilde{s}^{({k+1})}. Therefore, the right-hand side is a quadratic form of (h1(z),,hk+1(z))(h_{1}(z),\ldots,h_{k+1}(z)). Now we apply our induction hypothesis which states that (h1(z),,hk(z))(h_{1}(z),\ldots,h_{k}(z)) is a linear function of zz. Then we find expanding the quadratic form on the right-hand side that there is a matrix QQ, a vector vv, a constant δ\delta (which we here assume to be non-zero) and another quadratic form QQ^{\prime} such that

12zQz=12zQz+(vz)hk+1(z)+δhk+12(z)+c(k+1)=12zQz+δ(hk+1(z)+(vz)2δ)2(vz)24δ+c(k+1).\displaystyle\begin{split}\frac{1}{2}z^{\top}Q^{\prime}z&=\frac{1}{2}z^{\top}Qz+(v\cdot z)h_{k+1}(z)+\delta h_{k+1}^{2}(z)+c^{(k+1)}\\ &=\frac{1}{2}z^{\top}Qz+\delta\left(h_{k+1}(z)+\frac{(v\cdot z)}{2\delta}\right)^{2}-\frac{(v\cdot z)^{2}}{4\delta}+c^{(k+1)}.\end{split} (57)

Now we can apply Corollary 1 to conclude that hk+1(z)h_{k+1}(z) is an affine function of zz. Adding the shifts and considering also the case δ=0\delta=0 adds several technical difficulties, which we handle in the full proof. ∎

After all those preparations and presenting the proof sketch, we will finally prove our main theorem in full generality.

Full proof of Theorem 3.

As the proof has to consider different cases and is a bit technical we structure it in a series of steps.

Labeling assumptions

We can assume without loss of generality (by relabeling variables and interventions) that the variables Z~i\widetilde{Z}_{i} are ordered such that their natural order i=1,2,i=1,2,\ldots is a valid topological order of the underlying DAG and that intervention 1in1\leq i\leq n targets node ZiZ_{i}, i.e., t~i=i\widetilde{t}_{i}=i. We make no assumption on the ordering of the ZiZ_{i} or targets of tit_{i} which are not necessarily assumed to be pairwise different. Recall that we defined (and showed existence of) h=f~1fh=\widetilde{f}^{-1}f in Lemma 2 (including the fact that their latent dimensions agree).

Induction claim

We now prove by induction that hk(z)h_{k}(z) is linear for k0k\geq 0. The base case k=0k=0 is trivially true. For the induction step, assume that this is the case for jkj\leq k, i.e., hj(z)=mjzh_{j}(z)=m_{j}\cdot z for some mjdm_{j}\in\mathbb{R}^{d}. We define Mk=(m1,,mk)M_{k}=(m_{1},\ldots,m_{k})^{\top} so that we can write concisely

(h1(z),,hk(z))=Mkz.\displaystyle(h_{1}(z),\ldots,h_{k}(z))^{\top}=M_{k}z. (58)

Note that since hh is surjective MkM_{k} has full rank. To prove the induction step, we need to show that there exists mk+1dm_{k+1}\in\mathbb{R}^{d} such that hk+1(z)=mk+1zh_{k+1}(z)=m_{k+1}\cdot z.

Precision matrix decomposition

For the DAG G~\widetilde{G} underlying Z~i\widetilde{Z}_{i}, let PAi\mathrm{PA}_{i} denote the indices of the set of parents of ZiZ_{i} and let PA¯i={i}PAi\overline{\mathrm{PA}}_{i}=\{i\}\cup\mathrm{PA}_{i}. Recall the notation s~(i)=B~ei\widetilde{s}^{({i})}=\widetilde{B}^{\top}e_{i} (using t~i=i\widetilde{t}_{i}=i) and r~(i)=(B~(i))ei\widetilde{r}^{({i})}=(\widetilde{B}^{(i)})^{\top}e_{i}. Now s~j(i)0\widetilde{s}^{({i})}_{j}\neq 0 only holds if jPA¯ij\in\overline{\mathrm{PA}}_{i} and we ordered the variables such that PA¯i[i]\overline{\mathrm{PA}}_{i}\subset[i]. In particular, s~j(i)=0\widetilde{s}^{({i})}_{j}=0 and r~j(i)=0\widetilde{r}^{({i})}_{j}=0 for j>ij>i (interventions do not create new parents). By Lemma 3 we have

(Θ~(i)Θ~(0))=r~(i)r~(i)s~(i)s~(i)\displaystyle(\widetilde{\Theta}^{(i)}-\widetilde{\Theta}^{(0)})=\widetilde{r}^{({i})}\otimes\widetilde{r}^{({i})}-\widetilde{s}^{({i})}\otimes\widetilde{s}^{({i})} (59)

We find that

(Θ~(k+1)Θ~(0))rs=r~r(k+1)r~s(k+1)s~r(k+1)s~s(k+1)=0\displaystyle(\widetilde{\Theta}^{(k+1)}-\widetilde{\Theta}^{(0)})_{rs}=\widetilde{r}^{({k+1})}_{r}\widetilde{r}^{({k+1})}_{s}-\widetilde{s}^{({k+1})}_{r}\widetilde{s}^{({k+1})}_{s}=0 (60)

if r>k+1r>k+1 or s>k+1s>k+1. In particular, there exists a symmetric A(k+1)×(k+1)A\in\mathbb{R}^{(k+1)\times(k+1)} such that

h(z)(Θ~(i)Θ~(0))h(z)=r,s=1k+1hr(z)Arshs(z).\displaystyle h(z)^{\top}(\widetilde{\Theta}^{(i)}-\widetilde{\Theta}^{(0)})h(z)=\sum_{r,s=1}^{k+1}h_{r}(z)A_{rs}h_{s}(z). (61)

Now we decompose AA as

A=(Abbδ)\displaystyle A=\begin{pmatrix}A^{\prime}&b\\ b^{\top}&\delta\end{pmatrix} (62)

for some Ak×kA^{\prime}\in\mathbb{R}^{k\times k}, bkb\in\mathbb{R}^{k} and δ\delta\in\mathbb{R}. Using the induction hypothesis we find

h(z)(Θ~(k+1)Θ~(0))h(z)=r,s=1k+1hr(z)Arshs(z)=MkzAMkz+2hk+1(z)bMkz+δhk+1(z)2.\displaystyle\begin{split}h(z)^{\top}(\tilde{\Theta}^{(k+1)}-\tilde{\Theta}^{(0)})h(z)&=\sum_{r,s=1}^{k+1}h_{r}(z)A_{rs}h_{s}(z)\\ &=M_{k}z\cdot A^{\prime}M_{k}z+2h_{k+1}(z)b^{\top}M_{k}z+\delta h_{k+1}(z)^{2}.\end{split} (63)

We use the notation v:rrv_{:r}\in\mathbb{R}^{r} for the vector (v1,,vr)(v_{1},\ldots,v_{r})^{\top}. Then we can write

η(k+1)r~(k+1)h(z)=η(k+1)r~:k(k+1)Mkz+η(k+1)r~k+1(k+1)hk+1(z)\displaystyle\eta^{(k+1)}\widetilde{r}^{({k+1})}\cdot h(z)=\eta^{(k+1)}\widetilde{r}^{({k+1})}_{:k}\cdot M_{k}z+\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}h_{k+1}(z) (64)

Now we have to consider different cases.

Case δ0\delta\neq 0

We assume first that δ0\delta\neq 0. In this case we find using the last two displays

h(z)(Θ~(k+1)Θ~(0))h(z)2η(k+1)r~(k+1)h(z)+c(k+1)=zMkAMkz+δ(hk+1(z)+bMkzη(k+1)r~k+1(k+1)δ)21δzMkbbMkz+2δη(k+1)r~k+1(k+1)bMkz1δ(η(k+1)r~k+1(k+1))22η(k+1)r~:k(k+1)Mkz+c(k+1).\displaystyle\begin{split}&h(z)^{\top}(\tilde{\Theta}^{(k+1)}-\tilde{\Theta}^{(0)})h(z)-2\eta^{(k+1)}\widetilde{r}^{({k+1})}h(z)+c^{(k+1)}\\ &=z\cdot M_{k}^{\top}A^{\prime}M_{k}z+\delta\left(h_{k+1}(z)+\frac{b^{\top}M_{k}z-\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}}{\delta}\right)^{2}-\frac{1}{\delta}z\cdot M_{k}^{\top}bb^{\top}M_{k}z\\ &\;+\frac{2}{\delta}\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}b^{\top}M_{k}z-\frac{1}{\delta}(\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1})^{2}-2\eta^{(k+1)}\widetilde{r}^{({k+1})}_{:k}\cdot M_{k}z+c^{(k+1)}.\end{split} (65)

Next we set

g(z)=hk+1(z)+δ1bMkzδ1η(k+1)r~k+1(k+1).\displaystyle g(z)=h_{k+1}(z)+\delta^{-1}{b\cdot M_{k}z}-\delta^{-1}\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}. (66)

Looking carefully at (65) we find that there is a symmetric matrix Q~\widetilde{Q}, a vector w~\widetilde{w}, and a constant c~\widetilde{c} such that

h(z)(Θ~(k+1)Θ~(0))h(z)2η(k+1)r~(k+1)h(z)=δg(z)2+zQ~z+w~z+c~.\displaystyle h(z)^{\top}(\tilde{\Theta}^{(k+1)}-\tilde{\Theta}^{(0)})h(z)-2\eta^{(k+1)}\widetilde{r}^{({k+1})}h(z)=\delta g(z)^{2}+z\cdot\widetilde{Q}z+\widetilde{w}z+\widetilde{c}. (67)

We claim that gg is continuous and surjective. Continuity follows directly from continuity of hh. To show that gg is surjective we can ignore the constant shift, and we note that δ1vMkz=δ1r=1kvrhr(z)\delta^{-1}{v^{\top}M_{k}z}=\delta^{-1}\sum_{r=1}^{k}v_{r}h_{r}(z) and thus surjectivity of hh implies that gg is surjective (for every cc pick zz such that hk+1(z)=ch_{k+1}(z)=c and hr(z)=0h_{r}(z)=0 for rkr\leq k).

Using (19) from Lemma 2 we conclude that

δg(z)2\displaystyle\delta g(z)^{2} =z(Θ(k+1)Θ(0))z2η(k+1)r(k+1)zzQ~zw~zc~\displaystyle=z^{\top}(\Theta^{(k+1)}-\Theta^{(0)})z-2\eta^{(k+1)}r^{({k+1})}z-z\cdot\widetilde{Q}z-\widetilde{w}z-\widetilde{c} (68)
=zQz+wz+α\displaystyle=z\cdot Qz+w\cdot z+\alpha (69)

for all zz and some symmetric Qd×dQ\in\mathbb{R}^{d\times d}, wdw\in\mathbb{R}^{d}. Corollary 1 then implies that

hk+1(z)+δ1vMkz+δ1η(k+1)r~k+1(k+1)=g(z)=vz+c\displaystyle h_{k+1}(z)+\delta^{-1}{vM_{k}z}+\delta^{-1}\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}=g(z)=v\cdot z+c (70)

for some vdv\in\mathbb{R}^{d}, cc\in\mathbb{R}. Thus hk+1h_{k+1} is an affine function, i.e., hk+1(z)=mk+1z+αh_{k+1}(z)=m_{k+1}z+\alpha. But

0=𝔼Z~k+1=𝔼Zmk+1+α=α\displaystyle 0=\mathbb{E}\widetilde{Z}_{k+1}=\mathbb{E}Z\cdot m_{k+1}+\alpha=\alpha (71)

and therefore hk+1h_{k+1} is linear.

Case δ=0\delta=0, b0b\neq 0

Now we consider the case that 0=δ0=\delta. Using (63), (64), and (19) we find similarly to above

MkzAMkz+2hk+1(z)bMkz2η(k+1)r~:k(k+1)Mkz2η(k+1)r~k+1(k+1)hk+1(z)+c~(k+1)=z(Θ(k+1)Θ(0))z2η(k+1)r(k+1)z\displaystyle\begin{split}M_{k}z\cdot AM_{k}z+2h_{k+1}(z)b^{\top}M_{k}z&-2\eta^{(k+1)}\widetilde{r}^{({k+1})}_{:k}\cdot M_{k}z-2\eta^{(k+1)}\widetilde{r}^{({k+1})}_{k+1}h_{k+1}(z)+\widetilde{c}^{(k+1)}\\ &=z^{\top}(\Theta^{(k+1)}-\Theta^{(0)})z-2\eta^{(k+1)}r^{({k+1})}z\end{split} (72)

Collecting all terms we find that hk+1h_{k+1} satisfies a relation of the type

hk+1(z)(wz+β)=zQz+wz+α\displaystyle h_{k+1}(z)(w\cdot z+\beta)=z\cdot Qz+w^{\prime}\cdot z+\alpha (73)

for suitable QQ, β\beta, ww, ww^{\prime}, and α\alpha. Note in particular that w=2bMk0w^{\top}=2b^{\top}M_{k}\neq 0 because MkM_{k} has full rank. If w0w\neq 0 then Corollary 2 implies that hk(z)h_{k}(z) is affine and, as argued above, we conclude that hkh_{k} is linear.

Case δ=0\delta=0, b=0b=0

It remains to address the case b=0b=0 and δ=0\delta=0. Going back to the definition of bb and δ\delta (see (61) and (62)) we find

(b,δ)=r~:(k+1)(k+1)r~k+1(k+1)s~:(k+1)(k+1)s~k+1(k+1)\displaystyle(b^{\top},\delta)^{\top}=\widetilde{r}^{({k+1})}_{:(k+1)}\widetilde{r}^{({k+1})}_{k+1}-\widetilde{s}^{({k+1})}_{:(k+1)}\widetilde{s}^{({k+1})}_{k+1} (74)

Now δ=0\delta=0 implies s~k+1(k+1)=r~k+1(k+1)\widetilde{s}^{({k+1})}_{k+1}=\widetilde{r}^{({k+1})}_{k+1} (because they are both positive). But then we conclude r~(k+1)=s~(k+1)\widetilde{r}^{({k+1})}=\widetilde{s}^{({k+1})} from b=0b=0 and r~r(k+1)=s~r(k+1)=0\widetilde{r}^{({k+1})}_{r}=\widetilde{s}^{({k+1})}_{r}=0 for r>k+1r>k+1. This implies

Θ~(k+1)Θ~(0)=r~(k+1)r~(k+1)s~(k+1)s~(k+1)=0\displaystyle\widetilde{\Theta}^{(k+1)}-\widetilde{\Theta}^{(0)}=\widetilde{r}^{({k+1})}\otimes\widetilde{r}^{({k+1})}-\widetilde{s}^{({k+1})}\otimes\widetilde{s}^{({k+1})}=0 (75)

That is, this implies that intervention k+1k+1 is a pure shift intervention, i.e., B=B(k+1)B=B^{(k+1)} and η(k+1)0\eta^{(k+1)}\neq 0 (otherwise we do not actually intervene). In this case, we conclude from Lemma 2 and Lemma 6 (if Θ(k+1)Θ\Theta^{(k+1)}\neq\Theta)

η(k+1)r~(k+1)h(z)+c(k+1)=(zμ(k+1))(Θ(k+1)Θ)(zμ(k+1))+c.\displaystyle\eta^{(k+1)}\widetilde{r}^{({k+1})}h(z)+c^{(k+1)}=(z-\mu^{\prime(k+1)})^{\top}(\Theta^{(k+1)}-\Theta)(z-\mu^{\prime(k+1)})+c. (76)

But then η(k+1)r~(k+1)h(μ(k+1))=0\eta^{(k+1)}\widetilde{r}^{({k+1})}\nabla h(\mu^{\prime(k+1)})=0, this implies that h\nabla h is not invertible which is a contradiction to hh being a diffeomorphism. Thus, we conclude that also Θ(k+1)=Θ\Theta^{(k+1)}=\Theta, i.e., intervention k+1k+1 is also a pure shift intervention for the ZZ variables. But then we find

r~k+1(k+1)hk+1(z)+r~:k(k+1)Mkz=(η~(k+1))1(η(k+1)r(k+1)z+c(k+1)).\displaystyle\widetilde{r}^{({k+1})}_{k+1}h_{k+1}(z)+\widetilde{r}^{({k+1})}_{:k}M_{k}z=(\widetilde{\eta}^{(k+1)})^{-1}(\eta^{(k+1)}r^{({k+1})}z+c^{(k+1)}). (77)

So we finally conclude that also in this case hk+1h_{k+1} is a linear function, this ends the proof. ∎

Appendix B From Linear Identifiability to Full Identifiability

In this section, we provide the proof of Theorem 1 which is a combination of our linear identifiability result Theorem 3 proved in the previous section and the main result of [84] which shows identifiability for linear mixing ff. However, we need to carefully address their normalization choices and in addition consider shifts.

We emphasize that the full identifiability results could also be derived from our proof of Theorem 3 by carefully considering ranks of quadratic forms and showing that we can inductively identify source nodes. We do not add these arguments as our reasoning is already quite complex and because for linear mixing functions this is not substantially different from the original proof in [84].

For the convenience of the reader, we will now restate the main results of [84]. For our work it is convenient to slightly deviate from their conventions, i.e., we define the causal order of a graph by iji\prec j iff ian(j)i\in\mathrm{an}(j), in particular iji\to j implies iji\prec j.

We rephrase their results using our notation. Let S(G)S(G) denote the set of permutations of the vertices of GG that preserve the causal ordering, i.e., all permutations ρ\rho such that ρ(i)<ρ(j)\rho(i)<\rho(j) if iji\prec j. They consider linear mixing functions as specified next.

Assumption 5.

Assume f(z)=Lzf(z)=Lz for some matrix Ld×dL\in\mathbb{R}^{d^{\prime}\times d} with trivial kernel and pseudoinverse H=L+H=L^{+} such that the largest absolute value of each row of HH is one and the leftmost is positive.

Then the following result holds.

Theorem 5 (Linear setting, Theorems 1 and 2 in [84]).

Suppose latent variables Z(i)Z^{(i)} are generated following Assumptions 2 and 3 where η(i)=0\eta^{(i)}=0 (i.e., no shift) with DAG GG and all A(i)A^{(i)} are lower triangular. Suppose in addition that Assumption 4 holds and the mixing function satisfies Assumption 5. Then we can identify intervention targets and the partial order G\prec_{G} of the underlying DAG GG up to permutation of the node labels. If we in addition assume that all interventions are perfect the problem is identifiable in the following sense. For every representation (B~(i),H~)(\widetilde{B}^{(i)},\widetilde{H}) such that all B~(i)\widetilde{B}^{(i)} are lower triangular and L~(Z~(i))=𝒟X(i)\widetilde{L}(\widetilde{Z}^{(i)})\overset{\mathcal{D}}{=}X^{(i)} where L~=H~+\widetilde{L}=\widetilde{H}^{+} there is a permutation σS(G)\sigma\in S(G) such that

H~=PσH,B~(i)=PσB(i)Pσ.\displaystyle\widetilde{H}=P_{\sigma}^{\top}H,\quad\widetilde{B}^{(i)}=P_{\sigma}^{\top}B^{(i)}P_{\sigma}. (78)
Remark 5.

A few remarks are in order.

  1. 1.

    Compared to their statement in Theorem 2 we replaced PσP_{\sigma} by PσP_{\sigma}^{\top} because we used the transposed convention for the permutation matrices.

  2. 2.

    Note that their result per se doesn’t mention identifiability up to scaling but that’s because part of their assumptions involve normalizing ff. Since we don’t normalize the linear map ff, we add scaling in the theorem statement.

  3. 3.

    For the case of imperfect interventions, their result talks about identifiability of the transitive closure G¯\overline{G} of the graph GG, but this is the same as identifiability of the partial order G\prec_{G} that we state here.

  4. 4.

    We assume that interventions are not only acting on the noise but do change the corresponding row of AA. Note that this is equivalent to their genericity assumption which states that

    r(i)\displaystyle r^{({i})} =(B(i))eti=(IdA(i))ti(D(i))titi12\displaystyle=(B^{(i)})^{\top}e_{t_{i}}=(\mathrm{Id}-A^{(i)})_{t_{i}\cdot}(D^{(i)})^{-\frac{1}{2}}_{t_{i}t_{i}} (79)
    s(i)\displaystyle s^{({i})} =(B(0))eti=(IdA(0))ti(D(0))titi12\displaystyle=(B^{(0)})^{\top}e_{t_{i}}=(\mathrm{Id}-A^{(0)})_{t_{i}\cdot}(D^{(0)})^{-\frac{1}{2}}_{t_{i}t_{i}} (80)

    are linearly independent. Indeed, this holds iff Ati(0)Ati(i)A^{(0)}_{t_{i}\cdot}\neq A^{(i)}_{t_{i}\cdot} (by acyclicity Atiti(i)=0A^{(i)}_{t_{i}t_{i}}=0).

Let us provide a brief intuition why this result holds. The key ingredient is the same as in our proof of linear identifiability above, namely, Lemma 3 which reads

Θ(i)Θ(0)=r(i)r(i)s(i)s(i).\displaystyle\Theta^{(i)}-\Theta^{(0)}=r^{({i})}\otimes r^{({i})}-s^{({i})}\otimes s^{({i})}. (81)

This expression has rank 2 if tit_{i} is not a source node in the causal graph and the intervention is not a pure noise intervention because s(i)s^{({i})} and r(i)r^{({i})} are linearly independent as explained in the remark above. On the other hand, this expression has rank one for source nodes because s(i)etis^{({i})}\propto e_{t_{i}} in this case and the same is true for r(i)r^{({i})}. This allows us to identify interventions on source nodes. Moreover, Θ(i)Θ(k)\Theta^{(i)}-\Theta^{(k)} has rank 2 if intervention ii and kk target different source nodes and rank 1 if they target the same source node. Thus we can also identify the set of interventions targeting the same source node. One can complete the proof by an induction argument based on removing source nodes from the graph and studying the effect on the precision matrix of the latent variables.

We are now ready to prove our main theorems.

See 1

Proof of Theorem 1.

Suppose there are two representations ((B(i),η(i),ti)iI,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in I},f) and ((B~(i),η~(i),t~i)iI,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in I},\widetilde{f}) of the distributions X(i),i0X^{(i)},i\geq 0. As explained in the proof of Theorem 3 we can assume by relabeling the nodes that the matrices B~(i)\widetilde{B}^{(i)} are lower triangular and the corresponding DAG G~\widetilde{G} has the property that for every edge iji\to j we have i<ji<j. We now apply Theorem 3 and find that there is an invertible linear map TT such that f~=fT\widetilde{f}=f\circ T and T1Z(i)=𝒟Z~(i)T^{-1}Z^{(i)}\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)}. One minor difference to the setting in [84] is that we in addition consider shift interventions. They are simpler to analyze than a change of variance, but we can avoid additional arguments with the following trick. If Σ(i)=Σ(0)\Sigma^{(i)}=\Sigma^{(0)} for some ii we replace Z(i)=N(μ(i),Σ(i))Z^{(i)}=N(\mu^{(i)},\Sigma^{(i)}) by Z(i)=N(0,Σ(i)+μ(i)(μ(i)))Z^{\prime(i)}=N(0,\Sigma^{(i)}+\mu^{(i)}(\mu^{(i)})^{\top}), otherwise we set Z(i)=N(0,Σ(i))Z^{\prime(i)}=N(0,\Sigma^{(i)}) and similarly for Z~\widetilde{Z}. Then T1(Z(i))=𝒟Z~(i)T^{-1}(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)} implies T1(Z(i))=𝒟Z~(i)T^{-1}(Z^{\prime(i)})\overset{\mathcal{D}}{=}\widetilde{Z}^{\prime(i)}. Using (13) and (17) it can be shown that this distribution corresponds to a change of variance of the noise, i.e., a change of Dtiti(i)D^{(i)}_{t_{i}t_{i}}. This can also be seen directly from the relation Z(i)=𝒟(B(i))1(ϵ+η(i)ϵet(i))Z^{\prime(i)}\overset{\mathcal{D}}{=}(B^{(i)})^{-1}(\epsilon+\eta^{(i)}\epsilon^{\prime}e_{t_{(}i)}), where ϵN(0,1)\epsilon^{\prime}\sim N(0,1) and independent of ϵ\epsilon. We can now apply Theorem 5 to the primed distributions. We first find a unique invertible ΛDiag(d)\Lambda\in\mathrm{Diag}(d) such that ΛT\Lambda T has largest absolute value entry 11 in every row (and in case of ties the leftmost entry is 1). We can also find a permutation ρSd\rho\in S_{d} such that B¯(i)=PρB(i)Λ1Pρ\bar{B}^{(i)}=P_{\rho}B^{(i)}\Lambda^{-1}P_{\rho}^{\top} is lower triangular for all ii (note the distinction between B¯(i)\bar{B}^{(i)} and B~(i)\widetilde{B}^{(i)}). Indeed, this is possible since the underlying DAG GG is acyclic by assumption, interventions do not add edges, and Λ\Lambda is diagonal. Then we find

(T1Λ1Pρ)(B¯(i))1ϵ\displaystyle(T^{-1}\Lambda^{-1}P_{\rho}^{\top})(\bar{B}^{(i)})^{-1}\epsilon =𝒟(T1Λ1Pρ)(B¯(i))1Pρϵ\displaystyle\overset{\mathcal{D}}{=}(T^{-1}\Lambda^{-1}P_{\rho}^{\top})(\bar{B}^{(i)})^{-1}P_{\rho}\epsilon (82)
=T1(B(i))1ϵ\displaystyle=T^{-1}(B^{(i)})^{-1}\epsilon (83)
=𝒟(B~(i))1ϵ.\displaystyle\overset{\mathcal{D}}{=}(\widetilde{B}^{(i)})^{-1}\epsilon. (84)

Here the last step follows from the fact that T1Z(i)=𝒟Z~(i)T^{-1}Z^{(i)}\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)} which implies that the centered distributions agree.

Now we can apply Theorem 5 (with BB replaced by B~\widetilde{B} and identity mixing) and find that for the alternative representation in terms of B¯\bar{B}

PρΛT=PσId,B¯(i)=PσB~(i)Pσ\displaystyle P_{\rho}\Lambda T=P_{\sigma}^{\top}\mathrm{Id},\quad\bar{B}^{(i)}=P_{\sigma}^{\top}\widetilde{B}^{(i)}P_{\sigma} (85)

for some ρS(G~)\rho\in S(\widetilde{G}) because by assumption B¯(i)\bar{B}^{(i)} and B~\widetilde{B} are lower triangular. This implies that

T\displaystyle T =Λ1PρPσ,\displaystyle=\Lambda^{-1}P_{\rho}^{\top}P_{\sigma}^{\top}, (86)
B(i)\displaystyle B^{(i)} =PρB¯(i)PρΛ=PρPσB~(i)PσPρΛ.\displaystyle=P_{\rho}^{\top}\bar{B}^{(i)}P_{\rho}\Lambda=P_{\rho}^{\top}P_{\sigma}^{\top}\widetilde{B}^{(i)}P_{\sigma}P_{\rho}\Lambda. (87)

To summarize, there is a permutation ω=(ρσ)1\omega=(\rho\circ\sigma)^{-1} (note that PσPρ=PρσP_{\sigma}P_{\rho}=P_{\rho\circ\sigma}) and a diagonal matrix Λ\Lambda such that

B(i)=PωB~(i)PωΛ,T=Λ1Pω,Z(i)=TZ~(i)=Λ1PωZ~(i),\displaystyle B^{(i)}=P_{\omega}\widetilde{B}^{(i)}P_{\omega}^{\top}\Lambda,\quad T=\Lambda^{-1}P_{\omega},\quad Z^{(i)}=T\widetilde{Z}^{(i)}=\Lambda^{-1}P_{\omega}\widetilde{Z}^{(i)}, (88)

We also find the relation ω(ti)=t~i\omega(t_{i})=\widetilde{t}_{i} from here because Pωei=eω1(i)P_{\omega}e_{i}=e_{\omega^{-1}(i)}. Equating the means of T1Z(i)T^{-1}Z^{(i)} and Z~(i)\widetilde{Z}^{(i)} we find

η~(i)(B~(i))1et~i=η(i)T1(B(i))1eti.\displaystyle\widetilde{\eta}^{(i)}(\widetilde{B}^{(i)})^{-1}e_{\widetilde{t}_{i}}=\eta^{(i)}T^{-1}(B^{(i)})^{-1}e_{t_{i}}. (89)

Multiplying by B~(i)\widetilde{B}^{(i)} we find

η~(i)et~i=η(i)B~(i)T1(B(i))1eti=η(i)Pωeti\displaystyle\widetilde{\eta}^{(i)}e_{\widetilde{t}_{i}}=\eta^{(i)}\widetilde{B}^{(i)}T^{-1}(B^{(i)})^{-1}e_{t_{i}}=\eta^{(i)}P_{\omega}^{\top}e_{t_{i}} (90)

and we conclude that η~(i)=η(i)\widetilde{\eta}^{(i)}={\eta}^{(i)}.

See 2

Proof of Theorem 2.

Suppose there are two representations ((B(i),η(i),ti)iI,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in I},f) and ((B~(i),η~(i),t~i)iI,f~)((\widetilde{B}^{(i)},\widetilde{\eta}^{(i)},\widetilde{t}_{i})_{i\in I},\widetilde{f}) of the observational distribution. As shown in Theorem 3 there is a linear map such that T1Z(i)=Z~(i)T^{-1}Z^{(i)}=\widetilde{Z}^{(i)}. Then the same is true for the transformed variables as explained above. Viewing Z~(i)\widetilde{Z}^{(i)} as new observations obtained through a linear TT mixing we conclude from Theorem 5 that G\prec_{G} and the intervention targets are identifiable up to relabeling. ∎

Appendix C A comparison to related works

In this section, we will discuss the relation of our results to the most relevant prior works and put them into context.

[84]

This work considers linear mixing functions ff, and we directly generalize their work to non-linear ff. In particular, their techniques are linear-algebraic and do not generalize to non-linear ff (see more technical details in proof intuition in Section 4); we handle this using a mix of statistical and geometric techniques. Notably, their finding that the linearly mixed latent variables are recoverable after observing one intervention per latent node is closely related to an earlier result in domain generalization, which showed that a number of environments equal to the number of “non-causal” latent dimensions can ensure recovery of the remaining latents [71], and later work showing that this requirement can be reduced further under linear mixing with additional modeling structure [11, 72]. This connection is not coincidental—that analysis was for invariant prediction with latent variables, which was originally presented as a method of causal inference using distributions resulting from unique interventions [62]. Later this approach was applied for the purpose of robust prediction in nonlinear settings [25], but the identifiability of the latent causal structure under general nonlinear mixing via interventional distributions remained an open question until this work.

[88]

This work also considers linear mixing ff, but allow for non-linearity in the latent variables. They use score functions to learn the model and they raise as an open question the setting of non-linear mixing, which we study in this work. While the models are not directly comparable, our model is more akin to real-world settings since it’s not likely that high-level latent variables are linearly related to the observational distribution, e.g. pixels in an image are not linearly related to high-level concepts. Moreover, Gaussian priors and deep neural networks are extensively used in practice, and our theory as well as experimental methodology apply to them.

[48]

While they do not discuss interventions, this work considers the setting of multi-environment CRL with Gaussian priors and their results can be applied to interventional learning. When applied to interventional data as in our setting (as also shown in [84, Appendix D]), they require additional restrictive assumptions such as d=dd=d^{\prime}, a bijective mixing ff and ede\leq d where ee is the number of edges in the underlying causal graph whereas we make no such restrictions on d,fd,f and ee (therefore, the maximum value of ee can be as large as d2/2\approx d^{2}/2 in our setting). And when these assumptions are satisfied, they require 2d2d interventions whereas we show that dd interventions are sufficient as well as necessary.

[2]

This work also considers the setup of interventional causal representation learning. However, their setting is different in various ways and we now outline the key differences.

  1. 1.

    Their main results assume that the mixing ff is an injective polynomial of finite degree pp (see Assumption 2 in their paper). The injectivity requires their coefficient matrix to be full rank, which in turn implies dr=0p(r+d1d1)=(p+dd)d^{\prime}\geq\sum_{r=0}^{p}\binom{r+d-1}{d-1}=\binom{p+d}{d} (see the discussion below Assumption 2 in their work). This means that we cannot set the degree of the decoder polynomial to be arbitrarily large for fixed output dimension dd^{\prime} (which is required for universal approximation). In contrast, we only require ddd^{\prime}\geq d. For a general discussion of the relation between approximability and identifiability we refer to Appendix E.

  2. 2.

    All their results for general nonlinear mixing functions rely on deterministic do-interventions which is a type of hard intervention that assigns a constant value to the target. However, we focus on randomized soft interventions (and also allow shift interventions which have found applications [73, 58]), which as also pointed out by [84, 87] is less restricted. Moreover, they require multiple interventions for each latent variable (as they use an ϵ\epsilon-net argument) to show approximate identifiability, while the structural assumptions on the causal model allow us to show full identifiability with just one intervention for each latent variable. Thus, their setting is not directly comparable to ours and neither is strictly more general than the other. In particular, their proof techniques and experimental methodology don’t translate to our setting, and we use completely different ideas for our proofs and experiments.

Appendix D Counterexamples and Discussions

In this section, we first present several counterexamples to relaxed versions of our main results in Section D.1 and then provide the missing proofs in Section D.2.

D.1 Main counterexamples

In this section, we will discuss the optimality and limitations of various assumptions of our main results. In particular, we consider the number of interventions, the distribution of the latent variables, and the types of interventions separately.

Number of interventions

Our main results all require dd interventions, i.e., one intervention per node. This is necessary even in the simpler setting of linear ff as was shown in [84, Proposition 4], directly implying it is also necessary for the more general class of non-linear ff that we handle here. Therefore, the number of interventions in our main theorems 1, 2 is both necessary and sufficient. Going a step further, it is natural to ask whether Theorem 3 remains true for less than dd interventions. However, as we show next, d2d-2 interventions are not sufficient.

Fact 1.

Suppose we are given distributions X(i)X^{(i)} generated by ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f) satisfying Assumption 1-3. If the number of interventions satisfies |I|d2|I|\leq d-2 it is not possible to identify ff up to linear maps. Consider, e.g., d=2d=2 and Z=N(0,Id)Z=N(0,\mathrm{Id}) and no interventions. Then h(Z)=𝒟Zh(Z)\overset{\mathcal{D}}{=}Z for (nonlinear) radius dependent rotations as defined in [29, 9].

Let us next consider the case with a total of dd environments which corresponds to d1d-1 interventional distributions plus one observational distribution. We can provide a non-identifiability result for a general class of heterogeneous latent variable models that satisfy an algebraic property.

Lemma 7.

Assume that d=2d=2 and Z(0)N(0,Σ(0))Z^{(0)}\sim N(0,\Sigma^{(0)}), Z(1)N(0,Σ(1))Z^{(1)}\sim N(0,\Sigma^{(1)}) and Σ(1)Σ(0)\Sigma^{(1)}\succ\Sigma^{(0)} or Σ(1)Σ(0)\Sigma^{(1)}\prec\Sigma^{(0)} in Löewner order (i.e., the difference is positive definite). Then there is a non-linear map hh such that h(Z(i))=𝒟Z(i)h(Z^{(i)})\overset{\mathcal{D}}{=}Z^{(i)} for i=0,1i=0,1.

We provide a proof of this lemma in Appendix D.2. Note that this non-identifiability lemma requires the assumption on Σ(i)\Sigma^{(i)}. In Lemma 3 in Appendix A we have seen that for the interventions considered in this work the relation Σ(i)>Σ(0)\Sigma^{(i)}>\Sigma^{(0)} or Σ(i)<Σ(0)\Sigma^{(i)}<\Sigma^{(0)} generally does not hold. This is some weak indication that for Theorem 3, d1d-1 interventions might be sufficient. We leave this for future work.

As an additional note, in the special case when ff is a permutation matrix, we are in the setting of traditional causal discovery and here, d1d-1 interventions are necessary and sufficient [10, 18, 84].

Type of intervention

Even when we restrict ourselves to only linear mixing functions as considered as in [84], we need perfect interventions in Theorem 1. Otherwise, only the weaker identifiability guarantees in Theorem 2 can be obtained. This is shown in [84]for linear ff. Therefore, in the more general setting of nonlinear ff, we cannot hope to obtain Theorem 1 with imperfect interventions, showing that this assumption is needed.

Next, we clarify that for the identifiability result in Theorem 2 , the condition that the interventions are not pure noise interventions is also necessary.

Fact 2.

Recall that we defined B(i)=(D(i))12(IdA(i))B^{(i)}=(D^{(i)})^{-\frac{1}{2}}(\mathrm{Id}-A^{(i)}). Suppose X(i)X^{(i)} is generated by ((B(i),η(i),ti)iI¯,f)((B^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},f) satisfying Assumptions 1-4 where all interventions are pure noise interventions. By definition, this implies A(i)=AA^{(i)}=A for all ii. Consider any matrix A~\widetilde{A} such that the corresponding DAG is acyclic. Then ((B~(i),η(i),ti)iI¯,f~)((\widetilde{B}^{(i)},\eta^{(i)},t_{i})_{i\in\bar{I}},\widetilde{f}) with B~(i)=(D(i))12(IdA~)\widetilde{B}^{(i)}=(D^{(i)})^{-\frac{1}{2}}(\mathrm{Id}-\widetilde{A}) for all ii and

f~=f(B1B~)=f(IdA)1(D(0))12(D(0))12(IdA~)=f(IdA)1(IdA~)\displaystyle\begin{split}\widetilde{f}&=f\circ(B^{-1}\widetilde{B})=f\circ(\mathrm{Id}-A)^{-1}(D^{(0)})^{-\frac{1}{2}}(D^{(0)})^{\frac{1}{2}}(\mathrm{Id}-\widetilde{A})=f\circ(\mathrm{Id}-{A})^{-1}(\mathrm{Id}-\widetilde{A})\end{split} (91)

generates the same distributions X(i)X^{(i)}. Indeed,

f~(Z~(i))=f(B1B~)(B~(i))1(ϵ+η(i)eti)=f((IdA)1(IdA~)(IdA~)1(D(i))12(ϵ+η(i)eti))=f((IdA)1(D(i))12(ϵ+η(i)eti))=f((B(i))1(ϵ+η(i)eti))=f(Z(i)).\displaystyle\begin{split}\widetilde{f}(\widetilde{Z}^{(i)})&=f\circ(B^{-1}\widetilde{B})(\widetilde{B}^{(i)})^{-1}(\epsilon+\eta^{(i)}e_{t_{i}})\\ &=f((\mathrm{Id}-A)^{-1}(\mathrm{Id}-\widetilde{A})(\mathrm{Id}-\widetilde{A})^{-1}(D^{(i)})^{\frac{1}{2}}(\epsilon+\eta^{(i)}e_{t_{i}}))\\ &=f((\mathrm{Id}-A)^{-1}(D^{(i)})^{\frac{1}{2}}(\epsilon+\eta^{(i)}e_{t_{i}}))\\ &=f((B^{(i)})^{-1}(\epsilon+\eta^{(i)}e_{t_{i}}))\\ &=f(Z^{(i)}).\end{split} (92)

This implies that any underlying causal graph could generate the observations. In other words, for pure noise interventions we can identify the scale of the noise variables, but we obtain no intervention about the causal variables and structure (exactly because the interventions do not target the actual causal variables but the noise variables).

Finally, we remark that in contrast to the case with linear mixing functions, we need to exclude non-stochastic hard interventions for linear identifiability.

Lemma 8.

Consider d=2d=2 and Z(0)=N(0,Id)Z^{(0)}=N(0,\mathrm{Id}) and the non-stochastic hard interventions do(Z1=0)\mathrm{do}(Z_{1}=0) and do(Z2=0)\mathrm{do}(Z_{2}=0) resulting in the degenerate Gaussian distributions Z(1)=N(0,e2e2)Z^{(1)}=N(0,e_{2}\otimes e_{2}) and Z(2)=N(0,e1e1)Z^{(2)}=N(0,e_{1}\otimes e_{1}). Then there is a nonlinear hh such that h(Z(i))=𝒟Z(i)h(Z^{(i)})\overset{\mathcal{D}}{=}Z^{(i)} for all ii.

Intuitively, to show this, we choose hh to be identity close to the supports {Z1=0}\{Z_{1}=0\} and {Z2=0}\{Z_{2}=0\} but some nonlinear measure preserving deformation in the four quadrants. The full argument can be found in Appendix D.2. A similar counterexample was discussed in [2]. However, here we in addition constrain the distribution of the latent variables, which adds a bit of complexity.

Distribution of the Noise Variables

Our key additional assumption compared to the setting considered in [84, 2, 88] is that we assume that the latent variables are Gaussian which allows us to put only very mild assumptions on the mixing function ff. However, we do this in a manner so that we still preserve universal representation guarantees.

[88] consider arbitrary noise distributions but restrict to linear mixing functions. We believe that our result can be generalized to more general distributions of the noise variables, e.g., exponential families, but those generalizations will require different proof strategies, which we leave for future work. However, the result will not be true for arbitrary noise distributions, as the following lemma shows.

Lemma 9.

Consider Z(0)=(ϵ1,ϵ2)Z^{(0)}=(\epsilon_{1},\epsilon_{2})^{\top}, Z(1)=(ϵ1,ϵ2)Z^{(1)}=(\epsilon_{1}^{\prime},\epsilon_{2}), and Z(2)=(ϵ1,ϵ2)Z^{(2)}=(\epsilon_{1},\epsilon_{2}^{\prime}) where ϵ1,ϵ2𝒰([1,1])\epsilon_{1},\epsilon_{2}\sim\mathcal{U}([-1,1]) and ϵ1,ϵ2𝒰([3,3])\epsilon_{1}^{\prime},\epsilon_{2}^{\prime}\sim\mathcal{U}([-3,3]). Also define Z~(0)=(ϵ1,ϵ1+ϵ2)\widetilde{Z}^{(0)}=(\epsilon_{1},\epsilon_{1}+\epsilon_{2}), Z~(1)=(ϵ1,ϵ1+ϵ2)\widetilde{Z}^{(1)}=(\epsilon_{1}^{\prime},\epsilon_{1}^{\prime}+\epsilon_{2}), and Z~(2)=(ϵ1,ϵ2)\widetilde{Z}^{(2)}=(\epsilon_{1},\epsilon_{2}^{\prime}). Then there is hh such that

h(Z(i))=𝒟Z~(i).\displaystyle h(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)}. (93)

A proof of this lemma can be found in Appendix D.2. The result shows that even with dd perfect interventions and linear SCMs it is not possible to identify the underlying DAG (empty graph for ZZ and Z~1Z~2\widetilde{Z}_{1}\to\widetilde{Z}_{2}) and we also cannot identify ff up to linear maps. While our example relies on distributions with bounded support, we conjecture that full support is not sufficient for identifiability.

Structural Assumptions

A final assumption used in our results is the restriction to linear SCMs. This is used for most works on causal representation learning (see Section 2), even when they restrict to linear mixing functions. Although this may potentially be a nontrivial restriction, the advantage is that the simplicity of the latent space will enable us to meaningfully probe the latent variables, intervene and reason about them. Nevertheless, it’s an interesting direction to study to what generality this can be relaxed. Indeed, even for the arguably simpler problem of causal structure learning (i.e., when Z(i)Z^{(i)} is observed) there are many open questions when moving beyond additive noise models. We leave this for future work.

D.2 Technical constructions

In this section, we provide the proofs for the counterexamples in Section D.1. For the first two counterexamples we construct functions hh such that h(Z(i))=𝒟Z(i)h(Z^{(i)})\overset{\mathcal{D}}{=}Z^{(i)} by setting h=Φth=\Phi_{t} where Φt\Phi_{t} is defined as the flow of a suitable vectorfield XX, i.e.,

tΦt(z)=X(Φt(z)).\displaystyle\partial_{t}\Phi_{t}(z)=X(\Phi_{t}(z)). (94)

This is a common technique in differential geometry and was introduced to construct counterexamples to identifiability ICA with volume preserving functions in [9, 8]. To find a suitable vectorfield XX we rely on the fact that the density ptp_{t} of t=(Φt)\mathbb{P}_{t}=(\Phi_{t})_{\ast}\mathbb{P} satisfies the continuity equation

tpt(z)+Div(pt(z)X(z))=0.\displaystyle\partial_{t}p_{t}(z)+\operatorname{Div}(p_{t}(z)X(z))=0. (95)

In particular t=\mathbb{P}_{t}=\mathbb{P} holds iff

Div(pt(z)X(z))=0.\displaystyle\operatorname{Div}(p_{t}(z)X(z))=0. (96)

Therefore it is sufficient to find a vectorfield XX such that Div(p(i)X)=0\operatorname{Div}(p^{(i)}X)=0 for all environments ii to construct a suitable h=Φ1h=\Phi_{1}.

See 7

Proof of Lemma 7.

Let us first discuss the high-level idea of the proof. The general strategy is to construct a vectorfield XX whose flow preserves Z(i)Z^{(i)} for i=0,1i=0,1. This holds if and only if the densities p0p_{0}, p1p_{1} of Z(0)Z^{(0)}, Z(1)Z^{(1)} satisfy

Div(p0X)=Div(p1X)=0\displaystyle\operatorname{Div}(p_{0}X)=\operatorname{Div}(p_{1}X)=0 (97)

Using Div(fX)=fX+fDivX\operatorname{Div}(fX)=\nabla f\cdot X+f\operatorname{Div}X we find

X(lnp0(z)lnp1(z))=Xp0(z)p0(z)Xp1(z)p1(z)p0(z)DivXp0(z)+p1(z)DivXp1(z)=0.\displaystyle X\cdot\nabla(\ln p_{0}(z)-\ln p_{1}(z))=\frac{X\cdot\nabla p_{0}(z)}{p_{0}(z)}-\frac{X\cdot\nabla p_{1}(z)}{p_{1}(z)}-\frac{p_{0}(z)\operatorname{Div}X}{p_{0}(z)}+\frac{p_{1}(z)\operatorname{Div}X}{p_{1}(z)}=0. (98)

We conclude that any XX satisfying (97) must be orthogonal to (lnp0(z)lnp1(z))\nabla(\ln p_{0}(z)-\ln p_{1}(z)) and thus parallel to the level lines of lnp0(z)lnp1(z)\ln p_{0}(z)-\ln p_{1}(z). This already fixes the direction of XX and the magnitude can be inferred from (97). To simplify the calculations we can first linearly transform the data so that the directions of XX, i.e., equivalently the level lines of lnp0(z)lnp1(z)\ln p_{0}(z)-\ln p_{1}(z) have a simple form. We assume that Σ0Σ1\Sigma_{0}\prec\Sigma_{1}. Clearly, it is sufficient to show the result for any linear transformation of the Gaussian distributions Z(i)Z^{(i)}. Note that generally for GN(μ,Σ)G\sim N(\mu,\Sigma) we have AGN(Aμ,AΣA)AG\sim N(A\mu,A\Sigma A^{\top}) Applying Σ012\Sigma_{0}^{-\frac{1}{2}} we can reduce the problem to IdΣ1=Σ012Σ1Σ012\mathrm{Id}\prec\Sigma_{1}^{\prime}=\Sigma_{0}^{-\frac{1}{2}}\Sigma_{1}\Sigma_{0}^{-\frac{1}{2}}. Diagonalizing Σ1\Sigma_{1}^{\prime} by USO(d)U\in\mathrm{SO}(d) it is sufficient to consider IdΛ=UΣ1U\mathrm{Id}\prec\Lambda=U\Sigma_{1}^{\prime}U^{\top} where Λ=Diag(λ1,λ2)\Lambda=\mathrm{Diag}(\lambda_{1},\lambda_{2}). Finally, we can rescale z1z_{1} and z2z_{2} such that the resulting covariance matrices Λ0=Diag(λ10,λ20)\Lambda^{0}=\mathrm{Diag}(\lambda^{0}_{1},\lambda^{0}_{2}), Λ1=Diag(λ11,λ21)\Lambda^{1}=\mathrm{Diag}(\lambda^{1}_{1},\lambda^{1}_{2}) satisfy

1λ111λ10=1λ211λ20=1.\displaystyle\frac{1}{\lambda^{1}_{1}}-\frac{1}{\lambda^{0}_{1}}=\frac{1}{\lambda^{1}_{2}}-\frac{1}{\lambda^{0}_{2}}=1. (99)

Thus, it is sufficient to show the claim for such covariances Λ0\Lambda^{0}, Λ1\Lambda^{1} and the result for arbitrary covariances follows by applying suitable linear transformation.

For such covariances we find for some constant cc

ln(p0(z))ln(p1(z))=z122λ10z222λ20+z122λ11+z222λ21+c=z12+z22+c.\displaystyle\ln(p_{0}(z))-\ln(p_{1}(z))=-\frac{z_{1}^{2}}{2\lambda^{0}_{1}}-\frac{z_{2}^{2}}{2\lambda^{0}_{2}}+\frac{z_{1}^{2}}{2\lambda^{1}_{1}}+\frac{z_{2}^{2}}{2\lambda^{1}_{2}}+c=z_{1}^{2}+z_{2}^{2}+c. (100)

The level lines are circles. Thus, we consider the vector field

X0(z)=(z2z1).\displaystyle X_{0}(z)=\begin{pmatrix}-z_{2}\\ z_{1}\end{pmatrix}. (101)

We see directly that DivX0=0\operatorname{Div}X_{0}=0. We consider the Ansatz X(z)=f(|z|)X0(z)/p0(z)X(z)=f(|z|)X_{0}(z)/p_{0}(z). We now fix the radial function f:+f:\mathbb{R}_{+}\to\mathbb{R} such that it has compact support away from 0. We find using DivX0=0\operatorname{Div}X_{0}=0

Div(X(z)p0(z))\displaystyle\operatorname{Div}(X(z)p_{0}(z)) =Div(X(z)p0(z))\displaystyle=\operatorname{Div}(X(z)p_{0}(z)) (102)
=X0(z)(|z|)f(z)\displaystyle=X_{0}(z)\cdot(\nabla|z|)f^{\prime}(z) (103)
=(z2z1)z|z|2f(z)\displaystyle=\begin{pmatrix}-z_{2}\\ z_{1}\end{pmatrix}\cdot\frac{z}{|z|^{2}}f^{\prime}(z) (104)
=0.\displaystyle=0. (105)

Note, that close to 0 where |z||z| is not differentiable the vector field vanishes by our construction of ff. We find similarly using (100)

Div(X(z)p1(z))\displaystyle\operatorname{Div}(X(z)p_{1}(z)) =X0(z)(f(|z|)p1(z)p0(z))\displaystyle=X_{0}(z)\cdot\nabla\left(f(|z|)\frac{p_{1}(z)}{p_{0}(z)}\right) (106)
=X0(z)(f(|z|)e|z|2c)\displaystyle=X_{0}(z)\cdot\nabla\left(f(|z|)e^{-|z|^{2}-c}\right) (107)
=X0(z)f~(|z|)\displaystyle=X_{0}(z)\cdot\nabla\tilde{f}(|z|) (108)
=0.\displaystyle=0. (109)

Finally we remark that the flow Φt\Phi_{t} of the vectorfield XX generates a family of functions hh as desired, i.e., for all tt we have (Φt)Z(i)=Z(i)(\Phi_{t})_{\ast}Z^{(i)}=Z^{(i)} and Φt\Phi_{t} is clearly nonlinear as it is the identity close to 0 but not globally. ∎

The proof of the next lemma is similar but simpler.

See 8

Proof of Lemma 8.

Pick any smooth divergence free vectorfield X0X_{0} with compact support in [ϵ,)2[\epsilon,\infty)^{2} for some ϵ>0\epsilon>0. Then the vector field X=X0/p0X=X_{0}/p_{0} satisfies

DivXp0=DivX0=0.\displaystyle\operatorname{Div}Xp_{0}=\operatorname{Div}X_{0}=0. (110)

Thus the flow Φt\Phi_{t} preserves the distribution of Z(0)Z^{(0)}. But it also preserved the interventional distributions Z(i)Z^{(i)} for i=1,2i=1,2 because X(z)=0X(z)=0 and thus Φt(z)=z\Phi_{t}(z)=z for zz close to the supports of Z(i)Z^{(i)}. ∎

Finally we prove Lemma 9 that shows that we cannot completely drop the assumption on the distribution of the noise variables ϵ\epsilon.

Z(1)Z^{(1)}Z(0)Z^{(0)}Z(2)Z^{(2)}R2R_{2}R1R_{1}Z~(0)\widetilde{Z}^{(0)}Z~(1)\widetilde{Z}^{(1)}Z~(2)\widetilde{Z}^{(2)}Q2Q_{2}Q1Q_{1}hh
Figure 4: Sketch of the setting in Lemma 9. The map hh agrees with (z1,z2)(z1,z1+z2)(z_{1},z_{2})\to(z_{1},z_{1}+z_{2}) on the red rectangle and maps RiR_{i} to QiQ_{i} such that the uniform measure is preserved.

See 9

Proof of Lemma 9.

A sketch of the setting can be found in Figure 4. We define h0(z)=(z1,z1+z2)h_{0}(z)=(z_{1},z_{1}+z_{2}). If h(z)=h0(z)h(z)=h_{0}(z) for 1z21-1\leq z_{2}\leq 1 we find h(Z(i))=𝒟Z~(i)h(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)} for i=0i=0 and i=1i=1. It remains to modify h0h_{0} such that hh preserves the uniform measure on [1,1]×[3,3][-1,1]\times[-3,3]. We do this in two steps, first we construct a modification h~\widetilde{h} of the map h0h_{0} such that [1,1]×[3,3][-1,1]\times[-3,3] is mapped bijectively to itself but h~=h0\widetilde{h}=h_{0} for |z2|1|z_{2}|\leq 1 (so that h~(Z(i))=𝒟Z~(i)\widetilde{h}(Z^{(i)})\overset{\mathcal{D}}{=}\widetilde{Z}^{(i)} for i=0i=0 and i=1i=1 holds) and then we apply a general result to make the map measure preserving. We start with the first step. Let ψ:[0,1]\psi:\mathbb{R}\to[0,1] be differentiable such that ψ(t)=1\psi(t)=1 for 1z1-1\leq z\leq 1 and ψ(t)=0\psi(t)=0 for 5/2z5/2-5/2\leq z\leq 5/2 with |ψ(t)|<1|\psi^{\prime}(t)|<1. Then

h~(z)=ψ(z2)z+(1ψ(z2))h0(z)=(z1z2+ψ(z2)z1)\displaystyle\tilde{h}(z)=\psi(z_{2})z+(1-\psi(z_{2}))h_{0}(z)=\begin{pmatrix}z_{1}\\ z_{2}+\psi(z_{2})z_{1}\end{pmatrix} (111)

is injective on [1,1]×[3,3][-1,1]\times[-3,3] (note that the second coordinate is increasing in z2z_{2}) and maps [1,1]×[3,3][-1,1]\times[-3,3] bijectively to itself and agrees with h0h_{0} for |z2|1|z_{2}|\leq 1. However, it is not necessarily volume preserving. But applying Moser’s theorem (see [56]) to the image of the rectangles R1=[1,1]×[3,1]R_{1}=[-1,1]\times[-3,-1] and R2=[1,1]×[1,3]R_{2}=[-1,1]\times[1,3] which are the quadrilaterals Q1=h~(R1)Q_{1}=\tilde{h}(R_{1}) with endpoints (1,3)(-1,-3), (1,3)(1,-3), (1,0)(1,0), (1,2)(-1,-2) and Q2=h~(R2)Q_{2}=\tilde{h}(R_{2}) with endpoints (1,3)(1,3), (1,3)(-1,3), (1,0)(-1,0), (1,2)(1,2) with the same size as R1R_{1} and R2R_{2} we infer that there is φ\varphi supported in Q1Q2Q_{1}\cup Q_{2} such that h=φh~h=\varphi\circ\tilde{h} satisfies h(Z(2))=𝒟Z~(2)=𝒟𝒰([1,1]×[3,3])h(Z^{(2)})\overset{\mathcal{D}}{=}\widetilde{Z}^{(2)}\overset{\mathcal{D}}{=}\mathcal{U}([-1,1]\times[-3,3]). ∎

Appendix E Identifiability and Approximation

In this section, we explain that approximation properties of function classes cannot be leveraged to obtain stronger identifiability results. The general setup is that we have two function classes 𝒢\mathcal{G}\subset\mathcal{F} and we assume that 𝒢\mathcal{G} is dense in \mathcal{F} (in the topology of uniform convergence on Ω\Omega), i.e., for ff\in\mathcal{F} and any ϵ>0\epsilon>0 there is g𝒢g\in\mathcal{G} such that

supzΩ|f(z)g(z)|ϵ.\displaystyle\sup_{z\in\Omega}|f(z)-g(z)|\leq\epsilon. (112)

We now investigate the relationship of identifiability results for mixings in either 𝒢\mathcal{G} or in \mathcal{F}.

While this point is completely general, we will discuss this for concreteness in the context of nonlinear ICA and considering polynomial mixing functions for 𝒢\mathcal{G}. We also assume that \mathcal{F} are all diffeomorphisms (onto their image). Suppose we have latents Z𝒰([1,1]d)Z\sim\mathcal{U}([-1,1]^{d}) and observe a mixture X=h(Z)X=h(Z). We use the shorthand Ω=[1,1]d\Omega=[-1,1]^{d} from now on. Let us suppose that h𝒢h\in\mathcal{G}. Then the following result holds.

Lemma 10.

Suppose h~(Z)=𝒟X=𝒟h(Z)\widetilde{h}(Z)\overset{\mathcal{D}}{=}X\overset{\mathcal{D}}{=}h(Z) where Z𝒰(Ω)Z\sim\mathcal{U}(\Omega), h,h~𝒢h,\widetilde{h}\in\mathcal{G} and hh satisfies an injectivity condition as defined in Assumption 2 in [2]. Then hh and h~\widetilde{h} agree up to permutation of the variables.

Proof.

This is a consequence of Theorem 4 in [2]. Essentially, they show in Theorem 1 that hh and h~\tilde{h} agree up to a linear map and then use independence of supports to conclude (in our simpler setting here, one could also resort to the identifiability of linear ICA). ∎

The injectivity condition is a technical condition that ensures that the embedding hh is sufficiently diverse, but this is not essential for the discussion here. Let us nevertheless mention here that it is not clear whether this condition is sufficiently loose to ensure identifiability in a dense subspace of \mathcal{F} because it requires an output dimension depending on the degree of the polynomials.

We now investigate the implications for \mathcal{F}. It is well known that ICA in general nonlinear function is not identifiable, i.e., for any ff\in\mathcal{F} there is a f~\widetilde{f}\in\mathcal{F} such that f(Z)=𝒟f~(Z)f(Z)\overset{\mathcal{D}}{=}\widetilde{f}(Z). To find such an f~\widetilde{f} one use, e.g., the Darmois construction, radius dependent rotations or, more generally, measure preserving transformations mm such that m(Z)=𝒟Zm(Z)\overset{\mathcal{D}}{=}Z. Then we have the following trivial lemma.

Lemma 11.

For every ϵ>0\epsilon>0 there are functions g,g~𝒢g,\widetilde{g}\in\mathcal{G} such that

supΩ|gf|<ϵ,supΩ|g~f~|<ϵ\displaystyle\sup_{\Omega}|g-f|<\epsilon,\quad\sup_{\Omega}|\widetilde{g}-\widetilde{f}|<\epsilon (113)

and then the Wasserstein distance satisfies W2(f(Z),g(Z))<ϵW_{2}(f(Z),g(Z))<\epsilon, W2(f~(Z),g~(Z))<ϵW_{2}(\widetilde{f}(Z),\widetilde{g}(Z))<\epsilon.

Remark 6.

For the definition and a discussion of the Wasserstein metric we refer to [90]. Alternatively we could add a small amount of noise, i.e. X=f(Z)+νX=f(Z)+\nu and then consider the total variation distance.

Proof.

The first part follows directly from the Stone-Weierstrass Theorem which shows that polynomials are dense in the continuous function on compact domains. The second part follows from the coupling Z(f(Z),g(Z))Z\to(f(Z),g(Z)) and (113). ∎

The main message of this Lemma is that whenever there are spurious solutions in the larger function class \mathcal{F} we can equally well approximate the ground truth mixing ff and the spurious solution f~\widetilde{f} by functions in 𝒢\mathcal{G}. In this sense, the identifiability of ICA in 𝒢\mathcal{G} does not provide any guidance to resolve the ambiguity of ICA in \mathcal{F}. So identifiability results in 𝒢\mathcal{G} are not sufficient when there is no reason to believe that the ground truth mixing is exactly in 𝒢\mathcal{G}.

Actually, one could view the approximation capacity of 𝒢\mathcal{G} also as a slight sign of warning as we will explain now. Suppose the ground truth mixing satisfies g𝒢g\in\mathcal{G}. Since ICA in \mathcal{F} is not identifiable we can find f~\widetilde{f}\in\mathcal{F} which can be arbitrarily different from gg but such that g(Z)=f~(Z)g(Z)=\widetilde{f}(Z). Then we can approximate f~\widetilde{f} by g~\widetilde{g} with arbitrarily small error. Thus, we find almost spurious solutions, i.e., g~𝒢\widetilde{g}\in\mathcal{G} such that W2(g~(Z),g(Z))<ϵW_{2}(\widetilde{g}(Z),g(Z))<\epsilon for an arbitrary ϵ>0\epsilon>0 but g~\widetilde{g} and gg correspond to very different data representations, in this sense the identifiability result is not robust.

When only considering the smaller space 𝒢\mathcal{G} this problem can be resolved by considering norms or seminorms on 𝒢\mathcal{G}, for polynomials a natural choice is the degree of the polynomial. Indeed, suppose that we observe data generated using a mixing g𝒢g\in\mathcal{G}. Then there might be spurious solutions f~\widetilde{f}\in\mathcal{F} which can be approximated by g~𝒢\widetilde{g}\in\mathcal{G} but this approximation will generically have a larger norm (or degree for polynomials) than gg. Therefore, the minimum description length principle [69] favors gg over g~\widetilde{g}.

This reasoning can only be extended to the larger class \mathcal{F} if the ground truth mixing ff (for the relevant applications) can be better approximated (i.e., with smaller norm) by functions in 𝒢\mathcal{G} than all spurious solutions f~\widetilde{f}. This is hard to verify in practice and difficult to formalize theoretically. Nevertheless, this motivates to look for identifiability results for function classes that are known to be useful representation learners and used in practice, in particular neural nets. We emphasize that alternatively one can also directly consider norms on the larger space \mathcal{F} penalizing, e.g., derivative norms, to perform model selection.

Appendix F Additional details on experimental methodology

In this appendix, we give more details about our experimental methodology. First, we derive the log-odds and prove Lemma 1 in Section F.1. Then, we use it to design our model and theoretically justify our contrastive approach in Section F.2, by connecting it to the Bayes optimal classifier. Then, we describe the NOTEARS regularizer in Section F.3, describe the limitations of the contrastive approach in Section G.1 and finally, in Section F.4, we describe the ingredients needed for an approach via Variational Autoencoders, the difficulties involved and how to potentially bypass them.

F.1 Proof of Lemma 1

See 1

Proof.

We will write the log likelihood of X(i)=f(Z(i))X^{(i)}=f(Z^{(i)}) using standard change of variables. As we elaborate in Section A.1, we have

Z(i)=(B(i))1ϵ+μ(i) with B(i)=(D(i))1/2(IA(i)),μ(i)=η(i)(B(i))1eti\displaystyle Z^{(i)}=(B^{(i)})^{-1}\epsilon+\mu^{(i)}\text{ with }B^{(i)}=(D^{(i)})^{-1/2}(I-A^{(i)}),\mu^{(i)}=\eta^{(i)}(B^{(i)})^{-1}e_{t_{i}} (114)

Also, denote by Θ(i)=(B(i))B(i)\Theta^{(i)}=(B^{(i)})^{\top}B^{(i)} the precision matrix of Z(i)Z^{(i)} (see Section A.1 for full derivation). By change of variables,

lnpX(i)(x)\displaystyle\ln p_{X}^{(i)}(x) =ln|Jf1|+lnpZ(i)(f1(x))\displaystyle=\ln|J_{f^{-1}}|+\ln p_{Z}^{(i)}(f^{-1}(x))
=ln|Jf1|n2ln(2π)12ln|Σ(i)|12(f1(x)μ(i))TΘ(i)(f1(x)μ(i))\displaystyle=\ln|J_{f^{-1}}|-\frac{n}{2}\ln(2\pi)-\frac{1}{2}\ln|\Sigma^{(i)}|-\frac{1}{2}(f^{-1}(x)-\mu^{(i)})^{T}\Theta^{(i)}(f^{-1}(x)-\mu^{(i)})

where Jf1J_{f^{-1}} denotes the Jacobian matrix of partial derivatives. Let s(1),s(2),,s(d)s^{(1)},s^{(2)},\ldots,s^{(d)} denote the rows of B(0)B^{(0)}. We then compute the log-odds with respect to X(0)X^{(0)}, the base distribution without interventions (and where μ(0)=0\mu^{(0)}=0) to get

lnpX(i)(x)lnpX(0)(x)\displaystyle\ln p_{X}^{(i)}(x)-\ln p_{X}^{(0)}(x)
=(12ln|Σ(i)||Σ(0)|12(μ(i))Θ(i)μ(i))12(f1(x))(Θ(i)Θ(0))(f1(x))+(f1(x))Θ(i)μ(i)\displaystyle=\left(-\frac{1}{2}\ln\frac{|\Sigma^{(i)}|}{|\Sigma^{(0)}|}-\frac{1}{2}(\mu^{(i)})^{\top}\Theta^{(i)}\mu^{(i)}\right)-\frac{1}{2}(f^{-1}(x))^{\top}(\Theta^{(i)}-\Theta^{(0)})(f^{-1}(x))+(f^{-1}(x))^{\top}\Theta^{(i)}\mu^{(i)}
=ci12(f1(x))(r(i)r(i)s(i)s(i))(f1(x))+η(i)(f1(x))(B(i))eti\displaystyle=c_{i}-\frac{1}{2}(f^{-1}(x))^{\top}(r^{({i})}\otimes r^{({i})}-s^{({i})}\otimes s^{({i})})(f^{-1}(x))+\eta^{(i)}\cdot(f^{-1}(x))^{\top}(B^{(i)})^{\top}e_{t_{i}}
=ci12(f1(x))(λi2etieti(s(i))(s(i)))(f1(x))+η(i)λi(f1(x))eti\displaystyle=c_{i}-\frac{1}{2}(f^{-1}(x))^{\top}(\lambda_{i}^{2}e_{t_{i}}e_{t_{i}}^{\top}-(s^{({i})})(s^{({i})})^{\top})(f^{-1}(x))+\eta^{(i)}\lambda_{i}\cdot(f^{-1}(x))^{\top}e_{t_{i}}
=ci12(f1(x))(λi2etieti(s(i))(s(i)))(f1(x))+η(i)λi(f1(x))ti\displaystyle=c_{i}-\frac{1}{2}(f^{-1}(x))^{\top}(\lambda_{i}^{2}e_{t_{i}}e_{t_{i}}^{\top}-(s^{({i})})(s^{({i})})^{\top})(f^{-1}(x))+\eta^{(i)}\lambda_{i}\cdot(f^{-1}(x))_{t_{i}}
=ci12λi2(((f1(x))ti)2+η(i)λi(f1(x))ti)+12f1(x),s(i)2\displaystyle=c_{i}-\frac{1}{2}\lambda_{i}^{2}(((f^{-1}(x))_{t_{i}})^{2}+\eta^{(i)}\lambda_{i}\cdot(f^{-1}(x))_{t_{i}})+\frac{1}{2}\langle f^{-1}(x),s^{({i})}\rangle^{2}

for a constant cic_{i} independent of zz. For the second equality, we used Lemma 3. ∎

F.2 A theoretical justification for the log-odds model

In this section, we provide more details on our precise modeling choices for the contrastive approach, and show theoretically why it encourages the model to learn the true underlying parameters.

Recall that, motivated by Lemma 1, we model the log-odds as

gi(x,αi,βi,γi,w(i),θ)=αiβihti2(x,θ)+γihti(x,θ)+h(x,θ),w(i)2\displaystyle g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta)=\alpha_{i}-\beta_{i}h^{2}_{t_{i}}(x,\theta)+\gamma_{i}h_{t_{i}}(x,\theta)+\langle h(x,\theta),w^{(i)}\rangle^{2} (115)

where h(,θ)h(\cdot,\theta) denotes a neural net parametrized by θ\theta, parameters w(i)w^{(i)} are the rows of a matrix WW and αi\alpha_{i}, βi\beta_{i}, γi\gamma_{i} are learnable parameters. Moreover, we have g0(x)=0g_{0}(x)=0 as we compute the log-odds with respect to X(0)X^{(0)}. Therefore, the cross entropy loss given by

CE(i)=𝔼j𝒰({0,i})𝔼xX(j)ln(e𝟏j=igi(x)egi(x)+1).\displaystyle\mathcal{L}_{\mathrm{CE}}^{(i)}=-\mathbb{E}_{j\sim\mathcal{U}(\{0,i\})}\mathbb{E}_{x\sim X^{(j)}}\ln\left(\frac{e^{\bm{1}_{j=i}g_{i}(x)}}{e^{g_{i}(x)}+1}\right). (116)

Let C(x){0,1}C(x)\in\{0,1\} denote the label of the datapoint xx indicating whether it was an observational datapoint or an interventional datapoint respectively. By using the cross entropy loss, we are treating the model outputs as logits, therefore we can write down the posterior probability distribution using a logistic regression model as

Pr[C(x)=1|x]=exp(gi(x,αi,βi,γi,w(i),θ))1+exp(gi(x,αi,βi,γi,w(i),θ))\displaystyle\text{Pr}[C(x)=1|x]=\frac{\exp(g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta))}{1+\exp(g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta))} (117)

Moreover, by using a sufficiently wide or deep neural network with universal approximation capacity and sufficiently many samples, our model will learn the Bayes optimal classifier, as given by Lemma 1. Therefore, we can equate the probabilities to arrive at

exp(gi(x,αi,βi,γi,w(i),θ))1+exp(gi(x,αi,βi,γi,w(i),θ))\displaystyle\frac{\exp(g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta))}{1+\exp(g_{i}(x,\alpha_{i},\beta_{i},\gamma_{i},w^{(i)},\theta))} =exp(lnpX(i)(x)lnpX(0)(x))1+exp(lnpX(i)(x)lnpX(0)(x))\displaystyle=\frac{\exp(\ln p_{X}^{(i)}(x)-\ln p_{X}^{(0)}(x))}{1+\exp(\ln p_{X}^{(i)}(x)-\ln p_{X}^{(0)}(x))} (118)

implying

αi\displaystyle\alpha_{i} βihti2(x,θ)+γihti(x,θ)+h(x,θ),w(i)2\displaystyle-\beta_{i}h^{2}_{t_{i}}(x,\theta)+\gamma_{i}h_{t_{i}}(x,\theta)+\langle h(x,\theta),w^{(i)}\rangle^{2} (119)
=ci12λi2(((f1(x))ti)2+η(i)λi(f1(x))ti)+12f1(x),s(i)2\displaystyle=c_{i}-\frac{1}{2}\lambda_{i}^{2}(((f^{-1}(x))_{t_{i}})^{2}+\eta^{(i)}\lambda_{i}\cdot(f^{-1}(x))_{t_{i}})+\frac{1}{2}\langle f^{-1}(x),s^{({i})}\rangle^{2} (120)

This suggests that in optimal settings, our model will learn (up to scaling)

αi=ci,βi=12λi2,γi=η(i)λi,w(i)=s(i),h(x,θ)=f1(x)\displaystyle\alpha_{i}=c_{i},\quad\beta_{i}=\frac{1}{2}\lambda_{i}^{2},\quad\gamma_{i}=\eta^{(i)}\lambda_{i},\quad w^{(i)}=s^{(i)},\quad h(x,\theta)=f^{-1}(x) (121)

thereby inverting the nonlinearity and learning the underlying parameters.

F.3 NOTEARS [102]

In our algorithm, using the parameters w(i)w^{(i)} as rows, we learn the matrix WW, which we saw in optimal settings will be B=D1/2(IdA)B=D^{-1/2}(\mathrm{Id}-A). Note that the off-diagonal entries of BB form a DAG. Therefore, the graph encoded by W0W_{0}, which is defined to be WW with the main diagonal zeroed out, must also be a DAG.

However, in our algorithm, we don’t explicitly enforce this acyclicity. There are two ways to get around this. One way is to assume a default causal ordering on the ZiZ_{i}s, which is feasible as we don’t directly observe ZZ. Then, we can simply enforce W0W_{0} to be triangular and train via standard gradient descent and W0W_{0} is guaranteed to be a DAG. However, this approach doesn’t work because the interventional datasets are given in arbitrary order and so we are not able to match the datasets to the vertices in the correct order. So this merely defers the issue.

Instead, the other approach that we take in this work is to regularize the learnt W0W_{0} to model a directed acyclic graph. Learning an underlying causal graph from data is a decades old problem that has been widely studied in the causal inference literature, see e.g. [12, 102, 64, 80, 4] and references therein. In particular, the work [102] proposed an analytic expression to measure the DAGness of the causal graph, thereby making the problem continuous and more efficient to optimize over.

Lemma 12 ([102]).

A matrix Wd×dW\in\mathbb{R}^{d\times d} is a DAG if and only if

:=trexp(WW)d=0\displaystyle\mathcal{R}:=\operatorname{tr}\exp(W\circ W)-d=0 (122)

where \circ is the matrix Hadamard product and exp\exp is the matrix exponential. Moreover, \mathcal{R} has gradient

=exp(WW)2W\displaystyle\nabla\mathcal{R}=\exp(W\circ W)^{\top}\circ 2W (123)

Therefore, we add NOTEARS(W)=trexp(W0W0)d\mathcal{R}_{NOTEARS}(W)=\text{tr}\exp(W_{0}\circ W_{0})-d as a regularization to our loss function. As in prior works, we could also consider the augmented Lagrangian formulation however it leads to additional training complexity. There have been several follow-up works to NOTEARS such as [98] (see [4] for an overview) that suggest different regularization schemes. We leave exploring such alternatives to future work.

F.4 A Variational Autoencoder approach

In this section, we describe the technical details involved with adapting Variational Autoencoders (VAEs) [39, 68] for our setting. We highlight some difficulties that we will encounter and also suggest possible ways to work around them.

Indeed, most earlier approaches for causal representation learning have relied on maximum likelihood estimation via variational inference. In particular, they have relied on autoencoders or variational autoencoders [36, 2]. In our setting, VAEs are a viable approach. More concretely, we can encode the distribution X(0)X^{(0)} to ϵ\epsilon, then apply the B(i)B^{(i)} parameter matrix before finally decoding to X(i)X^{(i)}. To handle the fact that we don’t have paired interventional data as in [6], we could potentially modify the Evidence Lower Bound (ELBO) to include some divergence measure between the predicted and measured interventional data. Finally, this can be trained end-to-end as in traditional VAEs. We now describe this in more detail.

We will use VAEs to encode the observational distribution X(0)X^{(0)} to ϵ\epsilon, so the encoder will formally model B(0)(f1(.)μ(i))B^{(0)}\circ(f^{-1}(.)-\mu^{(i)}) where we use the notation from Section A.1. Note here that we cannot directly design the encoder to map XX to ZZ because we do not know the prior distribution on ZZ. Indeed, the objective is to learn it.

Let II denote the set of intervention targets. Assume the targets are chosen uniformly at random, which can be done in practice by subsampling each interventional distribution to have the same size. Suppose we had paired counterfactual data 𝒟=iI𝒟i\mathcal{D}=\cup_{i\in I}\mathcal{D}_{i} of paired interventional datasets, i.e. 𝒟i\mathcal{D}_{i} consists of pairs (X(0),X(i))(X^{(0)},X^{(i)}) with corresponding probability density denoted p(i)(x(0),x(i))p^{(i)}(x^{(0)},x^{(i)}). Following [39, 101], define an amortized inference distribution as q(ϵ|x(0))q(\epsilon|x^{(0)}). Then, we can bound the expected log-likelihood as

𝔼𝒟[lnp(i)(x(0),x(i))]\displaystyle\mathbb{E}_{\mathcal{D}}[\ln p^{(i)}(x^{(0)},x^{(i)})] =𝔼i𝒰(I)𝔼𝒟i[lnp(i)(x(0),x(i))]\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}[\ln p^{(i)}(x^{(0)},x^{(i)})] (124)
=𝔼i𝒰(I)𝔼𝒟ilnϵp(i)(x(0),x(i),ϵ)𝑑ϵ\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\ln\int_{\epsilon}p^{(i)}(x^{(0)},x^{(i)},\epsilon)\,d\epsilon (125)
=𝔼i𝒰(I)𝔼𝒟ilnϵp(i)(x(0),x(i),ϵ)q(ϵ|x(0))q(ϵ|x(0))𝑑ϵ\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\ln\int_{\epsilon}\frac{p^{(i)}(x^{(0)},x^{(i)},\epsilon)q(\epsilon|x^{(0)})}{q(\epsilon|x^{(0)})}\,d\epsilon (126)
=𝔼i𝒰(I)𝔼𝒟iln𝔼ϵq(ϵ|x(0))p(i)(x(0),x(i),ϵ)q(ϵ|x(0))\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\ln\mathbb{E}_{\epsilon\sim q(\epsilon|x^{(0)})}\frac{p^{(i)}(x^{(0)},x^{(i)},\epsilon)}{q(\epsilon|x^{(0)})} (127)
𝔼i𝒰(I)𝔼𝒟i𝔼ϵq(x(0))lnp(i)(x(0),x(i),ϵ)q(ϵ|x(0))(Jensen’s inequality)\displaystyle\geq\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\mathbb{E}_{\epsilon\sim q(x^{(0)})}\ln\frac{p^{(i)}(x^{(0)},x^{(i)},\epsilon)}{q(\epsilon|x^{(0)})}\qquad\qquad\text{(Jensen's inequality)} (128)
=𝔼i𝒰(I)𝔼𝒟i𝔼ϵq(x(0))lnp(i)(x(0),x(i)|ϵ)p(ϵ)q(ϵ|x(0))\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\mathbb{E}_{\epsilon\sim q(x^{(0)})}\ln\frac{p^{(i)}(x^{(0)},x^{(i)}|\epsilon)p(\epsilon)}{q(\epsilon|x^{(0)})} (129)
=𝔼i𝒰(I)𝔼𝒟i(𝔼ϵq(x(0))lnp(i)(x(0),x(i)|ϵ)𝔼ϵq(x(0))lnq(ϵ|x(0))p(ϵ))\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{\mathcal{D}_{i}}\left(\mathbb{E}_{\epsilon\sim q(x^{(0)})}\ln p^{(i)}(x^{(0)},x^{(i)}|\epsilon)-\mathbb{E}_{\epsilon\sim q(x^{(0)})}\ln\frac{q(\epsilon|x^{(0)})}{p(\epsilon)}\right) (130)
=𝔼i𝒰(I)𝔼(x(0),x(i))𝒟i[𝔼ϵq(ϵ|x(0))[lnp(i)(x(0),x(i)|ϵ)]\displaystyle=\mathbb{E}_{i\sim\mathcal{U}(I)}\mathbb{E}_{(x^{(0)},x^{(i)})\sim\mathcal{D}_{i}}\biggl{[}\mathbb{E}_{\epsilon\sim q(\epsilon|x^{(0)})}[\ln p^{(i)}(x^{(0)},x^{(i)}|\epsilon)] (131)
KL(q(ϵ|x(0))||N(0,I))]\displaystyle\hskip 180.00027pt-\textrm{KL}(q(\epsilon|x^{(0)})||N(0,I))\biggr{]} (132)

where the last term is the standard Evidence Lower Bound (ELBO). This can then be trained end-to-end via the reparameterization trick. A similar approach was taken in [6] who had access to paired counterfactual data.

However, we do not have access to such paired interventional data, but instead we only observe the marginal distributions X(i)X^{(i)}. To this end, conditioned on ϵ\epsilon, we can split the term lnp(i)(x(0),x(i)|ϵ)=lnp~(0)(x(0)|ϵ)+lnp~(i)(x(i)|ϵ)\ln p^{(i)}(x^{(0)},x^{(i)}|\epsilon)=\ln\widetilde{p}^{(0)}(x^{(0)}|\epsilon)+\ln\widetilde{p}^{(i)}(x^{(i)}|\epsilon) where p~(i)\widetilde{p}^{(i)} denotes the density of the intervened marginal of p(i)p^{(i)}, which corresponds to using the interventional decoder f((B(i))1(.)+μ(i))f\circ((B^{(i)})^{-1}(.)+\mu^{(i)}). Nevertheless, 𝔼ϵq(x(0))[lnp~(i)(x(i)|ϵ)]\mathbb{E}_{\epsilon\sim q(x^{(0)})}[\ln\widetilde{p}^{(i)}(x^{(i)}|\epsilon)] is still not tractable in unpaired settings. As a heuristic, we could consider a modified ELBO by modifying this term to measure some sort of divergence metric between the true interventional data X(i)X^{(i)} and the generated interventional data p~(i)(x(i)|ϵ)\widetilde{p}^{(i)}(x^{(i)}|\epsilon), similar to [101]. We leave it for future work to explore this direction.

Appendix G Additional experimental results

In this section, we provide additional experimental results and discussions to explore the effect of shift strength of the interventions, data scale, and noise distribution.

G.1 Limitations of the contrastive approach

Refer to caption
Figure 5: Dependence of the performance metrics on the shift η\eta. Other settings are as for nonlinear mixing functions (see Table 8) with ER(10,2)\mathrm{ER}(10,2) graphs and d=100d^{\prime}=100.
Refer to caption
Figure 6: Cross entropy loss for a=1a=1, c=0c=0, z0=1z_{0}=1 and b=0b=0 (left), b=2b=2 (right) as a function of the estimated latent variable z^\hat{z}.

As mentioned in the main part of the paper, our contrastive algorithm struggles to recover the ground truth latent variables when considering interventions without shifts. We illustrate the dependence on the shift strength in Figure 5. In this section, we provide some evidence why this is the case. Apart from setting the stage for future works, this also enables practitioners to be aware of potential pitfalls when applying our techniques.

Refer to caption
Figure 7: Density of standardized ground truth latents and recovered latents for η=0\eta=0 (top left) and η=1\eta=1 (top right), scatter plot of the ground truth latent variables Z1Z_{1} and recovered latent variables Z^1\hat{Z}_{1} for η=0\eta=0 (bottom left) and η=1\eta=1 (bottom right). Results shown for ER(10,2)\mathrm{ER}(10,2) graphs and d=100d^{\prime}=100, all further parameters as in Table 8.

We suspect that the main reason for this behavior is the non-convexity of the parametric output layer. Note that this is very different from the well known non-convexity of (overparametrized) neural networks. While neural networks parametrize non-convex functions their final layer is typically a convex optimization problem, e.g., a least squares regression or a logistic regression on the features. Moreover, it is well understood that convergence to a global minimizer using gradient descent is possible under suitable conditions, see, e.g., [47].

This is different for our quadratic log-odds expression where gradient descent is not sufficient to find the global minimizer. To illustrate this we consider the case of a single latent, i.e., d=1d=1, then the ground truth parametric form of the log-odds in 6 can be expressed as

lnpX(1)lnpX(0)(x)=g(x)=h(f1(x))=a(f1(x)b)2+c\displaystyle\ln p_{X}^{(1)}-\ln p_{X}^{(0)}(x)=g(x)=h(f^{-1}(x))=a(f^{-1}(x)-b)^{2}+c (133)

for some constants aa, bb, cc and b=0b=0 if η(1)=0\eta^{(1)}=0, i.e., there is no shift. We moreover assume, for the sake of argument, that the final layer implementing h(z)=a(zb)2+ch(z)=a(z-b)^{2}+c is fixed to the ground truth parametric form of the log-odds (then there is also no scaling ambiguity left). Now, let us fix a point x0x_{0} and define z0=f1(x0)z_{0}=f^{-1}(x_{0}) and

p=(i=1|X=x0)=pX(1)(x0)pX(1)(x0)+pX(0)(x0)=eg(x0)1+eg(x0).\displaystyle p=\mathbb{P}(i=1|X=x_{0})=\frac{p_{X}^{(1)}(x_{0})}{p_{X}^{(1)}(x_{0})+p_{X}^{(0)}(x_{0})}=\frac{e^{g(x_{0})}}{1+e^{g(x_{0})}}. (134)

Then the (sample conditional) cross entropy loss as a function of z^=f^1(x0)\hat{z}=\hat{f}^{-1}(x_{0}) for our learned function f^\hat{f} is given by

CE=pln(eh(z^)1+eh(z^))(1p)ln(11+eh(z^))=ph(z^)+ln(1+eh(z^))\displaystyle\mathcal{L}_{\mathrm{CE}}=-p\ln\left(\frac{e^{h(\hat{z})}}{1+e^{h(\hat{z})}}\right)-(1-p)\ln\left(\frac{1}{1+e^{h(\hat{z})}}\right)=-ph(\hat{z})+\ln(1+e^{h(\hat{z})}) (135)

We here drop the intervention index, since we focus on a one dimensional illustration. We plot two examples of such loss functions for b=0b=0 and b0b\neq 0 in Figure 6.

To verify that this is indeed the reason for our failure to recover the ground truth latent variables, we investigate the relation between estimated latent variables and ground truth latent variables. The result can be found in Figure 7 and, as we see, when η=0\eta=0, the distribution of the recovered latents is skewed. Moreover, we find that we essentially recover the latent variable up to a sign, i.e., Z^=|Z|\hat{Z}=|Z| (there is a small offset because we enforce Z^\hat{Z} to be centered). Note that this explains the tiny MCC scores in Table 1, since 𝔼(Z|Z|)=0\mathbb{E}(Z|Z|)=0 for symmetric distributions. This observation might also be a potential starting point to improve our algorithm. For non-zero shifts we recover the latent variables almost perfectly.

G.2 Varsortability and data scale

An important observation in causal discovery was that many benchmark settings exhibit specific correlation structures that can be exploited to infer causal directions. The most prominent example is varsortability [66] which measures how well the causal order agrees with the order induced by the magnitude of the variance. This is a measure between 0 and 1 and for values close to 0 and 1 the variances contain a lot of information about the causal order while close to 0.50.5 this is not the case. It was shown in [66] that typical parameter choices exhibit high varsortability 1\approx 1 which can be exploited by trivial algorithms. A related notion is R2R^{2}-varsortability [67] which is a similar notion where the variance of the variable is replaced by the residual variance after standardizing the variables and then regressing out all other variables, i.e., the unexplained variance.

Table 4: Results for nonlinear synthetic data with n=10000n=10000 and standardized latent variables ZZ (up to standardization the setting is the same as in Table 2 in the paper).
Setting Method SHD \downarrow AUROC \uparrow MCC \uparrow R2R^{2} \uparrow
ER(5, 2), d=20d^{\prime}=20 Contrastive Learning 2.6±0.42.6\pm 0.4 0.92±0.020.92\pm 0.02 0.96±0.010.96\pm 0.01 0.93±0.010.93\pm 0.01
VAE 10.0±0.010.0\pm 0.0 0.50±0.000.50\pm 0.00 0.11±0.010.11\pm 0.01 0.04±0.010.04\pm 0.01
ER(5, 2), d=100d^{\prime}=100 Contrastive Learning 1.4±0.41.4\pm 0.4 0.98±0.020.98\pm 0.02 0.98±0.000.98\pm 0.00 0.97±0.010.97\pm 0.01
VAE 10.0±0.010.0\pm 0.0 0.50±0.000.50\pm 0.00 0.14±0.030.14\pm 0.03 0.10±0.040.10\pm 0.04
ER(10, 2), d=20d^{\prime}=20 Contrastive Learning 9.6±2.19.6\pm 2.1 0.90±0.030.90\pm 0.03 0.90±0.010.90\pm 0.01 0.84±0.010.84\pm 0.01
VAE 18.6±0.918.6\pm 0.9 0.50±0.000.50\pm 0.00 0.27±0.040.27\pm 0.04 0.29±0.040.29\pm 0.04
ER(10, 2), d=100d^{\prime}=100 Contrastive Learning 2.8±1.42.8\pm 1.4 0.99±0.010.99\pm 0.01 0.98±0.000.98\pm 0.00 0.96±0.000.96\pm 0.00
VAE 18.6±0.918.6\pm 0.9 0.50±0.000.50\pm 0.00 0.14±0.030.14\pm 0.03 0.11±0.040.11\pm 0.04

In our setting the varsortability of the latent variables ZZ is quite high (see [66] for the full definition), e.g., for d=10d=10 and k=2k=2 average varsortability is 0.92±0.020.92\pm 0.02. Note that it is not obvious how this can be exploited algorithmically. Indeed, we only observe a mixture X=f(Z)X=f(Z) and the scale of YY is not identifiable. Nevertheless, to rule out that we implicitly use the scale of the variables we ran the same experiments except that we standardized the latent variables ZZ of the observable distribution (and apply the same scaling to the interventional distributions). The results can be found in Table 4 and are roughly similar to the result in Table 2 for the same settings without standardization. Note that R2R^{2}-sortability is not substantially different from .5.5 (roughly 0.40.4) in our settings, thus the R2R^{2}-sortability cannot be exploited to infer the causal structure, even when observing ZZ.

G.3 Effect of noise distribution

Table 5: Noise dependence of contrastive algorithm for ER(5,2)\mathrm{ER}(5,2), d=20d^{\prime}=20, n=10000n=10000 and nonlinear mixing (same setting as first row of Table 2 in the paper).
Noise distribution SHD \downarrow AUROC \uparrow MCC \uparrow R2R^{2} \uparrow
Gaussian 1.8±0.51.8\pm 0.5 0.97±0.010.97\pm 0.01 0.97±0.000.97\pm 0.00 0.96±0.000.96\pm 0.00
Laplace 3.0±0.63.0\pm 0.6 0.89±0.030.89\pm 0.03 0.93±0.010.93\pm 0.01 0.89±0.010.89\pm 0.01
Gumbel 3.0±0.93.0\pm 0.9 0.92±0.020.92\pm 0.02 0.94±0.010.94\pm 0.01 0.91±0.010.91\pm 0.01
Uniform 3.4±0.43.4\pm 0.4 0.87±0.030.87\pm 0.03 0.96±0.010.96\pm 0.01 0.93±0.010.93\pm 0.01
Exponential 3.8±0.53.8\pm 0.5 0.86±0.020.86\pm 0.02 0.89±0.020.89\pm 0.02 0.86±0.020.86\pm 0.02

Our theoretical results only cover the setting of Gaussian SCMs and we even show that identifiability does not necessarily hold for uniformly distributed noise (see Appendix D.2). Nevertheless, we can evaluate the performance of our algorithm for linear SCMs with non-Gaussian noise distribution. We rerun one of the settings in Table 2 for different noise distributions. The results can be found in Table 5, and they show a slightly worse but still reasonable performance than for Gaussian data. This is in line with the observation that using a Gaussian likelihood for non-Gaussian data gives good results in, e.g., causal discovery.

Appendix H Hyperparameters, architectures and further additional details on experiments

In this section, we give additional details on the experiments. We use PyTorch [60] for all our experiments.

Data generation

To generate the latent DAG and sample the latent variables we use the sempler package [21]. We always sample ER(d,k)\mathrm{ER}(d,k) graphs and the non-zero weights are sampled from the distribution wij𝒰(±[0.25,1.0])w_{ij}\sim\mathcal{U}(\pm[0.25,1.0]). To generate linear synthetic data, we sample all entries of the mixing matrix i.i.d. from a standard Gaussian distribution. For non-linear synthetic data ff is given by the architecture in Table 9 with weights sampled from 𝒰([1infeat,1infeat])\mathcal{U}([-\frac{1}{\sqrt{in_{feat}}},\frac{1}{\sqrt{in_{feat}}}]). For image data, similar to [2], the latents ZZ are the coordinates of balls in an image (but we sample them differently as per our setting) and a 64×64×364\times 64\times 3 RGB image is rendered using Pygame [77] which encompasses the nonlinearity. Other parameter choices such as dimensions, DAGs, variance parameters, and shifts are already outlined in Section 6 but for clarity we also outline them in Tables 6, 8, and 8.

Table 6: Parameters used for linear synthetic data.
dd 5
dd^{\prime} 10
kk 3/2
nn {2500,5000,10000,25000,50000}\{2500,5000,10000,25000,50000\}
σobs2\sigma_{\mathrm{obs}}^{2} 𝒰([2,4])\mathcal{U}([2,4])
σint2\sigma_{\mathrm{int}}^{2} 𝒰([6,8])\mathcal{U}([6,8])
η(i)\eta^{(i)} 0
runs 5
mixing linear
Table 7: Parameters used for nonlinear synthetic data.
dd {5,10}\{5,10\}
dd^{\prime} {20,100}\{20,100\}
kk {1,2}\{1,2\}
nn 1000010000
σobs2\sigma_{\mathrm{obs}}^{2} 𝒰([1,2])\mathcal{U}([1,2])
σint2\sigma_{\mathrm{int}}^{2} 𝒰([1,2])\mathcal{U}([1,2])
η(i)\eta^{(i)} 𝒰(±[1,2])\mathcal{U}(\pm[1,2])
runs 5
mixing 3 layer mlp
Table 8: Parameters used for image data.
dd {4,6}\{4,6\}
dd^{\prime} 36423\cdot 64^{2}
kk {1,2}\{1,2\}
nn 2500025000
σobs2\sigma_{\mathrm{obs}}^{2} 𝒰([0.01,0.01])\mathcal{U}([0.01,0.01])
σint2\sigma_{\mathrm{int}}^{2} 𝒰([0.01,0.02])\mathcal{U}([0.01,0.02])
η(i)\eta^{(i)} 𝒰(±[0.1,0.2])\mathcal{U}(\pm[0.1,0.2])
runs 5
mixing image rendering
Table 9: Synthetic data generation
Architecture Layer Sequence
MLP Embedding Input: zdz\in\mathbb{R}^{d}
FC 512512, LeakyReLU(0.20.2)
FC 512512, LeakyReLU(0.20.2)
FC 512512, LeakyReLU(0.20.2)
FC dd^{\prime} \longleftarrow Output xx

Models

For synthetic data, the model architecture for h(x,θ)h(x,\theta) is a one hidden-layer MLP with 512 hidden units and LeakyReLU activation functions. For the parametric final layer, we decompose the matrix WW as W=D(IdA)W=D(\mathrm{Id}-A) where DD is a diagonal matrix and AA is a matrix with a masked diagonal and they are both learned. This factorization shall improve stability. We initialize A=0A=0 and D=IdD=\mathrm{Id}. Also, the sparsity penalty and the DAGness penalty are only applied to AA. For the baseline VAE we use the same architecture for both encoder and decoder. The code for the linear baseline is from [84].

For image data, the model architectures for h(x,θ)h(x,\theta) are small convolutional networks. For the contrastive approach the architecture can be found in Table 10. For the VAE model, we use the encoder architecture as in Table 11 and the decoder architecture as in Table 12 respectively.

Table 10: Contrastive model architecture for image data.
Architecture Layer Sequence
Conv Embedding Input: x64×64×3x\in\mathbb{R}^{64\times 64\times 3}
Conv (3, 10, kernel size = 5, stride = 3), ReLU()
MaxPool (kernel size = 2, stride = 2)
FC 64, LeakyReLU()
FC dd \longleftarrow Output z^\hat{z}
Table 11: VAE Encoder architecture for image data
Architecture Layer Sequence
Conv Encoder Input: x64×64×3x\in\mathbb{R}^{64\times 64\times 3}
Conv(3, 16, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
Conv(16, 32, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
Conv(32, 32, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
Conv(32, 64, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
FC 2d2d \longleftarrow Output (mean, logvar)
Table 12: VAE Decoder architecture for image data
Architecture Layer Sequence
Deconv Decoder Input: z^d\hat{z}\in\mathbb{R}^{d}
FC (3×3)×d(3\times 3)\times d
DeConv (d, 32, kernel size = 4, stride = 2, padding = 0), LeakyReLU()
DeConv (32, 16, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
DeConv (16, 8, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
DeConv (8, 3, kernel size = 4, stride = 2, padding = 1), LeakyReLU()
Sigmoid() \longleftarrow Output x^\hat{x}

Training

For training we use the hyperparameters outlined in Table 13. We use a 80802020 split for train and validation/test set respectively. After subsampling each dataset to the same size, we copy the observation samples for each interventional dataset in order to have an equal number of observational and interventional samples so they can be naturally paired during the contrastive learning. We select the model with the smallest validation loss on the validation set, where we only use the cross entropy loss for validation. For the VAE baseline we use the standard VAE validation loss for model selection.

Table 13: Hyperparameters used for training.
τ1\tau_{1} (see Eq. (9)) 10510^{-5}
τ2\tau_{2} (see Eq. (9)) 10410^{-4}
learning rate 51045\cdot 10^{-4}
batch size 512
epochs 200 (synthetic data), 100 (image data)
optimizer Adam [38]
learning rate scheduler CosineAnnealing [51]

Compute and runtime

The experiments using synthetic data were run on 2 CPUs with 16Gb RAM on a compute cluster. For image data we added a GPU to increase the speed. The runtime per epoch for the synthetic data and 10000 samples was roughly 10s. The total run-time of the reported experiments was about 100h (i.e. 200 CPU hours) and 20h on a GPU, but hyperparameter selection and preliminary experiments required about 10 times more CPU compute.

Mean Correlation Coefficient (MCC)

Mean Correlation Coefficient (MCC) is a metric that has been utilized in prior works [37, 41] to quantify identifiability. MCC measures linear correlations up to permutation of the components. To compute the best permutation, a linear sum assignment problem is solved and finally, the correlation coefficients are computed and averaged over. A high MCC indicates that the true latents have been recovered. In this work, we compute the transformation using half of the samples and then report the MCC using the other half, i.e. the out-of-distribution samples.

Further experimental attempts and dead Ends

We will briefly summarize a few additional settings we tried in our experiments.

  • We fixed D=IdD=\mathrm{Id} in the parametrization of WW. This fixes the scaling of the latent variables in some non-trivial way. This resulted in very similar results.

  • We tried to combine the contrastive algorithm with a VAE approach. Since the ground truth latent variables ZZ are highly correlated, they are not suitable for an autoencoder that typically assumes a factorizing prior. Therefore, we map the estimated latent variables for the observational distribution to the noise distribution ϵ\epsilon which factorizes. Note that for the ground truth latent variables the relation ϵ=B(0)Z(0)\epsilon=B^{(0)}Z^{(0)} holds so we consider ϵ^=WZ^\hat{\epsilon}=W\hat{Z}. We use ϵ^\hat{\epsilon} as the latent space of an autoencoder on which we then stack a decoder. We train using the observational samples and the ELBO for the VAE part and with the contrastive loss (and interventional and observational samples) for the Z^\hat{Z} embedding. This resulted in a slightly worse recovery of the latent variables (as measured by the MCC score) but much worse recovery of the graph. We suspect that this originates from the fact that the matrix WW is used both in the parametrization of the log-odds and to map Z^\hat{Z} to ϵ\epsilon which might make the learning more error-prone. Note that this is different from the suggestion in Section F.4.

  • Choosing a larger learning rate for the parametric part than for the non-parametric part seemed to help initially, but using the same sufficiently small learning rate for the entire model was more robust.

  • We tried larger architectures for both synthetic and image data. They turned out to be more difficult to train without offering any benefit.

  • For image data we often observed overfitting when using smaller sample sizes. The overfitting persisted even when we used a pre-trained ResNet [24] with frozen layers. We suspect that for image data the performance (in particular the sample complexity) can be further improved by carefully tuning regularization and model size.