This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Interventional Causal Representation Learning

Kartik Ahuja    Divyat Mahajan    Yixin Wang    Yoshua Bengio
Abstract

Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors’ support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents’ support and their ancestors’. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect dodo interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.

Machine Learning, ICML

1 Introduction

Modern deep learning models like GPT-3 (Brown et al., 2020) and CLIP (Radford et al., 2021) are remarkable representation learners (Bengio et al., 2013). Despite the successes, these models continue to be far from the human ability to adapt to new situations (distribution shifts) or carry out new tasks (Geirhos et al., 2020; Bommasani et al., 2021; Yamada et al., 2022). Humans encapsulate their causal knowledge of the world in a highly reusable and recomposable way (Goyal & Bengio, 2020), enabling them to adapt to new tasks in an ever-distribution-shifting world. How can we empower modern deep learning models with this type of causal understanding? This question is central to the emerging field of causal representation learning (Schölkopf et al., 2021).

A core task in causal representation learning is provable representation identification, i.e., developing representation learning algorithms that can provably identify natural latent factors (e.g., location, shape and color of different objects in a scene). While provable representation identification is known to be impossible for arbitrary data-generating process (DGP) (Hyvärinen & Pajunen, 1999; Locatello et al., 2019), real data often exhibits additional structures. For example, Hyvarinen et al. (2019); Khemakhem et al. (2022) consider the conditional independence between the latents given auxiliary information; Lachapelle et al. (2022) leverage the sparsity of the causal connections among the latents; Locatello et al. (2020); Klindt et al. (2020); Ahuja et al. (2022a) rely on the sparse variation in the latents over time.

Refer to caption
Figure 1: Figure 1a) Observational data: the support of child (Z2Z_{2}) conditional on parent (Z1Z_{1}) varies with the value of parent. Figure 1b), 1c): the support of child conditional on parent under do intervention, perfect intervention and many imperfect interventions is independent of the parent. Figure 1d): intervention on child reduces the impact of the parent on it which causes the support of the child conditional on parent to take a larger set of values.

Most existing works rely on observational data and make assumptions on the dependency structure of the latents to achieve provable representation identification. However, in many applications, such as robotics and genomics, there is a wealth of interventional data available. For example, interventional data can be obtained from experiments such as genetic perturbations (Dixit et al., 2016) and electrical stimulations (Nejatbakhsh et al., 2021). Can interventional data help identify latent factors in causal representation learning? How can it help? We explore these questions in this work. The key observation is that interventional data often carries geometric signatures of the latent factors’ support (i.e., what values each latent can possibly take). Fig. 1 illustrates these geometric signatures: perfect interventions and many imperfect interventions can make the intervened latents’ support independent of their ancestors’ support. As we will show, these geometric signatures go a long way in facilitating provable representation identification in the absence of strong distributional assumptions.

Contributions. This work establishes representation identification guarantees without strong distributional assumptions on the latents in the following settings.

  • dodo interventions. We first investigate scenarios where the true latent factors are mapped to high-dimensional observations through a finite-degree multivariate polynomial. When some latent dimension undergoes a hard dodo intervention (Pearl, 2009), we are able to identify it up to shift and scaling. Even when the mapping is not a polynomial, approximate identification of the intervened latent is still achievable provided we have data from multiple dodo interventional distributions on the same latent dimension.

  • Perfect & imperfect interventions. We achieve block affine identification under imperfect interventions (Peters et al., 2017) provided the support of the intervened latent is rendered independent of its ancestors under the intervention as shown in Figure 1c. This result covers all perfect interventions as a special case.

  • Observational data and independent support. The independence-of-support condition above can further facilitate representation identification with observational data. We show that, if the support of the latents are already independent in observational data, then these latents can be identified up to permutation, shift, and scaling, without the need of any interventional data. This result extends the classical identifiability results from linear independent component analysis (ICA) (Comon, 1994) to allow for dependent latent variables. They also provide theoretical justifications for recent proposals of performing unsupervised disentanglement through the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).

We summarize our results in Table 1. Finally, we empirically demonstrate the practical utility of our theory. From data generation mechanisms ranging from polynomials to image generation from rendering engine (Shinners et al., 2011), we show that interventional data helps identification.

Also, the code repository can be accessed at: github.com/facebookresearch/CausalRepID.

Table 1: Summary of results. Existing works such as iVAE (Khemakhem et al., 2022) use observational data and make assumptions on the graphical model of the latents to achieve identification. In contrast, we use interventional data and make no assumptions on the graph.
Input data Assm. on ZZ Assm. on gg Identification
Obs ZrZs|UZ_{r}\perp Z_{s}|U, UU aux info. Diffeomorphic Perm & scale (Khemakhem, 2020)
Obs Non-empty interior Injective poly Affine (Theorem 4.4)
Obs Non-empty interior \approx Injective poly \approx Affine (Theorem A.8)
Obs Independent support Injective poly Perm, shift, & scale (Theorem 6.3)
Obs + dodo intervn Non-empty interior Injective poly Perm, shift, & scale (Theorem 5.3)
Obs + dodo intervn Non-empty interior Diffeomorphic \approx Perm & comp-wise (Theorem A.12)
Obs + Perfect intervn Non-empty interior Injective poly Block affine (Theorem 5.8)
Obs + Imperfect intervn Partially indep. support Injective poly Block affine (Theorem 5.8)
Counterfactual Bijection w.r.t. noise Diffeomorphic Perm & comp-wise (Brehmer, 2022)

2 Related Work

Existing provable representation identification approaches often utilize structure in time-series data, as seen in initial works by Hyvarinen & Morioka (2016) and Hyvarinen & Morioka (2017). More recent studies have expanded on this approach, such as Hälvä & Hyvarinen (2020); Yao et al. (2021, 2022a, 2022b); Lippe et al. (2022b, a); Lachapelle et al. (2022). Other forms of weak supervision, such as data augmentations, can also be used in representation identification, as seen in works by Zimmermann et al. (2021); Von Kügelgen et al. (2021); Brehmer et al. (2022); Locatello et al. (2020); Ahuja et al. (2022a) that assume access to contrastive pairs of observations (x,x~)(x,\tilde{x}). A third approach, used in (Khemakhem et al., 2022, 2020), involves using high-dimensional observations (e.g., an image) and auxiliary information (e.g., label) to identify representations.

To understand the factual and counterfactual knowledge used by different works in representation identification, we can classify them according to Pearl’s ladder of causation (Bareinboim et al., 2022). In particular, our work operates with interventional data (level-two knowledge), while other studies leverage either observational data (level-one knowledge) or counterfactual data (level-three knowledge). Works such as Khemakhem et al. (2022, 2020); Ahuja et al. (2022b); Hyvarinen & Morioka (2016, 2017); Ahuja et al. (2021) use observational data and either make assumptions on the structure of the underlying causal graph of latents or rely on auxiliary information. In contrast, works like Brehmer et al. (2022) use counterfactual knowledge to achieve identification for general DAG structures; Lippe et al. (2022b, a); Ahuja et al. (2022a); Lachapelle et al. (2022) use pre- and post-intervention observations to achieve provable representation identification. These latter studies use instance-level temporal interventions that carry much more information than interventional distribution alone. To summarize, these works require more information than is available with level two data in Pearlian ladder of causation.

Finally, a concurrent work from Seigal et al. (2022) also studies identification of causal representations using interventional distributions. The authors focus on linear mixing of the latents and consider perfect interventions. In contrast, our results consider nonlinear mixing function and imperfect interventions.

3 Setup: Causal Representation Learning

Causal representation learning aims to identify latent variables from high-dimensional observations. Begin with a data-generating process where some high-dimensional observations xnx\in\mathbb{R}^{n} are generated from some latent variables zdz\in\mathbb{R}^{d}. We consider the task of identifying latent zz assuming access to both observational and interventional datasets: the observational data is drawn from

zZ;xg(z),\displaystyle z\sim\mathbb{P}_{Z};\qquad x\leftarrow g(z), (1)

where the latent zz is sampled from the distribution Z\mathbb{P}_{Z} and xx is the observed data point rendered from the underlying latent zz via an injective decoder g:dng:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n}. The interventional data is drawn from a similar distribution except the latent zz is drawn from Z(i)\mathbb{P}^{(i)}_{Z}, namely the distribution of zz under intervention on ziz_{i}:

zZ(i);xg(z).\displaystyle z\sim\mathbb{P}_{Z}^{(i)};\qquad x\leftarrow g(z). (2)

We denote 𝒵\mathcal{Z} and 𝒵(i)\mathcal{Z}^{(i)} as the support of Z\mathbb{P}_{Z} and Z(i)\mathbb{P}_{Z}^{(i)} respectively (support is the set where the probability density is more than zero). The support of xx is thus 𝒳=g(𝒵)\mathcal{X}=g(\mathcal{Z}) in observational data and 𝒳(i)=g(𝒵(i))\mathcal{X}^{(i)}=g\big{(}\mathcal{Z}^{(i)}\big{)} in interventional data. The goal of causal representation learning is provable representation identification, i.e. to learn an encoder function, which takes in the observation xx as input and provably output its underlying true latent zz. In practice, such an encoder is often learned via solving a reconstruction identity,

hf(x)=xx𝒳𝒳(i),\displaystyle h\circ f(x)=x\qquad\forall x\in\mathcal{X}\cup\mathcal{X}^{(i)}, (3)

where f:ndf:\mathbb{R}^{n}\rightarrow\mathbb{R}^{d} and h:dnh:\mathbb{R}^{d}\rightarrow\mathbb{R}^{n} are a pair of encoder and decoder, which need to jointly satisfy Eq. 3. The pair (f,h)(f,h) together is referred to as the autoencoder. Given the learned encoder ff, the resulting representation is z^f(x)\hat{z}\triangleq f(x), which holds the encoder’s estimate of the latents.

The reconstruction identity Eq. 3 is highly underspecified and cannot in general identify the latents. There exist many pairs of (f,h)(f,h) that jointly solve Eq. 3 but do not provide representations z^f(x)\hat{z}\triangleq f(x) that coincide with the true latents zz. For instance, applying an invertible map bb to any solution (f,h)(f,h) will result in another valid solution bfb\circ f, hb1h\circ b^{-1}. In practical applications, however, the exact identification of the latents is not necessary. For example, we may not be concerned with the recovering the latent dimensions in the order they appear in zz. Thus, in this work, we examine conditions of under which the true latents can be identified up to certain transformations, such as affine transformations and coordinate permutations.

4 Stepping Stone: Affine Representation Identification with Polynomial Decoders

We first establish an affine identification result, which serves as a stepping stone towards stronger identification guarantees in the next section. We begin with a few assumptions.

Assumption 4.1.

The interior of the support of zz, 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)}, is a non-empty subset of d\mathbb{R}^{d}.111We work with (d,2(\mathbb{R}^{d},\|\|_{2}) as the metric space. A point is in the interior of a set if there exists an ϵ\epsilon ball for some ϵ>0\epsilon>0 containing that point in the set. The set of all such points defines the interior.

Assumption 4.2.

The decoder gg is a polynomial of finite degree pp whose corresponding coefficient matrix GG has full column rank. Specifically, the decoder gg is determined by the coefficient matrix GG as follows,

g(z)=G[1,z,z¯z,,z¯¯zptimes]zd,g(z)=G[1,z,z\bar{\otimes}z,\cdots,\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}]^{\top}\qquad\forall z\in\mathbb{R}^{d},

where ¯\bar{\otimes} represents the Kronecker product with all distinct entries; for example, if z=[z1,z2]z=[z_{1},z_{2}], then z¯z=[z12,z1z2,z22]z\bar{\otimes}z=[z_{1}^{2},z_{1}z_{2},z_{2}^{2}].

The assumption that the matrix Gn×qG\in\mathbb{R}^{n\times q} has a full column rank of qq guarantees that the decoder gg is injective; see Lemma A.1 in Appendix A.1 for a proof. This injectivity condition on gg is common in identifiable representation learning. Without injectivity, the problem of identification becomes ill-defined; multiple different latent zz’s can give rise to the same observation xx. We note that the full-column-rank condition for GG in LABEL:assm2:g_poly imposes an implicit constraint on the dimensionality nn of the data; it requires that the dimensionality nn is greater than the number of terms in the polynomial of degree pp, namely nr=0p(r+d1d1)n\geq\sum_{r=0}^{p}{r+d-1\choose d-1}. In the Appendix (Theorem A.5), we show that if our data is generated from sparse polynomials, i.e., GG is a sparse matrix, then nn is allowed to be much smaller.

Under Assumptions 4.1 and 4.2, we perform causal representation learning with two constraints: polynomial decoder and non-collapsing encoder.

Constraint 4.3.

The learned decoder hh is a polynomial of degree pp and it is determined by its corresponding coefficient matrix HH as follows,

h(z)=H[1,z,z¯z,,z¯¯zptimes]zd,h(z)=H[1,z,z\bar{\otimes}z,\cdots,\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}]^{\top}\qquad\forall z\in\mathbb{R}^{d},

where ¯\bar{\otimes} represents the Kronecker product with all distinct entries. The interior of the image of the encoder f(𝒳𝒳(i))f(\mathcal{X}\cup\mathcal{X}^{(i)}) is a non-empty subset of d\mathbb{R}^{d}.

We now show that solving the reconstruction identity with these constraints can provably identify the true latent zz up to affine transformations.

Theorem 4.4.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2. The autoencoder that solves the reconstruction identity in Eq. 3 under 4.3 achieves affine identification, i.e., z𝒵𝒵(i),z^=Az+c\forall z\in\mathcal{Z}\cup\mathcal{Z}^{(i)},\hat{z}=Az+c, where z^\hat{z} is the encoder ff’s output, zz is the true latent, Ad×dA\in\mathbb{R}^{d\times d} is invertible and cdc\in\mathbb{R}^{d}.

Theorem 4.4 drastically reduces the ambiguities in identifying latent zz from arbitrary invertible transformations to only invertible affine transformations. Moreover, Theorem 4.4 does not require any structural assumptions about the dependency between the latents. It only requires (i) a geometric assumption that the interior of the support is non-empty and (ii) the map gg is a finite-degree polynomial.

The proof of Theorem 4.4 is in Appendix A.1. The idea is to write the representation z^=f(x)\hat{z}=f(x) as z^=fg(z)=a(z)\hat{z}=f\circ g(z)=a(z) with afga\triangleq f\circ g, leveraging the relationship x=g(z)x=g(z) in Eq. 1. We then show the aa function must be an affine map. To give further intuition, we consider a toy example with one-dimensional latent zz, three-dimensional observation xx, and the true decoder gg and the learned decoder hh each being a degree-two polynomial. We first solve the reconstruction identity on all xx, which gives h(z^)=g(z)h(\hat{z})=g(z), and equivalently H[1,z^,z^2]=G[1,z,z2]H[1,\hat{z},\hat{z}^{2}]^{\top}=G[1,z,z^{2}]^{\top}. This equality implies that both z^\hat{z} and z^2\hat{z}^{2} must be at most degree-two polynomials of zz. As a consequence, z^\hat{z} must be a degree-one polynomial of zz, which we next prove by contradiction. If z^\hat{z} is a degree-two polynomial of zz, then z^2\hat{z}^{2} is degree four; it contradicts the fact that z^2\hat{z}^{2} is at most degree two in zz. Therefore, z^\hat{z} must be a degree-one polynomial in zz, i.e. a linear function of zz.

Beyond polynomial map gg.

Theorem A.8 in the Appendix extends Theorem 4.4 to a class of maps g()g(\cdot) that are ϵ\epsilon-approximable by a polynomial.

5 Provable Representation Identification with Interventional Data

In the previous section, we derived affine identification guarantees. Next, we strengthen these guarantees by leveraging geometric signals specific to many interventions.

5.1 Representation identification with dodo interventions

We begin with a motivating example on images, where we are given data with dodo interventions on the latents. Consider the two balls shown in Fig. 2a. Ball 11’s coordinates are (z11,z21)(z_{1}^{1},z_{2}^{1}) and Ball 22’s coordinates are (z12,z22)(z_{1}^{2},z_{2}^{2}). We write the latent z=[(z11,z21),(z12,z22)]z=[(z_{1}^{1},z_{2}^{1}),(z_{1}^{2},z_{2}^{2})], this latent is rendered in the form of the image xx shown in the Fig. 2a. The latent zz in the observational data follows the directed acyclic graph (DAG) in Fig. 2b, where Ball 11’s coordinate cause the Ball 22 coordinates. The latent zz under a dodo intervention on z22z_{2}^{2}, then the second coordinate of Ball 22, follows the DAG in Fig. 2c. Our goal is to learn an encoder using the images xx in observational and interventional data, which outputs the coordinates of the balls up to permutation and scaling.

Suppose zz is generated from a structural causal model with an underlying DAG (Pearl, 2009). Formally, a dodo intervention on one latent dimension fixes it to some constant value. The distribution of the children of the intervened component is affected by the intervention, while the distribution of remaining latents remains unaltered. Based on this property of dodo intervention, we characterize the distribution Z(i)\mathbb{P}_{Z}^{(i)} in Eq. 2 as

zi=z;ziZi(i),\displaystyle z_{i}=z^{*};\qquad z_{-i}\sim\mathbb{P}_{Z_{-i}}^{(i)}, (4)

where ziz_{i} takes a fixed value zz^{*}. The remaining variables in zz, ziz_{-i}, are sampled from Zi(i)\mathbb{P}_{Z_{-i}}^{(i)}.

The distribution Zi(i)\mathbb{P}_{Z_{-i}}^{(i)} in Eq. 4 encompasses many settings in practice, including (i) the dodo interventions on causal DAGs (Pearl, 2009), i.e., Zi(i)=Zi|do(zi=z)\mathbb{P}_{Z_{-i}}^{(i)}=\mathbb{P}_{Z_{-i}|\textit{do}(z_{i}=z^{*})}, ii) the dodo interventions on cyclic graphical models (Mooij & Heskes, 2013), and (iii) sampling ziz_{-i} from its conditional in the observational data Zi(i)=Zi|zi=z\mathbb{P}_{Z_{-i}}^{(i)}=\mathbb{P}_{Z_{-i}|z_{i}=z^{*}} (e.g., subsampling images in observational data with a fixed background color).

Given interventional data from dodo interventions, we perform causal representation learning by leveraging the geometric signature of the dodo intervention in search of the autoencoder. In particular, we enforce the following constraint while solving the reconstruction identity in Eq. 3.

Constraint 5.1.

The encoder’s kthk^{th} component fk(x)f_{k}(x) denoted as z^k\hat{z}_{k} is required to take some fixed value zz^{\dagger} for all x𝒳(i)x\in\mathcal{X}^{(i)}. Formally stated fk(x)=z,x𝒳(i)f_{k}(x)=z^{\dagger},\forall x\in\mathcal{X}^{(i)}.

In 5.1, we do not need to know which component is intervened and the value it takes, i.e., kik\not=i and zzz^{\dagger}\not=z^{*}. We next show how this constraint helps identify the intervened latent ziz_{i} under an additional assumption on the support of the unintervened latents stated below.

Refer to caption
Figure 2: Illustrating dodo interventions in image-based data in (a). The DAG of dependencies under the observational distribution (b) and a perfect intervention on z22z_{2}^{2} in (c).
Assumption 5.2.

The interior of support of distribution of unintervened latents Zi(i)\mathbb{P}_{Z_{-i}}^{(i)} is a non-empty subset of d1\mathbb{R}^{d-1}.

Theorem 5.3.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2, where Z(i)\mathbb{P}_{Z}^{(i)} follows Eq. 4. The autoencoder that solves Eq. 3 under 4.3, 5.1 identifies the intervened latent ziz_{i} up to shift and scaling, i.e., z^k=ezi+b\hat{z}_{k}=ez_{i}+b, where e,be\in\mathbb{R},b\in\mathbb{R}.

Theorem 5.3 immediately extends to settings when multiple interventional distributions are available, with each corresponding to a hard dodo intervention on a distinct latent variable. Under the same assumptions of Theorem 5.3, each of the intervened latents can be identified up to permutation, shift, and scaling. Notably, Theorem 5.3 does not rely on any distributional assumptions (e.g., parametric assumptions) on zz; nor does it rely on the nature of the graphical model for zz (e.g., cyclic, acyclic). Theorem 5.3 makes these key geometric assumptions: (i) support of zz in observational data, (ii) support of unintervened latents ziz_{-i} has a non-empty interior.

Theorem 5.3 combines the affine identification guarantee we derived in Theorem 4.4 with the geometric signature of dodo interventions. For example, in Fig. 1b, the support of the true latents is axis-aligned (parallel to x-axis). In this case, the interventional constraint also forces the support of z^\hat{z} to be axis-aligned (parallel to x-axis or y-axis). The proof of Theorem 5.3 is in Appendix A.2. We provide some intuition here. First, given Assumptions 4.1, LABEL:assm2:g_poly and 4.3, Theorem 4.4 already guarantees affine identification. It implies z^k=aizi+ezi+b\hat{z}_{k}=a_{-i}^{\top}z_{-i}+ez_{i}+b, where ziz_{-i} includes all entries of zz other than ziz_{i}, and aia_{-i} is a vector of the corresponding coefficients. As a result, aizia_{-i}^{\top}z_{-i} must also take a fixed value for all values of ziz_{-i} in the support of Zi(i)\mathbb{P}_{Z_{-i}}^{(i)}, since both z^k\hat{z}_{k} and ziz_{i} are set to a fixed value. We argue ai=0a_{-i}=0 by contradiction. If ai0a_{-i}\neq 0, then any changes to ziz_{-i} in the direction of aia_{-i} will also reflect as a change in z^k\hat{z}_{k}; it contradicts the fact that z^k\hat{z}_{k} takes a fixed value. Therefore, ai=0a_{-i}=0 and ziz_{i} is identified up to shift and scaling.

Beyond polynomial map gg.

In Theorem 5.3, we assume that the map gg is a polynomial. In the Appendix (Theorem A.12) we show that, even when gg is not a polynomial but a general diffeomorphism, the intervened latent can be approximately identified up to an invertible transform provided sufficiently many dodo interventional distributions per latent are available. That said, one interventional distribution per latent no longer suffices, unlike the polynomial gg case. Our experiments on images in § 8 further support this argument. We state Theorem A.12 informally below.

Theorem.

(Informal) Suppose the observational data is generated from Eq. 1 and suppose we gather multiple interventional datasets for latent ziz_{i}, where in each interventional dataset, ziz_{i} is set to a distinct fixed value under dodo intervention following Eq. 4. If the number of dodo interventional datasets is sufficiently large and the support of the latents satisfy certain regularity conditions (detailed in Theorem A.12), then the autoencoder that solves Eq. 3 under multiple constraints of the form 5.1 identifies ziz_{i} up to an invertible transform approximately.

5.2 General perfect and imperfect interventions

In the discussion so far, we focused on dodo interventions. In this section, our goal is to build identification guarantees under imperfect interventions. In the example that follows, we motivate the class of imperfect interventions we consider.

Motivating example of perfect & imperfect interventions on images.

First, we revisit perfect interventions in causal DAGs (Peters et al., 2017). Under a perfect intervention, the intervened latent is disconnected from its parents and dodo interventions are a special case of perfect interventions. Consider the two balls shown in Fig. 2a. Suppose Ball 1 has a strong influence on Ball 2 in the observational DAG shown in Fig. 2b. As a result, the position of Ball 1 determines the region where Ball 2 can be located inside the box in Fig. 2a. Now imagine if a perfect intervention is carried out as shown in Fig. 2c. Under this intervention the second coordinate of Ball 2 is not restricted by Ball 1 and it takes all possible values in the box. Do we need perfect interventions to ensure that Ball 2 can be located anywhere in the box? Even an imperfect intervention that reduces the strength of influence of Ball 1 on Ball 2 can suffice to ensure that Ball 2 takes all possible locations in the box. In this section, we consider such imperfect interventions that guarantee that the range of values the intervened latent takes does not depend on its non-descendants. We formalize this below.

Definition 5.4.

(Wang & Jordan, 2021) Consider a random variable V=[V1,V2]V=[V_{1},V_{2}] sampled from V\mathbb{P}_{V}. V1,V2V_{1},V_{2} are said to have independent support if 𝒱=𝒱1×𝒱2\mathcal{V}=\mathcal{V}_{1}\times\mathcal{V}_{2} where 𝒱\mathcal{V} is the support of V\mathbb{P}_{V}, 𝒱j\mathcal{V}_{j} are the supports of marginal distribution of VjV_{j} for j{1,2}j\in\{1,2\} and ×\times is the Cartesian product.

Observe that two random variables can be dependent but have independent support. Suppose zz is generated from a structural causal model with an underlying DAG and ziz_{i} undergoes an imperfect intervention. We consider imperfect interventions such that each pair (zi,zj)(z_{i},z_{j}) satisfies support independence (Definition 5.4), where zjz_{j} is a non-descendant of ziz_{i} in the underlying DAG. Below we characterize imperfect interventions that satisfy support independence.

Characterizing imperfect interventions that lead to support independence.

Suppose ziw(𝖯𝖺(zi),u)z_{i}\leftarrow w(\mathsf{Pa}(z_{i}),u), where 𝖯𝖺(zi)\mathsf{Pa}(z_{i}) is the value of the set of parents of ziz_{i}, u𝒰u\in\mathcal{U} is a noise variable that is independent of the ancestors of ziz_{i}, and ww is the map that generates ziz_{i}. We carry out an imperfect intervention on ziz_{i} and change the map ww to vv. If the range of values assumed by vv for any two values assumed by the parents are equal, then the support of ziz_{i} is independent of all its non-descendants. Formally stated the condition is v(𝖯𝖺(zi),𝒰)=v(𝖯𝖺(zi),𝒰)v(\mathsf{Pa}(z_{i}),\mathcal{U})=v(\mathsf{Pa}^{{}^{\prime}}(z_{i}),\mathcal{U}), where 𝖯𝖺(zi)\mathsf{Pa}(z_{i}) and 𝖯𝖺(zi)\mathsf{Pa}^{{}^{\prime}}(z_{i}) are any two sets of values assumed by the parents.

We are now ready to describe the geometric properties we require of the interventional distribution Z(i)\mathbb{P}_{Z}^{(i)} in Eq. 2. We introduce some notation before that. Let [d]{1,,d}[d]\coloneqq\{1,\cdots,d\}. For each j[d]j\in[d], we define the supremum and infimum of each component zjz_{j} in the interventional distribution. Define αsupj\alpha_{\sup}^{j} (αinfj\alpha_{\inf}^{j}) to be the supremum (infimum) of the set 𝒵j(i)\mathcal{Z}_{j}^{(i)}.

Assumption 5.5.

Consider zz sampled from the interventional distribution Z(i)\mathbb{P}_{Z}^{(i)} in Eq. 2. 𝒮[d]\exists\mathcal{S}\subseteq[d] such that the support of ziz_{i} is independent of zjz_{j} for all j𝒮j\in\mathcal{S}. For all j𝒮j\in\mathcal{S}

𝒵i,j(i)=𝒵i(i)×𝒵j(i)\mathcal{Z}^{(i)}_{i,j}=\mathcal{Z}^{(i)}_{i}\times\mathcal{Z}^{(i)}_{j} (5)

For all j[d]j\in[d], <αinfjαsupj<-\infty<\alpha_{\inf}^{j}\leq\alpha_{\sup}^{j}<\infty. ζ>0\exists\;\zeta>0 such that Πj[d](αsupjζ,αsupj)(αinfj,αinfj+ζ)𝒵(i)\Pi_{j\in[d]}(\alpha_{\sup}^{j}-\zeta,\alpha_{\sup}^{j})\cup(\alpha_{\inf}^{j},\alpha_{\inf}^{j}+\zeta)\subseteq\mathcal{Z}^{(i)} and Π\Pi denotes the Cartesian product.

The distribution Z(i)\mathbb{P}_{Z}^{(i)} above is quite general in several ways as it encompasses i) all perfect interventions since they render the intervened latent independent of its non-descendants and ii) imperfect interventions that lead to independent support as characterized above. The latter part of the above assumption is a regularity condition on the geometry of the support. It ensures the support of zz has a ζ\zeta-thick boundary for a ζ>0\zeta>0.

We now describe a constraint on the encoder that leverages the geometric signature of imperfect interventions in Assumption 5.5. Recall z^k=fk(x)\hat{z}_{k}=f_{k}(x). Let 𝒵^=f(𝒳)\hat{\mathcal{Z}}=f(\mathcal{X}) and 𝒵^(i)=f(𝒳(i))\hat{\mathcal{Z}}^{(i)}=f(\mathcal{X}^{(i)}) represent the support of encoder ff’s output on observational data and interventional data respectively. 𝒵^k,m(i)\hat{\mathcal{Z}}^{(i)}_{k,m} represents the joint support of (z^k,z^m)(\hat{z}_{k},\hat{z}_{m}) and 𝒵^k(i)\hat{\mathcal{Z}}_{k}^{(i)} is the support of z^k\hat{z}_{k} in interventional data. Similarly, we define 𝒵^k,m\hat{\mathcal{Z}}_{k,m} and 𝒵^k\hat{\mathcal{Z}}_{k} for observational data.

Constraint 5.6.

Given a set 𝒮\mathcal{S}^{{}^{\prime}}. For each m𝒮m\in\mathcal{S}^{{}^{\prime}}, (z^k,z^m)(\hat{z}_{k},\hat{z}_{m}) satisfies support independence on interventional data, i.e.,

𝒵^k,m(i)=𝒵^k(i)×𝒵^m(i),m𝒮.\hat{\mathcal{Z}}^{(i)}_{k,m}=\hat{\mathcal{Z}}^{(i)}_{k}\times\hat{\mathcal{Z}}^{(i)}_{m},\forall m\in\mathcal{S}^{{}^{\prime}}.

In the above 5.6, the index kk and set 𝒮\mathcal{S}^{{}^{\prime}} are not necessarily the same as ii and 𝒮\mathcal{S} from Assumption 5.5. In the theorem that follows, we require |𝒮||𝒮||\mathcal{S}^{{}^{\prime}}|\leq|\mathcal{S}| to guarantee that a solution to 5.6 exists. In the Appendix A.3, we explain that this requirement can be easily relaxed. Note that 5.6 bears similarity to 5.1 from the case of dodo interventions. Both constraints ensure that the support of the kthk^{th} component is independent of all other components. In the theorem that follows, we show that the above 5.6 helps achieve block affine identification, which we formally define below.

Definition 5.7.

If z^=Λ~Πz+c\hat{z}=\tilde{\Lambda}\Pi z+c for all z𝒵𝒵(i)z\in\mathcal{Z}\cup\mathcal{Z}^{(i)}, where Π\Pi is a permutation matrix, Λ~\tilde{\Lambda} is an invertible matrix such that there is a submatrix of Λ~\tilde{\Lambda} which is zero, then z^\hat{z} is said to block-affine identify zz.

Theorem 5.8.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, 5.5. The autoencoder that solves Eq. 3 under Constraint 4.3, 5.6 (with |𝒮||𝒮||\mathcal{S}^{{}^{\prime}}|\leq|\mathcal{S}|) achieves block affine identification. More specifically, z𝒵𝒵(i)\forall z\in\mathcal{Z}\cup\mathcal{Z}^{(i)}

z^k=akz+ck,z^m=amz+cm,m𝒮,\hat{z}_{k}=a_{k}^{\top}z+c_{k},\hat{z}_{m}=a_{m}^{\top}z+c_{m},\forall m\in\mathcal{S}^{{}^{\prime}},

where aka_{k} contains at most d|𝒮|d-|\mathcal{S}^{{}^{\prime}}| non-zero elements and each component of ama_{m} is zero whenever the corresponding component of aka_{k} is non-zero for all m𝒮m\in\mathcal{S}^{{}^{\prime}}.

Firstly, from Theorem 4.4, z^=Az+c\hat{z}=Az+c. From the above theorem, it follows that z^k\hat{z}_{k} linearly depends on at most d|𝒮|d-|\mathcal{S}^{{}^{\prime}}| latents and not all the latents. Each z^m\hat{z}_{m} with m𝒮m\in\mathcal{S}^{{}^{\prime}} does not depend on any of the latents that z^k\hat{z}_{k} depends on. As a result, |𝒮|+1|\mathcal{S}^{{}^{\prime}}|+1 rows of AA (from Theorem 4.4) are sparse. Observe that if |𝒮|=|𝒮|=d1|\mathcal{S}^{{}^{\prime}}|=|\mathcal{S}|=d-1, then as a result of the above theorem, z^k\hat{z}_{k} identifies some zjz_{j} up to scale and shift. Further, remaining components z^k\hat{z}_{-k} linearly depend on zjz_{-j} and do not depend on zjz_{j}. The proof of Theorem 5.8 is in Appendix A.3.

6 Extensions to Identification with Observational Data & Independent Support

In the previous section, we showed that interventions induce geometric structure (independence of supports) in the support of the latents that helps achieve strong identification guarantees. In this section, we consider a special case where such geometric structure is already present in the support of the latents in the observational data. Since we only work with observational data in this section, we set the interventional supports 𝒵(i)=𝒳(i)=\mathcal{Z}^{(i)}=\mathcal{X}^{(i)}=\emptyset, where \emptyset is the empty set. For each j[d]j\in[d], define βsupj\beta_{\sup}^{j} to be the supremum of the support of zjz_{j}, i.e., 𝒵j\mathcal{Z}_{j}. Similarly, for each j[d]j\in[d], define βinfj\beta_{\inf}^{j} to be the infimum of the set 𝒵j\mathcal{Z}_{j}.

Assumption 6.1.

The support of Z\mathbb{P}_{Z} in Eq. 1 satisfies pairwise support independence between all the pairs of latents. Formally stated,

𝒵r,s=𝒵r×𝒵s,rs,r,s[d]\mathcal{Z}_{r,s}=\mathcal{Z}_{r}\times\mathcal{Z}_{s},\forall r\not=s,r,s\in[d] (6)

For all r[d]r\in[d], <βinfrβsupr<-\infty<\beta_{\inf}^{r}\leq\beta_{\sup}^{r}<\infty. ζ>0\exists\;\zeta>0 such that Πr[d](βsuprζ,βsupr)(βinfr,βinfr+ζ)𝒵\Pi_{r\in[d]}(\beta_{\sup}^{r}-\zeta,\beta_{\sup}^{r})\cup(\beta_{\inf}^{r},\beta_{\inf}^{r}+\zeta)\subseteq\mathcal{Z} and Π\Pi denotes the Cartesian product.

Following previous sections, we state a constraint, where the learner leverages the geometric structure in the support in Assumption 6.1 to search for the autoencoder.

Constraint 6.2.

Each pair (z^k,z^m)(\hat{z}_{k},\hat{z}_{m}), where k,m[d]k,m\in[d] and kmk\not=m satisfies support independence on observational data, i.e., 𝒵^k,m=𝒵^k×𝒵^m\hat{\mathcal{Z}}_{k,m}=\hat{\mathcal{Z}}_{k}\times\hat{\mathcal{Z}}_{m}, where 𝒵^k,m\hat{\mathcal{Z}}_{k,m} is the joint support of (z^k,z^m)(\hat{z}_{k},\hat{z}_{m}) and 𝒵^k\hat{\mathcal{Z}}_{k} is support of z^k\hat{z}_{k}.

Theorem 6.3.

Suppose the observational data is generated from Eq. 1 under Assumption 4.1, 4.2, and 6.1, The autoencoder that the solves Eq. 3 under 6.2 achieves permutation, shift and scaling identification. Specifically, z𝒵,z^=ΛΠz+c,\forall z\in\mathcal{Z},\hat{z}=\Lambda\Pi z+c, where z^\hat{z} is the output of the encoder ff and zz is the true latent and Π\Pi is a permutation matrix and Λ\Lambda is an invertible diagonal matrix.

The proof of Theorem 6.3 is in Appendix A.4. Theorem 6.3 says that the independence between the latents’ support is sufficient to achieve identification up to permutation, shift, and scaling in observational data. Theorem 6.3 has important implications for the seminal works on linear ICA (Comon, 1994), considering the simple case of a linear gg. Comon (1994) shows that, if the latent variables are independent and non-Gaussian, then the latent variables can be identified up to permutation and scaling. However, Theorem 6.3 states that, even if the latent variables are dependent, the latent variables can be identified up to permutation, shift and scaling, as long as they are bounded (hence non-Gaussian) and satisfy pairwise support independence.

Finally, Theorem 6.3 provides a first general theoretical justification for recent proposals of unsupervised disentanglement via the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).

7 Learning Representations from Geometric Signatures: Practical Considerations

In this section, we describe practical algorithms to solve the constrained representation learning problems in § 5 and 6.

To perform constrained representation learning with dodo-intervention data, we proceed in two steps. In the first step, we carry out minimization of the reconstruction objective f,h=argminf,h𝔼[hf(X)X2]f^{\dagger},h^{\dagger}=\operatorname*{arg\,min}_{f,h}\mathbb{E}\big{[}\|h\circ f(X)-X\|^{2}\big{]}, where hh is the decoder, ff is the encoder and expectation is taken over observational data and interventional data. In the experiments, we restrict hh to be a polynomial and show that affine identification is achieved by the learned ff^{\dagger} as proved in Theorem 4.4.

In the second step, we learn a linear map to transform the learned representations and enforce LABEL:eqn:intv_cons. For each interventional distribution, X(i)\mathbb{P}_{X}^{(i)}, we learn a different linear map γi\gamma_{i} that projects the representation such that it takes an arbitrary fixed value ziz^{\dagger}_{i} on the support of X(i)\mathbb{P}_{X}^{(i)}. We write this objective as

min{γi}i𝔼XX(i)[γif(X)zi2].\min_{\{\gamma_{i}\}}\sum_{i}\mathbb{E}_{X\sim\mathbb{P}_{X}^{(i)}}\bigg{[}\big{\|}\gamma_{i}^{\top}f^{\dagger}(X)-z^{\dagger}_{i}\big{\|}^{2}\bigg{]}. (7)

Construct a matrix Γ\Gamma with different γi\gamma_{i}^{\top} as the rows. The final output representation is Γf(X)\Gamma f^{\dagger}(X). In the experiments, we show that this representation achieves permutation, shift and scaling identification as predicted by Theorem 5.3. A few remarks in order. i) ziz_{i}^{\dagger} is arbitrary and learner does not know the true do intervention value, ii) for ease of exposition, Eq. 7 assumes the knowledge of index of intervened and can be easily relaxed by multiplying Γ\Gamma with a permutation matrix.

We next describe an algorithm that learns representations to enforce independence of support (leveraged in Theorem 5.8 and 6.3). To measure the (non)-independence of the latents’ support, we follow Wang & Jordan (2021); Roth et al. (2022) and measure the distance between the sets in terms of Hausdorff distance: the Hausdorff distance 𝖧𝖣\mathsf{HD} between the sets 𝒮1,𝒮2\mathcal{S}_{1},\mathcal{S}_{2} is 𝖧𝖣(𝒮1,𝒮2)=supz𝒮2(infz𝒮1(zz))\mathsf{HD}(\mathcal{S}_{1},\mathcal{S}_{2})=\sup_{z\in\mathcal{S}_{2}}\bigg{(}\inf_{z^{{}^{\prime}}\in\mathcal{S}_{1}}(\|z-z^{{}^{\prime}}\|)\bigg{)}, where 𝒮1𝒮2\mathcal{S}_{1}\subseteq\mathcal{S}_{2}.

To further enforce the independent support constraint, we again follow a two-step algorithm. The first step remains the same, i.e., we minimize the reconstruction objective. In the second step, we transform the learned representations (f(x)f^{\dagger}(x)) with an invertible map Γd×d\Gamma\in\mathbb{R}^{d\times d}. The joint support obtained post transformation is a function of the parameters Γ\Gamma and is denoted as 𝒵^(Γ)\hat{\mathcal{Z}}(\Gamma). Following the notation introduced earlier, the joint support along dimensions k,mk,m is 𝒵^k,m(Γ)\hat{\mathcal{Z}}_{k,m}(\Gamma) and the marginal support along kk is 𝒵^k(Γ)\hat{\mathcal{Z}}_{k}(\Gamma). We translate the problem in 6.2 as follows. We find a Γ\Gamma to minimize

minΓkm𝖧𝖣(𝒵^k,m(Γ),𝒵^k(Γ)×𝒵^m(Γ)).\min_{\Gamma}\sum_{k\not=m}\mathsf{HD}\big{(}\hat{\mathcal{Z}}_{k,m}(\Gamma),\hat{\mathcal{Z}}_{k}(\Gamma)\times\hat{\mathcal{Z}}_{m}(\Gamma)\big{)}. (8)

5.6 can be similarly translated.

8 Empirical Findings

In this section, we analyze how the practical implementation of the theory holds up in different settings ranging from data generated from polynomial decoders to images generated from PyGame rendering engine (Shinners et al., 2011). The code to reproduce the experiments can be found at https://github.com/facebookresearch/CausalRepID.

Data generation process.

Polynomial decoder data: The latents for the observational data are sampled from Z\mathbb{P}_{Z}. Z\mathbb{P}_{Z} can be i) independent uniform, ii) an SCM with sparse connectivity (SCM-S), iii) an SCM with dense connectivity (SCM-D) (Brouillard et al., 2020). The latent variables are then mapped to xx using a multivariate polynomial. We use a n=200n=200 dimensional xx. We use two possible dimensions for the latents (dd) – six and ten. We use polynomials of degree (pp) two and three. Each element in GG to generate xx is sampled from a standard normal distribution.

Image data: For image-based experiments, we used the PyGame (Shinners, 2011) rendering engine. We generate 64×64×364\times 64\times 3 pixel images of the form in Fig. 2 and consider a setting with two balls. We consider three distributions for latents: i) independent uniform, ii) a linear SCM with DAG in Fig. 2, iii) a non-linear SCM with DAG in Fig. 2, where the coordinates of Ball 11 are at the top layer in the DAG and coordinates of Ball 22 are at the bottom layer in the DAG.

For both settings above, we carry out dodo interventions on each latent dimension to generate interventional data.

Model parameters and evaluation metrics.

We follow the two step training procedures described in § 7. For image-based experiments we use a ResNet-18 as the encoder (He et al., 2016) and for all other experiments, we use an MLP with three hidden layers and two hundred units per layer. We learn a polynomial decoder hh as the theory prescribes to use a polynomial decoder (4.3) when gg is a polynomial. In § B.3, we also present results when we use an MLP decoder. To check for affine identification (from Theorem 4.4), we measure the R2R^{2} score for linear regression between the output representation and the true representation. If the score is high, then it guarantees affine identification. To verify permutation, shift and scaling identification (from Theorem 6.3), we check the mean correlation coefficient (MCC (Khemakhem et al., 2022)). For further details on data generation, models, hyperparamters, and supplementary experiments refer to the App. B.

Z\mathbb{P}_{Z} dd pp R2R^{2} MCC (IOS)
Uniform 66 22 1.00±0.001.00\pm 0.00 99.3±0.0799.3\pm 0.07
Uniform 66 33 1.00±0.001.00\pm 0.00 99.4±0.0699.4\pm 0.06
Uniform 1010 22 1.00±0.001.00\pm 0.00 90.7±2.9290.7\pm 2.92
Uniform 1010 33 0.99±0.000.99\pm 0.00 94.6±1.5094.6\pm 1.50
SCM-S 66 22 0.96±0.020.96\pm 0.02 72.6±1.4872.6\pm 1.48
SCM-S 66 33 0.87±0.070.87\pm 0.07 70.6±1.5470.6\pm 1.54
SCM-S 1010 22 0.99±0.000.99\pm 0.00 65.9±1.3265.9\pm 1.32
SCM-S 1010 33 0.90±0.050.90\pm 0.05 58.8±1.2758.8\pm 1.27
SCM-D 66 22 0.97±0.010.97\pm 0.01 61.6±4.3661.6\pm 4.36
SCM-D 66 33 0.81±0.110.81\pm 0.11 65.2±2.7065.2\pm 2.70
SCM-D 1010 22 0.83±0.100.83\pm 0.10 69.6±3.0969.6\pm 3.09
SCM-D 1010 33 0.72±0.150.72\pm 0.15 60.1±1.1660.1\pm 1.16
Table 2: Observational data with polynomial decoder gg: Mean ± S.E. (5 random seeds). R2R^{2} and MCC(IOS) (for uniform) have high values as predicted in Theorem 4.4 and Theorem 6.3 respectively.
Z\mathbb{P}_{Z} dd pp MCC MCC (IL)
Uniform 66 22 69.1±1.1169.1\pm 1.11 100.0±0.00100.0\pm 0.00
Uniform 66 33 73.4±0.4973.4\pm 0.49 100.0±0.00100.0\pm 0.00
Uniform 1010 22 59.9±2.0359.9\pm 2.03 100.0±0.00100.0\pm 0.00
Uniform 1010 33 65.9±0.8065.9\pm 0.80  99.9±0.03\;99.9\pm 0.03
SCM-S 66 22 68.4±0.9068.4\pm 0.90 99.5±0.3899.5\pm 0.38
SCM-S 66 33 74.1±2.3274.1\pm 2.32 99.3±0.3499.3\pm 0.34
SCM-S 1010 22 68.0±2.3668.0\pm 2.36 99.9±0.0399.9\pm 0.03
SCM-S 1010 33 66.8±1.1066.8\pm 1.10 98.8±0.1398.8\pm 0.13
SCM-D 66 22 71.8±3.7771.8\pm 3.77 99.6±0.1299.6\pm 0.12
SCM-D 66 33 79.5±3.4579.5\pm 3.45 98.2±1.0798.2\pm 1.07
SCM-D 1010 22 70.8±1.8970.8\pm 1.89 95.3±2.2495.3\pm 2.24
SCM-D 1010 33 70.1±2.8070.1\pm 2.80 97.2±0.8897.2\pm 0.88
Table 3: Interventional data with polynomial decoder gg: Mean ± S.E. (5 random seeds). MCC(IL) is high as shown in Theorem 5.3.
Results for polynomial decoder.

Observational data: We consider the setting when the true decoder gg is a polynomial and the learned decoder hh is also a polynomial. In Table 2, we report the R2R^{2} between the representation learned after the first step, where we only minimize reconstruction loss. R2R^{2} values are high as predicted in Theorem 4.4. In the second step, we learn a map Γ\Gamma and enforce independence of support constraint by minimizing Hausdorff distance from Eq. 8. Among the distributions Z\mathbb{P}_{Z} only the uniform distribution satisfies support independence from Assumption 6.1 and following Theorem 6.3, we expect MCC to be high in this case only. In Table 2, we report the MCC obtained by enforcing independence of support in MCC (IOS). In the § B.3, we also carry out experiments on correlated uniform distributions and observe high MCC (IOS).

Interventional data: We now consider the case when we also have access to dodo intervention data in addition to observational data. We consider the setting with one dodo intervention per latent dimension. We follow the two step procedure described in § 7. In Table 3, we first show the MCC values of the representation obtained after the first step in the MCC column. In the second step, we learn Γ\Gamma by minimizing the interventional loss (IL) in Eq. 7. We report the MCC of the representation obtained in the MCC (IL) column in Table 3; the values are close to one as predicted by Theorem 5.3.

Results for image dataset.

We follow the two step procedure described in § 7 except now in the second step, we learn a non-linear map (using an MLP) to minimize the interventional loss (IL) in Eq. 7. In Table 4, we show the MCC values achieved by the learned representation as we vary the number of dodo interventional distributions per latent dimension. As shown in Theorem A.12, more interventional distributions per latent dimension improve the MCC.

#interv dist. Uniform SCM linear SCM non-linear
11 34.2±0.2434.2\pm 0.24 12.8±0.2812.8\pm 0.28 19.7±0.3119.7\pm 0.31
33 73.9±0.3873.9\pm 0.38 73.2±0.3373.2\pm 0.33 59.7±0.2859.7\pm 0.28
55 73.6±0.2173.6\pm 0.21 83.4±0.2183.4\pm 0.21 62.8±0.262.8\pm 0.2
77 72.5±0.3472.5\pm 0.34 84.2±0.2584.2\pm 0.25 69.3±0.3469.3\pm 0.34
99 73.1±0.4773.1\pm 0.47 86.2±0.1786.2\pm 0.17 71.4±0.2671.4\pm 0.26
Table 4: Interventional data in image-based experiments: Mean ± S.E (5 random seeds). MCCs increase with the number of dodo interventional distributions per latent dimension (Theorem A.12).

9 Conclusions

In this work, we lay down the theoretical foundations for learning causal representations in the presence of interventional data. We show that geometric signatures such as support independence that are induced under many interventions are useful for provable representation identification. Looking forward, we believe that exploring representation learning with real interventional data (Lopez et al., 2022; Liu et al., 2023) is a fruitful avenue for future work.

Acknowledgments

Yixin Wang acknowledges grant support from the National Science Foundation and the Office of Naval Research. Yoshua Bengio acknowledges the support from CIFAR and IBM. We thank Anirban Das for insightful feedback that helped us correctly state the precise ζ\zeta-thickness conditions.

References

  • Ahuja et al. (2021) Ahuja, K., Hartford, J., and Bengio, Y. Properties from mechanisms: an equivariance perspective on identifiable representation learning. arXiv preprint arXiv:2110.15796, 2021.
  • Ahuja et al. (2022a) Ahuja, K., Hartford, J., and Bengio, Y. Weakly supervised representation learning with sparse perturbations. arXiv preprint arXiv:2206.01101, 2022a.
  • Ahuja et al. (2022b) Ahuja, K., Mahajan, D., Syrgkanis, V., and Mitliagkas, I. Towards efficient representation identification in supervised learning. arXiv preprint arXiv:2204.04606, 2022b.
  • Ash et al. (2000) Ash, R. B., Robert, B., Doleans-Dade, C. A., and Catherine, A. Probability and measure theory. Academic press, 2000.
  • Bareinboim et al. (2022) Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On pearl’s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp.  507–556. 2022.
  • Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Bommasani et al. (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  • Brehmer et al. (2022) Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. Weakly supervised causal representation learning. arXiv preprint arXiv:2203.16437, 2022.
  • Brouillard et al. (2020) Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865–21877, 2020.
  • Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
  • Comon (1994) Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • Dixit et al. (2016) Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby-Arnon, L., Marjanovic, N. D., Dionne, D., Burks, T., Raychowdhury, R., et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. cell, 167(7):1853–1866, 2016.
  • Geirhos et al. (2020) Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  • Goyal & Bengio (2020) Goyal, A. and Bengio, Y. Inductive biases for deep learning of higher-level cognition. arXiv preprint arXiv:2011.15091, 2020.
  • Hälvä & Hyvarinen (2020) Hälvä, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939–948. PMLR, 2020.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hyvarinen & Morioka (2016) Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016.
  • Hyvarinen & Morioka (2017) Hyvarinen, A. and Morioka, H. Nonlinear ICA of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp.  460–469. PMLR, 2017.
  • Hyvärinen & Pajunen (1999) Hyvärinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
  • Hyvarinen et al. (2019) Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  859–868. PMLR, 2019.
  • Khemakhem et al. (2020) Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ICA. Advances in Neural Information Processing Systems, 33:12768–12778, 2020.
  • Khemakhem et al. (2022) Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp.  2207–2217. PMLR, 2022.
  • Klindt et al. (2020) Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930, 2020.
  • Lachapelle et al. (2022) Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Conference on Causal Learning and Reasoning, pp. 428–484. PMLR, 2022.
  • Lippe et al. (2022a) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. icitris: Causal representation learning for instantaneous temporal effects. arXiv preprint arXiv:2206.06169, 2022a.
  • Lippe et al. (2022b) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, S. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pp. 13557–13603. PMLR, 2022b.
  • Liu et al. (2023) Liu, Y., Alahi, A., Russell, C., Horn, M., Zietlow, D., Schölkopf, B., and Locatello, F. Causal triplet: An open challenge for intervention-centric causal representation learning. arXiv preprint arXiv:2301.05169, 2023.
  • Locatello et al. (2019) Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. PMLR, 2019.
  • Locatello et al. (2020) Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348–6359. PMLR, 2020.
  • Lopez et al. (2022) Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J. K., and Regev, A. Learning causal representations of single cells via sparse mechanism shift modeling. arXiv preprint arXiv:2211.03553, 2022.
  • Mityagin (2015) Mityagin, B. The zero set of a real analytic function. arXiv preprint arXiv:1512.07276, 2015.
  • Mooij & Heskes (2013) Mooij, J. and Heskes, T. Cyclic causal discovery from continuous equilibrium data. arXiv preprint arXiv:1309.6849, 2013.
  • Nejatbakhsh et al. (2021) Nejatbakhsh, A., Fumarola, F., Esteki, S., Toyoizumi, T., Kiani, R., and Mazzucato, L. Predicting perturbation effects from resting activity using functional causal flow. bioRxiv, pp.  2020–11, 2021.
  • Pearl (2009) Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
  • Roth et al. (2022) Roth, K., Ibrahim, M., Akata, Z., Vincent, P., and Bouchacourt, D. Disentanglement of correlated factors via hausdorff factorized support. arXiv preprint arXiv:2210.07347, 2022.
  • Schölkopf et al. (2021) Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Towards causal representation learning 2021. arXiv preprint arXiv:2102.11107, 2021.
  • Seigal et al. (2022) Seigal, A., Squires, C., and Uhler, C. Linear causal disentanglement via interventions. arXiv preprint arXiv:2211.16467, 2022.
  • Shinners (2011) Shinners, P. Pygame. http://pygame.org/, 2011.
  • Shinners et al. (2011) Shinners, P. et al. Pygame. Dostupné z: http://pygame. org/[Online (2011), 2011.
  • Von Kügelgen et al. (2021) Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. Self-supervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451–16467, 2021.
  • Wang & Jordan (2021) Wang, Y. and Jordan, M. I. Desiderata for representation learning: A causal perspective. arXiv preprint arXiv:2109.03795, 2021.
  • Yamada et al. (2022) Yamada, Y., Tang, T., and Ilker, Y. When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.
  • Yao et al. (2021) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. arXiv preprint arXiv:2110.05428, 2021.
  • Yao et al. (2022a) Yao, W., Chen, G., and Zhang, K. Learning latent causal dynamics. arXiv preprint arXiv:2202.04828, 2022a.
  • Yao et al. (2022b) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=RDlLMjLJXdq.
  • Zimmermann et al. (2021) Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979–12990. PMLR, 2021.

 

Interventional Causal Representation Learning Appendices

 

Contents

We organize the Appendix as follows.

  • In App. A, we present the proofs for the theorems that were presented in the main body of the paper.

    • In § A.1, we derive the affine identification guarantees and its approximations in various settings. (Theorem 4.4)

    • In § A.2, we derive the dodo intervention based identification guarantees and its extensions. (Theorem 5.3)

    • In § A.3, we present representation identification guarantees for imperfect interventions. (Theorem 5.8)

    • In § A.4, we present representation identification guarantees for observational data with independent support. (Theorem 6.3)

  • In App. B, we present supplementary materials for the experiments.

    • In § B.1, we present the pseudocode for the method used to learn the representations.

    • In § B.2, we present the details of the setup used in the experiments with the polynomial decoder gg.

    • In § B.3, we present supplementary results for the setting with polynomial decoder gg.

    • In § B.4, we present the details of the setup used in the experiments with image data.

    • In § B.5, we present supplementary results for the setting with image data.

Appendix A Proofs and Technical Details

In this section, we provide the proofs for the theorems. We restate the theorems for convenience.

Preliminaries and notation.

We state the formal definition of support of a random variable. In most of the work, we operate on the following measure space (d,,λ)(\mathbb{R}^{d},\mathcal{B},\lambda), \mathcal{B} is the Borel sigma field over d\mathbb{R}^{d} and λ\lambda is the Lebesgue measure over completion of Borel sets on d\mathbb{R}^{d} (Ash et al., 2000). For a random variable XX, the support 𝒳={xd,dX(x)>0}\mathcal{X}=\{x\in\mathbb{R}^{d},d\mathbb{P}_{X}(x)>0\}, where dX(x)d\mathbb{P}_{X}(x) is the Radon-Nikodym derivative of \mathbb{P} w.r.t Lebesgue measure over completion of Borel sets on d\mathbb{R}^{d}. For random variable ZZ, 𝒵\mathcal{Z} is the support of ZZ in the observational data. The support of the component ZjZ_{j} of ZZ is 𝒵j\mathcal{Z}_{j}. For random variable ZZ, 𝒵(i)\mathcal{Z}^{(i)} is the support of ZZ when ZiZ_{i} is intervened. The support of the component ZjZ_{j} of ZZ in intervened data is 𝒵j(i)\mathcal{Z}_{j}^{(i)}.

A.1 Affine Identification

Lemma A.1.

If the matrix GG that defines the polynomial gg is full rank and p>0p>0, then gg is injective.

Proof.

Suppose this is not the case and g(z1)=g(z2)g(z_{1})=g(z_{2}) for some z1z2z_{1}\not=z_{2}. Thus

G[1z1z1¯z1z1¯¯z1ptimes]=G[1z2z2¯z2z2¯¯z2ptimes]G[0(z1z2)z1¯z1z2¯z2z1¯¯z1ptimesz2¯¯z2ptimes]=0\begin{split}&G\begin{bmatrix}1\\ z_{1}\\ z_{1}\bar{\otimes}z_{1}\\ \vdots\\ \underbrace{z_{1}\bar{\otimes}\cdots\bar{\otimes}\;z_{1}}_{p\;\text{times}}\end{bmatrix}=G\begin{bmatrix}1\\ z_{2}\\ z_{2}\bar{\otimes}z_{2}\\ \vdots\\ \underbrace{z_{2}\bar{\otimes}\cdots\bar{\otimes}\;z_{2}}_{p\;\text{times}}\end{bmatrix}\\ \implies&G\begin{bmatrix}0\\ (z_{1}-z_{2})\\ z_{1}\bar{\otimes}z_{1}-z_{2}\bar{\otimes}z_{2}\\ \vdots\\ \underbrace{z_{1}\bar{\otimes}\cdots\bar{\otimes}\;z_{1}}_{p\;\text{times}}-\underbrace{z_{2}\bar{\otimes}\cdots\bar{\otimes}\;z_{2}}_{p\;\text{times}}\end{bmatrix}=0\end{split} (9)

Since z1z2z_{1}\not=z_{2} we find a non-zero vector in the null space of GG which contradicts the fact that GG has full column rank. Therefore, it cannot be the case that g(z1)=g(z2)g(z_{1})=g(z_{2}) for some z1z2z_{1}\not=z_{2}. Thus gg has to be injective. ∎

Lemma A.2.

If v1v_{1} is a polynomial of degree k1k_{1} and v2v_{2} is a polynomial of degree k2k_{2}, then v1v2v_{1}v_{2} is a polynomial of degree k1+k2k_{1}+k_{2}.

Proof.

We separate vi(z)v_{i}(z) into two parts – the terms with degree kik_{i} (ui(z)u_{i}(z)) and the terms with degree less than kik_{i} (wi(z)w_{i}(z)) for i{1,2}i\in\{1,2\}. We obtain the following expression.

v1(z)v2(z)=(u1(z)+w1(z))(u2(z)+w2(z))=u1(z)u2(z)+u1(z)w2(z)+u2(z)w1(z)+w1(z)w2(z)v_{1}(z)v_{2}(z)=(u_{1}(z)+w_{1}(z))(u_{2}(z)+w_{2}(z))=u_{1}(z)u_{2}(z)+u_{1}(z)w_{2}(z)+u_{2}(z)w_{1}(z)+w_{1}(z)w_{2}(z) (10)

The maximum degree achieved by u1(z)u2(z)u_{1}(z)u_{2}(z) is k1+k2k_{1}+k_{2}. For the other terms, the maximum is bounded above by k1+k21k_{1}+k_{2}-1. To prove the result, we need to show that u1(z)u2(z)u_{1}(z)u_{2}(z) has a degree k1+k2k_{1}+k_{2}.

We first start with a simple case. Suppose u1(z)u_{1}(z) and u2(z)u_{2}(z) do not share any component of zz that they both depend on. In such a case, if we take the leading degree term in u1u_{1} and u2u_{2} respectively and multiply them then we obtain distinct terms of degree k1+k2k_{1}+k_{2}.

Suppose u1u_{1} and u2u_{2} both depend on z1z_{1}. We write u1(z)u_{1}(z) as

u1(z)=idji=k1θji=1dzidji=idji=k1θjcj(z)u_{1}(z)=\sum_{\sum_{i}d_{ji}=k_{1}}\theta_{j}\prod_{i=1}^{d}z_{i}^{d_{ji}}=\sum_{\sum_{i}d_{ji}=k_{1}}\theta_{j}c_{j}(z)

where cj(z)=izidjic_{j}(z)=\prod_{i}z_{i}^{d_{ji}} is a degree k1k_{1} polynomial. Note that for each jj, cjc_{j} is a different polynomial, i.e. for jqj\not=q, cjcqc_{j}\not=c_{q}. We write u2(z)u_{2}(z) as

u2(z)=idji=k2βji=1dzidji=idji=k2βjcj(z)u_{2}(z)=\sum_{\sum_{i}d_{ji}=k_{2}}\beta_{j}\prod_{i=1}^{d}z_{i}^{d_{ji}}=\sum_{\sum_{i}d_{ji}=k_{2}}\beta_{j}c_{j}(z)

We collect all the terms in u1u_{1} that have the highest degree associated with z1z_{1} such that the coefficient θj\theta_{j} is non-zero. We denote the highest degree as rr and write these terms as

qθqz1ri=2dzidqi=qθqz1rωq(z)\sum_{q}\theta_{q}z_{1}^{r}\prod_{i=2}^{d}z_{i}^{d_{qi}}=\sum_{q}\theta_{q}z_{1}^{r}\omega_{q}(z)

where ωq(z)=i=2dzidqi\omega_{q}(z)=\prod_{i=2}^{d}z_{i}^{d_{qi}}, qlωqωlq\not=l\implies\omega_{q}\not=\omega_{l}, and r1r\geq 1

From u2(z)u_{2}(z), collect the terms with the highest degree for z1z_{1} such that the coefficient βj\beta_{j} is non-zero to obtain. We denote the highest degree as ss and write these terms as

tβtz1si=2dzidti=tβtz1sηt(z)\sum_{t}\beta_{t}z_{1}^{s}\prod_{i=2}^{d}z_{i}^{d_{ti}}=\sum_{t}\beta_{t}z_{1}^{s}\eta_{t}(z)

where ηt(z)=i=2dzidti\eta_{t}(z)=\prod_{i=2}^{d}z_{i}^{d_{ti}}, tlηtηlt\not=l\implies\eta_{t}\not=\eta_{l}, and s1s\geq 1.

As a result, u1(z)u2(z)u_{1}(z)u_{2}(z) will contain the term

z1r+sqθqωq(z)tβtηt(z)z_{1}^{r+s}\sum_{q}\theta_{q}\omega_{q}(z)\sum_{t}\beta_{t}\eta_{t}(z)
z1r+sδ1(z)δ2(z)z_{1}^{r+s}\delta_{1}(z)\delta_{2}(z)

where δ1(z)=qθqωq(z)\delta_{1}(z)=\sum_{q}\theta_{q}\omega_{q}(z) and δ2(z)=tβtηt(z)\delta_{2}(z)=\sum_{t}\beta_{t}\eta_{t}(z). We will use principle of induction on the degree of polynomial to prove the claim.

We first establish the base case for k1=1k_{1}=1 and k2=1k_{2}=1. Consider two polynomials ρ1z\rho_{1}^{\top}z and ρ2z\rho_{2}^{\top}z. We multiply the two to obtain i,jρ1iρ2jzizj\sum_{i,j}\rho_{1i}\rho_{2j}z_{i}z_{j}. Consider two cases. In case 1, the two polynomials have at least one non-zero coefficient for the same component ziz_{i}. In that case, we obtain the only non-zero term with ρ1i2zi2\rho_{1i}^{2}z_{i}^{2}, which establishes the base case. In the second case, the two polynomials have no shared non-zero coefficients. In such a case, each term with a non-zero coefficient is of the form ρ1iρ2jzizj\rho_{1i}\rho_{2j}z_{i}z_{j}. This establishes the base case. The other cases with k1=0k_{1}=0 and k2=1k_{2}=1 or k2=0k_{2}=0 and k1=1k_{1}=1 or both k1=0k_{1}=0, k2=0k_{2}=0 are trivially true. Thus we have established the base case for all polynomials (with arbitrary dimension for zz) of degree less than k1=1k_{1}=1 and k2=1k_{2}=1.

We can now assume that the claim is true for all polynomials v1v_{1} with degree less than k11k_{1}-1 and all polynomials v2v_{2} with degree less than k21k_{2}-1. As a result, the degree of δ1(z)δ2(z)\delta_{1}(z)\delta_{2}(z) is k1+k2rsk_{1}+k_{2}-r-s.

We can write δ1δ2\delta_{1}\delta_{2} in terms of the terms with degree equal to k1+k2rsk_{1}+k_{2}-r-s (δ(z)\delta^{{}^{\prime}}(z)) and terms that have a degree less than k1+k2rsk_{1}+k_{2}-r-s (δ(z)\delta^{*}(z)). As a result, we can simplify z1r+sδ1(z)δ2(z)z_{1}^{r+s}\delta_{1}(z)\delta_{2}(z) to obtain

z1r+s(δ(z)+δ(z))z_{1}^{r+s}(\delta^{{}^{\prime}}(z)+\delta^{*}(z)) (11)

The degree of z1r+sδ(z)z_{1}^{r+s}\delta^{*}(z) is at most k1+k21k_{1}+k_{2}-1. The degree of z1r+s(δ(z))z_{1}^{r+s}(\delta^{{}^{\prime}}(z)) has to be k1+k2k_{1}+k_{2} since δ(z)\delta^{{}^{\prime}}(z) does not depend on z1z_{1}, δ(z)\delta^{{}^{\prime}}(z) is of degree k1+k2rsk_{1}+k_{2}-r-s. Note that this is the only term in the entire polynomial u1(z)u2(z)u_{1}(z)u_{2}(z) that is associated with the highest degree for z1z_{1} (z1r+sz_{1}^{r+s}) since other terms (cj,cjc_{j},c_{j}^{{}^{\prime}}) have a smaller degree associated with z1z_{1} thus the coefficient of this term cannot be cancelled to zero. Therefore, the degree of the polynomial u1u2u_{1}u_{2} and hence the degree of v1v2v_{1}v_{2} is k1+k2k_{1}+k_{2}.

Recall z^f(x),afg\hat{z}\triangleq f(x),a\triangleq f\circ g. Since f(x)=fg(z)=a(z)z^=a(z),f(x)=f\circ g(z)=a(z)\implies\hat{z}=a(z), where a:𝒵𝒵(i)𝒵^𝒵^(i)a:\mathcal{Z}\cup\mathcal{Z}^{(i)}\rightarrow\hat{\mathcal{Z}}\cup\hat{\mathcal{Z}}^{(i)}, and 𝒵^=f(𝒳)\hat{\mathcal{Z}}=f(\mathcal{X}) and 𝒵^(i)=f(𝒳(i))\hat{\mathcal{Z}}^{(i)}=f(\mathcal{X}^{(i)}). We now show that aa is bijective.

Lemma A.3.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively. The mapping aa that relates the output of the encoder ff written as z^\hat{z}, which solves the reconstruction identity Eq. 3, is related to the true latent zz is bijective, where z^=a(z)\hat{z}=a(z).

Proof.

Observe that aa is surjective by construction. We now need to prove that aa is injective. Suppose aa is not injective. Therefore, there exists z1𝒵z_{1}\in\mathcal{Z} and z2𝒵z_{2}\in\mathcal{Z}, where z1z2z_{1}\not=z_{2} and z^1=a(z1)=z^2=a(z2)\hat{z}_{1}=a(z_{1})=\hat{z}_{2}=a(z_{2}). Note that a(z1)=f(x1)a(z_{1})=f(x_{1}), where x1=g(z1)x_{1}=g(z_{1}) and a(z2)=f(x2)a(z_{2})=f(x_{2}), where x2=g(z2)x_{2}=g(z_{2}). This implies that f(x1)=f(x2)f(x_{1})=f(x_{2}). We know that the decoder encoder pair satisfy reconstruction, which means hf(x1)=x1h\circ f(x_{1})=x_{1} and hf(x2)=x2h\circ f(x_{2})=x_{2}. Since f(x1)=f(x2)f(x_{1})=f(x_{2}), we obtain that x1=x2x_{1}=x_{2}, which implies that z1=z2z_{1}=z_{2} since gg is injective. This contradicts the fact that z1z2z_{1}\not=z_{2}. Therefore, z^=a(z)\hat{z}=a(z) is bijective. ∎

See 4.4

Proof.

We start by restating the reconstruction identity. For all x𝒳𝒳(i)x\in\mathcal{X}\cup\mathcal{X}^{(i)}

hf(x)=xh(z^)=g(z)H[1z^z^¯z^z^¯¯z^ptimes]=G[1zz¯zz¯¯zptimes]\begin{split}&h\circ f(x)=x\\ &h(\hat{z})=g(z)\\ &H\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=G\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\\ \end{split} (12)

Following the assumptions, hh is restricted to be polynomial but ff bears no restriction. If H=GH=G and f=g1f=g^{-1}, we get the ideal solution z^=z\hat{z}=z, thus a solution to the above identity exists.

Since GG has full column rank, we can select qq rows of GG such that G~q×q\tilde{G}\in\mathbb{R}^{q\times q} and 𝗋𝖺𝗇𝗄(G~)=q\mathsf{rank}(\tilde{G})=q. Denote the corresponding matrix HH that select the same rows as H~\tilde{H}. We restate the identity in Eq. 12 in terms of H~\tilde{H} and G~\tilde{G} as follows. For all z𝒵𝒵(i)z\in\mathcal{Z}\cup\mathcal{Z}^{(i)}

H~[1z^z^¯z^z^¯¯z^ptimes]=G~[1zz¯zz¯¯zptimes]G~1H~[1z^z^¯z^z^¯¯z^ptimes]=[1zz¯zz¯¯zptimes]z=A~[1z^z^¯z^z^¯¯z^ptimes]z=A~1z^+A~2z^¯z^+A~pz^¯¯z^ptimes+c,\begin{split}&\tilde{H}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=\tilde{G}\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\\ &\tilde{G}^{-1}\tilde{H}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\\ &z=\tilde{A}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}\\ &z=\tilde{A}_{1}\hat{z}+\tilde{A}_{2}\;\hat{z}\bar{\otimes}\hat{z}+\cdots\tilde{A}_{p}\;\underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}+c,\end{split} (13)

where A~\tilde{A} is a submatrix of G~1H~\tilde{G}^{-1}\tilde{H} that describes the relationship between zz and polynomial of z^\hat{z}, {A~i}i=1p\{\tilde{A}_{i}\}_{i=1}^{p} correspond to blocks of rows of A~\tilde{A}. Suppose at least one of A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} is non-zero. Among the matrices A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} which are non-zero, pick the matrix A~k\tilde{A}_{k} with largest index kk. Suppose row ii of A~k\tilde{A}_{k} has some non-zero element. Now consider the element in the row in the RHS of equation 13 corresponding to zipz_{i}^{p}. Observe that zipz_{i}^{p} is a polynomial of z^\hat{z} of degree kpkp, where k2k\geq 2 (follows from Lemma A.2). In the LHS, we have a polynomial of degree at most pp. In the LHS, we have a polynomial of degree at most pp. The equality between LHS and RHS is true for all z^f(𝒳𝒳(i))\hat{z}\in f(\mathcal{X}\cup\mathcal{X}^{(i)}). The difference of LHS and RHS is an analytic function. From 4.3 f(𝒳𝒳(i))f(\mathcal{X}\cup\mathcal{X}^{(i)}) has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire d\mathbb{R}^{d}. If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree kpkp while LHS has zero coefficient for terms higher than degree pp. This leads to a contradiction. As a result, none of A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} can be non-zero. Thus z=A~1z^+cz=\tilde{A}_{1}\hat{z}+c. Next, we show that A~1\tilde{A}_{1} is invertible, which immediately follows from Lemma A.3.

A.1.1 Extensions to sparse polynomial g()g(\cdot)

Suppose g()g(\cdot) is a degree pp polynomial. Let us define the basis that generates gg as

u(z)=[1zz¯zz¯¯zptimes]u(z)=\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}

Note that the number of terms in u(z)u(z) grows as q=r=0p(r+d1d1)q=\sum_{r=0}^{p}{r+d-1\choose d-1}. In the previous proof, we worked with

g(z)=G[1zz¯zz¯¯zptimes]=Gu(z)g(z)=G\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}=Gu(z)

where Gn×qG\in\mathbb{R}^{n\times q} was full rank. As a result, nn has to be greater than qq and also grow at least as r=0p(r+d1d1)\sum_{r=0}^{p}{r+d-1\choose d-1}. In real data, we can imagine that the g()g(\cdot) has a high degree. However, gg can exhibit some structure, for instance sparsity. We now show that our entire analysis continues to work even for sparse polynomials thus significantly reducing the requirment on nn to grow as the number of non-zero basis terms in the sparse polynomial. We write the basis for the sparse polynomial of degree pp as u(z)u^{{}^{\prime}}(z). u(z)u^{{}^{\prime}}(z) consists of a subset of terms in u(z)u(z). We write the sparse polynomial g()g(\cdot) as

g(z)=Gu(z)g(z)=Gu^{{}^{\prime}}(z)

We formally state the assumption on the decoder in this case as follows.

Assumption A.4.

The decoder gg is a polynomial of degree pp whose corresponding coefficient matrix GG (a.k.a. the weight matrix) has full column rank. Specifically, the decoder gg is determined by the coefficient matrix GG as follows,

g(z)=Gu(z)g(z)=Gu^{{}^{\prime}}(z) (14)

where u(z)u^{{}^{\prime}}(z) consists of a subset of terms in u(z)u(z). u(z)u^{{}^{\prime}}(z) consists of the degree one term, i.e., zz and at least one term of the form zioz_{i}^{o}, where op+12o\geq\frac{p+1}{2}

Theorem A.5.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, A.4. The autoencoder that solves reconstruction identity in Eq. 3 under 4.3 achieves affine identification, i.e., z𝒵𝒵(i),z^=Az+c\forall z\in\mathcal{Z}\cup\mathcal{Z}^{(i)},\hat{z}=Az+c, where z^\hat{z} is the output of the encoder ff, zz is the true latent, AA is an invertible d×dd\times d matrix and cdc\in\mathbb{R}^{d}.

Proof.

We start by restating the reconstruction identity. For all x𝒳𝒳(i)x\in\mathcal{X}\cup\mathcal{X}^{(i)}

hf(x)=xh(z^)=g(z)H[1z^z^¯z^z^¯¯z^ptimes]=Gu(z)\begin{split}&h\circ f(x)=x\\ &h(\hat{z})=g(z)\\ &H\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=Gu^{{}^{\prime}}(z)\\ \end{split} (15)

Following the assumptions, hh is restricted to be polynomial but ff bears no restriction. If HH is equal to the matrix GG for columns ii where ui=uju_{i}=u_{j}^{{}^{\prime}} for some jj and zero in other columns and f=g1f=g^{-1}, we get the ideal solution z^=z\hat{z}=z, thus a solution to the above identity exists. Since GG has full column rank, we can select qq rows of GG such that G~q×q\tilde{G}\in\mathbb{R}^{q\times q} and 𝗋𝖺𝗇𝗄(G~)=q\mathsf{rank}(\tilde{G})=q. Denote the corresponding matrix HH that select the same rows as H~\tilde{H}. We restate the identity in Eq. 15 in terms of H~\tilde{H} and G~\tilde{G} as follows. For all z𝒵𝒵(i)z\in\mathcal{Z}\cup\mathcal{Z}^{(i)}

H~[1z^z^¯z^z^¯¯z^ptimes]=G~u(z)G~1H~[1z^z^¯z^z^¯¯z^ptimes]=u(z)z=A~[1z^z^¯z^z^¯¯z^ptimes]z=A~1z^+A~2z^¯z^+A~pz^¯¯z^ptimes+c\begin{split}&\tilde{H}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=\tilde{G}u^{{}^{\prime}}(z)\\ &\tilde{G}^{-1}\tilde{H}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}=u^{{}^{\prime}}(z)\\ &z=\tilde{A}\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}\\ &z=\tilde{A}_{1}\hat{z}+\tilde{A}_{2}\;\hat{z}\bar{\otimes}\hat{z}+\cdots\tilde{A}_{p}\;\underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}+c\end{split} (16)

In the simplification above, we rely on the fact that u(z)u^{\prime}(z) consists of the first degree term. Suppose at least one of A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} is non-zero. Among the matrices A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} which are non-zero, pick the matrix A~k\tilde{A}_{k} with largest index kk. Suppose row ii of A~k\tilde{A}_{k} has some non-zero element. Now consider the element in the row in the RHS of equation 16 corresponding to zioz_{i}^{o}. Observe that zioz_{i}^{o} is a polynomial of z^\hat{z} of degree koko, where k2k\geq 2. In the LHS, we have a polynomial of degree at most pp. The equality between LHS and RHS is true for all z^f(𝒳𝒳(i))\hat{z}\in f(\mathcal{X}\cup\mathcal{X}^{(i)}). The difference of LHS and RHS is an analytic function. From 4.3 f(𝒳𝒳(i))f(\mathcal{X}\cup\mathcal{X}^{(i)}) has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire d\mathbb{R}^{d}. If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree p+1p+1 while LHS has zero coefficient for terms higher than degree pp. This leads to a contradiction. As a result, none of A~2,,A~p\tilde{A}_{2},\cdots,\tilde{A}_{p} can be non-zero. Thus z=A~1z^+cz=\tilde{A}_{1}\hat{z}+c. Next, we need to show that A~1\tilde{A}_{1} is invertible, which follows from Lemma A.3. ∎

A.1.2 Extensions to polynomial g()g(\cdot) with unknown degree

The learner starts with solving the reconstruction identity by setting the degree of h()h(\cdot) to be ss; here we assume HH has full rank (this implicitly requires that nn is greater than the number of terms in the polynomial of degree ss).

H[1z^z^¯z^z^¯¯z^stimes]=G[1zz¯zz¯¯zptimes]H\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{s\;\text{times}}\end{bmatrix}=G\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\\ (17)

We can restrict HH to rows such that it is a square invertible matrix H~\tilde{H}. Denote the corresponding restriction of GG as G~\tilde{G}. The equality is stated as follows.

[1z^z^¯z^z^¯¯z^stimes]=H~1G~[1zz¯zz¯¯zptimes]\begin{bmatrix}1\\ \hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{s\;\text{times}}\end{bmatrix}=\tilde{H}^{-1}\tilde{G}\begin{bmatrix}1\\ z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix} (18)

If s>ps>p, then z^¯¯z^stimes\underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{s\;\text{times}} is a polynomial of degree at least p+1p+1. Since the RHS contains a polynomial of degree at most pp the two sides cannot be equal over a set of values of zz with positive Lebesgue measure in d\mathbb{R}^{d}. Thus the reconstruction identity will only be satisfied when s=ps=p. Thus we can start with the upper bound and reduce the degree of the polynomial on LHS till the identity is satisfied.

A.1.3 Extensions from polynomials to ϵ\epsilon-approximate polynomials

We now discuss how to extend Theorem 4.4 to settings beyond polynomial gg. Suppose gg is a function that can be ϵ\epsilon-approximated by a polynomial of degree pp on entire 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)}. In this section, we assume that we continue to use polynomial decoders hh of degree pp (with full rank matrix HH) for reconstruction. We state this as follows.

Constraint A.6.

The learned decoder hh is a polynomial of degree pp and its corresponding coefficient matrix hh is determined by HH as follows. For all zdz\in\mathbb{R}^{d}

h(z)=H[1,z,z¯z,,z¯¯zptimes]h(z)=H[1,z,z\bar{\otimes}z,\cdots,\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}]^{\top} (19)

where ¯\bar{\otimes} represents the Kronecker product with all distinct entries. HH has a full column rank.

Since we use hh as a polynomial, then satisfying the exact reconstruction is not possible. Instead, we enforce approximate reconstruction as follows. For all x𝒳𝒳(i)x\in\mathcal{X}\cup\mathcal{X}^{(i)}, we want

hf(x)xϵ,\|h\circ f(x)-x\|\leq\epsilon, (20)

where ϵ\epsilon is the tolerance on reconstruction error. Recall z^=f(x)\hat{z}=f(x). We further simplify it as z^=fg(z)=a(z)\hat{z}=f\circ g(z)=a(z). We also assume that aa can be η\eta-approximated on entire 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)} with a polynomial of sufficiently high degree say qq. We write this as follows. For all z𝒵𝒵(i)z\in\mathcal{Z}\cup\mathcal{Z}^{(i)},

z^Θ[zz¯zz¯¯zqtimes]η,z^Θ1zΘ2z¯zΘpz¯¯zqtimesη.\begin{split}\Bigg{\|}\hat{z}-\Theta\begin{bmatrix}z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}\end{bmatrix}\Bigg{\|}\leq\eta,\\ \Bigg{\|}\hat{z}-\Theta_{1}z-\Theta_{2}\;z\bar{\otimes}z-\cdots\Theta_{p}\;\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}\Bigg{\|}\leq\eta.\end{split} (21)

We want to show that the norm of Θk\Theta_{k} for all k2k\geq 2 is sufficiently small. We state some assumptions needed in theorem below.

Assumption A.7.

Encoder ff does not take values near zero, i.e., fi(x)γηf_{i}(x)\geq\gamma\eta for all x𝒳𝒳(i)x\in\mathcal{X}\cup\mathcal{X}^{(i)} and for all i{1,,d}i\in\{1,\cdots,d\}, where γ>2\gamma>2. The absolute value of each element in H~1G~\tilde{H}^{-1}\tilde{G} is bounded by a fixed constant. Consider the absolute value of the singular values of H~\tilde{H}; we assume that the smallest absolute value is strictly positive and bounded below by ζ\zeta.

Theorem A.8.

Suppose the true decoder gg can be approximated by a polynomial of degree pp on entire 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)} with approximation error ϵ2\frac{\epsilon}{2}. Suppose a=fga=f\circ g can be approximated by polynomials on entire 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)} with η\eta error. If [z𝗆𝖺𝗑,z𝗆𝖺𝗑]d𝒵𝒵(i)[-z_{\mathsf{max}},z_{\mathsf{max}}]^{d}\subseteq\mathcal{Z}\cup\mathcal{Z}^{(i)}, where z𝗆𝖺𝗑z_{\mathsf{max}} is sufficiently large, and Assumption 4.1, Assumption A.7 hold, then the polynomial approximation of aa (recall z^=a(z)\hat{z}=a(z)) corresponding to solutions of approximate reconstruction identity in Eq. 20 under A.6 is approximately linear, i.e., the norms of the weights on higher order terms are sufficiently small. Specifically, the absolute value of the weight associated with term of degree kk decays as 1zmaxk1\frac{1}{z_{\max}^{k-1}}.

Proof.

We start by restating the approximate reconstruction identity. We use the fact that gg can be approximated with a polynomial of say degree pp to simplify the identity below. For all x𝒳𝒳(i)x\in\mathcal{X}\cup\mathcal{X}^{(i)}

hf(x)xϵH[z^z^¯z^z^¯¯z^ptimes]G[zz¯zz¯¯zptimes]G[zz¯zz¯¯zptimes]g(z)ϵ\begin{split}&\|h\circ f(x)-x\|\leq\epsilon\\ &\Bigg{\|}H\begin{bmatrix}\hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}-G\begin{bmatrix}z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\Bigg{\|}-\Bigg{\|}G\begin{bmatrix}z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}-g(z)\Bigg{\|}\leq\epsilon\end{split} (22)

To obtain the second step from the first, add and subtract G[z,z¯z,,z^¯¯z^ptimes]G[z,z\bar{\otimes}z,\cdots,\underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}]^{\top} and use reverse triangle inequality. Since HH is full rank, we select rows of HH such that H~\tilde{H} is square and invertible. The corresponding selection for GG is denoted as G~\tilde{G}. We write the identity in terms of these matrices as follows.

H~[z^z^¯z^z^¯¯z^ptimes]G~[zz¯zz¯¯zptimes]3ϵ2[z^z^¯z^z^¯¯z^ptimes]H~1G~[zz¯zz¯¯zptimes]3ϵ2|σ𝗆𝗂𝗇(H~)|\begin{split}&\Bigg{\|}\tilde{H}\begin{bmatrix}\hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}-\tilde{G}\begin{bmatrix}z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\Bigg{\|}\leq\frac{3\epsilon}{2}\\ &\Bigg{\|}\begin{bmatrix}\hat{z}\\ \hat{z}\bar{\otimes}\hat{z}\\ \vdots\\ \underbrace{\hat{z}\bar{\otimes}\cdots\bar{\otimes}\;\hat{z}}_{p\;\text{times}}\end{bmatrix}-\tilde{H}^{-1}\tilde{G}\begin{bmatrix}z\\ z\bar{\otimes}z\\ \vdots\\ \underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}\end{bmatrix}\Bigg{\|}\leq\frac{3\epsilon}{2|\sigma_{\mathsf{min}}(\tilde{H})|}\end{split} (23)

where |σ𝗆𝗂𝗇(H~)||\sigma_{\mathsf{min}}(\tilde{H})| is the singular value with smallest absolute value corresponding to the matrix H~\tilde{H}. In the simplification above, we use the assumption that gg is ϵ2\frac{\epsilon}{2}-approximated by a polynomial with matrix GG and we also use the fact that |σ𝗆𝗂𝗇(H~)||\sigma_{\mathsf{min}}(\tilde{H})| is positive. Now we write that the polynomial that approximates z^i=ai(z)\hat{z}_{i}=a_{i}(z) as follows.

|z^iθ1zθ2z¯zθqz¯¯zqtimes|η|\hat{z}_{i}-\theta_{1}^{\top}z-\theta_{2}^{\top}z\bar{\otimes}z-\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}|\leq\eta (24)
z^iθ1z+θ2z¯z+θqz¯¯zqtimesηz^iθ1z+θ2z¯z+θqz¯¯zqtimes+η\begin{split}&\hat{z}_{i}\geq\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}-\eta\\ &\hat{z}_{i}\leq\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}+\eta\end{split} (25)

From Assumption A.7 we know that z^iγη\hat{z}_{i}\geq\gamma\eta, where γ>2\gamma>2. It follows from the above equation that

θ1z+θ2z¯z+θqz¯¯zqtimes+ηγηθ1z+θ2z¯z++θqz¯¯zqtimes(γ1)η01γ1ηθ1z+θ2z¯z++θqz¯¯zqtimes\begin{split}&\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}+\eta\geq\gamma\eta\\ &\implies\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots+\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}-(\gamma-1)\eta\geq 0\\ &\implies\frac{1}{\gamma-1}\geq\frac{\eta}{\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots+\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}}\end{split} (26)

For z^iγη\hat{z}_{i}\geq\gamma\eta, we track how z^ip\hat{z}_{i}^{p} grows below.

z^iθ1z+θ2z¯z+θqz¯¯zqtimesη(γ2)η0z^ip(θ1z+θ2z¯z+θqz¯¯zqtimesη)pz^ip(θ1z+θ2z¯z+θqz¯¯zqtimes)p(11γ1)p\begin{split}&\hat{z}_{i}\geq\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}-\eta\geq(\gamma-2)\eta\geq 0\\ &\hat{z}_{i}^{p}\geq(\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}}-\eta)^{p}\\ &\hat{z}_{i}^{p}\geq(\theta_{1}^{\top}z+\theta_{2}^{\top}z\bar{\otimes}z+\cdots\theta_{q}^{\top}\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{q\;\text{times}})^{p}(1-\frac{1}{\gamma-1})^{p}\end{split} (27)

In the last step of the above simplification, we use the condition in Eq. 26. We consider z=[z𝗆𝖺𝗑,,z𝗆𝖺𝗑]z=[z_{\mathsf{max}},\cdots,z_{\mathsf{max}}]. Consider the terms θijz𝗆𝖺𝗑k\theta_{ij}z_{\mathsf{max}}^{k} inside the polynomial in the RHS above. We assume all components of θ\theta are positive. Suppose θij1z𝗆𝖺𝗑kκ1\theta_{ij}\geq\frac{1}{z_{\mathsf{max}}^{k-\kappa-1}}, where κ(0,1)\kappa\in(0,1), then the RHS in Eq. 27 grows at least z𝗆𝖺𝗑(1+κ)p(γ2γ1)pz_{\mathsf{max}}^{(1+\kappa)p}\big{(}\frac{\gamma-2}{\gamma-1}\big{)}^{p}. From Eq. 23, z^ip\hat{z}_{i}^{p} is very close to degree pp polynomial in zz. Under the assumption that the terms in H~1G~\tilde{H}^{-1}\tilde{G} are bounded by a constant, the polynomial of degree pp grows at at most z𝗆𝖺𝗑pz_{\mathsf{max}}^{p}. The difference in growth rates the Eq. 23 is an increasing function of z𝗆𝖺𝗑z_{\mathsf{max}} for ranges where z𝗆𝖺𝗑z_{\mathsf{max}} is sufficiently large. Therefore, the reconstruction identity in Eq. 23 cannot be satisfied for points in a sufficiently small neighborhood of z=[z𝗆𝖺𝗑,,z𝗆𝖺𝗑]z=[z_{\mathsf{max}},\cdots,z_{\mathsf{max}}]. Therefore, θij<1z𝗆𝖺𝗑kκ1\theta_{ij}<\frac{1}{z_{\mathsf{max}}^{k-\kappa-1}}. We can consider other vertices of the hypercube 𝒵\mathcal{Z} and conclude that |θij|<1z𝗆𝖺𝗑kκ1|\theta_{ij}|<\frac{1}{z_{\mathsf{max}}^{k-\kappa-1}}.

A.2 Representation identification under dodo interventions

See 5.3

Proof.

First note that Assumptions 4.1-4.2 hold. Since we solve Eq. 3 under 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, it follows that the estimated latents z^\hat{z} are an affine function of the true zz. z^k=az+b,\hat{z}_{k}=a^{\top}z+b, z𝒵𝒵(i)\forall z\in\mathcal{Z}\cup\mathcal{Z}^{(i)}, where ad,ba\in\mathbb{R}^{d},b\in\mathbb{R}.

We consider a z𝒵(i)z\in\mathcal{Z}^{(i)} such that ziz_{-i} is in the interior of the support of Zi(i)\mathbb{P}_{Z_{-i}}^{(i)}. We write z𝒵(i)z\in\mathcal{Z}^{(i)} as [z,zi][z^{*},z_{-i}]. We can write z^k=aiz+aizi+b\hat{z}_{k}=a_{i}z^{*}+a_{-i}^{\top}z_{-i}+b, where aia_{-i} is the vector of the values of coefficients in aa other than the coefficient of ithi^{th} dimension, aia_{i} is ithi^{th} component of aa, ziz_{-i} is the vector of values in zz other than ziz_{i}. From the constraint in 5.1 it follows that for all z𝒵(i)z\in\mathcal{Z}^{(i)}, z^k=z\hat{z}_{k}=z^{\dagger}. We use these expressions to carry out the following simplification.

aizi=zaizb\begin{split}a_{-i}^{\top}z_{-i}=z^{\dagger}-a_{i}z^{*}-b\end{split} (28)

Consider another data point z𝒵(i)z^{{}^{\prime}}\in\mathcal{Z}^{(i)} from the same interventional distribution such that zi=zi+θejz_{-i}^{{}^{\prime}}=z_{-i}+\theta e_{j} is in the interior of the support of Zi(i)\mathbb{P}_{Z_{-i}}^{(i)}, where eje_{j} is vector with one in jthj^{th} coordinate and zero everywhere else. From Assumption 5.2, we know that there exists a small enough θ\theta such that ziz_{-i}^{{}^{\prime}} is in the interior. Since the point is from the same interventional distribution zi=zz_{i}^{{}^{\prime}}=z^{*}. For ziz_{-i}^{{}^{\prime}} we have

aizi=zaizb\begin{split}a_{-i}^{\top}z_{-i}^{\prime}=z^{\dagger}-a_{i}z^{*}-b\\ \end{split} (29)

We take a difference of the two equations equation 28 and equation LABEL:proof2:eqn2 to get

ai(zizi)=θaiej=0.a_{-i}^{\top}(z_{-i}-z_{-i}^{{}^{\prime}})=\theta a_{-i}^{\top}e_{j}=0. (30)

From the above, we get that the jthj^{th} component of aia_{-i} is zero. We can repeat the above argument for all jj and get that ai=0a_{-i}=0. Therefore, z^k=aizi+b\hat{z}_{k}=a_{i}z_{i}+b for all possible values of ziz_{i} in 𝒵𝒵(i)\mathcal{Z}\cup\mathcal{Z}^{(i)}. ∎

A.2.1 Extension of dodo interventions beyond polynomials

In the main body of the paper, we studied the setting where gg is a polynomial. We relax the constraint on gg. We consider settings with multiple dodo interventional distribution on a target latent.

We write the DGP for intervention j{1,,t}j\in\{1,\cdots,t\} on latent ii as

zi=z,jziZi(i,j)\begin{split}&z_{i}=z^{*,j}\\ &z_{-i}\sim\mathbb{P}_{Z_{-i}}^{(i,j)}\end{split} (31)

Let 𝒯={z,1,,z,t}\mathcal{T}=\{z^{*,1},\cdots,z^{*,t}\} be the set of dodo intervention target values. We extend the constrained representation learning setting from the main body, where the learner leverages the geometric signature of a single dodo intervention per latent dimension to multiple dodo interventional distributions per latent dimension.

hf(x)=x,x𝒳𝒳(i,j)fk(x)=z,j,x𝒳(i,j),j{1,,t}\begin{split}&h\circ f(x)=x,\qquad\forall x\in\mathcal{X}\cup\mathcal{X}^{(i,j)}\\ &f_{k}(x)=z^{\dagger,j},\qquad\;\;\;\forall x\in\mathcal{X}^{(i,j)},\forall j\in\{1,\cdots,t\}\end{split} (32)

Recall that the z^=f(x)=fg(z)=a(z)\hat{z}=f(x)=f\circ g(z)=a(z). Consider the kthk^{th} component z^k=ak(z)\hat{z}_{k}=a_{k}(z). Suppose ak(z)a_{k}(z) is invertible and only depends on ziz_{i}, we can write it as ak(zi)a_{k}(z_{i}). If z^k\hat{z}_{k} only depends on ziz_{i}, i.e., z^k=ak(zi)\hat{z}_{k}=a_{k}(z_{i}) and aka_{k} is invertible, then the ziz_{i} is identified up to an invertible transform. Another way to state the above property is ziak(z)=0\nabla_{z_{-i}}a_{k}(z)=0 for all ziz_{-i}. In what follows, we show that it is possible to approximately achieve identification up to an invertible transform. We show that if the number of interventions tt is sufficiently large, then ziak(z)ϵ\|\nabla_{z_{-i}}a_{k}(z)\|\leq\epsilon for all z𝒵z\in\mathcal{Z}.

Assumption A.9.

The interior of the support of zz in the observational data, i.e., 𝒵\mathcal{Z}, is non-empty. The interior of the support of ziz_{-i} in the interventional data, i.e., 𝒵i(i,j)\mathcal{Z}_{-i}^{(i,j)}, is equal to the support in observational data, i.e., 𝒵i\mathcal{Z}_{-i}, for all j{1,,t}j\in\{1,\cdots,t\}. Each intervention z,jz^{*,j} is sampled from a distribution \mathbb{Q}. The support of \mathbb{Q} is equal to the support of ziz_{i} in the observational data, i.e., 𝒵i\mathcal{Z}_{i}. The density of \mathbb{Q} is greater than ϱ\varrho (ϱ>0\varrho>0) on the entire support.

The above assumption states the restrictions on the support of the latents underlying the observational data and the latents underlying the interventional data.

Assumption A.10.

2a(z)zizj\|\frac{\partial^{2}a(z)}{\partial z_{i}\partial z_{j}}\| is bounded by L<L<\infty for all z𝒵z\in\mathcal{Z} and for all i,j{1,,d}i,j\in\{1,\cdots,d\}.

Lemma A.11.

If the number of interventions tlog(δϵ2(βsupi+βinfi))/log(1ϱϵ2)t\geq\log(\frac{\delta\epsilon}{2(\beta_{\sup}^{i}+\beta^{i}_{\inf})})/\log(1-\varrho\frac{\epsilon}{2}), then maxzi𝒵iminz,j𝒯ziz,jϵ\max_{z_{i}\in\mathcal{Z}_{i}}\min_{z^{*,j}\in\mathcal{T}}\|z_{i}-z^{*,j}\|\leq\epsilon with probability 1δ1-\delta.

Proof.

Consider the interval [βinfi,βsupi][-\beta^{i}_{\inf},\beta_{\sup}^{i}], where βinfi\beta^{i}_{\inf} and βsupi\beta^{i}_{\sup} are the infimum and supremum of 𝒵i\mathcal{Z}_{i}. Consider an ϵ2\frac{\epsilon}{2} covering of [βinfi,βsupi][-\beta^{i}_{\inf},\beta_{\sup}^{i}]. This covering consists of 2(βsupi+βinfi)ϵ\frac{2(\beta_{\sup}^{i}+\beta^{i}_{\inf})}{\epsilon} equally spaced points at a separation of ϵ/2\epsilon/2. Consider a point ziz_{i}, its nearest neighbor in the cover is denoted as zlz_{l}^{{}^{\prime}}, and the nearest neighbor of ziz_{i} in the set of interventions 𝒯\mathcal{T} is z,jz^{*,j}. The nearest neighbor of zlz_{l}^{{}^{\prime}} in the set of interventions is z,rz^{*,r}. Since ziz,jziz,q\|z_{i}-z^{*,j}\|\leq\|z_{i}-z^{*,q}\| for all q{1,,t}q\in\{1,\cdots,t\} we can write

ziz,jziz,rzizl+zlz,rϵ2+zlz,r\begin{split}\|z_{i}-z^{*,j}\|\leq\|z_{i}-z^{*,r}\|\leq\|z_{i}-z_{l}^{{}^{\prime}}\|+\|z_{l}^{{}^{\prime}}-z^{*,r}\|\leq\frac{\epsilon}{2}+\|z_{l}^{{}^{\prime}}-z^{*,r}\|\end{split} (33)

Observe that if zlz,r\|z_{l}^{{}^{\prime}}-z^{*,r}\| is less than ϵ2\frac{\epsilon}{2} for all zlz_{l}^{{}^{\prime}} in the cover, then for all ziz_{i} in 𝒵i\mathcal{Z}_{i}, ziz,j\|z_{i}-z^{*,j}\| is less than ϵ\epsilon. We now show that zlz,r\|z_{l}^{{}^{\prime}}-z^{*,r}\| is sufficiently small provided tt is sufficiently large. Observe that

(zlz,r>ϵ2)(1ϱϵ2)t\mathbb{P}(\|z_{l}^{{}^{\prime}}-z^{*,r}\|>\frac{\epsilon}{2})\leq(1-\varrho\frac{\epsilon}{2})^{t}

We would like that (1ϱϵ2)tδ(1-\varrho\frac{\epsilon}{2})^{t}\leq\delta, which implies tlog(δ)/log(1ϱϵ2)t\geq\log(\delta)/\log(1-\varrho\frac{\epsilon}{2}). Therefore, if tlog(δ)/log(1ϱϵ2)t\geq\log(\delta)/\log(1-\varrho\frac{\epsilon}{2}), then (zlz,rϵ2)\mathbb{P}(\|z_{l}^{{}^{\prime}}-z^{*,r}\|\leq\frac{\epsilon}{2}) with a probability at least 1δ1-\delta. If we set δ=δϵ2(βsupi+βinfi)\delta=\frac{\delta\epsilon}{2(\beta_{\sup}^{i}+\beta^{i}_{\inf})}, then we obtain that for all ll, (zlz,rϵ2)\mathbb{P}(\|z_{l}^{{}^{\prime}}-z^{*,r}\|\leq\frac{\epsilon}{2}) with probability at least 1δ1-\delta. The final expression for tlog(δϵ2(βsupi+βinfi))/log(1ϱϵ2)t\geq\log(\frac{\delta\epsilon}{2(\beta_{\sup}^{i}+\beta^{i}_{\inf})})/\log(1-\varrho\frac{\epsilon}{2})

Theorem A.12.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 31 respectively. If the number of interventions tt is sufficiently large, i.e., tlog(δϵ2L(βsupi+βinfi))/log(1ϱϵ2L)t\geq\log(\frac{\delta\epsilon}{2L(\beta_{\sup}^{i}+\beta^{i}_{\inf})})/\log(1-\varrho\frac{\epsilon}{2L}), Assumption A.9 and Assumption A.10 are satified, then the solution to Eq. 32 identifies the intervened latent ziz_{i} approximately up to an invertible tranform, i.e., ziak(z)ϵ\|\nabla_{z_{-i}}a_{k}(z)\|_{\infty}\leq\epsilon for all z𝒵z\in\mathcal{Z}.

Proof.

Recall z^=f(x)=fg(z)=a(z)\hat{z}=f(x)=f\circ g(z)=a(z), where a:j𝒵(i,j)𝒵j𝒵^(i,j)𝒵^a:\cup_{j}\mathcal{Z}^{(i,j)}\cup\mathcal{Z}\rightarrow\cup_{j}\hat{\mathcal{Z}}^{(i,j)}\cup\hat{\mathcal{Z}}. Consistent with the notation used earlier in the proof of Theorem 4.4, 𝒵^(i,j)=f(𝒳(i,j))\hat{\mathcal{Z}}^{(i,j)}=f(\mathcal{X}^{(i,j)}). In Lemma A.3, we had shown that aa is bijective, we can use the same recipe here and show that aa is bijective.

Owing to the constraint in Eq. 32, we claim that ziak(z)=0\nabla_{z_{-i}}a_{k}(z)=0 for all ziz_{-i} in the interior of 𝒵i\mathcal{Z}_{-i} with zi=z,jz_{i}=z^{*,j}. Consider a ball around ziz_{-i} that is entirely contained in 𝒵i\mathcal{Z}_{-i}, denote it as z\mathcal{B}_{z}. From Eq. 32, it follows that fk(x)f_{k}(x) takes the same value on this neighborhood. As a result, ak(z)a_{k}(z) is equal to a constant on the ball z\mathcal{B}_{z}. Therefore, it follows that ziak(z)=0\nabla_{z_{-i}}a_{k}(z)=0 on the ball z\mathcal{B}_{z}. We can extend this argument to all the points in the interior of the support of ziz_{-i}. As a result, ziak(z)=0\nabla_{z_{-i}}a_{k}(z)=0 on the interior of the support of ziz_{-i}. Further, ziak(z)=0\nabla_{z_{-i}}a_{k}(z)=0 for all z=[z,j,zi]z=[z^{*,j},z_{-i}] in j𝒵(i,j)\cup_{j}\mathcal{Z}^{(i,j)}. Define (z)=ziak(z)\aleph(z)=\nabla_{z_{-i}}a_{k}(z). Consider the jthj^{th} component of (z)\aleph(z) denoted as j(z)\aleph_{j}(z). Consider a point z𝒵z\in\mathcal{Z} and find its nearest neighbor in j𝒵(i,j)\cup_{j}\mathcal{Z}^{(i,j)} and denote it as zz^{{}^{\prime}}. Following the assumptions, zi=ziz^{{}^{\prime}}_{-i}=z_{-i}. We expand j(z)\aleph_{j}(z) around zz^{{}^{\prime}} as follows

j(z)=j(z)+zj(z′′)(zz)\aleph_{j}(z)=\aleph_{j}(z^{{}^{\prime}})+\nabla_{z}\aleph_{j}(z^{{}^{\prime\prime}})^{\top}(z-z^{{}^{\prime}})
j(z)=j(z′′)zi(zizi)\aleph_{j}(z)=\frac{\partial\aleph_{j}(z^{{}^{\prime\prime}})}{\partial z_{i}}(z_{i}-z_{i}^{{}^{\prime}})

In the above, we use the fact that j(z)=0\aleph_{j}(z^{{}^{\prime}})=0.

|j(z)|=|j(z′′)zi(zizi)||j(z′′)zi|ϵLϵ|\aleph_{j}(z)|=\bigg{|}\frac{\partial\aleph_{j}(z^{{}^{\prime\prime}})}{\partial z_{i}}(z_{i}-z_{i}^{{}^{\prime}})\bigg{|}\leq\bigg{|}\frac{\partial\aleph_{j}(z^{{}^{\prime\prime}})}{\partial z_{i}}\bigg{|}\frac{\epsilon}{L}\leq\epsilon

To see the last inequality in the above, use Lemma A.11 with ϵ\epsilon as ϵ/L\epsilon/L and Assumption A.10. ∎

In the discussion above, we showed that multiple dodo interventional distribution on target latent dimension help achieve approximate identification of a latent up to an invertible transform. The above argument extends to all latents provided we have data with multiple do interventional distributions per latent. We end this section by giving some intuition as to why multiple interventions are necessary in the absence of much structure on gg.

Necessitating multiple interventions

We consider the case with one dodo intervention. Consider the set of values achieved under intervention, where ziz_{-i} is from the interior of 𝒵~i(i)\tilde{\mathcal{Z}}^{(i)}_{-i}. We call this set 𝒵~(i)\tilde{\mathcal{Z}}^{(i)} Suppose aa is a bijection of the following form.

a={𝖨,ifzis in𝒵~(i)a~otherwisea=\begin{cases}\mathsf{I},\text{if}\;z\;\text{is in}\;\tilde{\mathcal{Z}}^{(i)}\\ \tilde{a}\;\;\text{otherwise}\end{cases} (34)

where 𝖨\mathsf{I} is identity function and a~\tilde{a} is an arbitrary bijection with bounded second order derivative (satisfying Assumption A.10). Define f=ag1f=a\circ g^{-1} and h=ga1h=g\circ a^{-1}. Observe that these ff and hh satisfy both the constraints in the representation learning problem in 5.1. In the absence of any further assumptions on gg or structure of support of 𝒵\mathcal{Z}, each intervention enforces local constraints on aa.

A.3 Representation identification under general perfect and imperfect interventions

Before proving Theorem 5.8, we prove a simpler version of the theorem, which we leverage to prove Theorem 5.8. We start with the case when the set 𝒮\mathcal{S} has one element say 𝒮={j}\mathcal{S}=\{j\}.

Assumption A.13.

Consider the ZZ that follow the interventional distribution Z(i)\mathbb{P}_{Z}^{(i)}. The joint support of zi,zjz_{i},z_{j} satisfies factorization of support, i.e.,

𝒵i,j(i)=𝒵i(i)×𝒵j(i)\mathcal{Z}_{i,j}^{(i)}=\mathcal{Z}_{i}^{(i)}\times\mathcal{Z}_{j}^{(i)} (35)

For all j[d]j\in[d], <αinfjαsupj<-\infty<\alpha_{\inf}^{j}\leq\alpha_{\sup}^{j}<\infty. There exists a ζ>0\zeta>0 such that the all the points in Πj[d](αsupjζ,αsupj)(αinfj,αinfj+ζ)\Pi_{j\in[d]}(\alpha_{\sup}^{j}-\zeta,\alpha_{\sup}^{j})\cup(\alpha_{\inf}^{j},\alpha_{\inf}^{j}+\zeta) are in 𝒵(i)\mathcal{Z}^{(i)}.

The above assumption only requires support independence for two random variables ZiZ_{i} and ZjZ_{j}.

We now describe a constraint, where the learner enforces support independence between z^i\hat{z}_{i} and z^j\hat{z}_{j}.

Constraint A.14.

The pair (z^i,z^j)(\hat{z}_{i},\hat{z}_{j}) satisfies support independence on interventional data, i.e.,

𝒵^i,j(i)=𝒵^i(i)×𝒵^j(i)\hat{\mathcal{Z}}^{(i)}_{i,j}=\hat{\mathcal{Z}}^{(i)}_{i}\times\hat{\mathcal{Z}}^{(i)}_{j}

In the above A.14, we use same indices ii and jj as in Assumption A.13 for convenience, the arguments extend to the case where we use a different pair.

Theorem A.15.

Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, A.13. The autoencoder that solves Eq. 3 under Constraint 4.3, A.14 achieves block affine identification, i.e., z𝒵,z^=Az+c\forall z\in\mathcal{Z},\hat{z}=Az+c, where z^\hat{z} is the output of the encoder ff and zz is the true latent and AA is an invertible d×dd\times d matrix and cdc\in\mathbb{R}^{d}. Further, the matrix AA has a special structure, i.e., the row aia_{i} and aja_{j} do not have a non-zero entry in the same column. Also, each row aia_{i} and aja_{j} has at least one non-zero entry.

Proof.

Let us first verify that there exists a solution to Eq. 3 under Constraint 4.3, A.14. If Z^=Z\hat{Z}=Z and h=gh=g, then that suffices to guarantee that a solution exists.

First note that since Assumptions 4.1, 4.2 holds and we are solving Eq. 3 under 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, z𝒵𝒵(i),z^=Az+c\forall z\in\mathcal{Z}\cup\mathcal{Z}^{(i)},\hat{z}=Az+c, where z^\hat{z} is the output of the encoder ff and zz is the true latent and AA is an invertible d×dd\times d matrix and cdc\in\mathbb{R}^{d}.

From Assumption A.13 we know each component k{1,,d}k\in\{1,\cdots,d\} of zz, zkz_{k} is bounded above and below. Suppose the minimum and maximum value achieved by zk𝒵k(i)z_{k}\in\mathcal{Z}^{(i)}_{k} is αinfk\alpha^{k}_{\inf} and the maximum value achieved by zk𝒵k(i)z_{k}\in\mathcal{Z}^{(i)}_{k} is αsupk\alpha_{\sup}^{k} .

Define a new latent

zk=2(zkαsupk+αinfk2αsupkαinfk),k{1,,d}z_{k}^{{}^{\prime}}=2\bigg{(}\frac{z_{k}-\frac{\alpha^{k}_{\sup}+\alpha^{k}_{\inf}}{2}}{\alpha^{k}_{\sup}-\alpha^{k}_{\inf}}\bigg{)},\forall k\in\{1,\cdots,d\}

Notice post this linear operation, the new latent takes a maximum value of 11 and a minimum value of 1-1.

We start with z^=Az+c\hat{z}=Az^{{}^{\prime}}+c, where zz^{{}^{\prime}} is element-wise transformation of zz that brings its maximum and minimum value of each component to 11 and 1-1. Following the above transformation, we define the left most interval for ziz_{i}^{{}^{\prime}} as [1,1+ηi][-1,-1+\eta_{i}] and the rightmost interval is [1ζi,1][1-\zeta_{i},1], where ηi>0\eta_{i}>0 and ζi>0\zeta_{i}>0. Such an interval exists owing to the Assumption A.13.

Few remarks are in order. i) Here we define intervals to be closed from both ends. Our arguments also extend to the case if these intervals are open from both ends or one end, ii) We assume all the values in the interval [1,1+ηi][-1,-1+\eta_{i}] are in the support. The argument presented below extends to the case when all the values in [1,1+ηi][-1,-1+\eta_{i}] are assumed by ziz_{i}^{{}^{\prime}} except for a set of measure zero, iii) The assumption A.13 can be relaxed by replacing supremum and infimum with essential supremum and infimum.

For a sufficiently small κ\kappa, we claim that the marginal distribution of z^i\hat{z}_{i} and z^j\hat{z}_{j} contain the sets defined below. Formally stated

(ai1+ci,ai1+ci+κ)(ai1+ciκ,ai1+ci)𝒵^i(i)(-\|a_{i}\|_{1}+c_{i},-\|a_{i}\|_{1}+c_{i}+\kappa)\cup(\|a_{i}\|_{1}+c_{i}-\kappa,\|a_{i}\|_{1}+c_{i})\subseteq\hat{\mathcal{Z}}^{(i)}_{i} (36)
(aj1+cj,aj1+cj+κ)(aj1+cjκ,aj1+cj)𝒵^j(i)(-\|a_{j}\|_{1}+c_{j},-\|a_{j}\|_{1}+c_{j}+\kappa)\cup(\|a_{j}\|_{1}+c_{j}-\kappa,\|a_{j}\|_{1}+c_{j})\subseteq\hat{\mathcal{Z}}^{(i)}_{j} (37)

where aia_{i} and aja_{j} are ithi^{th} and jthj^{th} row in matrix AA. We justify the above claim next. Suppose all elements of aia_{i} are positive. We set κ\kappa sufficiently small such that κ|aik|dηk\frac{\kappa}{|a_{ik}|d}\leq\eta_{k} for all k{1,,d}k\in\{1,\cdots,d\}. Since κ\kappa is sufficiently small, [1,1+κ|aik|d][-1,-1+\frac{\kappa}{|a_{ik}|d}] in the support zkz^{{}^{\prime}}_{k}, this holds for all k{1,,d}k\in\{1,\cdots,d\}. As a result, (ai1+ci,ai1+ci+κ)(-\|a_{i}\|_{1}+c_{i},-\|a_{i}\|_{1}+c_{i}+\kappa) is in the support of z^i\hat{z}_{i}. We can repeat the same argument when the signs of aia_{i} are not all positive by adjusting the signs of the elements zz^{{}^{\prime}}. This establishes (ai1+ci,ai1+ci+κ)𝒵^i(i)(-\|a_{i}\|_{1}+c_{i},-\|a_{i}\|_{1}+c_{i}+\kappa)\subseteq\hat{\mathcal{Z}}^{(i)}_{i}. Similarly, we can also establish that (ai1+ciκ,ai1+ci)𝒵^i(i)(\|a_{i}\|_{1}+c_{i}-\kappa,\|a_{i}\|_{1}+c_{i})\subseteq\hat{\mathcal{Z}}^{(i)}_{i}.

Suppose the two rows aia_{i} and aja_{j} share at least q1q\geq 1 non-zero entries. Without loss of generality assume that ai1a_{i1} is non-zero and aj1a_{j1} is non-zero. Pick an 0<ϵ<κ0<\epsilon<\kappa

  • Suppose ai1a_{i1} and aj1a_{j1} are both positive. In this case, if z^i<ai1+ci+ϵ\hat{z}_{i}<-\|a_{i}\|_{1}+c_{i}+\epsilon, then

    z1<1+2ϵ|ai1|z_{1}^{{}^{\prime}}<-1+\frac{2\epsilon}{|a_{i1}|}

    To see why is the case, substitute z1=1+2ϵ|ai1|z_{1}^{{}^{\prime}}=-1+\frac{2\epsilon}{|a_{i1}|} and observe that z^i>ai1+ci+ϵ\hat{z}_{i}>-\|a_{i}\|_{1}+c_{i}+\epsilon.

  • Suppose ai1a_{i1} and aj1a_{j1} are both positive. In this case, if z^j>aj1+cjϵ\hat{z}_{j}>\|a_{j}\|_{1}+c_{j}-\epsilon, then

    z1>12ϵ|aj1|z_{1}^{{}^{\prime}}>1-\frac{2\epsilon}{|a_{j1}|}

For sufficiently small ϵ\epsilon (ϵ<11/|ai1|+|aj1|\epsilon<\frac{1}{1/|a_{i1}|+|a_{j1}|}) both z1<1+2ϵai1z_{1}^{{}^{\prime}}<-1+\frac{2\epsilon}{a_{i1}} and z1>12ϵaj1z_{1}^{{}^{\prime}}>1-\frac{2\epsilon}{a_{j1}} cannot be true simultaneously.

Therefore, z^i<ai1+ci+ϵ\hat{z}_{i}<-\|a_{i}\|_{1}+c_{i}+\epsilon and z^j>aj1+cjϵ\hat{z}_{j}>\|a_{j}\|_{1}+c_{j}-\epsilon cannot be true simultaneously. Individually, z^i<ai1+ci+ϵ\hat{z}_{i}<-\|a_{i}\|_{1}+c_{i}+\epsilon occurs with a probability greater than zero; see Eq. 36. Similarly, z^j>aj1+cjϵ\hat{z}_{j}>\|a_{j}\|_{1}+c_{j}-\epsilon occurs with a probability greater than zero; see Eq. 37. This contradicts the support independence constraint. For completeness, we present the argument for other possible signs of aa.

  • Suppose ai1a_{i1} is positive and aj1a_{j1} is negative. In this case, if z^i<ai1+ci+ϵ\hat{z}_{i}<-\|a_{i}\|_{1}+c_{i}+\epsilon, then

    z1<1+2ϵ|ai1|z_{1}^{{}^{\prime}}<-1+\frac{2\epsilon}{|a_{i1}|}
  • Suppose ai1a_{i1} is positive and aj1a_{j1} is negative. In this case, if z^j<aj1+cj+ϵ\hat{z}_{j}<-\|a_{j}\|_{1}+c_{j}+\epsilon, then

    z1>12ϵ|aj1|z_{1}^{{}^{\prime}}>1-\frac{2\epsilon}{|a_{j1}|}

Rest of the above case is same as the previous case. We can apply the same argument to any shared non-zero component. Note that a row aia_{i} cannot have all zeros or all non-zeros (then aja_{j} has all zeros). If that is the case, then matrix AA is not invertible. This completes the proof. ∎

We now use the result from Theorem A.15 to prove the Theorem 5.8.

See 5.8

Proof.

Let us first verify that there exists a solution to Eq. 3 under Constraint 4.3, 5.6 (with |𝒮||𝒮||\mathcal{S}^{{}^{\prime}}|\leq|\mathcal{S}|).

We write Z^=ΠZ\hat{Z}=\Pi Z, where Π\Pi is a permutation matrix such that Z^k=Zi\hat{Z}_{k}=Z_{i}. For each m𝒮m\in\mathcal{S}^{{}^{\prime}} there exists a unique j𝒮j\in\mathcal{S} such that Z^m=Zj\hat{Z}_{m}=Z_{j}. Suppose h=gΠ1h=g\circ\Pi^{-1}. Observe that this construction satisfies the constraints in 5.6.

To show the above claim, we leverage Theorem A.15. We apply Theorem A.15 to all the pairs in {(k,m),m𝒮}\{(k,m),\forall m\in\mathcal{S}^{{}^{\prime}}\}, we obtain the following. We write z^k=akz+ck\hat{z}_{k}=a_{k}^{\top}z+c_{k}. Without loss of generality, assume aka_{k} is non-zero in first ss elements. Now consider any z^m=amz+cm\hat{z}_{m}=a_{m}^{\top}z+c_{m}, where m𝒮m\in\mathcal{S}^{{}^{\prime}}. From Theorem A.15 it follows that am[1:s]=0a_{m}[1:s]=0. This holds true for all m𝒮m\in\mathcal{S}^{{}^{\prime}}. Suppose sd|𝒮|+1s\geq d-|\mathcal{S}^{{}^{\prime}}|+1. In this case, the first ss columns cannot be full rank. Consider the submatrix formed by the first ss columns. In this submatrix |𝒮||\mathcal{S}^{{}^{\prime}}| rows are zero. The maximum rank of this matrix is d|𝒮|d-|\mathcal{S}^{{}^{\prime}}|. If sd|𝒮|+1s\geq d-|\mathcal{S}^{{}^{\prime}}|+1, then this submatrix would not have a full column rank, which contradicts the fact that AA is invertible. Therefore, 1sd|𝒮|1\leq s\leq d-|\mathcal{S}^{{}^{\prime}}|. ∎

We can relax the assumption that |𝒮||𝒮||\mathcal{S}^{{}^{\prime}}|\leq|\mathcal{S}| in the above theorem. We follow an iterative procedure. We start by solving 5.6 with |𝒮|=d1|\mathcal{S}^{{}^{\prime}}|=d-1. If a solution exists, then we stop. If a solution does not exist, then we reduce the size of |𝒮||\mathcal{S}^{{}^{\prime}}| by one and repeat the procedure till we find a solution. As we reach |𝒮|=|𝒮||\mathcal{S}^{{}^{\prime}}|=|\mathcal{S}| a solution has to exist.

A.4 Representation identification with observational data under independent support

See 6.3

Proof.

We will leverage Theorem A.15 to show this claim. Consider z^i=ai𝖳z+ci\hat{z}_{i}=a_{i}^{\mathsf{T}}z+c_{i}. We know that the aia_{i} has at least one non-zero element. Suppose it has at least q2q\geq 2 non-zero elements. Without loss of generality assume that these correspond to the first qq components. We apply Theorem A.15 to each pair z^i,z^j\hat{z}_{i},\hat{z}_{j} for all jij\not=i. Note here ii is kept fixed and then Theorem A.15 is applied to every possible pair. From the theorem we get that aj[1:q]a_{j}[1:q] is zero for all jij\not=i. If q2q\geq 2, then the span of first qq columns will be one dimensional and as a result AA cannot be invertible. Therefore, only one element of row ii is non-zero. We apply the above argument to all i{1,,d}i\in\{1,\cdots,d\}. We write a function π:{1,,d}{1,,d}\pi:\{1,\cdots,d\}\rightarrow\{1,\cdots,d\}, where π(i)\pi(i) is the index of the element that is non-zero in row ii, i.e., z^i=aiπ(i)zπ(i)+ci\hat{z}_{i}=a_{i\pi(i)}z_{\pi(i)}+c_{i}. Note that π\pi is injective, if two indices map to the same element, then that creates shared non-zero coefficients, which violates Theorem A.15. This completes the proof. ∎

Appendix B Supplementary Materials for Empirical Findings

B.1 Method details

1:  {Step 1: Training autoencoder (f,hf,h)}
2:  Sample data: X𝒳𝒳IX\sim\mathcal{X}\cup\mathcal{X}^{I} where 𝒳I=i=1d𝒳(i)\mathcal{X}^{I}=\cup_{i=1}^{d}\mathcal{X}^{(i)}
3:  Minimize reconstruction loss: f,h=argminf,h𝔼[hf(X)X2]f^{\dagger},h^{\dagger}=\operatorname*{arg\,min}_{f,h}\mathbb{E}\big{[}\|h\circ f(X)-X\|^{2}\big{]}
4:  
5:  {Step 2: Learning transformation Γ\Gamma with Independence of Support (IOS) objective}
6:  Sample data: Z^f(𝒳)\hat{Z}\sim f^{\dagger}(\mathcal{X}) where ff^{\dagger} is the encoder learnt in Step 1
7:  Minimize reconstruction + Hausdorff loss: minΓ𝔼[ΓΓ(Z^)Z^2]+λ×km𝖧𝖣(𝒵^k,m(Γ),𝒵^k(Γ)×𝒵^m(Γ))\min_{\Gamma}\mathbb{E}\big{[}\|\Gamma^{{}^{\prime}}\circ\Gamma(\hat{Z})-\hat{Z}\|^{2}\big{]}+\lambda\times\sum_{k\not=m}\mathsf{HD}\big{(}\hat{\mathcal{Z}}_{k,m}(\Gamma),\hat{\mathcal{Z}}_{k}(\Gamma)\times\hat{\mathcal{Z}}_{m}(\Gamma)\big{)}
8:  Return transformed latents: Γ(Z^)\Gamma(\hat{Z})
9:  
10:  {Step 2: Learning transformation Γ=[γi]i=1:d\Gamma=[\gamma_{i}]_{i=1:d} using do-interventions}
11:  for ii in {1,,d}\{1,\cdots,d\} do
12:     Sample data: Z^f(𝒳(i))\hat{Z}\sim f^{\dagger}(\mathcal{X}^{(i)}) where ff^{\dagger} is the encoder learnt in Step 1s
13:     Fix intervention targets at random Y^(i)Uniform(0,1)\hat{Y}^{(i)}\sim\text{Uniform}(0,1)
14:     Minimize MSE loss: minγi𝔼Z^[γi(Z^)Y^(i)2]\min_{\gamma_{i}}\mathbb{E}_{\hat{Z}}\big{[}\big{\|}\gamma_{i}(\hat{Z})-\hat{Y}^{(i)}\big{\|}^{2}\big{]}
15:  end for
16:  Return transformed latents: Γ(Z^)\Gamma(\hat{Z})
Algorithm 1 Summarizing our two step approach for both the independence of support (IOS) and interventional data case.

We provide details about our training procedure in Algorithm 1. For learning with the independence of support (IOS) objective in Step 2, we need to ensure that the map Γ\Gamma is invertible, hence we minimize a combination of reconstruction loss with Hausdorff distance, i.e.,

minΓ𝔼[ΓΓ(Z^)Z^2]+λ×km𝖧𝖣(𝒵^k,m(Γ),𝒵^k(Γ)×𝒵^m(Γ))\min_{\Gamma}\mathbb{E}\big{[}\|\Gamma^{{}^{\prime}}\circ\Gamma(\hat{Z})-\hat{Z}\|^{2}\big{]}+\lambda\times\sum_{k\not=m}\mathsf{HD}\big{(}\hat{\mathcal{Z}}_{k,m}(\Gamma),\hat{\mathcal{Z}}_{k}(\Gamma)\times\hat{\mathcal{Z}}_{m}(\Gamma)\big{)} (38)

where Z^\hat{Z} denotes the output from the encoder learnt in Step 1, i.e., Z^=f(X)\hat{Z}=f^{\dagger}(X).

If we have data with multiple interventional distributions per latent dimension, then we sample a new target for each interventional distribution. In our polynomial decoder experiments, we use a linear γi\gamma_{i}. In our image based experiments, in Step 2, we use a non-linear map γi\gamma_{i}.

B.2 Experiment setup details: Polynomial decoder (gg)

Basic setup.

We sample data following the DGP described in Assumption 4.2 with the following details:

  • Latent dimension: d{6,10}d\in\{6,10\}

  • Degree of decoder polynomial (gg): p{2,3}p\in\{2,3\}

  • Data dimension: n=200n=200

  • Decoder polynomial coefficient matrix GG: sample each element of the matrix iid from a standard normal distribution.

Latent distributions.

Recall ziz_{i} is the ithi^{th} component of the latent vector zdz\in\mathbb{R}^{d}. The various latent distributions (Z\mathbb{P}_{Z}) we use in our experiments are as follows:

  • Uniform: Each latent component ziz_{i} is sampled from Uniform(-5, 5). All the latents (ziz_{i}) are independent and identically distributed.

  • Uniform-Correlated: Consider a pair of latent variables zi,zi+1z_{i},z_{i+1} and sample two confounder variables c1,c2c_{1},c_{2} s.t. c1Bernoulli(p=0.5)c_{1}\sim\text{Bernoulli}(p=0.5), and c2Bernoulli(p=0.9)c_{2}\sim\text{Bernoulli}(p=0.9). Now we sample zi,zi+1z_{i},z_{i+1} using c1,c2c_{1},c_{2} as follows:

    zi{Uniform(0.0,0.5)ifc1=1Uniform(0.5,0.0)ifc1=0,z_{i}\sim\begin{cases}\text{Uniform}(0.0,0.5)&\text{if}\;c_{1}=1\\ \text{Uniform}(-0.5,0.0)&\text{if}\;c_{1}=0\end{cases},
    zi+1{Uniform(0.0,0.3)ifc1c2=1Uniform(0.3,0.0)ifc1c2=0,z_{i+1}\sim\begin{cases}\text{Uniform}(0.0,0.3)&\text{if}\;c_{1}\oplus c_{2}=1\\ \text{Uniform}(-0.3,0.0)&\text{if}\;c_{1}\oplus c_{2}=0\end{cases},

    where \oplus is the xor operation. Hence, c1c_{1} acts as a confounder as it is involved in the generation process for both zi,zi+1z_{i},z_{i+1}, which leads to correlation between them. Due to the xor operation, the two random variables satisfy independence of support condition. Finally, we follow this generation process to generate the latent vector zz by iterating over different pairs (i{ 1,,d}i\in\{\ 1,\cdots,d\} with step size 2 ).

  • Gaussian-Mixture: Each ziz_{i} is sampled from a Gaussian mixture model with two components and equal probability of sampling from the components, as described below:

    zi{𝒩(0,1)with prob. 0.5𝒩(1,2)with prob. 0.5z_{i}\sim\begin{cases}\mathcal{N}(0,1)&\text{with prob.}\;0.5\\ \mathcal{N}(1,2)&\text{with prob.}\;0.5\end{cases}

    All latents in this case are independent and identically distributed like the Uniform case; though we have mixture distribution instead of single mode distribution.

  • SCM-S: The latent variable zz is sampled as a DAG with dd nodes using the Erdős–Rényi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) 222https://github.com/slachapelle/dcdi and set the expected density (expected number of edges per node) to be 0.5.

  • SCM-D: The latent variable zz is sampled as a DAG with dd nodes using the Erdős–Rényi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) and set the expected density (expected number of edges per node) to be 1.0.

Case Train Validation Test
Observational (𝒟\mathcal{D}) 10000 2500 20000
Interventional (𝒟(I)\mathcal{D}^{(I)}) 10000 2500 20000
Table 5: Statistics for the synthetic poly-DGP experiments
Further details on dataset and evaluation.

For experiments in Table 2, we only use observational data (𝒟\mathcal{D}); while for experiments in Table 3, we use both observational and interventional data (𝒟𝒟(i)\mathcal{D}\cup\mathcal{D}^{(i)}), with details regarding the train/val/test split described in Table 5.

We carry out dodo interventions on each latent with 𝒟(i)\mathcal{D}^{(i)} corresponding to data from interventions on ziz_{i}. The union of data from interventions across all latent dimensions is denoted as 𝒟(I)=i=1:d𝒟(i)\mathcal{D}^{(I)}=\cup_{i=1:d}\mathcal{D}^{(i)}. The index of the variable to be intervened is sampled from Uniform({1,,d})\text{Uniform}(\{1,\dots,d\}). The selected latent variable to be intervened is set to value 2.02.0.

Further, note that for learning the linear transformation (γi\gamma_{i}) in Step 2 (Eq. 7), we only use the corresponding interventional data (𝒟(i)\mathcal{D}^{(i)}) from do-intervention on the latent variable ii. Also, all the metrics (R2R^{2}, MCC (IOS), MCC, MCC (IL)) are computed only on the test split of observational data (𝒟\mathcal{D}) (no interventional data used).

Model architecture.

We use the following architecture for the encoder ff across all the experiments with polynomial decoder gg (Table 2, Table 3) to minimize the reconstruction loss;

  • Linear Layer (nn, hh); LeakyReLU(0.50.5),

  • Linear Layer (hh, hh); LeakyReLU(0.50.5),

  • Linear Layer (hh, dd),

where nn is the input data dimension and hh is hidden units and h=200h=200 in all the experiments. For the architecture for the decoder (hh) in Table 2, Table 3, we use the polynomial decoder (h(z)=H[1,z,z¯z,,z¯¯zptimes]h(z)=H[1,z,z\bar{\otimes}z,\cdots,\underbrace{z\bar{\otimes}\cdots\bar{\otimes}\;z}_{p\;\text{times}}]^{\top}); where pp is set to be same as that of the degree of true decoder polynomial (g(z)g(z)) and the coefficient matrix HH is modeled using a single fully connected layer.

For the independence of support (IOS) experiments in Table 2, we model both Γ,Γ\Gamma,\Gamma^{{}^{\prime}} using a single fully connected layer.

For the interventional data results (Table 3), we learn the mappings γi\gamma_{i} from the corresponding interventional data (X(i)\mathbb{P}_{X}^{(i)}) using the default linear regression class from scikit-learn (Pedregosa et al., 2011) with the intercept term turned off.

Finally, for the results with NN Decoder hh (Table 8, Table 9), we use the following architecture for the decoder with number of hidden nodes h=200h=200.

  • Linear layer (dd, hh); LeakyReLU(0.50.5)

  • Linear layer (hh, hh); LeakyReLU(0.50.5)

  • Linear layer (hh, nn)

Hyperparameters.

We use the Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 10 epochs consecutively.

  • Batch size: 1616

  • Weight decay: 5×1045\times 10^{-4}

  • Total epochs: 200200

  • Learning rate: optimal value chosen from grid: {103,5×104,104}\{10^{-3},5\times 10^{-4},10^{-4}\}

For experiments with independence of support (IOS) objective in Step 2 (Table 2), we train with λ=10\lambda=10 as the relative weight of Hausdorff distance in the reconstruction loss (Equation 38).

B.3 Additional results: Polynomial decoder (gg)

Table 6 presents additional details about Table 2 in main paper. We present additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and MCC computed using representations from Step 1. Note that training with independence of support objective in Step 2 leads to better MCC scores than using the representations from Step 1 on distributions that satisfy independence of support. Also, the Uniform Correlated (Uniform-C) latent case can be interpreted as another sparse SCM with confounders between latent variables. For this case, the latent variables are not independent but their support is still independent, therefore we see improvement in MCC with IOS training in Step 2. Similarly, Table 7 presents the extended results for the interventional case using polynomial decoder (Table 3 in main paper); with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and R2R^{2} to test for affine identification using representations from Step 1. We notice the same pattern for all latent distributions, that training on interventional data on Step 2 improves the MCC metric.

Further, we also experiment with using a neural network based decoder to have a more standard autoencoder architecture where we do not assume access to specific polynomial structure or the degree of the polynomial. Table 8 presents the results with NN decoder for the observational case, where we see a similar trend to that of polynomial decoder case (Table 6) that the MCC increase with IOS training in Step 2 for Uniform and Uniform-C latent distributions. Similarly, Table 9 presents the results with NN decoder for the interventional case, where the trend is similar to that of polynomial decoder case (Table 7); though the MCC (IL) for the SCM sparse and SCM dense case are lower compared to that with polynomial decoder case.

Z\mathbb{P}_{Z} dd pp Recon-MSE R2R^{2} MCC MCC (IOS)
Uniform 66 22 1.59±0.401.59\pm 0.40 1.00±0.001.00\pm 0.00 66.91±2.4566.91\pm 2.45 99.31±0.0799.31\pm 0.07
Uniform 66 33 1.81±0.401.81\pm 0.40 1.00±0.001.00\pm 0.00 75.14±3.9375.14\pm 3.93 99.39±0.0699.39\pm 0.06
Uniform 1010 22 2.04±0.762.04\pm 0.76 1.00±0.001.00\pm 0.00 58.49±2.2658.49\pm 2.26 90.73±2.9290.73\pm 2.92
Uniform 1010 33 8.59±2.158.59\pm 2.15 0.99±0.000.99\pm 0.00 56.77±0.6056.77\pm 0.60 94.62±1.5094.62\pm 1.50
Uniform-C 66 22 0.36±0.070.36\pm 0.07 1.00±0.001.00\pm 0.00 71.19±2.2971.19\pm 2.29 96.81±0.1196.81\pm 0.11
Uniform-C 66 33 1.72±0.671.72\pm 0.67 1.00±0.001.00\pm 0.00 70.53±1.170.53\pm 1.1 96.29±0.0596.29\pm 0.05
Uniform-C 1010 22 0.86±0.270.86\pm 0.27 1.00±0.001.00\pm 0.00 64.58±1.8164.58\pm 1.81 85.31±2.3585.31\pm 2.35
Uniform-C 1010 33 2.42±0.472.42\pm 0.47 1.00±0.001.00\pm 0.00 62.69±0.9262.69\pm 0.92 87.20±1.7787.20\pm 1.77
Gaussian-Mixture 66 22 0.86±0.270.86\pm 0.27 1.0±0.01.0\pm 0.0 70.53±1.2570.53\pm 1.25 67.43±2.0167.43\pm 2.01
Gaussian-Mixture 66 33 0.86±0.320.86\pm 0.32 0.99±0.00.99\pm 0.0 66.19±1.3866.19\pm 1.38 67.94±1.4267.94\pm 1.42
Gaussian-Mixture 1010 22 1.38±0.511.38\pm 0.51 1.0±0.01.0\pm 0.0 59.5±2.2259.5\pm 2.22 58.3±0.6758.3\pm 0.67
Gaussian-Mixture 1010 33 4.12±1.704.12\pm 1.70 0.99±0.00.99\pm 0.0 57.15±0.4357.15\pm 0.43 59.08±1.1159.08\pm 1.11
SCM-S 66 22 1.52±0.701.52\pm 0.70 0.96±0.020.96\pm 0.02 71.77±1.4371.77\pm 1.43 72.61±1.4872.61\pm 1.48
SCM-S 66 33 2.25±0.512.25\pm 0.51 0.87±0.070.87\pm 0.07 73.14±3.4473.14\pm 3.44 70.56±1.5470.56\pm 1.54
SCM-S 1010 22 4.23±1.134.23\pm 1.13 0.99±0.00.99\pm 0.0 64.35±2.064.35\pm 2.0 65.86±1.3265.86\pm 1.32
SCM-S 1010 33 2.83±0.852.83\pm 0.85 0.90±0.050.90\pm 0.05 61.95±0.9861.95\pm 0.98 58.77±1.2758.77\pm 1.27
SCM-D 66 22 1.34±0.261.34\pm 0.26 0.97±0.010.97\pm 0.01 75.25±2.8575.25\pm 2.85 61.61±4.3661.61\pm 4.36
SCM-D 66 33 1.20±0.551.20\pm 0.55 0.81±0.110.81\pm 0.11 82.9±3.1182.9\pm 3.11 65.19±2.7065.19\pm 2.70
SCM-D 1010 22 2.89±0.792.89\pm 0.79 0.83±0.100.83\pm 0.10 67.49±2.3267.49\pm 2.32 69.64±3.0969.64\pm 3.09
SCM-D 1010 33 1.55±0.391.55\pm 0.39 0.72±0.150.72\pm 0.15 66.4±1.8666.4\pm 1.86 60.1±1.1660.1\pm 1.16
Table 6: Observational data with Polynomial Decoder: Mean ± S.E. (5 random seeds). R2R^{2} and MCC (IOS) achieve high values (for Uniform & Uniform-C) as predicted Theorem 4.4 and Theorem 6.3 respectively.
Z\mathbb{P}_{Z} dd pp Recon-MSE R2R^{2} MCC MCC (IL)
Uniform 66 22 0.29±0.080.29\pm 0.08 1.0±0.01.0\pm 0.0 69.11±1.1169.11\pm 1.11 100.0±0.0100.0\pm 0.0
Uniform 66 33 0.97±0.360.97\pm 0.36 1.0±0.01.0\pm 0.0 73.42±0.4973.42\pm 0.49 100.0±0.0100.0\pm 0.0
Uniform 1010 22 2.29±0.852.29\pm 0.85 1.0±0.01.0\pm 0.0 59.96±2.0359.96\pm 2.03 100.0±0.0100.0\pm 0.0
Uniform 1010 33 2.74±0.362.74\pm 0.36 1.0±0.01.0\pm 0.0 65.94±0.8065.94\pm 0.80 99.85±0.0399.85\pm 0.03
Uniform-C 66 22 0.29±0.110.29\pm 0.11 1.0±0.01.0\pm 0.0 71.2±2.4671.2\pm 2.46 100.0±0.0100.0\pm 0.0
Uniform-C 66 33 1.50±0.621.50\pm 0.62 1.0±0.01.0\pm 0.0 70.21±1.9070.21\pm 1.90 99.97±0.0199.97\pm 0.01
Uniform-C 1010 22 0.79±0.240.79\pm 0.24 1.0±0.01.0\pm 0.0 61.02±1.0361.02\pm 1.03 100.0±0.0100.0\pm 0.0
Uniform-C 1010 33 1.72±0.451.72\pm 0.45 1.0±0.01.0\pm 0.0 61.16±1.5961.16\pm 1.59 99.91±0.0199.91\pm 0.01
Gaussian-Mixture 66 22 0.75±0.270.75\pm 0.27 1.0±0.01.0\pm 0.0 67.72±2.2067.72\pm 2.20 99.99±0.0199.99\pm 0.01
Gaussian-Mixture 66 33 0.57±0.200.57\pm 0.20 0.99±0.00.99\pm 0.0 70.21±2.7470.21\pm 2.74 99.39±0.0599.39\pm 0.05
Gaussian-Mixture 1010 22 0.61±0.160.61\pm 0.16 1.0±0.01.0\pm 0.0 60.77±1.6060.77\pm 1.60 99.98±0.0199.98\pm 0.01
Gaussian-Mixture 1010 33 2.29±0.722.29\pm 0.72 0.99±0.00.99\pm 0.0 57.81±1.1657.81\pm 1.16 99.46±0.0599.46\pm 0.05
SCM-S 66 22 0.21±0.040.21\pm 0.04 0.99±0.00.99\pm 0.0 68.41±0.9068.41\pm 0.90 99.53±0.3899.53\pm 0.38
SCM-S 66 33 0.93±0.180.93\pm 0.18 0.99±0.00.99\pm 0.0 74.12±2.3274.12\pm 2.32 99.25±0.3499.25\pm 0.34
SCM-S 1010 22 0.63±0.170.63\pm 0.17 1.0±0.01.0\pm 0.0 68.01±2.3668.01\pm 2.36 99.92±0.0399.92\pm 0.03
SCM-S 1010 33 1.29±0.311.29\pm 0.31 0.97±0.010.97\pm 0.01 66.81±1.1066.81\pm 1.10 98.8±0.1398.8\pm 0.13
SCM-D 66 22 0.81±0.050.81\pm 0.05 0.99±0.010.99\pm 0.01 71.8±3.7771.8\pm 3.77 99.64±0.1299.64\pm 0.12
SCM-D 66 33 0.75±0.260.75\pm 0.26 0.98±0.010.98\pm 0.01 79.48±3.4579.48\pm 3.45 98.22±1.0798.22\pm 1.07
SCM-D 1010 22 0.76±0.150.76\pm 0.15 0.98±0.010.98\pm 0.01 70.78±1.8970.78\pm 1.89 95.3±2.2495.3\pm 2.24
SCM-D 1010 33 0.96±0.220.96\pm 0.22 0.97±0.00.97\pm 0.0 70.08±2.8070.08\pm 2.80 97.24±0.8897.24\pm 0.88
Table 7: Interventional data with Polynomial Decoder: Mean ± S.E. (5 random seeds). MCC(IL) is high as predicted by Theorem 5.3.
Z\mathbb{P}_{Z} dd pp Recon-MSE R2R^{2} MCC MCC (IOS)
Uniform 66 22 1.22±0.191.22\pm 0.19 0.98±0.00.98\pm 0.0 73.75±2.8573.75\pm 2.85 99.05±0.0299.05\pm 0.02
Uniform 66 33 2.79±0.202.79\pm 0.20 0.92±0.00.92\pm 0.0 63.29±1.0663.29\pm 1.06 95.74±0.1295.74\pm 0.12
Uniform 1010 22 3.66±0.393.66\pm 0.39 0.99±0.00.99\pm 0.0 61.71±1.1661.71\pm 1.16 94.25±2.1394.25\pm 2.13
Uniform 1010 33 33.16±3.3433.16\pm 3.34 0.94±0.00.94\pm 0.0 59.27±1.0659.27\pm 1.06 91.24±4.9991.24\pm 4.99
Uniform-C 66 22 0.65±0.100.65\pm 0.10 0.96±0.020.96\pm 0.02 68.46±1.9468.46\pm 1.94 94.95±1.8394.95\pm 1.83
Uniform-C 66 33 1.39±0.301.39\pm 0.30 0.91±0.00.91\pm 0.0 68.09±1.5668.09\pm 1.56 89.14±2.3889.14\pm 2.38
Uniform-C 1010 22 1.78±0.091.78\pm 0.09 0.99±0.00.99\pm 0.0 62.63±2.0562.63\pm 2.05 88.88±3.2888.88\pm 3.28
Uniform-C 1010 33 12.0±1.5912.0\pm 1.59 0.91±0.010.91\pm 0.01 59.91±1.7559.91\pm 1.75 81.76±3.6781.76\pm 3.67
Gaussian-Mixture 66 22 0.49±0.120.49\pm 0.12 0.95±0.00.95\pm 0.0 72.59±2.0372.59\pm 2.03 65.33±1.1165.33\pm 1.11
Gaussian-Mixture 66 33 0.79±0.160.79\pm 0.16 0.84±0.010.84\pm 0.01 66.25±2.8666.25\pm 2.86 63.43±1.2763.43\pm 1.27
Gaussian-Mixture 1010 22 1.38±0.181.38\pm 0.18 0.95±0.00.95\pm 0.0 57.12±1.5257.12\pm 1.52 54.76±1.2654.76\pm 1.26
Gaussian-Mixture 1010 33 7.22±1.237.22\pm 1.23 0.83±0.010.83\pm 0.01 55.41±1.4055.41\pm 1.40 52.87±0.8652.87\pm 0.86
SCM-S 66 22 2.24±1.112.24\pm 1.11 0.59±0.180.59\pm 0.18 69.77±3.8769.77\pm 3.87 66.04±1.3466.04\pm 1.34
SCM-S 66 33 2.45±0.182.45\pm 0.18 0.74±0.050.74\pm 0.05 73.72±1.6373.72\pm 1.63 67.66±2.1867.66\pm 2.18
SCM-S 1010 22 6.41±1.716.41\pm 1.71 0.78±0.080.78\pm 0.08 65.99±1.1465.99\pm 1.14 63.52±1.1163.52\pm 1.11
SCM-S 1010 33 4.32±1.374.32\pm 1.37 0.11±0.430.11\pm 0.43 66.96±2.6066.96\pm 2.60 62.11±1.3662.11\pm 1.36
SCM-D 66 22 2.7±0.392.7\pm 0.39 0.63±0.220.63\pm 0.22 75.19±2.6275.19\pm 2.62 61.89±4.061.89\pm 4.0
SCM-D 66 33 1.89±0.731.89\pm 0.73 0.47±0.250.47\pm 0.25 77.83±3.4977.83\pm 3.49 65.85±1.5865.85\pm 1.58
SCM-D 1010 22 4.46±0.764.46\pm 0.76 0.46±0.110.46\pm 0.11 69.81±1.4369.81\pm 1.43 65.35±2.7265.35\pm 2.72
SCM-D 1010 33 3.53±0.693.53\pm 0.69 0.10±0.290.10\pm 0.29 65.89±2.5665.89\pm 2.56 61.92±1.9561.92\pm 1.95
Table 8: Observational data with Neural Network Decoder: Mean ± S.E. (5 random seeds). R2R^{2} achieves high values in many cases but MCC (IOS) achieve high values (for Uniform & Uniform-C).
Z\mathbb{P}_{Z} dd pp Recon-MSE R2R^{2} MCC MCC (IL)
Uniform 66 22 0.35±0.080.35\pm 0.08 0.98±0.00.98\pm 0.0 68.39±1.2168.39\pm 1.21 99.09±0.0299.09\pm 0.02
Uniform 66 33 2.02±0.282.02\pm 0.28 0.91±0.00.91\pm 0.0 63.2±1.3363.2\pm 1.33 91.67±2.5091.67\pm 2.50
Uniform 1010 22 3.89±0.503.89\pm 0.50 0.99±0.00.99\pm 0.0 60.54±1.8160.54\pm 1.81 99.59±0.0499.59\pm 0.04
Uniform 1010 33 29.21±2.3329.21\pm 2.33 0.95±0.00.95\pm 0.0 61.0±1.4861.0\pm 1.48 93.73±0.4593.73\pm 0.45
Uniform-C 66 22 0.42±0.150.42\pm 0.15 0.94±0.020.94\pm 0.02 65.91±0.5365.91\pm 0.53 96.43±1.4796.43\pm 1.47
Uniform-C 66 33 1.05±0.191.05\pm 0.19 0.91±0.00.91\pm 0.0 67.92±3.4867.92\pm 3.48 94.8±0.2894.8\pm 0.28
Uniform-C 1010 22 1.32±0.091.32\pm 0.09 0.99±0.00.99\pm 0.0 60.02±1.8360.02\pm 1.83 99.42±0.0199.42\pm 0.01
Uniform-C 1010 33 10.46±1.2710.46\pm 1.27 0.92±0.00.92\pm 0.0 61.68±1.2061.68\pm 1.20 93.83±0.7893.83\pm 0.78
Gaussian-Mixture 66 22 0.45±0.130.45\pm 0.13 0.94±0.00.94\pm 0.0 70.64±3.8370.64\pm 3.83 96.87±0.1496.87\pm 0.14
Gaussian-Mixture 66 33 0.62±0.120.62\pm 0.12 0.83±0.010.83\pm 0.01 64.43±2.3664.43\pm 2.36 84.53±2.6084.53\pm 2.60
Gaussian-Mixture 1010 22 0.87±0.150.87\pm 0.15 0.94±0.00.94\pm 0.0 57.35±1.6257.35\pm 1.62 97.06±0.1697.06\pm 0.16
Gaussian-Mixture 1010 33 5.98±0.935.98\pm 0.93 0.83±0.00.83\pm 0.0 57.89±2.0657.89\pm 2.06 80.14±1.7780.14\pm 1.77
SCM-S 66 22 0.27±0.070.27\pm 0.07 0.94±0.020.94\pm 0.02 74.68±2.2874.68\pm 2.28 93.07±2.1693.07\pm 2.16
SCM-S 66 33 0.9±0.180.9\pm 0.18 0.89±0.020.89\pm 0.02 71.56±3.1871.56\pm 3.18 88.66±2.7188.66\pm 2.71
SCM-S 1010 22 0.93±0.230.93\pm 0.23 0.98±0.00.98\pm 0.0 66.08±1.0466.08\pm 1.04 94.14±0.3994.14\pm 0.39
SCM-S 1010 33 1.99±0.361.99\pm 0.36 0.88±0.010.88\pm 0.01 63.35±1.4463.35\pm 1.44 76.62±6.1576.62\pm 6.15
SCM-D 66 22 0.69±0.070.69\pm 0.07 0.95±0.020.95\pm 0.02 76.99±2.5376.99\pm 2.53 91.63±1.9091.63\pm 1.90
SCM-D 66 33 0.87±0.250.87\pm 0.25 0.88±0.010.88\pm 0.01 75.72±1.6975.72\pm 1.69 88.19±3.6388.19\pm 3.63
SCM-D 1010 22 1.05±0.291.05\pm 0.29 0.95±0.010.95\pm 0.01 68.71±2.1668.71\pm 2.16 90.14±4.3590.14\pm 4.35
SCM-D 1010 33 1.68±0.341.68\pm 0.34 0.86±0.010.86\pm 0.01 68.52±2.1168.52\pm 2.11 81.82±3.081.82\pm 3.0
Table 9: Interventional data with Neural Network Decoder: Mean ± S.E. (5 random seeds). MCC(IL) is high.

B.4 Experiment setup details: Synthetic image experiments

The latent variable comprises of two balls and their (xx, yy) coordinates; hence we have d=4d=4 dimensional latent variable. We use PyGame (Shinners, 2011) rendering engine final images of dimension 64×64×364\times 64\times 3.

Latent Distributions.

We denote the (x,y)(x,y) coordinates of the Ball 1 as (x1x_{1}, y1y_{1}), and for Ball 2 as (x2x_{2}, y2y_{2}). We have the following three cases for the latent distributions in case of synthetic image experiments:

  • Uniform: Each coordinate of Ball 1 (x1x_{1}, y1y_{1}) and Ball 2 (x2x_{2}, y2y_{2}) are sampled from Uniform(0.1,0.9)\text{Uniform}(0.1,0.9).

  • SCM (linear): The coordinates of Ball 1 (x1x_{1}, y1y_{1}) are sampled from Uniform(0.1,0.9)\text{Uniform}(0.1,0.9), which are used to sample the coordinates of Ball 2 as follows:

    x2{Uniform(0.1,0.5)ifx1+y11.0Uniform(0.5,0.9)ifx1+y1<1.0x_{2}\sim\begin{cases}\text{Uniform}(0.1,0.5)&\text{if}\;x_{1}+y_{1}\geq 1.0\\ \text{Uniform}(0.5,0.9)&\text{if}\;x_{1}+y_{1}<1.0\end{cases}
    y2{Uniform(0.5,0.9)ifx1+y11.0Uniform(0.1,0.5)ifx1+y1<1.0y_{2}\sim\begin{cases}\text{Uniform}(0.5,0.9)&\text{if}\;x_{1}+y_{1}\geq 1.0\\ \text{Uniform}(0.1,0.5)&\text{if}\;x_{1}+y_{1}<1.0\end{cases}
  • SCM (non-linear): The coordinates of Ball 1 (x1x_{1}, y1y_{1}) are sampled from Uniform(0.1,0.9)\text{Uniform}(0.1,0.9), which are used to sample the coordinates of Ball 2 as follows:

    x2{Uniform(0.1,0.5)if 1.25×(x12+y12)1.0Uniform(0.5,0.9)if 1.25×(x12+y12)<1.0x_{2}\sim\begin{cases}\text{Uniform}(0.1,0.5)&\text{if}\;1.25\times(x_{1}^{2}+y_{1}^{2})\geq 1.0\\ \text{Uniform}(0.5,0.9)&\text{if}\;1.25\times(x_{1}^{2}+y_{1}^{2})<1.0\end{cases}
    y2{Uniform(0.5,0.9)if 1.25×(x12+y12)1.0Uniform(0.1,0.5)if 1.25×(x12+y12)<1.0y_{2}\sim\begin{cases}\text{Uniform}(0.5,0.9)&\text{if}\;1.25\times(x_{1}^{2}+y_{1}^{2})\geq 1.0\\ \text{Uniform}(0.1,0.5)&\text{if}\;1.25\times(x_{1}^{2}+y_{1}^{2})<1.0\end{cases}
Case Train Validation Test
Observational (𝒟\mathcal{D}) 20000 5000 20000
Interventional (𝒟(I)\mathcal{D}^{(I)}) 20000 5000 20000
Table 10: Statistics for the synthetic image experiments
Further details on dataset and evaluation.

For experiments in Table 4, the details regarding the train/val/test split are described in Table 10.

Note that the interventional data (𝒟(I)\mathcal{D}^{(I)}) is composed of do interventions on each latent variable (𝒟(I)=i=1:d𝒟i\mathcal{D}^{(I)}=\cup_{i=1:d}\mathcal{D}^{i}), where latent variable to be intervened is sampled from Uniform({1,,d})\text{Uniform}(\{1,\cdots,d\}). Hence, each latent variable has equal probability to be intervened.

While performing do-interventions on any latent variable (𝒟(i)\mathcal{D}^{(i)}), we control for the total number of distinct values the latent takes under the intervention (#interv, each distinct value correpsonds to sampling data from one interventional distribution). When #interv =1=1, then we set the latent variable ii to value 0.5. For the case when #interv >1>1, we sample the values corresponding to different do-interventions on latent variable ii as total of #interv equally distant points from S=[0.25,0.75]S=[0.25,0.75]. Eg, when #interv =3=3, then the possible values after do-intervention on latent variable ii are {0.25,0.50,0.75}\{0.25,0.50,0.75\}. Note that we uniformly at random sample the value of intervention from the set of intervention values.

Note that we only use the observational data (𝒟\mathcal{D}) for training the autoencoder in Step 1. while the non-linear transformations γi\gamma_{i} in Step 2 (Eq. 7) are learnt using the corresponding interventional data (𝒟(i)\mathcal{D}^{(i)}). Further, the metrics (MCC, MCC (IL)) are computed only on the test split of observational data (𝒟\mathcal{D}) (no interventional data used).

Model architecture.

We use the following architecture for encoder ff across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss.

  • ResNet-18 Architecture (No Pre Training): Image (64×64×364\times 64\times 3) \rightarrow Penultimate Layer Output (512512 dimensional)

  • Linear Layer (512,128)(512,128); BatchNorm(); LeakyReLU()

  • Linear Layer (128,25)(128,25); BatchNorm()

We use the following architecture for decoder hh across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss. Our architecture for decoder is inspired from the implementation in widely used works (Locatello et al., 2019).

  • Linear Layer (25,128)(25,128); LeakyReLU()

  • Linear Layer (128,1024)(128,1024); LeakyReLU()

  • DeConvolution Layer (cinc_{in}: 6464, coutc_{out}: 6464, kernel: 44; stride: 22; padding: 11); LeakyReLU()

  • DeConvolution Layer (cinc_{in}: 6464, coutc_{out}: 3232, kernel: 44; stride: 22; padding: 11); LeakyReLU()

  • DeConvolution Layer (cinc_{in}: 3232, coutc_{out}: 3232, kernel: 44; stride: 22; padding: 11); LeakyReLU()

  • DeConvolution Layer (cinc_{in}: 3232, coutc_{out}: 33, kernel: 44; stride: 22; padding: 11); LeakyReLU()

Note: Here the latent dimension of the encoder (2525) is not equal to the true latent dimension (d=4d=4) as that would lead issues with training the autoencoder itself. Also, this choice is more suited towards practical scenarios where we do not know the dimension of latent beforehand.

For learning the mappings γi\gamma_{i} from the corresponding interventional data (X(i)\mathbb{P}_{X}^{(i)}), we use the default MLP Regressor class from scikit-learn (Pedregosa et al., 2011) with 1000 max iterations for convergence.

Hyperparameters.

We use Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 100 epochs consecutively.

  • Batch size: 6464

  • Weight decay: 5×1045\times 10^{-4}

  • Total epochs: 10001000

  • Learning rate: 5×1045\times 10^{-4}

B.5 Additional Results: Synthetic Image Experiments

Z\mathbb{P}_{Z} #interv Recon-RMSE R2R^{2} MCC (IL)
Uniform 11 0.04±0.00.04\pm 0.0 0.51±0.00.51\pm 0.0 34.18±0.2434.18\pm 0.24
Uniform 33 0.04±0.00.04\pm 0.0 0.51±0.00.51\pm 0.0 73.94±0.3873.94\pm 0.38
Uniform 55 0.04±0.00.04\pm 0.0 0.51±0.00.51\pm 0.0 73.62±0.2173.62\pm 0.21
Uniform 77 0.04±0.00.04\pm 0.0 0.51±0.00.51\pm 0.0 72.54±0.3472.54\pm 0.34
Uniform 99 0.04±0.00.04\pm 0.0 0.51±0.00.51\pm 0.0 73.14±0.4773.14\pm 0.47
SCM (linear) 11 0.03±0.00.03\pm 0.0 0.8±0.00.8\pm 0.0 12.81±0.2812.81\pm 0.28
SCM (linear) 33 0.03±0.00.03\pm 0.0 0.8±0.00.8\pm 0.0 73.21±0.3373.21\pm 0.33
SCM (linear) 55 0.03±0.00.03\pm 0.0 0.8±0.00.8\pm 0.0 83.38±0.2183.38\pm 0.21
SCM (linear) 77 0.03±0.00.03\pm 0.0 0.8±0.00.8\pm 0.0 84.22±0.2584.22\pm 0.25
SCM (linear) 99 0.03±0.00.03\pm 0.0 0.8±0.00.8\pm 0.0 86.16±0.1786.16\pm 0.17
SCM (non-linear) 11 0.04±0.00.04\pm 0.0 0.69±0.00.69\pm 0.0 19.70±0.3119.70\pm 0.31
SCM (non-linear) 33 0.04±0.00.04\pm 0.0 0.69±0.00.69\pm 0.0 59.68±0.2859.68\pm 0.28
SCM (non-linear) 55 0.04±0.00.04\pm 0.0 0.69±0.00.69\pm 0.0 62.79±0.2062.79\pm 0.20
SCM (non-linear) 77 0.04±0.00.04\pm 0.0 0.69±0.00.69\pm 0.0 69.31±0.3469.31\pm 0.34
SCM (non-linear) 99 0.04±0.00.04\pm 0.0 0.69±0.00.69\pm 0.0 71.37±0.2671.37\pm 0.26
Table 11: Interventional data in image-based experiments: Mean ± S.E. (5 random seeds). MCCs increase with the number of interventions per latent dimension as predicted by Theorem A.12.

Table 11 presents more details about Table 4 in the main paper, with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and and R2R^{2} to test for affine identification using representations from Step 1. Note that Recon-RMSE and R2R^{2} are computed using the autoencoder trained from Step 1, hence the results are not affected by training on varying #interv per latent in Step 2. We get high R2R^{2} values across different latent distributions indicating the higher dimensional latents (d^=25\hat{d}=25) learned by the encoder are related to the small dimensional true latents (d=4d=4) by a linear function.

We also report a batch of reconstructed images from the trained autoencoder for the different latent distributions; Uniform (Figure 3), SCM Linear (Figure 4), and SCM Non-Linear (Figure 5). In all the cases the position and color of both the balls is accurately reconstructed.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Reconstructed images (top row) for the corresponding real images (bottom row) for the uniform latent case.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Reconstructed images (top row) for the corresponding real images (bottom row) for the SCM (linear) latent case.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Reconstructed images (top row) for the corresponding real images (bottom row) for the SCM (non-linear) latent case.

B.6 Experiments with independence penalty from β\beta-VAE

In this section, we provide some additional comparisons with models trained with independence prior on the latents used in β\beta-VAEs (Burgess et al., 2018). We take a standard autoencoder that uses a reconstruction penalty and add to it the β\beta-VAE penalty. We carry out the comparisons for both polynomial data generation experiments and also for the image-based experiments. For the polynomial data generation experiments, we use the same MLP based encoder-decoder architecture that we used earlier for Table 8 and Table 9. In Table 12 and Table 13, we show the results for autoencoder trained with β\beta-VAE penalty for the same setting as was used in Table 8 and Table 9 respectively. For the image-based experiments, we use the same ResNet-based encoder-decoder architecture that we used earlier for Table 4. In Table 14, we show results for the image-based experiments using the same setting as Table 4 focusing on the case with nine interventions.

Z\mathbb{P}_{Z} dd pp MCC (β=0.1\beta=0.1) MCC (β=1.0\beta=1.0) MCC (β=10.0\beta=10.0) MCC (IOS)
Uniform 66 22 67.35±2.767.35\pm 2.7 68.73±2.8868.73\pm 2.88 72.38±3.472.38\pm 3.4 99.05±0.0299.05\pm 0.02
Uniform 66 33 70.98±2.5770.98\pm 2.57 69.43±1.8269.43\pm 1.82 71.46±3.1371.46\pm 3.13 95.74±0.1295.74\pm 0.12
Uniform 1010 22 58.94±2.0458.94\pm 2.04 57.8±1.3557.8\pm 1.35 60.14±1.3360.14\pm 1.33 94.25±2.1394.25\pm 2.13
Uniform 1010 33 59.29±2.4559.29\pm 2.45 60.94±2.1760.94\pm 2.17 59.22±1.2459.22\pm 1.24 91.24±4.9991.24\pm 4.99
SCM-S 66 22 65.37±2.2865.37\pm 2.28 61.98±4.5261.98\pm 4.52 65.63±4.1565.63\pm 4.15 66.04±1.3466.04\pm 1.34
SCM-S 66 33 64.53±2.3864.53\pm 2.38 65.23±0.9965.23\pm 0.99 68.61±2.7468.61\pm 2.74 67.66±2.1867.66\pm 2.18
SCM-S 1010 22 62.54±1.3362.54\pm 1.33 62.23±1.6562.23\pm 1.65 63.43±0.8763.43\pm 0.87 63.52±1.1163.52\pm 1.11
SCM-S 1010 33 62.44±0.5662.44\pm 0.56 58.04±1.6458.04\pm 1.64 59.5±0.8959.5\pm 0.89 62.11±1.3662.11\pm 1.36
SCM-D 66 22 60.86±2.6360.86\pm 2.63 59.36±2.7159.36\pm 2.71 62.26±1.6662.26\pm 1.66 61.89±4.061.89\pm 4.0
SCM-D 66 33 66.32±1.3666.32\pm 1.36 65.49±1.9765.49\pm 1.97 66.43±1.466.43\pm 1.4 65.85±1.5865.85\pm 1.58
SCM-D 1010 22 61.13±1.4261.13\pm 1.42 62.23±1.3262.23\pm 1.32 61.38±2.2561.38\pm 2.25 65.35±2.7265.35\pm 2.72
SCM-D 1010 33 60.39±1.8360.39\pm 1.83 58.64±2.0158.64\pm 2.01 58.43±1.0858.43\pm 1.08 61.92±1.9561.92\pm 1.95
Table 12: Observational data with Neural Network Decoder: Mean ± S.E. (5 random seeds).
Z\mathbb{P}_{Z} dd pp MCC (β=0.1\beta=0.1) MCC (β=1.0\beta=1.0) MCC (β=10.0\beta=10.0) MCC (IL)
Uniform 66 22 69.79±1.8369.79\pm 1.83 68.62±2.9968.62\pm 2.99 69.59±1.7669.59\pm 1.76 99.09±0.0299.09\pm 0.02
Uniform 66 33 68.44±1.8868.44\pm 1.88 71.86±0.4271.86\pm 0.42 68.76±2.1368.76\pm 2.13 91.67±2.5091.67\pm 2.50
Uniform 1010 22 60.46±1.4660.46\pm 1.46 58.93±0.9158.93\pm 0.91 58.95±0.8958.95\pm 0.89 99.59±0.0499.59\pm 0.04
Uniform 1010 33 59.85±1.4659.85\pm 1.46 62.92±2.9862.92\pm 2.98 61.34±1.6661.34\pm 1.66 93.73±0.4593.73\pm 0.45
SCM-S 66 22 71.18±2.1171.18\pm 2.11 71.01±1.3271.01\pm 1.32 67.6±2.0767.6\pm 2.07 93.07±2.1693.07\pm 2.16
SCM-S 66 33 72.67±1.2972.67\pm 1.29 70.9±3.6370.9\pm 3.63 75.6±3.0475.6\pm 3.04 88.66±2.7188.66\pm 2.71
SCM-S 1010 22 65.04±1.4665.04\pm 1.46 64.47±1.4964.47\pm 1.49 65.84±2.165.84\pm 2.1 94.14±0.3994.14\pm 0.39
SCM-S 1010 33 63.2±1.8363.2\pm 1.83 62.31±1.162.31\pm 1.1 62.16±1.6862.16\pm 1.68 76.62±6.1576.62\pm 6.15
SCM-D 66 22 72.6±2.0172.6\pm 2.01 76.58±2.9576.58\pm 2.95 71.92±2.8571.92\pm 2.85 91.63±1.9091.63\pm 1.90
SCM-D 66 33 67.79±0.9767.79\pm 0.97 72.93±1.8172.93\pm 1.81 72.98±1.2772.98\pm 1.27 88.19±3.6388.19\pm 3.63
SCM-D 1010 22 69.78±3.8569.78\pm 3.85 69.19±2.7869.19\pm 2.78 66.53±1.2866.53\pm 1.28 90.14±4.3590.14\pm 4.35
SCM-D 1010 33 64.9±1.6864.9\pm 1.68 66.86±2.6166.86\pm 2.61 64.25±1.6164.25\pm 1.61 81.82±3.081.82\pm 3.0
Table 13: Interventional data with Neural Network Decoder: Mean ± S.E. (5 random seeds).
Z\mathbb{P}_{Z} MCC (β=0.1\beta=0.1) MCC (β=1.0\beta=1.0) MCC (β=10.0\beta=10.0) MCC (IL)
Uniform 42.6±4.2342.6\pm 4.23 36.5±2.4536.5\pm 2.45 38.3±2.2238.3\pm 2.22 73.1±0.4773.1\pm 0.47
SCM (linear) 60.8±2.5260.8\pm 2.52 59.5±2.4759.5\pm 2.47 61.6±1.0661.6\pm 1.06 86.2±0.1786.2\pm 0.17
SCM (non-linear) 62.5±1.8862.5\pm 1.88 60.7±2.4160.7\pm 2.41 59.7±1.2959.7\pm 1.29 71.4±0.2671.4\pm 0.26
Table 14: Interventional data in image-based experiments: Mean ± S.E. (5 random seeds).