Interventional Causal Representation Learning
Abstract
Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observational data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors’ support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents’ support and their ancestors’. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect interventions. Moreover, we can achieve block affine identification, namely the estimated latent factors are only entangled with a few other latents if we have access to data from imperfect interventions. These results highlight the unique power of interventional data in causal representation learning; they can enable provable identification of latent factors without any assumptions about their distributions or dependency structure.
1 Introduction
Modern deep learning models like GPT-3 (Brown et al., 2020) and CLIP (Radford et al., 2021) are remarkable representation learners (Bengio et al., 2013). Despite the successes, these models continue to be far from the human ability to adapt to new situations (distribution shifts) or carry out new tasks (Geirhos et al., 2020; Bommasani et al., 2021; Yamada et al., 2022). Humans encapsulate their causal knowledge of the world in a highly reusable and recomposable way (Goyal & Bengio, 2020), enabling them to adapt to new tasks in an ever-distribution-shifting world. How can we empower modern deep learning models with this type of causal understanding? This question is central to the emerging field of causal representation learning (Schölkopf et al., 2021).
A core task in causal representation learning is provable representation identification, i.e., developing representation learning algorithms that can provably identify natural latent factors (e.g., location, shape and color of different objects in a scene). While provable representation identification is known to be impossible for arbitrary data-generating process (DGP) (Hyvärinen & Pajunen, 1999; Locatello et al., 2019), real data often exhibits additional structures. For example, Hyvarinen et al. (2019); Khemakhem et al. (2022) consider the conditional independence between the latents given auxiliary information; Lachapelle et al. (2022) leverage the sparsity of the causal connections among the latents; Locatello et al. (2020); Klindt et al. (2020); Ahuja et al. (2022a) rely on the sparse variation in the latents over time.

Most existing works rely on observational data and make assumptions on the dependency structure of the latents
to achieve provable representation identification.
However, in many applications, such as robotics and genomics, there is
a wealth of interventional data available. For example, interventional
data can be obtained from experiments such as genetic
perturbations (Dixit et al., 2016) and electrical
stimulations (Nejatbakhsh et al., 2021). Can interventional
data help identify latent factors in causal representation learning?
How can it help? We explore these questions in this work. The key
observation is that interventional data often carries geometric
signatures of the latent factors’ support (i.e., what values each
latent can possibly take). Fig. 1 illustrates these geometric
signatures: perfect interventions and many imperfect
interventions can make the intervened latents’ support independent of
their ancestors’ support. As we will show, these geometric
signatures go a long way in facilitating provable representation
identification in the absence of strong distributional assumptions.
Contributions. This work establishes representation identification guarantees without strong distributional assumptions on the latents in the following settings.
-
•
interventions. We first investigate scenarios where the true latent factors are mapped to high-dimensional observations through a finite-degree multivariate polynomial. When some latent dimension undergoes a hard intervention (Pearl, 2009), we are able to identify it up to shift and scaling. Even when the mapping is not a polynomial, approximate identification of the intervened latent is still achievable provided we have data from multiple interventional distributions on the same latent dimension.
-
•
Perfect & imperfect interventions. We achieve block affine identification under imperfect interventions (Peters et al., 2017) provided the support of the intervened latent is rendered independent of its ancestors under the intervention as shown in Figure 1c. This result covers all perfect interventions as a special case.
-
•
Observational data and independent support. The independence-of-support condition above can further facilitate representation identification with observational data. We show that, if the support of the latents are already independent in observational data, then these latents can be identified up to permutation, shift, and scaling, without the need of any interventional data. This result extends the classical identifiability results from linear independent component analysis (ICA) (Comon, 1994) to allow for dependent latent variables. They also provide theoretical justifications for recent proposals of performing unsupervised disentanglement through the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).
We summarize our results in Table 1. Finally, we empirically demonstrate the practical utility of our theory. From data generation mechanisms ranging from polynomials to image generation from rendering engine (Shinners et al., 2011), we show that interventional data helps identification.
Also, the code repository can be accessed at: github.com/facebookresearch/CausalRepID.
Input data | Assm. on | Assm. on | Identification |
---|---|---|---|
Obs | , aux info. | Diffeomorphic | Perm & scale (Khemakhem, 2020) |
Obs | Non-empty interior | Injective poly | Affine (Theorem 4.4) |
Obs | Non-empty interior | Injective poly | Affine (Theorem A.8) |
Obs | Independent support | Injective poly | Perm, shift, & scale (Theorem 6.3) |
Obs + intervn | Non-empty interior | Injective poly | Perm, shift, & scale (Theorem 5.3) |
Obs + intervn | Non-empty interior | Diffeomorphic | Perm & comp-wise (Theorem A.12) |
Obs + Perfect intervn | Non-empty interior | Injective poly | Block affine (Theorem 5.8) |
Obs + Imperfect intervn | Partially indep. support | Injective poly | Block affine (Theorem 5.8) |
Counterfactual | Bijection w.r.t. noise | Diffeomorphic | Perm & comp-wise (Brehmer, 2022) |
2 Related Work
Existing provable representation identification approaches often utilize structure in time-series data, as seen in initial works by Hyvarinen & Morioka (2016) and Hyvarinen & Morioka (2017). More recent studies have expanded on this approach, such as Hälvä & Hyvarinen (2020); Yao et al. (2021, 2022a, 2022b); Lippe et al. (2022b, a); Lachapelle et al. (2022). Other forms of weak supervision, such as data augmentations, can also be used in representation identification, as seen in works by Zimmermann et al. (2021); Von Kügelgen et al. (2021); Brehmer et al. (2022); Locatello et al. (2020); Ahuja et al. (2022a) that assume access to contrastive pairs of observations . A third approach, used in (Khemakhem et al., 2022, 2020), involves using high-dimensional observations (e.g., an image) and auxiliary information (e.g., label) to identify representations.
To understand the factual and counterfactual knowledge used by different works in representation identification, we can classify them according to Pearl’s ladder of causation (Bareinboim et al., 2022). In particular, our work operates with interventional data (level-two knowledge), while other studies leverage either observational data (level-one knowledge) or counterfactual data (level-three knowledge). Works such as Khemakhem et al. (2022, 2020); Ahuja et al. (2022b); Hyvarinen & Morioka (2016, 2017); Ahuja et al. (2021) use observational data and either make assumptions on the structure of the underlying causal graph of latents or rely on auxiliary information. In contrast, works like Brehmer et al. (2022) use counterfactual knowledge to achieve identification for general DAG structures; Lippe et al. (2022b, a); Ahuja et al. (2022a); Lachapelle et al. (2022) use pre- and post-intervention observations to achieve provable representation identification. These latter studies use instance-level temporal interventions that carry much more information than interventional distribution alone. To summarize, these works require more information than is available with level two data in Pearlian ladder of causation.
Finally, a concurrent work from Seigal et al. (2022) also studies identification of causal representations using interventional distributions. The authors focus on linear mixing of the latents and consider perfect interventions. In contrast, our results consider nonlinear mixing function and imperfect interventions.
3 Setup: Causal Representation Learning
Causal representation learning aims to identify latent variables from high-dimensional observations. Begin with a data-generating process where some high-dimensional observations are generated from some latent variables . We consider the task of identifying latent assuming access to both observational and interventional datasets: the observational data is drawn from
(1) |
where the latent is sampled from the distribution and is the observed data point rendered from the underlying latent via an injective decoder . The interventional data is drawn from a similar distribution except the latent is drawn from , namely the distribution of under intervention on :
(2) |
We denote and as the support of and respectively (support is the set where the probability density is more than zero). The support of is thus in observational data and in interventional data. The goal of causal representation learning is provable representation identification, i.e. to learn an encoder function, which takes in the observation as input and provably output its underlying true latent . In practice, such an encoder is often learned via solving a reconstruction identity,
(3) |
where and are a pair of encoder and decoder, which need to jointly satisfy Eq. 3. The pair together is referred to as the autoencoder. Given the learned encoder , the resulting representation is , which holds the encoder’s estimate of the latents.
The reconstruction identity Eq. 3 is highly underspecified and cannot in general identify the latents. There exist many pairs of that jointly solve Eq. 3 but do not provide representations that coincide with the true latents . For instance, applying an invertible map to any solution will result in another valid solution , . In practical applications, however, the exact identification of the latents is not necessary. For example, we may not be concerned with the recovering the latent dimensions in the order they appear in . Thus, in this work, we examine conditions of under which the true latents can be identified up to certain transformations, such as affine transformations and coordinate permutations.
4 Stepping Stone: Affine Representation Identification with Polynomial Decoders
We first establish an affine identification result, which serves as a stepping stone towards stronger identification guarantees in the next section. We begin with a few assumptions.
Assumption 4.1.
The interior of the support of , , is a non-empty subset of .111We work with ) as the metric space. A point is in the interior of a set if there exists an ball for some containing that point in the set. The set of all such points defines the interior.
Assumption 4.2.
The decoder is a polynomial of finite degree whose corresponding coefficient matrix has full column rank. Specifically, the decoder is determined by the coefficient matrix as follows,
where represents the Kronecker product with all distinct entries; for example, if , then .
The assumption that the matrix has a full column rank of guarantees that the decoder is injective; see Lemma A.1 in Appendix A.1 for a proof. This injectivity condition on is common in identifiable representation learning. Without injectivity, the problem of identification becomes ill-defined; multiple different latent ’s can give rise to the same observation . We note that the full-column-rank condition for in LABEL:assm2:g_poly imposes an implicit constraint on the dimensionality of the data; it requires that the dimensionality is greater than the number of terms in the polynomial of degree , namely . In the Appendix (Theorem A.5), we show that if our data is generated from sparse polynomials, i.e., is a sparse matrix, then is allowed to be much smaller.
Under Assumptions 4.1 and 4.2, we perform causal representation learning with two constraints: polynomial decoder and non-collapsing encoder.
Constraint 4.3.
The learned decoder is a polynomial of degree and it is determined by its corresponding coefficient matrix as follows,
where represents the Kronecker product with all distinct entries. The interior of the image of the encoder is a non-empty subset of .
We now show that solving the reconstruction identity with these constraints can provably identify the true latent up to affine transformations.
Theorem 4.4.
Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1 and 4.2. The autoencoder that solves the reconstruction identity in Eq. 3 under 4.3 achieves affine identification, i.e., , where is the encoder ’s output, is the true latent, is invertible and .
Theorem 4.4 drastically reduces the ambiguities in identifying latent from arbitrary invertible transformations to only invertible affine transformations. Moreover, Theorem 4.4 does not require any structural assumptions about the dependency between the latents. It only requires (i) a geometric assumption that the interior of the support is non-empty and (ii) the map is a finite-degree polynomial.
The proof of Theorem 4.4 is in Appendix A.1. The idea is to write the representation as with , leveraging the relationship in Eq. 1. We then show the function must be an affine map. To give further intuition, we consider a toy example with one-dimensional latent , three-dimensional observation , and the true decoder and the learned decoder each being a degree-two polynomial. We first solve the reconstruction identity on all , which gives , and equivalently . This equality implies that both and must be at most degree-two polynomials of . As a consequence, must be a degree-one polynomial of , which we next prove by contradiction. If is a degree-two polynomial of , then is degree four; it contradicts the fact that is at most degree two in . Therefore, must be a degree-one polynomial in , i.e. a linear function of .
Beyond polynomial map .
Theorem A.8 in the Appendix extends Theorem 4.4 to a class of maps that are -approximable by a polynomial.
5 Provable Representation Identification with Interventional Data
In the previous section, we derived affine identification guarantees. Next, we strengthen these guarantees by leveraging geometric signals specific to many interventions.
5.1 Representation identification with interventions
We begin with a motivating example on images, where we are given data with interventions on the latents. Consider the two balls shown in Fig. 2a. Ball ’s coordinates are and Ball ’s coordinates are . We write the latent , this latent is rendered in the form of the image shown in the Fig. 2a. The latent in the observational data follows the directed acyclic graph (DAG) in Fig. 2b, where Ball ’s coordinate cause the Ball coordinates. The latent under a intervention on , then the second coordinate of Ball , follows the DAG in Fig. 2c. Our goal is to learn an encoder using the images in observational and interventional data, which outputs the coordinates of the balls up to permutation and scaling.
Suppose is generated from a structural causal model with an underlying DAG (Pearl, 2009). Formally, a intervention on one latent dimension fixes it to some constant value. The distribution of the children of the intervened component is affected by the intervention, while the distribution of remaining latents remains unaltered. Based on this property of intervention, we characterize the distribution in Eq. 2 as
(4) |
where takes a fixed value . The remaining variables in , , are sampled from .
The distribution in Eq. 4 encompasses many settings in practice, including (i) the interventions on causal DAGs (Pearl, 2009), i.e., , ii) the interventions on cyclic graphical models (Mooij & Heskes, 2013), and (iii) sampling from its conditional in the observational data (e.g., subsampling images in observational data with a fixed background color).
Given interventional data from interventions, we perform causal representation learning by leveraging the geometric signature of the intervention in search of the autoencoder. In particular, we enforce the following constraint while solving the reconstruction identity in Eq. 3.
Constraint 5.1.
The encoder’s component denoted as is required to take some fixed value for all . Formally stated .
In 5.1, we do not need to know which component is intervened and the value it takes, i.e., and . We next show how this constraint helps identify the intervened latent under an additional assumption on the support of the unintervened latents stated below.

Assumption 5.2.
The interior of support of distribution of unintervened latents is a non-empty subset of .
Theorem 5.3.
Theorem 5.3 immediately extends to settings when multiple interventional distributions are available, with each corresponding to a hard intervention on a distinct latent variable. Under the same assumptions of Theorem 5.3, each of the intervened latents can be identified up to permutation, shift, and scaling. Notably, Theorem 5.3 does not rely on any distributional assumptions (e.g., parametric assumptions) on ; nor does it rely on the nature of the graphical model for (e.g., cyclic, acyclic). Theorem 5.3 makes these key geometric assumptions: (i) support of in observational data, (ii) support of unintervened latents has a non-empty interior.
Theorem 5.3 combines the affine identification guarantee we derived in Theorem 4.4 with the geometric signature of interventions. For example, in Fig. 1b, the support of the true latents is axis-aligned (parallel to x-axis). In this case, the interventional constraint also forces the support of to be axis-aligned (parallel to x-axis or y-axis). The proof of Theorem 5.3 is in Appendix A.2. We provide some intuition here. First, given Assumptions 4.1, LABEL:assm2:g_poly and 4.3, Theorem 4.4 already guarantees affine identification. It implies , where includes all entries of other than , and is a vector of the corresponding coefficients. As a result, must also take a fixed value for all values of in the support of , since both and are set to a fixed value. We argue by contradiction. If , then any changes to in the direction of will also reflect as a change in ; it contradicts the fact that takes a fixed value. Therefore, and is identified up to shift and scaling.
Beyond polynomial map .
In Theorem 5.3, we assume that the map is a polynomial. In the Appendix (Theorem A.12) we show that, even when is not a polynomial but a general diffeomorphism, the intervened latent can be approximately identified up to an invertible transform provided sufficiently many interventional distributions per latent are available. That said, one interventional distribution per latent no longer suffices, unlike the polynomial case. Our experiments on images in § 8 further support this argument. We state Theorem A.12 informally below.
Theorem.
(Informal) Suppose the observational data is generated from Eq. 1 and suppose we gather multiple interventional datasets for latent , where in each interventional dataset, is set to a distinct fixed value under intervention following Eq. 4. If the number of interventional datasets is sufficiently large and the support of the latents satisfy certain regularity conditions (detailed in Theorem A.12), then the autoencoder that solves Eq. 3 under multiple constraints of the form 5.1 identifies up to an invertible transform approximately.
5.2 General perfect and imperfect interventions
In the discussion so far, we focused on interventions. In this section, our goal is to build identification guarantees under imperfect interventions. In the example that follows, we motivate the class of imperfect interventions we consider.
Motivating example of perfect & imperfect interventions on images.
First, we revisit perfect interventions in causal DAGs (Peters et al., 2017). Under a perfect intervention, the intervened latent is disconnected from its parents and interventions are a special case of perfect interventions. Consider the two balls shown in Fig. 2a. Suppose Ball 1 has a strong influence on Ball 2 in the observational DAG shown in Fig. 2b. As a result, the position of Ball 1 determines the region where Ball 2 can be located inside the box in Fig. 2a. Now imagine if a perfect intervention is carried out as shown in Fig. 2c. Under this intervention the second coordinate of Ball 2 is not restricted by Ball 1 and it takes all possible values in the box. Do we need perfect interventions to ensure that Ball 2 can be located anywhere in the box? Even an imperfect intervention that reduces the strength of influence of Ball 1 on Ball 2 can suffice to ensure that Ball 2 takes all possible locations in the box. In this section, we consider such imperfect interventions that guarantee that the range of values the intervened latent takes does not depend on its non-descendants. We formalize this below.
Definition 5.4.
(Wang & Jordan, 2021) Consider a random variable sampled from . are said to have independent support if where is the support of , are the supports of marginal distribution of for and is the Cartesian product.
Observe that two random variables can be dependent but have independent support. Suppose is generated from a structural causal model with an underlying DAG and undergoes an imperfect intervention. We consider imperfect interventions such that each pair satisfies support independence (Definition 5.4), where is a non-descendant of in the underlying DAG. Below we characterize imperfect interventions that satisfy support independence.
Characterizing imperfect interventions that lead to support independence.
Suppose , where is the value of the set of parents of , is a noise variable that is independent of the ancestors of , and is the map that generates . We carry out an imperfect intervention on and change the map to . If the range of values assumed by for any two values assumed by the parents are equal, then the support of is independent of all its non-descendants. Formally stated the condition is , where and are any two sets of values assumed by the parents.
We are now ready to describe the geometric properties we require of the interventional distribution in Eq. 2. We introduce some notation before that. Let . For each , we define the supremum and infimum of each component in the interventional distribution. Define () to be the supremum (infimum) of the set .
Assumption 5.5.
Consider sampled from the interventional distribution in Eq. 2. such that the support of is independent of for all . For all
(5) |
For all , . such that and denotes the Cartesian product.
The distribution above is quite general in several ways as it encompasses i) all perfect interventions since they render the intervened latent independent of its non-descendants and ii) imperfect interventions that lead to independent support as characterized above. The latter part of the above assumption is a regularity condition on the geometry of the support. It ensures the support of has a -thick boundary for a .
We now describe a constraint on the encoder that leverages the geometric signature of imperfect interventions in Assumption 5.5. Recall . Let and represent the support of encoder ’s output on observational data and interventional data respectively. represents the joint support of and is the support of in interventional data. Similarly, we define and for observational data.
Constraint 5.6.
Given a set . For each , satisfies support independence on interventional data, i.e.,
In the above 5.6, the index and set are not necessarily the same as and from Assumption 5.5. In the theorem that follows, we require to guarantee that a solution to 5.6 exists. In the Appendix A.3, we explain that this requirement can be easily relaxed. Note that 5.6 bears similarity to 5.1 from the case of interventions. Both constraints ensure that the support of the component is independent of all other components. In the theorem that follows, we show that the above 5.6 helps achieve block affine identification, which we formally define below.
Definition 5.7.
If for all , where is a permutation matrix, is an invertible matrix such that there is a submatrix of which is zero, then is said to block-affine identify .
Theorem 5.8.
Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, 5.5. The autoencoder that solves Eq. 3 under Constraint 4.3, 5.6 (with ) achieves block affine identification. More specifically,
where contains at most non-zero elements and each component of is zero whenever the corresponding component of is non-zero for all .
Firstly, from Theorem 4.4, . From the above theorem, it follows that linearly depends on at most latents and not all the latents. Each with does not depend on any of the latents that depends on. As a result, rows of (from Theorem 4.4) are sparse. Observe that if , then as a result of the above theorem, identifies some up to scale and shift. Further, remaining components linearly depend on and do not depend on . The proof of Theorem 5.8 is in Appendix A.3.
6 Extensions to Identification with Observational Data & Independent Support
In the previous section, we showed that interventions induce geometric structure (independence of supports) in the support of the latents that helps achieve strong identification guarantees. In this section, we consider a special case where such geometric structure is already present in the support of the latents in the observational data. Since we only work with observational data in this section, we set the interventional supports , where is the empty set. For each , define to be the supremum of the support of , i.e., . Similarly, for each , define to be the infimum of the set .
Assumption 6.1.
The support of in Eq. 1 satisfies pairwise support independence between all the pairs of latents. Formally stated,
(6) |
For all , . such that and denotes the Cartesian product.
Following previous sections, we state a constraint, where the learner leverages the geometric structure in the support in Assumption 6.1 to search for the autoencoder.
Constraint 6.2.
Each pair , where and satisfies support independence on observational data, i.e., , where is the joint support of and is support of .
Theorem 6.3.
Suppose the observational data is generated from Eq. 1 under Assumption 4.1, 4.2, and 6.1, The autoencoder that the solves Eq. 3 under 6.2 achieves permutation, shift and scaling identification. Specifically, where is the output of the encoder and is the true latent and is a permutation matrix and is an invertible diagonal matrix.
The proof of Theorem 6.3 is in Appendix A.4. Theorem 6.3 says that the independence between the latents’ support is sufficient to achieve identification up to permutation, shift, and scaling in observational data. Theorem 6.3 has important implications for the seminal works on linear ICA (Comon, 1994), considering the simple case of a linear . Comon (1994) shows that, if the latent variables are independent and non-Gaussian, then the latent variables can be identified up to permutation and scaling. However, Theorem 6.3 states that, even if the latent variables are dependent, the latent variables can be identified up to permutation, shift and scaling, as long as they are bounded (hence non-Gaussian) and satisfy pairwise support independence.
Finally, Theorem 6.3 provides a first general theoretical justification for recent proposals of unsupervised disentanglement via the independent support condition (Wang & Jordan, 2021; Roth et al., 2022).
7 Learning Representations from Geometric Signatures: Practical Considerations
In this section, we describe practical algorithms to solve the constrained representation learning problems in § 5 and 6.
To perform constrained representation learning with -intervention data, we proceed in two steps. In the first step, we carry out minimization of the reconstruction objective , where is the decoder, is the encoder and expectation is taken over observational data and interventional data. In the experiments, we restrict to be a polynomial and show that affine identification is achieved by the learned as proved in Theorem 4.4.
In the second step, we learn a linear map to transform the learned representations and enforce LABEL:eqn:intv_cons. For each interventional distribution, , we learn a different linear map that projects the representation such that it takes an arbitrary fixed value on the support of . We write this objective as
(7) |
Construct a matrix with different as the rows. The final output representation is . In the experiments, we show that this representation achieves permutation, shift and scaling identification as predicted by Theorem 5.3. A few remarks in order. i) is arbitrary and learner does not know the true do intervention value, ii) for ease of exposition, Eq. 7 assumes the knowledge of index of intervened and can be easily relaxed by multiplying with a permutation matrix.
We next describe an algorithm that learns representations to enforce independence of support (leveraged in Theorem 5.8 and 6.3). To measure the (non)-independence of the latents’ support, we follow Wang & Jordan (2021); Roth et al. (2022) and measure the distance between the sets in terms of Hausdorff distance: the Hausdorff distance between the sets is , where .
To further enforce the independent support constraint, we again follow a two-step algorithm. The first step remains the same, i.e., we minimize the reconstruction objective. In the second step, we transform the learned representations () with an invertible map . The joint support obtained post transformation is a function of the parameters and is denoted as . Following the notation introduced earlier, the joint support along dimensions is and the marginal support along is . We translate the problem in 6.2 as follows. We find a to minimize
(8) |
5.6 can be similarly translated.
8 Empirical Findings
In this section, we analyze how the practical implementation of the theory holds up in different settings ranging from data generated from polynomial decoders to images generated from PyGame rendering engine (Shinners et al., 2011). The code to reproduce the experiments can be found at https://github.com/facebookresearch/CausalRepID.
Data generation process.
Polynomial decoder data: The latents for the observational data are sampled from . can be i) independent uniform, ii) an SCM with sparse connectivity (SCM-S), iii) an SCM with dense connectivity (SCM-D) (Brouillard et al., 2020). The latent variables are then mapped to using a multivariate polynomial. We use a dimensional . We use two possible dimensions for the latents () – six and ten. We use polynomials of degree () two and three. Each element in to generate is sampled from a standard normal distribution.
Image data: For image-based experiments, we used the PyGame (Shinners, 2011) rendering engine. We generate pixel images of the form in Fig. 2 and consider a setting with two balls. We consider three distributions for latents: i) independent uniform, ii) a linear SCM with DAG in Fig. 2, iii) a non-linear SCM with DAG in Fig. 2, where the coordinates of Ball are at the top layer in the DAG and coordinates of Ball are at the bottom layer in the DAG.
For both settings above, we carry out interventions on each latent dimension to generate interventional data.
Model parameters and evaluation metrics.
We follow the two step training procedures described in § 7. For image-based experiments we use a ResNet-18 as the encoder (He et al., 2016) and for all other experiments, we use an MLP with three hidden layers and two hundred units per layer. We learn a polynomial decoder as the theory prescribes to use a polynomial decoder (4.3) when is a polynomial. In § B.3, we also present results when we use an MLP decoder. To check for affine identification (from Theorem 4.4), we measure the score for linear regression between the output representation and the true representation. If the score is high, then it guarantees affine identification. To verify permutation, shift and scaling identification (from Theorem 6.3), we check the mean correlation coefficient (MCC (Khemakhem et al., 2022)). For further details on data generation, models, hyperparamters, and supplementary experiments refer to the App. B.
MCC (IOS) | ||||
---|---|---|---|---|
Uniform | ||||
Uniform | ||||
Uniform | ||||
Uniform | ||||
SCM-S | ||||
SCM-S | ||||
SCM-S | ||||
SCM-S | ||||
SCM-D | ||||
SCM-D | ||||
SCM-D | ||||
SCM-D |
MCC | MCC (IL) | |||
---|---|---|---|---|
Uniform | ||||
Uniform | ||||
Uniform | ||||
Uniform | ||||
SCM-S | ||||
SCM-S | ||||
SCM-S | ||||
SCM-S | ||||
SCM-D | ||||
SCM-D | ||||
SCM-D | ||||
SCM-D |
Results for polynomial decoder.
Observational data: We consider the setting when the true decoder is a polynomial and the learned decoder is also a polynomial. In Table 2, we report the between the representation learned after the first step, where we only minimize reconstruction loss. values are high as predicted in Theorem 4.4. In the second step, we learn a map and enforce independence of support constraint by minimizing Hausdorff distance from Eq. 8. Among the distributions only the uniform distribution satisfies support independence from Assumption 6.1 and following Theorem 6.3, we expect MCC to be high in this case only. In Table 2, we report the MCC obtained by enforcing independence of support in MCC (IOS). In the § B.3, we also carry out experiments on correlated uniform distributions and observe high MCC (IOS).
Interventional data: We now consider the case when we also have access to intervention data in addition to observational data. We consider the setting with one intervention per latent dimension. We follow the two step procedure described in § 7. In Table 3, we first show the MCC values of the representation obtained after the first step in the MCC column. In the second step, we learn by minimizing the interventional loss (IL) in Eq. 7. We report the MCC of the representation obtained in the MCC (IL) column in Table 3; the values are close to one as predicted by Theorem 5.3.
Results for image dataset.
We follow the two step procedure described in § 7 except now in the second step, we learn a non-linear map (using an MLP) to minimize the interventional loss (IL) in Eq. 7. In Table 4, we show the MCC values achieved by the learned representation as we vary the number of interventional distributions per latent dimension. As shown in Theorem A.12, more interventional distributions per latent dimension improve the MCC.
#interv dist. | Uniform | SCM linear | SCM non-linear |
---|---|---|---|
9 Conclusions
In this work, we lay down the theoretical foundations for learning causal representations in the presence of interventional data. We show that geometric signatures such as support independence that are induced under many interventions are useful for provable representation identification. Looking forward, we believe that exploring representation learning with real interventional data (Lopez et al., 2022; Liu et al., 2023) is a fruitful avenue for future work.
Acknowledgments
Yixin Wang acknowledges grant support from the National Science Foundation and the Office of Naval Research. Yoshua Bengio acknowledges the support from CIFAR and IBM. We thank Anirban Das for insightful feedback that helped us correctly state the precise -thickness conditions.
References
- Ahuja et al. (2021) Ahuja, K., Hartford, J., and Bengio, Y. Properties from mechanisms: an equivariance perspective on identifiable representation learning. arXiv preprint arXiv:2110.15796, 2021.
- Ahuja et al. (2022a) Ahuja, K., Hartford, J., and Bengio, Y. Weakly supervised representation learning with sparse perturbations. arXiv preprint arXiv:2206.01101, 2022a.
- Ahuja et al. (2022b) Ahuja, K., Mahajan, D., Syrgkanis, V., and Mitliagkas, I. Towards efficient representation identification in supervised learning. arXiv preprint arXiv:2204.04606, 2022b.
- Ash et al. (2000) Ash, R. B., Robert, B., Doleans-Dade, C. A., and Catherine, A. Probability and measure theory. Academic press, 2000.
- Bareinboim et al. (2022) Bareinboim, E., Correa, J. D., Ibeling, D., and Icard, T. On pearl’s hierarchy and the foundations of causal inference. In Probabilistic and Causal Inference: The Works of Judea Pearl, pp. 507–556. 2022.
- Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
- Bommasani et al. (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Brehmer et al. (2022) Brehmer, J., De Haan, P., Lippe, P., and Cohen, T. Weakly supervised causal representation learning. arXiv preprint arXiv:2203.16437, 2022.
- Brouillard et al. (2020) Brouillard, P., Lachapelle, S., Lacoste, A., Lacoste-Julien, S., and Drouin, A. Differentiable causal discovery from interventional data. Advances in Neural Information Processing Systems, 33:21865–21877, 2020.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Burgess et al. (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. Understanding disentangling in beta-vae. arXiv preprint arXiv:1804.03599, 2018.
- Comon (1994) Comon, P. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
- Dixit et al. (2016) Dixit, A., Parnas, O., Li, B., Chen, J., Fulco, C. P., Jerby-Arnon, L., Marjanovic, N. D., Dionne, D., Burks, T., Raychowdhury, R., et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. cell, 167(7):1853–1866, 2016.
- Geirhos et al. (2020) Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., and Wichmann, F. A. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Goyal & Bengio (2020) Goyal, A. and Bengio, Y. Inductive biases for deep learning of higher-level cognition. arXiv preprint arXiv:2011.15091, 2020.
- Hälvä & Hyvarinen (2020) Hälvä, H. and Hyvarinen, A. Hidden markov nonlinear ica: Unsupervised learning from nonstationary time series. In Conference on Uncertainty in Artificial Intelligence, pp. 939–948. PMLR, 2020.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Hyvarinen & Morioka (2016) Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016.
- Hyvarinen & Morioka (2017) Hyvarinen, A. and Morioka, H. Nonlinear ICA of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460–469. PMLR, 2017.
- Hyvärinen & Pajunen (1999) Hyvärinen, A. and Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
- Hyvarinen et al. (2019) Hyvarinen, A., Sasaki, H., and Turner, R. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 859–868. PMLR, 2019.
- Khemakhem et al. (2020) Khemakhem, I., Monti, R., Kingma, D., and Hyvarinen, A. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ICA. Advances in Neural Information Processing Systems, 33:12768–12778, 2020.
- Khemakhem et al. (2022) Khemakhem, I., Kingma, D., Monti, R., and Hyvarinen, A. Variational autoencoders and nonlinear ICA: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pp. 2207–2217. PMLR, 2022.
- Klindt et al. (2020) Klindt, D., Schott, L., Sharma, Y., Ustyuzhaninov, I., Brendel, W., Bethge, M., and Paiton, D. Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930, 2020.
- Lachapelle et al. (2022) Lachapelle, S., Rodriguez, P., Sharma, Y., Everett, K. E., Le Priol, R., Lacoste, A., and Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Conference on Causal Learning and Reasoning, pp. 428–484. PMLR, 2022.
- Lippe et al. (2022a) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, E. icitris: Causal representation learning for instantaneous temporal effects. arXiv preprint arXiv:2206.06169, 2022a.
- Lippe et al. (2022b) Lippe, P., Magliacane, S., Löwe, S., Asano, Y. M., Cohen, T., and Gavves, S. Citris: Causal identifiability from temporal intervened sequences. In International Conference on Machine Learning, pp. 13557–13603. PMLR, 2022b.
- Liu et al. (2023) Liu, Y., Alahi, A., Russell, C., Horn, M., Zietlow, D., Schölkopf, B., and Locatello, F. Causal triplet: An open challenge for intervention-centric causal representation learning. arXiv preprint arXiv:2301.05169, 2023.
- Locatello et al. (2019) Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. PMLR, 2019.
- Locatello et al. (2020) Locatello, F., Poole, B., Rätsch, G., Schölkopf, B., Bachem, O., and Tschannen, M. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pp. 6348–6359. PMLR, 2020.
- Lopez et al. (2022) Lopez, R., Tagasovska, N., Ra, S., Cho, K., Pritchard, J. K., and Regev, A. Learning causal representations of single cells via sparse mechanism shift modeling. arXiv preprint arXiv:2211.03553, 2022.
- Mityagin (2015) Mityagin, B. The zero set of a real analytic function. arXiv preprint arXiv:1512.07276, 2015.
- Mooij & Heskes (2013) Mooij, J. and Heskes, T. Cyclic causal discovery from continuous equilibrium data. arXiv preprint arXiv:1309.6849, 2013.
- Nejatbakhsh et al. (2021) Nejatbakhsh, A., Fumarola, F., Esteki, S., Toyoizumi, T., Kiani, R., and Mazzucato, L. Predicting perturbation effects from resting activity using functional causal flow. bioRxiv, pp. 2020–11, 2021.
- Pearl (2009) Pearl, J. Causal inference in statistics: An overview. Statistics surveys, 3:96–146, 2009.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Peters et al. (2017) Peters, J., Janzing, D., and Schölkopf, B. Elements of causal inference: foundations and learning algorithms. The MIT Press, 2017.
- Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Roth et al. (2022) Roth, K., Ibrahim, M., Akata, Z., Vincent, P., and Bouchacourt, D. Disentanglement of correlated factors via hausdorff factorized support. arXiv preprint arXiv:2210.07347, 2022.
- Schölkopf et al. (2021) Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. Towards causal representation learning 2021. arXiv preprint arXiv:2102.11107, 2021.
- Seigal et al. (2022) Seigal, A., Squires, C., and Uhler, C. Linear causal disentanglement via interventions. arXiv preprint arXiv:2211.16467, 2022.
- Shinners (2011) Shinners, P. Pygame. http://pygame.org/, 2011.
- Shinners et al. (2011) Shinners, P. et al. Pygame. Dostupné z: http://pygame. org/[Online (2011), 2011.
- Von Kügelgen et al. (2021) Von Kügelgen, J., Sharma, Y., Gresele, L., Brendel, W., Schölkopf, B., Besserve, M., and Locatello, F. Self-supervised learning with data augmentations provably isolates content from style. Advances in neural information processing systems, 34:16451–16467, 2021.
- Wang & Jordan (2021) Wang, Y. and Jordan, M. I. Desiderata for representation learning: A causal perspective. arXiv preprint arXiv:2109.03795, 2021.
- Yamada et al. (2022) Yamada, Y., Tang, T., and Ilker, Y. When are lemons purple? the concept association bias of clip. arXiv preprint arXiv:2212.12043, 2022.
- Yao et al. (2021) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. arXiv preprint arXiv:2110.05428, 2021.
- Yao et al. (2022a) Yao, W., Chen, G., and Zhang, K. Learning latent causal dynamics. arXiv preprint arXiv:2202.04828, 2022a.
- Yao et al. (2022b) Yao, W., Sun, Y., Ho, A., Sun, C., and Zhang, K. Learning temporally causal latent processes from general temporal data. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=RDlLMjLJXdq.
- Zimmermann et al. (2021) Zimmermann, R. S., Sharma, Y., Schneider, S., Bethge, M., and Brendel, W. Contrastive learning inverts the data generating process. In International Conference on Machine Learning, pp. 12979–12990. PMLR, 2021.
Interventional Causal Representation Learning Appendices
Contents
We organize the Appendix as follows.
-
•
In App. A, we present the proofs for the theorems that were presented in the main body of the paper.
- –
- –
- –
- –
-
•
In App. B, we present supplementary materials for the experiments.
-
–
In § B.1, we present the pseudocode for the method used to learn the representations.
-
–
In § B.2, we present the details of the setup used in the experiments with the polynomial decoder .
-
–
In § B.3, we present supplementary results for the setting with polynomial decoder .
-
–
In § B.4, we present the details of the setup used in the experiments with image data.
-
–
In § B.5, we present supplementary results for the setting with image data.
-
–
Appendix A Proofs and Technical Details
In this section, we provide the proofs for the theorems. We restate the theorems for convenience.
Preliminaries and notation.
We state the formal definition of support of a random variable. In most of the work, we operate on the following measure space , is the Borel sigma field over and is the Lebesgue measure over completion of Borel sets on (Ash et al., 2000). For a random variable , the support , where is the Radon-Nikodym derivative of w.r.t Lebesgue measure over completion of Borel sets on . For random variable , is the support of in the observational data. The support of the component of is . For random variable , is the support of when is intervened. The support of the component of in intervened data is .
A.1 Affine Identification
Lemma A.1.
If the matrix that defines the polynomial is full rank and , then is injective.
Proof.
Suppose this is not the case and for some . Thus
(9) |
Since we find a non-zero vector in the null space of which contradicts the fact that has full column rank. Therefore, it cannot be the case that for some . Thus has to be injective. ∎
Lemma A.2.
If is a polynomial of degree and is a polynomial of degree , then is a polynomial of degree .
Proof.
We separate into two parts – the terms with degree () and the terms with degree less than () for . We obtain the following expression.
(10) |
The maximum degree achieved by is . For the other terms, the maximum is bounded above by . To prove the result, we need to show that has a degree .
We first start with a simple case. Suppose and do not share any component of that they both depend on. In such a case, if we take the leading degree term in and respectively and multiply them then we obtain distinct terms of degree .
Suppose and both depend on . We write as
where is a degree polynomial. Note that for each , is a different polynomial, i.e. for , . We write as
We collect all the terms in that have the highest degree associated with such that the coefficient is non-zero. We denote the highest degree as and write these terms as
where , , and
From , collect the terms with the highest degree for such that the coefficient is non-zero to obtain. We denote the highest degree as and write these terms as
where , , and .
As a result, will contain the term
where and . We will use principle of induction on the degree of polynomial to prove the claim.
We first establish the base case for and . Consider two polynomials and . We multiply the two to obtain . Consider two cases. In case 1, the two polynomials have at least one non-zero coefficient for the same component . In that case, we obtain the only non-zero term with , which establishes the base case. In the second case, the two polynomials have no shared non-zero coefficients. In such a case, each term with a non-zero coefficient is of the form . This establishes the base case. The other cases with and or and or both , are trivially true. Thus we have established the base case for all polynomials (with arbitrary dimension for ) of degree less than and .
We can now assume that the claim is true for all polynomials with degree less than and all polynomials with degree less than . As a result, the degree of is .
We can write in terms of the terms with degree equal to () and terms that have a degree less than (). As a result, we can simplify to obtain
(11) |
The degree of is at most . The degree of has to be since does not depend on , is of degree . Note that this is the only term in the entire polynomial that is associated with the highest degree for () since other terms () have a smaller degree associated with thus the coefficient of this term cannot be cancelled to zero. Therefore, the degree of the polynomial and hence the degree of is .
∎
Recall . Since where , and and . We now show that is bijective.
Lemma A.3.
Proof.
Observe that is surjective by construction. We now need to prove that is injective. Suppose is not injective. Therefore, there exists and , where and . Note that , where and , where . This implies that . We know that the decoder encoder pair satisfy reconstruction, which means and . Since , we obtain that , which implies that since is injective. This contradicts the fact that . Therefore, is bijective. ∎
See 4.4
Proof.
We start by restating the reconstruction identity. For all
(12) |
Following the assumptions, is restricted to be polynomial but bears no restriction. If and , we get the ideal solution , thus a solution to the above identity exists.
Since has full column rank, we can select rows of such that and . Denote the corresponding matrix that select the same rows as . We restate the identity in Eq. 12 in terms of and as follows. For all
(13) |
where is a submatrix of that describes the relationship between and polynomial of , correspond to blocks of rows of . Suppose at least one of is non-zero. Among the matrices which are non-zero, pick the matrix with largest index . Suppose row of has some non-zero element. Now consider the element in the row in the RHS of equation 13 corresponding to . Observe that is a polynomial of of degree , where (follows from Lemma A.2). In the LHS, we have a polynomial of degree at most . In the LHS, we have a polynomial of degree at most . The equality between LHS and RHS is true for all . The difference of LHS and RHS is an analytic function. From 4.3 has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire . If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree while LHS has zero coefficient for terms higher than degree . This leads to a contradiction. As a result, none of can be non-zero. Thus . Next, we show that is invertible, which immediately follows from Lemma A.3.
∎
A.1.1 Extensions to sparse polynomial
Suppose is a degree polynomial. Let us define the basis that generates as
Note that the number of terms in grows as . In the previous proof, we worked with
where was full rank. As a result, has to be greater than and also grow at least as . In real data, we can imagine that the has a high degree. However, can exhibit some structure, for instance sparsity. We now show that our entire analysis continues to work even for sparse polynomials thus significantly reducing the requirment on to grow as the number of non-zero basis terms in the sparse polynomial. We write the basis for the sparse polynomial of degree as . consists of a subset of terms in . We write the sparse polynomial as
We formally state the assumption on the decoder in this case as follows.
Assumption A.4.
The decoder is a polynomial of degree whose corresponding coefficient matrix (a.k.a. the weight matrix) has full column rank. Specifically, the decoder is determined by the coefficient matrix as follows,
(14) |
where consists of a subset of terms in . consists of the degree one term, i.e., and at least one term of the form , where
Theorem A.5.
Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, A.4. The autoencoder that solves reconstruction identity in Eq. 3 under 4.3 achieves affine identification, i.e., , where is the output of the encoder , is the true latent, is an invertible matrix and .
Proof.
We start by restating the reconstruction identity. For all
(15) |
Following the assumptions, is restricted to be polynomial but bears no restriction. If is equal to the matrix for columns where for some and zero in other columns and , we get the ideal solution , thus a solution to the above identity exists. Since has full column rank, we can select rows of such that and . Denote the corresponding matrix that select the same rows as . We restate the identity in Eq. 15 in terms of and as follows. For all
(16) |
In the simplification above, we rely on the fact that consists of the first degree term. Suppose at least one of is non-zero. Among the matrices which are non-zero, pick the matrix with largest index . Suppose row of has some non-zero element. Now consider the element in the row in the RHS of equation 16 corresponding to . Observe that is a polynomial of of degree , where . In the LHS, we have a polynomial of degree at most . The equality between LHS and RHS is true for all . The difference of LHS and RHS is an analytic function. From 4.3 has a measure greater than zero. Therefore, we leverage Mityagin (2015) to conclude that the LHS is equal to RHS on entire . If two polynomials are equal everywhere, then their respective coefficients have to be the same. Based on supposition, RHS has non zero coefficient for terms with degree while LHS has zero coefficient for terms higher than degree . This leads to a contradiction. As a result, none of can be non-zero. Thus . Next, we need to show that is invertible, which follows from Lemma A.3. ∎
A.1.2 Extensions to polynomial with unknown degree
The learner starts with solving the reconstruction identity by setting the degree of to be ; here we assume has full rank (this implicitly requires that is greater than the number of terms in the polynomial of degree ).
(17) |
We can restrict to rows such that it is a square invertible matrix . Denote the corresponding restriction of as . The equality is stated as follows.
(18) |
If , then is a polynomial of degree at least . Since the RHS contains a polynomial of degree at most the two sides cannot be equal over a set of values of with positive Lebesgue measure in . Thus the reconstruction identity will only be satisfied when . Thus we can start with the upper bound and reduce the degree of the polynomial on LHS till the identity is satisfied.
A.1.3 Extensions from polynomials to -approximate polynomials
We now discuss how to extend Theorem 4.4 to settings beyond polynomial . Suppose is a function that can be -approximated by a polynomial of degree on entire . In this section, we assume that we continue to use polynomial decoders of degree (with full rank matrix ) for reconstruction. We state this as follows.
Constraint A.6.
The learned decoder is a polynomial of degree and its corresponding coefficient matrix is determined by as follows. For all
(19) |
where represents the Kronecker product with all distinct entries. has a full column rank.
Since we use as a polynomial, then satisfying the exact reconstruction is not possible. Instead, we enforce approximate reconstruction as follows. For all , we want
(20) |
where is the tolerance on reconstruction error. Recall . We further simplify it as . We also assume that can be -approximated on entire with a polynomial of sufficiently high degree say . We write this as follows. For all ,
(21) |
We want to show that the norm of for all is sufficiently small. We state some assumptions needed in theorem below.
Assumption A.7.
Encoder does not take values near zero, i.e., for all and for all , where . The absolute value of each element in is bounded by a fixed constant. Consider the absolute value of the singular values of ; we assume that the smallest absolute value is strictly positive and bounded below by .
Theorem A.8.
Suppose the true decoder can be approximated by a polynomial of degree on entire with approximation error . Suppose can be approximated by polynomials on entire with error. If , where is sufficiently large, and Assumption 4.1, Assumption A.7 hold, then the polynomial approximation of (recall ) corresponding to solutions of approximate reconstruction identity in Eq. 20 under A.6 is approximately linear, i.e., the norms of the weights on higher order terms are sufficiently small. Specifically, the absolute value of the weight associated with term of degree decays as .
Proof.
We start by restating the approximate reconstruction identity. We use the fact that can be approximated with a polynomial of say degree to simplify the identity below. For all
(22) |
To obtain the second step from the first, add and subtract and use reverse triangle inequality. Since is full rank, we select rows of such that is square and invertible. The corresponding selection for is denoted as . We write the identity in terms of these matrices as follows.
(23) |
where is the singular value with smallest absolute value corresponding to the matrix . In the simplification above, we use the assumption that is -approximated by a polynomial with matrix and we also use the fact that is positive. Now we write that the polynomial that approximates as follows.
(24) |
(25) |
From Assumption A.7 we know that , where . It follows from the above equation that
(26) |
For , we track how grows below.
(27) |
In the last step of the above simplification, we use the condition in Eq. 26. We consider . Consider the terms inside the polynomial in the RHS above. We assume all components of are positive. Suppose , where , then the RHS in Eq. 27 grows at least . From Eq. 23, is very close to degree polynomial in . Under the assumption that the terms in are bounded by a constant, the polynomial of degree grows at at most . The difference in growth rates the Eq. 23 is an increasing function of for ranges where is sufficiently large. Therefore, the reconstruction identity in Eq. 23 cannot be satisfied for points in a sufficiently small neighborhood of . Therefore, . We can consider other vertices of the hypercube and conclude that .
∎
A.2 Representation identification under interventions
See 5.3
Proof.
First note that Assumptions 4.1-4.2 hold. Since we solve Eq. 3 under 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, it follows that the estimated latents are an affine function of the true . , where .
We consider a such that is in the interior of the support of . We write as . We can write , where is the vector of the values of coefficients in other than the coefficient of dimension, is component of , is the vector of values in other than . From the constraint in 5.1 it follows that for all , . We use these expressions to carry out the following simplification.
(28) |
Consider another data point from the same interventional distribution such that is in the interior of the support of , where is vector with one in coordinate and zero everywhere else. From Assumption 5.2, we know that there exists a small enough such that is in the interior. Since the point is from the same interventional distribution . For we have
(29) |
We take a difference of the two equations equation 28 and equation LABEL:proof2:eqn2 to get
(30) |
From the above, we get that the component of is zero. We can repeat the above argument for all and get that . Therefore, for all possible values of in . ∎
A.2.1 Extension of interventions beyond polynomials
In the main body of the paper, we studied the setting where is a polynomial. We relax the constraint on . We consider settings with multiple interventional distribution on a target latent.
We write the DGP for intervention on latent as
(31) |
Let be the set of intervention target values. We extend the constrained representation learning setting from the main body, where the learner leverages the geometric signature of a single intervention per latent dimension to multiple interventional distributions per latent dimension.
(32) |
Recall that the . Consider the component . Suppose is invertible and only depends on , we can write it as . If only depends on , i.e., and is invertible, then the is identified up to an invertible transform. Another way to state the above property is for all . In what follows, we show that it is possible to approximately achieve identification up to an invertible transform. We show that if the number of interventions is sufficiently large, then for all .
Assumption A.9.
The interior of the support of in the observational data, i.e., , is non-empty. The interior of the support of in the interventional data, i.e., , is equal to the support in observational data, i.e., , for all . Each intervention is sampled from a distribution . The support of is equal to the support of in the observational data, i.e., . The density of is greater than () on the entire support.
The above assumption states the restrictions on the support of the latents underlying the observational data and the latents underlying the interventional data.
Assumption A.10.
is bounded by for all and for all .
Lemma A.11.
If the number of interventions , then with probability .
Proof.
Consider the interval , where and are the infimum and supremum of . Consider an covering of . This covering consists of equally spaced points at a separation of . Consider a point , its nearest neighbor in the cover is denoted as , and the nearest neighbor of in the set of interventions is . The nearest neighbor of in the set of interventions is . Since for all we can write
(33) |
Observe that if is less than for all in the cover, then for all in , is less than . We now show that is sufficiently small provided is sufficiently large. Observe that
We would like that , which implies . Therefore, if , then with a probability at least . If we set , then we obtain that for all , with probability at least . The final expression for ∎
Theorem A.12.
Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 31 respectively. If the number of interventions is sufficiently large, i.e., , Assumption A.9 and Assumption A.10 are satified, then the solution to Eq. 32 identifies the intervened latent approximately up to an invertible tranform, i.e., for all .
Proof.
Recall , where . Consistent with the notation used earlier in the proof of Theorem 4.4, . In Lemma A.3, we had shown that is bijective, we can use the same recipe here and show that is bijective.
Owing to the constraint in Eq. 32, we claim that for all in the interior of with . Consider a ball around that is entirely contained in , denote it as . From Eq. 32, it follows that takes the same value on this neighborhood. As a result, is equal to a constant on the ball . Therefore, it follows that on the ball . We can extend this argument to all the points in the interior of the support of . As a result, on the interior of the support of . Further, for all in . Define . Consider the component of denoted as . Consider a point and find its nearest neighbor in and denote it as . Following the assumptions, . We expand around as follows
In the above, we use the fact that .
To see the last inequality in the above, use Lemma A.11 with as and Assumption A.10. ∎
In the discussion above, we showed that multiple interventional distribution on target latent dimension help achieve approximate identification of a latent up to an invertible transform. The above argument extends to all latents provided we have data with multiple do interventional distributions per latent. We end this section by giving some intuition as to why multiple interventions are necessary in the absence of much structure on .
Necessitating multiple interventions
We consider the case with one intervention. Consider the set of values achieved under intervention, where is from the interior of . We call this set Suppose is a bijection of the following form.
(34) |
where is identity function and is an arbitrary bijection with bounded second order derivative (satisfying Assumption A.10). Define and . Observe that these and satisfy both the constraints in the representation learning problem in 5.1. In the absence of any further assumptions on or structure of support of , each intervention enforces local constraints on .
A.3 Representation identification under general perfect and imperfect interventions
Before proving Theorem 5.8, we prove a simpler version of the theorem, which we leverage to prove Theorem 5.8. We start with the case when the set has one element say .
Assumption A.13.
Consider the that follow the interventional distribution . The joint support of satisfies factorization of support, i.e.,
(35) |
For all , . There exists a such that the all the points in are in .
The above assumption only requires support independence for two random variables and .
We now describe a constraint, where the learner enforces support independence between and .
Constraint A.14.
The pair satisfies support independence on interventional data, i.e.,
In the above A.14, we use same indices and as in Assumption A.13 for convenience, the arguments extend to the case where we use a different pair.
Theorem A.15.
Suppose the observational data and interventional data are generated from Eq. 1 and Eq. 2 respectively under Assumptions 4.1, 4.2, A.13. The autoencoder that solves Eq. 3 under Constraint 4.3, A.14 achieves block affine identification, i.e., , where is the output of the encoder and is the true latent and is an invertible matrix and . Further, the matrix has a special structure, i.e., the row and do not have a non-zero entry in the same column. Also, each row and has at least one non-zero entry.
Proof.
Let us first verify that there exists a solution to Eq. 3 under Constraint 4.3, A.14. If and , then that suffices to guarantee that a solution exists.
First note that since Assumptions 4.1, 4.2 holds and we are solving Eq. 3 under 4.3, we can continue to use the result from Theorem 4.4. From Theorem 4.4, , where is the output of the encoder and is the true latent and is an invertible matrix and .
From Assumption A.13 we know each component of , is bounded above and below. Suppose the minimum and maximum value achieved by is and the maximum value achieved by is .
Define a new latent
Notice post this linear operation, the new latent takes a maximum value of and a minimum value of .
We start with , where is element-wise transformation of that brings its maximum and minimum value of each component to and . Following the above transformation, we define the left most interval for as and the rightmost interval is , where and . Such an interval exists owing to the Assumption A.13.
Few remarks are in order. i) Here we define intervals to be closed from both ends. Our arguments also extend to the case if these intervals are open from both ends or one end, ii) We assume all the values in the interval are in the support. The argument presented below extends to the case when all the values in are assumed by except for a set of measure zero, iii) The assumption A.13 can be relaxed by replacing supremum and infimum with essential supremum and infimum.
For a sufficiently small , we claim that the marginal distribution of and contain the sets defined below. Formally stated
(36) |
(37) |
where and are and row in matrix . We justify the above claim next. Suppose all elements of are positive. We set sufficiently small such that for all . Since is sufficiently small, in the support , this holds for all . As a result, is in the support of . We can repeat the same argument when the signs of are not all positive by adjusting the signs of the elements . This establishes . Similarly, we can also establish that .
Suppose the two rows and share at least non-zero entries. Without loss of generality assume that is non-zero and is non-zero. Pick an
-
•
Suppose and are both positive. In this case, if , then
To see why is the case, substitute and observe that .
-
•
Suppose and are both positive. In this case, if , then
For sufficiently small () both and cannot be true simultaneously.
Therefore, and cannot be true simultaneously. Individually, occurs with a probability greater than zero; see Eq. 36. Similarly, occurs with a probability greater than zero; see Eq. 37. This contradicts the support independence constraint. For completeness, we present the argument for other possible signs of .
-
•
Suppose is positive and is negative. In this case, if , then
-
•
Suppose is positive and is negative. In this case, if , then
Rest of the above case is same as the previous case. We can apply the same argument to any shared non-zero component. Note that a row cannot have all zeros or all non-zeros (then has all zeros). If that is the case, then matrix is not invertible. This completes the proof. ∎
See 5.8
Proof.
We write , where is a permutation matrix such that . For each there exists a unique such that . Suppose . Observe that this construction satisfies the constraints in 5.6.
To show the above claim, we leverage Theorem A.15. We apply Theorem A.15 to all the pairs in , we obtain the following. We write . Without loss of generality, assume is non-zero in first elements. Now consider any , where . From Theorem A.15 it follows that . This holds true for all . Suppose . In this case, the first columns cannot be full rank. Consider the submatrix formed by the first columns. In this submatrix rows are zero. The maximum rank of this matrix is . If , then this submatrix would not have a full column rank, which contradicts the fact that is invertible. Therefore, . ∎
We can relax the assumption that in the above theorem. We follow an iterative procedure. We start by solving 5.6 with . If a solution exists, then we stop. If a solution does not exist, then we reduce the size of by one and repeat the procedure till we find a solution. As we reach a solution has to exist.
A.4 Representation identification with observational data under independent support
See 6.3
Proof.
We will leverage Theorem A.15 to show this claim. Consider . We know that the has at least one non-zero element. Suppose it has at least non-zero elements. Without loss of generality assume that these correspond to the first components. We apply Theorem A.15 to each pair for all . Note here is kept fixed and then Theorem A.15 is applied to every possible pair. From the theorem we get that is zero for all . If , then the span of first columns will be one dimensional and as a result cannot be invertible. Therefore, only one element of row is non-zero. We apply the above argument to all . We write a function , where is the index of the element that is non-zero in row , i.e., . Note that is injective, if two indices map to the same element, then that creates shared non-zero coefficients, which violates Theorem A.15. This completes the proof. ∎
Appendix B Supplementary Materials for Empirical Findings
B.1 Method details
We provide details about our training procedure in Algorithm 1. For learning with the independence of support (IOS) objective in Step 2, we need to ensure that the map is invertible, hence we minimize a combination of reconstruction loss with Hausdorff distance, i.e.,
(38) |
where denotes the output from the encoder learnt in Step 1, i.e., .
If we have data with multiple interventional distributions per latent dimension, then we sample a new target for each interventional distribution. In our polynomial decoder experiments, we use a linear . In our image based experiments, in Step 2, we use a non-linear map .
B.2 Experiment setup details: Polynomial decoder ()
Basic setup.
We sample data following the DGP described in Assumption 4.2 with the following details:
-
•
Latent dimension:
-
•
Degree of decoder polynomial ():
-
•
Data dimension:
-
•
Decoder polynomial coefficient matrix : sample each element of the matrix iid from a standard normal distribution.
Latent distributions.
Recall is the component of the latent vector . The various latent distributions () we use in our experiments are as follows:
-
•
Uniform: Each latent component is sampled from Uniform(-5, 5). All the latents () are independent and identically distributed.
-
•
Uniform-Correlated: Consider a pair of latent variables and sample two confounder variables s.t. , and . Now we sample using as follows:
where is the xor operation. Hence, acts as a confounder as it is involved in the generation process for both , which leads to correlation between them. Due to the xor operation, the two random variables satisfy independence of support condition. Finally, we follow this generation process to generate the latent vector by iterating over different pairs ( with step size 2 ).
-
•
Gaussian-Mixture: Each is sampled from a Gaussian mixture model with two components and equal probability of sampling from the components, as described below:
All latents in this case are independent and identically distributed like the Uniform case; though we have mixture distribution instead of single mode distribution.
-
•
SCM-S: The latent variable is sampled as a DAG with nodes using the Erdős–Rényi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) 222https://github.com/slachapelle/dcdi and set the expected density (expected number of edges per node) to be 0.5.
-
•
SCM-D: The latent variable is sampled as a DAG with nodes using the Erdős–Rényi scheme with linear causal mechanism and Gaussian noise (Brouillard et al., 2020) and set the expected density (expected number of edges per node) to be 1.0.
Case | Train | Validation | Test |
---|---|---|---|
Observational () | 10000 | 2500 | 20000 |
Interventional () | 10000 | 2500 | 20000 |
Further details on dataset and evaluation.
For experiments in Table 2, we only use observational data (); while for experiments in Table 3, we use both observational and interventional data (), with details regarding the train/val/test split described in Table 5.
We carry out interventions on each latent with corresponding to data from interventions on . The union of data from interventions across all latent dimensions is denoted as . The index of the variable to be intervened is sampled from . The selected latent variable to be intervened is set to value .
Further, note that for learning the linear transformation () in Step 2 (Eq. 7), we only use the corresponding interventional data () from do-intervention on the latent variable . Also, all the metrics (, MCC (IOS), MCC, MCC (IL)) are computed only on the test split of observational data () (no interventional data used).
Model architecture.
We use the following architecture for the encoder across all the experiments with polynomial decoder (Table 2, Table 3) to minimize the reconstruction loss;
-
•
Linear Layer (, ); LeakyReLU(),
-
•
Linear Layer (, ); LeakyReLU(),
-
•
Linear Layer (, ),
where is the input data dimension and is hidden units and in all the experiments. For the architecture for the decoder () in Table 2, Table 3, we use the polynomial decoder (); where is set to be same as that of the degree of true decoder polynomial () and the coefficient matrix is modeled using a single fully connected layer.
For the independence of support (IOS) experiments in Table 2, we model both using a single fully connected layer.
For the interventional data results (Table 3), we learn the mappings from the corresponding interventional data () using the default linear regression class from scikit-learn (Pedregosa et al., 2011) with the intercept term turned off.
Finally, for the results with NN Decoder (Table 8, Table 9), we use the following architecture for the decoder with number of hidden nodes .
-
•
Linear layer (, ); LeakyReLU()
-
•
Linear layer (, ); LeakyReLU()
-
•
Linear layer (, )
Hyperparameters.
We use the Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 10 epochs consecutively.
-
•
Batch size:
-
•
Weight decay:
-
•
Total epochs:
-
•
Learning rate: optimal value chosen from grid:
B.3 Additional results: Polynomial decoder ()
Table 6 presents additional details about Table 2 in main paper. We present additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and MCC computed using representations from Step 1. Note that training with independence of support objective in Step 2 leads to better MCC scores than using the representations from Step 1 on distributions that satisfy independence of support. Also, the Uniform Correlated (Uniform-C) latent case can be interpreted as another sparse SCM with confounders between latent variables. For this case, the latent variables are not independent but their support is still independent, therefore we see improvement in MCC with IOS training in Step 2. Similarly, Table 7 presents the extended results for the interventional case using polynomial decoder (Table 3 in main paper); with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and to test for affine identification using representations from Step 1. We notice the same pattern for all latent distributions, that training on interventional data on Step 2 improves the MCC metric.
Further, we also experiment with using a neural network based decoder to have a more standard autoencoder architecture where we do not assume access to specific polynomial structure or the degree of the polynomial. Table 8 presents the results with NN decoder for the observational case, where we see a similar trend to that of polynomial decoder case (Table 6) that the MCC increase with IOS training in Step 2 for Uniform and Uniform-C latent distributions. Similarly, Table 9 presents the results with NN decoder for the interventional case, where the trend is similar to that of polynomial decoder case (Table 7); though the MCC (IL) for the SCM sparse and SCM dense case are lower compared to that with polynomial decoder case.
Recon-MSE | MCC | MCC (IOS) | ||||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
Recon-MSE | MCC | MCC (IL) | ||||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
Recon-MSE | MCC | MCC (IOS) | ||||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
Recon-MSE | MCC | MCC (IL) | ||||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Uniform-C | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
Gaussian-Mixture | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
B.4 Experiment setup details: Synthetic image experiments
The latent variable comprises of two balls and their (, ) coordinates; hence we have dimensional latent variable. We use PyGame (Shinners, 2011) rendering engine final images of dimension .
Latent Distributions.
We denote the coordinates of the Ball 1 as (, ), and for Ball 2 as (, ). We have the following three cases for the latent distributions in case of synthetic image experiments:
-
•
Uniform: Each coordinate of Ball 1 (, ) and Ball 2 (, ) are sampled from .
-
•
SCM (linear): The coordinates of Ball 1 (, ) are sampled from , which are used to sample the coordinates of Ball 2 as follows:
-
•
SCM (non-linear): The coordinates of Ball 1 (, ) are sampled from , which are used to sample the coordinates of Ball 2 as follows:
Case | Train | Validation | Test |
---|---|---|---|
Observational () | 20000 | 5000 | 20000 |
Interventional () | 20000 | 5000 | 20000 |
Further details on dataset and evaluation.
For experiments in Table 4, the details regarding the train/val/test split are described in Table 10.
Note that the interventional data () is composed of do interventions on each latent variable (), where latent variable to be intervened is sampled from . Hence, each latent variable has equal probability to be intervened.
While performing do-interventions on any latent variable (), we control for the total number of distinct values the latent takes under the intervention (#interv, each distinct value correpsonds to sampling data from one interventional distribution). When #interv , then we set the latent variable to value 0.5. For the case when #interv , we sample the values corresponding to different do-interventions on latent variable as total of #interv equally distant points from . Eg, when #interv , then the possible values after do-intervention on latent variable are . Note that we uniformly at random sample the value of intervention from the set of intervention values.
Note that we only use the observational data () for training the autoencoder in Step 1. while the non-linear transformations in Step 2 (Eq. 7) are learnt using the corresponding interventional data (). Further, the metrics (MCC, MCC (IL)) are computed only on the test split of observational data () (no interventional data used).
Model architecture.
We use the following architecture for encoder across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss.
-
•
ResNet-18 Architecture (No Pre Training): Image () Penultimate Layer Output ( dimensional)
-
•
Linear Layer ; BatchNorm(); LeakyReLU()
-
•
Linear Layer ; BatchNorm()
We use the following architecture for decoder across all experiments (Table 4) in Step 1 of minimizing the reconstruction loss. Our architecture for decoder is inspired from the implementation in widely used works (Locatello et al., 2019).
-
•
Linear Layer ; LeakyReLU()
-
•
Linear Layer ; LeakyReLU()
-
•
DeConvolution Layer (: , : , kernel: ; stride: ; padding: ); LeakyReLU()
-
•
DeConvolution Layer (: , : , kernel: ; stride: ; padding: ); LeakyReLU()
-
•
DeConvolution Layer (: , : , kernel: ; stride: ; padding: ); LeakyReLU()
-
•
DeConvolution Layer (: , : , kernel: ; stride: ; padding: ); LeakyReLU()
Note: Here the latent dimension of the encoder () is not equal to the true latent dimension () as that would lead issues with training the autoencoder itself. Also, this choice is more suited towards practical scenarios where we do not know the dimension of latent beforehand.
For learning the mappings from the corresponding interventional data (), we use the default MLP Regressor class from scikit-learn (Pedregosa et al., 2011) with 1000 max iterations for convergence.
Hyperparameters.
We use Adam optimizer with hyperparameters defined below. We also use early stopping strategy, where we halt the training process if the validation loss does not improve over 100 epochs consecutively.
-
•
Batch size:
-
•
Weight decay:
-
•
Total epochs:
-
•
Learning rate:
B.5 Additional Results: Synthetic Image Experiments
#interv | Recon-RMSE | MCC (IL) | ||
---|---|---|---|---|
Uniform | ||||
Uniform | ||||
Uniform | ||||
Uniform | ||||
Uniform | ||||
SCM (linear) | ||||
SCM (linear) | ||||
SCM (linear) | ||||
SCM (linear) | ||||
SCM (linear) | ||||
SCM (non-linear) | ||||
SCM (non-linear) | ||||
SCM (non-linear) | ||||
SCM (non-linear) | ||||
SCM (non-linear) |
Table 11 presents more details about Table 4 in the main paper, with additional metrics like mean squared loss for autoencoder reconstruction task (Recon-MSE) and and to test for affine identification using representations from Step 1. Note that Recon-RMSE and are computed using the autoencoder trained from Step 1, hence the results are not affected by training on varying #interv per latent in Step 2. We get high values across different latent distributions indicating the higher dimensional latents () learned by the encoder are related to the small dimensional true latents () by a linear function.
We also report a batch of reconstructed images from the trained autoencoder for the different latent distributions; Uniform (Figure 3), SCM Linear (Figure 4), and SCM Non-Linear (Figure 5). In all the cases the position and color of both the balls is accurately reconstructed.






























B.6 Experiments with independence penalty from -VAE
In this section, we provide some additional comparisons with models trained with independence prior on the latents used in -VAEs (Burgess et al., 2018). We take a standard autoencoder that uses a reconstruction penalty and add to it the -VAE penalty. We carry out the comparisons for both polynomial data generation experiments and also for the image-based experiments. For the polynomial data generation experiments, we use the same MLP based encoder-decoder architecture that we used earlier for Table 8 and Table 9. In Table 12 and Table 13, we show the results for autoencoder trained with -VAE penalty for the same setting as was used in Table 8 and Table 9 respectively. For the image-based experiments, we use the same ResNet-based encoder-decoder architecture that we used earlier for Table 4. In Table 14, we show results for the image-based experiments using the same setting as Table 4 focusing on the case with nine interventions.
MCC () | MCC () | MCC () | MCC (IOS) | |||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
MCC () | MCC () | MCC () | MCC (IL) | |||
---|---|---|---|---|---|---|
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
Uniform | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-S | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D | ||||||
SCM-D |
MCC () | MCC () | MCC () | MCC (IL) | |
---|---|---|---|---|
Uniform | ||||
SCM (linear) | ||||
SCM (non-linear) |