This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\clearauthor\Name

Kartik Ahujathanks: Equal Contribution \Email[email protected]
\addrMila - Quebec AI Institute, Université de Montréal. and \NameDivyat Mahajan11footnotemark: 1 \Email[email protected]
\addrMila - Quebec AI Institute, Université de Montréal. and \NameVasilis Syrgkanis \Email[email protected]
\addrMicrosoft Research, New England and \NameIoannis Mitliagkas \Email[email protected]
\addrMila - Quebec AI Institute, Université de Montréal.

Towards efficient representation identification in supervised learning

Abstract

Humans have a remarkable ability to disentangle complex sensory inputs (e.g., image, text) into simple factors of variation (e.g., shape, color) without much supervision. This ability has inspired many works that attempt to solve the following question: how do we invert the data generation process to extract those factors with minimal or no supervision? Several works in the literature on non-linear independent component analysis have established this negative result; without some knowledge of the data generation process or appropriate inductive biases, it is impossible to perform this inversion. In recent years, a lot of progress has been made on disentanglement under structural assumptions, e.g., when we have access to auxiliary information that makes the factors of variation conditionally independent. However, existing work requires a lot of auxiliary information, e.g., in supervised classification, it prescribes that the number of label classes should be at least equal to the total dimension of all factors of variation. In this work, we depart from these assumptions and ask: a) How can we get disentanglement when the auxiliary information does not provide conditional independence over the factors of variation? b) Can we reduce the amount of auxiliary information required for disentanglement? For a class of models where auxiliary information does not ensure conditional independence, we show theoretically and experimentally that disentanglement (to a large extent) is possible even when the auxiliary information dimension is much less than the dimension of the true latent representation.

keywords:
disentanglement, non-linear independent component analysis

1 Introduction

Representation learning (Bengio et al., 2013) aims to extract low dimensional representations from high dimensional complex datasets. The hope is that if these representations succinctly capture factors of variation that describe the high dimensional data (e.g., extract features characterizing the shape of an object in an image), then these representations can be leveraged to achieve good performance on new downstream tasks with minimal supervision. Large scale pre-trained language models demonstrate the major success of representation learning based approaches (Brown et al., 2020; Wei et al., 2021; Radford et al., 2021). However, we should look at these results with a dose of caution, as neural networks have also been shown to fail often at out-of-distribution generalization (Beery et al., 2018; Geirhos et al., 2020; Peyrard et al., 2021). To address out-of-distribution generalization failures, recent works (Schölkopf, 2019; Schölkopf et al., 2021; Wang and Jordan, 2021) have argued in favour of incorporating causal principles into standard training paradigms—supervised (Arjovsky et al., 2019) and unsupervised (von Kügelgen et al., 2021). The issue is that the current deep learning paradigm does not imbibe and exploit key principles of causality (Pearl, 2009; Schölkopf, 2019)—invariance principle, independent causal mechanisms principle, and causal factorization. This is because the traditional causal inference requires access to structured random variables whose distributions can be decomposed using causal factorization, which is impossible with complex datasets such as images or text. Therefore, to leverage the power of deep learning and causal principles, we first need to disentangle raw datasets to obtain the causal representations that generated the data, and then exploit tools from causal structure learning to pin down the relationships between the representations. (Ke et al., 2019; Brouillard et al., 2020).

It has been shown that the general process of disentanglement is impossible in the absence of side knowledge of the structure of the data generation process (Hyvärinen and Pajunen, 1999; Locatello et al., 2019). However, under additional structural assumptions on the data generation process, it is possible to invert the data generation process and recover the underlying factors of variation (Hyvarinen and Morioka, 2016). Recently, there have been works (Hyvarinen et al., 2019; Khemakhem et al., 2020a) which present a general framework that relies on auxiliary information (e.g., labels, timestamps) to disentangle the latents. While existing works (Hyvarinen et al., 2019; Khemakhem et al., 2020a) have made remarkable progress in the field of disentanglement, these works make certain key assumptions highlighted below that we significantly depart from.

\bullet Labels cause the latent variables. In supervised learning datasets, there are two ways to think about the data generation process—a) labels cause the latent variables and b) latent variables cause the labels. (Schölkopf et al., 2012) argue for the former view, i.e., labels generate the latents, while (Arjovsky et al., 2019) argue for the latter view, i.e., latents generate the label (see Figure 1). Current non-linear ICA literature (Khemakhem et al., 2020a; Hyvarinen et al., 2019) assumes the label knowledge renders latent factors of variation conditionally independent, hence it is compatible with the former perspective (Schölkopf et al., 2012). But the latter view might be more natural for the setting where a human assigns labels based on the underlying latent factors. Our goal is to enable disentanglement for this case when the latent variables cause the labels (Arjovsky et al., 2019).

\bullet Amount of auxiliary information. Existing works (Khemakhem et al., 2020a) (Theorem 1), Khemakhem et al. (2020b) require a lot of auxiliary information, e.g., the number of label classes should be twice the total dimension of the latent factors of variation to guarantee disentanglement. We seek to enable disentanglement with lesser auxiliary information.
Contributions. We consider the following data generation process – latent factors generate the observations (raw features) and the labels for multiple tasks, where the latent factors are mutually independent. We study a natural extension of the standard empirical risk minimization (ERM) (Vapnik (1992)) paradigm. The most natural heuristic for learning representations is to train a neural network using ERM and use the output from the representation layer before the final layer. In this work, we propose to add a constraint on ERM to facilitate disentanglement – all the components of the representation layer must be mutually independent. Our main findings for the representations learned by the constrained ERM are summarized below.

\bullet If the number of tasks is at least equal to the dimension of the latent variables, and the latent variables are not Gaussian, then we can recover the latent variables up to permutation and scaling.

\bullet If we only have a single task and the latent variables come from an exponential family whose log-density can be expressed as a polynomial, then under further constraints on both the learner’s inductive bias and the function being inverted, we can recover the latent variables up to permutation and scaling.

\bullet To implement constrained ERM, we propose a simple two-step approximation. In the first step, we train a standard ERM based model, and in the subsequent step we carry out linear ICA (Comon, 1994) on the representation extracted from ERM. We carry out experiments with the above procedure for regression and classification. Our experiments show that even with the approximate procedure, it is possible to recover the true latent variables up to permutation and scaling when the number of tasks is smaller than the latent dimension.

2 Related work

Non-linear ICA with auxiliary information. We first describe the works in non-linear ICA where the time index itself serves as auxiliary information. Hyvarinen and Morioka (2016) showed that if each component of the latent vector evolves independently and follows a non-stationary time series without temporal dependence, then identification is possible for non-linear ICA. In contrast, Hyvarinen and Morioka (2017) showed that if the latent variables are mutually independent, with each component evolving in time following a stationary time series with temporal dependence, then also identification is possible. (Khemakhem et al., 2020a; Hyvarinen et al., 2019; Khemakhem et al., 2020b) further generalized the previous results. In these works, instead of using time, the authors require observation of auxiliary information. Note that (Hyvarinen and Morioka, 2017; Hyvarinen et al., 2019; Khemakhem et al., 2020a) have a limitation that the auxiliary information renders latent variables conditionally independent. This assumption was relaxed to some extent in (Khemakhem et al., 2020b), however, the model in (Khemakhem et al., 2020b) is not compatible with the data generation perspective that we consider, i.e., latent variables cause the labels. Recently, (Roeder et al., 2020) studied representation identification for classification and self-supervised learning models, where it was shown that if there are sufficient number of class labels (at least as many as the dimension of the latent variables), then the representations learned by neural networks with different initialization are related by a linear transformation. Their work does not focus on recovering the true latent variables and instead studies whether neural networks learn similar representations across different seeds.

Other works. In another line of work (Locatello et al., 2020; Shu et al., 2019), the authors study the role of weak supervision in assisting disentanglement. Note this line of work is different from us, as these models do not consider labelled supervised learning datasets, bur rather use different sources of supervision. For example, in (Locatello et al., 2020) the authors use multiple views of the same object as a form of weak supervision.

3 Problem Setup

Refer to caption
Figure 1: (a) Data generation process in Khemakhem et al. (2020b); (b) Data generation process studied in this work.

3.1 Data generation process

Before we describe the data generation process we use, we give an example of the data generation process that is compatible with the assumptions in (Khemakhem et al., 2020a).

Y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12)Z𝒩(Y𝟏,𝖨)Xg(Z)Y\leftarrow\mathsf{Bernoulli}\Big{(}\frac{1}{2}\Big{)}\;\;\;\;\;\;\;\;Z\leftarrow\mathcal{N}(Y\mathbf{1},\mathsf{I})\;\;\;\;\;\;\;\;X\leftarrow g(Z) (1)

where 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(12)\mathsf{Bernoulli}(\frac{1}{2}) is the uniform Bernoulli distribution over {0,1}\{0,1\}, 𝒩\mathcal{N} is normal distribution, 𝟏d\mathbf{1}\in\mathbb{R}^{d} is the vector that together with the label YY selects the mean of the latent ZZ, 𝖨\mathsf{I} is a dd dimensional identity matrix, gg is a bijection that generates the observations XX. In Figure 1 (a), we show the causal directed acyclic graph (DAG) corresponding to the above data generation process, where the labels cause the latent variables (Schölkopf et al., 2012). Most of the current non-linear ICA models are only compatible with this view of the data generation process. This may be valid for some settings, but it is not perfectly suited for human labelled datasets where a label is assigned based on the underlying latent factors of variations. Hence, we focus on the opposite perspective (Arjovsky et al., 2019) when the latent variables generate the labels. The assumptions regarding the data generation process studied in our work are defined formally ahead.

Assumption 1

The data generation process for regression is described as

Zh(NZ)Xg(Z)YΓZ+NY\begin{split}&Z\leftarrow h(N_{Z})\\ &X\leftarrow g(Z)\\ &Y\leftarrow\Gamma Z+N_{Y}\end{split} (2)

where NZdN_{Z}\in\mathbb{R}^{d} is noise, h:ddh:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} generates ZdZ\in\mathbb{R}^{d} (the components of latent variable ZZ are mutually independent, non-Gaussian, and have finite second moments), g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a bijection that generates the observations XX, Γk×d\Gamma\in\mathbb{R}^{k\times d} is a matrix that generates the label YkY\in\mathbb{R}^{k} and NYkN_{Y}\in\mathbb{R}^{k} is the noise vector (NYN_{Y} is independent of ZZ and 𝔼[NY]=0\mathbb{E}[N_{Y}]=0).

Note that dd is the dimension of the latent representation, and kk corresponds to the number of tasks. We adapt the data generation process to multi-task classification as follows by changing the label generation process.

Y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(σ(ΓZ)),Y\leftarrow\mathsf{Bernoulli}\Big{(}\sigma\Big{(}\Gamma Z\Big{)}\Big{)}, (3)

where (σ)(\sigma) corresponds to the sigmoid function applied elementwise to ΓZ\Gamma Z and outputs the probabilities of the tasks, and 𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂\mathsf{Bernoulli} operates on these probabilities elementwise to generate the label vector Y{0,1}kY\in\{0,1\}^{k} for the kk tasks.

We also contrast the DAGs of the two data generation processes in Figure 1. The nature of the auxiliary information (labels) in the data generation process we study (Assumption 1) is very different from the one in prior works (equation 1); conditioning on the auxiliary information (labels) in our data generation process does not make the latent variables independent. Our setting could be interpreted as multi-task supervised learning, where the downstream task labels serve as the auxiliary information generated from shared latent variables.
Remark. The classic non-identifiability results in Locatello et al. (2019) and Hyvärinen and Pajunen (1999) assumed that the latent factors of variation are all mutually independent. These results implied that without some knowledge, e.g., auxiliary information (labels), it is impossible to disentangle the latent variables. While (Khemakhem et al., 2020a) showed that it is possible to succeed in the presence of auxiliary information, their data generation process assumes that the latent variables are not mutually independent and thus is not consistent with assumptions in (Locatello et al., 2019) and (Hyvärinen and Pajunen, 1999). Whereas our work shows that the auxiliary information helps in the case considered in (Locatello et al., 2019) and (Hyvärinen and Pajunen, 1999), since the auxiliary information is introduced downstream of the latent variables.

3.2 Identifiability

Our objective is to learn the model g1g^{-1} (or the generator gg) from the observed data and labels pairs (X,Y)(X,Y), such that for new observations XX we can recover the latent variables ZZ that generated XX. We can not always learn the exact latent variables ZZ but may only identify them to some degree. Let us denote the model learned as g~1\tilde{g}^{-1} (with its inverse g~\tilde{g}). We now describe a general notion (commonly used in the literature) of identification for the learned map g~1\tilde{g}^{-1} with respect to the true map g1g^{-1}.

Definition 3.1.

Identifiability up to 𝒜\mathcal{A}. If the learned map g~1\tilde{g}^{-1} and the true map g1g^{-1} are related by some bijection a𝒜a\in\mathcal{A}, such that g~1=ag1\tilde{g}^{-1}=a\circ g^{-1} (or equivalently g~=ga1\tilde{g}=g\circ a^{-1}), then g~1\tilde{g}^{-1} is said to learn g1g^{-1} up to bijections in 𝒜\mathcal{A}. We denote this g~1𝒜g1\tilde{g}^{-1}\sim_{\mathcal{A}}g^{-1}.

In the above definition, if 𝒜\mathcal{A} was the set of all the permutation maps 𝒫\mathcal{P}, then g~1𝒫g1\tilde{g}^{-1}\sim_{\mathcal{P}}g^{-1} denotes identification up to permutations. Permutation identification guarantees that we learn the true latent vector but do not learn the indices of the true latent, which is not important for many downstream tasks. In other words, identification up to permutations of the latent variables is the gold standard for identification. Our aim is to identify the latent variables ZZ by inverting the data generating process (learning g1g^{-1}) up to permutations.

4 Identification via independence constrained ERM

The previous section established our objective of learning the model g1g^{-1} (or the generator gg) from the observed data (X,Y)(X,Y). We first train a supervised learning model to predict the labels YY from XX. For the rest of the work, we will assume that the predictor we learn takes the form ΘΦ\Theta\circ\Phi, where Θk×d\Theta\in\mathbb{R}^{k\times d} is a linear predictor that operates on the representation Φ:dd\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. As a result, the hypothesis space of the functions that the learner searches over has two parts: ΘΘ\Theta\in\mathcal{H}_{\Theta} corresponding to the hypothesis class of linear maps, and ΦΦ\Phi\in\mathcal{H}_{\Phi}, where Φ\mathcal{H}_{\Phi} corresponds to the hypothesis class over the representations. We measure the performance of the predictor on an instance (X,Y)(X,Y) using the loss (Y,ΘΦ(X))\ell\big{(}Y,\Theta\circ\Phi(X)\big{)} (mean square error for regression, cross-entropy loss for classification). We define the risk achieved by a predictor ΘΦ\Theta\circ\Phi as R(ΘΦ)=𝔼[(Y,ΘΦ(X))]R\big{(}\Theta\circ\Phi\big{)}=\mathbb{E}\Big{[}\ell\big{(}Y,\Theta\circ\Phi(X)\big{)}\Big{]}, where the expectation is taken with respect to the data (X,Y)(X,Y).
Independence Constrained ERM (IC-ERM): The representations (Θ\Theta) learnt by ERM as described above have no guarantee of recovering the true latent variables up to permutations. Hence, we propose a new objective where we want the learner to carry out constrained empirical risk minimization, where the constraint is placed on the representation layer that all components of the representation are mutually independent. We provide the formal definition of mutual independence for the convenience of the reader below.

Definition 4.1.

Mutual independence. A random vector V=[V1,,Vd]V=[V_{1},\cdots,V_{d}] is said to be mutually independent if for each subset {1,,d}\mathcal{M}\subseteq\{1,\cdots,d\} we have P({Vi}i)=iP(Vi)P(\{V_{i}\}_{i\in\mathcal{M}})=\prod_{i\in\mathcal{M}}P(V_{i}).

We state the proposed independence constrained ERM (IC-ERM) objective formally as follows:

minΘΘ,ΦΦR(ΘΦ)s.t.Φ(X)is mutually independent (Definition 4.1)\begin{split}&\min_{\Theta\in\mathcal{H}_{\Theta},\Phi\in\mathcal{H}_{\Phi}}R(\Theta\circ\Phi)\;\;\text{s.t.}\;\Phi(X)\;\text{is mutually independent (Definition \ref{def: mutualindep})}\end{split} (4)

We now state theorems that show the IC-ERM learning objective would recover the true latent variables up to permutations under certain assumptions. It is intuitive that more auxiliary information/numbers of tasks (kk) should help us to identify the latent variables as they are shared across these different tasks. Hence, we first state identification guarantees for IC-ERM when we have sufficient tasks, and then discuss the difficult cases when we have few tasks.

4.1 Identification when number of tasks is equal to the latent data dimension

We first study the setting when the number of tasks kk is equal to the dimension of the latent variables dd. Before we describe the theorem, we state assumptions on the hypothesis class of the representations (Φ\mathcal{H}_{\Phi}) and the classifier (Θ\mathcal{H}_{\Theta}).

Assumption 2

Assumption on Φ\mathcal{H}_{\Phi} and Θ\mathcal{H}_{\Theta}. For the true solutions (g1g^{-1}, Γ\Gamma), we have g1Φg^{-1}\in\mathcal{H}_{\Phi} and ΓΘ\Gamma\in\mathcal{H}_{\Theta}. For the case when k=dk=d, the set Θ\mathcal{H}_{\Theta} corresponds to the set of all invertible matrices.

The above assumption ensures that the true solutions g1g^{-1} and Γ\Gamma are in the respective hypothesis classes that the learner searches over. Also, the invertibility assumption on the hypothesis in Θ\mathcal{H}_{\Theta} is needed to ensure that we do not have redundant tasks for the identification of the latent variables. Under the above assumption and the assumptions of our data generation process (1), we state the following identification result for the case when k=d.

Theorem 1.

If Assumptions 1, 2 hold and the number of tasks kk is equal to the dimension of the latent dd, then the solution ΘΦ\Theta^{\dagger}\circ\Phi^{\dagger} to IC-ERM (4) with \ell as square loss for regression and cross-entropy loss for classification identifies true ZZ up to permutation and scaling.

The proof for the same is available in Appendix Section A. This implies that for the DAGs in Figure 1 (b), it is possible to recover the true latents up to permutation and scaling. This result extends the current disentanglement guarantees (Khemakhem et al., 2020b) that exist for models where labels cause the latent variables (latent variables are conditionally independent) to the settings where latent variables cause the label (latent variables are not conditionally independent). In multi-task learning literature (Caruana, 1997; Zhang and Yang, 2017), it has been argued that learning across multiple tasks with shared layers leads to internal representations that transfer better. The above result states the conditions when the ideal data generating representation shared across tasks can be recovered.

Remark. Since we use a linear model for the label generation process, one can ask what happens if we apply noisy linear ICA techniques (Davies, 2004; Hyvarinen, 1999) on the label itself (when k=dk=d) to recover ZZ followed by a regression to predict ZZ from XX. Noisy linear ICA require the noise distribution to be Gaussian and would not work when NYN_{Y} is not a Gaussian. Since we do not make such distributional assumptions on NYN_{Y}, we cannot rely on noisy linear ICA on labels.

4.2 Identification when the number of tasks is less than the dimensions of the latent

In this section, we study the setting when the number of tasks kk is equal to one. Since this setting is extreme, we need to make stronger assumptions to show latent identification guarantees. Before we lay down the assumptions, we provide some notation. Since we only have a single task, instead of using the matrix Γk×d\Gamma\in\mathbb{R}^{k\times d}, we use γd\gamma\in\mathbb{R}^{d} to signify the coefficients that generate the label in the single task setting. We assume each component of γ\gamma is non-zero. In the single task setting for regression problems, the label generation is written as Yγ𝖳Z+NYY\leftarrow\gamma^{\mathsf{T}}Z+N_{Y}, and the rest of the notation is the same as the data generation process in Assumption 1. We rewrite the data generation process in Assumption 1 for the single task case in terms of normalized variables U=ZγU=Z\odot\gamma.

Assumption 3

The data generation process for regressions is described as

Zh(NZ)Y𝟏𝖳U+NY,Xg(U),\begin{split}&Z\leftarrow h(N_{Z})\\ &Y\leftarrow\boldsymbol{1}^{\mathsf{T}}U+N_{Y},\\ &X\leftarrow g^{{}^{\prime}}(U),\end{split} (5)

where g(U)=g(U1γ)g^{{}^{\prime}}(U)=g(U\odot\frac{1}{\gamma}), where U1γ=[U1γ1,,Udγd]U\odot\frac{1}{\gamma}=[\frac{U_{1}}{\gamma_{1}},\cdots,\frac{U_{d}}{\gamma_{d}}] . We assume that all the components of UU are mutually independent and identically distributed (i.i.d.).

Note that gg^{{}^{\prime}} is invertible since gg is invertible and each element of γ\gamma is also non-zero. Hence, for simplicity, we can deal with the identification of UU. If we identify UU up to permutation and scaling, then ZZ is automatically identified up to permutation and scaling. The predictor we learn is a composition of linear predictor θ\theta and a representation Φ\Phi, which is written as θΦ(X)=θ𝖳Φ(X)\theta\circ\Phi(X)=\theta^{\mathsf{T}}\Phi(X). The learner searches for θ\theta in the set θ\mathcal{H}_{\theta}, where θ\mathcal{H}_{\theta} consists of linear predictors with all non-zero components, and Φ\Phi in the set Φ\mathcal{H}_{\Phi}.

We can further simplify the predictor as follows: θ𝖳Φ(X)=𝟏𝖳(Φ(X)θ)\theta^{\mathsf{T}}\Phi(X)=\boldsymbol{1}^{\mathsf{T}}(\Phi(X)\odot\theta), where Φ(X)θ\Phi(X)\odot\theta is component-wise multiplication expressed as Φ(X)θ=[Φ1(X)θ1,,Φd(X)θd]\Phi(X)\odot\theta=[\Phi_{1}(X)*\theta_{1},\cdots,\Phi_{d}(X)*\theta_{d}]. Therefore, instead of searching over θ\mathcal{H}_{\theta} such that all components of θ\theta are non-zero, we can fix θ={𝟏}\mathcal{H}_{\theta}=\{\boldsymbol{1}\} and carry out the search over representations Φ\mathcal{H}_{\Phi} only. For the rest of the section, without loss of generality, we assume the predictor is of the form 𝟏Φ(X)=𝟏𝖳Φ(X)\boldsymbol{1}\circ\Phi(X)=\boldsymbol{1}^{\mathsf{T}}\Phi(X). We restate the IC-ERM (4) with this parametrization and an additional constraint that all components are now required to be independent and identically distributed (i.i.d.). We provide a formal definition for the convenience of the reader below, where =d\stackrel{{\scriptstyle d}}{{=}} denotes identical in distribution.

Definition 3.

Independent & Identically Distributed (i.i.d.). A random vector V=[V1,,Vd]V=[V_{1},\cdots,V_{d}] is said to be i.i.d. if 1) Vi(X)=dVj(X)i,j{1,,d}V_{i}(X)\stackrel{{\scriptstyle d}}{{=}}V_{j}(X)\;\forall i,j\in\{1,\cdots,d\} 2) P({Vi}i)=iP(Vi)P(\{V_{i}\}_{i\in\mathcal{M}})=\prod_{i\in\mathcal{M}}P(V_{i}) {1,,d}\forall\mathcal{M}\subseteq\{1,\cdots,d\}.

The reparametrized IC-ERM (4) constraint is stated as follows.

minΦΦR(𝟏Φ)s.t.Φ(X)is i.i.d. (Definition 3)\begin{split}&\min_{\Phi\in\mathcal{H}_{\Phi}}R(\boldsymbol{1}\circ\Phi)\;\;\text{s.t.}\;\Phi(X)\;\text{is i.i.d. (Definition \ref{def: iid})}\end{split} (6)

Next, we state the assumptions on each component of UU (recall each component of UU is i.i.d. from Assumption 3) and Φ\mathcal{H}_{\Phi} under which we show that the reparametrized IC-ERM objective (equation (6)) recovers the true latent variables UU up to permutations. We assume each component of UU is a continuous random variable with probability density function (PDF) rr. Define the support of each component of UU as 𝒮={u|r(u)>0,u}\mathcal{S}=\{u\;|\;r(u)>0,u\in\mathbb{R}\}. Define a ball of radius 2p\sqrt{2}p as p={u||u|22p2,u}\mathcal{B}_{p}=\{u\;|\;|u|^{2}\leq 2p^{2},u\in\mathbb{R}\}.

Assumption 4

Each component of UU is a continuous random variable from the exponential family with probability density rr. log(r)\log(r) is a polynomial with degree pp (where pp is odd) written as

log(r(u))=k=0pakuk\log\big{(}r(u)\big{)}=\sum_{k=0}^{p}a_{k}u^{k}

where the absolute value of the coefficients of the polynomial are bounded by a𝗆𝖺𝗑a_{\mathsf{max}}, i.e., |ak|a𝗆𝖺𝗑|a_{k}|\leq a_{\mathsf{max}} for all k{1,,p}k\in\{1,\cdots,p\}, and the absolute value of the coefficient of the highest degree term is at least a𝗆𝗂𝗇a_{\mathsf{min}}, i.e., |ap|a𝗆𝗂𝗇>0|a_{p}|\geq a_{\mathsf{min}}>0. The support of rr is sufficiently large that it contains p\mathcal{B}_{p}, i.e., p𝒮\mathcal{B}_{p}\subseteq\mathcal{S}. Also, the moment generation function of each component ii of UU, MUi(t)M_{U_{i}}(t), exists for all tt.

Remark on the PDFs under the above assumption. The above assumption considers distributions in the exponential family, where the log-PDF can be expressed as a polynomial. Note that as long as the support of the distribution is bounded, every polynomial with bounded coefficients leads to a valid PDF (i.e., it integrates to one) and we only need to set the value of a0a_{0} appropriately.

We now state our assumptions on the hypothesis class Φ\mathcal{H}_{\Phi} that the learner searches over. Observe that Φ(X)\Phi(X) can be written as h(U)=Φ(g(U))h(U)=\Phi\Big{(}g^{{}^{\prime}}(U)\Big{)} (since X=g(U)X=g^{{}^{\prime}}(U)). We write the set of all the maps hh constructed from composition of ΦΦ\Phi\in\mathcal{H}_{\Phi} and gg^{{}^{\prime}} as Φg\mathcal{H}_{\Phi}\circ g^{{}^{\prime}}. Define w(u1,,ud)=log(|𝖽𝖾𝗍[J(h(u1,,ud))]|)w(u_{1},\cdots,u_{d})=\log\Big{(}\big{|}\mathsf{det}\big{[}J(h(u_{1},\cdots,u_{d}))\big{]}\big{|}\Big{)}, where 𝖽𝖾𝗍\mathsf{det} is the determinant, J(h(u1,,ud))J(h(u_{1},\cdots,u_{d})) is the Jacobian of hh computed at (u1,,ud)(u_{1},\cdots,u_{d}). The set of all the ww’s obtained from all hΦgh\in\mathcal{H}_{\Phi}\circ g^{{}^{\prime}} is denoted as 𝒲\mathcal{W}

Assumption 5

Φ\mathcal{H}_{\Phi} consists of analytic bijections. For each ΦΦ\Phi\in\mathcal{H}_{\Phi}, the moment generating function of each component i{1,,d}i\in\{1,\cdots,d\}, Φi(X)\Phi_{i}(X), denoted as MΦi(X)(t)M_{\Phi_{i}(X)}(t) exists for all tt. Each w𝒲w\in\mathcal{W} is a finite degree polynomial with degree at most qq, where the absolute values of the coefficients in the polynomial are bounded by b𝗆𝖺𝗑b_{\mathsf{{max}}}.

Define p𝗆𝗂𝗇=max{κlog(8(d1)),4a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇,log(4b𝗆𝖺𝗑n𝗉𝗈𝗅𝗒a𝗆𝗂𝗇)2+q}p_{\mathsf{min}}=\max\Big{\{}\kappa\log(8(d-1)),\frac{4a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}},\frac{\log\Big{(}\frac{4b_{\mathsf{max}}*n_{\mathsf{poly}}}{a_{\mathsf{min}}}\Big{)}}{2}+q\Big{\}}, where n𝗉𝗈𝗅𝗒n_{\mathsf{poly}} is the maximum number of non-zero coefficients in any polynomial w𝒲w\in\mathcal{W}, κ\kappa is small constant (see Appendix C).

Theorem 2.

If the Assumptions 3, 4, 5 hold, (g)1Φ(g^{\prime})^{-1}\in\mathcal{H}_{\Phi}, and pp is sufficiently large (pp𝗆𝗂𝗇p\geq p_{\mathsf{min}}), then the solution Φ(X)\Phi^{\dagger}(X) of reparametrized IC-ERM objective (6) recovers the true latent UU in the data generation process in Asssumption 3 up to permutations.

Proof sketch. The complete proof is available in Appendix Section C and we provide an overview here. We use the optimality condition that the prediction made by the learned model, 𝟏𝖳h(U)\boldsymbol{1}^{\mathsf{T}}h(U), exactly matches the true mean, 𝟏𝖳U\boldsymbol{1}^{\mathsf{T}}U, along with the constraints that each component of UU are i.i.d. and each component of h(U)h(U) are i.i.d., to derive that the distributions of UU and h(U)h(U) are the same. We substitute this condition in the change of variables formula that relates the densities of UU and h(U)h(U). Using the Assumption 5, we can show that if the highest absolute value of UU and the highest absolute value of h(U)h(U) are not equal, then the term with the highest absolute value among UU and h(U)h(U) will dominate, leading to contradiction in the identity obtained by change of variables formula. Based on this, we can conclude that the highest absolute value of UU and the highest absolute value of h(U)h(U) must be equal. We iteratively apply this argument to show that all the absolute values of UU and h(U)h(U) are related by permutation. We can extend this argument to the actual values instead of absolute values. Since hh is analytic we argue that the relationship of permutation holds in a neighborhood. Then we use properties of analytic functions (Mityagin, 2015) to conclude that the relationship holds on the entire space.
Why does the bound on pp grow linearly in dd? We provide some geometric intuition into why the degree of the polynomial (pp) of the log-PDF (log(r)\log(r)) needs to be large. If the dimension of the latent space dd is large, then the second term in p𝗆𝗂𝗇p_{\mathsf{min}} dominates, i.e., pp has to grow linearly in dd. The simplification in the proof yields that the mapping hh must satisfy the following condition – uk=h(u)k\|u\|_{k}=\|h(u)\|_{k} for all k{1,,p}k\in\{1,\cdots,p\}, where k\|\cdot\|_{k} is the kthk^{th} norm. Hence, hh is a bijection that preserves all norms up to the pthp^{th} norm. If UU is 22 dimensional, then the a bijection hh that preserves the 1\ell_{1} norm and the 2\ell_{2} norm is composed of permutations and sign-flips. In general, since UU is dd dimensional, we need at least dd constraints on hh in the form uk=h(u)k\|u\|_{k}=\|h(u)\|_{k} and thus pdp\geq d, which ensure that the only map that satisfies these constraints is composed of permutations and sign-flips.
Significance and remarks on Theorem 2. Theorem 2 shows that if we use IC-ERM principle, i.e., constraint the representations to be independent, then we continue to recover the latents even if the number of tasks is small. We can show that the above theorem also extends to binary classification. We admit that strong assumptions were made to arrive at the above result, while other assumptions such as bound on pp growing linearly in dd seem necessary, but we would like to remind the reader that we are operating in the extreme single task regime. In the previous section, when the number of tasks was equal to the dimension of the latent (when we have sufficiently many tasks), we had shown the success of the IC-ERM (4) objective (Theorem 1) for identification of latent variables with much fewer assumptions. In contrast, in Theorem 2, we saw that with more assumptions on the distribution we can guarantee latent recovery with even one task. If we are in the middle, i.e., when the number of tasks is between one and the dimension of the latent, then the above Theorem 2 says we only require the assumptions (Assumptions 3, 4, 5) to hold for at least one task.
Note on the case k>>d: We did not discuss the case when the number of tasks is greater than the dimension of the latent variables. This is because we can select a subset SS^{{}^{\prime}} of tasks, such that |S|=d|S^{{}^{\prime}}|=d and then proceed in a similar fashion as Theorem 1. This question arises commonly in linear ICA literature, and selecting a subset of tasks is the standard practice.

5 ERM-ICA: Proposed implementation for independence constrained ERM

In the previous section, we showed the identification guarantees with the IC-ERM objective. However, solving this objective is non-trivial, since we need to enforce independence on the representations learnt. We propose a simple two step procedure as an approximate approach to solve the above problem. The learner first carries out standard ERM stated as

Θ,ΦargminΘΘ,ΦΦR(ΘΦ)\Theta^{\dagger},\Phi^{\dagger}\in\operatorname*{arg\,min}_{\Theta\in\mathcal{H}_{\Theta},\Phi\in\mathcal{H}_{\Phi}}R(\Theta\circ\Phi) (7)

The learner then searches for a linear transformation Ω\Omega that when applied to Φ\Phi^{\dagger} yields a new representation with mutually independent components. We state this as follows. Find an invertible Ωd×d\Omega\in\mathbb{R}^{d\times d} such that

Z=ΩΦ(X)where the components ofZare mutually independentZ^{\dagger}=\Omega\circ\Phi^{\dagger}(X)\;\text{where the components of}\;Z^{\dagger}\;\text{are mutually independent} (8)

Note that a solution to the above equation (8) does not always exist. However, if we can find a Ω\Omega that satisfies the above (8), then the classifier ΘΩ1\Theta\circ\Omega^{-1} and the representation ΩΦ(X)\Omega\circ\Phi^{\dagger}(X) together solve the IC-ERM (4) assuming ΘΩ1Θ\Theta\circ\Omega^{-1}\in\mathcal{H}_{\Theta} and ΩΦΦ\Omega\circ\Phi^{\dagger}\in\mathcal{H}_{\Phi}. To find a solution to the equation (8) we resort to the approach of linear ICA (Comon, 1994). The approach has two steps. We first whiten Φ(X)\Phi^{\dagger}(X). 111For simplicity, we assume Φ(X)\Phi^{\dagger}(X) is zero mean, although our analysis extends to the non-zero mean case as well. We also assume Φ(X)\Phi^{\dagger}(X) has finite second moments. Define VV to be the covariance matrix of Φ(X)\Phi^{\dagger}(X). If the covariance VV is invertible, 222If it is not invertible, then we first need to project Φ(X)\Phi^{\dagger}(X) into the range space of VV then the eigendecomposition of VV is given as V=UΛ2U𝖳V=U\Lambda^{2}U^{\mathsf{T}}. We obtain the whitened data Φ(X)\Phi^{*}(X) as follows Φ(X)=Λ1U𝖳Φ(X)\Phi^{*}(X)=\Lambda^{-1}U^{\mathsf{T}}\Phi^{\dagger}(X). Consider a linear transformation of the whitened data and denote it as Z=ΩΦ(X)Z^{*}=\Omega\circ\Phi^{*}(X) and construct another vector ZZ^{{}^{\prime}} such that its individual components are all independent and equal in distribution to the corresponding components in ZZ^{*}. Our goal is to find an Ω\Omega such that the mutual information between ZZ^{*} and ZZ^{{}^{\prime}} is minimized. To make dependence on Ω\Omega explicit, we denote the mutual information between ZZ^{{}^{\prime}} and ZZ^{*} as I(ΩΦ(X))I(\Omega\circ\Phi^{*}(X)). We state this as the following optimization

ΩargminΩ,Ωis invertibleI(ΩΦ(X))\Omega^{\dagger}\in\operatorname*{arg\,min}_{\Omega,\Omega\;\text{is invertible}}I(\Omega\circ\Phi^{*}(X)) (9)

We denote the above two step approximation method as ERM-ICA and summarize it below:

  • ERM Phase: Learn Θ,Φ\Theta^{\dagger},\Phi^{\dagger} by solving the ERM objective (Eq: 7).

  • ICA Phase: Learn Ω\Omega^{\dagger} by linear ICA (Eq: 9) on the representation from ERM Phase (Φ\Phi^{\dagger}).

The above ERM-ICA procedure outputs a classifier Θ(Ω)1\Theta^{\dagger}\circ(\Omega^{\dagger})^{-1} and representation ΩΦ\Omega^{\dagger}\circ\Phi^{\dagger} that is an approximate solution to the IC-ERM problem (4). While the proposed ERM-ICA procedure is a simple approximation, we do not know of other works that have investigated this approach theoretically and experimentally for recovering the latents. Despite its simplicity, we can prove (similar to Theorem 1) that when the number of tasks is equal to the dimension of the latent variable, ERM-ICA can recover the latent variables up to permutation and scaling.

Theorem 3.

If Assumptions 1, 2 hold and the number of tasks kk is equal to the dimension of the latent dd, then the solution ΩΦ\Omega^{\dagger}\circ\Phi^{\dagger} to ERM-ICA ((7), (9)) with \ell as square loss for regression and cross-entropy loss for classification identifies true ZZ up to permutation and scaling.

The proof of the above theorem can be found in Appendix Section B. We leave the theoretical analysis of the ERM-ICA approach for the single task case for future work, but we do show its performance empirically for such scenarios in the evaluation section ahead. We believe that building better approximations to directly solve the IC-ERM, which do not involve a two-step approach like ERM-ICA approach is a fruitful future work.

6 Evaluation

6.1 Experiment Setup

6.1.1 Data generation process

Regression. We use the data generation process described in Assumption 1. The components of ZZ are i.i.d. and follow discrete uniform {0,1}\{0,1\} distribution. Each element of the task coefficient matrix Γ\Gamma is i.i.d. and follows a standard normal distribution. The noise in the label generation is also standard normal. We use a 2-layer invertible MLP to model gg and follow the construction used in Zimmermann et al. (2021).333\urlhttps://github.com/brendel-group/cl-ica/blob/master/main_mlp.py We carry out comparisons for three settings, d={16,24,50}d=\{16,24,50\}, and vary tasks from k={d2,3d4,d}k=\{\frac{d}{2},\frac{3d}{4},d\}. The dataset size used for training and test is 50005000 data points, along with a validation set of 12501250 data points for hyper parameter tuning.
Classification. We use the data generation process described in Assumption 1. We use the same parameters and dataset splits as regression, except the labels are binary and sampled as follows:
Y𝖡𝖾𝗋𝗇𝗈𝗎𝗅𝗅𝗂(σ(ΓZ))Y\leftarrow\mathsf{Bernoulli}(\sigma(\Gamma Z)). Also, the noise in the Γ\Gamma sampling is set to a higher value (10 times that of a standard normal), as otherwise the Bayes optimal accuracy is much smaller.

6.1.2 Methods, architecture, and other hyperparameters

We compare our method against two natural baselines. a) ERM. In this case, we carry out standard ERM (7) and use the representation learned at the layer before the output layer. b) ERM-PCA. In this case, we carry out standard ERM (7) and extract the representation learned at the layer before the output layer. We then take the extracted representation and transform it using principal component analysis (PCA). In other words, we analyze the representation in the eigenbasis of its covariance matrix. c) ERM-ICA. This is the main approach ((7), (8)) that approximates the IC-ERM objective (4). Here we take the representations learnt using ERM (7) and transform them using linear independent component analysis (ICA) (9).

We define mean square error and cross-entropy as our loss objectives for the regression and classification task respectively. In both settings, we use a two layer fully connected neural network and train the model using stochastic gradient descent. The architecture and the hyperparameter details for the different settings are provided in Appendix D. Also, the code repository can be accessed from the link 444Our Code Repository: \urlhttps://github.com/divyat09/ood_identification in the footnote.

Refer to caption
Refer to caption
Figure 2: Comparison of label and latent prediction performance (regression, d=16d=16).
Refer to caption
Refer to caption
Figure 3: Comparison of label and latent prediction performance (regression, d=50d=50).

6.1.3 Metrics

The models are evaluated on the test dataset using the following two metrics.

\bullet Label (YY) prediction performance. We use the average R2R^{2} (coefficient of determination) and the average accuracy across tasks in the regression and classification task respectively. For the ERM-PCA and ERM-ICA, we take final representations learnt by these methods and train a linear/logistic regression model to predict the label YY for the regression/classification task. We check this metric to ensure that the representations do not lose any information about the label.

\bullet Latent (ZZ) prediction performance. We use mean correlation coefficient (MCC), a standard metric used to measure permutation and scaling based identification (refer (Hyvarinen and Morioka, 2017; Zimmermann et al., 2021) for further details). The metric is computed by first obtaining the correlation matrix (ρ(Z,Z^)\rho(Z,\hat{Z})) between the recovered latents Z^\hat{Z} and the true latents ZZ. Let’s define |ρ(Z,Z^)||\rho(Z,\hat{Z})| as the absolute values of the correlation matrix. Then we find a matching (assign each row to a column in |ρ(Z,Z^)||\rho(Z,\hat{Z})|) such that the average absolute correlation is maximized and return the optimal average absolute correlation. Intuitively, we find the optimal way to match the components of the predicted latent representation (Z^\hat{Z}) and components of the true representation (ZZ). Notice that a perfect absolute correlation of one for each matched pair of components would imply identification up to permutations.

6.2 Results

Regression. Figure 2 and 3 show a comparison of the performance of the three approaches across d=16d=16 and d=50d=50 for various tasks. The results for the case of d=24d=24 are in the Appendix E.1 (due to space limits). In both cases, we find that the method ERM-ICA is significantly better than the other approaches in terms of guaranteeing permutation and scaling based identification. All three approaches have similar label prediction performance. We observe a similar trend for the case of d=24d=24 shown in the Appendix  E.1.
Classification. Figure 4 and 5 show a comparison of the performance of the three approaches across d=16d=16 and d=24d=24 for various tasks. In both cases, we find that the method ERM-ICA is better than the other approaches in terms of guaranteeing permutation and scaling based identification, except in the case of 24 data dimensions with a total of 12 tasks. All three approaches have similar label prediction performance. For classification, unlike regression, the improvements are smaller and we do not see improvement for d=50d=50 (comparisons for the case of d=50d=50 are in the Appendix E.2). We see a worse performance in the classification setting because the ERM model does not learn the Bayes optimal predictor unlike regression.
Discussion. We have shown that ERM-ICA achieves significant improvement in latent recovery with much fewer tasks (up to d2\frac{d}{2}). However, in our theory we proved that under certain assumptions solving the reparametrized IC-ERM objective (Eq (6)) can achieve identification even with a single task. Note that we only approximate equation (6) with ERM-ICA, and if we build better approximations of the ideal approach (IC-ERM), then we can witness gains with even fewer tasks. We believe that building such approximations is a fruitful future work.

Refer to caption
Refer to caption
Figure 4: Comparison of label and latent prediction performance (classification, d=16d=16)
Refer to caption
Refer to caption
Figure 5: Comparison of label and latent prediction performance (classification, d=24d=24)

7 Conclusion

In this work, we analyzed the problem of disentanglement in a natural setting, where latent factors cause the labels, a setting not well studied in the ICA literature. We show that if ERM is constrained to learn independent representations, then we can have latent recovery from learnt representations even when the number of tasks is small. We propose a simple two step approximate procedure (ERM-ICA) to solve the constrained ERM problem, and show that it is effective in a variety of experiments. Our analysis highlights the importance of learning independent representations and motivates the development of further approaches to achieve the same in practice.

\acks

We thank Dimitris Koukoulopoulos for discussions regarding the proof for indentification under a single task. Kartik Ahuja acknowledges the support provided by IVADO postdoctoral fellowship funding program. Ioannis Mitliagkas acknowledges support from an NSERC Discovery grant (RGPIN-2019-06512), a Samsung grant, Canada CIFAR AI chair and MSR collaborative research grant.

References

  • Arjovsky et al. (2019) Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  • Beery et al. (2018) Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In Proceedings of the European conference on computer vision (ECCV), pages 456–473, 2018.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
  • Brouillard et al. (2020) Philippe Brouillard, Sébastien Lachapelle, Alexandre Lacoste, Simon Lacoste-Julien, and Alexandre Drouin. Differentiable causal discovery from interventional data. arXiv preprint arXiv:2007.01754, 2020.
  • Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Caruana (1997) Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • Comon (1994) Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
  • Darmois (1953) George Darmois. General analysis of stochastic bonds: special study of linear factorial analysis. Journal of the International Statistical Institute, pages 2–8, 1953.
  • Davies (2004) Mike Davies. Identifiability issues in noisy ica. IEEE Signal processing letters, 11(5):470–473, 2004.
  • Feller (2008) Willliam Feller. An introduction to probability theory and its applications, vol 2. John Wiley & Sons, 2008.
  • Geirhos et al. (2020) Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  • Hyvarinen (1999) Aapo Hyvarinen. Gaussian moments for noisy independent component analysis. IEEE signal processing letters, 6(6):145–147, 1999.
  • Hyvarinen and Morioka (2016) Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in Neural Information Processing Systems, 29:3765–3773, 2016.
  • Hyvarinen and Morioka (2017) Aapo Hyvarinen and Hiroshi Morioka. Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pages 460–469. PMLR, 2017.
  • Hyvärinen and Pajunen (1999) Aapo Hyvärinen and Petteri Pajunen. Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
  • Hyvarinen et al. (2019) Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ica using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019.
  • Ke et al. (2019) Nan Rosemary Ke, Olexa Bilaniuk, Anirudh Goyal, Stefan Bauer, Hugo Larochelle, Bernhard Schölkopf, Michael C Mozer, Chris Pal, and Yoshua Bengio. Learning neural causal models from unknown interventions. arXiv preprint arXiv:1910.01075, 2019.
  • Khemakhem et al. (2020a) Ilyes Khemakhem, Diederik Kingma, Ricardo Monti, and Aapo Hyvarinen. Variational autoencoders and nonlinear ica: A unifying framework. In International Conference on Artificial Intelligence and Statistics, pages 2207–2217. PMLR, 2020a.
  • Khemakhem et al. (2020b) Ilyes Khemakhem, Ricardo Pio Monti, Diederik P Kingma, and Aapo Hyvärinen. Ice-beem: Identifiable conditional energy-based deep models based on nonlinear ica. arXiv preprint arXiv:2002.11537, 2020b.
  • Locatello et al. (2019) Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Raetsch, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124. PMLR, 2019.
  • Locatello et al. (2020) Francesco Locatello, Ben Poole, Gunnar Rätsch, Bernhard Schölkopf, Olivier Bachem, and Michael Tschannen. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning, pages 6348–6359. PMLR, 2020.
  • Mityagin (2015) Boris Mityagin. The zero set of a real analytic function. arXiv preprint arXiv:1512.07276, 2015.
  • Pearl (2009) Judea Pearl. Causality. Cambridge university press, 2009.
  • Peyrard et al. (2021) Maxime Peyrard, Sarvjeet Singh Ghotra, Martin Josifoski, Vidhan Agarwal, Barun Patra, Dean Carignan, Emre Kiciman, and Robert West. Invariant language modeling. arXiv preprint arXiv:2110.08413, 2021.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • Roeder et al. (2020) Geoffrey Roeder, Luke Metz, and Diederik P Kingma. On linear identifiability of learned representations. arXiv preprint arXiv:2007.00810, 2020.
  • Schölkopf (2019) Bernhard Schölkopf. Causality for machine learning. arXiv preprint arXiv:1911.10500, 2019.
  • Schölkopf et al. (2012) Bernhard Schölkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. arXiv preprint arXiv:1206.6471, 2012.
  • Schölkopf et al. (2021) Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, and Yoshua Bengio. Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634, 2021.
  • Shu et al. (2019) Rui Shu, Yining Chen, Abhishek Kumar, Stefano Ermon, and Ben Poole. Weakly supervised disentanglement with guarantees. arXiv preprint arXiv:1910.09772, 2019.
  • Vapnik (1992) Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
  • von Kügelgen et al. (2021) Julius von Kügelgen, Yash Sharma, Luigi Gresele, Wieland Brendel, Bernhard Schölkopf, Michel Besserve, and Francesco Locatello. Self-supervised learning with data augmentations provably isolates content from style. arXiv preprint arXiv:2106.04619, 2021.
  • Wang and Jordan (2021) Yixin Wang and Michael I Jordan. Desiderata for representation learning: A causal perspective. arXiv preprint arXiv:2109.03795, 2021.
  • Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • Zhang and Yang (2017) Yu Zhang and Qiang Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
  • Zimmermann et al. (2021) Roland S Zimmermann, Yash Sharma, Steffen Schneider, Matthias Bethge, and Wieland Brendel. Contrastive learning inverts the data generating process. arXiv preprint arXiv:2102.08850, 2021.

Appendix A Proof of Theorem 1: IC-ERM for the case k=d

Theorem 1.

If Assumptions 1, 2 hold and the number of tasks kk is equal to the dimension of the latent dd, then the solution ΘΦ\Theta^{\dagger}\circ\Phi^{\dagger} to IC-ERM (4) with \ell as square loss for regression and cross-entropy loss for classification identifies true ZZ up to permutation and scaling.

Proof 1.

Consider we are in the regression setting for the data generation process in Assumption 1. Therefore, using Xg(Z)X\leftarrow g(Z) and YΓZ+NY\leftarrow\Gamma Z+N, the risk of a predictor ff can be written as follows:

R(f)=𝔼[Yf(X)2]=𝔼[ΓZ+Nfg(Z)2]=𝔼[ΓZfg(Z)2]+𝔼[N2]2𝔼[(ΓZfg(Z))𝖳N]=𝔼[ΓZfg(Z)2]+𝔼[N2](sinceZNand𝔼[N]=0)\begin{split}R(f)&=\mathbb{E}\big{[}\|Y-f(X)\|^{2}\big{]}\\ &=\mathbb{E}\big{[}\|\Gamma Z+N-f\circ g(Z)\|^{2}\big{]}\\ &=\mathbb{E}\Big{[}\|\Gamma Z-f\circ g(Z)\|^{2}\Big{]}+\mathbb{E}\big{[}\|N\|^{2}\big{]}-2*\mathbb{E}[(\Gamma Z-f\circ g(Z))^{\mathsf{T}}N]\\ &=\mathbb{E}\Big{[}\|\Gamma Z-f\circ g(Z)\|^{2}\Big{]}+\mathbb{E}\big{[}\|N\|^{2}\big{]}\;\;\;\;\;\;(\text{since}\;Z\perp N\;\text{and}\;\mathbb{E}[N]=0)\end{split} (10)

From the above it is clear that R(f)𝔼[N2]R(f)\geq\mathbb{E}\big{[}\|N\|^{2}\big{]} for all functions f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. Since g1Φg^{-1}\in\mathcal{H}_{\Phi}, ΓΘ\Gamma\in\mathcal{H}_{\Theta} and g1(X)g^{-1}(X) has all mutually independent components, Γg1\Gamma\circ g^{-1} satisfies the constraints in IC-ERM (4) and also achieves the lowest error possible, i.e., R(Γg1)=𝔼[N2]R(\Gamma\circ g^{-1})=\mathbb{E}[\|N\|^{2}]. Consider any solution to constrained ERM in (4). The solution must satisfy the following equality except over a set of measure zero.

ΘΦ(X)=ΓZΦ(X)=(Θ)1ΓZ\begin{split}&\Theta^{\dagger}\circ\Phi^{\dagger}(X)=\Gamma Z\\ \implies&\Phi^{\dagger}(X)=(\Theta^{\dagger})^{-1}\Gamma Z\end{split} (11)

Let us call Φ(X)=Z\Phi^{\dagger}(X)=Z^{\dagger} and A=(Θ)1ΓA=(\Theta^{\dagger})^{-1}\Gamma. Hence, the above equality becomes Z=AZZ^{\dagger}=AZ, where all the components of ZZ^{\dagger} are independent (Eq: (4)) and all the components of ZZ are independent (Assumption 1). We will now argue that the matrix AA can be written as a permutation matrix times a scaling matrix. We first show that in each column of AA there is exactly one non-zero element. Consider column kk of AA denoted as [A]k[A]_{k}. Since AA is invertible all elements of the column cannot be zero. Now suppose at least two elements ii and jj of [A]k[A]_{k} are non-zero. Consider the corresponding components of ZZ^{\dagger}. Since ZiZ^{\dagger}_{i} and ZjZ^{\dagger}_{j} are both independent and since [A]ik[A]_{ik} and [A]jk[A]_{jk} are both non-zero, from Darmois’ theorem (Darmois, 1953) it follows that ZkZ_{k} is a Gaussian random variable. However, this leads to a contradiction as we assumed none of the random variables in ZZ follow a Gaussian distribution. Therefore, exactly one element in [A]k[A]_{k} is non-zero. We can say this about all the columns of AA. No two columns will have the same row with a non-zero entry or otherwise AA would not be invertible. Therefore, AA can be expressed as a matrix permutation times a scaling matrix, where the scaling takes care of the exact non-zero value in the row and the permutation matrix takes care of the address of the element which is non-zero. This completes the proof.

Appendix B Proof of Theorem 3: ERM-ICA for the case k=d

Theorem 3.

If Assumptions 1, 2 hold and the number of tasks kk is equal to the dimension of the latent dd, then the solution ΩΦ\Omega^{\dagger}\circ\Phi^{\dagger} to ERM-ICA ((7), (9)) with \ell as square loss for regression and cross-entropy loss for classification identifies true ZZ up to permutation and scaling.

Proof 3.

Although the initial half of the proof is identical to the proof of Theorem 1 we repeat it for clarity. Consider we are in the regression setting for the data generation process in Assumption 1. Therefore, using Xg(Z)X\leftarrow g(Z) and YΓZ+NY\leftarrow\Gamma Z+N, the risk of a predictor ff can be written as follows:

R(f)=𝔼[Yf(X)2]=𝔼[ΓZ+Nfg(Z)2]=𝔼[ΓZfg(Z)2]+𝔼[N2]2𝔼[(ΓZfg(Z))𝖳N]=𝔼[ΓZfg(Z)2]+𝔼[N2](sinceZNand𝔼[N]=0)\begin{split}R(f)&=\mathbb{E}\big{[}\|Y-f(X)\|^{2}\big{]}\\ &=\mathbb{E}\big{[}\|\Gamma Z+N-f\circ g(Z)\|^{2}\big{]}\\ &=\mathbb{E}\Big{[}\|\Gamma Z-f\circ g(Z)\|^{2}\Big{]}+\mathbb{E}\big{[}\|N\|^{2}\big{]}-2*\mathbb{E}[(\Gamma Z-f\circ g(Z))^{\mathsf{T}}N]\\ &=\mathbb{E}\Big{[}\|\Gamma Z-f\circ g(Z)\|^{2}\Big{]}+\mathbb{E}\big{[}\|N\|^{2}\big{]}\;\;\;\;\;\;(\text{since}\;Z\perp N\;\text{and}\;\mathbb{E}[N]=0)\end{split} (12)

From the above it is clear that R(f)𝔼[N2]R(f)\geq\mathbb{E}\big{[}\|N\|^{2}\big{]} for all functions f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}. Since g1Φg^{-1}\in\mathcal{H}_{\Phi}, ΓΘ\Gamma\in\mathcal{H}_{\Theta} and g1(X)g^{-1}(X) has all mutually independent components, Γg1\Gamma\circ g^{-1} satisfies the constraints in IC-ERM (4) and also achieves the lowest error possible, i.e., R(Γg1)=𝔼[N2]R(\Gamma\circ g^{-1})=\mathbb{E}[\|N\|^{2}].
Consider any solution to ERM in (7). The solution must satisfy the following equality except over a set of measure zero.

ΘΦ(X)=ΓZΦ(X)=(Θ)1ΓZ\begin{split}&\Theta^{\dagger}\circ\Phi^{\dagger}(X)=\Gamma Z\\ \implies&\Phi^{\dagger}(X)=(\Theta^{\dagger})^{-1}\Gamma Z\end{split} (13)

Since Φ(X)\Phi^{\dagger}(X) is a linear combination of independent latents with at least one latent non-Gaussian (and also the latents have a finite second moment). We can use the result from (Comon, 1994) that states Ω\Omega^{\dagger} that solves equation (9) relates to (Θ)1(\Theta^{\dagger})^{-1} as follows

(Ω)1=(ΓPΛ)1=Λ1P1Γ1(\Omega^{\dagger})^{-1}=(\Gamma P\Lambda)^{-1}=\Lambda^{-1}P^{-1}\Gamma^{-1} (14)

Substituting the above into (13) we get Φ(X)=Λ1P1Γ1ΓZ=Λ1P1Z\Phi^{\dagger}(X)=\Lambda^{-1}P^{-1}\Gamma^{-1}\Gamma Z=\Lambda^{-1}P^{-1}Z. This completes the proof.

We can also carry out the same proof for the multi-task classification case. In multi-task classification we can write the condition for optimality as

σ(ΩΦ(X))=σ(ΓZ)\sigma\Big{(}\Omega^{\dagger}\Phi^{\dagger}(X)\Big{)}=\sigma\Big{(}\Gamma Z\Big{)} (15)

where sigmoid is applied separately to each element, and since the sigmoids are equal, this implies the individual elements are also equal. Therefore, ΩΦ(X)=ΓZ\Omega^{\dagger}\Phi^{\dagger}(X)=\Gamma Z and we can use the same analysis as the regression case from this point on. Also, note that we equated the sigmoids in first place, because the LHS corresponds to P^(Y|X)\hat{P}(Y|X) and RHS corresponds to true P(Y|X)P(Y|X).

Appendix C Proof of Theorem 2: IC-ERM for the case k=1

Theorem 2.

If the Assumptions 3, 4, 5 hold, (g)1Φ(g^{\prime})^{-1}\in\mathcal{H}_{\Phi}, and pp is sufficiently large (pp𝗆𝗂𝗇p\geq p_{\mathsf{min}}), then the solution Φ(X)\Phi^{\dagger}(X) of reparametrized IC-ERM objective (6) recovers the true latent UU in the data generation process in Asssumption 3 up to permutations.

Proof 2.

We write Φ(X)=[Φ1(X),,Φd(X)]\Phi^{\dagger}(X)=[\Phi^{\dagger}_{1}(X),\cdots,\Phi_{d}^{\dagger}(X)]. We call Φi(X)=Vi\Phi^{\dagger}_{i}(X)=V_{i} and the vector V=[V1,,Vd]=[Φ1(X),,Φd(X)]V=[V_{1},\cdots,V_{d}]=[\Phi^{\dagger}_{1}(X),\cdots,\Phi^{\dagger}_{d}(X)].
Observe that since (g)1Φ(g^{{}^{\prime}})^{-1}\in\mathcal{H}_{\Phi}, we can use the same argument used in the proof of Theorem 1 to conclude that (g)1(g^{{}^{\prime}})^{-1} is a valid solution for the reparametrized IC-ERM objective (Eq: (6)). Notice that the terms Θ\Theta^{\dagger} and Γ\Gamma that appear in the proof of Theorem 1, they are equal to identity here due to the Assumption 3 and IC-ERM (6). Therefore, following the standard argument in Theorem 1, we can conclude that any solution Φ(X)\Phi^{\dagger}(X) of reparametrized IC-ERM (6) satisfies the following:

iΦi(X)=i(g)i1(X)iVi=iUi\begin{split}&\sum_{i}\Phi^{\dagger}_{i}(X)=\sum_{i}(g^{{}^{\prime}})^{-1}_{i}(X)\\ &\sum_{i}V_{i}=\sum_{i}U_{i}\end{split} (16)

We write the moment generation function of a random variable UiU_{i} as MUi(t)=𝔼[etUi]M_{U_{i}}(t)=\mathbb{E}[e^{tU_{i}}]. We substitute the moment generating functions to get the following identity.

iVi=iUiΠiMVi(t)=ΠiMUi(t)\sum_{i}V_{i}=\sum_{i}U_{i}\implies\Pi_{i}M_{V_{i}}(t)=\Pi_{i}M_{U_{i}}(t) (17)

Since Ui=dUjU_{i}{\buildrel d\over{=}}\ U_{j} and Vi=dVjV_{i}{\buildrel d\over{=}}\ V_{j}, we can use MUi(t)=MUj(t)M_{U_{i}}(t)=M_{U_{j}}(t) and MVi(t)=MVj(t)M_{V_{i}}(t)=M_{V_{j}}(t) for all tt\in\mathbb{R} and simplify as follows:

(MVi(t))d=(MUi(t))d\displaystyle\Big{(}M_{V_{i}}(t)\Big{)}^{d}=\Big{(}M_{U_{i}}(t)\Big{)}^{d} MVi(t)=MUi(t)\displaystyle\implies M_{V_{i}}(t)=M_{U_{i}}(t) (18)
Vi=dUi,i{1,,d}\displaystyle\implies V_{i}{\buildrel d\over{=}}\ U_{i},\forall i\in\{1,\cdots,d\}

In the second step of the above simplification, we use the fact that the moment generating function is positive. In the third step, we use the fact that if moment generating functions exist and are equal, then the random variables are equal in distribution (Feller, 2008). Having established that the distributions are equal, we now show that the random variables are equal up to permutations.

Since the vector VV is an invertible transform hh of UU, where V=h(U)V=h(U). We can write the pdf of UU in terms of VV as follows.

ip(ui)=ip(vi)|𝖽𝖾𝗍(𝖩(h(u)))|,\begin{split}\prod_{i}p(u_{i})=\prod_{i}p(v_{i})\big{|}\mathsf{det}(\mathsf{J}(h(u)))\big{|},\end{split} (19)

where pp corresponds to the pdf of UiU_{i} (recall that pdfs of all components is the same). We take log on both sides of the above equation to get the following:

ilog(p(ui))=ilog(p(vi))+log(|𝖽𝖾𝗍(𝖩(h(u)))|)\begin{split}\sum_{i}\log(p(u_{i}))=\sum_{i}\log(p(v_{i}))+\log\Big{(}\big{|}\mathsf{det}(\mathsf{J}(h(u)))\big{|}\Big{)}\end{split} (20)

From Assumption 4, we substitute a polynomial for log(p(u))=k=0pakuk\log(p(u))=\sum_{k=0}^{p}a_{k}u^{k}. From Assumption 5, we express this log(|𝖽𝖾𝗍(𝖩(h(u)))|)=bkiuiθk(i)\log\Big{(}\big{|}\mathsf{det}(\mathsf{J}(h(u)))\big{|}\Big{)}=\sum b_{k}\prod_{i}u_{i}^{\theta_{k}(i)}. We substitute these polynomials into the above equation to get the following:

ilog(p(ui))ilog(p(vi))log(|𝖽𝖾𝗍(𝖩(h))|)=0k=1paki=1d(uikvik)mbmiuiθm(i)=0\begin{split}&\sum_{i}\log(p(u_{i}))-\sum_{i}\log(p(v_{i}))-\log\Big{(}\big{|}\mathsf{det}(\mathsf{J}(h))\big{|}\Big{)}=0\\ \implies&\sum_{k=1}^{p}a_{k}\sum_{i=1}^{d}(u_{i}^{k}-v_{i}^{k})-\sum_{m}b_{m}\prod_{i}u_{i}^{\theta_{m}(i)}=0\end{split} (21)

In the proof, we first focus on comparing the largest absolute value among uu’s and largest absolute value among vv’s. Without loss of generality, we assume that |uj|>|ui||u_{j}|>|u_{i}| for all iji\not=j (uju_{j} is the largest absolute value among uu’s) and |vr|>|vi||v_{r}|>|v_{i}| for all iri\not=r (vrv_{r} is the largest absolute value among vv’s). Consider the setting when |uj|p2>1|u_{j}|\geq p^{2}>1 (these points exist in the support because of the Assumption 4). We can write |uj|=αp2|u_{j}|=\alpha p^{2}, where α1\alpha\geq 1.
There are three cases to further consider:

  • a) |vr|>|uj||v_{r}|>|u_{j}|

  • b) |vr|<|uj||v_{r}|<|u_{j}|

  • c) |vr|=|uj||v_{r}|=|u_{j}|

We first start by analyzing the case b). Since all values of |vi||v_{i}| and |ui||u_{i}| are also strictly less than |uj||u_{j}|, we have the following:

  • c<1\exists\;c<1 such that |ui|<cαp2|u_{i}|<c\alpha p^{2} ij\;\forall\;i\not=j

  • |vi|<cαp2|v_{i}|<c\alpha p^{2} i{1,,d}\;\forall\;i\in\{1,\cdots,d\}, where c<1c<1.

We divide the identity in equation (21) by ujpu_{j}^{p} and separate the terms in such a way that only the term apa_{p} is in the LHS and the rest of the terms are pushed to the RHS. We show further simplification of the identity below.

|ap|=|api=1d1(uipujpvipujp)+k=1p1aki=1d(uikujpvikujp)mbm1ujpiuiθm(i)||ap||api=1d1(uipujpvipujp)|+|k=1p1aki=1d(uikujpvikujp)|+|mbm1ujpiuiθm(i)|11|ap|(|api=1d1(uipujpvipujp)|+|k=1p1aki=1d(uikujpvikujp)|+|mbm1ujpiuiθm(i)|)\begin{split}&|a_{p}|=\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{u_{j}^{p}}-\frac{v_{i}^{p}}{u_{j}^{p}}\Big{)}+\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{u_{j}^{p}}-\frac{v_{i}^{k}}{u_{j}^{p}}\Big{)}-\sum_{m}b_{m}\frac{1}{u_{j}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\\ \implies&|a_{p}|\leq\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{u_{j}^{p}}-\frac{v_{i}^{p}}{u_{j}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{u_{j}^{p}}-\frac{v_{i}^{k}}{u_{j}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{m}b_{m}\frac{1}{u_{j}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\\ \implies&1\leq\frac{1}{|a_{p}|}\Bigg{(}\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{u_{j}^{p}}-\frac{v_{i}^{p}}{u_{j}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{u_{j}^{p}}-\frac{v_{i}^{k}}{u_{j}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{m}b_{m}\frac{1}{u_{j}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\Bigg{)}\end{split} (22)

We analyze each of the terms in the RHS separately. The simplification of the first term yields the following expression.

1|ap||api=1d1(uipujpvipujp)|\displaystyle\frac{1}{|a_{p}|}\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{u_{j}^{p}}-\frac{v_{i}^{p}}{u_{j}^{p}}\Big{)}\Big{|} 1|ap||ap|i=1d1(|uipujp|+|vipujp|)\displaystyle\leq\frac{1}{|a_{p}|}\Big{|}a_{p}\Big{|}\sum_{i=1}^{d-1}\Big{(}\Big{|}\frac{u_{i}^{p}}{u_{j}^{p}}\Big{|}+\Big{|}\frac{v_{i}^{p}}{u_{j}^{p}}\Big{|}\Big{)} (23)
2cp(d1)\displaystyle\leq 2c^{p}(d-1)

The simplification of the second term in the RHS of the last equation in (22) yields the following expression.

1|ap||k=1p1aki=1d(uikujpvikujp)|\displaystyle\frac{1}{|a_{p}|}\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{u_{j}^{p}}-\frac{v_{i}^{k}}{u_{j}^{p}}\Big{)}\Big{|} 1|ap|k=1p1|ak|i=1d(|uikujp|+|vikujp)|\displaystyle\leq\frac{1}{|a_{p}|}\sum_{k=1}^{p-1}|a_{k}|\sum_{i=1}^{d}\Big{(}\Big{|}\frac{u_{i}^{k}}{u_{j}^{p}}\Big{|}+\Big{|}\frac{v_{i}^{k}}{u_{j}^{p}}\Big{)}\Big{|} (24)
k=1p12|ak|i=1d(|ujkujp|)|\displaystyle\leq\sum_{k=1}^{p-1}2|a_{k}|\sum_{i=1}^{d}\Big{(}\Big{|}\frac{u_{j}^{k}}{u_{j}^{p}}\Big{|}\Big{)}\Big{|}
1|ap|(k=1p12|ak|(d1))1αp2\displaystyle\leq\frac{1}{|a_{p}|}\Big{(}\sum_{k=1}^{p-1}2|a_{k}|(d-1)\Big{)}\frac{1}{\alpha p^{2}}
2a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇p\displaystyle\leq\frac{2a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}p}

The simplification of the third term in the RHS of the last equation in (22) yields the following expression.

1|ap||mbm1ujpiuiθm(i)|\displaystyle\frac{1}{|a_{p}|}\Big{|}\sum_{m}b_{m}\frac{1}{u_{j}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|} m|bm|1|uj|(pq)\displaystyle\leq\sum_{m}|b_{m}|\frac{1}{|u_{j}|^{(p-q)}} (25)
b𝗆𝖺𝗑a𝗆𝗂𝗇n𝗉𝗈𝗅𝗒(αp2)(pq)\displaystyle\leq\frac{b_{\mathsf{max}}}{a_{\mathsf{min}}}\frac{n_{\mathsf{poly}}}{(\alpha p^{2})^{(p-q)}}

where n𝗉𝗈𝗅𝗒n_{\mathsf{poly}} corresponds to the number of non-zero terms in the polynomial expansion of the log-determinant. In the above simplification, we used the fact that iθm(i)q\sum_{i}\theta_{m}(i)\leq q and |ui|<|uj||u_{i}|<|u_{j}|. Analyzing the RHS in equations (23)-(25), we see that if pp becomes sufficiently large, the RHS becomes less than 11. This contradicts the relationship in equation (22). Therefore, all |vi||v_{i}| cannot be strictly less than |ui||u_{i}|. This rules out case b).
We now derive the bounds on the value of pp as follows. Assume p2qp\geq 2q, i.e., the degree of the log-pdf of each component of UU is at least twice the degree of the log-determinant of the Jacobian of hh.
From the equation (23) we get the following bound on pp

2cp(d1)148(d1)1cplog(8(d1))plog(1c)plog(8(d1))log(1c)\begin{split}&2c^{p}(d-1)\leq\frac{1}{4}\\ \implies&8(d-1)\leq\frac{1}{c^{p}}\\ \implies&\log(8(d-1))\leq p\log(\frac{1}{c})\\ \implies&p\geq\frac{\log(8(d-1))}{\log(\frac{1}{c})}\end{split} (26)

From the equation (24) we get the following bound on pp

a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇p14p4a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇\begin{split}\frac{a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}p}\leq\frac{1}{4}\implies p\geq\frac{4a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}}\end{split} (27)

From the equation (25) we get the following bound on pp

b𝗆𝖺𝗑|ap|n𝗉𝗈𝗅𝗒(αp2)(pq)14log(4b𝗆𝖺𝗑n𝗉𝗈𝗅𝗒|ap|)2(pq)logpplog(4b𝗆𝖺𝗑n𝗉𝗈𝗅𝗒a𝗆𝗂𝗇)2+q\begin{split}&\frac{b_{\mathsf{max}}}{|a_{p}|}\frac{n_{\mathsf{poly}}}{(\alpha p^{2})^{(p-q)}}\leq\frac{1}{4}\\ \implies&\log\Big{(}\frac{4b_{\mathsf{max}}n_{\mathsf{poly}}}{|a_{p}|}\Big{)}\leq 2(p-q)\log p\\ \implies&p\geq\frac{\log\Big{(}\frac{4b_{\mathsf{max}}n_{\mathsf{poly}}}{a_{\mathsf{min}}}\Big{)}}{2}+q\end{split} (28)

From the above equations (26), (27), and (28), we get that if

pmax{log(8(d1))log(1c),4a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇,log(4b𝗆𝖺𝗑n𝗉𝗈𝗅𝗒a𝗆𝗂𝗇)2+q}p\geq\max\Big{\{}\frac{\log(8(d-1))}{\log(\frac{1}{c})},\frac{4a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}},\frac{\log\Big{(}\frac{4b_{\mathsf{max}}n_{\mathsf{poly}}}{a_{\mathsf{min}}}\Big{)}}{2}+q\Big{\}} (29)

then the sum of the terms in the RHS in equation (22) is at most 34\frac{3}{4} and the term in the LHS in equation (22) is 11, which leads to a contradiction.

From the above expression, we gather that the second and third term should dominate in determining the lower bound for pp. From the second term, we gather that the lower bound increases linearly in the dimension of the latent, and from the third term we gather that pp must be greater than qq by a factor that grows logarithmically in the number of terms in the polynomial of the log determinant.
Let us now consider the case a) ( |vr|>|uj||v_{r}|>|u_{j}|) which is similar to the case b) analyzed above. Since values of |ui||u_{i}| are strictly less than |uj||u_{j}| and |vr||v_{r}|, and since |vr|>|uj||v_{r}|>|u_{j}|, there exist a c<1c<1 such that |ui|cαp2|u_{i}|\leq c\alpha p^{2}, |vr|1cαp2|v_{r}|\geq\frac{1}{c}\alpha p^{2}, |vi|c|vr||v_{i}|\leq c|v_{r}|, where c<1c<1. We follow the same steps as done in the analysis for case b). We separate the equation so that only the term apa_{p} (obtained by dividing vrpv_{r}^{p} with vrpv_{r}^{p}) is in the LHS and the rest on the RHS.

|ap|=|api=1d1(uipvrpvipvrp)+k=1p1aki=1d(uikvrpvikvrp)mbm1vrpiuiθm(i)||ap||api=1d1(uipvrpvipvrp)|+|k=1p1aki=1d(uikvrpvikvrp)|+|mbm1ujpiuiθm(i)|11|ap|(|api=1d1(uipvrpvipvrp)|+|k=1p1aki=1d(uikvrpvikvrp)|+|mbm1vrpiuiθm(i)|)\begin{split}&|a_{p}|=\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{v_{r}^{p}}-\frac{v_{i}^{p}}{v_{r}^{p}}\Big{)}+\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{v_{r}^{p}}-\frac{v_{i}^{k}}{v_{r}^{p}}\Big{)}-\sum_{m}b_{m}\frac{1}{v_{r}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\\ \implies&|a_{p}|\leq\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{v_{r}^{p}}-\frac{v_{i}^{p}}{v_{r}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{v_{r}^{p}}-\frac{v_{i}^{k}}{v_{r}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{m}b_{m}\frac{1}{u_{j}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\\ \implies&1\leq\frac{1}{|a_{p}|}\Bigg{(}\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{v_{r}^{p}}-\frac{v_{i}^{p}}{v_{r}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{v_{r}^{p}}-\frac{v_{i}^{k}}{v_{r}^{p}}\Big{)}\Big{|}+\Big{|}\sum_{m}b_{m}\frac{1}{v_{r}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|}\Bigg{)}\end{split} (30)

We analyze each of the terms in the RHS in equation (30) separately. The simplification of the first term in the RHS of the above yields the following upper bound.

1|ap||api=1d1(uipvrpvipvrp)|\displaystyle\frac{1}{|a_{p}|}\Big{|}a_{p}\sum_{i=1}^{d-1}\Big{(}\frac{u_{i}^{p}}{v_{r}^{p}}-\frac{v_{i}^{p}}{v_{r}^{p}}\Big{)}\Big{|} 1|ap||ap|i=1d1(|uipvrp|+|vipvrp|)\displaystyle\leq\frac{1}{|a_{p}|}\Big{|}a_{p}\Big{|}\sum_{i=1}^{d-1}\Big{(}\Big{|}\frac{u_{i}^{p}}{v_{r}^{p}}\Big{|}+\Big{|}\frac{v_{i}^{p}}{v_{r}^{p}}\Big{|}\Big{)} (31)
2cp(d1)\displaystyle\leq 2c^{p}(d-1)

The simplification of the second term in the RHS of equation (30) yields the following upper bound.

1|ap||k=1p1aki=1d(uikvrpvikvrp)|\displaystyle\frac{1}{|a_{p}|}\Big{|}\sum_{k=1}^{p-1}a_{k}\sum_{i=1}^{d}\Big{(}\frac{u_{i}^{k}}{v_{r}^{p}}-\frac{v_{i}^{k}}{v_{r}^{p}}\Big{)}\Big{|} 1|ap|k=1p1|ak|i=1d(|uikvrp|+|vikvrp)|\displaystyle\leq\frac{1}{|a_{p}|}\sum_{k=1}^{p-1}|a_{k}|\sum_{i=1}^{d}\Big{(}\Big{|}\frac{u_{i}^{k}}{v_{r}^{p}}\Big{|}+\Big{|}\frac{v_{i}^{k}}{v_{r}^{p}}\Big{)}\Big{|} (32)
k=1p12|ak|i=1d(|ujkvrp|)|\displaystyle\leq\sum_{k=1}^{p-1}2|a_{k}|\sum_{i=1}^{d}\Big{(}\Big{|}\frac{u_{j}^{k}}{v_{r}^{p}}\Big{|}\Big{)}\Big{|}
1|ap|(k=1p12|ak|(d1))1αp2\displaystyle\leq\frac{1}{|a_{p}|}\Big{(}\sum_{k=1}^{p-1}2|a_{k}|(d-1)\Big{)}\frac{1}{\alpha p^{2}}
a𝗆𝖺𝗑(d1)c|ap|αp\displaystyle\leq\frac{a_{\mathsf{max}}(d-1)c}{|a_{p}|\alpha p}
a𝗆𝖺𝗑(d1)a𝗆𝗂𝗇p\displaystyle\leq\frac{a_{\mathsf{max}}(d-1)}{a_{\mathsf{min}}p}

The simplification of the third term in the RHS of equation (30) yields the following upper bound.

1|ap||mbm1vrpiuiθm(i)|\displaystyle\frac{1}{|a_{p}|}\Big{|}\sum_{m}b_{m}\frac{1}{v_{r}^{p}}\prod_{i}u_{i}^{\theta_{m}(i)}\Big{|} m|bm|1|vr|(pq)\displaystyle\leq\sum_{m}|b_{m}|\frac{1}{|v_{r}|^{(p-q)}} (33)
b𝗆𝖺𝗑|ap|n𝗉𝗈𝗅𝗒(cpq)(αp2)(pq)\displaystyle\leq\frac{b_{\mathsf{max}}}{|a_{p}|}\frac{n_{\mathsf{poly}}\bigg{(}c^{p-q}\bigg{)}}{(\alpha p^{2})^{(p-q)}}
b𝗆𝖺𝗑|ap|n𝗉𝗈𝗅𝗒(αp2)(pq)\displaystyle\leq\frac{b_{\mathsf{max}}}{|a_{p}|}\frac{n_{\mathsf{poly}}}{(\alpha p^{2})^{(p-q)}}

Analyzing the RHS in equation (31)-(33), we see that if pp becomes sufficiently large then the RHS becomes less than 11. This contradicts the relationship in equation (30). Therefore, there is no |vi||v_{i}| that is strictly larger than |ui||u_{i}|. This rules out case a). In fact from equations (31)-(33) we can get the same bound on pp as in case b).

Thus, the only possibility is case c), i.e., |vr|=|uj|vr=uj|v_{r}|=|u_{j}|\implies v_{r}=u_{j} or vr=ujv_{r}=-u_{j}. Consider the case when pp is odd. In that case, vr=ujv_{r}=u_{j} is the only option that works. We substitute vr=ujv_{r}=u_{j} in the equation (22) and repeat the same argument for the second highest absolute value, and so on. This leads to the conclusion that for each component uu there is a component of vv such that both of them are equal. Hence, we have established the relationship vr=ujv_{r}=u_{j}. For another sample where index jj corresponds to the highest absolute value and is in the neighbourhood of uju_{j}, the relationship vr=ujv_{r}=u_{j} must continue to hold. If this does not happen, then there would be another component vq=ujv_{q}=u_{j} where qrq\not=r. However, if that were the case, then it would contradict the continuity of hh.

Therefore, the relationship vr=ujv_{r}=u_{j} (the match between index jj for uu and index rr for vv) holds for a neighbourhood of values of vector uu. Since each component of hh is analytic, we can use the fact that the neighbourhood of vector of values of uu for which the relationship vr=ujv_{r}=u_{j} holds has a positive measure, and then from (Mityagin, 2015) it follows that this relationship would hold on the entire space. We can draw the same conclusion for all the components of hh and conclude that hh is a permutation map.

Appendix D Implementation Details

D.1 Model Architecture

  • Fully Connected Layer: (Data Dim, 100)

  • BatchNormalization(100)

  • LeakyReLU(0.5)

  • Full Connected Layer: (100, Data Dim)

  • BatchNormalization(Data Dim)

  • LeakyReLU(0.5)

  • Fully Connected Layer: (Data Dim, Total Tasks)

We consider the part of the network before the final fully connected layer as the representation network, and use the output from the representation network for training ICA/PCA after the ERM step.

D.2 Hyperparameters

We use SGD to train all the methods with the different hyperparameters across each task mentioned below. In every case, we select the best model during the course of training based on the validation loss, and also use a learning rate scheduler that reduces the existing learning rate by half after every 50 epochs. Also, regarding ICA, we use the FastICA solver in sklearn with 30,00030,000 maximum iterations and data whitening.

  • Regression: Learning Rate: 0.010.01, Batch Size: 512512, Total Epochs: 10001000, Weight Decay: 5e45e-4, Momentum: 0.90.9

  • Classification: Learning Rate: 0.050.05, Batch Size: 512512, Total Epochs: 200200, Weight Decay: 5e45e-4, Momentum: 0.90.9

Appendix E Additional Experiments

E.1 Task: Regression

Refer to caption
Refer to caption
Figure 6: Regression Task: Data Dimension 24

E.2 Task: Classification

Refer to caption
Refer to caption
Figure 7: Classification Task: Data Dimension 50