Identification of Latent Variables From Graphical Model Residuals

Boris Hayete , Fred Gruber ¹¹footnotemark: 1, Anna Decker, Raymond Yan equal contribution

Abstract

Graph-based causal discovery methods aim to capture conditional independencies consistent with the observed data and differentiate causal relationships from indirect or induced ones. Successful construction of graphical models of data depends on the assumption of causal sufficiency: that is, that all confounding variables are measured. When this assumption is not met, learned graphical structures may become arbitrarily incorrect and effects implied by such models may be wrongly attributed, carry the wrong magnitude, or mis-represent direction of correlation. Wide application of graphical models to increasingly less curated "big data" draws renewed attention to the unobserved confounder problem.

We present a novel method that aims to control for the latent space when estimating a DAG by iteratively deriving proxies for the latent space from the residuals of the inferred model. Under mild assumptions, our method improves structural inference of Gaussian graphical models and enhances identifiability of the causal effect. In addition, when the model is being used to predict outcomes, it un-confounds the coefficients on the parents of the outcomes and leads to improved predictive performance when out-of-sample regime is very different from the training data. We show that any improvement of prediction of an outcome is intrinsically capped and cannot rise beyond a certain limit as compared to the confounded model. We extend our methodology beyond GGMs to ordinal variables and nonlinear cases. Our R package provides both PCA and autoencoder implementations of the methodology, suitable for GGMs with some guarantees and for better performance in general cases but without such guarantees.

Introduction

Construction of graphical models (GMs) pursues two related objectives: accurate inference of conditional independencies (causal discovery), and construction of models for outcomes of interest with the purpose of estimating the average causal effect (ACE) with respect to interventions (Pearl 2000; Hernán and Robins 2006) (causal inference). These two distinct goals mandate that certain identifiability conditions for both types of tasks be met. Of particular interest to us is the condition of causal sufficiency ((Spirtes, Glymour, and Scheines 1993)) - namely, that all of the relevant confounders have been observed. When this condition is not met, accurate GMs and identifiability of the causal effects are hard to infer.

Intuitively, the presence of unobserved confounding leads to violations of conditional independence among the affected variables downstream from any latent confounder. Score-based methods for GM construction aim to minimize unexplained variance for all variables in the network by accounting for conditional independencies in the data (Pearl 2000; Friedman and Koller 2013). Unobservability of a causally important variable will induce dependencies among its descendants that are conditionally independent of each other given the latent variable. This gives rise to inferred connectivity that is excessive compared to the true network (Elidan et al. 2001). Further, since none of the observed descendants are perfect correlates of the unobserved ancestor, some of the information from the ancestor will "pollute" model residuals and give rise to correlation patterns in the residual space.

Hitherto, the causal inference literature has largely focused on addressing the subset of problems where the ACE can be estimated reliably in the absence of this guarantee, such as when conditional exchangeability holds (Hernán and Robins 2006). In causal discovery, certain constraint-based methods like the Fast Causal Algorithm (FCI) (Spirtes, Meek, and Richardson 1995) can deal with general latent structures but do not scale well to large problems. While faster variants of FCI like RFCI (Colombo et al. 2012) and FCI+ (Claassen, Mooij, and Heskes 2013) exist, score-based approaches tend to perform better than purely constraint-based methods. This has been demonstrated in problems without latent confounders (Nandy et al. 2018) and has been suggested that this is because constraint-based methods are sensitive to early mistakes which progagate into more errors later on while in score-based approaches the mistakes are localized and do not affect the score of graphs later on (Bernstein et al. 2020). It is likely that these limitations also affect FCI-type algorithms. Furthermore, while FCI can deal with general latent structures, in certain scenarios where a few latent confounders drive many observed variables most observed variables are conditionally dependent which implies dense maximal ancestral graphs where very few edges can be oriented (Frot, Nandy, and Maathuis 2017) (see supplementary Figure LABEL:sup-fig_rfci_ex1 for an illustration). Finally, another disadvantage of constraint based methods is that they generate a single model and do not provide a score summarizing how likely the network is given the data while score based methods can generate this type of scores by using Bayesian or ensemble approaches (Jabbari et al. 2017). Relatively computationally efficient score-based methods have been proposed for inferring latent variables affecting GMs by the means of expectation maximization (EM) (as far back as (Friedman 1997, 1998)). However, for a large enough network, local gradients do not provide a reliable guide, nor do they address the cardinality of the latent space. Methods for using near-cliques for detection of latent variables in directed acyclic graphs (DAGs) with Gaussian graphical models (GGMs) have been proposed that address both problems by analyzing near-cliques in DAGs (Elidan et al. 2001; Silva et al. 2006). A method related to ours has been proposed for calculating latent variables in a greedy fashion in linear and "invertible continuous" networks, and relating such estimates to observed data to speed up structure search and enable discovery of hidden variables (Elidan, Nachman, and Friedman 2007). However, with a clique-driven approach it is impossible to tell whether any cliques have shared parents and, importantly, whether any signal remained to be modeled, resulting in score-based testing rejecting proposed "ideal parents". Additionally, Wang and Blei (2019) introduced the deconfounder approach to detecting and adjusting for latent confounders of causal effects in the presence of multiple causes but in the context of a fixed DAG under relatively strict conditions (Wang and Blei 2019a, b).

We show that there exist circumstances when causal sufficiency can be asymptotically achieved and exchangeability ensured even when the causal drivers of outcome are confounded by a number of latent variables. This can be done when confounding is pleiotropic, that is, when the latent variable affects a "large enough" number of variables, some driving an outcome of interest and others not (we tie this to the expansion property defined in (Anandkumar et al. 2013)). Notably, this objective cannot be achieved when confounding affects only the variables of interest and their causal parents ((D’Amour 2019)). Our approach is provably optimal given a proposed graph. This leads to the iterative EM-like optimization approach over structure and latent space proposed below.

The outline of this paper is as follows: In the first two sections, we discuss the process of diagnosting latent confounding in the special case of GGMs where closed-form local solutions exist. In the third section, we derive the improvement cap for predictive performance of the graph thus learned. In the fourth section, we discuss implementation, and in the fifth demonstrate the approach on simulated and real data. Finally, we propose some extensions to the approach for more complex graphical model structures and functional forms.

Background And Notation

Consider a factorized joint probability distribution over a set of observed and latent variables Table 1 summarizes the notation to be used to distinguish variables, their relevant partitions , and parameters.

Symbol	Meaning	Indexing
$S$	samples	$S_{i},i\in\{1,\dots,s\}$
$V$	observed predictor variables	$V_{j},j\in\{1,\dots,v\}$
$U$	unobserved predictor variables	$U_{l},l\in\{1,\dots,u\}$
$O$	outcomes (sinks)	$O_{k},k\in\{1,\dots,o\}$
$D$	$\{V,O\}$ - observable data matrix
$\bar{D}$	estimate of variables in D
	from their parents
$D_{u}$	$\{V,O,U\}$ - implied data matrix
$\theta$	parameters	$\theta_{i},i\in\{1,\dots,t\}$
$Pa^{N}$	parents of variable $N$	$Pa^{N}_{i},i\in\{1,\dots,p\}$
$C^{N}$	children of variable $N$	$C^{N}_{i},i\in\{1,\dots,c\}$
$G$	graph over $D$
$G_{u}$	graph over $D_{u}$
$R$	residuals - matrix matching $D$

Table 1: Notation

Assume that the joint distribution $D_{u}$ over the full data or $D$ over the observed data is factorized as a directed acyclic graph, $G_{u}$ or $G$ . We will consider individual conditional probabilities describing nodes and their parents, $P(V|Pa^{V},\theta)$ , where $\theta$ refers to the parameters linking the parents of $V$ to $V$ . $\hat{\theta}$ will refer to an estimate of these parameters. We will further assume that $G$ is constructed subject to regularization and using unbiased estimators for $P(V|Pa^{V},\hat{\theta})$ . We will further assume that $D_{u}$ plus any given constraints are sufficient to infer the true graph up to Markov equivalence. For convenience, we’ll focus on the actual true graph’s parameters, so that, using unbiased estimators, $E[\hat{\theta_{m}}|D_{u}]=\theta_{m},\forall m$ .

Mirroring $D$ (or $D_{u}$ ), we will define a matrix $R$ of the same dimensions ( $s\times(v+o)$ ) that captures the residuals of modeling every variable $N\in\{V,O,(U)\}$ via $G$ (or $G_{u}$ ). Note that $R_{u}$ - the residuals matrix that contains residuals of $U$ - doesn’t make sense and isn’t defined. In the linear case, these would be regular linear model residuals, but more generally we will consider probability scale residuals (PSR, (Shepherd, Li, and Liu 2016)). That is, we define $R[i,j]=PSR(P(V_{j}|Pa^{V_{j}},\hat{\theta}_{j})|D[i,j])$ , the residuals of $V_{j}$ given its graph parents. The use of probability-scale residuals allows us to define $R$ for all ordinal variable types, up to rank-equivalence.

Diagnosing Latent Confounding

Here, we consider the special case of GGMs where our approach for determining the existence of latent confounders can be written down in a closed form.

Recall that, for some $V_{j}\in\{V,U,O\}$ , $P^{V_{j}}_{k}$ denotes the $k$ th parent of $V_{j}$ . For GGMs, we can write down a fragment of any DAG G as a linear equation where a child node is parameterized as a linear function of its parents:

	$\displaystyle V_{j}=\beta_{j0}+\beta_{j1}P^{V_{j}}_{1}+\dots+\beta_{jp}P^{V_{j}}_{p}+\xi_{j},$		(1)
	$\displaystyle\xi_{j}\sim\mathcal{N}(0,\sigma_{j}).$		(1)

\digraph

[scale=0.5]g graph[ranksep=0.1]; node [shape=circle, style="filled"]; U [fillcolor=red]; V1 [fillcolor=white, label=<V1>]; V2 [fillcolor=gray, label=<V2>]; V3 [fillcolor=gray, label=<V3>]; V4 [fillcolor=gray, label=<V4>]; V5 [fillcolor=gray, label=<V5>]; V6 [fillcolor=gray, label=<V6>]; V7 [fillcolor=white, label=<V7>]; O1 [fillcolor=green, label=<O1>]; O2 [fillcolor=green, label=<O2>]; V1 -> U; U -> V2; U -> V3; U-> V4; U-> V5; U -> V6; V1 -> O2; V2 -> O2; V4-> O1; V5 -> O1; V6 -> O1; U -> O2; U -> O1; V1 -> V2; V5 -> V6; V7 -> O1 \digraph[scale=0.5]g2 graph[ranksep=0.1]; node [shape=circle, style="filled"]; V1 [fillcolor=white, label=<V1>]; V2 [fillcolor=gray, label=<V2>]; V3 [fillcolor=gray, label=<V3>]; V4 [fillcolor=gray, label=<V4>]; V5 [fillcolor=gray, label=<V5>]; V6 [fillcolor=gray, label=<V6>]; V7 [fillcolor=white, label=<V7>]; O1 [fillcolor=green, label=<O1>]; O2 [fillcolor=green, label=<O2>]; V1 -> V3; V3 -> V2; V3 -> V4; V3 -> V6; V3 -> V5; V4 -> V6; V6 -> V5; V1 -> O2; V2 -> O2; V4-> O1; V5 -> O1; V6 -> O1; V7 -> O1; V3 -> O2; V4 -> V5; V5 -> O2; V1 -> V2; V2 -> V5; V3 -> O1

Figure 1: Graph

G_{u}

U

influences the outcomes

O

, and a number of predictors

V

, confounding many of the

V_{j}\rightarrow O_{k}

relationships. Gray nodes are affected by

U

Figure 2: Graph

G

. With

U

latent, the graph adjusts, introducing spurious edges.

For example, consider $O_{1}$ in Figure 2. We can write:

\begin{split}O_{1}=\beta_{0}+\beta_{6}V_{6}+\beta_{5}V_{5}+\beta_{4}V_{4}+\beta_{7}V_{7}+\beta_{u}U+\xi_{1},\\ \xi_{1}\sim\mathcal{N}(0,\sigma_{O_{1}}).\end{split}

(2)

For any variable N that has parents in $G_{u}$ , we can group variables in $P^{N}$ into three subsets: $X_{U}\in\{P^{U},C^{U}\}$ , $X_{\cancel{U}}\notin\{P^{U},C^{U}\}$ , and the set $U$ itself, and write down the following general form using matrix notation:

\begin{split}N=\beta_{N0}+B_{U}X_{U}+B_{\cancel{U}}X_{\cancel{U}}+\beta_{U}U+\xi_{N},\\ \xi_{N}\sim\mathcal{N}(0,\sigma_{N}).\end{split}

(3)

Explicit dependence of $N$ on $U$ happens when $\beta_{U}\neq 0$ .

Note that if we deleted $U$ and its edges from $G_{u}$ without relearning the graph, Equation 3 from $G_{u}$ would read:

N=\beta_{N0}+B_{U}X_{U}+B_{\cancel{U}}X_{\cancel{U}}+R_{N}+\xi_{N}

(4)

where the residual term $R_{N}$ is simply equal to the direct contribution of $U$ to $N$ .

Now consider $G$ , the graph learned over the variables $\{V,O\}$ excluding the latent space $U$ . The network $G$ would have to adjust to the missingness of $U$ (e.g., Figure 2 vs Figure 2). As a result, $R_{N}$ will be partially substituted by other variables in $\{P^{U},C^{U}\}$ . Still, unless $U$ is completely explained by $\{P^{U},C^{U}\}$ (as described in (D’Amour 2019)) and in the absence of regularization (when a high enough number of covariates may lead to such collinearity), $R_{N}$ will not fully disappear in $G$ . Hence, even after partially explaining the contribution of $U$ to $N$ by some of the parents of $N$ in $G$ ,

R_{N}=\beta_{0}+\beta_{1}U+\xi_{N}.

(5)

	Variables
Samples	$V_{11}$	$V_{12}$	…	$V_{1v}$
	…	…	…	…
	$V_{s1}$	$V_{s2}$	…	$V_{sv}$

	Residuals
Samples	$R_{11}$	$R_{12}$	…	$R_{1v}$
	…	…	…	…
	$R_{s1}$	$R_{s2}$	…	$R_{sv}$

Table 2: The training data frame (Variables) implies a matching data frame (Residuals) once the joint distribution of all variables is specified via a graph and its parameterization

Therefore, the columns in the residuals table corresponding to $G$ (Table 2) that represent the parents and children of $U$ will contain residuals collinear with $U$ :

\begin{split}R_{1}=\beta_{10}&+\beta_{11}U+\xi_{1}\\ &\ldots\\ R_{k}=\beta_{k0}&+\beta_{k1}U+\xi_{k}.\end{split}

(6)

Rearranging and combining,

U=\beta_{i}^{*}R_{1}+\dots+\beta_{k}^{*}R_{k}+\xi=BR+\xi.

(7)

Equation 7 tells us that, for graphical Gaussian models, components of $U$ are obtainable from linear combinations of residuals, or principal components (PCs) of the residual table $R$ . In other words, $U$ is identifiable by principal component analysis (PCA). Whether the residuals needed for this identification exist depends on the expansion property as defined in (Anandkumar et al. 2013).

Theorem 1 (Inferrability).

Posit that we know the true DAG $G$ over $D$ , but not $U=G_{u}\setminus G$ , the latent space’s edges to observed variables. Assume that $L$ is orthogonal, with no variable in $D$ a parent of any variable in $U$ . Define as $R_{C}=C^{U,G}-\bar{C}^{U,G}$ the residuals of all children of $U$ that also have parents in $G$ . If relationships described by $G$ are linear, we can infer $U$ from $R_{C}$ up to sign and scale.

Proof.

For proof, refer to supplement, theorem LABEL:sup-thm:inferrability. ∎

Theorem 1 does require G to be known, which will give rise to an iterative algorithm later on. However, from it we see that focusing on residuals of a graph permits identification of the latent space in some circumstances. Supplementary theorem LABEL:sup-thm:deconfounding relaxes the need for orthogonality when latent space’s structure is irrelevant but controlling for it is important.

Note that this approach is similar, in the linear case, to that in (Elidan, Nachman, and Friedman 2007) except insofar as we show the principal components of the entire residual matrix to be optimal for the discovery of the whole latent space of a given DAG, and thus couple structure inference and latent space discovery via EM (1).

Capping improvement

The most important utility of causal modeling is to identify drivers as well as drivers of drivers of outcomes of interest. These drivers of drivers may have practical importance. For example, in the development of a drug, direct predictors of an outcome (e.g. $V_{6}$ pointing to $O_{1}$ in Figure 2) may not be viable targets but upstream variables (e.g. $V_{5}$ ) may in fact be be promising targets and the direct parents of the outcome mediate the effect of these potential targets. In the presence of latent confounding, the identifiability of direct causal effects of outcomes is also jeopardized (Hernán and Robins 2006). For example, in Figure 2 misisng $U$ induces apparent correlation among $V_{3}$ , $V_{4}$ , $V_{5}$ , and $V_{6}$ even though some of these variables are conditionally independent given $U$ .

The extent of the problem of unmeasured confounding can be quantified in more detail. Suppose we model $O_{1}$ without controlling for $U$ (2):

O_{1}=\beta_{0}+\beta_{3}V_{3}+\beta_{4}V_{4}+\beta_{5}V_{5}+\beta_{6}V_{6}+\dots.

Setting the coefficient of determination for the model

V_{3}=\alpha_{0}+\alpha_{4}V_{4}+\alpha_{5}V_{5}+\alpha_{6}V_{6}+\dots.

equal to $\rho_{3}^{2}$ . Then the estimated variance of $\beta_{3}$ in the presence of collinearity can be related to the variance when collinearity is absent via the following formula (Rawlings, Pantula, and Dickey 1998):

var(\bar{\beta}_{3})=var(\beta_{3})\frac{1}{1-\rho_{3}^{2}}\propto\frac{1}{1-\rho_{3}^{2}}.

(8)

Formula 8 describes the variance inflation factor (VIF) of $\beta_{3}$ . Note that $\lim_{\rho\to 1}\frac{1}{1-\rho^{2}}=\infty$ , so even mild collinearity induced by latent variables can severely distort coefficient values and signs, and thus estimation of ACE. Our approach reduces the VIFs of coefficients related to outcomes and thus make all causal statements relating to outcomes, such as calculation of ACE, more reliable, by controlling for $\bar{U}$ - the estimate of $U$ - in the network,

\lim_{(U-\bar{U})\to 0}var(\bar{\beta}_{i})=var(\beta_{i}).

(9)

Consider an output $O_{j}$ . While it is difficult to describe the limit of error on the coefficients of the drivers of $O_{j}$ , it is straightforward to put a ceiling on the improvement in the likelihood obtainable from modeling $U$ and approximating $G_{u}$ with $G_{\bar{u}}$ . Suppose we eventually model $U$ as a linear combination of a set of variables $X\subset V$ , and denote by $X\setminus W$ the set difference: members of $X$ not in $W$ . Then for any outcome $O_{i}$ predicted by a set of variables $W$ in the graph $G$ and in truth predicted by the set $Z+U$ , we can contrast three expressions (from $G$ and $G_{\bar{U}}$ respectively):

\begin{split}O_{i}=\beta_{i0}+B_{W}W(a)+\xi_{i}\qquad(a)\\ O_{i}=\beta_{i0}+B_{W}W+B_{X\setminus W}(X\setminus W)+\xi_{i}\qquad(b)\\ O_{i}^{U}=\beta_{i0}^{U}+B_{Z}^{U}Z+B^{U}U+\xi_{i}\qquad(c).\end{split}

(10)

Model (a) is the model that was actually accepted, subject to regularization, in G. Model (b) is the "complete" model of $O_{i}$ that controls for $U$ non-parsimoniously by controlling for all variables affected by $U$ and not originally in the model. The third model, (c), is the ideal parsimonious model when U is known. We can compare the quality of these models via the Bayesian Score, and the full score, which can approximated by the Bayesian Information Criterion (BIC) in large samples (Koller and Friedman 2009). We assume that model (c) would have the lowest BIC (being the best model), and model (a) would be slightly better, since we know that the set of variables $X\setminus W$ didn’t make it into the first equation subject to regularization by BIC. Assuming $n$ samples,

$\displaystyle b_{a}$	$\displaystyle=BIC(O_{i}=\beta_{i0}+B_{W}W+\xi_{i})$	(11)
$\displaystyle b_{b}$	$\displaystyle=BIC(O_{i}=\beta_{i0}+B_{W}W$	(12)
	$\displaystyle+B_{X\setminus W}(X\setminus W)+\xi_{i})$
	$\displaystyle=b_{c}+\|X\setminus W\|log(n)$
$\displaystyle b_{c}$	$\displaystyle=BIC(O_{i}^{U}=\beta_{i0}^{U}+B_{Z}^{U}Z+B^{U}U+\xi_{i})$	(13)

The "complete model" - model (b) - includes all of the true predictors of $O_{j}$ . Therefore its score will be the same as that of the true model - model (c) - plus the BIC penalty, $log(n)$ , for each extra term, minus the cost of having $U$ in the true model (that is, the cardinality of the relevant part of the latent space). We know that the extra information carried by this model was not big enough to improve upon model (a), that is $b_{a}<b_{c}+k\log(n)$ for some $k$ . Rearranging:

b_{c}-b_{a}>-k\log(n).

(14)

Any improvement in $G_{\bar{U}}$ owing to modeling of $\bar{U}$ cannot, therefore, exceed $k\log(n)$ logs, where $k=|X\setminus W|-|U|$ : the information contained in the "complete" model is smaller than its cost.

Although the available improvement in predictive power is also capped in some way, it is still important to aim for that limit. The reason is, correct inference of causality, especially in the presence of latent variables, is the only way to ensure transportability of models in real-world (heterogeneous-data) applications (see, e.g., (Bareinboim and Pearl 2016)).

This approach is a generalization of work presented in (Anandkumar et al. 2013), where the authors show that, under some assumptions. the latent space can be learned exactly, which is also related to the deconfounder approach described by (Wang and Blei 2019a). However, our approach does not require that the observables be conditionally independent given the latent space and instead generate such independence by the use of causal network’s residuals, which are conditionally independent of each other given the graph and the latent space. However, since the network among the observables is undefined in the beginning, the structure of the observable network must be learned at the same time as the structure of the latent space, which leads us to the iterative/variational bayes approach presented in Algorithm 1. Lastly, the use of the entirety of the residual space is different from the work described in (Elidan, Nachman, and Friedman 2007), where local residuals are pursued with the goal to accelerate structure learning while simultaneously discovering the latent space.

Implementation

Algorithm 1 below describes our approach to learning the latent space and can be viewed as a type of an expectation-maximization algorithm.

How do we learn $\bar{U}=f(R_{\bar{U}})$ ? In the linear case, we can use PCA, as described above, and in the non-linear case, we can use non-linear PCA, autoencoders, or other methods, as alluded to above. However, the linear case provides a useful constraint on dimensionality, and this constraint can be derived quickly. A useful notion of the ceiling constraint on the linear latent space dimensionality can be found in (Gavish and Donoho 2014). From a practical standpoint, the dimensionality can be set even tighter, and we use a previously described heuristic approach(Buja and Eyuboglu 1992).

Algorithm 1 Learning

\bar{U}

from structure residuals via EM

Data: The set of observed variables

\{V,O\}

Result: Graph

G_{\bar{U}}(V,O,\bar{U})

Construct

G=G(V,O)

Compute

S_{0}=BIC_{G}

Estimate

\bar{U}=f(R)

Construct

G_{\bar{U}}=G(V,O,\bar{U})

Compute

S_{\bar{U}}=BIC(G_{\bar{U}})

while

S_{\bar{U}}-S_{0}>\epsilon

Set

S_{0}=S_{\bar{U}}

Calculate

R_{\bar{U}}

Set

\bar{U}

to arbitrary constant values

for all child node

C\in G_{\bar{U}},C\notin\bar{U}

Set parents to training data

\bar{C}=C|parents(C)

Set

R_{C}=PSR(\bar{C},C)

end for

Estimate

\bar{U}=f(R_{\bar{U}})

Construct

G_{\bar{U}}=G(V,O,\bar{U})

Compute

S_{\bar{U}}=BIC(G_{\bar{U}})

end while

Nonlinear Extension: Autoencoder

All results described above refer to Gaussian or at most rank-monotonic relationships, and perhaps extend to linear models with interactions, when interactions can be seen as "synthetic features". Real-world data, however, often does not behave this way requiring an approach that might generalize beyond monotonic relationships. Therefore, we pursued latent space discovery using autoencoders.

We first assess the cardinality of the latent space (number of nodes in the coding layer) using the linear approach (PCA), and take this number as a useful "ceiling" for the dimensionality of the non-linear latent space, on the assumption that non-linear features are more compact and require lower dimensionality if discovered. We then constructed an autoencoder with 4 hidden layers where the maximum number of nodes in the hidden layers was capped at $\min(100,\textrm{number of variables})$ , the cap being dictated by practical considerations.

The autoencoder was implemented using Keras with Tensorflow backend and called within R using the Reticulate package (Ushey 2020). The encoders and decoder were kept symmetric and in order improve the stability we used tied weights (Berlinkov 2018). In addition, the coding layer had additional properties borrowed from PCA including a kernel regularizer promoting orthogonality between weights and an activity regularizer to promote uncorrelated encoded features (see (Ranjan 2019) for details on implementation and justification). This last property is of particular interest in our application since ideally every dimension in our latent space should be associated with a different latent variable. The hidden layers used a sigmoidal activation except for the output layer which had a linear activation. All layers had batch normalization (Ioffe and Szegedy 2015) (see supplementary Figure LABEL:sup-fig_ae for a diagram of the architecture).

For more details see the implementation in the github repository for this paper ((github 2020)).

Numerical Demonstration

Synthetic Data

To illustrate the algorithms described in the previous sections we generated synthetic data from the network shown in Figure 5 (also supplementary Figure LABEL:sup-fig_network_true) where two variables $V_{1}$ and $V_{2}$ drive an outcome $Z$ . Two confounders $U_{1}$ and $U_{2}$ affect both the drivers and the outcome, as well as many additional variables that do not affect the outcome $Z$ . The coefficient values in the network were chosen so that faithfullness (also knows as stability (Pearl 2000)) was achieved allowing the structure and coefficients to be approximately recovered when all variables were observed (see supplement for additional examples).

The underlying network inference needed for the algorithm was implemented by bootstrapping the data and running the R package bnlearn ((Scutari 2010) also see code in (github 2020) for specific settings of the run.) on each bootstrap. The resulting ensemble of networks can be combined to obtain a consensus network where only the most confident edges are kept. Similarly, the coefficient estimates can be obtained by averaging them over bootstraps.

For this example, using the complete data (no missing variables) the consensus network created with edges with confidence larger than 40% recovers the true structure (see supplementary Figure LABEL:sup-fig_network_full), and the root mean square error (RMSE) in the coefficient estimates is 0.05 (not shown). This represents a lower bound on the error that we can expect to obtain under perfect reconstruction of the latent space.

When the confounders are unobserved, the reconstruction of the network introduces many false edges and results in a five-times-larger RMSE. Figure 5 (see also supplementary Figure LABEL:sup-fig_network_miss) shows the reconstructed network, where the red edges are the true edges between $V_{1}$ and $V_{2}$ and the outcome $Z$ .

We ran 1 for 20 iterations using PCA to reconstruct the latent space from the residuals with the assumption that the latent variables were source nodes. We then tracked the latent variable reconstruction in an out-of-sample test set as well as the error in the coefficient’s estimates. Figure 7 (supplementary Figure LABEL:sup-fig_latvar_r2) shows the adjusted $R^{2}$ between each of the true latent variables and the prediction obtained from the estimated latent space across iterations. The lines and error bands are calculated using locally estimated scatterplot smoothing (LOESS). The estimated latent space is predictive of both the latent variables, and the iterative procedure improves the $R^{2}$ with respect to $U_{1}$ from 0.49 at the first iterations to about 0.505 in about 3 iterations.

Figure 7 (also supplementary Figure LABEL:sup-fig_coef_rmse) shows the total error in the coefficients of all variables connected to the outcome $Z$ (RMSE) in the infered network as well as the error in the coefficients of the true drivers of $Z$ , $V_{1}$ and $V_{2}$ . The dashed lines show the error levels when all variables are observed.

Figure 5 (and supplementary Figure 7) shows the final inferred network at iteration 20. The number of edges arriving to the outcome was reduced considerably with respect to the network prior to inferring latent variables (Figure 5 and supplementary Figure LABEL:sup-fig_network_learn). In addition, the coefficients connecting V1 and V2 to the outcome are now closer to their true values (Figure 7). This represents an improvement in ACE estimation.

Refer to caption — Figure 3: True network.

Experimental Data

In order to assess the efficacy of our approach to latent variable inference on realistic data, we picked a dataset of a type that is typically hard to analyze. The dataset in question was obtained from the Cure Huntington’s Disease Initiative (CHDI) foundation (Langfelder et al. 2016). It captures molecular phenotyping and length of Huntingtin gene’s CAG repeat expansion in CAG knock-in mice and consists of striatum gene expression and proteomic data as well as other measurements. For this study, we extracted CAG, mouse weight and age, and a number of gene expression and proteomic measurements correlated with CAG, so that gene expression would also have matching proteomic data, for a total of 249 variables.over 153 samples.

Excessive CAG repeats cause Huntingont’s disease via a complex and not entirely understood mechanism, where disease age of onset and severity strongly correlate with repeat length. Furthermore, CAG influences a large number of gene expression and protein level markers, based on bayesian modeling - mostly indirectly. Therefore CAG is an ideal pleiotropic covariate of a type that our algorithm should be able to uncover if it were missing. However, the high biological noise, low sample size, and the presence of ties and near-ties in CAG values leave enough difficulty in the problem. Since we didin’t have an outcome with a lot of signal on which to test effects of latent confounding, we focused on inference of the latent variable as our measure of success in this test (Supplementary Algorithm LABEL:sup-alg_cag). We developed an autoencoder approach to inference for cases latent space may relate to observables in a very non-linear manner. In this case, we begin to see marginally better performance for autoencoder, though on simulated data linear approach works better, as expected (Figure 8). Given the small number of samples, a problem for deep learning, we believe that autoencoder is a promising approach for real-world datasets, especially biological data.

Generalizations of this approach

In this section, we discuss extensions to this approach for GGMs with interactions, nonlinear functional forms, and categorical variables.

Gaussian Graphical Models With Interactions

In the presence of interactions among variables in a GGM, equation 5 expressing the deviation of residuals from Gaussian noise may acquire higher-order terms due to interactions among the descendants of the latent space $U$ :

R_{N}=\beta_{0}+\beta_{1}U+\beta_{2}U^{2}+\beta_{2}U^{3}+\dots.

(15)

Assuming interactions up to $k$ th power are present in the system being modeled, residuals for each variable may have up to $k$ terms in the model matrix described by equation 15, and if interactions among variables in the latent space $U$ also exist, the cardinality of the principal components of the residuals may far exceed the cardinality of the underlying latent space. Nevertheless, it may be possible to reconstruct a parsimonious basis vector by application of regularization and nonlinear approaches to latent variable modeling, such as nonlinear PCA (e.g. using methods from (Karatzoglou et al. 2004)), or autoencoders (Louizos et al. 2017) as will be discussed below.

Generalization to nonlinear functions

We can show that linear PCA will suffice for a set of models broader than GGMs. In particular, we will focus on nonlinear but homeomorphic functions within the Generalized Additive Model (GAM) family. When talking about multiple inputs, we will require that the relationship of any output variable to any of the covariates in equation 5 is homeomorphic (invertible), and that equation 7 can be marginalized with respect to any right-hand-side variable as well as to the original left-hand side variable. For such class of transformations, mutual information between variables, such as between a single confounder $U$ and some downstream variable $N$ , is invariant (Kraskov, Stögbauer, and Grassberger 2004). Therefore, residuals of any variable $N$ will be rank-correlated to $rank(U)$ in a transformation-invariant way. Further, spearman rank-correlation, specifically, is defined as pearson correlation of ranks, and pearson correlation is a special case of mutual information for bivariate normal distribution. Therefore when talking about mutual information between ranks of arbitrarily distributed variables, we can use our results for the GGM case above.

Thus, equation 5 will apply here with some modifications:

rank(R_{N})=\beta_{0}+\beta_{1}rank(U)+\xi_{N}.

(16)

Since a method has been published recently describing how to capture rank-equivalent residuals (aka probability-scale residuals, or PSR) for any ordinal variable (Shepherd, Li, and Liu 2016), we can modify the equation 7 to reconstruct latent space up to rank-equivalence when interactions are absent from the network.

rank(U)=\frac{1}{\beta_{i}}rank(R_{i})+\frac{1}{\beta_{j}}rank(R_{j})+\dots+\xi.

(17)

When $U$ consists of multiple variables that are independent of each other, the relationship between $N$ and $U$ can be written down using the mutual information chain rule (MacKay 2003) and simplified taking advantage of mutual independence of the latent sources:

\begin{split}I(N;U)=I(N;U_{1},U_{2},\dots,U_{U})\\ =\sum_{i=1}^{u}{I(N;X_{i}|X_{i-1},\dots,X_{1})}=\sum_{i=1}^{u}{I(N;X_{i})}.\end{split}

(18)

If interactions among $U$ are present, it may still be possible to approximate the latent space with a suitably regularized nonlinear basis, but we do not, at present, know of specific conditions when this may or may not work. Novel methods for encoding basis sets, such as nonlinear PCA (implemented in the accompanying code), autoencoders, and others, may be brought to bear to collapse the linearly independent basis down to non-linearly independent (i.e. in the mutual information sense) components.

While approximate inference of latent variables for GMs built over invertible functions had been noted in (Elidan, Nachman, and Friedman 2007), the above method gives a direct rank-linear approach leveraging the recently-proposed PSRs. Similar ideas pertaining to the Ideal Parents approach and involving copula transforms have been outlined in (Tenzer and Elidan 2016).

Generalization to categorical variables

In principle, PSRs can be extended to the case of non-ordinal categorical variables by modeling binary in/out of class label, deviance being correct/false. These models would lack the smooth gradient allowed by ranks and would probably converge far worse and offer more local minima for EM to get stuck in.

Conclusions and Future Directions

In this work we present a method for describing the latent variable space which is locally optimal under linearity, up to rank-linearity. The method does not place a priori constraints on the number of latent variables, and will infer the upper bound on this dimensionality automatically.

In real-data situations, some assumptions of our linear approach may not hold - PCA and related methods may not suffice. For such cases, we provide a heuristic autoencoder implementation. Autoencoders have shown utility when the structure of the causal model is already known (Louizos et al. 2017). Learning structure over observed data as well as unconstrained latent space via autoencoders, as in our work, results in a hybrid "deep causal discovery" generalization of that approach.

Many domains can benefit from joint causal discovery and inference of latent space. For instance, in epidemiology neither the model structure nor the latent space are typically well-understood beyond simple examples. Multi-omic biological datasets are another area of applicability, since it’s impossible to collect all biological data modalities in practice. In clinical trials, epidemiological features of multi-omic datasets may be hard to fully account for.

Combining causal discovery and causal inference improves statements about ACE, and helps assess data set limitations. Further, we provide a relatively fast implementation suitable to real problems. Therefore, we hope that our work will open new practical applications of both paradigms.

References

Anandkumar et al. (2013) Anandkumar, A.; Hsu, D.; Javanmard, A.; and Kakade, S. 2013. Learning Linear Bayesian Networks with Latent Variables. In Dasgupta, S.; and McAllester, D., eds., Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, 249–257. Atlanta, Georgia, USA: PMLR. URL http://proceedings.mlr.press/v28/anandkumar13.html.
Bareinboim and Pearl (2016) Bareinboim, E.; and Pearl, J. 2016. Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences 113(27): 7345–7352. ISSN 0027-8424, 1091-6490. doi:10.1073/pnas.1510507113. URL http://www.pnas.org/lookup/doi/10.1073/pnas.1510507113.
Berlinkov (2018) Berlinkov, M. 2018. python - Tying Autoencoder Weights in a Dense Keras Layer. URL https://stackoverflow.com/questions/53751024/tying-autoencoder-weights-in-a-dense-keras-layer.
Bernstein et al. (2020) Bernstein, D.; Saeed, B.; Squires, C.; and Uhler, C. 2020. Ordering-Based Causal Structure Learning in the Presence of Latent Variables. In International Conference on Artificial Intelligence and Statistics, 4098–4108. PMLR.
Buja and Eyuboglu (1992) Buja, A.; and Eyuboglu, N. 1992. Remarks on Parallel Analysis. Multivariate Behavioral Research 27(4): 509–540. ISSN 0027-3171, 1532-7906. doi:10.1207/s15327906mbr2704_2. URL http://www.tandfonline.com/doi/abs/10.1207/s15327906mbr2704_2.
Claassen, Mooij, and Heskes (2013) Claassen, T.; Mooij, J.; and Heskes, T. 2013. Learning Sparse Causal Models Is Not NP-Hard. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, 172–181. URL http://arxiv.org/abs/1309.6824.
Colombo et al. (2012) Colombo, D.; Maathuis, M. H.; Kalisch, M.; and Richardson, T. S. 2012. Learning High-Dimensional Directed Acyclic Graphs with Latent and Selection Variables. The Annals of Statistics 40(1): 294–321. ISSN 0090-5364. doi:10.1214/11-AOS940.
D’Amour (2019) D’Amour, A. 2019. On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives. arXiv:1902.10286 [cs, stat] URL http://arxiv.org/abs/1902.10286. ArXiv: 1902.10286.
Elidan et al. (2001) Elidan, G.; Lotner, N.; Friedman, N.; and Koller, D. 2001. Discovering Hidden Variables: A Structure-Based Approach. In Leen, T. K.; Dietterich, T. G.; and Tresp, V., eds., Advances in Neural Information Processing Systems 13, 479–485. MIT Press. URL http://papers.nips.cc/paper/1940-discovering-hidden-variables-a-structure-based-approach.pdf.
Elidan, Nachman, and Friedman (2007) Elidan, G.; Nachman, I.; and Friedman, N. 2007. “Ideal Parent” Structure Learning for Continuous Variable Bayesian Networks. Journal of Machine Learning Research 8: 35.
Friedman (1997) Friedman, N. 1997. Learning belief networks in the presence of missing values and hidden variables. In ICML, volume 97, 125–133.
Friedman (1998) Friedman, N. 1998. The Bayesian structural EM algorithm. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, 129–138.
Friedman and Koller (2013) Friedman, N.; and Koller, D. 2013. Being Bayesian about Network Structure. arXiv:1301.3856 [cs, stat] URL http://arxiv.org/abs/1301.3856. ArXiv: 1301.3856.
Frot, Nandy, and Maathuis (2017) Frot, B.; Nandy, P.; and Maathuis, M. H. 2017. Robust Causal Structure Learning with Some Hidden Variables. arXiv:1708.01151 [stat] .
Gavish and Donoho (2014) Gavish, M.; and Donoho, D. L. 2014. The Optimal Hard Threshold for Singular Values is $4/\sqrt {3}$. IEEE Transactions on Information Theory 60(8): 5040–5053. ISSN 0018-9448, 1557-9654. doi:10.1109/TIT.2014.2323359. URL http://ieeexplore.ieee.org/document/6846297/.
github (2020) github. 2020. Latent Confounder. URL https://github.com/rimorob/netres. Original-date: 2020-02-05T03:35:01Z.
Hernán and Robins (2006) Hernán, M. A.; and Robins, J. M. 2006. Estimating causal effects from epidemiological data. Journal of Epidemiology and Community Health 60(7): 578–586. ISSN 0143-005X. doi:10.1136/jech.2004.029496.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.
Jabbari et al. (2017) Jabbari, F.; Ramsey, J.; Spirtes, P.; and Cooper, G. 2017. Discovery of Causal Models That Contain Latent Variables through Bayesian Scoring of Independence Constraints. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 142–157. doi:10.1007/978-3-319-71246-8_9.
Karatzoglou et al. (2004) Karatzoglou, A.; Smola, A.; Hornik, K.; and Zeileis, A. 2004. kernlab - An S4 Package for Kernel Methods in R. Journal of Statistical Software 11(9). ISSN 1548-7660. doi:10.18637/jss.v011.i09. URL http://www.jstatsoft.org/v11/i09/.
Koller and Friedman (2009) Koller, D.; and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. Adaptive computation and machine learning. Cambridge, MA: MIT Press. ISBN 978-0-262-01319-2.
Kraskov, Stögbauer, and Grassberger (2004) Kraskov, A.; Stögbauer, H.; and Grassberger, P. 2004. Estimating mutual information. Physical Review E 69(6). ISSN 1539-3755, 1550-2376. doi:10.1103/PhysRevE.69.066138. URL https://link.aps.org/doi/10.1103/PhysRevE.69.066138.
Langfelder et al. (2016) Langfelder, P.; Cantle, J. P.; Chatzopoulou, D.; Wang, N.; Gao, F.; Al-Ramahi, I.; Lu, X.-H.; Ramos, E. M.; El-Zein, K.; Zhao, Y.; Deverasetty, S.; Tebbe, A.; Schaab, C.; Lavery, D. J.; Howland, D.; Kwak, S.; Botas, J.; Aaronson, J. S.; Rosinski, J.; Coppola, G.; Horvath, S.; and Yang, X. W. 2016. Integrated genomics and proteomics to define huntingtin CAG length-dependent networks in HD Mice. Nature neuroscience 19(4): 623–633. ISSN 1097-6256. doi:10.1038/nn.4256. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5984042/.
Louizos et al. (2017) Louizos, C.; Shalit, U.; Mooij, J.; Sontag, D.; Zemel, R.; and Welling, M. 2017. Causal Effect Inference with Deep Latent-Variable Models. arXiv:1705.08821 [cs, stat] URL http://arxiv.org/abs/1705.08821. ArXiv: 1705.08821.
MacKay (2003) MacKay, D. J. C. 2003. Information theory, inference, and learning algorithms. Cambridge, UK ; New York: Cambridge University Press. ISBN 978-0-521-64298-9.
Nandy et al. (2018) Nandy, P.; Hauser, A.; Maathuis, M. H.; et al. 2018. High-Dimensional Consistency in Score-Based and Hybrid Structure Learning. The Annals of Statistics 46(6A): 3151–3183.
Pearl (2000) Pearl, J. 2000. Causality: models, reasoning, and inference. Cambridge, U.K.; New York: Cambridge University Press. ISBN 978-1-139-64936-0 978-0-511-80316-1. URL http://dx.doi.org/10.1017/CBO9780511803161. OCLC: 834142635.
Ranjan (2019) Ranjan, C. 2019. Build the right Autoencoder — Tune and Optimize using PCA principles. Part II. URL https://towardsdatascience.com/build-the-right-autoencoder-tune-and-optimize-using-pca-principles-part-ii-24b9cca69bd6.
Rawlings, Pantula, and Dickey (1998) Rawlings, J. O.; Pantula, S. G.; and Dickey, D. A. 1998. Applied regression analysis: a research tool. Springer texts in statistics. New York: Springer, 2nd ed edition. ISBN 978-0-387-98454-4.
Scutari (2010) Scutari, M. 2010. Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software 35(3). ISSN 1548-7660. doi:10.18637/jss.v035.i03. URL http://www.jstatsoft.org/v35/i03/.
Shepherd, Li, and Liu (2016) Shepherd, B. E.; Li, C.; and Liu, Q. 2016. Probability-scale residuals for continuous, discrete, and censored data. The Canadian journal of statistics = Revue canadienne de statistique 44(4): 463–479. ISSN 0319-5724. doi:10.1002/cjs.11302. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5364820/.
Silva et al. (2006) Silva, R.; Scheines, R.; Glymour, C.; and Spirtes, P. 2006. Learning the Structure of Linear Latent Variable Models. J. Mach. Learn. Res. 7: 191–246.
Spirtes, Glymour, and Scheines (1993) Spirtes, P.; Glymour, C. N.; and Scheines, R. 1993. Causation, prediction, and search. Number 81 in Lecture notes in statistics. New York: Springer-Verlag. ISBN 978-0-387-97979-3.
Spirtes, Meek, and Richardson (1995) Spirtes, P. L.; Meek, C.; and Richardson, T. S. 1995. Causal Inference in the Presence of Latent Variables and Selection Bias. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 499–506. URL http://arxiv.org/abs/1302.4983.
Tenzer and Elidan (2016) Tenzer, Y.; and Elidan, G. 2016. Generalized Ideal Parent (GIP): Discovering non-Gaussian Hidden Variables. In Gretton, A.; and Robert, C. C., eds., Proceedings of Machine Learning Research, volume 51, 222–230. Proceedings of Machine Learning Research: PMLR. URL http://proceedings.mlr.press.
Ushey (2020) Ushey, K. 2020. rstudio/reticulate. URL https://github.com/rstudio/reticulate. Original-date: 2017-02-06T18:59:46Z.
Wang and Blei (2019a) Wang, Y.; and Blei, D. M. 2019a. The Blessings of Multiple Causes. Journal of the American Statistical Association 114(528): 1574–1596.
Wang and Blei (2019b) Wang, Y.; and Blei, D. M. 2019b. Multiple Causes: A Causal Graphical View. arXiv:1905.12793 [cs, stat] .