This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Proximal Causal Learning with Many Hidden Confounders

Nikos Vlassis1111Contact Author    Phil Hebda1    Stephan McBride2222The author was with Netflix when this work was concluded.    Athanasios Noulas322footnotemark: 2
1Netflix Research, CA, USA
2Amazon.com, CA, USA
3Traderepublic.com, Germany
{nvlassis, phebda}@netflix.com, [email protected], [email protected]
Abstract

We generalize the proximal g-formula of Miao, Geng, and Tchetgen Tchetgen (2018) for causal inference under unobserved confounding using proxy variables. Specifically, we show that the formula holds true for all causal models in a certain equivalence class, and this class contains models in which the total number of levels for the set of unobserved confounders can be arbitrarily larger than the number of levels of each proxy variable. Although straightforward to obtain, the result can be significant for applications. Simulations corroborate our formal arguments.

1 Introduction

XXYYUU
(a)  
XXYYUUZZ
(b)  
XXYYUUZZ
(c)  
XXYYUUWWZZ
(d)  
Figure 1: Examples of causal graphs. The red nodes are observed variables, the grey node UU is unobserved. In (a) the variable UU acts as an unobserved confounder between XX and YY. In (b) the variable ZZ acts as an instrument, adding exogenous variation to the system. In (c) the variable ZZ can serve as a backdoor adjustment (i.e., admissible) variable to infer the causal effect of XX to YY (see Section 2). In (d) the variables ZZ and WW can serve as proxies to infer the causal effect of XX to YY (see Section 3).

The gold standard in causal inference is randomized control trials. However, randomization can often be impractical, uneconomical, or even unethical, in which case the alternative is to infer the desired causal effects from observational data Hernán and Robins, (2020); Pearl, (2009); Rubin, (2005). In this paper we address the problem of causal inference from observational data when unobserved confounders are present. This is a hard problem in general, but a very relevant one for many applications. In a seminal paper, Miao, Geng, and Tchetgen Tchetgen Miao et al., (2018) demonstrated that the causal effect of a variable XX on some other variable YY can be identified nonparametrically even under unobserved confounding, as long as a set of sufficient conditions are satisfied. Their approach, dubbed Proximal Causal Learning Tchetgen Tchetgen et al., (2020), relies on the existence of two observed proxy variables that must be associated to the latent confounders in a certain way. Proximal Causal Learning can be particularly useful for industrial applications, where domain knowledge is often leveraged for reasoning about possible hidden confounders, and a large number of potential proxies are available (e.g., from detailed information on users or items).

As a motivating example from industry, consider users of an online service who perform a specific action, say interacting with a specific element in the service’s UI. Let XX be a (binary) random variable that models the event that a user undertook the specific action, and let YY denote the (binary) outcome that we are interested in studying, say if the user subsequently makes a purchase within the service. A causal quantity of interest is the average treatment effect

P(Y=1do(X=1))P(Y=1do(X=0)),P(Y=1\mid do(X=1))-P(Y=1\mid do(X=0))\,, (1)

where the dodo notation implies that we want to simulate the effect on YY of fixing XX to a specific value Pearl, (2009). The estimation of the effect in (1) can be difficult when unobserved confounders are present. In our industry example, those confounders could be unobserved external sources of variation such as social buzz or other competing services, as well as unobserved user characteristics such as demographics or price consciousness. Unless we account for such confounders, statistical inference of the treatment effect (1) may be prone to bias.

One way to account for an unobserved confounder UU is to employ two proxy variables, call them ZZ and WW, that are known to be coupled to UU in a certain way. In that case, and under certain (sufficient) conditions, the causal quantity P(ydo(x))P(y\mid do(x)) is nonparametrically identifiable Miao et al., (2018); Shi et al., (2020); Tchetgen Tchetgen et al., (2020). An example of a causal graph for which identification is possible is shown in Fig. 1(d).

In the case where all variables are discrete, which we assume in this work, one of the conditions for identifiability in Miao et al., (2018) is that each of the two proxies ZZ and WW must have the same number of levels (i.e., cardinality of their range) as the unobserved UU. In this work we show that this condition can be significantly relaxed. As we elaborate in Section 4, the identification formula of Miao et al., (2018) holds true for all causal models in a certain equivalence class, and this class contains models in which the total number of levels for the set of unobserved confounders can be arbitrarily larger than the number of levels of ZZ or WW. This result, although easy to obtain, can be very important for applications: It opens the way to apply Proximal Causal Learning in a much wider class of problems than previously thought possible.

2 Causal graphs and dodo-calculus

In this section we provide a short description of causal graphs and dodo-calculus, mainly following Bareinboim and Pearl, (2016). A causal graph is defined by a set UU of unobserved (latent) variables and a set VV of observed variables, which are assumed to be coupled by local causal dependencies (deterministic functions) that give rise to a directed acyclic graph. For example, the causal graph of Fig. 1(a) involves two observed variables V=(X,Y)V=(X,Y) and a latent variable UU, which are coupled by local functions as indicated by the arrows.

A causal graph allows to predict the effect on a variable YY of intervening on some other variable XX by setting X=xX=x. In Pearl’s notation this is written as P(ydo(x))P(y\mid do(x)) and it corresponds to the probability that YY takes the value yy in a modified causal graph in which XX has been fixed to value xx and all its incoming arrows in the original graph have been removed. The quantity P(ydo(x))P(y\mid do(x)) can also be expressed as P(ydo(x))=uYx(u)P(U=u)P(y\mid do(x))=\sum_{u}Y_{x}(u)\ P(U=u), where P(U)P(U) is the distribution of the latent variable UU, and Yx(u)Y_{x}(u) is the potential outcome of unit uu had uu been assigned treatment X=xX=x Rubin, (2005).

A notable property of a causal graph is that, regardless of the form of the coupling functions among variables and regardless of the distribution P(U)P(U) of the latent variables, the distribution P(V)P(V) of the observed variables must obey certain conditional independence relations, which can be characterized by means of a graphical criterion known as d-separation: A set ZZ of nodes is said to block a path pp if either (i) pp contains at least one arrow-emitting node that is in ZZ or (ii) pp contains at least one collision node that is outside ZZ and has no descendant in ZZ. If a set ZZ blocks all paths from a set XX to a set YY, then we say that ZZ d-separates XX and YY, in which case it holds that XX and YY are independent given ZZ, which we write XYZX\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid Z. For example, in Fig. 1(c) the variable ZZ d-separates the variables XX and YY in the induced graph in which the outgoing arrow from XX is removed.

Quantities such as P(ydo(x))P(y\mid do(x)) can be estimated nonparametrically from the observational distribution P(V)P(V) using an algebra known as dodo-calculus Pearl, (2009). The latter is a set of rules that stipulate d-separation conditions in certain induced subgraphs of the original graph. An example of dodo-calculus is the ‘backdoor’ criterion for specifying admissible sets of variables: A set ZZ is admissible for estimating the causal effect of XX on YY if (i) no element of ZZ is a descendant of XX and (ii) the elements of ZZ block all backdoor paths from XX to YY (those are paths that end with an arrow pointing to XX). Using counterfactuals, these two conditions imply XYxZX\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y_{x}\mid Z (see Bareinboim and Pearl, (2016)), which is known as ‘conditional ignorability’ in the potential outcomes literature Rosenbaum and Rubin, (1983). An admissible set allows expressing the interventional distribution P(ydo(x))P(y\mid do(x)) via observational quantities that are directly estimable from the data, using the g-formula Hernán and Robins, (2020):

P(ydo(x))=zP(yx,Z=z)P(Z=z).P(y\mid do(x))=\sum_{z}P(y\mid x,Z=z)\ P(Z=z)\,. (2)

As an example, in Fig. 1(c) the variable ZZ is admissible because (i) it is not a descendant of XX and (ii) it blocks the single backdoor path from XX to YY. Hence we can use (2) to estimate P(ydo(x))P(y\mid do(x)). Note that when all variables are discrete, the required quantities in (2) amount to simple histogram calculations.

The algebra of dodo-calculus has been shown to be complete, meaning that if the rules of dodo-calculus cannot establish a way to write P(ydo(x))P(y\mid do(x)) as a functional of P(V)P(V), then the effect P(ydo(x))P(y\mid do(x)) is not identifiable Pearl, (2009).

XXYYUUWWZZ
(a)  
XXYYUU_{*}UUWWZZ
(b)  
XXYYU1U_{1}U2U_{2}UUWWZZ
(c)  
Figure 2: Examples of causal graphs in the model equivalence class. The variables U,U1,U2U_{*},U_{1},U_{2} can have arbitrary many levels. See Section 4 for details.

3 Proximal Causal Learning

In practical applications, it may be hard to identify an admissible set of variables to use the backdoor g-formula (2), and reality may not be plausibly captured by a graph such as the one in Fig. 1(c) (for example, there might be omitted variables that cause UU to directly affect YY; that would appear in Fig. 1(c) as an extra arrow from UU to YY). In such cases, identification of P(ydo(x))P(y\mid do(x)) may nonetheless still be possible if a pair of proxy variables (aka negative controls) are available, such as the variables ZZ and WW in the graph of Fig. 1(d). Assuming all variables are discrete, the corresponding identifiability result, due to Miao, Geng, and Tchetgen Tchetgen Miao et al., (2018), relies on the following assumptions:

1 Cardinalities -

|Range(Z)|=|Range(W)|=|Range(U)||\mbox{Range}(Z)|=|\mbox{Range}(W)|=|\mbox{Range}(U)| (the variables Z,W,UZ,W,U have equal number of levels).

2 Backdoor -

The unobserved variable UU is admissible, blocking all backdoor paths from XX to YY.

3 Conditional independence -

The observed variables X,Y,Z,WX,Y,Z,W and the unobserved UU jointly satisfy the following two conditional independence relations:

(i) W(Z,X)U\displaystyle W\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}(Z,X)\mid U
(ii) ZY(U,X)\displaystyle Z\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid(U,X)
4 Matrix rank -

The matrix P(WZ,x)P(W\mid Z,x), whose wwth row and zzth column is P(wz,x)P(w\mid z,x), is invertible for each value xx.

Note that condition 4 is testable from the data, whereas conditions 1–3 involve the unobserved variable UU and hence their validity must rely on domain knowledge about the area of study. If all conditions 1–4 are met, then the quantity P(ydo(x))P(y\mid do(x)) is identifiable by the proximal g-formula

P(ydo(x))=P(yx,Z)P(WZ,x)1P(W),P(y\mid do(x))=P(y\mid x,Z)\ P(W\mid Z,x)^{-1}\,P(W)\,, (3)

where P(yx,Z)P(y\mid x,Z) and P(W)P(W) are row and column vectors whose entries are P(yx,z)P(y\mid x,z) and P(w)P(w), respectively, for all values of zz and ww. For a proof sketch of (3), note that condition 2 allows us to write

P(ydo(x))=uP(yu,x)P(u)=P(yU,x)P(U)P(y\mid do(x))=\sum_{u}P(y\mid u,x)P(u)=P(y\mid U,x)P(U) (4)

and the latter inner product can be re-expressed as

P(yU,x)P(Ux,Z)P(WZ,x)1P(WU)identity matrix due to conditions 3,4P(U)\displaystyle P(y\mid U,x)\underbrace{P(U\mid x,Z)\,P(W\mid Z,x)^{-1}\,P(W\mid U)}_{\mbox{\small identity matrix due to conditions 3,4}}P(U) (5)

which, by condition 3(ii), simplifies to the RHS of (3).

Note that the only difference of the proximal g-formula (3) with the g-formula (2) is that the P(Z)P(Z) terms of (2) have been replaced with P(WZ,x)1P(W)P(W\mid Z,x)^{-1}\,P(W). In analogy with (2), when all random variables are discrete, the required quantities in (3) can be computed by simple histogram calculations on the observed data. We refer to Miao et al., (2018); Shi et al., (2020); Tchetgen Tchetgen et al., (2020) for more details and extensions.

4 An equivalence class of models

The full proof of (3) in Miao et al., (2018) reveals that the condition 1 above (that ZZ, WW and UU have the same number of levels) is necessary for the proximal g-formula (3) to hold. This may at first seem too restrictive: The total number of levels for the unobserved UU may be hard to know in practice, and even if this number were somehow available, finding two proxy variables ZZ and WW that simultaneously satisfy conditions 1, 3, and 4, can be a daunting task. Condition 1 can be relaxed to |Range(Z)||Range(U)||\mbox{Range}(Z)|\geq|\mbox{Range}(U)|, and similarly for WW, replacing the matrix inverse in (3) with a pseudoinverse Shi et al., (2020). However, by increasing |Range(Z)||\mbox{Range}(Z)| and |Range(W)||\mbox{Range}(W)| (e.g., by introducing more proxy variables) we may start violating conditions 3 and 4. Fortunately, as we show next, condition 1 is not as restrictive as it may first appear.

Our key observation is that the proximal g-formula (3) must be true for all causal models in an equivalence class for which conditions 1–4 are true. This class contains more expressive graphs than the graph shown in Fig. 1d. Critically, some of these graphs can involve additional latent nodes UU_{*} (see, e.g., Fig. 2b), without violating conditions 1–4. This opens the avenue for applications in which |Range(Utotal)||\mbox{Range}(U_{\mbox{\tiny total}})|, the total number of levels of the set of all latent confounders, is arbitrarily higher than |Range(Z)||\mbox{Range}(Z)| or |Range(W)||\mbox{Range}(W)|. We do not attempt to provide a complete characterization of this equivalence class in this work; that would require different graphical tools Jaber et al., (2019). Here we only slightly modify the set of conditions 1–4, and show examples of causal graphs that are significantly more expressive than the graph in Fig. 1d, and for which the proximal g-formula (3) still holds. To maintain consistency with the notation we used in the previous sections, in the rest of the paper we will be using UU to denote a subset of all latent confounders such that |Range(Z)|=|Range(W)|=|Range(U)||\mbox{Range}(Z)|=|\mbox{Range}(W)|=|\mbox{Range}(U)|, and we will be using different notation (e.g., UU_{*}) for additional latent confounders.

Our modified set of conditions is obtained by re-expressing the backdoor criterion 2 by an equivalent set of conditional independence relations as follows (see Pearl, (2009, Section 11.3.3) for details). Let TT stand for the set of all direct parents of XX (observed and unobserved), excluding UU if UU is a direct parent of XX (in our generalization the node UU need not be a direct parent of XX). The set TT may contain ZZ and/or additional nodes not appearing in Fig. 1(d); see Fig. 2 for examples. Then the backdoor criterion 2 can be replaced by the following:

2 Backdoor-surrogate
(i) TY(U,X)\displaystyle T\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid(U,X)
(ii) XUTif U is not a direct parent of X\displaystyle X\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}U\mid T\qquad\mbox{if $U$ is not a direct parent of $X$}

Note that the above two conditions are subsumed by the single conditional independence criterion 3(ii) when T=ZT=Z and UU is a direct parent of XX, as in Fig. 1(d).

Given the above, the equivalence class is defined by the set of models that satisfy conditions 1 and 4 from Section 3, together with the following set of graphical conditions (replacing conditions 2 and 3 from Section 3):

Equivalence class:
(i) W(Z,X)U\displaystyle W\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}(Z,X)\mid U
(ii) ZY(U,X)\displaystyle Z\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid(U,X)
(iii) TY(U,X)\displaystyle T\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid(U,X)
(iv) XUTif U is not a direct parent of X\displaystyle X\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}U\mid T\qquad\mbox{if $U$ is not a direct parent of $X$}

In Fig. 2 we show a few examples of graphs in the equivalence class that can capture real-world dynamics and applications. The graph of Figure 2(a) uses a post-treatment variable ZZ to perform inference. Post treatment variables are known to bias regression estimators, but they can be used in the Proximal Causal Learning framework as long as the conditional independence relation ZY(U,X)Z\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}Y\mid(U,X) holds Shi et al., (2020); Tchetgen Tchetgen et al., (2020).

The graph of Fig. 2(b) has the potential to capture arbitrary high-dimensional confounding in UU_{*}. Returning to our industry example, this graph is particularly well suited for applications that involve interactions with each of the many elements within the UI: Engagement with each element can be studied independently by grouping the engagement with all other elements in UU_{*}. That would be an example where we treat observed covariates as part of a latent UU_{*} in order to satisfy the sufficient conditions of the equivalence class for the proximal g-formula (3) to hold; alternatively we can condition on those covariates and use a modified estimator (Tchetgen Tchetgen et al.,, 2020).

The graph of Fig. 2(c) allows for high-dimensional U1U_{1} and U2U_{2}, as long as they do not simultaneously influence the low-dimensional UU (i.e., UU is not a collider). In our running industry example, U2U_{2} can capture engagement with UI elements other than the one being studied, while U1U_{1} can capture competition, other UI elements, marketing, or word-of-mouth effects that influence how a user interacts with the service. In this model, UU can be viewed as a low-dimensional ‘bottleneck’ from U1U_{1} to U2U_{2}.

5 Simulations

Real-world observational settings lack both ground truth estimates of the average treatment effect in (1) and the ability to verify the cardinality and conditional independence assumptions. We can evaluate how well Proximal Causal Learning can recover causal effects through simulated data with ground truth. Next we describe the structure of the simulations and how we leverage these simulations to understand properties of a simple histogram-based estimator of (3) when the assumptions in Section 3 are violated. Where relevant, we benchmark the results against a regression estimator.

5.1 Simulation setup and notation

For grounding, we first discuss the simulation of the graphical model of Fig. 1(d). We will leverage and extend this simulation in the following subsections to handle the graphs in Fig. 2, violations of conditional independence, and a high dimensional UU. All nodes are binary in all cases. We introduce difference parameters δij\delta_{ij} for the connection from node ii to node jj, e.g., δUW=P(WU)P(W¬U)\delta_{UW}=P(W\mid U)-P(W\mid\neg U). From there on, given a draw uUu\sim U from a Bernoulli distribution with probability ϕ\phi, the values of the nodes are populated according to:

WBer(ϕ+(u12)δUW)\displaystyle W\sim Ber(\phi+(u-\frac{1}{2})\delta_{UW})
ZBer(ϕ+(u12)δUZ)\displaystyle Z\sim Ber(\phi+(u-\frac{1}{2})\delta_{UZ})
XBer(ϕ+(u12)δUX+(z12)δZX)\displaystyle X\sim Ber(\phi+(u-\frac{1}{2})\delta_{UX}+(z-\frac{1}{2})\delta_{ZX})
YBer(ϕ+(u12)δUY+(w12)δWY+(x12)δXY)\displaystyle Y\sim Ber(\phi+(u-\frac{1}{2})\delta_{UY}+(w-\frac{1}{2})\delta_{WY}+(x-\frac{1}{2})\delta_{XY})

where the average treatment effect is encoded by the difference parameter δXY\delta_{XY}.

For additional details on the above structure, as well as extensions performed below, we refer to the repository333The repository for the code that produced the simulations and their results is located at https://github.com/hebda/proximal_causal_learning.

5.2 Bias in the model equivalence class

Graph Regression Proximal Causal Learning
Fig. 1(d) 24.4%±0.4%24.4\%\pm 0.4\% 0.1%±1.7%\mathbf{-0.1\%\pm 1.7\%}
Fig. 2(a) 7.3%±0.3%~{}~{}7.3\%\pm 0.3\% 0.0%±1.2%\mathbf{-0.0\%\pm 1.2\%}
Fig. 2(b) 10.1%±0.4%10.1\%\pm 0.4\% 0.1%±1.5%\mathbf{~{}~{}~{}0.1\%\pm 1.5\%}
Fig. 2(c) 10.9%±0.5%10.9\%\pm 0.5\% 0.2%±1.3%\mathbf{-0.2\%\pm 1.3\%}
Table 1: Relative biases of estimators for the average treatment effect, for simulations of the graphs of Fig. 1(d) and Fig. 2 using regression and Proximal Causal Learning. Regression leverages all observed nodes in the given graph. Bold font indicates the best performance. We show empirical mean and variance from 100 runs of the simulated graph.

Table 1 contains the results for the graphs of Fig. 2, including the base case of Fig. 1(d). We extend the simulation to include additional nodes UU_{*}, U1U_{1}, and U2U_{2} where applicable. We compare the Proximal Causal Learning method against a baseline regression approach that uses the same covariates. The comparison is between each estimator’s relative bias, which is defined as the relative difference between the estimated and true average treatment effect in (1). Proximal Causal Learning gives an effectively unbiased result for the three graphs in Fig. 2, with the small, non-zero values coming from sample variation. The regression benchmark is significantly biased in all cases; however we note that regression is typically able to leverage many more covariates, which may result in a reduced, though non-zero, bias, depending on the application.

5.3 Implications of the invertibility of P(WZ,x)P(W\mid Z,x)

The invertibility of P(WZ,x)P(W\mid Z,x) (criterion 4 from Section 3) can be directly tested from the data by looking at the condition number of each P(WZ,x)P(W\mid Z,x). For the 2-dimensional case, we have two matrices to consider, one for each value of xx, and we use the larger of the two condition numbers to describe the stability of the matrix inversion.

In Fig. 3 we can see that requiring a condition number of 30, which would lead to stable matrix inversion, corresponds to an implicit claim on the strength of certain connections, captured through the δ\deltas in the structure of the simulation. These are that UW,ZU\rightarrow W,Z cannot be arbitrarily small, and that U,ZXU,Z\rightarrow X cannot be arbitrarily large. Furthermore, the connections U,W,XYU,W,X\rightarrow Y can be of arbitrary strength. In short, the unobserved confounder must be sufficiently captured by the proxies, without exerting too much force on the treatment condition.

The requirement of having a maximum condition number of 30 is somewhat arbitrary and context dependent. Ultimately, this quantity is related to the variance of the estimator. Numerically stable matrix inversion provides low variance, while unstable inversion provides the opposite.

Refer to caption
Figure 3: Dependence of the condition number of P(WZ,x)P(W\mid Z,x) on the underlying conditional probabilities. The dashed line represents a reference cut-off of 30.

5.4 Violations of the independence assumptions

In practical applications, the selection of variables WW and ZZ is very important since the independence assumptions are not testable. In temporal data, one could use certain future observations as proxies, as the proxies do not need to be pre-treatment Shi et al., (2020). In industry settings, variables ZZ are generally easier to discover: Most interventions can only affect an outcome of interest through user interactions with the service. Variables WW are generally harder to find, as they should affect the outcome without affecting interactions with the service.

In this subsection, we examine the bias in the average treatment effect that is induced into an estimator of (3) when the independence assumption W(Z,X)UW\mathrel{\text{\scalebox{1.2}{$\perp\mkern-10.0mu\perp$}}}(Z,X)\mid U is violated, and we benchmark this bias against a regression estimator. We impose this violation into the simulation by modifying the XX-step in the simulation to include the term (w12)δWX(w-\frac{1}{2})\delta_{WX}. In this way, XX is conditionally dependent on WW in addition to UU and ZZ.

Fig. 4 shows the bias of the two estimators for arbitrary δWX\delta_{WX}. We see that the bias for regression is roughly independent, though non-zero, for all values of δWX\delta_{WX}. For Proximal Causal Learning, we see that the bias is strongly dependent on the strength of the WXW\rightarrow X connection. Careful selection of the WW variable will lead to a variable for which δWX\delta_{WX} is not large. Additionally, the invertibility of P(WZ,x)P(W\mid Z,x) provides protection against arbitrarily high values of bias, since large values of δWX\delta_{WX} cause this testable criterion to fail. (Such points are omitted from the plot.)

Refer to caption
Figure 4: Bias in the estimation of the average treatment effect as a function δWX\delta_{WX}. The regression estimator is provided with 95% confidence intervals. The proximal g-formula estimate (3) is shown only when the condition number of P(WZ,x)P(W\mid Z,x) does not exceed 30; otherwise the estimate is omitted.

5.5 Bias when the dimensionality of UU is large

Simulation Setup Regression Proximal Causal Learning
Fig. 2(b), binary XX 10.0%±1.9%10.0\%\pm 1.9\% 0.7%\mathbf{~{}~{}~{}0.7\%}
Fig. 2(b), 5-dimensional XX 10.5%±9.9%10.5\%\pm 9.9\% 1.0%\mathbf{~{}~{}~{}1.0\%}
Fig. 1(d), δUV,j\delta_{UV,j} constant in jj 42.8%±4.1%42.8\%\pm 4.1\% 2.0%\mathbf{-2.0\%}
Fig. 1(d), δUV,j\delta_{UV,j} decreases linearly in jj 18.7%±4.0%18.7\%\pm 4.0\% 1.9%\mathbf{-1.9\%}
Fig. 1(d), δUV,j\delta_{UV,j} decreases linearly in jj, δUY,j\delta_{UY,j} constant in jj 18.4%±4.1%18.4\%\pm 4.1\% 0.2%\mathbf{-0.2\%}
Fig. 1(d), δUV,j\delta_{UV,j} constant in jj, δUY,j\delta_{UY,j} decreases linearly in jj 42.9%±4.0%42.9\%\pm 4.0\% 1.3%\mathbf{-1.3\%}
Fig. 1(d), δUV,j=0\delta_{UV,j}=0 for j1j\geq 1, δUY,j\delta_{UY,j} constant in jj 4.6%±3.9%~{}~{}4.6\%\pm 3.9\% 0.3%\mathbf{-0.3\%}
Fig. 1(d), δUV,j\delta_{UV,j} constant in jj, δUY,j=0\delta_{UY,j}=0 for j1j\geq 1 40.4%±3.8%40.4\%\pm 3.8\% 2.4%\mathbf{~{}~{}~{}2.4\%}
Table 2: Relative bias of the average treatment effect for graphs with high dimensional UU (10 dimensions, a total of 1024 levels). Bold fonts indicate best performance. We present 95% confidence intervals with the regression result; at present, there is no analytical means to estimate the variance of the Proximal Causal Learning estimator.

In most applications, it is hard to have a good estimate of the dimensionality of the unobserved confounder UU. In addition, the variables WW and ZZ need to have the same number of levels as the total number of levels of UU, for the matrix P(WZ,x)P(W\mid Z,x) to be invertible. In this subsection, we examine the bias under a high dimensional UU that violates the (untestable) cardinality assumption 1 for the proximal g-formula (3). We focus on the graphs of Figures 1(d) and 2(b).

In this simulation we maintain binary WW and ZZ and allow UU to contain a total of 10 binary dimensions. This produces an unobserved confounder space with a total of 1024 levels. We encode difference parameters δij\delta_{ij} for each binary dimension by treating the relevant variables as 10-dimensional vectors rather than scalars. Table 2 presents the results under different types of confounding effects. When comparing with Table 1 we see that, when the dimensionality of UU increases to 10, our estimates increase in bias, and this bias does not depend on the dimensionality of XX. Nevertheless, in all types of confounding we simulated, this bias is relatively small for Proximal Causal Learning, and it is consistently smaller than when using regression.

6 Conclusions and discussion

We have shown that the identification result of Miao, Geng, and Tchetgen Tchetgen Miao et al., (2018) for Proximal Causal Learning under unobserved confounding holds true for all causal models in a certain equivalence class. This class contains models in which the total number of levels for the set of unobserved confounders can be arbitrarily higher than the number of levels of the two proxy variables, an important result for industry applications.

In the simulations we have also studied a number of different properties of a simple histogram-based estimator of (3), such as its sensitivity to misspecification or violation of some of the sufficient conditions. The simulations have additionally identified that, for low dimensional proxies, even the presence of a high dimensional UU will yield less biased estimates under the Proximal Causal Learning estimator than with standard regression methods. Overall, our results provide further evidence that Proximal Causal Learning can be a very promising causal inference method in an observational setting.

An interesting open question is to completely characterize the equivalence class of models for which identification holds (using tools such as ancestral graphs Jaber et al., (2019)), and to establish a necessary condition for identifiability. Another useful direction is to derive the population moments of an estimator of (3), both under a specified as well as under a misspecified causal graph model, and see how the risk of the estimator varies with the problem inputs (for instance, what is the precise dependence of the variance of the estimator on properties of P(WZ,x)P(W\mid Z,x), as we have alluded to above). Our simulations are already providing some answers to those questions. Nonetheless, an analytical treatment would be valuable as it would offer more intuition into the applicability of Proximal Causal Learning in practical causal inference problems, especially those often encountered in the industry.

References

  • Bareinboim and Pearl, (2016) Bareinboim, E. and Pearl, J. (2016). Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27):7345–7352.
  • Hernán and Robins, (2020) Hernán, M. A. and Robins, J. M. (2020). Causal inference: What if. Boca Raton: Chapman & Hill/CRC.
  • Jaber et al., (2019) Jaber, A., Zhang, J., and Bareinboim, E. (2019). Causal identification under Markov equivalence: Completeness results. In International Conference on Machine Learning, pages 2981–2989.
  • Miao et al., (2018) Miao, W., Geng, Z., and Tchetgen Tchetgen, E. J. (2018). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika, 105(4):987–993.
  • Pearl, (2009) Pearl, J. (2009). Causality. Cambridge University Press, 2nd edition.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • Rubin, (2005) Rubin, D. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100(469):322–331.
  • Shi et al., (2020) Shi, X., Miao, W., Nelson, J. C., and Tchetgen Tchetgen, E. J. (2020). Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology).
  • Tchetgen Tchetgen et al., (2020) Tchetgen Tchetgen, E. J., Ying, A., Cui, Y., Shi, X., and Miao, W. (2020). An introduction to proximal causal learning. arXiv:2009.10982.