DeepMed: Semiparametric Causal Mediation Analysis with Debiased Deep Learning

Siqi Xu
Department of Statistics and Actuarial Sciences
University of Hong Kong
Hong Kong SAR, China
[email protected] Lin Liu
Institute of Natural Sciences, MOE-LSC,
School of Mathematical Sciences, CMA-Shanghai,
and SJTU-Yale Joint Center for Biostatistics and Data Science
Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory
Shanghai, China
[email protected] Zhonghua Liu¹¹footnotemark: 1
Department of Biostatistics
Columbia University
New York, NY, USA
[email protected]
Co-corresponding authors, alphabetical order

Abstract

Causal mediation analysis can unpack the black box of causality and is therefore a powerful tool for disentangling causal pathways in biomedical and social sciences, and also for evaluating machine learning fairness. To reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis, we propose a new method called $\mathsf{DeepMed}$ that uses deep neural networks (DNNs) to cross-fit the infinite-dimensional nuisance functions in the efficient influence functions. We obtain novel theoretical results that our $\mathsf{DeepMed}$ method (1) can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and (2) can adapt to certain low-dimensional structures of the nuisance functions, significantly advancing the existing literature on DNN-based semiparametric causal inference. Extensive synthetic experiments are conducted to support our findings and also expose the gap between theory and practice. As a proof of concept, we apply $\mathsf{DeepMed}$ to analyze two real datasets on machine learning fairness and reach conclusions consistent with previous findings.

1 Introduction

Tremendous progress has been made in this decade on deploying deep neural networks (DNNs) in real-world problems (krizhevsky2012imagenet; wolf2019huggingface; jumper2021highly; brown2022deep). Causal inference is no exception. In semiparametric causal inference, a series of seminal works (chen2020causal; chernozhukov2020adversarial; farrell2021deep) initiated the investigation of statistical properties of causal effect estimators when the nuisance functions (the outcome regressions and propensity scores) are estimated by DNNs. However, there are a few limitations in the current literature that need to be addressed before the theoretical results can be used to guide practice:

(1) Most recent works mainly focus on total effect (chen2020causal; farrell2021deep). In many settings, however, more intricate causal parameters are often of greater interests. In biomedical and social sciences, one is often interested in “mediation analysis” to decompose the total effect into direct and indirect effect to unpack the underlying black-box causal mechanism (baron1986moderator). More recently, mediation analysis also percolated into machine learning fairness. For instance, in the context of predicting the recidivism risk, nabi2018fair argued that, for a “fair” algorithm, sensitive features such as race should have no direct effect on the predicted recidivism risk. If such direct effects can be accurately estimated, one can detect the potential unfairness of a machine learning algorithm. We will revisit such applications in Section LABEL:sec:real and Appendix LABEL:app:real.

(2) Statistical properties of DNN-based causal estimators in recent works mostly follow from several (recent) results on the convergence rates of DNN-based nonparametric regression estimators (suzuki2019adaptivity; schmidt2020nonparametric; tsuji2021estimation), with the limitation of relying on sparse DNN architectures. The theoretical properties are in turn evaluated by relatively simple synthetic experiments not designed to generate nearly infinite-dimensional nuisance functions, a setting considered by almost all the above related works.

The above limitations raise the tantalizing question whether the available statistical guarantees for DNN-based causal inference have practical relevance. In this work, we plan to partially fill these gaps by developing a new method called $\mathsf{DeepMed}$ for semiparametric mediation analysis with DNNs. We focus on the Natural Direct/Indirect Effects (NDE/NIE) (robins1992identifiability; pearl2001direct) (defined in Section 2.1), but our results can also be applied to more general settings; see Remark 2. The $\mathsf{DeepMed}$ estimators leverage the “multiply-robust” property of the efficient influence function (EIF) of NDE/NIE (tchetgen2012semiparametric; farbmacher2022causal) (see Proposition 1 in Section 2.2), together with the flexibility and superior predictive power of DNNs (see Section 3.1 and Algorithm 3.1). In particular, we also make the following novel contributions to deepen our understanding of DNN-based semiparametric causal inference:

•

On the theoretical side, we obtain new results that our $\mathsf{DeepMed}$ method can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and can adapt to certain low-dimensional structures of the nuisance functions (see Section LABEL:sec:stat), thus significantly advancing the existing literature on DNN-based semiparametric causal inference. Non-sparse DNN architecture is more commonly employed in practice (farrell2021deep), and the low-dimensional structures of nuisance functions can help avoid curse-of-dimensionality. These two points, taken together, significantly advance our understanding of the statistical guarantee of DNN-based causal inference.
•

More importantly, on the empirical side, in Section LABEL:sec:sim, we designed sophisticated synthetic experiments to simulate nearly infinite-dimensional functions, which are much more complex than those in previous related works (chen2020causal; farrell2021deep; adcock2021gap). We emphasize that these nontrivial experiments could be of independent interest to the theory of deep learning beyond causal inference, to further expose the gap between deep learning theory and practice (adcock2021gap; gottschling2020troublesome); see Remark LABEL:beyond for an extended discussion. As a proof of concept, in Section LABEL:sec:real and Appendix LABEL:app:real, we also apply $\mathsf{DeepMed}$ to re-analyze two real-world datasets on algorithmic fairness and reach similar conclusions to related works.
•

Finally, a user-friendly R package can be found at https://github.com/siqixu/DeepMed. Making such resources available helps enhance reproducibility, a highly recognized problem in all scientific disciplines, including (causal) machine learning (pineau2021improving; kaddour2022causal).

2 Definition, identification, and estimation of NDE and NIE

2.1 Definition of NDE and NIE

Throughout this paper, we denote $Y$ as the primary outcome of interest, $D$ as a binary treatment variable, $M$ as the mediator on the causal pathway from $D$ to $Y$ , and $X\in[0,1]^{p}$ (or more generally, compactly supported in $\mathbb{R}^{p}$ ) as baseline covariates including all potential confounders. We denote the observed data vector as $O\equiv(X,D,M,Y)$ . Let $M(d)$ denote the potential outcome for the mediator when setting $D=d$ and $Y(d,m)$ be the potential outcome of $Y$ under $D=d$ and $M=m$ , where $d\in\{0,1\}$ and $m$ is in the support $\mathcal{M}$ of $M$ . We define the average total (treatment) effect as $\tau_{tot}\coloneqq\mathsf{E}[Y(1,M(1))-Y(0,M(0))]$ , the average NDE of the treatment $D$ on the outcome $Y$ when the mediator takes the natural potential outcome when $D=d$ as $\tau_{\mathsf{NDE}}(d)\coloneqq\mathsf{E}[Y(1,M(d))-Y(0,M(d))]$ , and the average NIE of the treatment $D$ on the outcome $Y$ via the mediator $M$ as $\tau_{\mathsf{NIE}}(d)\coloneqq\mathsf{E}[Y(d,M(1))-Y(d,M(0))]$ . We have the trivial decomposition $\tau_{tot}\equiv\tau_{\mathsf{NDE}}(d)+\tau_{\mathsf{NIE}}(d^{\prime})$ for $d\neq d^{\prime}$ . In causal mediation analysis, the parameters of interest are $\tau_{\mathsf{NDE}}(d)$ and $\tau_{\mathsf{NIE}}(d)$ .

2.2 Semiparametric multiply-robust estimators of NDE/NIE

Estimating $\tau_{\mathsf{NDE}}(d)$ and $\tau_{\mathsf{NIE}}(d)$ can be reduced to estimating $\phi(d,d^{\prime})\coloneqq\mathsf{E}[Y(d,M(d^{\prime}))]$ for $d,d^{\prime}\in\{0,1\}$ . We make the following standard identification assumptions:

i.

Consistency: if $D=d$ , then $M=M(d)$ for all $d\in\{0,1\}$ ; while if $D=d$ and $M=m$ , then $Y=Y(d,m)$ for all $d\in\{0,1\}$ and all $m$ in the support of $M$ .
ii.

Ignorability: $Y(d,m)\perp D|X$ , $Y(d,m)\perp M|X,D$ , $M(d)\perp D|X$ , and $Y(d,m)\perp M(d^{\prime})|X$ , almost surely for all $d,\in\{0,1\}$ and all $m\in\mathcal{M}$ . The first three conditions are, respectively, no unmeasured treatment-outcome, mediator-outcome and treatment-mediator confounding, whereas the fourth condition is often referred to as the “cross-world” condition. We provide more detailed comments on these four conditions in Appendix LABEL:app:ignore.
iii.

Positivity: The propensity score $a(d|X)\equiv\mathsf{Pr}(D=d|X)\in(c,C)$ for some constants $0<c\leq C<1$ , almost surely for all $d\in\{0,1\}$ ; $f(m|X,d)$ , the conditional density (mass) function of $M=m$ (when $M$ is discrete) given $X$ and $D=d$ , is strictly bounded between $[\underaccent{\bar}{\rho},\bar{\rho}]$ for some constants $0<\underaccent{\bar}{\rho}\leq\bar{\rho}<\infty$ almost surely for all $m$ in $\mathcal{M}$ and all $d\in\{0,1\}$ .

Under the above assumptions, the causal parameter $\phi(d,d^{\prime})$ for $d,d^{\prime}\in\{0,1\}$ can be identified as either of the following three observed-data functionals:

\begin{split}\phi(d,d^{\prime})&\equiv\mathsf{E}\left[\frac{\mathbbm{1}\{D=d\}f(M|X,d^{\prime})Y}{a(d|X)f(M|X,d)}\right]\equiv\mathsf{E}\left[\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}|X)}\mu(X,d,M)\right]\\ &\equiv\int\mu(x,d,m)f(m|x,d^{\prime})p(x)\ \mathrm{d}m\mathrm{d}x,\end{split}

(1)

where $\mathbbm{1}\{\cdot\}$ denotes the indicator function, $p(x)$ denotes the marginal density of $X$ , and $\mu(x,d,m)\coloneqq\mathsf{E}[Y|X=x,D=d,M=m]$ is the outcome regression model, for which we also make the following standard boundedness assumption:

iv.

$\mu(x,d,m)$ is also strictly bounded between $[-R,R]$ for some constant $R>0$ .

Following the convention in the semiparametric causal inference literature, we call $a,f,\mu$ “nuisance functions”. tchetgen2012semiparametric derived the EIF of $\phi(d,d^{\prime})$ : $\mathsf{EIF}_{d,d^{\prime}}\equiv\psi_{d,d^{\prime}}(O)-\phi(d,d^{\prime})$ , where

	$\displaystyle\psi_{d,d^{\prime}}(O)$	$\displaystyle=\frac{\mathbbm{1}\{D=d\}\cdot f(M\|X,d^{\prime})}{a(d\|X)\cdot f(M\|X,d)}(Y-\mu(X,d,M))$
		$\displaystyle+\left(1-\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}\|X)}\right)\int_{m\in\mathcal{M}}\mu(X,d,m)f(m\|X,d^{\prime})\mathrm{d}m+\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}\|X)}\mu(X,d,M).$		(2)

The nuisance functions $\mu(x,d,m)$ , $a(d|x)$ and $f(m|x,d)$ appeared in $\psi_{d,d^{\prime}}(o)$ are unknown and generally high-dimensional. But with a sample $\mathcal{D}\equiv\{O_{j}\}_{j=1}^{N}$ of the observed data, based on $\psi_{d,d^{\prime}}(o)$ , one can construct the following generic sample-splitting multiply-robust estimator of $\phi(d,d^{\prime})$ :

\widetilde{\phi}(d,d^{\prime})=\frac{1}{n}\sum_{i\in\mathcal{D}_{n}}\widetilde{\psi}_{d,d^{\prime}}(O_{i}),

(3)

where $\mathcal{D}_{n}\equiv\{O_{i}\}_{i=1}^{n}$ is a subset of all $N$ data, and $\widetilde{\psi}_{d,d^{\prime}}(o)$ replaces the unknown nuisance functions $a,f,\mu$ in $\psi_{d,d^{\prime}}(o)$ by some generic estimators $\widetilde{a},\widetilde{f},\widetilde{\mu}$ computed using the remaining $N-n$ nuisance sample data, denoted as $\mathcal{D}_{\nu}$ . Cross-fit is then needed to recover the information lost due to sample splitting; see Algorithm 3.1. It is clear from (2) that $\widetilde{\phi}(d,d^{\prime})$ is a consistent estimator of $\phi(d,d^{\prime})$ as long as any two of $\widetilde{a},\widetilde{f},\widetilde{\mu}$ are consistent estimators of the corresponding true nuisance functions, hence the name “multiply-robust”. Throughout this paper, we take $n\asymp N-n$ and assume:

v.

Any nuisance function estimators are strictly bounded within the respective lower and upper bounds of $a,f,\mu$ .

To further ease notation, we define: for any $d\in\{0,1\}$ , $r_{a,d}\coloneqq\left\{\int\delta_{a,d}(x)^{2}\mathrm{d}F(x)\right\}^{1/2},r_{f,d}\coloneqq\left\{\int\delta_{f,d}(x,m)^{2}\mathrm{d}F(x,m|d=0)\right\}^{1/2},$ and $r_{\mu,d}\coloneqq\left\{\int\delta_{\mu,d}(x,m)^{2}\mathrm{d}F(x,m|d=0)\right\}^{1/2},$ where $\delta_{a,d}(x)\coloneqq\widetilde{a}(d|x)-a(d|x)$ , $\delta_{f,d}(x,m)\coloneqq\widetilde{f}(m|x,d)-f(m|x,d)$ and $\delta_{\mu,d}(x,m)\coloneqq\widetilde{\mu}(x,d,m)-\mu(x,d,m)$ are point-wise estimation errors of the estimated nuisance functions. In defining the above $L_{2}$ -estimation errors, we choose to take expectation with respect to (w.r.t.) the law $F(m,x|d=0)$ only for convenience, with no loss of generality by Assumptions iii and v.

To show the cross-fit version of $\widetilde{\phi}(d,d^{\prime})$ is semiparametric efficient for $\phi(d,d^{\prime})$ , we shall demonstrate under what conditions $\sqrt{n}(\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime}))\overset{\mathcal{L}}{\rightarrow}\mathcal{N}(0,\mathsf{E}[\mathsf{EIF}_{d,d^{\prime}}^{2}])$ (newey1990semiparametric). The following proposition on the statistical properties of $\widetilde{\phi}(d,d^{\prime})$ is a key step towards this objective.

Proposition 1.

Denote $\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime}))\coloneqq\mathsf{E}[\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime})|\mathcal{D}_{\nu}]$ as the bias of $\widetilde{\phi}(d,d^{\prime})$ conditional on the nuisance sample $\mathcal{D}_{\nu}$ . Under Assumptions i – v, $\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime}))$ is of second-order:

\displaystyle|\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime}))|\lesssim\max\left\{r_{a,d}\cdot r_{f,d},\underset{d^{\prime\prime}\in\{0,1\}}{\max}r_{f,d^{\prime\prime}}\cdot r_{\mu,d},r_{a,d}\cdot r_{\mu,d}\right\}.

(4)

Furthermore, if the RHS of (4) is $o(n^{-1/2})$ , then

\displaystyle\sqrt{n}\left(\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime})\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\psi_{d,d^{\prime}}(O_{i})-\phi(d,d^{\prime}))+o(1)\overset{d}{\rightarrow}\mathcal{N}\left(0,\mathsf{E}\left[\mathsf{EIF}_{d,d^{\prime}}^{2}\right]\right).

(5)

Although the above result is a direct consequence of the EIF $\psi_{d,d^{\prime}}(O)$ , we prove Proposition 1 in Appendix LABEL:app:bias for completeness.

Remark 2.

The total effect $\tau_{tot}=\phi(1,1)-\phi(0,0)$ can be viewed as a special case, for which $d=d^{\prime}$ for $\phi(d,d^{\prime})$ . Then $\mathsf{EIF}_{d,d}\equiv\mathsf{EIF}_{d}$ corresponds to the nonparametric EIF of $\phi(d,d)\equiv\phi(d)\equiv\mathsf{E}[Y(d,M(d))]$ :

\displaystyle\mathsf{EIF}_{d}=\psi_{d}(O)-\phi(d)\text{ with }\psi_{d}(O)=\frac{\mathbbm{1}\{D=d\}}{a(d|X)}Y+\left(1-\frac{\mathbbm{1}\{D=d\}}{a(d|X)}\right)\mu(X,d),

where $\mu(x,d)\coloneqq\mathsf{E}[Y|X=x,D=d]$ . Hence all the theoretical results in this paper are applicable to total effect estimation. Our framework can also be applied to all the statistical functionals that satisfy a so-called “mixed-bias” property, characterized recently in rotnitzky2021characterization. This class includes the quadratic functional, which is important for uncertainty quantification in machine learning.

3 Estimation and inference of NDE/NIE using DeepMed

We now introduce $\mathsf{DeepMed}$ , a method for mediation analysis with nuisance functions estimated by DNNs. By leveraging the second-order bias property of the multiply-robust estimators of NDE/NIE (Proposition 1), we will derive statistical properties of $\mathsf{DeepMed}$ in this section. The nuisance function estimators by DNNs are denoted as $\widehat{a},\widehat{f},\widehat{\mu}$ .

3.1 Details on DeepMed

First, we introduce the fully-connected feed-forward neural network with the rectified linear units (ReLU) as the activation function for the hidden layer neurons (FNN-ReLU), which will be used to estimate the nuisance functions. Then, we will introduce an estimation procedure using a $V$ -fold cross-fitting with sample-splitting to avoid the Donsker-type empirical-process assumption on the nuisance functions, which, in general, is violated in high-dimensional setup. Finally, we provide the asymptotic statistical properties of the DNN-based estimators of $\tau_{tot}$ , $\tau_{\mathsf{NDE}}(d)$ and $\tau_{\mathsf{NIE}}(d)$ .

We denote the ReLU activation function as $\sigma(u)\coloneqq\max(u,0)$ for any $u\in\mathbb{R}$ . Given vectors $x,b$ , we denote $\sigma_{b}(x)\coloneqq\sigma(x-b)$ , with $\sigma$ acting on the vector $x-b$ component-wise.

Let $\mathcal{F}_{\mathrm{nn}}$ denote the class of the FNN-ReLU functions

\mathcal{F}_{\mathrm{nn}}\coloneqq\left\{f:\mathbb{R}^{p}\rightarrow\mathbb{R};f(x)=W^{(L)}\sigma_{b^{(L)}}\circ\cdots\circ W^{(1)}\sigma_{b^{(1)}}(x)\right\},

where $\circ$ is the composition operator, $L$ is the number of layers (i.e. depth) of the network, and for $l=1,\cdots,L$ , $W^{(l)}$ is a $K_{l+1}\times K_{l}$ -dimensional weight matrix with $K_{l}$ being the number of neurons in the $l$ -th layer (i.e. width) of the network, with $K_{1}=p$ and $K_{L+1}=1$ , and $b^{(l)}$ is a $K_{l}$ -dimensional vector. To avoid notation clutter, we concatenate all the network parameters as $\Theta=(W^{(l)},b^{(l)},l=1,\cdots,L)$ and simply take $K_{2}=\cdots=K_{L}=K$ . We also assume $\Theta$ to be bounded: $\|\Theta\|_{\infty}\leq B$ for some universal constant $B>0$ . We may let the dependence on $L$ , $K$ , $B$ explicit by writing $\mathcal{F}_{nn}$ as $\mathcal{F}_{\mathrm{nn}}(L,K,B)$ .

$\mathsf{DeepMed}$ estimates $\tau_{tot},\tau_{\mathsf{NDE}}(d),\tau_{\mathsf{NIE}}(d)$ by (3), with the nuisance functions $a,f,\mu$ estimated using $\mathcal{F}_{nn}$ with the $V$ -fold cross-fitting strategy, summarized in Algorithm 3.1 below; also see farbmacher2022causal. $\mathsf{DeepMed}$ inputs the observed data $\mathcal{D}\equiv\{O_{i}\}_{i=1}^{N}$ and outputs the estimated total effect $\widehat{\tau}_{tot}$ , NDE $\widehat{\tau}_{\mathsf{NDE}}(d)$ and NIE $\widehat{\tau}_{\mathsf{NIE}}(d)$ , together with their variance estimators $\widehat{\sigma}_{tot}^{2}$ , $\widehat{\sigma}^{2}_{\mathsf{NDE}}(d)$ and $\widehat{\sigma}^{2}_{\mathsf{NIE}}(d)$ .

\fname@algorithm 1 $\mathsf{DeepMed}$ with $V$ -fold cross-fitting

1: ChoosesomeintegerV(usuallyV∈{2,3,⋯,