This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

SMReferences for Appendix

DeepMed: Semiparametric Causal Mediation Analysis with Debiased Deep Learning

Siqi Xu
Department of Statistics and Actuarial Sciences
University of Hong Kong
Hong Kong SAR, China
[email protected] Lin Liu
Institute of Natural Sciences, MOE-LSC,
School of Mathematical Sciences, CMA-Shanghai,
and SJTU-Yale Joint Center for Biostatistics and Data Science
Shanghai Jiao Tong University and Shanghai Artificial Intelligence Laboratory
Shanghai, China
[email protected] Zhonghua Liu11footnotemark: 1
Department of Biostatistics
Columbia University
New York, NY, USA
[email protected]
Co-corresponding authors, alphabetical order
Abstract

Causal mediation analysis can unpack the black box of causality and is therefore a powerful tool for disentangling causal pathways in biomedical and social sciences, and also for evaluating machine learning fairness. To reduce bias for estimating Natural Direct and Indirect Effects in mediation analysis, we propose a new method called 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} that uses deep neural networks (DNNs) to cross-fit the infinite-dimensional nuisance functions in the efficient influence functions. We obtain novel theoretical results that our 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} method (1) can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and (2) can adapt to certain low-dimensional structures of the nuisance functions, significantly advancing the existing literature on DNN-based semiparametric causal inference. Extensive synthetic experiments are conducted to support our findings and also expose the gap between theory and practice. As a proof of concept, we apply 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} to analyze two real datasets on machine learning fairness and reach conclusions consistent with previous findings.

1 Introduction

Tremendous progress has been made in this decade on deploying deep neural networks (DNNs) in real-world problems (krizhevsky2012imagenet; wolf2019huggingface; jumper2021highly; brown2022deep). Causal inference is no exception. In semiparametric causal inference, a series of seminal works (chen2020causal; chernozhukov2020adversarial; farrell2021deep) initiated the investigation of statistical properties of causal effect estimators when the nuisance functions (the outcome regressions and propensity scores) are estimated by DNNs. However, there are a few limitations in the current literature that need to be addressed before the theoretical results can be used to guide practice:

(1) Most recent works mainly focus on total effect (chen2020causal; farrell2021deep). In many settings, however, more intricate causal parameters are often of greater interests. In biomedical and social sciences, one is often interested in β€œmediation analysis” to decompose the total effect into direct and indirect effect to unpack the underlying black-box causal mechanism (baron1986moderator). More recently, mediation analysis also percolated into machine learning fairness. For instance, in the context of predicting the recidivism risk, nabi2018fair argued that, for a β€œfair” algorithm, sensitive features such as race should have no direct effect on the predicted recidivism risk. If such direct effects can be accurately estimated, one can detect the potential unfairness of a machine learning algorithm. We will revisit such applications in Section LABEL:sec:real and Appendix LABEL:app:real.

(2) Statistical properties of DNN-based causal estimators in recent works mostly follow from several (recent) results on the convergence rates of DNN-based nonparametric regression estimators (suzuki2019adaptivity; schmidt2020nonparametric; tsuji2021estimation), with the limitation of relying on sparse DNN architectures. The theoretical properties are in turn evaluated by relatively simple synthetic experiments not designed to generate nearly infinite-dimensional nuisance functions, a setting considered by almost all the above related works.

The above limitations raise the tantalizing question whether the available statistical guarantees for DNN-based causal inference have practical relevance. In this work, we plan to partially fill these gaps by developing a new method called 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} for semiparametric mediation analysis with DNNs. We focus on the Natural Direct/Indirect Effects (NDE/NIE) (robins1992identifiability; pearl2001direct) (defined in Section 2.1), but our results can also be applied to more general settings; see Remark 2. The 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} estimators leverage the β€œmultiply-robust” property of the efficient influence function (EIF) of NDE/NIE (tchetgen2012semiparametric; farbmacher2022causal) (see Proposition 1 in Section 2.2), together with the flexibility and superior predictive power of DNNs (see Section 3.1 and Algorithm 3.1). In particular, we also make the following novel contributions to deepen our understanding of DNN-based semiparametric causal inference:

  • β€’

    On the theoretical side, we obtain new results that our 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} method can achieve semiparametric efficiency bound without imposing sparsity constraints on the DNN architecture and can adapt to certain low-dimensional structures of the nuisance functions (see Section LABEL:sec:stat), thus significantly advancing the existing literature on DNN-based semiparametric causal inference. Non-sparse DNN architecture is more commonly employed in practice (farrell2021deep), and the low-dimensional structures of nuisance functions can help avoid curse-of-dimensionality. These two points, taken together, significantly advance our understanding of the statistical guarantee of DNN-based causal inference.

  • β€’

    More importantly, on the empirical side, in Section LABEL:sec:sim, we designed sophisticated synthetic experiments to simulate nearly infinite-dimensional functions, which are much more complex than those in previous related works (chen2020causal; farrell2021deep; adcock2021gap). We emphasize that these nontrivial experiments could be of independent interest to the theory of deep learning beyond causal inference, to further expose the gap between deep learning theory and practice (adcock2021gap; gottschling2020troublesome); see Remark LABEL:beyond for an extended discussion. As a proof of concept, in Section LABEL:sec:real and Appendix LABEL:app:real, we also apply 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} to re-analyze two real-world datasets on algorithmic fairness and reach similar conclusions to related works.

  • β€’

    Finally, a user-friendly R package can be found at https://github.com/siqixu/DeepMed. Making such resources available helps enhance reproducibility, a highly recognized problem in all scientific disciplines, including (causal) machine learning (pineau2021improving; kaddour2022causal).

2 Definition, identification, and estimation of NDE and NIE

2.1 Definition of NDE and NIE

Throughout this paper, we denote YY as the primary outcome of interest, DD as a binary treatment variable, MM as the mediator on the causal pathway from DD to YY, and X∈[0,1]pX\in[0,1]^{p} (or more generally, compactly supported in ℝp\mathbb{R}^{p}) as baseline covariates including all potential confounders. We denote the observed data vector as O≑(X,D,M,Y)O\equiv(X,D,M,Y). Let M(d)M(d) denote the potential outcome for the mediator when setting D=dD=d and Y(d,m)Y(d,m) be the potential outcome of YY under D=dD=d and M=mM=m, where d∈{0,1}d\in\{0,1\} and mm is in the support β„³\mathcal{M} of MM. We define the average total (treatment) effect as Ο„tot≔𝖀[Y(1,M(1))βˆ’Y(0,M(0))]\tau_{tot}\coloneqq\mathsf{E}[Y(1,M(1))-Y(0,M(0))], the average NDE of the treatment DD on the outcome YY when the mediator takes the natural potential outcome when D=dD=d as τ𝖭𝖣𝖀(d)≔𝖀[Y(1,M(d))βˆ’Y(0,M(d))]\tau_{\mathsf{NDE}}(d)\coloneqq\mathsf{E}[Y(1,M(d))-Y(0,M(d))], and the average NIE of the treatment DD on the outcome YY via the mediator MM as τ𝖭𝖨𝖀(d)≔𝖀[Y(d,M(1))βˆ’Y(d,M(0))]\tau_{\mathsf{NIE}}(d)\coloneqq\mathsf{E}[Y(d,M(1))-Y(d,M(0))]. We have the trivial decomposition Ο„tot≑τ𝖭𝖣𝖀(d)+τ𝖭𝖨𝖀(dβ€²)\tau_{tot}\equiv\tau_{\mathsf{NDE}}(d)+\tau_{\mathsf{NIE}}(d^{\prime}) for dβ‰ dβ€²d\neq d^{\prime}. In causal mediation analysis, the parameters of interest are τ𝖭𝖣𝖀(d)\tau_{\mathsf{NDE}}(d) and τ𝖭𝖨𝖀(d)\tau_{\mathsf{NIE}}(d).

2.2 Semiparametric multiply-robust estimators of NDE/NIE

Estimating τ𝖭𝖣𝖀(d)\tau_{\mathsf{NDE}}(d) and τ𝖭𝖨𝖀(d)\tau_{\mathsf{NIE}}(d) can be reduced to estimating Ο•(d,dβ€²)≔𝖀[Y(d,M(dβ€²))]\phi(d,d^{\prime})\coloneqq\mathsf{E}[Y(d,M(d^{\prime}))] for d,dβ€²βˆˆ{0,1}d,d^{\prime}\in\{0,1\}. We make the following standard identification assumptions:

  • i.

    Consistency: if D=dD=d, then M=M(d)M=M(d) for all d∈{0,1}d\in\{0,1\}; while if D=dD=d and M=mM=m, then Y=Y(d,m)Y=Y(d,m) for all d∈{0,1}d\in\{0,1\} and all mm in the support of MM.

  • ii.

    Ignorability: Y(d,m)βŸ‚D|XY(d,m)\perp D|X, Y(d,m)βŸ‚M|X,DY(d,m)\perp M|X,D, M(d)βŸ‚D|XM(d)\perp D|X, and Y(d,m)βŸ‚M(dβ€²)|XY(d,m)\perp M(d^{\prime})|X, almost surely for all d,∈{0,1}d,\in\{0,1\} and all mβˆˆβ„³m\in\mathcal{M}. The first three conditions are, respectively, no unmeasured treatment-outcome, mediator-outcome and treatment-mediator confounding, whereas the fourth condition is often referred to as the β€œcross-world” condition. We provide more detailed comments on these four conditions in Appendix LABEL:app:ignore.

  • iii.

    Positivity: The propensity score a(d|X)≑𝖯𝗋(D=d|X)∈(c,C)a(d|X)\equiv\mathsf{Pr}(D=d|X)\in(c,C) for some constants 0<c≀C<10<c\leq C<1, almost surely for all d∈{0,1}d\in\{0,1\}; f(m|X,d)f(m|X,d), the conditional density (mass) function of M=mM=m (when MM is discrete) given XX and D=dD=d, is strictly bounded between [ρ¯,ρ¯][\underaccent{\bar}{\rho},\bar{\rho}] for some constants 0<ρ¯≀ρ¯<∞0<\underaccent{\bar}{\rho}\leq\bar{\rho}<\infty almost surely for all mm in β„³\mathcal{M} and all d∈{0,1}d\in\{0,1\}.

Under the above assumptions, the causal parameter Ο•(d,dβ€²)\phi(d,d^{\prime}) for d,dβ€²βˆˆ{0,1}d,d^{\prime}\in\{0,1\} can be identified as either of the following three observed-data functionals:

Ο•(d,dβ€²)≑𝖀[πŸ™{D=d}f(M|X,dβ€²)Ya(d|X)f(M|X,d)]≑𝖀[πŸ™{D=dβ€²}a(dβ€²|X)ΞΌ(X,d,M)]β‰‘βˆ«ΞΌ(x,d,m)f(m|x,dβ€²)p(x)dmdx,\begin{split}\phi(d,d^{\prime})&\equiv\mathsf{E}\left[\frac{\mathbbm{1}\{D=d\}f(M|X,d^{\prime})Y}{a(d|X)f(M|X,d)}\right]\equiv\mathsf{E}\left[\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}|X)}\mu(X,d,M)\right]\\ &\equiv\int\mu(x,d,m)f(m|x,d^{\prime})p(x)\ \mathrm{d}m\mathrm{d}x,\end{split} (1)

where πŸ™{β‹…}\mathbbm{1}\{\cdot\} denotes the indicator function, p(x)p(x) denotes the marginal density of XX, and ΞΌ(x,d,m)≔𝖀[Y|X=x,D=d,M=m]\mu(x,d,m)\coloneqq\mathsf{E}[Y|X=x,D=d,M=m] is the outcome regression model, for which we also make the following standard boundedness assumption:

  • iv.

    ΞΌ(x,d,m)\mu(x,d,m) is also strictly bounded between [βˆ’R,R][-R,R] for some constant R>0R>0.

Following the convention in the semiparametric causal inference literature, we call a,f,ΞΌa,f,\mu β€œnuisance functions”. tchetgen2012semiparametric derived the EIF of Ο•(d,dβ€²)\phi(d,d^{\prime}): 𝖀𝖨π–₯d,dβ€²β‰‘Οˆd,dβ€²(O)βˆ’Ο•(d,dβ€²)\mathsf{EIF}_{d,d^{\prime}}\equiv\psi_{d,d^{\prime}}(O)-\phi(d,d^{\prime}), where

ψd,dβ€²(O)\displaystyle\psi_{d,d^{\prime}}(O) =πŸ™{D=d}β‹…f(M|X,dβ€²)a(d|X)β‹…f(M|X,d)(Yβˆ’ΞΌ(X,d,M))\displaystyle=\frac{\mathbbm{1}\{D=d\}\cdot f(M|X,d^{\prime})}{a(d|X)\cdot f(M|X,d)}(Y-\mu(X,d,M))
+(1βˆ’πŸ™{D=dβ€²}a(dβ€²|X))∫mβˆˆβ„³ΞΌ(X,d,m)f(m|X,dβ€²)dm+πŸ™{D=dβ€²}a(dβ€²|X)ΞΌ(X,d,M).\displaystyle+\left(1-\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}|X)}\right)\int_{m\in\mathcal{M}}\mu(X,d,m)f(m|X,d^{\prime})\mathrm{d}m+\frac{\mathbbm{1}\{D=d^{\prime}\}}{a(d^{\prime}|X)}\mu(X,d,M). (2)

The nuisance functions ΞΌ(x,d,m)\mu(x,d,m), a(d|x)a(d|x) and f(m|x,d)f(m|x,d) appeared in ψd,dβ€²(o)\psi_{d,d^{\prime}}(o) are unknown and generally high-dimensional. But with a sample π’Ÿβ‰‘{Oj}j=1N\mathcal{D}\equiv\{O_{j}\}_{j=1}^{N} of the observed data, based on ψd,dβ€²(o)\psi_{d,d^{\prime}}(o), one can construct the following generic sample-splitting multiply-robust estimator of Ο•(d,dβ€²)\phi(d,d^{\prime}):

Ο•~(d,dβ€²)=1nβˆ‘iβˆˆπ’Ÿnψ~d,dβ€²(Oi),\widetilde{\phi}(d,d^{\prime})=\frac{1}{n}\sum_{i\in\mathcal{D}_{n}}\widetilde{\psi}_{d,d^{\prime}}(O_{i}), (3)

where π’Ÿn≑{Oi}i=1n\mathcal{D}_{n}\equiv\{O_{i}\}_{i=1}^{n} is a subset of all NN data, and ψ~d,dβ€²(o)\widetilde{\psi}_{d,d^{\prime}}(o) replaces the unknown nuisance functions a,f,ΞΌa,f,\mu in ψd,dβ€²(o)\psi_{d,d^{\prime}}(o) by some generic estimators a~,f~,ΞΌ~\widetilde{a},\widetilde{f},\widetilde{\mu} computed using the remaining Nβˆ’nN-n nuisance sample data, denoted as π’ŸΞ½\mathcal{D}_{\nu}. Cross-fit is then needed to recover the information lost due to sample splitting; see Algorithm 3.1. It is clear from (2) that Ο•~(d,dβ€²)\widetilde{\phi}(d,d^{\prime}) is a consistent estimator of Ο•(d,dβ€²)\phi(d,d^{\prime}) as long as any two of a~,f~,ΞΌ~\widetilde{a},\widetilde{f},\widetilde{\mu} are consistent estimators of the corresponding true nuisance functions, hence the name β€œmultiply-robust”. Throughout this paper, we take n≍Nβˆ’nn\asymp N-n and assume:

  • v.

    Any nuisance function estimators are strictly bounded within the respective lower and upper bounds of a,f,ΞΌa,f,\mu.

To further ease notation, we define: for any d∈{0,1}d\in\{0,1\}, ra,d≔{∫δa,d(x)2dF(x)}1/2,rf,d≔{∫δf,d(x,m)2dF(x,m|d=0)}1/2,r_{a,d}\coloneqq\left\{\int\delta_{a,d}(x)^{2}\mathrm{d}F(x)\right\}^{1/2},r_{f,d}\coloneqq\left\{\int\delta_{f,d}(x,m)^{2}\mathrm{d}F(x,m|d=0)\right\}^{1/2}, and rΞΌ,d≔{∫δμ,d(x,m)2dF(x,m|d=0)}1/2,r_{\mu,d}\coloneqq\left\{\int\delta_{\mu,d}(x,m)^{2}\mathrm{d}F(x,m|d=0)\right\}^{1/2}, where Ξ΄a,d(x)≔a~(d|x)βˆ’a(d|x)\delta_{a,d}(x)\coloneqq\widetilde{a}(d|x)-a(d|x), Ξ΄f,d(x,m)≔f~(m|x,d)βˆ’f(m|x,d)\delta_{f,d}(x,m)\coloneqq\widetilde{f}(m|x,d)-f(m|x,d) and δμ,d(x,m)≔μ~(x,d,m)βˆ’ΞΌ(x,d,m)\delta_{\mu,d}(x,m)\coloneqq\widetilde{\mu}(x,d,m)-\mu(x,d,m) are point-wise estimation errors of the estimated nuisance functions. In defining the above L2L_{2}-estimation errors, we choose to take expectation with respect to (w.r.t.) the law F(m,x|d=0)F(m,x|d=0) only for convenience, with no loss of generality by Assumptions iii and v.

To show the cross-fit version of Ο•~(d,dβ€²)\widetilde{\phi}(d,d^{\prime}) is semiparametric efficient for Ο•(d,dβ€²)\phi(d,d^{\prime}), we shall demonstrate under what conditions n(Ο•~(d,dβ€²)βˆ’Ο•(d,dβ€²))→ℒ𝒩(0,𝖀[𝖀𝖨π–₯d,dβ€²2])\sqrt{n}(\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime}))\overset{\mathcal{L}}{\rightarrow}\mathcal{N}(0,\mathsf{E}[\mathsf{EIF}_{d,d^{\prime}}^{2}]) (newey1990semiparametric). The following proposition on the statistical properties of Ο•~(d,dβ€²)\widetilde{\phi}(d,d^{\prime}) is a key step towards this objective.

Proposition 1.

Denote Bias(Ο•~(d,dβ€²))≔𝖀[Ο•~(d,dβ€²)βˆ’Ο•(d,dβ€²)|π’ŸΞ½]\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime}))\coloneqq\mathsf{E}[\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime})|\mathcal{D}_{\nu}] as the bias of Ο•~(d,dβ€²)\widetilde{\phi}(d,d^{\prime}) conditional on the nuisance sample π’ŸΞ½\mathcal{D}_{\nu}. Under Assumptions i – v, Bias(Ο•~(d,dβ€²))\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime})) is of second-order:

|Bias(Ο•~(d,dβ€²))|≲max{ra,dβ‹…rf,d,maxdβ€²β€²βˆˆ{0,1}rf,dβ€²β€²β‹…rΞΌ,d,ra,dβ‹…rΞΌ,d}.\displaystyle|\mathrm{Bias}(\widetilde{\phi}(d,d^{\prime}))|\lesssim\max\left\{r_{a,d}\cdot r_{f,d},\underset{d^{\prime\prime}\in\{0,1\}}{\max}r_{f,d^{\prime\prime}}\cdot r_{\mu,d},r_{a,d}\cdot r_{\mu,d}\right\}. (4)

Furthermore, if the RHS of (4) is o(nβˆ’1/2)o(n^{-1/2}), then

n(Ο•~(d,dβ€²)βˆ’Ο•(d,dβ€²))=1nβˆ‘i=1n(ψd,dβ€²(Oi)βˆ’Ο•(d,dβ€²))+o(1)β†’d𝒩(0,𝖀[𝖀𝖨π–₯d,dβ€²2]).\displaystyle\sqrt{n}\left(\widetilde{\phi}(d,d^{\prime})-\phi(d,d^{\prime})\right)=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(\psi_{d,d^{\prime}}(O_{i})-\phi(d,d^{\prime}))+o(1)\overset{d}{\rightarrow}\mathcal{N}\left(0,\mathsf{E}\left[\mathsf{EIF}_{d,d^{\prime}}^{2}\right]\right). (5)

Although the above result is a direct consequence of the EIF ψd,dβ€²(O)\psi_{d,d^{\prime}}(O), we prove Proposition 1 in Appendix LABEL:app:bias for completeness.

Remark 2.

The total effect Ο„tot=Ο•(1,1)βˆ’Ο•(0,0)\tau_{tot}=\phi(1,1)-\phi(0,0) can be viewed as a special case, for which d=dβ€²d=d^{\prime} for Ο•(d,dβ€²)\phi(d,d^{\prime}). Then 𝖀𝖨π–₯d,d≑𝖀𝖨π–₯d\mathsf{EIF}_{d,d}\equiv\mathsf{EIF}_{d} corresponds to the nonparametric EIF of Ο•(d,d)≑ϕ(d)≑𝖀[Y(d,M(d))]\phi(d,d)\equiv\phi(d)\equiv\mathsf{E}[Y(d,M(d))]:

𝖀𝖨π–₯d=ψd(O)βˆ’Ο•(d)Β with ψd(O)=πŸ™{D=d}a(d|X)Y+(1βˆ’πŸ™{D=d}a(d|X))ΞΌ(X,d),\displaystyle\mathsf{EIF}_{d}=\psi_{d}(O)-\phi(d)\text{ with }\psi_{d}(O)=\frac{\mathbbm{1}\{D=d\}}{a(d|X)}Y+\left(1-\frac{\mathbbm{1}\{D=d\}}{a(d|X)}\right)\mu(X,d),

where ΞΌ(x,d)≔𝖀[Y|X=x,D=d]\mu(x,d)\coloneqq\mathsf{E}[Y|X=x,D=d]. Hence all the theoretical results in this paper are applicable to total effect estimation. Our framework can also be applied to all the statistical functionals that satisfy a so-called β€œmixed-bias” property, characterized recently in rotnitzky2021characterization. This class includes the quadratic functional, which is important for uncertainty quantification in machine learning.

3 Estimation and inference of NDE/NIE using DeepMed

We now introduce 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed}, a method for mediation analysis with nuisance functions estimated by DNNs. By leveraging the second-order bias property of the multiply-robust estimators of NDE/NIE (Proposition 1), we will derive statistical properties of 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} in this section. The nuisance function estimators by DNNs are denoted as a^,f^,ΞΌ^\widehat{a},\widehat{f},\widehat{\mu}.

3.1 Details on DeepMed

First, we introduce the fully-connected feed-forward neural network with the rectified linear units (ReLU) as the activation function for the hidden layer neurons (FNN-ReLU), which will be used to estimate the nuisance functions. Then, we will introduce an estimation procedure using a VV-fold cross-fitting with sample-splitting to avoid the Donsker-type empirical-process assumption on the nuisance functions, which, in general, is violated in high-dimensional setup. Finally, we provide the asymptotic statistical properties of the DNN-based estimators of Ο„tot\tau_{tot}, τ𝖭𝖣𝖀(d)\tau_{\mathsf{NDE}}(d) and τ𝖭𝖨𝖀(d)\tau_{\mathsf{NIE}}(d).

We denote the ReLU activation function as Οƒ(u)≔max(u,0)\sigma(u)\coloneqq\max(u,0) for any uβˆˆβ„u\in\mathbb{R}. Given vectors x,bx,b, we denote Οƒb(x)≔σ(xβˆ’b)\sigma_{b}(x)\coloneqq\sigma(x-b), with Οƒ\sigma acting on the vector xβˆ’bx-b component-wise.

Let β„±nn\mathcal{F}_{\mathrm{nn}} denote the class of the FNN-ReLU functions

β„±nn≔{f:ℝp→ℝ;f(x)=W(L)Οƒb(L)βˆ˜β‹―βˆ˜W(1)Οƒb(1)(x)},\mathcal{F}_{\mathrm{nn}}\coloneqq\left\{f:\mathbb{R}^{p}\rightarrow\mathbb{R};f(x)=W^{(L)}\sigma_{b^{(L)}}\circ\cdots\circ W^{(1)}\sigma_{b^{(1)}}(x)\right\},

where ∘\circ is the composition operator, LL is the number of layers (i.e. depth) of the network, and for l=1,β‹―,Ll=1,\cdots,L, W(l)W^{(l)} is a Kl+1Γ—KlK_{l+1}\times K_{l}-dimensional weight matrix with KlK_{l} being the number of neurons in the ll-th layer (i.e. width) of the network, with K1=pK_{1}=p and KL+1=1K_{L+1}=1, and b(l)b^{(l)} is a KlK_{l}-dimensional vector. To avoid notation clutter, we concatenate all the network parameters as Θ=(W(l),b(l),l=1,β‹―,L)\Theta=(W^{(l)},b^{(l)},l=1,\cdots,L) and simply take K2=β‹―=KL=KK_{2}=\cdots=K_{L}=K. We also assume Θ\Theta to be bounded: βˆ₯Θβˆ₯βˆžβ‰€B\|\Theta\|_{\infty}\leq B for some universal constant B>0B>0. We may let the dependence on LL, KK, BB explicit by writing β„±nn\mathcal{F}_{nn} as β„±nn(L,K,B)\mathcal{F}_{\mathrm{nn}}(L,K,B).

𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} estimates Ο„tot,τ𝖭𝖣𝖀(d),τ𝖭𝖨𝖀(d)\tau_{tot},\tau_{\mathsf{NDE}}(d),\tau_{\mathsf{NIE}}(d) by (3), with the nuisance functions a,f,ΞΌa,f,\mu estimated using β„±nn\mathcal{F}_{nn} with the VV-fold cross-fitting strategy, summarized in Algorithm 3.1 below; also see farbmacher2022causal. 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} inputs the observed data π’Ÿβ‰‘{Oi}i=1N\mathcal{D}\equiv\{O_{i}\}_{i=1}^{N} and outputs the estimated total effect Ο„^tot\widehat{\tau}_{tot}, NDE Ο„^𝖭𝖣𝖀(d)\widehat{\tau}_{\mathsf{NDE}}(d) and NIE Ο„^𝖭𝖨𝖀(d)\widehat{\tau}_{\mathsf{NIE}}(d), together with their variance estimators Οƒ^tot2\widehat{\sigma}_{tot}^{2}, Οƒ^2𝖭𝖣𝖀(d)\widehat{\sigma}^{2}_{\mathsf{NDE}}(d) and Οƒ^2𝖭𝖨𝖀(d)\widehat{\sigma}^{2}_{\mathsf{NIE}}(d).

Β  \fname@algorithmΒ 1 𝖣𝖾𝖾𝗉𝖬𝖾𝖽\mathsf{DeepMed} with VV-fold cross-fitting

Β 

1:Β Β ChoosesomeintegerV(usuallyV∈{2,3,β‹―,