This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

0000 \jvol000 \jnum0

Doubly Robust Proximal Causal Inference under Confounded Outcome-Dependent Sampling

Kendrick Qijun Li [email protected] Department of Biostatistics, University of Michigan,
1415 Washington Heights, Ann Arbor, Michigan, U.S.A.
   Xu Shi [email protected] Department of Biostatistics, University of Michigan,
1415 Washington Heights, Ann Arbor, Michigan, U.S.A.
   Wang Miao [email protected] Department of Probability and Statistics, Peking University,
5 Yiheyuan Rd, Haidian District, Beijing, China
      Eric Tchetgen Tchetgen [email protected] Department of Statistics and Data Science, The Wharton School, University of Pennsylvania,
265 South 37th Street, Philadelphia, Pennsylvania, U.S.A
Abstract

Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable without additional assumptions. Under an odds ratio model for the treatment effect, Li et al. (2022) established both proximal identification and estimation of causal effects by leveraging a pair of negative control variables as proxies of latent factors at the source of both confounding and selection bias. However, their approach relies exclusively on the existence and correct specification of a so-called treatment confounding bridge function, a model that restricts the treatment assignment mechanism. In this article, we propose doubly robust estimation under the odds ratio model with respect to two nuisance functions – a treatment confounding bridge function and an outcome confounding bridge function that restricts the outcome law, such that our estimator is consistent and asymptotically normal if either bridge function model is correctly specified, without knowing which one is. Thus, our proposed doubly robust estimator is potentially more robust than that of Li et al. (2022). Our simulations confirm that the proposed proximal estimators of an odds ratio causal effect can adequately account for both residual confounding and selection bias under stated conditions with well-calibrated confidence intervals in a wide range of scenarios, where standard methods generally fail to be consistent. In addition, the proposed doubly robust estimator is consistent if at least one confounding bridge function is correctly specified.

keywords:
endogenous selection bias; kernel machine learning; proximal causal inference; semiparametric estimation.
journal: Biometrika

1 Introduction

Unmeasured confounding and selection bias are two ubiquitous challenges in observational studies, which may lead to biased causal effect estimates and misleading causal conclusions (Hernán MA, 2020). Lipsitch et al. (2010) reviewed and formalized the use of negative control variables as tools to detect the presence of unmeasured confounding in epidemiological research. Specifically, they identify two types of negative control variables: negative control exposures (NCE) known a priori to have no causal effect on the outcome and negative control outcomes (NCO) known a priori not to be causally impacted by the treatment. Such variables may be viewed as valid proxies of unmeasured confounders to the extent that they are associated with the latter. Tchetgen Tchetgen (2014) and Flanders et al. (2017) proposed regression-type calibration for unmeasured confounding adjustment using NCOs and NCEs, respectively, under fairly strong parametric restrictions. More recently, a double negative control approach which uses NCOs and NCEs jointly to correct for unmeasured confounding has been developed, which achieves nonparametric identification of the average treatment effect (Miao et al., 2018a); also see (Shi et al., 2020b) for a recent review of negative control methods in Epidemiology. Methods for identification, estimation and inference about causal parameters by leveraging such proxies to address unmeasured confounding bias have been referred to as “proximal causal inference” (Tchetgen Tchetgen et al., 2020).

The literature on proximal causal inference is fast growing. Miao et al. (2018b) introduced an outcome confounding bridge function approach to identify and estimate the average treatment effect by leveraging both NCE and NCO variables, while Cui et al. (2020) and Deaner (2018) introduced identification via a treatment confounding bridge function. Cui et al. (2020) further developed a doubly robust approach which can identify and consistently estimate the average treatment effect if either a treatment or an outcome confounding bridge function exists and can be estimated consistently. Ghassami et al. (2021) and Kallus et al. (2021) concurrently proposed a cross-fitting minimax kernel learning approach for nonparametric estimation of the confounding bridge functions, where the efficient influence function has a mixed bias structure (Rotnitzky et al., 2021), such that n\sqrt{n}-consistent estimation of the average treatment effect remains possible even when both confounding bridge functions might be estimated at considerably slower than n\sqrt{n} rates.

Despite the rapid growth in methods to address confounding bias, few have considered situations in which confounding bias and selection bias might coexist. Selection bias, also sometimes called endogenous selection bias or collider stratification bias (Greenland, 2003; Elwert & Winship, 2014; Cole et al., 2010; Hernán et al., 2004), arises when the analysis conditions on selection that may be induced by both the primary treatment and outcome variables. Selection bias naturally occurs in outcome-dependent sampling designs that are widely used in epidemiological and econometric research to reduce cost and effort, such as case-control design (Breslow & Day, 1980; Pearce, 2018), case-cohort design (Prentice, 1986), choice-based design (Manski & McFadden, 1981), test-negative design (Jackson & Nelson, 2013; Sullivan et al., 2016), and retrospective cohort designs with electronic health records (EHR) data (Streeter et al., 2017). Similar to confounding bias, selection bias potentially induces spurious associations between treatment and outcome variables of primary scientific interest, even under the null hypothesis of no causal effect. Existing methods to adjust for selection bias include explicit modeling of sample selection mechanism (Heckman, 1979; Heckman et al., 1998; Cuddeback et al., 2004), matching (Heckman et al., 1996, 1998), difference-in-differences models (Heckman et al., 1998) and inverse probability weighting (Hernán et al., 2004; Mansournia & Altman, 2016). However, when a common latent factor causes the treatment, outcome and selection mechanisms, the above methods are not applicable and the causal effect cannot generally be identified. Gabriel et al. (2020) referred to such a sampling design as “confounded outcome-dependent sampling”. They derived causal bounds of the average treatment effect, but these bounds are often too wide to be useful in practice, which highlights the challenges of identification. Didelez et al. (2010), Bareinboim & Pearl (2012) and Bareinboim et al. (2014) studied graphical conditions for recoverability of the causal effect encoded on the additive scale and odds ratio scale under different forms of selection bias including outcome-dependent sampling. They concluded that neither causal effect is identifiable if selection is dependent on both outcome and unmeasured confounders.

Recently, Li et al. (2022) studied inference about an odds ratio model encoding a treatment’s association with an outcome of interest conditional on both measured and unmeasured confounders under confounded outcome-dependent sampling, leveraging a pair of negative control variables. The consistency of their estimator, which we refer to as the proximal inverse probability weighted (PIPW) estimator, requires consistently estimating a treatment confounding bridge function which restricts the treatment assignment mechanism. However, in practice, it may be difficult to posit a suitable parametric model for the treatment confounding bridge function, which may limit the applicability of their approach.

In this paper, we develop semiparametric inference for the conditional odds ratio estimand in Li et al. (2022). We present a new identification result and propose the proximal outcome regression (POR) estimator of conditional odds ratio that relies on the existence and correct specification of an outcome confounding bridge function which restricts the outcome law by the conditional odds of the outcome given the treatment and confounders. We further introduce a doubly-robust closed-form expression for the conditional odds ratio, based on which we propose a proximal doubly robust (PDR) estimator, which is consistent if either the treatment or the outcome confounding bridge function is consistently estimated and thus has improved robustness against model misspecification of the nuisance functions compared with PIPW and POR. We demonstrate the performance of the proposed estimators through comprehensive simulations. Throughout, we relegate all proofs to Section S1 of the Supplementary Materials.

AAYYU,XU,XS=1S=1TreatmentOutcomeConfoundersSelection     AAYYU,XU,XZZS=1S=1WW
Outcome
proxy
Treatment
proxy
Outcome
    AAYYU,XU,XWWZZS=1S=1Outcome proxyTreatment proxyOutcome
      (a)             (b)                   (c)
Figure 1: (a) shows the directed acyclic graphs of a confounded outcome-dependent sampling in view. (b) and (c) shows two scenarios where the odds ratio can be identified leveraging additional treatment proxy (ZZ) and outcome proxy (WW) variables.

2 Identification and estimation under a homogeneous odds ratio model

2.1 Notation and Setup

Suppose the data contain a sample of nn identically and independently distributed observations drawn from the population of interest, referred to as the “target population”. For each observation in the target population, let AA and YY be the treatment and outcome of primary interest, respectively, assumed to be binary for the time being. Let SS be a binary indicator for selection into the study sample, such that the available data only include observations with S=1S=1. Let XX be a vector of measured pre-treatment covariates that may be associated with AA, YY and SS. Suppose that in addition to XX, there exist pre-treatment latent factors, denoted as UU, that may cause AA, YY and SS. Figure 1(a) shows the directed acyclic graph (DAG) for the causal relationships between the above variables. Similar to Li et al. (2022) Section 2.5, we assume the following model, where β0\beta_{0}, the conditional log odds ratio given (U,X)(U,X), encodes the treatment effect of primary interest: {model}[Homogeneous odds ratio model; Li et al. (2022) Assumption 3’]

logitP(Y=1A,U,X)=β0A+η(U,X)\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}A+\eta(U,X)

where logit(x)=log{x/(1x)}\mathrm{logit}(x)=\log\left\{x/(1-x)\right\} denotes the logit function and η\eta is an unknown real-valued function. Model 2.1 describes a semiparametric logistic regression model that assumes a homogeneous association between AA and YY on the odds ratio scale, across strata of all measured and latent factors (U,X)(U,X). Model 2.1 encodes the structural model

logitP(Y(a)=1U,X)=β0a+η(U,X),a=0,1,\mathrm{logit}\,P(Y(a)=1\mid U,X)=\beta_{0}a+\eta(U,X),\qquad{a=0,1},

where Y(a)Y(a) (a=0,1a=0,1) denotes the potential outcome were the individual given the treatment status A=aA=a, under standard identifiability assumptions of (a) consistency, i.e. Y(a)=YY(a)=Y if A=aA=a, which requires there is only one version of treatment and an individual’s outcome is not affected by others’ treatment status (Cole & Frangakis, 2009); (b) latent ignorability, i.e. Y(a)AU,XY(a)\perp\!\!\!\perp A\mid U,X, which essentially requires (U,X)(U,X) to contain all confounders of A-Y association; and (c) positivity, i.e. 0<P(A=1U,X)<10<P(A=1\mid U,X)<1 almost surely (Hernán MA, 2020). Under these conditions, β0\beta_{0} encodes the log conditional causal odds ratio effect in every (U,X)(U,X) stratum.

In Li et al. (2022) Assumption 3, they also considered a homogeneous risk ratio model, by which the marginal causal risk ratio can be identified. We focus on Model 2.1 for two reasons: first, proximal identification of β0\beta_{0} under Model 2.1 permits treatment-induced selection under Assumption 2.1 below, while identification under the homogeneous risk ratio model appears to require that AA has no causal effect on selection other than through YY; second, Li et al. (2022) invoked a rare outcome assumption in the target population for proximal identification under the homogeneous risk ratio model, which restricts the applicability of the method. Such a rare outcome assumption is not necessary for proximal identification of β0\beta_{0} under Model 2.1. As discussed in Section S2 of the Supplementary Materials, however, under standard identifiability assumptions (a)-(c) above and a rare outcome assumption akin to that assumed by Li et al. (2022), β0\beta_{0} in Model 2.1 still approximates the log marginal causal risk ratio P(Y(1)=1)/P(Y(0)=1)P(Y(1)=1)/P(Y(0)=1). Alternatively, as discussed in Section S3 of the Supplementary Materials, under settings where control subjects are selected to be positive for an outcome variable that is a priori known to be not affected by the treatment, β0\beta_{0} may recover the marginal causal risk ratio without invoking the rare outcome condition. Such a setting may occur, for example, in test-negative design studies of vaccine effectiveness, where the control subjects are selected to have a pre-determined infection that is not affected by the vaccine (Jackson & Nelson, 2013; Sullivan et al., 2016; Chua et al., 2020; Schnitzer, 2022).

Although to simplify the presentation, results given in the main text focus primarily on Model 2.1, Section S11 of the Supplemental Materials extends these results to the more general model

logitP(Y=1A,U,X)=β0(X)A+η(U,X),\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}(X)A+\eta^{\prime}(U,X),

which explicitly accounts for effect modification of odds ratio with respect to measured covariates XX.

In general, the conditional log odds ratio β0\beta_{0} is not identifiable: first, the latent factors UU defining the strata are not observed; second, the data only include selected subjects with S=1S=1 whilst Model 2.1 is defined in the target population.

As in Li et al. (2022) Assumption 2’, we make an important assumption regarding the selection mechanism that the treatment AA does not modify the conditional risk ratio of selection against the outcome in every (U,X)(U,X) stratum. Formally, we assume: {assumption}[No effect modification by AA on the outcome-dependent selection]For a=0,1a=0,1 and some positive-valued unknown function ξ\xi,

P(S=1Y=1,A=a,U,X)P(S=1Y=0,A=a,U,X)=ξ(U,X).\dfrac{P(S=1\mid Y=1,A=a,U,X)}{P(S=1\mid Y=0,A=a,U,X)}=\xi(U,X).

As mentioned above, a special case of Assumption 2.1 is if AA has no causal effect on SS other than through YY, a stronger assumption considered by Li et al. (2022) to identify the marginal causal risk ratio. As a result of Assumption 2.1, the conditional odds ratio given (U,X,S=1)(U,X,S=1) is identical to that given (U,X)(U,X), which leaves only the challenge of controlling for unmeasured confounding by UU in the selected sample.

2.2 Proximal Identification of β0\beta_{0} in Li et al. (2022)

To detect and correct for unmeasured confounding bias, the proximal inference framework proposes to leverage a pair of proxy measurements of unmeasured confounders (Miao et al., 2018a; Cui et al., 2020; Tchetgen Tchetgen et al., 2020). In this line of works, Li et al. (2022) developed proximal causal inference for β0\beta_{0} under confounded outcome-dependent sampling as represented in Figure 1(a) (Jackson & Nelson, 2013; Chua et al., 2020). We similarly assume proxies of the latent factors are available. {assumption}[Proximal independence conditions] There exist a pair of proxies of the latent factor UU: a treatment proxy (which Li et al. (2022) refers to as a “negative control exposure”), denoted as ZZ, and an outcome proxy (which Li et al. (2022) refers to as a “negative control outcome”), denoted as WW, that satisfy

(a) WAU,X,Y,S;(b) Z(Y,W,S)A,U,X.\text{(a) }W\perp\!\!\!\perp A\mid U,X,Y,S;\qquad\text{(b) }Z\perp\!\!\!\perp(Y,W,S)\mid A,U,X.

Assumption 2.2 requires the treatment proxy ZZ and outcome proxy WW to satisfy certain conditional independences in the target population. Namely, the treatment proxy ZZ has no direct effect on the primary outcome YY, selection indicator SS, and the outcome proxy WW; and the primary treatment AA has no direct effect on the outcome proxy WW. It is crucial that both ZZ and WW are both UU-relevant, that is they carry information and therefore are associated with UU (Shi et al., 2020b; Tchetgen Tchetgen et al., 2020); this is formalized by Assumptions 2.2(a) and 2.3(a) below. As such, in the selected sample, an observed AA-WW association after adjusting for Y,XY,X, or an observed ZZ-YY or ZZ-WW association after adjusting for A,XA,X, may indicate the presence of bias due to the latent factor UU. Figure 1 (b) and (c) show the DAGs of scenarios where Assumption 2.2 holds.

As in Li et al. (2022), we assumed the existence of a treatment confounding bridge function that connects the treatment proxy to the latent factor.

{assumption}

[Treatment confounding bridge function] There exists a treatment confounding bridge function q(A,Z,X)q(A,Z,X) that satisfies

E[q(A,Z,X)A=a,U,X,Y=0,S=1]=1/P(A=aU,X,Y=0,S=1)E[q(A,Z,X)\mid A=a,U,X,Y=0,S=1]=1/P(A=a\mid U,X,Y=0,S=1) (1)

almost surely for a=0,1a=0,1.

Equation (1) formally defines a Fredholm integral equation of the first kind. To ensure the existence and identifiability of the solution to Equation (1), we make the following completeness assumptions, similar to those assumed by Li et al. (2022) and Cui et al. (2020):

{assumption}

[Completeness]    

  1. (a)

    For any square-integrable function gg, if E[g(U)|A,Z,X,Y=0,S=1]=0E[g(U)|A,Z,X,Y=0,S=1]=0 almost surely, then g(U)=0g(U)=0 almost surely;

  2. (b)

    For any square-integrable function gg, if E[g(Z)|A,W,X,Y=0,S=1]=0E[g(Z)|A,W,X,Y=0,S=1]=0 almost surely, then g(Z)=0g(Z)=0 almost surely.

The completeness condition has been used to achieve identification for statistical functionals in econometrics and semiparametric statistics literature (Newey & Powell, 2003; D’Haultfoeuille, 2011). Essentially, Assumption 2.2(a) formalizes the UU-relevance of ZZ and requires that ZZ has at least as much variability as UU in every (A,X)(A,X) stratum of the outcome-free subjects in the selected sample. When UU and ZZ are both categorical variables, ZZ necessarily has at least as many categories as UU. Assumption 2.2(a) and the regularity conditions discussed in Section S4 of the Supplementary Materials, together constitute a sufficient condition for Assumption 2.2, i.e. the existence of qq.

Similarly, Assumption 2.2(b) roughly requires that the outcome proxy WW is at least as variable as the treatment proxy ZZ, and is a sufficient condition under which q(A,Z,X)q(A,Z,X) can be uniquely identified from the selected sample, as stated in the theorem below:

Theorem 2.1 (Identification of q(A,Z,X)q(A,Z,X); Li et al. (2022) Theorem 2’).

Under Assumptions 2.2 and 2.2, any treatment confounding bridge function q(A,Z,X)q(A,Z,X) satisfies the moment condition

E{q(a,Z,X)A=a,W,X,Y=0,S=1}=1/P(A=aW,X,Y=0,S=1).E\left\{q(a,Z,X)\mid A=a,W,X,Y=0,S=1\right\}=1/P(A=a\mid W,X,Y=0,S=1). (2)

If Assumption 2.2(b) holds in addition, then the function q()q(\cdot) is unique and can be identified as the unique solution to Equation (2).

Leveraging the proxies, Li et al. (2022) established the following identification result for β0\beta_{0}.

Theorem 2.2 (Identification of β0\beta_{0}; Li et al. (2022) Theorem 1’).

Under Model 2.1, if Assumptions 2.1-2.2 hold, then the conditional log odds ratio parameter β0\beta_{0} solves the moment equation

E{(1)1Ac(X)Yq(A,Z,X)eβAS=1}=0E\left\{(-1)^{1-A}c(X)Yq(A,Z,X)e^{-\beta A}\mid S=1\right\}=0 (3)

where c(X)c(X) is an arbitrary square-integrable unidimensional function that satisfies

E{c(X)I(A=a)Yq(A,Z,X)S=1}0,a=0,1.E\left\{c(X)I(A=a)Yq(A,Z,X)\mid S=1\right\}\neq 0,\quad a=0,1.

Equation (3) admits a closed-form solution for β0\beta_{0}:

β0=log(E{c(X)I(A=1)Yq(A,Z,X)S=1}E{c(X)I(A=0)Yq(A,Z,X)S=1}).\beta_{0}=\log\left(\dfrac{E\left\{c(X)I(A=1)Yq(A,Z,X)\mid S=1\right\}}{E\left\{c(X)I(A=0)Yq(A,Z,X)\mid S=1\right\}}\right). (4)

Unique identification of qq by Assumption 2.2(b) is necessary. Although Equation (4) holds for any treatment confounding bridge function qq that satisfies Equation (1), qq can only be identified in the selected sample through solving Equation (2). If Equation (2) has multiple solutions, there is no guarantee that a solution to Equation (2) is also a solution to Equation (1).

2.3 New Proximal Identification Results

From Theorem 2.2(b), it is clear that the treatment confounding bridge function q(A,Z,X)q(A,Z,X) plays a crucial role for identifying β0\beta_{0}. However, in case Assumption 2.2 does not hold, or Assumption 2.2 holds but Assumption 2.2(b) does not hold, then the solution to Equation (2) in the selected sample may not be unique and satisfy Equation (1) that defines a treatment confounding bridge function. As a result, the parameter β0\beta_{0} identified by Equation (4), where qq is a solution to Equation (2), may not have a causal interpretation. In this section, we establish a new identification result for β0\beta_{0} that does not rely on q(A,Z,X)q(A,Z,X). We instead define an outcome confounding bridge function as below: {assumption}[Outcome confounding bridge function]There exists an outcome confounding bridge function h(A,W,X)h(A,W,X) that satisfies

E{h(A,W,X)A,U,X,Y=0,S=1}=P(Y=1A,U,X,S=1)P(Y=0A,U,X,S=1)E\{h(A,W,X)\mid A,U,X,Y=0,S=1\}=\frac{P(Y=1\mid A,U,X,S=1)}{P(Y=0\mid A,U,X,S=1)} (5)

almost surely. Miao et al. (2018b) and Cui et al. (2020) previously proposed proximal identification of average treatment effect using an outcome confounding bridge function. Our definition of h(A,W,X)h(A,W,X) by Equation (5) is different from those in previous works of proximal causal inference, in that the conditional expectation of our h(A,W,X)h(A,W,X) is defined to equal the conditional odds of the outcome, whilst the conditional expectation of the outcome confounding bridge function in previous works was defined to equal the conditional mean outcome. Theorem 2.4 below suggests that the standard outcome bridge function of prior literature would indeed fail to identify β0\beta_{0} while our alternative definition yields the desired identification result.

In contrast with Assumption 2.2, we make the following completeness assumption for the existence and identifiability of the outcome confounding bridge function hh.

{assumption}

[Completeness]    

  1. (a)

    For any square-integrable function gg, if E[g(U)|A,W,X,Y=0,S=1]=0E[g(U)|A,W,X,Y=0,S=1]=0 almost surely, then g(U)=0g(U)=0 almost surely;

  2. (b)

    For any square-integrable function gg, if E[g(W)|A,Z,X,Y=0,S=1]=0E[g(W)|A,Z,X,Y=0,S=1]=0 almost surely, then g(W)=0g(W)=0 almost surely.

Similar to Assumption 2.2(a), Assumption 2.3(a) essentially requires that WW has sufficient variability relative to UU in every (A,X)(A,X) stratum of the outcome-free subjects in the selected sample, and constitutes a sufficient condition for Assumption 2.3, i.e. the existence of hh, together with other regularity conditions, as discussed in Section S4 of the Supplementary Materials. Assumption 2.3(b) requires the opposite of Assumption 2.2(b), that the treatment proxy ZZ is sufficiently variable relative to the outcome proxy WW, and is a sufficient condition under which h(A,W,X)h(A,W,X) can be uniquely identified from the selected sample, as stated in Theorem 2.3 below.

Theorem 2.3 (Identification of h(A,W,X)h(A,W,X)).

Under Assumptions 2.2 and 2.3, any outcome confounding bridge function h(A,W,X)h(A,W,X) satisfies the moment condition

E{h(A,W,X)A,Z,X,Y=0,S=1}=P(Y=1A,Z,X,S=1)P(Y=0A,Z,X,S=1).E\left\{h(A,W,X)\mid A,Z,X,Y=0,S=1\right\}=\dfrac{P(Y=1\mid A,Z,X,S=1)}{P(Y=0\mid A,Z,X,S=1)}. (6)

If Assumption 2.3(b) holds in addition, then the function h()h(\cdot) is unique and can be identified as the unique solution to Equation (6).

We introduce the new identification result in Theorem 2.4 below:

Theorem 2.4 (Identification of β0\beta_{0}).

Under Model 2.1, if Assumptions 2.1, 2.2 and 2.3 hold, then the conditional log odds ratio parameter β0\beta_{0} solves the moment equation

E[c(X)(1Y){h(1,W,X)eβ0h(0,W,X)}S=1]=0;E\left[c(X)(1-Y)\left\{h(1,W,X)e^{-\beta_{0}}-h(0,W,X)\right\}\mid S=1\right]=0; (7)

where c(X)c(X) is an arbitrary square-integrable unidimensional function that satisfies

E{c(X)(1Y)h(a,W,X)S=1}0,a=0,1.E\left\{c(X)(1-Y)h(a,W,X)\mid S=1\right\}\neq 0,\quad a=0,1.

The integral equation (7) also admits a closed-form solution

β0=log(E{c(X)(1Y)h(1,W,X)S=1}E{c(X)(1Y)h(0,W,X)S=1}).\beta_{0}=\log\left(\dfrac{E\left\{c(X)(1-Y)h(1,W,X)\mid S=1\right\}}{E\left\{c(X)(1-Y)h(0,W,X)\mid S=1\right\}}\right). (8)

2.4 A doubly robust closed-form expression for β0\beta_{0}

Until now, identification of β0\beta_{0} has required the existence and identification of either a treatment confounding bridge function or an outcome confounding bridge function. In this section, we present a closed-form expression for β0\beta_{0}, which has a desirable doubly-robustness property in the sense that it only requires one of the two confounding bridge functions to exist and be identified.

Theorem 2.5 (A doubly-robust closed-form expression for β0\beta_{0}).

Under Model 2.1 and Assumptions 2.1 and 2.2, we have

β0\displaystyle\beta_{0} =log(E{c(X)[I(A=1)q(A,Z,X){Y(1Y)h(A,W,X)}+\displaystyle=\log\bigg{(}E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\left\{Y-(1-Y)h(A,W,X)\right\}+
(1Y)h(1,W,X)]S=1}/E{c(X)[I(A=0)q(A,Z,X){Y\displaystyle\qquad(1-Y)h(1,W,X)\bigg{]}\mid S=1\bigg{\}}/E\bigg{\{}c(X)\bigg{[}I(A=0)q(A,Z,X)\{Y-
(1Y)h(A,W,X)}+(1Y)h(0,W,X)]S=1}),\displaystyle\qquad(1-Y)h(A,W,X)\}+(1-Y)h(0,W,X)\bigg{]}\mid S=1\bigg{\}}\bigg{)}, (9)

if either qq is a solution to Equation (1), or hh is a solution to Equation (5). Here c(X)c(X) is an arbitrary square-integrable real-valued function satisfying

E{c(X)[I(A=a)q(A,Z,X){Y(1Y)h(a,W,X)}+\displaystyle E\bigg{\{}c(X)\bigg{[}I(A=a)q(A,Z,X)\{Y-(1-Y)h(a,W,X)\}+
(1Y)h(a,W,X)]S=1}0\displaystyle\qquad(1-Y)h(a,W,X)\bigg{]}\mid S=1\bigg{\}}\neq 0

for a=0,1a=0,1.

In Equation (9), setting h(A,Z,X)=0h(A,Z,X)=0 recovers the closed-form solution (4), while setting q(A,Z,X)=0q(A,Z,X)=0 gives the closed-form solution (8), illustrating the double robustness property of the closed-form expression (9).

In Section S5 of the Supplementary Materials, we compare our assumptions and results with those in Li et al. (2022), and summarise the additional conditions under which β0\beta_{0} identifies the marginal causal risk ratio.

2.5 Estimation and large sample inference

Denote data for the iith subject in the selected sample as Oi=(Ai,Yi,Zi,Wi,Xi)O_{i}=(A_{i},Y_{i},Z_{i},W_{i},X_{i}), i=1,,ni=1,\dots,n. As suggested by Equations (4), (8) or (9), consistent estimators for q(A,Z,X)q(A,Z,X) and h(A,W,X)h(A,W,X) can be used to construct estimators for β0\beta_{0} following the standard plug-in principle (Bickel & Ritov, 2003). It remains to estimate the two confounding bridge functions. Below, we present crucial moment conditions for the identification of the confounding bridge functions.

Theorem 2.6 (Moment conditions for q(A,Z,X)q(A,Z,X) and h(A,W,X)h(A,W,X)).

Under Assumptions 2.2, 2.2 and 2.3, the confounding bridge functions q()q(\cdot) and h()h(\cdot) satisfy

E[(1Y){κ1(A,W,X)q(A,Z,X)κ1(1,W,X)κ1(0,W,X))}S=1]=0\displaystyle E\left[(1-Y)\{\kappa_{1}(A,W,X)q(A,Z,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X))\}\mid S=1\right]=0 (10)
E[κ2(A,Z,X){(1Y)h(A,W,X)Y}S=1]=0\displaystyle E\left[\kappa_{2}(A,Z,X)\{(1-Y)h(A,W,X)-Y\}\mid S=1\right]=0 (11)

where κ1()\kappa_{1}(\cdot) and κ2()\kappa_{2}(\cdot) are arbitrary square-integrable functions.

Equations (4), (8) and (9) and Theorem 2.6 together suggest a parametric approach to estimate β0\beta_{0}: one may postulate suitable parametric working models q(A,Z,X;τ)q(A,Z,X;\tau) and h(A,W,X;ψ)h(A,W,X;\psi) where τ\tau and ψ\psi are unknown parameters of finite dimensions. One can then estimate the nuisance parameters τ\tau and ψ\psi by solving the estimating equations

1ni=1n(1Yi){κ1(Ai,Wi,Xi)q(Ai,Zi,Xi;τ)κ1(1,W,X)κ1(0,W,X)}=0,\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}=0, (12)
1ni=1nκ2(Ai,Zi,Xi){(1Yi)h(Ai,Wi,Xi;ψ)Yi}=0.\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}\kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}=0. (13)

with user-specified functions κ1\kappa_{1} and κ2\kappa_{2} of dimensions equal to that of τ\tau and ψ\psi respectively. Closed-form parametric models for the confounding bridge functions can be derived in some settings as described below.

Example 2.7.

If U,Z,WU,Z,W are all binary variable and XX is a categorical variable with levels {x1,,xL}\{x_{1},\dots,x_{L}\}, then a suitable model is the saturated model of the form

q(a,z,x;τ)=τ0x+τ1xa+τ2xz+τ3xaz\displaystyle q(a,z,x;\tau)=\tau_{0x}+\tau_{1x}a+\tau_{2x}z+\tau_{3x}az (14)
and h(a,w,x;ψ)=ψ0x+ψ1xa+ψ2xw+ψ3xaw\displaystyle h(a,w,x;\psi)=\psi_{0x}+\psi_{1x}a+\psi_{2x}w+\psi_{3x}aw (15)

for x={x1,,xL}x=\{x_{1},\dots,x_{L}\}, where

τ\displaystyle\tau =(τ0x1,τ1x1,τ2x1,τ3x1,,τ0xL,τ1xL,τ2xL,τ3xL)T,\displaystyle=(\tau_{0x_{1}},\tau_{1x_{1}},\tau_{2x_{1}},\tau_{3x_{1}},\dots,\tau_{0x_{L}},\tau_{1x_{L}},\tau_{2x_{L}},\tau_{3x_{L}})^{T},
andψ\displaystyle\text{and}\qquad\psi =(ψ0x1,ψ1x1,ψ2x1,ψ3x1,,ψ0xL,ψ1xL,ψ2xL,ψ3xL)T.\displaystyle=(\psi_{0x_{1}},\psi_{1x_{1}},\psi_{2x_{1}},\psi_{3x_{1}},\dots,\psi_{0x_{L}},\psi_{1x_{L}},\psi_{2x_{L}},\psi_{3x_{L}})^{T}.

Example 2.8.

If ZZ and WW are both multivariate Gaussian random variables conditioning on (A,U,X,Y)(A,U,X,Y), under the setting given in Section S6 of the Supplementary Materials, suitable models for the confounding bridge functions are

q(a,z,x;τ)=1+exp{(12A)(τ0+τAa+τZTz+τXTx)}\displaystyle q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}^{T}z+\tau_{X}^{T}x)\} (16)
and h(a,w,x;ψ)=exp(ψ0+ψAa+ψWTw+ψXTx).\displaystyle h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}^{T}w+\psi_{X}^{T}x). (17)

where τ=(τ0,τA,τZT,τXT)T\tau=(\tau_{0},\tau_{A},\tau_{Z}^{T},\tau_{X}^{T})^{T} and ψ=(ψ0,ψA,ψZT,ψXT)T\psi=(\psi_{0},\psi_{A},\psi_{Z}^{T},\psi_{X}^{T})^{T}.

After obtaining the estimators τ^\hat{\tau} and ψ^\hat{\psi} by solving Equations (12) and (13), β0\beta_{0} can then be estimated with the following plug-in estimators:

β^c,PIPW\displaystyle\hat{\beta}_{c,\text{PIPW}} =log(1ni=1nI(Ai=1)c(Xi)q(1,Zi,Xi;τ^)Yi1ni=1nI(Ai=0)c(Xi)q(0,Zi,Xi;τ^)Yi);\displaystyle=\log\left(\dfrac{\frac{1}{n}\sum_{i=1}^{n}I(A_{i}=1)c(X_{i})q(1,Z_{i},X_{i};\hat{\tau})Y_{i}}{\frac{1}{n}\sum_{i=1}^{n}I(A_{i}=0)c(X_{i})q(0,Z_{i},X_{i};\hat{\tau})Y_{i}}\right); (18)
β^c,POR\displaystyle\hat{\beta}_{c,\text{POR}} =log(1ni=1nc(Xi)(1Yi)h(1,Wi,Xi;ψ^)1ni=1nc(Xi)(1Yi)h(0,Wi,Xi;ψ^));\displaystyle=\log\left(\dfrac{\frac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})h(1,W_{i},X_{i};\hat{\psi})}{\frac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})h(0,W_{i},X_{i};\hat{\psi})}\right); (19)
β^c,PDR\displaystyle\hat{\beta}_{c,\text{PDR}} =log(1ni=1n{c(Xi)[I(Ai=1)q(Ai,Zi,Xi){Yi(1Yi)h(Ai,Wi,Xi)}+\displaystyle=\log\bigg{(}\dfrac{1}{n}\sum_{i=1}^{n}\bigg{\{}c(X_{i})\bigg{[}I(A_{i}=1)q(A_{i},Z_{i},X_{i})\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i})\right\}+
(1Yi)h(1,Wi,Xi)]}/1ni=1n{c(Xi)[I(Ai=0)q(Ai,Zi,Xi){Yi\displaystyle\qquad(1-Y_{i})h(1,W_{i},X_{i})\bigg{]}\bigg{\}}\bigg{/}\dfrac{1}{n}\sum_{i=1}^{n}\bigg{\{}c(X_{i})\bigg{[}I(A_{i}=0)q(A_{i},Z_{i},X_{i})\{Y_{i}-
(1Yi)h(Ai,Wi,Xi)}+(1Yi)h(0,Wi,Xi)]}).\displaystyle\qquad(1-Y_{i})h(A_{i},W_{i},X_{i})\}+(1-Y_{i})h(0,W_{i},X_{i})\bigg{]}\bigg{\}}\bigg{)}. (20)

Alternatively, one can jointly estimate β0\beta_{0} and nuisance parameters τ\tau and ψ\psi by solving estimating equations with the following estimating functions:

GPIPW(Oi;β,τ)=\displaystyle G_{\text{PIPW}}(O_{i};\beta,\tau)=
((1Yi){κ1(Ai,Wi,Xi)q(Ai,Zi,Xi;τ)κ1(1,W,X)κ1(0,W,X)}c(Xi)(1)1Aiq(Ai,Zi,Xi;τ)YieβAi),\displaystyle\quad\left(\begin{array}[]{l}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}\\ c(X_{i}){(-1)}^{1-A_{i}}q(A_{i},Z_{i},X_{i};\tau)Y_{i}e^{-\beta A_{i}}\end{array}\right),
GPOR(Oi;β,ψ)=\displaystyle G_{\text{POR}}(O_{i};\beta,\psi)=
(κ2(Ai,Zi,Xi){(1Yi)h(Ai,Wi,Xi;ψ)Yi}c(Xi)(1Yi){h(1,Wi,Xi;ψ)eβh(0,Wi,Xi;ψ)}),\displaystyle\quad\left(\begin{array}[]{l}\kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}\\ c(X_{i})(1-Y_{i})\left\{h(1,W_{i},X_{i};\psi)e^{-\beta}-h(0,W_{i},X_{i};\psi)\right\}\end{array}\right),
and
GPDR(Oi;β,τ,ψ)=\displaystyle G_{\text{PDR}}(O_{i};\beta,\tau,\psi)=
((1Yi){κ1(Ai,Wi,Xi)q(Ai,Zi,Xi;τ)κ1(1,W,X)κ1(0,W,X)}κ2(Ai,Zi,Xi){(1Yi)h(Ai,Wi,Xi;ψ)Yi}c(Xi)[(1)1Aiq(Ai,Zi,Xi;τ)eβA{Yi(1Yi)h(Ai,Wi,Xi;ψ)}+(1Yi){h(1,Wi,Xi;ψ)eβh(0,Wi,Xi;ψ)}]).\displaystyle\quad\left(\begin{array}[]{l}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}\\ \kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}\\ c(X_{i})\bigg{[}(-1)^{1-A_{i}}q(A_{i},Z_{i},X_{i};\tau)e^{-\beta A}\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)\right\}+\\ \qquad\qquad(1-Y_{i})\left\{h(1,W_{i},X_{i};\psi)e^{-\beta}-h(0,W_{i},X_{i};\psi)\right\}\bigg{]}\end{array}\right).

The resulting estimators are consistent and asymptotically normal, following standard estimating equation theory (Van der Vaart, 2000). Similar to Cui et al. (2020), we refer to β^c,PIPW\hat{\beta}_{c,\text{PIPW}}, β^c,POR\hat{\beta}_{c,\text{POR}} and β^c,PDR\hat{\beta}_{c,\text{PDR}} as the proximal inverse probability weighted (PIPW) estimator, the proximal outcome regression (POR) estimator, and the proximal doubly robust (PDR) estimator, respectively. As discussed in Sections S2 and S3 in the Supplementary Materials, the PIPW estimator can estimate the marginal causal risk ratio assuming either (a) a homogeneous risk model, a treatment-independent selection mechanism, and a rare outcome condition in the target population, or (b) a test-negative design where the controls are selected to have an irrelevant outcome to the treatment. Because they estimate the same functional, the new POR and PDR estimators can be viewed as estimators of the marginal causal risk ratio under the same conditions.

As implied by Theorem 2.5, the estimator β^c,PDR\hat{\beta}_{c,\text{PDR}} also enjoys the desirable double robustness property– it is consistent for β0\beta_{0} if either q(A,Z,X)q(A,Z,X) or h(A,W,X)h(A,W,X) can be consistently estimated. We stated this result formally in the theorem below.

Theorem 2.9.

Under Model 2.1 and Assumptions 2.1, 2.2, 2.2, and 2.3, we have the following results:

  1. (a)

    β^c,PIPW\hat{\beta}_{c,\text{PIPW}} is consistent for β0\beta_{0} and asymptotically normal if the model q(A,Z,X;τ)q(A,Z,X;\tau) is correctly specified and Assumption 2.2(b) holds;

  2. (b)

    β^c,POR\hat{\beta}_{c,\text{POR}} is consistent for β0\beta_{0} and asymptotically normal if the model h(A,W,X;ψ)h(A,W,X;\psi) is correctly specified and Assumption 2.3(b) holds;

  3. (c)

    β^c,PDR\hat{\beta}_{c,\text{PDR}} is consistent for β0\beta_{0} and asymtotically normal if either the conditions in (a) or those in (b) hold.

Theorem 2.9 indicates that the proposed PDR estimator has improved robustness over the PIPW and POR estimator against model misspecification of the two confounding bridge functions. If candidate proxies WW and ZZ have different cardinalities, for example, when they are categorical variables of unequal dimensions, then only one of the completeness Assumptions 2.2 and 2.3 may hold. Say WW is of higher dimension, then Assumption 2.3(b) cannot hold, the outcome bridge function h(A,W,X)h(A,W,X) may not be not uniquely identified, and β^c,POR\hat{\beta}_{c,\text{POR}} may be biased for β0\beta_{0}. However, it may be possible to coarsen levels of WW to match those of ZZ (Shi et al., 2020a). We highlight that the consistency of β^c,PDR\hat{\beta}_{c,\text{PDR}} only requires one of the two completeness assumptions to hold.

Our identification and estimation strategies involve user-specified functions c()c(\cdot), κ1()\kappa_{1}(\cdot) and κ2()\kappa_{2}(\cdot). In principle, these nuisance functions can be chosen to construct a potentially more efficient estimator for β0\beta_{0}, qq, and hh respectively, but the optimal choice of these functions may require modeling complex features of the observed data distribution. As discussed in Cui et al. (2020) and Stephens et al. (2014), such features can be considerably difficult to model correctly, and unsuccessful attempts to do so may lead to loss of efficiency, thus are not always worthwhile. In our simulation and real data analysis, we found that our generic choices for c()c(\cdot), κ1()\kappa_{1}(\cdot) and κ2()\kappa_{2}(\cdot) deliver reasonable efficiency.

Until now, our estimation has required specification of parametric working models for q(A,Z,X;τ)q(A,Z,X;\tau) and/or h(A,W,X;ψ)h(A,W,X;\psi). When XX is continuous or high-dimensional, however, neither of these working models may contain the true confounding bridge functions. Ghassami et al. (2021) and Kallus et al. (2021) concurrently proposed a cross-fitting minimax kernel learning method for proximal learning of average treatment effect. Their method allows nonparametric estimation of the nuisance functions and n\sqrt{n}-consistent estimation for the average treatment effect, even when both estimated confounding bridge functions are consistent at slower than n\sqrt{n} rates. In Section S7 of the Supplementary Materials, we demonstrate that their method is also applicable to our setting, and describe a semiparametric kernel estimator for the odds ratio parameter with flexibly estimated confounding bridge functions, n\sqrt{n}-consistency, and asymptotic normality.

3 Simulation

In this section, we perform simulation studies to evaluate the performance of our proposed estimators of β0\beta_{0} under five different scenarios. All scenarios follow Model 2.1 with β0=log(0.2)\beta_{0}=\log(0.2). In Scenario I, the variables (U,Z,W)(U,Z,W) are all univariate binary variables. We suppose there are no measured confounders XX. Therefore, saturated models for the confounding bridge functions are appropriate, i.e.

q(A,Z;τ)=τ0+τaA+τzZ+τazAZ and h(A,W;ψ)=ψ0+ψaA+ψzZ+ψazAZ.\displaystyle q(A,Z;\tau)=\tau_{0}+\tau_{a}A+\tau_{z}Z+\tau_{az}AZ\text{ and }h(A,W;\psi)=\psi_{0}+\psi_{a}A+\psi_{z}Z+\psi_{az}AZ.

The nuisance parameters τ\tau and ψ\psi are estimated by solving Equations (12) and (13), respectively, with κ1(A,W)=(1,A,W,AW)T\kappa_{1}(A,W)=(1,A,W,AW)^{T} and κ2(A,Z)=(1,A,Z,AZ)T\kappa_{2}(A,Z)=(1,A,Z,AZ)^{T}. In Scenario II, the variables (X,U,Z,W)(X,U,Z,W) are continuous. The conditional distributions of ZZ and WW given other variables follow Example 2.8. Therefore parametric working models in Equations (16) and (17) are appropriate. We use these parametric working models and estimate the nuisance parameters τ\tau and ψ\psi by solving estimating equations (12) and (13), setting κ1(A,W,X)\kappa_{1}(A,W,X) to be a vector including AA, WW, XX, all of their higher-order interactions, and an intercept term, and κ2(A,Z,X)\kappa_{2}(A,Z,X) similarly. We set c(X)=1c(X)=1 for all the estimators under comparison.

In Scenarios III-V, the data generating process is the same as in Scenario II but we misspecify different components of the parametric working models. In Scenario III, we misspecify the treatment confounding bridge function q(A,Z,X)q(A,Z,X) by ignoring XX, i.e., we set

q(a,z,x;τ)=1+exp{(12A)(τ0+τAa+τZz)}.q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}z)\}.

In this scenario, we expect the POR and PDR estimators to be consistent and the PIPW estimator to be inconsistent. In Scenario IV, we similarly misspecify the outcome confounding bridge function h(A,W,X)h(A,W,X) by ignoring XX, i.e. we set

h(a,w,x;ψ)=exp(ψ0+ψAa+ψww).h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{w}w).

In this scenario, we expect the PIPW and PDR estimators to be consistent and the POR estimator to be inconsistent. Finally, in Scenario V, we misspecify both q(A,Z,X)q(A,Z,X) and h(A,W,X)h(A,W,X) as above. We expect all three estimators to be inconsistent.

We relegate the details of the data generating process to Section S10 of the Supplementary Materials. We report the bias, standard deviation, and coverage of 95% confidence intervals of the POR, PIPW, and PDR estimators for each scenario over 500 Monte Carlo samples.

Table 1 shows the results of various simulations. In Scenario I and II, as the parametric working models for both confounding bridge functions are correctly specified, all three estimators have essentially identical performances as expected. In Scenario III, the POR and PDR estimators are consistent and have calibrated 95% confidence intervals, while PIPW is inconsistent and has anti-conservative 95% confidence intervals. On the other hand, in Scenario IV, the PIPW and PDR estimators are consistent and have calibrated 95% confidence intervals but not the POR estimator. In Scenario V, all three estimators are inconsistent and have anti-conservative 95% confidence intervals, but the PDR estimator appears to have slightly smaller biases and better confidence interval coverage.

Table 1: Bias, standard deviation, and coverage of 95% confidence intervals of the proximal outcome regression (POR), proximal inverse probability weighted (PIPW), and proximal doubly robust (PDR) estimator for five different scenarios summarized over 500 Monte Carlo samples. In Scenario I, the variables (U,Z,W)(U,Z,W) are univariate binary; in Scenario II-V, the variables (X,U,Z,W)(X,U,Z,W) are univariate continuous variables.
Scenario N POR PIPW PDR
Bias SD Coverage Bias SD Coverage Bias SD Coverage
I 3,500 0.05 0.27 97.4% 0.05 0.05 97.4% 0.05 0.27 97.4%
I 5,000 0.00 0.23 96.4% 0.02 0.22 96.4% 0.00 0.22 96.4%
II 1×1051\times 10^{5} -0.03 0.24 95.0% -0.03 0.23 94.2% -0.02 0.23 94.2%
II 2×1052\times 10^{5} -0.02 0.16 95.0% -0.02 0.15 94.8% -0.02 0.15 94.2%
III 1×1051\times 10^{5} -0.03 0.24 94.8% -0.11 0.22 91.2% -0.02 0.23 94.2%
III 2×1052\times 10^{5} -0.02 0.16 95.0% -0.10 0.15 90.0% -0.02 0.15 95.0%
IV 1×1051\times 10^{5} -0.12 0.24 92.2% -0.03 0.23 94.0% -0.03 0.23 94.0%
IV 2×1052\times 10^{5} -0.11 0.16 90.8% -0.02 0.15 94.6% -0.02 0.15 94.2%
V 1×1051\times 10^{5} -0.12 0.24 92.2% -0.11 0.22 90.8% -0.09 0.22 92.2%
V 2×1052\times 10^{5} -0.11 0.16 90.8% -0.10 0.15 89.8% -0.08 0.15 91.6%

4 Extensions

We have focused on the settings where the treatment and outcome are both binary, but the proposed framework can extend to study the association between a polytomous treatment and a polytomous outcome with simple modification. These extensions are relegated to Section S12 of the Supplementary Materials.

Although we have primarily focused on test-negative design as an example of outcome-dependent sampling, our method applies to other outcome-dependent sampling designs where unmeasured confounding is of concern and suitable proxy variables may be identified, such as a case-cohort study (Breslow & Day, 1980) or a retrospective case-control/cohort study using data from electronic health records (Streeter et al., 2017), where subjects’ treatment status may be associated with underlying frailty or risk of developing the outcome of interest.

References

  • Ai & Chen (2003) Ai, C. & Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1843.
  • Bareinboim & Pearl (2012) Bareinboim, E. & Pearl, J. (2012). Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. PMLR.
  • Bareinboim et al. (2014) Bareinboim, E., Tian, J. & Pearl, J. (2014). Recovering from selection bias in causal and statistical inference. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28.
  • Bickel & Ritov (2003) Bickel, P. J. & Ritov, Y. (2003). Nonparametric estimators which can be” plugged-in”. The Annals of Statistics 31, 1033–1053.
  • Breslow & Day (1980) Breslow, N. & Day, N. (1980). Statistical methods in cancer research. volume i-the analysis of case-control studies.[accessed october 10, 2009]. IARC Sci Publ 32, 5–338.
  • Chen (2007) Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
  • Chua et al. (2020) Chua, H., Feng, S., Lewnard, J. A., Sullivan, S. G., Blyth, C. C., Lipsitch, M. & Cowling, B. J. (2020). The use of test-negative controls to monitor vaccine effectiveness: a systematic review of methodology. Epidemiology (Cambridge, Mass.) 31, 43.
  • Cole & Frangakis (2009) Cole, S. R. & Frangakis, C. E. (2009). The consistency statement in causal inference: a definition or an assumption? Epidemiology 20, 3–5.
  • Cole et al. (2010) Cole, S. R., Platt, R. W., Schisterman, E. F., Chu, H., Westreich, D., Richardson, D. & Poole, C. (2010). Illustrating bias due to conditioning on a collider. International journal of epidemiology 39, 417–420.
  • Cuddeback et al. (2004) Cuddeback, G., Wilson, E., Orme, J. G. & Combs-Orme, T. (2004). Detecting and statistically correcting sample selection bias. Journal of Social Service Research 30, 19–33.
  • Cui et al. (2020) Cui, Y., Pu, H., Shi, X., Miao, W. & Tchetgen Tchetgen, E. (2020). Semiparametric proximal causal inference. arXiv preprint arXiv:2011.08411 .
  • Deaner (2018) Deaner, B. (2018). Proxy controls and panel data. arXiv preprint arXiv:1810.00283 .
  • Didelez et al. (2010) Didelez, V., Kreiner, S. & Keiding, N. (2010). Graphical models for inference under outcome-dependent sampling. Statistical Science 25, 368–387.
  • Dikkala et al. (2020) Dikkala, N., Lewis, G., Mackey, L. & Syrgkanis, V. (2020). Minimax estimation of conditional moment models. Advances in Neural Information Processing Systems 33, 12248–12262.
  • D’Haultfoeuille (2011) D’Haultfoeuille, X. (2011). On the completeness condition in nonparametric instrumental problems. Econometric Theory 27, 460–471.
  • Elwert & Winship (2014) Elwert, F. & Winship, C. (2014). Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology 40, 31–53.
  • Engel (1988) Engel, J. (1988). Polytomous logistic regression. Statistica Neerlandica 42, 233–252.
  • Flanders et al. (2017) Flanders, W. D., Strickland, M. J. & Klein, M. (2017). A new method for partial correction of residual confounding in time-series and other observational studies. American journal of epidemiology 185, 941–949.
  • Gabriel et al. (2020) Gabriel, E. E., Sachs, M. C. & Sjölander, A. (2020). Causal bounds for outcome-dependent sampling in observational studies. Journal of the American Statistical Association , 1–12.
  • Ghassami et al. (2021) Ghassami, A., Ying, A., Shpitser, I., Tchetgen Tchetgen, E. et al. (2021). Minimax kernel machine learning for a class of doubly robust functionals. Tech. rep.
  • Greenland (2003) Greenland, S. (2003). Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 14, 300–306.
  • Heckman (1979) Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica: Journal of the econometric society , 153–161.
  • Heckman et al. (1996) Heckman, J. J., Ichimura, H., Smith, J. & Todd, P. (1996). Sources of selection bias in evaluating social programs: An interpretation of conventional measures and evidence on the effectiveness of matching as a program evaluation method. Proceedings of the National Academy of Sciences 93, 13416–13420.
  • Heckman et al. (1998) Heckman, J. J., Ichimura, H., Smith, J. A. & Todd, P. E. (1998). Characterizing selection bias using experimental data.
  • Hernán et al. (2004) Hernán, M. A., Hernández-Díaz, S. & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology , 615–625.
  • Hernán MA (2020) Hernán MA, R. J. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.
  • Jackson & Nelson (2013) Jackson, M. L. & Nelson, J. C. (2013). The test-negative design for estimating influenza vaccine effectiveness. Vaccine 31, 2165–2168.
  • Kallus et al. (2021) Kallus, N., Mao, X. & Uehara, M. (2021). Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029 .
  • Li et al. (2022) Li, K. Q., Shi, X., Miao, W. & Tchetgen, E. T. (2022). Double negative control inference in test-negative design studies of vaccine effectiveness. arXiv preprint arXiv:2203.12509 .
  • Lipsitch et al. (2010) Lipsitch, M., Tchetgen Tchetgen, E. & Cohen, T. (2010). Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology (Cambridge, Mass.) 21, 383.
  • Manski & McFadden (1981) Manski, C. F. & McFadden, D. (1981). Alternative estimators and sample designs for discrete choice analysis. Structural analysis of discrete data with econometric applications 2, 2–50.
  • Mansournia & Altman (2016) Mansournia, M. A. & Altman, D. G. (2016). Inverse probability weighting. Bmj 352.
  • Miao et al. (2018a) Miao, W., Geng, Z. & Tchetgen Tchetgen, E. J. (2018a). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika 105, 987–993.
  • Miao et al. (2018b) Miao, W., Shi, X. & Tchetgen Tchetgen, E. (2018b). A confounding bridge approach for double negative control inference on causal effects. arXiv e-prints , arXiv–1808.
  • Newey & Powell (2003) Newey, W. K. & Powell, J. L. (2003). Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–1578.
  • Pearce (2018) Pearce, N. (2018). Bias in matched case–control studies: Dags are not enough. European journal of epidemiology 33, 1–4.
  • Prentice (1986) Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73, 1–11.
  • Rotnitzky et al. (2021) Rotnitzky, A., Smucler, E. & Robins, J. M. (2021). Characterization of parameters with a mixed bias property. Biometrika 108, 231–238.
  • Schnitzer (2022) Schnitzer, M. E. (2022). Estimands and estimation of covid-19 vaccine effectiveness under the test-negative design: Connections to causal inference. Epidemiology (Cambridge, Mass.) 33, 325.
  • Shi et al. (2020a) Shi, X., Miao, W., Nelson, J. C. & Tchetgen Tchetgen, E. J. (2020a). Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, 521–540.
  • Shi et al. (2020b) Shi, X., Miao, W. & Tchetgen Tchetgen, E. (2020b). A selective review of negative control methods in epidemiology. Current Epidemiology Reports , 1–13.
  • Stephens et al. (2014) Stephens, A., Tchetgen, E. T. & De Gruttola, V. (2014). Locally efficient estimation of marginal treatment effects when outcomes are correlated: is the prize worth the chase? The international journal of biostatistics 10, 59–75.
  • Streeter et al. (2017) Streeter, A. J., Lin, N. X., Crathorne, L., Haasova, M., Hyde, C., Melzer, D. & Henley, W. E. (2017). Adjusting for unmeasured confounding in nonrandomized longitudinal studies: a methodological review. Journal of clinical epidemiology 87, 23–34.
  • Sullivan et al. (2016) Sullivan, S. G., Tchetgen Tchetgen, E. J. & Cowling, B. J. (2016). Theoretical basis of the test-negative study design for assessment of influenza vaccine effectiveness. American journal of epidemiology 184, 345–353.
  • Tchetgen Tchetgen (2014) Tchetgen Tchetgen, E. (2014). The control outcome calibration approach for causal inference with unobserved confounding. American journal of epidemiology 179, 633–640.
  • Tchetgen Tchetgen et al. (2020) Tchetgen Tchetgen, E. J., Ying, A., Cui, Y., Shi, X. & Miao, W. (2020). An introduction to proximal causal learning. arXiv preprint arXiv:2009.10982 .
  • Van der Vaart (2000) Van der Vaart, A. (2000). Asymptotic statistics, vol. 3. Cambridge university press.

Supplementary Material to “Proximal learning of odds ratio under confounded outcome-dependent sampling designs”

S1 Proofs of Theorems.

S1.1 Proof of Theorem 2.1

To prove Equation (2), we have

E[q(A,Z,X)A,W,X,Y=0,S=1]\displaystyle E[q(A,Z,X)\mid A,W,X,Y=0,S=1]
=\displaystyle= E{E[q(A,Z,X)A,U,W,X,Y=0,S=1]A,W,X,Y=0,S=1}\displaystyle E\left\{E[q(A,Z,X)\mid A,U,W,X,Y=0,S=1]\mid A,W,X,Y=0,S=1\right\}
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} E{E[q(A,Z,X)A,U,X,Y=0,S=1]A,W,X,Y=0,S=1}\displaystyle E\left\{E[q(A,Z,X)\mid A,U,X,Y=0,S=1]\mid A,W,X,Y=0,S=1\right\}
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} E{E[q(A,Z,X)A,U,X]A,W,X,Y=0,S=1}\displaystyle E\left\{E[q(A,Z,X)\mid A,U,X]\mid A,W,X,Y=0,S=1\right\}
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:trt-bridge}}}{{=}} E{1P(AU,X,Y=0,S=1)A,W,X,Y=0,S=1}\displaystyle E\left\{\dfrac{1}{P(A\mid U,X,Y=0,S=1)}\mid A,W,X,Y=0,S=1\right\}
=\displaystyle= 1P(AU=u,X,Y=0,S=1)f(uA,W,X,Y=0,S=1)𝑑u\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}f(u\mid A,W,X,Y=0,S=1)du
=\displaystyle= 1P(AU=u,X,Y=0,S=1)×\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}\times (21)
f(uW,X,Y=0,S=1)P(AU=u,W,X,Y=0,S=1)P(AW,X,Y=0,S=1)du\displaystyle\qquad\dfrac{f(u\mid W,X,Y=0,S=1)P(A\mid U=u,W,X,Y=0,S=1)}{P(A\mid W,X,Y=0,S=1)}du
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} 1P(AU=u,X,Y=0,S=1)×\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}\times (22)
f(uW,X,Y=0,S=1)P(AU=u,X,Y=0,S=1)P(AW,X,Y=0,S=1)du\displaystyle\qquad\dfrac{f(u\mid W,X,Y=0,S=1)P(A\mid U=u,X,Y=0,S=1)}{P(A\mid W,X,Y=0,S=1)}du
=\displaystyle= 1P(AW,X,Y=0,S=1)f(uW,X,Y=0,S=1)𝑑u\displaystyle\dfrac{1}{P(A\mid W,X,Y=0,S=1)}\int f(u\mid W,X,Y=0,S=1)du
=\displaystyle= 1P(AW,X,Y=0,S=1).\displaystyle\dfrac{1}{P(A\mid W,X,Y=0,S=1)}.

To prove the uniqueness of a solution to Equation (2) under Assumption 2.2(b), we suppose there exists another function qq^{*} that solves Equation (2), that is,

E{q(A,Z,X)A,W,X,Y=0,S=1}=1/P(AW,X,Y=0,S=1).E\{q^{*}(A,Z,X)\mid A,W,X,Y=0,S=1\}=1/P(A\mid W,X,Y=0,S=1).

Then

E{q(A,Z,X)q(A,Z,X)A,W,X,Y=0,S=1}=0.E\{q^{*}(A,Z,X)-q(A,Z,X)\mid A,W,X,Y=0,S=1\}=0.

By Assumption 2.2(b), q(A,Z,X)=q(A,Z,X)q^{*}(A,Z,X)=q(A,Z,X) almost surely.

Suppose there exists a function q0q_{0} that solves Equation (1), then q0q_{0} is also the solution to (2) by the proof above. By the uniqueness of a solution to Equation (2), q0(A,Z,X)=q(A,Z,X)q_{0}(A,Z,X)=q(A,Z,X) almost surely, which proves that qq is the unique solution to Equation (1).

S1.2 Proof of Theorem 2.2

We first state a useful result for identification:

Lemma S1.

Under Model 2.1 and a selection mechanism that satisfies Assumption 2.1, the log odds ratio satisfies

β0=log(θ1c/θ0c),\beta_{0}=\log\left(\theta_{1c}/\theta_{0c}\right), (23)

where, for a=0,1a=0,1,

θac=E{c(X)P(A=aU,X,Y=1,S=1)P(A=aU,X,Y=0,S=1)P(Y=1U,X,S=1)S=1}.\theta_{ac}=E\left\{c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1\right\}. (24)

Proof S2.

Under Model (2.1) and Assumption 2.1, we have

P(Y=1U,X,A=1,S=1)P(Y=0U,X,A=0,S=1)P(Y=1U,X,A=0,S=1)P(Y=0U,X,A=1,S=1)\displaystyle\dfrac{P(Y=1\mid U,X,A=1,S=1)P(Y=0\mid U,X,A=0,S=1)}{P(Y=1\mid U,X,A=0,S=1)P(Y=0\mid U,X,A=1,S=1)}
=\displaystyle= exp(β0)×P(S=1Y=1,A=1,U,X)P(S=1Y=0,A=1,U,X)×P(S=1Y=0,A=0,U,X)P(S=1Y=1,A=0,U,X)\displaystyle\exp(\beta_{0})\times\dfrac{P(S=1\mid Y=1,A=1,U,X)}{P(S=1\mid Y=0,A=1,U,X)}\times\dfrac{P(S=1\mid Y=0,A=0,U,X)}{P(S=1\mid Y=1,A=0,U,X)}
=A.2.1\displaystyle\stackrel{{\scriptstyle A.\ref{assump:selection}}}{{=}} exp(β0)×ξ(U,X)×ξ(U,X)1=exp(β0).\displaystyle\exp(\beta_{0})\times\xi(U,X)\times\xi(U,X)^{-1}=\exp(\beta_{0}).

Therefore

θ1c\displaystyle\theta_{1c} =E[c(X)P(A=1U,X,Y=1,S=1)P(A=1U,X,Y=0,S=1)P(Y=1U,X,S=1)S=1]\displaystyle=E\left[c(X)\dfrac{P(A=1\mid U,X,Y=1,S=1)}{P(A=1\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\,\mid\,S=1\right]
=E[c(X)P(A=0U,X,Y=1,S=1)P(A=0U,X,Y=0,S=1)P(Y=1U,X,S=1)S=1]exp(β0)\displaystyle=E\left[c(X)\dfrac{P(A=0\mid U,X,Y=1,S=1)}{P(A=0\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\,\mid\,S=1\right]\exp(\beta_{0})
=θ0cexp(β0),\displaystyle=\theta_{0c}\exp(\beta_{0}),

which proves Lemma S1.

By Lemma S1, it suffices to prove that θac=E[c(X)I(A=a)Yq(A,Z,X)S=1]\theta_{ac}=E[c(X)I(A=a)Yq(A,Z,X)\mid S=1]. Under Assumptions (2.2) and (2.2), q(A,Z,X)q(A,Z,X) solves Equation (1). We have

E[c(X)I(A=a)Yq(a,Z,X)S=1]\displaystyle E[c(X)I(A=a)Yq(a,Z,X)\mid S=1]
=\displaystyle= E[c(X)I(A=a)YE(q(a,Z,X)A=a,U,X,Y=1,S=1)S=1]\displaystyle E[c(X)I(A=a)YE\left(q(a,Z,X)\mid A=a,U,X,Y=1,S=1\right)\mid S=1]
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} E[c(X)I(A=a)YE(q(a,Z,X)A=a,U,X,Y=0,S=1)YS=1]\displaystyle E[c(X)I(A=a)YE\left(q(a,Z,X)\mid A=a,U,X,Y=0,S=1\right)Y\mid S=1]
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:trt-bridge}}}{{=}} E[c(X)I(A=a)Y1P(A=aU,X,Y=0,S=1)S=1]\displaystyle E[c(X)I(A=a)Y\dfrac{1}{P(A=a\mid U,X,Y=0,S=1)}\mid S=1]
=\displaystyle= E[c(X)E{I(A=a)U,X,Y=1,S=1}1P(A=aU,X,Y=0,S=1)YS=1]\displaystyle E[c(X)E\{I(A=a)\mid U,X,Y=1,S=1\}\dfrac{1}{P(A=a\mid U,X,Y=0,S=1)}Y\mid S=1]
=\displaystyle= E[c(X)P(A=aU,X,Y=1,S=1)P(A=aU,X,Y=0,S=1)YS=1]\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}Y\mid S=1]
=\displaystyle= E[c(X)P(A=aU,X,Y=1,S=1)P(A=aU,X,Y=0,S=1)P(Y=1U,X,S=1)S=1]\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1]
=\displaystyle= θac.\displaystyle\theta_{ac}. (25)

S1.3 Proof of Theorem 2.3

First, We show that Equation (5) is equivalent to the moment equation

E{(1Y)h(a,W,X)A=a,U,X,S=1}=E(YA=a,U,X,S=1).E\{(1-Y)h(a,W,X)\mid A=a,U,X,S=1\}=E(Y\mid A=a,U,X,S=1). (26)

This is because

E{(1Y)h(a,W,X)A=a,U,X,S=1}\displaystyle E\{(1-Y)h(a,W,X)\mid A=a,U,X,S=1\}
=\displaystyle= E{h(a,W,X)A=a,U,X,Y=0,S=1}P(Y=0A=a,U,X,S=1)\displaystyle E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}P(Y=0\mid A=a,U,X,S=1)
=A.2.3\displaystyle\stackrel{{\scriptstyle A.\ref{assump:outcome-bridge}}}{{=}} P(Y=1A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)\displaystyle\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}P(Y=0\mid A=a,U,X,S=1)
=\displaystyle= P(Y=1A=a,U,X,S=1)\displaystyle P(Y=1\mid A=a,U,X,S=1)
=\displaystyle= E(YA=a,U,X,S=1).\displaystyle E(Y\mid A=a,U,X,S=1).

Next, we show that Assumption 2.2 and Equation (26) implies the moment equation

E{(1Y)h(a,W,X)A=a,Z,X,S=1}=E(YA=a,Z,X,S=1).E\{(1-Y)h(a,W,X)\mid A=a,Z,X,S=1\}=E(Y\mid A=a,Z,X,S=1). (27)

This is because

E{(1Y)h(A,W,X)YA,Z,X,S=1}\displaystyle E\{(1-Y)h(A,W,X)-Y\mid A,Z,X,S=1\}
=\displaystyle= E[E{(1Y)h(A,W,X)YA,U,Z,X,S=1}A,Z,X,S=1]\displaystyle E[E\{(1-Y)h(A,W,X)-Y\mid A,U,Z,X,S=1\}\mid A,Z,X,S=1]
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} E[E{(1Y)h(A,W,X)YA,U,X,S=1}A,Z,X,S=1]\displaystyle E[E\{(1-Y)h(A,W,X)-Y\mid A,U,X,S=1\}\mid A,Z,X,S=1]
=Eq(26)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:outcome-bridge-u}}}{{=}} 0.\displaystyle 0.

Finally, the left-hand side of Equation (27) equals

E{h(a,W,X)A=a,Z,X,Y=0,S=1}P(Y=0A=a,Z,X,S=1).E\{h(a,W,X)\mid A=a,Z,X,Y=0,S=1\}P(Y=0\mid A=a,Z,X,S=1).

Rearranging the terms, we obtain Equation (6).

To prove the uniqueness of a solution to Equation (6) under Assumption 2.3(b), we suppose there exists another function hh^{*} that solves Equation (6), that is,

E{h(a,W,X)A=a,Z,X,Y=0,S=1}=P(Y=1A=a,Z,X,S=1)P(Y=0A=a,Z,X,S=1)E\{h^{*}(a,W,X)\mid A=a,Z,X,Y=0,S=1\}=\frac{P(Y=1\mid A=a,Z,X,S=1)}{P(Y=0\mid A=a,Z,X,S=1)}

Then

E{h(a,W,X)h(a,W,X)A=a,Z,X,Y=0,S=1}=0.E\{h^{*}(a,W,X)-h(a,W,X)\mid A=a,Z,X,Y=0,S=1\}=0.

By Assumption 2.3(b), h(A,Z,X)=h(A,Z,X)h^{*}(A,Z,X)=h(A,Z,X) almost surely.

Suppose there exists a function h0h_{0} that solves Equation (5), then h0h_{0} is also the solution to (6) by the proof above. By the uniqueness of a solution to Equation (6), h0(A,Z,X)=h(A,Z,X)h_{0}(A,Z,X)=h(A,Z,X) almost surely, which proves that hh is the unique solution to Equation (5).

S1.4 Proof of Theorem 2.4

Under Model 2.1 and Assumptions 2.1 and 2.2, the result in Lemma S1 holds. It suffices to prove that

θac=E{c(X)(1Y)h(a,W,X)S=1}\theta_{ac}=E\{c(X)(1-Y)h(a,W,X)\mid S=1\} (28)

for a=0,1a=0,1.

For a=0,1a=0,1, we have that

E[c(X)(1Y)h(a,W,X)S=1]\displaystyle E[c(X)(1-Y)h(a,W,X)\mid S=1]
=\displaystyle= E[c(X)(1Y)E{h(a,W,X)U,X,Y=0,S=1}S=1]\displaystyle E[c(X)(1-Y)E\{h(a,W,X)\mid U,X,Y=0,S=1\}\mid S=1]
=A.2.2\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}} E[c(X)(1Y)E{h(a,W,X)A=a,U,X,Y=0,S=1}S=1]\displaystyle E[c(X)(1-Y)E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}\mid S=1]
=A.2.3\displaystyle\stackrel{{\scriptstyle A.\ref{assump:outcome-bridge}}}{{=}} E[c(X)(1Y)P(Y=1A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)S=1]\displaystyle E[c(X)(1-Y)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]
=\displaystyle= E[c(X)E(1YU,X,S=1)P(Y=1A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)S=1]\displaystyle E[c(X)E(1-Y\mid U,X,S=1)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]
=\displaystyle= E[c(X)P(Y=0U,X,S=1)P(Y=1A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)S=1]\displaystyle E[c(X)P(Y=0\mid U,X,S=1)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]
=\displaystyle= E[c(X)P(A=aU,X,Y=1,S=1)P(A=aU,X,Y=0,S=1)P(Y=1U,X,S=1)S=1]\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1]
=\displaystyle= θac.\displaystyle\theta_{ac}.

S1.5 Proof of Theorem 2.5

To prove

β0\displaystyle\beta_{0} =log(E{c(X)[I(A=1)q(A,Z,X){Y(1Y)h(A,W,X)}+\displaystyle=\log\bigg{(}E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\left\{Y-(1-Y)h(A,W,X)\right\}+
(1Y)h(1,W,X)]S=1}/E{c(X)[I(A=1)q(A,Z,X){Y\displaystyle\qquad(1-Y)h(1,W,X)\bigg{]}\mid S=1\bigg{\}}/E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\{Y-
(1Y)h(A,W,X)}+(1Y)h(0,W,X)]S=1}),\displaystyle\qquad(1-Y)h(A,W,X)\}+(1-Y)h(0,W,X)\bigg{]}\mid S=1\bigg{\}}\bigg{)},

by Lemma S1 it suffices to prove

θac\displaystyle\theta_{ac} =E{c(X)[(1Y)h(a,W,X)\displaystyle=E\big{\{}c(X)\big{[}(1-Y)h(a,W,X)-
I(A=a)q(a,Z,X){(1Y)h(A,W,X)Y}]S=1}.\displaystyle\qquad I(A=a)q(a,Z,X)\big{\{}(1-Y)h(A,W,X)-Y\big{\}}\big{]}\mid S=1\big{\}}. (29)

for a=0,1a=0,1.

  1. (a)

    If Assumptions  2.2 and Equation (1) hold, then qq solves Equation (2) and the conclusion in Theorem 2.2 holds.

    The right-hand side of Equation (29) equals

    E{c(X)[(1Y)h(a,W,X)I(A=a)(1Y)q(a,Z,X)h(A,W,X)+\displaystyle E\bigg{\{}c(X)\big{[}(1-Y)h(a,W,X)-I(A=a)(1-Y)q(a,Z,X)h(A,W,X)+
    I(A=a)q(a,Z,X)Y]S=1}\displaystyle\qquad I(A=a)q(a,Z,X)Y\big{]}\mid S=1\bigg{\}}
    =Eq(25)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-tac-PIPW}}}{{=}} E{c(X)(1Y)h(a,W,X)[1I(A=a)q(a,Z,X)]S=1}+θac\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\big{[}1-I(A=a)q(a,Z,X)\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =\displaystyle= E{c(X)(1Y)h(a,W,X)E[1\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)E\big{[}1-
    I(A=a)q(a,Z,X)A,W,X,Y=0,S=1]S=1}+θac\displaystyle\qquad I(A=a)q(a,Z,X)\mid A,W,X,Y=0,S=1\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =Eq(2)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:trt-bridge}}}{{=}} E{c(X)(1Y)h(a,W,X)[1I(A=a)P(A=aW,X,Y=0,S=1)]S=1}+θac\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\big{[}1-\dfrac{I(A=a)}{P(A=a\mid W,X,Y=0,S=1)}\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =\displaystyle= E{c(X)(1Y)[h(a,W,X)I(A=a)h(a,W,X)P(A=aW,X,Y=0,S=1)]S=1}+θac\displaystyle E\bigg{\{}c(X)(1-Y)\big{[}h(a,W,X)-\dfrac{I(A=a)h(a,W,X)}{P(A=a\mid W,X,Y=0,S=1)}\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =\displaystyle= E{c(X)(1Y)E[h(a,W,X)\displaystyle E\bigg{\{}c(X)(1-Y)E\big{[}h(a,W,X)-
    I(A=a)h(a,W,X)P(A=aW,X,Y=0,S=1)W,X,Y=0,S=1]S=1}+θac\displaystyle\qquad\dfrac{I(A=a)h(a,W,X)}{P(A=a\mid W,X,Y=0,S=1)}\mid W,X,Y=0,S=1\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =\displaystyle= E{c(X)(1Y)[h(a,W,X)h(a,W,X)]S=1}+θac\displaystyle E\bigg{\{}c(X)(1-Y)\big{[}h(a,W,X)-h(a,W,X)\big{]}\mid S=1\bigg{\}}+\theta_{ac}
    =\displaystyle= θac.\displaystyle\theta_{ac}.
  2. (b)

    If Assumptions 2.2 and Equation (5) hold, then hh solves Equation (27) and the conclusion in Theorem 2.4 holds. The right-hand side of Equation (29) equals

    E{c(X)[(1Y)h(a,W,X)I(A=a)(1Y)q(a,Z,X)h(a,W,X)+\displaystyle E\bigg{\{}c(X)\big{[}(1-Y)h(a,W,X)-I(A=a)(1-Y)q(a,Z,X)h(a,W,X)+
    I(A=a)q(a,Z,X)Y]S=1}\displaystyle\qquad I(A=a)q(a,Z,X)Y\big{]}\mid S=1\bigg{\}}
    =\displaystyle= E{c(X)(1Y)h(a,W,X)S=1}\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\mid S=1\bigg{\}}-
    E{c(X)I(A=a)q(a,Z,X)[(1Y)h(A,W,X)Y]S=1}\displaystyle\qquad E\bigg{\{}c(X)I(A=a)q(a,Z,X)[(1-Y)h(A,W,X)-Y]\mid S=1\bigg{\}}
    =Eq(28)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-POR}}}{{=}} θacE{c(X)I(A=a)q(a,Z,X)×\displaystyle\theta_{ac}-E\bigg{\{}c(X)I(A=a)q(a,Z,X)\times
    E[(1Y)h(A,W,X)YA=a,Z,X,S=1]\displaystyle\qquad E[(1-Y)h(A,W,X)-Y\mid A=a,Z,X,S=1]\mid
    S=1}\displaystyle\qquad S=1\bigg{\}}
    =Eq(27)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:outcome-bridge}}}{{=}} θac0\displaystyle\theta_{ac}-0
    =\displaystyle= θac.\displaystyle\theta_{ac}.

S1.6 Proof of Theorem 2.6

To prove Equation (10), we have

E{(1Y)[κ1(A,W,X)q(A,Z,X)κ1(1,W,X)κ1(0,W,X)]S=1}\displaystyle E\{(1-Y)[\kappa_{1}(A,W,X)q(A,Z,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}
=\displaystyle= E{(1Y)[κ1(A,W,X)E[q(A,Z,X)A,W,X,Y=0,S=1]κ1(1,W,X)\displaystyle E\{(1-Y)[\kappa_{1}(A,W,X)E[q(A,Z,X)\mid A,W,X,Y=0,S=1]-\kappa_{1}(1,W,X)-
κ1(0,W,X)]S=1}\displaystyle\qquad\kappa_{1}(0,W,X)]\mid S=1\}
=Eq(2)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:trt-bridge}}}{{=}} E{(1Y)[κ1(A,W,X)P(AW,X,Y=0,S=1)κ1(1,W,X)κ1(0,W,X)]S=1}\displaystyle E\{(1-Y)[\dfrac{\kappa_{1}(A,W,X)}{P(A\mid W,X,Y=0,S=1)}-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}
=\displaystyle= E{(1Y)[E(κ1(A,W,X)P(AW,X,Y=0,S=1)W,X,Y=0,S=1)κ1(1,W,X)\displaystyle E\{(1-Y)[E\left(\dfrac{\kappa_{1}(A,W,X)}{P(A\mid W,X,Y=0,S=1)}\mid W,X,Y=0,S=1\right)-\kappa_{1}(1,W,X)-
κ1(0,W,X)]S=1}\displaystyle\qquad\kappa_{1}(0,W,X)]\mid S=1\}
=\displaystyle= E{(1Y)[κ1(1,W,X)+κ1(0,W,X)κ1(1,W,X)κ1(0,W,X)]S=1}\displaystyle E\{(1-Y)[\kappa_{1}(1,W,X)+\kappa_{1}(0,W,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}
=\displaystyle= 0.\displaystyle 0.

Equation (11) directly follows Equation (27).

S1.7 Proof of Theorem 2.9

Under Model 2.1 and Assumptions 2.1 and 2.2, Lemma S1 holds.

  1. (a)

    If q(A,Z,X;τ)q(A,Z,X;\tau) is correctly specified and Assumption 2.2(b) holds, then Equation (10) has a unique solution that is τ0\tau_{0} such that q(A,Z,X;τ0)=q(A,Z,X)q(A,Z,X;\tau_{0})=q(A,Z,X), and q(A,Z,X;τ^)q(A,Z,X;\hat{\tau}) is consistent for q(A,Z,X)q(A,Z,X). By Law of Large Numbers and Continuous Mapping Theorem, we have

    β^c,PIPW\displaystyle\hat{\beta}_{c,\text{PIPW}} 𝑝log(E{c(X)I(A=1)Yq(A,Z,X)S=1}E{c(X)I(A=0)Yq(A,Z,X)S=1})\displaystyle\overset{p}{\rightarrow}\log\left(\dfrac{E\left\{c(X)I(A=1)Yq(A,Z,X)\mid S=1\right\}}{E\left\{c(X)I(A=0)Yq(A,Z,X)\mid S=1\right\}}\right)
    =Eq(25)log(θ1c/θ0c)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-tac-PIPW}}}{{=}}\log(\theta_{1c}/\theta_{0c})
    =β0.\displaystyle=\beta_{0}.
  2. (b)

    If h(A,W,X;ψ)h(A,W,X;\psi) is correctly specified and Assumption 2.3(b) holds, then Equation (11) has a unique solution that is ψ0\psi_{0} such that h(A,W,X;ψ0)=h(A,W,X)h(A,W,X;\psi_{0})=h(A,W,X), and h(A,W,X;ψ^)h(A,W,X;\hat{\psi}) is consistent for h(A,W,X)h(A,W,X). By Law of Large Numbers and Continuous Mapping Theorem, we have

    β^c,POR\displaystyle\hat{\beta}_{c,\text{POR}} 𝑝log(E{c(X)(1Y)h(1,W,X)S=1}E{c(X)(1Y)h(0,W,X)S=1})\displaystyle\overset{p}{\rightarrow}log\left(\dfrac{E\left\{c(X)(1-Y)h(1,W,X)\mid S=1\right\}}{E\left\{c(X)(1-Y)h(0,W,X)\mid S=1\right\}}\right)
    =Eq(28)log(θ1c/θ0c)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-POR}}}{{=}}\log(\theta_{1c}/\theta_{0c})
    =β0.\displaystyle=\beta_{0}.
  3. (c)

    Write q(A,Z,X)q^{*}(A,Z,X) and h(A,W,X)h^{*}(A,W,X) as the probability limit of q(A,Z,X;τ^)q(A,Z,X;\hat{\tau}) and h(A,W,X;ψ^)h(A,W,X;\hat{\psi}), respectively. The stated result follows Law of Large Numbers, Continuous Mapping Theorem, and the proof of Theorem 2.5.

S2 Approximate equivalence of RR and OR for a rare outcome

Under Model 2.1, the conditional odds ratio is

OR(U,X)=exp(β0)=P(Y=1A=1,U,X)P(Y=0A=0,U,X)P(Y=1A=0,U,X)P(Y=1A=1,U,X).\text{OR}(U,X)=\exp(\beta_{0})=\dfrac{P(Y=1\mid A=1,U,X)P(Y=0\mid A=0,U,X)}{P(Y=1\mid A=0,U,X)P(Y=1\mid A=1,U,X)}.

Consider the conditional risk ratio

RR(U,X):=P(Y=1A=1,U,X)P(Y=1A=0,U,X)=OR(U,X)×P(Y=0A=1,U,X)P(Y=0A=0,U,X).\text{RR}(U,X):=\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}=\text{OR}(U,X)\times\dfrac{P(Y=0\mid A=1,U,X)}{P(Y=0\mid A=0,U,X)}.

Under a rare infection assumption that

0P(Y=1A=a,U,X)δ0\leq P(Y=1\mid A=a,U,X)\leq\delta

almost surely for a=0,1a=0,1 and a small δ>0\delta>0, OR and RR satisfies

1δRR(U,X)OR(U,X)=P(Y=0A=1,U,X)P(Y=0A=0,U,X)11δ.1-\delta\leq\dfrac{\text{RR}(U,X)}{\text{OR}(U,X)}=\dfrac{P(Y=0\mid A=1,U,X)}{P(Y=0\mid A=0,U,X)}\leq\dfrac{1}{1-\delta}. (30)

If δ0\delta\approx 0, then ORRR\text{OR}\approx\text{RR}.

Furthermore, denote Y(a)Y(a) as the potential outcome had the individual received the treatment A=aA=a. We make the following standard identifiability assumptions (Hernán MA, 2020):

  1. (a)

    (Consistency) Y(a)=YY(a)=Y if A=aA=a;

  2. (b)

    (Confounded ignorability) Y(a)AU,XY(a)\perp\!\!\!\perp A\mid U,X for a=0,1a=0,1;

  3. (c)

    (Positivity) 0<P(A=1U,X)<10<P(A=1\mid U,X)<1 almost surely.

Under these assumptions, the marginal causal risk ratio is

cRR :=P{Y(1)=1}P{Y(0)=1}\displaystyle:=\dfrac{P\{Y(1)=1\}}{P\{Y(0)=1\}}
=P{Y(1)=1U=u,X=x}f(u,x)𝑑u𝑑xP{Y(0)=1U=u,X=x}f(u,x)𝑑u𝑑x\displaystyle=\dfrac{\int\int P\{Y(1)=1\mid U=u,X=x\}f(u,x)dudx}{\int\int P\{Y(0)=1\mid U=u,X=x\}f(u,x)dudx}
=(b)P{Y(1)=1A=1,U=u,X=x}f(u,x)𝑑u𝑑xP{Y(0)=1A=0,U=u,X=x}f(u,x)𝑑u𝑑x\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\dfrac{\int\int P\{Y(1)=1\mid A=1,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y(0)=1\mid A=0,U=u,X=x\}f(u,x)dudx}
=(a)P{Y=1A=1,U=u,X=x}f(u,x)𝑑u𝑑xP{Y=1A=0,U=u,X=x}f(u,x)𝑑u𝑑x\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\dfrac{\int\int P\{Y=1\mid A=1,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}
=RR(U,X)P{Y=1A=0,U=u,X=x}f(u,x)𝑑u𝑑xP{Y=1A=0,U=u,X=x}f(u,x)𝑑u𝑑x\displaystyle=\dfrac{\int\int\text{RR}(U,X)P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}
Eq(30)11δexp(β0)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:or-rr}}}{{\leq}}\dfrac{1}{1-\delta}\exp(\beta_{0})

and

cRR(1δ)exp(β0).\text{cRR}\geq(1-\delta)\exp(\beta_{0}).

If δ0\delta\approx 0, then cRRexp(β0)\text{cRR}\approx\exp(\beta_{0}).

S3 Identification of marginal causal risk ratio in a test-negative design with diseased controls

Similar to Schnitzer (2022), we denote DD as a categorical variable indicating a subject’s outcome, where D=2D=2 indicates Y=1Y=1, i.e. the subject is positive for the infection of interest; D=1D=1 indicates that the subject has Y=0Y=0 but is positive for another outcome, referred to as the “control infection”, that is known a priori to not be affected by AA; and D=0D=0 indicates that neither outcome is positive. To formalize the assumption of no treatment effect on the control disease, we assume

P(D=1A=1,U,X)=P(D=1A=0,U,X).P(D=1\mid A=1,U,X)=P(D=1\mid A=0,U,X). (31)

As sample selection is only restricted to subjects with infection, we have

P(D=0S=1,U,X,A)=0.P(D=0\mid S=1,U,X,A)=0. (32)

Rather than Assumption 2.1, we make the stronger assumption that

ASD,U,XandASY,U,X.A\perp\!\!\!\perp S\mid D,U,X\quad\text{and}\quad A\perp\!\!\!\perp S\mid Y,U,X. (33)

That is, a subject’s treatment status has no impact on the sample selection. This assumption holds in a test-negative study if, given a subject’s disease status and other traits included in (U,X)(U,X), their decision to seek care does not depend on their treatment status.

Under Model 2.1 and above assumptions, we have

exp(β0)\displaystyle\exp(\beta_{0}) =P(A=1Y=1,U,X)P(A=0Y=0,U,X)P(A=1Y=0,U,X)P(A=0Y=1,U,X)\displaystyle=\dfrac{P(A=1\mid Y=1,U,X)P(A=0\mid Y=0,U,X)}{P(A=1\mid Y=0,U,X)P(A=0\mid Y=1,U,X)}
=Eq(33)P(A=1Y=1,U,X,S=1)P(A=0Y=0,U,X,S=1)P(A=1Y=0,U,X,S=1)P(A=0Y=1,U,X,S=1)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s2}}}{{=}}\dfrac{P(A=1\mid Y=1,U,X,S=1)P(A=0\mid Y=0,U,X,S=1)}{P(A=1\mid Y=0,U,X,S=1)P(A=0\mid Y=1,U,X,S=1)}
=P(A=1D=2,U,X,S=1)P(A=0D2,U,X,S=1)P(A=1D2,U,X,S=1)P(A=0D=2,U,X,S=1)\displaystyle=\dfrac{P(A=1\mid D=2,U,X,S=1)P(A=0\mid D\neq 2,U,X,S=1)}{P(A=1\mid D\neq 2,U,X,S=1)P(A=0\mid D=2,U,X,S=1)}
=Eq(32)P(A=1D=2,U,X,S=1)P(A=0D=1,U,X,S=1)P(A=1D=1,U,X,S=1)P(A=0D=2,U,X,S=1)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s1}}}{{=}}\dfrac{P(A=1\mid D=2,U,X,S=1)P(A=0\mid D=1,U,X,S=1)}{P(A=1\mid D=1,U,X,S=1)P(A=0\mid D=2,U,X,S=1)}
=Eq(33)P(A=1D=2,U,X)P(A=0D=1,U,X)P(A=1D=1,U,X)P(A=0D=2,U,X)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s2}}}{{=}}\dfrac{P(A=1\mid D=2,U,X)P(A=0\mid D=1,U,X)}{P(A=1\mid D=1,U,X)P(A=0\mid D=2,U,X)}
=P(D=2A=1,U,X)P(D=1A=0,U,X)P(D=1A=1,U,X)P(D=2A=0,U,X)\displaystyle=\dfrac{P(D=2\mid A=1,U,X)P(D=1\mid A=0,U,X)}{P(D=1\mid A=1,U,X)P(D=2\mid A=0,U,X)}
=Eq(31)P(D=2A=1,U,X,)P(D=2A=0,U,X)\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:nco}}}{{=}}\dfrac{P(D=2\mid A=1,U,X,)}{P(D=2\mid A=0,U,X)}
=P(Y=1A=1,U,X)P(Y=1A=0,U,X)\displaystyle=\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}
RR(U,X)\displaystyle\equiv\text{RR}(U,X)

Under the standard identifiability assumptions in Section S2, the marginal causal risk ratio is

cRR =RR(U,X)P{Y=1A=0,U=u,X=x}f(u,x)𝑑u𝑑xP{Y=1A=0,U=u,X=x}f(u,x)𝑑u𝑑x\displaystyle=\dfrac{\int\int\text{RR}(U,X)P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}
=exp(β0).\displaystyle=\exp(\beta_{0}).

In the above discussion, we do not invoke the rare outcome assumption.

S4 Regularity conditions for the existence of a solution to Equations (1) and (26)

The results in this section directly adapted from Appendix B of Cui et al. (2020) and Section B of the Supplementary Materials of Li et al. (2022).

Let L2{F(t)}L^{2}\{F(t)\} denote the Hilbert space of all square-integrable functions of tt with respect to distribution function F(t)F(t), equiped with inner product g1,g2=g1(t)g2(t)𝑑F(t)\langle g_{1},g_{2}\rangle=\int g_{1}(t)g_{2}(t)dF(t). Let Ta,u,x:L2{F(z|a,x,Y=0,S=1)}L2{F(u|a,x,Y=0,S=1)}T_{a,u,x}:L^{2}\{F(z|a,x,Y=0,S=1)\}\rightarrow L^{2}\{F(u|a,x,Y=0,S=1)\} denote the conditional expectation operator Ta,u,xh=E[h(a,Z,x)|A=a,U=u,X=x,Y=0,S=1]T_{a,u,x}h=E[h(a,Z,x)|A=a,U=u,X=x,Y=0,S=1] and let (λa,u,x,n,φa,u,x,n,ϕa,u,x,n)(\lambda_{a,u,x,n},\varphi_{a,u,x,n},\phi_{a,u,x,n}) denote a singular value decomposition of Ta,u,xT_{a,u,x}. Similarly, let Ta,u,x:L2{F(w|a,x,Y=0,S=1)}L2{F(u|a,x,Y=0,S=1)}T^{\prime}_{a,u,x}:L^{2}\{F(w|a,x,Y=0,S=1)\}\rightarrow L^{2}\{F(u|a,x,Y=0,S=1)\} denote the conditional expectation operator Ta,u,xg=E[g(a,W,x)|A=a,U=u,X=x,Y=0,S=1]T^{\prime}_{a,u,x}g=E[g(a,W,x)|A=a,U=u,X=x,Y=0,S=1] and let (λa,u,x,n,φa,u,x,n,ϕa,u,x,n)(\lambda^{\prime}_{a,u,x,n},\varphi^{\prime}_{a,u,x,n},\phi^{\prime}_{a,u,x,n}) denote a singular value decomposition of Ta,u,xT^{\prime}_{a,u,x}. Consider the following regularity conditions:

  1. (1)

    f(z|a,u,x,Y=0,S=1)f(u|a,z,x,Y=0,S=1)𝑑z𝑑u<\int\int f(z|a,u,x,Y=0,S=1)f(u|a,z,x,Y=0,S=1)dzdu<\infty;

  2. (2)

    P2(A=a|U=u,X=x,Y=0,S=1)f(u|a,x,Y=0,S=1)du<\int P^{-2}(A=a|U=u,X=x,Y=0,S=1)f(u|a,x,Y=0,S=1)du<\infty;

  3. (3)

    n=1λa,u,x,n2|P1(A=a|U=u,X=x,Y=0,S=1),ϕa,u,x,n|2<\sum_{n=1}^{\infty}\lambda_{a,u,x,n}^{-2}|\langle P^{-1}(A=a|U=u,X=x,Y=0,S=1),\phi_{a,u,x,n}\rangle|^{2}<\infty;

  4. (4)

    n=1(λa,u,x,n)2|P1(A=a|U=u,X=x,Y=0,S=1),ϕa,u,x,n|2<\sum_{n=1}^{\infty}(\lambda_{a,u,x,n}^{\prime})^{-2}|\langle P^{-1}(A=a|U=u,X=x,Y=0,S=1),\phi_{a,u,x,n}^{\prime}\rangle|^{2}<\infty.

The solution to Equation (1) exists if Assumption 2.2(a) and regularity conditions (1), (2) and (3) hold. The solution to Equation (5) exists if Assumption 2.3(a) and regularity conditions (1), (2) and (4) hold.

S5 Summary of our assumptions and results and those in Li et al. (2022)

Table S1: Summary of models, assumptions and identification results, with comparison to those in Li et al. (2022) for identifying conditional risk ratio
Li et al. (2022) Sections 2.2-2.4 PIPW identification
(Li et al. (2022) Section 2.5)
Model log(P(Y=1A=1,U,X)P(Y=1A=0,U,X))=β0\log\left(\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}\right)=\beta_{0} Model 2.1
Interpretation of β0\beta_{0} conditional risk ratio conditional odds ratio
Selection mechanism ASY,U,XA\perp\!\!\!\perp S\mid Y,U,X Assumption 2.1
Proximal independence Z(Y,S)|A,U,XZ\perp\!\!\!\perp(Y,S)|A,U,X Assumption 2.2
WAU,XW\perp\!\!\!\perp A\mid U,X
WZA,U,X,YW\perp\!\!\!\perp Z\mid A,U,X,Y
SZA,U,X,W,YS\perp\!\!\!\perp Z\mid A,U,X,W,Y
qq definition E{q(a,Z,XA=a,U,X)}=1/P(A=aU,X)\begin{array}[]{c}E\{q(a,Z,X\mid A=a,U,X)\}\\ \quad=1/P(A=a\mid U,X)\end{array} Equation (1)
hh definition / /
Completeness Assumption 2.2(b) Assumption 2.2(b)
Identification qq Equation (2) Equation (2)
hh / /
β0\beta_{0} Equation (4) Equation (4)
Note (1) Identification of qq and β0\beta_{0} requires a Under standard identifiability
rare outcome condition; assumptions in Section S2, if either
(2) Under standard identifiability the outcome is rare or under the
assumptions in Section S2, β0\beta_{0} setting of Section S3, β0\beta_{0} identifies the
identifies the marginal causal risk ratio. marginal causal risk ratio.
(3) The proximal independence conditions
can be replaced by Assumption 2.2
POR identification PDR closed-form expression
Model Model 2.1 Model 2.1
Interpretation of β0\beta_{0} conditional odds ratio conditional odds ratio
Selection mechanism Assumption 2.1 Assumption 2.1
Proximal independence Assumption 2.2 Assumption 2.2
qq definition / Equation (1)
hh definition Equation (5) Equation (5)
Completeness Assumption 2.3(b) Assumptions 2.2(b), 2.3(b)
Identification qq / Equation (2) + Assumption 2.2(b)
hh Equation (6) Equation (6) + Assumption 2.3(b)
β0\beta_{0} Equation (8) Equation (9)
Note Under standard identifiability a (1) Under standard identifiability
assumptions in Section S2, if either assumptions in Section S2, if either
the outcome is rare or under the the outcome is rare or under the
setting of Section S3, β0\beta_{0} identifies the β0\beta_{0} setting of Section S3, β0\beta_{0} identifies the
marginal causal risk ratio. marginal causal risk ratio.
(2) Equation (8) holds if either qq is a
solution to Equation (1) or hh is a
solution to Equation (5).

S6 Detailed setting and derivation of q(A,Z,X)q(A,Z,X) and h(A,W,X)h(A,W,X) in Example 2.8

If Z,WZ,W are both multivariate Gaussian random variables satisfying

ZA,U,XN(μ0Z+μAZA+μUZTU+μXZTX,ΣZ),\displaystyle Z\mid A,U,X\sim N(\mu_{0Z}+\mu_{AZ}A+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X,\Sigma_{Z}),
WA,U,X,YN(μ0W+μUWTU+μXWTX+μYWY,ΣW).\displaystyle W\mid A,U,X,Y\sim N(\mu_{0W}+\mu_{UW}^{T}U+\mu_{XW}^{T}X+\mu_{YW}Y,\Sigma_{W}).

Assume that the treatment AA and outcome YY both follow a logistic regression model given (U,X)(U,X):

P(A=1U,X)\displaystyle P(A=1\mid U,X) =[1+exp{(μ0A+μUATU+μXATX)}]1,\displaystyle=[1+\exp\{-(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}]^{-1},
P(Y=1A,U,X)\displaystyle P(Y=1\mid A,U,X) =exp(μ0Y+β0A+μUYTU+μXYTX).\displaystyle=\exp(\mu_{0Y}+\beta_{0}A+\mu_{UY}^{T}U+\mu_{XY}^{T}X).

Further assume the selection indicator SS follows a log-linear model of the form:

P(S=1A,U,X,Y)=exp(μASA+μYSY+ξ(U,X)).P(S=1\mid A,U,X,Y)=\exp(\mu_{AS}A+\mu_{YS}Y+\xi(U,X)).

It is straightforward to verify that the data distribution satisfies Model 2.1 and Assumptions 2.1-2.2. Then a suitable model for hh is

h(a,w,x;ψ)=exp(ψ0+ψAa+ψWTw+ψXTx).h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}^{T}w+\psi_{X}^{T}x). (34)

While a closed-form parametric model for qq does not exist in this example, if the outcome YY is rare in the target population in the sense that

P(Y=1A=a,U=u,X=x)0almost surely,P(Y=1\mid A=a,U=u,X=x)\approx 0\qquad\text{almost surely},

then the parametric model below may serve as an approximation

q(a,z,x;τ)=1+exp{(12A)(τ0+τAa+τZTz+τXTx)}.q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}^{T}z+\tau_{X}^{T}x)\}. (35)

We first derive the outcome confounding bridge function hh that satisfies Equation (5):

E{h(a,W,X)A=a,U,X,Y=0,S=1}=\displaystyle E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}=
P(Y=1A=a,U,X,S=1)/P(Y=0A=a,U,X,S=1).\displaystyle\qquad P(Y=1\mid A=a,U,X,S=1)/P(Y=0\mid A=a,U,X,S=1).

We have

P(Y=1A=a,Z,X)P(S=1A=a,Y=1,Z,X)=\displaystyle P(Y=1\mid A=a,Z,X)P(S=1\mid A=a,Y=1,Z,X)=
exp(μ0Y+β0a+μUYU+μXYTX)1+exp(μ0Y+β0a+μUYU+μXYTX)exp(μ0S+μASa+μYS+ξ(U,X))\displaystyle\qquad\dfrac{\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}\exp(\mu_{0S}+\mu_{AS}a+\mu_{YS}+\xi(U,X))
P(Y=0A=a,U,X)P(S=1A=a,Y=0,U=1,U,X)=\displaystyle P(Y=0\mid A=a,U,X)P(S=1\mid A=a,Y=0,U=1,U,X)=
11+exp(μ0Y+β0a+μUYU+μXYTX)exp(μ0S+μASa+ξ(U,X)).\displaystyle\qquad\dfrac{1}{1+\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}\exp(\mu_{0S}+\mu_{AS}a+\xi(U,X)).

Therefore

P(Y=1A=a,U,X,S=1)\displaystyle P(Y=1\mid A=a,U,X,S=1)
=\displaystyle= P(Y=1A=a,U,X)P(S=1A=a,Y=1,U,X)/{P(Y=1A=a,U,X)×\displaystyle P(Y=1\mid A=a,U,X)P(S=1\mid A=a,Y=1,U,X)/\{P(Y=1\mid A=a,U,X)\times
P(S=1A=a,Y=1,U,X)+P(Y=0A=a,U,X)P(S=1A=a,Y=0,U,X)}\displaystyle\qquad P(S=1\mid A=a,Y=1,U,X)+P(Y=0\mid A=a,U,X)P(S=1\mid A=a,Y=0,U,X)\}
=\displaystyle= [1+exp{(μ0Y+μYS+β0a+μUYU+μXYTX)}]1.\displaystyle[1+\exp\{-(\mu_{0Y}+\mu_{YS}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)\}]^{-1}.

and

P(Y=1A=a,U,X,S=1)P(Y=0A=a,U,X,S=1)=exp(μ0Y+μYS+β0a+μUYU+μXYTX).\displaystyle\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}=\exp(\mu_{0Y}+\mu_{YS}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X).

We further have

WA=a,U,X,Y=0,S=1N(μ0W+μUWU+μXWTX).W\mid A=a,U,X,Y=0,S=1\sim N(\mu_{0W}+\mu_{UW}U+\mu_{XW}^{T}X).

Now assume that h(a,w,x)=exp(ψ0+ψAa+ψWw+ψXTx)h(a,w,x)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}w+\psi_{X}^{T}x), then

E{h(A,W,X)A,U,X,Y=0,X=1}\displaystyle E\{h(A,W,X)\mid A,U,X,Y=0,X=1\}
=\displaystyle= exp(ψ0+ψAA+ψXTX)E{exp(ψWW)A,U,X,Y=0,S=1}\displaystyle\exp(\psi_{0}+\psi_{A}A+\psi_{X}^{T}X)E\{\exp(\psi_{W}W)\mid A,U,X,Y=0,S=1\}
=\displaystyle= exp(ψ0+ψAA+ψXTX)exp(ψWT(μ0W+μUWTU+μXWTX)+12ψWTΣWψW)\displaystyle\exp(\psi_{0}+\psi_{A}A+\psi_{X}^{T}X)\exp(\psi_{W}^{T}(\mu_{0W}+\mu_{UW}^{T}U+\mu_{XW}^{T}X)+\frac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W})
=\displaystyle= exp{(ψ0+ψWTμ0W+12ψWTΣWψW)+ψAA+ψWTμUWTU+(ψXT+ψWTμXWT)X}\displaystyle\exp\{(\psi_{0}+\psi_{W}^{T}\mu_{0W}+\frac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W})+\psi_{A}A+\psi_{W}^{T}\mu_{UW}^{T}U+(\psi_{X}^{T}+\psi_{W}^{T}\mu_{XW}^{T})X\}

Hence ψ\psi satisfies

ψA=β0,ψW=μUW+μUY,ψX=μXYμXWψW=μXYμUYμUW+μXW,\displaystyle\psi_{A}=\beta_{0},\qquad\psi_{W}=\mu_{UW}^{+}\mu_{UY},\qquad\psi_{X}=\mu_{XY}-\mu_{XW}\psi_{W}=\mu_{XY}-\mu_{UY}\mu_{UW}^{+}\mu_{XW},
ψ0=μ0Y+μYSψWTμ0W12ψWTΣWψW\displaystyle\psi_{0}=\mu_{0Y}+\mu_{YS}-\psi_{W}^{T}\mu_{0W}-\dfrac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W}
=μ0Y+μYSμUW+μ0WμUY12μUYT(μUW+)TΣWμUW+μUY.\displaystyle\qquad=\mu_{0Y}+\mu_{YS}-\mu_{UW}^{+}\mu_{0W}\mu_{UY}-\dfrac{1}{2}\mu_{UY}^{T}(\mu_{UW}^{+})^{T}\Sigma_{W}\mu_{UW}^{+}\mu_{UY}.

Here A+A^{+} denotes the generalised inverse of the matrix AA.

To derive q(A,Z,X)q(A,Z,X), we have

P(A=1U,X)P(Y=0A=1,U,X)P(S=1A=1,Y=0,U,X)\displaystyle P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid A=1,Y=0,U,X)
=\displaystyle= exp(μ0A+μUATU+μXATX)1+exp(μ0A+μUATU+μXATX)11+exp(μ0Y+β0+μUYTU+μXYTX)×\displaystyle\dfrac{\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}{1+\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}\dfrac{1}{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\times
exp{μAS+ξ(U,X)},\displaystyle\qquad\exp\{\mu_{AS}+\xi(U,X)\},
and P(A=0U,X)P(Y=0A=0,U,X)P(S=1A=0,Y=0,U,X)\displaystyle P(A=0\mid U,X)P(Y=0\mid A=0,U,X)P(S=1\mid A=0,Y=0,U,X)
=\displaystyle= 11+exp(μ0A+μUATU+μXATX)11+exp(μ0Y+μUYTU+μXYTX)×\displaystyle\dfrac{1}{1+\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}\dfrac{1}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\times
exp{ξ(U,X)}.\displaystyle\qquad\exp\{\xi(U,X)\}.

Therefore

P(A=1U,X,Y=0,S=1)\displaystyle P(A=1\mid U,X,Y=0,S=1)
=\displaystyle= P(A=1U,X)P(Y=0A=1,U,X)P(S=1Y=0,S=1,U,X)/\displaystyle P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid Y=0,S=1,U,X)/
{P(A=1U,X)P(Y=0A=1,U,X)P(S=1Y=0,S=1,U,X)+\displaystyle\qquad\{P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid Y=0,S=1,U,X)+
P(A=0U,X)P(Y=0A=0,U,X)P(S=1A=0,Y=0,U,X)}\displaystyle\qquad P(A=0\mid U,X)P(Y=0\mid A=0,U,X)P(S=1\mid A=0,Y=0,U,X)\}
=\displaystyle= exp(μ0A+μAS+μUATU+μXATX)exp(μ0A+μAS+μUATU+μXATX)+1+exp(μ0Y+β0+μUYTU+μXYTX)1+exp(μ0Y+μUYTU+μXYTX)\displaystyle\dfrac{\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}{\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)+\dfrac{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}}

In the case where P(Y=1A,U,X)P(Y=1\mid A,U,X) are small almost surely, both exp(μ0Y+β0+μUYTU+μXYTX)\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X) and exp(μ0Y+μUYTU+μXYTX)\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X) are necessarily small, then

1+exp(μ0Y+β0+μUYTU+μXYTX)1+exp(μ0Y+μUYTU+μXYTX)1,\dfrac{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\approx 1,

and therefore

P(A=1|U,X,Y=0,S=1)[1+exp{(μ0A+μAS+μUATU+μXATX)}]1.P(A=1|U,X,Y=0,S=1)\approx[1+\exp\{-(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}]^{-1}.

We need find a function q(a,z,x)q(a,z,x) that satisfies Equation (1):

E[q(1,Z,X)|A=1,U,X]=1+exp{(μ0A+μAS+μUATU+μXATX)}\displaystyle E[q(1,Z,X)|A=1,U,X]=1+\exp\{-(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}

and

E[q(0,Z,X)|A=0,U,X]=1+exp(μ0A+μAS+μUATU+μXATX).\displaystyle E[q(0,Z,X)|A=0,U,X]=1+\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X).

We consider q(A,Z,X)=1+exp{(12A)(τ0+τAA+τZTZ+τXTX)}q(A,Z,X)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}A+\tau_{Z}^{T}Z+\tau_{X}^{T}X)\}, then

E[q(1,Z,X)|A=1,U,X]\displaystyle E[q(1,Z,X)|A=1,U,X]
=\displaystyle= 1+exp(τ0τAτXTX)E[exp(τZTZ)|A=1,U,X]\displaystyle 1+\exp(-\tau_{0}-\tau_{A}-\tau_{X}^{T}X)E[\exp(-\tau_{Z}^{T}Z)|A=1,U,X]
=\displaystyle= 1+exp(τ0τAτXTX)exp{τZT(μ0Z+μAZ+μUZTU+μXZTX)+12τZTΣZτZ}\displaystyle 1+\exp(-\tau_{0}-\tau_{A}-\tau_{X}^{T}X)\exp\{-\tau_{Z}^{T}(\mu_{0Z}+\mu_{AZ}+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X)+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}\}
=\displaystyle= 1+exp[{(τ0+τA+τZTμ0Z+τZTμAZ12τZTΣZτZ)+τZTμUZTU+(τXT+τZTμXZT)X}]\displaystyle 1+\exp[-\{(\tau_{0}+\tau_{A}+\tau_{Z}^{T}\mu_{0Z}+\tau_{Z}^{T}\mu_{AZ}-\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z})+\tau_{Z}^{T}\mu_{UZ}^{T}U+(\tau_{X}^{T}+\tau_{Z}^{T}\mu_{XZ}^{T})X\}]

and

E[q(0,Z,X)|A=0,U,X]\displaystyle E[q(0,Z,X)|A=0,U,X]
=\displaystyle= 1+exp(τ0+τXTX)E[exp(τZTZ)A=0,U,X]\displaystyle 1+\exp(\tau_{0}+\tau_{X}^{T}X)E[\exp(\tau_{Z}^{T}Z)\mid A=0,U,X]
=\displaystyle= 1+exp(τ0+τXTX)exp{τZT(μ0Z+μUZTU+μXZTX)+12τZTΣZτZ}\displaystyle 1+\exp(\tau_{0}+\tau_{X}^{T}X)\exp\{\tau_{Z}^{T}(\mu_{0Z}+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X)+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}\}
=\displaystyle= 1+exp{(τ0+τZTμ0Z+12τZTΣZτZ)+τZTμUZTU+(τXT+τZTμXZT)X}\displaystyle 1+\exp\{(\tau_{0}+\tau_{Z}^{T}\mu_{0Z}+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z})+\tau_{Z}^{T}\mu_{UZ}^{T}U+(\tau_{X}^{T}+\tau_{Z}^{T}\mu_{XZ}^{T})X\}

Hence τ\tau satisfies

τZ=μUA+μUZ,τX=μXAμXZτZ=μXAμXZμUA+μUZ,\displaystyle\tau_{Z}=\mu_{UA}^{+}\mu_{UZ},\qquad\tau_{X}=\mu_{XA}-\mu_{XZ}\tau_{Z}=\mu_{XA}-\mu_{XZ}\mu_{UA}^{+}\mu_{UZ},
τA=τZTΣZτZ,\displaystyle\tau_{A}=\tau_{Z}^{T}\Sigma_{Z}\tau_{Z},
τ0=μ0A+μASτZTμ0Z12τZTΣZτZ.\displaystyle\tau_{0}=\mu_{0A}+\mu_{AS}-\tau_{Z}^{T}\mu_{0Z}-\dfrac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}.

S7 Cross-fitting kernel learning estimator for nonparametric estimation of β0\beta_{0}.

Under Model 2.1 and Assumptions 2.1 and 2.2, by Lemma S1, estimation of β0\beta_{0} reduces to estimation of θac\theta_{ac}. We first derive an influence function of θac\theta_{ac}.

Theorem S1.

An influence function for ζac\zeta_{ac} is

IFζac(O;q,h)\displaystyle IF_{\zeta_{ac}}(O;q,h) =SP(S=1)(1Y)c(X)h(a,W,X)\displaystyle=\dfrac{S}{P(S=1)}(1-Y)c(X)h(a,W,X)-
SP(S=1)I(A=a)q(a,Z,X)c(X)[(1Y)h(A,W,X)Y]\displaystyle\qquad\dfrac{S}{P(S=1)}I(A=a)q(a,Z,X)c(X)\left[(1-Y)h(A,W,X)-Y\right]-
ζac,\displaystyle\qquad\zeta_{ac},

Proof S2.

We let μac=E[Sc(X)I(A=a)Yq(A,Z,X)S=1].\mu_{ac}=E\left[Sc(X)I(A=a)Yq(A,Z,X)\mid S=1\right]. Because μac=ζacP(S=1)\mu_{ac}=\zeta_{ac}P(S=1), their influence functions IFμacIF_{\mu_{ac}} and IFζacIF_{\zeta_{ac}} satisfy

IFθac(O;q,h)=IFμac(O;q,h)/P(S=1).IF_{\theta_{ac}}(O;q,h)=IF_{\mu_{ac}}(O;q,h)/P(S=1).

We proceed to derive IFμacIF_{\mu_{ac}}.

Consider regular one-dimensional parametric submodel indexed by tt, where t=0t=0 indicates the truth, we have

μ1ctt\displaystyle\dfrac{\partial\mu_{1ct}}{\partial t} =E[c(X)q˙(A,Z,X)AYS]+E[c(X)q(A,Z,X)AYSR(O)],\displaystyle=E[c(X)\dot{q}(A,Z,X)AYS]+E[c(X)q(A,Z,X)AYSR(O)],

where RR denotes the score function.

To calculate E[c(X)q˙(A,Z,X)AYS]E[c(X)\dot{q}(A,Z,X)AYS], we note that

0=Et[S(1Y)(qt(A,Z,X)1ft(AW,X,Y=0,S=1))n(A,W,X)].\displaystyle 0=E_{t}\left[S(1-Y)\left(q_{t}(A,Z,X)-\dfrac{1}{f_{t}(A\mid W,X,Y=0,S=1)}\right)n(A,W,X)\right].

Taking derivatives w.r.t tt on both sides and rearranging the terms, we have

E[S(1Y)q˙(A,Z,X)n(A,W,X)]\displaystyle E[S(1-Y)\dot{q}(A,Z,X)n(A,W,X)] (36)
=E{S(1Y)[1f(AW,X,Y=0,S=1)q(A,Z,X)]n(A,X,W)R(O)}\displaystyle\qquad=E\left\{S(1-Y)\left[\dfrac{1}{f(A\mid W,X,Y=0,S=1)}-q(A,Z,X)\right]n(A,X,W)R(O)\right\} (37)
E[S(1Y)R(AW,X,Y=0,S=1)f(AW,X,Y=0,S=1)n(A,W,X)]\displaystyle\qquad-E\left[S(1-Y)\dfrac{R(A\mid W,X,Y=0,S=1)}{f(A\mid W,X,Y=0,S=1)}n(A,W,X)\right] (38)

In Equation (36), taking n(A,W,X)=Ah(A,W,X)c(X)n(A,W,X)=Ah(A,W,X)c(X), we have

E[S(1Y)c(X)q˙(A,Z,X)Ah(A,W,X)]\displaystyle E[S(1-Y)c(X)\dot{q}(A,Z,X)Ah(A,W,X)]
=\displaystyle= E{ASc(X)q˙(A,Z,X)E[(1Y)h(A,W,X)A,Z,X,S]}\displaystyle E\left\{ASc(X)\dot{q}(A,Z,X)E\left[(1-Y)h(A,W,X)\mid A,Z,X,S\right]\right\}
=(11)\displaystyle\stackrel{{\scriptstyle\eqref{eq:h-identification}}}{{=}} E{ASc(X)q˙(A,Z,X)E[YA,Z,X,S=1]}\displaystyle E\left\{ASc(X)\dot{q}(A,Z,X)E\left[Y\mid A,Z,X,S=1\right]\right\}
=\displaystyle= E[c(X)q˙(A,Z,X)AYS]\displaystyle E[c(X)\dot{q}(A,Z,X)AYS]

In Equation (38), by taking n(A,W,X)=Ah(A,W,X)c(X)n(A,W,X)=Ah(A,W,X)c(X), we have

E[S(1Y)c(X)R(AW,X,Y=0,S=1)f(AW,X,Y=0,S=1)Ah(A,W,X)]\displaystyle E\left[S(1-Y)c(X)\dfrac{R(A\mid W,X,Y=0,S=1)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]
=\displaystyle= E[S(1Y)c(X)R(O)f(AW,X,Y=0,S=1)Ah(A,W,X)]\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]
E{S(1Y)c(X)R(W,X,Y,S)Ah(A,W,X)f(AW,X,Y=0,S=1)}\displaystyle\qquad-E\left\{S(1-Y)c(X)R(W,X,Y,S)\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}\right\}
=\displaystyle= E[S(1Y)c(X)R(O)f(AW,X,Y=0,S=1)Ah(A,W,X)]\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]
E{S(1Y)c(X)R(W,X,Y,S)E[Ah(A,W,X)f(AW,X,Y=0,S=1)W,X,\displaystyle\qquad-E\bigg{\{}S(1-Y)c(X)R(W,X,Y,S)E\bigg{[}\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}\mid W,X,
Y=0,S=1]}\displaystyle\qquad Y=0,S=1\bigg{]}\bigg{\}}
=\displaystyle= E[S(1Y)c(X)R(O)f(AW,X,Y=0,S=1)Ah(A,W,X)]\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]-
E{S(1Y)c(X)R(W,X,Y,S)h(1,W,X)}\displaystyle\qquad E\left\{S(1-Y)c(X)R(W,X,Y,S)h(1,W,X)\right\}
=\displaystyle= E{S(1Y)c(X)[Ah(A,W,X)f(AW,X,Y=0,S=1)h(1,W,X)]R(O)}\displaystyle E\left\{S(1-Y)c(X)\left[\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}-h(1,W,X)\right]R(O)\right\}

We therefore have

E[c(X)q˙(A,Z,X)AYS]\displaystyle\quad E[c(X)\dot{q}(A,Z,X)AYS]
=E{S(1Y)c(X)[1f(AW,X,Y=0,S=1)q(A,Z,X)]Ah(A,W,X)R(O)}\displaystyle=E\left\{S(1-Y)c(X)\left[\dfrac{1}{f(A\mid W,X,Y=0,S=1)}-q(A,Z,X)\right]Ah(A,W,X)R(O)\right\}
E{S(1Y)c(X)[Ah(A,W,X)f(AW,X,Y=0,S=1)h(1,W,X)]R(O)}\displaystyle\qquad-E\left\{S(1-Y)c(X)\left[\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}-h(1,W,X)\right]R(O)\right\}
=E{S(1Y)c(X)[h(1,W,X)Aq(A,Z,X)h(A,W,X)]R(O)}\displaystyle=E\left\{S(1-Y)c(X)\left[h(1,W,X)-Aq(A,Z,X)h(A,W,X)\right]R(O)\right\}

Hence

μ1ctt\displaystyle\dfrac{\partial\mu_{1ct}}{\partial t} =E{Sc(X)((1Y)[h(1,W,X)Aq(A,Z,X)h(A,W,X)]+\displaystyle=E\bigg{\{}Sc(X)\bigg{(}(1-Y)\bigg{[}h(1,W,X)-Aq(A,Z,X)h(A,W,X)\bigg{]}+
YAq(A,Z,X))R(O)}\displaystyle\qquad YAq(A,Z,X)\bigg{)}R(O)\bigg{\}}
=E{Sc(X)((1Y)[h(1,W,X)Aq(A,Z,X)h(A,W,X)]+YAq(A,Z,X)\displaystyle=E\bigg{\{}Sc(X)\bigg{(}(1-Y)\left[h(1,W,X)-Aq(A,Z,X)h(A,W,X)\right]+YAq(A,Z,X)-
μ1c)R(O)}\displaystyle\qquad\mu_{1c}\bigg{)}R(O)\bigg{\}}

and

IFμ1c()\displaystyle IF_{\mu_{1c}}(\cdot) =S(1Y)c(X)h(1,W,X)SAq(A,Z,X)c(X)[(1Y)h(A,W,X)Y]μ1c.\displaystyle=S(1-Y)c(X)h(1,W,X)-SAq(A,Z,X)c(X)\left[(1-Y)h(A,W,X)-Y\right]-\mu_{1c}.

Similarly,

IFμ0c()\displaystyle IF_{\mu_{0c}}(\cdot) =Sc(X){(1Y)[h(0,W,X)(1A)q(A,Z,X)h(A,W,X)]+\displaystyle=Sc(X)\bigg{\{}(1-Y)\bigg{[}h(0,W,X)-(1-A)q(A,Z,X)h(A,W,X)\bigg{]}+
Y(1A)q(A,Z,X)}\displaystyle\qquad Y(1-A)q(A,Z,X)\bigg{\}}
=S(1Y)c(X)h(0,W,X)S(1A)q(A,Z,X)c(X)[(1Y)h(A,W,X)\displaystyle=S(1-Y)c(X)h(0,W,X)-S(1-A)q(A,Z,X)c(X)\bigg{[}(1-Y)h(A,W,X)-
Y]μ0c.\displaystyle\qquad Y\bigg{]}-\mu_{0c}.

The influence function IFζac(O;q,h)IF_{\zeta_{ac}}(O;q,h) can be written as

IFζac(O;qa,ha)\displaystyle IF_{\zeta_{ac}}(O;q_{a},h_{a}) =SP(S=1){ga1(A,Y,X)qa(Z,X)ha(W,X)+\displaystyle=\dfrac{S}{P(S=1)}\big{\{}g_{a1}(A,Y,X)q_{a}(Z,X)h_{a}(W,X)+
ga2(Y,X)qa(Z,X)+ga3(Y,X)ha(W,X)}ζac,\displaystyle\qquad g_{a2}(Y,X)q_{a}(Z,X)+g_{a3}(Y,X)h_{a}(W,X)\big{\}}-\zeta_{ac}, (39)

where

ga1(A,Y,X)=c(X)I(A=a)(1Y),ga2(A,Y,X)=c(X)I(A=a)Y,\displaystyle g_{a1}(A,Y,X)=-c(X)I(A=a)(1-Y),\qquad g_{a2}(A,Y,X)=c(X)I(A=a)Y,
ga3(A,Y,X)=c(X)(1Y),ha(W,X)=h(a,W,X),\displaystyle g_{a3}(A,Y,X)=c(X)(1-Y),\qquad h_{a}(W,X)=h(a,W,X),
and qa(Z,X)=q(a,Z,X).\displaystyle q_{a}(Z,X)=q(a,Z,X).

This influence function belongs to the class of doubly-robust influence functions considered in Ghassami et al. (2021), and thus their method directly applies. By Theorem 1 of Rotnitzky et al. (2021), such an influence function also satisfies the so-call ”mixed-bias” structure, and therefore n\sqrt{n}-consistent doubly robust estimation is possible even when the nuisance functions are estimated nonparametrically. In this section, we describe a cross-fitting minimax kernel learning estimator for ζac\zeta_{ac} and stated the theoretic property of the resulting estimator of β0\beta_{0}.

We randomly partition the data evenly into LL subsamples. For each l{1,,L}l\in\{1,\dots,L\}, let IlI_{l} and nln_{l} be the indices and size of data in the llth sample, respectively. Let ml=nnlm_{l}=n-n_{l} be the size of the sample excluding IlI_{l}. Write qa(Z,X)=q(a,Z,X)q_{a}(Z,X)=q(a,Z,X) and ha(W,X)=h(A,W,X)h_{a}(W,X)=h(A,W,X). As proposed in Ghassami et al. (2021), we estimate the nuisance functions qaq_{a} and hah_{a} using data from IlcI_{l}^{c} by solving the following optimization problems:

q~al\displaystyle\tilde{q}_{al} =argminqa𝒬amaxh˙a1mliIlc[h˙(Wi,Xi){qa(Zi,Xi)ga1(Ai,Yi,Xi)+g3(Yi,Xi)}\displaystyle=\arg\underset{q_{a}\in\mathcal{Q}_{a}}{\min}\,\underset{\dot{h}\in\mathcal{H}_{a}^{*}}{\max}\,\dfrac{1}{m_{l}}\sum_{i\in I_{l}^{c}}\bigg{[}\dot{h}(W_{i},X_{i})\left\{q_{a}(Z_{i},X_{i})g_{a1}(A_{i},Y_{i},X_{i})+g_{3}(Y_{i},X_{i})\right\}
h˙2(Wi,Xi)]λaqah˙a2+λ𝒬aqaqa𝒬a2;\displaystyle\qquad-\dot{h}^{2}(W_{i},X_{i})\bigg{]}-\lambda_{\mathcal{H}_{a}^{*}}^{q_{a}}\lVert\dot{h}\rVert_{\mathcal{H}_{a}^{*}}^{2}+\lambda_{\mathcal{Q}_{a}}^{q_{a}}\lVert q_{a}\rVert_{\mathcal{Q}_{a}}^{2}; (40)
h~al\displaystyle\tilde{h}_{al} =argminhaamaxq˙𝒬a1mliIlc[q˙(Zi,Xi){ha(Wi,Xi)ga1(Ai,Yi,Xi)+g2(Ai,Yi,Xi)}\displaystyle=\arg\underset{h_{a}\in\mathcal{H}_{a}}{\min}\,\underset{\dot{q}\in\mathcal{Q}_{a}^{*}}{\max}\,\dfrac{1}{m_{l}}\sum_{i\in I_{l}^{c}}\bigg{[}\dot{q}(Z_{i},X_{i})\left\{h_{a}(W_{i},X_{i})g_{a1}(A_{i},Y_{i},X_{i})+g_{2}(A_{i},Y_{i},X_{i})\right\}
q˙2(Zi,Xi)]λ𝒬ahaq˙𝒬a2+λahahaa2.\displaystyle\qquad-\dot{q}^{2}(Z_{i},X_{i})\bigg{]}-\lambda_{\mathcal{Q}_{a}^{*}}^{h_{a}}\lVert\dot{q}\rVert_{\mathcal{Q}_{a}^{*}}^{2}+\lambda_{\mathcal{H}_{a}}^{h_{a}}\lVert h_{a}\rVert_{\mathcal{H}_{a}}^{2}. (41)

where 𝒬a\mathcal{Q}_{a}, a\mathcal{H}_{a}, 𝒬a\mathcal{Q}_{a}^{*} and a\mathcal{H}_{a}^{*} are reproducing kernel Hilbert spaces (RKHS) generated by the kernel functions K𝒬aK_{\mathcal{Q}_{a}}, KaK_{\mathcal{H}_{a}}, K𝒬aK_{\mathcal{Q}_{a}^{*}} an KaK_{\mathcal{H}_{a}^{*}}, equipped with norms 𝒬a\lVert\cdot\rVert_{\mathcal{Q}_{a}}, a\lVert\cdot\rVert_{\mathcal{H}_{a}}, 𝒬a\lVert\cdot\rVert_{\mathcal{Q}_{a}^{*}} and a\lVert\cdot\rVert_{\mathcal{H}_{a}^{*}}, respectively, and λ𝒬aqa\lambda_{\mathcal{Q}_{a}}^{q_{a}}, λaha\lambda_{\mathcal{H}_{a}}^{h_{a}}, λ𝒬aha\lambda_{\mathcal{Q}_{a}^{*}}^{h_{a}} and λaqa\lambda_{\mathcal{H}_{a}^{*}}^{q_{a}} are non-negative regularizing parameters. The above optimization problems have closed form solutions, which we leave to to the end of this section.

The objective function for q~al\tilde{q}_{al} can be viewed as a regularised version of empirical perturbation evaluated using data from IlcI_{l}^{c}, by moving the influence function from the pair (qa,ha)(q_{a},h_{a}) to (qa,ha+h˙)(q_{a},h_{a}+\dot{h}), where the perturbation is defined as

prtθac(qa,ha;0,h˙)\displaystyle prt_{\theta_{ac}}(q_{a},h_{a};0,\dot{h}) =IFθac(A,Y,Z,W,X;qa,ha+h˙)IFθac(A,Y,Z,W,X;qa,ha)\displaystyle=IF_{\theta_{ac}}(A,Y,Z,W,X;q_{a},h_{a}+\dot{h})-IF_{\theta_{ac}}(A,Y,Z,W,X;q_{a},h_{a})
=SP(S=1)h˙(W,X){qa(Z,X)ga1(A,Y,X)ga3(Y,X)}.\displaystyle=\dfrac{S}{P(S=1)}\dot{h}(W,X)\left\{q_{a}(Z,X)g_{a1}(A,Y,X)-g_{a3}(Y,X)\right\}.

The objective function contains two forms of regularization: the term h˙2\dot{h}^{2} improves the robustness against model misspecification, stability and the convergence rates, while the Tikhonov-type penalties λaqah˙a2\lambda_{\mathcal{H}_{a}}^{q_{a}}\lVert\dot{h}\rVert_{\mathcal{H}_{a}}^{2} and λ𝒬aqaqa𝒬a2\lambda_{\mathcal{Q}_{a}}^{q_{a}}\lVert q_{a}\rVert_{\mathcal{Q}_{a}}^{2} penalise the complexity of the classes of nuisance functions considered. The objective function for h~al\tilde{h}_{al} can be interpreted similarly.

We estimate ζac\zeta_{ac} with the llth subsample with the PDR estimator

ζ~ac,l\displaystyle\tilde{\zeta}_{ac,l} =1nliIlc(Xi)[(1Yi)h~al(Wi,Xi)\displaystyle=\dfrac{1}{n_{l}}\sum_{i\in I_{l}}c(X_{i})\big{[}(1-Y_{i})\tilde{h}_{al}(W_{i},X_{i})-
I(Ai=a)q~al(Zi,Xi){(1Yi)h~al(Wi,Xi)Yi}].\displaystyle\qquad I(A_{i}=a)\tilde{q}_{al}(Z_{i},X_{i})\big{\{}(1-Y_{i})\tilde{h}_{al}(W_{i},X_{i})-Y_{i}\big{\}}\big{]}. (42)

The final estimator for ζac\zeta_{ac} is ζ~ac=l=1Lζ~ac,l/L\tilde{\zeta}_{ac}=\sum_{l=1}^{L}\tilde{\zeta}_{ac,l}/L, which gives the cross-fitting minimax kernel learning estimator for β0\beta_{0} as β~0=log(ζ~1c/ζ~0c).\tilde{\beta}_{0}=\log(\tilde{\zeta}_{1c}/\tilde{\zeta}_{0c}). We summarise the estimation of ζac\zeta_{ac} in Algorithm S7.

{algo}

A cross-fitting minimax kernel learning algorithm to estimate β0\beta_{0}

     Split the data into LL even subsamples
     For a=0a=0 to a=1a=1
      For l=1l=1 to l=Ll=L
        Solve q~al\tilde{q}_{al} and h~al\tilde{h}_{al} in Equations (40) and (41)
        Obtain ζ~ac,l\tilde{\zeta}_{ac,l} according to Equation (S7)
      ζ~acl=1Lζ~ac,l/L\tilde{\zeta}_{ac}\leftarrow\sum_{l=1}^{L}\tilde{\zeta}_{ac,l}/L
     Output β~0=log(ζ~1c/ζ~0c)\tilde{\beta}_{0}=\log(\tilde{\zeta}_{1c}/\tilde{\zeta}_{0c}).

The estimator β~0c\tilde{\beta}_{0c} is consistent and asymptotically normal, thus allowing valid statistical inference, as stated in Theorem S3 below.

Theorem S3.

Under Model 2.1 and Assumptions 2.1-2.3 and suitable regularity conditions listed in Section S8, the cross-fitting estimator β~0\tilde{\beta}_{0} in Algorithm S7 is consistent and asymptotically normal of β0\beta_{0}.

We prove Theorem S3 in Section S9 of the Supplementary Materials.

Implementation of this approach is challenging, as the performance of the method proves to be sensitive to the value of tuning parameters. We leave an efficient implementation of this approach to future research.

The closed form solutions for q~al\tilde{q}_{al} and h~al\tilde{h}_{al} are

q~al(z,x)\displaystyle\tilde{q}_{al}(z,x) =iIlcγaliK𝒬a(Zi,Xi;z,x),\displaystyle=\sum_{i\in I_{l}^{c}}\gamma_{ali}K_{\mathcal{Q}_{a}}(Z_{i},X_{i};z,x),
andh~al(w,x)\displaystyle\text{and}\qquad\tilde{h}_{al}(w,x) =iIlcαaliKa(Wi,Xi;w,x),\displaystyle=\sum_{i\in I_{l}^{c}}\alpha_{ali}K_{\mathcal{H}_{a}}(W_{i},X_{i};w,x),

where

αal\displaystyle\alpha_{al} =(Ka,lDalΓalDalKa,l+ml2λahaKa,l)+Ka,lDalΓalga2,l,\displaystyle=-(K_{\mathcal{H}_{a},l}D_{al}\Gamma_{al}D_{al}K_{\mathcal{H}_{a},l}+m_{l}^{2}\lambda_{\mathcal{H}_{a}}^{h_{a}}K_{\mathcal{H}_{a},l})^{+}K_{\mathcal{H}_{a},l}D_{al}\Gamma_{al}g_{a2,l},
γal\displaystyle\gamma_{al} =(K𝒬a,lDalΩalDalK𝒬a,l+ml2λ𝒬aqaK𝒬a,l)+K𝒬a,lDalΓalga3,l,\displaystyle=-(K_{\mathcal{Q}_{a},l}D_{al}\Omega_{al}D_{al}K_{\mathcal{Q}_{a},l}+m_{l}^{2}\lambda_{\mathcal{Q}_{a}}^{q_{a}}K_{\mathcal{Q}_{a},l})^{+}K_{\mathcal{Q}_{a},l}D_{al}\Gamma_{al}g_{a3,l},
Γal\displaystyle\Gamma_{al} =14K𝒬a,l(1mlK𝒬a,l+λ𝒬ahaIal)1,\displaystyle=\dfrac{1}{4}K_{\mathcal{Q}_{a}^{*},l}(\dfrac{1}{m_{l}}K_{\mathcal{Q}_{a}^{*},l}+\lambda_{\mathcal{Q}_{a}^{*}}^{h_{a}}I_{al})^{-1},
Ωal\displaystyle\Omega_{al} =14KmissingHa,l(1mlKa,l+λaqaIal)1,\displaystyle=\dfrac{1}{4}K_{\mathcal{\mathcal{missing}}H_{a}^{*},l}(\dfrac{1}{m_{l}}K_{\mathcal{H}_{a}^{*},l}+\lambda_{\mathcal{H}_{a}^{*}}^{q_{a}}I_{al})^{-1},

DalD_{al} is a (nnl)×(nnl)(n-n_{l})\times(n-n_{l}) diagonal matrix with elements {ga1(Ai,Yi,Xi):iIlc}\{g_{a1}(A_{i},Y_{i},X_{i}):i\in I_{l}^{c}\}, K,lK_{\mathcal{H},l} and K𝒬,lK_{\mathcal{Q},l} are respectively the (nnl)×(nnl)(n-n_{l})\times(n-n_{l}) empirical kernel matrices with data in the sample IlcI_{l}^{c}, ga2,lg_{a2,l} is the (nnl)(n-n_{l})-vector with elements {ga2(Yi,Xi):iIlc}\{g_{a2}(Y_{i},X_{i}):i\in I_{l}^{c}\}, and ga3,lg_{a3,l} is the (nnl)(n-n_{l})-vector with elements {ga3(Ai,Yi,Xi):iIlc}\{g_{a3}(A_{i},Y_{i},X_{i}):i\in I_{l}^{c}\}.

S8 Regularity conditions in Theorem S3

For a function class \mathcal{F}, let B:={f:f2B}\mathcal{F}_{B}:=\{f\in\mathcal{F}:\lVert f\rVert_{\mathcal{F}}^{2}\leq B\}.

Let {μhaj}j=1\{\mu^{h_{a}}_{j}\}_{j=1}^{\infty} and {φjha}j=1\{\varphi_{j}^{h_{a}}\}_{j=1}^{\infty} be the eigenvalues and eigenfunctions of the RKHS a\mathcal{H}_{a} and {μqaj}j=1\{\mu^{q_{a}}_{j}\}_{j=1}^{\infty} and {φjqa}j=1\{\varphi_{j}^{q_{a}}\}_{j=1}^{\infty} be the eigenvalues and eigenfunctions of the RKHS 𝒬a\mathcal{Q}_{a}. For any m+m\in\mathbb{N}_{+}, let VmaV_{m}^{\mathcal{H}_{a}} and Vm𝒬aV_{m}^{\mathcal{Q}_{a}} be the m×mm\times m matrices with entry (i,j)(i,j) defined respectively as

[Vma]i,j=E[E[φia(W,X)beta0Z,X,S=1]E[φja(W,X)Z,X,S=1]][V_{m}^{\mathcal{H}_{a}}]_{i,j}=E[E[\varphi_{i}^{\mathcal{H}_{a}}(W,X)beta_{0}\mid Z,X,S=1]E[\varphi_{j}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]]

and

[Vm𝒬a]i,j=E[E[φi𝒬a(Z,X)W,X,S=1]E[φj𝒬a(Z,X)W,X,S=1]].[V_{m}^{\mathcal{Q}_{a}}]_{i,j}=E[E[\varphi_{i}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]E[\varphi_{j}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]].

Let λma\lambda_{m}^{\mathcal{H}_{a}} and λm𝒬a\lambda_{m}^{\mathcal{Q}_{a}} be the minimum eigenvalues of VmaV_{m}^{\mathcal{H}_{a}} and Vm𝒬aV_{m}^{\mathcal{Q}_{a}}, respectively.

  1. R.1

    The functions ga1(A,Y,X)=c(X)I(A=a)(1Y)g_{a1}(A,Y,X)=c(X)I(A=a)(1-Y), ga2(A,Y,X)=c(X)I(A=a)Yg_{a2}(A,Y,X)=c(X)I(A=a)Y and ga3(Y,X)=c(X)(1Y)g_{a3}(Y,X)=c(X)(1-Y) are bounded.

  2. R.2

    There exists a constant σ1\sigma_{1} such that

    [E[c(X)I(A=a)(1Y)Z,W,X,S=1]>σ1>0\mid[E[c(X)I(A=a)(1-Y)\mid Z,W,X,S=1]\mid>\sigma_{1}>0

    almost surely for a=0,1a=0,1.

  3. R.3

    The nuisance functions q(A,Z,X)q(A,Z,X) and h(A,W,X)h(A,W,X) have finite second moments.

  4. R.4

    The nuisance functions qa(Z,X)q_{a}(Z,X) and their estimators q~al(Z,X)\tilde{q}_{al}(Z,X) and h~al(W,X)\tilde{h}_{al}(W,X) in each fold of cross-fitting satisfy

    min{supz,xq~al(z,x)+supw,xha(w,x),supz,xqa(z,x)+supw,xh~al(w,x)}<.\min\left\{\sup_{z,x}\mid\tilde{q}_{al}(z,x)+\sup_{w,x}h_{a}(w,x)\mid,\sup_{z,x}\mid q_{a}(z,x)+\sup_{w,x}\tilde{h}_{al}(w,x)\mid\right\}<\infty.
  5. R.5

    q~alqa2=op(1)\lVert\tilde{q}_{al}-q_{a}\rVert_{2}=o_{p}(1), h~alha2=op(1)\lVert\tilde{h}_{al}-h_{a}\rVert_{2}=o_{p}(1);

  6. R.6

    There exists a constant CC such that h~aaC\tilde{h}_{a}\in\mathcal{H}_{aC} and q~a𝒬aC\tilde{q}_{a}\in\mathcal{Q}_{aC} for a{0,1}a\in\{0,1\}.

  7. R.7
    E[E[φia(W,X)Z,X,S=1]E[φja(W,X)Z,X,S=1]]caλma\mid E[E[\varphi_{i}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]E[\varphi_{j}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]]\mid\leq c^{\mathcal{H}_{a}}\lambda_{m}^{\mathcal{H}_{a}}

    and

    E[E[φi𝒬a(Z,X)W,X,S=1]E[φj𝒬a(Z,X)W,X,S=1]]c𝒬aλm𝒬a\mid E[E[\varphi_{i}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]E[\varphi_{j}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]]\mid\leq c^{\mathcal{Q}_{a}}\lambda_{m}^{\mathcal{Q}_{a}}

    for some constant cac^{\mathcal{H}_{a}} and c𝒬ac^{\mathcal{Q}_{a}}.

  8. R.8
    E[I(A=a)c(X)(1Y){q~al(Z,X)qa(Z,X)W,X,S=1}]2=O(nrq)\lVert E[I(A=a)c(X)(1-Y)\{\tilde{q}_{al}(Z,X)-q_{a}(Z,X)\rVert W,X,S=1\}]\rVert_{2}=O(n^{-{r_{q}}})

    and

    E[I(A=a)c(X)(1Y){h~al(W,X)ha(W,X)Z,X,S=1}]2=O(nrh).\lVert E[I(A=a)c(X)(1-Y)\{\tilde{h}_{al}(W,X)-h_{a}(W,X)\rVert Z,X,S=1\}]\rVert_{2}=O(n^{-r_{h}}).
  9. R.9

    μmamsh\mu_{m}^{\mathcal{H}_{a}}\sim m^{-s_{h}}, μm𝒬amsq\mu_{m}^{\mathcal{Q}_{a}}\sim m^{-s_{q}}, λmamth\lambda_{m}^{\mathcal{H}_{a}}\sim m^{-t_{h}}, and λm𝒬amtq\lambda_{m}^{\mathcal{Q}_{a}}\sim m^{-t_{q}}, where the decay rates shs_{h}, sqs_{q}, tht_{h} and tqt_{q} satisfies

    max{rh+shrqsh+th,rq+sqrhsq+tq}>12.\max\{r_{h}+\dfrac{s_{h}r_{q}}{s_{h}+t_{h}},r_{q}+\dfrac{s_{q}r_{h}}{s_{q}+t_{q}}\}>\dfrac{1}{2}.

S9 Proof of Theorem S3

Under Model 2.1 and Assumptions 2.1, 2.2 and 2.2, Theorem 2.2(b) states that β0=log(ζ1c/ζ0c)\beta_{0}=\log(\zeta_{1c}/\zeta_{0c}). By delta-method, it suffices to prove that ζ~ac\tilde{\zeta}_{ac} is asymptotically linear with the influence function IFζac(O)IF_{\zeta_{ac}}(O), the proof of which follows Ghassami et al. (2021) Theorem 2 and Lemma 1, which we reproduced below.

For a=0,1a=0,1 and a given value δ>0\delta>0, we define aC|δ:={haaC:E[ha(W,X)Z,X,S=1]2δ}\mathcal{H}_{aC}^{|\delta}:=\{h_{a}\in\mathcal{H}_{aC}:\lVert E[h_{a}(W,X)\mid Z,X,S=1]\rVert_{2}\leq\delta\} and 𝒬aC|δ:={qa𝒬aC:E[qa(Z,X)W,X,S=1]2δ}\mathcal{Q}_{aC}^{|\delta}:=\{q_{a}\in\mathcal{Q}_{aC}:\lVert E[q_{a}(Z,X)\mid W,X,S=1]\rVert_{2}\leq\delta\}. We define the measures of ill-posedness τha(δ):=suphaaC|δha2\tau_{h_{a}}(\delta):=\sup_{h_{a}\in\mathcal{H}_{aC}^{|\delta}}\lVert h_{a}\rVert_{2} and τqa(δ):=supqa𝒬aC|δqa2\tau_{q_{a}}(\delta):=\sup_{q_{a}\in\mathcal{Q}_{aC}^{|\delta}}\lVert q_{a}\rVert_{2}.

Lemma S1.

(Ghassami et al. (2021) Lemma 1) Under regularity conditions R.6 and R.7, for any ϵ>0\epsilon>0 we have

τha(ϵ)2minm+{4ϵ2λma+[4(ca)2+1]Cμm+1a}\tau_{h_{a}}(\epsilon)^{2}\leq\min_{m\in\mathbb{N}_{+}}\left\{\dfrac{4\epsilon^{2}}{\lambda_{m}^{\mathcal{H}_{a}}}+[4(c^{\mathcal{H}_{a}})^{2}+1]C\mu_{m+1}^{\mathcal{H}^{a}}\right\}

and

τqa(ϵ)2minm+{4ϵ2λm𝒬a+[4(c𝒬a)2+1]Cμm+1𝒬a}.\tau_{q_{a}}(\epsilon)^{2}\leq\min_{m\in\mathbb{N}_{+}}\left\{\dfrac{4\epsilon^{2}}{\lambda_{m}^{\mathcal{Q}_{a}}}+[4(c^{\mathcal{Q}_{a}})^{2}+1]C\mu_{m+1}^{\mathcal{Q}^{a}}\right\}.

By Lemma S1 and R.9, we have that

τha(ϵ)=O(ϵshsh+th)andτqa(ϵ)=O(ϵsqsq+tq)\tau_{h_{a}}(\epsilon)=O\left(\epsilon^{\dfrac{s_{h}}{s_{h}+t_{h}}}\right)\qquad\text{and}\qquad\tau_{q_{a}}(\epsilon)=O\left(\epsilon^{\dfrac{s_{q}}{s_{q}+t_{q}}}\right)

The condition R.9 further implies that

min{nrhτha(nrq),nrqτqa(n)}\displaystyle\min\{n^{-r_{h}}\tau_{h_{a}}(n^{-r_{q}}),n^{-r_{q}}\tau_{q_{a}}(n)\} =O(min{n(rh+shrqsh+th),n(rq+sqrhsq+tq)})\displaystyle=O\left(\min\left\{n^{-\left(r_{h}+\dfrac{s_{h}r_{q}}{s_{h}+t_{h}}\right)},n^{-\left(r_{q}+\dfrac{s_{q}r_{h}}{s_{q}+t_{q}}\right)}\right\}\right)
=o(n1/2).\displaystyle=o(n^{-1/2}). (43)

The result follows by the theorem below:

Theorem S2.

(Ghassami et al. (2021) Theorem 2) Under regularity conditions R.1-R.6 and R.8, if Equation (43) holds in addition, then the estimator θ~ac\tilde{\theta}_{ac} is asymptotically linear with the influence function

IFζac(O).IF_{\zeta_{ac}}(O).

S10 Details of the simulation studies

We generate the data for a target population of NN observations according to the following five scenarios

Scenario I: Univariate binary UU, ZZ and WW

U\displaystyle U Bernoulli(0,0.5);\displaystyle\sim\text{Bernoulli}(0,0.5);
AU\displaystyle A\mid U Bernoulli(expit(0.2+0.4U))\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(0.2+0.4U))
ZA,U\displaystyle Z\mid A,U Bernoulli(0.2+0.1A+0.4U+0.2AU);\displaystyle\sim\text{Bernoulli}(0.2+0.1A+0.4U+0.2AU);
WU\displaystyle W\mid U Bernoulli(0.2+0.4U);\displaystyle\sim\text{Bernoulli}(0.2+0.4U);
YA,U\displaystyle Y\mid A,U Bernoulli(expit(0.4051.609A0.7U))\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-0.405-1.609A-0.7U))
SU,A,Y\displaystyle S\mid U,A,Y Bernoulli(exp(1.7+0.2A+0.4Y+0.7U)).\displaystyle\sim\text{Bernoulli}(\exp(-1.7+0.2A+0.4Y+0.7U)).

Scenario II-V: Continuous UU, XX, ZZ, WW.

U\displaystyle U Uniform(0,1);\displaystyle\sim\text{Uniform}(0,1);
X\displaystyle X Uniform(0,1);\displaystyle\sim\text{Uniform}(0,1);
AU,X\displaystyle A\mid U,X Bernoulli(expit(1+U+X))\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-1+U+X))
ZA,U,X\displaystyle Z\mid A,U,X N(0.25A+0.25X+4U,0.252);\displaystyle\sim\text{N}(0.25A+0.25X+4U,0.25^{2});
WY,U,X\displaystyle W\mid Y,U,X N(0.25Y+0.25X+4U,0.252);\displaystyle\sim\text{N}(0.25Y+0.25X+4U,0.25^{2});
YA,U,X\displaystyle Y\mid A,U,X Bernoulli(expit(3.89+β0A2UX))\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-3.89+\beta_{0}A-2U-X))
SU,X,A,Y\displaystyle S\mid U,X,A,Y Bernoulli(exp(5+0.4A+4Y+0.3U+0.2X).\displaystyle\sim\text{Bernoulli}(\exp(-5+0.4A+4Y+0.3U+0.2X).

S11 Estimating conditional odds ratio under effect modification by XX

In Section 2, we assumed that the effect size of odds ratio is constant across every (U,X)(U,X) strata. However, in practical situation, the treatment effect may be heterogeneous. Therefore, we relax Model 2.1 and propose Model S11 below that allows heterogeneous odds ratio across XX strata: {model}[Heterogeneous odds ratio model by XX]

logitP(Y=1A,U,X)=β0(X)A+η(U,X).\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}(X)A+\eta(U,X). (44)

In Model S11, the odds ratio treatment effect β0(X)\beta_{0}(X) is homogeneous with respect to the latent factors UU but is allowed to vary by XX. In such scenarios, if the negative control variables are available, the odds ratio function β0(X)\beta_{0}(X) can still be identified, similar to Theorems 2.2, 2.4 and 2.5, as stated in the following theorem:

Theorem S1.

Under Model S11 and Assumptions 2.1, 2.2, 2.2 and 2.3, then the odds ratio function β0(X)\beta_{0}(X) satisfies the following moment equations:

E[c(X)I(A=a)Yq(A,Z,X)eβ0(X)S=1]=0;\displaystyle E\left[c(X)I(A=a)Yq(A,Z,X)e^{-\beta_{0}(X)}\mid S=1\right]=0;
E[c(X)(1Y){h(1,W,X)h(0,W,X)eβ0(X)}S=1]=0;\displaystyle E\left[c(X)(1-Y)\left\{h(1,W,X)-h(0,W,X)e^{\beta_{0}(X)}\right\}\mid S=1\right]=0;
E{c(X)[(1)1Aq(A,Z,X)eβ0(X)A{Y(1Y)h(A,W,X)}+\displaystyle E\bigg{\{}c(X)\bigg{[}(-1)^{1-A}q(A,Z,X)e^{-\beta_{0}(X)A}\left\{Y-(1-Y)h(A,W,X)\right\}+
(1Y){h(1,W,X)eβ0(X)h(0,W,X)}]S=1}=0.\displaystyle\qquad\qquad(1-Y)\left\{h(1,W,X)e^{-\beta_{0}(X)}-h(0,W,X)\right\}\bigg{]}\mid S=1\bigg{\}}=0.

In cases where a parametric model for the odds ratio function β0(X;α)\beta_{0}(X;\alpha) is considered appropriate with a finite dimensional parameter α\alpha, similar to the previous PIPW, POR and PDR estimation, the low-dimension parameter α\alpha can be estimated by solving the estimating equations

1ni=1n(1)1Aic(Xi)q(Ai,Zi,Xi;τ)Yieβ0(X;α)Ai=0;\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}{(-1)}^{1-A_{i}}c(X_{i})q(A_{i},Z_{i},X_{i};\tau)Y_{i}e^{-\beta_{0}(X;\alpha)A_{i}}=0; (45)
1ni=1nc(Xi)(1Yi){h(1,Wi,Xi;ψ^)eβ0(Xi;α)h(0,Wi,Xi;ψ^)}=0;\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})\left\{h(1,W_{i},X_{i};\hat{\psi})e^{-\beta_{0}(X_{i};\alpha)}-h(0,W_{i},X_{i};\hat{\psi})\right\}=0; (46)
or 1ni=1nc(Xi)[(1)1Aiq(Ai,Zi,Xi)eβ0(X;α)A{Yi(1Yi)h(Ai,Wi,Xi)}+\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})\bigg{[}(-1)^{1-A_{i}}q(A_{i},Z_{i},X_{i})e^{-\beta_{0}(X;\alpha)A}\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i})\right\}+
(1Yi){h(1,Wi,Xi)eβ0(X;α)h(0,Wi,Xi)}]=0,\displaystyle\qquad\qquad(1-Y_{i})\left\{h(1,W_{i},X_{i})e^{-\beta_{0}(X;\alpha)}-h(0,W_{i},X_{i})\right\}\bigg{]}=0, (47)

where τ^\hat{\tau} and ψ^\hat{\psi} are estimators for the parameters indexing the treatment and outcome confounding bridge functions, obtained by solving Equations (12) and (13), respectively, and c(X)c(X) is a user-specified function with dimension no smaller than that of α\alpha.

Alternatively, a similar kernel method to the one in Section S7 of the Supplementary Materials can be used to obtain a cross-fitting minimax learning estimator of α\alpha which solves Equation (47) and nonparametrically estimates the nuisance functions qq and hh. With an estimator α^\hat{\alpha} using either of the above methods, the conditional odds ratio at X=xX=x can then be estimated by β0(x;α^)\beta_{0}(x;\hat{\alpha}).

Nonparametric estimation of β0(X)\beta_{0}(X) is possible, for example, by using the minimax learning approach (Dikkala et al., 2020) or the semi-nonparametric sieve generalised method of moments approach (Ai & Chen, 2003; Chen, 2007). Such a nonparametric estimator may be of interest, for example, to more flexibly account for the heterogeneity of treatment effects and making flexible prediction on the “individual treatment effect”.

S12 Extension to polytomous treatment and effect

Suppose we want to study the effect of a polytomous treatment AA on a polytomous outcome YY, where AA have levels a0,a1,,aJa_{0},a_{1},\dots,a_{J} and YY have levels y0,,yKy_{0},\dots,y_{K}. Here a0a_{0} and y0y_{0} denote the reference treatment and outcome, respectively. We make the following homogeneity assumption: {model}[Homogeneous odds ratio model]

log(P(YA,U,X)P(Y=y0A=a0,U,X)P(YA=a0,U,X)P(Y=y0A,U,X))=\displaystyle\log\left(\dfrac{P(Y\mid A,U,X)P(Y=y_{0}\mid A=a_{0},U,X)}{P(Y\mid A=a_{0},U,X)P(Y=y_{0}\mid A,U,X)}\right)=
j{1,,J}k{1,,K}βjkI(A=aj,Y=yk)+ηk(U,X).\displaystyle\qquad\qquad\sum_{\begin{subarray}{c}j\in\{1,\dots,J\}\\ k\in\{1,\dots,K\}\end{subarray}}\beta_{j}^{k}I(A=a_{j},Y=y_{k})+\eta^{k}(U,X).

Model S12 indicates that the odds ratio of Y=ykY=y_{k} relative to Y=y0Y=y_{0} against the treatment A=ajA=a_{j} versus A=a0A=a_{0} is βjk\beta_{j}^{k} across all strata of (U,X)(U,X). Model S12 is a natural semiparametric extension to the polytomous logistic regression model (Engel, 1988). When J=1J=1 and K=1K=1, Model S12 reduces to Model 2.1. The goal is to estimate and make inference on every βjk\beta_{j}^{k} in the presence of latent factors UU and confounded outcome-dependent sampling.

We make the following assumptions, corresponding to Assumption 2.1, 2.2 and 2.3: {assumption}[No effect modification by AA on the outcome-dependent selection]For a=a0,,aJa=a_{0},\dots,a_{J} and real-valued unknown functions ξ1(U,X),,ξK(U,X)\xi^{1}(U,X),\dots,\xi^{K}(U,X), the sampling mechanism satisfies

P(S=1Y=yk,A=a,U,X)P(S=1Y=y0,A=a,U,X)=ξk(U,X)\displaystyle\dfrac{P(S=1\mid Y=y_{k},A=a,U,X)}{P(S=1\mid Y=y_{0},A=a,U,X)}=\xi^{k}(U,X)

for k=1,,Kk=1,\dots,K.

{assumption}

[Confounding bridge functions] The exists a treatment confounding bridge function q(A,Z,X)q(A,Z,X) and KK outcome confounding bridge functions hk(A,W,X)h_{k}(A,W,X) such that for k=1,,Kk=1,\dots,K and a=a0,a1,,aJa=a_{0},a_{1},\dots,a_{J},

E{q(a,Z,X)A=a,U,X}=1/P(A=aU,X,Y=y0,S=1)\displaystyle E\{q(a,Z,X)\mid A=a,U,X\}=1/P(A=a\mid U,X,Y=y_{0},S=1) (48)
E{I(Y=y0)hk(a,W,X)A=a,U,X,S=1}=P(Y=ykA=a,U,X,S=1).\displaystyle E\{I(Y=y_{0})h_{k}(a,W,X)\mid A=a,U,X,S=1\}=P(Y=y_{k}\mid A=a,U,X,S=1). (49)

We state the following results for inference of βjk\beta_{j}^{k}. The proofs are similar to those in Section S1 and are therefore omitted.

Lemma S1.

Under Model S12 and Assumption S12, the log odds ratio βjk\beta_{j}^{k} satisfies

βjk=log(θjck/θ0ck),\beta_{j}^{k}=\log(\theta_{jc}^{k}/\theta_{0c}^{k}),

where

θjck=E{c(X)P(A=ajU,X,Y=yk,S=1)P(A=ajU,X,Y=y0,S=1)P(Y=ykU,X,S=1)S=1}.\theta_{jc}^{k}=E\left\{c(X)\dfrac{P(A=a_{j}\mid U,X,Y=y_{k},S=1)}{P(A=a_{j}\mid U,X,Y=y_{0},S=1)}P(Y=y_{k}\mid U,X,S=1)\mid S=1\right\}.

and c(X)c(X) is an arbitrary real-valued function that satisfies θjck0\theta_{jc}^{k}\neq 0 for j=0,1,,Kj=0,1,\dots,K.

Theorem S2 (Identification of q(A,Z,X)q(A,Z,X) and hk(A,W,X)h_{k}(A,W,X)).

The functions q()q(\cdot) and hk()h_{k}(\cdot), k=1,,Kk=1,\dots,K, satisfy the moment conditions

E[I(Y=y0){κ1(A,W,X)q(A,Z,X)j=0Jκ1(aj,W,X)}S=1]=0\displaystyle E\left[I(Y=y_{0})\left\{\kappa^{*}_{1}(A,W,X)q(A,Z,X)-\sum_{j=0}^{J}\kappa^{*}_{1}(a_{j},W,X)\right\}\mid S=1\right]=0
E[κ2k(A,Z,X){I(Y=y0)hk(A,W,X)I(Y=yk)}S=1]=0,k=1,…, K,\displaystyle E\left[\kappa^{*}_{2k}(A,Z,X)\left\{I(Y=y_{0})h_{k}(A,W,X)-I(Y=y_{k})\right\}\mid S=1\right]=0,\qquad\mbox{k=1,\ldots, K},

where κ1()\kappa^{*}_{1}(\cdot) and κ2k\kappa_{2k} are arbitrary functions.

Theorem S3 (Identification of θjck\theta_{jc}^{k}).

Assume Model S12 and Assumptions S12 and 2.2 hold, then

  1. (a)

    if there exists a function qq that solves Equation (48), then

    θjck=E{I(A=aj,Y=yk)c(X)q(aj,Z,X)S=1};\theta_{jc}^{k}=E\left\{I(A=a_{j},Y=y_{k})c(X)q(a_{j},Z,X)\mid S=1\right\};
  2. (b)

    if there exists functions hkh^{k}, k=1,,Kk=1,\dots,K that solve Equation (49), then

    θjck=E{c(X)I(Y=y0)hk(aj,W,X)S=1};\theta_{jc}^{k}=E\left\{c(X)I(Y=y_{0})h_{k}(a_{j},W,X)\mid S=1\right\};
  3. (c)

    if either (i) the function qq solves Equation (48), or (ii) the functions hkh_{k}’s solve Equation (49), then

    θjck\displaystyle\theta_{jc}^{k} =E{c(X)[I(Y=y0)hk(aj,W,X)\displaystyle=E\big{\{}c(X)\big{[}I(Y=y_{0})h_{k}(a_{j},W,X)-
    I(A=aj)q(aj,Z,X){I(Y=y0)hk(aj,W,X)I(Y=yk)}]S=1}.\displaystyle\qquad I(A=a_{j})q(a_{j},Z,X)\big{\{}I(Y=y_{0})h_{k}(a_{j},W,X)-I(Y=y_{k})\big{\}}\big{]}\mid S=1\big{\}}.

By Theorem S2 and S3, to estimate βjk\beta_{j}^{k} one may:

  1. 1.

    Specify suitable parametric working models q(A,Z,X;τ)q(A,Z,X;\tau) and hk(A,W,X;ψk)h_{k}(A,W,X;\psi_{k});

  2. 2.

    Estimate the nuisance parameters τ\tau and ψ\psi by solving

    i=1nI(Yi=y0){κ1(Ai,Wi,Xi)q(Ai,Zi,Xi;τ)j=0Jκ1(aj,Wi,Xi)}=0;\displaystyle\sum_{i=1}^{n}I(Y_{i}=y_{0})\left\{\kappa^{*}_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\sum_{j=0}^{J}\kappa^{*}_{1}(a_{j},W_{i},X_{i})\right\}=0;
    i=1nκ2k(Ai,Zi,Xi){I(Yi=y0)hk(Ai,Wi,Xi;ψk)I(Yi=yk)}=0\displaystyle\sum_{i=1}^{n}\kappa^{*}_{2k}(A_{i},Z_{i},X_{i})\{I(Y_{i}=y_{0})h_{k}(A_{i},W_{i},X_{i};\psi_{k})-I(Y_{i}=y_{k})\}=0

    for k=1,2,,Kk=1,2,\dots,K, where κ1\kappa^{*}_{1} and κ2k\kappa^{*}_{2k} are user-specified functions that dimensions at least as large as that of τ\tau and ψk\psi_{k} respectively. Denote the resulting estimator as τ^\hat{\tau} and ψ^k\hat{\psi}_{k}.

  3. 3.

    Estimate θjck\theta_{jc}^{k} by

    θ^jc,PIPWk=1ni=1nI(A=aj,Y=yk)c(Xi)q(aj,Zi,Xi;τ^);\displaystyle\hat{\theta}_{jc,\text{PIPW}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}I(A=a_{j},Y=y_{k})c(X_{i})q(a_{j},Z_{i},X_{i};\hat{\tau});
    θ^jc,PORk=1ni=1nc(Xi)I(Yi=y0)hk(aj,Wi,Xi;ψ^k);\displaystyle\hat{\theta}_{jc,\text{POR}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k});
    or θ^jc,PDRk=1ni=1nc(Xi)[I(Yi=y0)hk(aj,Wi,Xi;ψ^k)\displaystyle\hat{\theta}_{jc,\text{PDR}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})\bigg{[}I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k})-
    I(Ai=aj)q(aj,Zi,Xi;τ^){I(Yi=y0)hk(aj,Wi,Xi;ψ^k)I(Yi=yk)}]\displaystyle\qquad I(A_{i}=a_{j})q(a_{j},Z_{i},X_{i};\hat{\tau})\left\{I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k})-I(Y_{i}=y_{k})\right\}\bigg{]}

    for j=0,1,,Jj=0,1,\dots,J and k=1,,Kk=1,\dots,K and a user-specified real-valued function c(X)c(X) (one may simply set c(X)=1c(X)=1).

  4. 4.

    The resulting estimators for βjK\beta_{j}^{K} include the PIPW estimator β^j,PIPWk=log(θ^jc,PIPWk/θ^0c,PIPWk)\hat{\beta}_{j,\text{PIPW}}^{k}=\log(\hat{\theta}_{jc,\text{PIPW}}^{k}/\hat{\theta}_{0c,\text{PIPW}}^{k}), the POR estimator β^j,PORk=log(θ^jc,PORk/θ^0c,PORk)\hat{\beta}_{j,\text{POR}}^{k}=\log(\hat{\theta}_{jc,\text{POR}}^{k}/\hat{\theta}_{0c,\text{POR}}^{k}), and the PDR estimator β^jc,PDRk=log(θ^jc,PDRk/θ^0c,PDRk).\hat{\beta}_{jc,\text{PDR}}^{k}=\log(\hat{\theta}_{jc,\text{PDR}}^{k}/\hat{\theta}_{0c,\text{PDR}}^{k}).

Under Model S12 and Assumptions 2.2 and S12, the PDR estimator satisfies doubly robustness, i.e. β^jc,PORk𝑝βjk\hat{\beta}_{jc,\text{POR}}^{k}\overset{p}{\rightarrow}\beta_{j}^{k} if either (i) the parametric working model q(A,Z,X;τ)q(A,Z,X;\tau) is correctly specified, and the completeness assumption 2.2(b) holds, or (ii) the parametric working model h(A,W,X;ψ)h(A,W,X;\psi) is correctly specified, and Assumption 2.3(b) holds.

The kernel learning approach in Section S7 can similarly be employed to obtain semiparametric estimators for βjk\beta_{j}^{k} with flexible modeling for qq and hkh_{k}.