\jyear

0000 \jvol000 \jnum0

Doubly Robust Proximal Causal Inference under Confounded Outcome-Dependent Sampling

Kendrick Qijun Li [email protected] Department of Biostatistics, University of Michigan,
1415 Washington Heights, Ann Arbor, Michigan, U.S.A. Xu Shi [email protected] Department of Biostatistics, University of Michigan,
1415 Washington Heights, Ann Arbor, Michigan, U.S.A. Wang Miao [email protected] Department of Probability and Statistics, Peking University,
5 Yiheyuan Rd, Haidian District, Beijing, China Eric Tchetgen Tchetgen [email protected] Department of Statistics and Data Science, The Wharton School, University of Pennsylvania,
265 South 37th Street, Philadelphia, Pennsylvania, U.S.A

Abstract

Unmeasured confounding and selection bias are often of concern in observational studies and may invalidate a causal analysis if not appropriately accounted for. Under outcome-dependent sampling, a latent factor that has causal effects on the treatment, outcome, and sample selection process may cause both unmeasured confounding and selection bias, rendering standard causal parameters unidentifiable without additional assumptions. Under an odds ratio model for the treatment effect, Li et al. (2022) established both proximal identification and estimation of causal effects by leveraging a pair of negative control variables as proxies of latent factors at the source of both confounding and selection bias. However, their approach relies exclusively on the existence and correct specification of a so-called treatment confounding bridge function, a model that restricts the treatment assignment mechanism. In this article, we propose doubly robust estimation under the odds ratio model with respect to two nuisance functions – a treatment confounding bridge function and an outcome confounding bridge function that restricts the outcome law, such that our estimator is consistent and asymptotically normal if either bridge function model is correctly specified, without knowing which one is. Thus, our proposed doubly robust estimator is potentially more robust than that of Li et al. (2022). Our simulations confirm that the proposed proximal estimators of an odds ratio causal effect can adequately account for both residual confounding and selection bias under stated conditions with well-calibrated confidence intervals in a wide range of scenarios, where standard methods generally fail to be consistent. In addition, the proposed doubly robust estimator is consistent if at least one confounding bridge function is correctly specified.

keywords:

endogenous selection bias; kernel machine learning; proximal causal inference; semiparametric estimation.

^†^†journal: Biometrika

1 Introduction

Unmeasured confounding and selection bias are two ubiquitous challenges in observational studies, which may lead to biased causal effect estimates and misleading causal conclusions (Hernán MA, 2020). Lipsitch et al. (2010) reviewed and formalized the use of negative control variables as tools to detect the presence of unmeasured confounding in epidemiological research. Specifically, they identify two types of negative control variables: negative control exposures (NCE) known a priori to have no causal effect on the outcome and negative control outcomes (NCO) known a priori not to be causally impacted by the treatment. Such variables may be viewed as valid proxies of unmeasured confounders to the extent that they are associated with the latter. Tchetgen Tchetgen (2014) and Flanders et al. (2017) proposed regression-type calibration for unmeasured confounding adjustment using NCOs and NCEs, respectively, under fairly strong parametric restrictions. More recently, a double negative control approach which uses NCOs and NCEs jointly to correct for unmeasured confounding has been developed, which achieves nonparametric identification of the average treatment effect (Miao et al., 2018a); also see (Shi et al., 2020b) for a recent review of negative control methods in Epidemiology. Methods for identification, estimation and inference about causal parameters by leveraging such proxies to address unmeasured confounding bias have been referred to as “proximal causal inference” (Tchetgen Tchetgen et al., 2020).

The literature on proximal causal inference is fast growing. Miao et al. (2018b) introduced an outcome confounding bridge function approach to identify and estimate the average treatment effect by leveraging both NCE and NCO variables, while Cui et al. (2020) and Deaner (2018) introduced identification via a treatment confounding bridge function. Cui et al. (2020) further developed a doubly robust approach which can identify and consistently estimate the average treatment effect if either a treatment or an outcome confounding bridge function exists and can be estimated consistently. Ghassami et al. (2021) and Kallus et al. (2021) concurrently proposed a cross-fitting minimax kernel learning approach for nonparametric estimation of the confounding bridge functions, where the efficient influence function has a mixed bias structure (Rotnitzky et al., 2021), such that $\sqrt{n}$ -consistent estimation of the average treatment effect remains possible even when both confounding bridge functions might be estimated at considerably slower than $\sqrt{n}$ rates.

Despite the rapid growth in methods to address confounding bias, few have considered situations in which confounding bias and selection bias might coexist. Selection bias, also sometimes called endogenous selection bias or collider stratification bias (Greenland, 2003; Elwert & Winship, 2014; Cole et al., 2010; Hernán et al., 2004), arises when the analysis conditions on selection that may be induced by both the primary treatment and outcome variables. Selection bias naturally occurs in outcome-dependent sampling designs that are widely used in epidemiological and econometric research to reduce cost and effort, such as case-control design (Breslow & Day, 1980; Pearce, 2018), case-cohort design (Prentice, 1986), choice-based design (Manski & McFadden, 1981), test-negative design (Jackson & Nelson, 2013; Sullivan et al., 2016), and retrospective cohort designs with electronic health records (EHR) data (Streeter et al., 2017). Similar to confounding bias, selection bias potentially induces spurious associations between treatment and outcome variables of primary scientific interest, even under the null hypothesis of no causal effect. Existing methods to adjust for selection bias include explicit modeling of sample selection mechanism (Heckman, 1979; Heckman et al., 1998; Cuddeback et al., 2004), matching (Heckman et al., 1996, 1998), difference-in-differences models (Heckman et al., 1998) and inverse probability weighting (Hernán et al., 2004; Mansournia & Altman, 2016). However, when a common latent factor causes the treatment, outcome and selection mechanisms, the above methods are not applicable and the causal effect cannot generally be identified. Gabriel et al. (2020) referred to such a sampling design as “confounded outcome-dependent sampling”. They derived causal bounds of the average treatment effect, but these bounds are often too wide to be useful in practice, which highlights the challenges of identification. Didelez et al. (2010), Bareinboim & Pearl (2012) and Bareinboim et al. (2014) studied graphical conditions for recoverability of the causal effect encoded on the additive scale and odds ratio scale under different forms of selection bias including outcome-dependent sampling. They concluded that neither causal effect is identifiable if selection is dependent on both outcome and unmeasured confounders.

Recently, Li et al. (2022) studied inference about an odds ratio model encoding a treatment’s association with an outcome of interest conditional on both measured and unmeasured confounders under confounded outcome-dependent sampling, leveraging a pair of negative control variables. The consistency of their estimator, which we refer to as the proximal inverse probability weighted (PIPW) estimator, requires consistently estimating a treatment confounding bridge function which restricts the treatment assignment mechanism. However, in practice, it may be difficult to posit a suitable parametric model for the treatment confounding bridge function, which may limit the applicability of their approach.

In this paper, we develop semiparametric inference for the conditional odds ratio estimand in Li et al. (2022). We present a new identification result and propose the proximal outcome regression (POR) estimator of conditional odds ratio that relies on the existence and correct specification of an outcome confounding bridge function which restricts the outcome law by the conditional odds of the outcome given the treatment and confounders. We further introduce a doubly-robust closed-form expression for the conditional odds ratio, based on which we propose a proximal doubly robust (PDR) estimator, which is consistent if either the treatment or the outcome confounding bridge function is consistently estimated and thus has improved robustness against model misspecification of the nuisance functions compared with PIPW and POR. We demonstrate the performance of the proposed estimators through comprehensive simulations. Throughout, we relegate all proofs to Section S1 of the Supplementary Materials.

(a)

(b)

(c)

Figure 1: (a) shows the directed acyclic graphs of a confounded outcome-dependent sampling in view. (b) and (c) shows two scenarios where the odds ratio can be identified leveraging additional treatment proxy (

Z

) and outcome proxy (

W

) variables.

2 Identification and estimation under a homogeneous odds ratio model

2.1 Notation and Setup

Suppose the data contain a sample of $n$ identically and independently distributed observations drawn from the population of interest, referred to as the “target population”. For each observation in the target population, let $A$ and $Y$ be the treatment and outcome of primary interest, respectively, assumed to be binary for the time being. Let $S$ be a binary indicator for selection into the study sample, such that the available data only include observations with $S=1$ . Let $X$ be a vector of measured pre-treatment covariates that may be associated with $A$ , $Y$ and $S$ . Suppose that in addition to $X$ , there exist pre-treatment latent factors, denoted as $U$ , that may cause $A$ , $Y$ and $S$ . Figure 1(a) shows the directed acyclic graph (DAG) for the causal relationships between the above variables. Similar to Li et al. (2022) Section 2.5, we assume the following model, where $\beta_{0}$ , the conditional log odds ratio given $(U,X)$ , encodes the treatment effect of primary interest: {model}[Homogeneous odds ratio model; Li et al. (2022) Assumption 3’]

\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}A+\eta(U,X)

where $\mathrm{logit}(x)=\log\left\{x/(1-x)\right\}$ denotes the logit function and $\eta$ is an unknown real-valued function. Model 2.1 describes a semiparametric logistic regression model that assumes a homogeneous association between $A$ and $Y$ on the odds ratio scale, across strata of all measured and latent factors $(U,X)$ . Model 2.1 encodes the structural model

\mathrm{logit}\,P(Y(a)=1\mid U,X)=\beta_{0}a+\eta(U,X),\qquad{a=0,1},

where $Y(a)$ ( $a=0,1$ ) denotes the potential outcome were the individual given the treatment status $A=a$ , under standard identifiability assumptions of (a) consistency, i.e. $Y(a)=Y$ if $A=a$ , which requires there is only one version of treatment and an individual’s outcome is not affected by others’ treatment status (Cole & Frangakis, 2009); (b) latent ignorability, i.e. $Y(a)\perp\!\!\!\perp A\mid U,X$ , which essentially requires $(U,X)$ to contain all confounders of A-Y association; and (c) positivity, i.e. $0<P(A=1\mid U,X)<1$ almost surely (Hernán MA, 2020). Under these conditions, $\beta_{0}$ encodes the log conditional causal odds ratio effect in every $(U,X)$ stratum.

In Li et al. (2022) Assumption 3, they also considered a homogeneous risk ratio model, by which the marginal causal risk ratio can be identified. We focus on Model 2.1 for two reasons: first, proximal identification of $\beta_{0}$ under Model 2.1 permits treatment-induced selection under Assumption 2.1 below, while identification under the homogeneous risk ratio model appears to require that $A$ has no causal effect on selection other than through $Y$ ; second, Li et al. (2022) invoked a rare outcome assumption in the target population for proximal identification under the homogeneous risk ratio model, which restricts the applicability of the method. Such a rare outcome assumption is not necessary for proximal identification of $\beta_{0}$ under Model 2.1. As discussed in Section S2 of the Supplementary Materials, however, under standard identifiability assumptions (a)-(c) above and a rare outcome assumption akin to that assumed by Li et al. (2022), $\beta_{0}$ in Model 2.1 still approximates the log marginal causal risk ratio $P(Y(1)=1)/P(Y(0)=1)$ . Alternatively, as discussed in Section S3 of the Supplementary Materials, under settings where control subjects are selected to be positive for an outcome variable that is a priori known to be not affected by the treatment, $\beta_{0}$ may recover the marginal causal risk ratio without invoking the rare outcome condition. Such a setting may occur, for example, in test-negative design studies of vaccine effectiveness, where the control subjects are selected to have a pre-determined infection that is not affected by the vaccine (Jackson & Nelson, 2013; Sullivan et al., 2016; Chua et al., 2020; Schnitzer, 2022).

Although to simplify the presentation, results given in the main text focus primarily on Model 2.1, Section S11 of the Supplemental Materials extends these results to the more general model

\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}(X)A+\eta^{\prime}(U,X),

which explicitly accounts for effect modification of odds ratio with respect to measured covariates $X$ .

In general, the conditional log odds ratio $\beta_{0}$ is not identifiable: first, the latent factors $U$ defining the strata are not observed; second, the data only include selected subjects with $S=1$ whilst Model 2.1 is defined in the target population.

As in Li et al. (2022) Assumption 2’, we make an important assumption regarding the selection mechanism that the treatment $A$ does not modify the conditional risk ratio of selection against the outcome in every $(U,X)$ stratum. Formally, we assume: {assumption}[No effect modification by $A$ on the outcome-dependent selection]For $a=0,1$ and some positive-valued unknown function $\xi$ ,

\dfrac{P(S=1\mid Y=1,A=a,U,X)}{P(S=1\mid Y=0,A=a,U,X)}=\xi(U,X).

As mentioned above, a special case of Assumption 2.1 is if $A$ has no causal effect on $S$ other than through $Y$ , a stronger assumption considered by Li et al. (2022) to identify the marginal causal risk ratio. As a result of Assumption 2.1, the conditional odds ratio given $(U,X,S=1)$ is identical to that given $(U,X)$ , which leaves only the challenge of controlling for unmeasured confounding by $U$ in the selected sample.

2.2 Proximal Identification of $\beta_{0}$ in Li et al. (2022)

To detect and correct for unmeasured confounding bias, the proximal inference framework proposes to leverage a pair of proxy measurements of unmeasured confounders (Miao et al., 2018a; Cui et al., 2020; Tchetgen Tchetgen et al., 2020). In this line of works, Li et al. (2022) developed proximal causal inference for $\beta_{0}$ under confounded outcome-dependent sampling as represented in Figure 1(a) (Jackson & Nelson, 2013; Chua et al., 2020). We similarly assume proxies of the latent factors are available. {assumption}[Proximal independence conditions] There exist a pair of proxies of the latent factor $U$ : a treatment proxy (which Li et al. (2022) refers to as a “negative control exposure”), denoted as $Z$ , and an outcome proxy (which Li et al. (2022) refers to as a “negative control outcome”), denoted as $W$ , that satisfy

\text{(a) }W\perp\!\!\!\perp A\mid U,X,Y,S;\qquad\text{(b) }Z\perp\!\!\!\perp(Y,W,S)\mid A,U,X.

Assumption 2.2 requires the treatment proxy $Z$ and outcome proxy $W$ to satisfy certain conditional independences in the target population. Namely, the treatment proxy $Z$ has no direct effect on the primary outcome $Y$ , selection indicator $S$ , and the outcome proxy $W$ ; and the primary treatment $A$ has no direct effect on the outcome proxy $W$ . It is crucial that both $Z$ and $W$ are both $U$ -relevant, that is they carry information and therefore are associated with $U$ (Shi et al., 2020b; Tchetgen Tchetgen et al., 2020); this is formalized by Assumptions 2.2(a) and 2.3(a) below. As such, in the selected sample, an observed $A$ - $W$ association after adjusting for $Y,X$ , or an observed $Z$ - $Y$ or $Z$ - $W$ association after adjusting for $A,X$ , may indicate the presence of bias due to the latent factor $U$ . Figure 1 (b) and (c) show the DAGs of scenarios where Assumption 2.2 holds.

As in Li et al. (2022), we assumed the existence of a treatment confounding bridge function that connects the treatment proxy to the latent factor.

{assumption}

[Treatment confounding bridge function] There exists a treatment confounding bridge function $q(A,Z,X)$ that satisfies

E[q(A,Z,X)\mid A=a,U,X,Y=0,S=1]=1/P(A=a\mid U,X,Y=0,S=1)

(1)

almost surely for $a=0,1$ .

Equation (1) formally defines a Fredholm integral equation of the first kind. To ensure the existence and identifiability of the solution to Equation (1), we make the following completeness assumptions, similar to those assumed by Li et al. (2022) and Cui et al. (2020):

{assumption}

[Completeness]

(a)

For any square-integrable function $g$ , if $E[g(U)|A,Z,X,Y=0,S=1]=0$ almost surely, then $g(U)=0$ almost surely;
(b)

For any square-integrable function $g$ , if $E[g(Z)|A,W,X,Y=0,S=1]=0$ almost surely, then $g(Z)=0$ almost surely.

The completeness condition has been used to achieve identification for statistical functionals in econometrics and semiparametric statistics literature (Newey & Powell, 2003; D’Haultfoeuille, 2011). Essentially, Assumption 2.2(a) formalizes the $U$ -relevance of $Z$ and requires that $Z$ has at least as much variability as $U$ in every $(A,X)$ stratum of the outcome-free subjects in the selected sample. When $U$ and $Z$ are both categorical variables, $Z$ necessarily has at least as many categories as $U$ . Assumption 2.2(a) and the regularity conditions discussed in Section S4 of the Supplementary Materials, together constitute a sufficient condition for Assumption 2.2, i.e. the existence of $q$ .

Similarly, Assumption 2.2(b) roughly requires that the outcome proxy $W$ is at least as variable as the treatment proxy $Z$ , and is a sufficient condition under which $q(A,Z,X)$ can be uniquely identified from the selected sample, as stated in the theorem below:

Theorem 2.1 (Identification of $q(A,Z,X)$ ; Li et al. (2022) Theorem 2’).

Under Assumptions 2.2 and 2.2, any treatment confounding bridge function $q(A,Z,X)$ satisfies the moment condition

E\left\{q(a,Z,X)\mid A=a,W,X,Y=0,S=1\right\}=1/P(A=a\mid W,X,Y=0,S=1).

(2)

If Assumption 2.2(b) holds in addition, then the function $q(\cdot)$ is unique and can be identified as the unique solution to Equation (2).

Leveraging the proxies, Li et al. (2022) established the following identification result for $\beta_{0}$ .

Theorem 2.2 (Identification of $\beta_{0}$ ; Li et al. (2022) Theorem 1’).

Under Model 2.1, if Assumptions 2.1-2.2 hold, then the conditional log odds ratio parameter $\beta_{0}$ solves the moment equation

E\left\{(-1)^{1-A}c(X)Yq(A,Z,X)e^{-\beta A}\mid S=1\right\}=0

(3)

where $c(X)$ is an arbitrary square-integrable unidimensional function that satisfies

E\left\{c(X)I(A=a)Yq(A,Z,X)\mid S=1\right\}\neq 0,\quad a=0,1.

Equation (3) admits a closed-form solution for $\beta_{0}$ :

\beta_{0}=\log\left(\dfrac{E\left\{c(X)I(A=1)Yq(A,Z,X)\mid S=1\right\}}{E\left\{c(X)I(A=0)Yq(A,Z,X)\mid S=1\right\}}\right).

(4)

Unique identification of $q$ by Assumption 2.2(b) is necessary. Although Equation (4) holds for any treatment confounding bridge function $q$ that satisfies Equation (1), $q$ can only be identified in the selected sample through solving Equation (2). If Equation (2) has multiple solutions, there is no guarantee that a solution to Equation (2) is also a solution to Equation (1).

2.3 New Proximal Identification Results

From Theorem 2.2(b), it is clear that the treatment confounding bridge function $q(A,Z,X)$ plays a crucial role for identifying $\beta_{0}$ . However, in case Assumption 2.2 does not hold, or Assumption 2.2 holds but Assumption 2.2(b) does not hold, then the solution to Equation (2) in the selected sample may not be unique and satisfy Equation (1) that defines a treatment confounding bridge function. As a result, the parameter $\beta_{0}$ identified by Equation (4), where $q$ is a solution to Equation (2), may not have a causal interpretation. In this section, we establish a new identification result for $\beta_{0}$ that does not rely on $q(A,Z,X)$ . We instead define an outcome confounding bridge function as below: {assumption}[Outcome confounding bridge function]There exists an outcome confounding bridge function $h(A,W,X)$ that satisfies

E\{h(A,W,X)\mid A,U,X,Y=0,S=1\}=\frac{P(Y=1\mid A,U,X,S=1)}{P(Y=0\mid A,U,X,S=1)}

(5)

almost surely. Miao et al. (2018b) and Cui et al. (2020) previously proposed proximal identification of average treatment effect using an outcome confounding bridge function. Our definition of $h(A,W,X)$ by Equation (5) is different from those in previous works of proximal causal inference, in that the conditional expectation of our $h(A,W,X)$ is defined to equal the conditional odds of the outcome, whilst the conditional expectation of the outcome confounding bridge function in previous works was defined to equal the conditional mean outcome. Theorem 2.4 below suggests that the standard outcome bridge function of prior literature would indeed fail to identify $\beta_{0}$ while our alternative definition yields the desired identification result.

In contrast with Assumption 2.2, we make the following completeness assumption for the existence and identifiability of the outcome confounding bridge function $h$ .

{assumption}

[Completeness]

(a)

For any square-integrable function $g$ , if $E[g(U)|A,W,X,Y=0,S=1]=0$ almost surely, then $g(U)=0$ almost surely;
(b)

For any square-integrable function $g$ , if $E[g(W)|A,Z,X,Y=0,S=1]=0$ almost surely, then $g(W)=0$ almost surely.

Similar to Assumption 2.2(a), Assumption 2.3(a) essentially requires that $W$ has sufficient variability relative to $U$ in every $(A,X)$ stratum of the outcome-free subjects in the selected sample, and constitutes a sufficient condition for Assumption 2.3, i.e. the existence of $h$ , together with other regularity conditions, as discussed in Section S4 of the Supplementary Materials. Assumption 2.3(b) requires the opposite of Assumption 2.2(b), that the treatment proxy $Z$ is sufficiently variable relative to the outcome proxy $W$ , and is a sufficient condition under which $h(A,W,X)$ can be uniquely identified from the selected sample, as stated in Theorem 2.3 below.

Theorem 2.3 (Identification of $h(A,W,X)$ ).

Under Assumptions 2.2 and 2.3, any outcome confounding bridge function $h(A,W,X)$ satisfies the moment condition

E\left\{h(A,W,X)\mid A,Z,X,Y=0,S=1\right\}=\dfrac{P(Y=1\mid A,Z,X,S=1)}{P(Y=0\mid A,Z,X,S=1)}.

(6)

If Assumption 2.3(b) holds in addition, then the function $h(\cdot)$ is unique and can be identified as the unique solution to Equation (6).

We introduce the new identification result in Theorem 2.4 below:

Theorem 2.4 (Identification of $\beta_{0}$ ).

Under Model 2.1, if Assumptions 2.1, 2.2 and 2.3 hold, then the conditional log odds ratio parameter $\beta_{0}$ solves the moment equation

E\left[c(X)(1-Y)\left\{h(1,W,X)e^{-\beta_{0}}-h(0,W,X)\right\}\mid S=1\right]=0;

(7)

where $c(X)$ is an arbitrary square-integrable unidimensional function that satisfies

E\left\{c(X)(1-Y)h(a,W,X)\mid S=1\right\}\neq 0,\quad a=0,1.

The integral equation (7) also admits a closed-form solution

\beta_{0}=\log\left(\dfrac{E\left\{c(X)(1-Y)h(1,W,X)\mid S=1\right\}}{E\left\{c(X)(1-Y)h(0,W,X)\mid S=1\right\}}\right).

(8)

2.4 A doubly robust closed-form expression for $\beta_{0}$

Until now, identification of $\beta_{0}$ has required the existence and identification of either a treatment confounding bridge function or an outcome confounding bridge function. In this section, we present a closed-form expression for $\beta_{0}$ , which has a desirable doubly-robustness property in the sense that it only requires one of the two confounding bridge functions to exist and be identified.

Theorem 2.5 (A doubly-robust closed-form expression for $\beta_{0}$ ).

Under Model 2.1 and Assumptions 2.1 and 2.2, we have

$\displaystyle\beta_{0}$	$\displaystyle=\log\bigg{(}E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\left\{Y-(1-Y)h(A,W,X)\right\}+$
	$\displaystyle\qquad(1-Y)h(1,W,X)\bigg{]}\mid S=1\bigg{\}}/E\bigg{\{}c(X)\bigg{[}I(A=0)q(A,Z,X)\{Y-$
	$\displaystyle\qquad(1-Y)h(A,W,X)\}+(1-Y)h(0,W,X)\bigg{]}\mid S=1\bigg{\}}\bigg{)},$	(9)

if either $q$ is a solution to Equation (1), or $h$ is a solution to Equation (5). Here $c(X)$ is an arbitrary square-integrable real-valued function satisfying

	$\displaystyle E\bigg{\{}c(X)\bigg{[}I(A=a)q(A,Z,X)\{Y-(1-Y)h(a,W,X)\}+$
	$\displaystyle\qquad(1-Y)h(a,W,X)\bigg{]}\mid S=1\bigg{\}}\neq 0$

for $a=0,1$ .

In Equation (9), setting $h(A,Z,X)=0$ recovers the closed-form solution (4), while setting $q(A,Z,X)=0$ gives the closed-form solution (8), illustrating the double robustness property of the closed-form expression (9).

In Section S5 of the Supplementary Materials, we compare our assumptions and results with those in Li et al. (2022), and summarise the additional conditions under which $\beta_{0}$ identifies the marginal causal risk ratio.

2.5 Estimation and large sample inference

Denote data for the $i$ th subject in the selected sample as $O_{i}=(A_{i},Y_{i},Z_{i},W_{i},X_{i})$ , $i=1,\dots,n$ . As suggested by Equations (4), (8) or (9), consistent estimators for $q(A,Z,X)$ and $h(A,W,X)$ can be used to construct estimators for $\beta_{0}$ following the standard plug-in principle (Bickel & Ritov, 2003). It remains to estimate the two confounding bridge functions. Below, we present crucial moment conditions for the identification of the confounding bridge functions.

Theorem 2.6 (Moment conditions for $q(A,Z,X)$ and $h(A,W,X)$ ).

Under Assumptions 2.2, 2.2 and 2.3, the confounding bridge functions $q(\cdot)$ and $h(\cdot)$ satisfy

	$\displaystyle E\left[(1-Y)\{\kappa_{1}(A,W,X)q(A,Z,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X))\}\mid S=1\right]=0$		(10)
	$\displaystyle E\left[\kappa_{2}(A,Z,X)\{(1-Y)h(A,W,X)-Y\}\mid S=1\right]=0$		(11)

where $\kappa_{1}(\cdot)$ and $\kappa_{2}(\cdot)$ are arbitrary square-integrable functions.

Equations (4), (8) and (9) and Theorem 2.6 together suggest a parametric approach to estimate $\beta_{0}$ : one may postulate suitable parametric working models $q(A,Z,X;\tau)$ and $h(A,W,X;\psi)$ where $\tau$ and $\psi$ are unknown parameters of finite dimensions. One can then estimate the nuisance parameters $\tau$ and $\psi$ by solving the estimating equations

	$\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}=0,$		(12)
	$\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}\kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}=0.$		(13)

with user-specified functions $\kappa_{1}$ and $\kappa_{2}$ of dimensions equal to that of $\tau$ and $\psi$ respectively. Closed-form parametric models for the confounding bridge functions can be derived in some settings as described below.

Example 2.7.

If $U,Z,W$ are all binary variable and $X$ is a categorical variable with levels $\{x_{1},\dots,x_{L}\}$ , then a suitable model is the saturated model of the form

		$\displaystyle q(a,z,x;\tau)=\tau_{0x}+\tau_{1x}a+\tau_{2x}z+\tau_{3x}az$		(14)
	and	$\displaystyle h(a,w,x;\psi)=\psi_{0x}+\psi_{1x}a+\psi_{2x}w+\psi_{3x}aw$		(15)

for $x=\{x_{1},\dots,x_{L}\}$ , where

	$\displaystyle\tau$	$\displaystyle=(\tau_{0x_{1}},\tau_{1x_{1}},\tau_{2x_{1}},\tau_{3x_{1}},\dots,\tau_{0x_{L}},\tau_{1x_{L}},\tau_{2x_{L}},\tau_{3x_{L}})^{T},$
	$\displaystyle\text{and}\qquad\psi$	$\displaystyle=(\psi_{0x_{1}},\psi_{1x_{1}},\psi_{2x_{1}},\psi_{3x_{1}},\dots,\psi_{0x_{L}},\psi_{1x_{L}},\psi_{2x_{L}},\psi_{3x_{L}})^{T}.$

Example 2.8.

If $Z$ and $W$ are both multivariate Gaussian random variables conditioning on $(A,U,X,Y)$ , under the setting given in Section S6 of the Supplementary Materials, suitable models for the confounding bridge functions are

		$\displaystyle q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}^{T}z+\tau_{X}^{T}x)\}$		(16)
	and	$\displaystyle h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}^{T}w+\psi_{X}^{T}x).$		(17)

where $\tau=(\tau_{0},\tau_{A},\tau_{Z}^{T},\tau_{X}^{T})^{T}$ and $\psi=(\psi_{0},\psi_{A},\psi_{Z}^{T},\psi_{X}^{T})^{T}$ .

After obtaining the estimators $\hat{\tau}$ and $\hat{\psi}$ by solving Equations (12) and (13), $\beta_{0}$ can then be estimated with the following plug-in estimators:

$\displaystyle\hat{\beta}_{c,\text{PIPW}}$	$\displaystyle=\log\left(\dfrac{\frac{1}{n}\sum_{i=1}^{n}I(A_{i}=1)c(X_{i})q(1,Z_{i},X_{i};\hat{\tau})Y_{i}}{\frac{1}{n}\sum_{i=1}^{n}I(A_{i}=0)c(X_{i})q(0,Z_{i},X_{i};\hat{\tau})Y_{i}}\right);$	(18)
$\displaystyle\hat{\beta}_{c,\text{POR}}$	$\displaystyle=\log\left(\dfrac{\frac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})h(1,W_{i},X_{i};\hat{\psi})}{\frac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})h(0,W_{i},X_{i};\hat{\psi})}\right);$	(19)
$\displaystyle\hat{\beta}_{c,\text{PDR}}$	$\displaystyle=\log\bigg{(}\dfrac{1}{n}\sum_{i=1}^{n}\bigg{\{}c(X_{i})\bigg{[}I(A_{i}=1)q(A_{i},Z_{i},X_{i})\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i})\right\}+$
	$\displaystyle\qquad(1-Y_{i})h(1,W_{i},X_{i})\bigg{]}\bigg{\}}\bigg{/}\dfrac{1}{n}\sum_{i=1}^{n}\bigg{\{}c(X_{i})\bigg{[}I(A_{i}=0)q(A_{i},Z_{i},X_{i})\{Y_{i}-$
	$\displaystyle\qquad(1-Y_{i})h(A_{i},W_{i},X_{i})\}+(1-Y_{i})h(0,W_{i},X_{i})\bigg{]}\bigg{\}}\bigg{)}.$	(20)

Alternatively, one can jointly estimate $\beta_{0}$ and nuisance parameters $\tau$ and $\psi$ by solving estimating equations with the following estimating functions:

	$\displaystyle G_{\text{PIPW}}(O_{i};\beta,\tau)=$
	$\displaystyle\quad\left(\begin{array}[]{l}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}\\ c(X_{i}){(-1)}^{1-A_{i}}q(A_{i},Z_{i},X_{i};\tau)Y_{i}e^{-\beta A_{i}}\end{array}\right),$
	$\displaystyle G_{\text{POR}}(O_{i};\beta,\psi)=$
	$\displaystyle\quad\left(\begin{array}[]{l}\kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}\\ c(X_{i})(1-Y_{i})\left\{h(1,W_{i},X_{i};\psi)e^{-\beta}-h(0,W_{i},X_{i};\psi)\right\}\end{array}\right),$
	and
	$\displaystyle G_{\text{PDR}}(O_{i};\beta,\tau,\psi)=$
	$\displaystyle\quad\left(\begin{array}[]{l}(1-Y_{i})\left\{\kappa_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)\right\}\\ \kappa_{2}(A_{i},Z_{i},X_{i})\left\{(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)-Y_{i}\right\}\\ c(X_{i})\bigg{[}(-1)^{1-A_{i}}q(A_{i},Z_{i},X_{i};\tau)e^{-\beta A}\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i};\psi)\right\}+\\ \qquad\qquad(1-Y_{i})\left\{h(1,W_{i},X_{i};\psi)e^{-\beta}-h(0,W_{i},X_{i};\psi)\right\}\bigg{]}\end{array}\right).$

The resulting estimators are consistent and asymptotically normal, following standard estimating equation theory (Van der Vaart, 2000). Similar to Cui et al. (2020), we refer to $\hat{\beta}_{c,\text{PIPW}}$ , $\hat{\beta}_{c,\text{POR}}$ and $\hat{\beta}_{c,\text{PDR}}$ as the proximal inverse probability weighted (PIPW) estimator, the proximal outcome regression (POR) estimator, and the proximal doubly robust (PDR) estimator, respectively. As discussed in Sections S2 and S3 in the Supplementary Materials, the PIPW estimator can estimate the marginal causal risk ratio assuming either (a) a homogeneous risk model, a treatment-independent selection mechanism, and a rare outcome condition in the target population, or (b) a test-negative design where the controls are selected to have an irrelevant outcome to the treatment. Because they estimate the same functional, the new POR and PDR estimators can be viewed as estimators of the marginal causal risk ratio under the same conditions.

As implied by Theorem 2.5, the estimator $\hat{\beta}_{c,\text{PDR}}$ also enjoys the desirable double robustness property– it is consistent for $\beta_{0}$ if either $q(A,Z,X)$ or $h(A,W,X)$ can be consistently estimated. We stated this result formally in the theorem below.

Theorem 2.9.

Under Model 2.1 and Assumptions 2.1, 2.2, 2.2, and 2.3, we have the following results:

(a)

$\hat{\beta}_{c,\text{PIPW}}$ is consistent for $\beta_{0}$ and asymptotically normal if the model $q(A,Z,X;\tau)$ is correctly specified and Assumption 2.2(b) holds;
(b)

$\hat{\beta}_{c,\text{POR}}$ is consistent for $\beta_{0}$ and asymptotically normal if the model $h(A,W,X;\psi)$ is correctly specified and Assumption 2.3(b) holds;
(c)

$\hat{\beta}_{c,\text{PDR}}$ is consistent for $\beta_{0}$ and asymtotically normal if either the conditions in (a) or those in (b) hold.

Theorem 2.9 indicates that the proposed PDR estimator has improved robustness over the PIPW and POR estimator against model misspecification of the two confounding bridge functions. If candidate proxies $W$ and $Z$ have different cardinalities, for example, when they are categorical variables of unequal dimensions, then only one of the completeness Assumptions 2.2 and 2.3 may hold. Say $W$ is of higher dimension, then Assumption 2.3(b) cannot hold, the outcome bridge function $h(A,W,X)$ may not be not uniquely identified, and $\hat{\beta}_{c,\text{POR}}$ may be biased for $\beta_{0}$ . However, it may be possible to coarsen levels of $W$ to match those of $Z$ (Shi et al., 2020a). We highlight that the consistency of $\hat{\beta}_{c,\text{PDR}}$ only requires one of the two completeness assumptions to hold.

Our identification and estimation strategies involve user-specified functions $c(\cdot)$ , $\kappa_{1}(\cdot)$ and $\kappa_{2}(\cdot)$ . In principle, these nuisance functions can be chosen to construct a potentially more efficient estimator for $\beta_{0}$ , $q$ , and $h$ respectively, but the optimal choice of these functions may require modeling complex features of the observed data distribution. As discussed in Cui et al. (2020) and Stephens et al. (2014), such features can be considerably difficult to model correctly, and unsuccessful attempts to do so may lead to loss of efficiency, thus are not always worthwhile. In our simulation and real data analysis, we found that our generic choices for $c(\cdot)$ , $\kappa_{1}(\cdot)$ and $\kappa_{2}(\cdot)$ deliver reasonable efficiency.

Until now, our estimation has required specification of parametric working models for $q(A,Z,X;\tau)$ and/or $h(A,W,X;\psi)$ . When $X$ is continuous or high-dimensional, however, neither of these working models may contain the true confounding bridge functions. Ghassami et al. (2021) and Kallus et al. (2021) concurrently proposed a cross-fitting minimax kernel learning method for proximal learning of average treatment effect. Their method allows nonparametric estimation of the nuisance functions and $\sqrt{n}$ -consistent estimation for the average treatment effect, even when both estimated confounding bridge functions are consistent at slower than $\sqrt{n}$ rates. In Section S7 of the Supplementary Materials, we demonstrate that their method is also applicable to our setting, and describe a semiparametric kernel estimator for the odds ratio parameter with flexibly estimated confounding bridge functions, $\sqrt{n}$ -consistency, and asymptotic normality.

3 Simulation

In this section, we perform simulation studies to evaluate the performance of our proposed estimators of $\beta_{0}$ under five different scenarios. All scenarios follow Model 2.1 with $\beta_{0}=\log(0.2)$ . In Scenario I, the variables $(U,Z,W)$ are all univariate binary variables. We suppose there are no measured confounders $X$ . Therefore, saturated models for the confounding bridge functions are appropriate, i.e.

\displaystyle q(A,Z;\tau)=\tau_{0}+\tau_{a}A+\tau_{z}Z+\tau_{az}AZ\text{ and }h(A,W;\psi)=\psi_{0}+\psi_{a}A+\psi_{z}Z+\psi_{az}AZ.

The nuisance parameters $\tau$ and $\psi$ are estimated by solving Equations (12) and (13), respectively, with $\kappa_{1}(A,W)=(1,A,W,AW)^{T}$ and $\kappa_{2}(A,Z)=(1,A,Z,AZ)^{T}$ . In Scenario II, the variables $(X,U,Z,W)$ are continuous. The conditional distributions of $Z$ and $W$ given other variables follow Example 2.8. Therefore parametric working models in Equations (16) and (17) are appropriate. We use these parametric working models and estimate the nuisance parameters $\tau$ and $\psi$ by solving estimating equations (12) and (13), setting $\kappa_{1}(A,W,X)$ to be a vector including $A$ , $W$ , $X$ , all of their higher-order interactions, and an intercept term, and $\kappa_{2}(A,Z,X)$ similarly. We set $c(X)=1$ for all the estimators under comparison.

In Scenarios III-V, the data generating process is the same as in Scenario II but we misspecify different components of the parametric working models. In Scenario III, we misspecify the treatment confounding bridge function $q(A,Z,X)$ by ignoring $X$ , i.e., we set

q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}z)\}.

In this scenario, we expect the POR and PDR estimators to be consistent and the PIPW estimator to be inconsistent. In Scenario IV, we similarly misspecify the outcome confounding bridge function $h(A,W,X)$ by ignoring $X$ , i.e. we set

h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{w}w).

In this scenario, we expect the PIPW and PDR estimators to be consistent and the POR estimator to be inconsistent. Finally, in Scenario V, we misspecify both $q(A,Z,X)$ and $h(A,W,X)$ as above. We expect all three estimators to be inconsistent.

We relegate the details of the data generating process to Section S10 of the Supplementary Materials. We report the bias, standard deviation, and coverage of 95% confidence intervals of the POR, PIPW, and PDR estimators for each scenario over 500 Monte Carlo samples.

Table 1 shows the results of various simulations. In Scenario I and II, as the parametric working models for both confounding bridge functions are correctly specified, all three estimators have essentially identical performances as expected. In Scenario III, the POR and PDR estimators are consistent and have calibrated 95% confidence intervals, while PIPW is inconsistent and has anti-conservative 95% confidence intervals. On the other hand, in Scenario IV, the PIPW and PDR estimators are consistent and have calibrated 95% confidence intervals but not the POR estimator. In Scenario V, all three estimators are inconsistent and have anti-conservative 95% confidence intervals, but the PDR estimator appears to have slightly smaller biases and better confidence interval coverage.

Table 1: Bias, standard deviation, and coverage of 95% confidence intervals of the proximal outcome regression (POR), proximal inverse probability weighted (PIPW), and proximal doubly robust (PDR) estimator for five different scenarios summarized over 500 Monte Carlo samples. In Scenario I, the variables

(U,Z,W)

are univariate binary; in Scenario II-V, the variables

(X,U,Z,W)

are univariate continuous variables.

Scenario	N	POR			PIPW			PDR
Scenario	N	Bias	SD	Coverage	Bias	SD	Coverage	Bias	SD	Coverage
I	3,500	0.05	0.27	97.4%	0.05	0.05	97.4%	0.05	0.27	97.4%
I	5,000	0.00	0.23	96.4%	0.02	0.22	96.4%	0.00	0.22	96.4%
II	$1\times 10^{5}$	-0.03	0.24	95.0%	-0.03	0.23	94.2%	-0.02	0.23	94.2%
II	$2\times 10^{5}$	-0.02	0.16	95.0%	-0.02	0.15	94.8%	-0.02	0.15	94.2%
III	$1\times 10^{5}$	-0.03	0.24	94.8%	-0.11	0.22	91.2%	-0.02	0.23	94.2%
III	$2\times 10^{5}$	-0.02	0.16	95.0%	-0.10	0.15	90.0%	-0.02	0.15	95.0%
IV	$1\times 10^{5}$	-0.12	0.24	92.2%	-0.03	0.23	94.0%	-0.03	0.23	94.0%
IV	$2\times 10^{5}$	-0.11	0.16	90.8%	-0.02	0.15	94.6%	-0.02	0.15	94.2%
V	$1\times 10^{5}$	-0.12	0.24	92.2%	-0.11	0.22	90.8%	-0.09	0.22	92.2%
V	$2\times 10^{5}$	-0.11	0.16	90.8%	-0.10	0.15	89.8%	-0.08	0.15	91.6%

4 Extensions

We have focused on the settings where the treatment and outcome are both binary, but the proposed framework can extend to study the association between a polytomous treatment and a polytomous outcome with simple modification. These extensions are relegated to Section S12 of the Supplementary Materials.

Although we have primarily focused on test-negative design as an example of outcome-dependent sampling, our method applies to other outcome-dependent sampling designs where unmeasured confounding is of concern and suitable proxy variables may be identified, such as a case-cohort study (Breslow & Day, 1980) or a retrospective case-control/cohort study using data from electronic health records (Streeter et al., 2017), where subjects’ treatment status may be associated with underlying frailty or risk of developing the outcome of interest.

References

Ai & Chen (2003) Ai, C. & Chen, X. (2003). Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica 71, 1795–1843.
Bareinboim & Pearl (2012) Bareinboim, E. & Pearl, J. (2012). Controlling selection bias in causal inference. In Artificial Intelligence and Statistics. PMLR.
Bareinboim et al. (2014) Bareinboim, E., Tian, J. & Pearl, J. (2014). Recovering from selection bias in causal and statistical inference. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28.
Bickel & Ritov (2003) Bickel, P. J. & Ritov, Y. (2003). Nonparametric estimators which can be” plugged-in”. The Annals of Statistics 31, 1033–1053.
Breslow & Day (1980) Breslow, N. & Day, N. (1980). Statistical methods in cancer research. volume i-the analysis of case-control studies.[accessed october 10, 2009]. IARC Sci Publ 32, 5–338.
Chen (2007) Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
Chua et al. (2020) Chua, H., Feng, S., Lewnard, J. A., Sullivan, S. G., Blyth, C. C., Lipsitch, M. & Cowling, B. J. (2020). The use of test-negative controls to monitor vaccine effectiveness: a systematic review of methodology. Epidemiology (Cambridge, Mass.) 31, 43.
Cole & Frangakis (2009) Cole, S. R. & Frangakis, C. E. (2009). The consistency statement in causal inference: a definition or an assumption? Epidemiology 20, 3–5.
Cole et al. (2010) Cole, S. R., Platt, R. W., Schisterman, E. F., Chu, H., Westreich, D., Richardson, D. & Poole, C. (2010). Illustrating bias due to conditioning on a collider. International journal of epidemiology 39, 417–420.
Cuddeback et al. (2004) Cuddeback, G., Wilson, E., Orme, J. G. & Combs-Orme, T. (2004). Detecting and statistically correcting sample selection bias. Journal of Social Service Research 30, 19–33.
Cui et al. (2020) Cui, Y., Pu, H., Shi, X., Miao, W. & Tchetgen Tchetgen, E. (2020). Semiparametric proximal causal inference. arXiv preprint arXiv:2011.08411 .
Deaner (2018) Deaner, B. (2018). Proxy controls and panel data. arXiv preprint arXiv:1810.00283 .
Didelez et al. (2010) Didelez, V., Kreiner, S. & Keiding, N. (2010). Graphical models for inference under outcome-dependent sampling. Statistical Science 25, 368–387.
Dikkala et al. (2020) Dikkala, N., Lewis, G., Mackey, L. & Syrgkanis, V. (2020). Minimax estimation of conditional moment models. Advances in Neural Information Processing Systems 33, 12248–12262.
D’Haultfoeuille (2011) D’Haultfoeuille, X. (2011). On the completeness condition in nonparametric instrumental problems. Econometric Theory 27, 460–471.
Elwert & Winship (2014) Elwert, F. & Winship, C. (2014). Endogenous selection bias: The problem of conditioning on a collider variable. Annual review of sociology 40, 31–53.
Engel (1988) Engel, J. (1988). Polytomous logistic regression. Statistica Neerlandica 42, 233–252.
Flanders et al. (2017) Flanders, W. D., Strickland, M. J. & Klein, M. (2017). A new method for partial correction of residual confounding in time-series and other observational studies. American journal of epidemiology 185, 941–949.
Gabriel et al. (2020) Gabriel, E. E., Sachs, M. C. & Sjölander, A. (2020). Causal bounds for outcome-dependent sampling in observational studies. Journal of the American Statistical Association , 1–12.
Ghassami et al. (2021) Ghassami, A., Ying, A., Shpitser, I., Tchetgen Tchetgen, E. et al. (2021). Minimax kernel machine learning for a class of doubly robust functionals. Tech. rep.
Greenland (2003) Greenland, S. (2003). Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 14, 300–306.
Heckman (1979) Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica: Journal of the econometric society , 153–161.
Heckman et al. (1996) Heckman, J. J., Ichimura, H., Smith, J. & Todd, P. (1996). Sources of selection bias in evaluating social programs: An interpretation of conventional measures and evidence on the effectiveness of matching as a program evaluation method. Proceedings of the National Academy of Sciences 93, 13416–13420.
Heckman et al. (1998) Heckman, J. J., Ichimura, H., Smith, J. A. & Todd, P. E. (1998). Characterizing selection bias using experimental data.
Hernán et al. (2004) Hernán, M. A., Hernández-Díaz, S. & Robins, J. M. (2004). A structural approach to selection bias. Epidemiology , 615–625.
Hernán MA (2020) Hernán MA, R. J. (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.
Jackson & Nelson (2013) Jackson, M. L. & Nelson, J. C. (2013). The test-negative design for estimating influenza vaccine effectiveness. Vaccine 31, 2165–2168.
Kallus et al. (2021) Kallus, N., Mao, X. & Uehara, M. (2021). Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029 .
Li et al. (2022) Li, K. Q., Shi, X., Miao, W. & Tchetgen, E. T. (2022). Double negative control inference in test-negative design studies of vaccine effectiveness. arXiv preprint arXiv:2203.12509 .
Lipsitch et al. (2010) Lipsitch, M., Tchetgen Tchetgen, E. & Cohen, T. (2010). Negative controls: a tool for detecting confounding and bias in observational studies. Epidemiology (Cambridge, Mass.) 21, 383.
Manski & McFadden (1981) Manski, C. F. & McFadden, D. (1981). Alternative estimators and sample designs for discrete choice analysis. Structural analysis of discrete data with econometric applications 2, 2–50.
Mansournia & Altman (2016) Mansournia, M. A. & Altman, D. G. (2016). Inverse probability weighting. Bmj 352.
Miao et al. (2018a) Miao, W., Geng, Z. & Tchetgen Tchetgen, E. J. (2018a). Identifying causal effects with proxy variables of an unmeasured confounder. Biometrika 105, 987–993.
Miao et al. (2018b) Miao, W., Shi, X. & Tchetgen Tchetgen, E. (2018b). A confounding bridge approach for double negative control inference on causal effects. arXiv e-prints , arXiv–1808.
Newey & Powell (2003) Newey, W. K. & Powell, J. L. (2003). Instrumental variable estimation of nonparametric models. Econometrica 71, 1565–1578.
Pearce (2018) Pearce, N. (2018). Bias in matched case–control studies: Dags are not enough. European journal of epidemiology 33, 1–4.
Prentice (1986) Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73, 1–11.
Rotnitzky et al. (2021) Rotnitzky, A., Smucler, E. & Robins, J. M. (2021). Characterization of parameters with a mixed bias property. Biometrika 108, 231–238.
Schnitzer (2022) Schnitzer, M. E. (2022). Estimands and estimation of covid-19 vaccine effectiveness under the test-negative design: Connections to causal inference. Epidemiology (Cambridge, Mass.) 33, 325.
Shi et al. (2020a) Shi, X., Miao, W., Nelson, J. C. & Tchetgen Tchetgen, E. J. (2020a). Multiply robust causal inference with double-negative control adjustment for categorical unmeasured confounding. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82, 521–540.
Shi et al. (2020b) Shi, X., Miao, W. & Tchetgen Tchetgen, E. (2020b). A selective review of negative control methods in epidemiology. Current Epidemiology Reports , 1–13.
Stephens et al. (2014) Stephens, A., Tchetgen, E. T. & De Gruttola, V. (2014). Locally efficient estimation of marginal treatment effects when outcomes are correlated: is the prize worth the chase? The international journal of biostatistics 10, 59–75.
Streeter et al. (2017) Streeter, A. J., Lin, N. X., Crathorne, L., Haasova, M., Hyde, C., Melzer, D. & Henley, W. E. (2017). Adjusting for unmeasured confounding in nonrandomized longitudinal studies: a methodological review. Journal of clinical epidemiology 87, 23–34.
Sullivan et al. (2016) Sullivan, S. G., Tchetgen Tchetgen, E. J. & Cowling, B. J. (2016). Theoretical basis of the test-negative study design for assessment of influenza vaccine effectiveness. American journal of epidemiology 184, 345–353.
Tchetgen Tchetgen (2014) Tchetgen Tchetgen, E. (2014). The control outcome calibration approach for causal inference with unobserved confounding. American journal of epidemiology 179, 633–640.
Tchetgen Tchetgen et al. (2020) Tchetgen Tchetgen, E. J., Ying, A., Cui, Y., Shi, X. & Miao, W. (2020). An introduction to proximal causal learning. arXiv preprint arXiv:2009.10982 .
Van der Vaart (2000) Van der Vaart, A. (2000). Asymptotic statistics, vol. 3. Cambridge university press.

Supplementary Material to “Proximal learning of odds ratio under confounded outcome-dependent sampling designs”

S1 Proofs of Theorems.

S1.1 Proof of Theorem 2.1

To prove Equation (2), we have

	$\displaystyle E[q(A,Z,X)\mid A,W,X,Y=0,S=1]$
$\displaystyle=$	$\displaystyle E\left\{E[q(A,Z,X)\mid A,U,W,X,Y=0,S=1]\mid A,W,X,Y=0,S=1\right\}$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle E\left\{E[q(A,Z,X)\mid A,U,X,Y=0,S=1]\mid A,W,X,Y=0,S=1\right\}$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle E\left\{E[q(A,Z,X)\mid A,U,X]\mid A,W,X,Y=0,S=1\right\}$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:trt-bridge}}}{{=}}$	$\displaystyle E\left\{\dfrac{1}{P(A\mid U,X,Y=0,S=1)}\mid A,W,X,Y=0,S=1\right\}$
$\displaystyle=$	$\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}f(u\mid A,W,X,Y=0,S=1)du$
$\displaystyle=$	$\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}\times$	(21)
	$\displaystyle\qquad\dfrac{f(u\mid W,X,Y=0,S=1)P(A\mid U=u,W,X,Y=0,S=1)}{P(A\mid W,X,Y=0,S=1)}du$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle\int\dfrac{1}{P(A\mid U=u,X,Y=0,S=1)}\times$	(22)
	$\displaystyle\qquad\dfrac{f(u\mid W,X,Y=0,S=1)P(A\mid U=u,X,Y=0,S=1)}{P(A\mid W,X,Y=0,S=1)}du$
$\displaystyle=$	$\displaystyle\dfrac{1}{P(A\mid W,X,Y=0,S=1)}\int f(u\mid W,X,Y=0,S=1)du$
$\displaystyle=$	$\displaystyle\dfrac{1}{P(A\mid W,X,Y=0,S=1)}.$

To prove the uniqueness of a solution to Equation (2) under Assumption 2.2(b), we suppose there exists another function $q^{*}$ that solves Equation (2), that is,

E\{q^{*}(A,Z,X)\mid A,W,X,Y=0,S=1\}=1/P(A\mid W,X,Y=0,S=1).

Then

E\{q^{*}(A,Z,X)-q(A,Z,X)\mid A,W,X,Y=0,S=1\}=0.

By Assumption 2.2(b), $q^{*}(A,Z,X)=q(A,Z,X)$ almost surely.

Suppose there exists a function $q_{0}$ that solves Equation (1), then $q_{0}$ is also the solution to (2) by the proof above. By the uniqueness of a solution to Equation (2), $q_{0}(A,Z,X)=q(A,Z,X)$ almost surely, which proves that $q$ is the unique solution to Equation (1).

S1.2 Proof of Theorem 2.2

We first state a useful result for identification:

Lemma S1.

Under Model 2.1 and a selection mechanism that satisfies Assumption 2.1, the log odds ratio satisfies

\beta_{0}=\log\left(\theta_{1c}/\theta_{0c}\right),

(23)

where, for $a=0,1$ ,

\theta_{ac}=E\left\{c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1\right\}.

(24)

Proof S2.

Under Model (2.1) and Assumption 2.1, we have

		$\displaystyle\dfrac{P(Y=1\mid U,X,A=1,S=1)P(Y=0\mid U,X,A=0,S=1)}{P(Y=1\mid U,X,A=0,S=1)P(Y=0\mid U,X,A=1,S=1)}$
	$\displaystyle=$	$\displaystyle\exp(\beta_{0})\times\dfrac{P(S=1\mid Y=1,A=1,U,X)}{P(S=1\mid Y=0,A=1,U,X)}\times\dfrac{P(S=1\mid Y=0,A=0,U,X)}{P(S=1\mid Y=1,A=0,U,X)}$
	$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:selection}}}{{=}}$	$\displaystyle\exp(\beta_{0})\times\xi(U,X)\times\xi(U,X)^{-1}=\exp(\beta_{0}).$

Therefore

	$\displaystyle\theta_{1c}$	$\displaystyle=E\left[c(X)\dfrac{P(A=1\mid U,X,Y=1,S=1)}{P(A=1\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\,\mid\,S=1\right]$
		$\displaystyle=E\left[c(X)\dfrac{P(A=0\mid U,X,Y=1,S=1)}{P(A=0\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\,\mid\,S=1\right]\exp(\beta_{0})$
		$\displaystyle=\theta_{0c}\exp(\beta_{0}),$

which proves Lemma S1.

By Lemma S1, it suffices to prove that $\theta_{ac}=E[c(X)I(A=a)Yq(A,Z,X)\mid S=1]$ . Under Assumptions (2.2) and (2.2), $q(A,Z,X)$ solves Equation (1). We have

	$\displaystyle E[c(X)I(A=a)Yq(a,Z,X)\mid S=1]$
$\displaystyle=$	$\displaystyle E[c(X)I(A=a)YE\left(q(a,Z,X)\mid A=a,U,X,Y=1,S=1\right)\mid S=1]$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle E[c(X)I(A=a)YE\left(q(a,Z,X)\mid A=a,U,X,Y=0,S=1\right)Y\mid S=1]$
$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:trt-bridge}}}{{=}}$	$\displaystyle E[c(X)I(A=a)Y\dfrac{1}{P(A=a\mid U,X,Y=0,S=1)}\mid S=1]$
$\displaystyle=$	$\displaystyle E[c(X)E\{I(A=a)\mid U,X,Y=1,S=1\}\dfrac{1}{P(A=a\mid U,X,Y=0,S=1)}Y\mid S=1]$
$\displaystyle=$	$\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}Y\mid S=1]$
$\displaystyle=$	$\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1]$
$\displaystyle=$	$\displaystyle\theta_{ac}.$	(25)

S1.3 Proof of Theorem 2.3

First, We show that Equation (5) is equivalent to the moment equation

E\{(1-Y)h(a,W,X)\mid A=a,U,X,S=1\}=E(Y\mid A=a,U,X,S=1).

(26)

This is because

		$\displaystyle E\{(1-Y)h(a,W,X)\mid A=a,U,X,S=1\}$
	$\displaystyle=$	$\displaystyle E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}P(Y=0\mid A=a,U,X,S=1)$
	$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:outcome-bridge}}}{{=}}$	$\displaystyle\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}P(Y=0\mid A=a,U,X,S=1)$
	$\displaystyle=$	$\displaystyle P(Y=1\mid A=a,U,X,S=1)$
	$\displaystyle=$	$\displaystyle E(Y\mid A=a,U,X,S=1).$

Next, we show that Assumption 2.2 and Equation (26) implies the moment equation

E\{(1-Y)h(a,W,X)\mid A=a,Z,X,S=1\}=E(Y\mid A=a,Z,X,S=1).

(27)

This is because

		$\displaystyle E\{(1-Y)h(A,W,X)-Y\mid A,Z,X,S=1\}$
	$\displaystyle=$	$\displaystyle E[E\{(1-Y)h(A,W,X)-Y\mid A,U,Z,X,S=1\}\mid A,Z,X,S=1]$
	$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle E[E\{(1-Y)h(A,W,X)-Y\mid A,U,X,S=1\}\mid A,Z,X,S=1]$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:outcome-bridge-u}}}{{=}}$	$\displaystyle 0.$

Finally, the left-hand side of Equation (27) equals

E\{h(a,W,X)\mid A=a,Z,X,Y=0,S=1\}P(Y=0\mid A=a,Z,X,S=1).

Rearranging the terms, we obtain Equation (6).

To prove the uniqueness of a solution to Equation (6) under Assumption 2.3(b), we suppose there exists another function $h^{*}$ that solves Equation (6), that is,

E\{h^{*}(a,W,X)\mid A=a,Z,X,Y=0,S=1\}=\frac{P(Y=1\mid A=a,Z,X,S=1)}{P(Y=0\mid A=a,Z,X,S=1)}

Then

E\{h^{*}(a,W,X)-h(a,W,X)\mid A=a,Z,X,Y=0,S=1\}=0.

By Assumption 2.3(b), $h^{*}(A,Z,X)=h(A,Z,X)$ almost surely.

Suppose there exists a function $h_{0}$ that solves Equation (5), then $h_{0}$ is also the solution to (6) by the proof above. By the uniqueness of a solution to Equation (6), $h_{0}(A,Z,X)=h(A,Z,X)$ almost surely, which proves that $h$ is the unique solution to Equation (5).

S1.4 Proof of Theorem 2.4

Under Model 2.1 and Assumptions 2.1 and 2.2, the result in Lemma S1 holds. It suffices to prove that

\theta_{ac}=E\{c(X)(1-Y)h(a,W,X)\mid S=1\}

(28)

for $a=0,1$ .

For $a=0,1$ , we have that

		$\displaystyle E[c(X)(1-Y)h(a,W,X)\mid S=1]$
	$\displaystyle=$	$\displaystyle E[c(X)(1-Y)E\{h(a,W,X)\mid U,X,Y=0,S=1\}\mid S=1]$
	$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:nc}}}{{=}}$	$\displaystyle E[c(X)(1-Y)E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}\mid S=1]$
	$\displaystyle\stackrel{{\scriptstyle A.\ref{assump:outcome-bridge}}}{{=}}$	$\displaystyle E[c(X)(1-Y)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]$
	$\displaystyle=$	$\displaystyle E[c(X)E(1-Y\mid U,X,S=1)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]$
	$\displaystyle=$	$\displaystyle E[c(X)P(Y=0\mid U,X,S=1)\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}\mid S=1]$
	$\displaystyle=$	$\displaystyle E[c(X)\dfrac{P(A=a\mid U,X,Y=1,S=1)}{P(A=a\mid U,X,Y=0,S=1)}P(Y=1\mid U,X,S=1)\mid S=1]$
	$\displaystyle=$	$\displaystyle\theta_{ac}.$

S1.5 Proof of Theorem 2.5

To prove

	$\displaystyle\beta_{0}$	$\displaystyle=\log\bigg{(}E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\left\{Y-(1-Y)h(A,W,X)\right\}+$
		$\displaystyle\qquad(1-Y)h(1,W,X)\bigg{]}\mid S=1\bigg{\}}/E\bigg{\{}c(X)\bigg{[}I(A=1)q(A,Z,X)\{Y-$
		$\displaystyle\qquad(1-Y)h(A,W,X)\}+(1-Y)h(0,W,X)\bigg{]}\mid S=1\bigg{\}}\bigg{)},$

by Lemma S1 it suffices to prove

	$\displaystyle\theta_{ac}$	$\displaystyle=E\big{\{}c(X)\big{[}(1-Y)h(a,W,X)-$
		$\displaystyle\qquad I(A=a)q(a,Z,X)\big{\{}(1-Y)h(A,W,X)-Y\big{\}}\big{]}\mid S=1\big{\}}.$		(29)

for $a=0,1$ .

(a)

If Assumptions 2.2 and Equation (1) hold, then $q$ solves Equation (2) and the conclusion in Theorem 2.2 holds.

The right-hand side of Equation (29) equals

		$\displaystyle E\bigg{\{}c(X)\big{[}(1-Y)h(a,W,X)-I(A=a)(1-Y)q(a,Z,X)h(A,W,X)+$
		$\displaystyle\qquad I(A=a)q(a,Z,X)Y\big{]}\mid S=1\bigg{\}}$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-tac-PIPW}}}{{=}}$	$\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\big{[}1-I(A=a)q(a,Z,X)\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle=$	$\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)E\big{[}1-$
		$\displaystyle\qquad I(A=a)q(a,Z,X)\mid A,W,X,Y=0,S=1\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:trt-bridge}}}{{=}}$	$\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\big{[}1-\dfrac{I(A=a)}{P(A=a\mid W,X,Y=0,S=1)}\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle=$	$\displaystyle E\bigg{\{}c(X)(1-Y)\big{[}h(a,W,X)-\dfrac{I(A=a)h(a,W,X)}{P(A=a\mid W,X,Y=0,S=1)}\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle=$	$\displaystyle E\bigg{\{}c(X)(1-Y)E\big{[}h(a,W,X)-$
		$\displaystyle\qquad\dfrac{I(A=a)h(a,W,X)}{P(A=a\mid W,X,Y=0,S=1)}\mid W,X,Y=0,S=1\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle=$	$\displaystyle E\bigg{\{}c(X)(1-Y)\big{[}h(a,W,X)-h(a,W,X)\big{]}\mid S=1\bigg{\}}+\theta_{ac}$
	$\displaystyle=$	$\displaystyle\theta_{ac}.$

(b)

If Assumptions 2.2 and Equation (5) hold, then $h$ solves Equation (27) and the conclusion in Theorem 2.4 holds. The right-hand side of Equation (29) equals

		$\displaystyle E\bigg{\{}c(X)\big{[}(1-Y)h(a,W,X)-I(A=a)(1-Y)q(a,Z,X)h(a,W,X)+$
		$\displaystyle\qquad I(A=a)q(a,Z,X)Y\big{]}\mid S=1\bigg{\}}$
	$\displaystyle=$	$\displaystyle E\bigg{\{}c(X)(1-Y)h(a,W,X)\mid S=1\bigg{\}}-$
		$\displaystyle\qquad E\bigg{\{}c(X)I(A=a)q(a,Z,X)[(1-Y)h(A,W,X)-Y]\mid S=1\bigg{\}}$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-POR}}}{{=}}$	$\displaystyle\theta_{ac}-E\bigg{\{}c(X)I(A=a)q(a,Z,X)\times$
		$\displaystyle\qquad E[(1-Y)h(A,W,X)-Y\mid A=a,Z,X,S=1]\mid$
		$\displaystyle\qquad S=1\bigg{\}}$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:outcome-bridge}}}{{=}}$	$\displaystyle\theta_{ac}-0$
	$\displaystyle=$	$\displaystyle\theta_{ac}.$

S1.6 Proof of Theorem 2.6

To prove Equation (10), we have

		$\displaystyle E\{(1-Y)[\kappa_{1}(A,W,X)q(A,Z,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}$
	$\displaystyle=$	$\displaystyle E\{(1-Y)[\kappa_{1}(A,W,X)E[q(A,Z,X)\mid A,W,X,Y=0,S=1]-\kappa_{1}(1,W,X)-$
		$\displaystyle\qquad\kappa_{1}(0,W,X)]\mid S=1\}$
	$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:trt-bridge}}}{{=}}$	$\displaystyle E\{(1-Y)[\dfrac{\kappa_{1}(A,W,X)}{P(A\mid W,X,Y=0,S=1)}-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}$
	$\displaystyle=$	$\displaystyle E\{(1-Y)[E\left(\dfrac{\kappa_{1}(A,W,X)}{P(A\mid W,X,Y=0,S=1)}\mid W,X,Y=0,S=1\right)-\kappa_{1}(1,W,X)-$
		$\displaystyle\qquad\kappa_{1}(0,W,X)]\mid S=1\}$
	$\displaystyle=$	$\displaystyle E\{(1-Y)[\kappa_{1}(1,W,X)+\kappa_{1}(0,W,X)-\kappa_{1}(1,W,X)-\kappa_{1}(0,W,X)]\mid S=1\}$
	$\displaystyle=$	$\displaystyle 0.$

Equation (11) directly follows Equation (27).

S1.7 Proof of Theorem 2.9

Under Model 2.1 and Assumptions 2.1 and 2.2, Lemma S1 holds.

(a)

If $q(A,Z,X;\tau)$ is correctly specified and Assumption 2.2(b) holds, then Equation (10) has a unique solution that is $\tau_{0}$ such that $q(A,Z,X;\tau_{0})=q(A,Z,X)$ , and $q(A,Z,X;\hat{\tau})$ is consistent for $q(A,Z,X)$ . By Law of Large Numbers and Continuous Mapping Theorem, we have

	$\displaystyle\hat{\beta}_{c,\text{PIPW}}$	$\displaystyle\overset{p}{\rightarrow}\log\left(\dfrac{E\left\{c(X)I(A=1)Yq(A,Z,X)\mid S=1\right\}}{E\left\{c(X)I(A=0)Yq(A,Z,X)\mid S=1\right\}}\right)$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-tac-PIPW}}}{{=}}\log(\theta_{1c}/\theta_{0c})$
		$\displaystyle=\beta_{0}.$

(b)

If $h(A,W,X;\psi)$ is correctly specified and Assumption 2.3(b) holds, then Equation (11) has a unique solution that is $\psi_{0}$ such that $h(A,W,X;\psi_{0})=h(A,W,X)$ , and $h(A,W,X;\hat{\psi})$ is consistent for $h(A,W,X)$ . By Law of Large Numbers and Continuous Mapping Theorem, we have

	$\displaystyle\hat{\beta}_{c,\text{POR}}$	$\displaystyle\overset{p}{\rightarrow}log\left(\dfrac{E\left\{c(X)(1-Y)h(1,W,X)\mid S=1\right\}}{E\left\{c(X)(1-Y)h(0,W,X)\mid S=1\right\}}\right)$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:pf-POR}}}{{=}}\log(\theta_{1c}/\theta_{0c})$
		$\displaystyle=\beta_{0}.$

(c)

Write $q^{*}(A,Z,X)$ and $h^{*}(A,W,X)$ as the probability limit of $q(A,Z,X;\hat{\tau})$ and $h(A,W,X;\hat{\psi})$ , respectively. The stated result follows Law of Large Numbers, Continuous Mapping Theorem, and the proof of Theorem 2.5.

S2 Approximate equivalence of RR and OR for a rare outcome

Under Model 2.1, the conditional odds ratio is

\text{OR}(U,X)=\exp(\beta_{0})=\dfrac{P(Y=1\mid A=1,U,X)P(Y=0\mid A=0,U,X)}{P(Y=1\mid A=0,U,X)P(Y=1\mid A=1,U,X)}.

Consider the conditional risk ratio

\text{RR}(U,X):=\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}=\text{OR}(U,X)\times\dfrac{P(Y=0\mid A=1,U,X)}{P(Y=0\mid A=0,U,X)}.

Under a rare infection assumption that

0\leq P(Y=1\mid A=a,U,X)\leq\delta

almost surely for $a=0,1$ and a small $\delta>0$ , OR and RR satisfies

1-\delta\leq\dfrac{\text{RR}(U,X)}{\text{OR}(U,X)}=\dfrac{P(Y=0\mid A=1,U,X)}{P(Y=0\mid A=0,U,X)}\leq\dfrac{1}{1-\delta}.

(30)

If $\delta\approx 0$ , then $\text{OR}\approx\text{RR}$ .

Furthermore, denote $Y(a)$ as the potential outcome had the individual received the treatment $A=a$ . We make the following standard identifiability assumptions (Hernán MA, 2020):

(a)

(Consistency) $Y(a)=Y$ if $A=a$ ;
(b)

(Confounded ignorability) $Y(a)\perp\!\!\!\perp A\mid U,X$ for $a=0,1$ ;
(c)

(Positivity) $0<P(A=1\mid U,X)<1$ almost surely.

Under these assumptions, the marginal causal risk ratio is

	cRR	$\displaystyle:=\dfrac{P\{Y(1)=1\}}{P\{Y(0)=1\}}$
		$\displaystyle=\dfrac{\int\int P\{Y(1)=1\mid U=u,X=x\}f(u,x)dudx}{\int\int P\{Y(0)=1\mid U=u,X=x\}f(u,x)dudx}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{=}}\dfrac{\int\int P\{Y(1)=1\mid A=1,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y(0)=1\mid A=0,U=u,X=x\}f(u,x)dudx}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\dfrac{\int\int P\{Y=1\mid A=1,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}$
		$\displaystyle=\dfrac{\int\int\text{RR}(U,X)P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:or-rr}}}{{\leq}}\dfrac{1}{1-\delta}\exp(\beta_{0})$

and

\text{cRR}\geq(1-\delta)\exp(\beta_{0}).

If $\delta\approx 0$ , then $\text{cRR}\approx\exp(\beta_{0})$ .

S3 Identification of marginal causal risk ratio in a test-negative design with diseased controls

Similar to Schnitzer (2022), we denote $D$ as a categorical variable indicating a subject’s outcome, where $D=2$ indicates $Y=1$ , i.e. the subject is positive for the infection of interest; $D=1$ indicates that the subject has $Y=0$ but is positive for another outcome, referred to as the “control infection”, that is known a priori to not be affected by $A$ ; and $D=0$ indicates that neither outcome is positive. To formalize the assumption of no treatment effect on the control disease, we assume

P(D=1\mid A=1,U,X)=P(D=1\mid A=0,U,X).

(31)

As sample selection is only restricted to subjects with infection, we have

P(D=0\mid S=1,U,X,A)=0.

(32)

Rather than Assumption 2.1, we make the stronger assumption that

A\perp\!\!\!\perp S\mid D,U,X\quad\text{and}\quad A\perp\!\!\!\perp S\mid Y,U,X.

(33)

That is, a subject’s treatment status has no impact on the sample selection. This assumption holds in a test-negative study if, given a subject’s disease status and other traits included in $(U,X)$ , their decision to seek care does not depend on their treatment status.

Under Model 2.1 and above assumptions, we have

	$\displaystyle\exp(\beta_{0})$	$\displaystyle=\dfrac{P(A=1\mid Y=1,U,X)P(A=0\mid Y=0,U,X)}{P(A=1\mid Y=0,U,X)P(A=0\mid Y=1,U,X)}$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s2}}}{{=}}\dfrac{P(A=1\mid Y=1,U,X,S=1)P(A=0\mid Y=0,U,X,S=1)}{P(A=1\mid Y=0,U,X,S=1)P(A=0\mid Y=1,U,X,S=1)}$
		$\displaystyle=\dfrac{P(A=1\mid D=2,U,X,S=1)P(A=0\mid D\neq 2,U,X,S=1)}{P(A=1\mid D\neq 2,U,X,S=1)P(A=0\mid D=2,U,X,S=1)}$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s1}}}{{=}}\dfrac{P(A=1\mid D=2,U,X,S=1)P(A=0\mid D=1,U,X,S=1)}{P(A=1\mid D=1,U,X,S=1)P(A=0\mid D=2,U,X,S=1)}$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:sel-s2}}}{{=}}\dfrac{P(A=1\mid D=2,U,X)P(A=0\mid D=1,U,X)}{P(A=1\mid D=1,U,X)P(A=0\mid D=2,U,X)}$
		$\displaystyle=\dfrac{P(D=2\mid A=1,U,X)P(D=1\mid A=0,U,X)}{P(D=1\mid A=1,U,X)P(D=2\mid A=0,U,X)}$
		$\displaystyle\stackrel{{\scriptstyle Eq\eqref{eq:nco}}}{{=}}\dfrac{P(D=2\mid A=1,U,X,)}{P(D=2\mid A=0,U,X)}$
		$\displaystyle=\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}$
		$\displaystyle\equiv\text{RR}(U,X)$

Under the standard identifiability assumptions in Section S2, the marginal causal risk ratio is

	cRR	$\displaystyle=\dfrac{\int\int\text{RR}(U,X)P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}{\int\int P\{Y=1\mid A=0,U=u,X=x\}f(u,x)dudx}$
		$\displaystyle=\exp(\beta_{0}).$

In the above discussion, we do not invoke the rare outcome assumption.

S4 Regularity conditions for the existence of a solution to Equations (1) and (26)

The results in this section directly adapted from Appendix B of Cui et al. (2020) and Section B of the Supplementary Materials of Li et al. (2022).

Let $L^{2}\{F(t)\}$ denote the Hilbert space of all square-integrable functions of $t$ with respect to distribution function $F(t)$ , equiped with inner product $\langle g_{1},g_{2}\rangle=\int g_{1}(t)g_{2}(t)dF(t)$ . Let $T_{a,u,x}:L^{2}\{F(z|a,x,Y=0,S=1)\}\rightarrow L^{2}\{F(u|a,x,Y=0,S=1)\}$ denote the conditional expectation operator $T_{a,u,x}h=E[h(a,Z,x)|A=a,U=u,X=x,Y=0,S=1]$ and let $(\lambda_{a,u,x,n},\varphi_{a,u,x,n},\phi_{a,u,x,n})$ denote a singular value decomposition of $T_{a,u,x}$ . Similarly, let $T^{\prime}_{a,u,x}:L^{2}\{F(w|a,x,Y=0,S=1)\}\rightarrow L^{2}\{F(u|a,x,Y=0,S=1)\}$ denote the conditional expectation operator $T^{\prime}_{a,u,x}g=E[g(a,W,x)|A=a,U=u,X=x,Y=0,S=1]$ and let $(\lambda^{\prime}_{a,u,x,n},\varphi^{\prime}_{a,u,x,n},\phi^{\prime}_{a,u,x,n})$ denote a singular value decomposition of $T^{\prime}_{a,u,x}$ . Consider the following regularity conditions:

(1)

$\int\int f(z|a,u,x,Y=0,S=1)f(u|a,z,x,Y=0,S=1)dzdu<\infty$ ;
(2)

$\int P^{-2}(A=a|U=u,X=x,Y=0,S=1)f(u|a,x,Y=0,S=1)du<\infty$ ;
(3)

$\sum_{n=1}^{\infty}\lambda_{a,u,x,n}^{-2}|\langle P^{-1}(A=a|U=u,X=x,Y=0,S=1),\phi_{a,u,x,n}\rangle|^{2}<\infty$ ;
(4)

$\sum_{n=1}^{\infty}(\lambda_{a,u,x,n}^{\prime})^{-2}|\langle P^{-1}(A=a|U=u,X=x,Y=0,S=1),\phi_{a,u,x,n}^{\prime}\rangle|^{2}<\infty$ .

The solution to Equation (1) exists if Assumption 2.2(a) and regularity conditions (1), (2) and (3) hold. The solution to Equation (5) exists if Assumption 2.3(a) and regularity conditions (1), (2) and (4) hold.

S5 Summary of our assumptions and results and those in Li et al. (2022)

Table S1: Summary of models, assumptions and identification results, with comparison to those in Li et al. (2022) for identifying conditional risk ratio

		Li et al. (2022) Sections 2.2-2.4	PIPW identification
			(Li et al. (2022) Section 2.5)
Model		$\log\left(\dfrac{P(Y=1\mid A=1,U,X)}{P(Y=1\mid A=0,U,X)}\right)=\beta_{0}$	Model 2.1
Interpretation of $\beta_{0}$		conditional risk ratio	conditional odds ratio
Selection mechanism		$A\perp\!\!\!\perp S\mid Y,U,X$	Assumption 2.1
Proximal independence		$Z\perp\!\!\!\perp(Y,S)\|A,U,X$	Assumption 2.2
		$W\perp\!\!\!\perp A\mid U,X$
		$W\perp\!\!\!\perp Z\mid A,U,X,Y$
		$S\perp\!\!\!\perp Z\mid A,U,X,W,Y$
$q$ definition		$\begin{array}[]{c}E\{q(a,Z,X\mid A=a,U,X)\}\\ \quad=1/P(A=a\mid U,X)\end{array}$	Equation (1)
$h$ definition		/	/
Completeness		Assumption 2.2(b)	Assumption 2.2(b)
Identification	$q$	Equation (2)	Equation (2)
	$h$	/	/
	$\beta_{0}$	Equation (4)	Equation (4)
Note		(1) Identification of $q$ and $\beta_{0}$ requires a	Under standard identifiability
		rare outcome condition;	assumptions in Section S2, if either
		(2) Under standard identifiability	the outcome is rare or under the
		assumptions in Section S2, $\beta_{0}$	setting of Section S3, $\beta_{0}$ identifies the
		identifies the marginal causal risk ratio.	marginal causal risk ratio.
		(3) The proximal independence conditions
		can be replaced by Assumption 2.2
		POR identification	PDR closed-form expression
Model		Model 2.1	Model 2.1
Interpretation of $\beta_{0}$		conditional odds ratio	conditional odds ratio
Selection mechanism		Assumption 2.1	Assumption 2.1
Proximal independence		Assumption 2.2	Assumption 2.2
$q$ definition		/	Equation (1)
$h$ definition		Equation (5)	Equation (5)
Completeness		Assumption 2.3(b)	Assumptions 2.2(b), 2.3(b)
Identification	$q$	/	Equation (2) + Assumption 2.2(b)
	$h$	Equation (6)	Equation (6) + Assumption 2.3(b)
	$\beta_{0}$	Equation (8)	Equation (9)
Note		Under standard identifiability a	(1) Under standard identifiability
		assumptions in Section S2, if either	assumptions in Section S2, if either
		the outcome is rare or under the	the outcome is rare or under the
		setting of Section S3, $\beta_{0}$ identifies the $\beta_{0}$	setting of Section S3, $\beta_{0}$ identifies the
		marginal causal risk ratio.	marginal causal risk ratio.
			(2) Equation (8) holds if either $q$ is a
			solution to Equation (1) or $h$ is a
			solution to Equation (5).

S6 Detailed setting and derivation of $q(A,Z,X)$ and $h(A,W,X)$ in Example 2.8

If $Z,W$ are both multivariate Gaussian random variables satisfying

	$\displaystyle Z\mid A,U,X\sim N(\mu_{0Z}+\mu_{AZ}A+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X,\Sigma_{Z}),$
	$\displaystyle W\mid A,U,X,Y\sim N(\mu_{0W}+\mu_{UW}^{T}U+\mu_{XW}^{T}X+\mu_{YW}Y,\Sigma_{W}).$

Assume that the treatment $A$ and outcome $Y$ both follow a logistic regression model given $(U,X)$ :

	$\displaystyle P(A=1\mid U,X)$	$\displaystyle=[1+\exp\{-(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}]^{-1},$
	$\displaystyle P(Y=1\mid A,U,X)$	$\displaystyle=\exp(\mu_{0Y}+\beta_{0}A+\mu_{UY}^{T}U+\mu_{XY}^{T}X).$

Further assume the selection indicator $S$ follows a log-linear model of the form:

P(S=1\mid A,U,X,Y)=\exp(\mu_{AS}A+\mu_{YS}Y+\xi(U,X)).

It is straightforward to verify that the data distribution satisfies Model 2.1 and Assumptions 2.1-2.2. Then a suitable model for $h$ is

h(a,w,x;\psi)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}^{T}w+\psi_{X}^{T}x).

(34)

While a closed-form parametric model for $q$ does not exist in this example, if the outcome $Y$ is rare in the target population in the sense that

P(Y=1\mid A=a,U=u,X=x)\approx 0\qquad\text{almost surely},

then the parametric model below may serve as an approximation

q(a,z,x;\tau)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}a+\tau_{Z}^{T}z+\tau_{X}^{T}x)\}.

(35)

We first derive the outcome confounding bridge function $h$ that satisfies Equation (5):

	$\displaystyle E\{h(a,W,X)\mid A=a,U,X,Y=0,S=1\}=$
	$\displaystyle\qquad P(Y=1\mid A=a,U,X,S=1)/P(Y=0\mid A=a,U,X,S=1).$

We have

	$\displaystyle P(Y=1\mid A=a,Z,X)P(S=1\mid A=a,Y=1,Z,X)=$
	$\displaystyle\qquad\dfrac{\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}\exp(\mu_{0S}+\mu_{AS}a+\mu_{YS}+\xi(U,X))$
	$\displaystyle P(Y=0\mid A=a,U,X)P(S=1\mid A=a,Y=0,U=1,U,X)=$
	$\displaystyle\qquad\dfrac{1}{1+\exp(\mu_{0Y}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)}\exp(\mu_{0S}+\mu_{AS}a+\xi(U,X)).$

Therefore

		$\displaystyle P(Y=1\mid A=a,U,X,S=1)$
	$\displaystyle=$	$\displaystyle P(Y=1\mid A=a,U,X)P(S=1\mid A=a,Y=1,U,X)/\{P(Y=1\mid A=a,U,X)\times$
		$\displaystyle\qquad P(S=1\mid A=a,Y=1,U,X)+P(Y=0\mid A=a,U,X)P(S=1\mid A=a,Y=0,U,X)\}$
	$\displaystyle=$	$\displaystyle[1+\exp\{-(\mu_{0Y}+\mu_{YS}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X)\}]^{-1}.$

and

\displaystyle\dfrac{P(Y=1\mid A=a,U,X,S=1)}{P(Y=0\mid A=a,U,X,S=1)}=\exp(\mu_{0Y}+\mu_{YS}+\beta_{0}a+\mu_{UY}U+\mu_{XY}^{T}X).

We further have

W\mid A=a,U,X,Y=0,S=1\sim N(\mu_{0W}+\mu_{UW}U+\mu_{XW}^{T}X).

Now assume that $h(a,w,x)=\exp(\psi_{0}+\psi_{A}a+\psi_{W}w+\psi_{X}^{T}x)$ , then

		$\displaystyle E\{h(A,W,X)\mid A,U,X,Y=0,X=1\}$
	$\displaystyle=$	$\displaystyle\exp(\psi_{0}+\psi_{A}A+\psi_{X}^{T}X)E\{\exp(\psi_{W}W)\mid A,U,X,Y=0,S=1\}$
	$\displaystyle=$	$\displaystyle\exp(\psi_{0}+\psi_{A}A+\psi_{X}^{T}X)\exp(\psi_{W}^{T}(\mu_{0W}+\mu_{UW}^{T}U+\mu_{XW}^{T}X)+\frac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W})$
	$\displaystyle=$	$\displaystyle\exp\{(\psi_{0}+\psi_{W}^{T}\mu_{0W}+\frac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W})+\psi_{A}A+\psi_{W}^{T}\mu_{UW}^{T}U+(\psi_{X}^{T}+\psi_{W}^{T}\mu_{XW}^{T})X\}$

Hence $\psi$ satisfies

	$\displaystyle\psi_{A}=\beta_{0},\qquad\psi_{W}=\mu_{UW}^{+}\mu_{UY},\qquad\psi_{X}=\mu_{XY}-\mu_{XW}\psi_{W}=\mu_{XY}-\mu_{UY}\mu_{UW}^{+}\mu_{XW},$
	$\displaystyle\psi_{0}=\mu_{0Y}+\mu_{YS}-\psi_{W}^{T}\mu_{0W}-\dfrac{1}{2}\psi_{W}^{T}\Sigma_{W}\psi_{W}$
	$\displaystyle\qquad=\mu_{0Y}+\mu_{YS}-\mu_{UW}^{+}\mu_{0W}\mu_{UY}-\dfrac{1}{2}\mu_{UY}^{T}(\mu_{UW}^{+})^{T}\Sigma_{W}\mu_{UW}^{+}\mu_{UY}.$

Here $A^{+}$ denotes the generalised inverse of the matrix $A$ .

To derive $q(A,Z,X)$ , we have

		$\displaystyle P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid A=1,Y=0,U,X)$
	$\displaystyle=$	$\displaystyle\dfrac{\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}{1+\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}\dfrac{1}{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\times$
		$\displaystyle\qquad\exp\{\mu_{AS}+\xi(U,X)\},$
	and	$\displaystyle P(A=0\mid U,X)P(Y=0\mid A=0,U,X)P(S=1\mid A=0,Y=0,U,X)$
	$\displaystyle=$	$\displaystyle\dfrac{1}{1+\exp(\mu_{0A}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}\dfrac{1}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\times$
		$\displaystyle\qquad\exp\{\xi(U,X)\}.$

Therefore

		$\displaystyle P(A=1\mid U,X,Y=0,S=1)$
	$\displaystyle=$	$\displaystyle P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid Y=0,S=1,U,X)/$
		$\displaystyle\qquad\{P(A=1\mid U,X)P(Y=0\mid A=1,U,X)P(S=1\mid Y=0,S=1,U,X)+$
		$\displaystyle\qquad P(A=0\mid U,X)P(Y=0\mid A=0,U,X)P(S=1\mid A=0,Y=0,U,X)\}$
	$\displaystyle=$	$\displaystyle\dfrac{\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)}{\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)+\dfrac{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}}$

In the case where $P(Y=1\mid A,U,X)$ are small almost surely, both $\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)$ and $\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)$ are necessarily small, then

\dfrac{1+\exp(\mu_{0Y}+\beta_{0}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}{1+\exp(\mu_{0Y}+\mu_{UY}^{T}U+\mu_{XY}^{T}X)}\approx 1,

and therefore

P(A=1|U,X,Y=0,S=1)\approx[1+\exp\{-(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}]^{-1}.

We need find a function $q(a,z,x)$ that satisfies Equation (1):

\displaystyle E[q(1,Z,X)|A=1,U,X]=1+\exp\{-(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X)\}

and

\displaystyle E[q(0,Z,X)|A=0,U,X]=1+\exp(\mu_{0A}+\mu_{AS}+\mu_{UA}^{T}U+\mu_{XA}^{T}X).

We consider $q(A,Z,X)=1+\exp\{(1-2A)(\tau_{0}+\tau_{A}A+\tau_{Z}^{T}Z+\tau_{X}^{T}X)\}$ , then

		$\displaystyle E[q(1,Z,X)\|A=1,U,X]$
	$\displaystyle=$	$\displaystyle 1+\exp(-\tau_{0}-\tau_{A}-\tau_{X}^{T}X)E[\exp(-\tau_{Z}^{T}Z)\|A=1,U,X]$
	$\displaystyle=$	$\displaystyle 1+\exp(-\tau_{0}-\tau_{A}-\tau_{X}^{T}X)\exp\{-\tau_{Z}^{T}(\mu_{0Z}+\mu_{AZ}+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X)+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}\}$
	$\displaystyle=$	$\displaystyle 1+\exp[-\{(\tau_{0}+\tau_{A}+\tau_{Z}^{T}\mu_{0Z}+\tau_{Z}^{T}\mu_{AZ}-\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z})+\tau_{Z}^{T}\mu_{UZ}^{T}U+(\tau_{X}^{T}+\tau_{Z}^{T}\mu_{XZ}^{T})X\}]$

and

		$\displaystyle E[q(0,Z,X)\|A=0,U,X]$
	$\displaystyle=$	$\displaystyle 1+\exp(\tau_{0}+\tau_{X}^{T}X)E[\exp(\tau_{Z}^{T}Z)\mid A=0,U,X]$
	$\displaystyle=$	$\displaystyle 1+\exp(\tau_{0}+\tau_{X}^{T}X)\exp\{\tau_{Z}^{T}(\mu_{0Z}+\mu_{UZ}^{T}U+\mu_{XZ}^{T}X)+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}\}$
	$\displaystyle=$	$\displaystyle 1+\exp\{(\tau_{0}+\tau_{Z}^{T}\mu_{0Z}+\frac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z})+\tau_{Z}^{T}\mu_{UZ}^{T}U+(\tau_{X}^{T}+\tau_{Z}^{T}\mu_{XZ}^{T})X\}$

Hence $\tau$ satisfies

	$\displaystyle\tau_{Z}=\mu_{UA}^{+}\mu_{UZ},\qquad\tau_{X}=\mu_{XA}-\mu_{XZ}\tau_{Z}=\mu_{XA}-\mu_{XZ}\mu_{UA}^{+}\mu_{UZ},$
	$\displaystyle\tau_{A}=\tau_{Z}^{T}\Sigma_{Z}\tau_{Z},$
	$\displaystyle\tau_{0}=\mu_{0A}+\mu_{AS}-\tau_{Z}^{T}\mu_{0Z}-\dfrac{1}{2}\tau_{Z}^{T}\Sigma_{Z}\tau_{Z}.$

S7 Cross-fitting kernel learning estimator for nonparametric estimation of $\beta_{0}$ .

Under Model 2.1 and Assumptions 2.1 and 2.2, by Lemma S1, estimation of $\beta_{0}$ reduces to estimation of $\theta_{ac}$ . We first derive an influence function of $\theta_{ac}$ .

Theorem S1.

An influence function for $\zeta_{ac}$ is

	$\displaystyle IF_{\zeta_{ac}}(O;q,h)$	$\displaystyle=\dfrac{S}{P(S=1)}(1-Y)c(X)h(a,W,X)-$
		$\displaystyle\qquad\dfrac{S}{P(S=1)}I(A=a)q(a,Z,X)c(X)\left[(1-Y)h(A,W,X)-Y\right]-$
		$\displaystyle\qquad\zeta_{ac},$

Proof S2.

We let $\mu_{ac}=E\left[Sc(X)I(A=a)Yq(A,Z,X)\mid S=1\right].$ Because $\mu_{ac}=\zeta_{ac}P(S=1)$ , their influence functions $IF_{\mu_{ac}}$ and $IF_{\zeta_{ac}}$ satisfy

IF_{\theta_{ac}}(O;q,h)=IF_{\mu_{ac}}(O;q,h)/P(S=1).

We proceed to derive $IF_{\mu_{ac}}$ .

Consider regular one-dimensional parametric submodel indexed by $t$ , where $t=0$ indicates the truth, we have

\displaystyle\dfrac{\partial\mu_{1ct}}{\partial t}

\displaystyle=E[c(X)\dot{q}(A,Z,X)AYS]+E[c(X)q(A,Z,X)AYSR(O)],

where $R$ denotes the score function.

To calculate $E[c(X)\dot{q}(A,Z,X)AYS]$ , we note that

\displaystyle 0=E_{t}\left[S(1-Y)\left(q_{t}(A,Z,X)-\dfrac{1}{f_{t}(A\mid W,X,Y=0,S=1)}\right)n(A,W,X)\right].

Taking derivatives w.r.t $t$ on both sides and rearranging the terms, we have

	$\displaystyle E[S(1-Y)\dot{q}(A,Z,X)n(A,W,X)]$		(36)
	$\displaystyle\qquad=E\left\{S(1-Y)\left[\dfrac{1}{f(A\mid W,X,Y=0,S=1)}-q(A,Z,X)\right]n(A,X,W)R(O)\right\}$		(37)
	$\displaystyle\qquad-E\left[S(1-Y)\dfrac{R(A\mid W,X,Y=0,S=1)}{f(A\mid W,X,Y=0,S=1)}n(A,W,X)\right]$		(38)

In Equation (36), taking $n(A,W,X)=Ah(A,W,X)c(X)$ , we have

		$\displaystyle E[S(1-Y)c(X)\dot{q}(A,Z,X)Ah(A,W,X)]$
	$\displaystyle=$	$\displaystyle E\left\{ASc(X)\dot{q}(A,Z,X)E\left[(1-Y)h(A,W,X)\mid A,Z,X,S\right]\right\}$
	$\displaystyle\stackrel{{\scriptstyle\eqref{eq:h-identification}}}{{=}}$	$\displaystyle E\left\{ASc(X)\dot{q}(A,Z,X)E\left[Y\mid A,Z,X,S=1\right]\right\}$
	$\displaystyle=$	$\displaystyle E[c(X)\dot{q}(A,Z,X)AYS]$

In Equation (38), by taking $n(A,W,X)=Ah(A,W,X)c(X)$ , we have

		$\displaystyle E\left[S(1-Y)c(X)\dfrac{R(A\mid W,X,Y=0,S=1)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]$
	$\displaystyle=$	$\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]$
		$\displaystyle\qquad-E\left\{S(1-Y)c(X)R(W,X,Y,S)\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}\right\}$
	$\displaystyle=$	$\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]$
		$\displaystyle\qquad-E\bigg{\{}S(1-Y)c(X)R(W,X,Y,S)E\bigg{[}\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}\mid W,X,$
		$\displaystyle\qquad Y=0,S=1\bigg{]}\bigg{\}}$
	$\displaystyle=$	$\displaystyle E\left[S(1-Y)c(X)\dfrac{R(O)}{f(A\mid W,X,Y=0,S=1)}Ah(A,W,X)\right]-$
		$\displaystyle\qquad E\left\{S(1-Y)c(X)R(W,X,Y,S)h(1,W,X)\right\}$
	$\displaystyle=$	$\displaystyle E\left\{S(1-Y)c(X)\left[\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}-h(1,W,X)\right]R(O)\right\}$

We therefore have

	$\displaystyle\quad E[c(X)\dot{q}(A,Z,X)AYS]$
	$\displaystyle=E\left\{S(1-Y)c(X)\left[\dfrac{1}{f(A\mid W,X,Y=0,S=1)}-q(A,Z,X)\right]Ah(A,W,X)R(O)\right\}$
	$\displaystyle\qquad-E\left\{S(1-Y)c(X)\left[\dfrac{Ah(A,W,X)}{f(A\mid W,X,Y=0,S=1)}-h(1,W,X)\right]R(O)\right\}$
	$\displaystyle=E\left\{S(1-Y)c(X)\left[h(1,W,X)-Aq(A,Z,X)h(A,W,X)\right]R(O)\right\}$

Hence

	$\displaystyle\dfrac{\partial\mu_{1ct}}{\partial t}$	$\displaystyle=E\bigg{\{}Sc(X)\bigg{(}(1-Y)\bigg{[}h(1,W,X)-Aq(A,Z,X)h(A,W,X)\bigg{]}+$
		$\displaystyle\qquad YAq(A,Z,X)\bigg{)}R(O)\bigg{\}}$
		$\displaystyle=E\bigg{\{}Sc(X)\bigg{(}(1-Y)\left[h(1,W,X)-Aq(A,Z,X)h(A,W,X)\right]+YAq(A,Z,X)-$
		$\displaystyle\qquad\mu_{1c}\bigg{)}R(O)\bigg{\}}$

and

\displaystyle IF_{\mu_{1c}}(\cdot)

\displaystyle=S(1-Y)c(X)h(1,W,X)-SAq(A,Z,X)c(X)\left[(1-Y)h(A,W,X)-Y\right]-\mu_{1c}.

Similarly,

	$\displaystyle IF_{\mu_{0c}}(\cdot)$	$\displaystyle=Sc(X)\bigg{\{}(1-Y)\bigg{[}h(0,W,X)-(1-A)q(A,Z,X)h(A,W,X)\bigg{]}+$
		$\displaystyle\qquad Y(1-A)q(A,Z,X)\bigg{\}}$
		$\displaystyle=S(1-Y)c(X)h(0,W,X)-S(1-A)q(A,Z,X)c(X)\bigg{[}(1-Y)h(A,W,X)-$
		$\displaystyle\qquad Y\bigg{]}-\mu_{0c}.$

The influence function $IF_{\zeta_{ac}}(O;q,h)$ can be written as

	$\displaystyle IF_{\zeta_{ac}}(O;q_{a},h_{a})$	$\displaystyle=\dfrac{S}{P(S=1)}\big{\{}g_{a1}(A,Y,X)q_{a}(Z,X)h_{a}(W,X)+$
		$\displaystyle\qquad g_{a2}(Y,X)q_{a}(Z,X)+g_{a3}(Y,X)h_{a}(W,X)\big{\}}-\zeta_{ac},$		(39)

where

		$\displaystyle g_{a1}(A,Y,X)=-c(X)I(A=a)(1-Y),\qquad g_{a2}(A,Y,X)=c(X)I(A=a)Y,$
		$\displaystyle g_{a3}(A,Y,X)=c(X)(1-Y),\qquad h_{a}(W,X)=h(a,W,X),$
	and	$\displaystyle q_{a}(Z,X)=q(a,Z,X).$

This influence function belongs to the class of doubly-robust influence functions considered in Ghassami et al. (2021), and thus their method directly applies. By Theorem 1 of Rotnitzky et al. (2021), such an influence function also satisfies the so-call ”mixed-bias” structure, and therefore $\sqrt{n}$ -consistent doubly robust estimation is possible even when the nuisance functions are estimated nonparametrically. In this section, we describe a cross-fitting minimax kernel learning estimator for $\zeta_{ac}$ and stated the theoretic property of the resulting estimator of $\beta_{0}$ .

We randomly partition the data evenly into $L$ subsamples. For each $l\in\{1,\dots,L\}$ , let $I_{l}$ and $n_{l}$ be the indices and size of data in the $l$ th sample, respectively. Let $m_{l}=n-n_{l}$ be the size of the sample excluding $I_{l}$ . Write $q_{a}(Z,X)=q(a,Z,X)$ and $h_{a}(W,X)=h(A,W,X)$ . As proposed in Ghassami et al. (2021), we estimate the nuisance functions $q_{a}$ and $h_{a}$ using data from $I_{l}^{c}$ by solving the following optimization problems:

$\displaystyle\tilde{q}_{al}$	$\displaystyle=\arg\underset{q_{a}\in\mathcal{Q}_{a}}{\min}\,\underset{\dot{h}\in\mathcal{H}_{a}^{*}}{\max}\,\dfrac{1}{m_{l}}\sum_{i\in I_{l}^{c}}\bigg{[}\dot{h}(W_{i},X_{i})\left\{q_{a}(Z_{i},X_{i})g_{a1}(A_{i},Y_{i},X_{i})+g_{3}(Y_{i},X_{i})\right\}$
	$\displaystyle\qquad-\dot{h}^{2}(W_{i},X_{i})\bigg{]}-\lambda_{\mathcal{H}_{a}^{}}^{q_{a}}\lVert\dot{h}\rVert_{\mathcal{H}_{a}^{}}^{2}+\lambda_{\mathcal{Q}_{a}}^{q_{a}}\lVert q_{a}\rVert_{\mathcal{Q}_{a}}^{2};$	(40)
$\displaystyle\tilde{h}_{al}$	$\displaystyle=\arg\underset{h_{a}\in\mathcal{H}_{a}}{\min}\,\underset{\dot{q}\in\mathcal{Q}_{a}^{*}}{\max}\,\dfrac{1}{m_{l}}\sum_{i\in I_{l}^{c}}\bigg{[}\dot{q}(Z_{i},X_{i})\left\{h_{a}(W_{i},X_{i})g_{a1}(A_{i},Y_{i},X_{i})+g_{2}(A_{i},Y_{i},X_{i})\right\}$
	$\displaystyle\qquad-\dot{q}^{2}(Z_{i},X_{i})\bigg{]}-\lambda_{\mathcal{Q}_{a}^{}}^{h_{a}}\lVert\dot{q}\rVert_{\mathcal{Q}_{a}^{}}^{2}+\lambda_{\mathcal{H}_{a}}^{h_{a}}\lVert h_{a}\rVert_{\mathcal{H}_{a}}^{2}.$	(41)

where $\mathcal{Q}_{a}$ , $\mathcal{H}_{a}$ , $\mathcal{Q}_{a}^{*}$ and $\mathcal{H}_{a}^{*}$ are reproducing kernel Hilbert spaces (RKHS) generated by the kernel functions $K_{\mathcal{Q}_{a}}$ , $K_{\mathcal{H}_{a}}$ , $K_{\mathcal{Q}_{a}^{*}}$ an $K_{\mathcal{H}_{a}^{*}}$ , equipped with norms $\lVert\cdot\rVert_{\mathcal{Q}_{a}}$ , $\lVert\cdot\rVert_{\mathcal{H}_{a}}$ , $\lVert\cdot\rVert_{\mathcal{Q}_{a}^{*}}$ and $\lVert\cdot\rVert_{\mathcal{H}_{a}^{*}}$ , respectively, and $\lambda_{\mathcal{Q}_{a}}^{q_{a}}$ , $\lambda_{\mathcal{H}_{a}}^{h_{a}}$ , $\lambda_{\mathcal{Q}_{a}^{*}}^{h_{a}}$ and $\lambda_{\mathcal{H}_{a}^{*}}^{q_{a}}$ are non-negative regularizing parameters. The above optimization problems have closed form solutions, which we leave to to the end of this section.

The objective function for $\tilde{q}_{al}$ can be viewed as a regularised version of empirical perturbation evaluated using data from $I_{l}^{c}$ , by moving the influence function from the pair $(q_{a},h_{a})$ to $(q_{a},h_{a}+\dot{h})$ , where the perturbation is defined as

	$\displaystyle prt_{\theta_{ac}}(q_{a},h_{a};0,\dot{h})$	$\displaystyle=IF_{\theta_{ac}}(A,Y,Z,W,X;q_{a},h_{a}+\dot{h})-IF_{\theta_{ac}}(A,Y,Z,W,X;q_{a},h_{a})$
		$\displaystyle=\dfrac{S}{P(S=1)}\dot{h}(W,X)\left\{q_{a}(Z,X)g_{a1}(A,Y,X)-g_{a3}(Y,X)\right\}.$

The objective function contains two forms of regularization: the term $\dot{h}^{2}$ improves the robustness against model misspecification, stability and the convergence rates, while the Tikhonov-type penalties $\lambda_{\mathcal{H}_{a}}^{q_{a}}\lVert\dot{h}\rVert_{\mathcal{H}_{a}}^{2}$ and $\lambda_{\mathcal{Q}_{a}}^{q_{a}}\lVert q_{a}\rVert_{\mathcal{Q}_{a}}^{2}$ penalise the complexity of the classes of nuisance functions considered. The objective function for $\tilde{h}_{al}$ can be interpreted similarly.

We estimate $\zeta_{ac}$ with the $l$ th subsample with the PDR estimator

	$\displaystyle\tilde{\zeta}_{ac,l}$	$\displaystyle=\dfrac{1}{n_{l}}\sum_{i\in I_{l}}c(X_{i})\big{[}(1-Y_{i})\tilde{h}_{al}(W_{i},X_{i})-$
		$\displaystyle\qquad I(A_{i}=a)\tilde{q}_{al}(Z_{i},X_{i})\big{\{}(1-Y_{i})\tilde{h}_{al}(W_{i},X_{i})-Y_{i}\big{\}}\big{]}.$		(42)

The final estimator for $\zeta_{ac}$ is $\tilde{\zeta}_{ac}=\sum_{l=1}^{L}\tilde{\zeta}_{ac,l}/L$ , which gives the cross-fitting minimax kernel learning estimator for $\beta_{0}$ as $\tilde{\beta}_{0}=\log(\tilde{\zeta}_{1c}/\tilde{\zeta}_{0c}).$ We summarise the estimation of $\zeta_{ac}$ in Algorithm S7.

{algo}

A cross-fitting minimax kernel learning algorithm to estimate $\beta_{0}$

Split the data into

L

even subsamples

For

a=0

a=1

For

l=1

l=L

Solve

\tilde{q}_{al}

and

\tilde{h}_{al}

in Equations (40) and (41)

Obtain

\tilde{\zeta}_{ac,l}

according to Equation (S7)

\tilde{\zeta}_{ac}\leftarrow\sum_{l=1}^{L}\tilde{\zeta}_{ac,l}/L

Output

\tilde{\beta}_{0}=\log(\tilde{\zeta}_{1c}/\tilde{\zeta}_{0c})

The estimator $\tilde{\beta}_{0c}$ is consistent and asymptotically normal, thus allowing valid statistical inference, as stated in Theorem S3 below.

Theorem S3.

Under Model 2.1 and Assumptions 2.1-2.3 and suitable regularity conditions listed in Section S8, the cross-fitting estimator $\tilde{\beta}_{0}$ in Algorithm S7 is consistent and asymptotically normal of $\beta_{0}$ .

We prove Theorem S3 in Section S9 of the Supplementary Materials.

Implementation of this approach is challenging, as the performance of the method proves to be sensitive to the value of tuning parameters. We leave an efficient implementation of this approach to future research.

The closed form solutions for $\tilde{q}_{al}$ and $\tilde{h}_{al}$ are

	$\displaystyle\tilde{q}_{al}(z,x)$	$\displaystyle=\sum_{i\in I_{l}^{c}}\gamma_{ali}K_{\mathcal{Q}_{a}}(Z_{i},X_{i};z,x),$
	$\displaystyle\text{and}\qquad\tilde{h}_{al}(w,x)$	$\displaystyle=\sum_{i\in I_{l}^{c}}\alpha_{ali}K_{\mathcal{H}_{a}}(W_{i},X_{i};w,x),$

where

	$\displaystyle\alpha_{al}$	$\displaystyle=-(K_{\mathcal{H}_{a},l}D_{al}\Gamma_{al}D_{al}K_{\mathcal{H}_{a},l}+m_{l}^{2}\lambda_{\mathcal{H}_{a}}^{h_{a}}K_{\mathcal{H}_{a},l})^{+}K_{\mathcal{H}_{a},l}D_{al}\Gamma_{al}g_{a2,l},$
	$\displaystyle\gamma_{al}$	$\displaystyle=-(K_{\mathcal{Q}_{a},l}D_{al}\Omega_{al}D_{al}K_{\mathcal{Q}_{a},l}+m_{l}^{2}\lambda_{\mathcal{Q}_{a}}^{q_{a}}K_{\mathcal{Q}_{a},l})^{+}K_{\mathcal{Q}_{a},l}D_{al}\Gamma_{al}g_{a3,l},$
	$\displaystyle\Gamma_{al}$	$\displaystyle=\dfrac{1}{4}K_{\mathcal{Q}_{a}^{},l}(\dfrac{1}{m_{l}}K_{\mathcal{Q}_{a}^{},l}+\lambda_{\mathcal{Q}_{a}^{*}}^{h_{a}}I_{al})^{-1},$
	$\displaystyle\Omega_{al}$	$\displaystyle=\dfrac{1}{4}K_{\mathcal{\mathcal{missing}}H_{a}^{},l}(\dfrac{1}{m_{l}}K_{\mathcal{H}_{a}^{},l}+\lambda_{\mathcal{H}_{a}^{*}}^{q_{a}}I_{al})^{-1},$

$D_{al}$ is a $(n-n_{l})\times(n-n_{l})$ diagonal matrix with elements $\{g_{a1}(A_{i},Y_{i},X_{i}):i\in I_{l}^{c}\}$ , $K_{\mathcal{H},l}$ and $K_{\mathcal{Q},l}$ are respectively the $(n-n_{l})\times(n-n_{l})$ empirical kernel matrices with data in the sample $I_{l}^{c}$ , $g_{a2,l}$ is the $(n-n_{l})$ -vector with elements $\{g_{a2}(Y_{i},X_{i}):i\in I_{l}^{c}\}$ , and $g_{a3,l}$ is the $(n-n_{l})$ -vector with elements $\{g_{a3}(A_{i},Y_{i},X_{i}):i\in I_{l}^{c}\}$ .

S8 Regularity conditions in Theorem S3

For a function class $\mathcal{F}$ , let $\mathcal{F}_{B}:=\{f\in\mathcal{F}:\lVert f\rVert_{\mathcal{F}}^{2}\leq B\}$ .

Let $\{\mu^{h_{a}}_{j}\}_{j=1}^{\infty}$ and $\{\varphi_{j}^{h_{a}}\}_{j=1}^{\infty}$ be the eigenvalues and eigenfunctions of the RKHS $\mathcal{H}_{a}$ and $\{\mu^{q_{a}}_{j}\}_{j=1}^{\infty}$ and $\{\varphi_{j}^{q_{a}}\}_{j=1}^{\infty}$ be the eigenvalues and eigenfunctions of the RKHS $\mathcal{Q}_{a}$ . For any $m\in\mathbb{N}_{+}$ , let $V_{m}^{\mathcal{H}_{a}}$ and $V_{m}^{\mathcal{Q}_{a}}$ be the $m\times m$ matrices with entry $(i,j)$ defined respectively as

[V_{m}^{\mathcal{H}_{a}}]_{i,j}=E[E[\varphi_{i}^{\mathcal{H}_{a}}(W,X)beta_{0}\mid Z,X,S=1]E[\varphi_{j}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]]

and

[V_{m}^{\mathcal{Q}_{a}}]_{i,j}=E[E[\varphi_{i}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]E[\varphi_{j}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]].

Let $\lambda_{m}^{\mathcal{H}_{a}}$ and $\lambda_{m}^{\mathcal{Q}_{a}}$ be the minimum eigenvalues of $V_{m}^{\mathcal{H}_{a}}$ and $V_{m}^{\mathcal{Q}_{a}}$ , respectively.

R.1

The functions $g_{a1}(A,Y,X)=c(X)I(A=a)(1-Y)$ , $g_{a2}(A,Y,X)=c(X)I(A=a)Y$ and $g_{a3}(Y,X)=c(X)(1-Y)$ are bounded.
R.2

There exists a constant $\sigma_{1}$ such that

$\mid[E[c(X)I(A=a)(1-Y)\mid Z,W,X,S=1]\mid>\sigma_{1}>0$

almost surely for $a=0,1$ .
R.3

The nuisance functions $q(A,Z,X)$ and $h(A,W,X)$ have finite second moments.

R.4

The nuisance functions $q_{a}(Z,X)$ and their estimators $\tilde{q}_{al}(Z,X)$ and $\tilde{h}_{al}(W,X)$ in each fold of cross-fitting satisfy

\min\left\{\sup_{z,x}\mid\tilde{q}_{al}(z,x)+\sup_{w,x}h_{a}(w,x)\mid,\sup_{z,x}\mid q_{a}(z,x)+\sup_{w,x}\tilde{h}_{al}(w,x)\mid\right\}<\infty.

R.5

$\lVert\tilde{q}_{al}-q_{a}\rVert_{2}=o_{p}(1)$ , $\lVert\tilde{h}_{al}-h_{a}\rVert_{2}=o_{p}(1)$ ;
R.6

There exists a constant $C$ such that $\tilde{h}_{a}\in\mathcal{H}_{aC}$ and $\tilde{q}_{a}\in\mathcal{Q}_{aC}$ for $a\in\{0,1\}$ .

R.7

\mid E[E[\varphi_{i}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]E[\varphi_{j}^{\mathcal{H}_{a}}(W,X)\mid Z,X,S=1]]\mid\leq c^{\mathcal{H}_{a}}\lambda_{m}^{\mathcal{H}_{a}}

and

\mid E[E[\varphi_{i}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]E[\varphi_{j}^{\mathcal{Q}_{a}}(Z,X)\mid W,X,S=1]]\mid\leq c^{\mathcal{Q}_{a}}\lambda_{m}^{\mathcal{Q}_{a}}

for some constant $c^{\mathcal{H}_{a}}$ and $c^{\mathcal{Q}_{a}}$ .

R.8

$\lVert E[I(A=a)c(X)(1-Y)\{\tilde{q}_{al}(Z,X)-q_{a}(Z,X)\rVert W,X,S=1\}]\rVert_{2}=O(n^{-{r_{q}}})$

and

$\lVert E[I(A=a)c(X)(1-Y)\{\tilde{h}_{al}(W,X)-h_{a}(W,X)\rVert Z,X,S=1\}]\rVert_{2}=O(n^{-r_{h}}).$
R.9

$\mu_{m}^{\mathcal{H}_{a}}\sim m^{-s_{h}}$ , $\mu_{m}^{\mathcal{Q}_{a}}\sim m^{-s_{q}}$ , $\lambda_{m}^{\mathcal{H}_{a}}\sim m^{-t_{h}}$ , and $\lambda_{m}^{\mathcal{Q}_{a}}\sim m^{-t_{q}}$ , where the decay rates $s_{h}$ , $s_{q}$ , $t_{h}$ and $t_{q}$ satisfies

$\max\{r_{h}+\dfrac{s_{h}r_{q}}{s_{h}+t_{h}},r_{q}+\dfrac{s_{q}r_{h}}{s_{q}+t_{q}}\}>\dfrac{1}{2}.$

S9 Proof of Theorem S3

Under Model 2.1 and Assumptions 2.1, 2.2 and 2.2, Theorem 2.2(b) states that $\beta_{0}=\log(\zeta_{1c}/\zeta_{0c})$ . By delta-method, it suffices to prove that $\tilde{\zeta}_{ac}$ is asymptotically linear with the influence function $IF_{\zeta_{ac}}(O)$ , the proof of which follows Ghassami et al. (2021) Theorem 2 and Lemma 1, which we reproduced below.

For $a=0,1$ and a given value $\delta>0$ , we define $\mathcal{H}_{aC}^{|\delta}:=\{h_{a}\in\mathcal{H}_{aC}:\lVert E[h_{a}(W,X)\mid Z,X,S=1]\rVert_{2}\leq\delta\}$ and $\mathcal{Q}_{aC}^{|\delta}:=\{q_{a}\in\mathcal{Q}_{aC}:\lVert E[q_{a}(Z,X)\mid W,X,S=1]\rVert_{2}\leq\delta\}$ . We define the measures of ill-posedness $\tau_{h_{a}}(\delta):=\sup_{h_{a}\in\mathcal{H}_{aC}^{|\delta}}\lVert h_{a}\rVert_{2}$ and $\tau_{q_{a}}(\delta):=\sup_{q_{a}\in\mathcal{Q}_{aC}^{|\delta}}\lVert q_{a}\rVert_{2}$ .

Lemma S1.

(Ghassami et al. (2021) Lemma 1) Under regularity conditions R.6 and R.7, for any $\epsilon>0$ we have

\tau_{h_{a}}(\epsilon)^{2}\leq\min_{m\in\mathbb{N}_{+}}\left\{\dfrac{4\epsilon^{2}}{\lambda_{m}^{\mathcal{H}_{a}}}+[4(c^{\mathcal{H}_{a}})^{2}+1]C\mu_{m+1}^{\mathcal{H}^{a}}\right\}

and

\tau_{q_{a}}(\epsilon)^{2}\leq\min_{m\in\mathbb{N}_{+}}\left\{\dfrac{4\epsilon^{2}}{\lambda_{m}^{\mathcal{Q}_{a}}}+[4(c^{\mathcal{Q}_{a}})^{2}+1]C\mu_{m+1}^{\mathcal{Q}^{a}}\right\}.

By Lemma S1 and R.9, we have that

\tau_{h_{a}}(\epsilon)=O\left(\epsilon^{\dfrac{s_{h}}{s_{h}+t_{h}}}\right)\qquad\text{and}\qquad\tau_{q_{a}}(\epsilon)=O\left(\epsilon^{\dfrac{s_{q}}{s_{q}+t_{q}}}\right)

The condition R.9 further implies that

	$\displaystyle\min\{n^{-r_{h}}\tau_{h_{a}}(n^{-r_{q}}),n^{-r_{q}}\tau_{q_{a}}(n)\}$	$\displaystyle=O\left(\min\left\{n^{-\left(r_{h}+\dfrac{s_{h}r_{q}}{s_{h}+t_{h}}\right)},n^{-\left(r_{q}+\dfrac{s_{q}r_{h}}{s_{q}+t_{q}}\right)}\right\}\right)$
		$\displaystyle=o(n^{-1/2}).$		(43)

The result follows by the theorem below:

Theorem S2.

(Ghassami et al. (2021) Theorem 2) Under regularity conditions R.1-R.6 and R.8, if Equation (43) holds in addition, then the estimator $\tilde{\theta}_{ac}$ is asymptotically linear with the influence function

IF_{\zeta_{ac}}(O).

S10 Details of the simulation studies

We generate the data for a target population of $N$ observations according to the following five scenarios

Scenario I: Univariate binary $U$ , $Z$ and $W$

	$\displaystyle U$	$\displaystyle\sim\text{Bernoulli}(0,0.5);$
	$\displaystyle A\mid U$	$\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(0.2+0.4U))$
	$\displaystyle Z\mid A,U$	$\displaystyle\sim\text{Bernoulli}(0.2+0.1A+0.4U+0.2AU);$
	$\displaystyle W\mid U$	$\displaystyle\sim\text{Bernoulli}(0.2+0.4U);$
	$\displaystyle Y\mid A,U$	$\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-0.405-1.609A-0.7U))$
	$\displaystyle S\mid U,A,Y$	$\displaystyle\sim\text{Bernoulli}(\exp(-1.7+0.2A+0.4Y+0.7U)).$

Scenario II-V: Continuous $U$ , $X$ , $Z$ , $W$ .

	$\displaystyle U$	$\displaystyle\sim\text{Uniform}(0,1);$
	$\displaystyle X$	$\displaystyle\sim\text{Uniform}(0,1);$
	$\displaystyle A\mid U,X$	$\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-1+U+X))$
	$\displaystyle Z\mid A,U,X$	$\displaystyle\sim\text{N}(0.25A+0.25X+4U,0.25^{2});$
	$\displaystyle W\mid Y,U,X$	$\displaystyle\sim\text{N}(0.25Y+0.25X+4U,0.25^{2});$
	$\displaystyle Y\mid A,U,X$	$\displaystyle\sim\text{Bernoulli}(\mathrm{expit}(-3.89+\beta_{0}A-2U-X))$
	$\displaystyle S\mid U,X,A,Y$	$\displaystyle\sim\text{Bernoulli}(\exp(-5+0.4A+4Y+0.3U+0.2X).$

S11 Estimating conditional odds ratio under effect modification by $X$

In Section 2, we assumed that the effect size of odds ratio is constant across every $(U,X)$ strata. However, in practical situation, the treatment effect may be heterogeneous. Therefore, we relax Model 2.1 and propose Model S11 below that allows heterogeneous odds ratio across $X$ strata: {model}[Heterogeneous odds ratio model by $X$ ]

\mathrm{logit}\,P(Y=1\mid A,U,X)=\beta_{0}(X)A+\eta(U,X).

(44)

In Model S11, the odds ratio treatment effect $\beta_{0}(X)$ is homogeneous with respect to the latent factors $U$ but is allowed to vary by $X$ . In such scenarios, if the negative control variables are available, the odds ratio function $\beta_{0}(X)$ can still be identified, similar to Theorems 2.2, 2.4 and 2.5, as stated in the following theorem:

Theorem S1.

Under Model S11 and Assumptions 2.1, 2.2, 2.2 and 2.3, then the odds ratio function $\beta_{0}(X)$ satisfies the following moment equations:

	$\displaystyle E\left[c(X)I(A=a)Yq(A,Z,X)e^{-\beta_{0}(X)}\mid S=1\right]=0;$
	$\displaystyle E\left[c(X)(1-Y)\left\{h(1,W,X)-h(0,W,X)e^{\beta_{0}(X)}\right\}\mid S=1\right]=0;$
	$\displaystyle E\bigg{\{}c(X)\bigg{[}(-1)^{1-A}q(A,Z,X)e^{-\beta_{0}(X)A}\left\{Y-(1-Y)h(A,W,X)\right\}+$
	$\displaystyle\qquad\qquad(1-Y)\left\{h(1,W,X)e^{-\beta_{0}(X)}-h(0,W,X)\right\}\bigg{]}\mid S=1\bigg{\}}=0.$

In cases where a parametric model for the odds ratio function $\beta_{0}(X;\alpha)$ is considered appropriate with a finite dimensional parameter $\alpha$ , similar to the previous PIPW, POR and PDR estimation, the low-dimension parameter $\alpha$ can be estimated by solving the estimating equations

	$\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}{(-1)}^{1-A_{i}}c(X_{i})q(A_{i},Z_{i},X_{i};\tau)Y_{i}e^{-\beta_{0}(X;\alpha)A_{i}}=0;$	(45)
	$\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})(1-Y_{i})\left\{h(1,W_{i},X_{i};\hat{\psi})e^{-\beta_{0}(X_{i};\alpha)}-h(0,W_{i},X_{i};\hat{\psi})\right\}=0;$	(46)
or	$\displaystyle\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})\bigg{[}(-1)^{1-A_{i}}q(A_{i},Z_{i},X_{i})e^{-\beta_{0}(X;\alpha)A}\left\{Y_{i}-(1-Y_{i})h(A_{i},W_{i},X_{i})\right\}+$
	$\displaystyle\qquad\qquad(1-Y_{i})\left\{h(1,W_{i},X_{i})e^{-\beta_{0}(X;\alpha)}-h(0,W_{i},X_{i})\right\}\bigg{]}=0,$	(47)

where $\hat{\tau}$ and $\hat{\psi}$ are estimators for the parameters indexing the treatment and outcome confounding bridge functions, obtained by solving Equations (12) and (13), respectively, and $c(X)$ is a user-specified function with dimension no smaller than that of $\alpha$ .

Alternatively, a similar kernel method to the one in Section S7 of the Supplementary Materials can be used to obtain a cross-fitting minimax learning estimator of $\alpha$ which solves Equation (47) and nonparametrically estimates the nuisance functions $q$ and $h$ . With an estimator $\hat{\alpha}$ using either of the above methods, the conditional odds ratio at $X=x$ can then be estimated by $\beta_{0}(x;\hat{\alpha})$ .

Nonparametric estimation of $\beta_{0}(X)$ is possible, for example, by using the minimax learning approach (Dikkala et al., 2020) or the semi-nonparametric sieve generalised method of moments approach (Ai & Chen, 2003; Chen, 2007). Such a nonparametric estimator may be of interest, for example, to more flexibly account for the heterogeneity of treatment effects and making flexible prediction on the “individual treatment effect”.

S12 Extension to polytomous treatment and effect

Suppose we want to study the effect of a polytomous treatment $A$ on a polytomous outcome $Y$ , where $A$ have levels $a_{0},a_{1},\dots,a_{J}$ and $Y$ have levels $y_{0},\dots,y_{K}$ . Here $a_{0}$ and $y_{0}$ denote the reference treatment and outcome, respectively. We make the following homogeneity assumption: {model}[Homogeneous odds ratio model]

	$\displaystyle\log\left(\dfrac{P(Y\mid A,U,X)P(Y=y_{0}\mid A=a_{0},U,X)}{P(Y\mid A=a_{0},U,X)P(Y=y_{0}\mid A,U,X)}\right)=$
	$\displaystyle\qquad\qquad\sum_{\begin{subarray}{c}j\in\{1,\dots,J\}\\ k\in\{1,\dots,K\}\end{subarray}}\beta_{j}^{k}I(A=a_{j},Y=y_{k})+\eta^{k}(U,X).$

Model S12 indicates that the odds ratio of $Y=y_{k}$ relative to $Y=y_{0}$ against the treatment $A=a_{j}$ versus $A=a_{0}$ is $\beta_{j}^{k}$ across all strata of $(U,X)$ . Model S12 is a natural semiparametric extension to the polytomous logistic regression model (Engel, 1988). When $J=1$ and $K=1$ , Model S12 reduces to Model 2.1. The goal is to estimate and make inference on every $\beta_{j}^{k}$ in the presence of latent factors $U$ and confounded outcome-dependent sampling.

We make the following assumptions, corresponding to Assumption 2.1, 2.2 and 2.3: {assumption}[No effect modification by $A$ on the outcome-dependent selection]For $a=a_{0},\dots,a_{J}$ and real-valued unknown functions $\xi^{1}(U,X),\dots,\xi^{K}(U,X)$ , the sampling mechanism satisfies

\displaystyle\dfrac{P(S=1\mid Y=y_{k},A=a,U,X)}{P(S=1\mid Y=y_{0},A=a,U,X)}=\xi^{k}(U,X)

for $k=1,\dots,K$ .

{assumption}

[Confounding bridge functions] The exists a treatment confounding bridge function $q(A,Z,X)$ and $K$ outcome confounding bridge functions $h_{k}(A,W,X)$ such that for $k=1,\dots,K$ and $a=a_{0},a_{1},\dots,a_{J}$ ,

	$\displaystyle E\{q(a,Z,X)\mid A=a,U,X\}=1/P(A=a\mid U,X,Y=y_{0},S=1)$		(48)
	$\displaystyle E\{I(Y=y_{0})h_{k}(a,W,X)\mid A=a,U,X,S=1\}=P(Y=y_{k}\mid A=a,U,X,S=1).$		(49)

We state the following results for inference of $\beta_{j}^{k}$ . The proofs are similar to those in Section S1 and are therefore omitted.

Lemma S1.

Under Model S12 and Assumption S12, the log odds ratio $\beta_{j}^{k}$ satisfies

\beta_{j}^{k}=\log(\theta_{jc}^{k}/\theta_{0c}^{k}),

where

\theta_{jc}^{k}=E\left\{c(X)\dfrac{P(A=a_{j}\mid U,X,Y=y_{k},S=1)}{P(A=a_{j}\mid U,X,Y=y_{0},S=1)}P(Y=y_{k}\mid U,X,S=1)\mid S=1\right\}.

and $c(X)$ is an arbitrary real-valued function that satisfies $\theta_{jc}^{k}\neq 0$ for $j=0,1,\dots,K$ .

Theorem S2 (Identification of $q(A,Z,X)$ and $h_{k}(A,W,X)$ ).

The functions $q(\cdot)$ and $h_{k}(\cdot)$ , $k=1,\dots,K$ , satisfy the moment conditions

	$\displaystyle E\left[I(Y=y_{0})\left\{\kappa^{}_{1}(A,W,X)q(A,Z,X)-\sum_{j=0}^{J}\kappa^{}_{1}(a_{j},W,X)\right\}\mid S=1\right]=0$
	$\displaystyle E\left[\kappa^{*}_{2k}(A,Z,X)\left\{I(Y=y_{0})h_{k}(A,W,X)-I(Y=y_{k})\right\}\mid S=1\right]=0,\qquad\mbox{k=1,\ldots, K},$

where $\kappa^{*}_{1}(\cdot)$ and $\kappa_{2k}$ are arbitrary functions.

Theorem S3 (Identification of $\theta_{jc}^{k}$ ).

Assume Model S12 and Assumptions S12 and 2.2 hold, then

(a)

if there exists a function $q$ that solves Equation (48), then

$\theta_{jc}^{k}=E\left\{I(A=a_{j},Y=y_{k})c(X)q(a_{j},Z,X)\mid S=1\right\};$
(b)

if there exists functions $h^{k}$ , $k=1,\dots,K$ that solve Equation (49), then

$\theta_{jc}^{k}=E\left\{c(X)I(Y=y_{0})h_{k}(a_{j},W,X)\mid S=1\right\};$

(c)

if either (i) the function $q$ solves Equation (48), or (ii) the functions $h_{k}$ ’s solve Equation (49), then

	$\displaystyle\theta_{jc}^{k}$	$\displaystyle=E\big{\{}c(X)\big{[}I(Y=y_{0})h_{k}(a_{j},W,X)-$
		$\displaystyle\qquad I(A=a_{j})q(a_{j},Z,X)\big{\{}I(Y=y_{0})h_{k}(a_{j},W,X)-I(Y=y_{k})\big{\}}\big{]}\mid S=1\big{\}}.$

By Theorem S2 and S3, to estimate $\beta_{j}^{k}$ one may:

1.

Specify suitable parametric working models $q(A,Z,X;\tau)$ and $h_{k}(A,W,X;\psi_{k})$ ;

Estimate the nuisance parameters $\tau$ and $\psi$ by solving

	$\displaystyle\sum_{i=1}^{n}I(Y_{i}=y_{0})\left\{\kappa^{}_{1}(A_{i},W_{i},X_{i})q(A_{i},Z_{i},X_{i};\tau)-\sum_{j=0}^{J}\kappa^{}_{1}(a_{j},W_{i},X_{i})\right\}=0;$
	$\displaystyle\sum_{i=1}^{n}\kappa^{*}_{2k}(A_{i},Z_{i},X_{i})\{I(Y_{i}=y_{0})h_{k}(A_{i},W_{i},X_{i};\psi_{k})-I(Y_{i}=y_{k})\}=0$

for $k=1,2,\dots,K$ , where $\kappa^{*}_{1}$ and $\kappa^{*}_{2k}$ are user-specified functions that dimensions at least as large as that of $\tau$ and $\psi_{k}$ respectively. Denote the resulting estimator as $\hat{\tau}$ and $\hat{\psi}_{k}$ .

Estimate $\theta_{jc}^{k}$ by

		$\displaystyle\hat{\theta}_{jc,\text{PIPW}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}I(A=a_{j},Y=y_{k})c(X_{i})q(a_{j},Z_{i},X_{i};\hat{\tau});$
		$\displaystyle\hat{\theta}_{jc,\text{POR}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k});$
	or	$\displaystyle\hat{\theta}_{jc,\text{PDR}}^{k}=\dfrac{1}{n}\sum_{i=1}^{n}c(X_{i})\bigg{[}I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k})-$
		$\displaystyle\qquad I(A_{i}=a_{j})q(a_{j},Z_{i},X_{i};\hat{\tau})\left\{I(Y_{i}=y_{0})h_{k}(a_{j},W_{i},X_{i};\hat{\psi}_{k})-I(Y_{i}=y_{k})\right\}\bigg{]}$

for $j=0,1,\dots,J$ and $k=1,\dots,K$ and a user-specified real-valued function $c(X)$ (one may simply set $c(X)=1$ ).

4.

The resulting estimators for $\beta_{j}^{K}$ include the PIPW estimator $\hat{\beta}_{j,\text{PIPW}}^{k}=\log(\hat{\theta}_{jc,\text{PIPW}}^{k}/\hat{\theta}_{0c,\text{PIPW}}^{k})$ , the POR estimator $\hat{\beta}_{j,\text{POR}}^{k}=\log(\hat{\theta}_{jc,\text{POR}}^{k}/\hat{\theta}_{0c,\text{POR}}^{k})$ , and the PDR estimator $\hat{\beta}_{jc,\text{PDR}}^{k}=\log(\hat{\theta}_{jc,\text{PDR}}^{k}/\hat{\theta}_{0c,\text{PDR}}^{k}).$

Under Model S12 and Assumptions 2.2 and S12, the PDR estimator satisfies doubly robustness, i.e. $\hat{\beta}_{jc,\text{POR}}^{k}\overset{p}{\rightarrow}\beta_{j}^{k}$ if either (i) the parametric working model $q(A,Z,X;\tau)$ is correctly specified, and the completeness assumption 2.2(b) holds, or (ii) the parametric working model $h(A,W,X;\psi)$ is correctly specified, and Assumption 2.3(b) holds.

The kernel learning approach in Section S7 can similarly be employed to obtain semiparametric estimators for $\beta_{j}^{k}$ with flexible modeling for $q$ and $h_{k}$ .

Doubly Robust Proximal Causal Inference under Confounded Outcome-Dependent Sampling

Abstract

keywords:

1 Introduction

2 Identification and estimation under a homogeneous odds ratio model

2.1 Notation and Setup

2.2 Proximal Identification of β0\beta_{0} in Li et al. (2022)

Theorem 2.1 (Identification of q​(A,Z,X)q(A,Z,X); Li et al. (2022) Theorem 2’).

Theorem 2.2 (Identification of β0\beta_{0}; Li et al. (2022) Theorem 1’).

2.3 New Proximal Identification Results

Theorem 2.3 (Identification of h​(A,W,X)h(A,W,X)).

Theorem 2.4 (Identification of β0\beta_{0}).

2.4 A doubly robust closed-form expression for β0\beta_{0}

Theorem 2.5 (A doubly-robust closed-form expression for β0\beta_{0}).

2.5 Estimation and large sample inference

Theorem 2.6 (Moment conditions for q​(A,Z,X)q(A,Z,X) and h​(A,W,X)h(A,W,X)).

Example 2.7.

Example 2.8.

Theorem 2.9.

3 Simulation

4 Extensions

References

S1 Proofs of Theorems.

S1.1 Proof of Theorem 2.1

S1.2 Proof of Theorem 2.2

Lemma S1.

Proof S2.

S1.3 Proof of Theorem 2.3

S1.4 Proof of Theorem 2.4

S1.5 Proof of Theorem 2.5

S1.6 Proof of Theorem 2.6

S1.7 Proof of Theorem 2.9

S2 Approximate equivalence of RR and OR for a rare outcome

S3 Identification of marginal causal risk ratio in a test-negative design with diseased controls

S4 Regularity conditions for the existence of a solution to Equations (1) and (26)

S5 Summary of our assumptions and results and those in Li et al. (2022)

S6 Detailed setting and derivation of q​(A,Z,X)q(A,Z,X) and h​(A,W,X)h(A,W,X) in Example 2.8

S7 Cross-fitting kernel learning estimator for nonparametric estimation of β0\beta_{0}.

Theorem S1.

Proof S2.

Theorem S3.

S8 Regularity conditions in Theorem S3

S9 Proof of Theorem S3

Lemma S1.

Theorem S2.

S10 Details of the simulation studies

Scenario I: Univariate binary UU, ZZ and WW

Scenario II-V: Continuous UU, XX, ZZ, WW.

S11 Estimating conditional odds ratio under effect modification by XX

Theorem S1.

S12 Extension to polytomous treatment and effect

Lemma S1.

Theorem S2 (Identification of q(A,Z,X)q(A,Z,X) and hk(A,W,X)h_{k}(A,W,X)).

Theorem S3 (Identification of θjck\theta_{jc}^{k}).

2.2 Proximal Identification of $\beta_{0}$ in Li et al. (2022)

Theorem 2.1 (Identification of $q(A,Z,X)$ ; Li et al. (2022) Theorem 2’).

Theorem 2.2 (Identification of $\beta_{0}$ ; Li et al. (2022) Theorem 1’).

Theorem 2.3 (Identification of $h(A,W,X)$ ).

Theorem 2.4 (Identification of $\beta_{0}$ ).

2.4 A doubly robust closed-form expression for $\beta_{0}$

Theorem 2.5 (A doubly-robust closed-form expression for $\beta_{0}$ ).

Theorem 2.6 (Moment conditions for $q(A,Z,X)$ and $h(A,W,X)$ ).

S6 Detailed setting and derivation of $q(A,Z,X)$ and $h(A,W,X)$ in Example 2.8

S7 Cross-fitting kernel learning estimator for nonparametric estimation of $\beta_{0}$ .

Scenario I: Univariate binary $U$ , $Z$ and $W$

Scenario II-V: Continuous $U$ , $X$ , $Z$ , $W$ .

S11 Estimating conditional odds ratio under effect modification by $X$

Theorem S2 (Identification of $q(A,Z,X)$ and $h_{k}(A,W,X)$ ).

Theorem S3 (Identification of $\theta_{jc}^{k}$ ).