Causal Effects in Twin Studies: the Role of Interference

Bonnie Smith¹, Elizabeth L. Ogburn¹, Matt McGue²,
Saonli Basu³, Daniel O. Scharfstein¹

(¹ Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
² Department of Psychology, University of Minnesota, Minneapolis, MN
³ Division of Biostatistics, University of Minnesota, Minneapolis, MN
[email protected])

Abstract

The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins—that is, no twin’s exposure has a causal impact on their co-twin’s outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used existing methods; (2) change which effects are of greatest interest; and (3) impact the conditions under which we may estimate these effects. We explore these issues, and we derive semi-parametric efficient estimators for causal effects in the presence of interference between twins. Using data from the Minnesota Twin Family Study, we apply our estimators to assess whether twins’ consumption of alcohol in early adolescence may have a causal impact on their co-twins’ substance use later in life.

Keywords: Co-twin control method; Semi-parametric efficiency; Spillover effect

1 Introduction

While researchers have predominantly used twin studies to assess the heritability of a given trait, they are also increasingly leveraging twin designs to learn about the causal relationship between an exposure and an outcome: see for example [15, 7, 8, 9, 10, 19] and references therein. A popular tool in this area of twins research is the co-twin control method, in which an unexposed twin serves as the control for their exposed co-twin. This method uses the fact that twins are naturally matched on many predictors of the exposure and the outcome, such as shared genetics and characteristics of their shared environment; this can make it possible to estimate the causal effect of the exposure on the outcome even if these shared predictors are unobserved [15]. In this paper, we draw attention to an important causal assumption that underlies this approach. This is the assumption that there is no interference between twins, meaning one twin’s exposure has no causal impact on their co-twin’s outcome. This is a standard assumption made in the twins literature [21, 10]; however, evidence from the literature on sibling influence suggests that it may not hold in some cases. A number of studies have identified variables, such as substance use, where a subject’s behavior may be causally impacted by that of their siblings [22, 11, 17]. Since there is evidence of interference between siblings in these settings, it seems tenable that between twins, especially, interference could be present in some cases.

The possibility of interference between twins is important for a number of reasons. One is that the parameter estimated using the co-twin control method has a different causal interpretation when interference is present. Incorrectly assuming that there is no interference may therefore lead one to draw misleading conclusions. Interference impacts several issues regarding the causal parameter to be studied, including which parameters are of most scientific interest, and conditions under which these causal parameters can be identified from the observed data. If interference is present, key causal effects of interest include spillover effects, which are changes that would result in twins’ outcomes from intervening on their co-twins’ exposures, while holding their own exposures fixed; and main effects, which are changes that would result in twins’ outcomes from intervening on their own exposures, while holding their co-twins’ exposures fixed. Unfortunately, the co-twin control method does not provide a means of estimating spillover effects or main effects when there is interference between twins; rather, it targets an effect of more limited scientific and policy relevance.

In this paper, we clarify which causal effect is estimated by the co-twin control method when there is interference between twins, and argue that it is not a causal effect of prime interest. In order to identify effects that are of prime of interest, such as main effects and spillover effects, we then proceed under an assumption that the measured baseline covariates control for all confounding of the effects of the exposures on the outcomes. In this framework, we derive estimators of average main effects and average spillover effects which are semi-parametric efficient—that is, they make the most efficient use of the data possible, given the assumptions of the statistical models that we use.

In Section 2, we define potential outcomes and causal effects for our setting of independent pairs of twins with possible between-twin interference, and we review the co-twin control method for the case where there is no interference. In Section 3, we consider the co-twin control method with interference, then introduce the causal models that we will use in the rest of the paper, which make the assumption of no unmeasured confounding. In Section 4 we present the semi-parametric efficient estimators of average spillover effects and average main effects in these models. In Section 5 we apply these estimators to data from the Minnesota Twin Family Study, and investigate whether twins’ exposure to alcohol in early adolescence may have a causal impact on their co-twins’ drinking behavior in adulthood. We demonstrate the finite-sample performance of our estimators in a simulation study in Section 6, and conclude with a discussion in Section 7.

2 Background

2.1 Causal effects for the setting of within-pairs interference

Consider a twin study with a binary exposure $A$ , where $A=1$ if the subject is exposed and $A=0$ if the subject is unexposed, and an outcome $Y$ , and suppose first that there is no interference between subjects. In this setting each subject is considered to have two potential outcomes: $Y^{1}$ , the outcome that the subject would have if, possibly counter to fact, they were to have the exposure; and $Y^{0}$ , the outcome that the subject would have if, possibly counter to fact, they were to be unexposed. The observed outcome $Y$ is assumed to be equal to the potential outcome $Y^{1}$ for subjects who have exposure $A=1$ , and equal to $Y^{0}$ for subjects with $A=0$ . While only one potential outcome is observed for each twin—and hence the causal effect $Y^{1}-Y^{0}$ for an individual is never known—under additional assumptions (of positivity and exchangeability), the population average of these effects can be estimated from the observed data. Common targets of inference in twin studies include the average causal effect $E\big{[}Y^{1}-Y^{0}\big{]}$ , where the mean is over all twins in the target population, and the average causal effect in the subgroup of twins who are discordant with their co-twin for the exposure.

Now suppose instead that there may be interference between the two twins in each pair, but not between twins from different twin pairs, so that each twin’s outcome may be impacted by their own exposure and their co-twin’s exposure. In this case, we cannot meaningfully talk about a twin’s outcome if they were to be exposed, or $Y^{1}$ , since they might have one outcome if both they and their co-twin were to be exposed, and a different outcome if they were to be exposed but their co-twin were not. Thus, in this setting, each twin has 4 potential outcomes: $Y^{1,1}$ , $Y^{1,0}$ , $Y^{0,1}$ , and $Y^{0,0}$ , where we write $Y^{a,b}$ for the outcome that the twin would have if, possibly counter to fact, they were to have exposure $a$ and their co-twin were to have exposure $b$ . For a twin who has exposure $A=a$ and whose co-twin has exposure $A=b$ , the observed outcome $Y$ is assumed to be equal to the potential outcome $Y^{a,b}$ , while the twin’s other 3 potential outcomes are not observed. There are a number of different causal effects that we can consider in this setting: of particular interest are average main effects $E\big{[}Y^{0,0}-Y^{1,0}\big{]}$ and $E\big{[}Y^{0,1}-Y^{1,1}\big{]}$ , in which the subject’s own exposure is varied while their co-twin’s exposure is held constant; and average spillover effects $E\big{[}Y^{0,0}-Y^{0,1}\big{]}$ and $E\big{[}Y^{1,0}-Y^{1,1}\big{]}$ , in which the subject’s own exposure is held constant while their co-twin’s exposure is varied.

Methods for estimating average main effects and average spillover effects have been studied by several authors. One body of research considers the setting of partial interference, in which there are multiple distinct groups of subjects and interference does not operate across different groups. See for example [5, 6, 23, 12, 14] for estimation of average main effects and average spillover effects under partial interference, and see [13, 12, 14] for asymptotic properties of the estimators. A different strand of the interference literature considers the setting where there is only one group, such as a single connected social network where interference could occur between any of the subjects. See for example [1, 26, 18, 16, 24]. The context that we consider here, namely independent groups of size two, falls into the first of these two frameworks. To the best of our knowledge, no previous work on partial interference demonstrated semi-parametric efficiency of an estimator. Here we derive semi-parametric efficient estimators for two models that are tailored to the setting of independent pairs.

2.2 Notation and data structure

In describing the data for a twin pair, we distinguish between those baseline covariates which are characteristics of the pair—and thus necessarily common between the two twins in a given pair—such as zygosity or parental characteristics; and baseline covariates which are characteristics of an individual twin and may or may not be the same for the two twins in a given pair. We refer to these respectively as shared covariates and individual covariates. In each twin pair, suppose that the twins have been randomly labeled as Twin 1 and Twin 2. The observed data for a given pair is $O=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1},Y_{2})$ , where $C$ denotes the shared baseline covariates, $X_{j}$ denotes the individual baseline covariates for Twin $j$ , and $A_{j}$ and $Y_{j}$ are the binary exposure and the outcome for Twin $j$ . Throughout, $C$ , $X_{1}$ , and $X_{2}$ will refer to collections of measured baseline covariates. In some sections we will also consider factors which are unobserved; we will use notation such as $U$ or $U_{sh}$ when we refer to collections of variables at least some of which are unobserved. We assume that we observe data for $n$ twin pairs, and that the observed data $O_{1},\ldots,O_{n}$ for these $n$ pairs are independent and identically distributed.

We use the notation $Y_{j}^{a,b}$ for a potential outcome for Twin $j$ : specifically, let $Y_{j}^{a,b}$ be the outcome that Twin $j$ would have if, possibly counter to fact, Twin $j$ were to have exposure $a$ and their co-twin were to have exposure $b$ . Our primary targets of inference will be means of potential outcomes $E\big{[}Y^{a,b}\big{]}$ and contrasts of such parameters, where here the mean is taken over all twins in the population. We can also write the $E\big{[}Y^{a,b}\big{]}$ as $E\big{[}\frac{1}{2}\big{(}Y_{1}^{a,b}+Y_{2}^{a,b}\big{)}\big{]}$ , where $\frac{1}{2}\big{(}Y_{1}^{a,b}+Y_{2}^{a,b}\big{)}$ is the average of the potential outcomes within a twin pair, and the last expectation is a mean taken over all pairs. We will often express $E\big{[}Y^{a,b}\big{]}$ this way, so that the units we work in are the independent twin pairs rather than the individual twins.

2.3 The co-twin control method under the assumption of no interference

Here we review the co-twin control method for the setting where there is no interference between twins, following Sjölander et al. [21] and McGue et al. [15]; in Section 3 we compare the setting where there is interference.

The causal effect of the exposure $A_{j}$ on the outcome $Y_{j}$ may be confounded by any predictor $V$ that is a common cause of both $A_{j}$ and $Y_{j}$ (since there will be a non-causal source of association between $A_{j}$ and $Y_{j}$ via the path through $V$ ). Such a predictor could be a measured or unmeasured factor individual to Twin $j$ , or a measured or unmeasured factor shared by both twins. While we can explicitly adjust for measured confounders, unobserved confounders would typically preclude identification of the causal effect. However, using the co-twin control method, all shared factors—whether measured or unmeasured—are naturally accounted for by the fact that the two twins are matched on these factors. Therefore, Sjölander et al.[21] showed, in cases where there is no unmeasured confounding due to individual factors, a causal effect is identified by adjusting for the measured individual covariates $X_{1}$ and $X_{2}$ .

Let $U_{sh}$ denote the set of all confounders, including both measured and unmeasured factors, which are shared between the two twins. Consider the causal diagram in Figure 1(a), where a directed arrow means that the first variable has a possible causal impact on the second, while a bidirected arc between two variables indicates that there may be unobserved variables that are common causes of the two variables. Assume that the relationship among the variables is given by this diagram. In particular, assume that there is no interference between the two twins, as signaled by the absence of directed arrows $A_{1}\to Y_{2}$ and $A_{2}\to Y_{1}$ , and assume that there are no non-shared unmeasured variables that directly impact both $A_{j}$ and $Y_{j}$ .

Consider the subgroup of twins who are discordant with their co-twins for the exposure, but who have the same level $x$ of individual covariates as their co-twins. Write $E\big{[}Y^{1}-Y^{0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}$ to denote the average causal effect on this subgroup. For an exposed twin in this subgroup, their potential outcome $Y^{1}$ is observed; and while their potential outcome $Y^{0}$ is not observed, their unexposed co-twin’s potential outcome $Y^{0}$ is observed. [21] show that, because of symmetry between the group of twins designated Twin 1 and the group designated Twin 2, we may use the unexposed co-twins’ outcomes as proxies for the exposed twins’ $Y^{0}$ potential outcomes, and that the causal parameter $E\big{[}Y^{1}-Y^{0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}$ is equal to $E\big{[}Y_{j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ , the mean difference in the observed outcome for the exposed twin and the observed outcome for the unexposed twin, within this subgroup. The latter difference can now be estimated, for example by fitting a between-within regression model as described in Sjölander et al. Thus when all individual confounders are fully observed, Sjölander et al. have shown that the within-pair coefficient in the between-within regression model has a causal interpretation, which is the causal effect for the subgroup of twins who are discordant with their co-twin for the exposure but have the same level of individual covariates as their co-twin.

(a) Assuming no interference

(b) Allowing for interference

Figure 1: Causal diagrams for two settings where the co-twin control method can be used. Here

U_{sh}

represents all confounders, both measured and unmeasured, which are shared by the two twins, while

X_{j}

are measured individual covariates for Twin

j

, and

A_{j}

and

Y_{j}

are Twin

j

’s exposure and outcome. The blue arrows in (b) allow for possible interference between the two twins, i.e. one twin’s exposure may have a causal impact on their co-twin’s outcome.

3 Identification of causal effects when interference is present

Now we consider a setting where interference between twins may exist, that is, where it is possible for one twin’s exposure to affect their co-twin’s outcome. For example, this could occur because the two twins influence one another’s behavior or share information related to the exposure with each other. Interference is likely to be present in many studies where the exposure and outcome are behavioral; and incorrectly assuming that there is no interference can lead to misleading conclusions, and ignores spillover effects which may often be of interest.

3.1 The co-twin control effect when there is interference between twins

Here we modify the scenario described in Section 2.3 to allow for interference between the two twins in each pair. Suppose now that the relationship among the variables is as in Figure 1(b). In particular, there may be shared factors (both measured and unmeasured) which impact both the exposures and the outcomes, but we assume that there are no unmeasured non-shared factors that directly cause both $A_{j}$ and $Y_{j}$ , or that directly cause both $A_{3-j}$ and $Y_{j}$ . As before, the matching between the twins on shared factors allows for identification of one subgroup causal effect: importantly, this is the effect which is estimated using the co-twin control method if there is interference present. However, we argue that this is not an effect of prime importance, and that a different approach is needed in order to identify more useful effects such as average main effects and average spillover effects.

Between-within regression model	$E\big{[}Y_{j}\|A_{j},A_{3-j},X_{j},X_{3-j}\big{]}=$
	$\beta_{0}+\beta_{W}A_{j}+\beta_{B}\overline{A}+\gamma X_{j}+\delta X_{3-j}$
Statistical value of the $\beta_{W}$	$E\big{[}Y_{j}\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}\big{]}-$
regression coefficient	$E\big{[}Y_{j}\|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}\big{]}$
Causal interpretation of $\beta_{W}$ with	$E\big{[}Y^{1}-Y^{0}\|A_{1}\neq A_{2},X_{1}=X_{2}\big{]}$
no interference between twins
Causal interpretation of $\beta_{W}$ with	$E\big{[}Y^{1,0}-Y^{0,1}\|A_{1}\neq A_{2},X_{1}=X_{2}\big{]}$
interference between twins

Table 1: Comparison of the causal interpretation of the within-pair coefficient

\beta_{W}

in a between-within regression model, for the settings with and without interference between twins. In the no-interference case,

Y^{1}-Y^{0}

is the difference in a twin’s outcome under an intervention changing the twin from unexposed to exposed. In the interference case,

Y^{1,0}-Y^{0,1}

is the difference in a twin’s outcome under an intervention changing the twin from unexposed to exposed, while conversely changing their co-twin from exposed to unexposed.

As in Section 2.3, consider the subgroup of twins who are discordant with their co-twins for the exposure, but who have the same level $x$ of individual covariates as their co-twins. For an exposed twin in this subgroup, their potential outcome $Y^{1,0}$ is observed, and for an unexposed twin in this subgroup, their potential outcome $Y^{0,1}$ is observed. In Appendix A, we prove that, under the assumptions listed there, the mean of the contrast $Y^{1,0}-Y^{0,1}$ on this subgroup, $E\big{[}Y^{1,0}-Y^{0,1}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}$ , is equal to $E\big{[}Y_{j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ . Note that the latter difference is the same quantity that appears in Section 2.3, and which is estimated using a between-within regression model, but that the causal interpretation of this parameter is different in cases where there is interference between twins. The different interpretations for the two settings are highlighted in Table 1. The effect $E\big{[}Y^{1,0}-Y^{0,1}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}$ identified here is the average difference in twins’ outcomes that we would see if we could intervene to change the twins from unexposed to exposed, while conversely intervening to change their co-twins from exposed to unexposed, in the subgroup described above. It is difficult to see why one would want to target this particular combination of interventions. The contrast $Y^{1,0}-Y^{0,1}$ is the difference of the spillover effect $Y^{0,0}-Y^{0,1}$ and the main effect $Y^{0,0}-Y^{1,0}$ ; however we cannot tease apart these two effects, and the value of $Y^{1,0}-Y^{0,1}$ itself is not readily interpretable. For example, a zero value of $Y^{1,0}-Y^{0,1}$ could reflect the fact that $Y^{1,0}=Y^{0,0}=Y^{0,1}$ , or it could reflect a qualitatively different scenario where $Y^{0,0}-Y^{0,1}$ and $Y^{0,0}-Y^{1,0}$ are each nonzero but cancel each other out. That is, a null value of $Y^{1,0}-Y^{0,1}$ is equally compatible with the exposure having no causal effect, or with the presence of very strong interference.

The co-twin control method is a unique tool which allows for estimation of a causal effect even with shared unmeasured confounders, based on the use of discordant pairs. However, we cannot identify contrasts involving $Y^{1,1}$ or $Y^{0,0}$ from discordant pairs when interference is present, since these potential outcomes need not equal the observed outcome of either twin in a discordant pair. Therefore, the presence of shared unmeasured confounders poses a greater barrier than it does in settings with no interference, since the co-twin control method does not provide a means of estimating important causal effects for the interference setting. In order to identify average main effects and average spillover effects, throughout the rest of the paper we work under the assumption that there is no unmeasured confounding due to either individual or shared factors.

3.2 Identification under the assumption of no unmeasured confounding

Throughout the rest of the paper we make the following assumption of no unmeasured confounding: there are no unmeasured factors, whether shared or non-shared, which directly cause both $A_{j}$ and $Y_{j}$ , or that directly cause both $A_{3-j}$ and $Y_{j}$ . Specifically, we assume that any unmeasured factors $U$ which impact both an outcome and an exposure do so only through the measured baseline covariates $C$ , $X_{1}$ , and $X_{2}$ . Under this assumption, adjusting for the measured baseline covariates $C,X_{1},X_{2}$ controls for all confounding of the effects of $A_{j}$ and $A_{3-j}$ on $Y_{j}$ . This is an untestable assumption which is not automatically satisfied by design in an observational study; how reasonable the assumption is in a given study will depend on factors specific to that study, including how rich a set of baseline covariates is measured.

Consider the four groups of twin pairs with the four exposure patterns $(A_{1}=1,A_{2}=1)$ , $(A_{1}=1,A_{2}=0)$ , $(A_{1}=0,A_{2}=1)$ , and $(A_{1}=0,A_{2}=0)$ , and consider a specific potential outcome $Y_{j}^{a,b}$ . In general the potential outcomes $Y_{j}^{a,b}$ could be systematically higher among one of these four groups than another. For example they could be higher among the $(A_{1}=1,A_{2}=1)$ group than among the $(A_{1}=0,A_{2}=0)$ group if having some predictor $V$ causes both exposures and potential outcomes to be higher. However, our assumption of no unmeasured confounding implies that adjusting for the measured baseline covariates breaks any such association between $Y_{j}^{a,b}$ and the exposures, and that within levels of the measured baseline covariates $C,X_{1},X_{2}$ , the four groups are exchangeable in terms of potential outcomes $Y_{j}^{a,b}$ . This assumption of no unmeasured confounding, or exchangeability, is given by:

\displaystyle\mbox{{A1}}:

\displaystyle\ \ Y_{j}^{a,b}\perp\!\!\!\perp(A_{1},A_{2})\ \big{|}\ (C,X_{1},X_{2})\ \mbox{ for each fixed }a,b=0,1,\ j=1,2

(exchangeability)

We additionally make the positivity assumption that, within each level of the baseline covariates, there is a nonzero probability of having each of the 4 exposure patterns, and the consistency assumption that a twin’s observed outcome $Y_{j}$ is equal to the twin’s potential outcome corresponding to the exposures actually received by the the twin and their co-twin:

$\displaystyle\textbf{A2}:$	$\displaystyle\ \mbox{ For all }c,x_{1},x_{2}\mbox{ in the support of }C,X_{1},X_{2},$	(positivity)
	$\displaystyle\ P\big{(}A_{1}=a,A_{2}=b\ \|\ C=c,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1.$
$\displaystyle\textbf{A3}:$	$\displaystyle\ \mbox{ If }A_{j}=a\mbox{ and }A_{3-j}=b,\mbox{ then }Y_{j}=Y_{j}^{a,b}$	(consistency)

The assumptions of exchangeability, positivity, and consistency are sufficient for identification of the parameters $E\big{[}Y_{j}^{a,b}\big{]}$ . That is, under these three assumptions, we can express the causal parameter $E\big{[}Y_{j}^{a,b}\big{]}$ as a function of the observed data, as we show for completeness below. Exchangeability and positivity allow us to take the mean over just the group with the exposure pattern $(A_{j}=a,A_{3-j}=b)$ rather than over all twins labeled as Twin $j$ in the population, within each level of the baseline covariates; and consistency implies that, among the twins in this group, the potential outcome $Y_{j}^{a,b}$ is equal to the observed outcome $Y_{j}$ .

$\displaystyle E\big{[}Y_{j}^{a,b}\big{]}=$	$\displaystyle\ \int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{\|}c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2})$
$\displaystyle=$	$\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{\|}A_{j}=a,A_{3-j}=b,c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2})$	by A1 and A2
$\displaystyle=$	$\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}=y\big{\|}A_{j}=a,A_{3-j}=b,c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2})$	by A3
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}Y_{j}\ \big{\|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}\Big{]}.$

We will consider two models for the data for each twin pair. The larger of these, Model 1, makes only the assumptions used for identification above, and corresponds to the diagram in Figure 2(a). Here $U$ represents any unmeasured factors (shared or not). The absence of arrows pointing directly from $U$ to the exposures and the outcomes corresponds to the no unmeasured confounding assumption. Model 2 is a smaller model in which we make three extra assumptions in addition to the identification assumptions, and corresponds to the diagram in Figure 2(b). These three extra assumptions, A4-A6, are that Twin $j$ ’s individual covariates do not have a causal impact on their co-twin’s exposure (as seen by the lack of a directed arrow $X_{j}\to A_{3-j}$ ), or on their co-twin’s outcome (as seen by the lack of a directed arrow $X_{j}\to Y_{3-j}$ ); and that Twin 1’s exposure and Twin 2’s exposure are conditionally independent given the measured shared and individual baseline covariates (as seen by the lack of a bidirected arc $A_{1}\leftrightarrow A_{2}$ ).

A4:	$\displaystyle\ A_{j}\perp\!\!\!\perp X_{3-j}\ \|\ (C,X_{j}\big{)}\mbox{ for }a,b=0,1,\ j=1,2$
A5:	$\displaystyle\ Y_{j}^{a,b}\perp\!\!\!\perp X_{3-j}\ \|\ (C,X_{j}\big{)}\mbox{ for all }a,b=0,1,j=1,2$	(Model 2 assumptions)
A6:	$\displaystyle\ A_{1}\perp\!\!\!\perp A_{2}\ \|\ (C,X_{1},X_{2}\big{)}$

The assumption that Twin $j$ ’s individual-level covariates do not impact their co-twin’s exposure or outcome is one that is commonly used in the twin literature [21], and the distinction drawn in Model 2 between the individual and the shared covariates is a feature that distinguishes Model 2 from models considered elsewhere in the interference literature. While Model 1, being less restrictive, gives valid inference in a wider range of settings than Model 2, the advantage of Model 2 is that it allows for improved (asymptotic) efficiency: in settings which meet the assumptions posited for Model 2, we may leverage these additional assumptions to make a more efficient use of the data in estimating our causal parameters of interest, allowing for narrower confidence intervals in large samples. We demonstrate this improved efficiency in a simulation study in Section 6.

(a) Causal diagram for Model 1

(b) Causal diagram for Model 2.

Figure 2: Causal diagrams for two models allowing between-twin interference but making the assumption of no unobserved confounding.

C,X_{1},X_{2}

are measured baseline covariates, while here

U

represents unmeasured factors (shared and/or non-shared). In Model 2 we also assume that a twin’s individual covariates do not have a causal impact on their co-twin’s exposure or outcome, and that Twin 1’s and Twin 2’s exposures are conditionally independent given observed covariates.

4 Efficient estimation of causal effects when interference is present

4.1 Efficient estimators

Here we derive the efficient estimator of the parameter $\beta^{a,b}=E\big{[}Y^{a,b}\big{]}$ for each of the two models described in Section 3.2. That is, we derive the estimator $\widehat{\beta}_{1}^{a,b}$ with the smallest asymptotic variance, out of all estimators for $\beta^{a,b}$ which are regular and asymptotically linear (RAL) under every distribution in Model 1, and similarly for Model 2. Since Model 2 is a smaller model contained in Model 1, any RAL estimator for Model 1 may be used in Model 2, while the converse need not hold; and we will see that, in fact, there are RAL estimators for Model 2 that are more efficient than any Model 1 estimator.

RAL estimators are asymptotically normal with asymptotic variance determined by their influence function. In order to obtain the RAL estimator with the smallest possible asymptotic variance, we first derive the efficient influence function $\varphi_{k}(O;\beta^{a,b})$ for the parameter $\beta^{a,b}$ in each Model $k$ . The efficient estimator $\widehat{\beta}_{k}^{a,b}$ based on data $O_{1},\ldots,O_{n}$ from $n$ independent twin pairs is then found as the solution to the estimating equation $\sum_{i=1}^{n}\varphi_{k}(O_{i};\beta^{a,b})=0$ . See Tsiatis [25] for more on efficient influence functions.

The efficient influence function for $\beta^{a,b}=E\big{[}Y^{a,b}\big{]}$ in Model 1 is:

	$\displaystyle\varphi_{1}\big{(}O;\beta^{a,b}\big{)}=$	$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a,A_{2}=b\|C,X_{1},X_{2})}\Big{(}Y_{1}-E\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}+$
		$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a,A_{1}=b\|C,X_{1},X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.$

Proofs are given in Appendix B. In order to use $\varphi_{1}\big{(}O;\beta^{a,b}\big{)}$ to obtain an estimator for $\beta^{a,b}$ , we need a model for the joint propensity score $P(A_{j}=a,A_{3-j}=b|C,X_{j},X_{3-j})$ and a model for the outcome regression $E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}\big{]}$ . The resulting estimator of $\beta^{a,b}$ will be efficient provided that both of these models are correctly specified, and provided that the estimated values converge to the truth at fast enough rates. (See Section 4.2 for more discussion of rates.) Suppose this is the case. Let $\hat{\pi}_{a,b}(C_{i},X_{ij},X_{i,3-j})$ be the predicted propensity score for a twin pair with covariates $C_{i},X_{ij},X_{i,3-j}$ , and let $\hat{\mu}_{j,a,b}(C,X_{ij},X_{i,3-j})$ be the predicted outcome regression for Twin $j$ in a twin pair with covariates $C_{i},X_{ij},X_{i,3-j}$ if the twins’ exposures were set to $A_{j}=a,A_{3-j}=b$ . Then the efficient estimator for $\beta^{a,b}$ in Model 1, based on data $O_{1},\ldots,O_{n}$ , is given by:

	$\displaystyle\widehat{\beta}^{a,b}_{1}=\ \frac{1}{n}\sum_{i=1}^{n}\ \frac{1}{2}\Bigg{\{}$	$\displaystyle\frac{\mathds{1}(A_{i1}=a,A_{i2}=b)}{\hat{\pi}_{a,b}(C_{i},X_{i1},X_{i2})}\Big{(}Y_{i1}-\hat{\mu}_{1,a,b}(C_{i},X_{i1},X_{i2})\Big{)}+\hat{\mu}_{1,a,b}(C_{i},X_{i1},X_{i2})$
		$\displaystyle+\ \frac{\mathds{1}(A_{i1}=b,A_{i2}=a)}{\hat{\pi}_{a,b}(C_{i},X_{i2},X_{i1})}\Big{(}Y_{i2}-\hat{\mu}_{2,a,b}(C_{i},X_{i2},X_{i1})\Big{)}+\hat{\mu}_{2,a,b}(C_{i},X_{i2},X_{i1})\Bigg{\}}.$

The estimator $\widehat{\beta}_{1}^{a,b}$ is the augmented inverse probability weighted estimator for a bivariate exposure. It is a doubly robust estimator [20]: as long as at least one of the propensity score model or the outcome regression model is correctly specified, $\widehat{\beta}_{1}^{a,b}$ remains a consistent estimator for $\beta^{a,b}$ , even if the other model is misspecified. Liu et al. [14] have presented three doubly robust estimators of average main effects and average spillover effects in groups of size $n$ , working in the analog of Model 1. Our estimator $\widehat{\beta}_{1}^{a,b}$ corresponds to one of their estimators specialized to groups of size 2. Our contribution as regards Model 1 is to prove that this estimator is semiparametric efficient in Model 1, thus providing a partial answer to a question posed in Liu et al. [14]. To the best of our knowledge, Model 2 is distinct from other models that have been considered in the interference literature.

If the true distribution lies inside the smaller Model 2, a more efficient estimator than $\widehat{\beta}_{1}^{a,b}$ is possible. The efficient influence function for $\beta^{a,b}$ in Model 2 is:

	$\displaystyle\varphi_{2}\big{(}O;\beta^{a,b}\big{)}=$	$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a\|C,X_{1})P(A_{2}=b\|C,X_{1})}\Big{(}Y_{1}-E\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Bigg{\}}+$
		$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a\|C,X_{2})P(A_{1}=b\|C,X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.$

In order to use $\varphi_{2}(O;\beta^{a,b})$ to obtain an estimator of $\beta^{a,b}$ , we need a model for the outcome regression $E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j}\big{]}$ , a model for the propensity score $P(A_{j}=1|C,X_{j})$ which relates Twin $j$ ’s exposure to their own covariates, and a model for the propensity score $P(A_{j}=1|C,X_{3-j})$ which relates Twin $j$ ’s exposure to their co-twin’s covariates. The resulting estimator of $\beta^{a,b}$ will be efficient provided that all three of these models are correctly specified, and provided that the estimated values converge to the truth at fast enough rates. Suppose this is the case. Let $\hat{\pi}_{j,a}(C_{i},X_{ij})$ be the predicted value of the propensity score $P(A_{j}=a|C,X_{j})$ for a twin pair with covariates $C_{i},X_{ij}$ . Let $\hat{\theta}_{j,a}(C_{i},X_{i,3-j})$ be the predicted value of the propensity score $P(A_{j}=a|C,X_{3-j})$ for a twin pair with covariates $C_{i},X_{i,3-j}$ . Let $\hat{\mu}_{j,a,b}(C,X_{ij})$ be the predicted outcome regression for Twin $j$ in a twin pair with covariates $C_{i},X_{ij}$ if the twins’ exposures were set to $A_{j}=a,A_{3-j}=b$ . Then the efficient estimator for $\beta^{a,b}$ in Model 2, based on data $O_{1},\ldots,O_{n}$ , is given by:

	$\displaystyle\widehat{\beta}^{a,b}_{2}=\ \frac{1}{n}\sum_{i=1}^{n}\ \frac{1}{2}\Bigg{\{}$	$\displaystyle\frac{\mathds{1}(A_{i1}=a,A_{i2}=b)}{\hat{\pi}_{1,a}(C_{i},X_{i1})\ \hat{\theta}_{2,b}(C_{i},X_{i1})}\Big{(}Y_{i1}-\hat{\mu}_{1,a,b}(C_{i},X_{i1})\Big{)}+\hat{\mu}_{1,a,b}(C_{i},X_{i1})$
		$\displaystyle+\ \frac{\mathds{1}(A_{i1}=b,A_{i2}=a)}{\hat{\pi}_{2,a}(C_{i},X_{i2})\ \hat{\theta}_{1,b}(C_{i},X_{i2})}\Big{(}Y_{i2}-\hat{\mu}_{2,a,b}(C_{i},X_{i2})\Big{)}+\hat{\mu}_{2,a,b}(C_{i},X_{i2})\Big{)}\Bigg{\}}.$

The estimator $\widehat{\beta}_{2}^{a,b}$ is also doubly robust: it is consistent if (i) both propensity score models are correct, even if the outcome regression model is incorrect, or (ii) the outcome regression model is correct, even if one or both propensity score models are incorrect. We show the double robustness of both estimators in Appendix C, and we also illustrate this property in the simulation study in Section 6.

Efficient influence functions of average main effects and average spillover effects are obtained as differences of efficient influence functions of the $\beta^{a,b}$ parameters. For example, the efficient influence function of the spillover effect $\beta_{sp}=\beta^{0,0}-\beta^{0,1}$ in Model $k$ is $\varphi_{k}(O;\beta^{0,0})-\varphi_{k}(O;\beta^{0,1})$ . Similarly the efficient estimator of $\beta_{sp}$ in Model $k$ is $\widehat{\beta}^{0,0}_{k}-\widehat{\beta}^{0,1}_{k}$ .

4.2 Confidence intervals

Here we consider confidence intervals for a parameter $\beta^{a,b}=E\big{[}Y^{a,b}\big{]}$ or a linear combination of such parameters, such as an average spillover effect or average main effect, based on the efficient estimators presented above.

Denote the parameter of interest by $\beta$ . Let $\psi_{k}(O;\beta,\nu)$ denote the efficient influence function of $\beta$ in Model $k$ , where $\nu$ represents the parameters of the propensity score models and outcome regression model, which must be estimated in order to obtain the efficient estimator for $\beta$ . Under the assumption that the parameters $\nu$ are estimated consistently at fast enough rates, the resulting estimator for $\beta$ will be asymptotically normal at root- $n$ rates. Therefore, we may use a Wald-type confidence interval for $\beta$ . Estimating $\nu$ at rates faster than $n^{-1/4}$ —that is, using an estimator $\widehat{\nu}$ such that $n^{1/4+\epsilon}\big{(}\widehat{\nu}-\nu\big{)}$ is bounded in probability for some $\epsilon>0$ —is sufficient. However, this can be relaxed somewhat since our estimator for $\beta$ is doubly robust, and the bias of a doubly robust estimator is the product of the bias in the propensity score estimation times the bias in the outcome regression estimation. Therefore it suffices that we estimate the propensity scores and outcome regression at rates whose product is less than $n^{-1/2}$ . Suppose this is the case, and let $\widehat{\beta}_{k,eff}$ be the solution of the estimating equation $\sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\beta,\widehat{\nu}\big{)}=0$ . Because of the assumption on the rate of convergence for $\widehat{\nu}$ , the influence function of $\widehat{\beta}_{k,eff}$ is the efficient influence function $\psi_{k}\big{(}O;\beta,\nu\big{)}$ (with the true value of the nuisance parameters $\nu$ ). Therefore $\widehat{\beta}_{k,eff}$ is the efficient estimator of $\beta$ , and the asymptotic variance of $\widehat{\beta}_{k,eff}$ is the variance of $\psi_{k}\big{(}O;\beta,\nu\big{)}$ .

Thus in large samples, the variance of $\widehat{\beta}_{k,eff}$ is approximately $\frac{1}{n}Var\big{(}\psi_{k}(O;\beta,\nu)\big{)}=\frac{1}{n}E\big{[}\psi_{k}(O;\beta,\nu)^{2}\big{]}$ . A consistent estimator of $Var\big{(}\psi_{k}(O;\beta,\nu)\big{)}$ is $\frac{1}{n}\sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\widehat{\beta}_{eff},\hat{\nu}\big{)}^{2}$ . Therefore, a large-sample $95\%$ Wald-type confidence interval for $\beta$ is given by:

\widehat{\beta}_{k,eff}\ \pm\ 1.96\sqrt{\frac{\ \sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\widehat{\beta}_{k,eff},\hat{\nu}\big{)}^{2}\ \ }{n^{2}}}\ .

Alternatively, we may use a bootstrap-based confidence interval, which may have better coverage than the Wald-type confidence interval at moderate sample sizes.

5 Application: spillover effects of alcohol in early adolescence

Early use of of alcohol is often associated with substance use and dependence in adulthood [4, 3]. Recently, Irons et al. [7] used a twin design to determine whether this association is due in part to a causal relationship. The authors found that there is evidence of a causal effect of twins’ early alcohol use on their own adult substance use outcomes and antisocial behavior outcomes. Here we consider the same data from the Minnesota Twin Family Study; but rather than focusing on the effect of twins’ exposures on their own outcomes, we investigate whether twins’ early alcohol use has a causal impact on their co-twins’ outcomes. That is, our primary question of interest is whether there is evidence of a nonzero spillover effect.

In the Minnesota Twin Family Study, pairs of same-sex twins were contacted for an intake assessment at which baseline covariates were collected at target age 11; to assess exposure status at target age 14; and to measure outcomes at target age 24. The exposure that we consider here is an indicator of whether the subject had ever consumed alcohol without their parents’ permission by the time of the exposure interview. The outcome on which we focus is Drinking Index, which is a composite measure of adult drinking frequency and amount. Shared covariates which we include are severity of parent alcohol abuse, severity of parent drug abuse, parents’ occupation level, and sex and zygosity of the twins. Individual covariates which we include are a measure of academic motivation, number of externalizing disorder symptoms (such as symptoms of attention deficit/hyperactivity disorder or conduct disorder), amount of conflict with parents, and actual age at the time of the exposure assessment. Here we use only the complete case pairs, which account for 511 of the 761 pairs of twins in the dataset; future work is needed to develop methods to address missing data in settings where there is interference between pairs.

Our primary target of inference is the spillover effect $\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]}$ , which is the mean difference in twins’ outcomes we would see if we could intervene to change all co-twins from exposed to unexposed, while keeping all twins unexposed. We also estimate the main effect $\beta_{main}=E\big{[}Y^{0,0}-Y^{1,0}\big{]}$ , the mean difference in twins’ outcomes we would see if we could intervene to change all twins from exposed to unexposed, while keeping all co-twins unexposed. We estimate $\beta_{sp}$ and $\beta_{main}$ using the efficient estimators that we derived in Section 4. For the Model 1 estimator, we use generalized additive models to model the outcome regression $E[Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}]$ . In order to model the joint propensity score $P(A_{1},A_{2}|C,X_{1},X_{2})$ , we first use generalized additive models to fit the marginal distributions $P(A_{1}|C,X_{1})$ and $P(A_{2}|C,X_{2})$ , then model the association between the exposures via a Dale model [2]. We allow the strength of the association to vary by sex and zygosity. For the Model 2 estimator, we use generalized additive models to model the outcome regression $E[Y_{j}|A_{j},A_{3-j},C,X_{j}]$ , the propensity score $P(A_{j}|C,X_{j})$ , and the propensity score $P(A_{j}|C,X_{3-j})$ . Details are given in Appendix D.

Results are shown in Table 2. Using either of the two estimators, there is evidence of a nonzero spillover effect of twins’ use of alcohol in early adolescence on their co-twins’ adult drinking behavior. That is, there is evidence of interference between twins in this study. A spillover effect of $-1.5$ (the point estimate using the Model 1 estimator) would indicate that twins’ Drinking Index outcomes would be an average of 1.5 points lower if all twins and all co-twins were unexposed to alcohol in early adolescence, compared to all twins being unexposed with their co-twins exposed. The Drinking Index outcome ranges from 0 to 20, with mean 10.7 and standard deviation 4.0. Larger values correspond to more frequent and/or greater amounts of drinking, so a decrease of 1.5 points would represent a moderate benefit. There is also evidence of a nonzero main effect. A main effect of $-2.1$ points would indicate that twins’ Drinking Index outcomes would be an average of 2.1 points lower if all twins changed from exposed to unexposed, while their co-twins remained unexposed.

Our estimators for $\beta_{sp}$ and $\beta_{main}$ are based on the assumption of no unmeasured confounding due to shared or individual factors. This assumption is untestable, meaning that we have no means of determining that it holds in our study; and if it does not in fact hold, this could invalidate our results. However, we can check for one specific way in which the assumption may fail, through comparison of monozygotic (MZ) twins and dizygotic (DZ) twins. If there is no unmeasured confounding which is due to shared genetics or other unmeasured factors that are differential between MZ and DZ twins, then, since the distribution of measured covariates is well balanced across these two groups, the average spillover effect in the population of MZ twins, $\beta_{sp,MZ}$ , should be equal to the average spillover effect in the population of DZ twins, $\beta_{sp,DZ}$ . Therefore, evidence that the average spillover effects are different in these two populations would be evidence that there is unmeasured confounding due to shared genetics.

		Spillover effect $\beta_{sp}$	Main effect $\beta_{main}$	CTC effect $\beta_{W}$
All twins	Model 1	-1.470 (-2.318, -0.716)	-2.093 (-3.251, -0.738)	0.730 (0.064,1.372)
(n=510)	Model 2	-1.599 (-2.318, -0.762)	-2.298 (-3.356, -0.823)
MZ twins	Model 1	-1.671 (-2.796, -0.640)	-2.131 (-3.422, -1.094)	0.577 (-0.224, 1.347)
(n=324)	Model 2	-1.931 ( -2.821, -0.705)	-2.424 (-3.499, -1.123)
DZ twins	Model 1	-1.039 (-2.523, 0.118)	-1.952 (-3.711, -0.856)	1.016 (-0.106, 2.155)
(n=186)	Model 2	-0.957 (-2.245, 0.210)	-2.084 (-3.784, -0.890)

Table 2: Effect estimates for the MTFS data. The 3rd and 4th columns show point estimates, and 95% confidence intervals based on 5,000 bootstrap samples, for the spillover effect

\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]}

and the main effect

\beta_{main}=E\big{[}Y^{0,0}-Y^{1,0}\big{]}

, using the efficient estimators for Model 1 and for Model 2. The 5th column shows point estimates, and 95% confidence intervals based on 5,000 bootstrap samples, for the co-twin control effect

\beta_{W}=E\big{[}Y^{0,1}-Y^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}\big{]}

estimated by fitting a between-within regression model. All are shown using data on all twins, using data on MZ twins only, and using data on DZ twins only.

The point estimates are consistent with a somewhat stronger spillover effect among MZ twins than among DZ twins; however, a 95% confidence interval for the parameter $\beta_{sp,MZ}-\beta_{sp,DZ}$ is $(-2.25,1.84)$ , so the data do not provide evidence that $\beta_{sp,MZ}-\beta_{sp,DZ}$ is different from zero. Thus, comparison of the MZ and DZ subgroups does not suggest that there is unmeasured confounding due to shared genetics or factors that are differential between MZ and DZ twins. However, this subgroup comparison does not provide a means of assessing whether other types of unmeasured confounding may be present. Unmeasured shared factors such as peer group or attributes of the twins’ school or community, and unmeasured individual factors, could still invalidate our results if they are not captured by measured covariates.

We also fit a between-within regression model and report the within-pair coefficient $\beta_{W}$ . The causal interpretation of $\beta_{W}$ is valid even if there are shared unmeasured confounders, provided that there is no unmeasured confounding due to individual factors. In Section 3.1 we saw that, in settings where there is interference between twins, the causal interpretation of $\beta_{W}$ is equal to the spillover effect minus the main effect on a subset of the group of exposure-discordant twins. If the estimates of $\beta_{W}$ were not consistent with the estimates of $\beta_{sp}-\beta_{main}$ , this could be an indication that the estimates of $\beta_{sp}$ and $\beta_{main}$ were invalid due to shared confounders (though it could also be an indication that the subgroup effect $\beta_{W}$ does not generalize to the whole population of twins). In our case the estimates are consistent with each other.

Irons et al. [7] estimated the effect of adolescent alcohol use on adult drinking behavior using two approaches, propensity score weighting and the co-twin control method, and found that the results using the co-twin control method were attenuated compared to the first method. They pointed out that one possible reason for this attenuation is that unmeasured shared confounders could be creating bias in the propensity score estimates. Another possible explanation which appears now is interference: if there is interference between twins in this study, then the parameter estimated using the co-twin control method is not simply the effect of twins’ exposure on their outcome, but the difference of this quantity and the spillover effect from their co-twins’ exposure.

6 Simulation study

Here we evaluate the finite-sample performance of the estimators derived in Section 4, using simulated data with sample size $n=500$ designed to mimic the data from the Minnesota Twin Family Study (MTFS). We consider two data-generating mechanisms. Under the first data-generating mechanism, the larger Model 1 holds but the smaller Model 2 does not, while under the second data-generating mechanism both models hold. We generate the covariates in the simulated data by resampling covariates from the MTFS dataset. We then generate the exposures $A_{1},A_{2}$ from a joint distribution of exposures given covariates based on modeling of the MTFS data, and generate the outcomes $Y_{1},Y_{2}$ from a joint distribution of outcomes given exposures and covariates based on modeling of the MTFS data. Details of the data-generating mechanisms are given in Appendix D.

For each data-generating mechanism, we estimate the spillover effect $\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]}$ using each of 5 estimators. Under the first data-generating mechanism, we estimate $\beta_{sp}$ using the efficient estimator for Model 1 with both the joint propensity score model and the outcome regression model correctly specified; the Model 1 estimator with one but not both of these correctly specified; the Model 1 estimator with both misspecified; and the Model 2 efficient estimator with the outcome regression and $P(A_{j}|C,X_{j})$ correctly specified. Table 3 shows the empirical bias of each estimator in $R=5,000$ simulations; the empirical variance of each estimator; the mean of the influence function-based variance estimates; the mean of the bootstrap variance estimates based on $M=1,000$ bootstrap replicates within each simulation; the coverage of the 95% Wald confidence intervals for $\beta_{sp}$ using the influence function-based variance estimates; and coverage of the percentile bootstrap 95% confidence intervals. As expected, bias of the correctly specified Model 1 estimator $\hat{\beta}_{1}$ is low. Also as expected, since $\hat{\beta}_{1}$ is a doubly robust estimator, bias remains low even when either the propensity score model or the outcome regression model is misspecified. Coverage probabilities of the percentile bootstrap confidence intervals are very close to the nominal 95% level, while the influence function-based variance estimates are slightly anti-conservative at this sample size, resulting in coverage probabilities that are somewhat lower.

	Bias	Var	IF-Var Est	Wald Cov’g	Btstp-Var Est	Btstp Cov’g
$\hat{\beta}^{1}$	-0.00658	0.11390	0.09886	92.96	0.12582	95.12
$\hat{\beta}^{2}$	0.00320	0.11587	0.07195	87.50	0.13357	95.68
$\hat{\beta}_{wr.prop.}^{1}$	0.00042	0.11076	0.09971	93.44	0.12522	95.10
$\hat{\beta}_{wr.outc.}^{1}$	-0.02243	0.11766	0.11348	94.12	0.11136	94.70
$\hat{\beta}_{wr.both}^{1}$	-0.04121	0.11736	0.11618	94.56	0.11400	94.60

Table 3: Simulation results for

R=5,000

simulations at sample size

n=500

under a data-generating mechanism where Model 1 is correctly specified but Model 2 is not. Shown are the empirical bias and empirical variance of each estimator; the mean of the

5,000

influence function-based variance estimates; the mean of the

5,000

bootstrap-based variance estimates, based on

M=1,000

bootstraps for each simulation; and coverage probabilities for the Wald confidence interval which uses the IF-based variance estimate, and for the percentile bootstrap confidence interval.

	Bias	Var	IF-Var Est	Wald Cov’g	Btstp-Var Est	Btstp Cov’g
$\hat{\beta}^{1}$	0.00719	0.08526	0.07612	93.24	0.08967	95.30
$\hat{\beta}^{2}$	0.00576	0.08273	0.07427	93.46	0.08567	95.14
$\hat{\beta}_{wr.prop.}^{2}$	-0.01473	0.08232	0.07171	92.88	0.08375	94.82
$\hat{\beta}_{wr.outc.}^{2}$	-0.00005	0.08710	0.08372	94.86	0.08263	95.00
$\hat{\beta}_{wr.both}^{2}$	-0.00453	0.08716	0.08253	94.20	0.08231	94.50

Table 4: Simulation results for

R=5,000

simulations at sample size

n=500

under a data-generating mechanism where both Model 1 and Model 2 are correctly specified. Shown are the empirical bias and empirical variance of each estimator; the mean of the

5,000

influence function-based variance estimates; the mean of the

5,000

bootstrap-based variance estimates, based on

M=1,000

bootstraps for each simulation; and coverage probabilities for the Wald confidence interval which uses the IF-based variance estimate, and for the percentile bootstrap confidence interval.

Under the second data-generating mechanism, we estimate $\beta_{sp}$ using the efficient estimator for Model 1 with both the joint propensity score model and the outcome regression model correctly specified; the Model 2 efficient estimator with both propensity score models and the outcome regression model all correctly specified; the Model 2 estimator with either the propensity score models or the outcome regression model misspecified; and the Model 2 estimator with all of these misspecified. Results are displayed in Table 4. We expect each of the first four of these estimators to have low bias, as they do. Coverage probabilities are close to the nominal level, especially using the bootstrap confidence intervals.

Additionally, the Model 2 efficient estimator has smaller variance than the Model 1 estimator under the second data-generating mechanism. The theory from Section 4 shows that the Model 2 estimator is asymptotically more efficient than the Model 1 estimator when both models are correct; and here we see improved precision at sample size $n=500$ in this simulation scenario.

7 Discussion

In this paper we have considered the setting of independent pairs of twins where one twin’s exposure may have a causal impact on their co-twin’s outcome. Whether or not interference is present in a given study will depend on the nature of the exposure and the outcome: in some cases, researchers may be able to rule out the possibility of between-twin interference at the outset based on their knowledge of the domain; while for many behavioral exposures and outcomes, interference is likely to be a possibility. For settings where there may or may not be interference based on scientific considerations, researchers may allow for possible interference and estimate the spillover effect of twins’ exposures on their co-twins’ outcomes using the estimators we have presented here. Evidence of a nonzero spillover effect would be evidence of interference. We have highlighted the impact that between-twin interference would have for researchers using the co-twin control method: when there is no interference, this method estimates the causal effect of twins’ exposures on their outcomes. When there is interference, however, it estimates the difference of two effects: the spillover effect of the co-twins’ exposures on the twins’ outcomes, minus the main effect of twins’ exposures on their own outcomes.

Under the assumption of no unmeasured confounding, we derived the semi-parametric efficient estimators of key causal effects for the interference setting, including average main effects and average spillover effects. We applied our estimators to data from the Minnesota Twin Family Study (MTFS), and found evidence that twins’ exposure to alcohol in early adolescence may have a spillover effect on their co-twins’ drinking behavior in adulthood. However, if there are genetic or other shared or individual unmeasured factors impacting both the choice to drink in early adolescence, and drinking behavior in adulthood (after controlling for measured covariates), this could invalidate our results. Comparing the causal effects among MZ twin pairs versus among DZ twin pairs did not yield evidence of unmeasured confounders due to shared genetics in the MTFS data. The development of sensitivity analyses, showing how a range of different strengths of unmeasured confounding would impact results in the interference setting, would be valuable future work.

Future work towards addressing missing data in this setting is also needed. As with the MTFS data, there may be missingness in baseline covariates, exposures, and outcomes, and the interference structure leads to some challenges in addressing missing data. An imputation approach, for example, should be designed in a way which is compatible with the different models to be fit for construction of the estimators; how best to do this for the Model 2 estimator under minimal modeling assumptions is an open research question.

A final direction of future research involves leveraging symmetry. Throughout, we randomly labeled the twins in each pair as Twin 1 and Twin 2. Because this labeling is random, it follows that there is symmetry between the population of twins who are labeled as Twin 1, and the population of twins who are labeled as Twin 2. In our models we did not explicitly make an assumption of symmetry, and our estimators therefore apply not only to twins data, but to any setting of independent pairs where there is possible interference between partners. However, in the twins setting, leveraging such a symmetry assumption could lead to additional efficiency gains, and in future work we plan to derive efficient estimators for models which do impose an assumption of symmetry between the twins.

Funding

This work was supported by [ONR grant N00014-18-1-2760 to E.O. and B.S.]; and [R37-AA009367 to M.M.].

Acknowledgments

Conflict of Interest: None declared.

References

[1] Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4):1912–1947, 12 2017.
[2] Jocelyn R. Dale. Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics, 42(4):909–917, 1986.
[3] Bridget F. Grant and Deborah A. Dawson. Age at onset of alcohol use and its association with dsm-iv alcohol abuse and dependence: results from the national longitudinal alcohol epidemiologic survey. Journal of Substance Abuse, 9:103 – 110, 1997.
[4] Julia D. Grant, Jeffrey F. Scherrer, Michael T. Lynskey, Michael J. Lyons, Seth A. Eisen, Ming T. Tsuang, William R. True, and Kathleen K. Bucholz. Adolescent alcohol use is a risk factor for adult alcohol and drug dependence: evidence from a twin design. Psychological Medicine, 36(1):109–118, 2006.
[5] Guanglei Hong and Stephen W Raudenbush. Evaluating kindergarten retention policy. Journal of the American Statistical Association, 101(475):901–910, 2006.
[6] Michael G Hudgens and M. Elizabeth Halloran. Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832–842, 2008. PMID: 19081744.
[7] Daniel E. Irons, William G. Iacono, and Matt McGue. Tests of the effects of adolescent early alcohol exposures on adult outcomes. Addiction, 110(2):269–278, 2015.
[8] Wendy Johnson, Eric Turkheimer, Irving I. Gottesman, and Thomas J. Bouchard Jr. Beyond heritability: Twin studies in behavioral research. Current Directions in Psychological Science, 18(4):217 – 220, 2009.
[9] Kenneth S. Kendler and Charles O. Gardner. Dependent Stressful Life Events and Prior Depressive Episodes in the Prediction of Major Depression: The Problem of Causal Inference in Psychiatric Epidemiology. JAMA Psychiatry, 67(11):1120–1127, 11 2010.
[10] Benjamin B. Lahey and Brian M. D’Onofrio. All in the family: Comparing siblings to test causal hypotheses regarding environmental influences on behavior. Current Directions in Psychological Science, 19(5):319–323, 2010.
[11] Brett Laursen, Amy C. Hartl, Frank Vitaro, Mara Brendgen, Ginette Dionne, and Michel Boivin. The spread of substance use and delinquency between adolescent twins. Developmental Psychology, 53(2):329 – 339, 2017.
[12] L. Liu, M. G. Hudgens, and S. Becker-Dreps. On inverse probability-weighted estimators in the presence of interference. Biometrika, 103(4):829–842, 12 2016.
[13] Lan Liu and Michael G. Hudgens. Large sample randomization inference of causal effects in the presence of interference. Journal of the American Statistical Association, 109(505):288–301, 2014. PMID: 24659836.
[14] Lan Liu, Michael G. Hudgens, Bradley Saul, John D. Clemens, Mohammad Ali, and Michael E. Emch. Doubly robust estimation in observational studies with partial interference. Stat, 8(1):e214. e214 sta4.214.
[15] Matt McGue, Merete Osler, and Kaare Christensen. Causal inference and observational research: The utility of twins. Perspectives on Psychological Science, 5(5):546–556, 2010.
[16] Caleb H. Miles, Maya Petersen, and Mark J. van der Laan. Causal inference when counterfactuals depend on the proportion of all subjects exposed. Biometrics, 75(3):768–777, 2019.
[17] Gerald S. Oettinger. Sibling similarity in high school graduation outcomes: Causal interdependency or unobserved heterogeneity?. Southern Economic Journal, 66(3):631 – 648, 2000.
[18] Elizabeth L. Ogburn, Oleg Sofrygin, Ivan Diaz, and Mark J. van der Laan. Causal inference for social network data. arXiv e-prints, page arXiv:1705.08527, May 2017.
[19] Jonathan D. Schaefer, Terrie E. Moffitt, Louise Arseneault, Andrea Danese, Helen L. Fisher, Renate Houts, Margaret A. Sheridan, Jasmin Wertz, and Avshalom Caspi. Adolescent victimization and early-adult psychopathology: Approaching causal inference using a longitudinal twin study to rule out noncausal explanations. Clinical Psychological Science, 6(3):352–371, 2018. PMID: 29805917.
[20] Daniel O. Scharfstein, Andrea Rotnitzky, and James M. Robins. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999.
[21] Arvid Sjölander, Thomas Frisell, and Sara Ö berg. Causal interpretation of between-within models for twin research. Epidemiologic Methods, 1(1):217 – 237, 2012.
[22] Cheryl Slomkowski, Richard Rende, Scott Novak, Elizabeth Lloyd-Richardson, and Raymond Niaura. Sibling effects on smoking in adolescence: evidence for social influence from a genetically informative design. Addiction, 100(4):430 – 438, 2005.
[23] Eric J Tchetgen Tchetgen and Tyler J VanderWeele. On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):55–75, 2012. PMID: 21068053.
[24] Eric J. Tchetgen Tchetgen, Isabel Fulcher, and Ilya Shpitser. Auto-G-Computation of Causal Effects on a Network. arXiv e-prints, page arXiv:1709.01577, September 2017.
[25] Anastasios A. Tsiatis. Semiparametric Theory and Missing Data. Springer, New York, 2006.
[26] Mark J. van der Laan. Causal inference for a population of causally connected units. Journal of Causal Inference, 2(1):13 – 74, 2014.

Appendix A Identification of the co-twin control effect under interference

Here we show identification of the subgroup causal effect $\beta_{ctc}:=E\big{[}Y^{0,1}-Y^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}$ discussed in Section 3.1, under the following assumptions:

	C1:	$\displaystyle\ Y_{j}^{a,b}\perp\!\!\!\perp(A_{1},A_{2})\ \|\ (U,X_{1},X_{2})$
	C2:	$\displaystyle\ \mbox{If }A_{j}=a,A_{3-j}=b,\mbox{ then }Y_{j}=Y_{j}^{a,b}$
	C3:	$\displaystyle\mbox{ For all }u,x_{1},x_{2}\mbox{ in the support of }(U,X_{1},X_{2}),$
		$\displaystyle\ \ \ P\big{(}A_{1}=a,A_{2}=b\ \|\ U=u,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1$
	C4:	$\displaystyle\ \big{(}U,X_{1},X_{2},Y_{1}^{a,b}\big{)}\mbox{ and }\big{(}U,X_{2},X_{1},Y_{2}^{a,b}\big{)}\mbox{ are identically distributed}$

C1-C3 are untestable assumptions of exchangeability within levels of the (possibly unobserved) shared factors $U$ and the observed individual covariates $X_{1},X_{2}$ ; consistency; and positivity. C4 is an assumption of symmetry between the population of twins labeled as Twin 1 and the population labeled as Twin 2, which should hold by design as the labeling is randomly assigned.

Let us denote the subgroup of twins who are exposure-discordant with their co-twin, but who have the same value of $x$ of all individual covariates as their co-twin, as $D_{x}$ . Thus $\beta_{ctc}$ is the mean of the contrast $Y^{0,1}-Y^{1,0}$ over all twins in the subgroup $D_{x}$ . Note first that this is a weighted average of Twin 1’s contrast and Twin 2’s contrast in the subgroup of $D_{x}$ where Twin 1 is exposed and the subgroup where Twin 2 is exposed. Specifically, writing $\pi:=P(A_{1}=0,A_{2}=1,X_{1}=X_{2}=x|A_{1}\neq A_{2},X_{1}=X_{2}=x)$ , we have:

	$\displaystyle\beta_{ctc}=\frac{1}{2}\bigg{\{}$	$\displaystyle E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}+E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}\bigg{\}}$
	$\displaystyle=\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}+$
	$\displaystyle\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}$

Now $E\big{[}Y_{j}^{0,1}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ is equal to $E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ by consistency, while $E\big{[}Y_{j}^{0,1}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}$ can be identified by leveraging the symmetry between the group of twins labeled as Twin 1 and the group labeled as Twin 2:

$\displaystyle E\big{[}$	$\displaystyle Y_{j}^{0,1}\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}$
$\displaystyle=$	$\displaystyle\ \int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times$
	$\displaystyle\hskip 72.26999ptdF(u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x)$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|U=u,X_{1}=X_{2}=x\big{)}\times$	by C1
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y\|U=u,X_{1}=X_{2}=x\big{)}\times$	by C4
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times$	by C1
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$	by C2
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ E\Big{[}Y_{3-j}\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\Big{]}$

Similarly, $E\big{[}Y_{j}^{1,0}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ is identified as $E\big{[}Y_{3-j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}$ . Therefore,

	$\displaystyle\beta_{ctc}=\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}+$
	$\displaystyle\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}$
	$\displaystyle=E\big{[}Y_{j}$	$\displaystyle\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}\|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}.$

Appendix B Efficient influence functions

Here we derive the efficient influence function for the parameter $\beta_{j}^{a,b}:=E\big{[}Y_{j}^{a,b}\big{]}$ , for each of the two models presented in Section 3.2. This also immediately gives the efficient influence function for $\beta^{a,b}=\frac{1}{2}E\big{[}Y_{1}^{a,b}+Y_{2}^{a,b}\big{]}$ , and of sums and differences of such parameters. We use the theory, terminology, and notation of Tsiatis [25] and Scharfstein et al. [20].

We assume that we have $n$ independent pairs of twins, and that the data from these twin pairs constitute $n$ i.i.d. draws from some distribution. Let $O=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1},Y_{2}\big{)}$ be the observed data for each twin pair. Let $Z=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1}^{1,1},Y_{2}^{1,1},Y_{1}^{1,0},Y_{2}^{1,0},Y_{1}^{0,1},Y_{2}^{0,1},Y_{1}^{0,0},Y_{2}^{0,0}\big{)}$ be the full data for each twin pair. Under the consistency assumption, the observed data is a coarsening of the full data, since $\displaystyle{Y_{j}=\sum_{a,b=0,1}\mathds{1}(A_{j}=a,A_{3-j}=b})Y_{j}^{a,b}$ for each $j=1,2$ . We partition the full data $Z$ into $L=\big{(}C,X_{1},X_{2},Y_{1}^{1,1},Y_{2}^{1,1},Y_{1}^{1,0},Y_{2}^{1,0},Y_{1}^{0,1},Y_{2}^{0,1},Y_{1}^{0,0},Y_{2}^{0,0}\big{)}$ and $R=(A_{1},A_{2})$ , where $R$ determines which components of $Z$ are observed.

Let $\mathcal{H}^{O}$ and $\mathcal{H}^{Z}$ denote the Hilbert spaces of mean-zero functions of the observed data, and of the full data, respectively, with the covariance inner product. Full-data influence functions for $\beta_{j}^{a,b}$ are elements of the space $\Lambda^{\perp}$ , the orthogonal complement of the full-data nuisance tangent space $\Lambda=\Lambda(F_{L})\oplus\Lambda(F_{R|L})$ . Observed-data influence functions are elements of the space $\Lambda^{O,\perp}$ , the orthogonal complement of the observed-data nuisance tangent space $\Lambda^{O}=\Lambda_{1}^{O}+\Lambda_{2}^{O}$ , where $\Lambda_{j}^{O}=E[\Lambda_{j}|O]:=\big{\{}g(O):g(O)=E[h(Z)|O]$ for some $h(Z)\in\Lambda_{j}\big{\}}$ . The efficient influence function $\varphi(O;\beta_{j}^{a,b})$ in a model is the unique influence function for $\beta_{j}^{a,b}$ which is an element of the observed data tangent space $\mathcal{T}^{O}$ of that model; moreover, the efficient influence function is the projection of any influence function onto $\mathcal{T}^{O}$ .

B.1 The efficient influence function in Model 1

Model 1 is the nonparametric model which places no restrictions on the observed data. This implies that there is just a single observed-data influence function $\varphi_{1}(O;\beta_{j}^{a,b})$ for $\beta_{j}^{a,b}$ in Model 1, which is therefore the efficient influence function. By theory in [20] and [25], the observed-data influence function may be found by using a full data influence function for $\beta_{j}^{a,b}$ , say $\varphi^{Z}\in\Lambda(F_{L})$ , and the mapping $K:\mathcal{H}^{O}\to\mathcal{H}^{L}$ defined by $K(g)=E\big{[}\cdot|L\big{]}$ . An element $h(O)$ in the inverse image of $\varphi^{Z}$ under this mapping is in the space $\Lambda_{1}^{O,\perp}$ . This can be seen using adjoint operators: if we consider the linear map $A:\Lambda_{1}\to\mathcal{H}^{O}$ defined by $A(h)=E[h|O]$ , then $\Lambda_{1}^{O}$ is the range of $A$ . Therefore, by properties of adjoint operators, $\Lambda_{1}^{O,\perp}$ is the null space of $A^{*}$ , the adjoint of $A$ . In this case $A^{*}(g)=E[g|L]-\Pi\Big{(}E\big{[}g|L\big{]}|\Lambda_{1}^{\perp}\Big{)}$ , and therefore $\Lambda_{1}^{O,\perp}=\big{\{}g(O):E[g|L]\in\Lambda_{1}^{\perp}\big{\}}$ . The residual $h(O)-\Pi\big{(}h(O)|\Lambda_{2}^{O}\big{)}$ from projecting $h(O)$ onto the space $\Lambda_{2}^{O}$ is in the space $\Pi(\Lambda_{1}^{O,\perp}|\Lambda_{2}^{O,\perp})$ . In our case, the spaces $\Lambda_{1}^{O}$ and $\Lambda_{2}^{O}$ are orthogonal, which implies that $\Pi(\Lambda_{1}^{O,\perp}|\Lambda_{2}^{O,\perp})$ is equal to $\Lambda_{1}^{O,\perp}\cap\Lambda_{2}^{O,\perp}=\Lambda^{O,\perp}$ . Therefore the residual is in $\Lambda^{O,\perp}$ and is thus the observed data influence function for $\beta_{j}^{a,b}$ in Model 1.

A full data influence function for $\beta_{j}^{a,b}$ is $\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b}$ . To verify this, set $\epsilon:=Y_{j}^{a,b}-\beta_{j}^{a,b}$ , and partition the parameters of $F_{L}$ into variation-independent components $\beta_{j}^{a,b}$ and $\eta$ , where $\eta$ is the distribution of $\big{(}\tilde{L},\epsilon\big{)}$ and $\tilde{L}$ is formed by deleting $Y_{j}^{a,b}$ from the vector $L$ . Then there is a one-to-one transformation from $dF(L)$ to $dF\big{(}\epsilon,\tilde{L}\big{)}$ , which we factor as $dF(\epsilon)\times dF\big{(}\tilde{L}|\epsilon\big{)}$ . Because the only constraint on these distributions is that $\epsilon$ is mean zero, we can show that the nuisance tangent space for $L$ is $\Lambda(F_{L})=\Big{\{}h\big{(}\epsilon\big{)}:E\big{[}h(\epsilon)\big{]}=E\big{[}\epsilon h(\epsilon)\big{]}=0\Big{\}}\oplus\Big{\{}h(L):E\big{[}h(L)\ |\ \epsilon\big{]}=0\Big{\}}$ . Therefore $\epsilon\in\Lambda(F_{L})^{\perp}$ as claimed, since $\epsilon$ is orthogonal to each of the direct summands.

An element of $\mathcal{H}^{O}$ which is in the inverse image of $\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b}$ for the mapping $K$ , and hence in $\Lambda_{1}^{O,\perp}$ , is:

g^{*}(O)=\frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})}\Big{(}Y_{j}-\beta_{j}^{a,b}\Big{)}.

The space $\Lambda(F_{R|L})$ is $\Big{\{}h_{2}(A_{1},A_{2},C,X_{1},X_{2}):E\big{[}h_{2}(A_{1},A_{2},C,X_{1},X_{2})|C,X_{1},X_{2}\big{]}=0\Big{\}}$ . Since elements of $\Lambda(F_{R|L})$ are functions of the observed data (due to the no unobserved confounding assumption), the space $\Lambda(F_{R|L})$ is equal to the space $\Lambda_{2}^{O}$ . Projection onto $\Lambda_{2}^{O}$ is given by:

\Pi\big{(}\cdot|\Lambda_{2}^{O}\big{)}=E\big{[}\cdot|A_{1},A_{2},C,X_{1},X_{2}\big{]}-E\big{[}\cdot|C,X_{1},X_{2}\big{]},

as this operation yields elements that are in $\Lambda_{2}^{O}$ , and whose residuals are orthogonal to $\Lambda_{2}^{O}$ . Below we write $\pi(C,X_{1},X_{2}):=P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})$ . Then:

	$\displaystyle\Pi\big{(}g^{*}(O)\|\Lambda_{2}^{O}\big{)}=$	$\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}\ E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\big{\|}A_{1},A_{2},C,X_{1},X_{2}\Big{]}$
		$\displaystyle\hskip 18.06749pt-\frac{1}{\pi(C,X_{1},X_{2})}\cdot E\Big{[}\mathds{1}{(A_{j}=a,A_{3-j}=b})\big{(}Y_{j}^{a,b}-\beta_{j}^{a,b}\big{)}\big{\|}C,X_{1},X_{2}\Big{]}\$
	$\displaystyle=$	$\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}\ E\left[Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{\|}\ C,X_{1},X_{2}\right]$
		$\displaystyle\hskip 18.06749pt-\frac{1}{\pi(C,X_{1},X_{2})}E\Big{[}\mathds{1}{(A_{j}=a,A_{3-j}=b})\ \big{\|}\ C,X_{1},X_{2}\Big{]}E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{\|}\ C,X_{1},X_{2}\Big{]}\ \$
	$\displaystyle=$	$\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{\|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\Big{]}$
		$\displaystyle\hskip 18.06749pt-E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{\|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\Big{]}.$

Therefore, taking the residual, the efficient influence function for $\beta_{j}^{a,b}$ for Model 1 is

	$\displaystyle\varphi_{1}(O;\beta_{j}^{a,b})=$	$\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{P(A_{j}=a,A_{3-j}=b\|C,X_{1},X_{2})}\Big{(}Y_{j}-E\big{[}Y_{j}\|A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{j}\|A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}-\beta_{j}^{a,b}.$

B.2 The efficient influence function in Model 2.

Model 2, by contrast, does impose constraints on the observed data. In a model that imposes restrictions on the observed law, there will be multiple observed-data influence functions. An approach for deriving the efficient influence function in such a setting is to (i) find one observed data influence function, say $\varphi_{naive}$ , and (ii) project $\varphi_{naive}$ onto the observed data tangent space $\mathcal{T}^{O}$ for the model. Here we will actually start by considering a slightly smaller model, which we will call Model 3, in which we impose the additional assumption that $\big{(}Y_{1}^{1,1},\ldots,Y_{1}^{0,0}\big{)}\perp\!\!\!\perp\big{(}Y_{2}^{1,1},\ldots,Y_{2}^{0,0}\big{)}|\big{(}C,X_{1},X_{2}\big{)}$ . The observed data tangent space for Model 3 is more straightforward to compute than the observed data tangent space for Model 2; therefore we will first derive the the efficient influence function for $\beta_{j}^{a,b}$ in Model 3, then show the efficient influence function for $\beta_{j}^{a,b}$ in Model 3 is the same as the efficient influence function for $\beta_{j}^{a,b}$ in Model 2.

We start by recalling the assumptions placed on the full data in Model 3. Here we write $\tilde{Y}_{j}$ for the vector of potential outcomes $\big{(}Y_{j}^{1,1},Y_{j}^{1,0},Y_{j}^{0,1},Y_{j}^{0,0}\big{)}$ :

A1:	$\displaystyle\big{(}\tilde{Y}_{1},\tilde{Y}_{2}\big{)}\perp\!\!\!\perp(A_{1},A_{2})\ \big{\|}\ (C,X_{1},X_{2})$	(exchangeabilitiy)
A2:	$\displaystyle\mbox{ For all }c,x_{1},x_{2}\mbox{ in the support of }C,X_{1},X_{2}$	(positivity)
	$\displaystyle\ \ \ P\big{(}A_{1}=a,A_{2}=b\ \|\ C=c,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1$
A3:	$\displaystyle\mbox{ If }A_{1}=a,A_{2}=b,\mbox{ then }Y_{1}=Y_{1}^{a,b},Y_{2}=Y_{2}^{b,a}$	(consistency)
$\displaystyle\textbf{A4}:$	$\displaystyle\ A_{j}\perp\!\!\!\perp X_{3-j}\ \|\ (C,X_{j}\big{)}$
$\displaystyle\textbf{A5}:$	$\displaystyle\ \tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ \|\ (C,X_{j}\big{)}$
$\displaystyle\textbf{A6}:$	$\displaystyle\ A_{1}\perp\!\!\!\perp A_{2}\ \|\ (C,X_{1},X_{2}\big{)}$
$\displaystyle\textbf{A7}:$	$\displaystyle\ \tilde{Y}_{1}\perp\!\!\!\perp\tilde{Y}_{2}\ \|\ (C,X_{1},X_{2})$

Finally, we also assume the property of composition, meaning that for any sets of variables $U,V,W,Z$ , if $U\perp\!\!\!\perp V|Z$ and $U\perp\!\!\!\perp W|Z$ , then it also holds that $U\perp\!\!\!\perp(V,W)|Z$ , as with graphical $d$ -separation. The following independencies follow from the assumptions listed above:

•

B1: $\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2})\ |\ (C,X_{j})$ .

$\displaystyle dF(a_{1},a_{2}\|\ \tilde{y}_{j},c,x_{j})=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}\|\tilde{y}_{j},c,x_{j},x_{3-j})dF(x_{3-j}\|\tilde{y}_{j},c,x_{j})$
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}\|c,x_{j},x_{3-j})dF(x_{3-j}\|\tilde{y}_{j},c,x_{j})$	A1
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}\|c,x_{j},x_{3-j})dF(x_{3-j}\|c,x_{j})$	A5
$\displaystyle=$	$\displaystyle\ dF(a_{1},a_{2}\|c,x_{j}).$

•

B2: $\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ |\ (A_{1},A_{2},C,X_{j})$ . By B1, A5, and composition, we have $\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2},X_{3-j})|(C,X_{j})$ . Now the result follows immediately from the weak union property of conditional independence, which states that if $U\perp\!\!\!\perp(V,W)|Z$ , then $U\perp\!\!\!\perp V|(W,Z)$ .
•

B3: $\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ |\ (Y_{j},A_{1},A_{2},C,X_{j})$ . By A3, for each $a,b=0,1$ , $dF\big{(}x_{3-j},\tilde{y}_{j}|A_{j}=a,A_{3-j}=b,c,x_{j},Y_{j}=y_{j}\big{)}=dF\big{(}x_{3-j},\tilde{y}_{j}|A_{j}=a,A_{3-j}=b,c,x_{j},Y_{j}^{a,b}=y_{j}\big{)}$ . Therefore it suffices to show that $\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}|\big{(}Y_{j}^{a,b},A_{j}=a,A_{3-j}=b,C,X_{j}\big{)}$ , and this follows immediately from B2 by weak union.
•

B4: $\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}\ |\ (A_{1},A_{2},C,X_{1},X_{2})$ . By A1, A7, and composition, $\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},A_{1},A_{2})|(C,X_{1},X_{2})$ . Since the observed outcome $Y_{3-j}$ is a function of $\tilde{Y}_{3-j},A_{1},A_{2}$ , this implies $\tilde{Y}_{j}\perp\!\!\!\perp(Y_{3-j},A_{1},A_{2})|(C,X_{1},X_{2})$ . Now the result follows by weak union.
•

B5: $\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}\ |\ (Y_{j},A_{1},A_{2},C,X_{1},X_{2})$ . By B4 and weak union, $\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}|(Y_{j}^{a,b},A_{j}=a,A_{3-j}=b,C,X_{1},X_{2})$ . By A3, this implies $\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}|(Y_{j},A_{j}=a,A_{3-j}=b,C,X_{1},X_{2})$ .

•

B6: $\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j})|(A_{1},A_{2},C,X_{j})$ .

By B1, $\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2})|(C,X_{j})$ , and by A5, $\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}|(C,X_{j})$ . Therefore by composition, $\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2},X_{3-j})|(C,X_{j})$ . We show that $\tilde{Y}_{j}\perp\!\!\!\perp\tilde{Y}_{3-j}\ |\ (C,X_{j})$ :

$\displaystyle dF(\tilde{y}_{3-j}\|\ \tilde{y}_{j},c,x_{j})=$	$\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}\|\tilde{y}_{j},c,x_{j},x_{3-j})dF(x_{3-j}\|\tilde{y}_{j},c,x_{j})$
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}\|c,x_{j},x_{3-j})dF(x_{3-j}\|\tilde{y}_{j},c,x_{j})$	A7
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}\|c,x_{j},x_{3-j})dF(x_{3-j}\|c,x_{j})$	A5
$\displaystyle=$	$\displaystyle\ dF(\tilde{y}_{3-j}\|c,x_{j}).$

Therefore by composition $\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j},A_{1},A_{2})|(C,X_{j})$ , and this implies
$\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j})|(A_{1},A_{2},C,X_{j})$ by weak union.

•

B7: $A_{1}\perp\!\!\!\perp A_{2}\ |\ (C,X_{j})$ .

$\displaystyle dF(a_{3-j}\|a_{j},c,x_{j})=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}\|a_{j},c,x_{j},x_{3-j})dF(x_{3-j}\|a_{j},c,x_{j})$
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}\|a_{j},c,x_{j},x_{3-j})dF(x_{3-j}\|c,x_{j})$	A4
$\displaystyle=$	$\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}\|c,x_{j},x_{3-j})dF(x_{3-j}\|c,x_{j})$	A6
$\displaystyle=$	$\displaystyle\ dF(a_{3-j}\|c,x_{j}).$

We now derive the observed data tangent space for Model 3, by first computing the full data tangent space, then showing how to move from the full data to the observed data tangent space. In Model 3 the full data likelihood factors as:

$\displaystyle dF(Z)=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}\|c,x_{1},x_{2}\big{)}dF\big{(}a_{1},a_{2}\|\tilde{y}_{1},\tilde{y}_{2},c,x_{1},x_{2}\big{)}$
$\displaystyle=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}\|c,x_{1},x_{2}\big{)}dF\big{(}a_{1},a_{2}\|c,x_{1},x_{2}\big{)}$	A1
$\displaystyle=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}\|c,x_{1},x_{2}\big{)}dF(a_{1}\|c,x_{1},x_{2})dF(a_{2}\|c,x_{1},x_{2})$	A6
$\displaystyle=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}\|c,x_{1},x_{2}\big{)}dF(a_{1}\|c,x_{1})dF(a_{2}\|c,x_{2})$	A4
$\displaystyle=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1}\|c,x_{1},x_{2}\big{)}dF\big{(}\tilde{y}_{2}\|c,x_{1},x_{2}\big{)}dF(a_{1}\|c,x_{1})dF(a_{2}\|c,x_{2})$	A7
$\displaystyle=$	$\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1}\|c,x_{1}\big{)}dF\big{(}\tilde{y}_{2}\|c,x_{2}\big{)}dF(a_{1}\|c,x_{1})dF(a_{2}\|c,x_{2}).$	A5

The full data tangent space is $\mathcal{T}=\mathcal{T}_{1}\oplus\ldots\oplus\mathcal{T}_{5}$ , where:

	$\displaystyle\mathcal{T}_{1}=$	$\displaystyle\ \Big{\{}h_{1}(C,X_{1},X_{2}):E\big{[}h_{1}(C,X_{1},X_{2})\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{2}=$	$\displaystyle\ \Big{\{}h_{2}(\tilde{Y}_{1},C,X_{1}):E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|C,X_{1}\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{3}=$	$\displaystyle\ \Big{\{}h_{3}(\tilde{Y}_{2},C,X_{2}):E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|C,X_{2}\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{4}=$	$\displaystyle\ \Big{\{}(A_{1}-\pi_{1})h_{4}(C,X_{1}):h_{4}(C,X_{1})\mbox{ any function }\Big{\}}$
	$\displaystyle\mathcal{T}_{5}=$	$\displaystyle\ \Big{\{}(A_{2}-\pi_{2})h_{5}(C,X_{2}):h_{5}(C,X_{2})\mbox{ any function }\Big{\}}.$

The observed data tangent space is $\mathcal{T}^{O}=\mathcal{T}_{1}^{O}+\ldots+\mathcal{T}_{5}^{O}$ , where $\mathcal{T}_{k}^{O}=E\big{[}\mathcal{T}_{k}|O\big{]}$ . For
$k=1,4,5$ , $\mathcal{T}_{k}=\mathcal{T}_{k}^{O}$ , while $\mathcal{T}_{2}^{O}=\Big{\{}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}:E[h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}]=0\Big{\}}$ and $\mathcal{T}_{3}^{O}=\Big{\{}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}:E[h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}]=0\Big{\}}$ .

We claim that the 5 spaces $\mathcal{T}_{k}^{O},k=1,\ldots,5$ are mutually orthogonal. To show that $\mathcal{T}_{1}^{O}\perp\!\!\!\perp\mathcal{T}_{2}^{O}$ , let $g_{1}(C,X_{1},X_{2})\in\mathcal{T}_{1}^{O}$ and $g_{2}(O)=E[h_{2}(\tilde{Y}_{1},C,X_{1})|O]\in\mathcal{T}_{2}^{O}$ , where $E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0$ . Then:

$\displaystyle E\big{[}g_{1}(C,X_{1},X_{2})g_{2}(O)\big{]}=$	$\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|O\big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})h_{2}(\tilde{Y}_{1},C,X_{1})\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})E\big{[}\underbrace{h_{2}(\tilde{Y}_{1},C,X_{1})}_{\mbox{random in }\tilde{Y}_{1}\mbox{ only }}\|C,X_{1},X_{2}\big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})\underbrace{E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|C,X_{1}\big{]}}_{=0}\Big{]}=0.$	A5

All other parts of the claim follow similarly from the orthogonality of the full data spaces $\mathcal{T}_{1},\ldots,\mathcal{T}_{5}$ , except for showing that $\mathcal{T}_{2}^{O}$ and $\mathcal{T}_{3}^{O}$ are orthogonal. To show that $\mathcal{T}_{2}^{O}\perp\!\!\!\perp\mathcal{T}_{3}^{O}$ , let $g_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}\in\mathcal{T}_{2}^{O}$ where $E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0$ , and let $g_{3}(O)=E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\in\mathcal{T}_{3}^{O}$ , where $E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}\big{]}=0$ . Then:

$\displaystyle E$	$\displaystyle\Big{[}g_{2}(O)g_{3}(O)\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|O\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|O\big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}\underbrace{h_{2}(\tilde{Y}_{1},C,X_{1})}_{\mbox{random in }\tilde{Y}_{1}\mbox{ only}}\|Y_{1},Y_{2},A_{1},A_{2},C,X_{1},X_{2}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|O\big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|O\big{]}\Big{]}$	B5
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|O\big{]}\Big{]}$	B3
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|O\big{]}\|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}\underbrace{h_{3}(\tilde{Y}_{2},C,X_{2})}_{\mbox{random in }\tilde{Y}_{2}\mbox{ only}}\|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]}$	B4
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|A_{1},A_{2},C,X_{2}\Big{]}\Big{]}$	B2
$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|Y_{1},A_{1},A_{2},C,X_{1}\big{]}\underbrace{E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})\|C,X_{2}\Big{]}}_{=0}\Big{]}=0.$	B1

Since $\mathcal{T}^{O}$ is a sum of 5 mutually orthogonal spaces, projection onto $\mathcal{T}^{O}$ is given by the sum of the projections onto each of the 5 spaces. Projection onto $\mathcal{T}_{1}^{O}$ , $\mathcal{T}_{4}^{O}$ , and $\mathcal{T}_{5}^{O}$ is straightforward. For projection onto $\mathcal{T}_{2}^{O}$ , we will use the following claim (and analogously for $\mathcal{T}_{3}^{O}$ ):

Claim: $\mathcal{T}_{2}^{O}$ is equal to the space $S_{2}$ , where

S_{2}=\Big{\{}h_{2}(Y_{1},A_{1},A_{2},C,X_{1}):E\big{[}h_{2}(Y_{1},A_{1},A_{2},C,X_{1})|A_{1},A_{2},C,X_{1}\big{]}=0\Big{\}}.

To show the claim, we first show that $\mathcal{T}_{2}^{O}\subseteq S_{2}$ . Let $g_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}\in\mathcal{T}_{2}^{O}$ . As shown in the preceding proof, $g_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}$ by B5 and B3, which shows that $g_{2}(O)$ is a function of $Y_{1},A_{1},A_{2},C,X_{1}$ only. We must also check that $E\big{[}g_{2}(O)|A_{1},A_{2},C,X_{1}\big{]}=0$ :

$\displaystyle E\big{[}g_{2}(O)\|A_{1},A_{2},C,X_{1}\big{]}=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|O\big{]}\|A_{1},A_{2},C,X_{1}\big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|A_{1},A_{2},C,X_{1}\big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}h(\tilde{Y}_{1},C,X_{1})\|C,X_{1}\big{]}=0.$	B1

Therefore $g_{2}(O)\in S_{2}$ . To show the opposite containment, we will show that $\mathcal{T}_{2}^{O,\perp}\subseteq S_{2}^{\perp}$ , which is equivalent to $\mathcal{T}_{2}^{O}\supseteq S_{2}$ . We begin by computing $\mathcal{T}_{2}^{O,\perp}$ , using properties of adjoint operators.

Consider the linear map $A:\mathcal{T}_{2}\to\mathcal{H}^{O}$ given by $A(h)=E[h|O]$ . The adjoint of $A$ is the map
$A^{*}:\mathcal{H}^{O}\to\mathcal{T}_{2}$ such that $E\big{[}A(h_{2})\cdot g\big{]}=E\big{[}h_{2}\cdot A^{*}(g)\big{]}$ for all $h_{2}\in\mathcal{T}_{2}$ and all $g\in\mathcal{H}^{O}$ . The range of $A$ is $\mathcal{T}_{2}^{O}$ . Therefore, by properties of adjoint operators, $\mathcal{T}_{2}^{O,\perp}$ is the null space of $A^{*}$ . We show that $A^{*}$ is given by $A^{*}(g)=E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}-E\big{[}g|C,X_{1}\big{]}$ . First, note that for any $g\in\mathcal{H}^{O}$ , $E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}-E\big{[}g|C,X_{1}\big{]}$ is indeed an element in $\mathcal{T}_{2}$ , since it is a function of $\tilde{Y}_{1},C,X_{1}$ that is mean zero given $C,X_{1}$ . Now we check that $E\big{[}A(h_{2})g\big{]}=E\big{[}h_{2}A^{*}(g)\big{]}$ :

	$\displaystyle E\Big{[}A\Big{(}h_{2}(\tilde{Y}_{1},C,X_{1})\Big{)}g(O)\Big{]}=$	$\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\|O\big{]}g(O)\Big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})g(O)\Big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})E\big{[}g(O)\|\tilde{Y}_{1},C,X_{1}\big{]}\Big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\Big{\{}E[g(O)\|\tilde{Y}_{1},C,X_{1}]-E[g(O)\|C,X_{1}]\Big{\}}\Big{]}$

where the last equality follows since

E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})E[g(O)|C,X_{1}]\Big{]}=E\Big{[}\underbrace{E[h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}]}_{=0}E[g|C,X_{1}]\Big{]}=0.

Therefore,

\mathcal{T}_{2}^{O,\perp}=\Big{\{}g(O):E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g|C,X_{1}\big{]}\Big{\}}.

Now let $W$ be the set

W=\Big{\{}g(O):E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}\Big{\}}.

We will show the chain of containments $\mathcal{T}_{2}^{O,\perp}\subseteq W\subseteq S_{2}^{\perp}$ .

We first show that, for any $g(O)\in\mathcal{H}^{O}$ , for each $a,b$ , $E\big{[}g(O)|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]}$ is a function of $Y_{1}^{a,b}$ , $C$ , and $X_{1}$ only. This follows from B6: since $(\tilde{Y}_{2},X_{2})\perp\!\!\!\perp\tilde{Y}_{1}|(A_{1},A_{2},C,X_{1})$ , we can remove all but one of the potential outcomes in $\tilde{Y}_{1}$ in the conditioning below.

$\displaystyle E\big{[}g(O)\|\tilde{Y}_{1},$	$\displaystyle A_{1}=a,A_{2}=b,C,X_{1}\big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}\underbrace{g\big{(}Y_{1}^{a,b},Y_{2}^{a,b},a,b,C,X_{1},X_{2}\big{)}}_{\mbox{random in }Y_{2}^{a,b},X_{2}\mbox{ only}}\|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}g\big{(}Y_{1}^{a,b},Y_{2}^{a,b},a,b,C,X_{1},X_{2}\big{)}\|Y_{1}^{a,b},A_{1}=a,A_{2}=b,C,X_{1}\big{]}$	B6
$\displaystyle=:$	$\displaystyle\ h_{ab}(Y_{1}^{a,b},C,X_{1})$

Now by B1, $P(A_{1}=a,A_{2}=b|C,X_{1},\tilde{Y}_{1})=P(A_{1}=a,A_{2}=b|C,X_{1})$ . Write $\pi_{ab}=\pi_{ab}(C,X_{1})$ for this function. Then we have

	$\displaystyle E\big{[}g(O)\|\tilde{Y}_{1},C,X_{1}\big{]}=$	$\displaystyle\ E\Big{[}E\big{[}g(O)\|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}\|\tilde{Y}_{1},C,X_{1}\Big{]}$
	$\displaystyle=$	$\displaystyle\ \pi_{11}(C,X_{1})h_{11}(Y_{1}^{1,1},C,X_{1})+\pi_{10}(C,X_{1})h_{10}(Y_{1}^{1,0},C,X_{1})+$
		$\displaystyle\hskip 21.68121pt\pi_{01}(C,X_{1})h_{01}(Y_{1}^{0,1},C,X_{1})+\pi_{00}(C,X_{1})h_{00}(Y_{1}^{0,0},C,X_{1}).$

Now take $g(O)\in\mathcal{T}_{2}^{O,\perp}$ , so that $E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g(O)|C,X_{1}\big{]}$ . We will show that
$E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}$ . Isolating $h_{11}(Y_{1}^{1,1},C,X_{1})$ in the equation
$E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g(O)|C,X_{1}\big{]}$ , we have

h(Y_{1}^{1,1},C,X_{1})=\frac{E\big{[}g(O)|C,X_{1}\big{]}-\pi_{10}h_{10}(Y_{1}^{1,0},C,X_{1})-\pi_{01}h_{01}(Y_{1}^{0,1},C,X_{1})-\pi_{00}h_{00}(Y_{1}^{0,0},C,X_{1})}{\pi_{11}}.

Since the right hand side does not depend on $Y_{1}^{1,1}$ , this shows that $h(Y_{1}^{1,1},C,X_{1})$ does not depend on $Y_{1}^{1,1}$ . Therefore $h(y_{1}^{1,1},C,X_{1}):=E\big{[}g(O)|Y_{1}^{1,1}=y_{1}^{1,1},A_{1}=1,A_{2}=1,C,X_{1}\big{]}$ is constant in $y_{1}^{1,1}$ , and hence $E\big{[}g(O)|\tilde{Y}_{1},A_{1}=1,A_{2}=1,C,X_{1}\big{]}=E\big{[}g(O)|A_{1}=1,A_{2}=1,C,X_{1}\big{]}$ .

Similarly, $h_{10}$ , $h_{0,1}$ , and $h_{0,0}$ are functions of $C,X_{1}$ only, and therefore $E\big{[}g(O)|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]}=E\big{[}g(O)|A_{1}=a,A_{2}=b,C,X_{1}\big{]}$ for each $a,b$ . Therefore $E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}$ as claimed, which shows that $\mathcal{T}_{2}^{O,\perp}\subseteq W$ .

Finally we will show that $W\subseteq S_{2}^{\perp}$ Let $g_{1}(O)\in W$ , so that $E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}$ . We show that $g_{1}(O)$ is orthogonal to every $g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\in S_{2}$ :

	$\displaystyle E\big{[}g_{1}(O)$	$\displaystyle g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}g_{1}(O)\|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\Big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}g_{1}(O)\|A_{1},A_{2},C,X_{1}\big{]}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\Big{]}$
	$\displaystyle=$	$\displaystyle\ E\Big{[}E\big{[}g_{1}(O)\|A_{1},A_{2},C,X_{1}\big{]}\underbrace{E\big{[}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\|A_{1},A_{2},C,X_{1}\big{]}}_{=0}\Big{]}=0.$

This completes the proof of the claim that $\mathcal{T}_{2}^{O}=S_{2}$ . The analogous result holds for $\mathcal{T}_{3}^{O}$ by symmetry. Therefore the observed data tangent space for Model 3 is the direct sum of 5 mutually orthogonal spaces $\mathcal{T}^{O}=\mathcal{T}_{1}^{O}\oplus\ldots\oplus\mathcal{T}_{5}^{O}$ , where

	$\displaystyle\mathcal{T}_{1}^{O}=$	$\displaystyle\ \Big{\{}h_{1}(C,X_{1},X_{2}):E\big{[}h_{1}(C,X_{1},X_{2})\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{2}^{O}=$	$\displaystyle\ \Big{\{}h_{2}(Y_{1},A_{1},A_{2},C,X_{1}):E\big{[}h_{2}(Y_{1},A_{1},A_{2},C,X_{1})\|A_{1},A_{2},C,X_{1}\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{3}^{O}=$	$\displaystyle\ \Big{\{}h_{3}(Y_{2},A_{1},A_{2},,C,X_{2}):E\big{[}h_{3}(Y_{2},A_{1},A_{2},C,X_{2})\|A_{1},A_{2},C,X_{2}\big{]}=0\Big{\}}$
	$\displaystyle\mathcal{T}_{4}^{O}=$	$\displaystyle\ \Big{\{}(A_{1}-\pi_{1})h_{4}(C,X_{1}):h_{4}(C,X_{1})\mbox{ any function }\Big{\}}$
	$\displaystyle\mathcal{T}_{5}^{O}=$	$\displaystyle\ \Big{\{}(A_{2}-\pi_{2})h_{5}(C,X_{2}):h_{5}(C,X_{2})\mbox{ any function }\Big{\}}.$

The efficient influence function for $\beta_{j}^{a,b}$ in Model 3 is the projection of any influence function for $\beta_{j}^{a,b}$ in Model 3, say $\varphi_{3}^{naive}$ , onto $\mathcal{T}^{O}$ , where by orthogonality, $\Pi\big{(}\cdot|\mathcal{T}^{O}\big{)}=\sum_{k=1}^{5}\Pi\big{(}\cdot|\mathcal{T}_{k}^{O}\big{)}$ .

Because Model 3 is a subset of Model 1, the full-data influence function $\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b}$ from Model 1 is also a full-data influence function for Model 3. Following the steps outlined in Section 1.1, the element

g^{*}(O):=\frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})}\Big{(}Y_{j}-\beta_{j}^{a,b}\Big{)}

is in the space $\Lambda_{1}^{O}$ for Model 3. In Model 3, $\Lambda_{2}^{O}=\Lambda_{2}=\mathcal{T}_{4}^{O}\oplus\mathcal{T}_{5}^{O}$ , so one influence function for $\beta_{j}^{a,b}$ in Model 3 is:

\varphi_{3}^{naive}:=g^{*}(O)-\Pi\big{(}g^{*}(O)|\Lambda_{2}^{O})=g^{*}(O)-\Pi\big{(}g^{*}(O)|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)|\mathcal{T}_{5}^{O}\big{)}.

Therefore, the efficient influence function for $\beta_{j}^{a,b}$ in Model 3 is:

$\displaystyle\varphi_{3}(O;\beta_{j}^{a,b})=$	$\displaystyle\ \Pi\big{(}\varphi_{3}^{naive}\|\mathcal{T}^{O}\big{)}$
$\displaystyle=$	$\displaystyle\ \Pi\Big{(}\big{\{}g^{}(O)-\Pi\big{(}g^{}(O)\|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)\|\mathcal{T}_{5}^{O}\big{)}\big{\}}\ \big{\|}\mathcal{T}^{O}\Big{)}$
$\displaystyle=$	$\displaystyle\ \sum_{k=1}^{5}\ \Pi\Big{(}\big{\{}g^{}(O)-\Pi\big{(}g^{}(O)\|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)\|\mathcal{T}_{5}^{O}\big{)}\big{\}}\ \big{\|}\mathcal{T}_{k}^{O}\Big{)}$
$\displaystyle=$	$\displaystyle\ \Pi\big{(}g^{}(O)\|\mathcal{T}_{1}^{O}\big{)}+\Pi\big{(}g^{}(O)\|\mathcal{T}_{2}^{O}\big{)}+\Pi\big{(}g^{*}(O)\|\mathcal{T}_{3}^{O}\big{)},$	(1)

where the last line follows because $\Pi\Big{(}\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)}|\mathcal{T}_{k}^{O}\Big{)}=\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)}$ for each $k$ , and because $\Pi\Big{(}\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)}|\mathcal{T}_{k^{\prime}}^{O}\Big{)}=0$ for any $k\neq k^{\prime}$ , by orthogonality.

For notational convenience we take $a=1,b=1$ in the following calculations; the calculations for other values of $a,b=0,1$ are exactly similar. We write $\pi_{j}=\pi_{j}(C,X_{j}):=P(A_{j}=1|C,X_{j})$ , $\pi_{3-j}=\pi_{3-j}(C,X_{3-j}):=P(A_{3-j}=1|C,X_{3-j})$ . By A4 and A6, $P(A_{j}=1,A_{3-j}=1|C,X_{1},X_{2})=\pi_{1}\pi_{2}$ . Then:

g^{*}(O)=\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}=\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}^{1,1}-\beta_{j}^{1,1}\Big{)}.

We compute each of the 3 terms in (1):

$\displaystyle\Pi\Big{(}g^{*}(O)\|\ \mathcal{T}_{1}^{O}\Big{)}$	$\displaystyle=E\big{[}g^{*}(O)\|C,X_{1},X_{2}\big{]}$
$\displaystyle=$	$\displaystyle\ E\left[\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}^{1,1}-\beta_{j}^{1,1}\Big{)}\ \big{\|}C,X_{1},X_{2}\right]$
	$\displaystyle=\frac{1}{\pi_{1}\pi_{2}}E\Big{[}A_{1}A_{2}\ \|\ C,X_{1},X_{2}\Big{]}E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ \|\ C,X_{1},X_{2}\Big{]}$	A1
	$\displaystyle=E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ \|\ C,X_{1},X_{2}\Big{]}$	$\displaystyle\textbf{A4},\textbf{A6}$
	$\displaystyle=E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ \|\ C,X_{j}\Big{]}$	A5
	$\displaystyle=E\Big{[}Y_{j}^{1,1}\|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1}$	B1
	$\displaystyle=E\Big{[}Y_{j}\|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1}.$	A3

We relabel $\mathcal{T}_{2}^{O}$ and $\mathcal{T}_{3}^{O}$ as $\mathcal{T}_{k^{*}}^{O}$ and $\mathcal{T}_{k^{\dagger}}^{O}$ , according to whether $j=1$ or $j=2$ in the parameter $\beta_{j}^{a,b}$ . Set $k^{*}:=j+1$ , so that $\mathcal{T}_{k*}^{O}=\Big{\{}h_{k^{*}}(Y_{j},A_{1},A_{2},C,X_{j}):E\big{[}h_{k^{*}}(Y_{j},A_{1},A_{2},C,X_{j})|A_{1},A_{2},C,X_{j}\big{]}=0\Big{\}}$ , and $k^{\dagger}:=4-j$ , so $\mathcal{T}_{k^{\dagger}}^{O}=\Big{\{}h_{k^{\dagger}}(Y_{3-j},A_{1},A_{2},C,X_{3-j}):E\big{[}h_{k^{\dagger}}(Y_{3-j},A_{1},A_{2},C,X_{3-j})|A_{1},A_{2},C,X_{3-j}\big{]}=0\Big{\}}$ . Then:

$\displaystyle\Pi\Big{(}g^{*}(O)$	$\displaystyle\big{\|}\ \mathcal{T}_{k^{*}}^{O}\Big{)}$
$\displaystyle=$	$\displaystyle\ E\big{[}g^{}(O)\|Y_{j},A_{1},A_{2},C,X_{j}\big{]}-E\big{[}g^{}(O)\|A_{1},A_{2},C,X_{j}\big{]}$
$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}}\ \Big{\|}\ Y_{j},A_{1}A_{2}=1,C,X_{j}\right]-$
	$\displaystyle\hskip 36.135pt\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{\|}\ A_{1}A_{2}=1,C,X_{j}\right]$
$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\bigg{[}\underbrace{\frac{1}{\pi_{3-j}(C,X_{3-j})}}_{\mbox{ random in }X_{3-j}\mbox{ only}}\ \Big{\|}\ Y_{j}^{1,1},A_{1}A_{2}=1,C,X_{j}\bigg{]}$	A3
	$\displaystyle\hskip 21.68121pt-\frac{A_{1}A_{2}}{\pi_{j}}E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ \|A_{1}A_{2}=1,C,X_{j}\Big{]}E\left[\frac{1}{\pi_{3-j}}\Big{\|}A_{1}A_{2}=1,C,X_{j}\right]$	B2
$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}(C,X_{3-j})}\ \Big{\|}\ A_{1}A_{2}=1,C,X_{j}\right]$	B2
	$\displaystyle\hskip 21.68121pt-\frac{A_{1}A_{2}}{\pi_{j}}\Big{(}E\big{[}Y_{j}\|A_{1}A_{2}=1,C,X_{j}\big{]}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}}\Big{\|}A_{1}A_{2}=1,C,X_{j}\right].$	A3

Using the fact that for binary $Z$ , $E\big{[}ZW|V\big{]}=E[W|Z=1,V]P(Z=1|V)$ , we have:

$\displaystyle E\bigg{[}\frac{1}{\pi_{3-j}}$	$\displaystyle\ \Big{\|}\ A_{1}A_{2}=1,C,X_{j}\bigg{]}$
$\displaystyle=$	$\displaystyle\ \frac{1}{P\big{(}A_{1}A_{2}=1\|C,X_{j}\big{)}}E\left[\frac{A_{1}A_{2}}{\pi_{3-j}}\ \big{\|}\ C,X_{j}\right]$
$\displaystyle=$	$\displaystyle\ \frac{1}{P\big{(}A_{1}A_{2}=1\|C,X_{j}\big{)}}E\left[\frac{1}{\pi_{3-j}}E\left[A_{1}A_{2}\big{\|}\ C,X_{j},X_{3-j}\right]\ \big{\|}\ C,X_{j}\right]$
$\displaystyle=$	$\displaystyle\ \frac{1}{P(A_{1}=1,A_{2}=1\|C,X_{j})}E\left[\frac{1}{\pi_{3-j}}(\pi_{j}\pi_{3-j})\big{\|}\ C,X_{j}\right]$	$\displaystyle\textbf{A4},\textbf{A6}$
$\displaystyle=$	$\displaystyle\ \frac{1}{P(A_{1}=1,A_{2}=1\|C,X_{j})}\pi_{j}(C,X_{j})$
$\displaystyle=$	$\displaystyle\ \frac{1}{\pi_{j}P(A_{3-j}=1\|C,X_{j})}\pi_{j}(C,X_{j})$	B7
$\displaystyle=$	$\displaystyle\ \frac{1}{P(A_{3-j}=1\|C,X_{j})}.$

Therefore,

\Pi\Big{(}g^{*}(O)\big{|}\ \mathcal{T}_{k^{*}}^{O}\Big{)}=\frac{A_{1}A_{2}}{\pi_{j}(C,X_{j})P(A_{3-j}=1|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}|A_{1}A_{2}=1,C,X_{j}\big{]}\Big{)}.

Finally,

$\displaystyle\Pi\Big{(}g^{*}(O)$	$\displaystyle\big{\|}\ \mathcal{T}_{k^{\dagger}}^{O}\Big{)}$
$\displaystyle=$	$\displaystyle\ E\big{[}g^{}(O)\|Y_{3-j},A_{1},A_{2},C,X_{3-j}\big{]}-E\big{[}g^{}(O)\|A_{1},A_{2},C,X_{3-j}\big{]}$
$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{3-j}}E\bigg{[}\underbrace{\ \frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{j}(C,X_{j})}\ }_{\mbox{random in }Y_{j}^{1,1},X_{j}\mbox{ only}}\ \Big{\|}\ Y_{3-j}^{1,1},A_{1}A_{2}=1,C,X_{3-j}\bigg{]}$
	$\displaystyle\hskip 36.135pt-\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{\|}\ A_{1}A_{2}=1,C,X_{3-j}\right]$
$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{3-j}}E\bigg{[}\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{j}(C,X_{j})}\ \Big{\|}\ A_{1}A_{2}=1,C,X_{3-j}\bigg{]}$	B6
	$\displaystyle\hskip 36.135pt-\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{\|}\ A_{1}A_{2}=1,C,X_{3-j}\right]$
$\displaystyle=$	$\displaystyle\ 0.$

Therefore, the efficient influence function for $\beta_{j}^{1,1}$ in Model 3 is:

	$\displaystyle\varphi_{3}(O;\beta_{j}^{1,1})=$	$\displaystyle\ \Pi\big{(}g^{}(O)\|\mathcal{T}_{1}^{O}\big{)}+\Pi\big{(}g^{}(O)\|\mathcal{T}_{k^{}}^{O}\big{)}+\Pi\big{(}g^{}(O)\|\mathcal{T}_{k^{\dagger}}^{O}\big{)}$
	$\displaystyle=$	$\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}(C,X_{j})P(A_{3-j}=1\|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}\|A_{1}A_{2}=1,C,X_{j}\big{]}\Big{)}$
		$\displaystyle\hskip 21.68121pt+E\Big{[}Y_{j}\|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1}.$

By exactly the same arguments, the efficient influence function for $\beta_{j}^{a,b}$ in Model 3 is:

	$\displaystyle\varphi_{3}(O;\beta_{j}^{a,b})=$	$\displaystyle\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a\|C,X_{j})P(A_{3-j}=b\|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}\|A_{j}=a,A_{3-j}=b,C,X_{j}\big{]}\Big{)}$
		$\displaystyle\hskip 21.68121pt+E\Big{[}Y_{j}\|A_{j}=a,A_{3-j}=b,C,X_{j}\Big{]}-\beta_{j}^{a,b}.$

Next we will show that the function $\varphi_{3}(O;\beta_{j}^{a,b})$ given above is also the efficient influence function for $\beta_{j}^{a,b}$ in Model 2. It suffices to show that $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 2, for then the projection of $\varphi_{3}(O;\beta_{j}^{a,b})$ onto $\mathcal{T}^{O}(M2)$ is the efficient influence function for $\beta_{j}^{a,b}$ in Model 2. But Model 3 $\subseteq$ Model 2, which implies $\mathcal{T}^{O}(M3)\subseteq\mathcal{T}^{O}(M2)$ . Since $\varphi_{3}(O;\beta_{j}^{a,b})$ is the efficient influence function for Model 3, we know that $\varphi_{3}(O;\beta_{j}^{a,b})\in\mathcal{T}^{O}(M3)$ . Therefore $\varphi_{3}(O;\beta_{j}^{a,b})\in\mathcal{T}^{O}(M2)$ , and so is its own projection onto $\mathcal{T}^{O}(M2)$ . Thus, once we show that $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 2, it will automatically follow that $\varphi_{3}(O;\beta_{j}^{a,b})$ is the efficient influence function for $\beta_{j}^{a,b}$ in Model 2.

To show that $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 2, let $\mathcal{P}$ be any regular parametric submodel of Model 2, parametrized say by $\gamma\in\Gamma_{\mathcal{P}}\subseteq\mathbb{R}^{r}$ , in such a way that $\gamma=0$ corresponds to the true distribution $P_{0}\in\mathcal{P}$ . Write $\mathcal{P}=\Big{\{}f_{\mathcal{P}}(O;\gamma):\gamma\in\Gamma_{\mathcal{P}}\Big{\}}$ , and let $\displaystyle{S_{\gamma}(O;\mathcal{P})=\frac{\partial\log f_{\mathcal{P}}(O;\gamma)}{\partial\gamma}\Big{|}_{\gamma=0}}$ be the score vector for $\mathcal{P}$ evaluated at the truth. We will show that:

E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma}(O;\mathcal{P})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma}\Big{|}_{\gamma=0}

(2)

where the right hand side is the derivative of the functional $\beta_{j}^{a,b}=\beta_{j}^{a,b}(\mathcal{P};\gamma)$ , evaluated at the truth. By [25], $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 2 if and only if equation (2) is satisfied for every regular parametric submodel of Model 2.

Factoring the observed data likelihood, $\mathcal{P}$ can be parametrized as

\mathcal{P}=\Big{\{}dF(c,x_{1},x_{2},a_{1},a_{2}\ ;\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2}\ ;\gamma_{2})\cdot

\hskip 72.26999ptdF(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j}\ ;\gamma_{3}):\gamma=(\gamma_{1},\gamma_{2},\gamma_{3})\in\Gamma_{1}\times\Gamma_{2}\times\Gamma_{3}=\Gamma_{\mathcal{P}}\Big{\}}

where $\gamma_{1},\gamma_{2},\gamma_{3}$ are variation independent. We can partition the set of equations (B.2) into the 3 sets of equations

E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{k}}(O;\mathcal{P})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma_{k}}\Big{|}_{\gamma_{k}=0}

(3)

for $k=1,2,3$ . We will show that, for $k=3$ , both sides of (3) are zero, since neither the functional $\beta_{j}^{a,b}(\mathcal{P};\gamma)$ nor the the influence function $\varphi_{3}(O;\beta_{j}^{a,b})$ involve $Y_{3-j}$ . We will then show that (3) holds for $k=1,2$ because Model 2 and Model 3 are identical as regards features related to $\gamma_{1}$ and $\gamma_{2}$ .

The identifying functional for $\beta_{j}^{a,b}(\mathcal{P};\gamma)$ is:

\beta_{j}^{a,b}(\mathcal{P};\gamma)=\int_{c,x_{1},x_{2}}\int_{y_{j}}y_{j}\cdot dF\big{(}y_{j}|c,x_{j},a_{j}=a,a_{3-j}=b\ ;\gamma_{2}\big{)}dF\big{(}c,x_{1},x_{2}\ ;\gamma_{1}\big{)}.

(4)

Note that (4) varies in $\gamma_{1},\gamma_{2}$ only, so $\displaystyle{\frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma_{3}}=0}$ . Since $S_{\gamma_{3}}(O;\mathcal{P})$ is the score for
$dF(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{1})$ , $E\Big{[}S_{\gamma_{3}}(O;\mathcal{P})|C,X_{1},X_{2},A_{1},A_{2},Y_{j}\Big{]}=0$ . Note also that $\varphi_{3}(O;\beta_{j}^{a,b})$ is a function of $Y_{j},A_{1},A_{2},C,X_{j}$ only. Therefore:

E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{3}}(O;\mathcal{P})\Big{]}=E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\underbrace{E\Big{[}S_{\gamma_{3}}(O;\mathcal{P})|C,X_{1},X_{2},A_{1},A_{2},Y_{j}\Big{]}}_{=0}\bigg{]}=0

which shows that (3) is satisfied for $k=3$ .

Now consider the submodel $\mathcal{P}^{\prime}\subseteq\mathcal{P}$ given by

\mathcal{P}^{\prime}:=\Big{\{}dF(c,x_{1},x_{2},a_{1},a_{2};\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2};\gamma_{2})\cdot

\hskip 72.26999ptdF_{0}(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j};\gamma_{3}=0):\gamma^{\prime}=(\gamma_{1},\gamma_{2},0)\in\Gamma_{1}\times\Gamma_{2}\times\{0\}\Big{\}}

Note that for $k=1,2$ , $\mathcal{P}$ and $\mathcal{P}^{\prime}$ have the same scores, i.e. $S_{\gamma_{k}}(O;\mathcal{P})=S_{\gamma_{k}}(O;\mathcal{P}^{\prime})$ , and that $\beta_{j}^{a,b}(\mathcal{P};\gamma)=\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma^{\prime})$ . We claim that $\mathcal{P}^{\prime}$ is a parametric submodel of Model 3: Model 2 and Model 3 impose exactly the same restrictions on $dF(c,x_{1},x_{2},a_{1},a_{2})$ , and neither imposes any restriction on $dF(y_{j}|c,x_{j},a_{1},a_{2})$ . For each $(\gamma_{1},\gamma_{2})\in\Gamma_{1}\times\Gamma_{2}$ , $dF(c,x_{1},x_{2},a_{1},a_{2};\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2};\gamma_{2})\cdot dF_{0}(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j};\gamma_{3}=0)$ is in Model 2; therefore it is also in Model 3. Furthermore, taking $\gamma_{1}=\gamma_{2}=0$ shows that $\mathcal{P}^{\prime}$ contains the true distribution.

Therefore, since $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 3, and $\mathcal{P}^{\prime}$ is a regular parametric submodel of Model 3,

E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{k}}(O;\mathcal{P}^{\prime})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma)}{\partial\gamma_{k}}\Big{|}_{\gamma_{k}=0}.

Since $S_{\gamma_{k}}(O;\mathcal{P})=S_{\gamma_{k}}(O;\mathcal{P}^{\prime})$ for $k=1,2$ and $\beta_{j}^{a,b}(\mathcal{P};\gamma)=\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma^{\prime})$ , this shows that equation (3) holds for $k=1,2$ as claimed. Therefore, $\varphi_{3}(O;\beta_{j}^{a,b})$ is an influence function for $\beta_{j}^{a,b}$ in Model 2, and hence the efficient influence function for $\beta_{j}^{a,b}$ in Model 2.

Appendix C Double Robustness

Here we show that the Model 1 and Model 2 efficient estimators for $\beta_{j}^{a,b}=E\big{[}Y_{j}^{a,b}\big{]}$ are doubly robust.

We consider the Model 1 estimator first. Fix $j$ , $a$ , and $b$ , and let $\pi(C,X_{j},X_{3-j};\alpha)$ be a model for the propensity score $P(A_{j}=a,A_{3-j}=b|C,X_{j},X_{3-j})$ , and let $\mu(A_{j},A_{3-j},C,X_{j},X_{3-j};\gamma)$ be a model for $E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}\big{]}$ . These models may or may not be correctly specified, but suppose that they converge to some values, say $\hat{\alpha}\overset{p}{\to}\alpha^{*}$ and $\hat{\gamma}\overset{p}{\to}\gamma^{*}$ . In the case where the models are correctly specified, say with $\alpha_{0}$ and $\gamma_{0}$ corresponding to the truth, we would have $\alpha^{*}=\alpha_{0}$ and $\gamma^{*}=\gamma_{0}$ . Let $\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})$ and $\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma})$ denote predicted values under these models, and consider the corresponding Model 1 estimator

	$\displaystyle\widehat{\beta}_{1}=\ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}$	$\displaystyle\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}\Big{[}Y_{ij}-\mu\big{(}a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma}\big{)}\Big{]}$
		$\displaystyle\hskip 72.26999pt+\mu\big{(}C_{i},X_{ij},X_{i,3-j};\hat{\gamma}\big{)}\Bigg{\}}.$

We will show that $\hat{\beta}_{1}$ is consistent even if either $\pi$ or $\mu$ is misspecified, as long as the other is correctly specified. We claim that $\hat{\beta}_{1}\overset{p}{\longrightarrow}\beta^{*}$ , where

\displaystyle\beta^{*}=\ E

\displaystyle\Bigg{[}\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{\pi(C,X_{j},X_{3-j};\alpha^{*})}\Big{(}Y_{j}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Big{)}+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}.

For we can rewrite $\hat{\beta}_{1}$ as

\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{*})}\Big{(}Y_{ij}-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{*})\Big{)}+\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{*})\Bigg{\}}

(5)

$\displaystyle+\ \ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}$	$\displaystyle\left[\frac{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{})-\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{})\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}\right]\mathds{1}(A_{ij}=a,A_{i,3-j}=b)\times$
	$\displaystyle\hskip 72.26999pt\Big{[}Y_{ij}-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma})\Big{]}$	(6)
$\displaystyle+$	$\displaystyle\ \left[1-\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{})}\right]\Big{[}\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma})-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{})\Big{]}\Bigg{\}}$

where (5) is a sample mean of i.i.d. terms, and hence converges to the expected value of each term, and (6) converges in probability to 0.

Now we show that, if either $\pi$ or $\mu$ is correctly specified, then $\beta^{*}=E\big{[}Y_{j}^{a,b}\big{]}$ . If $\mu$ is correctly specified, then $\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})=E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}$ . Conditioning on $Y_{j}^{a,b},C,X_{j},X_{3-j}$ , we have:

$\displaystyle\beta^{*}=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j},X_{3-j}\big{]}}{\pi(C,X_{j},X_{3-j};\alpha^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\ \big{\|}Y_{j}^{a,b},C,X_{j},X_{3-j}\Big{]}$
	$\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}\|C,X_{j},X_{3-j}\big{]}\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j},X_{3-j}\big{]}}{\pi(C,X_{j},X_{3-j};\alpha^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\ \big{\|}C,X_{j},X_{3-j}\Big{]}$	by A1
	$\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}\|C,X_{j},X_{3-j}\big{]}\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{P(A_{j}=a,A_{3-j}=b)\|C,X_{j},X_{3-j})}{\pi(C,X_{j},X_{3-j};\alpha^{*})}\underbrace{E\Big{[}\Big{(}Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j},X_{3-j}\big{]}\Big{)}\big{\|}C,X_{j},X_{3-j}\Big{]}}_{=0}$
	$\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}\|C,X_{1},X_{2}\big{]}\Bigg{]}$
$\displaystyle=$	$\displaystyle\ 0+E\big{[}Y_{j}^{a,b}\big{]}.$

If $\pi$ is correctly specified, then we have:

$\displaystyle\beta^{*}=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})}{\pi(C,X_{j},X_{3-j};\alpha_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|Y_{j}^{a,b},C,X_{j},X_{3-j}\Big{]}$
	$\displaystyle\hskip 72.26999pt+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})}{\pi(C,X_{j},X_{3-j};\alpha_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|C,X_{j},X_{3-j}\Big{]}$	by A1
	$\displaystyle\hskip 72.26999pt+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Big{[}Y_{j}^{a,b}-\mu(a,b,C,X_{1},X_{2};\gamma^{})+\mu(a,b,C,X_{1},X_{2};\gamma^{})\Big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}Y_{j}^{a,b}\big{]}.$

Next we consider the Model 2 estimator. Now let $\pi(C,X_{j};\alpha)$ be a model for the propensity score $P(A_{j}=a|C,X_{j})$ , let $\psi(C,X_{j};\theta)$ be a model for the propensity score $P(A_{3-j}=b|C,X_{j})$ , and let $\mu(A_{j},A_{3-j},C,X_{j};\gamma)$ be a model for $E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j}\big{]}$ . Suppose that these models converge to some values, say $\hat{\alpha}\overset{p}{\to}\alpha^{*}$ , $\hat{\theta}\overset{p}{\to}\theta^{*}$ , and $\hat{\gamma}\overset{p}{\to}\gamma^{*}$ . In the case where the models are correctly specified, denote the truth by $\alpha_{0}$ , $\theta_{0}$ , and $\gamma_{0}$ respectively. Let $\pi(C_{i},X_{ij};\hat{\alpha})$ , $\psi(C_{i},X_{ij};\hat{\theta})$ , and $\mu(a,b,C_{i},X_{ij};\hat{\gamma})$ denote predicted values under these models, and consider the corresponding Model 2 estimator:

\widehat{\beta}_{2}=\ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij};\hat{\alpha})\psi(C_{i},X_{ij};\hat{\theta})}\Big{(}Y_{ij}-\mu(a,b,C_{i},X_{ij};\hat{\gamma})\Big{)}+\mu(a,b,C_{i},X_{ij};\hat{\gamma})\Bigg{\}}.

We show that $\hat{\beta}_{2}$ is consistent under misspecification of one or both of the propensity score models, provided the outcome regression model is correctly specified; and that $\hat{\beta}_{2}$ is consistent under misspecification of the outcome regression model, provided that both propensity score models are correctly specified.

We have $\hat{\beta}_{2}\overset{p}{\longrightarrow}\beta^{*}$ , where

\beta^{*}=\ E\Bigg{[}\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{\pi(C,X_{j};\alpha^{*})\ \psi(C,X_{j};\theta^{*})}\Big{(}Y_{j}-\mu(a,b,C,X_{j};\gamma^{*}\Big{)}+\mu(a,b,C,X_{j};\gamma^{*})\Bigg{]}.

Now suppose first that the outcome regression model is correctly specified. Then $\mu(a,b,C,X_{j};\gamma^{*})=E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}$ . Conditioning on $Y_{j}^{a,b},C,X_{j}$ , we have:

$\displaystyle\beta^{*}=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}}{\pi(C,X_{j};\alpha^{})\psi(C,X_{j};\theta^{})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|Y_{j}^{a,b},C,X_{j}\Big{]}+E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}}{\pi(C,X_{j};\alpha^{})\psi(C,X_{j};\theta^{})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|C,X_{j}\Big{]}+E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}\Bigg{]}$	by B1
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{P(A_{j}=a,A_{3-j}=b)\|C,X_{j})}{\pi(C,X_{j};\alpha^{})\ \psi(C,X_{j};\theta^{})}\underbrace{E\Big{[}\Big{(}Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}\Big{)}\big{\|}C,X_{j}\Big{]}}_{=0}+E\big{[}Y_{j}^{a,b}\|C,X_{j}\big{]}\Bigg{]}$
$\displaystyle=$	$\displaystyle\ 0+E\big{[}Y_{j}^{a,b}\big{]}.$

Finally, if both propensity score models are correctly specified, then conditioning on $Y_{j}^{a,b},C,X_{j}$ , we have:

$\displaystyle\beta^{*}=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|Y_{j}^{a,b},C,X_{j}\Big{]}+\mu(a,b,C,X_{j};\gamma^{})\Bigg{]}$
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\|C,X_{j}\Big{]}+\mu(a,b,C,X_{j};\gamma^{})\Bigg{]}$	by B1
$\displaystyle=$	$\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}P(A_{j}=a\|C,X_{j})P(A_{3-j}=b\|C,X_{j})+\mu(a,b,C,X_{j};\gamma^{})\Bigg{]}$	by B7
$\displaystyle=$	$\displaystyle\ E\Big{[}Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{})+\mu(a,b,C,X_{j};\gamma^{})\Big{]}$
$\displaystyle=$	$\displaystyle\ E\big{[}Y_{j}^{a,b}\big{]}.$

Appendix D Data analysis and Simulations

Here we describe the models that were used in the data analysis. The Model 1 estimator uses predicted values from an outcome regression model, and from a joint propensity score model. We use generalized additive models to model $E[Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}]$ , the conditional mean of the twin’s Drinking Index outcome, with the following model formulation:

Drinking.index $\sim$ s(Parent.alcohol.abuse) + s(Parent.drug.abuse) +

ti(Parent.alcohol.abuse, Parent.drug.abuse) + Parent.occupation.level +

Sex + Zygosity + Academic.motivation + Sex*Academic.motivation +

s(Conflict.with.parents)+ Age + Exposure + Parent.alcohol.abuse*Exposure +

Sex*Exposure + Cotwins.exposure +Zygosity*Cotwins.exposure + Exposure*Cotwins.exposure

where s( ) indicates a smooth function of the variable, and ti( , ) and indicates an interaction term consisting of a smooth function of the two variables. In order to enforce symmetry between the Twin 1’s and the Twin 2’s, we fit this model on a stacked dataset combining the data for the Twin 1’s and the data for the Twin 2’s, so that the same fitted values of the regression parameters are used for both the Twin 1’s and the Twin 2’s.

For the joint propensity score model, we first use generalized additive models to model logit $P(A_{j}=1|C,X_{j})$ :

Exposure $\sim$ Parent.alcohol.abuse + Parent.drug.abuse +

ti(Parent.alcohol.abuse, Parent.drug.abuse) + Externalizing.disorder +

Academic.motivation + Conflict.with.parents +

ti(Parent.alcohol.abuse, Conflict.with.parents) + s(Age), family=binomial

As before, we fit this propensity score model on a stacked dataset to enforce symmetry between the Twin 1’s and the Twin 2’s. We then obtain predicted values for the joint distribution $F(a,b):=P(A_{1}=a,A_{2}=b|C,X_{1},X_{2})$ from predicted values for the margins using a Dale model [2]. Under the Dale model, the joint distribution is determined by the two margins $F_{1}(0):=P(A_{1}=0|C,X_{1})$ and $F_{2}(0):=P(A_{2}=0|C,X_{2})$ together with a parameter for the association, the cross ratio $\displaystyle{\psi=\frac{F(0,0)\times F(1,1)}{F(0,1)\times F(1,0)}\in(0,\infty)}$ , where $\psi=1$ corresponds to the case that $A_{1}$ and $A_{2}$ are conditionally independent. Here we allow the strength of the association between $A_{1}$ and $A_{2}$ to depend on zygosity and sex, and we assume $\log(\psi_{i})=\alpha_{0}+\alpha_{1}Zygosity_{i}+\alpha_{2}Sex_{i}$ . We estimate $\alpha=(\alpha_{0},\alpha_{1},\alpha_{2})$ using maximum likelihood, treating the margins as fixed. Then $F(0,0)$ is determined by the margins $F_{1}(0)$ and $F_{2}(0)$ and the estimated $\widehat{\alpha}$ :

	$\displaystyle F($	$\displaystyle 0,0)=$
		$\displaystyle\frac{1}{2(\psi-1)}\Bigg{\{}\Big{[}1+(F_{1}(0)+F_{2}(0))(\psi-1)\Big{]}+\sqrt{\Big{[}1+(F_{1}(0)+F_{2}(0))(\psi-1)\Big{]}^{2}+4\psi(1-\psi)F_{1}(0)F_{2}(0)}\Bigg{\}}$

if $\psi\neq 1$ , and $F(0,0)=F_{1}(0)F_{2}(0)$ otherwise.

The Model 2 estimator uses predicted values from three models. We use the same conditional mean model $E[Y_{j}|A_{j},A_{3-j},C,X_{j}]$ and the same propensity score model $P(A_{j}|C,X_{j})$ as above. Finally, we model logit $P(A_{j}=1|C,X_{3-j})$ , where $P(A_{j}=1|C,X_{3-j})$ is the probability that the twin is exposed given shared covariates and their co-twin’s individual covariates, as:

Exposure $\sim$ Parent.alcohol.abuse + Parent.drug.abuse +

ti(Parent.alcohol.abuse, Parent.drug.abuse) +

Cotwins.externalizing.disorder + Cotwins.academic.motivation +

Cotwins.conflict.with.parents +

ti(Parent.alcohol.abuse, Cotwins.conflict.with.parents) + s(Age), family=binomial

and fit this model on the stacked dataset.

Next, we describe the data generating mechanisms used in our simulation study. Under the first data-generating mechanism, Model 1 is correctly specified but Model 2 is not. For each simulated dataset, we resample $n=500$ twin pairs from the Minnesota Twin Family Study (MTFS) data to generate shared and individual baseline covariates. Let the $s$ th twin pair in the simulated data have covariates $(C_{s},X_{s1},X_{s2})$ . We then use the model for $P(A_{1},A_{2}|C,X_{1},X_{2})$ described above to generate exposures, such that twin pair $s$ will have exposures $A_{s1}=a,A_{s2}=b$ with probability $P(A_{1}=a,A_{2}=b|C=C_{s},X_{1}=X_{s1},X_{2}=X_{s2})$ , the predicted values from the model fit on the MTFS data. The Drinking Index outcome in the MTFS data is approximately normal, so we generate outcomes from a bivariate normal distribution where the mean corresponds to predicted values of the outcome regression model fit on the MTFS data, and where values of the variance-covariance matrix are chosen to approximate the variance and covariance seen in the MTFS data. In particular, we take outcomes for Twin 1 and Twin 2 to be more strongly correlated among MZ twins than among DZ twins. Specifically, for twin pair $s$ , we draw

(Y_{s1},Y_{s2})\ \sim\ MVN_{2}\left(\left[\begin{array}[]{c}\mu_{1}\\ \mu_{2}\end{array}\right],\ \left(\begin{array}[]{cc}8&\sigma\\ \sigma&8\end{array}\right)\right),

where $\mu_{j}=E[Y_{j}|A_{j}=A_{sj},A_{3-j}=A_{s,3-j},C=C_{s},X_{j}=X_{sj},X_{3-j}=X_{s,3-j}]$ is the predicted value from the model fit on the MTFS data, and where $\sigma=3.5$ if twin pair $s$ is MZ, and $\sigma=1$ if twin pair $s$ is DZ.

For the second data-generating mechanism, under which Model 1 and Model 2 are both correctly specified, we modify the step in which we generate exposures, drawing $A_{s1}\sim Bern(p_{s1})$ , where $p_{s1}=P(A_{1}=1|C=C_{s},X_{1}=X_{s1})$ , and independently drawing $A_{2}\sim Bern(p_{s2})$ , where $p_{s2}=P(A_{2}=1|C=C_{s},X_{2}=X_{s2})$ .

	$\displaystyle\varphi_{1}\big{(}O;\beta^{a,b}\big{)}=$	$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a,A_{2}=b\|C,X_{1},X_{2})}\Big{(}Y_{1}-E\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}+$
		$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a,A_{1}=b\|C,X_{1},X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.$

	$\displaystyle\varphi_{2}\big{(}O;\beta^{a,b}\big{)}=$	$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a\|C,X_{1})P(A_{2}=b\|C,X_{1})}\Big{(}Y_{1}-E\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{1}\|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Bigg{\}}+$
		$\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a\|C,X_{2})P(A_{1}=b\|C,X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Big{)}+$
		$\displaystyle\hskip 36.135ptE\big{[}Y_{2}\|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.$

	$\displaystyle\beta_{ctc}=\frac{1}{2}\bigg{\{}$	$\displaystyle E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}+E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}\bigg{\}}$
	$\displaystyle=\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}+$
	$\displaystyle\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}$

$\displaystyle E\big{[}$	$\displaystyle Y_{j}^{0,1}\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}$
$\displaystyle=$	$\displaystyle\ \int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times$
	$\displaystyle\hskip 72.26999ptdF(u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x)$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y\|U=u,X_{1}=X_{2}=x\big{)}\times$	by C1
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y\|U=u,X_{1}=X_{2}=x\big{)}\times$	by C4
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times$	by C1
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}=y\|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$	by C2
	$\displaystyle\hskip 72.26999ptdF\big{(}u\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}$
$\displaystyle=$	$\displaystyle\ E\Big{[}Y_{3-j}\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\Big{]}$

	$\displaystyle\beta_{ctc}=\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}+$
	$\displaystyle\ \frac{1}{2}\bigg{\{}$	$\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}\|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+$
		$\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}\|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}$
	$\displaystyle=E\big{[}Y_{j}$	$\displaystyle\|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}\|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}.$