This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Effects in Twin Studies: the Role of Interference

Bonnie Smith1, Elizabeth L. Ogburn1, Matt McGue2,
Saonli Basu3, Daniel O. Scharfstein1
(1 Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
2 Department of Psychology, University of Minnesota, Minneapolis, MN
3 Division of Biostatistics, University of Minnesota, Minneapolis, MN
[email protected])
Abstract

The use of twins designs to address causal questions is becoming increasingly popular. A standard assumption is that there is no interference between twins—that is, no twin’s exposure has a causal impact on their co-twin’s outcome. However, there may be settings in which this assumption would not hold, and this would (1) impact the causal interpretation of parameters obtained by commonly used existing methods; (2) change which effects are of greatest interest; and (3) impact the conditions under which we may estimate these effects. We explore these issues, and we derive semi-parametric efficient estimators for causal effects in the presence of interference between twins. Using data from the Minnesota Twin Family Study, we apply our estimators to assess whether twins’ consumption of alcohol in early adolescence may have a causal impact on their co-twins’ substance use later in life.


Keywords: Co-twin control method; Semi-parametric efficiency; Spillover effect

1 Introduction

While researchers have predominantly used twin studies to assess the heritability of a given trait, they are also increasingly leveraging twin designs to learn about the causal relationship between an exposure and an outcome: see for example [15, 7, 8, 9, 10, 19] and references therein. A popular tool in this area of twins research is the co-twin control method, in which an unexposed twin serves as the control for their exposed co-twin. This method uses the fact that twins are naturally matched on many predictors of the exposure and the outcome, such as shared genetics and characteristics of their shared environment; this can make it possible to estimate the causal effect of the exposure on the outcome even if these shared predictors are unobserved [15]. In this paper, we draw attention to an important causal assumption that underlies this approach. This is the assumption that there is no interference between twins, meaning one twin’s exposure has no causal impact on their co-twin’s outcome. This is a standard assumption made in the twins literature [21, 10]; however, evidence from the literature on sibling influence suggests that it may not hold in some cases. A number of studies have identified variables, such as substance use, where a subject’s behavior may be causally impacted by that of their siblings [22, 11, 17]. Since there is evidence of interference between siblings in these settings, it seems tenable that between twins, especially, interference could be present in some cases.

The possibility of interference between twins is important for a number of reasons. One is that the parameter estimated using the co-twin control method has a different causal interpretation when interference is present. Incorrectly assuming that there is no interference may therefore lead one to draw misleading conclusions. Interference impacts several issues regarding the causal parameter to be studied, including which parameters are of most scientific interest, and conditions under which these causal parameters can be identified from the observed data. If interference is present, key causal effects of interest include spillover effects, which are changes that would result in twins’ outcomes from intervening on their co-twins’ exposures, while holding their own exposures fixed; and main effects, which are changes that would result in twins’ outcomes from intervening on their own exposures, while holding their co-twins’ exposures fixed. Unfortunately, the co-twin control method does not provide a means of estimating spillover effects or main effects when there is interference between twins; rather, it targets an effect of more limited scientific and policy relevance.

In this paper, we clarify which causal effect is estimated by the co-twin control method when there is interference between twins, and argue that it is not a causal effect of prime interest. In order to identify effects that are of prime of interest, such as main effects and spillover effects, we then proceed under an assumption that the measured baseline covariates control for all confounding of the effects of the exposures on the outcomes. In this framework, we derive estimators of average main effects and average spillover effects which are semi-parametric efficient—that is, they make the most efficient use of the data possible, given the assumptions of the statistical models that we use.

In Section 2, we define potential outcomes and causal effects for our setting of independent pairs of twins with possible between-twin interference, and we review the co-twin control method for the case where there is no interference. In Section 3, we consider the co-twin control method with interference, then introduce the causal models that we will use in the rest of the paper, which make the assumption of no unmeasured confounding. In Section 4 we present the semi-parametric efficient estimators of average spillover effects and average main effects in these models. In Section 5 we apply these estimators to data from the Minnesota Twin Family Study, and investigate whether twins’ exposure to alcohol in early adolescence may have a causal impact on their co-twins’ drinking behavior in adulthood. We demonstrate the finite-sample performance of our estimators in a simulation study in Section 6, and conclude with a discussion in Section 7.

2 Background

2.1 Causal effects for the setting of within-pairs interference

Consider a twin study with a binary exposure AA, where A=1A=1 if the subject is exposed and A=0A=0 if the subject is unexposed, and an outcome YY, and suppose first that there is no interference between subjects. In this setting each subject is considered to have two potential outcomes: Y1Y^{1}, the outcome that the subject would have if, possibly counter to fact, they were to have the exposure; and Y0Y^{0}, the outcome that the subject would have if, possibly counter to fact, they were to be unexposed. The observed outcome YY is assumed to be equal to the potential outcome Y1Y^{1} for subjects who have exposure A=1A=1, and equal to Y0Y^{0} for subjects with A=0A=0. While only one potential outcome is observed for each twin—and hence the causal effect Y1Y0Y^{1}-Y^{0} for an individual is never known—under additional assumptions (of positivity and exchangeability), the population average of these effects can be estimated from the observed data. Common targets of inference in twin studies include the average causal effect E[Y1Y0]E\big{[}Y^{1}-Y^{0}\big{]}, where the mean is over all twins in the target population, and the average causal effect in the subgroup of twins who are discordant with their co-twin for the exposure.

Now suppose instead that there may be interference between the two twins in each pair, but not between twins from different twin pairs, so that each twin’s outcome may be impacted by their own exposure and their co-twin’s exposure. In this case, we cannot meaningfully talk about a twin’s outcome if they were to be exposed, or Y1Y^{1}, since they might have one outcome if both they and their co-twin were to be exposed, and a different outcome if they were to be exposed but their co-twin were not. Thus, in this setting, each twin has 4 potential outcomes: Y1,1Y^{1,1}, Y1,0Y^{1,0}, Y0,1Y^{0,1}, and Y0,0Y^{0,0}, where we write Ya,bY^{a,b} for the outcome that the twin would have if, possibly counter to fact, they were to have exposure aa and their co-twin were to have exposure bb. For a twin who has exposure A=aA=a and whose co-twin has exposure A=bA=b, the observed outcome YY is assumed to be equal to the potential outcome Ya,bY^{a,b}, while the twin’s other 3 potential outcomes are not observed. There are a number of different causal effects that we can consider in this setting: of particular interest are average main effects E[Y0,0Y1,0]E\big{[}Y^{0,0}-Y^{1,0}\big{]} and E[Y0,1Y1,1]E\big{[}Y^{0,1}-Y^{1,1}\big{]}, in which the subject’s own exposure is varied while their co-twin’s exposure is held constant; and average spillover effects E[Y0,0Y0,1]E\big{[}Y^{0,0}-Y^{0,1}\big{]} and E[Y1,0Y1,1]E\big{[}Y^{1,0}-Y^{1,1}\big{]}, in which the subject’s own exposure is held constant while their co-twin’s exposure is varied.

Methods for estimating average main effects and average spillover effects have been studied by several authors. One body of research considers the setting of partial interference, in which there are multiple distinct groups of subjects and interference does not operate across different groups. See for example [5, 6, 23, 12, 14] for estimation of average main effects and average spillover effects under partial interference, and see [13, 12, 14] for asymptotic properties of the estimators. A different strand of the interference literature considers the setting where there is only one group, such as a single connected social network where interference could occur between any of the subjects. See for example [1, 26, 18, 16, 24]. The context that we consider here, namely independent groups of size two, falls into the first of these two frameworks. To the best of our knowledge, no previous work on partial interference demonstrated semi-parametric efficiency of an estimator. Here we derive semi-parametric efficient estimators for two models that are tailored to the setting of independent pairs.

2.2 Notation and data structure

In describing the data for a twin pair, we distinguish between those baseline covariates which are characteristics of the pair—and thus necessarily common between the two twins in a given pair—such as zygosity or parental characteristics; and baseline covariates which are characteristics of an individual twin and may or may not be the same for the two twins in a given pair. We refer to these respectively as shared covariates and individual covariates. In each twin pair, suppose that the twins have been randomly labeled as Twin 1 and Twin 2. The observed data for a given pair is O=(C,X1,X2,A1,A2,Y1,Y2)O=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1},Y_{2}), where CC denotes the shared baseline covariates, XjX_{j} denotes the individual baseline covariates for Twin jj, and AjA_{j} and YjY_{j} are the binary exposure and the outcome for Twin jj. Throughout, CC, X1X_{1}, and X2X_{2} will refer to collections of measured baseline covariates. In some sections we will also consider factors which are unobserved; we will use notation such as UU or UshU_{sh} when we refer to collections of variables at least some of which are unobserved. We assume that we observe data for nn twin pairs, and that the observed data O1,,OnO_{1},\ldots,O_{n} for these nn pairs are independent and identically distributed.

We use the notation Yja,bY_{j}^{a,b} for a potential outcome for Twin jj: specifically, let Yja,bY_{j}^{a,b} be the outcome that Twin jj would have if, possibly counter to fact, Twin jj were to have exposure aa and their co-twin were to have exposure bb. Our primary targets of inference will be means of potential outcomes E[Ya,b]E\big{[}Y^{a,b}\big{]} and contrasts of such parameters, where here the mean is taken over all twins in the population. We can also write the E[Ya,b]E\big{[}Y^{a,b}\big{]} as E[12(Y1a,b+Y2a,b)]E\big{[}\frac{1}{2}\big{(}Y_{1}^{a,b}+Y_{2}^{a,b}\big{)}\big{]}, where 12(Y1a,b+Y2a,b)\frac{1}{2}\big{(}Y_{1}^{a,b}+Y_{2}^{a,b}\big{)} is the average of the potential outcomes within a twin pair, and the last expectation is a mean taken over all pairs. We will often express E[Ya,b]E\big{[}Y^{a,b}\big{]} this way, so that the units we work in are the independent twin pairs rather than the individual twins.

2.3 The co-twin control method under the assumption of no interference

Here we review the co-twin control method for the setting where there is no interference between twins, following Sjölander et al. [21] and McGue et al. [15]; in Section 3 we compare the setting where there is interference.

The causal effect of the exposure AjA_{j} on the outcome YjY_{j} may be confounded by any predictor VV that is a common cause of both AjA_{j} and YjY_{j} (since there will be a non-causal source of association between AjA_{j} and YjY_{j} via the path through VV). Such a predictor could be a measured or unmeasured factor individual to Twin jj, or a measured or unmeasured factor shared by both twins. While we can explicitly adjust for measured confounders, unobserved confounders would typically preclude identification of the causal effect. However, using the co-twin control method, all shared factors—whether measured or unmeasured—are naturally accounted for by the fact that the two twins are matched on these factors. Therefore, Sjölander et al.[21] showed, in cases where there is no unmeasured confounding due to individual factors, a causal effect is identified by adjusting for the measured individual covariates X1X_{1} and X2X_{2}.

Let UshU_{sh} denote the set of all confounders, including both measured and unmeasured factors, which are shared between the two twins. Consider the causal diagram in Figure 1(a), where a directed arrow means that the first variable has a possible causal impact on the second, while a bidirected arc between two variables indicates that there may be unobserved variables that are common causes of the two variables. Assume that the relationship among the variables is given by this diagram. In particular, assume that there is no interference between the two twins, as signaled by the absence of directed arrows A1Y2A_{1}\to Y_{2} and A2Y1A_{2}\to Y_{1}, and assume that there are no non-shared unmeasured variables that directly impact both AjA_{j} and YjY_{j}.

Consider the subgroup of twins who are discordant with their co-twins for the exposure, but who have the same level xx of individual covariates as their co-twins. Write E[Y1Y0|A1A2,X1=X2=x]E\big{[}Y^{1}-Y^{0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]} to denote the average causal effect on this subgroup. For an exposed twin in this subgroup, their potential outcome Y1Y^{1} is observed; and while their potential outcome Y0Y^{0} is not observed, their unexposed co-twin’s potential outcome Y0Y^{0} is observed. [21] show that, because of symmetry between the group of twins designated Twin 1 and the group designated Twin 2, we may use the unexposed co-twins’ outcomes as proxies for the exposed twins’ Y0Y^{0} potential outcomes, and that the causal parameter E[Y1Y0|A1A2,X1=X2=x]E\big{[}Y^{1}-Y^{0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]} is equal to E[Yj|Aj=1,A3j=0,X1=X2=x]E[Yj|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}, the mean difference in the observed outcome for the exposed twin and the observed outcome for the unexposed twin, within this subgroup. The latter difference can now be estimated, for example by fitting a between-within regression model as described in Sjölander et al. Thus when all individual confounders are fully observed, Sjölander et al. have shown that the within-pair coefficient in the between-within regression model has a causal interpretation, which is the causal effect for the subgroup of twins who are discordant with their co-twin for the exposure but have the same level of individual covariates as their co-twin.

UshU_{sh}X1X_{1}X2X_{2}A1A_{1}A2A_{2}Y1Y_{1}Y2Y_{2}
(a) Assuming no interference
UshU_{sh}X1X_{1}X2X_{2}A1A_{1}A2A_{2}Y1Y_{1}Y2Y_{2}
(b) Allowing for interference
Figure 1: Causal diagrams for two settings where the co-twin control method can be used. Here UshU_{sh} represents all confounders, both measured and unmeasured, which are shared by the two twins, while XjX_{j} are measured individual covariates for Twin jj, and AjA_{j} and YjY_{j} are Twin jj’s exposure and outcome. The blue arrows in (b) allow for possible interference between the two twins, i.e. one twin’s exposure may have a causal impact on their co-twin’s outcome.

3 Identification of causal effects when interference is present

Now we consider a setting where interference between twins may exist, that is, where it is possible for one twin’s exposure to affect their co-twin’s outcome. For example, this could occur because the two twins influence one another’s behavior or share information related to the exposure with each other. Interference is likely to be present in many studies where the exposure and outcome are behavioral; and incorrectly assuming that there is no interference can lead to misleading conclusions, and ignores spillover effects which may often be of interest.

3.1 The co-twin control effect when there is interference between twins

Here we modify the scenario described in Section 2.3 to allow for interference between the two twins in each pair. Suppose now that the relationship among the variables is as in Figure 1(b). In particular, there may be shared factors (both measured and unmeasured) which impact both the exposures and the outcomes, but we assume that there are no unmeasured non-shared factors that directly cause both AjA_{j} and YjY_{j}, or that directly cause both A3jA_{3-j} and YjY_{j}. As before, the matching between the twins on shared factors allows for identification of one subgroup causal effect: importantly, this is the effect which is estimated using the co-twin control method if there is interference present. However, we argue that this is not an effect of prime importance, and that a different approach is needed in order to identify more useful effects such as average main effects and average spillover effects.

Between-within regression model E[Yj|Aj,A3j,Xj,X3j]=E\big{[}Y_{j}|A_{j},A_{3-j},X_{j},X_{3-j}\big{]}=
       β0+βWAj+βBA¯+γXj+δX3j\beta_{0}+\beta_{W}A_{j}+\beta_{B}\overline{A}+\gamma X_{j}+\delta X_{3-j}
Statistical value of the βW\beta_{W} E[Yj|Aj=1,A3j=0,X1=X2]E\big{[}Y_{j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}\big{]}-
regression coefficient        E[Yj|Aj=0,A3j=1,X1=X2]E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}\big{]}
Causal interpretation of βW\beta_{W} with E[Y1Y0|A1A2,X1=X2]E\big{[}Y^{1}-Y^{0}|A_{1}\neq A_{2},X_{1}=X_{2}\big{]}
no interference between twins
Causal interpretation of βW\beta_{W} with E[Y1,0Y0,1|A1A2,X1=X2]E\big{[}Y^{1,0}-Y^{0,1}|A_{1}\neq A_{2},X_{1}=X_{2}\big{]}
interference between twins
Table 1: Comparison of the causal interpretation of the within-pair coefficient βW\beta_{W} in a between-within regression model, for the settings with and without interference between twins. In the no-interference case, Y1Y0Y^{1}-Y^{0} is the difference in a twin’s outcome under an intervention changing the twin from unexposed to exposed. In the interference case, Y1,0Y0,1Y^{1,0}-Y^{0,1} is the difference in a twin’s outcome under an intervention changing the twin from unexposed to exposed, while conversely changing their co-twin from exposed to unexposed.

As in Section 2.3, consider the subgroup of twins who are discordant with their co-twins for the exposure, but who have the same level xx of individual covariates as their co-twins. For an exposed twin in this subgroup, their potential outcome Y1,0Y^{1,0} is observed, and for an unexposed twin in this subgroup, their potential outcome Y0,1Y^{0,1} is observed. In Appendix A, we prove that, under the assumptions listed there, the mean of the contrast Y1,0Y0,1Y^{1,0}-Y^{0,1} on this subgroup, E[Y1,0Y0,1|A1A2,X1=X2=x]E\big{[}Y^{1,0}-Y^{0,1}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}, is equal to E[Yj|Aj=1,A3j=0,X1=X2=x]E[Yj|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}. Note that the latter difference is the same quantity that appears in Section 2.3, and which is estimated using a between-within regression model, but that the causal interpretation of this parameter is different in cases where there is interference between twins. The different interpretations for the two settings are highlighted in Table 1. The effect E[Y1,0Y0,1|A1A2,X1=X2=x]E\big{[}Y^{1,0}-Y^{0,1}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]} identified here is the average difference in twins’ outcomes that we would see if we could intervene to change the twins from unexposed to exposed, while conversely intervening to change their co-twins from exposed to unexposed, in the subgroup described above. It is difficult to see why one would want to target this particular combination of interventions. The contrast Y1,0Y0,1Y^{1,0}-Y^{0,1} is the difference of the spillover effect Y0,0Y0,1Y^{0,0}-Y^{0,1} and the main effect Y0,0Y1,0Y^{0,0}-Y^{1,0}; however we cannot tease apart these two effects, and the value of Y1,0Y0,1Y^{1,0}-Y^{0,1} itself is not readily interpretable. For example, a zero value of Y1,0Y0,1Y^{1,0}-Y^{0,1} could reflect the fact that Y1,0=Y0,0=Y0,1Y^{1,0}=Y^{0,0}=Y^{0,1}, or it could reflect a qualitatively different scenario where Y0,0Y0,1Y^{0,0}-Y^{0,1} and Y0,0Y1,0Y^{0,0}-Y^{1,0} are each nonzero but cancel each other out. That is, a null value of Y1,0Y0,1Y^{1,0}-Y^{0,1} is equally compatible with the exposure having no causal effect, or with the presence of very strong interference.

The co-twin control method is a unique tool which allows for estimation of a causal effect even with shared unmeasured confounders, based on the use of discordant pairs. However, we cannot identify contrasts involving Y1,1Y^{1,1} or Y0,0Y^{0,0} from discordant pairs when interference is present, since these potential outcomes need not equal the observed outcome of either twin in a discordant pair. Therefore, the presence of shared unmeasured confounders poses a greater barrier than it does in settings with no interference, since the co-twin control method does not provide a means of estimating important causal effects for the interference setting. In order to identify average main effects and average spillover effects, throughout the rest of the paper we work under the assumption that there is no unmeasured confounding due to either individual or shared factors.

3.2 Identification under the assumption of no unmeasured confounding

Throughout the rest of the paper we make the following assumption of no unmeasured confounding: there are no unmeasured factors, whether shared or non-shared, which directly cause both AjA_{j} and YjY_{j}, or that directly cause both A3jA_{3-j} and YjY_{j}. Specifically, we assume that any unmeasured factors UU which impact both an outcome and an exposure do so only through the measured baseline covariates CC, X1X_{1}, and X2X_{2}. Under this assumption, adjusting for the measured baseline covariates C,X1,X2C,X_{1},X_{2} controls for all confounding of the effects of AjA_{j} and A3jA_{3-j} on YjY_{j}. This is an untestable assumption which is not automatically satisfied by design in an observational study; how reasonable the assumption is in a given study will depend on factors specific to that study, including how rich a set of baseline covariates is measured.

Consider the four groups of twin pairs with the four exposure patterns (A1=1,A2=1)(A_{1}=1,A_{2}=1), (A1=1,A2=0)(A_{1}=1,A_{2}=0), (A1=0,A2=1)(A_{1}=0,A_{2}=1), and (A1=0,A2=0)(A_{1}=0,A_{2}=0), and consider a specific potential outcome Yja,bY_{j}^{a,b}. In general the potential outcomes Yja,bY_{j}^{a,b} could be systematically higher among one of these four groups than another. For example they could be higher among the (A1=1,A2=1)(A_{1}=1,A_{2}=1) group than among the (A1=0,A2=0)(A_{1}=0,A_{2}=0) group if having some predictor VV causes both exposures and potential outcomes to be higher. However, our assumption of no unmeasured confounding implies that adjusting for the measured baseline covariates breaks any such association between Yja,bY_{j}^{a,b} and the exposures, and that within levels of the measured baseline covariates C,X1,X2C,X_{1},X_{2}, the four groups are exchangeable in terms of potential outcomes Yja,bY_{j}^{a,b}. This assumption of no unmeasured confounding, or exchangeability, is given by:

A1:\displaystyle\mbox{{A1}}: Yja,b(A1,A2)|(C,X1,X2) for each fixed a,b=0,1,j=1,2\displaystyle\ \ Y_{j}^{a,b}\perp\!\!\!\perp(A_{1},A_{2})\ \big{|}\ (C,X_{1},X_{2})\ \mbox{ for each fixed }a,b=0,1,\ j=1,2 (exchangeability)

We additionally make the positivity assumption that, within each level of the baseline covariates, there is a nonzero probability of having each of the 4 exposure patterns, and the consistency assumption that a twin’s observed outcome YjY_{j} is equal to the twin’s potential outcome corresponding to the exposures actually received by the the twin and their co-twin:

A2:\displaystyle\textbf{A2}:  For all c,x1,x2 in the support of C,X1,X2,\displaystyle\ \mbox{ For all }c,x_{1},x_{2}\mbox{ in the support of }C,X_{1},X_{2}, (positivity)
P(A1=a,A2=b|C=c,X1=x1,X2=x2)>0 for all a,b=0,1.\displaystyle\ P\big{(}A_{1}=a,A_{2}=b\ |\ C=c,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1.
A3:\displaystyle\textbf{A3}:  If Aj=a and A3j=b, then Yj=Yja,b\displaystyle\ \mbox{ If }A_{j}=a\mbox{ and }A_{3-j}=b,\mbox{ then }Y_{j}=Y_{j}^{a,b} (consistency)

The assumptions of exchangeability, positivity, and consistency are sufficient for identification of the parameters E[Yja,b]E\big{[}Y_{j}^{a,b}\big{]}. That is, under these three assumptions, we can express the causal parameter E[Yja,b]E\big{[}Y_{j}^{a,b}\big{]} as a function of the observed data, as we show for completeness below. Exchangeability and positivity allow us to take the mean over just the group with the exposure pattern (Aj=a,A3j=b)(A_{j}=a,A_{3-j}=b) rather than over all twins labeled as Twin jj in the population, within each level of the baseline covariates; and consistency implies that, among the twins in this group, the potential outcome Yja,bY_{j}^{a,b} is equal to the observed outcome YjY_{j}.

E[Yja,b]=\displaystyle E\big{[}Y_{j}^{a,b}\big{]}= yy𝑑F(Yja,b=y)\displaystyle\ \int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{)}
=\displaystyle= c,x1,x2yy𝑑F(Yja,b=y|c,x1,x2)𝑑F(c,x1,x2)\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{|}c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2})
=\displaystyle= c,x1,x2yydF(Yja,b=y|Aj=a,A3j=b,c,x1,x2)dF(c,x1,x2)\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}^{a,b}=y\big{|}A_{j}=a,A_{3-j}=b,c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2}) by A1 and A2
=\displaystyle= c,x1,x2yydF(Yj=y|Aj=a,A3j=b,c,x1,x2)dF(c,x1,x2)\displaystyle\ \int_{c,x_{1},x_{2}}\int_{y}y\ dF\big{(}Y_{j}=y\big{|}A_{j}=a,A_{3-j}=b,c,x_{1},x_{2}\big{)}dF(c,x_{1},x_{2}) by A3
=\displaystyle= E[E[Yj|Aj=a,A3j=b,C,X1,X2]].\displaystyle\ E\Big{[}E\big{[}Y_{j}\ \big{|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}\Big{]}.

We will consider two models for the data for each twin pair. The larger of these, Model 1, makes only the assumptions used for identification above, and corresponds to the diagram in Figure 2(a). Here UU represents any unmeasured factors (shared or not). The absence of arrows pointing directly from UU to the exposures and the outcomes corresponds to the no unmeasured confounding assumption. Model 2 is a smaller model in which we make three extra assumptions in addition to the identification assumptions, and corresponds to the diagram in Figure 2(b). These three extra assumptions, A4-A6, are that Twin jj’s individual covariates do not have a causal impact on their co-twin’s exposure (as seen by the lack of a directed arrow XjA3jX_{j}\to A_{3-j}), or on their co-twin’s outcome (as seen by the lack of a directed arrow XjY3jX_{j}\to Y_{3-j}); and that Twin 1’s exposure and Twin 2’s exposure are conditionally independent given the measured shared and individual baseline covariates (as seen by the lack of a bidirected arc A1A2A_{1}\leftrightarrow A_{2}).

A4: AjX3j|(C,Xj) for a,b=0,1,j=1,2\displaystyle\ A_{j}\perp\!\!\!\perp X_{3-j}\ |\ (C,X_{j}\big{)}\mbox{ for }a,b=0,1,\ j=1,2
A5: Yja,bX3j|(C,Xj) for all a,b=0,1,j=1,2\displaystyle\ Y_{j}^{a,b}\perp\!\!\!\perp X_{3-j}\ |\ (C,X_{j}\big{)}\mbox{ for all }a,b=0,1,j=1,2 (Model 2 assumptions)
A6: A1A2|(C,X1,X2)\displaystyle\ A_{1}\perp\!\!\!\perp A_{2}\ |\ (C,X_{1},X_{2}\big{)}

The assumption that Twin jj’s individual-level covariates do not impact their co-twin’s exposure or outcome is one that is commonly used in the twin literature [21], and the distinction drawn in Model 2 between the individual and the shared covariates is a feature that distinguishes Model 2 from models considered elsewhere in the interference literature. While Model 1, being less restrictive, gives valid inference in a wider range of settings than Model 2, the advantage of Model 2 is that it allows for improved (asymptotic) efficiency: in settings which meet the assumptions posited for Model 2, we may leverage these additional assumptions to make a more efficient use of the data in estimating our causal parameters of interest, allowing for narrower confidence intervals in large samples. We demonstrate this improved efficiency in a simulation study in Section 6.

UUCCX1X_{1}X2X_{2}A1A_{1}A2A_{2}Y1Y_{1}Y2Y_{2}
(a) Causal diagram for Model 1
UUCCX1X_{1}X2X_{2}A1A_{1}A2A_{2}Y1Y_{1}Y2Y_{2}
(b) Causal diagram for Model 2.
Figure 2: Causal diagrams for two models allowing between-twin interference but making the assumption of no unobserved confounding. C,X1,X2C,X_{1},X_{2} are measured baseline covariates, while here UU represents unmeasured factors (shared and/or non-shared). In Model 2 we also assume that a twin’s individual covariates do not have a causal impact on their co-twin’s exposure or outcome, and that Twin 1’s and Twin 2’s exposures are conditionally independent given observed covariates.

4 Efficient estimation of causal effects when interference is present

4.1 Efficient estimators

Here we derive the efficient estimator of the parameter βa,b=E[Ya,b]\beta^{a,b}=E\big{[}Y^{a,b}\big{]} for each of the two models described in Section 3.2. That is, we derive the estimator β^1a,b\widehat{\beta}_{1}^{a,b} with the smallest asymptotic variance, out of all estimators for βa,b\beta^{a,b} which are regular and asymptotically linear (RAL) under every distribution in Model 1, and similarly for Model 2. Since Model 2 is a smaller model contained in Model 1, any RAL estimator for Model 1 may be used in Model 2, while the converse need not hold; and we will see that, in fact, there are RAL estimators for Model 2 that are more efficient than any Model 1 estimator.

RAL estimators are asymptotically normal with asymptotic variance determined by their influence function. In order to obtain the RAL estimator with the smallest possible asymptotic variance, we first derive the efficient influence function φk(O;βa,b)\varphi_{k}(O;\beta^{a,b}) for the parameter βa,b\beta^{a,b} in each Model kk. The efficient estimator β^ka,b\widehat{\beta}_{k}^{a,b} based on data O1,,OnO_{1},\ldots,O_{n} from nn independent twin pairs is then found as the solution to the estimating equation i=1nφk(Oi;βa,b)=0\sum_{i=1}^{n}\varphi_{k}(O_{i};\beta^{a,b})=0. See Tsiatis [25] for more on efficient influence functions.

The efficient influence function for βa,b=E[Ya,b]\beta^{a,b}=E\big{[}Y^{a,b}\big{]} in Model 1 is:

φ1(O;βa,b)=\displaystyle\varphi_{1}\big{(}O;\beta^{a,b}\big{)}= 12{𝟙(A1=a,A2=b)P(A1=a,A2=b|C,X1,X2)(Y1E[Y1|A1=a,A2=b,C,X1,X2])+\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a,A_{2}=b|C,X_{1},X_{2})}\Big{(}Y_{1}-E\big{[}Y_{1}|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Big{)}+
E[Y1|A1=a,A2=b,C,X1,X2]}+\displaystyle\hskip 36.135ptE\big{[}Y_{1}|A_{1}=a,A_{2}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}+
12{𝟙(A2=a,A1=b)P(A2=a,A1=b|C,X1,X2)(Y2E[Y2|A2=a,A1=b,C,X1,X2])+\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a,A_{1}=b|C,X_{1},X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Big{)}+
E[Y2|A2=a,A1=b,C,X1,X2]}βa,b.\displaystyle\hskip 36.135ptE\big{[}Y_{2}|A_{2}=a,A_{1}=b,C,X_{1},X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.

Proofs are given in Appendix B. In order to use φ1(O;βa,b)\varphi_{1}\big{(}O;\beta^{a,b}\big{)} to obtain an estimator for βa,b\beta^{a,b}, we need a model for the joint propensity score P(Aj=a,A3j=b|C,Xj,X3j)P(A_{j}=a,A_{3-j}=b|C,X_{j},X_{3-j}) and a model for the outcome regression E[Yj|Aj,A3j,C,Xj,X3j]E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}\big{]}. The resulting estimator of βa,b\beta^{a,b} will be efficient provided that both of these models are correctly specified, and provided that the estimated values converge to the truth at fast enough rates. (See Section 4.2 for more discussion of rates.) Suppose this is the case. Let π^a,b(Ci,Xij,Xi,3j)\hat{\pi}_{a,b}(C_{i},X_{ij},X_{i,3-j}) be the predicted propensity score for a twin pair with covariates Ci,Xij,Xi,3jC_{i},X_{ij},X_{i,3-j}, and let μ^j,a,b(C,Xij,Xi,3j)\hat{\mu}_{j,a,b}(C,X_{ij},X_{i,3-j}) be the predicted outcome regression for Twin jj in a twin pair with covariates Ci,Xij,Xi,3jC_{i},X_{ij},X_{i,3-j} if the twins’ exposures were set to Aj=a,A3j=bA_{j}=a,A_{3-j}=b. Then the efficient estimator for βa,b\beta^{a,b} in Model 1, based on data O1,,OnO_{1},\ldots,O_{n}, is given by:

β^1a,b=1ni=1n12{\displaystyle\widehat{\beta}^{a,b}_{1}=\ \frac{1}{n}\sum_{i=1}^{n}\ \frac{1}{2}\Bigg{\{} 𝟙(Ai1=a,Ai2=b)π^a,b(Ci,Xi1,Xi2)(Yi1μ^1,a,b(Ci,Xi1,Xi2))+μ^1,a,b(Ci,Xi1,Xi2)\displaystyle\frac{\mathds{1}(A_{i1}=a,A_{i2}=b)}{\hat{\pi}_{a,b}(C_{i},X_{i1},X_{i2})}\Big{(}Y_{i1}-\hat{\mu}_{1,a,b}(C_{i},X_{i1},X_{i2})\Big{)}+\hat{\mu}_{1,a,b}(C_{i},X_{i1},X_{i2})
+𝟙(Ai1=b,Ai2=a)π^a,b(Ci,Xi2,Xi1)(Yi2μ^2,a,b(Ci,Xi2,Xi1))+μ^2,a,b(Ci,Xi2,Xi1)}.\displaystyle+\ \frac{\mathds{1}(A_{i1}=b,A_{i2}=a)}{\hat{\pi}_{a,b}(C_{i},X_{i2},X_{i1})}\Big{(}Y_{i2}-\hat{\mu}_{2,a,b}(C_{i},X_{i2},X_{i1})\Big{)}+\hat{\mu}_{2,a,b}(C_{i},X_{i2},X_{i1})\Bigg{\}}.

The estimator β^1a,b\widehat{\beta}_{1}^{a,b} is the augmented inverse probability weighted estimator for a bivariate exposure. It is a doubly robust estimator [20]: as long as at least one of the propensity score model or the outcome regression model is correctly specified, β^1a,b\widehat{\beta}_{1}^{a,b} remains a consistent estimator for βa,b\beta^{a,b}, even if the other model is misspecified. Liu et al. [14] have presented three doubly robust estimators of average main effects and average spillover effects in groups of size nn, working in the analog of Model 1. Our estimator β^1a,b\widehat{\beta}_{1}^{a,b} corresponds to one of their estimators specialized to groups of size 2. Our contribution as regards Model 1 is to prove that this estimator is semiparametric efficient in Model 1, thus providing a partial answer to a question posed in Liu et al. [14]. To the best of our knowledge, Model 2 is distinct from other models that have been considered in the interference literature.

If the true distribution lies inside the smaller Model 2, a more efficient estimator than β^1a,b\widehat{\beta}_{1}^{a,b} is possible. The efficient influence function for βa,b\beta^{a,b} in Model 2 is:

φ2(O;βa,b)=\displaystyle\varphi_{2}\big{(}O;\beta^{a,b}\big{)}= 12{𝟙(A1=a,A2=b)P(A1=a|C,X1)P(A2=b|C,X1)(Y1E[Y1|A1=a,A2=b,C,X1])+\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{1}=a,A_{2}=b)}{P(A_{1}=a|C,X_{1})P(A_{2}=b|C,X_{1})}\Big{(}Y_{1}-E\big{[}Y_{1}|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Big{)}+
E[Y1|A1=a,A2=b,C,X1]}+\displaystyle\hskip 36.135ptE\big{[}Y_{1}|A_{1}=a,A_{2}=b,C,X_{1}\big{]}\Bigg{\}}+
12{𝟙(A2=a,A1=b)P(A2=a|C,X2)P(A1=b|C,X2)(Y2E[Y2|A2=a,A1=b,C,X2])+\displaystyle\ \frac{1}{2}\Bigg{\{}\frac{\mathds{1}(A_{2}=a,A_{1}=b)}{P(A_{2}=a|C,X_{2})P(A_{1}=b|C,X_{2})}\Big{(}Y_{2}-E\big{[}Y_{2}|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Big{)}+
E[Y2|A2=a,A1=b,C,X2]}βa,b.\displaystyle\hskip 36.135ptE\big{[}Y_{2}|A_{2}=a,A_{1}=b,C,X_{2}\big{]}\Bigg{\}}-\beta^{a,b}.

In order to use φ2(O;βa,b)\varphi_{2}(O;\beta^{a,b}) to obtain an estimator of βa,b\beta^{a,b}, we need a model for the outcome regression E[Yj|Aj,A3j,C,Xj]E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j}\big{]}, a model for the propensity score P(Aj=1|C,Xj)P(A_{j}=1|C,X_{j}) which relates Twin jj’s exposure to their own covariates, and a model for the propensity score P(Aj=1|C,X3j)P(A_{j}=1|C,X_{3-j}) which relates Twin jj’s exposure to their co-twin’s covariates. The resulting estimator of βa,b\beta^{a,b} will be efficient provided that all three of these models are correctly specified, and provided that the estimated values converge to the truth at fast enough rates. Suppose this is the case. Let π^j,a(Ci,Xij)\hat{\pi}_{j,a}(C_{i},X_{ij}) be the predicted value of the propensity score P(Aj=a|C,Xj)P(A_{j}=a|C,X_{j}) for a twin pair with covariates Ci,XijC_{i},X_{ij}. Let θ^j,a(Ci,Xi,3j)\hat{\theta}_{j,a}(C_{i},X_{i,3-j}) be the predicted value of the propensity score P(Aj=a|C,X3j)P(A_{j}=a|C,X_{3-j}) for a twin pair with covariates Ci,Xi,3jC_{i},X_{i,3-j}. Let μ^j,a,b(C,Xij)\hat{\mu}_{j,a,b}(C,X_{ij}) be the predicted outcome regression for Twin jj in a twin pair with covariates Ci,XijC_{i},X_{ij} if the twins’ exposures were set to Aj=a,A3j=bA_{j}=a,A_{3-j}=b. Then the efficient estimator for βa,b\beta^{a,b} in Model 2, based on data O1,,OnO_{1},\ldots,O_{n}, is given by:

β^2a,b=1ni=1n12{\displaystyle\widehat{\beta}^{a,b}_{2}=\ \frac{1}{n}\sum_{i=1}^{n}\ \frac{1}{2}\Bigg{\{} 𝟙(Ai1=a,Ai2=b)π^1,a(Ci,Xi1)θ^2,b(Ci,Xi1)(Yi1μ^1,a,b(Ci,Xi1))+μ^1,a,b(Ci,Xi1)\displaystyle\frac{\mathds{1}(A_{i1}=a,A_{i2}=b)}{\hat{\pi}_{1,a}(C_{i},X_{i1})\ \hat{\theta}_{2,b}(C_{i},X_{i1})}\Big{(}Y_{i1}-\hat{\mu}_{1,a,b}(C_{i},X_{i1})\Big{)}+\hat{\mu}_{1,a,b}(C_{i},X_{i1})
+𝟙(Ai1=b,Ai2=a)π^2,a(Ci,Xi2)θ^1,b(Ci,Xi2)(Yi2μ^2,a,b(Ci,Xi2))+μ^2,a,b(Ci,Xi2))}.\displaystyle+\ \frac{\mathds{1}(A_{i1}=b,A_{i2}=a)}{\hat{\pi}_{2,a}(C_{i},X_{i2})\ \hat{\theta}_{1,b}(C_{i},X_{i2})}\Big{(}Y_{i2}-\hat{\mu}_{2,a,b}(C_{i},X_{i2})\Big{)}+\hat{\mu}_{2,a,b}(C_{i},X_{i2})\Big{)}\Bigg{\}}.

The estimator β^2a,b\widehat{\beta}_{2}^{a,b} is also doubly robust: it is consistent if (i) both propensity score models are correct, even if the outcome regression model is incorrect, or (ii) the outcome regression model is correct, even if one or both propensity score models are incorrect. We show the double robustness of both estimators in Appendix C, and we also illustrate this property in the simulation study in Section 6.

Efficient influence functions of average main effects and average spillover effects are obtained as differences of efficient influence functions of the βa,b\beta^{a,b} parameters. For example, the efficient influence function of the spillover effect βsp=β0,0β0,1\beta_{sp}=\beta^{0,0}-\beta^{0,1} in Model kk is φk(O;β0,0)φk(O;β0,1)\varphi_{k}(O;\beta^{0,0})-\varphi_{k}(O;\beta^{0,1}). Similarly the efficient estimator of βsp\beta_{sp} in Model kk is β^k0,0β^k0,1\widehat{\beta}^{0,0}_{k}-\widehat{\beta}^{0,1}_{k}.

4.2 Confidence intervals

Here we consider confidence intervals for a parameter βa,b=E[Ya,b]\beta^{a,b}=E\big{[}Y^{a,b}\big{]} or a linear combination of such parameters, such as an average spillover effect or average main effect, based on the efficient estimators presented above.

Denote the parameter of interest by β\beta. Let ψk(O;β,ν)\psi_{k}(O;\beta,\nu) denote the efficient influence function of β\beta in Model kk, where ν\nu represents the parameters of the propensity score models and outcome regression model, which must be estimated in order to obtain the efficient estimator for β\beta. Under the assumption that the parameters ν\nu are estimated consistently at fast enough rates, the resulting estimator for β\beta will be asymptotically normal at root-nn rates. Therefore, we may use a Wald-type confidence interval for β\beta. Estimating ν\nu at rates faster than n1/4n^{-1/4}—that is, using an estimator ν^\widehat{\nu} such that n1/4+ϵ(ν^ν)n^{1/4+\epsilon}\big{(}\widehat{\nu}-\nu\big{)} is bounded in probability for some ϵ>0\epsilon>0—is sufficient. However, this can be relaxed somewhat since our estimator for β\beta is doubly robust, and the bias of a doubly robust estimator is the product of the bias in the propensity score estimation times the bias in the outcome regression estimation. Therefore it suffices that we estimate the propensity scores and outcome regression at rates whose product is less than n1/2n^{-1/2}. Suppose this is the case, and let β^k,eff\widehat{\beta}_{k,eff} be the solution of the estimating equation i=1nψk(Oi;β,ν^)=0\sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\beta,\widehat{\nu}\big{)}=0. Because of the assumption on the rate of convergence for ν^\widehat{\nu}, the influence function of β^k,eff\widehat{\beta}_{k,eff} is the efficient influence function ψk(O;β,ν)\psi_{k}\big{(}O;\beta,\nu\big{)} (with the true value of the nuisance parameters ν\nu). Therefore β^k,eff\widehat{\beta}_{k,eff} is the efficient estimator of β\beta, and the asymptotic variance of β^k,eff\widehat{\beta}_{k,eff} is the variance of ψk(O;β,ν)\psi_{k}\big{(}O;\beta,\nu\big{)}.

Thus in large samples, the variance of β^k,eff\widehat{\beta}_{k,eff} is approximately 1nVar(ψk(O;β,ν))=1nE[ψk(O;β,ν)2]\frac{1}{n}Var\big{(}\psi_{k}(O;\beta,\nu)\big{)}=\frac{1}{n}E\big{[}\psi_{k}(O;\beta,\nu)^{2}\big{]}. A consistent estimator of Var(ψk(O;β,ν))Var\big{(}\psi_{k}(O;\beta,\nu)\big{)} is 1ni=1nψk(Oi;β^eff,ν^)2\frac{1}{n}\sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\widehat{\beta}_{eff},\hat{\nu}\big{)}^{2}. Therefore, a large-sample 95%95\% Wald-type confidence interval for β\beta is given by:

β^k,eff± 1.96i=1nψk(Oi;β^k,eff,ν^)2n2.\widehat{\beta}_{k,eff}\ \pm\ 1.96\sqrt{\frac{\ \sum_{i=1}^{n}\psi_{k}\big{(}O_{i};\widehat{\beta}_{k,eff},\hat{\nu}\big{)}^{2}\ \ }{n^{2}}}\ .

Alternatively, we may use a bootstrap-based confidence interval, which may have better coverage than the Wald-type confidence interval at moderate sample sizes.

5 Application: spillover effects of alcohol in early adolescence

Early use of of alcohol is often associated with substance use and dependence in adulthood [4, 3]. Recently, Irons et al. [7] used a twin design to determine whether this association is due in part to a causal relationship. The authors found that there is evidence of a causal effect of twins’ early alcohol use on their own adult substance use outcomes and antisocial behavior outcomes. Here we consider the same data from the Minnesota Twin Family Study; but rather than focusing on the effect of twins’ exposures on their own outcomes, we investigate whether twins’ early alcohol use has a causal impact on their co-twins’ outcomes. That is, our primary question of interest is whether there is evidence of a nonzero spillover effect.

In the Minnesota Twin Family Study, pairs of same-sex twins were contacted for an intake assessment at which baseline covariates were collected at target age 11; to assess exposure status at target age 14; and to measure outcomes at target age 24. The exposure that we consider here is an indicator of whether the subject had ever consumed alcohol without their parents’ permission by the time of the exposure interview. The outcome on which we focus is Drinking Index, which is a composite measure of adult drinking frequency and amount. Shared covariates which we include are severity of parent alcohol abuse, severity of parent drug abuse, parents’ occupation level, and sex and zygosity of the twins. Individual covariates which we include are a measure of academic motivation, number of externalizing disorder symptoms (such as symptoms of attention deficit/hyperactivity disorder or conduct disorder), amount of conflict with parents, and actual age at the time of the exposure assessment. Here we use only the complete case pairs, which account for 511 of the 761 pairs of twins in the dataset; future work is needed to develop methods to address missing data in settings where there is interference between pairs.

Our primary target of inference is the spillover effect βsp=E[Y0,0Y0,1]\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]}, which is the mean difference in twins’ outcomes we would see if we could intervene to change all co-twins from exposed to unexposed, while keeping all twins unexposed. We also estimate the main effect βmain=E[Y0,0Y1,0]\beta_{main}=E\big{[}Y^{0,0}-Y^{1,0}\big{]}, the mean difference in twins’ outcomes we would see if we could intervene to change all twins from exposed to unexposed, while keeping all co-twins unexposed. We estimate βsp\beta_{sp} and βmain\beta_{main} using the efficient estimators that we derived in Section 4. For the Model 1 estimator, we use generalized additive models to model the outcome regression E[Yj|Aj,A3j,C,Xj,X3j]E[Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}]. In order to model the joint propensity score P(A1,A2|C,X1,X2)P(A_{1},A_{2}|C,X_{1},X_{2}), we first use generalized additive models to fit the marginal distributions P(A1|C,X1)P(A_{1}|C,X_{1}) and P(A2|C,X2)P(A_{2}|C,X_{2}), then model the association between the exposures via a Dale model [2]. We allow the strength of the association to vary by sex and zygosity. For the Model 2 estimator, we use generalized additive models to model the outcome regression E[Yj|Aj,A3j,C,Xj]E[Y_{j}|A_{j},A_{3-j},C,X_{j}], the propensity score P(Aj|C,Xj)P(A_{j}|C,X_{j}), and the propensity score P(Aj|C,X3j)P(A_{j}|C,X_{3-j}). Details are given in Appendix D.

Results are shown in Table 2. Using either of the two estimators, there is evidence of a nonzero spillover effect of twins’ use of alcohol in early adolescence on their co-twins’ adult drinking behavior. That is, there is evidence of interference between twins in this study. A spillover effect of 1.5-1.5 (the point estimate using the Model 1 estimator) would indicate that twins’ Drinking Index outcomes would be an average of 1.5 points lower if all twins and all co-twins were unexposed to alcohol in early adolescence, compared to all twins being unexposed with their co-twins exposed. The Drinking Index outcome ranges from 0 to 20, with mean 10.7 and standard deviation 4.0. Larger values correspond to more frequent and/or greater amounts of drinking, so a decrease of 1.5 points would represent a moderate benefit. There is also evidence of a nonzero main effect. A main effect of 2.1-2.1 points would indicate that twins’ Drinking Index outcomes would be an average of 2.1 points lower if all twins changed from exposed to unexposed, while their co-twins remained unexposed.

Our estimators for βsp\beta_{sp} and βmain\beta_{main} are based on the assumption of no unmeasured confounding due to shared or individual factors. This assumption is untestable, meaning that we have no means of determining that it holds in our study; and if it does not in fact hold, this could invalidate our results. However, we can check for one specific way in which the assumption may fail, through comparison of monozygotic (MZ) twins and dizygotic (DZ) twins. If there is no unmeasured confounding which is due to shared genetics or other unmeasured factors that are differential between MZ and DZ twins, then, since the distribution of measured covariates is well balanced across these two groups, the average spillover effect in the population of MZ twins, βsp,MZ\beta_{sp,MZ}, should be equal to the average spillover effect in the population of DZ twins, βsp,DZ\beta_{sp,DZ}. Therefore, evidence that the average spillover effects are different in these two populations would be evidence that there is unmeasured confounding due to shared genetics.

Spillover effect βsp\beta_{sp} Main effect βmain\beta_{main} CTC effect βW\beta_{W}
All twins Model 1 -1.470 (-2.318, -0.716) -2.093 (-3.251, -0.738) 0.730 (0.064,1.372)
(n=510) Model 2 -1.599 (-2.318, -0.762) -2.298 (-3.356, -0.823)
MZ twins Model 1 -1.671 (-2.796, -0.640) -2.131 (-3.422, -1.094) 0.577 (-0.224, 1.347)
(n=324) Model 2 -1.931 ( -2.821, -0.705) -2.424 (-3.499, -1.123)
DZ twins Model 1 -1.039 (-2.523, 0.118) -1.952 (-3.711, -0.856) 1.016 (-0.106, 2.155)
(n=186) Model 2 -0.957 (-2.245, 0.210) -2.084 (-3.784, -0.890)
Table 2: Effect estimates for the MTFS data. The 3rd and 4th columns show point estimates, and 95% confidence intervals based on 5,000 bootstrap samples, for the spillover effect βsp=E[Y0,0Y0,1]\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]} and the main effect βmain=E[Y0,0Y1,0]\beta_{main}=E\big{[}Y^{0,0}-Y^{1,0}\big{]}, using the efficient estimators for Model 1 and for Model 2. The 5th column shows point estimates, and 95% confidence intervals based on 5,000 bootstrap samples, for the co-twin control effect βW=E[Y0,1Y1,0|A1A2,X1=X2]\beta_{W}=E\big{[}Y^{0,1}-Y^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}\big{]} estimated by fitting a between-within regression model. All are shown using data on all twins, using data on MZ twins only, and using data on DZ twins only.

The point estimates are consistent with a somewhat stronger spillover effect among MZ twins than among DZ twins; however, a 95% confidence interval for the parameter βsp,MZβsp,DZ\beta_{sp,MZ}-\beta_{sp,DZ} is (2.25,1.84)(-2.25,1.84), so the data do not provide evidence that βsp,MZβsp,DZ\beta_{sp,MZ}-\beta_{sp,DZ} is different from zero. Thus, comparison of the MZ and DZ subgroups does not suggest that there is unmeasured confounding due to shared genetics or factors that are differential between MZ and DZ twins. However, this subgroup comparison does not provide a means of assessing whether other types of unmeasured confounding may be present. Unmeasured shared factors such as peer group or attributes of the twins’ school or community, and unmeasured individual factors, could still invalidate our results if they are not captured by measured covariates.

We also fit a between-within regression model and report the within-pair coefficient βW\beta_{W}. The causal interpretation of βW\beta_{W} is valid even if there are shared unmeasured confounders, provided that there is no unmeasured confounding due to individual factors. In Section 3.1 we saw that, in settings where there is interference between twins, the causal interpretation of βW\beta_{W} is equal to the spillover effect minus the main effect on a subset of the group of exposure-discordant twins. If the estimates of βW\beta_{W} were not consistent with the estimates of βspβmain\beta_{sp}-\beta_{main}, this could be an indication that the estimates of βsp\beta_{sp} and βmain\beta_{main} were invalid due to shared confounders (though it could also be an indication that the subgroup effect βW\beta_{W} does not generalize to the whole population of twins). In our case the estimates are consistent with each other.

Irons et al. [7] estimated the effect of adolescent alcohol use on adult drinking behavior using two approaches, propensity score weighting and the co-twin control method, and found that the results using the co-twin control method were attenuated compared to the first method. They pointed out that one possible reason for this attenuation is that unmeasured shared confounders could be creating bias in the propensity score estimates. Another possible explanation which appears now is interference: if there is interference between twins in this study, then the parameter estimated using the co-twin control method is not simply the effect of twins’ exposure on their outcome, but the difference of this quantity and the spillover effect from their co-twins’ exposure.

6 Simulation study

Here we evaluate the finite-sample performance of the estimators derived in Section 4, using simulated data with sample size n=500n=500 designed to mimic the data from the Minnesota Twin Family Study (MTFS). We consider two data-generating mechanisms. Under the first data-generating mechanism, the larger Model 1 holds but the smaller Model 2 does not, while under the second data-generating mechanism both models hold. We generate the covariates in the simulated data by resampling covariates from the MTFS dataset. We then generate the exposures A1,A2A_{1},A_{2} from a joint distribution of exposures given covariates based on modeling of the MTFS data, and generate the outcomes Y1,Y2Y_{1},Y_{2} from a joint distribution of outcomes given exposures and covariates based on modeling of the MTFS data. Details of the data-generating mechanisms are given in Appendix D.

For each data-generating mechanism, we estimate the spillover effect βsp=E[Y0,0Y0,1]\beta_{sp}=E\big{[}Y^{0,0}-Y^{0,1}\big{]} using each of 5 estimators. Under the first data-generating mechanism, we estimate βsp\beta_{sp} using the efficient estimator for Model 1 with both the joint propensity score model and the outcome regression model correctly specified; the Model 1 estimator with one but not both of these correctly specified; the Model 1 estimator with both misspecified; and the Model 2 efficient estimator with the outcome regression and P(Aj|C,Xj)P(A_{j}|C,X_{j}) correctly specified. Table 3 shows the empirical bias of each estimator in R=5,000R=5,000 simulations; the empirical variance of each estimator; the mean of the influence function-based variance estimates; the mean of the bootstrap variance estimates based on M=1,000M=1,000 bootstrap replicates within each simulation; the coverage of the 95% Wald confidence intervals for βsp\beta_{sp} using the influence function-based variance estimates; and coverage of the percentile bootstrap 95% confidence intervals. As expected, bias of the correctly specified Model 1 estimator β^1\hat{\beta}_{1} is low. Also as expected, since β^1\hat{\beta}_{1} is a doubly robust estimator, bias remains low even when either the propensity score model or the outcome regression model is misspecified. Coverage probabilities of the percentile bootstrap confidence intervals are very close to the nominal 95% level, while the influence function-based variance estimates are slightly anti-conservative at this sample size, resulting in coverage probabilities that are somewhat lower.

Bias Var IF-Var Est Wald Cov’g Btstp-Var Est Btstp Cov’g
β^1\hat{\beta}^{1} -0.00658 0.11390 0.09886 92.96 0.12582 95.12
β^2\hat{\beta}^{2} 0.00320 0.11587 0.07195 87.50 0.13357 95.68
β^wr.prop.1\hat{\beta}_{wr.prop.}^{1} 0.00042 0.11076 0.09971 93.44 0.12522 95.10
β^wr.outc.1\hat{\beta}_{wr.outc.}^{1} -0.02243 0.11766 0.11348 94.12 0.11136 94.70
β^wr.both1\hat{\beta}_{wr.both}^{1} -0.04121 0.11736 0.11618 94.56 0.11400 94.60
Table 3: Simulation results for R=5,000R=5,000 simulations at sample size n=500n=500 under a data-generating mechanism where Model 1 is correctly specified but Model 2 is not. Shown are the empirical bias and empirical variance of each estimator; the mean of the 5,0005,000 influence function-based variance estimates; the mean of the 5,0005,000 bootstrap-based variance estimates, based on M=1,000M=1,000 bootstraps for each simulation; and coverage probabilities for the Wald confidence interval which uses the IF-based variance estimate, and for the percentile bootstrap confidence interval.
Bias Var IF-Var Est Wald Cov’g Btstp-Var Est Btstp Cov’g
β^1\hat{\beta}^{1} 0.00719 0.08526 0.07612 93.24 0.08967 95.30
β^2\hat{\beta}^{2} 0.00576 0.08273 0.07427 93.46 0.08567 95.14
β^wr.prop.2\hat{\beta}_{wr.prop.}^{2} -0.01473 0.08232 0.07171 92.88 0.08375 94.82
β^wr.outc.2\hat{\beta}_{wr.outc.}^{2} -0.00005 0.08710 0.08372 94.86 0.08263 95.00
β^wr.both2\hat{\beta}_{wr.both}^{2} -0.00453 0.08716 0.08253 94.20 0.08231 94.50
Table 4: Simulation results for R=5,000R=5,000 simulations at sample size n=500n=500 under a data-generating mechanism where both Model 1 and Model 2 are correctly specified. Shown are the empirical bias and empirical variance of each estimator; the mean of the 5,0005,000 influence function-based variance estimates; the mean of the 5,0005,000 bootstrap-based variance estimates, based on M=1,000M=1,000 bootstraps for each simulation; and coverage probabilities for the Wald confidence interval which uses the IF-based variance estimate, and for the percentile bootstrap confidence interval.

Under the second data-generating mechanism, we estimate βsp\beta_{sp} using the efficient estimator for Model 1 with both the joint propensity score model and the outcome regression model correctly specified; the Model 2 efficient estimator with both propensity score models and the outcome regression model all correctly specified; the Model 2 estimator with either the propensity score models or the outcome regression model misspecified; and the Model 2 estimator with all of these misspecified. Results are displayed in Table 4. We expect each of the first four of these estimators to have low bias, as they do. Coverage probabilities are close to the nominal level, especially using the bootstrap confidence intervals.

Additionally, the Model 2 efficient estimator has smaller variance than the Model 1 estimator under the second data-generating mechanism. The theory from Section 4 shows that the Model 2 estimator is asymptotically more efficient than the Model 1 estimator when both models are correct; and here we see improved precision at sample size n=500n=500 in this simulation scenario.

7 Discussion

In this paper we have considered the setting of independent pairs of twins where one twin’s exposure may have a causal impact on their co-twin’s outcome. Whether or not interference is present in a given study will depend on the nature of the exposure and the outcome: in some cases, researchers may be able to rule out the possibility of between-twin interference at the outset based on their knowledge of the domain; while for many behavioral exposures and outcomes, interference is likely to be a possibility. For settings where there may or may not be interference based on scientific considerations, researchers may allow for possible interference and estimate the spillover effect of twins’ exposures on their co-twins’ outcomes using the estimators we have presented here. Evidence of a nonzero spillover effect would be evidence of interference. We have highlighted the impact that between-twin interference would have for researchers using the co-twin control method: when there is no interference, this method estimates the causal effect of twins’ exposures on their outcomes. When there is interference, however, it estimates the difference of two effects: the spillover effect of the co-twins’ exposures on the twins’ outcomes, minus the main effect of twins’ exposures on their own outcomes.

Under the assumption of no unmeasured confounding, we derived the semi-parametric efficient estimators of key causal effects for the interference setting, including average main effects and average spillover effects. We applied our estimators to data from the Minnesota Twin Family Study (MTFS), and found evidence that twins’ exposure to alcohol in early adolescence may have a spillover effect on their co-twins’ drinking behavior in adulthood. However, if there are genetic or other shared or individual unmeasured factors impacting both the choice to drink in early adolescence, and drinking behavior in adulthood (after controlling for measured covariates), this could invalidate our results. Comparing the causal effects among MZ twin pairs versus among DZ twin pairs did not yield evidence of unmeasured confounders due to shared genetics in the MTFS data. The development of sensitivity analyses, showing how a range of different strengths of unmeasured confounding would impact results in the interference setting, would be valuable future work.

Future work towards addressing missing data in this setting is also needed. As with the MTFS data, there may be missingness in baseline covariates, exposures, and outcomes, and the interference structure leads to some challenges in addressing missing data. An imputation approach, for example, should be designed in a way which is compatible with the different models to be fit for construction of the estimators; how best to do this for the Model 2 estimator under minimal modeling assumptions is an open research question.

A final direction of future research involves leveraging symmetry. Throughout, we randomly labeled the twins in each pair as Twin 1 and Twin 2. Because this labeling is random, it follows that there is symmetry between the population of twins who are labeled as Twin 1, and the population of twins who are labeled as Twin 2. In our models we did not explicitly make an assumption of symmetry, and our estimators therefore apply not only to twins data, but to any setting of independent pairs where there is possible interference between partners. However, in the twins setting, leveraging such a symmetry assumption could lead to additional efficiency gains, and in future work we plan to derive efficient estimators for models which do impose an assumption of symmetry between the twins.

Funding

This work was supported by [ONR grant N00014-18-1-2760 to E.O. and B.S.]; and [R37-AA009367 to M.M.].

Acknowledgments

Conflict of Interest: None declared.

References

  • [1] Peter M. Aronow and Cyrus Samii. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4):1912–1947, 12 2017.
  • [2] Jocelyn R. Dale. Global cross-ratio models for bivariate, discrete, ordered responses. Biometrics, 42(4):909–917, 1986.
  • [3] Bridget F. Grant and Deborah A. Dawson. Age at onset of alcohol use and its association with dsm-iv alcohol abuse and dependence: results from the national longitudinal alcohol epidemiologic survey. Journal of Substance Abuse, 9:103 – 110, 1997.
  • [4] Julia D. Grant, Jeffrey F. Scherrer, Michael T. Lynskey, Michael J. Lyons, Seth A. Eisen, Ming T. Tsuang, William R. True, and Kathleen K. Bucholz. Adolescent alcohol use is a risk factor for adult alcohol and drug dependence: evidence from a twin design. Psychological Medicine, 36(1):109–118, 2006.
  • [5] Guanglei Hong and Stephen W Raudenbush. Evaluating kindergarten retention policy. Journal of the American Statistical Association, 101(475):901–910, 2006.
  • [6] Michael G Hudgens and M. Elizabeth Halloran. Toward causal inference with interference. Journal of the American Statistical Association, 103(482):832–842, 2008. PMID: 19081744.
  • [7] Daniel E. Irons, William G. Iacono, and Matt McGue. Tests of the effects of adolescent early alcohol exposures on adult outcomes. Addiction, 110(2):269–278, 2015.
  • [8] Wendy Johnson, Eric Turkheimer, Irving I. Gottesman, and Thomas J. Bouchard Jr. Beyond heritability: Twin studies in behavioral research. Current Directions in Psychological Science, 18(4):217 – 220, 2009.
  • [9] Kenneth S. Kendler and Charles O. Gardner. Dependent Stressful Life Events and Prior Depressive Episodes in the Prediction of Major Depression: The Problem of Causal Inference in Psychiatric Epidemiology. JAMA Psychiatry, 67(11):1120–1127, 11 2010.
  • [10] Benjamin B. Lahey and Brian M. D’Onofrio. All in the family: Comparing siblings to test causal hypotheses regarding environmental influences on behavior. Current Directions in Psychological Science, 19(5):319–323, 2010.
  • [11] Brett Laursen, Amy C. Hartl, Frank Vitaro, Mara Brendgen, Ginette Dionne, and Michel Boivin. The spread of substance use and delinquency between adolescent twins. Developmental Psychology, 53(2):329 – 339, 2017.
  • [12] L. Liu, M. G. Hudgens, and S. Becker-Dreps. On inverse probability-weighted estimators in the presence of interference. Biometrika, 103(4):829–842, 12 2016.
  • [13] Lan Liu and Michael G. Hudgens. Large sample randomization inference of causal effects in the presence of interference. Journal of the American Statistical Association, 109(505):288–301, 2014. PMID: 24659836.
  • [14] Lan Liu, Michael G. Hudgens, Bradley Saul, John D. Clemens, Mohammad Ali, and Michael E. Emch. Doubly robust estimation in observational studies with partial interference. Stat, 8(1):e214. e214 sta4.214.
  • [15] Matt McGue, Merete Osler, and Kaare Christensen. Causal inference and observational research: The utility of twins. Perspectives on Psychological Science, 5(5):546–556, 2010.
  • [16] Caleb H. Miles, Maya Petersen, and Mark J. van der Laan. Causal inference when counterfactuals depend on the proportion of all subjects exposed. Biometrics, 75(3):768–777, 2019.
  • [17] Gerald S. Oettinger. Sibling similarity in high school graduation outcomes: Causal interdependency or unobserved heterogeneity?. Southern Economic Journal, 66(3):631 – 648, 2000.
  • [18] Elizabeth L. Ogburn, Oleg Sofrygin, Ivan Diaz, and Mark J. van der Laan. Causal inference for social network data. arXiv e-prints, page arXiv:1705.08527, May 2017.
  • [19] Jonathan D. Schaefer, Terrie E. Moffitt, Louise Arseneault, Andrea Danese, Helen L. Fisher, Renate Houts, Margaret A. Sheridan, Jasmin Wertz, and Avshalom Caspi. Adolescent victimization and early-adult psychopathology: Approaching causal inference using a longitudinal twin study to rule out noncausal explanations. Clinical Psychological Science, 6(3):352–371, 2018. PMID: 29805917.
  • [20] Daniel O. Scharfstein, Andrea Rotnitzky, and James M. Robins. Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 94(448):1096–1120, 1999.
  • [21] Arvid Sjölander, Thomas Frisell, and Sara Ö berg. Causal interpretation of between-within models for twin research. Epidemiologic Methods, 1(1):217 – 237, 2012.
  • [22] Cheryl Slomkowski, Richard Rende, Scott Novak, Elizabeth Lloyd-Richardson, and Raymond Niaura. Sibling effects on smoking in adolescence: evidence for social influence from a genetically informative design. Addiction, 100(4):430 – 438, 2005.
  • [23] Eric J Tchetgen Tchetgen and Tyler J VanderWeele. On causal inference in the presence of interference. Statistical Methods in Medical Research, 21(1):55–75, 2012. PMID: 21068053.
  • [24] Eric J. Tchetgen Tchetgen, Isabel Fulcher, and Ilya Shpitser. Auto-G-Computation of Causal Effects on a Network. arXiv e-prints, page arXiv:1709.01577, September 2017.
  • [25] Anastasios A. Tsiatis. Semiparametric Theory and Missing Data. Springer, New York, 2006.
  • [26] Mark J. van der Laan. Causal inference for a population of causally connected units. Journal of Causal Inference, 2(1):13 – 74, 2014.

Appendix A Identification of the co-twin control effect under interference

Here we show identification of the subgroup causal effect βctc:=E[Y0,1Y1,0|A1A2,X1=X2=x]\beta_{ctc}:=E\big{[}Y^{0,1}-Y^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]} discussed in Section 3.1, under the following assumptions:

C1: Yja,b(A1,A2)|(U,X1,X2)\displaystyle\ Y_{j}^{a,b}\perp\!\!\!\perp(A_{1},A_{2})\ |\ (U,X_{1},X_{2})
C2: If Aj=a,A3j=b, then Yj=Yja,b\displaystyle\ \mbox{If }A_{j}=a,A_{3-j}=b,\mbox{ then }Y_{j}=Y_{j}^{a,b}
C3: For all u,x1,x2 in the support of (U,X1,X2),\displaystyle\mbox{ For all }u,x_{1},x_{2}\mbox{ in the support of }(U,X_{1},X_{2}),
P(A1=a,A2=b|U=u,X1=x1,X2=x2)>0 for all a,b=0,1\displaystyle\ \ \ P\big{(}A_{1}=a,A_{2}=b\ |\ U=u,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1
C4: (U,X1,X2,Y1a,b) and (U,X2,X1,Y2a,b) are identically distributed\displaystyle\ \big{(}U,X_{1},X_{2},Y_{1}^{a,b}\big{)}\mbox{ and }\big{(}U,X_{2},X_{1},Y_{2}^{a,b}\big{)}\mbox{ are identically distributed}

C1-C3 are untestable assumptions of exchangeability within levels of the (possibly unobserved) shared factors UU and the observed individual covariates X1,X2X_{1},X_{2}; consistency; and positivity. C4 is an assumption of symmetry between the population of twins labeled as Twin 1 and the population labeled as Twin 2, which should hold by design as the labeling is randomly assigned.

Let us denote the subgroup of twins who are exposure-discordant with their co-twin, but who have the same value of xx of all individual covariates as their co-twin, as DxD_{x}. Thus βctc\beta_{ctc} is the mean of the contrast Y0,1Y1,0Y^{0,1}-Y^{1,0} over all twins in the subgroup DxD_{x}. Note first that this is a weighted average of Twin 1’s contrast and Twin 2’s contrast in the subgroup of DxD_{x} where Twin 1 is exposed and the subgroup where Twin 2 is exposed. Specifically, writing π:=P(A1=0,A2=1,X1=X2=x|A1A2,X1=X2=x)\pi:=P(A_{1}=0,A_{2}=1,X_{1}=X_{2}=x|A_{1}\neq A_{2},X_{1}=X_{2}=x), we have:

βctc=12{\displaystyle\beta_{ctc}=\frac{1}{2}\bigg{\{} E[Y10,1Y11,0|A1A2,X1=X2=x]+E[Y20,1Y21,0|A1A2,X1=X2=x]}\displaystyle E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}+E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}|A_{1}\neq A_{2},X_{1}=X_{2}=x\big{]}\bigg{\}}
=12{\displaystyle=\ \frac{1}{2}\bigg{\{} π×E[Y10,1Y11,0|A1=0,A2=1,X1=X2=x]+\displaystyle\pi\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+
(1π)×E[Y10,1Y11,0|A1=1,A2=0,X1=X2=x]}+\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{1}^{0,1}-Y_{1}^{1,0}|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}+
12{\displaystyle\ \frac{1}{2}\bigg{\{} π×E[Y20,1Y21,0|A1=0,A2=1,X1=X2=x]+\displaystyle\pi\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+
(1π)×E[Y20,1Y21,0|A1=1,A2=0,X1=X2=x]}\displaystyle\hskip 21.68121pt(1-\pi)\times E\big{[}Y_{2}^{0,1}-Y_{2}^{1,0}|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}

Now E[Yj0,1|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{j}^{0,1}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]} is equal to E[Yj|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]} by consistency, while E[Yj0,1|Aj=1,A3j=0,X1=X2=x]E\big{[}Y_{j}^{0,1}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]} can be identified by leveraging the symmetry between the group of twins labeled as Twin 1 and the group labeled as Twin 2:

E[\displaystyle E\big{[} Yj0,1|Aj=1,A3j=0,X1=X2=x]\displaystyle Y_{j}^{0,1}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}
=\displaystyle= yydF(Yj0,1=y|Aj=1,A3j=0,X1=X2=x)\displaystyle\ \int_{y}y\ dF\big{(}Y_{j}^{0,1}=y|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}
=\displaystyle= uyydF(Yj0,1=y|U=u,Aj=1,A3j=0,X1=X2=x)×\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times
dF(u|Aj=1,A3j=0,X1=X2=x)\displaystyle\hskip 72.26999ptdF(u|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x)
=\displaystyle= uyydF(Yj0,1=y|U=u,X1=X2=x)×\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{j}^{0,1}=y|U=u,X_{1}=X_{2}=x\big{)}\times by C1
dF(u|Aj=1,A3j=0,X1=X2=x)\displaystyle\hskip 72.26999ptdF\big{(}u|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}
=\displaystyle= uyydF(Y3j0,1=y|U=u,X1=X2=x)×\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y|U=u,X_{1}=X_{2}=x\big{)}\times by C4
dF(u|Aj=1,A3j=0,X1=X2=x)\displaystyle\hskip 72.26999ptdF\big{(}u|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}
=\displaystyle= uyydF(Y3j0,1=y|U=u,Aj=1,A3j=0,X1=X2=x)×\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}^{0,1}=y|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}\times by C1
dF(u|Aj=1,A3j=0,X1=X2=x)\displaystyle\hskip 72.26999ptdF\big{(}u|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}
=\displaystyle= uyydF(Y3j=y|U=u,Aj=1,A3j=0,X1=X2=x)\displaystyle\ \int_{u}\int_{y}y\ dF\big{(}Y_{3-j}=y|U=u,A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)} by C2
dF(u|Aj=1,A3j=0,X1=X2=x)\displaystyle\hskip 72.26999ptdF\big{(}u|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{)}
=\displaystyle= E[Y3j|Aj=1,A3j=0,X1=X2=x]\displaystyle\ E\Big{[}Y_{3-j}|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\Big{]}

Similarly, E[Yj1,0|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{j}^{1,0}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]} is identified as E[Y3j|Aj=0,A3j=1,X1=X2=x]E\big{[}Y_{3-j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}. Therefore,

βctc=12{\displaystyle\beta_{ctc}=\ \frac{1}{2}\bigg{\{} π×E[Y1Y2|A1=0,A2=1,X1=X2=x]+\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+
(1π)×E[Y2Y1|A1=1,A2=0,X1=X2=x]+\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}+
12{\displaystyle\ \frac{1}{2}\bigg{\{} π×E[Y1Y2|A1=0,A2=1,X1=X2=x]+\displaystyle\pi\times E\big{[}Y_{1}-Y_{2}|A_{1}=0,A_{2}=1,X_{1}=X_{2}=x\big{]}+
(1π)×E[Y2Y1|A1=1,A2=0,X1=X2=x]}\displaystyle(1-\pi)\times E\big{[}Y_{2}-Y_{1}|A_{1}=1,A_{2}=0,X_{1}=X_{2}=x\big{]}\bigg{\}}
=E[Yj\displaystyle=E\big{[}Y_{j} |Aj=1,A3j=0,X1=X2=x]E[Yj|Aj=0,A3j=1,X1=X2=x].\displaystyle|A_{j}=1,A_{3-j}=0,X_{1}=X_{2}=x\big{]}-E\big{[}Y_{j}|A_{j}=0,A_{3-j}=1,X_{1}=X_{2}=x\big{]}.

Appendix B Efficient influence functions

Here we derive the efficient influence function for the parameter βja,b:=E[Yja,b]\beta_{j}^{a,b}:=E\big{[}Y_{j}^{a,b}\big{]}, for each of the two models presented in Section 3.2. This also immediately gives the efficient influence function for βa,b=12E[Y1a,b+Y2a,b]\beta^{a,b}=\frac{1}{2}E\big{[}Y_{1}^{a,b}+Y_{2}^{a,b}\big{]}, and of sums and differences of such parameters. We use the theory, terminology, and notation of Tsiatis [25] and Scharfstein et al. [20].

We assume that we have nn independent pairs of twins, and that the data from these twin pairs constitute nn i.i.d. draws from some distribution. Let O=(C,X1,X2,A1,A2,Y1,Y2)O=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1},Y_{2}\big{)} be the observed data for each twin pair. Let Z=(C,X1,X2,A1,A2,Y11,1,Y21,1,Y11,0,Y21,0,Y10,1,Y20,1,Y10,0,Y20,0)Z=\big{(}C,X_{1},X_{2},A_{1},A_{2},Y_{1}^{1,1},Y_{2}^{1,1},Y_{1}^{1,0},Y_{2}^{1,0},Y_{1}^{0,1},Y_{2}^{0,1},Y_{1}^{0,0},Y_{2}^{0,0}\big{)} be the full data for each twin pair. Under the consistency assumption, the observed data is a coarsening of the full data, since Yj=a,b=0,1𝟙(Aj=a,A3j=b)Yja,b\displaystyle{Y_{j}=\sum_{a,b=0,1}\mathds{1}(A_{j}=a,A_{3-j}=b})Y_{j}^{a,b} for each j=1,2j=1,2. We partition the full data ZZ into L=(C,X1,X2,Y11,1,Y21,1,Y11,0,Y21,0,Y10,1,Y20,1,Y10,0,Y20,0)L=\big{(}C,X_{1},X_{2},Y_{1}^{1,1},Y_{2}^{1,1},Y_{1}^{1,0},Y_{2}^{1,0},Y_{1}^{0,1},Y_{2}^{0,1},Y_{1}^{0,0},Y_{2}^{0,0}\big{)} and R=(A1,A2)R=(A_{1},A_{2}), where RR determines which components of ZZ are observed.

Let O\mathcal{H}^{O} and Z\mathcal{H}^{Z} denote the Hilbert spaces of mean-zero functions of the observed data, and of the full data, respectively, with the covariance inner product. Full-data influence functions for βja,b\beta_{j}^{a,b} are elements of the space Λ\Lambda^{\perp}, the orthogonal complement of the full-data nuisance tangent space Λ=Λ(FL)Λ(FR|L)\Lambda=\Lambda(F_{L})\oplus\Lambda(F_{R|L}). Observed-data influence functions are elements of the space ΛO,\Lambda^{O,\perp}, the orthogonal complement of the observed-data nuisance tangent space ΛO=Λ1O+Λ2O\Lambda^{O}=\Lambda_{1}^{O}+\Lambda_{2}^{O}, where ΛjO=E[Λj|O]:={g(O):g(O)=E[h(Z)|O]\Lambda_{j}^{O}=E[\Lambda_{j}|O]:=\big{\{}g(O):g(O)=E[h(Z)|O] for some h(Z)Λj}h(Z)\in\Lambda_{j}\big{\}}. The efficient influence function φ(O;βja,b)\varphi(O;\beta_{j}^{a,b}) in a model is the unique influence function for βja,b\beta_{j}^{a,b} which is an element of the observed data tangent space 𝒯O\mathcal{T}^{O} of that model; moreover, the efficient influence function is the projection of any influence function onto 𝒯O\mathcal{T}^{O}.

B.1 The efficient influence function in Model 1

Model 1 is the nonparametric model which places no restrictions on the observed data. This implies that there is just a single observed-data influence function φ1(O;βja,b)\varphi_{1}(O;\beta_{j}^{a,b}) for βja,b\beta_{j}^{a,b} in Model 1, which is therefore the efficient influence function. By theory in [20] and [25], the observed-data influence function may be found by using a full data influence function for βja,b\beta_{j}^{a,b}, say φZΛ(FL)\varphi^{Z}\in\Lambda(F_{L}), and the mapping K:OLK:\mathcal{H}^{O}\to\mathcal{H}^{L} defined by K(g)=E[|L]K(g)=E\big{[}\cdot|L\big{]}. An element h(O)h(O) in the inverse image of φZ\varphi^{Z} under this mapping is in the space Λ1O,\Lambda_{1}^{O,\perp}. This can be seen using adjoint operators: if we consider the linear map A:Λ1OA:\Lambda_{1}\to\mathcal{H}^{O} defined by A(h)=E[h|O]A(h)=E[h|O], then Λ1O\Lambda_{1}^{O} is the range of AA. Therefore, by properties of adjoint operators, Λ1O,\Lambda_{1}^{O,\perp} is the null space of AA^{*}, the adjoint of AA. In this case A(g)=E[g|L]Π(E[g|L]|Λ1)A^{*}(g)=E[g|L]-\Pi\Big{(}E\big{[}g|L\big{]}|\Lambda_{1}^{\perp}\Big{)}, and therefore Λ1O,={g(O):E[g|L]Λ1}\Lambda_{1}^{O,\perp}=\big{\{}g(O):E[g|L]\in\Lambda_{1}^{\perp}\big{\}}. The residual h(O)Π(h(O)|Λ2O)h(O)-\Pi\big{(}h(O)|\Lambda_{2}^{O}\big{)} from projecting h(O)h(O) onto the space Λ2O\Lambda_{2}^{O} is in the space Π(Λ1O,|Λ2O,)\Pi(\Lambda_{1}^{O,\perp}|\Lambda_{2}^{O,\perp}). In our case, the spaces Λ1O\Lambda_{1}^{O} and Λ2O\Lambda_{2}^{O} are orthogonal, which implies that Π(Λ1O,|Λ2O,)\Pi(\Lambda_{1}^{O,\perp}|\Lambda_{2}^{O,\perp}) is equal to Λ1O,Λ2O,=ΛO,\Lambda_{1}^{O,\perp}\cap\Lambda_{2}^{O,\perp}=\Lambda^{O,\perp}. Therefore the residual is in ΛO,\Lambda^{O,\perp} and is thus the observed data influence function for βja,b\beta_{j}^{a,b} in Model 1.

A full data influence function for βja,b\beta_{j}^{a,b} is φZ=Yja,bβja,b\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b}. To verify this, set ϵ:=Yja,bβja,b\epsilon:=Y_{j}^{a,b}-\beta_{j}^{a,b}, and partition the parameters of FLF_{L} into variation-independent components βja,b\beta_{j}^{a,b} and η\eta, where η\eta is the distribution of (L~,ϵ)\big{(}\tilde{L},\epsilon\big{)} and L~\tilde{L} is formed by deleting Yja,bY_{j}^{a,b} from the vector LL. Then there is a one-to-one transformation from dF(L)dF(L) to dF(ϵ,L~)dF\big{(}\epsilon,\tilde{L}\big{)}, which we factor as dF(ϵ)×dF(L~|ϵ)dF(\epsilon)\times dF\big{(}\tilde{L}|\epsilon\big{)}. Because the only constraint on these distributions is that ϵ\epsilon is mean zero, we can show that the nuisance tangent space for LL is Λ(FL)={h(ϵ):E[h(ϵ)]=E[ϵh(ϵ)]=0}{h(L):E[h(L)|ϵ]=0}\Lambda(F_{L})=\Big{\{}h\big{(}\epsilon\big{)}:E\big{[}h(\epsilon)\big{]}=E\big{[}\epsilon h(\epsilon)\big{]}=0\Big{\}}\oplus\Big{\{}h(L):E\big{[}h(L)\ |\ \epsilon\big{]}=0\Big{\}}. Therefore ϵΛ(FL)\epsilon\in\Lambda(F_{L})^{\perp} as claimed, since ϵ\epsilon is orthogonal to each of the direct summands.

An element of O\mathcal{H}^{O} which is in the inverse image of φZ=Yja,bβja,b\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b} for the mapping KK, and hence in Λ1O,\Lambda_{1}^{O,\perp}, is:

g(O)=𝟙(Aj=a,A3j=b)P(Aj=a,A3j=b|C,X1,X2)(Yjβja,b).g^{*}(O)=\frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})}\Big{(}Y_{j}-\beta_{j}^{a,b}\Big{)}.

The space Λ(FR|L)\Lambda(F_{R|L}) is {h2(A1,A2,C,X1,X2):E[h2(A1,A2,C,X1,X2)|C,X1,X2]=0}\Big{\{}h_{2}(A_{1},A_{2},C,X_{1},X_{2}):E\big{[}h_{2}(A_{1},A_{2},C,X_{1},X_{2})|C,X_{1},X_{2}\big{]}=0\Big{\}}. Since elements of Λ(FR|L)\Lambda(F_{R|L}) are functions of the observed data (due to the no unobserved confounding assumption), the space Λ(FR|L)\Lambda(F_{R|L}) is equal to the space Λ2O\Lambda_{2}^{O}. Projection onto Λ2O\Lambda_{2}^{O} is given by:

Π(|Λ2O)=E[|A1,A2,C,X1,X2]E[|C,X1,X2],\Pi\big{(}\cdot|\Lambda_{2}^{O}\big{)}=E\big{[}\cdot|A_{1},A_{2},C,X_{1},X_{2}\big{]}-E\big{[}\cdot|C,X_{1},X_{2}\big{]},

as this operation yields elements that are in Λ2O\Lambda_{2}^{O}, and whose residuals are orthogonal to Λ2O\Lambda_{2}^{O}. Below we write π(C,X1,X2):=P(Aj=a,A3j=b|C,X1,X2)\pi(C,X_{1},X_{2}):=P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2}). Then:

Π(g(O)|Λ2O)=\displaystyle\Pi\big{(}g^{*}(O)|\Lambda_{2}^{O}\big{)}= 𝟙(Aj=a,A3j=b)π(C,X1,X2)E[Yja,bβja,b|A1,A2,C,X1,X2]\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}\ E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\big{|}A_{1},A_{2},C,X_{1},X_{2}\Big{]}
1π(C,X1,X2)E[𝟙(Aj=a,A3j=b)(Yja,bβja,b)|C,X1,X2]\displaystyle\hskip 18.06749pt-\frac{1}{\pi(C,X_{1},X_{2})}\cdot E\Big{[}\mathds{1}{(A_{j}=a,A_{3-j}=b})\big{(}Y_{j}^{a,b}-\beta_{j}^{a,b}\big{)}\big{|}C,X_{1},X_{2}\Big{]}\
=\displaystyle= 𝟙(Aj=a,A3j=b)π(C,X1,X2)E[Yja,bβja,b|C,X1,X2]\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}\ E\left[Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{|}\ C,X_{1},X_{2}\right]
1π(C,X1,X2)E[𝟙(Aj=a,A3j=b)|C,X1,X2]E[Yja,bβja,b|C,X1,X2]\displaystyle\hskip 18.06749pt-\frac{1}{\pi(C,X_{1},X_{2})}E\Big{[}\mathds{1}{(A_{j}=a,A_{3-j}=b})\ \big{|}\ C,X_{1},X_{2}\Big{]}E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{|}\ C,X_{1},X_{2}\Big{]}\ \
=\displaystyle= 𝟙(Aj=a,A3j=b)π(C,X1,X2)E[Yja,bβja,b|Aj=a,A3j=b,C,X1,X2]\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{\pi(C,X_{1},X_{2})}E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\Big{]}
E[Yja,bβja,b|Aj=a,A3j=b,C,X1,X2].\displaystyle\hskip 18.06749pt-E\Big{[}Y_{j}^{a,b}-\beta_{j}^{a,b}\ \big{|}\ A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\Big{]}.

Therefore, taking the residual, the efficient influence function for βja,b\beta_{j}^{a,b} for Model 1 is

φ1(O;βja,b)=\displaystyle\varphi_{1}(O;\beta_{j}^{a,b})= 𝟙(Aj=a,A3j=b)P(Aj=a,A3j=b|C,X1,X2)(YjE[Yj|Aj=a,A3j=b,C,X1,X2])+\displaystyle\ \frac{\mathds{1}{(A_{j}=a,A_{3-j}=b})}{P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})}\Big{(}Y_{j}-E\big{[}Y_{j}|A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}\Big{)}+
E[Yj|Aj=a,A3j=b,C,X1,X2]βja,b.\displaystyle\hskip 36.135ptE\big{[}Y_{j}|A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}\big{]}-\beta_{j}^{a,b}.

B.2 The efficient influence function in Model 2.

Model 2, by contrast, does impose constraints on the observed data. In a model that imposes restrictions on the observed law, there will be multiple observed-data influence functions. An approach for deriving the efficient influence function in such a setting is to (i) find one observed data influence function, say φnaive\varphi_{naive}, and (ii) project φnaive\varphi_{naive} onto the observed data tangent space 𝒯O\mathcal{T}^{O} for the model. Here we will actually start by considering a slightly smaller model, which we will call Model 3, in which we impose the additional assumption that (Y11,1,,Y10,0)(Y21,1,,Y20,0)|(C,X1,X2)\big{(}Y_{1}^{1,1},\ldots,Y_{1}^{0,0}\big{)}\perp\!\!\!\perp\big{(}Y_{2}^{1,1},\ldots,Y_{2}^{0,0}\big{)}|\big{(}C,X_{1},X_{2}\big{)}. The observed data tangent space for Model 3 is more straightforward to compute than the observed data tangent space for Model 2; therefore we will first derive the the efficient influence function for βja,b\beta_{j}^{a,b} in Model 3, then show the efficient influence function for βja,b\beta_{j}^{a,b} in Model 3 is the same as the efficient influence function for βja,b\beta_{j}^{a,b} in Model 2.

We start by recalling the assumptions placed on the full data in Model 3. Here we write Y~j\tilde{Y}_{j} for the vector of potential outcomes (Yj1,1,Yj1,0,Yj0,1,Yj0,0)\big{(}Y_{j}^{1,1},Y_{j}^{1,0},Y_{j}^{0,1},Y_{j}^{0,0}\big{)}:

A1: (Y~1,Y~2)(A1,A2)|(C,X1,X2)\displaystyle\big{(}\tilde{Y}_{1},\tilde{Y}_{2}\big{)}\perp\!\!\!\perp(A_{1},A_{2})\ \big{|}\ (C,X_{1},X_{2}) (exchangeabilitiy)
A2: For all c,x1,x2 in the support of C,X1,X2\displaystyle\mbox{ For all }c,x_{1},x_{2}\mbox{ in the support of }C,X_{1},X_{2} (positivity)
P(A1=a,A2=b|C=c,X1=x1,X2=x2)>0 for all a,b=0,1\displaystyle\ \ \ P\big{(}A_{1}=a,A_{2}=b\ |\ C=c,X_{1}=x_{1},X_{2}=x_{2}\big{)}>0\mbox{ for all }a,b=0,1
A3: If A1=a,A2=b, then Y1=Y1a,b,Y2=Y2b,a\displaystyle\mbox{ If }A_{1}=a,A_{2}=b,\mbox{ then }Y_{1}=Y_{1}^{a,b},Y_{2}=Y_{2}^{b,a} (consistency)
A4:\displaystyle\textbf{A4}: AjX3j|(C,Xj)\displaystyle\ A_{j}\perp\!\!\!\perp X_{3-j}\ |\ (C,X_{j}\big{)}
A5:\displaystyle\textbf{A5}: Y~jX3j|(C,Xj)\displaystyle\ \tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ |\ (C,X_{j}\big{)}
A6:\displaystyle\textbf{A6}: A1A2|(C,X1,X2)\displaystyle\ A_{1}\perp\!\!\!\perp A_{2}\ |\ (C,X_{1},X_{2}\big{)}
A7:\displaystyle\textbf{A7}: Y~1Y~2|(C,X1,X2)\displaystyle\ \tilde{Y}_{1}\perp\!\!\!\perp\tilde{Y}_{2}\ |\ (C,X_{1},X_{2})

Finally, we also assume the property of composition, meaning that for any sets of variables U,V,W,ZU,V,W,Z, if UV|ZU\perp\!\!\!\perp V|Z and UW|ZU\perp\!\!\!\perp W|Z, then it also holds that U(V,W)|ZU\perp\!\!\!\perp(V,W)|Z, as with graphical dd-separation. The following independencies follow from the assumptions listed above:

  • B1: Y~j(A1,A2)|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2})\ |\ (C,X_{j}).

    dF(a1,a2|y~j,c,xj)=\displaystyle dF(a_{1},a_{2}|\ \tilde{y}_{j},c,x_{j})= x3j𝑑F(a1,a2|y~j,c,xj,x3j)𝑑F(x3j|y~j,c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}|\tilde{y}_{j},c,x_{j},x_{3-j})dF(x_{3-j}|\tilde{y}_{j},c,x_{j})
    =\displaystyle= x3j𝑑F(a1,a2|c,xj,x3j)𝑑F(x3j|y~j,c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}|c,x_{j},x_{3-j})dF(x_{3-j}|\tilde{y}_{j},c,x_{j}) A1
    =\displaystyle= x3j𝑑F(a1,a2|c,xj,x3j)𝑑F(x3j|c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{1},a_{2}|c,x_{j},x_{3-j})dF(x_{3-j}|c,x_{j}) A5
    =\displaystyle= dF(a1,a2|c,xj).\displaystyle\ dF(a_{1},a_{2}|c,x_{j}).
  • B2: Y~jX3j|(A1,A2,C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ |\ (A_{1},A_{2},C,X_{j}). By B1, A5, and composition, we have Y~j(A1,A2,X3j)|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2},X_{3-j})|(C,X_{j}). Now the result follows immediately from the weak union property of conditional independence, which states that if U(V,W)|ZU\perp\!\!\!\perp(V,W)|Z, then UV|(W,Z)U\perp\!\!\!\perp V|(W,Z).

  • B3: Y~jX3j|(Yj,A1,A2,C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}\ |\ (Y_{j},A_{1},A_{2},C,X_{j}). By A3, for each a,b=0,1a,b=0,1, dF(x3j,y~j|Aj=a,A3j=b,c,xj,Yj=yj)=dF(x3j,y~j|Aj=a,A3j=b,c,xj,Yja,b=yj)dF\big{(}x_{3-j},\tilde{y}_{j}|A_{j}=a,A_{3-j}=b,c,x_{j},Y_{j}=y_{j}\big{)}=dF\big{(}x_{3-j},\tilde{y}_{j}|A_{j}=a,A_{3-j}=b,c,x_{j},Y_{j}^{a,b}=y_{j}\big{)}. Therefore it suffices to show that Y~jX3j|(Yja,b,Aj=a,A3j=b,C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}|\big{(}Y_{j}^{a,b},A_{j}=a,A_{3-j}=b,C,X_{j}\big{)}, and this follows immediately from B2 by weak union.

  • B4: Y~jY3j|(A1,A2,C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}\ |\ (A_{1},A_{2},C,X_{1},X_{2}). By A1, A7, and composition, Y~j(Y~3j,A1,A2)|(C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},A_{1},A_{2})|(C,X_{1},X_{2}). Since the observed outcome Y3jY_{3-j} is a function of Y~3j,A1,A2\tilde{Y}_{3-j},A_{1},A_{2}, this implies Y~j(Y3j,A1,A2)|(C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp(Y_{3-j},A_{1},A_{2})|(C,X_{1},X_{2}). Now the result follows by weak union.

  • B5: Y~jY3j|(Yj,A1,A2,C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}\ |\ (Y_{j},A_{1},A_{2},C,X_{1},X_{2}). By B4 and weak union, Y~jY3j|(Yja,b,Aj=a,A3j=b,C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}|(Y_{j}^{a,b},A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}). By A3, this implies Y~jY3j|(Yj,Aj=a,A3j=b,C,X1,X2)\tilde{Y}_{j}\perp\!\!\!\perp Y_{3-j}|(Y_{j},A_{j}=a,A_{3-j}=b,C,X_{1},X_{2}).

  • B6: Y~j(Y~3j,X3j)|(A1,A2,C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j})|(A_{1},A_{2},C,X_{j}).

    By B1, Y~j(A1,A2)|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2})|(C,X_{j}), and by A5, Y~jX3j|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp X_{3-j}|(C,X_{j}). Therefore by composition, Y~j(A1,A2,X3j)|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(A_{1},A_{2},X_{3-j})|(C,X_{j}). We show that Y~jY~3j|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp\tilde{Y}_{3-j}\ |\ (C,X_{j}):

    dF(y~3j|y~j,c,xj)=\displaystyle dF(\tilde{y}_{3-j}|\ \tilde{y}_{j},c,x_{j})= x3j𝑑F(y~3j|y~j,c,xj,x3j)𝑑F(x3j|y~j,c,xj)\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}|\tilde{y}_{j},c,x_{j},x_{3-j})dF(x_{3-j}|\tilde{y}_{j},c,x_{j})
    =\displaystyle= x3j𝑑F(y~3j|c,xj,x3j)𝑑F(x3j|y~j,c,xj)\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}|c,x_{j},x_{3-j})dF(x_{3-j}|\tilde{y}_{j},c,x_{j}) A7
    =\displaystyle= x3j𝑑F(y~3j|c,xj,x3j)𝑑F(x3j|c,xj)\displaystyle\ \int_{x_{3-j}}dF(\tilde{y}_{3-j}|c,x_{j},x_{3-j})dF(x_{3-j}|c,x_{j}) A5
    =\displaystyle= dF(y~3j|c,xj).\displaystyle\ dF(\tilde{y}_{3-j}|c,x_{j}).

    Therefore by composition Y~j(Y~3j,X3j,A1,A2)|(C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j},A_{1},A_{2})|(C,X_{j}), and this implies
    Y~j(Y~3j,X3j)|(A1,A2,C,Xj)\tilde{Y}_{j}\perp\!\!\!\perp(\tilde{Y}_{3-j},X_{3-j})|(A_{1},A_{2},C,X_{j}) by weak union.

  • B7: A1A2|(C,Xj)A_{1}\perp\!\!\!\perp A_{2}\ |\ (C,X_{j}).

    dF(a3j|aj,c,xj)=\displaystyle dF(a_{3-j}|a_{j},c,x_{j})= x3j𝑑F(a3j|aj,c,xj,x3j)𝑑F(x3j|aj,c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}|a_{j},c,x_{j},x_{3-j})dF(x_{3-j}|a_{j},c,x_{j})
    =\displaystyle= x3j𝑑F(a3j|aj,c,xj,x3j)𝑑F(x3j|c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}|a_{j},c,x_{j},x_{3-j})dF(x_{3-j}|c,x_{j}) A4
    =\displaystyle= x3j𝑑F(a3j|c,xj,x3j)𝑑F(x3j|c,xj)\displaystyle\ \int_{x_{3-j}}dF(a_{3-j}|c,x_{j},x_{3-j})dF(x_{3-j}|c,x_{j}) A6
    =\displaystyle= dF(a3j|c,xj).\displaystyle\ dF(a_{3-j}|c,x_{j}).

We now derive the observed data tangent space for Model 3, by first computing the full data tangent space, then showing how to move from the full data to the observed data tangent space. In Model 3 the full data likelihood factors as:

dF(Z)=\displaystyle dF(Z)= dF(c,x1,x2)dF(y~1,y~2|c,x1,x2)dF(a1,a2|y~1,y~2,c,x1,x2)\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}|c,x_{1},x_{2}\big{)}dF\big{(}a_{1},a_{2}|\tilde{y}_{1},\tilde{y}_{2},c,x_{1},x_{2}\big{)}
=\displaystyle= dF(c,x1,x2)dF(y~1,y~2|c,x1,x2)dF(a1,a2|c,x1,x2)\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}|c,x_{1},x_{2}\big{)}dF\big{(}a_{1},a_{2}|c,x_{1},x_{2}\big{)} A1
=\displaystyle= dF(c,x1,x2)dF(y~1,y~2|c,x1,x2)dF(a1|c,x1,x2)dF(a2|c,x1,x2)\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}|c,x_{1},x_{2}\big{)}dF(a_{1}|c,x_{1},x_{2})dF(a_{2}|c,x_{1},x_{2}) A6
=\displaystyle= dF(c,x1,x2)dF(y~1,y~2|c,x1,x2)dF(a1|c,x1)dF(a2|c,x2)\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1},\tilde{y}_{2}|c,x_{1},x_{2}\big{)}dF(a_{1}|c,x_{1})dF(a_{2}|c,x_{2}) A4
=\displaystyle= dF(c,x1,x2)dF(y~1|c,x1,x2)dF(y~2|c,x1,x2)dF(a1|c,x1)dF(a2|c,x2)\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1}|c,x_{1},x_{2}\big{)}dF\big{(}\tilde{y}_{2}|c,x_{1},x_{2}\big{)}dF(a_{1}|c,x_{1})dF(a_{2}|c,x_{2}) A7
=\displaystyle= dF(c,x1,x2)dF(y~1|c,x1)dF(y~2|c,x2)dF(a1|c,x1)dF(a2|c,x2).\displaystyle\ dF(c,x_{1},x_{2})dF\big{(}\tilde{y}_{1}|c,x_{1}\big{)}dF\big{(}\tilde{y}_{2}|c,x_{2}\big{)}dF(a_{1}|c,x_{1})dF(a_{2}|c,x_{2}). A5

The full data tangent space is 𝒯=𝒯1𝒯5\mathcal{T}=\mathcal{T}_{1}\oplus\ldots\oplus\mathcal{T}_{5}, where:

𝒯1=\displaystyle\mathcal{T}_{1}= {h1(C,X1,X2):E[h1(C,X1,X2)]=0}\displaystyle\ \Big{\{}h_{1}(C,X_{1},X_{2}):E\big{[}h_{1}(C,X_{1},X_{2})\big{]}=0\Big{\}}
𝒯2=\displaystyle\mathcal{T}_{2}= {h2(Y~1,C,X1):E[h2(Y~1,C,X1)|C,X1]=0}\displaystyle\ \Big{\{}h_{2}(\tilde{Y}_{1},C,X_{1}):E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0\Big{\}}
𝒯3=\displaystyle\mathcal{T}_{3}= {h3(Y~2,C,X2):E[h3(Y~2,C,X2)|C,X2]=0}\displaystyle\ \Big{\{}h_{3}(\tilde{Y}_{2},C,X_{2}):E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}\big{]}=0\Big{\}}
𝒯4=\displaystyle\mathcal{T}_{4}= {(A1π1)h4(C,X1):h4(C,X1) any function }\displaystyle\ \Big{\{}(A_{1}-\pi_{1})h_{4}(C,X_{1}):h_{4}(C,X_{1})\mbox{ any function }\Big{\}}
𝒯5=\displaystyle\mathcal{T}_{5}= {(A2π2)h5(C,X2):h5(C,X2) any function }.\displaystyle\ \Big{\{}(A_{2}-\pi_{2})h_{5}(C,X_{2}):h_{5}(C,X_{2})\mbox{ any function }\Big{\}}.

The observed data tangent space is 𝒯O=𝒯1O++𝒯5O\mathcal{T}^{O}=\mathcal{T}_{1}^{O}+\ldots+\mathcal{T}_{5}^{O}, where 𝒯kO=E[𝒯k|O]\mathcal{T}_{k}^{O}=E\big{[}\mathcal{T}_{k}|O\big{]}. For
k=1,4,5k=1,4,5, 𝒯k=𝒯kO\mathcal{T}_{k}=\mathcal{T}_{k}^{O}, while 𝒯2O={E[h2(Y~1,C,X1)|O]:E[h2(Y~1,C,X1)|C,X1]=0}\mathcal{T}_{2}^{O}=\Big{\{}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}:E[h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}]=0\Big{\}} and 𝒯3O={E[h3(Y~2,C,X2)|O]:E[h3(Y~2,C,X2)|C,X2]=0}\mathcal{T}_{3}^{O}=\Big{\{}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}:E[h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}]=0\Big{\}}.

We claim that the 5 spaces 𝒯kO,k=1,,5\mathcal{T}_{k}^{O},k=1,\ldots,5 are mutually orthogonal. To show that 𝒯1O𝒯2O\mathcal{T}_{1}^{O}\perp\!\!\!\perp\mathcal{T}_{2}^{O}, let g1(C,X1,X2)𝒯1Og_{1}(C,X_{1},X_{2})\in\mathcal{T}_{1}^{O} and g2(O)=E[h2(Y~1,C,X1)|O]𝒯2Og_{2}(O)=E[h_{2}(\tilde{Y}_{1},C,X_{1})|O]\in\mathcal{T}_{2}^{O}, where E[h2(Y~1,C,X1)|C,X1]=0E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0. Then:

E[g1(C,X1,X2)g2(O)]=\displaystyle E\big{[}g_{1}(C,X_{1},X_{2})g_{2}(O)\big{]}= E[g1(C,X1,X2)E[h2(Y~1,C,X1)|O]]\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}\Big{]}
=\displaystyle= E[g1(C,X1,X2)h2(Y~1,C,X1)]\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})h_{2}(\tilde{Y}_{1},C,X_{1})\Big{]}
=\displaystyle= E[g1(C,X1,X2)E[h2(Y~1,C,X1)random in Y~1 only |C,X1,X2]]\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})E\big{[}\underbrace{h_{2}(\tilde{Y}_{1},C,X_{1})}_{\mbox{random in }\tilde{Y}_{1}\mbox{ only }}|C,X_{1},X_{2}\big{]}\Big{]}
=\displaystyle= E[g1(C,X1,X2)E[h2(Y~1,C,X1)|C,X1]=0]=0.\displaystyle\ E\Big{[}g_{1}(C,X_{1},X_{2})\underbrace{E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}}_{=0}\Big{]}=0. A5

All other parts of the claim follow similarly from the orthogonality of the full data spaces 𝒯1,,𝒯5\mathcal{T}_{1},\ldots,\mathcal{T}_{5}, except for showing that 𝒯2O\mathcal{T}_{2}^{O} and 𝒯3O\mathcal{T}_{3}^{O} are orthogonal. To show that 𝒯2O𝒯3O\mathcal{T}_{2}^{O}\perp\!\!\!\perp\mathcal{T}_{3}^{O}, let g2(O)=E[h2(Y~1,C,X1)|O]𝒯2Og_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}\in\mathcal{T}_{2}^{O} where E[h2(Y~1,C,X1)|C,X1]=0E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0, and let g3(O)=E[h3(Y~2,C,X2)|O]𝒯3Og_{3}(O)=E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\in\mathcal{T}_{3}^{O}, where E[h3(Y~2,C,X2)|C,X2]=0E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}\big{]}=0. Then:

E\displaystyle E [g2(O)g3(O)]\displaystyle\Big{[}g_{2}(O)g_{3}(O)\Big{]}
=\displaystyle= E[E[h2(Y~1,C,X1)|O]E[h3(Y~2,C,X2)|O]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\Big{]}
=\displaystyle= E[E[h2(Y~1,C,X1)random in Y~1 only|Y1,Y2,A1,A2,C,X1,X2]E[h3(Y~2,C,X2)|O]]\displaystyle\ E\Big{[}E\big{[}\underbrace{h_{2}(\tilde{Y}_{1},C,X_{1})}_{\mbox{random in }\tilde{Y}_{1}\mbox{ only}}|Y_{1},Y_{2},A_{1},A_{2},C,X_{1},X_{2}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\Big{]}
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1,X2]E[h3(Y~2,C,X2)|O]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\Big{]} B5
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[h3(Y~2,C,X2)|O]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}\Big{]} B3
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[E[h3(Y~2,C,X2)|O]|Y1,A1,A2,C,X1,X2]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}E\big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|O\big{]}|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]}
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[h3(Y~2,C,X2)random in Y~2 only|Y1,A1,A2,C,X1,X2]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}\underbrace{h_{3}(\tilde{Y}_{2},C,X_{2})}_{\mbox{random in }\tilde{Y}_{2}\mbox{ only}}|Y_{1},A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]}
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[h3(Y~2,C,X2)|A1,A2,C,X1,X2]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|A_{1},A_{2},C,X_{1},X_{2}\Big{]}\Big{]} B4
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[h3(Y~2,C,X2)|A1,A2,C,X2]]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|A_{1},A_{2},C,X_{2}\Big{]}\Big{]} B2
=\displaystyle= E[E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]E[h3(Y~2,C,X2)|C,X2]=0]=0.\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]}\underbrace{E\Big{[}h_{3}(\tilde{Y}_{2},C,X_{2})|C,X_{2}\Big{]}}_{=0}\Big{]}=0. B1

Since 𝒯O\mathcal{T}^{O} is a sum of 5 mutually orthogonal spaces, projection onto 𝒯O\mathcal{T}^{O} is given by the sum of the projections onto each of the 5 spaces. Projection onto 𝒯1O\mathcal{T}_{1}^{O}, 𝒯4O\mathcal{T}_{4}^{O}, and 𝒯5O\mathcal{T}_{5}^{O} is straightforward. For projection onto 𝒯2O\mathcal{T}_{2}^{O}, we will use the following claim (and analogously for 𝒯3O\mathcal{T}_{3}^{O}):

Claim: 𝒯2O\mathcal{T}_{2}^{O} is equal to the space S2S_{2}, where

S2={h2(Y1,A1,A2,C,X1):E[h2(Y1,A1,A2,C,X1)|A1,A2,C,X1]=0}.S_{2}=\Big{\{}h_{2}(Y_{1},A_{1},A_{2},C,X_{1}):E\big{[}h_{2}(Y_{1},A_{1},A_{2},C,X_{1})|A_{1},A_{2},C,X_{1}\big{]}=0\Big{\}}.

To show the claim, we first show that 𝒯2OS2\mathcal{T}_{2}^{O}\subseteq S_{2}. Let g2(O)=E[h2(Y~1,C,X1)|O]𝒯2Og_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}\in\mathcal{T}_{2}^{O}. As shown in the preceding proof, g2(O)=E[h2(Y~1,C,X1)|Y1,A1,A2,C,X1]g_{2}(O)=E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|Y_{1},A_{1},A_{2},C,X_{1}\big{]} by B5 and B3, which shows that g2(O)g_{2}(O) is a function of Y1,A1,A2,C,X1Y_{1},A_{1},A_{2},C,X_{1} only. We must also check that E[g2(O)|A1,A2,C,X1]=0E\big{[}g_{2}(O)|A_{1},A_{2},C,X_{1}\big{]}=0:

E[g2(O)|A1,A2,C,X1]=\displaystyle E\big{[}g_{2}(O)|A_{1},A_{2},C,X_{1}\big{]}= E[E[h2(Y~1,C,X1)|O]|A1,A2,C,X1]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}|A_{1},A_{2},C,X_{1}\big{]}
=\displaystyle= E[h2(Y~1,C,X1)|A1,A2,C,X1]\displaystyle\ E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|A_{1},A_{2},C,X_{1}\big{]}
=\displaystyle= E[h(Y~1,C,X1)|C,X1]=0.\displaystyle\ E\big{[}h(\tilde{Y}_{1},C,X_{1})|C,X_{1}\big{]}=0. B1

Therefore g2(O)S2g_{2}(O)\in S_{2}. To show the opposite containment, we will show that 𝒯2O,S2\mathcal{T}_{2}^{O,\perp}\subseteq S_{2}^{\perp}, which is equivalent to 𝒯2OS2\mathcal{T}_{2}^{O}\supseteq S_{2}. We begin by computing 𝒯2O,\mathcal{T}_{2}^{O,\perp}, using properties of adjoint operators.

Consider the linear map A:𝒯2OA:\mathcal{T}_{2}\to\mathcal{H}^{O} given by A(h)=E[h|O]A(h)=E[h|O]. The adjoint of AA is the map
A:O𝒯2A^{*}:\mathcal{H}^{O}\to\mathcal{T}_{2} such that E[A(h2)g]=E[h2A(g)]E\big{[}A(h_{2})\cdot g\big{]}=E\big{[}h_{2}\cdot A^{*}(g)\big{]} for all h2𝒯2h_{2}\in\mathcal{T}_{2} and all gOg\in\mathcal{H}^{O}. The range of AA is 𝒯2O\mathcal{T}_{2}^{O}. Therefore, by properties of adjoint operators, 𝒯2O,\mathcal{T}_{2}^{O,\perp} is the null space of AA^{*}. We show that AA^{*} is given by A(g)=E[g|Y~1,C,X1]E[g|C,X1]A^{*}(g)=E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}-E\big{[}g|C,X_{1}\big{]}. First, note that for any gOg\in\mathcal{H}^{O}, E[g|Y~1,C,X1]E[g|C,X1]E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}-E\big{[}g|C,X_{1}\big{]} is indeed an element in 𝒯2\mathcal{T}_{2}, since it is a function of Y~1,C,X1\tilde{Y}_{1},C,X_{1} that is mean zero given C,X1C,X_{1}. Now we check that E[A(h2)g]=E[h2A(g)]E\big{[}A(h_{2})g\big{]}=E\big{[}h_{2}A^{*}(g)\big{]}:

E[A(h2(Y~1,C,X1))g(O)]=\displaystyle E\Big{[}A\Big{(}h_{2}(\tilde{Y}_{1},C,X_{1})\Big{)}g(O)\Big{]}= E[E[h2(Y~1,C,X1)|O]g(O)]\displaystyle\ E\Big{[}E\big{[}h_{2}(\tilde{Y}_{1},C,X_{1})|O\big{]}g(O)\Big{]}
=\displaystyle= E[h2(Y~1,C,X1)g(O)]\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})g(O)\Big{]}
=\displaystyle= E[h2(Y~1,C,X1)E[g(O)|Y~1,C,X1]]\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}\Big{]}
=\displaystyle= E[h2(Y~1,C,X1){E[g(O)|Y~1,C,X1]E[g(O)|C,X1]}]\displaystyle\ E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})\Big{\{}E[g(O)|\tilde{Y}_{1},C,X_{1}]-E[g(O)|C,X_{1}]\Big{\}}\Big{]}

where the last equality follows since

E[h2(Y~1,C,X1)E[g(O)|C,X1]]=E[E[h2(Y~1,C,X1)|C,X1]=0E[g|C,X1]]=0.E\Big{[}h_{2}(\tilde{Y}_{1},C,X_{1})E[g(O)|C,X_{1}]\Big{]}=E\Big{[}\underbrace{E[h_{2}(\tilde{Y}_{1},C,X_{1})|C,X_{1}]}_{=0}E[g|C,X_{1}]\Big{]}=0.

Therefore,

𝒯2O,={g(O):E[g|Y~1,C,X1]=E[g|C,X1]}.\mathcal{T}_{2}^{O,\perp}=\Big{\{}g(O):E\big{[}g|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g|C,X_{1}\big{]}\Big{\}}.

Now let WW be the set

W={g(O):E[g(O)|Y~1,A1,A2,C,X1]=E[g(O)|A1,A2,C,X1]}.W=\Big{\{}g(O):E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}\Big{\}}.

We will show the chain of containments 𝒯2O,WS2\mathcal{T}_{2}^{O,\perp}\subseteq W\subseteq S_{2}^{\perp}.

We first show that, for any g(O)Og(O)\in\mathcal{H}^{O}, for each a,ba,b, E[g(O)|Y~1,A1=a,A2=b,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]} is a function of Y1a,bY_{1}^{a,b}, CC, and X1X_{1} only. This follows from B6: since (Y~2,X2)Y~1|(A1,A2,C,X1)(\tilde{Y}_{2},X_{2})\perp\!\!\!\perp\tilde{Y}_{1}|(A_{1},A_{2},C,X_{1}), we can remove all but one of the potential outcomes in Y~1\tilde{Y}_{1} in the conditioning below.

E[g(O)|Y~1,\displaystyle E\big{[}g(O)|\tilde{Y}_{1}, A1=a,A2=b,C,X1]\displaystyle A_{1}=a,A_{2}=b,C,X_{1}\big{]}
=\displaystyle= E[g(Y1a,b,Y2a,b,a,b,C,X1,X2)random in Y2a,b,X2 only|Y~1,A1=a,A2=b,C,X1]\displaystyle\ E\big{[}\underbrace{g\big{(}Y_{1}^{a,b},Y_{2}^{a,b},a,b,C,X_{1},X_{2}\big{)}}_{\mbox{random in }Y_{2}^{a,b},X_{2}\mbox{ only}}|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]}
=\displaystyle= E[g(Y1a,b,Y2a,b,a,b,C,X1,X2)|Y1a,b,A1=a,A2=b,C,X1]\displaystyle\ E\big{[}g\big{(}Y_{1}^{a,b},Y_{2}^{a,b},a,b,C,X_{1},X_{2}\big{)}|Y_{1}^{a,b},A_{1}=a,A_{2}=b,C,X_{1}\big{]} B6
=:\displaystyle=: hab(Y1a,b,C,X1)\displaystyle\ h_{ab}(Y_{1}^{a,b},C,X_{1})

Now by B1, P(A1=a,A2=b|C,X1,Y~1)=P(A1=a,A2=b|C,X1)P(A_{1}=a,A_{2}=b|C,X_{1},\tilde{Y}_{1})=P(A_{1}=a,A_{2}=b|C,X_{1}). Write πab=πab(C,X1)\pi_{ab}=\pi_{ab}(C,X_{1}) for this function. Then we have

E[g(O)|Y~1,C,X1]=\displaystyle E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}= E[E[g(O)|Y~1,A1,A2,C,X1]|Y~1,C,X1]\displaystyle\ E\Big{[}E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}|\tilde{Y}_{1},C,X_{1}\Big{]}
=\displaystyle= π11(C,X1)h11(Y11,1,C,X1)+π10(C,X1)h10(Y11,0,C,X1)+\displaystyle\ \pi_{11}(C,X_{1})h_{11}(Y_{1}^{1,1},C,X_{1})+\pi_{10}(C,X_{1})h_{10}(Y_{1}^{1,0},C,X_{1})+
π01(C,X1)h01(Y10,1,C,X1)+π00(C,X1)h00(Y10,0,C,X1).\displaystyle\hskip 21.68121pt\pi_{01}(C,X_{1})h_{01}(Y_{1}^{0,1},C,X_{1})+\pi_{00}(C,X_{1})h_{00}(Y_{1}^{0,0},C,X_{1}).

Now take g(O)𝒯2O,g(O)\in\mathcal{T}_{2}^{O,\perp}, so that E[g(O)|Y~1,C,X1]=E[g(O)|C,X1]E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g(O)|C,X_{1}\big{]}. We will show that
E[g(O)|Y~1,A1,A2,C,X1]=E[g(O)|A1,A2,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}. Isolating h11(Y11,1,C,X1)h_{11}(Y_{1}^{1,1},C,X_{1}) in the equation
E[g(O)|Y~1,C,X1]=E[g(O)|C,X1]E\big{[}g(O)|\tilde{Y}_{1},C,X_{1}\big{]}=E\big{[}g(O)|C,X_{1}\big{]}, we have

h(Y11,1,C,X1)=E[g(O)|C,X1]π10h10(Y11,0,C,X1)π01h01(Y10,1,C,X1)π00h00(Y10,0,C,X1)π11.h(Y_{1}^{1,1},C,X_{1})=\frac{E\big{[}g(O)|C,X_{1}\big{]}-\pi_{10}h_{10}(Y_{1}^{1,0},C,X_{1})-\pi_{01}h_{01}(Y_{1}^{0,1},C,X_{1})-\pi_{00}h_{00}(Y_{1}^{0,0},C,X_{1})}{\pi_{11}}.

Since the right hand side does not depend on Y11,1Y_{1}^{1,1}, this shows that h(Y11,1,C,X1)h(Y_{1}^{1,1},C,X_{1}) does not depend on Y11,1Y_{1}^{1,1}. Therefore h(y11,1,C,X1):=E[g(O)|Y11,1=y11,1,A1=1,A2=1,C,X1]h(y_{1}^{1,1},C,X_{1}):=E\big{[}g(O)|Y_{1}^{1,1}=y_{1}^{1,1},A_{1}=1,A_{2}=1,C,X_{1}\big{]} is constant in y11,1y_{1}^{1,1}, and hence E[g(O)|Y~1,A1=1,A2=1,C,X1]=E[g(O)|A1=1,A2=1,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1}=1,A_{2}=1,C,X_{1}\big{]}=E\big{[}g(O)|A_{1}=1,A_{2}=1,C,X_{1}\big{]}.

Similarly, h10h_{10}, h0,1h_{0,1}, and h0,0h_{0,0} are functions of C,X1C,X_{1} only, and therefore E[g(O)|Y~1,A1=a,A2=b,C,X1]=E[g(O)|A1=a,A2=b,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1}=a,A_{2}=b,C,X_{1}\big{]}=E\big{[}g(O)|A_{1}=a,A_{2}=b,C,X_{1}\big{]} for each a,ba,b. Therefore E[g(O)|Y~1,A1,A2,C,X1]=E[g(O)|A1,A2,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]} as claimed, which shows that 𝒯2O,W\mathcal{T}_{2}^{O,\perp}\subseteq W.

Finally we will show that WS2W\subseteq S_{2}^{\perp} Let g1(O)Wg_{1}(O)\in W, so that E[g(O)|Y~1,A1,A2,C,X1]=E[g(O)|A1,A2,C,X1]E\big{[}g(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}=E\big{[}g(O)|A_{1},A_{2},C,X_{1}\big{]}. We show that g1(O)g_{1}(O) is orthogonal to every g2(Y1,A1,A2,C,X1)S2g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\in S_{2}:

E[g1(O)\displaystyle E\big{[}g_{1}(O) g2(Y1,A1,A2,C,X1)]\displaystyle g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\big{]}
=\displaystyle= E[E[g1(O)|Y~1,A1,A2,C,X1]g2(Y1,A1,A2,C,X1)]\displaystyle\ E\Big{[}E\big{[}g_{1}(O)|\tilde{Y}_{1},A_{1},A_{2},C,X_{1}\big{]}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\Big{]}
=\displaystyle= E[E[g1(O)|A1,A2,C,X1]g2(Y1,A1,A2,C,X1)]\displaystyle\ E\Big{[}E\big{[}g_{1}(O)|A_{1},A_{2},C,X_{1}\big{]}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})\Big{]}
=\displaystyle= E[E[g1(O)|A1,A2,C,X1]E[g2(Y1,A1,A2,C,X1)|A1,A2,C,X1]=0]=0.\displaystyle\ E\Big{[}E\big{[}g_{1}(O)|A_{1},A_{2},C,X_{1}\big{]}\underbrace{E\big{[}g_{2}(Y_{1},A_{1},A_{2},C,X_{1})|A_{1},A_{2},C,X_{1}\big{]}}_{=0}\Big{]}=0.

This completes the proof of the claim that 𝒯2O=S2\mathcal{T}_{2}^{O}=S_{2}. The analogous result holds for 𝒯3O\mathcal{T}_{3}^{O} by symmetry. Therefore the observed data tangent space for Model 3 is the direct sum of 5 mutually orthogonal spaces 𝒯O=𝒯1O𝒯5O\mathcal{T}^{O}=\mathcal{T}_{1}^{O}\oplus\ldots\oplus\mathcal{T}_{5}^{O}, where

𝒯1O=\displaystyle\mathcal{T}_{1}^{O}= {h1(C,X1,X2):E[h1(C,X1,X2)]=0}\displaystyle\ \Big{\{}h_{1}(C,X_{1},X_{2}):E\big{[}h_{1}(C,X_{1},X_{2})\big{]}=0\Big{\}}
𝒯2O=\displaystyle\mathcal{T}_{2}^{O}= {h2(Y1,A1,A2,C,X1):E[h2(Y1,A1,A2,C,X1)|A1,A2,C,X1]=0}\displaystyle\ \Big{\{}h_{2}(Y_{1},A_{1},A_{2},C,X_{1}):E\big{[}h_{2}(Y_{1},A_{1},A_{2},C,X_{1})|A_{1},A_{2},C,X_{1}\big{]}=0\Big{\}}
𝒯3O=\displaystyle\mathcal{T}_{3}^{O}= {h3(Y2,A1,A2,,C,X2):E[h3(Y2,A1,A2,C,X2)|A1,A2,C,X2]=0}\displaystyle\ \Big{\{}h_{3}(Y_{2},A_{1},A_{2},,C,X_{2}):E\big{[}h_{3}(Y_{2},A_{1},A_{2},C,X_{2})|A_{1},A_{2},C,X_{2}\big{]}=0\Big{\}}
𝒯4O=\displaystyle\mathcal{T}_{4}^{O}= {(A1π1)h4(C,X1):h4(C,X1) any function }\displaystyle\ \Big{\{}(A_{1}-\pi_{1})h_{4}(C,X_{1}):h_{4}(C,X_{1})\mbox{ any function }\Big{\}}
𝒯5O=\displaystyle\mathcal{T}_{5}^{O}= {(A2π2)h5(C,X2):h5(C,X2) any function }.\displaystyle\ \Big{\{}(A_{2}-\pi_{2})h_{5}(C,X_{2}):h_{5}(C,X_{2})\mbox{ any function }\Big{\}}.

The efficient influence function for βja,b\beta_{j}^{a,b} in Model 3 is the projection of any influence function for βja,b\beta_{j}^{a,b} in Model 3, say φ3naive\varphi_{3}^{naive}, onto 𝒯O\mathcal{T}^{O}, where by orthogonality, Π(|𝒯O)=k=15Π(|𝒯kO)\Pi\big{(}\cdot|\mathcal{T}^{O}\big{)}=\sum_{k=1}^{5}\Pi\big{(}\cdot|\mathcal{T}_{k}^{O}\big{)}.

Because Model 3 is a subset of Model 1, the full-data influence function φZ=Yja,bβja,b\varphi^{Z}=Y_{j}^{a,b}-\beta_{j}^{a,b} from Model 1 is also a full-data influence function for Model 3. Following the steps outlined in Section 1.1, the element

g(O):=𝟙(Aj=a,A3j=b)P(Aj=a,A3j=b|C,X1,X2)(Yjβja,b)g^{*}(O):=\frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a,A_{3-j}=b|C,X_{1},X_{2})}\Big{(}Y_{j}-\beta_{j}^{a,b}\Big{)}

is in the space Λ1O\Lambda_{1}^{O} for Model 3. In Model 3, Λ2O=Λ2=𝒯4O𝒯5O\Lambda_{2}^{O}=\Lambda_{2}=\mathcal{T}_{4}^{O}\oplus\mathcal{T}_{5}^{O}, so one influence function for βja,b\beta_{j}^{a,b} in Model 3 is:

φ3naive:=g(O)Π(g(O)|Λ2O)=g(O)Π(g(O)|𝒯4O)Π(g(O)|𝒯5O).\varphi_{3}^{naive}:=g^{*}(O)-\Pi\big{(}g^{*}(O)|\Lambda_{2}^{O})=g^{*}(O)-\Pi\big{(}g^{*}(O)|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)|\mathcal{T}_{5}^{O}\big{)}.

Therefore, the efficient influence function for βja,b\beta_{j}^{a,b} in Model 3 is:

φ3(O;βja,b)=\displaystyle\varphi_{3}(O;\beta_{j}^{a,b})= Π(φ3naive|𝒯O)\displaystyle\ \Pi\big{(}\varphi_{3}^{naive}|\mathcal{T}^{O}\big{)}
=\displaystyle= Π({g(O)Π(g(O)|𝒯4O)Π(g(O)|𝒯5O)}|𝒯O)\displaystyle\ \Pi\Big{(}\big{\{}g^{*}(O)-\Pi\big{(}g^{*}(O)|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)|\mathcal{T}_{5}^{O}\big{)}\big{\}}\ \big{|}\mathcal{T}^{O}\Big{)}
=\displaystyle= k=15Π({g(O)Π(g(O)|𝒯4O)Π(g(O)|𝒯5O)}|𝒯kO)\displaystyle\ \sum_{k=1}^{5}\ \Pi\Big{(}\big{\{}g^{*}(O)-\Pi\big{(}g^{*}(O)|\mathcal{T}_{4}^{O}\big{)}-\Pi\big{(}g^{*}(O)|\mathcal{T}_{5}^{O}\big{)}\big{\}}\ \big{|}\mathcal{T}_{k}^{O}\Big{)}
=\displaystyle= Π(g(O)|𝒯1O)+Π(g(O)|𝒯2O)+Π(g(O)|𝒯3O),\displaystyle\ \Pi\big{(}g^{*}(O)|\mathcal{T}_{1}^{O}\big{)}+\Pi\big{(}g^{*}(O)|\mathcal{T}_{2}^{O}\big{)}+\Pi\big{(}g^{*}(O)|\mathcal{T}_{3}^{O}\big{)}, (1)

where the last line follows because Π(Π(g(O)|𝒯kO)|𝒯kO)=Π(g(O)|𝒯kO)\Pi\Big{(}\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)}|\mathcal{T}_{k}^{O}\Big{)}=\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)} for each kk, and because Π(Π(g(O)|𝒯kO)|𝒯kO)=0\Pi\Big{(}\Pi\big{(}g^{*}(O)|\mathcal{T}_{k}^{O}\big{)}|\mathcal{T}_{k^{\prime}}^{O}\Big{)}=0 for any kkk\neq k^{\prime}, by orthogonality.

For notational convenience we take a=1,b=1a=1,b=1 in the following calculations; the calculations for other values of a,b=0,1a,b=0,1 are exactly similar. We write πj=πj(C,Xj):=P(Aj=1|C,Xj)\pi_{j}=\pi_{j}(C,X_{j}):=P(A_{j}=1|C,X_{j}), π3j=π3j(C,X3j):=P(A3j=1|C,X3j)\pi_{3-j}=\pi_{3-j}(C,X_{3-j}):=P(A_{3-j}=1|C,X_{3-j}). By A4 and A6, P(Aj=1,A3j=1|C,X1,X2)=π1π2P(A_{j}=1,A_{3-j}=1|C,X_{1},X_{2})=\pi_{1}\pi_{2}. Then:

g(O)=A1A2π1π2(Yjβj1,1)=A1A2π1π2(Yj1,1βj1,1).g^{*}(O)=\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}=\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}^{1,1}-\beta_{j}^{1,1}\Big{)}.

We compute each of the 3 terms in (1):

Π(g(O)|𝒯1O)\displaystyle\Pi\Big{(}g^{*}(O)|\ \mathcal{T}_{1}^{O}\Big{)} =E[g(O)|C,X1,X2]\displaystyle=E\big{[}g^{*}(O)|C,X_{1},X_{2}\big{]}
=\displaystyle= E[A1A2π1π2(Yj1,1βj1,1)|C,X1,X2]\displaystyle\ E\left[\frac{A_{1}A_{2}}{\pi_{1}\pi_{2}}\Big{(}Y_{j}^{1,1}-\beta_{j}^{1,1}\Big{)}\ \big{|}C,X_{1},X_{2}\right]
=1π1π2E[A1A2|C,X1,X2]E[Yj1,1βj1,1|C,X1,X2]\displaystyle=\frac{1}{\pi_{1}\pi_{2}}E\Big{[}A_{1}A_{2}\ |\ C,X_{1},X_{2}\Big{]}E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ |\ C,X_{1},X_{2}\Big{]} A1
=E[Yj1,1βj1,1|C,X1,X2]\displaystyle=E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ |\ C,X_{1},X_{2}\Big{]} A4,A6\displaystyle\textbf{A4},\textbf{A6}
=E[Yj1,1βj1,1|C,Xj]\displaystyle=E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ |\ C,X_{j}\Big{]} A5
=E[Yj1,1|A1=1,A2=1,C,Xj]βj1,1\displaystyle=E\Big{[}Y_{j}^{1,1}|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1} B1
=E[Yj|A1=1,A2=1,C,Xj]βj1,1.\displaystyle=E\Big{[}Y_{j}|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1}. A3

We relabel 𝒯2O\mathcal{T}_{2}^{O} and 𝒯3O\mathcal{T}_{3}^{O} as 𝒯kO\mathcal{T}_{k^{*}}^{O} and 𝒯kO\mathcal{T}_{k^{\dagger}}^{O}, according to whether j=1j=1 or j=2j=2 in the parameter βja,b\beta_{j}^{a,b}. Set k:=j+1k^{*}:=j+1, so that 𝒯kO={hk(Yj,A1,A2,C,Xj):E[hk(Yj,A1,A2,C,Xj)|A1,A2,C,Xj]=0}\mathcal{T}_{k*}^{O}=\Big{\{}h_{k^{*}}(Y_{j},A_{1},A_{2},C,X_{j}):E\big{[}h_{k^{*}}(Y_{j},A_{1},A_{2},C,X_{j})|A_{1},A_{2},C,X_{j}\big{]}=0\Big{\}}, and k:=4jk^{\dagger}:=4-j, so 𝒯kO={hk(Y3j,A1,A2,C,X3j):E[hk(Y3j,A1,A2,C,X3j)|A1,A2,C,X3j]=0}\mathcal{T}_{k^{\dagger}}^{O}=\Big{\{}h_{k^{\dagger}}(Y_{3-j},A_{1},A_{2},C,X_{3-j}):E\big{[}h_{k^{\dagger}}(Y_{3-j},A_{1},A_{2},C,X_{3-j})|A_{1},A_{2},C,X_{3-j}\big{]}=0\Big{\}}. Then:

Π(g(O)\displaystyle\Pi\Big{(}g^{*}(O) |𝒯kO)\displaystyle\big{|}\ \mathcal{T}_{k^{*}}^{O}\Big{)}
=\displaystyle= E[g(O)|Yj,A1,A2,C,Xj]E[g(O)|A1,A2,C,Xj]\displaystyle\ E\big{[}g^{*}(O)|Y_{j},A_{1},A_{2},C,X_{j}\big{]}-E\big{[}g^{*}(O)|A_{1},A_{2},C,X_{j}\big{]}
=\displaystyle= A1A2πj(Yjβj1,1)E[1π3j|Yj,A1A2=1,C,Xj]\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}}\ \Big{|}\ Y_{j},A_{1}A_{2}=1,C,X_{j}\right]-
A1A2πjE[Yj1,1βj1,1π3j|A1A2=1,C,Xj]\displaystyle\hskip 36.135pt\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{|}\ A_{1}A_{2}=1,C,X_{j}\right]
=\displaystyle= A1A2πj(Yjβj1,1)E[1π3j(C,X3j) random in X3j only|Yj1,1,A1A2=1,C,Xj]\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\bigg{[}\underbrace{\frac{1}{\pi_{3-j}(C,X_{3-j})}}_{\mbox{ random in }X_{3-j}\mbox{ only}}\ \Big{|}\ Y_{j}^{1,1},A_{1}A_{2}=1,C,X_{j}\bigg{]} A3
A1A2πjE[Yj1,1βj1,1|A1A2=1,C,Xj]E[1π3j|A1A2=1,C,Xj]\displaystyle\hskip 21.68121pt-\frac{A_{1}A_{2}}{\pi_{j}}E\Big{[}Y_{j}^{1,1}-\beta_{j}^{1,1}\ |A_{1}A_{2}=1,C,X_{j}\Big{]}E\left[\frac{1}{\pi_{3-j}}\Big{|}A_{1}A_{2}=1,C,X_{j}\right] B2
=\displaystyle= A1A2πj(Yjβj1,1)E[1π3j(C,X3j)|A1A2=1,C,Xj]\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}}\Big{(}Y_{j}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}(C,X_{3-j})}\ \Big{|}\ A_{1}A_{2}=1,C,X_{j}\right] B2
A1A2πj(E[Yj|A1A2=1,C,Xj]βj1,1)E[1π3j|A1A2=1,C,Xj].\displaystyle\hskip 21.68121pt-\frac{A_{1}A_{2}}{\pi_{j}}\Big{(}E\big{[}Y_{j}|A_{1}A_{2}=1,C,X_{j}\big{]}-\beta_{j}^{1,1}\Big{)}E\left[\frac{1}{\pi_{3-j}}\Big{|}A_{1}A_{2}=1,C,X_{j}\right]. A3

Using the fact that for binary ZZ, E[ZW|V]=E[W|Z=1,V]P(Z=1|V)E\big{[}ZW|V\big{]}=E[W|Z=1,V]P(Z=1|V), we have:

E[1π3j\displaystyle E\bigg{[}\frac{1}{\pi_{3-j}} |A1A2=1,C,Xj]\displaystyle\ \Big{|}\ A_{1}A_{2}=1,C,X_{j}\bigg{]}
=\displaystyle= 1P(A1A2=1|C,Xj)E[A1A2π3j|C,Xj]\displaystyle\ \frac{1}{P\big{(}A_{1}A_{2}=1|C,X_{j}\big{)}}E\left[\frac{A_{1}A_{2}}{\pi_{3-j}}\ \big{|}\ C,X_{j}\right]
=\displaystyle= 1P(A1A2=1|C,Xj)E[1π3jE[A1A2|C,Xj,X3j]|C,Xj]\displaystyle\ \frac{1}{P\big{(}A_{1}A_{2}=1|C,X_{j}\big{)}}E\left[\frac{1}{\pi_{3-j}}E\left[A_{1}A_{2}\big{|}\ C,X_{j},X_{3-j}\right]\ \big{|}\ C,X_{j}\right]
=\displaystyle= 1P(A1=1,A2=1|C,Xj)E[1π3j(πjπ3j)|C,Xj]\displaystyle\ \frac{1}{P(A_{1}=1,A_{2}=1|C,X_{j})}E\left[\frac{1}{\pi_{3-j}}(\pi_{j}\pi_{3-j})\big{|}\ C,X_{j}\right] A4,A6\displaystyle\textbf{A4},\textbf{A6}
=\displaystyle= 1P(A1=1,A2=1|C,Xj)πj(C,Xj)\displaystyle\ \frac{1}{P(A_{1}=1,A_{2}=1|C,X_{j})}\pi_{j}(C,X_{j})
=\displaystyle= 1πjP(A3j=1|C,Xj)πj(C,Xj)\displaystyle\ \frac{1}{\pi_{j}P(A_{3-j}=1|C,X_{j})}\pi_{j}(C,X_{j}) B7
=\displaystyle= 1P(A3j=1|C,Xj).\displaystyle\ \frac{1}{P(A_{3-j}=1|C,X_{j})}.

Therefore,

Π(g(O)|𝒯kO)=A1A2πj(C,Xj)P(A3j=1|C,Xj)(YjE[Yj|A1A2=1,C,Xj]).\Pi\Big{(}g^{*}(O)\big{|}\ \mathcal{T}_{k^{*}}^{O}\Big{)}=\frac{A_{1}A_{2}}{\pi_{j}(C,X_{j})P(A_{3-j}=1|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}|A_{1}A_{2}=1,C,X_{j}\big{]}\Big{)}.

Finally,

Π(g(O)\displaystyle\Pi\Big{(}g^{*}(O) |𝒯kO)\displaystyle\big{|}\ \mathcal{T}_{k^{\dagger}}^{O}\Big{)}
=\displaystyle= E[g(O)|Y3j,A1,A2,C,X3j]E[g(O)|A1,A2,C,X3j]\displaystyle\ E\big{[}g^{*}(O)|Y_{3-j},A_{1},A_{2},C,X_{3-j}\big{]}-E\big{[}g^{*}(O)|A_{1},A_{2},C,X_{3-j}\big{]}
=\displaystyle= A1A2π3jE[Yj1,1βj1,1πj(C,Xj)random in Yj1,1,Xj only|Y3j1,1,A1A2=1,C,X3j]\displaystyle\ \frac{A_{1}A_{2}}{\pi_{3-j}}E\bigg{[}\underbrace{\ \frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{j}(C,X_{j})}\ }_{\mbox{random in }Y_{j}^{1,1},X_{j}\mbox{ only}}\ \Big{|}\ Y_{3-j}^{1,1},A_{1}A_{2}=1,C,X_{3-j}\bigg{]}
A1A2πjE[Yj1,1βj1,1π3j|A1A2=1,C,X3j]\displaystyle\hskip 36.135pt-\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{|}\ A_{1}A_{2}=1,C,X_{3-j}\right]
=\displaystyle= A1A2π3jE[Yj1,1βj1,1πj(C,Xj)|A1A2=1,C,X3j]\displaystyle\ \frac{A_{1}A_{2}}{\pi_{3-j}}E\bigg{[}\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{j}(C,X_{j})}\ \Big{|}\ A_{1}A_{2}=1,C,X_{3-j}\bigg{]} B6
A1A2πjE[Yj1,1βj1,1π3j|A1A2=1,C,X3j]\displaystyle\hskip 36.135pt-\frac{A_{1}A_{2}}{\pi_{j}}E\left[\frac{Y_{j}^{1,1}-\beta_{j}^{1,1}}{\pi_{3-j}}\ \Big{|}\ A_{1}A_{2}=1,C,X_{3-j}\right]
=\displaystyle= 0.\displaystyle\ 0.

Therefore, the efficient influence function for βj1,1\beta_{j}^{1,1} in Model 3 is:

φ3(O;βj1,1)=\displaystyle\varphi_{3}(O;\beta_{j}^{1,1})= Π(g(O)|𝒯1O)+Π(g(O)|𝒯kO)+Π(g(O)|𝒯kO)\displaystyle\ \Pi\big{(}g^{*}(O)|\mathcal{T}_{1}^{O}\big{)}+\Pi\big{(}g^{*}(O)|\mathcal{T}_{k^{*}}^{O}\big{)}+\Pi\big{(}g^{*}(O)|\mathcal{T}_{k^{\dagger}}^{O}\big{)}
=\displaystyle= A1A2πj(C,Xj)P(A3j=1|C,Xj)(YjE[Yj|A1A2=1,C,Xj])\displaystyle\ \frac{A_{1}A_{2}}{\pi_{j}(C,X_{j})P(A_{3-j}=1|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}|A_{1}A_{2}=1,C,X_{j}\big{]}\Big{)}
+E[Yj|A1=1,A2=1,C,Xj]βj1,1.\displaystyle\hskip 21.68121pt+E\Big{[}Y_{j}|A_{1}=1,A_{2}=1,C,X_{j}\Big{]}-\beta_{j}^{1,1}.

By exactly the same arguments, the efficient influence function for βja,b\beta_{j}^{a,b} in Model 3 is:

φ3(O;βja,b)=\displaystyle\varphi_{3}(O;\beta_{j}^{a,b})= 𝟙(Aj=a,A3j=b)P(Aj=a|C,Xj)P(A3j=b|C,Xj)(YjE[Yj|Aj=a,A3j=b,C,Xj])\displaystyle\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{P(A_{j}=a|C,X_{j})P(A_{3-j}=b|C,X_{j})}\Big{(}Y_{j}-E\big{[}Y_{j}|A_{j}=a,A_{3-j}=b,C,X_{j}\big{]}\Big{)}
+E[Yj|Aj=a,A3j=b,C,Xj]βja,b.\displaystyle\hskip 21.68121pt+E\Big{[}Y_{j}|A_{j}=a,A_{3-j}=b,C,X_{j}\Big{]}-\beta_{j}^{a,b}.

Next we will show that the function φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) given above is also the efficient influence function for βja,b\beta_{j}^{a,b} in Model 2. It suffices to show that φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 2, for then the projection of φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) onto 𝒯O(M2)\mathcal{T}^{O}(M2) is the efficient influence function for βja,b\beta_{j}^{a,b} in Model 2. But Model 3 \subseteq Model 2, which implies 𝒯O(M3)𝒯O(M2)\mathcal{T}^{O}(M3)\subseteq\mathcal{T}^{O}(M2). Since φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is the efficient influence function for Model 3, we know that φ3(O;βja,b)𝒯O(M3)\varphi_{3}(O;\beta_{j}^{a,b})\in\mathcal{T}^{O}(M3). Therefore φ3(O;βja,b)𝒯O(M2)\varphi_{3}(O;\beta_{j}^{a,b})\in\mathcal{T}^{O}(M2), and so is its own projection onto 𝒯O(M2)\mathcal{T}^{O}(M2). Thus, once we show that φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 2, it will automatically follow that φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is the efficient influence function for βja,b\beta_{j}^{a,b} in Model 2.

To show that φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 2, let 𝒫\mathcal{P} be any regular parametric submodel of Model 2, parametrized say by γΓ𝒫r\gamma\in\Gamma_{\mathcal{P}}\subseteq\mathbb{R}^{r}, in such a way that γ=0\gamma=0 corresponds to the true distribution P0𝒫P_{0}\in\mathcal{P}. Write 𝒫={f𝒫(O;γ):γΓ𝒫}\mathcal{P}=\Big{\{}f_{\mathcal{P}}(O;\gamma):\gamma\in\Gamma_{\mathcal{P}}\Big{\}}, and let Sγ(O;𝒫)=logf𝒫(O;γ)γ|γ=0\displaystyle{S_{\gamma}(O;\mathcal{P})=\frac{\partial\log f_{\mathcal{P}}(O;\gamma)}{\partial\gamma}\Big{|}_{\gamma=0}} be the score vector for 𝒫\mathcal{P} evaluated at the truth. We will show that:

E[φ3(O;βja,b)Sγ(O;𝒫)]=βja,b(𝒫;γ)γ|γ=0E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma}(O;\mathcal{P})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma}\Big{|}_{\gamma=0} (2)

where the right hand side is the derivative of the functional βja,b=βja,b(𝒫;γ)\beta_{j}^{a,b}=\beta_{j}^{a,b}(\mathcal{P};\gamma), evaluated at the truth. By [25], φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 2 if and only if equation (2) is satisfied for every regular parametric submodel of Model 2.

Factoring the observed data likelihood, 𝒫\mathcal{P} can be parametrized as

𝒫={dF(c,x1,x2,a1,a2;γ1)dF(yj|c,xj,a1,a2;γ2)\mathcal{P}=\Big{\{}dF(c,x_{1},x_{2},a_{1},a_{2}\ ;\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2}\ ;\gamma_{2})\cdot
dF(y3j|c,x1,x2,a1,a2,yj;γ3):γ=(γ1,γ2,γ3)Γ1×Γ2×Γ3=Γ𝒫}\hskip 72.26999ptdF(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j}\ ;\gamma_{3}):\gamma=(\gamma_{1},\gamma_{2},\gamma_{3})\in\Gamma_{1}\times\Gamma_{2}\times\Gamma_{3}=\Gamma_{\mathcal{P}}\Big{\}}

where γ1,γ2,γ3\gamma_{1},\gamma_{2},\gamma_{3} are variation independent. We can partition the set of equations (B.2) into the 3 sets of equations

E[φ3(O;βja,b)Sγk(O;𝒫)]=βja,b(𝒫;γ)γk|γk=0E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{k}}(O;\mathcal{P})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma_{k}}\Big{|}_{\gamma_{k}=0} (3)

for k=1,2,3k=1,2,3. We will show that, for k=3k=3, both sides of (3) are zero, since neither the functional βja,b(𝒫;γ)\beta_{j}^{a,b}(\mathcal{P};\gamma) nor the the influence function φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) involve Y3jY_{3-j}. We will then show that (3) holds for k=1,2k=1,2 because Model 2 and Model 3 are identical as regards features related to γ1\gamma_{1} and γ2\gamma_{2}.

The identifying functional for βja,b(𝒫;γ)\beta_{j}^{a,b}(\mathcal{P};\gamma) is:

βja,b(𝒫;γ)=c,x1,x2yjyj𝑑F(yj|c,xj,aj=a,a3j=b;γ2)𝑑F(c,x1,x2;γ1).\beta_{j}^{a,b}(\mathcal{P};\gamma)=\int_{c,x_{1},x_{2}}\int_{y_{j}}y_{j}\cdot dF\big{(}y_{j}|c,x_{j},a_{j}=a,a_{3-j}=b\ ;\gamma_{2}\big{)}dF\big{(}c,x_{1},x_{2}\ ;\gamma_{1}\big{)}. (4)

Note that (4) varies in γ1,γ2\gamma_{1},\gamma_{2} only, so βja,b(𝒫;γ)γ3=0\displaystyle{\frac{\partial\beta_{j}^{a,b}(\mathcal{P};\gamma)}{\partial\gamma_{3}}=0}. Since Sγ3(O;𝒫)S_{\gamma_{3}}(O;\mathcal{P}) is the score for
dF(y3j|c,x1,x2,a1,a2,y1)dF(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{1}), E[Sγ3(O;𝒫)|C,X1,X2,A1,A2,Yj]=0E\Big{[}S_{\gamma_{3}}(O;\mathcal{P})|C,X_{1},X_{2},A_{1},A_{2},Y_{j}\Big{]}=0. Note also that φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is a function of Yj,A1,A2,C,XjY_{j},A_{1},A_{2},C,X_{j} only. Therefore:

E[φ3(O;βja,b)Sγ3(O;𝒫)]=E[φ3(O;βja,b)E[Sγ3(O;𝒫)|C,X1,X2,A1,A2,Yj]=0]=0E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{3}}(O;\mathcal{P})\Big{]}=E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\underbrace{E\Big{[}S_{\gamma_{3}}(O;\mathcal{P})|C,X_{1},X_{2},A_{1},A_{2},Y_{j}\Big{]}}_{=0}\bigg{]}=0

which shows that (3) is satisfied for k=3k=3.

Now consider the submodel 𝒫𝒫\mathcal{P}^{\prime}\subseteq\mathcal{P} given by

𝒫:={dF(c,x1,x2,a1,a2;γ1)dF(yj|c,xj,a1,a2;γ2)\mathcal{P}^{\prime}:=\Big{\{}dF(c,x_{1},x_{2},a_{1},a_{2};\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2};\gamma_{2})\cdot
dF0(y3j|c,x1,x2,a1,a2,yj;γ3=0):γ=(γ1,γ2,0)Γ1×Γ2×{0}}\hskip 72.26999ptdF_{0}(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j};\gamma_{3}=0):\gamma^{\prime}=(\gamma_{1},\gamma_{2},0)\in\Gamma_{1}\times\Gamma_{2}\times\{0\}\Big{\}}

Note that for k=1,2k=1,2, 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime} have the same scores, i.e. Sγk(O;𝒫)=Sγk(O;𝒫)S_{\gamma_{k}}(O;\mathcal{P})=S_{\gamma_{k}}(O;\mathcal{P}^{\prime}), and that βja,b(𝒫;γ)=βja,b(𝒫;γ)\beta_{j}^{a,b}(\mathcal{P};\gamma)=\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma^{\prime}). We claim that 𝒫\mathcal{P}^{\prime} is a parametric submodel of Model 3: Model 2 and Model 3 impose exactly the same restrictions on dF(c,x1,x2,a1,a2)dF(c,x_{1},x_{2},a_{1},a_{2}), and neither imposes any restriction on dF(yj|c,xj,a1,a2)dF(y_{j}|c,x_{j},a_{1},a_{2}). For each (γ1,γ2)Γ1×Γ2(\gamma_{1},\gamma_{2})\in\Gamma_{1}\times\Gamma_{2}, dF(c,x1,x2,a1,a2;γ1)dF(yj|c,xj,a1,a2;γ2)dF0(y3j|c,x1,x2,a1,a2,yj;γ3=0)dF(c,x_{1},x_{2},a_{1},a_{2};\gamma_{1})\cdot dF(y_{j}|c,x_{j},a_{1},a_{2};\gamma_{2})\cdot dF_{0}(y_{3-j}|c,x_{1},x_{2},a_{1},a_{2},y_{j};\gamma_{3}=0) is in Model 2; therefore it is also in Model 3. Furthermore, taking γ1=γ2=0\gamma_{1}=\gamma_{2}=0 shows that 𝒫\mathcal{P}^{\prime} contains the true distribution.

Therefore, since φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 3, and 𝒫\mathcal{P}^{\prime} is a regular parametric submodel of Model 3,

E[φ3(O;βja,b)Sγk(O;𝒫)]=βja,b(𝒫;γ)γk|γk=0.E\bigg{[}\varphi_{3}(O;\beta_{j}^{a,b})\ S_{\gamma_{k}}(O;\mathcal{P}^{\prime})\Big{]}\ =\ \frac{\partial\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma)}{\partial\gamma_{k}}\Big{|}_{\gamma_{k}=0}.

Since Sγk(O;𝒫)=Sγk(O;𝒫)S_{\gamma_{k}}(O;\mathcal{P})=S_{\gamma_{k}}(O;\mathcal{P}^{\prime}) for k=1,2k=1,2 and βja,b(𝒫;γ)=βja,b(𝒫;γ)\beta_{j}^{a,b}(\mathcal{P};\gamma)=\beta_{j}^{a,b}(\mathcal{P}^{\prime};\gamma^{\prime}), this shows that equation (3) holds for k=1,2k=1,2 as claimed. Therefore, φ3(O;βja,b)\varphi_{3}(O;\beta_{j}^{a,b}) is an influence function for βja,b\beta_{j}^{a,b} in Model 2, and hence the efficient influence function for βja,b\beta_{j}^{a,b} in Model 2.

Appendix C Double Robustness

Here we show that the Model 1 and Model 2 efficient estimators for βja,b=E[Yja,b]\beta_{j}^{a,b}=E\big{[}Y_{j}^{a,b}\big{]} are doubly robust.

We consider the Model 1 estimator first. Fix jj, aa, and bb, and let π(C,Xj,X3j;α)\pi(C,X_{j},X_{3-j};\alpha) be a model for the propensity score P(Aj=a,A3j=b|C,Xj,X3j)P(A_{j}=a,A_{3-j}=b|C,X_{j},X_{3-j}), and let μ(Aj,A3j,C,Xj,X3j;γ)\mu(A_{j},A_{3-j},C,X_{j},X_{3-j};\gamma) be a model for E[Yj|Aj,A3j,C,Xj,X3j]E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}\big{]}. These models may or may not be correctly specified, but suppose that they converge to some values, say α^𝑝α\hat{\alpha}\overset{p}{\to}\alpha^{*} and γ^𝑝γ\hat{\gamma}\overset{p}{\to}\gamma^{*}. In the case where the models are correctly specified, say with α0\alpha_{0} and γ0\gamma_{0} corresponding to the truth, we would have α=α0\alpha^{*}=\alpha_{0} and γ=γ0\gamma^{*}=\gamma_{0}. Let π(Ci,Xij,Xi,3j;α^)\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha}) and μ(a,b,Ci,Xij,Xi,3j;γ^)\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma}) denote predicted values under these models, and consider the corresponding Model 1 estimator

β^1=1ni=1n{\displaystyle\widehat{\beta}_{1}=\ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{} 𝟙(Aij=a,Ai,3j=b)π(Ci,Xij,Xi,3j;α^)[Yijμ(a,b,Ci,Xij,Xi,3j;γ^)]\displaystyle\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}\Big{[}Y_{ij}-\mu\big{(}a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma}\big{)}\Big{]}
+μ(Ci,Xij,Xi,3j;γ^)}.\displaystyle\hskip 72.26999pt+\mu\big{(}C_{i},X_{ij},X_{i,3-j};\hat{\gamma}\big{)}\Bigg{\}}.

We will show that β^1\hat{\beta}_{1} is consistent even if either π\pi or μ\mu is misspecified, as long as the other is correctly specified. We claim that β^1𝑝β\hat{\beta}_{1}\overset{p}{\longrightarrow}\beta^{*}, where

β=E\displaystyle\beta^{*}=\ E [𝟙(Aj=a,A3j=b)π(C,Xj,X3j;α)(Yjμ(a,b,C,Xj,X3j;γ))+μ(a,b,C,Xj,X3j;γ)].\displaystyle\Bigg{[}\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{\pi(C,X_{j},X_{3-j};\alpha^{*})}\Big{(}Y_{j}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Big{)}+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}.

For we can rewrite β^1\hat{\beta}_{1} as

1ni=1n{𝟙(Aij=a,Ai,3j=b)π(Ci,Xij,Xi,3j;α)(Yijμ(a,b,Ci,Xij,Xi,3j;γ))+μ(a,b,Ci,Xij,Xi,3j;γ)}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{*})}\Big{(}Y_{ij}-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{*})\Big{)}+\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{*})\Bigg{\}} (5)
+1ni=1n{\displaystyle+\ \ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{} [π(Ci,Xij,Xi,3j;α)π(Ci,Xij,Xi,3j;α^)π(Ci,Xij,Xi,3j;α)π(Ci,Xij,Xi,3j;α^)]𝟙(Aij=a,Ai,3j=b)×\displaystyle\left[\frac{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{*})-\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{*})\pi(C_{i},X_{ij},X_{i,3-j};\hat{\alpha})}\right]\mathds{1}(A_{ij}=a,A_{i,3-j}=b)\times
[Yijμ(a,b,Ci,Xij,Xi,3j;γ^)]\displaystyle\hskip 72.26999pt\Big{[}Y_{ij}-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma})\Big{]} (6)
+\displaystyle+ [1𝟙(Aij=a,Ai,3j=b)π(Ci,Xij,Xi,3j;α)][μ(a,b,Ci,Xij,Xi,3j;γ^)μ(a,b,Ci,Xij,Xi,3j;γ)]}\displaystyle\ \left[1-\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij},X_{i,3-j};\alpha^{*})}\right]\Big{[}\mu(a,b,C_{i},X_{ij},X_{i,3-j};\hat{\gamma})-\mu(a,b,C_{i},X_{ij},X_{i,3-j};\gamma^{*})\Big{]}\Bigg{\}}

where (5) is a sample mean of i.i.d. terms, and hence converges to the expected value of each term, and (6) converges in probability to 0.

Now we show that, if either π\pi or μ\mu is correctly specified, then β=E[Yja,b]\beta^{*}=E\big{[}Y_{j}^{a,b}\big{]}. If μ\mu is correctly specified, then μ(a,b,C,Xj,X3j;γ)=E[Yja,b|C,Xj,X3j]\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})=E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}. Conditioning on Yja,b,C,Xj,X3jY_{j}^{a,b},C,X_{j},X_{3-j}, we have:

β=\displaystyle\beta^{*}= E[Yja,bE[Yja,b|C,Xj,X3j]π(C,Xj,X3j;α)E[𝟙(Aj=a,A3j=b)|Yja,b,C,Xj,X3j]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}}{\pi(C,X_{j},X_{3-j};\alpha^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\ \big{|}Y_{j}^{a,b},C,X_{j},X_{3-j}\Big{]}
+E[Yja,b|C,Xj,X3j]]\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}\Bigg{]}
=\displaystyle= E[Yja,bE[Yja,b|C,Xj,X3j]π(C,Xj,X3j;α)E[𝟙(Aj=a,A3j=b)|C,Xj,X3j]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}}{\pi(C,X_{j},X_{3-j};\alpha^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)\ \big{|}C,X_{j},X_{3-j}\Big{]} by A1
+E[Yja,b|C,Xj,X3j]]\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}\Bigg{]}
=\displaystyle= E[P(Aj=a,A3j=b)|C,Xj,X3j)π(C,Xj,X3j;α)E[(Yja,bE[Yja,b|C,Xj,X3j])|C,Xj,X3j]=0\displaystyle\ E\Bigg{[}\frac{P(A_{j}=a,A_{3-j}=b)|C,X_{j},X_{3-j})}{\pi(C,X_{j},X_{3-j};\alpha^{*})}\underbrace{E\Big{[}\Big{(}Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j},X_{3-j}\big{]}\Big{)}\big{|}C,X_{j},X_{3-j}\Big{]}}_{=0}
+E[Yja,b|C,X1,X2]]\displaystyle\hskip 72.26999pt+E\big{[}Y_{j}^{a,b}|C,X_{1},X_{2}\big{]}\Bigg{]}
=\displaystyle= 0+E[Yja,b].\displaystyle\ 0+E\big{[}Y_{j}^{a,b}\big{]}.

If π\pi is correctly specified, then we have:

β=\displaystyle\beta^{*}= E[Yja,bμ(a,b,C,Xj,X3j;γ)π(C,Xj,X3j;α0)E[𝟙(Aj=a,A3j=b)|Yja,b,C,Xj,X3j]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})}{\pi(C,X_{j},X_{3-j};\alpha_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|Y_{j}^{a,b},C,X_{j},X_{3-j}\Big{]}
+μ(a,b,C,Xj,X3j;γ)]\displaystyle\hskip 72.26999pt+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}
=\displaystyle= E[Yja,bμ(a,b,C,Xj,X3j;γ)π(C,Xj,X3j;α0)E[𝟙(Aj=a,A3j=b)|C,Xj,X3j]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})}{\pi(C,X_{j},X_{3-j};\alpha_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|C,X_{j},X_{3-j}\Big{]} by A1
+μ(a,b,C,Xj,X3j;γ)]\displaystyle\hskip 72.26999pt+\mu(a,b,C,X_{j},X_{3-j};\gamma^{*})\Bigg{]}
=\displaystyle= E[Yja,bμ(a,b,C,X1,X2;γ)+μ(a,b,C,X1,X2;γ)]\displaystyle\ E\Big{[}Y_{j}^{a,b}-\mu(a,b,C,X_{1},X_{2};\gamma^{*})+\mu(a,b,C,X_{1},X_{2};\gamma^{*})\Big{]}
=\displaystyle= E[Yja,b].\displaystyle\ E\big{[}Y_{j}^{a,b}\big{]}.

Next we consider the Model 2 estimator. Now let π(C,Xj;α)\pi(C,X_{j};\alpha) be a model for the propensity score P(Aj=a|C,Xj)P(A_{j}=a|C,X_{j}), let ψ(C,Xj;θ)\psi(C,X_{j};\theta) be a model for the propensity score P(A3j=b|C,Xj)P(A_{3-j}=b|C,X_{j}), and let μ(Aj,A3j,C,Xj;γ)\mu(A_{j},A_{3-j},C,X_{j};\gamma) be a model for E[Yj|Aj,A3j,C,Xj]E\big{[}Y_{j}|A_{j},A_{3-j},C,X_{j}\big{]}. Suppose that these models converge to some values, say α^𝑝α\hat{\alpha}\overset{p}{\to}\alpha^{*}, θ^𝑝θ\hat{\theta}\overset{p}{\to}\theta^{*}, and γ^𝑝γ\hat{\gamma}\overset{p}{\to}\gamma^{*}. In the case where the models are correctly specified, denote the truth by α0\alpha_{0}, θ0\theta_{0}, and γ0\gamma_{0} respectively. Let π(Ci,Xij;α^)\pi(C_{i},X_{ij};\hat{\alpha}), ψ(Ci,Xij;θ^)\psi(C_{i},X_{ij};\hat{\theta}), and μ(a,b,Ci,Xij;γ^)\mu(a,b,C_{i},X_{ij};\hat{\gamma}) denote predicted values under these models, and consider the corresponding Model 2 estimator:

β^2=1ni=1n{𝟙(Aij=a,Ai,3j=b)π(Ci,Xij;α^)ψ(Ci,Xij;θ^)(Yijμ(a,b,Ci,Xij;γ^))+μ(a,b,Ci,Xij;γ^)}.\widehat{\beta}_{2}=\ \frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}\frac{\mathds{1}(A_{ij}=a,A_{i,3-j}=b)}{\pi(C_{i},X_{ij};\hat{\alpha})\psi(C_{i},X_{ij};\hat{\theta})}\Big{(}Y_{ij}-\mu(a,b,C_{i},X_{ij};\hat{\gamma})\Big{)}+\mu(a,b,C_{i},X_{ij};\hat{\gamma})\Bigg{\}}.

We show that β^2\hat{\beta}_{2} is consistent under misspecification of one or both of the propensity score models, provided the outcome regression model is correctly specified; and that β^2\hat{\beta}_{2} is consistent under misspecification of the outcome regression model, provided that both propensity score models are correctly specified.

We have β^2𝑝β\hat{\beta}_{2}\overset{p}{\longrightarrow}\beta^{*}, where

β=E[𝟙(Aj=a,A3j=b)π(C,Xj;α)ψ(C,Xj;θ)(Yjμ(a,b,C,Xj;γ)+μ(a,b,C,Xj;γ)].\beta^{*}=\ E\Bigg{[}\ \frac{\mathds{1}(A_{j}=a,A_{3-j}=b)}{\pi(C,X_{j};\alpha^{*})\ \psi(C,X_{j};\theta^{*})}\Big{(}Y_{j}-\mu(a,b,C,X_{j};\gamma^{*}\Big{)}+\mu(a,b,C,X_{j};\gamma^{*})\Bigg{]}.

Now suppose first that the outcome regression model is correctly specified. Then μ(a,b,C,Xj;γ)=E[Yja,b|C,Xj]\mu(a,b,C,X_{j};\gamma^{*})=E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}. Conditioning on Yja,b,C,XjY_{j}^{a,b},C,X_{j}, we have:

β=\displaystyle\beta^{*}= E[Yja,bE[Yja,b|C,Xj]π(C,Xj;α)ψ(C,Xj;θ)E[𝟙(Aj=a,A3j=b)|Yja,b,C,Xj]+E[Yja,b|C,Xj]]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}}{\pi(C,X_{j};\alpha^{*})\psi(C,X_{j};\theta^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|Y_{j}^{a,b},C,X_{j}\Big{]}+E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}\Bigg{]}
=\displaystyle= E[Yja,bE[Yja,b|C,Xj]π(C,Xj;α)ψ(C,Xj;θ)E[𝟙(Aj=a,A3j=b)|C,Xj]+E[Yja,b|C,Xj]]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}}{\pi(C,X_{j};\alpha^{*})\psi(C,X_{j};\theta^{*})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|C,X_{j}\Big{]}+E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}\Bigg{]} by B1
=\displaystyle= E[P(Aj=a,A3j=b)|C,Xj)π(C,Xj;α)ψ(C,Xj;θ)E[(Yja,bE[Yja,b|C,Xj])|C,Xj]=0+E[Yja,b|C,Xj]]\displaystyle\ E\Bigg{[}\frac{P(A_{j}=a,A_{3-j}=b)|C,X_{j})}{\pi(C,X_{j};\alpha^{*})\ \psi(C,X_{j};\theta^{*})}\underbrace{E\Big{[}\Big{(}Y_{j}^{a,b}-E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}\Big{)}\big{|}C,X_{j}\Big{]}}_{=0}+E\big{[}Y_{j}^{a,b}|C,X_{j}\big{]}\Bigg{]}
=\displaystyle= 0+E[Yja,b].\displaystyle\ 0+E\big{[}Y_{j}^{a,b}\big{]}.

Finally, if both propensity score models are correctly specified, then conditioning on Yja,b,C,XjY_{j}^{a,b},C,X_{j}, we have:

β=\displaystyle\beta^{*}= E[Yja,bμ(a,b,C,Xj;γ)π(C,Xj;α0)ψ(C,Xj;θ0)E[𝟙(Aj=a,A3j=b)|Yja,b,C,Xj]+μ(a,b,C,Xj;γ)]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{*})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|Y_{j}^{a,b},C,X_{j}\Big{]}+\mu(a,b,C,X_{j};\gamma^{*})\Bigg{]}
=\displaystyle= E[Yja,bμ(a,b,C,Xj;γ)π(C,Xj;α0)ψ(C,Xj;θ0)E[𝟙(Aj=a,A3j=b)|C,Xj]+μ(a,b,C,Xj;γ)]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{*})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}E\Big{[}\mathds{1}(A_{j}=a,A_{3-j}=b)|C,X_{j}\Big{]}+\mu(a,b,C,X_{j};\gamma^{*})\Bigg{]} by B1
=\displaystyle= E[Yja,bμ(a,b,C,Xj;γ)π(C,Xj;α0)ψ(C,Xj;θ0)P(Aj=a|C,Xj)P(A3j=b|C,Xj)+μ(a,b,C,Xj;γ)]\displaystyle\ E\Bigg{[}\frac{Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{*})}{\pi(C,X_{j};\alpha_{0})\psi(C,X_{j};\theta_{0})}P(A_{j}=a|C,X_{j})P(A_{3-j}=b|C,X_{j})+\mu(a,b,C,X_{j};\gamma^{*})\Bigg{]} by B7
=\displaystyle= E[Yja,bμ(a,b,C,Xj;γ)+μ(a,b,C,Xj;γ)]\displaystyle\ E\Big{[}Y_{j}^{a,b}-\mu(a,b,C,X_{j};\gamma^{*})+\mu(a,b,C,X_{j};\gamma^{*})\Big{]}
=\displaystyle= E[Yja,b].\displaystyle\ E\big{[}Y_{j}^{a,b}\big{]}.

Appendix D Data analysis and Simulations

Here we describe the models that were used in the data analysis. The Model 1 estimator uses predicted values from an outcome regression model, and from a joint propensity score model. We use generalized additive models to model E[Yj|Aj,A3j,C,Xj,X3j]E[Y_{j}|A_{j},A_{3-j},C,X_{j},X_{3-j}], the conditional mean of the twin’s Drinking Index outcome, with the following model formulation:

Drinking.index \sim s(Parent.alcohol.abuse) + s(Parent.drug.abuse) +

ti(Parent.alcohol.abuse, Parent.drug.abuse) + Parent.occupation.level +

Sex + Zygosity + Academic.motivation + Sex*Academic.motivation +

s(Conflict.with.parents)+ Age + Exposure + Parent.alcohol.abuse*Exposure +

Sex*Exposure + Cotwins.exposure +Zygosity*Cotwins.exposure + Exposure*Cotwins.exposure

where s( ) indicates a smooth function of the variable, and ti( , ) and indicates an interaction term consisting of a smooth function of the two variables. In order to enforce symmetry between the Twin 1’s and the Twin 2’s, we fit this model on a stacked dataset combining the data for the Twin 1’s and the data for the Twin 2’s, so that the same fitted values of the regression parameters are used for both the Twin 1’s and the Twin 2’s.

For the joint propensity score model, we first use generalized additive models to model logit P(Aj=1|C,Xj)P(A_{j}=1|C,X_{j}):

Exposure \sim Parent.alcohol.abuse + Parent.drug.abuse +

ti(Parent.alcohol.abuse, Parent.drug.abuse) + Externalizing.disorder +

Academic.motivation + Conflict.with.parents +

ti(Parent.alcohol.abuse, Conflict.with.parents) + s(Age), family=binomial

As before, we fit this propensity score model on a stacked dataset to enforce symmetry between the Twin 1’s and the Twin 2’s. We then obtain predicted values for the joint distribution F(a,b):=P(A1=a,A2=b|C,X1,X2)F(a,b):=P(A_{1}=a,A_{2}=b|C,X_{1},X_{2}) from predicted values for the margins using a Dale model [2]. Under the Dale model, the joint distribution is determined by the two margins F1(0):=P(A1=0|C,X1)F_{1}(0):=P(A_{1}=0|C,X_{1}) and F2(0):=P(A2=0|C,X2)F_{2}(0):=P(A_{2}=0|C,X_{2}) together with a parameter for the association, the cross ratio ψ=F(0,0)×F(1,1)F(0,1)×F(1,0)(0,)\displaystyle{\psi=\frac{F(0,0)\times F(1,1)}{F(0,1)\times F(1,0)}\in(0,\infty)}, where ψ=1\psi=1 corresponds to the case that A1A_{1} and A2A_{2} are conditionally independent. Here we allow the strength of the association between A1A_{1} and A2A_{2} to depend on zygosity and sex, and we assume log(ψi)=α0+α1Zygosityi+α2Sexi\log(\psi_{i})=\alpha_{0}+\alpha_{1}Zygosity_{i}+\alpha_{2}Sex_{i}. We estimate α=(α0,α1,α2)\alpha=(\alpha_{0},\alpha_{1},\alpha_{2}) using maximum likelihood, treating the margins as fixed. Then F(0,0)F(0,0) is determined by the margins F1(0)F_{1}(0) and F2(0)F_{2}(0) and the estimated α^\widehat{\alpha}:

F(\displaystyle F( 0,0)=\displaystyle 0,0)=
12(ψ1){[1+(F1(0)+F2(0))(ψ1)]+[1+(F1(0)+F2(0))(ψ1)]2+4ψ(1ψ)F1(0)F2(0)}\displaystyle\frac{1}{2(\psi-1)}\Bigg{\{}\Big{[}1+(F_{1}(0)+F_{2}(0))(\psi-1)\Big{]}+\sqrt{\Big{[}1+(F_{1}(0)+F_{2}(0))(\psi-1)\Big{]}^{2}+4\psi(1-\psi)F_{1}(0)F_{2}(0)}\Bigg{\}}

if ψ1\psi\neq 1, and F(0,0)=F1(0)F2(0)F(0,0)=F_{1}(0)F_{2}(0) otherwise.

The Model 2 estimator uses predicted values from three models. We use the same conditional mean model E[Yj|Aj,A3j,C,Xj]E[Y_{j}|A_{j},A_{3-j},C,X_{j}] and the same propensity score model P(Aj|C,Xj)P(A_{j}|C,X_{j}) as above. Finally, we model logitP(Aj=1|C,X3j)P(A_{j}=1|C,X_{3-j}), where P(Aj=1|C,X3j)P(A_{j}=1|C,X_{3-j}) is the probability that the twin is exposed given shared covariates and their co-twin’s individual covariates, as:

Exposure \sim Parent.alcohol.abuse + Parent.drug.abuse +

ti(Parent.alcohol.abuse, Parent.drug.abuse) +

Cotwins.externalizing.disorder + Cotwins.academic.motivation +

Cotwins.conflict.with.parents +

ti(Parent.alcohol.abuse, Cotwins.conflict.with.parents) + s(Age), family=binomial

and fit this model on the stacked dataset.

Next, we describe the data generating mechanisms used in our simulation study. Under the first data-generating mechanism, Model 1 is correctly specified but Model 2 is not. For each simulated dataset, we resample n=500n=500 twin pairs from the Minnesota Twin Family Study (MTFS) data to generate shared and individual baseline covariates. Let the ssth twin pair in the simulated data have covariates (Cs,Xs1,Xs2)(C_{s},X_{s1},X_{s2}). We then use the model for P(A1,A2|C,X1,X2)P(A_{1},A_{2}|C,X_{1},X_{2}) described above to generate exposures, such that twin pair ss will have exposures As1=a,As2=bA_{s1}=a,A_{s2}=b with probability P(A1=a,A2=b|C=Cs,X1=Xs1,X2=Xs2)P(A_{1}=a,A_{2}=b|C=C_{s},X_{1}=X_{s1},X_{2}=X_{s2}), the predicted values from the model fit on the MTFS data. The Drinking Index outcome in the MTFS data is approximately normal, so we generate outcomes from a bivariate normal distribution where the mean corresponds to predicted values of the outcome regression model fit on the MTFS data, and where values of the variance-covariance matrix are chosen to approximate the variance and covariance seen in the MTFS data. In particular, we take outcomes for Twin 1 and Twin 2 to be more strongly correlated among MZ twins than among DZ twins. Specifically, for twin pair ss, we draw

(Ys1,Ys2)MVN2([μ1μ2],(8σσ8)),(Y_{s1},Y_{s2})\ \sim\ MVN_{2}\left(\left[\begin{array}[]{c}\mu_{1}\\ \mu_{2}\end{array}\right],\ \left(\begin{array}[]{cc}8&\sigma\\ \sigma&8\end{array}\right)\right),

where μj=E[Yj|Aj=Asj,A3j=As,3j,C=Cs,Xj=Xsj,X3j=Xs,3j]\mu_{j}=E[Y_{j}|A_{j}=A_{sj},A_{3-j}=A_{s,3-j},C=C_{s},X_{j}=X_{sj},X_{3-j}=X_{s,3-j}] is the predicted value from the model fit on the MTFS data, and where σ=3.5\sigma=3.5 if twin pair ss is MZ, and σ=1\sigma=1 if twin pair ss is DZ.

For the second data-generating mechanism, under which Model 1 and Model 2 are both correctly specified, we modify the step in which we generate exposures, drawing As1Bern(ps1)A_{s1}\sim Bern(p_{s1}), where ps1=P(A1=1|C=Cs,X1=Xs1)p_{s1}=P(A_{1}=1|C=C_{s},X_{1}=X_{s1}), and independently drawing A2Bern(ps2)A_{2}\sim Bern(p_{s2}), where ps2=P(A2=1|C=Cs,X2=Xs2)p_{s2}=P(A_{2}=1|C=C_{s},X_{2}=X_{s2}).