Two-way fixed effects instrumental variable regressions
in staggered DID-IV designs.††thanks: I am grateful to Daiji Kawaguchi and Ryo Okui for their continued guidance and support. All errors are my own.
Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation in the timing of policy adoption across units as an instrument for treatment. This paper studies the properties of the TWFEIV estimator in staggered instrumented difference-in-differences (DID-IV) designs. We show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator can be decomposed into a weighted average of all possible two-group/two-period Wald-DID estimators. Under staggered DID-IV designs, a causal interpretation of the TWFEIV estimand hinges on the stable effects of the instrument on the treatment and the outcome over time. We illustrate the use of our decomposition theorem for the TWFEIV estimator through an empirical application.
1 Introduction
Instrumented difference-in-differences (DID-IV) is a method to estimate the effect of a treatment on an outcome, exploiting variation in the timing of policy adoption across units as an instrument for the treatment. In a simple setting with two groups and two periods, some units become exposed to the policy shock in the second period (exposed group), whereas others are not over two periods (unexposed group). The estimator is constructed by running the following IV regression with the group and post-time dummies as included instruments and the interaction of the two as the excluded instrument (e.g., Duflo (2001), Field (2007)):
The resulting IV estimand scales the DID estimand of the outcome by the DID estimand of the treatment, the so-called Wald-DID estimand (de Chaisemartin and D’Haultfœuille (2018), Miyaji (2024)). In this two-group/two-period () setting, DID-IV designs mainly consist of a monotonicity assumption and parallel trends assumptions in the treatment and the outcome between the two groups, and allow for the Wald-DID estimand to capture the local average treatment effect on the treated (LATET) (de Chaisemartin (2010), Hudson et al. (2017), and Miyaji (2024)). DID-IV designs have gained popularity over DID designs in practice when there is no control group or the treatment adoption is potentially endogenous over time (Miyaji (2024)).
In reality, however, most DID-IV applications go beyond the canonical DID-IV set up, and leverage variation in the timing of policy adoption across units in more than two periods, instrumenting for the treatment with the natural variation. The instrument is constructed, for instance from the staggered adoption of school reforms across countries or municipalities (e.g. Oreopoulos (2006), Lundborg et al. (2014), Meghir et al. (2018)), the phase-in introduction of head starts across states (e.g. Johnson and Jackson (2019)), or the gradual adoption of broadband internet programs (e.g. Akerman et al. (2015), Bhuller et al. (2013)). These policy changes can be viewed as some natural experiments, but not randomized in reality.
Recently, Miyaji (2024) formalizes the underlying identification strategy as a staggered DID-IV design. In this design, the treatment adoption is allowed to be endogenous over time, while the instrument is required to be uncorrelated with time-varying unobservables in the treatment and the outcome; the assignment of the treatment can be non-staggered across units, while the assignment of the instrument is staggered across units: they are partitioned into mutually exclusive and exhaustive cohorts by the initial adoption date of the instrument. The target parameter is the cohort specific local average treatment effect on the treated (CLATT); this parameter measures the treatment effects among the units who belong to cohort and are induced to the treatment by instrument in a given relative period after the initial adoption of the instrument. The identifying assumptions are the natural generalization of those in DID-IV designs.
In practice, empirical researchers commonly implement this design via linear instrumental variable regressions with time and unit fixed effects, the so-called two-way fixed effects instrumental variable (TWFEIV) regressions (e.g., Black et al. (2005), Lundborg et al. (2017), Johnson and Jackson (2019)):
(1) | ||||
(2) |
In contrast to the canonical DID-IV set up, however, the validity of running TWFEIV regressions seems less clear under staggered DID-IV designs. The IV estimate is commonly interpreted as measuring the local average treatment effect in the presence of heterogeneous treatment effects as in Imbens and Angrist (1994), whereas the target parameter is not stated formally. We know little about how the IV estimator is constructed by comparing the evolution of the treatment and the outcome across units and over time. Finally, we have no tools to illustrate the identifying variations in the IV estimate in a given application.
In this paper, we study the properties of two-way fixed effects instrumental variable estimators under staggered DID-IV designs. Specifically, we present the decomposition result for the TWFEIV estimator, and study the causal interpretation of the TWFEIV estimand under staggered DID-IV designs.
First, we derive the decomposition theorem for the TWFEIV estimator with settings of the staggered adoption of the instrument across units. We show that the TWFEIV estimator is equal to a weighted average of all possible Wald-DID estimators arising from the three types of the DID-IV design. First, in an Unexposed/Exposed design, some units are never exposed to the instrument during the sample period (unexposed group), whereas some units start exposed at a particular date and remain exposed (exposed group). Second, in an Exposed/Not Yet Exposed design, some units start exposed earlier, whereas some units are not yet exposed during the design period (not yet exposed group). Finally, in an Exposed/Exposed Shift design, some units are already exposed, whereas some units start exposed later at a particular point during the design period (exposed shift group). The weight assigned to each Wald-DID estimator reflects all the identifying variations in each DID-IV design: the sample share, the variance of the instrument, and the DID estimator of the treatment between the two groups.
Built on the decomposition result, we next uncover the shortcomings of running TWFEIV regressions under staggered DID-IV designs. We show that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs due to negative weights. Specifically, we show that this estimand is equal to a weighted average of all possible cohort specific local average treatment effect on the treated (CLATT) parameters, but some weights can be negative. The negative weight problem potentially arises due to the "bad comparisons" (c.f. Goodman-Bacon (2021)) performed by TWFEIV regressions: the already exposed units play the role of controls in the Exposed/Exposed Shift design in the first stage and reduced form regressions. Given the negative result of using the TWFEIV estimand under staggered DID-IV designs, we also investigate the sufficient conditions for this estimand to attain its causal interpretation. We show that this estimand can be interpreted as causal only if the effects of the instrument on the treatment and the outcome are stable over time.
We extend our decomposition result in several directions. We first consider non-binary, ordered treatment. We also derive the decomposition result for the TWFEIV estimand in unbalanced panel settings. Lastly, we consider the case when the adoption date of the instrument is randomized across units. In all cases, we show that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs due to negative weights.
We illustrate our findings with the setting of Miller and Segal (2019) who estimate the effect of female police officers’ share on intimate partner homicide rate, leveraging the timing variation of AA (affirmative action) plans across U.S. counties. In this application, we first assess the plausibility of the staggered DID-IV design implicitly imposed by Miller and Segal (2019) and confirm its validity. We then estimate TWFEIV regressions, slightly modifying the authors’ setting, and apply our DID-IV decomposition theorem to the IV estimate. We find that the estimate assigns more weights to the Unexposed/Exposed design and less weights to the other two types of the DID-IV design. Despite the small weight on the Exposed/Exposed Shift design, we also find that the IV estimate suffers from the substantial downward bias arising from the bad comparisons in the Exposed/Exposed Shift design.
Finally, we develop simple tools to examine how different specifications affect the change in TWFEIV estimates, and illustrate these by revisiting Miller and Segal (2019). In many empirical settings, researchers typically diverge from a simple TWFEIV regression as in equation (1) and estimate various specifications such as weighting or including time-varying covariates. We follow Goodman-Bacon (2021) and decompose the difference between the two specifications into the changes in Wald-DID estimates, the changes in weights, and the interaction of the two. This decomposition result enables the researchers to quantify the contribution of the changes in each term to the difference in the overall estimates. In addition, plotting the pairs of Wald-DID estimates and associated weights obtained from the two specifications allows the researchers to investigate which components have the significant impact on these contributions.
Overall, this paper shows the negative result of using TWFEIV estimators under staggered DID-IV designs in more than two periods, and provide tools to illustrate how serious that concern is in a given application. Specifically, our decomposition result for the TWFEIV estimator enables the researchers to quantify the bias term arising from the bad comparisons in Exposed/Exposed Shift designs in the data. Fortunately, Miyaji (2024) recently proposes the alternative estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity. Using such estimation method allows the practitioners to avoid the issue of TWFEIV estimators in practice, and facilitates the credibility of their empirical findings.
The rest of the paper is organized as follows. The next subsection discusses the related literature. Section 2 presents our decomposition theorem for the TWFEIV estimator. Section 3 formally introduces staggered instrumented difference-in-differences designs. Section 4 presents the pitfalls of running TWFEIV regressions under staggered DID-IV designs, and explores the sufficient conditions for the TWFEIV estimand to attain its causal interpretation. Section 5 describes some of the extensions. Section 6 presents our empirical application. Section 7 explain how different specifications affect the difference in estimates and Section 8 concludes. All proofs are given in the Appendix.
1.1 Related literature
Our paper is related to the recent DID-IV literature (de Chaisemartin (2010); Hudson et al. (2017); de Chaisemartin and D’Haultfœuille (2018); Miyaji (2024)). In this literature, de Chaisemartin (2010) first formalizes DID-IV designs and shows that a Wald-DID estimand identifies the local average treatment effect on the treated (LATET) if the parallel trends assumptions in the treatment and the outcome, and a monotonicity assumption are satisfied. Hudson et al. (2017) also consider DID-IV designs with non-binary, ordered treatment settings. Build on the work in de Chaisemartin (2010), however, de Chaisemartin and D’Haultfœuille (2018) formalize DID-IV designs differently, and call them Fuzzy DID. Miyaji (2024) compares DID-IV to Fuzzy DID designs and points out the issues embedded in Fuzzy DID designs, and extends DID-IV design to multiple period settings with the staggered adoption of the instrument across units, which the author calls staggered DID-IV designs. Miyaji (2024) also provides a reliable estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity.
In this paper, we contribute to the literature by showing the properties of two-way fixed instrumental variable estimators in staggered DID-IV designs. In reality, when empirical researchers implicitly rely on the staggered DID-IV design, they commonly implement this design via TWFEIV regressions (e.g. Black et al. (2005), Lundborg et al. (2014), Meghir et al. (2018)). This paper presents the issues of the conventional approach, and provides the sufficient conditions for this estimand to attain its causal interpretation.
Our paper is also related to a recent DID literature on the causal interpretation of two-way fixed effects (TWFE) regressions and its dynamic specifications under heterogeneous treatment effects (Athey and Imbens (2022); Borusyak et al. (2021); de Chaisemartin and D’Haultfœuille (2020); Goodman-Bacon (2021); Imai and Kim (2021); Sun and Abraham (2021)).
Specifically, this paper is closely connected to Goodman-Bacon (2021), who derives the decomposition theorem for the TWFE estimator with settings of the staggered adoption of the treatment across units. In this paper, we establish the decompose theorem for the TWFEIV estimator with settings of the staggered adoption of the instrument across units, which is a natural generalization of their theorem 1.
This paper is also closely connected to de Chaisemartin and D’Haultfœuille (2020), who decompose the TWFE estimand and present the issue of using this estimand under DID designs: some weights assigned to the causal parameters in this estimand can be potentially negative. In their appendix, the authors also decompose the TWFEIV estimand and refer to the negative weight problem in this estimand. Specifically, they apply the decomposition theorem for the TWFE estimand to the numerator and denominator in the TWFEIV estimand respectively, and conclude that this estimand identifies the LATE as in Imbens and Angrist (1994) only if the effects of the instrument on the treatment and outcome are constant across groups and over time. However, their decomposition result for the TWFEIV estimand has some drawbacks. First, they do not formally state the target parameter and identifying assumptions in DID-IV designs. Second, their decomposition result is not based on the target parameter in DID-IV designs. Finally, the sufficient conditions for this estimand to be interpretable causal parameter are not well investigated.
In this paper, we investigate the causal interpretation of the TWFEIV estimand more clearly than that of de Chaisemartin and D’Haultfœuille (2020). Specifically, we first decompose the TWFEIV estimator into all possible Wald-DID estimators. We then formally introduce the target parameter and identifying assumptions in staggered DID-IV designs, built on the recent work in Miyaji (2024). This allows us to decompose the TWFEIV estimand into a weighted average of the target parameter in staggered DID-IV designs. Finally, we assess the causal interpretation of the TWFEIV estimand under a variety of restrictions on the effects of the instrument on the treatment and outcome, which clarifies the sufficient conditions for this estimand to attain its causal interpretation.
We note that this paper is distinct from the recent IV literature on the causal interpretation of two stage least square (TSLS) estimators with covariates under heterogeneous treatment effects (Słoczyński (2020), Blandhol et al. (2022)). These recent studies investigate the causal interpretation of the TSLS estimand with covariates under the random variation of the instrument conditional on covariates, and cast doubt on the LATE (or LATEs) interpretation of this estimand. In this literature, the identifying variations come from the assignment process of the instrument. In this paper, however, we investigate the causal interpretation of the TWFEIV estimand (where time and unit dummies can be viewed as covariates) under staggered DID-IV designs: our identifying variations mainly come from the parallel trends assumptions in the treatment and the outcome over time.
2 Instrumented difference-in-differences decomposition
In this section, we present a decomposition result for the two-way fixed effects instrumental variable (TWFEIV) estimator in multiple time period settings with the staggered adoption of the instrument across units.
2.1 Set up
We introduce the notation we use throughout this article. We consider a panel data setting with periods and units. For each and , let denote the outcome and denote the treatment status, and denote the instrument status. Let and denote the path of the treatment and the path of the instrument for unit , respectively. Throughout this article, we assume that are independent and identically distributed (i.i.d).
We make the following assumption about the assignment process of the instrument.
Assumption 1 (Staggered adoption for ).
For , where .
Assumption 1 requires that once units start exposed to the instrument, they remain exposed to that instrument afterward. In the DID literature, several recent papers impose this assumption on the adoption process of the treatment and sometimes call it the "staggered treatment adoption", see, e.g., Athey and Imbens (2022), Callaway and Sant’Anna (2021) and Sun and Abraham (2021).
Given Assumption 1, we can uniquely characterize the instrument path by the time period when unit is first exposed to the instrument, denoted by . If unit is not exposed to the instrument for all time periods, we define . Based on the initial exposure period , we can uniquely partition units into mutually exclusive and exhaustive cohorts for : all the units in cohort are first exposed to the instrument at time . Hereafter, to ease the notation, we assume that the data contain cohorts where , and define as the never exposed cohort .
Let be the relative sample share for cohort and let be the time share of the exposure to the instrument for cohort :
We also define to be the relative sample share between cohort and .
In contrast to the staggered adoption of the instrument across units, we allow the general adoption process for the treatment: the treatment can potentially turn on/off repeatedly over time. de Chaisemartin and D’Haultfœuille (2020) and Imai and Kim (2021) consider the same setting in the recent DID literature.
The notations , , and represent the corresponding time window, respectively: , , and . Let be the sample mean of the random variable in cohort during the time window :
We define and analogously, representing the sample mean of the random variable in cohort during the time window and respectively.
2.2 Decomposing the TWFEIV estimator
We consider a TWFEIV regression in multiple time period settings with the staggered adoption of the instrument across units:
(3) | ||||
(4) |
By substituting the first stage regression (4) into the structural equation (3), we obtain the reduced form regression:
(5) |
The ratio between the first stage coefficient and the reduced form coefficient yields the TWFEIV estimator . By the Frisch-Waugh-Lovell theorem, the IV estimator is equal to the ratio between the coefficient from regressing on the double-demeaning variable and the coefficient from regressing on the same variable:
(6) |
where is the double demeaning variable defined below:
Note that the TWFEIV regression runs the two-way fixed effects (TWFE) regression twice, as can be seen in equations (4) and (5). Because we assume the staggered assignment of the instrument across units, if we focus on the TWFE coefficient on in the first stage or the reduced form regression, we can show that it is equal to a weighted average of all possible DID estimators of the treatment or the outcome from the decomposition result for the TWFE estimator shown by Goodman-Bacon (2021).
Consider the simple setting where we have only two periods and two cohorts: one cohort is not exposed to the instrument during the two periods (), whereas the other cohort starts exposed to the instrument in the second period (). In this setting, the TWFEIV estimator takes the following form, the so-called Wald-DID estimator (de Chaisemartin and D’Haultfœuille (2018), Miyaji (2024)):
where is the sample mean of the random variable for cohort in time . This estimator scales the DID estimator of the outcome by the DID estimator of the treatment between cohort and .
The above observations bring us the intuition about how we can decompose the TWFEIV estimator with settings of the staggered adoption of the instrument across units; we expect that the TWFEIV estimator can be decomposed into a weighted average of all possible Wald-DID estimators (instead of DID-estimators).
To clarify this intuition, assume for now that we have only three cohorts, an early exposed cohort , a middle exposed cohort , and a never exposed cohort . Figure 1 plots the simulated data for the time trends of the average treatment (first stage) and the average outcome (reduced form) in three cohorts.


From the data structure, we can construct the Wald-DID estimator in three ways. First, we can compare the evolution of the treatment and the outcome between exposed cohort and never exposed cohort , exploiting the time window and , which we call an Unexposed/Exposed design:
(7) | ||||
Second, we can construct the Wald-DID estimator, leveraging variation in the timing of the initial exposure to the instrument between exposed cohorts. Consider an early exposed cohort and a middle exposed cohort . Before period , the early exposed cohort is already exposed to the instrument, while the middle exposed cohort is not yet exposed to the instrument. In this setting, we can view that the middle exposed cohort plays the role of the control group in both the first stage and the reduced form. From this observation, we can compare the evolution of the treatment and the outcome between the early exposed cohort and middle exposed cohort , exploiting the time window and , which we call an Exposed/Not Yet Exposed design:
(8) | ||||
Finally, if we focus on the middle exposed cohort , which changes the exposure status from being unexposed to being exposed at time , we can regard the early exposed cohort as the control group after time because this cohort is already exposed to the instrument at time . We can compare the evolution of the treatment and the outcome between early exposed cohort and middle exposed cohort , exploiting the time window and , which we call an Exposed/Exposed Shift design:
(9) | ||||
In each type of the DID-IV design, we have three sources of variation. First, each design exploits the subsample from all observations. The Unexposed/Exposed DID-IV design in (7) uses two cohorts and all time periods, indicating that the relative sample share is . The Exposed/Not Yet Exposed DID-IV design in (8) uses two cohorts but exploits only the time periods before period , so the relative sample share is . The Exposed/Exposed Shift DID-IV design in (9) uses two cohorts but exploits only the time periods after period , so the relative sample share is .
Second, the variation in each type of the DID-IV design partly comes from the variation of the instrument in its subsample. It is equal to the variance of the double demeaning variable in each design:
(10) | ||||
(11) | ||||
(12) |
where the , and represent the variance of the double demeaning variable in Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift DID-IV designs, respectively. In the staggered DID set up, Goodman-Bacon (2021) also describes the two variations, that is, the relative sample share and the variance of the double demeaning treatment variable in each type of the DID designs.
Unlike the staggered DID set up, however, each DID-IV design has an additional source of the variation; the effect of the instrument on the treatment in the first stage. This comes from the fact that each DID-IV design allows the noncompliance of receiving the treatment when units are exposed to the instrument. The amount of this variation is equal to the DID estimator of the treatment in each DID-IV design:
Note that the denominator of the TWFEIV estimator in (6), which we denote hereafter, measures the covariance between the instrument and the treatment in whole samples. By some calculations (see the proof of Theorem 1 below), one can show that is equal to a weighted average of all possible DID estimators of the treatment in each DID-IV design:
where the weights are:
Hereafter, we refer to , , and as the first stage weights. This decomposition result for is almost identical to that of Goodman-Bacon (2021) for the TWFE estimator under staggered DID designs, but the slight difference here is that each weight is not scaled by the variance of the double demeaning variable in whole samples.
We now present the decomposition theorem for the TWFEIV estimator under the staggered assignment of the instrument across units. Theorem 1 below is a generalization of the decomposition result for the TWFE estimator with settings of the staggered assignment of the treatment across units in Goodman-Bacon (2021).
Theorem 1 (Instrumented Difference-in-Differences Decomposition Theorem).
Suppose that there exist cohorts, . The data may also contain a never exposed cohort . Then, the two-way fixed effects instrumental variable estimator in (6) is a weighted average of all possible Wald-DID estimators.
The Wald-DID estimators are:
The weights are:
and sum to one, that is, we have .
Proof.
See Appendix A. ∎
Theorem 1 shows that when the assignment of the instrument is staggered across units, the TWFEIV estimator is a weighted average of all possible Wald-DID estimators. If there exist cohorts in the data, we have Wald-DID estimators, which come from either Exposed/Not Yet Exposed designs as in (8) or Exposed/Exposed shift designs as in (9). If the data contains a never exposed cohort , we have additionally Wald-DID estimators, which come from Unexposed/Exposed designs as in (7). If both situations occur, the TWFEIV estimator equals a weighted average of Wald-DID estimators.
The weight assigned to each Wald-DID estimator consists of three parts: the relative sample share squared, the variance of the double demeaning variable , and the DID estimator of the treatment in each DID-IV design. The first part depends on the sample share of two cohorts and the timing of the initial exposure date. The second part reflects the variation of the instrument in the subsample, represented by (10)-(12), and depends on the relative sample share between two cohorts and the timing of the initial exposure date. Finally, the remaining part reflects variation in the evolution of the treatment between the two cohorts. Note that the weight is not guaranteed to be non-negative in finite sample settings: although the first and second parts are always non-negative, the DID estimator of the treatment can be potentially negative in the data.
Theorem 1 also shows that if we subset the data containing only two cohorts (cohorts and ), the TWFEIV estimator in the subsample can be written as:
The TWFEIV estimator is a weighted average of the Wald-DID estimators which come from either Exposed/Not Yet Exposed design or Exposed/Exposed Shift design, and the weight assigned to each Wald-DID estimator reflects the first stage weight and the DID estimator of the treatment in each DID-IV design.
To make the DID-IV decomposition theorem concrete, we provide a simple numerical example. Suppose we have three cohorts with equal sample size, as shown in Figure 1. In this figure, we set an early exposed period and a middle exposed period such that and . We assume that the effect of the instrument on the treatment is in cohort and in cohort over time. This means that the units in cohort are more induced to the treatment by the instrument than those in cohort and the effects are stable in both cohorts. The DID estimates of the treatment are . We also assume that the effect of the instrument on the outcome through treatment is in cohort and in cohort over time. The DID estimates of the outcome are . Dividing the DID estimate of the treatment by the DID estimate of the outcome yields the Wald-DID estimate: . The Wald-DID estimate is larger in cohort than that of cohort , though as we already noted, the effect of the instrument on the treatment is larger in cohort than that of cohort .
The DID estimates of the treatment and the exposure timing determine the amount of the weight assigned to each Wald-DID estimate, holding the sample size equal across cohorts. In the above setting, the resulting weights are . In Unexposed/Exposed designs, we have for two reasons. First, the DID estimate of the treatment is larger in cohort than that of cohort , that is, we have . Second, the time period is closer to the middle in the whole period than the time period , that is, we have , which implies in the first stage weight. By the similar argument, we have between Exposed/Not Yet Exposed and Exposed/Exposed Shift designs: we have and in the first stage weight. If the DID estimates of the treatment are equal between the two designs, the exposure timing matters: we have and . The DD estimates are the same in each comparison, that is, we have and . However, the different initial exposure date yields different weights in the first stage, that is, we have and , which make the difference above the two comparisons.
In this numerical example, the simple average of the Wald-DID estimates is and the weighted average is where the weight assigned to the Wald-DID estimate reflects the relative amount of the DID estimate of the treatment. The TWFEIV estimate, however, is because it assigns more weights on the smaller Wald-DID estimate.
Theorem 1 is a decomposition result for the TWFEIV estimator and not for the estimand. Related to the work in this paper, de Chaisemartin and D’Haultfœuille (2020) decompose the TWFE estimand and present the issue regarding the use of this estimand under DID designs: some weights assigned to the causal parameters in this estimand can be potentially negative. In their appendix, the authors also decompose the TWFEIV estimand, and refer to the negative weight problem in this estimand. Specifically, they apply their decomposition theorem for the TWFE estimand to the numerator and the denominator of the TWFEIV estimand respectively, and conclude that this estimand identifies the local average treatment effect as in Imbens and Angrist (1994) only if the effects of the instrument on the treatment and the outcome are homogeneous across groups and over time. In fact, the population coefficients on the instrument in the first stage and the reduced form regressions take the form of the TWFE estimand and their decomposition theorem for the TWFE estimand is also applicable to the analysis of the TWFEIV estimand. However, the way of their decomposition for the TWFEIV estimand has some drawbacks. First, they do not formally state the target parameter and identifying assumptions in DID-IV designs. Second, their decomposition for the TWFEIV estimand is not based on the target parameter in DID-IV designs. Finally, the sufficient conditions for this estimand to have its causal interpretation are not well explored.
In the following section, we explore the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. In section 3, we first define the target parameter and identifying assumptions in staggered DID-IV designs. In section 4, based on the decomposition theorem for the TWFEIV estimator, we then provide the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. Finally, we investigate the sufficient conditions for this estimand to attain its causal interpretation under staggered DID-IV designs.
3 Staggered instrumented difference-in-differences
In this section, we formalize the staggered instrumented difference-in-differences (DID-IV), built on the recent work in Miyaji (2024). We first introduce the additional notation. We then define the target parameter and identifying assumptions in staggered DID-IV designs.
3.1 Notation
First, we introduce the potential outcomes framework. Let denote the potential outcome in period when unit receives the treatment path and the instrument path . Similarly, let denote the potential treatment status in period when unit receives the instrument path .
Assumption 1 allows us to rewrite by the initial adoption date . Let denote the potential treatment status in period if unit is first exposed to the instrument in period . Let denote the potential treatment status in period if unit is never exposed to the instrument. Hereafter, we call the "never exposed treatment". Since the adoption date of the instrument uniquely pins down one’s instrument path, we can write the observed treatment status for unit at time as
We define to be the effect of the instrument on the treatment for unit at time , which is the difference between the observed treatment status to the never exposed treatment status . Hereafter, we refer to as the individual exposed effect in the first stage. In the DID literature, Callaway and Sant’Anna (2021) and Sun and Abraham (2021) define the effect of the treatment on the outcome in the same fashion.
Next, we introduce the group variable which describes the type of unit at time , based on the reaction of potential treatment choices at time to the instrument path . Let be the group variable at time for unit and the initial exposure date . Specifically, the first element represents the treatment status at time if unit is never exposed to the instrument and the second element represents the treatment status at time if unit starts exposed to the instrument at . Following to the terminology in Imbens and Angrist (1994), we define to be the never-takers, to be the always-takers, to be the compliers and to be the defiers at time and the initial exposure date .
Finally, we make a no carryover assumption on potential outcomes .
Assumption 2 (No carryover assumption).
where is the generic element of the treatment path .
This assumption requires that potential outcomes depend only on the current treatment status and the instrument path . In the DID literature, several recent papers impose this assumption with settings of a non-staggered treatment; see, e.g., de Chaisemartin and D’Haultfœuille (2020) and Imai and Kim (2021). Although it can be possible to weaken this assumption by introducing the treatment path in potential outcomes , this requires the cumbersome notation and complicates the definition of our target parameter, thus is beyond the scope of this paper.
3.2 Target parameter in staggered DID-IV designs
Our target parameter in staggered DID-IV designs is the cohort specific local average treatment effect on the treated (CLATT) defined below.
Def.
The cohort specific local average treatment effect on the treated (CLATT) at a given relative period from the initial adoption of the instrument is
This parameter measures the treatment effects at a given relative period from the initial instrument adoption date , for those who belong to cohort , and are the compliers , that is, who are induced to treatment by instrument at time . Each CLATTe,l can potentially vary across cohorts and over time, as it depends on cohort , relative period , and the compliers .
3.3 Identifying assumptions in staggered DID-IV designs
In this section, we state the identifying assumptions in staggered DID-IV designs based on Miyaji (2024).
Assumption 3 (Exclusion Restriction in multiple time periods).
Assumption 3 requires that the path of the instrument does not directly affect the potential outcome for all time periods and its effects are only through treatment. Given Assumption 2 and Assumption 3, we can write the potential outcome as .
Here, we introduce the potential outcomes at time if unit is assigned to the instrument path :
Since the exposure timing completely determines the path of the instrument, we can write the potential outcomes for cohort and cohort as and , respectively. The potential outcome represents the outcome status at time if unit is first exposed to the instrument at time and the potential outcome represents the outcome status at time if unit is never exposed to the instrument. Hereafter, we refer to as the "never exposed outcome".
Assumption 4 (Monotonicity Assumption in multiple time periods).
This assumption requires that the instrument path affects the treatment adoption behavior in a monotone way for all relative periods after the initial exposure. Recall that we define to be the effect of the instrument on the treatment for unit at time . Assumption 4 requires that the individual exposed effect in the first stage is non-negative (or non-positive) for all and all the time periods after the initial exposure. This assumption implies that the group variable can take three values with non-zero probability for all and all . Hereafter, we consider the type of the monotonicity assumption that rules out the existence of the defiers for all in any cohort .
Assumption 5 (No anticipation in the first stage).
Assumption 5 requires that the potential treatment choice for the treatment in any period before the initial exposure to the instrument is equal to the never exposed treatment. This assumption restricts the anticipatory behavior before the initial exposure in the first stage.
Assumption 6 (Parallel Trends Assumption in the treatment in multiple time periods).
Assumption 6 is a parallel trends assumption in the treatment in multiple periods and multiple cohorts. This assumption requires that the trends of the treatment across cohorts would have followed the same path, on average, if there is no exposure to the instrument. Assumption 6 is analogous to that of Callaway and Sant’Anna (2021) and Sun and Abraham (2021) in DID designs: both papers impose the same type of the parallel trends assumption on untreated outcomes with settings of multiple periods and multiple cohorts.
Assumption 7 (Parallel Trends Assumption in the outcome in multiple time periods).
Assumption 7 is a parallel trends assumption in the outcome with settings of multiple periods and multiple cohorts. This assumption requires that the expectation of the never exposed outcome across cohorts would have followed the same evolution if the assignment of the instrument had not occurred. From the discussions in Miyaji (2024), we can interpret that this assumption requires the same expected time gain across cohorts and over time: the effects of time on outcome through treatment are the same on average across cohorts and over time.
4 Causal interpretation of the TWFEIV estimand
In this section, we explore the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. In section 4.1, we first define the main building block parameter in the first stage and reduced form regressions, respectively. In section 4.2, we then interpret the TWFEIV estimand under staggered DID-IV designs, and show that this estimand potentially fails to summarize the treatment effects. In section 4.3, given the negative result of using the TWFEIV estimand under staggered DID-IV designs, we describe the various restrictions on main building block parameter in each stage regression. In section 4.4, as a preparation, we then describe the causal interpretation of the denominator in the TWFEIV estimand under these restrictions. In section 4.5, we finally investigate the sufficient conditions for the TWFEIV estimand to attain its causal interpretation.
4.1 Main building block parameter in each stage regression
As we already mentioned in section 2, the TWFEIV regression employs the TWFEIV regression twice in the first stage and reduced form regressions. In this section, we define the main building block parameter in each stage regression.
In the first stage regression, our building block parameter is the average of individual exposed effect at a given relative period from the initial exposure to the instrument in cohort . We call this the cohort specific average exposed effect on the treated in the first stage (CAET) defined below.
Def.
The cohort specific average exposed effect on the treated in the first stage (CAET1) at a given relative period from the initial adoption of the instrument is
We use the superscript to make it clear that we define this parameter for the first stage regression. In the recent DID literature, Sun and Abraham (2021) define their main building block parameter in staggered DID designs in a similar fashion and call it the cohort specific average treatment effect on the treated. Callaway and Sant’Anna (2021) call the same parameter the group-time average treatment effect.
If the treatment is binary and monotonicity assumption (Assumption 4) holds, the CAET is equal to the share of the compliers CMe,e+l in cohort at period :
In the reduced form regression, our building block parameter is the average of individual effect of the instrument on the outcome through treatment at a given relative period from the initial exposure to the instrument in cohort . We call this the cohort specific average intention to exposed effect on the treated in the reduced form (CAIETe,l) defined below.
Def.
The cohort specific average intention to exposed effect on the treated in the reduced form (CAIET) at a given relative period from the initial adoption of the instrument is
If we assume the identifying assumptions in staggered DID-IV designs (Assumptions 1 to 7), this parameter is equal to a product of the CLATTe,l and CAET:
(13) |
In other words, if we scale the CAIETe,l in the reduced form by the CAET in the first stage, we obtain the CLATTe,l, which is the reason why we call this the cohort specific average "intention to exposed effect" on the treated in the reduced form.
4.2 Interpreting the TWFEIV estimand under staggered DID-IV designs
We now interpret the TWFEIV estimand under staggered DID-IV designs based on the DID-IV decomposition theorem derived in section 2 and the main building block parameters defined in the previous section. This section presumes the monotonicity assumption (Assumption 4) to clarify the interpretation of each notation defined below.
First, we introduce the additional notation. Let CLATT denote a weighted average of each CLATTk,t in the time window (with periods) where the weight reflects the relative amount of the exposed effect in the first stage in cohort at period :
The first equality holds because we have a binary treatment and assume the monotonicity assumption (Assumption 4). Each weight assigned to each reflects the relative share of the compliers at period in cohort during the time window . We call this the compliers weighted scheme. This would be one of the reasonable weighting schemes for two reasons. First, the weight is designed to be larger in the period when the proportion of the compliers is higher in cohort . Second, the sum of the weight is one by construction: the proportion of the compliers in each period in cohort is divided by the total amount of the compliers in the time window in cohort .
We also define the similar notation , in which the proportion of the compliers in cohort at period is divided by the time length :
We call this the time-corrected weighting scheme. In contrast to , the weight assigned to each can be inappropriate: each weight does not reflect the relative share of the compliers in cohort at period . In addition, the sum of each weight is not equal to one in general.
Theorem 2 below shows the probability limit of the TWFEIV estimator under staggered DID-IV designs (Assumptions 1-7).
Theorem 2.
Suppose Assumptions 1-7 hold. Then, the TWFEIV estimand consists of two terms:
where we define:
The weights and are the probability limit of and , respectively. The weight is the probability limit of . The specific expressions for each weight are shown in equations (49), (50), and (51) in Appendix B.
Proof.
See Appendix B. ∎
Theorem 2 shows that the TWFEIV estimand consists of two terms ( and ) and potentially fails to aggregate the treatment effects under staggered DID-IV designs.
The first term is a positively weighted average of each for the post-exposed period in cohort . We call this a weighted average cohort specific local average treatment effect on the treated () parameter. The first and the second terms in the use the compliers weighted scheme, but the third term in uses the time-corrected one.
Although can be a causal parameter, the amount of this parameter may be difficult to interpret in practice for two reasons. First, the weight assigned to reflects only the sample share and the variation of the instrument, and does not reflect the variation of the treatment in the first stage. Because the other weights, and precisely reflect all the variations in each DID-IV design, this asymmetry can break the implication of the magnitude of this parameter in a given application. Second, the in the third term is a weighted average of for the post exposed periods in cohort , but the weight assigned to each seems not reasonable: it does not reflect the relative share of the compliers in period in cohort and the sum of the weight is not equal to one.
The problem of the is due to the "bad comparisons" in the first stage TWFE regression: when we compare the evolution of the treatment in Exposed/Exposed Shift designs, we use already exposed cohorts as controls. In these comparisons, we should offset the DID estimator of the treatment in each weight in Exposed/Exposed Shift designs by the one appeared in the denominator of the corresponding Wald-DID estimator, which produces the weight and in the third term.
The second term is a weighted sum of the differences in the positively weighted average of each from the exposed period to before period and after period in the already exposed cohort . This term fails to properly aggregate the treatment effects because the is canceled out by the in each cohort . This problem arises due to the "bad comparisons" in the reduced form TWFE regression: when we compare the evolution of the outcome in Exposed/Exposed Shift designs, we use already exposed cohorts as controls. In these comparisons, we subtract their expected trends of unexposed potential outcomes and average intention exposed effects, which yields the .
Overall, this section shows that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs. In the next section, we first describe various restrictions on main building block parameters in the first stage and the reduced form regressions. Given these restrictions on exposed effect heterogeneity, we then explore the sufficient conditions for the TWFEIV estimand to be causally interpretable parameter.
4.3 Restrictions on exposed effect heterogeneity
First, we describe the restrictions on the CAET in the first stage regression.
Assumption 8 (Exposed effect homogeneity across cohorts in the first stage).
For each relative period , does not depend on cohort and is equal to .
Assumption 8 requires that the exposed effects in the first stage depend on only the relative time period after the initial exposure to the instrument and do not depend on the cohort . This assumption does not exclude the dynamic effects of the instrument on the treatment, but requires that the exposed effects are the same across cohorts for all relative periods.
Assumption 9 (Stable exposed effect over time within cohort in the first stage).
For each cohort , does not depend on the relative time period and is equal to .
Assumption 9 rules out the dynamic effects of the instrument on the treatment within cohort in the first stage regression. Assumption 9 permits the heterogeneous exposed effects across cohort , but requires the homogeneous exposed effects over time after the initial adoption of the instrument within cohort .
The recent DID literature imposes the similar restrictions as in Assumption 8 and Assumption 9 on treatment effects. Sun and Abraham (2021) assume that "each cohort experiences the same path of treatment effects", which is in line with Assumption 8. Goodman-Bacon (2021) requires heterogeneous treatment effects to either be "constant over time but vary across units" or "vary over time but not across units". The former corresponds to Assumption 9 and the latter corresponds to Assumption 8.
Next, we describe the restrictions on the CAIETe,l in the reduced form regression. Following to Assumption 8 and Assumption 9 on the CAET, we consider Assumption 10 and Assumption 11 below.
Assumption 10 (Exposed effect homogeneity across cohorts in the reduced form).
For each relative period , does not depend on cohort and is equal to .
Assumption 11 (Stable exposed effect over time within cohort in the reduced form).
For each cohort , does not depend on the relative time period and is equal to .
Assumption 10 requires that the evolution of the average intention to exposed effect after the initial exposure is the same across cohorts. Assumption 11 requires that the average intention to exposed effects are stable over time in all relative periods within cohort .
Note that given Assumption 8 and Assumption 10, we have the following restriction on the CLATTe,l, which follows from equation (13) in section 4.1.
Assumption 12 (Treatment effect homogeneity across cohorts for ).
For each relative period , does not depend on cohort and is equal to .
Assumption 13 (Stable treatment effect over time within cohort for ).
For each cohort , does not depend on the relative time period and is equal to .
4.4 The denominator in the TWFEIV estimand
In this section, we first interpret the denominator in the TWFEIV estimand under various restrictions considered in section 4.3. This section is a preparation for the next section, in which we analyze the TWFEIV estimand itself.
As we already noted, the denominator in the TWFEIV estimator (see equation (6)), can be decomposed into a weighted average of all possible DID estimators of the treatment. In the following discussion, we show that this estimand can potentially fail to aggregate the effects of the instrument on the treatment in the first stage regression without additional restrictions. We then briefly describe the interpretation of this estimand by imposing Assumption 8 or Assumption 9, and state the implications.
First, we introduce the additional notation. Let CAET denote an equally weighted average of the CAET in the time window (with period length):
If we assume Assumption 4 (monotonicity assumption), CAET is an equally weighted average of the fraction of the compliers in cohort in the time window . For instance, the CAET is an equally weighted average of the CAET during the periods after the initial exposure date and rewritten as
Lemma 1 below shows the probability limit of the denominator under staggered DID-IV designs. This lemma is mainly based on the result of Goodman-Bacon (2021), who shows the probability limit of the two-way fixed effects estimator under staggered DID designs. The slight difference here is that each weight assigned to each CAET in is not divided by the probability limit of the grand mean .
Lemma 1.
Suppose Assumptions 1-7 hold. Then, the probability limit of the denominator of the TWFEIV estimator, consists of two terms:
where we define:
The weights , and are the probability limit of and defined in section 3 respectively, and are non-negative. The specific expressions in each weight are shown in equations (35)-(37) in Appendix B.
Proof.
See Appendix B. ∎
Lemma 1 shows that we can decompose into two terms. The first term is a positively weighted average of each CAET during the periods after the initial exposure in exposed cohorts, allowing for its causal interpretation. Following the terminology in Goodman-Bacon (2021), we call this a weighted average cohort specific exposed effect on the treated (WCAET) parameter.
The second term CAET1 is equal to the sum of the difference in the positively weighted average of exposed effect CAET from the exposed period to before period () and after period in the already exposed cohort . This term fails to properly aggregate the causal parameter in the first stage because some exposed effects are canceled out by other exposed effects.
Lemma 1 implies that if we assume only Assumptions 1-7, the probability limit of the denominator in the TWFEIV estimand, generally fails to properly summarize the exposed effects in the first stage due to the second term CAET1. This problem arises from the "bad comparisons" performed by the TWFE regression in the first stage: we treat the already exposed cohorts as control groups in the Exposed/ Exposed Shift designs. In these comparisons, we should subtract their expected trends of unexposed potential treatment choices and their expected exposed effects, which yields the second term CAET1. In the DID literature, Borusyak et al. (2021), de Chaisemartin and D’Haultfœuille (2020), and Goodman-Bacon (2021) point out the same issue for the TWFE estimand in staggered DID designs.
Based on the negative result shown in Lemma 1, we consider the restrictions on exposed effect heterogeneity in the first stage regression. The conclusion here is that properly aggregates each only if Assumption 9 holds, that is, the exposed effects are stable over time within cohort . Because Goodman-Bacon (2021) have already made the same point for the TWFE estimand, we briefly summarize the interpretation of under Assumption 8 or Assumption 9 in the following. For the more detailed discussions, see section 3.1 in Goodman-Bacon (2021).
Interpreting under Assumption 8 only
Even when Assumption 8 holds, that is, the exposed effects are the same across cohorts but vary over time in the first stage, we have CAET in general. This implies that if we impose only Assumption 8, we cannot generally interpret the as measuring the positively weighted average of exposed effects in the first stage.
Interpreting under Assumption 9 only
If Assumption 9 holds, that is, the exposed effects are stable over time within cohort in the first stage, we have . This implies that the second term CAET1 is equal to zero:
Thus, simplifies to:
weights each CAET positively across cohorts under Assumption 9 only. We note, however, that each weight assigned to each CAET, is not equal to the sample share in cohort , but is a function of the sample share and the timing of the initial exposure date.
In this section, we have considered whether the denominator in the TWFEIV estimand properly aggregates the exposed effects in the first stage. We have two implications. First, if we do not impose Assumption 9, the weight assigned to each Wald-DID in the TWFEIV estimand may not be properly normalized because the numerator in each weight is divided by , and the denominator potentially fail to aggregate the exposed effects in the first stage. Second, if we do not impose Assumption 9, some weights assigned to Wald-DID estimands can be potentially negative. This is because the DID estimand of the treatment forms the part of each weight and can be negative due to the "bad comparisons" in the first stage regression.
From the discussion so far, hereafter, we impose Assumption 9 when we consider the restrictions on exposed effect heterogeneity in the first stage.
4.5 Interpreting the TWFEIV estimand under additional restrictions
We now describe the interpretation of the TWFEIV estimand under additional restrictions.
Interpretation under Assumption 9 only
First, we consider imposing Assumption 9 only, that is, we assume only the stable exposed effect over time in the first stage. If Assumption 9 holds, the simplifies to an equally weighted average of :
The weights each equally in the time window and the weight sum to one by construction. We call this an equal weighting scheme.
Lemma 2 presents the interpretation of the TWFEIV estimand under staggered DID-IV designs and Assumption 9.
Lemma 2.
Suppose Assumptions 1-7 hold. If Assumption 9 holds additionally, the TWFEIV estimand consists of two terms:
where we define:
The weights , and are the probability limit of , and respectively, and are non-negative. The specific expressions for these weights are shown in equations (54), (55), and (59) in Appendix B. The weight is already defined in Theorem 2.
Proof.
See Appendix B. ∎
Lemma 2 shows that Assumption 9 is not sufficient for the TWFEIV estimand to attain its causal interpretation. If the exposed effects in the first stage are stable over time, we can interpret the first term causally and its interpretation seems clear: this parameter is a positively weighted average of each and each weight assigned to each reflects all the variations in each DID-IV design. However, the second term still remains, which contaminates the causal interpretation of the TWFEIV estimand.
Interpretation under Assumption 9 and Assumption 10
Interpretation under Assumption 9 and Assumption 11
As we already noted in section 4.2, if we assume Assumption 9 and Assumption 11 additionally, we have Assumption 13, that is, holds. Then, we obtain the following Lemma.
Lemma 3.
Proof.
See Appendix B. ∎
If Assumption 9 and Assumption 11 are satisfied, the TWFEIV estimand is a positively weighted average of each across exposed cohorts, which implies that we can interpret this estimand causally. However, at the same time, we also note that the weight assigned to each does not reflect only the cohort share and the fraction of the compliers, but is a function of the cohort share, the fraction of the compliers, and the timing of the initial exposure to the instrument.
5 Extensions
This section briefly describes the extensions in section 4. We consider a non-binary, ordered treatment and unbalanced panel settings. It also includes the case when the adoption date of the instrument is randomized across units. For the proofs and the specific discussions, see Appendix C.
Non-binary, ordered treatment
Up to now, we have considered only the case of a binary treatment. When treatment takes a finite number of ordered values, , our target parameter in staggered DID-IV design is the cohort specific average causal response on the treated (CACRT) defined below.
Def.
The cohort specific average causal response on the treated (CACRT) at a given relative period from the initial adoption of the instrument is
where the weights are:
The CACRT is a weighted average of the effect of a unit increase in treatment on outcome, for those who are in cohort and induced to increase treatment by instrument at a relative period after the initial exposure. This parameter is similar to the average causal response (ACR) considered in Angrist and Imbens (1995), but the difference here is that there exist dynamic effects in the first stage, and each weight and the associated causal parameters in CACRT are conditioned on .
Unbalanced panel case
Throughout sections 2 to 4, we have considered a balanced panel setting. If we assume an unbalanced panel (or repeated cross section) setting, we obtain the following theorem.
Theorem 3.
Suppose Assumptions 1-7 hold. If we assume a binary treatment and an unbalanced panel setting, the population regression coefficient is a weighted average of each in all relative periods after the initial exposure across cohorts with potentially some negative weights:
where the weight is:
where is the population residuals from regression on unit and time fixed effects in cohort and is the population share for cohort at time . The weights sum to one.
Proof.
See Appendix C. ∎
Theorem 3 shows that the population regression coefficient is a weighted average of all possible across cohorts, but some weights can be negative. Theorem 3 is related to de Chaisemartin and D’Haultfœuille (2020), who show the decomposition theorem for the TWFEIV estimand when the assignment of the instrument is non-staggered and a no carry over assumption is satisfied in the first stage regression. Theorem 3 instead considers the case when the assignment of the instrument is staggered and there exist dynamic effects in the first stage. Theorem 3 assumes a binary treatment, but a non-binary, ordered treatment case is easy to extend: one can obtain the theorem which replaces with .
If one wants to check the validity of the TWFEIV estimator in a given application, one can estimate each weight by constructing the consistent estimator for . If there does not exist a never exposed cohort, however, it is not feasible to obtain the consistent estimator for in the last exposed cohort . In Appendix C, we provide another representation of the decomposition theorem, in which we can estimate each weight consistently and quantify the bias term arising from the bad comparisons performed by TWFEIV regressions.
Random assignment of the adoption date
In practice, researchers may use the TWFEIV regression when the adoption date of the instrument is randomized across units (e.g., Randomized control trial). In Appendix C, we consider the causal interpretation of the TWFEIV estimand under the random assignment assumption. In the DID literature, a similar issue is analyzed in Athey and Imbens (2022): they investigate the causal interpretation of the TWFE estimand when the adoption date of the treatment is randomized across units.
First, we define the random assignment assumption of the adoption date .
Assumption 14 (Random assignment assumption of adoption date ).
For all and all , is independent of potential outcomes:
When the assignment of the adoption date is totally randomized, our target parameter is the local average treatment effect (LATE) defined below.
Def.
The local average treatment effect (LATE) at a given relative period from the initial adoption of the instrument is
Unlike the CLATT, this parameter is not conditioned on the adoption date due to the independence assumption. The causal parameter in the first stage, , is also simplified to the average exposed effect () defined below:
If Assumptions 1- 7 and Assumption 14 hold, one can obtain the theorem and lemmas in section 4, which replace and with and , respectively. This implies that even when the adoption date of the instrument is randomized, we cannot interpret the TWFEIV estimand causally in general, and the causal interpretation requires the stable exposed assumptions in both the first stage and reduced form regressions.
6 Application
In this section, we illustrate our DID-IV decomposition theorem in the setting of Miller and Segal (2019). We first explain our dataset. We then assess the plausibility of the staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019). Finally, we present the DID-IV decomposition result and state the implication.
Miller and Segal (2019) study the effect of an increase in the share of female police officers on intimate partner homicide (IPH) rates among women in the United States between and . The increase was in line with a shift in gender norms during these periods and there was growing interest in whether the female integration improved police quality in addressing violence against women.
To establish the causal relationship, Miller and Segal (2019) first regress the IPH rates on the lagged female officers’ share with county and year fixed effects. In the second part of their analysis, Miller and Segal (2019) exploit "plausibly exogenous variation in female integration from externally imposed AA (affirmative action) following employment discrimination cases against particular departments in different years" across counties. Specifically, Miller and Segal (2019) use the two-way fixed effects instrumental variable regression, instrumenting the lagged female officers’ share with the exposure years of AA plans.
Miller and Segal (2019) implicitly rely on staggered DID-IV designs to estimate the causal effects: Miller and Segal (2019) concern that "AA itself might have occurred following increasing trends" in the share of female officers or the IPH rates. To address this concern, Miller and Segal (2019) check the trends of these variables before AA introduction using event study regressions in the first stage and reduced form.
In this application, we slightly modify the authors’ setting for simplicity. Specifically, unlike Miller and Segal (2019), we use the staggered adoption of AA plans as our instrument instead of the exposure years. In the authors’ setting, AA plans were terminated in some counties during the sample period, which is probably the reason why Miller and Segal (2019) use the exposure years of AA plans as their instrument. We instead drop such counties from our sample and make the instrument assignment staggered. Although it reduces our sample size, it allows us to have a clearer staggered DID-IV identification strategy. In addition, it enables us to apply our DID-IV decomposition theorem to the TWFEIV estimate in the authors’ setting.
Data
Start year of AA plans | Number of counties |
---|---|
Unexposed counties | 159 |
1976 | 6 |
1977 | 3 |
1978 | 3 |
1979 | 4 |
1980 | 3 |
1981 | 5 |
1982 | 4 |
1983 | 3 |
1984 | 2 |
1985 | 1 |
1986 | 1 |
1987 | 3 |
1988 | 1 |
1990 | 1 |
-
•
Notes: This table presents the initial exposure year of AA plans and the number of counties in each year in our final sample.
All counties | Unexposed counties | Exposed counties | ||
IPH per population | 1977-91 | 0.544 | 0.521 | 0.638 |
1977 | 0.549 | 0.526 | 0.641 | |
1991 | 0.489 | 0.461 | 0.599 | |
Lagged female officer share | 1977-91 | 0.053 | 0.050 | 0.066 |
1977 | 0.033 | 0.033 | 0.032 | |
1991 | 0.077 | 0.071 | 0.101 | |
Counties | 199 | 159 | 40 | |
Observations | 2985 | 2385 | 600 |
-
•
Notes: This table presents summary statistics on our final sample from to . The sample consists of counties.
The data come from Miller and Segal (2019). Our final sample differs from their main analysis sample in two ways. First, unlike Miller and Segal (2019), we only include the counties whose variables are observable for all sample periods. This restriction excludes counties and allows us to create the balanced panel data set. Second, as we already noted, we construct an instrument that takes one after the AA introduction. Miller and Segal (2019) use data on AA plans from Miller and Segal (2012) and define the instrument as the difference between the current year and the start year of AA introduction111As one can see in this construction, Miller and Segal (2019) create the lagged instrument in line with the lagged female officers’ share. Therefore, we construct the lagged staggered instrument instead of the current one.; see Miller and Segal (2012), Miller and Segal (2019) for details. We identify the initial year of AA plans in each county, and discard the counties whose AA plans ended between and ( counties dropped) and whose AA plans were already implemented before ( counties dropped). Table 2 shows the timing of AA adoption across counties between and .
Summary statistics for county characteristics are reported in Table 2. We have a smaller sample size, but otherwise have a similar sample to that of Miller and Segal (2019). Counties are separated into exposed and unexposed counties based on whether the county experienced AA introduction. In both types of counties, the lagged female officers’ share increased over time. However, it increases more in counties who are exposed to AA plans during sample periods. The IPH rates had downward trends in all counties, but it seems that there are no systematic differences in the trends between exposed and unexposed counties.
Assessing the identifying assumptions in staggered DID-IV design
In this section, we discuss the validity of the staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019). Note that in the authors’ setting, our target parameter is the cohort specific average causal response on the treated (CACRT) as female officer share is a non-binary, ordered treatment. We therefore expect that we can identify each CACRT if the underlying staggered DID-IV identification strategy seems plausible, which we will check below. Here, we presume the no carry over assumption (Assumption 2).
Exclusion restriction (Assumption 3).
It would be plausible, given that the AA plans (instrument) did not affect IPH rates other than by increasing the female officers’ share. This assumption may be violated for instance if the AA plans increased both the black and female officer shares and changes in IPH rates reflect both effects. Miller and Segal (2019) conduct the robustness check and confirm that this is not the case; see footnote in Miller and Segal (2019) for details.
Monotonicity assumption (Assumption 4).
It would be automatically satisfied in the authors’ setting: the AA plans (instrument) were imposed on departments with the intent to increase the share of female police officers. This ensures that the dynamic effects of the instrument on female police officers should be non-negative after the AA introduction.
No anticipation in the first stage (Assumption 5).
It would be plausible that there is no anticipatory behavior, given that the treatment status, i.e., the female officers share before the AA plans is equal to the one in the absence of the AA introduction across counties. This assumption may be violated if the police departments in some counties had private knowledge about the probability of the AA introduction and manipulated their treatment status before the implementation.
(a) First stage DID
(b) Reduced form DID
Next, we assess the plausibility of the parallel trends assumptions in the treatment and the outcome. To do so, we apply the method proposed by Callaway and Sant’Anna (2021) to the first stage and reduced form, respectively222Unfortunately, in the presence of heterogeneous treatment effects, the coefficients on event study regression face a contamination bias shown by Sun and Abraham (2021).. Specifically, we estimate the weighted average of the effects of the instrument on the treatment and outcome in each relative period where the weight reflects the cohort size. We depict the results in Figure 2. The plots report estimates for the effects before and after AA plans with a simultaneous confidence interval in each stage. The confidence intervals account for clustering at the county level.
Parallel trends assumption in the treatment (Assumption 6).
It requires that if the AA plans had not occurred, the average time trends of the female officers share would have been the same across counties and over time. The pre-exposed estimates in Panel (a) in Figure 2 seem consistent with the parallel trends assumption in the treatment: the pre-exposed estimates around AA plans are not significantly different from zero.
Parallel trends assumption in the outcome (Assumption 7).
It would be plausible if the AA plans had not been implemented, the average time trends of the IPH rates would have been the same across counties and over time. Panel (b) in Figure 2 presents that the pre-exposed estimates around AA introduction are not significantly different from zero, which indicates that the parallel trends assumption in the outcome is also plausible.
Figure 2 also sheds light on the dynamic effects of the AA plans on the female officer share and IPH rates during the post-exposed periods. The figure indicates that the effect of the AA plans on the female officer share increases over time, whereas the effect on IPH rates through the female officer share has downward trends during the post-exposed periods. We note that the estimated effects in the reduced form are not scaled by the ones in the first stage, i.e., these estimates do not capture each CACRT after the AA shock.
Illustrating the weights in TWFEIV regression
First, we estimate the two-way fixed effects instrumental variable regression in the authors’ setting. To clearly illustrate the shortcomings of the TWFEIV regression, we modify the authors’ specification in two ways: Miller and Segal (2019) include some covariates and weight their regression with county population, whereas we exclude such covariates and do not apply their weights to our regression.
The result is shown in Table 4. The two-way fixed effects instrumental variable estimate is and it is not significantly different from zero333Although Miller and Segal (2019) do not report the TWFEIV estimate without weights and covariates, when we run such a TWFEIV regression in their final analysis sample, the IV estimate is and is not significantly different from zero. This implies that we reach the same conclusion as in Miller and Segal (2019) in our data.. However, as we already noted in section 4, we cannot generally interpret the IV estimate as measuring a properly weighted average of each CACRT if the effect of the AA introduction on female officer share or IPH rates is not stable over time.
Our DID-IV decomposition theorem (Theorem 1) allows us to visualize the source of variations in the three types of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs. Panel (a) in Figure 3 plots the weights and the corresponding Wald-DID estimates for all designs and Panels (b), (c), and (d) in Figure 3 plot them for each type of the DID-IV design, respectively. Table 4 reports the total weight, total Wald-DID estimate, and weighted average of Wald-DID estimates in each type of the DID-IV design. The total weight and total Wald-DID estimate are calculated by summing the weights and Wald-DID estimates respectively, and the weighted average of Wald-DID estimates is calculated by summing the products of the weight and the associated Wald-DID estimate. Summing all the weighted average of Wald-DID estimates yields the two-way fixed instrumental variable estimate ().
Panel (a) in Figure 3 shows that the weights are heavily assigned to the Wald-DID estimates in Unexposed/Exposed designs. This is due to the large sample size of the unexposed cohort in the authors’ setting. Panels (b), (c), and (d) in Figure 3 highlight that some weights in each type can be negative: out of weights are negative in Unexposed/Exposed designs, out of weights are negative in Exposed/Not Yet Exposed designs and out of weights are negative in Exposed/Exposed Shift designs. The negative weights arise because some DID estimates of the treatment in the first stage are negative in each type of the DID-IV design.
The TWFEIV estimate suffers from a downward bias due to the bad comparisons arising from the Exposed/Exposed shift designs. As we already mentioned in section 4, the TWFEIV estimand potentially fails to summarize the causal effects if the effect of the instrument on the treatment or the outcome evolves over time. Table 4 indicates that the estimated bias occurring from the Exposed/Exposed shift designs is quantitatively not negligible: the weighted average of the Wald-DID estimates in the Exposed/Exposed shift designs is , which accounts for one-seventh of our IV estimate.

Estimate | Standard Error | 95% CI | |
TSLS with fixed effects | -0.646 | 3.284 | [-7.594, 6.301] |
-
•
Notes: Sample consists of counties. Confidence intervals account for clustering at the county level.
Total weight | Total WDD estimate | Weighted WDD estimate | |
---|---|---|---|
Unexposed/Exposed | 1.026 | -233.198 | -0.399 |
Exposed/Not Yet Exposed | 0.010 | -490.665 | -0.154 |
Exposed/Exposed Shift | -0.036 | -569.426 | -0.093 |
-
•
Notes: This table presents the total weight, total Wald-DID estimate, and weighted average of Wald-DID estimates in each type of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs.
7 Alternative specifications
So far, we have considered simple TWFEIV regressions as in equation (1). However, many studies routinely estimate various specifications, such as weighting or introducing covariates, to check the robustness of their findings. In this section, we extend our DID-IV decomposition theorem to the settings with weighting and covariates, and provide simple tools to examine how different specifications affect differences in estimates. We illustrate these by revisiting Miller and Segal (2019).
The tools we provide here are based on Goodman-Bacon (2021). Recall that our DID-IV decomposition theorem shows that the TWFEIV estimator can be written as the product of a vector of Wald-DID estimators () and a vector of weights (), that is, . When a TWFEIV estimator generated from different specification () can also be written as the product of a vector of Wald-DID estimators () and a vector of their associated weights (), one can decompose the difference between the two specifications as
It takes the form of a Oaxaca-Blinder-Kitagawa decomposition (Oaxaca (1973), Blinder (1973), Kitagawa (1955)) and indicates that the difference comes from changes in Wald-DID estimators, changes in weights, and the interaction of the two. Dividing both sides by , one can measure the proportional contribution of each term on the difference. Plotting each pair in ( and , one can also examine which elements in each term have a significant impact on the difference.
7.1 Weighted TWFEIV regression
When researchers use weighted TWFEIV regression instead of unweighted one, it potentially changes the influence of Wald-DID estimators () by replacing the DIDs of the treatment and the outcome with the weighted ones. It also potentially change the influence of weights () by replacing the sample share with the relative amount of the specified weight and the DIDs of the treatment with the weighted ones. Table 5 shows the result of our TWFEIV regression weighted by county population in Miller and Segal (2019): the estimate changes from to . The decomposition result indicates that the contribution of the changes in Wald-DIDs is negative, whereas the contributions of the changes in weights and the interaction are positive.
Figure 4 plots the Wald-DIDs and the associated weights in WLS against those in OLS. Panel (a) shows that most comparisons of the Wald-DID between OLS and WLS are located at the -degree line, but some comparisons generated from Exposed/Not Yet Exposed and Exposed/Exposed Shift designs are away from the 45-degree line. In addition, this figure indicates that the Wald-DID generated from the comparison between and counties ( counties are the controls) is much more negative in WLS than in OLS, which drives the overall negative impact of the changes in Wald-DIDs on the difference between the two specifications. Panel (b) shows that most comparisons of the decomposition weight between OLS and WLS are near the -degree line and the origin, but some comparisons generated from Unexposed/Exposed designs are away from the 45-degree line and the origin. This figure also indicates that the decomposition weight generated from the comparison between and unexposed counties is much more positive in WLS than in OLS, which causes the overall positive impact of the changes in weights on the difference between the two specifications.
Baseline | WLS | Covariates | |
---|---|---|---|
Estimate | -0.646 | -0.386 | -0.868 |
Standard Error | 3.284 | 2.452 | 3.968 |
Difference from baseline | 0.260 | -0.222 | |
Difference comes from: | |||
Wald-DIDs | -4.048 | 0.370 | |
Weights | 2.341 | 16.503 | |
Interaction | 1.966 | -17.107 | |
Within term | 0 | 0.012 |
-
•
Notes: This table presents TWFEIV estimates in the setting of Miller and Segal (2019). Column is a simple TWFEIV estimate from Eq. (1). Column is a TWFEIV estimate weighted by county population in . Column is a TWFEIV estimate with time-varying covariates which include the lagged local area controls, the county’s non-IPH rate, and the state-level crack cocaine index. All the standard errors are clustered at county level.
7.2 TWFEIV regression with time-varying covariates
In most applications of thr DID-IV method, researchers typically estimate TWFEIV models that include time-varying covariates, in addition to the simple ones, based on the belief that it enhances the validity of the parallel trends assumptions in the first stage and reduced form regressions:
(14) | ||||
(15) |
In this section, we derive a DID-IV decomposition result for the case when we introduce the time-varying covariates into TWFEIV regressions. Our decomposition result in this section is based on Goodman-Bacon (2021), who decomposes TWFE estimators with time-varying covariates. Appendix D further considers the causal interpretation of the covariate-adjusted TWFEIV estimand under additional conditions.
First, consider the coefficient on instrument () in the reduced form regression:
(16) |
Let and denote the double demeaning variables of and respectively, obtained from regressing and on time and unit fixed effects. Let denote the residuals obtained from regressing on :
Here, we define the linear projection as . The specific expression for is:
By the FWL theorem, we then obtain the following expression for :
where is the variance of . By symmetry, we can also express the first stage coefficient on instrument as follows:
Because the IV estimator is the ration between the first stage coefficient and the reduced form coefficient , we obtain the following expression for :
(17) |
In contrast to the unconditional TWFEIV estimator , the covariate-adjusted TWFEIV estimator exploits the variation in both and . varies at cohort and time level, but varies at unit and time level because varies at unit and time level.
To decompose the covariate-adjusted TWFEIV estimator , we first partition into "within" and "between" terms as in Goodman-Bacon (2021). Let be the average of in cohort . By adding and subtracting , we can decompose into two terms:
(18) |
The first term measures the deviation of from the average in cohort , which we call the within term of . The second term measures the deviation of from the average in whole sample, which we call the between term of . The within term varies at unit and time level because of , whereas the between term varies at cohort and time level.
By substituting (18) into (17), we obtain
(19) | ||||
(20) |
We use the subscript to denote within components and the subscript to denote between components. and are the variances of and , respectively. is the covariance between and , the within term of . is the covariance between and , the between term of . The weight measures the relative amount of the within covariance .
measures the relationship between and . Similarly, measures the relationship between and . We call these the within coefficients in the first stage and reduced form regressions. scales the within coefficient in the reduced form regression by the one in the first stage regression. We call this the within IV coefficient444One can obtain this coefficient by running an IV regression of the outcome on the treatment with as the excluded instrument.. This IV coefficient arises because varies at unit and time level. Similar to what Goodman-Bacon (2021) points out for the covariate-adjusted TWFE estimator, time-varying covariates bring a new source of identifying variation in the TWFEIV estimator, within variation of in each cohort.
measures the relationship between and . Similarly measures the relationship between and . We call these the between coefficients in the first stage and reduced form regressions. divides the between coefficient in the reduced form regression by the one in the first stage regression, and have the following specific expression:
(21) |
and are already defined in section 3. is the covariance between and (the between term of ). is the estimator, obtained from an IV regression of on with as the excluded instrument. We call the between IV coefficient, which exploits the cohort and time level variation in . This IV coefficient is not equal to the unconditional TWFEIV coefficient : subtracts the influence of from the unconditional IV estimator . This indicates that time-varying covariates changes the identifying variation at cohort and time level through , the between term of the linear projection .
We can further decompose the between IV coefficient as follows:
(22) |
The proof is given in Appendix D. Each notation is similarly defined in cell subsamples. and are the between IV coefficient and the corresponding weight in cell subsamples. Equation (22) indicates that time-varying covariates affect the between IV coefficient by changing both the between IV coefficient and the associated weight in each cell.
To sum up, combining (22) with (19), we can decompose the covariate-adjusted TWFEIV estimator as
The weight is assigned to the within IV coefficient and the weight is assigned to the between IV coefficient , which is equal to a weighted average of all possible between IV coefficients as in Theorem 1.
Table 5 presents the result of our TWFEIV regression with time-varying covariates in Miller and Segal (2019). We follow Miller and Segal (2019) and include the lagged local area controls, the county’s non-IPH rate, and the state-level crack cocaine index; see Miller and Segal (2019) for details. The estimate changes from to . The decomposition result shows that the contribution of the within term is positive but negligible, whereas the contribution of the between term is negative and substantial. Specifically, in the between term, the contribution of the changes in Wald-DIDs and weights are positive, but these are offset by the negative contribution of the interaction. This result indicates that in Miller and Segal (2019), the time-varying covariates affect the IV estimate mainly through the identifying variation in cohort and time level, that is, the between term of the linear projection .
8 Conclusion
Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation occurring from the different timing of policy adoption across units as an instrument for the treatment. In this paper, we study the causal interpretation of the TWFEIV estimator in staggered DID-IV designs. We first show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator is equal to a weighted average of all possible Wald-DID estimators arising from the three types of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs. The weight assigned to each Wald-DID estimator is a function of the sample share, the variance of the instrument, and the DID estimator of the treatment in each DID-IV design.
Based on the decomposition result, we then show that in staggered DID-IV designs, the TWFEIV estimand is equal to a weighted average of all possible cohort specific local average treatment effect on the treated parameters, but some weights can be negative. The negative weight problem arises due to the bad comparisons in the first and reduced form regressions: we use the already exposed units as controls. The TWFEIV estimand attains its causal interpretation if the effects of the instrument on the treatment and outcome are stable over time. The resulting causal parameter is a positively weighted average cohort specific local average treatment effect on the treated parameter.
Finally, we illustrate our findings with the setting of Miller and Segal (2019) who estimate the effect of female officers’ share on the IPH rate, exploiting the timing variation of AA introduction across U.S. counties. We first assess the underlying staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019) and confirm its validity. We then apply our DID-IV decomposition theorem to the TWFEIV estimate, and find that the estimate suffers from the substantial downward bias arising from the bad comparisons in Exposed/Exposed shift DID-IV designs. We also decompose the difference between the two specifications and illustrate how different specifications affect the overall estimates in Miller and Segal (2019).
Overall, this paper shows the negative result of using TWFEIV estimators in the presence of heterogeneous treatment effects in staggered DID-IV designs in more than two periods. This paper provides simple tools to evaluate how serious that concern is in a given application. Specifically, we demonstrate that the TWFEIV estimator is not robust to the time-varying exposed effects in the first stage and reduced form regressions. Our DID-IV decomposition theorem allows the empirical researchers to assess the impact of the bias term arising from the bad comparisons on their TWFEIV estimate. Recently, Miyaji (2024) developed an alternative estimation method that is robust to treatment effects heterogeneity and proposes a weighting scheme to construct various summary measures in staggered DID-IV designs. Further developing alternative approaches and diagnostic tools will be a promising area for future work, facilitating the credibility of DID-IV design in practice.
References
- (1)
- Akerman et al. (2015) Akerman, A., Gaarder, I., Mogstad, M., 2015. The Skill Complementarity of Broadband Internet. Q. J. Econ. 130 (4), 1781–1824.
- Angrist and Imbens (1995) Angrist, J. D., Imbens, G. W., 1995. Two-Stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity. J. Am. Stat. Assoc. 90 (430), 431–442.
- Athey and Imbens (2022) Athey, S., Imbens, G. W., 2022. Design-based analysis in Difference-In-Differences settings with staggered adoption. J. Econom. 226 (1), 62–79.
- Bhuller et al. (2013) Bhuller, M., Havnes, T., Leuven, E., Mogstad, M., 2013. Broadband Internet: An Information Superhighway to Sex Crime? Rev. Econ. Stud. 80 (4), 1237–1266.
- Black et al. (2005) Black, S. E., Devereux, P. J., Salvanes, K. G., 2005. Why the Apple Doesn’t Fall Far: Understanding Intergenerational Transmission of Human Capital. Am. Econ. Rev. 95 (1), 437–449.
- Blandhol et al. (2022) Blandhol, C., Bonney, J., Mogstad, M., Torgovitsky, A., 2022. When is TSLS Actually LATE? February.
- Blinder (1973) Blinder, A. S., 1973. Wage Discrimination: Reduced Form and Structural Estimates. J. Hum. Resour. 8 (4), 436–455.
- Borusyak et al. (2021) Borusyak, K., Jaravel, X., Spiess, J., 2021. Revisiting Event Study Designs: Robust and Efficient Estimation.
- Callaway and Sant’Anna (2021) Callaway, B., Sant’Anna, P. H. C., 2021. Difference-in-Differences with multiple time periods. J. Econom. 225 (2), 200–230.
- de Chaisemartin (2010) de Chaisemartin, 2010. A note on instrumented difference in differences. Unpublished Manuscript.
- de Chaisemartin and D’Haultfœuille (2018) de Chaisemartin, C., D’Haultfœuille, X., 2018. Fuzzy Differences-in-Differences. Rev. Econ. Stud. 85 (2 (303)), 999–1028.
- de Chaisemartin and D’Haultfœuille (2020) de Chaisemartin, C., D’Haultfœuille, X., 2020. Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. Am. Econ. Rev. 110 (9), 2964–2996.
- Duflo (2001) Duflo, E., 2001. Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment. Am. Econ. Rev. 91 (4), 795–813.
- Field (2007) Field, E., 2007. Entitled to Work: Urban Property Rights and Labor Supply in Peru. Q. J. Econ. 122 (4), 1561–1602.
- Goodman-Bacon (2021) Goodman-Bacon, A., 2021. Difference-in-differences with variation in treatment timing. J. Econom. 225 (2), 254–277.
- Hudson et al. (2017) Hudson, S., Hull, P., Liebersohn, J., 2017. Interpreting Instrumented Difference-in-Differences. Available at http://www.mit.edu/~liebers/DDIV.pdf.
- Imai and Kim (2021) Imai, K., Kim, I. S., 2021. On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data. Polit. Anal. 29 (3), 405–415.
- Imbens and Angrist (1994) Imbens, G. W., Angrist, J. D., 1994. Identification and Estimation of Local Average Treatment Effects. Econometrica. 62 (2), 467–475.
- Johnson and Jackson (2019) Johnson, R. C., Jackson, C. K., 2019. Reducing Inequality through Dynamic Complementarity: Evidence from Head Start and Public School Spending. Am. Econ. J. Econ. Policy. 11 (4), 310–349.
- Kitagawa (1955) Kitagawa, E. M., 1955. Components of a Difference Between Two Rates. J. Am. Stat. Assoc. 50 (272), 1168–1194.
- Lundborg et al. (2014) Lundborg, P., Nilsson, A., Rooth, D.-O., 2014. Parental education and offspring outcomes: Evidence from the Swedish compulsory school reform. Am. Econ. J. Appl. Econ. 6 (1), 253–278.
- Lundborg et al. (2017) Lundborg, P., Plug, E., Rasmussen, A. W., 2017. Can Women Have Children and a Career? IV Evidence from IVF Treatments. Am. Econ. Rev. 107 (6), 1611–1637.
- Meghir et al. (2018) Meghir, C., Palme, M., Simeonova, E., 2018. Education and mortality: Evidence from a social experiment. Am. Econ. J. Appl. Econ. 10 (2), 234–256.
- Miller and Segal (2012) Miller, A. R., Segal, C., 2012. Does temporary affirmative action produce persistent effects? A study of black and female employment in law enforcement. Rev. Econ. Stat. 94 (4), 1107–1125.
- Miller and Segal (2019) Miller, A. R., Segal, C., 2019. Do female officers improve law enforcement quality? Effects on crime reporting and domestic violence. Rev. Econ. Stud. 86 (5), 2220–2247.
- Miyaji (2024) Miyaji, S., 2024. Instrumented Difference-in-Differences with heterogeneous treatment effects. Unpublished Manuscript.
- Oaxaca (1973) Oaxaca, R., 1973. Male-Female Wage Differentials in Urban Labor Markets. Int. Econ. Rev. 14 (3), 693–709.
- Oreopoulos (2006) Oreopoulos, P., 2006. Estimating Average and Local Average Treatment Effects of Education when Compulsory Schooling Laws Really Matter. Am. Econ. Rev. 96 (1), 152–175.
- Słoczyński (2020) Słoczyński, T., 2020. When Should We (Not) Interpret Linear IV Estimands as LATE?
- Sun and Abraham (2021) Sun, L., Abraham, S., 2021. Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. J. Econom. 225 (2), 175–199.
Appendix A Proof of the theorem in section 2
Before we proceed the proof of Theorem 1, we provide Lemma 4 below. This lemma is shown by Goodman-Bacon (2021).
Lemma 4 (Lemma in Goodman-Bacon (2021)).
The sample covariance between a cohort and time specific variable and a double demeaning variable is equal to a sum over every pair of observations of the period-by-period products of differences between cohorts in and .
(23) |
Proof.
See the proof of Lemma in Goodman-Bacon (2021). ∎
A.1 Proof of Theorem 1
Proof.
From the FWL theorem, the TWFEIV estimator is:
(24) |
where is a double-demeaning variable.
First, we consider the numerator of (24). In the following, we use to express that unit belongs to cohort . We define to be the sample mean of the random variable in cohort at time , and define to the average of over time:
For the numerator of (24), by adding and subtracting , we obtain
where the third equality follows from the fact that and because all the units in cohort have the same assignment of the instrument. The forth equality follows because the expression only depends on cohort and time .
To further develop the expression, we use Lemma 4:
(25) |
Next, we consider all possible expressions of (25). When , that is, cohort is never exposed cohort, we have . From this observation, for the pair , we have:
By the similar argument, for the pair where , we obtain
To sum up, for the numerator of (24), we have
(26) |
Next, we consider the denominator of (24). We note that the structure of the denominator is completely same as the one of the numerator in (24). Therefore, by the completely same calculations, we obtain
(27) |
Combining (26) with (27), we obtain
where the weights are:
We note that is the sum of the numerator in each weight as one can see in (27). This implies that the weights sum to one:
Completing the proof. ∎
Appendix B Proofs of the theorem and lemma in section 4.
In this section, we first prove Lemma 1 as a preparation.
B.1 Proof of Lemma 1
Proof.
As one can see in the proof of Theorem 1, we have the following expression for the numerator of the TWFEIV estimand:
We fix and consider . We first derive the probability limit of . By definition, we can rewrite as follows:
By the law of large number (LLN), as , we obtain
(28) |
The second equality follows from the simple algebra and Assumption 5 (No anticipation for the first stage). The third equality follows from Assumption 6 (Parallel trend assumption in the treatment).
Next, we consider the probability limit of . By the LLN and the Slutsky’s theorem, we obtain
(29) |
Combining the result (28) with (29), by the Slutsky’s theorem, we have
(30) |
where the weight is:
By the completely same calculations, we also have
(31) |
where the weight is:
Next, we consider the probability limit of . By definition, we have
By the law of large number (LLN), as , we have
(32) |
The second equality follows from the simple algebra and Assumption 5. The third equality follows from Assumption.
Note that the LLN and the Slutsky’s theorem implies
(33) |
From the result of (32) with (33), we obtain
(34) |
where the weight is:
To sum up, by combining (30),(31) with (34), we obtain
where we define
and the weights are:
(35) | ||||
(36) | ||||
(37) |
Completing the proof.
∎
B.2 Preparation for the proof of Theorem 2.
Before we present the proof of Theorem 2, we show the following lemma.
B.3 Proof of Theorem 2.
Proof.
We fix and consider . We first derive the probability limit of . We note that is written as follows:
We first consider the probability limit of . Recall that we have already derived the probability limit of the denominator in the proof of Lemma 1:
(38) |
We consider the numerator . By the law of large number (LLN), as , we obtain
(39) |
The first equality follows from the simple manipulation, Assumption 3 (Exclusion restriction in multiple time periods) and Assumption 5. The second equality follows from Assumption 7 (Parallel trend assumption in the outcome). The final equality follows from Lemma 5.
Combining the result (38) with (39), we obtain
(40) |
Next, we consider the probability limit of . By the LLN and the Slutsky’s theorem, we obtain
(41) |
Here is the probability limit of and its specific expression is already derived in Lemma 1.
Combining the result (40) with (41), by the Slutsky’s theorem, we have
(42) |
where the weight is:
By the completely same argument, we also obtain
(43) |
where the weight is:
Next, we derive the probability limit of . Recall that is:
First note that in the proof of Lemma 1, we have already derived the probability limit of :
(44) | ||||
Here, to ease the notation, we define to be the probability limit of .
Next, we consider the probability limit of .
By the law of large number (LLN), as , we have
(45) |
The first equality follows from the simple algebra, Assumption 3, and Assumption 5. The second equality follows from Assumption 7. The final equality follows from Lemma 5.
Note that the LLN and the Slutsky’s theorem yields
(46) |
From the results of (44) and (45) with (46), we obtain
(47) |
where the weight is:
(48) |
To sum up, by combining (42),(43) with (47), we obtain
where we define
and the weights are:
(49) | ||||
(50) | ||||
(51) |
Completing the proof. ∎
B.4 Proof of Lemma 2
Proof.
We first simplify and (defined in (44)) under Assumption 9. If we assume Assumption 9, is:
In addition, is:
because we have .
We then rewrite the probability limit of , and respectively. First, the probability limit of and is simplified to:
(52) |
(53) |
where the weights , are:
(54) | ||||
(55) |
Next, we reconsider the probability limit of .
First, we note that the probability limit of is simplified to:
(56) |
Here the second equality follows from .
Combining the result (57) with (56), by the Slutsky’s theorem, we have
(58) |
where the weight is:
(59) |
and is already defined in (48).
Completing the proof. ∎
B.5 Proof of Lemma 3
Appendix C Extensions in section 5
C.1 Non-binary, ordered treatment
This subsection considers a non binary, ordered treatment. We show Lemma 6 below that is analogous to Lemma 5 in a binary treatment. If we use Lemma 6 instead of Lemma 5 in the proof of Theorem 2 and Lemmas 2-3, we obtain the theorem and the lemmas which replace with .
Lemma 6.
C.2 Unbalanced panel case
In this section, we consider an unbalanced setting. We use the notation for a panel data setting, but the discussions and the results are the same if we consider an unbalanced repeated cross section setting.
Proof of Theorem 3
Let be the sample size for cohort at time and be the total number of observations. We consider the following two way fixed effects instrumental variable regression:
We define to be the residuals from regression on the time and individual fixed effects.
From the FWL theorem, the TWFEIV estimator is:
where the third equality follows from the fact that only varies across cohort and time level.
We note that by the definition of , we have
(62) | ||||
(63) |
To ease the notation, we define the sample mean for a random variable in cohort at time as follows:
Here, we note that we can express in the following:
(64) |
where the second equality follows from Assumptions 1-3 and Assumption 5.
First, we consider the probability limit of the numerator in the TWFEIV estimator. By using (62) and (63), we obtain
(65) |
To further develop the expression, we use (64):
(66) |
Substituting (66) into (65), we obtain:
(67) |
From (67), as , we obtain
(68) |
where is population share and in cohort at time . The first equality follows from Assumption 1 and Assumption 7. The second equality follows from Assumption 5.
Next, we consider the probability limit of the numerator. We note that the structure in the numerator is same as the one in the numerator. Therefore, by the same argument, we have:
(69) |
Combining the result (69) with (68), we obtain
where the weight is:
Completing the proof.
Supplementary of Theorem 3
We provide another representation of Theorem 3. We assume that there does not exist a never exposed cohort, that is, we have , and define to be the last exposed cohort.
Lemma 7.
We note that when there is no never exposed cohort, we can only identify each before the time period for cohort , exploiting the time trends of the unexposed treatment and outcome for cohort . This implies that in equation (70), each is the bias term occurring from the bad comparisons performed by TWFEIV regressions. In a given application, we can estimate , , and the associated weights , by constructing the consistent estimators, using (76) and (80) below.
Proof.
We consider the case where there is no never exposed cohort, i.e., we have . In this case, by using the last exposed cohort , we obtain
where we define
From the Law of Large Numbers and the same argument in the proof of Theorem 2, we have
(76) |
Similarly, we obtain
(80) |
Combining the result (76) with (80) and by the Slutsky’s theorem, we obtain
Completing the proof. ∎
C.3 Random assignment of the instrument adoption date
First, we set up the additional notations. We define and analogous to and in section 4:
where we replace and with and respectively in and .
Theorem 4 below presents the TWFEIV estimand under Assumptions 1 - 5 and Assumption 14 (Random assignment assumption of adoption date ).
Theorem 4.
Theorem 4 is analogous to Theorem 2, but and are replaced by and respectively because we assume a random assignment of adoption date. If we consider the restrictions on the effects of the instrument on the treatment and outcome as in section 4.2, the similar arguments hold as in Theorem 2, Lemma 2 and Lemma 3.
Appendix D Proofs and discussions in section 7
In this appendix, we first derive equation (22). We then discuss the causal interpretation of the covariate-adjusted TWFEIV estimand under staggered DID-IV designs, imposing the additional assumptions.
D.1 Decomposing the between IV coefficient
Let denote the covariance between and , the between term of . The between IV coefficient is:
(81) |
To derive equation (22), we decompose the covariance between and . To do so, we first split the between term into the between term of and the between term of :
Then, we have
(82) |
is an estimator obtained from an IV regression of on with as the excluded instrument in cell subsample. is the covariance between and in cell subsample. is an estimator obtained from an IV regression of on with as the excluded instrument in cell subsample. is the covariance between and in cell subsample.
D.2 Causal interpretation of the covariate-adjusted TWFEIV estimand
This section considers the causal interpretation of the covariate-adjusted TWFEIV estimand . To simplify the analysis, we first make the following assumptions. Goodman-Bacon (2021) also make similar assumptions to investigate the causal interpretation of the covariate-adjusted TWFE estimand in Appendix B.
-
(i)
Time-varying covariates are not affected by instrument (policy shock).
-
(ii)
Time-varying covariates do not vary within cohorts.
-
(iii)
The coefficients obtained from regressing on in cell subsample are the same regardless of the pair .
Because Assumption (ii) implies that the within term is equal to zero, the covariate-adjusted TWFEIV estimator simplifies to
Assumption (iii) guarantees that is equal to the between coefficient obtained from estimating equation (14) in subsample, which we denote hereafter. To see this formally, let denote the linear projection obtained from regressing on in subsample and let denote the between term of in cohort . We note that (the between term of ) holds in general because is estimated using the whole sample. Then, we have
where is an estimator obtained from an IV regression of on with the difference as the excluded instrument. Because Assumption (iii) () implies , we obtain .
Hereafter, we assume the identifying assumptions in staggered DID-IV designs and Assumption (i)-(iii). We focus on the between coefficient as it clarifies how covariates affect the interpretation of the TWFEIV estimand:
Then, by the similar calculations in the proof of Theorem 2, we obtain
(83) |
and
(84) |
where and are the probability limits of and , respectively. Equations (83) and (84) indicate that covariates affects the causal interpretation of in two ways. First, it additionally introduce the covariance between the difference in unexposed outcomes and the difference in the variation of the linear projection for cohorts and (the first term in equation (83)). Second, it additionally introduce the covariance between the and the difference in the variation of the linear projection for cohorts and (the second term in equation (84)).