This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-way fixed effects instrumental variable regressions
in staggered DID-IV designs.thanks: I am grateful to Daiji Kawaguchi and Ryo Okui for their continued guidance and support. All errors are my own.

Miyaji Sho Graduate School of Economics, The University of Tokyo, 7-3-1 Hongo, Bunkyoku, Tokyo 113-0033, Japan; Email: [email protected].

Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation in the timing of policy adoption across units as an instrument for treatment. This paper studies the properties of the TWFEIV estimator in staggered instrumented difference-in-differences (DID-IV) designs. We show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator can be decomposed into a weighted average of all possible two-group/two-period Wald-DID estimators. Under staggered DID-IV designs, a causal interpretation of the TWFEIV estimand hinges on the stable effects of the instrument on the treatment and the outcome over time. We illustrate the use of our decomposition theorem for the TWFEIV estimator through an empirical application.

1   Introduction

Instrumented difference-in-differences (DID-IV) is a method to estimate the effect of a treatment on an outcome, exploiting variation in the timing of policy adoption across units as an instrument for the treatment. In a simple setting with two groups and two periods, some units become exposed to the policy shock in the second period (exposed group), whereas others are not over two periods (unexposed group). The estimator is constructed by running the following IV regression with the group and post-time dummies as included instruments and the interaction of the two as the excluded instrument (e.g., Duflo (2001), Field (2007)):

Yi,t=β0+βi,.Exposedi+β,.tPOSTt+βIVDi,t+ϵi,t.\displaystyle Y_{i,t}=\beta_{0}+\beta_{i,.}\text{Exposed}_{i}+\beta_{,.t}\text{POST}_{t}+\beta_{IV}D_{i,t}+\epsilon_{i,t}.

The resulting IV estimand βIV\beta_{IV} scales the DID estimand of the outcome by the DID estimand of the treatment, the so-called Wald-DID estimand (de Chaisemartin and D’Haultfœuille (2018), Miyaji (2024)). In this two-group/two-period (2×22\times 2) setting, DID-IV designs mainly consist of a monotonicity assumption and parallel trends assumptions in the treatment and the outcome between the two groups, and allow for the Wald-DID estimand to capture the local average treatment effect on the treated (LATET) (de Chaisemartin (2010), Hudson et al. (2017), and Miyaji (2024)). DID-IV designs have gained popularity over DID designs in practice when there is no control group or the treatment adoption is potentially endogenous over time (Miyaji (2024)).

In reality, however, most DID-IV applications go beyond the canonical DID-IV set up, and leverage variation in the timing of policy adoption across units in more than two periods, instrumenting for the treatment with the natural variation. The instrument is constructed, for instance from the staggered adoption of school reforms across countries or municipalities (e.g. Oreopoulos (2006), Lundborg et al. (2014), Meghir et al. (2018)), the phase-in introduction of head starts across states (e.g. Johnson and Jackson (2019)), or the gradual adoption of broadband internet programs (e.g. Akerman et al. (2015), Bhuller et al. (2013)). These policy changes can be viewed as some natural experiments, but not randomized in reality.

Recently, Miyaji (2024) formalizes the underlying identification strategy as a staggered DID-IV design. In this design, the treatment adoption is allowed to be endogenous over time, while the instrument is required to be uncorrelated with time-varying unobservables in the treatment and the outcome; the assignment of the treatment can be non-staggered across units, while the assignment of the instrument is staggered across units: they are partitioned into mutually exclusive and exhaustive cohorts by the initial adoption date of the instrument. The target parameter is the cohort specific local average treatment effect on the treated (CLATT); this parameter measures the treatment effects among the units who belong to cohort ee and are induced to the treatment by instrument in a given relative period ll after the initial adoption of the instrument. The identifying assumptions are the natural generalization of those in 2×22\times 2 DID-IV designs.

In practice, empirical researchers commonly implement this design via linear instrumental variable regressions with time and unit fixed effects, the so-called two-way fixed effects instrumental variable (TWFEIV) regressions (e.g., Black et al. (2005), Lundborg et al. (2017), Johnson and Jackson (2019)):

Yi,t=ϕi.+λt.+βIVDi,t+vi,t,\displaystyle Y_{i,t}=\phi_{i.}+\lambda_{t.}+\beta_{IV}D_{i,t}+v_{i,t}, (1)
Di,t=γi.+ζt.+πZi,t+ηi,t.\displaystyle D_{i,t}=\gamma_{i.}+\zeta_{t.}+\pi Z_{i,t}+\eta_{i,t}. (2)

In contrast to the canonical DID-IV set up, however, the validity of running TWFEIV regressions seems less clear under staggered DID-IV designs. The IV estimate is commonly interpreted as measuring the local average treatment effect in the presence of heterogeneous treatment effects as in Imbens and Angrist (1994), whereas the target parameter is not stated formally. We know little about how the IV estimator is constructed by comparing the evolution of the treatment and the outcome across units and over time. Finally, we have no tools to illustrate the identifying variations in the IV estimate in a given application.

In this paper, we study the properties of two-way fixed effects instrumental variable estimators under staggered DID-IV designs. Specifically, we present the decomposition result for the TWFEIV estimator, and study the causal interpretation of the TWFEIV estimand under staggered DID-IV designs.

First, we derive the decomposition theorem for the TWFEIV estimator with settings of the staggered adoption of the instrument across units. We show that the TWFEIV estimator is equal to a weighted average of all possible 2×22\times 2 Wald-DID estimators arising from the three types of the DID-IV design. First, in an Unexposed/Exposed design, some units are never exposed to the instrument during the sample period (unexposed group), whereas some units start exposed at a particular date and remain exposed (exposed group). Second, in an Exposed/Not Yet Exposed design, some units start exposed earlier, whereas some units are not yet exposed during the design period (not yet exposed group). Finally, in an Exposed/Exposed Shift design, some units are already exposed, whereas some units start exposed later at a particular point during the design period (exposed shift group). The weight assigned to each Wald-DID estimator reflects all the identifying variations in each DID-IV design: the sample share, the variance of the instrument, and the DID estimator of the treatment between the two groups.

Built on the decomposition result, we next uncover the shortcomings of running TWFEIV regressions under staggered DID-IV designs. We show that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs due to negative weights. Specifically, we show that this estimand is equal to a weighted average of all possible cohort specific local average treatment effect on the treated (CLATT) parameters, but some weights can be negative. The negative weight problem potentially arises due to the "bad comparisons" (c.f. Goodman-Bacon (2021)) performed by TWFEIV regressions: the already exposed units play the role of controls in the Exposed/Exposed Shift design in the first stage and reduced form regressions. Given the negative result of using the TWFEIV estimand under staggered DID-IV designs, we also investigate the sufficient conditions for this estimand to attain its causal interpretation. We show that this estimand can be interpreted as causal only if the effects of the instrument on the treatment and the outcome are stable over time.

We extend our decomposition result in several directions. We first consider non-binary, ordered treatment. We also derive the decomposition result for the TWFEIV estimand in unbalanced panel settings. Lastly, we consider the case when the adoption date of the instrument is randomized across units. In all cases, we show that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs due to negative weights.

We illustrate our findings with the setting of Miller and Segal (2019) who estimate the effect of female police officers’ share on intimate partner homicide rate, leveraging the timing variation of AA (affirmative action) plans across U.S. counties. In this application, we first assess the plausibility of the staggered DID-IV design implicitly imposed by Miller and Segal (2019) and confirm its validity. We then estimate TWFEIV regressions, slightly modifying the authors’ setting, and apply our DID-IV decomposition theorem to the IV estimate. We find that the estimate assigns more weights to the Unexposed/Exposed design and less weights to the other two types of the DID-IV design. Despite the small weight on the Exposed/Exposed Shift design, we also find that the IV estimate suffers from the substantial downward bias arising from the bad comparisons in the Exposed/Exposed Shift design.

Finally, we develop simple tools to examine how different specifications affect the change in TWFEIV estimates, and illustrate these by revisiting Miller and Segal (2019). In many empirical settings, researchers typically diverge from a simple TWFEIV regression as in equation (1) and estimate various specifications such as weighting or including time-varying covariates. We follow Goodman-Bacon (2021) and decompose the difference between the two specifications into the changes in Wald-DID estimates, the changes in weights, and the interaction of the two. This decomposition result enables the researchers to quantify the contribution of the changes in each term to the difference in the overall estimates. In addition, plotting the pairs of Wald-DID estimates and associated weights obtained from the two specifications allows the researchers to investigate which components have the significant impact on these contributions.

Overall, this paper shows the negative result of using TWFEIV estimators under staggered DID-IV designs in more than two periods, and provide tools to illustrate how serious that concern is in a given application. Specifically, our decomposition result for the TWFEIV estimator enables the researchers to quantify the bias term arising from the bad comparisons in Exposed/Exposed Shift designs in the data. Fortunately, Miyaji (2024) recently proposes the alternative estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity. Using such estimation method allows the practitioners to avoid the issue of TWFEIV estimators in practice, and facilitates the credibility of their empirical findings.

The rest of the paper is organized as follows. The next subsection discusses the related literature. Section 2 presents our decomposition theorem for the TWFEIV estimator. Section 3 formally introduces staggered instrumented difference-in-differences designs. Section 4 presents the pitfalls of running TWFEIV regressions under staggered DID-IV designs, and explores the sufficient conditions for the TWFEIV estimand to attain its causal interpretation. Section 5 describes some of the extensions. Section 6 presents our empirical application. Section 7 explain how different specifications affect the difference in estimates and Section 8 concludes. All proofs are given in the Appendix.

1.1   Related literature

Our paper is related to the recent DID-IV literature (de Chaisemartin (2010); Hudson et al. (2017); de Chaisemartin and D’Haultfœuille (2018); Miyaji (2024)). In this literature, de Chaisemartin (2010) first formalizes 2×22\times 2 DID-IV designs and shows that a Wald-DID estimand identifies the local average treatment effect on the treated (LATET) if the parallel trends assumptions in the treatment and the outcome, and a monotonicity assumption are satisfied. Hudson et al. (2017) also consider 2×22\times 2 DID-IV designs with non-binary, ordered treatment settings. Build on the work in de Chaisemartin (2010), however, de Chaisemartin and D’Haultfœuille (2018) formalize 2×22\times 2 DID-IV designs differently, and call them Fuzzy DID. Miyaji (2024) compares 2×22\times 2 DID-IV to Fuzzy DID designs and points out the issues embedded in Fuzzy DID designs, and extends 2×22\times 2 DID-IV design to multiple period settings with the staggered adoption of the instrument across units, which the author calls staggered DID-IV designs. Miyaji (2024) also provides a reliable estimation method in staggered DID-IV designs that is robust to treatment effect heterogeneity.

In this paper, we contribute to the literature by showing the properties of two-way fixed instrumental variable estimators in staggered DID-IV designs. In reality, when empirical researchers implicitly rely on the staggered DID-IV design, they commonly implement this design via TWFEIV regressions (e.g. Black et al. (2005), Lundborg et al. (2014), Meghir et al. (2018)). This paper presents the issues of the conventional approach, and provides the sufficient conditions for this estimand to attain its causal interpretation.

Our paper is also related to a recent DID literature on the causal interpretation of two-way fixed effects (TWFE) regressions and its dynamic specifications under heterogeneous treatment effects (Athey and Imbens (2022); Borusyak et al. (2021); de Chaisemartin and D’Haultfœuille (2020); Goodman-Bacon (2021); Imai and Kim (2021); Sun and Abraham (2021)).

Specifically, this paper is closely connected to Goodman-Bacon (2021), who derives the decomposition theorem for the TWFE estimator with settings of the staggered adoption of the treatment across units. In this paper, we establish the decompose theorem for the TWFEIV estimator with settings of the staggered adoption of the instrument across units, which is a natural generalization of their theorem 1.

This paper is also closely connected to de Chaisemartin and D’Haultfœuille (2020), who decompose the TWFE estimand and present the issue of using this estimand under DID designs: some weights assigned to the causal parameters in this estimand can be potentially negative. In their appendix, the authors also decompose the TWFEIV estimand and refer to the negative weight problem in this estimand. Specifically, they apply the decomposition theorem for the TWFE estimand to the numerator and denominator in the TWFEIV estimand respectively, and conclude that this estimand identifies the LATE as in Imbens and Angrist (1994) only if the effects of the instrument on the treatment and outcome are constant across groups and over time. However, their decomposition result for the TWFEIV estimand has some drawbacks. First, they do not formally state the target parameter and identifying assumptions in DID-IV designs. Second, their decomposition result is not based on the target parameter in DID-IV designs. Finally, the sufficient conditions for this estimand to be interpretable causal parameter are not well investigated.

In this paper, we investigate the causal interpretation of the TWFEIV estimand more clearly than that of de Chaisemartin and D’Haultfœuille (2020). Specifically, we first decompose the TWFEIV estimator into all possible 2×22\times 2 Wald-DID estimators. We then formally introduce the target parameter and identifying assumptions in staggered DID-IV designs, built on the recent work in Miyaji (2024). This allows us to decompose the TWFEIV estimand into a weighted average of the target parameter in staggered DID-IV designs. Finally, we assess the causal interpretation of the TWFEIV estimand under a variety of restrictions on the effects of the instrument on the treatment and outcome, which clarifies the sufficient conditions for this estimand to attain its causal interpretation.

We note that this paper is distinct from the recent IV literature on the causal interpretation of two stage least square (TSLS) estimators with covariates under heterogeneous treatment effects (Słoczyński (2020), Blandhol et al. (2022)). These recent studies investigate the causal interpretation of the TSLS estimand with covariates under the random variation of the instrument conditional on covariates, and cast doubt on the LATE (or LATEs) interpretation of this estimand. In this literature, the identifying variations come from the assignment process of the instrument. In this paper, however, we investigate the causal interpretation of the TWFEIV estimand (where time and unit dummies can be viewed as covariates) under staggered DID-IV designs: our identifying variations mainly come from the parallel trends assumptions in the treatment and the outcome over time.

2   Instrumented difference-in-differences decomposition

In this section, we present a decomposition result for the two-way fixed effects instrumental variable (TWFEIV) estimator in multiple time period settings with the staggered adoption of the instrument across units.

2.1   Set up

We introduce the notation we use throughout this article. We consider a panel data setting with TT periods and NN units. For each i{1,N}i\in\{1,\dots N\} and t{1,,T}t\in\{1,\dots,T\}, let Yi,tY_{i,t} denote the outcome and Di,t{0,1}D_{i,t}\in\{0,1\} denote the treatment status, and Zi,t{0,1}Z_{i,t}\in\{0,1\} denote the instrument status. Let Di=(Di,1,,Di,T)D_{i}=(D_{i,1},\dots,D_{i,T}) and Zi=(Zi,1,,Zi,T)Z_{i}=(Z_{i,1},\dots,Z_{i,T}) denote the path of the treatment and the path of the instrument for unit ii, respectively. Throughout this article, we assume that {Yi,t,Di,t,Zi,t}t=1T\{Y_{i,t},D_{i,t},Z_{i,t}\}_{t=1}^{T} are independent and identically distributed (i.i.d).

We make the following assumption about the assignment process of the instrument.

Assumption 1 (Staggered adoption for Zi,tZ_{i,t}).

For s<ts<t, Zi,sZi,tZ_{i,s}\leq Z_{i,t} where s,t{1,T}s,t\in\{1,\dots T\}.

Assumption 1 requires that once units start exposed to the instrument, they remain exposed to that instrument afterward. In the DID literature, several recent papers impose this assumption on the adoption process of the treatment and sometimes call it the "staggered treatment adoption", see, e.g., Athey and Imbens (2022), Callaway and Sant’Anna (2021) and Sun and Abraham (2021).

Given Assumption 1, we can uniquely characterize the instrument path by the time period when unit ii is first exposed to the instrument, denoted by Ei=min{t:Zi,t=1}E_{i}=\min\{t:Z_{i,t}=1\}. If unit ii is not exposed to the instrument for all time periods, we define Ei=E_{i}=\infty. Based on the initial exposure period EiE_{i}, we can uniquely partition units into mutually exclusive and exhaustive cohorts ee for e{1,2,,T,}e\in\{1,2,\dots,T,\infty\}: all the units in cohort ee are first exposed to the instrument at time Ei=eE_{i}=e. Hereafter, to ease the notation, we assume that the data contain KK cohorts (KT)(K\leq T) where e{1,,k,,K}e\in\{1,\dots,k,\dots,K\}, and define UU as the never exposed cohort Ei=E_{i}=\infty.

Let nen_{e} be the relative sample share for cohort ee and let Z¯e\bar{Z}_{e} be the time share of the exposure to the instrument for cohort ee:

nei𝟏{Ei=e}N,Z¯et𝟏{te}T.\displaystyle n_{e}\equiv\frac{\sum_{i}{\mathbf{1}\{E_{i}=e\}}}{N},\hskip 8.53581pt\bar{Z}_{e}\equiv\frac{\sum_{t}{\mathbf{1}\{t\geq e\}}}{T}.

We also define nabnana+nbn_{ab}\equiv\displaystyle\frac{n_{a}}{n_{a}+n_{b}} to be the relative sample share between cohort aa and bb.

In contrast to the staggered adoption of the instrument across units, we allow the general adoption process for the treatment: the treatment can potentially turn on/off repeatedly over time. de Chaisemartin and D’Haultfœuille (2020) and Imai and Kim (2021) consider the same setting in the recent DID literature.

The notations PRE(a)PRE(a), MID(a,b)MID(a,b), and POST(a)POST(a) represent the corresponding time window, respectively: PRE(a)[1,a)PRE(a)\equiv[1,a), MID(a,b)[a,b)MID(a,b)\equiv[a,b), and POST(a)[a,T]POST(a)\equiv[a,T]. Let R¯ePOST(a)\bar{R}_{e}^{POST(a)} be the sample mean of the random variable Ri,tR_{i,t} in cohort ee during the time window POST(a)POST(a):

R¯ePOST(a)1T(a1)aT[iRi,t𝟏{Ei=e}i𝟏{Ei=e}].\displaystyle\bar{R}_{e}^{POST(a)}\equiv\frac{1}{T-(a-1)}\sum_{a}^{T}\left[\frac{\sum_{i}R_{i,t}\mathbf{1}\{E_{i}=e\}}{\sum_{i}\mathbf{1}\{E_{i}=e\}}\right].

We define R¯ePRE(a)\bar{R}_{e}^{PRE(a)} and R¯eMID(a,b)\bar{R}_{e}^{MID(a,b)} analogously, representing the sample mean of the random variable Ri,tR_{i,t} in cohort ee during the time window PRE(a)PRE(a) and MID(a,b)MID(a,b) respectively.

2.2   Decomposing the TWFEIV estimator

We consider a TWFEIV regression in multiple time period settings with the staggered adoption of the instrument across units:

Yi,t=ϕi.+λt.+βIVDi,t+vi,t,\displaystyle Y_{i,t}=\phi_{i.}+\lambda_{t.}+\beta_{IV}D_{i,t}+v_{i,t}, (3)
Di,t=γi.+ζt.+πZi,t+ηi,t.\displaystyle D_{i,t}=\gamma_{i.}+\zeta_{t.}+\pi Z_{i,t}+\eta_{i,t}. (4)

By substituting the first stage regression (4) into the structural equation (3), we obtain the reduced form regression:

Yi,t=ϕi.+λt.+αZi,t+vi,t.\displaystyle Y_{i,t}=\phi_{i.}+\lambda_{t.}+\alpha Z_{i,t}+v_{i,t}. (5)

The ratio between the first stage coefficient π^\hat{\pi} and the reduced form coefficient α^\hat{\alpha} yields the TWFEIV estimator β^IV\hat{\beta}_{IV}. By the Frisch-Waugh-Lovell theorem, the IV estimator β^IV\hat{\beta}_{IV} is equal to the ratio between the coefficient from regressing Yi,tY_{i,t} on the double-demeaning variable Z~i,t\tilde{Z}_{i,t} and the coefficient from regressing Di,tD_{i,t} on the same variable:

β^IV\displaystyle\hat{\beta}_{IV} =1NTitZ~i,tYi,t1NTitZ~i,tDi,t,\displaystyle=\frac{\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{i,t}Y_{i,t}}{\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{i,t}D_{i,t}}, (6)

where Z~i,t\tilde{Z}_{i,t} is the double demeaning variable defined below:

Z~i,t\displaystyle\tilde{Z}_{i,t} =Zi,t1Tt=1TZi,t1Ni=1NZi,t+1NTt=1Ti=1NZi,t\displaystyle=Z_{i,t}-\frac{1}{T}\sum_{t=1}^{T}Z_{i,t}-\frac{1}{N}\sum_{i=1}^{N}Z_{i,t}+\frac{1}{NT}\sum_{t=1}^{T}\sum_{i=1}^{N}Z_{i,t}
(Zi,tZ¯i)(Z¯tZ¯¯).\displaystyle\equiv(Z_{i,t}-\bar{Z}_{i})-(\bar{Z}_{t}-\bar{\bar{Z}}).

Note that the TWFEIV regression runs the two-way fixed effects (TWFE) regression twice, as can be seen in equations (4) and (5). Because we assume the staggered assignment of the instrument across units, if we focus on the TWFE coefficient on Zi,tZ_{i,t} in the first stage or the reduced form regression, we can show that it is equal to a weighted average of all possible 2×22\times 2 DID estimators of the treatment or the outcome from the decomposition result for the TWFE estimator shown by Goodman-Bacon (2021).

Consider the simple setting where we have only two periods and two cohorts: one cohort is not exposed to the instrument during the two periods (Ei=UE_{i}=U), whereas the other cohort starts exposed to the instrument in the second period (Ei=2E_{i}=2). In this setting, the TWFEIV estimator takes the following form, the so-called Wald-DID estimator (de Chaisemartin and D’Haultfœuille (2018), Miyaji (2024)):

β^IV=Y¯2,2Y¯2,1(Y¯U,2Y¯U,1)D¯2,2D¯2,1(D¯U,2D¯U,1),\displaystyle\hat{\beta}_{IV}=\frac{\bar{Y}_{2,2}-\bar{Y}_{2,1}-(\bar{Y}_{U,2}-\bar{Y}_{U,1})}{\bar{D}_{2,2}-\bar{D}_{2,1}-(\bar{D}_{U,2}-\bar{D}_{U,1})},

where R¯a,t\bar{R}_{a,t} is the sample mean of the random variable Ri,tR_{i,t} for cohort Ei=aE_{i}=a in time tt. This estimator scales the DID estimator of the outcome by the DID estimator of the treatment between cohort Ei=UE_{i}=U and Ei=2E_{i}=2.

The above observations bring us the intuition about how we can decompose the TWFEIV estimator with settings of the staggered adoption of the instrument across units; we expect that the TWFEIV estimator can be decomposed into a weighted average of all possible 2×22\times 2 Wald-DID estimators (instead of DID-estimators).

To clarify this intuition, assume for now that we have only three cohorts, an early exposed cohort kk, a middle exposed cohort ll (k<l)(k<l), and a never exposed cohort UU (Ei=)(E_{i}=\infty). Figure 1 plots the simulated data for the time trends of the average treatment (first stage) and the average outcome (reduced form) in three cohorts.

Refer to caption
Refer to caption
Figure 1: Instrumented difference-in-differences with three cohorts. Notes: This figure plots the simulated data for the time trends of the average treatment (first stage) and the average outcome (reduced form) with time length T=100T=100 in three cohorts: an early exposed cohort kk, which is exposed to the instrument at k=34100Tk=\frac{34}{100}T; a middle exposed cohort ll, which is exposed to the instrument at l=80100Tl=\frac{80}{100}T; a never exposed cohort, UU. The xx-axis consists of three time windows: the pre-exposed period for cohort kk, [1,k1][1,k-1], denoted by PRE(k)PRE(k); the middle exposed period when the cohort kk is already exposed but cohort ll is not yet exposed, [k,l1][k,l-1], denoted by MID(k,l)MID(k,l); and post-exposed period when cohort ll is already exposed, [l,T][l,T], denoted by POST(l)POST(l). The effects of the instrument on the treatment and the outcome are 0.150.15 and 99 in cohort kk respectively; 0.10.1 and 1010 in cohort ll respectively.

From the data structure, we can construct the Wald-DID estimator in three ways. First, we can compare the evolution of the treatment and the outcome between exposed cohort j=k,lj=k,l and never exposed cohort UU, exploiting the time window POST(j)POST(j) and PRE(j)PRE(j), which we call an Unexposed/Exposed design:

β^IV,jU2×2\displaystyle\hat{\beta}_{IV,jU}^{2\times 2} (y¯jPOST(j)y¯jPRE(j))(y¯UPOST(j)y¯UPRE(j))(D¯jPOST(j)D¯jPRE(j))(D¯UPOST(j)D¯UPRE(j)),j=k,l,\displaystyle\equiv\frac{\left(\bar{y}_{j}^{POST(j)}-\bar{y}_{j}^{PRE(j)}\right)-\left(\bar{y}_{U}^{POST(j)}-\bar{y}_{U}^{PRE(j)}\right)}{\left(\bar{D}_{j}^{POST(j)}-\bar{D}_{j}^{PRE(j)}\right)-\left(\bar{D}_{U}^{POST(j)}-\bar{D}_{U}^{PRE(j)}\right)},\hskip 14.22636ptj=k,l, (7)
β^jU2×2D^jU2×2,j=k,l.\displaystyle\equiv\frac{\hat{\beta}_{jU}^{2\times 2}}{\hat{D}_{jU}^{2\times 2}},\hskip 14.22636ptj=k,l.

Second, we can construct the Wald-DID estimator, leveraging variation in the timing of the initial exposure to the instrument between exposed cohorts. Consider an early exposed cohort kk and a middle exposed cohort ll. Before period ll, the early exposed cohort kk is already exposed to the instrument, while the middle exposed cohort ll is not yet exposed to the instrument. In this setting, we can view that the middle exposed cohort ll plays the role of the control group in both the first stage and the reduced form. From this observation, we can compare the evolution of the treatment and the outcome between the early exposed cohort kk and middle exposed cohort ll, exploiting the time window MID(k,l)MID(k,l) and PRE(k)PRE(k), which we call an Exposed/Not Yet Exposed design:

β^IV,kl2×2,k\displaystyle\hat{\beta}_{IV,kl}^{2\times 2,k} (y¯kMID(k,l)y¯kPRE(k))(y¯lMID(k,l)y¯lPRE(k))(D¯kMID(k,l)D¯kPRE(k))(D¯lMID(k,l)D¯lPRE(k))\displaystyle\equiv\frac{\left(\bar{y}_{k}^{MID(k,l)}-\bar{y}_{k}^{PRE(k)}\right)-\left(\bar{y}_{l}^{MID(k,l)}-\bar{y}_{l}^{PRE(k)}\right)}{\left(\bar{D}_{k}^{MID(k,l)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{l}^{MID(k,l)}-\bar{D}_{l}^{PRE(k)}\right)} (8)
β^kl2×2,kD^kl2×2,k.\displaystyle\equiv\frac{\hat{\beta}_{kl}^{2\times 2,k}}{\hat{D}_{kl}^{2\times 2,k}}.

Finally, if we focus on the middle exposed cohort ll, which changes the exposure status from being unexposed to being exposed at time ll, we can regard the early exposed cohort kk as the control group after time ll because this cohort is already exposed to the instrument at time ll. We can compare the evolution of the treatment and the outcome between early exposed cohort kk and middle exposed cohort ll, exploiting the time window MID(k,l)MID(k,l) and POST(l)POST(l), which we call an Exposed/Exposed Shift design:

β^IV,kl2×2,l\displaystyle\hat{\beta}_{IV,kl}^{2\times 2,l} (y¯lPOST(l)y¯lMID(k,l))(y¯kPOST(l)y¯kMID(k,l))(D¯lPOST(l)D¯lMID(k,l))(D¯kPOST(l)D¯kMID(k,l))\displaystyle\equiv\frac{\left(\bar{y}_{l}^{POST(l)}-\bar{y}_{l}^{MID(k,l)}\right)-\left(\bar{y}_{k}^{POST(l)}-\bar{y}_{k}^{MID(k,l)}\right)}{\left(\bar{D}_{l}^{POST(l)}-\bar{D}_{l}^{MID(k,l)}\right)-\left(\bar{D}_{k}^{POST(l)}-\bar{D}_{k}^{MID(k,l)}\right)} (9)
β^kl2×2,lD^kl2×2,l.\displaystyle\equiv\frac{\hat{\beta}_{kl}^{2\times 2,l}}{\hat{D}_{kl}^{2\times 2,l}}.

In each type of the DID-IV design, we have three sources of variation. First, each design exploits the subsample from all NTNT observations. The Unexposed/Exposed DID-IV design in (7) uses two cohorts and all time periods, indicating that the relative sample share is nk+nun_{k}+n_{u}. The Exposed/Not Yet Exposed DID-IV design in (8) uses two cohorts but exploits only the time periods before period ll, so the relative sample share is (1Z¯l)(nk+nl)(1-\bar{Z}_{l})(n_{k}+n_{l}). The Exposed/Exposed Shift DID-IV design in (9) uses two cohorts but exploits only the time periods after period kk, so the relative sample share is Z¯k(nk+nl)\bar{Z}_{k}(n_{k}+n_{l}).

Second, the variation in each type of the DID-IV design partly comes from the variation of the instrument in its subsample. It is equal to the variance of the double demeaning variable Z~i,t\tilde{Z}_{i,t} in each design:

V^jUZnjU(1njU)Z¯j(1Z¯j),j=k,l,\displaystyle\hat{V}_{jU}^{Z}\equiv n_{jU}(1-n_{jU})\bar{Z}_{j}(1-\bar{Z}_{j}),\hskip 14.22636ptj=k,l, (10)
V^klZ,knkl(1nkl)(Z¯kZ¯l1Z¯l)(1Z¯k1Z¯l),\displaystyle\hat{V}_{kl}^{Z,k}\equiv n_{kl}(1-n_{kl})\left(\frac{\bar{Z}_{k}-\bar{Z}_{l}}{1-\bar{Z}_{l}}\right)\left(\frac{1-\bar{Z}_{k}}{1-\bar{Z}_{l}}\right), (11)
V^klZ,lnkl(1nkl)(Z¯lZ¯k)(Z¯kZ¯lZ¯k),\displaystyle\hat{V}_{kl}^{Z,l}\equiv n_{kl}(1-n_{kl})\left(\frac{\bar{Z}_{l}}{\bar{Z}_{k}}\right)\left(\frac{\bar{Z}_{k}-\bar{Z}_{l}}{\bar{Z}_{k}}\right), (12)

where the V^jUZ\hat{V}_{jU}^{Z}, V^klZ,k\hat{V}_{kl}^{Z,k} and V^klZ,l\hat{V}_{kl}^{Z,l} represent the variance of the double demeaning variable Z~i,t\tilde{Z}_{i,t} in Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift DID-IV designs, respectively. In the staggered DID set up, Goodman-Bacon (2021) also describes the two variations, that is, the relative sample share and the variance of the double demeaning treatment variable in each type of the DID designs.

Unlike the staggered DID set up, however, each DID-IV design has an additional source of the variation; the effect of the instrument on the treatment in the first stage. This comes from the fact that each DID-IV design allows the noncompliance of receiving the treatment when units are exposed to the instrument. The amount of this variation is equal to the 2×22\times 2 DID estimator of the treatment in each DID-IV design:

D^jU2×2(D¯jPOST(j)D¯jPRE(j))(D¯UPOST(j)D¯UPRE(j))j=k,l,\displaystyle\hat{D}_{jU}^{2\times 2}\equiv\left(\bar{D}_{j}^{POST(j)}-\bar{D}_{j}^{PRE(j)}\right)-\left(\bar{D}_{U}^{POST(j)}-\bar{D}_{U}^{PRE(j)}\right)\hskip 14.22636ptj=k,l,
D^kl2×2,k(D¯kMID(k,l)D¯kPRE(k))(D¯lMID(k,l)D¯lPRE(k)),\displaystyle\hat{D}_{kl}^{2\times 2,k}\equiv\left(\bar{D}_{k}^{MID(k,l)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{l}^{MID(k,l)}-\bar{D}_{l}^{PRE(k)}\right),
D^kl2×2,l(D¯lPOST(l)D¯lMID(k,l))(D¯kPOST(l)D¯kMID(k,l)).\displaystyle\hat{D}_{kl}^{2\times 2,l}\equiv\left(\bar{D}_{l}^{POST(l)}-\bar{D}_{l}^{MID(k,l)}\right)-\left(\bar{D}_{k}^{POST(l)}-\bar{D}_{k}^{MID(k,l)}\right).

Note that the denominator of the TWFEIV estimator β^IV\hat{\beta}_{IV} in (6), which we denote C^D,Z\hat{C}^{D,Z} hereafter, measures the covariance between the instrument Zi,tZ_{i,t} and the treatment Di,tD_{i,t} in whole samples. By some calculations (see the proof of Theorem 1 below), one can show that C^D,Z\hat{C}^{D,Z} is equal to a weighted average of all possible 2×22\times 2 DID estimators of the treatment in each DID-IV design:

C^D,Z=kUw^kUD^kU2×2+kUl>k[w^klkD^kl2×2,k+w^kllD^kl2×2,l],\displaystyle\hat{C}^{D,Z}=\sum_{k\neq U}\hat{w}_{kU}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}+\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}],

where the weights are:

w^kU=(nk+nu)2V^kUZ,\displaystyle\hat{w}_{kU}=(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z},
w^klk=((nk+nl)(1Z¯l))2V^klZ,k,\displaystyle\hat{w}_{kl}^{k}=((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k},
w^kll=((nk+nl)Z¯k)2V^klZ,l.\displaystyle\hat{w}_{kl}^{l}=((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}.

Hereafter, we refer to w^kU\hat{w}_{kU}, w^klk\hat{w}_{kl}^{k}, and w^kll\hat{w}_{kl}^{l} as the first stage weights. This decomposition result for C^D,Z\hat{C}^{D,Z} is almost identical to that of Goodman-Bacon (2021) for the TWFE estimator under staggered DID designs, but the slight difference here is that each weight is not scaled by the variance of the double demeaning variable Z~it\tilde{Z}_{it} in whole samples.

We now present the decomposition theorem for the TWFEIV estimator under the staggered assignment of the instrument across units. Theorem 1 below is a generalization of the decomposition result for the TWFE estimator with settings of the staggered assignment of the treatment across units in Goodman-Bacon (2021).

Theorem 1 (Instrumented Difference-in-Differences Decomposition Theorem).

Suppose that there exist KK cohorts, e=1,,k,,Ke=1,\dots,k,\dots,K. The data may also contain a never exposed cohort UU. Then, the two-way fixed effects instrumental variable estimator β^IV\hat{\beta}_{IV} in (6) is a weighted average of all possible 2×22\times 2 Wald-DID estimators.

β^IV=[kUw^IV,kUβ^IV,kU2×2+kUl>kw^IV,klkβ^IV,kl2×2,k+w^IV,kllβ^IV,kl2×2,l].\displaystyle\hat{\beta}_{IV}=\bigg{[}\sum_{k\neq U}\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}+\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}\bigg{]}.

The 2×22\times 2 Wald-DID estimators are:

β^IV,kU2×2(y¯kPOST(k)y¯kPRE(k))(y¯UPOST(k)y¯UPRE(k))(D¯kPOST(k)D¯kPRE(k))(D¯UPOST(k)D¯UPRE(k)),\displaystyle\hat{\beta}_{IV,kU}^{2\times 2}\equiv\frac{\left(\bar{y}_{k}^{POST(k)}-\bar{y}_{k}^{PRE(k)}\right)-\left(\bar{y}_{U}^{POST(k)}-\bar{y}_{U}^{PRE(k)}\right)}{\left(\bar{D}_{k}^{POST(k)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{U}^{POST(k)}-\bar{D}_{U}^{PRE(k)}\right)},
β^IV,kl2×2,k(y¯kMID(k,l)y¯kPRE(k))(y¯lMID(k,l)y¯lPRE(k))(D¯kMID(k,l)D¯kPRE(k))(D¯lMID(k,l)D¯lPRE(k)),\displaystyle\hat{\beta}_{IV,kl}^{2\times 2,k}\equiv\frac{\left(\bar{y}_{k}^{MID(k,l)}-\bar{y}_{k}^{PRE(k)}\right)-\left(\bar{y}_{l}^{MID(k,l)}-\bar{y}_{l}^{PRE(k)}\right)}{\left(\bar{D}_{k}^{MID(k,l)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{l}^{MID(k,l)}-\bar{D}_{l}^{PRE(k)}\right)},
β^IV,kl2×2,l(y¯lPOST(l)y¯lMID(k,l))(y¯kPOST(l)y¯kMID(k,l))(D¯lPOST(l)D¯lMID(k,l))(D¯kPOST(l)D¯kMID(k,l)).\displaystyle\hat{\beta}_{IV,kl}^{2\times 2,l}\equiv\frac{\left(\bar{y}_{l}^{POST(l)}-\bar{y}_{l}^{MID(k,l)}\right)-\left(\bar{y}_{k}^{POST(l)}-\bar{y}_{k}^{MID(k,l)}\right)}{\left(\bar{D}_{l}^{POST(l)}-\bar{D}_{l}^{MID(k,l)}\right)-\left(\bar{D}_{k}^{POST(l)}-\bar{D}_{k}^{MID(k,l)}\right)}.

The weights are:

w^IV,kU=w^kUD^kU2×2C^D,Z\displaystyle\hat{w}_{IV,kU}=\frac{\hat{w}_{kU}\hat{D}_{kU}^{2\times 2}}{\hat{C}^{D,Z}}
w^IV,klk=w^klkD^kl2×2,kC^D,Z\displaystyle\hat{w}_{IV,kl}^{k}=\frac{\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}}{\hat{C}^{D,Z}}
w^IV,kll=w^kllD^kl2×2,lC^D,Z.\displaystyle\hat{w}_{IV,kl}^{l}=\frac{\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}}.

and sum to one, that is, we have kUwIV,kU+kUl>k[wIV,klk+wIV,kll]=1\sum_{k\neq U}w_{IV,kU}+\sum_{k\neq U}\sum_{l>k}[w_{IV,kl}^{k}+w_{IV,kl}^{l}]=1.

Proof.

See Appendix A. ∎

Theorem 1 shows that when the assignment of the instrument is staggered across units, the TWFEIV estimator is a weighted average of all possible 2×22\times 2 Wald-DID estimators. If there exist KK cohorts in the data, we have K2KK^{2}-K Wald-DID estimators, which come from either Exposed/Not Yet Exposed designs as in (8) or Exposed/Exposed shift designs as in (9). If the data contains a never exposed cohort UU, we have additionally KK Wald-DID estimators, which come from Unexposed/Exposed designs as in (7). If both situations occur, the TWFEIV estimator equals a weighted average of K2K^{2} Wald-DID estimators.

The weight assigned to each Wald-DID estimator consists of three parts: the relative sample share squared, the variance of the double demeaning variable Z~i,t\tilde{Z}_{i,t}, and the DID estimator of the treatment in each DID-IV design. The first part depends on the sample share of two cohorts and the timing of the initial exposure date. The second part reflects the variation of the instrument in the subsample, represented by (10)-(12), and depends on the relative sample share between two cohorts and the timing of the initial exposure date. Finally, the remaining part reflects variation in the evolution of the treatment between the two cohorts. Note that the weight is not guaranteed to be non-negative in finite sample settings: although the first and second parts are always non-negative, the DID estimator of the treatment can be potentially negative in the data.

Theorem 1 also shows that if we subset the data containing only two cohorts (cohorts kk and ll), the TWFEIV estimator βIV,kl2×2\beta_{IV,kl}^{2\times 2} in the subsample can be written as:

βIV,kl2×2=w^klkD^kl2×2,kw^klkD^kl2×2,k+w^kllD^kl2×2,lβIV,kl2×2,k+w^kllD^kl2×2,lw^klkD^kl2×2,k+w^kllD^kl2×2,lβIV,kl2×2,l.\displaystyle\beta_{IV,kl}^{2\times 2}=\frac{\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}}{\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}+\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}}\beta_{IV,kl}^{2\times 2,k}+\frac{\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}}{\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}+\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}}\beta_{IV,kl}^{2\times 2,l}.

The TWFEIV estimator βIV,kl2×2\beta_{IV,kl}^{2\times 2} is a weighted average of the Wald-DID estimators which come from either Exposed/Not Yet Exposed design or Exposed/Exposed Shift design, and the weight assigned to each Wald-DID estimator reflects the first stage weight and the DID estimator of the treatment in each DID-IV design.

To make the DID-IV decomposition theorem concrete, we provide a simple numerical example. Suppose we have three cohorts with equal sample size, as shown in Figure 1. In this figure, we set an early exposed period kk and a middle exposed period ll such that Z¯k=0.67\bar{Z}_{k}=0.67 and Z¯l=0.21\bar{Z}_{l}=0.21. We assume that the effect of the instrument on the treatment is 0.150.15 in cohort kk and 0.10.1 in cohort ll over time. This means that the units in cohort kk are more induced to the treatment by the instrument than those in cohort ll and the effects are stable in both cohorts. The DID estimates of the treatment are {D^kU2×2,D^lU2×2,D^kl2×2,k,D^kl2×2,l}={0.15,0.1,0.15,0.1}\{\hat{D}_{kU}^{2\times 2},\hat{D}_{lU}^{2\times 2},\hat{D}_{kl}^{2\times 2,k},\hat{D}_{kl}^{2\times 2,l}\}=\{0.15,0.1,0.15,0.1\}. We also assume that the effect of the instrument on the outcome through treatment is 99 in cohort kk and 1010 in cohort ll over time. The DID estimates of the outcome are {Y^kU2×2,Y^lU2×2,Y^kl2×2,k,Y^kl2×2,l}={9,10,9,10}\{\hat{Y}_{kU}^{2\times 2},\hat{Y}_{lU}^{2\times 2},\hat{Y}_{kl}^{2\times 2,k},\hat{Y}_{kl}^{2\times 2,l}\}=\{9,10,9,10\}. Dividing the DID estimate of the treatment by the DID estimate of the outcome yields the Wald-DID estimate: {β^kU2×2,β^lU2×2,β^kl2×2,k,β^kl2×2,l}={60,100,60,100}\{\hat{\beta}_{kU}^{2\times 2},\hat{\beta}_{lU}^{2\times 2},\hat{\beta}_{kl}^{2\times 2,k},\hat{\beta}_{kl}^{2\times 2,l}\}=\{60,100,60,100\}. The Wald-DID estimate is larger in cohort ll than that of cohort kk, though as we already noted, the effect of the instrument on the treatment is larger in cohort kk than that of cohort ll.

The DID estimates of the treatment and the exposure timing determine the amount of the weight assigned to each Wald-DID estimate, holding the sample size equal across cohorts. In the above setting, the resulting weights are {w^IV,kU,w^IV,lU,w^IV,klk,w^IV,kll}={0.28,0.12,0.40,0.20}\{\hat{w}_{IV,kU},\hat{w}_{IV,lU},\hat{w}_{IV,kl}^{k},\hat{w}_{IV,kl}^{l}\}=\{0.28,0.12,0.40,0.20\}. In Unexposed/Exposed designs, we have w^IV,kU>w^IV,lU\hat{w}_{IV,kU}>\hat{w}_{IV,lU} for two reasons. First, the DID estimate of the treatment is larger in cohort kk than that of cohort ll, that is, we have D^kU2×2=0.15>0.1=D^lU2×2\hat{D}_{kU}^{2\times 2}=0.15>0.1=\hat{D}_{lU}^{2\times 2}. Second, the time period kk is closer to the middle in the whole period than the time period ll, that is, we have Z¯k(1Z¯k)=0.22>0.17=Z¯l(1Z¯l)\bar{Z}_{k}(1-\bar{Z}_{k})=0.22>0.17=\bar{Z}_{l}(1-\bar{Z}_{l}), which implies w^kU>w^lU\hat{w}_{kU}>\hat{w}_{lU} in the first stage weight. By the similar argument, we have w^IV,klk>w^IV,kll\hat{w}_{IV,kl}^{k}>\hat{w}_{IV,kl}^{l} between Exposed/Not Yet Exposed and Exposed/Exposed Shift designs: we have D^kl2×2,k=0.15>0.1=D^kl2×2,l\hat{D}_{kl}^{2\times 2,k}=0.15>0.1=\hat{D}_{kl}^{2\times 2,l} and w^klk>w^kll\hat{w}_{kl}^{k}>\hat{w}_{kl}^{l} in the first stage weight. If the DID estimates of the treatment are equal between the two designs, the exposure timing matters: we have w^IV,kU<w^IV,klk\hat{w}_{IV,kU}<\hat{w}_{IV,kl}^{k} and w^IV,lU<w^IV,kll\hat{w}_{IV,lU}<\hat{w}_{IV,kl}^{l}. The DD estimates are the same in each comparison, that is, we have D^kU2×2=D^kl2×2,k\hat{D}_{kU}^{2\times 2}=\hat{D}_{kl}^{2\times 2,k} and D^lU2×2=D^kl2×2,l\hat{D}_{lU}^{2\times 2}=\hat{D}_{kl}^{2\times 2,l}. However, the different initial exposure date yields different weights in the first stage, that is, we have w^kU<w^klk\hat{w}_{kU}<\hat{w}_{kl}^{k} and w^lU<w^kll\hat{w}_{lU}<\hat{w}_{kl}^{l}, which make the difference above the two comparisons.

In this numerical example, the simple average of the Wald-DID estimates is 8080 and the weighted average is 100×35+60×25=84100\times\frac{3}{5}+60\times\frac{2}{5}=84 where the weight assigned to the Wald-DID estimate reflects the relative amount of the DID estimate of the treatment. The TWFEIV estimate, however, is β^IV=60×(0.28+0.40)+100×(0.12+0.20)=72.8\hat{\beta}_{IV}=60\times(0.28+0.40)+100\times(0.12+0.20)=72.8 because it assigns more weights on the smaller Wald-DID estimate.

Theorem 1 is a decomposition result for the TWFEIV estimator and not for the estimand. Related to the work in this paper, de Chaisemartin and D’Haultfœuille (2020) decompose the TWFE estimand and present the issue regarding the use of this estimand under DID designs: some weights assigned to the causal parameters in this estimand can be potentially negative. In their appendix, the authors also decompose the TWFEIV estimand, and refer to the negative weight problem in this estimand. Specifically, they apply their decomposition theorem for the TWFE estimand to the numerator and the denominator of the TWFEIV estimand respectively, and conclude that this estimand identifies the local average treatment effect as in Imbens and Angrist (1994) only if the effects of the instrument on the treatment and the outcome are homogeneous across groups and over time. In fact, the population coefficients on the instrument in the first stage and the reduced form regressions take the form of the TWFE estimand and their decomposition theorem for the TWFE estimand is also applicable to the analysis of the TWFEIV estimand. However, the way of their decomposition for the TWFEIV estimand has some drawbacks. First, they do not formally state the target parameter and identifying assumptions in DID-IV designs. Second, their decomposition for the TWFEIV estimand is not based on the target parameter in DID-IV designs. Finally, the sufficient conditions for this estimand to have its causal interpretation are not well explored.

In the following section, we explore the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. In section 3, we first define the target parameter and identifying assumptions in staggered DID-IV designs. In section 4, based on the decomposition theorem for the TWFEIV estimator, we then provide the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. Finally, we investigate the sufficient conditions for this estimand to attain its causal interpretation under staggered DID-IV designs.

3   Staggered instrumented difference-in-differences

In this section, we formalize the staggered instrumented difference-in-differences (DID-IV), built on the recent work in Miyaji (2024). We first introduce the additional notation. We then define the target parameter and identifying assumptions in staggered DID-IV designs.

3.1   Notation

First, we introduce the potential outcomes framework. Let Yi,t(d,z)Y_{i,t}(d,z) denote the potential outcome in period tt when unit ii receives the treatment path d𝒮(D)d\in\mathcal{S}(D) and the instrument path z𝒮(Z)z\in\mathcal{S}(Z). Similarly, let Di,t(z)D_{i,t}(z) denote the potential treatment status in period tt when unit ii receives the instrument path z𝒮(Z)z\in\mathcal{S}(Z).

Assumption 1 allows us to rewrite Di,t(z)D_{i,t}(z) by the initial adoption date Ei=eE_{i}=e. Let Di,teD_{i,t}^{e} denote the potential treatment status in period tt if unit ii is first exposed to the instrument in period ee. Let Di,tD_{i,t}^{\infty} denote the potential treatment status in period tt if unit ii is never exposed to the instrument. Hereafter, we call Di,tD_{i,t}^{\infty} the "never exposed treatment". Since the adoption date of the instrument uniquely pins down one’s instrument path, we can write the observed treatment status Di,tD_{i,t} for unit ii at time tt as

Di,t=Di,t+1eT(Di,teDi,t)𝟏{Ei=e}.\displaystyle D_{i,t}=D_{i,t}^{\infty}+\sum_{1\leq e\leq T}(D_{i,t}^{e}-D_{i,t}^{\infty})\cdot\mathbf{1}\{E_{i}=e\}.

We define Di,tDi,tD_{i,t}-D_{i,t}^{\infty} to be the effect of the instrument on the treatment for unit ii at time tt, which is the difference between the observed treatment status Di,tD_{i,t} to the never exposed treatment status Di,tD_{i,t}^{\infty}. Hereafter, we refer to Di,tDi,tD_{i,t}-D_{i,t}^{\infty} as the individual exposed effect in the first stage. In the DID literature, Callaway and Sant’Anna (2021) and Sun and Abraham (2021) define the effect of the treatment on the outcome in the same fashion.

Next, we introduce the group variable which describes the type of unit ii at time tt, based on the reaction of potential treatment choices at time tt to the instrument path zz. Let Gi,e,t(Di,t,Di,te)(te)G_{i,e,t}\equiv(D_{i,t}^{\infty},D_{i,t}^{e})(t\geq e) be the group variable at time tt for unit ii and the initial exposure date ee. Specifically, the first element Di,tD_{i,t}^{\infty} represents the treatment status at time tt if unit ii is never exposed to the instrument Ei=E_{i}=\infty and the second element Di,teD_{i,t}^{e} represents the treatment status at time tt if unit ii starts exposed to the instrument at Ei=eE_{i}=e. Following to the terminology in Imbens and Angrist (1994), we define Gi,e,t=(0,0)NTe,tG_{i,e,t}=(0,0)\equiv NT_{e,t} to be the never-takers, Gi,e,t=(1,1)ATe,tG_{i,e,t}=(1,1)\equiv AT_{e,t} to be the always-takers, Gi,e,t=(0,1)CMe,tG_{i,e,t}=(0,1)\equiv CM_{e,t} to be the compliers and Gi,e,t=(1,0)DFe,tG_{i,e,t}=(1,0)\equiv DF_{e,t} to be the defiers at time tt and the initial exposure date ee.

Finally, we make a no carryover assumption on potential outcomes Yi,t(d,z)Y_{i,t}(d,z).

Assumption 2 (No carryover assumption).
z𝒮(Z),d𝒮(D),t{1,,T},Yi,t(d,z)=Yi,t(dt,z),\displaystyle\forall z\in\mathcal{S}(Z),\forall d\in\mathcal{S}(D),\forall t\in\{1,\dots,T\},Y_{i,t}(d,z)=Y_{i,t}(d_{t},z),

where d=(d1,,dT)d=(d_{1},\dots,d_{T}) is the generic element of the treatment path DiD_{i}.

This assumption requires that potential outcomes Yi,t(d,z)Y_{i,t}(d,z) depend only on the current treatment status dtd_{t} and the instrument path zz. In the DID literature, several recent papers impose this assumption with settings of a non-staggered treatment; see, e.g., de Chaisemartin and D’Haultfœuille (2020) and Imai and Kim (2021). Although it can be possible to weaken this assumption by introducing the treatment path dd in potential outcomes Yi,t(d,z)Y_{i,t}(d,z), this requires the cumbersome notation and complicates the definition of our target parameter, thus is beyond the scope of this paper.

Henceforth, we keep Assumption 1 and 2. In the next section, we define the target parameter in staggered DID-IV designs.

3.2   Target parameter in staggered DID-IV designs

Our target parameter in staggered DID-IV designs is the cohort specific local average treatment effect on the treated (CLATT) defined below.

Def.

The cohort specific local average treatment effect on the treated (CLATT) at a given relative period ll from the initial adoption of the instrument is

CLATTe,l\displaystyle CLATT_{e,l} =E[Yi,e+l(1)Yi,e+l(0)|Ei=e,Di,e+le>Di,e+l]\displaystyle=E[Y_{i,e+l}(1)-Y_{i,e+l}(0)|E_{i}=e,D_{i,e+l}^{e}>D_{i,e+l}^{\infty}]
=E[Yi,e+l(1)Yi,e+l(0)|Ei=e,CMe,e+l].\displaystyle=E[Y_{i,e+l}(1)-Y_{i,e+l}(0)|E_{i}=e,CM_{e,e+l}].

This parameter measures the treatment effects at a given relative period ll from the initial instrument adoption date Ei=eE_{i}=e, for those who belong to cohort ee, and are the compliers CMe,e+lCM_{e,e+l}, that is, who are induced to treatment by instrument at time e+le+l. Each CLATTe,l can potentially vary across cohorts and over time, as it depends on cohort ee, relative period ll, and the compliers CMe,e+lCM_{e,e+l}.

3.3   Identifying assumptions in staggered DID-IV designs

In this section, we state the identifying assumptions in staggered DID-IV designs based on Miyaji (2024).

Assumption 3 (Exclusion Restriction in multiple time periods).
z𝒮(Z),dt𝒮(Dt),t{1,,T},Yi,t(d,z)=Yi,t(d)a.s.\displaystyle\forall z\in\mathcal{S}(Z),\forall d_{t}\in\mathcal{S}(D_{t}),\forall t\in\{1,\dots,T\},Y_{i,t}(d,z)=Y_{i,t}(d)\hskip 8.53581pta.s.

Assumption 3 requires that the path of the instrument does not directly affect the potential outcome for all time periods and its effects are only through treatment. Given Assumption 2 and Assumption 3, we can write the potential outcome Yi,t(d,z)Y_{i,t}(d,z) as Yi,t(dt)=Di,tYi,t(1)+(1Di,t)Yi,t(0)Y_{i,t}(d_{t})=D_{i,t}Y_{i,t}(1)+(1-D_{i,t})Y_{i,t}(0).

Here, we introduce the potential outcomes at time tt if unit ii is assigned to the instrument path z𝒮(Z)z\in\mathcal{S}(Z):

Yi,t(Di,t(z))Di,t(z)Yi,t(1)+(1Di,t(z))Yi,t(0).\displaystyle Y_{i,t}(D_{i,t}(z))\equiv D_{i,t}(z)Y_{i,t}(1)+(1-D_{i,t}(z))Y_{i,t}(0).

Since the exposure timing EiE_{i} completely determines the path of the instrument, we can write the potential outcomes for cohort ee and cohort \infty as Yi,t(Di,te)Y_{i,t}(D_{i,t}^{e}) and Yi,t(Di,t)Y_{i,t}(D_{i,t}^{\infty}), respectively. The potential outcome Yi,t(Di,te)Y_{i,t}(D_{i,t}^{e}) represents the outcome status at time tt if unit ii is first exposed to the instrument at time ee and the potential outcome Yi,t(Di,t)Y_{i,t}(D_{i,t}^{\infty}) represents the outcome status at time tt if unit ii is never exposed to the instrument. Hereafter, we refer to Yi,t(Di,t)Y_{i,t}(D_{i,t}^{\infty}) as the "never exposed outcome".

Assumption 4 (Monotonicity Assumption in multiple time periods).
Pr(Di,e+leDi,e+l)=1orPr(Di,e+leDi,e+l)=1for alle𝒮(Ei)and for alll0.\displaystyle Pr(D_{i,e+l}^{e}\geq D_{i,e+l}^{\infty})=1\hskip 5.69054pt\text{or}\hskip 5.69054ptPr(D_{i,e+l}^{e}\leq D_{i,e+l}^{\infty})=1\hskip 5.69054pt\text{for all}\hskip 5.69054pte\in\mathcal{S}(E_{i})\hskip 5.69054pt\text{and for all}\hskip 5.69054ptl\geq 0.

This assumption requires that the instrument path affects the treatment adoption behavior in a monotone way for all relative periods after the initial exposure. Recall that we define Di,tDi,tD_{i,t}-D_{i,t}^{\infty} to be the effect of the instrument on the treatment for unit ii at time tt. Assumption 4 requires that the individual exposed effect in the first stage is non-negative (or non-positive) for all ii and all the time periods after the initial exposure. This assumption implies that the group variable Gi,e,t(Di,t,Di,te)G_{i,e,t}\equiv(D_{i,t}^{\infty},D_{i,t}^{e}) can take three values with non-zero probability for all ee and all tet\geq e. Hereafter, we consider the type of the monotonicity assumption that rules out the existence of the defiers DFe,tDF_{e,t} for all tet\geq e in any cohort ee.

Assumption 5 (No anticipation in the first stage).
Di,e+le=Di,e+la.s.for all units i,for alle𝒮(Ei)and for alll<0.\displaystyle D_{i,e+l}^{e}=D_{i,e+l}^{\infty}\hskip 5.69054pta.s.\hskip 8.53581pt\text{for all units $i$,}\hskip 8.53581pt\text{for all}\hskip 5.69054pte\in\mathcal{S}(E_{i})\hskip 5.69054pt\text{and for all}\hskip 5.69054ptl<0.

Assumption 5 requires that the potential treatment choice for the treatment in any ll period before the initial exposure to the instrument is equal to the never exposed treatment. This assumption restricts the anticipatory behavior before the initial exposure in the first stage.

Assumption 6 (Parallel Trends Assumption in the treatment in multiple time periods).
For allst,E[Di,tDi,s|Ei=e]is same for alle𝒮(Ei).\displaystyle\hskip 5.69054pt\text{For all}\hskip 5.69054pts\neq t,\hskip 5.69054ptE[D_{i,t}^{\infty}-D_{i,s}^{\infty}|E_{i}=e]\hskip 8.53581pt\text{is same for all}\hskip 5.69054pte\in\mathcal{S}(E_{i}).

Assumption 6 is a parallel trends assumption in the treatment in multiple periods and multiple cohorts. This assumption requires that the trends of the treatment across cohorts would have followed the same path, on average, if there is no exposure to the instrument. Assumption 6 is analogous to that of Callaway and Sant’Anna (2021) and Sun and Abraham (2021) in DID designs: both papers impose the same type of the parallel trends assumption on untreated outcomes with settings of multiple periods and multiple cohorts.

Assumption 7 (Parallel Trends Assumption in the outcome in multiple time periods).
For alls<t,E[Yi,t(Di,t)Yi,s(Di,s)|Ei=e]is same for alle𝒮(Ei).\displaystyle\hskip 5.69054pt\text{For all}\hskip 5.69054pts<t,\hskip 5.69054ptE[Y_{i,t}(D_{i,t}^{\infty})-Y_{i,s}(D_{i,s}^{\infty})|E_{i}=e]\hskip 8.53581pt\text{is same for all}\hskip 5.69054pte\in\mathcal{S}(E_{i}).

Assumption 7 is a parallel trends assumption in the outcome with settings of multiple periods and multiple cohorts. This assumption requires that the expectation of the never exposed outcome across cohorts would have followed the same evolution if the assignment of the instrument had not occurred. From the discussions in Miyaji (2024), we can interpret that this assumption requires the same expected time gain across cohorts and over time: the effects of time on outcome through treatment are the same on average across cohorts and over time.

4   Causal interpretation of the TWFEIV estimand

In this section, we explore the causal interpretation of the TWFEIV estimand under staggered DID-IV designs. In section 4.1, we first define the main building block parameter in the first stage and reduced form regressions, respectively. In section 4.2, we then interpret the TWFEIV estimand under staggered DID-IV designs, and show that this estimand potentially fails to summarize the treatment effects. In section 4.3, given the negative result of using the TWFEIV estimand under staggered DID-IV designs, we describe the various restrictions on main building block parameter in each stage regression. In section 4.4, as a preparation, we then describe the causal interpretation of the denominator in the TWFEIV estimand under these restrictions. In section 4.5, we finally investigate the sufficient conditions for the TWFEIV estimand to attain its causal interpretation.

4.1   Main building block parameter in each stage regression

As we already mentioned in section 2, the TWFEIV regression employs the TWFEIV regression twice in the first stage and reduced form regressions. In this section, we define the main building block parameter in each stage regression.

In the first stage regression, our building block parameter is the average of individual exposed effect at a given relative period ll from the initial exposure to the instrument in cohort ee. We call this the cohort specific average exposed effect on the treated in the first stage (CAET1e,l{}_{e,l}^{1}) defined below.

Def.

The cohort specific average exposed effect on the treated in the first stage (CAET1) at a given relative period ll from the initial adoption of the instrument is

CAETe,l1=E[Di,e+lDi,e+l|Ei=e].\displaystyle CAET_{e,l}^{1}=E[D_{i,e+l}-D_{i,e+l}^{\infty}|E_{i}=e].

We use the superscript 11 to make it clear that we define this parameter for the first stage regression. In the recent DID literature, Sun and Abraham (2021) define their main building block parameter in staggered DID designs in a similar fashion and call it the cohort specific average treatment effect on the treated. Callaway and Sant’Anna (2021) call the same parameter the group-time average treatment effect.

If the treatment is binary and monotonicity assumption (Assumption 4) holds, the CAET1e,l{}_{e,l}^{1} is equal to the share of the compliers CMe,e+l in cohort ee at period e+le+l:

CAETe,l1\displaystyle CAET^{1}_{e,l} =E[Di,e+leDi,e+l|Ei=e]\displaystyle=E[D_{i,e+l}^{e}-D_{i,e+l}^{\infty}|E_{i}=e]
=Pr(CMe,e+l|Ei=e).\displaystyle=Pr(CM_{e,e+l}|E_{i}=e).

In the reduced form regression, our building block parameter is the average of individual effect of the instrument on the outcome through treatment at a given relative period ll from the initial exposure to the instrument in cohort ee. We call this the cohort specific average intention to exposed effect on the treated in the reduced form (CAIETe,l) defined below.

Def.

The cohort specific average intention to exposed effect on the treated in the reduced form (CAIET) at a given relative period ll from the initial adoption of the instrument is

CAIETe,l=E[Yi,e+l(Di,e+l)Yi,e+l(Di,e+l)|Ei=e].\displaystyle CAIET_{e,l}=E[Y_{i,e+l}(D_{i,e+l})-Y_{i,e+l}(D_{i,e+l}^{\infty})|E_{i}=e].

If we assume the identifying assumptions in staggered DID-IV designs (Assumptions 1 to 7), this parameter is equal to a product of the CLATTe,l and CAET1e,l{}_{e,l}^{1}:

CAIETe,l\displaystyle CAIET_{e,l} =E[Yi,e+l(Di,e+le)Yi,e+l(Di,e+l)|Ei=e]\displaystyle=E[Y_{i,e+l}(D^{e}_{i,e+l})-Y_{i,e+l}(D_{i,e+l}^{\infty})|E_{i}=e]
=E[(Di,e+leDi,e+l)(Yi,e+l(1)Yi,e+l(0))|Ei=e]\displaystyle=E[(D^{e}_{i,e+l}-D_{i,e+l}^{\infty})(Y_{i,e+l}(1)-Y_{i,e+l}(0))|E_{i}=e]
=E[Yi,e+l(1)Yi,e+l(0)|Ei=e,CMe,e+l]Pr(CMe,e+l|Ei=e)\displaystyle=E[Y_{i,e+l}(1)-Y_{i,e+l}(0)|E_{i}=e,CM_{e,e+l}]\cdot Pr(CM_{e,e+l}|E_{i}=e)
=CLATTe,lCAETe,l1.\displaystyle=CLATT_{e,l}\cdot CAET_{e,l}^{1}. (13)

In other words, if we scale the CAIETe,l in the reduced form by the CAET1e,l{}_{e,l}^{1} in the first stage, we obtain the CLATTe,l, which is the reason why we call this the cohort specific average "intention to exposed effect" on the treated in the reduced form.

4.2   Interpreting the TWFEIV estimand under staggered DID-IV designs

We now interpret the TWFEIV estimand under staggered DID-IV designs based on the DID-IV decomposition theorem derived in section 2 and the main building block parameters defined in the previous section. This section presumes the monotonicity assumption (Assumption 4) to clarify the interpretation of each notation defined below.

First, we introduce the additional notation. Let CLATT(W)kCM{}^{CM}_{k}(W) denote a weighted average of each CLATTk,t in the time window WW (with TWT_{W} periods) where the weight reflects the relative amount of the exposed effect in the first stage in cohort kk at period tt:

CLATTkCM(W)\displaystyle CLATT^{CM}_{k}(W) tWCAETk,t1tWCAETk,t1CLATTk,t\displaystyle\equiv\sum_{t\in W}\frac{CAET^{1}_{k,t}}{\sum_{t\in W}CAET^{1}_{k,t}}CLATT_{k,t}
=tWPr(CMk,t|Ei=k)tWPr(CMk,t|Ei=k)CLATTk,t.\displaystyle=\sum_{t\in W}\frac{Pr(CM_{k,t}|E_{i}=k)}{\sum_{t\in W}Pr(CM_{k,t}|E_{i}=k)}CLATT_{k,t}.

The first equality holds because we have a binary treatment and assume the monotonicity assumption (Assumption 4). Each weight assigned to each CLATTk,tCLATT_{k,t} reflects the relative share of the compliers at period tt in cohort kk during the time window WW. We call this the compliers weighted scheme. This would be one of the reasonable weighting schemes for two reasons. First, the weight is designed to be larger in the period when the proportion of the compliers is higher in cohort kk. Second, the sum of the weight is one by construction: the proportion of the compliers in each period in cohort kk is divided by the total amount of the compliers in the time window WW in cohort kk.

We also define the similar notation CLATTk(W)CLATT_{k}(W), in which the proportion of the compliers in cohort kk at period tt is divided by the time length TWT_{W}:

CLATTk(W)\displaystyle CLATT_{k}(W) 1TWtWCAETk,t1CLATTk,t\displaystyle\equiv\frac{1}{T_{W}}\sum_{t\in W}CAET^{1}_{k,t}CLATT_{k,t}
=1TWtWPr(CMk,t|Ei=k)CLATTk,t.\displaystyle=\frac{1}{T_{W}}\sum_{t\in W}Pr(CM_{k,t}|E_{i}=k)CLATT_{k,t}.

We call this the time-corrected weighting scheme. In contrast to CLATTkCM(W)CLATT^{CM}_{k}(W), the weight assigned to each CLATTk,tCLATT_{k,t} can be inappropriate: each weight does not reflect the relative share of the compliers in cohort kk at period tt. In addition, the sum of each weight is not equal to one in general.

Theorem 2 below shows the probability limit of the TWFEIV estimator β^IV\hat{\beta}_{IV} under staggered DID-IV designs (Assumptions 1-7).

Theorem 2.

Suppose Assumptions 1-7 hold. Then, the TWFEIV estimand βIV\beta_{IV} consists of two terms:

β^IV\displaystyle\hat{\beta}_{IV} =[kUw^IV,kUβ^IV,kU2×2+kUl>kw^IV,klkβ^IV,kl2×2,k+w^IV,kllβ^IV,kl2×2,l]\displaystyle=\bigg{[}\sum_{k\neq U}\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}+\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}\bigg{]}
𝑝WCLATTΔCLATT.\displaystyle\xrightarrow{p}WCLATT-\Delta CLATT.

where we define:

WCLATT\displaystyle WCLATT kUwIV,kUCLATTkCM(POST(k))+kUl>kwIV,klkCLATTkCM(MID(k,l))\displaystyle\equiv\sum_{k\neq U}w_{IV,kU}CLATT^{CM}_{k}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}CLATT^{CM}_{k}(MID(k,l))
+kUl>kσIV,kllCLATTl(POST(l))\displaystyle+\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot CLATT_{l}(POST(l))
ΔCLATT\displaystyle\Delta CLATT kUl>kσIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle\equiv\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right].

The weights wIV,kUw_{IV,kU} and wIV,klkw_{IV,kl}^{k} are the probability limit of w^IV,kU\hat{w}_{IV,kU} and w^IV,klk\hat{w}_{IV,kl}^{k}, respectively. The weight σIV,kll\sigma_{IV,kl}^{l} is the probability limit of w^kllC^D,Zw^kllD^kl2×2,lC^D,Z=w^IV,kll\frac{\hat{w}_{kl}^{l}}{\hat{C}^{D,Z}}\neq\frac{\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}}=\hat{w}_{IV,kl}^{l}. The specific expressions for each weight are shown in equations (49), (50), and (51) in Appendix B.

Proof.

See Appendix B. ∎

Theorem 2 shows that the TWFEIV estimand βIV\beta_{IV} consists of two terms (WCLATTWCLATT and ΔCLATT\Delta CLATT) and potentially fails to aggregate the treatment effects under staggered DID-IV designs.

The first term WCLATTWCLATT is a positively weighted average of each CLATTk,tCLATT_{k,t} for the post-exposed period in cohort kk. We call this a weighted average cohort specific local average treatment effect on the treated (WCLATTWCLATT) parameter. The first and the second terms in the WCLATTWCLATT use the compliers weighted scheme, but the third term in WCLATTWCLATT uses the time-corrected one.

Although WCLATTWCLATT can be a causal parameter, the amount of this parameter may be difficult to interpret in practice for two reasons. First, the weight σIV,kll\sigma_{IV,kl}^{l} assigned to CLATTl(POST(l))CLATT_{l}(POST(l)) reflects only the sample share and the variation of the instrument, and does not reflect the variation of the treatment Dkl2×2,lD_{kl}^{2\times 2,l} in the first stage. Because the other weights, wIV,kUw_{IV,kU} and wIV,klkw_{IV,kl}^{k} precisely reflect all the variations in each DID-IV design, this asymmetry can break the implication of the magnitude of this parameter in a given application. Second, the CLATTl(POST(l))CLATT_{l}(POST(l)) in the third term is a weighted average of CLATTk,tCLATT_{k,t} for the post exposed periods in cohort kk, but the weight assigned to each CLATTk,tCLATT_{k,t} seems not reasonable: it does not reflect the relative share of the compliers in period tt in cohort kk and the sum of the weight is not equal to one.

The problem of the WCLATTWCLATT is due to the "bad comparisons" in the first stage TWFE regression: when we compare the evolution of the treatment in Exposed/Exposed Shift designs, we use already exposed cohorts as controls. In these comparisons, we should offset the DID estimator of the treatment in each weight in Exposed/Exposed Shift designs by the one appeared in the denominator of the corresponding Wald-DID estimator, which produces the weight σIV,kll\sigma_{IV,kl}^{l} and CLATTl(POST(l))CLATT_{l}(POST(l)) in the third term.

The second term ΔCLATT\Delta CLATT is a weighted sum of the differences in the positively weighted average of each CLATTk,tCLATT_{k,t} from the exposed period kk to before period l(k<l)l(k<l) and after period ll in the already exposed cohort kk. This term fails to properly aggregate the treatment effects because the CLATTk(POST(l))CLATT_{k}(POST(l)) is canceled out by the CLATTk(MID(k,l))CLATT_{k}(MID(k,l)) in each cohort kk. This problem arises due to the "bad comparisons" in the reduced form TWFE regression: when we compare the evolution of the outcome in Exposed/Exposed Shift designs, we use already exposed cohorts as controls. In these comparisons, we subtract their expected trends of unexposed potential outcomes and average intention exposed effects, which yields the ΔCLATT\Delta CLATT.

Overall, this section shows that the TWFEIV estimand potentially fails to summarize the treatment effects under staggered DID-IV designs. In the next section, we first describe various restrictions on main building block parameters in the first stage and the reduced form regressions. Given these restrictions on exposed effect heterogeneity, we then explore the sufficient conditions for the TWFEIV estimand to be causally interpretable parameter.

4.3   Restrictions on exposed effect heterogeneity

First, we describe the restrictions on the CAETe,l1{}^{1}_{e,l} in the first stage regression.

Assumption 8 (Exposed effect homogeneity across cohorts in the first stage).

For each relative period ll, CAETe,l1CAET^{1}_{e,l} does not depend on cohort ee and is equal to AETl1AET_{l}^{1}.

Assumption 8 requires that the exposed effects in the first stage depend on only the relative time period ll after the initial exposure to the instrument and do not depend on the cohort ee. This assumption does not exclude the dynamic effects of the instrument on the treatment, but requires that the exposed effects are the same across cohorts for all relative periods.

Assumption 9 (Stable exposed effect over time within cohort in the first stage).

For each cohort ee, CAETe,l1CAET^{1}_{e,l} does not depend on the relative time period ll and is equal to CAETe1CAET^{1}_{e}.

Assumption 9 rules out the dynamic effects of the instrument on the treatment within cohort ee in the first stage regression. Assumption 9 permits the heterogeneous exposed effects across cohort ee, but requires the homogeneous exposed effects over time after the initial adoption of the instrument within cohort ee.

The recent DID literature imposes the similar restrictions as in Assumption 8 and Assumption 9 on treatment effects. Sun and Abraham (2021) assume that "each cohort experiences the same path of treatment effects", which is in line with Assumption 8. Goodman-Bacon (2021) requires heterogeneous treatment effects to either be "constant over time but vary across units" or "vary over time but not across units". The former corresponds to Assumption 9 and the latter corresponds to Assumption 8.

Next, we describe the restrictions on the CAIETe,l in the reduced form regression. Following to Assumption 8 and Assumption 9 on the CAETe,l1{}^{1}_{e,l}, we consider Assumption 10 and Assumption 11 below.

Assumption 10 (Exposed effect homogeneity across cohorts in the reduced form).

For each relative period ll, CAIETe,lCAIET_{e,l} does not depend on cohort ee and is equal to AIETlAIET_{l}.

Assumption 11 (Stable exposed effect over time within cohort in the reduced form).

For each cohort ee, CAIETe,lCAIET_{e,l} does not depend on the relative time period ll and is equal to CAIETeCAIET_{e}.

Assumption 10 requires that the evolution of the average intention to exposed effect after the initial exposure is the same across cohorts. Assumption 11 requires that the average intention to exposed effects are stable over time in all relative periods within cohort ee.

Note that given Assumption 8 and Assumption 10, we have the following restriction on the CLATTe,l, which follows from equation (13) in section 4.1.

Assumption 12 (Treatment effect homogeneity across cohorts for CLATTe,lCLATT_{e,l}).

For each relative period ll, CLATTe,lCLATT_{e,l} does not depend on cohort ee and is equal to LATTlLATT_{l}.

Similarly, given Assumption 9 and Assumption 11, we have the following restriction on the CLATTe,l.

Assumption 13 (Stable treatment effect over time within cohort for CLATTe,lCLATT_{e,l}).

For each cohort ee, CLATTe,lCLATT_{e,l} does not depend on the relative time period ll and is equal to CLATTeCLATT_{e}.

4.4   The denominator in the TWFEIV estimand

In this section, we first interpret the denominator in the TWFEIV estimand under various restrictions considered in section 4.3. This section is a preparation for the next section, in which we analyze the TWFEIV estimand itself.

As we already noted, the denominator in the TWFEIV estimator (see equation (6)), C^D,Z\hat{C}^{D,Z} can be decomposed into a weighted average of all possible 2×22\times 2 DID estimators of the treatment. In the following discussion, we show that this estimand can potentially fail to aggregate the effects of the instrument on the treatment in the first stage regression without additional restrictions. We then briefly describe the interpretation of this estimand by imposing Assumption 8 or Assumption 9, and state the implications.

First, we introduce the additional notation. Let CAET(W)k1{}^{1}_{k}(W) denote an equally weighted average of the CAETk,t1{}^{1}_{k,t} in the time window WW (with TWT_{W} period length):

CAETk1(W)\displaystyle CAET^{1}_{k}(W) 1TWtWCAETk,t1.\displaystyle\equiv\frac{1}{T_{W}}\sum_{t\in W}CAET^{1}_{k,t}.

If we assume Assumption 4 (monotonicity assumption), CAET(W)k1{}^{1}_{k}(W) is an equally weighted average of the fraction of the compliers in cohort kk in the time window WW. For instance, the CAET(POST(k))k1{}^{1}_{k}(POST(k)) is an equally weighted average of the CAETk1{}^{1}_{k} during the periods after the initial exposure date kk and rewritten as

CAETk1(POST(k))\displaystyle CAET^{1}_{k}(POST(k)) =1T(k1)t=kTCAETk,t1\displaystyle=\frac{1}{T-(k-1)}\sum_{t=k}^{T}CAET_{k,t}^{1}
=1T(k1)t=kTPr(CMk,t|Ei=k).\displaystyle=\frac{1}{T-(k-1)}\sum_{t=k}^{T}Pr(CM_{k,t}|E_{i}=k).

Lemma 1 below shows the probability limit of the denominator C^D,Z\hat{C}^{D,Z} under staggered DID-IV designs. This lemma is mainly based on the result of Goodman-Bacon (2021), who shows the probability limit of the two-way fixed effects estimator under staggered DID designs. The slight difference here is that each weight assigned to each CAET(W)k1{}^{1}_{k}(W) in CD,ZC^{D,Z} is not divided by the probability limit of the grand mean 1NTitZ~it\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{it}.

Lemma 1.

Suppose Assumptions 1-7 hold. Then, the probability limit of the denominator of the TWFEIV estimator, CD,ZC^{D,Z} consists of two terms:

C^D,Z\displaystyle\hat{C}^{D,Z} =kUw^kUD^kU2×2+kUl>k[w^klkD^kl2×2,k+w^kllD^kl2×2,l]\displaystyle=\sum_{k\neq U}\hat{w}_{kU}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[\hat{w}_{kl}^{k}\hat{D}_{kl}^{2\times 2,k}+\hat{w}_{kl}^{l}\hat{D}_{kl}^{2\times 2,l}]
𝑝WCAETΔCAET1.\displaystyle\xrightarrow{p}WCAET-\Delta CAET^{1}.

where we define:

WCAETkUwkUCAETk1(POST(k))+kUl>kwklkCAETk1(MID(k,l))+wkllCAETl1(POST(l)),\displaystyle WCAET\equiv\sum_{k\neq U}w_{kU}CAET_{k}^{1}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{kl}^{k}CAET_{k}^{1}(MID(k,l))+w_{kl}^{l}CAET_{l}^{1}(POST(l)),
ΔCAET1kUl>kwkll[CAETk1(POST(l))CAETk1(MID(k,l))].\displaystyle\Delta CAET^{1}\equiv\sum_{k\neq U}\sum_{l>k}w_{kl}^{l}\left[CAET_{k}^{1}(POST(l))-CAET_{k}^{1}(MID(k,l))\right].

The weights wkUw_{kU},wklkw_{kl}^{k} and wkllw_{kl}^{l} are the probability limit of w^kU,w^klk\hat{w}_{kU},\hat{w}_{kl}^{k} and w^kll\hat{w}_{kl}^{l} defined in section 3 respectively, and are non-negative. The specific expressions in each weight are shown in equations (35)-(37) in Appendix B.

Proof.

See Appendix B. ∎

Lemma 1 shows that we can decompose CD,ZC^{D,Z} into two terms. The first term is a positively weighted average of each CAET1k,t{}_{k,t}^{1} during the periods after the initial exposure in exposed cohorts, allowing for its causal interpretation. Following the terminology in Goodman-Bacon (2021), we call this a weighted average cohort specific exposed effect on the treated (WCAET) parameter.

The second term Δ\DeltaCAET1 is equal to the sum of the difference in the positively weighted average of exposed effect CAET1k,t{}_{k,t}^{1} from the exposed period kk to before period ll (k<lk<l) and after period ll in the already exposed cohort kk. This term fails to properly aggregate the causal parameter in the first stage because some exposed effects are canceled out by other exposed effects.

Lemma 1 implies that if we assume only Assumptions 1-7, the probability limit of the denominator in the TWFEIV estimand, CD,ZC^{D,Z} generally fails to properly summarize the exposed effects in the first stage due to the second term Δ\DeltaCAET1. This problem arises from the "bad comparisons" performed by the TWFE regression in the first stage: we treat the already exposed cohorts as control groups in the Exposed/ Exposed Shift designs. In these comparisons, we should subtract their expected trends of unexposed potential treatment choices and their expected exposed effects, which yields the second term Δ\DeltaCAET1. In the DID literature, Borusyak et al. (2021), de Chaisemartin and D’Haultfœuille (2020), and Goodman-Bacon (2021) point out the same issue for the TWFE estimand in staggered DID designs.

Based on the negative result shown in Lemma 1, we consider the restrictions on exposed effect heterogeneity in the first stage regression. The conclusion here is that CD,ZC^{D,Z} properly aggregates each CAETk,t1CAET_{k,t}^{1} only if Assumption 9 holds, that is, the exposed effects are stable over time within cohort ee. Because Goodman-Bacon (2021) have already made the same point for the TWFE estimand, we briefly summarize the interpretation of CD,ZC^{D,Z} under Assumption 8 or Assumption 9 in the following. For the more detailed discussions, see section 3.1 in Goodman-Bacon (2021).

Interpreting CD,ZC^{D,Z} under Assumption 8 only

Even when Assumption 8 holds, that is, the exposed effects are the same across cohorts but vary over time in the first stage, we have Δ\DeltaCAET10{}^{1}\neq 0 in general. This implies that if we impose only Assumption 8, we cannot generally interpret the CD,ZC^{D,Z} as measuring the positively weighted average of exposed effects in the first stage.

Interpreting CD,ZC^{D,Z} under Assumption 9 only

If Assumption 9 holds, that is, the exposed effects are stable over time within cohort ee in the first stage, we have CAETk1(W)=CAETk1CAET_{k}^{1}(W)=CAET_{k}^{1}. This implies that the second term Δ\DeltaCAET1 is equal to zero:

ΔCAET1\displaystyle\Delta CAET^{1} =kUl>kwkll[CAETk1CAETk1]\displaystyle=\sum_{k\neq U}\sum_{l>k}w_{kl}^{l}\left[CAET_{k}^{1}-CAET_{k}^{1}\right]
=0.\displaystyle=0.

Thus, CD,ZC^{D,Z} simplifies to:

CD,Z\displaystyle C^{D,Z} =WCAET\displaystyle=WCAET
=kUCAETk1[wkU+j=1k1wjkk+j=k+1Kwkjk]wk.\displaystyle=\sum_{k\neq U}CAET^{1}_{k}\underbrace{\left[w_{kU}+\sum_{j=1}^{k-1}w_{jk}^{k}+\sum_{j=k+1}^{K}w_{kj}^{k}\right]}_{\equiv w_{k}}.

CD,ZC^{D,Z} weights each CAETk1{}^{1}_{k} positively across cohorts under Assumption 9 only. We note, however, that each weight assigned to each CAETk1{}^{1}_{k}, wkw_{k} is not equal to the sample share in cohort kk, but is a function of the sample share and the timing of the initial exposure date.

In this section, we have considered whether the denominator in the TWFEIV estimand properly aggregates the exposed effects in the first stage. We have two implications. First, if we do not impose Assumption 9, the weight assigned to each 2×22\times 2 Wald-DID in the TWFEIV estimand may not be properly normalized because the numerator in each weight is divided by CD,ZC^{D,Z}, and the denominator potentially fail to aggregate the exposed effects in the first stage. Second, if we do not impose Assumption 9, some weights assigned to 2×22\times 2 Wald-DID estimands can be potentially negative. This is because the DID estimand of the treatment forms the part of each weight and can be negative due to the "bad comparisons" in the first stage regression.

From the discussion so far, hereafter, we impose Assumption 9 when we consider the restrictions on exposed effect heterogeneity in the first stage.

4.5   Interpreting the TWFEIV estimand under additional restrictions

We now describe the interpretation of the TWFEIV estimand under additional restrictions.

Interpretation under Assumption 9 only

First, we consider imposing Assumption 9 only, that is, we assume only the stable exposed effect over time in the first stage. If Assumption 9 holds, the CLATTkCM(W)CLATT^{CM}_{k}(W) simplifies to an equally weighted average of CLATTk,tCLATT_{k,t}:

CLATTkCM(W)\displaystyle CLATT^{CM}_{k}(W) =tWPr(CMk,t|Ei=k)tWPr(CMk,t|Ei=k)CLATTk,t\displaystyle=\sum_{t\in W}\frac{Pr(CM_{k,t}|E_{i}=k)}{\sum_{t\in W}Pr(CM_{k,t}|E_{i}=k)}CLATT_{k,t}
=1TWtWCLATTk,t\displaystyle=\frac{1}{T_{W}}\sum_{t\in W}CLATT_{k,t}
CLATTkeq(W).\displaystyle\equiv CLATT^{eq}_{k}(W).

The CLATTkeq(W)CLATT^{eq}_{k}(W) weights each CLATTk,tCLATT_{k,t} equally in the time window WW and the weight sum to one by construction. We call this an equal weighting scheme.

Lemma 2 presents the interpretation of the TWFEIV estimand under staggered DID-IV designs and Assumption 9.

Lemma 2.

Suppose Assumptions 1-7 hold. If Assumption 9 holds additionally, the TWFEIV estimand βIV\beta_{IV} consists of two terms:

βIV=WCLATTΔCLATT.\displaystyle\beta_{IV}=WCLATT-\Delta CLATT.

where we define:

WCLATT\displaystyle WCLATT kUwIV,kUCLATTkeq(POST(k))+kUl>kwIV,klkCLATTkeq(MID(k,l))\displaystyle\equiv\sum_{k\neq U}w_{IV,kU}CLATT^{eq}_{k}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}CLATT^{eq}_{k}(MID(k,l))
+kUl>kwIV,kllCLATTleq(POST(l)),\displaystyle+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{l}CLATT^{eq}_{l}(POST(l)),
ΔCLATT\displaystyle\Delta CLATT kUl>kσIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle\equiv\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right].

The weights wIV,kUw_{IV,kU}, wIV,klkw_{IV,kl}^{k} and wIV,kllw_{IV,kl}^{l} are the probability limit of w^IV,kU\hat{w}_{IV,kU}, w^IV,klk\hat{w}_{IV,kl}^{k} and w^IV,kll\hat{w}_{IV,kl}^{l} respectively, and are non-negative. The specific expressions for these weights are shown in equations (54), (55), and (59) in Appendix B. The weight σIV,kll\sigma_{IV,kl}^{l} is already defined in Theorem 2.

Proof.

See Appendix B. ∎

Lemma 2 shows that Assumption 9 is not sufficient for the TWFEIV estimand to attain its causal interpretation. If the exposed effects in the first stage are stable over time, we can interpret the first term WCLATTWCLATT causally and its interpretation seems clear: this parameter is a positively weighted average of each CLATTkeq(W)CLATT^{eq}_{k}(W) and each weight assigned to each CLATTkeq(W)CLATT^{eq}_{k}(W) reflects all the variations in each DID-IV design. However, the second term ΔCLATT\Delta CLATT still remains, which contaminates the causal interpretation of the TWFEIV estimand.

Interpretation under Assumption 9 and Assumption 10

Next, we assume Assumption 9 and Assumption 10 additionally. Even in this case, we still have the second term ΔCLATT0\Delta CLATT\neq 0 in general. This implies that the TWFEIV estimand identifies WCLATTΔCLATTWCLATT-\Delta CLATT, that is, this estimand does not generally attain its causal interpretation.

Interpretation under Assumption 9 and Assumption 11

As we already noted in section 4.2, if we assume Assumption 9 and Assumption 11 additionally, we have Assumption 13, that is, CLATTe,t=CLATTeCLATT_{e,t}=CLATT_{e} holds. Then, we obtain the following Lemma.

Lemma 3.

Suppose Assumptions 1-7 hold. In addition, if Assumption 9 and Assumption 11 hold, the TWFEIV estimand βIV\beta_{IV} is:

βIV=kUCLATTk[wIV,kU+j=1k1wIV,jkk+j=k+1KwIV,kjk]wk,IV.\displaystyle\beta_{IV}=\sum_{k\neq U}CLATT_{k}\underbrace{\Bigg{[}w_{IV,kU}+\sum_{j=1}^{k-1}w_{IV,jk}^{k}+\sum_{j=k+1}^{K}w_{IV,kj}^{k}\Bigg{]}}_{\equiv w_{k,IV}}.

where the weights wIV,kUw_{IV,kU}, wIV,kjkw_{IV,kj}^{k} and wIV,jkkw_{IV,jk}^{k} are the probability limit of w^IV,kU\hat{w}_{IV,kU}, w^IV,kjk\hat{w}_{IV,kj}^{k} and w^IV,jkk\hat{w}_{IV,jk}^{k} respectively.

Proof.

See Appendix B. ∎

If Assumption 9 and Assumption 11 are satisfied, the TWFEIV estimand is a positively weighted average of each CLATTkCLATT_{k} across exposed cohorts, which implies that we can interpret this estimand causally. However, at the same time, we also note that the weight wk,IVw_{k,IV} assigned to each CLATTkCLATT_{k} does not reflect only the cohort share and the fraction of the compliers, but is a function of the cohort share, the fraction of the compliers, and the timing of the initial exposure to the instrument.

5   Extensions

This section briefly describes the extensions in section 4. We consider a non-binary, ordered treatment and unbalanced panel settings. It also includes the case when the adoption date of the instrument is randomized across units. For the proofs and the specific discussions, see Appendix C.

Non-binary, ordered treatment

Up to now, we have considered only the case of a binary treatment. When treatment takes a finite number of ordered values, Di,t{0,1,,J}D_{i,t}\in\{0,1,\dots,J\}, our target parameter in staggered DID-IV design is the cohort specific average causal response on the treated (CACRT) defined below.

Def.

The cohort specific average causal response on the treated (CACRT) at a given relative period ll from the initial adoption of the instrument is

CACRTe,lj=1Jwe+l,jeE[Yi,e+l(j)Yi,e+l(j1)|Ei=e,Di,e+lej>Di,e+l]\displaystyle CACRT_{e,l}\equiv\sum_{j=1}^{J}w^{e}_{e+l,j}\cdot E[Y_{i,e+l}(j)-Y_{i,e+l}(j-1)|E_{i}=e,D_{i,e+l}^{e}\geq j>D_{i,e+l}^{\infty}]

where the weights we+l,jew^{e}_{e+l,j} are:

we+l,je=Pr(Di,e+lej>Di,e+l|Ei=e)j=1JPr(Di,e+lej>Di,e+l|Ei=e).\displaystyle w^{e}_{e+l,j}=\frac{Pr(D_{i,e+l}^{e}\geq j>D_{i,e+l}^{\infty}|E_{i}=e)}{\sum_{j=1}^{J}Pr(D_{i,e+l}^{e}\geq j>D_{i,e+l}^{\infty}|E_{i}=e)}.

The CACRT is a weighted average of the effect of a unit increase in treatment on outcome, for those who are in cohort ee and induced to increase treatment by instrument at a relative period ll after the initial exposure. This parameter is similar to the average causal response (ACR) considered in Angrist and Imbens (1995), but the difference here is that there exist dynamic effects in the first stage, and each weight we+l,jew^{e}_{e+l,j} and the associated causal parameters in CACRT are conditioned on Ei=1E_{i}=1.

If we have a non-binary, ordered treatment, one can show that we have Theorem 2 and Lemmas 2-3 in section 4, which replace CLATTe,kCLATT_{e,k} with CACRTe,kCACRT_{e,k}. Note that our decomposition result for the TWFEIV estimator is unchanged under non-binary, ordered treatment settings.

Unbalanced panel case

Throughout sections 2 to 4, we have considered a balanced panel setting. If we assume an unbalanced panel (or repeated cross section) setting, we obtain the following theorem.

Theorem 3.

Suppose Assumptions 1-7 hold. If we assume a binary treatment and an unbalanced panel setting, the population regression coefficient βIV\beta_{IV} is a weighted average of each CLATTe,tCLATT_{e,t} in all relative periods after the initial exposure across cohorts with potentially some negative weights:

βIV=etewe,tCLATTe,t.\displaystyle\beta_{IV}=\sum_{e}\sum_{t\geq e}w_{e,t}\cdot CLATT_{e,t}.

where the weight we,tw_{e,t} is:

we,t=E[Z^i,t|Ei=e]ne,tCAETe,t1eteE[Z^i,t|Ei=e]ne,tCAETe,t1,\displaystyle w_{e,t}=\frac{E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CAET^{1}_{e,t}}{\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CAET^{1}_{e,t}},

where E[Z^i,t|Ei=e]E[\hat{Z}_{i,t}|E_{i}=e] is the population residuals from regression Zi,tZ_{i,t} on unit and time fixed effects in cohort ee and ne,tn_{e,t} is the population share for cohort ee at time tt. The weights sum to one.

Proof.

See Appendix C. ∎

Theorem 3 shows that the population regression coefficient βIV\beta_{IV} is a weighted average of all possible CLATTe,tCLATT_{e,t} across cohorts, but some weights can be negative. Theorem 3 is related to de Chaisemartin and D’Haultfœuille (2020), who show the decomposition theorem for the TWFEIV estimand when the assignment of the instrument is non-staggered and a no carry over assumption is satisfied in the first stage regression. Theorem 3 instead considers the case when the assignment of the instrument is staggered and there exist dynamic effects in the first stage. Theorem 3 assumes a binary treatment, but a non-binary, ordered treatment case is easy to extend: one can obtain the theorem which replaces CLATTe,tCLATT_{e,t} with CACRTe,tCACRT_{e,t}.

If one wants to check the validity of the TWFEIV estimator in a given application, one can estimate each weight by constructing the consistent estimator for CAETe,t1CAET^{1}_{e,t}. If there does not exist a never exposed cohort, however, it is not feasible to obtain the consistent estimator for CAETl,t1CAET^{1}_{l,t} in the last exposed cohort l=max{Ei}l=\max\{E_{i}\}. In Appendix C, we provide another representation of the decomposition theorem, in which we can estimate each weight consistently and quantify the bias term arising from the bad comparisons performed by TWFEIV regressions.

Random assignment of the adoption date

In practice, researchers may use the TWFEIV regression when the adoption date of the instrument is randomized across units (e.g., Randomized control trial). In Appendix C, we consider the causal interpretation of the TWFEIV estimand under the random assignment assumption. In the DID literature, a similar issue is analyzed in Athey and Imbens (2022): they investigate the causal interpretation of the TWFE estimand when the adoption date of the treatment is randomized across units.

First, we define the random assignment assumption of the adoption date EiE_{i}.

Assumption 14 (Random assignment assumption of adoption date EiE_{i}).

For all t{1,,T}t\in\{1,\dots,T\} and all z𝒮(Z)z\in\mathcal{S}(Z), EiE_{i} is independent of potential outcomes:

(Yi,t(1),Yi,t(0),Di,t(z))Ei.\displaystyle(Y_{i,t}(1),Y_{i,t}(0),D_{i,t}(z))\mathop{\perp\!\!\!\!\perp}E_{i}.

When the assignment of the adoption date is totally randomized, our target parameter is the local average treatment effect (LATE) defined below.

Def.

The local average treatment effect (LATE) at a given relative period ll from the initial adoption of the instrument is

LATEe,l=E[Yi,e+l(1)Yi,e+l(0)|CMe,e+l].\displaystyle LATE_{e,l}=E[Y_{i,e+l}(1)-Y_{i,e+l}(0)|CM_{e,e+l}].

Unlike the CLATT, this parameter is not conditioned on the adoption date EiE_{i} due to the independence assumption. The causal parameter in the first stage, CAETe,l1CAET^{1}_{e,l}, is also simplified to the average exposed effect (AEe,l1AE^{1}_{e,l}) defined below:

CAETe,l1\displaystyle CAET^{1}_{e,l} =E[Di,e+lDi,e+l]\displaystyle=E[D_{i,e+l}-D_{i,e+l}^{\infty}]
AEe,l1.\displaystyle\equiv AE^{1}_{e,l}.

If Assumptions 1- 7 and Assumption 14 hold, one can obtain the theorem and lemmas in section 4, which replace CAETk,t1CAET^{1}_{k,t} and CLATTk,tCLATT_{k,t} with AEk,t1AE^{1}_{k,t} and LATEk,tLATE_{k,t}, respectively. This implies that even when the adoption date of the instrument is randomized, we cannot interpret the TWFEIV estimand causally in general, and the causal interpretation requires the stable exposed assumptions in both the first stage and reduced form regressions.

6   Application

In this section, we illustrate our DID-IV decomposition theorem in the setting of Miller and Segal (2019). We first explain our dataset. We then assess the plausibility of the staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019). Finally, we present the DID-IV decomposition result and state the implication.

Miller and Segal (2019) study the effect of an increase in the share of female police officers on intimate partner homicide (IPH) rates among women in the United States between 19771977 and 19911991. The increase was in line with a shift in gender norms during these periods and there was growing interest in whether the female integration improved police quality in addressing violence against women.

To establish the causal relationship, Miller and Segal (2019) first regress the IPH rates on the lagged female officers’ share with county and year fixed effects. In the second part of their analysis, Miller and Segal (2019) exploit "plausibly exogenous variation in female integration from externally imposed AA (affirmative action) following employment discrimination cases against particular departments in different years" across 255255 counties. Specifically, Miller and Segal (2019) use the two-way fixed effects instrumental variable regression, instrumenting the lagged female officers’ share with the exposure years of AA plans.

Miller and Segal (2019) implicitly rely on staggered DID-IV designs to estimate the causal effects: Miller and Segal (2019) concern that "AA itself might have occurred following increasing trends" in the share of female officers or the IPH rates. To address this concern, Miller and Segal (2019) check the trends of these variables before AA introduction using event study regressions in the first stage and reduced form.

In this application, we slightly modify the authors’ setting for simplicity. Specifically, unlike Miller and Segal (2019), we use the staggered adoption of AA plans as our instrument instead of the exposure years. In the authors’ setting, AA plans were terminated in some counties during the sample period, which is probably the reason why Miller and Segal (2019) use the exposure years of AA plans as their instrument. We instead drop such counties from our sample and make the instrument assignment staggered. Although it reduces our sample size, it allows us to have a clearer staggered DID-IV identification strategy. In addition, it enables us to apply our DID-IV decomposition theorem to the TWFEIV estimate in the authors’ setting.

Data

Table 1: The staggered AA introduction: exposure year, cohort sizes.
Start year of AA plans Number of counties
Unexposed counties 159
1976 6
1977 3
1978 3
1979 4
1980 3
1981 5
1982 4
1983 3
1984 2
1985 1
1986 1
1987 3
1988 1
1990 1
  • Notes: This table presents the initial exposure year of AA plans and the number of counties in each year in our final sample.

Table 2: Summary statistics
All counties Unexposed counties Exposed counties
IPH per 100000100000 population 1977-91 0.544 0.521 0.638
1977 0.549 0.526 0.641
1991 0.489 0.461 0.599
Lagged female officer share 1977-91 0.053 0.050 0.066
1977 0.033 0.033 0.032
1991 0.077 0.071 0.101
Counties 199 159 40
Observations 2985 2385 600
  • Notes: This table presents summary statistics on our final sample from 19771977 to 19911991. The sample consists of 199199 counties.

The data come from Miller and Segal (2019). Our final sample differs from their main analysis sample in two ways. First, unlike Miller and Segal (2019), we only include the counties whose variables are observable for all sample periods. This restriction excludes 2020 counties and allows us to create the balanced panel data set. Second, as we already noted, we construct an instrument that takes one after the AA introduction. Miller and Segal (2019) use data on AA plans from Miller and Segal (2012) and define the instrument as the difference between the current year and the start year of AA introduction111As one can see in this construction, Miller and Segal (2019) create the lagged instrument in line with the lagged female officers’ share. Therefore, we construct the lagged staggered instrument instead of the current one.; see Miller and Segal (2012), Miller and Segal (2019) for details. We identify the initial year of AA plans in each county, and discard the counties whose AA plans ended between 19761976 and 19901990 (88 counties dropped) and whose AA plans were already implemented before 19761976 (2323 counties dropped). Table 2 shows the timing of AA adoption across 199199 counties between 19761976 and 19901990.

Summary statistics for county characteristics are reported in Table 2. We have a smaller sample size, but otherwise have a similar sample to that of Miller and Segal (2019). Counties are separated into exposed and unexposed counties based on whether the county experienced AA introduction. In both types of counties, the lagged female officers’ share increased over time. However, it increases more in counties who are exposed to AA plans during sample periods. The IPH rates had downward trends in all counties, but it seems that there are no systematic differences in the trends between exposed and unexposed counties.

Assessing the identifying assumptions in staggered DID-IV design

In this section, we discuss the validity of the staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019). Note that in the authors’ setting, our target parameter is the cohort specific average causal response on the treated (CACRT) as female officer share is a non-binary, ordered treatment. We therefore expect that we can identify each CACRT if the underlying staggered DID-IV identification strategy seems plausible, which we will check below. Here, we presume the no carry over assumption (Assumption 2).

Exclusion restriction (Assumption 3).

It would be plausible, given that the AA plans (instrument) did not affect IPH rates other than by increasing the female officers’ share. This assumption may be violated for instance if the AA plans increased both the black and female officer shares and changes in IPH rates reflect both effects. Miller and Segal (2019) conduct the robustness check and confirm that this is not the case; see footnote 4242 in Miller and Segal (2019) for details.

Monotonicity assumption (Assumption 4).

It would be automatically satisfied in the authors’ setting: the AA plans (instrument) were imposed on departments with the intent to increase the share of female police officers. This ensures that the dynamic effects of the instrument on female police officers should be non-negative after the AA introduction.

No anticipation in the first stage (Assumption 5).

It would be plausible that there is no anticipatory behavior, given that the treatment status, i.e., the female officers share before the AA plans is equal to the one in the absence of the AA introduction across counties. This assumption may be violated if the police departments in some counties had private knowledge about the probability of the AA introduction and manipulated their treatment status before the implementation.

(a) First stage DID Refer to caption

(b) Reduced form DID Refer to caption

Figure 2: Weighted average of the effects of AA introduction on female officer share and IPH rates in Miller and Segal (2019). Notes: The results for the effects of AA plans on the female officers share (Panel (a)) and on the IPH rates (Panel (b)) under the staggered DID-IV identification strategy. The red line represents the weighted average of the estimates with simultaneous 95%95\% confidence intervals for pre-exposed periods in both panels where the weight reflects cohort size in each period. The control group is a never exposed cohort and the reference period is t1t-1 for period tt estimate. These should be equal to zero under the null hypothesis that parallel trends assumptions in the treatment and the outcome hold. The blue line represents the weighted average of the estimates with simultaneous 95%95\% confidence intervals for post exposed periods in both panels where the weight reflects cohort size in each period. The control group is a never exposed cohort and the reference period is t=1t=-1 for all post-exposed period estimates. All the standard errors are clustered at county level.

Next, we assess the plausibility of the parallel trends assumptions in the treatment and the outcome. To do so, we apply the method proposed by Callaway and Sant’Anna (2021) to the first stage and reduced form, respectively222Unfortunately, in the presence of heterogeneous treatment effects, the coefficients on event study regression face a contamination bias shown by Sun and Abraham (2021).. Specifically, we estimate the weighted average of the effects of the instrument on the treatment and outcome in each relative period where the weight reflects the cohort size. We depict the results in Figure 2. The plots report estimates for the effects before and after AA plans with a simultaneous 95%95\% confidence interval in each stage. The confidence intervals account for clustering at the county level.

Parallel trends assumption in the treatment (Assumption 6).

It requires that if the AA plans had not occurred, the average time trends of the female officers share would have been the same across counties and over time. The pre-exposed estimates in Panel (a) in Figure 2 seem consistent with the parallel trends assumption in the treatment: the pre-exposed estimates around AA plans are not significantly different from zero.

Parallel trends assumption in the outcome (Assumption 7).

It would be plausible if the AA plans had not been implemented, the average time trends of the IPH rates would have been the same across counties and over time. Panel (b) in Figure 2 presents that the pre-exposed estimates around AA introduction are not significantly different from zero, which indicates that the parallel trends assumption in the outcome is also plausible.


Figure 2 also sheds light on the dynamic effects of the AA plans on the female officer share and IPH rates during the post-exposed periods. The figure indicates that the effect of the AA plans on the female officer share increases over time, whereas the effect on IPH rates through the female officer share has downward trends during the post-exposed periods. We note that the estimated effects in the reduced form are not scaled by the ones in the first stage, i.e., these estimates do not capture each CACRT after the AA shock.

Illustrating the weights in TWFEIV regression

First, we estimate the two-way fixed effects instrumental variable regression in the authors’ setting. To clearly illustrate the shortcomings of the TWFEIV regression, we modify the authors’ specification in two ways: Miller and Segal (2019) include some covariates and weight their regression with county population, whereas we exclude such covariates and do not apply their weights to our regression.

The result is shown in Table 4. The two-way fixed effects instrumental variable estimate is 0.646-0.646 and it is not significantly different from zero333Although Miller and Segal (2019) do not report the TWFEIV estimate without weights and covariates, when we run such a TWFEIV regression in their final analysis sample, the IV estimate is 1.445-1.445 and is not significantly different from zero. This implies that we reach the same conclusion as in Miller and Segal (2019) in our data.. However, as we already noted in section 4, we cannot generally interpret the IV estimate as measuring a properly weighted average of each CACRT if the effect of the AA introduction on female officer share or IPH rates is not stable over time.

Our DID-IV decomposition theorem (Theorem 1) allows us to visualize the source of variations in the three types of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs. Panel (a) in Figure 3 plots the weights and the corresponding Wald-DID estimates for all designs and Panels (b), (c), and (d) in Figure 3 plot them for each type of the DID-IV design, respectively. Table 4 reports the total weight, total Wald-DID estimate, and weighted average of Wald-DID estimates in each type of the DID-IV design. The total weight and total Wald-DID estimate are calculated by summing the weights and Wald-DID estimates respectively, and the weighted average of Wald-DID estimates is calculated by summing the products of the weight and the associated Wald-DID estimate. Summing all the weighted average of Wald-DID estimates yields the two-way fixed instrumental variable estimate (0.646-0.646).

Panel (a) in Figure 3 shows that the weights are heavily assigned to the Wald-DID estimates in Unexposed/Exposed designs. This is due to the large sample size of the unexposed cohort in the authors’ setting. Panels (b), (c), and (d) in Figure 3 highlight that some weights in each type can be negative: 22 out of 1414 weights are negative in Unexposed/Exposed designs, 2929 out of 9191 weights are negative in Exposed/Not Yet Exposed designs and 5050 out of 9191 weights are negative in Exposed/Exposed Shift designs. The negative weights arise because some DID estimates of the treatment in the first stage are negative in each type of the DID-IV design.

The TWFEIV estimate suffers from a downward bias due to the bad comparisons arising from the Exposed/Exposed shift designs. As we already mentioned in section 4, the TWFEIV estimand potentially fails to summarize the causal effects if the effect of the instrument on the treatment or the outcome evolves over time. Table 4 indicates that the estimated bias occurring from the Exposed/Exposed shift designs is quantitatively not negligible: the weighted average of the Wald-DID estimates in the Exposed/Exposed shift designs is 0.093-0.093, which accounts for one-seventh of our IV estimate.

Refer to caption
Figure 3: Instrumented difference-in-differences decomposition result in the setting of Miller and Segal (2019). Notes: Panel (a) plots the weights and the corresponding Wald-DID estimates for all DID-IV designs and Panels (b), (c), and (d) plot them for each type of the DID-IV design, respectively. Unexposed/Exposed designs yield blue circles, Exposed/Not Yet Exposed designs yield yellow triangles and Exposed/Exposed Shift designs yield red squares.
Table 3: Estimate for the effect of female officers share on IPH rates.
Estimate Standard Error 95% CI
TSLS with fixed effects -0.646 3.284 [-7.594, 6.301]
  • Notes: Sample consists of 199199 counties. Confidence intervals account for clustering at the county level.

Table 4: Total weight, Total and Weighted WDD estimates in each type of the DID-IV design.
Total weight Total WDD estimate Weighted WDD estimate
Unexposed/Exposed 1.026 -233.198 -0.399
Exposed/Not Yet Exposed 0.010 -490.665 -0.154
Exposed/Exposed Shift -0.036 -569.426 -0.093
  • Notes: This table presents the total weight, total Wald-DID estimate, and weighted average of Wald-DID estimates in each type of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs.

7   Alternative specifications

So far, we have considered simple TWFEIV regressions as in equation (1). However, many studies routinely estimate various specifications, such as weighting or introducing covariates, to check the robustness of their findings. In this section, we extend our DID-IV decomposition theorem to the settings with weighting and covariates, and provide simple tools to examine how different specifications affect differences in estimates. We illustrate these by revisiting Miller and Segal (2019).

The tools we provide here are based on Goodman-Bacon (2021). Recall that our DID-IV decomposition theorem shows that the TWFEIV estimator can be written as the product of a vector of 2×22\times 2 Wald-DID estimators (𝜷^IV2×2\hat{\bm{\beta}}_{IV}^{2\times 2}) and a vector of weights (𝒔\bm{s}), that is, β^IV=𝒔𝜷^IV2×2\hat{\beta}^{IV}=\bm{s^{\prime}}\hat{\bm{\beta}}_{IV}^{2\times 2}. When a TWFEIV estimator generated from different specification (β^IV,alt\hat{\beta}_{IV,alt}) can also be written as the product of a vector of 2×22\times 2 Wald-DID estimators (𝜷^IV,alt2×2\hat{\bm{\beta}}_{IV,alt}^{2\times 2}) and a vector of their associated weights (𝒔𝒂𝒍𝒕\bm{s_{alt}}), one can decompose the difference between the two specifications as

β^IV,altβ^IV=𝒔(𝜷^IV,alt2×2𝜷^IV2×2)Due to2×2Wald-DIDs+(𝒔𝒂𝒍𝒕𝒔)𝜷^IV2×2Due to2×2weights+(𝒔𝒂𝒍𝒕𝒔)(𝜷^IV,alt2×2𝜷^IV2×2)Due to the interaction of the two.\displaystyle\hat{\beta}_{IV,alt}-\hat{\beta}_{IV}=\underbrace{\bm{s^{\prime}}(\hat{\bm{\beta}}_{IV,alt}^{2\times 2}-\hat{\bm{\beta}}_{IV}^{2\times 2})}_{\text{Due to}\hskip 2.84526pt2\times 2\hskip 2.84526pt\text{Wald-DIDs}}+\underbrace{(\bm{s^{\prime}_{alt}}-\bm{s^{\prime}})\hat{\bm{\beta}}_{IV}^{2\times 2}}_{\text{Due to}\hskip 2.84526pt2\times 2\hskip 2.84526pt\text{weights}}+\underbrace{(\bm{s^{\prime}_{alt}}-\bm{s^{\prime}})(\hat{\bm{\beta}}_{IV,alt}^{2\times 2}-\hat{\bm{\beta}}_{IV}^{2\times 2})}_{\text{Due to the interaction of the two}}.

It takes the form of a Oaxaca-Blinder-Kitagawa decomposition (Oaxaca (1973), Blinder (1973), Kitagawa (1955)) and indicates that the difference comes from changes in 2×22\times 2 Wald-DID estimators, changes in weights, and the interaction of the two. Dividing both sides by β^IV,altβ^IV\hat{\beta}_{IV,alt}-\hat{\beta}_{IV}, one can measure the proportional contribution of each term on the difference. Plotting each pair in (𝜷^IV,alt2×2,𝜷^IV2×2)\hat{\bm{\beta}}_{IV,alt}^{2\times 2},\hat{\bm{\beta}}_{IV}^{2\times 2}) and (𝒔𝒂𝒍𝒕,𝒔)(\bm{s^{\prime}_{alt}},\bm{s^{\prime}}), one can also examine which elements in each term have a significant impact on the difference.

7.1   Weighted TWFEIV regression

When researchers use weighted TWFEIV regression instead of unweighted one, it potentially changes the influence of Wald-DID estimators (𝜷^IV,WLS2×2\hat{\bm{\beta}}_{IV,WLS}^{2\times 2}) by replacing the DIDs of the treatment and the outcome with the weighted ones. It also potentially change the influence of weights (𝒔𝑾𝑳𝑺\bm{s^{\prime}_{WLS}}) by replacing the sample share with the relative amount of the specified weight and the DIDs of the treatment with the weighted ones. Table 5 shows the result of our TWFEIV regression weighted by county population in Miller and Segal (2019): the estimate changes from 0.646-0.646 to 0.386-0.386. The decomposition result indicates that the contribution of the changes in 2×22\times 2 Wald-DIDs is negative, whereas the contributions of the changes in weights and the interaction are positive.

Figure 4 plots the 2×22\times 2 Wald-DIDs and the associated weights in WLS against those in OLS. Panel (a) shows that most comparisons of the Wald-DID between OLS and WLS are located at the 4545-degree line, but some comparisons generated from Exposed/Not Yet Exposed and Exposed/Exposed Shift designs are away from the 45-degree line. In addition, this figure indicates that the Wald-DID generated from the comparison between 19781978 and 19911991 counties (19911991 counties are the controls) is much more negative in WLS than in OLS, which drives the overall negative impact of the changes in 2×22\times 2 Wald-DIDs on the difference between the two specifications. Panel (b) shows that most comparisons of the decomposition weight between OLS and WLS are near the 4545-degree line and the origin, but some comparisons generated from Unexposed/Exposed designs are away from the 45-degree line and the origin. This figure also indicates that the decomposition weight generated from the comparison between 19821982 and unexposed counties is much more positive in WLS than in OLS, which causes the overall positive impact of the changes in weights on the difference between the two specifications.

Table 5: Estimate for the effect of female officers share on IPH rates.
(1)(1) (2)(2) (3)(3)
Baseline WLS Covariates
Estimate -0.646 -0.386 -0.868
Standard Error 3.284 2.452 3.968
Difference from baseline 0.260 -0.222
Difference comes from:
2×22\times 2 Wald-DIDs -4.048 0.370
Weights 2.341 16.503
Interaction 1.966 -17.107
Within term 0 0.012
  • Notes: This table presents TWFEIV estimates in the setting of Miller and Segal (2019). Column (1)(1) is a simple TWFEIV estimate from Eq. (1). Column (2)(2) is a TWFEIV estimate weighted by county population in 19771977. Column (3)(3) is a TWFEIV estimate with time-varying covariates which include the lagged local area controls, the county’s non-IPH rate, and the state-level crack cocaine index. All the standard errors are clustered at county level.

7.2   TWFEIV regression with time-varying covariates

In most applications of thr DID-IV method, researchers typically estimate TWFEIV models that include time-varying covariates, in addition to the simple ones, based on the belief that it enhances the validity of the parallel trends assumptions in the first stage and reduced form regressions:

Yi,t=ϕi.+λt.+βIVXDi,t+ψXi,t+vi,t,\displaystyle Y_{i,t}=\phi_{i.}+\lambda_{t.}+\beta_{IV}^{X}D_{i,t}+\psi X_{i,t}+v_{i,t}, (14)
Di,t=γi.+ζt.+πXZi,t+ψ~Xi,t+ηi,t.\displaystyle D_{i,t}=\gamma_{i.}+\zeta_{t.}+\pi^{X}Z_{i,t}+\tilde{\psi}X_{i,t}+\eta_{i,t}. (15)

In this section, we derive a DID-IV decomposition result for the case when we introduce the time-varying covariates into TWFEIV regressions. Our decomposition result in this section is based on Goodman-Bacon (2021), who decomposes TWFE estimators with time-varying covariates. Appendix D further considers the causal interpretation of the covariate-adjusted TWFEIV estimand under additional conditions.

First, consider the coefficient on instrument (αX\alpha^{X}) in the reduced form regression:

Yi,t=ϕi.+λt.+αXZi,t+ξXi,t+vi,t.\displaystyle Y_{i,t}=\phi_{i.}+\lambda_{t.}+\alpha^{X}Z_{i,t}+\xi X_{i,t}+v_{i,t}. (16)

Let Z~i,t\tilde{Z}_{i,t} and X~i,t\tilde{X}_{i,t} denote the double demeaning variables of Zi,tZ_{i,t} and Xi,tX_{i,t} respectively, obtained from regressing Zi,tZ_{i,t} and Xi,tX_{i,t} on time and unit fixed effects. Let z~i,t\tilde{z}_{i,t} denote the residuals obtained from regressing Z~i,t\tilde{Z}_{i,t} on X~i,t\tilde{X}_{i,t}:

Z~i,t=Γ^X~i,tp~i,t+z~i,t.\displaystyle\tilde{Z}_{i,t}=\overbrace{\hat{\Gamma}\tilde{X}_{i,t}}^{\tilde{p}_{i,t}}+\tilde{z}_{i,t}.

Here, we define the linear projection as p~i,tΓ^X~i,t\tilde{p}_{i,t}\equiv\hat{\Gamma}\tilde{X}_{i,t}. The specific expression for z~i,t\tilde{z}_{i,t} is:

z~i,t\displaystyle\tilde{z}_{i,t} =[(Zi,tZi¯)(Γ^Xi,tΓ^Xi¯)][(Z¯tZ¯¯)(Γ^X¯tΓ^X¯¯)]\displaystyle=[(Z_{i,t}-\bar{Z_{i}})-(\hat{\Gamma}X_{i,t}-\hat{\Gamma}\bar{X_{i}})]-[(\bar{Z}_{t}-\bar{\bar{Z}})-(\hat{\Gamma}\bar{X}_{t}-\hat{\Gamma}\bar{\bar{X}})]
(zi,tz¯i)(z¯tz¯¯).\displaystyle\equiv(z_{i,t}-\bar{z}_{i})-(\bar{z}_{t}-\bar{\bar{z}}).

By the FWL theorem, we then obtain the following expression for α^X\hat{\alpha}^{X}:

α^X=C^(Yi,t,z~i,t)V^z~=C^(Yi,t,Z~i,tp~i,t)V^z~,\displaystyle\hat{\alpha}^{X}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{i,t})}{\hat{V}^{\tilde{z}}}=\frac{\hat{C}(Y_{i,t},\tilde{Z}_{i,t}-\tilde{p}_{i,t})}{\hat{V}^{\tilde{z}}},

where V^z~\hat{V}^{\tilde{z}} is the variance of z~i,t\tilde{z}_{i,t}. By symmetry, we can also express the first stage coefficient on instrument π^X\hat{\pi}^{X} as follows:

π^X=C^(Di,t,z~i,t)V^z~=C^(Di,t,Z~i,tp~i,t)V^z~.\displaystyle\hat{\pi}^{X}=\frac{\hat{C}(D_{i,t},\tilde{z}_{i,t})}{\hat{V}^{\tilde{z}}}=\frac{\hat{C}(D_{i,t},\tilde{Z}_{i,t}-\tilde{p}_{i,t})}{\hat{V}^{\tilde{z}}}.

Because the IV estimator β^IVX\hat{\beta}_{IV}^{X} is the ration between the first stage coefficient π^X\hat{\pi}^{X} and the reduced form coefficient α^X\hat{\alpha}^{X}, we obtain the following expression for β^IVX\hat{\beta}_{IV}^{X}:

β^IVX=C^(Yi,t,z~i,t)C^(Di,t,z~i,t)=C^(Yi,t,Z~i,tp~i,t)C^(Di,t,Z~i,tp~i,t).\displaystyle\hat{\beta}_{IV}^{X}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{i,t})}{\hat{C}(D_{i,t},\tilde{z}_{i,t})}=\frac{\hat{C}(Y_{i,t},\tilde{Z}_{i,t}-\tilde{p}_{i,t})}{\hat{C}(D_{i,t},\tilde{Z}_{i,t}-\tilde{p}_{i,t})}. (17)

In contrast to the unconditional TWFEIV estimator β^IV\hat{\beta}_{IV}, the covariate-adjusted TWFEIV estimator exploits the variation in both Z~i,t\tilde{Z}_{i,t} and p~i,t\tilde{p}_{i,t}. Z~i,t\tilde{Z}_{i,t} varies at cohort and time level, but p~i,t\tilde{p}_{i,t} varies at unit and time level because Xi,tX_{i,t} varies at unit and time level.

To decompose the covariate-adjusted TWFEIV estimator β^IVX\hat{\beta}_{IV}^{X}, we first partition z~i,t\tilde{z}_{i,t} into "within" and "between" terms as in Goodman-Bacon (2021). Let z¯k,tz¯k=(Z¯k,tZ¯k)(Γ^X¯k,tΓ^X¯k)\bar{z}_{k,t}-\bar{z}_{k}=(\bar{Z}_{k,t}-\bar{Z}_{k})-(\hat{\Gamma}\bar{X}_{k,t}-\hat{\Gamma}\bar{X}_{k}) be the average of zi,tz¯iz_{i,t}-\bar{z}_{i} in cohort kk. By adding and subtracting z¯k,tz¯k\bar{z}_{k,t}-\bar{z}_{k}, we can decompose z~i,t\tilde{z}_{i,t} into two terms:

z~i,t=[(zi,tz¯i)(z¯k,tz¯k)]z~i(k),t+[(z¯k,tz¯k)(z¯tz¯¯)]z~k,t.\displaystyle\tilde{z}_{i,t}=\underbrace{[(z_{i,t}-\bar{z}_{i})-(\bar{z}_{k,t}-\bar{z}_{k})]}_{\tilde{z}_{i(k),t}}+\underbrace{[(\bar{z}_{k,t}-\bar{z}_{k})-(\bar{z}_{t}-\bar{\bar{z}})]}_{\tilde{z}_{k,t}}. (18)

The first term z~i(k),t\tilde{z}_{i(k),t} measures the deviation of zi,tz¯iz_{i,t}-\bar{z}_{i} from the average z¯k,tz¯k\bar{z}_{k,t}-\bar{z}_{k} in cohort kk, which we call the within term of z~i,t\tilde{z}_{i,t}. The second term z~k,t\tilde{z}_{k,t} measures the deviation of z¯k,tz¯k\bar{z}_{k,t}-\bar{z}_{k} from the average z¯tz¯¯\bar{z}_{t}-\bar{\bar{z}} in whole sample, which we call the between term of z~i,t\tilde{z}_{i,t}. The within term z~i(k),t\tilde{z}_{i(k),t} varies at unit and time level because of p~i,t\tilde{p}_{i,t}, whereas the between term z~k,t\tilde{z}_{k,t} varies at cohort and time level.

By substituting (18) into (17), we obtain

β^IVX=C^(Yi,t,z~i(k),t)+C^(Yi,t,z~k,t)C^(Di,t,z~i(k),t)+C^(Di,t,z~k,t)\displaystyle\hat{\beta}_{IV}^{X}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{i(k),t})+\hat{C}(Y_{i,t},\tilde{z}_{k,t})}{\hat{C}(D_{i,t},\tilde{z}_{i(k),t})+\hat{C}(D_{i,t},\tilde{z}_{k,t})} =V^wzβ^wp,y+V^bzβ^bz,yV^wzβ^wp,d+V^bzβ^bz,d\displaystyle=\frac{\hat{V}_{w}^{z}\hat{\beta}_{w}^{p,y}+\hat{V}_{b}^{z}\hat{\beta}_{b}^{z,y}}{\hat{V}_{w}^{z}\hat{\beta}_{w}^{p,d}+\hat{V}_{b}^{z}\hat{\beta}_{b}^{z,d}} (19)
=C^wD,z~C^wD,z~+C^bD,z~Ωβ^wp,yβ^wp,dβ^w,IVp+C^bD,z~C^wD,z~+C^bD,z~1Ωβ^bz,yβ^bz,dβ^b,IVz.\displaystyle=\underbrace{\frac{\hat{C}_{w}^{D,\tilde{z}}}{\hat{C}_{w}^{D,\tilde{z}}+\hat{C}_{b}^{D,\tilde{z}}}}_{\Omega}\cdot\underbrace{\frac{\hat{\beta}_{w}^{p,y}}{\hat{\beta}_{w}^{p,d}}}_{\hat{\beta}_{w,IV}^{p}}+\underbrace{\frac{\hat{C}_{b}^{D,\tilde{z}}}{\hat{C}_{w}^{D,\tilde{z}}+\hat{C}_{b}^{D,\tilde{z}}}}_{1-\Omega}\cdot\underbrace{\frac{\hat{\beta}_{b}^{z,y}}{\hat{\beta}_{b}^{z,d}}}_{\hat{\beta}_{b,IV}^{z}}. (20)

We use the subscript ww to denote within components and the subscript bb to denote between components. V^wz\hat{V}_{w}^{z} and V^bz\hat{V}_{b}^{z} are the variances of z~i(k),t\tilde{z}_{i(k),t} and z~k,t\tilde{z}_{k,t}, respectively. C^wD,z~\hat{C}_{w}^{D,\tilde{z}} is the covariance between Di,tD_{i,t} and z~i(k),t\tilde{z}_{i(k),t}, the within term of z~i,t\tilde{z}_{i,t}. C^bD,z~\hat{C}_{b}^{D,\tilde{z}} is the covariance between Di,tD_{i,t} and z~k,t\tilde{z}_{k,t}, the between term of z~i,t\tilde{z}_{i,t}. The weight Ω=C^wD,z~C^wD,z~+C^bD,z~\Omega=\frac{\hat{C}_{w}^{D,\tilde{z}}}{\hat{C}_{w}^{D,\tilde{z}}+\hat{C}_{b}^{D,\tilde{z}}} measures the relative amount of the within covariance C^wD,z~\hat{C}_{w}^{D,\tilde{z}}.

β^wp,yC^(Yi,t,z~i(k),t)V^wz\hat{\beta}_{w}^{p,y}\equiv\frac{\hat{C}(Y_{i,t},\tilde{z}_{i(k),t})}{\hat{V}_{w}^{z}} measures the relationship between Yi,tY_{i,t} and z~i(k),t\tilde{z}_{i(k),t}. Similarly, β^wp,dC^(Di,t,z~i(k),t)V^wz\hat{\beta}_{w}^{p,d}\equiv\frac{\hat{C}(D_{i,t},\tilde{z}_{i(k),t})}{\hat{V}_{w}^{z}} measures the relationship between Di,tD_{i,t} and z~i(k),t\tilde{z}_{i(k),t}. We call these the within coefficients in the first stage and reduced form regressions. β^w,IVpβ^wp,yβ^wp,d\hat{\beta}_{w,IV}^{p}\equiv\frac{\hat{\beta}_{w}^{p,y}}{\hat{\beta}_{w}^{p,d}} scales the within coefficient in the reduced form regression by the one in the first stage regression. We call this the within IV coefficient444One can obtain this coefficient by running an IV regression of the outcome on the treatment with z~i(k),t\tilde{z}_{i(k),t} as the excluded instrument.. This IV coefficient arises because z~i(k),t\tilde{z}_{i(k),t} varies at unit and time level. Similar to what Goodman-Bacon (2021) points out for the covariate-adjusted TWFE estimator, time-varying covariates bring a new source of identifying variation in the TWFEIV estimator, within variation of Xi,tX_{i,t} in each cohort.

β^bz,yC^(Yi,t,z~k,t)V^bz\hat{\beta}_{b}^{z,y}\equiv\frac{\hat{C}(Y_{i,t},\tilde{z}_{k,t})}{\hat{V}_{b}^{z}} measures the relationship between Yi,tY_{i,t} and z~k,t\tilde{z}_{k,t}. Similarly β^bz,dC^(Di,t,z~k,t)V^bz\hat{\beta}_{b}^{z,d}\equiv\frac{\hat{C}(D_{i,t},\tilde{z}_{k,t})}{\hat{V}_{b}^{z}} measures the relationship between Di,tD_{i,t} and z~k,t\tilde{z}_{k,t}. We call these the between coefficients in the first stage and reduced form regressions. β^b,IVzβ^bz,yβ^bz,d\hat{\beta}_{b,IV}^{z}\equiv\frac{\hat{\beta}_{b}^{z,y}}{\hat{\beta}_{b}^{z,d}} divides the between coefficient in the reduced form regression by the one in the first stage regression, and have the following specific expression:

β^b,IVz=C^D,Zβ^IVC^bpβ^b,IVpC^bD,z~.\displaystyle\hat{\beta}_{b,IV}^{z}=\frac{\hat{C}^{D,Z}\hat{\beta}_{IV}-\hat{C}^{p}_{b}\hat{\beta}_{b,IV}^{p}}{\hat{C}_{b}^{D,\tilde{z}}}. (21)

C^D,Z\hat{C}^{D,Z} and β^IV\hat{\beta}_{IV} are already defined in section 3. C^bp\hat{C}^{p}_{b} is the covariance between Di,tD_{i,t} and p~k,t\tilde{p}_{k,t} (the between term of p~i,t\tilde{p}_{i,t}). β^b,IVp\hat{\beta}_{b,IV}^{p} is the estimator, obtained from an IV regression of Yi,tY_{i,t} on Di,tD_{i,t} with p~k,t\tilde{p}_{k,t} as the excluded instrument. We call β^b,IVz\hat{\beta}_{b,IV}^{z} the between IV coefficient, which exploits the cohort and time level variation in z~k,t\tilde{z}_{k,t}. This IV coefficient is not equal to the unconditional TWFEIV coefficient β^IV\hat{\beta}_{IV}: β^b,IVz\hat{\beta}_{b,IV}^{z} subtracts the influence of β^b,IVp\hat{\beta}_{b,IV}^{p} from the unconditional IV estimator β^IV\hat{\beta}_{IV}. This indicates that time-varying covariates Xi,tX_{i,t} changes the identifying variation at cohort and time level through p~k,t\tilde{p}_{k,t}, the between term of the linear projection p~i,t\tilde{p}_{i,t}.

We can further decompose the between IV coefficient as follows:

β^b,IVz=kl>k(nk+nl)2C^b,klD,z~C^bD,z~sb,kl[C^klD,Zβ^IV,kl2×2C^b,klpβ^b,IV,klpC^b,klD,z~]β^b,IV,klz.\displaystyle\hat{\beta}_{b,IV}^{z}=\sum_{k}\sum_{l>k}\underbrace{(n_{k}+n_{l})^{2}\frac{\hat{C}_{b,kl}^{D,\tilde{z}}}{\hat{C}_{b}^{D,\tilde{z}}}}_{s_{b,kl}}\underbrace{\left[\frac{\hat{C}^{D,Z}_{kl}\hat{\beta}_{IV,kl}^{2\times 2}-\hat{C}^{p}_{b,kl}\hat{\beta}_{b,IV,kl}^{p}}{\hat{C}_{b,kl}^{D,\tilde{z}}}\right]}_{\hat{\beta}_{b,IV,kl}^{z}}. (22)

The proof is given in Appendix D. Each notation is similarly defined in (k,l)(k,l) cell subsamples. β^b,IV,klz\hat{\beta}_{b,IV,kl}^{z} and sb,kls_{b,kl} are the between IV coefficient and the corresponding weight in (k,l)(k,l) cell subsamples. Equation (22) indicates that time-varying covariates Xi,tX_{i,t} affect the between IV coefficient β^b,IVz\hat{\beta}_{b,IV}^{z} by changing both the 2×22\times 2 between IV coefficient and the associated weight in each (k,l)(k,l) cell.

To sum up, combining (22) with (19), we can decompose the covariate-adjusted TWFEIV estimator β^IVX\hat{\beta}_{IV}^{X} as

β^IVX=Ωβ^w,IVp+(1Ω)kl>ksb,klβ^b,IV,klzβ^b,IVz.\displaystyle\hat{\beta}_{IV}^{X}=\Omega\hat{\beta}_{w,IV}^{p}+(1-\Omega)\underbrace{\sum_{k}\sum_{l>k}s_{b,kl}\hat{\beta}_{b,IV,kl}^{z}}_{\hat{\beta}_{b,IV}^{z}}.

The weight Ω\Omega is assigned to the within IV coefficient β^w,IVp\hat{\beta}_{w,IV}^{p} and the weight 1Ω1-\Omega is assigned to the between IV coefficient β^b,IVz\hat{\beta}_{b,IV}^{z}, which is equal to a weighted average of all possible 2×22\times 2 between IV coefficients β^b,IV,klz\hat{\beta}_{b,IV,kl}^{z} as in Theorem 1.

Table 5 presents the result of our TWFEIV regression with time-varying covariates in Miller and Segal (2019). We follow Miller and Segal (2019) and include the lagged local area controls, the county’s non-IPH rate, and the state-level crack cocaine index; see Miller and Segal (2019) for details. The estimate changes from 0.646-0.646 to 0.868-0.868. The decomposition result shows that the contribution of the within term is positive but negligible, whereas the contribution of the between term is negative and substantial. Specifically, in the between term, the contribution of the changes in 2×22\times 2 Wald-DIDs and weights are positive, but these are offset by the negative contribution of the interaction. This result indicates that in Miller and Segal (2019), the time-varying covariates affect the IV estimate mainly through the identifying variation in cohort and time level, that is, the between term of the linear projection p~i,t\tilde{p}_{i,t}.

Refer to caption

Refer to caption

Figure 4: Comparisons of 2×22\times 2 WaldDIDs and weights between OLS and WLS in the setting of Miller and Segal (2019). Notes: Panel (a) plots the 2×22\times 2 Wald-DIDs in WLS against those in OLS for all DID-IV designs. Panel (b) plots the decomposition weights in WLS against those in OLS for all DID-IV designs. In Panel (a), the size of each point is proportional to the corresponding weight in OLS. In Panel (b), the size of each point is proportional to the corresponding Wald-DID estimate in OLS. In both panels, the dotted lines represent 45-degree lines. In both panels, Unexposed/Exposed designs yield blue circles, Exposed/Not Yet Exposed designs yield yellow triangles, and Exposed/Exposed Shift designs yield red squares. In both panels, the dotted lines represent 45-degree lines.

8   Conclusion

Many studies run two-way fixed effects instrumental variable (TWFEIV) regressions, leveraging variation occurring from the different timing of policy adoption across units as an instrument for the treatment. In this paper, we study the causal interpretation of the TWFEIV estimator in staggered DID-IV designs. We first show that in settings with the staggered adoption of the instrument across units, the TWFEIV estimator is equal to a weighted average of all possible 2×22\times 2 Wald-DID estimators arising from the three types of the DID-IV design: Unexposed/Exposed, Exposed/Not Yet Exposed, and Exposed/Exposed Shift designs. The weight assigned to each Wald-DID estimator is a function of the sample share, the variance of the instrument, and the DID estimator of the treatment in each DID-IV design.

Based on the decomposition result, we then show that in staggered DID-IV designs, the TWFEIV estimand is equal to a weighted average of all possible cohort specific local average treatment effect on the treated parameters, but some weights can be negative. The negative weight problem arises due to the bad comparisons in the first and reduced form regressions: we use the already exposed units as controls. The TWFEIV estimand attains its causal interpretation if the effects of the instrument on the treatment and outcome are stable over time. The resulting causal parameter is a positively weighted average cohort specific local average treatment effect on the treated parameter.

Finally, we illustrate our findings with the setting of Miller and Segal (2019) who estimate the effect of female officers’ share on the IPH rate, exploiting the timing variation of AA introduction across U.S. counties. We first assess the underlying staggered DID-IV identification strategy implicitly imposed by Miller and Segal (2019) and confirm its validity. We then apply our DID-IV decomposition theorem to the TWFEIV estimate, and find that the estimate suffers from the substantial downward bias arising from the bad comparisons in Exposed/Exposed shift DID-IV designs. We also decompose the difference between the two specifications and illustrate how different specifications affect the overall estimates in Miller and Segal (2019).

Overall, this paper shows the negative result of using TWFEIV estimators in the presence of heterogeneous treatment effects in staggered DID-IV designs in more than two periods. This paper provides simple tools to evaluate how serious that concern is in a given application. Specifically, we demonstrate that the TWFEIV estimator is not robust to the time-varying exposed effects in the first stage and reduced form regressions. Our DID-IV decomposition theorem allows the empirical researchers to assess the impact of the bias term arising from the bad comparisons on their TWFEIV estimate. Recently, Miyaji (2024) developed an alternative estimation method that is robust to treatment effects heterogeneity and proposes a weighting scheme to construct various summary measures in staggered DID-IV designs. Further developing alternative approaches and diagnostic tools will be a promising area for future work, facilitating the credibility of DID-IV design in practice.

References

  • (1)
  • Akerman et al. (2015) Akerman, A., Gaarder, I., Mogstad, M., 2015. The Skill Complementarity of Broadband Internet. Q. J. Econ. 130 (4), 1781–1824.
  • Angrist and Imbens (1995) Angrist, J. D., Imbens, G. W., 1995. Two-Stage Least Squares Estimation of Average Causal Effects in Models with Variable Treatment Intensity. J. Am. Stat. Assoc. 90 (430), 431–442.
  • Athey and Imbens (2022) Athey, S., Imbens, G. W., 2022. Design-based analysis in Difference-In-Differences settings with staggered adoption. J. Econom. 226 (1), 62–79.
  • Bhuller et al. (2013) Bhuller, M., Havnes, T., Leuven, E., Mogstad, M., 2013. Broadband Internet: An Information Superhighway to Sex Crime? Rev. Econ. Stud. 80 (4), 1237–1266.
  • Black et al. (2005) Black, S. E., Devereux, P. J., Salvanes, K. G., 2005. Why the Apple Doesn’t Fall Far: Understanding Intergenerational Transmission of Human Capital. Am. Econ. Rev. 95 (1), 437–449.
  • Blandhol et al. (2022) Blandhol, C., Bonney, J., Mogstad, M., Torgovitsky, A., 2022. When is TSLS Actually LATE? February.
  • Blinder (1973) Blinder, A. S., 1973. Wage Discrimination: Reduced Form and Structural Estimates. J. Hum. Resour. 8 (4), 436–455.
  • Borusyak et al. (2021) Borusyak, K., Jaravel, X., Spiess, J., 2021. Revisiting Event Study Designs: Robust and Efficient Estimation.
  • Callaway and Sant’Anna (2021) Callaway, B., Sant’Anna, P. H. C., 2021. Difference-in-Differences with multiple time periods. J. Econom. 225 (2), 200–230.
  • de Chaisemartin (2010) de Chaisemartin, 2010. A note on instrumented difference in differences. Unpublished Manuscript.
  • de Chaisemartin and D’Haultfœuille (2018) de Chaisemartin, C., D’Haultfœuille, X., 2018. Fuzzy Differences-in-Differences. Rev. Econ. Stud. 85 (2 (303)), 999–1028.
  • de Chaisemartin and D’Haultfœuille (2020) de Chaisemartin, C., D’Haultfœuille, X., 2020. Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects. Am. Econ. Rev. 110 (9), 2964–2996.
  • Duflo (2001) Duflo, E., 2001. Schooling and Labor Market Consequences of School Construction in Indonesia: Evidence from an Unusual Policy Experiment. Am. Econ. Rev. 91 (4), 795–813.
  • Field (2007) Field, E., 2007. Entitled to Work: Urban Property Rights and Labor Supply in Peru. Q. J. Econ. 122 (4), 1561–1602.
  • Goodman-Bacon (2021) Goodman-Bacon, A., 2021. Difference-in-differences with variation in treatment timing. J. Econom. 225 (2), 254–277.
  • Hudson et al. (2017) Hudson, S., Hull, P., Liebersohn, J., 2017. Interpreting Instrumented Difference-in-Differences. Available at http://www.mit.edu/~liebers/DDIV.pdf.
  • Imai and Kim (2021) Imai, K., Kim, I. S., 2021. On the Use of Two-Way Fixed Effects Regression Models for Causal Inference with Panel Data. Polit. Anal. 29 (3), 405–415.
  • Imbens and Angrist (1994) Imbens, G. W., Angrist, J. D., 1994. Identification and Estimation of Local Average Treatment Effects. Econometrica. 62 (2), 467–475.
  • Johnson and Jackson (2019) Johnson, R. C., Jackson, C. K., 2019. Reducing Inequality through Dynamic Complementarity: Evidence from Head Start and Public School Spending. Am. Econ. J. Econ. Policy. 11 (4), 310–349.
  • Kitagawa (1955) Kitagawa, E. M., 1955. Components of a Difference Between Two Rates. J. Am. Stat. Assoc. 50 (272), 1168–1194.
  • Lundborg et al. (2014) Lundborg, P., Nilsson, A., Rooth, D.-O., 2014. Parental education and offspring outcomes: Evidence from the Swedish compulsory school reform. Am. Econ. J. Appl. Econ. 6 (1), 253–278.
  • Lundborg et al. (2017) Lundborg, P., Plug, E., Rasmussen, A. W., 2017. Can Women Have Children and a Career? IV Evidence from IVF Treatments. Am. Econ. Rev. 107 (6), 1611–1637.
  • Meghir et al. (2018) Meghir, C., Palme, M., Simeonova, E., 2018. Education and mortality: Evidence from a social experiment. Am. Econ. J. Appl. Econ. 10 (2), 234–256.
  • Miller and Segal (2012) Miller, A. R., Segal, C., 2012. Does temporary affirmative action produce persistent effects? A study of black and female employment in law enforcement. Rev. Econ. Stat. 94 (4), 1107–1125.
  • Miller and Segal (2019) Miller, A. R., Segal, C., 2019. Do female officers improve law enforcement quality? Effects on crime reporting and domestic violence. Rev. Econ. Stud. 86 (5), 2220–2247.
  • Miyaji (2024) Miyaji, S., 2024. Instrumented Difference-in-Differences with heterogeneous treatment effects. Unpublished Manuscript.
  • Oaxaca (1973) Oaxaca, R., 1973. Male-Female Wage Differentials in Urban Labor Markets. Int. Econ. Rev. 14 (3), 693–709.
  • Oreopoulos (2006) Oreopoulos, P., 2006. Estimating Average and Local Average Treatment Effects of Education when Compulsory Schooling Laws Really Matter. Am. Econ. Rev. 96 (1), 152–175.
  • Słoczyński (2020) Słoczyński, T., 2020. When Should We (Not) Interpret Linear IV Estimands as LATE?
  • Sun and Abraham (2021) Sun, L., Abraham, S., 2021. Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. J. Econom. 225 (2), 175–199.

Appendix A Proof of the theorem in section 2

Before we proceed the proof of Theorem 1, we provide Lemma 4 below. This lemma is shown by Goodman-Bacon (2021).

Lemma 4 (Lemma 11 in Goodman-Bacon (2021)).

The sample covariance between a cohort and time specific variable zktz_{kt} and a double demeaning variable x~kt=(xktx¯k)(x¯tx¯¯)\tilde{x}_{kt}=(x_{kt}-\bar{x}_{k})-(\bar{x}_{t}-\bar{\bar{x}}) is equal to a sum over every pair of observations of the period-by-period products of differences between cohorts in zktz_{kt} and x~kt\tilde{x}_{kt}.

knk1Ttzkt[(xktx¯k)(x¯tx¯¯)]\displaystyle\sum_{k}n_{k}\frac{1}{T}\sum_{t}z_{kt}\left[(x_{kt}-\bar{x}_{k})-(\bar{x}_{t}-\bar{\bar{x}})\right]
=\displaystyle= kl>knlnk1Tt(zktzlt)[(xktx¯k)(xltx¯l)]\displaystyle\sum_{k}\sum_{l>k}n_{l}n_{k}\frac{1}{T}\sum_{t}(z_{kt}-z_{lt})[(x_{kt}-\bar{x}_{k})-(x_{lt}-\bar{x}_{l})] (23)
Proof.

See the proof of Lemma 11 in Goodman-Bacon (2021). ∎

A.1   Proof of Theorem 1

Proof.

From the FWL theorem, the TWFEIV estimator β^IV\hat{\beta}_{IV} is:

β^IV=1NTitZ~i,tYi,t1NTitZ~i,tDi,t\displaystyle\hat{\beta}_{IV}=\frac{\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{i,t}Y_{i,t}}{\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{i,t}D_{i,t}} (24)

where Z~i,t\tilde{Z}_{i,t} is a double-demeaning variable.

First, we consider the numerator of (24). In the following, we use k(i)k(i) to express that unit ii belongs to cohort kk. We define R¯k(i),t\bar{R}_{k(i),t} to be the sample mean of the random variable Ri,tR_{i,t} in cohort kk at time tt, and define R¯k(i)\bar{R}_{k(i)} to the average of R¯k(i),t\bar{R}_{k(i),t} over time:

R¯k(i),t=iRi,t𝟏{Ei=k}i𝟏{Ei=k}andR¯k(i)=1Tt=1TR¯k(i),t.\displaystyle\bar{R}_{k(i),t}=\frac{\sum_{i}R_{i,t}\mathbf{1}\{E_{i}=k\}}{\sum_{i}\mathbf{1}\{E_{i}=k\}}\hskip 5.69054pt\text{and}\hskip 5.69054pt\bar{R}_{k(i)}=\frac{1}{T}\sum_{t=1}^{T}\bar{R}_{k(i),t}.

For the numerator of (24), by adding and subtracting (Z¯k(i),tZ¯k(i))(\bar{Z}_{k(i),t}-\bar{Z}_{k(i)}), we obtain

1NTitZ~i,tYi,t\displaystyle\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{i,t}Y_{i,t}
=1NTitYi,t[(Zi,tZ¯i)(Z¯tZ¯¯)]\displaystyle=\frac{1}{NT}\sum_{i}\sum_{t}Y_{i,t}\left[(Z_{i,t}-\bar{Z}_{i})-(\bar{Z}_{t}-\bar{\bar{Z}})\right]
=1NTitYi,t[(Zi,tZ¯i)(Z¯k(i),tZ¯k(i))=0+(Z¯k(i),tZ¯k(i))(Z¯tZ¯¯)]\displaystyle=\frac{1}{NT}\sum_{i}\sum_{t}Y_{i,t}\left[\underbrace{(Z_{i,t}-\bar{Z}_{i})-(\bar{Z}_{k(i),t}-\bar{Z}_{k(i)})}_{=0}+(\bar{Z}_{k(i),t}-\bar{Z}_{k(i)})-(\bar{Z}_{t}-\bar{\bar{Z}})\right]
=1NTitYi,t[(Z¯k(i),tZ¯k(i))(Z¯tZ¯¯)]\displaystyle=\frac{1}{NT}\sum_{i}\sum_{t}Y_{i,t}\left[(\bar{Z}_{k(i),t}-\bar{Z}_{k(i)})-(\bar{Z}_{t}-\bar{\bar{Z}})\right]
=knk1TtY¯k,t[(Z¯k,tZ¯k)(Z¯tZ¯¯)],\displaystyle=\sum_{k}n_{k}\frac{1}{T}\sum_{t}\bar{Y}_{k,t}\left[(\bar{Z}_{k,t}-\bar{Z}_{k})-(\bar{Z}_{t}-\bar{\bar{Z}})\right],

where the third equality follows from the fact that Zi,t=Z¯k(i),tZ_{i,t}=\bar{Z}_{k(i),t} and Z¯i=Z¯k(i)\bar{Z}_{i}=\bar{Z}_{k(i)} because all the units in cohort kk have the same assignment of the instrument. The forth equality follows because the expression only depends on cohort kk and time tt.

To further develop the expression, we use Lemma 4:

knk1TtY¯kt[(Z¯ktZ¯k)(Z¯tZ¯¯)]\sum_{k}n_{k}\frac{1}{T}\sum_{t}\bar{Y}_{kt}[(\bar{Z}_{kt}-\bar{Z}_{k})-(\bar{Z}_{t}-\bar{\bar{Z}})]\\
=kl>knlnk1Tt(Y¯ktY¯lt)[(Z¯ktZ¯k)(Z¯ltZ¯l)].=\sum_{k}\sum_{l>k}n_{l}n_{k}\frac{1}{T}\sum_{t}(\bar{Y}_{kt}-\bar{Y}_{lt})[(\bar{Z}_{kt}-\bar{Z}_{k})-(\bar{Z}_{lt}-\bar{Z}_{l})]. (25)

Next, we consider all possible expressions of (25). When e=Ue=U, that is, cohort ee is never exposed cohort, we have Z¯UtZ¯U=0\bar{Z}_{Ut}-\bar{Z}_{U}=0. From this observation, for the pair (k,U)(k,U), we have:

1Tt(Y¯ktY¯Ut)[(Z¯ktZ¯k)(Z¯UtZ¯U)]\displaystyle\frac{1}{T}\sum_{t}(\bar{Y}_{kt}-\bar{Y}_{Ut})\left[(\bar{Z}_{kt}-\bar{Z}_{k})-(\bar{Z}_{Ut}-\bar{Z}_{U})\right]
=1Tt<k(Y¯ktY¯Ut)Z¯k+1Ttk(Y¯ktY¯Ut)(1Z¯k)\displaystyle=-\frac{1}{T}\sum_{t<k}(\bar{Y}_{kt}-\bar{Y}_{Ut})\bar{Z}_{k}+\frac{1}{T}\sum_{t\geq k}(\bar{Y}_{kt}-\bar{Y}_{Ut})(1-\bar{Z}_{k})
=[(Y¯ktPOST(k)Y¯ktPRE(k))(Y¯UtPOST(k)Y¯UtPRE(k))]Z¯k(1Z¯k).\displaystyle=\left[(\bar{Y}_{kt}^{POST(k)}-\bar{Y}_{kt}^{PRE(k)})-(\bar{Y}_{Ut}^{POST(k)}-\bar{Y}_{Ut}^{PRE(k)})\right]\bar{Z}_{k}(1-\bar{Z}_{k}).

By the similar argument, for the pair (k,l)(k,l) where k<l<Tk<l<T, we obtain

=1Tt<k(Y¯ktY¯lt)(Z¯kZ¯l)+1Tt[k,l)(Y¯ktY¯lt)(1Z¯k+Z¯l)1Ttl(Y¯ktY¯lt)(Z¯kZ¯l)\displaystyle=-\frac{1}{T}\sum_{t<k}(\bar{Y}_{kt}-\bar{Y}_{lt})(\bar{Z}_{k}-\bar{Z}_{l})+\frac{1}{T}\sum_{t\in[k,l)}(\bar{Y}_{kt}-\bar{Y}_{lt})(1-\bar{Z}_{k}+\bar{Z}_{l})-\frac{1}{T}\sum_{t\geq l}(\bar{Y}_{kt}-\bar{Y}_{lt})(\bar{Z}_{k}-\bar{Z}_{l})
=[(Y¯ktPRE(k)Y¯ltPRE(k))](Z¯kZ¯l)(1Z¯k)+[(Y¯ktMID(k,l)Y¯ltMID(k,l))](Z¯kZ¯l)(1Z¯k+Z¯l)\displaystyle=-\left[(\bar{Y}_{kt}^{PRE(k)}-\bar{Y}_{lt}^{PRE(k)})\right](\bar{Z}_{k}-\bar{Z}_{l})(1-\bar{Z}_{k})+\left[(\bar{Y}_{kt}^{MID(k,l)}-\bar{Y}_{lt}^{MID(k,l)})\right](\bar{Z}_{k}-\bar{Z}_{l})(1-\bar{Z}_{k}+\bar{Z}_{l})
[(Y¯ktPOST(l)Y¯ltPOST(l))](Z¯kZ¯l)Z¯l\displaystyle-\left[(\bar{Y}_{kt}^{POST(l)}-\bar{Y}_{lt}^{POST(l)})\right](\bar{Z}_{k}-\bar{Z}_{l})\bar{Z}_{l}
=[(Y¯ktMID(k,l)Y¯ktPRE(k))(Y¯ltMID(k,l)Y¯ltPRE(k))](Z¯kZ¯l)(1Z¯k)\displaystyle=\left[(\bar{Y}_{kt}^{MID(k,l)}-\bar{Y}_{kt}^{PRE(k)})-(\bar{Y}_{lt}^{MID(k,l)}-\bar{Y}_{lt}^{PRE(k)})\right](\bar{Z}_{k}-\bar{Z}_{l})(1-\bar{Z}_{k})
+[(Y¯ltPOST(l)Y¯ltMID(k,l))(Y¯ktPOST(l)Y¯ktMID(k,l))](Z¯kZ¯l)Z¯l.\displaystyle+\left[(\bar{Y}_{lt}^{POST(l)}-\bar{Y}_{lt}^{MID(k,l)})-(\bar{Y}_{kt}^{POST(l)}-\bar{Y}_{kt}^{MID(k,l)})\right](\bar{Z}_{k}-\bar{Z}_{l})\bar{Z}_{l}.

To sum up, for the numerator of (24), we have

1NTitZ~itYit\displaystyle\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{it}Y_{it}
=[kU(nk+nu)2nkU(1nkU)Z¯k(1Z¯k)β^kU2×2\displaystyle=\Big{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}n_{kU}(1-n_{kU})\bar{Z}_{k}(1-\bar{Z}_{k})\hat{\beta}_{kU}^{2\times 2}
+kUl>k[((nk+nl)(1Z¯l))2nkl(1nkl)(Z¯kZ¯l1Z¯l)(1Z¯k1Z¯l)β^kl2×2,k\displaystyle+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}n_{kl}(1-n_{kl})\left(\frac{\bar{Z}_{k}-\bar{Z}_{l}}{1-\bar{Z}_{l}}\right)\left(\frac{1-\bar{Z}_{k}}{1-\bar{Z}_{l}}\right)\hat{\beta}_{kl}^{2\times 2,k}
+((nk+nl)Z¯k)2nkl(1nkl)(Z¯lZ¯k)(Z¯kZ¯lZ¯k)β^kl2×2,l]]\displaystyle+((n_{k}+n_{l})\bar{Z}_{k})^{2}n_{kl}(1-n_{kl})\left(\frac{\bar{Z}_{l}}{\bar{Z}_{k}}\right)\left(\frac{\bar{Z}_{k}-\bar{Z}_{l}}{\bar{Z}_{k}}\right)\hat{\beta}_{kl}^{2\times 2,l}]\Big{]}
=[kU(nk+nu)2V^kUZβ^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kβ^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lβ^kl2×2,l].\displaystyle=\bigg{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{\beta}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{\beta}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{\beta}_{kl}^{2\times 2,l}\bigg{]}. (26)

Next, we consider the denominator of (24). We note that the structure of the denominator is completely same as the one of the numerator in (24). Therefore, by the completely same calculations, we obtain

1NTitZ~itDit\displaystyle\frac{1}{NT}\sum_{i}\sum_{t}\tilde{Z}_{it}D_{it}
C^D,Z\displaystyle\equiv\hat{C}^{D,Z}
=[kU(nk+nu)2V^kUZD^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l].\displaystyle=\bigg{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}\bigg{]}. (27)

Combining (26) with (27), we obtain

β^IV\displaystyle\hat{\beta}_{IV}
=[kU(nk+nu)2V^kUZβ^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kβ^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lβ^kl2×2,l][kU(nk+nu)2V^kUZD^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l]\displaystyle=\frac{\bigg{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{\beta}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{\beta}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{\beta}_{kl}^{2\times 2,l}\bigg{]}}{\bigg{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}\bigg{]}}
=[kU(nk+nu)2V^kUZβ^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kβ^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lβ^kl2×2,l]C^D,Z\displaystyle=\frac{\bigg{[}\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{\beta}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{\beta}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{\beta}_{kl}^{2\times 2,l}\bigg{]}}{\hat{C}^{D,Z}}
=[kUw^IV,kUβ^IV,kU2×2+kUl>kw^IV,klkβ^IV,kl2×2,k+w^IV,kllβ^IV,kl2×2,l],\displaystyle=\bigg{[}\sum_{k\neq U}\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}+\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}\bigg{]},

where the weights are:

w^IV,kU=(nk+nu)2V^kUZD^kU2×2C^D,Z,\displaystyle\hat{w}_{IV,kU}=\frac{(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}}{\hat{C}^{D,Z}},
w^IV,klk=((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,kC^D,Z,\displaystyle\hat{w}_{IV,kl}^{k}=\frac{((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}}{\hat{C}^{D,Z}},
w^IV,kll=((nk+nl)Z¯k)2V^klZ,lD^kl2×2,lC^D,Z.\displaystyle\hat{w}_{IV,kl}^{l}=\frac{((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}}.

We note that C^D,Z\hat{C}^{D,Z} is the sum of the numerator in each weight as one can see in (27). This implies that the weights sum to one:

kUwIV,kU+kUl>k[wIV,klk+wIV,kll]=1.\displaystyle\sum_{k\neq U}w_{IV,kU}+\sum_{k\neq U}\sum_{l>k}[w_{IV,kl}^{k}+w_{IV,kl}^{l}]=1.

Completing the proof. ∎

Appendix B Proofs of the theorem and lemma in section 4.

In this section, we first prove Lemma 1 as a preparation.

B.1   Proof of Lemma 1

Proof.

As one can see in the proof of Theorem 1, we have the following expression for the numerator of the TWFEIV estimand:

C^D,Z=kU(nk+nu)2V^kUZD^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l].\displaystyle\hat{C}^{D,Z}=\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\left[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}\right].

We fix TT and consider NN\rightarrow\infty. We first derive the probability limit of (nk+nu)2V^kUZD^kU2×2(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}. By definition, we can rewrite (nk+nu)2V^kUZD^kU2×2(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2} as follows:

(nk+nu)2V^kUZD^kU2×2\displaystyle(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2} =nknuZ¯k(1Z¯k)[(D¯kPOST(k)D¯kPRE(k))(D¯UPOST(k)D¯UPRE(k))].\displaystyle=n_{k}n_{u}\bar{Z}_{k}(1-\bar{Z}_{k})\cdot\left[\left(\bar{D}_{k}^{POST(k)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{U}^{POST(k)}-\bar{D}_{U}^{PRE(k)}\right)\right].

By the law of large number (LLN), as NN\rightarrow\infty, we obtain

(D¯kPOST(k)D¯kPRE(k))(D¯UPOST(k)D¯UPRE(k))\displaystyle\left(\bar{D}_{k}^{POST(k)}-\bar{D}_{k}^{PRE(k)}\right)-\left(\bar{D}_{U}^{POST(k)}-\bar{D}_{U}^{PRE(k)}\right)
𝑝1T(k1)[t=kTE[Dit|Ei=k]E[Dit|Ei=U]]1k1[t=1k1E[Dit|Ei=k]E[Dit|Ei=U]]\displaystyle\xrightarrow{p}\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[D_{it}|E_{i}=k]-E[D_{it}|E_{i}=U]\right]-\frac{1}{k-1}\left[\sum_{t=1}^{k-1}E[D_{it}|E_{i}=k]-E[D_{it}|E_{i}=U]\right]
=1T(k1)[t=kTE[DitkDit|Ei=k]]\displaystyle=\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[D_{it}^{k}-D_{it}^{\infty}|E_{i}=k]\right]
+1T(k1)[t=kTE[Dit|Ei=k]t=kTE[Dit|Ei=U]]\displaystyle+\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[D_{it}^{\infty}|E_{i}=k]-\sum_{t=k}^{T}E[D_{it}^{\infty}|E_{i}=U]\right]
1k1[t=1k1E[Dit|Ei=k]E[Dit|Ei=U]]\displaystyle-\frac{1}{k-1}\left[\sum_{t=1}^{k-1}E[D_{it}^{\infty}|E_{i}=k]-E[D_{it}^{\infty}|E_{i}=U]\right]
=1T(k1)[t=kTE[DitkDit|Ei=k]]\displaystyle=\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[D_{it}^{k}-D_{it}^{\infty}|E_{i}=k]\right]
=CAETk1(POST(k)).\displaystyle=CAET_{k}^{1}(POST(k)). (28)

The second equality follows from the simple algebra and Assumption 5 (No anticipation for the first stage). The third equality follows from Assumption 6 (Parallel trend assumption in the treatment).

Next, we consider the probability limit of nknuZ¯k(1Z¯k)n_{k}n_{u}\bar{Z}_{k}(1-\bar{Z}_{k}). By the LLN and the Slutsky’s theorem, we obtain

nknuZ¯k(1Z¯k)𝑝Pr(Ei=k)Pr(Ei=U)T(k1)Tk1T.\displaystyle n_{k}n_{u}\bar{Z}_{k}(1-\bar{Z}_{k})\xrightarrow{p}Pr(E_{i}=k)Pr(E_{i}=U)\frac{T-(k-1)}{T}\frac{k-1}{T}. (29)

Combining the result (28) with (29), by the Slutsky’s theorem, we have

(nk+nu)2V^kUZD^kU2×2\displaystyle(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2} 𝑝Pr(Ei=k)Pr(Ei=U)T(k1)Tk1TCAETk,t1(POST(k))\displaystyle\xrightarrow{p}Pr(E_{i}=k)Pr(E_{i}=U)\frac{T-(k-1)}{T}\frac{k-1}{T}CAET_{k,t}^{1}(POST(k))
=wkUCAETk1(POST(k)).\displaystyle=w_{kU}CAET_{k}^{1}(POST(k)). (30)

where the weight wkUw_{kU} is:

wkU=Pr(Ei=k)Pr(Ei=U)T(k1)Tk1T.\displaystyle w_{kU}=Pr(E_{i}=k)Pr(E_{i}=U)\frac{T-(k-1)}{T}\frac{k-1}{T}.

By the completely same calculations, we also have

((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,k𝑝wklkCAETk1(MID(k,l)).\displaystyle((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}\xrightarrow{p}w_{kl}^{k}CAET_{k}^{1}(MID(k,l)). (31)

where the weight wklkw_{kl}^{k} is:

wklk=Pr(Ei=k)Pr(Ei=l)k1TlkT.\displaystyle w_{kl}^{k}=Pr(E_{i}=k)Pr(E_{i}=l)\frac{k-1}{T}\frac{l-k}{T}.

Next, we consider the probability limit of ((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}. By definition, we have

((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l\displaystyle((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l} =nknl(Z¯kZ¯l)Z¯l[(D¯lPOST(l)D¯lMID(k,l))(D¯kPOST(l)D¯kMID(k,l))].\displaystyle=n_{k}n_{l}(\bar{Z}_{k}-\bar{Z}_{l})\bar{Z}_{l}\cdot\left[\left(\bar{D}_{l}^{POST(l)}-\bar{D}_{l}^{MID(k,l)}\right)-\left(\bar{D}_{k}^{POST(l)}-\bar{D}_{k}^{MID(k,l)}\right)\right].

By the law of large number (LLN), as NN\rightarrow\infty, we have

(D¯lPOST(l)D¯lMID(k,l))(D¯kPOST(l)D¯kMID(k,l))\displaystyle\left(\bar{D}_{l}^{POST(l)}-\bar{D}_{l}^{MID(k,l)}\right)-\left(\bar{D}_{k}^{POST(l)}-\bar{D}_{k}^{MID(k,l)}\right)
𝑝1T(l1)[t=lTE[Dit|Ei=l]E[Dit|Ei=k]]1lk[t=kl1E[Dit|Ei=l]E[Dit|Ei=k]]\displaystyle\xrightarrow{p}\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}|E_{i}=l]-E[D_{it}|E_{i}=k]\right]-\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[D_{it}|E_{i}=l]-E[D_{it}|E_{i}=k]\right]
=1T(l1)[t=lTE[DitlDit|Ei=l]]1T(l1)[t=lTE[DitkDit|Ei=k]]\displaystyle=\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}^{l}-D_{it}^{\infty}|E_{i}=l]\right]-\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}^{k}-D_{it}^{\infty}|E_{i}=k]\right]
+1T(l1)[t=lTE[Dit|Ei=l]t=lTE[Dit|Ei=k]]\displaystyle+\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}^{\infty}|E_{i}=l]-\sum_{t=l}^{T}E[D_{it}^{\infty}|E_{i}=k]\right]
+1lk[t=kl1E[Ditk|Ei=k]E[Dit|Ei=k]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[D_{it}^{k}|E_{i}=k]-E[D_{it}^{\infty}|E_{i}=k]\right]
+1lk[t=kl1E[Dit|Ei=k]E[Dit|Ei=l]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[D_{it}^{\infty}|E_{i}=k]-E[D_{it}^{\infty}|E_{i}=l]\right]
=1T(l1)[t=lTE[DitlDit|Ei=l]]1T(l1)[t=lTE[DitkDit|Ei=k]]\displaystyle=\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}^{l}-D_{it}^{\infty}|E_{i}=l]\right]-\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[D_{it}^{k}-D_{it}^{\infty}|E_{i}=k]\right]
+1lk[t=kl1E[Ditk|Ei=k]E[Dit|Ei=k]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[D_{it}^{k}|E_{i}=k]-E[D_{it}^{\infty}|E_{i}=k]\right]
=CAETl1(POST(l))CAETk1(POST(l))+CAETk1(MID(k,l))\displaystyle=CAET_{l}^{1}(POST(l))-CAET_{k}^{1}(POST(l))+CAET_{k}^{1}(MID(k,l)) (32)

The second equality follows from the simple algebra and Assumption 5. The third equality follows from Assumption.

Note that the LLN and the Slutsky’s theorem implies

nknl(Z¯kZ¯l)Z¯l𝑝Pr(Ei=k)Pr(Ei=l)T(l1)TlkT.\displaystyle n_{k}n_{l}(\bar{Z}_{k}-\bar{Z}_{l})\bar{Z}_{l}\xrightarrow{p}Pr(E_{i}=k)Pr(E_{i}=l)\frac{T-(l-1)}{T}\frac{l-k}{T}. (33)

From the result of (32) with (33), we obtain

((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l\displaystyle((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}
𝑝\displaystyle\xrightarrow{p} wkll[CAETl1(POST(l))CAETk1(POST(l))+CAETk1(MID(k,l))].\displaystyle w_{kl}^{l}\left[CAET_{l}^{1}(POST(l))-CAET_{k}^{1}(POST(l))+CAET_{k}^{1}(MID(k,l))\right]. (34)

where the weight wkllw_{kl}^{l} is:

wkll=Pr(Ei=k)Pr(Ei=l)T(l1)TlkT.\displaystyle w_{kl}^{l}=Pr(E_{i}=k)Pr(E_{i}=l)\frac{T-(l-1)}{T}\frac{l-k}{T}.

To sum up, by combining (30),(31) with (34), we obtain

C^D,Z\displaystyle\hat{C}^{D,Z} =kU(nk+nu)2V^kUZD^kU2×2+kUl>k[((nk+nl)(1Z¯l))2V^klZ,kD^kl2×2,k+((nk+nl)Z¯k)2V^klZ,lD^kl2×2,l]\displaystyle=\sum_{k\neq U}(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\left[((n_{k}+n_{l})(1-\bar{Z}_{l}))^{2}\hat{V}_{kl}^{Z,k}\hat{D}_{kl}^{2\times 2,k}+((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}\right]
𝑝WCAETΔCAET1.\displaystyle\xrightarrow{p}WCAET-\Delta CAET^{1}.

where we define

WCAETkUwkUCAETk1(POST(k))+kUl>kwklkCAETk1(MID(k,l))+wkllCAETl1(POST(l)),\displaystyle WCAET\equiv\sum_{k\neq U}w_{kU}CAET_{k}^{1}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{kl}^{k}CAET_{k}^{1}(MID(k,l))+w_{kl}^{l}CAET_{l}^{1}(POST(l)),
ΔCAET1wkll[CAETk1(POST(l))CAETk1(MID(k,l))].\displaystyle\Delta CAET^{1}\equiv w_{kl}^{l}\left[CAET_{k}^{1}(POST(l))-CAET_{k}^{1}(MID(k,l))\right].

and the weights are:

wkU=Pr(Ei=k)Pr(Ei=U)T(k1)Tk1T,\displaystyle w_{kU}=Pr(E_{i}=k)Pr(E_{i}=U)\frac{T-(k-1)}{T}\frac{k-1}{T}, (35)
wklk=Pr(Ei=k)Pr(Ei=l)k1TlkT,\displaystyle w_{kl}^{k}=Pr(E_{i}=k)Pr(E_{i}=l)\frac{k-1}{T}\frac{l-k}{T}, (36)
wkll=Pr(Ei=k)Pr(Ei=l)T(l1)TlkT.\displaystyle w_{kl}^{l}=Pr(E_{i}=k)Pr(E_{i}=l)\frac{T-(l-1)}{T}\frac{l-k}{T}. (37)

Completing the proof.

B.2   Preparation for the proof of Theorem 2.

Before we present the proof of Theorem 2, we show the following lemma.

Lemma 5.

Suppose Assumptions 1-7 hold. If the treatment is binary, for all k,l{1,,T}(kl)k,l\in\{1,\dots,T\}(k\leq l), we have

[t=klE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle\left[\sum_{t=k}^{l}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right] =t=klE[Di,tkDi,t|Ei=k]E[Yi,t(1)Yi,t(0)|Ei=k,CMk,t]\displaystyle=\sum_{t=k}^{l}E[D_{i,t}^{k}-D_{i,t}^{\infty}|E_{i}=k]\cdot E[Y_{i,t}(1)-Y_{i,t}(0)|E_{i}=k,CM_{k,t}]
t=klCAETk,t1CLATTk,t.\displaystyle\equiv\sum_{t=k}^{l}CAET^{1}_{k,t}\cdot CLATT_{k,t}.
Proof.

Because we assume a binary treatment, we have

[t=klE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle\left[\sum_{t=k}^{l}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right] =[t=klE[(Di,tkDi,t)(Yi,t(1)Yi,t(0))|Ei=k]]\displaystyle=\left[\sum_{t=k}^{l}E[(D_{i,t}^{k}-D_{i,t}^{\infty})\cdot(Y_{i,t}(1)-Y_{i,t}(0))|E_{i}=k]\right]
=t=klE[Di,tkDi,t|Ei=k]E[Yi,t(1)Yi,t(0)|Ei=k,CMk,t],\displaystyle=\sum_{t=k}^{l}E[D_{i,t}^{k}-D_{i,t}^{\infty}|E_{i}=k]\cdot E[Y_{i,t}(1)-Y_{i,t}(0)|E_{i}=k,CM_{k,t}],

where the first equality holds from Assumptions 1-3 and the second equality holds from Assumption 4. ∎

B.3   Proof of Theorem 2.

Proof.

We fix TT and consider NN\rightarrow\infty. We first derive the probability limit of w^IV,kUβ^IV,kU2×2\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}. We note that w^IV,kUβ^IV,kU2×2\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2} is written as follows:

w^IV,kUβ^IV,kU2×2\displaystyle\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2} =(nk+nu)2V^kUZD^kU2×2C^D,Zβ^kU2×2D^kU2×2.\displaystyle=\frac{(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}}{\hat{C}^{D,Z}}\cdot\frac{\hat{\beta}_{kU}^{2\times 2}}{\hat{D}_{kU}^{2\times 2}}.

We first consider the probability limit of β^kU2×2D^kU2×2\displaystyle\frac{\hat{\beta}_{kU}^{2\times 2}}{\hat{D}_{kU}^{2\times 2}}. Recall that we have already derived the probability limit of the denominator in the proof of Lemma 1:

D^kU2×2𝑝1T(k1)t=kTCAETk,t1.\displaystyle\hat{D}_{kU}^{2\times 2}\xrightarrow{p}\frac{1}{T-(k-1)}\sum_{t=k}^{T}CAET^{1}_{k,t}. (38)

We consider the numerator β^kU2×2\hat{\beta}_{kU}^{2\times 2}. By the law of large number (LLN), as NN\rightarrow\infty, we obtain

(Y¯kPOST(k)Y¯kPRE(k))(Y¯UPOST(k)Y¯UPRE(k))\displaystyle\left(\bar{Y}_{k}^{POST(k)}-\bar{Y}_{k}^{PRE(k)}\right)-\left(\bar{Y}_{U}^{POST(k)}-\bar{Y}_{U}^{PRE(k)}\right)
𝑝1T(k1)[t=kTE[Yi,t|Ei=k]E[Yi,t|Ei=U]]1k1[t=1k1E[Yi,t|Ei=k]E[Yi,t|Ei=U]]\displaystyle\xrightarrow{p}\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[Y_{i,t}|E_{i}=k]-E[Y_{i,t}|E_{i}=U]\right]-\frac{1}{k-1}\left[\sum_{t=1}^{k-1}E[Y_{i,t}|E_{i}=k]-E[Y_{i,t}|E_{i}=U]\right]
=1T(k1)[t=kTE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle=\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
+1T(k1)[t=kTE[Yi,t(Di,t)|Ei=k]t=kTE[Yi,t(Di,t)|Ei=U]]\displaystyle+\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]-\sum_{t=k}^{T}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=U]\right]
1k1[t=1k1E[Yi,t(Di,t)|Ei=k]E[Yi,t(Di,t)|Ei=U]]\displaystyle-\frac{1}{k-1}\left[\sum_{t=1}^{k-1}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]-E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=U]\right]
=1T(k1)[t=kTE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle=\frac{1}{T-(k-1)}\left[\sum_{t=k}^{T}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
=1T(k1)t=kTCAETk,t1CLATTk,t.\displaystyle=\frac{1}{T-(k-1)}\sum_{t=k}^{T}CAET^{1}_{k,t}\cdot CLATT_{k,t}. (39)

The first equality follows from the simple manipulation, Assumption 3 (Exclusion restriction in multiple time periods) and Assumption 5. The second equality follows from Assumption 7 (Parallel trend assumption in the outcome). The final equality follows from Lemma 5.

Combining the result (38) with (39), we obtain

β^kU2×2D^kU2×2\displaystyle\displaystyle\frac{\hat{\beta}_{kU}^{2\times 2}}{\hat{D}_{kU}^{2\times 2}} 𝑝t=kTCAETk,t1t=kTCAETk,t1CLATTk,t\displaystyle\xrightarrow{p}\sum_{t=k}^{T}\frac{CAET^{1}_{k,t}}{\sum_{t=k}^{T}CAET^{1}_{k,t}}\cdot CLATT_{k,t}
=CLATTkCM(POST(k)).\displaystyle=CLATT^{CM}_{k}(POST(k)). (40)

Next, we consider the probability limit of (nk+nu)2V^kUZD^kU2×2C^D,Z\displaystyle\frac{(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}}{\hat{C}^{D,Z}}. By the LLN and the Slutsky’s theorem, we obtain

(nk+nu)2V^kUZD^kU2×2C^D,Z𝑝Pr(Ei=k)Pr(Ei=U)CD,ZT(k1)Tk1TCAETk1(POST(k)).\displaystyle\frac{(n_{k}+n_{u})^{2}\hat{V}_{kU}^{Z}\hat{D}_{kU}^{2\times 2}}{\hat{C}^{D,Z}}\xrightarrow{p}\frac{Pr(E_{i}=k)Pr(E_{i}=U)}{C^{D,Z}}\frac{T-(k-1)}{T}\frac{k-1}{T}CAET^{1}_{k}(POST(k)). (41)

Here CD,ZC^{D,Z} is the probability limit of C^D,Z\hat{C}^{D,Z} and its specific expression is already derived in Lemma 1.

Combining the result (40) with (41), by the Slutsky’s theorem, we have

w^IV,kUβ^IV,kU2×2𝑝wIV,kUCLATTkCM(POST(k)).\displaystyle\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}\xrightarrow{p}w_{IV,kU}CLATT^{CM}_{k}(POST(k)). (42)

where the weight wIV,kUw_{IV,kU} is:

wIV,kU=Pr(Ei=k)Pr(Ei=U)CD,ZT(k1)Tk1TCAETk1(POST(k)).\displaystyle w_{IV,kU}=\frac{Pr(E_{i}=k)Pr(E_{i}=U)}{C^{D,Z}}\frac{T-(k-1)}{T}\frac{k-1}{T}\cdot CAET^{1}_{k}(POST(k)).

By the completely same argument, we also obtain

w^IV,klkβ^IV,kl2×2,k𝑝wIV,klkCLATTkCM(MID(k,l)).\displaystyle\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}\xrightarrow{p}w_{IV,kl}^{k}CLATT^{CM}_{k}(MID(k,l)). (43)

where the weight wIV,klkw_{IV,kl}^{k} is:

wIV,klk=Pr(Ei=k)Pr(Ei=l)CD,Zk1TlkTCAETk1(MID(k,l)).\displaystyle w_{IV,kl}^{k}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{k-1}{T}\frac{l-k}{T}\cdot CAET_{k}^{1}(MID(k,l)).

Next, we derive the probability limit of w^IV,kllβ^IV,kl2×2,l\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}. Recall that w^IV,kllβ^IV,kl2×2,l\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l} is:

w^IV,kllβ^IV,kl2×2,l\displaystyle\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l} =((nk+nl)Z¯k)2V^klZ,lD^kl2×2,lC^D,Zβ^kl2×2,lD^kl2×2,l.\displaystyle=\frac{((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}}\cdot\frac{\hat{\beta}_{kl}^{2\times 2,l}}{\hat{D}_{kl}^{2\times 2,l}}.

First note that in the proof of Lemma 1, we have already derived the probability limit of D^kl2×2,l\hat{D}_{kl}^{2\times 2,l}:

D^kl2×2,l\displaystyle\hat{D}_{kl}^{2\times 2,l} 𝑝CAETl1(POST(l))[CAETk1(POST(l))CAETk1(MID(k,l))]\displaystyle\xrightarrow{p}CAET_{l}^{1}(POST(l))-\left[CAET_{k}^{1}(POST(l))-CAET_{k}^{1}(MID(k,l))\right] (44)
Dkl2×2,l.\displaystyle\equiv D_{kl}^{2\times 2,l}.

Here, to ease the notation, we define Dkl2×2,lD_{kl}^{2\times 2,l} to be the probability limit of D^kl2×2,l\hat{D}_{kl}^{2\times 2,l}.

Next, we consider the probability limit of β^kl2×2,l\hat{\beta}_{kl}^{2\times 2,l}.

By the law of large number (LLN), as NN\rightarrow\infty, we have

β^kl2×2,l=(Y¯lPOST(l)Y¯lMID(k,l))(Y¯kPOST(l)Y¯kMID(k,l))\displaystyle\hat{\beta}_{kl}^{2\times 2,l}=\left(\bar{Y}_{l}^{POST(l)}-\bar{Y}_{l}^{MID(k,l)}\right)-\left(\bar{Y}_{k}^{POST(l)}-\bar{Y}_{k}^{MID(k,l)}\right)
𝑝1T(l1)[t=lTE[Yi,t|Ei=l]E[Yi,t|Ei=k]]1lk[t=kl1E[Yi,t|Ei=l]E[Yi,t|Ei=k]]\displaystyle\xrightarrow{p}\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}|E_{i}=l]-E[Y_{i,t}|E_{i}=k]\right]-\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[Y_{i,t}|E_{i}=l]-E[Y_{i,t}|E_{i}=k]\right]
=1T(l1)[t=lTE[Yi,t(Di,tl)Yi,t(Di,t)|Ei=l]]1T(l1)[t=lTE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle=\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{l})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=l]\right]-\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
+1T(l1)[t=lTE[Yi,t(Di,t)|Ei=l]t=lTE[Yi,t(Di,t)|Ei=k]]\displaystyle+\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=l]-\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
+1lk[t=kl1E[Yi,t(Di,tk)|Ei=k]E[Yi,t(Di,t)|Ei=k]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[Y_{i,t}(D_{i,t}^{k})|E_{i}=k]-E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
+1lk[t=kl1E[Yi,t(Di,t)|Ei=k]E[Yi,t(Di,t)|Ei=l]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]-E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=l]\right]
=1T(l1)[t=lTE[Yi,t(Di,tl)Yi,t(Di,t)|Ei=l]]1T(l1)[t=lTE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle=\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{l})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=l]\right]-\frac{1}{T-(l-1)}\left[\sum_{t=l}^{T}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
+1lk[t=kl1E[Yi,t(Di,tk)|Ei=k]E[Yi,t(Di,t)|Ei=k]]\displaystyle+\frac{1}{l-k}\left[\sum_{t=k}^{l-1}E[Y_{i,t}(D_{i,t}^{k})|E_{i}=k]-E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
=1T(l1)t=lTCAETl,t1CLATTl,t\displaystyle=\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{l,t}^{1}\cdot CLATT_{l,t}
[1T(l1)t=lTCAETk,t1CLATTk,t1lkt=kl1CAETk,t1CLATTk,t].\displaystyle-\left[\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{k,t}^{1}\cdot CLATT_{k,t}-\frac{1}{l-k}\sum_{t=k}^{l-1}CAET_{k,t}^{1}\cdot CLATT_{k,t}\right]. (45)

The first equality follows from the simple algebra, Assumption 3, and Assumption 5. The second equality follows from Assumption 7. The final equality follows from Lemma 5.

Note that the LLN and the Slutsky’s theorem yields

((nk+nl)Z¯k)2V^klZ,lD^kl2×2,lC^D,Z\displaystyle\frac{((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}} 𝑝Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkTDkl2×2,l\displaystyle\xrightarrow{p}\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}\cdot D_{kl}^{2\times 2,l} (46)

From the results of (44) and (45) with (46), we obtain

w^IV,kllβ^IV,kl2×2,l\displaystyle\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l} 𝑝σIV,kll{1T(l1)t=lTCAETl,t1CLATTl,t\displaystyle\xrightarrow{p}\sigma_{IV,kl}^{l}\cdot\Bigg{\{}\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{l,t}^{1}\cdot CLATT_{l,t}
[1T(l1)t=lTCAETk,t1CLATTk,t1lkt=kl1CAETk,t1CLATTk,t]}\displaystyle-\left[\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{k,t}^{1}\cdot CLATT_{k,t}-\frac{1}{l-k}\sum_{t=k}^{l-1}CAET_{k,t}^{1}\cdot CLATT_{k,t}\right]\Bigg{\}}
=σIV,kll{CLATTl(POST(l))\displaystyle=\sigma_{IV,kl}^{l}\cdot\Bigg{\{}CLATT_{l}(POST(l))
[CLATTk(POST(l))CLATTk(MID(k,l))]}.\displaystyle-\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right]\Bigg{\}}. (47)

where the weight σIV,kll\sigma_{IV,kl}^{l} is:

σIV,kll=Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkT.\displaystyle\sigma_{IV,kl}^{l}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}. (48)

To sum up, by combining (42),(43) with (47), we obtain

β^IV\displaystyle\hat{\beta}_{IV} =[kUwIV,kUβ^IV,kU2×2+kUl>kwIV,klkβ^IV,kl2×2,k+wIV,kllβ^IV,kl2×2,l]\displaystyle=\bigg{[}\sum_{k\neq U}w_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}+w_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}\bigg{]}
𝑝WCLATTΔCLATT.\displaystyle\xrightarrow{p}WCLATT-\Delta CLATT.

where we define

WCLATT\displaystyle WCLATT kUwIV,kUCLATTkCM(POST(k))+kUl>kwIV,klkCLATTkCM(MID(k,l))\displaystyle\equiv\sum_{k\neq U}w_{IV,kU}CLATT^{CM}_{k}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}CLATT^{CM}_{k}(MID(k,l))
+kUl>kσIV,kllCLATTl(POST(l)),\displaystyle+\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot CLATT_{l}(POST(l)),
ΔCLATT\displaystyle\Delta CLATT kUl>kσIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle\equiv\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right].

and the weights are:

wIV,kU=Pr(Ei=k)Pr(Ei=U)CD,ZT(k1)Tk1TCAETk1(POST(k)),\displaystyle w_{IV,kU}=\frac{Pr(E_{i}=k)Pr(E_{i}=U)}{C^{D,Z}}\frac{T-(k-1)}{T}\frac{k-1}{T}\cdot CAET^{1}_{k}(POST(k)), (49)
wIV,klk=Pr(Ei=k)Pr(Ei=l)CD,Zk1TlkTCAETk1(MID(k,l)),\displaystyle w_{IV,kl}^{k}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{k-1}{T}\frac{l-k}{T}\cdot CAET_{k}^{1}(MID(k,l)), (50)
σIV,kll=Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkT.\displaystyle\sigma_{IV,kl}^{l}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}. (51)

Completing the proof. ∎

B.4   Proof of Lemma 2

Proof.

We first simplify CLATTkCM(W)CLATT_{k}^{CM}(W) and Dkl2×2,lD_{kl}^{2\times 2,l} (defined in (44)) under Assumption 9. If we assume Assumption 9, CLATTkCM(W)CLATT_{k}^{CM}(W) is:

CLATTkCM(W)\displaystyle CLATT^{CM}_{k}(W) =tWPr(CMk,t|Ei=k)tWPr(CMk,t|Ei=k)CLATTk,t\displaystyle=\sum_{t\in W}\frac{Pr(CM_{k,t}|E_{i}=k)}{\sum_{t\in W}Pr(CM_{k,t}|E_{i}=k)}CLATT_{k,t}
=1TWtWCLATTk,t\displaystyle=\frac{1}{T_{W}}\sum_{t\in W}CLATT_{k,t}
CLATTkeq(W).\displaystyle\equiv CLATT^{eq}_{k}(W).

In addition, Dkl2×2,lD_{kl}^{2\times 2,l} is:

Dkl2×2,l\displaystyle D_{kl}^{2\times 2,l} =CAETl1(POST(l))[CAETk1(POST(l))CAETk1(MID(k,l))]\displaystyle=CAET_{l}^{1}(POST(l))-\left[CAET_{k}^{1}(POST(l))-CAET_{k}^{1}(MID(k,l))\right]
=CAETl1.\displaystyle=CAET_{l}^{1}.

because we have CAETk1(W)=CAETk1CAET_{k}^{1}(W)=CAET_{k}^{1}.

We then rewrite the probability limit of w^IV,kUβ^IV,kU2×2\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}, w^IV,klkβ^IV,kl2×2,k\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k} and w^IV,kllβ^IV,kl2×2,l\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l} respectively. First, the probability limit of w^IV,kUβ^IV,kU2×2\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2} and w^IV,klkβ^IV,kl2×2,k\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k} is simplified to:

w^IV,kUβ^IV,kU2×2\displaystyle\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2} 𝑝wIV,kUCLATTkCM(POST(k))\displaystyle\xrightarrow{p}w_{IV,kU}CLATT^{CM}_{k}(POST(k))
=wIV,kUCLATTkeq(POST(k)).\displaystyle=w_{IV,kU}CLATT^{eq}_{k}(POST(k)). (52)
w^IV,klkβ^IV,kl2×2,k\displaystyle\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k} 𝑝wIV,klkCLATTkCM(MID(k,l))\displaystyle\xrightarrow{p}w_{IV,kl}^{k}CLATT^{CM}_{k}(MID(k,l))
=wIV,klkCLATTkeq(MID(k,l)).\displaystyle=w_{IV,kl}^{k}CLATT^{eq}_{k}(MID(k,l)). (53)

where the weights wIV,kUw_{IV,kU}, wIV,klkw_{IV,kl}^{k} are:

wIV,kU=Pr(Ei=k)Pr(Ei=U)CD,ZT(k1)Tk1TCAETk1,\displaystyle w_{IV,kU}=\frac{Pr(E_{i}=k)Pr(E_{i}=U)}{C^{D,Z}}\frac{T-(k-1)}{T}\frac{k-1}{T}\cdot CAET^{1}_{k}, (54)
wIV,klk=Pr(Ei=k)Pr(Ei=l)CD,Zk1TlkTCAETk1.\displaystyle w_{IV,kl}^{k}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{k-1}{T}\frac{l-k}{T}\cdot CAET^{1}_{k}. (55)

Next, we reconsider the probability limit of w^IV,kllβ^IV,kl2×2,l\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}.

First, we note that the probability limit of w^IV,kll\hat{w}_{IV,kl}^{l} is simplified to:

w^IV,kll=((nk+nl)Z¯k)2V^klZ,lD^kl2×2,lC^D,Z\displaystyle\hat{w}_{IV,kl}^{l}=\frac{((n_{k}+n_{l})\bar{Z}_{k})^{2}\hat{V}_{kl}^{Z,l}\hat{D}_{kl}^{2\times 2,l}}{\hat{C}^{D,Z}} 𝑝Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkTDkl2×2,l\displaystyle\xrightarrow{p}\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}\cdot D_{kl}^{2\times 2,l}
=Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkTCAETl1.\displaystyle=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}\cdot CAET_{l}^{1}. (56)

Here the second equality follows from Dkl2×2,l=CAETl1D_{kl}^{2\times 2,l}=CAET_{l}^{1}.

Second, the probability limit of β^IV,kl2×2,l\hat{\beta}_{IV,kl}^{2\times 2,l} simply reduces to:

β^IV,kl2×2,l\displaystyle\hat{\beta}_{IV,kl}^{2\times 2,l} =β^kl2×2,lD^kl2×2,l\displaystyle=\frac{\hat{\beta}_{kl}^{2\times 2,l}}{\hat{D}_{kl}^{2\times 2,l}}
𝑝1CAETl11T(l1)t=lTCAETl1CLATTl,t\displaystyle\xrightarrow{p}\frac{1}{CAET_{l}^{1}}\cdot\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{l}^{1}\cdot CLATT_{l,t}
1CATTl1[1T(l1)t=lTCAETk1CLATTk,t1lkt=kl1CAETk1CLATTk,t]\displaystyle-\frac{1}{CATT_{l}^{1}}\cdot\left[\frac{1}{T-(l-1)}\sum_{t=l}^{T}CAET_{k}^{1}\cdot CLATT_{k,t}-\frac{1}{l-k}\sum_{t=k}^{l-1}CAET_{k}^{1}\cdot CLATT_{k,t}\right]
=CLATTleq(POST(l))\displaystyle=CLATT^{eq}_{l}(POST(l))
1CAETl1[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle-\frac{1}{CAET_{l}^{1}}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right]. (57)

where we use (45) and Dkl2×2,l=CAETl1D_{kl}^{2\times 2,l}=CAET_{l}^{1}.

Combining the result (57) with (56), by the Slutsky’s theorem, we have

w^IV,kllβ^IV,kl2×2,l\displaystyle\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l} 𝑝wIV,kllCLATTleq(POST(l))σIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle\xrightarrow{p}w_{IV,kl}^{l}CLATT^{eq}_{l}(POST(l))-\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right]. (58)

where the weight wIV,kllw_{IV,kl}^{l} is:

wIV,kll=Pr(Ei=k)Pr(Ei=l)CD,ZT(l1)TlkTCAETl1,\displaystyle w_{IV,kl}^{l}=\frac{Pr(E_{i}=k)Pr(E_{i}=l)}{C^{D,Z}}\frac{T-(l-1)}{T}\frac{l-k}{T}\cdot CAET_{l}^{1}, (59)

and σIV,kll\sigma_{IV,kl}^{l} is already defined in (48).

Finally, from the result (B.4) and (B.4) with (58), we obtain

β^IV\displaystyle\hat{\beta}_{IV} =[kUw^IV,kUβ^IV,kU2×2+kUl>kw^IV,klkβ^IV,kl2×2,k+w^IV,kllβ^IV,kl2×2,l]\displaystyle=\bigg{[}\sum_{k\neq U}\hat{w}_{IV,kU}\hat{\beta}_{IV,kU}^{2\times 2}+\sum_{k\neq U}\sum_{l>k}\hat{w}_{IV,kl}^{k}\hat{\beta}_{IV,kl}^{2\times 2,k}+\hat{w}_{IV,kl}^{l}\hat{\beta}_{IV,kl}^{2\times 2,l}\bigg{]}
𝑝WCLATTΔCLATT.\displaystyle\xrightarrow{p}WCLATT-\Delta CLATT.

where we define:

WCLATT\displaystyle WCLATT kUwIV,kUCLATTkeq(POST(k))+kUl>kwIV,klkCLATTkeq(MID(k,l))\displaystyle\equiv\sum_{k\neq U}w_{IV,kU}CLATT^{eq}_{k}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}CLATT^{eq}_{k}(MID(k,l))
+kUl>kwIV,kllCLATTleq(POST(l)),\displaystyle+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{l}CLATT^{eq}_{l}(POST(l)),
ΔCLATT\displaystyle\Delta CLATT kUl>kσIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))].\displaystyle\equiv\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right].

Completing the proof. ∎

B.5   Proof of Lemma 3

Proof.

First, we show ΔCLATT=0\Delta CLATT=0. Under Assumption 9 and Assumption 11, we have CLATTe,t=CLATTeCLATT_{e,t}=CLATT_{e}. This implies:

ΔCLATT\displaystyle\Delta CLATT =kUl>kσIV,kll[CLATTk(POST(l))CLATTk(MID(k,l))]\displaystyle=\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))\right]
=0.\displaystyle=0.

because we have CLATTk(POST(l))CLATTk(MID(k,l))=0CLATT_{k}(POST(l))-CLATT_{k}(MID(k,l))=0.

Next, we consider WCLATTWCLATT. Since we have CLATTkeq(W)=CLATTkCLATT^{eq}_{k}(W)=CLATT_{k}, we obtain:

WCLATT\displaystyle WCLATT =kUwIV,kUCLATTk+kUl>kwIV,klkCLATTk+kUl>kwIV,kllCLATTl\displaystyle=\sum_{k\neq U}w_{IV,kU}CLATT_{k}+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}CLATT_{k}+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{l}CLATT_{l}
=kUCLATTk[wIV,kU+j=1k1wIV,jkk+j=k+1KwIV,kjk].\displaystyle=\sum_{k\neq U}CLATT_{k}\Bigg{[}w_{IV,kU}+\sum_{j=1}^{k-1}w_{IV,jk}^{k}+\sum_{j=k+1}^{K}w_{IV,kj}^{k}\Bigg{]}.

Completing the proof. ∎

Appendix C Extensions in section 5

C.1   Non-binary, ordered treatment

This subsection considers a non binary, ordered treatment. We show Lemma 6 below that is analogous to Lemma 5 in a binary treatment. If we use Lemma 6 instead of Lemma 5 in the proof of Theorem 2 and Lemmas 2-3, we obtain the theorem and the lemmas which replace CLATTe,kCLATT_{e,k} with CACRTe,kCACRT_{e,k}.

Lemma 6.

Suppose Assumptions 1-7 hold. If treatment is a non-binary, ordered, for all k,l{1,,T}(kl)k,l\in\{1,\dots,T\}(k\leq l), we have

[t=klE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle\left[\sum_{t=k}^{l}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
=t=klE[Di,tkDi,t|Ei=k]j=1Jwt,jkE[Yi,t(j)Yi,t(j1)|Ei=k,Di,tkj>Di,t]\displaystyle=\sum_{t=k}^{l}E[D_{i,t}^{k}-D_{i,t}^{\infty}|E_{i}=k]\cdot\sum_{j=1}^{J}w^{k}_{t,j}\cdot E[Y_{i,t}(j)-Y_{i,t}(j-1)|E_{i}=k,D_{i,t}^{k}\geq j>D_{i,t}^{\infty}]
t=klCATTk,t1CACRTk,t.\displaystyle\equiv\sum_{t=k}^{l}CATT^{1}_{k,t}\cdot CACRT_{k,t}.

where the weight wt,jkw^{k}_{t,j} is:

wt,jk=Pr(Di,tkj>Di,t|Ei=k)j=1JPr(Di,tkj>Di,t|Ei=k).\displaystyle w^{k}_{t,j}=\frac{Pr(D_{i,t}^{k}\geq j>D_{i,t}^{\infty}|E_{i}=k)}{\sum_{j=1}^{J}Pr(D_{i,t}^{k}\geq j>D_{i,t}^{\infty}|E_{i}=k)}.
Proof.

By the similar argument in the proof of lemma 5, one can show that

[t=klE[Yi,t(Di,tk)Yi,t(Di,t)|Ei=k]]\displaystyle\left[\sum_{t=k}^{l}E[Y_{i,t}(D_{i,t}^{k})-Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]\right]
=[t=klj=1JPr(Di,tkj>Di,t|Ei=k)E[Yi,t(j)Yi,t(j1)|Ei=k,Di,tkj>Di,t]]\displaystyle=\left[\sum_{t=k}^{l}\sum_{j=1}^{J}Pr(D_{i,t}^{k}\geq j>D_{i,t}^{\infty}|E_{i}=k)\cdot E[Y_{i,t}(j)-Y_{i,t}(j-1)|E_{i}=k,D_{i,t}^{k}\geq j>D_{i,t}^{\infty}]\right] (60)
E[Di,tkDi,t|Ei=k]\displaystyle E[D_{i,t}^{k}-D_{i,t}^{\infty}|E_{i}=k]
=j=1JPr(Di,tkj>Di,t|Ei=k).\displaystyle=\sum_{j=1}^{J}Pr(D_{i,t}^{k}\geq j>D_{i,t}^{\infty}|E_{i}=k). (61)

Combining the result (60) with (61), we obtain the desired result. ∎

C.2   Unbalanced panel case

In this section, we consider an unbalanced setting. We use the notation for a panel data setting, but the discussions and the results are the same if we consider an unbalanced repeated cross section setting.

Proof of Theorem 3

Let Ne,tN_{e,t} be the sample size for cohort ee at time tt and N=etNe,tN=\sum_{e}\sum_{t}N_{e,t} be the total number of observations. We consider the following two way fixed effects instrumental variable regression:

Yi,t=μi.+δt.+αZi,t+ϵi,t,\displaystyle Y_{i,t}=\mu_{i.}+\delta_{t.}+\alpha Z_{i,t}+\epsilon_{i,t},
Di,t=γi.+ζt.+πZi,t+ηi,t.\displaystyle D_{i,t}=\gamma_{i.}+\zeta_{t.}+\pi Z_{i,t}+\eta_{i,t}.

We define Z^i,t\hat{Z}_{i,t} to be the residuals from regression Zi,t{Z}_{i,t} on the time and individual fixed effects.

From the FWL theorem, the TWFEIV estimator β^IV\hat{\beta}_{IV} is:

β^IV\displaystyle\hat{\beta}_{IV} =itZ^i,tYi,titZ^i,tDi,t\displaystyle=\frac{\sum_{i}\sum_{t}\hat{Z}_{i,t}Y_{i,t}}{\sum_{i}\sum_{t}\hat{Z}_{i,t}D_{i,t}}
=etNe,t1Ne,tiNe,tZ^e(i),tYe(i),tetNe,t1Ne,tiNe,tZ^e(i),tDe(i),t\displaystyle=\frac{\sum_{e}\sum_{t}N_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}\hat{Z}_{e(i),t}Y_{e(i),t}}{\sum_{e}\sum_{t}N_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}\hat{Z}_{e(i),t}D_{e(i),t}}
=etNe,tZ^e,t1Ne,tiNe,tYe(i),tetNe,tZ^e,t1Ne,tiNe,tDe(i),t,\displaystyle=\frac{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}Y_{e(i),t}}{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}D_{e(i),t}},

where the third equality follows from the fact that Z^i,t\hat{Z}_{i,t} only varies across cohort and time level.

We note that by the definition of Z^e,t\hat{Z}_{e,t}, we have

tNe,tZ^e,t=0for alle𝒮(Ei),\displaystyle\sum_{t}N_{e,t}\hat{Z}_{e,t}=0\hskip 8.53581pt\text{for all}\hskip 8.53581pte\in\mathcal{S}(E_{i}), (62)
eNe,tZ^e,t=0for allt{1,,T}.\displaystyle\sum_{e}N_{e,t}\hat{Z}_{e,t}=0\hskip 8.53581pt\text{for all}\hskip 8.53581ptt\in\{1,\dots,T\}. (63)

To ease the notation, we define the sample mean for a random variable Ri,tR_{i,t} in cohort ee at time tt as follows:

Re,t1Ne,tiNe,tRe(i),t.\displaystyle R_{e,t}\equiv\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}R_{e(i),t}.

Here, we note that we can express Ye,tY_{e,t} in the following:

Ye,t\displaystyle Y_{e,t} =1Ne,tiNe,tYe(i),t\displaystyle=\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}Y_{e(i),t}
=1Ne,tiNe,t[Ye(i),t(Di,t)+Ze,t(Ye(i),t(Di,te)Ye(i),t(Di,t))]\displaystyle=\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}[Y_{e(i),t}(D_{i,t}^{\infty})+Z_{e,t}\cdot(Y_{e(i),t}(D_{i,t}^{e})-Y_{e(i),t}(D_{i,t}^{\infty}))]
=Ye,t(Di,t)+Ze,t(Ye,t(Di,te)Ye,t(Di,t)).\displaystyle=Y_{e,t}(D_{i,t}^{\infty})+Z_{e,t}\cdot(Y_{e,t}(D_{i,t}^{e})-Y_{e,t}(D_{i,t}^{\infty})). (64)

where the second equality follows from Assumptions 1-3 and Assumption 5.

First, we consider the probability limit of the numerator in the TWFEIV estimator. By using (62) and (63), we obtain

etNe,tZ^e,t1Ne,tiNe,tYe(i),t\displaystyle\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}Y_{e(i),t} =etNe,tZ^e,tYe,t\displaystyle=\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}Y_{e,t}
=etNe,tZ^e,t[Ye,tYe,1(Y1,tY1,1)].\displaystyle=\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[Y_{e,t}-Y_{e,1}-(Y_{1,t}-Y_{1,1})\right]. (65)

To further develop the expression, we use (64):

Ye,tYe,1(Y1,tY1,1)\displaystyle Y_{e,t}-Y_{e,1}-(Y_{1,t}-Y_{1,1}) =Ye,t(Di,t)Ye,1(Di,1)[Y1,t(Di,t)Y1,1(Di,1)]\displaystyle=Y_{e,t}(D_{i,t}^{\infty})-Y_{e,1}(D_{i,1}^{\infty})-[Y_{1,t}(D_{i,t}^{\infty})-Y_{1,1}(D_{i,1}^{\infty})]
+Ze,t(Ye,t(Di,te)Ye,t(Di,t))Ze,1(Ye,1(Di,1e)Ye,1(Di,1))\displaystyle+Z_{e,t}\cdot(Y_{e,t}(D_{i,t}^{e})-Y_{e,t}(D_{i,t}^{\infty}))-Z_{e,1}\cdot(Y_{e,1}(D_{i,1}^{e})-Y_{e,1}(D_{i,1}^{\infty}))
[Z1,t(Y1,t(Di,t1)Y1,t(Di,t))Z1,1(Y1,1(Di,1e)Y1,1(Di,1))]\displaystyle-[Z_{1,t}\cdot(Y_{1,t}(D_{i,t}^{1})-Y_{1,t}(D_{i,t}^{\infty}))-Z_{1,1}\cdot(Y_{1,1}(D_{i,1}^{e})-Y_{1,1}(D_{i,1}^{\infty}))] (66)

Substituting (66) into (65), we obtain:

etNe,tZ^e,t[Ye,tYe,1(Y1,tY1,1)]\displaystyle\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[Y_{e,t}-Y_{e,1}-(Y_{1,t}-Y_{1,1})\right]
=etNe,tZ^e,t[Ye,t(Di,t)Ye,1(Di,1)[Y1,t(Di,t)Y1,1(Di,1)]]\displaystyle=\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[Y_{e,t}(D_{i,t}^{\infty})-Y_{e,1}(D_{i,1}^{\infty})-[Y_{1,t}(D_{i,t}^{\infty})-Y_{1,1}(D_{i,1}^{\infty})]\right]
+etNe,tZ^e,tZe,t(Ye,t(Di,te)Ye,t(Di,t)),\displaystyle+\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}Z_{e,t}\cdot(Y_{e,t}(D_{i,t}^{e})-Y_{e,t}(D_{i,t}^{\infty})), (67)

where the second equality holds from (62) and (63).

From (67), as NN\rightarrow\infty , we obtain

etNe,tZ^e,t[Ye,tYe,1(Y1,tY1,1)]\displaystyle\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[Y_{e,t}-Y_{e,1}-(Y_{1,t}-Y_{1,1})\right]
𝑝etE[Z^i,t|Ei=e]ne,t{E[Ye,t(Di,t)|Ei=e]E[Ye,1(Di,1)|Ei=e]\displaystyle\xrightarrow{p}\sum_{e}\sum_{t}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot\Bigg{\{}E[Y_{e,t}(D_{i,t}^{\infty})|E_{i}=e]-E[Y_{e,1}(D_{i,1}^{\infty})|E_{i}=e]
(E[Y1,t(Di,t)|Ei=1]E[Y1,1(Di,1)|Ei=1])}\displaystyle-\left(E[Y_{1,t}(D_{i,t}^{\infty})|E_{i}=1]-E[Y_{1,1}(D_{i,1}^{\infty})|E_{i}=1]\right)\Bigg{\}}
+etE[Z^i,t|Ei=e]ne,tE[Ze,t(Ye,t(Di,te)Ye,t(Di,t))|Ei=e]\displaystyle+\sum_{e}\sum_{t}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot E[Z_{e,t}\cdot(Y_{e,t}(D_{i,t}^{e})-Y_{e,t}(D_{i,t}^{\infty}))|E_{i}=e]
=eteE[Z^i,t|Ei=e]ne,tE[(Ye,t(Di,te)Ye,t(Di,t))|Ei=e]\displaystyle=\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot E[(Y_{e,t}(D_{i,t}^{e})-Y_{e,t}(D_{i,t}^{\infty}))|E_{i}=e]
=eteE[Z^i,t|Ei=e]ne,tCATTe,t1CLATTe,t,\displaystyle=\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}\cdot CLATT_{e,t}, (68)

where ne,tn_{e,t} is population share and E[Z^i,t|Ei=e]E[\hat{Z}_{i,t}|E_{i}=e] in cohort ee at time tt. The first equality follows from Assumption 1 and Assumption 7. The second equality follows from Assumption 5.

Next, we consider the probability limit of the numerator. We note that the structure in the numerator is same as the one in the numerator. Therefore, by the same argument, we have:

etNe,tZ^e,t1Ne,tiNe,tDe(i),t\displaystyle\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\frac{1}{N_{e,t}}\sum_{i}^{N_{e,t}}D_{e(i),t}
𝑝eteE[Z^i,t|Ei=e]ne,tCATTe,t1.\displaystyle\xrightarrow{p}\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}. (69)

Combining the result (69) with (68), we obtain

β^IV\displaystyle\hat{\beta}_{IV} 𝑝βIV\displaystyle\xrightarrow{p}\beta_{IV}
=eteE[Z^i,t|Ei=e]ne,tCATTe,t1CLATTe,teteE[Z^i,t|Ei=e]ne,tCATTe,t1\displaystyle=\frac{\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}\cdot CLATT_{e,t}}{\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}}
=etewe,tCLATTe,t.\displaystyle=\sum_{e}\sum_{t\geq e}w_{e,t}\cdot CLATT_{e,t}.

where the weight we,tw_{e,t} is:

we,t=E[Z^i,t|Ei=e]ne,tCATTe,t1eteE[Z^i,t|Ei=e]ne,tCATTe,t1.\displaystyle w_{e,t}=\frac{E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}}{\sum_{e}\sum_{t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CATT^{1}_{e,t}}.

Completing the proof.

Supplementary of Theorem 3

We provide another representation of Theorem 3. We assume that there does not exist a never exposed cohort, that is, we have 𝒮(Ei)\infty\notin\mathcal{S}(E_{i}), and define l=max{Ei}l=\max\{E_{i}\} to be the last exposed cohort.

Lemma 7.

Suppose Assumptions 1-7 hold. Assume a binary treatment and an unbalanced panel setting. If there does not exists a never exposed cohort, i.e., we have 𝒮(Ei)\infty\notin\mathcal{S}(E_{i}), the population regression coefficient βIV\beta_{IV} is:

βIV=\displaystyle\beta_{IV}= el1tewe,t1CLATTe,t+etlwe,t2Δe,t,\displaystyle\sum\limits_{e}\sum\limits_{l-1\geq t\geq e}w_{e,t}^{1}\cdot CLATT_{e,t}+\sum\limits_{e}\sum\limits_{t\geq l}w_{e,t}^{2}\cdot\Delta_{e,t}, (70)

where Δe,t\Delta_{e,t} is:

CAETe,t1CLATTe,tCAETl,t1CLATTl,tCAETe,t1CAETl,t1,\displaystyle\displaystyle\frac{CAET^{1}_{e,t}\cdot CLATT_{e,t}-CAET^{1}_{l,t}\cdot CLATT_{l,t}}{CAET^{1}_{e,t}-CAET^{1}_{l,t}},

and the weights we,t1w^{1}_{e,t} and we,t2w^{2}_{e,t} are:

we,t1=E[Z^i,t|Ei=e]ne,tCAETe,t1e(l1teE[Z^i,t|Ei=e]ne,tCAETe,t1+tlE[Z^i,t|Ei=e]ne,t(CAETe,t1CAETl,t1)),\displaystyle w_{e,t}^{1}=\frac{E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CAET^{1}_{e,t}}{\sum\limits_{e}\left(\sum\limits_{l-1\geq t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CAET^{1}_{e,t}+\sum\limits_{t\geq l}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot(CAET^{1}_{e,t}-CAET^{1}_{l,t})\right)}, (71)
we,t2=E[Z^i,t|Ei=e]ne,t(CAETe,t1CAETl,t1)e(l1teE[Z^i,t|Ei=e]ne,tCAETe,t1+tlE[Z^i,t|Ei=e]ne,t(CAETe,t1CAETl,t1)).\displaystyle w_{e,t}^{2}=\frac{E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot(CAET^{1}_{e,t}-CAET^{1}_{l,t})}{\sum\limits_{e}\left(\sum\limits_{l-1\geq t\geq e}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot CAET^{1}_{e,t}+\sum\limits_{t\geq l}E[\hat{Z}_{i,t}|E_{i}=e]\cdot n_{e,t}\cdot(CAET^{1}_{e,t}-CAET^{1}_{l,t})\right)}. (72)

We note that when there is no never exposed cohort, we can only identify each CLATTe,tCLATT_{e,t} before the time period l=max{Ei}l=\max\{E_{i}\} for cohort ele\neq l, exploiting the time trends of the unexposed treatment and outcome for cohort ll. This implies that in equation (70), each Δe,t\Delta_{e,t} is the bias term occurring from the bad comparisons performed by TWFEIV regressions. In a given application, we can estimate CLATTe,tCLATT_{e,t}, CLATTe,tCLATT_{e,t}, and the associated weights we,t1w^{1}_{e,t}, we,t2w^{2}_{e,t} by constructing the consistent estimators, using (76) and (80) below.

Proof.

We consider the case where there is no never exposed cohort, i.e., we have 𝒮(Ei)\infty\notin\mathcal{S}(E_{i}). In this case, by using the last exposed cohort l=max{Ei}l=\max\{E_{i}\}, we obtain

β^IV\displaystyle\hat{\beta}_{IV} =etNe,tZ^e,t[Ye,tYe,1(Yl,tYl,1)]etNe,tZ^e,t[De,tDe,1(Dl,tDl,1)]\displaystyle=\frac{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[Y_{e,t}-Y_{e,1}-(Y_{l,t}-Y_{l,1})\right]}{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[D_{e,t}-D_{e,1}-(D_{l,t}-D_{l,1})\right]}
=etNe,tZ^e,t[De,tDe,1(Dl,tDl,1)]WDID^e,tetNe,tZ^e,t[De,tDe,1(Dl,tDl,1)].\displaystyle=\frac{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[D_{e,t}-D_{e,1}-(D_{l,t}-D_{l,1})\right]\cdot\widehat{WDID}_{e,t}}{\sum_{e}\sum_{t}N_{e,t}\hat{Z}_{e,t}\left[D_{e,t}-D_{e,1}-(D_{l,t}-D_{l,1})\right]}.

where we define

WDID^e,t[Ye,tYe,1(Yl,tYl,1)][De,tDe,1(Dl,tDl,1)].\displaystyle\widehat{WDID}_{e,t}\equiv\frac{\left[Y_{e,t}-Y_{e,1}-(Y_{l,t}-Y_{l,1})\right]}{\left[D_{e,t}-D_{e,1}-(D_{l,t}-D_{l,1})\right]}.

From the Law of Large Numbers and the same argument in the proof of Theorem 2, we have

WDID^e,t𝑝{0(t<e)CLATTe,t(l1te)CAETe,t1CLATTe,tCAETl,t1CLATTl,tCAETe,t1CAETl,t1(tl)\displaystyle\widehat{WDID}_{e,t}\xrightarrow{p}\left\{\begin{array}[]{lll}0&(t<e)\\ CLATT_{e,t}&(l-1\geq t\geq e)\\ \displaystyle\frac{CAET^{1}_{e,t}\cdot CLATT_{e,t}-CAET^{1}_{l,t}\cdot CLATT_{l,t}}{CAET^{1}_{e,t}-CAET^{1}_{l,t}}&(t\geq l)\end{array}\right. (76)

Similarly, we obtain

[De,tDe,1(Dl,tDl,1)]𝑝{0(t<e)CAETe,t1(l1te)CAETe,t1CATTl,t1(tl)\displaystyle\left[D_{e,t}-D_{e,1}-(D_{l,t}-D_{l,1})\right]\xrightarrow{p}\left\{\begin{array}[]{lll}0&(t<e)\\ CAET^{1}_{e,t}&(l-1\geq t\geq e)\\ CAET^{1}_{e,t}-CATT^{1}_{l,t}&(t\geq l)\end{array}\right. (80)

Combining the result (76) with (80) and by the Slutsky’s theorem, we obtain

βIV=\displaystyle\beta_{IV}= el1tewe,t1CLATTe,t+e[tlwe,t2CAETe,t1CLATTe,tCAETl,t1CLATTl,tCAETe,t1CAETl,t1].\displaystyle\sum\limits_{e}\sum\limits_{l-1\geq t\geq e}w_{e,t}^{1}\cdot CLATT_{e,t}+\sum\limits_{e}\Bigg{[}\sum\limits_{t\geq l}w_{e,t}^{2}\cdot\displaystyle\frac{CAET^{1}_{e,t}\cdot CLATT_{e,t}-CAET^{1}_{l,t}\cdot CLATT_{l,t}}{CAET^{1}_{e,t}-CAET^{1}_{l,t}}\Bigg{]}.

Completing the proof. ∎

C.3   Random assignment of the instrument adoption date

First, we set up the additional notations. We define LATEkCM(W)LATE_{k}^{CM}(W) and LATEk(W)LATE_{k}(W) analogous to CLATTkCM(W)CLATT_{k}^{CM}(W) and CLATTk(W)CLATT_{k}(W) in section 4:

LATEkCM(W)tWAEk,t1tWAEk,t1LATEk,t,\displaystyle LATE^{CM}_{k}(W)\equiv\sum_{t\in W}\frac{AE^{1}_{k,t}}{\sum_{t\in W}AE^{1}_{k,t}}LATE_{k,t},
LATEk(W)1TWtWAEk,t1LATEk,t,\displaystyle LATE_{k}(W)\equiv\frac{1}{T_{W}}\sum_{t\in W}AE^{1}_{k,t}LATE_{k,t},

where we replace CAETk,t1CAET^{1}_{k,t} and CLATTk,tCLATT_{k,t} with AEk,t1AE^{1}_{k,t} and LATEk,tLATE_{k,t} respectively in CLATTkCM(W)CLATT_{k}^{CM}(W) and CLATTk(W)CLATT_{k}(W).

Theorem 4 below presents the TWFEIV estimand under Assumptions 1 - 5 and Assumption 14 (Random assignment assumption of adoption date EiE_{i}).

Theorem 4.

Suppose Assumptions 1-5 and 14 holds. Then, the population regression coefficient βIV\beta_{IV} consists of two terms:

βIV=WLATEΔLATE.\displaystyle\beta_{IV}=WLATE-\Delta LATE.

where we define:

WLATE\displaystyle WLATE kUwIV,kULATEkCM(POST(k))+kUl>kwIV,klkLATEkCM(MID(k,l))\displaystyle\equiv\sum_{k\neq U}w_{IV,kU}LATE^{CM}_{k}(POST(k))+\sum_{k\neq U}\sum_{l>k}w_{IV,kl}^{k}LATE^{CM}_{k}(MID(k,l))
+kUl>kσIV,kllLATEl(POST(l)),\displaystyle+\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot LATE_{l}(POST(l)),
ΔLATE\displaystyle\Delta LATE kUl>kσIV,kll[LATEk(POST(l))LATEk(MID(k,l))].\displaystyle\equiv\sum_{k\neq U}\sum_{l>k}\sigma_{IV,kl}^{l}\cdot\left[LATE_{k}(POST(l))-LATE_{k}(MID(k,l))\right].

The weights wIV,kUw_{IV,kU}, wIV,klkw_{IV,kl}^{k} and σIV,kll\sigma_{IV,kl}^{l} are the same in the proof of Theorem 2.

Theorem 4 is analogous to Theorem 2, but CAETe,l1CAET^{1}_{e,l} and CLATTe,lCLATT_{e,l} are replaced by AEe,l1AE^{1}_{e,l} and LATEe,lLATE_{e,l} respectively because we assume a random assignment of adoption date. If we consider the restrictions on the effects of the instrument on the treatment and outcome as in section 4.2, the similar arguments hold as in Theorem 2, Lemma 2 and Lemma 3.

Proof.

First, we note that Assumption 14 implies Assumption 6 and Assumption 7. Therefore, we obtain the result in Theorem 2 under Assumptions 1-5 and 14.

By noticing that we have CATTe,l1=AEe,l1CATT^{1}_{e,l}=AE^{1}_{e,l} and CLATTe,l=LATEe,lCLATT_{e,l}=LATE_{e,l} under Assumption 14, we obtain:

CLATTkCM(W)=LATEkCM(W),\displaystyle CLATT_{k}^{CM}(W)=LATE^{CM}_{k}(W),
CLATTk(W)=LATEk(W).\displaystyle CLATT_{k}(W)=LATE_{k}(W).

By replacing CLATTkCM(W)CLATT_{k}^{CM}(W) and CLATTk(W)CLATT_{k}(W) with LATEkCM(W)LATE^{CM}_{k}(W) and LATEk(W)LATE_{k}(W) in Theorem 2, we obtain the desired result. ∎

Appendix D Proofs and discussions in section 7

In this appendix, we first derive equation (22). We then discuss the causal interpretation of the covariate-adjusted TWFEIV estimand under staggered DID-IV designs, imposing the additional assumptions.

D.1   Decomposing the between IV coefficient

Let C^bD,z~\hat{C}^{D,\tilde{z}}_{b} denote the covariance between Di,tD_{i,t} and z~k,t\tilde{z}_{k,t}, the between term of z~i,t\tilde{z}_{i,t}. The between IV coefficient β^b,IVz\hat{\beta}_{b,IV}^{z} is:

β^b,IVz=C^(Yi,t,z~k,t)C^(Di,t,z~k,t)=C^(Yi,t,z~k,t)C^bD,z~.\displaystyle\hat{\beta}_{b,IV}^{z}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{k,t})}{\hat{C}(D_{i,t},\tilde{z}_{k,t})}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{k,t})}{\hat{C}^{D,\tilde{z}}_{b}}. (81)

To derive equation (22), we decompose the covariance between Yi,tY_{i,t} and z~k,t\tilde{z}_{k,t}. To do so, we first split the between term z~k,t\tilde{z}_{k,t} into the between term of Zi,tZ_{i,t} and the between term of p~i,t\tilde{p}_{i,t}:

z~k,t\displaystyle\tilde{z}_{k,t} =[(Z¯k,tZ¯k)(Z¯tZ¯¯)][(p¯k,tp¯k)(p¯tp¯¯)]\displaystyle=[(\bar{Z}_{k,t}-\bar{Z}_{k})-(\bar{Z}_{t}-\bar{\bar{Z}})]-[(\bar{p}_{k,t}-\bar{p}_{k})-(\bar{p}_{t}-\bar{\bar{p}})]
Z~k,tp~k,t.\displaystyle\equiv\tilde{Z}_{k,t}-\tilde{p}_{k,t}.

Then, we have

C^(Yi,t,z~k,t)\displaystyle\hat{C}(Y_{i,t},\tilde{z}_{k,t}) =1NTitYi,t[(z¯k,tz¯k)(z¯tz¯¯)]\displaystyle=\frac{1}{NT}\sum_{i}\sum_{t}Y_{i,t}[(\bar{z}_{k,t}-\bar{z}_{k})-(\bar{z}_{t}-\bar{\bar{z}})]
=1TknktY¯k,t[(z¯k,tz¯k)(z¯tz¯¯)]\displaystyle=\frac{1}{T}\sum_{k}n_{k}\sum_{t}\bar{Y}_{k,t}[(\bar{z}_{k,t}-\bar{z}_{k})-(\bar{z}_{t}-\bar{\bar{z}})]
=klnknl1Tt(Y¯k,tY¯l,t)Z~k,tklnknl1Tt(Y¯k,tY¯l,t)p~k,t\displaystyle=\sum_{k}\sum_{l}n_{k}n_{l}\frac{1}{T}\sum_{t}(\bar{Y}_{k,t}-\bar{Y}_{l,t})\tilde{Z}_{k,t}-\sum_{k}\sum_{l}n_{k}n_{l}\frac{1}{T}\sum_{t}(\bar{Y}_{k,t}-\bar{Y}_{l,t})\tilde{p}_{k,t}
=kl>k(nk+nl)2[C^klD,Zβ^IV,kl2×2C^b,klpβ^b,IV,klp].\displaystyle=\sum_{k}\sum_{l>k}(n_{k}+n_{l})^{2}[\hat{C}_{kl}^{D,Z}\hat{\beta}_{IV,kl}^{2\times 2}-\hat{C}_{b,kl}^{p}\hat{\beta}_{b,IV,kl}^{p}]. (82)

β^IV,kl2×2\hat{\beta}_{IV,kl}^{2\times 2} is an estimator obtained from an IV regression of Yi,tY_{i,t} on Di,tD_{i,t} with Z~k,t\tilde{Z}_{k,t} as the excluded instrument in (k,l)(k,l) cell subsample. C^klD,Z\hat{C}_{kl}^{D,Z} is the covariance between Di,tD_{i,t} and Z~k,t\tilde{Z}_{k,t} in (k,l)(k,l) cell subsample. β^b,IV,klp\hat{\beta}_{b,IV,kl}^{p} is an estimator obtained from an IV regression of Yi,tY_{i,t} on Di,tD_{i,t} with p~k,t\tilde{p}_{k,t} as the excluded instrument in (k,l)(k,l) cell subsample. C^b,klp\hat{C}_{b,kl}^{p} is the covariance between Di,tD_{i,t} and p~k,t\tilde{p}_{k,t} in (k,l)(k,l) cell subsample.

By combining (81) with (82), we obtain equation (22).

D.2   Causal interpretation of the covariate-adjusted TWFEIV estimand

This section considers the causal interpretation of the covariate-adjusted TWFEIV estimand βIVX\beta_{IV}^{X}. To simplify the analysis, we first make the following assumptions. Goodman-Bacon (2021) also make similar assumptions to investigate the causal interpretation of the covariate-adjusted TWFE estimand in Appendix B.

  • (i)

    Time-varying covariates Xi,tX_{i,t} are not affected by instrument (policy shock).

  • (ii)

    Time-varying covariates Xi,tX_{i,t} do not vary within cohorts.

  • (iii)

    The coefficients obtained from regressing Z~i,t\tilde{Z}_{i,t} on X~i,t\tilde{X}_{i,t} in (k,l)(k,l) cell subsample are the same regardless of the pair (k,l)(k,l).

Because Assumption (ii) implies that the within term is equal to zero, the covariate-adjusted TWFEIV estimator β^IVX\hat{\beta}_{IV}^{X} simplifies to

β^IVX=kl>ksb,klβ^b,IV,klz.\displaystyle\hat{\beta}_{IV}^{X}=\sum_{k}\sum_{l>k}s_{b,kl}\hat{\beta}_{b,IV,kl}^{z}.

Assumption (iii) guarantees that β^b,IV,klz\hat{\beta}_{b,IV,kl}^{z} is equal to the between coefficient obtained from estimating equation (14) in (k,l)(k,l) subsample, which we denote β^b,IV,klz,X\hat{\beta}_{b,IV,kl}^{z,X} hereafter. To see this formally, let p~i,tklΓ^k,lX~i,t\tilde{p}_{i,t}^{kl}\equiv\hat{\Gamma}_{k,l}\tilde{X}_{i,t} denote the linear projection obtained from regressing Z~i,t\tilde{Z}_{i,t} on X~i,t\tilde{X}_{i,t} in (k,l)(k,l) subsample and let p~j,tkl\tilde{p}_{j,t}^{kl} denote the between term of p~i,tkl\tilde{p}_{i,t}^{kl} in cohort jj. We note that p~j,tklp~j,t\tilde{p}_{j,t}^{kl}\neq\tilde{p}_{j,t} (the between term of p~i,t\tilde{p}_{i,t}) holds in general because p~i,t=Γ^X~i,t\tilde{p}_{i,t}=\hat{\Gamma}\tilde{X}_{i,t} is estimated using the whole sample. Then, we have

β^b,IV,klz=C^(Yi,t,z~j,t)C^(Di,t,z~j,t)\displaystyle\hat{\beta}_{b,IV,kl}^{z}=\frac{\hat{C}(Y_{i,t},\tilde{z}_{j,t})}{\hat{C}(D_{i,t},\tilde{z}_{j,t})} =C^(Yi,t,Z~j,tp~j,tkl)+C^(Yi,t,p~j,tklp~j,t)C^(Di,t,Z~j,tp~j,tkl)+C^(Di,t,p~j,tklp~j,t)\displaystyle=\frac{\hat{C}(Y_{i,t},\tilde{Z}_{j,t}-\tilde{p}_{j,t}^{kl})+\hat{C}(Y_{i,t},\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t})}{\hat{C}(D_{i,t},\tilde{Z}_{j,t}-\tilde{p}_{j,t}^{kl})+\hat{C}(D_{i,t},\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t})}
=C^(Di,t,Z~j,tp~j,tkl)β^b,IV,klz,X+C^(Di,t,p~j,tklp~j,t)β^b,IV,kldifC^(Di,t,Z~j,tp~j,tkl)+C^(Di,t,p~j,tklp~j,t),j=k,l,\displaystyle=\frac{\hat{C}(D_{i,t},\tilde{Z}_{j,t}-\tilde{p}_{j,t}^{kl})\hat{\beta}_{b,IV,kl}^{z,X}+\hat{C}(D_{i,t},\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t})\hat{\beta}_{b,IV,kl}^{dif}}{\hat{C}(D_{i,t},\tilde{Z}_{j,t}-\tilde{p}_{j,t}^{kl})+\hat{C}(D_{i,t},\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t})},\hskip 8.53581ptj=k,l,

where β^b,IV,kldif\hat{\beta}_{b,IV,kl}^{dif} is an estimator obtained from an IV regression of Yi,tY_{i,t} on Di,tD_{i,t} with the difference p~j,tklp~j,t\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t} as the excluded instrument. Because Assumption (iii) (p~j,tkl=p~j,t\tilde{p}_{j,t}^{kl}=\tilde{p}_{j,t}) implies C^(Di,t,p~j,tklp~j,t)=0\hat{C}(D_{i,t},\tilde{p}_{j,t}^{kl}-\tilde{p}_{j,t})=0, we obtain β^b,IV,klz=β^b,IV,klz,X\hat{\beta}_{b,IV,kl}^{z}=\hat{\beta}_{b,IV,kl}^{z,X}.

Hereafter, we assume the identifying assumptions in staggered DID-IV designs and Assumption (i)-(iii). We focus on the between coefficient β^b,IV,kUz=β^b,IV,kUz,X\hat{\beta}_{b,IV,kU}^{z}=\hat{\beta}_{b,IV,kU}^{z,X} as it clarifies how covariates affect the interpretation of the TWFEIV estimand:

β^b,IV,kUz=C^(Yi,t,Z~j,t)C^(Yi,t,p~j,tkl)C^(Di,t,Z~j,t)C^(Di,t,p~j,tkl),j=k,U.\displaystyle\hat{\beta}_{b,IV,kU}^{z}=\frac{\hat{C}(Y_{i,t},\tilde{Z}_{j,t})-\hat{C}(Y_{i,t},\tilde{p}_{j,t}^{kl})}{\hat{C}(D_{i,t},\tilde{Z}_{j,t})-\hat{C}(D_{i,t},\tilde{p}_{j,t}^{kl})},\hskip 8.53581ptj=k,U.

Then, by the similar calculations in the proof of Theorem 2, we obtain

C^(Yi,t,Z~j,t)\displaystyle\hat{C}(Y_{i,t},\tilde{Z}_{j,t}) =V^kUzD^kU2×2βkU,IV2×2\displaystyle=\hat{V}_{kU}^{z}\hat{D}_{kU}^{2\times 2}\beta_{kU,IV}^{2\times 2}
𝑝VkUzCAETk1(POST(k))CLATTkCM(POST(k)),\displaystyle\xrightarrow[]{p}V_{kU}^{z}\cdot CAET_{k}^{1}(POST(k))\cdot CLATT_{k}^{CM}(POST(k)), (83)

and

C^(Yi,t,p~j,tkl)\displaystyle\hat{C}(Y_{i,t},\tilde{p}_{j,t}^{kl}) =nkU(1nkU)Tt(Y¯ktY¯Ut)[(p¯k,tkUp¯kkU)(p¯U,tkUp¯UkU)]\displaystyle=\frac{n_{kU}(1-n_{kU})}{T}\sum_{t}(\bar{Y}_{kt}-\bar{Y}_{Ut})\cdot[(\bar{p}_{k,t}^{kU}-\bar{p}_{k}^{kU})-(\bar{p}_{U,t}^{kU}-\bar{p}_{U}^{kU})]
𝑝NkU(1NkU)Tt{E[Yi,t(Di,t)|Ei=k]E[Yi,t(Di,t)|Ei=U]}[(pk,tkUpkkU)(pU,tkUpUkU)]\displaystyle\xrightarrow[]{p}\frac{N_{kU}(1-N_{kU})}{T}\sum_{t}\left\{E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=k]-E[Y_{i,t}(D_{i,t}^{\infty})|E_{i}=U]\right\}[(p_{k,t}^{kU}-p_{k}^{kU})-(p_{U,t}^{kU}-p_{U}^{kU})]
+NkU(1NkU)T(k1)tkCAETkCLATTk,tCAIETk,t[(pk,tkUpkkU)(pU,tkUpUkU)],\displaystyle+\frac{N_{kU}(1-N_{kU})}{T-(k-1)}\sum_{t\geq k}\underbrace{CAET_{k}\cdot CLATT_{k,t}}_{CAIET_{k,t}}\cdot[(p_{k,t}^{kU}-p_{k}^{kU})-(p_{U,t}^{kU}-p_{U}^{kU})], (84)

where NkUN_{kU} and [(pk,tkUpkkU)(pU,tkUpUkU)][(p_{k,t}^{kU}-p_{k}^{kU})-(p_{U,t}^{kU}-p_{U}^{kU})] are the probability limits of nkUn_{kU} and [(p¯k,tkUp¯kkU)(p¯U,tkUp¯UkU)][(\bar{p}_{k,t}^{kU}-\bar{p}_{k}^{kU})-(\bar{p}_{U,t}^{kU}-\bar{p}_{U}^{kU})], respectively. Equations (83) and (84) indicate that covariates affects the causal interpretation of β^b,IV,kUz\hat{\beta}_{b,IV,kU}^{z} in two ways. First, it additionally introduce the covariance between the difference in unexposed outcomes and the difference in the variation of the linear projection for cohorts kk and UU (the first term in equation (83)). Second, it additionally introduce the covariance between the CAIETk,tCAIET_{k,t} and the difference in the variation of the linear projection for cohorts kk and UU (the second term in equation (84)).