This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Standard Errors When a Regressor is Randomly Assigned

Denis Chetverikov
UCLA
Department of Economics, UCLA, Los Angeles, CA 90095-1477 USA. Email: [email protected]
   Jinyong Hahn
UCLA
Department of Economics, UCLA, Los Angeles, CA 90095-1477 USA. Email: [email protected]
   Zhipeng Liao
UCLA
Department of Economics, UCLA, Los Angeles, CA 90095-1477 USA. Email: [email protected]
   Andres Santos
UCLA
Department of Economics, UCLA, Los Angeles, CA 90095-1477 USA. Email: [email protected]
Abstract

We examine asymptotic properties of the OLS estimator when the values of the regressor of interest are assigned randomly and independently of other regressors. We find that the OLS variance formula in this case is often simplified, sometimes substantially. In particular, when the regressor of interest is independent not only of other regressors but also of the error term, the textbook homoskedastic variance formula is valid even if the error term and auxiliary regressors exhibit a general dependence structure. In the context of randomized controlled trials, this conclusion holds in completely randomized experiments with constant treatment effects. When the error term is heteroscedastic with respect to the regressor of interest, the variance formula has to be adjusted not only for heteroscedasticity but also for correlation structure of the error term. However, even in the latter case, some simplifications are possible as only a part of the correlation structure of the error term should be taken into account. In the context of randomized control trials, this implies that the textbook homoscedastic variance formula is typically not valid if treatment effects are heterogenous but heteroscedasticity-robust variance formulas are valid if treatment effects are independent across units, even if the error term exhibits a general dependence structure. In addition, we extend the results to the case when the regressor of interest is assigned randomly at a group level, such as in randomized control trials with treatment assignment determined at a group (e.g., school/village) level.


JEL Classification: C14, C31, C32

Keywords: Cluster Robust Inference; Randomized Control Trial

1 Introduction

Textbook discussion of linear regression usually begins with a standard model of the form 𝐘=𝐗θ+ϵ\mathbf{Y}=\mathbf{X}\theta+\bm{\epsilon}, where it is assumed that 𝐗\mathbf{X} is a nonstochastic matrix (with full column rank) of regressors and the error vector ϵ\bm{\epsilon} has mean zero and variance matrix proportional to an identity matrix. As is well known, such an assumption justifies the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} as an estimator of the variance of the OLS estimator, where s2s^{2} is equal to the sum of squares of the estimated residuals divided either by the sample size or the degrees of freedom. This formula is easy to use but, as is typically taught, may not be valid if the variance matrix Ω\Omega of the error vector ϵ\bm{\epsilon} is not proportional to an identity matrix. In such cases, the variance of the OLS estimator should be based on the formula (𝐗𝐗)1𝐗Ω𝐗(𝐗𝐗)1\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\Omega\mathbf{X}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} to reflect the heteroscedasticity and dependence structure of the error vector. An important practical challenge in implementing such an approach is that the matrix Ω\Omega may be hard to estimate if the dependence structure of the error vector ϵ\bm{\epsilon} is unknown. In this paper, we study the implications for the variance of OLS estimators of having a regressor of interest whose values are i.i.d. across units/time periods and are independent of values of other regressors. The primary motivating applications for our analysis are randomized controlled trials in which units are independently assigned to some treatment level without a connection to observable characteristics. The main finding of the paper is that variance estimation in this case is often simplified, sometimes substantially.

Let 𝐃\mathbf{D} be the column of 𝐗\mathbf{X} corresponding to the regressor of interest and let 𝐖\mathbf{W} be the remaining columns of 𝐗\mathbf{X}; i.e. columns corresponding to controls. Our first main result shows that when the vector 𝐃\mathbf{D} has i.i.d. components and is strongly exogenous in the sense of being independent not only of 𝐖\mathbf{W} but also of ϵ\bm{\epsilon}, the OLS estimator is asymptotically normally distributed and the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} actually yields a valid variance estimator for the coefficient of 𝐃\mathbf{D} even if Ω\Omega is not proportional to the identity matrix. This result, which superficially contradicts the lessons of elementary linear regression analysis, is due to the randomness of the 𝐗\mathbf{X} matrix in our analysis. While the textbook analysis assumes away the randomness by conditioning on 𝐗\mathbf{X}, we instead obtain our result by recognizing that the randomness of the 𝐗\mathbf{X} matrix delivers a suitable martingale structure.111The validity of the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} does not mean that the variance estimators based on the formula (𝐗𝐗)1𝐗Ω𝐗(𝐗𝐗)1\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}\mathbf{X}^{\top}\Omega\mathbf{X}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} are invalid; see Lemma 1 in the Appendix for the asymptotic equivalence of these estimators in our setting. We recognize that a version of this result in some simple contexts is well understood in the profession in the sense that many can anticipate such a result when the entire matrix of regressors is strongly exogenous; see references below. We go one step further, however, and establish our result for the case when (i) only one regressor is strongly exogenous (e.g., treatment in a randomized controlled trial); and/or (ii) the error vector is subject to some generalized dependence more complicated than what is commonly understood to be the cluster structure, e.g. temporal/spatial autocorrelation or a network structure. This result is important because it facilitates inference on the coefficient of the regressor of interest even if the researcher does not know the dependence structure of the error vector ϵ\bm{\epsilon}, which is useful from the pragmatic point of view. We emphasize, however, that while our conclusions hold for the coefficient corresponding to a strongly exogenous regressor, they need not hold for other coefficients in the regression.

Our second main result shows that when the vector 𝐃\mathbf{D} has i.i.d. components and is independent of 𝐖\mathbf{W} but ϵ\bm{\epsilon} is conditionally heteroscedastic with respect to 𝐃\mathbf{D}, the formula s2(𝐗𝐗)𝟏s^{2}(\bf{X}^{\top}\bf{X})^{-1} is actually not valid and has to be adjusted not only for heteroscedasticity but also, rather surprisingly, for the dependence structure of the vector ϵ\bm{\epsilon}. Nevertheless, a simplified variance formula can often be used in this case as well. For example, conditional heteroscedasticity arises when the regression model 𝐘=𝐗θ+ϵ\mathbf{Y}=\mathbf{X}\theta+\bm{\epsilon} is taken from the potential outcomes framework with heterogeneous treatment effects. If the researcher is concerned about clustering in this case, our results imply that it is sufficient to adjust the variance formula for clustering of the treatment effects only.222When the regression model 𝐘=𝐗θ+ϵ\mathbf{Y}=\mathbf{X}\theta+\bm{\epsilon} is taken from the potential outcomes framework with heterogeneous treatment effects, the case of strongly exogenous regressor corresponds to the assumption of constant treatment effects. In contrast, for example, there is no need to adjust the variance formula for clustering of the potential outcomes in any given treatment arm. We also note that neither of our results restrict or exclude conditional heteroscedasticity of ϵ\bm{\epsilon} with respect to 𝐖\mathbf{W}.

In addition, we extend both results to the case when the values of the regressor of interest are independent only across groups of units/time periods, such as is the case in randomized controlled trials in which treatment assignment is determined at a group (e.g., school/village) level. We show that in the strongly exogenous case, it suffices to take into account only the within-group correlation of the error vector ϵ\bm{\epsilon}. In other words, it suffices to use variance estimators that are clustered at the level at which treatment is assigned. In the conditional heteroscedasticity case, the variance formula still requires some adjustments for both heteroscedasticity and dependence but often allows for some simplifications relative to the standard textbook formula mentioned above.

Our first main result and its extension to the group-level assignment are related to but different from those in Barrios, Diamond, Imbens, and Kolesar (2012), who came to the same conclusions for regressions without controls and in which a fixed fraction of units/clusters is randomly assigned to be treated. To the best of our knowledge, however, there are no results in the literature related to our second main result. Our analysis is also related to Abadie, Athey, Imbens, and Wooldridge (2017), who presented a new clustering framework that emphasizes a finite population perspective as well as interactions between the sampling and assignment parts of the data-generating process. They established in particular that there is no need to cluster when estimating the variance if the randomization is conducted at the individual level and there is no heteregeneity in the treatment effects. Our first main result echoes and complements their findings in the following aspects. First, unlike them, we do not impose a particular structure on the sampling process, which allows us to cover general forms of time series or even network dependence in addition to the cluster-type dependence. Second, our analysis goes beyond the binary treatment framework and accommodates a general strongly exogenous regressor as well as the inclusion of additional controls in the regression. In particular, we allow for general dependence structures in the control variables, which makes it ex-ante unclear at what level one should cluster. Third, we rely on a traditional asymptotic framework, which may make our analysis more familiar to the reader. We recognize, however, that the third aspect may be a weakness in the sense that our framework is unable to address the finite population adjustment that plays an important role in Abadie, Athey, Imbens, and Wooldridge (2017). Finally, we note Bloom (2005) who considered a random effects type cluster structure and produced a variance estimator for the simple difference of means estimator that is quoted in Duflo, Glennerster, and Kremer (2007). His equation (4.3), which is presented without proof, is in fact a special case of the s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} formula. The cluster structure that he analyzed has a built-in correlation among observations, and as such, it would be tempting to think that variance estimation would require Moulton (1986)’s clustering formula – a conclusion that can be motivated if inference is to be conditioned on the 𝐗\mathbf{X} matrix. Hence, in our view, his equation (4.3) can only be motivated by explicitly recognizing the randomness of the 𝐗\mathbf{X} matrix.

Our results are not particularly difficult to derive. On the other hand, we are unaware of any systematic discussion of results along this line in the literature besides Barrios, Diamond, Imbens, and Kolesar (2012) and Abadie, Athey, Imbens, and Wooldridge (2017), especially in models where the control variables and treatment variables are both present. Our results have convenient pragmatic implications, which we hope are helpful to some empirical researchers.

Outline. We present the basic intuition underlying our results in Section 2. The intuitive discussion there suggests that asymptotic normality and the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} are valid under fairly general dependence structure in ϵ\bm{\epsilon} provided that the randomness of 𝐗\mathbf{X} generates a suitable martingale structure in 𝐗ϵ\mathbf{X}^{\top}\bm{\epsilon}. We also explain how conditional heteroscedasticity breaks down this martingale structure. We formalize our discussion in Section 3, where our main restriction on the dependence of ϵ\bm{\epsilon} is that it be weak enough for its sample average to converge in probability to zero – a condition that further emphasizes that our analysis is driven by the randomness in 𝐗\mathbf{X} and not ϵ\bm{\epsilon}. We provide an extension to the case of group-level random assignment in Section 4.

Notation. We use KK to denote a generic strictly positive constant that may change from place to place but is independent of the sample size nn. For any positive integer kk, let 𝐈k\mathbf{I}_{k}, 𝟏k×1\mathbf{1}_{k\times 1}, and 𝟎k×1\mathbf{0}_{k\times 1} denote the k×kk\times k identity matrix, k×1k\times 1 vector of ones, and k×1k\times 1 vector of zeros. For any real square matrix AA, λmin(A)\lambda_{\min}(A) and λmax(A)\lambda_{\max}(A) denote its smallest and largest eigenvalues. We use ABA\equiv B to denote that AA is defined as BB, and (Aj)jJ(A_{j})_{j\leq J} to denote the matrix composed by sequentially stacking matrices A1,,AJA_{1},\ldots,A_{J} with equal number of columns.

2 Intuition

In this section, we provide intuition for the validity of the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} in the case of strongly exogenous regressors. We first consider the case of a simple univariate regression model and then extend the result to the case of a multivariate regression model. At the end of this section, we explain complications arising conditional heteroscedasticity and how they break the validity of the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}.

2.1 Case of Strong Exogeneity

We start by considering a simple univariate linear regression time series model in which we have

yi=diβ+εi,i=1,,n,y_{i}=d_{i}\beta+\varepsilon_{i},\quad i=1,\ldots,n, (1)

where (εi)in(\varepsilon_{i})_{i\leq n} is a second-order, possibly autocorrelated, stationary time series – we employ the index ii, rather than tt, to emphasize our analysis is not confined to the time series context. We depart from the textbook time series model by assuming that the regressors (di)in(d_{i})_{i\leq n} are: (i) Independent and identically distributed (i.i.d.) with mean zero, and (ii) Strongly exogenous in the sense that (di)in(d_{i})_{i\leq n} is independent of the time series process (εi)in(\varepsilon_{i})_{i\leq n}.

As is well-known, the least squares estimator β^\hat{\beta} of β\beta in model (1) satisfies the equality

n(β^β)=n1/2i=1ndiεin1i=1ndi2.\sqrt{n}(\hat{\beta}-\beta)=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}\varepsilon_{i}}{n^{-1}\sum_{i=1}^{n}d_{i}^{2}}. (2)

In many standard time series textbooks, the asymptotic distribution of β^\hat{\beta} is thus derived by imposing sufficiently strong conditions to ensure that the score n1/2i=1ndiεin^{-1/2}\sum_{i=1}^{n}d_{i}\varepsilon_{i} is asymptotically normal and the Hessian n1i=1ndi2n^{-1}\sum_{i=1}^{n}d_{i}^{2} converges in probability to a non-stochastic matrix. In order to derive the standard error for β^\hat{\beta}, we therefore only need a consistent estimator of the long-run variance of the score; i.e., a heteroscedasticity and autocorrelation consistent (HAC) variance estimator, such as those proposed by Newey and West (1987) and Andrews (1991).

On the other hand, if the regressors (di)in(d_{i})_{i\leq n} are i.i.d. and strongly exogenous with mean zero, the independence of (di)in(d_{i})_{i\leq n} and (εi)in(\varepsilon_{i})_{i\leq n} implies that for any 1i1,i2n1\leq i_{1},i_{2}\leq n with i1i2i_{1}\neq i_{2}, we must have

𝔼[di1εi1di2εi2]=𝔼[di1]𝔼[di2]𝔼[εi1εi2]=0,\mathbb{E}\left[d_{i_{1}}\varepsilon_{i_{1}}d_{i_{2}}\varepsilon_{i_{2}}\right]=\mathbb{E}\left[d_{i_{1}}\right]\mathbb{E}\left[d_{i_{2}}\right]\mathbb{E}\left[\varepsilon_{i_{1}}\varepsilon_{i_{2}}\right]=0,

and also 𝔼[(diεi)2]=𝔼[di2]𝔼[εi2]\mathbb{E}[(d_{i}\varepsilon_{i})^{2}]=\mathbb{E}[d_{i}^{2}]\mathbb{E}[\varepsilon_{i}^{2}] for all 1in1\leq i\leq n. Hence, as long as some version of the central limit theorem is applicable to the score n1/2i=1ndiεin^{-1/2}\sum_{i=1}^{n}d_{i}\varepsilon_{i} and a law of large numbers is applicable to the Hessian n1i=1ndi2n^{-1}\sum_{i=1}^{n}d_{i}^{2}, we can conclude that the asymptotic distribution of β^\hat{\beta} is given by

n(β^β)dN(0, 𝔼[εi2]𝔼[di2]).\sqrt{n}(\hat{\beta}-\beta)\rightarrow_{d}N\left(0,\text{ \ }\frac{\mathbb{E}\left[\varepsilon_{i}^{2}\right]}{\mathbb{E}\left[d_{i}^{2}\right]}\right).

In particular, statistical inference on β\beta can be conducted as if it were not a time series model, i.e. using the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}, where 𝐗(di)in\mathbf{X}\equiv(d_{i})_{i\leq n} in this case.

The main takeaway of the preceding example is that the strong exogeneity and i.i.d. nature of the regressors (di)in(d_{i})_{i\leq n} imply that the sequence (diεi)in(d_{i}\varepsilon_{i})_{i\leq n} is homoscedastic even if the errors (εi)in(\varepsilon_{i})_{i\leq n} are arbitrarily autocorrelated. Is this simplification confined to the time series model? Our preceding discussion suggests that this is not the case. Indeed, provided the regressors (di)in(d_{i})_{i\leq n} are i.i.d., mean zero, and strongly exogenous, the score n1/2i=1ndiεin^{-1/2}\sum_{i=1}^{n}d_{i}\varepsilon_{i} has a built-in martingale structure vis-à-vis the filtration i\mathcal{F}_{i} generated by (dj,εj)ji(d_{j},\varepsilon_{j})_{j\leq i}^{\top} because:

𝔼[diεi|i1]=𝔼[𝔼[di|i1,εi]εi|i1]=𝔼[di]𝔼[εi|i1]=0.\mathbb{E}\left[\left.d_{i}\varepsilon_{i}\right|\mathcal{F}_{i-1}\right]=\mathbb{E}\left[\left.\mathbb{E}\left[\left.d_{i}\right|\mathcal{F}_{i-1},\varepsilon_{i}\right]\varepsilon_{i}\right|\mathcal{F}_{i-1}\right]=\mathbb{E}[d_{i}]\mathbb{E}[\varepsilon_{i}|\mathcal{F}_{i-1}]=0. (3)

Therefore, assuming that the random pairs (di,εi)(d_{i},\varepsilon_{i}) satisfy certain moment conditions, the martingale central limit theorem will be applicable regardless of the dependence structure of (εi)in(\varepsilon_{i})_{i\leq n} and the long run variance of the score will reduce to 𝔼[di2]𝔼[εi2]\mathbb{E}[d_{i}^{2}]\mathbb{E}[\varepsilon_{i}^{2}]. In particular, the variance formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} will remain valid despite the dependence present in the variables (εi)in(\varepsilon_{i})_{i\leq n}. Thus, spatial correlation, network dependence, and/or a cluster structure in the variables (εi)in(\varepsilon_{i})_{i\leq n} are all accommodated by the standard homoscedastic standard errors. Moreover, we note that a quick inspection at the preceding argument reveals that the assumptions we have imposed so far are stronger than necessary for the desired conclusion to hold.

We next build on our preceding discussion by considering the multivariate linear regression model

yi=diβ+wiγ+εi,i=1,,n,y_{i}=d_{i}\beta+w_{i}^{\top}\gamma+\varepsilon_{i},\quad i=1,\ldots,n, (4)

where did_{i} is a scalar regressor of interest, wiw_{i} is a dwd_{w}-vector of controls, and (wi,εi)in(w_{i},\varepsilon_{i})_{i\leq n} is a second-order, possibly autocorrelated, stationary time series satisfying 𝔼[wiεi]=𝟎dw×1\mathbb{E}[w_{i}\varepsilon_{i}]=\mathbf{0}_{d_{w}\times 1} for all 1in1\leq i\leq n. We continue to assume that the regressors (di)in(d_{i})_{i\leq n} are i.i.d. with mean zero, and strongly exogenous in the sense that (di)in(d_{i})_{i\leq n} is independent of the time series process (εi,wi)in(\varepsilon_{i},w_{i}^{\top})_{i\leq n}. The parameter of interest continues to be β\beta.

For this model, the Frisch-Waugh-Lovell theorem implies the least squares estimator β^\hat{\beta} satisfies

n(β^β)=n1/2i=1n(diwiα^)εin1i=1n(diwiα^)2α^(i=1nwiwi)1i=1nwidi.\sqrt{n}\left(\hat{\beta}-\beta\right)=\frac{n^{-1/2}\sum_{i=1}^{n}\left(d_{i}-w_{i}^{\top}\hat{\alpha}\right)\varepsilon_{i}}{n^{-1}\sum_{i=1}^{n}\left(d_{i}-w_{i}^{\top}\hat{\alpha}\right)^{2}}\hskip 36.135pt\hat{\alpha}\equiv\left(\sum_{i=1}^{n}w_{i}w_{i}^{\top}\right)^{-1}\sum_{i=1}^{n}w_{i}d_{i}. (5)

Hence, under appropriate regularity conditions the estimator β^\hat{\beta} admits the asymptotic expansion

n(β^β)=n1/2i=1n(diwiα)εin1i=1n(diwiα)2+op(1)α(𝔼[wiwi])1𝔼[widi].\sqrt{n}\left(\hat{\beta}-\beta\right)=\frac{n^{-1/2}\sum_{i=1}^{n}\left(d_{i}-w_{i}^{\top}\alpha\right)\varepsilon_{i}}{n^{-1}\sum_{i=1}^{n}\left(d_{i}-w_{i}^{\top}\alpha\right)^{2}}+o_{p}(1)\hskip 36.135pt\alpha\equiv(\mathbb{E}[w_{i}w_{i}^{\top}])^{-1}\mathbb{E}[w_{i}d_{i}]. (6)

In particular, if (di)in(d_{i})_{i\leq n} is mean zero and independent of (wi)in(w_{i}^{\top})_{i\leq n}, then α=0\alpha=0 and the asymptotic expansion reduces to the univariate setting – i.e. the right-hand side of (6) is (asymptotically) equivalent to the right-hand side of (2). It therefore follows that the end result is the same as in the univariate case: The variance formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1}, where 𝐗(di,wi)in\mathbf{X}\equiv(d_{i},w_{i}^{\top})_{i\leq n} in this case, remains valid for β^\hat{\beta}. We emphasize, however, that this formula is not necessarily justified for conducting inference on the coefficient γ\gamma.

As a preview of results in the next section, we again note that the conditions we have imposed so far are stronger than required. For instance, suppose that instead of demanding that the regressors (di)in(d_{i})_{i\leq n} and (wi)in(w_{i}^{\top})_{i\leq n} be fully independent, we assume that they are related according to the model

di=αwi+ηi,d_{i}=\alpha w_{i}+\eta_{i},

for (ηi)in(\eta_{i})_{i\leq n} i.i.d. and independent of (εi,wi)in(\varepsilon_{i},w_{i}^{\top})_{i\leq n}. The asymptotic expansion in (6), combined with the same arguments employed in the univariate case, then continue to imply that

n(β^β)dN(0, 𝔼[εi2]𝔼[ηi2])\sqrt{n}(\hat{\beta}-\beta)\rightarrow_{d}N\left(0,\text{ \ }\frac{\mathbb{E}\left[\varepsilon_{i}^{2}\right]}{\mathbb{E}\left[\eta_{i}^{2}\right]}\right)

under mild moment conditions – i.e. when computing standard errors for β^\hat{\beta} we can continue to pretend that the time series process fits the textbook homoscedastic model. Setting wiw_{i} to be a constant, for instance, reveals that the mean zero assumption on (di)in(d_{i})_{i\leq n} is superfluous.

2.2 Case of Conditional Heteroscedasticity

The preceding martingale argument relies crucially on two key assumptions: (i) The regressors (di)in(d_{i})_{i\leq n} are independent of each other, and (ii) The regressors (di)in(d_{i})_{i\leq n} are independent of the errors (εi)in(\varepsilon_{i})_{i\leq n}. A challenge to our martingale argument arises when (di)in(d_{i})_{i\leq n} and (εi)in(\varepsilon_{i})_{i\leq n} are not independent. Within the potential outcome framework, for instance, this full independence requirement is violated in the presence of heterogenous treatment effects. More precisely, heterogenous treatment effects render εi\varepsilon_{i} conditionally heteroscedastic with respect to the treatment status did_{i}. Motivated by this observation, we also study a model in which εi=l=1Lσl(di)εl,i\varepsilon_{i}=\sum_{l=1}^{L}\sigma_{l}(d_{i})\varepsilon_{l,i}^{*} with ei(εl,i)lLe_{i}\equiv(\varepsilon_{l,i}^{*})_{l\leq L} possibly correlated across ii and ll, but (ei)in(e_{i})_{i\leq n} fully independent of (di)in(d_{i})_{i\leq n}. In the potential outcome framework, with di{0,1}d_{i}\in\{0,1\} indicating treatment status, we would have

εi=(yi(0)𝔼[yi(0)])+di((yi(1)yi(0))𝔼[yi(1)yi(0)]),\varepsilon_{i}=(y_{i}(0)-\mathbb{E}[y_{i}(0)])+d_{i}((y_{i}(1)-y_{i}(0))-\mathbb{E}[y_{i}(1)-y_{i}(0)]), (7)

where yi(d)y_{i}(d) denotes the potential outcome for unit ii under treatment status d{0,1}d\in\{0,1\}.

To see the problem for the martingale structure in this model, observe that for any 1i1,i2n1\leq i_{1},i_{2}\leq n, we now have

𝔼[di1εi1di2εi2]\displaystyle\mathbb{E}[d_{i_{1}}\varepsilon_{i_{1}}d_{i_{2}}\varepsilon_{i_{2}}] =𝔼[di1l1=1Lσl1(di1)εl1,i1di2l2=1Lσl2(di2)εl2,i2]\displaystyle=\mathbb{E}\left[d_{i_{1}}\sum_{l_{1}=1}^{L}\sigma_{l_{1}}(d_{i_{1}})\varepsilon_{l_{1},i_{1}}^{*}d_{i_{2}}\sum_{l_{2}=1}^{L}\sigma_{l_{2}}(d_{i_{2}})\varepsilon_{l_{2},i_{2}}^{*}\right]
=l1=1Ll2=1L𝔼[di1σl1(di1)]𝔼[di2σl2(di2)]𝔼[εl1,i1εl2,i2],\displaystyle=\sum_{l_{1}=1}^{L}\sum_{l_{2}=1}^{L}\mathbb{E}\left[d_{i_{1}}\sigma_{l_{1}}(d_{i_{1}})\right]\mathbb{E}\left[d_{i_{2}}\sigma_{l_{2}}(d_{i_{2}})\right]\mathbb{E}[\varepsilon_{l_{1},i_{1}}^{*}\varepsilon_{l_{2},i_{2}}^{*}],

which is not necessarily zero even if did_{i}’s are mean zero. The variance formula for the sum indiεi\sum_{i\leq n}d_{i}\varepsilon_{i} therefore must include interactions terms as long as the random vectors (εl,i1)lL(\varepsilon_{l,i_{1}}^{*})_{l\leq L} and (εl,i2)lL(\varepsilon_{l,i_{2}}^{*})_{l\leq L} are correlated.

We will present a detailed analysis of the conditional heteroscedasticity case in Section 3 but the main takeaways from our results are: (i) The independence of (di)in(d_{i})_{i\leq n} from controls (wi)in(w_{i}^{\top})_{i\leq n} still simplifies the asymptotic variance for β^\hat{\beta}; and (ii) Conditional heteroscedasticity yields a break in the martingale structure that requires us to adjust standard errors not only for heteroscedasticity, but also for correlation of the errors across units. For example, in the context of (7), our analysis implies the asymptotic variance of n(β^β)\sqrt{n}(\hat{\beta}-\beta) equals (the probability limit of)

1nσd4i=1n𝔼[(di𝔼[di])2εi2]+Var(n1/2i=1n(yi(1)yi(0)))n1i=1nVar(yi(1)yi(0)).\frac{1}{n\sigma_{d}^{4}}\sum_{i=1}^{n}\mathbb{E}[(d_{i}-\mathbb{E}[d_{i}])^{2}\varepsilon_{i}^{2}]+\text{Var}\left(n^{-1/2}\sum_{i=1}^{n}(y_{i}(1)-y_{i}(0))\right)-n^{-1}\sum_{i=1}^{n}\text{Var}\left(y_{i}(1)-y_{i}(0)\right). (8)

In particular, we note that standard errors may need to be adjusted for correlation if we are concerned the treatment effects are correlated. On the other hand, the correlation between components of the vector (yi(0))in(y_{i}(0))_{i\leq n} plays no role for the standard errors, and neither does correlation between components of the vector (yi(1))in(y_{i}(1))_{i\leq n}.

3 Main Theory

We now present a rigorous theory showing that the variance formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} for the OLS estimator is valid in the strongly exogenous case even if the errors in the regression model are correlated. We also derive the variance formula in the case of conditional heteroscedasticity. In what follows, we suppose an outcome yiy_{i} satisfies

yi=xiθ+εi,i=1,,n,y_{i}=x_{i}^{\top}\theta+\varepsilon_{i},\quad i=1,\ldots,n, (9)

where xi=(di,wi)x_{i}=(d_{i},w_{i}^{\top})^{\top} is a vector of regressors, with did_{i} being a key regressor and wiw_{i} being a vector of controls, θ=(β,γ)\theta=(\beta,\gamma^{\top})^{\top} is a vector of parameters, with β\beta being a parameter of interest and γ\gamma being a vector of nuisance parameters, and εi\varepsilon_{i} is an error term with mean zero. More explicitly, the regression model (9) can be rewritten as

yi=diβ+wiγ+εi,i=1,,n.y_{i}=d_{i}\beta+w_{i}^{\top}\gamma+\varepsilon_{i},\quad i=1,\ldots,n. (10)

We assume that the first component of the vector wiw_{i} is a (non-zero) constant, meaning that the regression model (10) contains the intercept term. For notational simplicity, we set 𝐘(yi)in\mathbf{Y}\equiv(y_{i})_{i\leq n}, 𝐃(di)in\mathbf{D}\equiv(d_{i})_{i\leq n}, 𝐖(wi)in\mathbf{W}\equiv(w_{i}^{\top})_{i\leq n}, 𝐗(xi)in\mathbf{X}\equiv(x_{i}^{\top})_{i\leq n}, and ϵ(εi)in\bm{\epsilon}\equiv(\varepsilon_{i})_{i\leq n}.

Under a suitable exogeneity assumption on (di,wi)(d_{i},w_{i}^{\top})^{\top}, the unknown parameter θ(β,γ)\theta\equiv(\beta,\gamma^{\top})^{\top} can be estimated by OLS:

θ^(β^,γ^)(𝐗𝐗)1(𝐗𝐘).\hat{\theta}\equiv(\hat{\beta},\hat{\gamma}^{\top})^{\top}\equiv(\mathbf{X}^{\top}\mathbf{X})^{-1}(\mathbf{X}^{\top}\mathbf{Y}).

Standard estimators of the asymptotic variance of θ^\hat{\theta} rely on the asymptotic variance of the “score” n1/2i=1n(di,wi)εin^{-1/2}\sum_{i=1}^{n}(d_{i},w_{i}^{\top})^{\top}\varepsilon_{i}, which may take a complicated form due to possible dependence between observations. This relationship makes standard error estimation and statistical inference challenging in practice. For instance, cluster-robust standard errors, such as those proposed by Moulton (1986), Liang and Zeger (1986), and Arellano (1987), are predicated on knowledge of the relevant group structure at which to cluster; see Hansen (2007), Ibragimov and Müller (2010) and Ibragimov and Müller (2016) for related discussion. Similarly, spatial standard errors, such as those proposed by Conley (1999), often require knowledge of a measure of “economic distance” that is relates to the degree of dependence across observations. However, we next show that when 𝐃\mathbf{D} is independent of 𝐖\mathbf{W}, such as in randomized control trials, the asymptotic variance for β^\hat{\beta} simplifies significantly. As a result, estimation of standard errors and asymptotically valid inference simplify as well.

3.1 Case of Strong Exogeneity

We first study the case that 𝐃\mathbf{D} is independent of ϵ\bm{\epsilon}, and hence ϵ\bm{\epsilon} is conditionally homoscedastic with respect to 𝐃\mathbf{D}. Together with independence of 𝐃\mathbf{D} from 𝐖\mathbf{W}, this means that 𝐃\mathbf{D} is strongly exogenous. The conditional heteroscedasticity case is discussed later in this section. In the assumptions that follow, recall that KK should be interpreted to denote a sufficiently large constant that can change from place to place but is independent of the sample size nn.

Assumption 1.

(i) The random variables did_{i}, 1in1\leq i\leq n, are i.i.d. with mean μd\mu_{d} and variance σd2\sigma_{d}^{2}; (ii) (di)in(d_{i})_{i\leq n} is independent of (wi)in(w_{i}^{\top})_{i\leq n}; (iii) there exists a constant δ1>0\delta_{1}>0 such that maxin𝔼[|di|2+δ1]K\max_{i\leq n}\mathbb{E}[|d_{i}|^{2+\delta_{1}}]\leq K; (iv) σd2K1\sigma_{d}^{2}\geq K^{-1}; (v) (di)in(d_{i})_{i\leq n} is independent of (εi)in(\varepsilon_{i})_{i\leq n}.

Assumption 2.

(i) λmin(n1i=1nwiwi)K1+op(1)\lambda_{\min}(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top})\geq K^{-1}+o_{p}(1); (ii) maxin𝔼[wi2]K\max_{i\leq n}\mathbb{E}[||w_{i}||^{2}]\leq K; (iii) the first component of the vectors wiw_{i}, 1in1\leq i\leq n, is a non-zero constant.

Assumption 3.

(i) n1i=1nwiεi=op(1)n^{-1}\sum_{i=1}^{n}w_{i}\varepsilon_{i}=o_{p}(1); (ii) n1i=1nεi2=n1i=1n𝔼[εi2]+op(1)n^{-1}\sum_{i=1}^{n}\varepsilon_{i}^{2}=n^{-1}\sum_{i=1}^{n}\mathbb{E}[\varepsilon_{i}^{2}]+o_{p}(1); (iii) there exists a constant δ2>0\delta_{2}>0 such that maxin𝔼[|εi|2+δ2]K\max_{i\leq n}\mathbb{E}[|\varepsilon_{i}|^{2+\delta_{2}}]\leq K; (iv) n1i=1n𝔼[εi2]K1n^{-1}\sum_{i=1}^{n}\mathbb{E}[\varepsilon_{i}^{2}]\geq K^{-1}.

Assumption 1 contains our main requirements for the regressor of interest. In particular, Assumptions 1(i) and 1(ii) are our key conditions that are satisfied in many randomized control trials. Assumptions 1(iii) and 1(iv) are mild moment conditions. Assumption 1(v) means that (di)in(d_{i})_{i\leq n} is strongly exogenous. Assumption 2 contains our main requirements for the controls. In particular, Assumption 2(i) means that there is no multicollinearity among controls. Assumption 2(ii) is a mild moment condition. Assumption 2(iii) means that we study regressions with an intercept. Assumption 3 contains our main requirements for the regression error. Assumption 3(i) holds if εi\varepsilon_{i}’s are uncorrelated with wiw_{i}’s and a law of large numbers applies to the product wiεiw_{i}\varepsilon_{i}. Assumption 3(ii) is essentially a law of large numbers for εi2\varepsilon_{i}^{2}’s. Assumptions 3(iii) and 3(iv) are mild moment conditions. We highlight that our assumptions allow for a wide array of dependence structures in the matrix (εi,wi)in(\varepsilon_{i},w_{i}^{\top})_{i\leq n}, with the main condition in this regard intuitively being that dependence be “weak” enough for the law of large numbers imposed in Assumptions 3(i) and 3(ii) to apply.

For all 1in1\leq i\leq n, denote didiμdd_{i}^{*}\equiv d_{i}-\mu_{d}. The following theorem derives the asymptotic distribution of the OLS estimator β^\hat{\beta} in the strongly exogenous case.

Theorem 1.

Let Assumptions 1, 2 and 3 hold. Then

n(β^β)σε/σd=n1/2i=1ndiεiσdσε+oP(1)dN(0,1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{\sigma_{\varepsilon}/\sigma_{d}}=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}+o_{P}(1)\rightarrow_{d}N(0,1),

where σε2n1i=1nVar(εi)\sigma_{\varepsilon}^{2}\equiv n^{-1}\sum_{i=1}^{n}\operatorname{Var}(\varepsilon_{i}).

This theorem establishes two key facts. First, it shows that, given the strong exogeneity of (di)in(d_{i})_{i\leq n}, our mild requirements on the dependence structure of (εi,wi)in(\varepsilon_{i},w_{i}^{\top})_{i\leq n} suffice for establishing asymptotic normality of β^\hat{\beta}. To establish such a conclusion, we rely on a martingale construction that generalizes our discussion in Section 2. Second, Theorem 1 establishes that the asymptotic variance of β^\hat{\beta} is not affected by the possible correlation across vectors wiεiw_{i}\varepsilon_{i} since it only depends on the variance of did_{i} and the averaged variance of the error terms (εi)in(\varepsilon_{i})_{i\leq n}. We emphasize that neither of these conclusions need hold for the estimator γ^\hat{\gamma} of the coefficient γ\gamma corresponding to the vectors of controls wiw_{i}.

In addition, Theorem 1 suggests that we can estimate the variance of β^\hat{\beta} by s2(𝐃˘𝐃˘)1s^{2}(\mathbf{\breve{D}^{\top}\breve{D}})^{-1}, where

s2n1(𝐘𝐃β^𝐖γ^)(𝐘𝐃β^𝐖γ^),s^{2}\equiv n^{-1}(\mathbf{Y}-\mathbf{D}\hat{\beta}-\mathbf{W}\hat{\gamma})^{\top}(\mathbf{Y}-\mathbf{D}\hat{\beta}-\mathbf{W}\hat{\gamma}), (11)

𝐃˘𝐌W𝐃\mathbf{\breve{D}}\equiv\mathbf{M}_{W}\mathbf{D} and 𝐌W𝐈n𝐖(𝐖𝐖)1𝐖\mathbf{M}_{W}\equiv\mathbf{I}_{n}-\mathbf{W}(\mathbf{W}^{\top}\mathbf{W})^{-1}\mathbf{W}^{\top}. The following corollary confirms this conjecture.

Corollary 1.

Let Assumptions 1, 2 and 3 hold. Then

s2(n1𝐃˘𝐃˘)1=σε2/σd2+op(1) and β^βs2(𝐃˘𝐃˘)1dN(0,1).s^{2}(n^{-1}\mathbf{\breve{D}^{\top}\breve{D}})^{-1}=\sigma_{\varepsilon}^{2}/\sigma_{d}^{2}+o_{p}(1)\text{ \ \ \ and \ \ \ }\frac{\hat{\beta}-\beta}{\sqrt{s^{2}(\mathbf{\breve{D}^{\top}\breve{D}})^{-1}}}\rightarrow_{d}N(0,1). (12)

Together with Theorem 1, this corollary is our first main result. Indeed, it is well-known that the top left element of the matrix (𝐗𝐗)1(\mathbf{X}^{\top}\mathbf{X})^{-1} coincides with (𝐃˘𝐃˘)1(\mathbf{\breve{D}}^{\top}\mathbf{\breve{D}})^{-1}. Therefore, this corollary justifies using the classic standard error formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} for inference on β\beta, even though we allow for general dependence structures in the errors (εi)in(\varepsilon_{i})_{i\leq n} and controls (wi)in(w_{i}^{\top})_{i\leq n}.


Remark. Theorem 1 and Corollary 1 can be extended to allow for dependence between did_{i} and wiw_{i}. Indeed, suppose that did_{i} depends linearly on wiw_{i}:

di=wiα+ηi,d_{i}=w_{i}^{\top}\alpha+\eta_{i},

where ηi\eta_{i}’s satisfy the conditions of Assumption 1 imposed on did_{i}. By arguments that are similar to those in the proof of Theorem 1 and Corollary 1, it is then straightforward to show that

n(β^β)σε/σηdN(0,1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{\sigma_{\varepsilon}/\sigma_{\eta}}\rightarrow_{d}N(0,1),

where ση2\sigma_{\eta}^{2} denotes the variance of ηi\eta_{i}’s, and that the convergence results (12) still hold. ∎


Remark. Theorem 1 and Corollary 1 can also be extended to instrumental variable (IV) estimators. Indeed, suppose that

di=viρ+wiα+ηi,d_{i}=v_{i}\rho+w_{i}^{\top}\alpha+\eta_{i}, (13)

where viv_{i} is an instrumental variable satisfying the conditions of Assumption 1 imposed on did_{i} and ηi\eta_{i} is a (first-stage) estimation error satisfying the conditions of Assumption 3 imposed on εi\varepsilon_{i}. In addition denote 𝐕(vi)in\mathbf{V}\equiv(v_{i})_{i\leq n} and define 𝐌V,W\mathbf{M}_{V,W} and 𝐌V\mathbf{M}_{V} the same way as 𝐌W\mathbf{M}_{W} with 𝐖\mathbf{W} replaced by (𝐕,𝐖)\left(\mathbf{V},\mathbf{W}\right) and 𝐕\mathbf{V}, respectively. The two-stage least squared (2SLS) estimator of β\beta then satisfies

β^2sls=𝐃(𝐌W𝐌V,W)𝐘𝐃(𝐌W𝐌V,W)𝐃.\hat{\beta}_{2sls}=\frac{\mathbf{D}^{\top}(\mathbf{M}_{W}-\mathbf{M}_{V,W})\mathbf{Y}}{\mathbf{D}^{\top}(\mathbf{M}_{W}-\mathbf{M}_{V,W})\mathbf{D}}.

Using Assumptions 2 and 3, it is then possible to show that

n(β^2slsβ)=n1/2i=1n(viμv)εiρσv2+op(1),\sqrt{n}(\hat{\beta}_{2sls}-\beta)=\frac{n^{-1/2}\sum_{i=1}^{n}(v_{i}-\mu_{v})\varepsilon_{i}}{\rho\sigma_{v}^{2}}+o_{p}(1),

where μv\mu_{v} is the mean of viv_{i}’s and σv2\sigma_{v}^{2} is the variance of viv_{i}’s. Therefore,

n(β^2slsβ)σε/(ρσv)dN(0,1).\frac{\sqrt{n}(\hat{\beta}_{2sls}-\beta)}{\sigma_{\varepsilon}/(\rho\sigma_{v})}\rightarrow_{d}N(0,1).

It is clear that σε2\sigma_{\varepsilon}^{2} can be estimated by the same s2s^{2} as that in (11) with β^\hat{\beta} replaced by β^2sls\hat{\beta}_{2sls}, ρ\rho can be estimated by OLS on the (first-stage) regression (13), and σv2\sigma_{v}^{2} can be estimated by n1𝐕~𝐕~n^{-1}\mathbf{\tilde{V}^{\top}\tilde{V}}, where 𝐕~=𝐕𝟏n,1(n1𝐕𝟏n×1)\mathbf{\tilde{V}}=\mathbf{V}-\mathbf{1}_{n,1}(n^{-1}\mathbf{V}^{\top}\mathbf{1}_{n\times 1}). ∎

3.2 Case of Conditional Heteroscedasticity

Next, we derive the asymptotic variance for β^\hat{\beta} in the case of conditional heteroscedasticity, i.e. when (εi)in(\varepsilon_{i})_{i\leq n} is conditionally heteroscedastic with respect to (di)in(d_{i})_{i\leq n}. Following the notation introduced in Section 2, we focus on the case in which εilLσl(di)εl,i\varepsilon_{i}\equiv\sum_{l\leq L}\sigma_{l}(d_{i})\varepsilon_{l,i}^{\ast}, where ei(εl,i)lLe_{i}\equiv(\varepsilon_{l,i}^{\ast})_{l\leq L} is a vector with mean zero, for all 1in1\leq i\leq n.

Let Ai(diμd)(σl(di))lLA_{i}\equiv(d_{i}-\mu_{d})(\sigma_{l}(d_{i}))_{l\leq L} for all 1in1\leq i\leq n and observe that under Assumption 1(i), the random vectors AiA_{i} are i.i.d. Denote their common mean vector by μA\mu_{A}. In addition, denote σe,12n1i=1n𝔼[((AiμA)ei)2]\sigma_{e,1}^{2}\equiv n^{-1}\sum_{i=1}^{n}\mathbb{E}[((A_{i}-\mu_{A})^{\top}e_{i})^{2}] and σe,22𝔼[(n1/2i=1nμAei)2]\sigma_{e,2}^{2}\equiv\mathbb{E}[(n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i})^{2}]. Within this context, we impose the following assumptions.

Assumption 4.

(i) (di)in(d_{i})_{i\leq n} is independent of (ei)in(e_{i})_{i\leq n}; (ii) the functions σl\sigma_{l}, 1lL1\leq l\leq L, are bounded; (iii) σe,12K1\sigma_{e,1}^{2}\geq K^{-1}.

Assumption 5.

(i) n1i=1neiein1i=1n𝔼[eiei]=op(1)\|n^{-1}\sum_{i=1}^{n}e_{i}e_{i}^{\top}-n^{-1}\sum_{i=1}^{n}\mathbb{E}[e_{i}e_{i}^{\top}]\|=o_{p}(1); (ii) there exists a constant δ3>0\delta_{3}>0 such that n1i=1nei2+δ3=Op(1)n^{-1}\sum_{i=1}^{n}\|e_{i}\|^{2+\delta_{3}}=O_{p}(1); (iii) σe,21n1/2i=1nμAeidN(0,1)\sigma_{e,2}^{-1}n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i}\to_{d}N(0,1).

Assumption 4 is mainly used to replace Assumption 1(v) and accounts for the conditional heteroscedasticity of (εi)in(\varepsilon_{i})_{i\leq n} with respect to (di)in(d_{i})_{i\leq n}. Assumption 4(i) requires that (di)in(d_{i})_{i\leq n} are strongly exogenous with respect to the“scaled” error vector (ei)in(e_{i})_{i\leq n} – a requirement that, as discussed in Section 2 maps well into a potential outcome framework with heterogeneous treatment effects. Assumption 4(ii) imposes upper bounds on the functions σl2()\sigma_{l}^{2}(\cdot), which we view as a mild regularity condition. Assumption 4(iii) is a mild moment condition. Assumption 5 contains further restrictions on the vectors eie_{i}. Assumption 5(i) is essentially a law of large numbers for eieie_{i}e_{i}^{\top}’s. Assumption 5(ii) is essentially a moment condition for the random variables ei\|e_{i}\|. Assumption 5(iii) limits the amount of dependence among the vectors eie_{i} to ensure convergence in distribution.

The following theorem, which is our second main result, derives the asymptotic distribution of the OLS estimator β^\hat{\beta} in the case of conditional heteroscedasticity.

Theorem 2.

Let Assumptions 1(i)-(iv), 2, 3, 4, and 5 hold. Then

n(β^β)σdε/σd2=n1/2i=1ndiεiσdε+op(1)dN(0,1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{\sigma_{d\varepsilon}/\sigma_{d}^{2}}=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d\varepsilon}}+o_{p}(1)\to_{d}N(0,1), (14)

where σdε2σe,12+σe,22\sigma_{d\varepsilon}^{2}\equiv\sigma_{e,1}^{2}+\sigma_{e,2}^{2}.

To establish this theorem, we decompose the score n1/2i=1ndiεin^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i} into the sum of two uncorrelated terms,

n1/2i=1ndiεi=n1/2i=1n(AiμA)ei+n1/2i=1nμAei,n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}=n^{-1/2}\sum_{i=1}^{n}\left(A_{i}-\mu_{A}\right)^{\top}e_{i}+n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i},

and observe that the first term on the right-hand side here forms a martingale difference sequence while the second term is asymptotically normal by assumption. Combining these facts, we are able to obtain asymptotic normality of the sum; see the detailed proof in the Appendix.

Like Theorem 1, this theorem establishes two key facts as well. To see both of them, observe that the term σdε2\sigma_{d\varepsilon}^{2} appearing in the convergence result (14) can be more explicitly rewritten as

σdε2=n1i=1n𝔼[(di)2εi2]+μA[Var(n1/2i=1nei)n1i=1nVar(ei)]μA.\sigma_{d\varepsilon}^{2}=n^{-1}\sum_{i=1}^{n}\mathbb{E}\left[(d_{i}^{*})^{2}\varepsilon_{i}^{2}\right]+\mu_{A}^{\top}\left[\operatorname{Var}\left(n^{-1/2}\sum_{i=1}^{n}e_{i}\right)-n^{-1}\sum_{i=1}^{n}\operatorname{Var}\left(e_{i}\right)\right]\mu_{A}. (15)

This expression in turn implies that the asymptotic variance of the OLS estimator now depends on the correlation across the vectors eie_{i}, which means that heteroscedasticity of the regression errors forces OLS variance estimators to be adjusted for correlation across the errors. On the other hand, the expression (15) also demonstrates that it suffices to adjust the variance estimators only for correlation across the random variables μAei\mu_{A}^{\top}e_{i}, instead of correlation across the full vectors eie_{i}. The latter point might seem like a minor technicality but it in fact plays an interesting role in models with heterogeneous treatment effects. Indeed, when di{0,1}d_{i}\in\{0,1\} represents the treatment assignment status, (yi(1),yi(0))(y_{i}(1),y_{i}(0)) represents the pair of potential outcomes with and without treatment, so that yi=yi(0)+di(yi(1)yi(0))y_{i}=y_{i}(0)+d_{i}(y_{i}(1)-y_{i}(0)), and wiw_{i} consists of a non-zero constant only, it follows that εi\varepsilon_{i} takes the form (7), which matches the setting here with ei=(yi(0)𝔼[yi(0)],yi(1)yi(0)𝔼[yi(1)yi(0)])e_{i}=(y_{i}(0)-\mathbb{E}[y_{i}(0)],y_{i}(1)-y_{i}(0)-\mathbb{E}[y_{i}(1)-y_{i}(0)])^{\top} and σ(di)=(1,di)\sigma(d_{i})=(1,d_{i})^{\top}. Hence, μA=(0,σd2)\mu_{A}=(0,\sigma_{d}^{2})^{\top}, and so the expression for the asymptotic variance of n(β^β)\sqrt{n}(\hat{\beta}-\beta) reduces to the probability limit of

1nσd4i=1n𝔼[(diμd)2εi2]+Var(n1/2i=1n(yi(1)yi(0)))n1i=1nVar(yi(1)yi(0)),\frac{1}{n\sigma_{d}^{4}}\sum_{i=1}^{n}\mathbb{E}[(d_{i}-\mu_{d})^{2}\varepsilon_{i}^{2}]+\text{Var}\left(n^{-1/2}\sum_{i=1}^{n}(y_{i}(1)-y_{i}(0))\right)-n^{-1}\sum_{i=1}^{n}\text{Var}\left(y_{i}(1)-y_{i}(0)\right),

as previewed in the previous section; see expression (8) there. In turn, this expression means that it suffices to adjust OLS variance estimation for correlation across treatment effect, and there is no need to worry about correlation of potential outcomes within any given treatment arm. For example, whenever treatment effects yi(1)yi(0)y_{i}(1)-y_{i}(0) are independent across ii, it suffices to use the usual heteroscedasticity-robust variance formulas, even if regression errors εi\varepsilon_{i} are correlated. Note, however, that it is necessary to use heteroscedasticity-robust variance formulas even if the treatment effects are i.i.d. (and not just independent), as the formula s2(𝐗𝐗)1s^{2}\left(\mathbf{X}^{\top}\mathbf{X}\right)^{-1} is not valid in this case. Moreover, the same results apply even if wiw_{i}’s include non-constant controls as well, as the second component of eie_{i}’s remains the same in this case.

Finally, we note that as long as the form of correlation across random variables μAei\mu_{A}^{\top}e_{i} is known, estimation of the asymptotic variance based on Theorem 2 is conceptually straightforward. For example, if treatment effects are clustered, it suffices to use Moulton (1986)’s or Liang and Zeger (1986)’s formulas assuming that regression errors are clustered at the same level (even though they could be clustered at a different level because of the clustering of potential outcomes without treatment, for example). For brevity of the paper, we do not provide formal statement of such results, as they are case-specific and depend on the form of the correlation structure, e.g. time series versus cluster versus spatial dependence.

4 Group-Level Randomization

A key assumption behind the results of Section 3 is that the regressor of interest is independent and identically distributed across units. In randomized controlled trials, such an assumption is satisfied for completely randomized assignments but can fail under other randomization protocols. For instance, in randomized controlled trials the i.i.d. assumption on the treatment fails when treatment is assigned at a group level. Motivated by this challenge, we suppose in this section that

yi,j=djβ+wi,jγ+εi,j,y_{i,j}=d_{j}\beta+w_{i,j}^{\top}\gamma+\varepsilon_{i,j}, (16)

where the index j=1,ngj=1,\ldots n_{g} denotes the group membership, the index i=1,,nji=1,\ldots,n_{j} denotes units within group jj, ngn_{g} is the number of groups, njn_{j} is the number of units within group jj, djd_{j} denotes the regressor of interest which is invariant within group jj, wi,jw_{i,j} denotes the vector of controls, and εi,j\varepsilon_{i,j} is a mean-zero regression error. We continue to assume that the first component of each wi,jw_{i,j} is a non-zero constant and also continue to employ nn as the total number of observations, n=jngnjn=\sum_{j\leq n_{g}}n_{j}.

We emphasize at this point that the index jj distinguishes the level at which the regressor djd_{j} varies, but has no special significance for other variables. In particular, we do not insist that any (potential) clustering structure of the errors ((εi,j)inj)jng((\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}} be the same as the group structure specified by the regressor djd_{j}. In other words, ((εi,j)inj)jng((\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}} may have a dependence structure completely different from group structure determined by jj – e.g., εi1,j1\varepsilon_{i_{1},j_{1}} may be correlated with εi2,j2\varepsilon_{i_{2},j_{2}} when j1j2j_{1}\neq j_{2}. This discussion will be formalized in the nature of the regularity conditions below, which do not include, for example, the random effects type specification as in Moulton (1986). Instead, as in the previous section, we will rely on a martingale structure based on (dj)jng(d_{j})_{j\leq n_{g}} that will be crucial for understanding the asymptotic distribution of β^\hat{\beta}.

Following our discussion in the previous section, we consider cases of strong exogeneity and conditional heteroscedasticity separately. We first consider the case of strong exogeneity. In order to study the asymptotic properties of β^\hat{\beta}, we first need to revise Assumptions 1 and 3 to account for the group structure and a possible within-group correlation. To this end, we let κn(jngnj2)1/2\kappa_{n}\equiv(\sum_{j\leq n_{g}}n_{j}^{2})^{1/2} and Sε2n1jng𝔼[(injεi,j)2]S_{\varepsilon}^{2}\equiv n^{-1}\sum_{j\leq n_{g}}\mathbb{E}[(\sum_{i\leq n_{j}}\varepsilon_{i,j})^{2}] and impose the following assumptions.

Assumption 6.

(i) The random variables djd_{j}, 1jng1\leq j\leq n_{g}, are i.i.d. with mean μd\mu_{d} and variance σd2\sigma_{d}^{2}; (ii) (dj)jng(d_{j})_{j\leq n_{g}} is independent of ((wi,j)inj)jng;((w_{i,j}^{\top})_{i\leq n_{j}})_{j\leq n_{g}}; (iii) maxjng𝔼[|dj|4]K\max_{j\leq n_{g}}\mathbb{E}[|d_{j}|^{4}]\leq K; (iv) σd2K1\sigma_{d}^{2}\geq K^{-1}; (v) (dj)ing(d_{j})_{i\leq n_{g}} is independent of ((εi,j)inj)jng((\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}}.

Assumption 7.

(i) n1j=1ngi=1njwi,jεi,j=op(Sεn1/2/κn)n^{-1}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}w_{i,j}\varepsilon_{i,j}=o_{p}(S_{\varepsilon}n^{1/2}/\kappa_{n}); (ii) Sε2n1j=1ng(i=1njεi,j)2p1S_{\varepsilon}^{-2}n^{-1}\sum_{j=1}^{n_{g}}(\sum_{i=1}^{n_{j}}\varepsilon_{i,j})^{2}\to_{p}1; (iii) there exists a constant δ4(0,2]\delta_{4}\in(0,2] such that n1δ4/2j=1ng(|i=1njεi,j|)2+δ4=op(Sε2+δ4)n^{-1-\delta_{4}/2}\sum_{j=1}^{n_{g}}(|\sum_{i=1}^{n_{j}}\varepsilon_{i,j}|)^{2+\delta_{4}}=o_{p}(S_{\varepsilon}^{2+\delta_{4}}); (iv) κn/n=o(1)\kappa_{n}/n=o(1).

Assumption 6 requires mild moment restrictions and that the regressor of interest djd_{j} be strongly exogenous in the sense that it be independent of the errors and other regressors. To make sense of Assumption 7, assume that each group jj has the same size njn_{j} that is independent of nn. In this case, κn\kappa_{n} is of order n\sqrt{n} and Sε2S_{\varepsilon}^{2} is typically of order one. In turn, the latter implies that Assumption 7(i) reduces to n1jnginjwi,jεi,j=op(1)n^{-1}\sum_{j\leq n_{g}}\sum_{i\leq n_{j}}w_{i,j}\varepsilon_{i,j}=o_{p}(1), which is similar to Assumption 3(i), and Assumption 7(iii) reduces to n1δ4/2j=1ngi=1nj|εi,j|2+δ4=op(1)n^{-1-\delta_{4}/2}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}|\varepsilon_{i,j}|^{2+\delta_{4}}=o_{p}(1), which is satisfied as long as max1jngmax1inj|εi,j|2+δ4K\max_{1\leq j\leq n_{g}}\max_{1\leq i\leq n_{j}}|\varepsilon_{i,j}|^{2+\delta_{4}}\leq K. In addition, Assumption 7(iv) reduces n1/2=o(1)n^{-1/2}=o(1) and is satisfied automatically and Assumption 7(ii) can be regarded as a law of large numbers. Thus, Assumption 7 in general requires that the size of groups does not increase too fast. Note also that, as previously claimed, this assumption does not require ((wi,j,εi,j)inj)jng((w_{i,j}^{\top},\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}} to be independent across jj.

We are now ready to derive the asymptotic distribution of the OLS estimator β^\hat{\beta} in the strongly exogenous case with a group-level assignment. In the statement of the result, we impose Assumption 2, which should be interpreted to hold with j=1ngi=1nj\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}} in place of i=1n\sum_{i=1}^{n} in Assumption 2(i) and maxjngmaxinj\max_{j\leq n_{g}}\max_{i\leq n_{j}} in place of maxin\max_{i\leq n} in Assumption 2(ii). Also, we denote djdjμdd_{j}^{*}\equiv d_{j}-\mu_{d} for all 1jng1\leq j\leq n_{g}.

Theorem 3.

Let Assumptions 2, 6 and 7 hold. Then

n(β^β)Sε/σd=n1/2j=1ngdji=1njεi,jσdSε+op(1)dN(0,1).\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{\varepsilon}/\sigma_{d}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}+o_{p}(1)\to_{d}N(0,1). (17)

The implications of this theorem are following. First, the OLS estimator β^\hat{\beta} is asymptotically normally distributed under mild moment conditions and under fairly general assumptions on the dependence structure of ((wi,j,εi,j)inj)jng((w_{i,j}^{\top},\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}}. Second, the asymptotic variance of β^\hat{\beta} is given by Sε2/σd2S_{\varepsilon}^{2}/\sigma_{d}^{2}. In particular, since Sε2=n1jngVar(injεi,j)S_{\varepsilon}^{2}=n^{-1}\sum_{j\leq n_{g}}\text{Var}(\sum_{i\leq n_{j}}\varepsilon_{i,j}), when computing standard errors for β^\hat{\beta} we need only account for possibly within-group jj correlation even if ((εi,j)inj)jng((\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}} is dependent across jj. Importantly, the group structure is determined solely by the variables (dj)jng(d_{j})_{j\leq n_{g}} and hence is known, considerably simplifying estimation. For instance, in a randomized controlled trial with constant effects, Theorem 3 implies we may, for example, employ Moulton (1986)’s or Liang and Zeger (1986) formulas clustered at the level at which treatment was assigned. We again emphasize, however, that similar conclusions do not apply for γ^\hat{\gamma} whose standard errors and asymptotic normality may depend on the dependence structure of ((wi,j,εi,j)inj)jng((w_{i,j}^{\top},\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}} across jj.

Next, we consider the case of conditional heteroscedasticity. Denote Se,12n1j=1ng𝔼[((AjμA)i=1njei,j)2]S_{e,1}^{2}\equiv n^{-1}\sum_{j=1}^{n_{g}}\mathbb{E}[((A_{j}-\mu_{A})^{\top}\sum_{i=1}^{n_{j}}e_{i,j})^{2}] and Se,22𝔼[(n1/2i=1nμAei)2]S_{e,2}^{2}\equiv\mathbb{E}[(n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i})^{2}]. Note that Se,2S_{e,2} here actually coincides with σe,2\sigma_{e,2} in the previous section. Within this context, we impose the following assumptions.

Assumption 8.

(i) (dj)jng(d_{j})_{j\leq n_{g}} is independent of ((ei,j)inj)jng((e_{i,j}^{\top})_{i\leq n_{j}})_{j\leq n_{g}}; (ii) the functions σl\sigma_{l}, 1lL1\leq l\leq L, are bounded; (iii) Se,1/SεK1S_{e,1}/S_{\varepsilon}\geq K^{-1}.

Assumption 9.

(i) n1j=1ng(i=1njei,j)(i=1njei,j)n1j=1ng𝔼[(i=1njei,j)(i=1njei,j)]=op(Se,12)\|n^{-1}\sum_{j=1}^{n_{g}}(\sum_{i=1}^{n_{j}}e_{i,j})(\sum_{i=1}^{n_{j}}e_{i,j})^{\top}-n^{-1}\sum_{j=1}^{n_{g}}\mathbb{E}[(\sum_{i=1}^{n_{j}}e_{i,j})(\sum_{i=1}^{n_{j}}e_{i,j})^{\top}]\|=o_{p}(S_{e,1}^{2}); (ii) n2j=1ngi=1njei,j4=op(Se,14)n^{-2}\sum_{j=1}^{n_{g}}\|\sum_{i=1}^{n_{j}}e_{i,j}\|^{4}=o_{p}(S_{e,1}^{4}); (iii) Se,21n1/2i=1nμAeidN(0,1)S_{e,2}^{-1}n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i}\to_{d}N(0,1).

These assumptions naturally extend Assumptions 4 and 5 in the previous section to allow for group-level assignments.

The next theorem derives the asymptotic distribution of the OLS estimator β^\hat{\beta} in the case of conditional heteroscedasticity with a group-level assignment.

Theorem 4.

Let Assumptions 2, 6(i)-(iv), 7, 8, and 9 hold. Then

n(β^β)Sdε/σd2=n1/2j=1ngdji=1njεi,jSdε+op(1)dN(0,1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{d\varepsilon}/\sigma_{d}^{2}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{S_{d\varepsilon}}+o_{p}(1)\to_{d}N(0,1), (18)

where SdεSe,12+Se,22S_{d\varepsilon}\equiv S_{e,1}^{2}+S_{e,2}^{2}.

This theorem relates to Theorem 3 in the same way as Theorem 2 relates to Theorem 1. In particular, noting that the term Sdε2S_{d\varepsilon}^{2} appearing in this theorem can be more explicitly rewritten as

Sdε2n1j=1ng𝔼[(dji=1njεi,j)2]+μA[Var(n1/2j=1ngi=1njei,j)n1j=1ngVar(i=1njei,j)]μA,S_{d\varepsilon}^{2}\equiv n^{-1}\sum_{j=1}^{n_{g}}\mathbb{E}\left[\left(d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}\right)^{2}\right]+\mu_{A}^{\top}\left[\operatorname{Var}\left(n^{-1/2}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}e_{i,j}\right)-n^{-1}\sum_{j=1}^{n_{g}}\operatorname{Var}\left(\sum_{i=1}^{n_{j}}e_{i,j}\right)\right]\mu_{A},

we conclude that because of conditional heteroscedasticity, variance estimators that are clustered at the group level at which the regressor djd_{j} is assigned may not be valid if the errors ei,je_{i,j} are correlated across groups jj. On the other hand, in the context of estimation with heterogeneous treatment effects, such estimators are valid if the treatment effects are uncorrelated across these groups.


References

  • (1)
  • Abadie, Athey, Imbens, and Wooldridge (2017) Abadie, A., S. Athey, G. W. Imbens, and J. Wooldridge (2017): “When should you adjust standard errors for clustering?,” Discussion paper, National Bureau of Economic Research.
  • Andrews (1991) Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, 59(3), 817–858.
  • Arellano (1987) Arellano, M. (1987): “Computing robust standard errors for within-groups estimators,” Oxford bulletin of Economics and Statistics, 49(4), 431–434.
  • Barrios, Diamond, Imbens, and Kolesar (2012) Barrios, T., R. Diamond, G. Imbens, and M. Kolesar (2012): “Clustering, spacial correlations, and randomization inference,” Journal of American Statistical Association, 107, 578–591.
  • Bloom (2005) Bloom, H. S. (2005): Learning more from social experiments: Evolving analytic approaches. Russell Sage Foundation.
  • Conley (1999) Conley, T. G. (1999): “GMM estimation with cross sectional dependence,” Journal of econometrics, 92(1), 1–45.
  • DasGupta (2008) DasGupta, A. (2008): Asymptotic Theory of Statistics and Probability. Springer.
  • Duflo, Glennerster, and Kremer (2007) Duflo, E., R. Glennerster, and M. Kremer (2007): “Using randomization in development economics research: A toolkit,” Handbook of development economics, 4, 3895–3962.
  • Hall and Heyde (1980) Hall, P., and C. C. Heyde (1980): Martingale Limit Theory and Its Application. Academic Press, Inc. New York.
  • Hansen (2007) Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix estimator for panel data when T is large,” Journal of Econometrics, 141(2), 597–620.
  • Ibragimov and Müller (2010) Ibragimov, R., and U. K. Müller (2010): “t-Statistic based correlation and heterogeneity robust inference,” Journal of Business & Economic Statistics, 28(4), 453–468.
  • Ibragimov and Müller (2016)    (2016): “Inference with few heterogeneous clusters,” Review of Economics and Statistics, 98(1), 83–96.
  • Liang and Zeger (1986) Liang, K.-Y., and S. Zeger (1986): “Longitudinal Data Analysis Using Generalized Linear Models,” Biometrica, 73, 13–22.
  • Moulton (1986) Moulton, B. R. (1986): “Random group effects and the precision of regression estimates,” Journal of econometrics, 32(3), 385–397.
  • Newey and West (1987) Newey, W. K., and K. D. West (1987): “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, 55(3), 703–708.
  • White (2014) White, H. (2014): Asymptotic theory for econometricians. Academic press.

Appendix

Appendix A Proof of the main results

Proof of Theorem 1. Define 𝐃~𝐃𝟏n×1(n1𝐃𝟏n×1)\mathbf{\tilde{D}\equiv D}-\mathbf{1}_{n\times 1}(n^{-1}\mathbf{D}^{\top}\mathbf{1}_{n\times 1}) and 𝐌W𝐈n𝐖(𝐖𝐖)1𝐖\mathbf{M}_{W}\equiv\mathbf{I}_{n}-\mathbf{W}(\mathbf{W}^{\top}\mathbf{W})^{-1}\mathbf{W}^{\top}. Then, given that the matrix 𝐖\mathbf{W} includes a non-zero constant column by Assumption 2(iii), it follows that 𝟏n×1𝐌W=0\mathbf{1}_{n\times 1}^{\top}\mathbf{M}_{W}=0. Therefore, by the Frisch-Waugh-Lovell theorem,

β^β=(𝐃𝐌W𝐃)1(𝐃𝐌Wϵ)=(𝐃~𝐌W𝐃~)1(𝐃~𝐌Wϵ).\hat{\beta}-\beta=(\mathbf{D}^{\top}\mathbf{M}_{W}\mathbf{D})^{-1}(\mathbf{D}^{\top}\mathbf{M}_{W}\bm{\epsilon})=(\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}})^{-1}(\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon}). (19)

Also, denoting d¯nn1i=1ndi\overline{d}_{n}\equiv n^{-1}\sum_{i=1}^{n}d_{i}, we have

𝔼[|d¯nμd|2]=σd2/nK/n=o(1)\mathbb{E}[|\overline{d}_{n}-\mu_{d}|^{2}]=\sigma_{d}^{2}/n\leq K/n=o(1)

by Assumptions 1(i,iii), and so d¯nμd=op(1)\overline{d}_{n}-\mu_{d}=o_{p}(1) by Markov’s inequality. Hence,

d¯n2μd2=(d¯nμd)(d¯n+μd)=op(1)\overline{d}_{n}^{2}-\mu_{d}^{2}=(\overline{d}_{n}-\mu_{d})(\overline{d}_{n}+\mu_{d})=o_{p}(1)

by Assumption 1(iii) again. In addition,

𝔼[|n1i=1n(di2𝔼[di2])|1+δ1/2]2i=1n𝔼[|di2𝔼[di2]|1+δ1/2]/n1+δ1/2=o(1)\mathbb{E}\left[\left|n^{-1}\sum_{i=1}^{n}(d_{i}^{2}-\mathbb{E}[d_{i}^{2}])\right|^{1+\delta_{1}/2}\right]\leq 2\sum_{i=1}^{n}\mathbb{E}\left[|d_{i}^{2}-\mathbb{E}[d_{i}^{2}]|^{1+\delta_{1}/2}\right]/n^{1+\delta_{1}/2}=o(1)

by the von Bahr-Esseen Inequality (see Section 35.1.5 in DasGupta (2008)) and Assumptions 1(i,iii). Hence, by Markov’s inequality,

n1i=1ndi2=n1i=1n𝔼[di2]+op(1),n^{-1}\sum_{i=1}^{n}d_{i}^{2}=n^{-1}\sum_{i=1}^{n}\mathbb{E}[d_{i}^{2}]+o_{p}(1),

and so

n1𝐃~𝐃~=n1i=1n𝔼[di2]μd2+op(1)=σd2+op(1).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}=n^{-1}\sum_{i=1}^{n}\mathbb{E}[d_{i}^{2}]-\mu_{d}^{2}+o_{p}(1)=\sigma_{d}^{2}+o_{p}(1). (20)

Further,

n1𝐃~𝐖=n1i=1ndiwi(n1i=1ndi)(n1i=1nwi).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W}=n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}w_{i}-\left(n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}\right)\left(n^{-1}\sum_{i=1}^{n}w_{i}\right). (21)

By Assumptions 1(i, ii) and 2(iii), we further obtain

𝔼[n1/2i=1ndiwi2]\displaystyle\mathbb{E}\left[\left\|n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}w_{i}\right\|^{2}\right] =n1i=1n𝔼[diwi2]n1i=1n𝔼[di2]𝔼[wi2]K.\displaystyle=n^{-1}\sum_{i=1}^{n}\mathbb{E}\left[||d_{i}^{\ast}w_{i}||^{2}\right]\leq n^{-1}\sum_{i=1}^{n}\mathbb{E}[d_{i}^{2}]\mathbb{E}\left[||w_{i}||^{2}\right]\leq K.

Hence, by Markov’s inequality,

n1i=1ndiwi=Op(n1/2).n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}w_{i}=O_{p}(n^{-1/2}). (22)

Similarly, we can use Assumptions 1(i, iii) and 2(ii) to show

(n1i=1ndi)(n1i=1nwi)=Op(n1/2),\left(n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}\right)\left(n^{-1}\sum_{i=1}^{n}w_{i}\right)=O_{p}(n^{-1/2}), (23)

which together with (21) and (22) implies that

n1𝐃~𝐖=Op(n1/2).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W}=O_{p}(n^{-1/2}). (24)

Combining this result with (20) and using Assumptions 2(i), we then have

n1𝐃~𝐌W𝐃~=n1𝐃~𝐃~n1𝐃~𝐖(n1𝐖𝐖)1n1𝐖𝐃~=σd2+op(1).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}}=n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}-n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W}(n^{-1}\mathbf{W}^{\top}\mathbf{W})^{-1}n^{-1}\mathbf{W}^{\top}\mathbf{\tilde{D}}=\sigma_{d}^{2}+o_{p}(1). (25)

For the term 𝐃~𝐌Wϵ\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon} in the numerator of β^β\hat{\beta}-\beta in (19), we have

n1/2𝐃~𝐌Wϵ\displaystyle n^{-1/2}\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon} =n1/2𝐃~ϵn1/2𝐃~𝐖(𝐖𝐖)1𝐖ϵ\displaystyle=n^{-1/2}\mathbf{\tilde{D}}^{\top}\bm{\epsilon}-n^{-1/2}\mathbf{\tilde{D}}^{\top}\mathbf{W}(\mathbf{W}^{\top}\mathbf{W})^{-1}\mathbf{W}^{\top}\bm{\epsilon}
=n1/2i=1ndiεi(n1/2i=1ndi)(n1i=1nεi)+op(1)\displaystyle=n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}-\left(n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\right)\left(n^{-1}\sum_{i=1}^{n}\varepsilon_{i}\right)+o_{p}(1)
=n1/2i=1ndiεi+op(1),\displaystyle=n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}+o_{p}(1), (26)

where the second equality follows by the definition of 𝐃~\mathbf{\tilde{D}}, (24), and Assumptions 2(i) and 3(i), and the third equality follows by Assumptions 1(i, iii) and 3(iii) and Markov’s inequality. Therefore,

n(β^β)=n1/2i=1ndiεiσd2+op(1)\sqrt{n}(\hat{\beta}-\beta)=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d}^{2}}+o_{p}(1) (27)

by Assumption 1(iv), and so

n(β^β)σε/σd=n1/2i=1ndiεiσdσε+op(1)\frac{\sqrt{n}(\hat{\beta}-\beta)}{\sigma_{\varepsilon}/\sigma_{d}}=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}+o_{p}(1) (28)

by Assumptions 1(iii) and 3(iv).

We next derive the asymptotic distribution of n1/2i=1ndiεi/(σdσε)n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}/(\sigma_{d}\sigma_{\varepsilon}). Let i,n\mathcal{F}_{i,n} denote the filtration generated by (ϵ,((dj)ji))(\bm{\epsilon}^{\top},((d_{j})_{j\leq i})^{\top}). Then by Assumptions 1(iii,iv) and 3(iii, iv), n1/2diεi/(σdσε)n^{-1/2}d_{i}^{\ast}\varepsilon_{i}/(\sigma_{d}\sigma_{\varepsilon}) has finite second moment and

𝔼[n1/2diεiσdσε|i1,n]=n1/2εiσdσε𝔼[di|i1,n]=n1/2εiσdσε𝔼[di]=0\mathbb{E}\left[\frac{n^{-1/2}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}|\mathcal{F}_{i-1,n}\right]=\frac{n^{-1/2}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\mathbb{E}\left[\left.d_{i}^{\ast}\right|\mathcal{F}_{i-1,n}\right]=\frac{n^{-1/2}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\mathbb{E}\left[d_{i}^{\ast}\right]=0 (29)

almost surely, which implies that n1/2diεi/(σdσε)n^{-1/2}d_{i}^{\ast}\varepsilon_{i}/(\sigma_{d}\sigma_{\varepsilon}) is a martingale difference array with respect to i,n\mathcal{F}_{i,n}. Next, observe that Assumptions 1(i,v) and 3(ii,iv) yield

i=1n𝔼[(n1/2diεiσdσε)2|i1,n]=σε2n1i=1nεi2p1.\sum_{i=1}^{n}\mathbb{E}\left[\left(\frac{n^{-1/2}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\right)^{2}|\mathcal{F}_{i-1,n}\right]=\sigma_{\varepsilon}^{-2}n^{-1}\sum_{i=1}^{n}\varepsilon_{i}^{2}\rightarrow_{p}1. (30)

Moreover for any η>0\eta>0, Assumptions 1(iii,iv,v) and 3(iii,iv) allow us to conclude that for δmin(δ1,δ2)\delta\equiv\min(\delta_{1},\delta_{2}),

i=1n𝔼[(n1/2diεiσdσε)21{|n1/2diεiσdσε|>η}|i1,n]\displaystyle\sum_{i=1}^{n}\mathbb{E}\left[\allowbreak\left.\left(\frac{n^{-1/2}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\right)^{2}1\left\{\left|\frac{n^{-1/2}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\right|>\eta\right\}\right|\mathcal{F}_{i-1,n}\right]
1ηδi=1n𝔼[|n1/2diεiσdσε|2+δ|i1,n]\displaystyle\leq\frac{1}{\eta^{\delta}}\sum_{i=1}^{n}\mathbb{E}\left[\left.\left|\frac{n^{-1/2}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\right|^{2+\delta}\right|\mathcal{F}_{i-1,n}\right]
=1ηδ(σdσε)2+δn1+δ/2i=1n𝔼[|di|2+δ]|εi|2+δKηδnδ/2n1i=1n|εi|2+δ=op(1).\displaystyle=\frac{1}{\eta^{\delta}(\sigma_{d}\sigma_{\varepsilon})^{2+\delta}n^{1+\delta/2}}\sum_{i=1}^{n}\mathbb{E}\left[|d_{i}^{\ast}|^{2+\delta}\right]|\varepsilon_{i}\allowbreak|^{2+\delta}\leq\frac{K}{\eta^{\delta}n^{\delta/2}}n^{-1}\sum_{i=1}^{n}|\varepsilon_{i}\allowbreak|^{2+\delta}=o_{p}(1). (31)

In view of (30) and (31), we can invoke the martingale central limit theorem (see, e.g., Corollary 3.1 in Hall and Heyde (1980)) to conclude that

n1/2i=1ndiεiσdσεdN(0,1).\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}}{\sigma_{d}\sigma_{\varepsilon}}\rightarrow_{d}N(0,1). (32)

The claim of the theorem follows from combining this result with (28). Q.E.D.Q.E.D.


Proof of Corollary 1. We first proof that λmin(n1𝐗𝐗)K1+op(1)\lambda_{\min}(n^{-1}\mathbf{X}^{\top}\mathbf{X})\geq K^{-1}+o_{p}(1). To do so, observe that

n1𝐃𝐃=μd2+σd2+op(1)n^{-1}\mathbf{D}^{\top}\mathbf{D}=\mu_{d}^{2}+\sigma_{d}^{2}+o_{p}(1)

by Assumptions 1(i,iii) and the law of large numbers. Also, denoting di=diμdd_{i}^{*}=d_{i}-\mu_{d} for all 1in1\leq i\leq n, we have

n1𝐃𝐖=n1i=1ndiwi=μdn1i=1nwi+n1i=1ndiwi=μdn1i=1nwi+op(1),n^{-1}\mathbf{D}^{\top}\mathbf{W}=n^{-1}\sum_{i=1}^{n}d_{i}w_{i}=\mu_{d}n^{-1}\sum_{i=1}^{n}w_{i}+n^{-1}\sum_{i=1}^{n}d_{i}^{*}w_{i}=\mu_{d}n^{-1}\sum_{i=1}^{n}w_{i}+o_{p}(1),

where the last equality follows from (22) in the proof of Theorem 1. Hence,

n1𝐗𝐗=n1(𝐃,𝐖)(𝐃,𝐖)=n1i=1n(μd,wi)(μd,wi)+(σd,𝟎dw×1)(σd,𝟎dw×1)+op(1),n^{-1}\mathbf{X}^{\top}\mathbf{X}=n^{-1}(\mathbf{D},\mathbf{W})^{\top}(\mathbf{D},\mathbf{W})=n^{-1}\sum_{i=1}^{n}(\mu_{d},w_{i}^{\top})^{\top}(\mu_{d},w_{i}^{\top})+(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})^{\top}(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})+o_{p}(1),

where dwd_{w} is the dimension of the vectors wiw_{i}. Now, denote

Rnmin(3λmin(n1i=1nwiwi)8|μd|,12)R_{n}\equiv\min\left(\frac{\sqrt{3\lambda_{\min}(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top})}}{8|\mu_{d}|},\frac{1}{2}\right)

and fix any a1a_{1}\in\mathbb{R} and a2dwa_{2}\in\mathbb{R}^{d_{w}} such that a12+a22=1a_{1}^{2}+\|a_{2}\|^{2}=1. If |a1|>Rn|a_{1}|>R_{n}, then

(a1,a2)(n1i=1n(μd,wi)(μd,wi)+(σd,𝟎dw×1)(σd,𝟎dw×1))(a1,a2)a12σd2Rn2σd2K1(a_{1},a_{2}^{\top})\left(n^{-1}\sum_{i=1}^{n}(\mu_{d},w_{i}^{\top})^{\top}(\mu_{d},w_{i}^{\top})+(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})^{\top}(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})\right)(a_{1},a_{2}^{\top})^{\top}\geq a_{1}^{2}\sigma_{d}^{2}\geq R_{n}^{2}\sigma_{d}^{2}\geq K^{-1}

by Assumptions 1(iii,iv) and 2(i). If, on the other hand, |a1|Rn|a_{1}|\leq R_{n}, then a22=1a1211/4=3/4\|a_{2}\|^{2}=1-a_{1}^{2}\geq 1-1/4=3/4, and so

|a1|a2λmin(n1i=1nwiwi)4|μd|.|a_{1}|\leq\frac{\|a_{2}\|\sqrt{\lambda_{\min}(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top})}}{4|\mu_{d}|}.

The latter in turn implies via Jensen’s inequality that

(a1,a2)(n1i=1n(μd,wi)(μd,wi)+(σd,𝟎dw×1)(σd,𝟎dw×1))(a1,a2)\displaystyle(a_{1},a_{2}^{\top})\left(n^{-1}\sum_{i=1}^{n}(\mu_{d},w_{i}^{\top})^{\top}(\mu_{d},w_{i}^{\top})+(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})^{\top}(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})\right)(a_{1},a_{2}^{\top})^{\top}
n1i=1n(a1μd+a2wi)2n1i=1n(a2wi)22|a1μd|n1i=1n|a2wi|\displaystyle\qquad\geq n^{-1}\sum_{i=1}^{n}(a_{1}\mu_{d}+a_{2}^{\top}w_{i})^{2}\geq n^{-1}\sum_{i=1}^{n}(a_{2}^{\top}w_{i})^{2}-2|a_{1}\mu_{d}|n^{-1}\sum_{i=1}^{n}|a_{2}^{\top}w_{i}|
n1i=1n(a2wi)22|a1μd|n1i=1n(a2wi)2\displaystyle\qquad\geq n^{-1}\sum_{i=1}^{n}(a_{2}^{\top}w_{i})^{2}-2|a_{1}\mu_{d}|\sqrt{n^{-1}\sum_{i=1}^{n}(a_{2}^{\top}w_{i})^{2}}
=n1i=1n(a2wi)2(n1i=1n(a2wi)22|a1μd|)\displaystyle\qquad=\sqrt{n^{-1}\sum_{i=1}^{n}(a_{2}^{\top}w_{i})^{2}}\left(\sqrt{n^{-1}\sum_{i=1}^{n}(a_{2}^{\top}w_{i})^{2}}-2|a_{1}\mu_{d}|\right)
a2λmin(n1i=1nwiwi)(a2λmin(n1i=1nwiwi)2|a1μd|)\displaystyle\qquad\geq\|a_{2}\|\sqrt{\lambda_{\min}\left(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top}\right)}\left(\|a_{2}\|\sqrt{\lambda_{\min}\left(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top}\right)}-2|a_{1}\mu_{d}|\right)
21a22λmin(n1i=1nwiwi)(3/8)λmin(n1i=1nwiwi)K1\displaystyle\qquad\geq 2^{-1}\|a_{2}\|^{2}\lambda_{\min}\left(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top}\right)\geq(3/8)\lambda_{\min}\left(n^{-1}\sum_{i=1}^{n}w_{i}w_{i}^{\top}\right)\geq K^{-1}

by Assumption 2(i). Hence, it follows that

λmin(n1i=1n(μd,wi)(μd,wi)+(σd,𝟎dw×1)(σd,𝟎dw×1))K1,\lambda_{\min}\left(n^{-1}\sum_{i=1}^{n}(\mu_{d},w_{i}^{\top})^{\top}(\mu_{d},w_{i}^{\top})+(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})^{\top}(\sigma_{d},\mathbf{0}_{d_{w}\times 1}^{\top})\right)\geq K^{-1},

and so

λmin(n1𝐗𝐗)K1+op(1),\lambda_{\min}(n^{-1}\mathbf{X}^{\top}\mathbf{X})\geq K^{-1}+o_{p}(1), (33)

as required.

Next, we prove that s2s^{2} is consistent for σε2\sigma_{\varepsilon}^{2}. To this end, note that by Assumptions 1(i, iii, v) and 3(iii),

𝔼[(n1i=1ndiεi)2]=n2i=1n𝔼[di2εi2]Kn1.\mathbb{E}\left[\left(n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}\right)^{2}\right]=n^{-2}\sum_{i=1}^{n}\mathbb{E}\left[d_{i}^{\ast 2}\varepsilon_{i}^{2}\right]\leq Kn^{-1}. (34)

Hence, by Assumptions 2(iii) and 3(i) and Markov’s inequality

n1𝐃ϵ=n1i=1ndiεi+𝔼[di]n1iεi=op(1),n^{-1}\mathbf{D}^{\top}\bm{\epsilon}=n^{-1}\sum_{i=1}^{n}d_{i}^{\ast}\varepsilon_{i}+\mathbb{E}[d_{i}]n^{-1}\sum_{i}\varepsilon_{i}=o_{p}(1),

which combined with Assumption 3(i) further implies that

n1𝐗ϵ=n1(𝐃,𝐖)ϵ=op(1).n^{-1}\mathbf{X}^{\top}\bm{\epsilon}=n^{-1}(\mathbf{D},\mathbf{W})^{\top}\bm{\epsilon}=o_{p}(1). (35)

Combining this bound with (33) gives

θ^θ=(n1𝐗𝐗)1(n1𝐗ϵ)=op(1).\hat{\theta}-\theta=(n^{-1}\mathbf{X}^{\top}\mathbf{X})^{-1}(n^{-1}\mathbf{X}^{\top}\bm{\epsilon}\mathbf{)}=o_{p}(1). (36)

In addition, 𝐘𝐗θ^=ϵ𝐗(θ^θ)\mathbf{Y}-\mathbf{X}\hat{\theta}=\bm{\epsilon}-\mathbf{X}(\hat{\theta}-\theta). Therefore,

s2\displaystyle s^{2} =n1(𝐘𝐗θ^)(𝐘𝐗θ^)=n1(ϵ𝐗(θ^θ))(ϵ𝐗(θ^θ))\displaystyle=n^{-1}(\mathbf{Y}-\mathbf{X}\hat{\theta})^{\top}(\mathbf{Y}-\mathbf{X}\hat{\theta})=n^{-1}(\bm{\epsilon}-\mathbf{X}(\hat{\theta}-\theta))^{\top}(\bm{\epsilon}-\mathbf{X}(\hat{\theta}-\theta))
=n1ϵϵ2n1(θ^θ)𝐗ϵ+n1(θ^θ)𝐗𝐗(θ^θ)\displaystyle=n^{-1}\bm{\epsilon}^{\top}\bm{\epsilon}-2n^{-1}(\hat{\theta}-\theta)^{\top}\mathbf{X}^{\top}\bm{\epsilon}+n^{-1}(\hat{\theta}-\theta)^{\top}\mathbf{X}^{\top}\mathbf{X}(\hat{\theta}-\theta)
=σε2+op(1),\displaystyle=\sigma_{\varepsilon}^{2}+o_{p}(1), (37)

where the last equality follows from (35) and (36) and Assumptions 1(iii), 2(ii) and 3(ii). This finishes the proof of consistency of s2s^{2}.

Now, defining 𝐃~𝐃𝟏n×1(n1𝐃𝟏n×1)\mathbf{\tilde{D}\equiv D}-\mathbf{1}_{n\times 1}(n^{-1}\mathbf{D}^{\top}\mathbf{1}_{n\times 1}) to match the proof of Theorem 1 and recalling that 𝐃˘=𝐌W𝐃\mathbf{\breve{D}}=\mathbf{M}_{W}\mathbf{D}, we have

n1𝐃˘𝐃˘\displaystyle n^{-1}\mathbf{\breve{D}}^{\top}\mathbf{\breve{D}} =n1𝐃~𝐌W𝐃~=n1𝐃~𝐃~n1(𝐃~𝐖)(𝐖𝐖)1(𝐖𝐃~)\displaystyle=n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}}=n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}-n^{-1}(\mathbf{\tilde{D}}^{\top}\mathbf{W})(\mathbf{W}^{\top}\mathbf{W})^{-1}(\mathbf{W}^{\top}\mathbf{\tilde{D}})
=n1𝐃~𝐃~+op(1)=σd2+op(1),\displaystyle=n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}+o_{p}(1)=\sigma_{d}^{2}+o_{p}(1),

where the third equality follows from (24) and Assumption 2(i), and the fourth from (20). Together with (37), this gives the first convergence result in (12) since σd2K1\sigma_{d}^{2}\geq K^{-1} by Assumption 1(iii). The second convergence result follows from the first one, Theorem 1, and Slutsky’s theorem. Q.E.D.Q.E.D.


Proof of Theorem 2.  As in the proof of Theorem 1, we have

n(β^β)=n1/2i=1ndiεiσd2+op(1);\sqrt{n}(\hat{\beta}-\beta)=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d}^{2}}+o_{p}(1); (38)

see Equation (27) there and note that the derivation of (27) did not rely on Assumption 1(v), which we are not imposing here. Since σdε2σe,12K1\sigma_{d\varepsilon}^{2}\geq\sigma_{e,1}^{2}\geq K^{-1} by Assumption 4(iii) and σd2K\sigma_{d}^{2}\leq K by Assumption 1(iii), (38) implies that

n(β^β)σdε/σd2=n1/2i=1ndiεiσdε+op(1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{\sigma_{d\varepsilon}/\sigma_{d}^{2}}=\frac{n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}}{\sigma_{d\varepsilon}}+o_{p}(1),

which yields the equality in (14).

We next derive the convergence result in (14), i.e. we show that σdε1n1/2i=1ndiεidN(0,1)\sigma_{d\varepsilon}^{-1}n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}\to_{d}N(0,1). To do so, we write

n1/2i=1ndiεi=n1/2i=1nAiei=n1/2i=1n(AiμA)ei+n1/2i=1nμAein^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}=n^{-1/2}\sum_{i=1}^{n}A_{i}^{\top}e_{i}=n^{-1/2}\sum_{i=1}^{n}(A_{i}-\mu_{A})^{\top}e_{i}+n^{-1/2}\sum_{i=1}^{n}\mu_{A}^{\top}e_{i}

and denote the first and the second terms on the right-hand side by M1M_{1} and M2M_{2}, respectively. Also, we denote A~iAiμA\tilde{A}_{i}\equiv A_{i}-\mu_{A} for all 1in1\leq i\leq n. In addition, denote σ^e,12n1i=1n𝔼[(A~iei)2e]\hat{\sigma}_{e,1}^{2}\equiv n^{-1}\sum_{i=1}^{n}\mathbb{E}[(\tilde{A}_{i}^{\top}e_{i})^{2}\mid\mathcal{F}_{e}], where e\mathcal{F}_{e} is the filtration generated by (ei)in(e_{i}^{\top})_{i\leq n}. Then

σ^e,12\displaystyle\hat{\sigma}_{e,1}^{2} =n1i=1nei𝔼[A~iA~i]ei=n1i=1ntr(𝔼[A~iA~i]eiei)=tr(n1i=1n𝔼[A~iA~i]eiei)\displaystyle=n^{-1}\sum_{i=1}^{n}e_{i}^{\top}\mathbb{E}[\tilde{A}_{i}\tilde{A}_{i}^{\top}]e_{i}=n^{-1}\sum_{i=1}^{n}\text{tr}(\mathbb{E}[\tilde{A}_{i}\tilde{A}_{i}^{\top}]e_{i}e_{i}^{\top})=\text{tr}\left(n^{-1}\sum_{i=1}^{n}\mathbb{E}[\tilde{A}_{i}\tilde{A}_{i}^{\top}]e_{i}e_{i}^{\top}\right)
=tr(n1i=1n𝔼[A~iA~i]𝔼[eiei])+op(1)=n1i=1n𝔼[eiA~iA~iei]+op(1)=σe,12+op(1),\displaystyle=\text{tr}\left(n^{-1}\sum_{i=1}^{n}\mathbb{E}[\tilde{A}_{i}\tilde{A}_{i}^{\top}]\mathbb{E}[e_{i}e_{i}^{\top}]\right)+o_{p}(1)=n^{-1}\sum_{i=1}^{n}\mathbb{E}[e_{i}^{\top}\tilde{A}_{i}\tilde{A}_{i}^{\top}e_{i}]+o_{p}(1)=\sigma_{e,1}^{2}+o_{p}(1), (39)

where the first equality follows from Assumption 4(i), the second and the third from properties of the trace operator tr()\text{tr}(\cdot), the fourth from Assumptions 1(i,iii), 4(ii), and 5(i), the fifth from Assumption 4(i) and properties of the trace operator tr()\text{tr}(\cdot), and the sixth from the definition of σe,12\sigma_{e,1}^{2}.

Next, let (Z1,Z2)(Z_{1},Z_{2}) be a pair of independent standard normal random variables that is independent of everything else. Then for δ=min(δ1,δ3)\delta=\min(\delta_{1},\delta_{3}),

supt|(σ^e,11n1/2i=1nA~ieite)missingP(Z1t)|\displaystyle\sup_{t\in\mathbb{R}}\left|\mathbb{P}\left(\hat{\sigma}_{e,1}^{-1}n^{-1/2}\sum_{i=1}^{n}\tilde{A}_{i}^{\top}e_{i}\leq t\mid\mathcal{F}_{e}\right)-\mathbb{\mathbb{missing}}P(Z_{1}\leq t)\right|
n1δ/2i=1n𝔼[|A~iei|2+δe]σ^e,12+δKn1δ/2i=1nei2+δσ^e,12+δ=op(1),\displaystyle\qquad\qquad\leq\frac{n^{-1-\delta/2}\sum_{i=1}^{n}\mathbb{E}[|\tilde{A}_{i}^{\top}e_{i}|^{2+\delta}\mid\mathcal{F}_{e}]}{\hat{\sigma}_{e,1}^{2+\delta}}\leq\frac{Kn^{-1-\delta/2}\sum_{i=1}^{n}\|e_{i}\|^{2+\delta}}{\hat{\sigma}_{e,1}^{2+\delta}}=o_{p}(1), (40)

where the first inequality follows from a version of the Berry-Esseen theorem (see Section 35.1.9 in DasGupta (2008)), the second inequality from the Cauchy-Schwarz inequality and Assumptions 1(iii) and 4(ii), and the last bound from (39) and Assumptions 4(iii) and 5(ii). Therefore, for any tt\in\mathbb{R},

(σdε1(M1+M2)t)\displaystyle\mathbb{P}(\sigma_{d\varepsilon}^{-1}(M_{1}+M_{2})\leq t)
=𝔼[(σ^e,11M1σ^e,11(σdεtM2)e)]=𝔼[(Z1σ^e,11(σdεtM2)e)]+o(1)\displaystyle\qquad=\mathbb{E}[\mathbb{P}(\hat{\sigma}_{e,1}^{-1}M_{1}\leq\hat{\sigma}_{e,1}^{-1}(\sigma_{d\varepsilon}t-M_{2})\mid\mathcal{F}_{e})]=\mathbb{E}[\mathbb{P}(Z_{1}\leq\hat{\sigma}_{e,1}^{-1}(\sigma_{d\varepsilon}t-M_{2})\mid\mathcal{F}_{e})]+o(1)
=(σ^e,1Z1+M2σdεt)+o(1)=(σe,1Z1+M2σdεt)+o(1)\displaystyle\qquad=\mathbb{P}(\hat{\sigma}_{e,1}Z_{1}+M_{2}\leq\sigma_{d\varepsilon}t)+o(1)=\mathbb{P}(\sigma_{e,1}Z_{1}+M_{2}\leq\sigma_{d\varepsilon}t)+o(1)
=𝔼[(σe,21M2σe,21(σdεtσe,1Z1)Z1)]+o(1)=𝔼[(Z2σe,21(σdεtσe,1Z1)Z1)]+o(1)\displaystyle\qquad=\mathbb{E}[\mathbb{P}(\sigma_{e,2}^{-1}M_{2}\leq\sigma_{e,2}^{-1}(\sigma_{d\varepsilon}t-\sigma_{e,1}Z_{1})\mid Z_{1})]+o(1)=\mathbb{E}[\mathbb{P}(Z_{2}\leq\sigma_{e,2}^{-1}(\sigma_{d\varepsilon}t-\sigma_{e,1}Z_{1})\mid Z_{1})]+o(1)
=(σdε1(σe,1Z1+σe,2Z2)t)+o(1)=(Z1t)+o(1),\displaystyle\qquad=\mathbb{P}(\sigma_{d\varepsilon}^{-1}(\sigma_{e,1}Z_{1}+\sigma_{e,2}Z_{2})\leq t)+o(1)=\mathbb{P}(Z_{1}\leq t)+o(1), (41)

where the first equality follows from the law of iterated expectations (LIE), the second from (40) by noting that the difference of two probabilities is always a number between zero and one to conclude that op(1)o_{p}(1) in (40) satisfies 𝔼[op(1)]=o(1)\mathbb{E}[o_{p}(1)]=o(1), the third from the LIE, the fourth from (39) and Assumption 4(iii), the fifth from the LIE, the sixth from Assumption 5(iii), the seventh from the LIE, and the eighth from noting that σe,1Z1+σe,2Z2\sigma_{e,1}Z_{1}+\sigma_{e,2}Z_{2} is a normal random variable with mean zero and variance σdε2=σe,12+σe,22\sigma_{d\varepsilon}^{2}=\sigma_{e,1}^{2}+\sigma_{e,2}^{2}. This gives σdε1n1/2i=1ndiεidN(0,1)\sigma_{d\varepsilon}^{-1}n^{-1/2}\sum_{i=1}^{n}d_{i}^{*}\varepsilon_{i}\to_{d}N(0,1) and completes the proof of the theorem. Q.E.D.Q.E.D.


Proof of Theorem 3. For this proof, it will be convenient to denote 𝐖((wi,j)inj)jng\mathbf{W}\equiv((w_{i,j}^{\top})_{i\leq n_{j}})_{j\leq n_{g}}, 𝐘((yi,j)inj)jng\mathbf{Y}\equiv((y_{i,j})_{i\leq n_{j}})_{j\leq n_{g}}, and ϵ((εi,j)inj)jng\bm{\epsilon}\equiv((\varepsilon_{i,j})_{i\leq n_{j}})_{j\leq n_{g}}. In addition, denote 𝐃(dj𝟏nj×1)jng\mathbf{D}\equiv(d_{j}\mathbf{1}_{n_{j}\times 1})_{j\leq n_{g}} and 𝐃~𝐃(n1𝐃𝟏n×1)\tilde{\mathbf{D}}\equiv\mathbf{D}-(n^{-1}\mathbf{D}^{\top}\mathbf{1}_{n\times 1}). Moreover, denote d¯nn1j=1ngnjdj\overline{d}_{n}\equiv n^{-1}\sum_{j=1}^{n_{g}}n_{j}d_{j}.

As in the proof of Theorem 1, given that the matrix 𝐖\mathbf{W} includes a non-zero constant column by Assumption 2(iii), it follows that 𝟏n×1𝐌W=0\mathbf{1}_{n\times 1}^{\top}\mathbf{M}_{W}=0. Therefore, by the Frisch-Waugh-Lovell theorem,

β^β=(𝐃𝐌W𝐃)1(𝐃𝐌Wϵ)=(𝐃~𝐌W𝐃~)1(𝐃~𝐌Wϵ).\hat{\beta}-\beta=(\mathbf{D}^{\top}\mathbf{M}_{W}\mathbf{D})^{-1}(\mathbf{D}^{\top}\mathbf{M}_{W}\bm{\epsilon})=(\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}})^{-1}(\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon}). (42)

We first consider the denominator 𝐃~𝐌W𝐃~\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}}. Since n=jngnjn=\sum_{j\leq n_{g}}n_{j}, under Assumptions 6(i,iii) we have

𝔼[|d¯nμd|2]=n2j=1ngnj2𝔼[(dj)2]Kn2j=1ngnj2=Kκn2/n2.\mathbb{E}\left[|\overline{d}_{n}-\mu_{d}|^{2}\right]=n^{-2}\sum_{j=1}^{n_{g}}n_{j}^{2}\mathbb{E}\left[(d_{j}^{\ast})^{2}\right]\leq Kn^{-2}\sum_{j=1}^{n_{g}}n_{j}^{2}=K\kappa_{n}^{2}/n^{2}. (43)

We therefore obtain from Markov’s inequality that

d¯nμd=Op(κn/n),\overline{d}_{n}-\mu_{d}=O_{p}(\kappa_{n}/n), (44)

and so

d¯n2μd2=(d¯nμd)(d¯n+μd)=Op(κn/n).\overline{d}_{n}^{2}-\mu_{d}^{2}=(\overline{d}_{n}-\mu_{d})(\overline{d}_{n}+\mu_{d})=O_{p}(\kappa_{n}/n). (45)

by Assumption 6(iii). In addition, again under Assumptions 6(i,iii) we have

𝔼[|n1j=1ngnj(dj2𝔼[dj2])|2]=n2j=1ngnj2𝔼[|dj2𝔼[dj2]|2]Kn2j=1ngnj2=Kκn2/n2,\mathbb{E}\left[\left|n^{-1}\sum_{j=1}^{n_{g}}n_{j}(d_{j}^{2}-\mathbb{E}[d_{j}^{2}])\right|^{2}\right]=n^{-2}\sum_{j=1}^{n_{g}}n_{j}^{2}\mathbb{E}\left[|d_{j}^{2}-\mathbb{E}[d_{j}^{2}]|^{2}\right]\leq Kn^{-2}\sum_{j=1}^{n_{g}}n_{j}^{2}=K\kappa_{n}^{2}/n^{2},

and so

n1j=1ngnj(dj2𝔼[dj2])=Op(κn/n)n^{-1}\sum_{j=1}^{n_{g}}n_{j}(d_{j}^{2}-\mathbb{E}[d_{j}^{2}])=O_{p}(\kappa_{n}/n) (46)

by Markov’s inequality. Combining results (45) and (46) then yields

n1𝐃~𝐃~=n1j=1ngnj(djd¯n)2=n1j=1ngnjdj2(d¯n)2=σd2+Op(κn/n).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}=n^{-1}\sum_{j=1}^{n_{g}}n_{j}(d_{j}-\overline{d}_{n})^{2}=n^{-1}\sum_{j=1}^{n_{g}}n_{j}d_{j}^{2}-(\overline{d}_{n})^{2}=\sigma_{d}^{2}+O_{p}(\kappa_{n}/n). (47)

Further, since djd_{j} does not depend on ii, we have

n1𝐃~𝐖\displaystyle n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W} =n1j=1ngdji=1njwi,j(n1j=1ngnjdj)(n1j=1ngi=1njwi,j)\displaystyle=n^{-1}\sum_{j=1}^{n_{g}}d_{j}^{\ast}\sum_{i=1}^{n_{j}}w_{i,j}-\left(n^{-1}\sum_{j=1}^{n_{g}}n_{j}d_{j}^{\ast}\right)\left(n^{-1}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}w_{i,j}\right)
=n1j=1ngdji=1njwi,j+Op(κn/n),\displaystyle=n^{-1}\sum_{j=1}^{n_{g}}d_{j}^{\ast}\sum_{i=1}^{n_{j}}w_{i,j}+O_{p}(\kappa_{n}/n), (48)

where the second equality is by Assumption 2(ii) and (44). Moreover, by Assumptions 2(ii) and 6(i, ii, iii),

𝔼[n1j=1ngdji=1njwi,j2]\displaystyle\mathbb{E}\left[\left\|n^{-1}\sum_{j=1}^{n_{g}}d_{j}^{\ast}\sum_{i=1}^{n_{j}}w_{i,j}\right\|^{2}\right] =σd2n2j=1ng𝔼[i=1njwi,j2]Kn2j=1ngnj2=Kκn2/n2.\displaystyle=\sigma_{d}^{2}n^{-2}\sum_{j=1}^{n_{g}}\mathbb{E}\left[\left\|\sum_{i=1}^{n_{j}}w_{i,j}\right\|^{2}\right]\leq Kn^{-2}\sum_{j=1}^{n_{g}}n_{j}^{2}=K\kappa_{n}^{2}/n^{2}.

Therefore, by Markov’s inequality we obtain

n1j=1ngdji=1njwi,j=Op(κn/n),n^{-1}\sum_{j=1}^{n_{g}}d_{j}^{\ast}\sum_{i=1}^{n_{j}}w_{i,j}=O_{p}(\kappa_{n}/n),

which together with (48) further shows that

n1𝐃~𝐖=Op(κn/n).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W}=O_{p}(\kappa_{n}/n). (49)

Collecting the results (47), and (49) and using Assumptions 2(i) and 7(iv), we get

n1𝐃~𝐌W𝐃~=n1𝐃~𝐃~n1𝐃~𝐖(n1𝐖𝐖)1n1𝐖𝐃~=σd2+op(1).n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\mathbf{\tilde{D}}=n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{\tilde{D}}-n^{-1}\mathbf{\tilde{D}}^{\top}\mathbf{W}(n^{-1}\mathbf{W}^{\top}\mathbf{W})^{-1}n^{-1}\mathbf{W}^{\top}\mathbf{\tilde{D}}=\sigma_{d}^{2}+o_{p}(1). (50)

Next, we consider the numerator 𝐃~𝐌Wϵ\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon} in (42). By (44) and Assumptions 7(i) and 2(iii),

d¯nμdSεn1/2j=1ngi=1njεi,j=Op(κn/n)Sεn1/2j=1ngi=1njεi,j=op(1).\frac{\overline{d}_{n}-\mu_{d}}{S_{\varepsilon}}n^{-1/2}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}=\frac{O_{p}(\kappa_{n}/n)}{S_{\varepsilon}}n^{-1/2}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}=o_{p}(1).

Therefore,

n1/2𝐃~ϵSε\displaystyle\frac{n^{-1/2}\mathbf{\tilde{D}}^{\top}\bm{\epsilon}}{S_{\varepsilon}} =n1/2j=1ngdji=1njεi,jSε(d¯nμd)Sεn1/2j=1ngi=1njεi,j\displaystyle=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{S_{\varepsilon}}-\frac{(\overline{d}_{n}-\mu_{d})}{S_{\varepsilon}}n^{-1/2}\sum_{j=1}^{n_{g}}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}
=n1/2j=1ngdji=1njεi,jSε+op(1).\displaystyle=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{S_{\varepsilon}}+o_{p}(1). (51)

In addition,

n3/2κn𝐖ϵSε=op(1)\frac{n^{-3/2}\kappa_{n}\mathbf{W}^{\top}\bm{\epsilon}}{S_{\varepsilon}}=o_{p}(1)

by Assumption 7(i), which together with (49) and (51) and Assumption 2(i) shows

n1/2𝐃~𝐌WϵSε\displaystyle\frac{n^{-1/2}\mathbf{\tilde{D}}^{\top}\mathbf{M}_{W}\bm{\epsilon}}{S_{\varepsilon}} =n1/2𝐃~ϵSεn1/2𝐃~𝐖(𝐖𝐖)1𝐖ϵSε\displaystyle=\frac{n^{-1/2}\mathbf{\tilde{D}}^{\top}\bm{\epsilon}}{S_{\varepsilon}}-\frac{n^{-1/2}\mathbf{\tilde{D}}^{\top}\mathbf{W}(\mathbf{W}^{\top}\mathbf{W})^{-1}\mathbf{W}^{\top}\bm{\epsilon}}{S_{\varepsilon}}
=n1/2j=1ngdji=1njεi,jSε+op(1).\displaystyle=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{S_{\varepsilon}}+o_{p}(1). (52)

Given that Sε1n1/2j=1ng(djμd)i=1njεi,j=Op(1)S_{\varepsilon}^{-1}n^{-1/2}\sum_{j=1}^{n_{g}}(d_{j}-\mu_{d})\sum_{i=1}^{n_{j}}\varepsilon_{i,j}=O_{p}(1) by Assumptions 6(i,iii) and 7(ii) and Markov’s inequality and that σd2K1\sigma_{d}^{2}\geq K^{-1} by Assumption 6(iv), we obtain from (42), (50), and (52) that

n(β^β)Sε=n1/2j=1ngdji=1njεi,jσd2Sε+op(1)\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{\varepsilon}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}^{2}S_{\varepsilon}}+o_{p}(1)

and so

n(β^β)Sε/σd=n1/2j=1ngdji=1njεi,jσdSε+op(1)\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{\varepsilon}/\sigma_{d}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}+o_{p}(1) (53)

by Assumption 6(iii).

We next derive the asymptotic distribution of n1/2j=1ngdji=1njεi,j/(σdSε)n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}/(\sigma_{d}S_{\varepsilon}). Let j,n\mathcal{F}_{j,n} denote the filtration generated by (ϵ,((dm)mj))(\bm{\epsilon}^{\top},((d_{m})_{m\leq j})^{\top}). Then by Assumptions 6(i,v),

𝔼[n1/2dji=1njεi,jσdSε|j1,n]=n1/2i=1njεi,jσdSε𝔼[dj|j1,n]=n1/2i=1njεi,jσdSε𝔼[dj]=0\mathbb{E}\left[\frac{n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}|\mathcal{F}_{j-1,n}\right]=\frac{n^{-1/2}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\mathbb{E}\left[\left.d_{j}^{\ast}\right|\mathcal{F}_{j-1,n}\right]=\frac{n^{-1/2}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\mathbb{E}\left[d_{j}^{\ast}\right]=0

almost surely, which implies that n1/2dji=1njεi,j/(σdSε)n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}/(\sigma_{d}S_{\varepsilon}) is a martingale difference array with respect to j,n\mathcal{F}_{j,n}. Next, observe that Assumptions 6(i,v) and 7(ii) yield

j=1ng𝔼[(n1/2dji=1njεi,jσdSε)2|j1,n]=Sε2n1j=1ng(i=1njεi,j)2p1.\sum_{j=1}^{n_{g}}\mathbb{E}\left[\left(\frac{n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\right)^{2}|\mathcal{F}_{j-1,n}\right]=S_{\varepsilon}^{-2}n^{-1}\sum_{j=1}^{n_{g}}\left(\sum_{i=1}^{n_{j}}\varepsilon_{i,j}\right)^{2}\rightarrow_{p}1. (54)

Moreover for any η>0\eta>0, Assumptions 6(iii,iv,v) and 7(iii) allow us to conclude that

j=1ng𝔼[(n1/2dji=1njεi,jσdSε)21{|n1/2dji=1njεi,jσdSε|>η}|j1,n]\displaystyle\sum_{j=1}^{n_{g}}\mathbb{E}\left[\allowbreak\left.\left(\frac{n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\right)^{2}1\left\{\left|\frac{n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\right|>\eta\right\}\right|\mathcal{F}_{j-1,n}\right]
1ηδ4j=1ng𝔼[|n1/2dji=1njεi,jσdSε|2+δ4|j1,n]\displaystyle\leq\frac{1}{\eta^{\delta_{4}}}\sum_{j=1}^{n_{g}}\mathbb{E}\left[\left.\left|\frac{n^{-1/2}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\right|^{2+\delta_{4}}\right|\mathcal{F}_{j-1,n}\right]
=1ηδ4(σdSε)2+δ4n1+δ4/2j=1ng𝔼[|di|2+δ4]|i=1njεi,j|2+δ4Kηδ4Sε2+δ4n1+δ4/2j=1ng|i=1njεi|2+δ4=op(1).\displaystyle=\frac{1}{\eta^{\delta_{4}}(\sigma_{d}S_{\varepsilon})^{2+\delta_{4}}n^{1+\delta_{4}/2}}\sum_{j=1}^{n_{g}}\mathbb{E}\left[|d_{i}^{\ast}|^{2+\delta_{4}}\right]\left|\sum_{i=1}^{n_{j}}\varepsilon_{i,j}\right|^{2+\delta_{4}}\leq\frac{K}{\eta^{\delta_{4}}S_{\varepsilon}^{2+\delta_{4}}n^{1+\delta_{4}/2}}\sum_{j=1}^{n_{g}}\left|\sum_{i=1}^{n_{j}}\varepsilon_{i}\right|^{2+\delta_{4}}=o_{p}(1).

Combining this bound with (54), we can invoke the martingale central limit theorem (see, e.g., Corollary 3.1 in Hall and Heyde (1980)) to conclude that

n1/2j=1ngdji=1njεi,jσdSεdN(0,1).\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{\ast}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}\rightarrow_{d}N(0,1). (55)

The claim of the theorem follows from combining this result with (53). Q.E.D.Q.E.D.


Proof of Theorem 4. As in the proof of Theorem 3, we have

n(β^β)Sε/σd=n1/2j=1ngdji=1njεi,jσdSε+op(1);\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{\varepsilon}/\sigma_{d}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{\sigma_{d}S_{\varepsilon}}+o_{p}(1); (56)

see Equation (53) there and note that the derivation of (53) did not rely on Assumption 6(v), which we are not imposing here. Since SdεSe,1S_{d\varepsilon}\geq S_{e,1} and Se,1SεK1S_{e,1}\geq S_{\varepsilon}K^{-1} by Assumption 8(iii) and σdK\sigma_{d}\leq K by Assumption 6(iii), (56) implies that

n(β^β)Sdε/σd2=n1/2j=1ngdji=1njεi,jSdε+op(1),\frac{\sqrt{n}(\hat{\beta}-\beta)}{S_{d\varepsilon}/\sigma_{d}^{2}}=\frac{n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}}{S_{d\varepsilon}}+o_{p}(1),

which yields the equality in (18).

We next derive the convergence result in (18), i.e. we show that Sdε1n1/2j=1ngdji=1njεi,jdN(0,1)S_{d\varepsilon}^{-1}n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}\to_{d}N(0,1). To do so, we write

n1/2j=1ngdji=1njεi,j=n1/2j=1ngAji=1njei,j=j=1ng(AjμA)i=1njei,j+n1/2j=1ngμAi=1njei,jn^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}=n^{-1/2}\sum_{j=1}^{n_{g}}A_{j}^{\top}\sum_{i=1}^{n_{j}}e_{i,j}=\sum_{j=1}^{n_{g}}(A_{j}-\mu_{A})^{\top}\sum_{i=1}^{n_{j}}e_{i,j}+n^{-1/2}\sum_{j=1}^{n_{g}}\mu_{A}^{\top}\sum_{i=1}^{n_{j}}e_{i,j}

and denote the first and the second terms on the right-hand side by M1M_{1} and M2M_{2}, respectively. Also, we denote A~jAjμA\tilde{A}_{j}\equiv A_{j}-\mu_{A} for all 1jng1\leq j\leq n_{g}. In addition, denote §^e,12n1i=1n𝔼[(A~jei,j)2e]\hat{\S}_{e,1}^{2}\equiv n^{-1}\sum_{i=1}^{n}\mathbb{E}[(\tilde{A}_{j}^{\top}e_{i,j})^{2}\mid\mathcal{F}_{e}], where e\mathcal{F}_{e} is the filtration generated by ((ei,j)inj)jng((e_{i,j}^{\top})_{i\leq n_{j}})_{j\leq n_{g}}.

Then S^e,12Se,12=op(Se,12)\hat{S}_{e,1}^{2}-S_{e,1}^{2}=o_{p}(S_{e,1}^{2}) by the same argument as that to used to derive (39) in the proof of Theorem 2 with Assumption 9(i) replacing Assumption 5. In addition, letting ZZ be a standard normal random variable that is independent of everything else, we have

supt|(§^e,11n1/2j=1ngA~ji=1njei,jte)missingP(Zt)|=op(1)\sup_{t\in\mathbb{R}}\left|\mathbb{P}\left(\hat{\S}_{e,1}^{-1}n^{-1/2}\sum_{j=1}^{n_{g}}\tilde{A}_{j}^{\top}\sum_{i=1}^{n_{j}}e_{i,j}\leq t\mid\mathcal{F}_{e}\right)-\mathbb{\mathbb{missing}}P(Z\leq t)\right|=o_{p}(1)

by the same argument as that used to derive (40) in the proof of Theorem 2 with δ=2\delta=2 and Assumption 9(ii) replacing Assumption 5(ii). Finally,

(Sdε1(M1+M2)t)=(Zt)+o(1)\mathbb{P}(S_{d\varepsilon}^{-1}(M_{1}+M_{2})\leq t)=\mathbb{P}(Z\leq t)+o(1)

for all tt\in\mathbb{R} by the same argument as that used to derive (41) in the proof of Theorem 2 with Assumption 9(iii) replacing Assumption 5(iii). This gives Sdε1n1/2j=1ngdji=1njεi,jdN(0,1)S_{d\varepsilon}^{-1}n^{-1/2}\sum_{j=1}^{n_{g}}d_{j}^{*}\sum_{i=1}^{n_{j}}\varepsilon_{i,j}\to_{d}N(0,1) and completes the proof of the theorem. Q.E.D.Q.E.D.


Appendix B Asymptotic equivalence of two variance formula

Lemma 1.

Consider the linear regression model in (1). Suppose: (i) (di)in(d_{i})_{i\leq n} are i.i.d. with zero mean, finite and nonzero variance σd2\sigma_{d}^{2}; (ii) (εi)in(\varepsilon_{i})_{i\leq n} is covariance stationary with auto-covariance function Γε()\Gamma_{\varepsilon}(\cdot) satisfying Γε(0)>0\Gamma_{\varepsilon}(0)>0 and j=1Γε(j)2<\sum_{j=1}^{\infty}\Gamma_{\varepsilon}(j)^{2}<\infty. For 𝐃(di)in\mathbf{D}\equiv(d_{i})_{i\leq n}, we then have

𝐃Ω𝐃Γε(0)𝐃𝐃1 almost surely as n,\frac{\mathbf{D}^{\top}\Omega\mathbf{D}}{\Gamma_{\varepsilon}(0)\mathbf{D}^{\top}\mathbf{D}}\rightarrow 1\text{ almost surely as }n\rightarrow\infty, (57)

where Ω\Omega is the covariance matrix of (εi)in(\varepsilon_{i})_{i\leq n}.

Proof of Lemma 1. Note that

n1𝐃Ω𝐃=n1i1=1ni2=1ndi1di2Γε(i1i2)=Γε(0)n1𝐃𝐃+2n1i=2nUin^{-1}\mathbf{D}^{\top}\Omega\mathbf{D}=n^{-1}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}d_{i_{1}}d_{i_{2}}\Gamma_{\varepsilon}(i_{1}-i_{2})=\Gamma_{\varepsilon}(0)n^{-1}\mathbf{D}^{\top}\mathbf{D}+2n^{-1}\sum_{i=2}^{n}U_{i} (58)

where Uii=1i1xixiΓε(ii)U_{i}\equiv\sum_{i^{\prime}=1}^{i-1}x_{i}x_{i^{\prime}}\Gamma_{\varepsilon}(i-i^{\prime}). Let ~i\mathcal{\tilde{F}}_{i} denote the natural filtration generated by (dj)ji(d_{j})_{j\leq i}. Then, under the assumption that (di)in(d_{i})_{i\leq n} is i.i.d., it follows that {Ui,~i}\{U_{i},\tilde{\mathcal{F}}_{i}\} is a martingale difference sequence with variance 𝔼[Ui2]=σd4j=1i1Γε(j)2\mathbb{E}[U_{i}^{2}]=\sigma_{d}^{4}\sum_{j=1}^{i-1}\Gamma_{\varepsilon}(j)^{2}. Therefore we obtain that

i=2ni2𝔼[Ui2]=σd4i=2ni2j=1i1Γε(j)2=σd4j=1n1Γε(j)2m=j+1nm2Kσd4j=1Γε(j)2<\sum_{i=2}^{n}i^{-2}\mathbb{E}\left[U_{i}^{2}\right]=\sigma_{d}^{4}\sum_{i=2}^{n}i^{-2}\sum_{j=1}^{i-1}\Gamma_{\varepsilon}(j)^{2}=\sigma_{d}^{4}\sum_{j=1}^{n-1}\Gamma_{\varepsilon}(j)^{2}\sum_{m=j+1}^{n}m^{-2}\leq K\sigma_{d}^{4}\sum_{j=1}^{\infty}\Gamma_{\varepsilon}(j)^{2}<\infty (59)

where the first inequality follows from m=1m2<K\sum_{m=1}^{\infty}m^{-2}<K and the last inequality is due to σd2<\sigma_{d}^{2}<\infty and j=1Γε(j)2<\sum_{j=1}^{\infty}\Gamma_{\varepsilon}(j)^{2}<\infty by assumption. Hence, (59) establishes that i=2i2𝔼[Ui2]<\sum_{i=2}^{\infty}i^{-2}\mathbb{E}\left[U_{i}^{2}\right]<\infty. By the martingale strong law of large numbers (see, e.g., Theorem 3.76 in White (2014)) we can therefore deduce that n1i=2nUi0n^{-1}\sum_{i=2}^{n}U_{i}\rightarrow 0 almost surely as nn\rightarrow\infty, which together with (58) yields

n1𝐃Ω𝐃Γε(0)n1𝐃𝐃0 almost surely as n.n^{-1}\mathbf{D}^{\top}\Omega\mathbf{D}-\Gamma_{\varepsilon}(0)n^{-1}\mathbf{D}^{\top}\mathbf{D}\rightarrow 0\text{ almost surely as }n\rightarrow\infty. (60)

Moreover, by condition (i) of the lemma and Kolmogorov’s strong law of large numbers (see, e.g., Theorem 3.1 in White (2014)), n1𝐃𝐃σd2n^{-1}\mathbf{D}^{\top}\mathbf{D}\rightarrow\sigma_{d}^{2} almost surely as nn\rightarrow\infty. Since σd2>0\sigma_{d}^{2}>0 and Γε(0)>0\Gamma_{\varepsilon}(0)>0, the claim of the lemma then follows from (60). Q.E.D.Q.E.D.