Standard Errors When a Regressor is Randomly Assigned
Abstract
We examine asymptotic properties of the OLS estimator when the values of the regressor of interest are assigned randomly and independently of other regressors. We find that the OLS variance formula in this case is often simplified, sometimes substantially. In particular, when the regressor of interest is independent not only of other regressors but also of the error term, the textbook homoskedastic variance formula is valid even if the error term and auxiliary regressors exhibit a general dependence structure. In the context of randomized controlled trials, this conclusion holds in completely randomized experiments with constant treatment effects. When the error term is heteroscedastic with respect to the regressor of interest, the variance formula has to be adjusted not only for heteroscedasticity but also for correlation structure of the error term. However, even in the latter case, some simplifications are possible as only a part of the correlation structure of the error term should be taken into account. In the context of randomized control trials, this implies that the textbook homoscedastic variance formula is typically not valid if treatment effects are heterogenous but heteroscedasticity-robust variance formulas are valid if treatment effects are independent across units, even if the error term exhibits a general dependence structure. In addition, we extend the results to the case when the regressor of interest is assigned randomly at a group level, such as in randomized control trials with treatment assignment determined at a group (e.g., school/village) level.
JEL Classification: C14, C31, C32
Keywords: Cluster Robust Inference; Randomized Control Trial
1 Introduction
Textbook discussion of linear regression usually begins with a standard model of the form , where it is assumed that is a nonstochastic matrix (with full column rank) of regressors and the error vector has mean zero and variance matrix proportional to an identity matrix. As is well known, such an assumption justifies the formula as an estimator of the variance of the OLS estimator, where is equal to the sum of squares of the estimated residuals divided either by the sample size or the degrees of freedom. This formula is easy to use but, as is typically taught, may not be valid if the variance matrix of the error vector is not proportional to an identity matrix. In such cases, the variance of the OLS estimator should be based on the formula to reflect the heteroscedasticity and dependence structure of the error vector. An important practical challenge in implementing such an approach is that the matrix may be hard to estimate if the dependence structure of the error vector is unknown. In this paper, we study the implications for the variance of OLS estimators of having a regressor of interest whose values are i.i.d. across units/time periods and are independent of values of other regressors. The primary motivating applications for our analysis are randomized controlled trials in which units are independently assigned to some treatment level without a connection to observable characteristics. The main finding of the paper is that variance estimation in this case is often simplified, sometimes substantially.
Let be the column of corresponding to the regressor of interest and let be the remaining columns of ; i.e. columns corresponding to controls. Our first main result shows that when the vector has i.i.d. components and is strongly exogenous in the sense of being independent not only of but also of , the OLS estimator is asymptotically normally distributed and the formula actually yields a valid variance estimator for the coefficient of even if is not proportional to the identity matrix. This result, which superficially contradicts the lessons of elementary linear regression analysis, is due to the randomness of the matrix in our analysis. While the textbook analysis assumes away the randomness by conditioning on , we instead obtain our result by recognizing that the randomness of the matrix delivers a suitable martingale structure.111The validity of the formula does not mean that the variance estimators based on the formula are invalid; see Lemma 1 in the Appendix for the asymptotic equivalence of these estimators in our setting. We recognize that a version of this result in some simple contexts is well understood in the profession in the sense that many can anticipate such a result when the entire matrix of regressors is strongly exogenous; see references below. We go one step further, however, and establish our result for the case when (i) only one regressor is strongly exogenous (e.g., treatment in a randomized controlled trial); and/or (ii) the error vector is subject to some generalized dependence more complicated than what is commonly understood to be the cluster structure, e.g. temporal/spatial autocorrelation or a network structure. This result is important because it facilitates inference on the coefficient of the regressor of interest even if the researcher does not know the dependence structure of the error vector , which is useful from the pragmatic point of view. We emphasize, however, that while our conclusions hold for the coefficient corresponding to a strongly exogenous regressor, they need not hold for other coefficients in the regression.
Our second main result shows that when the vector has i.i.d. components and is independent of but is conditionally heteroscedastic with respect to , the formula is actually not valid and has to be adjusted not only for heteroscedasticity but also, rather surprisingly, for the dependence structure of the vector . Nevertheless, a simplified variance formula can often be used in this case as well. For example, conditional heteroscedasticity arises when the regression model is taken from the potential outcomes framework with heterogeneous treatment effects. If the researcher is concerned about clustering in this case, our results imply that it is sufficient to adjust the variance formula for clustering of the treatment effects only.222When the regression model is taken from the potential outcomes framework with heterogeneous treatment effects, the case of strongly exogenous regressor corresponds to the assumption of constant treatment effects. In contrast, for example, there is no need to adjust the variance formula for clustering of the potential outcomes in any given treatment arm. We also note that neither of our results restrict or exclude conditional heteroscedasticity of with respect to .
In addition, we extend both results to the case when the values of the regressor of interest are independent only across groups of units/time periods, such as is the case in randomized controlled trials in which treatment assignment is determined at a group (e.g., school/village) level. We show that in the strongly exogenous case, it suffices to take into account only the within-group correlation of the error vector . In other words, it suffices to use variance estimators that are clustered at the level at which treatment is assigned. In the conditional heteroscedasticity case, the variance formula still requires some adjustments for both heteroscedasticity and dependence but often allows for some simplifications relative to the standard textbook formula mentioned above.
Our first main result and its extension to the group-level assignment are related to but different from those in Barrios, Diamond, Imbens, and Kolesar (2012), who came to the same conclusions for regressions without controls and in which a fixed fraction of units/clusters is randomly assigned to be treated. To the best of our knowledge, however, there are no results in the literature related to our second main result. Our analysis is also related to Abadie, Athey, Imbens, and Wooldridge (2017), who presented a new clustering framework that emphasizes a finite population perspective as well as interactions between the sampling and assignment parts of the data-generating process. They established in particular that there is no need to cluster when estimating the variance if the randomization is conducted at the individual level and there is no heteregeneity in the treatment effects. Our first main result echoes and complements their findings in the following aspects. First, unlike them, we do not impose a particular structure on the sampling process, which allows us to cover general forms of time series or even network dependence in addition to the cluster-type dependence. Second, our analysis goes beyond the binary treatment framework and accommodates a general strongly exogenous regressor as well as the inclusion of additional controls in the regression. In particular, we allow for general dependence structures in the control variables, which makes it ex-ante unclear at what level one should cluster. Third, we rely on a traditional asymptotic framework, which may make our analysis more familiar to the reader. We recognize, however, that the third aspect may be a weakness in the sense that our framework is unable to address the finite population adjustment that plays an important role in Abadie, Athey, Imbens, and Wooldridge (2017). Finally, we note Bloom (2005) who considered a random effects type cluster structure and produced a variance estimator for the simple difference of means estimator that is quoted in Duflo, Glennerster, and Kremer (2007). His equation (4.3), which is presented without proof, is in fact a special case of the formula. The cluster structure that he analyzed has a built-in correlation among observations, and as such, it would be tempting to think that variance estimation would require Moulton (1986)’s clustering formula – a conclusion that can be motivated if inference is to be conditioned on the matrix. Hence, in our view, his equation (4.3) can only be motivated by explicitly recognizing the randomness of the matrix.
Our results are not particularly difficult to derive. On the other hand, we are unaware of any systematic discussion of results along this line in the literature besides Barrios, Diamond, Imbens, and Kolesar (2012) and Abadie, Athey, Imbens, and Wooldridge (2017), especially in models where the control variables and treatment variables are both present. Our results have convenient pragmatic implications, which we hope are helpful to some empirical researchers.
Outline. We present the basic intuition underlying our results in Section 2. The intuitive discussion there suggests that asymptotic normality and the formula are valid under fairly general dependence structure in provided that the randomness of generates a suitable martingale structure in . We also explain how conditional heteroscedasticity breaks down this martingale structure. We formalize our discussion in Section 3, where our main restriction on the dependence of is that it be weak enough for its sample average to converge in probability to zero – a condition that further emphasizes that our analysis is driven by the randomness in and not . We provide an extension to the case of group-level random assignment in Section 4.
Notation. We use to denote a generic strictly positive constant that may change from place to place but is independent of the sample size . For any positive integer , let , , and denote the identity matrix, vector of ones, and vector of zeros. For any real square matrix , and denote its smallest and largest eigenvalues. We use to denote that is defined as , and to denote the matrix composed by sequentially stacking matrices with equal number of columns.
2 Intuition
In this section, we provide intuition for the validity of the formula in the case of strongly exogenous regressors. We first consider the case of a simple univariate regression model and then extend the result to the case of a multivariate regression model. At the end of this section, we explain complications arising conditional heteroscedasticity and how they break the validity of the formula .
2.1 Case of Strong Exogeneity
We start by considering a simple univariate linear regression time series model in which we have
(1) |
where is a second-order, possibly autocorrelated, stationary time series – we employ the index , rather than , to emphasize our analysis is not confined to the time series context. We depart from the textbook time series model by assuming that the regressors are: (i) Independent and identically distributed (i.i.d.) with mean zero, and (ii) Strongly exogenous in the sense that is independent of the time series process .
As is well-known, the least squares estimator of in model (1) satisfies the equality
(2) |
In many standard time series textbooks, the asymptotic distribution of is thus derived by imposing sufficiently strong conditions to ensure that the score is asymptotically normal and the Hessian converges in probability to a non-stochastic matrix. In order to derive the standard error for , we therefore only need a consistent estimator of the long-run variance of the score; i.e., a heteroscedasticity and autocorrelation consistent (HAC) variance estimator, such as those proposed by Newey and West (1987) and Andrews (1991).
On the other hand, if the regressors are i.i.d. and strongly exogenous with mean zero, the independence of and implies that for any with , we must have
and also for all . Hence, as long as some version of the central limit theorem is applicable to the score and a law of large numbers is applicable to the Hessian , we can conclude that the asymptotic distribution of is given by
In particular, statistical inference on can be conducted as if it were not a time series model, i.e. using the formula , where in this case.
The main takeaway of the preceding example is that the strong exogeneity and i.i.d. nature of the regressors imply that the sequence is homoscedastic even if the errors are arbitrarily autocorrelated. Is this simplification confined to the time series model? Our preceding discussion suggests that this is not the case. Indeed, provided the regressors are i.i.d., mean zero, and strongly exogenous, the score has a built-in martingale structure vis-à-vis the filtration generated by because:
(3) |
Therefore, assuming that the random pairs satisfy certain moment conditions, the martingale central limit theorem will be applicable regardless of the dependence structure of and the long run variance of the score will reduce to . In particular, the variance formula will remain valid despite the dependence present in the variables . Thus, spatial correlation, network dependence, and/or a cluster structure in the variables are all accommodated by the standard homoscedastic standard errors. Moreover, we note that a quick inspection at the preceding argument reveals that the assumptions we have imposed so far are stronger than necessary for the desired conclusion to hold.
We next build on our preceding discussion by considering the multivariate linear regression model
(4) |
where is a scalar regressor of interest, is a -vector of controls, and is a second-order, possibly autocorrelated, stationary time series satisfying for all . We continue to assume that the regressors are i.i.d. with mean zero, and strongly exogenous in the sense that is independent of the time series process . The parameter of interest continues to be .
For this model, the Frisch-Waugh-Lovell theorem implies the least squares estimator satisfies
(5) |
Hence, under appropriate regularity conditions the estimator admits the asymptotic expansion
(6) |
In particular, if is mean zero and independent of , then and the asymptotic expansion reduces to the univariate setting – i.e. the right-hand side of (6) is (asymptotically) equivalent to the right-hand side of (2). It therefore follows that the end result is the same as in the univariate case: The variance formula , where in this case, remains valid for . We emphasize, however, that this formula is not necessarily justified for conducting inference on the coefficient .
As a preview of results in the next section, we again note that the conditions we have imposed so far are stronger than required. For instance, suppose that instead of demanding that the regressors and be fully independent, we assume that they are related according to the model
for i.i.d. and independent of . The asymptotic expansion in (6), combined with the same arguments employed in the univariate case, then continue to imply that
under mild moment conditions – i.e. when computing standard errors for we can continue to pretend that the time series process fits the textbook homoscedastic model. Setting to be a constant, for instance, reveals that the mean zero assumption on is superfluous.
2.2 Case of Conditional Heteroscedasticity
The preceding martingale argument relies crucially on two key assumptions: (i) The regressors are independent of each other, and (ii) The regressors are independent of the errors . A challenge to our martingale argument arises when and are not independent. Within the potential outcome framework, for instance, this full independence requirement is violated in the presence of heterogenous treatment effects. More precisely, heterogenous treatment effects render conditionally heteroscedastic with respect to the treatment status . Motivated by this observation, we also study a model in which with possibly correlated across and , but fully independent of . In the potential outcome framework, with indicating treatment status, we would have
(7) |
where denotes the potential outcome for unit under treatment status .
To see the problem for the martingale structure in this model, observe that for any , we now have
which is not necessarily zero even if ’s are mean zero. The variance formula for the sum therefore must include interactions terms as long as the random vectors and are correlated.
We will present a detailed analysis of the conditional heteroscedasticity case in Section 3 but the main takeaways from our results are: (i) The independence of from controls still simplifies the asymptotic variance for ; and (ii) Conditional heteroscedasticity yields a break in the martingale structure that requires us to adjust standard errors not only for heteroscedasticity, but also for correlation of the errors across units. For example, in the context of (7), our analysis implies the asymptotic variance of equals (the probability limit of)
(8) |
In particular, we note that standard errors may need to be adjusted for correlation if we are concerned the treatment effects are correlated. On the other hand, the correlation between components of the vector plays no role for the standard errors, and neither does correlation between components of the vector .
3 Main Theory
We now present a rigorous theory showing that the variance formula for the OLS estimator is valid in the strongly exogenous case even if the errors in the regression model are correlated. We also derive the variance formula in the case of conditional heteroscedasticity. In what follows, we suppose an outcome satisfies
(9) |
where is a vector of regressors, with being a key regressor and being a vector of controls, is a vector of parameters, with being a parameter of interest and being a vector of nuisance parameters, and is an error term with mean zero. More explicitly, the regression model (9) can be rewritten as
(10) |
We assume that the first component of the vector is a (non-zero) constant, meaning that the regression model (10) contains the intercept term. For notational simplicity, we set , , , , and .
Under a suitable exogeneity assumption on , the unknown parameter can be estimated by OLS:
Standard estimators of the asymptotic variance of rely on the asymptotic variance of the “score” , which may take a complicated form due to possible dependence between observations. This relationship makes standard error estimation and statistical inference challenging in practice. For instance, cluster-robust standard errors, such as those proposed by Moulton (1986), Liang and Zeger (1986), and Arellano (1987), are predicated on knowledge of the relevant group structure at which to cluster; see Hansen (2007), Ibragimov and Müller (2010) and Ibragimov and Müller (2016) for related discussion. Similarly, spatial standard errors, such as those proposed by Conley (1999), often require knowledge of a measure of “economic distance” that is relates to the degree of dependence across observations. However, we next show that when is independent of , such as in randomized control trials, the asymptotic variance for simplifies significantly. As a result, estimation of standard errors and asymptotically valid inference simplify as well.
3.1 Case of Strong Exogeneity
We first study the case that is independent of , and hence is conditionally homoscedastic with respect to . Together with independence of from , this means that is strongly exogenous. The conditional heteroscedasticity case is discussed later in this section. In the assumptions that follow, recall that should be interpreted to denote a sufficiently large constant that can change from place to place but is independent of the sample size .
Assumption 1.
(i) The random variables , , are i.i.d. with mean and variance ; (ii) is independent of ; (iii) there exists a constant such that ; (iv) ; (v) is independent of .
Assumption 2.
(i) ; (ii) ; (iii) the first component of the vectors , , is a non-zero constant.
Assumption 3.
(i) ; (ii) ; (iii) there exists a constant such that ; (iv) .
Assumption 1 contains our main requirements for the regressor of interest. In particular, Assumptions 1(i) and 1(ii) are our key conditions that are satisfied in many randomized control trials. Assumptions 1(iii) and 1(iv) are mild moment conditions. Assumption 1(v) means that is strongly exogenous. Assumption 2 contains our main requirements for the controls. In particular, Assumption 2(i) means that there is no multicollinearity among controls. Assumption 2(ii) is a mild moment condition. Assumption 2(iii) means that we study regressions with an intercept. Assumption 3 contains our main requirements for the regression error. Assumption 3(i) holds if ’s are uncorrelated with ’s and a law of large numbers applies to the product . Assumption 3(ii) is essentially a law of large numbers for ’s. Assumptions 3(iii) and 3(iv) are mild moment conditions. We highlight that our assumptions allow for a wide array of dependence structures in the matrix , with the main condition in this regard intuitively being that dependence be “weak” enough for the law of large numbers imposed in Assumptions 3(i) and 3(ii) to apply.
For all , denote . The following theorem derives the asymptotic distribution of the OLS estimator in the strongly exogenous case.
This theorem establishes two key facts. First, it shows that, given the strong exogeneity of , our mild requirements on the dependence structure of suffice for establishing asymptotic normality of . To establish such a conclusion, we rely on a martingale construction that generalizes our discussion in Section 2. Second, Theorem 1 establishes that the asymptotic variance of is not affected by the possible correlation across vectors since it only depends on the variance of and the averaged variance of the error terms . We emphasize that neither of these conclusions need hold for the estimator of the coefficient corresponding to the vectors of controls .
In addition, Theorem 1 suggests that we can estimate the variance of by , where
(11) |
and . The following corollary confirms this conjecture.
Together with Theorem 1, this corollary is our first main result. Indeed, it is well-known that the top left element of the matrix coincides with . Therefore, this corollary justifies using the classic standard error formula for inference on , even though we allow for general dependence structures in the errors and controls .
Remark. Theorem 1 and Corollary 1 can be extended to allow for dependence between and . Indeed, suppose that depends linearly on :
where ’s satisfy the conditions of Assumption 1 imposed on . By arguments that are similar to those in the proof of Theorem 1 and Corollary 1, it is then straightforward to show that
where denotes the variance of ’s, and that the convergence results (12) still hold. ∎
Remark. Theorem 1 and Corollary 1 can also be extended to instrumental variable (IV) estimators. Indeed, suppose that
(13) |
where is an instrumental variable satisfying the conditions of Assumption 1 imposed on and is a (first-stage) estimation error satisfying the conditions of Assumption 3 imposed on . In addition denote and define and the same way as with replaced by and , respectively. The two-stage least squared (2SLS) estimator of then satisfies
Using Assumptions 2 and 3, it is then possible to show that
where is the mean of ’s and is the variance of ’s. Therefore,
It is clear that can be estimated by the same as that in (11) with replaced by , can be estimated by OLS on the (first-stage) regression (13), and can be estimated by , where . ∎
3.2 Case of Conditional Heteroscedasticity
Next, we derive the asymptotic variance for in the case of conditional heteroscedasticity, i.e. when is conditionally heteroscedastic with respect to . Following the notation introduced in Section 2, we focus on the case in which , where is a vector with mean zero, for all .
Let for all and observe that under Assumption 1(i), the random vectors are i.i.d. Denote their common mean vector by . In addition, denote and . Within this context, we impose the following assumptions.
Assumption 4.
(i) is independent of ; (ii) the functions , , are bounded; (iii) .
Assumption 5.
(i) ; (ii) there exists a constant such that ; (iii) .
Assumption 4 is mainly used to replace Assumption 1(v) and accounts for the conditional heteroscedasticity of with respect to . Assumption 4(i) requires that are strongly exogenous with respect to the“scaled” error vector – a requirement that, as discussed in Section 2 maps well into a potential outcome framework with heterogeneous treatment effects. Assumption 4(ii) imposes upper bounds on the functions , which we view as a mild regularity condition. Assumption 4(iii) is a mild moment condition. Assumption 5 contains further restrictions on the vectors . Assumption 5(i) is essentially a law of large numbers for ’s. Assumption 5(ii) is essentially a moment condition for the random variables . Assumption 5(iii) limits the amount of dependence among the vectors to ensure convergence in distribution.
The following theorem, which is our second main result, derives the asymptotic distribution of the OLS estimator in the case of conditional heteroscedasticity.
To establish this theorem, we decompose the score into the sum of two uncorrelated terms,
and observe that the first term on the right-hand side here forms a martingale difference sequence while the second term is asymptotically normal by assumption. Combining these facts, we are able to obtain asymptotic normality of the sum; see the detailed proof in the Appendix.
Like Theorem 1, this theorem establishes two key facts as well. To see both of them, observe that the term appearing in the convergence result (14) can be more explicitly rewritten as
(15) |
This expression in turn implies that the asymptotic variance of the OLS estimator now depends on the correlation across the vectors , which means that heteroscedasticity of the regression errors forces OLS variance estimators to be adjusted for correlation across the errors. On the other hand, the expression (15) also demonstrates that it suffices to adjust the variance estimators only for correlation across the random variables , instead of correlation across the full vectors . The latter point might seem like a minor technicality but it in fact plays an interesting role in models with heterogeneous treatment effects. Indeed, when represents the treatment assignment status, represents the pair of potential outcomes with and without treatment, so that , and consists of a non-zero constant only, it follows that takes the form (7), which matches the setting here with and . Hence, , and so the expression for the asymptotic variance of reduces to the probability limit of
as previewed in the previous section; see expression (8) there. In turn, this expression means that it suffices to adjust OLS variance estimation for correlation across treatment effect, and there is no need to worry about correlation of potential outcomes within any given treatment arm. For example, whenever treatment effects are independent across , it suffices to use the usual heteroscedasticity-robust variance formulas, even if regression errors are correlated. Note, however, that it is necessary to use heteroscedasticity-robust variance formulas even if the treatment effects are i.i.d. (and not just independent), as the formula is not valid in this case. Moreover, the same results apply even if ’s include non-constant controls as well, as the second component of ’s remains the same in this case.
Finally, we note that as long as the form of correlation across random variables is known, estimation of the asymptotic variance based on Theorem 2 is conceptually straightforward. For example, if treatment effects are clustered, it suffices to use Moulton (1986)’s or Liang and Zeger (1986)’s formulas assuming that regression errors are clustered at the same level (even though they could be clustered at a different level because of the clustering of potential outcomes without treatment, for example). For brevity of the paper, we do not provide formal statement of such results, as they are case-specific and depend on the form of the correlation structure, e.g. time series versus cluster versus spatial dependence.
4 Group-Level Randomization
A key assumption behind the results of Section 3 is that the regressor of interest is independent and identically distributed across units. In randomized controlled trials, such an assumption is satisfied for completely randomized assignments but can fail under other randomization protocols. For instance, in randomized controlled trials the i.i.d. assumption on the treatment fails when treatment is assigned at a group level. Motivated by this challenge, we suppose in this section that
(16) |
where the index denotes the group membership, the index denotes units within group , is the number of groups, is the number of units within group , denotes the regressor of interest which is invariant within group , denotes the vector of controls, and is a mean-zero regression error. We continue to assume that the first component of each is a non-zero constant and also continue to employ as the total number of observations, .
We emphasize at this point that the index distinguishes the level at which the regressor varies, but has no special significance for other variables. In particular, we do not insist that any (potential) clustering structure of the errors be the same as the group structure specified by the regressor . In other words, may have a dependence structure completely different from group structure determined by – e.g., may be correlated with when . This discussion will be formalized in the nature of the regularity conditions below, which do not include, for example, the random effects type specification as in Moulton (1986). Instead, as in the previous section, we will rely on a martingale structure based on that will be crucial for understanding the asymptotic distribution of .
Following our discussion in the previous section, we consider cases of strong exogeneity and conditional heteroscedasticity separately. We first consider the case of strong exogeneity. In order to study the asymptotic properties of , we first need to revise Assumptions 1 and 3 to account for the group structure and a possible within-group correlation. To this end, we let and and impose the following assumptions.
Assumption 6.
(i) The random variables , , are i.i.d. with mean and variance ; (ii) is independent of (iii) ; (iv) ; (v) is independent of .
Assumption 7.
(i) ; (ii) ; (iii) there exists a constant such that ; (iv) .
Assumption 6 requires mild moment restrictions and that the regressor of interest be strongly exogenous in the sense that it be independent of the errors and other regressors. To make sense of Assumption 7, assume that each group has the same size that is independent of . In this case, is of order and is typically of order one. In turn, the latter implies that Assumption 7(i) reduces to , which is similar to Assumption 3(i), and Assumption 7(iii) reduces to , which is satisfied as long as . In addition, Assumption 7(iv) reduces and is satisfied automatically and Assumption 7(ii) can be regarded as a law of large numbers. Thus, Assumption 7 in general requires that the size of groups does not increase too fast. Note also that, as previously claimed, this assumption does not require to be independent across .
We are now ready to derive the asymptotic distribution of the OLS estimator in the strongly exogenous case with a group-level assignment. In the statement of the result, we impose Assumption 2, which should be interpreted to hold with in place of in Assumption 2(i) and in place of in Assumption 2(ii). Also, we denote for all .
The implications of this theorem are following. First, the OLS estimator is asymptotically normally distributed under mild moment conditions and under fairly general assumptions on the dependence structure of . Second, the asymptotic variance of is given by . In particular, since , when computing standard errors for we need only account for possibly within-group correlation even if is dependent across . Importantly, the group structure is determined solely by the variables and hence is known, considerably simplifying estimation. For instance, in a randomized controlled trial with constant effects, Theorem 3 implies we may, for example, employ Moulton (1986)’s or Liang and Zeger (1986) formulas clustered at the level at which treatment was assigned. We again emphasize, however, that similar conclusions do not apply for whose standard errors and asymptotic normality may depend on the dependence structure of across .
Next, we consider the case of conditional heteroscedasticity. Denote and . Note that here actually coincides with in the previous section. Within this context, we impose the following assumptions.
Assumption 8.
(i) is independent of ; (ii) the functions , , are bounded; (iii) .
Assumption 9.
(i) ; (ii) ; (iii) .
These assumptions naturally extend Assumptions 4 and 5 in the previous section to allow for group-level assignments.
The next theorem derives the asymptotic distribution of the OLS estimator in the case of conditional heteroscedasticity with a group-level assignment.
This theorem relates to Theorem 3 in the same way as Theorem 2 relates to Theorem 1. In particular, noting that the term appearing in this theorem can be more explicitly rewritten as
we conclude that because of conditional heteroscedasticity, variance estimators that are clustered at the group level at which the regressor is assigned may not be valid if the errors are correlated across groups . On the other hand, in the context of estimation with heterogeneous treatment effects, such estimators are valid if the treatment effects are uncorrelated across these groups.
References
- (1)
- Abadie, Athey, Imbens, and Wooldridge (2017) Abadie, A., S. Athey, G. W. Imbens, and J. Wooldridge (2017): “When should you adjust standard errors for clustering?,” Discussion paper, National Bureau of Economic Research.
- Andrews (1991) Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation,” Econometrica, 59(3), 817–858.
- Arellano (1987) Arellano, M. (1987): “Computing robust standard errors for within-groups estimators,” Oxford bulletin of Economics and Statistics, 49(4), 431–434.
- Barrios, Diamond, Imbens, and Kolesar (2012) Barrios, T., R. Diamond, G. Imbens, and M. Kolesar (2012): “Clustering, spacial correlations, and randomization inference,” Journal of American Statistical Association, 107, 578–591.
- Bloom (2005) Bloom, H. S. (2005): Learning more from social experiments: Evolving analytic approaches. Russell Sage Foundation.
- Conley (1999) Conley, T. G. (1999): “GMM estimation with cross sectional dependence,” Journal of econometrics, 92(1), 1–45.
- DasGupta (2008) DasGupta, A. (2008): Asymptotic Theory of Statistics and Probability. Springer.
- Duflo, Glennerster, and Kremer (2007) Duflo, E., R. Glennerster, and M. Kremer (2007): “Using randomization in development economics research: A toolkit,” Handbook of development economics, 4, 3895–3962.
- Hall and Heyde (1980) Hall, P., and C. C. Heyde (1980): Martingale Limit Theory and Its Application. Academic Press, Inc. New York.
- Hansen (2007) Hansen, C. B. (2007): “Asymptotic properties of a robust variance matrix estimator for panel data when T is large,” Journal of Econometrics, 141(2), 597–620.
- Ibragimov and Müller (2010) Ibragimov, R., and U. K. Müller (2010): “t-Statistic based correlation and heterogeneity robust inference,” Journal of Business & Economic Statistics, 28(4), 453–468.
- Ibragimov and Müller (2016) (2016): “Inference with few heterogeneous clusters,” Review of Economics and Statistics, 98(1), 83–96.
- Liang and Zeger (1986) Liang, K.-Y., and S. Zeger (1986): “Longitudinal Data Analysis Using Generalized Linear Models,” Biometrica, 73, 13–22.
- Moulton (1986) Moulton, B. R. (1986): “Random group effects and the precision of regression estimates,” Journal of econometrics, 32(3), 385–397.
- Newey and West (1987) Newey, W. K., and K. D. West (1987): “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix,” Econometrica, 55(3), 703–708.
- White (2014) White, H. (2014): Asymptotic theory for econometricians. Academic press.
Appendix
Appendix A Proof of the main results
Proof of Theorem 1. Define and . Then, given that the matrix includes a non-zero constant column by Assumption 2(iii), it follows that . Therefore, by the Frisch-Waugh-Lovell theorem,
(19) |
Also, denoting , we have
by Assumptions 1(i,iii), and so by Markov’s inequality. Hence,
by Assumption 1(iii) again. In addition,
by the von Bahr-Esseen Inequality (see Section 35.1.5 in DasGupta (2008)) and Assumptions 1(i,iii). Hence, by Markov’s inequality,
and so
(20) |
Further,
(21) |
By Assumptions 1(i, ii) and 2(iii), we further obtain
Hence, by Markov’s inequality,
(22) |
Similarly, we can use Assumptions 1(i, iii) and 2(ii) to show
(23) |
which together with (21) and (22) implies that
(24) |
Combining this result with (20) and using Assumptions 2(i), we then have
(25) |
For the term in the numerator of in (19), we have
(26) |
where the second equality follows by the definition of , (24), and Assumptions 2(i) and 3(i), and the third equality follows by Assumptions 1(i, iii) and 3(iii) and Markov’s inequality. Therefore,
(27) |
by Assumption 1(iv), and so
(28) |
We next derive the asymptotic distribution of . Let denote the filtration generated by . Then by Assumptions 1(iii,iv) and 3(iii, iv), has finite second moment and
(29) |
almost surely, which implies that is a martingale difference array with respect to . Next, observe that Assumptions 1(i,v) and 3(ii,iv) yield
(30) |
Moreover for any , Assumptions 1(iii,iv,v) and 3(iii,iv) allow us to conclude that for ,
(31) |
In view of (30) and (31), we can invoke the martingale central limit theorem (see, e.g., Corollary 3.1 in Hall and Heyde (1980)) to conclude that
(32) |
The claim of the theorem follows from combining this result with (28).
Proof of Corollary 1. We first proof that . To do so, observe that
by Assumptions 1(i,iii) and the law of large numbers. Also, denoting for all , we have
where the last equality follows from (22) in the proof of Theorem 1. Hence,
where is the dimension of the vectors . Now, denote
and fix any and such that . If , then
by Assumptions 1(iii,iv) and 2(i). If, on the other hand, , then , and so
The latter in turn implies via Jensen’s inequality that
by Assumption 2(i). Hence, it follows that
and so
(33) |
as required.
Next, we prove that is consistent for . To this end, note that by Assumptions 1(i, iii, v) and 3(iii),
(34) |
Hence, by Assumptions 2(iii) and 3(i) and Markov’s inequality
which combined with Assumption 3(i) further implies that
(35) |
Combining this bound with (33) gives
(36) |
In addition, . Therefore,
(37) |
where the last equality follows from (35) and (36) and Assumptions 1(iii), 2(ii) and 3(ii). This finishes the proof of consistency of .
Now, defining to match the proof of Theorem 1 and recalling that , we have
where the third equality follows from (24) and Assumption 2(i), and the fourth from (20). Together with (37), this gives the first convergence result in (12) since by Assumption 1(iii). The second convergence result follows from the first one, Theorem 1, and Slutsky’s theorem.
Proof of Theorem 2. As in the proof of Theorem 1, we have
(38) |
see Equation (27) there and note that the derivation of (27) did not rely on Assumption 1(v), which we are not imposing here. Since by Assumption 4(iii) and by Assumption 1(iii), (38) implies that
which yields the equality in (14).
We next derive the convergence result in (14), i.e. we show that . To do so, we write
and denote the first and the second terms on the right-hand side by and , respectively. Also, we denote for all . In addition, denote , where is the filtration generated by . Then
(39) |
where the first equality follows from Assumption 4(i), the second and the third from properties of the trace operator , the fourth from Assumptions 1(i,iii), 4(ii), and 5(i), the fifth from Assumption 4(i) and properties of the trace operator , and the sixth from the definition of .
Next, let be a pair of independent standard normal random variables that is independent of everything else. Then for ,
(40) |
where the first inequality follows from a version of the Berry-Esseen theorem (see Section 35.1.9 in DasGupta (2008)), the second inequality from the Cauchy-Schwarz inequality and Assumptions 1(iii) and 4(ii), and the last bound from (39) and Assumptions 4(iii) and 5(ii). Therefore, for any ,
(41) |
where the first equality follows from the law of iterated expectations (LIE), the second from (40) by noting that the difference of two probabilities is always a number between zero and one to conclude that in (40) satisfies , the third from the LIE, the fourth from (39) and Assumption 4(iii), the fifth from the LIE, the sixth from Assumption 5(iii), the seventh from the LIE, and the eighth from noting that is a normal random variable with mean zero and variance . This gives and completes the proof of the theorem.
Proof of Theorem 3. For this proof, it will be convenient to denote , , and . In addition, denote and . Moreover, denote .
As in the proof of Theorem 1, given that the matrix includes a non-zero constant column by Assumption 2(iii), it follows that . Therefore, by the Frisch-Waugh-Lovell theorem,
(42) |
We first consider the denominator . Since , under Assumptions 6(i,iii) we have
(43) |
We therefore obtain from Markov’s inequality that
(44) |
and so
(45) |
by Assumption 6(iii). In addition, again under Assumptions 6(i,iii) we have
and so
(46) |
by Markov’s inequality. Combining results (45) and (46) then yields
(47) |
Further, since does not depend on , we have
(48) |
where the second equality is by Assumption 2(ii) and (44). Moreover, by Assumptions 2(ii) and 6(i, ii, iii),
Therefore, by Markov’s inequality we obtain
which together with (48) further shows that
(49) |
Collecting the results (47), and (49) and using Assumptions 2(i) and 7(iv), we get
(50) |
Next, we consider the numerator in (42). By (44) and Assumptions 7(i) and 2(iii),
Therefore,
(51) |
In addition,
by Assumption 7(i), which together with (49) and (51) and Assumption 2(i) shows
(52) |
Given that by Assumptions 6(i,iii) and 7(ii) and Markov’s inequality and that by Assumption 6(iv), we obtain from (42), (50), and (52) that
and so
(53) |
by Assumption 6(iii).
We next derive the asymptotic distribution of . Let denote the filtration generated by . Then by Assumptions 6(i,v),
almost surely, which implies that is a martingale difference array with respect to . Next, observe that Assumptions 6(i,v) and 7(ii) yield
(54) |
Moreover for any , Assumptions 6(iii,iv,v) and 7(iii) allow us to conclude that
Combining this bound with (54), we can invoke the martingale central limit theorem (see, e.g., Corollary 3.1 in Hall and Heyde (1980)) to conclude that
(55) |
The claim of the theorem follows from combining this result with (53).
Proof of Theorem 4. As in the proof of Theorem 3, we have
(56) |
see Equation (53) there and note that the derivation of (53) did not rely on Assumption 6(v), which we are not imposing here. Since and by Assumption 8(iii) and by Assumption 6(iii), (56) implies that
which yields the equality in (18).
We next derive the convergence result in (18), i.e. we show that . To do so, we write
and denote the first and the second terms on the right-hand side by and , respectively. Also, we denote for all . In addition, denote , where is the filtration generated by .
Then by the same argument as that to used to derive (39) in the proof of Theorem 2 with Assumption 9(i) replacing Assumption 5. In addition, letting be a standard normal random variable that is independent of everything else, we have
by the same argument as that used to derive (40) in the proof of Theorem 2 with and Assumption 9(ii) replacing Assumption 5(ii). Finally,
for all by the same argument as that used to derive (41) in the proof of Theorem 2 with Assumption 9(iii) replacing Assumption 5(iii). This gives and completes the proof of the theorem.
Appendix B Asymptotic equivalence of two variance formula
Lemma 1.
Consider the linear regression model in (1). Suppose: (i) are i.i.d. with zero mean, finite and nonzero variance ; (ii) is covariance stationary with auto-covariance function satisfying and . For , we then have
(57) |
where is the covariance matrix of .
Proof of Lemma 1. Note that
(58) |
where . Let denote the natural filtration generated by . Then, under the assumption that is i.i.d., it follows that is a martingale difference sequence with variance . Therefore we obtain that
(59) |
where the first inequality follows from and the last inequality is due to and by assumption. Hence, (59) establishes that . By the martingale strong law of large numbers (see, e.g., Theorem 3.76 in White (2014)) we can therefore deduce that almost surely as , which together with (58) yields
(60) |
Moreover, by condition (i) of the lemma and Kolmogorov’s strong law of large numbers (see, e.g., Theorem 3.1 in White (2014)), almost surely as . Since and , the claim of the lemma then follows from (60).