General Covariance-Based Conditions for Central Limit Theorems with Dependent Triangular Arrays
Abstract.
We present a general central limit theorem with simple, easy-to-check covariance-based sufficient conditions for triangular arrays of random vectors when all variables could be interdependent. The result is constructed from Stein’s method, but the conditions are distinct from those of related work. Existing approaches require checking bespoke conditions that in many contexts are difficult to verify scientifically and either (i) impose rigid structure, often difficult to interpret and lacking microfoundations (e.g., strong mixing in random fields) or (ii) allow more flexibility but limit the extent of correlations (e.g., sparsity restrictions in dependency graphs). Our approach, in contrast, permits researchers to work with high-level but intuitive conditions based on overall correlation.
We show that these covariance conditions nest standard assumptions studied in the literature such as -dependence, mixing random fields, non-mixing autoregressive processes, and dependency graphs, which themselves need not imply each other.
We apply our result practical settings previously not covered in the literature, such as
treatment effects with spillovers in more settings than previously admitted, covariance matrices, processes with global dependencies such as epidemic spread and information diffusion, and spatial processes with Matérn dependencies.
Keywords: Central limit theorem, dependent data, Stein’s method
1. Introduction
In many contexts researchers use an interdependent set of random vectors to develop estimators and need to establish whether the estimators are asymptotically normal. In existing results, dependency is modeled in idiosyncratic ways, with perhaps unintuitive if not unappealing conditions to describe assumptions on correlation. Further, in many settings—such as treatment effects with spillovers, epidemic spread, information diffusion, general equilibrium effects, and so on,—the correlation is non-zero for any finite set of observations across all random vectors, which is not allowed for in the literature.
We present simple covariance-based sufficient conditions for a central limit theorem to be applied to a triangular array of dependent random vectors. We use Stein’s method (Stein, 1986) to derive three high-level, easy-to-interpret conditions. Stein’s method is widely used in theoretical arguments (in fact, a special case of the argument here first appeared in (Chandrasekhar and Jackson, 2024)) and our goal is not to advance a new technique for proving limit theorems. Instead, we present a new approach to proving asymptotic normality that is extremely general, nesting many existing approaches. More importantly, it consists of easily-interpretable conditions that are also easy to check. Researchers can check our conditions by thinking directly about correlations using calculations that we demonstrate are straightforward in several consequential examples. This feature makes the result accessible to a wider range of researchers without having to derive bespoke limit theorems in every setting. Our approach also eases restrictions required by existing methods, again doing so in a way that is not idiosyncratically imposed by a specific setting.
We follow a literature that uses Stein’s method to prove asymptotic normality. Our goal is not to derive a new proof, but rather to provide much more general conditions that allow for some amount of dependence between all observations, but at the same time are minimal in terms of parametric/shape restrictions, making them flexible and easy to check.
To review, Stein’s method observes that
for all continuously differentiable functions if and only if has a standard normal distribution. So, when considering normalized sums taking the role of , it is enough to show that this equality holds for all such functions asymptotically. Rinott and Rotar (2000) and Ross (2011), among others, provide a detailed view of the method.Proving normality using Stein’s method typically amounts to checking dependence conditions. Bolthausen (1982), for example, establishes how the probability of a joint set of events differs from independence decays in temporal distance. So, as the mixing coefficients decay fast enough in distance, the proof proceeds by checking that the Stein argument follows under the mixing structure. Distance in time can be generalized to space and, further, to random fields. Existing literature is organized around a number of such conditions that apply in particular data settings.
A peculiar consequence of this organization, however, is that these conditions generally lack a scientific basis for the assumed dependence structure. For example, spatial standard errors are often used—e.g., in Froot (1989); Conley (1999); Driscoll and Kraay (1998)—when conducting -estimation (or GMM). However, in actual applications, for instance agricultural shocks such as rainfall or pests or soil, it is not clear that they follow a specific form of interdependence satisfying -mixing with a certain decay rate as invoked in Conley (1999) (cross-sectionally) or Driscoll and Kraay (1998) (temporally and cross-sectionally). Surely shocks correlate over space, but it is hard to argue empirically that their correlation matches the required mixing conditions.
To take a different example, some models of social network formation orient themselves by embedding individuals in a random field to deliver central limit theorems. Distance in the metric space is, then, inversely proportional to the likelihood of a connection. Given the structure of the field, researchers can toggle dependence as a function of distance, reducing the size of the covariance sum. However, the structure of the field has implications for consequential properties of the graph, such as clustering patterns (Hoff et al., 2002; Lubold et al., 2023), that then often fail to match the empirical observations. As in the previous example, the researcher probably has some intuition as to whether graph properties (e.g., clustering) are likely or not, but assessing the reasonableness of specific mixing random field assumptions is all but impossible.
As an alternative to embedding variables in a metric space and toggling dependency by distance, a literature on dependency graphs emerged (Baldi and Rinott, 1989; Goldstein and Rinott, 1996; Ross, 2011) in part to be more flexible on the correlation structure. There, the cost is extreme sparsity: observations have indices in a graph, where those that are not edge-adjacent are independent. This provides a different strategy to apply Stein’s method, by creating for each observation a dependency neighborhood. Sufficient sparsity in the graph structure allows a central limit theorem to apply, despite not forcing a time- or space-like structure. Examples include Ross (2011); Goldstein and Rinott (1996) and a more general treatment in Chen and Shao (2004).
Both embedding indices in a metric space or using a more unstructured dependency structure are similar in the sense that they constrain the total amount of correlation between the random variables. In principle there are -order components to this sum, but, via mixing conditions or sparsity conditions, this sum is assumed to be of order . However, in many settings—such as treatment effects with spillovers, epidemic spread, information diffusion, or general equilibrium effects—we see dependency that are neither random fields nor are there any conditional independencies. The correlation is non-zero for any finite set of observations across all random vectors of interest. Our approach allows for this general dependence.
In the remainder of this section, we provide a brief overview of our approach, which we formalize in the next section. We consider a triangular array of random vectors , which are neither necessarily independent nor identically distributed. We study conditions under which their appropriately normalized sample mean is asymptotically normally disturbed. In principle, at any , all and can be correlated. The proof follows the well-known Stein’s method, though we develop and apply specific bounds for our purpose. To apply Stein’s method we first associate each random variable with a set of other random variables with which it could have a higher level of correlation, keeping in mind that it could in principle have some non-zero correlation with all variables.
We call these affinity sets, denoted , to capture the other random variables with which may have high correlations in the -th dimension. We use the term affinity set rather than “dependency neighborhood” to emphasize the possibility of high and low non-zero covariance structures with arbitrary groupings that satisfy summation conditions. We provide sufficient conditions for asymptotic normality in terms of the total amount of covariance within an affinity set and the total amount of covariance across affinity sets. As long as, in the limit, the amount of interdependence in the overall mean comes from the covariances within affinity sets, asymptotic normality follows.
This yields substantially weaker conditions than in the previous literature. In Arratia et al. (1989), the authors present Chen’s method (Chen, 1975) for Poisson approximation rather than normality, which has a similar approach to ours in collecting random variables into dependencies. While this results in nice finite sample bounds, these bounds consist of three almost separate pieces, making it less friendly to understanding these bounds together with growing sums of covariance of the samples. In fact, all five of the examples studied in Arratia et al. (1989) are limited to cases where at least one of these pieces is identically zero in the cases where Chen’s method succeeds. Many empirically relevant examples do not have such zeros, and so our approach substantially expand the relevant applications.
To preview our conditions, presented formally below, let us take so is scalar, and let be centered, . Let
denote the total covariance within the affinity sets. Let also , the sum of the random variables outside of ’s affinity set. Informally, our conditions are the following.
-
(1)
Within affinity set covariance control:
The average covariance between two random variables in an affinity set, when weighted by the realized magnitude of the reference variable, and the covariance between the averages given the magnitude of the reference variable, is small relative to the comparably adjusted covariance between the reference variables and their affinity sets.
-
(2)
Cross affinity set covariance control:
The average covariance across members of two different affinity sets (weighted by the reference variables) is sufficiently small compared to the squared covariance within affinity sets.
-
(3)
Outside affinity set covariance control:
The average absolute conditional covariance outside of affinity sets is small compared to the covariance within affinity sets.
These conditions are general and easy to check, as we show. They also further simplify in a number of cases such as with positive correlations (as in diffusion models and auto-regressive processes) and with binary variables.
The advantages of this approach, which we see as most salient for emiprical scientists, are several-fold. First, we do not require a sparse dependency structure at any . That is, there can be non-zero correlation between any pair . Much of the dependency graph literature leverages an independence structure in constructing their bounds and, therefore, the bounds we build are different.
Second, because of this possibility of non-zero covariance across all random vectors, we organize our bounds through covariance conditions. We are reminded of a discussion in Chen (1975) in the context of Poisson approximation. Covariance conditions are easy-to-interpret and check, and from an applied perspective often easier to justify from a microfoundation.
Third, our result is for random vectors, and while the application of the Cramér-Wold device is simple in our setting—by the nature of how indexing works—it is useful to have and instructive for a practitioner.
Fourth, our setup nests many of the previous literature’s examples, most of which do not nest each other. We illustrate the utility of our central limit theorem through several distinct applications. We begin with an example of -dependence in a stochastic process, noting that this implies many other types of mixing, such as -, -, or - mixing (Bradley, 2005). We then move to random fields, where we show an example with - and - dependence. These mixing approaches require constructing idiosyncratic notions of dependence based on the underlying probability distribution that happen to imply bounds on the covariance function (see, for example, Rio (1993) or Rio (2017)), which, as noted above, is not derived from, and may not match, micro-economic foundations or scientific principles. Our covariance-based arguments are compact and direct, placing restrictions on the covariance explicitly and, thus, in a matter that is salient in the scientific context. They are also general rather than based on a specific model or type of dependence. We also show that our framework is applicable outside the context of mixing, giving examples with non-mixing autoregressive processes, dependency graphs, among other examples.
Fifth, we show that our generalizations permit a wider and more practical set of analyses that were otherwise ruled out or limited in the literature. This includes treatment effects with spillover models, covariance matrices, and things like epidemic and diffusion models. Specifically, we extend the treatment effects with spillovers analysis, as in Aronow and Samii (2017), to allow every individual’s exposure to treatment to possibly be increasing in every other node’s treatment assignment, and nonetheless, the relevant estimator is still asymptotically normally distributed. This case, which is ubiquitous in practice: e.g., in diffusion, epidemic, and financial flow models in principle a shock anywhere could theoretical impact any other node’s outcomes (albeit with potentially small odds). But this is assumed away in applied work because conventional central limit theorems do not cover such a case. We also show how a researcher can model covariance matrices without forcing a random field structure as in Conley (1999) or Driscoll and Kraay (1998). This allows applied researchers to proceed with greater generality, and permits structure across units that do not have a natural ordering such as race, ethnicity, caste, and occupation.
The next two examples concern diffusion. First, we look at a sub-critical epidemic process with a number of periods longer than the graph’s diameter. So, whether an individual is infected is correlated with the infection status of any other individual (assuming a connected, unweighted graph). Again, this practical situation is excluded by the previous central limit theorems in the literature. Second, we look at diffusion in stochastic block models to show that our conditions characterize when asymptotic normality holds and when it does not.
Lastly, we turn to the setting of Zhan and Datta (2023): the estimation of neural network models with irregular spatial dependence, e.g., Matérn covariance functions. The authors provide the first proof of consistent estimation of the neural network model in this dependent setting. We show that the covariance structure of the residuals, on which the asymptotic distribution of the estimator depends, satisfies our main assumptions and our CLT.
The remainder of the paper is organized as follows, Section 2 proves our main result. Section 3 shows that the conditions we provide nest several commonly used characterizations of dependence: M-dependence, non-mixing autoregressive processes, random fields, and dependency graphs. Section 4 discusses the process for checking our conditions in several applications (peer effects models, socio-demographic distances, sub-critical diffusion models, stochastic block models, and irregularly observed spatial processes), while also yielding new results. Section 5 provides a discussion.
2. The Theorem
We consider a triangular array of random variables , with entries and , each of which has finite variance (possibly varying with ). We let denote the corresponding de-meaned variables, The sum, , is given by We suppress the dependency on for clarity; writing unless otherwise needed.
2.1. Affinity Sets
Each real-valued random variable has an affinity set, denoted , which can depend on . We require . Heuristically, includes the indices for which the covariance between and is relatively high in magnitude, but not those for which the covariance is low.
There is no independence requirement at any and, in fact, our sufficient conditions for the central limit theorem bound the total sums of covariances within and across affinity sets. The precise construction of affinity sets is flexible, as long as these bounds on the respective total sums are respected.
2.2. The Central Limit Theorem
Let be a matrix which houses the bulk of covariance across observations and dimensions, summing across variables all the covariances of each variable and the others in its affinity set:
This is distinct from a total variance-covariance matrix which includes terms outside of .
In what follows, we maintain the assumption that , where is the Frobenius norm. We also presume that for all is bounded above. Define , the sum of the random variables outside of ’s affinity set.
Our first assumption is that the the total mass of the variance-covariance is not driven by the covariance between members of a given affinity set neither of which are the reference random variables themselves. That is, given reference variable , the covariance of some and where both are in the reference variable’s affinity set is relatively small in total across all such triples of variables compared to the variance coming from the reference variable and its affinity sets.
Assumption 1 (Bound on total weighted-covariance within affinity sets).
The second assumption is that the total mass of the variance-covariance is not driven by random variables across affinity sets relative to two distinct reference variables. That is, given two random variables and , the aggregate amount of weighted covariance between two other random variables—each within one of the reference variables’ affinity sets—is small compared to the (squared) variance coming from the reference variable and its affinity sets.
Assumption 2 (Bound on total weighted-covariance across affinity sets).
The third assumption is that the total mass of variance-covariance is not driven by reference random variables and the variables outside of their affinity sets, again compared to the variance coming from the reference variable and its affinity sets.
Assumption 3 (Bound on total weighted-covariance from outside of affinity sets).
These three assumptions imply a central limit theorem.
The proof is provided in the Appendix. The argument follows by applying the Cramér-Wold device to the arguments following Stein’s method, as Chandrasekhar and Jackson (2024) argued for the univariate case. Since the Cramér-Wold device requires for all fixed in that the -weighted sum satisfies a central limit theorem (Biscio et al., 2018)—that is, —we can consider a problem of random variables with affinity sets. Then, by checking Assumptions 1-3 for the case of the result follows.
An important special case is where the affinity sets are the variables themselves: . In that case, the conditions simplify to a total bound on the overall sum of covariances across variables (the univariate case is in Chandrasekhar and Jackson (2024)). It nests many cases in practice, and we provide an illustration in our second application.
Corollary 1.
If , for every , and
-
(i)
, and
-
(ii)
,
then .
3. Models of Dependence
We first present four applications from the literature that prove asymptotic normality: (i) -dependence, (ii) non-mixing autoregressive processes, (iii) mixing random fields, and (iv) dependency graphs. These examples do not necessarily nest each other, though we do comment on relations between the dependence types in terms of mixing, where relevant. We can construct affinity sets that meet our conditions in each case. A key distinction in our work is that the conditions we provide are general, rather than specific to a particular model class of dependency type. We provide a sketch of the core assumptions made in the relevant papers to be self-contained for the reader. We show how these assumptions imply our covariance restrictions and the relative complexity of these setups.
We then present five common applications that are not covered by the previous literature but are covered by our model: (i) peer effects, (ii) covariance estimation with socio-demographic characteristics, (iii) subcritical diffusion processes, (iv) diffusion in stochastic block models, and (v) spatial dependence via a Matérn covariance matrix.
In these examples, we maintain consistent use of notation defined in the previous sections. The remaining notation, however, is kept consistent only within each subsection.
3.1. -dependence
3.1.1. Environment
We consider Theorem 2.1 of Romano and Wolf (2000). In this application there are real-valued time series data, so (and we drop the index ) and is a scalar. Under Romano and Wolf’s setup, and are independent if . Here, are mean zero random variables. For convenience of the reader, we include the assumptions made in their paper: Suppose is an -dependent sequence of random variables for some and ,
-
(1)
for all
-
(2)
for all and
-
(3)
-
(4)
-
(5)
-
(6)
3.1.2. Application of Theorem 1
We consider the -ball, . We drop the subscript in for convenience. In this case, by independence for all with , so Assumption 3 is satisfied. Under bounded third and fourth moments, we check the remaining assumptions. Assumption 1 is easily verified:
following their Assumption 6. Our Assumption 2 is satisfied similarly following their Assumption 6:
This is due to the fact that if they are not within that distance, then automatically it is impossible for the and to induce any correlation as well.
Following the hierarchy established in Bradley (2007), -dependence implies several of these commonly used forms of mixing, such as -mixing (Bradley, 2005). It also implies -mixing and -mixing in the time series context. We also given a example using - and - mixing in the context of random fields below. Characterizing mixing in the context of random fields requires imposing restrictions on the dependence between -algebras as the number of points in those sets increases, see Jenish and Prucha (2009) or Bradley (2005).
Note that our conditions generate bounded fourth moment requirements, which is not necessarily a condition invoked in every analysis of -dependent processes in the literature, which sometimes have slightly lower moment requirements. Nonetheless, our results are not intended to provide the tightest bounds, but rather general conditions, spanning various types of dependence, that are easily checkable for most applied settings.
3.2. Andrews’ Non-Mixing Autoregressive Processes
3.2.1. Environment
This application is from Andrews (1984), which allows interdependence in a time series, but one that does not satisfy strong (-) mixing, in order to clarify the distinction between dependence and mixing. Again, . We take where . The ’s come from a Bernoulli distribution with success probability . We define the as normalized, mean zero, . Assume, without loss of generality, that , so
where . For a constant depending only on , we show asymptotic normality of :
3.2.2. Application of Theorem 1
We begin by verifying our conditions for truncated versions of our random variables and then show that the full result applies. To define what we mean by “truncation”, for each , let
for some Let also . We then define the affinity sets to be .
We begin by verifying Assumption 1:
Next, we verify Assumption 2:
To verify Assumption 3 is trivial; we have . Therefore, we have that the truncated variables , and we would like to show the result in full generality, . To do that, we begin by writing
(3.1) |
where . We will show that , where , and use this to control the second term in the right-hand-side of the decomposition given in expression (3.1) above, via Chebyshev’s inequality. Then, we will also show that , which will result in the weak convergence of the first term on the right-hand-side of (3.1) to We start with
The final line comes from observing that from , the denominator , i.e. there exists some constant such that . This is because for some constant . Additionally, for any fixed , there exists a such that the numerator
Together with Chebyshev’s inequality, we have that
(3.2) |
and this gives us convergence in probability of the second term in 3.1 to zero.
Next, we show that . We begin by considering
(3.3) |
We now consider the second term in the above expression, (3.2.2), . The lower bound (i.e. there exists some constant such that ). For a fixed , there exists a such that for all . Therefore, we have that . Hence,
Next, we consider the third term in 3.2.2
Once again, we have the lower bound (i.e. there exists some constant such that ), and for a fixed , there exists a such that for all . Therefore, we have that , and hence,
3.3. Random Fields
3.3.1. Environment
This example nests many time series and spatial mixing models. Take the setting of Jenish and Prucha (2009), Theorem 1. Their setting has either - or - mixing in random fields, allowing for non-stationarity and asymptotically unbounded second moments. They treat real mean-zero random field arrays , where each pair of elements has some minimum distance , where , between them. At each point on the lattice, there is a real-valued random variable drawn, so . The authors assume (see their Assumptions 2 and 5) a version of uniform integrability that allows for asymptotically unbounded second moments, while maintaining that no single variance summand dominates by scaling so that is uniformly integrable in . They also assume (Assumption 3, restated below) conditions on the inverse function on mixing coefficients (their Assumption 3) and (their Assumption 4) together with the tail quantile functions (where where is the cumulative distribution function for the random variable ), requiring nice trade-off conditions between the two, such that under -mixing decaying at a rate for some , for all , and tends to zero in the limit of upper quantiles. Restating the assumptions made in their paper:
-
•
Assumption 2: for
-
•
Assumption 3: The following conditions must be satisfied by the -mixing coefficients:
-
(1)
-
(2)
for where
-
(3)
-
(1)
-
•
Assumption 4: The following conditions must be satisfied by the -mixing coefficients:
-
(1)
-
(2)
for
-
(3)
for some
-
(1)
-
•
Assumption 5:
3.3.2. Application of Theorem 1
In the following, we assume that the s have bounded second moments (otherwise, we can replace them with their scaled versions (see above), and the results should go through under bounded third and fourth moments). Here, for any , we take , where is a non-increasing function. That is, we pick to be large enough, and this can be decided by understanding the cumulative distribution function of the random variables.
By Assumption 3 (and Proposition B.10), and Lemma B.1 (a) in Jenish and Prucha (2009), we know that for any such that , we have that for some constant ,
(3.4) |
In particular, we note that we can pick by observing that above satisfies for
Taking allows control of the size of the affinity sets. Indeed, via a packing number calculation, we see that while this allows to grow with , it grows more slowly than . Specifically, taking for any , and since together with their Assumption 3, we have,
Now, we verify that our key conditions are satisfied in this setting. We write , and first, we check Assumption 1:
since The second inequality holds by the algorithm-geometric mean inequality. The remaining argument relies on rearranging the summations and using the growth rate of , i.e. in the third equality, as defined above.
3.4. Dependency Graphs and Chen and Shao (2004)
Next, we consider dependency graphs. There is an undirected, unweighted graph with dependency neighborhoods such that is independent of all for (Baldi and Rinott, 1989; Chen and Shao, 2004; Ross, 2011). Let . Denote the maximum cardinality of these to be . In Ross (2011) (see Theorem 3.6), together with a bounded fourth moment assumption, we see that the conditions there imply the conditions here. Indeed, we see that
and for , we need in Ross (2011) (Theorem 3.6), and hence Assumption 1 is satisfied. Similarly,
and for , we need in Ross (2011) (Theorem 3.6), so Assumption 2 is satisfied. Assumption 3 holds by definition of the dependency neighborhoods.
Now, we consider Chen and Shao (2004). In particular, we consider their weakest assumption, LD1: Given index set , for any , there exists an such that and are independent. The affinity sets can be defined by the complement of the independence sets. So , which is similar to the dependency graphs setting. The goals of their paper are different. They develop finite-sample Berry-Esseen bounds, with bounded -th moments, where . This is different from our approach. In our paper, we focus on covariance conditions in the asymptotics, and collect relatively more dependent sets along the triangular array.
4. Applications
4.1. Peer Effects Models
4.1.1. Environment
We now turn to an example of treatment effects with spillovers. Consider a setting in which units in a network are assigned treatment status , as in Aronow and Samii (2017). The network is a graph, , consisting of individuals (nodes) and connections (edges). For now, we consider the case in which treatment assignments are independent across nodes. However, there are spillovers in treatment effects determined by the topology of the network where treatment status within one’s network neighborhood may influence one’s own outcome - for instance whether a friend is vaccinated affects a person’s chance of being exposed to a disease.
Rather than being arbitrary, Aronow and Samii (2017) consider an exposure function that takes on one of finite values; i.e., . An estimand of the average causal effect is of the form
where and are the induced exposures under the treatment vectors. The Horvitz and Thompson estimator from Horvitz and Thompson (1952) provides:
where is the probability that that node receives exposure over all treatments.
The challenge is that the treatment effects are not independent across subjects. Let denote a dummy vector of whether is in ’s neighborhood: let when with the convention . Aronow and Samii (2017) consider an empirical study with that are: (i) only is treated in their neighborhood (), (ii) at least one is treated in ’s neighborhood and is treated (), (iii) is not treated but some member of the neighborhood is (), and (iv) neither nor any neighbor is treated (). We show that our result allows for a more generalized setting.
4.1.2. Application of Theorem 1
To obtain consistency and asymptotic normality Aronow and Samii (2017) assume a covariance restriction of local dependence (Condition 5) and apply Chen and Shao (2004) to prove the result. Namely, their restriction is that there is a dependency graph (with entries ) with degree that is uniformly bounded by some integer independent of . That is, for every . This setting is much more restrictive than our conditions, especially as there can exist indirect correlation in choices as effects propagate or diffuse through the graph. We can work with larger real exposure values, and in settings concentrating the mass of influence in a neighborhood while allowing from spillovers from everywhere. This is important to allow for in centrality-based diffusion models, SIR models, and financial flow networks, since the the spillovers in these settings are less restricted than the sparse dependency graph in their Condition 5.
Indeed, we can even allow the dependency graph to be a complete graph, as long as the correlations between the nodes in this dependency graph satisfy our Assumptions 1-3. That is, we can handle cases where, for a given treatment assignment, each node has real exposure conditions for which the exposure conditions of the whole graph can be well approximated by simple functions, where small perturbations to any node in large regions of small correlations do not substantially perturb the outcomes in these regions (i.e., across affinity sets) while perturbations of the same size in any region of larger correlations (i.e., within affinity sets) can cause significant changes in the outcomes in that region. One can think of the “shorter” monotonic regions in the simple function to be over affinity sets, and “longer” monotonic regions to be across different affinity sets. One can think about monotonically non-decreasing functions, for instance, in epidemic spread settings where any increase in the “treatment” cannot decrease the number of infected nodes.
To take an example, let the true exposure of be given by . Then consider a case where , where indicates an increase in any element from . This is a structure that would happen naturally in a setting with diffusion. The potential outcome for given treatment assignment is assumed to be .
In practice, for parsimony and ease exposures are often binned. So consider the problem where the possible exposures can be approximated by well-separated “effective” exposures where for any and some , and for any , , we have, if and only if and we have smooth in its argument for every .
Then, following the above, the researcher’s target estimand is the average causal effect switching between two exposure bins,
The estimator for this estimand cannot directly be shown to be asymptotically normally distributed using the prior literature. It is ruled out by Condition 5 in Aronow and Samii (2017) which uses Chen and Shao (2004). However, it is straightforward to apply our result.
4.2. Covariance Estimation using Socio-economic Distances
4.2.1. Environment
One application of mixing random fields is to use them to develop covariance matrices for estimators (e.g., Driscoll and Kraay (1998); Bester et al. (2011); Barrios et al. (2012); Cressie (2015)). Here we consider the example of Conley and Topa (2002), which builds on Conley (1999). Essentially, their approach is to parameterize the characteristics (observable or unobservable) of units that drive correlation in shocks by the Euclidean metric, as we further describe below. This, however, rules out examples that are common in practice that include (discrete) characteristics with no intrinsic ordering driving degrees of correlation. For instance, correlational structures across race, ethnicity, caste, occupation, and so on, are not readily accommodated in the framework. For a concrete example, correlations between ethnicities and for units that are parametrized by in an unstructured manner are ruled out. Like this, many of these examples only admit partial orderings, if that. Yet these are important, practical considerations in applied work. The results in our Theorem 1 allows an intuitively nice treatment of such cases. Our discussion below also applies to combinations of temporal (and possibly cross-sectional) dependence as in Driscoll and Kraay (1998). Our conditions also provide consistent estimators for covariance matrices of moment conditions for parameters of interest in the GMM setting, under full-rank conditions of expected derivatives Conley (1999), since the author uses the CLT from Bolthausen (1982) under stationary random fields which is generalized above.
In Conley (1999), the model is that the population lives in a Euclidean space (taken to be for the purposes of exposition), with each individual at location . Each location has an associated random field Conley (1999) obtains the limiting distribution of parameter estimates of , where is a compact subset and is the unique solution to for a moment function . Conley (1999) lists sufficient conditions on the moment function to imply consistent estimation of the expected derivatives and having full rank:
-
a)
for all , the derivative with respect to is measurable, continuous on for all , and first-moment continuous.
-
b)
and is of full-rank.
-
c)
corresponding to sampled locations is a non-singular matrix.
In addition to the sufficient conditions on the expected derivatives above, we list the remaining sufficient conditions on the random field itself used by Conley (1999) to obtain the limiting distribution of parameter estimates through the GMM; and these are nested in the conditions from Jenish and Prucha (2009), with the addition of bounded -moment of :
-
(1)
for
-
(2)
-
(3)
for some , and .
In Conley and Topa (2002), the authors develop consistent covariance estimators, using these conditions, combining different distance metrics including physical distance as well as ethnicity (or occupation, for another example) distance in at an aggregate level (using census tracts data). In particular, the authors use indicator vectors to encode ethnicities (or occupations), and take the Euclidean distance from aggregated (at the census tract level) indicator vectors, as a measure of these ethnic/occupational distances. They use the Euclidean metric to write a “race and ethnicity” distance between census tracts and ,
where the sum is taken over nine ethnicities/races, indexed by , defined by the authors.
The use of indicator vectors and Euclidean distance results people of different race/ethnicity groups being in orthogonal groups (with a fixed addition of Euclidean distance between any pairs of different race/ethnicity groups). To apply in practice, one would often need to allow for varying degrees of pairwise correlation for each pair of race/ethnicity groups. Additionally, even if the correlation induced by physical distance vanishes, it may be of interest to us to maintain correlation arising from interactions within and between ethnicity groups, where being of similar ethnic groups may induce nontrivial transfer of information between people despite being physically located large distances apart. It is not too difficult to see that the indicator vector formulation above does not allow for this, since in the case of a pair of distinct ethnicities with high correlation, the formulation in Conley and Topa (2002) would require a correlation of zero.
4.2.2. Application of Theorem 1
Consider random variables and , whose correlation we can decompose into components of physical distance and racial/ethnic distance (just as in Conley and Topa (2002)). It is direct to see how our work above takes care of the physical distance component, and so we turn to the remaining distance component. For this, for instance, one could consider the pairwise interaction probabilities, characterizing the correlation between ethnicity of , and ethnicity of . Our affinity set structure then allows us to incorporate this correlation structure. That is, one can construct an affinity set with defined just as in Subsection 3.3. Following our previous section, our generalization holds.
Attempts to (non-parametrically) estimate covariance in the cross-section often leverage a time or distance structure. For example, Driscoll and Kraay (1998) assume a mixing condition on a random field such that the correlation between shocks and tends to zero as . This allows for reasonably agnostic cross sectional correlational structures but requires it to be temporally invariant and studies asymptotics. Although such an assumption applies in certain contexts, there are many socio-economic contexts in which it does do not apply and yet our theorem can be applied. We provide two such examples.
For instance, in simple models of migration with migration cost, there is often persistence in how shocks to incentives to migrate in some areas affect populations in other areas. Nonetheless, there are very particular correlation patterns because, as an example, ethnic groups migrate to specific places based on existing populations, and so affinity sets are driven by the places to which a given ethnic group might consider moving and our central limit theorem can then be applied, provided our conditions on affinity sets apply.
Another example comes from social interaction. Individuals interact with others in small groups that experience correlated shocks which correlate grouped individuals’ behaviors and beliefs. Each group involves only a tiny portion of the population, and any given person interacts in a series of groups over time. Thus individuals’ behaviors or beliefs are correlated with others with whom they have interacted; but without any natural temporal or spatial structure. Each person has an affinity set composed of the groups (classes, teams, and so on) that they have been part of. People may also have their own idiosyncratic shocks to behaviors and beliefs. In this example again, our central limit theorem applies despite the lack of any spatial or temporal structure. One does not need to know the affinity sets, just to know that each person’s affinity group is appropriately small relative to the population.
4.3. Sub-Critical Diffusion Models
4.3.1. Environment
A finite-time SIR diffusion process occurs on a sequence of unweighted and undirected graphs . A first-infected set, or seed set , of size , of random nodes with treatment indicated , are seeded (set to have ) at and in period each infects each of its network neighbors, i.i.d. with probability . The seeds then are no longer active in infecting others. In period each of the nodes infected at period infects each of its network neighbors who were never previously infected i.i.d. with probability . The process continues for periods. Let be a binary indicator of whether was ever infected throughout the process.
Assume that the sequence of SIR models under study, have (with ), , and (with at each ), and are such that the process is sub-critical. Since the number of periods is at least as large as the diameter, it guarantees that for a connected , for each .
The statistician may be interested in a number of quantities. For instance, the unknown parameter may be of interest. Suppose is a (scalar) moment condition satisfied only at the true parameter given known seeding . The -estimator (or GMM) derives from the empirical analog, setting
By a standard expansion argument
To study the asymptotic normality of the estimator, we need to study
which involves developing affinity sets for each . For simplicity we consider the case where the estimator may directly work with . Letting be the de-meaned outcome and we want to show that
Under sub-criticality, a vanishing share of nodes are infected from a single seed. Without sub-criticality, most of the graph can have a nontrivial probability of being infected and accurate inference cannot be made with a single network. Let us define
Then is the set of nodes for which, if is the only seed, the probability of being infected in the process is at least . As noted above, in a sub-critical process for every , not necessarily uniformly in . For simplicity assume that there is a sequence such that this holds uniformly (otherwise, one can simply consider sums). Let which tends to zero, such that .
Next, we assume that the rate at which infections happen within the affinity set is higher than outside of it, and the share of seeds is sufficiently high and affinity sets are large enough to lead to many small infection outbursts but not so large as to infect the whole network. That is, there exists some such that and with , such that , and . This would apply to, for instance, targeted advertisements or promotions that lead to local spread of information about a product, but that does not go viral. None of the prior examples such as random fields and dependency graphs cover this case since since all are correlated. We now show that Theorem 1 applies to this case.
4.3.2. Application of Theorem 1
Let us define the affinity sets . Next, consider a random seed . Let denote the event that none of the nodes are in each others’ affinity sets. It is clear that , since for and that seeds are uniformly randomly chosen.
If we look at an affinity set, it is sufficient to just look at the variance components and check it is a higher order of magnitude
Now to check Assumption 1, we compute:
Thus, we have
which is satisfied. The probability that no seed is in any other affinity set
This puts an intuitive restriction on the number of seeds and percolation size as a function of . Next, we verify Assumption 2. We have,
Therefore, , is satisfied. We next verify Assumption 3. Given the event , we can bound the conditional covariance, , by bounding the probabilities of two contagions. So then
for some constant fixed in . Keeping orders we have
Since , we have, and is also satisfied.
4.4. Diffusion in Stochastic Block Models
4.4.1. Environment
A SIR diffusion process occurs on a sequence of unweighted and undirected networks, as in the previous example, except that the network has a block structure as generated by a standard stochastic block model (Holland et al. (1983)). The nodes are partitioned into blocks, where block sizes are equal, or within one of each other. With probability links are formed inside blocks, and with probability they are formed across blocks, independently. Let be the probability of infection, as in the last example. Inside link probabilities are large enough for percolation within blocks: . Across link probabilities are small enough for vanishing probabilities of contagion across blocks, even if all other blocks are infected: . The infections are seeded with seeds. With probability going to 1, all nodes in the blocks with the seeds will be infected and no others. There is a correlation going to 1 of infection status of nodes within the blocks, which are the affinity sets; and there is correlation of infection status going to 0 across blocks, but it is always positive. If is bounded, then a central limit theorem fails. If grows without bound (while allowing so that blocks are large), then the central limit theorem holds.
4.4.2. Application of Theorem 1
Given the previous examples, we simply sketch the application. The affinity set of node , , is the block in which it resides. Letting denote , it follows that . Then the first assumption is satisfied noting that if , then
Verification of the second assumption comes from noting that if then
Next we check the third assumption. Let for in different blocks (ignoring the approximation due to the fact that blocks may be of slightly different sizes). Note that is on the order of contagion across blocks, which is . Both blocks could also be infected by some other nodes, which happens of order at most , which is also of order . If grows without bound
In this example, not only do the assumptions fail if is bounded, but also the conclusion of the theorem fails to hold as well; so the conditions are tight in that sense.
4.5. Spatial Process with Irregular Observations and Matérn Covariance
4.5.1. Environment
Finally, we turn to an example of neural network models for geospatial data. Specifically, we look at the environment of Zhan and Datta (2023). The authors propose a neural network generalized least squares process (NN-GLS) with the dependency in the residuals modeled by a Matérn covariance function, described below. Their paper is the first to demonstrate consistency for the NN-GLS estimator in this setting.
Consider a spatial process model, , where is a vector of characteristics and the residuals correspond to observations at locations in . Let be a continuous function and define as a Gaussian Process with covariance function for some and is the indicator function. Here , where
is the Matérn covariance function, with modified Bessel function of the second kind . We consider the setting in Zhan and Datta (2023) (Proposition 1) where for some .
The NN-GLS fits a system of multi-layered perceptrons via the loss function and the authors prove consistency under some assumptions including, in particular, restrictions on the spectral radius of a sparse approximation of the covariance function. That is covered by an assumption of minimum distance separation of locations above, where . Previous work characterizes the asymptotic properties, including asymptotic normality of the neural network estimators, in the case of independent and identically distributed shocks (Shen et al., 2019). Zhan and Datta (2023) extend this result by modeling dependency using the Matérn covariance function.
4.5.2. Application of Theorem 1
We create affinity sets using the same restrictions as presented in Zhan and Datta (2023). Reflecting the duality of spatial distance and dependence, we construct affinity sets such that the maximal separation in the affinity set has implications for the maximum covariance between random variables. Specifically, take the affinity sets to be defined as where for . Using Zhan and Datta (2023)’s restriction on the amount of dependence associated with the distance in space, namely for some , we can solve for the appropriate . Specifically, if we know that the covariance in Zhan and Datta (2023) is asymptotically bounded by , we can set this equal to and solve for the appropriate distance. After the resulting algebra we take .
Until now, we have defined distances that give affinity sets that contain the bulk of the dependence. We also need to ensure that, under the setup in Zhan and Datta (2023), these sets are not too large that they violate our assumptions. To do this, we take for and as the minimum separation distance, defined above. Using a packing number calculation, we see that while this allows to grow with , it grows more slowly than . Specifically, we have,
This logic generalizes to dimensions , taking for . Using this construction and assuming bounded third and fourth moments, we check that our Assumptions 1-3 apply.
Letting , we show Assumption 1 holds since
The second inequality holds by the algorithm-geometric mean inequality. The remaining argument relies on rearranging the summations and using the growth rate of , i.e. in the fourth equality, as defined above based on the conditions required by Zhan and Datta (2023).The last equality follows since
We check that Assumption 2 is satisfied using similar arguments and relying on an assumption of finite fourth moment:
The first equality comes from the construction of affinity sets such that the covariance terms within the affinity sets dominate those with outside the affinity sets (additional details in Assumption 3). The remaining equalities use similar arguments to the Assumption above.
Assumption 3 follows from taking where for arbitrarily small Indeed, taking as such, we have, , and thus,
where the first line above is obtained by observing that for jointly Gaussian random variables , we can write , where due to the eigenvalues of the covariance matrix being uniformly bounded in .
5. Discussion
We have provided an organizing principle for modeling dependency and obtaining a central limit theorem: affinity sets. It allows for non-zero correlation across all random vectors in the triangular array and places focus on correlations within and across sets. These conditions are intuitive and we illustrate their use through some practical applications for applied research. In some cases, as in several of our applied examples, our result is needed as previous conditions do not apply.
We now reflect on settings that our theorem does not cover. For example, the martingale central limit theorem (e.g., Billingsley (1961); Ibragimov (1963); Hall and Heyde (2014) among others) is not covered by our theorem, without modification. It admits nontrivial unconditional correlation between all variables, but relies on other structural properties to deduce the result. In fact, proofs of the martingale central limit theorem did not appeal to Stein’s method, until Röllin (2018). By combining Stein and Lindeberg methods, Röllin (2018) develops a shorter proof but did not find a direct proof using the Stein technique alone. Some biased processes that do not fall under the martingale umbrella can still generate a central limit theorem if they satisfy the covariance structure that we have provided.
References
- Andrews [1984] D. W. Andrews. Non-strong mixing autoregressive processes. Journal of Applied Probability, 21(4):930–934, 1984.
- Aronow and Samii [2017] P. M. Aronow and C. Samii. Estimating average causal effects under general interference, with application to a social network experiment. Annals of Applied Statistics, 11(4):1912–1947, 2017. doi: 10.1214/16-AOAS1005.
- Arratia et al. [1989] R. Arratia, L. Goldstein, and L. Gordon. Two moments suffice for poisson approximations: the chen-stein method. The Annals of Probability, pages 9–25, 1989.
- Baldi and Rinott [1989] P. Baldi and Y. Rinott. On normal approximations of distributions in terms of dependency graphs. The Annals of Probability, pages 1646–1650, 1989.
- Barrios et al. [2012] T. Barrios, R. Diamond, G. W. Imbens, and M. Kolesár. Clustering, spatial correlations, and randomization inference. Journal of the American Statistical Association, 107(498):578–591, 2012.
- Bester et al. [2011] C. A. Bester, T. G. Conley, and C. B. Hansen. Inference with dependent data using cluster covariance estimators. Journal of Econometrics, 165(2):137–151, 2011.
- Billingsley [1961] P. Billingsley. The lindeberg-levy theorem for martingales. Proceedings of the American Mathematical Society, 12(5):788–792, 1961.
- Biscio et al. [2018] C. A. N. Biscio, A. Poinas, and R. Waagepetersen. A note on gaps in proofs of central limit theorems. Statistics & Probability Letters, 135:7–10, 2018.
- Bolthausen [1982] E. Bolthausen. On the central limit theorem for stationary mixing random fields. Annals of Probability, 10:1047–1050, 1982.
- Bradley [2005] R. C. Bradley. Basic properties of strong mixing conditions. a survey and some open questions. Probability Surveys, 2:107–144, 2005. doi: 10.1214/154957805100000104.
- Bradley [2007] R. C. Bradley. Introduction to strong mixing conditions, volume 1. Kendrick Press, 2007.
- Chandrasekhar and Jackson [2024] A. G. Chandrasekhar and M. O. Jackson. A network formation model based on subgraphs. Review of Economic Studies, forthcoming, 2024.
- Chen [1975] L. H. Chen. Poisson approximation for dependent trials. The Annals of Probability, 3(3):534–545, 1975.
- Chen and Shao [2004] L. H. Chen and Q.-M. Shao. Normal approximation under local dependence. The Annals of Probability, 32(3):1985–2028, 2004.
- Conley [1999] T. G. Conley. GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1):1–45, 1999.
- Conley and Topa [2002] T. G. Conley and G. Topa. Socio-economic distance and spatial patterns in unemployment. Journal of Applied Econometrics, 17(4):303–327, 2002.
- Cramér and Wold [1936] H. Cramér and H. Wold. Some theorems on distribution functions. Journal of the London Mathematical Society, 1(4):290–294, 1936.
- Cressie [2015] N. Cressie. Statistics for spatial data. John Wiley & Sons, 2015.
- Driscoll and Kraay [1998] J. C. Driscoll and A. C. Kraay. Consistent covariance matrix estimation with spatially dependent panel data. Review of Economics and Statistics, 80(4):549–560, 1998.
- Froot [1989] K. A. Froot. Consistent covariance matrix estimation with cross-sectional dependence and heteroskedasticity in financial data. Journal of Financial and Quantitative analysis, 24(3):333–355, 1989.
- Goldstein and Rinott [1996] L. Goldstein and Y. Rinott. Multivariate normal approximations by stein’s method and size bias couplings. Journal of Applied Probability, pages 1–17, 1996.
- Hall and Heyde [2014] P. Hall and C. C. Heyde. Martingale limit theory and its application. Academic press, 2014.
- Hoff et al. [2002] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002.
- Holland et al. [1983] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: First steps. Social networks, 5(2):109–137, 1983.
- Horvitz and Thompson [1952] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
- Ibragimov [1963] I. Ibragimov. A central limit theorem for a class of dependent random variables. Theory of Probability & Its Applications, 8(1):83–89, 1963.
- Jenish and Prucha [2009] N. Jenish and I. R. Prucha. Central limit theorems and uniform laws of large numbers for arrays of random fields. Journal of Econometrics, 150(1):86–98, 2009.
- Lubold et al. [2023] S. Lubold, A. G. Chandrasekhar, and T. H. McCormick. Identifying the latent space geometry of network models through analysis of curvature. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(2):240–292, 2023.
- Rinott and Rotar [2000] Y. Rinott and V. Rotar. Normal approximations by stein’s method. Decisions in Economics and Finance, 23:15–29, 2000.
- Rio [1993] E. Rio. Covariance inequalities for strongly mixing processes. Annales de l’I.H.P. Probabilités et statistiques, 29(4):587–597, 1993. doi: 10.1016/S0246-0203(93)80028-X.
- Rio [2013] E. Rio. Inequalities and limit theorems for weakly dependent sequences. 2013.
- Rio [2017] E. Rio. Weakly dependent sequences: theory and applications. Springer, 2017. doi: 10.1007/978-3-319-54235-7.
- Röllin [2018] A. Röllin. On quantitative bounds in the mean martingale central limit theorem. Statistics & Probability Letters, 138:171–176, 2018.
- Romano and Wolf [2000] J. P. Romano and M. Wolf. A more general central limit theorem for m-dependent random variables with unbounded m. Statistics & probability letters, 47(2):115–124, 2000.
- Ross [2011] N. Ross. Fundamentals of stein’s method. Probability Surveys, 8:210–293, 2011.
- Shen et al. [2019] X. Shen, C. Jiang, L. Sakhanenko, and Q. Lu. Asymptotic properties of neural network sieve estimators. arXiv preprint arXiv:1906.00875, 2019.
- Stein [1986] C. Stein. Approximate computation of expectations. Lecture Notes-Monograph Series, 7:i–164, 1986.
- Zhan and Datta [2023] W. Zhan and A. Datta. Neural networks for geospatial data. arXiv preprint arXiv:2304.09157, 2023.
Appendix
Proof of Central Limit Theorem 1 and Corollary 1
Recall that is a matrix with entries
We start with the case , reproducing elements of the proof to Theorem 2 in Chandrasekhar and Jackson [2024]. Let
The proof uses Stein’s lemma from Stein [1986].
Lemma .1 (Stein [1986], Ross [2011]).
If is a random variable and has the standard normal distribution, then
Further
By this lemma, if we show that a normalized sum of random variables satisfies
then , and so it must be asymptotically normally distributed.
The following lemmas are useful in the proof.
Lemma .2 (Chandrasekhar and Jackson [2024], Lemma B.2).
A solution to (where is measurable) is where we break ties, setting when .
Lemma .3 (Chandrasekhar and Jackson [2024], Lemma B.3).
when is measurable and bounded by satisfies
Lemma .4 (Cramér-Wold Device, Cramér and Wold [1936]).
The sequence of random vectors of dimension weakly converges to the random vector , as , if and only if for any ,
as
Proof of Theorem 1. The case is Theorem 2 in Chandrasekhar and Jackson [2024]. So we reproduce a sketch to provide intuition. By Lemma .1, it is sufficient to show that the appropriate sequence of random variables satisfies
Let
Let
We consider
(.1) |
for all such that . Observe that
(.2) |
We first consider the second term above. By a first-order Taylor approximation of about , and observing and the triangle inequality, we get an upper bound:
where is an intermediate term between 0 and . The first term is zero since . We upper bound the second term by applying Lemma .3.
By Assumption 3, we have that the upper bound is . This kind of bound is generated by the fact that, in principle, any two random variables can be correlated for a given .
Therefore, (.2) is simply
Now, by plugging this expression into (.1) we have an upper bound by using the triangle inequality (we follow a reasoning similar to that in Ross [2011]). Follow this by a second-order Taylor series approximation and a bound on the derivatives of and the use of the Cauchy-Schwarz inequality, together with Assumptions 1 and 2. In both of these pieces, rather than relying on conditional independence and applying the arithmetic-geometric mean inequality to write conditions in terms of moment restrictions, which cannot be used with the general dependency structure, we collect covariance terms.
Therefore, we have shown the convergence in distribution in each dimension. Now, we consider the multidimensional setting, and let be a mean-zero normally distributed random vector with covariance the -dimensional identity matrix. By the Cramér-Wold device (Lemma .4), it is sufficient to show that
(.3) |
for all .
But, from the proof above, we see that for each ,
It immediately follows that .3 is satisfied, so we have shown .