Rate-Optimal Cluster-Randomized Designs for Spatial Interference

Michael P. Leunglabel=e1][email protected] [ Department of Economics, UC Santa Cruz,

Abstract

We consider a potential outcomes model in which interference may be present between any two units but the extent of interference diminishes with spatial distance. The causal estimand is the global average treatment effect, which compares outcomes under the counterfactuals that all or no units are treated. We study a class of designs in which space is partitioned into clusters that are randomized into treatment and control. For each design, we estimate the treatment effect using a Horvitz-Thompson estimator that compares the average outcomes of units with all or no neighbors treated, where the neighborhood radius is of the same order as the cluster size dictated by the design. We derive the estimator’s rate of convergence as a function of the design and degree of interference and use this to obtain estimator-design pairs that achieve near-optimal rates of convergence under relatively minimal assumptions on interference. We prove that the estimators are asymptotically normal and provide a variance estimator. For practical implementation of the designs, we suggest partitioning space using clustering algorithms.

causal inference,

interference,

experimental design,

spatial dependence,

keywords:

\startlocaldefs\pdfximage

supplement.pdf \endlocaldefs

1 Introduction

Consider a population of $n$ experimental units. Denote by $Y_{i}(\bm{d})$ the potential outcome of unit $i$ under the counterfactual that the population is assigned treatments according to the vector $\bm{d}=(d_{i})_{i=1}^{n}\in\{0,1\}^{n}$ , where $d_{i}=1$ ( $d_{i}=0$ ) implies unit $i$ is assigned to treatment (control). Treatments assigned to alters can influence the ego since $Y_{i}(\bm{d})$ is a function of $d_{j}$ for $j\neq i$ , what is known as interference.

An important estimand of practical interest is the global average treatment effect

\theta_{n}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}\big{(}Y_{i}(\bm{1}_{n})-Y_{i}(\bm{0}_{n})\big{)}

where $\bm{1}_{n}$ ( $\bm{0}_{n}$ ) is the $n$ -dimensional vector of ones (zeros). This compares average outcomes under the counterfactuals that all or no units are treated. Each average can only be directly observed in the data under an extreme design that assigns all units to the same treatment arm, which would necessarily preclude observation of the other counterfactual. Common designs used in the literature, including those studied here, assign different units to different treatment arms, so neither average is directly observed in the data. Nonetheless, we show that asymptotic inference on $\theta_{n}$ is possible for a class of cluster-randomized designs under spatial interference where the degree of interference diminishes with distance.

Many phenomena diffuse primarily through physical interaction. The government of a large city may wish to compare the effect of two different policing strategies on crime, but more intensive policing in one neighborhood may displace crime to adjacent neighborhoods [8, 49]. A rideshare company may wish to compare the performance of two different pricing algorithms, but these may induce behavior that generates spatial externalities, such as congestion. Other phenomena exhibiting spatial interference include infectious diseases in animal [14] and human [35] populations, pollution [18], and environmental conservation programs [36].

Much of the existing literature assumes that interference is summarized by a low-dimensional exposure mapping and that units are individually randomized into treatment or control either via Bernoulli or complete randomization [e.g. 3, 7, 16, 34, 45]. Jagadeesan et al. [23] and Ugander et al. [46] also utilize exposure mappings but depart from unit-level randomization. They propose new designs that introduce cluster dependence in unit-level assignments in order to improve estimator precision. We build on this literature by (1) studying rate-optimal choices of both cluster-randomized designs and Horvitz-Thompson estimators, (2) avoiding exposure mapping restrictions on interference, which can be quite strong [15], and (3) developing a distributional theory for the estimator and a variance estimator.

Regarding (2), most exposure mappings used in the literature imply that only units within a small, known distance from the ego can interfere with the ego’s outcome. We instead study a weaker restriction on interference similar to [31], which states that the degree of interference decreases with distance but does not necessarily zero out at any given distance. This is analogous to the widespread use of mixing-type conditions in the time series and spatial literature instead of $m$ -dependence because the latter rules out interesting forms of autocorrelation, including models as basic as AR $(1)$ .

Regarding (1), we study cluster-randomized designs in which units are partitioned into spatial clusters, clusters are independently randomized into treatment and control, and the assignment of a unit is dictated by the assignment of its cluster. By introducing correlation in assignments, such designs can avoid overlap problems common under Bernoulli randomization, which improves the rate of convergence. For analytical tractability, we focus on designs in which clusters are equally sized squares, each design distinguished by the number of such squares. We pair each design with a Horvitz-Thompson estimator that compares the average outcomes of units with all or no treated neighbors, where the neighborhood radius is of the same order as the cluster size dictated by the design. See Figure 1 for a depiction of a hypothetical design and neighborhoods used to construct the estimator.

Our results inform how the analyst should choose the number of clusters (and hence, the cluster size and neighborhood radius of the estimator) to minimize the rate of convergence of the estimator. Notably, existing work on cluster randomization with interference utilizes small clusters (those with asymptotically bounded size). We show that such designs are generally asymptotically biased under the weaker restriction on interference we impose, which motivates the large-cluster designs we study.

Refer to caption — Figure 1: White and black dots depict units. White squares correspond to clusters and gray squares to neighborhoods of black units used to construct the estimator.

Finally, regarding (3), we show that the estimator is asymptotically normal and provide a variance estimator. These results appear to be novel, as no existing central limit theorems seem to apply to our setup in which treatments exhibit cluster dependence, clusters can be large, and units in different clusters are spatially dependent due to interference. As usual, the variance estimator is biased due to heterogeneity in unit-level treatment effects. However, we show that, in a superpopulation setting in which potential outcomes are weakly spatially dependent, the bias is asymptotically negligible.

Based on our theory, we provide practical recommendations for implementing cluster-randomized designs in § 3.3. Of course, rate-optimality results do not determine the choice of nonasymptotic constants that are often important in practice under smaller sample sizes. Still, they constitute an important first step toward designing practical procedures. Due to the generality of the setting, which imposes quite minimal assumptions on interference, it seems reasonable to first study rate-optimality, as finite-sample optimality appears to require substantial additional structure on the problem. We note that existing results on graph cluster randomization, which require stronger restrictions on interference than this paper, are nonetheless limited to rates, and how “best” to construct clusters in practice has been an open question.

1.1 Related Literature

Most of the literature supposes interference is mediated by a network. Studying optimal design in this setting is difficult because network clusters can be highly heterogeneous in topology, and their graph-theoretic properties can closely depend on the generative model of the network [32]. We study spatial interference, and to make the optimal design problem analytically tractable, we focus on a class of designs that partitions space into equally sized squares while exploring in simulations the performance of more realistic designs that partition using clustering algorithms. We discuss in § 6 the (pessimistic) prospects of extending our approach to network interference.

There is relatively little work on optimal experimental design under interference. Viviano [50] proposes variance-minimizing two-wave experiments under network interference. Baird et al. [5] study the power of randomized saturation designs under partial interference.

A recent literature studies designs for interference that depart from unit-level randomization. A key paper motivating our work is [46], who propose graph cluster randomization designs under network interference. Ugander and Yin [47] study a new variant of these designs, and [19] consider related designs for bipartite experiments. These papers assume interference is summarized by exposure mappings, which enables the construction of unbiased estimators and use of designs in which clusters are small. Under our weaker restriction on interference, we show that large clusters are required to reduce bias, which creates a bias-variance trade-off.

Eckles et al. [15] show that graph cluster randomization can reduce the bias of common estimators for $\theta_{n}$ in the absence of correctly specified exposure mappings. Pouget-Abadie et al. [40] propose two-stage cluster-randomized designs to minimize bias under a monotonicity restriction on interference. Several papers [6, 23, 44] study linear potential outcome models and propose designs targeting the direct average treatment effect, rather than $\theta_{n}$ . Under a normal-sum model, [6] compute the mean-squared error of the difference-in-means estimator, which they use to suggest model-assisted designs.

The aforementioned papers on cluster randomization target global effects such as $\theta_{n}$ [also see 9, 10]. Much of the literature on interference considers fundamentally different estimands defined by exposure mappings. When these mappings are misspecified, the estimands are functions of assignment probabilities, in which case their interpretations can be specific to the experiments run [42, 43]. Hu et al. [21] (§5) views this as “largely unavoidable” in nonparametric settings with interference. Our results show that inference on $\theta_{n}$ , which avoids this issue, is possible under restrictions on interference weaker than those typically used in the literature. Additionally, papers in the literature impose an overlap assumption, which implicitly restricts the estimand [31]. We study cluster-randomized designs that directly satisfy overlap.

There is a large literature on cluster-randomized trials [e.g. 20, 37]. This literature predominantly studies partial interference, meaning that units are divided into clusters such that those in distinct clusters do not interfere. That is, the clusters themselves impose restrictions on interference. In our setting, clusters are determined by the design and do not restrict interference.

Finally, [4], [39], and [51] study spatial interference in a different “bipartite” setting in which treatments are assigned to units or locations that are distinct from the units whose outcomes are of interest. This shares some similarities with spatial cluster randomization, where different spatial regions are randomized into treatment, so some of the ideas here may be applicable to optimal design there.

1.2 Outline

The next section defines the model of spatial interference and the class of designs and estimators studied. In § 3, we derive the estimator’s rate of convergence, discuss rate-optimal designs, and provide practical design recommendations. In § 4, we prove that the estimator is asymptotically normal, propose a variance estimator, and characterize its asymptotic properties. We report results from a simulation study in § 5, exploring the use of spectral clustering to implement the designs. Finally, § 6 concludes.

2 Setup

Let $\mathcal{N}_{n}$ be a set of $n$ units. We study experiments in which units are cluster-randomized into treatment and control, postponing to § 2.2 the specifics of the design. For each $i\in\mathcal{N}_{n}$ , let $D_{i}$ be a binary random variable where $D_{i}=1$ indicates that $i$ is assigned to treatment and $D_{i}=0$ indicates assignment to control. Let $\bm{D}=(D_{i})_{i\in\mathcal{N}_{n}}$ be the vector of realized treatments and $\bm{d}=(d_{i})_{i\in\mathcal{N}_{n}}\in\{0,1\}^{n}$ denote a non-random vector of counterfactual treatments. Recall from § 1 that $Y_{i}(\bm{d})$ is the potential outcome of unit $i$ under the counterfactual that units are assigned treatments according to $\bm{d}$ . Formally, for each $n\in\mathbb{N}$ and $i\in\mathcal{N}_{n}$ , $Y_{i}(\cdot)$ is a non-random function from $\{0,1\}^{n}$ to $\mathbb{R}$ . We denote $i$ ’s factual, or observed, outcome by $Y_{i}=Y_{i}(\bm{D})$ and maintain the standard assumption that potential outcomes are uniformly bounded.

Assumption 1 (Bounded Outcomes).

$\sup_{n\in\mathbb{N}}\max_{i\in\mathcal{N}_{n}}\max_{\bm{d}\in\{0,1\}^{n}}\lvert Y_{i}(\bm{d})\rvert<\infty$ .

2.1 Spatial Interference

Thus far, the model allows for unrestricted interference in the sense that $Y_{i}(\bm{d})$ may vary essentially arbitrarily in any component of $\bm{d}$ . In order to obtain a positive result on asymptotic inference, it is necessary to impose restrictions on interference to establish some form of weak dependence across unit outcomes. The existing literature primarily focuses on restrictions captured by $K$ -neighborhood exposure mappings, which imply that $D_{j}$ can only interfere with $Y_{i}(\bm{D})$ if the distance between $i,j$ is at most $K$ . We will discuss how this assumption is potentially restrictive and establish results under weaker conditions.

We assume each unit is located in $\mathbb{R}^{2}$ . Label each unit by its location, so that $\mathcal{N}_{n}\subset\mathbb{R}^{2}$ , and equip this space with the sup metric $\rho(i,j)=\max_{t=1,2}\lvert i_{t}-j_{t}\rvert$ for $i=(i_{1},i_{2})$ , $j=(j_{1},j_{2})$ , and $i,j\in\mathbb{R}^{2}$ . Let $Q(i,r)=\{j\in\mathbb{R}^{2}\colon\rho(i,j)\leq r\}$ , the ball of radius $r$ centered at $i$ . Under the sup metric, balls are squares, and the radius is half the side length of the square. Letting $\bm{0}$ denote the origin, we consider a sequence of population regions $\{Q(\bm{0},R_{n})\}_{n\in\mathbb{N}}$ such that

\mathcal{N}_{n}\subset Q(\bm{0},R_{n}),\quad\text{where}\quad R_{n}n^{-1/2}\rightarrow c\in(0,\infty).

That is, units are located in the square $Q(\bm{0},R_{n})$ with growing radius $R_{n}$ . Combined with the next increasing domain assumption, the number of units in the region is $O(n)$ , but throughout, we will simply assume the number is exactly $n$ .

Assumption 2 (Increasing Domain).

There exists $\rho_{0}>0$ such that, for any $n\in\mathbb{N}$ and $i,j\in\mathcal{N}_{n}$ , $\rho(i,j)\geq\rho_{0}$ .

This allows units to be arbitrarily irregularly spaced, subject to being minimally separated by some distance $\rho_{0}$ , a widely used sampling framework in the spatial literature [e.g. 25]. In contrast, “infill” asymptotic approaches that do not require minimal separation and instead assume increasingly dense sampling from a fixed region can yield nonstandard limiting behavior [28]. For some applications, the spatial distribution of units may exhibit “hotspots” with unusually high densities, perhaps making the infill approach more plausible. Some work adopts a hybrid of the two approaches [29, 30], and it may be possible to extend our results to this framework.

Let $\mathbb{R}_{+}$ denote the set of non-negative reals and

\mathcal{N}(i,K)=Q(i,K)\cap\mathcal{N}_{n}

denote the $K$ -neighborhood of $i$ . We study the following model of interference similar to that proposed by [31].

Assumption 3 (Interference).

There exists a non-increasing function $\psi\colon\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ such that $\psi(0)\in(0,\infty)$ , $\sum_{s=1}^{\infty}s\,\psi(s)<\infty$ , and, for all $s\in\mathbb{R}_{+}$ ,

\sup_{n\in\mathbb{N}}\max_{i\in\mathcal{N}_{n}}\max\big{\{}\lvert Y_{i}(\bm{d})-Y_{i}(\bm{d}^{\prime})\rvert\colon\bm{d},\bm{d}^{\prime}\in\{0,1\}^{n},d_{j}=d_{j}^{\prime}\,\,\forall j\in\mathcal{N}(i,s)\big{\}}\leq\psi(s).

To interpret this, observe that $\max\big{\{}\lvert Y_{i}(\bm{d})-Y_{i}(\bm{d}^{\prime})\rvert\colon\bm{d},\bm{d}^{\prime}\in\{0,1\}^{n},d_{j}=d_{j}^{\prime}\,\,\forall j\in\mathcal{N}(i,s)\big{\}}$ maximizes over pairs of treatment assignment vectors that fix the assignments of units in $i$ ’s $s$ -neighborhood but allow assignments to freely vary outside of this neighborhood. It therefore measures the degree of spatial interference in terms of the maximum change to $i$ ’s potential outcome caused by manipulating treatments assigned to units $k$ “distant” from $i$ in the sense that $\rho(i,k)>s$ . The assumption requires the degree of interference to vanish with the neighborhood radius $s$ so that treatments assigned to more distant alters interfere less with the ego. The rate at which interference vanishes is controlled by $\psi(s)$ , which is required to decay at a rate faster than $s^{-2}$ .

Remark 1 (Necessity of rate condition).

Assumption 3(b) of [25] and Assumption 4(c) of [27] impose the same minimum rate of decay as Assumption 3 on various measures of spatial dependence (mixing or near-epoch dependence coefficients) to establish central limit theorems. If the rate is slower, then the variance can be infinite. For example, consider a spatial process $\{Z_{i}\}_{i\in\mathcal{N}_{n}}$ such that units are positioned on the integer lattice $\mathbb{Z}^{2}$ and $\text{Cov}(Z_{i},Z_{j})=f(\rho(i,j))$ for some function $f(\cdot)$ for any $i,j$ . Then

\text{Var}\left(\frac{1}{\sqrt{n}}\sum_{i=1}^{n}Z_{i}\right)=\frac{1}{n}\sum_{i=1}^{n}\sum_{s=0}^{\infty}\sum_{j:\rho(i,j)=s}\text{Cov}(Z_{i},Z_{j})=\sum_{s=0}^{\infty}f(s)\frac{1}{n}\sum_{i=1}^{n}\lvert\{j\colon\rho(i,j)=s\}\rvert.

Note that $\lvert\{j\colon\rho(i,j)=s\}\rvert\leq 8s$ , with equality achieved for all $i$ not near the boundary of the population region. Thus, for large $n$ , a finite variance requires that $f(s)$ decay with $s$ faster than $s^{-2}$ .

We next discuss two models of interference satisfying Assumption 3. The first is the standard approach of specifying a $K$ -neighborhood exposure mapping. Such a mapping is given by $T_{i}=T(i,\bm{D},\mathcal{N}_{n})$ with the crucial property that its dimension does not depend on $n$ , unlike that of $\bm{D}$ . The approach is to assume that the low-dimensional $T_{i}$ summarizes interference by reparameterizing potential outcomes as

Y_{i}(\bm{D})=\tilde{Y}_{i}(T_{i}).

(1)

That is, once we fix $i$ ’s exposure mapping $T_{i}$ , its potential outcome is fully determined. No less important, it is also typically assumed that exposure mappings are restricted to a unit’s $K$ -neighborhood, where $K$ is small, meaning fixed with respect to $n$ . Formally, $T(i,\bm{d},\mathcal{N}_{n})=T(i,\bm{d}^{\prime},\mathcal{N}_{n})$ for any $\bm{d},\bm{d}^{\prime}$ such that $d_{j}=d_{j}^{\prime}$ for all $j\in\mathcal{N}(i,K)$ , which implies that the treatment assigned to a unit $j$ only interferes with $i$ if $\rho(i,j)\leq K$ . In practice, choices with $K=1$ are most common, for example $T_{i}=(D_{i},S_{i})$ for $S_{i}=\bm{1}\{\sum_{j\in\mathcal{N}_{n}}G_{ij}T_{j}>0\}$ or $S_{i}=\sum_{j\in\mathcal{N}_{n}}G_{ij}T_{j}$ where $G_{ij}=\bm{1}\{\rho(i,j)\leq 1\}$ . In these examples, $D_{i}$ captures the direct effect of the treatment, and $S_{i}$ captures interference from units near $i$ .

Crucially, $T_{i}$ and $K$ must be known to the analyst in this approach, which is often a strong requirement. In contrast, Assumption 3 enables the analyst to impose (1) while requiring neither to be known. Indeed, if there exists a $K$ -neighborhood exposure mapping satisfying (1), then Assumption 3 holds with $\psi(s)=c\,\bm{1}\{s\leq K\}$ for some $c$ sufficiently large.

Furthermore, Assumption 3 allows for more complex forms of interference ruled out by (1) in which interference decays more smoothly with distance, rather than being truncated at some distance $K$ . The former is analogous to mixing conditions, which are widespread in the time series and spatial literature, while the latter is analogous to $m$ -dependence, which rules out interesting forms of autocorrelation, including models as basic as AR $(1)$ .

In the spatial context, our assumption accommodates, for example, the Cliff-Ord autoregressive model [11, 12], which is a workhorse model of spatial autocorrelation used in a variety of fields, including geography [17], ecology [48], and economics [2]. A typical formulation of the model is

Y_{i}=\alpha+\lambda\sum_{j\in\mathcal{N}_{n}}W_{ij}Y_{j}+D_{i}\beta+\varepsilon_{i},

where we assume $\varepsilon_{i}$ is uniformly bounded to satisfy Assumption 1. Let $\bm{W}$ be the $n\times n$ spatial weight matrix whose $ij$ th entry is $W_{ij}$ . These weights typically decay with distance $\rho(i,j)$ in a sense to be made precise below. While this model is highly stylized, the important aspect it captures is autocorrelation through the spatial autoregressive parameter $\lambda$ . If this is nonzero, then there is no $K$ -neighborhood exposure mapping for which (1) holds, a point previously noted by [15] in the context of network interference.

To see this, first note that coherency of the model requires nonsingularity of $\bm{I}-\lambda\bm{W}$ , where $\bm{I}$ is the $n\times n$ identity matrix. Let $\bm{V}$ be the inverse of this matrix and $V_{ij}$ its entry corresponding to units $(i,j)$ . Then the reduced form of the model is

Y_{i}(\bm{D})\equiv Y_{i}=\sum_{j\in\mathcal{N}_{n}}V_{ij}(\alpha+D_{j}\beta+\varepsilon_{j}),

(2)

a spatial “moving average” model with spatial weight matrix $\bm{V}$ . (See § 5 for some examples of $\bm{V}$ .) Noticeably, $Y_{i}(\bm{D})$ can potentially depend on $D_{j}$ for any $j\in\mathcal{N}_{n}$ , which is ruled out if one imposes a $K$ -neighborhood exposure mapping.

Outcomes satisfying (2) are near-epoch dependent, a notion of weak spatial dependence, when the weights decay with spatial distance in the following sense:

\sup_{n\in\mathbb{N}}\max_{i\in\mathcal{N}_{n}}\sum_{j\in\mathcal{N}_{n}}\lvert V_{ij}\rvert\rho(i,j)^{\gamma}<\infty

(3)

for some $\gamma>0$ [see Proposition 5 and eq. (13) of 26]. The next result shows that this condition is sufficient for verifying Assumption 3 if $\gamma>2$ .

Proposition 1.

Suppose potential outcomes are given by (2) and spatial weights satisfy (3) for some $\gamma>2$ . Then Assumption 3 holds with $\psi(s)=c\,\min\{s^{-\gamma},1\}$ for some $c\in(0,\infty)$ that does not depend on $s$ .

Proof.

Fix $s\geq 0$ and any $\bm{d},\bm{d}^{\prime}\in\{0,1\}^{n}$ such that $d_{j}=d_{j}^{\prime}$ for all $j\in\mathcal{N}(i,s)$ . For $s<1$ , $\lvert Y_{i}(\bm{d})-Y_{i}(\bm{d}^{\prime})\rvert\leq c_{1}\equiv 2\sup_{n}\max_{i}\max_{\bm{d}}\lvert Y_{i}(\bm{d})\rvert$ . For $s\geq 1$ ,

\lvert Y_{i}(\bm{d})-Y_{i}(\bm{d}^{\prime})\rvert\leq\lvert\beta\rvert\sum_{j\in\mathcal{N}_{n}}\lvert V_{ij}\rvert\lvert d_{j}-d_{j}^{\prime}\rvert\bm{1}\{\rho(i,j)>s\}\leq s^{-\gamma}\lvert\beta\rvert\sum_{j\in\mathcal{N}_{n}}\lvert V_{ij}\rvert\rho(i,j)^{\gamma}.

Defining $c_{2}=\sup_{n}\max_{i}\sum_{j\in\mathcal{N}_{n}}\lvert\beta\rvert\lvert V_{ij}\rvert\rho(i,j)^{\gamma}$ and $c=\max\{c_{1},c_{2}\}+1>0$ , the inequality in Assumption 3 holds with $\psi(s)=c\,\min\{s^{-\gamma},1\}$ by construction. Furthermore, $c<\infty$ by (3) and uniform boundedness of $\{\varepsilon_{i}\}_{i\in\mathcal{N}_{n}}$ , and $\psi(s)$ satisfies $\sum_{s=1}^{\infty}s\,\psi(s)<\infty$ because $\gamma>2$ . ∎

This result shows that, unlike the standard approach of imposing a $K$ -neighborhood exposure mapping, Assumption 3 can allow for richer forms of interference in which alters that are arbitrarily distant from the ego can interfere with the ego’s response.

Remark 2 (Literature).

Assumption 3 and Proposition 1 are spatial analogs of Assumptions 4 and 6 and Proposition 1 of [31] who studies interference mediated by an unweighted network, Bernoulli designs, and a different class of estimands defined by exposure mappings satisfying overlap. We study the global average treatment effect and cluster-randomized designs that induce overlap by introducing dependence in treatment assignments, and we further derive rate-optimal designs. These differences require an entirely distinct asymptotic theory.

2.2 Design and Estimator

Much of the literature on interference considers designs in which units are individually randomized into treatment and control, either via Bernoulli or complete randomization. A common problem faced by such designs is limited overlap, meaning that some realizations of the exposure mapping occur with low probability. For example, suppose that (1) holds with exposure mapping $T_{i}=\sum_{j\in\mathcal{N}_{n}}\bm{1}\{\rho(i,j)\leq K\}D_{j}$ , the number of treated units in $i$ ’s $K$ -neighborhood. Then in a Bernoulli design, for large values of $K$ , ${\bf P}(T_{i}=0)$ is small, tending to zero with $K$ at an exponential rate. This is problematic for a Horvitz-Thompson estimator such as $n^{-1}\sum_{i\in\mathcal{N}_{n}}(p_{i}(t)^{-1}\bm{1}\{T_{i}=t\}-p_{i}(t^{\prime})^{-1}\bm{1}\{T_{i}=t^{\prime}\})Y_{i}$ where $p_{i}(t)={\bf P}(T_{i}=t)$ since its variance grows rapidly with $K$ if either $t$ or $t^{\prime}$ is zero. Ugander et al. [46] propose cluster-randomized designs, which reduce this problem by deliberately introducing dependence in treatment assignments across certain units.

We consider the following class of such designs. We assign units to mutually exclusive clusters by partitioning the population region $Q(\bm{0},R_{n})$ into $m_{n}\leq n$ equally sized squares, assuming for simplicity that $m_{n}\in\{4^{s}\colon s\in\mathbb{N}\}$ . That is, to obtain increasingly more clusters, we first divide the population region into four squares, then divide each of these squares into four squares, and so on, as in Figure 2. Label the $m_{n}$ squares $Q_{1},\dots,Q_{m_{n}}$ , and call

C_{k}=Q_{k}\mathbin{\scalebox{1.5}{$\cap$}}\mathcal{N}_{n}

cluster $k$ . Then the number of units in each cluster is uniformly $O(n/m_{n})$ under Assumption 2, and the radius of each cluster is

r_{n}\equiv R_{n}/\sqrt{m_{n}}=\sqrt{n/m_{n}}+o(1),

which we assume is greater than 1. We also assume there are no units on the common boundaries of different squares, so the squares partition $\mathcal{N}_{n}$ .

A cluster-randomized design first independently assigns each cluster to treatment and control with some probability

p\in(0,1)

that is fixed with respect to $n$ . Then within a treated (control) cluster $C_{k}$ , all $i\in C_{k}$ are assigned $D_{i}=1$ ( $D_{i}=0$ ). In order to emphasize that we use this design in later theorems, we state it as a separate assumption.

Assumption 4 (Design).

For any $n$ , $\bm{D}$ is realized according to a cluster-randomized design with $m_{n}$ clusters constructed as above.

Note that $m_{n}$ will be required to diverge with $n$ since a large number of clusters is needed for the estimator to concentrate. If $m_{n}$ is order $n$ , then $r_{n}=O(1)$ , so clusters are asymptotically bounded in size, the usual case studied in the literature, which includes unit-level Bernoulli randomization as a special case. If $m_{n}$ is of smaller order, then cluster sizes grow with $n$ .

To construct the estimator, define the neighborhood exposure indicator

T_{ti}=\prod_{j\in\mathcal{N}(i,\kappa_{n})}\bm{1}\{D_{j}=t\}\quad\text{for}\quad t\in\{0,1\},\quad\kappa_{n}=r_{n}/2.

This is an indicator for whether $i$ ’s $\kappa_{n}$ -neighborhood is entirely treated ( $t=1$ ) or untreated ( $t=0$ ). Unlike $K$ -neighborhood exposure mappings, the radius $\kappa_{n}$ will be allowed to diverge. Let $p_{ti}={\bf E}[T_{ti}]$ . We study the Horvitz-Thompson estimator

\hat{\theta}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}Z_{i}\quad\text{for}\quad Z_{i}=\left(\frac{T_{1i}}{p_{1i}}-\frac{T_{0i}}{p_{0i}}\right)Y_{i}.

Intuitively, $n^{-1}\sum_{i=1}^{n}Y_{i}T_{1i}p_{1i}^{-1}$ estimates $n^{-1}\sum_{i\in\mathcal{N}_{n}}Y_{i}(\bm{1}_{n})$ using the outcomes of units whose neighbors within radius $\kappa_{n}$ are all treated. Since the radius depends on $m_{n}$ through $r_{n}$ , $\hat{\theta}$ is a function of the number of clusters dictated by the design. Figure 1 depicts the relationship between the clusters and the $\kappa_{n}$ -neighborhoods that determine exposure.

Since nontrivial designs will include both treated and untreated units, $\hat{\theta}$ is biased for the global average treatment effect. The choice of design can trade off the size of the bias against that of the variance. In particular, small choices of $m_{n}$ (few clusters, large radii) induce lower bias and higher variance. In § 3, we discuss nearly rate-optimal choices of $m_{n}$ for which the bias is asymptotically negligible.

Remark 3 (Overlap).

Under Bernoulli randomization, overlap needs to be imposed as a separate assumption for asymptotic inference. By overlap we mean the probability weights $p_{1i}$ and $p_{0i}$ are uniformly bounded away from zero and one, which imposes potentially strong restrictions on the types of exposure mappings the analyst can use, as previously illustrated. In our setup, however, overlap is directly satisfied because $p_{1i}=p^{k}$ and $p_{0i}=(1-p)^{k}$ , where $k$ is the number of clusters that intersect $i$ ’s $\kappa_{n}$ -neighborhood. Our choice of $\kappa_{n}$ implies $k\in[1,4]$ for all $i$ , so overlap holds.

Remark 4 (Neighborhood radius).

Let $\bm{c}(C_{k})\in\mathbb{R}^{2}$ be the centroid of cluster $C_{k}$ . The choice of $\kappa_{n}=r_{n}/2$ ensures that, for any unit $i$ in the “interior” of a cluster in the sense that $i\in Q(\bm{c}(C_{k}),r_{n}/2)$ , $i$ ’s $\kappa_{n}$ -neighborhood also lies within that cluster, in which case the exposure probabilities are simply given by the cluster assignment probability: $p_{1i}=p$ and $p_{0i}=1-p$ . If we had instead chosen, say, $\kappa_{n}=r_{n}$ , then this would be true only for the centroid, while for the remaining units, $p_{1i}$ and $p_{0i}$ could be as small as $p^{4}$ , which means less overlap and a more variable estimate. For the purposes of the asymptotic theory, the main requirement is that $\kappa_{n}$ has the same asymptotic order as $r_{n}$ . If $\kappa_{n}$ were of smaller order, then results in § 3 show that the bias of $\hat{\theta}$ could be non-negligible, whereas if $\kappa_{n}$ were of larger order, then $k$ in Remark 3 would grow with $n$ , overlap would be limited, and $\text{Var}(\hat{\theta})$ could be large.

3 Rate-Optimal Designs

We next derive the rate of convergence of $\hat{\theta}$ as a function of $n$ , $m_{n}$ , and $\psi(\cdot)$ , which we use to obtain rate-optimal choices of $m_{n}$ . Recall that designs are parameterized by $m_{n}$ , which determines the number and sizes of clusters, and also that $\hat{\theta}$ depends on $m_{n}$ through the neighborhood exposure radius $\kappa_{n}$ , so we will be optimizing over both the design and the radius that determines the estimator.

3.1 Rate of Convergence

We first provide asymptotic upper bounds on the bias and variance of $\hat{\theta}$ . For two sequences $\{a_{n}\}_{n\in\mathbb{N}}$ and $\{b_{n}\}_{n\in\mathbb{N}}$ , we write $a_{n}\lesssim b_{n}$ to mean $a_{n}/b_{n}=O(1)$ and $a_{n}\gtrsim b_{n}$ to mean $b_{n}/a_{n}=O(1)$ .

Theorem 1.

Under Assumptions 1–4, $\lvert{\bf E}[\hat{\theta}]-\theta_{n}\rvert\lesssim\psi(r_{n}/2)$ and $\text{Var}(\hat{\theta})\lesssim m_{n}^{-1}$ .

Proof.

First, we bound the bias. If $T_{1i}=1$ , then all units in $Q(i,\kappa_{n})$ are treated, so by Assumption 3,

\lvert{\bf E}[Y_{i}T_{1i}p_{1i}^{-1}]-Y_{i}(\bm{1}_{n})\rvert=\lvert{\bf E}[Y_{i}(\bm{D})\mid T_{1i}=1]-Y_{i}(\bm{1}_{n})\rvert\leq\psi(\kappa_{n}).

The same argument applies to $\lvert{\bf E}[Y_{i}T_{0i}p_{0i}^{-1}]-Y_{i}(\bm{0}_{n})\rvert$ , so combining these results, we obtain the rate for the bias.

Next, we bound the variance. The following is an important object in our analysis and also for later constructing the variance estimator:

\Lambda_{i}=\{j\in\mathcal{N}_{n}\colon\exists\,k\,\text{ s.t. }\,C_{k}\cap\mathcal{N}(i,\kappa_{n})\neq\varnothing,\,C_{k}\cap\mathcal{N}(j,\kappa_{n})\neq\varnothing\}.

(4)

This is the set of units $j$ whose $\kappa_{n}$ -neighborhoods intersect a cluster $C_{k}$ that also intersects $i$ ’s $\kappa_{n}$ -neighborhood. We have

\text{Var}\left(\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}Z_{i}\right)=\frac{1}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\in\Lambda_{i}}\text{Cov}(Z_{i},Z_{j})+\frac{1}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\not\in\Lambda_{i}}\text{Cov}(Z_{i},Z_{j})\equiv[A]+[B].

(5)

Note that $\Lambda_{i}$ contains units from at most 16 clusters (the worst case is when $Q(i,\kappa_{n})$ intersects four clusters), and clusters contain uniformly $O(n/m_{n})$ units by Lemma A.1 of [25]. By Assumption 1 and Remark 3, $\{Z_{i}\}_{i\in\mathcal{N}_{n}}$ is uniformly bounded, so

[A]\lesssim\frac{1}{n^{2}}\cdot n\cdot\frac{n}{m_{n}}=\frac{1}{m_{n}}.

The difficult part of the argument is obtaining a tight enough rate for $[B]$ , which requires control over the dependence between outcomes of units $i,j$ that are “distant” in the sense that $j\not\in\Lambda_{i}$ . Lemma B.1 in § B proves that $[B]\lesssim(nm_{n})^{-1/2}$ . Therefore, $\text{Var}(\hat{\theta})\lesssim m_{n}^{-1}+(nm_{n})^{-1/2}$ , which is at most $m_{n}^{-1}$ since $m_{n}\leq n$ . ∎

Our second result provides asymptotic lower bounds.

Theorem 2.

Suppose $\psi(s)=\min\{s^{-\gamma},0.5\}$ for some $\gamma>2$ . Under Assumption 4, there exists a sequence of units and potential outcomes $\{\{Y_{i}(\cdot)\}_{i\in\mathcal{N}_{n}}\colon n\in\mathbb{N}\}$ satisfying Assumptions 1–3 such that $\lvert{\bf E}[\hat{\theta}]-\theta_{n}\rvert\gtrsim\psi(1.5r_{n})$ and $\text{Var}(\hat{\theta})\gtrsim m_{n}^{-1}$ .

Proof.

See supplemental material [33]. ∎

The result shows that we can construct potential outcomes satisfying the assumptions of Theorem 1 such that the bias is at least order $\psi(1.5r_{n})$ and the variance at least $m_{n}^{-1}$ . As discussed in § 1, existing work on cluster randomization under interference assumes clusters have asymptotically bounded size, which, in our setting, implies $r_{n}=O(1)$ . Theorem 2 implies that the bias of the Horvitz-Thompson estimator can then be bounded away from zero, showing that existing results strongly rely on the exposure mapping assumption to obtain unbiased estimates. In the absence of this assumption, it is necessary to consider designs in which cluster sizes are large to ensure the bias vanishes with $n$ .

3.2 Design Examples

Theorem 1 implies the mean-squared error of $\hat{\theta}$ is at most of order $\psi(r_{n}/2)^{2}+m_{n}^{-1}$ , and Theorem 2 provides a similar asymptotic lower bound. Under either bound, the bias increases with $m_{n}$ while the variance decreases, so there exists a bias-variance trade-off in the choice of design. We next derive rates for $m_{n}$ that minimize or nearly minimize the upper bound under different assumptions on $\psi(\cdot)$ . Based on these results, we make recommendations for practical implementation in the next subsection.

Oracle design. Suppose $\psi(s)$ is known to decay with $s$ at some rate $\phi(s)$ . Then by definition of $r_{n}$ , a rate-optimal design chooses $m_{n}$ to minimize $\phi(0.5R_{n}m_{n}^{-1/2})^{2}+m_{n}^{-1}$ .

Exposure mappings. If we assume (1) holds for some $K$ -neighborhood exposure mapping, then $\psi(s)=0$ for all $s>K$ . If $K$ is known, then by choosing $m_{n}=R_{n}^{2}(2K)^{-2}$ , we have $\kappa_{n}=K$ and zero bias. In this case, clusters are asymptotically bounded in size, the estimator converges at rate $n^{-1/2}$ , and both the design and estimator qualitatively coincide with those of [46].

On the other hand, if $K$ is unknown, then for a nearly rate-optimal design, we can choose $\kappa_{n}$ to grow at a slow rate so that it eventually exceeds any fixed $K$ . This may be achieved by choosing $m_{n}$ to grow slightly slower than $n$ , say $n/\log(n)$ . Then for large enough $n$ , the bias is zero, and the rate of convergence is $\sqrt{\log(n)/n}$ .

Exponential decay. Common specifications of the spatial weight matrix $\bm{W}$ in the Cliff-Ord model imply that $\psi(s)$ decays exponentially with $s$ , for example, the row-normalized matrix

W_{ij}=\frac{\bm{1}\{\rho(i,j)\leq 1\}}{\sum_{k\in\mathcal{N}_{n}}\bm{1}\{\rho(i,k)\leq 1\}}.

(6)

If $\psi(s)$ is known to decay at some exponential rate but the exponent is unknown, then we may choose $m_{n}=n^{1-\epsilon}$ for any small $\epsilon>0$ for a nearly rate-optimal design, which yields a rate of convergence of $n^{-0.5(1-\epsilon)}$ , close to an $n^{-1/2}$ -rate. This shows that rates close to $n^{-1/2}$ are attainable in the absence of exposure mapping assumptions, despite targeting the global average treatment effect.

Worst-case decay. In practice, we may have little prior knowledge about $\psi(s)$ . Recall that Assumption 3 requires the rate of decay to be no slower than $s^{-2(1+\epsilon)}$ for $\epsilon>0$ . As discussed in Remark 1, this is the slowest rate for spatial dependence that ensures a finite variance. For this rate, since $R_{n}$ is order $\sqrt{n}$ , the bias is order $(n/m_{n})^{-(1+\epsilon)}$ . Without knowledge of $\epsilon$ , we can settle for a nearly rate-optimal design by setting $\epsilon=0$ and choosing $m_{n}$ to minimize $(n/m_{n})^{-2}+m_{n}^{-1}$ , which yields $m_{n}=n^{2/3}$ and an $n^{-1/3}$ -rate of convergence. Under this design, cluster sizes grow at the rate $r_{n}^{2}=n^{1/3}$ .

In the last three designs, the bias is $o(m_{n}^{-1/2})$ , which is of smaller order than the variance. This makes the bias negligible from an asymptotic standpoint, but it would be useful to develop bias-reduction methods. We also reiterate that, while this analysis only provides rates, it is apparently by necessity at this level of generality. A finite-sample optimal design seems to require substantially more knowledge of the functional form of $\psi(\cdot)$ .

3.3 Practical Recommendations

The designs in the previous section rely on varying degrees of knowledge of $\psi(\cdot)$ , the rate at which interference vanishes with distance. In practice, this is likely unknown, so we recommend operating under the worst-case rate of decay discussed in the previous subsection. The default conservative choice we recommend using in practice is the near-optimal rate described there, namely

m_{n}=n^{2/3}.

(7)

To construct the clusters, we recommend partitioning space into $m_{n}$ clusters using a clustering algorithm, such as spectral clustering. A confidence interval (CI) for $\theta_{n}$ is given in (11). In § 5, we explore in simulations the performance of the CI when clusters are constructed according to these recommendations.

Our large-sample theory assumes space is subdivided into evenly sized squares in order to avoid the difficult problem of optimizing over arbitrary shapes. However, since units are typically irregularly distributed in practice, division into equally sized squares may be inefficient, which is why we recommend the use of clustering algorithms. We suggest spectral clustering because it recovers, under weak conditions, low-conductance clusters [38], and low conductance is the key property of clusters utilized in our proofs, as discussed in § 6.

4 Inference

We next state results for asymptotic inference on $\theta_{n}$ . Define $\sigma_{n}^{2}=\text{Var}(\sqrt{m_{n}}\hat{\theta})$ .

Assumption 5 (Non-degeneracy).

$\liminf_{n\rightarrow\infty}\sigma_{n}^{2}>0$ .

This is a standard condition and reasonable to impose in light of the lower bound on the variance derived in Theorem 2.

Theorem 3.

Suppose $m_{n}\rightarrow\infty$ and $m_{n}=o(n)$ . Under Assumptions 1–5,

\sigma_{n}^{-1}\sqrt{m_{n}}(\hat{\theta}-{\bf E}[\hat{\theta}])\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{N}(0,1).

Proof.

See § B. ∎

The result centers $\hat{\theta}$ at its expectation, not the estimand $\theta_{n}$ . However, designs discussed in § 3.2 result in small bias, meaning $\lvert{\bf E}[\hat{\theta}]-\theta_{n}\rvert=o(m_{n}^{-1/2})$ , so we can replace ${\bf E}[\hat{\theta}]$ with $\theta_{n}$ on the left-hand side. Also note that the assumption $m_{n}=o(n)$ implies that cluster sizes grow with $n$ . If instead $m_{n}$ were of order $n$ , then $r_{n}=O(1)$ , so by Theorem 2, we would additionally need to assume that there exists a $K$ -neighborhood exposure mapping in the sense of (1) in order to guarantee that the bias vanishes at all. In this case, it is straightforward to establish a normal approximation using existing results.

4.1 Proof Sketch

To our knowledge, there is no off-the-shelf central limit theorem that we can apply to $\hat{\theta}$ . Under Assumption 3, the outcomes appear to be near-epoch dependent on the input process $\{D_{i}\}_{i\in\mathcal{N}_{n}}$ , but the treatments are cluster-dependent with growing cluster sizes, rather than $\alpha$ -mixing, as required by [27]. To prove a central limit theorem, they split the average into two parts: its expectation conditional on the dependent input process $\{D_{i}\}_{i\in\mathcal{N}_{n}}$ , and a remainder that they show is small. Rather than conditioning on all treatments, we find that the following unit-specific conditioning event is more useful for proving our result.

Let $\mathcal{C}_{i}$ be the cluster containing unit $i$ , and $\mathcal{F}_{i}=\{D_{j}\colon j\in\mathcal{C}_{i}\cup\mathcal{N}(i,\kappa_{n})\}$ . Rewrite the estimator as

\hat{\theta}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}Z_{i}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}{\bf E}[Z_{i}\mid\mathcal{F}_{i}]+\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}(Z_{i}-{\bf E}[Z_{i}\mid\mathcal{F}_{i}]).

(8)

We first show that the last term is relatively small, $o_{p}(m_{n}^{-1/2})$ to be precise, which means that, on average, $Z_{i}$ is primarily determined by $\mathcal{F}_{i}$ . The proof of this claim is somewhat complicated [and different from that of 27], but it is similar to the argument showing $[B]\lesssim(nm_{n})^{-1/2}$ in (5). To then establish a central limit theorem for $n^{-1}\sum_{i\in\mathcal{N}_{n}}{\bf E}[Z_{i}\mid\mathcal{F}_{i}]$ , we observe that the dependence between “observations” $\{{\bf E}[Z_{i}\mid\mathcal{F}_{i}]\}_{i\in\mathcal{N}_{n}}$ is characterized by the following dependency graph $\bm{A}$ , which, roughly speaking, links two units only if they are dependent. Recalling the definition of $\Lambda_{i}$ from (4), we connect units $i,j$ in $\bm{A}$ if and only if $j\in\Lambda_{i}$ (or equivalently $i\in\Lambda_{j}$ ). Then $\bm{A}$ is indeed a dependency graph because, under Assumption 4, $j\not\in\Lambda_{i}$ implies that the treatment assignments that determine $\mathcal{F}_{i}$ are independent of those that determine $\mathcal{F}_{j}$ . The result follows from a central limit theorem for dependency graphs.

The proof highlights two sources of dependence. The first-order source is the first term on the right-hand side of (8). Dependence in this average is due to cluster randomization, which induces correlation in the treatments determining $\mathcal{F}_{i}$ across $i$ . The second-order source is the second term on the right-hand side of (8). Dependence in this average is due to interference, which decays with distance due to Assumption 3. Sävje [42] derives a similar decomposition in a different context with misspecified exposure mappings. The previous arguments show that the second-order source of dependence is small relative to the first-order source because, with large clusters, dependence induced by cluster randomization dominates dependence induced by interference. This is generally untrue with small clusters.

4.2 Variance Estimator

The proof sketch suggests that, to estimate $\sigma_{n}^{2}$ , it suffices to account for dependence induced by cluster randomization. Define $A_{ij}=\bm{1}\{j\in\Lambda_{i}\}$ , where $\Lambda_{i}$ is defined in (4), and note that $A_{ii}=1$ and $A_{ij}=A_{ji}$ . Let $\bar{Z}=n^{-1}\sum_{i\in\mathcal{N}_{n}}Z_{i}$ , which is equivalent to $\hat{\theta}$ . Our proposed variance estimator is

\hat{\sigma}^{2}=\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\in\mathcal{N}_{n}}(Z_{i}-\bar{Z})(Z_{j}-\bar{Z})A_{ij}.

(9)

Theorem 4.

Suppose $m_{n}\rightarrow\infty$ and $m_{n}=o(n)$ . Under Assumptions 1–5, $(\hat{\sigma}^{2}-\mathcal{R}_{n})/\sigma_{n}^{2}\stackrel{{\scriptstyle p}}{{\longrightarrow}}1$ , where

\mathcal{R}_{n}=\frac{m_{n}}{n}\hat{\tilde{\sigma}}^{2}\quad\text{and}\quad\hat{\tilde{\sigma}}^{2}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}\sum_{j\in\mathcal{N}_{n}}({\bf E}[Z_{i}]-{\bf E}[\bar{Z}])({\bf E}[Z_{j}]-{\bf E}[\bar{Z}])A_{ij}.

Proof.

See supplemental material [33]. ∎

The bias term $\mathcal{R}_{n}$ is typically nonzero due to the unit-level heterogeneity. That is, $\lvert{\bf E}[Z_{i}]-{\bf E}[\bar{Z}]\rvert$ does not approach zero asymptotically, except in the special case of homogeneous treatment effects where $Y_{i}(\bm{1}_{n})-Y_{i}(\bm{0}_{n})$ does not vary across $i$ . In the no-interference setting, it is well-known that the variance of the difference-in-means estimator is biased for the same reason and that consistent estimation of the variance is impossible. However, due to the term $m_{n}/n=o(1)$ in $\mathcal{R}_{n}$ , we will argue that typically $\mathcal{R}_{n}=o_{p}(1)$ , meaning that $\hat{\sigma}^{2}$ is asymptotically exact.

Let us first compare $\mathcal{R}_{n}$ to its formulation under no interference. In this case, $Y_{i}(\bm{D})=Y_{i}(D_{i})$ , and we replace $T_{1i}$ with $D_{i}$ and $T_{0i}$ with $1-D_{i}$ to estimate the usual average treatment effect. Furthermore, we set $A_{ij}=0$ for all $i\neq j$ because units are independent and set $m_{n}=n$ since there is no longer a need to cluster units. With these changes, $Z_{i}=(D_{i}/p-(1-D_{i})/(1-p))Y_{i}$ , and

\mathcal{R}_{n}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}(\tau_{i}-\bar{\tau})^{2}

(10)

for $\tau_{i}=Y_{i}(1)-Y_{i}(0)$ and $\bar{\tau}=n^{-1}\sum_{i\in\mathcal{N}_{n}}\tau_{i}$ . This is the well-known expression for the bias in the absence of interference [e.g. 22, Theorem 6.2].

In our setting, we have additional “covariance” terms included in $\mathcal{R}_{n}$ due to the non-zero off-diagonals of the dependency graph $A_{ij}$ . These would be problematic if they were negative and larger in magnitude than the main variance terms since that would make $\hat{\sigma}^{2}$ anti-conservative. We show that this occurs with small probability, and in fact, that $\mathcal{R}_{n}$ is $o_{p}(1)$ . Observe that $m_{n}/n=o(1)$ and $\hat{\tilde{\sigma}}^{2}$ has the form of a HAC (heteroskedasticity and autocorrelation consistent) variance estimator [1, 13]. Hence, under conventional regularity conditions, $\hat{\tilde{\sigma}}^{2}$ is consistent for a variance term $\tilde{\sigma}^{2}\geq 0$ , in which case $\mathcal{R}_{n}$ is non-negative in large samples, and furthermore, $o_{p}(1)$ . To formalize this intuition, we need to specify conditions on the superpopulation from which potential outcomes are drawn. In § A, we show that, if potential outcomes are $\alpha$ -mixing, then $\hat{\tilde{\sigma}}^{2}$ is asymptotically unbiased for $\tilde{\sigma}^{2}=\text{Var}(n^{-1/2}\sum_{i=1}^{n}{\bf E}[Z_{i}\mid\{Y_{i}(\bm{d})\}_{\bm{d}\in\{0,1\}^{n}}])$ , and furthermore, $\text{Var}(\hat{\tilde{\sigma}}^{2})=O(n^{2}/m_{n}^{3})$ . Consequently, $\text{Var}(\mathcal{R}_{n})=O(m_{n}^{-1})$ due to the $m_{n}/n$ term in its expression.

Remark 5 (Confidence interval).

As previously discussed, the bias of $\hat{\theta}_{n}$ is $o(m_{n}^{-1/2})$ for the near-optimal designs discussed at the end of § 3. Thus, for such designs, the preceding discussion justifies the use of

\hat{\theta}\pm 1.96\cdot\hat{\sigma}m_{n}^{-1/2}

(11)

as an asymptotic 95-percent CI for $\theta_{n}$ , where $\hat{\sigma}^{2}$ is defined in (9).

Remark 6 (Literature).

Leung [31] proves a result similar to Theorem 4 but for a different variance estimator under a different design and variety of interference. Due to the lack of an analogous $m_{n}/n$ term, in his setting, weak dependence conditions would only ensure $\mathcal{R}_{n}\stackrel{{\scriptstyle p}}{{\longrightarrow}}\tilde{\sigma}^{2}\geq 0$ , in which case the estimator would be asymptotically conservative, whereas ours is asymptotically exact. He does not provide a formal result on the limit of $\mathcal{R}_{n}$ .

5 Monte Carlo

We next present results from a simulation study illustrating the quality of the normal approximation in Theorem 3 and coverage of the CI (11) when constructing clusters using spectral clustering. To generate spatial locations, let $\{\tilde{\rho}_{i}\}_{i\in\mathcal{N}_{n}}$ be i.i.d. draws from $\mathcal{U}([-1,1]^{2})$ . Unit locations in $\mathbb{R}^{2}$ are given by $\{\rho_{i}\}_{i\in\mathcal{N}_{n}}$ for $\rho_{i}=R_{n}\tilde{\rho}_{i}$ with $R_{n}=\sqrt{n}$ . We let $\rho(i,j)=\lVert\rho_{i}-\rho_{j}\rVert$ where $\lVert\cdot\rVert$ is the Euclidean norm.

We set the number of clusters according to (7), rounded to the nearest integer, which corresponds to the near-optimal design under the worst-case decay discussed in § 3.2. To construct clusters, we apply spectral clustering to $\{\rho_{i}\}_{i\in\mathcal{N}_{n}}$ with the standard Gaussian affinity matrix whose $ij$ th entry is $\text{exp}\{-\rho(i,j)^{2}\}$ . Clusters are randomized into treatment with probability $p=0.5$ . Figure 3 displays the clusters and treatment assignments for a typical simulation draw.

We generate outcomes from three different models. Let $\{\varepsilon_{i}\}_{i\in\mathcal{N}_{n}}\stackrel{{\scriptstyle iid}}{{\sim}}\mathcal{N}(0,1)$ be drawn independently of the other primitives. The first model is Cliff-Ord:

Y_{i}=\alpha+\lambda\sum_{j\in\mathcal{N}_{n}}W_{ij}Y_{j}+\delta\sum_{j\in\mathcal{N}_{n}}W_{ij}D_{j}+D_{i}\beta+\varepsilon_{i}

with $(\alpha,\lambda,\delta,\beta)=(-1,0.8,1,1)$ and spatial weight matrix given by the row-normalized adjacency matrix (6). As discussed in § 3, this model features exponentially decaying $\psi(s)$ , in fact of order $\lambda^{s}$ [31, Proposition 1].

We construct the second and third models to explore how our methods break down when Assumption 3 is violated or close to violated. For this purpose, we use the “moving average” model (2) with $(\alpha,\beta)=(-1,1)$ and $V_{ij}=\rho(i,j)^{-\eta}$ for $\eta=4,5$ for the two respective models, so that $\psi(s)$ decays at a polynomial rate. Notably, the choice of $\eta=4$ implies that the rate of decay is slow enough that Assumption 3 can fail to hold. This is because

\eqref{coani}=\sum_{s=1}^{\infty}\sum_{j\in\mathcal{N}_{n}}\bm{1}\{\rho(i,j)\in[s-1,s)\}\rho(i,j)^{\gamma-4}\leq c\sum_{s=0}^{\infty}s^{\gamma-3}

for some $c>0$ by Lemma A.1(iii) of [25]. The right-hand side does not converge for some $\gamma>2$ , as required by Proposition 1. On the other hand, the choice of $\eta=5$ is large enough for Assumption 3 to be satisfied since we now replace the 3 on the right-hand side of the previous display with 4. However, in smaller samples, $\eta=4$ or 5 may not be substantially different, so our methods may still break down from the assumption being “close to” violated.

Table 1 displays the results of 5000 simulation draws. Row “Bias $(\hat{\theta})$ ” displays $\lvert{\bf E}[\hat{\theta}-\theta_{n}]\rvert$ , estimated by taking the average over the draws, while “Var $(\hat{\theta})$ ” is the variance of $\hat{\theta}$ across the draws. The next rows display the coverage of three different confidence intervals. “Our CI” corresponds to the empirical coverage of (11). “Naive CI” corresponds to (11) but replaces $\hat{\sigma}m_{n}^{-1/2}$ with the i.i.d. standard error, so the extent to which its coverage deviates from 95 percent illustrates the degree of spatial dependence. “Oracle CI” corresponds to (11) but replaces $\hat{\sigma}m_{n}^{-1/2}$ with the “oracle” SE, which is the standard deviation of $\hat{\theta}$ across the draws. Note that the oracle SE approximates $\sigma_{n}^{2}+\mathcal{R}_{n}$ because the variance is taken over the randomness of the design as well as of the potential outcomes. Lastly, “SE” displays our standard error $\hat{\sigma}m_{n}^{-1/2}$ .

Table 1: Simulation Results

	Moving Avg, $\eta=4$			Moving Avg, $\eta=5$			Cliff-Ord
$n$	250	500	1000	250	500	1000	250	500	1000
$m_{n}$	40	63	100	40	63	100	40	63	100
Our CI	0.909	0.925	0.924	0.922	0.937	0.940	0.979	0.983	0.982
Naive CI	0.530	0.489	0.447	0.575	0.539	0.494	0.918	0.913	0.916
Oracle CI	0.943	0.943	0.936	0.950	0.952	0.949	0.983	0.982	0.982
Bias $(\hat{\theta})$	0.108	0.093	0.083	0.033	0.027	0.024	0.009	0.004	0.003
Var $(\hat{\theta})$	0.143	0.087	0.054	0.108	0.065	0.039	0.341	0.177	0.087
SE	0.364	0.292	0.232	0.317	0.252	0.199	0.601	0.432	0.309
$\hat{\theta}$	1.432	1.467	1.492	1.289	1.307	1.319	5.804	5.822	5.851

5k simulations. The “CI” rows show the empirical coverage of 95% CIs. “Naive” and “Oracle” respectively correspond to i.i.d. and true standard errors.

There are at most 100 clusters in all designs, and the rate of convergence is quite slow at $n^{-1/3}$ for our choice of $m_{n}$ . Nonetheless, across all designs, the coverage of the oracle CI is close to 95 percent or above, which illustrates the quality of the normal approximation. For the Cliff-Ord model, our CI attains at least 95 percent coverage even for small sample sizes, despite $m_{n}$ being chosen suboptimally for the worst-case decay. For the moving average model with $\eta=5$ , we see some under-coverage in smaller samples due to the larger bias, which is unsurprising from the above discussion, but coverage is close to the nominal level for larger $n$ . The results for $\eta=4$ , as expected, are worse since it is deliberately constructed to violate our main assumption. Once again, our CI exhibits under-coverage due to the larger bias, but coverage improves and bias decreases as $n$ grows.

6 Conclusion

This paper studies the design of cluster-randomized experiments targeting the global average treatment effect under spatial interference. Each design is characterized by a parameter $m_{n}$ that determines the number and sizes of clusters. We propose a Horvitz-Thompson estimator that compares units with different neighborhood exposures to treatment, where the neighborhood radius is of the same order as clusters’ sizes given by the design. We asymptotically bound the estimator’s bias and variance as a function of $m_{n}$ and the degree of interference and derive rate-optimal choices of $m_{n}$ . Our lower asymptotic bound shows that designs using small clusters (those with asymptotically bounded sizes) generally result in a non-negligible asymptotic bias. On the other hand, constructing large clusters reduces the total number of clusters, resulting in a bias-variance trade-off that we seek to optimize in terms of rates through the choice of design.

In the worst case where the degree of interference is substantial, the estimator has an $n^{-1/3}$ -rate of convergence under a nearly rate-optimal design, whereas in the best case where interference is characterized by a $K$ -neighborhood exposure mapping, the rate is $n^{-1/2}$ under a rate-optimal design. We derive the asymptotic distribution of the estimator and provide an estimate of the variance.

Important areas for future research include data-driven choices of $m_{n}$ and $\kappa_{n}$ and methods to reduce the bias of the estimator. However, a rigorous theory appears to require more substantive restrictions on interference than what we impose.

Our results focus on the canonical case of spatial data in $\mathbb{R}^{2}$ . We conjecture that they can be extended to $\mathbb{R}^{d}$ for $d>2$ because our proofs fundamentally rely on the following key property of Euclidean space, which is true for any dimension: it is always possible to construct many clusters with low conductance, or boundary-to-volume ratio, for example by partitioning space into hypercubes or by spectral clustering [32]. This appears in our proofs through the use of Lemma A.1 of [25], which, together with Assumption 3, is crucial to establish that spatially distant units have small covariance, despite dependence induced by cluster randomization and interference. In this sense, the technical idea behind this paper is to exploit a useful property of Euclidean space – the existence of many low-conductance clusters – to show that cluster-randomized designs may be fruitfully applied to the problem of spatial interference.

The story for network interference appears to be different. Existing cluster-randomized designs have theoretical guarantees under exposure mapping assumptions, but it is an open question whether such designs work under weaker restrictions on interference such as Assumption 3. In order to directly apply our idea in the previous paragraph, the network must possess many low-conductance clusters across which we can randomize. Unfortunately, this is a strong requirement in practice because, as discussed in [32], not only do some networks not possess multiple low-conductance clusters, but, of those that do, some apparently possess only a small number of such clusters. Because network “space” differs from Euclidean space in this fundamental aspect, under network interference, clusters can be strongly dependent in the absence of exposure mapping assumptions.

Appendix A Bias of the Variance Estimator

Characterizing the asymptotic behavior of $\mathcal{R}_{n}$ requires conditions on the superpopulation from which units are drawn. In this section, we assume potential outcomes are random and constitute a weakly dependent spatial process (independent of $\bm{D}$ ). Accordingly, we rewrite ${\bf E}[Z_{i}]$ in Theorem 4 as

\tilde{Z}_{i}\equiv{\bf E}[Z_{i}\mid\{Y_{i}(\bm{d})\}_{\bm{d}\in\{0,1\}^{n}}].

We require the spatial process $\{\tilde{Z}_{i}\}_{i\in\mathcal{N}_{n}}$ to be $\alpha$ -mixing, which is a standard concept of weak spatial dependence. The results we use in fact apply to the weaker concept of near-epoch dependence, but we focus on mixing since it requires less exposition.

Definition A.1.

Let $(\Omega,\mathcal{F},{\bf P})$ be the probability space, $\mathcal{A},\mathcal{B}$ be sub- $\sigma$ -algebras of $\mathcal{F}$ , and

\alpha(\mathcal{A},\mathcal{B})=\sup\{\lvert{\bf P}(A\cap B)-{\bf P}(A){\bf P}(B)\rvert;A\in\mathcal{A},B\in\mathcal{B}\}.

For $U,V\subseteq\mathcal{N}_{n}$ , let $\alpha_{n}(U,V)=\alpha(\sigma_{n}(U),\sigma_{n}(V))$ , where $\sigma_{n}(U)$ is $\sigma$ -algebra generated by $\{\tilde{Z}_{i}\}_{i\in U}$ . The $\alpha$ -mixing coefficient of $\{\tilde{Z}_{i}\}_{i\in\mathcal{N}_{n}}$ is

\bar{\alpha}(u,v,r)=\sup_{n}\sup_{U,V}\{\alpha_{n}(U,V);\lvert U\rvert\leq u,\lvert V\rvert\leq v,\rho(U,V)\geq r\}

for $u,v\in\mathbb{N}$ , $r\in\mathbb{R}_{+}$ , and $\rho(U,V)=\min\{\rho(i,j)\colon i\in U,j\in V\}$ .

That is, for any two sets of units $U,V\subseteq\mathcal{N}_{n}$ with respective sizes $u,v$ such that the minimum distance between $U,V$ is at least $r$ , $\bar{\alpha}(u,v,r)$ bounds their dependence with respect to observations $\{\tilde{Z}_{i}\}_{i\in\mathcal{N}_{n}}$ .

Example A.1.

Suppose that, for any $n$ and $i\in\mathcal{N}_{n}$ , there exists a function $f(\cdot)$ such that $Y_{i}(\bm{d})=f(\varepsilon_{i},\bm{d})$ . If the unobserved heterogeneity $\{\varepsilon_{i}\}_{i\in\mathcal{N}_{n}}$ is $\alpha$ -mixing and $\tilde{Z}_{i}$ is a Borel-measurable function of $\varepsilon_{i}$ (a mild requirement since treatments are independent of potential outcomes), then $\{\tilde{Z}_{i}\}_{i\in\mathcal{N}_{n}}$ is $\alpha$ -mixing.

Example A.2.

Generalizing the previous example, suppose $Y_{i}(\bm{d})=f(d_{i},d_{-i},\varepsilon_{i},\varepsilon_{-i})$ , where $\bm{d}=(d_{i},d_{-i})$ and $\varepsilon_{-i}$ is similarly defined. Under some conditions on $f(\cdot)$ , one can ensure that $\{\tilde{Z}_{i}\}_{i\in\mathcal{N}_{n}}$ is near-epoch dependent on the input $\{\varepsilon_{i}\}_{i\in\mathcal{N}_{n}}$ [e.g. 27, Proposition 1]. However, we only focus on mixing.

We next discuss the intuition behind our main result. Let $\bar{\tilde{Z}}=n^{-1}\sum_{i=1}^{n}\tilde{Z}_{i}$ , so that

\mathcal{R}_{n}=\frac{m_{n}}{n}\hat{\tilde{\sigma}}^{2},\quad\text{where}\quad\hat{\tilde{\sigma}}^{2}=\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}\sum_{j\in\mathcal{N}_{n}}(\tilde{Z}_{i}-{\bf E}[\bar{\tilde{Z}}])(\tilde{Z}_{i}-{\bf E}[\bar{\tilde{Z}}])A_{ij}.

Observe that $m_{n}/n=o(1)$ , and $\hat{\tilde{\sigma}}^{2}$ is essentially a HAC variance estimator with “kernel” $A_{ij}$ . More precisely, $A_{ij}$ is sandwiched between two uniform kernels:

\bm{1}\{\rho(i,j)\leq\kappa_{n}\}\leq A_{ij}\leq\bm{1}\{\rho(i,j)\leq 2r_{n}+\kappa_{n}\}.

(A.1)

This is a consequence of the construction of clusters. The lower bound holds because $\Lambda_{i}$ must include $i$ ’s $\kappa_{n}$ -neighborhood. The upper bound is achieved if $i$ is located at a corner shared by four clusters, and each such cluster intersects some $\kappa_{n}$ -neighborhood that is maximally distant from the cluster. Notably, the bandwidths of the two kernels are of the same asymptotic order (recalling that $\kappa_{n}$ has the same order as $r_{n}$ ). Hence, $\hat{\tilde{\sigma}}^{2}$ should behave as a HAC variance estimator. This has two implications.

1.

$\hat{\tilde{\sigma}}^{2}$ should be consistent for a non-negative variance term under standard regularity conditions, so $\mathcal{R}_{n}=o_{p}(1)$ .
2.

If, in the formula for $\hat{\sigma}^{2}$ , we replace $A_{ij}$ with the uniform kernel in the upper bound (A.1), then we would have a positive semidefinite HAC estimator, implying $\mathcal{R}_{n}\geq 0$ a.s. [1, 13, §3.3.1]. Then in the finite-population setting of our main results, $\hat{\sigma}^{2}$ would be at worst asymptotically conservative.

We choose not use a spatial HAC for $\hat{\sigma}^{2}$ because they have a reputation for being anti-conservative in smaller samples [see references in e.g. 32]. In our estimator, $A_{ij}$ functions as a sort of heterogeneous bandwidth determined by the design that allows different units to have different neighborhood radii in the variance estimator, whereas HAC kernels imply homogeneous radii determined by the bandwidth. The hope is that heterogeneous radii could translate to better finite-sample properties since they directly capture the first-order dependency neighborhood.

We next state regularity conditions taken from [24], which we use to apply her Theorem 4 on consistency of HAC estimators.

Assumption A.1.

The mixing coefficient satisfies $\bar{\alpha}(u,v,r)\leq(u+v)^{\varsigma}\hat{\alpha}(r)$ for $\varsigma\geq 0$ and $\sum_{r=1}^{\infty}r^{2(\varsigma+1)-1}\hat{\alpha}(r)<\infty$ .

This is Assumption 2 of [24]. The substance of the condition is the requirement that the mixing coefficient decays at a sufficiently fast rate with distance $r$ . For $\varsigma>0$ , the rate requirement is stronger than what we require of $\psi(\cdot)$ in Assumption 3.

Assumption A.2.

(a) ${\bf E}[\tilde{Z}_{i}]={\bf E}[\tilde{Z}_{j}]$ for all $i,j$ . (b) $\text{Var}(n^{-1/2}\sum_{i=1}^{n}\tilde{Z}_{i})\stackrel{{\scriptstyle p}}{{\longrightarrow}}\tilde{\sigma}^{2}<\infty$ .

This is Assumption 7(a) of [24]. Part (a) is a standard mean-homogeneity condition required for HAC to be asymptotically unbiased. Such a requirement is untenable in the finite population setting of Theorem 4 because $\tilde{Z}_{i}$ is a function of $i$ ’s potential outcomes which are generally heterogeneous across units. Heterogeneity is responsible for the appearance of the bias $\mathcal{R}_{n}$ . However, in the superpopulation setting of this section, the assumption is much more tenable since we integrate over the randomness of the potential outcomes. The condition then requires that the mean be invariant to unit labels. The finiteness requirement in part (b) can be proven as a consequence of Assumption A.1 and moment conditions, so the only substance of the assumption is the existence of a limit.

Theorem A.1.

Under Assumptions 1, 2, A.1, and A.2, if $m_{n}\rightarrow\infty$ , then (a) ${\bf E}[\hat{\tilde{\sigma}}^{2}]\rightarrow\tilde{\sigma}^{2}$ , and (b) $\text{Var}(\hat{\tilde{\sigma}})=O(n^{2}/m_{n}^{3})$ .

We next discuss the implications of the theorem for $\mathcal{R}_{n}$ (also see the discussion in § 4.2) and then conclude with the proof. The designs discussed at the end of § 3, other than under the worst-case decay, choose $m_{n}$ to be of substantially higher order than $n^{2/3}$ . In this case, Theorem A.1 yields

\mathcal{R}_{n}=\underbrace{\frac{m_{n}}{n}}_{o(1)}\underbrace{\hat{\tilde{\sigma}}^{2}}_{\stackrel{{\scriptstyle p}}{{\longrightarrow}}\tilde{\sigma}^{2}\geq 0}\stackrel{{\scriptstyle p}}{{\longrightarrow}}0.

For the worst-case decay, $m_{n}=n^{2/3}$ , in which case $\hat{\tilde{\sigma}}^{2}$ remains asymptotically unbiased for $\tilde{\sigma}^{2}\geq 0$ , and we still have $\text{Var}(\mathcal{R}_{n})=O(m_{n}^{-1})$ , so again $\mathcal{R}_{n}=o_{p}(1)$ .

Proof of Theorem A.1.

We apply Theorem 4 of [24]. Note that our setting is a simple mean estimation problem, which is a special case of her semiparametric model $Y_{1in}=h(Y_{2in},\theta_{0})+g(X_{in})+U_{in}$ with $h(Y_{2in},\theta_{0})+g(X_{in})=0$ and $U_{in}$ equal to our $\tilde{Z}_{i}-{\bf E}[\bar{\tilde{Z}}]=\tilde{Z}_{i}-{\bf E}[\tilde{Z}_{i}]$ (by Assumption A.2(a)). The moments $m_{\varepsilon_{n}}(W_{in},\hat{\theta},\hat{\tau})$ in her formula (13) for the HAC estimator corresponds to our $\tilde{Z}_{i}-{\bf E}[\tilde{Z}_{i}]$ . Other than Assumption 10, her assumptions are either satisfied (increasing domain corresponds to our Assumption 2 and the moment conditions hold by our Assumption 1 and Remark 3) or are irrelevant in our setting.

Assumption 10 concerns the properties of the kernel function. For context, note that if, hypothetically, we replaced $A_{ij}$ in $\hat{\tilde{\sigma}}^{2}$ with its upper bound in (A.1), then the kernel $K(\cdot)$ and bandwidth $\beta_{n}$ in [24] would correspond in our setting to the uniform kernel $\bm{1}\{\cdot\leq 1\}$ and $2r_{n}+\kappa_{n}$ , respectively, so that $K(x/\beta_{n})$ in Jenish’s notation would correspond to our $\bm{1}\{x\leq 2r_{n}+\kappa_{n}\}$ .

Now, because $A_{ij}$ is only bounded by kernel functions but cannot be written as one, we cannot directly verify Assumption 10. However, inspection of the proof reveals that the assumption is used as follows. First, to derive bounds on the variance of the HAC estimator (Step 1 of the proof), uniform boundedness of the kernel is used, but this is trivially satisfied by $A_{ij}\in\{0,1\}$ . Second, to derive bounds on the bias (Step 2 of the proof), Assumption 10 is used to show that the term $a_{r,n}=\operatornamewithlimits{argmax}_{r\leq x\leq r+1}\lvert K(x/\beta_{n})-1\rvert\rightarrow 0$ as $n\rightarrow\infty$ for any $r>0$ . In our case, $a_{r,n}$ corresponds to $\operatornamewithlimits{argmax}_{i,j\in\mathcal{N}_{n}\colon\rho(i,j)\in[r,r+1]}\lvert A_{ij}-1\rvert$ . But this has the desired property; it is in fact exactly zero for $n$ sufficiently large due to (A.1).

Hence, the conclusions of the proof of Jenish’s Theorem 4 apply to $\hat{\tilde{\sigma}}^{2}$ , which we now apply to prove our claims. Part (a) of our theorem follows from Step 2 of her proof. Next, in Step 1 of her proof, $H_{2n},H_{3n},H_{4n}=0$ in our setting because the data is $\alpha$ -mixing rather than near-epoch dependent. Accordingly, the variance bound on $H_{1n}$ in that step implies that the variance of the HAC estimator is $O(n^{-1}\beta_{n}^{3d})$ where $\beta_{n}$ is the bandwidth and $d$ is the dimension of the space. In our case, by (A.1), the bandwidth corresponds to $\beta_{n}=2r_{n}+\kappa_{n}=O(\sqrt{n/m_{n}})$ , so $n^{-1}\beta_{n}^{3d}=O(n^{2}/m_{n}^{3})$ , and part (b) of our theorem follows. ∎

Appendix B Proofs

The proofs use the following definitions. Let $\bm{c}(C_{k})\in\mathbb{R}^{2}$ be the centroid of cluster $C_{k}$ . For $s\leq r_{n}-1$ , define

J(s,C_{k})=\big{\{}j\in C_{k}\colon\rho(\bm{c}(C_{k}),j)\in[r_{n}-s-1,r_{n}-s)\big{\}}.

(B.1)

For $s=0$ , this is the “boundary” of $C_{k}$ , and as we increase $s$ , $J(s,C_{k})$ moves through contour sets within $C_{k}$ that are increasingly further from the boundary. Also, for any two sets $S,T\subset\mathbb{R}^{2}$ , let $\rho(S,T)=\min\{\rho(i,j)\colon i\in S,j\in T\}$ .

The proofs make use of the following facts, which are a consequence of Lemma A.1 of [25] and use Assumption 2. Given that $C_{k}$ has radius $r_{n}$ , $\lvert C_{k}\rvert\leq c^{\prime}r_{n}^{2}$ and $\lvert J(0,C_{k})\rvert\leq c^{\prime}r_{n}$ for some universal constant $c^{\prime}>0$ that does not depend on $n$ or $k$ . Also, since $J(s,C_{k})$ is the boundary of a ball of radius $r_{n}-s$ , $\lvert J(s,C_{k})\rvert\leq c^{\prime}(r_{n}-s)$ .

Lemma B.1.

Recall the definition of $[B]$ from (5). Under the assumptions of Theorem 1, $[B]\lesssim(nm_{n})^{-1/2}$ .

Proof.

Step 1. We first establish covariance bounds. Fix $i,j$ such that $j\not\in\Lambda_{i}$ , the latter defined in (4), and set $s=\rho(i,j)$ . Trivially,

\lvert\text{Cov}(Z_{i},Z_{j})\rvert=\lvert\text{Cov}(Z_{i},Z_{j})\rvert\bm{1}\{s\leq 4r_{n}\}+\lvert\text{Cov}(Z_{i},Z_{j})\rvert\bm{1}\{s>4r_{n}\}.

First consider the case $s\leq 4r_{n}$ . Let $\mathcal{C}_{i}$ be the cluster containing unit $i$ , $\mathcal{F}_{i}(r)=\{D_{j}\colon\rho(i,j)\leq r\text{ or }j\in\mathcal{C}_{i}\}$ , and $X_{i}^{r}={\bf E}[Z_{i}\mid\mathcal{F}_{i}(r)]$ . As a preliminary result, we bound the discrepancy between $Z_{i}$ and $X_{i}^{s}$ .

Let $t=\rho(\{i\},J(0,\mathcal{C}_{i}))$ , the distance between $i$ and the nearest unit in the boundary of $\mathcal{C}_{i}$ . By Assumption 3, for any $q>0$ ,

{\bf E}[\lvert Z_{i}-X_{i}^{s}\rvert^{q}\mid T_{1i}=1]=p_{1i}^{-q}{\bf E}[\lvert Y_{i}(\bm{D})-{\bf E}[Y_{i}(\bm{D})\mid\mathcal{F}_{i}(s)]\rvert^{q}\mid T_{1i}=1]\\ \leq p_{1i}^{-q}\psi(\max\{t,s\})^{q}.

The equality holds because $\mathcal{F}_{i}(s)$ conditions on $D_{i}$ , and by Assumption 4, $T_{1i}=1$ implies $D_{i}=1$ , which implies $T_{0i}=0$ . The inequality holds because $\mathcal{F}_{i}(s)$ fixes $\{D_{j}\colon j\in Q(i,t)\cup Q(i,s)\}$ at their realized values. Similarly, ${\bf E}[\lvert Z_{i}-X_{i}^{s}\rvert^{q}\mid T_{0i}=1]\leq p_{0i}^{-q}\psi(\max\{t,s\})^{q}$ , so by the law of total probability and Remark 3, for some universal constant $c^{\prime\prime}>0$ ,

{\bf E}[\lvert Z_{i}-X_{i}^{s}\rvert^{q}]^{1/q}\leq c^{\prime\prime}\psi(\max\{t,s\}).

(B.2)

Define $R_{i}=Z_{i}-X_{i}^{\kappa_{n}}$ . Since $j\not\in\Lambda_{i}$ , $\lvert\text{Cov}(X_{i}^{\kappa_{n}},X_{j}^{\kappa_{n}})\rvert=0$ by Assumption 4. Applying the Cauchy-Schwarz and Jensen’s inequalities and (B.2) for $q=2$ ,

	$\displaystyle\lvert\text{Cov}(Z_{i},Z_{j})\rvert$	$\displaystyle\leq\lvert\text{Cov}(X_{i}^{\kappa_{n}},X_{j}^{\kappa_{n}})\rvert+\lvert\text{Cov}(X_{i}^{\kappa_{n}},R_{j})\rvert+\lvert\text{Cov}(R_{i},X_{j}^{\kappa_{n}})\rvert+\lvert\text{Cov}(R_{i},R_{j})\rvert$
		$\displaystyle\leq 2c^{\prime\prime}(\lVert Z_{i}\rVert_{2}\psi(\max\{t,\kappa_{n}\})+\lVert Z_{j}\rVert_{2}\psi(\max\{t,\kappa_{n}\})+\psi(\max\{t,\kappa_{n}\})^{2})$
		$\displaystyle\leq c\,\psi(\max\{t,\kappa_{n}\})$

for some universal $c>0$ .

Next consider the case $s>4r_{n}$ . Abbreviate $X_{i}=X_{i}^{s/2-r_{n}}$ , and redefine $R_{i}=Z_{i}-X_{i}$ . Note that $\rho(Q(i,s/2-r_{n}),Q(j,s/2-r_{n}))>2r_{n}$ and $2r_{n}$ is the diameter of a cluster, so $X_{i}\perp\!\!\!\perp X_{j}$ . Consequently, following the previous argument,

\lvert\text{Cov}(Z_{i},Z_{j})\rvert\leq\lvert\text{Cov}(X_{i},X_{j})\rvert+\lvert\text{Cov}(X_{i},R_{j})\rvert+\lvert\text{Cov}(R_{i},X_{j})\rvert+\lvert\text{Cov}(R_{i},R_{j})\rvert\\ \leq c\,\psi(\max\{t,s/2-r_{n}\}).

(B.3)

Step 2. For any $c\in\mathbb{R}$ , let $\lfloor c\rfloor$ denote $c$ rounded down to the nearest integer. Using the covariance bounds derived in step 1,

\frac{1}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\not\in\Lambda_{i}}\lvert\text{Cov}(Z_{i},Z_{j})\rvert\leq\frac{c}{n^{2}}\sum_{k=1}^{m_{n}}\sum_{s=1}^{\lfloor 2R_{n}\rfloor}\sum_{t=0}^{\lfloor\min\{s,r_{n}-1\}\rfloor}\sum_{i\in J(t,C_{k})}\sum_{j\not\in\Lambda_{i}}\bm{1}\{\rho(i,j)\in[s-1,s)\}\\ \times\big{(}\psi(\max\{t,r_{n}/2\})\bm{1}\{s\leq 4r_{n}\}+\psi(\max\{t,s/2-r_{n}\})\bm{1}\{s>4r_{n}\}\big{)}\\ \equiv[B1]+[B2],

(B.4)

where $[B1]$ takes the part involving $s\leq 4r_{n}$ and $[B2]$ the remainder. Note that $t$ can be at most $r_{n}-1$ because $J(r_{n}-1,C_{k})$ is the 1-ball centered at the centroid of $C_{k}$ , and it can be at most $s$ because $\rho(i,j)\in[s-1,s)$ and $j\not\in C_{k}$ since $j\not\in\Lambda_{i}$ .

As discussed at the start of this section, $\sum_{j\not\in\Lambda_{i}}\bm{1}\{\rho(i,j)\in[s-1,s)\}\leq\sum_{j\in\mathcal{N}_{n}}\bm{1}\{\rho(i,j)\in[s-1,s)\}\leq c^{\prime}\,s$ and $\lvert J(t,C_{k})\rvert\leq c^{\prime}(r_{n}-t)$ for some universal $c^{\prime}>0$ by Assumption 2. Then by Assumption 3,

	$\displaystyle[B1]$	$\displaystyle\leq c\,\frac{m_{n}}{n^{2}}\sum_{s=1}^{\lfloor 4r_{n}\rfloor}c^{\prime}s\sum_{t=0}^{\lfloor\min\{s,r_{n}-1\}\rfloor}c^{\prime}(r_{n}-t)\psi(t)$
		$\displaystyle\lesssim\frac{m_{n}}{n^{2}}\sum_{s=1}^{\lfloor 4r_{n}\rfloor}s\left(r_{n}\sum_{t=0}^{\infty}\psi(t)+\sum_{t=0}^{\infty}t\,\psi(t)\right)\lesssim\frac{m_{n}}{n^{2}}r_{n}^{2}r_{n}\lesssim\frac{1}{\sqrt{nm_{n}}}.$		(B.5)

Finally,

$\displaystyle[B2]$	$\displaystyle\leq c\,\frac{m_{n}}{n^{2}}\sum_{s=\lfloor 4r_{n}\rfloor}^{\lfloor 2R_{n}\rfloor}c^{\prime}s\sum_{t=0}^{\lfloor\min\{s,r_{n}-1\}\rfloor}c^{\prime}(r_{n}-t)\psi(s/2-r_{n})$
	$\displaystyle\lesssim\frac{m_{n}}{n^{2}}\sum_{t=0}^{\lfloor r_{n}-1\rfloor}(r_{n}-t)\sum_{s=\lfloor 4r_{n}\rfloor}^{\lfloor 2R_{n}\rfloor}s\,\psi(s/2-r_{n})\lesssim\frac{m_{n}}{n^{2}}r_{n}^{2}\sum_{s=\lfloor 4r_{n}\rfloor}^{\lfloor 2R_{n}\rfloor}s\,\psi(s/2-r_{n})$
	$\displaystyle\lesssim\frac{m_{n}}{n^{2}}r_{n}^{2}r_{n}\lesssim\frac{1}{\sqrt{nm_{n}}}.$	(B.6)

∎

Proof of Theorem 3.

Recall that $\mathcal{F}_{i}=\{D_{j}\colon j\in\mathcal{C}_{i}\cup\mathcal{N}(i,\kappa_{n})\}$ , and define $R_{i}=Z_{i}-{\bf E}[Z_{i}\mid\mathcal{F}_{i}]$ .

Step 1. We show that

{\bf E}\left[\left(\sqrt{m_{n}}\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}R_{i}\right)^{2}\right]=o(1).

Recalling (4), the left-hand side equals

\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\in\Lambda_{i}}{\bf E}[R_{i}R_{j}]+\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\not\in\Lambda_{i}}{\bf E}[R_{i}R_{j}]\equiv[C]+[D].

If $j\in\Lambda_{i}$ , then $\rho(i,j)\leq 3r_{n}$ . Using this fact and (B.2) with $s=r_{n}/2$ ,

	$\displaystyle\lvert[C]\rvert$	$\displaystyle\leq\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}{\bf E}[R_{i}^{2}]+\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{s=1}^{\lfloor 3r_{n}\rfloor}\sum_{j\neq i}(c^{\prime\prime}\psi(r_{n}/2))^{2}\bm{1}\{\rho(i,j)\in[s-1,s)\}$
		$\displaystyle\lesssim\frac{m_{n}}{n}+\frac{m_{n}}{n}\sum_{s=1}^{\lfloor 3r_{n}\rfloor}s\,\psi(r_{n}/2)^{2}\lesssim\frac{m_{n}}{n}+\psi(r_{n}/2)^{2}=o(1).$

For $\lvert[D]\rvert$ , we first establish some covariance bounds. Fix $i,j$ such that $j\not\in\Lambda_{i}$ , and let $s=\rho(i,j)$ . Let $t=\rho(\{i\},J(0,\mathcal{C}_{i}))$ , the distance between $i$ and the nearest unit in the boundary of $\mathcal{C}_{i}$ . By Assumption 3,

{\bf E}[\lvert R_{i}\rvert^{2}\mid T_{1i}=1]=p_{1i}^{-2}{\bf E}\big{[}\lvert Y_{i}(\bm{D})-{\bf E}[Y_{i}(\bm{D})\mid\mathcal{F}_{i}]\rvert^{2}\mid T_{1i}=1\big{]}\leq p_{1i}^{-2}\psi(t)^{2}.

Similarly, ${\bf E}[\lvert R_{i}\rvert^{2}\mid T_{0i}=1]\leq p_{0i}^{-2}\psi(t)^{2}$ , so by the law of total probability and Remark 3, for some universal constant $c>0$ ,

{\bf E}[\lvert R_{i}\rvert^{2}]\leq c\,\psi(t)^{2}\quad\text{and}\quad\lvert\text{Cov}(R_{i},R_{j})\rvert\leq c\,\psi(t).

We derive an alternate bound for the case $s>4r_{n}$ . Let $\mathcal{F}_{i}(r)=\{D_{j}\colon\rho(i,j)\leq r\text{ or }j\in\mathcal{C}_{i}\}$ and $X_{i}={\bf E}[Z_{i}\mid\mathcal{F}_{i}(s/2-r_{n})]$ . Then

S_{i}\equiv R_{i}-{\bf E}[R_{i}\mid\mathcal{F}_{i}(s/2-r_{n})]=\tilde{R}_{i}\equiv Z_{i}-X_{i}

since $\mathcal{N}(i,s/2-r_{n})\supseteq\mathcal{N}(i,\kappa_{n})$ . Moreover, $X_{i}\perp\!\!\!\perp X_{j}$ since $\rho(Q(i,s/2-r_{n}),Q(j,s/2-r_{n}))>2r_{n}$ and $2r_{n}$ is the diameter of a cluster. Consequently, by (B.3),

	$\displaystyle\lvert\text{Cov}(R_{i},R_{j})\rvert$	$\displaystyle\leq\lvert\text{Cov}(X_{i},X_{j})\rvert+\lvert\text{Cov}(X_{i},S_{j})\rvert+\lvert\text{Cov}(S_{i},X_{j})\rvert+\lvert\text{Cov}(S_{i},S_{j})\rvert$
		$\displaystyle=\lvert\text{Cov}(X_{i},\tilde{R}_{j})\rvert+\lvert\text{Cov}(\tilde{R}_{i},X_{j})\rvert+\lvert\text{Cov}(\tilde{R}_{i},\tilde{R}_{j})\rvert\leq c\,\psi(s/2-r_{n}).$

Applying the covariance bounds,

\lvert[D]\rvert\leq\frac{m_{n}}{n^{2}}\sum_{i\in\mathcal{N}_{n}}\sum_{j\not\in\Lambda_{i}}\lvert{\bf E}[R_{i}R_{j}]\rvert\\ \leq c\frac{m_{n}}{n^{2}}\sum_{k=1}^{m_{n}}\sum_{s=1}^{\lfloor 2R_{n}\rfloor}\sum_{t=0}^{\lfloor\min\{s,r_{n}-1\}\rfloor}\sum_{i\in J(t,C_{k})}\sum_{j\not\in\Lambda_{i}}\bm{1}\{\rho(i,j)\in[s-1,s)\}\\ \times\big{(}\psi(t)\bm{1}\{s\leq 4r_{n}\}+\psi(s/2-r_{n})\bm{1}\{s>4r_{n}\}\big{)},

which is order $(m_{n}/n)^{1/2}=o(1)$ by an argument similar to (B.5) and (B.6).

Step 2. We show that

\sigma_{n}^{-1}\sqrt{m_{n}}\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}\big{(}{\bf E}[Z_{i}\mid\mathcal{F}_{i}]-{\bf E}[Z_{i}]\big{)}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{N}(0,1).

First, let $\tilde{\sigma}_{n}^{2}=\text{Var}(\sqrt{m_{n}}n^{-1}\sum_{i\in\mathcal{N}_{n}}{\bf E}[Z_{i}\mid\mathcal{F}_{i}])$ . By Minkowski’s inequality,

\lvert\sigma_{n}-\tilde{\sigma}_{n}\rvert\leq\text{Var}\left(\sqrt{m_{n}}\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}(Z_{i}-{\bf E}[Z_{i}\mid\mathcal{F}_{i}])\right)^{1/2},

(B.7)

which is $o(1)$ by step 1. By Assumption 5, it then suffices to show

\tilde{\sigma}_{n}^{-1}\sqrt{m_{n}}\frac{1}{n}\sum_{i\in\mathcal{N}_{n}}\big{(}{\bf E}[Z_{i}\mid\mathcal{F}_{i}]-{\bf E}[Z_{i}]\big{)}\stackrel{{\scriptstyle d}}{{\longrightarrow}}\mathcal{N}(0,1).

(B.8)

We apply Lemma B.2 with $X_{i}=n^{-1}m_{n}^{1/2}({\bf E}[Z_{i}\mid\mathcal{F}_{i}]-{\bf E}[Z_{i}])$ and dependency graph $\bm{A}$ defined after (8). The maximum degree of $\bm{A}$ is at most $16\max_{k}\lvert C_{k}\rvert$ , and $\max_{k}\lvert C_{k}\rvert=O(n/m_{n})$ . Therefore, by Assumptions 1 and 5,

\eqref{wass}\lesssim\left(\frac{n}{m_{n}}\right)^{2}n\left(\frac{\sqrt{m_{n}}}{n}\right)^{3}+\left(\frac{n}{m_{n}}\right)^{3/2}\sqrt{n\left(\frac{\sqrt{m_{n}}}{n}\right)^{4}}\lesssim m_{n}^{-1/2}.

Since $m_{n}\rightarrow\infty$ , (B.8) follows. ∎

Lemma B.2 ([41], Theorem 3.6).

Let $\{X_{i}\}_{i=1}^{n}$ be random variables with dependency graph $\bm{A}$ such that ${\bf E}[X_{i}^{4}]<\infty$ and ${\bf E}[X_{i}]=0$ . Define $\sigma^{2}\equiv\text{Var}(\sum_{i=1}^{n}X_{i})$ , $\mathcal{W}=\sum_{i=1}^{n}X_{i}/\sigma^{2}$ , and $\Psi=\max_{i=1,\dots,n}\sum_{j=1}^{n}A_{ij}$ . For $\mathcal{Z}\sim\mathcal{N}(0,1)$ ,

d(\mathcal{W},\mathcal{Z})\leq\frac{\Psi^{2}}{\sigma^{3}}\sum_{i=1}^{n}{\bf E}[\lvert X_{i}\rvert^{3}]+\frac{\sqrt{28}\Psi^{3/2}}{\sqrt{\pi}\sigma^{2}}\left(\sum_{i=1}^{n}{\bf E}[X_{i}^{4}]\right)^{1/2},

(B.9)

where $d(\cdot,\cdot)$ is the Wasserstein distance.

{acks}

[Acknowledgments] The author thanks the referees and associate editor for helpful comments that improved the exposition of the paper.

{supplement}\stitle

supplement.zip \sdescriptionThis zip file contains a PDF with proofs omitted in this text and code used to produce the simulation results in § 5.

References

[1] {barticle}[author] \bauthor\bsnmAndrews, \bfnmD.\binitsD. (\byear1991). \btitleHeteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation. \bjournalEconometrica \bpages817–858. \endbibitem
[2] {bincollection}[author] \bauthor\bsnmAnselin, \bfnmL.\binitsL. (\byear2001). \btitleSpatial Econometrics. In \bbooktitleA Companion to Theoretical Econometrics (\beditor\bfnmB.\binitsB. \bsnmBaltagi, ed.) \bchapter14 \bpublisherBlackwell Publishing Ltd. \endbibitem
[3] {barticle}[author] \bauthor\bsnmAronow, \bfnmP.\binitsP. and \bauthor\bsnmSamii, \bfnmC.\binitsC. (\byear2017). \btitleEstimating Average Causal Effects Under General Interference, with Application to a Social Network Experiment. \bjournalThe Annals of Applied Statistics \bvolume11 \bpages1912–1947. \endbibitem
[4] {barticle}[author] \bauthor\bsnmAronow, \bfnmP.\binitsP., \bauthor\bsnmSamii, \bfnmC.\binitsC. and \bauthor\bsnmWang, \bfnmY.\binitsY. (\byear2020). \btitleDesign-Based Inference for Spatial Experiments with Interference. \bjournalarXiv preprint arXiv:2010.13599. \endbibitem
[5] {barticle}[author] \bauthor\bsnmBaird, \bfnmS.\binitsS., \bauthor\bsnmBohren, \bfnmJ.\binitsJ., \bauthor\bsnmMcIntosh, \bfnmC.\binitsC. and \bauthor\bsnmÖzler, \bfnmB.\binitsB. (\byear2018). \btitleOptimal Design of Experiments in the Presence of Interference. \bjournalReview of Economics and Statistics \bvolume100 \bpages844–860. \endbibitem
[6] {barticle}[author] \bauthor\bsnmBasse, \bfnmG.\binitsG. and \bauthor\bsnmAiroldi, \bfnmE.\binitsE. (\byear2018). \btitleModel-Assisted Design of Experiments in the Presence of Network-Correlated Outcomes. \bjournalBiometrika \bvolume105 \bpages849–858. \endbibitem
[7] {barticle}[author] \bauthor\bsnmBasse, \bfnmG.\binitsG., \bauthor\bsnmFeller, \bfnmA.\binitsA. and \bauthor\bsnmToulis, \bfnmP.\binitsP. (\byear2019). \btitleRandomization Tests of Causal Effects Under Interference. \bjournalBiometrika \bvolume106 \bpages487–494. \endbibitem
[8] {barticle}[author] \bauthor\bsnmBlattman, \bfnmC.\binitsC., \bauthor\bsnmGreen, \bfnmD.\binitsD., \bauthor\bsnmOrtega, \bfnmD.\binitsD. and \bauthor\bsnmTobón, \bfnmS.\binitsS. (\byear2021). \btitlePlace-Based Interventions at Scale: The Direct and Spillover Effects of Policing and City Services on Crime. \bjournalJournal of the European Economic Association \bvolume19 \bpages2022–2051. \endbibitem
[9] {barticle}[author] \bauthor\bsnmChin, \bfnmA.\binitsA. (\byear2019). \btitleRegression Adjustments for Estimating the Global Treatment Effect in Experiments with Interference. \bjournalJournal of Causal Inference \bvolume7. \endbibitem
[10] {barticle}[author] \bauthor\bsnmChoi, \bfnmD.\binitsD. (\byear2017). \btitleEstimation of Monotone Treatment Effects in Network Experiments. \bjournalJournal of the American Statistical Association \bvolume112 \bpages1147–1155. \endbibitem
[11] {bbook}[author] \bauthor\bsnmCliff, \bfnmA.\binitsA. and \bauthor\bsnmOrd, \bfnmJ.\binitsJ. (\byear1973). \btitleSpatial Autocorrelation. \bpublisherLondon: Pion. \endbibitem
[12] {bbook}[author] \bauthor\bsnmCliff, \bfnmA.\binitsA. and \bauthor\bsnmOrd, \bfnmJ.\binitsJ. (\byear1981). \btitleSpatial Processes: Models and Applications. \bpublisherLondon: Pion. \endbibitem
[13] {barticle}[author] \bauthor\bsnmConley, \bfnmT.\binitsT. (\byear1999). \btitleGMM Estimation with Cross Sectional Dependence. \bjournalJournal of Econometrics \bvolume92 \bpages1–45. \endbibitem
[14] {barticle}[author] \bauthor\bsnmDonnelly, \bfnmC.\binitsC., \bauthor\bsnmWoodroffe, \bfnmR.\binitsR., \bauthor\bsnmCox, \bfnmD.\binitsD., \bauthor\bsnmBourne, \bfnmJ.\binitsJ., \bauthor\bsnmGettinby, \bfnmG.\binitsG., \bauthor\bsnmLe Fevre, \bfnmA.\binitsA., \bauthor\bsnmMcInerney, \bfnmJ.\binitsJ. and \bauthor\bsnmMorrison, \bfnmI.\binitsI. (\byear2003). \btitleImpact of Localized Badger Culling on Tuberculosis Incidence in British Cattle. \bjournalNature \bvolume426 \bpages834–837. \endbibitem
[15] {barticle}[author] \bauthor\bsnmEckles, \bfnmD.\binitsD., \bauthor\bsnmKarrer, \bfnmB.\binitsB. and \bauthor\bsnmUgander, \bfnmJ.\binitsJ. (\byear2017). \btitleDesign and Analysis of Experiments in Networks: Reducing Bias from Interference. \bjournalJournal of Causal Inference \bvolume5. \endbibitem
[16] {barticle}[author] \bauthor\bsnmForastiere, \bfnmL.\binitsL., \bauthor\bsnmAiroldi, \bfnmE.\binitsE. and \bauthor\bsnmMealli, \bfnmF.\binitsF. (\byear2021). \btitleIdentification and Estimation of Treatment and Interference Effects in Observational Studies on Networks. \bjournalJournal of the American Statistical Association \bvolume116 \bpages901–918. \endbibitem
[17] {barticle}[author] \bauthor\bsnmGetis, \bfnmA.\binitsA. (\byear2008). \btitleA History of the Concept of Spatial Autocorrelation: A Geographer’s Perspective. \bjournalGeographical Analysis \bvolume40 \bpages297–309. \endbibitem
[18] {barticle}[author] \bauthor\bsnmGiffin, \bfnmA.\binitsA., \bauthor\bsnmReich, \bfnmB.\binitsB., \bauthor\bsnmYang, \bfnmS.\binitsS. and \bauthor\bsnmRappold, \bfnmA.\binitsA. (\byear2020). \btitleGeneralized Propensity Score Approach to Causal Inference with Spatial Interference. \bjournalarXiv preprint arXiv:2007.00106. \endbibitem
[19] {barticle}[author] \bauthor\bsnmHarshaw, \bfnmC.\binitsC., \bauthor\bsnmSävje, \bfnmF.\binitsF., \bauthor\bsnmEisenstat, \bfnmD.\binitsD., \bauthor\bsnmMirrokni, \bfnmV.\binitsV. and \bauthor\bsnmPouget-Abadie, \bfnmJ.\binitsJ. (\byear2021). \btitleDesign and Analysis of Bipartite Experiments Under a Linear Exposure-Response Model. \bjournalarXiv preprint arXiv:2103.06392. \endbibitem
[20] {bbook}[author] \bauthor\bsnmHayes, \bfnmR.\binitsR. and \bauthor\bsnmMoulton, \bfnmL.\binitsL. (\byear2017). \btitleCluster Randomised Trials. \bpublisherChapman and Hall/CRC. \endbibitem
[21] {barticle}[author] \bauthor\bsnmHu, \bfnmY.\binitsY., \bauthor\bsnmLi, \bfnmS.\binitsS. and \bauthor\bsnmWager, \bfnmS.\binitsS. (\byear2022). \btitleAverage Direct and Indirect Causal Effects Under Interference. \bjournalBiometrika (forthcoming). \endbibitem
[22] {bbook}[author] \bauthor\bsnmImbens, \bfnmG\binitsG. and \bauthor\bsnmRubin, \bfnmD.\binitsD. (\byear2015). \btitleCausal Inference for Statistics, Social, and Biomedical Sciences: An Introduction. \bpublisherCambridge University Press. \endbibitem
[23] {barticle}[author] \bauthor\bsnmJagadeesan, \bfnmR.\binitsR., \bauthor\bsnmPillai, \bfnmN.\binitsN. and \bauthor\bsnmVolfovsky, \bfnmA.\binitsA. (\byear2020). \btitleDesigns for Estimating the Treatment Effect in Networks with Interference. \bjournalThe Annals of Statistics \bvolume48 \bpages679–712. \endbibitem
[24] {barticle}[author] \bauthor\bsnmJenish, \bfnmN.\binitsN. (\byear2016). \btitleSpatial Semiparametric Model with Endogenous Regressors. \bjournalEconometric Theory \bvolume32 \bpages714–739. \endbibitem
[25] {barticle}[author] \bauthor\bsnmJenish, \bfnmN.\binitsN. and \bauthor\bsnmPrucha, \bfnmI.\binitsI. (\byear2009). \btitleCentral Limit Theorems and Uniform Laws of Large Numbers for Arrays of Random Fields. \bjournalJournal of Econometrics \bvolume150 \bpages86–98. \endbibitem
[26] {barticle}[author] \bauthor\bsnmJenish, \bfnmN.\binitsN. and \bauthor\bsnmPrucha, \bfnmI.\binitsI. (\byear2011). \btitleOn Spatial Processes and Asymptotic Inference Under Near-Epoch Dependence. \bjournalU. Maryland working paper. \endbibitem
[27] {barticle}[author] \bauthor\bsnmJenish, \bfnmN.\binitsN. and \bauthor\bsnmPrucha, \bfnmI.\binitsI. (\byear2012). \btitleOn Spatial Processes and Asymptotic Inference Under Near-Epoch Dependence. \bjournalJournal of Econometrics \bvolume170 \bpages178–190. \endbibitem
[28] {barticle}[author] \bauthor\bsnmLahiri, \bfnmS.\binitsS. (\byear1996). \btitleOn Inconsistency of Estimators Based on Spatial Data Under Infill Asymptotics. \bjournalSankhyā: The Indian Journal of Statistics, Series A \bpages403–417. \endbibitem
[29] {barticle}[author] \bauthor\bsnmLahiri, \bfnmS.\binitsS. (\byear2003). \btitleCentral Limit Theorems for Weighted Sums of a Spatial Process Under a Class of Stochastic and Fixed Designs. \bjournalSankhyā: The Indian Journal of Statistics \bpages356–388. \endbibitem
[30] {barticle}[author] \bauthor\bsnmLahiri, \bfnmS.\binitsS. and \bauthor\bsnmZhu, \bfnmJ.\binitsJ. (\byear2006). \btitleResampling Methods for Spatial Regression Models Under a Class of Stochastic Designs. \bjournalThe Annals of Statistics \bvolume34 \bpages1774–1813. \endbibitem
[31] {barticle}[author] \bauthor\bsnmLeung, \bfnmM.\binitsM. (\byear2022). \btitleCausal Inference Under Approximate Neighborhood Interference. \bjournalEconometrica \bvolume90 \bpages267-293. \endbibitem
[32] {barticle}[author] \bauthor\bsnmLeung, \bfnmM.\binitsM. (\byear2022). \btitleNetwork Cluster-Robust Inference. \bjournalarXiv preprint arXiv:2103.01470. \endbibitem
[33] {barticle}[author] \bauthor\bsnmLeung, \bfnmM.\binitsM. (\byear2022). \btitleSupplement to “Rate-Optimal Cluster-Randomized Designs for Spatial Interference”. \bjournalDOI: 10.1214/22-AOS2224SUPP. \endbibitem
[34] {barticle}[author] \bauthor\bsnmManski, \bfnmC.\binitsC. (\byear2013). \btitleIdentification of Treatment Response with Social Interactions. \bjournalThe Econometrics Journal \bvolume16 \bpagesS1–S23. \endbibitem
[35] {barticle}[author] \bauthor\bsnmMiguel, \bfnmE.\binitsE. and \bauthor\bsnmKremer, \bfnmM.\binitsM. (\byear2004). \btitleWorms: Identifying Impacts on Education and Health in the Presence of Treatment Externalities. \bjournalEconometrica \bvolume72 \bpages159–217. \endbibitem
[36] {btechreport}[author] \bauthor\bsnmPaler, \bfnmL.\binitsL., \bauthor\bsnmSamii, \bfnmC.\binitsC., \bauthor\bsnmLisiecki, \bfnmM.\binitsM. and \bauthor\bsnmMorel, \bfnmA.\binitsA. (\byear2015). \btitleSocial and Environmental Impact of the Community Rangers Program in Aceh \btypeTechnical Report, \bpublisherWorld Bank, Washington, DC. \endbibitem
[37] {barticle}[author] \bauthor\bsnmPark, \bfnmC.\binitsC. and \bauthor\bsnmKang, \bfnmH.\binitsH. (\byear2021). \btitleAssumption-Lean Analysis of Cluster Randomized Trials in Infectious Diseases for Intent-to-Treat Effects and Spillover Effects Among a Vulnerable Subpopulation. \bjournalJournal of the American Statistical Association (forthcoming). \endbibitem
[38] {barticle}[author] \bauthor\bsnmPeng, \bfnmR.\binitsR., \bauthor\bsnmSun, \bfnmH.\binitsH. and \bauthor\bsnmZanetti, \bfnmL.\binitsL. (\byear2017). \btitlePartitioning Well-Clustered Graphs: Spectral Clustering Works! \bjournalSIAM Journal on Computing \bvolume46 \bpages710–743. \endbibitem
[39] {barticle}[author] \bauthor\bsnmPollmann, \bfnmM.\binitsM. (\byear2020). \btitleCausal Inference for Spatial Treatments. \bjournalarXiv preprint arXiv:2011.00373. \endbibitem
[40] {binproceedings}[author] \bauthor\bsnmPouget-Abadie, \bfnmJ.\binitsJ., \bauthor\bsnmMirrokni, \bfnmV.\binitsV., \bauthor\bsnmParkes, \bfnmD.\binitsD. and \bauthor\bsnmAiroldi, \bfnmE.\binitsE. (\byear2018). \btitleOptimizing Cluster-Based Randomized Experiments Under Monotonicity. In \bbooktitleProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining \bpages2090–2099. \endbibitem
[41] {barticle}[author] \bauthor\bsnmRoss, \bfnmN.\binitsN. (\byear2011). \btitleFundamentals of Stein’s Method. \bjournalProbability Surveys \bvolume8 \bpages210–293. \endbibitem
[42] {barticle}[author] \bauthor\bsnmSävje, \bfnmF.\binitsF. (\byear2021). \btitleCausal Inference with Misspecified Exposure Mappings. \bjournalarXiv preprint arXiv:2103.06471. \endbibitem
[43] {barticle}[author] \bauthor\bsnmSävje, \bfnmF.\binitsF., \bauthor\bsnmAronow, \bfnmP.\binitsP. and \bauthor\bsnmHudgens, \bfnmM.\binitsM. (\byear2021). \btitleAverage Treatment Effects in the Presence of Unknown Interference. \bjournalThe Annals of Statistics \bvolume49 \bpages673–701. \endbibitem
[44] {barticle}[author] \bauthor\bsnmSussman, \bfnmD.\binitsD. and \bauthor\bsnmAiroldi, \bfnmE.\binitsE. (\byear2017). \btitleElements of Estimation Theory for Causal Effects in the Presence of Network Interference. \bjournalarXiv preprint arXiv:1702.03578. \endbibitem
[45] {binproceedings}[author] \bauthor\bsnmToulis, \bfnmP.\binitsP. and \bauthor\bsnmKao, \bfnmE.\binitsE. (\byear2013). \btitleEstimation of Causal Peer Influence Effects. In \bbooktitleInternational Conference on Machine Learning \bpages1489–1497. \endbibitem
[46] {binproceedings}[author] \bauthor\bsnmUgander, \bfnmJ.\binitsJ., \bauthor\bsnmKarrer, \bfnmB.\binitsB., \bauthor\bsnmBackstrom, \bfnmL.\binitsL. and \bauthor\bsnmKleinberg, \bfnmJ.\binitsJ. (\byear2013). \btitleGraph Cluster Randomization: Network Exposure to Multiple Universes. In \bbooktitleProceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining \bpages329–337. \endbibitem
[47] {barticle}[author] \bauthor\bsnmUgander, \bfnmJ.\binitsJ. and \bauthor\bsnmYin, \bfnmH.\binitsH. (\byear2020). \btitleRandomized Graph Cluster Randomization. \bjournalarXiv preprint arXiv:2009.02297. \endbibitem
[48] {barticle}[author] \bauthor\bsnmValcu, \bfnmM.\binitsM. and \bauthor\bsnmKempenaers, \bfnmB.\binitsB. (\byear2010). \btitleSpatial Autocorrelation: an Overlooked Concept in Behavioral Ecology. \bjournalBehavioral Ecology \bvolume21 \bpages902–905. \endbibitem
[49] {barticle}[author] \bauthor\bsnmVerbitsky-Savitz, \bfnmN.\binitsN. and \bauthor\bsnmRaudenbush, \bfnmS.\binitsS. (\byear2012). \btitleCausal Inference Under Interference in Spatial Settings: A Case Study Evaluating Community Policing Program in Chicago. \bjournalEpidemiologic Methods \bvolume1 \bpages107–130. \endbibitem
[50] {barticle}[author] \bauthor\bsnmViviano, \bfnmD.\binitsD. (\byear2020). \btitleExperimental Design Under Network Interference. \bjournalarXiv preprint arXiv:2003.08421. \endbibitem
[51] {barticle}[author] \bauthor\bsnmZigler, \bfnmC.\binitsC. and \bauthor\bsnmPapadogeorgou, \bfnmG.\binitsG. (\byear2021). \btitleBipartite Causal Inference with Interference. \bjournalStatistical Science \bvolume36 \bpages109. \endbibitem

See pages 1 of supplement.pdf See pages 0 of supplement.pdf