This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\addauthor

sublue \addauthorcymagenta \addauthornkred

Clustered Switchback Designs for
Experimentation Under Spatio-temporal Interference

Su Jia Cornell University
E-mails: {sj693,kallus,cleeyu}@cornell.edu
Nathan Kallus Cornell University
E-mails: {sj693,kallus,cleeyu}@cornell.edu
Christina Lee Yu Cornell University
E-mails: {sj693,kallus,cleeyu}@cornell.edu
Abstract

We consider experimentation in the presence of non-stationarity, inter-unit (spatial) interference, and carry-over effects (temporal interference), where we wish to estimate the global average treatment effect (GATE), the difference between average outcomes having exposed all units at all times to treatment or to control. We suppose spatial interference is described by a graph, where a unit’s outcome depends on its neighborhood’s treatments, and that temporal interference is described by an MDP, where the transition kernel under either treatment (action) satisfies a rapid mixing condition. We propose a clustered switchback design, where units are grouped into clusters and time steps are grouped into blocks, and each whole cluster-block combination is assigned a single random treatment. Under this design, we show that for graphs that admit good clustering, a truncated Horvitz-Thompson estimator achieves a O~(1/NT)\tilde{O}(1/NT) mean squared error (MSE), matching the lower bound up to logarithmic terms for sparse graphs. Our results simultaneously generalize the results from Hu and Wager (2022); Ugander et al. (2013) and Leung (2022). Simulation studies validate the favorable performance of our approach.

1 Introduction

Prior Work #individuals #rounds Interference Graph Clustering MSE
Hu and Wager (2022) 11 TT singleton n/a tmix/Tt_{\rm mix}/T
Ugander and Yin (2023) NN 11 κ\kappa-restricted growth graphs 33-net d2κ4/Nd^{2}\kappa^{4}/N
Leung (2022) NN 11 Intersection graph of balls uniform h2/Nh^{2}/N
Table 1: Known Results: Our main theorem recovers several known results for “pure” switchback and “pure” A/B testing under interference. Here, tmixt_{\rm mix} is a parameter that measures how fast the system “stabilizes” (more precisely, the mixing time of the transition kernels, which we will define later); dd is the maximum degree of the interference graph; κ\kappa is the restricted growth parameter Ugander et al. (2013) which restrains the growth rate of the neighbor hood in the number of hops; hh is the radii of the balls in the intersection graph.

Randomized experimentation, or A/B testing, is \cyreplacea broadly deployed learning tool in online commerce that is simple to execute.widely used to estimate causal effects on online platforms. Basic strategies involve partitioning the experimental units (e.g., individuals or time periods) into two groups randomly, and assigning one group to treatment and the other to control. A key challenge in modern A/B testing is interference: From two-sided markets to social networks, interference between individuals complicates experimentation and makes it difficult to estimate the true effect of a treatment.

The spillover effect in experimentation has been extensively studied (Manski, 2013; Aronow et al., 2017; Li et al., 2021; Ugander et al., 2013; Sussman and Airoldi, 2017; Toulis and Kao, 2013; Basse and Airoldi, 2018; Cai et al., 2015; Gui et al., 2015; Eckles et al., 2017a; Chin, 2019). Most of these works assume \cyeditneighborhood interference, where the spillover effect is \cyeditconstrained to the direct neighborhood of an individual as given by an interference graph. Under this assumption, Ugander et al. (2013) proposed a clustering-based design, and showed that if the growth rate κ\kappa of neighborhoods is bounded, then the Horvitz-Thompson (HT) estimator achieves a mean squared error (MSE) of O(d)2O(κ)/NO(d)\cdot 2^{O(\kappa)}/N where dd is the maximum degree. Later, Ugander and Yin (2023) showed that by introducing randomness into the clustering, the dependence on κ\kappa can be improved to polynomial. \cyreplaceAnother example is spatial interference. Leung (2022) considered the geometric graph induced by an almost uniformly spaced point set in the plane and assumed that the level of interference between two individuals decays with their (spatial) distance.As there are many settings in which interference extends beyond direct neighbors, Leung (2022) considers a relaxed assumption in which the interference is not restricted to direct neighbors, but decays as a function of the spatial distance between individuals with respect to an embedding of the individuals in Euclidean space.

Orthogonal to the spillover effect, the carryover effect (or temporal interference), where past treatments may affect future outcomes, has also been extensively studied. Bojinov et al. (2023) considers \cyedita simple model in which the temporal interference \cyeditis bounded by a fixed window length. Other works model temporal interference that arises from the Markovian evolution of states, \cyeditwhich allows for interference effects that can persist across long time horizons (Glynn et al., 2020; Farias et al., 2022; Hu and Wager, 2022; Johari et al., 2022; Shi et al., 2023). A \cyeditcommonly used approach in practice is to deploy switchback experiments: The exposure of the entire system (viewed as a single experimental unit) \cyreplaceto a single treatment at a time, interchangeably switching the treatment between the two options over interspersed blocks of time.alternates randomly between treatment and control for sufficiently long contiguous blocks of time such that the temporal interference around the switching points does not dominate. \cydeleteMoreover, despite the literature on Markovian switchback models, the minimax MSE rate for the N=1N=1 case remains unknown. \cyeditUnder a switchback design, Hu and Wager (2022) showed that a O~(tmix/T)\tilde{O}(t_{\rm mix}/T) MSE rate can be achieved, assuming that the Markov chains have mixing time tmixt_{\rm mix}.

While the prior studies either focused on only network interference or only temporal interference, there are many practical settings in which both types of interference are present, such as online platforms, healthcare systems, or ride-sharing networks. In these environments, an individual’s outcome may depend not only on who else is treated nearby but also on how the individual’s “state” has evolved over time, making it essential to develop methodologies that can handle both dimensions jointly.

Refer to caption
Figure 1: Clustered Switchback Experiments. The image illustrates clustered switchback on DoorDash (Sneider and Tang, 2019). The time and geographical locations are grouped into blocks. Each spatio-temporal cluster (i.e., product set) is independently assigned treatment/control. The goal is to estimate the difference in the average (counterfactual) “outcomes” (e.g., revenues) between the all-treatment and all-control policy..

To address this, clustered switchback experiments have become widely adopted in industry. The idea is to partition both the space (e.g., by geographics or a social network) and time into discrete blocks, and then randomize at the level of space-time clusters (i.e., product set of spatial and temporal blocks). For example, DoorDash randomizes promotions at the region-hour level (see Fig. 1). This allows practitioners to mitigate interference within clusters while preserving statistical power and operational feasibility.

Despite its practical popularity, the theoretical foundations of this approach remain underexplored. On the surface, handling spatio-temporal interference seems straightforward, considering that (i) time can be regarded as an additional “dimension”, and (ii) these two types of interference have been well explored separately. However, in most work on network interference, the potential outcomes \cyeditconditioned on the treatment assignments are assumed to be independent (e.g., Ugander et al. (2013); Leung (2022)). In a Markovian setting, this assumption breaks down, since \cyreplacepast treatments may affect future outcomes throughpast outcomes are correlated to future outcomes even conditioned on the treatments due to state evolution.

We consider experimentation with spatio-temporal interference on a model that encapsulates both (i) the network interference between individuals by a given interference graph, and (ii) the temporal interference \cyreplaceby associating a state to each individual that evolves in a Markovian fashion independently.that arises from Markovian state evolutions. We assume that the outcome and state evolution of each individual depends solely on the treatments of their immediate neighborhood (including themselves), \cyeditand that the state evolutions are independent across individuals conditioned on the treatments.

1.1 Our Contributions

Our main theorem states that a truncated HT estimator achieves an MSE of 1/NT1/NT times a \cyreplacesome combinatorial parameters that involve Π\Pi and the interference graph GG; see Theorem 1.graph clustering-dependent quantity which is O(1)O(1) for low-degree graphs for good clusterings, e.g., growth restricted graphs (Ugander et al., 2013) or spatially derived graphs (Leung, 2022). This result bridges the literature on experimentation with spatial/network interference and temporal interference by extending the following results:

  1. 1.

    Pure switchback experiments. Hu and Wager (2022) independently obtained a O~(tmix/T)\tilde{O}(t_{\rm mix}/T) MSE rate for N=1N=1. Our Theorem 1 generalizes this result to the NN-individual setting, using a different class of estimators. We discuss the comparison with their work in Section 3.

  2. 2.

    Network interference. Assuming that the interference graph satisfies the κ\kappa-restricted growth condition (defined in Section 4), Ugander et al. (2013) showed that the HT estimator achieves an MSE of O~(2κ6d/N)\tilde{O}({\color[rgb]{0,0,0}2^{{\kappa^{6}}}}d/N) for T=1T=1 with a suitable partition (graph clustering), where dd is the maximum degree. Moreover, by introducing randomness into the clustering, Ugander and Yin (2023) improved the exponential dependence on κ\kappa to polynomial, achieving a O~(d2κ4/N)\tilde{O}(d^{2}\kappa^{4}/N) MSE. Our Corollary 4 generalizes this to O~(d2κ4tmix/NT)\tilde{O}(d^{2}\kappa^{4}t_{\rm mix}/NT) in the presence of Markovian temporal interference.

Interference Graph (Spatio-) Clustering MSE Reference
No edges (i.e., no interference) one node per cluster tmix/NTt_{\rm mix}/NT Corollary 1
Singleton (i.e., pure switchback) n/a tmix/Tt_{\rm mix}/T Corollary 2
Maximum Degree dd arbitrary 24dtmix/NT2^{4d}t_{\rm mix}/NT Corollary 3
κ\kappa-Restricted Growth 11-hop-random d2κ4tmix/NTd^{2}\kappa^{4}t_{\rm mix}/NT Corollary 4
Intersection graph of radius-hh balls uniform h2tmix/NTh^{2}t_{\rm mix}/NT Corollary 5
Table 2: Implications of Our Main Result: Our main theorem implies the following MSE bounds in several fundamental special cases. We omit polylogarithmic terms in N,TN,T in these MSE bounds.

We state our results under the δ\delta-fractional neighborhood exposure (δ\delta-FNE) as introduced in Ugander et al. (2013); Eckles et al. (2017b), generalizing beyond the “exact” (i.e., δ=0\delta=0) neighborhood interference assumption. We summarize our results (with δ=0\delta=0 for simplicity) in Table 2.

We emphasize that our setting, even for N=1N=1, can not be reduced to that of Leung 2022. Essentially, this is because their independence assumptions no longer holds here. We will provide more details in a dedicated discussion section Section 3.2.

1.2 Related Work

Violation of SUTVA. Experimentation is a broadly deployed learning tool in e-commerce that is simple to execute (Kohavi and Thomke, 2017; Thomke, 2020; Larsen et al., 2023). As a key challenge, the violation of the so-called Stable Unit Treatment Value Assumption (SUTVA) has been viewed as problematic in online platforms (Blake and Coey, 2014).

Many existing works that tackle this problem assume that interference is summarized by a low-dimensional exposure mapping and that individuals are individually randomized into treatment or control by Bernoulli randomization (Manski, 2013; Toulis and Kao, 2013; Aronow et al., 2017; Basse et al., 2019; Forastiere et al., 2021). Some work departed from unit-level randomization and introduced cluster dependence in unit-level assignments in order to improve estimator precision, including Ugander et al. 2013; Jagadeesan et al. 2020; Leung 2022, 2023, just to name a few.

There is another line of work that considers the temporal interference (or carryover effect). Some works consider a fixed bound on the persistence of temporal interference (e.g., Bojinov et al. (2023)), while other works considered temporal interference arising from the Markovian evolution of states (Glynn et al., 2020; Farias et al., 2022; Johari et al., 2022; Shi et al., 2023; Hu and Wager, 2022; Li and Wager, 2022; Li et al., 2023). Apart from being limited to the single-individual setting, many of these works differ from ours either by (i) focusing on alternative objectives, such as stationary outcome Glynn et al. (2020), or (ii) imposing additional assumptions, like observability of the states Farias et al. (2022).

Spatio-temporal Interference. Although extensively studied separately and recognized for its practical significance, experimentation under spatio-temporal interference has received relatively limited attention previously. Recently, Ni et al. (2023) attempted to address this problem, but their carryover effect is confined to one period. \cydeletesee their Assumption 2. Another closely related work is Li and Wager 2022. Similar to our work, they specified the spatial interference using an interference graph and modeled temporal interference by assigning an MDP to each individual. In our model, the transition probability depends on the states of all neighbors (in the interference graph). In contrast, their evolution depends on the sum of the outcome of direct neighbors. Moreover, our work focuses on ATE estimation for a fixed, unknown environment, whereas they focus on the large sample asymptotics and mean-field properties. Wang (2021) studied direct treatment effects for panel data under spatiotemporal interference, but focused on asymptotic properties instead of finite-sample bounds.

Off-Policy Evaluation (OPE) Since we model temporal interference using an MDP, our work is naturally related to reinforcement learning (RL). In fact, our result on ATE estimation can be rephrased as OPE (Jiang and Li, 2016; Thomas and Brunskill, 2016) in a multi-agent MDP: Given a behavioral policy from which the data is generated, we aim to evaluate the mean reward of a target policy. The ATE in our work is essentially the difference in the mean reward between two target policies (all-1 and all-0 policies), and the behavioral policy is given by clustered randomization. However, these works usually require certain states to be observable, which is not needed in our work. Moreover, these works usually impose certain assumptions on the non-stationarity, which we allow to be completely arbitrary. Finally, we focus on rather general data-generating policies (beyond fixed-treatment policies) and estimands (beyond ATE), compromising the strengths of the results.

Refer to caption
Figure 2: Positioning of this work. By and large, a model for experimentation involves a subset of four key features, as illustrated above. Prior work has addressed spatial and temporal interference separately, often incorporating heterogeneity across individuals as well. However, in real-world applications, all four features often arise simultaneously. This work aims to lay the theoretical foundation for cluster switchback experiments, which are increasingly used in practice to navigate such complex interference.

2 Model Setup and Experiment Design

2.1 Formulation

Consider a horizon with TT rounds and NN individuals, where each individual is randomly assigned to treatment (“1”) or control (“0”) at each time. We model the interference between individuals using an interference graph G=(U,E)G=(U,E) where |U|=N|U|=N and each node represents an individual. Formally, the treatment assignment is given by a binary matrix W{0,1}N×TW\in\{0,1\}^{N\times T}. \cydeleteAlthough it may sometimes be reasonable to assign the treatments in each round adaptively, We focus on non-adaptive designs where WW is drawn at the beginning and hence is independent of all other variables, including the individuals’ states and outcomes.

To model temporal interference, we assign each individual iUi\in U a state Sit𝒮S_{it}\in\mathcal{S} at time t[T]:{1,2,,T}t\in[T]:\{1,2,\dots,T\} that evolves independently in a Markovian fashion. The transition kernel is a function of the treatments of uu and its direct neighbors in GG at time tt, which we refer to as the \cyeditinterference neighborhood of uu, denoted 𝒩(i):={i}{iU(i,j)E}{\cal N}(i):=\{i\}\cup\{i\in U\mid(i,j)\in E\}. The state at time t+1t+1, Si,t+1S_{i,t+1}, is drawn from a distribution PitW𝒩(i),t(Si,t+1Sit)P_{it}^{W_{{\cal N}(i),t}}(S_{i,t+1}\in\cdot\mid S_{it}). We allow PitwP_{it}^{w} to vary arbitrarily across different combinations of i,ti,t and w{0,1}𝒩(i)w\in\{0,1\}^{{\cal N}(i)}.

An observed outcome YitY_{it}\in\mathbb{R} is generated as a function of (i) the unit’s state and (ii) the treatments of itself and its neighbors, according to

Yit=μit(Sit,W𝒩(i),t)+ϵit,Y_{it}=\mu_{it}(S_{it},W_{{\cal N}(i),t})+\epsilon_{it},

where ϵit\epsilon_{it} has mean zero. The \cyeditconditional mean outcome 𝔼[YitW]\mathbb{E}[Y_{it}\mid W] is determined by μit:𝒮×{0,1}𝒩(i)[0,1]\mu_{it}:\mathcal{S}\times\{0,1\}^{{\cal N}(i)}\to[0,1], which we call the outcome function. The model dynamics are thus specified by the sequence

W,{(Si1,Yi1)}iU,,{(SiT,YiT)}iU.W,\{(S_{i1},Y_{i1})\}_{i\in U},\dots,\{(S_{iT},Y_{iT})\}_{i\in U}.
\cyedit

We emphasize that we do not assume observation of the state variables.

Given the observations (consisting solely of W,YW,Y), our objective is to estimate the difference between the \cyeditcounterfactual outcomes under continuous deployment of treatment 11 and treatment 0, averaged over all individuals and rounds, referred to as the Global Average Treatment Effect.

Definition 1 (Global Average Treatment Effect).

Let 𝒟{\cal D} be any distribution over 𝒮N{\cal S}^{N}, and denote by 𝔼𝒟[]=𝔼[𝐒𝟎𝒟]\mathbb{E}_{\cal D}[\cdot]=\mathbb{E}[\cdot\mid{\bf S_{0}}\sim{\cal D}]. The global average treatment effect (GATE) is

Δ=Δ𝒟:=1NT(i,t)U×[T]Δit,whereΔit=𝔼𝒟[YitW=𝟏]𝔼𝒟[YitW=𝟎].\Delta=\Delta_{\cal D}:=\frac{1}{NT}\sum_{(i,t)\in U\times[T]}\Delta_{it},\quad{\rm where}\quad\Delta_{it}=\mathbb{E}_{\cal D}[Y_{it}\mid W={\bf 1}]-\mathbb{E}_{\cal D}[Y_{it}\mid W={\bf 0}].

If the Markov chains are rapid-mixing (defined soon), then 𝒟\cal D “matters” only by a lower-order term compared to the MSE (see Proposition 1), and we will thus suppress 𝒟\cal D. We also want to point out that our results still hold if “𝟏{\bf 1}” and “𝟎{\bf 0}” are replaced with an arbitrary pair of fixed treatment sequences.

2.2 Assumptions

A key assumption as introduced in Hu and Wager (2022) that allows for estimation despite temporal interference is rapid mixing:

Assumption 1 (Rapid Mixing).

There exists a constant tmix>0t_{\rm mix}>0 such that for any iUi\in U, t[T]t\in[T], w{0,1}𝒩(i)w\in\{0,1\}^{{\cal N}(i)} and distributions f,ff,f^{\prime} over 𝒮\mathcal{S}, we have

dTV(fPitw,fPitw)e1/tmixdTV(f,f).d_{\rm TV}(fP^{w}_{it},f^{\prime}P^{w}_{it})\leq e^{-1/t_{\rm mix}}\cdot d_{\rm TV}(f,f^{\prime}).

As a convenient consequence, the initial state distribution does not matter much.

Proposition 1 (Initial State Doesn’t Matter).

For any distributions 𝒟,𝒟{\cal D,D^{\prime}} over 𝒮N{\cal S}^{N}, we have

|𝔼𝒟[Yit]𝔼𝒟[Yit]|=O(et/tmix),i,t.\displaystyle|\mathbb{E}_{\cal D}[Y_{it}]-\mathbb{E}_{\cal D^{\prime}}[Y_{it}]|=O\left(e^{-t/t_{\rm mix}}\right),\quad\forall i,t. (1)

Consequently,

|Δ𝒟Δ𝒟|=O(tmixlog(NT)NT).|\Delta_{\cal D}-\Delta_{\cal D^{\prime}}|=O\left(\frac{t_{\rm mix}\log(NT)}{NT}\right).

The above implies that the error caused by misspecifying the initial distribution is O~(1/NT)\tilde{O}(1/NT), and thus it contributes only a O~(1/(NT)2)\tilde{O}(1/(NT)^{2}) term to the MSE. This is of lower order compared to our MSE bound, which scales as 1/NT1/NT, as we will soon see.

We assume that the mean-zero noise ϵit\epsilon_{it} \cyreplaceindependenthave zero cross-correlation and bounded \cyreplacecorrelationvariance:

Assumption 2 (Uncorrelated Noise).

Write S=(Sit)S=(S_{it}). There is a constant σ\sigma s.t. for all i,iUi,i^{\prime}\in U, t,t[T]t,t^{\prime}\in[T], we have

𝔼[ϵitS,W]=0and𝔼[ϵitϵitS,W]σ2𝟙(i=iandt=t)\mathbb{E}[\epsilon_{it}\mid S,W]=0\quad{\rm and}\quad\mathbb{E}[\epsilon_{it}\cdot\epsilon_{i^{\prime}t^{\prime}}\mid S,W]\leq\sigma^{2}\cdot\mathbbm{1}(i=i^{\prime}\ {\rm and}\ t^{\prime}=t)

We state our results under the δ\delta-Fractional Neighborhood Exposure (δ\delta-FNE) mapping as introduced in Ugander et al. (2013); Eckles et al. (2017a); the neighborhood interference assumption is the special case when δ=0\delta=0. For concreteness, the reader may assume δ=0\delta=0 without losing sight of the main ideas.

Assumption 3 (δ\delta-FNE).

For any a{0,1}a\in\{0,1\} and w{0,1}𝒩(i)w\in\{0,1\}^{{\cal N}(i)} s.t. wa𝟏1δ|𝒩(i)|\|w-a{\bf 1}\|_{1}\leq\delta|{\cal N}(i)|, we have

μitwμita𝟏andPitwPita𝟏.\mu_{it}^{w}\equiv\mu_{it}^{a{\bf 1}}\quad{\rm and}\quad P_{it}^{w}\equiv P_{it}^{a{\bf 1}}.

2.3 Design and Estimator

We focus on \cyeditclustered switchback designs, which specify a distribution for sampling the treatment vector WW given a fixed clustering over the network.

Definition 2 (Clusters).

A family Π\Pi of subsets of UU is a clustering (or partition) if CC=C\cap C^{\prime}=\emptyset and CΠC=U\cup_{C\in\Pi}C=U for any C,CΠC,C^{\prime}\in\Pi. Each set CΠC\in\Pi is called a cluster.

We independently assign treatments to the cluster-timeblock product sets uniformly.

Definition 3 (Clustered Switchback Design).

Let Π\Pi be a clustering for UU. Uniformly partition [T][T] into timeblocks of length >0\ell>0 (except the last one). For each block B[T]B\subseteq[T] and CΠC\in\Pi, draw ACBBer(1/2)A_{CB}\sim{\rm Ber}(1/2) independently. Set Wit=ACBW_{it}=A_{CB} for (i,t)C×B(i,t)\in C\times B.

We consider a class of Horvitz-Thompson (HT) Horvitz and Thompson (1952) style estimators under a misspecified radius-rr exposure mapping, similar to that of Aronow et al. (2017); Leung (2022); Sävje (2024).

Definition 4 (Radius-rr Truncated Horvitz-Thompson (HT) Estimator).

For any radius r0r\geq 0, define the radius-rr truncated exposure mapping as

Xitar(W):=t=trt𝟙(i𝒩(i)𝟏(Wit=a)|𝒩(i)|1δ)\displaystyle X_{ita}^{r}(W):={\color[rgb]{0,0,0}\prod_{t^{\prime}=t-r}^{t}\mathbbm{1}\left(\frac{\sum_{i^{\prime}\in{\cal N}(i)}{\bf 1}(W_{i^{\prime}t^{\prime}}=a)}{|{\cal N}(i)|}\geq 1-\delta\right)}

for any iU,t[T],a{0,1}i\in U,t\in[T],a\in\{0,1\} and W{0,1}N×TW\in\{0,1\}^{N\times T}. Define the exposure probability as

pitar=[Xitar=1].p_{ita}^{r}=\mathbb{P}[X_{ita}^{r}=1].

Denote

Y^itar=XitarpitarYitandΔ^itr=Y^it1rY^it0r\widehat{Y}_{ita}^{r}=\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it}\quad{\rm and}\quad\widehat{\Delta}_{it}^{r}=\widehat{Y}_{it1}^{r}-\widehat{Y}_{it0}^{r}

for iUi\in U, t[T]t\in[T] and a{0,1}a\in\{0,1\}. The Radius-rr Truncated Horvitz-Thompson estimator is given by

Δ^r=1NT(i,t)U×[T]Δ^itr.\widehat{\Delta}^{r}=\frac{1}{NT}\sum_{(i,t)\in U\times[T]}\widehat{\Delta}_{it}^{r}.

Note that as in previous literature, YitY_{it} and XitarX_{ita}^{r} are not independent, as they both depend on the treatments in the rr rounds before tt. The truncated HT estimator was proposed in the spatial interference setting (Leung, 2022), and utilizes the framework of misspecified exposure mappings introduced by Aronow et al. (2017); Sävje (2023).

Remark 1.
\cyedit

The radius-rr truncated exposure mapping is misspecified in the time dimension, since the treatments from t<trt^{\prime}<t-r could still impact the outcome at time tt through the correlation of the state distributions. The “true exposure mapping” is instead

XitaTrue(W):=t=1t𝟙(i𝒩(i)𝟏(Wit=a)|𝒩(i)|1δ).\displaystyle X^{\rm True}_{ita}(W):=\prod_{t^{\prime}=1}^{t}\mathbbm{1}\left(\frac{\sum_{i^{\prime}\in{\cal N}(i)}{\bf 1}(W_{i^{\prime}t^{\prime}}=a)}{|{\cal N}(i)|}\geq 1-\delta\right).

However, the associated true exposure probability is exponentially low in r/r/\ell, and thus by truncating the neighborhood in the time dimension, the misspecified exposure mapping enjoys a much higher exposure probability. This leads to a natural bias-variance tradeoff in the performance of the truncated Horvitz-Thompson estimator with respect to the choice of rr. Moreover, it serves as a good approximation of XitaTrueX^{\rm True}_{ita}, as the rapid-mixing property implies that the correlation across long time scales is weak, and thus limits the impact that treatments from a long time ago can have on the current outcome.∎

2.4 Dependence Graph

The following will be useful when stating our results in the general form.

Definition 5 (Dependence Graph).

Given a partition Π\Pi of UU, the dependence graph is GΠ:=(U,EΠ)G_{\Pi}:=(U,E_{\Pi}) where for any i,iUi,i^{\prime}\in U (possibly identical), we include an edge (i,i)(i,i^{\prime}) in EΠE_{\Pi} (and denote i⟂̸ii\not\perp\!\!\!\perp i^{\prime}) if there is a cluster CΠC\in\Pi s.t. C𝒩(i)C\cap{\cal N}(i)\neq\emptyset and C𝒩(i).C\cap{\cal N}(i^{\prime})\neq\emptyset.

Refer to caption
Figure 3: Dependence Graph: The regions correspond to the clusters in a partition Π\Pi. Units i,ji,j intersect a common cluster CC, so (i,j)EΠ(i,j)\in E_{\Pi} (or i⟂̸ji\not\perp\!\!\!\perp j).

The reader should not confuse dependence graph with interference graph. The former is, in fact, always a supergraph of the latter. For example, if each cluster in Π\Pi is a singleton, then the dependence graph is the second power of the interference graph.

The dependence graph has the following useful property: If (i,i)EΠ(i,i^{\prime})\notin E_{\Pi}, then i,ii,i^{\prime} do not intersect any common cluster, and hence their outcomes and exposure mappings are independent.

Lemma 1 (Independence for Far-apart Individuals).

Fix a partition Π\Pi and r0r\geq 0. Suppose i,iUi,i^{\prime}\in U and (i,i)EΠ(i,i^{\prime})\notin E_{\Pi}. Then, for any t,tt,t^{\prime}, we have Δ^itrΔ^itr\widehat{\Delta}^{r}_{it}\perp\!\!\!\perp\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}.

This result is the consequence of the following facts: (1) Δ^itr\widehat{\Delta}_{it}^{r} only depends on the treatments of the clusters that intersect N(i)N(i), (2) i⟂̸ii\not\perp\!\!\!\perp i^{\prime} implies 𝒞i𝒞i={\cal C}_{i}\cap{\cal C}_{i^{\prime}}=\emptyset where 𝒞i{\cal C}_{i} denotes the collection of clusters that ii intersects, and (3) the treatments are independently assigned to each cluster.

3 Main Results

3.1 MSE Upper Bound

Proposition 2 (Bias of the HT estimator).

For any r0r\geq 0, we have

|𝔼[Δ^r]Δ|2er/tmix.|\mathbb{E}[\widehat{\Delta}^{r}]-\Delta|\leq 2e^{-r/t_{\rm mix}}.

The is reminiscent of the decaying interference assumption in Leung 2022 (albeit on distributions rather than realizations), which inspires us to consider a truncated HT estimator as considered therein. However, their analysis is not readily applicable to our Markovian setting, since their potential outcomes are deterministic.

Definition 6 (Cluster Degree).

Given a clustering Π\Pi of UU, for each iUi\in U we define

dΠ(i):=|{CΠ:C𝒩(i)}|.d_{\Pi}(i):=|\{C\in\Pi:C\cap{\cal N}(i)\neq\emptyset\}|.

Recall that i⟂̸ii\not\perp\!\!\!\perp i^{\prime} if (i,i)EΠ(i,i^{\prime})\in E_{\Pi} where EΠE_{\Pi} is the set of edges of the independence graph induced by Π\Pi.

Proposition 3 (Variance of the HT estimator).

Fix a clustering Π\Pi of UU and any r,0r,\ell\geq 0. Denote pimin=mint,a{pitar}.p^{\rm min}_{i}=\min_{t,a}\{p_{ita}^{r}\}. Then,

Var(Δ^r)(1+σ2)N2T(tmixe+rtmixiUdΠ(u)+i⟂̸ir+piminpimin).\displaystyle{\rm Var}(\widehat{\Delta}^{r})\lesssim\frac{(1+\sigma^{2})}{N^{2}T}{\color[rgb]{0,0,0}\left(t_{\rm mix}e^{-\frac{\ell+r}{t_{\rm mix}}}\sum_{i\in U}d_{\Pi}(u)+\sum_{i\not\perp\!\!\!\perp i^{\prime}}\frac{r+\ell}{p_{i}^{\rm min}p_{i^{\prime}}^{\rm min}}\right)}.

In Section 4 we will further simplify the above by considering natural classes of graphs.

Taken together, we deduce that for any fixed Π\Pi, we have:

Theorem 1 (MSE Upper Bound).

Suppose =r=tmixlog(NT)\ell=r=t_{\rm mix}\log(NT), then

MSE(Δ^r)(1+σ2)tmixlog(NT)N2Ti⟂̸i1piminpimin.{\rm MSE}(\widehat{\Delta}^{r}){\color[rgb]{0,0,0}\lesssim\frac{(1+\sigma^{2})\cdot t_{\rm mix}\log(NT)}{N^{2}T}\sum_{i\not\perp\!\!\!\perp i^{\prime}}\frac{1}{p_{i}^{\rm min}p_{i^{\prime}}^{\rm min}}}.

We will soon see that for “sparse” graphs or geometric graphs, the summation becomes O(N)O(N), and thus the MSE becomes O~(1/NT)\tilde{O}(1/NT).

3.2 Discussions

We address some questions that the readers may have at this point.

1. Can we reduce to Leung (2022)? We emphasize that our Theorem 1 is not implied by Leung 2022. While the rapid mixing property implies that the temporal interference decays exponentially across time, which seems to align with Assumption 3 from Leung 2022, they critically assume that

YitYitWi,i,t,t,Y_{it}\perp\!\!\!\perp Y_{i^{\prime}t^{\prime}}\mid W\quad\forall i,i^{\prime},t,t^{\prime},

which does not hold in our setting - the outcomes in our Markovian setting are not independent (albeit having weak covariance) over time, even conditioned on treatment assignment.

2. Comparison with Hu and Wager (2022). Independently, Hu and Wager (2022) obtained a O~(tmix/T)\tilde{O}(t_{\rm mix}/T) MSE for N=1N=1, using a bias-corrected estimator which is similar to our HT estimator with r=r=\ell. Our analysis is more general as it handles cases where r\ell\neq r. This is a significant distinction since, in practice, the block length \ell is “externally” chosen, say, by an online platform, government, or nature, e.g., =Θ(T)\ell=\Theta(T) or O(1)O(1).

Proposition 3 provides insights on how to select the best rr specific to this \ell. For example, consider N=1N=1 and =1\ell=1. Then, Propositions 2 and 3 combined lead to an MSE of Ttmix/(tmix+O(1))T^{-t_{\rm mix}/(t_{\rm mix}+O(1))} if

r=tmixlogTtmix+O(1).r=\frac{t_{\rm mix}\log T}{t_{\rm mix}+O(1)}.

3.3 Practical Concern: Small Exposure Probability.

So far, we have stated our Theorem 1 in terms of the minimal exposure probabilities piminp_{i}^{\rm min}. Intuitively, smaller values of these probabilities lead to higher variance and worse MSE bounds. We next present lower bounds on these probabilities in δ\delta (as in the δ\delta-FNE, see 3).

Proposition 4 (Lower Bound of Exposure Probabilities).

Denote the entropy function H(δ):=δlog1δ+(1δ)log11δH(\delta):=\delta\log\frac{1}{\delta}+(1-\delta)\log\frac{1}{1-\delta}. Then,

pimin(2(1H(δ))dΠ(i)2πδ(1δ))1+r.\displaystyle p_{i}^{\rm min}\geq\left(\frac{2^{-(1-H(\delta))d_{\Pi}(i)}}{\sqrt{2\pi\delta(1-\delta)}}\right)^{1+\lceil\frac{r}{\ell}\rceil}.

To see the intuition, consider T=1,dΠ(i)=5T=1,d_{\Pi}(i)=5 and δ=0.2\delta=0.2. For simplicity, assume that all clusters intersecting 𝒩(i){\cal N}(i) have the same cardinality. Then, the exposure mapping XitarX_{ita}^{r} equals 11 if at least (10.2)×5=4(1-0.2)\times 5=4 of these clusters are assigned aa. Thus, the exposure probability is

pitar=((50)+(51))×(12)5=0.1825.p_{ita}^{r}=\left({5\choose 0}+{5\choose 1}\right)\times\left(\frac{1}{2}\right)^{5}=0.1825.

Therefore, there is a 18.25%×2=37.5%18.25\%\times 2=37.5\% probability (multiplying by two to account for both a=0a=0 and 11) that we will keep each data point. More generally, by Stirling’s formula, for any δ0\delta\geq 0 s.t. δd\delta d is an integer,

S(d,δ)=j=0δd(dj)(dδd)2dH(δ)2πδ(1δ).S(d,\delta)=\sum_{j=0}^{\delta d}{d\choose j}\geq{d\choose\delta d}\geq\frac{2^{dH(\delta)}}{\sqrt{2\pi\delta(1-\delta)}}.

It follows that

pimin(S(dΠ(u),δ)2dΠ(u))1+r(2(1H(δ))dΠ(u)2πδ(1δ))1+r.\displaystyle p_{i}^{\rm min}\geq\left(\frac{S(d_{\Pi}(u),\delta)}{2^{d_{\Pi}(u)}}\right)^{1+\lceil\frac{r}{\ell}\rceil}\geq\left(\frac{2^{-(1-H(\delta))d_{\Pi}(u)}}{\sqrt{2\pi\delta(1-\delta)}}\right)^{1+\lceil\frac{r}{\ell}\rceil}.
Remark 2.

It is straightforward to verify that with δ=0\delta=0 (i.e., “exact” neighborhood condition), we have

pimin=2dΠ(u)(1+r/)p^{\rm min}_{i}=2^{-d_{\Pi}(u)(1+\lceil r/\ell\rceil)}

for each iUi\in U. In particular, if =r\ell=r, the above becomes 2dΠ(u)2^{-d_{\Pi}(u)}. However, this probability may be too low to be considered practical. For example, if dΠ(u)=5d_{\Pi}(u)=5 for most units uu, we will use only 2×256.2%2\times 2^{-5}\approx 6.2\% of the data. Proposition 4 improves the 2dΠ(u)2^{-d_{\Pi}(u)} exposure probability (for δ=0\delta=0) due to the H(δ)H(\delta) term in the exponent.

4 Implications on Special Clusterings

Let us simplify Theorem 1 for specific clusterings. Unless stated otherwise, we take δ=0\delta=0 and σ=1\sigma=1 to highlight the key parameters.

Corollary 1 (No Interference).

Suppose the interference graph has no edge. Then, for the clustering Πsgtn\Pi_{\rm sgtn}, where each cluster is a singleton set, and =r=tmixlog(NT)\ell=r=t_{\rm mix}\log(NT), we have

MSE(Δ^r)tmixlog(NT)NT.{\rm MSE}(\widehat{\Delta}^{r})\lesssim\frac{t_{\rm mix}\log(NT)}{NT}.

The following holds for any interference graph.

Corollary 2 (Pure Switchback).

Consider the clustering Πwhole\Pi_{\rm whole} where all individuals are in one cluster. For =r=tmixlogT\ell=r=t_{\rm mix}\log T, we have

MSE(Δ^r)tmixlogTT.{\rm MSE}(\widehat{\Delta}^{r})\lesssim\frac{t_{\rm mix}\log T}{T}.
Remark 3.
\cyreplace

As we recall from Definition 3, our block-based randomized design coincides with the switchback design in Hu and Wager 2022 when N=1N=1: Partition the time horizon into blocks and assign a random treatment independently to each block. \cydeletethroughout to each vertex independently. Hu and Wager (2022) analyzed aWhen N=1N=1, our model and design coincide with Hu and Wager 2022. They focus on a class of difference-in-mean (DIM) estimators which compute the difference in average outcomes between blocks assigned to treatment vs control, ignoring data from time points that are too close to the boundary (referred to as the burn-in period\cydelete and denoted bb). While they show that the vanilla DIM estimators are limited to an MSE of T2/3T^{-2/3}, our results show that the truncated Horvitz-Thompson estimator obtains the optimal MSE, \cyeditmatching the improved rate of their concurrent bias-corrected estimator.∎

Now, we consider graphs with bounded degree.

Corollary 3 (Bounded-degree Graphs).

Let dd be the maximum degree of GG. Then, for the partition Π=Πsgtn\Pi=\Pi_{\rm sgtn} and =r=tmixlog(NT)\ell=r=t_{\rm mix}\log(NT),

MSE(Δ^r)(1+σ2)tmix24d(NT)1log(NT).{\rm MSE}(\widehat{\Delta}^{r})\lesssim(1+\sigma^{2})t_{\rm mix}2^{4d}(NT)^{-1}\log(NT).

The above bound has an unfavorable exponential dependence in dd. This motivated Ugander et al. (2013) to introduce the following condition which assumes that the number of rr-hop neighbors of each node is dominated by a geometric sequence with a common ratio κ\kappa. Denote by dhop(,)d_{\rm hop}(\cdot,\cdot) the hop distance.

Definition 7 (Restricted Growth Coefficient).

A graph GG has a restricted growth coefficient (RGC) of κ1\kappa\geq 1, if

|𝒩r+1(i)|κ|𝒩r(i)|,r1,iUwhere𝒩r(i)={jU:dhop(i,j)r}.|{\cal N}_{r+1}(i)|\leq\kappa\cdot|{\cal N}_{r}(i)|,\quad\forall r\geq 1,i\in U\quad{\rm where}\quad{\cal N}_{r}(i)=\{j\in U:d_{\rm hop}(i,j)\leq r\}.

Example. An dd-spider graph consists of a root node attached to dd paths, each of length nn. Then, the graph has an RGC of κ=2\kappa=2. Another example is a social network that is globally sparse but locally dense. ∎

Ugander and Yin (2023) showed that in a κ\kappa-RGC graph, under their randomized group cluster randomization (RGCR), the exposure probability of each unit is at least 12(d+1)κ\frac{1}{2(d+1)\kappa} for T=1T=1 (see their Theorem 4.2). As a result, the MSE of the HT estimator is polynomial in dd and κ\kappa. By considering the product of their pure-spatio design and a uniform partition of the time horizon, it is straightforward to generalize their result as:

Theorem 2 (Ugander and Yin (2023), Generalized).

Using a 1-hop-max random clustering on a κ\kappa-RGC graph, then for any iUi\in U and r,>0r,\ell>0, we have

pimin22(1+r/)κ2(1+d).p^{\rm min}_{i}\geq\frac{{\color[rgb]{0,0,0}2^{2(1+\lceil r/\ell\rceil)}}\cdot\kappa}{2(1+d)}.

Combining with Theorem 1, we obtain the following.

Corollary 4 (Restricted-Growth Graphs).

Suppose GG satisfies the κ\kappa-RGC and has maximum degree dd. Then, using the 1-hop-max random clustering in Ugander and Yin (2023), with r==tmixlog(NT)r=\ell=t_{\rm mix}\log(NT), we have

MSE(Δ^r)d2κ4tmixlog(NT)NT.{\rm MSE}(\widehat{\Delta}^{r})\lesssim d^{2}\kappa^{4}\cdot\frac{t_{\rm mix}\log(NT)}{NT}.
Remark 4.

When T=1T=1, this matches Theorem 4.7 of Ugander and Yin 2023. Moreover, the above is stronger than Corollary 3 if κd\kappa\ll d. For example, for the spider graph, we have κ=2\kappa=2, so the MSE improves exponentially in dd. ∎

Now we consider spatially derived graphs. Suppose that the units are embedded into a N×N\sqrt{N}\times\sqrt{N} lattice. We assume that the transitions and outcomes at a node can interfere with nodes within a hop distance κ\kappa. In other words, we include an edge (i,j)(i,j) in the interference graph GG if dhop(i,j)hd_{\rm hop}(i,j)\leq h.

We achieve a O~(h2/NT)\tilde{O}(h^{2}/NT) MSE as follows. Consider a natural clustering. For any s>0s>0, we denote by Πs\Pi_{s} the uniform partition of the N×N\sqrt{N}\times\sqrt{N} lattice into square-shaped clusters of size s×ss\times s. Then:

Corollary 5 (hh-neighborhood Interference).

For Π=Π2h\Pi=\Pi_{2h}, we have 1/pimin=2O(r)1/p_{i}^{\rm min}=2^{O(\lceil\frac{r}{\ell}\rceil)} for any uUu\in U. Consequently, with s=2hs=2h and =r=tmixlog(NT)\ell=r=t_{\rm mix}\log(NT),

MSE(Δ^r)(1+σ2)h2tmix(NT)1log(NT).{\rm MSE}(\widehat{\Delta}^{r})\lesssim(1+\sigma^{2})h^{2}t_{\rm mix}\cdot(NT)^{-1}\log(NT).

To complement the above, we want to point out that it is not hard to show the following lower bound:

Theorem 3 (MSE Lower Bound).

For any N,T1N,T\geq 1, if the interference graph has no edges, then MSE(Δ^)=Ω(1/NT){\rm MSE}(\widehat{\Delta})=\Omega(1/NT) for any estimator Δ^\widehat{\Delta} under any (possibly adaptive) design.

While this shows that the dependence on N,TN,T is optimal, this lower bound unfortunately does not suggest what the optimal dependence on the problem dependent parameters is. It would be of value for future study to consider whether one could obtain tighter lower bounds that indicate the optimal dependence on the properties of the spatial and temporal interference.

5 Simulation Study

5.1 Single-unit Setting (N=1N=1)

Our Theorem 1 states that the optimal MSE rate is achieved when the block length and HT radius are both O(tmixlogT)O(t_{\rm mix}\log T). We next show the efficacy of this design-estimator combination through experiments.

Refer to caption
Figure 4: (a) MSE, Stationary
Refer to caption
Figure 5: (b) MSE, Non-stationary
Refer to caption
Figure 6: (c) Bias, Stationary
Refer to caption
Figure 7: (d) Bias, Non-stationary
Figure 8: Comparison of Designs and Estimators. In a stationary environment, both DIMBI and HT-large exhibit performance similar to HT-opt (see a,c). However, this is no longer the case when the environment is non-stationary (see b), as both methods suffer high bias (see d). In fact, DIMBI can not detect any signals at the beginning of each block. In contrast, HT-large can misinterpret underlying non-stationarity as treatment effect (see d).

Our MDP. The state evolves according to a clipped random walk with a stationary transition kernel. Specifically, the states are integers with an absolute value of at most m=30m=30. If we select treatment 11, we flip a coin with heads probability 0.90.9, and move up and down by one unit correspondingly, except at “boundary states” ±m\pm m, where we stay put if the coin toss informs us to move outside. The reward function is non-stationary over time and depends only on the state. Specifically, letting (αt)t[T],(βt)t[T](\alpha_{t})_{t\in[T]},(\beta_{t})_{t\in[T]} be two sequences of real numbers, we define μt(s,a)=αt+βtsm\mu_{t}(s,a)=\alpha_{t}+\beta_{t}\frac{s}{m} for each s{m,,m}s\in\{-m,\dots,m\}, a{0,1}a\in\{0,1\} and t[T]t\in[T].

Refer to caption
Figure 9: (a) T=NT=N case.
Refer to caption
Figure 10: (b) N=TN=\sqrt{T} case
Refer to caption
Figure 11: (c) T=NT=\sqrt{N} case
Figure 12: Clustered Switchback Has Faster Rates: We compare the performance of clustered switchback experiments with “pure” A/B (i.e., randomize only over space) and “pure” switchback (i.e., randomize only over time). For different scalings of N,TN,T, clustered switchback design outperforms the benchmarks consistently.

The DIMBI estimator. We will compare with the Difference-In-Means with Burn-In (DIMBI) estimator in Hu and Wager (2022), which discards the first bb (“burn-in”) observations in each block and calculates the difference in the mean outcomes in the remaining observations. Formally, let \ell be the block length and WW be a treatment vector, for each b(0,)b\in(0,\ell) we define

ΔDIMBIb\displaystyle\Delta_{\rm DIMBI}^{b} =t=1TYt𝟏(Wt=1)𝟏(tt>b)t=1T𝟏(Wt=1)𝟏(tt>b)\displaystyle=\frac{\sum_{t=1}^{T}Y_{t}\cdot{\bf 1}(W_{t}=1)\cdot{\bf 1}(t-\ell\lceil\frac{t}{\ell}\rceil>b)}{\sum_{t=1}^{T}{\bf 1}(W_{t}=1)\cdot{\bf 1}(t-\ell\lceil\frac{t}{\ell}\rceil>b)}
t=1TYt𝟏(Wt=0)𝟏(tt>b)t=1T𝟏(Wt=0)𝟏(tt>b).\displaystyle-\frac{\sum_{t=1}^{T}Y_{t}\cdot{\bf 1}(W_{t}=0)\cdot{\bf 1}(t-\ell\lceil\frac{t}{\ell}\rceil>b)}{\sum_{t=1}^{T}{\bf 1}(W_{t}=0)\cdot{\bf 1}(t-\ell\lceil\frac{t}{\ell}\rceil>b)}.

5.1.1 Benchmarks

We compare the MSE and bias of the following five design-estimator combinations. Note that our MDP has tmix=Θ(m)t_{\rm mix}=\Theta(m), so we choose OPT=rOPT=30logT\ell_{\rm OPT}=r_{\rm OPT}=30\log T. We will compare:
(1) HT-OPT: HT estimator with block length OPT\ell_{\rm OPT} and radius rOPTr_{\rm OPT};
(2) DIM: difference-in-means estimator (i.e., DIMBI with burn-in b=0b=0) and block length OPT\ell_{OPT},
(3) DIMBI: burn-in b=12OPTb=\frac{1}{2}\ell_{\rm OPT}, block length OPT\ell_{\rm OPT},
(4) HT-small: HT estimator under block length small=8\ell_{\rm small}=8, and radius rOPT=3smallr_{\rm OPT}=3\ell_{\rm small}. Here we do not choose rOPTlogTr_{\rm OPT}\sim\log T since its exposure probability 2rOPT2^{-r_{\rm OPT}} is too small and the estimator rarely produces meaningful results.
(5) HT-large: HT estimator with radius rOPTr_{\rm OPT} under large block length =T/8\ell=T/8.

Randomly Generated Instances. For =OPT,small,large\ell=\ell_{\rm OPT},\ell_{\rm small},\ell_{\rm large}, we randomly generate 100100 pairs of sequences (αt),(βt)(\alpha_{t}),(\beta_{t}) as follows. We consider both stationary and non-stationary setting:
(a) Stationary (Fig. 8 (a), (c)): Set αt=0\alpha_{t}=0 and βit=1+0.2ϵit\beta_{it}=1+0.2\epsilon_{it} where ϵitU(0,1)\epsilon_{it}\sim U(0,1) i.i.d.
(b) Non-stationary: We introduce both large-scale and small-scale non-stationary. We first generate a piecewise constant function (called the drift): Partition [T][T] uniformly into 88 pieces and generate the function values on each piece independently from U(0,1)U(0,1). Then, to generate local non-stationarity, we partition [T][T] uniformly into pieces of lengths OPT\ell_{\rm OPT}, and set βt=0\beta_{t}=0 if tt lies in the final ρ\rho fraction of this piece.

5.1.2 Results and Insights

For each instance and block length (OPT,small,large\ell_{\rm OPT},\ell_{\rm small},\ell_{\rm large}), we draw 100100 treatment vectors. We visualize the MSE and bias. The confidence intervals for bias are 95%. We observe the following.
(a) MSE Rates. HT-opt has the lowest MSE in both statationary and non-stationary settings. Moreover, its MSE curve in the log-log plots has a slope close to 1-1, which validates the theoretical 1/T1/T MSE bound. In contrast, HT-large and DIMBI both perform well in the stationary setting (with a slope close to 1-1), but fail in the non-stationary setting.
(b) DIM(BI) Has Large Bias. DIMBI suffers large bias for both small and large bb. This is because for small bb, DIMBI uses data before the chain mixes sufficiently (even in stationary environment), and therefore suffers large bias. Large bb has decent performance when the environment is stationary, but suffers high bias in the presence of non-stationarity. This is because it discards data blindly, and may miss out useful signals in the beginning of a block.
(c) Large \ell leads to high variance. With large block length, the Markov chain can mix sufficiently and provide reliable data points. However, the estimator may mistakenly view external non-stationarity (αt)(\alpha_{t}) as treatment effect. For example, consider αt=𝟏(tT/2)\alpha_{t}={\bf 1}(t\leq T/2) and βt0\beta_{t}\equiv 0, then ATE is 0. If we have only two blocks, each assigned a distinct treatment (which occurs w.p. 1/21/2). Then, the estimated ATE is non-zero.

5.2 Multi-unit Setting (General NN)

Next, we show that in the presence of both spatial and temporal interference, clustered switchback outperforms both “pure” switchback (i.e., only partition time) and “pure” A/B test (i.e., only partition space).

MDP. Suppose that NN units lie on an unweighted line graph. Each unit’s state follows the random walk capped at ±m=±30\pm m=\pm 30, similar to the single-user setting. To generate spatial interference, we assume that the move-up probability pup(i,t)p_{\rm up}(i,t) of uu at time tt is

pup(i,t)=0.1+0.812h+1j:dhop(i,j)h𝟏(Wit=1).p_{\rm up}(i,t)=0.1+0.8\frac{1}{2h+1}\sum_{j:d_{\rm hop}(i,j)\leq h}{\bf 1}(W_{it}=1).

In particular, if all hh-hop neighbors are assigned treatment 0 (resp. 1), then pup=0.1p_{\rm up}=0.1 (resp. 0.90.9). In this setting, the exposure mapping for treatment aa is equal to 11 if and only if all hh-hop neighbors are assigned aa in the previous rr rounds. As suggested by Corollary 5, we choose r=30log(NT)r=30\log(NT).

Reward Function. As in the single-user setting, we choose μit(s,a)=αit+βitsm\mu_{it}(s,a)=\alpha_{it}+\beta_{it}\frac{s}{m}, where αit\alpha_{it} captures large-scale heterogeneity and βit\beta_{it} models user features. To generate αt\alpha_{t}’s, we partition uniformly into [N]×[T][N]\times[T] pieces of size N/8×T/8N/8\times T/8. We generate the function value on each piece independently from U(0,1)U(0,1). We also set βit=1+0.2ϵit\beta_{it}=1+0.2\epsilon_{it} where ϵitU(0,1)\epsilon_{it}\sim U(0,1) i.i.d.

Benchmarks. We partition the space-time [N]×[T][N]\times[T] into “boxes” of (spatial) width ww and (temporal) length \ell. We will compare the performance of the HT estimator under the following designs.
(1) Pure Switchback Test: w=N,=30logTw=N,\ell=30\log T (rate optimal block length for switchback).
(2) Pure A/B Test: w=hw=h (rate optimal width for pure A/B test, see Corollary 5), =T\ell=T.
(3) Clustered Switchback Test: =30logT,w=h\ell=30\log T,w=h (rate optimal width and length).

Discussion. For each TT, we randomly generated 100100 instances and 200200 treatment vectors. When N=TN=T, the MSE of clustered switchback decreases most rapidly. The slope of its curve in the log-log is 1.89-1.89, close to the theoretical value 2-2. It also outperforms the other two designs in the other two scenarios.

Finally, let us compare pure A/B with pure switchback. Theoretically, they have MSE rates of 1/N1/N and 1/T1/T. Consistent with this, our empirical study shows that the MSE of pure A/B test decreases slower than pure switchback when N=TN=\sqrt{T}, and faster when N=T2N=T^{2}.

Refer to caption
Figure 13: MSE for m=10m=10
Refer to caption
Figure 14: m=100m=100

We repeat the five-curve comparison for different choices of mm. Recall that in the main body we choose m=30m=30. We now choose m=30m=30 and m=100m=100, respectively. As a key observation, we found that the performance of HT-small is heavily based on mm: It works well for small mm and not for large ones. This is because the mixing time is almost linear in mm, so for large mm, we need more time for the chain to mix sufficiently. But this is difficult for a small block length. In fact, the exposure probability is O(2tmix/)O(2^{-{t_{\rm mix}/\ell}}). So, when m=300m=300 and =8\ell=8, we have tmix/>m/=37.5t_{\rm mix}/\ell>m/\ell=37.5. This means that the exposure mapping discards most of the data points, leading to a high variance.

References

  • Aronow et al. [2017] P. M. Aronow, C. Samii, et al. Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics, 11(4):1912–1947, 2017.
  • Basse and Airoldi [2018] G. W. Basse and E. M. Airoldi. Model-assisted design of experiments in the presence of network-correlated outcomes. Biometrika, 105(4):849–858, 2018.
  • Basse et al. [2019] G. W. Basse, A. Feller, and P. Toulis. Randomization tests of causal effects under interference. Biometrika, 106(2):487–494, 2019.
  • Blake and Coey [2014] T. Blake and D. Coey. Why marketplace experimentation is harder than it seems: The role of test-control interference. In Proceedings of the fifteenth ACM conference on Economics and computation, pages 567–582, 2014.
  • Bojinov et al. [2023] I. Bojinov, D. Simchi-Levi, and J. Zhao. Design and analysis of switchback experiments. Management Science, 69(7):3759–3777, 2023.
  • Cai et al. [2015] J. Cai, A. De Janvry, and E. Sadoulet. Social networks and the decision to insure. American Economic Journal: Applied Economics, 7(2):81–108, 2015.
  • Chin [2019] A. Chin. Regression adjustments for estimating the global treatment effect in experiments with interference. Journal of Causal Inference, 7(2), 2019.
  • Eckles et al. [2017a] D. Eckles, B. Karrer, and J. Ugander. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference, 5(1), 2017a.
  • Eckles et al. [2017b] D. Eckles, B. Karrer, and J. Ugander. Design and analysis of experiments in networks: Reducing bias from interference. Journal of Causal Inference, 5(1):20150021, 2017b.
  • Farias et al. [2022] V. Farias, A. Li, T. Peng, and A. Zheng. Markovian interference in experiments. Advances in Neural Information Processing Systems, 35:535–549, 2022.
  • Forastiere et al. [2021] L. Forastiere, E. M. Airoldi, and F. Mealli. Identification and estimation of treatment and interference effects in observational studies on networks. Journal of the American Statistical Association, 116(534):901–918, 2021.
  • Glynn et al. [2020] P. W. Glynn, R. Johari, and M. Rasouli. Adaptive experimental design with temporal interference: A maximum likelihood approach. Advances in Neural Information Processing Systems, 33:15054–15064, 2020.
  • Gui et al. [2015] H. Gui, Y. Xu, A. Bhasin, and J. Han. Network a/b testing: From sampling to estimation. In Proceedings of the 24th International Conference on World Wide Web, pages 399–409. International World Wide Web Conferences Steering Committee, 2015.
  • Horvitz and Thompson [1952] D. G. Horvitz and D. J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260):663–685, 1952.
  • Hu and Wager [2022] Y. Hu and S. Wager. Switchback experiments under geometric mixing. arXiv preprint arXiv:2209.00197, 2022.
  • Jagadeesan et al. [2020] R. Jagadeesan, N. S. Pillai, and A. Volfovsky. Designs for estimating the treatment effect in networks with interference. 2020.
  • Jiang and Li [2016] N. Jiang and L. Li. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
  • Johari et al. [2022] R. Johari, H. Li, I. Liskovich, and G. Y. Weintraub. Experimental design in two-sided platforms: An analysis of bias. Management Science, 68(10):7069–7089, 2022.
  • Kohavi and Thomke [2017] R. Kohavi and S. Thomke. The surprising power of online experiments. Harvard business review, 95(5):74–82, 2017.
  • Larsen et al. [2023] N. Larsen, J. Stallrich, S. Sengupta, A. Deng, R. Kohavi, and N. T. Stevens. Statistical challenges in online controlled experiments: A review of a/b testing methodology. The American Statistician, pages 1–15, 2023.
  • Leung [2022] M. P. Leung. Rate-optimal cluster-randomized designs for spatial interference. The Annals of Statistics, 50(5):3064–3087, 2022.
  • Leung [2023] M. P. Leung. Network cluster-robust inference. Econometrica, 91(2):641–667, 2023.
  • Li and Wager [2022] S. Li and S. Wager. Network interference in micro-randomized trials. arXiv preprint arXiv:2202.05356, 2022.
  • Li et al. [2023] S. Li, R. Johari, X. Kuang, and S. Wager. Experimenting under stochastic congestion. arXiv preprint arXiv:2302.12093, 2023.
  • Li et al. [2021] W. Li, D. L. Sussman, and E. D. Kolaczyk. Causal inference under network interference with noise. arXiv preprint arXiv:2105.04518, 2021.
  • Manski [2013] C. F. Manski. Identification of treatment response with social interactions. The Econometrics Journal, 16(1):S1–S23, 2013.
  • Ni et al. [2023] T. Ni, I. Bojinov, and J. Zhao. Design of panel experiments with spatial and temporal interference. Available at SSRN 4466598, 2023.
  • Sävje [2023] F. Sävje. Causal inference with misspecified exposure mappings: separating definitions and assumptions. Biometrika, page asad019, 2023.
  • Sävje [2024] F. Sävje. Causal inference with misspecified exposure mappings: separating definitions and assumptions. Biometrika, 111(1):1–15, 2024.
  • Shi et al. [2023] C. Shi, X. Wang, S. Luo, H. Zhu, J. Ye, and R. Song. Dynamic causal effects evaluation in a/b testing with a reinforcement learning framework. Journal of the American Statistical Association, 118(543):2059–2071, 2023.
  • Sneider and Tang [2019] C. S. Sneider and Y. Tang. Experiment rigor for switchback experiment analysis. 2019. URL https://careersatdoordash.com/blog/experiment-rigor-for-switchback-experiment-analysis/.
  • Sussman and Airoldi [2017] D. L. Sussman and E. M. Airoldi. Elements of estimation theory for causal effects in the presence of network interference. arXiv preprint arXiv:1702.03578, 2017.
  • Thomas and Brunskill [2016] P. Thomas and E. Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
  • Thomke [2020] S. Thomke. Building a culture of experimentation. Harvard Business Review, 98(2):40–47, 2020.
  • Toulis and Kao [2013] P. Toulis and E. Kao. Estimation of causal peer influence effects. In International conference on machine learning, pages 1489–1497. PMLR, 2013.
  • Ugander and Yin [2023] J. Ugander and H. Yin. Randomized graph cluster randomization. Journal of Causal Inference, 11(1):20220014, 2023.
  • Ugander et al. [2013] J. Ugander, B. Karrer, L. Backstrom, and J. Kleinberg. Graph cluster randomization: Network exposure to multiple universes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 329–337, 2013.
  • Wang [2021] Y. Wang. Causal inference with panel data under temporal and spatial interference. arXiv preprint arXiv:2106.15074, 2021.

Appendix A Omitted Proofs in Section 3

A.1 Proof of Theorem 3: MSE Lower Bound

Consider the following two instances ,{\cal I},{\cal I}^{\prime}. Suppose the interference graph has no edges (and hence there is no interference between isers). We also assume there is only one state and suppress ss in the notation. The outcomes follow Bernoulli distributions. In {\cal I}, we have μit(a)=12\mu_{it}(a)=\frac{1}{2} for any i,ti,t and treatment aa. In {\cal I}^{\prime}, for any i,ti,t, the reward functions are given by

μit(0)=12andμit(1)=12+ϵ.\mu_{it}(0)=\frac{1}{2}\quad\text{and}\quad\mu_{it}(1)=\frac{1}{2}+\epsilon.

The ATE in these two instances are 0 and ϵ\epsilon respectively.

Now fix any design WW (i.e., random vector taking value in {0,1}N×T\{0,1\}^{N\times T}). Let ,\mathbb{P},\mathbb{P}^{\prime} be the probability measures induced by the two instances, and it,it\mathbb{P}_{it},\mathbb{P}_{it}^{\prime} be the marginal probability measures. Note that the outcomes distributions are Bernoulli, so DKL(it,it)2ϵ2D_{\rm KL}(\mathbb{P}_{it},\mathbb{P}_{it}^{\prime})\leq 2\epsilon^{2} for each (i,t)(i,t). Therefore,

DKL(,)2ϵ2NT.\displaystyle D_{\rm KL}(\mathbb{P},\mathbb{P}^{\prime})\leq 2\epsilon^{2}NT. (2)

In particular, for ϵ=1/4NT\epsilon=1/4\sqrt{NT}, we have (2)18\eqref{eqn:052224}\leq\frac{1}{8}.

To conclude, consider any estimator Δ^\widehat{\Delta} and the event EE that Δ^>ϵ2\widehat{\Delta}>\frac{\epsilon}{2}. By Pinsker’s inequality,

|(E)(E)|2DKL(,)12.|\mathbb{P}(E)-\mathbb{P}^{\prime}(E)|\leq\sqrt{2D_{\rm KL}(\mathbb{P},\mathbb{P}^{\prime})}\leq\frac{1}{2}.

Therefore, we have either [E]>14\mathbb{P}[E]>\frac{1}{4} or [Ec]>14\mathbb{P}^{\prime}[E^{c}]>\frac{1}{4}. Therefore, we have

min{𝔼[(Δ^Δ)2],𝔼[(Δ^Δ)2]}14(ϵ2)21256NT.\min\left\{\mathbb{E}[(\widehat{\Delta}-\Delta)^{2}],\ \mathbb{E}^{\prime}[(\widehat{\Delta}-\Delta)^{2}]\right\}\geq\frac{1}{4}\left(\frac{\epsilon}{2}\right)^{2}\geq\frac{1}{256NT}.

A.2 Proof of Corollary 1

Observe that for the singleton partition, i⟂̸ii\not\perp\!\!\!\perp i^{\prime} if and only if u=iu=i^{\prime}. Moreover, since =r\ell=r, the interval [tr,t][t-r,t] intersects at most two time blocks, so pimin14p_{i}^{\rm min}\geq\frac{1}{4} for any iUi\in U. This,

(i,i):i⟂̸i1piminpimin=(i,i):u=i1piminpimini4×4=16N.\sum_{(i,i^{\prime}):i\not\perp\!\!\!\perp i^{\prime}}\frac{1}{p_{i}^{\rm min}\cdot p_{i^{\prime}}^{\rm min}}=\sum_{(i,i^{\prime}):u=i^{\prime}}\frac{1}{p_{i}^{\rm min}\cdot p_{i^{\prime}}^{\rm min}}\leq\sum_{i}4\times 4=16N.

Therefore, by Theorem 1,

MSE(Δ^r)tmixN2T16NtmixNT.{\rm MSE}(\widehat{\Delta}^{r})\lesssim\frac{t_{\rm mix}}{N^{2}T}\cdot 16N\lesssim\frac{t_{\rm mix}}{NT}.

A.3 Proof of Corollary 2

Since =r\ell=r, the interval [tr,t][t-r,t] intersects at most two time blocks. This, pimin14p_{i}^{\rm min}\geq\frac{1}{4} for any iUi\in U. Therefore,

MSE(Δ^r)tmixN2Ti,i4×416tmixN2T(N2)tmixT.{\rm MSE}(\widehat{\Delta}^{r})\lesssim\frac{t_{\rm mix}}{N^{2}T}\sum_{i,i^{\prime}}4\times 4\leq\frac{16t_{\rm mix}}{N^{2}T}{N\choose 2}\lesssim\frac{t_{\rm mix}}{T}.

A.4 Proof of Corollary 3

Note that for Πsgtn\Pi_{\rm sgtn}, the dependence graph is the second power111i.e., add an edge between two nodes if their hop distance is at most 22. of the interference graph. Since every node has a maximum degree dd, each node has degree no more than d2d^{2} edges in the dependence graph. Moreover, since =r\ell=r, the interval [tr,t][t-r,t] intersects at most two time blocks, and so each XitarX_{ita}^{r} depends on at most 2d2d cluster-blocks, so pimin22dp_{i}^{\rm min}\geq 2^{-2d} for any iUi\in U. Therefore,

MSE(Δ^r)tmixN2Ti⟂̸i(24d2)2tmixN2T4d2N(22d)2tmixNTd224d.\displaystyle{\rm MSE}(\widehat{\Delta}^{r})\lesssim\frac{t_{\rm mix}}{N^{2}T}\sum_{i\not\perp\!\!\!\perp i^{\prime}}(2^{4d^{2}})^{2}\lesssim\frac{t_{\rm mix}}{N^{2}T}\cdot 4d^{2}N\cdot(2^{2d})^{2}\lesssim\frac{t_{\rm mix}}{NT}d^{2}2^{4d}.

A.5 Proof of Corollary 5

To find dΠ(i)d_{\Pi}(i), consider a unit ii^{\prime} with iiEΠii^{\prime}\in E_{\Pi}. Then, there exists CΠC\in\Pi such that 𝒩(i)C{\cal N}(i)\cap C and 𝒩(i)C,{\cal N}(i^{\prime})\cap C, and hence ii^{\prime} lies in either CC or one of the eight clisters neighboring CC. Therefore,

MSE(Δ^r)tmixN2Ti⟂̸i2O(1)=tmixN2TN8h22O(1)tmixh2NT.{\rm MSE}(\widehat{\Delta}^{r})\leq\frac{t_{\rm mix}}{N^{2}T}\cdot\sum_{i\not\perp\!\!\!\perp i^{\prime}}2^{O(1)}=\frac{t_{\rm mix}}{N^{2}T}\cdot N\cdot 8h^{2}\cdot 2^{O(1)}\lesssim\frac{t_{\rm mix}h^{2}}{NT}.

A.6 Proof of Proposition 4

Suppose 𝒩(i){\cal N}(i) intersects clusters (i.e., spatio-blocks) C1,,CdC_{1},\dots,C_{d}, and |𝒩(i)Cj|=mj|{\cal N}(i)\cap C_{j}|=m_{j}. Then,

j=0dmj=|𝒩(i)|=:m.\sum_{j=0}^{d}m_{j}=|{\cal N}(i)|=:m.

Denote by Zj{0,1}Z_{j}\in\{0,1\} the treatment assigned to CjC_{j}. Recall that the δ\delta-FNE requires that

wa𝟏𝒩(i)1δ|𝒩(i)|\displaystyle\|w-a\cdot{\bf 1}_{{\cal N}(i)}\|_{1}\leq\delta|{\cal N}(i)| (3)

for either a=0a=0 or 11. Suppose a=1a=1; the proof for a=0a=0 is identical by symmetry. Then, Eq. 3 can be rewritten as

j=0dmjZj(1δ)m.\sum_{j=0}^{d}m_{j}Z_{j}\geq(1-\delta)m.

Since ZjZ_{j}’s are i.i.d. Bernoulli, we have

[j=0dmjZj(1δ)m]\displaystyle\mathbb{P}\left[\sum_{j=0}^{d}m_{j}Z_{j}\geq(1-\delta)m\right] =[j=0dmimZj1δ]\displaystyle=\mathbb{P}\left[\sum_{j=0}^{d}\frac{m_{i}}{m}Z_{j}\geq 1-\delta\right]
[j=0d1dZj1δ]\displaystyle\geq\mathbb{P}\left[\sum_{j=0}^{d}\frac{1}{d}\cdot Z_{j}\geq 1-\delta\right]
=[j=0dZj(1δ)d]\displaystyle=\mathbb{P}\left[\sum_{j=0}^{d}Z_{j}\geq(1-\delta)d\right]
2dj=0δd(dj).\displaystyle\geq 2^{-d}\sum_{j=0}^{\lfloor\delta d\rfloor}{d\choose j}.

Since the interval [tr,t][t-r,t] intersects 1+r1+\lceil\frac{r}{\ell}\rceil time blocks, the claim follows by taking

d=dΠ(i)(1+r).d=d_{\Pi}(i)\cdot\left(1+\left\lceil\frac{r}{\ell}\right\rceil\right).

Appendix B Bias Analysis: Proof of Proposition 2

For any event W\mathcal{F}_{W}-measurable event AA, denote by A\mathbb{P}_{A}, 𝔼A\mathbb{E}_{A}, VarA\mathrm{Var}_{A}, and CovA\mathrm{Cov}_{A} probability, expectation, variance, and covariance conditioned on AA.

Next, we show that 1 implies a bound on how the law of YitY_{it} under w\mathbb{P}_{w} can vary as we vary 𝐰\bf w. In particular, when A={w}A=\{w\} is a singleton set, we use the subscript ww instead of {w}\{w\}.

Lemma 2 (Decaying Temporal Interference).

Consider any t,m[T]t,m\in[T] with 1m<tT1\leq m<t\leq T and iUi\in U. Suppose w,w{0,1}N×Tw,w^{\prime}\in\{0,1\}^{N\times T} are identical on 𝒩(i)×[tm,t]{\cal N}(i)\times[t-m,t]. Then,

dTV(w[Yit],w[Yit])em/tmix.d_{\rm TV}\left(\mathbb{P}_{w}[Y_{it}\in\cdot],\ \mathbb{P}_{w^{\prime}}[Y_{it}\in\cdot]\right)\leq e^{-m/t_{\rm mix}}.
Proof.

For any i,ti,t, we denote fit=w[Sit]f_{it}=\mathbb{P}_{w}[S_{it}\in\cdot] and fit=w[Sit]f^{\prime}_{it}=\mathbb{P}_{w^{\prime}}[S_{it}\in\cdot]. Then, for any s[T]s\in[T], by the Chapman–Kolmogorov equation,

fi,s+1=fisPisw𝒩(i),sandfi,s+1=fisPisw𝒩(i),s.f_{i,s+1}=f_{is}P^{w_{{\cal N}(i)},s}_{is}\quad\text{and}\quad f^{\prime}_{i,s+1}=f^{\prime}_{is}P^{w^{\prime}_{{\cal N}(i)},s}_{is}.

This, if tmstt-m\leq s\leq t,

dTV(fi,s+1,fi,s+1)\displaystyle d_{\rm TV}(f_{i,s+1},\ f^{\prime}_{i,s+1}) =dTV(fisPisw𝒩(i),s,fisPisw𝒩(i),s)\displaystyle=d_{\rm TV}\left(f_{is}P^{w_{{\cal N}(i)},s}_{is},\ f^{\prime}_{is}P^{w^{\prime}_{{\cal N}(i)},s}_{is}\right)
=dTV(fisPisw𝒩(i),s,fisPisw𝒩(i),s)\displaystyle=d_{\rm TV}\left(f_{is}P^{w_{{\cal N}(i)},s}_{is},\ f^{\prime}_{is}P^{w_{{\cal N}(i)},s}_{is}\right)
e1/tmixdTV(fis,fis),\displaystyle\leq e^{-1/t_{\rm mix}}\cdot d_{\rm TV}\left(f_{is},f^{\prime}_{is}\right),

where we ised w𝒩(i),s=w𝒩(i),sw^{\prime}_{{\cal N}(i),s}=w_{{\cal N}(i),s} in the second equality and 1 in the inequality. Applying the above for all s=tm,,ts=t-m,\dots,t, we conclude that

dTV(w[Yit],w[Yit])dTV(fit,fit)em/tmixdTV(fi,tm,fi,tm)em/tmix,\displaystyle d_{\rm TV}\left(\mathbb{P}_{w}[Y_{it}\in\cdot],\ \mathbb{P}_{w^{\prime}}[Y_{it}\in\cdot]\right)\leq d_{\rm TV}\left(f_{it},f^{\prime}_{it}\right)\leq e^{-m/t_{\rm mix}}\cdot d_{\rm TV}(f_{i,t-m},f^{\prime}_{i,t-m})\leq e^{-m/t_{\rm mix}},

where the first inequality is becaise μit[0,1]\mu_{it}\in[0,1], and the last is becaise the TV distance is at most 11. ∎

Based on Lemma 2, we can establish the following bound.

Lemma 3 (Per-unit Bias).

For any a{0,1}a\in\{0,1\}, r>0r>0, iUi\in U and t[T]t\in[T], we have

|𝔼[Y^itar]𝔼[YitW=a𝟏]|er/tmix.\left|\mathbb{E}\left[\widehat{Y}_{ita}^{r}\right]-\mathbb{E}\left[Y_{it}\mid W=a\cdot{\bf 1}\right]\right|\leq e^{-r/t_{\rm mix}}.
Proof.

For any a{0,1}a\in\{0,1\}, iUi\in U and t[T]t\in[T], we have

𝔼[Y^itar]\displaystyle\mathbb{E}\left[\widehat{Y}_{ita}^{r}\right] =𝔼[XitarpitarYit|Xitar=1][Xitar=1]+𝔼[XitarpitarYit|Xitar=0][Xitar=0]\displaystyle=\mathbb{E}\left[\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it}\middle|X_{ita}^{r}=1\right]\mathbb{P}\left[X_{ita}^{r}=1\right]+\mathbb{E}\left[\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it}\middle|X_{ita}^{r}=0\right]\mathbb{P}\left[X_{ita}^{r}=0\right]
=𝔼[XitarpitarYit|Xitar=1]pitar+0\displaystyle=\mathbb{E}\left[\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it}\middle|X_{ita}^{r}=1\right]p_{ita}^{r}+0
=𝔼[YitXitar=1].\displaystyle=\mathbb{E}\left[Y_{it}\mid X_{ita}^{r}=1\right].

Note that Xitar=1X_{ita}^{r}=1 implies that w=a𝟏w=a\cdot{\bf 1} on 𝒩(i)×[tr,t]{\cal N}(i)\times[t-r,t]. Therefore, by Lemma 2 with m=rm=r, we obtain

dTV(w[Yit],a𝟏[Yit])er/tmix,d_{\rm TV}\left(\mathbb{P}_{w}[Y_{it}\in\cdot],\ \mathbb{P}_{a\cdot\bf 1}[Y_{it}\in\cdot]\right)\leq e^{-r/t_{\rm mix}},

and hence

|𝔼[Y^itar]𝔼[YitW=a𝟏]|=|𝔼[YitXitar=1]𝔼[YitW=a𝟏]|er/tmix.\left|\mathbb{E}\left[\widehat{Y}_{ita}^{r}\right]-\mathbb{E}\left[Y_{it}\mid W=a\cdot{\bf 1}\right]\right|=\left|\mathbb{E}\left[Y_{it}\mid X_{ita}^{r}=1\right]-\mathbb{E}\left[Y_{it}\mid W=a\cdot{\bf 1}\right]\right|\leq e^{-r/t_{\rm mix}}.

Now we are prepared to prove Proposition 2. Recall that Δ=1NTi,tΔit\Delta=\frac{1}{NT}\sum_{i,t}\Delta_{it} and Δ^r=1NTi,tΔ^itr\widehat{\Delta}^{r}=\frac{1}{NT}\sum_{i,t}\widehat{\Delta}_{it}^{r}.

Proof of Proposition 2.

By Lemma 3,

|𝔼[Δ^r]Δ|\displaystyle\left|\mathbb{E}\left[\widehat{\Delta}^{r}\right]-\Delta\right| 1NTi,t|Δit𝔼[Δ^itr]|\displaystyle\leq\frac{1}{NT}\sum_{i,t}\left|\Delta_{it}-\mathbb{E}\left[\widehat{\Delta}_{it}^{r}\right]\right|
1NTa{0,1}i,t|𝔼[Y^itar]𝔼[YitW=a𝟏]|\displaystyle\leq\frac{1}{NT}\sum_{a\in\{0,1\}}\sum_{i,t}\left|\mathbb{E}\left[\widehat{Y}_{ita}^{r}\right]-\mathbb{E}\left[Y_{it}\mid W=a\cdot{\bf 1}\right]\right|
2er/tmix.\displaystyle\leq 2e^{-r/t_{\rm mix}}.

Appendix C Variance Analysis: Proof of Proposition 3

We start with a bound that holds for all pairs of units. Note that pit0r=pit1rp_{it0}^{r}=p_{it1}^{r} for any i,t,ri,t,r, since treatment and control are assigned with equal probabilities. We will this suppress aa in the notation.

Lemma 4 (Covariance Bound).

For any r,0r,\ell\geq 0, i,iUi,i^{\prime}\in U and t,t[T]t,t^{\prime}\in[T], we have

Cov(Δ^itr,Δ^itr)4(1+σ2)pitrpitr.{\rm Cov}\left(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}\right)\leq\frac{4(1+\sigma^{2})}{p_{it}^{r}\cdot p_{i^{\prime}t^{\prime}}^{r}}.
Proof.

Expanding the definition of Δ^itr\widehat{\Delta}_{it}^{r}, we have

Cov(Δ^itr,Δ^itr)\displaystyle{\rm Cov}\left(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}\right) =Cov(Xit1rpit1rYitXit0rpit0rYit,Xit1rpit1rYitXit0rpit0rYit)\displaystyle={\rm Cov}\left(\frac{X_{it1}^{r}}{p_{it1}^{r}}Y_{it}-\frac{X_{it0}^{r}}{p_{it0}^{r}}Y_{it},\ \frac{X_{it1}^{r}}{p_{i^{\prime}t^{\prime}1}^{r}}Y_{i^{\prime}t^{\prime}}-\frac{X_{i^{\prime}t^{\prime}0}^{r}}{p_{i^{\prime}t^{\prime}0}^{r}}Y_{i^{\prime}t^{\prime}}\right)
a,a{0,1}|Cov(XitarpitarYit,XitarpitarYit)|\displaystyle\leq\sum_{a,a^{\prime}\in\{0,1\}}\left|{\rm Cov}\left(\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it},\ \frac{X_{i^{\prime}t^{\prime}a^{\prime}}^{r}}{p_{i^{\prime}t^{\prime}a^{\prime}}^{r}}Y_{i^{\prime}t^{\prime}}\right)\right|
a,a{0,1}1pitar1pitar|Cov(XitarYit,XitarYit)|\displaystyle\leq\sum_{a,a^{\prime}\in\{0,1\}}\frac{1}{p_{ita}^{r}}\frac{1}{p_{i^{\prime}t^{\prime}a^{\prime}}^{r}}\left|{\rm Cov}\left(X_{ita}^{r}Y_{it},\ X_{i^{\prime}t^{\prime}a^{\prime}}^{r}Y_{i^{\prime}t^{\prime}}\right)\right|
a,a{0,1}1pitar1pitar𝔼[(XitarYit)2]𝔼[(XitarYit)2],\displaystyle\leq\sum_{a,a^{\prime}\in\{0,1\}}\frac{1}{p_{ita}^{r}}\frac{1}{p_{i^{\prime}t^{\prime}a^{\prime}}^{r}}\sqrt{\mathbb{E}[(X_{ita}^{r}Y_{it})^{2}]}\sqrt{\mathbb{E}[(X_{i^{\prime}t^{\prime}a^{\prime}}^{r}Y_{i^{\prime}t^{\prime}})^{2}]}, (4)

where the last inequality is by the Cauchy-Schwarz inequality. Note that XitarX_{ita}^{r} is binary and

𝔼[(Yit)2]=𝔼[Yit]2+Var(Yit)1+σ2,\mathbb{E}[(Y_{it})^{2}]=\mathbb{E}[Y_{it}]^{2}+{\rm Var}(Y_{it})\leq 1+\sigma^{2},

we have

(C)(1pit0r+1pit1r)(1pit0r+1pit1r)(1+σ2)=4pitrpitr(1+σ2).\displaystyle\eqref{eqn:121723}\leq\left(\frac{1}{p_{it0}^{r}}+\frac{1}{p_{it1}^{r}}\right)\left(\frac{1}{p_{i^{\prime}t^{\prime}0}^{r}}+\frac{1}{p_{i^{\prime}t^{\prime}1}^{r}}\right)(1+\sigma^{2})=\frac{4}{p_{it}^{r}\cdot p_{i^{\prime}t^{\prime}}^{r}}(1+\sigma^{2}).

The above bound alone is not sufficient for our analysis, as it does not take advantage of the rapid mixing property. In the rest of this section, we show that for any units that are far apart in time, the covariance of their HT terms decays exponentially in their temporal distance.

C.1 Covariance of Outcomes

We first show that if the realization of one random variable has little impact on the (conditional) distribution of another random variable, then they have low covariance.

Lemma 5 (Low Interference in Conditional Distribution Implies Low Covariance).

Let U,VU,V be two random variables and g,hg,h be real-valued functions defined on their respective realization spaces. If for some δ>0\delta>0, we have

dTV([UV],[U])δV-almost surely ,d_{\rm TV}(\mathbb{P}[U\in\cdot\mid V],\ \mathbb{P}[U\in\cdot])\leq\delta\quad V\text{-almost surely },

then,

Cov(g(U),h(V))δh(V)1g(U).{\rm Cov}(g(U),h(V))\leq\delta\cdot\|h(V)\|_{1}\cdot\|g(U)\|_{\infty}.
Proof.

Denote by μU,V,μU,μV,μUV=v\mu_{U,V},\mu_{U},\mu_{V},\mu_{U\mid V=v} the probability measures of (U,V)(U,V), UU, uu, and UU conditioned on V=vV=v, respectively. We then have

|Cov(g(U),h(V))|\displaystyle|{\rm Cov}(g(U),h(V))| =|𝔼[g(U)h(V)]𝔼[g(U)]𝔼[h(V)]|\displaystyle=|\mathbb{E}\left[g(U)h(V)\right]-\mathbb{E}[g(U)]\mathbb{E}[h(V)]|
=|vh(v)(ug(u)(μUV=v(du)μU(du)))μV(dv)|\displaystyle=\left|\int_{v}h(v)\left(\int_{u}g(u)\left(\mu_{U\mid V=v}(du)-\mu_{U}(du)\right)\right)\mu_{V}(dv)\right|
v|h(v)|g(U)dTV((UV=v),(U))μV(dv)\displaystyle\leq\int_{v}|h(v)|\cdot\|g(U)\|_{\infty}\cdot d_{\rm TV}(\mathbb{P}(U\in\cdot\mid V=v),\mathbb{P}(U\in\cdot))\mu_{V}(dv)
h(V)1g(U)δ.\displaystyle\leq\|h(V)\|_{1}\cdot\|g(U)\|_{\infty}\cdot\delta.

Viewing V,UV,U as outcomes in different rounds, we ise the above to bound the covariance in the outcomes in terms of their temporal distance.

Lemma 6 (Covariance of Outcomes).

For any A{0,1}N×TA\subseteq\{0,1\}^{N\times T}, i,iUi,i^{\prime}\in U and t,t[T]t,t^{\prime}\in[T], we have

CovA(Yit,Yit)e|tt|/tmix.{\rm Cov}_{A}(Y_{it},Y_{i^{\prime}t^{\prime}})\leq e^{-{|t-t^{\prime}|/{t_{{\rm mix}}}}}.
Proof.

Wlog assume t<tt^{\prime}<t. Recall that ϵit=Yitμit(Sit,W𝒩(i),t)\epsilon_{it}=Y_{it}-\mu_{it}(S_{it},W_{{\cal N}(i),t}), so

CovA(Yit,Yit)\displaystyle{\rm Cov}_{A}(Y_{i^{\prime}t^{\prime}},Y_{it}) =CovA(μit(Sit,Wit),μit(Sit,Wit))+CovA(μit(Sit,Wit),ϵt)\displaystyle={\rm Cov}_{A}(\mu_{i^{\prime}t^{\prime}}(S_{i^{\prime}t^{\prime}},W_{i^{\prime}t^{\prime}}),\mu_{it}(S_{it},W_{it}))+{\rm Cov}_{A}(\mu_{i^{\prime}t^{\prime}}(S_{i^{\prime}t^{\prime}},W_{i^{\prime}t^{\prime}}),\epsilon_{t})
+CovA(ϵit,μit(Sit,Wit))+CovA(ϵit,ϵit).\displaystyle\quad+{\rm Cov}_{A}(\epsilon_{i^{\prime}t^{\prime}},\mu_{it}(S_{it},W_{it}))+{\rm Cov}_{A}(\epsilon_{i^{\prime}t^{\prime}},\epsilon_{it}).

The latter three terms are zero by the exogenois noise assumption (in terms of covariances). By Lemma 2 and the triangle inequality for dTVd_{\rm TV}, for any s𝒮s\in\mathcal{S}, we have

dTV(A[(Sit,W𝒩(i),t)],A[(Sit,W𝒩(i),t)Sit=s,W𝒩(i),t=w])\displaystyle\quad d_{\rm TV}\left(\mathbb{P}_{A}\left[(S_{it},\ W_{{\cal N}(i),t})\in\cdot\right],\ \mathbb{P}_{A}\left[(S_{it},\ W_{{\cal N}(i),t})\in\cdot\mid S_{i^{\prime}t^{\prime}}=s,\ W_{{\cal N}(i^{\prime}),t^{\prime}}=w\right]\right)
=dTV(A[Sit],A[SitSit=s,W𝒩(i),t=w])\displaystyle=d_{\rm TV}\left(\mathbb{P}_{A}\left[S_{it}\in\cdot\right],\ \mathbb{P}_{A}\left[S_{it}\in\cdot\mid S_{i^{\prime}t^{\prime}}=s,\ W_{{\cal N}(i^{\prime}),t^{\prime}}=w\right]\right)
e(tt)/tmixdTV(A[Sit],𝐞s)\displaystyle\leq e^{-{(t-t^{\prime})/{t_{{\rm mix}}}}}\cdot d_{\rm TV}(\mathbb{P}_{A}[S_{i^{\prime}t^{\prime}}\in\cdot],{\bf e}_{s})
e(tt)/tmix,\displaystyle\leq e^{-{(t-t^{\prime})/{t_{{\rm mix}}}}}, (5)

where 𝐞x{\bf e}_{x} denotes the Dirac distribution at xx, and the last inequality follows since the TV distance between any two distributions is at most 11.

Now, apply Lemma 5 with (Sit,Wit)(S_{i^{\prime}t^{\prime}},W_{i^{\prime}t^{\prime}}) in the role of uu, with (Sit,Wit)(S_{it},W_{it}) in the role of UU, with μit\mu_{it} in the role of gg, with μit\mu_{i^{\prime}t^{\prime}} in the role of hh, and with e(tt)/tmixe^{-(t-t^{\prime})/t_{\rm mix}} in the role of δ\delta. Noting that g,h1\|g\|_{\infty},\|h\|_{\infty}\leq 1 and combining with Section C.1, we conclude the statement. ∎

C.2 Covariance of HT terms

So far we have shown that the outcomes have low covariance if they are far apart in time. However, this does not immediately imply that the covariance between the HT terms Δ^itr\widehat{\Delta}^{r}_{it} is also low, since each HT term is a product of the outcome and the exposure mapping. To proceed, we need the following.

Lemma 7 (Bounding Covariance Using Conditional Covariance).

Let U,VU,V be independent Bernoulli random variables with means p,q[0,1]p,q\in[0,1]. Suppose X,YX,Y are random variables s.t.

XVUandYUV.{\color[rgb]{0,0,0}X\perp\!\!\!\perp V\mid U\quad{\rm and}\quad Y\perp\!\!\!\perp U\mid V.}

Then,

Cov(UX,VY)=pqCov(X,YU=V=1).{\rm Cov}(UX,VY)=pq\cdot{\rm Cov}(X,Y\mid U=V=1).
Proof.

Since U,VU,V are Bernoulli, we have

Cov(UX,VY)\displaystyle{\rm Cov}(UX,VY) =𝔼[UXVY]𝔼[UX]𝔼[VY]\displaystyle=\mathbb{E}[UXVY]-\mathbb{E}[UX]\mathbb{E}[VY]
=𝔼[UXVYU=V=1]pq𝔼[UXU=1]p𝔼[VYV=1]q\displaystyle=\mathbb{E}[UXVY\mid U=V=1]pq-\mathbb{E}[UX\mid U=1]p\mathbb{E}[VY\mid V=1]q
=pq(𝔼[XYU=V=1]𝔼[XU=1]𝔼[YV=1]).\displaystyle=pq\left(\mathbb{E}[XY\mid U=V=1]-\mathbb{E}[X\mid U=1]\mathbb{E}[Y\mid V=1]\right). (6)

Note that XVUX\perp\!\!\!\perp V\mid U and YUVY\perp\!\!\!\perp U\mid V, so

𝔼[XU=1]=𝔼[XU=V=1]and𝔼[YV=1]=𝔼[YU=V=1],\mathbb{E}[X\mid U=1]=\mathbb{E}[X\mid U=V=1]\quad\text{and}\quad\mathbb{E}[Y\mid V=1]=\mathbb{E}[Y\mid U=V=1],

and therefore

(C.2)=pqCov(X,YU=V=1).\eqref{eqn:121323}=pq\cdot{\rm Cov}(X,Y\mid U=V=1).

We obtain the following bound by applying the above to the outcomes and exposure mappings in two rounds that are in different blocks and are further apart than 2r2r (in time).

Lemma 8 (Covariance of Far-apart HT terms).

Suppose i,iUi,i^{\prime}\in U and t,t[T]t,t^{\prime}\in[T] satisfy t/t/\lceil t^{\prime}/\ell\rceil\neq\lceil t/\ell\rceil and t+r<trt^{\prime}+r<t-r, then

Cov(Δ^itr,Δ^itr)4e|tt|/tmix.{\rm Cov}(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r})\leq 4e^{-|t^{\prime}-t|/t_{\rm mix}}.
Proof.

Observe that for any (possibly identical) a,a{0,1}a,a^{\prime}\in\{0,1\}, since t,tt,t^{\prime} lie in distinct blocks and are more than 2r2r apart, we see that XitarX_{ita}^{r} and XitarX_{i^{\prime}t^{\prime}a^{\prime}}^{r} are independent. This, by Lemma 7, we have

|Cov(XitarpitarYit,XitarpitarYit)|\displaystyle\left|{\rm Cov}\left(\frac{X_{ita}^{r}}{p_{ita}^{r}}Y_{it},\ \frac{X_{i^{\prime}t^{\prime}a^{\prime}}^{r}}{p_{i^{\prime}t^{\prime}a^{\prime}}^{r}}Y_{i^{\prime}t^{\prime}}\right)\right| =1pitar1pitar|Cov(XitarYit,XitarYit)|\displaystyle=\frac{1}{p_{ita}^{r}}\frac{1}{p_{i^{\prime}t^{\prime}a^{\prime}}^{r}}\left|{\rm Cov}\left(X_{ita}^{r}Y_{it},\ X_{i^{\prime}t^{\prime}a^{\prime}}^{r}Y_{i^{\prime}t^{\prime}}\right)\right|
=|Cov(Yit,YitXitar=Xitar=1)|.\displaystyle=\left|{\rm Cov}\left(Y_{it},Y_{i^{\prime}t^{\prime}}\mid X_{i^{\prime}t^{\prime}a^{\prime}}^{r}=X_{ita}^{r}=1\right)\right|. (7)

To bound the above, consider the event

A={w{0,1}U×[T]:Xitar(w)=Xitar(w)=1},A=\left\{w\in\{0,1\}^{U\times[T]}:X_{i^{\prime}t^{\prime}a^{\prime}}^{r}(w)=X_{ita}^{r}(w)=1\right\},

so that by Lemma 6,

(C.2)=|CovA(Yit,Yit)|e|tt|/tmix.\eqref{eqn:121423}=|{\rm Cov}_{A}(Y_{i^{\prime}t^{\prime}},Y_{it})|\leq e^{-|t-t^{\prime}|/t_{\rm mix}}.

The conclision follows by summing over all four combinations of a,a{0,1}2a,a^{\prime}\in\{0,1\}^{2}. ∎

Remark 5.

The restriction that t,tt,t^{\prime} are both farther than 2r2r apart in time and lie in distinct blocks are both necessary for the above exponential covariance bound. As an example, fix a vertex uu and consider t,t[T]t,t^{\prime}\in[T] in the same block and suppose that they are at a distance rr away from the boundary of this block. Then, the exposure mappings XitarX_{ita}^{r} and XitarX_{it^{\prime}a}^{r} are the same, which we denote by UU. Then,

Cov(UYit,UYit)=pCov(Yit,YitU=1)+p(1p)𝔼[YitU=1]𝔼[YitU=1],\displaystyle{\rm Cov}(UY_{i^{\prime}t^{\prime}},\ UY_{it})=p\cdot{\rm Cov}(Y_{i^{\prime}t^{\prime}},Y_{it}\mid U=1)+p(1-p)\cdot\mathbb{E}[Y_{i^{\prime}t^{\prime}}\mid U=1]\cdot\mathbb{E}[Y_{it}\mid U=1],

where p=[U=1]p=\mathbb{P}[U=1]. Therefore, we can choose the mean outcome function μit,μit\mu_{i^{\prime}t^{\prime}},\mu_{it} to be large so that the above does not decrease exponentially in |tt||t^{\prime}-t|.

C.3 Proof of Proposition 3

We are now ready to bound the variance. Write

Var(Δ^r)=Var(1NT(i,t)U×[T]Δ^itr)=1N2T2i,ti,tCov(Δ^itr,Δ^itr).\displaystyle{\rm Var}\left(\widehat{\Delta}^{r}\right)={\rm Var}\left(\frac{1}{NT}\sum_{(i,t)\in U\times[T]}\widehat{\Delta}_{it}^{r}\right)=\frac{1}{N^{2}T^{2}}\sum_{i,t}\sum_{i^{\prime},t^{\prime}}{\rm Cov}\left(\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r},\widehat{\Delta}_{it}^{r}\right). (8)

We need two observations to decompose the above. First, by the definition of the dependence graph, if i⟂̸ii\not\perp\!\!\!\perp i^{\prime}, then Cov(Δ^itr,Δ^itr)=0{\rm Cov}(\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r},\widehat{\Delta}_{it}^{r})=0. This, for each uu, we may consider only the units ii^{\prime} with i⟂̸ii\not\perp\!\!\!\perp i^{\prime}. Second, observe that if |tt|>r+|t-t^{\prime}|>r+\ell, then we can apply Lemma 8 to bound the covariance term exponentially in |tt||t-t^{\prime}|. Combining, we can decompose Eq. 8 into close-by (“C”) and far-apart (“F”) pairs as

(8)=1N2T2(i,t)(Cit+Fit)\displaystyle\eqref{eqn:062324}=\frac{1}{N^{2}T^{2}}\sum_{(i,t)}\left(C_{it}+F_{it}\right) (9)

where

Cit:=t:|tt|+ri:i⟂̸iCov(Δ^itr,Δ^itr)C_{it}:=\sum_{t^{\prime}:|t^{\prime}-t|\leq\ell+r}\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}{\rm Cov}\left(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}\right)

and

Fit:=t:|tt|>+ri:i⟂̸iCov(Δ^itr,Δ^itr).F_{it}:=\sum_{t^{\prime}:|t^{\prime}-t|>\ell+r}\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}{\rm Cov}\left(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}\right).

To further analyze the above, fix any (i,t)U×[T](i,t)\in U\times[T].

Part I: Bounding CitC_{it}. By Lemma 4,

Cit\displaystyle C_{it} t:|tt|+ri:i⟂̸i4(1+σ2)1pitrpitr\displaystyle\leq\sum_{t^{\prime}:|t^{\prime}-t|\leq\ell+r}\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}4(1+\sigma^{2}){\color[rgb]{0,0,0}\frac{1}{p_{it}^{r}\cdot p_{i^{\prime}t^{\prime}}^{r}}}
(+r)4(1+σ2)i:i⟂̸i1piminpimin.\displaystyle\leq(\ell+r)\cdot 4(1+\sigma^{2})\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}{\color[rgb]{0,0,0}\frac{1}{p_{i}^{\rm min}\cdot p_{i^{\prime}}^{\rm min}}}.

Summing over all (i,t)(i,t), we have

i,tCit\displaystyle\sum_{i,t}C_{it} i,t4(1+σ2)(r+)i:i⟂̸i1piminpimin\displaystyle\leq\sum_{i,t}4(1+\sigma^{2})(r+\ell)\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}{\color[rgb]{0,0,0}\frac{1}{p_{i}^{\rm min}\cdot p_{i^{\prime}}^{\rm min}}}
4T(1+σ2)(r+)(i,i):i⟂̸i1piminpimin.\displaystyle\leq 4T(1+\sigma^{2})(r+\ell)\sum_{(i,i^{\prime}):i\not\perp\!\!\!\perp i^{\prime}}{\color[rgb]{0,0,0}\frac{1}{p_{i}^{\rm min}\cdot p_{i^{\prime}}^{\rm min}}}. (10)

Part II: Bounding FitF_{it}. By Lemma 8,

Fit\displaystyle F_{it} =i:i⟂̸it:|tt|>+rCov(Δ^itr,Δ^itr)\displaystyle=\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}\sum_{t^{\prime}:|t^{\prime}-t|>\ell+r}{\rm Cov}\left(\widehat{\Delta}_{it}^{r},\widehat{\Delta}_{i^{\prime}t^{\prime}}^{r}\right)
i:i⟂̸i2+r4ez/tmix𝑑z\displaystyle\leq\sum_{i^{\prime}:i\not\perp\!\!\!\perp i^{\prime}}2\int_{\ell+r}^{\infty}4e^{-z/t_{\rm mix}}\ dz
=8tmixe(r+)/tmixdΠ(i),\displaystyle=8t_{\rm mix}e^{-(r+\ell)/t_{\rm mix}}\cdot d_{\Pi}(i), (11)

where the “2” in the inequality arises since ss may be either greater or smaller than tt.

Combining Sections C.3 and C.3, we conclude that

Var(Δ^r)\displaystyle{\rm Var}\left(\widehat{\Delta}^{r}\right) 1N2T2(i,tCit+i,tFit)\displaystyle\leq\frac{1}{N^{2}T^{2}}\left(\sum_{i,t}C_{it}+\sum_{i,t}F_{it}\right)
1N2T2(8(1+σ2)T(i,i):i⟂̸i1piminpimin+8Ttmixe(r+)tmixiUdΠ(i))\displaystyle\leq\frac{1}{N^{2}T^{2}}\left(8(1+\sigma^{2})T\sum_{(i,i^{\prime}):i\not\perp\!\!\!\perp i^{\prime}}\frac{1}{p_{i}^{\rm min}p_{i^{\prime}}^{\rm min}}+8Tt_{\rm mix}e^{-\frac{(r+\ell)}{t_{\rm mix}}}\sum_{i\in U}d_{\Pi}(i)\right)
=8N2T((1+σ2)(i,i):i⟂̸i1piminpimin+tmixe(r+)tmixiUdΠ(i)).\displaystyle=\frac{8}{N^{2}T}\left((1+\sigma^{2})\sum_{(i,i^{\prime}):i\not\perp\!\!\!\perp i^{\prime}}\frac{1}{p_{i}^{\rm min}p_{i^{\prime}}^{\rm min}}+t_{\rm mix}e^{-\frac{(r+\ell)}{t_{\rm mix}}}\sum_{i\in U}d_{\Pi}(i)\right).