\addauthor

ezblue \addauthoreporange

Heterogeneous Treatment Effects in Panel Data

Retsef Levi
Sloan School of Management
Massachusetts Institute of Technology
[email protected] &Elisabeth Paulson
Harvard Business School
[email protected]
Georgia Perakis
Sloan School of Management
Massachusetts Institute of Technology
[email protected]
&Emily Zhang
Operations Research Center
Massachusetts Institute of Technology
[email protected]

Abstract

We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters with similar treatment effects using a regression tree, and then leverages the (assumed) low-rank structure of the panel data to estimate the average treatment effect for each cluster. Our theoretical results establish the convergence of the resulting estimates to the true treatment effects. Computation experiments with semi-synthetic data show that our method achieves superior accuracy compared to alternative approaches, using a regression tree with no more than 40 leaves. Hence, our method provides more accurate and interpretable estimates than alternative methods.

1 Introduction

Suppose we observe an outcome of interest across $n$ distinct units over $T$ time periods; such data is commonly referred to as panel data. Each unit may have been subject to an intervention during certain time periods that influences the outcome. As a concrete example, a unit could be a geographic region affected by a new economic policy, or an individual consumer or store influenced by a marketing promotion. Our goal is to understand the impact of this intervention on the outcome. Since the effect of the intervention might vary across individual units and time periods as a function of the unit-level and time-varying covariates, we aim to estimate the heterogeneous treatment effects. This is a key problem in econometrics and causal inference, enabling policymakers or business owners to make more informed decisions about which units to target for future interventions.

We develop a new method, the Panel Clustering Estimator (PaCE), for estimating heterogeneous treatment effects in panel data under general treatment patterns. The causal effects are modeled as a non-parametric function of the covariates of the units, which may vary over time. To estimate heterogeneous effects, PaCE splits the observations into disjoint clusters using a regression tree and estimates the average treatment effect of each cluster. In addition to the development of this novel method, we make the following theoretical and empirical contributions.

Regression tree with bounded bias (Theorem 3.2)

We show that, subject to mild assumptions on the regression tree and a density assumption on the covariates, the bias introduced by approximating the treatment effect function with a piece-wise constant function, produced by the regression tree, decreases polynomially as the number of leaves in the tree increases.

Optimal rate for average treatment effect estimation (Theorem 3.5)

We obtain guarantees for the recovery of average treatment effects for each cluster of observations that matches the optimal rate, up to logarithmic factors. This extends the convergence guarantee of [17] to a setting with multiple treatments. Similar to [17], we allow for general intervention patterns.

Enhanced empirical accuracy with simple estimator

Using semi-synthetic data, we demonstrate that PaCE achieves empirical performance competitive with, and often superior to, alternative methods for heterogeneous treatment effect estimation, including double machine learning, causal forest, and matrix completion-based methods. Furthermore, PaCE offers a much simpler and more interpretable estimate, as we achieve this enhanced performance using a regression tree with at most 40 leaves.

1.1 Related literature

Causal inference using panel data is often approached through the synthetic control framework, where treated units are compared to a synthetic control group constructed from untreated units [1, 2]. However, for these methods, the treatment is restricted to a block. Matrix completion approaches offer a more flexible alternative, allowing for general treatment patterns [4, 7, 35]. The treated entries are viewed as missing entries, and low-rank matrix completion is utilized to estimate counterfactual outcomes for the treated entries. In most of the matrix completion literature, the set of missing or treated entries are assumed to be generated randomly [12]. However, this assumption is not realistic in our panel data setting. Other works explore matrix completion under deterministic sampling [10, 17, 19, 22, 24], and our work extends the methodology from [17].

Several methods for estimating heterogeneous treatment effects use supervised learning techniques to fit models for the outcome or treatment assignment mechanisms [14, 15, 18, 23, 27]. Although these methods are intended for cases where observations are drawn i.i.d. from one or multiple predefined distributions, they could still be applied to panel data. In contrast, our approach is specifically designed for panel data with underlying structure. By avoiding complex machine learning methods, our approach not only provides accurate results for panel data but also offers interpretable estimates.

The first step of our method involves grouping data into clusters that vary in their treatment effects using a regression tree. As in [8], we greedily search for good splits, iteratively improving the tree’s representation of the true heterogeneous treatment effects. However, unlike supervised learning where the splitting criterion is directly computable, treatment effects are not readily observed. Therefore, we estimate the splitting criterion, an approach used in [5, 6, 29, 33]. While methods like causal forest only use information within a given leaf to estimate the treatment effect for observations in that leaf [33], our method leverages global information from the panel structure to determine the average treatment effect of each cluster, enabling us to achieve high accuracy with just one tree.

The second step involves estimating the average treatment effect in each cluster. For this step, we utilize the de-biased convex estimator introduced by [17]. This estimator is applicable to panel data with general treatment patterns but is specifically designed for estimating average, rather than heterogeneous, treatment effects. It uses a convex estimator, similar to those used in [20, 34], and introduces a novel de-biasing technique that improves upon the initial estimates. Additionally, [17] builds on previous works [11, 13] to provide theoretical guarantees for recovering the average treatment effect with provably minimal assumptions on the intervention patterns. The results of [17] are specific to the case where there is only one treatment. We build on their proof techniques to show convergence when the number of treatments is $O(\log n)$ , and also nest this technique into the first step of our algorithm to estimate heterogeneous effects.

1.2 General notation

For any matrix $A$ , $\left\lVert A\right\rVert$ denotes the spectral norm, $\left\lVert A\right\rVert_{\mathrm{F}}$ denotes the Frobenius norm, $\left\lVert A\right\rVert_{\star}$ denotes the nuclear norm, and $\|A\|_{\text{sum}}$ denotes the sum of the absolute values of the entries in $A$ . For any vector $a$ , $\left\lVert a\right\rVert$ denotes its Euclidean norm, and $\left\lVert a\right\rVert_{\infty}$ denotes the maximum absolute value of its components. We use $\circ$ to denote element-wise matrix multiplication. For a matrix subspace $\mathbf{T}$ , let $P_{{\mathbf{T}}^{\perp}}(\cdot)$ denote the projection operator onto ${\mathbf{T}}^{\perp}$ , the orthogonal space of $\mathbf{T}$ . We define the sum of two subspaces $\mathbf{T}_{1}$ and $\mathbf{T}_{2}$ as $\mathbf{T}_{1}+\mathbf{T}_{2}=\left\{T_{1}+T_{2}\mid T_{1}\in\mathbf{T}_{1},T_{2}\in\mathbf{T}_{2}\right\}$ . We write $a\lesssim b$ whenever $a\leq c\cdot b$ , for some fixed constant $c$ ; $a\gtrsim b$ is defined similarly.

2 Model and algorithm

PaCE is designed for panel data, where an outcome of interest is observed across $n$ distinct units for $T$ time periods. Let $O\in\mathbb{R}^{n\times T}$ be the matrix of these observed outcomes. Our objective is to discern how these outcomes were influenced by various treatments under consideration.

We assume that, in a hypothetical scenario where the treatments were not applied, the expected outcomes would be represented by an unknown low-rank matrix $M^{\star}\in\mathbb{R}^{n\times T}$ . To incorporate the influence of the external factors, we introduce the covariate tensor $\mathbf{X}:=[X_{1},\dots,X_{p}]\in\mathbb{R}^{p\times n\times T}$ , where an element $X_{izt}$ signifies the $i$ -th covariate of unit $z$ at time $t$ . For convenience, we will write $X^{zt}:=[X_{1zt},\dots,X_{pzt}]$ as the vector of relevant covariates for unit $z$ at time $t$ . We consider $q$ distinct treatments, each represented by a binary matrix $W_{1},\dots,W_{q}$ that encodes which observations are subjected to that treatment. For each treatment $i\in[q]$ applied to unit $z$ at time period $t$ , the treatment effect is modeled as a non-parametric Lipschitz function of the covariates $\mathcal{T}_{i}^{\star}(X^{zt})$ . We consider the treatment effects of distinct treatments to be additive.¹¹1This is a flexible framework because, as we allow for up to $\log(n)$ treatments, the interactions of individual treatments can be modeled as additional treatments. Combining all of these elements, the observed outcome matrix $O$ can be expressed as the sum of $M^{\star}$ , the combined effects of the treatments, and a noise matrix $E$ . Mathematically,

\displaystyle O=M^{\star}+\sum_{i=1}^{q}\mathcal{T}_{i}^{\star}(\mathbf{X})\circ W_{i}+E,

where $\mathcal{T}_{i}^{\star}(\mathbf{X})$ denotes the matrix where the element in $z$ -th row and $t$ -th column is $\mathcal{T}_{i}^{\star}(X^{zt})$ . We note that our model implicitly assumes unconfoundedness, which is common in the literature [21].

Our goal is to estimate the treatment effect function $\mathcal{T}^{\star}_{i}$ for each $i\in[q]$ , having observed $O$ , $\mathbf{X}$ , and $W_{1},\ldots,W_{q}$ . Our methodology PaCE consists of two main steps. First, we build a specialized tree for each treatment $i\in[q]$ , designed to partition the observations into disjoint clusters that differ in their treatment effects. Next, we estimate the average treatment effect of each cluster, thus obtaining estimates for the treatment effect functions that are piece-wise constant functions of the covariates. The following subsections will delve into the specifics of these two steps.

2.1 Clustering the observations

Initially, the entire dataset with $n\times T$ entries is in one single cluster. This cluster serves as the starting point for the iterative partitioning process, where we split the data into $\ell$ more granular clusters for each treatment that better represent the true heterogeneous treatment effects. We present our method for clustering observations in Algorithm 1. Below, we discuss its two main steps.

Algorithm 1 Clustering the observations

1:Input: outcome matrix

O

, covariate matrices

X_{1},\dots,X_{p}

, treatment matrices

W_{1},\dots,W_{q}

, maximum number of leaves

\ell

, regularization parameter

\lambda

2:For each treatment

i\in[q]

, initialize

j

-th cluster matrix

C^{i}_{j}

such that all observations are in one single cluster:

C_{1}^{i}\leftarrow\mathbf{1}_{n\times T}

and

C_{j}^{i}\leftarrow\mathbf{0}_{n\times T}

for

j>1

3:for

l=1

\ell-1

4: Step 1: Estimate the average treatment effect for each cluster:

\displaystyle\left(\hat{M},\hat{\tau},\hat{m}\right)\leftarrow\operatorname*{argmin}_{\begin{subarray}{c}M\in\mathbb{R}^{n\times T},\\ \tau\in\mathbb{R}^{q\times l},m\in\mathbb{R}^{n}\end{subarray}}\frac{1}{2}\left\lVert O-M-m\mathbf{1}^{\top}-\sum_{i=1}^{q}\sum_{j=1}^{l}\tau_{i,j}W_{i}\circ C^{i}_{j}\right\rVert^{2}_{\mathrm{F}}+\lambda\left\lVert M\right\rVert_{\star}.

(1)

5: Step 2: The following is performed for each treatment

q^{\prime}\in[q]

to select the valid single-variable split that minimizes the estimated mean squared error:

•

For $i\in\left[q\right]$ and $j\in\left[l\right]$ , let the function $\mathcal{C}_{j}^{i}(l^{\prime},p^{\prime},x)$ encode the new cluster $C^{i}_{j}$ after splitting the $l^{\prime}$ -th cluster for treatment $q^{\prime}$ based on the comparison of the $p^{\prime}$ -th covariate to $x$ :

\displaystyle\mathcal{C}_{j}^{i}(l^{\prime},p^{\prime},x):=\begin{cases}C_{l^{\prime}}^{q^{\prime}}\circ I_{\{X_{p^{\prime}}\leq x\}},&\text{for }j=l^{\prime}\text{ and }i=q^{\prime}\\ C_{l^{\prime}}^{q^{\prime}}\circ I_{\{X_{p^{\prime}}>x\}},&\text{for }j=l+1\text{ and }i=q^{\prime}\\ C_{j}^{i},&\text{otherwise.}\end{cases}

•

Identify the split that minimizes the estimated mean squared error: $(l^{\star},p^{\star},x^{\star})\leftarrow\operatorname*{argmin}_{l^{\prime},p^{\prime},x}\widehat{MSE}(l^{\prime},p^{\prime},x)$ , where

\displaystyle\widehat{MSE}(l^{\prime},p^{\prime},x):=\min_{\tau\in\mathbb{R}^{q\times(l+1)}}\left\lVert O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{q}\sum_{j=1}^{l+1}\tau_{i,j}W_{i}\circ\mathcal{C}^{i}_{j}(l^{\prime},p^{\prime},x)\right\rVert_{\mathrm{F}}^{2}.

•

For $j\in\left\{l^{\star},l+1\right\}$ , update the cluster $C_{j}^{q^{\prime}}\leftarrow\mathcal{C}_{j}^{q^{\prime}}\left(l^{\star},p^{\star},x^{\star}\right)$ .

6:end for

Step 1

We solve (the convex) Problem 1 to obtain a rough estimate of the counterfactual matrix, represented by $\hat{M}+\hat{m}\mathbf{1}^{\top}$ , for a given clustering of observations. The regularization term $\left\lVert M\right\rVert_{\star}$ penalizes the rank of $M$ . The term $m\mathbf{1}^{\top}$ is introduced to center each row of matrix $\hat{M}$ without penalization from the nuclear norm term. The inclusion of this term deviates from the convex problem in [17], and we find that its inclusion improves empirical performance. The choice of $\lambda$ is discussed in our theoretical results.

Step 2

In the second step, we greedily choose a valid split to achieve a better approximation of the true heterogeneous treatment effects. A valid split is one that complies with predefined constraints, which will be discussed in our theoretical guarantees. While the provided algorithm selects splits that minimize the mean squared error (MSE), alternative objectives such as mean absolute error (MAE) may be used instead.

2.2 Estimating the average treatment effect within each cluster

After partitioning all observations into $\ell$ clusters for each of the $q$ treatments, the next step in our algorithm is to estimate the average treatment effect within each cluster. This is done using an extension of the de-biased convex estimator introduced in [17]. We first obtain an estimate of the average treatment effects by solving a convex optimization problem. This initial estimate is then refined through a de-biasing step.

For simplicity, we define separate treatment matrices for each cluster. Letting for $k:=q\times\ell$ , $\tilde{Z}_{1},\dots,\tilde{Z}_{k}$ are defined as follows: $\tilde{Z}_{(i-1)\times\ell+j}=W_{i}\circ C_{j}^{i}$ for $i\in[q],j\in[\ell]$ . These are binary matrices that identify observations that are in a given cluster and receive a given treatment. We denote the true average treatment effect of the treated in each cluster $i\in[k]$ as $\tilde{\tau}^{\star}_{i}$ , and let $\delta_{i}$ be the associated residual matrices, which represent the extent to which individual treatment effects differ from the average treatment effect. More precisely, for $i\in[k]$ ,

\displaystyle\tilde{\tau}^{\star}_{i}:={\big{\langle}\mathcal{T}^{\star}_{\left\lceil\frac{i}{\ell}\right\rceil}(\mathbf{X}),\tilde{Z}_{i}\big{\rangle}}\big{/}{\big{\|}\tilde{Z}_{i}\big{\|}_{\text{sum}}}\qquad\text{and}\qquad\delta_{i}:=\mathcal{T}^{\star}_{\left\lceil\frac{i}{\ell}\right\rceil}(\mathbf{X})\circ\tilde{Z}_{i}-\tilde{\tau}^{\star}_{i}\tilde{Z}_{i}.

Finally, we define $\delta$ as the combined residual error, where $\delta:=\sum_{i=1}^{k}\delta_{i}$ , and $\hat{E}:=E+\delta$ is the total error.

We normalize all of these treatment matrices to have Frobenius norm 1. This standardization allows the subsequent de-biasing step to be less sensitive to the relative magnitudes of the treatment matrices. Define $Z_{i}:=\tilde{Z}_{i}/\|\tilde{Z}_{i}\|_{\mathrm{F}}$ . By scaling down the treatment matrices in this way, our estimates are proportionally scaled up. Similarly, we define a rescaled version of $\tilde{\tau}^{\star}$ , given by $\tau^{\star}_{i}:=\tilde{\tau}^{\star}_{i}\|\tilde{Z}_{i}\|_{\mathrm{F}}.$

First, we solve the following convex optimization problem with the normalized treatment matrices to obtain an estimate of $M^{\star}$ and $\tau^{\star}$ :

\displaystyle\left(\hat{M},\hat{\tau},\hat{m}\right)\in\operatorname*{argmin}_{M\in\mathbb{R}^{n\times T},\tau\in\mathbb{R}^{k},m\in\mathbb{R}^{n}}\frac{1}{2}\left\lVert O-M-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert^{2}_{\mathrm{F}}+\lambda\left\lVert M\right\rVert_{\star}.

(2)

Because we introduced the $m\mathbf{1}^{\top}$ term in the convex formulation, which centers the rows of $\hat{M}$ without penalization from the nuclear norm term, $\hat{M}$ serves as our estimate of $M^{\star}$ , projected to have zero-mean rows, while $\hat{m}$ serves as our estimate of the row means of $M^{\star}$ . For notational convenience, let $P_{\mathbf{1}}$ denote the projection onto $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ . That is, for any matrix $A\in\mathbb{R}^{n\times T}$ ,

\displaystyle P_{\mathbf{1}}(A)=A\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\qquad\text{and}\qquad P_{{\mathbf{1}}^{\perp}}(A)=A\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right).

Hence $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ is the projection of $M^{\star}$ with zero-mean rows. Let $m^{\star}:=\frac{M^{\star}\mathbf{1}}{T}$ be the vector representing the row means of $M^{\star}$ . Let $r$ denote the rank of $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ ,²²2Note that we can show $r\leq\text{rank}(M^{*})+1$ , so $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ is a low-rank matrix as well. and let $P_{{\mathbf{1}}^{\perp}}(M^{\star})=U^{\star}\Sigma^{\star}V^{\star\top}$ be its SVD, with $U^{\star}\in\mathbb{R}^{n\times r},\Sigma^{\star}\in\mathbb{R}^{r\times r},V^{\star}\in\mathbb{R}^{T\times r}$ . We denote by $\mathbf{T}^{\star}$ the span of the space $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ and the tangent space of $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ in the manifold of matrices with rank $r$ . That is,

\displaystyle\mathbf{T}^{\star}

\displaystyle=\left\{U^{\star}A^{\top}+BV^{\star\top}+a\mathbf{1}^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r},a\in\mathbb{R}^{n}\right\}.

The closed form expression of the projection $P_{{\mathbf{T}}^{\star\perp}}(\cdot)$ is given in Lemma B.1.

The following lemma gives a decomposition of $\hat{\tau}-\tau^{\star}$ that suggests a method for de-biasing of $\hat{\tau}$ . The difference between our error decomposition and the decomposition presented in [17] stems from the fact that we introduced the $m\mathbf{1}^{\top}$ term in (2).

Lemma 2.1 (Error decomposition).

Suppose $(\hat{M},\hat{m},\hat{\tau})$ is a minimizer of (2). Let $\hat{M}=\hat{U}\hat{\Sigma}\hat{V}^{\top}$ be the SVD of $\hat{M}$ , and let $\hat{\mathbf{T}}$ denote the span of the tangent space of $\hat{M}$ and $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ . Then,

D\left(\hat{\tau}-\tau^{\star}\right)=\Delta^{1}+\Delta^{2}+\Delta^{3},

(3)

where $D\in\mathbb{R}^{k\times k}$ is the matrix with entries $D_{ij}=\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),P_{\hat{\mathbf{T}}^{\perp}}(Z_{j})\rangle$ and $\Delta^{1},\Delta^{2},\Delta^{3}\in\mathbb{R}^{k}$ are vectors with components

\displaystyle\Delta^{1}_{i}=\lambda\left\langle Z_{i},\hat{U}\hat{V}^{\top}\right\rangle,\quad\Delta^{2}_{i}=\left\langle Z_{i},P_{\hat{\mathbf{T}}^{\perp}}(\hat{E})\right\rangle,\quad\Delta^{3}_{i}=\left\langle Z_{i},P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rangle.

As noted in [17], since $D^{-1}\Delta^{1}$ depends solely on observed quantities, it can be subtracted from our estimate. Thus, we define our de-biased estimator $\tau^{d}$ as $\tau^{d}_{i}=\left(\hat{\tau}-D^{-1}\Delta^{1}\right)_{i}/\|\tilde{Z}_{i}\|_{\mathrm{F}}$ for each $i\in[k]$ . The rescaling by $\|\tilde{Z}_{i}\|_{\mathrm{F}}$ adjusts our estimate to approximate $\tilde{\tau}^{\star}$ , rather than $\tau^{\star}.$

The de-biased $\tau^{d}$ serves as our final estimate. Specifically, if an observation’s covariates $X^{zt}$ belong to the $j$ -th cluster, then for each treatment $i\in[q]$ , our estimate of $\mathcal{T}^{\star}_{i}(X^{zt})$ is given by $\hat{\mathcal{T}}_{i}(X^{zt})=\tau^{d}_{(i-1)\times\ell+j}$ , which represents the average treatment effect of treatment $i$ associated with the cluster that $X^{zt}$ belongs to.

3 Theoretical guarantees

Our analysis determines the conditions for the convergence of our algorithm’s estimates to the true treatment effect functions $\mathcal{T}_{i}^{\star}(\mathbf{X})$ for each $i\in[q]$ . Our proof incorporates ideas from [17, 33].

The final treatment effect estimate given by PaCE has two sources of error. First, there is the approximation error, which stems from the approximation of the non-parametric functions $\mathcal{T}^{\star}_{i}$ by piece-wise constant functions. The second source of error, the estimation error, arises from the estimation of the average treatment effect of each cluster. Note that the deviation between our estimate and the true treatment effect is, at most, the sum of these two errors. We will analyze and bound these two different sources of error separately to demonstrate the convergence of our method.

3.1 Approximation error

We start by bounding the bias introduced by approximating $\mathcal{T}^{\star}_{i}$ by a piece-wise constant function. To guarantee consistency, we establish constraints on valid splits. We adopt the $\alpha$ -regularity condition from Definition 4b of [33]. This condition requires that each split retains at least a fraction $\alpha\in(0,\frac{1}{2})$ of the available training examples and at least one treated observation on each side of the split. We further require that the depth of each leaf be on the order $\log\ell$ and that the trees are fair-split trees. That is, during the tree construction procedure in Algorithm 1, for any given node, if a covariate $j$ has not been used in the splits of its last $\pi p$ parent nodes, for some $\pi>1$ , the next split must utilize covariate $j$ . If multiple covariates meet this criterion, the choice between them can be made arbitrarily, based on any criterion.

Additionally, we assume that the proportion of treated observations with covariates within a given hyper-rectangle should be approximately proportional to the volume of the hyper-rectangle. This is a ‘coverage condition’ that allows us to accurately estimate the heterogeneous treatment effect on the whole domain using the available observations. For the purposes of this analysis, we assume that all covariates are bounded and normalized to $[0,1]^{p}$ .

Assumption 3.1.

Suppose that all covariates belong to $[0,1]^{p}$ . Let $x^{(1)}\leq x^{(2)}\in[0,1]^{p}$ be the lower and upper corners of any hyper-rectangle. We assume that the proportion of observations that have covariates inside this hyper-rectangle is proportional to the volume of the hyper-rectangle $V:=\prod_{p^{\prime}\in[p]}\left(x_{p^{\prime}}^{(2)}-x_{p^{\prime}}^{(1)}\right)$ , with a margin of error $M:=\sqrt{\frac{\ln(nT)(p+1)}{\min(n,T)}}$ . More precisely,

\underaccent{\bar}{c}V-c_{m}M\leq\frac{\#\left\{(z,t):x^{(1)}\leq X^{zt}\leq x^{(2)}\right\}}{nT}\leq\bar{c}V+c_{m}M,

for some fixed constants $\bar{c}\geq\underaccent{\bar}{c}>0$ and some $c_{m}>0$ .

3.1 is a broader condition than the requirement for the covariates to be i.i.d. across observations, as assumed in [33]. Our assumption only requires that the covariates maintain an even density. Indeed, Lemma A.2 in Appendix A illustrates that while i.i.d. covariates satisfy the assumption with high probability, the assumption additionally accommodates (with high probability) scenarios where covariates are either constant over time and only vary across different units, or constant across different units and only vary over time.

In the following result, we demonstrate that, as the number of observations grows, a regression tree, subject to some regularity constraints, contains leaves that become increasingly homogeneous in terms of treatment effect.

Theorem 3.2.

Let $\mathbb{T}$ be an $\alpha$ -regular, fair-split tree, split into $\ell$ leaves, each of which has a depth of at least $c\log\ell$ for some constant $c$ . Suppose that $\ell\leq\left(\frac{\alpha}{2c_{m}M}\right)^{\frac{1}{c\log 1/\alpha}}$ and that 3.1 is satisfied. Then, for each treatment $i\in[q]$ and every leaf of $\mathbb{T}$ , the maximum difference in treatment effects between any two observations within the cluster $C$ of observations in that leaf is upper bounded as follows:

\max_{X_{1},X_{2}\in C}\left\lvert\mathcal{T}_{i}^{\star}(X_{1})-\mathcal{T}_{i}^{\star}(X_{2})\right\rvert\leq\frac{2L_{i}\sqrt{p}}{\ell^{s}},

where $s:=\frac{c}{(\pi+1)p}\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}$ and $L_{i}$ is the Lipschitz constant for the treatment effect function $\mathcal{T}_{i}^{\star}$ .

Using the above theorem, we can make the following observation. For any $i\in[q]$ , denote by $\tilde{\tau}^{\star,i}$ the regression tree function that is piece-wise constant on each cluster $C_{j}^{i}$ for $j\in[\ell]$ , with a value of $\tilde{\tau}^{\star}_{(i-1)\ell+j}$ assigned to each cluster $j$ . Then, for all covariate vectors $X$ , we have $\left\lvert\mathcal{T}_{i}^{\star}(X)-\tilde{\tau}^{\star,i}(X)\right\rvert\leq\frac{2L_{i}\sqrt{p}}{\ell^{s}}$ because the value assigned to each cluster by $\tilde{\tau}^{\star,i}$ is an average of treatment effects within that cluster. This demonstrates that it is possible to approximate the true treatment effect functions with a regression tree function with arbitrary precision. As the number of leaves $\ell$ in the tree $\mathbb{T}$ increases, the approximation error of the tree decreases polynomially. The theorem requires an upper bound on $\ell$ , which grows with $\min(n,T)$ as the dataset expands, due to the definition of the margin or error $M$ . For PaCE, we constrain $\ell$ to be on the order of $\log n$ (we require $k=O\left(\log n\right)$ in Theorem 3.5), thereby complying with the upper bound requirement.

3.2 Estimation error

We now bound the estimation error, showing that it converges as $\max(n,T)$ increases. For simplicity, we will assume that $n\gtrsim T$ so that we can write all necessary assumptions and statements in terms of $n$ directly. We stress, however, that even without $n\gtrsim T$ , the statement of Theorem 3.5 also holds by replacing all occurrences of $n$ in 3.3, 3.4, and Theorem 3.5 with $\max(n,T)$ . Proving this only requires changing $n$ to $\max(n,T)$ in the proofs accordingly.

Our main technical contribution is extending the convergence guarantee from [17] to a setting with multiple treatment matrices and a convex formulation that centers the rows of $\hat{M}$ without penalization. We require a set of conditions that extend the assumptions made in [17] to multiple treatment matrices.

Throughout the paper, we define the subspace $\mathbf{Z}:=\text{span}\{Z_{i},i\in[k]\}+\text{span}\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\}$ . We next define $D^{\star}\in\mathbb{R}^{k\times k}$ and $\Delta^{\star 1}\in\mathbb{R}^{k}$ similarly to $D$ and $\Delta^{1}$ in Lemma 2.1, except we replace all quantities associated with $\hat{M}$ with the corresponding quantity for $M^{\star}$ . That is, $D^{\star}_{ij}=\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle$ , and $\Delta^{\star 1}_{i}=\left\langle Z_{i},U^{\star}V^{\star\top}\right\rangle$ for $i,j\in[k]$ . Now, we are ready to introduce the assumptions that we need to bound the estimation error.

Assumption 3.3 (Identification).

There exist positive constants $c_{r_{1}},c_{r_{2}},c_{s}$ such that

(a)

$\left\lVert ZV^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert{Z^{\top}U^{\star}}\right\rVert_{\mathrm{F}}^{2}\leq\left(1-{c_{r_{1}}}/{\log(n)}\right)\left\lVert Z\right\rVert_{\mathrm{F}}^{2},$ $\forall Z\in\mathbf{Z}$ ,
(b)

$\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert\leq 1-{c_{r_{2}}}/{\log n},$
(c)

$\sigma_{\min}(D^{\star})\geq{c_{s}}/{\log n}.$

Item 3.3(a) and Item 3.3(b) are the direct generalization of Assumptions 3(a) and 3(b) from [17] to multiple treatment matrices. Proposition 2 of [17] proves that their Assumptions 3(a) and 3(b) are nearly necessary for identifying $\tau^{\star}$ . This is because it is impossible to recover $\tau^{\star}$ if $M^{\star}+\gamma Z$ were also a matrix of rank $r$ for some $\gamma\neq 0$ . More generally, if $Z$ lies close to the tangent space of $M^{\star}$ in the space of rank- $r$ matrices, recovering $M^{\star}$ is significantly harder (the signal becomes second order and may be lost in the noise). In our setting with multiple treatment matrices, we further require that there is no collinearity in the treatment matrices; otherwise, we would not be able to distinguish the treatment effect associated with each treatment matrix. Mathematically, this is captured by Item 3.3(c).

Assumption 3.4 (Noise assumptions).

We make the following assumptions about $E$ and $\delta$ .

(a)

$\left\lVert E\right\rVert,\left\lVert\delta\right\rVert=O(\sigma\sqrt{n})$ ,
(b)

$\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert=O(\sigma\sqrt{n})$ for all $i\in[k]$ ,
(c)

$\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert=O(\sigma\sqrt{n})$ for all $i\in[k]$ .

The assumptions involving $E$ are mild: they would be satisfied with probability at least $1-e^{-n}$ if $E$ has independent, zero-mean, sub-Gaussian entries with sub-Gaussian norm bounded by $\sigma$ (see Lemma D.10), an assumption that is canonical in matrix completion literature [17]. Both Item 3.4(a) and Item 3.4(b) were also used in [17]. Because $\delta$ is a zero-mean matrix that is zero outside the support of $\sum_{i\in[k]}Z_{i}$ , the assumption $\left\lVert\delta\right\rVert=O(\sigma\sqrt{n})$ is mild. In fact, it is satisfied with high probability if the entries of $\delta$ are sub-Gaussian with certain correlation patterns [26]. Furthermore, we note that if there is only one treatment matrix $(k=1)$ or if all treatment matrices are disjoint, then Item 3.4(c) is guaranteed to be satisfied. This is because, for each $i\in[k]$ , $\delta_{i}$ is a zero-mean matrix that is zero outside of the support of $Z_{i}$ . Consequently, in these scenarios, we simply have $\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert=\left\lvert\left\langle Z_{i},\delta_{i}\right\rangle\right\rvert=0$ .

We denote by $\sigma_{\max}$ and $\sigma_{\min}$ the largest and smallest singular values of $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ respectively, and we let $\kappa:=\sigma_{\max}/\sigma_{\min}$ be the condition number of $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ .

Theorem 3.5.

Suppose that 3.3 and 3.4 are satisfied, $k=O\left(\log n\right)$ , and $\frac{\sigma\sqrt{n}}{\sigma_{\min}}\lesssim\frac{1}{\kappa^{2}r^{2}\log^{12.5}(n)}$ . Take $\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right)$ . Then for sufficiently large $n$ , we have

\displaystyle\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{13.5}(n)}{\sigma_{\min}\|\tilde{Z}_{i}\|_{\mathrm{F}}}+\frac{\log^{1.5}(n)}{\|\tilde{Z}_{i}\|_{\mathrm{F}}}\cdot\max_{j\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{j}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\|\tilde{Z}_{j}\|_{\mathrm{F}}}.

Note that the bound becomes weaker as $\|\tilde{Z}_{j}\|_{\mathrm{F}}$ shrinks. In other words, more treated entries within a cluster results in a more accurate estimate for that cluster, as expected. Our rate in Theorem 3.5 matches, up to $\log n$ and $r$ factors, the optimal error rate obtained in Theorem 1 of [17], which bounds the error in the setting with one treatment matrix. Our bound accumulated extra factors of $\log n$ because we allowed $k=O(\log n)$ treatment matrices, and we have an extra factor of $r$ because we assumed $\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n}$ , instead of $\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n/r}.$ Proposition 1 from [17] proves that their convergence rate is optimal (up to $\log n$ factors) in the “typical scenario" where $\sigma,\kappa,r=O(1)$ and $\sigma_{\min}=\Theta(n)$ . We can conclude the same for our rate in Theorem 3.5.

The convergence rate of PaCE is primarily determined by the slower asymptotic convergence rate of the approximation error. However, our result in Theorem 3.5 is of independent interest, as it extends previous results on low-rank matrix recovery with a deterministic pattern of treated entries to allow for multiple treatment matrices.

4 Empirical evaluation

To assess the accuracy of PaCE, we demonstrate its performance on semi-synthetic data. Using publicly available panel data as the baseline $M^{\star}+E$ , we introduced an artificial treatment and added a synthetic treatment effect to treated entries to generate the outcome matrix $O$ . Since the true heterogeneous treatment effects $\mathcal{T}^{\star}(\mathbf{X})$ are known, we can verify the accuracy of various methods. Our results show that PaCE often surpasses alternative methods in accuracy, while using a tree with no more than 40 leaves to cluster observations. Thus, PaCE not only offers superior accuracy, but also provides a simple, interpretable solution.

Data

To demonstrate the effectiveness of our methodology, we use two publicly available U.S. economic datasets. The first source comprises monthly Supplemental Nutrition Assistance Program (SNAP) user counts by zip code in Massachusetts, spanning January 2017 to March 2023, obtained from the Department of Transitional Assistance (DTA) public records [25]. SNAP is a federal aid program that provides food-purchasing assistance to low-income families [30]. The second data source comprises nine annual demographic and economic data fields for Massachusetts zip codes, U.S. counties, and U.S. states from 2005 to 2022, provided by the U.S. Census Bureau [31].

We use this data to conduct experiments on panels of varying sizes. In the first set of experiments, we use the SNAP user count as the baseline, $M^{\star}+E$ , with the zip code-level census data for Massachusetts as covariates. The panel size is 517 zip codes over 70 months. The second set of experiments uses the population below the poverty line (from the U.S. census data) in each state and U.S. territory as the baseline and the remaining census data fields as covariates, with a panel size of 52 states and territories over 18 years. The third set of experiments is similar to the second, but the unit is changed to counties, resulting in a panel size of 770 counties over 18 years. The time period is provided as an additional covariate for the methods that we benchmark against.

Treatment generation

To generate the treatment pattern, we varied two parameters: 1) the proportion $\alpha$ of units receiving the treatment $\alpha\in\left\{0.05,0.25,0.5,0.75,1.0\right\}$ and 2) the functional form of the treatment—either adaptive or non-adaptive. In the non-adaptive case, a random $\alpha$ proportion of units receive the treatment for a random consecutive period of time. In the adaptive case, for each time period, we apply the treatment to the $\alpha/2$ proportion of units that show the largest absolute percentage change in outcome in the previous time period. The purpose of the adaptive treatment is to mimic public policies that target either low-performing or high-performing units.

The treatment effect is generated by randomly sampling two covariates and either adding or multiplying them, then normalizing the magnitude of the effects so that the average treatment effect is 20% of the average outcome. For the SNAP experiments, the treatment effect is added to the outcome to simulate an intervention increasing SNAP usage. For the state and county experiments, the treatment effect is subtracted from the outcome to represent an intervention that reduces poverty.

Each set of parameters is tested on 200 instances, with the treatment pattern and treatment effect being randomly regenerated for each instance.

Benchmark methods

We benchmark PaCE against the following methods: double machine learning (DML) [14, 15], doubly robust (DR) learner [18], XLearner [23], and multi-arm causal forest [6].³³3Additional methods are shown in Appendix E. In the main text, we include the five best methods. We implemented PaCE to split the observations into at most 40 clusters. The hyper-parameter $\lambda$ used in the convex programs (1) and (2) was tuned so that $\hat{M}$ had a rank of $r=6$ , which was chosen arbitrarily and fixed across all experiments. During the splitting procedure in Algorithm 1, we utilize binary search on each covariate to find the best split. See Appendix E for implementation details.

Table 1: Proportion of instances in which each method results in the lowest nMAE.

		Adaptive		Proportion treated					Effect
	All	No	Yes	0.05	0.25	0.5	0.75	1.0	Add.	Mult.
SNAP
PaCE	0.59	0.58	0.60	0.26	0.54	0.76	0.75	0.65	0.64	0.54
Causal Forest	0.16	0.17	0.16	0.14	0.17	0.16	0.14	0.21	0.02	0.30
DML	0.11	0.21	0.00	0.12	0.13	0.08	0.09	0.11	0.17	0.04
LinearDML	0.14	0.04	0.24	0.48	0.15	0.01	0.02	0.04	0.17	0.11
XLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
State
PaCE	0.41	0.39	0.42	0.23	0.32	0.39	0.54	0.56	0.46	0.36
Causal Forest	0.05	0.09	0.01	0.06	0.04	0.04	0.05	0.04	0.04	0.05
DML	0.27	0.29	0.26	0.41	0.36	0.27	0.17	0.16	0.31	0.24
LinearDML	0.02	0.01	0.04	0.02	0.06	0.01	0.01	0.02	0.02	0.02
XLearner	0.17	0.13	0.20	0.25	0.14	0.19	0.15	0.11	0.10	0.23
County
PaCE	0.06	0.10	0.02	0.16	0.08	0.04	0.02	0.01	0.07	0.06
Causal Forest	0.29	0.38	0.20	0.25	0.28	0.31	0.31	0.30	0.23	0.34
DML	0.35	0.35	0.35	0.34	0.37	0.34	0.39	0.31	0.47	0.24
LinearDML	0.20	0.04	0.35	0.22	0.22	0.21	0.16	0.17	0.19	0.20
XLearner	0.05	0.07	0.04	0.03	0.03	0.06	0.07	0.07	0.01	0.10

Evaluation

We represent the estimated treated effects by $\hat{\mathcal{T}}(\mathbf{X})$ , a matrix where the element in $z$ -th row and $t$ -th column is the estimate $\hat{\mathcal{T}}(X^{zt})$ . Accuracy of the estimates is measured using the normalized mean absolute error (nMAE), calculated relative to the true treatment effect $\mathcal{T}^{\star}(\mathbf{X})$ :

\displaystyle\text{nMAE}\left(\hat{\mathcal{T}}(\mathbf{X}),\mathcal{T}^{\star}(\mathbf{X})\right)={\|\mathcal{T}^{\star}(\mathbf{X})-\hat{\mathcal{T}}(\mathbf{X})}\|_{\text{sum}}/\left\lVert\mathcal{T}^{\star}(\mathbf{X})\right\rVert_{\text{sum}}.

In Table 1, we present the proportion of instances where each method results in the lowest nMAE. We observe that PaCE performs best in the SNAP and state experiments, achieving the highest proportion of instances with the lowest nMAE. However, it often performs worse than DML and Causal Forest in the county experiments. We hypothesize that this is because the county census data, having many units relative to the number of time periods, ‘most resembles i.i.d. data’ and has less underlying panel structure. However, the average nMAE achieved by PaCE on this data set is competitive with that of DML and Causal Forest, as shown in Table 3. Given that Causal Forest and DML use hundreds of trees, PaCE, which uses only one regression tree, is still a great option if interpretability is desired.

In Figure 1, we show the average nMAE as we vary the proportion of units treated. We only show the results for the SNAP experiments with a non-adaptive treatment pattern, and we refer to Table 3 and Table 4 in Appendix E for the results of additional experiments and for the standard deviations. We observe that while the performance of all methods improves as the proportion of treated entries increases, PaCE consistently achieves the best accuracy on this set of experiments.

Refer to caption — Figure 1: nMAE for SNAP experiments as the proportion of treated units increases.

5 Discussion and Future Work

This work introduces PaCE, a method designed for estimating heterogeneous treatment effects within panel data. Our technical contributions extend previous results on low-rank matrix recovery with a deterministic pattern of treated entries to allow for multiple treatments. We validate the ability of PaCE to obtain strong accuracy with a simple estimator on semi-synthetic datasets.

One interesting direction for future research is to investigate whether the estimates produced by PaCE are asymptotically normal around the true estimates, which would allow for the derivation of confidence intervals. Further computational testing could also be beneficial. Specifically, assessing the estimator’s performance across a broader array of treatment patterns and datasets would provide deeper insights into its adaptability and limitations. Expanding these tests to include multiple treatments will help in understanding the estimator’s effectiveness in more complex scenarios.

Acknowledgments and Disclosure of Funding

The authors are grateful to Tianyi Peng for valuable and insightful discussions. We acknowledge the MIT SuperCloud [28] and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper. This work was supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2141064. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

References

Abadie et al., [2010] Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505.
Abadie and Gardeazabal, [2003] Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study of the basque country. American economic review, 93(1):113–132.
Abbe et al., [2020] Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Annals of statistics, 48(3):1452.
Athey et al., [2021] Athey, S., Bayati, M., Doudchenko, N., Imbens, G., and Khosravi, K. (2021). Matrix completion methods for causal panel data models. Journal of the American Statistical Association, 116(536):1716–1730.
Athey and Imbens, [2016] Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360.
Athey et al., [2019] Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(3):1148–1178.
Bai and Ng, [2021] Bai, J. and Ng, S. (2021). Matrix completion, counterfactuals, and factor analysis of missing data. Journal of the American Statistical Association, 116(536):1746–1763.
Breiman, [2001] Breiman, L. (2001). Random forests. Machine learning, 45:5–32.
Cai et al., [2010] Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982.
Chatterjee, [2020] Chatterjee, S. (2020). A deterministic theory of low rank matrix completion. IEEE Transactions on Information Theory, 66(12):8046–8055.
Chen et al., [2020] Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121.
Chen et al., [2019] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–22937.
Chen et al., [2021] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2021). Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. Annals of statistics, 49(5):2948.
[14] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2017a). Double/debiased machine learning for treatment and causal parameters. Technical report.
[15] Chernozhukov, V., Goldman, M., Semenova, V., and Taddy, M. (2017b). Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv, pages arXiv–1712.
Devroye, [1977] Devroye, L. P. (1977). A uniform bound for the deviation of empirical distribution functions. Journal of Multivariate Analysis, 7(4):594–597.
Farias et al., [2021] Farias, V., Li, A., and Peng, T. (2021). Learning treatment effects in panels with general intervention patterns. Advances in Neural Information Processing Systems, 34:14001–14013.
Foster and Syrgkanis, [2023] Foster, D. J. and Syrgkanis, V. (2023). Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908.
Foucart et al., [2020] Foucart, S., Needell, D., Pathak, R., Plan, Y., and Wootters, M. (2020). Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264–1290.
Gobillon and Magnac, [2016] Gobillon, L. and Magnac, T. (2016). Regional policy evaluation: Interactive fixed effects and synthetic controls. Review of Economics and Statistics, 98(3):535–551.
Imbens and Rubin, [2015] Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
Klopp et al., [2017] Klopp, O., Lounici, K., and Tsybakov, A. B. (2017). Robust matrix completion. Probability Theory and Related Fields, 169:523–564.
Künzel et al., [2019] Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165.
Liu et al., [2017] Liu, G., Liu, Q., and Yuan, X. (2017). A new theory for matrix completion. Advances in Neural Information Processing Systems, 30.
Massachusetts Government, [2024] Massachusetts Government (2024). Department of transitional assistance caseload by zip code reports. https://www.mass.gov/lists/department-of-transitional-assistance-caseload-by-zip-code-reports. Accessed: 2024-05-09.
Moon and Weidner, [2016] Moon, H. and Weidner, M. (2016). Linear regression for panel with unknown number of factors as interactive effects. Econometrica, à paraître.
Nie and Wager, [2021] Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319.
Reuther et al., [2018] Reuther, A., Kepner, J., Byun, C., Samsi, S., Arcand, W., Bestor, D., Bergeron, B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Milechin, L., Mullen, J., Prout, A., Rosa, A., Yee, C., and Michaleas, P. (2018). Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–6. IEEE.
Su et al., [2009] Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., and Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(2).
United States Department of Agriculture, [2024] United States Department of Agriculture (2024). Supplemental Nutrition Assistance Program (SNAP). https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-program. Accessed: May 15, 2024.
U.S. Census Bureau, [2023] U.S. Census Bureau (2023). American Community Survey (ACS) 1-Year Data. https://www.census.gov/data/developers/data-sets/acs-1year.html. Accessed data from multiple years between 2005 and 2022. Accessed: 2024-05-09.
Vershynin, [2018] Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
Wager and Athey, [2018] Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
Xiong et al., [2023] Xiong, R., Athey, S., Bayati, M., and Imbens, G. (2023). Optimal experimental design for staggered rollouts. Management Science.
Xiong and Pelger, [2023] Xiong, R. and Pelger, M. (2023). Large dimensional latent factor modeling with missing observations and applications to causal inference. Journal of Econometrics, 233(1):271–301.
Yu et al., [2015] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323.

Appendix A Proofs for the approximation error

Proof of Theorem 3.2.

We define the diameter of a given set $C$ of points as the maximum distance between two points in the set $C$ :

\text{Diam}(C)=\sup_{(z_{1},t_{1}),(z_{2},t_{2})\in C}\|X^{z_{1}t_{1}}-X^{z_{2}t_{2}}\|.

Because the treatment effect function $\mathcal{T}^{\star}_{i}$ is Lipschitz continuous for every $i\in[q]$ , with Lipschitz constant $L_{i}$ , it suffices to show that, for any leaf $x$ in $\mathbb{T}$ , the set $C_{x}$ of points within that leaf has a diameter bounded above as follows:

\text{Diam}(C_{x})\leq\frac{2\sqrt{p}}{\ell^{s}}.

In this proof we assume, without loss of generality, that all leaves are at depth exactly $\left\lfloor c\log\ell\right\rfloor$ . If a leaf were instead even deeper in the tree, the diameter of the associated cluster would be even smaller. We begin by establishing a lower bound for the number of observations $n_{a}$ in any node $a$ of the tree $\mathbb{T}$ . This is done by leveraging the fact that every node is at depth at most $c\log\ell$ , the upper bound on the number of leaves $\ell$ , and the property of $\alpha$ -regularity. Specifically, due to $\alpha$ -regularity, we have $n_{a}\geq nT\cdot\alpha^{c\log\ell}$ . Hence,

\displaystyle n_{a}

\displaystyle\geq nT\alpha^{c\log\ell}=nT\ell^{c\log\alpha}\leq nT\left(\frac{\alpha}{2c_{m}M}\right)^{\frac{-1}{c\log\alpha}\cdot c\log\alpha}=\frac{2nT}{\alpha}c_{m}M.

Rearranging, we have the following for any node $a$ of the tree $\mathbb{T}$ :

\displaystyle c_{m}M\leq\frac{\alpha n_{a}}{2nT}.

(4)

Next, for every covariate $j\in[p]$ and any leaf $x$ , we aim to establish an upper bound for the length of $C_{x}$ along the $j$ -th coordinate, denoted $\text{Diam}_{j}(C_{x})$ . This is defined as follows:

\displaystyle\text{Diam}_{j}(C_{x}):=\sup_{(z_{1},t_{1}),(z_{2},t_{2})\in C_{x}}\left\lvert X^{z_{1}t_{1}}_{j}-X^{z_{2}t_{2}}_{j}\right\rvert.

Utilizing the lower bound in 3.1, the volume $V_{a}$ of the rectangular hull enclosing the points in a given node $a$ can be upper bounded as follows:

\displaystyle V_{a}\leq\frac{1}{\underaccent{\bar}{c}}\left(\frac{n_{a}}{nT}+c_{m}M\right)\leq\frac{2}{\underaccent{\bar}{c}}\left(\frac{n_{a}}{nT}\right),

(5)

where the last inequality made use of (4). Given the $\alpha$ -regularity of splits, we know that a child of node $a$ has at most $n_{a}(1-\alpha)$ observations. This implies that the volume $V_{r}$ of the rectangular hull enclosing the points removed by the split on node $a$ satisfies:

\displaystyle V_{r}\geq\frac{1}{\bar{c}}\left(\frac{\alpha n_{a}}{nT}-c_{m}M\right)\geq\frac{1}{2\bar{c}}\left(\frac{\alpha n_{a}}{nT}\right).

(6)

Putting (5) and (6) together, we have $V_{r}\geq\left(\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)V_{a}$ , which means that a split along a given covariate removes at least $\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}$ fraction of the length along that coordinate.

Because the tree is a fair-split tree, for every $(\pi+1)p$ splits, at least one split is made on every covariate. Hence, the number of splits leading to node $x$ along each coordinate is at least $\left\lfloor\frac{\left\lfloor c\log\ell\right\rfloor}{(\pi+1)p}\right\rfloor\geq\frac{c\log\ell}{(\pi+1)p}-2$ .

As a result, for every $j\in[p]$ , we have $\text{Diam}_{j}(C_{x})\leq\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{\frac{c\log\ell}{(\pi+1)p}-2}$ , which implies

	$\displaystyle\text{Diam}(C_{x})$	$\displaystyle\leq\sqrt{p}\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{\frac{c}{(\pi+1)p}\log\ell-2}$
		$\displaystyle\leq\sqrt{p}\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{-2}\left(\ell\right)^{\frac{c}{(\pi+1)p}\log\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)}$
		$\displaystyle\leq\sqrt{p}\left(\frac{3}{4}\right)^{-2}\left(\ell\right)^{-\frac{c}{(\pi+1)p}\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}},$

which proves the desired bound. ∎

The following lemma is a generalization of the multivariate case of the Dvoretzky–Kiefer–Wolfowitz inequality. Specifically, it provides a bound on the maximum deviation of the empirical fraction of observations within a hyper-rectangle from its expected value. It will be used in the proof of Lemma A.2.

Lemma A.1.

Let $X$ be a $d$ -dimensional random vector with pdf $f_{X}$ . Let $X_{1},X_{2},\dots,X_{n}$ be an i.i.d. sequence of $d$ -dimensional vectors drawn from this distribution. For any $x_{1},x_{2}\in\mathbb{R}^{d}$ , define

\displaystyle G(x_{1},x_{2}):=\Pr[x_{1}\leq X\leq x_{2}]

and

\displaystyle G_{n}(x_{1},x_{2}):=\frac{1}{n}\sum_{i=1}^{n}I_{\left\{X_{i}\in[x_{1},x_{2}]\right\}}.

Then, for any $\epsilon>0$ , we have both of the following bounds:

	$\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert>\epsilon\right]\leq 2n^{2d}\exp\left(-2n\epsilon^{2}+16\epsilon d\right),$
	$\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert>\frac{4d}{n}+\epsilon\right]\leq 2n^{2d}\exp\left(-2n\epsilon^{2}\right).$

Proof.

We use ideas from the proof of the main theorem in [16] and their notation. Because the i.i.d. samples $X_{1},\ldots,X_{n}$ are drawn from a distribution with a density function, on an almost sure event $\mathcal{E}=\{X_{i_{1}j}\neq X_{i_{2}j},i_{1}\neq i_{2}\in[n],j\in[d]\}$ , they do not share any component in common. For $i\in[n]$ , let $X_{i}=\left(X_{i1},\dots,X_{id}\right)$ , and let $\mathcal{Y}$ denote the set of random vectors $Y_{1},\dots,Y_{n^{d}}\in\mathbb{R}^{d}$ that are obtained by considering all $n^{d}$ vectors of the form $\left(X_{i_{1}1},\dots,X_{i_{d}d}\right)$ , where $\left(i_{1},\dots,i_{d}\right)\in\left\{1,\dots,n\right\}^{d}$ .

Because $G_{n}$ is a staircase function with flat levels everywhere except at points in $\mathcal{Y}$ , and $G$ is monotonic, $\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert$ is maximized when $x_{1}$ and $x_{2}$ approach vectors that are in $\mathcal{Y}$ . Under the event $\mathcal{E}$ , we have

\displaystyle\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\leq\sup_{i,j}\left\lvert G_{n}(Y_{i},Y_{j})-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n},

where the $\frac{2d}{n}$ is due to the fact that, for any $i,j$ there may be up to $2d$ different points in $X_{1},\dots,X_{n}$ that share a component with $Y_{i}$ or $Y_{j}$ under the event $\mathcal{E}$ . These $2d$ points lie on the perimeter of the hyper-rectangle between $Y_{i}$ and $Y_{j}$ and could be included or excluded from the count for $G_{n}$ depending on whether $x_{1}$ approaches $Y_{i}$ (and $x_{2}$ approaches $Y_{j}$ ) from the inside or outside of the hyper-rectangle.

Now, it remains to upper bound the right hand side of the above inequality. For any pair of indices $i,j\leq n^{d}$ , there exists a subset of indices $\mathcal{I}\subseteq[n]$ , corresponding to samples of $X$ that have at least one component in common with either $Y_{i}$ or $Y_{j}$ . Hence, the samples indexed by $\mathcal{I}$ are not independent from both $Y_{i}$ and $Y_{j}$ . In order to apply Hoeffding’s inequality to the samples, we sample $d^{\prime}:=\left\lvert\mathcal{I}\right\rvert$ additional i.i.d. random vectors $X_{n+1},\dots,X_{n+d^{\prime}}$ from the same distribution. These serve to ‘substitute’ those samples $X_{l}$ for which $l\in\mathcal{I}$ . Under $\mathcal{E}$ , the number of indices in $\mathcal{I}$ satisfies $d^{\prime}\leq 2d$ , and we have

	$\displaystyle\sup_{i,j}\left\lvert G_{n}(Y_{i},Y_{j})-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n}$
	$\displaystyle=\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k=1}^{n}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n}$
	$\displaystyle\leq\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{1}{n}\left\lvert\sum_{k\in\mathcal{I}}I_{X_{k}\in[Y_{i},Y_{j}]}-\sum_{k=n+1}^{n+d^{\prime}}I_{X_{k}\in[Y_{i},Y_{j}]}\right\rvert+\frac{2d}{n}$
	$\displaystyle\leq\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{4d}{n}.$

The first inequality is due to the triangle inequality, and the second inequality follows because $\left\lvert\mathcal{I}\right\rvert=d^{\prime}\leq 2d$ .

By the independence of both $Y_{i}$ and $Y_{j}$ with $X_{k}$ for $k\in[n+d^{\prime}]\setminus\mathcal{I}$ , we apply Hoeffding’s inequality conditioned on the realizations of $Y_{i}$ and $Y_{j}$ to establish the following upper bound:

\displaystyle\Pr\left[\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{4d}{n}\geq\epsilon\right]

\displaystyle\leq 2\exp\left(-2n\left(\epsilon-\frac{4d}{n}\right)^{2}\right).

Applying a Union Bound over all $i,j$ , we obtain

	$\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\geq\epsilon\right]$	$\displaystyle\leq 2n^{2d}\exp\left(-2n\left(\epsilon-\frac{4d}{n}\right)^{2}\right)$
		$\displaystyle\leq 2n^{2d}\exp\left(-2n\epsilon^{2}+16\epsilon d\right).$

This proves our first bound. To show the second bound, we note that the first line of the above inequality can be rewritten as

\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\geq\frac{4d}{n}+\epsilon\right]\leq 2n^{2d}e^{-2n\epsilon^{2}}.

∎

Lemma A.2.

For large enough $n$ and $T$ , the covariates $X^{zt}\in\mathbb{R}^{p}$ for $z\in[n]$ and $t\in[T]$ satisfy 3.1 with probability at least $1-\frac{6}{(nT)^{2}}$ if, for any nonnegative integer $x$ , $y$ , and $z$ that add up to $p$ , every $X^{zt}$ is the concatenation of vectors $w^{zt}\in\mathbb{R}^{x},u^{zt}\in\mathbb{R}^{y},v^{zt}\in\mathbb{R}^{z}$ , which are independently generated as follows:

1.

Covariates varying by unit and time: Each $w^{zt}$ is i.i.d., sampled from a distribution $W$ on $[0,1]^{x}$ with density uniformly lower bounded by $\underaccent{\bar}{c}^{1/3}$ and upper bounded by $\bar{c}^{1/3}$ .
2.

Covariates varying only across units: Draw $n$ i.i.d. samples from a distribution $U$ on $[0,1]^{y}$ , bounded in density between $\underaccent{\bar}{c}^{1/3}$ and $\bar{c}^{1/3}$ . These make up $u^{z1}$ for $z\in[n]$ , with arbitrary ordering. For every $z$ and $t\in[T]$ , we have $u^{zt}=u^{z1}$ .
3.

Covariates varying only over time: Draw $T$ i.i.d. samples from a distribution $V$ on $[0,1]^{z}$ with the same density bounds. These form $v^{1t}$ for $t\in[T]$ , with arbitrary ordering. For every $t$ and $z\in[n]$ , we have $v^{zt}=v^{1t}$ .

Proof.

The proof of the lemma proceeds as follows. We first formally define an event $\mathcal{E}$ , which intuitively states that the maximum deviation between the fraction of samples falling within a specified interval and its expected value is bounded. Our second step is to show that the event occurs with high probability, and our final step is to show that under the event $\mathcal{E}$ , 3.1 is satisfied.

We begin by defining our notation. For any random vector $D$ , we define $\hat{G}^{D}(a,b)$ as the empirical fraction of observations that fall within a specified interval $(a,b)$ , with $a\leq b$ having the same dimension as $D$ . The expected value of this quantity is denoted $G^{D}(a,b):=\Pr[a\leq D\leq b]$ . Note that we have

\displaystyle\hat{G}^{U}(a,b)=\frac{1}{n}\sum_{z}I_{\left\{u^{z1}\in[a,b]\right\}}\quad\hat{G}^{V}(a,b)=\frac{1}{T}\sum_{t}I_{\left\{v^{1t}\in[a,b]\right\}}\quad\hat{G}^{X}(a,b)=\frac{1}{nT}\sum_{z,t}I_{\left\{X^{zt}\in[a,b]\right\}}.

Defining event $\mathcal{E}$ . Now we will formally define $\mathcal{E}$ , which is the intersection of three events $\mathcal{E}=\mathcal{E_{A}}\cap\mathcal{E_{B}}\cap\mathcal{E_{C}}$ . The events $\mathcal{E_{A}}$ (resp. $\mathcal{E_{B}}$ ) is the event that maximum deviation between the fraction of samples from random variable $U$ (resp. $V$ ) falling within an interval and its expected value is bounded. To simplify our notation, we let $\epsilon:=\sqrt{\frac{(p+1)\ln(nT)}{\min(n,T)}}$ . Mathematically,

	$\displaystyle\mathcal{E_{A}}:=\left\{\sup_{u^{(1)}\leq u^{(2)}}\left\lvert\hat{G}^{U}(u^{(1)},u^{(2)})-G^{U}(u^{(1)},u^{(2)})\right\rvert\leq\epsilon+\frac{4y}{n}\right\},$
	$\displaystyle\mathcal{E_{B}}:=\left\{\sup_{v^{(1)}\leq v^{(2)}}\left\lvert\hat{G}^{V}(v^{(1)},v^{(2)})-G^{V}(v^{(1)},v^{(2)})\right\rvert\leq\epsilon+\frac{4z}{T}\right\}.$

We will now turn to defining event $\mathcal{E_{C}}$ , which intuitively bounds the deviation for the $W$ samples. In the event $\mathcal{E_{C}}$ , rather than considering all of the samples of $W$ , we are only considering the samples of $W$ for which the corresponding $U$ and $V$ samples satisfy given constraints. That is, $\mathcal{E_{C}}$ is the event that for any constraints on $U$ and $V$ , the maximum deviation between the following is bounded: 1) the proportion of samples with $U$ and $V$ satisfying the constraint, that have $W$ fall within a specified interval and 2) the expected fraction of $W$ samples falling within a specified interval. We will define the event $\mathcal{E_{C}}$ formally below.

To define event $\mathcal{E_{C}}$ , we begin by defining $S_{U}$ (resp. $S_{V}$ ), which is the set of all possible subsets of units (resp. time periods) selected by some constraint on $U$ (resp. $V$ ). Given the random variables $u^{z1}$ for $z\in[n]$ , we denote by $\tilde{u}_{1},\ldots,\tilde{u}_{n^{y}}$ all $n^{y}$ vectors formed by selecting its $i$ -th coordinate (for each $i\in[y]$ ) from this set of observed values: $\{u^{z1}_{i},z\in[n]\}$ . We similarly define $\tilde{v}_{1},\ldots,\tilde{v}_{T^{z}}$ using the variables $v^{1t}$ for $t\in[T]$ . Next, for any choice of indices $i_{1},i_{2}\in[n^{y}]$ such that $\tilde{u}_{i_{1}}\leq\tilde{u}_{i_{2}}$ , we denote by $\mathcal{Z}(i_{1},i_{2})$ the subset of units that lie in the hyper-rectangle defined by $\tilde{u}_{i_{1}}$ and $\tilde{u}_{i_{2}}$ , that is

\mathcal{Z}(i_{1},i_{2}):=\left\{z\in[n],\tilde{u}_{i_{1}}\leq u^{z1}\leq\tilde{u}_{i_{2}}\right\}.

We define $\mathcal{T}(j_{1},j_{2})$ similarly, for any $j_{1},j_{2}\in[T^{z}]$ with $\tilde{v}_{j_{1}}\leq\tilde{v}_{j_{2}}$ . Finally, $S_{U}:=\{\mathcal{Z}(i_{1},i_{2}):i_{1},i_{2}\in[n^{y}],\tilde{u}_{i_{1}}\leq\tilde{u}_{i_{2}}\}$ and $S_{V}$ is similarly defined. We are now ready to define the last event:

	$\displaystyle\mathcal{E_{C}}:=\left\{\sup_{w^{(1)}\leq w^{(2)}}\left\lvert\frac{1}{\|\mathcal{Z}\|\|\mathcal{T}\|}\sum_{z\in\mathcal{Z},t\in\mathcal{T}}I_{w^{zt}\in[w^{(1)},w^{(2)}]}-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{\|\mathcal{Z}\|\|\mathcal{T}\|}\right.$
	$\displaystyle\left.\forall\mathcal{Z}\in S_{U},\mathcal{T}\in S_{V}\right\}.$

Bounding the probability of event $\mathcal{E}$ . By Lemma A.1, we directly have

	$\displaystyle\Pr(\mathcal{E_{A}})$	$\displaystyle\geq 1-2n^{2y}e^{-2n\epsilon^{2}}$
	$\displaystyle\Pr(\mathcal{E_{B}})$	$\displaystyle\geq 1-2T^{2z}e^{-2T\epsilon^{2}}.$

We now show that $\mathcal{E_{C}}$ occurs with high probability. Note that conditioned on the realizations of $u^{z1}$ and $v^{1t}$ for $z\in[n]$ and $t\in[T]$ , and for any choice of unit and time period subsets $\mathcal{Z}\in S_{U}$ and $\mathcal{T}\in S_{V}$ , all the random variables $w^{zt}$ for $z\in\mathcal{Z},t\in\mathcal{T}$ are still i.i.d. according to the distribution $W$ since they are sampled independently from samples of $U$ and $V$ . As a result, we can apply Lemma A.1 to obtain

\sup_{w^{(1)}\leq w^{(2)}}\left\lvert\frac{1}{|\mathcal{Z}||\mathcal{T}|}\sum_{z\in\mathcal{Z},t\in\mathcal{T}}I_{w^{zt}\in[w^{(1)},w^{(2)}]}-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{|\mathcal{Z}||\mathcal{T}|},

with probability at least $1-2(|\mathcal{Z}||\mathcal{T}|)^{2x}e^{-2(nT\epsilon)^{2}/(|\mathcal{Z}||\mathcal{T}|)}\geq 1-2(nT)^{2x}e^{-2nT\epsilon^{2}}$ for any given $\mathcal{Z}$ and $\mathcal{T}$ and for any realization of the random variables $u^{z1}$ for $z\in[n]$ and $v^{1t}$ for $t\in[T]$ . Taking the union bound, over all choices of indices $i_{1},i_{2}\in[n^{y}]$ and $j_{1},j_{2}\in[T^{z}]$ then shows that

\Pr(\mathcal{E_{C}}\mid u^{zt},v^{zt};\forall z\in[n],t\in[T])\geq 1-2n^{2y}T^{2z}(nT)^{2x}e^{-2nT\epsilon^{2}}.

In particular, the same lower bound still holds for $\Pr(\mathcal{E_{C}})$ after taking the expectation over the samples from $U$ and $V$ . Putting all the probabilistic bounds together, we obtain

\Pr[\mathcal{E_{A}}\cap\mathcal{E_{B}}\cap\mathcal{E_{C}}]\geq 1-6n^{2(x+y)}T^{2(x+z)}e^{-2\min(n,T)\epsilon^{2}}.

Plugging in $\epsilon=\sqrt{\frac{(p+1)\ln(nT)}{\min(n,T)}}$ , this event has probability at least $1-\frac{6}{(nT)^{2}}$ .

3.1 is satisfied under event $\mathcal{E}$ . We suppose that event $\mathcal{E}$ holds for the rest of this proof. Consider an arbitrary hyper-rectangle in $[0,1]^{p}$ with lower and upper corners $x^{(1)}$ and $x^{(2)}$ , where $x^{(1)}\leq x^{(2)}$ . For convenience, we decompose $x^{(1)}$ into three distinct vectors: $w^{(1)}\in\mathbb{R}^{x}$ , $u^{(1)}\in\mathbb{R}^{y}$ , and $v^{(1)}\in\mathbb{R}^{z}$ , such that the concatenation of these three vectors forms $x^{(1)}$ . Similarly, $x^{(2)}$ is decomposed into vectors $w^{(2)}$ , $u^{(2)}$ , and $v^{(2)}$ . Our goal is to bound the number of observations that simultaneously satisfy the constraints:

\displaystyle w^{(1)}\leq w^{zt}\leq w^{(2)}\qquad u^{(1)}\leq u^{zt}\leq u^{(2)}\qquad v^{(1)}\leq v^{zt}\leq v^{(2)}.

First, the set $\mathcal{Z}$ of units that satisfies the constraint $u^{(1)}\leq u^{z1}\leq u^{(2)}$ is included within the considered sets $S_{U}$ . Similarly, the set $\mathcal{T}$ of time periods satisfying the constraint $v^{(1)}\leq v^{1t}\leq v^{(2)}$ is also in $S_{V}$ . As a result, because $\mathcal{E_{C}}$ is satisfied, we have

\left\lvert\frac{nT}{|\mathcal{Z}||\mathcal{T}|}\hat{G}^{X}(x^{(1)}x^{(2)})-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{|\mathcal{Z}||\mathcal{T}|}.

Multiplying both sides by $\frac{|\mathcal{Z}||\mathcal{T}|}{nT}$ , we have

\left\lvert\hat{G}^{X}(x^{(1)},x^{(2)})-\frac{|\mathcal{Z}||\mathcal{T}|}{nT}G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\epsilon+\frac{4x}{nT}.

(7)

We then use the events $\mathcal{E_{A}}$ and $\mathcal{E_{B}}$ to bound the number of units and times satisfying their respective constraints:

\displaystyle\left\lvert\frac{|\mathcal{Z}|}{n}-G^{U}(u^{(1)},u^{(2)})\right\rvert\leq\epsilon+\frac{4y}{n},\qquad\left\lvert\frac{|\mathcal{T}|}{T}-G^{V}(v^{(1)},v^{(2)})\right\rvert\leq\epsilon+\frac{4z}{T}.

(8)

Now let $P:=G^{U}(u^{(1)},u^{(2)})G^{V}(v^{(1)},v^{(2)})G^{W}(w^{(1)},w^{(2)})$ . Provided that $n$ and $T$ are large enough such that $\frac{4x}{nT},\frac{4y}{n},\frac{4z}{T}\leq\epsilon\leq\frac{1}{4}$ , inequalities (7) and (8) imply

\left\lvert\hat{G}^{X}(x^{(1)},x^{(2)})-P\right\rvert\leq 7\epsilon.

(9)

Because of the assumed density bounds on $U$ , $V$ , and $W$ , we have

\displaystyle\underaccent{\bar}{c}\prod_{i=1}^{p}\left(x_{i}^{(2)}-x_{i}^{(1)}\right)\leq P\leq\bar{c}\prod_{i=1}^{p}\left(x_{i}^{(2)}-x_{i}^{(1)}\right).

Combining this with (9) completes the proof. ∎

Appendix B Proof of Lemma 2.1

Lemma B.1.

Define the matrix subspace $\mathbf{T}$ as follows:

\displaystyle\mathbf{T}

\displaystyle=\left\{UA^{\top}+BV^{\top}+a\mathbf{1}^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r},a\in\mathbb{R}^{n}\right\}.

The projection onto its orthogonal space is

\displaystyle P_{{\mathbf{T}}^{\perp}}(X)

\displaystyle=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top}),\quad\text{where }r=(I-VV^{\top})\mathbf{1}

Proof.

For a given matrix $X$ , we define $P:=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top})$ . To show that $P=P_{{\mathbf{T}}^{\perp}}(X)$ , it suffices to show that $P\in\mathbf{T}^{\perp}$ and $X-P\in\mathbf{T}$ .

First, note that $U^{\top}P=0$ and $PV=0$ because of the construction of $P$ . This implies that for any matrices $A\in\mathbb{R}^{T\times r}$ and $B\in\mathbb{R}^{n\times r}$ , we have $\langle P,UA^{\top}\rangle=\langle P,BV^{\top}\rangle=0$ . Next, we note that

P\mathbf{1}=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)r=0.

As a result, we also have $\langle P,a\mathbf{1}^{\top}\rangle=0$ for all $a\in\mathbb{R}^{n}$ . This ends the proof that $P\in\mathbf{T}^{\perp}$ .

Next, we compute

	$\displaystyle X-P$	$\displaystyle=U\underbrace{U^{\top}X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top})}_{A^{{}^{\prime}\top}}+\underbrace{X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)V}_{B^{\prime}}V^{\top}+X\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}$
		$\displaystyle=UA^{{}^{\prime}\top}+\left(B^{\prime}-X\frac{r\mathbf{1}^{\top}V}{\left\lVert r\right\rVert^{2}}\right)V^{\top}+X\frac{r}{\left\lVert r\right\rVert^{2}}\mathbf{1}^{\top}$

Hence, $X-P\in\mathbf{T}$ , which concludes the proof of the lemma. ∎

Proof of Lemma 2.1..

It is known [9] that the set of subgradients of the nuclear norm of an arbitrary matrix $X\in\mathbb{R}^{n_{1}\times n_{2}}$ with SVD $X=U\Sigma V^{\top}$ is

\displaystyle\partial\left\lVert X\right\rVert_{\star}=\left\{UV^{\top}+W:W\in\mathbb{R}^{n_{1}\times n_{2}},U^{\top}W=0,WV=0,\left\lVert W\right\rVert\leq 1\right\}.

Thus, the first-order optimality conditions of (2) are


$\displaystyle\left\langle Z_{j},O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right\rangle$	$\displaystyle=0\qquad\text{for }j=1,2,\dots,k$	(10a)
$\displaystyle O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}$	$\displaystyle=\lambda\left(\hat{U}\hat{V}^{\top}+W\right)$	(10b)
$\displaystyle\hat{U}^{\top}W$	$\displaystyle=0$	(10c)
$\displaystyle W\hat{V}$	$\displaystyle=0$	(10d)
$\displaystyle\left\lVert W\right\rVert$	$\displaystyle\leq 1$	(10e)
$\displaystyle\hat{m}$	$\displaystyle=\frac{1}{T}\left(O-\hat{M}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right)\mathbf{1}.$	(10f)

Combining equations (10a) and (10b), we have

\displaystyle\left\langle Z_{j},\lambda\left(\hat{U}\hat{V}^{\top}+W\right)\right\rangle

\displaystyle=0\qquad\text{for }j=1,2,\dots,k

(11)

By the definition of our model, we have $O=M^{\star}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}.$ We plug this into equation (10b):

\displaystyle M^{\star}-\hat{M}-\hat{m}\mathbf{1}^{\top}+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)Z_{i}+\hat{E}

\displaystyle=\lambda\left(\hat{U}\hat{V}^{\top}+W\right).

We apply $P_{\hat{\mathbf{T}}^{\perp}}(\cdot)$ to both sides, where the formula for the projection is given in Lemma B.1:

	$\displaystyle\underbrace{P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})}_{B}$
	$\displaystyle=\lambda P_{\hat{\mathbf{T}}^{\perp}}\left(W\right)$
	$\displaystyle=\lambda(I-\hat{U}\hat{U}^{\top})W\left(I-\frac{(I-\hat{V}\hat{V}^{\top})\mathbf{1}\mathbf{1}^{\top}(I-\hat{V}\hat{V}^{\top})}{\left\lVert(I-\hat{V}\hat{V}^{\top})\mathbf{1}\right\rVert^{2}}\right)(I-\hat{V}\hat{V}^{\top})$
	$\displaystyle=\lambda W\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}(I-\hat{V}\hat{V}^{\top})}{\left\lVert(I-\hat{V}\hat{V}^{\top})\mathbf{1}\right\rVert^{2}}\right)$
	$\displaystyle=\lambda W-\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right).$

The last line makes use of the following claim:

Claim B.2.

$\hat{V}^{\top}\mathbf{1}=0$ .

To compute $\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)$ , we right-multiply (10b) by $\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)$ . We note that by (10f), we have $\left(O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right)\mathbf{1}=0$ . Hence, we find

\displaystyle\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)=-\lambda\hat{U}\hat{V}^{\top}\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)\overset{(i)}{=}0,

where $(i)$ is due to B.2. As a result, we can conclude that $B=\lambda W$ .

Substituting $\lambda W=B$ into (11), we obtain for all $j=1,2,\dots,k$ :

\displaystyle\left\langle Z_{j},\lambda\hat{U}\hat{V}^{\top}+P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})\right\rangle=0

(12)

Rearranging, we have

\displaystyle\left[\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),P_{\hat{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle\right]_{i,j\in[k]}\left(\hat{\tau}-\tau^{\star}\right)=\left[\left\langle Z_{i},\lambda\hat{U}\hat{V}^{\top}+P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})\right\rangle\right]_{i\in[k]}

This is equivalent to the error decomposition in the statement of the lemma, completing the proof. ∎

Proof of B.2.

Suppose for contradiction that $\hat{V}^{\top}\mathbf{1}\neq 0$ . Then, there is a solution that for which the objective is lower, contradicting the optimality of $\hat{M},\hat{\tau},\hat{m}$ . That solution can be constructed as follows. Define $V^{\prime}$ to be the matrix resulting from projecting each column of $\hat{V}$ onto the space orthogonal to the vector $\mathbf{1}$ . That is, the $i$ -th column of $V^{\prime}$ is $V^{\prime}_{i}=\hat{V}_{i}-\mathbf{1}\left(\frac{\hat{V}_{i}^{\top}\mathbf{1}}{\mathbf{1}^{\top}\mathbf{1}}\right)$ . Let $m^{\prime}$ be the vector such that $m^{\prime}\mathbf{1}^{\top}=\hat{U}\hat{\Sigma}\left(\hat{V}-V^{\prime}\right)^{\top}+\hat{m}\mathbf{1}^{\top}$ . The solution $\left(\hat{U}\hat{\Sigma}V^{\prime\top},\hat{\tau},m^{\prime}\right)$ achieves lower objective value because the value of the expression in the Frobenius norm remains unchanged, but $\left\lVert\hat{U}\hat{\Sigma}V^{\prime\top}\right\rVert_{\star}<\left\lVert\hat{U}\hat{\Sigma}\hat{V}^{\top}\right\rVert_{\star}=\left\lVert\hat{M}\right\rVert_{\star}$ . The inequality is due to the fact that matrix $V^{\prime}$ is still an orthogonal matrix, but its columns have norm that is at most 1, and at least one of its columns has norm strictly less than 1. ∎

Appendix C Proof of Theorem 3.5

We aim to bound $\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert$ . Throughout this section, we assume that the assumptions made in the statement of Theorem 3.5 are satisfied. To simplify notation, we define $\eta_{i}:=\|\tilde{Z}_{i}\|_{\mathrm{F}}$ . We have

\displaystyle\begin{split}\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert&\leq\left\lVert\tau^{d}-\tilde{\tau}^{\star}\right\rVert\\ &=\frac{1}{\eta_{i}}\left\lVert\eta_{i}\tau^{d}-\tau^{\star}\right\rVert\\ &\overset{(i)}{=}\frac{1}{\eta_{i}}\left\lVert D^{-1}(\Delta^{2}+\Delta^{3})\right\rVert\\ &\leq\frac{1}{\sigma_{\min}(D)\eta_{i}}\left(\left\lVert\Delta^{2}\right\rVert+\left\lVert\Delta^{3}\right\rVert\right),\end{split}

(13)

where $(i)$ is due to the error decomposition in Lemma 2.1 and the definition of $\tau^{d}$ .

Hence, we aim to upper bound $\left\lVert\Delta^{2}\right\rVert$ and $\left\lVert\Delta^{3}\right\rVert$ and lower bound $\sigma_{\min}(D)$ . The main challenge lies in upper bounding $\left\lVert\Delta^{3}\right\rVert$ . In fact, the desired bounds for $\sigma_{\min}(D)$ and $\left\lVert\Delta^{2}\right\rVert$ (presented in the following lemma) are obtained during the process of bounding $\left\lVert\Delta^{3}\right\rVert$ .

Lemma C.1.

Let $D$ and $\Delta^{2}$ be defined as in Lemma 2.1. We have $\sigma_{\min}(D)\geq\frac{c_{s}}{2\log n}$ . Furthermore, we have the following for sufficiently large $n$ :

\displaystyle\left\lVert\Delta^{2}\right\rVert\lesssim\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{i}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\|\tilde{Z}_{i}\|_{\mathrm{F}}}.

The proof of the above lemma is provided in Section D.6.

We now discuss the strategy for bounding $\left\lVert\Delta^{3}\right\rVert$ . In order to control $P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})$ , we aim to show that the true counterfactual matrix $M^{\star}$ has tangent space close to that of $\hat{M}$ . Because we introduced $\hat{m}$ in the convex formulation, which centers the rows of $\hat{M}$ without penalization from the nuclear norm term, we will focus on the projection of $M^{\star}$ that has rows with mean zero, $P_{{\mathbf{1}}^{\perp}}(M^{\star})$ . Recall that $P_{{\mathbf{1}}^{\perp}}(M^{\star})=U^{\star}\Sigma^{\star}V^{\star\top}$ is its SVD, and define $X^{\star}:=U^{\star}\Sigma^{\star 1/2}$ and $Y^{\star}:=V^{\star}\Sigma^{\star 1/2}$ . We similarly define these quantities for $\hat{M}$ : $\hat{X}:=\hat{U}\hat{\Sigma}^{1/2}$ and $\hat{Y}:=\hat{V}\hat{\Sigma}^{1/2}$ . We aim to show that we have $(\hat{X},\hat{Y})\approx(X^{\star},Y^{\star})$ . This is stated formally in Lemma C.2, which requires introducing a few additional notations first.

Let $g(M,\tau,m)$ denote the convex function that we are optimizing in (2). Recall that $(\hat{M},\hat{\tau},\hat{m})$ is a global optimum of function $g$ . Following the main proof idea in [17], we define a non-convex proxy function $f$ , which is similar to $g$ , except $M$ is split into two variables and expressed as $XY^{\top}$ .

\displaystyle f(X,Y)

\displaystyle=\min_{m\in\mathbb{R}^{n},\tau\in\mathbb{R}^{k}}\frac{1}{2}\left\lVert O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle.

Our analysis will relate $(\hat{X},\hat{Y})$ to a specific local optimum of $f$ , which we will show is close to $(X^{\star},Y^{\star})$ . The local optimum of $f$ considered is the limit of the gradient flow of $f$ , initiated at $(X^{\star},Y^{\star})$ , formally defined by the following differential equation:

\displaystyle\begin{cases}(\dot{X}^{t},\dot{Y}^{t})=-\nabla f(X^{t},Y^{t})\\ (X^{0},Y^{0})=(X^{\star},Y^{\star}).\end{cases}

(14)

We define $H_{X,Y}$ as the rotation matrix that optimally aligns $(X,Y)$ with $(X^{\star},Y^{\star})$ . That is, letting $\mathcal{O}^{r\times r}$ denote the set of $r\times r$ orthogonal matrices, $H_{X,Y}$ is formally defined as follows:

H_{X,Y}:=\operatorname*{argmin}_{R\in\mathcal{O}^{r\times r}}\left\lVert XR-X^{\star}\right\rVert^{2}_{\mathrm{F}}+\left\lVert YR-Y^{\star}\right\rVert^{2}_{\mathrm{F}}.

Define $F^{\star}$ to be the vertical concatenation of matrices $X^{\star}$ and $Y^{\star}$ . That is, $F^{\star}=[(X^{\star})^{\top},(Y^{\star})^{\top}]^{\top}$ . We are now ready to present Lemma C.2.

Lemma C.2.

For sufficiently large $n$ , the gradient flow of $f$ , starting from $(X^{\star},Y^{\star})$ , converges to $(X,Y)$ such that $X=\hat{X}R$ and $Y=\hat{Y}R$ for some rotation matrix $R\in\mathcal{O}^{r\times r}$ . Furthermore, $(X,Y)\in\mathcal{B}$ , where

\displaystyle\mathcal{B}

\displaystyle:=\left\{(X,Y)\mid\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}^{2}\leq\rho^{2}\right\},\quad\rho:=\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

Let us derive an upper bound on $\rho$ . Because $\left\lVert X^{\star}\right\rVert_{\mathrm{F}}\leq\left\lVert X^{\star}\right\rVert\sqrt{r}=\sqrt{\sigma_{\max}r}$ and $\left\lVert Y^{\star}\right\rVert_{\mathrm{F}}\leq\sqrt{\sigma_{\max}r}$ , we can conclude that $\left\lVert F^{\star}\right\rVert_{\mathrm{F}}\lesssim\sqrt{\sigma_{\max}r}$ . Thus, we can upper bound $\rho$ as follows:

\displaystyle\rho\lesssim\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\sqrt{\sigma_{\max}r}=\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\sqrt{\sigma_{\min}\kappa r}=\frac{\sigma r\kappa^{0.5}\sqrt{n}\log^{6}(n)}{\sqrt{\sigma_{\min}}}.

(15)

Using (15) and the assumption $\frac{\sigma\sqrt{n}}{\sigma_{\min}}\lesssim\frac{1}{\kappa^{2}r^{2}\log^{12.5}(n)}$ provided in the theorem, we can further upper bound $\rho$ as follows:

\displaystyle\rho\lesssim\frac{\sqrt{\sigma_{\min}}}{\log^{6.5}(n)\kappa}.

(16)

With Lemma C.2 and the bound on $\rho$ , we can complete the proof of the theorem as follows.

	$\displaystyle\left\lVert\Delta^{3}\right\rVert$	$\displaystyle\leq\sqrt{k}\left\lVert\Delta^{3}\right\rVert_{\infty}$
		$\displaystyle\leq\sqrt{k}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rVert_{\mathrm{F}}$
		$\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})(X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top})\left(I-\frac{\hat{r}\hat{r}^{\top}}{\hat{r}^{\top}\hat{r}}\right)(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}},$

where $\hat{r}=(I-\hat{V}\hat{V}^{\top})\mathbf{1}$ , and the last line comes from the closed form of the projection derived in Lemma B.1. We can use the fact that $\hat{V}^{\top}\mathbf{1}=0$ from B.2 to simplify the expression. Note that with $\hat{V}^{\top}\mathbf{1}=0$ , $\hat{r}$ simply evaluates to $\mathbf{1}$ .

	$\displaystyle\sqrt{k}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rVert_{\mathrm{F}}$	$\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})X^{\star}Y^{\star\top}\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}}$
		$\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})X^{\star}Y^{\star\top}(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}},$

where the last step is due to the fact that $X^{\star}Y^{\star\top}\mathbf{1}=0$ because $X^{\star}Y^{\star\top}=P_{{\mathbf{1}}^{\perp}}(M^{\star})$ .

Finally, let $(X,Y)$ be the limit of the gradient flow of $f$ starting from $(X^{\star},Y^{\star})$ . By Lemma C.2, we have $X=\hat{X}R$ and $Y=\hat{Y}R$ for some rotation matrix $R\in\mathcal{O}^{r\times r}$ . This, combined with the definition of $\hat{\mathbf{T}}$ , implies $P_{\hat{\mathbf{T}}^{\perp}}(XA^{\top})=P_{\hat{\mathbf{T}}^{\perp}}(BY^{\top})=0$ for any $A\in\mathbb{R}^{T\times r}$ and $B\in\mathbb{R}^{n\times r}$ . Hence,

\displaystyle\begin{split}\left\lVert\Delta^{3}\right\rVert&\leq\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})(XH_{X,Y}-X^{\star})(YH_{X,Y}-Y^{\star})^{\top}(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}}\\ &\leq\sqrt{k}\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}\\ &\lesssim\sqrt{k}\rho^{2}\\ &\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{12.5}(n)}{\sigma_{\min}},\end{split}

(17)

where the last step is due to the fact that $k=O(\log n).$

Now that we have bounded $\sigma_{\min}(D)$ , $\left\lVert\Delta^{2}\right\rVert$ , and $\left\lVert\Delta^{3}\right\rVert$ , we can plug these bounds into (13) to complete the proof of the theorem.

Appendix D Proof of Lemma C.2

Throughout this section, we assume that the assumptions made in the statement of Theorem 3.5 are satisfied. The proof of Lemma C.2 is completed by combining the results of the following two lemmas.

Lemma D.1.

Any point $(X,Y)$ on the gradient flow of $f$ starting from $(X^{\star},Y^{\star})$ satisfies $(X,Y)\in\mathcal{B}$ .

Lemma D.2.

The limit $(X,Y)$ of the gradient flow of $f$ starting from $(X^{\star},Y^{\star})$ satisfies $X=\hat{X}R$ and $Y=\hat{Y}R$ for some rotation matrix $R\in\mathcal{O}^{r\times r}$ .

We note our methodology differs from that of [17]. Instead of analyzing a gradient descent algorithm, we analyze the gradient flow of the function $f$ . This allows us to simplify the analysis by avoiding error terms due to the discretization of gradient descent.

In this section, we start by deriving the gradient of the function $f$ and examining some properties of its gradient flow in Section D.1. Then, we prove some technical lemmas. We extend 3.3 to a broader subset of matrices within the set $\mathcal{B}$ in Section D.2, and establish bounds on the noise in Section D.3. Finally, we complete the proofs of Lemma D.1 in Section D.4 and Lemma D.2 in Section D.5.

D.1 The gradient of the function $f$

Before we prove Lemma D.1, we need to derive the gradient of the function $f$ .

Define $P_{\mathbf{Z}}$ to be the projection operator that projects onto the subspace $\mathbf{Z}$ . Note that in the definition of $f(X,Y)$ , the quantities $m$ and $\tau$ are chosen to minimize the distance between $O-XY^{\top}$ and this subspace, measured in terms of Euclidean norm. Hence, we can view $m$ and $\tau$ as coordinates of the projection of $O-XY^{\top}$ onto this subspace. This observation gives us the following equivalent definition of $f(X,Y)$ :

\displaystyle f(X,Y)

\displaystyle=\frac{1}{2}\left\lVert P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right)\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle.

Consider a single entry of the matrix $X$ : $X_{ij}$ . Note that the expression inside the Frobenius norm is linear in $X_{ij}$ . Hence, we can rewrite $f(X,Y)$ as follows:

\displaystyle f(X,Y)

\displaystyle=\frac{1}{2}\left\lVert X_{ij}A+B\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle,

for some matrices $A$ and $B$ that do not depend on $X_{ij}$ (but may depend on other entries of $X$ , and $Y$ ).

Now we take the partial derivative of $f$ with respect to $X_{ij}$ . Let $E_{ij}$ be the matrix where all elements are zero except for the element in the $i$ -th row and $j$ -th column, which is 1. Then,

	$\displaystyle\frac{\partial f}{\partial X_{ij}}$	$\displaystyle=\left\langle X_{ij}A+B,A\right\rangle+\lambda X_{ij}$
		$\displaystyle\overset{(i)}{=}\left\langle P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right),-P_{\mathbf{Z}^{\perp}}\left(E_{ij}Y^{\top}\right)\right\rangle+\lambda X_{ij}$
		$\displaystyle=-\left\langle P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right),E_{ij}Y^{\top}\right\rangle+\lambda X_{ij},$

where $(i)$ is due to the fact that $A$ and $B$ were defined such that $X_{ij}A+B=P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right).$ Hence,

	$\displaystyle\frac{\partial f}{\partial X}$	$\displaystyle=-P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right)Y+\lambda X$
		$\displaystyle=-\left(O-XY^{\top}-m(X,Y)\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}(X,Y)Z_{i}\right)Y+\lambda X,$

where $m(X,Y),\tau(X,Y):=\operatorname*{argmin}_{m,\tau}\left\lVert O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert^{2}_{\mathrm{F}}$ . We will write $m$ and $\tau$ to represent $m(X,Y)$ and $\tau(X,Y)$ below for notational simplicity.

Using $O=X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}$ , we can simplify the gradient as follows:

\displaystyle\frac{\partial f}{\partial X}=\left(XY^{\top}-X^{\star}Y^{\star\top}+(m-m^{\star})\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau_{i}-\tau_{i}^{\star})Z_{i}-\hat{E}\right)Y+\lambda X.

Because $m$ and $\tau$ are coordinates of the projection of $O-XY^{\top}$ onto the subspace $\mathbf{Z}$ , we have

m\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}Z_{i}=P_{\mathbf{Z}}\left(O-XY^{\top}\right).

Additionally, because $m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}^{\star}Z_{i}=O-X^{\star}Y^{\star\top}-\hat{E}$ , the right hand side quantity is also in the subspace $\mathbf{Z}$ , so we have

\displaystyle m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}^{\star}Z_{i}=P_{\mathbf{Z}}\left(O-X^{\star}Y^{\star\top}-\hat{E}\right).

Subtracting the two equations above, we have

\displaystyle(m-m^{\star})\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau_{i}-\tau_{i}^{\star})Z_{i}=P_{\mathbf{Z}}\left(X^{\star}Y^{\star\top}-XY^{\top}+\hat{E}\right).

(18)

Using (18), we can rewrite

\displaystyle\frac{\partial f}{\partial X}=\left(P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)\right)Y+\lambda X.

Similarly, we can derive

\displaystyle\frac{\partial f}{\partial Y}

\displaystyle=\left(P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)\right)^{\top}X+\lambda Y.

D.1.1 Properties of the gradient flow

We now prove some properties of the gradient flow of the function $f$ starting from the point $(X^{\star},Y^{\star})$ . For simplicity of notation, we define

\displaystyle D:=P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right),

so we can simplify our gradients as follows:

\displaystyle\frac{\partial f}{\partial X}=DY+\lambda X\qquad\text{and}\qquad\frac{\partial f}{\partial Y}=D^{\top}X+\lambda Y.

(19)

Lemma D.3.

Every point on the gradient flow of function $f(X,Y)$ starting from the point $\left(X^{\star},Y^{\star}\right)$ satisfies $X^{\top}X=Y^{\top}Y.$

Proof.

Note that at the starting point $\left(X^{\star},Y^{\star}\right)$ , we have that ${X^{\star}}^{\top}X^{\star}=\Sigma^{\star}={Y^{\star}}^{\top}Y^{\star}$ as desired.

Next, we examine the change in the value $\phi(t):={X^{t}}^{\top}X^{t}-{Y^{t}}^{\top}Y^{t}$ as we move along the gradient flow defined by (14). For convenience, we omit the superscript $t$ in $X^{t}$ and $Y^{t}$ . Using (14), we have

\displaystyle{\begin{pmatrix}\dot{X}\\ \dot{Y}\end{pmatrix}}=-\nabla f\begin{pmatrix}X\\ Y\end{pmatrix}=-\begin{pmatrix}DY+\lambda X\\ D^{\top}X+\lambda Y\end{pmatrix}.

The derivative of $\phi$ is

	$\displaystyle\phi^{\prime}(t)$	$\displaystyle=\dot{X}^{\top}X+{X}^{\top}\dot{X}-\dot{Y}^{\top}Y-Y^{\top}\dot{Y}$
		$\displaystyle=-\left(DY+\lambda X\right)^{\top}X-X^{\top}(DY+\lambda X)+\left(D^{\top}X+\lambda Y\right)^{\top}Y+Y^{\top}(D^{\top}X+\lambda Y)$
		$\displaystyle=-2\lambda\left(X^{\top}X-Y^{\top}Y\right),$

showing that $\phi^{\prime}(t)=-2\lambda\phi(t)$ . As a result, recalling that $\phi(0)=0$ , we can solve the differential equation to obtain $\phi(t)=e^{-2\lambda t}\phi(0)=0$ for all $t$ . ∎

Lemma D.4.

Every point on the gradient flow of function $f(X,Y)$ starting from the point $\left(X^{\star},Y^{\star}\right)$ satisfies $Y^{\top}\mathbf{1}=0.$ Furthermore, if $U\Sigma V^{\top}$ is the SVD of $XY^{\top}$ , then $V^{\top}\mathbf{1}=0.$

Proof.

Using $X^{\star}Y^{\star\top}\mathbf{1}=0$ (because $X^{\star}Y^{\star\top}=P_{{\mathbf{1}}^{\perp}}(M^{\star})$ ), we have that

\displaystyle\left(X^{\star}Y^{\star\top}\mathbf{1}\right)^{\top}X^{\star}Y^{\star\top}\mathbf{1}=\mathbf{1}^{\top}V^{\star}\Sigma^{\star 2}V^{\star\top}\mathbf{1}=0,

which can only happen if $V^{\star\top}\mathbf{1}=0$ . This implies that $Y^{\star\top}\mathbf{1}=0$ .

Using the same approach as the proof of Lemma D.3, we examine the change in the value of $\phi(t):={Y^{t}}^{\top}\mathbf{1}$ as we move along the gradient flow defined by (14). We omit the superscript $t$ in $Y^{t}$ for notational convenience. Using (14), we have $\dot{Y}=-D^{\top}X-\lambda Y$ . Now we compute the derivative of $\phi(t)$ :

	$\displaystyle\phi^{\prime}(t)$	$\displaystyle=\dot{Y}^{\top}\mathbf{1}$
		$\displaystyle=-X^{\top}D\mathbf{1}-\lambda Y^{\top}\mathbf{1}$
		$\displaystyle\overset{(i)}{=}-\lambda Y^{\top}\mathbf{1}$
		$\displaystyle=-\lambda\phi(t).$

The equality $(i)$ follows because $D$ is the projection onto a space orthogonal to $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ , effectively centering its rows. Because $D$ has zero-mean rows, $D\mathbf{1}=0$ .

Note that $\phi(0)=Y^{\star\top}\mathbf{1}=0$ . Solving the differential equation, we have $\phi(t)=e^{-\lambda t}\phi(0)=0$ for all $t$ . Using the same logic we used to show $V^{\star\top}\mathbf{1}=0$ , we have that $Y^{\top}\mathbf{1}$ implies $V^{\top}\mathbf{1}=0$ . ∎

D.2 Extending assumptions to $\mathcal{B}$

We prove a lemma that extends Assumptions 3.3(a), 3.3(b), and 3.3(c) to a broader subset of matrices within the set $\mathcal{B}$ , thereby expanding the applicability of the original assumptions.

We’ll begin by proving a useful lemma that provides bounds for the singular values of matrices in the set $\mathcal{B}$ . We show these values are within a constant factor of the singular value range of $X^{\star}$ and $Y^{\star}$ , spanning from $\sqrt{\sigma_{\min}}$ to $\sqrt{\sigma_{\max}}$ .

Lemma D.5.

For large enough $n$ and $(X,Y)\in\mathcal{B}$ , we have the following for any $i\in[r]$ :

\displaystyle\sigma_{i}(X),\sigma_{i}(Y)\in\left[\frac{\sqrt{\sigma_{\min}}}{2},2\sqrt{\sigma_{\max}}\right].

Proof.

The singular values of a matrix are not changed by right-multiplying by an orthogonal matrix. Hence, without loss of generality, we suppose that $(X,Y)$ are rotated such that they are optimally aligned with $(X^{\star},Y^{\star})$ ; in other words, $H_{X,Y}=I$ . Then, $(X,Y)\in\mathcal{B}$ gives us

\displaystyle\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\leq\rho\lesssim\frac{\sqrt{\sigma_{\min}}}{\log^{6.5}(n)\kappa},

where the last step is due to (16).

Then by Weyl’s inequality for singular values, for any $i\in[r]$ , we have

\displaystyle\sigma_{i}(X)\leq\sigma_{i}(X^{\star})+\left\lVert X-X^{\star}\right\rVert\leq\sigma_{1}(X^{\star})+\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\leq 2\sqrt{\sigma_{\max}}.

We also have

\displaystyle\sigma_{i}(X)\geq\sigma_{i}(X^{\star})-\left\lVert X-X^{\star}\right\rVert\geq\sigma_{r}(X^{\star})-\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\geq\frac{\sqrt{\sigma_{\min}}}{2}.

The proof for $\sigma_{i}(Y)$ is identical. ∎

Now, we present the lemma which extends 3.3 to a broader subset of matrices in $\mathcal{B}$ .

Lemma D.6.

Suppose that $(X,Y)\in\mathcal{B}$ and that there exists a rotation matrix $H\in\mathcal{O}^{r\times r}$ such that $(XH,YH)$ is along the gradient flow of function $f$ starting from the point $(X^{\star},Y^{\star})$ . Let $m,\tau$ denote the values that minimize $f(X,Y)$ . Let $XY^{\top}=U\Sigma V^{\top}$ be the SVD of $XY^{\top}$ . Let $\mathbf{T}_{0}$ be the span of the tangent space of $XY^{\top}$ , and denote by $\mathbf{T}$ the span of $\mathbf{T}_{0}$ and $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ . Define $\tilde{\Delta}^{1}\in\mathbb{R}^{k}$ as the vector with components $\tilde{\Delta}^{1}_{i}=\left\langle Z_{i},UV^{\top}\right\rangle.$ Define $\tilde{D}$ to be the matrix with entries $\tilde{D}_{ij}=\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle$ .

Assume Item 3.3(a) holds, then for large enough $n$ ,

$\displaystyle\left\lVert ZV\right\rVert_{\mathrm{F}}^{2}+\left\lVert Z^{\top}U\right\rVert_{\mathrm{F}}^{2}$	$\displaystyle\leq 1-\frac{c_{r_{1}}}{2\log(n)}$	(20)
$\displaystyle\left\lVert P_{{\mathbf{T}_{0}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}^{2}$	$\displaystyle\geq\frac{c_{r_{1}}}{2\log(n)}$	(21)
$\displaystyle\left\lVert P_{{\mathbf{T}}^{\perp}}(Z)-P_{{\mathbf{T}}^{\star\perp}}(Z)\right\rVert_{\star}$	$\displaystyle\lesssim\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}}$	(22)

for any $Z\in\mathbf{Z}$ that has $\left\lVert Z\right\rVert_{\mathrm{F}}=1$ .

Assume Item 3.3(b) holds, then for large enough $n$ ,

\displaystyle\|\tilde{D}^{-1}\|\left\lVert\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert\leq 1-\frac{c_{r_{2}}}{2\log n}.

(23)

Assume Item 3.3(c) holds, then for large enough $n$ ,

\displaystyle\sigma_{\min}(\tilde{D})\geq\frac{c_{s}}{2\log n}.

(24)

Proof of Lemma D.6..

We will make use of the following claim throughout this proof:

Claim D.7.

\displaystyle\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}},\left\lVert V^{\star}V^{\star\top}-VV^{\top}\right\rVert_{\mathrm{F}},\left\lVert U^{\star}V^{\star\top}-UV^{\top}\right\rVert_{\mathrm{F}}\lesssim\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}\lesssim\frac{1}{\log^{6.5}(n)}.

The proof of (20) and (21) follow the exact same logic as the proof of (68) and (69) in the proof of Lemma 13 in [17].

Proof of (22). We use Lemma B.1 to compute the projections of the treatment matrices. Let $r=(I-VV^{\top})\mathbf{1}$ and let $r^{\star}=(I-V^{\star}V^{\star\top})\mathbf{1}$ . By Lemma D.4, we have $V^{\top}\mathbf{1}=V^{\star\top}\mathbf{1}=0$ . Hence, $r=r^{\star}=\mathbf{1}$ , which allows us to simplify as follows:

	$\displaystyle\left\lVert P_{{\mathbf{T}}^{\perp}}(Z)-P_{{\mathbf{T}}^{\star\perp}}(Z)\right\rVert_{\star}$
	$\displaystyle=\left\lVert(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-V^{\star}V^{\star\top})\right\rVert_{\star}$
	$\displaystyle\leq\left\lVert(U^{\star}U^{\star\top}-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-VV^{\top})\right\rVert_{\star}+\left\lVert(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z)(VV^{\top}-V^{\star}V^{\star\top})\right\rVert_{\star}$
	$\displaystyle\overset{(i)}{\lesssim}\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}\left\lVert P_{{\mathbf{1}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}\sqrt{r}+\left\lVert P_{{\mathbf{1}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}\sqrt{r}$
	$\displaystyle\leq\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}\sqrt{r}+\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}\sqrt{r}$
	$\displaystyle\lesssim\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}},$

where $(i)$ is due to $\left\lVert A\right\rVert_{\star}\leq\left\lVert A\right\rVert_{\mathrm{F}}\sqrt{rank(A)}$ , the fact that projection matrices $(I-VV^{\top})$ and $(I-U^{\star}U^{\star\top})$ have Frobenius norm at most 1. In particular for step $(i)$ , we note that the rank of the matrices inside the nuclear norm is $O(r)$ because $U^{\star}U^{\star\top}$ , $UU^{\top}$ , $V^{\star}V^{\star\top}$ , and $VV^{\top}$ are all rank $r$ matrices, and we have that $rank(A+B)\leq rank(A)+rank(B)$ and $rank(AB)\leq\min(rank(A),rank(B))$ .

Proof of (24). Define $\Delta^{D}:=\tilde{D}-D^{\star}$ such that $\Delta^{D}_{ij}=\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle$ for $i,j\in[k]$ . We upper bound the magnitude of the entries of $\Delta^{D}$ . For any $i\in[k]$ and $j\in[k]$ , we have

\displaystyle\begin{split}\left\lvert\Delta^{D}_{ij}\right\rvert&=\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle\right\rvert\\ &\overset{(i)}{=}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),Z_{j}\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),Z_{j}\right\rangle\right\rvert\\ &=\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),Z_{j}\right\rangle\right\rvert\\ &\overset{(ii)}{\leq}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\mathrm{F}}\\ &=\left\lVert(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-V^{\star}V^{\star\top})\right\rVert_{\mathrm{F}}\\ &=\left\lVert(U^{\star}U^{\star\top}-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(VV^{\top}-V^{\star}V^{\star\top})\right\rVert_{\mathrm{F}}\\ &\leq\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}+\left\lVert V^{\star}V^{\star\top}-VV^{\top}\right\rVert_{\mathrm{F}}\\ &\overset{(iii)}{\lesssim}\frac{1}{\log^{6.5}(n)},\end{split}

(25)

The equality $(i)$ is due to the fact that for any projection operator $P$ and matrices $A$ and $B$ , we have $\left\langle P(A),B\right\rangle=\left\langle P(A),P(B)\right\rangle=\left\langle A,P(B)\right\rangle$ , and $(ii)$ is due to the Cauchy–Schwarz inequality and the fact that the $\left\lVert Z_{i}\right\rVert_{\mathrm{F}}=1$ , and $(iii)$ follows from D.7.

By Weyl’s inequality, we have $\sigma_{\min}(\tilde{D})\geq\sigma_{\min}(D^{\star})-\sigma_{\max}(\Delta^{D}).$ Now we upper bound the second term as follows:

\sigma_{\max}(\Delta^{D})=\left\lVert\Delta^{D}\right\rVert\leq k\max_{l,m}\left\lvert\Delta^{D}_{l,m}\right\rvert=O\left(\frac{1}{\log^{5.5}(n)}\right).

(26)

Putting everything together, for large enough $n$ ,

\displaystyle\sigma_{\min}(\tilde{D})

\displaystyle\geq\frac{c_{s}}{\log n}-\sigma_{\max}(\Delta^{D})\geq\frac{c_{s}}{2\log n}.

Proof of (23). We wish to bound $A:=\left\lVert\tilde{D}^{-1}\right\rVert\left\lVert\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert.$ This quantity can be restructured as

\displaystyle A=\left(\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert+E_{1}\right)\left(\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert+E_{2}\right),

(27)

where $D^{\star}$ and $\Delta^{\star 1}$ are defined just above 3.3, and $E_{1}$ and $E_{2}$ are defined as follows:

	$\displaystyle E_{1}$	$\displaystyle:=\left\lVert\tilde{D}^{-1}\right\rVert\left\lVert\tilde{\Delta}^{1}\right\rVert-\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert$
	$\displaystyle E_{2}$	$\displaystyle:=\sum_{i=1}^{k}(\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert-\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert).$

We bound these two quantities as follows.

Claim D.8.

We have that $\left\lvert E_{1}\right\rvert\lesssim\frac{1}{\log^{2.5}n}$ and $\left\lvert E_{2}\right\rvert\lesssim\frac{1}{\log^{5.5}n}$ .

To bound $A$ , we consider the following additional observations. Firstly, we have $\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\lesssim\log^{2}n$ . This follows from the fact that $\sigma_{\min}(D^{\star})\gtrsim\frac{1}{\log n}$ , and $\Delta^{\star 1}$ is a vector of length $k$ (where $k\lesssim\log n$ ) with entries that do not exceed 1. Moreover, we have $\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert\leq k\lesssim\log n$ because for every $i\in[k]$ , $P_{{\mathbf{T}}^{\star\perp}}(Z_{i})$ has Frobenius norm (and hence spectral norm) at most 1. Now, we are ready to plug all of this into (27) to bound $A$ as follows.

	$\displaystyle A$	$\displaystyle\leq\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert+O\left(\frac{1}{\log^{1.5}n}\right)$
		$\displaystyle\overset{(i)}{\leq}1-\frac{c_{r_{2}}}{\log n}+O\left(\frac{1}{\log^{1.5}n}\right)$
		$\displaystyle\leq 1-\frac{c_{r_{2}}}{2\log n},$

where $(i)$ is due to Item 3.3(b). ∎

Proof of D.7.

First, we note that we have

	$\displaystyle\left\lVert XY^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}}$	$\displaystyle=\left\lVert XH_{X,Y}(YH_{X,Y})^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}}$
		$\displaystyle\leq\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert+\left\lVert X^{\star}\right\rVert\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\rho\sqrt{\sigma_{\max}},$

where $\left\lVert Y\right\rVert$ is bounded by Lemma D.5. Hence, we can make use of Theorem 2 from [36], combined with the symmetric dilation trick from section C.3.2 of [3], to obtain the following. There exists an orthogonal matrix $O\in\mathbb{R}^{r\times r}$ such that

\displaystyle\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}\lesssim\frac{\left\lVert XY^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}}}{\sigma_{r}(X^{\star}Y^{\star\top})-\sigma_{r+1}(X^{\star}Y^{\star\top})}\lesssim\frac{\rho\sqrt{\sigma_{\max}}}{\sigma_{\min}}=\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}.

Now we can obtained the desired bound as follows:

	$\displaystyle\left\lVert UV^{\top}-U^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}$	$\displaystyle=\left\lVert UO(VO)^{\top}-U^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}$
		$\displaystyle\leq\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}\left\lVert VO\right\rVert+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}\left\lVert U^{\star}\right\rVert$
		$\displaystyle\leq\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}$
		$\displaystyle\lesssim\frac{1}{\log^{6.5}n}\qquad\text{using \eqref{eq: rho bound}}.$

We can similarly bound both $\left\lVert UU^{\top}-U^{\star}U^{\star\top}\right\rVert_{\mathrm{F}}$ and $\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}$ . ∎

Proof of D.8.

$\left\lvert E_{2}\right\rvert$ can be simply bounded as follows:

\displaystyle\left\lvert E_{2}\right\rvert

\displaystyle\leq\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\mathrm{F}}\lesssim\frac{k}{\log^{6.5}n},

where the last bound was shown in 25. Because $k=O(\log n)$ , we have the desired bound for $E_{2}$ .

It remains to bound $\left\lvert E_{1}\right\rvert$ :

	$\displaystyle\left\lvert E_{1}\right\rvert$	$\displaystyle=\left\|\\|D^{\star-1}\\|\\|\Delta^{\star 1}\\|-\\|\tilde{D}^{-1}\\|\\|\tilde{\Delta}^{1}\\|\right\|$
		$\displaystyle\leq\\|D^{\star-1}\\|\left\|\\|\Delta^{\star 1}\\|-\\|\tilde{\Delta}^{1}\\|\right\|+\left\|\\|D^{\star-1}\\|-\\|\tilde{D}^{-1}\\|\right\|\\|\tilde{\Delta}^{1}\\|$
		$\displaystyle\leq\underbrace{\\|D^{\star-1}\\|\\|\Delta^{\star 1}-\tilde{\Delta}^{1}\\|}_{A_{1}}+\underbrace{\\|D^{\star-1}-\tilde{D}^{-1}\\|\\|\tilde{\Delta}^{1}\\|}_{A_{2}}.$

Bounding $A_{1}$ . Applying Assumption 3.3(c), we can bound $A_{1}$ as follows:

	$\displaystyle A_{1}$	$\displaystyle\lesssim\log(n)\left\lVert\Delta^{\star 1}-\tilde{\Delta}^{1}\right\rVert$
		$\displaystyle\leq\log(n)k\left\lVert\Delta^{\star 1}-\tilde{\Delta}^{1}\right\rVert_{\infty}$
		$\displaystyle=\log(n)k\max_{i\in[k]}\left\lvert\left\langle Z_{i},U^{\star}V^{\star\top}-UV^{\top}\right\rangle\right\rvert$
		$\displaystyle\leq\log(n)k\left\lVert U^{\star}V^{\star\top}-UV^{\top}\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\frac{1}{\log^{4.5}(n)},$

where the last step makes use of D.7.

Bounding $A_{2}$ . Because $\tilde{D}=D^{\star}+\Delta^{D}=(I+\Delta^{D}D^{\star-1})D^{\star}$ , we can write

\displaystyle\tilde{D}^{-1}=D^{\star-1}\sum_{k=0}^{\infty}(-\Delta^{D}D^{\star-1})^{k}.

Note that this quantity is well-defined because

\displaystyle\left\lVert\Delta^{D}D^{\star-1}\right\rVert\leq\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})}\lesssim\frac{1}{\log^{4.5}n}<1,

where the last bound is due to (26) and Item 3.3(c). Now, we have

\displaystyle D^{\star-1}-\tilde{D}^{-1}=D^{\star-1}\Delta^{D}D^{\star-1}\sum_{k=0}^{\infty}(-\Delta^{D}D^{\star-1})^{k}.

We use this to bound $A_{2}$ . Using (26) and Item 3.3(c), and the fact that $\|\tilde{\Delta}^{1}\|\leq k$ because it is a vector of length $k$ with all components less than or equal to 1, we have

\displaystyle A_{2}\leq\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})^{2}}\cdot\frac{1}{1-\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})}}\cdot\|\tilde{\Delta}^{1}\|\lesssim\frac{1}{\log^{2.5}n}.

∎

D.3 Discussion of $\hat{E}$

Recall that the matrix $\hat{E}$ consists of two components $\hat{E}=E+\delta$ . $E$ is a noise matrix, and $\delta$ is the matrix of the approximation error. Refer to 3.4 for our assumptions on $E$ and $\delta$ .

We need to ensure that the error $\hat{E}$ does not significantly interfere with the recovery of the treatment effects. That is, we need to ensure that $\hat{E}$ is not confounded with the treatment matrices $Z_{i}$ for $i\in[k]$ . This condition is formalized and established in the following lemma.

Lemma D.9.

Let $(X,Y)\in\mathcal{B}$ be along the gradient flow of function $f$ starting from the point $(X^{\star},Y^{\star})$ , and let $m,\tau$ denote the values that minimize $f(X,Y)$ . Let $\mathbf{T}$ be the span of the tangent space of $XY^{\top}$ and $\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}$ . Then, we have

\displaystyle\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert\lesssim\sigma\sqrt{nr}.

Proof of Lemma D.9.

Fix any $i\in[k]$ . We recall that $\hat{E}=E+\delta$ . We note that $rank(P_{\mathbf{T}}(A))\leq 2r+1$ for any matrix $A$ due to the definition of $\mathbf{T}$ . Then,

	$\displaystyle\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$	$\displaystyle\leq\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert+\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert+\left\lvert\left\langle P_{\mathbf{T}}(Z_{i}),\hat{E}\right\rangle\right\rvert$
		$\displaystyle\overset{(i)}{\lesssim}\sigma\sqrt{n}+\left\lvert\left\langle P_{\mathbf{T}}(Z_{i}),\hat{E}\right\rangle\right\rvert$
		$\displaystyle\overset{(ii)}{\leq}\sigma\sqrt{n}+\\|\hat{E}\\|\left\lVert P_{\mathbf{T}}(Z_{i})\right\rVert_{\star}$
		$\displaystyle\overset{(iii)}{\leq}\sigma\sqrt{n}+(\left\lVert E\right\rVert+\left\lVert\delta\right\rVert)\sqrt{2r+1}\left\lVert P_{\mathbf{T}}(Z_{i})\right\rVert_{\mathrm{F}}$
		$\displaystyle\overset{(iv)}{\lesssim}\sigma\sqrt{nr}.$

Here, (i) is due to 3.4 $\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert,\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert\lesssim\sigma\sqrt{n}$ ; (ii) is due to Von Neumann’s trace inequality; (iii) is due to the inequality $\left\lVert A\right\rVert_{\star}\leq\sqrt{rank(A)}\left\lVert A\right\rVert_{\mathrm{F}}$ ; and (iv) is due to the fact that $\left\lVert Z_{i}\right\rVert_{\mathrm{F}}=1$ , and our assumed upper bound on $\left\lVert E\right\rVert$ and $\left\lVert\delta\right\rVert$ from 3.4. ∎

We now argue that the assumptions on the noise matrix $E$ from 3.4 are mild. The following lemma shows that under standard sub-Gaussianity assumptions, these are satisfied with very high probability.

Lemma D.10.

Suppose that the entries of $E$ are independent, zero-mean, sub-Gaussian random variables, and the sub-Gaussian norm of each entry is bounded by $\sigma$ , and $E$ is independent from $Z_{i}$ for $i\in[k]$ . Then, for $n$ sufficiently large, with probability at least $1-e^{-n}$ , we have

\left\lVert E\right\rVert\lesssim\sigma\sqrt{n}\quad\text{and}\quad\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\lesssim\sigma\sqrt{n},\quad\forall i\in[k].

Proof of Lemma D.10.

We start by proving that with probability at least $1-2e^{-n}$ , we have $\left\lVert E\right\rVert\lesssim\sigma\sqrt{n}$ . We use the following result bounding the norm of matrices with sub-Gaussian entries.

Theorem D.11 (Theorem 4.4.5, [32]).

Let $A$ be an $m\times n$ random matrix whose entries $A_{ij}$ are independent, mean zero, sub-Gaussian random variables with sub-Gaussian norm bounded by $\sigma$ . Then, for any $t>0$ , we have

\displaystyle\left\lVert A\right\rVert\lesssim\sigma(\sqrt{m}+\sqrt{n}+t)

with probability at least $1-2\exp(-t^{2})$ .

As a result of the above theorem, recalling that $E$ is a $n\times T$ matrix with $T\lesssim n$ and using $t=\sqrt{2n}$ , we have $\left\lVert E\right\rVert\lesssim\sigma\sqrt{n}$ with probability at least $1-2\exp(-2n)$ .

We next turn to the second claim. The general version of Hoeffding’s inequality states that: for zero-mean independent sub-Gaussian random variables $X_{1},\dots,X_{n}$ , we have

\displaystyle\Pr\left[\left\lvert\sum_{i=1}^{n}X_{i}\right\rvert\geq t\right]\leq 2\exp\left(-\frac{ct^{2}}{\sum_{i=1}^{n}\left\lVert X_{i}\right\rVert_{\psi_{2}}^{2}}\right),

for some absolute constant $c>0$ . Hence, for any $i\in[k]$ :

\displaystyle\Pr\left[\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\geq\sigma\sqrt{2n/c}\right]\leq 2\exp\left(-\frac{2\sigma^{2}n}{\sigma^{2}\left\lVert Z_{i}\right\rVert_{\mathrm{F}}^{2}}\right)\leq 2e^{-2n}.

Applying the Union Bound over all $i\in[k]$ :

\displaystyle\Pr\left[\max_{i\in[k]}\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\geq\sigma\sqrt{n/c}\right]\leq 2ke^{-2n}\lesssim 2\log(n)e^{-2n}.

Applying the Union Bound shows that the two desired claims hold with probability at least $1-2e^{-2n}(1+\log n)\geq 1-e^{-n}$ for sufficiently large $n$ . ∎

D.4 Proof of Lemma D.1

If the gradient flow of $f$ started at $(X^{\star},Y^{\star})$ never intersects the boundary of $\mathcal{B}$ , then we are done. For the rest of the proof, we will suppose that such an intersection exists. Fix any point $(X,Y)$ at the intersection of the boundary of $\mathcal{B}$ and the gradient flow of $f$ . We aim to show that at this boundary point, the inner product of the normal vector to the region (pointing to the exterior of the region) and the gradient of $f$ is positive. This property implies that the gradient flow of $f$ , initiated from any point inside $\mathcal{B}$ , cannot exit $\mathcal{B}$ . Proving this characteristic is sufficient to prove Lemma D.1.

We denote by $H$ the rotation matrix $H_{X,Y}$ that optimally aligns the fixed boundary point $(X,Y)$ to $(X^{\star},Y^{\star})$ . Consider

\displaystyle\mathcal{B}_{H}=\left\{(X^{\prime},Y^{\prime})\mid\left\lVert X^{\prime}H-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert Y^{\prime}H-Y^{\star}\right\rVert_{\mathrm{F}}^{2}\leq\rho^{2}\right\}.

Note that $\mathcal{B}_{H}$ differs from $\mathcal{B}$ because $H$ is fixed to be the rotation matrix associated with a given point $(X,Y)$ . $\mathcal{B}_{H}$ and $\mathcal{B}$ are tangent to each other at $(X,Y)$ , so the normal vector to $\mathcal{B}_{H}$ and $\mathcal{B}$ are co-linear at $(X,Y)$ . Thus, it suffices to show that the normal vector to $\mathcal{B}_{H}$ at $(X,Y)$ , and the gradient of $f$ at $(X,Y)$ have positive inner product.

To simplify our notation, we define

F:=\left[\begin{array}[]{c}X\\ Y\end{array}\right],\qquad{F}^{\star}:=\left[\begin{array}[]{c}X^{\star}\\ Y^{\star}\end{array}\right],\qquad\text{and}\qquad\frac{\partial f}{\partial F}:=\left[\begin{array}[]{c}\frac{\partial f}{\partial X}\\ \frac{\partial f}{\partial Y}\end{array}\right].

Additionally, we define $\Delta_{X}:=XH-X^{\star}$ , $\Delta_{Y}:=YH-Y^{\star}$ , and $\Delta_{\mathrm{F}}:=FH-F^{\star}$ .

The normal vector to $\mathcal{B}$ at $(X,Y)$ is simply the subgradient of $\left\lVert XH-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert YH-Y^{\star}\right\rVert_{\mathrm{F}}^{2}$ at $(X,Y)$ , which is $2(FH-F^{\star})H^{\top}$ . Hence, we aim to show that the following inner product is positive:

	$\displaystyle\Gamma:=\left\langle(FH-F^{\star})H^{\top},\frac{\partial f}{\partial F}\right\rangle=$	$\displaystyle\left\langle\Delta_{X},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)YH+\lambda XH\right\rangle$
	$\displaystyle+$	$\displaystyle\left\langle\Delta_{Y},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)^{\top}XH+\lambda YH\right\rangle,$

where we used the formula for the gradient of $f$ from Eq (19). In the remainder of this subsection, we will use the following shorthand notations for simplicity: we denote $XH$ , $YH$ , and $FH$ as $X$ , $Y$ , and $F$ respectively. The above inner product $\Gamma$ becomes:

	$\displaystyle\Gamma=\underbrace{\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}\right)\right\rangle}_{A_{3}}-\underbrace{\left\langle\Delta_{X},P_{\mathbf{Z}^{\perp}}(\hat{E})Y\right\rangle-\left\langle\Delta_{Y},P_{\mathbf{Z}^{\perp}}(\hat{E})^{\top}X\right\rangle}_{A_{1}}$
	$\displaystyle+\underbrace{\left\langle\Delta_{X},\lambda X\right\rangle+\left\langle\Delta_{Y},\lambda Y\right\rangle}_{A_{2}}.$

By Item 3.4(a), we have $\|\hat{E}\|\leq\|E\|+\|\delta\|\lesssim\sigma\sqrt{n}$ . In particular,

	$\displaystyle\left\lvert A_{1}\right\rvert$	$\displaystyle\lesssim\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\\|\hat{E}\\|\left\lVert Y\right\rVert_{\mathrm{F}}+\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\\|\hat{E}\\|\left\lVert X\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\\|\hat{E}\\|\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F\right\rVert_{\mathrm{F}}$
		$\displaystyle\overset{(i)}{\lesssim}\sigma\sqrt{n}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}},$

where (i) is because $\left\lVert F\right\rVert_{\mathrm{F}}-\left\lVert F^{\star}\right\rVert_{\mathrm{F}}\leq\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}$ by the Triangle Inequality, and $(X,Y)\in\mathcal{B}$ gives $\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\leq\rho\lesssim\left\lVert F^{\star}\right\rVert_{\mathrm{F}}$ , where the last step is due to the assumed bound on $\frac{\sigma\sqrt{n}}{\sigma_{\min}}$ .

Recalling $\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right)$ , we bound $\left\lvert A_{2}\right\rvert$ :

	$\displaystyle\|A_{2}\|$	$\displaystyle\leq\lambda\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert X\right\rVert_{\mathrm{F}}+\lambda\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert_{\mathrm{F}}$
		$\displaystyle\leq 2\lambda\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.$

Finally, we will lower bound $A_{3}$ . For some $Z\in\mathbf{Z}$ with $\left\lVert Z\right\rVert_{\mathrm{F}}=1$ , we have the following:

	$\displaystyle A_{3}$	$\displaystyle=\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}\right)\right\rangle$
		$\displaystyle=\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}-\Delta_{X}\Delta_{Y}^{\top}\right)\right\rangle$
		$\displaystyle=\left\lVert P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right)\right\rVert_{\mathrm{F}}^{2}-\underbrace{\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}\Delta_{Y}^{\top}\right)\right\rangle}_{B_{0}}$
		$\displaystyle=\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\lVert P_{\mathbf{Z}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right)\right\rVert_{\mathrm{F}}^{2}-B_{0}$
		$\displaystyle=\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\langle Z,\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rangle^{2}-B_{0}$
		$\displaystyle\overset{(i)}{=}\left\lVert Z\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\langle P_{\mathbf{T}_{0}}(Z),\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rangle^{2}-B_{0}$
		$\displaystyle\geq\left\lVert Z\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\lVert P_{\mathbf{T}_{0}}(Z)\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}$
		$\displaystyle=\left\lVert P_{{\mathbf{T}_{0}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}$
		$\displaystyle\overset{(ii)}{\geq}\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0},$

where $(i)$ is due to the fact that $\left\lVert Z\right\rVert_{\mathrm{F}}=1$ and $\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}$ is already in the subspace $\mathbf{T}_{0}$ by the definition of $\mathbf{T}_{0}=\left\{UA^{\top}+BV^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r}\right\}$ as the tangent space of $XY^{\top}$ ; $(ii)$ is due to Lemma D.6. Note that Lemma D.6 applies because $\mathbf{T}_{0}$ is the tangent space of $XY^{\top}$ , and $(X,Y)$ was fixed to be on the gradient flow. Furthermore, we can bound $\left\lvert B_{0}\right\rvert$ as follows:

	$\displaystyle\left\lvert B_{0}\right\rvert$	$\displaystyle\leq\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left(\left\lVert\Delta_{X}Y^{\top}\right\rVert_{\mathrm{F}}+\left\lVert X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}\right)$
		$\displaystyle\leq\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left(\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert+\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left\lVert X\right\rVert\right)$
		$\displaystyle\overset{(i)}{\lesssim}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}^{3}\left\lVert F^{\star}\right\rVert$
		$\displaystyle\leq\rho^{2}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert$
		$\displaystyle\overset{(ii)}{\lesssim}\frac{\sigma^{2}r^{2}\kappa n\log^{12}(n)}{\sigma_{\min}}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert$
		$\displaystyle\lesssim\sigma\sqrt{n}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert,$

where $(i)$ makes use of Lemma D.5, and $(ii)$ follows from the bound on $\rho$ from (15), and the last step follows from the bound on $\frac{\sigma\sqrt{n}}{\sigma_{\min}}$ .

Now, we are ready to put everything together to bound the inner product:

	$\displaystyle\Gamma$	$\displaystyle=A_{3}-A_{1}+A_{2}$
		$\displaystyle\geq\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}-A_{1}+A_{2}$

Putting previous bounds together, we have

\displaystyle\left\lvert A_{1}\right\rvert+\left\lvert A_{2}\right\rvert+\left\lvert B_{0}\right\rvert\lesssim\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

On the other hand, we can lower bound the positive term of $\Gamma$ using the following lemma.

Lemma D.12.

Consider a point $(X,Y)$ on the gradient flow of function $f$ starting from the point $\left(X^{\star},Y^{\star}\right)$ , such that $(X,Y)\in\mathcal{B}$ . Then,

\displaystyle\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert^{2}_{\mathrm{F}}.

By Lemma D.12 we have

\displaystyle\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{c_{r_{1}}}{2\log(n)}\cdot\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert^{2}_{\mathrm{F}}

We plug in $\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}=\rho$ because we assumed that $(X,Y)$ is at the border of region $\mathcal{B}$ . Hence,

	$\displaystyle\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}$	$\displaystyle\gtrsim\frac{\sigma_{\min}}{\log(n)}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\cdot\rho$
		$\displaystyle=\sigma\sqrt{nr}\log^{5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.$

Finally, for large enough $n$ ,

\displaystyle\Gamma

\displaystyle\gtrsim\sigma\sqrt{nr}\log^{5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}-\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}>0.

Proof of Lemma D.12.

Claim 11 in [17] states

\displaystyle\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}^{2}-\frac{\sigma^{2}}{n^{13}}.

(28)

We can follow the steps of their proof to prove the desired statement. In particular, we note that the term $-\sigma^{2}/n^{13}$ in the right-hand side of Eq (28) comes from the fact that they use the bound

\left\lVert X^{\top}X-Y^{\top}Y\right\rVert_{\mathrm{F}}\lesssim\frac{\sigma}{\kappa n^{15}}.

Because we have, by Lemma D.3 that $X^{\top}X=Y^{\top}Y$ , we do not incur the term $-\sigma^{2}/n^{13}$ in our lower bound. ∎

D.5 Proof of Lemma D.2

Let $g(M,\tau,m)$ denote the convex function that we are optimizing in (2). Recall that $(\hat{M},\hat{\tau},\hat{m})$ is a global optimum of function $g$ . In this subsection, we prove Lemma D.2, which relates $(\hat{M},\hat{\tau},\hat{m})$ to a local optimum of $f$ .

We begin by proving the following useful lemma, which is similar to Lemma 20 from [11], but allows $X,Y$ to have different dimensions.

Lemma D.13.

Consider matrices $X$ and $Y$ such that $X^{\top}X=Y^{\top}Y$ . There is an SVD of $XY^{\top}$ denoted by $U\Sigma V^{\top}$ such that $X=U\Sigma^{1/2}R$ and $Y=V\Sigma^{1/2}R$ for some rotation matrix $R\in\mathcal{O}^{r\times r}$ .

Proof.

Let $X=U_{X}\Sigma_{X}V_{X}^{\top}$ and $Y=U_{Y}\Sigma_{Y}V_{Y}^{\top}$ be their respective SVDs, ordering the diagonal components of $\Sigma_{X}$ and $\Sigma_{Y}$ by decreasing order. Then, $X^{\top}X=Y^{\top}Y$ implies $\Sigma_{X}=\Sigma_{Y}$ and that the singular subspaces of $V_{X}$ and $V_{Y}$ coincide. Hence, there exists an SVD decomposition of $Y=\tilde{U}_{Y}\Sigma_{Y}\tilde{V}_{Y}^{\top}$ such that $V_{X}=\tilde{V}_{Y}$ . Then, $XY^{\top}=U_{X}\Sigma_{X}^{2}\tilde{U}_{Y}^{\top}$ . This is an SVD of $XY^{\top}$ , with $U=U_{X}$ , $\Sigma=\Sigma_{X}^{2}$ , and $V=\tilde{U}_{Y}$ . Substituting these quantities into the SVD of $X$ and $Y$ , we complete the proof that $X=U\Sigma^{1/2}R$ and $Y=V\Sigma^{1/2}R$ , where $R=V_{X}=\tilde{V}_{Y}\in\mathcal{O}^{r\times r}$ . ∎

Let $(X,Y)$ represent the limit of the gradient flow of $f$ from the initial point $(X^{\star},Y^{\star})$ . Let $m$ and $\tau$ be the values that minimize $f(X,Y)$ . Furthermore, let the SVD of $XY^{\top}$ be denoted by $U\Sigma V^{\top}$ .

By Lemma D.3, we have that $X^{\top}X=Y^{\top}Y$ . Then, Lemma D.13 gives us

\displaystyle(I-UU^{\top})X=0\qquad\text{and}\qquad(I-VV^{\top})Y=0.

(29)

We claim that to prove Lemma D.2, it suffices to prove that $XY^{\top}=\hat{M}$ . If we prove that $XY^{\top}=\hat{M}$ , we would have by Lemma D.13 $X=\hat{X}R$ and $Y=\hat{Y}R$ for some rotation matrix $R\in\mathcal{O}^{r\times r}$ . This proves Lemma D.2.

The proof that $XY^{\top}=\hat{M}$ consists of two parts. We will first establish that $(XY^{\top},\tau,m)$ is also an optimal point of $g$ by verifying the first order conditions of $g$ are satisfied. We will then show that $g$ has a unique optimal solution $(\hat{M},\hat{\tau},\hat{m})$ . Putting these two parts together establishes that $XY^{\top}=\hat{M}$ .

D.5.1 First order conditions of $g$ are satisfied

We will first show that the following first order conditions of $g$ are satisfied at $(XY^{\top},\tau,m)$ .


$\displaystyle\left\langle Z_{l},O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}{\tau}_{i}Z_{i}\right\rangle$	$\displaystyle=0\qquad\text{for }l=1,2,\dots,k$	(30a)
$\displaystyle O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}$	$\displaystyle=\lambda\left(UV^{\top}+W\right)$	(30b)
$\displaystyle U^{\top}W$	$\displaystyle=0$	(30c)
$\displaystyle WV$	$\displaystyle=0$	(30d)
$\displaystyle\left\lVert W\right\rVert$	$\displaystyle\leq 1$	(30e)
$\displaystyle m$	$\displaystyle=\frac{1}{T}\left(O-XY^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)\mathbf{1}.$	(30f)

We select $W:=\frac{1}{\lambda}\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)-UV^{\top}$ . Note that (30a) and (30f) are automatically satisfied given the definition of $\tau$ and $m$ , and (30b) is automatically satisfied by our choice of $W$ .

To show (30c) and (30d), we use the fact that $\frac{\partial f}{\partial X}=\frac{\partial f}{\partial Y}=0$ :

\displaystyle\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)Y=\lambda X\quad\text{and}\quad\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)^{\top}X=\lambda Y

Now, by Lemma D.13, we can decompose $X=U\Sigma^{1/2}R$ and $Y=V\Sigma^{1/2}R$ where $R$ is a rotation matrix. Right-multiplying the above equations by $R^{-1}\Sigma^{-1/2}$ gives

\displaystyle\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)V=\lambda U\quad\text{and}\quad\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)^{\top}U=\lambda V.

Rearranging, the first equation shows that $WV=0$ and the second shows $U^{\top}W=0$ .

The last step of the proof is to verify 30e. Using (30c) and (30d), we have

	$\displaystyle W$	$\displaystyle=(I-UU^{\top})W(I-VV^{\top})$
		$\displaystyle=\frac{1}{\lambda}(I-UU^{\top})\left(O-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)(I-VV^{\top}),$

where the last line is obtained by plugging in our chosen value of $W$ and using (29) to get rid of the $XY^{\top}$ term. Plugging in $O=X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}$ , we have

\displaystyle W

\displaystyle=\frac{1}{\lambda}(I-UU^{\top})\left(X^{\star}Y^{\star\top}+(m^{\star}-m)\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}+\hat{E}\right)(I-VV^{\top}).

We will use substitution to get rid of the $(m^{\star}-m)\mathbf{1}^{\top}$ term. Because we have $m^{\star}=\frac{M^{\star}\mathbf{1}}{T}$ and

\displaystyle m

\displaystyle=\frac{1}{T}\left(O-XY^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)\mathbf{1}=\frac{1}{T}\left(M^{\star}-XY^{\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}+\hat{E}\right)\mathbf{1},

this implies that

\displaystyle(m^{\star}-m)\mathbf{1}^{\top}

\displaystyle=\left(XY^{\top}-\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}-\hat{E}\right)\frac{\mathbf{1}\mathbf{1}^{\top}}{T}=P_{\mathbf{1}}\left(XY^{\top}-\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}-\hat{E}\right).

Substituting this expression into our expression for $W$ , we have

\displaystyle W=\frac{1}{\lambda}(I-UU^{\top})\left(X^{\star}Y^{\star\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})P_{{\mathbf{1}}^{\perp}}(Z_{i})+P_{{\mathbf{1}}^{\perp}}(\hat{E})\right)(I-VV^{\top})

where the $P_{\mathbf{1}}(XY^{\top})$ term went away because $(I-UU^{\top})X=0$ by (29).

By Lemma D.4, we have $V^{\top}\mathbf{1}=0$ . This allows us to simplify the closed form expression for the projection given in Lemma B.1: $P_{{\mathbf{T}}^{\perp}}(A)=(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(A)(I-VV^{\top})$ . Hence,

\displaystyle W=\frac{1}{\lambda}(I-UU^{\top})\ X^{\star}Y^{\star\top}(I-VV^{\top})+\frac{1}{\lambda}P_{{\mathbf{T}}^{\perp}}(\hat{E})+\frac{1}{\lambda}\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})P_{{\mathbf{T}}^{\perp}}(Z_{i}).

We can upper bound its spectral norm as follows:

\displaystyle\lambda\left\lVert W\right\rVert

\displaystyle\leq\underbrace{\left\lVert(I-UU^{\top})X^{\star}Y^{\star\top}(I-VV^{\top})\right\rVert}_{A_{1}}+\underbrace{\left\lVert\hat{E}\right\rVert}_{A_{2}}+\underbrace{\left\lVert\tau^{\star}-\tau\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert}_{A_{3}}.

Now, we bound each of these terms separately.

Bounding $A_{1}$ . By (29), we have $(I-UU^{\top})X=0$ and $(I-VV^{\top})Y=0$ . Hence,

	$\displaystyle\left\lVert(I-UU^{\top})X^{\star}Y^{\star\top}(I-VV^{\top})\right\rVert$	$\displaystyle=\left\lVert(I-UU^{\top})(XH_{X,Y}-X^{\star})(YH_{X,Y}-Y^{\star\top})(I-VV^{\top})\right\rVert$
		$\displaystyle\leq\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}$
		$\displaystyle\lesssim\rho^{2}$
		$\displaystyle\overset{(i)}{\lesssim}\frac{\sigma^{2}r^{2}\kappa n\log^{12}(n)}{\sigma_{\min}}$
		$\displaystyle\lesssim\sigma\sqrt{n},$

where $(i)$ follows from the bound on $\rho$ from (15), and the last step follows from the bound on $\frac{\sigma\sqrt{n}}{\sigma_{\min}}$ .

Bounding $A_{2}$ . $A_{2}$ is bounded by Item 3.4(a) which gives $\|\hat{E}\|\leq\left\lVert E\right\rVert+\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n}$ .

Bounding $A_{3}$ . If $(XY^{\top},\tau,m)$ satisfy (30a)–(30d) and (30f), then the following decomposition holds due to the same proof as in Lemma 2.1.

\displaystyle\tilde{D}\left(\tau-\tau^{\star}\right)=\lambda\tilde{\Delta}^{1}+\tilde{\Delta}^{2}+\tilde{\Delta}^{3},

where $\tilde{D}\in\mathbb{R}^{k\times k}$ is the matrix with entries $\tilde{D}_{ij}=\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\rangle$ and $\tilde{\Delta}^{1},\tilde{\Delta}^{2},\tilde{\Delta}^{3}\in\mathbb{R}^{k}$ are vectors with components

\displaystyle\tilde{\Delta}^{1}_{i}=\left\langle Z_{i},UV^{\top}\right\rangle,\quad\tilde{\Delta}^{2}_{i}=\left\langle Z_{i},P_{{\mathbf{T}}^{\perp}}(\hat{E})\right\rangle,\quad\tilde{\Delta}^{3}_{i}=\left\langle Z_{i},P_{{\mathbf{T}}^{\perp}}(M^{\star})\right\rangle.

This leads us to have:

	$\displaystyle A_{3}$	$\displaystyle=\left\lVert\tilde{D}^{-1}(\lambda\tilde{\Delta}^{1}+\tilde{\Delta}^{2}+\tilde{\Delta}^{3})\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert$
		$\displaystyle\leq\lambda\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert+\left(\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{3}\right\rVert\right)\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert$
		$\displaystyle\leq\lambda\left(1-\frac{c_{r_{2}}}{2\log n}\right)+\frac{2\log^{2}n}{c_{s}}\left(\left\lVert\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{\Delta}^{3}\right\rVert\right),$

where the last inequality made use of Lemma D.6.

Claim D.14.

We have

\displaystyle\left\lVert\tilde{\Delta}^{2}\right\rVert,\left\lVert\tilde{\Delta}^{3}\right\rVert\lesssim\sigma\sqrt{nr}\log n.

With the above claim, we are ready to bound $\left\lVert W\right\rVert$ .

	$\displaystyle\left\lVert W\right\rVert$	$\displaystyle\leq\frac{1}{\lambda}\left(A_{1}+A_{2}+A_{3}\right)$
		$\displaystyle\leq\frac{1}{\lambda}\left(A_{1}+A_{2}+\lambda\left(1-\frac{c_{r_{2}}}{2\log n}\right)+\frac{2\log^{2}n}{c_{s}}\left(\left\lVert\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{\Delta}^{3}\right\rVert\right)\right).$

Hence, we conclude that

\displaystyle\left\lVert W\right\rVert

\displaystyle\lesssim 1-\frac{c_{r_{2}}}{2\log n}+O\left(\frac{\sigma\sqrt{nr}\log^{3}(n)}{\lambda}\right)\leq 1

for large enough $n$ and $\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right)$ .

Proof of D.14..

Bounding $\left\lVert\tilde{\Delta}^{2}\right\rVert$ . Using Lemma D.9, we have

\displaystyle\left\lVert\tilde{\Delta}^{2}\right\rVert\leq k\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert\lesssim\sigma\sqrt{nr}\log n.

Bounding $\left\lVert\tilde{\Delta}^{3}\right\rVert$ . $\left\lVert\tilde{\Delta}^{3}\right\rVert$ can be bounded by following the exact same steps as the bound of $\left\lVert\Delta^{3}\right\rVert$ (17), replacing $\hat{\mathbf{T}}$ , $\hat{U}$ , and $\hat{V}$ with $\mathbf{T}$ , $U$ , and $V$ , respectively. In particular, note that Lemma D.4 gives that $V^{\top}\mathbf{1}=0$ along the gradient flow that ends at $X,Y$ , which allows the same simplifications.

\displaystyle\left\lVert\tilde{\Delta}^{3}\right\rVert\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{12.5}(n)}{\sigma_{\min}}\lesssim\sigma\sqrt{n},

where the last step used the assumed bound on $\frac{\sigma\sqrt{n}}{\sigma_{\min}}$ . ∎

D.5.2 Function $g$ has a unique minimizer

In this subsection, we aim to show that the convex function $g(M,\tau,m)$ has a unique minimizer. Throughout the proof, we fix a global minimum $(\hat{M},\hat{\tau},\hat{m})$ of $g$ , such that $\hat{M}=XY^{\top}$ , where $(X,Y)$ is the limit of the gradient flow of $f$ starting from $(X^{\star},Y^{\star})$ . We have already showed in Section D.5.1 that $(XY^{\top},\tau,m)$ , satisfies all of the first order conditions of $g$ ; hence $\hat{M}=XY^{\top}$ is indeed a minimizer of $g$ . First note that up to a bijective change in variables, minimizing $g$ is equivalent to minimizing the following function

\tilde{g}(N,\tau,m)=\frac{1}{2}\|O-N\|_{\mathrm{F}}^{2}+\lambda\left\|N-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star}.

Hence, it suffices to show that $\tilde{g}$ has a unique minimizer. This function is strictly convex in $N$ , because of the term $\|O-N\|_{\mathrm{F}}^{2}$ . We can then fix $\hat{N}$ to be the unique value of $N$ that minimizes $\tilde{g}$ . In particular, note that $\hat{N}=\hat{M}+\sum_{i}\hat{\tau}_{i}Z_{i}+\hat{m}\mathbf{1}^{\top}$ . It only remains to show that the following convex optimization problem

\min_{\tau,m}\left\|\hat{N}-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star}=\min_{\tau,m}\left\|\hat{M}-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star}

(31)

has a unique minimizer. By the definition of $\hat{M}=XY^{\top}$ , a minimum of the right-hand side of (31) is attained for $\tau=0$ and $m=0$ (otherwise, $\hat{M}$ wouldn’t be an optimum of $g$ ). Now consider any other optimal solution to the problem in (31), that is $Z\in\mathbf{Z}$ such that $\|XY^{\top}\|_{\star}=\|XY^{\top}+Z\|_{\star}$ . Write the SVD $XY\top=U\Sigma V^{\top}$ . Recall that the subgradients of the nuclear norm are $\left\{UV^{\top}+W:U^{\top}W=0,WV=0,\left\lVert W\right\rVert\leq 1\right\}$ . By the convexity of the nuclear norm, we have

\|XY^{\top}+Z\|_{\star}-\|XY^{\top}\|_{\star}\geq\left\langle Z,UV^{\top}+W\right\rangle,

(32)

for any matrix $W$ with $U^{\top}W=0$ , $WV=0$ and $\|W\|\leq 1$ . Recall that $\mathbf{T}_{0}$ is defined to be the tangent space of $XY^{\top}$ . Defining the SVD $P_{\mathbf{T}_{0}^{\perp}}(Z)=\tilde{U}\tilde{\Sigma}\tilde{V}^{\top}$ , we can take $W=\tilde{U}\tilde{V}^{\top}$ , which gives

\|XY^{\top}+Z\|_{\star}-\|XY^{\top}\|_{\star}\geq\left\langle Z,UV^{\top}\right\rangle+\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}.

(33)

We now use the following lemma.

Claim D.15.

Under the same assumptions as in Lemma D.6, we have

\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}>|\left\langle Z,UV^{\top}\right\rangle|\qquad\forall Z\in\mathbf{Z}\setminus\{0\}.

With this result, we have that if $Z\neq 0$ , then $\|XY^{\top}+Z\|_{\star}>\|XY^{\top}\|_{\star}$ , which contradicts the definition of $Z$ . Hence, $Z=0$ which ends the proof that $g$ has a unique minimizer.

Proof of D.15.

Up to changing $Z$ into $-Z$ , it suffices to show that for all non-zero $Z\in\mathbf{Z}$ , we have $\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}>\left\langle Z,UV^{\top}\right\rangle$ . First, using similar arguments as in (33) we show that for any matrices $A$ and $B$ such that $\left\langle A,B\right\rangle=0$ with SVD $A=U_{A}\Sigma_{A}V_{A}^{\top}$ , we have

\|A+B\|_{\star}-\|A\|_{\star}\geq\left\langle B,U_{A}V_{A}^{\top}\right\rangle=0.

Hence, $\|A+B\|_{\star}\geq\|A\|_{\star}$ . In particular, we can take $A=P_{\mathbf{T}^{\perp}}(Z)$ and $B=P_{\mathbf{T}_{0}^{\perp}}(Z)-P_{\mathbf{T}^{\perp}}(Z)$ since they are orthogonal to each other due to the fact that $\mathbf{T}^{\perp}\subset\mathbf{T}_{0}^{\perp}$ . Then, we have

\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}\geq\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}.

Next, by Lemma D.4 we have $V^{\top}\mathbf{1}=0$ , so that $\left\langle Z,UV^{\top}\right\rangle=\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle=\left\langle P_{\mathbf{1}^{\perp}}(Z),P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$ . The two previous steps essentially show that we can ignore the terms of the form $\alpha\mathbf{1}^{\top}$ within $Z$ . Formally, it suffices to show that for any non-zero $Z\in span(Z_{i},i\in[k])$ , we have $\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}>\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$ . We decompose such a matrix as $Z=\sum_{i\in[k]}\alpha_{i}Z_{i}$ . First, by (21) of Lemma D.6, we have that $\|P_{\mathbf{T}^{\perp}}(Z)\|_{\mathrm{F}}^{2}\geq\frac{c_{r_{1}}}{2\log(n)}\|Z\|_{F}^{2}>0$ , which implies $\|P_{\mathbf{T}^{\perp}}(Z)\|>0$ . Then, with $\alpha:=(\alpha_{1},\ldots,\alpha_{k})\neq 0$ , we have

	$\displaystyle\\|P_{\mathbf{T}^{\perp}}(Z)\\|_{\star}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$	$\displaystyle\overset{(i)}{\geq}\frac{\\|P_{\mathbf{T}^{\perp}}(Z)\\|_{\mathrm{F}}^{2}}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$
		$\displaystyle\overset{(ii)}{=}\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\\|P_{\mathbf{T}^{\perp}}(Z)\\|\right)$
		$\displaystyle=\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\left\lVert\sum_{i\in[k]}\alpha_{i}P_{\mathbf{T}^{\perp}}(Z_{i})\right\rVert\right)$
		$\displaystyle\geq\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\underbrace{\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\\|\alpha\\|\sum_{i\in[k]}\\|P_{\mathbf{T}^{\perp}}(Z_{i})\\|\right)}_{a},$

where $(i)$ is due to the fact that $\|P_{\mathbf{T}^{\perp}}(Z)\|>0$ and the fact that the Frobenius norm of a matrix, squared, is the sum of the squares of the singular values of that matrix; $(ii)$ uses the identity $\|P_{\mathbf{T}^{\perp}}(Z)\|_{\mathrm{F}}^{2}=\left\langle\sum_{i\in[k]}\alpha_{i}P_{\mathbf{T}^{\perp}}(Z_{i}),\sum_{j\in[k]}\alpha_{j}P_{\mathbf{T}^{\perp}}(Z_{j})\right\rangle=\alpha^{\top}\tilde{D}\alpha$ .

By (24) of Lemma D.6, the matrix $\tilde{D}$ is invertible. It is also symmetric by construction. We next define $\beta=\tilde{D}^{1/2}\alpha$ . We obtain

	$\displaystyle a$	$\displaystyle=\\|\beta\\|^{2}-\beta^{\top}\tilde{D}^{-1/2}\tilde{\Delta}^{1}\cdot\\|\tilde{D}^{-1/2}\beta\\|\sum_{i\in[k]}\\|P_{\mathbf{T}^{\perp}}(Z_{i})\\|$
		$\displaystyle\geq\\|\beta\\|^{2}\left(1-\\|\tilde{D}^{-1/2}\\|\cdot\\|\tilde{\Delta}^{1}\\|\cdot\\|\tilde{D}^{-1/2}\\|\sum_{i\in[k]}\\|P_{\mathbf{T}^{\perp}}(Z_{i})\\|\right)$
		$\displaystyle\overset{(i)}{>}0,$

where in $(i)$ we used (23) of Lemma D.6 together with the fact that $\beta\neq 0$ . Combining the two previous inequalities shows that

\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}>\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle,\quad\forall Z\in span(Z_{i},i\in[k]).

This ends the proof of the claim. ∎

D.6 Proof of Lemma C.1

See C.1

Proof.

By Lemma C.2, for sufficiently large $n$ , we can write $\hat{M}=XY^{\top}$ where $(X,Y)$ is the limit of the gradient flow of $f$ started at $(X^{\star},Y^{\star})$ , and $(X,Y)\in\mathcal{B}$ . Hence, the conditions for applying both Lemma D.6 and Lemma D.9 with $\hat{\mathbf{T}}$ , the tangent space of $\hat{M}$ , are satisfied. By (24) from Lemma D.6, we have the desired bound on $\sigma_{\min}(D)$ . Furthermore, we have

\displaystyle\left\lVert\Delta^{2}\right\rVert\leq\sqrt{k}\left\lVert\Delta^{2}\right\rVert_{\infty}\lesssim\sqrt{\log n}\left\lVert\Delta^{2}\right\rVert_{\infty}.

We aim to bound $\left\lVert\Delta^{2}\right\rVert_{\infty}=\max_{i\in[k]}\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$ , where $\hat{E}=E+\delta$ . We have

	$\displaystyle\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$	$\displaystyle\leq\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$
		$\displaystyle\overset{(i)}{\leq}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\star}(\left\lVert E\right\rVert+\left\lVert\delta\right\rVert)+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$
		$\displaystyle\overset{(ii)}{\lesssim}\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}}\sigma\sqrt{n}+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert$
		$\displaystyle\overset{(iii)}{\lesssim}\frac{\sigma^{2}r^{1.5}\kappa n\log^{6}(n)}{\sigma_{\min}}+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(\hat{E})\right\rangle\right\rvert$

where $(i)$ is due to Von Neumann’s trace inequality, $(ii)$ is due to (22) and Item 3.4(a), and $(iii)$ is due to (15). Hence,

	$\displaystyle\left\lVert\Delta^{2}\right\rVert$	$\displaystyle\lesssim\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(\hat{E})\right\rangle\right\rvert$
		$\displaystyle=\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{i}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\\|\tilde{Z}_{i}\\|_{\mathrm{F}}}.$

∎

Appendix E Details of Computational Experiments

In this section, we presents more details and results for Section 4. Our code and data are available at this GitHub repository.

Compute resources: Our experiments utilized a high-performance computing cluster, leveraging CPU-based nodes for all experimental runs. Our home directory within the cluster provides 10 TB of storage. Experiments were conducted on nodes equipped with Intel Xeon CPUs, configured with 5 CPUs and 20 GB of RAM. We utilized approximately 40 nodes operating in parallel to conduct 200 iterations of experiments for each parameter set. While a single experiment completes within minutes on a personal laptop, executing the full suite of experiments on the cluster required approximately one day.

Method implementations: Similar to the methodology outlined in the supplemental materials of [17], we implemented PaCE using an alternating minimization technique to solve the convex optimization problem. To tune the hyperparameter $\lambda$ , we initially set it to a large value and gradually reduced it until the rank of $\hat{M}$ reached a pre-defined rank $r$ . In all of our experiments, we fixed $r$ to 6. We note that we did not tune this pre-defined rank, and our method might achieve better performance with a different rank.

The following models from the Python econml package were implemented. The supervised learning models from sklearn were chosen for the parameters in our implementation:

•
DML and CausalForestDML:
- –
  
  model_y: GradientBoostingRegressor
- –
  
  model_t: GradientBoostingClassifier
- –
  
  model_final: LassoCV (for DML)
•
LinearDML:
- –
  
  model_y: RandomForestRegressor
- –
  
  model_t: RandomForestClassifier
•
XLearner and ForestDRLearner:
- –
  
  models/model_regression: GradientBoostingRegressor
- –
  
  propensity_model/model_propensity: GradientBoostingClassifier

Additionally, we implemented DRLearner from the Python econml package using the default parameters. We also implemented multi_arm_causal_forest from the grf package in R.

Finally, we implemented matrix completion with nuclear norm minimization (MCNNM), where the matrix of observed entries is completed using nuclear norm penalization. The regularization parameter for the nuclear norm term, $\lambda$ , is chosen in the same way that it is chosen in our implementation of PaCE.

We note that LLMs facilitated our implementation of all these methods, the experiment and visualization code, and our data processing code.

Notes about the data: We imputed missing data values for 2020 by calculating the average values from 2019 and 2021. Furthermore, in the supplemental code and data provided [available at this GitHub repository], we redacted a covariate column ‘Snap_Percentage_Vulnerable’ that was used in our experiments because it was obtained using proprietary information.

Additional results: Table 2 presents the proportion of instances in which each method achieved the lowest normalized Mean Absolute Error (nMAE) across all of our experiments. It encompasses every method we benchmarked.

Table 3 and Table 4 detail the complete results for the average nMAE of each method, along with the corresponding standard deviations. Table 3 reports the nMAE values for estimating heterogeneous treatment effects across all observations. In contrast, Table 4 focuses on the nMAE values for treated observations only. Note that MCNNM only appears in Table 4, as this method does not directly give estimates of heterogeneous treatment effects for untreated observations.

In the following tables, the "All" column presents the result for all instances across all parameter settings. These settings include proportions of zip codes treated at 0.05, 0.25, 0.5, 0.75, and 1.0; either with adaptive or non-adaptive treatment application approaches; and under additive or multiplicative treatment effects. Each parameter set is tested in 200 instances, totaling 4000 instances reflected in each value within the "All" column. Subsequent columns detail outcomes when one parameter is held constant. Specifically, the "Adaptive" and "Effect" columns each summarize results from 2000 instances, while each result under the "Proportion treated" column corresponds to 800 instances. The standard deviations, shown in parentheses within Table 3 and Table 4, are calculated for each result by analyzing the variability among the instances that contribute to that particular data point.

Table 2: Proportion of instances where each method results in the lowest nMAE, across all methods considered.

		Adaptive		Proportion treated					Effect
	All	N	Y	0.05	0.25	0.5	0.75	1.0	Add.	Mult.
SNAP
PaCE	0.59	0.58	0.60	0.26	0.54	0.76	0.75	0.65	0.64	0.54
Causal Forest	0.16	0.17	0.16	0.14	0.17	0.16	0.14	0.21	0.02	0.30
CausalForestDML	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
DML	0.11	0.21	0.00	0.12	0.13	0.08	0.09	0.11	0.17	0.04
DRLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ForestDRLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
LinearDML	0.14	0.04	0.24	0.48	0.15	0.01	0.02	0.04	0.17	0.11
XLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
State
PaCE	0.41	0.39	0.42	0.23	0.32	0.39	0.54	0.56	0.46	0.36
Causal Forest	0.05	0.09	0.01	0.06	0.04	0.04	0.05	0.04	0.04	0.05
CausalForestDML	0.07	0.07	0.07	0.03	0.07	0.10	0.07	0.08	0.06	0.09
DML	0.27	0.29	0.26	0.41	0.36	0.27	0.17	0.16	0.31	0.24
DRLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ForestDRLearner	0.01	0.01	0.01	0.00	0.00	0.00	0.01	0.02	0.00	0.01
LinearDML	0.02	0.01	0.04	0.02	0.06	0.01	0.01	0.02	0.02	0.02
XLearner	0.17	0.13	0.20	0.25	0.14	0.19	0.15	0.11	0.10	0.23
County
PaCE	0.06	0.10	0.02	0.16	0.08	0.04	0.02	0.01	0.07	0.06
Causal Forest	0.29	0.38	0.20	0.25	0.28	0.31	0.31	0.30	0.23	0.34
CausalForestDML	0.04	0.05	0.03	0.00	0.01	0.03	0.05	0.11	0.03	0.05
DML	0.35	0.35	0.35	0.34	0.37	0.34	0.39	0.31	0.47	0.24
DRLearner	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00
ForestDRLearner	0.01	0.00	0.01	0.00	0.00	0.00	0.00	0.02	0.00	0.01
LinearDML	0.20	0.04	0.35	0.22	0.22	0.21	0.16	0.17	0.19	0.20
XLearner	0.05	0.07	0.04	0.03	0.03	0.06	0.07	0.07	0.01	0.10

Table 3: Average nMAE across methods. Standard deviations shown in parentheses.

		Adaptive		Proportion treated					Treatment effect
	All	N	Y	0.05	0.25	0.5	0.75	1.0	Add.	Mult.
SNAP
PaCE	0.18	0.17	0.18	0.4	0.17	0.12	0.1	0.09	0.13	0.22
	(0.16)	(0.11)	(0.2)	(0.21)	(0.1)	(0.07)	(0.06)	(0.05)	(0.13)	(0.18)
Causal Forest (R)	0.21	0.21	0.21	0.42	0.2	0.17	0.13	0.12	0.18	0.23
	(0.15)	(0.07)	(0.19)	(0.19)	(0.06)	(0.04)	(0.04)	(0.06)	(0.12)	(0.17)
CausalForestDML	40.81	5.01	76.6	195.95	6.85	0.8	0.26	0.17	40.93	40.68
	(115.45)	(10.86)	(154.86)	(191.08)	(6.37)	(0.42)	(0.05)	(0.06)	(115.69)	(115.24)
DML	0.38	0.31	0.45	0.72	0.35	0.32	0.27	0.26	0.28	0.48
	(0.29)	(0.21)	(0.33)	(0.41)	(0.17)	(0.15)	(0.14)	(0.16)	(0.27)	(0.27)
DRLearner	1186.44	371.84	2001.03	107.4	1016.66	2277.78	2445.56	84.78	1095.4	1277.47
	(1764.1)	(462.49)	(2164.23)	(229.15)	(815.45)	(2036.92)	(2339.59)	(211.55)	(1597.24)	(1912.51)
ForestDRLearner	0.94	0.84	1.04	2.08	0.9	0.71	0.57	0.45	0.99	0.89
	(7.08)	(1.53)	(9.9)	(15.78)	(0.1)	(0.13)	(0.17)	(0.21)	(9.49)	(3.2)
LinearDML	0.25	0.28	0.21	0.34	0.25	0.25	0.2	0.19	0.15	0.34
	(0.16)	(0.16)	(0.16)	(0.23)	(0.13)	(0.12)	(0.13)	(0.14)	(0.1)	(0.16)
XLearner	0.51	0.46	0.56	0.89	0.55	0.47	0.36	0.27	0.5	0.52
	(0.25)	(0.17)	(0.3)	(0.29)	(0.05)	(0.04)	(0.02)	(0.02)	(0.23)	(0.27)
State
PaCE	0.28	0.34	0.22	0.57	0.32	0.21	0.17	0.14	0.25	0.32
	(0.24)	(0.26)	(0.2)	(0.3)	(0.18)	(0.12)	(0.1)	(0.08)	(0.21)	(0.25)
Causal Forest (R)	0.33	0.39	0.28	0.6	0.35	0.29	0.23	0.19	0.3	0.36
	(0.22)	(0.24)	(0.18)	(0.27)	(0.16)	(0.11)	(0.08)	(0.08)	(0.19)	(0.24)
CausalForestDML	0.75	1.02	0.48	2.76	0.36	0.25	0.21	0.19	0.76	0.74
	(3.22)	(4.41)	(1.1)	(6.85)	(0.25)	(0.11)	(0.1)	(0.08)	(3.31)	(3.14)
DML	0.35	0.42	0.28	0.59	0.34	0.29	0.27	0.26	0.31	0.39
	(0.33)	(0.41)	(0.19)	(0.6)	(0.21)	(0.17)	(0.14)	(0.14)	(0.37)	(0.28)
DRLearner	929.38	520.49	1338.07	1466.8	2905.93	159.67	68.36	47.3	891.42	967.37
	(4601.05)	(3268.39)	(5597.16)	(5776.54)	(8056.84)	(898.43)	(348.15)	(677.07)	(4089.69)	(5062.03)
ForestDRLearner	60.29	119.37	1.24	299.72	0.87	0.4	0.28	0.26	52.15	68.43
	(629.42)	(886.43)	(2.42)	(1382.62)	(1.11)	(0.21)	(0.15)	(0.14)	(451.79)	(767.07)
LinearDML	0.86	1.34	0.39	2.72	0.59	0.4	0.33	0.29	0.85	0.88
	(1.56)	(2.06)	(0.36)	(2.74)	(0.41)	(0.22)	(0.16)	(0.13)	(1.63)	(1.48)
XLearner	0.29	0.37	0.22	0.53	0.31	0.23	0.2	0.19	0.27	0.31
	(0.19)	(0.2)	(0.14)	(0.25)	(0.13)	(0.1)	(0.08)	(0.06)	(0.16)	(0.2)
County
PaCE	0.34	0.35	0.33	0.49	0.37	0.29	0.28	0.26	0.28	0.39
	(0.2)	(0.19)	(0.22)	(0.29)	(0.19)	(0.14)	(0.12)	(0.11)	(0.13)	(0.24)
Causal Forest (R)	0.26	0.26	0.27	0.43	0.3	0.22	0.2	0.18	0.22	0.31
	(0.17)	(0.13)	(0.21)	(0.23)	(0.17)	(0.12)	(0.1)	(0.08)	(0.12)	(0.21)
CausalForestDML	2.1	0.47	3.72	8.83	0.95	0.28	0.23	0.2	2.05	2.15
	(5.24)	(0.68)	(7.01)	(8.92)	(0.78)	(0.11)	(0.09)	(0.07)	(5.17)	(5.31)
DML	0.3	0.34	0.25	0.46	0.28	0.26	0.23	0.26	0.21	0.38
	(0.25)	(0.25)	(0.24)	(0.27)	(0.2)	(0.23)	(0.22)	(0.26)	(0.17)	(0.28)
DRLearner	717.31	260.77	1173.85	560.3	153.83	72.65	428.02	2371.74	396.35	1038.27
	(5046.04)	(834.88)	(7058.49)	(1091.51)	(664.27)	(462.94)	(827.06)	(11014.13)	(1118.57)	(7034.13)
ForestDRLearner	0.75	0.51	1	2.57	0.4	0.3	0.26	0.23	0.57	0.93
	(5.99)	(1.35)	(8.35)	(13.24)	(0.15)	(0.12)	(0.1)	(0.06)	(2.93)	(7.94)
LinearDML	0.36	0.47	0.24	0.61	0.34	0.29	0.26	0.28	0.28	0.44
	(0.32)	(0.36)	(0.24)	(0.42)	(0.23)	(0.27)	(0.25)	(0.27)	(0.25)	(0.37)
XLearner	0.32	0.33	0.32	0.5	0.36	0.29	0.25	0.23	0.29	0.36
	(0.14)	(0.13)	(0.16)	(0.17)	(0.11)	(0.09)	(0.07)	(0.04)	(0.09)	(0.18)

Table 4: Average nMAE among treated units. Standard deviations shown in parentheses.

		Adaptive		Proportion treated					Treatment effect
	All	N	Y	0.05	0.25	0.5	0.75	1.0	Add.	Mult.
SNAP
PaCE	0.1	0.11	0.09	0.15	0.1	0.09	0.09	0.09	0.07	0.13
	(0.08)	(0.04)	(0.1)	(0.13)	(0.06)	(0.05)	(0.05)	(0.05)	(0.03)	(0.09)
Causal Forest (R)	0.13	0.18	0.07	0.18	0.13	0.12	0.11	0.1	0.11	0.14
	(0.08)	(0.04)	(0.07)	(0.11)	(0.07)	(0.07)	(0.06)	(0.06)	(0.07)	(0.09)
CausalForestDML	0.22	0.27	0.17	0.41	0.2	0.17	0.16	0.15	0.18	0.26
	(0.24)	(0.11)	(0.31)	(0.45)	(0.09)	(0.08)	(0.07)	(0.07)	(0.12)	(0.31)
DML	0.44	0.29	0.6	0.89	0.41	0.37	0.29	0.27	0.27	0.62
	(0.72)	(0.19)	(0.98)	(1.44)	(0.37)	(0.28)	(0.17)	(0.18)	(0.22)	(0.97)
DRLearner	2074.25	376.92	3771.58	170.11	2703.71	4134.22	3279.81	83.39	1405.57	2742.93
	(4283.12)	(461.16)	(5542.67)	(263)	(5417.88)	(5543.25)	(4232.75)	(203.48)	(2139.72)	(5587.94)
ForestDRLearner	4.86	1.19	8.54	20.94	1.39	0.88	0.63	0.47	4.08	5.64
	(98.58)	(1.92)	(139.32)	(219.79)	(0.89)	(0.45)	(0.31)	(0.26)	(103.28)	(93.67)
LinearDML	0.26	0.25	0.26	0.31	0.28	0.27	0.22	0.22	0.12	0.39
	(0.26)	(0.12)	(0.35)	(0.3)	(0.29)	(0.28)	(0.21)	(0.2)	(0.09)	(0.3)
MCNNM	0.19	0.24	0.14	0.27	0.18	0.17	0.16	0.16	0.16	0.21
	(0.18)	(0.05)	(0.24)	(0.34)	(0.1)	(0.09)	(0.08)	(0.08)	(0.09)	(0.23)
XLearner	0.92	0.44	1.41	2.67	0.75	0.54	0.38	0.28	0.58	1.27
	(2.77)	(0.19)	(3.85)	(5.81)	(0.62)	(0.35)	(0.15)	(0.07)	(0.49)	(3.85)
State
My Estimator	0.21	0.23	0.19	0.36	0.24	0.17	0.15	0.13	0.18	0.24
	(0.17)	(0.19)	(0.15)	(0.27)	(0.14)	(0.09)	(0.08)	(0.07)	(0.13)	(0.2)
Causal Forest (R)	0.3	0.32	0.28	0.46	0.33	0.29	0.23	0.19	0.27	0.33
	(0.22)	(0.24)	(0.19)	(0.37)	(0.17)	(0.13)	(0.09)	(0.08)	(0.15)	(0.26)
CausalForestDML	0.26	0.33	0.2	0.43	0.27	0.22	0.21	0.19	0.24	0.28
	(0.24)	(0.31)	(0.12)	(0.46)	(0.13)	(0.1)	(0.09)	(0.08)	(0.16)	(0.3)
DML	0.34	0.38	0.29	0.51	0.34	0.29	0.27	0.26	0.3	0.37
	(0.23)	(0.25)	(0.2)	(0.34)	(0.21)	(0.17)	(0.14)	(0.14)	(0.2)	(0.25)
DRLearner	1704.15	791.94	2615.89	4387.86	3853.02	165.89	68.68	47.43	1413.19	1995.26
	(11821.06)	(5032.31)	(15890.26)	(23128.14)	(11970.8)	(1085.93)	(353.98)	(662.49)	(6457.8)	(15418.33)
ForestDRLearner	427.34	849.26	5.63	2134.78	1.45	0.46	0.29	0.26	329.6	525.13
	(4126.02)	(5805.96)	(17.91)	(9032.24)	(2.32)	(0.33)	(0.17)	(0.16)	(2885.66)	(5071.02)
LinearDML	0.43	0.52	0.35	0.81	0.42	0.35	0.31	0.28	0.39	0.48
	(0.53)	(0.65)	(0.36)	(1.06)	(0.24)	(0.17)	(0.14)	(0.12)	(0.27)	(0.7)
MCNNM	0.23	0.29	0.18	0.25	0.21	0.21	0.24	0.25	0.22	0.24
	(0.17)	(0.21)	(0.08)	(0.3)	(0.11)	(0.1)	(0.09)	(0.11)	(0.11)	(0.21)
XLearner	0.28	0.33	0.23	0.5	0.29	0.22	0.2	0.18	0.26	0.3
	(0.31)	(0.38)	(0.21)	(0.61)	(0.12)	(0.08)	(0.07)	(0.05)	(0.15)	(0.41)
County
PaCE	0.31	0.29	0.33	0.44	0.34	0.27	0.26	0.26	0.23	0.39
	(0.34)	(0.13)	(0.46)	(0.6)	(0.36)	(0.19)	(0.13)	(0.11)	(0.09)	(0.46)
Causal Forest (R)	0.2	0.25	0.15	0.3	0.21	0.18	0.16	0.16	0.16	0.24
	(0.16)	(0.12)	(0.18)	(0.26)	(0.15)	(0.11)	(0.08)	(0.07)	(0.1)	(0.19)
CausalForestDML	0.28	0.3	0.26	0.51	0.29	0.22	0.19	0.18	0.22	0.34
	(0.35)	(0.18)	(0.46)	(0.69)	(0.21)	(0.12)	(0.09)	(0.07)	(0.11)	(0.48)
DML	0.3	0.33	0.27	0.41	0.27	0.26	0.24	0.31	0.19	0.41
	(0.34)	(0.24)	(0.41)	(0.39)	(0.26)	(0.26)	(0.24)	(0.45)	(0.15)	(0.42)
DRLearner	1183.28	278.38	2088.17	1010.99	160.94	69.94	564.55	4109.95	443.32	1923.23
	(10927.48)	(1008.75)	(15369.32)	(2371.12)	(956.66)	(448.59)	(1563.04)	(24023.26)	(1265.5)	(15367.99)
ForestDRLearner	1.57	1.14	1.99	6.58	0.54	0.3	0.22	0.19	1.06	2.08
	(12.1)	(2.9)	(16.86)	(26.48)	(0.41)	(0.17)	(0.1)	(0.08)	(5.01)	(16.35)
LinearDML	0.33	0.41	0.25	0.48	0.31	0.28	0.26	0.32	0.22	0.44
	(0.39)	(0.29)	(0.46)	(0.56)	(0.27)	(0.28)	(0.27)	(0.44)	(0.17)	(0.5)
MCNNM	0.56	0.49	0.62	0.73	0.58	0.51	0.49	0.48	0.45	0.67
	(0.63)	(0.12)	(0.88)	(1.21)	(0.57)	(0.33)	(0.19)	(0.14)	(0.11)	(0.87)
XLearner	0.36	0.32	0.4	0.67	0.39	0.29	0.24	0.22	0.28	0.44
	(0.52)	(0.14)	(0.72)	(1.04)	(0.34)	(0.17)	(0.09)	(0.06)	(0.12)	(0.72)

	$\displaystyle\left\lvert E_{1}\right\rvert$	$\displaystyle=\left\|\\|D^{\star-1}\\|\\|\Delta^{\star 1}\\|-\\|\tilde{D}^{-1}\\|\\|\tilde{\Delta}^{1}\\|\right\|$
		$\displaystyle\leq\\|D^{\star-1}\\|\left\|\\|\Delta^{\star 1}\\|-\\|\tilde{\Delta}^{1}\\|\right\|+\left\|\\|D^{\star-1}\\|-\\|\tilde{D}^{-1}\\|\right\|\\|\tilde{\Delta}^{1}\\|$
		$\displaystyle\leq\underbrace{\\|D^{\star-1}\\|\\|\Delta^{\star 1}-\tilde{\Delta}^{1}\\|}_{A_{1}}+\underbrace{\\|D^{\star-1}-\tilde{D}^{-1}\\|\\|\tilde{\Delta}^{1}\\|}_{A_{2}}.$

	$\displaystyle\\|P_{\mathbf{T}^{\perp}}(Z)\\|_{\star}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$	$\displaystyle\overset{(i)}{\geq}\frac{\\|P_{\mathbf{T}^{\perp}}(Z)\\|_{\mathrm{F}}^{2}}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle$
		$\displaystyle\overset{(ii)}{=}\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\\|P_{\mathbf{T}^{\perp}}(Z)\\|\right)$
		$\displaystyle=\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\left\lVert\sum_{i\in[k]}\alpha_{i}P_{\mathbf{T}^{\perp}}(Z_{i})\right\rVert\right)$
		$\displaystyle\geq\frac{1}{\\|P_{\mathbf{T}^{\perp}}(Z)\\|}\underbrace{\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\\|\alpha\\|\sum_{i\in[k]}\\|P_{\mathbf{T}^{\perp}}(Z_{i})\\|\right)}_{a},$

Heterogeneous Treatment Effects in Panel Data

Abstract

1 Introduction

Regression tree with bounded bias (Theorem 3.2)

Optimal rate for average treatment effect estimation (Theorem 3.5)

Enhanced empirical accuracy with simple estimator

1.1 Related literature

1.2 General notation

2 Model and algorithm

2.1 Clustering the observations

Step 1

Step 2

2.2 Estimating the average treatment effect within each cluster

Lemma 2.1 (Error decomposition).

3 Theoretical guarantees

3.1 Approximation error

Assumption 3.1.

Theorem 3.2.

3.2 Estimation error

Assumption 3.3 (Identification).

Assumption 3.4 (Noise assumptions).

Theorem 3.5.

4 Empirical evaluation

Data

Treatment generation

Benchmark methods

Evaluation

5 Discussion and Future Work

Acknowledgments and Disclosure of Funding

References

Appendix A Proofs for the approximation error

Proof of Theorem 3.2.

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Appendix B Proof of Lemma 2.1

Lemma B.1.

Proof.

Proof of Lemma 2.1..

Claim B.2.

Proof of B.2.

Appendix C Proof of Theorem 3.5

Lemma C.1.

Lemma C.2.

Appendix D Proof of Lemma C.2

Lemma D.1.

Lemma D.2.

D.1 The gradient of the function ff

D.1.1 Properties of the gradient flow

Lemma D.3.

Proof.

Lemma D.4.

Proof.

D.2 Extending assumptions to ℬ\mathcal{B}

Lemma D.5.

Proof.

Lemma D.6.

Proof of Lemma D.6..

Claim D.7.

Claim D.8.

Proof of D.7.

Proof of D.8.

D.3 Discussion of E^\hat{E}

Lemma D.9.

Proof of Lemma D.9.

Lemma D.10.

Proof of Lemma D.10.

Theorem D.11 (Theorem 4.4.5, [32]).

D.4 Proof of Lemma D.1

Lemma D.12.

Proof of Lemma D.12.

D.5 Proof of Lemma D.2

Lemma D.13.

Proof.

D.5.1 First order conditions of gg are satisfied

Claim D.14.

Proof of D.14..

D.5.2 Function gg has a unique minimizer

Claim D.15.

D.1 The gradient of the function $f$

D.2 Extending assumptions to $\mathcal{B}$

D.3 Discussion of $\hat{E}$

D.5.1 First order conditions of $g$ are satisfied

D.5.2 Function $g$ has a unique minimizer