This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\addauthor

ezblue \addauthoreporange

Heterogeneous Treatment Effects in Panel Data

Retsef Levi
Sloan School of Management
Massachusetts Institute of Technology
[email protected] &Elisabeth Paulson
Harvard Business School
[email protected]
Georgia Perakis
Sloan School of Management
Massachusetts Institute of Technology
[email protected]
&Emily Zhang
Operations Research Center
Massachusetts Institute of Technology
[email protected]
Abstract

We address a core problem in causal inference: estimating heterogeneous treatment effects using panel data with general treatment patterns. Many existing methods either do not utilize the potential underlying structure in panel data or have limitations in the allowable treatment patterns. In this work, we propose and evaluate a new method that first partitions observations into disjoint clusters with similar treatment effects using a regression tree, and then leverages the (assumed) low-rank structure of the panel data to estimate the average treatment effect for each cluster. Our theoretical results establish the convergence of the resulting estimates to the true treatment effects. Computation experiments with semi-synthetic data show that our method achieves superior accuracy compared to alternative approaches, using a regression tree with no more than 40 leaves. Hence, our method provides more accurate and interpretable estimates than alternative methods.

1 Introduction

Suppose we observe an outcome of interest across nn distinct units over TT time periods; such data is commonly referred to as panel data. Each unit may have been subject to an intervention during certain time periods that influences the outcome. As a concrete example, a unit could be a geographic region affected by a new economic policy, or an individual consumer or store influenced by a marketing promotion. Our goal is to understand the impact of this intervention on the outcome. Since the effect of the intervention might vary across individual units and time periods as a function of the unit-level and time-varying covariates, we aim to estimate the heterogeneous treatment effects. This is a key problem in econometrics and causal inference, enabling policymakers or business owners to make more informed decisions about which units to target for future interventions.

We develop a new method, the Panel Clustering Estimator (PaCE), for estimating heterogeneous treatment effects in panel data under general treatment patterns. The causal effects are modeled as a non-parametric function of the covariates of the units, which may vary over time. To estimate heterogeneous effects, PaCE splits the observations into disjoint clusters using a regression tree and estimates the average treatment effect of each cluster. In addition to the development of this novel method, we make the following theoretical and empirical contributions.

Regression tree with bounded bias (Theorem 3.2)

We show that, subject to mild assumptions on the regression tree and a density assumption on the covariates, the bias introduced by approximating the treatment effect function with a piece-wise constant function, produced by the regression tree, decreases polynomially as the number of leaves in the tree increases.

Optimal rate for average treatment effect estimation (Theorem 3.5)

We obtain guarantees for the recovery of average treatment effects for each cluster of observations that matches the optimal rate, up to logarithmic factors. This extends the convergence guarantee of [17] to a setting with multiple treatments. Similar to [17], we allow for general intervention patterns.

Enhanced empirical accuracy with simple estimator

Using semi-synthetic data, we demonstrate that PaCE achieves empirical performance competitive with, and often superior to, alternative methods for heterogeneous treatment effect estimation, including double machine learning, causal forest, and matrix completion-based methods. Furthermore, PaCE offers a much simpler and more interpretable estimate, as we achieve this enhanced performance using a regression tree with at most 40 leaves.

1.1 Related literature

Causal inference using panel data is often approached through the synthetic control framework, where treated units are compared to a synthetic control group constructed from untreated units [1, 2]. However, for these methods, the treatment is restricted to a block. Matrix completion approaches offer a more flexible alternative, allowing for general treatment patterns [4, 7, 35]. The treated entries are viewed as missing entries, and low-rank matrix completion is utilized to estimate counterfactual outcomes for the treated entries. In most of the matrix completion literature, the set of missing or treated entries are assumed to be generated randomly [12]. However, this assumption is not realistic in our panel data setting. Other works explore matrix completion under deterministic sampling [10, 17, 19, 22, 24], and our work extends the methodology from [17].

Several methods for estimating heterogeneous treatment effects use supervised learning techniques to fit models for the outcome or treatment assignment mechanisms [14, 15, 18, 23, 27]. Although these methods are intended for cases where observations are drawn i.i.d. from one or multiple predefined distributions, they could still be applied to panel data. In contrast, our approach is specifically designed for panel data with underlying structure. By avoiding complex machine learning methods, our approach not only provides accurate results for panel data but also offers interpretable estimates.

The first step of our method involves grouping data into clusters that vary in their treatment effects using a regression tree. As in [8], we greedily search for good splits, iteratively improving the tree’s representation of the true heterogeneous treatment effects. However, unlike supervised learning where the splitting criterion is directly computable, treatment effects are not readily observed. Therefore, we estimate the splitting criterion, an approach used in [5, 6, 29, 33]. While methods like causal forest only use information within a given leaf to estimate the treatment effect for observations in that leaf [33], our method leverages global information from the panel structure to determine the average treatment effect of each cluster, enabling us to achieve high accuracy with just one tree.

The second step involves estimating the average treatment effect in each cluster. For this step, we utilize the de-biased convex estimator introduced by [17]. This estimator is applicable to panel data with general treatment patterns but is specifically designed for estimating average, rather than heterogeneous, treatment effects. It uses a convex estimator, similar to those used in [20, 34], and introduces a novel de-biasing technique that improves upon the initial estimates. Additionally, [17] builds on previous works [11, 13] to provide theoretical guarantees for recovering the average treatment effect with provably minimal assumptions on the intervention patterns. The results of [17] are specific to the case where there is only one treatment. We build on their proof techniques to show convergence when the number of treatments is O(logn)O(\log n), and also nest this technique into the first step of our algorithm to estimate heterogeneous effects.

1.2 General notation

For any matrix AA, A\left\lVert A\right\rVert denotes the spectral norm, AF\left\lVert A\right\rVert_{\mathrm{F}} denotes the Frobenius norm, A\left\lVert A\right\rVert_{\star} denotes the nuclear norm, and Asum\|A\|_{\text{sum}} denotes the sum of the absolute values of the entries in AA. For any vector aa, a\left\lVert a\right\rVert denotes its Euclidean norm, and a\left\lVert a\right\rVert_{\infty} denotes the maximum absolute value of its components. We use \circ to denote element-wise matrix multiplication. For a matrix subspace 𝐓\mathbf{T}, let P𝐓()P_{{\mathbf{T}}^{\perp}}(\cdot) denote the projection operator onto 𝐓{\mathbf{T}}^{\perp}, the orthogonal space of 𝐓\mathbf{T}. We define the sum of two subspaces 𝐓1\mathbf{T}_{1} and 𝐓2\mathbf{T}_{2} as 𝐓1+𝐓2={T1+T2T1𝐓1,T2𝐓2}\mathbf{T}_{1}+\mathbf{T}_{2}=\left\{T_{1}+T_{2}\mid T_{1}\in\mathbf{T}_{1},T_{2}\in\mathbf{T}_{2}\right\}. We write aba\lesssim b whenever acba\leq c\cdot b, for some fixed constant cc; aba\gtrsim b is defined similarly.

2 Model and algorithm

PaCE is designed for panel data, where an outcome of interest is observed across nn distinct units for TT time periods. Let On×TO\in\mathbb{R}^{n\times T} be the matrix of these observed outcomes. Our objective is to discern how these outcomes were influenced by various treatments under consideration.

We assume that, in a hypothetical scenario where the treatments were not applied, the expected outcomes would be represented by an unknown low-rank matrix Mn×TM^{\star}\in\mathbb{R}^{n\times T}. To incorporate the influence of the external factors, we introduce the covariate tensor 𝐗:=[X1,,Xp]p×n×T\mathbf{X}:=[X_{1},\dots,X_{p}]\in\mathbb{R}^{p\times n\times T}, where an element XiztX_{izt} signifies the ii-th covariate of unit zz at time tt. For convenience, we will write Xzt:=[X1zt,,Xpzt]X^{zt}:=[X_{1zt},\dots,X_{pzt}] as the vector of relevant covariates for unit zz at time tt. We consider qq distinct treatments, each represented by a binary matrix W1,,WqW_{1},\dots,W_{q} that encodes which observations are subjected to that treatment. For each treatment i[q]i\in[q] applied to unit zz at time period tt, the treatment effect is modeled as a non-parametric Lipschitz function of the covariates 𝒯i(Xzt)\mathcal{T}_{i}^{\star}(X^{zt}). We consider the treatment effects of distinct treatments to be additive.111This is a flexible framework because, as we allow for up to log(n)\log(n) treatments, the interactions of individual treatments can be modeled as additional treatments. Combining all of these elements, the observed outcome matrix OO can be expressed as the sum of MM^{\star}, the combined effects of the treatments, and a noise matrix EE. Mathematically,

O=M+i=1q𝒯i(𝐗)Wi+E,\displaystyle O=M^{\star}+\sum_{i=1}^{q}\mathcal{T}_{i}^{\star}(\mathbf{X})\circ W_{i}+E,

where 𝒯i(𝐗)\mathcal{T}_{i}^{\star}(\mathbf{X}) denotes the matrix where the element in zz-th row and tt-th column is 𝒯i(Xzt)\mathcal{T}_{i}^{\star}(X^{zt}). We note that our model implicitly assumes unconfoundedness, which is common in the literature [21].

Our goal is to estimate the treatment effect function 𝒯i\mathcal{T}^{\star}_{i} for each i[q]i\in[q], having observed OO, 𝐗\mathbf{X}, and W1,,WqW_{1},\ldots,W_{q}. Our methodology PaCE consists of two main steps. First, we build a specialized tree for each treatment i[q]i\in[q], designed to partition the observations into disjoint clusters that differ in their treatment effects. Next, we estimate the average treatment effect of each cluster, thus obtaining estimates for the treatment effect functions that are piece-wise constant functions of the covariates. The following subsections will delve into the specifics of these two steps.

2.1 Clustering the observations

Initially, the entire dataset with n×Tn\times T entries is in one single cluster. This cluster serves as the starting point for the iterative partitioning process, where we split the data into \ell more granular clusters for each treatment that better represent the true heterogeneous treatment effects. We present our method for clustering observations in Algorithm 1. Below, we discuss its two main steps.

Algorithm 1 Clustering the observations
1:Input: outcome matrix OO, covariate matrices X1,,XpX_{1},\dots,X_{p}, treatment matrices W1,,WqW_{1},\dots,W_{q}, maximum number of leaves \ell, regularization parameter λ\lambda.
2:For each treatment i[q]i\in[q], initialize jj-th cluster matrix CjiC^{i}_{j} such that all observations are in one single cluster: C1i𝟏n×TC_{1}^{i}\leftarrow\mathbf{1}_{n\times T} and Cji𝟎n×TC_{j}^{i}\leftarrow\mathbf{0}_{n\times T} for j>1j>1.
3:for l=1l=1 to 1\ell-1  do
4:     Step 1: Estimate the average treatment effect for each cluster:
(M^,τ^,m^)argminMn×T,τq×l,mn12OMm𝟏i=1qj=1lτi,jWiCjiF2+λM.\displaystyle\left(\hat{M},\hat{\tau},\hat{m}\right)\leftarrow\operatorname*{argmin}_{\begin{subarray}{c}M\in\mathbb{R}^{n\times T},\\ \tau\in\mathbb{R}^{q\times l},m\in\mathbb{R}^{n}\end{subarray}}\frac{1}{2}\left\lVert O-M-m\mathbf{1}^{\top}-\sum_{i=1}^{q}\sum_{j=1}^{l}\tau_{i,j}W_{i}\circ C^{i}_{j}\right\rVert^{2}_{\mathrm{F}}+\lambda\left\lVert M\right\rVert_{\star}. (1)
5:     Step 2: The following is performed for each treatment q[q]q^{\prime}\in[q] to select the valid single-variable split that minimizes the estimated mean squared error:
  • For i[q]i\in\left[q\right] and j[l]j\in\left[l\right], let the function 𝒞ji(l,p,x)\mathcal{C}_{j}^{i}(l^{\prime},p^{\prime},x) encode the new cluster CjiC^{i}_{j} after splitting the ll^{\prime}-th cluster for treatment qq^{\prime} based on the comparison of the pp^{\prime}-th covariate to xx:

    𝒞ji(l,p,x):={ClqI{Xpx},for j=l and i=qClqI{Xp>x},for j=l+1 and i=qCji,otherwise.\displaystyle\mathcal{C}_{j}^{i}(l^{\prime},p^{\prime},x):=\begin{cases}C_{l^{\prime}}^{q^{\prime}}\circ I_{\{X_{p^{\prime}}\leq x\}},&\text{for }j=l^{\prime}\text{ and }i=q^{\prime}\\ C_{l^{\prime}}^{q^{\prime}}\circ I_{\{X_{p^{\prime}}>x\}},&\text{for }j=l+1\text{ and }i=q^{\prime}\\ C_{j}^{i},&\text{otherwise.}\end{cases}
  • Identify the split that minimizes the estimated mean squared error: (l,p,x)argminl,p,xMSE^(l,p,x)(l^{\star},p^{\star},x^{\star})\leftarrow\operatorname*{argmin}_{l^{\prime},p^{\prime},x}\widehat{MSE}(l^{\prime},p^{\prime},x), where

    MSE^(l,p,x):=minτq×(l+1)OM^m^𝟏i=1qj=1l+1τi,jWi𝒞ji(l,p,x)F2.\displaystyle\widehat{MSE}(l^{\prime},p^{\prime},x):=\min_{\tau\in\mathbb{R}^{q\times(l+1)}}\left\lVert O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{q}\sum_{j=1}^{l+1}\tau_{i,j}W_{i}\circ\mathcal{C}^{i}_{j}(l^{\prime},p^{\prime},x)\right\rVert_{\mathrm{F}}^{2}.
  • For j{l,l+1}j\in\left\{l^{\star},l+1\right\}, update the cluster Cjq𝒞jq(l,p,x)C_{j}^{q^{\prime}}\leftarrow\mathcal{C}_{j}^{q^{\prime}}\left(l^{\star},p^{\star},x^{\star}\right).

6:end for
Step 1

We solve (the convex) Problem 1 to obtain a rough estimate of the counterfactual matrix, represented by M^+m^𝟏\hat{M}+\hat{m}\mathbf{1}^{\top}, for a given clustering of observations. The regularization term M\left\lVert M\right\rVert_{\star} penalizes the rank of MM. The term m𝟏m\mathbf{1}^{\top} is introduced to center each row of matrix M^\hat{M} without penalization from the nuclear norm term. The inclusion of this term deviates from the convex problem in [17], and we find that its inclusion improves empirical performance. The choice of λ\lambda is discussed in our theoretical results.

Step 2

In the second step, we greedily choose a valid split to achieve a better approximation of the true heterogeneous treatment effects. A valid split is one that complies with predefined constraints, which will be discussed in our theoretical guarantees. While the provided algorithm selects splits that minimize the mean squared error (MSE), alternative objectives such as mean absolute error (MAE) may be used instead.

2.2 Estimating the average treatment effect within each cluster

After partitioning all observations into \ell clusters for each of the qq treatments, the next step in our algorithm is to estimate the average treatment effect within each cluster. This is done using an extension of the de-biased convex estimator introduced in [17]. We first obtain an estimate of the average treatment effects by solving a convex optimization problem. This initial estimate is then refined through a de-biasing step.

For simplicity, we define separate treatment matrices for each cluster. Letting for k:=q×k:=q\times\ell, Z~1,,Z~k\tilde{Z}_{1},\dots,\tilde{Z}_{k} are defined as follows: Z~(i1)×+j=WiCji\tilde{Z}_{(i-1)\times\ell+j}=W_{i}\circ C_{j}^{i} for i[q],j[]i\in[q],j\in[\ell]. These are binary matrices that identify observations that are in a given cluster and receive a given treatment. We denote the true average treatment effect of the treated in each cluster i[k]i\in[k] as τ~i\tilde{\tau}^{\star}_{i}, and let δi\delta_{i} be the associated residual matrices, which represent the extent to which individual treatment effects differ from the average treatment effect. More precisely, for i[k]i\in[k],

τ~i:=𝒯i(𝐗),Z~i/Z~isumandδi:=𝒯i(𝐗)Z~iτ~iZ~i.\displaystyle\tilde{\tau}^{\star}_{i}:={\big{\langle}\mathcal{T}^{\star}_{\left\lceil\frac{i}{\ell}\right\rceil}(\mathbf{X}),\tilde{Z}_{i}\big{\rangle}}\big{/}{\big{\|}\tilde{Z}_{i}\big{\|}_{\text{sum}}}\qquad\text{and}\qquad\delta_{i}:=\mathcal{T}^{\star}_{\left\lceil\frac{i}{\ell}\right\rceil}(\mathbf{X})\circ\tilde{Z}_{i}-\tilde{\tau}^{\star}_{i}\tilde{Z}_{i}.

Finally, we define δ\delta as the combined residual error, where δ:=i=1kδi\delta:=\sum_{i=1}^{k}\delta_{i}, and E^:=E+δ\hat{E}:=E+\delta is the total error.

We normalize all of these treatment matrices to have Frobenius norm 1. This standardization allows the subsequent de-biasing step to be less sensitive to the relative magnitudes of the treatment matrices. Define Zi:=Z~i/Z~iFZ_{i}:=\tilde{Z}_{i}/\|\tilde{Z}_{i}\|_{\mathrm{F}}. By scaling down the treatment matrices in this way, our estimates are proportionally scaled up. Similarly, we define a rescaled version of τ~\tilde{\tau}^{\star}, given by τi:=τ~iZ~iF.\tau^{\star}_{i}:=\tilde{\tau}^{\star}_{i}\|\tilde{Z}_{i}\|_{\mathrm{F}}.

First, we solve the following convex optimization problem with the normalized treatment matrices to obtain an estimate of MM^{\star} and τ\tau^{\star}:

(M^,τ^,m^)argminMn×T,τk,mn12OMm𝟏i=1kτiZiF2+λM.\displaystyle\left(\hat{M},\hat{\tau},\hat{m}\right)\in\operatorname*{argmin}_{M\in\mathbb{R}^{n\times T},\tau\in\mathbb{R}^{k},m\in\mathbb{R}^{n}}\frac{1}{2}\left\lVert O-M-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert^{2}_{\mathrm{F}}+\lambda\left\lVert M\right\rVert_{\star}. (2)

Because we introduced the m𝟏m\mathbf{1}^{\top} term in the convex formulation, which centers the rows of M^\hat{M} without penalization from the nuclear norm term, M^\hat{M} serves as our estimate of MM^{\star}, projected to have zero-mean rows, while m^\hat{m} serves as our estimate of the row means of MM^{\star}. For notational convenience, let P𝟏P_{\mathbf{1}} denote the projection onto {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}. That is, for any matrix An×TA\in\mathbb{R}^{n\times T},

P𝟏(A)=A𝟏𝟏TandP𝟏(A)=A(I𝟏𝟏T).\displaystyle P_{\mathbf{1}}(A)=A\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\qquad\text{and}\qquad P_{{\mathbf{1}}^{\perp}}(A)=A\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right).

Hence P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}) is the projection of MM^{\star} with zero-mean rows. Let m:=M𝟏Tm^{\star}:=\frac{M^{\star}\mathbf{1}}{T} be the vector representing the row means of MM^{\star}. Let rr denote the rank of P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}),222Note that we can show rrank(M)+1r\leq\text{rank}(M^{*})+1, so P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}) is a low-rank matrix as well. and let P𝟏(M)=UΣVP_{{\mathbf{1}}^{\perp}}(M^{\star})=U^{\star}\Sigma^{\star}V^{\star\top} be its SVD, with Un×r,Σr×r,VT×rU^{\star}\in\mathbb{R}^{n\times r},\Sigma^{\star}\in\mathbb{R}^{r\times r},V^{\star}\in\mathbb{R}^{T\times r}. We denote by 𝐓\mathbf{T}^{\star} the span of the space {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\} and the tangent space of P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}) in the manifold of matrices with rank rr. That is,

𝐓\displaystyle\mathbf{T}^{\star} ={UA+BV+a𝟏AT×r,Bn×r,an}.\displaystyle=\left\{U^{\star}A^{\top}+BV^{\star\top}+a\mathbf{1}^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r},a\in\mathbb{R}^{n}\right\}.

The closed form expression of the projection P𝐓()P_{{\mathbf{T}}^{\star\perp}}(\cdot) is given in Lemma B.1.

The following lemma gives a decomposition of τ^τ\hat{\tau}-\tau^{\star} that suggests a method for de-biasing of τ^\hat{\tau}. The difference between our error decomposition and the decomposition presented in [17] stems from the fact that we introduced the m𝟏m\mathbf{1}^{\top} term in (2).

Lemma 2.1 (Error decomposition).

Suppose (M^,m^,τ^)(\hat{M},\hat{m},\hat{\tau}) is a minimizer of (2). Let M^=U^Σ^V^\hat{M}=\hat{U}\hat{\Sigma}\hat{V}^{\top} be the SVD of M^\hat{M}, and let 𝐓^\hat{\mathbf{T}} denote the span of the tangent space of M^\hat{M} and {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}. Then,

D(τ^τ)=Δ1+Δ2+Δ3,D\left(\hat{\tau}-\tau^{\star}\right)=\Delta^{1}+\Delta^{2}+\Delta^{3}, (3)

where Dk×kD\in\mathbb{R}^{k\times k} is the matrix with entries Dij=P𝐓^(Zi),P𝐓^(Zj)D_{ij}=\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),P_{\hat{\mathbf{T}}^{\perp}}(Z_{j})\rangle and Δ1,Δ2,Δ3k\Delta^{1},\Delta^{2},\Delta^{3}\in\mathbb{R}^{k} are vectors with components

Δi1=λZi,U^V^,Δi2=Zi,P𝐓^(E^),Δi3=Zi,P𝐓^(M).\displaystyle\Delta^{1}_{i}=\lambda\left\langle Z_{i},\hat{U}\hat{V}^{\top}\right\rangle,\quad\Delta^{2}_{i}=\left\langle Z_{i},P_{\hat{\mathbf{T}}^{\perp}}(\hat{E})\right\rangle,\quad\Delta^{3}_{i}=\left\langle Z_{i},P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rangle.

As noted in [17], since D1Δ1D^{-1}\Delta^{1} depends solely on observed quantities, it can be subtracted from our estimate. Thus, we define our de-biased estimator τd\tau^{d} as τid=(τ^D1Δ1)i/Z~iF\tau^{d}_{i}=\left(\hat{\tau}-D^{-1}\Delta^{1}\right)_{i}/\|\tilde{Z}_{i}\|_{\mathrm{F}} for each i[k]i\in[k]. The rescaling by Z~iF\|\tilde{Z}_{i}\|_{\mathrm{F}} adjusts our estimate to approximate τ~\tilde{\tau}^{\star}, rather than τ.\tau^{\star}.

The de-biased τd\tau^{d} serves as our final estimate. Specifically, if an observation’s covariates XztX^{zt} belong to the jj-th cluster, then for each treatment i[q]i\in[q], our estimate of 𝒯i(Xzt)\mathcal{T}^{\star}_{i}(X^{zt}) is given by 𝒯^i(Xzt)=τ(i1)×+jd\hat{\mathcal{T}}_{i}(X^{zt})=\tau^{d}_{(i-1)\times\ell+j}, which represents the average treatment effect of treatment ii associated with the cluster that XztX^{zt} belongs to.

3 Theoretical guarantees

Our analysis determines the conditions for the convergence of our algorithm’s estimates to the true treatment effect functions 𝒯i(𝐗)\mathcal{T}_{i}^{\star}(\mathbf{X}) for each i[q]i\in[q]. Our proof incorporates ideas from [17, 33].

The final treatment effect estimate given by PaCE has two sources of error. First, there is the approximation error, which stems from the approximation of the non-parametric functions 𝒯i\mathcal{T}^{\star}_{i} by piece-wise constant functions. The second source of error, the estimation error, arises from the estimation of the average treatment effect of each cluster. Note that the deviation between our estimate and the true treatment effect is, at most, the sum of these two errors. We will analyze and bound these two different sources of error separately to demonstrate the convergence of our method.

3.1 Approximation error

We start by bounding the bias introduced by approximating 𝒯i\mathcal{T}^{\star}_{i} by a piece-wise constant function. To guarantee consistency, we establish constraints on valid splits. We adopt the α\alpha-regularity condition from Definition 4b of [33]. This condition requires that each split retains at least a fraction α(0,12)\alpha\in(0,\frac{1}{2}) of the available training examples and at least one treated observation on each side of the split. We further require that the depth of each leaf be on the order log\log\ell and that the trees are fair-split trees. That is, during the tree construction procedure in Algorithm 1, for any given node, if a covariate jj has not been used in the splits of its last πp\pi p parent nodes, for some π>1\pi>1, the next split must utilize covariate jj. If multiple covariates meet this criterion, the choice between them can be made arbitrarily, based on any criterion.

Additionally, we assume that the proportion of treated observations with covariates within a given hyper-rectangle should be approximately proportional to the volume of the hyper-rectangle. This is a ‘coverage condition’ that allows us to accurately estimate the heterogeneous treatment effect on the whole domain using the available observations. For the purposes of this analysis, we assume that all covariates are bounded and normalized to [0,1]p[0,1]^{p}.

Assumption 3.1.

Suppose that all covariates belong to [0,1]p[0,1]^{p}. Let x(1)x(2)[0,1]px^{(1)}\leq x^{(2)}\in[0,1]^{p} be the lower and upper corners of any hyper-rectangle. We assume that the proportion of observations that have covariates inside this hyper-rectangle is proportional to the volume of the hyper-rectangle V:=p[p](xp(2)xp(1))V:=\prod_{p^{\prime}\in[p]}\left(x_{p^{\prime}}^{(2)}-x_{p^{\prime}}^{(1)}\right), with a margin of error M:=ln(nT)(p+1)min(n,T)M:=\sqrt{\frac{\ln(nT)(p+1)}{\min(n,T)}}. More precisely,

c¯VcmM#{(z,t):x(1)Xztx(2)}nTc¯V+cmM,\underaccent{\bar}{c}V-c_{m}M\leq\frac{\#\left\{(z,t):x^{(1)}\leq X^{zt}\leq x^{(2)}\right\}}{nT}\leq\bar{c}V+c_{m}M,

for some fixed constants c¯c¯>0\bar{c}\geq\underaccent{\bar}{c}>0 and some cm>0c_{m}>0.

3.1 is a broader condition than the requirement for the covariates to be i.i.d. across observations, as assumed in [33]. Our assumption only requires that the covariates maintain an even density. Indeed, Lemma A.2 in Appendix A illustrates that while i.i.d. covariates satisfy the assumption with high probability, the assumption additionally accommodates (with high probability) scenarios where covariates are either constant over time and only vary across different units, or constant across different units and only vary over time.

In the following result, we demonstrate that, as the number of observations grows, a regression tree, subject to some regularity constraints, contains leaves that become increasingly homogeneous in terms of treatment effect.

Theorem 3.2.

Let 𝕋\mathbb{T} be an α\alpha-regular, fair-split tree, split into \ell leaves, each of which has a depth of at least clogc\log\ell for some constant cc. Suppose that (α2cmM)1clog1/α\ell\leq\left(\frac{\alpha}{2c_{m}M}\right)^{\frac{1}{c\log 1/\alpha}} and that 3.1 is satisfied. Then, for each treatment i[q]i\in[q] and every leaf of 𝕋\mathbb{T}, the maximum difference in treatment effects between any two observations within the cluster CC of observations in that leaf is upper bounded as follows:

maxX1,X2C|𝒯i(X1)𝒯i(X2)|2Lips,\max_{X_{1},X_{2}\in C}\left\lvert\mathcal{T}_{i}^{\star}(X_{1})-\mathcal{T}_{i}^{\star}(X_{2})\right\rvert\leq\frac{2L_{i}\sqrt{p}}{\ell^{s}},

where s:=c(π+1)pαc¯4c¯s:=\frac{c}{(\pi+1)p}\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}} and LiL_{i} is the Lipschitz constant for the treatment effect function 𝒯i\mathcal{T}_{i}^{\star}.

Using the above theorem, we can make the following observation. For any i[q]i\in[q], denote by τ~,i\tilde{\tau}^{\star,i} the regression tree function that is piece-wise constant on each cluster CjiC_{j}^{i} for j[]j\in[\ell], with a value of τ~(i1)+j\tilde{\tau}^{\star}_{(i-1)\ell+j} assigned to each cluster jj. Then, for all covariate vectors XX, we have |𝒯i(X)τ~,i(X)|2Lips\left\lvert\mathcal{T}_{i}^{\star}(X)-\tilde{\tau}^{\star,i}(X)\right\rvert\leq\frac{2L_{i}\sqrt{p}}{\ell^{s}} because the value assigned to each cluster by τ~,i\tilde{\tau}^{\star,i} is an average of treatment effects within that cluster. This demonstrates that it is possible to approximate the true treatment effect functions with a regression tree function with arbitrary precision. As the number of leaves \ell in the tree 𝕋\mathbb{T} increases, the approximation error of the tree decreases polynomially. The theorem requires an upper bound on \ell, which grows with min(n,T)\min(n,T) as the dataset expands, due to the definition of the margin or error MM. For PaCE, we constrain \ell to be on the order of logn\log n (we require k=O(logn)k=O\left(\log n\right) in Theorem 3.5), thereby complying with the upper bound requirement.

3.2 Estimation error

We now bound the estimation error, showing that it converges as max(n,T)\max(n,T) increases. For simplicity, we will assume that nTn\gtrsim T so that we can write all necessary assumptions and statements in terms of nn directly. We stress, however, that even without nTn\gtrsim T, the statement of Theorem 3.5 also holds by replacing all occurrences of nn in 3.3, 3.4, and Theorem 3.5 with max(n,T)\max(n,T). Proving this only requires changing nn to max(n,T)\max(n,T) in the proofs accordingly.

Our main technical contribution is extending the convergence guarantee from [17] to a setting with multiple treatment matrices and a convex formulation that centers the rows of M^\hat{M} without penalization. We require a set of conditions that extend the assumptions made in [17] to multiple treatment matrices.

Throughout the paper, we define the subspace 𝐙:=span{Zi,i[k]}+span{α𝟏αn}\mathbf{Z}:=\text{span}\{Z_{i},i\in[k]\}+\text{span}\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\}. We next define Dk×kD^{\star}\in\mathbb{R}^{k\times k} and Δ1k\Delta^{\star 1}\in\mathbb{R}^{k} similarly to DD and Δ1\Delta^{1} in Lemma 2.1, except we replace all quantities associated with M^\hat{M} with the corresponding quantity for MM^{\star}. That is, Dij=P𝐓(Zi),P𝐓(Zj)D^{\star}_{ij}=\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle, and Δi1=Zi,UV\Delta^{\star 1}_{i}=\left\langle Z_{i},U^{\star}V^{\star\top}\right\rangle for i,j[k]i,j\in[k]. Now, we are ready to introduce the assumptions that we need to bound the estimation error.

Assumption 3.3 (Identification).

There exist positive constants cr1,cr2,csc_{r_{1}},c_{r_{2}},c_{s} such that

  1. (a)

    ZVF2+ZUF2(1cr1/log(n))ZF2,\left\lVert ZV^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert{Z^{\top}U^{\star}}\right\rVert_{\mathrm{F}}^{2}\leq\left(1-{c_{r_{1}}}/{\log(n)}\right)\left\lVert Z\right\rVert_{\mathrm{F}}^{2}, Z𝐙\forall Z\in\mathbf{Z},

  2. (b)

    D1Δ1i=1kP𝐓(Zi)1cr2/logn,\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert\leq 1-{c_{r_{2}}}/{\log n},

  3. (c)

    σmin(D)cs/logn.\sigma_{\min}(D^{\star})\geq{c_{s}}/{\log n}.

Item 3.3(a) and Item 3.3(b) are the direct generalization of Assumptions 3(a) and 3(b) from [17] to multiple treatment matrices. Proposition 2 of [17] proves that their Assumptions 3(a) and 3(b) are nearly necessary for identifying τ\tau^{\star}. This is because it is impossible to recover τ\tau^{\star} if M+γZM^{\star}+\gamma Z were also a matrix of rank rr for some γ0\gamma\neq 0. More generally, if ZZ lies close to the tangent space of MM^{\star} in the space of rank-rr matrices, recovering MM^{\star} is significantly harder (the signal becomes second order and may be lost in the noise). In our setting with multiple treatment matrices, we further require that there is no collinearity in the treatment matrices; otherwise, we would not be able to distinguish the treatment effect associated with each treatment matrix. Mathematically, this is captured by Item 3.3(c).

Assumption 3.4 (Noise assumptions).

We make the following assumptions about EE and δ\delta.

  1. (a)

    E,δ=O(σn)\left\lVert E\right\rVert,\left\lVert\delta\right\rVert=O(\sigma\sqrt{n}),

  2. (b)

    |Zi,E|=O(σn)\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert=O(\sigma\sqrt{n}) for all i[k]i\in[k],

  3. (c)

    |Zi,δ|=O(σn)\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert=O(\sigma\sqrt{n}) for all i[k]i\in[k].

The assumptions involving EE are mild: they would be satisfied with probability at least 1en1-e^{-n} if EE has independent, zero-mean, sub-Gaussian entries with sub-Gaussian norm bounded by σ\sigma (see Lemma D.10), an assumption that is canonical in matrix completion literature [17]. Both Item 3.4(a) and Item 3.4(b) were also used in [17]. Because δ\delta is a zero-mean matrix that is zero outside the support of i[k]Zi\sum_{i\in[k]}Z_{i}, the assumption δ=O(σn)\left\lVert\delta\right\rVert=O(\sigma\sqrt{n}) is mild. In fact, it is satisfied with high probability if the entries of δ\delta are sub-Gaussian with certain correlation patterns [26]. Furthermore, we note that if there is only one treatment matrix (k=1)(k=1) or if all treatment matrices are disjoint, then Item 3.4(c) is guaranteed to be satisfied. This is because, for each i[k]i\in[k], δi\delta_{i} is a zero-mean matrix that is zero outside of the support of ZiZ_{i}. Consequently, in these scenarios, we simply have |Zi,δ|=|Zi,δi|=0\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert=\left\lvert\left\langle Z_{i},\delta_{i}\right\rangle\right\rvert=0.

We denote by σmax\sigma_{\max} and σmin\sigma_{\min} the largest and smallest singular values of P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}) respectively, and we let κ:=σmax/σmin\kappa:=\sigma_{\max}/\sigma_{\min} be the condition number of P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}).

Theorem 3.5.

Suppose that 3.3 and 3.4 are satisfied, k=O(logn)k=O\left(\log n\right), and σnσmin1κ2r2log12.5(n)\frac{\sigma\sqrt{n}}{\sigma_{\min}}\lesssim\frac{1}{\kappa^{2}r^{2}\log^{12.5}(n)}. Take λ=Θ(σnrlog4.5(n))\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right). Then for sufficiently large nn, we have

|τidτ~i|σ2r2κnlog13.5(n)σminZ~iF+log1.5(n)Z~iFmaxj[k]|P𝐓(Z~j),P𝐓(E+δ)|Z~jF.\displaystyle\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{13.5}(n)}{\sigma_{\min}\|\tilde{Z}_{i}\|_{\mathrm{F}}}+\frac{\log^{1.5}(n)}{\|\tilde{Z}_{i}\|_{\mathrm{F}}}\cdot\max_{j\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{j}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\|\tilde{Z}_{j}\|_{\mathrm{F}}}.

Note that the bound becomes weaker as Z~jF\|\tilde{Z}_{j}\|_{\mathrm{F}} shrinks. In other words, more treated entries within a cluster results in a more accurate estimate for that cluster, as expected. Our rate in Theorem 3.5 matches, up to logn\log n and rr factors, the optimal error rate obtained in Theorem 1 of [17], which bounds the error in the setting with one treatment matrix. Our bound accumulated extra factors of logn\log n because we allowed k=O(logn)k=O(\log n) treatment matrices, and we have an extra factor of rr because we assumed δσn\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n}, instead of δσn/r.\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n/r}. Proposition 1 from [17] proves that their convergence rate is optimal (up to logn\log n factors) in the “typical scenario" where σ,κ,r=O(1)\sigma,\kappa,r=O(1) and σmin=Θ(n)\sigma_{\min}=\Theta(n). We can conclude the same for our rate in Theorem 3.5.

The convergence rate of PaCE is primarily determined by the slower asymptotic convergence rate of the approximation error. However, our result in Theorem 3.5 is of independent interest, as it extends previous results on low-rank matrix recovery with a deterministic pattern of treated entries to allow for multiple treatment matrices.

4 Empirical evaluation

To assess the accuracy of PaCE, we demonstrate its performance on semi-synthetic data. Using publicly available panel data as the baseline M+EM^{\star}+E, we introduced an artificial treatment and added a synthetic treatment effect to treated entries to generate the outcome matrix OO. Since the true heterogeneous treatment effects 𝒯(𝐗)\mathcal{T}^{\star}(\mathbf{X}) are known, we can verify the accuracy of various methods. Our results show that PaCE often surpasses alternative methods in accuracy, while using a tree with no more than 40 leaves to cluster observations. Thus, PaCE not only offers superior accuracy, but also provides a simple, interpretable solution.

Data

To demonstrate the effectiveness of our methodology, we use two publicly available U.S. economic datasets. The first source comprises monthly Supplemental Nutrition Assistance Program (SNAP) user counts by zip code in Massachusetts, spanning January 2017 to March 2023, obtained from the Department of Transitional Assistance (DTA) public records [25]. SNAP is a federal aid program that provides food-purchasing assistance to low-income families [30]. The second data source comprises nine annual demographic and economic data fields for Massachusetts zip codes, U.S. counties, and U.S. states from 2005 to 2022, provided by the U.S. Census Bureau [31].

We use this data to conduct experiments on panels of varying sizes. In the first set of experiments, we use the SNAP user count as the baseline, M+EM^{\star}+E, with the zip code-level census data for Massachusetts as covariates. The panel size is 517 zip codes over 70 months. The second set of experiments uses the population below the poverty line (from the U.S. census data) in each state and U.S. territory as the baseline and the remaining census data fields as covariates, with a panel size of 52 states and territories over 18 years. The third set of experiments is similar to the second, but the unit is changed to counties, resulting in a panel size of 770 counties over 18 years. The time period is provided as an additional covariate for the methods that we benchmark against.

Treatment generation

To generate the treatment pattern, we varied two parameters: 1) the proportion α\alpha of units receiving the treatment α{0.05,0.25,0.5,0.75,1.0}\alpha\in\left\{0.05,0.25,0.5,0.75,1.0\right\} and 2) the functional form of the treatment—either adaptive or non-adaptive. In the non-adaptive case, a random α\alpha proportion of units receive the treatment for a random consecutive period of time. In the adaptive case, for each time period, we apply the treatment to the α/2\alpha/2 proportion of units that show the largest absolute percentage change in outcome in the previous time period. The purpose of the adaptive treatment is to mimic public policies that target either low-performing or high-performing units.

The treatment effect is generated by randomly sampling two covariates and either adding or multiplying them, then normalizing the magnitude of the effects so that the average treatment effect is 20% of the average outcome. For the SNAP experiments, the treatment effect is added to the outcome to simulate an intervention increasing SNAP usage. For the state and county experiments, the treatment effect is subtracted from the outcome to represent an intervention that reduces poverty.

Each set of parameters is tested on 200 instances, with the treatment pattern and treatment effect being randomly regenerated for each instance.

Benchmark methods

We benchmark PaCE against the following methods: double machine learning (DML) [14, 15], doubly robust (DR) learner [18], XLearner [23], and multi-arm causal forest [6].333Additional methods are shown in Appendix E. In the main text, we include the five best methods. We implemented PaCE to split the observations into at most 40 clusters. The hyper-parameter λ\lambda used in the convex programs (1) and (2) was tuned so that M^\hat{M} had a rank of r=6r=6, which was chosen arbitrarily and fixed across all experiments. During the splitting procedure in Algorithm 1, we utilize binary search on each covariate to find the best split. See Appendix E for implementation details.

Table 1: Proportion of instances in which each method results in the lowest nMAE.
Adaptive Proportion treated Effect
All No Yes 0.05 0.25 0.5 0.75 1.0 Add. Mult.
SNAP
PaCE 0.59 0.58 0.60 0.26 0.54 0.76 0.75 0.65 0.64 0.54
Causal Forest 0.16 0.17 0.16 0.14 0.17 0.16 0.14 0.21 0.02 0.30
DML 0.11 0.21 0.00 0.12 0.13 0.08 0.09 0.11 0.17 0.04
LinearDML 0.14 0.04 0.24 0.48 0.15 0.01 0.02 0.04 0.17 0.11
XLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
State
PaCE 0.41 0.39 0.42 0.23 0.32 0.39 0.54 0.56 0.46 0.36
Causal Forest 0.05 0.09 0.01 0.06 0.04 0.04 0.05 0.04 0.04 0.05
DML 0.27 0.29 0.26 0.41 0.36 0.27 0.17 0.16 0.31 0.24
LinearDML 0.02 0.01 0.04 0.02 0.06 0.01 0.01 0.02 0.02 0.02
XLearner 0.17 0.13 0.20 0.25 0.14 0.19 0.15 0.11 0.10 0.23
County
PaCE 0.06 0.10 0.02 0.16 0.08 0.04 0.02 0.01 0.07 0.06
Causal Forest 0.29 0.38 0.20 0.25 0.28 0.31 0.31 0.30 0.23 0.34
DML 0.35 0.35 0.35 0.34 0.37 0.34 0.39 0.31 0.47 0.24
LinearDML 0.20 0.04 0.35 0.22 0.22 0.21 0.16 0.17 0.19 0.20
XLearner 0.05 0.07 0.04 0.03 0.03 0.06 0.07 0.07 0.01 0.10
Evaluation

We represent the estimated treated effects by 𝒯^(𝐗)\hat{\mathcal{T}}(\mathbf{X}), a matrix where the element in zz-th row and tt-th column is the estimate 𝒯^(Xzt)\hat{\mathcal{T}}(X^{zt}). Accuracy of the estimates is measured using the normalized mean absolute error (nMAE), calculated relative to the true treatment effect 𝒯(𝐗)\mathcal{T}^{\star}(\mathbf{X}):

nMAE(𝒯^(𝐗),𝒯(𝐗))=𝒯(𝐗)𝒯^(𝐗)sum/𝒯(𝐗)sum.\displaystyle\text{nMAE}\left(\hat{\mathcal{T}}(\mathbf{X}),\mathcal{T}^{\star}(\mathbf{X})\right)={\|\mathcal{T}^{\star}(\mathbf{X})-\hat{\mathcal{T}}(\mathbf{X})}\|_{\text{sum}}/\left\lVert\mathcal{T}^{\star}(\mathbf{X})\right\rVert_{\text{sum}}.

In Table 1, we present the proportion of instances where each method results in the lowest nMAE. We observe that PaCE performs best in the SNAP and state experiments, achieving the highest proportion of instances with the lowest nMAE. However, it often performs worse than DML and Causal Forest in the county experiments. We hypothesize that this is because the county census data, having many units relative to the number of time periods, ‘most resembles i.i.d. data’ and has less underlying panel structure. However, the average nMAE achieved by PaCE on this data set is competitive with that of DML and Causal Forest, as shown in Table 3. Given that Causal Forest and DML use hundreds of trees, PaCE, which uses only one regression tree, is still a great option if interpretability is desired.

In Figure 1, we show the average nMAE as we vary the proportion of units treated. We only show the results for the SNAP experiments with a non-adaptive treatment pattern, and we refer to Table 3 and Table 4 in Appendix E for the results of additional experiments and for the standard deviations. We observe that while the performance of all methods improves as the proportion of treated entries increases, PaCE consistently achieves the best accuracy on this set of experiments.

Refer to caption
Figure 1: nMAE for SNAP experiments as the proportion of treated units increases.

5 Discussion and Future Work

This work introduces PaCE, a method designed for estimating heterogeneous treatment effects within panel data. Our technical contributions extend previous results on low-rank matrix recovery with a deterministic pattern of treated entries to allow for multiple treatments. We validate the ability of PaCE to obtain strong accuracy with a simple estimator on semi-synthetic datasets.

One interesting direction for future research is to investigate whether the estimates produced by PaCE are asymptotically normal around the true estimates, which would allow for the derivation of confidence intervals. Further computational testing could also be beneficial. Specifically, assessing the estimator’s performance across a broader array of treatment patterns and datasets would provide deeper insights into its adaptability and limitations. Expanding these tests to include multiple treatments will help in understanding the estimator’s effectiveness in more complex scenarios.

Acknowledgments and Disclosure of Funding

The authors are grateful to Tianyi Peng for valuable and insightful discussions. We acknowledge the MIT SuperCloud [28] and Lincoln Laboratory Supercomputing Center for providing HPC resources that have contributed to the research results reported within this paper. This work was supported by the National Science Foundation Graduate Research Fellowship under Grant No. 2141064. Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.

References

  • Abadie et al., [2010] Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505.
  • Abadie and Gardeazabal, [2003] Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study of the basque country. American economic review, 93(1):113–132.
  • Abbe et al., [2020] Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Annals of statistics, 48(3):1452.
  • Athey et al., [2021] Athey, S., Bayati, M., Doudchenko, N., Imbens, G., and Khosravi, K. (2021). Matrix completion methods for causal panel data models. Journal of the American Statistical Association, 116(536):1716–1730.
  • Athey and Imbens, [2016] Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360.
  • Athey et al., [2019] Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(3):1148–1178.
  • Bai and Ng, [2021] Bai, J. and Ng, S. (2021). Matrix completion, counterfactuals, and factor analysis of missing data. Journal of the American Statistical Association, 116(536):1746–1763.
  • Breiman, [2001] Breiman, L. (2001). Random forests. Machine learning, 45:5–32.
  • Cai et al., [2010] Cai, J.-F., Candès, E. J., and Shen, Z. (2010). A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, 20(4):1956–1982.
  • Chatterjee, [2020] Chatterjee, S. (2020). A deterministic theory of low rank matrix completion. IEEE Transactions on Information Theory, 66(12):8046–8055.
  • Chen et al., [2020] Chen, Y., Chi, Y., Fan, J., Ma, C., and Yan, Y. (2020). Noisy matrix completion: Understanding statistical guarantees for convex relaxation via nonconvex optimization. SIAM journal on optimization, 30(4):3098–3121.
  • Chen et al., [2019] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–22937.
  • Chen et al., [2021] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2021). Bridging convex and nonconvex optimization in robust pca: Noise, outliers, and missing data. Annals of statistics, 49(5):2948.
  • [14] Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2017a). Double/debiased machine learning for treatment and causal parameters. Technical report.
  • [15] Chernozhukov, V., Goldman, M., Semenova, V., and Taddy, M. (2017b). Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv, pages arXiv–1712.
  • Devroye, [1977] Devroye, L. P. (1977). A uniform bound for the deviation of empirical distribution functions. Journal of Multivariate Analysis, 7(4):594–597.
  • Farias et al., [2021] Farias, V., Li, A., and Peng, T. (2021). Learning treatment effects in panels with general intervention patterns. Advances in Neural Information Processing Systems, 34:14001–14013.
  • Foster and Syrgkanis, [2023] Foster, D. J. and Syrgkanis, V. (2023). Orthogonal statistical learning. The Annals of Statistics, 51(3):879–908.
  • Foucart et al., [2020] Foucart, S., Needell, D., Pathak, R., Plan, Y., and Wootters, M. (2020). Weighted matrix completion from non-random, non-uniform sampling patterns. IEEE Transactions on Information Theory, 67(2):1264–1290.
  • Gobillon and Magnac, [2016] Gobillon, L. and Magnac, T. (2016). Regional policy evaluation: Interactive fixed effects and synthetic controls. Review of Economics and Statistics, 98(3):535–551.
  • Imbens and Rubin, [2015] Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge university press.
  • Klopp et al., [2017] Klopp, O., Lounici, K., and Tsybakov, A. B. (2017). Robust matrix completion. Probability Theory and Related Fields, 169:523–564.
  • Künzel et al., [2019] Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences, 116(10):4156–4165.
  • Liu et al., [2017] Liu, G., Liu, Q., and Yuan, X. (2017). A new theory for matrix completion. Advances in Neural Information Processing Systems, 30.
  • Massachusetts Government, [2024] Massachusetts Government (2024). Department of transitional assistance caseload by zip code reports. https://www.mass.gov/lists/department-of-transitional-assistance-caseload-by-zip-code-reports. Accessed: 2024-05-09.
  • Moon and Weidner, [2016] Moon, H. and Weidner, M. (2016). Linear regression for panel with unknown number of factors as interactive effects. Econometrica, à paraître.
  • Nie and Wager, [2021] Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319.
  • Reuther et al., [2018] Reuther, A., Kepner, J., Byun, C., Samsi, S., Arcand, W., Bestor, D., Bergeron, B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Milechin, L., Mullen, J., Prout, A., Rosa, A., Yee, C., and Michaleas, P. (2018). Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pages 1–6. IEEE.
  • Su et al., [2009] Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M., and Li, B. (2009). Subgroup analysis via recursive partitioning. Journal of Machine Learning Research, 10(2).
  • United States Department of Agriculture, [2024] United States Department of Agriculture (2024). Supplemental Nutrition Assistance Program (SNAP). https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-program. Accessed: May 15, 2024.
  • U.S. Census Bureau, [2023] U.S. Census Bureau (2023). American Community Survey (ACS) 1-Year Data. https://www.census.gov/data/developers/data-sets/acs-1year.html. Accessed data from multiple years between 2005 and 2022. Accessed: 2024-05-09.
  • Vershynin, [2018] Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Wager and Athey, [2018] Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242.
  • Xiong et al., [2023] Xiong, R., Athey, S., Bayati, M., and Imbens, G. (2023). Optimal experimental design for staggered rollouts. Management Science.
  • Xiong and Pelger, [2023] Xiong, R. and Pelger, M. (2023). Large dimensional latent factor modeling with missing observations and applications to causal inference. Journal of Econometrics, 233(1):271–301.
  • Yu et al., [2015] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323.

Appendix A Proofs for the approximation error

Proof of Theorem 3.2.

We define the diameter of a given set CC of points as the maximum distance between two points in the set CC:

Diam(C)=sup(z1,t1),(z2,t2)CXz1t1Xz2t2.\text{Diam}(C)=\sup_{(z_{1},t_{1}),(z_{2},t_{2})\in C}\|X^{z_{1}t_{1}}-X^{z_{2}t_{2}}\|.

Because the treatment effect function 𝒯i\mathcal{T}^{\star}_{i} is Lipschitz continuous for every i[q]i\in[q], with Lipschitz constant LiL_{i}, it suffices to show that, for any leaf xx in 𝕋\mathbb{T}, the set CxC_{x} of points within that leaf has a diameter bounded above as follows:

Diam(Cx)2ps.\text{Diam}(C_{x})\leq\frac{2\sqrt{p}}{\ell^{s}}.

In this proof we assume, without loss of generality, that all leaves are at depth exactly clog\left\lfloor c\log\ell\right\rfloor. If a leaf were instead even deeper in the tree, the diameter of the associated cluster would be even smaller. We begin by establishing a lower bound for the number of observations nan_{a} in any node aa of the tree 𝕋\mathbb{T}. This is done by leveraging the fact that every node is at depth at most clogc\log\ell, the upper bound on the number of leaves \ell, and the property of α\alpha-regularity. Specifically, due to α\alpha-regularity, we have nanTαclogn_{a}\geq nT\cdot\alpha^{c\log\ell}. Hence,

na\displaystyle n_{a} nTαclog=nTclogαnT(α2cmM)1clogαclogα=2nTαcmM.\displaystyle\geq nT\alpha^{c\log\ell}=nT\ell^{c\log\alpha}\leq nT\left(\frac{\alpha}{2c_{m}M}\right)^{\frac{-1}{c\log\alpha}\cdot c\log\alpha}=\frac{2nT}{\alpha}c_{m}M.

Rearranging, we have the following for any node aa of the tree 𝕋\mathbb{T}:

cmMαna2nT.\displaystyle c_{m}M\leq\frac{\alpha n_{a}}{2nT}. (4)

Next, for every covariate j[p]j\in[p] and any leaf xx, we aim to establish an upper bound for the length of CxC_{x} along the jj-th coordinate, denoted Diamj(Cx)\text{Diam}_{j}(C_{x}). This is defined as follows:

Diamj(Cx):=sup(z1,t1),(z2,t2)Cx|Xjz1t1Xjz2t2|.\displaystyle\text{Diam}_{j}(C_{x}):=\sup_{(z_{1},t_{1}),(z_{2},t_{2})\in C_{x}}\left\lvert X^{z_{1}t_{1}}_{j}-X^{z_{2}t_{2}}_{j}\right\rvert.

Utilizing the lower bound in 3.1, the volume VaV_{a} of the rectangular hull enclosing the points in a given node aa can be upper bounded as follows:

Va1c¯(nanT+cmM)2c¯(nanT),\displaystyle V_{a}\leq\frac{1}{\underaccent{\bar}{c}}\left(\frac{n_{a}}{nT}+c_{m}M\right)\leq\frac{2}{\underaccent{\bar}{c}}\left(\frac{n_{a}}{nT}\right), (5)

where the last inequality made use of (4). Given the α\alpha-regularity of splits, we know that a child of node aa has at most na(1α)n_{a}(1-\alpha) observations. This implies that the volume VrV_{r} of the rectangular hull enclosing the points removed by the split on node aa satisfies:

Vr1c¯(αnanTcmM)12c¯(αnanT).\displaystyle V_{r}\geq\frac{1}{\bar{c}}\left(\frac{\alpha n_{a}}{nT}-c_{m}M\right)\geq\frac{1}{2\bar{c}}\left(\frac{\alpha n_{a}}{nT}\right). (6)

Putting (5) and (6) together, we have Vr(αc¯4c¯)VaV_{r}\geq\left(\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)V_{a}, which means that a split along a given covariate removes at least αc¯4c¯\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}} fraction of the length along that coordinate.

Because the tree is a fair-split tree, for every (π+1)p(\pi+1)p splits, at least one split is made on every covariate. Hence, the number of splits leading to node xx along each coordinate is at least clog(π+1)pclog(π+1)p2\left\lfloor\frac{\left\lfloor c\log\ell\right\rfloor}{(\pi+1)p}\right\rfloor\geq\frac{c\log\ell}{(\pi+1)p}-2.

As a result, for every j[p]j\in[p], we have Diamj(Cx)(1αc¯4c¯)clog(π+1)p2\text{Diam}_{j}(C_{x})\leq\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{\frac{c\log\ell}{(\pi+1)p}-2}, which implies

Diam(Cx)\displaystyle\text{Diam}(C_{x}) p(1αc¯4c¯)c(π+1)plog2\displaystyle\leq\sqrt{p}\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{\frac{c}{(\pi+1)p}\log\ell-2}
p(1αc¯4c¯)2()c(π+1)plog(1αc¯4c¯)\displaystyle\leq\sqrt{p}\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)^{-2}\left(\ell\right)^{\frac{c}{(\pi+1)p}\log\left(1-\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}\right)}
p(34)2()c(π+1)pαc¯4c¯,\displaystyle\leq\sqrt{p}\left(\frac{3}{4}\right)^{-2}\left(\ell\right)^{-\frac{c}{(\pi+1)p}\frac{\alpha\underaccent{\bar}{c}}{4\bar{c}}},

which proves the desired bound. ∎

The following lemma is a generalization of the multivariate case of the Dvoretzky–Kiefer–Wolfowitz inequality. Specifically, it provides a bound on the maximum deviation of the empirical fraction of observations within a hyper-rectangle from its expected value. It will be used in the proof of Lemma A.2.

Lemma A.1.

Let XX be a dd-dimensional random vector with pdf fXf_{X}. Let X1,X2,,XnX_{1},X_{2},\dots,X_{n} be an i.i.d. sequence of dd-dimensional vectors drawn from this distribution. For any x1,x2dx_{1},x_{2}\in\mathbb{R}^{d}, define

G(x1,x2):=Pr[x1Xx2]\displaystyle G(x_{1},x_{2}):=\Pr[x_{1}\leq X\leq x_{2}]

and

Gn(x1,x2):=1ni=1nI{Xi[x1,x2]}.\displaystyle G_{n}(x_{1},x_{2}):=\frac{1}{n}\sum_{i=1}^{n}I_{\left\{X_{i}\in[x_{1},x_{2}]\right\}}.

Then, for any ϵ>0\epsilon>0, we have both of the following bounds:

Pr[supx1,x2d|Gn(x1,x2)G(x1,x2)|>ϵ]2n2dexp(2nϵ2+16ϵd),\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert>\epsilon\right]\leq 2n^{2d}\exp\left(-2n\epsilon^{2}+16\epsilon d\right),
Pr[supx1,x2d|Gn(x1,x2)G(x1,x2)|>4dn+ϵ]2n2dexp(2nϵ2).\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert>\frac{4d}{n}+\epsilon\right]\leq 2n^{2d}\exp\left(-2n\epsilon^{2}\right).
Proof.

We use ideas from the proof of the main theorem in [16] and their notation. Because the i.i.d. samples X1,,XnX_{1},\ldots,X_{n} are drawn from a distribution with a density function, on an almost sure event ={Xi1jXi2j,i1i2[n],j[d]}\mathcal{E}=\{X_{i_{1}j}\neq X_{i_{2}j},i_{1}\neq i_{2}\in[n],j\in[d]\}, they do not share any component in common. For i[n]i\in[n], let Xi=(Xi1,,Xid)X_{i}=\left(X_{i1},\dots,X_{id}\right), and let 𝒴\mathcal{Y} denote the set of random vectors Y1,,YnddY_{1},\dots,Y_{n^{d}}\in\mathbb{R}^{d} that are obtained by considering all ndn^{d} vectors of the form (Xi11,,Xidd)\left(X_{i_{1}1},\dots,X_{i_{d}d}\right), where (i1,,id){1,,n}d\left(i_{1},\dots,i_{d}\right)\in\left\{1,\dots,n\right\}^{d}.

Because GnG_{n} is a staircase function with flat levels everywhere except at points in 𝒴\mathcal{Y}, and GG is monotonic, |Gn(x1,x2)G(x1,x2)|\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert is maximized when x1x_{1} and x2x_{2} approach vectors that are in 𝒴\mathcal{Y}. Under the event \mathcal{E}, we have

supx1,x2d|Gn(x1,x2)G(x1,x2)|supi,j|Gn(Yi,Yj)G(Yi,Yj)|+2dn,\displaystyle\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\leq\sup_{i,j}\left\lvert G_{n}(Y_{i},Y_{j})-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n},

where the 2dn\frac{2d}{n} is due to the fact that, for any i,ji,j there may be up to 2d2d different points in X1,,XnX_{1},\dots,X_{n} that share a component with YiY_{i} or YjY_{j} under the event \mathcal{E}. These 2d2d points lie on the perimeter of the hyper-rectangle between YiY_{i} and YjY_{j} and could be included or excluded from the count for GnG_{n} depending on whether x1x_{1} approaches YiY_{i} (and x2x_{2} approaches YjY_{j}) from the inside or outside of the hyper-rectangle.

Now, it remains to upper bound the right hand side of the above inequality. For any pair of indices i,jndi,j\leq n^{d}, there exists a subset of indices [n]\mathcal{I}\subseteq[n], corresponding to samples of XX that have at least one component in common with either YiY_{i} or YjY_{j}. Hence, the samples indexed by \mathcal{I} are not independent from both YiY_{i} and YjY_{j}. In order to apply Hoeffding’s inequality to the samples, we sample d:=||d^{\prime}:=\left\lvert\mathcal{I}\right\rvert additional i.i.d. random vectors Xn+1,,Xn+dX_{n+1},\dots,X_{n+d^{\prime}} from the same distribution. These serve to ‘substitute’ those samples XlX_{l} for which ll\in\mathcal{I}. Under \mathcal{E}, the number of indices in \mathcal{I} satisfies d2dd^{\prime}\leq 2d, and we have

supi,j|Gn(Yi,Yj)G(Yi,Yj)|+2dn\displaystyle\sup_{i,j}\left\lvert G_{n}(Y_{i},Y_{j})-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n}
=supi,j|1nk=1nI{Xk[Yi,Yj]}G(Yi,Yj)|+2dn\displaystyle=\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k=1}^{n}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{2d}{n}
supi,j|1nk[n+d]I{Xk[Yi,Yj]}G(Yi,Yj)|+1n|kIXk[Yi,Yj]k=n+1n+dIXk[Yi,Yj]|+2dn\displaystyle\leq\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{1}{n}\left\lvert\sum_{k\in\mathcal{I}}I_{X_{k}\in[Y_{i},Y_{j}]}-\sum_{k=n+1}^{n+d^{\prime}}I_{X_{k}\in[Y_{i},Y_{j}]}\right\rvert+\frac{2d}{n}
supi,j|1nk[n+d]I{Xk[Yi,Yj]}G(Yi,Yj)|+4dn.\displaystyle\leq\sup_{i,j}\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{4d}{n}.

The first inequality is due to the triangle inequality, and the second inequality follows because ||=d2d\left\lvert\mathcal{I}\right\rvert=d^{\prime}\leq 2d.

By the independence of both YiY_{i} and YjY_{j} with XkX_{k} for k[n+d]k\in[n+d^{\prime}]\setminus\mathcal{I}, we apply Hoeffding’s inequality conditioned on the realizations of YiY_{i} and YjY_{j} to establish the following upper bound:

Pr[|1nk[n+d]I{Xk[Yi,Yj]}G(Yi,Yj)|+4dnϵ]\displaystyle\Pr\left[\left\lvert\frac{1}{n}\sum_{k\in[n+d^{\prime}]\setminus\mathcal{I}}I_{\left\{X_{k}\in[Y_{i},Y_{j}]\right\}}-G(Y_{i},Y_{j})\right\rvert+\frac{4d}{n}\geq\epsilon\right] 2exp(2n(ϵ4dn)2).\displaystyle\leq 2\exp\left(-2n\left(\epsilon-\frac{4d}{n}\right)^{2}\right).

Applying a Union Bound over all i,ji,j, we obtain

Pr[supx1,x2d|Gn(x1,x2)G(x1,x2)|ϵ]\displaystyle\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\geq\epsilon\right] 2n2dexp(2n(ϵ4dn)2)\displaystyle\leq 2n^{2d}\exp\left(-2n\left(\epsilon-\frac{4d}{n}\right)^{2}\right)
2n2dexp(2nϵ2+16ϵd).\displaystyle\leq 2n^{2d}\exp\left(-2n\epsilon^{2}+16\epsilon d\right).

This proves our first bound. To show the second bound, we note that the first line of the above inequality can be rewritten as

Pr[supx1,x2d|Gn(x1,x2)G(x1,x2)|4dn+ϵ]2n2de2nϵ2.\Pr\left[\sup_{x_{1},x_{2}\in\mathbb{R}^{d}}\left\lvert G_{n}(x_{1},x_{2})-G(x_{1},x_{2})\right\rvert\geq\frac{4d}{n}+\epsilon\right]\leq 2n^{2d}e^{-2n\epsilon^{2}}.

Lemma A.2.

For large enough nn and TT, the covariates XztpX^{zt}\in\mathbb{R}^{p} for z[n]z\in[n] and t[T]t\in[T] satisfy 3.1 with probability at least 16(nT)21-\frac{6}{(nT)^{2}} if, for any nonnegative integer xx, yy, and zz that add up to pp, every XztX^{zt} is the concatenation of vectors wztx,uzty,vztzw^{zt}\in\mathbb{R}^{x},u^{zt}\in\mathbb{R}^{y},v^{zt}\in\mathbb{R}^{z}, which are independently generated as follows:

  1. 1.

    Covariates varying by unit and time: Each wztw^{zt} is i.i.d., sampled from a distribution WW on [0,1]x[0,1]^{x} with density uniformly lower bounded by c¯1/3\underaccent{\bar}{c}^{1/3} and upper bounded by c¯1/3\bar{c}^{1/3}.

  2. 2.

    Covariates varying only across units: Draw nn i.i.d. samples from a distribution UU on [0,1]y[0,1]^{y}, bounded in density between c¯1/3\underaccent{\bar}{c}^{1/3} and c¯1/3\bar{c}^{1/3}. These make up uz1u^{z1} for z[n]z\in[n], with arbitrary ordering. For every zz and t[T]t\in[T], we have uzt=uz1u^{zt}=u^{z1}.

  3. 3.

    Covariates varying only over time: Draw TT i.i.d. samples from a distribution VV on [0,1]z[0,1]^{z} with the same density bounds. These form v1tv^{1t} for t[T]t\in[T], with arbitrary ordering. For every tt and z[n]z\in[n], we have vzt=v1tv^{zt}=v^{1t}.

Proof.

The proof of the lemma proceeds as follows. We first formally define an event \mathcal{E}, which intuitively states that the maximum deviation between the fraction of samples falling within a specified interval and its expected value is bounded. Our second step is to show that the event occurs with high probability, and our final step is to show that under the event \mathcal{E}, 3.1 is satisfied.

We begin by defining our notation. For any random vector DD, we define G^D(a,b)\hat{G}^{D}(a,b) as the empirical fraction of observations that fall within a specified interval (a,b)(a,b), with aba\leq b having the same dimension as DD. The expected value of this quantity is denoted GD(a,b):=Pr[aDb]G^{D}(a,b):=\Pr[a\leq D\leq b]. Note that we have

G^U(a,b)=1nzI{uz1[a,b]}G^V(a,b)=1TtI{v1t[a,b]}G^X(a,b)=1nTz,tI{Xzt[a,b]}.\displaystyle\hat{G}^{U}(a,b)=\frac{1}{n}\sum_{z}I_{\left\{u^{z1}\in[a,b]\right\}}\quad\hat{G}^{V}(a,b)=\frac{1}{T}\sum_{t}I_{\left\{v^{1t}\in[a,b]\right\}}\quad\hat{G}^{X}(a,b)=\frac{1}{nT}\sum_{z,t}I_{\left\{X^{zt}\in[a,b]\right\}}.

Defining event \mathcal{E}. Now we will formally define \mathcal{E}, which is the intersection of three events =𝒜𝒞\mathcal{E}=\mathcal{E_{A}}\cap\mathcal{E_{B}}\cap\mathcal{E_{C}}. The events 𝒜\mathcal{E_{A}} (resp. \mathcal{E_{B}}) is the event that maximum deviation between the fraction of samples from random variable UU (resp. VV) falling within an interval and its expected value is bounded. To simplify our notation, we let ϵ:=(p+1)ln(nT)min(n,T)\epsilon:=\sqrt{\frac{(p+1)\ln(nT)}{\min(n,T)}}. Mathematically,

𝒜:={supu(1)u(2)|G^U(u(1),u(2))GU(u(1),u(2))|ϵ+4yn},\displaystyle\mathcal{E_{A}}:=\left\{\sup_{u^{(1)}\leq u^{(2)}}\left\lvert\hat{G}^{U}(u^{(1)},u^{(2)})-G^{U}(u^{(1)},u^{(2)})\right\rvert\leq\epsilon+\frac{4y}{n}\right\},
:={supv(1)v(2)|G^V(v(1),v(2))GV(v(1),v(2))|ϵ+4zT}.\displaystyle\mathcal{E_{B}}:=\left\{\sup_{v^{(1)}\leq v^{(2)}}\left\lvert\hat{G}^{V}(v^{(1)},v^{(2)})-G^{V}(v^{(1)},v^{(2)})\right\rvert\leq\epsilon+\frac{4z}{T}\right\}.

We will now turn to defining event 𝒞\mathcal{E_{C}}, which intuitively bounds the deviation for the WW samples. In the event 𝒞\mathcal{E_{C}}, rather than considering all of the samples of WW, we are only considering the samples of WW for which the corresponding UU and VV samples satisfy given constraints. That is, 𝒞\mathcal{E_{C}} is the event that for any constraints on UU and VV, the maximum deviation between the following is bounded: 1) the proportion of samples with UU and VV satisfying the constraint, that have WW fall within a specified interval and 2) the expected fraction of WW samples falling within a specified interval. We will define the event 𝒞\mathcal{E_{C}} formally below.

To define event 𝒞\mathcal{E_{C}}, we begin by defining SUS_{U} (resp. SVS_{V}), which is the set of all possible subsets of units (resp. time periods) selected by some constraint on UU (resp. VV). Given the random variables uz1u^{z1} for z[n]z\in[n], we denote by u~1,,u~ny\tilde{u}_{1},\ldots,\tilde{u}_{n^{y}} all nyn^{y} vectors formed by selecting its ii-th coordinate (for each i[y]i\in[y]) from this set of observed values: {uiz1,z[n]}\{u^{z1}_{i},z\in[n]\}. We similarly define v~1,,v~Tz\tilde{v}_{1},\ldots,\tilde{v}_{T^{z}} using the variables v1tv^{1t} for t[T]t\in[T]. Next, for any choice of indices i1,i2[ny]i_{1},i_{2}\in[n^{y}] such that u~i1u~i2\tilde{u}_{i_{1}}\leq\tilde{u}_{i_{2}}, we denote by 𝒵(i1,i2)\mathcal{Z}(i_{1},i_{2}) the subset of units that lie in the hyper-rectangle defined by u~i1\tilde{u}_{i_{1}} and u~i2\tilde{u}_{i_{2}}, that is

𝒵(i1,i2):={z[n],u~i1uz1u~i2}.\mathcal{Z}(i_{1},i_{2}):=\left\{z\in[n],\tilde{u}_{i_{1}}\leq u^{z1}\leq\tilde{u}_{i_{2}}\right\}.

We define 𝒯(j1,j2)\mathcal{T}(j_{1},j_{2}) similarly, for any j1,j2[Tz]j_{1},j_{2}\in[T^{z}] with v~j1v~j2\tilde{v}_{j_{1}}\leq\tilde{v}_{j_{2}}. Finally, SU:={𝒵(i1,i2):i1,i2[ny],u~i1u~i2}S_{U}:=\{\mathcal{Z}(i_{1},i_{2}):i_{1},i_{2}\in[n^{y}],\tilde{u}_{i_{1}}\leq\tilde{u}_{i_{2}}\} and SVS_{V} is similarly defined. We are now ready to define the last event:

𝒞:={supw(1)w(2)|1|𝒵||𝒯|z𝒵,t𝒯Iwzt[w(1),w(2)]GW(w(1),w(2))|nTϵ+4x|𝒵||𝒯|\displaystyle\mathcal{E_{C}}:=\left\{\sup_{w^{(1)}\leq w^{(2)}}\left\lvert\frac{1}{|\mathcal{Z}||\mathcal{T}|}\sum_{z\in\mathcal{Z},t\in\mathcal{T}}I_{w^{zt}\in[w^{(1)},w^{(2)}]}-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{|\mathcal{Z}||\mathcal{T}|}\right.
𝒵SU,𝒯SV}.\displaystyle\left.\forall\mathcal{Z}\in S_{U},\mathcal{T}\in S_{V}\right\}.

Bounding the probability of event \mathcal{E}. By Lemma A.1, we directly have

Pr(𝒜)\displaystyle\Pr(\mathcal{E_{A}}) 12n2ye2nϵ2\displaystyle\geq 1-2n^{2y}e^{-2n\epsilon^{2}}
Pr()\displaystyle\Pr(\mathcal{E_{B}}) 12T2ze2Tϵ2.\displaystyle\geq 1-2T^{2z}e^{-2T\epsilon^{2}}.

We now show that 𝒞\mathcal{E_{C}} occurs with high probability. Note that conditioned on the realizations of uz1u^{z1} and v1tv^{1t} for z[n]z\in[n] and t[T]t\in[T], and for any choice of unit and time period subsets 𝒵SU\mathcal{Z}\in S_{U} and 𝒯SV\mathcal{T}\in S_{V}, all the random variables wztw^{zt} for z𝒵,t𝒯z\in\mathcal{Z},t\in\mathcal{T} are still i.i.d. according to the distribution WW since they are sampled independently from samples of UU and VV. As a result, we can apply Lemma A.1 to obtain

supw(1)w(2)|1|𝒵||𝒯|z𝒵,t𝒯Iwzt[w(1),w(2)]GW(w(1),w(2))|nTϵ+4x|𝒵||𝒯|,\sup_{w^{(1)}\leq w^{(2)}}\left\lvert\frac{1}{|\mathcal{Z}||\mathcal{T}|}\sum_{z\in\mathcal{Z},t\in\mathcal{T}}I_{w^{zt}\in[w^{(1)},w^{(2)}]}-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{|\mathcal{Z}||\mathcal{T}|},

with probability at least 12(|𝒵||𝒯|)2xe2(nTϵ)2/(|𝒵||𝒯|)12(nT)2xe2nTϵ21-2(|\mathcal{Z}||\mathcal{T}|)^{2x}e^{-2(nT\epsilon)^{2}/(|\mathcal{Z}||\mathcal{T}|)}\geq 1-2(nT)^{2x}e^{-2nT\epsilon^{2}} for any given 𝒵\mathcal{Z} and 𝒯\mathcal{T} and for any realization of the random variables uz1u^{z1} for z[n]z\in[n] and v1tv^{1t} for t[T]t\in[T]. Taking the union bound, over all choices of indices i1,i2[ny]i_{1},i_{2}\in[n^{y}] and j1,j2[Tz]j_{1},j_{2}\in[T^{z}] then shows that

Pr(𝒞uzt,vzt;z[n],t[T])12n2yT2z(nT)2xe2nTϵ2.\Pr(\mathcal{E_{C}}\mid u^{zt},v^{zt};\forall z\in[n],t\in[T])\geq 1-2n^{2y}T^{2z}(nT)^{2x}e^{-2nT\epsilon^{2}}.

In particular, the same lower bound still holds for Pr(𝒞)\Pr(\mathcal{E_{C}}) after taking the expectation over the samples from UU and VV. Putting all the probabilistic bounds together, we obtain

Pr[𝒜𝒞]16n2(x+y)T2(x+z)e2min(n,T)ϵ2.\Pr[\mathcal{E_{A}}\cap\mathcal{E_{B}}\cap\mathcal{E_{C}}]\geq 1-6n^{2(x+y)}T^{2(x+z)}e^{-2\min(n,T)\epsilon^{2}}.

Plugging in ϵ=(p+1)ln(nT)min(n,T)\epsilon=\sqrt{\frac{(p+1)\ln(nT)}{\min(n,T)}}, this event has probability at least 16(nT)21-\frac{6}{(nT)^{2}}.

3.1 is satisfied under event \mathcal{E}. We suppose that event \mathcal{E} holds for the rest of this proof. Consider an arbitrary hyper-rectangle in [0,1]p[0,1]^{p} with lower and upper corners x(1)x^{(1)} and x(2)x^{(2)}, where x(1)x(2)x^{(1)}\leq x^{(2)}. For convenience, we decompose x(1)x^{(1)} into three distinct vectors: w(1)xw^{(1)}\in\mathbb{R}^{x}, u(1)yu^{(1)}\in\mathbb{R}^{y}, and v(1)zv^{(1)}\in\mathbb{R}^{z}, such that the concatenation of these three vectors forms x(1)x^{(1)}. Similarly, x(2)x^{(2)} is decomposed into vectors w(2)w^{(2)}, u(2)u^{(2)}, and v(2)v^{(2)}. Our goal is to bound the number of observations that simultaneously satisfy the constraints:

w(1)wztw(2)u(1)uztu(2)v(1)vztv(2).\displaystyle w^{(1)}\leq w^{zt}\leq w^{(2)}\qquad u^{(1)}\leq u^{zt}\leq u^{(2)}\qquad v^{(1)}\leq v^{zt}\leq v^{(2)}.

First, the set 𝒵\mathcal{Z} of units that satisfies the constraint u(1)uz1u(2)u^{(1)}\leq u^{z1}\leq u^{(2)} is included within the considered sets SUS_{U}. Similarly, the set 𝒯\mathcal{T} of time periods satisfying the constraint v(1)v1tv(2)v^{(1)}\leq v^{1t}\leq v^{(2)} is also in SVS_{V}. As a result, because 𝒞\mathcal{E_{C}} is satisfied, we have

|nT|𝒵||𝒯|G^X(x(1)x(2))GW(w(1),w(2))|nTϵ+4x|𝒵||𝒯|.\left\lvert\frac{nT}{|\mathcal{Z}||\mathcal{T}|}\hat{G}^{X}(x^{(1)}x^{(2)})-G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\frac{nT\epsilon+4x}{|\mathcal{Z}||\mathcal{T}|}.

Multiplying both sides by |𝒵||𝒯|nT\frac{|\mathcal{Z}||\mathcal{T}|}{nT}, we have

|G^X(x(1),x(2))|𝒵||𝒯|nTGW(w(1),w(2))|ϵ+4xnT.\left\lvert\hat{G}^{X}(x^{(1)},x^{(2)})-\frac{|\mathcal{Z}||\mathcal{T}|}{nT}G^{W}(w^{(1)},w^{(2)})\right\rvert\leq\epsilon+\frac{4x}{nT}. (7)

We then use the events 𝒜\mathcal{E_{A}} and \mathcal{E_{B}} to bound the number of units and times satisfying their respective constraints:

||𝒵|nGU(u(1),u(2))|ϵ+4yn,||𝒯|TGV(v(1),v(2))|ϵ+4zT.\displaystyle\left\lvert\frac{|\mathcal{Z}|}{n}-G^{U}(u^{(1)},u^{(2)})\right\rvert\leq\epsilon+\frac{4y}{n},\qquad\left\lvert\frac{|\mathcal{T}|}{T}-G^{V}(v^{(1)},v^{(2)})\right\rvert\leq\epsilon+\frac{4z}{T}. (8)

Now let P:=GU(u(1),u(2))GV(v(1),v(2))GW(w(1),w(2))P:=G^{U}(u^{(1)},u^{(2)})G^{V}(v^{(1)},v^{(2)})G^{W}(w^{(1)},w^{(2)}). Provided that nn and TT are large enough such that 4xnT,4yn,4zTϵ14\frac{4x}{nT},\frac{4y}{n},\frac{4z}{T}\leq\epsilon\leq\frac{1}{4}, inequalities (7) and (8) imply

|G^X(x(1),x(2))P|7ϵ.\left\lvert\hat{G}^{X}(x^{(1)},x^{(2)})-P\right\rvert\leq 7\epsilon. (9)

Because of the assumed density bounds on UU, VV, and WW, we have

c¯i=1p(xi(2)xi(1))Pc¯i=1p(xi(2)xi(1)).\displaystyle\underaccent{\bar}{c}\prod_{i=1}^{p}\left(x_{i}^{(2)}-x_{i}^{(1)}\right)\leq P\leq\bar{c}\prod_{i=1}^{p}\left(x_{i}^{(2)}-x_{i}^{(1)}\right).

Combining this with (9) completes the proof. ∎

Appendix B Proof of Lemma 2.1

Lemma B.1.

Define the matrix subspace 𝐓\mathbf{T} as follows:

𝐓\displaystyle\mathbf{T} ={UA+BV+a𝟏AT×r,Bn×r,an}.\displaystyle=\left\{UA^{\top}+BV^{\top}+a\mathbf{1}^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r},a\in\mathbb{R}^{n}\right\}.

The projection onto its orthogonal space is

P𝐓(X)\displaystyle P_{{\mathbf{T}}^{\perp}}(X) =(IUU)X(Irrr2)(IVV),where r=(IVV)𝟏\displaystyle=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top}),\quad\text{where }r=(I-VV^{\top})\mathbf{1}
Proof.

For a given matrix XX, we define P:=(IUU)X(Irrr2)(IVV)P:=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top}). To show that P=P𝐓(X)P=P_{{\mathbf{T}}^{\perp}}(X), it suffices to show that P𝐓P\in\mathbf{T}^{\perp} and XP𝐓X-P\in\mathbf{T}.

First, note that UP=0U^{\top}P=0 and PV=0PV=0 because of the construction of PP. This implies that for any matrices AT×rA\in\mathbb{R}^{T\times r} and Bn×rB\in\mathbb{R}^{n\times r}, we have P,UA=P,BV=0\langle P,UA^{\top}\rangle=\langle P,BV^{\top}\rangle=0. Next, we note that

P𝟏=(IUU)X(Irrr2)r=0.P\mathbf{1}=(I-UU^{\top})X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)r=0.

As a result, we also have P,a𝟏=0\langle P,a\mathbf{1}^{\top}\rangle=0 for all ana\in\mathbb{R}^{n}. This ends the proof that P𝐓P\in\mathbf{T}^{\perp}.

Next, we compute

XP\displaystyle X-P =UUX(Irrr2)(IVV)A+X(Irrr2)VBV+Xrrr2\displaystyle=U\underbrace{U^{\top}X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)(I-VV^{\top})}_{A^{{}^{\prime}\top}}+\underbrace{X\left(I-\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}\right)V}_{B^{\prime}}V^{\top}+X\frac{rr^{\top}}{\left\lVert r\right\rVert^{2}}
=UA+(BXr𝟏Vr2)V+Xrr2𝟏\displaystyle=UA^{{}^{\prime}\top}+\left(B^{\prime}-X\frac{r\mathbf{1}^{\top}V}{\left\lVert r\right\rVert^{2}}\right)V^{\top}+X\frac{r}{\left\lVert r\right\rVert^{2}}\mathbf{1}^{\top}

Hence, XP𝐓X-P\in\mathbf{T}, which concludes the proof of the lemma. ∎

Proof of Lemma 2.1..

It is known [9] that the set of subgradients of the nuclear norm of an arbitrary matrix Xn1×n2X\in\mathbb{R}^{n_{1}\times n_{2}} with SVD X=UΣVX=U\Sigma V^{\top} is

X={UV+W:Wn1×n2,UW=0,WV=0,W1}.\displaystyle\partial\left\lVert X\right\rVert_{\star}=\left\{UV^{\top}+W:W\in\mathbb{R}^{n_{1}\times n_{2}},U^{\top}W=0,WV=0,\left\lVert W\right\rVert\leq 1\right\}.

Thus, the first-order optimality conditions of (2) are

Zj,OM^m^𝟏i=1kτ^iZi\displaystyle\left\langle Z_{j},O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right\rangle =0for j=1,2,,k\displaystyle=0\qquad\text{for }j=1,2,\dots,k (10a)
OM^m^𝟏i=1kτ^iZi\displaystyle O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i} =λ(U^V^+W)\displaystyle=\lambda\left(\hat{U}\hat{V}^{\top}+W\right) (10b)
U^W\displaystyle\hat{U}^{\top}W =0\displaystyle=0 (10c)
WV^\displaystyle W\hat{V} =0\displaystyle=0 (10d)
W\displaystyle\left\lVert W\right\rVert 1\displaystyle\leq 1 (10e)
m^\displaystyle\hat{m} =1T(OM^i=1kτ^iZi)𝟏.\displaystyle=\frac{1}{T}\left(O-\hat{M}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right)\mathbf{1}. (10f)

Combining equations (10a) and (10b), we have

Zj,λ(U^V^+W)\displaystyle\left\langle Z_{j},\lambda\left(\hat{U}\hat{V}^{\top}+W\right)\right\rangle =0for j=1,2,,k\displaystyle=0\qquad\text{for }j=1,2,\dots,k (11)

By the definition of our model, we have O=M+i=1kτiZi+E^.O=M^{\star}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}. We plug this into equation (10b):

MM^m^𝟏+i=1k(τiτ^i)Zi+E^\displaystyle M^{\star}-\hat{M}-\hat{m}\mathbf{1}^{\top}+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)Z_{i}+\hat{E} =λ(U^V^+W).\displaystyle=\lambda\left(\hat{U}\hat{V}^{\top}+W\right).

We apply P𝐓^()P_{\hat{\mathbf{T}}^{\perp}}(\cdot) to both sides, where the formula for the projection is given in Lemma B.1:

P𝐓^(M+E^)+i=1k(τiτ^i)P𝐓^(Zi)B\displaystyle\underbrace{P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})}_{B}
=λP𝐓^(W)\displaystyle=\lambda P_{\hat{\mathbf{T}}^{\perp}}\left(W\right)
=λ(IU^U^)W(I(IV^V^)𝟏𝟏(IV^V^)(IV^V^)𝟏2)(IV^V^)\displaystyle=\lambda(I-\hat{U}\hat{U}^{\top})W\left(I-\frac{(I-\hat{V}\hat{V}^{\top})\mathbf{1}\mathbf{1}^{\top}(I-\hat{V}\hat{V}^{\top})}{\left\lVert(I-\hat{V}\hat{V}^{\top})\mathbf{1}\right\rVert^{2}}\right)(I-\hat{V}\hat{V}^{\top})
=λW(I𝟏𝟏(IV^V^)(IV^V^)𝟏2)\displaystyle=\lambda W\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}(I-\hat{V}\hat{V}^{\top})}{\left\lVert(I-\hat{V}\hat{V}^{\top})\mathbf{1}\right\rVert^{2}}\right)
=λWλW(𝟏𝟏T).\displaystyle=\lambda W-\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right).

The last line makes use of the following claim:

Claim B.2.

V^𝟏=0\hat{V}^{\top}\mathbf{1}=0.

To compute λW(𝟏𝟏T)\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right), we right-multiply (10b) by (𝟏𝟏T)\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right). We note that by (10f), we have (OM^m^𝟏i=1kτ^iZi)𝟏=0\left(O-\hat{M}-\hat{m}\mathbf{1}^{\top}-\sum_{i=1}^{k}\hat{\tau}_{i}Z_{i}\right)\mathbf{1}=0. Hence, we find

λW(𝟏𝟏T)=λU^V^(𝟏𝟏T)=(i)0,\displaystyle\lambda W\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)=-\lambda\hat{U}\hat{V}^{\top}\left(\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)\overset{(i)}{=}0,

where (i)(i) is due to B.2. As a result, we can conclude that B=λWB=\lambda W.

Substituting λW=B\lambda W=B into (11), we obtain for all j=1,2,,kj=1,2,\dots,k:

Zj,λU^V^+P𝐓^(M+E^)+i=1k(τiτ^i)P𝐓^(Zi)=0\displaystyle\left\langle Z_{j},\lambda\hat{U}\hat{V}^{\top}+P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})+\sum_{i=1}^{k}\left(\tau^{\star}_{i}-\hat{\tau}_{i}\right)P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})\right\rangle=0 (12)

Rearranging, we have

[P𝐓^(Zi),P𝐓^(Zj)]i,j[k](τ^τ)=[Zi,λU^V^+P𝐓^(M+E^)]i[k]\displaystyle\left[\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),P_{\hat{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle\right]_{i,j\in[k]}\left(\hat{\tau}-\tau^{\star}\right)=\left[\left\langle Z_{i},\lambda\hat{U}\hat{V}^{\top}+P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}+\hat{E})\right\rangle\right]_{i\in[k]}

This is equivalent to the error decomposition in the statement of the lemma, completing the proof. ∎

Proof of B.2.

Suppose for contradiction that V^𝟏0\hat{V}^{\top}\mathbf{1}\neq 0. Then, there is a solution that for which the objective is lower, contradicting the optimality of M^,τ^,m^\hat{M},\hat{\tau},\hat{m}. That solution can be constructed as follows. Define VV^{\prime} to be the matrix resulting from projecting each column of V^\hat{V} onto the space orthogonal to the vector 𝟏\mathbf{1}. That is, the ii-th column of VV^{\prime} is Vi=V^i𝟏(V^i𝟏𝟏𝟏)V^{\prime}_{i}=\hat{V}_{i}-\mathbf{1}\left(\frac{\hat{V}_{i}^{\top}\mathbf{1}}{\mathbf{1}^{\top}\mathbf{1}}\right). Let mm^{\prime} be the vector such that m𝟏=U^Σ^(V^V)+m^𝟏m^{\prime}\mathbf{1}^{\top}=\hat{U}\hat{\Sigma}\left(\hat{V}-V^{\prime}\right)^{\top}+\hat{m}\mathbf{1}^{\top}. The solution (U^Σ^V,τ^,m)\left(\hat{U}\hat{\Sigma}V^{\prime\top},\hat{\tau},m^{\prime}\right) achieves lower objective value because the value of the expression in the Frobenius norm remains unchanged, but U^Σ^V<U^Σ^V^=M^\left\lVert\hat{U}\hat{\Sigma}V^{\prime\top}\right\rVert_{\star}<\left\lVert\hat{U}\hat{\Sigma}\hat{V}^{\top}\right\rVert_{\star}=\left\lVert\hat{M}\right\rVert_{\star}. The inequality is due to the fact that matrix VV^{\prime} is still an orthogonal matrix, but its columns have norm that is at most 1, and at least one of its columns has norm strictly less than 1. ∎

Appendix C Proof of Theorem 3.5

We aim to bound |τidτ~i|\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert. Throughout this section, we assume that the assumptions made in the statement of Theorem 3.5 are satisfied. To simplify notation, we define ηi:=Z~iF\eta_{i}:=\|\tilde{Z}_{i}\|_{\mathrm{F}}. We have

|τidτ~i|τdτ~=1ηiηiτdτ=(i)1ηiD1(Δ2+Δ3)1σmin(D)ηi(Δ2+Δ3),\displaystyle\begin{split}\left\lvert\tau^{d}_{i}-\tilde{\tau}^{\star}_{i}\right\rvert&\leq\left\lVert\tau^{d}-\tilde{\tau}^{\star}\right\rVert\\ &=\frac{1}{\eta_{i}}\left\lVert\eta_{i}\tau^{d}-\tau^{\star}\right\rVert\\ &\overset{(i)}{=}\frac{1}{\eta_{i}}\left\lVert D^{-1}(\Delta^{2}+\Delta^{3})\right\rVert\\ &\leq\frac{1}{\sigma_{\min}(D)\eta_{i}}\left(\left\lVert\Delta^{2}\right\rVert+\left\lVert\Delta^{3}\right\rVert\right),\end{split} (13)

where (i)(i) is due to the error decomposition in Lemma 2.1 and the definition of τd\tau^{d}.

Hence, we aim to upper boundΔ2\left\lVert\Delta^{2}\right\rVert and Δ3\left\lVert\Delta^{3}\right\rVert and lower bound σmin(D)\sigma_{\min}(D). The main challenge lies in upper bounding Δ3\left\lVert\Delta^{3}\right\rVert. In fact, the desired bounds for σmin(D)\sigma_{\min}(D) and Δ2\left\lVert\Delta^{2}\right\rVert (presented in the following lemma) are obtained during the process of bounding Δ3\left\lVert\Delta^{3}\right\rVert.

Lemma C.1.

Let DD and Δ2\Delta^{2} be defined as in Lemma 2.1. We have σmin(D)cs2logn\sigma_{\min}(D)\geq\frac{c_{s}}{2\log n}. Furthermore, we have the following for sufficiently large nn:

Δ2σ2r1.5κnlog6.5(n)σmin+log0.5(n)maxi[k]|P𝐓(Z~i),P𝐓(E+δ)|Z~iF.\displaystyle\left\lVert\Delta^{2}\right\rVert\lesssim\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{i}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\|\tilde{Z}_{i}\|_{\mathrm{F}}}.

The proof of the above lemma is provided in Section D.6.

We now discuss the strategy for bounding Δ3\left\lVert\Delta^{3}\right\rVert. In order to control P𝐓^(M)P_{\hat{\mathbf{T}}^{\perp}}(M^{\star}), we aim to show that the true counterfactual matrix MM^{\star} has tangent space close to that of M^\hat{M}. Because we introduced m^\hat{m} in the convex formulation, which centers the rows of M^\hat{M} without penalization from the nuclear norm term, we will focus on the projection of MM^{\star} that has rows with mean zero, P𝟏(M)P_{{\mathbf{1}}^{\perp}}(M^{\star}). Recall that P𝟏(M)=UΣVP_{{\mathbf{1}}^{\perp}}(M^{\star})=U^{\star}\Sigma^{\star}V^{\star\top} is its SVD, and define X:=UΣ1/2X^{\star}:=U^{\star}\Sigma^{\star 1/2} and Y:=VΣ1/2Y^{\star}:=V^{\star}\Sigma^{\star 1/2}. We similarly define these quantities for M^\hat{M}: X^:=U^Σ^1/2\hat{X}:=\hat{U}\hat{\Sigma}^{1/2} and Y^:=V^Σ^1/2\hat{Y}:=\hat{V}\hat{\Sigma}^{1/2}. We aim to show that we have (X^,Y^)(X,Y)(\hat{X},\hat{Y})\approx(X^{\star},Y^{\star}). This is stated formally in Lemma C.2, which requires introducing a few additional notations first.

Let g(M,τ,m)g(M,\tau,m) denote the convex function that we are optimizing in (2). Recall that (M^,τ^,m^)(\hat{M},\hat{\tau},\hat{m}) is a global optimum of function gg. Following the main proof idea in [17], we define a non-convex proxy function ff, which is similar to gg, except MM is split into two variables and expressed as XYXY^{\top}.

f(X,Y)\displaystyle f(X,Y) =minmn,τk12OXYm𝟏i=1kτiZiF2+12λX,X+12λY,Y.\displaystyle=\min_{m\in\mathbb{R}^{n},\tau\in\mathbb{R}^{k}}\frac{1}{2}\left\lVert O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle.

Our analysis will relate (X^,Y^)(\hat{X},\hat{Y}) to a specific local optimum of ff, which we will show is close to (X,Y)(X^{\star},Y^{\star}). The local optimum of ff considered is the limit of the gradient flow of ff, initiated at (X,Y)(X^{\star},Y^{\star}), formally defined by the following differential equation:

{(X˙t,Y˙t)=f(Xt,Yt)(X0,Y0)=(X,Y).\displaystyle\begin{cases}(\dot{X}^{t},\dot{Y}^{t})=-\nabla f(X^{t},Y^{t})\\ (X^{0},Y^{0})=(X^{\star},Y^{\star}).\end{cases} (14)

We define HX,YH_{X,Y} as the rotation matrix that optimally aligns (X,Y)(X,Y) with (X,Y)(X^{\star},Y^{\star}). That is, letting 𝒪r×r\mathcal{O}^{r\times r} denote the set of r×rr\times r orthogonal matrices, HX,YH_{X,Y} is formally defined as follows:

HX,Y:=argminR𝒪r×rXRXF2+YRYF2.H_{X,Y}:=\operatorname*{argmin}_{R\in\mathcal{O}^{r\times r}}\left\lVert XR-X^{\star}\right\rVert^{2}_{\mathrm{F}}+\left\lVert YR-Y^{\star}\right\rVert^{2}_{\mathrm{F}}.

Define FF^{\star} to be the vertical concatenation of matrices XX^{\star} and YY^{\star}. That is, F=[(X),(Y)]F^{\star}=[(X^{\star})^{\top},(Y^{\star})^{\top}]^{\top}. We are now ready to present Lemma C.2.

Lemma C.2.

For sufficiently large nn, the gradient flow of ff, starting from (X,Y)(X^{\star},Y^{\star}), converges to (X,Y)(X,Y) such that X=X^RX=\hat{X}R and Y=Y^RY=\hat{Y}R for some rotation matrix R𝒪r×rR\in\mathcal{O}^{r\times r}. Furthermore, (X,Y)(X,Y)\in\mathcal{B}, where

\displaystyle\mathcal{B} :={(X,Y)XHX,YXF2+YHX,YYF2ρ2},ρ:=σnrlog6(n)σminFF.\displaystyle:=\left\{(X,Y)\mid\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}^{2}\leq\rho^{2}\right\},\quad\rho:=\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

Let us derive an upper bound on ρ\rho. Because XFXr=σmaxr\left\lVert X^{\star}\right\rVert_{\mathrm{F}}\leq\left\lVert X^{\star}\right\rVert\sqrt{r}=\sqrt{\sigma_{\max}r} and YFσmaxr\left\lVert Y^{\star}\right\rVert_{\mathrm{F}}\leq\sqrt{\sigma_{\max}r}, we can conclude that FFσmaxr\left\lVert F^{\star}\right\rVert_{\mathrm{F}}\lesssim\sqrt{\sigma_{\max}r}. Thus, we can upper bound ρ\rho as follows:

ρσnrlog6(n)σminσmaxr=σnrlog6(n)σminσminκr=σrκ0.5nlog6(n)σmin.\displaystyle\rho\lesssim\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\sqrt{\sigma_{\max}r}=\frac{\sigma\sqrt{nr}\log^{6}(n)}{\sigma_{\min}}\sqrt{\sigma_{\min}\kappa r}=\frac{\sigma r\kappa^{0.5}\sqrt{n}\log^{6}(n)}{\sqrt{\sigma_{\min}}}. (15)

Using (15) and the assumption σnσmin1κ2r2log12.5(n)\frac{\sigma\sqrt{n}}{\sigma_{\min}}\lesssim\frac{1}{\kappa^{2}r^{2}\log^{12.5}(n)} provided in the theorem, we can further upper bound ρ\rho as follows:

ρσminlog6.5(n)κ.\displaystyle\rho\lesssim\frac{\sqrt{\sigma_{\min}}}{\log^{6.5}(n)\kappa}. (16)

With Lemma C.2 and the bound on ρ\rho, we can complete the proof of the theorem as follows.

Δ3\displaystyle\left\lVert\Delta^{3}\right\rVert kΔ3\displaystyle\leq\sqrt{k}\left\lVert\Delta^{3}\right\rVert_{\infty}
kP𝐓^(M)F\displaystyle\leq\sqrt{k}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rVert_{\mathrm{F}}
=k(IU^U^)(XY+m𝟏)(Ir^r^r^r^)(IV^V^)F,\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})(X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top})\left(I-\frac{\hat{r}\hat{r}^{\top}}{\hat{r}^{\top}\hat{r}}\right)(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}},

where r^=(IV^V^)𝟏\hat{r}=(I-\hat{V}\hat{V}^{\top})\mathbf{1}, and the last line comes from the closed form of the projection derived in Lemma B.1. We can use the fact that V^𝟏=0\hat{V}^{\top}\mathbf{1}=0 from B.2 to simplify the expression. Note that with V^𝟏=0\hat{V}^{\top}\mathbf{1}=0, r^\hat{r} simply evaluates to 𝟏\mathbf{1}.

kP𝐓^(M)F\displaystyle\sqrt{k}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(M^{\star})\right\rVert_{\mathrm{F}} =k(IU^U^)XY(I𝟏𝟏T)(IV^V^)F\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})X^{\star}Y^{\star\top}\left(I-\frac{\mathbf{1}\mathbf{1}^{\top}}{T}\right)(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}}
=k(IU^U^)XY(IV^V^)F,\displaystyle=\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})X^{\star}Y^{\star\top}(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}},

where the last step is due to the fact that XY𝟏=0X^{\star}Y^{\star\top}\mathbf{1}=0 because XY=P𝟏(M)X^{\star}Y^{\star\top}=P_{{\mathbf{1}}^{\perp}}(M^{\star}).

Finally, let (X,Y)(X,Y) be the limit of the gradient flow of ff starting from (X,Y)(X^{\star},Y^{\star}). By Lemma C.2, we have X=X^RX=\hat{X}R and Y=Y^RY=\hat{Y}R for some rotation matrix R𝒪r×rR\in\mathcal{O}^{r\times r}. This, combined with the definition of 𝐓^\hat{\mathbf{T}}, implies P𝐓^(XA)=P𝐓^(BY)=0P_{\hat{\mathbf{T}}^{\perp}}(XA^{\top})=P_{\hat{\mathbf{T}}^{\perp}}(BY^{\top})=0 for any AT×rA\in\mathbb{R}^{T\times r} and Bn×rB\in\mathbb{R}^{n\times r}. Hence,

Δ3k(IU^U^)(XHX,YX)(YHX,YY)(IV^V^)FkXHX,YXFYHX,YYFkρ2σ2r2κnlog12.5(n)σmin,\displaystyle\begin{split}\left\lVert\Delta^{3}\right\rVert&\leq\sqrt{k}\left\lVert(I-\hat{U}\hat{U}^{\top})(XH_{X,Y}-X^{\star})(YH_{X,Y}-Y^{\star})^{\top}(I-\hat{V}\hat{V}^{\top})\right\rVert_{\mathrm{F}}\\ &\leq\sqrt{k}\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}\\ &\lesssim\sqrt{k}\rho^{2}\\ &\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{12.5}(n)}{\sigma_{\min}},\end{split} (17)

where the last step is due to the fact that k=O(logn).k=O(\log n).

Now that we have bounded σmin(D)\sigma_{\min}(D), Δ2\left\lVert\Delta^{2}\right\rVert, and Δ3\left\lVert\Delta^{3}\right\rVert, we can plug these bounds into (13) to complete the proof of the theorem.

Appendix D Proof of Lemma C.2

Throughout this section, we assume that the assumptions made in the statement of Theorem 3.5 are satisfied. The proof of Lemma C.2 is completed by combining the results of the following two lemmas.

Lemma D.1.

Any point (X,Y)(X,Y) on the gradient flow of ff starting from (X,Y)(X^{\star},Y^{\star}) satisfies (X,Y)(X,Y)\in\mathcal{B}.

Lemma D.2.

The limit (X,Y)(X,Y) of the gradient flow of ff starting from (X,Y)(X^{\star},Y^{\star}) satisfies X=X^RX=\hat{X}R and Y=Y^RY=\hat{Y}R for some rotation matrix R𝒪r×rR\in\mathcal{O}^{r\times r}.

We note our methodology differs from that of [17]. Instead of analyzing a gradient descent algorithm, we analyze the gradient flow of the function ff. This allows us to simplify the analysis by avoiding error terms due to the discretization of gradient descent.

In this section, we start by deriving the gradient of the function ff and examining some properties of its gradient flow in Section D.1. Then, we prove some technical lemmas. We extend 3.3 to a broader subset of matrices within the set \mathcal{B} in Section D.2, and establish bounds on the noise in Section D.3. Finally, we complete the proofs of Lemma D.1 in Section D.4 and Lemma D.2 in Section D.5.

D.1 The gradient of the function ff

Before we prove Lemma D.1, we need to derive the gradient of the function ff.

Define P𝐙P_{\mathbf{Z}} to be the projection operator that projects onto the subspace 𝐙\mathbf{Z}. Note that in the definition of f(X,Y)f(X,Y), the quantities mm and τ\tau are chosen to minimize the distance between OXYO-XY^{\top} and this subspace, measured in terms of Euclidean norm. Hence, we can view mm and τ\tau as coordinates of the projection of OXYO-XY^{\top} onto this subspace. This observation gives us the following equivalent definition of f(X,Y)f(X,Y):

f(X,Y)\displaystyle f(X,Y) =12P𝐙(OXY)F2+12λX,X+12λY,Y.\displaystyle=\frac{1}{2}\left\lVert P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right)\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle.

Consider a single entry of the matrix XX: XijX_{ij}. Note that the expression inside the Frobenius norm is linear in XijX_{ij}. Hence, we can rewrite f(X,Y)f(X,Y) as follows:

f(X,Y)\displaystyle f(X,Y) =12XijA+BF2+12λX,X+12λY,Y,\displaystyle=\frac{1}{2}\left\lVert X_{ij}A+B\right\rVert_{\mathrm{F}}^{2}+\frac{1}{2}\lambda\left\langle X,X\right\rangle+\frac{1}{2}\lambda\left\langle Y,Y\right\rangle,

for some matrices AA and BB that do not depend on XijX_{ij} (but may depend on other entries of XX, and YY).

Now we take the partial derivative of ff with respect to XijX_{ij}. Let EijE_{ij} be the matrix where all elements are zero except for the element in the ii-th row and jj-th column, which is 1. Then,

fXij\displaystyle\frac{\partial f}{\partial X_{ij}} =XijA+B,A+λXij\displaystyle=\left\langle X_{ij}A+B,A\right\rangle+\lambda X_{ij}
=(i)P𝐙(OXY),P𝐙(EijY)+λXij\displaystyle\overset{(i)}{=}\left\langle P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right),-P_{\mathbf{Z}^{\perp}}\left(E_{ij}Y^{\top}\right)\right\rangle+\lambda X_{ij}
=P𝐙(OXY),EijY+λXij,\displaystyle=-\left\langle P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right),E_{ij}Y^{\top}\right\rangle+\lambda X_{ij},

where (i)(i) is due to the fact that AA and BB were defined such that XijA+B=P𝐙(OXY).X_{ij}A+B=P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right). Hence,

fX\displaystyle\frac{\partial f}{\partial X} =P𝐙(OXY)Y+λX\displaystyle=-P_{\mathbf{Z}^{\perp}}\left(O-XY^{\top}\right)Y+\lambda X
=(OXYm(X,Y)𝟏i=1kτi(X,Y)Zi)Y+λX,\displaystyle=-\left(O-XY^{\top}-m(X,Y)\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}(X,Y)Z_{i}\right)Y+\lambda X,

where m(X,Y),τ(X,Y):=argminm,τOXYm𝟏i=1kτiZiF2m(X,Y),\tau(X,Y):=\operatorname*{argmin}_{m,\tau}\left\lVert O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right\rVert^{2}_{\mathrm{F}}. We will write mm and τ\tau to represent m(X,Y)m(X,Y) and τ(X,Y)\tau(X,Y) below for notational simplicity.

Using O=XY+m𝟏+i=1kτiZi+E^O=X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}, we can simplify the gradient as follows:

fX=(XYXY+(mm)𝟏+i=1k(τiτi)ZiE^)Y+λX.\displaystyle\frac{\partial f}{\partial X}=\left(XY^{\top}-X^{\star}Y^{\star\top}+(m-m^{\star})\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau_{i}-\tau_{i}^{\star})Z_{i}-\hat{E}\right)Y+\lambda X.

Because mm and τ\tau are coordinates of the projection of OXYO-XY^{\top} onto the subspace 𝐙\mathbf{Z}, we have

m𝟏+i=1kτiZi=P𝐙(OXY).m\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}Z_{i}=P_{\mathbf{Z}}\left(O-XY^{\top}\right).

Additionally, because m𝟏+i=1kτiZi=OXYE^m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}^{\star}Z_{i}=O-X^{\star}Y^{\star\top}-\hat{E}, the right hand side quantity is also in the subspace 𝐙\mathbf{Z}, so we have

m𝟏+i=1kτiZi=P𝐙(OXYE^).\displaystyle m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau_{i}^{\star}Z_{i}=P_{\mathbf{Z}}\left(O-X^{\star}Y^{\star\top}-\hat{E}\right).

Subtracting the two equations above, we have

(mm)𝟏+i=1k(τiτi)Zi=P𝐙(XYXY+E^).\displaystyle(m-m^{\star})\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau_{i}-\tau_{i}^{\star})Z_{i}=P_{\mathbf{Z}}\left(X^{\star}Y^{\star\top}-XY^{\top}+\hat{E}\right). (18)

Using (18), we can rewrite

fX=(P𝐙(XYXYE^))Y+λX.\displaystyle\frac{\partial f}{\partial X}=\left(P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)\right)Y+\lambda X.

Similarly, we can derive

fY\displaystyle\frac{\partial f}{\partial Y} =(P𝐙(XYXYE^))X+λY.\displaystyle=\left(P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)\right)^{\top}X+\lambda Y.

D.1.1 Properties of the gradient flow

We now prove some properties of the gradient flow of the function ff starting from the point (X,Y)(X^{\star},Y^{\star}). For simplicity of notation, we define

D:=P𝐙(XYXYE^),\displaystyle D:=P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right),

so we can simplify our gradients as follows:

fX=DY+λXandfY=DX+λY.\displaystyle\frac{\partial f}{\partial X}=DY+\lambda X\qquad\text{and}\qquad\frac{\partial f}{\partial Y}=D^{\top}X+\lambda Y. (19)
Lemma D.3.

Every point on the gradient flow of function f(X,Y)f(X,Y) starting from the point (X,Y)\left(X^{\star},Y^{\star}\right) satisfies XX=YY.X^{\top}X=Y^{\top}Y.

Proof.

Note that at the starting point (X,Y)\left(X^{\star},Y^{\star}\right), we have that XX=Σ=YY{X^{\star}}^{\top}X^{\star}=\Sigma^{\star}={Y^{\star}}^{\top}Y^{\star} as desired.

Next, we examine the change in the value ϕ(t):=XtXtYtYt\phi(t):={X^{t}}^{\top}X^{t}-{Y^{t}}^{\top}Y^{t} as we move along the gradient flow defined by (14). For convenience, we omit the superscript tt in XtX^{t} and YtY^{t}. Using (14), we have

(X˙Y˙)=f(XY)=(DY+λXDX+λY).\displaystyle{\begin{pmatrix}\dot{X}\\ \dot{Y}\end{pmatrix}}=-\nabla f\begin{pmatrix}X\\ Y\end{pmatrix}=-\begin{pmatrix}DY+\lambda X\\ D^{\top}X+\lambda Y\end{pmatrix}.

The derivative of ϕ\phi is

ϕ(t)\displaystyle\phi^{\prime}(t) =X˙X+XX˙Y˙YYY˙\displaystyle=\dot{X}^{\top}X+{X}^{\top}\dot{X}-\dot{Y}^{\top}Y-Y^{\top}\dot{Y}
=(DY+λX)XX(DY+λX)+(DX+λY)Y+Y(DX+λY)\displaystyle=-\left(DY+\lambda X\right)^{\top}X-X^{\top}(DY+\lambda X)+\left(D^{\top}X+\lambda Y\right)^{\top}Y+Y^{\top}(D^{\top}X+\lambda Y)
=2λ(XXYY),\displaystyle=-2\lambda\left(X^{\top}X-Y^{\top}Y\right),

showing that ϕ(t)=2λϕ(t)\phi^{\prime}(t)=-2\lambda\phi(t). As a result, recalling that ϕ(0)=0\phi(0)=0, we can solve the differential equation to obtain ϕ(t)=e2λtϕ(0)=0\phi(t)=e^{-2\lambda t}\phi(0)=0 for all tt. ∎

Lemma D.4.

Every point on the gradient flow of function f(X,Y)f(X,Y) starting from the point (X,Y)\left(X^{\star},Y^{\star}\right) satisfies Y𝟏=0.Y^{\top}\mathbf{1}=0. Furthermore, if UΣVU\Sigma V^{\top} is the SVD of XYXY^{\top}, then V𝟏=0.V^{\top}\mathbf{1}=0.

Proof.

Using XY𝟏=0X^{\star}Y^{\star\top}\mathbf{1}=0 (because XY=P𝟏(M)X^{\star}Y^{\star\top}=P_{{\mathbf{1}}^{\perp}}(M^{\star})), we have that

(XY𝟏)XY𝟏=𝟏VΣ2V𝟏=0,\displaystyle\left(X^{\star}Y^{\star\top}\mathbf{1}\right)^{\top}X^{\star}Y^{\star\top}\mathbf{1}=\mathbf{1}^{\top}V^{\star}\Sigma^{\star 2}V^{\star\top}\mathbf{1}=0,

which can only happen if V𝟏=0V^{\star\top}\mathbf{1}=0. This implies that Y𝟏=0Y^{\star\top}\mathbf{1}=0.

Using the same approach as the proof of Lemma D.3, we examine the change in the value of ϕ(t):=Yt𝟏\phi(t):={Y^{t}}^{\top}\mathbf{1} as we move along the gradient flow defined by (14). We omit the superscript tt in YtY^{t} for notational convenience. Using (14), we have Y˙=DXλY\dot{Y}=-D^{\top}X-\lambda Y. Now we compute the derivative of ϕ(t)\phi(t):

ϕ(t)\displaystyle\phi^{\prime}(t) =Y˙𝟏\displaystyle=\dot{Y}^{\top}\mathbf{1}
=XD𝟏λY𝟏\displaystyle=-X^{\top}D\mathbf{1}-\lambda Y^{\top}\mathbf{1}
=(i)λY𝟏\displaystyle\overset{(i)}{=}-\lambda Y^{\top}\mathbf{1}
=λϕ(t).\displaystyle=-\lambda\phi(t).

The equality (i)(i) follows because DD is the projection onto a space orthogonal to {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}, effectively centering its rows. Because DD has zero-mean rows, D𝟏=0D\mathbf{1}=0.

Note that ϕ(0)=Y𝟏=0\phi(0)=Y^{\star\top}\mathbf{1}=0. Solving the differential equation, we have ϕ(t)=eλtϕ(0)=0\phi(t)=e^{-\lambda t}\phi(0)=0 for all tt. Using the same logic we used to show V𝟏=0V^{\star\top}\mathbf{1}=0, we have that Y𝟏Y^{\top}\mathbf{1} implies V𝟏=0V^{\top}\mathbf{1}=0. ∎

D.2 Extending assumptions to \mathcal{B}

We prove a lemma that extends Assumptions 3.3(a), 3.3(b), and 3.3(c) to a broader subset of matrices within the set \mathcal{B}, thereby expanding the applicability of the original assumptions.

We’ll begin by proving a useful lemma that provides bounds for the singular values of matrices in the set \mathcal{B}. We show these values are within a constant factor of the singular value range of XX^{\star} and YY^{\star}, spanning from σmin\sqrt{\sigma_{\min}} to σmax\sqrt{\sigma_{\max}}.

Lemma D.5.

For large enough nn and (X,Y)(X,Y)\in\mathcal{B}, we have the following for any i[r]i\in[r]:

σi(X),σi(Y)[σmin2,2σmax].\displaystyle\sigma_{i}(X),\sigma_{i}(Y)\in\left[\frac{\sqrt{\sigma_{\min}}}{2},2\sqrt{\sigma_{\max}}\right].
Proof.

The singular values of a matrix are not changed by right-multiplying by an orthogonal matrix. Hence, without loss of generality, we suppose that (X,Y)(X,Y) are rotated such that they are optimally aligned with (X,Y)(X^{\star},Y^{\star}); in other words, HX,Y=IH_{X,Y}=I. Then, (X,Y)(X,Y)\in\mathcal{B} gives us

XXFρσminlog6.5(n)κ,\displaystyle\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\leq\rho\lesssim\frac{\sqrt{\sigma_{\min}}}{\log^{6.5}(n)\kappa},

where the last step is due to (16).

Then by Weyl’s inequality for singular values, for any i[r]i\in[r], we have

σi(X)σi(X)+XXσ1(X)+XXF2σmax.\displaystyle\sigma_{i}(X)\leq\sigma_{i}(X^{\star})+\left\lVert X-X^{\star}\right\rVert\leq\sigma_{1}(X^{\star})+\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\leq 2\sqrt{\sigma_{\max}}.

We also have

σi(X)σi(X)XXσr(X)XXFσmin2.\displaystyle\sigma_{i}(X)\geq\sigma_{i}(X^{\star})-\left\lVert X-X^{\star}\right\rVert\geq\sigma_{r}(X^{\star})-\left\lVert X-X^{\star}\right\rVert_{\mathrm{F}}\geq\frac{\sqrt{\sigma_{\min}}}{2}.

The proof for σi(Y)\sigma_{i}(Y) is identical. ∎

Now, we present the lemma which extends 3.3 to a broader subset of matrices in \mathcal{B}.

Lemma D.6.

Suppose that (X,Y)(X,Y)\in\mathcal{B} and that there exists a rotation matrix H𝒪r×rH\in\mathcal{O}^{r\times r} such that (XH,YH)(XH,YH) is along the gradient flow of function ff starting from the point (X,Y)(X^{\star},Y^{\star}). Let m,τm,\tau denote the values that minimize f(X,Y)f(X,Y). Let XY=UΣVXY^{\top}=U\Sigma V^{\top} be the SVD of XYXY^{\top}. Let 𝐓0\mathbf{T}_{0} be the span of the tangent space of XYXY^{\top}, and denote by 𝐓\mathbf{T} the span of 𝐓0\mathbf{T}_{0} and {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}. Define Δ~1k\tilde{\Delta}^{1}\in\mathbb{R}^{k} as the vector with components Δ~i1=Zi,UV.\tilde{\Delta}^{1}_{i}=\left\langle Z_{i},UV^{\top}\right\rangle. Define D~\tilde{D} to be the matrix with entries D~ij=P𝐓(Zi),P𝐓(Zj)\tilde{D}_{ij}=\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle.

Assume Item 3.3(a) holds, then for large enough nn,

ZVF2+ZUF2\displaystyle\left\lVert ZV\right\rVert_{\mathrm{F}}^{2}+\left\lVert Z^{\top}U\right\rVert_{\mathrm{F}}^{2} 1cr12log(n)\displaystyle\leq 1-\frac{c_{r_{1}}}{2\log(n)} (20)
P𝐓0(Z)F2\displaystyle\left\lVert P_{{\mathbf{T}_{0}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}^{2} cr12log(n)\displaystyle\geq\frac{c_{r_{1}}}{2\log(n)} (21)
P𝐓(Z)P𝐓(Z)\displaystyle\left\lVert P_{{\mathbf{T}}^{\perp}}(Z)-P_{{\mathbf{T}}^{\star\perp}}(Z)\right\rVert_{\star} ρκrσmin\displaystyle\lesssim\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}} (22)

for any Z𝐙Z\in\mathbf{Z} that has ZF=1\left\lVert Z\right\rVert_{\mathrm{F}}=1.

Assume Item 3.3(b) holds, then for large enough nn,

D~1Δ~1i=1kP𝐓(Zi)1cr22logn.\displaystyle\|\tilde{D}^{-1}\|\left\lVert\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert\leq 1-\frac{c_{r_{2}}}{2\log n}. (23)

Assume Item 3.3(c) holds, then for large enough nn,

σmin(D~)cs2logn.\displaystyle\sigma_{\min}(\tilde{D})\geq\frac{c_{s}}{2\log n}. (24)
Proof of Lemma D.6..

We will make use of the following claim throughout this proof:

Claim D.7.
UUUUF,VVVVF,UVUVFρκσmin1log6.5(n).\displaystyle\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}},\left\lVert V^{\star}V^{\star\top}-VV^{\top}\right\rVert_{\mathrm{F}},\left\lVert U^{\star}V^{\star\top}-UV^{\top}\right\rVert_{\mathrm{F}}\lesssim\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}\lesssim\frac{1}{\log^{6.5}(n)}.

The proof of (20) and (21) follow the exact same logic as the proof of (68) and (69) in the proof of Lemma 13 in [17].

Proof of (22). We use Lemma B.1 to compute the projections of the treatment matrices. Let r=(IVV)𝟏r=(I-VV^{\top})\mathbf{1} and let r=(IVV)𝟏r^{\star}=(I-V^{\star}V^{\star\top})\mathbf{1}. By Lemma D.4, we have V𝟏=V𝟏=0V^{\top}\mathbf{1}=V^{\star\top}\mathbf{1}=0. Hence, r=r=𝟏r=r^{\star}=\mathbf{1}, which allows us to simplify as follows:

P𝐓(Z)P𝐓(Z)\displaystyle\left\lVert P_{{\mathbf{T}}^{\perp}}(Z)-P_{{\mathbf{T}}^{\star\perp}}(Z)\right\rVert_{\star}
=(IUU)P𝟏(Z)(IVV)(IUU)P𝟏(Z)(IVV)\displaystyle=\left\lVert(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-V^{\star}V^{\star\top})\right\rVert_{\star}
(UUUU)P𝟏(Z)(IVV)+(IUU)P𝟏(Z)(VVVV)\displaystyle\leq\left\lVert(U^{\star}U^{\star\top}-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z)(I-VV^{\top})\right\rVert_{\star}+\left\lVert(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z)(VV^{\top}-V^{\star}V^{\star\top})\right\rVert_{\star}
(i)UUUUFP𝟏(Z)Fr+P𝟏(Z)FVVVVFr\displaystyle\overset{(i)}{\lesssim}\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}\left\lVert P_{{\mathbf{1}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}\sqrt{r}+\left\lVert P_{{\mathbf{1}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}\sqrt{r}
UUUUFr+VVVVFr\displaystyle\leq\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}\sqrt{r}+\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}\sqrt{r}
ρκrσmin,\displaystyle\lesssim\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}},

where (i)(i) is due to AAFrank(A)\left\lVert A\right\rVert_{\star}\leq\left\lVert A\right\rVert_{\mathrm{F}}\sqrt{rank(A)}, the fact that projection matrices (IVV)(I-VV^{\top}) and (IUU)(I-U^{\star}U^{\star\top}) have Frobenius norm at most 1. In particular for step (i)(i), we note that the rank of the matrices inside the nuclear norm is O(r)O(r) because UUU^{\star}U^{\star\top}, UUUU^{\top}, VVV^{\star}V^{\star\top}, and VVVV^{\top} are all rank rr matrices, and we have that rank(A+B)rank(A)+rank(B)rank(A+B)\leq rank(A)+rank(B) and rank(AB)min(rank(A),rank(B))rank(AB)\leq\min(rank(A),rank(B)).

Proof of (24). Define ΔD:=D~D\Delta^{D}:=\tilde{D}-D^{\star} such that ΔijD=P𝐓(Zi),P𝐓(Zj)P𝐓(Zi),P𝐓(Zj)\Delta^{D}_{ij}=\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle for i,j[k]i,j\in[k]. We upper bound the magnitude of the entries of ΔD\Delta^{D}. For any i[k]i\in[k] and j[k]j\in[k], we have

|ΔijD|=|P𝐓(Zi),P𝐓(Zj)P𝐓(Zi),P𝐓(Zj)|=(i)|P𝐓(Zi),ZjP𝐓(Zi),Zj|=|P𝐓(Zi)P𝐓(Zi),Zj|(ii)P𝐓(Zi)P𝐓(Zi)F=(IUU)P𝟏(Zi)(IVV)(IUU)P𝟏(Zi)(IVV)F=(UUUU)P𝟏(Zi)(IVV)(IUU)P𝟏(Zi)(VVVV)FUUUUF+VVVVF(iii)1log6.5(n),\displaystyle\begin{split}\left\lvert\Delta^{D}_{ij}\right\rvert&=\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(Z_{j})\right\rangle\right\rvert\\ &\overset{(i)}{=}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),Z_{j}\right\rangle-\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),Z_{j}\right\rangle\right\rvert\\ &=\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),Z_{j}\right\rangle\right\rvert\\ &\overset{(ii)}{\leq}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\mathrm{F}}\\ &=\left\lVert(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-V^{\star}V^{\star\top})\right\rVert_{\mathrm{F}}\\ &=\left\lVert(U^{\star}U^{\star\top}-UU^{\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(I-VV^{\top})-(I-U^{\star}U^{\star\top})P_{{\mathbf{1}}^{\perp}}(Z_{i})(VV^{\top}-V^{\star}V^{\star\top})\right\rVert_{\mathrm{F}}\\ &\leq\left\lVert U^{\star}U^{\star\top}-UU^{\top}\right\rVert_{\mathrm{F}}+\left\lVert V^{\star}V^{\star\top}-VV^{\top}\right\rVert_{\mathrm{F}}\\ &\overset{(iii)}{\lesssim}\frac{1}{\log^{6.5}(n)},\end{split} (25)

The equality (i)(i) is due to the fact that for any projection operator PP and matrices AA and BB, we have P(A),B=P(A),P(B)=A,P(B)\left\langle P(A),B\right\rangle=\left\langle P(A),P(B)\right\rangle=\left\langle A,P(B)\right\rangle, and (ii)(ii) is due to the Cauchy–Schwarz inequality and the fact that the ZiF=1\left\lVert Z_{i}\right\rVert_{\mathrm{F}}=1, and (iii)(iii) follows from D.7.

By Weyl’s inequality, we have σmin(D~)σmin(D)σmax(ΔD).\sigma_{\min}(\tilde{D})\geq\sigma_{\min}(D^{\star})-\sigma_{\max}(\Delta^{D}). Now we upper bound the second term as follows:

σmax(ΔD)=ΔDkmaxl,m|Δl,mD|=O(1log5.5(n)).\sigma_{\max}(\Delta^{D})=\left\lVert\Delta^{D}\right\rVert\leq k\max_{l,m}\left\lvert\Delta^{D}_{l,m}\right\rvert=O\left(\frac{1}{\log^{5.5}(n)}\right). (26)

Putting everything together, for large enough nn,

σmin(D~)\displaystyle\sigma_{\min}(\tilde{D}) cslognσmax(ΔD)cs2logn.\displaystyle\geq\frac{c_{s}}{\log n}-\sigma_{\max}(\Delta^{D})\geq\frac{c_{s}}{2\log n}.

Proof of (23). We wish to bound A:=D~1Δ~1i=1kP𝐓(Zi).A:=\left\lVert\tilde{D}^{-1}\right\rVert\left\lVert\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert. This quantity can be restructured as

A=(D1Δ1+E1)(i=1kP𝐓(Zi)+E2),\displaystyle A=\left(\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert+E_{1}\right)\left(\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert+E_{2}\right), (27)

where DD^{\star} and Δ1\Delta^{\star 1} are defined just above 3.3, and E1E_{1} and E2E_{2} are defined as follows:

E1\displaystyle E_{1} :=D~1Δ~1D1Δ1\displaystyle:=\left\lVert\tilde{D}^{-1}\right\rVert\left\lVert\tilde{\Delta}^{1}\right\rVert-\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert
E2\displaystyle E_{2} :=i=1k(P𝐓(Zi)P𝐓(Zi)).\displaystyle:=\sum_{i=1}^{k}(\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert-\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert).

We bound these two quantities as follows.

Claim D.8.

We have that |E1|1log2.5n\left\lvert E_{1}\right\rvert\lesssim\frac{1}{\log^{2.5}n} and |E2|1log5.5n\left\lvert E_{2}\right\rvert\lesssim\frac{1}{\log^{5.5}n}.

To bound AA, we consider the following additional observations. Firstly, we have D1Δ1log2n\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\lesssim\log^{2}n. This follows from the fact that σmin(D)1logn\sigma_{\min}(D^{\star})\gtrsim\frac{1}{\log n}, and Δ1\Delta^{\star 1} is a vector of length kk (where klognk\lesssim\log n) with entries that do not exceed 1. Moreover, we have i=1kP𝐓(Zi)klogn\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert\leq k\lesssim\log n because for every i[k]i\in[k], P𝐓(Zi)P_{{\mathbf{T}}^{\star\perp}}(Z_{i}) has Frobenius norm (and hence spectral norm) at most 1. Now, we are ready to plug all of this into (27) to bound AA as follows.

A\displaystyle A D1Δ1i=1kP𝐓(Zi)+O(1log1.5n)\displaystyle\leq\left\lVert D^{\star-1}\right\rVert\left\lVert\Delta^{\star 1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert+O\left(\frac{1}{\log^{1.5}n}\right)
(i)1cr2logn+O(1log1.5n)\displaystyle\overset{(i)}{\leq}1-\frac{c_{r_{2}}}{\log n}+O\left(\frac{1}{\log^{1.5}n}\right)
1cr22logn,\displaystyle\leq 1-\frac{c_{r_{2}}}{2\log n},

where (i)(i) is due to Item 3.3(b). ∎

Proof of D.7.

First, we note that we have

XYXYF\displaystyle\left\lVert XY^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}} =XHX,Y(YHX,Y)XYF\displaystyle=\left\lVert XH_{X,Y}(YH_{X,Y})^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}}
XHX,YXFY+XYHX,YYF\displaystyle\leq\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert+\left\lVert X^{\star}\right\rVert\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}
ρσmax,\displaystyle\lesssim\rho\sqrt{\sigma_{\max}},

where Y\left\lVert Y\right\rVert is bounded by Lemma D.5. Hence, we can make use of Theorem 2 from [36], combined with the symmetric dilation trick from section C.3.2 of [3], to obtain the following. There exists an orthogonal matrix Or×rO\in\mathbb{R}^{r\times r} such that

UOUF+VOVFXYXYFσr(XY)σr+1(XY)ρσmaxσmin=ρκσmin.\displaystyle\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}\lesssim\frac{\left\lVert XY^{\top}-X^{\star}Y^{\star\top}\right\rVert_{\mathrm{F}}}{\sigma_{r}(X^{\star}Y^{\star\top})-\sigma_{r+1}(X^{\star}Y^{\star\top})}\lesssim\frac{\rho\sqrt{\sigma_{\max}}}{\sigma_{\min}}=\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}.

Now we can obtained the desired bound as follows:

UVUVF\displaystyle\left\lVert UV^{\top}-U^{\star}V^{\star\top}\right\rVert_{\mathrm{F}} =UO(VO)UVF\displaystyle=\left\lVert UO(VO)^{\top}-U^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}
UOUFVO+VOVFU\displaystyle\leq\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}\left\lVert VO\right\rVert+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}\left\lVert U^{\star}\right\rVert
UOUF+VOVF\displaystyle\leq\left\lVert UO-U^{\star}\right\rVert_{\mathrm{F}}+\left\lVert VO-V^{\star}\right\rVert_{\mathrm{F}}
ρκσmin\displaystyle\lesssim\frac{\rho\sqrt{\kappa}}{\sqrt{\sigma_{\min}}}
1log6.5nusing (16).\displaystyle\lesssim\frac{1}{\log^{6.5}n}\qquad\text{using \eqref{eq: rho bound}}.

We can similarly bound both UUUUF\left\lVert UU^{\top}-U^{\star}U^{\star\top}\right\rVert_{\mathrm{F}} and VVVVF\left\lVert VV^{\top}-V^{\star}V^{\star\top}\right\rVert_{\mathrm{F}}. ∎

Proof of D.8.

|E2|\left\lvert E_{2}\right\rvert can be simply bounded as follows:

|E2|\displaystyle\left\lvert E_{2}\right\rvert i=1kP𝐓(Zi)P𝐓(Zi)Fklog6.5n,\displaystyle\leq\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\mathrm{F}}\lesssim\frac{k}{\log^{6.5}n},

where the last bound was shown in 25. Because k=O(logn)k=O(\log n), we have the desired bound for E2E_{2}.

It remains to bound |E1|\left\lvert E_{1}\right\rvert:

|E1|\displaystyle\left\lvert E_{1}\right\rvert =|D1Δ1D~1Δ~1|\displaystyle=\left|\|D^{\star-1}\|\|\Delta^{\star 1}\|-\|\tilde{D}^{-1}\|\|\tilde{\Delta}^{1}\|\right|
D1|Δ1Δ~1|+|D1D~1|Δ~1\displaystyle\leq\|D^{\star-1}\|\left|\|\Delta^{\star 1}\|-\|\tilde{\Delta}^{1}\|\right|+\left|\|D^{\star-1}\|-\|\tilde{D}^{-1}\|\right|\|\tilde{\Delta}^{1}\|
D1Δ1Δ~1A1+D1D~1Δ~1A2.\displaystyle\leq\underbrace{\|D^{\star-1}\|\|\Delta^{\star 1}-\tilde{\Delta}^{1}\|}_{A_{1}}+\underbrace{\|D^{\star-1}-\tilde{D}^{-1}\|\|\tilde{\Delta}^{1}\|}_{A_{2}}.

Bounding A1A_{1}. Applying Assumption 3.3(c), we can bound A1A_{1} as follows:

A1\displaystyle A_{1} log(n)Δ1Δ~1\displaystyle\lesssim\log(n)\left\lVert\Delta^{\star 1}-\tilde{\Delta}^{1}\right\rVert
log(n)kΔ1Δ~1\displaystyle\leq\log(n)k\left\lVert\Delta^{\star 1}-\tilde{\Delta}^{1}\right\rVert_{\infty}
=log(n)kmaxi[k]|Zi,UVUV|\displaystyle=\log(n)k\max_{i\in[k]}\left\lvert\left\langle Z_{i},U^{\star}V^{\star\top}-UV^{\top}\right\rangle\right\rvert
log(n)kUVUVF\displaystyle\leq\log(n)k\left\lVert U^{\star}V^{\star\top}-UV^{\top}\right\rVert_{\mathrm{F}}
1log4.5(n),\displaystyle\lesssim\frac{1}{\log^{4.5}(n)},

where the last step makes use of D.7.

Bounding A2A_{2}. Because D~=D+ΔD=(I+ΔDD1)D\tilde{D}=D^{\star}+\Delta^{D}=(I+\Delta^{D}D^{\star-1})D^{\star}, we can write

D~1=D1k=0(ΔDD1)k.\displaystyle\tilde{D}^{-1}=D^{\star-1}\sum_{k=0}^{\infty}(-\Delta^{D}D^{\star-1})^{k}.

Note that this quantity is well-defined because

ΔDD1σmax(ΔD)σmin(D)1log4.5n<1,\displaystyle\left\lVert\Delta^{D}D^{\star-1}\right\rVert\leq\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})}\lesssim\frac{1}{\log^{4.5}n}<1,

where the last bound is due to (26) and Item 3.3(c). Now, we have

D1D~1=D1ΔDD1k=0(ΔDD1)k.\displaystyle D^{\star-1}-\tilde{D}^{-1}=D^{\star-1}\Delta^{D}D^{\star-1}\sum_{k=0}^{\infty}(-\Delta^{D}D^{\star-1})^{k}.

We use this to bound A2A_{2}. Using (26) and Item 3.3(c), and the fact that Δ~1k\|\tilde{\Delta}^{1}\|\leq k because it is a vector of length kk with all components less than or equal to 1, we have

A2σmax(ΔD)σmin(D)211σmax(ΔD)σmin(D)Δ~11log2.5n.\displaystyle A_{2}\leq\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})^{2}}\cdot\frac{1}{1-\frac{\sigma_{\max}(\Delta^{D})}{\sigma_{\min}(D^{\star})}}\cdot\|\tilde{\Delta}^{1}\|\lesssim\frac{1}{\log^{2.5}n}.

D.3 Discussion of E^\hat{E}

Recall that the matrix E^\hat{E} consists of two components E^=E+δ\hat{E}=E+\delta. EE is a noise matrix, and δ\delta is the matrix of the approximation error. Refer to 3.4 for our assumptions on EE and δ\delta.

We need to ensure that the error E^\hat{E} does not significantly interfere with the recovery of the treatment effects. That is, we need to ensure that E^\hat{E} is not confounded with the treatment matrices ZiZ_{i} for i[k]i\in[k]. This condition is formalized and established in the following lemma.

Lemma D.9.

Let (X,Y)(X,Y)\in\mathcal{B} be along the gradient flow of function ff starting from the point (X,Y)(X^{\star},Y^{\star}), and let m,τm,\tau denote the values that minimize f(X,Y)f(X,Y). Let 𝐓\mathbf{T} be the span of the tangent space of XYXY^{\top} and {α𝟏αn}\left\{\alpha\mathbf{1}^{\top}\mid\alpha\in\mathbb{R}^{n}\right\}. Then, we have

maxi[k]|P𝐓(Zi),E^|σnr.\displaystyle\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert\lesssim\sigma\sqrt{nr}.
Proof of Lemma D.9.

Fix any i[k]i\in[k]. We recall that E^=E+δ\hat{E}=E+\delta. We note that rank(P𝐓(A))2r+1rank(P_{\mathbf{T}}(A))\leq 2r+1 for any matrix AA due to the definition of 𝐓\mathbf{T}. Then,

|P𝐓(Zi),E^|\displaystyle\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert |Zi,E|+|Zi,δ|+|P𝐓(Zi),E^|\displaystyle\leq\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert+\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert+\left\lvert\left\langle P_{\mathbf{T}}(Z_{i}),\hat{E}\right\rangle\right\rvert
(i)σn+|P𝐓(Zi),E^|\displaystyle\overset{(i)}{\lesssim}\sigma\sqrt{n}+\left\lvert\left\langle P_{\mathbf{T}}(Z_{i}),\hat{E}\right\rangle\right\rvert
(ii)σn+E^P𝐓(Zi)\displaystyle\overset{(ii)}{\leq}\sigma\sqrt{n}+\|\hat{E}\|\left\lVert P_{\mathbf{T}}(Z_{i})\right\rVert_{\star}
(iii)σn+(E+δ)2r+1P𝐓(Zi)F\displaystyle\overset{(iii)}{\leq}\sigma\sqrt{n}+(\left\lVert E\right\rVert+\left\lVert\delta\right\rVert)\sqrt{2r+1}\left\lVert P_{\mathbf{T}}(Z_{i})\right\rVert_{\mathrm{F}}
(iv)σnr.\displaystyle\overset{(iv)}{\lesssim}\sigma\sqrt{nr}.

Here, (i) is due to 3.4 |Zi,E|,|Zi,δ|σn\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert,\left\lvert\left\langle Z_{i},\delta\right\rangle\right\rvert\lesssim\sigma\sqrt{n}; (ii) is due to Von Neumann’s trace inequality; (iii) is due to the inequality Arank(A)AF\left\lVert A\right\rVert_{\star}\leq\sqrt{rank(A)}\left\lVert A\right\rVert_{\mathrm{F}}; and (iv) is due to the fact that ZiF=1\left\lVert Z_{i}\right\rVert_{\mathrm{F}}=1, and our assumed upper bound on E\left\lVert E\right\rVert and δ\left\lVert\delta\right\rVert from 3.4. ∎

We now argue that the assumptions on the noise matrix EE from 3.4 are mild. The following lemma shows that under standard sub-Gaussianity assumptions, these are satisfied with very high probability.

Lemma D.10.

Suppose that the entries of EE are independent, zero-mean, sub-Gaussian random variables, and the sub-Gaussian norm of each entry is bounded by σ\sigma, and EE is independent from ZiZ_{i} for i[k]i\in[k]. Then, for nn sufficiently large, with probability at least 1en1-e^{-n}, we have

Eσnand|Zi,E|σn,i[k].\left\lVert E\right\rVert\lesssim\sigma\sqrt{n}\quad\text{and}\quad\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\lesssim\sigma\sqrt{n},\quad\forall i\in[k].
Proof of Lemma D.10.

We start by proving that with probability at least 12en1-2e^{-n}, we have Eσn\left\lVert E\right\rVert\lesssim\sigma\sqrt{n}. We use the following result bounding the norm of matrices with sub-Gaussian entries.

Theorem D.11 (Theorem 4.4.5, [32]).

Let AA be an m×nm\times n random matrix whose entries AijA_{ij} are independent, mean zero, sub-Gaussian random variables with sub-Gaussian norm bounded by σ\sigma. Then, for any t>0t>0, we have

Aσ(m+n+t)\displaystyle\left\lVert A\right\rVert\lesssim\sigma(\sqrt{m}+\sqrt{n}+t)

with probability at least 12exp(t2)1-2\exp(-t^{2}).

As a result of the above theorem, recalling that EE is a n×Tn\times T matrix with TnT\lesssim n and using t=2nt=\sqrt{2n}, we have Eσn\left\lVert E\right\rVert\lesssim\sigma\sqrt{n} with probability at least 12exp(2n)1-2\exp(-2n).

We next turn to the second claim. The general version of Hoeffding’s inequality states that: for zero-mean independent sub-Gaussian random variables X1,,XnX_{1},\dots,X_{n}, we have

Pr[|i=1nXi|t]2exp(ct2i=1nXiψ22),\displaystyle\Pr\left[\left\lvert\sum_{i=1}^{n}X_{i}\right\rvert\geq t\right]\leq 2\exp\left(-\frac{ct^{2}}{\sum_{i=1}^{n}\left\lVert X_{i}\right\rVert_{\psi_{2}}^{2}}\right),

for some absolute constant c>0c>0. Hence, for any i[k]i\in[k]:

Pr[|Zi,E|σ2n/c]2exp(2σ2nσ2ZiF2)2e2n.\displaystyle\Pr\left[\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\geq\sigma\sqrt{2n/c}\right]\leq 2\exp\left(-\frac{2\sigma^{2}n}{\sigma^{2}\left\lVert Z_{i}\right\rVert_{\mathrm{F}}^{2}}\right)\leq 2e^{-2n}.

Applying the Union Bound over all i[k]i\in[k]:

Pr[maxi[k]|Zi,E|σn/c]2ke2n2log(n)e2n.\displaystyle\Pr\left[\max_{i\in[k]}\left\lvert\left\langle Z_{i},E\right\rangle\right\rvert\geq\sigma\sqrt{n/c}\right]\leq 2ke^{-2n}\lesssim 2\log(n)e^{-2n}.

Applying the Union Bound shows that the two desired claims hold with probability at least 12e2n(1+logn)1en1-2e^{-2n}(1+\log n)\geq 1-e^{-n} for sufficiently large nn. ∎

D.4 Proof of Lemma D.1

If the gradient flow of ff started at (X,Y)(X^{\star},Y^{\star}) never intersects the boundary of \mathcal{B}, then we are done. For the rest of the proof, we will suppose that such an intersection exists. Fix any point (X,Y)(X,Y) at the intersection of the boundary of \mathcal{B} and the gradient flow of ff. We aim to show that at this boundary point, the inner product of the normal vector to the region (pointing to the exterior of the region) and the gradient of ff is positive. This property implies that the gradient flow of ff, initiated from any point inside \mathcal{B}, cannot exit \mathcal{B}. Proving this characteristic is sufficient to prove Lemma D.1.

We denote by HH the rotation matrix HX,YH_{X,Y} that optimally aligns the fixed boundary point (X,Y)(X,Y) to (X,Y)(X^{\star},Y^{\star}). Consider

H={(X,Y)XHXF2+YHYF2ρ2}.\displaystyle\mathcal{B}_{H}=\left\{(X^{\prime},Y^{\prime})\mid\left\lVert X^{\prime}H-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert Y^{\prime}H-Y^{\star}\right\rVert_{\mathrm{F}}^{2}\leq\rho^{2}\right\}.

Note that H\mathcal{B}_{H} differs from \mathcal{B} because HH is fixed to be the rotation matrix associated with a given point (X,Y)(X,Y). H\mathcal{B}_{H} and \mathcal{B} are tangent to each other at (X,Y)(X,Y), so the normal vector to H\mathcal{B}_{H} and \mathcal{B} are co-linear at (X,Y)(X,Y). Thus, it suffices to show that the normal vector to H\mathcal{B}_{H} at (X,Y)(X,Y), and the gradient of ff at (X,Y)(X,Y) have positive inner product.

To simplify our notation, we define

F:=[XY],F:=[XY],andfF:=[fXfY].F:=\left[\begin{array}[]{c}X\\ Y\end{array}\right],\qquad{F}^{\star}:=\left[\begin{array}[]{c}X^{\star}\\ Y^{\star}\end{array}\right],\qquad\text{and}\qquad\frac{\partial f}{\partial F}:=\left[\begin{array}[]{c}\frac{\partial f}{\partial X}\\ \frac{\partial f}{\partial Y}\end{array}\right].

Additionally, we define ΔX:=XHX\Delta_{X}:=XH-X^{\star}, ΔY:=YHY\Delta_{Y}:=YH-Y^{\star}, and ΔF:=FHF\Delta_{\mathrm{F}}:=FH-F^{\star}.

The normal vector to \mathcal{B} at (X,Y)(X,Y) is simply the subgradient of XHXF2+YHYF2\left\lVert XH-X^{\star}\right\rVert_{\mathrm{F}}^{2}+\left\lVert YH-Y^{\star}\right\rVert_{\mathrm{F}}^{2} at (X,Y)(X,Y), which is 2(FHF)H2(FH-F^{\star})H^{\top}. Hence, we aim to show that the following inner product is positive:

Γ:=(FHF)H,fF=\displaystyle\Gamma:=\left\langle(FH-F^{\star})H^{\top},\frac{\partial f}{\partial F}\right\rangle= ΔX,P𝐙(XYXYE^)YH+λXH\displaystyle\left\langle\Delta_{X},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)YH+\lambda XH\right\rangle
+\displaystyle+ ΔY,P𝐙(XYXYE^)XH+λYH,\displaystyle\left\langle\Delta_{Y},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}-\hat{E}\right)^{\top}XH+\lambda YH\right\rangle,

where we used the formula for the gradient of ff from Eq (19). In the remainder of this subsection, we will use the following shorthand notations for simplicity: we denote XHXH, YHYH, and FHFH as XX, YY, and FF respectively. The above inner product Γ\Gamma becomes:

Γ=ΔXY+XΔY,P𝐙(XYXY)A3ΔX,P𝐙(E^)YΔY,P𝐙(E^)XA1\displaystyle\Gamma=\underbrace{\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}\right)\right\rangle}_{A_{3}}-\underbrace{\left\langle\Delta_{X},P_{\mathbf{Z}^{\perp}}(\hat{E})Y\right\rangle-\left\langle\Delta_{Y},P_{\mathbf{Z}^{\perp}}(\hat{E})^{\top}X\right\rangle}_{A_{1}}
+ΔX,λX+ΔY,λYA2.\displaystyle+\underbrace{\left\langle\Delta_{X},\lambda X\right\rangle+\left\langle\Delta_{Y},\lambda Y\right\rangle}_{A_{2}}.

By Item 3.4(a), we have E^E+δσn\|\hat{E}\|\leq\|E\|+\|\delta\|\lesssim\sigma\sqrt{n}. In particular,

|A1|\displaystyle\left\lvert A_{1}\right\rvert ΔXFE^YF+ΔYFE^XF\displaystyle\lesssim\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\|\hat{E}\|\left\lVert Y\right\rVert_{\mathrm{F}}+\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\|\hat{E}\|\left\lVert X\right\rVert_{\mathrm{F}}
E^ΔFFFF\displaystyle\lesssim\|\hat{E}\|\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F\right\rVert_{\mathrm{F}}
(i)σnΔFFFF,\displaystyle\overset{(i)}{\lesssim}\sigma\sqrt{n}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}},

where (i) is because FFFFΔFF\left\lVert F\right\rVert_{\mathrm{F}}-\left\lVert F^{\star}\right\rVert_{\mathrm{F}}\leq\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}} by the Triangle Inequality, and (X,Y)(X,Y)\in\mathcal{B} gives ΔFFρFF\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\leq\rho\lesssim\left\lVert F^{\star}\right\rVert_{\mathrm{F}}, where the last step is due to the assumed bound on σnσmin\frac{\sigma\sqrt{n}}{\sigma_{\min}}.

Recalling λ=Θ(σnrlog4.5(n))\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right), we bound |A2|\left\lvert A_{2}\right\rvert:

|A2|\displaystyle|A_{2}| λΔXFXF+λΔYFYF\displaystyle\leq\lambda\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert X\right\rVert_{\mathrm{F}}+\lambda\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert_{\mathrm{F}}
2λΔFFFF\displaystyle\leq 2\lambda\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F\right\rVert_{\mathrm{F}}
σnrlog4.5(n)ΔFFFF.\displaystyle\lesssim\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

Finally, we will lower bound A3A_{3}. For some Z𝐙Z\in\mathbf{Z} with ZF=1\left\lVert Z\right\rVert_{\mathrm{F}}=1, we have the following:

A3\displaystyle A_{3} =ΔXY+XΔY,P𝐙(XYXY)\displaystyle=\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(XY^{\top}-X^{\star}Y^{\star\top}\right)\right\rangle
=ΔXY+XΔY,P𝐙(ΔXY+XΔYΔXΔY)\displaystyle=\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}-\Delta_{X}\Delta_{Y}^{\top}\right)\right\rangle
=P𝐙(ΔXY+XΔY)F2ΔXY+XΔY,P𝐙(ΔXΔY)B0\displaystyle=\left\lVert P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right)\right\rVert_{\mathrm{F}}^{2}-\underbrace{\left\langle\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top},P_{\mathbf{Z}^{\perp}}\left(\Delta_{X}\Delta_{Y}^{\top}\right)\right\rangle}_{B_{0}}
=ΔXY+XΔYF2P𝐙(ΔXY+XΔY)F2B0\displaystyle=\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\lVert P_{\mathbf{Z}}\left(\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right)\right\rVert_{\mathrm{F}}^{2}-B_{0}
=ΔXY+XΔYF2Z,ΔXY+XΔY2B0\displaystyle=\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\langle Z,\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rangle^{2}-B_{0}
=(i)ZF2ΔXY+XΔYF2P𝐓0(Z),ΔXY+XΔY2B0\displaystyle\overset{(i)}{=}\left\lVert Z\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\langle P_{\mathbf{T}_{0}}(Z),\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rangle^{2}-B_{0}
ZF2ΔXY+XΔYF2P𝐓0(Z)F2ΔXY+XΔYF2B0\displaystyle\geq\left\lVert Z\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-\left\lVert P_{\mathbf{T}_{0}}(Z)\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}
=P𝐓0(Z)F2ΔXY+XΔYF2B0\displaystyle=\left\lVert P_{{\mathbf{T}_{0}}^{\perp}}(Z)\right\rVert_{\mathrm{F}}^{2}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}
(ii)cr12log(n)ΔXY+XΔYF2B0,\displaystyle\overset{(ii)}{\geq}\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0},

where (i)(i) is due to the fact that ZF=1\left\lVert Z\right\rVert_{\mathrm{F}}=1 and ΔXY+XΔY\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top} is already in the subspace 𝐓0\mathbf{T}_{0} by the definition of 𝐓0={UA+BVAT×r,Bn×r}\mathbf{T}_{0}=\left\{UA^{\top}+BV^{\top}\mid A\in\mathbb{R}^{T\times r},B\in\mathbb{R}^{n\times r}\right\} as the tangent space of XYXY^{\top}; (ii)(ii) is due to Lemma D.6. Note that Lemma D.6 applies because 𝐓0\mathbf{T}_{0} is the tangent space of XYXY^{\top}, and (X,Y)(X,Y) was fixed to be on the gradient flow. Furthermore, we can bound |B0|\left\lvert B_{0}\right\rvert as follows:

|B0|\displaystyle\left\lvert B_{0}\right\rvert ΔXFΔYF(ΔXYF+XΔYF)\displaystyle\leq\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left(\left\lVert\Delta_{X}Y^{\top}\right\rVert_{\mathrm{F}}+\left\lVert X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}\right)
ΔXFΔYF(ΔXFY+ΔYFX)\displaystyle\leq\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left(\left\lVert\Delta_{X}\right\rVert_{\mathrm{F}}\left\lVert Y\right\rVert+\left\lVert\Delta_{Y}\right\rVert_{\mathrm{F}}\left\lVert X\right\rVert\right)
(i)ΔFF3F\displaystyle\overset{(i)}{\lesssim}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}^{3}\left\lVert F^{\star}\right\rVert
ρ2ΔFFF\displaystyle\leq\rho^{2}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert
(ii)σ2r2κnlog12(n)σminΔFFF\displaystyle\overset{(ii)}{\lesssim}\frac{\sigma^{2}r^{2}\kappa n\log^{12}(n)}{\sigma_{\min}}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert
σnΔFFF,\displaystyle\lesssim\sigma\sqrt{n}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert,

where (i)(i) makes use of Lemma D.5, and (ii)(ii) follows from the bound on ρ\rho from (15), and the last step follows from the bound on σnσmin\frac{\sigma\sqrt{n}}{\sigma_{\min}}.

Now, we are ready to put everything together to bound the inner product:

Γ\displaystyle\Gamma =A3A1+A2\displaystyle=A_{3}-A_{1}+A_{2}
cr12log(n)ΔXY+XΔYF2B0A1+A2\displaystyle\geq\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}-B_{0}-A_{1}+A_{2}

Putting previous bounds together, we have

|A1|+|A2|+|B0|σnrlog4.5(n)ΔFFFF.\displaystyle\left\lvert A_{1}\right\rvert+\left\lvert A_{2}\right\rvert+\left\lvert B_{0}\right\rvert\lesssim\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

On the other hand, we can lower bound the positive term of Γ\Gamma using the following lemma.

Lemma D.12.

Consider a point (X,Y)(X,Y) on the gradient flow of function ff starting from the point (X,Y)\left(X^{\star},Y^{\star}\right), such that (X,Y)(X,Y)\in\mathcal{B}. Then,

ΔXY+XΔYF2σmin4ΔFF2.\displaystyle\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert^{2}_{\mathrm{F}}.

By Lemma D.12 we have

cr12log(n)ΔXY+XΔYF2cr12log(n)σmin4ΔFF2\displaystyle\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{c_{r_{1}}}{2\log(n)}\cdot\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert^{2}_{\mathrm{F}}

We plug in ΔFF=ρ\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}=\rho because we assumed that (X,Y)(X,Y) is at the border of region \mathcal{B}. Hence,

cr12log(n)ΔXY+XΔYF2\displaystyle\frac{c_{r_{1}}}{2\log(n)}\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2} σminlog(n)ΔFFρ\displaystyle\gtrsim\frac{\sigma_{\min}}{\log(n)}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\cdot\rho
=σnrlog5(n)ΔFFFF.\displaystyle=\sigma\sqrt{nr}\log^{5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}.

Finally, for large enough nn,

Γ\displaystyle\Gamma σnrlog5(n)ΔFFFFσnrlog4.5(n)ΔFFFF>0.\displaystyle\gtrsim\sigma\sqrt{nr}\log^{5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}-\sigma\sqrt{nr}\log^{4.5}(n)\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}\left\lVert F^{\star}\right\rVert_{\mathrm{F}}>0.
Proof of Lemma D.12.

Claim 11 in [17] states

ΔXY+XΔYF2σmin4ΔFF2σ2n13.\displaystyle\left\lVert\Delta_{X}Y^{\top}+X\Delta_{Y}^{\top}\right\rVert_{\mathrm{F}}^{2}\geq\frac{\sigma_{\min}}{4}\left\lVert\Delta_{\mathrm{F}}\right\rVert_{\mathrm{F}}^{2}-\frac{\sigma^{2}}{n^{13}}. (28)

We can follow the steps of their proof to prove the desired statement. In particular, we note that the term σ2/n13-\sigma^{2}/n^{13} in the right-hand side of Eq (28) comes from the fact that they use the bound

XXYYFσκn15.\left\lVert X^{\top}X-Y^{\top}Y\right\rVert_{\mathrm{F}}\lesssim\frac{\sigma}{\kappa n^{15}}.

Because we have, by Lemma D.3 that XX=YYX^{\top}X=Y^{\top}Y, we do not incur the term σ2/n13-\sigma^{2}/n^{13} in our lower bound. ∎

D.5 Proof of Lemma D.2

Let g(M,τ,m)g(M,\tau,m) denote the convex function that we are optimizing in (2). Recall that (M^,τ^,m^)(\hat{M},\hat{\tau},\hat{m}) is a global optimum of function gg. In this subsection, we prove Lemma D.2, which relates (M^,τ^,m^)(\hat{M},\hat{\tau},\hat{m}) to a local optimum of ff.

We begin by proving the following useful lemma, which is similar to Lemma 20 from [11], but allows X,YX,Y to have different dimensions.

Lemma D.13.

Consider matrices XX and YY such that XX=YYX^{\top}X=Y^{\top}Y. There is an SVD of XYXY^{\top} denoted by UΣVU\Sigma V^{\top} such that X=UΣ1/2RX=U\Sigma^{1/2}R and Y=VΣ1/2RY=V\Sigma^{1/2}R for some rotation matrix R𝒪r×rR\in\mathcal{O}^{r\times r}.

Proof.

Let X=UXΣXVXX=U_{X}\Sigma_{X}V_{X}^{\top} and Y=UYΣYVYY=U_{Y}\Sigma_{Y}V_{Y}^{\top} be their respective SVDs, ordering the diagonal components of ΣX\Sigma_{X} and ΣY\Sigma_{Y} by decreasing order. Then, XX=YYX^{\top}X=Y^{\top}Y implies ΣX=ΣY\Sigma_{X}=\Sigma_{Y} and that the singular subspaces of VXV_{X} and VYV_{Y} coincide. Hence, there exists an SVD decomposition of Y=U~YΣYV~YY=\tilde{U}_{Y}\Sigma_{Y}\tilde{V}_{Y}^{\top} such that VX=V~YV_{X}=\tilde{V}_{Y}. Then, XY=UXΣX2U~YXY^{\top}=U_{X}\Sigma_{X}^{2}\tilde{U}_{Y}^{\top}. This is an SVD of XYXY^{\top}, with U=UXU=U_{X}, Σ=ΣX2\Sigma=\Sigma_{X}^{2}, and V=U~YV=\tilde{U}_{Y}. Substituting these quantities into the SVD of XX and YY, we complete the proof that X=UΣ1/2RX=U\Sigma^{1/2}R and Y=VΣ1/2RY=V\Sigma^{1/2}R, where R=VX=V~Y𝒪r×rR=V_{X}=\tilde{V}_{Y}\in\mathcal{O}^{r\times r}. ∎

Let (X,Y)(X,Y) represent the limit of the gradient flow of ff from the initial point (X,Y)(X^{\star},Y^{\star}). Let mm and τ\tau be the values that minimize f(X,Y)f(X,Y). Furthermore, let the SVD of XYXY^{\top} be denoted by UΣVU\Sigma V^{\top}.

By Lemma D.3, we have that XX=YYX^{\top}X=Y^{\top}Y. Then, Lemma D.13 gives us

(IUU)X=0and(IVV)Y=0.\displaystyle(I-UU^{\top})X=0\qquad\text{and}\qquad(I-VV^{\top})Y=0. (29)

We claim that to prove Lemma D.2, it suffices to prove that XY=M^XY^{\top}=\hat{M}. If we prove that XY=M^XY^{\top}=\hat{M}, we would have by Lemma D.13 X=X^RX=\hat{X}R and Y=Y^RY=\hat{Y}R for some rotation matrix R𝒪r×rR\in\mathcal{O}^{r\times r}. This proves Lemma D.2.

The proof that XY=M^XY^{\top}=\hat{M} consists of two parts. We will first establish that (XY,τ,m)(XY^{\top},\tau,m) is also an optimal point of gg by verifying the first order conditions of gg are satisfied. We will then show that gg has a unique optimal solution (M^,τ^,m^)(\hat{M},\hat{\tau},\hat{m}). Putting these two parts together establishes that XY=M^XY^{\top}=\hat{M}.

D.5.1 First order conditions of gg are satisfied

We will first show that the following first order conditions of gg are satisfied at (XY,τ,m)(XY^{\top},\tau,m).

Zl,OXYm𝟏i=1kτiZi\displaystyle\left\langle Z_{l},O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}{\tau}_{i}Z_{i}\right\rangle =0for l=1,2,,k\displaystyle=0\qquad\text{for }l=1,2,\dots,k (30a)
OXYm𝟏i=1kτiZi\displaystyle O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i} =λ(UV+W)\displaystyle=\lambda\left(UV^{\top}+W\right) (30b)
UW\displaystyle U^{\top}W =0\displaystyle=0 (30c)
WV\displaystyle WV =0\displaystyle=0 (30d)
W\displaystyle\left\lVert W\right\rVert 1\displaystyle\leq 1 (30e)
m\displaystyle m =1T(OXYi=1kτiZi)𝟏.\displaystyle=\frac{1}{T}\left(O-XY^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)\mathbf{1}. (30f)

We select W:=1λ(OXYm𝟏i=1kτiZi)UVW:=\frac{1}{\lambda}\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)-UV^{\top}. Note that (30a) and (30f) are automatically satisfied given the definition of τ\tau and mm, and (30b) is automatically satisfied by our choice of WW.

To show (30c) and (30d), we use the fact that fX=fY=0\frac{\partial f}{\partial X}=\frac{\partial f}{\partial Y}=0:

(OXYm𝟏i=1kτiZi)Y=λXand(OXYm𝟏i=1kτiZi)X=λY\displaystyle\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)Y=\lambda X\quad\text{and}\quad\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)^{\top}X=\lambda Y

Now, by Lemma D.13, we can decompose X=UΣ1/2RX=U\Sigma^{1/2}R and Y=VΣ1/2RY=V\Sigma^{1/2}R where RR is a rotation matrix. Right-multiplying the above equations by R1Σ1/2R^{-1}\Sigma^{-1/2} gives

(OXYm𝟏i=1kτiZi)V=λUand(OXYm𝟏i=1kτiZi)U=λV.\displaystyle\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)V=\lambda U\quad\text{and}\quad\left(O-XY^{\top}-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)^{\top}U=\lambda V.

Rearranging, the first equation shows that WV=0WV=0 and the second shows UW=0U^{\top}W=0.

The last step of the proof is to verify 30e. Using (30c) and (30d), we have

W\displaystyle W =(IUU)W(IVV)\displaystyle=(I-UU^{\top})W(I-VV^{\top})
=1λ(IUU)(Om𝟏i=1kτiZi)(IVV),\displaystyle=\frac{1}{\lambda}(I-UU^{\top})\left(O-m\mathbf{1}^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)(I-VV^{\top}),

where the last line is obtained by plugging in our chosen value of WW and using (29) to get rid of the XYXY^{\top} term. Plugging in O=XY+m𝟏+i=1kτiZi+E^O=X^{\star}Y^{\star\top}+m^{\star}\mathbf{1}^{\top}+\sum_{i=1}^{k}\tau^{\star}_{i}Z_{i}+\hat{E}, we have

W\displaystyle W =1λ(IUU)(XY+(mm)𝟏+i=1k(τiτi)Zi+E^)(IVV).\displaystyle=\frac{1}{\lambda}(I-UU^{\top})\left(X^{\star}Y^{\star\top}+(m^{\star}-m)\mathbf{1}^{\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}+\hat{E}\right)(I-VV^{\top}).

We will use substitution to get rid of the (mm)𝟏(m^{\star}-m)\mathbf{1}^{\top} term. Because we have m=M𝟏Tm^{\star}=\frac{M^{\star}\mathbf{1}}{T} and

m\displaystyle m =1T(OXYi=1kτiZi)𝟏=1T(MXY+i=1k(τiτi)Zi+E^)𝟏,\displaystyle=\frac{1}{T}\left(O-XY^{\top}-\sum_{i=1}^{k}\tau_{i}Z_{i}\right)\mathbf{1}=\frac{1}{T}\left(M^{\star}-XY^{\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}+\hat{E}\right)\mathbf{1},

this implies that

(mm)𝟏\displaystyle(m^{\star}-m)\mathbf{1}^{\top} =(XYi=1k(τiτi)ZiE^)𝟏𝟏T=P𝟏(XYi=1k(τiτi)ZiE^).\displaystyle=\left(XY^{\top}-\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}-\hat{E}\right)\frac{\mathbf{1}\mathbf{1}^{\top}}{T}=P_{\mathbf{1}}\left(XY^{\top}-\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})Z_{i}-\hat{E}\right).

Substituting this expression into our expression for WW, we have

W=1λ(IUU)(XY+i=1k(τiτi)P𝟏(Zi)+P𝟏(E^))(IVV)\displaystyle W=\frac{1}{\lambda}(I-UU^{\top})\left(X^{\star}Y^{\star\top}+\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})P_{{\mathbf{1}}^{\perp}}(Z_{i})+P_{{\mathbf{1}}^{\perp}}(\hat{E})\right)(I-VV^{\top})

where the P𝟏(XY)P_{\mathbf{1}}(XY^{\top}) term went away because (IUU)X=0(I-UU^{\top})X=0 by (29).

By Lemma D.4, we have V𝟏=0V^{\top}\mathbf{1}=0. This allows us to simplify the closed form expression for the projection given in Lemma B.1: P𝐓(A)=(IUU)P𝟏(A)(IVV)P_{{\mathbf{T}}^{\perp}}(A)=(I-UU^{\top})P_{{\mathbf{1}}^{\perp}}(A)(I-VV^{\top}). Hence,

W=1λ(IUU)XY(IVV)+1λP𝐓(E^)+1λi=1k(τiτi)P𝐓(Zi).\displaystyle W=\frac{1}{\lambda}(I-UU^{\top})\ X^{\star}Y^{\star\top}(I-VV^{\top})+\frac{1}{\lambda}P_{{\mathbf{T}}^{\perp}}(\hat{E})+\frac{1}{\lambda}\sum_{i=1}^{k}(\tau^{\star}_{i}-\tau_{i})P_{{\mathbf{T}}^{\perp}}(Z_{i}).

We can upper bound its spectral norm as follows:

λW\displaystyle\lambda\left\lVert W\right\rVert (IUU)XY(IVV)A1+E^A2+ττi=1kP𝐓(Zi)A3.\displaystyle\leq\underbrace{\left\lVert(I-UU^{\top})X^{\star}Y^{\star\top}(I-VV^{\top})\right\rVert}_{A_{1}}+\underbrace{\left\lVert\hat{E}\right\rVert}_{A_{2}}+\underbrace{\left\lVert\tau^{\star}-\tau\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert}_{A_{3}}.

Now, we bound each of these terms separately.

Bounding A1A_{1}. By (29), we have (IUU)X=0(I-UU^{\top})X=0 and (IVV)Y=0(I-VV^{\top})Y=0. Hence,

(IUU)XY(IVV)\displaystyle\left\lVert(I-UU^{\top})X^{\star}Y^{\star\top}(I-VV^{\top})\right\rVert =(IUU)(XHX,YX)(YHX,YY)(IVV)\displaystyle=\left\lVert(I-UU^{\top})(XH_{X,Y}-X^{\star})(YH_{X,Y}-Y^{\star\top})(I-VV^{\top})\right\rVert
XHX,YXFYHX,YYF\displaystyle\leq\left\lVert XH_{X,Y}-X^{\star}\right\rVert_{\mathrm{F}}\left\lVert YH_{X,Y}-Y^{\star}\right\rVert_{\mathrm{F}}
ρ2\displaystyle\lesssim\rho^{2}
(i)σ2r2κnlog12(n)σmin\displaystyle\overset{(i)}{\lesssim}\frac{\sigma^{2}r^{2}\kappa n\log^{12}(n)}{\sigma_{\min}}
σn,\displaystyle\lesssim\sigma\sqrt{n},

where (i)(i) follows from the bound on ρ\rho from (15), and the last step follows from the bound on σnσmin\frac{\sigma\sqrt{n}}{\sigma_{\min}}.

Bounding A2A_{2}. A2A_{2} is bounded by Item 3.4(a) which gives E^E+δσn\|\hat{E}\|\leq\left\lVert E\right\rVert+\left\lVert\delta\right\rVert\lesssim\sigma\sqrt{n}.

Bounding A3A_{3}. If (XY,τ,m)(XY^{\top},\tau,m) satisfy (30a)–(30d) and (30f), then the following decomposition holds due to the same proof as in Lemma 2.1.

D~(ττ)=λΔ~1+Δ~2+Δ~3,\displaystyle\tilde{D}\left(\tau-\tau^{\star}\right)=\lambda\tilde{\Delta}^{1}+\tilde{\Delta}^{2}+\tilde{\Delta}^{3},

where D~k×k\tilde{D}\in\mathbb{R}^{k\times k} is the matrix with entries D~ij=P𝐓(Zi),P𝐓(Zj)\tilde{D}_{ij}=\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),P_{{\mathbf{T}}^{\perp}}(Z_{j})\rangle and Δ~1,Δ~2,Δ~3k\tilde{\Delta}^{1},\tilde{\Delta}^{2},\tilde{\Delta}^{3}\in\mathbb{R}^{k} are vectors with components

Δ~i1=Zi,UV,Δ~i2=Zi,P𝐓(E^),Δ~i3=Zi,P𝐓(M).\displaystyle\tilde{\Delta}^{1}_{i}=\left\langle Z_{i},UV^{\top}\right\rangle,\quad\tilde{\Delta}^{2}_{i}=\left\langle Z_{i},P_{{\mathbf{T}}^{\perp}}(\hat{E})\right\rangle,\quad\tilde{\Delta}^{3}_{i}=\left\langle Z_{i},P_{{\mathbf{T}}^{\perp}}(M^{\star})\right\rangle.

This leads us to have:

A3\displaystyle A_{3} =D~1(λΔ~1+Δ~2+Δ~3)i=1kP𝐓(Zi)\displaystyle=\left\lVert\tilde{D}^{-1}(\lambda\tilde{\Delta}^{1}+\tilde{\Delta}^{2}+\tilde{\Delta}^{3})\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert
λD~1Δ~1i=1kP𝐓(Zi)+(D~1Δ~2+D~1Δ~3)i=1kP𝐓(Zi)\displaystyle\leq\lambda\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{1}\right\rVert\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert+\left(\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{D}^{-1}\tilde{\Delta}^{3}\right\rVert\right)\sum_{i=1}^{k}\left\lVert P_{{\mathbf{T}}^{\perp}}(Z_{i})\right\rVert
λ(1cr22logn)+2log2ncs(Δ~2+Δ~3),\displaystyle\leq\lambda\left(1-\frac{c_{r_{2}}}{2\log n}\right)+\frac{2\log^{2}n}{c_{s}}\left(\left\lVert\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{\Delta}^{3}\right\rVert\right),

where the last inequality made use of Lemma D.6.

Claim D.14.

We have

Δ~2,Δ~3σnrlogn.\displaystyle\left\lVert\tilde{\Delta}^{2}\right\rVert,\left\lVert\tilde{\Delta}^{3}\right\rVert\lesssim\sigma\sqrt{nr}\log n.

With the above claim, we are ready to bound W\left\lVert W\right\rVert.

W\displaystyle\left\lVert W\right\rVert 1λ(A1+A2+A3)\displaystyle\leq\frac{1}{\lambda}\left(A_{1}+A_{2}+A_{3}\right)
1λ(A1+A2+λ(1cr22logn)+2log2ncs(Δ~2+Δ~3)).\displaystyle\leq\frac{1}{\lambda}\left(A_{1}+A_{2}+\lambda\left(1-\frac{c_{r_{2}}}{2\log n}\right)+\frac{2\log^{2}n}{c_{s}}\left(\left\lVert\tilde{\Delta}^{2}\right\rVert+\left\lVert\tilde{\Delta}^{3}\right\rVert\right)\right).

Hence, we conclude that

W\displaystyle\left\lVert W\right\rVert 1cr22logn+O(σnrlog3(n)λ)1\displaystyle\lesssim 1-\frac{c_{r_{2}}}{2\log n}+O\left(\frac{\sigma\sqrt{nr}\log^{3}(n)}{\lambda}\right)\leq 1

for large enough nn and λ=Θ(σnrlog4.5(n))\lambda=\Theta\left(\sigma\sqrt{nr}\log^{4.5}(n)\right).

Proof of D.14..

Bounding Δ~2\left\lVert\tilde{\Delta}^{2}\right\rVert. Using Lemma D.9, we have

Δ~2kmaxi[k]|P𝐓(Zi),E^|σnrlogn.\displaystyle\left\lVert\tilde{\Delta}^{2}\right\rVert\leq k\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert\lesssim\sigma\sqrt{nr}\log n.

Bounding Δ~3\left\lVert\tilde{\Delta}^{3}\right\rVert. Δ~3\left\lVert\tilde{\Delta}^{3}\right\rVert can be bounded by following the exact same steps as the bound of Δ3\left\lVert\Delta^{3}\right\rVert (17), replacing 𝐓^\hat{\mathbf{T}}, U^\hat{U}, and V^\hat{V} with 𝐓\mathbf{T}, UU, and VV, respectively. In particular, note that Lemma D.4 gives that V𝟏=0V^{\top}\mathbf{1}=0 along the gradient flow that ends at X,YX,Y, which allows the same simplifications.

Δ~3σ2r2κnlog12.5(n)σminσn,\displaystyle\left\lVert\tilde{\Delta}^{3}\right\rVert\lesssim\frac{\sigma^{2}r^{2}\kappa n\log^{12.5}(n)}{\sigma_{\min}}\lesssim\sigma\sqrt{n},

where the last step used the assumed bound on σnσmin\frac{\sigma\sqrt{n}}{\sigma_{\min}}. ∎

D.5.2 Function gg has a unique minimizer

In this subsection, we aim to show that the convex function g(M,τ,m)g(M,\tau,m) has a unique minimizer. Throughout the proof, we fix a global minimum (M^,τ^,m^)(\hat{M},\hat{\tau},\hat{m}) of gg, such that M^=XY\hat{M}=XY^{\top}, where (X,Y)(X,Y) is the limit of the gradient flow of ff starting from (X,Y)(X^{\star},Y^{\star}). We have already showed in Section D.5.1 that (XY,τ,m)(XY^{\top},\tau,m), satisfies all of the first order conditions of gg; hence M^=XY\hat{M}=XY^{\top} is indeed a minimizer of gg. First note that up to a bijective change in variables, minimizing gg is equivalent to minimizing the following function

g~(N,τ,m)=12ONF2+λNiτiZim𝟏.\tilde{g}(N,\tau,m)=\frac{1}{2}\|O-N\|_{\mathrm{F}}^{2}+\lambda\left\|N-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star}.

Hence, it suffices to show that g~\tilde{g} has a unique minimizer. This function is strictly convex in NN, because of the term ONF2\|O-N\|_{\mathrm{F}}^{2}. We can then fix N^\hat{N} to be the unique value of NN that minimizes g~\tilde{g}. In particular, note that N^=M^+iτ^iZi+m^𝟏\hat{N}=\hat{M}+\sum_{i}\hat{\tau}_{i}Z_{i}+\hat{m}\mathbf{1}^{\top}. It only remains to show that the following convex optimization problem

minτ,mN^iτiZim𝟏=minτ,mM^iτiZim𝟏\min_{\tau,m}\left\|\hat{N}-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star}=\min_{\tau,m}\left\|\hat{M}-\sum_{i}\tau_{i}Z_{i}-m\mathbf{1}^{\top}\right\|_{\star} (31)

has a unique minimizer. By the definition of M^=XY\hat{M}=XY^{\top}, a minimum of the right-hand side of (31) is attained for τ=0\tau=0 and m=0m=0 (otherwise, M^\hat{M} wouldn’t be an optimum of gg). Now consider any other optimal solution to the problem in (31), that is Z𝐙Z\in\mathbf{Z} such that XY=XY+Z\|XY^{\top}\|_{\star}=\|XY^{\top}+Z\|_{\star}. Write the SVD XY=UΣVXY\top=U\Sigma V^{\top}. Recall that the subgradients of the nuclear norm are {UV+W:UW=0,WV=0,W1}\left\{UV^{\top}+W:U^{\top}W=0,WV=0,\left\lVert W\right\rVert\leq 1\right\}. By the convexity of the nuclear norm, we have

XY+ZXYZ,UV+W,\|XY^{\top}+Z\|_{\star}-\|XY^{\top}\|_{\star}\geq\left\langle Z,UV^{\top}+W\right\rangle, (32)

for any matrix WW with UW=0U^{\top}W=0, WV=0WV=0 and W1\|W\|\leq 1. Recall that 𝐓0\mathbf{T}_{0} is defined to be the tangent space of XYXY^{\top}. Defining the SVD P𝐓0(Z)=U~Σ~V~P_{\mathbf{T}_{0}^{\perp}}(Z)=\tilde{U}\tilde{\Sigma}\tilde{V}^{\top}, we can take W=U~V~W=\tilde{U}\tilde{V}^{\top}, which gives

XY+ZXYZ,UV+P𝐓0(Z).\|XY^{\top}+Z\|_{\star}-\|XY^{\top}\|_{\star}\geq\left\langle Z,UV^{\top}\right\rangle+\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}. (33)

We now use the following lemma.

Claim D.15.

Under the same assumptions as in Lemma D.6, we have

P𝐓0(Z)>|Z,UV|Z𝐙{0}.\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}>|\left\langle Z,UV^{\top}\right\rangle|\qquad\forall Z\in\mathbf{Z}\setminus\{0\}.

With this result, we have that if Z0Z\neq 0, then XY+Z>XY\|XY^{\top}+Z\|_{\star}>\|XY^{\top}\|_{\star}, which contradicts the definition of ZZ. Hence, Z=0Z=0 which ends the proof that gg has a unique minimizer.

Proof of D.15.

Up to changing ZZ into Z-Z, it suffices to show that for all non-zero Z𝐙Z\in\mathbf{Z}, we have P𝐓0(Z)>Z,UV\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}>\left\langle Z,UV^{\top}\right\rangle. First, using similar arguments as in (33) we show that for any matrices AA and BB such that A,B=0\left\langle A,B\right\rangle=0 with SVD A=UAΣAVAA=U_{A}\Sigma_{A}V_{A}^{\top}, we have

A+BAB,UAVA=0.\|A+B\|_{\star}-\|A\|_{\star}\geq\left\langle B,U_{A}V_{A}^{\top}\right\rangle=0.

Hence, A+BA\|A+B\|_{\star}\geq\|A\|_{\star}. In particular, we can take A=P𝐓(Z)A=P_{\mathbf{T}^{\perp}}(Z) and B=P𝐓0(Z)P𝐓(Z)B=P_{\mathbf{T}_{0}^{\perp}}(Z)-P_{\mathbf{T}^{\perp}}(Z) since they are orthogonal to each other due to the fact that 𝐓𝐓0\mathbf{T}^{\perp}\subset\mathbf{T}_{0}^{\perp}. Then, we have

P𝐓0(Z)P𝐓(Z).\|P_{\mathbf{T}_{0}^{\perp}}(Z)\|_{\star}\geq\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}.

Next, by Lemma D.4 we have V𝟏=0V^{\top}\mathbf{1}=0, so that Z,UV=Z,P𝟏(UV)=P𝟏(Z),P𝟏(UV)\left\langle Z,UV^{\top}\right\rangle=\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle=\left\langle P_{\mathbf{1}^{\perp}}(Z),P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle. The two previous steps essentially show that we can ignore the terms of the form α𝟏\alpha\mathbf{1}^{\top} within ZZ. Formally, it suffices to show that for any non-zero Zspan(Zi,i[k])Z\in span(Z_{i},i\in[k]), we have P𝐓(Z)>Z,P𝟏(UV)\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}>\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle. We decompose such a matrix as Z=i[k]αiZiZ=\sum_{i\in[k]}\alpha_{i}Z_{i}. First, by (21) of Lemma D.6, we have that P𝐓(Z)F2cr12log(n)ZF2>0\|P_{\mathbf{T}^{\perp}}(Z)\|_{\mathrm{F}}^{2}\geq\frac{c_{r_{1}}}{2\log(n)}\|Z\|_{F}^{2}>0, which implies P𝐓(Z)>0\|P_{\mathbf{T}^{\perp}}(Z)\|>0. Then, with α:=(α1,,αk)0\alpha:=(\alpha_{1},\ldots,\alpha_{k})\neq 0, we have

P𝐓(Z)Z,P𝟏(UV)\displaystyle\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle (i)P𝐓(Z)F2P𝐓(Z)Z,P𝟏(UV)\displaystyle\overset{(i)}{\geq}\frac{\|P_{\mathbf{T}^{\perp}}(Z)\|_{\mathrm{F}}^{2}}{\|P_{\mathbf{T}^{\perp}}(Z)\|}-\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle
=(ii)1P𝐓(Z)(αD~ααΔ~1P𝐓(Z))\displaystyle\overset{(ii)}{=}\frac{1}{\|P_{\mathbf{T}^{\perp}}(Z)\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\|P_{\mathbf{T}^{\perp}}(Z)\|\right)
=1P𝐓(Z)(αD~ααΔ~1i[k]αiP𝐓(Zi))\displaystyle=\frac{1}{\|P_{\mathbf{T}^{\perp}}(Z)\|}\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\left\lVert\sum_{i\in[k]}\alpha_{i}P_{\mathbf{T}^{\perp}}(Z_{i})\right\rVert\right)
1P𝐓(Z)(αD~ααΔ~1αi[k]P𝐓(Zi))a,\displaystyle\geq\frac{1}{\|P_{\mathbf{T}^{\perp}}(Z)\|}\underbrace{\left(\alpha^{\top}\tilde{D}\alpha-\alpha^{\top}\tilde{\Delta}^{1}\cdot\|\alpha\|\sum_{i\in[k]}\|P_{\mathbf{T}^{\perp}}(Z_{i})\|\right)}_{a},

where (i)(i) is due to the fact that P𝐓(Z)>0\|P_{\mathbf{T}^{\perp}}(Z)\|>0 and the fact that the Frobenius norm of a matrix, squared, is the sum of the squares of the singular values of that matrix; (ii)(ii) uses the identity P𝐓(Z)F2=i[k]αiP𝐓(Zi),j[k]αjP𝐓(Zj)=αD~α\|P_{\mathbf{T}^{\perp}}(Z)\|_{\mathrm{F}}^{2}=\left\langle\sum_{i\in[k]}\alpha_{i}P_{\mathbf{T}^{\perp}}(Z_{i}),\sum_{j\in[k]}\alpha_{j}P_{\mathbf{T}^{\perp}}(Z_{j})\right\rangle=\alpha^{\top}\tilde{D}\alpha.

By (24) of Lemma D.6, the matrix D~\tilde{D} is invertible. It is also symmetric by construction. We next define β=D~1/2α\beta=\tilde{D}^{1/2}\alpha. We obtain

a\displaystyle a =β2βD~1/2Δ~1D~1/2βi[k]P𝐓(Zi)\displaystyle=\|\beta\|^{2}-\beta^{\top}\tilde{D}^{-1/2}\tilde{\Delta}^{1}\cdot\|\tilde{D}^{-1/2}\beta\|\sum_{i\in[k]}\|P_{\mathbf{T}^{\perp}}(Z_{i})\|
β2(1D~1/2Δ~1D~1/2i[k]P𝐓(Zi))\displaystyle\geq\|\beta\|^{2}\left(1-\|\tilde{D}^{-1/2}\|\cdot\|\tilde{\Delta}^{1}\|\cdot\|\tilde{D}^{-1/2}\|\sum_{i\in[k]}\|P_{\mathbf{T}^{\perp}}(Z_{i})\|\right)
>(i)0,\displaystyle\overset{(i)}{>}0,

where in (i)(i) we used (23) of Lemma D.6 together with the fact that β0\beta\neq 0. Combining the two previous inequalities shows that

P𝐓(Z)>Z,P𝟏(UV),Zspan(Zi,i[k]).\|P_{\mathbf{T}^{\perp}}(Z)\|_{\star}>\left\langle Z,P_{\mathbf{1}^{\perp}}(UV^{\top})\right\rangle,\quad\forall Z\in span(Z_{i},i\in[k]).

This ends the proof of the claim. ∎

D.6 Proof of Lemma C.1

See C.1

Proof.

By Lemma C.2, for sufficiently large nn, we can write M^=XY\hat{M}=XY^{\top} where (X,Y)(X,Y) is the limit of the gradient flow of ff started at (X,Y)(X^{\star},Y^{\star}), and (X,Y)(X,Y)\in\mathcal{B}. Hence, the conditions for applying both Lemma D.6 and Lemma D.9 with 𝐓^\hat{\mathbf{T}}, the tangent space of M^\hat{M}, are satisfied. By (24) from Lemma D.6, we have the desired bound on σmin(D)\sigma_{\min}(D). Furthermore, we have

Δ2kΔ2lognΔ2.\displaystyle\left\lVert\Delta^{2}\right\rVert\leq\sqrt{k}\left\lVert\Delta^{2}\right\rVert_{\infty}\lesssim\sqrt{\log n}\left\lVert\Delta^{2}\right\rVert_{\infty}.

We aim to bound Δ2=maxi[k]|P𝐓^(Zi),E^|\left\lVert\Delta^{2}\right\rVert_{\infty}=\max_{i\in[k]}\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert, where E^=E+δ\hat{E}=E+\delta. We have

|P𝐓^(Zi),E^|\displaystyle\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert |P𝐓^(Zi)P𝐓(Zi),E^|+|P𝐓(Zi),E^|\displaystyle\leq\left\lvert\left\langle P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert
(i)P𝐓^(Zi)P𝐓(Zi)(E+δ)+|P𝐓(Zi),E^|\displaystyle\overset{(i)}{\leq}\left\lVert P_{\hat{\mathbf{T}}^{\perp}}(Z_{i})-P_{{\mathbf{T}}^{\star\perp}}(Z_{i})\right\rVert_{\star}(\left\lVert E\right\rVert+\left\lVert\delta\right\rVert)+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert
(ii)ρκrσminσn+|P𝐓(Zi),E^|\displaystyle\overset{(ii)}{\lesssim}\frac{\rho\sqrt{\kappa r}}{\sqrt{\sigma_{\min}}}\sigma\sqrt{n}+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),\hat{E}\right\rangle\right\rvert
(iii)σ2r1.5κnlog6(n)σmin+|P𝐓(Zi),P𝐓(E^)|\displaystyle\overset{(iii)}{\lesssim}\frac{\sigma^{2}r^{1.5}\kappa n\log^{6}(n)}{\sigma_{\min}}+\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(\hat{E})\right\rangle\right\rvert

where (i)(i) is due to Von Neumann’s trace inequality, (ii)(ii) is due to (22) and Item 3.4(a), and (iii)(iii) is due to (15). Hence,

Δ2\displaystyle\left\lVert\Delta^{2}\right\rVert σ2r1.5κnlog6.5(n)σmin+log0.5(n)maxi[k]|P𝐓(Zi),P𝐓(E^)|\displaystyle\lesssim\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(Z_{i}),P_{{\mathbf{T}}^{\star\perp}}(\hat{E})\right\rangle\right\rvert
=σ2r1.5κnlog6.5(n)σmin+log0.5(n)maxi[k]|P𝐓(Z~i),P𝐓(E+δ)|Z~iF.\displaystyle=\frac{\sigma^{2}r^{1.5}\kappa n\log^{6.5}(n)}{\sigma_{\min}}+\log^{0.5}(n)\cdot\max_{i\in[k]}\frac{\left\lvert\left\langle P_{{\mathbf{T}}^{\star\perp}}(\tilde{Z}_{i}),P_{{\mathbf{T}}^{\star\perp}}(E+\delta)\right\rangle\right\rvert}{\|\tilde{Z}_{i}\|_{\mathrm{F}}}.

Appendix E Details of Computational Experiments

In this section, we presents more details and results for Section 4. Our code and data are available at this GitHub repository.

Compute resources: Our experiments utilized a high-performance computing cluster, leveraging CPU-based nodes for all experimental runs. Our home directory within the cluster provides 10 TB of storage. Experiments were conducted on nodes equipped with Intel Xeon CPUs, configured with 5 CPUs and 20 GB of RAM. We utilized approximately 40 nodes operating in parallel to conduct 200 iterations of experiments for each parameter set. While a single experiment completes within minutes on a personal laptop, executing the full suite of experiments on the cluster required approximately one day.

Method implementations: Similar to the methodology outlined in the supplemental materials of [17], we implemented PaCE using an alternating minimization technique to solve the convex optimization problem. To tune the hyperparameter λ\lambda, we initially set it to a large value and gradually reduced it until the rank of M^\hat{M} reached a pre-defined rank rr. In all of our experiments, we fixed rr to 6. We note that we did not tune this pre-defined rank, and our method might achieve better performance with a different rank.

The following models from the Python econml package were implemented. The supervised learning models from sklearn were chosen for the parameters in our implementation:

  • DML and CausalForestDML:

    • model_y: GradientBoostingRegressor

    • model_t: GradientBoostingClassifier

    • model_final: LassoCV (for DML)

  • LinearDML:

    • model_y: RandomForestRegressor

    • model_t: RandomForestClassifier

  • XLearner and ForestDRLearner:

    • models/model_regression: GradientBoostingRegressor

    • propensity_model/model_propensity: GradientBoostingClassifier

Additionally, we implemented DRLearner from the Python econml package using the default parameters. We also implemented multi_arm_causal_forest from the grf package in R.

Finally, we implemented matrix completion with nuclear norm minimization (MCNNM), where the matrix of observed entries is completed using nuclear norm penalization. The regularization parameter for the nuclear norm term, λ\lambda, is chosen in the same way that it is chosen in our implementation of PaCE.

We note that LLMs facilitated our implementation of all these methods, the experiment and visualization code, and our data processing code.

Notes about the data: We imputed missing data values for 2020 by calculating the average values from 2019 and 2021. Furthermore, in the supplemental code and data provided [available at this GitHub repository], we redacted a covariate column ‘Snap_Percentage_Vulnerable’ that was used in our experiments because it was obtained using proprietary information.

Additional results: Table 2 presents the proportion of instances in which each method achieved the lowest normalized Mean Absolute Error (nMAE) across all of our experiments. It encompasses every method we benchmarked.

Table 3 and Table 4 detail the complete results for the average nMAE of each method, along with the corresponding standard deviations. Table 3 reports the nMAE values for estimating heterogeneous treatment effects across all observations. In contrast, Table 4 focuses on the nMAE values for treated observations only. Note that MCNNM only appears in Table 4, as this method does not directly give estimates of heterogeneous treatment effects for untreated observations.

In the following tables, the "All" column presents the result for all instances across all parameter settings. These settings include proportions of zip codes treated at 0.05, 0.25, 0.5, 0.75, and 1.0; either with adaptive or non-adaptive treatment application approaches; and under additive or multiplicative treatment effects. Each parameter set is tested in 200 instances, totaling 4000 instances reflected in each value within the "All" column. Subsequent columns detail outcomes when one parameter is held constant. Specifically, the "Adaptive" and "Effect" columns each summarize results from 2000 instances, while each result under the "Proportion treated" column corresponds to 800 instances. The standard deviations, shown in parentheses within Table 3 and Table 4, are calculated for each result by analyzing the variability among the instances that contribute to that particular data point.

Table 2: Proportion of instances where each method results in the lowest nMAE, across all methods considered.
Adaptive Proportion treated Effect
All N Y 0.05 0.25 0.5 0.75 1.0 Add. Mult.
SNAP
PaCE 0.59 0.58 0.60 0.26 0.54 0.76 0.75 0.65 0.64 0.54
Causal Forest 0.16 0.17 0.16 0.14 0.17 0.16 0.14 0.21 0.02 0.30
CausalForestDML 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
DML 0.11 0.21 0.00 0.12 0.13 0.08 0.09 0.11 0.17 0.04
DRLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ForestDRLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
LinearDML 0.14 0.04 0.24 0.48 0.15 0.01 0.02 0.04 0.17 0.11
XLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
State
PaCE 0.41 0.39 0.42 0.23 0.32 0.39 0.54 0.56 0.46 0.36
Causal Forest 0.05 0.09 0.01 0.06 0.04 0.04 0.05 0.04 0.04 0.05
CausalForestDML 0.07 0.07 0.07 0.03 0.07 0.10 0.07 0.08 0.06 0.09
DML 0.27 0.29 0.26 0.41 0.36 0.27 0.17 0.16 0.31 0.24
DRLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ForestDRLearner 0.01 0.01 0.01 0.00 0.00 0.00 0.01 0.02 0.00 0.01
LinearDML 0.02 0.01 0.04 0.02 0.06 0.01 0.01 0.02 0.02 0.02
XLearner 0.17 0.13 0.20 0.25 0.14 0.19 0.15 0.11 0.10 0.23
County
PaCE 0.06 0.10 0.02 0.16 0.08 0.04 0.02 0.01 0.07 0.06
Causal Forest 0.29 0.38 0.20 0.25 0.28 0.31 0.31 0.30 0.23 0.34
CausalForestDML 0.04 0.05 0.03 0.00 0.01 0.03 0.05 0.11 0.03 0.05
DML 0.35 0.35 0.35 0.34 0.37 0.34 0.39 0.31 0.47 0.24
DRLearner 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
ForestDRLearner 0.01 0.00 0.01 0.00 0.00 0.00 0.00 0.02 0.00 0.01
LinearDML 0.20 0.04 0.35 0.22 0.22 0.21 0.16 0.17 0.19 0.20
XLearner 0.05 0.07 0.04 0.03 0.03 0.06 0.07 0.07 0.01 0.10
Table 3: Average nMAE across methods. Standard deviations shown in parentheses.
Adaptive Proportion treated Treatment effect
All N Y 0.05 0.25 0.5 0.75 1.0 Add. Mult.
SNAP
PaCE 0.18 0.17 0.18 0.4 0.17 0.12 0.1 0.09 0.13 0.22
(0.16) (0.11) (0.2) (0.21) (0.1) (0.07) (0.06) (0.05) (0.13) (0.18)
Causal Forest (R) 0.21 0.21 0.21 0.42 0.2 0.17 0.13 0.12 0.18 0.23
(0.15) (0.07) (0.19) (0.19) (0.06) (0.04) (0.04) (0.06) (0.12) (0.17)
CausalForestDML 40.81 5.01 76.6 195.95 6.85 0.8 0.26 0.17 40.93 40.68
(115.45) (10.86) (154.86) (191.08) (6.37) (0.42) (0.05) (0.06) (115.69) (115.24)
DML 0.38 0.31 0.45 0.72 0.35 0.32 0.27 0.26 0.28 0.48
(0.29) (0.21) (0.33) (0.41) (0.17) (0.15) (0.14) (0.16) (0.27) (0.27)
DRLearner 1186.44 371.84 2001.03 107.4 1016.66 2277.78 2445.56 84.78 1095.4 1277.47
(1764.1) (462.49) (2164.23) (229.15) (815.45) (2036.92) (2339.59) (211.55) (1597.24) (1912.51)
ForestDRLearner 0.94 0.84 1.04 2.08 0.9 0.71 0.57 0.45 0.99 0.89
(7.08) (1.53) (9.9) (15.78) (0.1) (0.13) (0.17) (0.21) (9.49) (3.2)
LinearDML 0.25 0.28 0.21 0.34 0.25 0.25 0.2 0.19 0.15 0.34
(0.16) (0.16) (0.16) (0.23) (0.13) (0.12) (0.13) (0.14) (0.1) (0.16)
XLearner 0.51 0.46 0.56 0.89 0.55 0.47 0.36 0.27 0.5 0.52
(0.25) (0.17) (0.3) (0.29) (0.05) (0.04) (0.02) (0.02) (0.23) (0.27)
State
PaCE 0.28 0.34 0.22 0.57 0.32 0.21 0.17 0.14 0.25 0.32
(0.24) (0.26) (0.2) (0.3) (0.18) (0.12) (0.1) (0.08) (0.21) (0.25)
Causal Forest (R) 0.33 0.39 0.28 0.6 0.35 0.29 0.23 0.19 0.3 0.36
(0.22) (0.24) (0.18) (0.27) (0.16) (0.11) (0.08) (0.08) (0.19) (0.24)
CausalForestDML 0.75 1.02 0.48 2.76 0.36 0.25 0.21 0.19 0.76 0.74
(3.22) (4.41) (1.1) (6.85) (0.25) (0.11) (0.1) (0.08) (3.31) (3.14)
DML 0.35 0.42 0.28 0.59 0.34 0.29 0.27 0.26 0.31 0.39
(0.33) (0.41) (0.19) (0.6) (0.21) (0.17) (0.14) (0.14) (0.37) (0.28)
DRLearner 929.38 520.49 1338.07 1466.8 2905.93 159.67 68.36 47.3 891.42 967.37
(4601.05) (3268.39) (5597.16) (5776.54) (8056.84) (898.43) (348.15) (677.07) (4089.69) (5062.03)
ForestDRLearner 60.29 119.37 1.24 299.72 0.87 0.4 0.28 0.26 52.15 68.43
(629.42) (886.43) (2.42) (1382.62) (1.11) (0.21) (0.15) (0.14) (451.79) (767.07)
LinearDML 0.86 1.34 0.39 2.72 0.59 0.4 0.33 0.29 0.85 0.88
(1.56) (2.06) (0.36) (2.74) (0.41) (0.22) (0.16) (0.13) (1.63) (1.48)
XLearner 0.29 0.37 0.22 0.53 0.31 0.23 0.2 0.19 0.27 0.31
(0.19) (0.2) (0.14) (0.25) (0.13) (0.1) (0.08) (0.06) (0.16) (0.2)
County
PaCE 0.34 0.35 0.33 0.49 0.37 0.29 0.28 0.26 0.28 0.39
(0.2) (0.19) (0.22) (0.29) (0.19) (0.14) (0.12) (0.11) (0.13) (0.24)
Causal Forest (R) 0.26 0.26 0.27 0.43 0.3 0.22 0.2 0.18 0.22 0.31
(0.17) (0.13) (0.21) (0.23) (0.17) (0.12) (0.1) (0.08) (0.12) (0.21)
CausalForestDML 2.1 0.47 3.72 8.83 0.95 0.28 0.23 0.2 2.05 2.15
(5.24) (0.68) (7.01) (8.92) (0.78) (0.11) (0.09) (0.07) (5.17) (5.31)
DML 0.3 0.34 0.25 0.46 0.28 0.26 0.23 0.26 0.21 0.38
(0.25) (0.25) (0.24) (0.27) (0.2) (0.23) (0.22) (0.26) (0.17) (0.28)
DRLearner 717.31 260.77 1173.85 560.3 153.83 72.65 428.02 2371.74 396.35 1038.27
(5046.04) (834.88) (7058.49) (1091.51) (664.27) (462.94) (827.06) (11014.13) (1118.57) (7034.13)
ForestDRLearner 0.75 0.51 1 2.57 0.4 0.3 0.26 0.23 0.57 0.93
(5.99) (1.35) (8.35) (13.24) (0.15) (0.12) (0.1) (0.06) (2.93) (7.94)
LinearDML 0.36 0.47 0.24 0.61 0.34 0.29 0.26 0.28 0.28 0.44
(0.32) (0.36) (0.24) (0.42) (0.23) (0.27) (0.25) (0.27) (0.25) (0.37)
XLearner 0.32 0.33 0.32 0.5 0.36 0.29 0.25 0.23 0.29 0.36
(0.14) (0.13) (0.16) (0.17) (0.11) (0.09) (0.07) (0.04) (0.09) (0.18)
Table 4: Average nMAE among treated units. Standard deviations shown in parentheses.
Adaptive Proportion treated Treatment effect
All N Y 0.05 0.25 0.5 0.75 1.0 Add. Mult.
SNAP
PaCE 0.1 0.11 0.09 0.15 0.1 0.09 0.09 0.09 0.07 0.13
(0.08) (0.04) (0.1) (0.13) (0.06) (0.05) (0.05) (0.05) (0.03) (0.09)
Causal Forest (R) 0.13 0.18 0.07 0.18 0.13 0.12 0.11 0.1 0.11 0.14
(0.08) (0.04) (0.07) (0.11) (0.07) (0.07) (0.06) (0.06) (0.07) (0.09)
CausalForestDML 0.22 0.27 0.17 0.41 0.2 0.17 0.16 0.15 0.18 0.26
(0.24) (0.11) (0.31) (0.45) (0.09) (0.08) (0.07) (0.07) (0.12) (0.31)
DML 0.44 0.29 0.6 0.89 0.41 0.37 0.29 0.27 0.27 0.62
(0.72) (0.19) (0.98) (1.44) (0.37) (0.28) (0.17) (0.18) (0.22) (0.97)
DRLearner 2074.25 376.92 3771.58 170.11 2703.71 4134.22 3279.81 83.39 1405.57 2742.93
(4283.12) (461.16) (5542.67) (263) (5417.88) (5543.25) (4232.75) (203.48) (2139.72) (5587.94)
ForestDRLearner 4.86 1.19 8.54 20.94 1.39 0.88 0.63 0.47 4.08 5.64
(98.58) (1.92) (139.32) (219.79) (0.89) (0.45) (0.31) (0.26) (103.28) (93.67)
LinearDML 0.26 0.25 0.26 0.31 0.28 0.27 0.22 0.22 0.12 0.39
(0.26) (0.12) (0.35) (0.3) (0.29) (0.28) (0.21) (0.2) (0.09) (0.3)
MCNNM 0.19 0.24 0.14 0.27 0.18 0.17 0.16 0.16 0.16 0.21
(0.18) (0.05) (0.24) (0.34) (0.1) (0.09) (0.08) (0.08) (0.09) (0.23)
XLearner 0.92 0.44 1.41 2.67 0.75 0.54 0.38 0.28 0.58 1.27
(2.77) (0.19) (3.85) (5.81) (0.62) (0.35) (0.15) (0.07) (0.49) (3.85)
State
My Estimator 0.21 0.23 0.19 0.36 0.24 0.17 0.15 0.13 0.18 0.24
(0.17) (0.19) (0.15) (0.27) (0.14) (0.09) (0.08) (0.07) (0.13) (0.2)
Causal Forest (R) 0.3 0.32 0.28 0.46 0.33 0.29 0.23 0.19 0.27 0.33
(0.22) (0.24) (0.19) (0.37) (0.17) (0.13) (0.09) (0.08) (0.15) (0.26)
CausalForestDML 0.26 0.33 0.2 0.43 0.27 0.22 0.21 0.19 0.24 0.28
(0.24) (0.31) (0.12) (0.46) (0.13) (0.1) (0.09) (0.08) (0.16) (0.3)
DML 0.34 0.38 0.29 0.51 0.34 0.29 0.27 0.26 0.3 0.37
(0.23) (0.25) (0.2) (0.34) (0.21) (0.17) (0.14) (0.14) (0.2) (0.25)
DRLearner 1704.15 791.94 2615.89 4387.86 3853.02 165.89 68.68 47.43 1413.19 1995.26
(11821.06) (5032.31) (15890.26) (23128.14) (11970.8) (1085.93) (353.98) (662.49) (6457.8) (15418.33)
ForestDRLearner 427.34 849.26 5.63 2134.78 1.45 0.46 0.29 0.26 329.6 525.13
(4126.02) (5805.96) (17.91) (9032.24) (2.32) (0.33) (0.17) (0.16) (2885.66) (5071.02)
LinearDML 0.43 0.52 0.35 0.81 0.42 0.35 0.31 0.28 0.39 0.48
(0.53) (0.65) (0.36) (1.06) (0.24) (0.17) (0.14) (0.12) (0.27) (0.7)
MCNNM 0.23 0.29 0.18 0.25 0.21 0.21 0.24 0.25 0.22 0.24
(0.17) (0.21) (0.08) (0.3) (0.11) (0.1) (0.09) (0.11) (0.11) (0.21)
XLearner 0.28 0.33 0.23 0.5 0.29 0.22 0.2 0.18 0.26 0.3
(0.31) (0.38) (0.21) (0.61) (0.12) (0.08) (0.07) (0.05) (0.15) (0.41)
County
PaCE 0.31 0.29 0.33 0.44 0.34 0.27 0.26 0.26 0.23 0.39
(0.34) (0.13) (0.46) (0.6) (0.36) (0.19) (0.13) (0.11) (0.09) (0.46)
Causal Forest (R) 0.2 0.25 0.15 0.3 0.21 0.18 0.16 0.16 0.16 0.24
(0.16) (0.12) (0.18) (0.26) (0.15) (0.11) (0.08) (0.07) (0.1) (0.19)
CausalForestDML 0.28 0.3 0.26 0.51 0.29 0.22 0.19 0.18 0.22 0.34
(0.35) (0.18) (0.46) (0.69) (0.21) (0.12) (0.09) (0.07) (0.11) (0.48)
DML 0.3 0.33 0.27 0.41 0.27 0.26 0.24 0.31 0.19 0.41
(0.34) (0.24) (0.41) (0.39) (0.26) (0.26) (0.24) (0.45) (0.15) (0.42)
DRLearner 1183.28 278.38 2088.17 1010.99 160.94 69.94 564.55 4109.95 443.32 1923.23
(10927.48) (1008.75) (15369.32) (2371.12) (956.66) (448.59) (1563.04) (24023.26) (1265.5) (15367.99)
ForestDRLearner 1.57 1.14 1.99 6.58 0.54 0.3 0.22 0.19 1.06 2.08
(12.1) (2.9) (16.86) (26.48) (0.41) (0.17) (0.1) (0.08) (5.01) (16.35)
LinearDML 0.33 0.41 0.25 0.48 0.31 0.28 0.26 0.32 0.22 0.44
(0.39) (0.29) (0.46) (0.56) (0.27) (0.28) (0.27) (0.44) (0.17) (0.5)
MCNNM 0.56 0.49 0.62 0.73 0.58 0.51 0.49 0.48 0.45 0.67
(0.63) (0.12) (0.88) (1.21) (0.57) (0.33) (0.19) (0.14) (0.11) (0.87)
XLearner 0.36 0.32 0.4 0.67 0.39 0.29 0.24 0.22 0.28 0.44
(0.52) (0.14) (0.72) (1.04) (0.34) (0.17) (0.09) (0.06) (0.12) (0.72)