Minimax rates for heterogeneous causal effect estimation

Edward H. Kennedy¹, Sivaraman Balakrishnan^1,2, James M. Robins³, Larry Wasserman^1,2

¹Department of Statistics & Data Science, Carnegie Mellon University
²Machine Learning Department, Carnegie Mellon University
³Departments of Biostatistics and Epidemiology, Harvard University

{edward, siva, larry} @ stat.cmu.edu, [email protected]

Abstract

Estimation of heterogeneous causal effects – i.e., how effects of policies and treatments vary across subjects – is a fundamental task in causal inference. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but questions surrounding optimality have remained largely unanswered. In particular, a minimax theory of optimality has yet to be developed, with the minimax rate of convergence and construction of rate-optimal estimators remaining open problems. In this paper we derive the minimax rate for CATE estimation, in a Hölder-smooth nonparametric model, and present a new local polynomial estimator, giving high-level conditions under which it is minimax optimal. Our minimax lower bound is derived via a localized version of the method of fuzzy hypotheses, combining lower bound constructions for nonparametric regression and functional estimation. Our proposed estimator can be viewed as a local polynomial R-Learner, based on a localized modification of higher-order influence function methods. The minimax rate we find exhibits several interesting features, including a non-standard elbow phenomenon and an unusual interpolation between nonparametric regression and functional estimation rates. The latter quantifies how the CATE, as an estimand, can be viewed as a regression/functional hybrid.

Keywords: causal inference, functional estimation, higher order influence functions, nonparametric regression, optimal rates of convergence.

1 Introduction

In this paper we consider estimating the difference in regression functions

\tau(x)=\mathbb{E}(Y\mid X=x,A=1)-\mathbb{E}(Y\mid X=x,A=0)

(1)

from an iid sample of observations of $Z=(X,A,Y)$ . Let $Y^{a}$ denote the counterfactual outcome that would have been observed under treatment level $A=a$ . Then, under the assumptions of consistency (i.e., $Y=Y^{a}$ if $A=a$ ), positivity (i.e., $\epsilon\leq\mathbb{P}(A=1\mid X)\leq 1-\epsilon$ with probability one, for some $\epsilon>0$ ), and no unmeasured confounding (i.e., $A\perp\!\!\!\perp Y^{a}\mid X$ ), the quantity $\tau(x)$ also equals the conditional average treatment effect (CATE)

\mathbb{E}(Y^{1}-Y^{0}\mid X=x).

The CATE $\tau(x)$ gives a more individualized picture of treatment effects compared to the overall average treatment effect (ATE) $\mathbb{E}(Y^{1}-Y^{0})$ , and plays a crucial role in many fundamental tasks in causal inference, including assessing effect heterogeneity, constructing optimal treatment policies, generalizing treatment effects to new populations, finding subgroups with enhanced effects, and more. Further, these tasks have far-reaching implications across the sciences, from personalizing medicine to optimizing voter turnout. We refer to Hernán and Robins (2020) (chapter 4), Nie and Wager (2021), Kennedy (2023), and citations therein, for general discussion and review.

The simplest approach to CATE estimation would be to assume a low-dimensional parametric model for the outcome regression $\mathbb{E}(Y\mid X,A)$ ; then maximum likelihood estimates could be easily constructed, and under regularity conditions the resulting plug-in estimator would be minimax optimal. However, when $X$ has continuous components, it is typically difficult to specify a correct parametric model, and under misspecification the previously described approach could lead to substantial bias. This suggests the need for more flexible methods. Early work in flexible CATE estimation employed semiparametric models, for example partially linear models assuming $\tau(x)$ to be constant, or structural nested models in which $\tau(x)$ followed some known parametric form, but leaving other parts of the distribution unspecified (Robinson, 1988; Robins et al., 1992; Robins, 1994; van der Laan and Robins, 2003; van der Laan, 2005; Vansteelandt and Joffe, 2014). An important theme in this work is that the CATE can be much more structured and simple than the rest of the data-generating process. Specifically, the individual regression functions $\mu_{a}(x)=\mathbb{E}(Y\mid X=x,A=a)$ for each $a=0,1$ may be very complex (e.g., non-smooth or non-sparse), even when the difference $\tau(x)=\mu_{1}(x)-\mu_{0}(x)$ is very smooth or sparse, or even constant or zero. We refer to Kennedy (2023) for some recent discussion of this point.

More recently there has been increased emphasis on incorporating nonparametrics and machine learning tools for CATE estimation. We briefly detail two especially relevant streams of this recent literature, based on so-called DR-Learner and R-Learner methods, both of which rely on doubly robust-style estimation. The DR-Learner is a model-free meta-algorithm first proposed by van der Laan (2005) (Section 4.2), which essentially takes the components of the classic doubly robust estimator of the ATE, and rather than averaging, instead regresses on covariates. It has since been specialized to particular methods, e.g., cross-validated ensembles (Luedtke and van der Laan, 2016), kernel (Lee et al., 2017; Fan et al., 2019; Zimmert and Lechner, 2019) and series methods (Semenova and Chernozhukov, 2017), empirical risk minimization (Foster and Syrgkanis, 2019), and linear smoothers (Kennedy, 2023). On the other hand, the R-Learner is a flexible adaptation of the double-residual regression method originally built for partially linear models (Robinson, 1988), with the first nonparametric version proposed by Robins et al. (2008) (Section 5.2) using series methods. The R-Learner has since been adapted to RKHS regression (Nie and Wager, 2021), lasso (Zhao et al., 2017; Chernozhukov et al., 2017), and local polynomials (Kennedy, 2023). Many flexible non-doubly robust methods have also been proposed in recent years, often based on inverse-weighting or direct regression estimation (Foster et al., 2011; Imai and Ratkovic, 2013; Athey and Imbens, 2016; Shalit et al., 2017; Wager and Athey, 2018; Künzel et al., 2019; Hahn et al., 2020).

Despite the wide variety of methods available for flexible CATE estimation, questions of optimality have remained mostly unsolved. Gao and Han (2020) studied minimax optimality, but in a specialized model where the propensity score has zero smoothness, and covariates are non-random; this model does not reflect the kinds of assumptions typically used in practice, e.g., in the papers cited in the previous paragraph. Some but not all of these papers derive upper bounds on the error of their proposed CATE estimators; in the best case, these take the form of an oracle error rate (which would remain even if the potential outcomes $(Y^{1}-Y^{0})$ were observed and regressed on covariates), plus some contribution coming from having to estimate nuisance functions (i.e., outcome regressions and propensity scores). The fastest rates we are aware of come from Foster and Syrgkanis (2019) and Kennedy (2023). Foster and Syrgkanis (2019) studied global error rates, obtaining an oracle error plus sums of squared $L_{4}$ errors in all nuisance components. Kennedy (2023) studied pointwise error rates, giving two main results; in the first, they obtain the oracle error plus a product of nuisance errors, while in the second, they obtain a faster rate via undersmoothing (described in more detail in Section 3.3). However, since these are all upper bounds on the errors of particular procedures, it is unknown whether these rates are optimal in any sense, and if they are not, how they might be improved upon. In this paper we resolve these questions (via the minimax framework, in a nonparametric model that allows components of the data-generating process to be infinite-dimensional, yet smooth in the Hölder sense).

More specifically, in Section 3 we derive a lower bound on the minimax rate of CATE estimation, indicating the best possible (worst-case) performance of any estimator, in a model where the CATE, regression function, and propensity score are Hölder-smooth functions. Our derivation uses an adaptation of the method of fuzzy hypotheses, which is specially localized compared to the constructions previously used for obtaining lower bounds in functional estimation and hypothesis testing (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In Section 4, we confirm that our minimax lower bound is tight (under some conditions), by proposing and analyzing a new local polynomial R-Learner, using localized adaptations of higher order influence function methodology (Robins et al., 2008, 2009a, 2017). In addition to giving a new estimator that is provably optimal (under some conditions, e.g., on how well the covariate density is estimated), our results also confirm that previously proposed estimators were not generally optimal in this smooth nonparametric model. Our minimax rate also sheds light on the nature of the CATE as a statistical quantity, showing how it acts as a regression/functional hybrid: for example, the rate interpolates between nonparametric regression and functional estimation, depending on the relative smoothness of the CATE and nuisance functions (outcome regression and propensity score).

2 Setup & Notation

We consider an iid sample of $n$ observations of $Z=(X,A,Y)$ from distribution $\mathbb{P}$ , where $X\in[0,1]^{d}$ denotes covariates, $A\in\{0,1\}$ a treatment or policy indicator, and $Y\in\mathbb{R}$ an outcome of interest. We let $F$ denote the distribution function of the covariate $X$ (with density $f$ as needed), and let

	$\displaystyle\pi(x)$	$\displaystyle=\mathbb{P}(A=1\mid X=x)$
	$\displaystyle\eta(x)$	$\displaystyle=\mathbb{E}(Y\mid X=x)$
	$\displaystyle\mu_{a}(x)$	$\displaystyle=\mathbb{E}(Y\mid X=x,A=a)$

denote the propensity score, marginal, and treatment-specific outcome regressions, respectively. We sometimes omit arguments from functions to ease notation, e.g., note that $\tau=(\eta-\mu_{0})/\pi$ . We also index quantities by a distribution $P$ when needed, e.g., $\tau(x)$ under a particular distribution $P$ is written $\tau_{P}(x)$ ; depending on context, no indexing means the quantity is evaluated at the true $\mathbb{P}$ , e.g., $\tau(x)=\tau_{\mathbb{P}}(x)$ .

Our goal is to study estimation of the CATE $\tau(x)=\mu_{1}(x)-\mu_{0}(x)$ at a point $x_{0}\in(0,1)^{d}$ , with error quantified by mean absolute error

\mathbb{E}\left|\widehat{\tau}(x_{0})-\tau(x_{0})\right|.

As detailed in subsequent sections, we work in a nonparametric model $\mathcal{P}$ whose components are infinite-dimensional functions but with some smoothness. We say a function is $s$ -smooth if it belongs to a Hölder class with index $s$ ; this essentially means it has $s-1$ bounded derivatives, and the highest order derivative is continuous. To be more precise, let $\lfloor s\rfloor$ denote the largest integer strictly smaller than $s$ , and let $D^{\alpha}=\frac{\partial^{\alpha}}{\partial x_{1}^{\alpha_{1}}\dots\partial x_{d}^{\alpha_{d}}}$ denote the partial derivative operator. Then the Hölder class with index $s$ contains all functions $g:\mathcal{X}\rightarrow\mathbb{R}$ that are $\lfloor s\rfloor$ times continuously differentiable, with derivatives up to order $\lfloor s\rfloor$ bounded, i.e.,

\left|D^{\alpha}g(x)\right|\leq C<\infty

for all $\alpha=(\alpha_{1},\dots,\alpha_{d})$ with $\sum_{j}\alpha_{j}\leq\lfloor s\rfloor$ and for all $x\in\mathcal{X}$ , and with $\lfloor s\rfloor$ -order derivatives satisyfing the Lipschitz condition

\left|D^{\beta}g(x)-D^{\beta}g(x^{\prime})\right|\leq C\|x-x^{\prime}\|^{s-\lfloor s\rfloor}

for some $C<\infty$ , for all $\beta=(\beta_{1},\dots,\beta_{d})$ with $\sum_{j}\beta_{j}=\lfloor s\rfloor$ , and for all $x,x^{\prime}\in\mathcal{X}$ , where for a vector $v\in\mathbb{R}^{d}$ we let $\|v\|$ denote the Euclidean norm. Sometimes Hölder classes are referenced by both the smoothness $s$ and constant $C$ , but we focus our discussion on the smoothness $s$ and omit the constant, which is assumed finite and independent of $n$ .

We write the squared $L_{2}(Q)$ norm of a function as $\|g\|_{Q}^{2}={\int g(z)^{2}\ dQ(z)}$ . The sup-norm is denoted by $\|f\|_{\infty}=\sup_{z\in\mathcal{Z}}|f(z)|$ . For a matrix $A$ we let $\|A\|$ and $\|A\|_{2}$ denote the operator/spectral and Frobenius norms, respectively, and let $\lambda_{\min}(A)$ and $\lambda_{\max}(A)$ denote the minimum and maximum eigenvalues of $A$ , respectively. We write $a_{n}\lesssim b_{n}$ if $a_{n}\leq Cb_{n}$ for $C$ a positive constant independent of $n$ , and $a_{n}\asymp b_{n}$ if $a_{n}\leq Cb_{n}$ and $b_{n}\leq Ca_{n}$ (i.e., if $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ ). We write $a_{n}\sim b_{n}$ to mean that $a_{n}$ and $b_{n}$ are proportional, i.e., $a_{n}=Cb_{n}$ for some $C$ . We also use $a\vee b=\max(a,b)$ and $a\wedge b=\min(a,b)$ . We use the shorthand $\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{f(Z)\}=\frac{1}{n}\sum_{i=1}^{n}f(Z_{i})$ to write sample averages, and similarly $\mathbb{U}_{n}(f)=\mathbb{U}_{n}\{f(Z_{1},Z_{2})\}=\frac{1}{n(n-1)}\sum_{i\neq j}f(Z_{i},Z_{j})$ for the U-statistic measure.

3 Fundamental Limits

In this section we derive a lower bound on the minimax rate for CATE estimation. This result has several crucial implications, both practical and theoretical. First, it gives a benchmark for the best possible performance of any CATE estimator in the nonparametric model defined in Theorem 1. In particular, if an estimator is shown to attain this benchmark, then one can safely conclude the estimator cannot be improved, at least in terms of worst-case rates, without adding assumptions; conversely, if the benchmark is not shown to be attained, then one should continue searching for other better estimators (or better lower or upper risk bounds). Second, a tight minimax lower bound is important in its own right as a measure of the fundamental limits of CATE estimation, illustrating precisely how difficult CATE estimation is in a statistical sense. The main result of this section is given in Theorem 1 below. It is finally proved and discussed in detail in Section 3.3.

Theorem 1.

For $x_{0}\in(0,1)^{d}$ , let $\mathcal{P}$ denote the model where:

1.

$f(x)$ is bounded above by a constant,
2.

$\pi(x)$ is $\alpha$ -smooth,
3.

$\mu_{0}(x)$ is $\beta$ -smooth, and
4.

$\tau(x)$ is $\gamma$ -smooth.

Let $s\equiv(\alpha+\beta)/2$ . Then for $n$ larger than a constant depending on $(\alpha,\beta,\gamma,d)$ , the minimax rate is lower bounded as

\displaystyle\inf_{\widehat{\tau}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\gtrsim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}&\text{ if }s<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise}.\end{cases}

Remark 1.

In Appendix A we also give results (both lower and upper bounds) for the model that puts smoothness assumptions on the marginal regression $\eta(x)=\mathbb{E}(Y\mid X=x)$ , instead of the control regression $\mu_{0}(x)=\mathbb{E}(Y\mid X=x,A=0)$ . Interestingly, the minimax rates differ in these two models, but only in the regime where the regression function is more smooth than the propensity score (i.e., $\beta>\alpha$ ). Specifically, when $\eta$ is $\beta$ -smooth, the minimax rate from Theorem 1 holds but with $s=(\alpha+\beta)/2$ replaced by $\min(\alpha,s)$ .

Crucially, Condition 4 allows the CATE $\tau(x)$ to have its own smoothness $\gamma$ , which is necessarily at least the regression smoothness $\beta$ , but can also be much larger, as described in the Introduction. We defer discussion of the details of the overall minimax rate of Theorem 1 to Section 3.3, moving first to a proof of the result.

Remark 2.

For simplicity, the lower bound result in Theorem 1 is given for a large model in which the covariate density is only bounded. However, as discussed in detail later in Section 4 and Remark 11, for the stated rate to be attainable more conditions on the covariate density are required. In Section B.8 of the Appendix, we give a particular submodel of $\mathcal{P}$ under which upper and lower bounds on the minimax rate match up to constants; it will be important in future work to further elucidate the role of the covariate density in CATE estimation.

The primary strategy in deriving minimax lower bounds is to construct distributions that are similar enough that they are statistically indistinguishable, but for which the parameter of interest is maximally separated; this implies no estimator can have error uniformly smaller than this separation. More specifically, we derive our lower bound using a localized version of the method of fuzzy hypotheses (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In the classic Le Cam two-point method, which can be used to derive minimax lower bounds for nonparametric regression at a point (Tsybakov, 2009), it suffices to consider a pair of distributions that differ locally; however, for nonlinear functional estimation, such pairs give bounds that are too loose. One instead needs to construct pairs of mixture distributions, which can be viewed via a prior over distributions in the model (Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). Our construction combines these two approaches via a localized mixture, as will be described in detail in the next subsection.

Remark 3.

In what follows we focus on the lower bound in the low smoothness regime where $s<\frac{d/4}{1+d/2\gamma}$ . The $n^{-1/(2+d/\gamma)}$ lower bound for the high smoothness regime matches the classic smooth nonparametric regression rate, and follows from a standard two-point argument, using the same construction as in Section 2.5 of Tsybakov (2009).

The following lemma, adapted from Section 2.7.4 of Tsybakov (2009), provides the foundation for the minimax lower bound result of this section.

Lemma 1 (Tsybakov (2009)).

Let $P_{\lambda}$ and $Q_{\lambda}$ denote distributions in $\mathcal{P}$ indexed by a vector $\lambda=(\lambda_{1},\dots,\lambda_{k})$ , with $n$ -fold products denoted by $P_{\lambda}^{n}$ and $Q_{\lambda}^{n}$ , respectively. Let $\varpi$ denote a prior distribution over $\lambda$ . If

H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\leq\alpha<2

and

\psi(P_{\lambda})-\psi(Q_{\lambda^{\prime}})\geq s>0

for a functional $\psi:\mathcal{P}\mapsto\mathbb{R}$ and for all $\lambda,\lambda^{\prime}$ , then

\inf_{\widehat{\psi}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}\left\{\ell\left(\left|\widehat{\psi}-\psi(P)\right|\right)\right\}\geq\ell(s/2)\left(\frac{1-\sqrt{\alpha(1-\alpha/4)}}{2}\right)

for any monotonic non-negative loss function $\ell$ .

Lemma 1 illuminates the three ingredients for deriving a minimax lower bound, and shows how they interact. The ingredients are: (i) a pair of mixture distributions, (ii) the distance between their $n$ -fold products, which is ideally small, and (iii) the separation of the parameter of interest under the mixtures, which is ideally large. Finding the right minimax lower bound requires balancing these three ingredients appropriately: with too much distance or not enough separation, the lower bound will be too loose. In the following subsections we describe these three ingredients in detail.

3.1 Construction

In this subsection we detail the distributions $P_{\lambda}$ and $Q_{\lambda}$ used to construct the minimax lower bound. The main idea is to mix constructions for nonparametric regression and functional estimation, by perturbing the CATE with a bump at the point $x_{0}$ , and to also use a mixture of perturbations of the propensity score and regression functions $\pi$ and $\mu_{0}$ , locally near $x_{0}$ .

For our lower bound results, we work in the setting where $Y$ is binary; this is mostly to ease notation and calculations. Note however that this still yields a valid lower bound in the general continuous $Y$ case, since a lower bound in the strict submodel where $Y$ is binary is also a lower bound across the larger model $\mathcal{P}$ . Importantly, when $Y$ is binary, the density $p$ of an observation $Z$ can be indexed via either the quadruple $(f,\pi,\mu_{0},\mu_{1})$ , or $(f,\pi,\mu_{0},\tau)$ ; here we make use of the latter parametrization (and in the appendix we consider the $(f,\pi,\eta,\tau)$ parametrization). We first give the construction for the $\alpha\geq\beta$ case in the definition below, and then go on to discuss details (and in Appendix A we give constructions for all other regimes).

Definition 1 (Distributions $P_{\lambda}$ and $Q_{\lambda}$ ).

Let:

1.

$B:\mathbb{R}^{d}\rightarrow\mathbb{R}$ denote a $C^{\infty}$ function with $B(x)=1$ for $x\in[-1/4,1/4]^{d}$ , and $B(x)=0$ for $x\notin[-1/2,1/2]^{d}$ ,
2.

$\mathcal{C}_{h}(x_{0})$ denote the cube centered at $x_{0}\in(0,1)^{d}$ with sides of length $h\leq 1/4$ ,
3.

$(\mathcal{X}_{1},\dots,\mathcal{X}_{k})$ denote a partition of $\mathcal{C}_{h}(x_{0})$ into $k$ cubes of equal size (for $k$ an integer raised to the power $d$ ), with midpoints $(m_{1},\dots,m_{k})$ , so each cube $\mathcal{X}_{j}=\mathcal{C}_{h/k^{1/d}}(m_{j})$ has side length $h/k^{1/d}$ .

Then for $\lambda_{j}\in\{-1,1\}$ define the functions

	$\displaystyle\tau_{h}(x)$	$\displaystyle=h^{\gamma}B\left(\frac{x-x_{0}}{2h}\right)$
	$\displaystyle\mu_{0\lambda}(x)$	$\displaystyle=\frac{1}{2}+\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$
	$\displaystyle\pi_{\lambda}(x)$	$\displaystyle=\frac{1}{2}+\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$
	$\displaystyle f(x)$	$\displaystyle=\mathbbold{1}(x\in\mathcal{S}_{hk})\Big{/}\left\{1-\left(\frac{4^{d}-1}{2^{d}}\right)h^{d}\right\}$

where $\mathcal{S}_{hk}=\left\{\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j})\right\}\bigcup\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\}$ . Finally let the distributions $P_{\lambda}$ and $Q_{\lambda}$ be defined via the densities

	$\displaystyle p_{\lambda}$	$\displaystyle=(f,1/2,\mu_{0\lambda}-\tau_{h}/2,\tau_{h})$
	$\displaystyle q_{\lambda}$	$\displaystyle=(f,\pi_{\lambda},\mu_{0\lambda},0).$

Remark 4.

Under the $(f,\pi,\eta,\tau)$ parametrization, since $\eta=\pi\tau+\mu_{0}$ , the above densities can equivalently be written as $p_{\lambda}=(f,1/2,\mu_{0\lambda},\tau_{h})$ and $q_{\lambda}=(f,\pi_{\lambda},\mu_{0\lambda},0)$ .

Figure 1 shows an illustration of our construction in the $d=1$ case. As mentioned above, the CATE is perturbed with a bump at $x_{0}$ and the nuisance functions $\pi$ and $\mu_{0}$ with bumps locally near $x_{0}$ . The regression function $\mu_{0}$ is perturbed under both $P_{\lambda}$ and $Q_{\lambda}$ , since it is less smooth than the propensity score in the $\alpha\geq\beta$ case. The choices of the CATE mimic those in the two-point proof of the lower bound for nonparametric regression at a point (see, e.g., Section 2.5 of Tsybakov (2009)), albeit with a particular flat-top bump function, while the choices of nuisance functions $\pi$ and $\mu_{0}$ are more similar to those in the lower bound for functionals such as the expected conditional covariance (cf. Section 4 of Robins et al. (2009b)). In this sense our construction can be viewed as combining those for nonparametric regression and functional estimation, similar to Shen et al. (2020). In what follows we remark on some important details.

Remark 5.

Section 3.2 of Shen et al. (2020) used a similar construction for conditional variance estimation. Some important distinctions are: (i) they focused on the univariate and low smoothness setting; (ii) in that problem there is only one nuisance function, so the null can be a point rather than a mixture distribution; and (iii) they use a different, arguably more complicated, approach to bound the distance between distributions. Our work can thus be used to generalize such variance estimation results to arbitrary dimension and smoothness.

Refer to caption — (a) Null $P_{\lambda}$

First we remark on the choice of CATE in the construction. As mentioned above, the bump construction resembles that of the standard Le Cam lower bound for nonparametric regression at a point, but differs in that we use a specialized bump function with a flat top. Crucially, this choice ensures the CATE is constant and equal to $h^{\gamma}$ for all $x$ in the cube $\mathcal{C}_{h}(x_{0})$ centered at $x_{0}$ with sides of length $h$ , and that it is equal to zero for all $x\notin\mathcal{C}_{2h}(x_{0})$ , i.e., outside the cube centered at $x_{0}$ with side length $2h$ . The fact that the CATE is constant across the top of the bump (which will be the only place where observations appear near $x_{0}$ ) eases Hellinger distance calculations substantially. It is straightforward to check that the CATE $\tau_{h}(x)$ is $\gamma$ -smooth in this construction (see page 93 of Tsybakov (2009)).

Remark 6.

One example of a bump function $B$ satisfying the conditions above is

B(x)=\left[1+\frac{g\{(4x)^{2}-1\}}{g\{2-(4x)^{2}\}}\right]^{-1}

for $g(t)=\exp(-1/t)\mathbbold{1}(t>0)$ .

For the propensity score and regression functions, we similarly have

\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=\begin{cases}1&\text{ for }x\in\mathcal{C}_{h/2k^{1/d}}(m_{j})\\ 0&\text{ for }x\notin\mathcal{C}_{h/k^{1/d}}(m_{j})\end{cases}

i.e., each bump equals one on the half- $h/k^{1/d}$ cube around $m_{j}$ , and is identically zero outside the main larger $h/k^{1/d}$ cube around $m_{j}$ . It is again straightforward to check that $\pi_{\lambda}(x)$ and $\mu_{0\lambda}(x)$ are $\alpha$ - and $\beta$ -smooth, respectively.

The covariate density is chosen to be uniform, but on the set $\mathcal{S}_{hk}$ that captures the middle of all the nuisance bumps $\left\{\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j})\right\}$ , together with the space $\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\}$ away from $x_{0}$ . Importantly, this choice ensures there is only mass where the nuisance bumps $B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$ are constant and non-zero (and where $\tau_{h}(x)=h^{\gamma}$ ), or else far away from $x_{0}$ , where the densities are the same under $P_{\lambda}$ and $Q_{\lambda}$ . Note that, as $h\rightarrow 0$ , the Lebesgue measure of the set $\mathcal{S}_{hk}$ tends to one, and the covariate density tends towards a standard uniform distribution.

3.2 Hellinger Distance

As mentioned previously, deriving a tight minimax lower bound requires carefully balancing the distance between distributions in our construction. To this end, in this subsection we bound the Hellinger distance between the $n$ -fold product mixtures $\int P_{\lambda}^{n}\ d\varpi(\lambda)$ and $\int Q_{\lambda}^{n}\ d\varpi(\lambda)$ , for $\varpi$ a uniform prior distribution, so that $(\lambda_{1},\dots,\lambda_{k})$ are iid Rademachers.

In general these product densities can be complicated, making direct distance calculations difficult. However, Theorem 2.1 from Robins et al. (2009b) can be used to relate the distance between the $n$ -fold products to those of simpler posteriors over a single observation. In the following lemma we adapt this result to localized constructions like those in Definition 1.

Lemma 2.

Let $P_{\lambda}$ and $Q_{\lambda}$ denote distributions indexed by a vector $\lambda=(\lambda_{1},\dots,\lambda_{k})$ , and let $\mathcal{Z}=\cup_{j=1}^{k}\mathcal{Z}_{j}$ denote a partition of the sample space. Assume:

1.

$P_{\lambda}(\mathcal{Z}_{j})=Q_{\lambda}(\mathcal{Z}_{j})=p_{j}$ for all $\lambda$ , and
2.

the conditional distributions $\mathbbold{1}_{\mathcal{Z}_{j}}dP_{\lambda}/p_{j}$ and $\mathbbold{1}_{\mathcal{Z}_{j}}dQ_{\lambda}/p_{j}$ (given an observation is in $\mathcal{Z}_{j}$ ) do not depend on $\lambda_{\ell}$ for $\ell\neq j$ , and only differ on partitions $j\in S\subseteq\{1,\dots,k\}$ .

For a prior distribution $\varpi$ over $\lambda$ , let $\overline{p}=\int p_{\lambda}\ d\varpi(\lambda)$ and $\overline{q}=\int q_{\lambda}\ d\varpi(\lambda)$ , and define

	$\displaystyle\delta_{1}$	$\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(p_{\lambda}-\overline{p})^{2}}{p_{\lambda}p_{j}}\ d\nu$
	$\displaystyle\delta_{2}$	$\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(q_{\lambda}-p_{\lambda})^{2}}{p_{\lambda}p_{j}}\ d\nu$
	$\displaystyle\delta_{3}$	$\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(\overline{q}-\overline{p})^{2}}{p_{\lambda}p_{j}}\ d\nu$

for a dominating measure $\nu$ . If $\overline{p}/p_{\lambda}\leq b<\infty$ and $np_{j}\max(1,\delta_{1},\delta_{2})\leq b$ for all $j$ , then $H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)$ is bounded above by

Cn\left(\sum_{j\in S}p_{j}\right)\left\{n\left(\max_{j\in S}p_{j}\right)\Big{(}\delta_{1}\delta_{2}+\delta_{2}^{2}\Big{)}+\delta_{3}\right\}

for a constant $C$ only depending on $b$ .

In the next proposition, we bound the quantities from Lemma 2 and put the results together to obtain a bound on the desired Hellinger distance between product mixtures.

Proposition 1.

Assume $h\leq 1/4$ and $h^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\varepsilon$ for some $\varepsilon\in(0,1/4)$ , and take $h^{\gamma}=4(h/k^{1/d})^{2s}$ for $s\equiv(\alpha+\beta)/2$ . Then for the distributions $P_{\lambda}$ and $Q_{\lambda}$ from Definition 1, with $\varpi$ the uniform distribution over $\{-1,1\}^{k}$ , the quantities $\delta_{j}$ from Lemma 2 satisfy

\displaystyle\delta_{1}\leq\left(\frac{2^{d+1}\|B\|_{2}^{2}}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\beta},\ \ \ \delta_{2}\leq\left(\frac{2^{d+1}\|B\|_{2}^{2}}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\alpha},\ \ \ \delta_{3}=0,

and $p_{j}\leq 2(h/2)^{d}/k$ . Further $H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)$ is bounded above by

4C\left(\frac{2\|B\|_{2}^{2}}{\varepsilon}\right)^{2}\left(\frac{n^{2}h^{2d}}{k}\right)\left\{\left(h/k^{1/d}\right)^{4s}+\left(h/k^{1/d}\right)^{4\alpha}\right\}

for $C$ a constant only depending on $\varepsilon$ .

Before moving to the proof of Proposition 1, we briefly discuss and give some remarks. Compared to the Hellinger distance arising in the average treatment effect or expected conditional covariance lower bounds (Robins et al., 2009b), there is an extra $h^{d}$ factor in the numerator. Of course, one cannot simply repeat those calculations with $k/h^{d}$ bins, since then for example the $k^{-4s/d}$ term would also be inflated to $(k/h^{d})^{-4s/d}$ ; our carefully localized construction is crucial to obtain the right rate in this case. We also note that the choice $h^{\gamma}=4(h/k^{1/d})^{2s}$ is required for ensuring that the averaged densities $\overline{p}(z)$ and $\overline{q}(z)$ are equal (implying that $\delta_{3}=0$ ); specifically this equalizes the CATE bump under $P_{\lambda}$ with the squared nuisance bumps under $Q_{\lambda}$ .

Proof.

Here the relevant partition of the sample space $\mathcal{X}\times\mathcal{A}\times\mathcal{Y}=[0,1]^{d}\times\{0,1\}\times\{0,1\}$ is $\mathcal{Z}_{j}=\mathcal{C}_{h/2k^{1/d}}(m_{j})\times\{0,1\}\times\{0,1\}$ , $j=1,\dots,k$ , along with $\mathcal{Z}_{j}^{\prime}$ , which partitions the space $[0,1]^{d}/\mathcal{C}_{2h}(x_{0})$ away from $x_{0}$ into disjoint cubes with side lengths $h/2k^{1/d}$ . Thus we have

P_{\lambda}(\mathcal{Z}_{j})=P_{\lambda}(\mathcal{Z}_{j}^{\prime})=Q_{\lambda}(\mathcal{Z}_{j})=Q_{\lambda}(\mathcal{Z}_{j}^{\prime})=p_{j}

where $p_{j}=\int\mathbbold{1}\{x\in C_{h/2k^{1/d}}(m_{j})\}f(x)\ dx$ . In Appendix Section B.1 we show that $(h/2)^{d}/k\leq p_{j}\leq 2(h/2)^{d}/k$ when $h\leq 1/4$ , and so is proportional to the volume of a cube with side lengths $h/2k^{1/d}$ . Further the conditional distributions $\mathbbold{1}_{\mathcal{Z}_{j}}dP_{\lambda}/p_{j}$ and $\mathbbold{1}_{\mathcal{Z}_{j}}dQ_{\lambda}/p_{j}$ do not depend on $\lambda_{\ell}$ for $\ell\neq j$ , since $\lambda_{j}$ only changes the density in $\mathcal{Z}_{j}$ . In what follows we focus on the partitions $\mathcal{Z}_{j}$ and not $\mathcal{Z}_{j}^{\prime}$ , since the distributions are exactly equal on the latter (in the language of Lemma 1, the set $S$ indexes only the partitions $\mathcal{Z}_{j}$ , and note that for this set we have $\sum_{j\in S}p_{j}=kp_{j}\leq 2(h/2)^{d}$ ). Note when $(\lambda_{1},\dots,\lambda_{k})$ are iid Rademacher random variables the marginalized densities from Lemma 2 are

	$\displaystyle\overline{p}(z)$	$\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}$
	$\displaystyle\overline{q}(z)$	$\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}.$

The first step is to show that relevant densities and density ratios are appropriately bounded. We give these details in Appendix Section B.1. Next it remains to bound the quantities $\delta_{1}$ , $\delta_{2}$ , and $\delta_{3}$ .

We begin with $\delta_{3}$ , the distance between marginalized densities $\overline{p}$ and $\overline{q}$ , which is tackled somewhat differently from $\delta_{1}$ and $\delta_{2}$ . Because we take $(h/k^{1/d})^{2s}=h^{\gamma}/4$ it follows that $\overline{q}(z)-\overline{p}(z)$ equals

(2a-1)(2y-1)f(x)\left\{\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}=0,

(2)

since $f(x)=0$ for $x\notin\mathcal{S}_{hk}$ and

	$\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=B\left(\frac{x-x_{0}}{2h}\right)=0$	$\displaystyle\ \text{ for }x\in\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\}$
	$\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=B\left(\frac{x-x_{0}}{2h}\right)=1$	$\displaystyle\ \text{ for }x\in\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j}).$

We note that this result requires a carefully selected relationship between $h$ and $k$ , which guarantees that the squared nuisance bumps under $Q_{\lambda}$ equal the CATE bumps under $P_{\lambda}$ . This also exploits the flat-top bump functions we use, together with a covariate density that only puts mass at these tops, so that the squared terms are constant and no observations occur elsewhere where the bumps are not equal. Without these choices of bump function and covariate density, the expression in (2) would only be bounded by $h^{\gamma}$ , and so $\delta_{3}$ would only be bounded by $h^{2\gamma}$ ; in that case, the $\delta_{3}$ term would dominate the Hellinger bound in Lemma 2, and the resulting minimax lower bound would reduce to the oracle rate $n^{-1/(2+d/\gamma)}$ , which is uninformative in the low-smoothness regimes we are considering.

Now we move to the distance $\delta_{1}$ , which does not end up depending on $h$ and is somewhat easier to handle. For it we have

	$\displaystyle\delta_{1}$	$\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\max_{\ell}\sup_{\lambda}\int_{\mathcal{X}_{\ell}}\sum_{a,y}\frac{f(x)^{2}}{4p_{\lambda}(z)}\left(h/k^{1/d}\right)^{2\beta}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx$
		$\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\left(\frac{2}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\beta}\max_{\ell}\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx$
		$\displaystyle\leq\left(\frac{2^{d}\\|B\\|_{2}^{2}}{\varepsilon/2}\right)\left(h/k^{1/d}\right)^{2\beta}$

where the first line follows by definition, and since $p_{\ell}\geq(h/2)^{d}/k$ , and $B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=0$ outside of the cube $\mathcal{C}_{h/k^{1/d}}(m_{j})$ , which implies that

\left\{\sum_{j}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)\right\}^{2}=\sum_{j}\lambda_{j}^{2}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2},

the inequality in the second line since $p_{\lambda}(z)/f(x)\geq\varepsilon$ and $f(x)\leq 2$ as in (23) and (22), and the last inequality since

\displaystyle\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx

\displaystyle=\int_{\mathcal{X}_{\ell}}B\left(\frac{x-m_{\ell}}{h/k^{1/d}}\right)^{2}\ dx\leq\frac{h^{d}}{k}\int B(u)^{2}\ du

by a change of variables.

For $\delta_{2}$ we use a mix of the above logic for $\delta_{3}$ and $\delta_{1}$ . Note that $({q}_{\lambda}-p_{\lambda})^{2}$ equals

	$\displaystyle f(x)^{2}$	$\displaystyle\Bigg{[}(a-1/2)\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$
		$\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left\{\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}\Bigg{]}^{2}$
		$\displaystyle\leq(1/2)f(x)^{2}\left(h/k^{1/d}\right)^{2\alpha}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}$

where we used the fact that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ and $\{\sum_{j}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)\}^{2}=\sum_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}$ , along with the same logic as above with $\delta_{3}$ (ensuring the second term in the square equals zero). Now we have

\displaystyle\delta_{2}

\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\left(\frac{2}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\alpha}\max_{\ell}\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx\leq\left(\frac{2^{d}\|B\|_{2}^{2}}{\varepsilon/2}\right)\left(h/k^{1/d}\right)^{2\alpha}

using the exact same logic as for $\delta_{1}$ . Plugging these bounds on $(\delta_{1},\delta_{2},\delta_{3})$ into Lemma 2, together with the fact that $p_{j}\leq 2(h/2)^{d}/k$ and $\sum_{j\in S}p_{j}\leq 2(h/2)^{d}$ , yields the result. ∎

3.3 Choice of Parameters & Final Rate

Finally we detail how the parameters $h$ and $k$ can be chosen to ensure the Hellinger distance from Proposition 1 remains bounded, and use the result to finalize the proof of Theorem 1.

Proposition 2.

Let

\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2s}=\left(\frac{1}{C^{*}n^{2}}\right)^{\frac{1}{d+4s+2sd/\gamma}}

for $C^{*}=2^{2d/\gamma+5}C(\|B\|_{2}^{2}/\varepsilon)^{2}$ and $C$ the constant from Proposition 1. Then under the assumptions of Proposition 1 we have

H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\ \leq\ 1

and $h^{\gamma}=4(\sqrt{C^{*}}n)^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}$ .

The proof of Proposition 2 follows directly from Proposition 1, after plugging in the selected values of $h$ and $k$ . Importantly, it (together with the alternative construction for the $\beta>\alpha$ case given in Appendix A) also settles the proof of Theorem 1 via Lemma 1. This follows since, with the proposed choices of $h$ and $k$ , the Hellinger distance is appropriately bounded so that the term $({1-\sqrt{\alpha(1-\alpha/4)}})/2=(1-\sqrt{3/4})/{2}\approx 0.067$ in Lemma 1 is a constant (greater than 1/20 for example), while the separation in the CATE at $x_{0}$ , which equals $h^{\gamma}$ , is proportional to $n^{-1/(1+d/2\gamma+d/4s)}$ under all $P_{\lambda}$ and $Q_{\lambda}$ . Therefore this separation is indeed the minimax rate in the low smoothness regime where $s<(d/4)/(1+d/2\gamma)$ . Note again that, as discussed in Remark 3, when $s>(d/4)/(1+d/2\gamma)$ the rate $n^{-1/(1+d/2\gamma+d/4s)}$ is faster than the usual nonparametric regression rate $n^{-1/(2+d/\gamma)}$ , and so the standard lower bound construction as in Section 2.5 of Tsybakov (2009) indicates that the slower rate $n^{-1/(2+d/\gamma)}$ is the tighter lower bound in that regime.

Figure 2 illustrates the minimax rate from Theorem 1, as a function of the average nuisance smoothness $s/d$ (scaled by dimension), and the CATE smoothness scaled by dimension $\gamma/d$ . A number of important features about the rate are worth highlighting.

First, of course, the rate never slows with higher nuisance smoothness $s/d$ , for any CATE smoothness $\gamma/d$ , and vice versa. However, there is an important elbow phenomenon, akin to that found in functional estimation problems (Bickel and Ritov, 1988; Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). In particular, the minimax lower bound shows that when the average nuisance smoothness is low enough that $s<\frac{d/4}{1+d/2\gamma}$ , the oracle rate $n^{-1/(2+d/\gamma)}$ (which could be achieved if one actually observed the potential outcomes) is in fact unachievable. This verifies a conjecture in Kennedy (2023).

Notably, though, the elbow phenomenon we find in the problem of CATE estimation differs quite substantially from that for classic pathwise differentiable functionals. For the latter, the rate is parametric (i.e., $n^{-1/2}$ ) above some threshold, and nonparametric ( $n^{-1/(1+d/4s)}$ ) below. In contrast, in our setting the rate matches that of nonparametric regression above the threshold, and otherwise is a combination of nonparametric regression and functional estimation rates. Thus in this problem there are many elbows, with the threshold depending on the CATE smoothness $\gamma$ . In particular, our minimax rate below the threshold,

n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)},

is a mixture of the nonparametric regression rate $n^{-1/(1+d/2\gamma)}$ (on the squared scale) and the classic functional estimation rate $n^{-1/(1+d/4s)}$ . This means, for example, that in regimes where the CATE is very smooth, e.g., $\gamma\rightarrow\infty$ , the CATE estimation problem begins to resemble that of pathwise-differentiable functional estimation, where the elbow occurs at $s>d/4$ , with rates approaching the parametric rate $n^{-1/2}$ above, and the functional estimation rate $n^{-1/(1+d/4s)}$ below. At the other extreme, where the CATE does not have any extra smoothness, so that $\gamma\rightarrow\beta$ (note we must have $\gamma\geq\beta$ ), the elbow threshold condition holds for any $\alpha\geq 0$ . Thus, at this other extreme, there is no elbow phenomenon, and the CATE estimation problem resembles that of smooth nonparametric regression, with optimal rate $n^{-1/(2+d/\beta)}$ . For the arguably more realistic setting, where the CATE smoothness $\gamma$ may take intermediate values between $\beta$ and $\infty$ , the minimax rate is a mixture, interpolating between the two extremes. All of this quantifies the sense in which the CATE can be viewed as a regression/functional hybrid.

It is also worth mentioning that no estimator previously proposed in the literature (that we know of) attains the minimax rate in Theorem 1 in full generality. Some estimators have been shown to attain the oracle rate $n^{-1/(2+d/\gamma)}$ , but only under stronger assumptions than the minimal condition we find here, i.e., that $s>\frac{d/4}{1+d/2\gamma}$ . One exception is the undersmoothed R-learner estimator analyzed in Kennedy (2023), which did achieve the rate $n^{-1/(2+d/\gamma)}$ whenever $s>(d/4)/(1+d/2\gamma)$ , under some conditions (including that $\alpha\geq\beta$ ). However, in the low-smoothness regime where $s<(d/4)/(1+d/2\gamma)$ , that estimator’s rate was $n^{-2s/d}$ , which is slower than the minimax rate we find here. This motivates our work in the following section, where we propose and analyze a new estimator, whose error matches the minimax rate in much greater generality (under some conditions, e.g., on how well the covariate density is estimated).

Remark 7.

A slightly modified version of our construction also reveals that, when the CATE $\tau(x)=\tau$ is constant, the classic functional estimation rate $n^{-1/\left(1+\frac{d}{4s}\right)}$ acts as a minimax lower bound. To the best of our knowledge, this result has not been noted elsewhere.

4 Attainability

In this section we show that the minimax lower bound of Theorem 1 is actually attainable, via a new local polynomial version of the R-Learner (Nie and Wager, 2021; Kennedy, 2023), based on an adaptation of higher-order influence functions (Robins et al., 2008, 2009a, 2017).

4.1 Proposed Estimator & Decomposition

In this subsection we first describe our proposed estimator, and then give a preliminary error bound, which motivates the specific bias and variance calculations in following subsections. In short, the estimator is a higher-order influence function-based version of the local polynomial R-learner analyzed in Kennedy (2023). At its core, the R-Learner essentially regresses outcome residuals on treatment residuals to estimate a weighted average of the CATE. Early versions for a constant or otherwise parametric CATE were studied by Chamberlain (1987); Robinson (1988), and Robins (1994), with more flexible series, RKHS, and lasso versions studied more recently by Robins et al. (2008), Nie and Wager (2021), and Chernozhukov et al. (2017), respectively. None of this previous work obtained the minimax optimal rates in Theorem 2.

Definition 2 (Higher-Order Local Polynomial R-Learner).

Let $K_{h}(x)=\frac{1}{h^{d}}\mathbbold{1}(\|x-x_{0}\|\leq h/2)$ . For each covariate $x_{j}$ , $j=1,\dots,d,$ define $\rho(x_{j})=\{\rho_{0}(x_{j}),\rho_{1}(x_{j}),\dots,\rho_{\lfloor\gamma\rfloor}(x_{j})\}^{\mathrm{\scriptscriptstyle T}}$ as the first $({\lfloor\gamma\rfloor}+1$ ) terms of the Legendre polynomial series (shifted to be orthonormal on $[0,1]$ ),

\rho_{m}(x_{j})=\sum_{\ell=0}^{m}(-1)^{\ell+m}\sqrt{2m+1}{m\choose\ell}{m+\ell\choose\ell}x_{j}^{\ell}.

Define $\rho(x)$ to be the corresponding tensor product of all interactions of $\rho(x_{1}),\dots,\rho(x_{d})$ up to order $\lfloor\gamma\rfloor$ , which has length $q={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}$ and is orthonormal on $[0,1]^{d}$ , and finally define $\rho_{h}(x)=\rho\left(1/2+(x-x_{0})/h\right)$ . The proposed estimator is then defined as

\widehat{\tau}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\widehat{R}

(3)

where $\widehat{Q}$ is a $q\times q$ matrix and $\widehat{R}$ a $q$ -vector given by

	$\displaystyle\widehat{Q}$	$\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\Big{\}}$
		$\displaystyle\hskip 28.90755pt+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\Big{\}}$
	$\displaystyle\widehat{R}$	$\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z_{1})\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}},$

respectively, and

	$\displaystyle\widehat{\varphi}_{a1}(Z)$	$\displaystyle=A\{A-\widehat{\pi}(X)\}$
	$\displaystyle\widehat{\varphi}_{y1}(Z)$	$\displaystyle=\{Y-\widehat{\mu}_{0}(X)\}\{A-\widehat{\pi}(X)\}$
	$\displaystyle\widehat{\varphi}_{a2}(Z_{1},Z_{2})$	$\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})A_{2}$
	$\displaystyle\widehat{\varphi}_{y2}(Z_{1},Z_{2})$	$\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{Y_{2}-\widehat{\mu}_{0}(X_{2})\}$
	$\displaystyle b_{hk}(x)$	$\displaystyle=b\{1/2+(x-x_{0})/h\}\mathbbold{1}(\\|x-x_{0}\\|\leq h/2)$
	$\displaystyle\widehat{\Omega}$	$\displaystyle=\int_{v\in[0,1]^{d}}b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ d\widehat{F}(x_{0}+h(v-1/2))$

for $b:\mathbb{R}^{d}\mapsto\mathbb{R}^{k}$ a basis of dimension $k$ . The nuisance estimators $(\widehat{F},\widehat{\pi},\widehat{\mu}_{0})$ are constructed from a separate training sample $D^{n}$ , independent of that on which $\mathbb{U}_{n}$ operates.

The estimator in Definition 2 can be viewed as a localized higher-order estimator, and depends on two main tuning parameters: the bandwidth $h$ , which controls how locally one averages near $x_{0}$ , and the dimension $k$ of the basis $b$ , which controls how bias and variance are balanced in the second-order U-statistic terms in $\widehat{Q}$ and $\widehat{R}$ . Specific properties of the basis $b$ are discussed shortly, e.g., just prior to Remark 7 and in (6). We also note that while the basis $b$ will have dimension $k$ growing with sample size, the dimension of the basis $\rho$ is fixed depending on the smoothness of the CATE; for the latter we use the Legendre series for concreteness, but expect other bases to work as well.

The U-statistic terms are important for debiasing the first-order sample average terms. In addition, our proposed estimator can be viewed as estimating a locally weighted projection parameter $\tau_{h}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\theta$ , with coefficients given by

\operatorname*{arg\,min}_{\beta}\mathbb{E}\left[K_{h}(x)\pi(x)\{1-\pi(x)\}\Big{\{}\tau(x)-\beta^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x)\Big{\}}^{2}\right]=Q^{-1}R

(4)

for

	$\displaystyle Q$	$\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)$
	$\displaystyle R$	$\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\tau(x)\ dF(x).$

In other words, this projection parameter $\tau_{h}(x_{0})$ is a $K_{h}(x)\pi(x)\{1-\pi(x)\}$ -weighted least squares projection of the CATE $\tau(x)$ on the scaled Legendre polynomials $\rho_{h}(x)$ . Crucially, since $\rho_{h}(x)$ includes polynomials in $x$ up to order $\lfloor\gamma\rfloor$ , the projection parameter is within $h^{\gamma}$ of the target CATE; this is formalized in the following proposition.

Proposition 3.

Let $\tau_{h}(x)=\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}Q^{-1}R$ denote the $x_{0}$ -specific projection parameter from (4), and assume:

1.

$\tau(x)$ is $\gamma$ -smooth,
2.

the eigenvalues of $Q$ are bounded below away from zero, and
3.

$\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\lesssim h^{d}$ .

Then for any $x$ with $\|x-x_{0}\|\leq Ch$ we have

|\tau_{h}(x)-\tau(x)|\lesssim h^{\gamma}.

Proof.

This proof follows from a higher-order kernel argument (e.g., Proposition 1.13 of Tsybakov (2009), Proposition 4.1.5 of Giné and Nickl (2021)), after noting that we can treat $K_{h}(x)\pi(x)\{1-\pi(x)\}$ itself as a kernel. A similar result was also proved in Kennedy (2023). A detailed proof is given in Appendix B.3. ∎

In Proposition S5 in the Appendix we give simple sufficient conditions under which the eigenvalues of $Q$ are bounded. In short, this holds under standard boundedness conditions on the propensity score and covariate density.

As mentioned above, our estimator (3) can be viewed as a modified higher-order estimator. More specifically, $\widehat{S}=\widehat{R}-\widehat{Q}\theta$ is closely related to a second-order estimator of the moment condition

\mathbb{E}\Big{[}\rho_{h}(X)K_{h}(X)\Big{\{}A-\pi(X)\Big{\}}\Big{\{}Y-\mu_{0}(X)-A\tau(X)\Big{\}}\Big{]}=R-Q\theta=0

(5)

under the assumption that $\tau(x)=\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\theta$ (this is not exactly true in our case, but it is enough that it is approximately true locally near $x_{0}$ , as will be proved shortly). Indeed, letting $\widehat{Q}_{1}=\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\}$ and $\widehat{R}_{1}=\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{y1}(Z)\}$ denote the first terms in $\widehat{Q}=\widehat{Q}_{1}+\widehat{Q}_{2}$ and $\widehat{R}=\widehat{R}_{1}+\widehat{R}_{2}$ , respectively, we see that $\widehat{S}_{1}=\widehat{R}_{1}-\widehat{Q}_{1}\theta$ is a usual first-order influence function-based estimator of the moment condition (5). Similarly, the second terms $\widehat{Q}_{2}$ and $\widehat{R}_{1}$ are akin to the second-order U-statistic corrections that would be added using the higher-order influence function methodology developed by Robins et al. (2008, 2009a, 2017). However, these terms differ in two important ways, both relating to localization near $x_{0}$ . First, the U-statistic is localized with respect to both $X_{1}$ and $X_{2}$ , i.e., the product $K_{h}(X_{1})K_{h}(X_{2})$ is included, whereas only $K_{h}(X_{1})$ would arise if the goal were purely to estimate the moment condition (5) for fixed $h$ . Second, the basis functions

b_{hk}(x)=b\left(1/2+\frac{x-x_{0}}{h}\right)\mathbbold{1}(\|x-x_{0}\|\leq h/2)

appearing in $\widehat{\varphi}_{a2}$ , $\widehat{\varphi}_{y2}$ , and $\widehat{\Omega}$ are localized; they only operate on $X$ s near $x_{0}$ , stretching them out so as to map the cube $[x_{0}-h/2,x_{0}+h/2]^{d}$ around $x_{0}$ to the whole space $[0,1]^{d}$ (e.g., $b_{hk}(x_{0}-h/2)=b(0)$ , $b_{hk}(x_{0})=b(1/2)$ , etc.). This is the same localization that is used with the Legendre basis $\rho(x)$ . In this sense, these localized basis terms spend all their approximation power locally rather than globally away from $x_{0}$ . (Specific approximating properties we require of $b$ will be detailed shortly, in (6)). These somewhat subtle distinctions play a crucial role in appropriately controlling bias, as will be described in more detail shortly.

Remark 8.

Note again that, as with other higher-order estimators, the estimator (3) depends on an initial estimate of the covariate distribution $F$ (near $x_{0}$ ), through $\widehat{\Omega}$ . Importantly, we do not take this estimator $\widehat{F}$ to be the empirical distribution, in general, since then our optimal choices of the tuning parameter $k$ would yield $\widehat{\Omega}$ non-invertible; this occurs with higher-order estimators of pathwise differentiable functionals as well (Mukherjee et al., 2017). As discussed in Remark 11, and in more detail shortly, we do give conditions under which the estimation error in $\widehat{\Omega}$ or $\widehat{F}$ does not impact the overall rate of $\widehat{\tau}(x_{0})$ .

Crucially, Proposition 3 allows us to focus on understanding the estimation error in $\widehat{\tau}(x_{0})$ with respect to the projection parameter $\tau_{h}(x_{0})$ , treating $h^{\gamma}$ as a separate approximation bias. The next result gives a finite-sample bound on this error, showing how it is controlled by the error in estimating the components of $Q$ and $R$ .

Proposition 4.

Let $S=R-Q(Q^{-1}R)$ and $\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R)$ . The estimator (3) satisfies

\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ \leq\ \|\rho(1/2)\|\left(\|Q^{-1}\|+\|\widehat{Q}^{-1}-{Q}^{-1}\|\right)\left\|\widehat{S}-S\right\|,

and if $\|Q^{-1}\|$ and $\|\widehat{Q}^{-1}-{Q}^{-1}\|$ are bounded above, then

\displaystyle\mathbb{E}\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\

\displaystyle\lesssim\ \max_{j}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}.

for $D^{n}$ a separate independent training sample on which $(\widehat{F},\widehat{\pi},\widehat{\mu}_{0})$ are estimated.

Thus Proposition 4 tells us that bounding the conditional bias and variance of $\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R)$ will also yield finite-sample bounds on the error in $\widehat{\tau}(x_{0})$ , relative to the projection parameter $\tau_{h}(x_{0})$ . These bias and variance bounds will be derived in the following two subsections.

4.2 Bias

In this subsection we derive bounds on the conditional bias of the estimator $\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R)$ , relative to the components of the projection parameter (4), given the training sample $D^{n}$ . The main ideas behind the approach are to use localized versions of higher-order influence function arguments, along with a specialized localized basis construction, which results in smaller bias due to the fact that the bases only need to be used in a shrinking window around $x_{0}$ .

Here we rely on the basis $b(x)$ having optimal Hölder approximation properties, In particular, we assume the approximation error of projections in $L_{2}$ norm satisfies

\|(I-\Pi_{b})g\|_{F^{*}}\ \lesssim\ k^{-s/d}\ \text{ for any $s$-smooth function $g$}

(6)

where $\Pi_{b}g=\operatorname*{arg\,min}_{\ell=\theta^{\mathrm{\scriptscriptstyle T}}b}\int(g-\ell)^{2}\ dF^{*}$ is the usual linear projection of $g$ on $b$ , for $dF^{*}(v)=dF(x_{0}+h(v-1/2))$ the distribution in $\mathcal{B}_{h}(x_{0})$ , the $h$ -ball around $x_{0}$ , mapped to $[0,1]^{d}$ . In a slight abuse to ease notation, we omit the dependence of $\Pi_{b}g$ on $F^{*}$ . The approximating condition (6) holds for numerous bases, including spline, CDV wavelet, and local polynomial partition series (and polynomial and Fourier series, up to log factors); it is used often in the literature. We refer to Belloni et al. (2015) for more discussion and specific examples (see their Condition A.3 and subsequent discussion in, for example, their Section 3.2).

Proposition 5.

Assume:

1.

$\lambda_{\max}(\Omega)$ is bounded above,
2.

the basis $b$ satisfies approximating condition (6),
3.

$\widehat{\pi}(x)-\pi(x)$ is $\alpha$ -smooth,
4.

$\widehat{\mu}_{0}(x)-\mu_{0}(x)$ is $\beta$ -smooth.

Then

	$\displaystyle\|\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})\|$	$\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2s}+h^{\gamma}\left(h/k^{1/d}\right)^{\alpha}$
		$\displaystyle\hskip 36.135pt+\Big{(}h^{\gamma}+\\|\widehat{\mu}_{0}-\mu_{0}\\|_{F^{}}\Big{)}\left(\\|\widehat{\pi}-\pi\\|_{F^{}}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|\right).$

Before delving into the proof, we give some brief discussion. The bias consists of three terms; the first two are the main bias terms that would result even if the covariate distribution $F$ were known, and the third is essentially the contribution from having to estimate $F$ . (In Lemma S2 of the Appendix we show how the operator norm error of $\widehat{\Omega}$ is bounded above by estimation error of the distribution $F$ itself). We note that the second of the main bias terms $h^{\gamma}(h/k^{1/d})^{\alpha}$ will be of smaller order in all regimes we consider. Compared to the main bias term in a usual higher-order influence function analysis, which is $k^{-2s/d}$ (e.g., for the average treatment effect), our bias term is smaller; this is a result of using the localized basis $b_{hk}(x)$ defined in (3), which only has to be utilized locally near $x_{0}$ (this smaller bias will be partially offset by a larger variance, as discussed in the next subsection). As mentioned in Remark 11, the contribution from having to estimate $F$ is only a third-order term, since the estimation error of $\widehat{\Omega}$ (in terms of operator norm) is multiplied by a product of propensity score errors (in $L_{2}(F^{*})$ norm) with the sum of regression errors and smoothing bias $h^{\gamma}$ , which is typically of smaller order. In Proposition 6, given after the following proof of Proposition 5, we show how the bias simplifies when $F$ is estimated accurately enough.

Proof.

By iterated expectation, the conditional mean of the first-order term in $\widehat{R}$ , i.e., the quantity $\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z)\mid D^{n}\}$ is equal to

	$\displaystyle R$	$\displaystyle+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\{\pi(x)\tau(x)+\mu_{0}(x)-\widehat{\mu}_{0}(x)\}\ dF(x)$
		$\displaystyle=R+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}\{\pi^{}(v)\tau^{}(v)+\mu_{0}^{}(v)-\widehat{\mu}_{0}^{}(v)\}\ dF^{*}(v)$

where we use the change of variable $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ and again define for any function $g:\mathbb{R}^{d}\mapsto\mathbb{R}$ its corresponding stretched version as $g^{*}(v)=g(x_{0}+h(v-1/2))$ . To ease notation it is left implicit that any integral over $v$ is only over $\{v:\|v-1/2\|\leq 1/2\}$ . Similarly for $\widehat{Q}$ we have that $\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a1}(Z)\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\mid D^{n}\}$ equals

	$\displaystyle Q$	$\displaystyle+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\pi(x)\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)$
		$\displaystyle=Q+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}\pi^{}(v)\rho(v)\ dF^{}(v)$

so that for the first-order term in $\widehat{S}$ (denoted $\widehat{R}_{1}-\widehat{Q}_{1}\theta$ in discussion of the moment condition (5)) we have

$\displaystyle\mathbb{E}\{\rho_{h}(X_{1})$	$\displaystyle K_{h}(X_{1})\widehat{\varphi}_{y1}(Z)\mid D^{n}\}-\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a1}(Z)\rho_{h}(X_{1})\mid D^{n}\}\theta$
	$\displaystyle=R-Q\theta+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}\{\mu_{0}^{}(v)-\widehat{\mu}_{0}^{}(v)\}\ dF^{*}(v)$
	$\displaystyle\hskip 36.135pt+\int\rho(v)\pi^{}(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}\{\tau^{}(v)-\tau_{h}^{}(v)\}\ dF^{}(v).$	(7)

The conditional mean of the second-order influence function term in $\widehat{R}$ is

	$\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\mid D^{n}\}$		(8)
	$\displaystyle=-\int\rho(v_{1})\{\pi^{}(v_{1})-\widehat{\pi}^{}(v_{1})\}\widehat{\Pi}_{b}(\pi^{}\tau^{}+\mu_{0}^{}-\widehat{\mu}_{0}^{})(v_{1})\ dF^{*}(v_{1})$

where we define

\displaystyle\Pi_{b}g^{*}(u)

\displaystyle=b(u)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int{b}(v)g^{*}(v)\ dF^{*}(v)

as the $F^{*}$ -weighted linear projection of $g^{*}$ on the basis $b$ , and $\widehat{\Pi}_{b}g^{*}(u)$ as the estimated version, which simply replaces $\Omega$ with $\widehat{\Omega}$ . Similarly for $\widehat{Q}\theta$ we have

	$\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{2})^{\mathrm{\scriptscriptstyle T}}\theta\mid D^{n}\}$		(9)
	$\displaystyle=-\int\rho(v_{1})\{\pi^{}(v_{1})-\widehat{\pi}^{}(v_{1})\}\widehat{\Pi}_{b}(\pi^{}\tau_{h}^{})(v_{1})\ dF^{*}(v_{1})$

so that the conditional mean (8) minus the conditional mean (9) equals

\displaystyle-\int\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}\widehat{\Pi}_{b}\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v_{1})\ dF^{*}(v_{1}).

(10)

Therefore adding the first- and second-order expected values in (7) and (10), the overall bias relative to $S$ is

	$\displaystyle\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(I-\widehat{\Pi}_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$
	$\displaystyle=\int(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$		(11)
	$\displaystyle\hskip 14.45377pt+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(\Pi_{b}-\widehat{\Pi}_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$		(12)

by the orthogonality of a projection with its residuals (Lemma S1(i)).

Now we analyze the bias terms (11) and (12) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.

Crucially, by virtue of using the localized basis $b_{hk}$ , the projections in these bias terms are of stretched versions of the nuisance functions $(\pi^{*}-\widehat{\pi}^{*})$ and $(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})$ , on the standard non-localized basis $b$ , with weights equal to the stretched density $dF^{*}$ . This is important because stretching a function increases its smoothness; in particular, the stretched and scaled function $g^{*}(v)/h^{\alpha}$ is $\alpha$ -smooth whenever $g$ is $\alpha$ -smooth. This follows since $|D^{\lfloor\alpha\rfloor}g^{*}(v)-D^{\lfloor\alpha\rfloor}g^{*}(v^{\prime})|$ equals

	$\displaystyle\left\|D^{\lfloor\alpha\rfloor}g(x_{0}+h(v-1/2))-D^{\lfloor\alpha\rfloor}g(x_{0}+h(v^{\prime}-1/2))\right\|$
	$\displaystyle\hskip 28.90755pt=h^{\lfloor\alpha\rfloor}\left\|g^{(\lfloor\alpha\rfloor)}(x_{0}+h(v-1/2))-g^{(\lfloor\alpha\rfloor)}(x_{0}+h(v^{\prime}-1/2))\right\|$
	$\displaystyle\hskip 28.90755pt\lesssim h^{\alpha}\|v-v^{\prime}\|$

where the second equality follows by the chain rule, and the third since $g$ is $\alpha$ -smooth. Thus the above implies $h^{-\alpha}\left|D^{\lfloor\alpha\rfloor}g^{*}(v)-D^{\lfloor\alpha\rfloor}g^{*}(v^{\prime})\right|\lesssim|v-v^{\prime}|,$ i.e., that $g^{*}(v)/h^{\alpha}$ is $\alpha$ -smooth.

Therefore if $g$ is $\alpha$ -smooth, then $\|(I-\Pi_{b})g^{*}/h^{\alpha}\|_{F^{*}}\lesssim k^{-\alpha/d}$ by the Hölder approximation properties (6) of the basis $b$ , and so it follows that

\|(I-\Pi_{b})g^{*}\|_{F^{*}}\ \lesssim\ h^{\alpha}k^{-\alpha/d}=\left(h/k^{1/d}\right)^{\alpha}

(13)

for any $\alpha$ -smooth function $g$ .

Therefore now consider the bias term (11). This term satisfies

	$\displaystyle\int\Big{[}$	$\displaystyle(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)\Big{]}\Big{[}(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\Big{]}\ dF^{}(v)$
		$\displaystyle\leq\\|(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}\\|_{F^{}}\Big{[}\\|(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})\}\\|_{F^{*}}$
		$\displaystyle\hskip 28.90755pt+\\|(I-\Pi_{b})(\mu_{0}^{}-\widehat{\mu}_{0}^{})\\|_{F^{*}}\Big{]}$
		$\displaystyle\lesssim\ \left(h/k^{1/d}\right)^{\alpha}\Big{\{}h^{\gamma}+\left(h/k^{1/d}\right)^{\beta}\Big{\}}=\left(h/k^{1/d}\right)^{2s}+h^{\gamma}\left(h/k^{1/d}\right)^{\alpha}$

where the second line follows by Cauchy-Schwarz, and the third by (13), since $(\pi-\widehat{\pi})$ and $(\mu_{0}-\widehat{\mu}_{0})$ are assumed $\alpha$ - and $\beta$ -smooth, respectively (note $\rho(v)$ is a polynomial, so the smoothness of $\rho(\pi^{*}-\widehat{\pi}^{*})$ is the same as $(\pi^{*}-\widehat{\pi}^{*})$ ), along with the fact that

	$\displaystyle\\|(I-\Pi_{b})\pi^{}(\tau^{}$	$\displaystyle-\tau_{h}^{})\\|_{F^{}}^{2}\leq\\|\pi^{}(\tau^{}-\tau_{h}^{})\\|_{F^{}}^{2}$
		$\displaystyle\leq\int K_{h}(x)\Big{\{}\tau(x)-\tau_{h}(x)\Big{\}}^{2}dF(x)\lesssim h^{2\gamma}$

where the first inequality follows by Lemma S1(ii), the second by definition of $F^{*}$ and since $\pi(x)\leq 1$ , and the last by Proposition 3.

Now for the term in (12), let $\theta_{b,g}=\Omega^{-1}\int bg\ dF^{*}$ denote the coefficients of the projection $\Pi_{b}g$ , and note for any functions $g_{1},g_{2}$ we have

	$\displaystyle\int g_{1}(\Pi_{b}-\widehat{\Pi}_{b})(g_{2})\ dF^{*}$	$\displaystyle=\left(\Omega^{1/2}\theta_{b,g_{1}}\right)^{\mathrm{\scriptscriptstyle T}}\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\left(\Omega^{1/2}\theta_{b,g_{2}}\right)$
		$\displaystyle\leq\\|g_{1}\\|_{F^{}}\\|\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\\|\\|g_{2}\\|_{F^{}}$
		$\displaystyle\leq\\|g_{1}\\|_{F^{}}\\|g_{2}\\|_{F^{}}\\|\Omega\\|\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|$

where the first equality follows by definition, the second line since the $L_{2}$ norm of the coefficients of a (weighted) projection is no more than the weighted $L_{2}(\mathbb{P})$ norm of the function itself (Lemma S1(iii)), and the last by the sub-multiplicative property of the operator norm, along with the fact that $\|\Omega^{1/2}\|^{2}=\|\Omega\|$ . ∎

Several of our results require the eigenvalues of $\Omega$ to be bounded above and below away from zero. Proposition S6 in the Appendix gives simple sufficient conditions for this to hold (similar to Proposition S5 for the matrix $Q$ , which was mentioned earlier after Proposition 3).

The next result is a refined version of Proposition 5, giving high-level conditions under which estimation of $F$ itself (rather than the matrix $\Omega^{-1}$ ) does not impact the bias. We refer to Remark 11 for more detailed discussion of these conditions, and note that the result follows from Proposition 5 together with Lemma S2 in the Appendix.

Proposition 6.

Under the assumptions of Proposition 6, and if

1.

$\lambda_{\min}(\Omega)$ is bounded below away from zero,
2.

$\|d\widehat{F}^{*}/dF^{*}\|_{\infty}$ is bounded above and below away from zero,
3.

$\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\lesssim\frac{(h/k^{1/d})^{2s}}{\|\widehat{\pi}-\pi\|_{F^{*}}(\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma})}$ ,

then, when $h^{\gamma}\lesssim(h/k^{1/d})^{\beta}$ , the bias satisfies $|\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})|\lesssim\left(h/k^{1/d}\right)^{2s}.$

4.3 Variance

In this subsection we derive bounds on the conditional variance of the estimators $\widehat{R}_{j}$ and $\widehat{Q}_{j\ell}$ , given the training sample $D^{n}$ . The main tool used here is a localized version of second-order U-statistic variance arguments, recognizing that our higher-order estimator is, conditionally, a second-order U-statistic over $nh^{d}$ observations.

Proposition 7.

Assume:

1.

$y^{2}$ , $\widehat{\pi}^{2}$ , $\widehat{\mu}_{0}^{2}$ , and $\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}$ are all bounded above, and
2.

$\lambda_{\max}(\Omega)$ is bounded above.

Then

\displaystyle\text{var}(\widehat{S}_{j}\mid D^{n})

\displaystyle\ \lesssim\ \frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\left(1+\|\widehat{\Omega}^{-1}-\Omega^{-1}\|^{2}\right)\right).

We give the proof of Proposition 7 in Appendix B.6, and so just make some comments here. First, the variance here is analogous to that of a higher-order (quadratic) influence function estimator (cf. Theorem 1 of Robins et al. (2009a)), except with sample size $n$ deflated to $nh^{d}$ . This is to be expected given the double localization in our proposed estimator. Another important note is that the contribution to the variance from having to estimate $F$ is relatively minimal, compared to the bias, as detailed in Proposition 5. For the bias, non-trivial rate conditions are needed to ensure estimation of $F$ does not play a role, whereas for the variance one only needs the operator norm of $\widehat{\Omega}^{-1}-\Omega^{-1}$ to be bounded (under regularity conditions, this amounts to the estimator $\widehat{F}$ only having bounded errors, in a relative sense, as noted in the following remark).

Remark 9.

By Lemma S2, under the assumptions of Proposition 6, it follows that

\|\widehat{\Omega}^{-1}-\Omega^{-1}\|\ \lesssim\ \|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty},

so estimation of $F$ will not affect the conditional variances as long as the error of $\widehat{F}$ is bounded in uniform norm.

4.4 Overall Rate

Combining the approximation bias in Proposition 3 with the decomposition in Proposition 4, and the bias and variance bounds from Proposition 6 and Proposition 7, respectively, shows that

\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ h^{\gamma}+\left(h/k^{1/d}\right)^{2s}+\sqrt{\frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\right)}

(14)

under all the combined assumptions of these results, which are compiled in the statement of Theorem 2 below.

The first two terms in (14) are the bias, with $h^{\gamma}$ an oracle bias that would remain even if one had direct access to the potential outcomes $(Y^{1}-Y^{0})$ (or equivalently, samples of $\tau(X)+\epsilon$ for some $\epsilon$ with conditional mean zero), and $(h/k^{1/d})^{2s}$ analogous to a squared nuisance bias term, but shrunken due to the stretching induced by the localized basis $b_{hk}$ . Similarly, $1/(nh^{d})$ is an oracle variance that would remain even if given access to the potential outcomes, whereas the $k/(nh^{d})$ factor is a contribution from nuisance estimation (akin to the variance of a series regression on $k$ basis terms with $nh^{d}$ samples).

Balancing bias and variance in (14) by taking the tuning parameters to satisfy

h\sim n^{-(1/\gamma)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

and

k\sim n^{\left(\frac{d}{2s}-\frac{d}{\gamma}\right)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

ensures the rate matches the minimax lower bound from Theorem 1 (in the low smoothness regime), proving that lower bound is in fact tight. (In the high smoothness regime where $s>(d/4)/(1+d/2\gamma)$ , one can simply take $k\sim nh^{d}$ and $h\sim n^{-1/(2\gamma+d)}$ ). This is formalized in the following theorem.

Theorem 2.

Assume the regularity conditions:

A.

The eigenvalues of $Q$ and $\Omega$ are bounded above and below away from zero.
B.

$\widehat{\pi}(x)-\pi(x)$ is $\alpha$ -smooth and $\widehat{\mu}_{0}(x)-\mu_{0}(x)$ is $\beta$ -smooth.
C.

The quantities $y^{2}$ , $(\widehat{\pi}^{2},\widehat{\mu}_{0}^{2})$ , $\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}$ , and $\|\widehat{Q}^{-1}-{Q}^{-1}\|$ are all bounded above, and $\|d\widehat{F}^{*}/dF^{*}\|_{\infty}$ is bounded above and below away from zero.

Also assume the basis $b$ satisfies Hölder approximating condition (6), and:

1.

$\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\lesssim\frac{n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\vee\left(1+\frac{d}{2\gamma}\right)\right)}}{\|\widehat{\pi}-\pi\|_{F^{*}}(\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma})}$ ,
2.

$\pi(x)$ is $\alpha$ -smooth, and $\epsilon\leq\pi(x)\leq 1-\epsilon$ for some $\epsilon>0$ ,
3.

$\mu_{0}(x)$ is $\beta$ -smooth,
4.

$\tau(x)$ is $\gamma$ -smooth.

Finally let the tuning parameters satisfy

h\sim n^{-(1/\gamma)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}\ \text{ and }\ k\sim n^{\left(\frac{d}{2s}-\frac{d}{\gamma}\right)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

if $s<\frac{d/4}{1+d/2\gamma}$ , or $h\sim n^{-\frac{1}{2\gamma+d}}$ and $k\sim nh^{d}$ otherwise. Then the estimator $\widehat{\tau}$ from Definition 2 has error upper bounded as

\displaystyle\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\lesssim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}&\text{ if }s<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise.}\end{cases}

We refer to Section 3.3 for more detailed discussion and visualization of the rate from Theorem 2. Here we give two remarks discussing the regularity conditions A–C and Condition 1 (which ensures the covariate distribution is estimated accurately enough).

Remark 10.

Condition A. is a standard collinearity restriction used with least squares estimators; simple sufficient conditions are given in Propositions S5 and S6 in the Appendix. In Lemma S3 in the Appendix we also prove that this condition holds for a class of densities contained in the model $\mathcal{P}$ in Theorem 1, ensuring that the upper bound holds over the same submodel. A sufficient condition for Condition B. to hold is that the estimators $\widehat{\pi}(x)$ and $\widehat{\mu}_{0}(x)$ match the (known) smoothness of $\pi(x)$ and $\mu_{0}(x)$ ; this would be the case for standard minimax optimal estimators based on series or local polynomial methods. Condition C. is a mild boundedness condition on the outcome $Y$ (which could be weakened at the expense of adding some complexity in the analysis), as well as the nuisance estimators $(\widehat{F}^{*},\widehat{\pi},\widehat{\mu}_{0}$ ), and even weaker, the errors $\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}$ and $\|\widehat{Q}^{-1}-{Q}^{-1}\|$ (which would typically not only be bounded but decreasing to zero).

Remark 11.

First, Condition 1 of Theorem 2 will of course hold if the covariate distribution is estimated at a rate faster than that of the CATE (i.e., the numerator of the rate in Condition 1); however, it also holds under substantially weaker conditions, depending on how accurately $\pi$ and $\mu_{0}$ are estimated. This is because the condition really amounts to a third-order term (the covariate distribution error multiplied by a product of nuisance errors) being of smaller order than the CATE rate. Specifically, the result of Theorem 2 can also be written as

\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\vee\left(1+\frac{d}{2\gamma}\right)\right)}+R_{3,n},

(15)

for the third-order error term

R_{3,n}=\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\|\widehat{\pi}-\pi\|_{F^{*}}\Big{(}\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma}\Big{)},

so that Condition 1 simply requires this third-order term to be of smaller order than the first minimax optimal rate in (15) (note in the above that $h^{\gamma}$ matches the overall CATE estimation error, under our tuning parameter choices, which would typically be of smaller order than the regression error $\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}$ ). Second, we note that we leave the condition in terms of the $L_{2}(F^{*})$ errors $\|\widehat{\pi}-\pi\|_{F^{*}}$ and $\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}$ because, although we assume $\pi$ and $\mu_{0}$ are $\alpha$ - and $\beta$ -smooth, technically, they do not need to be estimated at particular rates for any of the other results we prove to hold. Of course, under these smoothness assumptions, there are available minimax optimal estimators for which

\|\widehat{\pi}-\pi\|_{F^{*}}\asymp n^{-1/(2+d/\alpha)}\ \text{ and }\ \|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}\asymp n^{-1/(2+d/\beta)}.

If in addition there exists some $\zeta$ for which $\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\asymp n^{-1/(2+d/\zeta)}$ (e.g., if $F$ has a density that is $\zeta$ -smooth), then Condition 1 reduces to $\zeta>{d}/{(1/M_{\alpha,\beta,\gamma,d}-2)}$ , for

M_{\alpha,\beta,\gamma,d}\equiv\frac{1}{1+d/2\gamma+d/4s\vee(1+d/2\gamma)}-\frac{1}{2+d/\alpha}-\frac{1}{2+d/\beta}.

Exploring CATE estimation under weaker conditions on the covariate distribution is an interesting avenue for future work; we suspect the minimax rate changes depending on what is assumed about this distribution, as is the case for average effects (e.g., page 338 of Robins et al. (2008)) and conditional variance estimation (Wang et al., 2008; Shen et al., 2020).

5 Discussion

In this paper we have characterized the minimax rate for estimating heterogeneous causal effects in a smooth nonparametric model. We derived a lower bound on the minimax rate using a localized version of the method of fuzzy hypotheses, and a matching upper bound via a new local polynomial R-Learner estimator based on higher-order influence functions. We also characterize how the minimax rate changes depending on whether the propensity score or regression function is smoother, either when one parametrizes the control or the marginal regression function. The minimax rate has several important features. First, it exhibits a so-called elbow phenomenon: when the nuisance functions (regression and propensity scores) are smooth enough, the rate matches that of standard smooth nonparametric regression (the same that would be obtained if potential outcomes were actually observed). On the other hand, when the average nuisance smoothness is below the relevant threshold, the rate obtained is slower. This leads to a second important feature: in the latter low-smoothness regime, the minimax rate is a mixture of the minimax rates for nonparametric regression and functional estimation. This quantifies how the CATE can be viewed as a regression/functional hybrid.

There are numerous important avenues left for future work. We detail a few briefly here, based on: different error metrics, inference and testing, adaptivity, and practical implementation. First, we have focused on estimation error at a point, but one could also consider global rates in $L_{2}$ or $L_{\infty}$ norm, for example. We expect $L_{2}$ rates to be the same, and $L_{\infty}$ rates to be the same up to log factors, but verifying this would be useful. In addition, it would also be very important to study the distribution of the proposed estimator, beyond just bias and variance, e.g., for purposes of inference ( $L_{\infty}$ rates could also be useful in this respect). Relatedly, one could consider minimax rates for testing whether the CATE is zero, for example, versus $\epsilon$ -separated in some distance. The goal of the present work is mostly to further our theoretical understanding of the fundamental limits of CATE estimation, so there remains lots to do to make the optimal rates obtained here achievable in practice. For example, although we have specified particular values of the tuning parameters $h$ and $k$ to confirm attainability of our minimax lower bound, it would be practically useful to have more data-driven approaches for selection. In particular, the optimal tuning values depend on underlying smoothness, and since in practice this is often unknown, a natural next step is to study adaptivity. For example one could study whether approaches based on Lepski’s method could be used, as in Mukherjee et al. (2015) and Liu et al. (2021). There are also potential computational challenges associated with constructing the tensor products in $\rho(x)$ when dimension $d$ is not small, as well as evaluating the U-statistic terms of our estimator, and inverting the matrices $\widehat{Q}$ and $\widehat{\Omega}$ . Finally, in this work we have assumed the nuisance functions are Hölder-smooth, a classic infinite-dimensional function class from which important insights can be drawn. However, it will be important to explore minimax rates in other function classes as well.

Acknowledgements

EK gratefully acknowledges support from NSF DMS Grant 1810979, NSF CAREER Award 2047444, and NIH R01 Grant LM013361-01A1, and SB and LW from NSF Grant DMS1713003. EK also thanks Matteo Bonvini and Tiger Zeng for very helpful discussions.

References

Athey and Imbens [2016] S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
Belloni et al. [2015] A. Belloni, V. Chernozhukov, D. Chetverikov, and K. Kato. Some new asymptotic theory for least squares series: pointwise and uniform results. Journal of Econometrics, 186(2):345–366, 2015.
Bickel and Ritov [1988] P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā, pages 381–393, 1988.
Birgé and Massart [1995] L. Birgé and P. Massart. Estimation of integral functionals of a density. The Annals of Statistics, 23(1):11–29, 1995.
Carbery and Wright [2001] A. Carbery and J. Wright. Distributional and $l^{q}$ norm inequalities for polynomials over convex bodies in $r^{n}$ . Mathematical Research Letters, 8(3):233–248, 2001.
Chamberlain [1987] G. Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
Chernozhukov et al. [2017] V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988, 2017.
Fan et al. [2019] Q. Fan, Y.-C. Hsu, R. P. Lieli, and Y. Zhang. Estimation of conditional average treatment effects with high-dimensional data. arXiv preprint arXiv:1908.02399, 2019.
Foster and Syrgkanis [2019] D. J. Foster and V. Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
Foster et al. [2011] J. C. Foster, J. M. Taylor, and S. J. Ruberg. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011.
Gao and Han [2020] Z. Gao and Y. Han. Minimax optimal nonparametric estimation of heterogeneous treatment effects. arXiv preprint arXiv:2002.06471, 2020.
Giné and Nickl [2021] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2021.
Hahn et al. [2020] P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis, 2020.
Hernán and Robins [2020] M. A. Hernán and J. M. Robins. Causal inference: What if. CRC Boca Raton, FL, 2020.
Ibragimov et al. [1987] I. A. Ibragimov, A. S. Nemirovskii, and R. Khas’minskii. Some problems on nonparametric estimation in gaussian white noise. Theory of Probability & Its Applications, 31(3):391–406, 1987.
Imai and Ratkovic [2013] K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
Ingster et al. [2003] Y. Ingster, J. I. Ingster, and I. Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2003.
Kennedy [2023] E. H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023.
Künzel et al. [2019] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
Lee et al. [2017] S. Lee, R. Okui, and Y.-J. Whang. Doubly robust uniform confidence band for the conditional average treatment effect function. Journal of Applied Econometrics, 32(7):1207–1225, 2017.
Liu et al. [2021] L. Liu, R. Mukherjee, J. M. Robins, and E. J. Tchetgen Tchetgen. Adaptive estimation of nonparametric functionals. Journal of Machine Learning Research, 22(99):1–66, 2021.
Luedtke and van der Laan [2016] A. R. Luedtke and M. J. van der Laan. Super-learning of an optimal dynamic treatment rule. The International Journal of Biostatistics, 12(1):305–332, 2016.
Mukherjee et al. [2015] R. Mukherjee, E. J. Tchetgen Tchetgen, and J. M. Robins. Lepski’s method and adaptive estimation of nonlinear integral functionals of density. arXiv preprint arXiv:1508.00249, 2015.
Mukherjee et al. [2017] R. Mukherjee, W. K. Newey, and J. M. Robins. Semiparametric efficient empirical higher order influence function estimators. arXiv preprint arXiv:1705.07577, 2017.
Nemirovski [2000] A. Nemirovski. Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour, 28:85, 2000.
Nie and Wager [2021] X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.
Robins [1994] J. M. Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23(8):2379–2412, 1994.
Robins et al. [1992] J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48(2):479–495, 1992.
Robins et al. [2008] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008.
Robins et al. [2009a] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Quadratic semiparametric von mises calculus. Metrika, 69(2-3):227–247, 2009a.
Robins et al. [2009b] J. M. Robins, E. J. Tchetgen Tchetgen, L. Li, and A. W. van der Vaart. Semiparametric minimax rates. Electronic Journal of Statistics, 3:1305–1321, 2009b.
Robins et al. [2017] J. M. Robins, L. Li, R. Mukherjee, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.
Robinson [1988] P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988.
Semenova and Chernozhukov [2017] V. Semenova and V. Chernozhukov. Estimation and inference about conditional average treatment effect and other structural functions. arXiv, pages arXiv–1702, 2017.
Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
Shen et al. [2020] Y. Shen, C. Gao, D. Witten, and F. Han. Optimal estimation of variance in nonparametric regression with random design. The Annals of Statistics, 48(6):3589–3618, 2020.
Tsybakov [2009] A. B. Tsybakov. Introduction to Nonparametric Estimation. New York: Springer, 2009.
van der Laan [2005] M. J. van der Laan. Statistical inference for variable importance. The International Journal of Biostatistics, 188, 2005.
van der Laan and Robins [2003] M. J. van der Laan and J. M. Robins. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer, 2003.
Vansteelandt and Joffe [2014] S. Vansteelandt and M. M. Joffe. Structural nested models and g-estimation: the partially realized promise. Statistical Science, 29(4):707–731, 2014.
Wager and Athey [2018] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
Wang et al. [2008] L. Wang, L. D. Brown, T. T. Cai, and M. Levine. Effect of mean on variance function estimation in nonparametric regression. The Annals of Statistics, 36(2):646–664, 2008.
Zhao et al. [2017] Q. Zhao, D. S. Small, and A. Ertefaie. Selective inference for effect modification via the lasso. arXiv preprint arXiv:1705.08020, 2017.
Zimmert and Lechner [2019] M. Zimmert and M. Lechner. Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv preprint arXiv:1908.08779, 2019.

\appendixpage

Roadmap of Main Results

The following figures illustrate how Theorems 1 and 2 follow from supporting propositions and lemmas (with arrows indicating that a result plays a role in implying another).

Figure 3: Roadmap for lower bound result in Theorem 1.

Figure 4: Roadmap for upper bound result in Theorem 2.

Propositions S1 and S2 are not included in the above figures since they support the alternative parametrization detailed in Theorem 3; however they play roles analogous to Propositions 4 and 5, respectively. Propositions S5 and S6 and Lemma S3 are not included since they are stand-alone results, giving conditions under which bounded eigenvalue assumptions hold.

Appendix A Alternative Parametrization

In the main text we focused on the model that puts smoothness assumptions on $(\pi,\mu_{0},\tau)$ , and gave a detailed proof of the minimax lower bound in the $\alpha\geq\beta$ case. In this section we give lower bound constructions for the $\beta\geq\alpha$ case, as well as minimax lower bounds for a different model that puts smoothness assumptions on $(\pi,\eta,\tau)$ , i.e., the marginal regression $\eta(x)$ rather than the control regression $\mu_{0}(x)$ .

In the main text we also gave an estimator that was minimax optimal for all $(\alpha,\beta)$ in the $(\pi,\mu_{0},\tau)$ model (under some conditions on the covariate density). This estimator is also optimal in the $(\pi,\eta,\tau)$ model, as described shortly in Section A.2. Here we also give matching upper bounds for the $(\pi,\eta,\tau)$ model using a different, adapted but similar higher-order estimator (which itself is also minimax optimal in the $\alpha\geq\beta$ case in the $(\pi,\mu_{0},\tau)$ model).

A.1 Lower Bounds

The following table summarizes the constructions used to derive all minimax lower bounds stated in this paper.

		Construction
Model	Regime	$p_{\lambda}$	$q_{\lambda}$
$\mu_{0}$ is $\beta$ -smooth	$\alpha\geq\beta$	$(1/2,\mu_{0\lambda}-\tau_{h}/2,\tau_{h})$	$(\pi_{\lambda},\mu_{0\lambda},0)$
	$\beta\geq\alpha$	$(\pi_{\lambda},1/2-\tau_{h}/2,\tau_{h})$	$(\pi_{\lambda},\mu_{0\lambda},0)$
$\eta$ is $\beta$ -smooth	$\alpha\geq\beta$	$(1/2,\eta_{\lambda},\tau_{h})$	$(\pi_{\lambda},\eta_{\lambda},0)$
	$\beta\geq\alpha$	$(1/2,1/2,1-\tau_{h})$	$(\pi_{\lambda},1/2,1)$

Table 1: The constructions used for our lower bound arguments. For the

\mu_{0}

parametrization the density components are

(\pi,\mu_{0},\tau)

and for the

\eta

parametrization the components are

(\pi,\eta,\tau)

. The first row is the construction detailed in the main text. The constructions in the first and third rows are equivalent, since

\eta=\pi\tau+\mu_{0}

, so that

\eta=\mu_{0\lambda}

in the first row. These first/third rows are the construction used in the first preprint of this paper.

Theorem 3 below states the minimax rate for the $(\pi,\eta,\tau)$ model that puts smoothness assumptions on the marginal regression function $\eta$ , rather than the control regression $\mu_{0}$ .

Theorem 3.

For $x_{0}\in(0,1)$ , let $\mathcal{P}$ denote the model where:

1.

$f(x)$ is bounded above by a constant,
2.

$\pi(x)$ is $\alpha$ -smooth,
3.

$\eta(x)$ is $\beta$ -smooth, and
4.

$\tau(x)$ is $\gamma$ -smooth.

Let $s\equiv(\alpha+\beta)/2$ . Then for $n$ larger than a constant depending on $(\alpha,\beta,\gamma,d)$ , the minimax rate is lower bounded as

\displaystyle\inf_{\widehat{\tau}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\gtrsim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4\min(\alpha,s)}\right)}&\text{ if }\min(\alpha,s)<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise}.\end{cases}

In the following subsections we give proofs of Theorems 1 and 3 for the $\beta\geq\alpha$ regimes not detailed in the main text. Given the constructions listed in Table 1, the arguments are all very similar to those presented in the main text for the $\alpha\geq\beta$ case.

A.1.1 Setting 1: $\alpha\geq\beta$

When $\alpha\geq\beta$ , the same construction can be used for both the $\mu_{0}$ and $\eta$ parametrizations. This is the construction in the first and third rows of Table 1, and detailed in the main text. Thus the proofs there settle this case.

A.1.2 Setting 2: $\beta\geq\alpha$ and $\eta$ is $\beta$ -smooth

For the construction in the 4th row of Table 1, $\eta(x)=1/2$ is constant and so certainly $\beta$ -smooth, and $\pi(x)$ is $\alpha$ -smooth as in the main text. Further we have by Proposition S3 that

	$\displaystyle p_{\lambda}(z)$	$\displaystyle=f(x)\left[\frac{1}{4}+(2a-1)(2y-1)\left\{\frac{1}{4}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}\right]$
	$\displaystyle q_{\lambda}(z)$	$\displaystyle=f(x)\left[\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(2a-1)(2y-1)\left\{\frac{1}{4}-\Delta_{\pi}(x)^{2}\right\}\right]$

with $\Delta_{\pi}(x)=\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$ , so that $\overline{p}(z)=p_{\lambda}(z)$ and

\displaystyle\overline{q}(z)

\displaystyle=f(x)\left[\frac{1}{4}+(2a-1)(2y-1)\left\{\frac{1}{4}-\left(h/k^{1/d}\right)^{2\alpha}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}\right].

Now we detail the quantities $(\delta_{1},\delta_{2},\delta_{3})$ from Lemma 2, which yield a bound on the Hellinger distance between mixtures of $n$ -fold products from $P_{\lambda}$ and $Q_{\lambda}$ .

First, since $\overline{p}(z)=p_{\lambda}(z)$ it immediately follows that $\delta_{1}=0$ . Similarly, taking

\frac{h^{\gamma}}{4}=\left(h/k^{1/d}\right)^{2\alpha}

ensures that $\overline{p}(z)=\overline{q}(z)$ and thus $\delta_{3}=0$ , just as in (2) in the main text. For $\delta_{2}$ , note that $q_{\lambda}(z)-p_{\lambda}(z)=f(x)(a-1/2)\Delta_{\pi}(x)$ (with the above choices of $h$ and $k$ ) so that

\delta_{2}\leq\left(\frac{2^{d}\|B\|_{2}^{2}}{\epsilon/2}\right)\left(h/k^{1/d}\right)^{2\alpha}

by the same arguments as for $\delta_{1}$ in the main text. Now by Lemma 2 the Hellinger distance is bounded above as

H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\ \lesssim\ \left(\frac{n^{2}h^{2d}}{k}\right)\left(h/k^{1/d}\right)^{4\alpha}

so taking

\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2\alpha}\sim\left(\frac{1}{n^{2}}\right)^{\frac{1}{d+4\alpha+2\alpha d/\gamma}}

ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in $\tau(x_{0})$ under this construction is

\frac{h^{\gamma}}{4}=\left(\frac{h}{k^{1/d}}\right)^{2\alpha}\sim n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4\alpha}\right)}.

A.1.3 Setting 3: $\beta\geq\alpha$ and $\mu_{0}$ is $\beta$ -smooth

For the construction in the 2nd row of Table 1 it is again clear that $\pi(x)$ and $\mu_{0}(x)$ are $\alpha$ - and $\beta$ -smooth, respectively. Note for the latter that, under $p_{\lambda}$ , the control regression $\mu_{0}(x)=1/2-\tau_{h}(x)/2$ is $\gamma$ -smooth and $\gamma\geq\beta$ by definition. By Proposition S3 we have

q_{\lambda}(z)=f(x)\left\{\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\mu}(x)+(2a-1)(2y-1)\Delta_{\pi}(x)\Delta_{\mu}(x)\right\}

with $\Delta_{\pi}(x)=\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$ and $\Delta_{\mu}(x)=\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$ , as in the main text, and

	$\displaystyle p_{\lambda}(z)$	$\displaystyle=f(x)\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)-(y-1/2)\tau_{h}(x)/2$
		$\displaystyle\hskip 36.135pt+(2a-1)(2y-1)\left[-\Delta_{\pi}(x)\tau_{h}(x)/2+a\tau_{h}(x)\left\{\frac{1}{2}+\Delta_{\pi}(x)\right\}\right]\bigg{)}$
		$\displaystyle=f(x)\left[\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(2a-1)(2y-1)\left\{\frac{\tau_{h}(x)}{4}+(a-1/2)\Delta_{\pi}(x)\tau_{h}(x)\right\}\right]$

where the second equality follows since $(2a-1)^{2}=1$ . Therefore

	$\displaystyle\overline{p}(z)$	$\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}$
	$\displaystyle\overline{q}(z)$	$\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}$

as in the main text. Now we again detail the quantities $(\delta_{1},\delta_{2},\delta_{3})$ from Lemma 2.

Since the marginalized densities $\overline{p}(z)$ and $\overline{q}(z)$ here take the same form as those in the main text, the same arguments imply $\delta_{3}=0$ as long as $h^{\gamma}/4=(h/k^{1/d})^{2s}$ . Since

\displaystyle\{p_{\lambda}(z)-\overline{p}(z)\}^{2}

\displaystyle=\left[f(x)\Big{\{}(a-1/2)+(y-1/2)\tau_{h}(x)\Big{\}}\Delta_{\pi}(x)\right]^{2}\leq f(x)^{2}\Delta_{\pi}(x)^{2}

as long as $\tau_{h}(x)\leq 1$ , we have that

\delta_{1}\lesssim\left(h/k^{1/d}\right)^{2\alpha}

by the same arguments as for $\delta_{1}$ in the main text (and $\delta_{2}$ in the previous subsection). Similarly, with the above choices of $h$ and $k$ ensuring $\delta_{3}=0$ , we have

	$\displaystyle\{q_{\lambda}(z)-p_{\lambda}(z)\}^{2}$	$\displaystyle=\left[f(x)(y-1/2)\Big{\{}\Delta_{\mu}(x)-\tau_{h}(x)\Delta_{\pi}(x)\Big{\}}\right]^{2}$
		$\displaystyle=\left[f(x)(y-1/2)\Delta_{\mu}(x)\Big{\{}1-4\Delta_{\pi}(x)^{2}\Big{\}}\right]^{2}$
		$\displaystyle\lesssim f(x)^{2}\Delta_{\mu}(x)^{2}$

where the second equality follows since $\tau_{h}(x)/4=\Delta_{\pi}(x)\Delta_{\mu}(x)$ , and the last inequality since $\Delta_{\pi}(x)^{2}$ is bounded. Therefore

\delta_{2}\lesssim\left(h/k^{1/d}\right)^{2\beta}

by the same arguments as for $\delta_{1}$ and $\delta_{2}$ in the main text. This implies by Lemma 2 that the Hellinger distance is bounded above by

\left(\frac{n^{2}h^{2d}}{k}\right)\left\{\left(h/k^{1/d}\right)^{4s}+\left(h/k^{1/d}\right)^{4\beta}\right\}\ \lesssim\ \left(\frac{n^{2}h^{2d}}{k}\right)\left(h/k^{1/d}\right)^{4s}

where the inequality follows since $\beta\geq\alpha$ in this setting. Now, as in the main text, taking

\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2s}\sim\left(\frac{1}{n^{2}}\right)^{\frac{1}{d+4s+2sd/\gamma}}

ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in $\tau(x_{0})$ under this construction is

\frac{h^{\gamma}}{4}=\left(\frac{h}{k^{1/d}}\right)^{2s}\sim n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}.

A.2 Upper Bounds

In the model where $\eta$ instead of $\mu_{0}$ is assumed to be $\beta$ -smooth, the estimator described in the main text is also minimax optimal (since if $\eta$ is $\beta$ -smooth, then $\mu_{0}=\eta-\pi\tau$ is $\min(\alpha,\beta)$ -smooth, and so in the $\alpha<\beta$ case one gets the same rates in the main text with $s$ replaced by $\min(s,\alpha)$ ). However one can also use the following alternative estimator, which we detail for posterity.

Definition 3 (Higher-Order Local Polynomial R-Learner).

Let $K_{h}(x)$ and $\rho_{h}(x)$ be defined exactly as in Definition 2. The proposed estimator is then defined as

\widehat{\tau}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\widehat{R}

(16)

where $\widehat{Q}$ is a $q\times q$ matrix and $\widehat{R}$ a $q$ -vector now given by

	$\displaystyle\widehat{Q}$	$\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\Big{\}}$
	$\displaystyle\widehat{R}$	$\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z_{1})\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}},$

respectively, with

	$\displaystyle\widehat{\varphi}_{a1}(Z)$	$\displaystyle=\{A-\widehat{\pi}(X)\}^{2}$
	$\displaystyle\widehat{\varphi}_{y1}(Z)$	$\displaystyle=\{Y-\widehat{\eta}(X)\}\{A-\widehat{\pi}(X)\}$
	$\displaystyle\widehat{\varphi}_{a2}(Z_{1},Z_{2})$	$\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{A_{2}-\widehat{\pi}(X_{2})\}$
	$\displaystyle\widehat{\varphi}_{y2}(Z_{1},Z_{2})$	$\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{Y_{2}-\widehat{\eta}(X_{2})\}$
	$\displaystyle b_{hk}(x)$	$\displaystyle=b\{1/2+(x-x_{0})/h\}\mathbbold{1}(\\|x-x_{0}\\|\leq h/2)$
	$\displaystyle\widehat{\Omega}$	$\displaystyle=\int_{v\in[0,1]^{d}}b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ d\widehat{F}(x_{0}+h(v-1/2))$

for $b:\mathbb{R}^{d}\mapsto\mathbb{R}^{k}$ a basis of dimension $k$ . The nuisance estimators $(\widehat{F},\widehat{\pi},\widehat{\eta})$ are constructed from a separate training sample $D^{n}$ , independent of that on which $\mathbb{U}_{n}$ operates.

Like the estimator (3) in the main text, the estimator (16) can also be viewed as a modified second-order estimator of the projection parameter. However, instead of estimating the moment condition (5) under a linear effect-type assumption, the estimator (16) instead estimates the numerator and denominator terms $R$ and $Q$ separately. For example, the first term in $\widehat{R}$ , i.e., $\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{y1}(Z)\},$ is the usual first-order influence function-based estimator of $R$ . Kennedy [2023] analyzed an undersmoothed version of this estimator (where the nuisance estimates $\widehat{\pi}$ and $\widehat{\mu}$ themselves are undersmoothed linear smoothers), calling it the local polynomial R-learner. The second term

\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}}

is similar to the second-order U-statistic correction that would be added using the higher-order influence function methodology developed by Robins et al. [2008, 2009a, 2017]. However, this term differs in its localization just as described for the estimator in the main text.

Proposition S1.

The estimator (16) satisfies

\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ \leq\ \|\rho(1/2)\|\left(|\|Q^{-1}\|+\|\widehat{Q}^{-1}-{Q}^{-1}\|\right)\left(\left\|\widehat{R}-R\right\|+\left\|Q-\widehat{Q}\right\|_{2}\|Q^{-1}R\|\right),

and further if $\|Q^{-1}\|$ , $\|\widehat{Q}^{-1}-{Q}^{-1}\|$ , and $\|Q^{-1}R\|$ are all bounded above, then

	$\displaystyle\mathbb{E}\left\|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right\|\$	$\displaystyle\lesssim\ \max_{j}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{R}_{j}-R_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}$
		$\displaystyle\hskip 36.135pt+\max_{j,\ell}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{Q}_{j\ell}-Q_{j\ell}\mid D^{n})^{2}+\text{var}(\widehat{Q}_{j\ell}\mid D^{n})\right\}}.$

for $D^{n}$ a separate independent training sample on which $(\widehat{F},\widehat{\pi},\widehat{\eta})$ are estimated.

Proof.

We have

	$\displaystyle\|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\|$	$\displaystyle=\left\|\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\left\{\left(\widehat{R}-R\right)+\left(Q-\widehat{Q}\right)Q^{-1}R\right\}\right\|$
		$\displaystyle\leq\\|\rho(1/2)\\|\left\\|\widehat{Q}^{-1}\right\\|\left(\left\\|\widehat{R}-R\right\\|+\left\\|Q-\widehat{Q}\right\\|\left\\|Q^{-1}R\right\\|\right)$
		$\displaystyle\leq\\|\rho(1/2)\\|\left(\left\\|Q^{-1}\right\\|+\left\\|\widehat{Q}^{-1}-Q^{-1}\right\\|\right)\left(\left\\|\widehat{R}-R\right\\|+\left\\|Q-\widehat{Q}\right\\|_{2}\left\\|Q^{-1}R\right\\|\right)$

by the sub-multiplicative and triangle inequalities of the operator norm, along with the fact that $\|A\|\leq\|A\|_{2}$ . Together with the bounds on $\|Q^{-1}\|$ , $\|\widehat{Q}^{-1}-Q^{-1}\|$ , and $\|Q^{-1}R\|$ , this yields the first inequality. For the second inequality, first note $\|\rho(x)\|\leq Cq$ , as described in the proof of Proposition 3. The second inequality now follows since

	$\displaystyle\mathbb{E}\\|\widehat{R}-R\\|$	$\displaystyle\leq\sqrt{\mathbb{E}\\|\widehat{R}-R\\|^{2}}=\sqrt{\sum_{j}\mathbb{E}\left[\mathbb{E}\Big{\{}(\widehat{R}_{j}-R_{j})^{2}\mid D^{n}\Big{\}}\right]}$
		$\displaystyle=\sqrt{\sum_{j}\mathbb{E}\left\{\text{bias}(\widehat{R}_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}$
		$\displaystyle\leq\sqrt{{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}}\max_{j}\sqrt{\mathbb{E}\left\{\text{bias}(\widehat{R}_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}$

using Jensen’s inequality and iterated expectation. The last line follows since the length of $R$ is ${{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}$ . The logic is the same for $\mathbb{E}\|\widehat{Q}-Q\|_{2}=\mathbb{E}\sqrt{\sum_{j,\ell}(\widehat{Q}_{j\ell}-Q_{j\ell})^{2}}$ . ∎

Proposition S2.

Assume:

1.

$\lambda_{\max}(\Omega)$ is bounded above,
2.

the basis $b$ satisfies approximating condition (6),
3.

$\widehat{\pi}(x)-\pi(x)$ is $\alpha$ -smooth,
4.

$\widehat{\eta}(x)-\eta(x)$ is $\beta$ -smooth.

Then

	$\displaystyle\|\mathbb{E}(\widehat{R}_{j}-R_{j}\mid D^{n})\|$	$\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2s}+\\|\widehat{\pi}-\pi\\|_{F^{}}\\|\widehat{\eta}-\eta\\|_{F^{}}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|$
	$\displaystyle\|\mathbb{E}(\widehat{Q}_{j\ell}-Q_{j\ell}\mid D^{n})\|$	$\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2\alpha}+\\|\widehat{\pi}-\pi\\|_{F^{*}}^{2}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|.$

Proof.

We only prove the result for $\widehat{R}_{j}$ , since the logic is the same for $\widehat{Q}_{j\ell}$ . By iterated expectation, the conditional mean of the first-order term is

	$\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})$	$\displaystyle\widehat{\varphi}_{y1}(Z)\mid D^{n}\}=R+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\{\eta(x)-\widehat{\eta}(x)\}\ dF(x)$
		$\displaystyle=R+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}\{\eta^{}(v)-\widehat{\eta}^{}(v)\}\ dF^{*}(v)$		(17)

where we use the change of variable $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ and again define for any function $g:\mathbb{R}^{d}\mapsto\mathbb{R}$ its corresponding stretched version as $g^{*}(v)=g(x_{0}+h(v-1/2))$ . To ease notation it is left implicit that any integral over $v$ is only over $\{v:\|v-1/2\|\leq 1/2\}$ . Similarly, (minus) the conditional mean of the second-order influence function term is

	$\displaystyle-\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\mid D^{n}\}$
	$\displaystyle=\iint\rho_{h}(x_{1})K_{h}(x_{1})\{\pi(x_{1})-\widehat{\pi}(x_{1})\}b_{hk}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(x_{2})\{\eta(x_{2})-\widehat{\eta}(x_{2})\}K_{h}(x_{2})\ dF(x_{2})\ dF(x_{1})$
	$\displaystyle=\iint\rho(v_{1})\{\pi^{}(v_{1})-\widehat{\pi}^{}(v_{1})\}b(v_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b(v_{2})\{\eta^{}(v_{2})-\widehat{\eta}^{}(v_{2})\}\ dF^{}(v_{2})\ dF^{}(v_{1})$
	$\displaystyle=\int\rho(v_{1})\{\pi^{}(v_{1})-\widehat{\pi}^{}(v_{1})\}\widehat{\Pi}_{b}(\eta^{}-\widehat{\eta}^{})(v_{1})\ dF^{*}(v_{1})$		(18)

where we define

\displaystyle\Pi_{b}g^{*}(u)

\displaystyle=b(u)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int{b}(v)g^{*}(v)\ dF^{*}(v)

$\displaystyle\int\rho(v)$	$\displaystyle\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(I-\widehat{\Pi}_{b})(\eta^{}-\widehat{\eta}^{})(v)\ dF^{*}(v)$
	$\displaystyle=\int(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)(I-\Pi_{b})(\eta^{}-\widehat{\eta}^{})(v)\ dF^{*}(v)$	(19)
	$\displaystyle\hskip 43.36243pt+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(\Pi_{b}-\widehat{\Pi}_{b})(\eta^{}-\widehat{\eta}^{})(v)\ dF^{*}(v)$	(20)

where the last line follows from orthogonality of a projection with its residuals (Lemma S1(i)).

Now we analyze the bias terms (19) and (20) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.

Recall from (13) in the main text that if $g$ is $\alpha$ -smooth, then $\|(I-\Pi_{b})g^{*}/h^{\alpha}\|_{F^{*}}\lesssim k^{-\alpha/d}$ by the Hölder approximation properties (6) of the basis $b$ , and so it follows that

\|(I-\Pi_{b})g^{*}\|_{F^{*}}\ \lesssim\ h^{\alpha}k^{-\alpha/d}=(k/h^{d})^{-\alpha/d}

for any $\alpha$ -smooth function $g$ . Therefore now consider the bias term (19). This term satisfies

	$\displaystyle\int\Big{[}$	$\displaystyle(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)\Big{]}\Big{\{}(I-\Pi_{b})(\eta^{}-\widehat{\eta}^{})(v)\Big{\}}\ dF^{*}(v)$
		$\displaystyle\leq\\|(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}\\|_{F^{}}\\|(I-\Pi_{b})(\eta^{}-\widehat{\eta}^{})\\|_{F^{}}$
		$\displaystyle\lesssim\ (k/h^{d})^{-2s/d}$

where the second line follows by Cauchy-Schwarz, and the third by (13), since $(\pi-\widehat{\pi})$ and $(\eta-\widehat{\eta})$ are assumed $\alpha$ - and $\beta$ -smooth, respectively (note $\rho(v)$ is a polynomial, so the smoothness of $\rho(\pi^{*}-\widehat{\pi}^{*})$ is the same as $(\pi^{*}-\widehat{\pi}^{*})$ ).

Now for the term in (20), let $\theta_{b,g}=\Omega^{-1}\int bg\ dF^{*}$ denote the coefficients of the projection $\Pi_{b}g$ , and note for any functions $g_{1},g_{2}$ we have

	$\displaystyle\int g_{1}(\Pi_{b}-\widehat{\Pi}_{b})(g_{2})\ dF^{*}$	$\displaystyle=\left(\Omega^{1/2}\theta_{b,g_{1}}\right)^{\mathrm{\scriptscriptstyle T}}\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\left(\Omega^{1/2}\theta_{b,g_{2}}\right)$
		$\displaystyle\leq\\|g_{1}\\|_{F^{}}\\|\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\\|\\|g_{2}\\|_{F^{}}$
		$\displaystyle\leq\\|g_{1}\\|_{F^{}}\\|g_{2}\\|_{F^{}}\\|\Omega\\|\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|$

We omit details on variance calculations because the logic is exactly the same as for the estimator (3) studied in Section 4.3 of the main text.

Combining the approximation bias in Proposition 3 with the decomposition in Proposition S1, and the relevant bias and variance bounds together with sufficient conditions on covariate density estimation, we obtain that the estimator (16) has error satisfying

\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ h^{\gamma}+\left(h/k^{1/d}\right)^{2\min(s,\alpha)}+\sqrt{\frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\right)}

(21)

Balancing terms as in Section 4.4 yields the minimax rate from Theorem 3.

Appendix B Additional Results

B.1 Density Bounds in Proposition 1

Here we show that the relevant densities and density ratios from Proposition 1 are appropriately bounded.

Proof.

When $h\leq 1/4$ then it follows that on $\mathcal{S}_{hk}$ we have

1\leq f(x)=\left\{1-\left(\frac{4^{d}-1}{2^{d}}\right)h^{d}\right\}^{-1}\leq 2.

(22)

Further, since $B(x)\leq\mathbbold{1}(x\in[-1,1]^{d})$ , $a,y\in\{0,1\}$ , and $\lambda\in\{-1,1\}$ , we have on $\mathcal{S}_{hk}$ that

\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}\ \leq\frac{p_{\lambda}(z)}{f(x)}\leq\ \frac{1}{4}+\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}+\frac{h^{\gamma}}{4},

regardless of the values of $h,k\geq 0$ . Therefore when $h^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\epsilon$ , the above bound implies

\frac{p_{\lambda}(z)}{f(x)}\geq\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}\geq\epsilon.

(23)

Similarly, when $h^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\epsilon$ (which implies $h^{\gamma}\leq 1$ ) we also have

\frac{\overline{p}(z)}{p_{\lambda}(z)}\leq\frac{\frac{1}{4}+\frac{h^{\gamma}}{4}}{\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}}\leq\frac{1/2}{\epsilon}.

Note that, although Robins et al. [2009b] assume $p_{\lambda}(z)$ is uniformly lower bounded away from zero in their version of Lemma 2, they only use a bound on $\overline{p}/p_{\lambda}$ to ensure their quantity $c$ is bounded (see page 1319). Therefore this condition also holds in our case. ∎

B.2 Density Expressions

Proposition S3.

For binary $A$ and $Y$ , the density of $Z=(X,A,Y)$ can be written as

	$\displaystyle f(x)$	$\displaystyle\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\mu}(x)$
		$\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left[\Delta_{\pi}(x)\Delta_{\mu}(x)+a\tau(x)\left\{\frac{1}{2}+\Delta_{\pi}(x)\right\}\right]\bigg{)}$

for $\Delta_{\pi}(x)=\pi(x)-1/2$ and $\Delta_{\mu}(x)=\mu_{0}(x)-1/2$ , or equivalently as

	$\displaystyle f(x)$	$\displaystyle\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\eta}(x)$
		$\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left[\Delta_{\pi}(x)\Delta_{\eta}(x)+\tau(x)\left\{\frac{1}{4}-\Delta_{\pi}(x)^{2}\right\}\right]\bigg{)}$

for $\Delta_{\eta}(x)=\eta(x)-1/2$ .

Proposition S4.

The densities under $P_{\lambda}$ and $Q_{\lambda}$ from Definition 1 are given by

	$\displaystyle p_{\lambda}(z)$	$\displaystyle=f(x)\Bigg{[}\frac{1}{4}+(y-1/2)\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{h}\right)\Bigg{]}$
	$\displaystyle q_{\lambda}(z)$	$\displaystyle=f(x)\Bigg{[}\frac{1}{4}+\left\{(a-1/2)\left(h/k^{1/d}\right)^{\alpha}+(y-1/2)\left(h/k^{1/d}\right)^{\beta}\right\}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)$
		$\displaystyle\hskip 50.58878pt+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\Bigg{]}$

where $s\equiv(\alpha+\beta)/2$ .

We note that the densities are both equal to $1/4$ for all $x\notin\mathcal{C}_{2h}(x_{0})$ away from $x_{0}$ , since $B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=0$ for $x\notin\mathcal{C}_{h/k^{1/d}}(m_{j})$ and $\mathcal{C}_{h/k^{1/d}}(m_{j})\subseteq\mathcal{C}_{h}(x_{0})\subseteq\mathcal{C}_{2h}(x_{0})$ , and since $B\left(\frac{x-x_{0}}{h}\right)=0$ for $x\notin\mathcal{C}_{2h}(x_{0})$ .

B.3 Proof of Proposition 3

Proof.

To ease notation we prove the result in the $d=1$ case but the logic is the same when $d>1$ . First note that the local polynomial projection operator $Lg(x^{\prime})\equiv\int g(x)w_{h}(x;x^{\prime})\ dx$ for

w_{h}(x;x^{\prime})\equiv\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}f(x)

reproduces polynomials, in the sense that, for any polynomial of the form $g(x)=a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x)$ , $a\in\mathbb{R}^{q}$ , we have

	$\displaystyle Lg(x^{\prime})$	$\displaystyle=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x)\Big{\}}w_{h}(x;x^{\prime})\ dx$
		$\displaystyle=\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}f(x)\ dx\ a$
		$\displaystyle=\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}Qa=a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x^{\prime})=g(x^{\prime}).$

Then for $\|x^{\prime}-x_{0}\|\leq C_{2}h$ we have that $\tau_{h}(x^{\prime})-\tau(x^{\prime})=\int w_{h}(x;x^{\prime})\Big{\{}\tau(x)-\tau(x^{\prime})\Big{\}}dx$ equals

$\displaystyle\int w_{h}(x;x^{\prime})$	$\displaystyle\left\{\sum_{j=1}^{\lfloor\gamma\rfloor-1}\frac{D^{j}\tau(x^{\prime})}{j!}(x-x^{\prime})^{j}+\frac{D^{\lfloor\gamma\rfloor}\tau(x^{*})}{\lfloor\gamma\rfloor!}(x-x^{\prime})^{\lfloor\gamma\rfloor}\right\}dx$
	$\displaystyle=0+\int w_{h}(x;x^{\prime})\left\{\frac{D^{\lfloor\gamma\rfloor}\tau(x^{\prime}+\epsilon(x-x^{\prime}))-D^{\lfloor\gamma\rfloor}\tau(x^{\prime})}{{\lfloor\gamma\rfloor}!}\right\}(x-x^{\prime})^{\lfloor\gamma\rfloor}\ dx$
	$\displaystyle\leq\int\left\|w_{h}(x;x^{\prime})\right\|\frac{C_{1}\\|x-x^{\prime}\\|^{\gamma-{\lfloor\gamma\rfloor}}}{{\lfloor\gamma\rfloor}!}\\|x-x^{\prime}\\|^{{\lfloor\gamma\rfloor}}\ dx$
	$\displaystyle\leq\frac{C_{1}}{{\lfloor\gamma\rfloor}!}\int\left\|w_{h}(x;x^{\prime})\right\|\left(\\|x-x_{0}\\|+C_{2}h\right)^{\gamma}dx$
	$\displaystyle\leq\frac{C_{1}h^{\gamma}}{{\lfloor\gamma\rfloor}!}\left\{h^{d}\int\left\|w_{h}(x_{0}+hu;x^{\prime})\right\|(\\|u\\|+C_{2})^{\gamma}\ du\right\}$	(24)

where the first line follows by Taylor expansion (with $x^{*}=x^{\prime}+\epsilon(x-x^{\prime})$ for some $\epsilon\in[0,1]$ ), the second by the polynomial reproducing property, the third since $\tau$ is $\gamma$ -smooth, the fourth by the triangle inequality with $\|x_{0}-x^{\prime}\|\leq C_{2}h$ , and the last by a change of variable $u=(x-x_{0})/h$ (so that $dx=h^{d}\ du$ ).

Now the result follows since we show the term on the right in (24) (in brackets) is bounded under the stated assumptions. Specifically it equals

	$\displaystyle\int(\\|u\\|+C_{2})^{\gamma}\left\|\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\rho_{h}(x_{0}+hu)\right\|\mathbbold{1}(\\|u\\|\leq 1/2)$
	$\displaystyle\hskip 36.135pt\times\pi(x_{0}+hu)\{1-\pi(x_{0}+hu)\}f(x_{0}+hu)\ du$
	$\displaystyle\leq(1/4)(1/2+C_{2})^{\gamma}\\|\rho(1/2)\\|\left\\|Q^{-1}\right\\|\int_{-1/2}^{1/2}\\|\rho(1/2+u)\\|f(x_{0}+hu)\ du$
	$\displaystyle=(C_{3}q^{2}/4)(1/2+C_{2})^{\gamma}\left\\|Q^{-1}\right\\|h^{-d}\int\mathbbold{1}(\\|x-x_{0}\\|\leq h/2)\ dF(x)\lesssim 1$

where the second line follows from the submultiplicative property of the operator norm and since $\pi(1-\pi)\leq 1/4$ , and the last from Assumptions 2 and 3 and since

\displaystyle\|\rho(x)\|^{2}

\displaystyle\leq{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}(2\lfloor\gamma\rfloor+1)\leq C_{3}q^{2}

for all $x$ (Belloni et al. [2015], Example 3.1), since each Legendre term satisfies $|\rho_{m}(x_{j})|\leq\sqrt{2m+1}\leq\sqrt{2\lfloor\gamma\rfloor+1}$ for $m\leq\lfloor\gamma\rfloor$ , and the length of $\rho$ is $q={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}$ , i.e., the maximum number of monomials in a polynomial in $d$ variables with degree up to $\lfloor\gamma\rfloor$ . ∎

B.4 Proof of Proposition 4

Proof.

For the first inequality, we have

	$\displaystyle\|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\|$	$\displaystyle=\left\|\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\left\{\left(\widehat{R}-R\right)+\left(Q-\widehat{Q}\right)Q^{-1}R\right\}\right\|$
		$\displaystyle\leq\\|\rho(1/2)\\|\left(\left\\|Q^{-1}\right\\|+\left\\|\widehat{Q}^{-1}-Q^{-1}\right\\|\right)\left(\left\\|(\widehat{R}-R)-(\widehat{Q}-Q)(Q^{-1}R)\right\\|\right)$

by the sub-multiplicative and triangle inequalities of the operator norm. For the second inequality, first note $\|\rho(x)\|\leq Cq$ , as described in the proof of Proposition 3. The result now follows from the bounds on $\|Q^{-1}\|$ and $\|\widehat{Q}^{-1}-{Q}^{-1}\|$ and since

	$\displaystyle\mathbb{E}\\|\widehat{S}-S\\|$	$\displaystyle\leq\sqrt{\mathbb{E}\\|\widehat{S}-S\\|^{2}}=\sqrt{\sum_{j}\mathbb{E}\left[\mathbb{E}\Big{\{}(\widehat{S}_{j}-S_{j})^{2}\mid D^{n}\Big{\}}\right]}$
		$\displaystyle=\sqrt{\sum_{j}\mathbb{E}\left\{\text{bias}(\widehat{S}_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}$
		$\displaystyle\leq\sqrt{{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}}\max_{j}\sqrt{\mathbb{E}\left\{\text{bias}(\widehat{S}_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}$

by Jensen’s inequality and iterated expectation. The last line follows since $|S|={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}$ . ∎

B.5 Eigenvalues of $Q$ and $\Omega$

Proposition S5.

If (i) $\epsilon\leq\pi(x)\leq 1-\epsilon$ and (ii) the density $dF(x)$ is bounded above and below away from zero on $\{x:\|x-x_{0}\|\leq h/2\}$ , then the eigenvalues of $Q$ are bounded above and below away from zero.

Proof.

Define the stretched function $g^{*}(v)=g(x_{0}+h(v-1/2))$ for any $g:\mathbb{R}^{d}\mapsto\mathbb{R}$ . This maps values of $g$ in the small cube $[x_{0}-h/2,x_{0}+h/2]^{d}$ around $x_{0}$ to the whole space $[0,1]^{d}$ . Then note that, with the change of variables $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ ,

	$\displaystyle Q$	$\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)$
		$\displaystyle=\int_{\\|v-1/2\\|\leq 1/2}\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\pi^{}(v)\{1-\pi^{}(v)\}\ dF^{*}(v).$

Next note that $\epsilon(1-\epsilon)\leq\pi(1-\pi)\leq 1/4$ , so the eigenvalues of $Q$ will be bounded if those of the matrix

\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dF(x_{0}+h(v-1/2))

are. But $\int\rho(x)\rho(x)^{\mathrm{\scriptscriptstyle T}}\ dx=I$ by orthonormality of the Legendre polynomials on $[0,1]^{d}$ , and the local boundedness of $dF$ ensures $dF^{*}/d\mu$ is bounded above and below away from zero, for $\mu$ the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎

Proposition S6.

If (i) the basis $b(v)$ is orthonormal with respect to the uniform measure, and (ii) the density $dF(x)$ is bounded above and below away from zero on $\{x:\|x-x_{0}\|\leq h/2\}$ , then the eigenvalues of $\Omega$ are bounded above and below away from zero.

Proof.

The proof is similar to that of Proposition S5. First note that

\displaystyle\Omega=\int b_{hk}(x)K_{h}(x)b_{hk}(x)^{\mathrm{\scriptscriptstyle T}}dF(x)=\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dF^{*}(v)

by the change of variables $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ , and where $dF^{*}(v)=dF(x_{0}+h(v-1/2))$ as before. Further $\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I$ by the assumed orthonormality, and the local boundedness of $dF$ ensures $dF^{*}/d\mu$ is bounded above and below away from zero, for $\mu$ the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎

B.6 Proof of Proposition 7

Proof.

Given the training data $D^{n}$ , $\widehat{S}_{j}=[\widehat{R}-\widehat{Q}\theta]_{j}$ is a second-order U-statistic with kernel

	$\displaystyle\xi(Z_{1},Z_{2})=\rho_{h}(X_{1})$	$\displaystyle K_{h}(X_{1})\Big{[}\Big{\{}\widehat{\varphi}_{y1}(Z_{1})-\widehat{\varphi}_{a1}(Z_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}$
		$\displaystyle+\Big{\{}\widehat{\varphi}_{y2}(Z_{1},Z_{2})-\widehat{\varphi}_{a2}(Z_{1},Z_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}K_{h}(X_{2})\Big{]}$

Thus its conditional variance is given by the usual variance of a second-order U-statistic (e.g., Lemma 6 of Robins et al. [2009a]), which is

	$\displaystyle\text{var}\left\{\mathbb{U}_{n}(\xi)\mid D^{n}\right\}$	$\displaystyle=\left(\frac{4(n-2)}{n(n-1)}\right)\text{var}\{\xi_{1}(Z_{1})\mid D^{n}\}+\left(\frac{2}{n(n-1)}\right)\text{var}\{\xi(Z_{1},Z_{2})\mid D^{n}\}$
		$\displaystyle\leq\frac{4\mathbb{E}\{\xi_{1}(Z_{1})^{2}\mid D^{n}\}}{n}+\frac{4\mathbb{E}\{\xi(Z_{1},Z_{2})^{2}\mid D^{n}\}}{n^{2}}$		(25)

for $\xi_{1}(z_{1})=\int\xi(z_{1},z_{2})\ d\mathbb{P}(z_{2})$ , if $n\geq 2$ . In our case $\xi_{1}$ equals

	$\displaystyle\rho_{h}(X_{1})K_{h}(X_{1})$	$\displaystyle\Big{[}\widehat{\varphi}_{y1}(Z_{1})-\widehat{\varphi}_{a1}(Z_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta$
		$\displaystyle\hskip 28.90755pt+\{A_{1}-\widehat{\pi}(X_{1})\}\Big{\{}\widehat{\Pi}_{b}(\mu_{0}^{}-\widehat{\mu}_{0}^{})(X_{1})-\widehat{\Pi}_{b}\widehat{\pi}^{*}(X_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}\Big{]}.$

Therefore for the first term in (25) we have

	$\displaystyle\int\xi_{1}^{2}\ d\mathbb{P}$	$\displaystyle\leq 2\bigg{(}\int\rho_{h}^{2}K_{h}^{2}(\widehat{\varphi}_{y1}-\widehat{\varphi}_{a1}\rho_{h}^{\mathrm{\scriptscriptstyle T}}\theta)^{2}\ d\mathbb{P}$
		$\displaystyle\hskip 28.90755pt+\int\rho_{h}^{2}K_{h}^{2}\left\{\pi(1-\pi)+(\pi-\widehat{\pi})^{2}\right\}\left\{\widehat{\Pi}_{b}(\mu_{0}^{}-\widehat{\mu}_{0}^{})+\widehat{\Pi}_{b}\pi^{*}\rho_{h}^{\mathrm{\scriptscriptstyle T}}\theta\right\}^{2}dF\bigg{)}$
		$\displaystyle\lesssim\frac{1}{h^{d}}\left(\int\rho(v)^{2}\ dF^{}(v)+\int\rho(v)^{2}\left[\left\{\widehat{\Pi}_{b}(\mu_{0}^{}-\widehat{\mu}_{0}^{})(v)\right\}^{2}+\left\{\widehat{\Pi}_{b}\pi^{}(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\theta\right\}^{2}\right]\ dF^{*}(v)\right)$
		$\displaystyle\lesssim\frac{1}{h^{d}}\left(1+\\|\widehat{\mu}_{0}-\mu_{0}\\|_{F^{*}}^{2}\right)$

where the second inequality follows by the change of variables $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ , and since all of $(\widehat{\varphi}_{y1},\widehat{\varphi}_{a1},\pi,\widehat{\pi})$ are uniformly bounded, $0\leq K_{h}(x)\lesssim h^{-d}$ , and $\sup_{v}\|\rho(v)\|\lesssim q$ (Belloni et al. [2015], Example 3.1); and the third follows from Lemma S1(ii). Now, consider the second term in (25); we only detail the term in $\xi$ involving $\widehat{\varphi}_{y2}$ since the logic is the same for the $\widehat{\varphi}_{a2}$ term, and the terms involving $(\widehat{\varphi}_{y1},\widehat{\varphi}_{a1})$ are bounded by standard arguments. Letting $\overline{b}_{j}(x)\equiv\Omega^{-1/2}b_{j}(x)$ and $M\equiv\Omega^{1/2}\widehat{\Omega}^{-1}\Omega^{1/2}$ , we have $\mathbb{E}\{\xi(Z_{1},Z_{2})^{2}\mid D^{n}\}$ upper bounded by a multiple of

	$\displaystyle\int\bigg{[}\rho_{h}(x_{1})$	$\displaystyle\{a_{1}-\widehat{\pi}(x_{1})\}\Big{\{}{b}_{h}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}{b}_{h}(x_{2})\Big{\}}\{y_{2}-\widehat{\mu}_{0}(x_{2})\}K_{h}(x_{1})K_{h}(x_{2})\bigg{]}^{2}\ d\mathbb{P}(z_{1})\ d\mathbb{P}(z_{2})$
		$\displaystyle\lesssim\frac{1}{h^{2d}}\int\rho_{h}(x_{1})^{2}\Big{\{}b_{hk}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(x_{2})\Big{\}}^{2}K_{h}(x_{1})K_{h}(x_{2})\ dF(x_{1})\ dF(x_{2})$
		$\displaystyle\lesssim\frac{1}{h^{2d}}\int\Big{\{}\overline{b}(v_{1})^{\mathrm{\scriptscriptstyle T}}M\overline{b}(v_{2})\Big{\}}^{2}\ dF^{}(v_{1})\ dF^{}(v_{2})$
		$\displaystyle=\frac{\\|M\\|_{2}^{2}}{h^{2d}}=\left(\frac{1}{h^{2d}}\right)\\|\Omega^{1/2}\Omega^{-1}\Omega^{1/2}+\Omega^{1/2}(\widehat{\Omega}^{-1}-\Omega^{-1})\Omega^{1/2}\\|_{2}^{2}$
		$\displaystyle\leq 2\left(\frac{1}{h^{2d}}\right)\left(\\|I\\|_{2}^{2}+\\|\Omega^{1/2}(\widehat{\Omega}^{-1}-\Omega^{-1})\Omega^{1/2}\\|_{2}^{2}\right)$
		$\displaystyle\leq 2\left(\frac{1}{h^{2d}}\right)k\left(1+\\|\Omega\\|^{2}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|^{2}\right)$

where the first line follows by definition, the second since $(a-\widehat{\pi})$ and $(y-\widehat{\mu}_{0})$ are uniformly bounded and $0\leq K_{h}(x)\lesssim h^{-d}$ , the third by a change of variables $v=\frac{1}{2}+\frac{x-x_{0}}{h}$ , by definition of $\overline{b}$ and $M$ , and since $\rho(v)$ is bounded, the fourth by definition since $\int\overline{b}\overline{b}^{\mathrm{\scriptscriptstyle T}}\ dF^{*}=I$ , and the last two by basic properties of the Frobenius norm (e.g., the triangle inequality and $\|A\|_{2}\leq\sqrt{k}\|A\|$ ) together with $(a+b)^{2}\leq 2(a^{2}+b^{2})$ . ∎

B.7 Technical Lemmas

Lemma S1.

Let $\Pi_{w}f(t)=b(t)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int b(x)w(x)f(x)\ d\mathbb{P}(x)$ denote a $w$ -weighted projection with $\Omega=\int bwb^{T}d\mathbb{P}$ for $w\geq 0$ . And let $\theta_{w,f}=\Omega^{-1}\int b(x)w(x)f(x)\ d\mathbb{P}(x)$ denote the coefficients of the projection. Then:

(i)

projections are orthogonal to residuals, i.e.,

$\int(\Pi_{w}f)(I-\Pi_{w})g\ wd\mathbb{P}=0,$
(ii)

the $L_{2}(w\mathbb{P})$ norm of a projection is no more than that of the function, i.e.,

$\int(\Pi_{w}f)^{2}w\ d\mathbb{P}\leq\int f^{2}\ wd\mathbb{P},$

and the $L_{2}(w\mathbb{P})$ norm of the approximation error is no more than that of the function, i.e.,

$\int\{(I-\Pi_{w})f\}^{2}w\ d\mathbb{P}\leq\int f^{2}\ wd\mathbb{P},$
(iii)

the $L_{2}$ norm of the scaled coefficients is no more than the $L_{2}(w\mathbb{P})$ norm of the function, i.e.,

$\|\Omega^{1/2}\theta_{w,f}\|^{2}\leq\int f^{2}\ wd\mathbb{P}.$

Proof.

For (i) let $b^{*}(x)=\Omega^{-1/2}b(x)$ so that $\int b^{*}w(b^{*})^{\mathrm{\scriptscriptstyle T}}d\mathbb{P}=I$ and note that

	$\displaystyle\int(\Pi_{w}f)(I-\Pi_{w})$	$\displaystyle g\ wd\mathbb{P}=\int\int b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)$
		$\displaystyle\hskip 65.04256pt-\int b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(t)f(t)b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)\ wd\mathbb{P}(u)$
		$\displaystyle=\int b^{}(x)^{\mathrm{\scriptscriptstyle T}}b^{}(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)$
		$\displaystyle\hskip 65.04256pt-\int f(t)b^{}(t)^{\mathrm{\scriptscriptstyle T}}b^{}(x)b^{}(x)^{\mathrm{\scriptscriptstyle T}}b^{}(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)\ wd\mathbb{P}(u)$
		$\displaystyle=\int b^{}(x)^{\mathrm{\scriptscriptstyle T}}b^{}(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)$
		$\displaystyle\hskip 65.04256pt-\int f(t)b^{}(t)^{\mathrm{\scriptscriptstyle T}}b^{}(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(u)=0$

where the last line follows since $\int b^{*}(x)b^{*}(x)^{\mathrm{\scriptscriptstyle T}}\ wd\mathbb{P}(x)=I$ .

For (ii) we have by definition that $\int\{\Pi_{w}f(x)\}^{2}w(x)\ dF(x)$ equals

$\displaystyle\int$	$\displaystyle\left\{\int f(u)w(u)b(u)^{\mathrm{\scriptscriptstyle T}}\ d\mathbb{P}(u)\Omega^{-1}b(x)\right\}\left\{b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int b(t)w(t)f(t)\ d\mathbb{P}(t)\right\}w(x)d\mathbb{P}(x)$
	$\displaystyle=\left\{\int f(u)w(u)b(u)^{\mathrm{\scriptscriptstyle T}}\ d\mathbb{P}(u)\Omega^{-1}\right\}\left\{\int b(t)w(t)f(t)\ d\mathbb{P}(t)\right\}$	(26)
	$\displaystyle=\int f(t)\Pi_{w}f(t)w(t)\ d\mathbb{P}(t)$

and so

	$\displaystyle\int(f-\Pi_{w}f)^{2}w\ d\mathbb{P}$	$\displaystyle=\int f^{2}w\ d\mathbb{P}-2\int f\Pi_{w}fw\ d\mathbb{P}+\int(\Pi_{w}f)^{2}w\ d\mathbb{P}$
		$\displaystyle=\int f^{2}w\ d\mathbb{P}-\int(\Pi_{w}f)^{2}w\ d\mathbb{P}$

which implies the results, as long as $w\geq 0$ , so that the far left and right sides are non-negative.

For (iii) we have

	$\displaystyle\\|\Omega^{1/2}\theta_{w,f}\\|^{2}$	$\displaystyle=(\Omega^{1/2}\theta_{w,f})^{\mathrm{\scriptscriptstyle T}}(\Omega^{1/2}\theta_{w,f})=\theta_{w,f}^{\mathrm{\scriptscriptstyle T}}\Omega\theta_{w,f}$
		$\displaystyle=\left(\int b^{\mathrm{\scriptscriptstyle T}}wf\ d\mathbb{P}\right)\Omega^{-1}\left(\int bwf\ d\mathbb{P}\right)$
		$\displaystyle=\int\{\Pi_{w}f(x)\}^{2}w(x)\ d\mathbb{P}(x)$

where the last equality holds by (26), and so the result follows from Lemma S1(ii). ∎

Lemma S2.

Assume:

1.

$0<b\leq\lambda_{\min}(\Omega)\leq\lambda_{\max}(\Omega)\leq B<\infty$
2.

$0<c\leq\|d\widehat{F}^{*}/dF^{*}\|_{\infty}\leq C<\infty$

Then

bc\leq\lambda_{\min}(\widehat{\Omega})\leq\lambda_{\max}(\widehat{\Omega})\leq BC

and

\|\widehat{\Omega}^{-1}-\Omega^{-1}\|\leq\left(\frac{B}{b^{2}c}\right)\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}.

Proof.

The logic mirrors that of the proof of Proposition 2.1 in Belloni et al. [2015]. Note that

	$\displaystyle a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a$	$\displaystyle=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}b(v)\Big{\}}^{2}\ d\widehat{F}^{*}(v)$
		$\displaystyle\leq\\|d\widehat{F}^{}/dF^{}\\|_{\infty}\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}b(v)\Big{\}}^{2}\ d{F}^{*}(v)$
		$\displaystyle=\\|d\widehat{F}^{}/dF^{}\\|_{\infty}\ a^{\mathrm{\scriptscriptstyle T}}\Omega a$

and by the same logic, $a^{\mathrm{\scriptscriptstyle T}}\Omega a\leq\|dF^{*}/d\widehat{F}^{*}\|_{\infty}\ a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a$ . Therefore

	$\displaystyle\lambda_{\max}(\widehat{\Omega})$	$\displaystyle=\max_{\\|a\\|=1}a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a\leq\\|d\widehat{F}^{}/dF^{}\\|_{\infty}\lambda_{\max}(\Omega)$
	$\displaystyle\lambda_{\min}(\widehat{\Omega})$	$\displaystyle=\min_{\\|a\\|=1}a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a\geq\\|dF^{}/d\widehat{F}^{}\\|_{\infty}^{-1}\lambda_{\min}(\Omega)$

by the min-max theorem, which gives the first inequality. For the second, note

	$\displaystyle\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|$	$\displaystyle=\\|\widehat{\Omega}^{-1}(\Omega-\widehat{\Omega})\Omega^{-1}\\|$
		$\displaystyle\leq\\|\widehat{\Omega}^{-1}\\|\\|\Omega-\widehat{\Omega}\\|\\|\Omega^{-1}\\|$

by the sub-multiplicative property of the operator norm, and then

\|\Omega-\widehat{\Omega}\|\leq\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\ \|\Omega\|

by the same logic as above. ∎

B.8 Covariate Density

Although the lower bounds in Theorems 1 and 3 are given for a simple model where the covariate density is merely bounded, the rates are only attainable when the covariate density is known or can be estimated accurately enough, as discussed in Remark 11. More specifically, the lower and upper bounds match for a model in which the covariate density $dF(x)$ is known, satisfies $\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\asymp h^{d}$ , and has local support $\{x\in\mathbb{R}^{d}:dF(x)>0,\|x-x_{0}\|\leq h/2\}$ on a union of no more than $k$ disjoint cubes all with proportional volume, for $h$ and $k$ defined in Proposition 2. The lower bounds come from Theorems 1 and 3 since the density used in our construction satisfies these conditions, while the matching upper bound is implied by the results in Section 4 together with the following lemma, in which we prove that Condition A. holds for this class of densities.

Lemma S3.

Assume:

1.

$dF(x)$ satisfies $\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\asymp h^{d}$ ,
2.

$dF(x)$ is bounded above and below away from zero on its local support $\{x\in\mathbb{R}^{d}:dF(x)>0,\|x-x_{0}\|\leq h/2\}$ , which is a union of no more than $k$ disjoint cubes all with proportional volume, for $h$ and $k$ defined in Proposition 2.

Let $dF^{*}(v)=dF(x_{0}+h(v-1/2))$ denote the distribution in $\mathcal{B}_{h}(x_{0})$ , the $h$ -ball around $x_{0}$ , mapped to $[0,1]^{d}$ , and similarly for $\pi^{*}$ . Then the eigenvalues of

Q=\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\pi^{*}(v)\{1-\pi^{*}(v)\}\ dF^{*}(v)

are bounded above and below away from zero, and there exists a basis $b$ with Hölder approximation property (6), for which the eigenvalues of

\Omega=\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dF^{*}(v)

are bounded above and below away from zero.

Proof.

Note that since $\epsilon(1-\epsilon)\leq\pi(1-\pi)\leq 1/4$ and since $dF(x)$ is bounded above and below on its support, the eigenvalues of $Q$ will be bounded if those of the matrix

\sum_{j=1}^{k}\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{j})\ dv

are, where $M_{j}$ indicates the $j$ th disjoint cube making up the local support of $F$ . By the min-max theorem, the eigenvalues are bounded by the min and max of

a^{\mathrm{\scriptscriptstyle T}}\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv\ a=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv

over all $a\in\mathbb{R}^{q}$ with $\|a\|=1$ . First consider lower bounding the eigenvalues. Note $g(v)=a^{\mathrm{\scriptscriptstyle T}}\rho(v)$ is a polynomial of degree at most $q$ . Therefore

	$\displaystyle\sum_{j}\int g(v)^{2}\mathbbold{1}(v\in M_{j})\ dv$	$\displaystyle\geq\int g(v)^{2}\mathbbold{1}\{g(v)^{2}\geq\epsilon\}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv$
		$\displaystyle\geq\epsilon\int\mathbbold{1}\{g(v)^{2}\geq\epsilon\}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv$
		$\displaystyle\geq\epsilon\left\{\sum_{j}\int\mathbbold{1}(v\in M_{j})\ dv-\int\mathbbold{1}\{g(v)^{2}<\epsilon\}\ dv\right\}$
		$\displaystyle\geq\epsilon\left(C^{}-C\epsilon^{1/2q}\right)=\left(\frac{C^{}}{\sqrt{C}}\right)^{4q}>0$

where the last line follows since

	$\displaystyle\int\sum_{j}\mathbbold{1}(v\in M_{j})\ dv$	$\displaystyle\asymp\int dF^{*}(v)=\int dF(x_{0}+h(v-1/2))$
		$\displaystyle=h^{-d}\int\mathbbold{1}\{\\|x-x_{0}\\|\leq h/2\}\ dF(x)\geq C^{*}$

from a change of variables $x=x_{0}+h(v-1/2)$ , so that $v=1/2+(x-x_{0})/h$ and $h^{d}\ dv=dx$ , together with Assumption 1 of the lemma, and by Theorem 4 of Carbery and Wright [2001]. The last equality follows if we choose $\epsilon=\left(C^{*}/2C\right)^{2q}$ . To upper-bound the eigenvalues, note that

\displaystyle\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv\leq\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\ dv=\|a\|^{2}=1

since $\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I$ by the orthonormality of $\rho$ .

Now consider the eigenvalues of $\Omega$ . We will construct a basis of order $k$ for which eigenvalues of $\Omega$ are bounded and for which the Hölder approximation property (6) holds.

Remark 12.

For simplicity we only consider the case where there are exactly $k$ cubes in the local support of $F$ ; if there are finitely many cubes, say $M<\infty$ , the arguments are more straightforward, orthonormalizing a standard series $M$ times, once per cube, and taking $k/M$ terms from each. We omit details in the intermediate case where the number of cubes scales at a rate slower than $k$ , but we expect similar arguments to those below can be used.

First, let $\rho(x)$ denote the tensor products of a Legendre basis of order $s$ , orthonormal on $[-1,1]$ , i.e., tensor products of

\rho_{j}(x_{\ell})=\frac{\sqrt{j+1/2}}{2^{j}j!}\frac{d^{j}}{dx_{\ell}^{j}}(x_{\ell}^{2}-1)^{j}

for $j\in\{1,\dots,s\}$ and $\ell\in\{1,\dots,d\}$ . This basis satisfies

\int\rho(x)\rho(x)^{\mathrm{\scriptscriptstyle T}}\ dx=I_{s}.

Now shift and rescale so that the basis is orthonormal on $M_{j}$ with

b_{j}(x)=2^{d}\sqrt{k}\rho\left(4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right)\right)

and so satisfies

	$\displaystyle\int b_{j}(x)b_{j}(x)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(x\in M_{j})\ dx$	$\displaystyle=4^{d}k\int\rho\left(4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right)\right)\mathbbold{1}(x\in M_{j})\ dx$
		$\displaystyle=\int_{-1}^{1}\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I_{s}$

where we used the change of variable $v=4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right)$ so that $dv=4^{d}k\ dx$ and $x=v/4k^{1/d}+1/2+(m_{j}-x_{0})/h\in M_{j}$ when $v\in[-1,1]^{d}$ . Finally define the basis

b(v)=\{b_{1}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{1}),\dots,b_{k}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{k})\}^{\mathrm{\scriptscriptstyle T}}

of length $sk$ . Then

\displaystyle b(v)b(v)^{\mathrm{\scriptscriptstyle T}}=\begin{pmatrix}b_{1}(v)b_{1}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{1})&0&\cdots&0\\ 0&b_{2}(v)b_{2}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{2})&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&b_{k}(v)b_{k}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{k})\\ \end{pmatrix}

Therefore since $dF^{*}(x)$ is bounded above and below on its support, the $j$ -th diagonal block of $\Omega$ is proportional to

\displaystyle\int b_{j}(v)b_{j}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{j})\ dv

\displaystyle=I_{s}.

Therefore the eigenvalues of $\Omega$ are all proportional to one and bounded as desired. Further, by the same higher-order kernel arguments for local polynomials as in Proposition 3, the basis satisfies the Hölder approximation property (6). ∎

	$\displaystyle\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(I-\widehat{\Pi}_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$
	$\displaystyle=\int(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$		(11)
	$\displaystyle\hskip 14.45377pt+\int\rho(v)\{\pi^{}(v)-\widehat{\pi}^{}(v)\}(\Pi_{b}-\widehat{\Pi}_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\ dF^{}(v)$		(12)

	$\displaystyle\int\Big{[}$	$\displaystyle(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}(v)\Big{]}\Big{[}(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})+(\mu_{0}^{}-\widehat{\mu}_{0}^{})\}(v)\Big{]}\ dF^{}(v)$
		$\displaystyle\leq\\|(I-\Pi_{b})\{\rho(\pi^{}-\widehat{\pi}^{})\}\\|_{F^{}}\Big{[}\\|(I-\Pi_{b})\{\pi^{}(\tau^{}-\tau_{h}^{})\}\\|_{F^{*}}$
		$\displaystyle\hskip 28.90755pt+\\|(I-\Pi_{b})(\mu_{0}^{}-\widehat{\mu}_{0}^{})\\|_{F^{*}}\Big{]}$
		$\displaystyle\lesssim\ \left(h/k^{1/d}\right)^{\alpha}\Big{\{}h^{\gamma}+\left(h/k^{1/d}\right)^{\beta}\Big{\}}=\left(h/k^{1/d}\right)^{2s}+h^{\gamma}\left(h/k^{1/d}\right)^{\alpha}$

	$\displaystyle\|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\|$	$\displaystyle=\left\|\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\left\{\left(\widehat{R}-R\right)+\left(Q-\widehat{Q}\right)Q^{-1}R\right\}\right\|$
		$\displaystyle\leq\\|\rho(1/2)\\|\left\\|\widehat{Q}^{-1}\right\\|\left(\left\\|\widehat{R}-R\right\\|+\left\\|Q-\widehat{Q}\right\\|\left\\|Q^{-1}R\right\\|\right)$
		$\displaystyle\leq\\|\rho(1/2)\\|\left(\left\\|Q^{-1}\right\\|+\left\\|\widehat{Q}^{-1}-Q^{-1}\right\\|\right)\left(\left\\|\widehat{R}-R\right\\|+\left\\|Q-\widehat{Q}\right\\|_{2}\left\\|Q^{-1}R\right\\|\right)$

	$\displaystyle\|\mathbb{E}(\widehat{R}_{j}-R_{j}\mid D^{n})\|$	$\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2s}+\\|\widehat{\pi}-\pi\\|_{F^{}}\\|\widehat{\eta}-\eta\\|_{F^{}}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|$
	$\displaystyle\|\mathbb{E}(\widehat{Q}_{j\ell}-Q_{j\ell}\mid D^{n})\|$	$\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2\alpha}+\\|\widehat{\pi}-\pi\\|_{F^{*}}^{2}\\|\widehat{\Omega}^{-1}-\Omega^{-1}\\|.$

	$\displaystyle\int(\\|u\\|+C_{2})^{\gamma}\left\|\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\rho_{h}(x_{0}+hu)\right\|\mathbbold{1}(\\|u\\|\leq 1/2)$
	$\displaystyle\hskip 36.135pt\times\pi(x_{0}+hu)\{1-\pi(x_{0}+hu)\}f(x_{0}+hu)\ du$
	$\displaystyle\leq(1/4)(1/2+C_{2})^{\gamma}\\|\rho(1/2)\\|\left\\|Q^{-1}\right\\|\int_{-1/2}^{1/2}\\|\rho(1/2+u)\\|f(x_{0}+hu)\ du$
	$\displaystyle=(C_{3}q^{2}/4)(1/2+C_{2})^{\gamma}\left\\|Q^{-1}\right\\|h^{-d}\int\mathbbold{1}(\\|x-x_{0}\\|\leq h/2)\ dF(x)\lesssim 1$

Minimax rates for heterogeneous causal effect estimation

Abstract

1 Introduction

2 Setup & Notation

3 Fundamental Limits

Theorem 1.

Remark 1.

Remark 2.

Remark 3.

Lemma 1 (Tsybakov (2009)).

3.1 Construction

Definition 1 (Distributions PλP_{\lambda} and QλQ_{\lambda}).

Remark 4.

Remark 5.

Remark 6.

3.2 Hellinger Distance

Lemma 2.

Proposition 1.

Proof.

3.3 Choice of Parameters & Final Rate

Proposition 2.

Remark 7.

4 Attainability

4.1 Proposed Estimator & Decomposition

Definition 2 (Higher-Order Local Polynomial R-Learner).

Proposition 3.

Proof.

Remark 8.

Proposition 4.

4.2 Bias

Proposition 5.

Proof.

Proposition 6.

4.3 Variance

Proposition 7.

Remark 9.

4.4 Overall Rate

Theorem 2.

Remark 10.

Remark 11.

5 Discussion

Acknowledgements

References

Roadmap of Main Results

Appendix A Alternative Parametrization

A.1 Lower Bounds

Theorem 3.

A.1.1 Setting 1: α≥β\alpha\geq\beta

A.1.2 Setting 2: β≥α\beta\geq\alpha and η\eta is β\beta-smooth

A.1.3 Setting 3: β≥α\beta\geq\alpha and μ0\mu_{0} is β\beta-smooth

A.2 Upper Bounds

Definition 3 (Higher-Order Local Polynomial R-Learner).

Proposition S1.

Proof.

Proposition S2.

Proof.

Appendix B Additional Results

B.1 Density Bounds in Proposition 1

Proof.

B.2 Density Expressions

Proposition S3.

Proposition S4.

B.3 Proof of Proposition 3

Proof.

B.4 Proof of Proposition 4

Proof.

B.5 Eigenvalues of QQ and Ω\Omega

Proposition S5.

Proof.

Proposition S6.

Proof.

B.6 Proof of Proposition 7

Proof.

B.7 Technical Lemmas

Lemma S1.

Proof.

Lemma S2.

Proof.

B.8 Covariate Density

Lemma S3.

Definition 1 (Distributions $P_{\lambda}$ and $Q_{\lambda}$ ).

A.1.1 Setting 1: $\alpha\geq\beta$

A.1.2 Setting 2: $\beta\geq\alpha$ and $\eta$ is $\beta$ -smooth

A.1.3 Setting 3: $\beta\geq\alpha$ and $\mu_{0}$ is $\beta$ -smooth

B.5 Eigenvalues of $Q$ and $\Omega$