This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimax rates for heterogeneous causal effect estimation


Edward H. Kennedy1, Sivaraman Balakrishnan1,2, James M. Robins3, Larry Wasserman1,2

1Department of Statistics & Data Science, Carnegie Mellon University
2Machine Learning Department, Carnegie Mellon University
3Departments of Biostatistics and Epidemiology, Harvard University

{edward, siva, larry} @ stat.cmu.edu, [email protected]
Abstract

Estimation of heterogeneous causal effects – i.e., how effects of policies and treatments vary across subjects – is a fundamental task in causal inference. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but questions surrounding optimality have remained largely unanswered. In particular, a minimax theory of optimality has yet to be developed, with the minimax rate of convergence and construction of rate-optimal estimators remaining open problems. In this paper we derive the minimax rate for CATE estimation, in a Hölder-smooth nonparametric model, and present a new local polynomial estimator, giving high-level conditions under which it is minimax optimal. Our minimax lower bound is derived via a localized version of the method of fuzzy hypotheses, combining lower bound constructions for nonparametric regression and functional estimation. Our proposed estimator can be viewed as a local polynomial R-Learner, based on a localized modification of higher-order influence function methods. The minimax rate we find exhibits several interesting features, including a non-standard elbow phenomenon and an unusual interpolation between nonparametric regression and functional estimation rates. The latter quantifies how the CATE, as an estimand, can be viewed as a regression/functional hybrid.


Keywords: causal inference, functional estimation, higher order influence functions, nonparametric regression, optimal rates of convergence.

1 Introduction

In this paper we consider estimating the difference in regression functions

τ(x)=𝔼(YX=x,A=1)𝔼(YX=x,A=0)\tau(x)=\mathbb{E}(Y\mid X=x,A=1)-\mathbb{E}(Y\mid X=x,A=0) (1)

from an iid sample of observations of Z=(X,A,Y)Z=(X,A,Y). Let YaY^{a} denote the counterfactual outcome that would have been observed under treatment level A=aA=a. Then, under the assumptions of consistency (i.e., Y=YaY=Y^{a} if A=aA=a), positivity (i.e., ϵ(A=1X)1ϵ\epsilon\leq\mathbb{P}(A=1\mid X)\leq 1-\epsilon with probability one, for some ϵ>0\epsilon>0), and no unmeasured confounding (i.e., AYaXA\perp\!\!\!\perp Y^{a}\mid X), the quantity τ(x)\tau(x) also equals the conditional average treatment effect (CATE)

𝔼(Y1Y0X=x).\mathbb{E}(Y^{1}-Y^{0}\mid X=x).

The CATE τ(x)\tau(x) gives a more individualized picture of treatment effects compared to the overall average treatment effect (ATE) 𝔼(Y1Y0)\mathbb{E}(Y^{1}-Y^{0}), and plays a crucial role in many fundamental tasks in causal inference, including assessing effect heterogeneity, constructing optimal treatment policies, generalizing treatment effects to new populations, finding subgroups with enhanced effects, and more. Further, these tasks have far-reaching implications across the sciences, from personalizing medicine to optimizing voter turnout. We refer to Hernán and Robins (2020) (chapter 4), Nie and Wager (2021), Kennedy (2023), and citations therein, for general discussion and review.

The simplest approach to CATE estimation would be to assume a low-dimensional parametric model for the outcome regression 𝔼(YX,A)\mathbb{E}(Y\mid X,A); then maximum likelihood estimates could be easily constructed, and under regularity conditions the resulting plug-in estimator would be minimax optimal. However, when XX has continuous components, it is typically difficult to specify a correct parametric model, and under misspecification the previously described approach could lead to substantial bias. This suggests the need for more flexible methods. Early work in flexible CATE estimation employed semiparametric models, for example partially linear models assuming τ(x)\tau(x) to be constant, or structural nested models in which τ(x)\tau(x) followed some known parametric form, but leaving other parts of the distribution unspecified (Robinson, 1988; Robins et al., 1992; Robins, 1994; van der Laan and Robins, 2003; van der Laan, 2005; Vansteelandt and Joffe, 2014). An important theme in this work is that the CATE can be much more structured and simple than the rest of the data-generating process. Specifically, the individual regression functions μa(x)=𝔼(YX=x,A=a)\mu_{a}(x)=\mathbb{E}(Y\mid X=x,A=a) for each a=0,1a=0,1 may be very complex (e.g., non-smooth or non-sparse), even when the difference τ(x)=μ1(x)μ0(x)\tau(x)=\mu_{1}(x)-\mu_{0}(x) is very smooth or sparse, or even constant or zero. We refer to Kennedy (2023) for some recent discussion of this point.

More recently there has been increased emphasis on incorporating nonparametrics and machine learning tools for CATE estimation. We briefly detail two especially relevant streams of this recent literature, based on so-called DR-Learner and R-Learner methods, both of which rely on doubly robust-style estimation. The DR-Learner is a model-free meta-algorithm first proposed by van der Laan (2005) (Section 4.2), which essentially takes the components of the classic doubly robust estimator of the ATE, and rather than averaging, instead regresses on covariates. It has since been specialized to particular methods, e.g., cross-validated ensembles (Luedtke and van der Laan, 2016), kernel (Lee et al., 2017; Fan et al., 2019; Zimmert and Lechner, 2019) and series methods (Semenova and Chernozhukov, 2017), empirical risk minimization (Foster and Syrgkanis, 2019), and linear smoothers (Kennedy, 2023). On the other hand, the R-Learner is a flexible adaptation of the double-residual regression method originally built for partially linear models (Robinson, 1988), with the first nonparametric version proposed by Robins et al. (2008) (Section 5.2) using series methods. The R-Learner has since been adapted to RKHS regression (Nie and Wager, 2021), lasso (Zhao et al., 2017; Chernozhukov et al., 2017), and local polynomials (Kennedy, 2023). Many flexible non-doubly robust methods have also been proposed in recent years, often based on inverse-weighting or direct regression estimation (Foster et al., 2011; Imai and Ratkovic, 2013; Athey and Imbens, 2016; Shalit et al., 2017; Wager and Athey, 2018; Künzel et al., 2019; Hahn et al., 2020).

Despite the wide variety of methods available for flexible CATE estimation, questions of optimality have remained mostly unsolved. Gao and Han (2020) studied minimax optimality, but in a specialized model where the propensity score has zero smoothness, and covariates are non-random; this model does not reflect the kinds of assumptions typically used in practice, e.g., in the papers cited in the previous paragraph. Some but not all of these papers derive upper bounds on the error of their proposed CATE estimators; in the best case, these take the form of an oracle error rate (which would remain even if the potential outcomes (Y1Y0)(Y^{1}-Y^{0}) were observed and regressed on covariates), plus some contribution coming from having to estimate nuisance functions (i.e., outcome regressions and propensity scores). The fastest rates we are aware of come from Foster and Syrgkanis (2019) and Kennedy (2023). Foster and Syrgkanis (2019) studied global error rates, obtaining an oracle error plus sums of squared L4L_{4} errors in all nuisance components. Kennedy (2023) studied pointwise error rates, giving two main results; in the first, they obtain the oracle error plus a product of nuisance errors, while in the second, they obtain a faster rate via undersmoothing (described in more detail in Section 3.3). However, since these are all upper bounds on the errors of particular procedures, it is unknown whether these rates are optimal in any sense, and if they are not, how they might be improved upon. In this paper we resolve these questions (via the minimax framework, in a nonparametric model that allows components of the data-generating process to be infinite-dimensional, yet smooth in the Hölder sense).

More specifically, in Section 3 we derive a lower bound on the minimax rate of CATE estimation, indicating the best possible (worst-case) performance of any estimator, in a model where the CATE, regression function, and propensity score are Hölder-smooth functions. Our derivation uses an adaptation of the method of fuzzy hypotheses, which is specially localized compared to the constructions previously used for obtaining lower bounds in functional estimation and hypothesis testing (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In Section 4, we confirm that our minimax lower bound is tight (under some conditions), by proposing and analyzing a new local polynomial R-Learner, using localized adaptations of higher order influence function methodology (Robins et al., 2008, 2009a, 2017). In addition to giving a new estimator that is provably optimal (under some conditions, e.g., on how well the covariate density is estimated), our results also confirm that previously proposed estimators were not generally optimal in this smooth nonparametric model. Our minimax rate also sheds light on the nature of the CATE as a statistical quantity, showing how it acts as a regression/functional hybrid: for example, the rate interpolates between nonparametric regression and functional estimation, depending on the relative smoothness of the CATE and nuisance functions (outcome regression and propensity score).

2 Setup & Notation

We consider an iid sample of nn observations of Z=(X,A,Y)Z=(X,A,Y) from distribution \mathbb{P}, where X[0,1]dX\in[0,1]^{d} denotes covariates, A{0,1}A\in\{0,1\} a treatment or policy indicator, and YY\in\mathbb{R} an outcome of interest. We let FF denote the distribution function of the covariate XX (with density ff as needed), and let

π(x)\displaystyle\pi(x) =(A=1X=x)\displaystyle=\mathbb{P}(A=1\mid X=x)
η(x)\displaystyle\eta(x) =𝔼(YX=x)\displaystyle=\mathbb{E}(Y\mid X=x)
μa(x)\displaystyle\mu_{a}(x) =𝔼(YX=x,A=a)\displaystyle=\mathbb{E}(Y\mid X=x,A=a)

denote the propensity score, marginal, and treatment-specific outcome regressions, respectively. We sometimes omit arguments from functions to ease notation, e.g., note that τ=(ημ0)/π\tau=(\eta-\mu_{0})/\pi. We also index quantities by a distribution PP when needed, e.g., τ(x)\tau(x) under a particular distribution PP is written τP(x)\tau_{P}(x); depending on context, no indexing means the quantity is evaluated at the true \mathbb{P}, e.g., τ(x)=τ(x)\tau(x)=\tau_{\mathbb{P}}(x).

Our goal is to study estimation of the CATE τ(x)=μ1(x)μ0(x)\tau(x)=\mu_{1}(x)-\mu_{0}(x) at a point x0(0,1)dx_{0}\in(0,1)^{d}, with error quantified by mean absolute error

𝔼|τ^(x0)τ(x0)|.\mathbb{E}\left|\widehat{\tau}(x_{0})-\tau(x_{0})\right|.

As detailed in subsequent sections, we work in a nonparametric model 𝒫\mathcal{P} whose components are infinite-dimensional functions but with some smoothness. We say a function is ss-smooth if it belongs to a Hölder class with index ss; this essentially means it has s1s-1 bounded derivatives, and the highest order derivative is continuous. To be more precise, let s\lfloor s\rfloor denote the largest integer strictly smaller than ss, and let Dα=αx1α1xdαdD^{\alpha}=\frac{\partial^{\alpha}}{\partial x_{1}^{\alpha_{1}}\dots\partial x_{d}^{\alpha_{d}}} denote the partial derivative operator. Then the Hölder class with index ss contains all functions g:𝒳g:\mathcal{X}\rightarrow\mathbb{R} that are s\lfloor s\rfloor times continuously differentiable, with derivatives up to order s\lfloor s\rfloor bounded, i.e.,

|Dαg(x)|C<\left|D^{\alpha}g(x)\right|\leq C<\infty

for all α=(α1,,αd)\alpha=(\alpha_{1},\dots,\alpha_{d}) with jαjs\sum_{j}\alpha_{j}\leq\lfloor s\rfloor and for all x𝒳x\in\mathcal{X}, and with s\lfloor s\rfloor-order derivatives satisyfing the Lipschitz condition

|Dβg(x)Dβg(x)|Cxxss\left|D^{\beta}g(x)-D^{\beta}g(x^{\prime})\right|\leq C\|x-x^{\prime}\|^{s-\lfloor s\rfloor}

for some C<C<\infty, for all β=(β1,,βd)\beta=(\beta_{1},\dots,\beta_{d}) with jβj=s\sum_{j}\beta_{j}=\lfloor s\rfloor, and for all x,x𝒳x,x^{\prime}\in\mathcal{X}, where for a vector vdv\in\mathbb{R}^{d} we let v\|v\| denote the Euclidean norm. Sometimes Hölder classes are referenced by both the smoothness ss and constant CC, but we focus our discussion on the smoothness ss and omit the constant, which is assumed finite and independent of nn.

We write the squared L2(Q)L_{2}(Q) norm of a function as gQ2=g(z)2𝑑Q(z)\|g\|_{Q}^{2}={\int g(z)^{2}\ dQ(z)}. The sup-norm is denoted by f=supz𝒵|f(z)|\|f\|_{\infty}=\sup_{z\in\mathcal{Z}}|f(z)|. For a matrix AA we let A\|A\| and A2\|A\|_{2} denote the operator/spectral and Frobenius norms, respectively, and let λmin(A)\lambda_{\min}(A) and λmax(A)\lambda_{\max}(A) denote the minimum and maximum eigenvalues of AA, respectively. We write anbna_{n}\lesssim b_{n} if anCbna_{n}\leq Cb_{n} for CC a positive constant independent of nn, and anbna_{n}\asymp b_{n} if anCbna_{n}\leq Cb_{n} and bnCanb_{n}\leq Ca_{n} (i.e., if anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}). We write anbna_{n}\sim b_{n} to mean that ana_{n} and bnb_{n} are proportional, i.e., an=Cbna_{n}=Cb_{n} for some CC. We also use ab=max(a,b)a\vee b=\max(a,b) and ab=min(a,b)a\wedge b=\min(a,b). We use the shorthand n(f)=n{f(Z)}=1ni=1nf(Zi)\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{f(Z)\}=\frac{1}{n}\sum_{i=1}^{n}f(Z_{i}) to write sample averages, and similarly 𝕌n(f)=𝕌n{f(Z1,Z2)}=1n(n1)ijf(Zi,Zj)\mathbb{U}_{n}(f)=\mathbb{U}_{n}\{f(Z_{1},Z_{2})\}=\frac{1}{n(n-1)}\sum_{i\neq j}f(Z_{i},Z_{j}) for the U-statistic measure.

3 Fundamental Limits

In this section we derive a lower bound on the minimax rate for CATE estimation. This result has several crucial implications, both practical and theoretical. First, it gives a benchmark for the best possible performance of any CATE estimator in the nonparametric model defined in Theorem 1. In particular, if an estimator is shown to attain this benchmark, then one can safely conclude the estimator cannot be improved, at least in terms of worst-case rates, without adding assumptions; conversely, if the benchmark is not shown to be attained, then one should continue searching for other better estimators (or better lower or upper risk bounds). Second, a tight minimax lower bound is important in its own right as a measure of the fundamental limits of CATE estimation, illustrating precisely how difficult CATE estimation is in a statistical sense. The main result of this section is given in Theorem 1 below. It is finally proved and discussed in detail in Section 3.3.

Theorem 1.

For x0(0,1)dx_{0}\in(0,1)^{d}, let 𝒫\mathcal{P} denote the model where:

  1. 1.

    f(x)f(x) is bounded above by a constant,

  2. 2.

    π(x)\pi(x) is α\alpha-smooth,

  3. 3.

    μ0(x)\mu_{0}(x) is β\beta-smooth, and

  4. 4.

    τ(x)\tau(x) is γ\gamma-smooth.

Let s(α+β)/2s\equiv(\alpha+\beta)/2. Then for nn larger than a constant depending on (α,β,γ,d)(\alpha,\beta,\gamma,d), the minimax rate is lower bounded as

infτ^supP𝒫𝔼P|τ^(x0)τP(x0)|{n1/(1+d2γ+d4s) if s<d/41+d/2γn1/(2+dγ) otherwise.\displaystyle\inf_{\widehat{\tau}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\gtrsim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}&\text{ if }s<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise}.\end{cases}

Remark 1.

In Appendix A we also give results (both lower and upper bounds) for the model that puts smoothness assumptions on the marginal regression η(x)=𝔼(YX=x)\eta(x)=\mathbb{E}(Y\mid X=x), instead of the control regression μ0(x)=𝔼(YX=x,A=0)\mu_{0}(x)=\mathbb{E}(Y\mid X=x,A=0). Interestingly, the minimax rates differ in these two models, but only in the regime where the regression function is more smooth than the propensity score (i.e., β>α\beta>\alpha). Specifically, when η\eta is β\beta-smooth, the minimax rate from Theorem 1 holds but with s=(α+β)/2s=(\alpha+\beta)/2 replaced by min(α,s)\min(\alpha,s).

Crucially, Condition 4 allows the CATE τ(x)\tau(x) to have its own smoothness γ\gamma, which is necessarily at least the regression smoothness β\beta, but can also be much larger, as described in the Introduction. We defer discussion of the details of the overall minimax rate of Theorem 1 to Section 3.3, moving first to a proof of the result.

Remark 2.

For simplicity, the lower bound result in Theorem 1 is given for a large model in which the covariate density is only bounded. However, as discussed in detail later in Section 4 and Remark 11, for the stated rate to be attainable more conditions on the covariate density are required. In Section B.8 of the Appendix, we give a particular submodel of 𝒫\mathcal{P} under which upper and lower bounds on the minimax rate match up to constants; it will be important in future work to further elucidate the role of the covariate density in CATE estimation.

The primary strategy in deriving minimax lower bounds is to construct distributions that are similar enough that they are statistically indistinguishable, but for which the parameter of interest is maximally separated; this implies no estimator can have error uniformly smaller than this separation. More specifically, we derive our lower bound using a localized version of the method of fuzzy hypotheses (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In the classic Le Cam two-point method, which can be used to derive minimax lower bounds for nonparametric regression at a point (Tsybakov, 2009), it suffices to consider a pair of distributions that differ locally; however, for nonlinear functional estimation, such pairs give bounds that are too loose. One instead needs to construct pairs of mixture distributions, which can be viewed via a prior over distributions in the model (Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). Our construction combines these two approaches via a localized mixture, as will be described in detail in the next subsection.

Remark 3.

In what follows we focus on the lower bound in the low smoothness regime where s<d/41+d/2γs<\frac{d/4}{1+d/2\gamma}. The n1/(2+d/γ)n^{-1/(2+d/\gamma)} lower bound for the high smoothness regime matches the classic smooth nonparametric regression rate, and follows from a standard two-point argument, using the same construction as in Section 2.5 of Tsybakov (2009).

The following lemma, adapted from Section 2.7.4 of Tsybakov (2009), provides the foundation for the minimax lower bound result of this section.

Lemma 1 (Tsybakov (2009)).

Let PλP_{\lambda} and QλQ_{\lambda} denote distributions in 𝒫\mathcal{P} indexed by a vector λ=(λ1,,λk)\lambda=(\lambda_{1},\dots,\lambda_{k}), with nn-fold products denoted by PλnP_{\lambda}^{n} and QλnQ_{\lambda}^{n}, respectively. Let ϖ\varpi denote a prior distribution over λ\lambda. If

H2(Pλn𝑑ϖ(λ),Qλn𝑑ϖ(λ))α<2H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\leq\alpha<2

and

ψ(Pλ)ψ(Qλ)s>0\psi(P_{\lambda})-\psi(Q_{\lambda^{\prime}})\geq s>0

for a functional ψ:𝒫\psi:\mathcal{P}\mapsto\mathbb{R} and for all λ,λ\lambda,\lambda^{\prime}, then

infψ^supP𝒫𝔼P{(|ψ^ψ(P)|)}(s/2)(1α(1α/4)2)\inf_{\widehat{\psi}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}\left\{\ell\left(\left|\widehat{\psi}-\psi(P)\right|\right)\right\}\geq\ell(s/2)\left(\frac{1-\sqrt{\alpha(1-\alpha/4)}}{2}\right)

for any monotonic non-negative loss function \ell.


Lemma 1 illuminates the three ingredients for deriving a minimax lower bound, and shows how they interact. The ingredients are: (i) a pair of mixture distributions, (ii) the distance between their nn-fold products, which is ideally small, and (iii) the separation of the parameter of interest under the mixtures, which is ideally large. Finding the right minimax lower bound requires balancing these three ingredients appropriately: with too much distance or not enough separation, the lower bound will be too loose. In the following subsections we describe these three ingredients in detail.

3.1 Construction

In this subsection we detail the distributions PλP_{\lambda} and QλQ_{\lambda} used to construct the minimax lower bound. The main idea is to mix constructions for nonparametric regression and functional estimation, by perturbing the CATE with a bump at the point x0x_{0}, and to also use a mixture of perturbations of the propensity score and regression functions π\pi and μ0\mu_{0}, locally near x0x_{0}.

For our lower bound results, we work in the setting where YY is binary; this is mostly to ease notation and calculations. Note however that this still yields a valid lower bound in the general continuous YY case, since a lower bound in the strict submodel where YY is binary is also a lower bound across the larger model 𝒫\mathcal{P}. Importantly, when YY is binary, the density pp of an observation ZZ can be indexed via either the quadruple (f,π,μ0,μ1)(f,\pi,\mu_{0},\mu_{1}), or (f,π,μ0,τ)(f,\pi,\mu_{0},\tau); here we make use of the latter parametrization (and in the appendix we consider the (f,π,η,τ)(f,\pi,\eta,\tau) parametrization). We first give the construction for the αβ\alpha\geq\beta case in the definition below, and then go on to discuss details (and in Appendix A we give constructions for all other regimes).

Definition 1 (Distributions PλP_{\lambda} and QλQ_{\lambda}).

Let:

  1. 1.

    B:dB:\mathbb{R}^{d}\rightarrow\mathbb{R} denote a CC^{\infty} function with B(x)=1B(x)=1 for x[1/4,1/4]dx\in[-1/4,1/4]^{d}, and B(x)=0B(x)=0 for x[1/2,1/2]dx\notin[-1/2,1/2]^{d},

  2. 2.

    𝒞h(x0)\mathcal{C}_{h}(x_{0}) denote the cube centered at x0(0,1)dx_{0}\in(0,1)^{d} with sides of length h1/4h\leq 1/4,

  3. 3.

    (𝒳1,,𝒳k)(\mathcal{X}_{1},\dots,\mathcal{X}_{k}) denote a partition of 𝒞h(x0)\mathcal{C}_{h}(x_{0}) into kk cubes of equal size (for kk an integer raised to the power dd), with midpoints (m1,,mk)(m_{1},\dots,m_{k}), so each cube 𝒳j=𝒞h/k1/d(mj)\mathcal{X}_{j}=\mathcal{C}_{h/k^{1/d}}(m_{j}) has side length h/k1/dh/k^{1/d}.

Then for λj{1,1}\lambda_{j}\in\{-1,1\} define the functions

τh(x)\displaystyle\tau_{h}(x) =hγB(xx02h)\displaystyle=h^{\gamma}B\left(\frac{x-x_{0}}{2h}\right)
μ0λ(x)\displaystyle\mu_{0\lambda}(x) =12+(h/k1/d)βj=1kλjB(xmjh/k1/d)\displaystyle=\frac{1}{2}+\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)
πλ(x)\displaystyle\pi_{\lambda}(x) =12+(h/k1/d)αj=1kλjB(xmjh/k1/d)\displaystyle=\frac{1}{2}+\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)
f(x)\displaystyle f(x) =1(x𝒮hk)/{1(4d12d)hd}\displaystyle=\mathbbold{1}(x\in\mathcal{S}_{hk})\Big{/}\left\{1-\left(\frac{4^{d}-1}{2^{d}}\right)h^{d}\right\}

where 𝒮hk={j=1k𝒞h/2k1/d(mj)}{[0,1]d𝒞2h(x0)}\mathcal{S}_{hk}=\left\{\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j})\right\}\bigcup\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\}. Finally let the distributions PλP_{\lambda} and QλQ_{\lambda} be defined via the densities

pλ\displaystyle p_{\lambda} =(f,1/2,μ0λτh/2,τh)\displaystyle=(f,1/2,\mu_{0\lambda}-\tau_{h}/2,\tau_{h})
qλ\displaystyle q_{\lambda} =(f,πλ,μ0λ,0).\displaystyle=(f,\pi_{\lambda},\mu_{0\lambda},0).

Remark 4.

Under the (f,π,η,τ)(f,\pi,\eta,\tau) parametrization, since η=πτ+μ0\eta=\pi\tau+\mu_{0}, the above densities can equivalently be written as pλ=(f,1/2,μ0λ,τh)p_{\lambda}=(f,1/2,\mu_{0\lambda},\tau_{h}) and qλ=(f,πλ,μ0λ,0)q_{\lambda}=(f,\pi_{\lambda},\mu_{0\lambda},0).

Figure 1 shows an illustration of our construction in the d=1d=1 case. As mentioned above, the CATE is perturbed with a bump at x0x_{0} and the nuisance functions π\pi and μ0\mu_{0} with bumps locally near x0x_{0}. The regression function μ0\mu_{0} is perturbed under both PλP_{\lambda} and QλQ_{\lambda}, since it is less smooth than the propensity score in the αβ\alpha\geq\beta case. The choices of the CATE mimic those in the two-point proof of the lower bound for nonparametric regression at a point (see, e.g., Section 2.5 of Tsybakov (2009)), albeit with a particular flat-top bump function, while the choices of nuisance functions π\pi and μ0\mu_{0} are more similar to those in the lower bound for functionals such as the expected conditional covariance (cf. Section 4 of Robins et al. (2009b)). In this sense our construction can be viewed as combining those for nonparametric regression and functional estimation, similar to Shen et al. (2020). In what follows we remark on some important details.

Remark 5.

Section 3.2 of Shen et al. (2020) used a similar construction for conditional variance estimation. Some important distinctions are: (i) they focused on the univariate and low smoothness setting; (ii) in that problem there is only one nuisance function, so the null can be a point rather than a mixture distribution; and (iii) they use a different, arguably more complicated, approach to bound the distance between distributions. Our work can thus be used to generalize such variance estimation results to arbitrary dimension and smoothness.

Refer to caption
(a) Null PλP_{\lambda}
Refer to caption
(b) Alternative QλQ_{\lambda}
Figure 1: Minimax lower bound construction in d=1d=1 case. An example null density pλp_{\lambda} is displayed in panel (a) and an alternative density qλq_{\lambda} in panel (b). The black, red, and blue lines denote the CATE, marginal outcome regression η=πτ+μ0\eta=\pi\tau+\mu_{0}, and propensity score functions, respectively, and the gray line denotes the support of the covariate density.

First we remark on the choice of CATE in the construction. As mentioned above, the bump construction resembles that of the standard Le Cam lower bound for nonparametric regression at a point, but differs in that we use a specialized bump function with a flat top. Crucially, this choice ensures the CATE is constant and equal to hγh^{\gamma} for all xx in the cube 𝒞h(x0)\mathcal{C}_{h}(x_{0}) centered at x0x_{0} with sides of length hh, and that it is equal to zero for all x𝒞2h(x0)x\notin\mathcal{C}_{2h}(x_{0}), i.e., outside the cube centered at x0x_{0} with side length 2h2h. The fact that the CATE is constant across the top of the bump (which will be the only place where observations appear near x0x_{0}) eases Hellinger distance calculations substantially. It is straightforward to check that the CATE τh(x)\tau_{h}(x) is γ\gamma-smooth in this construction (see page 93 of Tsybakov (2009)).

Remark 6.

One example of a bump function BB satisfying the conditions above is

B(x)=[1+g{(4x)21}g{2(4x)2}]1B(x)=\left[1+\frac{g\{(4x)^{2}-1\}}{g\{2-(4x)^{2}\}}\right]^{-1}

for g(t)=exp(1/t)1(t>0)g(t)=\exp(-1/t)\mathbbold{1}(t>0).

For the propensity score and regression functions, we similarly have

B(xmjh/k1/d)={1 for x𝒞h/2k1/d(mj)0 for x𝒞h/k1/d(mj)\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=\begin{cases}1&\text{ for }x\in\mathcal{C}_{h/2k^{1/d}}(m_{j})\\ 0&\text{ for }x\notin\mathcal{C}_{h/k^{1/d}}(m_{j})\end{cases}

i.e., each bump equals one on the half-h/k1/dh/k^{1/d} cube around mjm_{j}, and is identically zero outside the main larger h/k1/dh/k^{1/d} cube around mjm_{j}. It is again straightforward to check that πλ(x)\pi_{\lambda}(x) and μ0λ(x)\mu_{0\lambda}(x) are α\alpha- and β\beta-smooth, respectively.

The covariate density is chosen to be uniform, but on the set 𝒮hk\mathcal{S}_{hk} that captures the middle of all the nuisance bumps {j=1k𝒞h/2k1/d(mj)}\left\{\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j})\right\}, together with the space {[0,1]d𝒞2h(x0)}\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\} away from x0x_{0}. Importantly, this choice ensures there is only mass where the nuisance bumps B(xmjh/k1/d)B\left(\frac{x-m_{j}}{h/k^{1/d}}\right) are constant and non-zero (and where τh(x)=hγ\tau_{h}(x)=h^{\gamma}), or else far away from x0x_{0}, where the densities are the same under PλP_{\lambda} and QλQ_{\lambda}. Note that, as h0h\rightarrow 0, the Lebesgue measure of the set 𝒮hk\mathcal{S}_{hk} tends to one, and the covariate density tends towards a standard uniform distribution.

3.2 Hellinger Distance

As mentioned previously, deriving a tight minimax lower bound requires carefully balancing the distance between distributions in our construction. To this end, in this subsection we bound the Hellinger distance between the nn-fold product mixtures Pλn𝑑ϖ(λ)\int P_{\lambda}^{n}\ d\varpi(\lambda) and Qλn𝑑ϖ(λ)\int Q_{\lambda}^{n}\ d\varpi(\lambda), for ϖ\varpi a uniform prior distribution, so that (λ1,,λk)(\lambda_{1},\dots,\lambda_{k}) are iid Rademachers.

In general these product densities can be complicated, making direct distance calculations difficult. However, Theorem 2.1 from Robins et al. (2009b) can be used to relate the distance between the nn-fold products to those of simpler posteriors over a single observation. In the following lemma we adapt this result to localized constructions like those in Definition 1.

Lemma 2.

Let PλP_{\lambda} and QλQ_{\lambda} denote distributions indexed by a vector λ=(λ1,,λk)\lambda=(\lambda_{1},\dots,\lambda_{k}), and let 𝒵=j=1k𝒵j\mathcal{Z}=\cup_{j=1}^{k}\mathcal{Z}_{j} denote a partition of the sample space. Assume:

  1. 1.

    Pλ(𝒵j)=Qλ(𝒵j)=pjP_{\lambda}(\mathcal{Z}_{j})=Q_{\lambda}(\mathcal{Z}_{j})=p_{j} for all λ\lambda, and

  2. 2.

    the conditional distributions 1𝒵jdPλ/pj\mathbbold{1}_{\mathcal{Z}_{j}}dP_{\lambda}/p_{j} and 1𝒵jdQλ/pj\mathbbold{1}_{\mathcal{Z}_{j}}dQ_{\lambda}/p_{j} (given an observation is in 𝒵j\mathcal{Z}_{j}) do not depend on λ\lambda_{\ell} for j\ell\neq j, and only differ on partitions jS{1,,k}j\in S\subseteq\{1,\dots,k\}.

For a prior distribution ϖ\varpi over λ\lambda, let p¯=pλ𝑑ϖ(λ)\overline{p}=\int p_{\lambda}\ d\varpi(\lambda) and q¯=qλ𝑑ϖ(λ)\overline{q}=\int q_{\lambda}\ d\varpi(\lambda), and define

δ1\displaystyle\delta_{1} =maxjSsupλ𝒵j(pλp¯)2pλpj𝑑ν\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(p_{\lambda}-\overline{p})^{2}}{p_{\lambda}p_{j}}\ d\nu
δ2\displaystyle\delta_{2} =maxjSsupλ𝒵j(qλpλ)2pλpj𝑑ν\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(q_{\lambda}-p_{\lambda})^{2}}{p_{\lambda}p_{j}}\ d\nu
δ3\displaystyle\delta_{3} =maxjSsupλ𝒵j(q¯p¯)2pλpj𝑑ν\displaystyle=\max_{j\in S}\sup_{\lambda}\int_{\mathcal{Z}_{j}}\frac{(\overline{q}-\overline{p})^{2}}{p_{\lambda}p_{j}}\ d\nu

for a dominating measure ν\nu. If p¯/pλb<\overline{p}/p_{\lambda}\leq b<\infty and npjmax(1,δ1,δ2)bnp_{j}\max(1,\delta_{1},\delta_{2})\leq b for all jj, then H2(Pλn𝑑ϖ(λ),Qλn𝑑ϖ(λ))H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right) is bounded above by

Cn(jSpj){n(maxjSpj)(δ1δ2+δ22)+δ3}Cn\left(\sum_{j\in S}p_{j}\right)\left\{n\left(\max_{j\in S}p_{j}\right)\Big{(}\delta_{1}\delta_{2}+\delta_{2}^{2}\Big{)}+\delta_{3}\right\}

for a constant CC only depending on bb.


In the next proposition, we bound the quantities from Lemma 2 and put the results together to obtain a bound on the desired Hellinger distance between product mixtures.

Proposition 1.

Assume h1/4h\leq 1/4 and hγ+2(h/k1/d)β14εh^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\varepsilon for some ε(0,1/4)\varepsilon\in(0,1/4), and take hγ=4(h/k1/d)2sh^{\gamma}=4(h/k^{1/d})^{2s} for s(α+β)/2s\equiv(\alpha+\beta)/2. Then for the distributions PλP_{\lambda} and QλQ_{\lambda} from Definition 1, with ϖ\varpi the uniform distribution over {1,1}k\{-1,1\}^{k}, the quantities δj\delta_{j} from Lemma 2 satisfy

δ1(2d+1B22ε)(h/k1/d)2β,δ2(2d+1B22ε)(h/k1/d)2α,δ3=0,\displaystyle\delta_{1}\leq\left(\frac{2^{d+1}\|B\|_{2}^{2}}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\beta},\ \ \ \delta_{2}\leq\left(\frac{2^{d+1}\|B\|_{2}^{2}}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\alpha},\ \ \ \delta_{3}=0,

and pj2(h/2)d/kp_{j}\leq 2(h/2)^{d}/k. Further H2(Pλn𝑑ϖ(λ),Qλn𝑑ϖ(λ))H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right) is bounded above by

4C(2B22ε)2(n2h2dk){(h/k1/d)4s+(h/k1/d)4α}4C\left(\frac{2\|B\|_{2}^{2}}{\varepsilon}\right)^{2}\left(\frac{n^{2}h^{2d}}{k}\right)\left\{\left(h/k^{1/d}\right)^{4s}+\left(h/k^{1/d}\right)^{4\alpha}\right\}

for CC a constant only depending on ε\varepsilon.


Before moving to the proof of Proposition 1, we briefly discuss and give some remarks. Compared to the Hellinger distance arising in the average treatment effect or expected conditional covariance lower bounds (Robins et al., 2009b), there is an extra hdh^{d} factor in the numerator. Of course, one cannot simply repeat those calculations with k/hdk/h^{d} bins, since then for example the k4s/dk^{-4s/d} term would also be inflated to (k/hd)4s/d(k/h^{d})^{-4s/d}; our carefully localized construction is crucial to obtain the right rate in this case. We also note that the choice hγ=4(h/k1/d)2sh^{\gamma}=4(h/k^{1/d})^{2s} is required for ensuring that the averaged densities p¯(z)\overline{p}(z) and q¯(z)\overline{q}(z) are equal (implying that δ3=0\delta_{3}=0); specifically this equalizes the CATE bump under PλP_{\lambda} with the squared nuisance bumps under QλQ_{\lambda}.

Proof.

Here the relevant partition of the sample space 𝒳×𝒜×𝒴=[0,1]d×{0,1}×{0,1}\mathcal{X}\times\mathcal{A}\times\mathcal{Y}=[0,1]^{d}\times\{0,1\}\times\{0,1\} is 𝒵j=𝒞h/2k1/d(mj)×{0,1}×{0,1}\mathcal{Z}_{j}=\mathcal{C}_{h/2k^{1/d}}(m_{j})\times\{0,1\}\times\{0,1\}, j=1,,kj=1,\dots,k, along with 𝒵j\mathcal{Z}_{j}^{\prime}, which partitions the space [0,1]d/𝒞2h(x0)[0,1]^{d}/\mathcal{C}_{2h}(x_{0}) away from x0x_{0} into disjoint cubes with side lengths h/2k1/dh/2k^{1/d}. Thus we have

Pλ(𝒵j)=Pλ(𝒵j)=Qλ(𝒵j)=Qλ(𝒵j)=pjP_{\lambda}(\mathcal{Z}_{j})=P_{\lambda}(\mathcal{Z}_{j}^{\prime})=Q_{\lambda}(\mathcal{Z}_{j})=Q_{\lambda}(\mathcal{Z}_{j}^{\prime})=p_{j}

where pj=1{xCh/2k1/d(mj)}f(x)𝑑xp_{j}=\int\mathbbold{1}\{x\in C_{h/2k^{1/d}}(m_{j})\}f(x)\ dx. In Appendix Section B.1 we show that (h/2)d/kpj2(h/2)d/k(h/2)^{d}/k\leq p_{j}\leq 2(h/2)^{d}/k when h1/4h\leq 1/4, and so is proportional to the volume of a cube with side lengths h/2k1/dh/2k^{1/d}. Further the conditional distributions 1𝒵jdPλ/pj\mathbbold{1}_{\mathcal{Z}_{j}}dP_{\lambda}/p_{j} and 1𝒵jdQλ/pj\mathbbold{1}_{\mathcal{Z}_{j}}dQ_{\lambda}/p_{j} do not depend on λ\lambda_{\ell} for j\ell\neq j, since λj\lambda_{j} only changes the density in 𝒵j\mathcal{Z}_{j}. In what follows we focus on the partitions 𝒵j\mathcal{Z}_{j} and not 𝒵j\mathcal{Z}_{j}^{\prime}, since the distributions are exactly equal on the latter (in the language of Lemma 1, the set SS indexes only the partitions 𝒵j\mathcal{Z}_{j}, and note that for this set we have jSpj=kpj2(h/2)d\sum_{j\in S}p_{j}=kp_{j}\leq 2(h/2)^{d}). Note when (λ1,,λk)(\lambda_{1},\dots,\lambda_{k}) are iid Rademacher random variables the marginalized densities from Lemma 2 are

p¯(z)\displaystyle\overline{p}(z) =f(x){14+(2a1)(2y1)hγ4B(xx02h)}\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}
q¯(z)\displaystyle\overline{q}(z) =f(x){14+(2a1)(2y1)(h/k1/d)2sj=1kB(xmjh/k1/d)2}.\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}.

The first step is to show that relevant densities and density ratios are appropriately bounded. We give these details in Appendix Section B.1. Next it remains to bound the quantities δ1\delta_{1}, δ2\delta_{2}, and δ3\delta_{3}.

We begin with δ3\delta_{3}, the distance between marginalized densities p¯\overline{p} and q¯\overline{q}, which is tackled somewhat differently from δ1\delta_{1} and δ2\delta_{2}. Because we take (h/k1/d)2s=hγ/4(h/k^{1/d})^{2s}=h^{\gamma}/4 it follows that q¯(z)p¯(z)\overline{q}(z)-\overline{p}(z) equals

(2a1)(2y1)f(x){(h/k1/d)2sj=1kB(xmjh/k1/d)2hγ4B(xx02h)}=0,(2a-1)(2y-1)f(x)\left\{\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}=0, (2)

since f(x)=0f(x)=0 for x𝒮hkx\notin\mathcal{S}_{hk} and

B(xmjh/k1/d)=B(xx02h)=0\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=B\left(\frac{x-x_{0}}{2h}\right)=0  for x{[0,1]d𝒞2h(x0)}\displaystyle\ \text{ for }x\in\left\{[0,1]^{d}\setminus\mathcal{C}_{2h}(x_{0})\right\}
B(xmjh/k1/d)=B(xx02h)=1\displaystyle B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=B\left(\frac{x-x_{0}}{2h}\right)=1  for xj=1k𝒞h/2k1/d(mj).\displaystyle\ \text{ for }x\in\bigcup_{j=1}^{k}\mathcal{C}_{h/2k^{1/d}}(m_{j}).

We note that this result requires a carefully selected relationship between hh and kk, which guarantees that the squared nuisance bumps under QλQ_{\lambda} equal the CATE bumps under PλP_{\lambda}. This also exploits the flat-top bump functions we use, together with a covariate density that only puts mass at these tops, so that the squared terms are constant and no observations occur elsewhere where the bumps are not equal. Without these choices of bump function and covariate density, the expression in (2) would only be bounded by hγh^{\gamma}, and so δ3\delta_{3} would only be bounded by h2γh^{2\gamma}; in that case, the δ3\delta_{3} term would dominate the Hellinger bound in Lemma 2, and the resulting minimax lower bound would reduce to the oracle rate n1/(2+d/γ)n^{-1/(2+d/\gamma)}, which is uninformative in the low-smoothness regimes we are considering.

Now we move to the distance δ1\delta_{1}, which does not end up depending on hh and is somewhat easier to handle. For it we have

δ1\displaystyle\delta_{1} (2dkhd)maxsupλ𝒳a,yf(x)24pλ(z)(h/k1/d)2βj=1kB(xmjh/k1/d)2dx\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\max_{\ell}\sup_{\lambda}\int_{\mathcal{X}_{\ell}}\sum_{a,y}\frac{f(x)^{2}}{4p_{\lambda}(z)}\left(h/k^{1/d}\right)^{2\beta}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx
(2dkhd)(2ε)(h/k1/d)2βmax𝒳j=1kB(xmjh/k1/d)2dx\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\left(\frac{2}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\beta}\max_{\ell}\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx
(2dB22ε/2)(h/k1/d)2β\displaystyle\leq\left(\frac{2^{d}\|B\|_{2}^{2}}{\varepsilon/2}\right)\left(h/k^{1/d}\right)^{2\beta}

where the first line follows by definition, and since p(h/2)d/kp_{\ell}\geq(h/2)^{d}/k, and B(xmjh/k1/d)=0B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=0 outside of the cube 𝒞h/k1/d(mj)\mathcal{C}_{h/k^{1/d}}(m_{j}), which implies that

{jλjB(xmjh/k1/d)}2=jλj2B(xmjh/k1/d)2,\left\{\sum_{j}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)\right\}^{2}=\sum_{j}\lambda_{j}^{2}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2},

the inequality in the second line since pλ(z)/f(x)εp_{\lambda}(z)/f(x)\geq\varepsilon and f(x)2f(x)\leq 2 as in (23) and (22), and the last inequality since

𝒳j=1kB(xmjh/k1/d)2dx\displaystyle\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx =𝒳B(xmh/k1/d)2𝑑xhdkB(u)2𝑑u\displaystyle=\int_{\mathcal{X}_{\ell}}B\left(\frac{x-m_{\ell}}{h/k^{1/d}}\right)^{2}\ dx\leq\frac{h^{d}}{k}\int B(u)^{2}\ du

by a change of variables.

For δ2\delta_{2} we use a mix of the above logic for δ3\delta_{3} and δ1\delta_{1}. Note that (qλpλ)2({q}_{\lambda}-p_{\lambda})^{2} equals

f(x)2\displaystyle f(x)^{2} [(a1/2)(h/k1/d)αj=1kλjB(xmjh/k1/d)\displaystyle\Bigg{[}(a-1/2)\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)
+(2a1)(2y1){(h/k1/d)2sj=1kB(xmjh/k1/d)2hγ4B(xx02h)}]2\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left\{\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}\Bigg{]}^{2}
(1/2)f(x)2(h/k1/d)2αj=1kB(xmjh/k1/d)2\displaystyle\leq(1/2)f(x)^{2}\left(h/k^{1/d}\right)^{2\alpha}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}

where we used the fact that (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and {jλjB(xmjh/k1/d)}2=jB(xmjh/k1/d)2\{\sum_{j}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)\}^{2}=\sum_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}, along with the same logic as above with δ3\delta_{3} (ensuring the second term in the square equals zero). Now we have

δ2\displaystyle\delta_{2} (2dkhd)(2ε)(h/k1/d)2αmax𝒳j=1kB(xmjh/k1/d)2dx(2dB22ε/2)(h/k1/d)2α\displaystyle\leq\left(\frac{2^{d}k}{h^{d}}\right)\left(\frac{2}{\varepsilon}\right)\left(h/k^{1/d}\right)^{2\alpha}\max_{\ell}\int_{\mathcal{X}_{\ell}}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\ dx\leq\left(\frac{2^{d}\|B\|_{2}^{2}}{\varepsilon/2}\right)\left(h/k^{1/d}\right)^{2\alpha}

using the exact same logic as for δ1\delta_{1}. Plugging these bounds on (δ1,δ2,δ3)(\delta_{1},\delta_{2},\delta_{3}) into Lemma 2, together with the fact that pj2(h/2)d/kp_{j}\leq 2(h/2)^{d}/k and jSpj2(h/2)d\sum_{j\in S}p_{j}\leq 2(h/2)^{d}, yields the result. ∎


3.3 Choice of Parameters & Final Rate

Finally we detail how the parameters hh and kk can be chosen to ensure the Hellinger distance from Proposition 1 remains bounded, and use the result to finalize the proof of Theorem 1.

Proposition 2.

Let

hk1/d=(hγ4)1/2s=(1Cn2)1d+4s+2sd/γ\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2s}=\left(\frac{1}{C^{*}n^{2}}\right)^{\frac{1}{d+4s+2sd/\gamma}}

for C=22d/γ+5C(B22/ε)2C^{*}=2^{2d/\gamma+5}C(\|B\|_{2}^{2}/\varepsilon)^{2} and CC the constant from Proposition 1. Then under the assumptions of Proposition 1 we have

H2(Pλn𝑑ϖ(λ),Qλn𝑑ϖ(λ)) 1H^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\ \leq\ 1

and hγ=4(Cn)1/(1+d2γ+d4s)h^{\gamma}=4(\sqrt{C^{*}}n)^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}.


The proof of Proposition 2 follows directly from Proposition 1, after plugging in the selected values of hh and kk. Importantly, it (together with the alternative construction for the β>α\beta>\alpha case given in Appendix A) also settles the proof of Theorem 1 via Lemma 1. This follows since, with the proposed choices of hh and kk, the Hellinger distance is appropriately bounded so that the term (1α(1α/4))/2=(13/4)/20.067({1-\sqrt{\alpha(1-\alpha/4)}})/2=(1-\sqrt{3/4})/{2}\approx 0.067 in Lemma 1 is a constant (greater than 1/20 for example), while the separation in the CATE at x0x_{0}, which equals hγh^{\gamma}, is proportional to n1/(1+d/2γ+d/4s)n^{-1/(1+d/2\gamma+d/4s)} under all PλP_{\lambda} and QλQ_{\lambda}. Therefore this separation is indeed the minimax rate in the low smoothness regime where s<(d/4)/(1+d/2γ)s<(d/4)/(1+d/2\gamma). Note again that, as discussed in Remark 3, when s>(d/4)/(1+d/2γ)s>(d/4)/(1+d/2\gamma) the rate n1/(1+d/2γ+d/4s)n^{-1/(1+d/2\gamma+d/4s)} is faster than the usual nonparametric regression rate n1/(2+d/γ)n^{-1/(2+d/\gamma)}, and so the standard lower bound construction as in Section 2.5 of Tsybakov (2009) indicates that the slower rate n1/(2+d/γ)n^{-1/(2+d/\gamma)} is the tighter lower bound in that regime.

Figure 2 illustrates the minimax rate from Theorem 1, as a function of the average nuisance smoothness s/ds/d (scaled by dimension), and the CATE smoothness scaled by dimension γ/d\gamma/d. A number of important features about the rate are worth highlighting.

Refer to caption
Figure 2: The minimax rate for CATE estimation, as a function of average nuisance smoothness ss and CATE smoothness γ\gamma, each scaled by covariate dimension dd. The black dotted line denotes a threshold on the nuisance smoothness s/ds/d, below which the oracle nonparametric regression rate n1/(2+d/γ)n^{-1/(2+d/\gamma)} is unachievable (the “elbow” phenomenon).

First, of course, the rate never slows with higher nuisance smoothness s/ds/d, for any CATE smoothness γ/d\gamma/d, and vice versa. However, there is an important elbow phenomenon, akin to that found in functional estimation problems (Bickel and Ritov, 1988; Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). In particular, the minimax lower bound shows that when the average nuisance smoothness is low enough that s<d/41+d/2γs<\frac{d/4}{1+d/2\gamma}, the oracle rate n1/(2+d/γ)n^{-1/(2+d/\gamma)} (which could be achieved if one actually observed the potential outcomes) is in fact unachievable. This verifies a conjecture in Kennedy (2023).

Notably, though, the elbow phenomenon we find in the problem of CATE estimation differs quite substantially from that for classic pathwise differentiable functionals. For the latter, the rate is parametric (i.e., n1/2n^{-1/2}) above some threshold, and nonparametric (n1/(1+d/4s)n^{-1/(1+d/4s)}) below. In contrast, in our setting the rate matches that of nonparametric regression above the threshold, and otherwise is a combination of nonparametric regression and functional estimation rates. Thus in this problem there are many elbows, with the threshold depending on the CATE smoothness γ\gamma. In particular, our minimax rate below the threshold,

n1/(1+d2γ+d4s),n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)},

is a mixture of the nonparametric regression rate n1/(1+d/2γ)n^{-1/(1+d/2\gamma)} (on the squared scale) and the classic functional estimation rate n1/(1+d/4s)n^{-1/(1+d/4s)}. This means, for example, that in regimes where the CATE is very smooth, e.g., γ\gamma\rightarrow\infty, the CATE estimation problem begins to resemble that of pathwise-differentiable functional estimation, where the elbow occurs at s>d/4s>d/4, with rates approaching the parametric rate n1/2n^{-1/2} above, and the functional estimation rate n1/(1+d/4s)n^{-1/(1+d/4s)} below. At the other extreme, where the CATE does not have any extra smoothness, so that γβ\gamma\rightarrow\beta (note we must have γβ\gamma\geq\beta), the elbow threshold condition holds for any α0\alpha\geq 0. Thus, at this other extreme, there is no elbow phenomenon, and the CATE estimation problem resembles that of smooth nonparametric regression, with optimal rate n1/(2+d/β)n^{-1/(2+d/\beta)}. For the arguably more realistic setting, where the CATE smoothness γ\gamma may take intermediate values between β\beta and \infty, the minimax rate is a mixture, interpolating between the two extremes. All of this quantifies the sense in which the CATE can be viewed as a regression/functional hybrid.

It is also worth mentioning that no estimator previously proposed in the literature (that we know of) attains the minimax rate in Theorem 1 in full generality. Some estimators have been shown to attain the oracle rate n1/(2+d/γ)n^{-1/(2+d/\gamma)}, but only under stronger assumptions than the minimal condition we find here, i.e., that s>d/41+d/2γs>\frac{d/4}{1+d/2\gamma}. One exception is the undersmoothed R-learner estimator analyzed in Kennedy (2023), which did achieve the rate n1/(2+d/γ)n^{-1/(2+d/\gamma)} whenever s>(d/4)/(1+d/2γ)s>(d/4)/(1+d/2\gamma), under some conditions (including that αβ\alpha\geq\beta). However, in the low-smoothness regime where s<(d/4)/(1+d/2γ)s<(d/4)/(1+d/2\gamma), that estimator’s rate was n2s/dn^{-2s/d}, which is slower than the minimax rate we find here. This motivates our work in the following section, where we propose and analyze a new estimator, whose error matches the minimax rate in much greater generality (under some conditions, e.g., on how well the covariate density is estimated).

Remark 7.

A slightly modified version of our construction also reveals that, when the CATE τ(x)=τ\tau(x)=\tau is constant, the classic functional estimation rate n1/(1+d4s)n^{-1/\left(1+\frac{d}{4s}\right)} acts as a minimax lower bound. To the best of our knowledge, this result has not been noted elsewhere.

4 Attainability

In this section we show that the minimax lower bound of Theorem 1 is actually attainable, via a new local polynomial version of the R-Learner (Nie and Wager, 2021; Kennedy, 2023), based on an adaptation of higher-order influence functions (Robins et al., 2008, 2009a, 2017).

4.1 Proposed Estimator & Decomposition

In this subsection we first describe our proposed estimator, and then give a preliminary error bound, which motivates the specific bias and variance calculations in following subsections. In short, the estimator is a higher-order influence function-based version of the local polynomial R-learner analyzed in Kennedy (2023). At its core, the R-Learner essentially regresses outcome residuals on treatment residuals to estimate a weighted average of the CATE. Early versions for a constant or otherwise parametric CATE were studied by Chamberlain (1987); Robinson (1988), and Robins (1994), with more flexible series, RKHS, and lasso versions studied more recently by Robins et al. (2008), Nie and Wager (2021), and Chernozhukov et al. (2017), respectively. None of this previous work obtained the minimax optimal rates in Theorem 2.

Definition 2 (Higher-Order Local Polynomial R-Learner).

Let Kh(x)=1hd1(xx0h/2)K_{h}(x)=\frac{1}{h^{d}}\mathbbold{1}(\|x-x_{0}\|\leq h/2). For each covariate xjx_{j}, j=1,,d,j=1,\dots,d, define ρ(xj)={ρ0(xj),ρ1(xj),,ργ(xj)}T\rho(x_{j})=\{\rho_{0}(x_{j}),\rho_{1}(x_{j}),\dots,\rho_{\lfloor\gamma\rfloor}(x_{j})\}^{\mathrm{\scriptscriptstyle T}} as the first (γ+1({\lfloor\gamma\rfloor}+1) terms of the Legendre polynomial series (shifted to be orthonormal on [0,1][0,1]),

ρm(xj)==0m(1)+m2m+1(m)(m+)xj.\rho_{m}(x_{j})=\sum_{\ell=0}^{m}(-1)^{\ell+m}\sqrt{2m+1}{m\choose\ell}{m+\ell\choose\ell}x_{j}^{\ell}.

Define ρ(x)\rho(x) to be the corresponding tensor product of all interactions of ρ(x1),,ρ(xd)\rho(x_{1}),\dots,\rho(x_{d}) up to order γ\lfloor\gamma\rfloor, which has length q=(d+γγ)q={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}} and is orthonormal on [0,1]d[0,1]^{d}, and finally define ρh(x)=ρ(1/2+(xx0)/h)\rho_{h}(x)=\rho\left(1/2+(x-x_{0})/h\right). The proposed estimator is then defined as

τ^(x0)=ρh(x0)TQ^1R^\widehat{\tau}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\widehat{R} (3)

where Q^\widehat{Q} is a q×qq\times q matrix and R^\widehat{R} a qq-vector given by

Q^\displaystyle\widehat{Q} =n{ρh(X)Kh(X)φ^a1(Z)ρh(X)T}\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\Big{\}}
+𝕌n{ρh(X1)Kh(X1)φ^a2(Z1,Z2)Kh(X2)ρh(X1)T}\displaystyle\hskip 28.90755pt+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\Big{\}}
R^\displaystyle\widehat{R} =n{ρh(X1)Kh(X1)φ^y1(Z1)}+𝕌n{ρh(X1)Kh(X1)φ^y2(Z1,Z2)Kh(X2)},\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z_{1})\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}},

respectively, and

φ^a1(Z)\displaystyle\widehat{\varphi}_{a1}(Z) =A{Aπ^(X)}\displaystyle=A\{A-\widehat{\pi}(X)\}
φ^y1(Z)\displaystyle\widehat{\varphi}_{y1}(Z) ={Yμ^0(X)}{Aπ^(X)}\displaystyle=\{Y-\widehat{\mu}_{0}(X)\}\{A-\widehat{\pi}(X)\}
φ^a2(Z1,Z2)\displaystyle\widehat{\varphi}_{a2}(Z_{1},Z_{2}) ={A1π^(X1)}bhk(X1)TΩ^1bhk(X2)A2\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})A_{2}
φ^y2(Z1,Z2)\displaystyle\widehat{\varphi}_{y2}(Z_{1},Z_{2}) ={A1π^(X1)}bhk(X1)TΩ^1bhk(X2){Y2μ^0(X2)}\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{Y_{2}-\widehat{\mu}_{0}(X_{2})\}
bhk(x)\displaystyle b_{hk}(x) =b{1/2+(xx0)/h}1(xx0h/2)\displaystyle=b\{1/2+(x-x_{0})/h\}\mathbbold{1}(\|x-x_{0}\|\leq h/2)
Ω^\displaystyle\widehat{\Omega} =v[0,1]db(v)b(v)T𝑑F^(x0+h(v1/2))\displaystyle=\int_{v\in[0,1]^{d}}b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ d\widehat{F}(x_{0}+h(v-1/2))

for b:dkb:\mathbb{R}^{d}\mapsto\mathbb{R}^{k} a basis of dimension kk. The nuisance estimators (F^,π^,μ^0)(\widehat{F},\widehat{\pi},\widehat{\mu}_{0}) are constructed from a separate training sample DnD^{n}, independent of that on which 𝕌n\mathbb{U}_{n} operates.


The estimator in Definition 2 can be viewed as a localized higher-order estimator, and depends on two main tuning parameters: the bandwidth hh, which controls how locally one averages near x0x_{0}, and the dimension kk of the basis bb, which controls how bias and variance are balanced in the second-order U-statistic terms in Q^\widehat{Q} and R^\widehat{R}. Specific properties of the basis bb are discussed shortly, e.g., just prior to Remark 7 and in (6). We also note that while the basis bb will have dimension kk growing with sample size, the dimension of the basis ρ\rho is fixed depending on the smoothness of the CATE; for the latter we use the Legendre series for concreteness, but expect other bases to work as well.

The U-statistic terms are important for debiasing the first-order sample average terms. In addition, our proposed estimator can be viewed as estimating a locally weighted projection parameter τh(x0)=ρh(x0)Tθ\tau_{h}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\theta, with coefficients given by

argminβ𝔼[Kh(x)π(x){1π(x)}{τ(x)βTρh(x)}2]=Q1R\operatorname*{arg\,min}_{\beta}\mathbb{E}\left[K_{h}(x)\pi(x)\{1-\pi(x)\}\Big{\{}\tau(x)-\beta^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x)\Big{\}}^{2}\right]=Q^{-1}R (4)

for

Q\displaystyle Q =ρh(x)Kh(x)π(x){1π(x)}ρh(x)T𝑑F(x)\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)
R\displaystyle R =ρh(x)Kh(x)π(x){1π(x)}τ(x)𝑑F(x).\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\tau(x)\ dF(x).

In other words, this projection parameter τh(x0)\tau_{h}(x_{0}) is a Kh(x)π(x){1π(x)}K_{h}(x)\pi(x)\{1-\pi(x)\}-weighted least squares projection of the CATE τ(x)\tau(x) on the scaled Legendre polynomials ρh(x)\rho_{h}(x). Crucially, since ρh(x)\rho_{h}(x) includes polynomials in xx up to order γ\lfloor\gamma\rfloor, the projection parameter is within hγh^{\gamma} of the target CATE; this is formalized in the following proposition.

Proposition 3.

Let τh(x)=ρh(x)TQ1R\tau_{h}(x)=\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}Q^{-1}R denote the x0x_{0}-specific projection parameter from (4), and assume:

  1. 1.

    τ(x)\tau(x) is γ\gamma-smooth,

  2. 2.

    the eigenvalues of QQ are bounded below away from zero, and

  3. 3.

    1{xx0h/2}𝑑F(x)hd\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\lesssim h^{d}.

Then for any xx with xx0Ch\|x-x_{0}\|\leq Ch we have

|τh(x)τ(x)|hγ.|\tau_{h}(x)-\tau(x)|\lesssim h^{\gamma}.

Proof.

This proof follows from a higher-order kernel argument (e.g., Proposition 1.13 of Tsybakov (2009), Proposition 4.1.5 of Giné and Nickl (2021)), after noting that we can treat Kh(x)π(x){1π(x)}K_{h}(x)\pi(x)\{1-\pi(x)\} itself as a kernel. A similar result was also proved in Kennedy (2023). A detailed proof is given in Appendix B.3. ∎


In Proposition S5 in the Appendix we give simple sufficient conditions under which the eigenvalues of QQ are bounded. In short, this holds under standard boundedness conditions on the propensity score and covariate density.

As mentioned above, our estimator (3) can be viewed as a modified higher-order estimator. More specifically, S^=R^Q^θ\widehat{S}=\widehat{R}-\widehat{Q}\theta is closely related to a second-order estimator of the moment condition

𝔼[ρh(X)Kh(X){Aπ(X)}{Yμ0(X)Aτ(X)}]=RQθ=0\mathbb{E}\Big{[}\rho_{h}(X)K_{h}(X)\Big{\{}A-\pi(X)\Big{\}}\Big{\{}Y-\mu_{0}(X)-A\tau(X)\Big{\}}\Big{]}=R-Q\theta=0 (5)

under the assumption that τ(x)=ρh(x)Tθ\tau(x)=\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\theta (this is not exactly true in our case, but it is enough that it is approximately true locally near x0x_{0}, as will be proved shortly). Indeed, letting Q^1=n{ρh(X)Kh(X)φ^a1(Z)ρh(X)T}\widehat{Q}_{1}=\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\} and R^1=n{ρh(X)Kh(X)φ^y1(Z)}\widehat{R}_{1}=\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{y1}(Z)\} denote the first terms in Q^=Q^1+Q^2\widehat{Q}=\widehat{Q}_{1}+\widehat{Q}_{2} and R^=R^1+R^2\widehat{R}=\widehat{R}_{1}+\widehat{R}_{2}, respectively, we see that S^1=R^1Q^1θ\widehat{S}_{1}=\widehat{R}_{1}-\widehat{Q}_{1}\theta is a usual first-order influence function-based estimator of the moment condition (5). Similarly, the second terms Q^2\widehat{Q}_{2} and R^1\widehat{R}_{1} are akin to the second-order U-statistic corrections that would be added using the higher-order influence function methodology developed by Robins et al. (2008, 2009a, 2017). However, these terms differ in two important ways, both relating to localization near x0x_{0}. First, the U-statistic is localized with respect to both X1X_{1} and X2X_{2}, i.e., the product Kh(X1)Kh(X2)K_{h}(X_{1})K_{h}(X_{2}) is included, whereas only Kh(X1)K_{h}(X_{1}) would arise if the goal were purely to estimate the moment condition (5) for fixed hh. Second, the basis functions

bhk(x)=b(1/2+xx0h)1(xx0h/2)b_{hk}(x)=b\left(1/2+\frac{x-x_{0}}{h}\right)\mathbbold{1}(\|x-x_{0}\|\leq h/2)

appearing in φ^a2\widehat{\varphi}_{a2}, φ^y2\widehat{\varphi}_{y2}, and Ω^\widehat{\Omega} are localized; they only operate on XXs near x0x_{0}, stretching them out so as to map the cube [x0h/2,x0+h/2]d[x_{0}-h/2,x_{0}+h/2]^{d} around x0x_{0} to the whole space [0,1]d[0,1]^{d} (e.g., bhk(x0h/2)=b(0)b_{hk}(x_{0}-h/2)=b(0), bhk(x0)=b(1/2)b_{hk}(x_{0})=b(1/2), etc.). This is the same localization that is used with the Legendre basis ρ(x)\rho(x). In this sense, these localized basis terms spend all their approximation power locally rather than globally away from x0x_{0}. (Specific approximating properties we require of bb will be detailed shortly, in (6)). These somewhat subtle distinctions play a crucial role in appropriately controlling bias, as will be described in more detail shortly.

Remark 8.

Note again that, as with other higher-order estimators, the estimator (3) depends on an initial estimate of the covariate distribution FF (near x0x_{0}), through Ω^\widehat{\Omega}. Importantly, we do not take this estimator F^\widehat{F} to be the empirical distribution, in general, since then our optimal choices of the tuning parameter kk would yield Ω^\widehat{\Omega} non-invertible; this occurs with higher-order estimators of pathwise differentiable functionals as well (Mukherjee et al., 2017). As discussed in Remark 11, and in more detail shortly, we do give conditions under which the estimation error in Ω^\widehat{\Omega} or F^\widehat{F} does not impact the overall rate of τ^(x0)\widehat{\tau}(x_{0}).

Crucially, Proposition 3 allows us to focus on understanding the estimation error in τ^(x0)\widehat{\tau}(x_{0}) with respect to the projection parameter τh(x0)\tau_{h}(x_{0}), treating hγh^{\gamma} as a separate approximation bias. The next result gives a finite-sample bound on this error, showing how it is controlled by the error in estimating the components of QQ and RR.

Proposition 4.

Let S=RQ(Q1R)S=R-Q(Q^{-1}R) and S^=R^Q^(Q1R)\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R). The estimator (3) satisfies

|τ^(x0)τh(x0)|ρ(1/2)(Q1+Q^1Q1)S^S,\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ \leq\ \|\rho(1/2)\|\left(\|Q^{-1}\|+\|\widehat{Q}^{-1}-{Q}^{-1}\|\right)\left\|\widehat{S}-S\right\|,

and if Q1\|Q^{-1}\| and Q^1Q1\|\widehat{Q}^{-1}-{Q}^{-1}\| are bounded above, then

𝔼|τ^(x0)τh(x0)|\displaystyle\mathbb{E}\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ maxj𝔼{𝔼(S^jSjDn)2+var(S^jDn)}.\displaystyle\lesssim\ \max_{j}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}.

for DnD^{n} a separate independent training sample on which (F^,π^,μ^0)(\widehat{F},\widehat{\pi},\widehat{\mu}_{0}) are estimated.


Thus Proposition 4 tells us that bounding the conditional bias and variance of S^=R^Q^(Q1R)\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R) will also yield finite-sample bounds on the error in τ^(x0)\widehat{\tau}(x_{0}), relative to the projection parameter τh(x0)\tau_{h}(x_{0}). These bias and variance bounds will be derived in the following two subsections.

4.2 Bias

In this subsection we derive bounds on the conditional bias of the estimator S^=R^Q^(Q1R)\widehat{S}=\widehat{R}-\widehat{Q}(Q^{-1}R), relative to the components of the projection parameter (4), given the training sample DnD^{n}. The main ideas behind the approach are to use localized versions of higher-order influence function arguments, along with a specialized localized basis construction, which results in smaller bias due to the fact that the bases only need to be used in a shrinking window around x0x_{0}.

Here we rely on the basis b(x)b(x) having optimal Hölder approximation properties, In particular, we assume the approximation error of projections in L2L_{2} norm satisfies

(IΠb)gFks/d for any s-smooth function g\|(I-\Pi_{b})g\|_{F^{*}}\ \lesssim\ k^{-s/d}\ \text{ for any $s$-smooth function $g$} (6)

where Πbg=argmin=θTb(g)2𝑑F\Pi_{b}g=\operatorname*{arg\,min}_{\ell=\theta^{\mathrm{\scriptscriptstyle T}}b}\int(g-\ell)^{2}\ dF^{*} is the usual linear projection of gg on bb, for dF(v)=dF(x0+h(v1/2))dF^{*}(v)=dF(x_{0}+h(v-1/2)) the distribution in h(x0)\mathcal{B}_{h}(x_{0}), the hh-ball around x0x_{0}, mapped to [0,1]d[0,1]^{d}. In a slight abuse to ease notation, we omit the dependence of Πbg\Pi_{b}g on FF^{*}. The approximating condition (6) holds for numerous bases, including spline, CDV wavelet, and local polynomial partition series (and polynomial and Fourier series, up to log factors); it is used often in the literature. We refer to Belloni et al. (2015) for more discussion and specific examples (see their Condition A.3 and subsequent discussion in, for example, their Section 3.2).

Proposition 5.

Assume:

  1. 1.

    λmax(Ω)\lambda_{\max}(\Omega) is bounded above,

  2. 2.

    the basis bb satisfies approximating condition (6),

  3. 3.

    π^(x)π(x)\widehat{\pi}(x)-\pi(x) is α\alpha-smooth,

  4. 4.

    μ^0(x)μ0(x)\widehat{\mu}_{0}(x)-\mu_{0}(x) is β\beta-smooth.

Then

|𝔼(S^jSjDn)|\displaystyle|\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})| (h/k1/d)2s+hγ(h/k1/d)α\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2s}+h^{\gamma}\left(h/k^{1/d}\right)^{\alpha}
+(hγ+μ^0μ0F)(π^πFΩ^1Ω1).\displaystyle\hskip 36.135pt+\Big{(}h^{\gamma}+\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}\Big{)}\left(\|\widehat{\pi}-\pi\|_{F^{*}}\|\widehat{\Omega}^{-1}-\Omega^{-1}\|\right).

Before delving into the proof, we give some brief discussion. The bias consists of three terms; the first two are the main bias terms that would result even if the covariate distribution FF were known, and the third is essentially the contribution from having to estimate FF. (In Lemma S2 of the Appendix we show how the operator norm error of Ω^\widehat{\Omega} is bounded above by estimation error of the distribution FF itself). We note that the second of the main bias terms hγ(h/k1/d)αh^{\gamma}(h/k^{1/d})^{\alpha} will be of smaller order in all regimes we consider. Compared to the main bias term in a usual higher-order influence function analysis, which is k2s/dk^{-2s/d} (e.g., for the average treatment effect), our bias term is smaller; this is a result of using the localized basis bhk(x)b_{hk}(x) defined in (3), which only has to be utilized locally near x0x_{0} (this smaller bias will be partially offset by a larger variance, as discussed in the next subsection). As mentioned in Remark 11, the contribution from having to estimate FF is only a third-order term, since the estimation error of Ω^\widehat{\Omega} (in terms of operator norm) is multiplied by a product of propensity score errors (in L2(F)L_{2}(F^{*}) norm) with the sum of regression errors and smoothing bias hγh^{\gamma}, which is typically of smaller order. In Proposition 6, given after the following proof of Proposition 5, we show how the bias simplifies when FF is estimated accurately enough.

Proof.

By iterated expectation, the conditional mean of the first-order term in R^\widehat{R}, i.e., the quantity 𝔼{ρh(X1)Kh(X1)φ^y1(Z)Dn}\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z)\mid D^{n}\} is equal to

R\displaystyle R +ρh(x)Kh(x){π(x)π^(x)}{π(x)τ(x)+μ0(x)μ^0(x)}𝑑F(x)\displaystyle+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\{\pi(x)\tau(x)+\mu_{0}(x)-\widehat{\mu}_{0}(x)\}\ dF(x)
=R+ρ(v){π(v)π^(v)}{π(v)τ(v)+μ0(v)μ^0(v)}𝑑F(v)\displaystyle=R+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}\{\pi^{*}(v)\tau^{*}(v)+\mu_{0}^{*}(v)-\widehat{\mu}_{0}^{*}(v)\}\ dF^{*}(v)

where we use the change of variable v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h} and again define for any function g:dg:\mathbb{R}^{d}\mapsto\mathbb{R} its corresponding stretched version as g(v)=g(x0+h(v1/2))g^{*}(v)=g(x_{0}+h(v-1/2)). To ease notation it is left implicit that any integral over vv is only over {v:v1/21/2}\{v:\|v-1/2\|\leq 1/2\}. Similarly for Q^\widehat{Q} we have that 𝔼{ρh(X1)Kh(X1)φ^a1(Z)ρh(X1)TDn}\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a1}(Z)\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\mid D^{n}\} equals

Q\displaystyle Q +ρh(x)Kh(x){π(x)π^(x)}π(x)ρh(x)T𝑑F(x)\displaystyle+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\pi(x)\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)
=Q+ρ(v){π(v)π^(v)}π(v)ρ(v)𝑑F(v)\displaystyle=Q+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}\pi^{*}(v)\rho(v)\ dF^{*}(v)

so that for the first-order term in S^\widehat{S} (denoted R^1Q^1θ\widehat{R}_{1}-\widehat{Q}_{1}\theta in discussion of the moment condition (5)) we have

𝔼{ρh(X1)\displaystyle\mathbb{E}\{\rho_{h}(X_{1}) Kh(X1)φ^y1(Z)Dn}𝔼{ρh(X1)Kh(X1)φ^a1(Z)ρh(X1)Dn}θ\displaystyle K_{h}(X_{1})\widehat{\varphi}_{y1}(Z)\mid D^{n}\}-\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a1}(Z)\rho_{h}(X_{1})\mid D^{n}\}\theta
=RQθ+ρ(v){π(v)π^(v)}{μ0(v)μ^0(v)}𝑑F(v)\displaystyle=R-Q\theta+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}\{\mu_{0}^{*}(v)-\widehat{\mu}_{0}^{*}(v)\}\ dF^{*}(v)
+ρ(v)π(v){π(v)π^(v)}{τ(v)τh(v)}𝑑F(v).\displaystyle\hskip 36.135pt+\int\rho(v)\pi^{*}(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}\{\tau^{*}(v)-\tau_{h}^{*}(v)\}\ dF^{*}(v). (7)

The conditional mean of the second-order influence function term in R^\widehat{R} is

𝔼{ρh(X1)Kh(X1)φ^y2(Z1,Z2)Kh(X2)Dn}\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\mid D^{n}\} (8)
=ρ(v1){π(v1)π^(v1)}Π^b(πτ+μ0μ^0)(v1)𝑑F(v1)\displaystyle=-\int\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}\widehat{\Pi}_{b}(\pi^{*}\tau^{*}+\mu_{0}^{*}-\widehat{\mu}_{0}^{*})(v_{1})\ dF^{*}(v_{1})

where we define

Πbg(u)\displaystyle\Pi_{b}g^{*}(u) =b(u)TΩ1b(v)g(v)𝑑F(v)\displaystyle=b(u)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int{b}(v)g^{*}(v)\ dF^{*}(v)

as the FF^{*}-weighted linear projection of gg^{*} on the basis bb, and Π^bg(u)\widehat{\Pi}_{b}g^{*}(u) as the estimated version, which simply replaces Ω\Omega with Ω^\widehat{\Omega}. Similarly for Q^θ\widehat{Q}\theta we have

𝔼{ρh(X1)Kh(X1)φ^a2(Z1,Z2)Kh(X2)ρh(X2)TθDn}\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{2})^{\mathrm{\scriptscriptstyle T}}\theta\mid D^{n}\} (9)
=ρ(v1){π(v1)π^(v1)}Π^b(πτh)(v1)𝑑F(v1)\displaystyle=-\int\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}\widehat{\Pi}_{b}(\pi^{*}\tau_{h}^{*})(v_{1})\ dF^{*}(v_{1})

so that the conditional mean (8) minus the conditional mean (9) equals

ρ(v1){π(v1)π^(v1)}Π^b{π(ττh)+(μ0μ^0)}(v1)𝑑F(v1).\displaystyle-\int\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}\widehat{\Pi}_{b}\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v_{1})\ dF^{*}(v_{1}). (10)

Therefore adding the first- and second-order expected values in (7) and (10), the overall bias relative to SS is

ρ(v){π(v)π^(v)}(IΠ^b){π(ττh)+(μ0μ^0)}(v)𝑑F(v)\displaystyle\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}(I-\widehat{\Pi}_{b})\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v)\ dF^{*}(v)
=(IΠb){ρ(ππ^)}(v)(IΠb){π(ττh)+(μ0μ^0)}(v)𝑑F(v)\displaystyle=\int(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}(v)(I-\Pi_{b})\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v)\ dF^{*}(v) (11)
+ρ(v){π(v)π^(v)}(ΠbΠ^b){π(ττh)+(μ0μ^0)}(v)𝑑F(v)\displaystyle\hskip 14.45377pt+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}(\Pi_{b}-\widehat{\Pi}_{b})\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v)\ dF^{*}(v) (12)

by the orthogonality of a projection with its residuals (Lemma S1(i)).

Now we analyze the bias terms (11) and (12) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.

Crucially, by virtue of using the localized basis bhkb_{hk}, the projections in these bias terms are of stretched versions of the nuisance functions (ππ^)(\pi^{*}-\widehat{\pi}^{*}) and (μ0μ^0)(\mu_{0}^{*}-\widehat{\mu}_{0}^{*}), on the standard non-localized basis bb, with weights equal to the stretched density dFdF^{*}. This is important because stretching a function increases its smoothness; in particular, the stretched and scaled function g(v)/hαg^{*}(v)/h^{\alpha} is α\alpha-smooth whenever gg is α\alpha-smooth. This follows since |Dαg(v)Dαg(v)||D^{\lfloor\alpha\rfloor}g^{*}(v)-D^{\lfloor\alpha\rfloor}g^{*}(v^{\prime})| equals

|Dαg(x0+h(v1/2))Dαg(x0+h(v1/2))|\displaystyle\left|D^{\lfloor\alpha\rfloor}g(x_{0}+h(v-1/2))-D^{\lfloor\alpha\rfloor}g(x_{0}+h(v^{\prime}-1/2))\right|
=hα|g(α)(x0+h(v1/2))g(α)(x0+h(v1/2))|\displaystyle\hskip 28.90755pt=h^{\lfloor\alpha\rfloor}\left|g^{(\lfloor\alpha\rfloor)}(x_{0}+h(v-1/2))-g^{(\lfloor\alpha\rfloor)}(x_{0}+h(v^{\prime}-1/2))\right|
hα|vv|\displaystyle\hskip 28.90755pt\lesssim h^{\alpha}|v-v^{\prime}|

where the second equality follows by the chain rule, and the third since gg is α\alpha-smooth. Thus the above implies hα|Dαg(v)Dαg(v)||vv|,h^{-\alpha}\left|D^{\lfloor\alpha\rfloor}g^{*}(v)-D^{\lfloor\alpha\rfloor}g^{*}(v^{\prime})\right|\lesssim|v-v^{\prime}|, i.e., that g(v)/hαg^{*}(v)/h^{\alpha} is α\alpha-smooth.

Therefore if gg is α\alpha-smooth, then (IΠb)g/hαFkα/d\|(I-\Pi_{b})g^{*}/h^{\alpha}\|_{F^{*}}\lesssim k^{-\alpha/d} by the Hölder approximation properties (6) of the basis bb, and so it follows that

(IΠb)gFhαkα/d=(h/k1/d)α\|(I-\Pi_{b})g^{*}\|_{F^{*}}\ \lesssim\ h^{\alpha}k^{-\alpha/d}=\left(h/k^{1/d}\right)^{\alpha} (13)

for any α\alpha-smooth function gg.

Therefore now consider the bias term (11). This term satisfies

[\displaystyle\int\Big{[} (IΠb){ρ(ππ^)}(v)][(IΠb){π(ττh)+(μ0μ^0)}(v)]dF(v)\displaystyle(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}(v)\Big{]}\Big{[}(I-\Pi_{b})\{\pi^{*}(\tau^{*}-\tau_{h}^{*})+(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\}(v)\Big{]}\ dF^{*}(v)
(IΠb){ρ(ππ^)}F[(IΠb){π(ττh)}F\displaystyle\leq\|(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}\|_{F^{*}}\Big{[}\|(I-\Pi_{b})\{\pi^{*}(\tau^{*}-\tau_{h}^{*})\}\|_{F^{*}}
+(IΠb)(μ0μ^0)F]\displaystyle\hskip 28.90755pt+\|(I-\Pi_{b})(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})\|_{F^{*}}\Big{]}
(h/k1/d)α{hγ+(h/k1/d)β}=(h/k1/d)2s+hγ(h/k1/d)α\displaystyle\lesssim\ \left(h/k^{1/d}\right)^{\alpha}\Big{\{}h^{\gamma}+\left(h/k^{1/d}\right)^{\beta}\Big{\}}=\left(h/k^{1/d}\right)^{2s}+h^{\gamma}\left(h/k^{1/d}\right)^{\alpha}

where the second line follows by Cauchy-Schwarz, and the third by (13), since (ππ^)(\pi-\widehat{\pi}) and (μ0μ^0)(\mu_{0}-\widehat{\mu}_{0}) are assumed α\alpha- and β\beta-smooth, respectively (note ρ(v)\rho(v) is a polynomial, so the smoothness of ρ(ππ^)\rho(\pi^{*}-\widehat{\pi}^{*}) is the same as (ππ^)(\pi^{*}-\widehat{\pi}^{*})), along with the fact that

(IΠb)π(τ\displaystyle\|(I-\Pi_{b})\pi^{*}(\tau^{*} τh)F2π(ττh)F2\displaystyle-\tau_{h}^{*})\|_{F^{*}}^{2}\leq\|\pi^{*}(\tau^{*}-\tau_{h}^{*})\|_{F^{*}}^{2}
Kh(x){τ(x)τh(x)}2𝑑F(x)h2γ\displaystyle\leq\int K_{h}(x)\Big{\{}\tau(x)-\tau_{h}(x)\Big{\}}^{2}dF(x)\lesssim h^{2\gamma}

where the first inequality follows by Lemma S1(ii), the second by definition of FF^{*} and since π(x)1\pi(x)\leq 1, and the last by Proposition 3.

Now for the term in (12), let θb,g=Ω1bg𝑑F\theta_{b,g}=\Omega^{-1}\int bg\ dF^{*} denote the coefficients of the projection Πbg\Pi_{b}g, and note for any functions g1,g2g_{1},g_{2} we have

g1(ΠbΠ^b)(g2)𝑑F\displaystyle\int g_{1}(\Pi_{b}-\widehat{\Pi}_{b})(g_{2})\ dF^{*} =(Ω1/2θb,g1)TΩ1/2(Ω1Ω^1)Ω1/2(Ω1/2θb,g2)\displaystyle=\left(\Omega^{1/2}\theta_{b,g_{1}}\right)^{\mathrm{\scriptscriptstyle T}}\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\left(\Omega^{1/2}\theta_{b,g_{2}}\right)
g1FΩ1/2(Ω1Ω^1)Ω1/2g2F\displaystyle\leq\|g_{1}\|_{F^{*}}\|\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\|\|g_{2}\|_{F^{*}}
g1Fg2FΩΩ^1Ω1\displaystyle\leq\|g_{1}\|_{F^{*}}\|g_{2}\|_{F^{*}}\|\Omega\|\|\widehat{\Omega}^{-1}-\Omega^{-1}\|

where the first equality follows by definition, the second line since the L2L_{2} norm of the coefficients of a (weighted) projection is no more than the weighted L2()L_{2}(\mathbb{P}) norm of the function itself (Lemma S1(iii)), and the last by the sub-multiplicative property of the operator norm, along with the fact that Ω1/22=Ω\|\Omega^{1/2}\|^{2}=\|\Omega\|. ∎


Several of our results require the eigenvalues of Ω\Omega to be bounded above and below away from zero. Proposition S6 in the Appendix gives simple sufficient conditions for this to hold (similar to Proposition S5 for the matrix QQ, which was mentioned earlier after Proposition 3).

The next result is a refined version of Proposition 5, giving high-level conditions under which estimation of FF itself (rather than the matrix Ω1\Omega^{-1}) does not impact the bias. We refer to Remark 11 for more detailed discussion of these conditions, and note that the result follows from Proposition 5 together with Lemma S2 in the Appendix.

Proposition 6.

Under the assumptions of Proposition 6, and if

  1. 1.

    λmin(Ω)\lambda_{\min}(\Omega) is bounded below away from zero,

  2. 2.

    dF^/dF\|d\widehat{F}^{*}/dF^{*}\|_{\infty} is bounded above and below away from zero,

  3. 3.

    (dF^/dF)1(h/k1/d)2sπ^πF(μ^0μ0F+hγ)\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\lesssim\frac{(h/k^{1/d})^{2s}}{\|\widehat{\pi}-\pi\|_{F^{*}}(\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma})},

then, when hγ(h/k1/d)βh^{\gamma}\lesssim(h/k^{1/d})^{\beta}, the bias satisfies |𝔼(S^jSjDn)|(h/k1/d)2s.|\mathbb{E}(\widehat{S}_{j}-S_{j}\mid D^{n})|\lesssim\left(h/k^{1/d}\right)^{2s}.


4.3 Variance

In this subsection we derive bounds on the conditional variance of the estimators R^j\widehat{R}_{j} and Q^j\widehat{Q}_{j\ell}, given the training sample DnD^{n}. The main tool used here is a localized version of second-order U-statistic variance arguments, recognizing that our higher-order estimator is, conditionally, a second-order U-statistic over nhdnh^{d} observations.

Proposition 7.

Assume:

  1. 1.

    y2y^{2}, π^2\widehat{\pi}^{2}, μ^02\widehat{\mu}_{0}^{2}, and μ^0μ0F\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}} are all bounded above, and

  2. 2.

    λmax(Ω)\lambda_{\max}(\Omega) is bounded above.

Then

var(S^jDn)\displaystyle\text{var}(\widehat{S}_{j}\mid D^{n}) 1nhd(1+knhd(1+Ω^1Ω12)).\displaystyle\ \lesssim\ \frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\left(1+\|\widehat{\Omega}^{-1}-\Omega^{-1}\|^{2}\right)\right).

We give the proof of Proposition 7 in Appendix B.6, and so just make some comments here. First, the variance here is analogous to that of a higher-order (quadratic) influence function estimator (cf. Theorem 1 of Robins et al. (2009a)), except with sample size nn deflated to nhdnh^{d}. This is to be expected given the double localization in our proposed estimator. Another important note is that the contribution to the variance from having to estimate FF is relatively minimal, compared to the bias, as detailed in Proposition 5. For the bias, non-trivial rate conditions are needed to ensure estimation of FF does not play a role, whereas for the variance one only needs the operator norm of Ω^1Ω1\widehat{\Omega}^{-1}-\Omega^{-1} to be bounded (under regularity conditions, this amounts to the estimator F^\widehat{F} only having bounded errors, in a relative sense, as noted in the following remark).

Remark 9.

By Lemma S2, under the assumptions of Proposition 6, it follows that

Ω^1Ω1(dF^/dF)1,\|\widehat{\Omega}^{-1}-\Omega^{-1}\|\ \lesssim\ \|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty},

so estimation of FF will not affect the conditional variances as long as the error of F^\widehat{F} is bounded in uniform norm.

4.4 Overall Rate

Combining the approximation bias in Proposition 3 with the decomposition in Proposition 4, and the bias and variance bounds from Proposition 6 and Proposition 7, respectively, shows that

𝔼P|τ^(x0)τP(x0)|hγ+(h/k1/d)2s+1nhd(1+knhd)\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ h^{\gamma}+\left(h/k^{1/d}\right)^{2s}+\sqrt{\frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\right)} (14)

under all the combined assumptions of these results, which are compiled in the statement of Theorem 2 below.

The first two terms in (14) are the bias, with hγh^{\gamma} an oracle bias that would remain even if one had direct access to the potential outcomes (Y1Y0)(Y^{1}-Y^{0}) (or equivalently, samples of τ(X)+ϵ\tau(X)+\epsilon for some ϵ\epsilon with conditional mean zero), and (h/k1/d)2s(h/k^{1/d})^{2s} analogous to a squared nuisance bias term, but shrunken due to the stretching induced by the localized basis bhkb_{hk}. Similarly, 1/(nhd)1/(nh^{d}) is an oracle variance that would remain even if given access to the potential outcomes, whereas the k/(nhd)k/(nh^{d}) factor is a contribution from nuisance estimation (akin to the variance of a series regression on kk basis terms with nhdnh^{d} samples).

Balancing bias and variance in (14) by taking the tuning parameters to satisfy

hn(1/γ)/(1+d2γ+d4s)h\sim n^{-(1/\gamma)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

and

kn(d2sdγ)/(1+d2γ+d4s)k\sim n^{\left(\frac{d}{2s}-\frac{d}{\gamma}\right)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

ensures the rate matches the minimax lower bound from Theorem 1 (in the low smoothness regime), proving that lower bound is in fact tight. (In the high smoothness regime where s>(d/4)/(1+d/2γ)s>(d/4)/(1+d/2\gamma), one can simply take knhdk\sim nh^{d} and hn1/(2γ+d)h\sim n^{-1/(2\gamma+d)}). This is formalized in the following theorem.

Theorem 2.

Assume the regularity conditions:

  1. A.

    The eigenvalues of QQ and Ω\Omega are bounded above and below away from zero.

  2. B.

    π^(x)π(x)\widehat{\pi}(x)-\pi(x) is α\alpha-smooth and μ^0(x)μ0(x)\widehat{\mu}_{0}(x)-\mu_{0}(x) is β\beta-smooth.

  3. C.

    The quantities y2y^{2}, (π^2,μ^02)(\widehat{\pi}^{2},\widehat{\mu}_{0}^{2}), μ^0μ0F\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}, and Q^1Q1\|\widehat{Q}^{-1}-{Q}^{-1}\| are all bounded above, and dF^/dF\|d\widehat{F}^{*}/dF^{*}\|_{\infty} is bounded above and below away from zero.

Also assume the basis bb satisfies Hölder approximating condition (6), and:

  1. 1.

    (dF^/dF)1n1/(1+d2γ+d4s(1+d2γ))π^πF(μ^0μ0F+hγ)\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\lesssim\frac{n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\vee\left(1+\frac{d}{2\gamma}\right)\right)}}{\|\widehat{\pi}-\pi\|_{F^{*}}(\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma})},

  2. 2.

    π(x)\pi(x) is α\alpha-smooth, and ϵπ(x)1ϵ\epsilon\leq\pi(x)\leq 1-\epsilon for some ϵ>0\epsilon>0,

  3. 3.

    μ0(x)\mu_{0}(x) is β\beta-smooth,

  4. 4.

    τ(x)\tau(x) is γ\gamma-smooth.

Finally let the tuning parameters satisfy

hn(1/γ)/(1+d2γ+d4s) and kn(d2sdγ)/(1+d2γ+d4s)h\sim n^{-(1/\gamma)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}\ \text{ and }\ k\sim n^{\left(\frac{d}{2s}-\frac{d}{\gamma}\right)/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}

if s<d/41+d/2γs<\frac{d/4}{1+d/2\gamma}, or hn12γ+dh\sim n^{-\frac{1}{2\gamma+d}} and knhdk\sim nh^{d} otherwise. Then the estimator τ^\widehat{\tau} from Definition 2 has error upper bounded as

𝔼P|τ^(x0)τP(x0)|{n1/(1+d2γ+d4s) if s<d/41+d/2γn1/(2+dγ) otherwise.\displaystyle\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\lesssim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}&\text{ if }s<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise.}\end{cases}

We refer to Section 3.3 for more detailed discussion and visualization of the rate from Theorem 2. Here we give two remarks discussing the regularity conditions A–C and Condition 1 (which ensures the covariate distribution is estimated accurately enough).

Remark 10.

Condition A. is a standard collinearity restriction used with least squares estimators; simple sufficient conditions are given in Propositions S5 and S6 in the Appendix. In Lemma S3 in the Appendix we also prove that this condition holds for a class of densities contained in the model 𝒫\mathcal{P} in Theorem 1, ensuring that the upper bound holds over the same submodel. A sufficient condition for Condition B. to hold is that the estimators π^(x)\widehat{\pi}(x) and μ^0(x)\widehat{\mu}_{0}(x) match the (known) smoothness of π(x)\pi(x) and μ0(x)\mu_{0}(x); this would be the case for standard minimax optimal estimators based on series or local polynomial methods. Condition C. is a mild boundedness condition on the outcome YY (which could be weakened at the expense of adding some complexity in the analysis), as well as the nuisance estimators (F^,π^,μ^0(\widehat{F}^{*},\widehat{\pi},\widehat{\mu}_{0}), and even weaker, the errors μ^0μ0F\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}} and Q^1Q1\|\widehat{Q}^{-1}-{Q}^{-1}\| (which would typically not only be bounded but decreasing to zero).

Remark 11.

First, Condition 1 of Theorem 2 will of course hold if the covariate distribution is estimated at a rate faster than that of the CATE (i.e., the numerator of the rate in Condition 1); however, it also holds under substantially weaker conditions, depending on how accurately π\pi and μ0\mu_{0} are estimated. This is because the condition really amounts to a third-order term (the covariate distribution error multiplied by a product of nuisance errors) being of smaller order than the CATE rate. Specifically, the result of Theorem 2 can also be written as

𝔼P|τ^(x0)τP(x0)|n1/(1+d2γ+d4s(1+d2γ))+R3,n,\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\vee\left(1+\frac{d}{2\gamma}\right)\right)}+R_{3,n}, (15)

for the third-order error term

R3,n=(dF^/dF)1π^πF(μ^0μ0F+hγ),R_{3,n}=\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\|\widehat{\pi}-\pi\|_{F^{*}}\Big{(}\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}+h^{\gamma}\Big{)},

so that Condition 1 simply requires this third-order term to be of smaller order than the first minimax optimal rate in (15) (note in the above that hγh^{\gamma} matches the overall CATE estimation error, under our tuning parameter choices, which would typically be of smaller order than the regression error μ^0μ0F\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}). Second, we note that we leave the condition in terms of the L2(F)L_{2}(F^{*}) errors π^πF\|\widehat{\pi}-\pi\|_{F^{*}} and μ^0μ0F\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}} because, although we assume π\pi and μ0\mu_{0} are α\alpha- and β\beta-smooth, technically, they do not need to be estimated at particular rates for any of the other results we prove to hold. Of course, under these smoothness assumptions, there are available minimax optimal estimators for which

π^πFn1/(2+d/α) and μ^0μ0Fn1/(2+d/β).\|\widehat{\pi}-\pi\|_{F^{*}}\asymp n^{-1/(2+d/\alpha)}\ \text{ and }\ \|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}\asymp n^{-1/(2+d/\beta)}.

If in addition there exists some ζ\zeta for which (dF^/dF)1n1/(2+d/ζ)\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\asymp n^{-1/(2+d/\zeta)} (e.g., if FF has a density that is ζ\zeta-smooth), then Condition 1 reduces to ζ>d/(1/Mα,β,γ,d2)\zeta>{d}/{(1/M_{\alpha,\beta,\gamma,d}-2)}, for

Mα,β,γ,d11+d/2γ+d/4s(1+d/2γ)12+d/α12+d/β.M_{\alpha,\beta,\gamma,d}\equiv\frac{1}{1+d/2\gamma+d/4s\vee(1+d/2\gamma)}-\frac{1}{2+d/\alpha}-\frac{1}{2+d/\beta}.

Exploring CATE estimation under weaker conditions on the covariate distribution is an interesting avenue for future work; we suspect the minimax rate changes depending on what is assumed about this distribution, as is the case for average effects (e.g., page 338 of Robins et al. (2008)) and conditional variance estimation (Wang et al., 2008; Shen et al., 2020).

5 Discussion

In this paper we have characterized the minimax rate for estimating heterogeneous causal effects in a smooth nonparametric model. We derived a lower bound on the minimax rate using a localized version of the method of fuzzy hypotheses, and a matching upper bound via a new local polynomial R-Learner estimator based on higher-order influence functions. We also characterize how the minimax rate changes depending on whether the propensity score or regression function is smoother, either when one parametrizes the control or the marginal regression function. The minimax rate has several important features. First, it exhibits a so-called elbow phenomenon: when the nuisance functions (regression and propensity scores) are smooth enough, the rate matches that of standard smooth nonparametric regression (the same that would be obtained if potential outcomes were actually observed). On the other hand, when the average nuisance smoothness is below the relevant threshold, the rate obtained is slower. This leads to a second important feature: in the latter low-smoothness regime, the minimax rate is a mixture of the minimax rates for nonparametric regression and functional estimation. This quantifies how the CATE can be viewed as a regression/functional hybrid.

There are numerous important avenues left for future work. We detail a few briefly here, based on: different error metrics, inference and testing, adaptivity, and practical implementation. First, we have focused on estimation error at a point, but one could also consider global rates in L2L_{2} or LL_{\infty} norm, for example. We expect L2L_{2} rates to be the same, and LL_{\infty} rates to be the same up to log factors, but verifying this would be useful. In addition, it would also be very important to study the distribution of the proposed estimator, beyond just bias and variance, e.g., for purposes of inference (LL_{\infty} rates could also be useful in this respect). Relatedly, one could consider minimax rates for testing whether the CATE is zero, for example, versus ϵ\epsilon-separated in some distance. The goal of the present work is mostly to further our theoretical understanding of the fundamental limits of CATE estimation, so there remains lots to do to make the optimal rates obtained here achievable in practice. For example, although we have specified particular values of the tuning parameters hh and kk to confirm attainability of our minimax lower bound, it would be practically useful to have more data-driven approaches for selection. In particular, the optimal tuning values depend on underlying smoothness, and since in practice this is often unknown, a natural next step is to study adaptivity. For example one could study whether approaches based on Lepski’s method could be used, as in Mukherjee et al. (2015) and Liu et al. (2021). There are also potential computational challenges associated with constructing the tensor products in ρ(x)\rho(x) when dimension dd is not small, as well as evaluating the U-statistic terms of our estimator, and inverting the matrices Q^\widehat{Q} and Ω^\widehat{\Omega}. Finally, in this work we have assumed the nuisance functions are Hölder-smooth, a classic infinite-dimensional function class from which important insights can be drawn. However, it will be important to explore minimax rates in other function classes as well.

Acknowledgements

EK gratefully acknowledges support from NSF DMS Grant 1810979, NSF CAREER Award 2047444, and NIH R01 Grant LM013361-01A1, and SB and LW from NSF Grant DMS1713003. EK also thanks Matteo Bonvini and Tiger Zeng for very helpful discussions.

References

  • Athey and Imbens [2016] S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
  • Belloni et al. [2015] A. Belloni, V. Chernozhukov, D. Chetverikov, and K. Kato. Some new asymptotic theory for least squares series: pointwise and uniform results. Journal of Econometrics, 186(2):345–366, 2015.
  • Bickel and Ritov [1988] P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā, pages 381–393, 1988.
  • Birgé and Massart [1995] L. Birgé and P. Massart. Estimation of integral functionals of a density. The Annals of Statistics, 23(1):11–29, 1995.
  • Carbery and Wright [2001] A. Carbery and J. Wright. Distributional and lql^{q} norm inequalities for polynomials over convex bodies in rnr^{n}. Mathematical Research Letters, 8(3):233–248, 2001.
  • Chamberlain [1987] G. Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
  • Chernozhukov et al. [2017] V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988, 2017.
  • Fan et al. [2019] Q. Fan, Y.-C. Hsu, R. P. Lieli, and Y. Zhang. Estimation of conditional average treatment effects with high-dimensional data. arXiv preprint arXiv:1908.02399, 2019.
  • Foster and Syrgkanis [2019] D. J. Foster and V. Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
  • Foster et al. [2011] J. C. Foster, J. M. Taylor, and S. J. Ruberg. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011.
  • Gao and Han [2020] Z. Gao and Y. Han. Minimax optimal nonparametric estimation of heterogeneous treatment effects. arXiv preprint arXiv:2002.06471, 2020.
  • Giné and Nickl [2021] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2021.
  • Hahn et al. [2020] P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis, 2020.
  • Hernán and Robins [2020] M. A. Hernán and J. M. Robins. Causal inference: What if. CRC Boca Raton, FL, 2020.
  • Ibragimov et al. [1987] I. A. Ibragimov, A. S. Nemirovskii, and R. Khas’minskii. Some problems on nonparametric estimation in gaussian white noise. Theory of Probability & Its Applications, 31(3):391–406, 1987.
  • Imai and Ratkovic [2013] K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
  • Ingster et al. [2003] Y. Ingster, J. I. Ingster, and I. Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2003.
  • Kennedy [2023] E. H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023.
  • Künzel et al. [2019] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
  • Lee et al. [2017] S. Lee, R. Okui, and Y.-J. Whang. Doubly robust uniform confidence band for the conditional average treatment effect function. Journal of Applied Econometrics, 32(7):1207–1225, 2017.
  • Liu et al. [2021] L. Liu, R. Mukherjee, J. M. Robins, and E. J. Tchetgen Tchetgen. Adaptive estimation of nonparametric functionals. Journal of Machine Learning Research, 22(99):1–66, 2021.
  • Luedtke and van der Laan [2016] A. R. Luedtke and M. J. van der Laan. Super-learning of an optimal dynamic treatment rule. The International Journal of Biostatistics, 12(1):305–332, 2016.
  • Mukherjee et al. [2015] R. Mukherjee, E. J. Tchetgen Tchetgen, and J. M. Robins. Lepski’s method and adaptive estimation of nonlinear integral functionals of density. arXiv preprint arXiv:1508.00249, 2015.
  • Mukherjee et al. [2017] R. Mukherjee, W. K. Newey, and J. M. Robins. Semiparametric efficient empirical higher order influence function estimators. arXiv preprint arXiv:1705.07577, 2017.
  • Nemirovski [2000] A. Nemirovski. Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour, 28:85, 2000.
  • Nie and Wager [2021] X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.
  • Robins [1994] J. M. Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23(8):2379–2412, 1994.
  • Robins et al. [1992] J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48(2):479–495, 1992.
  • Robins et al. [2008] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008.
  • Robins et al. [2009a] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Quadratic semiparametric von mises calculus. Metrika, 69(2-3):227–247, 2009a.
  • Robins et al. [2009b] J. M. Robins, E. J. Tchetgen Tchetgen, L. Li, and A. W. van der Vaart. Semiparametric minimax rates. Electronic Journal of Statistics, 3:1305–1321, 2009b.
  • Robins et al. [2017] J. M. Robins, L. Li, R. Mukherjee, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.
  • Robinson [1988] P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988.
  • Semenova and Chernozhukov [2017] V. Semenova and V. Chernozhukov. Estimation and inference about conditional average treatment effect and other structural functions. arXiv, pages arXiv–1702, 2017.
  • Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
  • Shen et al. [2020] Y. Shen, C. Gao, D. Witten, and F. Han. Optimal estimation of variance in nonparametric regression with random design. The Annals of Statistics, 48(6):3589–3618, 2020.
  • Tsybakov [2009] A. B. Tsybakov. Introduction to Nonparametric Estimation. New York: Springer, 2009.
  • van der Laan [2005] M. J. van der Laan. Statistical inference for variable importance. The International Journal of Biostatistics, 188, 2005.
  • van der Laan and Robins [2003] M. J. van der Laan and J. M. Robins. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer, 2003.
  • Vansteelandt and Joffe [2014] S. Vansteelandt and M. M. Joffe. Structural nested models and g-estimation: the partially realized promise. Statistical Science, 29(4):707–731, 2014.
  • Wager and Athey [2018] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • Wang et al. [2008] L. Wang, L. D. Brown, T. T. Cai, and M. Levine. Effect of mean on variance function estimation in nonparametric regression. The Annals of Statistics, 36(2):646–664, 2008.
  • Zhao et al. [2017] Q. Zhao, D. S. Small, and A. Ertefaie. Selective inference for effect modification via the lasso. arXiv preprint arXiv:1705.08020, 2017.
  • Zimmert and Lechner [2019] M. Zimmert and M. Lechner. Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv preprint arXiv:1908.08779, 2019.
\appendixpage

Roadmap of Main Results

The following figures illustrate how Theorems 1 and 2 follow from supporting propositions and lemmas (with arrows indicating that a result plays a role in implying another).

\pgfmathresultptLemma 2(generic Hellinger bound)Proposition S3(density expression)Proposition 1(our Hellinger bound)Proposition S4(density under Pλ,QλP_{\lambda},Q_{\lambda})\pgfmathresultpt\pgfmathresultpt\pgfmathresultptProposition 2(choice of h,kh,k)\pgfmathresultptLemma 1(fuzzy hypotheses)Theorem1\pgfmathresultpt\pgfmathresultpt

Figure 3: Roadmap for lower bound result in Theorem 1.

\pgfmathresultptLemma S1(projection properties)Proposition 5(bias of S^j\widehat{S}_{j})Lemma S2(Ω^\widehat{\Omega} error \lesssimF^\widehat{F} error)Proposition 6(bias of S^j\widehat{S}_{j} w/accurate F^\widehat{F})\pgfmathresultpt\pgfmathresultpt\pgfmathresultptProposition 7(variance of S^j\widehat{S}_{j})Proposition 4(τ^\widehat{\tau} error \lesssim RMSE(S^j\widehat{S}_{j}))Proposition 3(smoothing bias)Theorem2\pgfmathresultpt\pgfmathresultpt\pgfmathresultpt\pgfmathresultpt

Figure 4: Roadmap for upper bound result in Theorem 2.

Propositions S1 and S2 are not included in the above figures since they support the alternative parametrization detailed in Theorem 3; however they play roles analogous to Propositions 4 and 5, respectively. Propositions S5 and S6 and Lemma S3 are not included since they are stand-alone results, giving conditions under which bounded eigenvalue assumptions hold.

Appendix A Alternative Parametrization

In the main text we focused on the model that puts smoothness assumptions on (π,μ0,τ)(\pi,\mu_{0},\tau), and gave a detailed proof of the minimax lower bound in the αβ\alpha\geq\beta case. In this section we give lower bound constructions for the βα\beta\geq\alpha case, as well as minimax lower bounds for a different model that puts smoothness assumptions on (π,η,τ)(\pi,\eta,\tau), i.e., the marginal regression η(x)\eta(x) rather than the control regression μ0(x)\mu_{0}(x).

In the main text we also gave an estimator that was minimax optimal for all (α,β)(\alpha,\beta) in the (π,μ0,τ)(\pi,\mu_{0},\tau) model (under some conditions on the covariate density). This estimator is also optimal in the (π,η,τ)(\pi,\eta,\tau) model, as described shortly in Section A.2. Here we also give matching upper bounds for the (π,η,τ)(\pi,\eta,\tau) model using a different, adapted but similar higher-order estimator (which itself is also minimax optimal in the αβ\alpha\geq\beta case in the (π,μ0,τ)(\pi,\mu_{0},\tau) model).

A.1 Lower Bounds

The following table summarizes the constructions used to derive all minimax lower bounds stated in this paper.

Construction
Model Regime pλp_{\lambda} qλq_{\lambda}
μ0\mu_{0} is β\beta-smooth αβ\alpha\geq\beta (1/2,μ0λτh/2,τh)(1/2,\mu_{0\lambda}-\tau_{h}/2,\tau_{h}) (πλ,μ0λ,0)(\pi_{\lambda},\mu_{0\lambda},0)
βα\beta\geq\alpha (πλ,1/2τh/2,τh)(\pi_{\lambda},1/2-\tau_{h}/2,\tau_{h}) (πλ,μ0λ,0)(\pi_{\lambda},\mu_{0\lambda},0)
η\eta is β\beta-smooth αβ\alpha\geq\beta (1/2,ηλ,τh)(1/2,\eta_{\lambda},\tau_{h}) (πλ,ηλ,0)(\pi_{\lambda},\eta_{\lambda},0)
βα\beta\geq\alpha (1/2,1/2,1τh)(1/2,1/2,1-\tau_{h}) (πλ,1/2,1)(\pi_{\lambda},1/2,1)
Table 1: The constructions used for our lower bound arguments. For the μ0\mu_{0} parametrization the density components are (π,μ0,τ)(\pi,\mu_{0},\tau) and for the η\eta parametrization the components are (π,η,τ)(\pi,\eta,\tau). The first row is the construction detailed in the main text. The constructions in the first and third rows are equivalent, since η=πτ+μ0\eta=\pi\tau+\mu_{0}, so that η=μ0λ\eta=\mu_{0\lambda} in the first row. These first/third rows are the construction used in the first preprint of this paper.

Theorem 3 below states the minimax rate for the (π,η,τ)(\pi,\eta,\tau) model that puts smoothness assumptions on the marginal regression function η\eta, rather than the control regression μ0\mu_{0}.

Theorem 3.

For x0(0,1)x_{0}\in(0,1), let 𝒫\mathcal{P} denote the model where:

  1. 1.

    f(x)f(x) is bounded above by a constant,

  2. 2.

    π(x)\pi(x) is α\alpha-smooth,

  3. 3.

    η(x)\eta(x) is β\beta-smooth, and

  4. 4.

    τ(x)\tau(x) is γ\gamma-smooth.

Let s(α+β)/2s\equiv(\alpha+\beta)/2. Then for nn larger than a constant depending on (α,β,γ,d)(\alpha,\beta,\gamma,d), the minimax rate is lower bounded as

infτ^supP𝒫𝔼P|τ^(x0)τP(x0)|{n1/(1+d2γ+d4min(α,s)) if min(α,s)<d/41+d/2γn1/(2+dγ) otherwise.\displaystyle\inf_{\widehat{\tau}}\sup_{P\in\mathcal{P}}\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\gtrsim\begin{cases}n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4\min(\alpha,s)}\right)}&\text{ if }\min(\alpha,s)<\frac{d/4}{1+d/2\gamma}\\ n^{-1/\left(2+\frac{d}{\gamma}\right)}&\text{ otherwise}.\end{cases}

In the following subsections we give proofs of Theorems 1 and 3 for the βα\beta\geq\alpha regimes not detailed in the main text. Given the constructions listed in Table 1, the arguments are all very similar to those presented in the main text for the αβ\alpha\geq\beta case.

A.1.1 Setting 1: αβ\alpha\geq\beta

When αβ\alpha\geq\beta, the same construction can be used for both the μ0\mu_{0} and η\eta parametrizations. This is the construction in the first and third rows of Table 1, and detailed in the main text. Thus the proofs there settle this case.

A.1.2 Setting 2: βα\beta\geq\alpha and η\eta is β\beta-smooth

For the construction in the 4th row of Table 1, η(x)=1/2\eta(x)=1/2 is constant and so certainly β\beta-smooth, and π(x)\pi(x) is α\alpha-smooth as in the main text. Further we have by Proposition S3 that

pλ(z)\displaystyle p_{\lambda}(z) =f(x)[14+(2a1)(2y1){14hγ4B(xx02h)}]\displaystyle=f(x)\left[\frac{1}{4}+(2a-1)(2y-1)\left\{\frac{1}{4}-\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}\right]
qλ(z)\displaystyle q_{\lambda}(z) =f(x)[14+(a1/2)Δπ(x)+(2a1)(2y1){14Δπ(x)2}]\displaystyle=f(x)\left[\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(2a-1)(2y-1)\left\{\frac{1}{4}-\Delta_{\pi}(x)^{2}\right\}\right]

with Δπ(x)=(h/k1/d)αj=1kλjB(xmjh/k1/d)\Delta_{\pi}(x)=\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right), so that p¯(z)=pλ(z)\overline{p}(z)=p_{\lambda}(z) and

q¯(z)\displaystyle\overline{q}(z) =f(x)[14+(2a1)(2y1){14(h/k1/d)2αj=1kB(xmjh/k1/d)2}].\displaystyle=f(x)\left[\frac{1}{4}+(2a-1)(2y-1)\left\{\frac{1}{4}-\left(h/k^{1/d}\right)^{2\alpha}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}\right].

Now we detail the quantities (δ1,δ2,δ3)(\delta_{1},\delta_{2},\delta_{3}) from Lemma 2, which yield a bound on the Hellinger distance between mixtures of nn-fold products from PλP_{\lambda} and QλQ_{\lambda}.

First, since p¯(z)=pλ(z)\overline{p}(z)=p_{\lambda}(z) it immediately follows that δ1=0\delta_{1}=0. Similarly, taking

hγ4=(h/k1/d)2α\frac{h^{\gamma}}{4}=\left(h/k^{1/d}\right)^{2\alpha}

ensures that p¯(z)=q¯(z)\overline{p}(z)=\overline{q}(z) and thus δ3=0\delta_{3}=0, just as in (2) in the main text. For δ2\delta_{2}, note that qλ(z)pλ(z)=f(x)(a1/2)Δπ(x)q_{\lambda}(z)-p_{\lambda}(z)=f(x)(a-1/2)\Delta_{\pi}(x) (with the above choices of hh and kk) so that

δ2(2dB22ϵ/2)(h/k1/d)2α\delta_{2}\leq\left(\frac{2^{d}\|B\|_{2}^{2}}{\epsilon/2}\right)\left(h/k^{1/d}\right)^{2\alpha}

by the same arguments as for δ1\delta_{1} in the main text. Now by Lemma 2 the Hellinger distance is bounded above as

H2(Pλn𝑑ϖ(λ),Qλn𝑑ϖ(λ))(n2h2dk)(h/k1/d)4αH^{2}\left(\int P_{\lambda}^{n}\ d\varpi(\lambda),\int Q_{\lambda}^{n}\ d\varpi(\lambda)\right)\ \lesssim\ \left(\frac{n^{2}h^{2d}}{k}\right)\left(h/k^{1/d}\right)^{4\alpha}

so taking

hk1/d=(hγ4)1/2α(1n2)1d+4α+2αd/γ\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2\alpha}\sim\left(\frac{1}{n^{2}}\right)^{\frac{1}{d+4\alpha+2\alpha d/\gamma}}

ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in τ(x0)\tau(x_{0}) under this construction is

hγ4=(hk1/d)2αn1/(1+d2γ+d4α).\frac{h^{\gamma}}{4}=\left(\frac{h}{k^{1/d}}\right)^{2\alpha}\sim n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4\alpha}\right)}.

A.1.3 Setting 3: βα\beta\geq\alpha and μ0\mu_{0} is β\beta-smooth

For the construction in the 2nd row of Table 1 it is again clear that π(x)\pi(x) and μ0(x)\mu_{0}(x) are α\alpha- and β\beta-smooth, respectively. Note for the latter that, under pλp_{\lambda}, the control regression μ0(x)=1/2τh(x)/2\mu_{0}(x)=1/2-\tau_{h}(x)/2 is γ\gamma-smooth and γβ\gamma\geq\beta by definition. By Proposition S3 we have

qλ(z)=f(x){14+(a1/2)Δπ(x)+(y1/2)Δμ(x)+(2a1)(2y1)Δπ(x)Δμ(x)}q_{\lambda}(z)=f(x)\left\{\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\mu}(x)+(2a-1)(2y-1)\Delta_{\pi}(x)\Delta_{\mu}(x)\right\}

with Δπ(x)=(h/k1/d)αj=1kλjB(xmjh/k1/d)\Delta_{\pi}(x)=\left(h/k^{1/d}\right)^{\alpha}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right) and Δμ(x)=(h/k1/d)βj=1kλjB(xmjh/k1/d)\Delta_{\mu}(x)=\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right), as in the main text, and

pλ(z)\displaystyle p_{\lambda}(z) =f(x)(14+(a1/2)Δπ(x)(y1/2)τh(x)/2\displaystyle=f(x)\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)-(y-1/2)\tau_{h}(x)/2
+(2a1)(2y1)[Δπ(x)τh(x)/2+aτh(x){12+Δπ(x)}])\displaystyle\hskip 36.135pt+(2a-1)(2y-1)\left[-\Delta_{\pi}(x)\tau_{h}(x)/2+a\tau_{h}(x)\left\{\frac{1}{2}+\Delta_{\pi}(x)\right\}\right]\bigg{)}
=f(x)[14+(a1/2)Δπ(x)+(2a1)(2y1){τh(x)4+(a1/2)Δπ(x)τh(x)}]\displaystyle=f(x)\left[\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(2a-1)(2y-1)\left\{\frac{\tau_{h}(x)}{4}+(a-1/2)\Delta_{\pi}(x)\tau_{h}(x)\right\}\right]

where the second equality follows since (2a1)2=1(2a-1)^{2}=1. Therefore

p¯(z)\displaystyle\overline{p}(z) =f(x){14+(2a1)(2y1)hγ4B(xx02h)}\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{2h}\right)\right\}
q¯(z)\displaystyle\overline{q}(z) =f(x){14+(2a1)(2y1)(h/k1/d)2sj=1kB(xmjh/k1/d)2}\displaystyle=f(x)\left\{\frac{1}{4}+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\right\}

as in the main text. Now we again detail the quantities (δ1,δ2,δ3)(\delta_{1},\delta_{2},\delta_{3}) from Lemma 2.

Since the marginalized densities p¯(z)\overline{p}(z) and q¯(z)\overline{q}(z) here take the same form as those in the main text, the same arguments imply δ3=0\delta_{3}=0 as long as hγ/4=(h/k1/d)2sh^{\gamma}/4=(h/k^{1/d})^{2s}. Since

{pλ(z)p¯(z)}2\displaystyle\{p_{\lambda}(z)-\overline{p}(z)\}^{2} =[f(x){(a1/2)+(y1/2)τh(x)}Δπ(x)]2f(x)2Δπ(x)2\displaystyle=\left[f(x)\Big{\{}(a-1/2)+(y-1/2)\tau_{h}(x)\Big{\}}\Delta_{\pi}(x)\right]^{2}\leq f(x)^{2}\Delta_{\pi}(x)^{2}

as long as τh(x)1\tau_{h}(x)\leq 1, we have that

δ1(h/k1/d)2α\delta_{1}\lesssim\left(h/k^{1/d}\right)^{2\alpha}

by the same arguments as for δ1\delta_{1} in the main text (and δ2\delta_{2} in the previous subsection). Similarly, with the above choices of hh and kk ensuring δ3=0\delta_{3}=0, we have

{qλ(z)pλ(z)}2\displaystyle\{q_{\lambda}(z)-p_{\lambda}(z)\}^{2} =[f(x)(y1/2){Δμ(x)τh(x)Δπ(x)}]2\displaystyle=\left[f(x)(y-1/2)\Big{\{}\Delta_{\mu}(x)-\tau_{h}(x)\Delta_{\pi}(x)\Big{\}}\right]^{2}
=[f(x)(y1/2)Δμ(x){14Δπ(x)2}]2\displaystyle=\left[f(x)(y-1/2)\Delta_{\mu}(x)\Big{\{}1-4\Delta_{\pi}(x)^{2}\Big{\}}\right]^{2}
f(x)2Δμ(x)2\displaystyle\lesssim f(x)^{2}\Delta_{\mu}(x)^{2}

where the second equality follows since τh(x)/4=Δπ(x)Δμ(x)\tau_{h}(x)/4=\Delta_{\pi}(x)\Delta_{\mu}(x), and the last inequality since Δπ(x)2\Delta_{\pi}(x)^{2} is bounded. Therefore

δ2(h/k1/d)2β\delta_{2}\lesssim\left(h/k^{1/d}\right)^{2\beta}

by the same arguments as for δ1\delta_{1} and δ2\delta_{2} in the main text. This implies by Lemma 2 that the Hellinger distance is bounded above by

(n2h2dk){(h/k1/d)4s+(h/k1/d)4β}(n2h2dk)(h/k1/d)4s\left(\frac{n^{2}h^{2d}}{k}\right)\left\{\left(h/k^{1/d}\right)^{4s}+\left(h/k^{1/d}\right)^{4\beta}\right\}\ \lesssim\ \left(\frac{n^{2}h^{2d}}{k}\right)\left(h/k^{1/d}\right)^{4s}

where the inequality follows since βα\beta\geq\alpha in this setting. Now, as in the main text, taking

hk1/d=(hγ4)1/2s(1n2)1d+4s+2sd/γ\frac{h}{k^{1/d}}=\left(\frac{h^{\gamma}}{4}\right)^{1/2s}\sim\left(\frac{1}{n^{2}}\right)^{\frac{1}{d+4s+2sd/\gamma}}

ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in τ(x0)\tau(x_{0}) under this construction is

hγ4=(hk1/d)2sn1/(1+d2γ+d4s).\frac{h^{\gamma}}{4}=\left(\frac{h}{k^{1/d}}\right)^{2s}\sim n^{-1/\left(1+\frac{d}{2\gamma}+\frac{d}{4s}\right)}.

A.2 Upper Bounds

In the model where η\eta instead of μ0\mu_{0} is assumed to be β\beta-smooth, the estimator described in the main text is also minimax optimal (since if η\eta is β\beta-smooth, then μ0=ηπτ\mu_{0}=\eta-\pi\tau is min(α,β)\min(\alpha,\beta)-smooth, and so in the α<β\alpha<\beta case one gets the same rates in the main text with ss replaced by min(s,α)\min(s,\alpha)). However one can also use the following alternative estimator, which we detail for posterity.

Definition 3 (Higher-Order Local Polynomial R-Learner).

Let Kh(x)K_{h}(x) and ρh(x)\rho_{h}(x) be defined exactly as in Definition 2. The proposed estimator is then defined as

τ^(x0)=ρh(x0)TQ^1R^\widehat{\tau}(x_{0})=\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\widehat{R} (16)

where Q^\widehat{Q} is a q×qq\times q matrix and R^\widehat{R} a qq-vector now given by

Q^\displaystyle\widehat{Q} =n{ρh(X)Kh(X)φ^a1(Z)ρh(X)T}+𝕌n{ρh(X1)Kh(X1)φ^a2(Z1,Z2)Kh(X2)ρh(X1)T}\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{a1}(Z)\rho_{h}(X)^{\mathrm{\scriptscriptstyle T}}\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{a2}(Z_{1},Z_{2})K_{h}(X_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\Big{\}}
R^\displaystyle\widehat{R} =n{ρh(X1)Kh(X1)φ^y1(Z1)}+𝕌n{ρh(X1)Kh(X1)φ^y2(Z1,Z2)Kh(X2)},\displaystyle=\mathbb{P}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y1}(Z_{1})\Big{\}}+\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}},

respectively, with

φ^a1(Z)\displaystyle\widehat{\varphi}_{a1}(Z) ={Aπ^(X)}2\displaystyle=\{A-\widehat{\pi}(X)\}^{2}
φ^y1(Z)\displaystyle\widehat{\varphi}_{y1}(Z) ={Yη^(X)}{Aπ^(X)}\displaystyle=\{Y-\widehat{\eta}(X)\}\{A-\widehat{\pi}(X)\}
φ^a2(Z1,Z2)\displaystyle\widehat{\varphi}_{a2}(Z_{1},Z_{2}) ={A1π^(X1)}bhk(X1)TΩ^1bhk(X2){A2π^(X2)}\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{A_{2}-\widehat{\pi}(X_{2})\}
φ^y2(Z1,Z2)\displaystyle\widehat{\varphi}_{y2}(Z_{1},Z_{2}) ={A1π^(X1)}bhk(X1)TΩ^1bhk(X2){Y2η^(X2)}\displaystyle=-\{A_{1}-\widehat{\pi}(X_{1})\}b_{hk}(X_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(X_{2})\{Y_{2}-\widehat{\eta}(X_{2})\}
bhk(x)\displaystyle b_{hk}(x) =b{1/2+(xx0)/h}1(xx0h/2)\displaystyle=b\{1/2+(x-x_{0})/h\}\mathbbold{1}(\|x-x_{0}\|\leq h/2)
Ω^\displaystyle\widehat{\Omega} =v[0,1]db(v)b(v)T𝑑F^(x0+h(v1/2))\displaystyle=\int_{v\in[0,1]^{d}}b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ d\widehat{F}(x_{0}+h(v-1/2))

for b:dkb:\mathbb{R}^{d}\mapsto\mathbb{R}^{k} a basis of dimension kk. The nuisance estimators (F^,π^,η^)(\widehat{F},\widehat{\pi},\widehat{\eta}) are constructed from a separate training sample DnD^{n}, independent of that on which 𝕌n\mathbb{U}_{n} operates.

Like the estimator (3) in the main text, the estimator (16) can also be viewed as a modified second-order estimator of the projection parameter. However, instead of estimating the moment condition (5) under a linear effect-type assumption, the estimator (16) instead estimates the numerator and denominator terms RR and QQ separately. For example, the first term in R^\widehat{R}, i.e., n{ρh(X)Kh(X)φ^y1(Z)},\mathbb{P}_{n}\{\rho_{h}(X)K_{h}(X)\widehat{\varphi}_{y1}(Z)\}, is the usual first-order influence function-based estimator of RR. Kennedy [2023] analyzed an undersmoothed version of this estimator (where the nuisance estimates π^\widehat{\pi} and μ^\widehat{\mu} themselves are undersmoothed linear smoothers), calling it the local polynomial R-learner. The second term

𝕌n{ρh(X1)Kh(X1)φ^y2(Z1,Z2)Kh(X2)}\mathbb{U}_{n}\Big{\{}\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\Big{\}}

is similar to the second-order U-statistic correction that would be added using the higher-order influence function methodology developed by Robins et al. [2008, 2009a, 2017]. However, this term differs in its localization just as described for the estimator in the main text.

Proposition S1.

The estimator (16) satisfies

|τ^(x0)τh(x0)|ρ(1/2)(|Q1+Q^1Q1)(R^R+QQ^2Q1R),\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ \leq\ \|\rho(1/2)\|\left(|\|Q^{-1}\|+\|\widehat{Q}^{-1}-{Q}^{-1}\|\right)\left(\left\|\widehat{R}-R\right\|+\left\|Q-\widehat{Q}\right\|_{2}\|Q^{-1}R\|\right),

and further if Q1\|Q^{-1}\|, Q^1Q1\|\widehat{Q}^{-1}-{Q}^{-1}\|, and Q1R\|Q^{-1}R\| are all bounded above, then

𝔼|τ^(x0)τh(x0)|\displaystyle\mathbb{E}\left|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})\right|\ maxj𝔼{𝔼(R^jRjDn)2+var(R^jDn)}\displaystyle\lesssim\ \max_{j}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{R}_{j}-R_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}
+maxj,𝔼{𝔼(Q^jQjDn)2+var(Q^jDn)}.\displaystyle\hskip 36.135pt+\max_{j,\ell}\sqrt{\mathbb{E}\left\{\mathbb{E}(\widehat{Q}_{j\ell}-Q_{j\ell}\mid D^{n})^{2}+\text{var}(\widehat{Q}_{j\ell}\mid D^{n})\right\}}.

for DnD^{n} a separate independent training sample on which (F^,π^,η^)(\widehat{F},\widehat{\pi},\widehat{\eta}) are estimated.


Proof.

We have

|τ^(x0)τh(x0)|\displaystyle|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})| =|ρh(x0)TQ^1{(R^R)+(QQ^)Q1R}|\displaystyle=\left|\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\left\{\left(\widehat{R}-R\right)+\left(Q-\widehat{Q}\right)Q^{-1}R\right\}\right|
ρ(1/2)Q^1(R^R+QQ^Q1R)\displaystyle\leq\|\rho(1/2)\|\left\|\widehat{Q}^{-1}\right\|\left(\left\|\widehat{R}-R\right\|+\left\|Q-\widehat{Q}\right\|\left\|Q^{-1}R\right\|\right)
ρ(1/2)(Q1+Q^1Q1)(R^R+QQ^2Q1R)\displaystyle\leq\|\rho(1/2)\|\left(\left\|Q^{-1}\right\|+\left\|\widehat{Q}^{-1}-Q^{-1}\right\|\right)\left(\left\|\widehat{R}-R\right\|+\left\|Q-\widehat{Q}\right\|_{2}\left\|Q^{-1}R\right\|\right)

by the sub-multiplicative and triangle inequalities of the operator norm, along with the fact that AA2\|A\|\leq\|A\|_{2}. Together with the bounds on Q1\|Q^{-1}\|, Q^1Q1\|\widehat{Q}^{-1}-Q^{-1}\|, and Q1R\|Q^{-1}R\|, this yields the first inequality. For the second inequality, first note ρ(x)Cq\|\rho(x)\|\leq Cq, as described in the proof of Proposition 3. The second inequality now follows since

𝔼R^R\displaystyle\mathbb{E}\|\widehat{R}-R\| 𝔼R^R2=j𝔼[𝔼{(R^jRj)2Dn}]\displaystyle\leq\sqrt{\mathbb{E}\|\widehat{R}-R\|^{2}}=\sqrt{\sum_{j}\mathbb{E}\left[\mathbb{E}\Big{\{}(\widehat{R}_{j}-R_{j})^{2}\mid D^{n}\Big{\}}\right]}
=j𝔼{bias(R^jDn)2+var(R^jDn)}\displaystyle=\sqrt{\sum_{j}\mathbb{E}\left\{\text{bias}(\widehat{R}_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}
(d+γγ)maxj𝔼{bias(R^jDn)2+var(R^jDn)}\displaystyle\leq\sqrt{{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}}\max_{j}\sqrt{\mathbb{E}\left\{\text{bias}(\widehat{R}_{j}\mid D^{n})^{2}+\text{var}(\widehat{R}_{j}\mid D^{n})\right\}}

using Jensen’s inequality and iterated expectation. The last line follows since the length of RR is (d+γγ){{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}. The logic is the same for 𝔼Q^Q2=𝔼j,(Q^jQj)2\mathbb{E}\|\widehat{Q}-Q\|_{2}=\mathbb{E}\sqrt{\sum_{j,\ell}(\widehat{Q}_{j\ell}-Q_{j\ell})^{2}}. ∎


Proposition S2.

Assume:

  1. 1.

    λmax(Ω)\lambda_{\max}(\Omega) is bounded above,

  2. 2.

    the basis bb satisfies approximating condition (6),

  3. 3.

    π^(x)π(x)\widehat{\pi}(x)-\pi(x) is α\alpha-smooth,

  4. 4.

    η^(x)η(x)\widehat{\eta}(x)-\eta(x) is β\beta-smooth.

Then

|𝔼(R^jRjDn)|\displaystyle|\mathbb{E}(\widehat{R}_{j}-R_{j}\mid D^{n})| (h/k1/d)2s+π^πFη^ηFΩ^1Ω1\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2s}+\|\widehat{\pi}-\pi\|_{F^{*}}\|\widehat{\eta}-\eta\|_{F^{*}}\|\widehat{\Omega}^{-1}-\Omega^{-1}\|
|𝔼(Q^jQjDn)|\displaystyle|\mathbb{E}(\widehat{Q}_{j\ell}-Q_{j\ell}\mid D^{n})| (h/k1/d)2α+π^πF2Ω^1Ω1.\displaystyle\ \lesssim\ \left(h/k^{1/d}\right)^{2\alpha}+\|\widehat{\pi}-\pi\|_{F^{*}}^{2}\|\widehat{\Omega}^{-1}-\Omega^{-1}\|.

Proof.

We only prove the result for R^j\widehat{R}_{j}, since the logic is the same for Q^j\widehat{Q}_{j\ell}. By iterated expectation, the conditional mean of the first-order term is

𝔼{ρh(X1)Kh(X1)\displaystyle\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1}) φ^y1(Z)Dn}=R+ρh(x)Kh(x){π(x)π^(x)}{η(x)η^(x)}dF(x)\displaystyle\widehat{\varphi}_{y1}(Z)\mid D^{n}\}=R+\int\rho_{h}(x)K_{h}(x)\{\pi(x)-\widehat{\pi}(x)\}\{\eta(x)-\widehat{\eta}(x)\}\ dF(x)
=R+ρ(v){π(v)π^(v)}{η(v)η^(v)}𝑑F(v)\displaystyle=R+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}\{\eta^{*}(v)-\widehat{\eta}^{*}(v)\}\ dF^{*}(v) (17)

where we use the change of variable v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h} and again define for any function g:dg:\mathbb{R}^{d}\mapsto\mathbb{R} its corresponding stretched version as g(v)=g(x0+h(v1/2))g^{*}(v)=g(x_{0}+h(v-1/2)). To ease notation it is left implicit that any integral over vv is only over {v:v1/21/2}\{v:\|v-1/2\|\leq 1/2\}. Similarly, (minus) the conditional mean of the second-order influence function term is

𝔼{ρh(X1)Kh(X1)φ^y2(Z1,Z2)Kh(X2)Dn}\displaystyle-\mathbb{E}\{\rho_{h}(X_{1})K_{h}(X_{1})\widehat{\varphi}_{y2}(Z_{1},Z_{2})K_{h}(X_{2})\mid D^{n}\}
=ρh(x1)Kh(x1){π(x1)π^(x1)}bhk(x1)TΩ^1bhk(x2){η(x2)η^(x2)}Kh(x2)𝑑F(x2)𝑑F(x1)\displaystyle=\iint\rho_{h}(x_{1})K_{h}(x_{1})\{\pi(x_{1})-\widehat{\pi}(x_{1})\}b_{hk}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(x_{2})\{\eta(x_{2})-\widehat{\eta}(x_{2})\}K_{h}(x_{2})\ dF(x_{2})\ dF(x_{1})
=ρ(v1){π(v1)π^(v1)}b(v1)TΩ^1b(v2){η(v2)η^(v2)}𝑑F(v2)𝑑F(v1)\displaystyle=\iint\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}b(v_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b(v_{2})\{\eta^{*}(v_{2})-\widehat{\eta}^{*}(v_{2})\}\ dF^{*}(v_{2})\ dF^{*}(v_{1})
=ρ(v1){π(v1)π^(v1)}Π^b(ηη^)(v1)𝑑F(v1)\displaystyle=\int\rho(v_{1})\{\pi^{*}(v_{1})-\widehat{\pi}^{*}(v_{1})\}\widehat{\Pi}_{b}(\eta^{*}-\widehat{\eta}^{*})(v_{1})\ dF^{*}(v_{1}) (18)

where we define

Πbg(u)\displaystyle\Pi_{b}g^{*}(u) =b(u)TΩ1b(v)g(v)𝑑F(v)\displaystyle=b(u)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int{b}(v)g^{*}(v)\ dF^{*}(v)

as the FF^{*}-weighted linear projection of gg^{*} on the basis bb, and Π^bg(u)\widehat{\Pi}_{b}g^{*}(u) as the estimated version, which simply replaces Ω\Omega with Ω^\widehat{\Omega}. Therefore adding the first- and second-order expected values in (17) and (18), the overall bias relative to RR is

ρ(v)\displaystyle\int\rho(v) {π(v)π^(v)}(IΠ^b)(ηη^)(v)dF(v)\displaystyle\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}(I-\widehat{\Pi}_{b})(\eta^{*}-\widehat{\eta}^{*})(v)\ dF^{*}(v)
=(IΠb){ρ(ππ^)}(v)(IΠb)(ηη^)(v)𝑑F(v)\displaystyle=\int(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}(v)(I-\Pi_{b})(\eta^{*}-\widehat{\eta}^{*})(v)\ dF^{*}(v) (19)
+ρ(v){π(v)π^(v)}(ΠbΠ^b)(ηη^)(v)𝑑F(v)\displaystyle\hskip 43.36243pt+\int\rho(v)\{\pi^{*}(v)-\widehat{\pi}^{*}(v)\}(\Pi_{b}-\widehat{\Pi}_{b})(\eta^{*}-\widehat{\eta}^{*})(v)\ dF^{*}(v) (20)

where the last line follows from orthogonality of a projection with its residuals (Lemma S1(i)).

Now we analyze the bias terms (19) and (20) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.

Recall from (13) in the main text that if gg is α\alpha-smooth, then (IΠb)g/hαFkα/d\|(I-\Pi_{b})g^{*}/h^{\alpha}\|_{F^{*}}\lesssim k^{-\alpha/d} by the Hölder approximation properties (6) of the basis bb, and so it follows that

(IΠb)gFhαkα/d=(k/hd)α/d\|(I-\Pi_{b})g^{*}\|_{F^{*}}\ \lesssim\ h^{\alpha}k^{-\alpha/d}=(k/h^{d})^{-\alpha/d}

for any α\alpha-smooth function gg. Therefore now consider the bias term (19). This term satisfies

[\displaystyle\int\Big{[} (IΠb){ρ(ππ^)}(v)]{(IΠb)(ηη^)(v)}dF(v)\displaystyle(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}(v)\Big{]}\Big{\{}(I-\Pi_{b})(\eta^{*}-\widehat{\eta}^{*})(v)\Big{\}}\ dF^{*}(v)
(IΠb){ρ(ππ^)}F(IΠb)(ηη^)F\displaystyle\leq\|(I-\Pi_{b})\{\rho(\pi^{*}-\widehat{\pi}^{*})\}\|_{F^{*}}\|(I-\Pi_{b})(\eta^{*}-\widehat{\eta}^{*})\|_{F^{*}}
(k/hd)2s/d\displaystyle\lesssim\ (k/h^{d})^{-2s/d}

where the second line follows by Cauchy-Schwarz, and the third by (13), since (ππ^)(\pi-\widehat{\pi}) and (ηη^)(\eta-\widehat{\eta}) are assumed α\alpha- and β\beta-smooth, respectively (note ρ(v)\rho(v) is a polynomial, so the smoothness of ρ(ππ^)\rho(\pi^{*}-\widehat{\pi}^{*}) is the same as (ππ^)(\pi^{*}-\widehat{\pi}^{*})).

Now for the term in (20), let θb,g=Ω1bg𝑑F\theta_{b,g}=\Omega^{-1}\int bg\ dF^{*} denote the coefficients of the projection Πbg\Pi_{b}g, and note for any functions g1,g2g_{1},g_{2} we have

g1(ΠbΠ^b)(g2)𝑑F\displaystyle\int g_{1}(\Pi_{b}-\widehat{\Pi}_{b})(g_{2})\ dF^{*} =(Ω1/2θb,g1)TΩ1/2(Ω1Ω^1)Ω1/2(Ω1/2θb,g2)\displaystyle=\left(\Omega^{1/2}\theta_{b,g_{1}}\right)^{\mathrm{\scriptscriptstyle T}}\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\left(\Omega^{1/2}\theta_{b,g_{2}}\right)
g1FΩ1/2(Ω1Ω^1)Ω1/2g2F\displaystyle\leq\|g_{1}\|_{F^{*}}\|\Omega^{1/2}(\Omega^{-1}-\widehat{\Omega}^{-1})\Omega^{1/2}\|\|g_{2}\|_{F^{*}}
g1Fg2FΩΩ^1Ω1\displaystyle\leq\|g_{1}\|_{F^{*}}\|g_{2}\|_{F^{*}}\|\Omega\|\|\widehat{\Omega}^{-1}-\Omega^{-1}\|

where the first equality follows by definition, the second line since the L2L_{2} norm of the coefficients of a (weighted) projection is no more than the weighted L2()L_{2}(\mathbb{P}) norm of the function itself (Lemma S1(iii)), and the last by the sub-multiplicative property of the operator norm, along with the fact that Ω1/22=Ω\|\Omega^{1/2}\|^{2}=\|\Omega\|. ∎


We omit details on variance calculations because the logic is exactly the same as for the estimator (3) studied in Section 4.3 of the main text.

Combining the approximation bias in Proposition 3 with the decomposition in Proposition S1, and the relevant bias and variance bounds together with sufficient conditions on covariate density estimation, we obtain that the estimator (16) has error satisfying

𝔼P|τ^(x0)τP(x0)|hγ+(h/k1/d)2min(s,α)+1nhd(1+knhd)\mathbb{E}_{P}|\widehat{\tau}(x_{0})-\tau_{P}(x_{0})|\ \lesssim\ h^{\gamma}+\left(h/k^{1/d}\right)^{2\min(s,\alpha)}+\sqrt{\frac{1}{nh^{d}}\left(1+\frac{k}{nh^{d}}\right)} (21)

Balancing terms as in Section 4.4 yields the minimax rate from Theorem 3.

Appendix B Additional Results

B.1 Density Bounds in Proposition 1

Here we show that the relevant densities and density ratios from Proposition 1 are appropriately bounded.

Proof.

When h1/4h\leq 1/4 then it follows that on 𝒮hk\mathcal{S}_{hk} we have

1f(x)={1(4d12d)hd}12.1\leq f(x)=\left\{1-\left(\frac{4^{d}-1}{2^{d}}\right)h^{d}\right\}^{-1}\leq 2. (22)

Further, since B(x)1(x[1,1]d)B(x)\leq\mathbbold{1}(x\in[-1,1]^{d}), a,y{0,1}a,y\in\{0,1\}, and λ{1,1}\lambda\in\{-1,1\}, we have on 𝒮hk\mathcal{S}_{hk} that

1412(h/k1/d)βhγ4pλ(z)f(x)14+12(h/k1/d)β+hγ4,\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}\ \leq\frac{p_{\lambda}(z)}{f(x)}\leq\ \frac{1}{4}+\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}+\frac{h^{\gamma}}{4},

regardless of the values of h,k0h,k\geq 0. Therefore when hγ+2(h/k1/d)β14ϵh^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\epsilon, the above bound implies

pλ(z)f(x)1412(h/k1/d)βhγ4ϵ.\frac{p_{\lambda}(z)}{f(x)}\geq\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}\geq\epsilon. (23)

Similarly, when hγ+2(h/k1/d)β14ϵh^{\gamma}+2(h/k^{1/d})^{\beta}\leq 1-4\epsilon (which implies hγ1h^{\gamma}\leq 1) we also have

p¯(z)pλ(z)14+hγ41412(h/k1/d)βhγ41/2ϵ.\frac{\overline{p}(z)}{p_{\lambda}(z)}\leq\frac{\frac{1}{4}+\frac{h^{\gamma}}{4}}{\frac{1}{4}-\frac{1}{2}\left(h/k^{1/d}\right)^{\beta}-\frac{h^{\gamma}}{4}}\leq\frac{1/2}{\epsilon}.

Note that, although Robins et al. [2009b] assume pλ(z)p_{\lambda}(z) is uniformly lower bounded away from zero in their version of Lemma 2, they only use a bound on p¯/pλ\overline{p}/p_{\lambda} to ensure their quantity cc is bounded (see page 1319). Therefore this condition also holds in our case. ∎


B.2 Density Expressions

Proposition S3.

For binary AA and YY, the density of Z=(X,A,Y)Z=(X,A,Y) can be written as

f(x)\displaystyle f(x) (14+(a1/2)Δπ(x)+(y1/2)Δμ(x)\displaystyle\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\mu}(x)
+(2a1)(2y1)[Δπ(x)Δμ(x)+aτ(x){12+Δπ(x)}])\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left[\Delta_{\pi}(x)\Delta_{\mu}(x)+a\tau(x)\left\{\frac{1}{2}+\Delta_{\pi}(x)\right\}\right]\bigg{)}

for Δπ(x)=π(x)1/2\Delta_{\pi}(x)=\pi(x)-1/2 and Δμ(x)=μ0(x)1/2\Delta_{\mu}(x)=\mu_{0}(x)-1/2, or equivalently as

f(x)\displaystyle f(x) (14+(a1/2)Δπ(x)+(y1/2)Δη(x)\displaystyle\bigg{(}\frac{1}{4}+(a-1/2)\Delta_{\pi}(x)+(y-1/2)\Delta_{\eta}(x)
+(2a1)(2y1)[Δπ(x)Δη(x)+τ(x){14Δπ(x)2}])\displaystyle\hskip 28.90755pt+(2a-1)(2y-1)\left[\Delta_{\pi}(x)\Delta_{\eta}(x)+\tau(x)\left\{\frac{1}{4}-\Delta_{\pi}(x)^{2}\right\}\right]\bigg{)}

for Δη(x)=η(x)1/2\Delta_{\eta}(x)=\eta(x)-1/2.


Proposition S4.

The densities under PλP_{\lambda} and QλQ_{\lambda} from Definition 1 are given by

pλ(z)\displaystyle p_{\lambda}(z) =f(x)[14+(y1/2)(h/k1/d)βj=1kλjB(xmjh/k1/d)+(2a1)(2y1)hγ4B(xx0h)]\displaystyle=f(x)\Bigg{[}\frac{1}{4}+(y-1/2)\left(h/k^{1/d}\right)^{\beta}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)+(2a-1)(2y-1)\frac{h^{\gamma}}{4}B\left(\frac{x-x_{0}}{h}\right)\Bigg{]}
qλ(z)\displaystyle q_{\lambda}(z) =f(x)[14+{(a1/2)(h/k1/d)α+(y1/2)(h/k1/d)β}j=1kλjB(xmjh/k1/d)\displaystyle=f(x)\Bigg{[}\frac{1}{4}+\left\{(a-1/2)\left(h/k^{1/d}\right)^{\alpha}+(y-1/2)\left(h/k^{1/d}\right)^{\beta}\right\}\sum_{j=1}^{k}\lambda_{j}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)
+(2a1)(2y1)(h/k1/d)2sj=1kB(xmjh/k1/d)2]\displaystyle\hskip 50.58878pt+(2a-1)(2y-1)\left(h/k^{1/d}\right)^{2s}\sum_{j=1}^{k}B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)^{2}\Bigg{]}

where s(α+β)/2s\equiv(\alpha+\beta)/2.


We note that the densities are both equal to 1/41/4 for all x𝒞2h(x0)x\notin\mathcal{C}_{2h}(x_{0}) away from x0x_{0}, since B(xmjh/k1/d)=0B\left(\frac{x-m_{j}}{h/k^{1/d}}\right)=0 for x𝒞h/k1/d(mj)x\notin\mathcal{C}_{h/k^{1/d}}(m_{j}) and 𝒞h/k1/d(mj)𝒞h(x0)𝒞2h(x0)\mathcal{C}_{h/k^{1/d}}(m_{j})\subseteq\mathcal{C}_{h}(x_{0})\subseteq\mathcal{C}_{2h}(x_{0}), and since B(xx0h)=0B\left(\frac{x-x_{0}}{h}\right)=0 for x𝒞2h(x0)x\notin\mathcal{C}_{2h}(x_{0}).

B.3 Proof of Proposition 3

Proof.

To ease notation we prove the result in the d=1d=1 case but the logic is the same when d>1d>1. First note that the local polynomial projection operator Lg(x)g(x)wh(x;x)𝑑xLg(x^{\prime})\equiv\int g(x)w_{h}(x;x^{\prime})\ dx for

wh(x;x)ρh(x)TQ1ρh(x)Kh(x)π(x){1π(x)}f(x)w_{h}(x;x^{\prime})\equiv\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}f(x)

reproduces polynomials, in the sense that, for any polynomial of the form g(x)=aTρh(x)g(x)=a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x), aqa\in\mathbb{R}^{q}, we have

Lg(x)\displaystyle Lg(x^{\prime}) ={aTρh(x)}wh(x;x)𝑑x\displaystyle=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x)\Big{\}}w_{h}(x;x^{\prime})\ dx
=ρh(x)TQ1ρh(x)Kh(x)π(x){1π(x)}ρh(x)Tf(x)𝑑xa\displaystyle=\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}f(x)\ dx\ a
=ρh(x)TQ1Qa=aTρh(x)=g(x).\displaystyle=\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}Qa=a^{\mathrm{\scriptscriptstyle T}}\rho_{h}(x^{\prime})=g(x^{\prime}).

Then for xx0C2h\|x^{\prime}-x_{0}\|\leq C_{2}h we have that τh(x)τ(x)=wh(x;x){τ(x)τ(x)}𝑑x\tau_{h}(x^{\prime})-\tau(x^{\prime})=\int w_{h}(x;x^{\prime})\Big{\{}\tau(x)-\tau(x^{\prime})\Big{\}}dx equals

wh(x;x)\displaystyle\int w_{h}(x;x^{\prime}) {j=1γ1Djτ(x)j!(xx)j+Dγτ(x)γ!(xx)γ}dx\displaystyle\left\{\sum_{j=1}^{\lfloor\gamma\rfloor-1}\frac{D^{j}\tau(x^{\prime})}{j!}(x-x^{\prime})^{j}+\frac{D^{\lfloor\gamma\rfloor}\tau(x^{*})}{\lfloor\gamma\rfloor!}(x-x^{\prime})^{\lfloor\gamma\rfloor}\right\}dx
=0+wh(x;x){Dγτ(x+ϵ(xx))Dγτ(x)γ!}(xx)γ𝑑x\displaystyle=0+\int w_{h}(x;x^{\prime})\left\{\frac{D^{\lfloor\gamma\rfloor}\tau(x^{\prime}+\epsilon(x-x^{\prime}))-D^{\lfloor\gamma\rfloor}\tau(x^{\prime})}{{\lfloor\gamma\rfloor}!}\right\}(x-x^{\prime})^{\lfloor\gamma\rfloor}\ dx
|wh(x;x)|C1xxγγγ!xxγ𝑑x\displaystyle\leq\int\left|w_{h}(x;x^{\prime})\right|\frac{C_{1}\|x-x^{\prime}\|^{\gamma-{\lfloor\gamma\rfloor}}}{{\lfloor\gamma\rfloor}!}\|x-x^{\prime}\|^{{\lfloor\gamma\rfloor}}\ dx
C1γ!|wh(x;x)|(xx0+C2h)γ𝑑x\displaystyle\leq\frac{C_{1}}{{\lfloor\gamma\rfloor}!}\int\left|w_{h}(x;x^{\prime})\right|\left(\|x-x_{0}\|+C_{2}h\right)^{\gamma}dx
C1hγγ!{hd|wh(x0+hu;x)|(u+C2)γ𝑑u}\displaystyle\leq\frac{C_{1}h^{\gamma}}{{\lfloor\gamma\rfloor}!}\left\{h^{d}\int\left|w_{h}(x_{0}+hu;x^{\prime})\right|(\|u\|+C_{2})^{\gamma}\ du\right\} (24)

where the first line follows by Taylor expansion (with x=x+ϵ(xx)x^{*}=x^{\prime}+\epsilon(x-x^{\prime}) for some ϵ[0,1]\epsilon\in[0,1]), the second by the polynomial reproducing property, the third since τ\tau is γ\gamma-smooth, the fourth by the triangle inequality with x0xC2h\|x_{0}-x^{\prime}\|\leq C_{2}h, and the last by a change of variable u=(xx0)/hu=(x-x_{0})/h (so that dx=hddudx=h^{d}\ du).

Now the result follows since we show the term on the right in (24) (in brackets) is bounded under the stated assumptions. Specifically it equals

(u+C2)γ|ρh(x)TQ1ρh(x0+hu)|1(u1/2)\displaystyle\int(\|u\|+C_{2})^{\gamma}\left|\rho_{h}(x^{\prime})^{\mathrm{\scriptscriptstyle T}}Q^{-1}\rho_{h}(x_{0}+hu)\right|\mathbbold{1}(\|u\|\leq 1/2)
×π(x0+hu){1π(x0+hu)}f(x0+hu)du\displaystyle\hskip 36.135pt\times\pi(x_{0}+hu)\{1-\pi(x_{0}+hu)\}f(x_{0}+hu)\ du
(1/4)(1/2+C2)γρ(1/2)Q11/21/2ρ(1/2+u)f(x0+hu)𝑑u\displaystyle\leq(1/4)(1/2+C_{2})^{\gamma}\|\rho(1/2)\|\left\|Q^{-1}\right\|\int_{-1/2}^{1/2}\|\rho(1/2+u)\|f(x_{0}+hu)\ du
=(C3q2/4)(1/2+C2)γQ1hd1(xx0h/2)𝑑F(x)1\displaystyle=(C_{3}q^{2}/4)(1/2+C_{2})^{\gamma}\left\|Q^{-1}\right\|h^{-d}\int\mathbbold{1}(\|x-x_{0}\|\leq h/2)\ dF(x)\lesssim 1

where the second line follows from the submultiplicative property of the operator norm and since π(1π)1/4\pi(1-\pi)\leq 1/4, and the last from Assumptions 2 and 3 and since

ρ(x)2\displaystyle\|\rho(x)\|^{2} (d+γγ)(2γ+1)C3q2\displaystyle\leq{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}(2\lfloor\gamma\rfloor+1)\leq C_{3}q^{2}

for all xx (Belloni et al. [2015], Example 3.1), since each Legendre term satisfies |ρm(xj)|2m+12γ+1|\rho_{m}(x_{j})|\leq\sqrt{2m+1}\leq\sqrt{2\lfloor\gamma\rfloor+1} for mγm\leq\lfloor\gamma\rfloor, and the length of ρ\rho is q=(d+γγ)q={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}, i.e., the maximum number of monomials in a polynomial in dd variables with degree up to γ\lfloor\gamma\rfloor. ∎


B.4 Proof of Proposition 4

Proof.

For the first inequality, we have

|τ^(x0)τh(x0)|\displaystyle|\widehat{\tau}(x_{0})-\tau_{h}(x_{0})| =|ρh(x0)TQ^1{(R^R)+(QQ^)Q1R}|\displaystyle=\left|\rho_{h}(x_{0})^{\mathrm{\scriptscriptstyle T}}\widehat{Q}^{-1}\left\{\left(\widehat{R}-R\right)+\left(Q-\widehat{Q}\right)Q^{-1}R\right\}\right|
ρ(1/2)(Q1+Q^1Q1)((R^R)(Q^Q)(Q1R))\displaystyle\leq\|\rho(1/2)\|\left(\left\|Q^{-1}\right\|+\left\|\widehat{Q}^{-1}-Q^{-1}\right\|\right)\left(\left\|(\widehat{R}-R)-(\widehat{Q}-Q)(Q^{-1}R)\right\|\right)

by the sub-multiplicative and triangle inequalities of the operator norm. For the second inequality, first note ρ(x)Cq\|\rho(x)\|\leq Cq, as described in the proof of Proposition 3. The result now follows from the bounds on Q1\|Q^{-1}\| and Q^1Q1\|\widehat{Q}^{-1}-{Q}^{-1}\| and since

𝔼S^S\displaystyle\mathbb{E}\|\widehat{S}-S\| 𝔼S^S2=j𝔼[𝔼{(S^jSj)2Dn}]\displaystyle\leq\sqrt{\mathbb{E}\|\widehat{S}-S\|^{2}}=\sqrt{\sum_{j}\mathbb{E}\left[\mathbb{E}\Big{\{}(\widehat{S}_{j}-S_{j})^{2}\mid D^{n}\Big{\}}\right]}
=j𝔼{bias(S^jDn)2+var(S^jDn)}\displaystyle=\sqrt{\sum_{j}\mathbb{E}\left\{\text{bias}(\widehat{S}_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}
(d+γγ)maxj𝔼{bias(S^jDn)2+var(S^jDn)}\displaystyle\leq\sqrt{{{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}}\max_{j}\sqrt{\mathbb{E}\left\{\text{bias}(\widehat{S}_{j}\mid D^{n})^{2}+\text{var}(\widehat{S}_{j}\mid D^{n})\right\}}

by Jensen’s inequality and iterated expectation. The last line follows since |S|=(d+γγ)|S|={{d+\lfloor\gamma\rfloor}\choose{\lfloor\gamma\rfloor}}. ∎


B.5 Eigenvalues of QQ and Ω\Omega

Proposition S5.

If (i) ϵπ(x)1ϵ\epsilon\leq\pi(x)\leq 1-\epsilon and (ii) the density dF(x)dF(x) is bounded above and below away from zero on {x:xx0h/2}\{x:\|x-x_{0}\|\leq h/2\}, then the eigenvalues of QQ are bounded above and below away from zero.


Proof.

Define the stretched function g(v)=g(x0+h(v1/2))g^{*}(v)=g(x_{0}+h(v-1/2)) for any g:dg:\mathbb{R}^{d}\mapsto\mathbb{R}. This maps values of gg in the small cube [x0h/2,x0+h/2]d[x_{0}-h/2,x_{0}+h/2]^{d} around x0x_{0} to the whole space [0,1]d[0,1]^{d}. Then note that, with the change of variables v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h},

Q\displaystyle Q =ρh(x)Kh(x)π(x){1π(x)}ρh(x)T𝑑F(x)\displaystyle=\int\rho_{h}(x)K_{h}(x)\pi(x)\{1-\pi(x)\}\rho_{h}(x)^{\mathrm{\scriptscriptstyle T}}\ dF(x)
=v1/21/2ρ(v)ρ(v)Tπ(v){1π(v)}𝑑F(v).\displaystyle=\int_{\|v-1/2\|\leq 1/2}\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\pi^{*}(v)\{1-\pi^{*}(v)\}\ dF^{*}(v).

Next note that ϵ(1ϵ)π(1π)1/4\epsilon(1-\epsilon)\leq\pi(1-\pi)\leq 1/4, so the eigenvalues of QQ will be bounded if those of the matrix

ρ(v)ρ(v)T𝑑F(x0+h(v1/2))\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dF(x_{0}+h(v-1/2))

are. But ρ(x)ρ(x)T𝑑x=I\int\rho(x)\rho(x)^{\mathrm{\scriptscriptstyle T}}\ dx=I by orthonormality of the Legendre polynomials on [0,1]d[0,1]^{d}, and the local boundedness of dFdF ensures dF/dμdF^{*}/d\mu is bounded above and below away from zero, for μ\mu the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎


Proposition S6.

If (i) the basis b(v)b(v) is orthonormal with respect to the uniform measure, and (ii) the density dF(x)dF(x) is bounded above and below away from zero on {x:xx0h/2}\{x:\|x-x_{0}\|\leq h/2\}, then the eigenvalues of Ω\Omega are bounded above and below away from zero.


Proof.

The proof is similar to that of Proposition S5. First note that

Ω=bhk(x)Kh(x)bhk(x)T𝑑F(x)=b(v)b(v)T𝑑F(v)\displaystyle\Omega=\int b_{hk}(x)K_{h}(x)b_{hk}(x)^{\mathrm{\scriptscriptstyle T}}dF(x)=\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dF^{*}(v)

by the change of variables v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h}, and where dF(v)=dF(x0+h(v1/2))dF^{*}(v)=dF(x_{0}+h(v-1/2)) as before. Further b(v)b(v)T𝑑v=I\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I by the assumed orthonormality, and the local boundedness of dFdF ensures dF/dμdF^{*}/d\mu is bounded above and below away from zero, for μ\mu the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎


B.6 Proof of Proposition 7

Proof.

Given the training data DnD^{n}, S^j=[R^Q^θ]j\widehat{S}_{j}=[\widehat{R}-\widehat{Q}\theta]_{j} is a second-order U-statistic with kernel

ξ(Z1,Z2)=ρh(X1)\displaystyle\xi(Z_{1},Z_{2})=\rho_{h}(X_{1}) Kh(X1)[{φ^y1(Z1)φ^a1(Z1)ρh(X1)Tθ}\displaystyle K_{h}(X_{1})\Big{[}\Big{\{}\widehat{\varphi}_{y1}(Z_{1})-\widehat{\varphi}_{a1}(Z_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}
+{φ^y2(Z1,Z2)φ^a2(Z1,Z2)ρh(X1)Tθ}Kh(X2)]\displaystyle+\Big{\{}\widehat{\varphi}_{y2}(Z_{1},Z_{2})-\widehat{\varphi}_{a2}(Z_{1},Z_{2})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}K_{h}(X_{2})\Big{]}

Thus its conditional variance is given by the usual variance of a second-order U-statistic (e.g., Lemma 6 of Robins et al. [2009a]), which is

var{𝕌n(ξ)Dn}\displaystyle\text{var}\left\{\mathbb{U}_{n}(\xi)\mid D^{n}\right\} =(4(n2)n(n1))var{ξ1(Z1)Dn}+(2n(n1))var{ξ(Z1,Z2)Dn}\displaystyle=\left(\frac{4(n-2)}{n(n-1)}\right)\text{var}\{\xi_{1}(Z_{1})\mid D^{n}\}+\left(\frac{2}{n(n-1)}\right)\text{var}\{\xi(Z_{1},Z_{2})\mid D^{n}\}
4𝔼{ξ1(Z1)2Dn}n+4𝔼{ξ(Z1,Z2)2Dn}n2\displaystyle\leq\frac{4\mathbb{E}\{\xi_{1}(Z_{1})^{2}\mid D^{n}\}}{n}+\frac{4\mathbb{E}\{\xi(Z_{1},Z_{2})^{2}\mid D^{n}\}}{n^{2}} (25)

for ξ1(z1)=ξ(z1,z2)𝑑(z2)\xi_{1}(z_{1})=\int\xi(z_{1},z_{2})\ d\mathbb{P}(z_{2}), if n2n\geq 2. In our case ξ1\xi_{1} equals

ρh(X1)Kh(X1)\displaystyle\rho_{h}(X_{1})K_{h}(X_{1}) [φ^y1(Z1)φ^a1(Z1)ρh(X1)Tθ\displaystyle\Big{[}\widehat{\varphi}_{y1}(Z_{1})-\widehat{\varphi}_{a1}(Z_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta
+{A1π^(X1)}{Π^b(μ0μ^0)(X1)Π^bπ^(X1)ρh(X1)Tθ}].\displaystyle\hskip 28.90755pt+\{A_{1}-\widehat{\pi}(X_{1})\}\Big{\{}\widehat{\Pi}_{b}(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})(X_{1})-\widehat{\Pi}_{b}\widehat{\pi}^{*}(X_{1})\rho_{h}(X_{1})^{\mathrm{\scriptscriptstyle T}}\theta\Big{\}}\Big{]}.

Therefore for the first term in (25) we have

ξ12𝑑\displaystyle\int\xi_{1}^{2}\ d\mathbb{P} 2(ρh2Kh2(φ^y1φ^a1ρhTθ)2d\displaystyle\leq 2\bigg{(}\int\rho_{h}^{2}K_{h}^{2}(\widehat{\varphi}_{y1}-\widehat{\varphi}_{a1}\rho_{h}^{\mathrm{\scriptscriptstyle T}}\theta)^{2}\ d\mathbb{P}
+ρh2Kh2{π(1π)+(ππ^)2}{Π^b(μ0μ^0)+Π^bπρhTθ}2dF)\displaystyle\hskip 28.90755pt+\int\rho_{h}^{2}K_{h}^{2}\left\{\pi(1-\pi)+(\pi-\widehat{\pi})^{2}\right\}\left\{\widehat{\Pi}_{b}(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})+\widehat{\Pi}_{b}\pi^{*}\rho_{h}^{\mathrm{\scriptscriptstyle T}}\theta\right\}^{2}dF\bigg{)}
1hd(ρ(v)2𝑑F(v)+ρ(v)2[{Π^b(μ0μ^0)(v)}2+{Π^bπ(v)ρ(v)Tθ}2]𝑑F(v))\displaystyle\lesssim\frac{1}{h^{d}}\left(\int\rho(v)^{2}\ dF^{*}(v)+\int\rho(v)^{2}\left[\left\{\widehat{\Pi}_{b}(\mu_{0}^{*}-\widehat{\mu}_{0}^{*})(v)\right\}^{2}+\left\{\widehat{\Pi}_{b}\pi^{*}(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\theta\right\}^{2}\right]\ dF^{*}(v)\right)
1hd(1+μ^0μ0F2)\displaystyle\lesssim\frac{1}{h^{d}}\left(1+\|\widehat{\mu}_{0}-\mu_{0}\|_{F^{*}}^{2}\right)

where the second inequality follows by the change of variables v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h}, and since all of (φ^y1,φ^a1,π,π^)(\widehat{\varphi}_{y1},\widehat{\varphi}_{a1},\pi,\widehat{\pi}) are uniformly bounded, 0Kh(x)hd0\leq K_{h}(x)\lesssim h^{-d}, and supvρ(v)q\sup_{v}\|\rho(v)\|\lesssim q (Belloni et al. [2015], Example 3.1); and the third follows from Lemma S1(ii). Now, consider the second term in (25); we only detail the term in ξ\xi involving φ^y2\widehat{\varphi}_{y2} since the logic is the same for the φ^a2\widehat{\varphi}_{a2} term, and the terms involving (φ^y1,φ^a1)(\widehat{\varphi}_{y1},\widehat{\varphi}_{a1}) are bounded by standard arguments. Letting b¯j(x)Ω1/2bj(x)\overline{b}_{j}(x)\equiv\Omega^{-1/2}b_{j}(x) and MΩ1/2Ω^1Ω1/2M\equiv\Omega^{1/2}\widehat{\Omega}^{-1}\Omega^{1/2}, we have 𝔼{ξ(Z1,Z2)2Dn}\mathbb{E}\{\xi(Z_{1},Z_{2})^{2}\mid D^{n}\} upper bounded by a multiple of

[ρh(x1)\displaystyle\int\bigg{[}\rho_{h}(x_{1}) {a1π^(x1)}{bh(x1)TΩ^1bh(x2)}{y2μ^0(x2)}Kh(x1)Kh(x2)]2d(z1)d(z2)\displaystyle\{a_{1}-\widehat{\pi}(x_{1})\}\Big{\{}{b}_{h}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}{b}_{h}(x_{2})\Big{\}}\{y_{2}-\widehat{\mu}_{0}(x_{2})\}K_{h}(x_{1})K_{h}(x_{2})\bigg{]}^{2}\ d\mathbb{P}(z_{1})\ d\mathbb{P}(z_{2})
1h2dρh(x1)2{bhk(x1)TΩ^1bhk(x2)}2Kh(x1)Kh(x2)𝑑F(x1)𝑑F(x2)\displaystyle\lesssim\frac{1}{h^{2d}}\int\rho_{h}(x_{1})^{2}\Big{\{}b_{hk}(x_{1})^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}^{-1}b_{hk}(x_{2})\Big{\}}^{2}K_{h}(x_{1})K_{h}(x_{2})\ dF(x_{1})\ dF(x_{2})
1h2d{b¯(v1)TMb¯(v2)}2𝑑F(v1)𝑑F(v2)\displaystyle\lesssim\frac{1}{h^{2d}}\int\Big{\{}\overline{b}(v_{1})^{\mathrm{\scriptscriptstyle T}}M\overline{b}(v_{2})\Big{\}}^{2}\ dF^{*}(v_{1})\ dF^{*}(v_{2})
=M22h2d=(1h2d)Ω1/2Ω1Ω1/2+Ω1/2(Ω^1Ω1)Ω1/222\displaystyle=\frac{\|M\|_{2}^{2}}{h^{2d}}=\left(\frac{1}{h^{2d}}\right)\|\Omega^{1/2}\Omega^{-1}\Omega^{1/2}+\Omega^{1/2}(\widehat{\Omega}^{-1}-\Omega^{-1})\Omega^{1/2}\|_{2}^{2}
2(1h2d)(I22+Ω1/2(Ω^1Ω1)Ω1/222)\displaystyle\leq 2\left(\frac{1}{h^{2d}}\right)\left(\|I\|_{2}^{2}+\|\Omega^{1/2}(\widehat{\Omega}^{-1}-\Omega^{-1})\Omega^{1/2}\|_{2}^{2}\right)
2(1h2d)k(1+Ω2Ω^1Ω12)\displaystyle\leq 2\left(\frac{1}{h^{2d}}\right)k\left(1+\|\Omega\|^{2}\|\widehat{\Omega}^{-1}-\Omega^{-1}\|^{2}\right)

where the first line follows by definition, the second since (aπ^)(a-\widehat{\pi}) and (yμ^0)(y-\widehat{\mu}_{0}) are uniformly bounded and 0Kh(x)hd0\leq K_{h}(x)\lesssim h^{-d}, the third by a change of variables v=12+xx0hv=\frac{1}{2}+\frac{x-x_{0}}{h}, by definition of b¯\overline{b} and MM, and since ρ(v)\rho(v) is bounded, the fourth by definition since b¯b¯T𝑑F=I\int\overline{b}\overline{b}^{\mathrm{\scriptscriptstyle T}}\ dF^{*}=I, and the last two by basic properties of the Frobenius norm (e.g., the triangle inequality and A2kA\|A\|_{2}\leq\sqrt{k}\|A\|) together with (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}). ∎


B.7 Technical Lemmas

Lemma S1.

Let Πwf(t)=b(t)TΩ1b(x)w(x)f(x)𝑑(x)\Pi_{w}f(t)=b(t)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int b(x)w(x)f(x)\ d\mathbb{P}(x) denote a ww-weighted projection with Ω=bwbT𝑑\Omega=\int bwb^{T}d\mathbb{P} for w0w\geq 0. And let θw,f=Ω1b(x)w(x)f(x)𝑑(x)\theta_{w,f}=\Omega^{-1}\int b(x)w(x)f(x)\ d\mathbb{P}(x) denote the coefficients of the projection. Then:

  1. (i)

    projections are orthogonal to residuals, i.e.,

    (Πwf)(IΠw)gw𝑑=0,\int(\Pi_{w}f)(I-\Pi_{w})g\ wd\mathbb{P}=0,
  2. (ii)

    the L2(w)L_{2}(w\mathbb{P}) norm of a projection is no more than that of the function, i.e.,

    (Πwf)2w𝑑f2w𝑑,\int(\Pi_{w}f)^{2}w\ d\mathbb{P}\leq\int f^{2}\ wd\mathbb{P},

    and the L2(w)L_{2}(w\mathbb{P}) norm of the approximation error is no more than that of the function, i.e.,

    {(IΠw)f}2w𝑑f2w𝑑,\int\{(I-\Pi_{w})f\}^{2}w\ d\mathbb{P}\leq\int f^{2}\ wd\mathbb{P},
  3. (iii)

    the L2L_{2} norm of the scaled coefficients is no more than the L2(w)L_{2}(w\mathbb{P}) norm of the function, i.e.,

    Ω1/2θw,f2f2w𝑑.\|\Omega^{1/2}\theta_{w,f}\|^{2}\leq\int f^{2}\ wd\mathbb{P}.
Proof.

For (i) let b(x)=Ω1/2b(x)b^{*}(x)=\Omega^{-1/2}b(x) so that bw(b)T𝑑=I\int b^{*}w(b^{*})^{\mathrm{\scriptscriptstyle T}}d\mathbb{P}=I and note that

(Πwf)(IΠw)\displaystyle\int(\Pi_{w}f)(I-\Pi_{w}) gwd=b(x)TΩ1b(t)f(t)g(x)w𝑑(t)w𝑑(x)\displaystyle g\ wd\mathbb{P}=\int\int b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)
b(x)TΩ1b(t)f(t)b(x)TΩ1b(u)g(u)w𝑑(t)w𝑑(x)w𝑑(u)\displaystyle\hskip 65.04256pt-\int b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(t)f(t)b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}b(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)\ wd\mathbb{P}(u)
=b(x)Tb(t)f(t)g(x)w𝑑(t)w𝑑(x)\displaystyle=\int b^{*}(x)^{\mathrm{\scriptscriptstyle T}}b^{*}(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)
f(t)b(t)Tb(x)b(x)Tb(u)g(u)w𝑑(t)w𝑑(x)w𝑑(u)\displaystyle\hskip 65.04256pt-\int f(t)b^{*}(t)^{\mathrm{\scriptscriptstyle T}}b^{*}(x)b^{*}(x)^{\mathrm{\scriptscriptstyle T}}b^{*}(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)\ wd\mathbb{P}(u)
=b(x)Tb(t)f(t)g(x)w𝑑(t)w𝑑(x)\displaystyle=\int b^{*}(x)^{\mathrm{\scriptscriptstyle T}}b^{*}(t)f(t)g(x)\ wd\mathbb{P}(t)\ wd\mathbb{P}(x)
f(t)b(t)Tb(u)g(u)w𝑑(t)w𝑑(u)=0\displaystyle\hskip 65.04256pt-\int f(t)b^{*}(t)^{\mathrm{\scriptscriptstyle T}}b^{*}(u)g(u)\ wd\mathbb{P}(t)\ wd\mathbb{P}(u)=0

where the last line follows since b(x)b(x)Tw𝑑(x)=I\int b^{*}(x)b^{*}(x)^{\mathrm{\scriptscriptstyle T}}\ wd\mathbb{P}(x)=I.

For (ii) we have by definition that {Πwf(x)}2w(x)𝑑F(x)\int\{\Pi_{w}f(x)\}^{2}w(x)\ dF(x) equals

\displaystyle\int {f(u)w(u)b(u)T𝑑(u)Ω1b(x)}{b(x)TΩ1b(t)w(t)f(t)𝑑(t)}w(x)d(x)\displaystyle\left\{\int f(u)w(u)b(u)^{\mathrm{\scriptscriptstyle T}}\ d\mathbb{P}(u)\Omega^{-1}b(x)\right\}\left\{b(x)^{\mathrm{\scriptscriptstyle T}}\Omega^{-1}\int b(t)w(t)f(t)\ d\mathbb{P}(t)\right\}w(x)d\mathbb{P}(x)
={f(u)w(u)b(u)T𝑑(u)Ω1}{b(t)w(t)f(t)𝑑(t)}\displaystyle=\left\{\int f(u)w(u)b(u)^{\mathrm{\scriptscriptstyle T}}\ d\mathbb{P}(u)\Omega^{-1}\right\}\left\{\int b(t)w(t)f(t)\ d\mathbb{P}(t)\right\} (26)
=f(t)Πwf(t)w(t)𝑑(t)\displaystyle=\int f(t)\Pi_{w}f(t)w(t)\ d\mathbb{P}(t)

and so

(fΠwf)2w𝑑\displaystyle\int(f-\Pi_{w}f)^{2}w\ d\mathbb{P} =f2w𝑑2fΠwfw𝑑+(Πwf)2w𝑑\displaystyle=\int f^{2}w\ d\mathbb{P}-2\int f\Pi_{w}fw\ d\mathbb{P}+\int(\Pi_{w}f)^{2}w\ d\mathbb{P}
=f2w𝑑(Πwf)2w𝑑\displaystyle=\int f^{2}w\ d\mathbb{P}-\int(\Pi_{w}f)^{2}w\ d\mathbb{P}

which implies the results, as long as w0w\geq 0, so that the far left and right sides are non-negative.

For (iii) we have

Ω1/2θw,f2\displaystyle\|\Omega^{1/2}\theta_{w,f}\|^{2} =(Ω1/2θw,f)T(Ω1/2θw,f)=θw,fTΩθw,f\displaystyle=(\Omega^{1/2}\theta_{w,f})^{\mathrm{\scriptscriptstyle T}}(\Omega^{1/2}\theta_{w,f})=\theta_{w,f}^{\mathrm{\scriptscriptstyle T}}\Omega\theta_{w,f}
=(bTwf𝑑)Ω1(bwf𝑑)\displaystyle=\left(\int b^{\mathrm{\scriptscriptstyle T}}wf\ d\mathbb{P}\right)\Omega^{-1}\left(\int bwf\ d\mathbb{P}\right)
={Πwf(x)}2w(x)𝑑(x)\displaystyle=\int\{\Pi_{w}f(x)\}^{2}w(x)\ d\mathbb{P}(x)

where the last equality holds by (26), and so the result follows from Lemma S1(ii). ∎


Lemma S2.

Assume:

  1. 1.

    0<bλmin(Ω)λmax(Ω)B<0<b\leq\lambda_{\min}(\Omega)\leq\lambda_{\max}(\Omega)\leq B<\infty

  2. 2.

    0<cdF^/dFC<0<c\leq\|d\widehat{F}^{*}/dF^{*}\|_{\infty}\leq C<\infty

Then

bcλmin(Ω^)λmax(Ω^)BCbc\leq\lambda_{\min}(\widehat{\Omega})\leq\lambda_{\max}(\widehat{\Omega})\leq BC

and

Ω^1Ω1(Bb2c)(dF^/dF)1.\|\widehat{\Omega}^{-1}-\Omega^{-1}\|\leq\left(\frac{B}{b^{2}c}\right)\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}.

Proof.

The logic mirrors that of the proof of Proposition 2.1 in Belloni et al. [2015]. Note that

aTΩ^a\displaystyle a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a ={aTb(v)}2𝑑F^(v)\displaystyle=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}b(v)\Big{\}}^{2}\ d\widehat{F}^{*}(v)
dF^/dF{aTb(v)}2𝑑F(v)\displaystyle\leq\|d\widehat{F}^{*}/dF^{*}\|_{\infty}\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}b(v)\Big{\}}^{2}\ d{F}^{*}(v)
=dF^/dFaTΩa\displaystyle=\|d\widehat{F}^{*}/dF^{*}\|_{\infty}\ a^{\mathrm{\scriptscriptstyle T}}\Omega a

and by the same logic, aTΩadF/dF^aTΩ^aa^{\mathrm{\scriptscriptstyle T}}\Omega a\leq\|dF^{*}/d\widehat{F}^{*}\|_{\infty}\ a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a. Therefore

λmax(Ω^)\displaystyle\lambda_{\max}(\widehat{\Omega}) =maxa=1aTΩ^adF^/dFλmax(Ω)\displaystyle=\max_{\|a\|=1}a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a\leq\|d\widehat{F}^{*}/dF^{*}\|_{\infty}\lambda_{\max}(\Omega)
λmin(Ω^)\displaystyle\lambda_{\min}(\widehat{\Omega}) =mina=1aTΩ^adF/dF^1λmin(Ω)\displaystyle=\min_{\|a\|=1}a^{\mathrm{\scriptscriptstyle T}}\widehat{\Omega}a\geq\|dF^{*}/d\widehat{F}^{*}\|_{\infty}^{-1}\lambda_{\min}(\Omega)

by the min-max theorem, which gives the first inequality. For the second, note

Ω^1Ω1\displaystyle\|\widehat{\Omega}^{-1}-\Omega^{-1}\| =Ω^1(ΩΩ^)Ω1\displaystyle=\|\widehat{\Omega}^{-1}(\Omega-\widehat{\Omega})\Omega^{-1}\|
Ω^1ΩΩ^Ω1\displaystyle\leq\|\widehat{\Omega}^{-1}\|\|\Omega-\widehat{\Omega}\|\|\Omega^{-1}\|

by the sub-multiplicative property of the operator norm, and then

ΩΩ^(dF^/dF)1Ω\|\Omega-\widehat{\Omega}\|\leq\|(d\widehat{F}^{*}/dF^{*})-1\|_{\infty}\ \|\Omega\|

by the same logic as above. ∎


B.8 Covariate Density

Although the lower bounds in Theorems 1 and 3 are given for a simple model where the covariate density is merely bounded, the rates are only attainable when the covariate density is known or can be estimated accurately enough, as discussed in Remark 11. More specifically, the lower and upper bounds match for a model in which the covariate density dF(x)dF(x) is known, satisfies 1{xx0h/2}𝑑F(x)hd\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\asymp h^{d}, and has local support {xd:dF(x)>0,xx0h/2}\{x\in\mathbb{R}^{d}:dF(x)>0,\|x-x_{0}\|\leq h/2\} on a union of no more than kk disjoint cubes all with proportional volume, for hh and kk defined in Proposition 2. The lower bounds come from Theorems 1 and 3 since the density used in our construction satisfies these conditions, while the matching upper bound is implied by the results in Section 4 together with the following lemma, in which we prove that Condition A. holds for this class of densities.

Lemma S3.

Assume:

  1. 1.

    dF(x)dF(x) satisfies 1{xx0h/2}𝑑F(x)hd\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\asymp h^{d},

  2. 2.

    dF(x)dF(x) is bounded above and below away from zero on its local support {xd:dF(x)>0,xx0h/2}\{x\in\mathbb{R}^{d}:dF(x)>0,\|x-x_{0}\|\leq h/2\}, which is a union of no more than kk disjoint cubes all with proportional volume, for hh and kk defined in Proposition 2.

Let dF(v)=dF(x0+h(v1/2))dF^{*}(v)=dF(x_{0}+h(v-1/2)) denote the distribution in h(x0)\mathcal{B}_{h}(x_{0}), the hh-ball around x0x_{0}, mapped to [0,1]d[0,1]^{d}, and similarly for π\pi^{*}. Then the eigenvalues of

Q=ρ(v)ρ(v)Tπ(v){1π(v)}𝑑F(v)Q=\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\pi^{*}(v)\{1-\pi^{*}(v)\}\ dF^{*}(v)

are bounded above and below away from zero, and there exists a basis bb with Hölder approximation property (6), for which the eigenvalues of

Ω=b(v)b(v)T𝑑F(v)\Omega=\int b(v)b(v)^{\mathrm{\scriptscriptstyle T}}\ dF^{*}(v)

are bounded above and below away from zero.


Proof.

Note that since ϵ(1ϵ)π(1π)1/4\epsilon(1-\epsilon)\leq\pi(1-\pi)\leq 1/4 and since dF(x)dF(x) is bounded above and below on its support, the eigenvalues of QQ will be bounded if those of the matrix

j=1kρ(v)ρ(v)T1(vMj)𝑑v\sum_{j=1}^{k}\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{j})\ dv

are, where MjM_{j} indicates the jjth disjoint cube making up the local support of FF. By the min-max theorem, the eigenvalues are bounded by the min and max of

aTρ(v)ρ(v)Tj1(vMj)dva={aTρ(v)}2j1(vMj)dva^{\mathrm{\scriptscriptstyle T}}\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv\ a=\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv

over all aqa\in\mathbb{R}^{q} with a=1\|a\|=1. First consider lower bounding the eigenvalues. Note g(v)=aTρ(v)g(v)=a^{\mathrm{\scriptscriptstyle T}}\rho(v) is a polynomial of degree at most qq. Therefore

jg(v)21(vMj)𝑑v\displaystyle\sum_{j}\int g(v)^{2}\mathbbold{1}(v\in M_{j})\ dv g(v)21{g(v)2ϵ}j1(vMj)dv\displaystyle\geq\int g(v)^{2}\mathbbold{1}\{g(v)^{2}\geq\epsilon\}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv
ϵ1{g(v)2ϵ}j1(vMj)dv\displaystyle\geq\epsilon\int\mathbbold{1}\{g(v)^{2}\geq\epsilon\}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv
ϵ{j1(vMj)𝑑v1{g(v)2<ϵ}𝑑v}\displaystyle\geq\epsilon\left\{\sum_{j}\int\mathbbold{1}(v\in M_{j})\ dv-\int\mathbbold{1}\{g(v)^{2}<\epsilon\}\ dv\right\}
ϵ(CCϵ1/2q)=(CC)4q>0\displaystyle\geq\epsilon\left(C^{*}-C\epsilon^{1/2q}\right)=\left(\frac{C^{*}}{\sqrt{C}}\right)^{4q}>0

where the last line follows since

j1(vMj)dv\displaystyle\int\sum_{j}\mathbbold{1}(v\in M_{j})\ dv 𝑑F(v)=𝑑F(x0+h(v1/2))\displaystyle\asymp\int dF^{*}(v)=\int dF(x_{0}+h(v-1/2))
=hd1{xx0h/2}𝑑F(x)C\displaystyle=h^{-d}\int\mathbbold{1}\{\|x-x_{0}\|\leq h/2\}\ dF(x)\geq C^{*}

from a change of variables x=x0+h(v1/2)x=x_{0}+h(v-1/2), so that v=1/2+(xx0)/hv=1/2+(x-x_{0})/h and hddv=dxh^{d}\ dv=dx, together with Assumption 1 of the lemma, and by Theorem 4 of Carbery and Wright [2001]. The last equality follows if we choose ϵ=(C/2C)2q\epsilon=\left(C^{*}/2C\right)^{2q}. To upper-bound the eigenvalues, note that

{aTρ(v)}2j1(vMj)dv{aTρ(v)}2𝑑v=a2=1\displaystyle\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\sum_{j}\mathbbold{1}(v\in M_{j})\ dv\leq\int\Big{\{}a^{\mathrm{\scriptscriptstyle T}}\rho(v)\Big{\}}^{2}\ dv=\|a\|^{2}=1

since ρ(v)ρ(v)T𝑑v=I\int\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I by the orthonormality of ρ\rho.

Now consider the eigenvalues of Ω\Omega. We will construct a basis of order kk for which eigenvalues of Ω\Omega are bounded and for which the Hölder approximation property (6) holds.

Remark 12.

For simplicity we only consider the case where there are exactly kk cubes in the local support of FF; if there are finitely many cubes, say M<M<\infty, the arguments are more straightforward, orthonormalizing a standard series MM times, once per cube, and taking k/Mk/M terms from each. We omit details in the intermediate case where the number of cubes scales at a rate slower than kk, but we expect similar arguments to those below can be used.

First, let ρ(x)\rho(x) denote the tensor products of a Legendre basis of order ss, orthonormal on [1,1][-1,1], i.e., tensor products of

ρj(x)=j+1/22jj!djdxj(x21)j\rho_{j}(x_{\ell})=\frac{\sqrt{j+1/2}}{2^{j}j!}\frac{d^{j}}{dx_{\ell}^{j}}(x_{\ell}^{2}-1)^{j}

for j{1,,s}j\in\{1,\dots,s\} and {1,,d}\ell\in\{1,\dots,d\}. This basis satisfies

ρ(x)ρ(x)T𝑑x=Is.\int\rho(x)\rho(x)^{\mathrm{\scriptscriptstyle T}}\ dx=I_{s}.

Now shift and rescale so that the basis is orthonormal on MjM_{j} with

bj(x)=2dkρ(4k1/d(x12mjx0h))b_{j}(x)=2^{d}\sqrt{k}\rho\left(4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right)\right)

and so satisfies

bj(x)bj(x)T1(xMj)𝑑x\displaystyle\int b_{j}(x)b_{j}(x)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(x\in M_{j})\ dx =4dkρ(4k1/d(x12mjx0h))1(xMj)𝑑x\displaystyle=4^{d}k\int\rho\left(4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right)\right)\mathbbold{1}(x\in M_{j})\ dx
=11ρ(v)ρ(v)T𝑑v=Is\displaystyle=\int_{-1}^{1}\rho(v)\rho(v)^{\mathrm{\scriptscriptstyle T}}\ dv=I_{s}

where we used the change of variable v=4k1/d(x12mjx0h)v=4k^{1/d}\left(x-\frac{1}{2}-\frac{m_{j}-x_{0}}{h}\right) so that dv=4dkdxdv=4^{d}k\ dx and x=v/4k1/d+1/2+(mjx0)/hMjx=v/4k^{1/d}+1/2+(m_{j}-x_{0})/h\in M_{j} when v[1,1]dv\in[-1,1]^{d}. Finally define the basis

b(v)={b1(v)T1(vM1),,bk(v)T1(vMk)}Tb(v)=\{b_{1}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{1}),\dots,b_{k}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{k})\}^{\mathrm{\scriptscriptstyle T}}

of length sksk. Then

b(v)b(v)T=(b1(v)b1(v)T1(vM1)000b2(v)b2(v)T1(vM2)000bk(v)bk(v)T1(vMk))\displaystyle b(v)b(v)^{\mathrm{\scriptscriptstyle T}}=\begin{pmatrix}b_{1}(v)b_{1}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{1})&0&\cdots&0\\ 0&b_{2}(v)b_{2}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{2})&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&b_{k}(v)b_{k}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{k})\\ \end{pmatrix}

Therefore since dF(x)dF^{*}(x) is bounded above and below on its support, the jj-th diagonal block of Ω\Omega is proportional to

bj(v)bj(v)T1(vMj)𝑑v\displaystyle\int b_{j}(v)b_{j}(v)^{\mathrm{\scriptscriptstyle T}}\mathbbold{1}(v\in M_{j})\ dv =Is.\displaystyle=I_{s}.

Therefore the eigenvalues of Ω\Omega are all proportional to one and bounded as desired. Further, by the same higher-order kernel arguments for local polynomials as in Proposition 3, the basis satisfies the Hölder approximation property (6). ∎