Minimax rates for heterogeneous causal effect estimation
Abstract
Estimation of heterogeneous causal effects – i.e., how effects of policies and treatments vary across subjects – is a fundamental task in causal inference. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but questions surrounding optimality have remained largely unanswered. In particular, a minimax theory of optimality has yet to be developed, with the minimax rate of convergence and construction of rate-optimal estimators remaining open problems. In this paper we derive the minimax rate for CATE estimation, in a Hölder-smooth nonparametric model, and present a new local polynomial estimator, giving high-level conditions under which it is minimax optimal. Our minimax lower bound is derived via a localized version of the method of fuzzy hypotheses, combining lower bound constructions for nonparametric regression and functional estimation. Our proposed estimator can be viewed as a local polynomial R-Learner, based on a localized modification of higher-order influence function methods. The minimax rate we find exhibits several interesting features, including a non-standard elbow phenomenon and an unusual interpolation between nonparametric regression and functional estimation rates. The latter quantifies how the CATE, as an estimand, can be viewed as a regression/functional hybrid.
Keywords: causal inference, functional estimation, higher order influence functions, nonparametric regression, optimal rates of convergence.
1 Introduction
In this paper we consider estimating the difference in regression functions
(1) |
from an iid sample of observations of . Let denote the counterfactual outcome that would have been observed under treatment level . Then, under the assumptions of consistency (i.e., if ), positivity (i.e., with probability one, for some ), and no unmeasured confounding (i.e., ), the quantity also equals the conditional average treatment effect (CATE)
The CATE gives a more individualized picture of treatment effects compared to the overall average treatment effect (ATE) , and plays a crucial role in many fundamental tasks in causal inference, including assessing effect heterogeneity, constructing optimal treatment policies, generalizing treatment effects to new populations, finding subgroups with enhanced effects, and more. Further, these tasks have far-reaching implications across the sciences, from personalizing medicine to optimizing voter turnout. We refer to Hernán and Robins (2020) (chapter 4), Nie and Wager (2021), Kennedy (2023), and citations therein, for general discussion and review.
The simplest approach to CATE estimation would be to assume a low-dimensional parametric model for the outcome regression ; then maximum likelihood estimates could be easily constructed, and under regularity conditions the resulting plug-in estimator would be minimax optimal. However, when has continuous components, it is typically difficult to specify a correct parametric model, and under misspecification the previously described approach could lead to substantial bias. This suggests the need for more flexible methods. Early work in flexible CATE estimation employed semiparametric models, for example partially linear models assuming to be constant, or structural nested models in which followed some known parametric form, but leaving other parts of the distribution unspecified (Robinson, 1988; Robins et al., 1992; Robins, 1994; van der Laan and Robins, 2003; van der Laan, 2005; Vansteelandt and Joffe, 2014). An important theme in this work is that the CATE can be much more structured and simple than the rest of the data-generating process. Specifically, the individual regression functions for each may be very complex (e.g., non-smooth or non-sparse), even when the difference is very smooth or sparse, or even constant or zero. We refer to Kennedy (2023) for some recent discussion of this point.
More recently there has been increased emphasis on incorporating nonparametrics and machine learning tools for CATE estimation. We briefly detail two especially relevant streams of this recent literature, based on so-called DR-Learner and R-Learner methods, both of which rely on doubly robust-style estimation. The DR-Learner is a model-free meta-algorithm first proposed by van der Laan (2005) (Section 4.2), which essentially takes the components of the classic doubly robust estimator of the ATE, and rather than averaging, instead regresses on covariates. It has since been specialized to particular methods, e.g., cross-validated ensembles (Luedtke and van der Laan, 2016), kernel (Lee et al., 2017; Fan et al., 2019; Zimmert and Lechner, 2019) and series methods (Semenova and Chernozhukov, 2017), empirical risk minimization (Foster and Syrgkanis, 2019), and linear smoothers (Kennedy, 2023). On the other hand, the R-Learner is a flexible adaptation of the double-residual regression method originally built for partially linear models (Robinson, 1988), with the first nonparametric version proposed by Robins et al. (2008) (Section 5.2) using series methods. The R-Learner has since been adapted to RKHS regression (Nie and Wager, 2021), lasso (Zhao et al., 2017; Chernozhukov et al., 2017), and local polynomials (Kennedy, 2023). Many flexible non-doubly robust methods have also been proposed in recent years, often based on inverse-weighting or direct regression estimation (Foster et al., 2011; Imai and Ratkovic, 2013; Athey and Imbens, 2016; Shalit et al., 2017; Wager and Athey, 2018; Künzel et al., 2019; Hahn et al., 2020).
Despite the wide variety of methods available for flexible CATE estimation, questions of optimality have remained mostly unsolved. Gao and Han (2020) studied minimax optimality, but in a specialized model where the propensity score has zero smoothness, and covariates are non-random; this model does not reflect the kinds of assumptions typically used in practice, e.g., in the papers cited in the previous paragraph. Some but not all of these papers derive upper bounds on the error of their proposed CATE estimators; in the best case, these take the form of an oracle error rate (which would remain even if the potential outcomes were observed and regressed on covariates), plus some contribution coming from having to estimate nuisance functions (i.e., outcome regressions and propensity scores). The fastest rates we are aware of come from Foster and Syrgkanis (2019) and Kennedy (2023). Foster and Syrgkanis (2019) studied global error rates, obtaining an oracle error plus sums of squared errors in all nuisance components. Kennedy (2023) studied pointwise error rates, giving two main results; in the first, they obtain the oracle error plus a product of nuisance errors, while in the second, they obtain a faster rate via undersmoothing (described in more detail in Section 3.3). However, since these are all upper bounds on the errors of particular procedures, it is unknown whether these rates are optimal in any sense, and if they are not, how they might be improved upon. In this paper we resolve these questions (via the minimax framework, in a nonparametric model that allows components of the data-generating process to be infinite-dimensional, yet smooth in the Hölder sense).
More specifically, in Section 3 we derive a lower bound on the minimax rate of CATE estimation, indicating the best possible (worst-case) performance of any estimator, in a model where the CATE, regression function, and propensity score are Hölder-smooth functions. Our derivation uses an adaptation of the method of fuzzy hypotheses, which is specially localized compared to the constructions previously used for obtaining lower bounds in functional estimation and hypothesis testing (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In Section 4, we confirm that our minimax lower bound is tight (under some conditions), by proposing and analyzing a new local polynomial R-Learner, using localized adaptations of higher order influence function methodology (Robins et al., 2008, 2009a, 2017). In addition to giving a new estimator that is provably optimal (under some conditions, e.g., on how well the covariate density is estimated), our results also confirm that previously proposed estimators were not generally optimal in this smooth nonparametric model. Our minimax rate also sheds light on the nature of the CATE as a statistical quantity, showing how it acts as a regression/functional hybrid: for example, the rate interpolates between nonparametric regression and functional estimation, depending on the relative smoothness of the CATE and nuisance functions (outcome regression and propensity score).
2 Setup & Notation
We consider an iid sample of observations of from distribution , where denotes covariates, a treatment or policy indicator, and an outcome of interest. We let denote the distribution function of the covariate (with density as needed), and let
denote the propensity score, marginal, and treatment-specific outcome regressions, respectively. We sometimes omit arguments from functions to ease notation, e.g., note that . We also index quantities by a distribution when needed, e.g., under a particular distribution is written ; depending on context, no indexing means the quantity is evaluated at the true , e.g., .
Our goal is to study estimation of the CATE at a point , with error quantified by mean absolute error
As detailed in subsequent sections, we work in a nonparametric model whose components are infinite-dimensional functions but with some smoothness. We say a function is -smooth if it belongs to a Hölder class with index ; this essentially means it has bounded derivatives, and the highest order derivative is continuous. To be more precise, let denote the largest integer strictly smaller than , and let denote the partial derivative operator. Then the Hölder class with index contains all functions that are times continuously differentiable, with derivatives up to order bounded, i.e.,
for all with and for all , and with -order derivatives satisyfing the Lipschitz condition
for some , for all with , and for all , where for a vector we let denote the Euclidean norm. Sometimes Hölder classes are referenced by both the smoothness and constant , but we focus our discussion on the smoothness and omit the constant, which is assumed finite and independent of .
We write the squared norm of a function as . The sup-norm is denoted by . For a matrix we let and denote the operator/spectral and Frobenius norms, respectively, and let and denote the minimum and maximum eigenvalues of , respectively. We write if for a positive constant independent of , and if and (i.e., if and ). We write to mean that and are proportional, i.e., for some . We also use and . We use the shorthand to write sample averages, and similarly for the U-statistic measure.
3 Fundamental Limits
In this section we derive a lower bound on the minimax rate for CATE estimation. This result has several crucial implications, both practical and theoretical. First, it gives a benchmark for the best possible performance of any CATE estimator in the nonparametric model defined in Theorem 1. In particular, if an estimator is shown to attain this benchmark, then one can safely conclude the estimator cannot be improved, at least in terms of worst-case rates, without adding assumptions; conversely, if the benchmark is not shown to be attained, then one should continue searching for other better estimators (or better lower or upper risk bounds). Second, a tight minimax lower bound is important in its own right as a measure of the fundamental limits of CATE estimation, illustrating precisely how difficult CATE estimation is in a statistical sense. The main result of this section is given in Theorem 1 below. It is finally proved and discussed in detail in Section 3.3.
Theorem 1.
For , let denote the model where:
-
1.
is bounded above by a constant,
-
2.
is -smooth,
-
3.
is -smooth, and
-
4.
is -smooth.
Let . Then for larger than a constant depending on , the minimax rate is lower bounded as
Remark 1.
In Appendix A we also give results (both lower and upper bounds) for the model that puts smoothness assumptions on the marginal regression , instead of the control regression . Interestingly, the minimax rates differ in these two models, but only in the regime where the regression function is more smooth than the propensity score (i.e., ). Specifically, when is -smooth, the minimax rate from Theorem 1 holds but with replaced by .
Crucially, Condition 4 allows the CATE to have its own smoothness , which is necessarily at least the regression smoothness , but can also be much larger, as described in the Introduction.
We defer discussion of the details of the overall minimax rate of Theorem 1 to Section 3.3, moving first to a proof of the result.
Remark 2.
For simplicity, the lower bound result in Theorem 1 is given for a large model in which the covariate density is only bounded. However, as discussed in detail later in Section 4 and Remark 11, for the stated rate to be attainable more conditions on the covariate density are required. In Section B.8 of the Appendix, we give a particular submodel of under which upper and lower bounds on the minimax rate match up to constants; it will be important in future work to further elucidate the role of the covariate density in CATE estimation.
The primary strategy in deriving minimax lower bounds is to construct distributions that are similar enough that they are statistically indistinguishable, but for which the parameter of interest is maximally separated; this implies no estimator can have error uniformly smaller than this separation. More specifically, we derive our lower bound using a localized version of the method of fuzzy hypotheses (Ibragimov et al., 1987; Birgé and Massart, 1995; Nemirovski, 2000; Ingster et al., 2003; Tsybakov, 2009; Robins et al., 2009b). In the classic Le Cam two-point method, which can be used to derive minimax lower bounds for nonparametric regression at a point (Tsybakov, 2009), it suffices to consider a pair of distributions that differ locally; however, for nonlinear functional estimation, such pairs give bounds that are too loose. One instead needs to construct pairs of mixture distributions, which can be viewed via a prior over distributions in the model (Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). Our construction combines these two approaches via a localized mixture, as will be described in detail in the next subsection.
Remark 3.
In what follows we focus on the lower bound in the low smoothness regime where . The lower bound for the high smoothness regime matches the classic smooth nonparametric regression rate, and follows from a standard two-point argument, using the same construction as in Section 2.5 of Tsybakov (2009).
The following lemma, adapted from Section 2.7.4 of Tsybakov (2009), provides the foundation for the minimax lower bound result of this section.
Lemma 1 (Tsybakov (2009)).
Let and denote distributions in indexed by a vector , with -fold products denoted by and , respectively. Let denote a prior distribution over . If
and
for a functional and for all , then
for any monotonic non-negative loss function .
Lemma 1 illuminates the three ingredients for deriving a minimax lower bound, and shows how they interact. The ingredients are: (i) a pair of mixture distributions, (ii) the distance between their -fold products, which is ideally small, and (iii) the separation of the parameter of interest under the mixtures, which is ideally large. Finding the right minimax lower bound requires balancing these three ingredients appropriately: with too much distance or not enough separation, the lower bound will be too loose. In the following subsections we describe these three ingredients in detail.
3.1 Construction
In this subsection we detail the distributions and used to construct the minimax lower bound.
The main idea is to mix constructions for nonparametric regression and functional estimation, by perturbing the CATE with a bump at the point , and to also use a mixture of perturbations of the propensity score and regression functions and , locally near .
For our lower bound results, we work in the setting where is binary; this is mostly to ease notation and calculations. Note however that this still yields a valid lower bound in the general continuous case, since a lower bound in the strict submodel where is binary is also a lower bound across the larger model . Importantly, when is binary, the density of an observation can be indexed via either the quadruple , or ; here we make use of the latter parametrization (and in the appendix we consider the parametrization).
We first give the construction for the case in the definition below, and then go on to discuss details (and in Appendix A we give constructions for all other regimes).
Definition 1 (Distributions and ).
Let:
-
1.
denote a function with for , and for ,
-
2.
denote the cube centered at with sides of length ,
-
3.
denote a partition of into cubes of equal size (for an integer raised to the power ), with midpoints , so each cube has side length .
Then for define the functions
where . Finally let the distributions and be defined via the densities
Remark 4.
Under the parametrization, since , the above densities can equivalently be written as and .
Figure 1 shows an illustration of our construction in the case. As mentioned above, the CATE is perturbed with a bump at and the nuisance functions and with bumps locally near . The regression function is perturbed under both and , since it is less smooth than the propensity score in the case. The choices of the CATE mimic those in the two-point proof of the lower bound for nonparametric regression at a point (see, e.g., Section 2.5 of Tsybakov (2009)), albeit with a particular flat-top bump function, while the choices of nuisance functions and are more similar to those in the lower bound for functionals such as the expected conditional covariance (cf. Section 4 of Robins et al. (2009b)). In this sense our construction can be viewed as combining those for nonparametric regression and functional estimation, similar to Shen et al. (2020). In what follows we remark on some important details.
Remark 5.
Section 3.2 of Shen et al. (2020) used a similar construction for conditional variance estimation. Some important distinctions are: (i) they focused on the univariate and low smoothness setting; (ii) in that problem there is only one nuisance function, so the null can be a point rather than a mixture distribution; and (iii) they use a different, arguably more complicated, approach to bound the distance between distributions. Our work can thus be used to generalize such variance estimation results to arbitrary dimension and smoothness.


First we remark on the choice of CATE in the construction. As mentioned above, the bump construction resembles that of the standard Le Cam lower bound for nonparametric regression at a point, but differs in that we use a specialized bump function with a flat top. Crucially, this choice ensures the CATE is constant and equal to for all in the cube centered at with sides of length , and that it is equal to zero for all , i.e., outside the cube centered at with side length . The fact that the CATE is constant across the top of the bump (which will be the only place where observations appear near ) eases Hellinger distance calculations substantially. It is straightforward to check that the CATE is -smooth in this construction (see page 93 of Tsybakov (2009)).
Remark 6.
One example of a bump function satisfying the conditions above is
for .
For the propensity score and regression functions, we similarly have
i.e., each bump equals one on the half- cube around , and is identically zero outside the main larger cube around . It is again straightforward to check that and are - and -smooth, respectively.
The covariate density is chosen to be uniform, but on the set that captures the middle of all the nuisance bumps , together with the space away from . Importantly, this choice ensures there is only mass where the nuisance bumps are constant and non-zero (and where ), or else far away from , where the densities are the same under and . Note that, as , the Lebesgue measure of the set tends to one, and the covariate density tends towards a standard uniform distribution.
3.2 Hellinger Distance
As mentioned previously, deriving a tight minimax lower bound requires carefully balancing the distance between distributions in our construction. To this end, in this subsection we bound the Hellinger distance between the -fold product mixtures and , for a uniform prior distribution, so that are iid Rademachers.
In general these product densities can be complicated, making direct distance calculations difficult. However, Theorem 2.1 from Robins et al. (2009b) can be used to relate the distance between the -fold products to those of simpler posteriors over a single observation. In the following lemma we adapt this result to localized constructions like those in Definition 1.
Lemma 2.
Let and denote distributions indexed by a vector , and let denote a partition of the sample space. Assume:
-
1.
for all , and
-
2.
the conditional distributions and (given an observation is in ) do not depend on for , and only differ on partitions .
For a prior distribution over , let and , and define
for a dominating measure . If and for all , then is bounded above by
for a constant only depending on .
In the next proposition, we bound the quantities from Lemma 2 and put the results together to obtain a bound on the desired Hellinger distance between product mixtures.
Proposition 1.
Before moving to the proof of Proposition 1, we briefly discuss and give some remarks. Compared to the Hellinger distance arising in the average treatment effect or expected conditional covariance lower bounds (Robins et al., 2009b), there is an extra factor in the numerator. Of course, one cannot simply repeat those calculations with bins, since then for example the term would also be inflated to ; our carefully localized construction is crucial to obtain the right rate in this case. We also note that the choice is required for ensuring that the averaged densities and are equal (implying that ); specifically this equalizes the CATE bump under with the squared nuisance bumps under .
Proof.
Here the relevant partition of the sample space is , , along with , which partitions the space away from into disjoint cubes with side lengths . Thus we have
where . In Appendix Section B.1 we show that when , and so is proportional to the volume of a cube with side lengths . Further the conditional distributions and do not depend on for , since only changes the density in . In what follows we focus on the partitions and not , since the distributions are exactly equal on the latter (in the language of Lemma 1, the set indexes only the partitions , and note that for this set we have ). Note when are iid Rademacher random variables the marginalized densities from Lemma 2 are
The first step is to show that relevant densities and density ratios are appropriately bounded. We give these details in Appendix Section B.1. Next it remains to bound the quantities , , and .
We begin with , the distance between marginalized densities and , which is tackled somewhat differently from and . Because we take it follows that equals
(2) |
since for and
We note that this result requires a carefully selected relationship between and , which guarantees that the squared nuisance bumps under equal the CATE bumps under . This also exploits the flat-top bump functions we use, together with a covariate density that only puts mass at these tops, so that the squared terms are constant and no observations occur elsewhere where the bumps are not equal. Without these choices of bump function and covariate density, the expression in (2) would only be bounded by , and so would only be bounded by ; in that case, the term would dominate the Hellinger bound in Lemma 2, and the resulting minimax lower bound would reduce to the oracle rate , which is uninformative in the low-smoothness regimes we are considering.
Now we move to the distance , which does not end up depending on and is somewhat easier to handle. For it we have
where the first line follows by definition, and since , and outside of the cube , which implies that
the inequality in the second line since and as in (23) and (22), and the last inequality since
by a change of variables.
For we use a mix of the above logic for and . Note that equals
where we used the fact that and , along with the same logic as above with (ensuring the second term in the square equals zero). Now we have
using the exact same logic as for . Plugging these bounds on into Lemma 2, together with the fact that and , yields the result. ∎
3.3 Choice of Parameters & Final Rate
Finally we detail how the parameters and can be chosen to ensure the Hellinger distance from Proposition 1 remains bounded, and use the result to finalize the proof of Theorem 1.
Proposition 2.
The proof of Proposition 2 follows directly from Proposition 1, after plugging in the selected values of and . Importantly, it (together with the alternative construction for the case given in Appendix A) also settles the proof of Theorem 1 via Lemma 1. This follows since, with the proposed choices of and , the Hellinger distance is appropriately bounded so that the term
in Lemma 1 is a constant (greater than 1/20 for example), while the separation in the CATE at , which equals , is proportional to under all and . Therefore this separation is indeed the minimax rate in the low smoothness regime where . Note again that, as discussed in Remark 3, when the rate is faster than the usual nonparametric regression rate , and so the standard lower bound construction as in Section 2.5 of Tsybakov (2009) indicates that the slower rate is the tighter lower bound in that regime.
Figure 2 illustrates the minimax rate from Theorem 1, as a function of the average nuisance smoothness (scaled by dimension), and the CATE smoothness scaled by dimension . A number of important features about the rate are worth highlighting.

First, of course, the rate never slows with higher nuisance smoothness , for any CATE smoothness , and vice versa. However, there is an important elbow phenomenon, akin to that found in functional estimation problems (Bickel and Ritov, 1988; Birgé and Massart, 1995; Tsybakov, 2009; Robins et al., 2009b). In particular, the minimax lower bound shows that when the average nuisance smoothness is low enough that , the oracle rate (which could be achieved if one actually observed the potential outcomes) is in fact unachievable. This verifies a conjecture in Kennedy (2023).
Notably, though, the elbow phenomenon we find in the problem of CATE estimation differs quite substantially from that for classic pathwise differentiable functionals. For the latter, the rate is parametric (i.e., ) above some threshold, and nonparametric () below. In contrast, in our setting the rate matches that of nonparametric regression above the threshold, and otherwise is a combination of nonparametric regression and functional estimation rates. Thus in this problem there are many elbows, with the threshold depending on the CATE smoothness . In particular, our minimax rate below the threshold,
is a mixture of the nonparametric regression rate (on the squared scale) and the classic functional estimation rate . This means, for example, that in regimes where the CATE is very smooth, e.g., , the CATE estimation problem begins to resemble that of pathwise-differentiable functional estimation, where the elbow occurs at , with rates approaching the parametric rate above, and the functional estimation rate below. At the other extreme, where the CATE does not have any extra smoothness, so that (note we must have ), the elbow threshold condition holds for any . Thus, at this other extreme, there is no elbow phenomenon, and the CATE estimation problem resembles that of smooth nonparametric regression, with optimal rate . For the arguably more realistic setting, where the CATE smoothness may take intermediate values between and , the minimax rate is a mixture, interpolating between the two extremes. All of this quantifies the sense in which the CATE can be viewed as a regression/functional hybrid.
It is also worth mentioning that no estimator previously proposed in the literature (that we know of) attains the minimax rate in Theorem 1 in full generality. Some estimators have been shown to attain the oracle rate , but only under stronger assumptions than the minimal condition we find here, i.e., that . One exception is the undersmoothed R-learner estimator analyzed in Kennedy (2023), which did achieve the rate whenever , under some conditions (including that ). However, in the low-smoothness regime where , that estimator’s rate was , which is slower than the minimax rate we find here. This motivates our work in the following section, where we propose and analyze a new estimator, whose error matches the minimax rate in much greater generality (under some conditions, e.g., on how well the covariate density is estimated).
Remark 7.
A slightly modified version of our construction also reveals that, when the CATE is constant, the classic functional estimation rate acts as a minimax lower bound. To the best of our knowledge, this result has not been noted elsewhere.
4 Attainability
In this section we show that the minimax lower bound of Theorem 1 is actually attainable, via a new local polynomial version of the R-Learner (Nie and Wager, 2021; Kennedy, 2023), based on an adaptation of higher-order influence functions (Robins et al., 2008, 2009a, 2017).
4.1 Proposed Estimator & Decomposition
In this subsection we first describe our proposed estimator, and then give a preliminary error bound, which motivates the specific bias and variance calculations in following subsections. In short, the estimator is a higher-order influence function-based version of the local polynomial R-learner analyzed in Kennedy (2023). At its core, the R-Learner essentially regresses outcome residuals on treatment residuals to estimate a weighted average of the CATE. Early versions for a constant or otherwise parametric CATE were studied by Chamberlain (1987); Robinson (1988), and Robins (1994), with more flexible series, RKHS, and lasso versions studied more recently by Robins et al. (2008), Nie and Wager (2021), and Chernozhukov et al. (2017), respectively. None of this previous work obtained the minimax optimal rates in Theorem 2.
Definition 2 (Higher-Order Local Polynomial R-Learner).
Let . For each covariate , define as the first ) terms of the Legendre polynomial series (shifted to be orthonormal on ),
Define to be the corresponding tensor product of all interactions of up to order , which has length and is orthonormal on , and finally define . The proposed estimator is then defined as
(3) |
where is a matrix and a -vector given by
respectively, and
for a basis of dimension . The nuisance estimators are constructed from a separate training sample , independent of that on which operates.
The estimator in Definition 2 can be viewed as a localized higher-order estimator, and depends on two main tuning parameters: the bandwidth , which controls how locally one averages near , and the dimension of the basis , which controls how bias and variance are balanced in the second-order U-statistic terms in and . Specific properties of the basis are discussed shortly, e.g., just prior to Remark 7 and in (6). We also note that while the basis will have dimension growing with sample size, the dimension of the basis is fixed depending on the smoothness of the CATE; for the latter we use the Legendre series for concreteness, but expect other bases to work as well.
The U-statistic terms are important for debiasing the first-order sample average terms. In addition, our proposed estimator can be viewed as estimating a locally weighted projection parameter , with coefficients given by
(4) |
for
In other words, this projection parameter is a -weighted least squares projection of the CATE on the scaled Legendre polynomials . Crucially, since includes polynomials in up to order , the projection parameter is within of the target CATE; this is formalized in the following proposition.
Proposition 3.
Let denote the -specific projection parameter from (4), and assume:
-
1.
is -smooth,
-
2.
the eigenvalues of are bounded below away from zero, and
-
3.
.
Then for any with we have
Proof.
In Proposition S5 in the Appendix we give simple sufficient conditions under which the eigenvalues of are bounded. In short, this holds under standard boundedness conditions on the propensity score and covariate density.
As mentioned above, our estimator (3) can be viewed as a modified higher-order estimator. More specifically, is closely related to a second-order estimator of the moment condition
(5) |
under the assumption that (this is not exactly true in our case, but it is enough that it is approximately true locally near , as will be proved shortly). Indeed, letting and denote the first terms in and , respectively, we see that is a usual first-order influence function-based estimator of the moment condition (5). Similarly, the second terms and are akin to the second-order U-statistic corrections that would be added using the higher-order influence function methodology developed by Robins et al. (2008, 2009a, 2017). However, these terms differ in two important ways, both relating to localization near . First, the U-statistic is localized with respect to both and , i.e., the product is included, whereas only would arise if the goal were purely to estimate the moment condition (5) for fixed . Second, the basis functions
appearing in , , and are localized; they only operate on s near , stretching them out so as to map the cube around to the whole space (e.g., , , etc.). This is the same localization that is used with the Legendre basis . In this sense, these localized basis terms spend all their approximation power locally rather than globally away from . (Specific approximating properties we require of will be detailed shortly, in (6)). These somewhat subtle distinctions play a crucial role in appropriately controlling bias, as will be described in more detail shortly.
Remark 8.
Note again that, as with other higher-order estimators, the estimator (3) depends on an initial estimate of the covariate distribution (near ), through . Importantly, we do not take this estimator to be the empirical distribution, in general, since then our optimal choices of the tuning parameter would yield non-invertible; this occurs with higher-order estimators of pathwise differentiable functionals as well (Mukherjee et al., 2017). As discussed in Remark 11, and in more detail shortly, we do give conditions under which the estimation error in or does not impact the overall rate of .
Crucially, Proposition 3 allows us to focus on understanding the estimation error in with respect to the projection parameter , treating as a separate approximation bias. The next result gives a finite-sample bound on this error, showing how it is controlled by the error in estimating the components of and .
Proposition 4.
Let and . The estimator (3) satisfies
and if and are bounded above, then
for a separate independent training sample on which are estimated.
Thus Proposition 4 tells us that bounding the conditional bias and variance of will also yield finite-sample bounds on the error in , relative to the projection parameter . These bias and variance bounds will be derived in the following two subsections.
4.2 Bias
In this subsection we derive bounds on the conditional bias of the estimator , relative to the components of the projection parameter (4), given the training sample . The main ideas behind the approach are to use localized versions of higher-order influence function arguments, along with a specialized localized basis construction, which results in smaller bias due to the fact that the bases only need to be used in a shrinking window around .
Here we rely on the basis having optimal Hölder approximation properties, In particular, we assume the approximation error of projections in norm satisfies
(6) |
where is the usual linear projection of on ,
for the distribution in , the -ball around , mapped to .
In a slight abuse to ease notation, we omit the dependence of on .
The approximating condition (6) holds for numerous bases, including spline, CDV wavelet, and local polynomial partition series (and polynomial and Fourier series, up to log factors); it is used often in the literature. We refer to Belloni et al. (2015) for more discussion and specific examples (see their Condition A.3 and subsequent discussion in, for example, their Section 3.2).
Proposition 5.
Assume:
-
1.
is bounded above,
-
2.
the basis satisfies approximating condition (6),
-
3.
is -smooth,
-
4.
is -smooth.
Then
Before delving into the proof, we give some brief discussion. The bias consists of three terms; the first two are the main bias terms that would result even if the covariate distribution were known, and the third is essentially the contribution from having to estimate . (In Lemma S2 of the Appendix we show how the operator norm error of is bounded above by estimation error of the distribution itself). We note that the second of the main bias terms will be of smaller order in all regimes we consider. Compared to the main bias term in a usual higher-order influence function analysis, which is (e.g., for the average treatment effect), our bias term is smaller; this is a result of using the localized basis defined in (3), which only has to be utilized locally near (this smaller bias will be partially offset by a larger variance, as discussed in the next subsection). As mentioned in Remark 11, the contribution from having to estimate is only a third-order term, since the estimation error of (in terms of operator norm) is multiplied by a product of propensity score errors (in norm) with the sum of regression errors and smoothing bias , which is typically of smaller order. In Proposition 6, given after the following proof of Proposition 5, we show how the bias simplifies when is estimated accurately enough.
Proof.
By iterated expectation, the conditional mean of the first-order term in , i.e., the quantity is equal to
where we use the change of variable and again define for any function its corresponding stretched version as . To ease notation it is left implicit that any integral over is only over . Similarly for we have that equals
so that for the first-order term in (denoted in discussion of the moment condition (5)) we have
(7) |
The conditional mean of the second-order influence function term in is
(8) | |||
where we define
as the -weighted linear projection of on the basis , and as the estimated version, which simply replaces with . Similarly for we have
(9) | |||
so that the conditional mean (8) minus the conditional mean (9) equals
(10) |
Therefore adding the first- and second-order expected values in (7) and (10), the overall bias relative to is
(11) | |||
(12) |
by the orthogonality of a projection with its residuals (Lemma S1(i)).
Now we analyze the bias terms (11) and (12) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.
Crucially, by virtue of using the localized basis , the projections in these bias terms are of stretched versions of the nuisance functions and , on the standard non-localized basis , with weights equal to the stretched density . This is important because stretching a function increases its smoothness; in particular, the stretched and scaled function is -smooth whenever is -smooth. This follows since equals
where the second equality follows by the chain rule, and the third since is -smooth. Thus the above implies
i.e., that is -smooth.
Therefore if is -smooth, then by the Hölder approximation properties (6) of the basis , and so it follows that
(13) |
for any -smooth function .
Therefore now consider the bias term (11). This term satisfies
where the second line follows by Cauchy-Schwarz, and the third by (13), since and are assumed - and -smooth, respectively (note is a polynomial, so the smoothness of is the same as ), along with the fact that
where the first inequality follows by Lemma S1(ii), the second by definition of and since , and the last by Proposition 3.
Now for the term in (12), let denote the coefficients of the projection , and note for any functions we have
where the first equality follows by definition, the second line since the norm of the coefficients of a (weighted) projection is no more than the weighted norm of the function itself (Lemma S1(iii)), and the last by the sub-multiplicative property of the operator norm, along with the fact that . ∎
Several of our results require the eigenvalues of to be bounded above and below away from zero. Proposition S6 in the Appendix gives simple sufficient conditions for this to hold (similar to Proposition S5 for the matrix , which was mentioned earlier after Proposition 3).
The next result is a refined version of Proposition 5, giving high-level conditions under which estimation of itself (rather than the matrix ) does not impact the bias. We refer to Remark 11 for more detailed discussion of these conditions, and note that the result follows from Proposition 5 together with Lemma S2 in the Appendix.
Proposition 6.
Under the assumptions of Proposition 6, and if
-
1.
is bounded below away from zero,
-
2.
is bounded above and below away from zero,
-
3.
,
then, when , the bias satisfies
4.3 Variance
In this subsection we derive bounds on the conditional variance of the estimators and , given the training sample . The main tool used here is a localized version of second-order U-statistic variance arguments, recognizing that our higher-order estimator is, conditionally, a second-order U-statistic over observations.
Proposition 7.
Assume:
-
1.
, , , and are all bounded above, and
-
2.
is bounded above.
Then
We give the proof of Proposition 7 in Appendix B.6, and so just make some comments here. First, the variance here is analogous to that of a higher-order (quadratic) influence function estimator (cf. Theorem 1 of Robins et al. (2009a)), except with sample size deflated to . This is to be expected given the double localization in our proposed estimator. Another important note is that the contribution to the variance from having to estimate is relatively minimal, compared to the bias, as detailed in Proposition 5. For the bias, non-trivial rate conditions are needed to ensure estimation of does not play a role, whereas for the variance one only needs the operator norm of to be bounded (under regularity conditions, this amounts to the estimator only having bounded errors, in a relative sense, as noted in the following remark).
4.4 Overall Rate
Combining the approximation bias in Proposition 3 with the decomposition in Proposition 4, and the bias and variance bounds from Proposition 6 and Proposition 7, respectively, shows that
(14) |
under all the combined assumptions of these results, which are compiled in the statement of Theorem 2 below.
The first two terms in (14) are the bias, with an oracle bias that would remain even if one had direct access to the potential outcomes (or equivalently, samples of for some with conditional mean zero), and analogous to a squared nuisance bias term, but shrunken due to the stretching induced by the localized basis . Similarly, is an oracle variance that would remain even if given access to the potential outcomes, whereas the factor is a contribution from nuisance estimation (akin to the variance of a series regression on basis terms with samples).
Balancing bias and variance in (14) by taking the tuning parameters to satisfy
and
ensures the rate matches the minimax lower bound from Theorem 1 (in the low smoothness regime), proving that lower bound is in fact tight. (In the high smoothness regime where , one can simply take and ). This is formalized in the following theorem.
Theorem 2.
Assume the regularity conditions:
-
A.
The eigenvalues of and are bounded above and below away from zero.
-
B.
is -smooth and is -smooth.
-
C.
The quantities , , , and are all bounded above, and is bounded above and below away from zero.
Also assume the basis satisfies Hölder approximating condition (6), and:
-
1.
,
-
2.
is -smooth, and for some ,
-
3.
is -smooth,
-
4.
is -smooth.
Finally let the tuning parameters satisfy
if , or and otherwise. Then the estimator from Definition 2 has error upper bounded as
We refer to Section 3.3 for more detailed discussion and visualization of the rate from Theorem 2. Here we give two remarks discussing the regularity conditions A–C and Condition 1 (which ensures the covariate distribution is estimated accurately enough).
Remark 10.
Condition A. is a standard collinearity restriction used with least squares estimators; simple sufficient conditions are given in Propositions S5 and S6 in the Appendix. In Lemma S3 in the Appendix we also prove that this condition holds for a class of densities contained in the model in Theorem 1, ensuring that the upper bound holds over the same submodel. A sufficient condition for Condition B. to hold is that the estimators and match the (known) smoothness of and ; this would be the case for standard minimax optimal estimators based on series or local polynomial methods. Condition C. is a mild boundedness condition on the outcome (which could be weakened at the expense of adding some complexity in the analysis), as well as the nuisance estimators ), and even weaker, the errors and (which would typically not only be bounded but decreasing to zero).
Remark 11.
First, Condition 1 of Theorem 2 will of course hold if the covariate distribution is estimated at a rate faster than that of the CATE (i.e., the numerator of the rate in Condition 1); however, it also holds under substantially weaker conditions, depending on how accurately and are estimated. This is because the condition really amounts to a third-order term (the covariate distribution error multiplied by a product of nuisance errors) being of smaller order than the CATE rate. Specifically, the result of Theorem 2 can also be written as
(15) |
for the third-order error term
so that Condition 1 simply requires this third-order term to be of smaller order than the first minimax optimal rate in (15) (note in the above that matches the overall CATE estimation error, under our tuning parameter choices, which would typically be of smaller order than the regression error ). Second, we note that we leave the condition in terms of the errors and because, although we assume and are - and -smooth, technically, they do not need to be estimated at particular rates for any of the other results we prove to hold. Of course, under these smoothness assumptions, there are available minimax optimal estimators for which
If in addition there exists some for which (e.g., if has a density that is -smooth), then Condition 1 reduces to , for
Exploring CATE estimation under weaker conditions on the covariate distribution is an interesting avenue for future work; we suspect the minimax rate changes depending on what is assumed about this distribution, as is the case for average effects (e.g., page 338 of Robins et al. (2008)) and conditional variance estimation (Wang et al., 2008; Shen et al., 2020).
5 Discussion
In this paper we have characterized the minimax rate for estimating heterogeneous causal effects in a smooth nonparametric model. We derived a lower bound on the minimax rate using a localized version of the method of fuzzy hypotheses, and a matching upper bound via a new local polynomial R-Learner estimator based on higher-order influence functions. We also characterize how the minimax rate changes depending on whether the propensity score or regression function is smoother, either when one parametrizes the control or the marginal regression function. The minimax rate has several important features. First, it exhibits a so-called elbow phenomenon: when the nuisance functions (regression and propensity scores) are smooth enough, the rate matches that of standard smooth nonparametric regression (the same that would be obtained if potential outcomes were actually observed). On the other hand, when the average nuisance smoothness is below the relevant threshold, the rate obtained is slower. This leads to a second important feature: in the latter low-smoothness regime, the minimax rate is a mixture of the minimax rates for nonparametric regression and functional estimation. This quantifies how the CATE can be viewed as a regression/functional hybrid.
There are numerous important avenues left for future work. We detail a few briefly here, based on: different error metrics, inference and testing, adaptivity, and practical implementation. First, we have focused on estimation error at a point, but one could also consider global rates in or norm, for example. We expect rates to be the same, and rates to be the same up to log factors, but verifying this would be useful. In addition, it would also be very important to study the distribution of the proposed estimator, beyond just bias and variance, e.g., for purposes of inference ( rates could also be useful in this respect). Relatedly, one could consider minimax rates for testing whether the CATE is zero, for example, versus -separated in some distance. The goal of the present work is mostly to further our theoretical understanding of the fundamental limits of CATE estimation, so there remains lots to do to make the optimal rates obtained here achievable in practice. For example, although we have specified particular values of the tuning parameters and to confirm attainability of our minimax lower bound, it would be practically useful to have more data-driven approaches for selection. In particular, the optimal tuning values depend on underlying smoothness, and since in practice this is often unknown, a natural next step is to study adaptivity. For example one could study whether approaches based on Lepski’s method could be used, as in Mukherjee et al. (2015) and Liu et al. (2021). There are also potential computational challenges associated with constructing the tensor products in when dimension is not small, as well as evaluating the U-statistic terms of our estimator, and inverting the matrices and . Finally, in this work we have assumed the nuisance functions are Hölder-smooth, a classic infinite-dimensional function class from which important insights can be drawn. However, it will be important to explore minimax rates in other function classes as well.
Acknowledgements
EK gratefully acknowledges support from NSF DMS Grant 1810979, NSF CAREER Award 2047444, and NIH R01 Grant LM013361-01A1, and SB and LW from NSF Grant DMS1713003. EK also thanks Matteo Bonvini and Tiger Zeng for very helpful discussions.
References
- Athey and Imbens [2016] S. Athey and G. Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
- Belloni et al. [2015] A. Belloni, V. Chernozhukov, D. Chetverikov, and K. Kato. Some new asymptotic theory for least squares series: pointwise and uniform results. Journal of Econometrics, 186(2):345–366, 2015.
- Bickel and Ritov [1988] P. J. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhyā, pages 381–393, 1988.
- Birgé and Massart [1995] L. Birgé and P. Massart. Estimation of integral functionals of a density. The Annals of Statistics, 23(1):11–29, 1995.
- Carbery and Wright [2001] A. Carbery and J. Wright. Distributional and norm inequalities for polynomials over convex bodies in . Mathematical Research Letters, 8(3):233–248, 2001.
- Chamberlain [1987] G. Chamberlain. Asymptotic efficiency in estimation with conditional moment restrictions. Journal of Econometrics, 34(3):305–334, 1987.
- Chernozhukov et al. [2017] V. Chernozhukov, M. Goldman, V. Semenova, and M. Taddy. Orthogonal machine learning for demand estimation: High dimensional causal inference in dynamic panels. arXiv preprint arXiv:1712.09988, 2017.
- Fan et al. [2019] Q. Fan, Y.-C. Hsu, R. P. Lieli, and Y. Zhang. Estimation of conditional average treatment effects with high-dimensional data. arXiv preprint arXiv:1908.02399, 2019.
- Foster and Syrgkanis [2019] D. J. Foster and V. Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
- Foster et al. [2011] J. C. Foster, J. M. Taylor, and S. J. Ruberg. Subgroup identification from randomized clinical trial data. Statistics in Medicine, 30(24):2867–2880, 2011.
- Gao and Han [2020] Z. Gao and Y. Han. Minimax optimal nonparametric estimation of heterogeneous treatment effects. arXiv preprint arXiv:2002.06471, 2020.
- Giné and Nickl [2021] E. Giné and R. Nickl. Mathematical foundations of infinite-dimensional statistical models. Cambridge University Press, 2021.
- Hahn et al. [2020] P. R. Hahn, J. S. Murray, and C. M. Carvalho. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis, 2020.
- Hernán and Robins [2020] M. A. Hernán and J. M. Robins. Causal inference: What if. CRC Boca Raton, FL, 2020.
- Ibragimov et al. [1987] I. A. Ibragimov, A. S. Nemirovskii, and R. Khas’minskii. Some problems on nonparametric estimation in gaussian white noise. Theory of Probability & Its Applications, 31(3):391–406, 1987.
- Imai and Ratkovic [2013] K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program evaluation. The Annals of Applied Statistics, 7(1):443–470, 2013.
- Ingster et al. [2003] Y. Ingster, J. I. Ingster, and I. Suslina. Nonparametric goodness-of-fit testing under Gaussian models, volume 169. Springer Science & Business Media, 2003.
- Kennedy [2023] E. H. Kennedy. Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049, 2023.
- Künzel et al. [2019] S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165, 2019.
- Lee et al. [2017] S. Lee, R. Okui, and Y.-J. Whang. Doubly robust uniform confidence band for the conditional average treatment effect function. Journal of Applied Econometrics, 32(7):1207–1225, 2017.
- Liu et al. [2021] L. Liu, R. Mukherjee, J. M. Robins, and E. J. Tchetgen Tchetgen. Adaptive estimation of nonparametric functionals. Journal of Machine Learning Research, 22(99):1–66, 2021.
- Luedtke and van der Laan [2016] A. R. Luedtke and M. J. van der Laan. Super-learning of an optimal dynamic treatment rule. The International Journal of Biostatistics, 12(1):305–332, 2016.
- Mukherjee et al. [2015] R. Mukherjee, E. J. Tchetgen Tchetgen, and J. M. Robins. Lepski’s method and adaptive estimation of nonlinear integral functionals of density. arXiv preprint arXiv:1508.00249, 2015.
- Mukherjee et al. [2017] R. Mukherjee, W. K. Newey, and J. M. Robins. Semiparametric efficient empirical higher order influence function estimators. arXiv preprint arXiv:1705.07577, 2017.
- Nemirovski [2000] A. Nemirovski. Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour, 28:85, 2000.
- Nie and Wager [2021] X. Nie and S. Wager. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319, 2021.
- Robins [1994] J. M. Robins. Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23(8):2379–2412, 1994.
- Robins et al. [1992] J. M. Robins, S. D. Mark, and W. K. Newey. Estimating exposure effects by modelling the expectation of exposure conditional on confounders. Biometrics, 48(2):479–495, 1992.
- Robins et al. [2008] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman, pages 335–421, 2008.
- Robins et al. [2009a] J. M. Robins, L. Li, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Quadratic semiparametric von mises calculus. Metrika, 69(2-3):227–247, 2009a.
- Robins et al. [2009b] J. M. Robins, E. J. Tchetgen Tchetgen, L. Li, and A. W. van der Vaart. Semiparametric minimax rates. Electronic Journal of Statistics, 3:1305–1321, 2009b.
- Robins et al. [2017] J. M. Robins, L. Li, R. Mukherjee, E. J. Tchetgen Tchetgen, and A. W. van der Vaart. Minimax estimation of a functional on a structured high dimensional model. The Annals of Statistics, 45(5):1951–1987, 2017.
- Robinson [1988] P. M. Robinson. Root-n-consistent semiparametric regression. Econometrica, 56(4):931–954, 1988.
- Semenova and Chernozhukov [2017] V. Semenova and V. Chernozhukov. Estimation and inference about conditional average treatment effect and other structural functions. arXiv, pages arXiv–1702, 2017.
- Shalit et al. [2017] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3076–3085. JMLR. org, 2017.
- Shen et al. [2020] Y. Shen, C. Gao, D. Witten, and F. Han. Optimal estimation of variance in nonparametric regression with random design. The Annals of Statistics, 48(6):3589–3618, 2020.
- Tsybakov [2009] A. B. Tsybakov. Introduction to Nonparametric Estimation. New York: Springer, 2009.
- van der Laan [2005] M. J. van der Laan. Statistical inference for variable importance. The International Journal of Biostatistics, 188, 2005.
- van der Laan and Robins [2003] M. J. van der Laan and J. M. Robins. Unified Methods for Censored Longitudinal Data and Causality. New York: Springer, 2003.
- Vansteelandt and Joffe [2014] S. Vansteelandt and M. M. Joffe. Structural nested models and g-estimation: the partially realized promise. Statistical Science, 29(4):707–731, 2014.
- Wager and Athey [2018] S. Wager and S. Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
- Wang et al. [2008] L. Wang, L. D. Brown, T. T. Cai, and M. Levine. Effect of mean on variance function estimation in nonparametric regression. The Annals of Statistics, 36(2):646–664, 2008.
- Zhao et al. [2017] Q. Zhao, D. S. Small, and A. Ertefaie. Selective inference for effect modification via the lasso. arXiv preprint arXiv:1705.08020, 2017.
- Zimmert and Lechner [2019] M. Zimmert and M. Lechner. Nonparametric estimation of causal heterogeneity under high-dimensional confounding. arXiv preprint arXiv:1908.08779, 2019.
Roadmap of Main Results
The following figures illustrate how Theorems 1 and 2 follow from supporting propositions and lemmas (with arrows indicating that a result plays a role in implying another).
Propositions S1 and S2 are not included in the above figures since they support the alternative parametrization detailed in Theorem 3; however they play roles analogous to Propositions 4 and 5, respectively. Propositions S5 and S6 and Lemma S3 are not included since they are stand-alone results, giving conditions under which bounded eigenvalue assumptions hold.
Appendix A Alternative Parametrization
In the main text we focused on the model that puts smoothness assumptions on , and gave a detailed proof of the minimax lower bound in the case. In this section we give lower bound constructions for the case, as well as minimax lower bounds for a different model that puts smoothness assumptions on , i.e., the marginal regression rather than the control regression .
In the main text we also gave an estimator that was minimax optimal for all in the model (under some conditions on the covariate density). This estimator is also optimal in the model, as described shortly in Section A.2. Here we also give matching upper bounds for the model using a different, adapted but similar higher-order estimator (which itself is also minimax optimal in the case in the model).
A.1 Lower Bounds
The following table summarizes the constructions used to derive all minimax lower bounds stated in this paper.
Construction | |||
---|---|---|---|
Model | Regime | ||
is -smooth | |||
is -smooth | |||
Theorem 3 below states the minimax rate for the model that puts smoothness assumptions on the marginal regression function , rather than the control regression .
Theorem 3.
For , let denote the model where:
-
1.
is bounded above by a constant,
-
2.
is -smooth,
-
3.
is -smooth, and
-
4.
is -smooth.
Let . Then for larger than a constant depending on , the minimax rate is lower bounded as
In the following subsections we give proofs of Theorems 1 and 3 for the regimes not detailed in the main text. Given the constructions listed in Table 1, the arguments are all very similar to those presented in the main text for the case.
A.1.1 Setting 1:
When , the same construction can be used for both the and parametrizations. This is the construction in the first and third rows of Table 1, and detailed in the main text. Thus the proofs there settle this case.
A.1.2 Setting 2: and is -smooth
For the construction in the 4th row of Table 1, is constant and so certainly -smooth, and is -smooth as in the main text. Further we have by Proposition S3 that
with , so that and
Now we detail the quantities from Lemma 2, which yield a bound on the Hellinger distance between mixtures of -fold products from and .
First, since it immediately follows that . Similarly, taking
ensures that and thus , just as in (2) in the main text. For , note that (with the above choices of and ) so that
by the same arguments as for in the main text. Now by Lemma 2 the Hellinger distance is bounded above as
so taking
ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in under this construction is
A.1.3 Setting 3: and is -smooth
For the construction in the 2nd row of Table 1 it is again clear that and are - and -smooth, respectively. Note for the latter that, under , the control regression is -smooth and by definition. By Proposition S3 we have
with and , as in the main text, and
where the second equality follows since . Therefore
as in the main text.
Now we again detail the quantities from Lemma 2.
Since the marginalized densities and here take the same form as those in the main text, the same arguments imply as long as . Since
as long as , we have that
by the same arguments as for in the main text (and in the previous subsection). Similarly, with the above choices of and ensuring , we have
where the second equality follows since , and the last inequality since is bounded. Therefore
by the same arguments as for and in the main text. This implies by Lemma 2 that the Hellinger distance is bounded above by
where the inequality follows since in this setting. Now, as in the main text, taking
ensures the Hellinger distance is bounded above by one, and yields the promised minimax rate, since the separation in under this construction is
A.2 Upper Bounds
In the model where instead of is assumed to be -smooth, the estimator described in the main text is also minimax optimal (since if is -smooth, then is -smooth, and so in the case one gets the same rates in the main text with replaced by ). However one can also use the following alternative estimator, which we detail for posterity.
Definition 3 (Higher-Order Local Polynomial R-Learner).
Let and be defined exactly as in Definition 2. The proposed estimator is then defined as
(16) |
where is a matrix and a -vector now given by
respectively, with
for a basis of dimension .
The nuisance estimators are constructed from a separate training sample , independent of that on which operates.
Like the estimator (3) in the main text, the estimator (16) can also be viewed as a modified second-order estimator of the projection parameter. However, instead of estimating the moment condition (5) under a linear effect-type assumption, the estimator (16) instead estimates the numerator and denominator terms and separately. For example, the first term in , i.e., is the usual first-order influence function-based estimator of . Kennedy [2023] analyzed an undersmoothed version of this estimator (where the nuisance estimates and themselves are undersmoothed linear smoothers), calling it the local polynomial R-learner. The second term
is similar to the second-order U-statistic correction that would be added using the higher-order influence function methodology developed by Robins et al. [2008, 2009a, 2017]. However, this term differs in its localization just as described for the estimator in the main text.
Proposition S1.
The estimator (16) satisfies
and further if , , and are all bounded above, then
for a separate independent training sample on which are estimated.
Proof.
We have
by the sub-multiplicative and triangle inequalities of the operator norm, along with the fact that . Together with the bounds on , , and , this yields the first inequality. For the second inequality, first note , as described in the proof of Proposition 3. The second inequality now follows since
using Jensen’s inequality and iterated expectation. The last line follows since the length of is . The logic is the same for . ∎
Proposition S2.
Assume:
-
1.
is bounded above,
-
2.
the basis satisfies approximating condition (6),
-
3.
is -smooth,
-
4.
is -smooth.
Then
Proof.
We only prove the result for , since the logic is the same for . By iterated expectation, the conditional mean of the first-order term is
(17) |
where we use the change of variable and again define for any function its corresponding stretched version as . To ease notation it is left implicit that any integral over is only over . Similarly, (minus) the conditional mean of the second-order influence function term is
(18) |
where we define
as the -weighted linear projection of on the basis , and as the estimated version, which simply replaces with . Therefore adding the first- and second-order expected values in (17) and (18), the overall bias relative to is
(19) | ||||
(20) |
where the last line follows from orthogonality of a projection with its residuals (Lemma S1(i)).
Now we analyze the bias terms (19) and (20) separately; the first is the main bias term, which would arise even if the covariate density were known, and the second is the contribution coming from having to estimate the covariate density.
Recall from (13) in the main text that if is -smooth, then by the Hölder approximation properties (6) of the basis , and so it follows that
for any -smooth function . Therefore now consider the bias term (19). This term satisfies
where the second line follows by Cauchy-Schwarz, and the third by (13),
since and are assumed - and -smooth, respectively (note is a polynomial, so the smoothness of is the same as ).
Now for the term in (20), let denote the coefficients of the projection , and note for any functions we have
where the first equality follows by definition, the second line since the norm of the coefficients of a (weighted) projection is no more than the weighted norm of the function itself (Lemma S1(iii)), and the last by the sub-multiplicative property of the operator norm, along with the fact that . ∎
We omit details on variance calculations because the logic is exactly the same as for the estimator (3) studied in Section 4.3 of the main text.
Combining the approximation bias in Proposition 3 with the decomposition in Proposition S1, and the relevant bias and variance bounds together with sufficient conditions on covariate density estimation, we obtain that the estimator (16) has error satisfying
(21) |
Balancing terms as in Section 4.4 yields the minimax rate from Theorem 3.
Appendix B Additional Results
B.1 Density Bounds in Proposition 1
Here we show that the relevant densities and density ratios from Proposition 1 are appropriately bounded.
Proof.
When then it follows that on we have
(22) |
Further, since , , and , we have on that
regardless of the values of . Therefore when , the above bound implies
(23) |
Similarly, when (which implies ) we also have
Note that, although Robins et al. [2009b] assume is uniformly lower bounded away from zero in their version of Lemma 2, they only use a bound on to ensure their quantity is bounded (see page 1319). Therefore this condition also holds in our case. ∎
B.2 Density Expressions
Proposition S3.
For binary and , the density of can be written as
for and , or equivalently as
for .
Proposition S4.
We note that the densities are both equal to for all away from , since for and , and since for .
B.3 Proof of Proposition 3
Proof.
To ease notation we prove the result in the case but the logic is the same when . First note that the local polynomial projection operator for
reproduces polynomials, in the sense that, for any polynomial of the form , , we have
Then for we have that equals
(24) |
where the first line follows by Taylor expansion (with for some ), the second by the polynomial reproducing property, the third since is -smooth, the fourth by the triangle inequality with , and the last by a change of variable (so that ).
Now the result follows since we show the term on the right in (24) (in brackets) is bounded under the stated assumptions. Specifically it equals
where the second line follows from the submultiplicative property of the operator norm and since , and the last from Assumptions 2 and 3 and since
for all (Belloni et al. [2015], Example 3.1), since each Legendre term satisfies for , and the length of is , i.e., the maximum number of monomials in a polynomial in variables with degree up to . ∎
B.4 Proof of Proposition 4
Proof.
For the first inequality, we have
by the sub-multiplicative and triangle inequalities of the operator norm. For the second inequality, first note , as described in the proof of Proposition 3. The result now follows from the bounds on and and since
by Jensen’s inequality and iterated expectation. The last line follows since . ∎
B.5 Eigenvalues of and
Proposition S5.
If (i) and (ii) the density is bounded above and below away from zero on , then the eigenvalues of are bounded above and below away from zero.
Proof.
Define the stretched function for any . This maps values of in the small cube around to the whole space . Then note that, with the change of variables ,
Next note that , so the eigenvalues of will be bounded if those of the matrix
are. But by orthonormality of the Legendre polynomials on , and the local boundedness of ensures is bounded above and below away from zero, for the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎
Proposition S6.
If (i) the basis is orthonormal with respect to the uniform measure, and (ii) the density is bounded above and below away from zero on , then the eigenvalues of are bounded above and below away from zero.
Proof.
The proof is similar to that of Proposition S5. First note that
by the change of variables , and where as before. Further by the assumed orthonormality, and the local boundedness of ensures is bounded above and below away from zero, for the uniform measure. Therefore Proposition 2.1 of Belloni et al. [2015] yields the result. ∎
B.6 Proof of Proposition 7
Proof.
Given the training data , is a second-order U-statistic with kernel
Thus its conditional variance is given by the usual variance of a second-order U-statistic (e.g., Lemma 6 of Robins et al. [2009a]), which is
(25) |
for , if . In our case equals
Therefore for the first term in (25) we have
where the second inequality follows by the change of variables , and since all of are uniformly bounded, , and (Belloni et al. [2015], Example 3.1); and the third follows from Lemma S1(ii). Now, consider the second term in (25); we only detail the term in involving since the logic is the same for the term, and the terms involving are bounded by standard arguments. Letting and , we have upper bounded by a multiple of
where the first line follows by definition, the second since and are uniformly bounded and , the third by a change of variables , by definition of and , and since is bounded, the fourth by definition since , and the last two by basic properties of the Frobenius norm (e.g., the triangle inequality and ) together with . ∎
B.7 Technical Lemmas
Lemma S1.
Let denote a -weighted projection with for . And let denote the coefficients of the projection. Then:
-
(i)
projections are orthogonal to residuals, i.e.,
-
(ii)
the norm of a projection is no more than that of the function, i.e.,
and the norm of the approximation error is no more than that of the function, i.e.,
-
(iii)
the norm of the scaled coefficients is no more than the norm of the function, i.e.,
Proof.
For (i) let so that and note that
where the last line follows since .
For (ii) we have by definition that equals
(26) | ||||
and so
which implies the results, as long as , so that the far left and right sides are non-negative.
Lemma S2.
Assume:
-
1.
-
2.
Then
and
Proof.
The logic mirrors that of the proof of Proposition 2.1 in Belloni et al. [2015]. Note that
and by the same logic, . Therefore
by the min-max theorem, which gives the first inequality. For the second, note
by the sub-multiplicative property of the operator norm, and then
by the same logic as above. ∎
B.8 Covariate Density
Although the lower bounds in Theorems 1 and 3 are given for a simple model where the covariate density is merely bounded, the rates are only attainable when the covariate density is known or can be estimated accurately enough, as discussed in Remark 11. More specifically, the lower and upper bounds match for a model in which
the covariate density
is known, satisfies , and has local support on a union of no more than disjoint cubes all with proportional volume, for and defined in Proposition 2. The lower bounds come from Theorems 1 and 3 since the density used in our construction satisfies these conditions, while the matching upper bound is implied by the results in Section 4 together with the following lemma, in which we prove that Condition A. holds for this class of densities.
Lemma S3.
Assume:
-
1.
satisfies ,
-
2.
is bounded above and below away from zero on its local support , which is a union of no more than disjoint cubes all with proportional volume, for and defined in Proposition 2.
Let denote the distribution in , the -ball around , mapped to , and similarly for . Then the eigenvalues of
are bounded above and below away from zero, and there exists a basis with Hölder approximation property (6), for which the eigenvalues of
are bounded above and below away from zero.
Proof.
Note that since and since is bounded above and below on its support, the eigenvalues of will be bounded if those of the matrix
are, where indicates the th disjoint cube making up the local support of . By the min-max theorem, the eigenvalues are bounded by the min and max of
over all with . First consider lower bounding the eigenvalues. Note is a polynomial of degree at most . Therefore
where the last line follows since
from a change of variables , so that and , together with Assumption 1 of the lemma, and by Theorem 4 of Carbery and Wright [2001]. The last equality follows if we choose . To upper-bound the eigenvalues, note that
since by the orthonormality of .
Now consider the eigenvalues of . We will construct a basis of order for which eigenvalues of are bounded and for which the Hölder approximation property (6) holds.
Remark 12.
For simplicity we only consider the case where there are exactly cubes in the local support of ; if there are finitely many cubes, say , the arguments are more straightforward, orthonormalizing a standard series times, once per cube, and taking terms from each. We omit details in the intermediate case where the number of cubes scales at a rate slower than , but we expect similar arguments to those below can be used.
First, let denote the tensor products of a Legendre basis of order , orthonormal on , i.e., tensor products of
for and . This basis satisfies
Now shift and rescale so that the basis is orthonormal on with
and so satisfies
where we used the change of variable so that and when . Finally define the basis
of length . Then
Therefore since is bounded above and below on its support, the -th diagonal block of is proportional to
Therefore the eigenvalues of are all proportional to one and bounded as desired. Further, by the same higher-order kernel arguments for local polynomials as in Proposition 3, the basis satisfies the Hölder approximation property (6). ∎