This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric Estimation of the Continuous Treatment Effect with Measurement Error

Wei Huang School of Mathematics and Statistics, University of Melbourne, Australia. [email protected]    Zheng Zhang Center for Applied Statistics, Institute of Statistics & Big Data, Renmin University of China, China. [email protected]
Abstract

We identify the average dose-response function (ADRF) for a continuously valued error contaminated treatment by a weighted conditional expectation. We then estimate the weights nonparametrically by maximising a local generalised empirical likelihood subject to an expanding set of conditional moment equations incorporated into the deconvolution kernels. Thereafter, we construct a deconvolution kernel estimator of ADRF. We derive the asymptotic bias and variance of our ADRF estimator and provide its asymptotic linear expansion, which helps conduct statistical inference. To select our smoothing parameters, we adopt the simulation-extrapolation method and propose a new extrapolation procedure to stabilise the computation. Monte Carlo simulations and a real data study illustrate our method’s practical performance.

keywords:
continuous treatment, deconvolution kernel, measurement error, sieve estimator, simulation-extrapolation method, stabilised weights
00footnotetext: The authors contributed equally to this work and are listed in the alphabetical order.

1 Introduction

Identifying and estimating the causal effect of a treatment or policy from observational studies is of great interest to economics, social science, and public health researchers. There, confounding issues usually exist (i.e. individual characteristics are related to both the treatment selection and the potential outcome), making the causal effect not directly identifiable from the data. Early studies focused on whether an individual receives the treatment or not (e.g. Rosenbaum and Rubin, 1983, 1984; Hahn, 1998; Hirano et al., 2003). More recently, as opposed to such binary treatments, researchers have been investigating the causal effect of a continuously valued treatment, where the effect depends not only on the introduction of the treatment but also on the intensity; see Hirano and Imbens (2004); Galvao and Wang (2015); Kennedy et al. (2017); Fong et al. (2018); Huber et al. (2020); Dong et al. (2021); Ai et al. (2021, 2022), and Huang et al. (2021), among others. However, all these methods require the treatment data to be measured without errors.

In many empirical applications, the treatment and confounding variables may be inaccurately observed. For example, Mahajan (2006) studied the binary treatment effect when the observed treatment indicators are subject to misclassification; see also Lewbel (2007) and Molinari (2008), among others. Battistin and Chesher (2014) investigated the extent to which confounder measurement error affects the analysis of the treatment effect. Continuous treatment variables are also likely to be measured with error in practice. For example, in the Epidemiologic Study Cohort data from the first National Health and Nutrition Examination (NHANES-I) (see Carroll et al., 2006), over 75% of the variance in the fat intake data is made up of measurement error. However, to the best of our knowledge, no existing work has considered the identification and estimation of the causal effect from such error-contaminated continuous treatment data. To bridge this gap in the body of knowledge, we focus on continuous treatment data measured with classical error; that is, instead of observing the received treatment, researchers only observe the sum of the treatment and a random error.

In particular, we study a causal parameter of primary interest to scholars in the literature, the average dose-response function (ADRF). It is defined as the population mean of the individual potential outcome corresponding to certain levels of the treatment. To resolve the confounding problem, we adopt the most widely imposed assumption in the literature on treatment effects, the unconfoundedness condition (e.g. Rosenbaum and Rubin, 1983, 1984; Hirano and Imbens, 2004), which assumes that the assignment of the treatment is independent of the potential outcome of interest given a set of observable covariates.

Perhaps, the most straightforward approach to this problem is based on some parametric model specifying how the outcome relates to confounders and the treatment. Then one may apply the parametric methods to measurement error data in the literature (see e.g. Carroll et al., 2006). However, the parametric approach suffers from the model misspecification problem and does not incorporate available information on the treatment mechanism. Thus, this paper focuses on robust nonparametric estimation of ADRF.

In the literature on error-free continuous treatment data, nonparametric estimation of ADRF under unconfoundedness assumption has been extensively studied. For example, Galvao and Wang (2015) identified the ADRF via an unconditional weighted expectation, with the weighting function being the ratio of two conditional densities of the treatment variable. They estimated these two conditional densities separately and then constructed the estimator for the ADRF. However, it is well known that such a ratio estimator can be unstable owing to its sensitivity to the denominator estimation (Kang and Schafer, 2007). To improve the robustness of estimation, Kennedy et al. (2017) developed a doubly robust estimator for ADRF by regressing a doubly robust mapping on the treatment; see more detailed discussion in Remark 2 of Section 3.1. Ai et al. (2021) identified ADRF via a weighted conditional expectation on the treatment variable, where the weighting function is the ratio of the treatment variable’s marginal density and its conditional density given the confounders. They estimated the weighting function (but not the two densities in the ratio separately) by maximising entropy subject to some moment restrictions that identify the weighting function, then obtained the nonparametric weighted estimator of the ADRF. The idea of estimating the density ratio directly has also been exploited in the literature in (bio)statistics based on parametric modeling (Qin, 1998), semiparametric modeling (Cheng and Chu, 2004), and machine learning with an augmented dataset (Díaz et al., 2021).

However, no existing methods for the error-free treatment data can be easily extended to the error-contaminated treatment data. Indeed, under the unconfoundedness assumption, all those methods depend on estimating a weighting function with the treatment’s conditional density given the confounders in the denominator. Its estimation is a critical step and full of challenges in the presence of measurement error. Although the nonparametric method of estimating the (conditional) density of a variable measured with the classical error has been developed (see e.g. Fan, 1991a, b; Meister, 2006), estimating the density and plugging it into the denominator suffers from the instability issue mentioned above. Such an issue can be even worse when the treatment is measured with errors. Unfortunately, existing techniques of improving the robustness cannot apply to the measurement error data; for example, Ai et al.’s 2021 approach heavily relies on the moment information of a general function of the treatment variable, which is hard to obtain in the presence of measurement error. To our best knowledge, estimating the moment of a general function of a variable measured with errors and its associated theoretical behaviour are challenging problems and remain open in the literature (see Hall and Lahiri, 2008 and our discussion in section 3.1 for more details).

We propose a broad class of novel and robust nonparametric estimators for the ADRF when the treatment data are measured with error. We first represent the ADRF as in Ai et al. (2021), i.e. a weighted expectation of the observed outcome on the treatment variable, where the weighting function is a ratio of the treatment variable’s marginal density and its conditional density given the confounders. Then we propose a novel approach to identify the weighting function with no need to estimate the moment of a general function of the treatment. Specifically, we propose a local empirical likelihood technique that only requires estimating the conditional moment of a general function of the error-free confounders given each fixed value of the treatment. A consistent nonparametric estimator of such a conditional moment can be obtained by using a local constant deconvolution kernel estimator in the literature on errors-in-variables regression; see Fan and Truong (1993); Carroll and Hall (2004); Hall et al. (2007); Delaigle et al. (2009), and Meister (2009), among others. Moreover, by choosing different criterion functions for the local generalised empirical likelihood, the proposed method produces a wide class of weights, including exponential tilting, empirical likelihood and generalised regression as important special cases. Once the weights are consistently estimated, we construct our estimator of the ADRF by estimating the weighted conditional expectation of the observed outcome using another local constant deconvolution kernel estimator.

Based on our identification theorem and the literature on the local constant deconvolution kernel estimator, it is not hard to see the consistency of our estimator. However, the asymptotic behaviour of a local constant deconvolution kernel estimator incorporated with such local generalised empirical likelihood weight has never been investigated. We show the asymptotic bias of our estimator consists of two parts: the one from estimating the weights and the one from estimating the weighted conditional expectation, but it can achieve the same convergence rate as that in the error-free case. The asymptotic variance depends on the type of measurement error. We study the most commonly used two types of measurement errors: the ordinary smooth and the supersmooth ones. The convergence rate of the variance is slower than that in the error-free case but can achieve the optimal rate for the nonparametric estimation in measurement error literature.

Moreover, without imposing additional conditions, it is well known that the explicit expression of the asymptotic variance of the deconvolution kernel for supersmooth error (e.g. Gaussian error) is not derivable; even the exact convergence rate remains unknown. Therefore, it is common to use residual bootstrap methods to make the inference (see e.g. Fan, 1991a; Delaigle et al., 2009, 2015). However, residual bootstrap methods do not work well in our treatment effect framework. We thus provide the asymptotic linear expansion of our ADRF estimator. Based on this, we propose an undersmoothing (pointwise) confidence interval for the ADRF.

A data-driven method to select the bandwidth for a local polynomial deconvolution kernel estimator is challenging. The only method in the literature to our best knowledge is the simulation-extrapolation (SIMEX) method proposed by Delaigle and Hall (2008). However, the linear back-extrapolation method suggested by the authors performed unstably in our numerical studies. We, thus, propose a new local constant extrapolant function. Monte Carlo simulations show that our estimator is stable and outperforms the naive estimator that does not adjust to the measurement error. We also demonstrate the practical value of our estimator by studying the causal effect of fat intake on breast cancer using the Epidemiologic Study Cohort data from NHANES-I (Carroll et al., 2006).

The remainder of the paper is organised as follows. We introduce the basic framework and notations in Section 2. Section 3 presents our estimation procedure, followed by the asymptotic studies of our estimator in Section 4. We discuss the method of selecting our smoothing parameters in Section 5. Finally, we present our simulation results and real data analysis in Section 6.

2 Basic Framework

We consider a continuously valued treatment in which the observed treatment variable is denoted by TT with the probability density function fT(t)f_{T}(t) and support 𝒯\mathcal{T}\subset\mathbb{R}. Let Y(t)Y^{\ast}(t) denote the potential outcome if one was treated at level tt for t𝒯t\in\mathcal{T}. In practice, each individual can only receive one treatment level TT and we only observe the corresponding outcome Y:=Y(T)Y:=Y^{\ast}(T). We are also given a vector of covariates 𝑿r\boldsymbol{X}\in\mathbb{R}^{r}, with rr a positive integer, which is related to both TT and Y(t)Y^{\ast}(t) for t𝒯t\in\mathcal{T}. Moreover, we consider the situation in which the treatment level is measured with classical error, that is, instead of observing TT, we observe SS such that

S=T+U,S=T+U, (1)

where UU is the measurement error, independent of T,𝑿T,\boldsymbol{X} and {Y(t)}t𝒯\{Y^{\ast}(t)\}_{t\in\mathcal{T}}, and its characteristic function ϕU\phi_{U} is known; see Remark 1 for the case in which ϕU\phi_{U} is unknown.

The goal of this paper is to nonparametrically estimate of the unconditional ADRF μ(t):=𝔼{Y(t)}\mu(t):=\mathbb{E}\{Y^{\ast}(t)\} from an independent and identically distributed (i.i.d.) sample {Si,𝑿i,Yi}i=1N\{S_{i},\boldsymbol{X}_{i},Y_{i}\}_{i=1}^{N} drawn from the joint distribution of (S,𝑿,Y)(S,\boldsymbol{X},Y).

Y(t)Y^{\ast}(t) is never observed simultaneously for all t𝒯t\in\mathcal{T} or even on a dense subset of 𝒯\mathcal{T} for any individual, but only at a particular level of treatment, TT. Thus, to identify μ(t)\mu(t) from the observed data, the following assumption is imposed in most of the treatment effect literature (e.g. rubin1990comment; Kennedy et al., 2017; Ai et al., 2021, 2022; D’Amour et al., 2021).

Assumption 1

We assume

  1. (i)

    (Unconfoundedness) for all t𝒯t\in\mathcal{T}, given 𝑿\boldsymbol{X}, TT is independent of Y(t)Y^{*}(t), that is, T{Y(t)}t𝒯|𝑿T\perp\{Y^{\ast}(t)\}_{t\in\mathcal{T}}|\boldsymbol{X};

  2. (ii)

    (No Interference) For i=1,,Ni=1,\ldots,N, the outcome of individual ii is not affected by the treatment assignments to any other individuals. That is, Yi(Ti,𝑻(i))=Yi(Ti,𝑻(i))Y_{i}^{\ast}(T_{i},\boldsymbol{T}_{(-i)})=Y_{i}^{\ast}(T_{i},\boldsymbol{T}^{\prime}_{(-i)}) for any 𝑻(i)\boldsymbol{T}_{(-i)}, 𝑻(i)\boldsymbol{T}^{\prime}_{(-i)}, where Yi(Ti,𝑻(i))Y_{i}^{\ast}(T_{i},\boldsymbol{T}_{(-i)}) is the potential outcome of individual ii given the treatment assignments to individual ii and the others are TiT_{i} and 𝑻(i):=(T1,,Ti1,Ti+1,,TN)\boldsymbol{T}_{(-i)}:=(T_{1},\ldots,T_{i-1},T_{i+1},\ldots,T_{N}), respectively;

  3. (iii)

    (Consistency) Y=Y(t)Y=Y^{*}(t) a.s. if T=tT=t;

  4. (iv)

    (Positivity) the generalised propensity score fT|Xf_{T|X} satisfies fT|X(t|𝑿)>0f_{T|X}(t|\boldsymbol{X})>0 a.s. for all t𝒯t\in\mathcal{T}.

Under Assumption 1, for every fixed t𝒯t\in\mathcal{T}, μ(t)\mu(t) can be identified as follows:

μ(t)=\displaystyle\mu(t)= 𝔼[𝔼{Y(t)|𝑿}]=𝔼[𝔼{Y(t)|𝑿,T=t}](by Assumption 1 (i) and (ii))\displaystyle\mathbb{E}[\mathbb{E}\{Y^{\ast}(t)|\boldsymbol{X}\}]=\mathbb{E}[\mathbb{E}\{Y^{\ast}(t)|\boldsymbol{X},T=t\}]\quad\text{(by Assumption \ref{UnconfoundAssump} (i) and (ii))}
=\displaystyle= 𝔼{𝔼(Y|𝑿,T=t)}(by Assumption 1 (iii))\displaystyle\mathbb{E}\{\mathbb{E}(Y|\boldsymbol{X},T=t)\}\quad\text{(by Assumption \ref{UnconfoundAssump} (iii))}
=\displaystyle= 𝒳𝒴fT(t)fT|X(t|𝒙)yfY|X,T(y|𝒙,t)fX|T(𝒙|t)𝑑y𝑑𝒙(by Assumption 1 (iv))\displaystyle\int_{\mathcal{X}}\int_{\mathcal{Y}}\frac{f_{T}(t)}{f_{T|X}(t|\boldsymbol{x})}yf_{Y|X,T}(y|\boldsymbol{x},t)f_{X|T}(\boldsymbol{x}|t)\,dy\,d\boldsymbol{x}\quad\text{(by Assumption \ref{UnconfoundAssump} (iv))}
=\displaystyle= 𝔼{π0(t,𝑿)Y|T=t},\displaystyle\mathbb{E}\{\pi_{0}(t,\boldsymbol{X})Y|T=t\}\,, (2)

where

π0(t,𝒙):=fT(t)fT|X(t|𝒙),\pi_{0}(t,\boldsymbol{x}):=\frac{f_{T}(t)}{f_{T|X}(t|\boldsymbol{x})}\,, (3)

with fTf_{T} and fT|Xf_{T|X} being the density function of TT and the conditional density of TT given 𝑿\boldsymbol{X}, respectively. The function π0(t,𝒙)\pi_{0}(t,\boldsymbol{x}) is called the stabilised weights in Ai et al. (2021).

If TT is fully observable and π0(t,𝒙)\pi_{0}(t,\boldsymbol{x}) is known, estimating μ(t)\mu(t) in (2) is reduced to a standard regression problem. For example, μ(t)\mu(t) can be consistently estimated using the Nadaraya–Watson estimator:

μNW(t):=i=1Nπ0(t,𝑿i)YiL{(tTi)/h}i=1NL{(tTi)/h},t𝒯,\mu_{NW}(t):=\frac{\sum^{N}_{i=1}\pi_{0}(t,\boldsymbol{X}_{i})Y_{i}L\{(t-T_{i})/h\}}{\sum^{N}_{i=1}L\{(t-T_{i})/h\}},\quad t\in\mathcal{T}\,, (4)

where L()L(\cdot) is a prespecified univariate kernel function such that L(x)𝑑x=1\int_{-\infty}^{\infty}L(x)dx=1, hh is the bandwidth. However, we do not observe TT but only SS in (1) and π0(t,𝒙)\pi_{0}(t,\boldsymbol{x}) is also unknown in practice. We address these issues in the next section.

Remark 1

The characteristic function of the measurement error ϕU\phi_{U} may be unknown in some empirical applications. Several methods of consistently estimating it have been proposed in the literature. For example, Diggle and Hall (1993) assumed that the data from the error distribution are observable and proposed estimating ϕU\phi_{U} from the error data nonparametrically. In some applications, the error-contaminated observations are replicated, and we can estimate the density from these replicates; see Delaigle et al. (2009). When ϕU\phi_{U} is known up to certain parameters, parametric methods are applicable (e.g. Meister, 2006). Finally, a nonparametric method without using any additional data is also available (e.g. Delaigle and Hall, 2016). Once a consistent estimator of ϕU\phi_{U} is obtained, our proposed method can be directly applied.

In particular, the assumptions that, (a) the error distribution is known up to certain parameters that are identifiable from some previous studies and (b) repeated error-contaminated data are available, are commonly met in practice. When assumption (a) is satisfied, we can usually obtain NU\sqrt{N_{U}}-consistent estimators of the parameters and thus a NU\sqrt{N_{U}}-consistent estimators of ϕU(t)\phi_{U}(t) uniformly in tt\in\mathbb{R}, where NUN_{U} is the sample size of the previous studies. For example, in our real data example in Section 6.3, the error distribution is known up to the variance. Since the convergence rate of our proposed estimator μ^(t)\widehat{\mu}(t) is slower than N1/2N^{-1/2}, the asymptotic behaviour is insensitive to the estimation of ϕU\phi_{U} provided that N=O(NU)N=O(N_{U}). Under assumption (b), we observe

Sjk=Tj+Ujk,k=1,2,j=1,,N,S_{jk}=T_{j}+U_{jk},\quad k=1,2,\quad j=1,\ldots,N\,,

where the UjkU_{jk}’s are i.i.d.. Note that

𝔼[exp{it(Sj1Sj2)}]=𝔼{exp(itUj1)}𝔼{exp(itUj2)}=|ϕU(t)|2.\mathbb{E}[\exp\{it(S_{j1}-S_{j2})\}]=\mathbb{E}\{\exp(itU_{j1})\}\mathbb{E}\{\exp(-itU_{j2})\}=|\phi_{U}(t)|^{2}\,.

Delaigle et al. (2008) proposed to estimate ϕU(t)\phi_{U}(t) by ϕ^U(t)=|N1j=1Ncos{it(Sj1Sj2)}|1/2\widehat{\phi}_{U}(t)=\big{|}N^{-1}\sum^{N}_{j=1}\cos\{it(S_{j1}-S_{j2})\}\big{|}^{1/2}. They showed that the deconvolution kernel local constant estimator using this ϕ^U(t)\widehat{\phi}_{U}(t) and that using ϕU(t)\phi_{U}(t) asymptotically behave the same under some regularity conditions. Specifically, when UU is ordinary smooth (as defined in (18)), they require fTf_{T} to be sufficiently smooth relative to UU’s density. In the contrast, if UU is supersmooth as in (19), the optimal convergence rate of a deconvolution kernel local constant estimator is logarithmic in NN, which is so slow that the error incurred by estimating ϕU\phi_{U} is negligible. This result can be extended to our setting.

3 Estimation Procedure

To overcome the problem that the L{(tTi)/h}L\{(t-T_{i})/h\}’s are not empirically accessible, we apply the deconvolution kernel approach (e.g. Stefanski and Carroll, 1990; Fan and Truong, 1993). This method is often used in nonparametric regression in which the covariates are measured with classical error, as in (1), and the idea is introduced as follows. The density of SS is the convolution of the densities of TT and UU, meaning that ϕS(w)=ϕT(w)ϕU(w)\phi_{S}(w)=\phi_{T}(w)\phi_{U}(w), where ϕS\phi_{S} and ϕT\phi_{T} are the characteristic functions of SS and TT, respectively. We consider UU with ϕU(w)0\phi_{U}(w)\neq 0 for all ww\in\mathbb{R}. Using the Fourier inversion theorem, if |ϕT||\phi_{T}| is integrable, we have

fT(t)=12πexp(iwt)ϕS(w)ϕU(w)𝑑w.\displaystyle f_{T}(t)=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-iwt)\frac{\phi_{S}(w)}{\phi_{U}(w)}dw. (5)

This inspired Stefanski and Carroll (1990) to estimate fTf_{T} by f^T,h(t):=(Nh)1i=1NLU{(tSi)/h}\widehat{f}_{T,h}(t):=(Nh)^{-1}\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h\}, where

LU(v):=12πexp(iwv)ϕL(w)ϕU(w/h)𝑑w,L_{U}(v):=\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp(-iwv)\frac{\phi_{L}(w)}{\phi_{U}(w/h)}dw, (6)

with ϕL\phi_{L} the Fourier transform of the kernel LL, which aims to prevent f^T,h\widehat{f}_{T,h} from becoming unreliably large in its tails.

Based on this idea, Fan and Truong (1993) proposed a consistent errors-in-variables regression estimator by replacing the L{(tTi)/h}L\{(t-T_{i})/h\}’s in (4) with the LU{(tSi)/h}L_{U}\{(t-S_{i})/h\}’s. In our context, an errors-in-variables estimator of μ(t)\mu(t) is

μ~(t):=i=1Nπ0(t,𝑿i)YiLU{(tSi)/h}i=1NLU{(tSi)/h}.\widetilde{\mu}(t):=\frac{\sum^{N}_{i=1}\pi_{0}(t,\boldsymbol{X}_{i})Y_{i}L_{U}\{(t-S_{i})/h\}}{\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h\}}. (7)

Note that U(T,𝑿,Y)U\perp(T,\boldsymbol{X},Y), we have

𝔼[LU{(tS)/h}|T,𝑿,Y]=𝔼[LU{(tS)/h}|T]\displaystyle\mathbb{E}\big{[}L_{U}\{(t-S)/h\}|T,\boldsymbol{X},Y\big{]}=\mathbb{E}\big{[}L_{U}\{(t-S)/h\}|T\big{]}
=\displaystyle= 12πexp{iw(tT)/h}𝔼[exp(iwU/h)]ϕL(w)ϕU(w/h)𝑑w\displaystyle\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp\{-iw(t-T)/h\}\mathbb{E}[\exp(iwU/h)]\frac{\phi_{L}(w)}{\phi_{U}(w/h)}\,dw
=\displaystyle= 12πexp{iw(tT)/h}ϕL(w)𝑑w=L{(tT)/h},\displaystyle\frac{1}{2\pi}\int_{-\infty}^{\infty}\exp\{-iw(t-T)/h\}\phi_{L}(w)\,dw=L\{(t-T)/h\}, (8)

where the last equation comes from the Fourier inversion theorem. Using this property, μ~(t)\widetilde{\mu}(t) has the same asymptotic bias as that of μNW(t)\mu_{NW}(t), which shrinks to zero as h0h\to 0. Then, to verify its consistency to μ(t)\mu(t), it suffices to show that its asymptotic variance decays to zero as NN\rightarrow\infty, using a straightforward extension of the proof in Fan and Truong (1993).

However, the challenge is that π0\pi_{0} is unknown in practice. We next show how to estimate π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) from the error-contaminated data (Si,𝑿i,Yi)(S_{i},\boldsymbol{X}_{i},Y_{i}), i=1,,Ni=1,\ldots,N.

3.1 Estimating π0(t,𝑿)\pi_{0}(t,\boldsymbol{X})

Observing (3), a straightforward way to estimate π0\pi_{0} is to estimate fTf_{T} and fT|Xf_{T|X} and then compute the ratio. However, this ratio estimator is sensitive to low values of fT|Xf_{T|X} since small errors in estimating fT|Xf_{T|X} lead to large errors in the ratio estimator (see Ai et al., 2021, 2022 for an example and Appendix A.1.1 in the supplementary file for a detailed illustration). As in the literature of error-free treatment effect, we treat π0\pi_{0} as a whole and estimated directly to mitigate this problem. In paticular, we estimated it nonparametrically from an expanding set of equations, which is closely related to the idea in Ai et al. (2021). However, their method is not applicable to error-contaminated data.

Specifically, when the TiT_{i}’s are fully observable, Ai et al. (2021) found that the moment equation

𝔼{π0(T,𝑿)u(𝑿)v(T)}=𝔼{v(T)}𝔼{u(𝑿)}\mathbb{E}\{\pi_{0}(T,\boldsymbol{X})u(\boldsymbol{X})v(T)\}=\mathbb{E}\{v(T)\}\mathbb{E}\{u(\boldsymbol{X})\} (9)

holds for any integrable function u(𝑿)u(\boldsymbol{X}) and v(T)v(T), and that it identifies π0(,)\pi_{0}(\cdot,\cdot). They further estimated the function π0(,):𝒯×𝒳\pi_{0}(\cdot,\cdot):\mathcal{T}\times\mathcal{X}\to\mathbb{R} by maximising a generalised empirical likelihood, subject to the restrictions of the sample version of (9); see LABEL:Remark:AiEstimator in the supplementary file for more details. However, those restrictions are not computable in our context since TT is not observable, and the nonparametric estimation of the moment 𝔼[v(T)]\mathbb{E}[v(T)] for a general function v()v(\cdot) from contaminated data {Si}i=1N\{S_{i}\}_{i=1}^{N} is challenging and its theoretical properties are difficult to derive, if not impossible. For example, Hall and Lahiri (2008) studied the nonparametric estimation of the absolute moment 𝔼[|T|q]\mathbb{E}[|T|^{q}] with TT subject to ordinary smooth error (see the definition (18)) and found that the theoretical behaviour of the estimator differs depending on qq: if qq is an even integer, the N\sqrt{N}-consistency is only achievable under a strong condition 𝔼[T2q]+𝔼[U2q]<\mathbb{E}[T^{2q}]+\mathbb{E}[U^{2q}]<\infty; if qq is an odd integer, the N\sqrt{N}-consistency is achievable if and only if the distribution of the measurement error is sufficiently “rough” in terms of the convergence rate of ϕU\phi_{U} to zero in its tails; for q>0q>0 not a positive integer, N\sqrt{N}-consistency is generally impossible. For other forms of v(T)v(T) or the involvement of supersmooth error (see the definition (19)), the consistent nonparametric estimation of 𝔼[v(T)]\mathbb{E}[v(T)] as well as the corresponding theoretical behaviour are still open problems to our best knowledge.

Thus, to stabilise the estimation of π0\pi_{0}, we derive another expanding set of equations that can identify π0\pi_{0} from the error-contaminated data and avoids estimating 𝔼[v(T)]\mathbb{E}[v(T)]. Specifically, instead of estimating the function π0(,):𝒯×𝒳\pi_{0}(\cdot,\cdot):\mathcal{T}\times\mathcal{X}\to\mathbb{R}, we turn to estimate its projection π0(t,):𝒳\pi_{0}(t,\cdot):\mathcal{X}\to\mathbb{R} for every fixed t𝒯t\in\mathcal{T}, and find that

𝔼{π0(t,𝑿)u(𝑿)|T=t}=𝒳fT(t)fT|X(t|𝒙)u(𝒙)fX|T(𝒙|t)𝑑𝒙=𝔼{u(𝑿)}\mathbb{E}\left\{\pi_{0}(t,\boldsymbol{X})u(\boldsymbol{X})|T=t\right\}=\int_{\mathcal{X}}\frac{f_{T}(t)}{f_{T|X}(t|\boldsymbol{x})}u(\boldsymbol{x})f_{X|T}(\boldsymbol{x}|t)\,d\boldsymbol{x}=\mathbb{E}\{u(\boldsymbol{X})\} (10)

holds for any integrable function u(𝑿)u(\boldsymbol{X}). Although the equation (10) still depends on the unobservable TT, 𝔼{π0(t,𝑿)u(𝑿)|T=t}\mathbb{E}\left\{\pi_{0}(t,\boldsymbol{X})u(\boldsymbol{X})|T=t\right\} can be estimated using the deconvolution kernel introduced in (8) from the observable (S,𝑿)(S,\boldsymbol{X}). In the following theorem, we show that the corresponding moment condition can identify the function π0(t,)\pi_{0}(t,\cdot) from (S,𝑿)(S,\boldsymbol{X}) for every fixed t𝒯t\in\mathcal{T}.

Theorem 3.1

Let LU()L_{U}(\cdot) be the deconvolution kernel function defined in (6). For every fixed t𝒯t\in\mathcal{T} and any integrable function u(𝐗)u(\boldsymbol{X}),

limh00𝔼[π(t,𝑿)u(𝑿)LU{(tS)/h0}]𝔼[LU{(tS)/h0}]=𝔼[u(𝑿)]\lim_{h_{0}\to 0}\frac{\mathbb{E}\left[\pi(t,\boldsymbol{X})u(\boldsymbol{X})L_{U}\{(t-S)/h_{0}\}\right]}{\mathbb{E}\left[L_{U}\{(t-S)/h_{0}\}\right]}=\mathbb{E}[u(\boldsymbol{X})] (11)

holds if and only if π(t,𝐗)=π0(t,𝐗)\pi(t,\boldsymbol{X})=\pi_{0}(t,\boldsymbol{X}) a.s.

The proof is provided in Appendix A.2 in the supplementary file. Theorem 3.1 suggests a way of estimating the weighting function (i.e. solving a sample analogue of (11) for any integrable function u(𝒙)u(\boldsymbol{x}), where h0h_{0} goes to 0 as the sample size tends to infinity). However, this implies solving an infinite number of equations, which is impossible using a finite sample of observations in practice. To overcome this difficulty, we approximate the infinite-dimensional function space of u(𝒙)u(\boldsymbol{x}) using a sequence of finite-dimensional sieves. Specifically, let uK(𝒙):=(uK,1(𝒙),,uK,K(𝒙))u_{K}(\boldsymbol{x}):=\big{(}u_{K,1}(\boldsymbol{x}),\ldots,u_{K,K}(\boldsymbol{x})\big{)}^{\top} denote the known basis functions with dimension KK (e.g. the power series, B-splines, or trigonometric polynomials). The function uK(𝒙)u_{K}(\boldsymbol{x}) provides approximation sieves that can approximate any suitable functions u(𝒙)u(\boldsymbol{x}) arbitrarily well as KK\to\infty (see Chen, 2007 for a discussion on the sieve approximation). Since the sieve approximates a subspace of the original function space, π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) also satisfies

limh00𝔼[π(t,𝑿)uK(𝑿)LU{(tS)/h0}]𝔼[LU{(tS)/h0}]=𝔼{uK(𝑿)}.\lim_{h_{0}\to 0}\frac{\mathbb{E}\left[\pi(t,\boldsymbol{X})u_{K}(\boldsymbol{X})L_{U}\{(t-S)/h_{0}\}\right]}{\mathbb{E}\left[L_{U}\{(t-S)/h_{0}\}\right]}=\mathbb{E}\{u_{K}(\boldsymbol{X})\}. (12)

Equation (12) asymptotically identifies π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) as KK\to\infty. We observe that for any increasing and globally concave function ρ(v)\rho(v),

π(t,𝑿):=ρ{λtuK(𝑿)}\pi^{\ast}(t,\boldsymbol{X}):=\rho^{\prime}\left\{{\lambda^{\ast}_{t}}^{\top}u_{K}(\boldsymbol{X})\right\} (13)

solves (12), where ρ()\rho^{\prime}(\cdot) is the derivative of ρ()\rho(\cdot), λt:=argmaxλKGt(λ)\lambda_{t}^{\ast}:=\operatornamewithlimits{argmax}_{\lambda\in\mathbb{R}^{K}}G_{t}^{*}({\lambda}) and Gt(λ)G_{t}^{*}({\lambda}) is a strictly concave function defined by

Gt(λ):=limh00𝔼[ρ{λuK(𝑿)}LU{(tS)/h0}]𝔼[LU{(tS)/h0}]λ𝔼{uK(𝑿)}.\displaystyle G_{t}^{*}({\lambda}):=\lim_{h_{0}\to 0}\frac{\mathbb{E}\left[\rho\{\lambda^{\top}u_{K}(\boldsymbol{X})\}L_{U}\{(t-S)/h_{0}\}\right]}{\mathbb{E}\left[L_{U}\{(t-S)/h_{0}\}\right]}-\lambda^{\top}\mathbb{E}\{u_{K}(\boldsymbol{X})\}\,.

Indeed, by the first-order condition Gt(λt)=0\nabla G_{t}^{*}({\lambda}_{t}^{*})=0, we see that (12) holds with π(t,𝑿)=π(t,𝑿)\pi(t,\boldsymbol{X})=\pi^{\ast}(t,\boldsymbol{X}). The estimator of π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) is then expected to be defined as the empirical counterpart of (13). Therefore, for every fixed t𝒯t\in\mathcal{T}, we propose estimating π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) by

π^(t,𝑿)=ρ{λ^tuK(𝑿)}\widehat{\pi}(t,\boldsymbol{X})=\rho^{\prime}\left\{\widehat{\lambda}_{t}^{\top}u_{K}(\boldsymbol{X})\right\} (14)

with λ^t=argmaxλKG^t(λ)\widehat{\lambda}_{t}=\operatornamewithlimits{argmax}_{\lambda\in\mathbb{R}^{K}}\widehat{G}_{t}(\lambda) and

G^t(λ):=i=1Nρ{λuK(𝑿i)}LU{(tSi)/h0}i=1NLU{(tSi)/h0}λ{1Ni=1NuK(𝑿i)}.\displaystyle\widehat{G}_{t}(\lambda):=\frac{\sum^{N}_{i=1}\rho\{\lambda^{\top}u_{K}(\boldsymbol{X}_{i})\}L_{U}\{(t-S_{i})/h_{0}\}}{\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h_{0}\}}-\lambda^{\top}\bigg{\{}\frac{1}{N}\sum^{N}_{i=1}u_{K}(\boldsymbol{X}_{i})\bigg{\}}. (15)

Some of the deconvolution kernel LU{(tSi)/h0}L_{U}\{(t-S_{i})/h_{0}\}’s may take negative values, making the objective function G^t()\widehat{G}_{t}(\cdot) not strictly concave in a finite sample. However, as NN\to\infty and h00h_{0}\to 0, G^t()𝑝Gt()\widehat{G}_{t}(\cdot)\xrightarrow{p}G^{*}_{t}(\cdot) and Gt()G^{*}_{t}(\cdot) is a strictly concave function. Therefore, with probability approaching one, G^t()\widehat{G}_{t}(\cdot) is strictly concave and λ^t\widehat{\lambda}_{t} uniquely exists. Remark 3 in Section 5 introduces a way of solving this maximisation problem fast and stably from finite samples.

Our estimator π^(t,𝑿i)\widehat{\pi}(t,\boldsymbol{X}_{i}) has a local generalised empirical likelihood interpretation. To see this, Appendix A.3 shows that π^(t,𝑿i)\widehat{\pi}(t,\boldsymbol{X}_{i}) is the dual solution to the following local generalised empirical likelihood maximisation problem: for every fixed t𝒯t\in\mathcal{T},

{max{πi}i=1Ni=1ND(πi)LU({tSi}/h0)i=1NLU({tSi}/h0)subject toi=1NπiuK(𝑿i)LU({tSi}/h0)i=1NLU({tSi}/h0)=1Ni=1NuK(𝑿i),\left\{\begin{array}[c]{cc}&\max_{\{\pi_{i}\}_{i=1}^{N}}-\frac{\sum_{i=1}^{N}D(\pi_{i})L_{U}(\{t-S_{i}\}/h_{0})}{\sum^{N}_{i=1}L_{U}(\{t-S_{i}\}/h_{0})}\\[5.69054pt] &\text{subject to}\ \frac{\sum_{i=1}^{N}\pi_{i}u_{K}(\boldsymbol{X}_{i})L_{U}(\{t-S_{i}\}/h_{0})}{\sum^{N}_{i=1}L_{U}(\{t-S_{i}\}/h_{0})}=\frac{1}{N}\sum_{i=1}^{N}u_{K}(\boldsymbol{X}_{i})\,,\end{array}\right. (16)

where D(v)D(v) is a distance measure from vv to 1 for vv\in\mathbb{R}, which is continuously differentiable and satisfies that D(1)=0D(1)=0 and

ρ(v)=D{(D)1(v)}v(D)1(v).\rho(-v)=D\{(D^{{}^{\prime}})^{-1}(v)\}-v\cdot(D^{{}^{\prime}})^{-1}(v).

Equation (16) aims to minimise some distance measure between the desired weight N1πiN^{-1}\pi_{i} and the empirical frequencies N1N^{-1} locally around a small neighbourhood of Ti=tT_{i}=t, subject to the sample analogue of the moment restriction (12).

Since the dual formulation (14) is equivalent to the primal problem (16) and will simplify the following discussions, we shall express the estimator in terms of ρ(v)\rho(v) in the rest of the discussions. In particular, ρ(v)=exp(v1)\rho(v)=-\exp(-v-1) corresponds to exponential tilting (Kitamura and Stutzer, 1997; Imbens et al., 1998; Ai et al., 2022), ρ(v)=log(1+v)\rho(v)=\log(1+v) corresponds to the empirical likelihood (Owen, 2001), ρ(v)=(1v)2/2\rho(v)=-(1-v)^{2}/2 corresponds to the continuous updating of the generalised method of moments (Hansen, 1982), and ρ(v)=vexp(v)\rho(v)=v-\exp(-v) corresponds to the inverse logistic.

Now, replacing π0(t,𝑿i)\pi_{0}(t,\boldsymbol{X}_{i}) in (7) with π^(t,𝑿i)\widehat{\pi}(t,\boldsymbol{X}_{i}), we obtain an estimator of μ(t)\mu(t):

μ^(t):=i=1Nπ^(t,𝑿i)YiLU{(tSi)/h}i=1NLU{(tSi)/h}.\widehat{\mu}(t):=\frac{\sum^{N}_{i=1}\widehat{\pi}(t,\boldsymbol{X}_{i})Y_{i}L_{U}\{(t-S_{i})/h\}}{\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h\}}. (17)
Remark 2

When TT is observed without error, Kennedy et al. (2017) propose a doubly-robust estimator for μ(t)\mu(t) by regressing a pseudo-outcome

ξ(T,𝑿,Y;fT|X,m)=fT|X(T|𝒙)fX(𝒙)𝑑𝒙fT|X(T|𝑿){Ym(T,𝑿)}+m(T,𝒙)fX(𝒙)𝑑𝒙\xi(T,\boldsymbol{X},Y;f_{T|X},m)=\frac{\int f_{T|X}(T|\boldsymbol{x})f_{X}(\boldsymbol{x})d\boldsymbol{x}}{f_{T|X}(T|\boldsymbol{X})}\{Y-m(T,\boldsymbol{X})\}+\int m(T,\boldsymbol{x})f_{X}(\boldsymbol{x})d\boldsymbol{x}

onto T=tT=t, i.e. μ^DR(t)=𝐠ht(t)𝛃^h(t)\widehat{\mu}_{DR}(t)=\boldsymbol{g}^{\top}_{ht}(t)\widehat{\boldsymbol{\beta}}_{h}(t), where m(T,𝐗):=𝔼[Y|T,𝐗]m(T,\boldsymbol{X}):=\mathbb{E}[Y|T,\boldsymbol{X}] is the outcome regression function, 𝐠ht(a):=(1,(at)/h)\boldsymbol{g}_{ht}(a):=(1,(a-t)/h)^{\top} and

𝜷^h(t):=argmin𝜷21Nhi=1NL(Tith){ξ^(Ti,𝑿i,Yi;f^T|X,m^)𝒈ht(Ti)𝜷}2,\displaystyle\widehat{\boldsymbol{\beta}}_{h}(t):=\arg\min_{\boldsymbol{\beta}\in\mathbb{R}^{2}}\frac{1}{Nh}\sum_{i=1}^{N}L\left(\frac{T_{i}-t}{h}\right)\left\{\widehat{\xi}(T_{i},\boldsymbol{X}_{i},Y_{i};\widehat{f}_{T|X},\widehat{m})-\boldsymbol{g}_{ht}(T_{i})^{\top}\boldsymbol{\beta}\right\}^{2},
ξ^(T,𝑿,Y;f^T|X,m^):=N1i=1Nf^T|X(T|𝑿i)f^T|X(T|𝑿){Ym^(T,𝑿)}+1Ni=1Nm^(T,𝑿i)\displaystyle\widehat{\xi}(T,\boldsymbol{X},Y;\widehat{f}_{T|X},\widehat{m}):=\frac{N^{-1}\sum_{i=1}^{N}\widehat{f}_{T|X}(T|\boldsymbol{X}_{i})}{\widehat{f}_{T|X}(T|\boldsymbol{X})}\{Y-\widehat{m}(T,\boldsymbol{X})\}+\frac{1}{N}\sum_{i=1}^{N}\widehat{m}(T,\boldsymbol{X}_{i})

with L()L(\cdot) being a prespecified kernel function, f^T|X()\widehat{f}_{T|X}(\cdot) and m^()\widehat{m}(\cdot) are some consistent estimators for fT|X()f_{T|X}(\cdot) and m()m(\cdot). Kennedy et al. (2017) showed that μ^DR(t)\widehat{\mu}_{DR}(t) enjoys double robustness: (i) when both fT|X()f_{T|X}(\cdot) and m()m(\cdot) are consistently estimated and the product of the estimators’ local rates of convergence is sufficiently small, μ^DR(t)\widehat{\mu}_{DR}(t) asymptotically behaves the same as the standard local linear estimator of μ(t)\mu(t) in (2) with known π0\pi_{0}; (ii) when either fT|X()f_{T|X}(\cdot) or m()m(\cdot) is consistently estimated and the other is misspecified, μ^DR(t)\widehat{\mu}_{DR}(t) is still consistent.

This idea can be adapted to the our setup with measurement error (1) by replacing the standard kernel with the deconvolution one. For example, we can define a doubly-robust estimator of μ(t)\mu(t) by

μ^DK(t):=i=1Nξ^DK(t,𝑿i,Yi;π^,m^DK)LU{(tSi)/h}i=1NLU{(tSi)/h},\displaystyle\widehat{\mu}_{DK}(t):=\frac{\sum^{N}_{i=1}\widehat{\xi}_{DK}(t,\boldsymbol{X}_{i},Y_{i};\widehat{\pi},\widehat{m}_{DK})L_{U}\{(t-S_{i})/h\}}{\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h\}},

where

ξ^DK(t,𝑿,Y;π^,m^DK)=π^(t,𝑿){Ym^DK(t,𝑿)}+1Ni=1Nm^DK(t,𝑿i)\widehat{\xi}_{DK}(t,\boldsymbol{X},Y;\widehat{\pi},\widehat{m}_{DK})=\widehat{\pi}(t,\boldsymbol{X})\{Y-\widehat{m}_{DK}(t,\boldsymbol{X})\}+\frac{1}{N}\sum_{i=1}^{N}\widehat{m}_{DK}(t,\boldsymbol{X}_{i})

and m^DK()\widehat{m}_{DK}(\cdot) is some consistent estimator of m()m(\cdot). Comparing to our proposed estimator μ^(t)\widehat{\mu}(t) in (17), μ^DK(t)\widehat{\mu}_{DK}(t) requires additionally a consistent estimator of m()m(\cdot) from the error-contaminated data (Si,𝐗i,Yi)i=1N(S_{i},\boldsymbol{X}_{i},Y_{i})_{i=1}^{N}. Establishing practical estimators (with tuning parameter techniques) and the corresponding theoretical results for this method is beyond the scope of this paper and will be resolved in future work.

4 Large Sample Properties

In this section, we establish the LL_{\infty} and L2L_{2} convergence rates of π^(t,)\widehat{\pi}(t,\cdot) for every fixed t𝒯t\in\mathcal{T}. We then investigate the asymptotic behaviour of the proposed ADRF estimator μ^(t)\widehat{\mu}(t). Note that μ^(t)\widehat{\mu}(t) (resp. π^(t,)\widehat{\pi}(t,\cdot)) is a nonparametric estimator and that its asymptotic behaviour is affected by the asymptotic bias and variance, which are respectively defined as the expectation and variance of the limiting distribution of μ^(t)μ(t)\widehat{\mu}(t)-\mu(t) (resp. π^(t,)π0(t,)\widehat{\pi}(t,\cdot)-\pi_{0}(t,\cdot)). Based on (8), we will show that the asymptotic biases of the two estimators are the same as their counterparts in the error-free case. That is, they depend on the smoothness of π0\pi_{0}, μ\mu, and the density of TT, and the approximation error based on the sieve basis uKu_{K}. In particular, the following conditions are required:

Assumption 2

The kernel function L()L(\cdot) is an even function such that L(u)𝑑u=1\int_{-\infty}^{\infty}L(u)\,du=1 and has finite moments of order 3.

Assumption 3

We assume

  1. (i)

    the support 𝒳\mathcal{X} of 𝑿\boldsymbol{X} is a compact subset of r\mathbb{R}^{r}. The support 𝒯\mathcal{T} of the treatment variable TT is a compact subset of \mathbb{R}.

  2. (ii)

    (Strict Positivity) there exist a positive constant ηmin\eta_{min} such that fT|X(t|𝒙)ηmin>0f_{T|X}(t|\boldsymbol{x})\geq\eta_{\min}>0, for all 𝒙𝒳\boldsymbol{x}\in\mathcal{X}.

Assumption 4

(i) The densities fT(t)f_{T}(t), fT|X(t|𝐗)f_{T|X}(t|\boldsymbol{X}) and fT|Y,X(t|Y,𝐗)f_{T|Y,X}(t|Y,\boldsymbol{X}) are third-order continuously differentiable w.r.t. tt almost surely. (ii) The derivatives of fT|X(t|𝐗)f_{T|X}(t|\boldsymbol{X}) and fT|Y,X(t|Y,𝐗)f_{T|Y,X}(t|Y,\boldsymbol{X}), denoted by {tdfT|X(t|𝐗),tdfT|Y,X(t|Y,𝐗)ford=0,1,2,3}\{\partial_{t}^{d}f_{T|X}(t|\boldsymbol{X}),\ \partial_{t}^{d}f_{T|Y,X}(t|Y,\boldsymbol{X})\ \text{for}\ d=0,1,2,3\}, are integrable almost surely in tt.

Assumption 5

For every t𝒯t\in\mathcal{T}, (i) the function π0(t,𝐱)\pi_{0}(t,\boldsymbol{x}) is ss-times continuously differentiable w.r.t. 𝐱𝒳\boldsymbol{x}\in\mathcal{X}, where s>r/2s>r/2 is an integer; (ii) there exist λtK\lambda_{t}\in\mathbb{R}^{K} and a positive constant α>0\alpha>0 such that sup𝐱𝒳|(ρ)1{π0(t,𝐱)}λtuK(𝐱)|=O(Kα)\sup_{\boldsymbol{x}\in\mathcal{X}}\left|(\rho^{\prime})^{-1}\left\{\pi_{0}(t,\boldsymbol{x})\right\}-\lambda_{t}^{\top}u_{K}(\boldsymbol{x})\right|=O(K^{-\alpha}).

Assumption 6

(i) For every KK, the eigenvalues of 𝔼[uK(𝐗)uK(𝐗)|T=t]\mathbb{E}\left[u_{K}(\boldsymbol{X})u_{K}(\boldsymbol{X})^{\top}|T=t\right] are bounded away from zero and infinity, and twice differentiable w.r.t. tt for t𝒯t\in\mathcal{T}. (ii) There is a sequence of constants ζ(K)\zeta(K) satisfying sup𝐱𝒳uK(𝐱)ζ(K)\sup_{\boldsymbol{x}\in\mathcal{X}}\|u_{K}(\boldsymbol{x})\|\leq\zeta(K), such that ζ(K){Kα+h02+h2}0\zeta(K)\{K^{-\alpha}+h_{0}^{2}+h^{2}\}\rightarrow 0 as NN\rightarrow\infty, where \|\cdot\| denotes the Euclidean norm.

Assumption 7

For every t𝒯t\in\mathcal{T}, there exist γtK\gamma_{t}\in\mathbb{R}^{K} and a positive constant >0\ell>0 such that sup𝐱𝒳|m(t,𝐱)γtuK(𝐱)|=O(K)\sup_{\boldsymbol{x}\in\mathcal{X}}\left|m(t,\boldsymbol{x})-\gamma_{t}^{\top}u_{K}(\boldsymbol{x})\right|=O(K^{-\ell}), where m(t,𝐱)=𝔼[Y|T=t,𝐗=𝐱]m(t,\boldsymbol{x})=\mathbb{E}[Y|T=t,\boldsymbol{X}=\boldsymbol{x}].

Assumption 8

R12+δ(t):=𝔼[|π0(t,𝑿)Yμ(t)|2+δ|T=t]R_{1}^{2+\delta}(t):=\mathbb{E}\big{[}|\pi_{0}(t,\boldsymbol{X})Y-\mu(t)|^{2+\delta}|T=t\big{]}, R22+δ(t):=𝔼[|π0(t,𝐗)m(t,𝐗)μ(t)|2+δ|T=t]R_{2}^{2+\delta}(t):=\mathbb{E}\big{[}|\pi_{0}(t,\boldsymbol{X})m(t,\boldsymbol{X})-\mu(t)\big{|}^{2+\delta}|T=t\big{]} and R32+δ(t):=𝔼[|π0(t,𝐗){Ym(t,𝐗)}|2+δ|T=t]R_{3}^{2+\delta}(t):=\mathbb{E}\big{[}|\pi_{0}(t,\boldsymbol{X})\{Y-m(t,\boldsymbol{X})\}\big{|}^{2+\delta}|T=t\big{]} are bounded for some δ>0\delta>0, for all t𝒯t\in\mathcal{T}.

Assumption 3 (i) restricts the covariates and the treatment to be bounded. This condition is commonly imposed in the nonparametric regression literature. Assumption 3 (i) can be relaxed if we restrict the tail distributions of 𝑿\boldsymbol{X} and TT. For example, Chen et al. (2008, Assumption 3) allowed the support of 𝑿\boldsymbol{X} to be the entire Euclidean space but imposed r(1+|𝒙|2)ωfX(𝒙)𝑑𝒙<\int_{\mathbb{R}^{r}}(1+|\boldsymbol{x}|^{2})^{\omega}f_{X}(\boldsymbol{x})d\boldsymbol{x}<\infty for some ω>0\omega>0.

Assumption 3 (ii) is a strict positivity condition requires every subject having certain chance of receiving every treatment level regardless of covariates. This condition is also imposed in a large body of literature in the absence of measurement error (see e.g. Kennedy et al., 2017, Assumption 2 and D’Amour et al., 2021, Assumption 3), particularly when no restrictions are imposed on the potential outcome distribution. This condition can be relaxed if other smoothness conditions are imposed on the potential outcome distribution (Ma and Wang, 2020), or if different target parameters are considered; for example, Muñoz and Van Der Laan (2012) studied the estimation of a stochastic intervention causal parameter, defined by 𝔼[𝔼[Y|T+a(𝑿),𝑿]]\mathbb{E}[\mathbb{E}[Y|T+a(\boldsymbol{X}),\boldsymbol{X}]], based on a weaker positivity condition, i.e. supt𝒯{fT|X(ta(𝑿)|𝑿)/fT|X(t|𝑿)}<\sup_{t\in\mathcal{T}}\{f_{T|X}(t-a(\boldsymbol{X})|\boldsymbol{X})/f_{T|X}(t|\boldsymbol{X})\}<\infty a.e., where a(𝑿)a(\boldsymbol{X}) is a user specified intervention function. Díaz and van der Laan (2013) studied the estimation of a conditional causal dose-response curve defined by 𝔼[𝔼[Y|T=t,𝑿]|𝒁]\mathbb{E}[\mathbb{E}[Y|T=t,\boldsymbol{X}]|\boldsymbol{Z}], where 𝒁𝑿\boldsymbol{Z}\subset\boldsymbol{X} is a subset of observed covariates, based on a weaker positivity condition, i.e. supt𝒯{b(t,𝒁)/fT|X(t|𝑿)}<\sup_{t\in\mathcal{T}}\{b(t,\boldsymbol{Z})/f_{T|X}(t|\boldsymbol{X})\}<\infty a.e., for a user specified weight function b(t,𝒁)b(t,\boldsymbol{Z}). Although Assumption 3 (ii) is not the mildest condition in the literature, we maintain it throughout this paper owing to its technical benefits, especially in the presence of measurement error.

Assumption 4 includes smoothness conditions required for nonparametric estimation. Under Assumption 1, the parameter of interest μ(t)=𝔼[Y(t)]\mu(t)=\mathbb{E}[Y^{*}(t)] can be also written as μ(t)=𝔼[π0(t,𝑿)Y|T=t]=𝔼[fT|Y,X(t|Y,𝑿)Y/fT|X(t|𝑿)]\mu(t)=\mathbb{E}\left[\pi_{0}(t,\boldsymbol{X})Y|T=t\right]=\mathbb{E}\left[f_{T|Y,X}(t|Y,\boldsymbol{X})Y/f_{T|X}(t|\boldsymbol{X})\right]. Note that Assumption 4 (i) implies that tfT|Y,X(t|Y,𝑿)/fT|X(t|𝑿)t\mapsto f_{T|Y,X}(t|Y,\boldsymbol{X})/f_{T|X}(t|\boldsymbol{X}) is third-order continuously differentiable almost surely. Furthermore, using Leibniz integral rule and Assumption 4 (ii), we have that the target parameter tμ(t)t\to\mu(t) is third-order continuously differentiable.

Assumption 5 (i) is used to control the complexity (measured by the uniform entropy integral) of the function class {π0(t,𝒙),𝒙𝒳}\{\pi_{0}(t,\boldsymbol{x}),\boldsymbol{x}\in\mathcal{X}\} such that it forms a Donsker class and the empirical process theory can be applied (Van Der Vaart et al., 1996, Corollary 2.7.2). Despite of its stringency, the smoothness condition of this type is commonly adopted in the literature of nonparametric inference, see Chen et al. (2008, Assumption 4 (i)) and Fan et al. (2021, Condition E.1.7).

Assumption 5 (ii) requires the sieve approximation error of ρ1{π0(t,𝒙)}\rho^{\prime-1}\left\{\pi_{0}(t,\boldsymbol{x})\right\} to shrink at a polynomial rate. This condition is satisfied for a variety of sieve basis functions. For example, it can be satisfied with α=+\alpha=+\infty if 𝑿\boldsymbol{X} is discrete, and with α=s/r\alpha=s/r if 𝑿\boldsymbol{X} is continuous and uK(𝒙)u_{K}(\boldsymbol{x}) is a power series or a B-spline, where ss is the smoothness of the approximand and rr is the dimension of 𝑿\boldsymbol{X}. Assumption 7 imposes a similar sieve approximation error for m(t,𝒙)m(t,\boldsymbol{x}).

Assumption 6 (i) rules out near multicollinearity in the approximating basis functions, which is common in the sieve regression literature. Assumption 6 (ii) is satisfied with ζ(K)=O(K)\zeta(K)=O(K) if uK(𝒙)u_{K}(\boldsymbol{x}) is a power series and with ζ(K)=O(K)\zeta(K)=O(\sqrt{K}) if uKu_{K} is a B-spline (Newey, 1997). Assumption 8 imposes the boundedness conditions on the moment of the response variable, which are also standard in the errors-in-variables problem (e.g. Fan and Truong, 1993; Delaigle et al., 2009). This condition is needed for deriving the asymptotic distribution of the proposed estimator by applying the Lyapunov central limit theorem.

Depending on the type of the distribution of UU and decaying rates of h0h_{0} and hh, the asymptotic variance of our estimator differs. This is different from the error-free case. We consider two types of UU: the ordinary smooth case and supersmooth case, which are standard in the literature of errors-in-variables problem (see e.g. Fan and Truong, 1993, Delaigle et al., 2009, and Meister, 2009, among others, for more details).

An ordinary smooth error of order β1\beta\geq 1 satisfies

limttβϕU(t)=candlimttβ+1ϕU(1)(t)=cβ,\lim_{t\rightarrow\infty}t^{\beta}\phi_{U}(t)=c\quad\text{and}\quad\lim_{t\rightarrow\infty}t^{\beta+1}\phi_{U}^{(1)}(t)=-c\beta\,, (18)

for some constant c>0c>0. A supersmooth error of order β1\beta\geq 1 satisfies

d0|t|β0exp(|t|β/γ)|ϕU(t)|d1|t|β1exp(|t|β/γ)as|t|,d_{0}|t|^{\beta_{0}}\exp(-|t|^{\beta}/\gamma)\leq|\phi_{U}(t)|\leq d_{1}|t|^{\beta_{1}}\exp(-|t|^{\beta}/\gamma)\quad\text{as}\quad|t|\rightarrow\infty\,, (19)

for some positive constants d0,d1,γd_{0},d_{1},\gamma and some constants β0\beta_{0} and β1\beta_{1}. Examples of ordinary smooth errors include Laplace errors, Gamma errors, and their convolutions. Cauchy errors, Gaussian errors, and their convolutions are supersmooth errors. The order β\beta describes the decaying rate of the characteristic function ϕU(t)\phi_{U}(t) as tt\rightarrow\infty, which corresponds to the smoothness of the error distribution (e.g. β=1\beta=1 for Cauchy distribution, β=2\beta=2 for Laplace and Gaussian distribution and for Gamma distribution, it relates to both the shape and the scale parameters).

Since in the inverse Fourier transform representation (5), division by ϕU\phi_{U} appears, it is natural to expect better estimation results for a larger |ϕU||\phi_{U}| (i.e. a smaller β\beta); indeed, it is found in the literature (see e.g. Fan, 1991b, Fan and Truong, 1993, Delaigle et al., 2009, and Meister, 2009, among others) that for both the ordinary smooth and supersmooth cases, the higher the order β\beta is, the harder the deconvolution will be, i.e. the slower the variance of a deconvolution kernel estimator converges. This is an intrinsic difficulty to the nonparametric estimation with errors in variables (Fan and Truong, 1993, Carroll et al., 2006).

Such an influence will be seen in the convergence rate of our estimator in the following theorems. Depending on the type of the distribution of UU, we need the following different conditions on LL to derive the asymptotic variance:
Assumption O (Ordinary Smooth Case): ϕL<\|\phi_{L}\|_{\infty}<\infty, |t|β+1{|ϕL(t)|+|tϕL(t)|}𝑑t<\int_{-\infty}^{\infty}|t|^{\beta+1}\{|\phi_{L}(t)|+|\partial_{t}\phi_{L}(t)|\}\,dt<\infty and |tβϕL(t)|2𝑑t<\int_{-\infty}^{\infty}|t^{\beta}\phi_{L}(t)|^{2}\,dt<\infty.
Assumption S (Supersmooth Case): ϕL(t)\phi_{L}(t) is support on [1,1][-1,1] and bounded.

These assumptions concern the prespecified kernel function and can be satisfied easily. For example, the one whose Fourier transform is ϕL(u)=(1u2)3𝟙[1,1](u)\phi_{L}(u)=(1-u^{2})^{3}\cdot\mathbbm{1}_{[-1,1]}(u) satisfies these conditions (e.g. Fan and Truong, 1993 and Delaigle et al., 2009). In the following two sections, we establish the large sample properties of π^(t,)\widehat{\pi}(t,\cdot) and μ^(t)\widehat{\mu}(t) under the two types of UU.

4.1 Asymptotics for the Ordinary Smooth Error

To establish the large sample properties of μ^(t)\widehat{\mu}(t), we first show that the estimated weight function π^(t,)\widehat{\pi}(t,\cdot) is consistent and compute its convergence rates under both the LL_{\infty} norm and the L2L_{2} norm.

Theorem 4.1

Suppose that the error UU is ordinary smooth of order β\beta satisfying (18) and that Assumption O holds. Under Assumptions 26 and ζ(K)K/(Nh01+2β)0\zeta(K)\sqrt{K\Big{/}\left(Nh_{0}^{1+2\beta}\right)}\rightarrow 0 as NN\rightarrow\infty, for every fixed t𝒯t\in\mathcal{T}, then

sup𝒙𝒳|π^(t,𝒙)π0(t,𝒙)|=Op(ζ(K){Kα+h02}+ζ(K)KNh01+2β),\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{\pi}(t,\boldsymbol{x})-\pi_{0}(t,\boldsymbol{x})|=O_{p}\left(\zeta(K)\left\{K^{-\alpha}+h_{0}^{2}\right\}+\zeta(K)\sqrt{\frac{K}{Nh_{0}^{1+2\beta}}}\right),
𝒳|π^(t,𝒙)π0(t,𝒙)|2𝑑FX(𝒙)=Op({K2α+h04}+KNh01+2β),\displaystyle\int_{\mathcal{X}}|\widehat{\pi}(t,\boldsymbol{x})-\pi_{0}(t,\boldsymbol{x})|^{2}dF_{X}(\boldsymbol{x})=O_{p}\left(\{K^{-2\alpha}+h_{0}^{4}\}+{\frac{K}{Nh_{0}^{1+2\beta}}}\right),
1Ni=1N|π^(t,𝑿i)π0(t,𝑿i)|2=Op({K2α+h04}+KNh01+2β).\displaystyle\frac{1}{N}\sum_{i=1}^{N}|\widehat{\pi}(t,\boldsymbol{X}_{i})-\pi_{0}(t,\boldsymbol{X}_{i})|^{2}=O_{p}\left(\{K^{-2\alpha}+h_{0}^{4}\}+{\frac{K}{Nh_{0}^{1+2\beta}}}\right).

The proof of Theorem 4.1 is presented in Appendix C. The first part of the rates, ζ(K){Kα+h02}\zeta(K)\left\{K^{-\alpha}+h_{0}^{2}\right\} and {K2α+h04}\{K^{-2\alpha}+h_{0}^{4}\}, are the rates of the asymptotic bias. ζ(K)K/Nh01+2β\zeta(K)\sqrt{K/Nh_{0}^{1+2\beta}} and K/Nh01+2βK/Nh_{0}^{1+2\beta} correspond to the asymptotic variance.

We next establish the asymptotic linear expansion and asymptotic normality of μ^(t)μ(t)\widehat{\mu}(t)-\mu(t). To aid the presentation, we define the following quantities. For i=1,,Ni=1,\ldots,N, ηh,h0(Si,𝑿i,Yi;t):=ϕh(Si,𝑿i,Yi;t)+ψh0(Si,𝑿i,Yi;t)\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t):=\phi_{h}(S_{i},\boldsymbol{X}_{i},Y_{i};t)+\psi_{h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t), where

ϕh\displaystyle\phi_{h} (Si,𝑿i,Yi;t):=[π0(t,𝑿i)YiLU,h(tSi)𝔼{π0(t,𝑿)YLU,h(tS)}]\displaystyle(S_{i},\boldsymbol{X}_{i},Y_{i};t):=\big{[}\pi_{0}(t,\boldsymbol{X}_{i})Y_{i}L_{U,h}(t-S_{i})-\mathbb{E}\{\pi_{0}(t,\boldsymbol{X})YL_{U,h}(t-S)\}\big{]}
μ(t)[LU,h(tSi)𝔼{LU,h(tS)}],\displaystyle-\mu(t)\big{[}L_{U,h}(t-S_{i})-\mathbb{E}\{L_{U,h}(t-S)\}\big{]},
ψh0\displaystyle\psi_{h_{0}} (Si,𝑿i,Yi;t):=μ(t)[LU,h0(tSi)𝔼{LU,h0(tS)}]\displaystyle(S_{i},\boldsymbol{X}_{i},Y_{i};t):=\mu(t)\big{[}L_{U,h_{0}}(t-S_{i})-\mathbb{E}\{L_{U,h_{0}}(t-S)\}\big{]}
[m(t,𝑿i)π0(t,𝑿i)LU,h0(tSi)𝔼{m(t,𝑿)π0(t,𝑿)LU,h0(tS)}],\displaystyle-\big{[}m(t,\boldsymbol{X}_{i})\pi_{0}(t,\boldsymbol{X}_{i})L_{U,h_{0}}\left(t-S_{i}\right)-\mathbb{E}\{m(t,\boldsymbol{X})\pi_{0}(t,\boldsymbol{X})L_{U,h_{0}}\left(t-S\right)\}\big{]},

with LU,h(v):=h1LU(v/h)L_{U,h}(v):=h^{-1}L_{U}(v/h). The population mean of both ϕh\phi_{h} and ψh0\psi_{h_{0}} are zero. Let ``"``\ast" denote the convolution operator, we define

Vj:=\displaystyle V_{j}:= fT2(t)(Rj2fT)fU(t)C,forj=1,2,\displaystyle f_{T}^{-2}(t)(R_{j}^{2}f_{T})\ast f_{U}(t)\cdot C\,,\ \text{for}\ j=1,2\,,

where C:=J2(v)𝑑v=(2πc2)1|w|2βϕL2(w)𝑑wC:=\int^{\infty}_{-\infty}J^{2}(v)\,dv=(2\pi c^{2})^{-1}\int|w|^{2\beta}\phi_{L}^{2}(w)\,dw, with cc defined in (18), J(v):=(2πc)1exp(iwv)ϕL(w)wβ𝑑wJ(v):=(2\pi c)^{-1}\int^{\infty}_{-\infty}\exp(-iwv)\phi_{L}(w)w^{\beta}\,dw and R12,R22R_{1}^{2},R_{2}^{2} defined in Assumption 8. Moreover, let (R1R2)(t):=𝔼[{π0(t,𝑿)Yμ(t)}{μ(t)π0(t,𝑿)m(t,𝑿)}|T=t](R_{1}R_{2})(t):=\mathbb{E}\big{[}\{\pi_{0}(t,\boldsymbol{X})Y-\mu(t)\}\{\mu(t)-\pi_{0}(t,\boldsymbol{X})m(t,\boldsymbol{X})\}|T=t\big{]} and vh(t):=𝔼{LU,h2(tS)}v_{h}(t):=\mathbb{E}\{L^{2}_{U,h}(t-S)\}.

Theorem 4.2

Suppose that the error UU is ordinary smooth of order β\beta satisfying (18) and that Assumption O and Assumptions 18 as well as the following condition hold:

(K+h02)(Kα+h02)h2+(hh0)1/2+βh01+2βKN0,\frac{(K^{-\ell}+h_{0}^{2})\cdot(K^{-\alpha}+h_{0}^{2})}{h^{2}}+\frac{(h\wedge h_{0})^{1/2+\beta}}{h_{0}^{1+2\beta}}\frac{K}{\sqrt{N}}\to 0,

where (hh0)=h𝟙{h=O(h0)}+h0𝟙{h0=o(h)}(h\wedge h_{0})=h\mathbbm{1}\{h=O(h_{0})\}+h_{0}\mathbbm{1}\{h_{0}=o(h)\}. Then, for every fixed t𝒯t\in\mathcal{T},

μ^(t)μ(t)=\displaystyle\widehat{\mu}(t)-\mu(t)= κ212[fT(t)Φ1(t)μ(t)t2fT(t)fT(t)]h2+o(h2)\displaystyle\frac{\kappa_{21}}{2}\bigg{[}\frac{f_{T}(t)\Phi_{1}(t)-{\mu(t)\partial_{t}^{2}f_{T}(t)}}{f_{T}(t)}\bigg{]}\cdot h^{2}+o(h^{2}) (20)
+\displaystyle+ κ212[μ(t)t2fT(t)fT(t)Φ2(t)fT(t)]h02+o(h02)\displaystyle\frac{\kappa_{21}}{2}\bigg{[}\frac{\mu(t)\partial_{t}^{2}f_{T}(t)-f_{T}(t)\Phi_{2}(t)}{f_{T}(t)}\bigg{]}\cdot h_{0}^{2}+o(h_{0}^{2})
+\displaystyle+ i=1Nηh,h0(Si,𝑿i,Yi;t)NfT(t)+oP{1N(hh0)1+2β},\displaystyle\sum_{i=1}^{N}\frac{\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)}{N\cdot f_{T}(t)}+o_{P}\bigg{\{}\frac{1}{\sqrt{N(h\wedge h_{0})^{1+2\beta}}}\bigg{\}}\,,

where κij:=uiLj(u)𝑑u\kappa_{ij}:=\int u^{i}L^{j}(u)du, Φ1(t):=𝔼[{Yt2fT|Y,𝐗(t|Y,𝐗)}/{fT|𝐗(t|𝐗)}]\Phi_{1}(t):=\mathbb{E}\big{[}\{Y\partial_{t}^{2}f_{T|Y,\boldsymbol{X}}(t|Y,\boldsymbol{X})\}/\{f_{T|\boldsymbol{X}}(t|\boldsymbol{X})\}\big{]}, and Φ2(t):=𝔼[{m(t,𝐗)t2fT|𝐗(t|𝐗)}/{fT|𝐗(t|𝐗)}]\Phi_{2}(t):=\mathbb{E}\big{[}\{m(t,\boldsymbol{X})\partial_{t}^{2}f_{T|\boldsymbol{X}}(t|\boldsymbol{X})\}/\{f_{T|\boldsymbol{X}}(t|\boldsymbol{X})\}\big{]}. Furthermore,

  1. a)

    if h=o(h0)h=o(h_{0}), then h1+2β/Ni=1Nηh,h0(Si,𝑿i,Yi;t)/fT(t)𝑑N(0,V1);\sqrt{h^{1+2\beta}/N}\sum_{i=1}^{N}\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)/f_{T}(t)\overset{d}{\to}N(0,V_{1});

  2. b)

    if h0=o(h)h_{0}=o(h), then h01+2β/Ni=1Nηh,h0(Si,𝑿i,Yi;t)/fT(t)𝑑N(0,V2);\sqrt{h_{0}^{1+2\beta}/N}\sum_{i=1}^{N}\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)/f_{T}(t)\overset{d}{\to}N(0,V_{2});

  3. c)

    if h0=c~hh_{0}=\tilde{c}h for a constant c~>0\tilde{c}>0, then h1+2β/Ni=1Nηh,h0(Si,𝑿i,Yi;t)/fT(t)𝑑N(0,V3),\sqrt{h^{1+2\beta}/N}\sum_{i=1}^{N}\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)/f_{T}(t)\overset{d}{\to}N(0,V_{3})\,, where

    V3:=\displaystyle V_{3}:= (R12fT)fU(t)fT2(t)J2(v)𝑑v+(R22fT)fU(t)c~(2+2β)fT2(t)J2(v/c~)𝑑v\displaystyle\frac{(R_{1}^{2}f_{T})\ast f_{U}(t)}{f^{2}_{T}(t)}\cdot\int_{-\infty}^{\infty}J^{2}(v)\,dv+\frac{(R_{2}^{2}f_{T})\ast f_{U}(t)}{\tilde{c}^{(2+2\beta)}f^{2}_{T}(t)}\cdot\int_{-\infty}^{\infty}J^{2}(v/\tilde{c})\,dv
    +2{(R1R2)fT}fU(t)c~(1+β)fT2(t)J(v)J(v/c~)𝑑v.\displaystyle+\frac{2\{(R_{1}R_{2})f_{T}\}\ast f_{U}(t)}{\tilde{c}^{(1+\beta)}f_{T}^{2}(t)}\cdot\int_{-\infty}^{\infty}J(v)J(v/\tilde{c})\,dv\,.

    In particular, when c~=1\tilde{c}=1, V3V_{3} reduces to fT2(t)(R32fT)fU(t)Cf_{T}^{-2}(t)(R_{3}^{2}f_{T})\ast f_{U}(t)\cdot C with R32R_{3}^{2} defined in Assumption 8.

The proof of Theorem 4.2 is presented in Appendix D. From the theorem, we see that as long as π0(t,)\pi_{0}(t,\cdot) and m(t,)m(t,\cdot) are sufficiently smooth or KK grows sufficiently fast, and h0h_{0} decays fast enough, so that (K+h02)(Kα+h02)=o(h2)(K^{-\ell}+h_{0}^{2})\cdot(K^{-\alpha}+h_{0}^{2})=o(h^{2}), the error arising from the sieve approximation is asymptotically negligible. For example, using the usual trade-off between the squared bias and variance, μ^(t)μ(t)\widehat{\mu}(t)-\mu(t) achieves the optimal convergence rate, N2/(2β+5)N^{-2/(2\beta+5)}, if h0hN1/(2β+5)h_{0}\asymp h\asymp N^{-1/(2\beta+5)}. In such a case, we require K=o(h2)K=o(h^{-2}), α+>1\alpha+\ell>1 and α>1/2\alpha>1/2 if spline basis is used and α>1\alpha>1 if a power series is used (The detailed derivation can be found in Appendix A.5).

The convergence rate N2/(2β+5)N^{-2/(2\beta+5)} above is optimal for all possible nonparametric regression estimators when the regressors are measured with ordinary smooth errors showed in Fan and Truong (1993). Note that for error-free local constant estimator, the convergence rates of the asymptotic bias and variance are h2h^{2} and (Nh)1/2(Nh)^{-1/2}, respectively (see e.g. Fan and Gijbels, 1996). Our proposed estimator μ^(t)\widehat{\mu}(t) has the same rate of asymptotic bias as that in the error-free case, but the asymptotic variance is degenerated by hβh^{-\beta}, owing to the ordinary smoothness of the error distribution.

In addition to asymptotic normality, we provide in (20) the asymptotic linear expansion of μ^(t)μ(t)\widehat{\mu}(t)-\mu(t), which can help conduct statistical inference. It is known in the literature on measurement error (see e.g. Delaigle et al., 2015 Appendix C) that the closed-form asymptotic variances V1,V2V_{1},V_{2} and V3V_{3} are difficult to estimate. However, using our linear expansion in (20), to estimate the asymptotic variance, we only need consistent estimators of ϕh(S,𝑿,Y;t)\phi_{h}(S,\boldsymbol{X},Y;t) and ψh0(S,𝑿,Y;t)\psi_{h_{0}}(S,\boldsymbol{X},Y;t). For example, π0(t,𝑿)\pi_{0}(t,\boldsymbol{X}) and μ(t)\mu(t) can be estimated respectively using our π^(t,𝑿)\widehat{\pi}(t,\boldsymbol{X}) and μ^(t)\widehat{\mu}(t), and 𝔼[Y|T=t,𝑿]\mathbb{E}[Y|T=t,\boldsymbol{X}] can be estimated using Liang’s (2000) method. Then, we can construct a pointwise confidence interval for μ(t)\mu(t) using the undersmoothing technique (see Appendix A.4 for the detailed method and some simulation results). Other confidence intervals based on bias-correction (see e.g. Calonico et al., 2018; takatsu2022debiased) are also possible but require a better estimation of the asymptotic bias and corresponding adjustments of the variance estimation with theoretical justification. That is beyond the scope of this paper and will be resolved in future work.

4.2 Asymptotics for the Supersmooth Error

The next two theorems establish the asymptotic properties of our estimator for the supersmooth case.

Theorem 4.3

Suppose that the error UU is supersmooth of order β\beta satisfying (19) and Assumption S holds. Under Assumptions 26 and ζ2(K)K(Nh0)1exp(2h0β/γ)0\zeta^{2}(K)K\cdot(Nh_{0})^{-1}\cdot\exp(2h_{0}^{-\beta}/\gamma)\rightarrow 0 as NN\rightarrow\infty, for every fixed t𝒯t\in\mathcal{T}, then

sup𝒙𝒳|π^(t,𝒙)π0(t,𝒙)|=Op(ζ(K)[{Kα+h02}+exp(h0β/γ)h0KN]),\displaystyle\sup_{\boldsymbol{x}\in\mathcal{X}}|\widehat{\pi}(t,\boldsymbol{x})-\pi_{0}(t,\boldsymbol{x})|=O_{p}\left(\zeta(K)\cdot\left[\left\{K^{-\alpha}+h_{0}^{2}\right\}+\frac{\exp\left(h_{0}^{-\beta}/\gamma\right)}{\sqrt{h_{0}}}\cdot\sqrt{\frac{K}{N}}\right]\right),
𝒳|π^(t,𝒙)π0(t,𝒙)|2𝑑FX(𝒙)=Op({K2α+h04}+exp(2h0β/γ)h0KN),\displaystyle\int_{\mathcal{X}}|\widehat{\pi}(t,\boldsymbol{x})-\pi_{0}(t,\boldsymbol{x})|^{2}dF_{X}(\boldsymbol{x})=O_{p}\left(\{K^{-2\alpha}+h_{0}^{4}\}+\frac{\exp\left(2h_{0}^{-\beta}/\gamma\right)}{h_{0}}\cdot{\frac{K}{N}}\right),
1Ni=1N|π^(t,𝑿i)π0(t,𝑿i)|2=Op({K2α+h04}+exp(2h0β/γ)h0KN).\displaystyle\frac{1}{N}\sum_{i=1}^{N}|\widehat{\pi}(t,\boldsymbol{X}_{i})-\pi_{0}(t,\boldsymbol{X}_{i})|^{2}=O_{p}\left(\{K^{-2\alpha}+h_{0}^{4}\}+\frac{\exp\left(2h_{0}^{-\beta}/\gamma\right)}{h_{0}}\cdot{\frac{K}{N}}\right).

The proof of Theorem 4.3 is presented in Appendix C. Comparing these results with those in Theorem 4.1, the asymptotic bias is the same as that in the ordinary smooth case. The rate of the asymptotic variance, however, becomes much slower, which is expected in the errors-in-variables context; see Fan and Truong (1993) and Delaigle et al. (2009).

Theorem 4.4

Suppose that the error UU is supersmooth of order β\beta satisfying (19) and that Assumption S and Assumptions 18 hold. Letting e(h):=h1/2exp(hβ/γ)e(h):=h^{1/2}\exp(-h^{-\beta}/\gamma), we have vh(t)=𝔼{LU,h2(tS)}=O{e(h)2}v_{h}(t)=\mathbb{E}\{L^{2}_{U,h}(t-S)\}=O\{e(h)^{-2}\}. If, as h0h\rightarrow 0, vh(t)v_{h}(t)\rightarrow\infty and

(K+h02)(Kα+h02)h2+K{e(h)e(h0)}N0asN,\frac{(K^{-\ell}+h_{0}^{2})\cdot(K^{-\alpha}+h_{0}^{2})}{h^{2}}+\frac{K}{\{e(h)\wedge e(h_{0})\}\sqrt{N}}\to 0\ \text{as}\ N\rightarrow\infty\,,

then, for every fixed t𝒯t\in\mathcal{T},

μ^(t)μ(t)=\displaystyle\widehat{\mu}(t)-\mu(t)= κ212[fT(t)Φ1(t)μ(t)t2fT(t)fT(t)]h2+o(h2)\displaystyle\frac{\kappa_{21}}{2}\bigg{[}\frac{f_{T}(t)\Phi_{1}(t)-{\mu(t)\partial_{t}^{2}f_{T}(t)}}{f_{T}(t)}\bigg{]}\cdot h^{2}+o(h^{2}) (21)
+\displaystyle+ κ212[μ(t)t2fT(t)fT(t)Φ2(t)fT(t)]h02+o(h02)\displaystyle\frac{\kappa_{21}}{2}\bigg{[}\frac{\mu(t)\partial_{t}^{2}f_{T}(t)-f_{T}(t)\Phi_{2}(t)}{f_{T}(t)}\bigg{]}\cdot h_{0}^{2}+o(h_{0}^{2})
+i=1Nηh,h0(Si,𝑿i,Yi;t)NfT(t){1+oP(1)},\displaystyle+\sum_{i=1}^{N}\frac{\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)}{N\cdot f_{T}(t)}\cdot\{1+o_{P}(1)\}\,,

where κ21,Φ1(t),Φ2(t)\kappa_{21},\Phi_{1}(t),\Phi_{2}(t) and ηh,h0\eta_{h,h_{0}} are defined as those in Theorem 4.2 and

{NfT(t)}1i=1Nηh,h0(Si,𝑿i,Yi;t)=Op{N1/2{e(h)e(h0)}1}.\{Nf_{T}(t)\}^{-1}\sum_{i=1}^{N}\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)=O_{p}\{N^{-1/2}\{e(h)\wedge e(h_{0})\}^{-1}\}\,.

Moreover, if vh(t)d1fS(t)hd3exp(2hβ/γd2hd4β)v_{h}(t)\geq d_{1}f_{S}(t)h^{d_{3}}\exp(2h^{-\beta}/\gamma-d_{2}h^{-d_{4}\beta}) for some constants d1,d2>0d_{1},d_{2}>0, 1>d4>01>d_{4}>0 and d3d_{3}, we have

[var{ηh,h0(Si,𝑿i,Yi;t)}]1/21Ni=1N{ηh,h0(Si,𝑿i,Yi;t)}𝐷N(0,1).[\text{var}\{\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)\}]^{-1/2}\cdot\frac{1}{\sqrt{N}}\sum_{i=1}^{N}\big{\{}{\eta_{h,h_{0}}(S_{i},\boldsymbol{X}_{i},Y_{i};t)}\big{\}}\overset{D}{\to}N(0,1)\,.

The proof of Theorem 4.4 is presented in Appendix D. As in the ordinary smooth case, as long as π0(t,)\pi_{0}(t,\cdot) is sufficiently smooth or KK grows sufficiently fast, and h0h_{0} decays fast enough, the sieve approximation error of our estimator μ^(t)\widehat{\mu}(t) is asymptotically negligible and the dominating bias term is the same as that in the ordinary smooth case. The asymptotic variance is affected by the measurement error UU. The convergence rate of the variance for bandwidth b=hb=h or h0h_{0}, {Nbexp(2bβ/γ)}1/2\{Nb\exp(-2b^{-\beta}/\gamma)\}^{-1/2}, is degenerated by exp(bβ/γ)\exp(b^{-\beta}/\gamma) compared to the rate (Nb)1/2(Nb)^{-1/2} for the error-free case, owing to the supersmoothness of the error distribution.

From the theorem, when hh0h\asymp h_{0} and min(h,h0)=d(logN)1/β\min(h,h_{0})=d(\log N)^{-1/\beta} for a constant d>(2/γ)1/βd>(2/\gamma)^{1/\beta}, one finds that the rate of variance, N1{e(h)e(h0)}2=o(h4+h04)N^{-1}\{e(h)\wedge e(h_{0})\}^{-2}=o(h^{4}+h_{0}^{4}), is negligible compared to the asymptotic bias, and the convergence rate of μ^(t)μ(t)\widehat{\mu}(t)-\mu(t) is (logN)2/β\left(\log N\right)^{-2/\beta}. This result is analogue to that in the literature on nonparametric regression with measurement error (see e.g. Fan, 1991a, a; Fan and Truong, 1993; Delaigle et al., 2015 among others) and it achieves the optimal convergence rate for all possible nonparametric regression estimators when the regressors are measured with supersmooth errors showed in Fan and Truong (1993).

Note that under the case of supersmooth error, an explicit expression and the exact convergence rate of var{ηh,h0(S,𝑿,Y;t)}\text{var}\{\eta_{h,h_{0}}(S,\boldsymbol{X},Y;t)\} is extremely hard (if not impossible) to derive without additional assumptions. In order to establish the asymptotic distribution of μ^\widehat{\mu} using Lyapunov central limit theorem, a lower bound of the deconvolution kernel’s second moment is required. In particular, we require vh(t)d1fS(t)hd3exp(2hβ/γd2hd4β)v_{h}(t)\geq d_{1}f_{S}(t)h^{d_{3}}\exp(2h^{-\beta}/\gamma-d_{2}h^{-d_{4}\beta}). This is commonly imposed in the measurement error literature; see Fan (1991a) and Delaigle et al. (2015) among others. Fan (1991a) showed that this lower bound holds under some mild conditions on ϕU\phi_{U} and ϕL\phi_{L} (e.g. (19) and Assumption S hold, ϕL(t)>cL(1t)3\phi_{L}(t)>c_{L}(1-t)^{3} for t[1ϵ,1)t\in[1-\epsilon,1) for some cL,ϵ>0c_{L},\epsilon>0, and the real part RU(t)R_{U}(t) and the imaginary part IU(t)I_{U}(t) of ϕU\phi_{U} satisfy RU(t)=o{IU(t)}R_{U}(t)=o\{I_{U}(t)\} or IU(t)=o{RU(t)}I_{U}(t)=o\{R_{U}(t)\} as tt\rightarrow\infty). These assumptions do not exclude the usually-used kernel function LL defined below Assumption S and error distributions such as Gaussian, Cauchy and Gaussian mixture.

Our explicit asymptotic linear expansion of μ^(t)μ(t)\widehat{\mu}(t)-\mu(t) in (21) is particularly helpful for statistical inference in the supersmooth error case due to the difficulty of deriving an explicit expression of the asymptotic variance. Most of the literature provides only the convergence rate; see Fan and Truong (1993), Meister (2006), and Meister (2009), among others.

5 Select the Smoothing Parameters

In this section, we discuss how to choose the three smoothing parameters K,h0K,h_{0}, and hh to calculate our estimator μ^(t)\widehat{\mu}(t) (see (14), (15), and (17)). Before delving into our method, we need some preliminaries.

5.1 Preliminaries

The smoothing parameters in nonparametric regression are usually selected by either minimising certain cross-validation (CV) criteria or minimising an approximation of the asymptotic bias and variance of the estimator.

In nonparametric errors-in-variables regression, as pointed out by Carroll et al. (2006) and Meister (2009), approximating the asymptotic bias and variance of the estimator can be extremely challenging, if not impossible. Unfortunately, the CV criteria are also not computable. To see this, we assume that KK and h0h_{0} in (14) and (15) are given for now and adapt the CV criteria to our context to choose hh, which would be

CV(h)=i=1N{π^(Ti,𝑿i)Yiμ^i(Ti)}2w(Ti),CV(h)=\sum^{N}_{i=1}\big{\{}\widehat{\pi}(T_{i},\boldsymbol{X}_{i})Y_{i}-\widehat{\mu}^{-i}(T_{i})\big{\}}^{2}w(T_{i})\,, (22)

where ww is a weight function that prevents the CV from becoming too large because of the unreliable data points from the tails of the distributions of TT and μ^i\widehat{\mu}^{-i} denotes the estimator obtained as in (17), but without using the observations from individual ii. Now, we see that (22) is not computable in errors-in-variables regression problems since the TiT_{i}’s are not observable.

To tackle this problem, Delaigle and Hall (2008) proposed combining the CV and SIMEX methods (e.g. Cook and Stefanski, 1994 and Stefanski and Cook, 1995). Specifically, in the simulation step, we generate two additional sets of contaminated data, namely Si,d=Si+Ui,dS_{i,d}^{*}=S_{i}+U_{i,d}^{*} and Si,d=Si,d+Ui,dS_{i,d}^{**}=S_{i,d}^{*}+U_{i,d}^{**}, for i=1,,Ni=1,\ldots,N and d=1,,Dd=1,\ldots,D with DD a large number, where the Ui,dU_{i,d}^{*}’s and Ui,dU_{i,d}^{**}’s are i.i.d. as UU in (1). Now, inserting first the Si,dS_{i,d}^{*}’s and then the Si,dS_{i,d}^{**}’s in (14) and (17) instead of the SiS_{i}’s, we obtain respectively (π^d,μ^d)(\widehat{\pi}^{*}_{d},\widehat{\mu}_{d}^{*}) and (π^d,μ^d)(\widehat{\pi}^{**}_{d},\widehat{\mu}_{d}^{**}) for d=1,,Dd=1,\ldots,D. The authors then suggested deriving two CV-type bandwidths, h^\widehat{h}^{*} and h^\widehat{h}^{**}, which minimise d=1DCVd(h)/D\sum^{D}_{d=1}CV_{d}^{*}(h)/D and d=1DCVd(h)/D\sum^{D}_{d=1}CV_{d}^{**}(h)/D, respectively, where

CVd(h)=\displaystyle CV_{d}^{*}(h)= i=1N{π^d(Si,𝑿i)Yiμ^d,i(Si)}2w(Si),\displaystyle\sum^{N}_{i=1}\big{\{}\widehat{\pi}^{*}_{d}(S_{i},\boldsymbol{X}_{i})Y_{i}-\widehat{\mu}^{*,-i}_{d}(S_{i})\big{\}}^{2}w(S_{i})\,,
CVd(h)=\displaystyle CV_{d}^{**}(h)= i=1N{π^d(Si,d,𝑿i)Yiμ^d,i(Si,d)}2w(Si,d),\displaystyle\sum^{N}_{i=1}\big{\{}\widehat{\pi}^{**}_{d}(S^{*}_{i,d},\boldsymbol{X}_{i})Y_{i}-\widehat{\mu}^{**,-i}_{d}(S^{*}_{i,d})\big{\}}^{2}w(S^{*}_{i,d})\,,

for d=1,,Dd=1,\ldots,D, where μ^d,i\widehat{\mu}^{*,-i}_{d} and μ^d,i\widehat{\mu}^{**,-i}_{d} are obtained respectively as μ^d\widehat{\mu}_{d}^{*} and μ^d\widehat{\mu}_{d}^{**}, but without using the observations from individual ii.

The Si,dS^{**}_{i,d}’s are the contaminated version of the SiS^{*}_{i}’s, which is the same role as the SiS^{*}_{i}’s play to the SiS_{i}’s and the SiS_{i}’s play to the TiT_{i}’s. Intuitively, we then expect the relationship between h^\widehat{h}^{*} and our target bandwidth hh to be similar to that between h^\widehat{h}^{**} and h^\widehat{h}^{*}. Thus, the authors proposed an extrapolation step to obtain an estimator of hh. Specifically, they considered that h/h^h^/h^h/\widehat{h}^{\ast}\approx\widehat{h}^{\ast}/\widehat{h}^{**} and used a linear back-extrapolation procedure that, in our context, would give the bandwidth

h^DH=(h^)2/h^.\widehat{h}_{DH}=(\widehat{h}^{*})^{2}/\widehat{h}^{**}\,. (23)

5.2 Two-step Procedure and Local Constant Extrapolation

In our case, recall that we have two more smoothing parameters, KK and h0h_{0}. We can either extend the SIMEX method to choose three parameters simultaneously or choose KK and h0h_{0} using other methods first and then apply SIMEX to choose hh. The first option incurs a high computational burden and is unstable in practice. Thus, we adopt the second choice, which leads to a two-step procedure.

Note from Theorems 4.2 and 4.4 that our estimator achieves optimal rate when hh0h\asymp h_{0} trades off the rate of the bias h2+h02h^{2}+h_{0}^{2} and that of the standard deviation vh(t)/N+vh0(t)/N\sqrt{v_{h}(t)/N+v_{h_{0}}(t)/N}. Note also that the plug-in bandwidth hPIh_{PI} for the kernel deconvolution estimator with bandwidth hh of the density of TT proposed by Delaigle and Gijbels (2002) minimises the asymptotic MSE of the estimator, whose bias is of rate of h2h^{2} and standard deviation vh(t)/N\sqrt{v_{h}(t)/N}. Thus, we should have our hh0hPIh\asymp h_{0}\asymp h_{PI}. Moreover, to make KK satisfy all the conditions in our theorems when hh0h\asymp h_{0}, we require K=o(h2)K=o(h^{-2}).

Thus, we first set h0=hPIh_{0}=h_{PI}. Then, to choose KK, we note from (11) that 𝔼{π0(t,𝑿)exp(𝑿)|T=t}=𝔼{exp(𝑿)}\mathbb{E}\{\pi_{0}(t,\boldsymbol{X})\exp(\boldsymbol{X})|T=t\}=\mathbb{E}\{\exp(\boldsymbol{X})\} holds. We propose to choose K=c~hPI2log(hPI+1)K=\lfloor\tilde{c}h_{PI}^{-2}\log(h_{PI}+1)\rfloor such that K2K\geq 2, where the constant c~\tilde{c} minimises the following generalised CV criterion (Craven and Wahba, 1978):

𝒯|i=1Nπ^(t,𝑿i)exp(𝑿i)LU{(tSi)/hPI}i=1NLU{(tSi)/hPI}i=1Nexp(𝑿i)N|2/(1K/N)2𝑑t.\int_{\mathcal{T}}\bigg{|}\frac{\sum^{N}_{i=1}\widehat{\pi}(t,\boldsymbol{X}_{i})\exp(\boldsymbol{X}_{i})L_{U}\{(t-S_{i})/h_{PI}\}}{\sum^{N}_{i=1}L_{U}\{(t-S_{i})/h_{PI}\}}-\frac{\sum^{N}_{i=1}\exp(\boldsymbol{X}_{i})}{N}\bigg{|}^{2}\bigg{/}(1-K/N)^{2}\,dt.

Such a choice of KK and h0h_{0} is not guaranteed to minimise the error of our final estimator μ^(t)\widehat{\mu}(t). However, with our choice of hh below, they guarantee the optimal convergence rate of μ^(t)\widehat{\mu}(t) if B-spline basis is used. For the polynomial sieve basis, the smooth parameter α\alpha defined in Assumption 5 need to be larger than 1 (see Appendix A.5). Moreover, the simulation results showed that this works well (see Section 6 for more discussion).

In the second step, we could simply adopt h^DH\widehat{h}_{DH} in (23). However, in our numerical study, the linear back-extrapolation sometimes gave highly unstable results. We expected a larger number of DD to reduce the variability; for example, Delaigle and Hall (2008) used D=20D=20. However, even with D=40D=40, we still found some unacceptable results, which was somewhat expected, as which extrapolant function should be used in practice is unknown (Carroll et al., 2006, Section 5.3.2). Therefore, we introduce a new extrapolation procedure.

In particular, instead of extrapolating parametrically from h^\widehat{h}^{*} and h^\widehat{h}^{**}, we suggest approximating the relationship between the hdh^{*}_{d}’s and hdh^{**}_{d}’s using a local constant estimator (see Fan and Gijbels, 1996), where hd=cdhPIh^{*}_{d}=c^{*}_{d}h_{PI} and hd=cdhPIh^{**}_{d}=c^{**}_{d}h_{PI} with the constants cd,cdc^{*}_{d},c^{**}_{d} minimise CVd(h)CV_{d}^{*}(h) and CVd(h)CV_{d}^{**}(h), respectively, for d=1,,Dd=1,\ldots,D. Then, we take this approximated relationship as the extrapolant function. Specifically, we choose the bandwidth hh to be

h^=d=1Dhdφ{(h^hd)/b}d=1Dφ{(h^hd)/b},\widehat{h}=\frac{\sum^{D}_{d=1}h_{d}^{*}\cdot\varphi\{(\widehat{h}^{*}-h^{**}_{d})/b\}}{\sum^{D}_{d=1}\varphi\{(\widehat{h}^{*}-h^{**}_{d})/b\}}\,, (24)

where φ\varphi is the Gaussian kernel function. The bandwidth bb here is selected by leave-one-out cross-validation. Local constant estimator has been well studied and widely used, and can work fairly fast and stable. In our simulation study, we found that D=35D=35 is sufficiently large to ensure good performance.

Remark 3

Recall from (15) that some of the deconvolution kernel LU{(tSi)/h0}L_{U}\{(t-S_{i})/h_{0}\}’s may take negative values, making the maximisation of G^t(λ)\widehat{G}_{t}(\lambda) not strictly concave in finite samples. With h0=hPIh_{0}=h_{PI}, truncating those negative LU{(tSi)/h0}L_{U}\{(t-S_{i})/h_{0}\}’s to 0 is a fast and stable way to solve the problem. The simulation performed well.

6 Numerical Properties

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Plots of the true curve (solid line) and estimated curves corresponding to the 1st (dashed line), 2nd (dash-dotted line), and 3rd (dotted line) quartiles of the ISEs from the 200 Monte Carlo samples of models 1 (row 1) and 2 (row 2) with Laplace measurement errors and N=500N=500. The third column depicts the boxplots of the 200 ISEs of each estimator.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Plots of the true curve (solid line) and estimated curves CM (row 1) and CM^\widehat{\text{CM}} (row 2) corresponding to the 1st (dashed line), 2nd (dash-dotted line), and 3rd (dotted line) quartiles of the 200 ISEs from model 3 with Laplace measurement errors and N=250N=250 (left) and N=500N=500 (centre) and the boxplots of the 200 ISEs (right).
Refer to caption
Refer to caption
Refer to caption
Figure 3: Plots of the true curve (solid line) and estimated curves corresponding to the 1st (dashed line), 2nd (dash-dotted line), and 3rd (dotted line) quartiles of the 200 ISEs from model 4 with Gaussian measurement errors and N=250N=250.

6.1 Simulation Settings

Let ξx,1,,ξx,10\xi_{x,1},\ldots,\xi_{x,10} be i.i.d. uniform random variables supported on [0,1][0,1] and ξtN(0,1)\xi_{t}\sim N(0,1). We consider the following four models in which ξy\xi_{y} is generated from a standard normal distribution for models 1 and 4 and from a uniform distribution supported on [0,1][0,1] for models 2 and 3:

  1. 1.

    X=0.3+0.4ξx,1X=0.3+0.4\xi_{x,1}, T=X+ξtT=X+\xi_{t}, and Y(t)=(t0.5)2+X+ξyY^{\ast}(t)=(t-0.5)^{2}+X+\xi_{y} (XX affects TT and YY linearly);

  2. 2.

    X=j=120.3ξx,jX=\sum^{2}_{j=1}0.3\xi_{x,j}, T=1+X2+ξtT=1+X^{2}+\xi_{t}, and Y(t)=exp(6+6t)/{1+exp(6+6t)}+X+ξyY^{\ast}(t)=\exp(-6+6t)/\{1+\exp(-6+6t)\}+X+\xi_{y} (XX affects TT nonlinearly and YY linearly);

  3. 3.

    X=j=1100.2ξx,jX=\sum^{10}_{j=1}0.2\xi_{x,j}, T=X+ξtT=X+\xi_{t}, and Y(t)=t+X+ξyY^{\ast}(t)=-t+\sqrt{X}+\xi_{y} (XX affects TT linearly but YY nonlinearly);

  4. 4.

    X=0.2+0.6ξx,1X=0.2+0.6\xi_{x,1}, T=X0.7+ξtT=\sqrt{X}-0.7+\xi_{t}, and Y(t)=t+exp(X)+ξyY^{\ast}(t)=t+\exp(X)+\xi_{y} (XX affects TT and YY nonlinearly).

For each model, we generate 200 samples of (S,X,Y)(S,X,Y) of size 250 or 500, where S=T+US=T+U with UU either a Laplace random variable with mean 0 and var(U)/var(T)=0.25\text{var}(U)/\text{var}(T)=0.25 or a mean zero Gaussian random variable with var(U)/var(T)=0.2\text{var}(U)/\text{var}(T)=0.2.

For each combination of the model, sample size, and measurement error type, we calculate our estimator in (17). To measure the quality of the estimator, we calculate the integrated squared errors ISE=q0.1q0.9{μ^(t)μ(t)}2𝑑t\text{ISE}=\int_{q_{0.1}}^{q_{0.9}}\big{\{}\widehat{\mu}(t)-\mu(t)\big{\}}^{2}\,dt, where q0.1,q0.9q_{0.1},q_{0.9} are the 10th and 90th quantiles of TT, respectively.

To highlight the importance of considering the measurement errors in the estimation, we also calculate the naive estimator that ignored the error for each sample. That is, we apply the estimator of Ai et al. (2021, 2022) to our data by replacing the TiT_{i}’s there with the SiS_{i}’s. Specifically, the naive estimator is

i=1Nπ~(Si,𝑿i)Yiφ{(tSi)/hn}i=1Nφ{(tSi)/hn},t𝒯,\frac{\sum^{N}_{i=1}\widetilde{\pi}(S_{i},\boldsymbol{X}_{i})Y_{i}\varphi\{(t-S_{i})/h_{n}\}}{\sum^{N}_{i=1}\varphi\{(t-S_{i})/h_{n}\}}\,,\quad t\in\mathcal{T}\,, (25)

where φ\varphi is the standard normal density function and π~(Si,𝑿i)\widetilde{\pi}(S_{i},\boldsymbol{X}_{i}) is calculated using (LABEL:AipiEst) (see Ai et al., 2021, 2022 for more details).

Unless otherwise specified, we take the kernel function LL for the deconvolution kernel method to be the one whose Fourier transform is given by ϕL(u)=(1u2)3𝟙[1,1](u).\phi_{L}(u)=(1-u^{2})^{3}\cdot\mathbbm{1}_{[-1,1]}(u).

To illustrate the potential benefit of using our methods over the naive estimator without confounding the effect of the smoothing parameter selectors, we first use the theoretically optimal smoothing parameters for each method. These parameters simultaneously minimise the integrated squared error (ISE) for each method, resulting in the optimal naive estimator (NV) and the proposed conditional moment estimator μ^\widehat{\mu} (CM).

Recall from Section 5.2 that we do not choose KK and h0h_{0} by minimising the estimated ISE of our estimator. To see how much we might lose by doing so, we calculate the estimator with our choice of KK and h0h_{0} and the optimal hh that minimised the ISE, which is denoted by CM~\widetilde{\text{CM}}.

Finally, to assess the performance of our method in practice, we calculate μ^\widehat{\mu} using the smoothing parameters selected from the data using the method in Section 5.2 and denote it by CM^\widehat{\text{CM}}. We take the weight function ww to be an indicator function that equals 1 when the SiS_{i}’s or Si,dS^{*}_{i,d}’s are within their 5% to 95% quantiles, and 0 otherwise. We also compute the naive estimator in (25) with the K1,K2K_{1},K_{2}, and hnh_{n} selected using the 10-fold CV method, denoted by NV^\widehat{\text{NV}}, to make a comparison.

6.2 Simulation Results

In this section, we show our simulation results. The full simulation results of the 200 values of the ISE of each estimator obtained from the 200 simulated samples for models 1 to 4 can be seen in the boxplots in Appendix A.6. Figures 1 to 3 depict the true curve of the model and three estimated curves corresponding to the 1st, 2nd, and 3rd quartiles of the 200 ISE values.

Overall, the simulation results show that our methods with the theoretically optimal smoothing parameters (CM) perform better than that of the naive one (NV). This confirms the advantages of our methods over the naive one by adapting the estimation to the measurement errors. A graphical example is presented in Figure 1, which shows the quartile curves of NV and CM for models 1 and 2 with Laplace measurement errors and N=500N=500.

The simulation results also confirm our theoretical proportion that the performance of our method improves as the sample size increases. Figure 2 exemplifies the effect of increasing NN by depicting the quartile curves and ISE boxplots of CM and CM^\widehat{\text{CM}} for model 3 when the measurement errors follow a Laplace distribution with N=250N=250 and N=500N=500. The improvement with the increase in sample size can also be seen in the boxplots in the Appendex A.6.

Comparing the ISE values of CM~\widetilde{\text{CM}} with those of CM, we find that our choice of KK and h0h_{0} discussed in Section 5.2 lowers the performance of our estimator only marginally in most cases. Recall that KK is the number of polynomial basis functions of 𝑿\boldsymbol{X} used to estimate π0\pi_{0} and h0h_{0} is the bandwidth used to estimate 𝔼{π0(t,𝑿)|T=t}\mathbb{E}\{\pi_{0}(t,\boldsymbol{X})|T=t\}, which are related to the relationship between 𝑿\boldsymbol{X} and TT as well as that between 𝑿\boldsymbol{X} and Y(t)Y^{*}(t). We thus consider the nonlinear relationship between TT and 𝑿\boldsymbol{X} or between Y(t)Y^{*}(t) and 𝑿\boldsymbol{X} in the simulation models (see models 2 to 4). Our choice of KK and h0h_{0} still works well compared with the optimal one. Figure 3 provides a graphical example from model 4, where 𝑿\boldsymbol{X} affects both TT and Y(t)Y^{*}(t) nonlinearly, with Gaussian measurement errors and N=250N=250.

Finally, from the behaviour of CM^\widehat{\text{CM}} in all our simulation studies, we observe that our modified SIMEX method introduced in Section 5.2 performs well and stably. This can also be seen from Figures 2 and 3.

6.3 Real Data Example

We demonstrate the practical value of our data-driven CM^\widehat{\text{CM}} estimator using Epidemiologic Study Cohort data from NHANES-I. We estimate the causal effect of the long-term log-transformed daily saturated fat intake on the risk of breast cancer based on a sample of 3,145 women aged 25 to 50. The data were analysed by Carroll et al. (2006) using a logistic regression calibration method, and they are available from https://carroll.stat.tamu.edu/data-and-documentation. The daily saturated fat intake was measured using a single 24-hour recall. Specifically, the log-transformation was taken as log(5+saturated fat)\log(5+\text{saturated fat}). Previous nutrition studies have estimated that over 75% of the variance in those data is made up of measurement error. According to Carroll et al. (2006), it is reasonable to assume the classical measurement error model (i.e. (1)), with a Gaussian measurement error UU on the data. The outcome variable YY takes 1 if the individual has breast cancer and 0 otherwise. The covariates in 𝑿\boldsymbol{X} are age, the poverty index ratio, the body mass index, alcohol use (yes or no), family history of breast cancer, age at menarche (a dummy variable taking 1 if age is 12\leq 12), menopausal status (pre or post), and race, which are assumed to have been measured without appreciable error.

We first apply our estimator to the data for a Gaussian measurement error with an error variance var(U)/var(S)=0.75\text{var}(U)/\text{var}(S)=0.75. That corresponds to var(U)/var(T)=3\text{var}(U)/\text{var}(T)=3. As pointed out by Delaigle and Gijbels (2004), the error variances estimated by other nutrition studies may be inaccurate. Thus, we also consider cases in which var(U)/var(S)=0.43\text{var}(U)/\text{var}(S)=0.43, var(U)/var(S)=0.17\text{var}(U)/\text{var}(S)=0.17, and var(U)=0\text{var}(U)=0 (i.e. var(U)/var(T)=0.75\text{var}(U)/\text{var}(T)=0.75, var(U)/var(T)=0.2\text{var}(U)/\text{var}(T)=0.2 and the error-free case).

Refer to caption
Figure 4: Estimation of the treatment effect of the log-saturated fat intake on the risk of breast cancer for a Gaussian error of var(U)/var(S)=0.75\text{var}(U)/\text{var}(S)=0.75 (solid line), var(U)/var(S)=0.43\text{var}(U)/\text{var}(S)=0.43 (dash-dotted line), var(U)/var(S)=0.17\text{var}(U)/\text{var}(S)=0.17 (dashed line) with its 95% pointwise confidence band (shaded), and var(U)=0\text{var}(U)=0 (dotted line).

Figure 4 presents the estimated curves using the smoothing parameters selected as described in section 5 and a 95% undersmoothing pointwise confidence band for var(U)/var(S)=0.17\text{var}(U)/\text{var}(S)=0.17 (see Appendix A.4 for the method and the confidence bands for var(U)/var(S)=0.43\text{var}(U)/\text{var}(S)=0.43 and 0.75). Overall, the estimated risk of breast cancer shows a decreasing trend across the range of transformed saturated fat intake. When the measurement error variance is 0.17 of var(S)\text{var}(S) or 0, there is a marginal increasing trend between t=3t=3 and 3.4. The 95% confidence bands for var(U)/var(S)=0.17\text{var}(U)/\text{var}(S)=0.17 and 0.43 show an overall decreasing trend with a slight increase between t=3t=3 and 3.4. These findings concur with the results of Carroll et al. (2006), who found in their multivariate logistic regression calibration that the coefficient of the log-transformed saturated fat intake on the risk of breast cancer was significant and negative. However, the results should be treated with extreme caution because of possible misclassification in the breast cancer data and the lack of follow-up of breast cancer cases with high fat intakes; see Carroll et al. (2006, Chap 3.3).

Acknowledgements

The authors would like to sincerely thank the Steffen Lauritzen, the Associate Editor, and the two referees for their constructive suggestions and comments. Wei Huang’s research was supported by the Professor Maurice H. Belz Fund of the University of Melbourne. Zheng Zhang is supported by the fund from the National Natural Science Foundation of China [grant number 12001535], Natural Science Foundation of Beijing [grant number 1222007], and the fund for building world-class universities (disciplines) of Renmin University of China [project number KYGJC2022014]. The authors contributed equally to this work and are listed in the alphabetical order.

References

  • Ai et al. (2021) Ai, C., Linton, O., Motegi, K. and Zhang, Z. (2021) A unified framework for efficient estimation of general treatment models. Quantitative Economics, 12, 779–816.
  • Ai et al. (2022) Ai, C., Linton, O. and Zhang, Z. (2022) Estimation and inference for the counterfactual distribution and quantile functions in continuous treatment models. Journal of Econometrics, 228, 39–61.
  • Battistin and Chesher (2014) Battistin, E. and Chesher, A. (2014) Treatment effect estimation with covariate measurement error. Journal of Econometrics, 178, 707–715.
  • Calonico et al. (2018) Calonico, S., Cattaneo, M. D. and Farrell, M. H. (2018) On the effect of bias estimation on coverage accuracy in nonparametric inference. Journal of the American Statistical Association, 113, 767–779.
  • Carroll and Hall (2004) Carroll, R. J. and Hall, P. (2004) Low order approximations in deconvolution and regression with errors in variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 31–46.
  • Carroll et al. (2006) Carroll, R. J., Rupper, D., Stefanski, L. A. and Crainiceanu, C. M. (2006) Measurement Error in Nonlinear Models: A Modern Perspective. Chapman & Hall/CRC.
  • Chen (2007) Chen, X. (2007) Large sample sieve estiamtion of semi-nonparametric models. Handbook of Econometrics, 6, 5549–5632.
  • Chen et al. (2008) Chen, X., Hong, H. and Tarozzi, A. (2008) Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics, 36, 808–843.
  • Cheng and Chu (2004) Cheng, K. F. and Chu, C.-K. (2004) Semiparametric density estimation under a two-sample density ratio model. Bernoulli, 10, 583–604.
  • Cook and Stefanski (1994) Cook, J. and Stefanski, L. (1994) Simulation-extrapolation estimation in parametric measurement error models. Journal of the American Statistical Association, 89, 1314–1328.
  • Craven and Wahba (1978) Craven, P. and Wahba, G. (1978) Smoothing noisy data with spline functions. Numerische Mathematik, 31, 377–403.
  • Delaigle et al. (2009) Delaigle, A., Fan, J. and Carroll, R. J. (2009) A design-adaptive local polynomial estimator for the errors-in-variables problem. Journal of the American Statistical Association, 104, 348–359.
  • Delaigle and Gijbels (2002) Delaigle, A. and Gijbels, I. (2002) Estimation of integrated squared density derivatives from a contaminated sample. Journal of Royal Statistical Society: Series B (Statistical Methodology), 64, 869–886.
  • Delaigle and Gijbels (2004) — (2004) Bootstrap bandwidth selection in kernel density estimation from a contaminated sample. Annals of the Institute of Statistical Mathematics, 56, 19–47.
  • Delaigle and Hall (2008) Delaigle, A. and Hall, P. (2008) Using simex for smoothing-parameter choice in errors-in-variables problems. Journal of the American Statistical Association, 103, 280–287.
  • Delaigle and Hall (2016) — (2016) Methodology for non-parametric deconvolution when the error distribution is unknown. Journal of the Royal Statistical Society: Series B: Statistical Methodology, 231–252.
  • Delaigle et al. (2008) Delaigle, A., Hall, P., Meister, A. et al. (2008) On deconvolution with repeated measurements. The Annals of Statistics, 36, 665–685.
  • Delaigle et al. (2015) Delaigle, A., Hall, P. and Wishart, J. (2015) Confidence bands in nonparametric errors-in-variables regression. Journal of Royal Statistical Society: Series B (Statistical Methodology), 77, 149–169.
  • Díaz and van der Laan (2013) Díaz, I. and van der Laan, M. J. (2013) Targeted data adaptive estimation of the causal dose–response curve. Journal of Causal Inference, 1, 171–192.
  • Díaz et al. (2021) Díaz, I., Williams, N., Hoffman, K. L. and Schenck, E. J. (2021) Nonparametric causal effects based on longitudinal modified treatment policies. Journal of the American Statistical Association, 1–16.
  • Diggle and Hall (1993) Diggle, P. J. and Hall, P. (1993) A fourier approach to nonparametric deconvolution of a density estimate. Journal of the Royal Statistical Society: Series B (Methodological), 55, 523–531.
  • Dong et al. (2021) Dong, Y., Lee, Y.-Y. and Gou, M. (2021) Regression discontinuity designs with a continuous treatment. Journal of the American Statistical Association, 1–31.
  • D’Amour et al. (2021) D’Amour, A., Ding, P., Feller, A., Lei, L. and Sekhon, J. (2021) Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221, 644–654.
  • Fan (1991a) Fan, J. (1991a) Asymptotic normality for deconvolution kernel density estimators. Sankhyā: The Indian Journal of Statistics, Series A, 97–110.
  • Fan (1991b) — (1991b) On the optimal rates of convergence for nonparametric deconvolution problems. The Annals of Statistics, 1257–1272.
  • Fan and Gijbels (1996) Fan, J. and Gijbels, I. (1996) Local Polynomial Modelling and Its Applications. Chapman & Hall/CRC.
  • Fan et al. (2021) Fan, J., Imai, K., Lee, I., Liu, H., Ning, Y. and Yang, X. (2021) Optimal covariate balancing conditions in propensity score estimation. Journal of Business & Economic Statistics, 1–14.
  • Fan and Truong (1993) Fan, J. and Truong, Y. K. (1993) Nonparametric regression with errors in variables. The Annals of Statistics, 21, 1900–1925.
  • Fong et al. (2018) Fong, C., Hazlett, C. and Imai, K. (2018) Covariate balancing propensity score for a continuous treatment: Application to the efficacy of political advertisements. The Annals of Applied Statistics, 12, 156–177.
  • Galvao and Wang (2015) Galvao, A. F. and Wang, L. (2015) Uniformly semiparametric efficient estimation of treatment effects with a continuous treatment. Journal of the American Statistical Association, 110, 1528–1542.
  • Hahn (1998) Hahn, J. (1998) On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66, 315–331.
  • Hall and Lahiri (2008) Hall, P. and Lahiri, S. N. (2008) Estimation of distributions, moments and quantiles in deconvolution problems. The Annals of Statistics, 36, 2110–2134.
  • Hall et al. (2007) Hall, P., Meister, A. et al. (2007) A ridge-parameter approach to deconvolution. The Annals of Statistics, 35, 1535–1558.
  • Hansen (1982) Hansen, L. (1982) Large sample properties of generalized method of moments estimators. Econometrica, 50, 1029–1054.
  • Hirano and Imbens (2004) Hirano, K. and Imbens, G. W. (2004) The propensity score with continuous treatments. Applied Bayesian Modeling and Causal Inference from Incomplete-data Perspectives, 226164, 73–84.
  • Hirano et al. (2003) Hirano, K., Imbens, G. W. and Ridder, G. (2003) Efficient estimation of average treatment effects using the estimated propensity score. Econometrica, 71, 1161–1189.
  • Huang et al. (2021) Huang, W., Linton, O. and Zhang, Z. (2021) A unified framework for specification tests of continuous treatment effect models. Journal of Business & Economic Statistics, forthcoming.
  • Huber et al. (2020) Huber, M., Hsu, Y.-C., Lee, Y.-Y. and Lettry, L. (2020) Direct and indirect effects of continuous treatments based on generalized propensity score weighting. Journal of Applied Econometrics, 35, 814–840.
  • Imbens et al. (1998) Imbens, G., Johnson, P. and Spady, R. H. (1998) Information theoretic approaches to inference in moment condition models. Econometrica, 66, 333–357.
  • Kang and Schafer (2007) Kang, J. D. and Schafer, J. L. (2007) Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22, 523–539.
  • Kennedy et al. (2017) Kennedy, E. H., Ma, Z., McHugh, M. D. and Small, D. S. (2017) Non-parametric methods for doubly robust estimation of continuous treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79, 1229–1245.
  • Kitamura and Stutzer (1997) Kitamura, Y. and Stutzer, M. (1997) An information-theoretic alternative to generalized method of moments estimation. Econometrica: Journal of the Econometric Society, 861–874.
  • Lewbel (2007) Lewbel, A. (2007) Estimation of average treatment effects with misclassification. Econometrica, 75, 537–551.
  • Liang (2000) Liang, H. (2000) Asymptotic normality of parametric part in partially linear models with measurement error in the nonparametric part. Journal of Statistical Planning and Inference, 86, 51–62.
  • Ma and Wang (2020) Ma, X. and Wang, J. (2020) Robust inference using inverse probability weighting. Journal of the American Statistical Association, 115, 1851–1860.
  • Mahajan (2006) Mahajan, A. (2006) Identification and estimation of regression models with misclassification. Econometrica, 74, 631–665.
  • Meister (2006) Meister, A. (2006) Density estimation with normal measurement error with unknown variance. Statistica Sinica, 195–211.
  • Meister (2009) — (2009) Deconvolution Problems in Nonparametric Statistics. Springer-Verlag Berlin Heidelberg.
  • Molinari (2008) Molinari, F. (2008) Partial identification of probability distributions with misclassified data. Journal of Econometrics, 144, 81–117.
  • Muñoz and Van Der Laan (2012) Muñoz, I. D. and Van Der Laan, M. (2012) Population intervention causal effects based on stochastic interventions. Biometrics, 68, 541–549.
  • Newey (1997) Newey, W. K. (1997) Convergence rates and asymptotic normality for series estimators. Journal of Econometrics, 79, 147–168.
  • Owen (2001) Owen, A. B. (2001) Empirical likelihood. Chapman and Hall/CRC.
  • Qin (1998) Qin, J. (1998) Inferences for case-control and semiparametric two-sample density ratio models. Biometrika, 85, 619–630.
  • Rosenbaum and Rubin (1983) Rosenbaum, P. R. and Rubin, D. B. (1983) The central role of the propensity score in observational studies for causal effects. Biometrika, 70, 45–55.
  • Rosenbaum and Rubin (1984) — (1984) Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79, 516–524.
  • Stefanski and Carroll (1990) Stefanski, L. and Carroll, R. J. (1990) Deconvoluting kernel density estimators. Statistics, 2, 169–184.
  • Stefanski and Cook (1995) Stefanski, L. and Cook, J. (1995) Simulation-extrapolation: the measurement error jackknife. Journal of the American Statistical Association, 90, 1247–1256.
  • Van Der Vaart et al. (1996) Van Der Vaart, A. W., van der Vaart, A., van der Vaart, A. W. and Wellner, J. (1996) Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Science & Business Media.