This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space

Aaron Hudson
Fred Hutchinson Cancer Center
Abstract

It is often of interest to assess whether a function-valued statistical parameter, such as a density function or a mean regression function, is equal to any function in a class of candidate null parameters. This can be framed as a statistical inference problem where the target estimand is a scalar measure of dissimilarity between the true function-valued parameter and the closest function among all candidate null values. These estimands are typically defined to be zero when the null holds and positive otherwise. While there is well-established theory and methodology for performing efficient inference when one assumes a parametric model for the function-valued parameter, methods for inference in the nonparametric setting are limited. When the null holds, and the target estimand resides at the boundary of the parameter space, existing nonparametric estimators either achieve a non-standard limiting distribution or a sub-optimal convergence rate, making inference challenging. In this work, we propose a strategy for constructing nonparametric estimators with improved asymptotic performance. Notably, our estimators converge at the parametric rate at the boundary of the parameter space and also achieve a tractable null limiting distribution. As illustrations, we discuss how this framework can be applied to perform inference in nonparametric regression problems, and also to perform nonparametric assessment of stochastic dependence.

1 Introduction

Suppose we are interested in studying a function-valued parameter of an unknown probability distribution, such as a conditional mean function or a density function. For such parameters, one can typically define a goodness-of-fit functional, which measures the closeness of any given candidate function to the true population parameter. The goodness-of-fit achieves its minimum when evaluated at the true population parameter. It is often of scientific interest to compare multiple models for the function-valued parameter. In particular, one may seek to determine whether the minimizer of the goodness-of-fit over a, possibly large, function class is equal to the minimizer over a smaller sub-class. The difference between the minima over reduced and full function classes can serve as a natural measure of dissimilarity for comparing the corresponding minimizers. This dissimilarity measure is non-negative, with values of zero corresponding to no dissimilarity. The main focus of this work is on estimation of such dissimilarity measures and testing the null hypothesis of no dissimilarity, or equality of goodness-of-fit.

As an example, suppose that an investigator would like to determine whether an exposure is conditionally associated with an outcome, given a set of confounding variables. This can be formulated as a statistical inference problem, where the objective is to determine whether the conditional mean of the outcome, given both the exposure and confounders, is equivalent to the conditional mean of the outcome, given only the confounders. One can specify a full model for the conditional mean as a class of functions that depends on both the exposure and confounders, while the reduced model is the subclass of functions that may depend on the confounders but do not depend on the exposure. Several goodness-of-fit measures, such as the expected squares error loss, can be used to assess how close a candidate parameter is to the conditional mean given the exposure and confounders. And so, one can test for conditional independence by assessing whether the best approximation of the conditional mean in the full model class is an improvement over the best approximation in the reduced class, in terms of the goodness-of-fit.

When the function-valued parameter of interest is modeled using a finite-dimensional function class, there are standard procedures available for performing inference. For instance, the classical likelihood ratio test is widely-used to compare classes of regression functions when the conditional distribution of the outcome given the predictor and covariates is assumed to belong to a parametric family of probability distributions (Wilks, 1938). There also exist approaches for efficient inference in settings where the reduced and full function classes are both infinite-dimensional, but the difference between the two classes is finite dimensional. For instance, in regression problems of the form described in the example above, it is common to assume that the conditional mean of the outcome given the exposure and covariates follows a partially linear model. In a partially linear model, the full conditional mean can be expressed as the sum of an unknown function of the confounders, which is only assumed to belong to a large infinite-dimensional function class, plus a linear function of the exposure of interest. One can therefore assess for conditional dependence by determining whether the linear function has zero slope, which is a well-studied inference problem (Chernozhukov et al., 2018; Bhattacharya and Zhao, 1997; Robinson, 1988; Donald and Newey, 1994).

In this work, we focus on the more challenging setting in which the difference between the full and reduced function classes is infinite-dimensional. Recently, several investigators have examined whether modern methods for estimation of smooth functionals of unknown probability distributions in a nonparametric model, such as targeted-minimum loss-based estimation (van der Laan and Rose, 2011, 2018) and one-step estimation (Pfanzagl, 1982), can be applied to attain inference on non-negative dissimilarity measures (Williamson et al., 2021a, b; Hines et al., 2022; Kennedy et al., 2023; Kandasamy et al., 2015). For these estimation strategies to be viable, the target estimand – in this case the non-negative dissimilarity measure – must be a pathwise differentiable functional of the underlying probability distribution with non-zero pathwise derivative. In essence, this means that the target estimand makes smooth but non-negligible changes in response to infinitesimally small perturbations around the unknown probability distribution. While pathwise differentiability of the target can be established in many examples, the pathwise derivative is typically zero when the null hypothesis of no dissimilarity holds. That the derivative is zero can be seen as a consequence of the fact that, under the null, the target estimand achieves its minimum at the true unknown distribution. In this setting, conventional estimation strategies do not achieve parametric-rate convergence or attain tractable limiting distributions, making hypothesis testing challenging.

When the target estimand satisfies additional smoothness assumptions, it can be possible to construct estimators with improved asymptotic behavior by utilizing higher-order pathwise derivatives (Pfanzagl, 1985; Robins et al., 2008; van der Vaart, 2014; Carone et al., 2018). While this approach has been successful in some examples (Luedtke et al., 2019), it is seemingly rare that for a given statistical functional, higher-order pathwise derivatives exist, so this strategy does not appear to be broadly applicable.

In this work, we propose a general method for estimation and inference on non-negative dissimilarity measures. Our proposal builds upon recent developments on the construction of omnibus tests for equality of function-valued parameters to fixed null parameters (Hudson et al., 2021; Westling, 2021). The key idea used is that one can perform inference on a function-valued parameter by estimating a large collection of simpler one-dimensional estimands that act as an effective summary thereof. Here, we show that in many instances, non-negative dissimilarity measures can be represented as the largest value in a collection of simple one-dimensional estimands. In such cases, we can estimate non-negative dissimilarity measures using the maximum of suitably well-behaved estimators for these scalar quantities. Our main results show that when efficient estimators for the simple estimands are used, the resulting estimator for the non-negative dissimilarity measure achieves parametric-rate convergence under the null and also attains a tractable limiting distribution. This makes it possible to construct well-calibrated asymptotic tests of the null. We also show that when the alternative holds, our estimator is asymptotically efficient. To the best of our knowledge, our work is the first to provide a general theoretical basis for recovering parametric rate inference on non-negative dissimilarity measures in a nonparametric model.

The remainder of the paper is organized as follows. In Section 2, we formally introduce the class of non-negative dissimilarity measures of interest, and we describe some motivating examples. In Section 3, we review an existing approach for inference based on plug-in estimation and provide a discussion of some of its limitations. In Section 4, we propose a new estimator for non-negative dissimilarity measures, and we describe its theoretical properties. In Section 5, we present multiplier bootstrap methods for testing the null of no dissimilarity, and for constructing confidence intervals. In Section 6 we discuss implementation and practical concerns. In Section 7, we illustrate how our methodology can be used to perform inference in a nonparametric regression model. We present results from our simulation study in Section 8, and we conclude with a discussion in Section 9.

2 Preliminaries

2.1 Data structure and target estimand

Let Z1,,ZnZ_{1},\ldots,Z_{n} be i.i.d. random vectors, generated from an unknown probability distribution P0P_{0}. We make few assumptions about P0P_{0} and only require that it belongs to a flexible nonparametric model \mathcal{M}, which is essentially unrestricted, aside from mild regularity conditions. For a given probability distribution PP in \mathcal{M}, let θP\theta_{P} be a function-valued summary of interest with domain 𝒪d\mathcal{O}\subseteq\mathbb{R}^{d} for a positive integer dd and range 𝒦\mathcal{K}\subseteq\mathbb{R}. We denote by θ0:=θP0\theta_{0}:=\theta_{P_{0}} the evaluation of this summary at P0P_{0}.

Suppose that θP\theta_{P} is known to belong to a, possibly infinite-dimensional, function class Θ\Theta. For a given distribution PP, we define a real-valued functional GP:ΘG_{P}:\Theta\to\mathbb{R} that satisfies

GP(θP)=infθΘGP(θ).\displaystyle G_{P}(\theta_{P})=\inf_{\theta\in\Theta}G_{P}(\theta). (1)

The functional GPG_{P} measures the goodness-of-fit of any function θΘ\theta\in\Theta – larger values of GP(θ)G_{P}(\theta) indicate that θ\theta and θP\theta_{P} are farther away from one another, in a sense. Throughout this paper, we use the shorthand notation G0:=GP0G_{0}:=G_{P_{0}} to denote the value of the goodness-of-fit measure at P0P_{0}.

Let ΘΘ\Theta^{*}\subset\Theta be a subclass of Θ\Theta, and let θP\theta_{P}^{*} be a function that satisfies

GP(θP)=infθΘGP(θ).\displaystyle G_{P}(\theta_{P}^{*})=\inf_{\theta\in\Theta^{*}}G_{P}(\theta).

In essence, θP\theta^{*}_{P} is the closest function to θP\theta_{P} among all functions in the subclass Θ\Theta^{*}. We define as our target parameter the difference between the goodness-of-fit of θP\theta_{P} and θP\theta^{*}_{P},

ΨP:=GP(θP)GP(θP),\displaystyle\Psi_{P}:=G_{P}(\theta^{*}_{P})-G_{P}(\theta_{P}), (2)

and we again use the shorthand notation Ψ0:=ΨP0\Psi_{0}:=\Psi_{P_{0}}. Throughout this manuscript, we refer to Ψ0\Psi_{0} as the improvement in fit because it represents the improvement in the goodness-of-fit attained by using the full function class instead of the reduced class.

Because Θ\Theta^{*} is contained within Θ\Theta, it can be seen that ΨP\Psi_{P} is a non-negative statistical functional, and ΨP\Psi_{P} is only equal to zero when θP\theta_{P} provides no improvement in fit compared with θP\theta^{*}_{P}. In many applications, a problem of central importance is to determine whether θP\theta^{*}_{P} is inferior to θP\theta_{P} in terms of goodness-of-fit. Letting Ψ0:=ΨP0\Psi_{0}:=\Psi_{P_{0}}, we are interested in performing a test of the null hypothesis

H0:Ψ0=0.\displaystyle H_{0}:\Psi_{0}=0. (3)

Additionally, because statistical functionals that have the representation in (2) have scientifically meaningful interpretations in some contexts, estimation of Ψ0\Psi_{0} and confidence interval construction are also of practical interest. Our paper provides a general framework for estimation, testing, and confidence interval construction for statistical functionals of this form.

2.2 Examples

In what follows, we introduce some working examples. As a first example, we discuss statistical inference in nonparametric regression models, and second, we discuss a nonparametric approach for assessing dependence between a pair of random variables. We then describe a simple way to define a goodness-of-fit measure for any function-valued parameter.

Example 1: Inference in a Nonparametric Regression Model
Let Z=(W,X,Y)Z=(W,X,Y), where YY\in\mathbb{R} is a real-valued outcome variable, and Xd1X\in\mathbb{R}^{d_{1}} and Wd2W\in\mathbb{R}^{d_{2}} are vectors of predictor variables with dimensions d1d_{1} and d2d_{2}, respectively. We define Θ\Theta as a (possibly large) class of prediction functions with domain d1+d2\mathbb{R}^{d_{1}+d_{2}} and range \mathbb{R}. Each function θΘ\theta\in\Theta takes as input a realization (w,x)(w,x) of the predictor vector (W,X)(W,X) and returns as output a predicted outcome.

We are interested in studying the conditional mean of the outcome given the predictors, defined as θP:(w,x)EP[Y|X=x,W=w]\theta_{P}:(w,x)\mapsto E_{P}[Y|X=x,W=w]. It is well-known that the conditional mean can be characterized as the minimizer of the expected squared error loss over Θ\Theta, if Θ\Theta is sufficiently large. That is, defining the goodness-of-fit measure

GP:θ{yθ(w,x)}2𝑑P(w,x,y),\displaystyle G_{P}:\theta\mapsto\int\left\{y-\theta(w,x)\right\}^{2}dP(w,x,y),

the conditional mean satisfies GP(θP)=infθΘGP(θ)G_{P}(\theta_{P})=\inf_{\theta\in\Theta}G_{P}(\theta).

Consider now the set of candidate prediction functions that do not depend on XX, which we write as

Θ:={θΘ:θ(w,x1)=θ(w,x2) for every x1x2}.\displaystyle\Theta^{*}:=\left\{\theta\in\Theta:\theta(w,x_{1})=\theta(w,x_{2})\text{ for every }x_{1}\neq x_{2}\right\}.

When Θ\Theta is large, any minimizer θP\theta^{*}_{P} of the expected squared error loss over Θ\Theta^{*} is almost everywhere equal to the conditional mean of YY given WW. We are often interested in determining whether XX is an important set of predictors in the sense that it does not need to be included in a prediction function in order for optimal squared error loss to be achieved. If XX is not important in this sense, the conditional mean of YY given XX and WW does not depend on XX, and the difference in the expected squared error loss Ψ0=G0(θ0)G0(θ0)\Psi_{0}=G_{0}(\theta^{*}_{0})-G_{0}(\theta_{0}) is zero. Otherwise, Ψ0\Psi_{0} is positive. Thus, assessing variable importance can be framed a statistical inference problem of the type described in Section 2.

Many recent works have studied inference on variable importance estimands of a similar form to that we describe above (see, e.g., Williamson et al., 2021a, b; Verdinelli and Wasserman, 2021; Zhang and Janson, 2020). These works all encounter difficulties with constructing estimators for their original target estimand that achieve parametric rate convergence under the null. To the best of our knowledge, there is currently no solution available to this problem.

Example 2: Nonparametric Assessment of Stochastic Dependence
Let Z=(X,Y)Z=(X,Y), where XX\in\mathbb{R} and YY\in\mathbb{R} are real-valued random variables, and let θP\theta_{P} denote the log of the joint density of (X,Y)(X,Y) under PP with respect to some dominating measure ν\nu. Our objective here is to determine whether XX and YY are dependent. If XX and YY are independent, by basic laws of probability, the joint density function can be expressed as the product of the marginal density functions, i.e.,

expθP(x,y)=expθP(x,y1)ν(dy1)expθP(x1,y)ν(dx1)\displaystyle\exp\theta_{P}(x,y)=\int\exp\theta_{P}(x,y_{1})\nu(dy_{1})\int\exp\theta_{P}(x_{1},y)\nu(dx_{1})

for all x,yx,y\in\mathbb{R}. We can therefore assess dependence between XX and YY by defining a goodness-of-fit measure for the joint density function, and determining whether the goodness-of-fit of the true joint density is lower than the goodness-of-fit of the product of the marginal densities.

Let Θ\Theta be a collection of candidate values for the log density function, and assume that Θ\Theta is large enough to contain θ0\theta_{0}, the log density under P0P_{0}. The density function can be represented as a minimizer of the expected cross-entropy loss. Therefore, defining the goodness-of-fit measure

GP:θθ(x,y)𝑑P(x,y),\displaystyle G_{P}:\theta\mapsto-\int\theta(x,y)dP(x,y),

the joint density satisfies (1).

We now define Θ\Theta^{*} as the class of candidate log density functions for which the joint density can be expressed as the product of two marginal density functions – that is,

Θ:={θΘ:expθ(x,y)=expθ(x,y1)ν(dy1)expθ(x1,y)ν(dx1) for all x,y}.\displaystyle\Theta^{*}:=\left\{\theta\in\Theta:\exp\theta(x,y)=\int\exp\theta(x,y_{1})\nu(dy_{1})\int\exp\theta(x_{1},y)\nu(dx_{1})\text{ for all }x,y\in\mathbb{R}\right\}.

Any minimizer θP\theta^{*}_{P} of GPG_{P} over Θ\Theta^{*} is almost everywhere equal to the product of the marginal densities of XX and YY under PP. Therefore, Ψ0:=G0(θ0)G0(θ0)\Psi_{0}:=G_{0}(\theta^{*}_{0})-G_{0}(\theta_{0}) is zero if XX and YY are independent, and Ψ0\Psi_{0} is otherwise positive. One can assess dependence between XX and YY by performing inference on Ψ0\Psi_{0}, so similar to the previous example, this problem falls within our framework.

The measure of dependence Ψ0\Psi_{0} we have defined here is commonly referred to as the mutual information and has been a widely-studied measure of stochastic dependence (see, e.g., Paninski, 2003; Steuer et al., 2002). We are not aware of an existing nonparametric estimator for the mutual information that achieves parametric rate convergence under the null of independence. This appears to be a longstanding open problem.

Example 3: Generic L2L_{2} Distance

Suppose one is interested in assessing whether a given function-valued parameter θP\theta_{P} is equal to a fixed and known function θ\theta^{*}. For a measure ν\nu on 𝒪\mathcal{O}, one can define as a goodness-of-fit measure an integrated squared difference between θP\theta_{P} and θ\theta^{*}:

GP:θ{θ(o)θP(o)}2𝑑ν(o).\displaystyle G_{P}:\theta\mapsto\int\left\{\theta(o)-\theta_{P}(o)\right\}^{2}d\nu(o).

Because GPG_{P} is non-negative, and GP(θP)=0G_{P}(\theta_{P})=0, it is easy to see that GPG_{P} is minimized by θP\theta_{P}.

One might wish to perform inference on the quantity

Ψ0=G(θ)G0(θ0)={θ(o)θ0(o)}2𝑑ν(o).\displaystyle\Psi_{0}=G(\theta^{*})-G_{0}(\theta_{0})=\int\left\{\theta^{*}(o)-\theta_{0}(o)\right\}^{2}d\nu(o).

Clearly, Ψ0\Psi_{0} is equal to zero only when θ0\theta_{0} is equal to θ\theta^{*} almost everywhere ν\nu. Estimands of this form can be of interest when one wishes to construct an omnibus test of the hypothesis that θ0=θ\theta_{0}=\theta^{*}. The framework we develop in this paper can be applied in this setting as well, and so, methodology for inference on the improvement in fit can be seen as generally useful for performing inference on function-valued parameters.

3 Plug-in estimation of the improvement in fit

We now describe an approach for nonparametric inference on Ψ0\Psi_{0} based on plug-in estimation, and we discuss the shortcomings of this approach. The methodology we describe below and its limitations are discussed extensively by Williamson et al. (2021b) in the context of nonparametric regression, though their theoretical and methodological results are more broadly applicable.

Suppose that for any θΘ\theta\in\Theta, GP(θ)G_{P}(\theta) is a pathwise differentiable functional of PP, meaning that GP(θ)G_{P}(\theta) changes smoothly with respect to small changes in PP (Bickel et al., 1998). When GP(θ)G_{P}(\theta) is pathwise differentiable, it is generally possible to construct an estimator Gn(θ)G_{n}(\theta) that is asymptotically linear in the sense that

Gn(θ)G0(θ)=1ni=1nϕP0(Zi;θ)+rn(θ),\displaystyle G_{n}(\theta)-G_{0}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0}}(Z_{i};\theta)+r_{n}(\theta), (4)

where ϕP0(Z;θ)\phi_{P_{0}}(Z;\theta) has mean zero and finite variance under P0P_{0}, and rn(θ)=oP(n1/2)r_{n}(\theta)=o_{P}(n^{-1/2}) is an asymptotically negligible remainder term. The function ϕP0(;θ)\phi_{P_{0}}(\cdot;\theta) determines the first order asymptotic behavior of Gn(θ)G_{n}(\theta) and is commonly referred to as the influence function of Gn(θ)G_{n}(\theta). Because Gn(θ)G_{n}(\theta) is asymptotically linear, it is n1/2n^{1/2}-rate consistent and asymptotically Gaussian by the central limit theorem. Conventional strategies for constructing asymptotically linear estimators include one-step estimation (Pfanzagl, 1982) and targeted minimum loss-based estimation (van der Laan and Rose, 2011, 2018).

Given an asymptotically linear estimator GnG_{n} for G0G_{0}, we can obtain estimators θn\theta_{n} and θn\theta_{n}^{*} for θ0\theta_{0} and θ0\theta^{*}_{0} by minimizing GnG_{n} over Θ\Theta and Θ\Theta^{*}, respectively. That is, we take

θn:=arg minθΘGn(θ),θn:=arg minθΘGn(θ).\displaystyle\theta_{n}:=\underset{\theta\in\Theta}{\text{arg min}}\,G_{n}(\theta),\quad\theta^{*}_{n}:=\underset{\theta\in\Theta^{*}}{\text{arg min}}\,G_{n}(\theta).

We can then obtain the following plug-in estimator Ψn\Psi_{n} for Ψ0\Psi_{0}:

Ψn:=Gn(θn)Gn(θn).\displaystyle\Psi_{n}:=G_{n}(\theta^{*}_{n})-G_{n}(\theta_{n}).

It can be shown that, under mild regularity conditions, the plug-in estimator is asymptotically linear with influence function ϕP0(;θ0)ϕP0(;θ0)\phi_{P_{0}}(\cdot;\theta_{0})-\phi_{P_{0}}(\cdot;\theta^{*}_{0}) (Williamson et al., 2021b). That is, the plug-in estimator satisfies

ΨnΨ0=1ni=1nϕP0(Zi;θ0)ϕP0(Zi;θ0)+oP(n1/2).\displaystyle\Psi_{n}-\Psi_{0}=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0}}(Z_{i};\theta^{*}_{0})-\phi_{P_{0}}(Z_{i};\theta_{0})+o_{P}(n^{-1/2}). (5)

From an initial inspection, it would appear that there is no loss in efficiency resulting from estimating θ0\theta_{0} and θ0\theta_{0}^{*}. That is, if θ0\theta_{0} and θ0\theta_{0}^{*} were known, then the estimator Gn(θ0)Gn(θ0)G_{n}(\theta_{0})-G_{n}(\theta^{*}_{0}) would have the same asymptotically linear representation as Ψn\Psi_{n}.

Under the null, Gn(θn)G_{n}(\theta_{n}) and Gn(θn)G_{n}(\theta^{*}_{n}) have the same influence function, and the leading term in (5) vanishes. Therefore, the convergence rate and limiting distribution of the plug-in estimator are determined by the higher-order remainder term. When Θ\Theta is a finite-dimensional model, it is often possible to establish that, under the null, the remainder term is OP(n1)O_{P}(n^{-1}) and attains a tractable limiting distribution. Conversely, in the infinite-dimensional setting, the remainder typically converges at a slower-than-nn rate, and its asymptotic distribution is difficult to characterize. This makes it challenging to approximate the null sampling distribution of Ψ~n\tilde{\Psi}_{n} and hence challenging to construct a hypothesis test for no improvement in fit. Moreover, confidence intervals based on a normal approximation to the sampling distribution can fail to achieve the nominal coverage rate when Ψ0=0\Psi_{0}=0.

In order to develop an estimator for Ψ0\Psi_{0} that has better asymptotic properties than the plug-in, it is helpful for us to further investigate what is the source of the plug-in estimator’s poor behavior. We can first recognize that, as Ψ0\Psi_{0} is a measure of an improvement in fit, estimating Ψ0\Psi_{0} involves performing a search away from θ0\theta^{*}_{0} to identify whether any candidate function in the difference between the full and reduced function classes, ΘΘ\Theta\setminus\Theta^{*}, provides a better fit than θ0\theta^{*}_{0}.

Suppose now that ΘΘ\Theta\setminus\Theta^{*} can be expressed as a collection of, potentially many, one-dimensional sub-models. Let gg be a fixed function from 𝒦×\mathcal{K}\times\mathbb{R} to 𝒦\mathcal{K} that satisfies g(k;0)=kg(k;0)=k for any k𝒦k\in\mathcal{K}. For a scalar β\beta and a fixed function f:𝒪f:\mathcal{O}\to\mathbb{R}, we define θP,f\theta^{*}_{P,f} as the one-dimensional sub-model

θP,f(;β):og(θP(o),βf(o)).\displaystyle\theta_{P,f}^{*}(\cdot;\beta):o\mapsto g(\theta^{*}_{P}(o),\beta f(o)). (6)

We have constructed our sub-model θP,f\theta^{*}_{P,f} so that it passes through the null best fit θP\theta^{*}_{P} at β=0\beta=0, i.e.,

θP,f(;0)=θP().\displaystyle\theta^{*}_{P,f}(\cdot;0)=\theta^{*}_{P}(\cdot). (7)

We can therefore interpret ff as the path along which θP,f\theta^{*}_{P,f} approaches θP\theta^{*}_{P} as β\beta tends to zero. We assume that there exists a function class \mathcal{F} and a symmetric interval \mathcal{B} such that

ΘΘ={θ=θP,f(;β):f,β}.\displaystyle\Theta\setminus\Theta^{*}=\left\{\theta=\theta^{*}_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\right\}.

We will see that using this representation for our model facilitates making comparisons between any function in ΘΘ\Theta\setminus\Theta^{*} and the best null fit.

We now define GP,f:βG_{P,f}:\beta\to\mathbb{R} as the goodness-of-fit of θP,f(;β)\theta^{*}_{P,f}(\cdot;\beta), i.e.,

GP,f(β):=GP(θP,f(;β)),\displaystyle G_{P,f}(\beta):=G_{P}(\theta^{*}_{P,f}(\cdot;\beta)), (8)

and similarly as above, we use the shorthand notation G0,f:=GP0,fG_{0,f}:=G_{P_{0},f}. We assume that GP,fG_{P,f} is a smooth convex function, and we denote the first and second derivatives of GP,fG_{P,f} in β\beta by

GP,f(β):=ddβGP,f(β),GP,f′′(β):=d2dβ2GP,f(β).\displaystyle G^{\prime}_{P,f}(\beta):=\frac{d}{d\beta}G_{P,f}(\beta),\quad G^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{P,f}(\beta). (9)

We define βP,f\beta_{P,f} as the minimizer of the goodness-of-fit measure along the parametric sub-model over the interval \mathcal{B}:

βP,f:=arg minβGP,f(β).\displaystyle\beta_{P,f}:=\underset{\beta\in\mathcal{B}}{\text{arg min}}G_{P,f}(\beta).

Due to the convexity of GP,fG_{P,f}, for large enough \mathcal{B}, βP,f\beta_{P,f} is the unique solver of GP,f(βP,f)=0G^{\prime}_{P,f}(\beta_{P,f})=0. Under this regime, θP\theta_{P} satisfies GP(θP)=inffGP,f(βP,f)G_{P}(\theta_{P})=\inf_{f\in\mathcal{F}}G_{P,f}(\beta_{P,f}), and we can write ΨP\Psi_{P} as

ΨP=supfGP,f(0)GP,f(βP,f).\displaystyle\Psi_{P}=\sup_{f\in\mathcal{F}}G_{P,f}(0)-G_{P,f}(\beta_{P,f}).

We can see that, in view of condition (7), ΨP=0\Psi_{P}=0 only when supf|βP,f|=0\sup_{f\in\mathcal{F}}|\beta_{P,f}|=0.

Let GnG_{n} and θn\theta^{*}_{n} be the estimators for G0G_{0} and θ0\theta^{*}_{0} described in earlier in this section, and let θn,f(;β):og(θn(o),βf(o))\theta^{*}_{n,f}(\cdot;\beta):o\mapsto g(\theta_{n}^{*}(o),\beta f(o)) be the plug-in estimator for the sub-model. We define the plug-in estimator for G0,fG_{0,f} as

Gn,f(β):=Gn(θn,f(;β)),\displaystyle G_{n,f}(\beta):=G_{n}(\theta_{n,f}^{*}(\cdot;\beta)),

and we write its first and second derivatives as

Gn,f(β):=ddβGn,f(β),Gn,f′′(β):=d2dβ2Gn,f(β).\displaystyle G^{\prime}_{n,f}(\beta):=\frac{d}{d\beta}G_{n,f}(\beta),\quad G^{\prime\prime}_{n,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{n,f}(\beta).

We define the plug-in estimator βn,f\beta_{n,f} for β0,f\beta_{0,f} as the minimizer of Gn,fG_{n,f} over \mathcal{B}. For large \mathcal{B}, βn,f\beta_{n,f} satisfies Gn,f(βn,f)=0G^{\prime}_{n,f}(\beta_{n,f})=0 for all ff in \mathcal{F}, and the plug-in estimator θn\theta_{n} for θ0\theta_{0} satisfies

Gn(θn)=inffGn,f(βn,f).\displaystyle G_{n}(\theta_{n})=\inf_{f\in\mathcal{F}}G_{n,f}(\beta_{n,f}).

The plug-in estimator for Ψ0\Psi_{0} can therefore be expressed as

Ψn=supfGn,f(0)Gn,f(βn,f).\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}G_{n,f}(0)-G_{n,f}(\beta_{n,f}).

Using this representation for the plug-in estimator Ψn\Psi_{n} makes it easier for us to carefully study its asymptotic behavior in the setting where Ψ0=0\Psi_{0}=0. By performing a second order Taylor expansion for Gn,fG_{n,f} around βn,f\beta_{n,f} for every ff\in\mathcal{F}, we can write the plug-in as

Ψn=supfGn,f(βn,f)(0βn,f)+12Gn,f′′(βn,f)βn,f2+rn,\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}G^{\prime}_{n,f}(\beta_{n,f})(0-\beta_{n,f})+\frac{1}{2}G^{\prime\prime}_{n,f}(\beta_{n,f})\beta_{n,f}^{2}+r_{n},

where rnr_{n} is a higher order remainder term that should approach zero at a faster rate than the leading terms. Because Gn,f(βn,f)=0G^{\prime}_{n,f}(\beta_{n,f})=0 for all ff, the first term in this expansion vanishes, leaving us with

Ψn=supf12Gn,f′′(βn,f)βn,f2+rn.\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}\frac{1}{2}G^{\prime\prime}_{n,f}(\beta_{n,f})\beta_{n,f}^{2}+r_{n}.

If Gn,f′′(βn,f)G^{\prime\prime}_{n,f}(\beta_{n,f}) is consistent for G0,f′′(β0,f)G^{\prime\prime}_{0,f}(\beta_{0,f}) uniformly in \mathcal{F}, then by Slutsky’s theorem, one can replace the random quantities Gn,f′′(βn,f)G^{\prime\prime}_{n,f}(\beta_{n,f}) with the fixed values G0,f′′(β0,f)G^{\prime\prime}_{0,f}(\beta_{0,f}) in the above display. It would appear then that, under the null, the limiting distribution of Ψn\Psi_{n} is determined by the behavior of the stochastic process {βn,f:f}\{\beta_{n,f}:f\in\mathcal{F}\}.

If it were possible to characterize the joint limiting distribution of {βn,f:f}\left\{\beta_{n,f}:f\in\mathcal{F}\right\} under the null where β0,f=0\beta_{0,f}=0 for all ff\in\mathcal{F}, the limiting distribution of Ψn\Psi_{n} could be characterized using a straightforward application of the continuous mapping theorem. Typically, β0,f\beta_{0,f} is a pathwise differentiable parameter for each ff\in\mathcal{F}, making it possible to construct estimators thereof that converge at a n1/2n^{1/2}-rate and achieve an Gaussian limiting distribution. Ideally, one would be able to establish that the standardized process {n1/2[βn,fβ0,f]:f}\left\{n^{1/2}\left[\beta_{n,f}-\beta_{0,f}\right]:f\in\mathcal{F}\right\} converges weakly to a Gaussian process as long as the collection of paths \mathcal{F} is not overly complex. However, in many settings, this property is not satisfied by the plug-in estimator. One can view the plug-in estimator βn,f\beta_{n,f} for β0,f\beta_{0,f} as a functional of the estimator Gn,fG_{n,f} for G0,fG_{0,f}. As stated above, estimating G0,fG_{0,f} requires us to estimate the nuisance parameter θ0\theta^{*}_{0}. In settings where Θ\Theta^{*} is a large nonparametric function class, our estimator θn\theta^{*}_{n} for θ0\theta^{*}_{0} will necessarily converge slower than the parametric rate of n1/2n^{1/2} and may retain non-negligible asymptotic bias. Consequently, θn\theta^{*}_{n} generates bias for Gn,fG_{n,f}, which leads to βn,f\beta_{n,f} retaining non-negligible bias as well. Indeed, βn,f\beta_{n,f} will typically converge slower than the parametric rate of n1/2n^{1/2}, causing Ψn\Psi_{n} to converge at a sub-optimal rate and achieve a non-standard limiting distribution.

To summarize, estimating Ψ0\Psi_{0} requires one to perform a search away from θP\theta^{*}_{P} in order to attempt to identify a candidate function in ΘΘ\Theta\setminus\Theta^{*} that provides an improvement in the goodness-of-fit. In the regime we describe above, performing this search is equivalent to finding the best fit along each parametric sub-model that passes through the null, and subsequently taking the best fit among all of the sub-models. From the above argument, we can see that the plug-in estimator has poor asymptotic properties because the plug-in estimator for the best fit along the parametric sub-models can be sub-optimal when Θ\Theta^{*} is large. Thus, the key to obtaining an estimator with improved asymptotic properties is to efficiently estimate the best fit along each of the sub-models that comprise ΘΘ\Theta\setminus\Theta^{*}.

4 Bias-corrected estimation of the improvement in fit

From the discussion in Section 3, it would seem that if one had an efficient estimator for the goodness-of-fit along each parametric sub-model, and hence an efficient estimator for β0,f\beta_{0,f}, one could obtain an estimator for the improvement in fit Ψ0\Psi_{0} that has better asymptotic properties than the plug-in. In what follows, we describe a general strategy for constructing an estimator that has a tractable limiting distribution when Ψ0\Psi_{0} is at the boundary of the parameter space. We show that our newly-proposed estimator enjoys the same nn-rate convergence that is typically attained in parametric models.

4.1 Uniform inference along the parametric sub-models

Our proposal requires us to construct an estimator for {G0,f(β):f,β}\{G_{0,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} that enables us to perform inference uniformly along the collection of parametric sub-models. In this sub-section, we first outline a set of sufficient conditions under which an estimator has asymptotic properties that facilitate uniform inference. We then describe a strategy for constructing an estimator that satisfies these conditions.

We begin by providing assumptions upon which our first main theoretical result relies. We consider two types of assumptions. The first type (A) is a set of determinsitic conditions on the goodness-of-fit functional and the underlying probabilty distribution, whereas the second set of assumptions (B) is stochastic in nature and describes conditions that our estimator {G~n,f(β):f,β}\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} must satisfy.

  • Assumption A1: For any ff\in\mathcal{F} and any β\beta\in\mathcal{B}, GP,f(β)G_{P,f}(\beta) is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function is given by ϕP,f(;β):𝒵\phi_{P,f}(\cdot;\beta):\mathcal{Z}\to\mathbb{R}.

  • Assumption A2: GP,fG_{P,f} and ϕP,f\phi_{P,f} are twice differentiable in β\beta for each ff in \mathcal{F}, and the derivatives are given by

    GP,f(β):=ddβGP,f(β),GP,f′′(β):=d2dβ2GP,f(β),\displaystyle G^{\prime}_{P,f}(\beta):=\frac{d}{d\beta}G_{P,f}(\beta),\quad G^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{P,f}(\beta),
    ϕP,f(;β):=ddβϕP,f(;β),ϕP,f′′(β):=d2dβ2ϕP,f(;β).\displaystyle\phi^{\prime}_{P,f}(\cdot;\beta):=\frac{d}{d\beta}\phi_{P,f}(\cdot;\beta),\quad\phi^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}\phi_{P,f}(\cdot;\beta).
  • Assumption A3: There exist positive constants C1,C2>0C_{1},C_{2}>0 such that, for any {βf:}\{\beta_{f}:\in\mathcal{F}\}, supfG0,f(βf)G0,f(β0,f)<C1\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{f})-G_{0,f}(\beta_{0,f})<C_{1} implies that

    supf(β0,fβf)2C2{supfG0,f(βf)G0,f(β0,f)}.\displaystyle\sup_{f\in\mathcal{F}}(\beta_{0,f}-\beta_{f})^{2}\leq C_{2}\left\{\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{f})-G_{0,f}(\beta_{0,f})\right\}.
  • Assumption A4: For each ff\in\mathcal{F}, G0,f(β0,f)=0G^{\prime}_{0,f}(\beta_{0,f})=0. Additionally, G0,f′′G^{\prime\prime}_{0,f} is bounded above zero in a neighborhood of β0,f\beta_{0,f}, uniformly in \mathcal{F}. That is, inffG0,f′′(βf)\inf_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(\beta_{f}) is positive whenever supf|βfβ0,f|\sup_{f\in\mathcal{F}}|\beta_{f}-\beta_{0,f}| is small.

  • Assumption A5: Both the function classes {ϕP0,f(;β):f,β}\left\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\right\} and {ϕP0,f(;β0,f):f}\left\{\phi^{\prime}_{P_{0},f}(\cdot;\beta_{0,f}):f\in\mathcal{F}\right\} are P0P_{0}-Donsker.

Assumption A1 requires that the goodness-of-fit is pathwise differentiable, which as noted in Section 3, enables us to construct n1/2n^{1/2}-consistent estimators. When GP,fG_{P,f} is pathwise differentiable estimand, its efficient influence function is guaranteed to exist, and knowledge of the efficient influence function is often needed for constructing efficient estimators and studying their asymptotic properties in nonparametric models. We note that because we assume GP,f1(0)=GP,f2(0)G_{P,f_{1}}(0)=G_{P,f_{2}}(0) for any f1,f2f_{1},f_{2} (recall we assume (7) holds), the efficient influence functions ϕP,f1(;0)\phi_{P,f_{1}}(\cdot;0) and ϕP,f2(;0)\phi_{P,f_{2}}(\cdot;0) are also equal. Assumptions A2 and A3 state that the goodness-of-fit must be smooth and convex along each of the parametric sub-models. Assumption A4 requires that \mathcal{B} is large enough to contain the global optimizer of G0,fG_{0,f} over \mathbb{R}, and the goodness-of-fit satisfies some additional smoothness constraints in a neighborhood of the optimizer. Assumption A5 states that, while \mathcal{F} may be specified as a large nonparametric function class, it must satisfy some mild complexity constraints.

  • Assumption B1: For any f1,f2f_{1},f_{2}\in\mathcal{F} with f1f2f_{1}\neq f_{2} we have that G~n,f1(0)=G~n,f2(0)\tilde{G}_{n,f_{1}}(0)=\tilde{G}_{n,f_{2}}(0).

  • Assumption B2: G~n,f\tilde{G}_{n,f} is an asymptotically linear estimator for G0,fG_{0,f} in the sense that the remainder r~n,f(β):={G~n,f(β)G0,f(β)}1ni=1nϕ0,f(Zi;β)\tilde{r}_{n,f}(\beta):=\left\{\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)\right\}-\frac{1}{n}\sum_{i=1}^{n}\phi_{0,f}(Z_{i};\beta) satisfies

    supf,β|r~n,f(β)|=oP(n1/2).\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}_{n,f}(\beta)|=o_{P}(n^{-1/2}).
  • Assumption B3: The derivative of G~n,f\tilde{G}_{n,f} exists and is given by G~n,f(β)=ddβG~n,f(β)\tilde{G}^{\prime}_{n,f}(\beta)=\frac{d}{d\beta}\tilde{G}_{n,f}(\beta). Moreover, letting r~n,f(β)=ddβrn,f(β)\tilde{r}^{\prime}_{n,f}(\beta)=\frac{d}{d\beta}r_{n,f}(\beta), we have

    supf,β|r~n,f(β)|=oP(n1/2).\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}^{\prime}_{n,f}(\beta)|=o_{P}(n^{-1/2}).
  • Assumption B4: The second derivative of G~n,f\tilde{G}_{n,f} exists and is given by G~n,f′′(β)=d2dβ2G~n,f(β)\tilde{G}^{\prime\prime}_{n,f}(\beta)=\frac{d^{2}}{d\beta^{2}}\tilde{G}_{n,f}(\beta). Moreover, letting r~n,f′′(β)=d2dβ2rn,f(β)\tilde{r}^{\prime\prime}_{n,f}(\beta)=\frac{d^{2}}{d\beta^{2}}r_{n,f}(\beta), we have

    supf,β|r~n,f′′(β)|=oP(1).\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}^{\prime\prime}_{n,f}(\beta)|=o_{P}(1).

Assumption B1 states that the estimator for the goodness-of-fit along any parametric sub-model takes the same value at β=0\beta=0. In view of condition (7), all sub-models intersect and attain the same value for the goodness-of-fit at β=0\beta=0, so it is natural to assume that our estimator also has this property. Assumptions B2 places a requirement that G~n,f(β)\tilde{G}_{n,f}(\beta) is an asymptotically linear estimator for G0,f(β)G_{0,f}(\beta), where the asymptotic linearity holds uniformly over ×\mathcal{F}\times\mathcal{B}. Assumption B3 states that G~n,f(β)\tilde{G}_{n,f}(\beta) is differentiable, and the derivative G~n,f(β)\tilde{G}_{n,f}^{\prime}(\beta) is an asymptotically linear estimator for G0,f(β)G^{\prime}_{0,f}(\beta), uniformly over ×\mathcal{F}\times\mathcal{B}. Finally, Assumption B4 requires that the second derivative of G~n,f(β)\tilde{G}_{n,f}(\beta) exists and converges in probability to the second derivative of G0,f′′(β)G^{\prime\prime}_{0,f}(\beta), uniformly over ×\mathcal{F}\times\mathcal{B}.

For a given estimator {G~n,f(β):f,β}\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} of {G0,f:f,β}\{G_{0,f}:f\in\mathcal{F},\beta\in\mathcal{B}\}, let {β~n,f:f}\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\} satisfy

supf|G~n,f(β~n,f)|=oP(n1/2).\displaystyle\sup_{f\in\mathcal{F}}|\tilde{G}_{n,f}^{\prime}(\tilde{\beta}_{n,f})|=o_{P}(n^{-1/2}).

The following theorem states that, under mild regularity conditions, β~n,f\tilde{\beta}_{n,f} is an asymptotically linear estimator for β0,f\beta_{0,f}, and moreover the collection {β~n,f:f}\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\}, when appropriately standardized, achieves a Gaussian limiting distribution.

Theorem 1.

Let ()\ell^{\infty}(\mathcal{F}) denote the space of bounded functionals on \mathcal{F}, and let 0\mathbb{H}_{0} be a tight mean zero Gaussian process with covariance

Σ0:(f1,f2)E0[ϕ0,f1(Z;β0,f1)ϕ0,f2(Z;β0,f2)].\displaystyle\Sigma_{0}:(f_{1},f_{2})\mapsto E_{0}[\phi^{\prime}_{0,f_{1}}(Z;\beta_{0,f_{1}})\phi^{\prime}_{0,f_{2}}(Z;\beta_{0,f_{2}})].

If Assumptions A1-A4 hold, and if {G~n,f(β):f,β}\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} satisfies Assumptions B1-B4, then β~n,f\tilde{\beta}_{n,f} is asymptotically linear with influence function

z{G0,f′′(β0,f)}1ϕ0,f(z;β0,f).\displaystyle z\mapsto-\left\{G^{\prime\prime}_{0,f}(\beta_{0,f})\right\}^{-1}\phi^{\prime}_{0,f}(z;\beta_{0,f}).

Moreover, If A5 also holds, then the process {n1/2[β~n,fβ0,f]:f}\left\{n^{1/2}[\tilde{\beta}_{n,f}-\beta_{0,f}]:f\in\mathcal{F}\right\}, converges weakly to {[G0,f′′(β0,f)]10(f):f}\left\{\left[G^{\prime\prime}_{0,f}(\beta_{0,f})\right]^{-1}\mathbb{H}_{0}(f):f\in\mathcal{F}\right\} as an element of ()\ell^{\infty}(\mathcal{F}), with respect to the supremum norm.

Theorem 1 can be viewed as a generalization of well-known results that show M-estimators are asymptotically linear in finite-dimensional models (see, e.g., Theorem 5.23 of van der Vaart, 2000). Our result on uniform asymptotic linearity in infinite-dimensional models can be proven using a fairly standard argument.

In what follows, we suggest some approaches for constructing an estimator that satisfies Assumptions B1-B4. We describe at a high-level what types of conditions are needed for a given estimation strategy to be valid, though the specific requirements depend on the target estimand G0,fG_{0,f} and vary from problem to problem. Later on in Section 7, we demonstrate how to construct an estimator and that satisfies Assumptions B1-B4 in an example.

Suppose that one has available an estimator P^n\hat{P}_{n} for the underlying probability distribution P0P_{0}. Typically estimation of the entire probability distribution P0P_{0} it not necessary, and one will only need to estimate nuisance components upon which G0,f(β)G_{0,f}(\beta) and ϕ0,f(;β)\phi_{0,f}(\cdot;\beta) depend. We assume that {GP,f(β):f,β}\{G_{P,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} and {ϕP,f(;β):f,β}\{\phi_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} depend on PP only through a nuisance QPQ_{P}, which resides in a space 𝒬\mathcal{Q} endowed with norm 𝒬\|\cdot\|_{\mathcal{Q}}. The true value of the nuisance component is given by QP0Q_{P_{0}}, and the plug-in estimator for the nuisance is QP^nQ_{\hat{P}_{n}}.

As a starting point, one might consider using GP^n,f(β)G_{\hat{P}_{n},f}(\beta) as an estimator for G0,f(β)G_{0,f}(\beta). If P^n\hat{P}_{n} belongs to the model \mathcal{M}, the plug-in estimator satisfies Assumption B1. This leaves Assumptions B2 through B4 to be verified. Suppose now that QPQ_{P} is itself pathwise differentiable and can therefore be estimated at an n1/2n^{1/2}-rate. Then if QP^nQ_{\hat{P}_{n}} is an asymptotically linear estimator for QP0Q_{P_{0}}, one can argue that GP^n,f(β)G_{\hat{P}_{n},f}(\beta) is also be asymptotically linear by applying the delta method. Assumption B2 then holds, as long as the asymptotic linearity is preserved uniformly over ×\mathcal{F}\times\mathcal{B}. Asymptotic linearity of GP^n,fG^{\prime}_{\hat{P}_{n},f} (Assumption B3) and consistency of GP^n,f′′G^{\prime\prime}_{\hat{P}_{n},f} (Assumption B4) can be established using a similar argument.

In many instances, the nuisance QPQ_{P} can include quantities such as density functions or conditional mean functions which are non-pathwise differentiable in a nonparametric model. In this case, it is not possible to construct an n1/2n^{1/2}-rate consistent estimator for the nuisance. Obtaining an estimator for the nuisance usually involves making a bias variance trade-off that may be sub-optimal for the objective of estimating the goodness-of-fit. When the nuisance estimator retains non-negligible bias, it is possible that the bias propagates, leading to GP^n,f(β)G_{\hat{P}_{n},f}(\beta) being biased as well. As a consequence, GP^n,f(β)G_{\hat{P}_{n},f}(\beta) may not be asymptotically linear, and we may require more sophisticated methods to construct an n1/2n^{1/2}-consistent estimator.

One widely-used method for obtaining an asymptotically linear estimator when the initial estimator GP^n,f(β)G_{\hat{P}_{n},f}(\beta) is biased is to perform a one-step bias correction (Pfanzagl, 1982). Consider the plug-in estimator for the efficient influence function ϕP^,f(;β)\phi_{\hat{P},f}(\cdot;\beta). The empirical average of the estimator for the efficient influence function serves as a first-order correction for the bias of the initial estimator. By adding this empirical average to the initial estimator, one can obtain the so-called one-step estimator:

G~n,f(β)=GP^n,f(β)+1ni=1nϕP^n,f(Zi;β).\displaystyle\tilde{G}_{n,f}(\beta)=G_{\hat{P}_{n},f}(\beta)+\frac{1}{n}\sum_{i=1}^{n}\phi_{\hat{P}_{n},f}(Z_{i};\beta).

It can be easily seen that the one-step estimator satisfies Assumption B1. In what follows, we briefly discuss what arguments one will typically use to verify Assumptions B2-B4. While we do not provide a detailed discussion here, we refer readers to a recent review by Hines et al. (2022), which provides a more in-depth explanation.

The estimation error of the one-step estimator has the exact representation,

G~n,f(β)G0,f(β)=1ni=1nϕ0,f(Zi;β)+Rn,fi(β)+Rn,fii(β),\displaystyle\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{0,f}(Z_{i};\beta)+R^{\mathrm{i}}_{n,f}(\beta)+R^{\mathrm{ii}}_{n,f}(\beta),

where we define Rn,fi(β)R^{\mathrm{i}}_{n,f}(\beta) and Rn,fii(β)R^{\mathrm{ii}}_{n,f}(\beta) as

Rn,fi(β):=1ni=1n{ϕP^n,f(Zi;β)ϕ0,f(Zi;β)}{ϕP^n,f(z;β)ϕ0,f(z;β)}𝑑P0(z),\displaystyle R^{\mathrm{i}}_{n,f}(\beta):=\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{\hat{P}_{n},f}(Z_{i};\beta)-\phi_{0,f}(Z_{i};\beta)\right\}-\int\left\{\phi_{\hat{P}_{n},f}(z;\beta)-\phi_{0,f}(z;\beta)\right\}dP_{0}(z),
Rn,fii(β):={GP^n,f(β)G0,f(β)}+ϕP^n,f(z;β)𝑑P0(z).\displaystyle R^{\mathrm{ii}}_{n,f}(\beta):=\left\{G_{\hat{P}_{n},f}(\beta)-G_{0,f}(\beta)\right\}+\int\phi_{\hat{P}_{n},f}(z;\beta)dP_{0}(z).

Asymptotic linearity of the one-step estimator follows if it can be established that Rn,fi(β)R^{\mathrm{i}}_{n,f}(\beta) and Rn,fii(β)R^{\mathrm{ii}}_{n,f}(\beta) converge to zero in probability at an n1/2n^{1/2}-rate. The first term Rn,fi(β)R^{\mathrm{i}}_{n,f}(\beta) is a difference-in-differences remainder that is asymptotically negligible when ϕP^n,f(;β)\phi_{\hat{P}_{n},f}(\cdot;\beta) is consistent for ϕP0,f(;β)\phi_{P_{0},f}(\cdot;\beta) and ϕP^n,f(;β)\phi_{\hat{P}_{n},f}(\cdot;\beta) is contained within a P0P_{0}-Donsker class (see Lemmas 19.24 and 19.26 of van der Vaart, 2000). The second term Rn,fii(β)R^{\mathrm{ii}}_{n,f}(\beta) is a second-order remainder term, which can usually be bounded above by the squared norm of the difference between the nuisance estimator and its true value, QP^nQP0𝒬2\|Q_{\hat{P}_{n}}-Q_{P_{0}}\|_{\mathcal{Q}}^{2}. One can argue that if the nuisance estimator is n1/4n^{1/4}-consistent with respect to 𝒬\|\cdot\|_{\mathcal{Q}}, then Rn,fii(β)=oP(n1/2)R^{\mathrm{ii}}_{n,f}(\beta)=o_{P}(n^{-1/2}). Even in a nonparametric model, there exist several approaches for constructing n1/4n^{1/4}-rate consistent nuisance estimators when one makes only mild structural assumptions on QP0Q_{P_{0}}, such as smoothness or monotonicity (see, e.g., van de Geer, 2000; Tsybakov, 2009). To verify that Assumptions B3 and B4 hold, one can perform a similar analysis to show that the first and second derivatives of the remainder terms Rn,fi(β)R^{\mathrm{i}}_{n,f}(\beta) and Rn,fii(β)R^{\mathrm{ii}}_{n,f}(\beta), with respect to β\beta, tend to zero at the requisite rate.

While we focused on one-step estimation above because we find its simplicity appealing, other strategies for constructing of bias-corrected estimators, such as targeted minimum-loss based estimation could alternatively be used. These strategies are usually also viable under a similar set of regularity conditions.

4.2 Asymptotic properties of proposed estimator

We are at this point prepared to describe the bias-corrected estimator for Ψ0\Psi_{0} and its asymptotic properties. As stated in Section 3, we estimate Ψ0\Psi_{0} as

Ψ~n=supfG~n,f(0)G~n,f(β~n,f),\displaystyle\tilde{\Psi}_{n}=\sup_{f\in\mathcal{F}}\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f}), (10)

where {G~n,f(β):f,β}\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} is an estimator satisfying the conditions outlined in Section 4.1.

In this section we establish weak convergence of Ψ~n\tilde{\Psi}_{n}. We show that Ψ~n\tilde{\Psi}_{n} attains a tractable limiting distribution under mild regularity conditions, but the limiting distribution and convergence rate depend on the true value of Ψ0\Psi_{0}. We study two cases. First, we consider the setting in which Ψ0=0\Psi_{0}=0, and the null hypothesis of no improvement in fit (3) holds. Second, we study the case in which Ψ0\Psi_{0} is a positive constant.

Case 1: The improvement in fit is zero (Ψ0=0\Psi_{0}=0)

Suppose that the null of no improvement in fit holds. Recall from Section 2 that when Ψ0=0\Psi_{0}=0, supf|β0,f|=0\sup_{f\in\mathcal{F}}|\beta_{0,f}|=0. Also, as discussed in Section 3, by performing a Taylor expansion for G~n,f\tilde{G}_{n,f} around β~n,f\tilde{\beta}_{n,f}, we have

Ψ~n=12supfGn,f′′(βˇn,f)β~n,f2,\displaystyle\tilde{\Psi}_{n}=\frac{1}{2}\sup_{f\in\mathcal{F}}G^{\prime\prime}_{n,f}(\check{\beta}_{n,f})\tilde{\beta}^{2}_{n,f}, (11)

for some βˇn,f\check{\beta}_{n,f} satisfying |βˇn,fβ0,f||β~n,fβ0,f||\check{\beta}_{n,f}-\beta_{0,f}|\leq|\tilde{\beta}_{n,f}-\beta_{0,f}|. Under Assumption B4, we are able, in (11), to replace Gn,f′′(βˇn,f)G^{\prime\prime}_{n,f}(\check{\beta}_{n,f}) with G0,f′′(β0,f)G^{\prime\prime}_{0,f}(\beta_{0,f}). This and the fact that supfβn,f2=OP(n1)\sup_{f\in\mathcal{F}}\beta_{n,f}^{2}=O_{P}(n^{-1}) allow us to write

Ψ~n\displaystyle\tilde{\Psi}_{n} =12supfG0,f′′(0)β~n,f2+oP(n1)=supf12G0,f′′(0)[1ni=1nϕP0,f(Zi;0)]2+oP(n1).\displaystyle=\frac{1}{2}\sup_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(0)\tilde{\beta}^{2}_{n,f}+o_{P}(n^{-1})=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\left[\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right]^{2}+o_{P}(n^{-1}). (12)

Thus, under the null, Ψ~n\tilde{\Psi}_{n} can be represented as the squared supremum of an empirical process, plus an asymptotically negligible remainder. By applying Theorem 1 in conjunction with Slutsky’s theorem and the continuous mapping theorem, we have that nΨ~nn\tilde{\Psi}_{n} converges weakly to supf[{2G0,f′′(0)}1/2(f)]2\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}, where \mathbb{H} is the Gaussian process described in Theorem 1. The following Theorem states this result formally.

Theorem 2.

Suppose that the null hypothesis of no improvement in fit (3) holds, and the Assumptions of Theorem 1 are all satisfied. Then Ψ~n\tilde{\Psi}_{n} converges weakly to supf[{2G0,f′′(0)}1/2(f)]2\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}.

We can apply Theorem 2 to obtain an approximation for the sampling distribution of Ψ~n\tilde{\Psi}_{n} under the null of zero improvement in fit, making it easy for us to perform a hypothesis test. In particular, Theorem 2 implies that a test which rejects the null when nΨ~nn\tilde{\Psi}_{n} is larger than the (1α)(1-\alpha) quantile of the distribution of supf[{2G0,f′′(0)}1/2(f)]2\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2} will achieve type-1 error control at the α\alpha-level in the limit of large nn.

We used two key ingredients to construct an improvement in fit estimator with parametric rate convergence under the null. First, we found it useful to the represent the difference between the full and reduced models for the function-valued parameter of interest as the union of many one-dimensional parametric sub-models. We have deduced that, under the null, the asymptotic behavior of an improvement in fit estimator is determined in large part by the complexity of the collection of paths along which the estimated goodness-of-fit minimizer over the full model can possibly approach the minimizer over the reduced model. We found it necessary to constrain the complexity to ensure that the improvement in fit estimator converges sufficiently quickly. In our regime, this can be easily achieved by restricting the size of \mathcal{F}. We expect that if one were to assume a different form for ΘΘ\Theta\setminus\Theta^{*}, one would still need to impose a constraint that plays a similar role in order to obtain an nn-rate consistent estimator. Second, efficient estimation of the improvement in fit along any sub-model is needed. In settings where the reduced model is infinite-dimensional, estimation of the goodness-of-fit minimizer over the reduced model can generate bias for the improvement in fit estimator and reduce its convergence rate. Fortunately, an efficient estimator can be obtained using standard techniques for bias correction.

Case 2: The improvement in fit is bounded away from zero (Ψ0>0)(\Psi_{0}>0)

Now, consider the setting where Ψ0\Psi_{0} is a positive constant. Let f0f_{0} and fnf_{n} be functions that satisfy

G0,f0(β0,f0)=supfG0,f(β0,f),Gn,fn(β~n,fn)=supfG0,f(β~n,f).\displaystyle G_{0,f_{0}}(\beta_{0,f_{0}})=\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{0,f}),\quad G_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})=\sup_{f\in\mathcal{F}}G_{0,f}(\tilde{\beta}_{n,f}). (13)

We can express the estimation error of Ψ~n\tilde{\Psi}_{n} as

Ψ~nΨ0\displaystyle\tilde{\Psi}_{n}-\Psi_{0} ={supf1Gn,f1(0)Gn,f1(βn,f1)}{supf2G0,f2(0)G0,f2(β0,f2)}\displaystyle=\left\{\sup_{f_{1}\in\mathcal{F}}G_{n,f_{1}}(0)-G_{n,f_{1}}(\beta_{n,f_{1}})\right\}-\left\{\sup_{f_{2}\in\mathcal{F}}G_{0,f_{2}}(0)-G_{0,f_{2}}(\beta_{0,f_{2}})\right\}
={G~n,fn(0)G~n,fn(βn,fn)}{G0,f0(0)G0,f0(β0,f0)}.\displaystyle=\left\{\tilde{G}_{n,f_{n}}(0)-\tilde{G}_{n,f_{n}}(\beta_{n,f_{n}})\right\}-\left\{G_{0,f_{0}}(0)-G_{0,f_{0}}(\beta_{0,f_{0}})\right\}.

One might expect that fnf_{n} should approach f0f_{0} as nn grows, so Gn,fn(β~n,fn)G_{n,f_{n}}(\tilde{\beta}_{n,f_{n}}) should behave similarly to Gn,f0(β0,f0)G_{n,f_{0}}(\beta_{0,f_{0}}). In fact, if one could establish that

|{G~n,f0(0)G~n,f0(β0,f0)}{G~n,fn(0)G~n,fn(β~n,fn)}|=oP(n1/2),\displaystyle\left|\left\{\tilde{G}_{n,f_{0}}(0)-\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})\right\}-\left\{\tilde{G}_{n,f_{n}}(0)-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})\right\}\right|=o_{P}(n^{-1/2}), (14)

then they would be able to conclude that Ψ~n\tilde{\Psi}_{n} is asymptotically linear with influence function z{ϕP0,f0(z;0)ϕP0,f0(z;β0,f0)}z\mapsto\{\phi_{P_{0},f_{0}}(z;0)-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}})\} under Assumption B2.

The remainder term in (14) is oP(n1/2)o_{P}(n^{-1/2}) under mild assumptions. Because G~n,f0(0)G~n,fn(0)\tilde{G}_{n,f_{0}}(0)-\tilde{G}_{n,f_{n}}(0) is zero under Assumption B1, it only needs to be shown that G~n,f0(β0,f0)G~n,fn(β~n,fn)\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}}) is asymptotically negligible. Because the goodness-of-fit estimator is asymptotically linear, G~n,f0(β0,f0)G~n,fn(β~n,fn)\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}}) is approximately equal to G0,f0(β0,f0)G0,fn(β0,fn)G_{0,f_{0}}(\beta_{0,f_{0}})-G_{0,f_{n}}(\beta_{0,f_{n}}), which is commonly referred to as the excess risk in the literature on M-estimation (van de Geer, 2000). Thus, in essence, one can verify (14) by showing that the excess risk converges to zero in probability at an n1/2n^{1/2}-rate. This can be done using standard arguments from the M-estimation literature. The following result provides explicit conditions under which Ψ~n\tilde{\Psi}_{n} is an asymptotically linear estimator for Ψ0\Psi_{0}.

Theorem 3.

Suppose that the improvement in fit is positive, i.e., Ψ0>0\Psi_{0}>0. Suppose further that Assumptions A1, A5, B1, and B2 hold, and there exists a sequence dn=o(n1/2δ)d_{n}=o(n^{1/2-\delta}) for some δ>0\delta>0 such that

sup{(f,β):G0,f(β)G0,f0(β0,f0)dn}[{ϕP0,f(z;β)ϕP0,f0(z;β0,f)}2𝑑P0(z)]1/2=o(1).\displaystyle\sup_{\{(f,\beta):G_{0,f}(\beta)-G_{0,f_{0}}(\beta_{0,f_{0}})\leq d_{n}\}}\left[\int\left\{\phi_{P_{0},f}(z;\beta)-\phi_{P_{0},f_{0}}(z;\beta_{0,f})\right\}^{2}dP_{0}(z)\right]^{1/2}=o(1). (15)

Then Ψ~n\tilde{\Psi}_{n} is an asymptotically linear estimator for Ψ0\Psi_{0} with influence function

zϕP0,f0(z;0)ϕP0,f0(z;β0,f0).\displaystyle z\mapsto\phi_{P_{0},f_{0}}(z;0)-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}}).

An important consequence of Theorem 3 is that Ψ~n\tilde{\Psi}_{n} is asymptotically efficient in a nonparametric model, and hence performs as well as the plug-in estimator described in Section 3 when Ψ0>0\Psi_{0}>0. The assumption in (15) is a type of smoothness condition that is assumed commonly in the literature on estimation in high-dimensional and nonparametric models (see, e.g., van de Geer, 2008; Negahban et al., 2012; Bibaut and van der Laan, 2019). The condition ensures that ϕP0,f(z;β)\phi_{P_{0},f}(z;\beta) and ϕP0,f0(z;β0,f0)\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}}) are close in L2(P0)L_{2}(P_{0}) distance when G0,f(β)G0,f(β0,f)G_{0,f}(\beta)-G_{0,f}(\beta_{0,f}) is small.

Some conditions that are needed by Theorem 2 are not needed by Theorem 3. Notably, it is not necessary for supf|G0,f(β0,f)|\sup_{f\in\mathcal{F}}|G^{\prime}_{0,f}(\beta_{0,f})| to be zero. This means that \mathcal{B} can be mis-specified in the sense that the interval is too small, and along any sub-model θP,f\theta^{*}_{P,f}, there can exist a candidate that achieves a better fit than θP,f(;β0,f)\theta^{*}_{P,f}(\cdot;\beta_{0,f}). In other words, we allow there to be β0,f\beta_{0,f}^{*}\in\mathbb{R}\setminus\mathcal{B} for which G0,f(β0,f)>G0,f(β0,f)G_{0,f}(\beta_{0,f})>G_{0,f}(\beta^{*}_{0,f}). Even then, Ψ~n\tilde{\Psi}_{n} remains an asymptotically linear estimator for supf,β{G0,f(0)G0,f(β)}\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\{G_{0,f}(0)-G_{0,f}(\beta)\}.

5 Construction of tests and intervals for the improvement in fit

In this section, we propose strategies for testing and confidence set construction for the improvement in fit. Our approach uses a computationally efficient bootstrap algorithm, which we describe in detail below.

We also provide theoretical results that establish validity of our proposed bootstrap method. Before proceeding, it is helpful to first state regularity conditions upon which our result rely.

  • Assumption C1: The function class {[GP,f′′(0)]1/2ϕP,f(;0):f}\left\{\left[G^{\prime\prime}_{P,f}(0)\right]^{-1/2}\phi^{\prime}_{P,f}(\cdot;0):f\in\mathcal{F}\right\} depends on PP only through a nuisance QQ, which takes values in a space 𝒬\mathcal{Q} endowed with norm 𝒬\|\cdot\|_{\mathcal{Q}}, and our estimator QP^nQ_{\hat{P}_{n}} satisfies QP^nQP0𝒬=oP(1)\|Q_{\hat{P}_{n}}-Q_{P_{0}}\|_{\mathcal{Q}}=o_{P}(1).

  • Assumption C2: QPQP0𝒬\|Q_{P}-Q_{P_{0}}\|_{\mathcal{Q}} approaches zero, both supf{ϕP,f(z;0)ϕP0,f(z;0)}2𝑑P0(z)\sup_{f\in\mathcal{F}}\int\left\{\phi_{P,f}^{\prime}(z;0)-\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}(z) and supf|GP,f′′(0)GP0,f′′(0)|\sup_{f\in\mathcal{F}}|G^{\prime\prime}_{P,f}(0)-G_{P_{0},f}^{\prime\prime}(0)| tend to zero as well.

  • Assumption C3: There exist δ1,δ2>0\delta_{1},\delta_{2}>0 such that the function classes

    Φδ1:={ϕP,f(;0)ϕP,f(;β):f,β,QPQP0<δ1},\displaystyle\Phi_{\delta_{1}}:=\left\{\phi_{P,f}(\cdot;0)-\phi_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B},\|Q_{P}-Q_{P_{0}}\|<\delta_{1}\right\},
    Φδ2:={ϕP,f(;0):f,QPQP0<δ2},\displaystyle\Phi^{\prime}_{\delta_{2}}:=\left\{\phi^{\prime}_{P,f}(\cdot;0):f\in\mathcal{F},\|Q_{P}-Q_{P_{0}}\|<\delta_{2}\right\},

    are P0P_{0}-Donsker, with finite squared envelope function and finite bracketing integral (see, e.g., Chapter 19 of van der Vaart, 2000), and

    inf{GP,f′′(0):f,QPQP0<δ2}>0.\displaystyle\inf\left\{G^{\prime\prime}_{P,f}(0):f\in\mathcal{F},\|Q_{P}-Q_{P_{0}}\|<\delta_{2}\right\}>0.

Assumption C1 states that, for our bootstrap methods to be viable, estimation of the entire probability distribution is not needed, and it is sufficient to only estimate nuisance parameters upon which the efficient influence function depends. Recall that we made a similar assumption when we described construction of asymptotically linear estimators in Section 4.1. Assumption C2 states that when we estimate the nuisance components consistently, the plug-in estimator for the efficient influence function is consistent as well. Assumption C3 states that our efficient influence function estimator belongs to a function class that is not overly complex, with probability tending to one.

5.1 Approximation of the null limiting distribution

To perform a test of the hypothesis of no improvement in fit, we need an approximation for the asymptotic cumulative distribution function of Ψ~n\tilde{\Psi}_{n} under the null. While we are able to characterize the null limiting distribution of Ψ~n\tilde{\Psi}_{n} using Theorem 2, it is possible that a closed form expression for the distribution function is not available. However, we can use resampling techniques to obtain an approximation.

We approximate the null limiting distribution of Ψ~n\tilde{\Psi}_{n} using the multiplier bootstrap method proposed by Hudson et al. (2021). The multiplier bootstrap is a computationally efficient method for approximating the sampling distribution estimators that can be represented as a functional of a well-behaved empirical process, plus a negligible remainder. Such an approach is applicable in our setting because Ψ~n\tilde{\Psi}_{n} has such representation (see (12)).

For m=1,2,,Mm=1,2,\ldots,M and MM large, let 𝝃m=(ξ1,m,,ξn,m)\boldsymbol{\xi}_{m}=(\xi_{1,m},\ldots,\xi_{n,m}) be an nn-dimensional vector of independent Rademacher random variables, also drawn independently from Z1,,ZnZ_{1},\ldots,Z_{n}. We define the multiplier bootstrap statistic

Tn,mξ:=supf12GP^n,f′′(0){1ni=1nξi,mϕP^n,f(Zi;0)}2,\displaystyle T_{n,m}^{\xi}:=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i,m}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}, (16)

as an approximate draw from the null limiting distribution of Ψ~n\tilde{\Psi}_{n}.

For a realization tt of nΨ~nn\tilde{\Psi}_{n}, let

ρ0(t):=P0(supf12G0,f′′(0)2(f)>t)=limnP0(Ψ~n>n1t),\displaystyle\rho_{0}(t):=P_{0}\left(\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\mathbb{H}^{2}(f)>t\right)=\lim_{n\to\infty}P_{0}(\tilde{\Psi}_{n}>n^{-1}t), (17)

denote the p-value for a test of no improvement in fit, based on the limiting distribution of Ψ~n\tilde{\Psi}_{n}. Given a large sample of multiplier bootstrap statistics, one can approximate the p-value as

ρM,n(t):=1Mm=1M𝟙(Tn,mξ>n1t).\displaystyle\rho_{M,n}(t):=\frac{1}{M}\sum_{m=1}^{M}\mathds{1}\left(T^{\xi}_{n,m}>n^{-1}t\right).

The following result due to Hudson et al. (2021) provides conditions under which the bootstrap approximation of the limiting distribution is asymptotically valid, and use of the bootstrap p-value is appropriate.

Theorem 4.

Let ξ1,ξ2,,ξn\xi_{1},\xi_{2},\ldots,\xi_{n} be independent Rademacher random variables, also independent of Z1,,ZnZ_{1},\ldots,Z_{n}, and let Tnξ=supf12GP^n,f′′(0){1ni=1nξiϕP^n,f(Zi;0)}2T_{n}^{\xi}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}. Under Assumptions C1 through C3, nTnξnT_{n}^{\xi} converges weakly to converges weakly to supf[{2G0,f′′(0)}1/2(f)]2\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}, conditional upon Z1,,ZnZ_{1},\ldots,Z_{n}, in outer probability.

5.2 Interval construction for Ψ0\Psi_{0}

In this section, we present a method for constructing a confidence interval for Ψ0\Psi_{0}. The standard approach for interval construction based on a Gaussian approximation of the sampling distribution of an estimator is inadvisable because Ψ~n\tilde{\Psi}_{n} is only asymptotically Gaussian when Ψ0\Psi_{0} is bounded away from zero. We show that this issue can be overcome by instead constructing a confidence interval via hypothesis test inversion.

Suppose that one could perform a level (1α)(1-\alpha) level test of the hypothesis H:Ψ0=ψH:\Psi_{0}=\psi for any ψ0\psi\geq 0. Then the set

𝒞n1α:={ψ0:We fail to reject Ψ0=ψ based on Z1,,Zn},\displaystyle\mathcal{C}_{n}^{1-\alpha}:=\left\{\psi\geq 0:\text{We fail to reject }\Psi_{0}=\psi\text{ based on }Z_{1},\ldots,Z_{n}\right\},

would be a 100(1α)%100(1-\alpha)\% confidence interval for Ψ0\Psi_{0}. That is, in the limit of large nn, 𝒞n1α\mathcal{C}_{n}^{1-\alpha} would contain Ψ0\Psi_{0} with probability at least (1α)(1-\alpha).

We construct a test of Ψ0=ψ\Psi_{0}=\psi using the test statistic Sn(ψ):=|Ψ~nψ|S_{n}(\psi):=|\tilde{\Psi}_{n}-\psi|. Let sn1αs^{1-\alpha}_{n} be an approximation for the (1α)(1-\alpha) quantile of the limiting distribution of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}|. For a suitable sn1αs^{1-\alpha}_{n}, a test that rejects the null when Sn(ψ)S_{n}(\psi) exceeds sn1αs^{1-\alpha}_{n} will achieve asymptotic type-1 error control at the level (1α)(1-\alpha). Moreover, an asymptotically valid confidence set can be obtained by setting

𝒞n1α\displaystyle\mathcal{C}_{n}^{1-\alpha} ={ψ0:Sn(ψ)sn1α}=[max(0,Ψ~nsn1α),Ψ~n+sn1α].\displaystyle=\left\{\psi\geq 0:S_{n}(\psi)\leq s^{1-\alpha}_{n}\right\}=\left[\max\left(0,\tilde{\Psi}_{n}-s_{n}^{1-\alpha}\right),\tilde{\Psi}_{n}+s_{n}^{1-\alpha}\right].

It is not immediately obvious how to select sn1αs_{n}^{1-\alpha} because the limiting distribution and convergence rate of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}| depend on whether Ψ0=0\Psi_{0}=0. To address this concern, in what follows, we present a multiplier bootstrap approximation of the limiting distribution that adapts to the unknown value of Ψ0\Psi_{0}.

Let πn\pi_{n} be any random sequence that converges to one in probability when Ψ0=0\Psi_{0}=0, and converges to zero in probability to when Ψ0>0\Psi_{0}>0. For instance, we can set

πn=ρM,n(nlog(n)Ψ~n),\displaystyle\pi_{n}=\rho_{M,n}\left(\frac{n}{\log(n)}\tilde{\Psi}_{n}\right), (18)

where ρM,n\rho_{M,n} is the multiplier bootstrap p-value. That this choice of πn\pi_{n} is valid follows from the fact that Ψ~n\tilde{\Psi}_{n} is consistent for Ψ0\Psi_{0} and nn-rate convergent under the null. Now, similar to Section 5.1, for m=1,,Mm=1,\ldots,M and MM large, we generate a pair of random variables as follows. The first random variable is Tn,mξT_{n,m}^{\xi} in (16), which is a multiplier bootstrap approximation of a draw from the limiting distribution of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}| under the setting where Ψ0=0\Psi_{0}=0. We take the second random variable as a multiplier bootstrap approximation of a draw from the limiting distribution of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}| when Ψ0>0\Psi_{0}>0. Specifically, we define this second random variable as

Un,mξ:=|1ni=1nξi,m{ϕP^n,fn(Zi;0)ϕP^n,fn(Zi;β~n,fn)}|,\displaystyle U^{\xi}_{n,m}:=\left|\frac{1}{n}\sum_{i=1}^{n}\xi_{i,m}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|, (19)

where 𝝃m\boldsymbol{\xi}_{m} is the vector of Rademacher random variables defined in Section 5.1 (the same vector may be used to construct Tn,mξT_{n,m}^{\xi} and Un,mξU_{n,m}^{\xi}), and fnf_{n} is as defined (13). Finally, we take an approximate draw from the sampling distribution of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}| as Vn,mξV_{n,m}^{\xi}, where

Vn,mξ:=πnTn,mξ+(1πn)Un,mξ,\displaystyle V_{n,m}^{\xi}:=\pi_{n}T^{\xi}_{n,m}+(1-\pi_{n})U^{\xi}_{n,m},

and we set sn1αs_{n}^{1-\alpha} as the (1α)(1-\alpha) quantile of (Vn,1ξ,Vn,Mξ)(V_{n,1}^{\xi},\ldots V_{n,M}^{\xi}).

Because πn\pi_{n} converges to zero when Ψ0\Psi_{0} is zero and approaches one when Ψ0\Psi_{0} is large, Vn,mξV_{n,m}^{\xi} adaptively identifies whether Tn,mξT^{\xi}_{n,m} or Un,mξU^{\xi}_{n,m} is a more appropriate approximation of a draw from the sampling distribution of |Ψ~nΨ0||\tilde{\Psi}_{n}-\Psi_{0}|. The following result states that Vn,mξV_{n,m}^{\xi} is an asymptotically valid approximation regardless of whether Ψ0\Psi_{0} is zero or nonzero, thereby justifying our selection of sn1αs_{n}^{1-\alpha}.

Theorem 5.

Let ξ1,ξ2,,ξn\xi_{1},\xi_{2},\ldots,\xi_{n} be independent Rademacher random variables, also independent of Z1,,ZnZ_{1},\ldots,Z_{n}. Let πn\pi_{n} be a random sequence that converges to one in probability when Ψ0=0\Psi_{0}=0 and converges to zero in probability when Ψ0>0\Psi_{0}>0. Let Tnξ=supf12GP^n,f′′(0){1ni=1nξiϕP^n,f(Zi;0)}2T_{n}^{\xi}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}, let Unξ=|1ni=1nξi{ϕP^n,fn(Zi;0)ϕP^n,fn(Zi;β~n,fn)}|U_{n}^{\xi}=\left|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|, and Vnξ=πnTnξ+(1πn)UnξV_{n}^{\xi}=\pi_{n}T_{n}^{\xi}+(1-\pi_{n})U_{n}^{\xi}. Let 𝕀\mathbb{I} be a mean zero Gaussian random variable with variance E0[{ϕP0,f0(Z;0)ϕP0,f0(Z;β0,f0)}2]E_{0}[\{\phi_{P_{0},f_{0}}(Z;0)-\phi_{P_{0},f_{0}}(Z;\beta_{0,f_{0}})\}^{2}], with f0f_{0} defined in (13). Suppose that Assumptions C1-C3 are met. Then when Ψ0=0\Psi_{0}=0, and the conditions of Theorem 2 hold, nVnξnV_{n}^{\xi} converges weakly to converges weakly to supf[{G0,f′′(0)}1/2(f)]2\sup_{f\in\mathcal{F}}\left[\left\{G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}, conditional upon Z1,,ZnZ_{1},\ldots,Z_{n}, in outer probability. When Ψ0>0\Psi_{0}>0, n1/2Vnξn^{1/2}V_{n}^{\xi}, and the conditions of Theorem 3 hold, converges weakly to |𝕀||\mathbb{I}|, conditional upon Z1,,ZnZ_{1},\ldots,Z_{n}, in outer probability.

6 Implementation

In this section we discuss implementation of our proposed method for inference on the improvement in fit. First we describe how to construct a model for ΘΘ\Theta\setminus\Theta^{*}. We subsequently discuss how to calculate the improvement in fit estimator and how to implement our proposed bootstrap procedures for testing the null of no improvement in fit and constructing confidence sets.

6.1 Constructing the collection of parametric sub-models

We propose to construct \mathcal{F} as a space of linear combinations of basis functions from 𝒪\mathcal{O} to \mathbb{R}, where the coefficients for the basis functions are required to satisfy a constraint that induces structure on the function class. Let =h1h2\mathcal{H}=h_{1}\oplus h_{2}\oplus\cdots be a vector space defined as the span of basis functions h1,h2,,h_{1},h_{2},\ldots, from 𝒪\mathcal{O} to \mathbb{R}. Let Γ\Gamma be a functional on \mathcal{H} that measures the complexity of any function in \mathcal{H}, with larger values corresponding to greater complexity. We set \mathcal{F} to have bounded complexity. Additionally, we impose a constraint that inffG0,f′′(0)\inf_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(0) is bounded away from zero. In view of Assumption A4, such a constraint is needed in order for us to establish weak convergence of our proposed improvement in fit estimator under the null. Finally, we set =λ\mathcal{F}=\mathcal{F}_{\lambda}, where we define

λ:={f=j=1ajhj:a1,a2,,Γ(f)G0,f′′(0)λ},\displaystyle\mathcal{F}_{\lambda}:=\left\{f=\sum_{j=1}^{\infty}a_{j}h_{j}:a_{1},a_{2},\ldots\in\mathbb{R},\frac{\Gamma(f)}{G^{\prime\prime}_{0,f}(0)}\leq\lambda\right\}, (20)

and λ\lambda is a tuning parameter. In practice, we recommending truncating the basis at a large level JJ to facilitate computation.

As an example, one could construct \mathcal{F} using a reproducing kernel Hilbert space (RKHS). Let κ:𝒪×𝒪\kappa:\mathcal{O}\times\mathcal{O}\to\mathbb{R} be a positive definite kernel function, and let 𝒮κ\mathcal{S}_{\kappa} denote its unique reproducing kernel Hilbert space, endowed with inner product ,κ\langle\cdot,\cdot\rangle_{\kappa}. One can select the basis functions h1,h2,h_{1},h_{2},\ldots as the eigenfunctions of κ\kappa, with respect to the RKHS inner product. We denote the corresponding eigenvalues by γ1γ2\gamma_{1}\leq\gamma_{2}\leq\ldots. The complexity of any function s:oj=1ajhj(o)s:o\mapsto\sum_{j=1}^{\infty}a_{j}h_{j}(o) can be measured by its RKHS norm Γ(s)=s,sκ=j(ajγj)2\Gamma(s)=\langle s,s\rangle_{\kappa}=\sum_{j}\left(\frac{a_{j}}{\gamma_{j}}\right)^{2}. The RKHS norm is a measure of smoothness, with higher values corresponding with lesser smoothness. Reproducing kernel Hilbert spaces are appealing because they are flexible and contain close approximations of smooth functions (Micchelli et al., 2006). Moreover, that the RKHS norm is available in quadratic form in the coefficients simplifies computation. Alternative approaches, such as constructing \mathcal{H} using a spline basis and setting Γ\Gamma as a variation norm, are commonly used in nonparametric regression problems and could also be considered (see, e.g., Tibshirani et al., 2005; Benkeser and van der Laan, 2016).

We now discuss specification of the interval \mathcal{B}. The choice of \mathcal{B} does not affect the null limiting distribution but may affect the limiting distribution when Ψ0>0\Psi_{0}>0. The main role of \mathcal{B} is to regularize supf|β~n,f|\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}| to ensure that variance of the estimator is well-controlled. Recall from our discussion of Theorem 3 in Section 4.2 that β0,f\beta_{0,f}, which is defined to be the minimizer of GP0,fG_{P_{0},f} over \mathcal{B}, does not need to be the global minimizer over \mathcal{B}; our results show that Ψ~n\tilde{\Psi}_{n} has a well-behaved limiting distribution regardless. We treat the width of the interval as an additional tuning parameter.

We find that in some settings, Ψ~n\tilde{\Psi}_{n} can retain good asymptotic behavior under the alternative even when \mathcal{B} is taken to be an interval of arbitrary width. In view of Theorem 1, the variance of β~n,f\tilde{\beta}_{n,f} has an inverse relationship with G0,f′′(β0,f)G^{\prime\prime}_{0,f}(\beta_{0,f}). Therefore, constructing \mathcal{F} to only include functions for which G0,f′′(β0,f)G^{\prime\prime}_{0,f}(\beta_{0,f}) is bounded from below also serves to regularize supf|β~n,f|\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}|. In particular, when G0,fG_{0,f} is a quadratic function, and G0,f′′G^{\prime\prime}_{0,f} is a constant function, it is sufficient to ensure that G0,f′′(0)G^{\prime\prime}_{0,f}(0) away from zero. Because this constraint is already incorporated into \mathcal{F} with the above specification, constraining the width of \mathcal{B} is unnecessary in such instances.

We recommend selecting the tuning parameters λ\lambda, and \mathcal{B} (when needed) by performing cross-validation with respect to the loss fG~n,f(β~f,n)f\mapsto\tilde{G}_{n,f}(\tilde{\beta}_{f,n}). We note that while our asymptotic results implicitly assume that \mathcal{F} is pre-specified, it is argued in Hudson et al. (2021) that one can select \mathcal{F} data-adaptively without compromising type-1 error control as long as the adaptive choice converges to a fixed class. In some settings, e.g., when the sample size is small, it is possible that the data-adaptive choice is highly or moderately variable, and that failure to account for this variability could lead to type-1 error inflation. One can avoid this issue by using a more conservative sample splitting approach, wherein one partition of the data is used for tuning parameter selection, and a second independent partition is used to estimate Ψ0\Psi_{0}.

6.2 Computation

We now discuss how to calculate the improvement in fit estimator Ψ~n\tilde{\Psi}_{n} and how to implement the multiplier bootstrap for hypothesis testing and confidence interval construction.

Calculating Ψ~n\tilde{\Psi}_{n} requires us to solve the optimization problem in (10). When we use the specification of \mathcal{F} in Section 6.1, it is possible for this problem to be non-convex, and it can be particularly challenging to solve when a closed form solution for β~n,f\tilde{\beta}_{n,f} is not available. We find, however, that when G~n,f\tilde{G}_{n,f} is a quadratic function of β\beta, a computationally efficient solution is available. In Examples 1 and 3 in Section 2.2, GP,fG_{P,f} is a quadratic function when one considers a sub-model of the form θP,f(;β)=θP()+βf()\theta^{*}_{P,f}(\cdot;\beta)=\theta^{*}_{P}(\cdot)+\beta f(\cdot), so this special case captures at least some examples. In what follows, we present an approach for solving this the problem in the setting where G~n,f\tilde{G}_{n,f} is a quadratic function of β\beta. We then describe a more general method in the Supplementary Materials.

Suppose that for any f=j=1Jajhjf=\sum_{j=1}^{J}a_{j}h_{j} and 𝐚=(a1,,aJ)\mathbf{a}=(a_{1},\ldots,a_{J}), there exists a J×JJ\times J-dimensional matrix 𝐇1\mathbf{H}_{1} and a JJ-dimensional vector 𝐇2\mathbf{H}_{2} such that

G~n,f(β)=β2𝐚𝐇1𝐚2β𝐇2𝐚+const,\displaystyle\tilde{G}_{n,f}(\beta)=\beta^{2}\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}-2\beta\mathbf{H}_{2}^{\top}\mathbf{a}+\text{const},

where “const” refers to a constant that depends neither on 𝐚\mathbf{a} nor β\beta. It can be easily seen that β~n,f\tilde{\beta}_{n,f} has the exact representation

β~n,f=𝐇2𝐚𝐚𝐇1𝐚.\displaystyle\tilde{\beta}_{n,f}=\frac{\mathbf{H}^{\top}_{2}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}.

Additionally, the second derivative estimator satisfies G~n,f(β)=𝐚𝐇1𝐚\tilde{G}_{n,f}(\beta)=\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a} for all β\beta. Now, G~n,f(0)G~n,f(β~n,f)\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f}) can be expressed as

2{G~n,f(0)G~n,f(β~n,f)}=𝐚𝐇2𝐇2𝐚𝐚𝐇1𝐚.\displaystyle 2\left\{\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f})\right\}=\frac{\mathbf{a}^{\top}\mathbf{H}_{2}^{\top}\mathbf{H}_{2}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}.

Suppose now that Γ(f)\Gamma(f) is available in quadratic form in the coefficients of the basis functions – that is, Γ(f)=𝐚𝐋𝐚\Gamma(f)=\mathbf{a}^{\top}\mathbf{L}\mathbf{a} for a J×JJ\times J matrix 𝐋\mathbf{L}. Using the above representation for G~n,f\tilde{G}_{n,f}, we can express Ψ~n\tilde{\Psi}_{n} as

2Ψ~n\displaystyle 2\tilde{\Psi}_{n} =max𝐚{𝐚𝐇2𝐇2𝐚𝐚𝐇1𝐚:𝐚𝐋𝐚𝐚𝐇1𝐚λ,}\displaystyle=\max_{\mathbf{a}}\left\{\frac{\mathbf{a}^{\top}\mathbf{H}_{2}\mathbf{H}_{2}^{\top}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}:\frac{\mathbf{a}^{\top}\mathbf{L}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}\leq\lambda,\right\}
=max𝐚{𝐚𝐇2𝐇2𝐚:𝐚𝐋𝐚λ,𝐚𝐇1𝐚1}.\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\mathbf{H}_{2}\mathbf{H}_{2}^{\top}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}\leq 1\right\}. (21)

The optimization problem in (21) is a quadratically constrained quadratic program (QCQP) and can be solved using publicly available software, such as the CVXR package in R (Fu et al., 2017).

Multiplier bootstrap samples can be calculated using a similar method. We first observe when we use the specification of θP,f(;β)\theta^{*}_{P,f}(\cdot;\beta) in (6), the Riesz representation theorem implies that G0,f(0)G^{\prime}_{0,f}(0) is a linear functional of ff. Consequently, the efficient influence function ϕ0,f(;0)\phi^{\prime}_{0,f}(\cdot;0) is also linear in ff. Therefore, for any f=ajhjf=\sum a_{j}h_{j}, we have

ϕP0,f(z;0)=j=1JajϕP0,hj(z;0).\displaystyle\phi^{\prime}_{P_{0},f}(z;0)=\sum_{j=1}^{J}a_{j}\phi^{\prime}_{P_{0},h_{j}}(z;0).

Now, let 𝚽\boldsymbol{\Phi} be an n×Jn\times J matrix with element (i,j)(i,j) given by ϕP^,hj(Zi;0)\phi^{\prime}_{\hat{P},h_{j}}(Z_{i};0), and let 𝝃m\boldsymbol{\xi}_{m} be an nn-dimensional vector of Rademacher random variables, as in Sections 5.1 and 5.2. Similarly as Ψ~n\tilde{\Psi}_{n}, the multiplier bootstrap test statistics Tn,mξT_{n,m}^{\xi} in (16) can be expressed as

2nTn,mξ\displaystyle 2nT^{\xi}_{n,m} =max𝐚{𝐚𝚽[diag(𝝃m)]2𝚽𝐚:𝐚𝐋𝐚λ,𝐚𝐇1𝐚1}.\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\boldsymbol{\Phi}^{\top}\left[\text{diag}(\boldsymbol{\xi}_{m})\right]^{2}\boldsymbol{\Phi}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}\leq 1\right\}. (22)

The optimization problem in (22) is also a QCQP and can solved efficiently. Finally, Un,mξU_{n,m}^{\xi} can be written as

Un,mξ=|i=1nξi,m{ϕP^n,f(Zi;0)a~n,jj=1JϕP^n,hj(Zi;𝐇2𝐚~n𝐚~n𝐇1𝐚~n)}|,\displaystyle U_{n,m}^{\xi}=\left|\sum_{i=1}^{n}\xi_{i,m}\left\{\phi_{\hat{P}_{n},f}(Z_{i};0)-\tilde{a}_{n,j}\sum_{j=1}^{J}\phi_{\hat{P}_{n},h_{j}}\left(Z_{i};\frac{\mathbf{H}_{2}^{\top}\tilde{\mathbf{a}}_{n}}{\tilde{\mathbf{a}}_{n}^{\top}\mathbf{H}_{1}\tilde{\mathbf{a}}_{n}}\right)\right\}\right|,

where 𝐚~n=(a~1,n,,a~J,n)\tilde{\mathbf{a}}_{n}=(\tilde{a}_{1,n},\ldots,\tilde{a}_{J,n}) is a solution to (21).

7 Illustration: Inference in a Nonparametric Regression Model

In this section, we apply our framework to perform inference on the non-negative dissimilarity measured described in Example 1. In the Supplementary Materials, we describe our framework can be used to construct a test of stochastic dependence, following the setting described in Example 2.

In this problem, we are tasked with assessing whether a subset of a collection of predictor variables is needed for attaining an optimal prediction function. As in Section 2.2, our data take the form Z=(W,X,Y)Z=(W,X,Y), where YY is a real-valued outcome variable, and XX is the predictor vector of interest, and WW is a vector of covariates. We denote by μP,Y:wEP[Y|W=w]\mu_{P,Y}:w\mapsto E_{P}[Y|W=w] the conditional mean of the outcome given the covariates. Our objectives are to assess whether there exists a function that depends on both XX and WW which predicts YY better than μP0,Y(W)\mu_{P_{0},Y}(W), and to measure the best achievable improvement in predictive performance by any function in a large class.

We specify the parametric sub-model θP,f\theta^{*}_{P,f} as

θP,f(z;β)=μP,Y(w)+βf(w,x).\displaystyle\theta^{*}_{P,f}(z;\beta)=\mu_{P,Y}(w)+\beta f(w,x).

The goodness-of-fit of any candidate in the sub-model is defined as

GP,f(β):={yμP,Y(w)βf(w,x)}2𝑑P(z),\displaystyle G_{P,f}(\beta):=\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}dP(z),

and the first and second derivatives are given by

GP,f(β)=2f(w,x){yμY,P(w)βf(w,x)}𝑑P(z),\displaystyle G^{\prime}_{P,f}(\beta)=-2\int f(w,x)\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}dP(z),
GP,f′′(β)=2f2(w,x)𝑑P(z).\displaystyle G^{\prime\prime}_{P,f}(\beta)=2\int f^{2}(w,x)dP(z).

As we noted in Section 4, knowledge of the efficient influence function of GP,f(β)G_{P,f}(\beta) is helpful for constructing an asymptotically linear estimator thereof. Additionally, we require the derivative of the efficient influence function to exist. Let μf,P(w)=E[f(W,X)|W=w]\mu_{f,P}(w)=E[f(W,X)|W=w] represent the conditional mean of f(W,X)f(W,X) given WW. The form of the efficient influence function and its derivative are provided in the following lemma.

Lemma 1.

The efficient influence function of GP,f(β)={yμY,P(w)βf(w,x)}2𝑑P(z)G_{P,f}(\beta)=\int\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}^{2}dP(z) is given by

ϕP,f(;β):(w,x,y){yμP,Y(w)βf(w,x)}2+2β{yμY,P(w)}μf,P(w)GP,f(β).\displaystyle\phi_{P,f}(\cdot;\beta):(w,x,y)\mapsto\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{Y,P}(w)\right\}\mu_{f,P}(w)-G_{P,f}(\beta).

It is also easy to see that the efficient influence function is twice differentiable in β\beta. The evaluation of its first and second derivatives at β=0\beta=0 are given by

ϕP,f(;0)(w,x,y)2{yμP,Y(w)}{f(x,w)μP,f(w)}GP,f(0),\displaystyle\phi^{\prime}_{P,f}(\cdot;0)(w,x,y)\mapsto-2\left\{y-\mu_{P,Y}(w)\right\}\left\{f(x,w)-\mu_{P,f}(w)\right\}-G^{\prime}_{P,f}(0),
ϕP,f′′(;0)(w,x,y)2f2(w,x)GP,f′′(0).\displaystyle\phi^{\prime\prime}_{P,f}(\cdot;0)(w,x,y)\mapsto-2f^{2}(w,x)-G^{\prime\prime}_{P,f}(0).

From Lemma 1, we can see that {G0,f(β):f,β}\{G_{0,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} and {ϕP0,f(;β):f,β}\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} depend on the nuisance parameters μY,P0\mu_{Y,P_{0}} and {μP0,f:f}\{\mu_{P_{0},f}:f\in\mathcal{F}\}. One can obtain nonparametric estimators μn,Y\mu_{n,Y} and {μn,f:f}\{\mu_{n,f}:f\in\mathcal{F}\} for the nuisance using any in a wide variety of flexible data-adaptive regression procedures, including kernel ridge regression (Wahba, 1990), neural networks (Barron, 1989), the highly adaptive lasso (Benkeser and van der Laan, 2016), or the Super Learner (van der Laan et al., 2007). In our implementation, we use kernel ridge regression, in large part because it is computationally efficient.

It may at first seem computationally difficult to estimate the conditional mean of f(X,W)f(X,W) given WW for all ff. However, because we have assumed that ff can be represented as a linear combination of basis functions h1,h2,h_{1},h_{2},\ldots, we can perform this calculation without too much trouble. For f=j=1Jajhjf=\sum_{j=1}^{J}a_{j}h_{j}, we have the representation μP,f=j=1JajμP,hj\mu_{P,f}=\sum_{j=1}^{J}a_{j}\mu_{P,h_{j}}. Therefore, one can obtain estimators μn,hj\mu_{n,h_{j}} for μP0,hj\mu_{P_{0},h_{j}} for j=1,,Jj=1,\ldots,J and then estimate μP0,f\mu_{P_{0},f} as μn,f=j=1Jajμn,hj\mu_{n,f}=\sum_{j=1}^{J}a_{j}\mu_{n,h_{j}}.

Consider the following initial plug-in estimator for G0,f(β)G_{0,f}(\beta):

Gn,f(β)=1ni=1n{Yiμn,Y(Wi)βf(Wi,Xi)}2,\displaystyle G_{n,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\left\{Y_{i}-\mu_{n,Y}(W_{i})-\beta f(W_{i},X_{i})\right\}^{2},

and consider the efficient influence function estimator

ϕn,f(w,x,y;β)={yμn,Y(w)βf(w,x)}2+2β{yμn,Y(w)}μn,f(w)Gn,f(β).\displaystyle\phi_{n,f}(w,x,y;\beta)=\left\{y-\mu_{n,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)-G_{n,f}(\beta).

The initial estimator for G0,f(β)G_{0,f}(\beta) is biased, so a corrected estimator is needed so that one can perform inference. We can use the following one-step bias-corrected estimator:

G~n,f(β)\displaystyle\tilde{G}_{n,f}(\beta) =Gn,f(β)+1ni=1nϕn,f(Zi;β)\displaystyle=G_{n,f}(\beta)+\frac{1}{n}\sum_{i=1}^{n}\phi_{n,f}(Z_{i};\beta)
=1ni=1n[{Yiμn,Y(Wi)βf(Wi,Xi)}2+2β{Yiμn,Y(Wi)}μn,f(Wi)].\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left[\left\{Y_{i}-\mu_{n,Y}(W_{i})-\beta f(W_{i},X_{i})\right\}^{2}+2\beta\left\{Y_{i}-\mu_{n,Y}(W_{i})\right\}\mu_{n,f}(W_{i})\right]. (23)

The following lemma provides conditions under which the one-step estimator satisfies Assumptions B1 through B4.

Lemma 2.

Suppose that the nuisance estimators satisfy the rate conditions

[{μY,P0(w)μY,n(w)}2𝑑P0(w)]1/2=oP(n1/4),\displaystyle\left[\int\left\{\mu_{Y,P_{0}}(w)-\mu_{Y,n}(w)\right\}^{2}dP_{0}(w)\right]^{1/2}=o_{P}(n^{-1/4}),
supf\displaystyle\sup_{f\in\mathcal{F}} |{μP0,f(w)μn,f(w)}{μY,P0(w)μY,n(w)}dP0(w)|=oP(n1/2).\displaystyle\int\left|\left\{\mu_{P_{0},f}(w)-\mu_{n,f}(w)\right\}\left\{\mu_{Y,P_{0}}(w)-\mu_{Y,n}(w)\right\}dP_{0}(w)\right|=o_{P}(n^{-1/2}).

Suppose also that there exist P0P_{0}-Donsker classes Φ\Phi, Φ\Phi^{\prime} and a P0P_{0}-Glivenko-Cantelli class Φ′′\Phi^{\prime\prime} such that, with probability tending to one, each of the following holds:

{ϕn,f(;β)ϕP0,f(;β):f,β}Φ,\displaystyle\{\phi_{n,f}(\cdot;\beta)-\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}\subset\Phi,
{ϕn,f(;0)ϕP0,f(;0):f}Φ,\displaystyle\{\phi^{\prime}_{n,f}(\cdot;0)-\phi^{\prime}_{P_{0},f}(\cdot;0):f\in\mathcal{F}\}\subset\Phi^{\prime},
{ϕn,f′′(;0)ϕP0,f′′(;0):f}Φ′′.\displaystyle\{\phi^{\prime\prime}_{n,f}(\cdot;0)-\phi^{\prime\prime}_{P_{0},f}(\cdot;0):f\in\mathcal{F}\}\subset\Phi^{\prime\prime}.

Then the one-step estimator G~n,f\tilde{G}_{n,f} in (23) satisfies Assumptions B1-B4.

The condition on the convergence rates of the nuisance estimators is standard and holds when all nuisance estimators are n1/4n^{1/4}-rate convergent. This is rate is attained by many nonparametric regression estimators under weak structural assumptions on the true conditional mean functions, so the condition is fairly mild.

We conclude with a brief comment about computation. We observe that G~n,f(β)\tilde{G}_{n,f}(\beta) is a quadratic function of β\beta, so the implementation scheme described in Section 6.2 can be applied in this example.

8 Simulation Study

In this section, we assess the empirical performance of our proposal in our simulation study. In this example, we again consider the nonparametric regression task discussed in Section 7.

We generate synthetic data sets as follows. First, we generate independent 33-dimensional random vectors A1,,AnA_{1},\ldots,A_{n} from a Gaussian distribution with mean zero and covariance

𝐕=(1.5.5.51.5.5.51).\displaystyle\mathbf{V}=\begin{pmatrix}1&.5&.5\\ .5&1&.5\\ .5&.5&1\end{pmatrix}.

We then define the predictor vector as Xi=(2γ(Ai,1)1,2γ(Ai,2)1,2γ(Ai,3)1)X_{i}=(2\gamma(A_{i,1})-1,2\gamma(A_{i,2})-1,2\gamma(A_{i,3})-1), where γ\gamma is the standard normal distribution function. We generate the outcome YY according to the model

Yi=sin(πXi,1)2(Xi,212)2+𝟙(Xi,2>0)exp(Xi,1)+ϵi,\displaystyle Y_{i}=\sin(\pi X_{i,1})-2\left(X_{i,2}-\frac{1}{2}\right)^{2}+\mathds{1}(X_{i,2}>0)\exp(X_{i,1})+\epsilon_{i},

where the white noise ϵi\epsilon_{i} is a continuous uniform [6,6][-6,6] random variable, drawn independently of XiX_{i}. Our objective is to determine which of the elements of the predictor XX are conditionally associated with the outcome YY, given the other elements. Clearly, YY is conditionally dependent on the first and second elements, and not the third.

Our target of inference is the improvement in fit estimand defined in Example 1. More specifically, for each predictor, we estimate the improvement in fit comparing a flexible regression model containing prediction functions that may depend on all predictors, with a model only containing functions that do not depend on the predictor of interest. We estimate all nuisance parameters described in Section 7 using kernel ridge regression. We construct \mathcal{F} using the reproducing kernel Hilbert space corresponding to the Gaussian kernel, and we consider two choices for the smoothness parameter. First, we use an oracle apporach, which sets \mathcal{F} as the smoothest class containing a function that is proportional to the difference between the full conditional mean of YY given all predictors XX, and the conditional mean of YY given predictors that are not of interest. We recall from our discussion in Section 2.2 that the difference between conditional means is the function that maximizes the improvement in fit. Second, we consider a data-adaptive procedure, where the smoothness parameter is selected using cross-validation, and no sample-splitting is performed.

We compare our method with a sample splitting approach proposed by Williamson et al. (2021b). They propose to separate the dataset into two independent partitions – one of which is used to estimate the optimal goodness-of-fit over the full model, and the second of which is used to estimate the optimal goodness-of-fit over the reduced model. Then inference on Ψ0\Psi_{0} can be performed using a two-sample Wald test, and Wald-type intervals can similarly be constructed. We expect our approach to offer an improvement because we do not require sample-splitting for valid inference. We also expect our proposal to benefit from achieving fast nn-rate convergence rate at the boundary, compared with the sample-splitting approach, which is only n1/2n^{1/2}-rate convergent.

We generate 1000 synthetic data sets under the data-generating process described above for n{100,200,400,800,1600,3200}n\in\{100,200,400,800,1600,3200\}. We compare all methods under consideration in terms mean squared error, type-1 error control under the null, power under the alternative, confidence interval coverage, and average confidence interval width.

Figure 1 shows the root mean squared error for each proposed improvement in fit estimator as a function of the sample size. We find that, while the root mean squared error of each estimator approaches zero, our proposed estimators converge much more quickly than the sample splitting estimator, with the oracle version of our approach performing best.

In Figures 2 and 3 we plot the rejection probability for a test of the null of no improvement in fit as a function of the nominal type-1 error level. We find that all tests considered achieve asymptotic type-1 error control in the setting where Ψ0=0\Psi_{0}=0, though we acknowledge there is type-1 error inflation when the sample size is small. We find that our proposed tests are well-powered against the null, both outperforming the sample-splitting estimator.

In Figures 4 and 5 we plot coverage probability and average width of 95% confidence intervals as a function of the sample size. We find that all approaches considered achieve nominal coverage as the sample size grows, though the there is a tendency for our proposed intervals to exhibit undercoverage when the sample size is small. Our proposed estimator with oracle selection of \mathcal{F} achieves the lowest average width, followed by the adaptive approach, and the sample-splitting approach.

9 Discussion

In this work, we have presented a general framework for inference on non-negative dissimilarity measures. Our proposed methodology has wide-ranging utility. As examples, we described how this framework can be applied to perform rate-optimal inference on statistical functionals arising in nonparametric regression and graphical modeling problems. Our framework can also be useful in other settings, such as causal inference problems. For instance, some statistical functionals that have been used for studying treatment heterogeneity (see, e.g., Levy et al., 2021; Hines et al., 2021; Sanchez-Becerra, 2023) have the representation described in Section 2, so one can perform inference using our general approach.

Our work has some notable limitations that we plan to address in future research. While our proposal for inference on the improvement in fit enjoys good behavior in a large sample setting, we observed in our simulatoin study that it may have undesirable small sample properties, such mild type-1 error inflation or poor coverage. Additionally, our estimator suffers a loss in precision when we select of \mathcal{F} in a data-adaptive manner. In future work, we plan to investigate whether the performance can be improved using, e.g., small-sample adjustments or improved data-adaptive methods for tuning parameter selection. Additionally, because our results assume smoothness of the goodness-of-fit functional, it is not clear whether our results can be directly applied to perform inference on estimands such as L1L_{1} distances. It is of interest to develop a more flexible inferential strategy that relaxes this assumption. Our methodology also places complexity constraints on nuisance parameter estimators, which prohibits us from using estimators such as gradient boosted trees (Friedman, 2002). It is of interest to develop cross-fitted versions of our improvement in fit estimator and multiplier bootstrap strategy that relax this assumption (Zheng and van der Laan, 2011; Chernozhukov et al., 2018).

There also remain several open theoretical and methodological questions. For instance, while we have established nn-rate consistency of our proposed improvement in fit estimator, it is unclear whether our test of the null of no improvement in fit is optimal in any sense. It would be important to characterize the power of our test and to determine whether there exists a more powerful test. Additionally, it is of interest to understand how specification of the sub-model θP,f\theta^{*}_{P,f} affects our procedure’s performance. It is possible that there are many ways for one to construct a sub-model while still obtaining valid inference on Ψ0\Psi_{0}. It is not clear how this choice affects the estimator or whether there is an optimal choice. We expect that, in practice, this choice will need to be made in consideration of theoretical properties, such as power, and more practical concerns, such as ease of implementation and computational efficiency.

Refer to caption
Figure 1: Monte Carlo estimates of the root mean squared error for improvement in fit estimators in our simulation study.
Refer to caption
Figure 2: Monte Carlo estimates of the rejection probability as a function of the nominal type-1 error level, under the null of no improvement in fit in our simulation study.
Refer to caption
Figure 3: Monte Carlo estimates of the rejection probability as a function of the nominal type-1 error level, under the alternative of positive improvement in fit in our simulation study.
Refer to caption
Figure 4: Monte Carlo estimates of coverage probability of 95% confidence intervals for the improvement in fit in our simulation study.
Refer to caption
Figure 5: Monte Carlo estimates of average width of 95% confidence intervals for the improvement in our simulation study.

References

  • Barron (1989) Barron, A. R. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th IEEE Conference on Decision and Control,, pages 280–285. IEEE.
  • Benkeser and van der Laan (2016) Benkeser, D. and van der Laan, M. (2016). The highly adaptive lasso estimator. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 689–696. IEEE.
  • Bhattacharya and Zhao (1997) Bhattacharya, P. and Zhao, P.-L. (1997). Semiparametric inference in a partial linear model. The Annals of Statistics 25, 244–262.
  • Bibaut and van der Laan (2019) Bibaut, A. F. and van der Laan, M. J. (2019). Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244 .
  • Bickel et al. (1998) Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1998). Efficient and adaptive estimation for semiparametric models. Springer.
  • Carone et al. (2018) Carone, M., Díaz, I., and van der Laan, M. J. (2018). Higher-order targeted loss-based estimation. Targeted learning in data science: causal inference for complex longitudinal studies pages 483–510.
  • Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1–C68.
  • Donald and Newey (1994) Donald, S. G. and Newey, W. K. (1994). Series estimation of semilinear models. Journal of Multivariate Analysis 50, 30–40.
  • Friedman (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 367–378.
  • Fu et al. (2017) Fu, A., Narasimhan, B., and Boyd, S. (2017). CVXR: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582 .
  • Hines et al. (2021) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2021). Parameterising the effect of a continuous exposure using average derivative effects. arXiv preprint arXiv:2109.13124 .
  • Hines et al. (2022) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:2204.06030 .
  • Hines et al. (2022) Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician 76, 292–304.
  • Hudson et al. (2021) Hudson, A., Carone, M., and Shojaie, A. (2021). Inference on function-valued parameters using a restricted score test. arXiv preprint arXiv:2105.06646 .
  • Kandasamy et al. (2015) Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., et al. (2015). Nonparametric von mises estimators for entropies, divergences and mutual informations. Advances in Neural Information Processing Systems 28,.
  • Kennedy et al. (2023) Kennedy, E. H., Balakrishnan, S., and Wasserman, L. A. (2023). Semiparametric Counterfactual Density Estimation. Biometrika .
  • Levy et al. (2021) Levy, J., van der Laan, M., Hubbard, A., and Pirracchio, R. (2021). A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference 9, 83–108.
  • Luedtke et al. (2019) Luedtke, A., Carone, M., and van der Laan, M. J. (2019). An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology 81, 75–99.
  • Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research 7,.
  • Negahban et al. (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers.
  • Paninski (2003) Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation 15, 1191–1253.
  • Pfanzagl (1982) Pfanzagl, J. (1982). Contributions to a general asymptotic statistical theory. Springer.
  • Pfanzagl (1985) Pfanzagl, J. (1985). Asymptotic expansions for general statistical models, volume 31. Springer-Verlag.
  • Robins et al. (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman 2, 335–421.
  • Robinson (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society pages 931–954.
  • Sanchez-Becerra (2023) Sanchez-Becerra, A. (2023). Robust inference for the treatment effect variance in experiments using machine learning. arXiv preprint arXiv:2306.03363 .
  • Steuer et al. (2002) Steuer, R., Kurths, J., Daub, C. O., Weise, J., and Selbig, J. (2002). The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 18, S231–S240.
  • Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 91–108.
  • Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
  • van de Geer (2000) van de Geer, S. A. (2000). Empirical Processes in M-estimation, volume 6. Cambridge university press.
  • van de Geer (2008) van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso.
  • van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology 6,.
  • van der Laan and Rose (2011) van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.
  • van der Laan and Rose (2018) van der Laan, M. J. and Rose, S. (2018). Targeted learning in data science. Springer.
  • van der Vaart (2014) van der Vaart, A. (2014). Higher order tangent spaces and influence functions. Statistical Science pages 679–686.
  • van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak convergence and empirical processes. Springer.
  • van der Vaart (2000) van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
  • Verdinelli and Wasserman (2021) Verdinelli, I. and Wasserman, L. (2021). Decorrelated variable importance. arXiv preprint arXiv:2111.10853 .
  • Wahba (1990) Wahba, G. (1990). Spline models for observational data. SIAM.
  • Westling (2021) Westling, T. (2021). Nonparametric tests of the causal null with nondiscrete exposures. Journal of the American Statistical Association pages 1–12.
  • Wilks (1938) Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9, 60–62.
  • Williamson et al. (2021a) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics 77, 9–22.
  • Williamson et al. (2021b) Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association pages 1–14.
  • Zhang and Janson (2020) Zhang, L. and Janson, L. (2020). Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283 .
  • Zheng and van der Laan (2011) Zheng, W. and van der Laan, M. J. (2011). Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pages 459–474. Springer.

Supplementary Materials

S1 Implementation for Non-quadratic Objectives

Here, we propose a general method for computing the improvement in fit estimator. As noted in Section 6.2, computation can be challenging because when we use the specification of \mathcal{F} in (20), the optimzation problem in (10) is possibly non-convex. Moreover, a closed form expression for β~n,f\tilde{\beta}_{n,f} may not be available, further complicating the problem.

As in Section 6.2, we focus on the setting where the complexity measure Γ\Gamma is available in quadratic form, satisfying Γ(f)=𝐚𝐋𝐚\Gamma(f)=\mathbf{a}^{\top}\mathbf{L}\mathbf{a} for some matrix 𝐋\mathbf{L}. Also, we note that in general, the second derivative of the goodness-of-fit GP,f′′(0)G_{P,f}^{\prime\prime}(0) is available in quadratic in the coefficients as a consequence of the Riesz representation theorem. We assume that G~n,f′′(0)\tilde{G}_{n,f}^{\prime\prime}(0) can be expressed as 𝐚Ω𝐚\mathbf{a}^{\top}\Omega\mathbf{a} for some Ω\Omega.

We propose a slightly-modified estimator for Ψ~0\tilde{\Psi}_{0} that has nearly identical asymptotic properties to Ψ~n\tilde{\Psi}_{n}, but which can be easier to compute in a more general setting. The main idea is to separate estimation of Ψ0\Psi_{0} into two parts. First, as before, for each ff\in\mathcal{F} we perform a search to identify whether any candidate parameter in the sub-model is an improvement over the null in terms goodness-of-fit. Where we differ is that we confine our search to a small neighborhood of zero. When there exists evidence that for some ff, G0,fG_{0,f} is not minimized at zero, we search over a larger function class to identify a candidate parameter that achieves a better fit than the null. In contrast to the strategy proposed in Section 6, we specify ΘΘ\Theta\setminus\Theta^{*} as a class over which an optimum can more easily be identified. In what follows, we provide details and rationale for this method.

First, Taylor’s theorem implies that when β0,f\beta_{0,f} resides within a neighborhood of zero, β0,f\beta_{0,f} is approximately equal to {G0,f′′(0)}1G0,f(0)\{G^{\prime\prime}_{0,f}(0)\}^{-1}G^{\prime}_{0,f}(0). Additionally, in this setting we have that

Ψ0:=supf{G0,f(0)}22G0,f′′(0)supfG0,f(0)G0,f(β0,f)=Ψ0.\displaystyle\Psi_{0}^{*}:=\sup_{f\in\mathcal{F}}\frac{\left\{G^{\prime}_{0,f}(0)\right\}^{2}}{2G^{\prime\prime}_{0,f}(0)}\approx\sup_{f\in\mathcal{F}}G_{0,f}(0)-G_{0,f}(\beta_{0,f})=\Psi_{0}.

Because G0,fG_{0,f} is assumed to be convex for all ff, then G0,f(0)0G^{\prime}_{0,f^{*}}(0)\neq 0 for some ff^{*}, implies that β0,f0\beta_{0,f^{*}}\neq 0, and Ψ0>0\Psi_{0}>0. And so, one can assess whether Ψ0=0\Psi_{0}=0 by checking whether Ψ0=0\Psi_{0}^{*}=0. This is roughly equivalent to performing a search over ×\mathcal{F}\times\mathcal{B} to identify the best fit, where \mathcal{B} is taken to be a small interval containing zero. To estimate Ψ0\Psi^{*}_{0}, we use the estimator

Ψ~n:=supf{G~n,f(0)}22G~n,f′′(0).\displaystyle\tilde{\Psi}_{n}^{*}:=\sup_{f\in\mathcal{F}}\frac{\left\{\tilde{G}_{n,f}^{\prime}(0)\right\}^{2}}{2\tilde{G}^{\prime\prime}_{n,f}(0)}.

When the null holds, and the conditions of Theorem 2 hold as well, Ψ~n\tilde{\Psi}^{*}_{n} has the same asymptotic representation as Ψ~n\tilde{\Psi}_{n}. That is,

Ψ~n=supf12G0,f′′(0)[1ni=1nϕP0,f(Zi;0)]2+oP(n1).\displaystyle\tilde{\Psi}^{*}_{n}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\left[\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right]^{2}+o_{P}(n^{-1}).

As noted in Section 6.2, the Riesz representation theorem implies that G0,f(0)G^{\prime}_{0,f}(0) is a linear functional of ff. We assume that our estimator G~n,f(0)\tilde{G}_{n,f}(0) is also linear, and can be expressed as 𝐇𝐚\mathbf{H}^{\top}\mathbf{a} for some JJ-dimensional vector 𝐇\mathbf{H}. Thus, using the specification of \mathcal{F} in (20), Ψ~n\tilde{\Psi}_{n}^{*} can be expressed as

2Ψ~n\displaystyle 2\tilde{\Psi}^{*}_{n} =max𝐚{𝐚𝐇𝐇𝐚:𝐚𝐋𝐚λ,𝐚Ω𝐚1}.\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\mathbf{H}\mathbf{H}^{\top}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\Omega\mathbf{a}\leq 1\right\}.

This problem is a quadratically constrained quadratic program and can be solved efficiently.

It is possible that Ψ~n\tilde{\Psi}_{n}^{*} is a poor approximation for Ψ~n\tilde{\Psi}_{n} when the null of no improvement in fit does not hold. Let πn(0,1)\pi^{*}_{n}\in(0,1) be a random sequence that converges to one in probability when Ψ0=0\Psi_{0}^{*}=0 holds and converges to zero in probability when Ψ0>0\Psi_{0}^{*}>0. For instance, we can take πn\pi^{*}_{n} as

πn=ρ0(nlog(n)Ψ~n),\displaystyle\pi^{*}_{n}=\rho_{0}\left(\frac{n}{\log(n)}\tilde{\Psi}_{n}^{*}\right),

where ρ0\rho_{0} is as defined in (17). For large values of πn\pi^{*}_{n}, Ψ~n\tilde{\Psi}^{*}_{n} can replace Ψ~n\tilde{\Psi}_{n}. Otherwise, an alternative estimator may be needed.

To estimate Ψ0\Psi_{0} when the null does not hold, we consider an alternative specification of ΘΘ\Theta\setminus\Theta^{*}. A main source of our difficulty with solving the optimization problem in (10) is that the constraint {G0,f′′(0)}1Γ(f)λ\{G^{\prime\prime}_{0,f}(0)\}^{-1}\Gamma(f)\leq\lambda is non-convex. Under the null, constraining {G0,f′′(0)}1\{G^{\prime\prime}_{0,f}(0)\}^{-1} is necessary, as Theorem 2 states that Assumption A4 must hold in order for our improvement in fit estimator to have a well-behaved limiting distribution. However, this assumption is not needed for Theorem 3 to hold. As we discussed previously in Section 6.1, for Theorem 3 to hold, we really only need to ensure that the variance of {β~n,f:f}\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\} is well controlled. When G0,fG_{0,f} is quadratic, this can be achieved by constraining {G0,f′′(0)}1\{G^{\prime\prime}_{0,f}(0)\}^{-1} and leaving \mathcal{B} unconstrained, as was done previously. When G0,fG_{0,f} is non-quadratic, we can alternatively leave {G0,f′′(0)}1\{G^{\prime\prime}_{0,f}(0)\}^{-1} unconstrained and carefully select the width of \mathcal{B}. We instead set \mathcal{F} as the function class

λ:={f=j=1ajhj:Γ(f)λ},\displaystyle\mathcal{F}^{*}_{\lambda}:=\left\{f=\sum_{j=1}^{\infty}a_{j}h_{j}:\Gamma(f)\leq\lambda\right\},

and we set \mathcal{B} as σ:=[σ,σ]\mathcal{B}^{*}_{\sigma}:=[-\sigma,\sigma] for some σ>0\sigma>0. Now, we define Ψ0\Psi_{0}^{**} as

Ψ0:=inffλinfβσG0,f(β),\displaystyle\Psi_{0}^{**}:=\inf_{f\in\mathcal{F}^{*}_{\lambda}}\inf_{\beta\in\mathcal{B}^{*}_{\sigma}}G_{0,f}(\beta),

and we estimate Ψ~0\tilde{\Psi}_{0}^{**} as

Ψ~n:=inffλinfβσG~n,f(β).\displaystyle\tilde{\Psi}_{n}^{**}:=\inf_{f\in\mathcal{F}^{*}_{\lambda}}\inf_{\beta\in\mathcal{B}^{*}_{\sigma}}\tilde{G}_{n,f}(\beta).

Calculating Ψ~n\tilde{\Psi}_{n}^{**} will in many cases be much easier than calculating Ψ~n\tilde{\Psi}_{n}. We can write Ψ~n\tilde{\Psi}_{n}^{**} as

Ψ~n=inffinfβG~n,βf(1)=min𝐛{G~n,bjhj(1):𝐛𝐋𝐛σλ},\displaystyle\tilde{\Psi}_{n}^{**}=\inf_{f\in\mathcal{F}}\inf_{\beta\in\mathcal{B}}\tilde{G}_{n,\beta f}(1)=\min_{\mathbf{b}}\left\{\tilde{G}_{n,\sum b_{j}h_{j}}(1):\mathbf{b}^{\top}\mathbf{L}\mathbf{b}\leq\sigma\lambda\right\},

where 𝐛=(b1,,bJ)\mathbf{b}=(b_{1},\ldots,b_{J}) is a JJ-dimensional vector. This optimization problem has a only single convex constraint, so the problem will be convex when the objective function is also convex. This can greatly simplify computation.

Of course, Ψ~n\tilde{\Psi}_{n}^{**} and Ψ~n\tilde{\Psi}_{n} potentially can achieve different limiting distributions when Ψ>0\Psi>0. This is because Ψ0\Psi_{0}^{**} and Ψ0\Psi_{0} are defined as optima over different function classes. While the two values are expected to be similar, they will not necessarily be equal. Nonetheless, one can still apply Theorem 3 to establish weak convergence of Ψ~n\tilde{\Psi}_{n}^{**} to a Gaussian distribution, as long as ×\mathcal{F}\times\mathcal{B} is not too large.

Finally, we combine Ψ~n\tilde{\Psi}_{n}^{*} and Ψ~n\tilde{\Psi}_{n}^{**} to obtain a single estimator for Ψ0\Psi_{0}. We define our estimator as

Ψˇn=πnΨ~n+(1πn)Ψ~n.\displaystyle\check{\Psi}_{n}=\pi_{n}^{*}\tilde{\Psi}_{n}^{*}+(1-\pi_{n}^{*})\tilde{\Psi}_{n}^{**}.

Because πn\pi^{*}_{n} tends to one when the null holds and approaches zero when the null fails, Ψˇn\check{\Psi}_{n} has approximately the same asymptotic behavior as Ψ~n\tilde{\Psi}_{n}. That is Ψˇn\check{\Psi}_{n} behaves like the supremum of an empirical process under the null, and like a sample average otherwise. Therefore, the multiplier bootstrap tests described in Sections 5.1 and 5.2 remain valid, and a similar strategy for implementation as described in Section 6.2 can be used.

S2 Illustration: Nonparametric Assessment of Stochastic Dependence

In this Section, we briefly discuss inference in Example 2 from Section 2.2, where we are interested in assessing whether a pair of random variables is independent. Our data takes the form Z=(X,Y)Z=(X,Y), where XX and YY are one-dimensional random variables.

We assess dependence by comparing the expectation of the log of the product of the marginal densities of XX and YY with the expectation of the logarithm of an approximation for the joint density. Let qP,Xq_{P,X} and qP,Yq_{P,Y} denote the marginal densities of XX and YY under PP, and let θP=logqP,XqP,Y\theta^{*}_{P}=\log q_{P,X}q_{P,Y} denote the log of the product of the marginal densities. We use the following sub-model to approximate the logarithm of the joint density:

θP,f:(x,y)θP(x,y)+βf(x,y)logexp(θP(x,y)+βf(x,y))𝑑μ(x,y).\displaystyle\theta^{*}_{P,f}:(x,y)\mapsto\theta^{*}_{P}(x,y)+\beta f(x,y)-\log\int\exp(\theta^{*}_{P}(x,y)+\beta f(x,y))d\mu(x,y).

With a straightforward calculation, it can be verified that (7) is satisfied, and moreover, we can see that θP,f\theta^{*}_{P,f} is a valid candidate log density, as for any ff, expθP,f(z;β)𝑑μ(z)=1\int\exp\theta^{*}_{P,f}(z;\beta)d\mu(z)=1.

With the above specification for the parametric sub-model, the goodness-of-fit takes the form

GP,f(β)=EP[logqP,X(X)logqP,Y(Y)βf(X,Y)]+log(EPXEPY[exp(βf(X,Y))]),\displaystyle G_{P,f}(\beta)=E_{P}[-\log q_{P,X}(X)-\log q_{P,Y}(Y)-\beta f(X,Y)]+\log\left(E_{P_{X}}E_{P_{Y}}[\exp(\beta f(X,Y))]\right),

where EPXE_{P_{X}} and EPYE_{P_{Y}} denote the marginal expectations under PP, with respect to XX and YY, respectively. We now observe that along any submodel, the difference in goodness-of-fit comparing a given candidate parameter with the null is given by

ψP,f(β):=GP,f(β)GP,f(0)=βEP[f(X,Y)]+log(EPXEPY[exp(βf(X,Y))]),\displaystyle\psi_{P,f}(\beta):=G_{P,f}(\beta)-G_{P,f}(0)=-\beta E_{P}[f(X,Y)]+\log\left(E_{P_{X}}E_{P_{Y}}[\exp(\beta f(X,Y))]\right),

and this expression does not depend on the marginal densities qP,Xq_{P,X} and qP,Yq_{P,Y}. Thus, estimation of the marginal densities is not needed.

The derivatives GP,f(0)G^{\prime}_{P,f}(0) and GP,f′′(0)G^{\prime\prime}_{P,f}(0) are given by

GP,f(0)=ddβψP,f(β)|β=0=EP[f(X,Y)]+EPXEPY[f(X,Y)],\displaystyle G^{\prime}_{P,f}(0)=\frac{d}{d\beta}\psi_{P,f}(\beta)\bigg{|}_{\beta=0}=-E_{P}[f(X,Y)]+E_{P_{X}}E_{P_{Y}}[f(X,Y)],
GP,f′′(0)=d2dβ2ψP,f(β)|β=0=EPXEPY[f2(X,Y)]{EPXEPY[f(X,Y)]}2.\displaystyle G^{\prime\prime}_{P,f}(0)=\frac{d^{2}}{d\beta^{2}}\psi_{P,f}(\beta)\bigg{|}_{\beta=0}=E_{P_{X}}E_{P_{Y}}[f^{2}(X,Y)]-\left\{E_{P_{X}}E_{P_{Y}}[f(X,Y)]\right\}^{2}.

One can interpret GP,f(0)G^{\prime}_{P,f}(0) as the difference between the true mean of f(X,Y)f(X,Y) under PP and the value the mean would hypothetically take if XX and YY were independent. The second derivative GP,f′′(0)G^{\prime\prime}_{P,f}(0) represents the variance of f(X,Y)f(X,Y) under the assumption that XX and YY are independent.

Because ψP,f(β)\psi_{P,f}(\beta) does not depend on PP through any nuisance parameters that are not pathwise differentiable, it is expected that a plug-in estimator, which is defined as a functional of the cumulative distribution function, would be asymptotically linear, and no sophisticated methods for bias correction should be needed. We use the following plug-in estimator for ψP0,f(β)\psi_{P_{0},f}(\beta):

ψ~n,f(β)=βni=1nf(Xi,Yi)+log{1n2i=1nj=1nexp(βf(Xi,Yj))}.\displaystyle\tilde{\psi}_{n,f}(\beta)=\frac{-\beta}{n}\sum_{i=1}^{n}f(X_{i},Y_{i})+\log\left\{\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}\exp(\beta f(X_{i},Y_{j}))\right\}.

By an application of the functional delta method, one can show that the plug-in estimator is asymptotically linear with influence function

ϕP,f(z;β)ϕP,f(z;0)=\displaystyle\phi_{P,f}(z;\beta)-\phi_{P,f}(z;0)= {EPY[exp(βf(x,Y))]+EPX[exp(βf(Y,x))]EPX[EPY[expβf(X,Y)]]βf(x,y)}{2EP[βf(X,Y)]}.\displaystyle\left\{\frac{E_{P_{Y}}[\exp(\beta f(x,Y))]+E_{P_{X}}[\exp(\beta f(Y,x))]}{E_{P_{X}}[E_{P_{Y}}[\exp\beta f(X,Y)]]}-\beta f(x,y)\right\}-\left\{2-E_{P}[\beta f(X,Y)]\right\}.

Similarly, we estimate G0,f(0)G^{\prime}_{0,f}(0) as

G~n,f(0)=ddβψ~n,f(β)|β=0=1ni=1nf(Xi,Yi)+1n2i1=1ni2=1nf(Xi1,Yi2),\displaystyle\tilde{G}^{\prime}_{n,f}(0)=\frac{d}{d\beta}\tilde{\psi}_{n,f}(\beta)\bigg{|}_{\beta=0}=\frac{-1}{n}\sum_{i=1}^{n}f(X_{i},Y_{i})+\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f(X_{i_{1}},Y_{i_{2}}),

and G~n,f(0)\tilde{G}_{n,f}(0) is asymptotically linear with efficient influence function

ϕP,f(z;0)=f(x,y)+EP[f(X,Y)]+{EPY[f(x,Y)]+EPX[f(X,y)]2EPX[EPY[f(X,Y)]]}.\displaystyle\phi^{\prime}_{P,f}(z;0)=-f(x,y)+E_{P}[f(X,Y)]+\left\{E_{P_{Y}}\left[f(x,Y)\right]+E_{P_{X}}\left[f(X,y)\right]-2E_{P_{X}}[E_{P_{Y}}[f(X,Y)]]\right\}.

We estimate the second derivative G0,f′′(0)G^{\prime\prime}_{0,f}(0) as

G~n,f(0)=d2dβ2ψ~n,f(β)|β=0=1n2i1=1ni2=1nf2(Xi1,Yi2){1n2i1=1ni2=1nf(Xi1,Yi2)}2.\displaystyle\tilde{G}^{\prime}_{n,f}(0)=\frac{d^{2}}{d\beta^{2}}\tilde{\psi}_{n,f}(\beta)\bigg{|}_{\beta=0}=\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f^{2}(X_{i_{1}},Y_{i_{2}})-\left\{\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f(X_{i_{1}},Y_{i_{2}})\right\}^{2}.

In this setting, a closed form solution for β~n,f\tilde{\beta}_{n,f} is not available, and if one uses the specification for \mathcal{F} in (20), the problem

inff,βψ~n,f(β)\displaystyle\inf_{f\in\mathcal{F},\beta\in\mathcal{B}}\tilde{\psi}_{n,f}(\beta)

is difficult to solve. As an alternative, we recommend using the more general implementation strategy presented in the Supplementary Materials Section S1.

S3 Proofs of Theoretical Results

Proof of Theorem 1

We have by definition that {G0,f(β~n,f)G0,f(β0,f)}\left\{G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})\right\} and {G~n,f(β0,f)G~n,f(β~n,f)}\left\{\tilde{G}_{n,f}(\beta_{0,f})-\tilde{G}_{n,f}\left(\tilde{\beta}_{n,f}\right)\right\} are non-negative. Under Assumption B2, we can write

0\displaystyle 0 supf{G0,f(β~n,f)G0,f(β0,f)}{G~n,f(β~n,f)G~n,f(β0,f)}\displaystyle\leq\sup_{f\in\mathcal{F}}\left\{G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})\right\}-\left\{\tilde{G}_{n,f}\left(\tilde{\beta}_{n,f}\right)-\tilde{G}_{n,f}(\beta_{0,f})\right\}
=supf1ni=1n{ϕP0,f(Zi;β~n,f)ϕP0,f(Zi;β0,f)}+oP(n1/2)\displaystyle=\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f})-\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2})
supf,β1ni=1n{ϕP0,f(Zi;β)ϕP0,f(Zi;β0,f)}+oP(n1/2).\displaystyle\leq\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2}).

Thus from the fact that {ϕP0,f(;β):f,β}\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} is P0P_{0}-Donsker, we have that G0,f(β~n,f)G0,f(β0,f)=OP(n1/2)G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})=O_{P}(n^{-1/2}). Assumption A3 implies that supf|β~n,fβ0,f|=oP(1)\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}-\beta_{0,f}|=o_{P}(1).

Now, because we have that β~n,f\tilde{\beta}_{n,f} satisfies G~n,f(β~n,f)=oP(n1/2)\tilde{G}^{\prime}_{n,f}(\tilde{\beta}_{n,f})=o_{P}(n^{-1/2}), Taylor’s theorem implies

G~n,f(β~n,f)=G~n,f(β0,f)+Gn,f′′(β¯n,f)(β~n,fβ0,f)=0,\displaystyle\tilde{G}^{\prime}_{n,f}(\tilde{\beta}_{n,f})=\tilde{G}^{\prime}_{n,f}(\beta_{0,f})+G^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})(\tilde{\beta}_{n,f}-\beta_{0,f})=0,

for some β¯n,f\bar{\beta}_{n,f} that satisfies |β¯n,fβ0,f||β~n,fβ0,f||\bar{\beta}_{n,f}-\beta_{0,f}|\leq|\tilde{\beta}_{n,f}-\beta_{0,f}|. By rearranging terms and invoking Assumption B4, the estimation error for β~n,f\tilde{\beta}_{n,f} can be expressed as

β~n,fβ0,f={G0,f′′(β¯n,f)+oP(1)}1G~n,f(β0,f).\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=-\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+o_{P}(1)\right\}^{-1}\tilde{G}^{\prime}_{n,f}(\beta_{0,f}).

Because G0,f(β0,f)=0G^{\prime}_{0,f}(\beta_{0,f})=0, Assumption B2 implies that

β~n,fβ0,f={G0,f′′(β¯n,f)+rn,f′′(β¯n,f)}1{1ni=1nϕ0,f(Zi;β0,f)+rn,f(β0,f)}.\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=-\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+r^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})\right\}^{-1}\left\{\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{0,f}(Z_{i};\beta_{0,f})+r^{\prime}_{n,f}(\beta_{0,f})\right\}.

Now, because {β~n,f:f}\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\} is uniformly consistent for {β0,f:f}\{\beta_{0,f}:f\in\mathcal{F}\}, the continuous mapping theorem and Assumptions A4 and B4 allow us to replace {G0,f′′(β¯n,f)+rn,f′′(β¯n,f)}1\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+r^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})\right\}^{-1} with {G0,f′′(β0,f)}1\left\{G^{\prime\prime}_{0,f}(\beta_{0,f})\right\}^{-1} in the above display. Thus, we have

β~n,fβ0,f=1G0,f′′(β0,f){1ni=1nϕP0,f(Zi;β0,f)}+oP(n1/2),\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=\frac{-1}{G^{\prime\prime}_{0,f}(\beta_{0,f})}\left\{\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2}),

as claimed. The weak convergence result follows as an immediate consequence of the Donsker Assumption A5.

Proof of Theorem 2

This result follows directly from an application of the continuous mapping theorem.

Proof of Theorem 3

Following from our discussion from Section 4.2, it suffices to show that Gn,f(β~n,f)Gn,f0(β0,f0)=oP(n1/2)G_{n,f}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}})=o_{P}(n^{-1/2}). First, we write

supfGn,f(β~n,f)Gn,f0(β0,f0)\displaystyle\sup_{f\in\mathcal{F}}G_{n,f}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}}) {Gn,fn(β~n,f)Gn,f0(β0,f0)}{G0,fn(β~n,fn)G0,f0(β0,f0)}\displaystyle\leq\left\{G_{n,f_{n}}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}})\right\}-\left\{G_{0,f_{n}}(\tilde{\beta}_{n,f_{n}})-G_{0,f_{0}}(\beta_{0,f_{0}})\right\}
=1ni=1n{ϕP0,f(Zi;β~n,fn)ϕP0,f(Zi;β0,f0)}+oP(n1/2)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}+o_{P}(n^{-1/2})
1ni=1nsupf,β{ϕP0,f(Zi;β)ϕP0,f(Zi;β0,f0)}+oP(n1/2)\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left\{\phi_{P_{0},f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}+o_{P}(n^{-1/2})
=OP(n1/2).\displaystyle=O_{P}(n^{-1/2}).

From the above argument, we can conclude that the remainder is at least OP(n1/2)O_{P}(n^{-1/2}).

Now, we have under the smoothness assumption in (15) that

{ϕP0,fn(z;β~n,fn)ϕP0,f0(z;β0,f0)}2𝑑P0(z)=oP(1).\displaystyle\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1).

We can now apply lemma 19.24 of van der Vaart (2000) to conclude that

1ni=1n{ϕP0,f(Zi;β~n,fn)ϕP0,f(Zi;β0,f0)}=oP(n1/2),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}=o_{P}(n^{-1/2}),

thereby completing the proof.

Proof of Theorem 4

Let \mathcal{L} be the space of bounded Lipschitz-1 functions :[1,1]\ell:\mathbb{R}\to[-1,1]. That is, any \ell in \mathcal{L} satisfies |(a1)(a2)||a1a2||\ell(a_{1})-\ell(a_{2})|\leq|a_{1}-a_{2}| for any a1,a2a_{1},a_{2}\in\mathbb{R}. Let EξE_{\xi} denote the expectation of a random variable with respect to the distribution of ξ\xi (treating ZZ as fixed). We show that

sup|Eξ[({nTnξ}1/2)]E0[(supf|{2G0,f′′(0)}1/2(f)|)]|\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{nT_{n}^{\xi}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|

converges to zero in outer probability. This is equivalent to weak convergence due to Portmanteau lemma (see, e.g. Lemma 18.9 of van der Vaart, 2000).

Let ()\mathcal{\ell}^{\infty}(\mathcal{F}) denote the space of bounded functionals on \mathcal{F}, and let \mathcal{E} be the space of Lipschitz-1 functionals e:()[1,1]e:\ell^{\infty}(\mathcal{F})\to[-1,1]. That is, for F1,F2F_{1},F_{2} in ()\ell^{\infty}(\mathcal{F}), any ee\in\mathcal{E} satisfies |e(F1)e(F2)|supf|F1(f)F2(f)||e(F_{1})-e(F_{2})|\leq\sup_{f\in\mathcal{F}}|F_{1}(f)-F_{2}(f)|. We now define:

Λnξ:f{2GP^n,f′′(0)}1/2[n1/2i=1nϕP^n,f(Zi;0)],\displaystyle\Lambda^{\xi}_{n}:f\mapsto\{2G^{\prime\prime}_{\hat{P}_{n},f}(0)\}^{-1/2}\left[n^{-1/2}\sum_{i=1}^{n}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right],
Λ0:f{2GP0,f′′(0)}1/2(f).\displaystyle\Lambda_{0}:f\mapsto\{2G^{\prime\prime}_{P_{0},f}(0)\}^{-1/2}\mathbb{H}(f).

It is shown by Hudson et al. (2021) that under Assumptions C1-C3,

supe|Eξ[e(Λnξ)]E0[e(Λ0)]|.\displaystyle\sup_{e\in\mathcal{E}}\left|E_{\xi}\left[e\left(\Lambda_{n}^{\xi}\right)\right]-E_{0}\left[e\left(\Lambda_{0}\right)\right]\right|.

The proof is completed by recognizing that for any \ell\in\mathcal{L}, the functional F|(supf|F(f)|)|F\mapsto\left|\ell\left(\sup_{f\in\mathcal{F}}\left|F(f)\right|\right)\right| is contained within \mathcal{E}.

Proof of Theorem 5

Case 1: Ψ0=0\Psi_{0}=0

We first consider the setting in which Ψ0=0\Psi_{0}=0. Let \ell be a Lipschitz-1 function on \mathbb{R}. As in the proof of Theorem 4, let \mathcal{L} be the space of Lipschitz-1 functions on \mathbb{R}. We show that

sup|Eξ[({πnnTnξ+(1πn)nUnξ}1/2)]E0[(supf|{2G0,f′′(0)}1/2(f)|)]|\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{\pi_{n}nT^{\xi}_{n}+(1-\pi_{n})nU^{\xi}_{n}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|

converges to zero in outer probability.

First, by applying the triangle inequality and invoking the Lipschitz property, we have

sup|Eξ[({πnnTnξ+(1πn)nUnξ}1/2)]E0[(supf|{2G0,f′′(0)}1/2(f)|)]|An+Bn,\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{\pi_{n}nT^{\xi}_{n}+(1-\pi_{n})nU^{\xi}_{n}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|\leq A_{n}+B_{n},

where we define

An:=|Eξ[(|nTnξ|1/2)]E0[(supf|{2G0,f′′(0)}1/2(f)|)]|,\displaystyle A_{n}:=\left|E_{\xi}\left[\ell\left(\left|nT^{\xi}_{n}\right|^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|,
Bn:=Eξ|{(1πn)nUnξ+πnnTnξ}1/2|nTnξ|1/2|.\displaystyle B_{n}:=E_{\xi}\left|\left\{(1-\pi_{n})nU^{\xi}_{n}+\pi_{n}nT^{\xi}_{n}\right\}^{1/2}-|nT^{\xi}_{n}|^{1/2}\right|.

We have already shown in Theorem 4 that the first term AnA_{n} converges to zero in outer probability, so it only remains to verify this for the second term BnB_{n}.

By the reverse triangle inequality, we have

Bn(1πn){Eξ[(nUnξ)1/2]+Eξ[(nTnξ)1/2]}.\displaystyle B_{n}\leq(1-\pi_{n})\left\{E_{\xi}\left[\left(nU^{\xi}_{n}\right)^{1/2}\right]+E_{\xi}\left[\left(nT^{\xi}_{n}\right)^{1/2}\right]\right\}.

Because (1πn)=oP(1)(1-\pi_{n})=o_{P}(1) when Ψ0=0\Psi_{0}=0, it suffices to show that Eξ[(nUn)1/2]E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right] and Eξ[(nTn)1/2]E_{\xi}\left[\left(nT_{n}\right)^{1/2}\right] are both OP(1)O_{P}(1).

We first show that Eξ[(nUn)1/2]E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right] is bounded in probability. First, we have by Jensen’s inequality that

Eξ[(nUn)1/2]Eξ[nUn]1/2.\displaystyle E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right]\leq E_{\xi}\left[nU_{n}\right]^{1/2}.

Now, by Taylor’s theorem, we have

Eξ[|i=1nξi{ϕP^n,fn(Zi;0)ϕP^n,fn(Zi;β~n,fn)}|]Eξ[supf|1n1/2i=1nξi{ϕP^n,f(Zi;0)}|]supf|n1/2β¯n,f|,\displaystyle E_{\xi}\left[\left|\sum_{i=1}^{n}\xi_{i}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|\right]\leq E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]\sup_{f\in\mathcal{F}}\left|n^{1/2}\bar{\beta}_{n,f}\right|,

for some {β¯n,f:f}\{\bar{\beta}_{n,f}:f\in\mathcal{F}\} that satisfies |β¯n,f||β~n,f||\bar{\beta}_{n,f}|\leq|\tilde{\beta}_{n,f}| for all ff\in\mathcal{F}. Because supf|n1/2β¯n,f|=OP(1)\sup_{f\in\mathcal{F}}|n^{1/2}\bar{\beta}_{n,f}|=O_{P}(1) under the conditions of Theorem 2, it suffices to show that

Eξ[supf|1n1/2i=1nξi{ϕP^n,f(Zi;0)}|]=OP(1).\displaystyle E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]=O_{P}(1). (24)

By the triangle inequality, we have the upper bound

Eξ[supf|1n1/2i=1nξi{ϕP^n,f(Zi;0)}|]Bn,1+Bn,2+Bn,3,\displaystyle E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]\leq B_{n,1}+B_{n,2}+B_{n,3},

where we define

Bn,1=Eξ[supf|1n1/2i=1nξi{ϕP^n,f(z;0)ϕP0,f(z)}𝑑P0(z)|],\displaystyle B_{n,1}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\int\left\{\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\phi^{\prime}_{P_{0},f}(z)\right\}dP_{0}(z)\right|\right],
Bn,2=Eξ[supf|1n1/2i=1nξi[{ϕP^n,f(Zi;0)ϕP0,f(Zi;0)}{ϕP^n,f(z;0)ϕP0,f(z)}]dP0(z)|],\displaystyle B_{n,2}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left[\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}-\int\left\{\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\phi^{\prime}_{P_{0},f}(z)\right\}\right]dP_{0}(z)\right|\right],
Bn,3=Eξ[supf|1n1/2i=1nξi{ϕP^n,f(Zi;0)}|].\displaystyle B_{n,3}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right].

It can be seen through an application of the Cauchy-Schwarz inequality Bn,1=oP(1)B_{n,1}=o_{P}(1) under Assumption C2. To see that Bn,2B_{n,2} is bounded in probability, we first note that Assumptions C1 implies that with probability tending to one,

Bn,2Eξ[supφΦδ|1n1/2i=1nξiφ(Zi)|].\displaystyle B_{n,2}\leq E_{\xi}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\varphi(Z_{i})\right|\right].

In view of Markov’s inequality, it is sufficient to show that this upper bound has finite expectation. Lemma 2.3.6 of van der Vaart and Wellner (1996) implies that

E0Eξ[supφΦδ|1n1/2i=1nξiφ(Zi)|]2E0[supφΦδ|1n1/2i=1nφ(Zi)|].\displaystyle E_{0}E_{\xi}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\varphi(Z_{i})\right|\right]\leq 2E_{0}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\varphi(Z_{i})\right|\right].

Because under Assumption C3, Φδ\Phi_{\delta} has finite bracketing integral, we have by Corollary 19.35 of van der Vaart (2000) that

[supφΦδ|1n1/2i=1nφ(Zi)|]<,\displaystyle\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\varphi(Z_{i})\right|\right]<\infty,

thereby establishing that Bn,2=OP(1)B_{n,2}=O_{P}(1). That Bn,3=OP(1)B_{n,3}=O_{P}(1) follows from a similar argument.

To argue that Eξ[(nTnξ)1/2]E_{\xi}\left[\left(nT^{\xi}_{n}\right)^{1/2}\right] is OP(1)O_{P}(1), we use the same argument as is used to show that (24) holds. In brief, the result follows from the facts that (i\mathrm{i}) both {GP^,f′′:f}\left\{G^{\prime\prime}_{\hat{P},f}:f\in\mathcal{F}\right\} and {zϕP^n,f(z;0):f}\{z\mapsto\phi^{\prime}_{\hat{P}_{n},f}(z;0):f\in\mathcal{F}\} are uniformly consistent under Assumption C2, and (ii\mathrm{ii}) the class

Φ¯δ2:={z[GP,f′′(0)]1ϕP,f(z;0)[GP0,f′′(0)]1ϕP0,f(z;0):f,QPQP0𝒬δ2}\displaystyle\bar{\Phi}^{\prime}_{\delta_{2}}:=\{z\mapsto\left[G^{\prime\prime}_{P,f}(0)\right]^{-1}\phi^{\prime}_{P,f}(z;0)-\left[G^{\prime\prime}_{P_{0},f}(0)\right]^{-1}\phi^{\prime}_{P_{0},f}(z;0):f\in\mathcal{F},\|Q_{P}-Q_{P_{0}}\|_{\mathcal{Q}}\leq\delta_{2}\}

is P0P_{0}-Donsker with finite bracketing integral under Assumption C3.

Case 2: Ψ0>0\Psi_{0}>0

We now consider the setting where Ψ0>0\Psi_{0}>0. We show that

sup|Eξ[(πnn1/2Tnξ+(1πn)n1/2Unξ)]E0[(|𝕀|)]|,\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\pi_{n}n^{1/2}T^{\xi}_{n}+(1-\pi_{n})n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}\left[\ell\left(|\mathbb{I}|\right)\right]\right|,

converges to zero in outer probability. Similarly as for Case 1, we have by the triangle inequality that

sup|Eξ[(πnn1/2Tnξ+(1πn)n1/2Unξ)]E0[(|𝕀|)]|An+Bn,\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\pi_{n}n^{1/2}T^{\xi}_{n}+(1-\pi_{n})n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}\left[\ell\left(|\mathbb{I}|\right)\right]\right|\leq A_{n}+B_{n},

where we define

An:=sup|Eξ[(n1/2Unξ)]E0[(𝕀)]|,\displaystyle A_{n}:=\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}[\ell(\mathbb{I})]\right|,
Bn:=πn{Eξ[n1/2Unξ]+Eξ[n1/2Tnξ]}.\displaystyle B_{n}:=\pi_{n}\left\{E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right]+E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]\right\}.

We first argue that AnA_{n} converges to zero in outer probability. First, we have under assumption that the function

zϕP^n,fn(z;0)ϕP^n,fn(z;β~n,fn)\displaystyle z\mapsto\phi_{\hat{P}_{n},f_{n}}(z;0)-\phi_{\hat{P}_{n},f_{n}}(z;\tilde{\beta}_{n,f_{n}})

is contained within a P0P_{0}-Donsker class with probability tending to one. Also we have under Assumption C2 that {ϕP^n,f(z;0)ϕP0,f(z;0)}2𝑑P0(z)=oP(1)\int\left\{\phi_{\hat{P}_{n},f}(z;0)-\phi_{P_{0},f}(z;0)\right\}^{2}dP_{0}(z)=o_{P}(1). We now argue that {ϕP^n,f(z;β~n,fn)ϕP0,f(z;β~0,f0)}2𝑑P0(z)=oP(1)\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1). We have the upper bound

{ϕP^n,f(z;β~n,fn)ϕP0,f(z;β~0,f0)}2𝑑P0(z)\displaystyle\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{0,f_{0}})\right\}^{2}dP_{0}(z)\leq
2[{ϕP^n,f(z;β~n,fn)ϕP0,f(z;β~n,fn)}2𝑑P0(z)+{ϕP0,fn(z;β~n,fn)ϕP0,f(z;β0,f0)}2𝑑P0(z)].\displaystyle 2\left[\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{n,f_{n}})\right\}^{2}dP_{0}(z)+\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)\right].

Under Assumption C2, we have

{ϕP^n,f(z;β~n,fn)ϕP0,f(z;β~n,fn)}2𝑑P0(z)supf,β{ϕP^n,f(z;β)ϕP0,f(z;β)}2𝑑P0(z)=oP(1).\displaystyle\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{n,f_{n}})\right\}^{2}dP_{0}(z)\leq\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\int\left\{\phi_{\hat{P}_{n},f}(z;\beta)-\phi_{P_{0},f}(z;\beta)\right\}^{2}dP_{0}(z)=o_{P}(1).

Additionaly, we have

{ϕP0,fn(z;β~n,fn)ϕP0,f(z;β0,f0)}2𝑑P0(z)=oP(1)\displaystyle\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1)

under the conditions of Theorem 3. Now, by Theorem 2 of Hudson et al. (2021), we can conclude that |An||A_{n}| converges to zero in outer probability.

We now argue that Bn=oP(1)B_{n}=o_{P}(1). Because πn=oP(1)\pi_{n}=o_{P}(1), we only need to show that Eξ[n1/2Unξ]E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right] and Eξ[n1/2Tnξ]E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right] are bounded in probability. That Eξ[n1/2Unξ]=OP(1)E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right]=O_{P}(1) follows from the same argument as was used to show that (24) holds in Case 1.

To argue that Eξ[n1/2Tnξ]=OP(1)E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]=O_{P}(1), we begin by applying the Cauchy-Schwarz inequality and invoking Assumption C2 to get

{Tnξ}1/2\displaystyle\left\{T_{n}^{\xi}\right\}^{1/2} [1ni=1nξi2]1/2[supf1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)}2]1/2\displaystyle\leq\left[\frac{1}{n}\sum_{i=1}^{n}\xi_{i}^{2}\right]^{1/2}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}
=[supf1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)}2]1/2.\displaystyle=\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}.

Now, by the triangle inequality,

[supf1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)}2]1/2\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}\leq
[supf1ni=1n{[GP0,f′′(0)]1ϕP0,f(Zi;0)}2]1/2+\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}^{2}\right]^{1/2}+
[supf1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)[GP0,f′′(0)]1ϕP0(Zi;0)}2]1/2.\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}\right]^{1/2}.

Because {[G0,f′′(0)]1ϕ0,f(;0):f}\{[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{0,f}(\cdot;0):f\in\mathcal{F}\} is a P0P_{0}-Donsker class with finite squared envelope function, we have by Lemma 2.10.4 of van der Vaart and Wellner (1996) that {[[G0,f′′(0)]1ϕ0,f(;0)]2:f}\left\{\left[[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{0,f}(\cdot;0)\right]^{2}:f\in\mathcal{F}\right\} is a P0P_{0}-Glivenko-Cantelli class, and so

[supf1ni=1n{[GP0,f′′(0)]1ϕP0,f(Zi;0)}2]1/2=OP(1).\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}^{2}\right]^{1/2}=O_{P}(1).

Now, by the triangle inequality

supf\displaystyle\sup_{f\in\mathcal{F}} [1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)[GP0,f′′(0)]1ϕP0(Zi;0)}2]1/2\displaystyle\left[\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}\right]^{1/2}\leq
supf[\displaystyle\sup_{f\in\mathcal{F}}\Bigg{[} 1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)[GP0,f′′(0)]1ϕP0(Zi;0)}2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}-
{[GP^n,f′′(0)]1ϕP^n,f(z;0)[GP0,f′′(0)]1ϕP0,f(z;0)}2dP0]1/2+\displaystyle\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\Bigg{]}^{1/2}+
supf\displaystyle\sup_{f\in\mathcal{F}} [{[GP^n,f′′(0)]1ϕP^n,f(z;0)[GP0,f′′(0)]1ϕP0,f(z;0)}2𝑑P0]1/2.\displaystyle\left[\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\int[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\right]^{1/2}.

We have by assumption that

[{[GP^n,f′′(0)]1ϕP^n,f(z;0)[GP0,f′′(0)]1ϕP0,f(z;0)}2𝑑P0]1/2=oP(1).\displaystyle\left[\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\int[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\right]^{1/2}=o_{P}(1).

Additionally, Assumption C2 and Lemma 2.10.4 of (van der Vaart and Wellner, 1996) imply that the class {{[GP,f′′(0)]1ϕP,f(;0)[G0,f′′(0)]1ϕP0,f()}2}\left\{\left\{[G^{\prime\prime}_{P,f}(0)]^{-1}\phi^{\prime}_{P,f}(\cdot;0)-[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(\cdot)\right\}^{2}\right\} is P0P_{0}-Glivenko-Cantelli with probability tending to one. Therefore,

[supf1ni=1n{[GP^n,f′′(0)]1ϕP^n,f(Zi;0)[GP0,f′′(0)]1ϕP0(Zi;0)}2\displaystyle\Bigg{[}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}-
{[GP^n,f′′(0)]1ϕP^n,f(z;0)[GP0,f′′(0)]1ϕP0,f(z;0)}2dP0]1/2=oP(1).\displaystyle\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\Bigg{]}^{1/2}=o_{P}(1).

Thus, we have that {Tnξ}1/2OP(1)\left\{T_{n}^{\xi}\right\}^{1/2}\leq O_{P}(1). This allows us to write

Eξ[n1/2Tnξ]OP(1)Eξ[{nTnξ}1/2].\displaystyle E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]\leq O_{P}(1)E_{\xi}\left[\left\{nT_{n}^{\xi}\right\}^{1/2}\right].

That Eξ[{nTnξ}1/2]=OP(1)E_{\xi}\left[\left\{nT_{n}^{\xi}\right\}^{1/2}\right]=O_{P}(1) follows from the argument presented in Case 1. This completes the proof.

Proof of Lemma 1

Suppose a given distribution PP in \mathcal{M} has density pp with respect to a dominating measure ν\nu, and let η:𝒵\eta:\mathcal{Z}\to\mathbb{R} be a fixed function that has mean zero and finite variance under PP. Let PϵP_{\epsilon} be a one-dimensional parametric sub-model for PP indexed by the parameter ϵ\epsilon, which satisfies the following:

  1. 1.

    The sub-model passes through PP at ϵ=0\epsilon=0 – that is, Pϵ=PP_{\epsilon}=P at ϵ=0\epsilon=0

  2. 2.

    The density of the parametric sub-model is given by pϵp_{\epsilon}, and the score function is given by η\eta at ϵ=0\epsilon=0. That is,

    ddϵlogpϵ(z)|ϵ=0=η(z).\displaystyle\frac{d}{d\epsilon}\log p_{\epsilon}(z)\bigg{|}_{\epsilon=0}=\eta(z).

We refer to ddϵGPϵ,f(β)|ϵ=0\frac{d}{d\epsilon}G_{P_{\epsilon},f}(\beta)|_{\epsilon=0} as the pathwise derivative of GPϵ,f(β)G_{P_{\epsilon},f}(\beta). The nonparametric efficient influence function ϕP,f(;β)\phi_{P,f}(\cdot;\beta) is the unique function that satisfies the following two properties:

  1. 1.

    For every η\eta, ddϵGPϵ(β)=ϕP,f(z;β)η(z)p(z)𝑑ν(z)\frac{d}{d\epsilon}G_{P_{\epsilon}}(\beta)=\int\phi_{P,f}(z;\beta)\eta(z)p(z)d\nu(z).

  2. 2.

    ϕP,f(Z;β)\phi_{P,f}(Z;\beta) has mean zero under PP. That is, ϕP,f(z;β)𝑑P(z)=0\int\phi_{P,f}(z;\beta)dP(z)=0.

We can therefore find the efficient influence function by calculating the pathwise derivative.

Let PϵP_{\epsilon} have density

pϵ=p(1+ϵη).\displaystyle p_{\epsilon}=p(1+\epsilon\eta).

We can approximate any density in a neighborhood of pp with such a sub-model. Any distribution in a small neighborhood of PP can be approximated using a sub-model of this form.

The goodness-of-fit under PϵP_{\epsilon} is given by

GPϵ,f(β)={yμPϵ,Y(w)βf(w,x)}2p(z){1+ϵη(z)}𝑑ν(z).\displaystyle G_{P_{\epsilon},f}(\beta)=\int\left\{y-\mu_{P_{\epsilon},Y}(w)-\beta f(w,x)\right\}^{2}p(z)\{1+\epsilon\eta(z)\}d\nu(z).

Through a simple calculation, it can be shown that

ddϵμPϵ,Y(w)|ϵ=0={y1μY,P(w)}η(w,x1,y1)p(x1,y1|w)ν(dx1,dy1),\displaystyle\frac{d}{d\epsilon}\mu_{P_{\epsilon},Y}(w)\bigg{|}_{\epsilon=0}=\int\{y_{1}-\mu_{Y,P}(w)\}\eta(w,x_{1},y_{1})p(x_{1},y_{1}|w)\nu(dx_{1},dy_{1}),

where p(|w)p(\cdot|w) denotes the conditional density of (X,Y)(X,Y) given that W=wW=w, under PP. We now have

ddϵGPϵ(β)|ϵ=0=\displaystyle\frac{d}{d\epsilon}G_{P_{\epsilon}}(\beta)\bigg{|}_{\epsilon=0}= {yμP,Y(w)βf(w,x)}2p(z)η(z)𝑑ν\displaystyle\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}p(z)\eta(z)d\nu-
2{yμP,Y(w)βf(w,x)}{{y1μP,Y(w)}η(w,x1,y1)p(x1,y1|w)ν(dx1,dy1)}p(z)𝑑ν(z),\displaystyle 2\int\{y-\mu_{P,Y}(w)-\beta f(w,x)\}\left\{\int\{y_{1}-\mu_{P,Y}(w)\}\eta(w,x_{1},y_{1})p(x_{1},y_{1}|w)\nu(dx_{1},dy_{1})\right\}p(z)d\nu(z),
=\displaystyle= {yμP,Y(w)βf(w,x)}2p(z)η(z)𝑑ν+\displaystyle\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}p(z)\eta(z)d\nu+
2[βf(w,x1)p(x1,y1|w)ν(dx1,dy1)]{yμP,Y(w)}η(w,x,y)p(z)𝑑ν\displaystyle 2\int\left[\int\beta f(w,x_{1})p(x_{1},y_{1}|w)\nu(dx_{1},dy_{1})\right]\{y-\mu_{P,Y}(w)\}\eta(w,x,y)p(z)d\nu
=\displaystyle= [{yμP,Y(w)βf(w,x)}2+2βμP,f(w,x){yμP,Y(w)}]η(w,x,y)p(z)𝑑ν,\displaystyle\int\left[\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\mu_{P,f}(w,x)\left\{y-\mu_{P,Y}(w)\right\}\right]\eta(w,x,y)p(z)d\nu,

where the second equality follows from an application of the law of total expectation to the second summand. The “non-mean-centered” efficient influence function is thus given by

z=(w,x,y){yμY,P(w)βf(w,x)}2+2βμf,P(w){yμY,P(w)}.\displaystyle z=(w,x,y)\mapsto\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}^{2}+2\beta\mu_{f,P}(w)\left\{y-\mu_{Y,P}(w)\right\}.

The result is completed by centering the above function about its mean.

Proof of Lemma 2

We write the estimation error for the one-step estimator as

G~n,f(β)G0,f(β)=1ni=1nϕP0,f(Zi;β)+Rn,fi(β)+Rn,fii(β),\displaystyle\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0},f}(Z_{i};\beta)+R^{\mathrm{i}}_{n,f}(\beta)+R^{\mathrm{ii}}_{n,f}(\beta),

where the remainder terms are

Rn,fi(β)=i=1n{ϕn,f(Zi;β)ϕP0,f(Zi;β)}{ϕn,f(Zi;β)ϕP0,f(Zi;β)}𝑑P0(z),\displaystyle R_{n,f}^{\mathrm{i}}(\beta)=\sum_{i=1}^{n}\left\{\phi_{n,f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta)\right\}-\int\left\{\phi_{n,f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta)\right\}dP_{0}(z),
Rn,fii(β)=ϕn,f(z;β)𝑑P0(z)+{Gn,f(β)GP0,f(β)}.\displaystyle R_{n,f}^{\mathrm{ii}}(\beta)=\int\phi_{n,f}(z;\beta)dP_{0}(z)+\left\{G_{n,f}(\beta)-G_{P_{0},f}(\beta)\right\}.

Following from our discussion in Section 4.1, it suffices to argue each of the following:

supf,β|Rn,fi(β)|=oP(n1/2),supf|ddβ{Rn,fi(β)}β=0|=oP(n1/2),supf|d2dβ2{Rn,fi(β)}β=0|=oP(1),\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|R^{\mathrm{i}}_{n,f}(\beta)|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left|\frac{d}{d\beta}\left\{R^{\mathrm{i}}_{n,f}(\beta)\right\}_{\beta=0}\right|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left|\frac{d^{2}}{d\beta^{2}}\left\{R^{\mathrm{i}}_{n,f}(\beta)\right\}_{\beta=0}\right|=o_{P}(1),
supf,β|Rn,fii(β)|=oP(n1/2),supf|ddβ{Rn,fii(β)}β=0|=oP(n1/2),supf|d2dβ2{Rn,fii(β)}β=0|=oP(1).\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|R^{\mathrm{ii}}_{n,f}(\beta)|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left|\frac{d}{d\beta}\left\{R^{\mathrm{ii}}_{n,f}(\beta)\right\}_{\beta=0}\right|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left|\frac{d^{2}}{d\beta^{2}}\left\{R^{\mathrm{ii}}_{n,f}(\beta)\right\}_{\beta=0}\right|=o_{P}(1).

First, we argue that supf,β|Rn,f(β)|=oP(n1/2)\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|R_{n,f}(\beta)|=o_{P}(n^{-1/2}). It is shown in the proof of Lemma 19.26 of van der Vaart that this convergence rate is achieved when

supf,β{ϕn,f(z;β)ϕP0,f(z;β)}2𝑑P0(z)=oP(1),\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\int\left\{\phi_{n,f}(z;\beta)-\phi_{P_{0},f}(z;\beta)\right\}^{2}dP_{0}(z)=o_{P}(1),

and when the class {ϕn,f(;β)ϕP0,f(;β):f,β}\{\phi_{n,f}(\cdot;\beta)-\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\} is contained within a P0P_{0}-Donsker class probability tending to one. That the influence functions are uniformly consistent follows a consequence of the rate conditions on the nuisance parameter estimators, and the Donsker condition holds by assumption. Similarly, that supf|ddβ{Rn,f(β)}β=0|=oP(n1/2)\sup_{f\in\mathcal{F}}|\frac{d}{d\beta}\{R_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2}) and supf|d2dβ2{Rn,f(β)}β=0|=oP(n1/2)\sup_{f\in\mathcal{F}}|\frac{d^{2}}{d\beta^{2}}\{R_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2}) follow from consistency of nuisance estimators and the assumed complexity constraints.

Now, we argue that supf,β|Rn,fii(β)|=oP(n1/2)\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left|R^{\mathrm{ii}}_{n,f}(\beta)\right|=o_{P}(n^{-1/2}). This remainder term has the exact representation

ϕn,f(z;β)𝑑P0(z)+{Gn,f(β)GP0,f(β)}=\displaystyle\int\phi_{n,f}(z;\beta)dP_{0}(z)+\left\{G_{n,f}(\beta)-G_{P_{0},f}(\beta)\right\}=
{yμn,Y(w)βf(w,x)}2+2β{yμn,Y(w)}μn,f(w){yμ0,Y(w)βf(w,x)}2dP0(z)=\displaystyle\int\left\{y-\mu_{n,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)-\left\{y-\mu_{0,Y}(w)-\beta f(w,x)\right\}^{2}dP_{0}(z)=
[{yμn,Y(w)}{yμP0,Y(w)}][{yμn,Y(w)}+{yμP0,Y(w)}]𝑑P0(z)\displaystyle\int\left[\left\{y-\mu_{n,Y}(w)\right\}-\left\{y-\mu_{P_{0},Y}(w)\right\}\right]\left[\left\{y-\mu_{n,Y}(w)\right\}+\left\{y-\mu_{P_{0},Y}(w)\right\}\right]dP_{0}(z)-
2β[{μP0,Y(w)μn,Y(w)}f(w,x){yμn,Y(w)}μn,f(w)]𝑑P0(z)=\displaystyle\int 2\beta\left[\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}f(w,x)-\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)\right]dP_{0}(z)=
{μP0,Y(w)μn,Y(w)}2𝑑P0(z)+2β{μP0,Y(w)μn,Y(w)}{μf,P0(w)μf,P(w)}dP0(z).\displaystyle\int\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}^{2}dP_{0}(z)+2\beta\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}\left\{\mu_{f,P_{0}}(w)-\mu_{f,P}(w)\right\}dP_{0}(z).

It can seen that supf,β|Rn,fii(β)|=oP(n1/2)\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left|R^{\mathrm{ii}}_{n,f}(\beta)\right|=o_{P}(n^{-1/2}) when the rate conditions on the nuisance estimators are met. The derivative of the second remainder term is

ddβRn,fii(β)|β=0=2{μP0,Y(w)μn,Y(w)}{μf,P0(w)μf,P(w)}𝑑P0(z),\displaystyle\frac{d}{d\beta}R^{\mathrm{ii}}_{n,f}(\beta)\bigg{|}_{\beta=0}=\int 2\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}\left\{\mu_{f,P_{0}}(w)-\mu_{f,P}(w)\right\}dP_{0}(z),

and so supf|ddβ{Rn,f(β)ii}β=0|=oP(n1/2)\sup_{f\in\mathcal{F}}|\frac{d}{d\beta}\{R^{\mathrm{ii}}_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2}) under the rate conditions as well. Finally, it is easily seen that supf|d2dβ2{Rn,f(β)ii}β=0|=0\sup_{f\in\mathcal{F}}|\frac{d^{2}}{d\beta^{2}}\{R^{\mathrm{ii}}_{n,f(\beta)}\}_{\beta=0}|=0.