Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space

Aaron Hudson
Fred Hutchinson Cancer Center

Abstract

It is often of interest to assess whether a function-valued statistical parameter, such as a density function or a mean regression function, is equal to any function in a class of candidate null parameters. This can be framed as a statistical inference problem where the target estimand is a scalar measure of dissimilarity between the true function-valued parameter and the closest function among all candidate null values. These estimands are typically defined to be zero when the null holds and positive otherwise. While there is well-established theory and methodology for performing efficient inference when one assumes a parametric model for the function-valued parameter, methods for inference in the nonparametric setting are limited. When the null holds, and the target estimand resides at the boundary of the parameter space, existing nonparametric estimators either achieve a non-standard limiting distribution or a sub-optimal convergence rate, making inference challenging. In this work, we propose a strategy for constructing nonparametric estimators with improved asymptotic performance. Notably, our estimators converge at the parametric rate at the boundary of the parameter space and also achieve a tractable null limiting distribution. As illustrations, we discuss how this framework can be applied to perform inference in nonparametric regression problems, and also to perform nonparametric assessment of stochastic dependence.

1 Introduction

Suppose we are interested in studying a function-valued parameter of an unknown probability distribution, such as a conditional mean function or a density function. For such parameters, one can typically define a goodness-of-fit functional, which measures the closeness of any given candidate function to the true population parameter. The goodness-of-fit achieves its minimum when evaluated at the true population parameter. It is often of scientific interest to compare multiple models for the function-valued parameter. In particular, one may seek to determine whether the minimizer of the goodness-of-fit over a, possibly large, function class is equal to the minimizer over a smaller sub-class. The difference between the minima over reduced and full function classes can serve as a natural measure of dissimilarity for comparing the corresponding minimizers. This dissimilarity measure is non-negative, with values of zero corresponding to no dissimilarity. The main focus of this work is on estimation of such dissimilarity measures and testing the null hypothesis of no dissimilarity, or equality of goodness-of-fit.

As an example, suppose that an investigator would like to determine whether an exposure is conditionally associated with an outcome, given a set of confounding variables. This can be formulated as a statistical inference problem, where the objective is to determine whether the conditional mean of the outcome, given both the exposure and confounders, is equivalent to the conditional mean of the outcome, given only the confounders. One can specify a full model for the conditional mean as a class of functions that depends on both the exposure and confounders, while the reduced model is the subclass of functions that may depend on the confounders but do not depend on the exposure. Several goodness-of-fit measures, such as the expected squares error loss, can be used to assess how close a candidate parameter is to the conditional mean given the exposure and confounders. And so, one can test for conditional independence by assessing whether the best approximation of the conditional mean in the full model class is an improvement over the best approximation in the reduced class, in terms of the goodness-of-fit.

When the function-valued parameter of interest is modeled using a finite-dimensional function class, there are standard procedures available for performing inference. For instance, the classical likelihood ratio test is widely-used to compare classes of regression functions when the conditional distribution of the outcome given the predictor and covariates is assumed to belong to a parametric family of probability distributions (Wilks, 1938). There also exist approaches for efficient inference in settings where the reduced and full function classes are both infinite-dimensional, but the difference between the two classes is finite dimensional. For instance, in regression problems of the form described in the example above, it is common to assume that the conditional mean of the outcome given the exposure and covariates follows a partially linear model. In a partially linear model, the full conditional mean can be expressed as the sum of an unknown function of the confounders, which is only assumed to belong to a large infinite-dimensional function class, plus a linear function of the exposure of interest. One can therefore assess for conditional dependence by determining whether the linear function has zero slope, which is a well-studied inference problem (Chernozhukov et al., 2018; Bhattacharya and Zhao, 1997; Robinson, 1988; Donald and Newey, 1994).

In this work, we focus on the more challenging setting in which the difference between the full and reduced function classes is infinite-dimensional. Recently, several investigators have examined whether modern methods for estimation of smooth functionals of unknown probability distributions in a nonparametric model, such as targeted-minimum loss-based estimation (van der Laan and Rose, 2011, 2018) and one-step estimation (Pfanzagl, 1982), can be applied to attain inference on non-negative dissimilarity measures (Williamson et al., 2021a, b; Hines et al., 2022; Kennedy et al., 2023; Kandasamy et al., 2015). For these estimation strategies to be viable, the target estimand – in this case the non-negative dissimilarity measure – must be a pathwise differentiable functional of the underlying probability distribution with non-zero pathwise derivative. In essence, this means that the target estimand makes smooth but non-negligible changes in response to infinitesimally small perturbations around the unknown probability distribution. While pathwise differentiability of the target can be established in many examples, the pathwise derivative is typically zero when the null hypothesis of no dissimilarity holds. That the derivative is zero can be seen as a consequence of the fact that, under the null, the target estimand achieves its minimum at the true unknown distribution. In this setting, conventional estimation strategies do not achieve parametric-rate convergence or attain tractable limiting distributions, making hypothesis testing challenging.

When the target estimand satisfies additional smoothness assumptions, it can be possible to construct estimators with improved asymptotic behavior by utilizing higher-order pathwise derivatives (Pfanzagl, 1985; Robins et al., 2008; van der Vaart, 2014; Carone et al., 2018). While this approach has been successful in some examples (Luedtke et al., 2019), it is seemingly rare that for a given statistical functional, higher-order pathwise derivatives exist, so this strategy does not appear to be broadly applicable.

In this work, we propose a general method for estimation and inference on non-negative dissimilarity measures. Our proposal builds upon recent developments on the construction of omnibus tests for equality of function-valued parameters to fixed null parameters (Hudson et al., 2021; Westling, 2021). The key idea used is that one can perform inference on a function-valued parameter by estimating a large collection of simpler one-dimensional estimands that act as an effective summary thereof. Here, we show that in many instances, non-negative dissimilarity measures can be represented as the largest value in a collection of simple one-dimensional estimands. In such cases, we can estimate non-negative dissimilarity measures using the maximum of suitably well-behaved estimators for these scalar quantities. Our main results show that when efficient estimators for the simple estimands are used, the resulting estimator for the non-negative dissimilarity measure achieves parametric-rate convergence under the null and also attains a tractable limiting distribution. This makes it possible to construct well-calibrated asymptotic tests of the null. We also show that when the alternative holds, our estimator is asymptotically efficient. To the best of our knowledge, our work is the first to provide a general theoretical basis for recovering parametric rate inference on non-negative dissimilarity measures in a nonparametric model.

The remainder of the paper is organized as follows. In Section 2, we formally introduce the class of non-negative dissimilarity measures of interest, and we describe some motivating examples. In Section 3, we review an existing approach for inference based on plug-in estimation and provide a discussion of some of its limitations. In Section 4, we propose a new estimator for non-negative dissimilarity measures, and we describe its theoretical properties. In Section 5, we present multiplier bootstrap methods for testing the null of no dissimilarity, and for constructing confidence intervals. In Section 6 we discuss implementation and practical concerns. In Section 7, we illustrate how our methodology can be used to perform inference in a nonparametric regression model. We present results from our simulation study in Section 8, and we conclude with a discussion in Section 9.

2 Preliminaries

2.1 Data structure and target estimand

Let $Z_{1},\ldots,Z_{n}$ be i.i.d. random vectors, generated from an unknown probability distribution $P_{0}$ . We make few assumptions about $P_{0}$ and only require that it belongs to a flexible nonparametric model $\mathcal{M}$ , which is essentially unrestricted, aside from mild regularity conditions. For a given probability distribution $P$ in $\mathcal{M}$ , let $\theta_{P}$ be a function-valued summary of interest with domain $\mathcal{O}\subseteq\mathbb{R}^{d}$ for a positive integer $d$ and range $\mathcal{K}\subseteq\mathbb{R}$ . We denote by $\theta_{0}:=\theta_{P_{0}}$ the evaluation of this summary at $P_{0}$ .

Suppose that $\theta_{P}$ is known to belong to a, possibly infinite-dimensional, function class $\Theta$ . For a given distribution $P$ , we define a real-valued functional $G_{P}:\Theta\to\mathbb{R}$ that satisfies

\displaystyle G_{P}(\theta_{P})=\inf_{\theta\in\Theta}G_{P}(\theta).

(1)

The functional $G_{P}$ measures the goodness-of-fit of any function $\theta\in\Theta$ – larger values of $G_{P}(\theta)$ indicate that $\theta$ and $\theta_{P}$ are farther away from one another, in a sense. Throughout this paper, we use the shorthand notation $G_{0}:=G_{P_{0}}$ to denote the value of the goodness-of-fit measure at $P_{0}$ .

Let $\Theta^{*}\subset\Theta$ be a subclass of $\Theta$ , and let $\theta_{P}^{*}$ be a function that satisfies

\displaystyle G_{P}(\theta_{P}^{*})=\inf_{\theta\in\Theta^{*}}G_{P}(\theta).

In essence, $\theta^{*}_{P}$ is the closest function to $\theta_{P}$ among all functions in the subclass $\Theta^{*}$ . We define as our target parameter the difference between the goodness-of-fit of $\theta_{P}$ and $\theta^{*}_{P}$ ,

\displaystyle\Psi_{P}:=G_{P}(\theta^{*}_{P})-G_{P}(\theta_{P}),

(2)

and we again use the shorthand notation $\Psi_{0}:=\Psi_{P_{0}}$ . Throughout this manuscript, we refer to $\Psi_{0}$ as the improvement in fit because it represents the improvement in the goodness-of-fit attained by using the full function class instead of the reduced class.

Because $\Theta^{*}$ is contained within $\Theta$ , it can be seen that $\Psi_{P}$ is a non-negative statistical functional, and $\Psi_{P}$ is only equal to zero when $\theta_{P}$ provides no improvement in fit compared with $\theta^{*}_{P}$ . In many applications, a problem of central importance is to determine whether $\theta^{*}_{P}$ is inferior to $\theta_{P}$ in terms of goodness-of-fit. Letting $\Psi_{0}:=\Psi_{P_{0}}$ , we are interested in performing a test of the null hypothesis

\displaystyle H_{0}:\Psi_{0}=0.

(3)

Additionally, because statistical functionals that have the representation in (2) have scientifically meaningful interpretations in some contexts, estimation of $\Psi_{0}$ and confidence interval construction are also of practical interest. Our paper provides a general framework for estimation, testing, and confidence interval construction for statistical functionals of this form.

2.2 Examples

In what follows, we introduce some working examples. As a first example, we discuss statistical inference in nonparametric regression models, and second, we discuss a nonparametric approach for assessing dependence between a pair of random variables. We then describe a simple way to define a goodness-of-fit measure for any function-valued parameter.

Example 1: Inference in a Nonparametric Regression Model
Let $Z=(W,X,Y)$ , where $Y\in\mathbb{R}$ is a real-valued outcome variable, and $X\in\mathbb{R}^{d_{1}}$ and $W\in\mathbb{R}^{d_{2}}$ are vectors of predictor variables with dimensions $d_{1}$ and $d_{2}$ , respectively. We define $\Theta$ as a (possibly large) class of prediction functions with domain $\mathbb{R}^{d_{1}+d_{2}}$ and range $\mathbb{R}$ . Each function $\theta\in\Theta$ takes as input a realization $(w,x)$ of the predictor vector $(W,X)$ and returns as output a predicted outcome.

We are interested in studying the conditional mean of the outcome given the predictors, defined as $\theta_{P}:(w,x)\mapsto E_{P}[Y|X=x,W=w]$ . It is well-known that the conditional mean can be characterized as the minimizer of the expected squared error loss over $\Theta$ , if $\Theta$ is sufficiently large. That is, defining the goodness-of-fit measure

\displaystyle G_{P}:\theta\mapsto\int\left\{y-\theta(w,x)\right\}^{2}dP(w,x,y),

the conditional mean satisfies $G_{P}(\theta_{P})=\inf_{\theta\in\Theta}G_{P}(\theta)$ .

Consider now the set of candidate prediction functions that do not depend on $X$ , which we write as

\displaystyle\Theta^{*}:=\left\{\theta\in\Theta:\theta(w,x_{1})=\theta(w,x_{2})\text{ for every }x_{1}\neq x_{2}\right\}.

When $\Theta$ is large, any minimizer $\theta^{*}_{P}$ of the expected squared error loss over $\Theta^{*}$ is almost everywhere equal to the conditional mean of $Y$ given $W$ . We are often interested in determining whether $X$ is an important set of predictors in the sense that it does not need to be included in a prediction function in order for optimal squared error loss to be achieved. If $X$ is not important in this sense, the conditional mean of $Y$ given $X$ and $W$ does not depend on $X$ , and the difference in the expected squared error loss $\Psi_{0}=G_{0}(\theta^{*}_{0})-G_{0}(\theta_{0})$ is zero. Otherwise, $\Psi_{0}$ is positive. Thus, assessing variable importance can be framed a statistical inference problem of the type described in Section 2.

Many recent works have studied inference on variable importance estimands of a similar form to that we describe above (see, e.g., Williamson et al., 2021a, b; Verdinelli and Wasserman, 2021; Zhang and Janson, 2020). These works all encounter difficulties with constructing estimators for their original target estimand that achieve parametric rate convergence under the null. To the best of our knowledge, there is currently no solution available to this problem.

Example 2: Nonparametric Assessment of Stochastic Dependence
Let $Z=(X,Y)$ , where $X\in\mathbb{R}$ and $Y\in\mathbb{R}$ are real-valued random variables, and let $\theta_{P}$ denote the log of the joint density of $(X,Y)$ under $P$ with respect to some dominating measure $\nu$ . Our objective here is to determine whether $X$ and $Y$ are dependent. If $X$ and $Y$ are independent, by basic laws of probability, the joint density function can be expressed as the product of the marginal density functions, i.e.,

\displaystyle\exp\theta_{P}(x,y)=\int\exp\theta_{P}(x,y_{1})\nu(dy_{1})\int\exp\theta_{P}(x_{1},y)\nu(dx_{1})

for all $x,y\in\mathbb{R}$ . We can therefore assess dependence between $X$ and $Y$ by defining a goodness-of-fit measure for the joint density function, and determining whether the goodness-of-fit of the true joint density is lower than the goodness-of-fit of the product of the marginal densities.

Let $\Theta$ be a collection of candidate values for the log density function, and assume that $\Theta$ is large enough to contain $\theta_{0}$ , the log density under $P_{0}$ . The density function can be represented as a minimizer of the expected cross-entropy loss. Therefore, defining the goodness-of-fit measure

\displaystyle G_{P}:\theta\mapsto-\int\theta(x,y)dP(x,y),

the joint density satisfies (1).

We now define $\Theta^{*}$ as the class of candidate log density functions for which the joint density can be expressed as the product of two marginal density functions – that is,

\displaystyle\Theta^{*}:=\left\{\theta\in\Theta:\exp\theta(x,y)=\int\exp\theta(x,y_{1})\nu(dy_{1})\int\exp\theta(x_{1},y)\nu(dx_{1})\text{ for all }x,y\in\mathbb{R}\right\}.

Any minimizer $\theta^{*}_{P}$ of $G_{P}$ over $\Theta^{*}$ is almost everywhere equal to the product of the marginal densities of $X$ and $Y$ under $P$ . Therefore, $\Psi_{0}:=G_{0}(\theta^{*}_{0})-G_{0}(\theta_{0})$ is zero if $X$ and $Y$ are independent, and $\Psi_{0}$ is otherwise positive. One can assess dependence between $X$ and $Y$ by performing inference on $\Psi_{0}$ , so similar to the previous example, this problem falls within our framework.

The measure of dependence $\Psi_{0}$ we have defined here is commonly referred to as the mutual information and has been a widely-studied measure of stochastic dependence (see, e.g., Paninski, 2003; Steuer et al., 2002). We are not aware of an existing nonparametric estimator for the mutual information that achieves parametric rate convergence under the null of independence. This appears to be a longstanding open problem.

Example 3: Generic $L_{2}$ Distance

Suppose one is interested in assessing whether a given function-valued parameter $\theta_{P}$ is equal to a fixed and known function $\theta^{*}$ . For a measure $\nu$ on $\mathcal{O}$ , one can define as a goodness-of-fit measure an integrated squared difference between $\theta_{P}$ and $\theta^{*}$ :

\displaystyle G_{P}:\theta\mapsto\int\left\{\theta(o)-\theta_{P}(o)\right\}^{2}d\nu(o).

Because $G_{P}$ is non-negative, and $G_{P}(\theta_{P})=0$ , it is easy to see that $G_{P}$ is minimized by $\theta_{P}$ .

One might wish to perform inference on the quantity

\displaystyle\Psi_{0}=G(\theta^{*})-G_{0}(\theta_{0})=\int\left\{\theta^{*}(o)-\theta_{0}(o)\right\}^{2}d\nu(o).

Clearly, $\Psi_{0}$ is equal to zero only when $\theta_{0}$ is equal to $\theta^{*}$ almost everywhere $\nu$ . Estimands of this form can be of interest when one wishes to construct an omnibus test of the hypothesis that $\theta_{0}=\theta^{*}$ . The framework we develop in this paper can be applied in this setting as well, and so, methodology for inference on the improvement in fit can be seen as generally useful for performing inference on function-valued parameters.

3 Plug-in estimation of the improvement in fit

We now describe an approach for nonparametric inference on $\Psi_{0}$ based on plug-in estimation, and we discuss the shortcomings of this approach. The methodology we describe below and its limitations are discussed extensively by Williamson et al. (2021b) in the context of nonparametric regression, though their theoretical and methodological results are more broadly applicable.

Suppose that for any $\theta\in\Theta$ , $G_{P}(\theta)$ is a pathwise differentiable functional of $P$ , meaning that $G_{P}(\theta)$ changes smoothly with respect to small changes in $P$ (Bickel et al., 1998). When $G_{P}(\theta)$ is pathwise differentiable, it is generally possible to construct an estimator $G_{n}(\theta)$ that is asymptotically linear in the sense that

\displaystyle G_{n}(\theta)-G_{0}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0}}(Z_{i};\theta)+r_{n}(\theta),

(4)

where $\phi_{P_{0}}(Z;\theta)$ has mean zero and finite variance under $P_{0}$ , and $r_{n}(\theta)=o_{P}(n^{-1/2})$ is an asymptotically negligible remainder term. The function $\phi_{P_{0}}(\cdot;\theta)$ determines the first order asymptotic behavior of $G_{n}(\theta)$ and is commonly referred to as the influence function of $G_{n}(\theta)$ . Because $G_{n}(\theta)$ is asymptotically linear, it is $n^{1/2}$ -rate consistent and asymptotically Gaussian by the central limit theorem. Conventional strategies for constructing asymptotically linear estimators include one-step estimation (Pfanzagl, 1982) and targeted minimum loss-based estimation (van der Laan and Rose, 2011, 2018).

Given an asymptotically linear estimator $G_{n}$ for $G_{0}$ , we can obtain estimators $\theta_{n}$ and $\theta_{n}^{*}$ for $\theta_{0}$ and $\theta^{*}_{0}$ by minimizing $G_{n}$ over $\Theta$ and $\Theta^{*}$ , respectively. That is, we take

\displaystyle\theta_{n}:=\underset{\theta\in\Theta}{\text{arg min}}\,G_{n}(\theta),\quad\theta^{*}_{n}:=\underset{\theta\in\Theta^{*}}{\text{arg min}}\,G_{n}(\theta).

We can then obtain the following plug-in estimator $\Psi_{n}$ for $\Psi_{0}$ :

\displaystyle\Psi_{n}:=G_{n}(\theta^{*}_{n})-G_{n}(\theta_{n}).

It can be shown that, under mild regularity conditions, the plug-in estimator is asymptotically linear with influence function $\phi_{P_{0}}(\cdot;\theta_{0})-\phi_{P_{0}}(\cdot;\theta^{*}_{0})$ (Williamson et al., 2021b). That is, the plug-in estimator satisfies

\displaystyle\Psi_{n}-\Psi_{0}=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0}}(Z_{i};\theta^{*}_{0})-\phi_{P_{0}}(Z_{i};\theta_{0})+o_{P}(n^{-1/2}).

(5)

From an initial inspection, it would appear that there is no loss in efficiency resulting from estimating $\theta_{0}$ and $\theta_{0}^{*}$ . That is, if $\theta_{0}$ and $\theta_{0}^{*}$ were known, then the estimator $G_{n}(\theta_{0})-G_{n}(\theta^{*}_{0})$ would have the same asymptotically linear representation as $\Psi_{n}$ .

Under the null, $G_{n}(\theta_{n})$ and $G_{n}(\theta^{*}_{n})$ have the same influence function, and the leading term in (5) vanishes. Therefore, the convergence rate and limiting distribution of the plug-in estimator are determined by the higher-order remainder term. When $\Theta$ is a finite-dimensional model, it is often possible to establish that, under the null, the remainder term is $O_{P}(n^{-1})$ and attains a tractable limiting distribution. Conversely, in the infinite-dimensional setting, the remainder typically converges at a slower-than- $n$ rate, and its asymptotic distribution is difficult to characterize. This makes it challenging to approximate the null sampling distribution of $\tilde{\Psi}_{n}$ and hence challenging to construct a hypothesis test for no improvement in fit. Moreover, confidence intervals based on a normal approximation to the sampling distribution can fail to achieve the nominal coverage rate when $\Psi_{0}=0$ .

In order to develop an estimator for $\Psi_{0}$ that has better asymptotic properties than the plug-in, it is helpful for us to further investigate what is the source of the plug-in estimator’s poor behavior. We can first recognize that, as $\Psi_{0}$ is a measure of an improvement in fit, estimating $\Psi_{0}$ involves performing a search away from $\theta^{*}_{0}$ to identify whether any candidate function in the difference between the full and reduced function classes, $\Theta\setminus\Theta^{*}$ , provides a better fit than $\theta^{*}_{0}$ .

Suppose now that $\Theta\setminus\Theta^{*}$ can be expressed as a collection of, potentially many, one-dimensional sub-models. Let $g$ be a fixed function from $\mathcal{K}\times\mathbb{R}$ to $\mathcal{K}$ that satisfies $g(k;0)=k$ for any $k\in\mathcal{K}$ . For a scalar $\beta$ and a fixed function $f:\mathcal{O}\to\mathbb{R}$ , we define $\theta^{*}_{P,f}$ as the one-dimensional sub-model

\displaystyle\theta_{P,f}^{*}(\cdot;\beta):o\mapsto g(\theta^{*}_{P}(o),\beta f(o)).

(6)

We have constructed our sub-model $\theta^{*}_{P,f}$ so that it passes through the null best fit $\theta^{*}_{P}$ at $\beta=0$ , i.e.,

\displaystyle\theta^{*}_{P,f}(\cdot;0)=\theta^{*}_{P}(\cdot).

(7)

We can therefore interpret $f$ as the path along which $\theta^{*}_{P,f}$ approaches $\theta^{*}_{P}$ as $\beta$ tends to zero. We assume that there exists a function class $\mathcal{F}$ and a symmetric interval $\mathcal{B}$ such that

\displaystyle\Theta\setminus\Theta^{*}=\left\{\theta=\theta^{*}_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\right\}.

We will see that using this representation for our model facilitates making comparisons between any function in $\Theta\setminus\Theta^{*}$ and the best null fit.

We now define $G_{P,f}:\beta\to\mathbb{R}$ as the goodness-of-fit of $\theta^{*}_{P,f}(\cdot;\beta)$ , i.e.,

\displaystyle G_{P,f}(\beta):=G_{P}(\theta^{*}_{P,f}(\cdot;\beta)),

(8)

and similarly as above, we use the shorthand notation $G_{0,f}:=G_{P_{0},f}$ . We assume that $G_{P,f}$ is a smooth convex function, and we denote the first and second derivatives of $G_{P,f}$ in $\beta$ by

\displaystyle G^{\prime}_{P,f}(\beta):=\frac{d}{d\beta}G_{P,f}(\beta),\quad G^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{P,f}(\beta).

(9)

We define $\beta_{P,f}$ as the minimizer of the goodness-of-fit measure along the parametric sub-model over the interval $\mathcal{B}$ :

\displaystyle\beta_{P,f}:=\underset{\beta\in\mathcal{B}}{\text{arg min}}G_{P,f}(\beta).

Due to the convexity of $G_{P,f}$ , for large enough $\mathcal{B}$ , $\beta_{P,f}$ is the unique solver of $G^{\prime}_{P,f}(\beta_{P,f})=0$ . Under this regime, $\theta_{P}$ satisfies $G_{P}(\theta_{P})=\inf_{f\in\mathcal{F}}G_{P,f}(\beta_{P,f})$ , and we can write $\Psi_{P}$ as

\displaystyle\Psi_{P}=\sup_{f\in\mathcal{F}}G_{P,f}(0)-G_{P,f}(\beta_{P,f}).

We can see that, in view of condition (7), $\Psi_{P}=0$ only when $\sup_{f\in\mathcal{F}}|\beta_{P,f}|=0$ .

Let $G_{n}$ and $\theta^{*}_{n}$ be the estimators for $G_{0}$ and $\theta^{*}_{0}$ described in earlier in this section, and let $\theta^{*}_{n,f}(\cdot;\beta):o\mapsto g(\theta_{n}^{*}(o),\beta f(o))$ be the plug-in estimator for the sub-model. We define the plug-in estimator for $G_{0,f}$ as

\displaystyle G_{n,f}(\beta):=G_{n}(\theta_{n,f}^{*}(\cdot;\beta)),

and we write its first and second derivatives as

\displaystyle G^{\prime}_{n,f}(\beta):=\frac{d}{d\beta}G_{n,f}(\beta),\quad G^{\prime\prime}_{n,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{n,f}(\beta).

We define the plug-in estimator $\beta_{n,f}$ for $\beta_{0,f}$ as the minimizer of $G_{n,f}$ over $\mathcal{B}$ . For large $\mathcal{B}$ , $\beta_{n,f}$ satisfies $G^{\prime}_{n,f}(\beta_{n,f})=0$ for all $f$ in $\mathcal{F}$ , and the plug-in estimator $\theta_{n}$ for $\theta_{0}$ satisfies

\displaystyle G_{n}(\theta_{n})=\inf_{f\in\mathcal{F}}G_{n,f}(\beta_{n,f}).

The plug-in estimator for $\Psi_{0}$ can therefore be expressed as

\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}G_{n,f}(0)-G_{n,f}(\beta_{n,f}).

Using this representation for the plug-in estimator $\Psi_{n}$ makes it easier for us to carefully study its asymptotic behavior in the setting where $\Psi_{0}=0$ . By performing a second order Taylor expansion for $G_{n,f}$ around $\beta_{n,f}$ for every $f\in\mathcal{F}$ , we can write the plug-in as

\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}G^{\prime}_{n,f}(\beta_{n,f})(0-\beta_{n,f})+\frac{1}{2}G^{\prime\prime}_{n,f}(\beta_{n,f})\beta_{n,f}^{2}+r_{n},

where $r_{n}$ is a higher order remainder term that should approach zero at a faster rate than the leading terms. Because $G^{\prime}_{n,f}(\beta_{n,f})=0$ for all $f$ , the first term in this expansion vanishes, leaving us with

\displaystyle\Psi_{n}=\sup_{f\in\mathcal{F}}\frac{1}{2}G^{\prime\prime}_{n,f}(\beta_{n,f})\beta_{n,f}^{2}+r_{n}.

If $G^{\prime\prime}_{n,f}(\beta_{n,f})$ is consistent for $G^{\prime\prime}_{0,f}(\beta_{0,f})$ uniformly in $\mathcal{F}$ , then by Slutsky’s theorem, one can replace the random quantities $G^{\prime\prime}_{n,f}(\beta_{n,f})$ with the fixed values $G^{\prime\prime}_{0,f}(\beta_{0,f})$ in the above display. It would appear then that, under the null, the limiting distribution of $\Psi_{n}$ is determined by the behavior of the stochastic process $\{\beta_{n,f}:f\in\mathcal{F}\}$ .

If it were possible to characterize the joint limiting distribution of $\left\{\beta_{n,f}:f\in\mathcal{F}\right\}$ under the null where $\beta_{0,f}=0$ for all $f\in\mathcal{F}$ , the limiting distribution of $\Psi_{n}$ could be characterized using a straightforward application of the continuous mapping theorem. Typically, $\beta_{0,f}$ is a pathwise differentiable parameter for each $f\in\mathcal{F}$ , making it possible to construct estimators thereof that converge at a $n^{1/2}$ -rate and achieve an Gaussian limiting distribution. Ideally, one would be able to establish that the standardized process $\left\{n^{1/2}\left[\beta_{n,f}-\beta_{0,f}\right]:f\in\mathcal{F}\right\}$ converges weakly to a Gaussian process as long as the collection of paths $\mathcal{F}$ is not overly complex. However, in many settings, this property is not satisfied by the plug-in estimator. One can view the plug-in estimator $\beta_{n,f}$ for $\beta_{0,f}$ as a functional of the estimator $G_{n,f}$ for $G_{0,f}$ . As stated above, estimating $G_{0,f}$ requires us to estimate the nuisance parameter $\theta^{*}_{0}$ . In settings where $\Theta^{*}$ is a large nonparametric function class, our estimator $\theta^{*}_{n}$ for $\theta^{*}_{0}$ will necessarily converge slower than the parametric rate of $n^{1/2}$ and may retain non-negligible asymptotic bias. Consequently, $\theta^{*}_{n}$ generates bias for $G_{n,f}$ , which leads to $\beta_{n,f}$ retaining non-negligible bias as well. Indeed, $\beta_{n,f}$ will typically converge slower than the parametric rate of $n^{1/2}$ , causing $\Psi_{n}$ to converge at a sub-optimal rate and achieve a non-standard limiting distribution.

To summarize, estimating $\Psi_{0}$ requires one to perform a search away from $\theta^{*}_{P}$ in order to attempt to identify a candidate function in $\Theta\setminus\Theta^{*}$ that provides an improvement in the goodness-of-fit. In the regime we describe above, performing this search is equivalent to finding the best fit along each parametric sub-model that passes through the null, and subsequently taking the best fit among all of the sub-models. From the above argument, we can see that the plug-in estimator has poor asymptotic properties because the plug-in estimator for the best fit along the parametric sub-models can be sub-optimal when $\Theta^{*}$ is large. Thus, the key to obtaining an estimator with improved asymptotic properties is to efficiently estimate the best fit along each of the sub-models that comprise $\Theta\setminus\Theta^{*}$ .

4 Bias-corrected estimation of the improvement in fit

From the discussion in Section 3, it would seem that if one had an efficient estimator for the goodness-of-fit along each parametric sub-model, and hence an efficient estimator for $\beta_{0,f}$ , one could obtain an estimator for the improvement in fit $\Psi_{0}$ that has better asymptotic properties than the plug-in. In what follows, we describe a general strategy for constructing an estimator that has a tractable limiting distribution when $\Psi_{0}$ is at the boundary of the parameter space. We show that our newly-proposed estimator enjoys the same $n$ -rate convergence that is typically attained in parametric models.

4.1 Uniform inference along the parametric sub-models

Our proposal requires us to construct an estimator for $\{G_{0,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ that enables us to perform inference uniformly along the collection of parametric sub-models. In this sub-section, we first outline a set of sufficient conditions under which an estimator has asymptotic properties that facilitate uniform inference. We then describe a strategy for constructing an estimator that satisfies these conditions.

We begin by providing assumptions upon which our first main theoretical result relies. We consider two types of assumptions. The first type (A) is a set of determinsitic conditions on the goodness-of-fit functional and the underlying probabilty distribution, whereas the second set of assumptions (B) is stochastic in nature and describes conditions that our estimator $\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ must satisfy.

Assumption A1: For any $f\in\mathcal{F}$ and any $\beta\in\mathcal{B}$ , $G_{P,f}(\beta)$ is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function is given by $\phi_{P,f}(\cdot;\beta):\mathcal{Z}\to\mathbb{R}$ .

Assumption A2: $G_{P,f}$ and $\phi_{P,f}$ are twice differentiable in $\beta$ for each $f$ in $\mathcal{F}$ , and the derivatives are given by

	$\displaystyle G^{\prime}_{P,f}(\beta):=\frac{d}{d\beta}G_{P,f}(\beta),\quad G^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}G_{P,f}(\beta),$
	$\displaystyle\phi^{\prime}_{P,f}(\cdot;\beta):=\frac{d}{d\beta}\phi_{P,f}(\cdot;\beta),\quad\phi^{\prime\prime}_{P,f}(\beta):=\frac{d^{2}}{d\beta^{2}}\phi_{P,f}(\cdot;\beta).$

Assumption A3: There exist positive constants $C_{1},C_{2}>0$ such that, for any $\{\beta_{f}:\in\mathcal{F}\}$ , $\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{f})-G_{0,f}(\beta_{0,f})<C_{1}$ implies that

\displaystyle\sup_{f\in\mathcal{F}}(\beta_{0,f}-\beta_{f})^{2}\leq C_{2}\left\{\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{f})-G_{0,f}(\beta_{0,f})\right\}.

Assumption A4: For each $f\in\mathcal{F}$ , $G^{\prime}_{0,f}(\beta_{0,f})=0$ . Additionally, $G^{\prime\prime}_{0,f}$ is bounded above zero in a neighborhood of $\beta_{0,f}$ , uniformly in $\mathcal{F}$ . That is, $\inf_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(\beta_{f})$ is positive whenever $\sup_{f\in\mathcal{F}}|\beta_{f}-\beta_{0,f}|$ is small.
Assumption A5: Both the function classes $\left\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\right\}$ and $\left\{\phi^{\prime}_{P_{0},f}(\cdot;\beta_{0,f}):f\in\mathcal{F}\right\}$ are $P_{0}$ -Donsker.

Assumption A1 requires that the goodness-of-fit is pathwise differentiable, which as noted in Section 3, enables us to construct $n^{1/2}$ -consistent estimators. When $G_{P,f}$ is pathwise differentiable estimand, its efficient influence function is guaranteed to exist, and knowledge of the efficient influence function is often needed for constructing efficient estimators and studying their asymptotic properties in nonparametric models. We note that because we assume $G_{P,f_{1}}(0)=G_{P,f_{2}}(0)$ for any $f_{1},f_{2}$ (recall we assume (7) holds), the efficient influence functions $\phi_{P,f_{1}}(\cdot;0)$ and $\phi_{P,f_{2}}(\cdot;0)$ are also equal. Assumptions A2 and A3 state that the goodness-of-fit must be smooth and convex along each of the parametric sub-models. Assumption A4 requires that $\mathcal{B}$ is large enough to contain the global optimizer of $G_{0,f}$ over $\mathbb{R}$ , and the goodness-of-fit satisfies some additional smoothness constraints in a neighborhood of the optimizer. Assumption A5 states that, while $\mathcal{F}$ may be specified as a large nonparametric function class, it must satisfy some mild complexity constraints.

Assumption B1: For any $f_{1},f_{2}\in\mathcal{F}$ with $f_{1}\neq f_{2}$ we have that $\tilde{G}_{n,f_{1}}(0)=\tilde{G}_{n,f_{2}}(0)$ .
Assumption B2: $\tilde{G}_{n,f}$ is an asymptotically linear estimator for $G_{0,f}$ in the sense that the remainder $\tilde{r}_{n,f}(\beta):=\left\{\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)\right\}-\frac{1}{n}\sum_{i=1}^{n}\phi_{0,f}(Z_{i};\beta)$ satisfies

$\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}_{n,f}(\beta)|=o_{P}(n^{-1/2}).$
Assumption B3: The derivative of $\tilde{G}_{n,f}$ exists and is given by $\tilde{G}^{\prime}_{n,f}(\beta)=\frac{d}{d\beta}\tilde{G}_{n,f}(\beta)$ . Moreover, letting $\tilde{r}^{\prime}_{n,f}(\beta)=\frac{d}{d\beta}r_{n,f}(\beta)$ , we have

$\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}^{\prime}_{n,f}(\beta)|=o_{P}(n^{-1/2}).$
Assumption B4: The second derivative of $\tilde{G}_{n,f}$ exists and is given by $\tilde{G}^{\prime\prime}_{n,f}(\beta)=\frac{d^{2}}{d\beta^{2}}\tilde{G}_{n,f}(\beta)$ . Moreover, letting $\tilde{r}^{\prime\prime}_{n,f}(\beta)=\frac{d^{2}}{d\beta^{2}}r_{n,f}(\beta)$ , we have

$\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|\tilde{r}^{\prime\prime}_{n,f}(\beta)|=o_{P}(1).$

Assumption B1 states that the estimator for the goodness-of-fit along any parametric sub-model takes the same value at $\beta=0$ . In view of condition (7), all sub-models intersect and attain the same value for the goodness-of-fit at $\beta=0$ , so it is natural to assume that our estimator also has this property. Assumptions B2 places a requirement that $\tilde{G}_{n,f}(\beta)$ is an asymptotically linear estimator for $G_{0,f}(\beta)$ , where the asymptotic linearity holds uniformly over $\mathcal{F}\times\mathcal{B}$ . Assumption B3 states that $\tilde{G}_{n,f}(\beta)$ is differentiable, and the derivative $\tilde{G}_{n,f}^{\prime}(\beta)$ is an asymptotically linear estimator for $G^{\prime}_{0,f}(\beta)$ , uniformly over $\mathcal{F}\times\mathcal{B}$ . Finally, Assumption B4 requires that the second derivative of $\tilde{G}_{n,f}(\beta)$ exists and converges in probability to the second derivative of $G^{\prime\prime}_{0,f}(\beta)$ , uniformly over $\mathcal{F}\times\mathcal{B}$ .

For a given estimator $\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ of $\{G_{0,f}:f\in\mathcal{F},\beta\in\mathcal{B}\}$ , let $\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\}$ satisfy

\displaystyle\sup_{f\in\mathcal{F}}|\tilde{G}_{n,f}^{\prime}(\tilde{\beta}_{n,f})|=o_{P}(n^{-1/2}).

The following theorem states that, under mild regularity conditions, $\tilde{\beta}_{n,f}$ is an asymptotically linear estimator for $\beta_{0,f}$ , and moreover the collection $\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\}$ , when appropriately standardized, achieves a Gaussian limiting distribution.

Theorem 1.

Let $\ell^{\infty}(\mathcal{F})$ denote the space of bounded functionals on $\mathcal{F}$ , and let $\mathbb{H}_{0}$ be a tight mean zero Gaussian process with covariance

\displaystyle\Sigma_{0}:(f_{1},f_{2})\mapsto E_{0}[\phi^{\prime}_{0,f_{1}}(Z;\beta_{0,f_{1}})\phi^{\prime}_{0,f_{2}}(Z;\beta_{0,f_{2}})].

If Assumptions A1-A4 hold, and if $\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ satisfies Assumptions B1-B4, then $\tilde{\beta}_{n,f}$ is asymptotically linear with influence function

\displaystyle z\mapsto-\left\{G^{\prime\prime}_{0,f}(\beta_{0,f})\right\}^{-1}\phi^{\prime}_{0,f}(z;\beta_{0,f}).

Moreover, If A5 also holds, then the process $\left\{n^{1/2}[\tilde{\beta}_{n,f}-\beta_{0,f}]:f\in\mathcal{F}\right\}$ , converges weakly to $\left\{\left[G^{\prime\prime}_{0,f}(\beta_{0,f})\right]^{-1}\mathbb{H}_{0}(f):f\in\mathcal{F}\right\}$ as an element of $\ell^{\infty}(\mathcal{F})$ , with respect to the supremum norm.

Theorem 1 can be viewed as a generalization of well-known results that show M-estimators are asymptotically linear in finite-dimensional models (see, e.g., Theorem 5.23 of van der Vaart, 2000). Our result on uniform asymptotic linearity in infinite-dimensional models can be proven using a fairly standard argument.

In what follows, we suggest some approaches for constructing an estimator that satisfies Assumptions B1-B4. We describe at a high-level what types of conditions are needed for a given estimation strategy to be valid, though the specific requirements depend on the target estimand $G_{0,f}$ and vary from problem to problem. Later on in Section 7, we demonstrate how to construct an estimator and that satisfies Assumptions B1-B4 in an example.

Suppose that one has available an estimator $\hat{P}_{n}$ for the underlying probability distribution $P_{0}$ . Typically estimation of the entire probability distribution $P_{0}$ it not necessary, and one will only need to estimate nuisance components upon which $G_{0,f}(\beta)$ and $\phi_{0,f}(\cdot;\beta)$ depend. We assume that $\{G_{P,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ and $\{\phi_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ depend on $P$ only through a nuisance $Q_{P}$ , which resides in a space $\mathcal{Q}$ endowed with norm $\|\cdot\|_{\mathcal{Q}}$ . The true value of the nuisance component is given by $Q_{P_{0}}$ , and the plug-in estimator for the nuisance is $Q_{\hat{P}_{n}}$ .

As a starting point, one might consider using $G_{\hat{P}_{n},f}(\beta)$ as an estimator for $G_{0,f}(\beta)$ . If $\hat{P}_{n}$ belongs to the model $\mathcal{M}$ , the plug-in estimator satisfies Assumption B1. This leaves Assumptions B2 through B4 to be verified. Suppose now that $Q_{P}$ is itself pathwise differentiable and can therefore be estimated at an $n^{1/2}$ -rate. Then if $Q_{\hat{P}_{n}}$ is an asymptotically linear estimator for $Q_{P_{0}}$ , one can argue that $G_{\hat{P}_{n},f}(\beta)$ is also be asymptotically linear by applying the delta method. Assumption B2 then holds, as long as the asymptotic linearity is preserved uniformly over $\mathcal{F}\times\mathcal{B}$ . Asymptotic linearity of $G^{\prime}_{\hat{P}_{n},f}$ (Assumption B3) and consistency of $G^{\prime\prime}_{\hat{P}_{n},f}$ (Assumption B4) can be established using a similar argument.

In many instances, the nuisance $Q_{P}$ can include quantities such as density functions or conditional mean functions which are non-pathwise differentiable in a nonparametric model. In this case, it is not possible to construct an $n^{1/2}$ -rate consistent estimator for the nuisance. Obtaining an estimator for the nuisance usually involves making a bias variance trade-off that may be sub-optimal for the objective of estimating the goodness-of-fit. When the nuisance estimator retains non-negligible bias, it is possible that the bias propagates, leading to $G_{\hat{P}_{n},f}(\beta)$ being biased as well. As a consequence, $G_{\hat{P}_{n},f}(\beta)$ may not be asymptotically linear, and we may require more sophisticated methods to construct an $n^{1/2}$ -consistent estimator.

One widely-used method for obtaining an asymptotically linear estimator when the initial estimator $G_{\hat{P}_{n},f}(\beta)$ is biased is to perform a one-step bias correction (Pfanzagl, 1982). Consider the plug-in estimator for the efficient influence function $\phi_{\hat{P},f}(\cdot;\beta)$ . The empirical average of the estimator for the efficient influence function serves as a first-order correction for the bias of the initial estimator. By adding this empirical average to the initial estimator, one can obtain the so-called one-step estimator:

\displaystyle\tilde{G}_{n,f}(\beta)=G_{\hat{P}_{n},f}(\beta)+\frac{1}{n}\sum_{i=1}^{n}\phi_{\hat{P}_{n},f}(Z_{i};\beta).

It can be easily seen that the one-step estimator satisfies Assumption B1. In what follows, we briefly discuss what arguments one will typically use to verify Assumptions B2-B4. While we do not provide a detailed discussion here, we refer readers to a recent review by Hines et al. (2022), which provides a more in-depth explanation.

The estimation error of the one-step estimator has the exact representation,

\displaystyle\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{0,f}(Z_{i};\beta)+R^{\mathrm{i}}_{n,f}(\beta)+R^{\mathrm{ii}}_{n,f}(\beta),

where we define $R^{\mathrm{i}}_{n,f}(\beta)$ and $R^{\mathrm{ii}}_{n,f}(\beta)$ as

	$\displaystyle R^{\mathrm{i}}_{n,f}(\beta):=\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{\hat{P}_{n},f}(Z_{i};\beta)-\phi_{0,f}(Z_{i};\beta)\right\}-\int\left\{\phi_{\hat{P}_{n},f}(z;\beta)-\phi_{0,f}(z;\beta)\right\}dP_{0}(z),$
	$\displaystyle R^{\mathrm{ii}}_{n,f}(\beta):=\left\{G_{\hat{P}_{n},f}(\beta)-G_{0,f}(\beta)\right\}+\int\phi_{\hat{P}_{n},f}(z;\beta)dP_{0}(z).$

Asymptotic linearity of the one-step estimator follows if it can be established that $R^{\mathrm{i}}_{n,f}(\beta)$ and $R^{\mathrm{ii}}_{n,f}(\beta)$ converge to zero in probability at an $n^{1/2}$ -rate. The first term $R^{\mathrm{i}}_{n,f}(\beta)$ is a difference-in-differences remainder that is asymptotically negligible when $\phi_{\hat{P}_{n},f}(\cdot;\beta)$ is consistent for $\phi_{P_{0},f}(\cdot;\beta)$ and $\phi_{\hat{P}_{n},f}(\cdot;\beta)$ is contained within a $P_{0}$ -Donsker class (see Lemmas 19.24 and 19.26 of van der Vaart, 2000). The second term $R^{\mathrm{ii}}_{n,f}(\beta)$ is a second-order remainder term, which can usually be bounded above by the squared norm of the difference between the nuisance estimator and its true value, $\|Q_{\hat{P}_{n}}-Q_{P_{0}}\|_{\mathcal{Q}}^{2}$ . One can argue that if the nuisance estimator is $n^{1/4}$ -consistent with respect to $\|\cdot\|_{\mathcal{Q}}$ , then $R^{\mathrm{ii}}_{n,f}(\beta)=o_{P}(n^{-1/2})$ . Even in a nonparametric model, there exist several approaches for constructing $n^{1/4}$ -rate consistent nuisance estimators when one makes only mild structural assumptions on $Q_{P_{0}}$ , such as smoothness or monotonicity (see, e.g., van de Geer, 2000; Tsybakov, 2009). To verify that Assumptions B3 and B4 hold, one can perform a similar analysis to show that the first and second derivatives of the remainder terms $R^{\mathrm{i}}_{n,f}(\beta)$ and $R^{\mathrm{ii}}_{n,f}(\beta)$ , with respect to $\beta$ , tend to zero at the requisite rate.

While we focused on one-step estimation above because we find its simplicity appealing, other strategies for constructing of bias-corrected estimators, such as targeted minimum-loss based estimation could alternatively be used. These strategies are usually also viable under a similar set of regularity conditions.

4.2 Asymptotic properties of proposed estimator

We are at this point prepared to describe the bias-corrected estimator for $\Psi_{0}$ and its asymptotic properties. As stated in Section 3, we estimate $\Psi_{0}$ as

\displaystyle\tilde{\Psi}_{n}=\sup_{f\in\mathcal{F}}\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f}),

(10)

where $\{\tilde{G}_{n,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ is an estimator satisfying the conditions outlined in Section 4.1.

In this section we establish weak convergence of $\tilde{\Psi}_{n}$ . We show that $\tilde{\Psi}_{n}$ attains a tractable limiting distribution under mild regularity conditions, but the limiting distribution and convergence rate depend on the true value of $\Psi_{0}$ . We study two cases. First, we consider the setting in which $\Psi_{0}=0$ , and the null hypothesis of no improvement in fit (3) holds. Second, we study the case in which $\Psi_{0}$ is a positive constant.

Case 1: The improvement in fit is zero ( $\Psi_{0}=0$ )

Suppose that the null of no improvement in fit holds. Recall from Section 2 that when $\Psi_{0}=0$ , $\sup_{f\in\mathcal{F}}|\beta_{0,f}|=0$ . Also, as discussed in Section 3, by performing a Taylor expansion for $\tilde{G}_{n,f}$ around $\tilde{\beta}_{n,f}$ , we have

\displaystyle\tilde{\Psi}_{n}=\frac{1}{2}\sup_{f\in\mathcal{F}}G^{\prime\prime}_{n,f}(\check{\beta}_{n,f})\tilde{\beta}^{2}_{n,f},

(11)

for some $\check{\beta}_{n,f}$ satisfying $|\check{\beta}_{n,f}-\beta_{0,f}|\leq|\tilde{\beta}_{n,f}-\beta_{0,f}|$ . Under Assumption B4, we are able, in (11), to replace $G^{\prime\prime}_{n,f}(\check{\beta}_{n,f})$ with $G^{\prime\prime}_{0,f}(\beta_{0,f})$ . This and the fact that $\sup_{f\in\mathcal{F}}\beta_{n,f}^{2}=O_{P}(n^{-1})$ allow us to write

\displaystyle\tilde{\Psi}_{n}

\displaystyle=\frac{1}{2}\sup_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(0)\tilde{\beta}^{2}_{n,f}+o_{P}(n^{-1})=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\left[\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right]^{2}+o_{P}(n^{-1}).

(12)

Thus, under the null, $\tilde{\Psi}_{n}$ can be represented as the squared supremum of an empirical process, plus an asymptotically negligible remainder. By applying Theorem 1 in conjunction with Slutsky’s theorem and the continuous mapping theorem, we have that $n\tilde{\Psi}_{n}$ converges weakly to $\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}$ , where $\mathbb{H}$ is the Gaussian process described in Theorem 1. The following Theorem states this result formally.

Theorem 2.

Suppose that the null hypothesis of no improvement in fit (3) holds, and the Assumptions of Theorem 1 are all satisfied. Then $\tilde{\Psi}_{n}$ converges weakly to $\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}$ .

We can apply Theorem 2 to obtain an approximation for the sampling distribution of $\tilde{\Psi}_{n}$ under the null of zero improvement in fit, making it easy for us to perform a hypothesis test. In particular, Theorem 2 implies that a test which rejects the null when $n\tilde{\Psi}_{n}$ is larger than the $(1-\alpha)$ quantile of the distribution of $\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}$ will achieve type-1 error control at the $\alpha$ -level in the limit of large $n$ .

We used two key ingredients to construct an improvement in fit estimator with parametric rate convergence under the null. First, we found it useful to the represent the difference between the full and reduced models for the function-valued parameter of interest as the union of many one-dimensional parametric sub-models. We have deduced that, under the null, the asymptotic behavior of an improvement in fit estimator is determined in large part by the complexity of the collection of paths along which the estimated goodness-of-fit minimizer over the full model can possibly approach the minimizer over the reduced model. We found it necessary to constrain the complexity to ensure that the improvement in fit estimator converges sufficiently quickly. In our regime, this can be easily achieved by restricting the size of $\mathcal{F}$ . We expect that if one were to assume a different form for $\Theta\setminus\Theta^{*}$ , one would still need to impose a constraint that plays a similar role in order to obtain an $n$ -rate consistent estimator. Second, efficient estimation of the improvement in fit along any sub-model is needed. In settings where the reduced model is infinite-dimensional, estimation of the goodness-of-fit minimizer over the reduced model can generate bias for the improvement in fit estimator and reduce its convergence rate. Fortunately, an efficient estimator can be obtained using standard techniques for bias correction.

Case 2: The improvement in fit is bounded away from zero $(\Psi_{0}>0)$

Now, consider the setting where $\Psi_{0}$ is a positive constant. Let $f_{0}$ and $f_{n}$ be functions that satisfy

\displaystyle G_{0,f_{0}}(\beta_{0,f_{0}})=\sup_{f\in\mathcal{F}}G_{0,f}(\beta_{0,f}),\quad G_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})=\sup_{f\in\mathcal{F}}G_{0,f}(\tilde{\beta}_{n,f}).

(13)

We can express the estimation error of $\tilde{\Psi}_{n}$ as

	$\displaystyle\tilde{\Psi}_{n}-\Psi_{0}$	$\displaystyle=\left\{\sup_{f_{1}\in\mathcal{F}}G_{n,f_{1}}(0)-G_{n,f_{1}}(\beta_{n,f_{1}})\right\}-\left\{\sup_{f_{2}\in\mathcal{F}}G_{0,f_{2}}(0)-G_{0,f_{2}}(\beta_{0,f_{2}})\right\}$
		$\displaystyle=\left\{\tilde{G}_{n,f_{n}}(0)-\tilde{G}_{n,f_{n}}(\beta_{n,f_{n}})\right\}-\left\{G_{0,f_{0}}(0)-G_{0,f_{0}}(\beta_{0,f_{0}})\right\}.$

One might expect that $f_{n}$ should approach $f_{0}$ as $n$ grows, so $G_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})$ should behave similarly to $G_{n,f_{0}}(\beta_{0,f_{0}})$ . In fact, if one could establish that

\displaystyle\left|\left\{\tilde{G}_{n,f_{0}}(0)-\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})\right\}-\left\{\tilde{G}_{n,f_{n}}(0)-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})\right\}\right|=o_{P}(n^{-1/2}),

(14)

then they would be able to conclude that $\tilde{\Psi}_{n}$ is asymptotically linear with influence function $z\mapsto\{\phi_{P_{0},f_{0}}(z;0)-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}})\}$ under Assumption B2.

The remainder term in (14) is $o_{P}(n^{-1/2})$ under mild assumptions. Because $\tilde{G}_{n,f_{0}}(0)-\tilde{G}_{n,f_{n}}(0)$ is zero under Assumption B1, it only needs to be shown that $\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})$ is asymptotically negligible. Because the goodness-of-fit estimator is asymptotically linear, $\tilde{G}_{n,f_{0}}(\beta_{0,f_{0}})-\tilde{G}_{n,f_{n}}(\tilde{\beta}_{n,f_{n}})$ is approximately equal to $G_{0,f_{0}}(\beta_{0,f_{0}})-G_{0,f_{n}}(\beta_{0,f_{n}})$ , which is commonly referred to as the excess risk in the literature on M-estimation (van de Geer, 2000). Thus, in essence, one can verify (14) by showing that the excess risk converges to zero in probability at an $n^{1/2}$ -rate. This can be done using standard arguments from the M-estimation literature. The following result provides explicit conditions under which $\tilde{\Psi}_{n}$ is an asymptotically linear estimator for $\Psi_{0}$ .

Theorem 3.

Suppose that the improvement in fit is positive, i.e., $\Psi_{0}>0$ . Suppose further that Assumptions A1, A5, B1, and B2 hold, and there exists a sequence $d_{n}=o(n^{1/2-\delta})$ for some $\delta>0$ such that

\displaystyle\sup_{\{(f,\beta):G_{0,f}(\beta)-G_{0,f_{0}}(\beta_{0,f_{0}})\leq d_{n}\}}\left[\int\left\{\phi_{P_{0},f}(z;\beta)-\phi_{P_{0},f_{0}}(z;\beta_{0,f})\right\}^{2}dP_{0}(z)\right]^{1/2}=o(1).

(15)

Then $\tilde{\Psi}_{n}$ is an asymptotically linear estimator for $\Psi_{0}$ with influence function

\displaystyle z\mapsto\phi_{P_{0},f_{0}}(z;0)-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}}).

An important consequence of Theorem 3 is that $\tilde{\Psi}_{n}$ is asymptotically efficient in a nonparametric model, and hence performs as well as the plug-in estimator described in Section 3 when $\Psi_{0}>0$ . The assumption in (15) is a type of smoothness condition that is assumed commonly in the literature on estimation in high-dimensional and nonparametric models (see, e.g., van de Geer, 2008; Negahban et al., 2012; Bibaut and van der Laan, 2019). The condition ensures that $\phi_{P_{0},f}(z;\beta)$ and $\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}})$ are close in $L_{2}(P_{0})$ distance when $G_{0,f}(\beta)-G_{0,f}(\beta_{0,f})$ is small.

Some conditions that are needed by Theorem 2 are not needed by Theorem 3. Notably, it is not necessary for $\sup_{f\in\mathcal{F}}|G^{\prime}_{0,f}(\beta_{0,f})|$ to be zero. This means that $\mathcal{B}$ can be mis-specified in the sense that the interval is too small, and along any sub-model $\theta^{*}_{P,f}$ , there can exist a candidate that achieves a better fit than $\theta^{*}_{P,f}(\cdot;\beta_{0,f})$ . In other words, we allow there to be $\beta_{0,f}^{*}\in\mathbb{R}\setminus\mathcal{B}$ for which $G_{0,f}(\beta_{0,f})>G_{0,f}(\beta^{*}_{0,f})$ . Even then, $\tilde{\Psi}_{n}$ remains an asymptotically linear estimator for $\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\{G_{0,f}(0)-G_{0,f}(\beta)\}$ .

5 Construction of tests and intervals for the improvement in fit

In this section, we propose strategies for testing and confidence set construction for the improvement in fit. Our approach uses a computationally efficient bootstrap algorithm, which we describe in detail below.

We also provide theoretical results that establish validity of our proposed bootstrap method. Before proceeding, it is helpful to first state regularity conditions upon which our result rely.

Assumption C1: The function class $\left\{\left[G^{\prime\prime}_{P,f}(0)\right]^{-1/2}\phi^{\prime}_{P,f}(\cdot;0):f\in\mathcal{F}\right\}$ depends on $P$ only through a nuisance $Q$ , which takes values in a space $\mathcal{Q}$ endowed with norm $\|\cdot\|_{\mathcal{Q}}$ , and our estimator $Q_{\hat{P}_{n}}$ satisfies $\|Q_{\hat{P}_{n}}-Q_{P_{0}}\|_{\mathcal{Q}}=o_{P}(1)$ .
Assumption C2: $\|Q_{P}-Q_{P_{0}}\|_{\mathcal{Q}}$ approaches zero, both $\sup_{f\in\mathcal{F}}\int\left\{\phi_{P,f}^{\prime}(z;0)-\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}(z)$ and $\sup_{f\in\mathcal{F}}|G^{\prime\prime}_{P,f}(0)-G_{P_{0},f}^{\prime\prime}(0)|$ tend to zero as well.

Assumption C3: There exist $\delta_{1},\delta_{2}>0$ such that the function classes

	$\displaystyle\Phi_{\delta_{1}}:=\left\{\phi_{P,f}(\cdot;0)-\phi_{P,f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B},\\|Q_{P}-Q_{P_{0}}\\|<\delta_{1}\right\},$
	$\displaystyle\Phi^{\prime}_{\delta_{2}}:=\left\{\phi^{\prime}_{P,f}(\cdot;0):f\in\mathcal{F},\\|Q_{P}-Q_{P_{0}}\\|<\delta_{2}\right\},$

are $P_{0}$ -Donsker, with finite squared envelope function and finite bracketing integral (see, e.g., Chapter 19 of van der Vaart, 2000), and

\displaystyle\inf\left\{G^{\prime\prime}_{P,f}(0):f\in\mathcal{F},\|Q_{P}-Q_{P_{0}}\|<\delta_{2}\right\}>0.

Assumption C1 states that, for our bootstrap methods to be viable, estimation of the entire probability distribution is not needed, and it is sufficient to only estimate nuisance parameters upon which the efficient influence function depends. Recall that we made a similar assumption when we described construction of asymptotically linear estimators in Section 4.1. Assumption C2 states that when we estimate the nuisance components consistently, the plug-in estimator for the efficient influence function is consistent as well. Assumption C3 states that our efficient influence function estimator belongs to a function class that is not overly complex, with probability tending to one.

5.1 Approximation of the null limiting distribution

To perform a test of the hypothesis of no improvement in fit, we need an approximation for the asymptotic cumulative distribution function of $\tilde{\Psi}_{n}$ under the null. While we are able to characterize the null limiting distribution of $\tilde{\Psi}_{n}$ using Theorem 2, it is possible that a closed form expression for the distribution function is not available. However, we can use resampling techniques to obtain an approximation.

We approximate the null limiting distribution of $\tilde{\Psi}_{n}$ using the multiplier bootstrap method proposed by Hudson et al. (2021). The multiplier bootstrap is a computationally efficient method for approximating the sampling distribution estimators that can be represented as a functional of a well-behaved empirical process, plus a negligible remainder. Such an approach is applicable in our setting because $\tilde{\Psi}_{n}$ has such representation (see (12)).

For $m=1,2,\ldots,M$ and $M$ large, let $\boldsymbol{\xi}_{m}=(\xi_{1,m},\ldots,\xi_{n,m})$ be an $n$ -dimensional vector of independent Rademacher random variables, also drawn independently from $Z_{1},\ldots,Z_{n}$ . We define the multiplier bootstrap statistic

\displaystyle T_{n,m}^{\xi}:=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i,m}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2},

(16)

as an approximate draw from the null limiting distribution of $\tilde{\Psi}_{n}$ .

For a realization $t$ of $n\tilde{\Psi}_{n}$ , let

\displaystyle\rho_{0}(t):=P_{0}\left(\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\mathbb{H}^{2}(f)>t\right)=\lim_{n\to\infty}P_{0}(\tilde{\Psi}_{n}>n^{-1}t),

(17)

denote the p-value for a test of no improvement in fit, based on the limiting distribution of $\tilde{\Psi}_{n}$ . Given a large sample of multiplier bootstrap statistics, one can approximate the p-value as

\displaystyle\rho_{M,n}(t):=\frac{1}{M}\sum_{m=1}^{M}\mathds{1}\left(T^{\xi}_{n,m}>n^{-1}t\right).

The following result due to Hudson et al. (2021) provides conditions under which the bootstrap approximation of the limiting distribution is asymptotically valid, and use of the bootstrap p-value is appropriate.

Theorem 4.

Let $\xi_{1},\xi_{2},\ldots,\xi_{n}$ be independent Rademacher random variables, also independent of $Z_{1},\ldots,Z_{n}$ , and let $T_{n}^{\xi}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}$ . Under Assumptions C1 through C3, $nT_{n}^{\xi}$ converges weakly to converges weakly to $\sup_{f\in\mathcal{F}}\left[\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}$ , conditional upon $Z_{1},\ldots,Z_{n}$ , in outer probability.

5.2 Interval construction for $\Psi_{0}$

In this section, we present a method for constructing a confidence interval for $\Psi_{0}$ . The standard approach for interval construction based on a Gaussian approximation of the sampling distribution of an estimator is inadvisable because $\tilde{\Psi}_{n}$ is only asymptotically Gaussian when $\Psi_{0}$ is bounded away from zero. We show that this issue can be overcome by instead constructing a confidence interval via hypothesis test inversion.

Suppose that one could perform a level $(1-\alpha)$ level test of the hypothesis $H:\Psi_{0}=\psi$ for any $\psi\geq 0$ . Then the set

\displaystyle\mathcal{C}_{n}^{1-\alpha}:=\left\{\psi\geq 0:\text{We fail to reject }\Psi_{0}=\psi\text{ based on }Z_{1},\ldots,Z_{n}\right\},

would be a $100(1-\alpha)\%$ confidence interval for $\Psi_{0}$ . That is, in the limit of large $n$ , $\mathcal{C}_{n}^{1-\alpha}$ would contain $\Psi_{0}$ with probability at least $(1-\alpha)$ .

We construct a test of $\Psi_{0}=\psi$ using the test statistic $S_{n}(\psi):=|\tilde{\Psi}_{n}-\psi|$ . Let $s^{1-\alpha}_{n}$ be an approximation for the $(1-\alpha)$ quantile of the limiting distribution of $|\tilde{\Psi}_{n}-\Psi_{0}|$ . For a suitable $s^{1-\alpha}_{n}$ , a test that rejects the null when $S_{n}(\psi)$ exceeds $s^{1-\alpha}_{n}$ will achieve asymptotic type-1 error control at the level $(1-\alpha)$ . Moreover, an asymptotically valid confidence set can be obtained by setting

\displaystyle\mathcal{C}_{n}^{1-\alpha}

\displaystyle=\left\{\psi\geq 0:S_{n}(\psi)\leq s^{1-\alpha}_{n}\right\}=\left[\max\left(0,\tilde{\Psi}_{n}-s_{n}^{1-\alpha}\right),\tilde{\Psi}_{n}+s_{n}^{1-\alpha}\right].

It is not immediately obvious how to select $s_{n}^{1-\alpha}$ because the limiting distribution and convergence rate of $|\tilde{\Psi}_{n}-\Psi_{0}|$ depend on whether $\Psi_{0}=0$ . To address this concern, in what follows, we present a multiplier bootstrap approximation of the limiting distribution that adapts to the unknown value of $\Psi_{0}$ .

Let $\pi_{n}$ be any random sequence that converges to one in probability when $\Psi_{0}=0$ , and converges to zero in probability to when $\Psi_{0}>0$ . For instance, we can set

\displaystyle\pi_{n}=\rho_{M,n}\left(\frac{n}{\log(n)}\tilde{\Psi}_{n}\right),

(18)

where $\rho_{M,n}$ is the multiplier bootstrap p-value. That this choice of $\pi_{n}$ is valid follows from the fact that $\tilde{\Psi}_{n}$ is consistent for $\Psi_{0}$ and $n$ -rate convergent under the null. Now, similar to Section 5.1, for $m=1,\ldots,M$ and $M$ large, we generate a pair of random variables as follows. The first random variable is $T_{n,m}^{\xi}$ in (16), which is a multiplier bootstrap approximation of a draw from the limiting distribution of $|\tilde{\Psi}_{n}-\Psi_{0}|$ under the setting where $\Psi_{0}=0$ . We take the second random variable as a multiplier bootstrap approximation of a draw from the limiting distribution of $|\tilde{\Psi}_{n}-\Psi_{0}|$ when $\Psi_{0}>0$ . Specifically, we define this second random variable as

\displaystyle U^{\xi}_{n,m}:=\left|\frac{1}{n}\sum_{i=1}^{n}\xi_{i,m}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|,

(19)

where $\boldsymbol{\xi}_{m}$ is the vector of Rademacher random variables defined in Section 5.1 (the same vector may be used to construct $T_{n,m}^{\xi}$ and $U_{n,m}^{\xi}$ ), and $f_{n}$ is as defined (13). Finally, we take an approximate draw from the sampling distribution of $|\tilde{\Psi}_{n}-\Psi_{0}|$ as $V_{n,m}^{\xi}$ , where

\displaystyle V_{n,m}^{\xi}:=\pi_{n}T^{\xi}_{n,m}+(1-\pi_{n})U^{\xi}_{n,m},

and we set $s_{n}^{1-\alpha}$ as the $(1-\alpha)$ quantile of $(V_{n,1}^{\xi},\ldots V_{n,M}^{\xi})$ .

Because $\pi_{n}$ converges to zero when $\Psi_{0}$ is zero and approaches one when $\Psi_{0}$ is large, $V_{n,m}^{\xi}$ adaptively identifies whether $T^{\xi}_{n,m}$ or $U^{\xi}_{n,m}$ is a more appropriate approximation of a draw from the sampling distribution of $|\tilde{\Psi}_{n}-\Psi_{0}|$ . The following result states that $V_{n,m}^{\xi}$ is an asymptotically valid approximation regardless of whether $\Psi_{0}$ is zero or nonzero, thereby justifying our selection of $s_{n}^{1-\alpha}$ .

Theorem 5.

Let $\xi_{1},\xi_{2},\ldots,\xi_{n}$ be independent Rademacher random variables, also independent of $Z_{1},\ldots,Z_{n}$ . Let $\pi_{n}$ be a random sequence that converges to one in probability when $\Psi_{0}=0$ and converges to zero in probability when $\Psi_{0}>0$ . Let $T_{n}^{\xi}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{\hat{P}_{n},f}(0)}\left\{\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\phi_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}$ , let $U_{n}^{\xi}=\left|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|$ , and $V_{n}^{\xi}=\pi_{n}T_{n}^{\xi}+(1-\pi_{n})U_{n}^{\xi}$ . Let $\mathbb{I}$ be a mean zero Gaussian random variable with variance $E_{0}[\{\phi_{P_{0},f_{0}}(Z;0)-\phi_{P_{0},f_{0}}(Z;\beta_{0,f_{0}})\}^{2}]$ , with $f_{0}$ defined in (13). Suppose that Assumptions C1-C3 are met. Then when $\Psi_{0}=0$ , and the conditions of Theorem 2 hold, $nV_{n}^{\xi}$ converges weakly to converges weakly to $\sup_{f\in\mathcal{F}}\left[\left\{G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right]^{2}$ , conditional upon $Z_{1},\ldots,Z_{n}$ , in outer probability. When $\Psi_{0}>0$ , $n^{1/2}V_{n}^{\xi}$ , and the conditions of Theorem 3 hold, converges weakly to $|\mathbb{I}|$ , conditional upon $Z_{1},\ldots,Z_{n}$ , in outer probability.

6 Implementation

In this section we discuss implementation of our proposed method for inference on the improvement in fit. First we describe how to construct a model for $\Theta\setminus\Theta^{*}$ . We subsequently discuss how to calculate the improvement in fit estimator and how to implement our proposed bootstrap procedures for testing the null of no improvement in fit and constructing confidence sets.

6.1 Constructing the collection of parametric sub-models

We propose to construct $\mathcal{F}$ as a space of linear combinations of basis functions from $\mathcal{O}$ to $\mathbb{R}$ , where the coefficients for the basis functions are required to satisfy a constraint that induces structure on the function class. Let $\mathcal{H}=h_{1}\oplus h_{2}\oplus\cdots$ be a vector space defined as the span of basis functions $h_{1},h_{2},\ldots,$ from $\mathcal{O}$ to $\mathbb{R}$ . Let $\Gamma$ be a functional on $\mathcal{H}$ that measures the complexity of any function in $\mathcal{H}$ , with larger values corresponding to greater complexity. We set $\mathcal{F}$ to have bounded complexity. Additionally, we impose a constraint that $\inf_{f\in\mathcal{F}}G^{\prime\prime}_{0,f}(0)$ is bounded away from zero. In view of Assumption A4, such a constraint is needed in order for us to establish weak convergence of our proposed improvement in fit estimator under the null. Finally, we set $\mathcal{F}=\mathcal{F}_{\lambda}$ , where we define

\displaystyle\mathcal{F}_{\lambda}:=\left\{f=\sum_{j=1}^{\infty}a_{j}h_{j}:a_{1},a_{2},\ldots\in\mathbb{R},\frac{\Gamma(f)}{G^{\prime\prime}_{0,f}(0)}\leq\lambda\right\},

(20)

and $\lambda$ is a tuning parameter. In practice, we recommending truncating the basis at a large level $J$ to facilitate computation.

As an example, one could construct $\mathcal{F}$ using a reproducing kernel Hilbert space (RKHS). Let $\kappa:\mathcal{O}\times\mathcal{O}\to\mathbb{R}$ be a positive definite kernel function, and let $\mathcal{S}_{\kappa}$ denote its unique reproducing kernel Hilbert space, endowed with inner product $\langle\cdot,\cdot\rangle_{\kappa}$ . One can select the basis functions $h_{1},h_{2},\ldots$ as the eigenfunctions of $\kappa$ , with respect to the RKHS inner product. We denote the corresponding eigenvalues by $\gamma_{1}\leq\gamma_{2}\leq\ldots$ . The complexity of any function $s:o\mapsto\sum_{j=1}^{\infty}a_{j}h_{j}(o)$ can be measured by its RKHS norm $\Gamma(s)=\langle s,s\rangle_{\kappa}=\sum_{j}\left(\frac{a_{j}}{\gamma_{j}}\right)^{2}$ . The RKHS norm is a measure of smoothness, with higher values corresponding with lesser smoothness. Reproducing kernel Hilbert spaces are appealing because they are flexible and contain close approximations of smooth functions (Micchelli et al., 2006). Moreover, that the RKHS norm is available in quadratic form in the coefficients simplifies computation. Alternative approaches, such as constructing $\mathcal{H}$ using a spline basis and setting $\Gamma$ as a variation norm, are commonly used in nonparametric regression problems and could also be considered (see, e.g., Tibshirani et al., 2005; Benkeser and van der Laan, 2016).

We now discuss specification of the interval $\mathcal{B}$ . The choice of $\mathcal{B}$ does not affect the null limiting distribution but may affect the limiting distribution when $\Psi_{0}>0$ . The main role of $\mathcal{B}$ is to regularize $\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}|$ to ensure that variance of the estimator is well-controlled. Recall from our discussion of Theorem 3 in Section 4.2 that $\beta_{0,f}$ , which is defined to be the minimizer of $G_{P_{0},f}$ over $\mathcal{B}$ , does not need to be the global minimizer over $\mathcal{B}$ ; our results show that $\tilde{\Psi}_{n}$ has a well-behaved limiting distribution regardless. We treat the width of the interval as an additional tuning parameter.

We find that in some settings, $\tilde{\Psi}_{n}$ can retain good asymptotic behavior under the alternative even when $\mathcal{B}$ is taken to be an interval of arbitrary width. In view of Theorem 1, the variance of $\tilde{\beta}_{n,f}$ has an inverse relationship with $G^{\prime\prime}_{0,f}(\beta_{0,f})$ . Therefore, constructing $\mathcal{F}$ to only include functions for which $G^{\prime\prime}_{0,f}(\beta_{0,f})$ is bounded from below also serves to regularize $\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}|$ . In particular, when $G_{0,f}$ is a quadratic function, and $G^{\prime\prime}_{0,f}$ is a constant function, it is sufficient to ensure that $G^{\prime\prime}_{0,f}(0)$ away from zero. Because this constraint is already incorporated into $\mathcal{F}$ with the above specification, constraining the width of $\mathcal{B}$ is unnecessary in such instances.

We recommend selecting the tuning parameters $\lambda$ , and $\mathcal{B}$ (when needed) by performing cross-validation with respect to the loss $f\mapsto\tilde{G}_{n,f}(\tilde{\beta}_{f,n})$ . We note that while our asymptotic results implicitly assume that $\mathcal{F}$ is pre-specified, it is argued in Hudson et al. (2021) that one can select $\mathcal{F}$ data-adaptively without compromising type-1 error control as long as the adaptive choice converges to a fixed class. In some settings, e.g., when the sample size is small, it is possible that the data-adaptive choice is highly or moderately variable, and that failure to account for this variability could lead to type-1 error inflation. One can avoid this issue by using a more conservative sample splitting approach, wherein one partition of the data is used for tuning parameter selection, and a second independent partition is used to estimate $\Psi_{0}$ .

6.2 Computation

We now discuss how to calculate the improvement in fit estimator $\tilde{\Psi}_{n}$ and how to implement the multiplier bootstrap for hypothesis testing and confidence interval construction.

Calculating $\tilde{\Psi}_{n}$ requires us to solve the optimization problem in (10). When we use the specification of $\mathcal{F}$ in Section 6.1, it is possible for this problem to be non-convex, and it can be particularly challenging to solve when a closed form solution for $\tilde{\beta}_{n,f}$ is not available. We find, however, that when $\tilde{G}_{n,f}$ is a quadratic function of $\beta$ , a computationally efficient solution is available. In Examples 1 and 3 in Section 2.2, $G_{P,f}$ is a quadratic function when one considers a sub-model of the form $\theta^{*}_{P,f}(\cdot;\beta)=\theta^{*}_{P}(\cdot)+\beta f(\cdot)$ , so this special case captures at least some examples. In what follows, we present an approach for solving this the problem in the setting where $\tilde{G}_{n,f}$ is a quadratic function of $\beta$ . We then describe a more general method in the Supplementary Materials.

Suppose that for any $f=\sum_{j=1}^{J}a_{j}h_{j}$ and $\mathbf{a}=(a_{1},\ldots,a_{J})$ , there exists a $J\times J$ -dimensional matrix $\mathbf{H}_{1}$ and a $J$ -dimensional vector $\mathbf{H}_{2}$ such that

\displaystyle\tilde{G}_{n,f}(\beta)=\beta^{2}\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}-2\beta\mathbf{H}_{2}^{\top}\mathbf{a}+\text{const},

where “const” refers to a constant that depends neither on $\mathbf{a}$ nor $\beta$ . It can be easily seen that $\tilde{\beta}_{n,f}$ has the exact representation

\displaystyle\tilde{\beta}_{n,f}=\frac{\mathbf{H}^{\top}_{2}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}.

Additionally, the second derivative estimator satisfies $\tilde{G}_{n,f}(\beta)=\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}$ for all $\beta$ . Now, $\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f})$ can be expressed as

\displaystyle 2\left\{\tilde{G}_{n,f}(0)-\tilde{G}_{n,f}(\tilde{\beta}_{n,f})\right\}=\frac{\mathbf{a}^{\top}\mathbf{H}_{2}^{\top}\mathbf{H}_{2}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}.

Suppose now that $\Gamma(f)$ is available in quadratic form in the coefficients of the basis functions – that is, $\Gamma(f)=\mathbf{a}^{\top}\mathbf{L}\mathbf{a}$ for a $J\times J$ matrix $\mathbf{L}$ . Using the above representation for $\tilde{G}_{n,f}$ , we can express $\tilde{\Psi}_{n}$ as

	$\displaystyle 2\tilde{\Psi}_{n}$	$\displaystyle=\max_{\mathbf{a}}\left\{\frac{\mathbf{a}^{\top}\mathbf{H}_{2}\mathbf{H}_{2}^{\top}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}:\frac{\mathbf{a}^{\top}\mathbf{L}\mathbf{a}}{\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}}\leq\lambda,\right\}$
		$\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\mathbf{H}_{2}\mathbf{H}_{2}^{\top}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}\leq 1\right\}.$		(21)

The optimization problem in (21) is a quadratically constrained quadratic program (QCQP) and can be solved using publicly available software, such as the CVXR package in R (Fu et al., 2017).

Multiplier bootstrap samples can be calculated using a similar method. We first observe when we use the specification of $\theta^{*}_{P,f}(\cdot;\beta)$ in (6), the Riesz representation theorem implies that $G^{\prime}_{0,f}(0)$ is a linear functional of $f$ . Consequently, the efficient influence function $\phi^{\prime}_{0,f}(\cdot;0)$ is also linear in $f$ . Therefore, for any $f=\sum a_{j}h_{j}$ , we have

\displaystyle\phi^{\prime}_{P_{0},f}(z;0)=\sum_{j=1}^{J}a_{j}\phi^{\prime}_{P_{0},h_{j}}(z;0).

Now, let $\boldsymbol{\Phi}$ be an $n\times J$ matrix with element $(i,j)$ given by $\phi^{\prime}_{\hat{P},h_{j}}(Z_{i};0)$ , and let $\boldsymbol{\xi}_{m}$ be an $n$ -dimensional vector of Rademacher random variables, as in Sections 5.1 and 5.2. Similarly as $\tilde{\Psi}_{n}$ , the multiplier bootstrap test statistics $T_{n,m}^{\xi}$ in (16) can be expressed as

\displaystyle 2nT^{\xi}_{n,m}

\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\boldsymbol{\Phi}^{\top}\left[\text{diag}(\boldsymbol{\xi}_{m})\right]^{2}\boldsymbol{\Phi}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\mathbf{H}_{1}\mathbf{a}\leq 1\right\}.

(22)

The optimization problem in (22) is also a QCQP and can solved efficiently. Finally, $U_{n,m}^{\xi}$ can be written as

\displaystyle U_{n,m}^{\xi}=\left|\sum_{i=1}^{n}\xi_{i,m}\left\{\phi_{\hat{P}_{n},f}(Z_{i};0)-\tilde{a}_{n,j}\sum_{j=1}^{J}\phi_{\hat{P}_{n},h_{j}}\left(Z_{i};\frac{\mathbf{H}_{2}^{\top}\tilde{\mathbf{a}}_{n}}{\tilde{\mathbf{a}}_{n}^{\top}\mathbf{H}_{1}\tilde{\mathbf{a}}_{n}}\right)\right\}\right|,

where $\tilde{\mathbf{a}}_{n}=(\tilde{a}_{1,n},\ldots,\tilde{a}_{J,n})$ is a solution to (21).

7 Illustration: Inference in a Nonparametric Regression Model

In this section, we apply our framework to perform inference on the non-negative dissimilarity measured described in Example 1. In the Supplementary Materials, we describe our framework can be used to construct a test of stochastic dependence, following the setting described in Example 2.

In this problem, we are tasked with assessing whether a subset of a collection of predictor variables is needed for attaining an optimal prediction function. As in Section 2.2, our data take the form $Z=(W,X,Y)$ , where $Y$ is a real-valued outcome variable, and $X$ is the predictor vector of interest, and $W$ is a vector of covariates. We denote by $\mu_{P,Y}:w\mapsto E_{P}[Y|W=w]$ the conditional mean of the outcome given the covariates. Our objectives are to assess whether there exists a function that depends on both $X$ and $W$ which predicts $Y$ better than $\mu_{P_{0},Y}(W)$ , and to measure the best achievable improvement in predictive performance by any function in a large class.

We specify the parametric sub-model $\theta^{*}_{P,f}$ as

\displaystyle\theta^{*}_{P,f}(z;\beta)=\mu_{P,Y}(w)+\beta f(w,x).

The goodness-of-fit of any candidate in the sub-model is defined as

\displaystyle G_{P,f}(\beta):=\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}dP(z),

and the first and second derivatives are given by

	$\displaystyle G^{\prime}_{P,f}(\beta)=-2\int f(w,x)\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}dP(z),$
	$\displaystyle G^{\prime\prime}_{P,f}(\beta)=2\int f^{2}(w,x)dP(z).$

As we noted in Section 4, knowledge of the efficient influence function of $G_{P,f}(\beta)$ is helpful for constructing an asymptotically linear estimator thereof. Additionally, we require the derivative of the efficient influence function to exist. Let $\mu_{f,P}(w)=E[f(W,X)|W=w]$ represent the conditional mean of $f(W,X)$ given $W$ . The form of the efficient influence function and its derivative are provided in the following lemma.

Lemma 1.

The efficient influence function of $G_{P,f}(\beta)=\int\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}^{2}dP(z)$ is given by

\displaystyle\phi_{P,f}(\cdot;\beta):(w,x,y)\mapsto\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{Y,P}(w)\right\}\mu_{f,P}(w)-G_{P,f}(\beta).

It is also easy to see that the efficient influence function is twice differentiable in $\beta$ . The evaluation of its first and second derivatives at $\beta=0$ are given by

	$\displaystyle\phi^{\prime}_{P,f}(\cdot;0)(w,x,y)\mapsto-2\left\{y-\mu_{P,Y}(w)\right\}\left\{f(x,w)-\mu_{P,f}(w)\right\}-G^{\prime}_{P,f}(0),$
	$\displaystyle\phi^{\prime\prime}_{P,f}(\cdot;0)(w,x,y)\mapsto-2f^{2}(w,x)-G^{\prime\prime}_{P,f}(0).$

From Lemma 1, we can see that $\{G_{0,f}(\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ and $\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ depend on the nuisance parameters $\mu_{Y,P_{0}}$ and $\{\mu_{P_{0},f}:f\in\mathcal{F}\}$ . One can obtain nonparametric estimators $\mu_{n,Y}$ and $\{\mu_{n,f}:f\in\mathcal{F}\}$ for the nuisance using any in a wide variety of flexible data-adaptive regression procedures, including kernel ridge regression (Wahba, 1990), neural networks (Barron, 1989), the highly adaptive lasso (Benkeser and van der Laan, 2016), or the Super Learner (van der Laan et al., 2007). In our implementation, we use kernel ridge regression, in large part because it is computationally efficient.

It may at first seem computationally difficult to estimate the conditional mean of $f(X,W)$ given $W$ for all $f$ . However, because we have assumed that $f$ can be represented as a linear combination of basis functions $h_{1},h_{2},\ldots$ , we can perform this calculation without too much trouble. For $f=\sum_{j=1}^{J}a_{j}h_{j}$ , we have the representation $\mu_{P,f}=\sum_{j=1}^{J}a_{j}\mu_{P,h_{j}}$ . Therefore, one can obtain estimators $\mu_{n,h_{j}}$ for $\mu_{P_{0},h_{j}}$ for $j=1,\ldots,J$ and then estimate $\mu_{P_{0},f}$ as $\mu_{n,f}=\sum_{j=1}^{J}a_{j}\mu_{n,h_{j}}$ .

Consider the following initial plug-in estimator for $G_{0,f}(\beta)$ :

\displaystyle G_{n,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\left\{Y_{i}-\mu_{n,Y}(W_{i})-\beta f(W_{i},X_{i})\right\}^{2},

and consider the efficient influence function estimator

\displaystyle\phi_{n,f}(w,x,y;\beta)=\left\{y-\mu_{n,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)-G_{n,f}(\beta).

The initial estimator for $G_{0,f}(\beta)$ is biased, so a corrected estimator is needed so that one can perform inference. We can use the following one-step bias-corrected estimator:

	$\displaystyle\tilde{G}_{n,f}(\beta)$	$\displaystyle=G_{n,f}(\beta)+\frac{1}{n}\sum_{i=1}^{n}\phi_{n,f}(Z_{i};\beta)$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left[\left\{Y_{i}-\mu_{n,Y}(W_{i})-\beta f(W_{i},X_{i})\right\}^{2}+2\beta\left\{Y_{i}-\mu_{n,Y}(W_{i})\right\}\mu_{n,f}(W_{i})\right].$		(23)

The following lemma provides conditions under which the one-step estimator satisfies Assumptions B1 through B4.

Lemma 2.

Suppose that the nuisance estimators satisfy the rate conditions

		$\displaystyle\left[\int\left\{\mu_{Y,P_{0}}(w)-\mu_{Y,n}(w)\right\}^{2}dP_{0}(w)\right]^{1/2}=o_{P}(n^{-1/4}),$
	$\displaystyle\sup_{f\in\mathcal{F}}$	$\displaystyle\int\left\|\left\{\mu_{P_{0},f}(w)-\mu_{n,f}(w)\right\}\left\{\mu_{Y,P_{0}}(w)-\mu_{Y,n}(w)\right\}dP_{0}(w)\right\|=o_{P}(n^{-1/2}).$

Suppose also that there exist $P_{0}$ -Donsker classes $\Phi$ , $\Phi^{\prime}$ and a $P_{0}$ -Glivenko-Cantelli class $\Phi^{\prime\prime}$ such that, with probability tending to one, each of the following holds:

	$\displaystyle\{\phi_{n,f}(\cdot;\beta)-\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}\subset\Phi,$
	$\displaystyle\{\phi^{\prime}_{n,f}(\cdot;0)-\phi^{\prime}_{P_{0},f}(\cdot;0):f\in\mathcal{F}\}\subset\Phi^{\prime},$
	$\displaystyle\{\phi^{\prime\prime}_{n,f}(\cdot;0)-\phi^{\prime\prime}_{P_{0},f}(\cdot;0):f\in\mathcal{F}\}\subset\Phi^{\prime\prime}.$

Then the one-step estimator $\tilde{G}_{n,f}$ in (23) satisfies Assumptions B1-B4.

The condition on the convergence rates of the nuisance estimators is standard and holds when all nuisance estimators are $n^{1/4}$ -rate convergent. This is rate is attained by many nonparametric regression estimators under weak structural assumptions on the true conditional mean functions, so the condition is fairly mild.

We conclude with a brief comment about computation. We observe that $\tilde{G}_{n,f}(\beta)$ is a quadratic function of $\beta$ , so the implementation scheme described in Section 6.2 can be applied in this example.

8 Simulation Study

In this section, we assess the empirical performance of our proposal in our simulation study. In this example, we again consider the nonparametric regression task discussed in Section 7.

We generate synthetic data sets as follows. First, we generate independent $3$ -dimensional random vectors $A_{1},\ldots,A_{n}$ from a Gaussian distribution with mean zero and covariance

\displaystyle\mathbf{V}=\begin{pmatrix}1&.5&.5\\ .5&1&.5\\ .5&.5&1\end{pmatrix}.

We then define the predictor vector as $X_{i}=(2\gamma(A_{i,1})-1,2\gamma(A_{i,2})-1,2\gamma(A_{i,3})-1)$ , where $\gamma$ is the standard normal distribution function. We generate the outcome $Y$ according to the model

\displaystyle Y_{i}=\sin(\pi X_{i,1})-2\left(X_{i,2}-\frac{1}{2}\right)^{2}+\mathds{1}(X_{i,2}>0)\exp(X_{i,1})+\epsilon_{i},

where the white noise $\epsilon_{i}$ is a continuous uniform $[-6,6]$ random variable, drawn independently of $X_{i}$ . Our objective is to determine which of the elements of the predictor $X$ are conditionally associated with the outcome $Y$ , given the other elements. Clearly, $Y$ is conditionally dependent on the first and second elements, and not the third.

Our target of inference is the improvement in fit estimand defined in Example 1. More specifically, for each predictor, we estimate the improvement in fit comparing a flexible regression model containing prediction functions that may depend on all predictors, with a model only containing functions that do not depend on the predictor of interest. We estimate all nuisance parameters described in Section 7 using kernel ridge regression. We construct $\mathcal{F}$ using the reproducing kernel Hilbert space corresponding to the Gaussian kernel, and we consider two choices for the smoothness parameter. First, we use an oracle apporach, which sets $\mathcal{F}$ as the smoothest class containing a function that is proportional to the difference between the full conditional mean of $Y$ given all predictors $X$ , and the conditional mean of $Y$ given predictors that are not of interest. We recall from our discussion in Section 2.2 that the difference between conditional means is the function that maximizes the improvement in fit. Second, we consider a data-adaptive procedure, where the smoothness parameter is selected using cross-validation, and no sample-splitting is performed.

We compare our method with a sample splitting approach proposed by Williamson et al. (2021b). They propose to separate the dataset into two independent partitions – one of which is used to estimate the optimal goodness-of-fit over the full model, and the second of which is used to estimate the optimal goodness-of-fit over the reduced model. Then inference on $\Psi_{0}$ can be performed using a two-sample Wald test, and Wald-type intervals can similarly be constructed. We expect our approach to offer an improvement because we do not require sample-splitting for valid inference. We also expect our proposal to benefit from achieving fast $n$ -rate convergence rate at the boundary, compared with the sample-splitting approach, which is only $n^{1/2}$ -rate convergent.

We generate 1000 synthetic data sets under the data-generating process described above for $n\in\{100,200,400,800,1600,3200\}$ . We compare all methods under consideration in terms mean squared error, type-1 error control under the null, power under the alternative, confidence interval coverage, and average confidence interval width.

Figure 1 shows the root mean squared error for each proposed improvement in fit estimator as a function of the sample size. We find that, while the root mean squared error of each estimator approaches zero, our proposed estimators converge much more quickly than the sample splitting estimator, with the oracle version of our approach performing best.

In Figures 2 and 3 we plot the rejection probability for a test of the null of no improvement in fit as a function of the nominal type-1 error level. We find that all tests considered achieve asymptotic type-1 error control in the setting where $\Psi_{0}=0$ , though we acknowledge there is type-1 error inflation when the sample size is small. We find that our proposed tests are well-powered against the null, both outperforming the sample-splitting estimator.

In Figures 4 and 5 we plot coverage probability and average width of 95% confidence intervals as a function of the sample size. We find that all approaches considered achieve nominal coverage as the sample size grows, though the there is a tendency for our proposed intervals to exhibit undercoverage when the sample size is small. Our proposed estimator with oracle selection of $\mathcal{F}$ achieves the lowest average width, followed by the adaptive approach, and the sample-splitting approach.

9 Discussion

In this work, we have presented a general framework for inference on non-negative dissimilarity measures. Our proposed methodology has wide-ranging utility. As examples, we described how this framework can be applied to perform rate-optimal inference on statistical functionals arising in nonparametric regression and graphical modeling problems. Our framework can also be useful in other settings, such as causal inference problems. For instance, some statistical functionals that have been used for studying treatment heterogeneity (see, e.g., Levy et al., 2021; Hines et al., 2021; Sanchez-Becerra, 2023) have the representation described in Section 2, so one can perform inference using our general approach.

Our work has some notable limitations that we plan to address in future research. While our proposal for inference on the improvement in fit enjoys good behavior in a large sample setting, we observed in our simulatoin study that it may have undesirable small sample properties, such mild type-1 error inflation or poor coverage. Additionally, our estimator suffers a loss in precision when we select of $\mathcal{F}$ in a data-adaptive manner. In future work, we plan to investigate whether the performance can be improved using, e.g., small-sample adjustments or improved data-adaptive methods for tuning parameter selection. Additionally, because our results assume smoothness of the goodness-of-fit functional, it is not clear whether our results can be directly applied to perform inference on estimands such as $L_{1}$ distances. It is of interest to develop a more flexible inferential strategy that relaxes this assumption. Our methodology also places complexity constraints on nuisance parameter estimators, which prohibits us from using estimators such as gradient boosted trees (Friedman, 2002). It is of interest to develop cross-fitted versions of our improvement in fit estimator and multiplier bootstrap strategy that relax this assumption (Zheng and van der Laan, 2011; Chernozhukov et al., 2018).

There also remain several open theoretical and methodological questions. For instance, while we have established $n-$ rate consistency of our proposed improvement in fit estimator, it is unclear whether our test of the null of no improvement in fit is optimal in any sense. It would be important to characterize the power of our test and to determine whether there exists a more powerful test. Additionally, it is of interest to understand how specification of the sub-model $\theta^{*}_{P,f}$ affects our procedure’s performance. It is possible that there are many ways for one to construct a sub-model while still obtaining valid inference on $\Psi_{0}$ . It is not clear how this choice affects the estimator or whether there is an optimal choice. We expect that, in practice, this choice will need to be made in consideration of theoretical properties, such as power, and more practical concerns, such as ease of implementation and computational efficiency.

Refer to caption — Figure 1: Monte Carlo estimates of the root mean squared error for improvement in fit estimators in our simulation study.

References

Barron (1989) Barron, A. R. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th IEEE Conference on Decision and Control,, pages 280–285. IEEE.
Benkeser and van der Laan (2016) Benkeser, D. and van der Laan, M. (2016). The highly adaptive lasso estimator. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 689–696. IEEE.
Bhattacharya and Zhao (1997) Bhattacharya, P. and Zhao, P.-L. (1997). Semiparametric inference in a partial linear model. The Annals of Statistics 25, 244–262.
Bibaut and van der Laan (2019) Bibaut, A. F. and van der Laan, M. J. (2019). Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244 .
Bickel et al. (1998) Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1998). Efficient and adaptive estimation for semiparametric models. Springer.
Carone et al. (2018) Carone, M., Díaz, I., and van der Laan, M. J. (2018). Higher-order targeted loss-based estimation. Targeted learning in data science: causal inference for complex longitudinal studies pages 483–510.
Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1–C68.
Donald and Newey (1994) Donald, S. G. and Newey, W. K. (1994). Series estimation of semilinear models. Journal of Multivariate Analysis 50, 30–40.
Friedman (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 367–378.
Fu et al. (2017) Fu, A., Narasimhan, B., and Boyd, S. (2017). CVXR: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582 .
Hines et al. (2021) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2021). Parameterising the effect of a continuous exposure using average derivative effects. arXiv preprint arXiv:2109.13124 .
Hines et al. (2022) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:2204.06030 .
Hines et al. (2022) Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician 76, 292–304.
Hudson et al. (2021) Hudson, A., Carone, M., and Shojaie, A. (2021). Inference on function-valued parameters using a restricted score test. arXiv preprint arXiv:2105.06646 .
Kandasamy et al. (2015) Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., et al. (2015). Nonparametric von mises estimators for entropies, divergences and mutual informations. Advances in Neural Information Processing Systems 28,.
Kennedy et al. (2023) Kennedy, E. H., Balakrishnan, S., and Wasserman, L. A. (2023). Semiparametric Counterfactual Density Estimation. Biometrika .
Levy et al. (2021) Levy, J., van der Laan, M., Hubbard, A., and Pirracchio, R. (2021). A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference 9, 83–108.
Luedtke et al. (2019) Luedtke, A., Carone, M., and van der Laan, M. J. (2019). An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology 81, 75–99.
Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research 7,.
Negahban et al. (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers.
Paninski (2003) Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation 15, 1191–1253.
Pfanzagl (1982) Pfanzagl, J. (1982). Contributions to a general asymptotic statistical theory. Springer.
Pfanzagl (1985) Pfanzagl, J. (1985). Asymptotic expansions for general statistical models, volume 31. Springer-Verlag.
Robins et al. (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman 2, 335–421.
Robinson (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society pages 931–954.
Sanchez-Becerra (2023) Sanchez-Becerra, A. (2023). Robust inference for the treatment effect variance in experiments using machine learning. arXiv preprint arXiv:2306.03363 .
Steuer et al. (2002) Steuer, R., Kurths, J., Daub, C. O., Weise, J., and Selbig, J. (2002). The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 18, S231–S240.
Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 91–108.
Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
van de Geer (2000) van de Geer, S. A. (2000). Empirical Processes in M-estimation, volume 6. Cambridge university press.
van de Geer (2008) van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso.
van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology 6,.
van der Laan and Rose (2011) van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.
van der Laan and Rose (2018) van der Laan, M. J. and Rose, S. (2018). Targeted learning in data science. Springer.
van der Vaart (2014) van der Vaart, A. (2014). Higher order tangent spaces and influence functions. Statistical Science pages 679–686.
van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak convergence and empirical processes. Springer.
van der Vaart (2000) van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
Verdinelli and Wasserman (2021) Verdinelli, I. and Wasserman, L. (2021). Decorrelated variable importance. arXiv preprint arXiv:2111.10853 .
Wahba (1990) Wahba, G. (1990). Spline models for observational data. SIAM.
Westling (2021) Westling, T. (2021). Nonparametric tests of the causal null with nondiscrete exposures. Journal of the American Statistical Association pages 1–12.
Wilks (1938) Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9, 60–62.
Williamson et al. (2021a) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics 77, 9–22.
Williamson et al. (2021b) Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association pages 1–14.
Zhang and Janson (2020) Zhang, L. and Janson, L. (2020). Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283 .
Zheng and van der Laan (2011) Zheng, W. and van der Laan, M. J. (2011). Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pages 459–474. Springer.

Supplementary Materials

S1 Implementation for Non-quadratic Objectives

Here, we propose a general method for computing the improvement in fit estimator. As noted in Section 6.2, computation can be challenging because when we use the specification of $\mathcal{F}$ in (20), the optimzation problem in (10) is possibly non-convex. Moreover, a closed form expression for $\tilde{\beta}_{n,f}$ may not be available, further complicating the problem.

As in Section 6.2, we focus on the setting where the complexity measure $\Gamma$ is available in quadratic form, satisfying $\Gamma(f)=\mathbf{a}^{\top}\mathbf{L}\mathbf{a}$ for some matrix $\mathbf{L}$ . Also, we note that in general, the second derivative of the goodness-of-fit $G_{P,f}^{\prime\prime}(0)$ is available in quadratic in the coefficients as a consequence of the Riesz representation theorem. We assume that $\tilde{G}_{n,f}^{\prime\prime}(0)$ can be expressed as $\mathbf{a}^{\top}\Omega\mathbf{a}$ for some $\Omega$ .

We propose a slightly-modified estimator for $\tilde{\Psi}_{0}$ that has nearly identical asymptotic properties to $\tilde{\Psi}_{n}$ , but which can be easier to compute in a more general setting. The main idea is to separate estimation of $\Psi_{0}$ into two parts. First, as before, for each $f\in\mathcal{F}$ we perform a search to identify whether any candidate parameter in the sub-model is an improvement over the null in terms goodness-of-fit. Where we differ is that we confine our search to a small neighborhood of zero. When there exists evidence that for some $f$ , $G_{0,f}$ is not minimized at zero, we search over a larger function class to identify a candidate parameter that achieves a better fit than the null. In contrast to the strategy proposed in Section 6, we specify $\Theta\setminus\Theta^{*}$ as a class over which an optimum can more easily be identified. In what follows, we provide details and rationale for this method.

First, Taylor’s theorem implies that when $\beta_{0,f}$ resides within a neighborhood of zero, $\beta_{0,f}$ is approximately equal to $\{G^{\prime\prime}_{0,f}(0)\}^{-1}G^{\prime}_{0,f}(0)$ . Additionally, in this setting we have that

\displaystyle\Psi_{0}^{*}:=\sup_{f\in\mathcal{F}}\frac{\left\{G^{\prime}_{0,f}(0)\right\}^{2}}{2G^{\prime\prime}_{0,f}(0)}\approx\sup_{f\in\mathcal{F}}G_{0,f}(0)-G_{0,f}(\beta_{0,f})=\Psi_{0}.

Because $G_{0,f}$ is assumed to be convex for all $f$ , then $G^{\prime}_{0,f^{*}}(0)\neq 0$ for some $f^{*}$ , implies that $\beta_{0,f^{*}}\neq 0$ , and $\Psi_{0}>0$ . And so, one can assess whether $\Psi_{0}=0$ by checking whether $\Psi_{0}^{*}=0$ . This is roughly equivalent to performing a search over $\mathcal{F}\times\mathcal{B}$ to identify the best fit, where $\mathcal{B}$ is taken to be a small interval containing zero. To estimate $\Psi^{*}_{0}$ , we use the estimator

\displaystyle\tilde{\Psi}_{n}^{*}:=\sup_{f\in\mathcal{F}}\frac{\left\{\tilde{G}_{n,f}^{\prime}(0)\right\}^{2}}{2\tilde{G}^{\prime\prime}_{n,f}(0)}.

When the null holds, and the conditions of Theorem 2 hold as well, $\tilde{\Psi}^{*}_{n}$ has the same asymptotic representation as $\tilde{\Psi}_{n}$ . That is,

\displaystyle\tilde{\Psi}^{*}_{n}=\sup_{f\in\mathcal{F}}\frac{1}{2G^{\prime\prime}_{0,f}(0)}\left[\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right]^{2}+o_{P}(n^{-1}).

As noted in Section 6.2, the Riesz representation theorem implies that $G^{\prime}_{0,f}(0)$ is a linear functional of $f$ . We assume that our estimator $\tilde{G}_{n,f}(0)$ is also linear, and can be expressed as $\mathbf{H}^{\top}\mathbf{a}$ for some $J$ -dimensional vector $\mathbf{H}$ . Thus, using the specification of $\mathcal{F}$ in (20), $\tilde{\Psi}_{n}^{*}$ can be expressed as

\displaystyle 2\tilde{\Psi}^{*}_{n}

\displaystyle=\max_{\mathbf{a}}\left\{\mathbf{a}^{\top}\mathbf{H}\mathbf{H}^{\top}\mathbf{a}:\mathbf{a}^{\top}\mathbf{L}\mathbf{a}\leq\lambda,\mathbf{a}^{\top}\Omega\mathbf{a}\leq 1\right\}.

This problem is a quadratically constrained quadratic program and can be solved efficiently.

It is possible that $\tilde{\Psi}_{n}^{*}$ is a poor approximation for $\tilde{\Psi}_{n}$ when the null of no improvement in fit does not hold. Let $\pi^{*}_{n}\in(0,1)$ be a random sequence that converges to one in probability when $\Psi_{0}^{*}=0$ holds and converges to zero in probability when $\Psi_{0}^{*}>0$ . For instance, we can take $\pi^{*}_{n}$ as

\displaystyle\pi^{*}_{n}=\rho_{0}\left(\frac{n}{\log(n)}\tilde{\Psi}_{n}^{*}\right),

where $\rho_{0}$ is as defined in (17). For large values of $\pi^{*}_{n}$ , $\tilde{\Psi}^{*}_{n}$ can replace $\tilde{\Psi}_{n}$ . Otherwise, an alternative estimator may be needed.

To estimate $\Psi_{0}$ when the null does not hold, we consider an alternative specification of $\Theta\setminus\Theta^{*}$ . A main source of our difficulty with solving the optimization problem in (10) is that the constraint $\{G^{\prime\prime}_{0,f}(0)\}^{-1}\Gamma(f)\leq\lambda$ is non-convex. Under the null, constraining $\{G^{\prime\prime}_{0,f}(0)\}^{-1}$ is necessary, as Theorem 2 states that Assumption A4 must hold in order for our improvement in fit estimator to have a well-behaved limiting distribution. However, this assumption is not needed for Theorem 3 to hold. As we discussed previously in Section 6.1, for Theorem 3 to hold, we really only need to ensure that the variance of $\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\}$ is well controlled. When $G_{0,f}$ is quadratic, this can be achieved by constraining $\{G^{\prime\prime}_{0,f}(0)\}^{-1}$ and leaving $\mathcal{B}$ unconstrained, as was done previously. When $G_{0,f}$ is non-quadratic, we can alternatively leave $\{G^{\prime\prime}_{0,f}(0)\}^{-1}$ unconstrained and carefully select the width of $\mathcal{B}$ . We instead set $\mathcal{F}$ as the function class

\displaystyle\mathcal{F}^{*}_{\lambda}:=\left\{f=\sum_{j=1}^{\infty}a_{j}h_{j}:\Gamma(f)\leq\lambda\right\},

and we set $\mathcal{B}$ as $\mathcal{B}^{*}_{\sigma}:=[-\sigma,\sigma]$ for some $\sigma>0$ . Now, we define $\Psi_{0}^{**}$ as

\displaystyle\Psi_{0}^{**}:=\inf_{f\in\mathcal{F}^{*}_{\lambda}}\inf_{\beta\in\mathcal{B}^{*}_{\sigma}}G_{0,f}(\beta),

and we estimate $\tilde{\Psi}_{0}^{**}$ as

\displaystyle\tilde{\Psi}_{n}^{**}:=\inf_{f\in\mathcal{F}^{*}_{\lambda}}\inf_{\beta\in\mathcal{B}^{*}_{\sigma}}\tilde{G}_{n,f}(\beta).

Calculating $\tilde{\Psi}_{n}^{**}$ will in many cases be much easier than calculating $\tilde{\Psi}_{n}$ . We can write $\tilde{\Psi}_{n}^{**}$ as

\displaystyle\tilde{\Psi}_{n}^{**}=\inf_{f\in\mathcal{F}}\inf_{\beta\in\mathcal{B}}\tilde{G}_{n,\beta f}(1)=\min_{\mathbf{b}}\left\{\tilde{G}_{n,\sum b_{j}h_{j}}(1):\mathbf{b}^{\top}\mathbf{L}\mathbf{b}\leq\sigma\lambda\right\},

where $\mathbf{b}=(b_{1},\ldots,b_{J})$ is a $J$ -dimensional vector. This optimization problem has a only single convex constraint, so the problem will be convex when the objective function is also convex. This can greatly simplify computation.

Of course, $\tilde{\Psi}_{n}^{**}$ and $\tilde{\Psi}_{n}$ potentially can achieve different limiting distributions when $\Psi>0$ . This is because $\Psi_{0}^{**}$ and $\Psi_{0}$ are defined as optima over different function classes. While the two values are expected to be similar, they will not necessarily be equal. Nonetheless, one can still apply Theorem 3 to establish weak convergence of $\tilde{\Psi}_{n}^{**}$ to a Gaussian distribution, as long as $\mathcal{F}\times\mathcal{B}$ is not too large.

Finally, we combine $\tilde{\Psi}_{n}^{*}$ and $\tilde{\Psi}_{n}^{**}$ to obtain a single estimator for $\Psi_{0}$ . We define our estimator as

\displaystyle\check{\Psi}_{n}=\pi_{n}^{*}\tilde{\Psi}_{n}^{*}+(1-\pi_{n}^{*})\tilde{\Psi}_{n}^{**}.

Because $\pi^{*}_{n}$ tends to one when the null holds and approaches zero when the null fails, $\check{\Psi}_{n}$ has approximately the same asymptotic behavior as $\tilde{\Psi}_{n}$ . That is $\check{\Psi}_{n}$ behaves like the supremum of an empirical process under the null, and like a sample average otherwise. Therefore, the multiplier bootstrap tests described in Sections 5.1 and 5.2 remain valid, and a similar strategy for implementation as described in Section 6.2 can be used.

S2 Illustration: Nonparametric Assessment of Stochastic Dependence

In this Section, we briefly discuss inference in Example 2 from Section 2.2, where we are interested in assessing whether a pair of random variables is independent. Our data takes the form $Z=(X,Y)$ , where $X$ and $Y$ are one-dimensional random variables.

We assess dependence by comparing the expectation of the log of the product of the marginal densities of $X$ and $Y$ with the expectation of the logarithm of an approximation for the joint density. Let $q_{P,X}$ and $q_{P,Y}$ denote the marginal densities of $X$ and $Y$ under $P$ , and let $\theta^{*}_{P}=\log q_{P,X}q_{P,Y}$ denote the log of the product of the marginal densities. We use the following sub-model to approximate the logarithm of the joint density:

\displaystyle\theta^{*}_{P,f}:(x,y)\mapsto\theta^{*}_{P}(x,y)+\beta f(x,y)-\log\int\exp(\theta^{*}_{P}(x,y)+\beta f(x,y))d\mu(x,y).

With a straightforward calculation, it can be verified that (7) is satisfied, and moreover, we can see that $\theta^{*}_{P,f}$ is a valid candidate log density, as for any $f$ , $\int\exp\theta^{*}_{P,f}(z;\beta)d\mu(z)=1$ .

With the above specification for the parametric sub-model, the goodness-of-fit takes the form

\displaystyle G_{P,f}(\beta)=E_{P}[-\log q_{P,X}(X)-\log q_{P,Y}(Y)-\beta f(X,Y)]+\log\left(E_{P_{X}}E_{P_{Y}}[\exp(\beta f(X,Y))]\right),

where $E_{P_{X}}$ and $E_{P_{Y}}$ denote the marginal expectations under $P$ , with respect to $X$ and $Y$ , respectively. We now observe that along any submodel, the difference in goodness-of-fit comparing a given candidate parameter with the null is given by

\displaystyle\psi_{P,f}(\beta):=G_{P,f}(\beta)-G_{P,f}(0)=-\beta E_{P}[f(X,Y)]+\log\left(E_{P_{X}}E_{P_{Y}}[\exp(\beta f(X,Y))]\right),

and this expression does not depend on the marginal densities $q_{P,X}$ and $q_{P,Y}$ . Thus, estimation of the marginal densities is not needed.

The derivatives $G^{\prime}_{P,f}(0)$ and $G^{\prime\prime}_{P,f}(0)$ are given by

	$\displaystyle G^{\prime}_{P,f}(0)=\frac{d}{d\beta}\psi_{P,f}(\beta)\bigg{\|}_{\beta=0}=-E_{P}[f(X,Y)]+E_{P_{X}}E_{P_{Y}}[f(X,Y)],$
	$\displaystyle G^{\prime\prime}_{P,f}(0)=\frac{d^{2}}{d\beta^{2}}\psi_{P,f}(\beta)\bigg{\|}_{\beta=0}=E_{P_{X}}E_{P_{Y}}[f^{2}(X,Y)]-\left\{E_{P_{X}}E_{P_{Y}}[f(X,Y)]\right\}^{2}.$

One can interpret $G^{\prime}_{P,f}(0)$ as the difference between the true mean of $f(X,Y)$ under $P$ and the value the mean would hypothetically take if $X$ and $Y$ were independent. The second derivative $G^{\prime\prime}_{P,f}(0)$ represents the variance of $f(X,Y)$ under the assumption that $X$ and $Y$ are independent.

Because $\psi_{P,f}(\beta)$ does not depend on $P$ through any nuisance parameters that are not pathwise differentiable, it is expected that a plug-in estimator, which is defined as a functional of the cumulative distribution function, would be asymptotically linear, and no sophisticated methods for bias correction should be needed. We use the following plug-in estimator for $\psi_{P_{0},f}(\beta)$ :

\displaystyle\tilde{\psi}_{n,f}(\beta)=\frac{-\beta}{n}\sum_{i=1}^{n}f(X_{i},Y_{i})+\log\left\{\frac{1}{n^{2}}\sum_{i=1}^{n}\sum_{j=1}^{n}\exp(\beta f(X_{i},Y_{j}))\right\}.

By an application of the functional delta method, one can show that the plug-in estimator is asymptotically linear with influence function

\displaystyle\phi_{P,f}(z;\beta)-\phi_{P,f}(z;0)=

\displaystyle\left\{\frac{E_{P_{Y}}[\exp(\beta f(x,Y))]+E_{P_{X}}[\exp(\beta f(Y,x))]}{E_{P_{X}}[E_{P_{Y}}[\exp\beta f(X,Y)]]}-\beta f(x,y)\right\}-\left\{2-E_{P}[\beta f(X,Y)]\right\}.

Similarly, we estimate $G^{\prime}_{0,f}(0)$ as

\displaystyle\tilde{G}^{\prime}_{n,f}(0)=\frac{d}{d\beta}\tilde{\psi}_{n,f}(\beta)\bigg{|}_{\beta=0}=\frac{-1}{n}\sum_{i=1}^{n}f(X_{i},Y_{i})+\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f(X_{i_{1}},Y_{i_{2}}),

and $\tilde{G}_{n,f}(0)$ is asymptotically linear with efficient influence function

\displaystyle\phi^{\prime}_{P,f}(z;0)=-f(x,y)+E_{P}[f(X,Y)]+\left\{E_{P_{Y}}\left[f(x,Y)\right]+E_{P_{X}}\left[f(X,y)\right]-2E_{P_{X}}[E_{P_{Y}}[f(X,Y)]]\right\}.

We estimate the second derivative $G^{\prime\prime}_{0,f}(0)$ as

\displaystyle\tilde{G}^{\prime}_{n,f}(0)=\frac{d^{2}}{d\beta^{2}}\tilde{\psi}_{n,f}(\beta)\bigg{|}_{\beta=0}=\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f^{2}(X_{i_{1}},Y_{i_{2}})-\left\{\frac{1}{n^{2}}\sum_{i_{1}=1}^{n}\sum_{i_{2}=1}^{n}f(X_{i_{1}},Y_{i_{2}})\right\}^{2}.

In this setting, a closed form solution for $\tilde{\beta}_{n,f}$ is not available, and if one uses the specification for $\mathcal{F}$ in (20), the problem

\displaystyle\inf_{f\in\mathcal{F},\beta\in\mathcal{B}}\tilde{\psi}_{n,f}(\beta)

is difficult to solve. As an alternative, we recommend using the more general implementation strategy presented in the Supplementary Materials Section S1.

S3 Proofs of Theoretical Results

Proof of Theorem 1

We have by definition that $\left\{G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})\right\}$ and $\left\{\tilde{G}_{n,f}(\beta_{0,f})-\tilde{G}_{n,f}\left(\tilde{\beta}_{n,f}\right)\right\}$ are non-negative. Under Assumption B2, we can write

	$\displaystyle 0$	$\displaystyle\leq\sup_{f\in\mathcal{F}}\left\{G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})\right\}-\left\{\tilde{G}_{n,f}\left(\tilde{\beta}_{n,f}\right)-\tilde{G}_{n,f}(\beta_{0,f})\right\}$
		$\displaystyle=\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f})-\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2})$
		$\displaystyle\leq\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2}).$

Thus from the fact that $\{\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ is $P_{0}$ -Donsker, we have that $G_{0,f}(\tilde{\beta}_{n,f})-G_{0,f}(\beta_{0,f})=O_{P}(n^{-1/2})$ . Assumption A3 implies that $\sup_{f\in\mathcal{F}}|\tilde{\beta}_{n,f}-\beta_{0,f}|=o_{P}(1)$ .

Now, because we have that $\tilde{\beta}_{n,f}$ satisfies $\tilde{G}^{\prime}_{n,f}(\tilde{\beta}_{n,f})=o_{P}(n^{-1/2})$ , Taylor’s theorem implies

\displaystyle\tilde{G}^{\prime}_{n,f}(\tilde{\beta}_{n,f})=\tilde{G}^{\prime}_{n,f}(\beta_{0,f})+G^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})(\tilde{\beta}_{n,f}-\beta_{0,f})=0,

for some $\bar{\beta}_{n,f}$ that satisfies $|\bar{\beta}_{n,f}-\beta_{0,f}|\leq|\tilde{\beta}_{n,f}-\beta_{0,f}|$ . By rearranging terms and invoking Assumption B4, the estimation error for $\tilde{\beta}_{n,f}$ can be expressed as

\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=-\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+o_{P}(1)\right\}^{-1}\tilde{G}^{\prime}_{n,f}(\beta_{0,f}).

Because $G^{\prime}_{0,f}(\beta_{0,f})=0$ , Assumption B2 implies that

\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=-\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+r^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})\right\}^{-1}\left\{\frac{1}{n}\sum_{i=1}^{n}\phi^{\prime}_{0,f}(Z_{i};\beta_{0,f})+r^{\prime}_{n,f}(\beta_{0,f})\right\}.

Now, because $\{\tilde{\beta}_{n,f}:f\in\mathcal{F}\}$ is uniformly consistent for $\{\beta_{0,f}:f\in\mathcal{F}\}$ , the continuous mapping theorem and Assumptions A4 and B4 allow us to replace $\left\{G^{\prime\prime}_{0,f}(\bar{\beta}_{n,f})+r^{\prime\prime}_{n,f}(\bar{\beta}_{n,f})\right\}^{-1}$ with $\left\{G^{\prime\prime}_{0,f}(\beta_{0,f})\right\}^{-1}$ in the above display. Thus, we have

\displaystyle\tilde{\beta}_{n,f}-\beta_{0,f}=\frac{-1}{G^{\prime\prime}_{0,f}(\beta_{0,f})}\left\{\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0},f}(Z_{i};\beta_{0,f})\right\}+o_{P}(n^{-1/2}),

as claimed. The weak convergence result follows as an immediate consequence of the Donsker Assumption A5.

Proof of Theorem 2

This result follows directly from an application of the continuous mapping theorem.

Proof of Theorem 3

Following from our discussion from Section 4.2, it suffices to show that $G_{n,f}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}})=o_{P}(n^{-1/2})$ . First, we write

	$\displaystyle\sup_{f\in\mathcal{F}}G_{n,f}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}})$	$\displaystyle\leq\left\{G_{n,f_{n}}(\tilde{\beta}_{n,f})-G_{n,f_{0}}(\beta_{0,f_{0}})\right\}-\left\{G_{0,f_{n}}(\tilde{\beta}_{n,f_{n}})-G_{0,f_{0}}(\beta_{0,f_{0}})\right\}$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}+o_{P}(n^{-1/2})$
		$\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left\{\phi_{P_{0},f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}+o_{P}(n^{-1/2})$
		$\displaystyle=O_{P}(n^{-1/2}).$

From the above argument, we can conclude that the remainder is at least $O_{P}(n^{-1/2})$ .

Now, we have under the smoothness assumption in (15) that

\displaystyle\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f_{0}}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1).

We can now apply lemma 19.24 of van der Vaart (2000) to conclude that

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{\phi_{P_{0},f}(Z_{i};\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(Z_{i};\beta_{0,f_{0}})\right\}=o_{P}(n^{-1/2}),

thereby completing the proof.

Proof of Theorem 4

Let $\mathcal{L}$ be the space of bounded Lipschitz-1 functions $\ell:\mathbb{R}\to[-1,1]$ . That is, any $\ell$ in $\mathcal{L}$ satisfies $|\ell(a_{1})-\ell(a_{2})|\leq|a_{1}-a_{2}|$ for any $a_{1},a_{2}\in\mathbb{R}$ . Let $E_{\xi}$ denote the expectation of a random variable with respect to the distribution of $\xi$ (treating $Z$ as fixed). We show that

\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{nT_{n}^{\xi}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|

converges to zero in outer probability. This is equivalent to weak convergence due to Portmanteau lemma (see, e.g. Lemma 18.9 of van der Vaart, 2000).

Let $\mathcal{\ell}^{\infty}(\mathcal{F})$ denote the space of bounded functionals on $\mathcal{F}$ , and let $\mathcal{E}$ be the space of Lipschitz-1 functionals $e:\ell^{\infty}(\mathcal{F})\to[-1,1]$ . That is, for $F_{1},F_{2}$ in $\ell^{\infty}(\mathcal{F})$ , any $e\in\mathcal{E}$ satisfies $|e(F_{1})-e(F_{2})|\leq\sup_{f\in\mathcal{F}}|F_{1}(f)-F_{2}(f)|$ . We now define:

	$\displaystyle\Lambda^{\xi}_{n}:f\mapsto\{2G^{\prime\prime}_{\hat{P}_{n},f}(0)\}^{-1/2}\left[n^{-1/2}\sum_{i=1}^{n}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right],$
	$\displaystyle\Lambda_{0}:f\mapsto\{2G^{\prime\prime}_{P_{0},f}(0)\}^{-1/2}\mathbb{H}(f).$

It is shown by Hudson et al. (2021) that under Assumptions C1-C3,

\displaystyle\sup_{e\in\mathcal{E}}\left|E_{\xi}\left[e\left(\Lambda_{n}^{\xi}\right)\right]-E_{0}\left[e\left(\Lambda_{0}\right)\right]\right|.

The proof is completed by recognizing that for any $\ell\in\mathcal{L}$ , the functional $F\mapsto\left|\ell\left(\sup_{f\in\mathcal{F}}\left|F(f)\right|\right)\right|$ is contained within $\mathcal{E}$ .

Proof of Theorem 5

Case 1: $\Psi_{0}=0$

We first consider the setting in which $\Psi_{0}=0$ . Let $\ell$ be a Lipschitz-1 function on $\mathbb{R}$ . As in the proof of Theorem 4, let $\mathcal{L}$ be the space of Lipschitz-1 functions on $\mathbb{R}$ . We show that

\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{\pi_{n}nT^{\xi}_{n}+(1-\pi_{n})nU^{\xi}_{n}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|

converges to zero in outer probability.

First, by applying the triangle inequality and invoking the Lipschitz property, we have

\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\left\{\pi_{n}nT^{\xi}_{n}+(1-\pi_{n})nU^{\xi}_{n}\right\}^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right|\right)\right]\right|\leq A_{n}+B_{n},

where we define

	$\displaystyle A_{n}:=\left\|E_{\xi}\left[\ell\left(\left\|nT^{\xi}_{n}\right\|^{1/2}\right)\right]-E_{0}\left[\ell\left(\sup_{f\in\mathcal{F}}\left\|\left\{2G^{\prime\prime}_{0,f}(0)\right\}^{-1/2}\mathbb{H}(f)\right\|\right)\right]\right\|,$
	$\displaystyle B_{n}:=E_{\xi}\left\|\left\{(1-\pi_{n})nU^{\xi}_{n}+\pi_{n}nT^{\xi}_{n}\right\}^{1/2}-\|nT^{\xi}_{n}\|^{1/2}\right\|.$

We have already shown in Theorem 4 that the first term $A_{n}$ converges to zero in outer probability, so it only remains to verify this for the second term $B_{n}$ .

By the reverse triangle inequality, we have

\displaystyle B_{n}\leq(1-\pi_{n})\left\{E_{\xi}\left[\left(nU^{\xi}_{n}\right)^{1/2}\right]+E_{\xi}\left[\left(nT^{\xi}_{n}\right)^{1/2}\right]\right\}.

Because $(1-\pi_{n})=o_{P}(1)$ when $\Psi_{0}=0$ , it suffices to show that $E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right]$ and $E_{\xi}\left[\left(nT_{n}\right)^{1/2}\right]$ are both $O_{P}(1)$ .

We first show that $E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right]$ is bounded in probability. First, we have by Jensen’s inequality that

\displaystyle E_{\xi}\left[\left(nU_{n}\right)^{1/2}\right]\leq E_{\xi}\left[nU_{n}\right]^{1/2}.

Now, by Taylor’s theorem, we have

\displaystyle E_{\xi}\left[\left|\sum_{i=1}^{n}\xi_{i}\left\{\phi_{\hat{P}_{n},f_{n}}(Z_{i};0)-\phi_{\hat{P}_{n},f_{n}}(Z_{i};\tilde{\beta}_{n,f_{n}})\right\}\right|\right]\leq E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]\sup_{f\in\mathcal{F}}\left|n^{1/2}\bar{\beta}_{n,f}\right|,

for some $\{\bar{\beta}_{n,f}:f\in\mathcal{F}\}$ that satisfies $|\bar{\beta}_{n,f}|\leq|\tilde{\beta}_{n,f}|$ for all $f\in\mathcal{F}$ . Because $\sup_{f\in\mathcal{F}}|n^{1/2}\bar{\beta}_{n,f}|=O_{P}(1)$ under the conditions of Theorem 2, it suffices to show that

\displaystyle E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]=O_{P}(1).

(24)

By the triangle inequality, we have the upper bound

\displaystyle E_{\xi}\left[\sup_{f\in\mathcal{F}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right|\right]\leq B_{n,1}+B_{n,2}+B_{n,3},

where we define

	$\displaystyle B_{n,1}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left\|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\int\left\{\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\phi^{\prime}_{P_{0},f}(z)\right\}dP_{0}(z)\right\|\right],$
	$\displaystyle B_{n,2}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left\|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left[\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}-\int\left\{\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\phi^{\prime}_{P_{0},f}(z)\right\}\right]dP_{0}(z)\right\|\right],$
	$\displaystyle B_{n,3}=E_{\xi}\left[\sup_{f\in\mathcal{F}}\left\|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\left\{\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}\right\|\right].$

It can be seen through an application of the Cauchy-Schwarz inequality $B_{n,1}=o_{P}(1)$ under Assumption C2. To see that $B_{n,2}$ is bounded in probability, we first note that Assumptions C1 implies that with probability tending to one,

\displaystyle B_{n,2}\leq E_{\xi}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\varphi(Z_{i})\right|\right].

In view of Markov’s inequality, it is sufficient to show that this upper bound has finite expectation. Lemma 2.3.6 of van der Vaart and Wellner (1996) implies that

\displaystyle E_{0}E_{\xi}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\xi_{i}\varphi(Z_{i})\right|\right]\leq 2E_{0}\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\varphi(Z_{i})\right|\right].

Because under Assumption C3, $\Phi_{\delta}$ has finite bracketing integral, we have by Corollary 19.35 of van der Vaart (2000) that

\displaystyle\left[\sup_{\varphi\in\Phi_{\delta}}\left|\frac{1}{n^{1/2}}\sum_{i=1}^{n}\varphi(Z_{i})\right|\right]<\infty,

thereby establishing that $B_{n,2}=O_{P}(1)$ . That $B_{n,3}=O_{P}(1)$ follows from a similar argument.

To argue that $E_{\xi}\left[\left(nT^{\xi}_{n}\right)^{1/2}\right]$ is $O_{P}(1)$ , we use the same argument as is used to show that (24) holds. In brief, the result follows from the facts that ( $\mathrm{i}$ ) both $\left\{G^{\prime\prime}_{\hat{P},f}:f\in\mathcal{F}\right\}$ and $\{z\mapsto\phi^{\prime}_{\hat{P}_{n},f}(z;0):f\in\mathcal{F}\}$ are uniformly consistent under Assumption C2, and ( $\mathrm{ii}$ ) the class

\displaystyle\bar{\Phi}^{\prime}_{\delta_{2}}:=\{z\mapsto\left[G^{\prime\prime}_{P,f}(0)\right]^{-1}\phi^{\prime}_{P,f}(z;0)-\left[G^{\prime\prime}_{P_{0},f}(0)\right]^{-1}\phi^{\prime}_{P_{0},f}(z;0):f\in\mathcal{F},\|Q_{P}-Q_{P_{0}}\|_{\mathcal{Q}}\leq\delta_{2}\}

is $P_{0}$ -Donsker with finite bracketing integral under Assumption C3.

Case 2: $\Psi_{0}>0$

We now consider the setting where $\Psi_{0}>0$ . We show that

\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\pi_{n}n^{1/2}T^{\xi}_{n}+(1-\pi_{n})n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}\left[\ell\left(|\mathbb{I}|\right)\right]\right|,

converges to zero in outer probability. Similarly as for Case 1, we have by the triangle inequality that

\displaystyle\sup_{\ell\in\mathcal{L}}\left|E_{\xi}\left[\ell\left(\pi_{n}n^{1/2}T^{\xi}_{n}+(1-\pi_{n})n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}\left[\ell\left(|\mathbb{I}|\right)\right]\right|\leq A_{n}+B_{n},

where we define

	$\displaystyle A_{n}:=\sup_{\ell\in\mathcal{L}}\left\|E_{\xi}\left[\ell\left(n^{1/2}U^{\xi}_{n}\right)\right]-E_{0}[\ell(\mathbb{I})]\right\|,$
	$\displaystyle B_{n}:=\pi_{n}\left\{E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right]+E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]\right\}.$

We first argue that $A_{n}$ converges to zero in outer probability. First, we have under assumption that the function

\displaystyle z\mapsto\phi_{\hat{P}_{n},f_{n}}(z;0)-\phi_{\hat{P}_{n},f_{n}}(z;\tilde{\beta}_{n,f_{n}})

is contained within a $P_{0}$ -Donsker class with probability tending to one. Also we have under Assumption C2 that $\int\left\{\phi_{\hat{P}_{n},f}(z;0)-\phi_{P_{0},f}(z;0)\right\}^{2}dP_{0}(z)=o_{P}(1)$ . We now argue that $\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1)$ . We have the upper bound

	$\displaystyle\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{0,f_{0}})\right\}^{2}dP_{0}(z)\leq$
	$\displaystyle 2\left[\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{n,f_{n}})\right\}^{2}dP_{0}(z)+\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)\right].$

Under Assumption C2, we have

\displaystyle\int\left\{\phi_{\hat{P}_{n},f}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\tilde{\beta}_{n,f_{n}})\right\}^{2}dP_{0}(z)\leq\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\int\left\{\phi_{\hat{P}_{n},f}(z;\beta)-\phi_{P_{0},f}(z;\beta)\right\}^{2}dP_{0}(z)=o_{P}(1).

Additionaly, we have

\displaystyle\int\left\{\phi_{P_{0},f_{n}}(z;\tilde{\beta}_{n,f_{n}})-\phi_{P_{0},f}(z;\beta_{0,f_{0}})\right\}^{2}dP_{0}(z)=o_{P}(1)

under the conditions of Theorem 3. Now, by Theorem 2 of Hudson et al. (2021), we can conclude that $|A_{n}|$ converges to zero in outer probability.

We now argue that $B_{n}=o_{P}(1)$ . Because $\pi_{n}=o_{P}(1)$ , we only need to show that $E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right]$ and $E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]$ are bounded in probability. That $E_{\xi}\left[n^{1/2}U_{n}^{\xi}\right]=O_{P}(1)$ follows from the same argument as was used to show that (24) holds in Case 1.

To argue that $E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]=O_{P}(1)$ , we begin by applying the Cauchy-Schwarz inequality and invoking Assumption C2 to get

	$\displaystyle\left\{T_{n}^{\xi}\right\}^{1/2}$	$\displaystyle\leq\left[\frac{1}{n}\sum_{i=1}^{n}\xi_{i}^{2}\right]^{1/2}\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}$
		$\displaystyle=\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}.$

Now, by the triangle inequality,

	$\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)\right\}^{2}\right]^{1/2}\leq$
	$\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}^{2}\right]^{1/2}+$
	$\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}\right]^{1/2}.$

Because $\{[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{0,f}(\cdot;0):f\in\mathcal{F}\}$ is a $P_{0}$ -Donsker class with finite squared envelope function, we have by Lemma 2.10.4 of van der Vaart and Wellner (1996) that $\left\{\left[[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{0,f}(\cdot;0)\right]^{2}:f\in\mathcal{F}\right\}$ is a $P_{0}$ -Glivenko-Cantelli class, and so

\displaystyle\left[\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(Z_{i};0)\right\}^{2}\right]^{1/2}=O_{P}(1).

Now, by the triangle inequality

	$\displaystyle\sup_{f\in\mathcal{F}}$	$\displaystyle\left[\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}\right]^{1/2}\leq$
	$\displaystyle\sup_{f\in\mathcal{F}}\Bigg{[}$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}-$
		$\displaystyle\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\Bigg{]}^{1/2}+$
	$\displaystyle\sup_{f\in\mathcal{F}}$	$\displaystyle\left[\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\int[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\right]^{1/2}.$

We have by assumption that

\displaystyle\left[\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-\int[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\right]^{1/2}=o_{P}(1).

Additionally, Assumption C2 and Lemma 2.10.4 of (van der Vaart and Wellner, 1996) imply that the class $\left\{\left\{[G^{\prime\prime}_{P,f}(0)]^{-1}\phi^{\prime}_{P,f}(\cdot;0)-[G^{\prime\prime}_{0,f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(\cdot)\right\}^{2}\right\}$ is $P_{0}$ -Glivenko-Cantelli with probability tending to one. Therefore,

	$\displaystyle\Bigg{[}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(Z_{i};0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi_{P_{0}}(Z_{i};0)\right\}^{2}-$
	$\displaystyle\int\left\{[G^{\prime\prime}_{\hat{P}_{n},f}(0)]^{-1}\phi^{\prime}_{\hat{P}_{n},f}(z;0)-[G^{\prime\prime}_{P_{0},f}(0)]^{-1}\phi^{\prime}_{P_{0},f}(z;0)\right\}^{2}dP_{0}\Bigg{]}^{1/2}=o_{P}(1).$

Thus, we have that $\left\{T_{n}^{\xi}\right\}^{1/2}\leq O_{P}(1)$ . This allows us to write

\displaystyle E_{\xi}\left[n^{1/2}T_{n}^{\xi}\right]\leq O_{P}(1)E_{\xi}\left[\left\{nT_{n}^{\xi}\right\}^{1/2}\right].

That $E_{\xi}\left[\left\{nT_{n}^{\xi}\right\}^{1/2}\right]=O_{P}(1)$ follows from the argument presented in Case 1. This completes the proof.

Proof of Lemma 1

Suppose a given distribution $P$ in $\mathcal{M}$ has density $p$ with respect to a dominating measure $\nu$ , and let $\eta:\mathcal{Z}\to\mathbb{R}$ be a fixed function that has mean zero and finite variance under $P$ . Let $P_{\epsilon}$ be a one-dimensional parametric sub-model for $P$ indexed by the parameter $\epsilon$ , which satisfies the following:

1.

The sub-model passes through $P$ at $\epsilon=0$ – that is, $P_{\epsilon}=P$ at $\epsilon=0$
2.

The density of the parametric sub-model is given by $p_{\epsilon}$ , and the score function is given by $\eta$ at $\epsilon=0$ . That is,

$\displaystyle\frac{d}{d\epsilon}\log p_{\epsilon}(z)\bigg{|}_{\epsilon=0}=\eta(z).$

We refer to $\frac{d}{d\epsilon}G_{P_{\epsilon},f}(\beta)|_{\epsilon=0}$ as the pathwise derivative of $G_{P_{\epsilon},f}(\beta)$ . The nonparametric efficient influence function $\phi_{P,f}(\cdot;\beta)$ is the unique function that satisfies the following two properties:

1.

For every $\eta$ , $\frac{d}{d\epsilon}G_{P_{\epsilon}}(\beta)=\int\phi_{P,f}(z;\beta)\eta(z)p(z)d\nu(z)$ .
2.

$\phi_{P,f}(Z;\beta)$ has mean zero under $P$ . That is, $\int\phi_{P,f}(z;\beta)dP(z)=0$ .

We can therefore find the efficient influence function by calculating the pathwise derivative.

Let $P_{\epsilon}$ have density

\displaystyle p_{\epsilon}=p(1+\epsilon\eta).

We can approximate any density in a neighborhood of $p$ with such a sub-model. Any distribution in a small neighborhood of $P$ can be approximated using a sub-model of this form.

The goodness-of-fit under $P_{\epsilon}$ is given by

\displaystyle G_{P_{\epsilon},f}(\beta)=\int\left\{y-\mu_{P_{\epsilon},Y}(w)-\beta f(w,x)\right\}^{2}p(z)\{1+\epsilon\eta(z)\}d\nu(z).

Through a simple calculation, it can be shown that

\displaystyle\frac{d}{d\epsilon}\mu_{P_{\epsilon},Y}(w)\bigg{|}_{\epsilon=0}=\int\{y_{1}-\mu_{Y,P}(w)\}\eta(w,x_{1},y_{1})p(x_{1},y_{1}|w)\nu(dx_{1},dy_{1}),

where $p(\cdot|w)$ denotes the conditional density of $(X,Y)$ given that $W=w$ , under $P$ . We now have

	$\displaystyle\frac{d}{d\epsilon}G_{P_{\epsilon}}(\beta)\bigg{\|}_{\epsilon=0}=$	$\displaystyle\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}p(z)\eta(z)d\nu-$
		$\displaystyle 2\int\{y-\mu_{P,Y}(w)-\beta f(w,x)\}\left\{\int\{y_{1}-\mu_{P,Y}(w)\}\eta(w,x_{1},y_{1})p(x_{1},y_{1}\|w)\nu(dx_{1},dy_{1})\right\}p(z)d\nu(z),$
	$\displaystyle=$	$\displaystyle\int\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}p(z)\eta(z)d\nu+$
		$\displaystyle 2\int\left[\int\beta f(w,x_{1})p(x_{1},y_{1}\|w)\nu(dx_{1},dy_{1})\right]\{y-\mu_{P,Y}(w)\}\eta(w,x,y)p(z)d\nu$
	$\displaystyle=$	$\displaystyle\int\left[\left\{y-\mu_{P,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\mu_{P,f}(w,x)\left\{y-\mu_{P,Y}(w)\right\}\right]\eta(w,x,y)p(z)d\nu,$

where the second equality follows from an application of the law of total expectation to the second summand. The “non-mean-centered” efficient influence function is thus given by

\displaystyle z=(w,x,y)\mapsto\left\{y-\mu_{Y,P}(w)-\beta f(w,x)\right\}^{2}+2\beta\mu_{f,P}(w)\left\{y-\mu_{Y,P}(w)\right\}.

The result is completed by centering the above function about its mean.

Proof of Lemma 2

We write the estimation error for the one-step estimator as

\displaystyle\tilde{G}_{n,f}(\beta)-G_{0,f}(\beta)=\frac{1}{n}\sum_{i=1}^{n}\phi_{P_{0},f}(Z_{i};\beta)+R^{\mathrm{i}}_{n,f}(\beta)+R^{\mathrm{ii}}_{n,f}(\beta),

where the remainder terms are

	$\displaystyle R_{n,f}^{\mathrm{i}}(\beta)=\sum_{i=1}^{n}\left\{\phi_{n,f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta)\right\}-\int\left\{\phi_{n,f}(Z_{i};\beta)-\phi_{P_{0},f}(Z_{i};\beta)\right\}dP_{0}(z),$
	$\displaystyle R_{n,f}^{\mathrm{ii}}(\beta)=\int\phi_{n,f}(z;\beta)dP_{0}(z)+\left\{G_{n,f}(\beta)-G_{P_{0},f}(\beta)\right\}.$

Following from our discussion in Section 4.1, it suffices to argue each of the following:

	$\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\|R^{\mathrm{i}}_{n,f}(\beta)\|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left\|\frac{d}{d\beta}\left\{R^{\mathrm{i}}_{n,f}(\beta)\right\}_{\beta=0}\right\|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left\|\frac{d^{2}}{d\beta^{2}}\left\{R^{\mathrm{i}}_{n,f}(\beta)\right\}_{\beta=0}\right\|=o_{P}(1),$
	$\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\|R^{\mathrm{ii}}_{n,f}(\beta)\|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left\|\frac{d}{d\beta}\left\{R^{\mathrm{ii}}_{n,f}(\beta)\right\}_{\beta=0}\right\|=o_{P}(n^{-1/2}),\quad\sup_{f\in\mathcal{F}}\left\|\frac{d^{2}}{d\beta^{2}}\left\{R^{\mathrm{ii}}_{n,f}(\beta)\right\}_{\beta=0}\right\|=o_{P}(1).$

First, we argue that $\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}|R_{n,f}(\beta)|=o_{P}(n^{-1/2})$ . It is shown in the proof of Lemma 19.26 of van der Vaart that this convergence rate is achieved when

\displaystyle\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\int\left\{\phi_{n,f}(z;\beta)-\phi_{P_{0},f}(z;\beta)\right\}^{2}dP_{0}(z)=o_{P}(1),

and when the class $\{\phi_{n,f}(\cdot;\beta)-\phi_{P_{0},f}(\cdot;\beta):f\in\mathcal{F},\beta\in\mathcal{B}\}$ is contained within a $P_{0}$ -Donsker class probability tending to one. That the influence functions are uniformly consistent follows a consequence of the rate conditions on the nuisance parameter estimators, and the Donsker condition holds by assumption. Similarly, that $\sup_{f\in\mathcal{F}}|\frac{d}{d\beta}\{R_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2})$ and $\sup_{f\in\mathcal{F}}|\frac{d^{2}}{d\beta^{2}}\{R_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2})$ follow from consistency of nuisance estimators and the assumed complexity constraints.

Now, we argue that $\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left|R^{\mathrm{ii}}_{n,f}(\beta)\right|=o_{P}(n^{-1/2})$ . This remainder term has the exact representation

	$\displaystyle\int\phi_{n,f}(z;\beta)dP_{0}(z)+\left\{G_{n,f}(\beta)-G_{P_{0},f}(\beta)\right\}=$
	$\displaystyle\int\left\{y-\mu_{n,Y}(w)-\beta f(w,x)\right\}^{2}+2\beta\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)-\left\{y-\mu_{0,Y}(w)-\beta f(w,x)\right\}^{2}dP_{0}(z)=$
	$\displaystyle\int\left[\left\{y-\mu_{n,Y}(w)\right\}-\left\{y-\mu_{P_{0},Y}(w)\right\}\right]\left[\left\{y-\mu_{n,Y}(w)\right\}+\left\{y-\mu_{P_{0},Y}(w)\right\}\right]dP_{0}(z)-$
	$\displaystyle\int 2\beta\left[\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}f(w,x)-\left\{y-\mu_{n,Y}(w)\right\}\mu_{n,f}(w)\right]dP_{0}(z)=$
	$\displaystyle\int\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}^{2}dP_{0}(z)+2\beta\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}\left\{\mu_{f,P_{0}}(w)-\mu_{f,P}(w)\right\}dP_{0}(z).$

It can seen that $\sup_{f\in\mathcal{F},\beta\in\mathcal{B}}\left|R^{\mathrm{ii}}_{n,f}(\beta)\right|=o_{P}(n^{-1/2})$ when the rate conditions on the nuisance estimators are met. The derivative of the second remainder term is

\displaystyle\frac{d}{d\beta}R^{\mathrm{ii}}_{n,f}(\beta)\bigg{|}_{\beta=0}=\int 2\left\{\mu_{P_{0},Y}(w)-\mu_{n,Y}(w)\right\}\left\{\mu_{f,P_{0}}(w)-\mu_{f,P}(w)\right\}dP_{0}(z),

and so $\sup_{f\in\mathcal{F}}|\frac{d}{d\beta}\{R^{\mathrm{ii}}_{n,f(\beta)}\}_{\beta=0}|=o_{P}(n^{-1/2})$ under the rate conditions as well. Finally, it is easily seen that $\sup_{f\in\mathcal{F}}|\frac{d^{2}}{d\beta^{2}}\{R^{\mathrm{ii}}_{n,f(\beta)}\}_{\beta=0}|=0$ .

Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space

Abstract

1 Introduction

2 Preliminaries

2.1 Data structure and target estimand

2.2 Examples

3 Plug-in estimation of the improvement in fit

4 Bias-corrected estimation of the improvement in fit

4.1 Uniform inference along the parametric sub-models

Theorem 1.

4.2 Asymptotic properties of proposed estimator

Theorem 2.

Theorem 3.

5 Construction of tests and intervals for the improvement in fit

5.1 Approximation of the null limiting distribution

Theorem 4.

5.2 Interval construction for Ψ0\Psi_{0}

Theorem 5.

6 Implementation

6.1 Constructing the collection of parametric sub-models

6.2 Computation

7 Illustration: Inference in a Nonparametric Regression Model

Lemma 1.

Lemma 2.

8 Simulation Study

9 Discussion

References

Supplementary Materials

S1 Implementation for Non-quadratic Objectives

S2 Illustration: Nonparametric Assessment of Stochastic Dependence

S3 Proofs of Theoretical Results

Proof of Theorem 1

Proof of Theorem 2

Proof of Theorem 3

Proof of Theorem 4

Proof of Theorem 5

Proof of Lemma 1

Proof of Lemma 2

5.2 Interval construction for $\Psi_{0}$