Nonparametric inference on non-negative dissimilarity measures at the boundary of the parameter space
Abstract
It is often of interest to assess whether a function-valued statistical parameter, such as a density function or a mean regression function, is equal to any function in a class of candidate null parameters. This can be framed as a statistical inference problem where the target estimand is a scalar measure of dissimilarity between the true function-valued parameter and the closest function among all candidate null values. These estimands are typically defined to be zero when the null holds and positive otherwise. While there is well-established theory and methodology for performing efficient inference when one assumes a parametric model for the function-valued parameter, methods for inference in the nonparametric setting are limited. When the null holds, and the target estimand resides at the boundary of the parameter space, existing nonparametric estimators either achieve a non-standard limiting distribution or a sub-optimal convergence rate, making inference challenging. In this work, we propose a strategy for constructing nonparametric estimators with improved asymptotic performance. Notably, our estimators converge at the parametric rate at the boundary of the parameter space and also achieve a tractable null limiting distribution. As illustrations, we discuss how this framework can be applied to perform inference in nonparametric regression problems, and also to perform nonparametric assessment of stochastic dependence.
1 Introduction
Suppose we are interested in studying a function-valued parameter of an unknown probability distribution, such as a conditional mean function or a density function. For such parameters, one can typically define a goodness-of-fit functional, which measures the closeness of any given candidate function to the true population parameter. The goodness-of-fit achieves its minimum when evaluated at the true population parameter. It is often of scientific interest to compare multiple models for the function-valued parameter. In particular, one may seek to determine whether the minimizer of the goodness-of-fit over a, possibly large, function class is equal to the minimizer over a smaller sub-class. The difference between the minima over reduced and full function classes can serve as a natural measure of dissimilarity for comparing the corresponding minimizers. This dissimilarity measure is non-negative, with values of zero corresponding to no dissimilarity. The main focus of this work is on estimation of such dissimilarity measures and testing the null hypothesis of no dissimilarity, or equality of goodness-of-fit.
As an example, suppose that an investigator would like to determine whether an exposure is conditionally associated with an outcome, given a set of confounding variables. This can be formulated as a statistical inference problem, where the objective is to determine whether the conditional mean of the outcome, given both the exposure and confounders, is equivalent to the conditional mean of the outcome, given only the confounders. One can specify a full model for the conditional mean as a class of functions that depends on both the exposure and confounders, while the reduced model is the subclass of functions that may depend on the confounders but do not depend on the exposure. Several goodness-of-fit measures, such as the expected squares error loss, can be used to assess how close a candidate parameter is to the conditional mean given the exposure and confounders. And so, one can test for conditional independence by assessing whether the best approximation of the conditional mean in the full model class is an improvement over the best approximation in the reduced class, in terms of the goodness-of-fit.
When the function-valued parameter of interest is modeled using a finite-dimensional function class, there are standard procedures available for performing inference. For instance, the classical likelihood ratio test is widely-used to compare classes of regression functions when the conditional distribution of the outcome given the predictor and covariates is assumed to belong to a parametric family of probability distributions (Wilks, 1938). There also exist approaches for efficient inference in settings where the reduced and full function classes are both infinite-dimensional, but the difference between the two classes is finite dimensional. For instance, in regression problems of the form described in the example above, it is common to assume that the conditional mean of the outcome given the exposure and covariates follows a partially linear model. In a partially linear model, the full conditional mean can be expressed as the sum of an unknown function of the confounders, which is only assumed to belong to a large infinite-dimensional function class, plus a linear function of the exposure of interest. One can therefore assess for conditional dependence by determining whether the linear function has zero slope, which is a well-studied inference problem (Chernozhukov et al., 2018; Bhattacharya and Zhao, 1997; Robinson, 1988; Donald and Newey, 1994).
In this work, we focus on the more challenging setting in which the difference between the full and reduced function classes is infinite-dimensional. Recently, several investigators have examined whether modern methods for estimation of smooth functionals of unknown probability distributions in a nonparametric model, such as targeted-minimum loss-based estimation (van der Laan and Rose, 2011, 2018) and one-step estimation (Pfanzagl, 1982), can be applied to attain inference on non-negative dissimilarity measures (Williamson et al., 2021a, b; Hines et al., 2022; Kennedy et al., 2023; Kandasamy et al., 2015). For these estimation strategies to be viable, the target estimand – in this case the non-negative dissimilarity measure – must be a pathwise differentiable functional of the underlying probability distribution with non-zero pathwise derivative. In essence, this means that the target estimand makes smooth but non-negligible changes in response to infinitesimally small perturbations around the unknown probability distribution. While pathwise differentiability of the target can be established in many examples, the pathwise derivative is typically zero when the null hypothesis of no dissimilarity holds. That the derivative is zero can be seen as a consequence of the fact that, under the null, the target estimand achieves its minimum at the true unknown distribution. In this setting, conventional estimation strategies do not achieve parametric-rate convergence or attain tractable limiting distributions, making hypothesis testing challenging.
When the target estimand satisfies additional smoothness assumptions, it can be possible to construct estimators with improved asymptotic behavior by utilizing higher-order pathwise derivatives (Pfanzagl, 1985; Robins et al., 2008; van der Vaart, 2014; Carone et al., 2018). While this approach has been successful in some examples (Luedtke et al., 2019), it is seemingly rare that for a given statistical functional, higher-order pathwise derivatives exist, so this strategy does not appear to be broadly applicable.
In this work, we propose a general method for estimation and inference on non-negative dissimilarity measures. Our proposal builds upon recent developments on the construction of omnibus tests for equality of function-valued parameters to fixed null parameters (Hudson et al., 2021; Westling, 2021). The key idea used is that one can perform inference on a function-valued parameter by estimating a large collection of simpler one-dimensional estimands that act as an effective summary thereof. Here, we show that in many instances, non-negative dissimilarity measures can be represented as the largest value in a collection of simple one-dimensional estimands. In such cases, we can estimate non-negative dissimilarity measures using the maximum of suitably well-behaved estimators for these scalar quantities. Our main results show that when efficient estimators for the simple estimands are used, the resulting estimator for the non-negative dissimilarity measure achieves parametric-rate convergence under the null and also attains a tractable limiting distribution. This makes it possible to construct well-calibrated asymptotic tests of the null. We also show that when the alternative holds, our estimator is asymptotically efficient. To the best of our knowledge, our work is the first to provide a general theoretical basis for recovering parametric rate inference on non-negative dissimilarity measures in a nonparametric model.
The remainder of the paper is organized as follows. In Section 2, we formally introduce the class of non-negative dissimilarity measures of interest, and we describe some motivating examples. In Section 3, we review an existing approach for inference based on plug-in estimation and provide a discussion of some of its limitations. In Section 4, we propose a new estimator for non-negative dissimilarity measures, and we describe its theoretical properties. In Section 5, we present multiplier bootstrap methods for testing the null of no dissimilarity, and for constructing confidence intervals. In Section 6 we discuss implementation and practical concerns. In Section 7, we illustrate how our methodology can be used to perform inference in a nonparametric regression model. We present results from our simulation study in Section 8, and we conclude with a discussion in Section 9.
2 Preliminaries
2.1 Data structure and target estimand
Let be i.i.d. random vectors, generated from an unknown probability distribution . We make few assumptions about and only require that it belongs to a flexible nonparametric model , which is essentially unrestricted, aside from mild regularity conditions. For a given probability distribution in , let be a function-valued summary of interest with domain for a positive integer and range . We denote by the evaluation of this summary at .
Suppose that is known to belong to a, possibly infinite-dimensional, function class . For a given distribution , we define a real-valued functional that satisfies
(1) |
The functional measures the goodness-of-fit of any function – larger values of indicate that and are farther away from one another, in a sense. Throughout this paper, we use the shorthand notation to denote the value of the goodness-of-fit measure at .
Let be a subclass of , and let be a function that satisfies
In essence, is the closest function to among all functions in the subclass . We define as our target parameter the difference between the goodness-of-fit of and ,
(2) |
and we again use the shorthand notation . Throughout this manuscript, we refer to as the improvement in fit because it represents the improvement in the goodness-of-fit attained by using the full function class instead of the reduced class.
Because is contained within , it can be seen that is a non-negative statistical functional, and is only equal to zero when provides no improvement in fit compared with . In many applications, a problem of central importance is to determine whether is inferior to in terms of goodness-of-fit. Letting , we are interested in performing a test of the null hypothesis
(3) |
Additionally, because statistical functionals that have the representation in (2) have scientifically meaningful interpretations in some contexts, estimation of and confidence interval construction are also of practical interest. Our paper provides a general framework for estimation, testing, and confidence interval construction for statistical functionals of this form.
2.2 Examples
In what follows, we introduce some working examples. As a first example, we discuss statistical inference in nonparametric regression models, and second, we discuss a nonparametric approach for assessing dependence between a pair of random variables. We then describe a simple way to define a goodness-of-fit measure for any function-valued parameter.
Example 1: Inference in a Nonparametric Regression Model
Let , where is a real-valued outcome variable, and and are vectors of predictor variables with dimensions and , respectively.
We define as a (possibly large) class of prediction functions with domain and range .
Each function takes as input a realization of the predictor vector and returns as output a predicted outcome.
We are interested in studying the conditional mean of the outcome given the predictors, defined as . It is well-known that the conditional mean can be characterized as the minimizer of the expected squared error loss over , if is sufficiently large. That is, defining the goodness-of-fit measure
the conditional mean satisfies .
Consider now the set of candidate prediction functions that do not depend on , which we write as
When is large, any minimizer of the expected squared error loss over is almost everywhere equal to the conditional mean of given . We are often interested in determining whether is an important set of predictors in the sense that it does not need to be included in a prediction function in order for optimal squared error loss to be achieved. If is not important in this sense, the conditional mean of given and does not depend on , and the difference in the expected squared error loss is zero. Otherwise, is positive. Thus, assessing variable importance can be framed a statistical inference problem of the type described in Section 2.
Many recent works have studied inference on variable importance estimands of a similar form to that we describe above (see, e.g., Williamson et al., 2021a, b; Verdinelli and Wasserman, 2021; Zhang and Janson, 2020). These works all encounter difficulties with constructing estimators for their original target estimand that achieve parametric rate convergence under the null. To the best of our knowledge, there is currently no solution available to this problem.
Example 2: Nonparametric Assessment of Stochastic Dependence
Let , where and are real-valued random variables, and let denote the log of the joint density of under with respect to some dominating measure .
Our objective here is to determine whether and are dependent.
If and are independent, by basic laws of probability, the joint density function can be expressed as the product of the marginal density functions, i.e.,
for all . We can therefore assess dependence between and by defining a goodness-of-fit measure for the joint density function, and determining whether the goodness-of-fit of the true joint density is lower than the goodness-of-fit of the product of the marginal densities.
Let be a collection of candidate values for the log density function, and assume that is large enough to contain , the log density under . The density function can be represented as a minimizer of the expected cross-entropy loss. Therefore, defining the goodness-of-fit measure
the joint density satisfies (1).
We now define as the class of candidate log density functions for which the joint density can be expressed as the product of two marginal density functions – that is,
Any minimizer of over is almost everywhere equal to the product of the marginal densities of and under . Therefore, is zero if and are independent, and is otherwise positive. One can assess dependence between and by performing inference on , so similar to the previous example, this problem falls within our framework.
The measure of dependence we have defined here is commonly referred to as the mutual information and has been a widely-studied measure of stochastic dependence (see, e.g., Paninski, 2003; Steuer et al., 2002). We are not aware of an existing nonparametric estimator for the mutual information that achieves parametric rate convergence under the null of independence. This appears to be a longstanding open problem.
Example 3: Generic Distance
Suppose one is interested in assessing whether a given function-valued parameter is equal to a fixed and known function . For a measure on , one can define as a goodness-of-fit measure an integrated squared difference between and :
Because is non-negative, and , it is easy to see that is minimized by .
One might wish to perform inference on the quantity
Clearly, is equal to zero only when is equal to almost everywhere . Estimands of this form can be of interest when one wishes to construct an omnibus test of the hypothesis that . The framework we develop in this paper can be applied in this setting as well, and so, methodology for inference on the improvement in fit can be seen as generally useful for performing inference on function-valued parameters.
3 Plug-in estimation of the improvement in fit
We now describe an approach for nonparametric inference on based on plug-in estimation, and we discuss the shortcomings of this approach. The methodology we describe below and its limitations are discussed extensively by Williamson et al. (2021b) in the context of nonparametric regression, though their theoretical and methodological results are more broadly applicable.
Suppose that for any , is a pathwise differentiable functional of , meaning that changes smoothly with respect to small changes in (Bickel et al., 1998). When is pathwise differentiable, it is generally possible to construct an estimator that is asymptotically linear in the sense that
(4) |
where has mean zero and finite variance under , and is an asymptotically negligible remainder term. The function determines the first order asymptotic behavior of and is commonly referred to as the influence function of . Because is asymptotically linear, it is -rate consistent and asymptotically Gaussian by the central limit theorem. Conventional strategies for constructing asymptotically linear estimators include one-step estimation (Pfanzagl, 1982) and targeted minimum loss-based estimation (van der Laan and Rose, 2011, 2018).
Given an asymptotically linear estimator for , we can obtain estimators and for and by minimizing over and , respectively. That is, we take
We can then obtain the following plug-in estimator for :
It can be shown that, under mild regularity conditions, the plug-in estimator is asymptotically linear with influence function (Williamson et al., 2021b). That is, the plug-in estimator satisfies
(5) |
From an initial inspection, it would appear that there is no loss in efficiency resulting from estimating and . That is, if and were known, then the estimator would have the same asymptotically linear representation as .
Under the null, and have the same influence function, and the leading term in (5) vanishes. Therefore, the convergence rate and limiting distribution of the plug-in estimator are determined by the higher-order remainder term. When is a finite-dimensional model, it is often possible to establish that, under the null, the remainder term is and attains a tractable limiting distribution. Conversely, in the infinite-dimensional setting, the remainder typically converges at a slower-than- rate, and its asymptotic distribution is difficult to characterize. This makes it challenging to approximate the null sampling distribution of and hence challenging to construct a hypothesis test for no improvement in fit. Moreover, confidence intervals based on a normal approximation to the sampling distribution can fail to achieve the nominal coverage rate when .
In order to develop an estimator for that has better asymptotic properties than the plug-in, it is helpful for us to further investigate what is the source of the plug-in estimator’s poor behavior. We can first recognize that, as is a measure of an improvement in fit, estimating involves performing a search away from to identify whether any candidate function in the difference between the full and reduced function classes, , provides a better fit than .
Suppose now that can be expressed as a collection of, potentially many, one-dimensional sub-models. Let be a fixed function from to that satisfies for any . For a scalar and a fixed function , we define as the one-dimensional sub-model
(6) |
We have constructed our sub-model so that it passes through the null best fit at , i.e.,
(7) |
We can therefore interpret as the path along which approaches as tends to zero. We assume that there exists a function class and a symmetric interval such that
We will see that using this representation for our model facilitates making comparisons between any function in and the best null fit.
We now define as the goodness-of-fit of , i.e.,
(8) |
and similarly as above, we use the shorthand notation . We assume that is a smooth convex function, and we denote the first and second derivatives of in by
(9) |
We define as the minimizer of the goodness-of-fit measure along the parametric sub-model over the interval :
Due to the convexity of , for large enough , is the unique solver of . Under this regime, satisfies , and we can write as
We can see that, in view of condition (7), only when .
Let and be the estimators for and described in earlier in this section, and let be the plug-in estimator for the sub-model. We define the plug-in estimator for as
and we write its first and second derivatives as
We define the plug-in estimator for as the minimizer of over . For large , satisfies for all in , and the plug-in estimator for satisfies
The plug-in estimator for can therefore be expressed as
Using this representation for the plug-in estimator makes it easier for us to carefully study its asymptotic behavior in the setting where . By performing a second order Taylor expansion for around for every , we can write the plug-in as
where is a higher order remainder term that should approach zero at a faster rate than the leading terms. Because for all , the first term in this expansion vanishes, leaving us with
If is consistent for uniformly in , then by Slutsky’s theorem, one can replace the random quantities with the fixed values in the above display. It would appear then that, under the null, the limiting distribution of is determined by the behavior of the stochastic process .
If it were possible to characterize the joint limiting distribution of under the null where for all , the limiting distribution of could be characterized using a straightforward application of the continuous mapping theorem. Typically, is a pathwise differentiable parameter for each , making it possible to construct estimators thereof that converge at a -rate and achieve an Gaussian limiting distribution. Ideally, one would be able to establish that the standardized process converges weakly to a Gaussian process as long as the collection of paths is not overly complex. However, in many settings, this property is not satisfied by the plug-in estimator. One can view the plug-in estimator for as a functional of the estimator for . As stated above, estimating requires us to estimate the nuisance parameter . In settings where is a large nonparametric function class, our estimator for will necessarily converge slower than the parametric rate of and may retain non-negligible asymptotic bias. Consequently, generates bias for , which leads to retaining non-negligible bias as well. Indeed, will typically converge slower than the parametric rate of , causing to converge at a sub-optimal rate and achieve a non-standard limiting distribution.
To summarize, estimating requires one to perform a search away from in order to attempt to identify a candidate function in that provides an improvement in the goodness-of-fit. In the regime we describe above, performing this search is equivalent to finding the best fit along each parametric sub-model that passes through the null, and subsequently taking the best fit among all of the sub-models. From the above argument, we can see that the plug-in estimator has poor asymptotic properties because the plug-in estimator for the best fit along the parametric sub-models can be sub-optimal when is large. Thus, the key to obtaining an estimator with improved asymptotic properties is to efficiently estimate the best fit along each of the sub-models that comprise .
4 Bias-corrected estimation of the improvement in fit
From the discussion in Section 3, it would seem that if one had an efficient estimator for the goodness-of-fit along each parametric sub-model, and hence an efficient estimator for , one could obtain an estimator for the improvement in fit that has better asymptotic properties than the plug-in. In what follows, we describe a general strategy for constructing an estimator that has a tractable limiting distribution when is at the boundary of the parameter space. We show that our newly-proposed estimator enjoys the same -rate convergence that is typically attained in parametric models.
4.1 Uniform inference along the parametric sub-models
Our proposal requires us to construct an estimator for that enables us to perform inference uniformly along the collection of parametric sub-models. In this sub-section, we first outline a set of sufficient conditions under which an estimator has asymptotic properties that facilitate uniform inference. We then describe a strategy for constructing an estimator that satisfies these conditions.
We begin by providing assumptions upon which our first main theoretical result relies. We consider two types of assumptions. The first type (A) is a set of determinsitic conditions on the goodness-of-fit functional and the underlying probabilty distribution, whereas the second set of assumptions (B) is stochastic in nature and describes conditions that our estimator must satisfy.
-
Assumption A1: For any and any , is pathwise differentiable in a nonparametric model, and its nonparametric efficient influence function is given by .
-
Assumption A2: and are twice differentiable in for each in , and the derivatives are given by
-
Assumption A3: There exist positive constants such that, for any , implies that
-
Assumption A4: For each , . Additionally, is bounded above zero in a neighborhood of , uniformly in . That is, is positive whenever is small.
-
Assumption A5: Both the function classes and are -Donsker.
Assumption A1 requires that the goodness-of-fit is pathwise differentiable, which as noted in Section 3, enables us to construct -consistent estimators. When is pathwise differentiable estimand, its efficient influence function is guaranteed to exist, and knowledge of the efficient influence function is often needed for constructing efficient estimators and studying their asymptotic properties in nonparametric models. We note that because we assume for any (recall we assume (7) holds), the efficient influence functions and are also equal. Assumptions A2 and A3 state that the goodness-of-fit must be smooth and convex along each of the parametric sub-models. Assumption A4 requires that is large enough to contain the global optimizer of over , and the goodness-of-fit satisfies some additional smoothness constraints in a neighborhood of the optimizer. Assumption A5 states that, while may be specified as a large nonparametric function class, it must satisfy some mild complexity constraints.
-
Assumption B1: For any with we have that .
-
Assumption B2: is an asymptotically linear estimator for in the sense that the remainder satisfies
-
Assumption B3: The derivative of exists and is given by . Moreover, letting , we have
-
Assumption B4: The second derivative of exists and is given by . Moreover, letting , we have
Assumption B1 states that the estimator for the goodness-of-fit along any parametric sub-model takes the same value at . In view of condition (7), all sub-models intersect and attain the same value for the goodness-of-fit at , so it is natural to assume that our estimator also has this property. Assumptions B2 places a requirement that is an asymptotically linear estimator for , where the asymptotic linearity holds uniformly over . Assumption B3 states that is differentiable, and the derivative is an asymptotically linear estimator for , uniformly over . Finally, Assumption B4 requires that the second derivative of exists and converges in probability to the second derivative of , uniformly over .
For a given estimator of , let satisfy
The following theorem states that, under mild regularity conditions, is an asymptotically linear estimator for , and moreover the collection , when appropriately standardized, achieves a Gaussian limiting distribution.
Theorem 1.
Let denote the space of bounded functionals on , and let be a tight mean zero Gaussian process with covariance
If Assumptions A1-A4 hold, and if satisfies Assumptions B1-B4, then is asymptotically linear with influence function
Moreover, If A5 also holds, then the process , converges weakly to as an element of , with respect to the supremum norm.
Theorem 1 can be viewed as a generalization of well-known results that show M-estimators are asymptotically linear in finite-dimensional models (see, e.g., Theorem 5.23 of van der Vaart, 2000). Our result on uniform asymptotic linearity in infinite-dimensional models can be proven using a fairly standard argument.
In what follows, we suggest some approaches for constructing an estimator that satisfies Assumptions B1-B4. We describe at a high-level what types of conditions are needed for a given estimation strategy to be valid, though the specific requirements depend on the target estimand and vary from problem to problem. Later on in Section 7, we demonstrate how to construct an estimator and that satisfies Assumptions B1-B4 in an example.
Suppose that one has available an estimator for the underlying probability distribution . Typically estimation of the entire probability distribution it not necessary, and one will only need to estimate nuisance components upon which and depend. We assume that and depend on only through a nuisance , which resides in a space endowed with norm . The true value of the nuisance component is given by , and the plug-in estimator for the nuisance is .
As a starting point, one might consider using as an estimator for . If belongs to the model , the plug-in estimator satisfies Assumption B1. This leaves Assumptions B2 through B4 to be verified. Suppose now that is itself pathwise differentiable and can therefore be estimated at an -rate. Then if is an asymptotically linear estimator for , one can argue that is also be asymptotically linear by applying the delta method. Assumption B2 then holds, as long as the asymptotic linearity is preserved uniformly over . Asymptotic linearity of (Assumption B3) and consistency of (Assumption B4) can be established using a similar argument.
In many instances, the nuisance can include quantities such as density functions or conditional mean functions which are non-pathwise differentiable in a nonparametric model. In this case, it is not possible to construct an -rate consistent estimator for the nuisance. Obtaining an estimator for the nuisance usually involves making a bias variance trade-off that may be sub-optimal for the objective of estimating the goodness-of-fit. When the nuisance estimator retains non-negligible bias, it is possible that the bias propagates, leading to being biased as well. As a consequence, may not be asymptotically linear, and we may require more sophisticated methods to construct an -consistent estimator.
One widely-used method for obtaining an asymptotically linear estimator when the initial estimator is biased is to perform a one-step bias correction (Pfanzagl, 1982). Consider the plug-in estimator for the efficient influence function . The empirical average of the estimator for the efficient influence function serves as a first-order correction for the bias of the initial estimator. By adding this empirical average to the initial estimator, one can obtain the so-called one-step estimator:
It can be easily seen that the one-step estimator satisfies Assumption B1. In what follows, we briefly discuss what arguments one will typically use to verify Assumptions B2-B4. While we do not provide a detailed discussion here, we refer readers to a recent review by Hines et al. (2022), which provides a more in-depth explanation.
The estimation error of the one-step estimator has the exact representation,
where we define and as
Asymptotic linearity of the one-step estimator follows if it can be established that and converge to zero in probability at an -rate. The first term is a difference-in-differences remainder that is asymptotically negligible when is consistent for and is contained within a -Donsker class (see Lemmas 19.24 and 19.26 of van der Vaart, 2000). The second term is a second-order remainder term, which can usually be bounded above by the squared norm of the difference between the nuisance estimator and its true value, . One can argue that if the nuisance estimator is -consistent with respect to , then . Even in a nonparametric model, there exist several approaches for constructing -rate consistent nuisance estimators when one makes only mild structural assumptions on , such as smoothness or monotonicity (see, e.g., van de Geer, 2000; Tsybakov, 2009). To verify that Assumptions B3 and B4 hold, one can perform a similar analysis to show that the first and second derivatives of the remainder terms and , with respect to , tend to zero at the requisite rate.
While we focused on one-step estimation above because we find its simplicity appealing, other strategies for constructing of bias-corrected estimators, such as targeted minimum-loss based estimation could alternatively be used. These strategies are usually also viable under a similar set of regularity conditions.
4.2 Asymptotic properties of proposed estimator
We are at this point prepared to describe the bias-corrected estimator for and its asymptotic properties. As stated in Section 3, we estimate as
(10) |
where is an estimator satisfying the conditions outlined in Section 4.1.
In this section we establish weak convergence of . We show that attains a tractable limiting distribution under mild regularity conditions, but the limiting distribution and convergence rate depend on the true value of . We study two cases. First, we consider the setting in which , and the null hypothesis of no improvement in fit (3) holds. Second, we study the case in which is a positive constant.
Case 1: The improvement in fit is zero ()
Suppose that the null of no improvement in fit holds. Recall from Section 2 that when , . Also, as discussed in Section 3, by performing a Taylor expansion for around , we have
(11) |
for some satisfying . Under Assumption B4, we are able, in (11), to replace with . This and the fact that allow us to write
(12) |
Thus, under the null, can be represented as the squared supremum of an empirical process, plus an asymptotically negligible remainder. By applying Theorem 1 in conjunction with Slutsky’s theorem and the continuous mapping theorem, we have that converges weakly to , where is the Gaussian process described in Theorem 1. The following Theorem states this result formally.
Theorem 2.
Suppose that the null hypothesis of no improvement in fit (3) holds, and the Assumptions of Theorem 1 are all satisfied. Then converges weakly to .
We can apply Theorem 2 to obtain an approximation for the sampling distribution of under the null of zero improvement in fit, making it easy for us to perform a hypothesis test. In particular, Theorem 2 implies that a test which rejects the null when is larger than the quantile of the distribution of will achieve type-1 error control at the -level in the limit of large .
We used two key ingredients to construct an improvement in fit estimator with parametric rate convergence under the null. First, we found it useful to the represent the difference between the full and reduced models for the function-valued parameter of interest as the union of many one-dimensional parametric sub-models. We have deduced that, under the null, the asymptotic behavior of an improvement in fit estimator is determined in large part by the complexity of the collection of paths along which the estimated goodness-of-fit minimizer over the full model can possibly approach the minimizer over the reduced model. We found it necessary to constrain the complexity to ensure that the improvement in fit estimator converges sufficiently quickly. In our regime, this can be easily achieved by restricting the size of . We expect that if one were to assume a different form for , one would still need to impose a constraint that plays a similar role in order to obtain an -rate consistent estimator. Second, efficient estimation of the improvement in fit along any sub-model is needed. In settings where the reduced model is infinite-dimensional, estimation of the goodness-of-fit minimizer over the reduced model can generate bias for the improvement in fit estimator and reduce its convergence rate. Fortunately, an efficient estimator can be obtained using standard techniques for bias correction.
Case 2: The improvement in fit is bounded away from zero
Now, consider the setting where is a positive constant. Let and be functions that satisfy
(13) |
We can express the estimation error of as
One might expect that should approach as grows, so should behave similarly to . In fact, if one could establish that
(14) |
then they would be able to conclude that is asymptotically linear with influence function under Assumption B2.
The remainder term in (14) is under mild assumptions. Because is zero under Assumption B1, it only needs to be shown that is asymptotically negligible. Because the goodness-of-fit estimator is asymptotically linear, is approximately equal to , which is commonly referred to as the excess risk in the literature on M-estimation (van de Geer, 2000). Thus, in essence, one can verify (14) by showing that the excess risk converges to zero in probability at an -rate. This can be done using standard arguments from the M-estimation literature. The following result provides explicit conditions under which is an asymptotically linear estimator for .
Theorem 3.
Suppose that the improvement in fit is positive, i.e., . Suppose further that Assumptions A1, A5, B1, and B2 hold, and there exists a sequence for some such that
(15) |
Then is an asymptotically linear estimator for with influence function
An important consequence of Theorem 3 is that is asymptotically efficient in a nonparametric model, and hence performs as well as the plug-in estimator described in Section 3 when . The assumption in (15) is a type of smoothness condition that is assumed commonly in the literature on estimation in high-dimensional and nonparametric models (see, e.g., van de Geer, 2008; Negahban et al., 2012; Bibaut and van der Laan, 2019). The condition ensures that and are close in distance when is small.
Some conditions that are needed by Theorem 2 are not needed by Theorem 3. Notably, it is not necessary for to be zero. This means that can be mis-specified in the sense that the interval is too small, and along any sub-model , there can exist a candidate that achieves a better fit than . In other words, we allow there to be for which . Even then, remains an asymptotically linear estimator for .
5 Construction of tests and intervals for the improvement in fit
In this section, we propose strategies for testing and confidence set construction for the improvement in fit. Our approach uses a computationally efficient bootstrap algorithm, which we describe in detail below.
We also provide theoretical results that establish validity of our proposed bootstrap method. Before proceeding, it is helpful to first state regularity conditions upon which our result rely.
-
Assumption C1: The function class depends on only through a nuisance , which takes values in a space endowed with norm , and our estimator satisfies .
-
Assumption C2: approaches zero, both and tend to zero as well.
-
Assumption C3: There exist such that the function classes
are -Donsker, with finite squared envelope function and finite bracketing integral (see, e.g., Chapter 19 of van der Vaart, 2000), and
Assumption C1 states that, for our bootstrap methods to be viable, estimation of the entire probability distribution is not needed, and it is sufficient to only estimate nuisance parameters upon which the efficient influence function depends. Recall that we made a similar assumption when we described construction of asymptotically linear estimators in Section 4.1. Assumption C2 states that when we estimate the nuisance components consistently, the plug-in estimator for the efficient influence function is consistent as well. Assumption C3 states that our efficient influence function estimator belongs to a function class that is not overly complex, with probability tending to one.
5.1 Approximation of the null limiting distribution
To perform a test of the hypothesis of no improvement in fit, we need an approximation for the asymptotic cumulative distribution function of under the null. While we are able to characterize the null limiting distribution of using Theorem 2, it is possible that a closed form expression for the distribution function is not available. However, we can use resampling techniques to obtain an approximation.
We approximate the null limiting distribution of using the multiplier bootstrap method proposed by Hudson et al. (2021). The multiplier bootstrap is a computationally efficient method for approximating the sampling distribution estimators that can be represented as a functional of a well-behaved empirical process, plus a negligible remainder. Such an approach is applicable in our setting because has such representation (see (12)).
For and large, let be an -dimensional vector of independent Rademacher random variables, also drawn independently from . We define the multiplier bootstrap statistic
(16) |
as an approximate draw from the null limiting distribution of .
For a realization of , let
(17) |
denote the p-value for a test of no improvement in fit, based on the limiting distribution of . Given a large sample of multiplier bootstrap statistics, one can approximate the p-value as
The following result due to Hudson et al. (2021) provides conditions under which the bootstrap approximation of the limiting distribution is asymptotically valid, and use of the bootstrap p-value is appropriate.
Theorem 4.
Let be independent Rademacher random variables, also independent of , and let . Under Assumptions C1 through C3, converges weakly to converges weakly to , conditional upon , in outer probability.
5.2 Interval construction for
In this section, we present a method for constructing a confidence interval for . The standard approach for interval construction based on a Gaussian approximation of the sampling distribution of an estimator is inadvisable because is only asymptotically Gaussian when is bounded away from zero. We show that this issue can be overcome by instead constructing a confidence interval via hypothesis test inversion.
Suppose that one could perform a level level test of the hypothesis for any . Then the set
would be a confidence interval for . That is, in the limit of large , would contain with probability at least .
We construct a test of using the test statistic . Let be an approximation for the quantile of the limiting distribution of . For a suitable , a test that rejects the null when exceeds will achieve asymptotic type-1 error control at the level . Moreover, an asymptotically valid confidence set can be obtained by setting
It is not immediately obvious how to select because the limiting distribution and convergence rate of depend on whether . To address this concern, in what follows, we present a multiplier bootstrap approximation of the limiting distribution that adapts to the unknown value of .
Let be any random sequence that converges to one in probability when , and converges to zero in probability to when . For instance, we can set
(18) |
where is the multiplier bootstrap p-value. That this choice of is valid follows from the fact that is consistent for and -rate convergent under the null. Now, similar to Section 5.1, for and large, we generate a pair of random variables as follows. The first random variable is in (16), which is a multiplier bootstrap approximation of a draw from the limiting distribution of under the setting where . We take the second random variable as a multiplier bootstrap approximation of a draw from the limiting distribution of when . Specifically, we define this second random variable as
(19) |
where is the vector of Rademacher random variables defined in Section 5.1 (the same vector may be used to construct and ), and is as defined (13). Finally, we take an approximate draw from the sampling distribution of as , where
and we set as the quantile of .
Because converges to zero when is zero and approaches one when is large, adaptively identifies whether or is a more appropriate approximation of a draw from the sampling distribution of . The following result states that is an asymptotically valid approximation regardless of whether is zero or nonzero, thereby justifying our selection of .
Theorem 5.
Let be independent Rademacher random variables, also independent of . Let be a random sequence that converges to one in probability when and converges to zero in probability when . Let , let , and . Let be a mean zero Gaussian random variable with variance , with defined in (13). Suppose that Assumptions C1-C3 are met. Then when , and the conditions of Theorem 2 hold, converges weakly to converges weakly to , conditional upon , in outer probability. When , , and the conditions of Theorem 3 hold, converges weakly to , conditional upon , in outer probability.
6 Implementation
In this section we discuss implementation of our proposed method for inference on the improvement in fit. First we describe how to construct a model for . We subsequently discuss how to calculate the improvement in fit estimator and how to implement our proposed bootstrap procedures for testing the null of no improvement in fit and constructing confidence sets.
6.1 Constructing the collection of parametric sub-models
We propose to construct as a space of linear combinations of basis functions from to , where the coefficients for the basis functions are required to satisfy a constraint that induces structure on the function class. Let be a vector space defined as the span of basis functions from to . Let be a functional on that measures the complexity of any function in , with larger values corresponding to greater complexity. We set to have bounded complexity. Additionally, we impose a constraint that is bounded away from zero. In view of Assumption A4, such a constraint is needed in order for us to establish weak convergence of our proposed improvement in fit estimator under the null. Finally, we set , where we define
(20) |
and is a tuning parameter. In practice, we recommending truncating the basis at a large level to facilitate computation.
As an example, one could construct using a reproducing kernel Hilbert space (RKHS). Let be a positive definite kernel function, and let denote its unique reproducing kernel Hilbert space, endowed with inner product . One can select the basis functions as the eigenfunctions of , with respect to the RKHS inner product. We denote the corresponding eigenvalues by . The complexity of any function can be measured by its RKHS norm . The RKHS norm is a measure of smoothness, with higher values corresponding with lesser smoothness. Reproducing kernel Hilbert spaces are appealing because they are flexible and contain close approximations of smooth functions (Micchelli et al., 2006). Moreover, that the RKHS norm is available in quadratic form in the coefficients simplifies computation. Alternative approaches, such as constructing using a spline basis and setting as a variation norm, are commonly used in nonparametric regression problems and could also be considered (see, e.g., Tibshirani et al., 2005; Benkeser and van der Laan, 2016).
We now discuss specification of the interval . The choice of does not affect the null limiting distribution but may affect the limiting distribution when . The main role of is to regularize to ensure that variance of the estimator is well-controlled. Recall from our discussion of Theorem 3 in Section 4.2 that , which is defined to be the minimizer of over , does not need to be the global minimizer over ; our results show that has a well-behaved limiting distribution regardless. We treat the width of the interval as an additional tuning parameter.
We find that in some settings, can retain good asymptotic behavior under the alternative even when is taken to be an interval of arbitrary width. In view of Theorem 1, the variance of has an inverse relationship with . Therefore, constructing to only include functions for which is bounded from below also serves to regularize . In particular, when is a quadratic function, and is a constant function, it is sufficient to ensure that away from zero. Because this constraint is already incorporated into with the above specification, constraining the width of is unnecessary in such instances.
We recommend selecting the tuning parameters , and (when needed) by performing cross-validation with respect to the loss . We note that while our asymptotic results implicitly assume that is pre-specified, it is argued in Hudson et al. (2021) that one can select data-adaptively without compromising type-1 error control as long as the adaptive choice converges to a fixed class. In some settings, e.g., when the sample size is small, it is possible that the data-adaptive choice is highly or moderately variable, and that failure to account for this variability could lead to type-1 error inflation. One can avoid this issue by using a more conservative sample splitting approach, wherein one partition of the data is used for tuning parameter selection, and a second independent partition is used to estimate .
6.2 Computation
We now discuss how to calculate the improvement in fit estimator and how to implement the multiplier bootstrap for hypothesis testing and confidence interval construction.
Calculating requires us to solve the optimization problem in (10). When we use the specification of in Section 6.1, it is possible for this problem to be non-convex, and it can be particularly challenging to solve when a closed form solution for is not available. We find, however, that when is a quadratic function of , a computationally efficient solution is available. In Examples 1 and 3 in Section 2.2, is a quadratic function when one considers a sub-model of the form , so this special case captures at least some examples. In what follows, we present an approach for solving this the problem in the setting where is a quadratic function of . We then describe a more general method in the Supplementary Materials.
Suppose that for any and , there exists a -dimensional matrix and a -dimensional vector such that
where “const” refers to a constant that depends neither on nor . It can be easily seen that has the exact representation
Additionally, the second derivative estimator satisfies for all . Now, can be expressed as
Suppose now that is available in quadratic form in the coefficients of the basis functions – that is, for a matrix . Using the above representation for , we can express as
(21) |
The optimization problem in (21) is a quadratically constrained quadratic program (QCQP) and can be solved using publicly available software, such as the CVXR package in R (Fu et al., 2017).
Multiplier bootstrap samples can be calculated using a similar method. We first observe when we use the specification of in (6), the Riesz representation theorem implies that is a linear functional of . Consequently, the efficient influence function is also linear in . Therefore, for any , we have
Now, let be an matrix with element given by , and let be an -dimensional vector of Rademacher random variables, as in Sections 5.1 and 5.2. Similarly as , the multiplier bootstrap test statistics in (16) can be expressed as
(22) |
The optimization problem in (22) is also a QCQP and can solved efficiently. Finally, can be written as
where is a solution to (21).
7 Illustration: Inference in a Nonparametric Regression Model
In this section, we apply our framework to perform inference on the non-negative dissimilarity measured described in Example 1. In the Supplementary Materials, we describe our framework can be used to construct a test of stochastic dependence, following the setting described in Example 2.
In this problem, we are tasked with assessing whether a subset of a collection of predictor variables is needed for attaining an optimal prediction function. As in Section 2.2, our data take the form , where is a real-valued outcome variable, and is the predictor vector of interest, and is a vector of covariates. We denote by the conditional mean of the outcome given the covariates. Our objectives are to assess whether there exists a function that depends on both and which predicts better than , and to measure the best achievable improvement in predictive performance by any function in a large class.
We specify the parametric sub-model as
The goodness-of-fit of any candidate in the sub-model is defined as
and the first and second derivatives are given by
As we noted in Section 4, knowledge of the efficient influence function of is helpful for constructing an asymptotically linear estimator thereof. Additionally, we require the derivative of the efficient influence function to exist. Let represent the conditional mean of given . The form of the efficient influence function and its derivative are provided in the following lemma.
Lemma 1.
The efficient influence function of is given by
It is also easy to see that the efficient influence function is twice differentiable in . The evaluation of its first and second derivatives at are given by
From Lemma 1, we can see that and depend on the nuisance parameters and . One can obtain nonparametric estimators and for the nuisance using any in a wide variety of flexible data-adaptive regression procedures, including kernel ridge regression (Wahba, 1990), neural networks (Barron, 1989), the highly adaptive lasso (Benkeser and van der Laan, 2016), or the Super Learner (van der Laan et al., 2007). In our implementation, we use kernel ridge regression, in large part because it is computationally efficient.
It may at first seem computationally difficult to estimate the conditional mean of given for all . However, because we have assumed that can be represented as a linear combination of basis functions , we can perform this calculation without too much trouble. For , we have the representation . Therefore, one can obtain estimators for for and then estimate as .
Consider the following initial plug-in estimator for :
and consider the efficient influence function estimator
The initial estimator for is biased, so a corrected estimator is needed so that one can perform inference. We can use the following one-step bias-corrected estimator:
(23) |
The following lemma provides conditions under which the one-step estimator satisfies Assumptions B1 through B4.
Lemma 2.
Suppose that the nuisance estimators satisfy the rate conditions
Suppose also that there exist -Donsker classes , and a -Glivenko-Cantelli class such that, with probability tending to one, each of the following holds:
Then the one-step estimator in (23) satisfies Assumptions B1-B4.
The condition on the convergence rates of the nuisance estimators is standard and holds when all nuisance estimators are -rate convergent. This is rate is attained by many nonparametric regression estimators under weak structural assumptions on the true conditional mean functions, so the condition is fairly mild.
We conclude with a brief comment about computation. We observe that is a quadratic function of , so the implementation scheme described in Section 6.2 can be applied in this example.
8 Simulation Study
In this section, we assess the empirical performance of our proposal in our simulation study. In this example, we again consider the nonparametric regression task discussed in Section 7.
We generate synthetic data sets as follows. First, we generate independent -dimensional random vectors from a Gaussian distribution with mean zero and covariance
We then define the predictor vector as , where is the standard normal distribution function. We generate the outcome according to the model
where the white noise is a continuous uniform random variable, drawn independently of . Our objective is to determine which of the elements of the predictor are conditionally associated with the outcome , given the other elements. Clearly, is conditionally dependent on the first and second elements, and not the third.
Our target of inference is the improvement in fit estimand defined in Example 1. More specifically, for each predictor, we estimate the improvement in fit comparing a flexible regression model containing prediction functions that may depend on all predictors, with a model only containing functions that do not depend on the predictor of interest. We estimate all nuisance parameters described in Section 7 using kernel ridge regression. We construct using the reproducing kernel Hilbert space corresponding to the Gaussian kernel, and we consider two choices for the smoothness parameter. First, we use an oracle apporach, which sets as the smoothest class containing a function that is proportional to the difference between the full conditional mean of given all predictors , and the conditional mean of given predictors that are not of interest. We recall from our discussion in Section 2.2 that the difference between conditional means is the function that maximizes the improvement in fit. Second, we consider a data-adaptive procedure, where the smoothness parameter is selected using cross-validation, and no sample-splitting is performed.
We compare our method with a sample splitting approach proposed by Williamson et al. (2021b). They propose to separate the dataset into two independent partitions – one of which is used to estimate the optimal goodness-of-fit over the full model, and the second of which is used to estimate the optimal goodness-of-fit over the reduced model. Then inference on can be performed using a two-sample Wald test, and Wald-type intervals can similarly be constructed. We expect our approach to offer an improvement because we do not require sample-splitting for valid inference. We also expect our proposal to benefit from achieving fast -rate convergence rate at the boundary, compared with the sample-splitting approach, which is only -rate convergent.
We generate 1000 synthetic data sets under the data-generating process described above for . We compare all methods under consideration in terms mean squared error, type-1 error control under the null, power under the alternative, confidence interval coverage, and average confidence interval width.
Figure 1 shows the root mean squared error for each proposed improvement in fit estimator as a function of the sample size. We find that, while the root mean squared error of each estimator approaches zero, our proposed estimators converge much more quickly than the sample splitting estimator, with the oracle version of our approach performing best.
In Figures 2 and 3 we plot the rejection probability for a test of the null of no improvement in fit as a function of the nominal type-1 error level. We find that all tests considered achieve asymptotic type-1 error control in the setting where , though we acknowledge there is type-1 error inflation when the sample size is small. We find that our proposed tests are well-powered against the null, both outperforming the sample-splitting estimator.
In Figures 4 and 5 we plot coverage probability and average width of 95% confidence intervals as a function of the sample size. We find that all approaches considered achieve nominal coverage as the sample size grows, though the there is a tendency for our proposed intervals to exhibit undercoverage when the sample size is small. Our proposed estimator with oracle selection of achieves the lowest average width, followed by the adaptive approach, and the sample-splitting approach.
9 Discussion
In this work, we have presented a general framework for inference on non-negative dissimilarity measures. Our proposed methodology has wide-ranging utility. As examples, we described how this framework can be applied to perform rate-optimal inference on statistical functionals arising in nonparametric regression and graphical modeling problems. Our framework can also be useful in other settings, such as causal inference problems. For instance, some statistical functionals that have been used for studying treatment heterogeneity (see, e.g., Levy et al., 2021; Hines et al., 2021; Sanchez-Becerra, 2023) have the representation described in Section 2, so one can perform inference using our general approach.
Our work has some notable limitations that we plan to address in future research. While our proposal for inference on the improvement in fit enjoys good behavior in a large sample setting, we observed in our simulatoin study that it may have undesirable small sample properties, such mild type-1 error inflation or poor coverage. Additionally, our estimator suffers a loss in precision when we select of in a data-adaptive manner. In future work, we plan to investigate whether the performance can be improved using, e.g., small-sample adjustments or improved data-adaptive methods for tuning parameter selection. Additionally, because our results assume smoothness of the goodness-of-fit functional, it is not clear whether our results can be directly applied to perform inference on estimands such as distances. It is of interest to develop a more flexible inferential strategy that relaxes this assumption. Our methodology also places complexity constraints on nuisance parameter estimators, which prohibits us from using estimators such as gradient boosted trees (Friedman, 2002). It is of interest to develop cross-fitted versions of our improvement in fit estimator and multiplier bootstrap strategy that relax this assumption (Zheng and van der Laan, 2011; Chernozhukov et al., 2018).
There also remain several open theoretical and methodological questions. For instance, while we have established rate consistency of our proposed improvement in fit estimator, it is unclear whether our test of the null of no improvement in fit is optimal in any sense. It would be important to characterize the power of our test and to determine whether there exists a more powerful test. Additionally, it is of interest to understand how specification of the sub-model affects our procedure’s performance. It is possible that there are many ways for one to construct a sub-model while still obtaining valid inference on . It is not clear how this choice affects the estimator or whether there is an optimal choice. We expect that, in practice, this choice will need to be made in consideration of theoretical properties, such as power, and more practical concerns, such as ease of implementation and computational efficiency.
References
- Barron (1989) Barron, A. R. (1989). Statistical properties of artificial neural networks. In Proceedings of the 28th IEEE Conference on Decision and Control,, pages 280–285. IEEE.
- Benkeser and van der Laan (2016) Benkeser, D. and van der Laan, M. (2016). The highly adaptive lasso estimator. In 2016 IEEE international conference on data science and advanced analytics (DSAA), pages 689–696. IEEE.
- Bhattacharya and Zhao (1997) Bhattacharya, P. and Zhao, P.-L. (1997). Semiparametric inference in a partial linear model. The Annals of Statistics 25, 244–262.
- Bibaut and van der Laan (2019) Bibaut, A. F. and van der Laan, M. J. (2019). Fast rates for empirical risk minimization over càdlàg functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244 .
- Bickel et al. (1998) Bickel, P. J., Klaassen, C. A., Ritov, Y., and Wellner, J. A. (1998). Efficient and adaptive estimation for semiparametric models. Springer.
- Carone et al. (2018) Carone, M., Díaz, I., and van der Laan, M. J. (2018). Higher-order targeted loss-based estimation. Targeted learning in data science: causal inference for complex longitudinal studies pages 483–510.
- Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 21, C1–C68.
- Donald and Newey (1994) Donald, S. G. and Newey, W. K. (1994). Series estimation of semilinear models. Journal of Multivariate Analysis 50, 30–40.
- Friedman (2002) Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 367–378.
- Fu et al. (2017) Fu, A., Narasimhan, B., and Boyd, S. (2017). CVXR: An r package for disciplined convex optimization. arXiv preprint arXiv:1711.07582 .
- Hines et al. (2021) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2021). Parameterising the effect of a continuous exposure using average derivative effects. arXiv preprint arXiv:2109.13124 .
- Hines et al. (2022) Hines, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:2204.06030 .
- Hines et al. (2022) Hines, O., Dukes, O., Diaz-Ordaz, K., and Vansteelandt, S. (2022). Demystifying statistical learning based on efficient influence functions. The American Statistician 76, 292–304.
- Hudson et al. (2021) Hudson, A., Carone, M., and Shojaie, A. (2021). Inference on function-valued parameters using a restricted score test. arXiv preprint arXiv:2105.06646 .
- Kandasamy et al. (2015) Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., et al. (2015). Nonparametric von mises estimators for entropies, divergences and mutual informations. Advances in Neural Information Processing Systems 28,.
- Kennedy et al. (2023) Kennedy, E. H., Balakrishnan, S., and Wasserman, L. A. (2023). Semiparametric Counterfactual Density Estimation. Biometrika .
- Levy et al. (2021) Levy, J., van der Laan, M., Hubbard, A., and Pirracchio, R. (2021). A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference 9, 83–108.
- Luedtke et al. (2019) Luedtke, A., Carone, M., and van der Laan, M. J. (2019). An omnibus non-parametric test of equality in distribution for unknown functions. Journal of the Royal Statistical Society Series B: Statistical Methodology 81, 75–99.
- Micchelli et al. (2006) Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Universal kernels. Journal of Machine Learning Research 7,.
- Negahban et al. (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., and Yu, B. (2012). A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers.
- Paninski (2003) Paninski, L. (2003). Estimation of entropy and mutual information. Neural Computation 15, 1191–1253.
- Pfanzagl (1982) Pfanzagl, J. (1982). Contributions to a general asymptotic statistical theory. Springer.
- Pfanzagl (1985) Pfanzagl, J. (1985). Asymptotic expansions for general statistical models, volume 31. Springer-Verlag.
- Robins et al. (2008) Robins, J., Li, L., Tchetgen, E., van der Vaart, A., et al. (2008). Higher order influence functions and minimax estimation of nonlinear functionals. Probability and Statistics: Essays in Honor of David A. Freedman 2, 335–421.
- Robinson (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society pages 931–954.
- Sanchez-Becerra (2023) Sanchez-Becerra, A. (2023). Robust inference for the treatment effect variance in experiments using machine learning. arXiv preprint arXiv:2306.03363 .
- Steuer et al. (2002) Steuer, R., Kurths, J., Daub, C. O., Weise, J., and Selbig, J. (2002). The mutual information: detecting and evaluating dependencies between variables. Bioinformatics 18, S231–S240.
- Tibshirani et al. (2005) Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67, 91–108.
- Tsybakov (2009) Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer.
- van de Geer (2000) van de Geer, S. A. (2000). Empirical Processes in M-estimation, volume 6. Cambridge university press.
- van de Geer (2008) van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso.
- van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology 6,.
- van der Laan and Rose (2011) van der Laan, M. J. and Rose, S. (2011). Targeted learning: causal inference for observational and experimental data. Springer Science & Business Media.
- van der Laan and Rose (2018) van der Laan, M. J. and Rose, S. (2018). Targeted learning in data science. Springer.
- van der Vaart (2014) van der Vaart, A. (2014). Higher order tangent spaces and influence functions. Statistical Science pages 679–686.
- van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak convergence and empirical processes. Springer.
- van der Vaart (2000) van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
- Verdinelli and Wasserman (2021) Verdinelli, I. and Wasserman, L. (2021). Decorrelated variable importance. arXiv preprint arXiv:2111.10853 .
- Wahba (1990) Wahba, G. (1990). Spline models for observational data. SIAM.
- Westling (2021) Westling, T. (2021). Nonparametric tests of the causal null with nondiscrete exposures. Journal of the American Statistical Association pages 1–12.
- Wilks (1938) Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics 9, 60–62.
- Williamson et al. (2021a) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. (2021). Nonparametric variable importance assessment using machine learning techniques. Biometrics 77, 9–22.
- Williamson et al. (2021b) Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2021). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association pages 1–14.
- Zhang and Janson (2020) Zhang, L. and Janson, L. (2020). Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283 .
- Zheng and van der Laan (2011) Zheng, W. and van der Laan, M. J. (2011). Cross-validated targeted minimum-loss-based estimation. In Targeted Learning, pages 459–474. Springer.
Supplementary Materials
S1 Implementation for Non-quadratic Objectives
Here, we propose a general method for computing the improvement in fit estimator. As noted in Section 6.2, computation can be challenging because when we use the specification of in (20), the optimzation problem in (10) is possibly non-convex. Moreover, a closed form expression for may not be available, further complicating the problem.
As in Section 6.2, we focus on the setting where the complexity measure is available in quadratic form, satisfying for some matrix . Also, we note that in general, the second derivative of the goodness-of-fit is available in quadratic in the coefficients as a consequence of the Riesz representation theorem. We assume that can be expressed as for some .
We propose a slightly-modified estimator for that has nearly identical asymptotic properties to , but which can be easier to compute in a more general setting. The main idea is to separate estimation of into two parts. First, as before, for each we perform a search to identify whether any candidate parameter in the sub-model is an improvement over the null in terms goodness-of-fit. Where we differ is that we confine our search to a small neighborhood of zero. When there exists evidence that for some , is not minimized at zero, we search over a larger function class to identify a candidate parameter that achieves a better fit than the null. In contrast to the strategy proposed in Section 6, we specify as a class over which an optimum can more easily be identified. In what follows, we provide details and rationale for this method.
First, Taylor’s theorem implies that when resides within a neighborhood of zero, is approximately equal to . Additionally, in this setting we have that
Because is assumed to be convex for all , then for some , implies that , and . And so, one can assess whether by checking whether . This is roughly equivalent to performing a search over to identify the best fit, where is taken to be a small interval containing zero. To estimate , we use the estimator
When the null holds, and the conditions of Theorem 2 hold as well, has the same asymptotic representation as . That is,
As noted in Section 6.2, the Riesz representation theorem implies that is a linear functional of . We assume that our estimator is also linear, and can be expressed as for some -dimensional vector . Thus, using the specification of in (20), can be expressed as
This problem is a quadratically constrained quadratic program and can be solved efficiently.
It is possible that is a poor approximation for when the null of no improvement in fit does not hold. Let be a random sequence that converges to one in probability when holds and converges to zero in probability when . For instance, we can take as
where is as defined in (17). For large values of , can replace . Otherwise, an alternative estimator may be needed.
To estimate when the null does not hold, we consider an alternative specification of . A main source of our difficulty with solving the optimization problem in (10) is that the constraint is non-convex. Under the null, constraining is necessary, as Theorem 2 states that Assumption A4 must hold in order for our improvement in fit estimator to have a well-behaved limiting distribution. However, this assumption is not needed for Theorem 3 to hold. As we discussed previously in Section 6.1, for Theorem 3 to hold, we really only need to ensure that the variance of is well controlled. When is quadratic, this can be achieved by constraining and leaving unconstrained, as was done previously. When is non-quadratic, we can alternatively leave unconstrained and carefully select the width of . We instead set as the function class
and we set as for some . Now, we define as
and we estimate as
Calculating will in many cases be much easier than calculating . We can write as
where is a -dimensional vector. This optimization problem has a only single convex constraint, so the problem will be convex when the objective function is also convex. This can greatly simplify computation.
Of course, and potentially can achieve different limiting distributions when . This is because and are defined as optima over different function classes. While the two values are expected to be similar, they will not necessarily be equal. Nonetheless, one can still apply Theorem 3 to establish weak convergence of to a Gaussian distribution, as long as is not too large.
Finally, we combine and to obtain a single estimator for . We define our estimator as
Because tends to one when the null holds and approaches zero when the null fails, has approximately the same asymptotic behavior as . That is behaves like the supremum of an empirical process under the null, and like a sample average otherwise. Therefore, the multiplier bootstrap tests described in Sections 5.1 and 5.2 remain valid, and a similar strategy for implementation as described in Section 6.2 can be used.
S2 Illustration: Nonparametric Assessment of Stochastic Dependence
In this Section, we briefly discuss inference in Example 2 from Section 2.2, where we are interested in assessing whether a pair of random variables is independent. Our data takes the form , where and are one-dimensional random variables.
We assess dependence by comparing the expectation of the log of the product of the marginal densities of and with the expectation of the logarithm of an approximation for the joint density. Let and denote the marginal densities of and under , and let denote the log of the product of the marginal densities. We use the following sub-model to approximate the logarithm of the joint density:
With a straightforward calculation, it can be verified that (7) is satisfied, and moreover, we can see that is a valid candidate log density, as for any , .
With the above specification for the parametric sub-model, the goodness-of-fit takes the form
where and denote the marginal expectations under , with respect to and , respectively. We now observe that along any submodel, the difference in goodness-of-fit comparing a given candidate parameter with the null is given by
and this expression does not depend on the marginal densities and . Thus, estimation of the marginal densities is not needed.
The derivatives and are given by
One can interpret as the difference between the true mean of under and the value the mean would hypothetically take if and were independent. The second derivative represents the variance of under the assumption that and are independent.
Because does not depend on through any nuisance parameters that are not pathwise differentiable, it is expected that a plug-in estimator, which is defined as a functional of the cumulative distribution function, would be asymptotically linear, and no sophisticated methods for bias correction should be needed. We use the following plug-in estimator for :
By an application of the functional delta method, one can show that the plug-in estimator is asymptotically linear with influence function
Similarly, we estimate as
and is asymptotically linear with efficient influence function
We estimate the second derivative as
In this setting, a closed form solution for is not available, and if one uses the specification for in (20), the problem
is difficult to solve. As an alternative, we recommend using the more general implementation strategy presented in the Supplementary Materials Section S1.
S3 Proofs of Theoretical Results
Proof of Theorem 1
We have by definition that and are non-negative. Under Assumption B2, we can write
Thus from the fact that is -Donsker, we have that . Assumption A3 implies that .
Now, because we have that satisfies , Taylor’s theorem implies
for some that satisfies . By rearranging terms and invoking Assumption B4, the estimation error for can be expressed as
Because , Assumption B2 implies that
Now, because is uniformly consistent for , the continuous mapping theorem and Assumptions A4 and B4 allow us to replace with in the above display. Thus, we have
as claimed. The weak convergence result follows as an immediate consequence of the Donsker Assumption A5.
Proof of Theorem 2
This result follows directly from an application of the continuous mapping theorem.
Proof of Theorem 3
Following from our discussion from Section 4.2, it suffices to show that . First, we write
From the above argument, we can conclude that the remainder is at least .
Proof of Theorem 4
Let be the space of bounded Lipschitz-1 functions . That is, any in satisfies for any . Let denote the expectation of a random variable with respect to the distribution of (treating as fixed). We show that
converges to zero in outer probability. This is equivalent to weak convergence due to Portmanteau lemma (see, e.g. Lemma 18.9 of van der Vaart, 2000).
Let denote the space of bounded functionals on , and let be the space of Lipschitz-1 functionals . That is, for in , any satisfies . We now define:
It is shown by Hudson et al. (2021) that under Assumptions C1-C3,
The proof is completed by recognizing that for any , the functional is contained within .
Proof of Theorem 5
Case 1:
We first consider the setting in which . Let be a Lipschitz-1 function on . As in the proof of Theorem 4, let be the space of Lipschitz-1 functions on . We show that
converges to zero in outer probability.
First, by applying the triangle inequality and invoking the Lipschitz property, we have
where we define
We have already shown in Theorem 4 that the first term converges to zero in outer probability, so it only remains to verify this for the second term .
By the reverse triangle inequality, we have
Because when , it suffices to show that and are both .
We first show that is bounded in probability. First, we have by Jensen’s inequality that
Now, by Taylor’s theorem, we have
for some that satisfies for all . Because under the conditions of Theorem 2, it suffices to show that
(24) |
By the triangle inequality, we have the upper bound
where we define
It can be seen through an application of the Cauchy-Schwarz inequality under Assumption C2. To see that is bounded in probability, we first note that Assumptions C1 implies that with probability tending to one,
In view of Markov’s inequality, it is sufficient to show that this upper bound has finite expectation. Lemma 2.3.6 of van der Vaart and Wellner (1996) implies that
Because under Assumption C3, has finite bracketing integral, we have by Corollary 19.35 of van der Vaart (2000) that
thereby establishing that . That follows from a similar argument.
To argue that is , we use the same argument as is used to show that (24) holds. In brief, the result follows from the facts that () both and are uniformly consistent under Assumption C2, and () the class
is -Donsker with finite bracketing integral under Assumption C3.
Case 2:
We now consider the setting where . We show that
converges to zero in outer probability. Similarly as for Case 1, we have by the triangle inequality that
where we define
We first argue that converges to zero in outer probability. First, we have under assumption that the function
is contained within a -Donsker class with probability tending to one. Also we have under Assumption C2 that . We now argue that . We have the upper bound
Under Assumption C2, we have
Additionaly, we have
under the conditions of Theorem 3. Now, by Theorem 2 of Hudson et al. (2021), we can conclude that converges to zero in outer probability.
We now argue that . Because , we only need to show that and are bounded in probability. That follows from the same argument as was used to show that (24) holds in Case 1.
To argue that , we begin by applying the Cauchy-Schwarz inequality and invoking Assumption C2 to get
Now, by the triangle inequality,
Because is a -Donsker class with finite squared envelope function, we have by Lemma 2.10.4 of van der Vaart and Wellner (1996) that is a -Glivenko-Cantelli class, and so
Now, by the triangle inequality
We have by assumption that
Additionally, Assumption C2 and Lemma 2.10.4 of (van der Vaart and Wellner, 1996) imply that the class is -Glivenko-Cantelli with probability tending to one. Therefore,
Thus, we have that . This allows us to write
That follows from the argument presented in Case 1. This completes the proof.
Proof of Lemma 1
Suppose a given distribution in has density with respect to a dominating measure , and let be a fixed function that has mean zero and finite variance under . Let be a one-dimensional parametric sub-model for indexed by the parameter , which satisfies the following:
-
1.
The sub-model passes through at – that is, at
-
2.
The density of the parametric sub-model is given by , and the score function is given by at . That is,
We refer to as the pathwise derivative of . The nonparametric efficient influence function is the unique function that satisfies the following two properties:
-
1.
For every , .
-
2.
has mean zero under . That is, .
We can therefore find the efficient influence function by calculating the pathwise derivative.
Let have density
We can approximate any density in a neighborhood of with such a sub-model. Any distribution in a small neighborhood of can be approximated using a sub-model of this form.
The goodness-of-fit under is given by
Through a simple calculation, it can be shown that
where denotes the conditional density of given that , under . We now have
where the second equality follows from an application of the law of total expectation to the second summand. The “non-mean-centered” efficient influence function is thus given by
The result is completed by centering the above function about its mean.
Proof of Lemma 2
We write the estimation error for the one-step estimator as
where the remainder terms are
Following from our discussion in Section 4.1, it suffices to argue each of the following:
First, we argue that . It is shown in the proof of Lemma 19.26 of van der Vaart that this convergence rate is achieved when
and when the class is contained within a -Donsker class probability tending to one. That the influence functions are uniformly consistent follows a consequence of the rate conditions on the nuisance parameter estimators, and the Donsker condition holds by assumption. Similarly, that and follow from consistency of nuisance estimators and the assumed complexity constraints.
Now, we argue that . This remainder term has the exact representation
It can seen that when the rate conditions on the nuisance estimators are met. The derivative of the second remainder term is
and so under the rate conditions as well. Finally, it is easily seen that .