Test and Measure for Partial Mean Dependence Based on Machine Learning Methods
Abstract
It is of importance to investigate the significance of a subset of covariates for the response given covariates in regression modeling. To this end, we propose a significance test for the partial mean independence problem based on machine learning methods and data splitting. The test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability are also discussed. If the null hypothesis is rejected, we propose a partial Generalized Measure of Correlation (pGMC) to measure the partial mean dependence of given after controlling for the nonlinear effect of . We present the appealing theoretical properties of the pGMC and establish the asymptotic normality of its estimator with the optimal root- convergence rate. Furthermore, the valid confidence interval for the pGMC is also derived. As an important special case when there are no conditional covariates , we introduce a new test of overall significance of covariates for the response in a model-free setting. Numerical studies and real data analysis are also conducted to compare with existing approaches and to demonstrate the validity and flexibility of our proposed procedures.
Keywords: machine learning methods; partial mean independence test; partial generalized measure of correlation.
1 Introduction
It is of importance to investigate the significance of a subset of covariates for the response in regression modeling. In practice, some covariates are often known to be related to the response of interest based on historical analysis or domain knowledge. We aim to test the significance of another group of covariates for the response given the known covariates, especially in high dimensional data. For example, Scheetz et al. (2006) has detected 22 gene probes that are related to human eye disease from 18,976 different gene probes and it is interesting to test whether the remaining high dimensional gene probes still contribute to the response conditional on the subset of 22 identified gene probes. This problem can be formulated as the following hypothesis testing problem
(1.1) |
where is a scalar-valued response variable of interest, is a dimensional vector of covariates, and are two subsets of and . The equality in indicates that and are partially mean independent after controlling for the nonlinear effect of . If is not rejected, is considered as insignificant for and can be omitted from the regression model. Thus, this test is also called a significance test (Delgado and Gonzáles-Manteiga, 2001) or an omitted variable test (Fan and Li, 1996). In the literature, several tests in nonparametric settings have been proposed based on locally and globally smoothing methodologies whose representatives are Fan and Li (1996)’s test and Delgado and Gonzáles-Manteiga (2001)’s test, respectively. However, both tests are based on kernel smoothing estimation and severely suffer from the curse of dimensionality. Zhu and Zhu (2018) proposed a dimension reduction-based adaptive-to-model (DREAM) test for the significance of a subset of covariates. However, the DREAM test only focuses on the relatively low dimensional situation and cannot perform satisfactorily in high dimensions. Therefore, it is still challenging to test in high dimensional settings.
In this paper, we propose a new significance test for the partial mean independence problem in (1.1) based on machine learning methods. The obstacle of testing is estimating the conditional means in high dimensions which makes the classical nonparametric estimation such as kernel smoothing and local polynomial fitting perform poorly. Alternatively, we apply the machine learning methods such as deep neural networks (DNN) and eXtreme Gradient Boosting (XGBoost) to estimating the conditional means in the new test. Specifically, we randomly split the data into two parts. We estimate the conditional means via machine learning methods using the first part of the data and conduct the test based on the second part. We theoretically show that the test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. In this paper, we introduce a new reformulation of the null hypothesis, which is helpful to obtain a tractable limiting null distribution. Power enhancement and algorithm stability based on multiple data splitting are also discussed.
Recently, there is an increasing interest in developing powerful testing procedures for the partial mean independence problem. Some remarkable developments include Dai et al. (2024), Williamson et al. (2023), Verdinelli and Wasserman (2024), and Lundborg et al. (2022). These papers also adopt machine learning methods and sample splitting. To deal with the fundamental difficulty that a naive test statistic has a degenerate distribution, different attempts are made. Dai et al. (2024) added random noise to their test statistic. Williamson et al. (2023) alleviated this problem by estimating two non-vanishing parameters on separate splits of the data which was also adopted in the two-split test of Dai et al. (2024). The method of Verdinelli and Wasserman (2024) is similar to Dai et al. (2024) but without introducing additional randomness. They achieved tractable limiting null distributions at the price of possible power loss. Lundborg et al. (2022) introduced a notable procedure named Projected Covariance Measure and proved that under linear regression models with fixed dimensional , their procedures are more powerful than those of Dai et al. (2024) and Williamson et al. (2023). However, under the model-free framework, which is the focus of this paper, their theory requires three or more subsamples. For other relevant papers, see also Lei et al. (2018), Zhang and Janson (2020), Shah and Peters (2020), Gan et al. (2022), Tansey et al. (2022), and Verdinelli and Wasserman (2023). For a recent review, see Lundborg (2023).
In this paper, we introduce a new procedure to deal with the degenerate null distribution issue. Our introduced test procedure has the following four merits compared with other existing methods. Firstly, the newly proposed test statistic does not involve data perturbation as in Dai et al. (2024) nor require three or more subsamples as in Williamson et al. (2023). Secondly, the proposed procedure has nontrivial power against the local alternative hypothesis which converges to the null faster than those in Williamson et al. (2023) and Dai et al. (2024). Thirdly, our test procedure only requires estimating two conditional means, and thus it is computationally practical, while Algorithm 1 in Lundborg et al. (2022) necessitates conducting estimations of at least five conditional means. Furthermore, we novelly introduce a power enhancement procedure, which significantly differs from the power enhancement technique using type thresholding statistic in Fan et al. (2015), turning the degeneracy issue into a blessing of power.
If the null hypothesis is rejected, then it is meaningful to measure the partial mean dependence of given after controlling for the nonlinear effect of . The traditional Pearson’s partial correlation coefficient cannot measure the nonlinear partial dependence between and given . In the literature, several works focus on quantifying the partial/conditional dependence between two random objects given another one. Székely and Rizzo (2014) defined the partial distance correlation with the help of the Hilbert space for measuring the partial dependence. Wang et al. (2015) introduced a conditional distance correlation for conditional dependence. Azadkia and Chatterjee (2021) proposed a new conditional dependence measure which is a generalization of a univariate nonparametric measure of regression dependence in Dette et al. (2013). However, although conditional independence implies conditional mean independence, the inverse is not right and the conditional mean independence has a clear connection to regression modelling. To measure the partial mean dependence, Park et al. (2015) extended the idea of Székely and Rizzo (2014) to define the partial Martingale Difference Correlation (pMDC) to measure the conditional mean dependence of given adjusting for . However, asymptotic properties of the pMDC are still unclear. As another main contribution, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), and derive its theoretical properties and construct confidence interval. The new pGMC is based on the decomposition formula of the conditional variance and thus can be considered as an extension of Generalized Measure of Correlation (GMC) proposed by Zheng et al. (2012). The new pGMC has several appealing properties. First, it takes value between 0 and 1. The value is 0 if and only if and are conditionally mean independent given , and is 1 if and only if is equal to a measurable function of given . Second, the asymptotic normality of the proposed estimator of the pGMC is derived with the optimal convergence rate and the associated confidence interval is also constructed.
As an important special case when there is no conditioning random object , (1.1) becomes
(1.2) |
where the null hypothesis means the conditional mean of given does not depend on . It is a fundamental testing problem to check the overall significance of covariates for modeling the mean of the response in regression problems. For example, the F-test is the classical and standard procedure to determine the overall significance of predictors in linear models. Without parametric model assumptions, many model checking procedures can also be applied, for instance, Wang and Akritas (2006), González-Manteiga and Crujeiras (2013), and Guo et al. (2016). In high dimensions, many testing procedures including Goeman et al. (2006), Zhong and Chen (2011), Guo and Chen (2016), and Cui et al. (2018) have been developed to assert the overall significance for linear models or generalized linear models. To accommodate the high dimensionality in the model-free setting, both Zhang et al. (2018) and Li et al. (2023) considered a weak but necessary null hypothesis
(1.3) |
and proposed the sum-type test statistics over all marginal tests. In this paper, we also propose a new test of overall significance for .
The paper is organized as follows. In section 2, we propose a new significance test for the partial mean independence. In section 3, a new partial Generalized Measure of Correlation (pGMC) is introduced. Section 4 considers the important special case to test the overall significance for . Numerical studies are conducted in section 5. In section 6, we present two real data examples. Additional simulation studies and the proofs of theoretical results are presented in the supplementary materials.
2 Partial Mean Independence Test (pMIT)
2.1 Test statistic and its asymptotic distributions
In this section, we consider the partial mean independence test problem (1.1). That is, it aims to check whether and are partially mean independent after controlling for the nonlinear effect of . We first make a new reformulation of the null hypothesis as follows
(2.1) |
where . Note that , and thus the right-hand side of is actually reduced to 0. Based on this reformulation , a natural idea is to compare the estimators of and via measuring whether equals 0. By introducing the sample analog estimator of the zero term , one could avoid the fundamental difficulty that a naive test statistic has a degenerate distribution. Let be a random sample from the population . We propose a new test statistic based on machine learning methods and data splitting technique. To be specific, we split the data randomly into two independent parts, denoted by and . Let and with , where . Without loss of generality, we assume that is an integer. We first estimate the conditional mean using machine learning methods based on the first subset of data , denoted by , and then estimate by regressing on using machine learning methods based on to get its final estimator . Then, we construct the test statistic via comparing the estimators of and based on ,
(2.2) |
To simplify notations, here we use to denote that belongs to the index set corresponding to . Intuitively, the larger is, the stronger evidence we have to reject . We call this test the partial mean independence test (pMIT).
Remark 2.1.
First, the data splitting is crucial to control the type-I error in the pMIT procedure. Second, the unbalanced sample splitting with a larger training set for the conditional means is necessary to achieve a tractable limiting null distribution for the significance test (1.1). This observation makes the proposed pMIT test distinguished from Cai et al. (2022) which proposed a similar test for (1.1) via data splitting. However, Cai et al. (2022) didn’t theoretically investigate either asymptotic distributions nor the sample splitting ratio.
To proceed, we make some discussions on the reformulation (2.1). Different from existing approaches to compare the two conditional expectations and , we compare one conditional expectation and one unconditional expectation of transformed . With the unbalanced sample splitting strategy, under the null hypothesis, the estimation errors from are controlled and becomes negligible. While with the formulation , converges to a chi-squared distribution and determines the asymptotic distribution of , making it valid for inference. If instead, we do not estimate and consider directly, this natural statistic still has the degeneracy issue. Thus, introducing the estimation of is crucial and is a new complement to existing methods. Essentially, we tackle the issue of degeneracy by introducing additional randomness in the error term of . However there are two differences between our procedure and Dai et al. (2024)’s procedure. Firstly, our procedure does not involve data perturbation and does not require to select perturbation size. Secondly and more subtlely, we introduce the randomness in directly in the comparison with . While Dai et al. (2024)’s procedure involves data perturbation with the difference of prediction errors of and . This seemingly minor change has a major beneficial effect: theoretically our procedure can detect local alternative hypothesis which converges to the null at faster rate. Intuitively, our procedure directly detects shift from , while their procedure focuses on . Furthermore, our reformulation is general and can be adopted to deal with other conditional mean restriction testing problems. For instance, we may extend our procedure to test the conditional independence. Actually the conditional independence of and given is equivalent to for any . We can then apply our method for each and integrate all the information at different ’s to construct the final test statistic.
Then, we derive the asymptotic distributions of the statistic . Let and it can be estimated by We impose the following assumption on the estimation errors of the conditional means using machine learning methods. For some and a generic positive constant , we suppose
-
(C1)
and .
The first part of (C1) reasonably assumes that machine learning methods are able to estimate the conditional mean accurately with a certain convergence rate of . For example, Bauer and Kohler (2019), Schmidt-Hieber (2020) and Kohler and Langer (2021) studied the high-level estimation errors of estimators of conditional means via DNNs. Specifically, Bauer and Kohler (2019) showed that , where is the smoothness parameter and is the intrinsic dimensionality. From their corollary 1, the above rate can be even in the order of . Further, rates of convergence of various boosting procedures have been widely studied. For instance, Bühlmann and Yu (2003) derived that the estimation error achieves rate of convergence for one-dimensional regression, when adopting smoothing splines as weak learners. Bühlmann (2006) established the consistency of -Boosting in high-dimensional setting without stating the rate. Kueck et al. (2023) proved that for high-dimensional linear regression model, iterated post--Boosting and orthogonal -Boosting achieve the rate of convergence with being the sparsity level. More discussions can be seen after Condition (C2). For the second part of (C1), we note that can be estimated by , where is an estimator of the conditional mean . Thus, the second part of (C1) is deduced from the first assumption in (C1) and a similar assumption on . In the literature, Williamson et al. (2021) assumed that with , where is obtained by regressing on based on . Lundborg et al. (2022) also imposed similar conditions about estimation errors for the relevant estimators in their assumptions 3-4. In a special case, for linear regression model with ordinary least squared estimator, it can be shown that our proposed estimator satisfies . Further for linear regression model with LASSO estimator, we show that our proposed estimator satisfies with being the sparsity level. Thus in both cases, the second assumption in (C1) holds.
Although we are not constrained to any particular estimation method to estimate , we have found that the proposed one works well in practice. In fact under the null hypothesis, it generally has very small overall estimation error and thus can control empirical sizes very well. While under the alternative hypothesis, as long as the over-fitting is not too severe, our procedure can still detect corresponding alternative hypothesis. As in Lundborg et al. (2022), we do not claim that our choice is always the best, but we find this choice works well in practice and set it to be a sensible default choice.
Next, we derive the asymptotic distributions of in the following theorem.
Theorem 2.1.
Assume that for some positive constants and , and Condition (C1) holds with . Under the null hypothesis, we have
Under local alternative hypotheses with ,
Further assume . Under the fixed alternative hypothesis,
Theorem 2.1 demonstrates that is -consistent under the null hypothesis and the asymptotic null distribution is a standard chi-squared distribution, which provides a computationally efficient way to compute the p-values without any resampling procedures. Under the fixed alternative hypothesis, is root--consistent and asymptotically normal. It also reveals that the proposed test has non-trivial power against the local alternative hypothesis which converges to the null at the root- rate. Although the root- rate is generally slower than the root- rate, it is very close to the root- rate under parametric regression models where we can take and can be very close to 1. The condition reveals the relative roles of the sample size used to estimate . If can be estimated accurately, then could be close to 1 and the sample size can be reduced. However, it is challenging to determine from a theoretical perspective, since is generally unknown, often related to the smoothness of the conditional mean function and the dimensionality. Practically, we suggest achieving unbalanced sample splitting by tuning the ratio of the sample size of to the total sample size, i.e., . We provide an adaptive procedure for selecting proper in section 2.2, through which Type I error can be well controlled without much loss of power.
Now suppose that . Here may not be equal to the true function . As long as for some positive constant under the alternative hypothesis, we have and thus . That is, we still can detect the alternative hypothesis even when may not be a consistent estimator of .
We further note that Theorem 2.1 can be strengthened to uniform results. If we enhance the condition (C1) to be valid for all distributions ’s of and assume that for all ’s with and , being positive constants. Then, from the Lemmas 18-20 in Shah and Peters (2020) (see also Lemmas 16-17 in Lundborg et al. (2022)), we can show that under the null hypothesis, converges to chi-squared distribution uniformly and the test can uniformly control the nominal level .
From Theorem 2.1, in terms of , our procedure can detect local alternative hypothesis which converges to the null at rate. With assumption that as in condition (C2), the convergence rate is faster than root- and slower than rate. While from Lundborg et al. (2022), the convergence rates of Williamson et al. (2023) and Dai et al. (2024) are generally root-. Although the procedure in Lundborg et al. (2022) can achieve the fastest possible rate, the theorem is obtained with fixed dimension under linear regression models. Furthermore, compared with Lundborg et al. (2022), we establish the asymptotic distribution of the proposed test statistic under alternative hypothesis within the model-free framework.
In addition, we comment that a larger training sample could have a negative impact on the statistical power for our testing procedure. This is the price we have to pay for dealing with the estimation errors of machine learning methods and the critical degenerate limiting null distribution issue. Similarly, Dai et al. (2024) also required larger training sample to make their bias-sd-ratio negligible. When the alternatives are believed to be complex, the use of machine learning methods including DNN can be helpful to estimate the conditional means accurately. If the relationship between the response and the covariates is believed to be parametric, parametric methods should be used and then it is not necessary to use the larger training sample. The loss of power associated with performing a test on a small testing sample could be offset by efficient estimation of regression functions. In the next session, we will also discuss some procedures for power enhancement.
2.2 Power enhancement and algorithm stability
We first present our pMIT procedure in Algorithm 1 based on single data splitting.
In the following, we make some improvements to enhance the empirical testing power. Inspired by the seminal idea of Fan et al. (2015), we first introduce a power-enhanced version of , with . As discussed before under , is asymptotically negligible and thus still converges to under , which implies that the size can be controlled asymptotically. While under the local alternative hypothesis , and thus Since , the power is always enhanced. It reveals that the degeneracy issue can be a blessing in terms of power. Also the same idea can be applied to other recently developed test statistics. How to adaptively choose warrants further investigation. Selecting an appropriate value of involves a trade-off between size control and power enhancement. With larger , the power is larger but the empirical size faces the risk of being distorted. In practice, we simply take .
Next, we discuss how to select the sample splitting ratio to balance the estimation bias and the test efficiency in Algorithm 1, which can be viewed as the trade-off between type I and type II errors. For the selection of , Dai et al. (2024) provided an easy-to-use formula and also a novel data-adaptive procedure. Inspired by Dai et al. (2024), we also consider a data-adaptive splitting procedure and select optimal by controlling the estimated Type I error on permutation datasets. To be specific, we first randomize the covariates and obtain the corresponding -values. After that, we estimate the Type I error by using the proportions of rejections over permutations, and select the value of that can control the estimated Type I error under some predetermined significance level. This procedure is presented in Algorithm 2. In the process of searching optimal in the candidate set , it stops once the criterion is met in practice to enjoy the computational efficiency.
Finally, the testing performance of the pMIT procedure outlined in Algorithm 1 may be affected by the additional randomness of the data splitting. To improve the algorithm stability, we introduce an ensemble testing procedure based on multiple data splittings in Algorithm 3. Actually we aggregate -values from different sample splits to obtain an adjusted -value . The -value aggregation is an important problem in statistics. For recent theoretical investigations, see for instance DiCiccio et al. (2020), Vovk and Wang (2020), and Choi and Kim (2023). Instead of combining p-values, Guo and Shah (2023) studied the combination of test statistics from data splitting and made a deep theoretical investigation.
3 Partial Generalized Measure of Correlation (pGMC)
3.1 Definition and properties
The rejection of implies that the sub-vector of covariates is still able to add some contributions to the conditional mean of after controlling for the nonlinear effect of . Then, it is of great interest to quantify the partial dependence between and given . In this section, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), and derive its theoretical properties and confidence interval.
To quantify the partial dependence between and given , a commonly used method is partial correlation coefficient, defined as
which equals the Pearson correlation coefficient between the errors and . It only measures the linear correlation between and . It can be zero even there exists explicit partial dependence between and given . For example, let and be two independent standard normal random variables and set . In this case, and are perfectly dependent given but . To measure the nonlinear dependence for the conditional mean of given adjusting for , we consider the following conditional variance decomposition
It motivates us to consider the following partial measure
(3.1) |
which can be interpreted as the explained variance of by given in a model-free framework. To make well defined, it is necessary to assume that the denominator of (3.1) is nonzero. That is, is not almost surely equal to a measurable function of . Otherwise, given , is a constant almost surely and it makes no sense to discuss the contribution of to the conditional mean of . When there is no conditional covariates set , reduces to , the Generalized Measure of Correlation (GMC) proposed by Zheng et al. (2012). Thus, the pGMC can be considered as an extension of the GMC to measure the partial dependence between and given .
Next, we discuss more properties of in the following proposition.
Proposition 3.1.
The partial Generalized Measure of Correlation (pGMC) defined in (3.1) satisfies that
-
1.
is an indicator belonging to . That is,
-
2.
if and only if almost surely. That is, given , does not add further information about the conditional mean of .
-
3.
if and only if almost surely. That is, is almost surely equal to a measurable function of given .
-
4.
The relation of and the partial correlation coefficient satisfies that and implies that . If , then ; If , then .
-
5.
The relation of and satisfies that
If , then
-
6.
can also be rewritten as
-
7.
Suppose is a subset of and , then
In Proposition 3.1, properties (1)-(3) demonstrate that the new partial dependence characterizes nonlinearly partial dependence. is 0 if and only if and are conditionally mean independent given , and is 1 if and only if is equal to a measurable function of given . For example, for where and be two independent standard normal random variables, . In property (4), further indicates that is capable of revealing more conditional association information between and beyond the linear relation. The relationship between the new partial dependence and the GMC in Zheng et al. (2012) is established in property (5). As the GMC is viewed as a generalization of the classical measurement to a nonparametric model, the new partial dependence can then be viewed as the relative difference in the population obtained using the full set of covariates as compared to the subset of covariates only. Moreover, when the conditional mean of does not depend on , the new partial dependence then reduces to . Thus is a natural extension of for partial mean dependence. From property (6), measures the relative reduction of population residual sum of squares with additional in the model. The last property (7) means that when the conditional information is more, the remaining variables provide less additional information about the conditional mean of the response. On the other hand, the partial dependence is always bounded by the .
Remark 3.1.
The pGMC is related with the novel conditional dependence measure proposed by Azadkia and Chatterjee (2021). Their measure is defined as
where is the distribution function of . By comparing the formulas for and , can be considered as the conditional mean version of . As Cook and Li (2002) and Shao and Zhang (2014), conditional mean is of primary interest in many applications such as regression problems. The pGMC provides a nonparametric way to measure the additional contribution of for the conditional mean of given .
We remark that Williamson et al. (2021) introduced an appealing variable importance measure
which is indeed the difference between and . It can measure the importance of for . However, it cannot be a correlation measure for the partial mean dependence of given conditional on . Because if and only if , i.e. , and , i.e. , which requires that the mean of doesn’t depend on . Thus, its interpretation of is unclear. The pGMC shares the same numerator as , which measures the difference between two conditional means and . The difference between and the pGMC is the normalization term in the denominator. The pGMC adopts as the normalization term, which makes the pGMC be a suitable partial mean dependence measurement with clear interpretation. The pGMC if and only if is equal to a measurable function of given . More interesting properties of the pGMC as a partial mean dependence measure are also established in Proposition 3.1. These distinct attributes position our pGMC as a valuable augmentation to the current body of literature on partial mean dependence measures.
3.2 Estimation and confidence interval
Next, we estimate the pGMC via data splitting. We split the data randomly into two independent parts with equal sample size, denoted by and , and estimate the conditional means and using . According to the representation of in property (6), we estimate its numerator by the following statistic using ,
To improve the estimation efficiency, we use the cross-fitting method to obtain another estimator similarly by swapping the roles of and . Then, we define to estimate the numerator of . We remark that the cross-fitting method has been shown to improve the estimation efficiency asymptotically in Fan et al. (2012), Chernozhukov et al. (2018) and Vansteelandt and Dukes (2022). Similarly, we can estimate the denominator of by , where and is similarly defined by swapping the role of and . Thus, the pGMC can be estimated by
It is worth noting that the balanced data splitting and the cross-fitting method is used to estimate the pGMC, which is different from the unbalanced data splitting for the significance testing procedure in the previous section. This is because, when the partial dependence between and given truly exists, i.e. , we can use the balanced data splitting to improve the estimation efficiency under the following condition (C2).
-
(C2)
and , where and ; and
, where and .
Here . The assumptions and in (C2) are also adopted in Chernozhukov et al. (2018) (equation 3.8), Williamson et al. (2021) (assumption A1) and in Vansteelandt and Dukes (2022) (assumptions in Theorem 2). For a full discussion about the estimator of nuisance functions, see Chernozhukov et al. (2018). Chernozhukov et al. (2018) showed that the -penalized and related methods in various sparse models can satisfy the above condition. For the DNN, Bauer and Kohler (2019) showed that , where is the smoothness parameter and is the intrinsic dimensionality. For other recent developments, see also Schmidt-Hieber (2020) and Kohler and Langer (2021). While the other assumptions and follow if we assume that the conditional variance functions and are bounded above.
Next, we establish the asymptotic normality for the in Theorem 3.1.
Theorem 3.1.
Suppose Conditions (C2) and are satisfied. When , we have
(3.2) |
where
As discussed in Remark 3.1, is the conditional mean version of introduced in Azadkia and Chatterjee (2021). However, from the above theorem, one can see that the asymptotic properties of two estimators are different. Azadkia and Chatterjee (2021) only derived the convergence rate of the estimator for , which is slightly larger than and depends strongly on the dimension of both and . Our convergence rate is significantly faster than it. Besides, we also establish the asymptotic normality which paves a way to derive the confidence interval for .
Next, we estimate the asymptotic variance in (3.2) and derive the confidence interval for . Define
(3.3) |
and . A natural plug-in estimator of is given by
(3.4) |
where is any consistent estimator of . For example, might be taken as with , , , , , and replaced by , , , , , and , respectively. Hence, a confidence interval at the confidence level of for is
where is the upper-tailed critical value of the standard normal distribution.
4 Special case: Conditional mean independence Test
As an important special case when there is no conditioning random object , the original pMIT test (1.1) becomes
indicates that the conditional mean of given does not depend on . It is the fundamental testing problem to check the overall significance of covariates for modeling the mean of the response . The test has to be conducted before any regression modeling. Similar to the pMIT test statistic (2.2), we consider the test statistic via comparing the estimators of and based on data splitting and machine learning methods,
where is the average of over the second part of data. With unbalanced sample splitting strategy, becomes negligible compared with . Denote and . We impose similar estimation error condition of the conditional mean of given via machine learning methods.
-
(C1’)
for some .
The above condition (C1’) is the first part of the condition (C1) with and . The asymptotic distributions of are established in the following corollary.
Corollary 4.1.
Assume that and Condition (C1’) holds with . When the null hypothesis holds, we have
Under local alternative hypotheses with ,
Further, assume that . When the fixed alternative hypothesis holds,
Corollary 4.1 demonstrates that the asymptotic null distribution of the test statistic after the standardization is the standard chi-squared distribution. Asymptotic distributions under local and fixed alternative hypotheses are also derived. We remark that the proposed significance test can be applied to high dimensional settings in a model-free framework. It is more appealing than the methods proposed by Zhang et al. (2018) and Li et al. (2023), which considered the sum-type test statistics over all marginal tests under a weaker but necessary null hypothesis (1.3). Because the summation of all marginal test statistics ignore the interactions among covariates and can fail for pure interaction models. We will show this comparison by the simulations in the supplementary materials. Moreover, we can apply the multiple data splitting to improve the algorithm stability. Since it is similar to Algorithm 3 as mentioned in Section 2, we omit the details of the algorithm here.
5 Numerical Studies
In this section, we present some numerical results to assess the finite sample performance of our proposed model-free statistical inference methods. To conserve space, extensive additional simulation results are included in the supplementary materials.
5.1 Partial mean independence test
We consider the following two examples including linear and nonlinear models.
Example A1: We first consider a linear regression
where and , and we set . For each , the covariate vector is generated from the multivariate normal distribution with for . The stochastic error term follows the normal distribution . We consider three different settings of : (1) the null hypothesis , set for all ; (2) the sparse alternative , the first two elements of are nonzeros with equal magnitude and ; (3) the dense alternative , each element of is nonzero with equal magnitude and .
Example A2: We consider a nonlinear regression
where and are generated by the same setting as Example A1. The response depends on via a quadratic term of a linear combination of . (1) The null hypothesis holds when all . We consider two slightly different settings of from Example A1: (2) the sparse , the first five elements of are nonzeros with equal magnitude and ; (3) the dense , the first half elements of are nonzero with equal magnitude and .
In each simulation, we estimate the conditional mean functions using the XGBoost (eXtreme Gradient Boosting) procedure (Chen and Guestrin, 2016) based on the R package xgboost, and deep neural network by employing the torch module in Python. We set and as the default setting of the hyperparameters in XGBoost. For the neural network, we consider the fully connected neural network with depth and the width . We set the learning rate as , the batchsize as , and the maximum number of training epochs as . The Adam optimizer is adopted for parameter optimizing. In order to prevent under-fitting or over-fitting, it is vital to care about the generalization ability of the model. Thus, we involve an early stopping strategy, a useful practical technique to reduce the generalization error efficiently and prevent over-fitting. It is a data-driven approach to determine the number of training epochs such that the model stops training when validation loss doesn’t improve after a given patience (a few epochs). In practice, we use the testing data to validate the model.
Recall that we start with unbalanced single data splitting in Algorithm 1. To evaluate the effect of the data splitting ratio, we consider different choices of in numerical examples. We also implement the easy-to-use formula provided in Dai et al. (2024) with defined therein, and the proposed adaptive data splitting procedure in Algorithm 2. From Table LABEL:ratio-A1 in the Supplement, it is interesting to note that for balanced data splitting (the case of ), the empirical size of pMIT is hard to be controlled under , which is consistent with our theoretical results in Theorem 2.1. As becomes larger, the empirical sizes are getting closer to the significance level , while the test loses powers due to the smaller sample size of the second part of the data. We find that both the easy-to-use formula and the data-adaptive splitting procedure work well, while the data-adaptive splitting procedure generally has higher powers than the easy-to-use formula.
Then, we compare different aggregation methods in the proposed ensemble partial mean independence test, including the multiple random splitting procedure in Meinshausen et al. (2009), the Cauchy combination approach in Liu and Xie (2020), and the Rank-transformed subsampling method in Guo and Shah (2023). We summarize these combining methods in Table LABEL:pvalue-combine in the Supplement. In our simulation results, we find that the Rank-transformed subsampling method has the largest powers, while multiple random splitting procedure in Meinshausen et al. (2009) generally is conservative. However, the computation cost of the Rank-transformed subsampling method is generally heavy. Thus, we adopt the Cauchy combination approach for its computational efficiency and detection of the signal in the following simulations.
To proceed, we conduct simulation experiments to assess the finite sample performance of our proposed single data splitting procedure in Algorithm 1, denoted as pMIT, the multiple data splitting () in Algorithm 3, denoted as , the power enhanced procedure, denoted as , and its multiple data splitting version, denoted as . The adaptive data splitting in Algorithm 2 is adopted to determine the splitting ratio . For the purpose of comparison, we also implement the Algorithm 3 in Williamson et al. (2023), which is implemented by the cv_vim function from the R package vimp with , resulting in 4 folds, denoted as vim; the Projected Covariance Measure in Algorithm 1 of Lundborg et al. (2022) with their practically recommended splitting ratio , denoted as pcm; the multiple splitting version in Algorithm 2 of Lundborg et al. (2022) with splits, denoted as ; and the one-split test and the combined one-split test in Dai et al. (2024) with their proposed adaptive splitting scheme, denoted as DSP and , respectively. For each experiment, we run 500 replications to compute the empirical sizes and powers. The simulation results are presented in Tables 1-2. Additional simulations are referred to the Supplement.
N | pMIT | pMITe | pcm | vim | DSP | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XGB | DNN | XGB | DNN | XGB | DNN | XGB | DNN | |||||||
100 | 0.032 | 0.060 | 0.032 | 0.076 | 0.050 | 0.064 | 0.060 | 0.076 | 0.058 | 0.000 | 0.048 | 0.000 | 0.000 | |
200 | 0.062 | 0.030 | 0.052 | 0.060 | 0.062 | 0.054 | 0.056 | 0.062 | 0.030 | 0.000 | 0.036 | 0.000 | 0.000 | |
300 | 0.038 | 0.046 | 0.050 | 0.050 | 0.060 | 0.048 | 0.054 | 0.072 | 0.026 | 0.000 | 0.032 | 0.000 | 0.000 | |
400 | 0.050 | 0.044 | 0.054 | 0.050 | 0.056 | 0.050 | 0.066 | 0.070 | 0.016 | 0.000 | 0.030 | 0.000 | 0.000 | |
sparse | 100 | 0.110 | 0.136 | 0.146 | 0.212 | 0.284 | 0.186 | 0.538 | 0.250 | 0.138 | 0.008 | 0.308 | 0.060 | 0.090 |
200 | 0.442 | 0.492 | 0.926 | 0.886 | 0.740 | 0.522 | 0.994 | 0.908 | 0.684 | 0.794 | 0.880 | 0.416 | 0.690 | |
300 | 0.822 | 0.738 | 1.000 | 1.000 | 0.984 | 0.764 | 1.000 | 1.000 | 0.966 | 1.000 | 0.962 | 0.852 | 1.000 | |
400 | 0.984 | 0.818 | 1.000 | 1.000 | 1.000 | 0.828 | 1.000 | 1.000 | 0.998 | 1.000 | 1.000 | 0.994 | 1.000 | |
dense | 100 | 0.076 | 0.228 | 0.068 | 0.410 | 0.246 | 0.316 | 0.470 | 0.496 | 0.126 | 0.000 | 0.244 | 0.264 | 0.384 |
200 | 0.210 | 0.674 | 0.272 | 0.982 | 0.554 | 0.710 | 0.930 | 0.996 | 0.430 | 0.280 | 0.662 | 0.808 | 1.000 | |
300 | 0.492 | 0.896 | 0.732 | 1.000 | 0.866 | 0.908 | 0.998 | 1.000 | 0.836 | 0.956 | 0.908 | 0.994 | 1.000 | |
400 | 0.804 | 0.938 | 0.974 | 1.000 | 0.986 | 0.952 | 1.000 | 1.000 | 0.976 | 1.000 | 0.974 | 1.000 | 1.000 |
N | pMIT | pMITe | pcm | vim | DSP | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XGB | DNN | XGB | DNN | XGB | DNN | XGB | DNN | |||||||
100 | 0.034 | 0.044 | 0.032 | 0.036 | 0.042 | 0.046 | 0.054 | 0.050 | 0.058 | 0.000 | 0.048 | 0.000 | 0.000 | |
200 | 0.054 | 0.046 | 0.036 | 0.054 | 0.060 | 0.060 | 0.048 | 0.076 | 0.024 | 0.000 | 0.036 | 0.000 | 0.000 | |
300 | 0.048 | 0.060 | 0.054 | 0.036 | 0.052 | 0.060 | 0.056 | 0.074 | 0.018 | 0.000 | 0.058 | 0.000 | 0.000 | |
400 | 0.048 | 0.038 | 0.060 | 0.056 | 0.056 | 0.052 | 0.062 | 0.076 | 0.028 | 0.000 | 0.050 | 0.000 | 0.000 | |
sparse | 100 | 0.126 | 0.104 | 0.270 | 0.172 | 0.330 | 0.198 | 0.650 | 0.446 | 0.074 | 0.000 | 0.054 | 0.020 | 0.044 |
200 | 0.164 | 0.220 | 0.398 | 0.382 | 0.466 | 0.296 | 0.896 | 0.642 | 0.088 | 0.008 | 0.224 | 0.078 | 0.128 | |
300 | 0.262 | 0.284 | 0.538 | 0.640 | 0.698 | 0.326 | 0.964 | 0.808 | 0.234 | 0.052 | 0.446 | 0.336 | 0.652 | |
400 | 0.450 | 0.356 | 0.840 | 0.826 | 0.914 | 0.410 | 1.000 | 0.932 | 0.442 | 0.446 | 0.662 | 0.742 | 0.946 | |
dense | 100 | 0.126 | 0.114 | 0.288 | 0.170 | 0.368 | 0.208 | 0.714 | 0.464 | 0.062 | 0.000 | 0.050 | 0.044 | 0.024 |
200 | 0.152 | 0.222 | 0.344 | 0.438 | 0.428 | 0.388 | 0.872 | 0.730 | 0.062 | 0.000 | 0.100 | 0.242 | 0.388 | |
300 | 0.160 | 0.376 | 0.386 | 0.800 | 0.518 | 0.440 | 0.924 | 0.876 | 0.088 | 0.000 | 0.198 | 0.610 | 0.852 | |
400 | 0.214 | 0.496 | 0.484 | 0.934 | 0.652 | 0.552 | 0.986 | 0.956 | 0.140 | 0.012 | 0.294 | 0.880 | 1.000 |
According to the simulation results, we have the following findings. Firstly, the empirical sizes of our proposed tests and also the vim procedure are closely around the significance level , while the empirical sizes of , DSP and are very small. Secondly, power enhancement strategy and multiple data splitting are necessarily effective to improve empirical powers of the proposed pMIT. Thus, we recommend the multiple data splitting version of the power enhanced test statistic, . Thirdly, the proposed methods are strong alternatives to existing tests in most cases. The empirical powers of our procedure with power enhancement strategy and multiple data splitting are similar or even higher than the other procedures. For instance, in Example A2, the pcm and vim procedures do not have high powers for dense alternative hypotheses.
In addition, we also conduct numerical simulations to assess the impact of different network structure (the depth and the width ). The corresponding results are provided in Table LABEL:DNN_different, indicating that the performance of the proposed methods is not sensitive to the network structure. Furthermore, we also check the robustness of the proposed methods under the case of heterogeneous error via simulations. Please see Section A.2 in the Supplement.
5.2 Confidence interval for pGMC
In this subsection, we conduct numerical simulations to evaluate the empirical performance of the proposed confidence interval for the pGMC. We consider two models in the following.
Example B1:
where is a -dimensional vector. The covariates and are independent, and generated from with for with , respectively. Besides, only the first three elements of and are nonzeros with equal magnitude such that . The error term follows the standard normal distribution.
Example B2:
where is a -dimensional vector. and denote the first element of the covariates and , respectively. The other settings are the same as Example B1.
To make inference for the pGMC of the simulated data, we build a fully connected neural network with the default settings and involve distance correlation based feature screening techniques (Li et al., 2012) before training. To be more specific, we first apply the distance correlation based feature screening approach to the training data in order to reduce dimension of the predictors. Then we use the reduced predictors to train the neural network model. 500 realizations are repeated for each setting in order to calculate the coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals. Corresponding numerical results are illustrated in Table 3.
Example B1 | Example B2 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
CP | AL | CP | AL | CP | AL | CP | AL | |||
100 | 100 | 0.874 | 0.135 | 0.860 | 0.133 | 0.804 | 0.139 | 0.840 | 0.140 | |
200 | 200 | 0.918 | 0.093 | 0.882 | 0.092 | 0.870 | 0.095 | 0.864 | 0.096 | |
300 | 300 | 0.926 | 0.074 | 0.912 | 0.074 | 0.894 | 0.077 | 0.894 | 0.076 | |
400 | 400 | 0.930 | 0.064 | 0.926 | 0.064 | 0.918 | 0.066 | 0.916 | 0.065 | |
500 | 500 | 0.946 | 0.056 | 0.942 | 0.057 | 0.942 | 0.058 | 0.938 | 0.058 |
The results in Table 3 provide strong evidence that corroborates the asymptotic theory. As the sample size increases, the empirical coverage probabilities tend to the nominal level 95% and the average lengths of confidence interval become smaller.
6 Real data example
6.1 Significant parts on facial images
The significance of automatic age prediction of facial images has gained substantial relevance in numerous applications, particularly due to the rapid growth of social media; see Rothe et al. (2015), Antipov et al. (2017) and Fang et al. (2019) for instance. In this subsection, we investigate the significance of three keypoints on facial photos for human age prediction. The data set is available from www.kaggle.com/datasets/mariafrenti/age-prediction and contains thousands of images, with each image being a three-channel pixel facial photograph. It also records the age of the person corresponding to each image. Our goal is to make statistical inferences of the significant predictive features on the images for predicting the age.

We randomly select images, and artificially cover different key parts on the images (case 1: eyes, case 2: mouths, case 3: top-left corner edge region), see Figure 1 as an illustration. In each covered image, the covered parts can be considered as the variables of interest (denoted as ), while the uncovered parts are denoted as . We employ a ResNet-18 neural network model and add two fully connected layers at the end of the original network for training. Other hyperparameters keep the same as in the simulation experiments. The proposed procedures along with the other existing methods are conducted and the resulting p-values for each case are presented in Table 4.
pMIT | pMITe | pcm | vim | DSP | |||||
---|---|---|---|---|---|---|---|---|---|
Case 1 | 0.043 | 0.045 | 0.036 | 0.037 | 0.441 | 0.386 | 0.522 | 0.209 | 0.522 |
Case 2 | 0.046 | 0.029 | 0.036 | 0.024 | 0.196 | 0.426 | 0.627 | 0.050 | 0.064 |
Case 3 | 0.178 | 0.195 | 0.164 | 0.182 | 0.142 | 0.409 | 0.470 | 0.616 | 0.801 |
While most other testing methods fail to detect the two significant regions in Cases 1 and 2, the results in Table 4 show that our pMIT tests consistently reject for the two cases, suggesting that both eyes and mouths are significant regions for age prediction in computer vision, which are further visually illustrated in Figure 1. The corresponding confidence intervals of pGMC for the two regions are and , respectively. The region 3 (the top-left corner edge region) is recognized as not important by all testing procedures, which is consistent with our observation.
6.2 Significant gene expression levels
The second data set is about gene expression and it was once studied in Scheetz et al. (2006) and Huang et al. (2008). It contains 120 twelve-week old male rats from which over 31042 different probes from eye tissue were measured. In Chiang et al. (2006), the gene TRIM32 was believed to cause Bandet-Biedl syndrome which is a genetically heterogeneous disease of multiple organ systems including the retina. Scheetz et al. (2006) excluded the probes which were not expressed in the eye or which lacked sufficient variation. As a result, a total of probes were considered as sufficient variables. Among these probes, Huang et al. (2008) further selected 3000 probes with large variances. We are interested in the genes whose expression would have significant effects on that of gene TRIM32.
We first apply our proposed conditional mean independence test to check whether the 3000 selected probes are significantly predictive for the expression of TRIM32. The p-values of the proposed conditional mean tests are smaller than , providing a strong evidence that the probes have significant effects on the prediction of the gene expression of TRIM32.
Based on the results in Huang et al. (2008), there are in total of significant probes selected by adaptive group LASSO. We care about if any significant probe were omitted or not. Hence, we apply the newly proposed significance test procedures to test whether the remaining probes are conditional correlated with the gene expression of TRIM32. We standardize the data and adopt XGBoost and DNN with three hidden layers to estimate the condition means. Other existing procedures are also implemented for comparison. From Table 5, it is observed that all the procedures do not reject the null hypothesis at the significance level of . It means given the selected probes, other probes do not bring significant additional information to predict the gene expression of TRIM32. Based on the inference above, we conclude that the 19 probes selected by adaptive group LASSO in Huang et al. (2008) provide useful information for further biological scientific study in TRIM32.
pMIT | pMITe | pcm | vim | DSP | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
XGB | DNN | XGB | DNN | XGB | DNN | XGB | DNN | ||||||
p-values | 0.931 | 0.718 | 0.484 | 0.245 | 0.880 | 0.683 | 0.438 | 0.181 | 0.945 | 0.350 | 0.583 | 0.544 | 0.538 |
7 Conclusion
In this paper, we propose a new significance test for the partial mean independence problem based on machine learning methods and data splitting, which is applicable for high dimensional data. The pMIT test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability based on multiple data splitting are also discussed. When the partial mean independence test is rejected, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), based on the decomposition formula of the conditional variance. We derive its theoretical properties and construct confidence interval. As an important special case when there is no conditioning random object, we also investigate significance testing of potentially high dimensional covariates for the conditional mean of in a model-free framework. In the future, it is interesting to extend the proposed ideas to study conditional/partial quantile independence, conditional/partial independence problems.
Supplementary materials. Additional simulation results including the finite sample performance of the conditional mean independence test, and all theoretical proofs are included in the supplementary materials.
References
- Antipov et al. (2017) Antipov, G., Baccouche, M., Berrani, S.-A., and Dugelay, J.-L. (2017), “Effective training of convolutional neural networks for face-based gender and age prediction,” Pattern Recognition, 72, 15–26.
- Azadkia and Chatterjee (2021) Azadkia, M. and Chatterjee, S. (2021), “A simple measure of conditional dependence,” The Annals of Statistics, 49, 3070–3102.
- Bauer and Kohler (2019) Bauer, B. and Kohler, M. (2019), “On deep learning as a remedy for the curse of dimensionality in nonparametric regression,” The Annals of Statistics, 47, 2261–2285.
- Bühlmann (2006) Bühlmann, P. (2006), “Boosting for high-dimensional linear models,” The Annals of Statistics, 34, 559 – 583.
- Bühlmann and Yu (2003) Bühlmann, P. and Yu, B. (2003), “Boosting with the L2 loss: regression and classification,” Journal of the American Statistical Association, 98, 324–339.
- Cai et al. (2022) Cai, Z., Lei, J., and Roeder, K. (2022), “Model-free prediction test with application to genomics data,” Proceedings of the National Academy of Sciences, 119, e2205518119.
- Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016), “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794.
- Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21, C1–C68.
- Chiang et al. (2006) Chiang, A. P., Beck, J. S., Yen, H. J., Tayeh, M. K., Scheetz, T. E., Swiderski, R. E., Nishimura, D. Y., Braun, T. A., Kim, K. Y. A., and Huang, J. a. (2006), “Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11),” Proceedings of the National Academy of Sciences of the United States of America, 103.
- Choi and Kim (2023) Choi, W. and Kim, I. (2023), “Averaging p-values under exchangeability,” Statistics & Probability Letters, 194, 109748.
- Cook and Li (2002) Cook, R. D. and Li, B. (2002), “Dimension reduction for conditional mean in regression,” The Annals of Statistics, 30, 455–474.
- Cui et al. (2018) Cui, H., Guo, W., and Zhong, W. (2018), “Test for high-dimensional regression coefficients using refitted cross-validation variance estimation,” The Annals of Statistics, 46, 958–988.
- Dai et al. (2024) Dai, B., Shen, X., and Pan, W. (2024), “Significance tests of feature relevance for a black-box learner,” IEEE Transactions on Neural Networks and Learning Systems, 35, 1898–1911.
- Delgado and Gonzáles-Manteiga (2001) Delgado, M. and Gonzáles-Manteiga, W. (2001), “Significance testing in nonparametric regression based on the bootstrap,” The Annals of Statistics, 29, 1469–1507.
- Dette et al. (2013) Dette, H., Siburg, K. F., and Stoimenov, P. A. (2013), “A copula-based non-parametric measure of regression dependence,” Scandinavian Journal of Statistics, 40, 21–41.
- DiCiccio et al. (2020) DiCiccio, C. J., DiCiccio, T. J., and Romano, J. P. (2020), “Exact tests via multiple data splitting,” Statistics & Probability Letters, 166, 108865.
- Fan et al. (2012) Fan, J., Guo, S., and Hao, N. (2012), “Variance estimation using refitted cross-validation in ultrahigh dimensional regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 37–65.
- Fan et al. (2015) Fan, J., Liao, Y., and Yao, J. (2015), “Power enhancement in high-dimensional cross-sectional tests,” Econometrica, 83, 1497–1541.
- Fan and Li (1996) Fan, Y. and Li, Q. (1996), “Consistence model specification tests: omitted variables and semiparametric functional forms,” Econometrica, 64, 865–890.
- Fang et al. (2019) Fang, J., Yuan, Y., Lu, X., and Feng, Y. (2019), “Muti-stage learning for gender and age prediction,” Neurocomputing, 334, 114–124.
- Gan et al. (2022) Gan, L., Zheng, L., and Allen, G. I. (2022), “Inference for Interpretable Machine Learning: Fast, Model-Agnostic Confidence Intervals for Feature Importance,” arXiv preprint arXiv:2206.02088.
- Goeman et al. (2006) Goeman, J. J., Van De Geer, S. A., and Van Houwelingen, H. C. (2006), “Testing against a high dimensional alternative,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 477–493.
- González-Manteiga and Crujeiras (2013) González-Manteiga, W. and Crujeiras, R. M. (2013), “An updated review of goodness-of-fit tests for regression models,” Test, 22, 361–411.
- Guo and Chen (2016) Guo, B. and Chen, S. X. (2016), “Tests for high dimensional generalized linear models,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 78, 1079–1102.
- Guo and Shah (2023) Guo, F. R. and Shah, R. D. (2023), “Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values,” arXiv preprint arXiv:2301.02739.
- Guo et al. (2016) Guo, X., Wang, T., and Zhu, L. (2016), “Model checking for parametric single-index models: a dimension reduction model-adaptive approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78, 1013–1035.
- Huang et al. (2008) Huang, J., Ma, S., and Zhang, C. H. (2008), “Adaptive LASSO for sparse high-dimensional regression,” Statistica Sinica, 18, 1603–1618.
- Kohler and Langer (2021) Kohler, M. and Langer, S. (2021), “On the rate of convergence of fully connected deep neural network regression estimates,” The Annals of Statistics, 49, 2231–2249.
- Kueck et al. (2023) Kueck, J., Luo, Y., Spindler, M., and Wang, Z. (2023), “Estimation and inference of treatment effects with L2-boosting in high-dimensional settings,” Journal of Econometrics, 234, 714–731.
- Lei et al. (2018) Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), “Distribution-free predictive inference for regression,” Journal of the American Statistical Association, 113, 1094–1111.
- Li et al. (2023) Li, R., Xu, K., Zhou, Y., and Zhu, L. (2023), “Testing the effects of high-dimensional covariates via aggregating cumulative covariances,” Journal of the American Statistical Association, 118, 2184–2194.
- Li et al. (2012) Li, R., Zhong, W., and Zhu, L. (2012), “Feature screening via distance correlation learning,” Journal of the American Statistical Association, 107, 1129–1139.
- Liu and Xie (2020) Liu, Y. and Xie, J. (2020), “Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures,” Journal of the American Statistical Association, 115, 393–402.
- Lundborg (2023) Lundborg, A. R. (2023), “Modern Methods for Variable Significance Testing,” Ph.D. thesis, University of Cambridge.
- Lundborg et al. (2022) Lundborg, A. R., Kim, I., Shah, R. D., and Samworth, R. J. (2022), “The Projected Covariance Measure for assumption-lean variable significance testing,” arXiv preprint arXiv:2211.02039.
- Meinshausen et al. (2009) Meinshausen, N., Meier, L., and Buehlmann, P. (2009), “P-values for high-dimensional regression,” Journal of the American statistical association, 104, 1671–1681.
- Park et al. (2015) Park, T., Shao, X., and Yao, S. (2015), “Partial martingale difference correlation,” Electronic Journal of Statistics, 9, 1492–1517.
- Rothe et al. (2015) Rothe, R., Timofte, R., and Van Gool, L. (2015), “Dex: Deep expectation of apparent age from a single image,” in Proceedings of the IEEE international conference on computer vision workshops, pp. 10–15.
- Scheetz et al. (2006) Scheetz, T. E., Kim, K., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., Dibona, G. F., Jian, H., and Casavant, T. L. (2006), “Regulation of gene expression in the mammalian eye and its relevance to eye disease,” Proceedings of the National Academy of Sciences of the United States of America, 103, 14429–14434.
- Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020), “Nonparametric regression using deep neural networks with ReLU activation function,” The Annals of Statistics, 48, 1875–1897.
- Shah and Peters (2020) Shah, R. D. and Peters, J. (2020), “The hardness of conditional independence testing and the generalised covariance measure,” The Annals of Statistics, 48, 1514–1538.
- Shao and Zhang (2014) Shao, X. and Zhang, J. (2014), “Martingale difference correlation and its use in high-dimensional variable screening,” Journal of the American Statistical Association, 109, 1302–1318.
- Székely and Rizzo (2014) Székely, G. and Rizzo, M. (2014), “Partial distance correlation with methods for dissimilarities,” The Annals of Statistics, 42, 2382–2412.
- Tansey et al. (2022) Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. (2022), “The holdout randomization test for feature selection in black box models,” Journal of Computational and Graphical Statistics, 31, 151–162.
- Vansteelandt and Dukes (2022) Vansteelandt, S. and Dukes, O. (2022), “Assumption-lean inference for generalised linear model parameters,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 657–685.
- Verdinelli and Wasserman (2023) Verdinelli, I. and Wasserman, L. (2023), “Feature Importance: A Closer Look at Shapley Values and LOCO,” arXiv preprint arXiv:2303.05981.
- Verdinelli and Wasserman (2024) — (2024), “Decorrelated variable importance,” Journal of Machine Learning Research, 25, 1–27.
- Vovk and Wang (2020) Vovk, V. and Wang, R. (2020), “Combining p-values via averaging,” Biometrika, 107, 791–808.
- Wang and Akritas (2006) Wang, L. and Akritas, M. G. (2006), “Testing for covariate effects in the fully nonparametric analysis of covariance model,” Journal of the American Statistical Association, 101, 722–736.
- Wang et al. (2015) Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. (2015), “Conditional distance correlation,” Journal of the American Statistical Association, 110, 1726–1734.
- Williamson et al. (2021) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. (2021), “Nonparametric variable importance assessment using machine learning techniques,” Biometrics, 77, 9–22.
- Williamson et al. (2023) Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2023), “A general framework for inference on algorithm-agnostic variable importance,” Journal of the American Statistical Association, 118, 1645–1658.
- Zhang and Janson (2020) Zhang, L. and Janson, L. (2020), “Floodgate: inference for model-free variable importance,” arXiv preprint arXiv:2007.01283.
- Zhang et al. (2018) Zhang, X., Yao, S., and Shao, X. (2018), “Conditional mean and quantile dependence testing in high dimension,” The Annals of Statistics, 46, 219–246.
- Zheng et al. (2012) Zheng, S., Shi, N.-Z., and Zhang, Z. (2012), “Generalized measures of correlation for asymmetry, nonlinearity, and beyond,” Journal of the American Statistical Association, 107, 1239–1252.
- Zhong and Chen (2011) Zhong, P.-S. and Chen, S. X. (2011), “Tests for high-dimensional regression coefficients with factorial designs,” Journal of the American Statistical Association, 106, 260–274.
- Zhu and Zhu (2018) Zhu, X. and Zhu, L. (2018), “Dimension reduction-based significance testing in nonparametric regression,” Electronic Journal of Statistics, 12, 1468–1506.