This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Test and Measure for Partial Mean Dependence Based on Machine Learning Methods

Leheng Cai1, Xu Guo2 and Wei Zhong3
1 Tsinghua University, 2 Beijing Normal University and 3 Xiamen University
We would like to thank the Editor, the Associate Editor, and the anonymous reviewers for their valuable comments and constructive suggestions, which lead to significant improvements in the paper. All authors equally contributed to this article, and the authors are listed in the alphabetical order. Corresponding authors: Xu Guo, Wei Zhong. Email address: [email protected]; [email protected].
Abstract

It is of importance to investigate the significance of a subset of covariates WW for the response YY given covariates ZZ in regression modeling. To this end, we propose a significance test for the partial mean independence problem based on machine learning methods and data splitting. The test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability are also discussed. If the null hypothesis is rejected, we propose a partial Generalized Measure of Correlation (pGMC) to measure the partial mean dependence of YY given WW after controlling for the nonlinear effect of ZZ. We present the appealing theoretical properties of the pGMC and establish the asymptotic normality of its estimator with the optimal root-NN convergence rate. Furthermore, the valid confidence interval for the pGMC is also derived. As an important special case when there are no conditional covariates ZZ, we introduce a new test of overall significance of covariates for the response in a model-free setting. Numerical studies and real data analysis are also conducted to compare with existing approaches and to demonstrate the validity and flexibility of our proposed procedures.

Keywords: machine learning methods; partial mean independence test; partial generalized measure of correlation.

1 Introduction

It is of importance to investigate the significance of a subset of covariates for the response in regression modeling. In practice, some covariates are often known to be related to the response of interest based on historical analysis or domain knowledge. We aim to test the significance of another group of covariates for the response given the known covariates, especially in high dimensional data. For example, Scheetz et al. (2006) has detected 22 gene probes that are related to human eye disease from 18,976 different gene probes and it is interesting to test whether the remaining high dimensional gene probes still contribute to the response conditional on the subset of 22 identified gene probes. This problem can be formulated as the following hypothesis testing problem

H0:E(Y|X)=E(Y|Z), v.s. H1:E(Y|X)E(Y|Z),H_{0}:\,\,\,\mbox{E}(Y|X)=\mbox{E}(Y|Z),\mbox{~{}v.s.~{}~{}~{}}H_{1}:\,\,\,\mbox{E}(Y|X)\neq\mbox{E}(Y|Z), (1.1)

where YY is a scalar-valued response variable of interest, X=(X1,,Xp)=(Z,W)pX=\left(X_{1},\cdots,X_{p}\right)^{\top}=\left(Z^{\top},W^{\top}\right)^{\top}\in\mathbb{R}^{p} is a pp-dimensional vector of covariates, Zp1Z\in\mathbb{R}^{p_{1}} and Wp2W\in\mathbb{R}^{p_{2}} are two subsets of XX and p1+p2=pp_{1}+p_{2}=p. The equality in H0H_{0} indicates that YY and WW are partially mean independent after controlling for the nonlinear effect of ZZ. If H0H_{0} is not rejected, WW is considered as insignificant for YY and can be omitted from the regression model. Thus, this test is also called a significance test (Delgado and Gonzáles-Manteiga, 2001) or an omitted variable test (Fan and Li, 1996). In the literature, several tests in nonparametric settings have been proposed based on locally and globally smoothing methodologies whose representatives are Fan and Li (1996)’s test and Delgado and Gonzáles-Manteiga (2001)’s test, respectively. However, both tests are based on kernel smoothing estimation and severely suffer from the curse of dimensionality. Zhu and Zhu (2018) proposed a dimension reduction-based adaptive-to-model (DREAM) test for the significance of a subset of covariates. However, the DREAM test only focuses on the relatively low dimensional situation and cannot perform satisfactorily in high dimensions. Therefore, it is still challenging to test H0H_{0} in high dimensional settings.

In this paper, we propose a new significance test for the partial mean independence problem in (1.1) based on machine learning methods. The obstacle of testing H0H_{0} is estimating the conditional means in high dimensions which makes the classical nonparametric estimation such as kernel smoothing and local polynomial fitting perform poorly. Alternatively, we apply the machine learning methods such as deep neural networks (DNN) and eXtreme Gradient Boosting (XGBoost) to estimating the conditional means in the new test. Specifically, we randomly split the data into two parts. We estimate the conditional means via machine learning methods using the first part of the data and conduct the test based on the second part. We theoretically show that the test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. In this paper, we introduce a new reformulation of the null hypothesis, which is helpful to obtain a tractable limiting null distribution. Power enhancement and algorithm stability based on multiple data splitting are also discussed.

Recently, there is an increasing interest in developing powerful testing procedures for the partial mean independence problem. Some remarkable developments include Dai et al. (2024), Williamson et al. (2023), Verdinelli and Wasserman (2024), and Lundborg et al. (2022). These papers also adopt machine learning methods and sample splitting. To deal with the fundamental difficulty that a naive test statistic has a degenerate distribution, different attempts are made. Dai et al. (2024) added random noise to their test statistic. Williamson et al. (2023) alleviated this problem by estimating two non-vanishing parameters on separate splits of the data which was also adopted in the two-split test of Dai et al. (2024). The method of Verdinelli and Wasserman (2024) is similar to Dai et al. (2024) but without introducing additional randomness. They achieved tractable limiting null distributions at the price of possible power loss. Lundborg et al. (2022) introduced a notable procedure named Projected Covariance Measure and proved that under linear regression models with fixed dimensional WW, their procedures are more powerful than those of Dai et al. (2024) and Williamson et al. (2023). However, under the model-free framework, which is the focus of this paper, their theory requires three or more subsamples. For other relevant papers, see also Lei et al. (2018), Zhang and Janson (2020), Shah and Peters (2020), Gan et al. (2022), Tansey et al. (2022), and Verdinelli and Wasserman (2023). For a recent review, see Lundborg (2023).

In this paper, we introduce a new procedure to deal with the degenerate null distribution issue. Our introduced test procedure has the following four merits compared with other existing methods. Firstly, the newly proposed test statistic does not involve data perturbation as in Dai et al. (2024) nor require three or more subsamples as in Williamson et al. (2023). Secondly, the proposed procedure has nontrivial power against the local alternative hypothesis which converges to the null faster than those in Williamson et al. (2023) and Dai et al. (2024). Thirdly, our test procedure only requires estimating two conditional means, and thus it is computationally practical, while Algorithm 1 in Lundborg et al. (2022) necessitates conducting estimations of at least five conditional means. Furthermore, we novelly introduce a power enhancement procedure, which significantly differs from the power enhancement technique using LL_{\infty} type thresholding statistic in Fan et al. (2015), turning the degeneracy issue into a blessing of power.

If the null hypothesis H0H_{0} is rejected, then it is meaningful to measure the partial mean dependence of YY given WW after controlling for the nonlinear effect of ZZ. The traditional Pearson’s partial correlation coefficient cannot measure the nonlinear partial dependence between YY and WW given ZZ. In the literature, several works focus on quantifying the partial/conditional dependence between two random objects given another one. Székely and Rizzo (2014) defined the partial distance correlation with the help of the Hilbert space for measuring the partial dependence. Wang et al. (2015) introduced a conditional distance correlation for conditional dependence. Azadkia and Chatterjee (2021) proposed a new conditional dependence measure which is a generalization of a univariate nonparametric measure of regression dependence in Dette et al. (2013). However, although conditional independence implies conditional mean independence, the inverse is not right and the conditional mean independence has a clear connection to regression modelling. To measure the partial mean dependence, Park et al. (2015) extended the idea of Székely and Rizzo (2014) to define the partial Martingale Difference Correlation (pMDC) to measure the conditional mean dependence of YY given WW adjusting for ZZ. However, asymptotic properties of the pMDC are still unclear. As another main contribution, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), and derive its theoretical properties and construct confidence interval. The new pGMC is based on the decomposition formula of the conditional variance and thus can be considered as an extension of Generalized Measure of Correlation (GMC) proposed by Zheng et al. (2012). The new pGMC has several appealing properties. First, it takes value between 0 and 1. The value is 0 if and only if YY and WW are conditionally mean independent given ZZ, and is 1 if and only if YY is equal to a measurable function of WW given ZZ. Second, the asymptotic normality of the proposed estimator of the pGMC is derived with the optimal convergence rate and the associated confidence interval is also constructed.

As an important special case when there is no conditioning random object ZZ, (1.1) becomes

H~0:E(Y|X)=E(Y), v.s. H~1:E(Y|X)E(Y),\widetilde{H}_{0}:\,\,\,\mbox{E}(Y|X)=\mbox{E}(Y),\mbox{~{}v.s.~{}~{}~{}}\widetilde{H}_{1}:\,\,\,\mbox{E}(Y|X)\neq\mbox{E}(Y), (1.2)

where the null hypothesis H~0\widetilde{H}_{0} means the conditional mean of YY given XX does not depend on XX. It is a fundamental testing problem to check the overall significance of covariates XX for modeling the mean of the response YY in regression problems. For example, the F-test is the classical and standard procedure to determine the overall significance of predictors in linear models. Without parametric model assumptions, many model checking procedures can also be applied, for instance, Wang and Akritas (2006), González-Manteiga and Crujeiras (2013), and Guo et al. (2016). In high dimensions, many testing procedures including Goeman et al. (2006), Zhong and Chen (2011), Guo and Chen (2016), and Cui et al. (2018) have been developed to assert the overall significance for linear models or generalized linear models. To accommodate the high dimensionality in the model-free setting, both Zhang et al. (2018) and Li et al. (2023) considered a weak but necessary null hypothesis

H~0:E(Y|Xj)=E(Y), almost surely, for all 1jp,\widetilde{H}_{0}^{\prime}:\,\,\,\mbox{E}(Y|X_{j})=\mbox{E}(Y),\mbox{ almost surely, for all }1\leq j\leq p, (1.3)

and proposed the sum-type test statistics over all marginal tests. In this paper, we also propose a new test of overall significance for H~0\widetilde{H}_{0}.

The paper is organized as follows. In section 2, we propose a new significance test for the partial mean independence. In section 3, a new partial Generalized Measure of Correlation (pGMC) is introduced. Section 4 considers the important special case to test the overall significance for H~0\widetilde{H}_{0}. Numerical studies are conducted in section 5. In section 6, we present two real data examples. Additional simulation studies and the proofs of theoretical results are presented in the supplementary materials.

2 Partial Mean Independence Test (pMIT)

2.1 Test statistic and its asymptotic distributions

In this section, we consider the partial mean independence test problem (1.1). That is, it aims to check whether YY and WW are partially mean independent after controlling for the nonlinear effect of ZZ. We first make a new reformulation of the null hypothesis H0H_{0} as follows

H0:E(Y~|X)=E(Y~),\displaystyle{H}_{0}^{\prime}:\,\,\,\mbox{E}(\tilde{Y}|X)=\mbox{E}(\tilde{Y}), (2.1)

where Y~=YE(Y|Z)\tilde{Y}=Y-\mbox{E}(Y|Z). Note that E(Y~)=E[YE(Y|Z)]=0\mbox{E}(\tilde{Y})=\mbox{E}[Y-\mbox{E}(Y|Z)]=0, and thus the right-hand side of H0H^{\prime}_{0} is actually reduced to 0. Based on this reformulation H0H^{\prime}_{0}, a natural idea is to compare the estimators of E(Y~|X)\mbox{E}(\tilde{Y}|X) and E(Y~)\mbox{E}(\tilde{Y}) via measuring whether E[{E(Y~|X)E(Y~)}2]\mbox{E}\left[\{\mbox{E}(\tilde{Y}|X)-\mbox{E}(\tilde{Y})\}^{2}\right] equals 0. By introducing the sample analog estimator of the zero term E(Y~)\mbox{E}(\tilde{Y}), one could avoid the fundamental difficulty that a naive test statistic has a degenerate distribution. Let 𝒟={Xi,Yi}i=1N{\mathcal{D}}=\{X_{i},Y_{i}\}_{i=1}^{N} be a random sample from the population {X,Y}\{X,Y\}. We propose a new test statistic based on machine learning methods and data splitting technique. To be specific, we split the data randomly into two independent parts, denoted by 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2}. Let |𝒟2|=n2=:n|{\mathcal{D}}_{2}|=n_{2}=:n and |𝒟1|=n1=:nγ|{\mathcal{D}}_{1}|=n_{1}=:n^{\gamma} with γ>1\gamma>1, where n1+n2=nγ+n=Nn_{1}+n_{2}=n^{\gamma}+n=N. Without loss of generality, we assume that nγn^{\gamma} is an integer. We first estimate the conditional mean h(Zi)=E(Yi|Zi)h(Z_{i})=\mbox{E}(Y_{i}|Z_{i}) using machine learning methods based on the first subset of data 𝒟1{\mathcal{D}}_{1}, denoted by h^𝒟1(Zi)\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i}), and then estimate g(Xi)=E(Y~i|Xi)g(X_{i})=\mbox{E}(\tilde{Y}_{i}|X_{i}) by regressing Yih^𝒟1(Zi)Y_{i}-\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i}) on XiX_{i} using machine learning methods based on 𝒟1{\mathcal{D}}_{1} to get its final estimator g^𝒟1(Xi)\widehat{g}_{{\mathcal{D}}_{1}}(X_{i}). Then, we construct the test statistic via comparing the estimators of E(Y~|X)\mbox{E}(\tilde{Y}|X) and E(Y~)\mbox{E}(\tilde{Y}) based on 𝒟2{\mathcal{D}}_{2},

Tn=1ni𝒟2[g^𝒟1(Xi)1nj𝒟2{Yjh^𝒟1(Zj)}]2.T_{n}=\frac{1}{n}\sum_{i\in{\mathcal{D}}_{2}}\left[\widehat{g}_{{\mathcal{D}}_{1}}(X_{i})-\frac{1}{n}\sum_{j\in{\mathcal{D}}_{2}}\left\{Y_{j}-\widehat{h}_{{\mathcal{D}}_{1}}(Z_{j})\right\}\right]^{2}. (2.2)

To simplify notations, here we use i𝒟2i\in{\mathcal{D}}_{2} to denote that ii belongs to the index set corresponding to 𝒟2{\mathcal{D}}_{2}. Intuitively, the larger TnT_{n} is, the stronger evidence we have to reject H0H_{0}. We call this test the partial mean independence test (pMIT).

Remark 2.1.

First, the data splitting is crucial to control the type-I error in the pMIT procedure. Second, the unbalanced sample splitting with a larger training set for the conditional means is necessary to achieve a tractable limiting null distribution for the significance test (1.1). This observation makes the proposed pMIT test distinguished from Cai et al. (2022) which proposed a similar test for (1.1) via data splitting. However, Cai et al. (2022) didn’t theoretically investigate either asymptotic distributions nor the sample splitting ratio.

To proceed, we make some discussions on the reformulation (2.1). Different from existing approaches to compare the two conditional expectations E(Y|X)\mbox{E}(Y|X) and E(Y|Z)\mbox{E}(Y|Z), we compare one conditional expectation E(Y~|X)\mbox{E}(\tilde{Y}|X) and one unconditional expectation E(Y~)\mbox{E}(\tilde{Y}) of transformed Y~\tilde{Y}. With the unbalanced sample splitting strategy, under the null hypothesis, the estimation errors from i𝒟2g^𝒟12(Xi)\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}_{{\mathcal{D}}_{1}}^{2}(X_{i}) are controlled and i𝒟2g^𝒟12(Xi)\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}_{{\mathcal{D}}_{1}}^{2}(X_{i}) becomes negligible. While with the formulation H0H^{\prime}_{0}, i𝒟2(n1j𝒟2Y~j)2\sum_{i\in{\mathcal{D}}_{2}}(n^{-1}\sum_{j\in{\mathcal{D}}_{2}}\tilde{Y}_{j})^{2} converges to a chi-squared distribution and determines the asymptotic distribution of TnT_{n}, making it valid for inference. If instead, we do not estimate E(Y~)\mbox{E}(\tilde{Y}) and consider n1i𝒟2g^𝒟12(Xi)n^{-1}\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}^{2}_{{\mathcal{D}}_{1}}(X_{i}) directly, this natural statistic still has the degeneracy issue. Thus, introducing the estimation of E(Y~){\rm E}(\tilde{Y}) is crucial and is a new complement to existing methods. Essentially, we tackle the issue of degeneracy by introducing additional randomness in the error term of YY. However there are two differences between our procedure and Dai et al. (2024)’s procedure. Firstly, our procedure does not involve data perturbation and does not require to select perturbation size. Secondly and more subtlely, we introduce the randomness in Y~\tilde{Y} directly in the comparison with g(X)g(X). While Dai et al. (2024)’s procedure involves data perturbation with the difference of prediction errors of m(X)m(X) and h(Z)h(Z). This seemingly minor change has a major beneficial effect: theoretically our procedure can detect local alternative hypothesis which converges to the null at faster rate. Intuitively, our procedure directly detects shift from g(X)g(X), while their procedure focuses on E[g2(X)]\mbox{E}[g^{2}(X)]. Furthermore, our reformulation is general and can be adopted to deal with other conditional mean restriction testing problems. For instance, we may extend our procedure to test the conditional independence. Actually the conditional independence of YY and WW given ZZ is equivalent to E[I(Yy)|X]=E[I(Yy)|Z]\mbox{E}[I(Y\leq y)|X]=\mbox{E}[I(Y\leq y)|Z] for any yy. We can then apply our method for each yy and integrate all the information at different yy’s to construct the final test statistic.

Then, we derive the asymptotic distributions of the statistic TnT_{n}. Let σY|Z2=Var{YE(Y|Z)}=E[{Yh(Z)}2]\sigma^{2}_{Y|Z}=\mbox{Var}\left\{Y-\mbox{E}(Y|Z)\right\}=\mbox{E}\left[\left\{Y-h(Z)\right\}^{2}\right] and it can be estimated by σ^Y|Z2=n1i𝒟2{Yih^𝒟1(Zi)}2.\widehat{\sigma}^{2}_{Y|Z}={n}^{-1}\sum_{i\in{\mathcal{D}}_{2}}\left\{Y_{i}-\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i})\right\}^{2}. We impose the following assumption on the estimation errors of the conditional means using machine learning methods. For some a(0,1)a\in(0,1) and a generic positive constant cc, we suppose

  • (C1)

    E[{h^𝒟1(Z)h(Z)}2]cn1a=cnγa\mbox{E}\left[\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]\leq cn_{1}^{-a}=cn^{-\gamma a} and E[{g^𝒟1(X)g(X)}2]cn1a=cnγa\mbox{E}\left[\left\{\widehat{g}_{{\mathcal{D}}_{1}}(X)-g(X)\right\}^{2}\right]\leq cn_{1}^{-a}=cn^{-\gamma a}.

The first part of (C1) reasonably assumes that machine learning methods are able to estimate the conditional mean accurately with a certain convergence rate of n1an_{1}^{-a}. For example, Bauer and Kohler (2019), Schmidt-Hieber (2020) and Kohler and Langer (2021) studied the high-level estimation errors of estimators of conditional means via DNNs. Specifically, Bauer and Kohler (2019) showed that E[{h^𝒟1(Z)h(Z)}2]=O((logn1)3n12q/(2q+d)){\rm E}\left[\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]=O((\log n_{1})^{3}n_{1}^{-2q/(2q+d^{*})}), where qq is the smoothness parameter and dd^{*} is the intrinsic dimensionality. From their corollary 1, the above rate can be even in the order of (logn1)3n12/3(\log n_{1})^{3}n_{1}^{-2/3}. Further, rates of convergence of various boosting procedures have been widely studied. For instance, Bühlmann and Yu (2003) derived that the estimation error achieves rate of convergence O(n2q/(2q+1))O(n^{-2q/(2q+1)}) for one-dimensional regression, when adopting smoothing splines as weak learners. Bühlmann (2006) established the consistency of L2L_{2}-Boosting in high-dimensional setting without stating the rate. Kueck et al. (2023) proved that for high-dimensional linear regression model, iterated post-L2L_{2}-Boosting and orthogonal L2L_{2}-Boosting achieve the rate of convergence O(slogplogn/n)O(s\log p\log n/n) with ss being the sparsity level. More discussions can be seen after Condition (C2). For the second part of (C1), we note that g(Xi)=E(Yi|Xi)E(Yi|Zi)g(X_{i})={\rm E}(Y_{i}|X_{i})-{\rm E}(Y_{i}|Z_{i}) can be estimated by g^𝒟1(Xi)=m^𝒟1(Xi)h^𝒟1(Zi)\widehat{g}_{{\mathcal{D}}_{1}}(X_{i})=\widehat{m}_{{\mathcal{D}}_{1}}(X_{i})-\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i}), where m^𝒟1(Xi)\widehat{m}_{{\mathcal{D}}_{1}}(X_{i}) is an estimator of the conditional mean m(Xi)=E(Yi|Xi)m(X_{i})={\rm E}(Y_{i}|X_{i}). Thus, the second part of (C1) is deduced from the first assumption in (C1) and a similar assumption on m^𝒟1(Xi)\widehat{m}_{{\mathcal{D}}_{1}}(X_{i}). In the literature, Williamson et al. (2021) assumed that E[{h~𝒟1(Z)h(Z)}2]cn1a\mbox{E}\left[\left\{\widetilde{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]\leq cn_{1}^{-a} with 1/2<a1/2<a, where h~(Z)\widetilde{h}(Z) is obtained by regressing m^𝒟1(X)\widehat{m}_{{\mathcal{D}}_{1}}(X) on ZZ based on 𝒟1{\mathcal{D}}_{1}. Lundborg et al. (2022) also imposed similar conditions about estimation errors for the relevant estimators in their assumptions 3-4. In a special case, for linear regression model with ordinary least squared estimator, it can be shown that our proposed estimator satisfies E[{g^𝒟1(X)g(X)}2]cn11\mbox{E}\left[\left\{\widehat{g}_{{\mathcal{D}}_{1}}(X)-g(X)\right\}^{2}\right]\leq cn_{1}^{-1}. Further for linear regression model with LASSO estimator, we show that our proposed estimator satisfies E[{g^𝒟1(X)g(X)}2]cslogp/n1\mbox{E}[\left\{\widehat{g}_{{\mathcal{D}}_{1}}(X)-g(X)\right\}^{2}]\leq cs\log p/n_{1} with ss being the sparsity level. Thus in both cases, the second assumption in (C1) holds.

Although we are not constrained to any particular estimation method to estimate g(X)g(X), we have found that the proposed one works well in practice. In fact under the null hypothesis, it generally has very small overall estimation error i𝒟2g^𝒟12(Xi)\sum_{i\in\mathcal{D}_{2}}\widehat{g}^{2}_{{\mathcal{D}}_{1}}(X_{i}) and thus can control empirical sizes very well. While under the alternative hypothesis, as long as the over-fitting is not too severe, our procedure can still detect corresponding alternative hypothesis. As in Lundborg et al. (2022), we do not claim that our choice is always the best, but we find this choice works well in practice and set it to be a sensible default choice.

Next, we derive the asymptotic distributions of TnT_{n} in the following theorem.

Theorem 2.1.

Assume that cY<E[{Yh(Z)}2]<CYc_{Y}<{\rm{E}}\left[\left\{Y-h(Z)\right\}^{2}\right]<C_{Y} for some positive constants cYc_{Y} and CYC_{Y}, and Condition (C1) holds with γ>1/a\gamma>1/a. Under the null hypothesis, we have

Vn=:nσ^Y|Z2Tn𝑑χ2(1).\displaystyle V_{n}=:n\widehat{\sigma}^{-2}_{Y|Z}T_{n}\overset{d}{\rightarrow}\chi^{2}(1).

Under local alternative hypotheses H1n:g(X)=δ(X)/nH_{1n}:g(X)=\delta(X)/\sqrt{n} with 0<c<E[δ2(X)]<C<0<c<{\rm E}[\delta^{2}(X)]<C<\infty,

Vn=nσ^Y|Z2Tn𝑑χ2(1)+σY|Z2E[δ2(X)].\displaystyle V_{n}=n\widehat{\sigma}^{-2}_{Y|Z}T_{n}\overset{d}{\rightarrow}\chi^{2}(1)+\sigma^{-2}_{Y|Z}{\rm E}[\delta^{2}(X)].

Further assume E[{m(X)h(Z)}4]<{\rm E}\left[\left\{m(X)-h(Z)\right\}^{4}\right]<\infty. Under the fixed alternative hypothesis,

n(TnE[{m(X)h(Z)}2])𝑑N(0,Var[{m(X)h(Z)}2]).\displaystyle\sqrt{n}\left(T_{n}-{\rm E}[\left\{m(X)-h(Z)\right\}^{2}]\right)\overset{d}{\rightarrow}N\left(0,{\rm Var}\left[\{m(X)-h(Z)\}^{2}\right]\right).

Theorem 2.1 demonstrates that TnT_{n} is nn-consistent under the null hypothesis and the asymptotic null distribution is a standard chi-squared distribution, which provides a computationally efficient way to compute the p-values without any resampling procedures. Under the fixed alternative hypothesis, TnT_{n} is root-nn-consistent and asymptotically normal. It also reveals that the proposed test has non-trivial power against the local alternative hypothesis which converges to the null at the root-nn rate. Although the root-nn rate is generally slower than the root-NN rate, it is very close to the root-NN rate under parametric regression models where we can take a=1a=1 and γ\gamma can be very close to 1. The condition γ>1/a\gamma>1/a reveals the relative roles of the sample size n1n_{1} used to estimate g(X)g(X). If g(X)g(X) can be estimated accurately, then aa could be close to 1 and the sample size n1n_{1} can be reduced. However, it is challenging to determine γ\gamma from a theoretical perspective, since aa is generally unknown, often related to the smoothness of the conditional mean function and the dimensionality. Practically, we suggest achieving unbalanced sample splitting by tuning the ratio of the sample size of 𝒟1\mathcal{D}_{1} to the total sample size, i.e., ξ=n1/N\xi=n_{1}/N. We provide an adaptive procedure for selecting proper ξ\xi in section 2.2, through which Type I error can be well controlled without much loss of power.

Now suppose that E[{g^𝒟1(X)g~(X)}2]=O(n1a)\mbox{E}[\{\widehat{g}_{{\mathcal{D}}_{1}}(X)-\tilde{g}(X)\}^{2}]=O(n^{-a}_{1}). Here g~(X)\tilde{g}(X) may not be equal to the true function g(X)g(X). As long as E[g~2(X)]>cg\mbox{E}[\tilde{g}^{2}(X)]>c_{g} for some positive constant cgc_{g} under the alternative hypothesis, we have Vn=nTn/σ^Y|Z2=σY|Z2i𝒟2[g~(Xi)n1j𝒟2{Yjh(Zj)}]2+op(1)V_{n}=nT_{n}/\widehat{\sigma}_{Y|Z}^{2}=\sigma_{Y|Z}^{-2}\sum_{i\in{\mathcal{D}}_{2}}\left[\tilde{g}(X_{i})-n^{-1}\sum_{j\in{\mathcal{D}}_{2}}\left\{Y_{j}-h(Z_{j})\right\}\right]^{2}+o_{p}(1) and thus VnV_{n}\rightarrow\infty. That is, we still can detect the alternative hypothesis even when g^𝒟1(X)\widehat{g}_{{\mathcal{D}}_{1}}(X) may not be a consistent estimator of g(X)g(X).

We further note that Theorem 2.1 can be strengthened to uniform results. If we enhance the condition (C1) to be valid for all distributions PP’s of (X,Y)(X,Y) and assume that cY<E|Yh(Z)|2+τ<CYc_{Y}^{\prime}<\mbox{E}|Y-h(Z)|^{2+\tau}<C_{Y}^{\prime} for all PP’s with τ>0\tau>0 and cYc_{Y}^{\prime}, CYC_{Y}^{\prime} being positive constants. Then, from the Lemmas 18-20 in Shah and Peters (2020) (see also Lemmas 16-17 in Lundborg et al. (2022)), we can show that under the null hypothesis, VnV_{n} converges to chi-squared distribution uniformly and the test can uniformly control the nominal level α\alpha.

From Theorem 2.1, in terms of E[g2(X)]{\rm E}[g^{2}(X)], our procedure can detect local alternative hypothesis which converges to the null at nn rate. With assumption that 1/2<a<11/2<a<1 as in condition (C2), the convergence rate is faster than root-NN and slower than NN rate. While from Lundborg et al. (2022), the convergence rates of Williamson et al. (2023) and Dai et al. (2024) are generally root-NN. Although the procedure in Lundborg et al. (2022) can achieve the fastest possible NN rate, the theorem is obtained with fixed dimension WW under linear regression models. Furthermore, compared with Lundborg et al. (2022), we establish the asymptotic distribution of the proposed test statistic under alternative hypothesis within the model-free framework.

In addition, we comment that a larger training sample could have a negative impact on the statistical power for our testing procedure. This is the price we have to pay for dealing with the estimation errors of machine learning methods and the critical degenerate limiting null distribution issue. Similarly, Dai et al. (2024) also required larger training sample to make their bias-sd-ratio T2T_{2} negligible. When the alternatives are believed to be complex, the use of machine learning methods including DNN can be helpful to estimate the conditional means accurately. If the relationship between the response and the covariates is believed to be parametric, parametric methods should be used and then it is not necessary to use the larger training sample. The loss of power associated with performing a test on a small testing sample could be offset by efficient estimation of regression functions. In the next session, we will also discuss some procedures for power enhancement.

2.2 Power enhancement and algorithm stability

We first present our pMIT procedure in Algorithm 1 based on single data splitting.

Algorithm 1 The pMIT procedure based single data splitting
  Step 1. For a given splitting ratio ξ\xi, i.e. ξ=n1/N\xi=n_{1}/N, construct the pMIT test statistic TnT_{n} in equation (2.2) and obtain pp-value according to Theorem 2.1, denoted as p0p_{0}.
  Step 2. Reject the null hypothesis if p0<αp_{0}<\alpha, where α\alpha is the pre-determined significance level.

In the following, we make some improvements to enhance the empirical testing power. Inspired by the seminal idea of Fan et al. (2015), we first introduce a power-enhanced version of VnV_{n}, Vn=Vn+τi𝒟2g^𝒟12(Xi)V_{n}^{*}=V_{n}+\tau\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}^{2}_{{\mathcal{D}}_{1}}(X_{i}) with τ>0\tau>0. As discussed before under H0H_{0}, i𝒟2g^𝒟12(Xi)\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}^{2}_{{\mathcal{D}}_{1}}(X_{i}) is asymptotically negligible and thus VnV_{n}^{*} still converges to χ2(1)\chi^{2}(1) under H0H_{0}, which implies that the size can be controlled asymptotically. While under the local alternative hypothesis H1n:g(X)=δ(X)/nH_{1n}:g(X)=\delta(X)/\sqrt{n}, i𝒟2g^𝒟12(Xi)E[δ2(X)]\sum_{i\in{\mathcal{D}}_{2}}\widehat{g}^{2}_{{\mathcal{D}}_{1}}(X_{i})\rightarrow\mbox{E}[\delta^{2}(X)] and thus Vn𝑑χ2(1)+σY|Z2E[δ2(X)]+τE[δ2(X)].V_{n}^{*}\overset{d}{\rightarrow}\chi^{2}(1)+\sigma^{-2}_{Y|Z}{\rm E}[\delta^{2}(X)]+\tau{\rm E}[\delta^{2}(X)]. Since VnVnV_{n}^{*}\geq V_{n}, the power is always enhanced. It reveals that the degeneracy issue can be a blessing in terms of power. Also the same idea can be applied to other recently developed test statistics. How to adaptively choose τ\tau warrants further investigation. Selecting an appropriate value of τ\tau involves a trade-off between size control and power enhancement. With larger τ\tau, the power is larger but the empirical size faces the risk of being distorted. In practice, we simply take τ=1\tau=1.

Next, we discuss how to select the sample splitting ratio ξ\xi to balance the estimation bias and the test efficiency in Algorithm 1, which can be viewed as the trade-off between type I and type II errors. For the selection of ξ\xi, Dai et al. (2024) provided an easy-to-use formula and also a novel data-adaptive procedure. Inspired by Dai et al. (2024), we also consider a data-adaptive splitting procedure and select optimal ξ\xi by controlling the estimated Type I error on permutation datasets. To be specific, we first randomize the covariates WW and obtain the corresponding pp-values. After that, we estimate the Type I error by using the proportions of rejections over MM permutations, and select the value of ξ\xi that can control the estimated Type I error under some predetermined significance level. This procedure is presented in Algorithm 2. In the process of searching optimal ξ\xi in the candidate set Ξ\Xi, it stops once the criterion Err^(ξ)α\widehat{{\rm Err}}(\xi)\leq\alpha is met in practice to enjoy the computational efficiency.

Algorithm 2 A data-adaptive procedure to select the sample splitting ratio ξ\xi
  Step 1. For a given splitting ratio ξ\xi in the candidate sets Ξ\Xi (e.g. Ξ={(k1)/k:2kκ}\Xi=\{(k-1)/k:2\leq k\leq\kappa\}, κ=10\kappa=10), shuffle the covariates WW, and then apply Algorithm 1 with the splitting ratio ξ\xi to the permutation data and obtain the pp-value, denoted as p1p_{1}.
  Step 2. Repeat Step 1 MM times (e.g., M=200M=200). Obtain the set of pp-values {p1,p2,,pM}\{p_{1},p_{2},\cdots,p_{M}\} corresponding to the ratio ξ\xi. The estimated Type I error is given by
Err^(ξ)=M1m=1M𝟏(pmα).\displaystyle\widehat{{\rm Err}}(\xi)=M^{-1}\sum_{m=1}^{M}\mathbf{1}\left(p_{m}\leq\alpha\right).
  Step 3. Set ξ\xi^{\star} as the smallest value that controls the estimated Type I error,
ξ=minξΞ{Err^(ξ)α}.\xi^{\star}=\min_{\xi\in\Xi}\left\{\widehat{{\rm Err}}(\xi)\leq\alpha\right\}.

Finally, the testing performance of the pMIT procedure outlined in Algorithm 1 may be affected by the additional randomness of the data splitting. To improve the algorithm stability, we introduce an ensemble testing procedure pMITM\rm pMIT_{M} based on multiple data splittings in Algorithm 3. Actually we aggregate pp-values from different sample splits to obtain an adjusted pp-value QQ^{*}. The pp-value aggregation is an important problem in statistics. For recent theoretical investigations, see for instance DiCiccio et al. (2020), Vovk and Wang (2020), and Choi and Kim (2023). Instead of combining p-values, Guo and Shah (2023) studied the combination of test statistics from data splitting and made a deep theoretical investigation.

Algorithm 3 The pMIT procedure based multiple data splitting
  Step 1. Apply Algorithm 1 via a single random data splitting to the original data and obtain the pp-value, denoted as p1p_{1}.
  Step 2. Repeat Step 1 BB times. Obtain the set of pp-values {p1,p2,,pB}\{p_{1},p_{2},\cdots,p_{B}\}.
  Step 3. Calculate the adjusted pp-value QQ^{*} by aggregating the above pp-values {p1,p2,,pB}\{p_{1},p_{2},\cdots,p_{B}\} in Step 2. (Note: different p-value aggregation methods are specified in the simulation section.)
  Step 4. Reject H0H_{0} at a significance level α\alpha if Q<αQ^{\ast}<\alpha.

3 Partial Generalized Measure of Correlation (pGMC)

3.1 Definition and properties

The rejection of H0H_{0} implies that the sub-vector of covariates WW is still able to add some contributions to the conditional mean of YY after controlling for the nonlinear effect of ZZ. Then, it is of great interest to quantify the partial dependence between WW and YY given ZZ. In this section, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), and derive its theoretical properties and confidence interval.

To quantify the partial dependence between WW and YY given ZZ, a commonly used method is partial correlation coefficient, defined as

ρ(Y,W|Z)=E[{YE(Y|Z)}{WE(W|Z)}]E[{YE(Y|Z)}2]E[{WE(W|Z)}2],\rho(Y,W|Z)=\frac{\mbox{E}[\{Y-\mbox{E}(Y|Z)\}\{W-\mbox{E}(W|Z)\}]}{\sqrt{\mbox{E}[\{Y-\mbox{E}(Y|Z)\}^{2}]}\sqrt{\mbox{E}[\{W-\mbox{E}(W|Z)\}^{2}]}},

which equals the Pearson correlation coefficient between the errors YE(Y|Z)Y-\mbox{E}(Y|Z) and WE(W|Z)W-\mbox{E}(W|Z). It only measures the linear correlation between YE(Y|Z)Y-\mbox{E}(Y|Z) and WE(W|Z)W-\mbox{E}(W|Z). It can be zero even there exists explicit partial dependence between YY and WW given ZZ. For example, let WW and ZZ be two independent standard normal random variables and set Y=Z+W2Y=Z+W^{2}. In this case, YY and WW are perfectly dependent given ZZ but ρ(Y,W|Z)=0\rho(Y,W|Z)=0. To measure the nonlinear dependence for the conditional mean of YY given WW adjusting for ZZ, we consider the following conditional variance decomposition

Var(Y|Z)=E{Var(Y|X)|Z}+Var{E(Y|X)|Z}.\displaystyle\mbox{Var}(Y|Z)=\mbox{E}\left\{\mbox{Var}(Y|X)|Z\right\}+\mbox{Var}\left\{\mbox{E}(Y|X)|Z\right\}.

It motivates us to consider the following partial measure

r2(Y,W|Z)=E[Var{E(Y|X)|Z}]E[Var(Y|Z)]=E[{E(Y|X)E(Y|Z)}2]E[{YE(Y|Z)}2]=E[{m(X)h(Z)}2]E[{Yh(Z)}2],r^{2}(Y,W|Z)=\frac{\mbox{E}[\mbox{Var}\{\mbox{E}(Y|X)|Z\}]}{\mbox{E}[\mbox{Var}(Y|Z)]}=\frac{\mbox{E}[\{\mbox{E}(Y|X)-\mbox{E}(Y|Z)\}^{2}]}{\mbox{E}[\{Y-\mbox{E}(Y|Z)\}^{2}]}=\frac{\mbox{E}[\{m(X)-h(Z)\}^{2}]}{\mbox{E}[\{Y-h(Z)\}^{2}]}, (3.1)

which can be interpreted as the explained variance of YY by WW given ZZ in a model-free framework. To make r2(Y,W|Z)r^{2}(Y,W|Z) well defined, it is necessary to assume that the denominator of (3.1) is nonzero. That is, YY is not almost surely equal to a measurable function of ZZ. Otherwise, given ZZ, YY is a constant almost surely and it makes no sense to discuss the contribution of WW to the conditional mean of YY. When there is no conditional covariates set ZZ, r2(Y,W|Z)r^{2}(Y,W|Z) reduces to GMC(Y|X)=Var{E(Y|X)}/Var(Y)\mbox{GMC}(Y|X)=\mbox{Var}\left\{\mbox{E}(Y|X)\right\}/\mbox{Var}(Y), the Generalized Measure of Correlation (GMC) proposed by Zheng et al. (2012). Thus, the pGMC can be considered as an extension of the GMC to measure the partial dependence between WW and YY given ZZ.

Next, we discuss more properties of r2(Y,Z|X)r^{2}(Y,Z|X) in the following proposition.

Proposition 3.1.

The partial Generalized Measure of Correlation (pGMC) r2(Y,W|Z)r^{2}(Y,W|Z) defined in (3.1) satisfies that

  • 1.

    r2(Y,W|Z)r^{2}(Y,W|Z) is an indicator belonging to [0,1][0,1]. That is, 0r2(Y,W|Z)1.0\leq r^{2}(Y,W|Z)\leq 1.

  • 2.

    r2(Y,W|Z)=0r^{2}(Y,W|Z)=0 if and only if E(Y|X)=E(Y|Z),{\rm E}(Y|X)={\rm E}(Y|Z), almost surely. That is, given ZZ, WW does not add further information about the conditional mean of YY.

  • 3.

    r2(Y,W|Z)=1r^{2}(Y,W|Z)=1 if and only if Y=E(Y|X),Y={\rm E}(Y|X), almost surely. That is, YY is almost surely equal to a measurable function of WW given ZZ.

  • 4.

    The relation of r2(Y,W|Z)r^{2}(Y,W|Z) and the partial correlation coefficient ρ(Y,W|Z)\rho(Y,W|Z) satisfies that r2(Y,W|Z)ρ2(Y,W|Z)r^{2}(Y,W|Z)\geq\rho^{2}(Y,W|Z) and ρ(Y,W|Z)0\rho(Y,W|Z)\neq 0 implies that r2(Y,W|Z)0r^{2}(Y,W|Z)\neq 0. If r2(Y,W|Z)=0r^{2}(Y,W|Z)=0, then ρ2(Y,W|Z)=0\rho^{2}(Y,W|Z)=0; If ρ(Y,W|Z)=±1\rho(Y,W|Z)=\pm 1, then r2(Y,W|Z)=1r^{2}(Y,W|Z)=1.

  • 5.

    The relation of r2(Y,W|Z)r^{2}(Y,W|Z) and GMC(Y|W){\rm GMC}(Y|W) satisfies that

    r2(Y,W|Z)=Var{E(Y|X)}Var{E(Y|Z)}Var(Y)Var{E(Y|Z)}=GMC(Y|X)GMC(Y|Z)1GMC(Y|Z).r^{2}(Y,W|Z)=\frac{{\rm Var}\left\{{\rm E}(Y|X)\right\}-{\rm Var}\left\{{\rm E}(Y|Z)\right\}}{{\rm Var}(Y)-{\rm Var}\left\{{\rm E}(Y|Z)\right\}}=\frac{{\rm GMC}(Y|X)-{\rm GMC}(Y|Z)}{1-{\rm GMC}(Y|Z)}.

    If E(Y|Z)=E(Y){\rm E}(Y|Z)={\rm E}(Y), then r2(Y,W|Z)=GMC(Y|X)=GMC(Y|W).r^{2}(Y,W|Z)={\rm GMC}(Y|X)={\rm GMC}(Y|W).

  • 6.

    r2(Y,W|Z)r^{2}(Y,W|Z) can also be rewritten as

    r2(Y,W|Z)=E[{Yh(Z)}2]E[{Ym(X)}2]E[{Yh(Z)}2]=1E[{Ym(X)}2]E[{Yh(Z)}2].r^{2}(Y,W|Z)=\frac{{\rm E}\left[\{Y-h(Z)\}^{2}\right]-{\rm E}\left[\{Y-m(X)\}^{2}\right]}{{\rm E}\left[\{Y-h(Z)\}^{2}\right]}=1-\frac{{\rm E}\left[\{Y-m(X)\}^{2}\right]}{{\rm E}\left[\{Y-h(Z)\}^{2}\right]}.
  • 7.

    Suppose ZsZ_{s} is a subset of ZZ and X=(Zs,Ws)X=(Z_{s}^{\top},W_{s}^{\top})^{\top}, then GMC(Y|X)r2(Y,Ws|Zs)r2(Y,W|Z).{\rm GMC}(Y|X)\geq r^{2}(Y,W_{s}|Z_{s})\geq r^{2}(Y,W|Z).

In Proposition 3.1, properties (1)-(3) demonstrate that the new partial dependence r2(Y,W|Z)r^{2}(Y,W|Z) characterizes nonlinearly partial dependence. r2(Y,W|Z)r^{2}(Y,W|Z) is 0 if and only if YY and WW are conditionally mean independent given ZZ, and is 1 if and only if YY is equal to a measurable function of WW given ZZ. For example, for Y=Z+W2Y=Z+W^{2} where WW and ZZ be two independent standard normal random variables, r2(Y,W|Z)=1r^{2}(Y,W|Z)=1. In property (4), r2(Y,W|Z)ρ2(Y,W|Z)r^{2}(Y,W|Z)\geq\rho^{2}(Y,W|Z) further indicates that r2(Y,W|Z)r^{2}(Y,W|Z) is capable of revealing more conditional association information between YY and WW beyond the linear relation. The relationship between the new partial dependence and the GMC in Zheng et al. (2012) is established in property (5). As the GMC is viewed as a generalization of the classical R2R^{2} measurement to a nonparametric model, the new partial dependence can then be viewed as the relative difference in the population R2R^{2} obtained using the full set of covariates XX as compared to the subset of covariates ZZ only. Moreover, when the conditional mean of YY does not depend on ZZ, the new partial dependence r2(Y,W|Z)r^{2}(Y,W|Z) then reduces to GMC(Y|X)\mbox{GMC}(Y|X). Thus r2(Y,W|Z)r^{2}(Y,W|Z) is a natural extension of GMC(Y|X)\mbox{GMC}(Y|X) for partial mean dependence. From property (6), r2(Y,W|Z)r^{2}(Y,W|Z) measures the relative reduction of population residual sum of squares with additional WW in the model. The last property (7) means that when the conditional information is more, the remaining variables provide less additional information about the conditional mean of the response. On the other hand, the partial dependence is always bounded by the GMC(Y|X)\mbox{GMC}(Y|X).

Remark 3.1.

The pGMC r2(Y,W|Z)r^{2}(Y,W|Z) is related with the novel conditional dependence measure proposed by Azadkia and Chatterjee (2021). Their measure is defined as

ξ(Y,W|Z)=E[Var{E(I(Yy)|X)|Z}]𝑑FY(y)E[Var{I(Yy)|Z}]𝑑FY(y),\xi(Y,W|Z)=\frac{\int{\rm E}[{\rm Var}\{{\rm E}(I(Y\geq y)|X)|Z\}]dF_{Y}(y)}{\int{\rm E}[{\rm{\rm Var}}\{I(Y\geq y)|Z\}]dF_{Y}(y)},

where FY(y)F_{Y}(y) is the distribution function of YY. By comparing the formulas for r2(Y,W|Z)r^{2}(Y,W|Z) and ξ(Y,W|Z)\xi(Y,W|Z), r2(Y,W|Z)r^{2}(Y,W|Z) can be considered as the conditional mean version of ξ(Y,W|Z)\xi(Y,W|Z). As Cook and Li (2002) and Shao and Zhang (2014), conditional mean is of primary interest in many applications such as regression problems. The pGMC provides a nonparametric way to measure the additional contribution of WW for the conditional mean of YY given ZZ.

We remark that Williamson et al. (2021) introduced an appealing variable importance measure

ϕ(Y,W|Z)=E[{m(X)h(Z)}2]E[{YE(Y)}2],\phi(Y,W|Z)=\frac{{\rm E}[\{m(X)-h(Z)\}^{2}]}{{\rm E}[\{Y-{\rm E}(Y)\}^{2}]},

which is indeed the difference between GMC(Y|X){\rm GMC}(Y|X) and GMC(Y|Z){\rm GMC}(Y|Z). It can measure the importance of WW for YY. However, it cannot be a correlation measure for the partial mean dependence of YY given WW conditional on ZZ. Because ϕ(Y,W|Z)=1\phi(Y,W|Z)=1 if and only if GMC(Y|X)=1{\rm GMC}(Y|X)=1, i.e. Y=E(Y|X)Y={\rm E}(Y|X), and GMC(Y|Z)=0{\rm GMC}(Y|Z)=0, i.e. E(Y|Z)=E(Y){\rm E}(Y|Z)={\rm E}(Y), which requires that the mean of YY doesn’t depend on ZZ. Thus, its interpretation of ϕ(Y,W|Z)=1\phi(Y,W|Z)=1 is unclear. The pGMC r2(Y,W|Z)r^{2}(Y,W|Z) shares the same numerator as ϕ(Y,W|Z)\phi(Y,W|Z), which measures the difference between two conditional means m(X)m(X) and h(Z)h(Z). The difference between ϕ(Y,W|Z)\phi(Y,W|Z) and the pGMC is the normalization term in the denominator. The pGMC adopts E[{Yh(Z)}2]{\rm E}[\{Y-h(Z)\}^{2}] as the normalization term, which makes the pGMC be a suitable partial mean dependence measurement with clear interpretation. The pGMC r2(Y,W|Z)=1r^{2}(Y,W|Z)=1 if and only if YY is equal to a measurable function of WW given ZZ. More interesting properties of the pGMC as a partial mean dependence measure are also established in Proposition 3.1. These distinct attributes position our pGMC as a valuable augmentation to the current body of literature on partial mean dependence measures.

3.2 Estimation and confidence interval

Next, we estimate the pGMC r2(Y,W|Z)r^{2}(Y,W|Z) via data splitting. We split the data randomly into two independent parts with equal sample size, denoted by 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2}, and estimate the conditional means h(Z)h(Z) and m(X)m(X) using 𝒟1{\mathcal{D}}_{1}. According to the representation of r2(Y,W|Z)r^{2}(Y,W|Z) in property (6), we estimate its numerator E[{Yh(Z)}2]E[{Ym(X)}2]\mbox{E}\left[\{Y-h(Z)\}^{2}\right]-\mbox{E}\left[\{Y-m(X)\}^{2}\right] by the following statistic using 𝒟2{\mathcal{D}}_{2},

Rn=1ni𝒟2{Yih^𝒟1(Zi)}21ni𝒟2{Yim^𝒟1(Xi)}2.\displaystyle R_{n}=\frac{1}{n}\sum_{i\in{\mathcal{D}}_{2}}\left\{Y_{i}-\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i})\right\}^{2}-\frac{1}{n}\sum_{i\in{\mathcal{D}}_{2}}\left\{Y_{i}-\widehat{m}_{{\mathcal{D}}_{1}}(X_{i})\right\}^{2}.

To improve the estimation efficiency, we use the cross-fitting method to obtain another estimator RnR_{n}^{\prime} similarly by swapping the roles of 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2}. Then, we define Rn=(Rn+Rn)/2R_{n}^{\ast}=(R_{n}+R_{n}^{\prime})/2 to estimate the numerator of r2(Y,W|Z)r^{2}(Y,W|Z). We remark that the cross-fitting method has been shown to improve the estimation efficiency asymptotically in Fan et al. (2012), Chernozhukov et al. (2018) and Vansteelandt and Dukes (2022). Similarly, we can estimate the denominator E[{Yh(Z)}2]\mbox{E}\left[\left\{Y-h(Z)\right\}^{2}\right] of r2(Y,W|Z)r^{2}(Y,W|Z) by σ^Y|Z2=(σ^Y|Z2+σ~Y|Z2)/2\widehat{\sigma}^{2*}_{Y|Z}=\left(\widehat{\sigma}^{2}_{Y|Z}+\widetilde{\sigma}^{2}_{Y|Z}\right)/2, where σ^Y|Z2=n1i𝒟2{Yih^𝒟1(Xi)}2,\widehat{\sigma}^{2}_{Y|Z}={n}^{-1}\sum_{i\in{\mathcal{D}}_{2}}\left\{Y_{i}-\widehat{h}_{{\mathcal{D}}_{1}}(X_{i})\right\}^{2}, and σ~Y|Z2\widetilde{\sigma}^{2}_{Y|Z} is similarly defined by swapping the role of 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2}. Thus, the pGMC r2(Y,W|Z)r^{2}(Y,W|Z) can be estimated by

r^2(Y,W|Z)=Rnσ^Y|Z2.\displaystyle\widehat{r}^{2}(Y,W|Z)=\frac{R_{n}^{\ast}}{\widehat{\sigma}^{2*}_{Y|Z}}.

It is worth noting that the balanced data splitting and the cross-fitting method is used to estimate the pGMC, which is different from the unbalanced data splitting for the significance testing procedure in the previous section. This is because, when the partial dependence between WW and YY given ZZ truly exists, i.e. r2(Y,W|Z)0r^{2}(Y,W|Z)\neq 0, we can use the balanced data splitting to improve the estimation efficiency under the following condition (C2).

  • (C2)

    E[{m^𝒟1(X)m(X)}2]=o(n11/2)\mbox{E}\left[\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]=o(n^{-1/2}_{1}) and E[σ2(X){m^𝒟1(X)m(X)}2]=o(1)\mbox{E}\left[\sigma^{2}(X)\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]=o(1), where ϵ=Ym(X)\epsilon=Y-m(X) and σ2(X)=E(ϵ2|X)\sigma^{2}(X)=\mbox{E}(\epsilon^{2}|X); E{h^𝒟1(Z)h(Z)}2=o(n11/2)\mbox{E}\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}=o(n^{-1/2}_{1}) and
    E[δ2(X){h^𝒟1(Z)h(Z)}2]=o(1)\mbox{E}\left[\delta^{2}(X)\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]=o(1), where η=Yh(Z)\eta=Y-h(Z) and δ2(X)=E(η2|X)\delta^{2}(X)=\mbox{E}\left(\eta^{2}|X\right).

Here n1=n2=nn_{1}=n_{2}=n. The assumptions E[{m^𝒟1(X)m(X)}2]=o(n1/2)\mbox{E}\left[\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]=o(n^{-1/2}) and E[{h^𝒟1(Z)h(Z)}2]=o(n1/2)\mbox{E}\left[\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]=o(n^{-1/2}) in (C2) are also adopted in Chernozhukov et al. (2018) (equation 3.8), Williamson et al. (2021) (assumption A1) and in Vansteelandt and Dukes (2022) (assumptions in Theorem 2). For a full discussion about the estimator of nuisance functions, see Chernozhukov et al. (2018). Chernozhukov et al. (2018) showed that the l1l_{1}-penalized and related methods in various sparse models can satisfy the above condition. For the DNN, Bauer and Kohler (2019) showed that E[{m^𝒟1(X)m(X)}2]=O((logn)3n2q/(2q+d)){\rm E}\left[\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]=O((\log n)^{3}n^{-2q/(2q+d^{*})}), where qq is the smoothness parameter and dd^{*} is the intrinsic dimensionality. For other recent developments, see also Schmidt-Hieber (2020) and Kohler and Langer (2021). While the other assumptions E[σ2(X){m^𝒟1(X)m(X)}2]=o(1)\mbox{E}\left[\sigma^{2}(X)\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]=o(1) and E[δ2(X){h^𝒟1(Z)h(Z)}2]=o(1)\mbox{E}\left[\delta^{2}(X)\left\{\widehat{h}_{{\mathcal{D}}_{1}}(Z)-h(Z)\right\}^{2}\right]=o(1) follow if we assume that the conditional variance functions σ2(X)\sigma^{2}(X) and δ2(X)\delta^{2}(X) are bounded above.

Next, we establish the asymptotic normality for the r^2(Y,W|Z)\widehat{r}^{2}(Y,W|Z) in Theorem 3.1.

Theorem 3.1.

Suppose Conditions (C2) and E[{Yh(Z)}4]<{\rm E}\left[\left\{Y-h(Z)\right\}^{4}\right]<\infty are satisfied. When r2(Y,W|Z)0r^{2}(Y,W|Z)\neq 0, we have

N{r^2(Y,W|Z)r2(Y,W|Z)}𝑑N(0,Var(Φ)),\sqrt{N}\left\{\widehat{r}^{2}(Y,W|Z)-r^{2}(Y,W|Z)\right\}\overset{d}{\rightarrow}N\left(0,{\rm Var}(\Phi)\right), (3.2)

where

Φ={m(X)h(Z)}2+2ϵ{m(X)h(Z)}σY|Z2E[{m(X)h(Z)}2]η2σY|Z4.\Phi=\frac{\left\{m(X)-h(Z)\right\}^{2}+2\epsilon\left\{m(X)-h(Z)\right\}}{\sigma_{Y|Z}^{2}}-\frac{{\rm E}\left[\left\{m(X)-h(Z)\right\}^{2}\right]\eta^{2}}{\sigma_{Y|Z}^{4}}.

As discussed in Remark 3.1, r2(Y,W|Z)r^{2}(Y,W|Z) is the conditional mean version of ξ(Y,W|Z)\xi(Y,W|Z) introduced in Azadkia and Chatterjee (2021). However, from the above theorem, one can see that the asymptotic properties of two estimators are different. Azadkia and Chatterjee (2021) only derived the convergence rate of the estimator for ξ(Y,W|Z)\xi(Y,W|Z), which is slightly larger than N1/pN^{-1/p} and depends strongly on the dimension of both ZZ and WW. Our convergence rate N1/2N^{-1/2} is significantly faster than it. Besides, we also establish the asymptotic normality which paves a way to derive the confidence interval for r2(Y,W|Z)r^{2}(Y,W|Z).

Next, we estimate the asymptotic variance in (3.2) and derive the confidence interval for r2(Y,W|Z)r^{2}(Y,W|Z). Define

m^(Xi)={m^𝒟1(Xi),i𝒟2;m^𝒟2(Xi),i𝒟1,h^(Zi)={h^𝒟1(Zi),i𝒟2;h^𝒟2(Zi),i𝒟1,\widehat{m}(X_{i})=\left\{\begin{array}[]{rcl}\widehat{m}_{{\mathcal{D}}_{1}}(X_{i}),&&{i\in{\mathcal{D}}_{2}};\\ \widehat{m}_{{\mathcal{D}}_{2}}(X_{i}),&&{i\in{\mathcal{D}}_{1}},\\ \end{array}\right.~{}~{}~{}\widehat{h}(Z_{i})=\left\{\begin{array}[]{rcl}\widehat{h}_{{\mathcal{D}}_{1}}(Z_{i}),&&{i\in{\mathcal{D}}_{2}};\\ \widehat{h}_{{\mathcal{D}}_{2}}(Z_{i}),&&{i\in{\mathcal{D}}_{1}},\\ \end{array}\right. (3.3)

ϵ^i=Yim^(Xi)\widehat{\epsilon}_{i}=Y_{i}-\widehat{m}(X_{i}) and η^i=Yih^(Zi)\widehat{\eta}_{i}=Y_{i}-\widehat{h}(Z_{i}). A natural plug-in estimator of Var(Φ)\mbox{Var}(\Phi) is given by

Var^(Φ)=1Ni=1NΦ^i2,\widehat{\mbox{Var}}(\Phi)=\frac{1}{N}\sum_{i=1}^{N}\widehat{\Phi}_{i}^{2}, (3.4)

where Φ^i\widehat{\Phi}_{i} is any consistent estimator of Φi\Phi_{i}. For example, Φ^i\widehat{\Phi}_{i} might be taken as Φi\Phi_{i} with σY|Z2\sigma_{Y|Z}^{2}, h(Zi)h(Z_{i}), m(Xi)m(X_{i}), ϵi\epsilon_{i}, ηi\eta_{i}, and E{m(X)h(Z)}2{\rm E}\{m(X)-h(Z)\}^{2} replaced by σ^Y|Z2\widehat{\sigma}_{Y|Z}^{2*}, h^(Zi)\widehat{h}(Z_{i}), m^(Xi)\widehat{m}(X_{i}), ϵ^i\widehat{\epsilon}_{i}, η^i\widehat{\eta}_{i}, and RnR_{n}^{\ast}, respectively. Hence, a confidence interval at the confidence level of 1α1-\alpha for r2(Y,W|Z)r^{2}(Y,W|Z) is

(Rnσ^Y|Z2zα/2Var^(Φ)N,Rnσ^Y|Z2+zα/2Var^(Φ)N),\displaystyle\left(\frac{R_{n}^{\ast}}{\widehat{\sigma}_{Y|Z}^{2*}}-z_{\alpha/2}\cdot\sqrt{\frac{\widehat{\mbox{Var}}(\Phi)}{N}},\,\,\frac{R_{n}^{\ast}}{\widehat{\sigma}_{Y|Z}^{2*}}+z_{\alpha/2}\cdot\sqrt{\frac{\widehat{\mbox{Var}}(\Phi)}{N}}\right),

where zα/2z_{\alpha/2} is the α/2\alpha/2 upper-tailed critical value of the standard normal distribution.

4 Special case: Conditional mean independence Test

As an important special case when there is no conditioning random object ZZ, the original pMIT test (1.1) becomes

H~0:E(Y|X)=E(Y), v.s. H~1:E(Y|X)E(Y).\widetilde{H}_{0}:\,\,\,\mbox{E}(Y|X)=\mbox{E}(Y),\mbox{~{}v.s.~{}~{}~{}}\widetilde{H}_{1}:\,\,\,\mbox{E}(Y|X)\neq\mbox{E}(Y).

H~0\widetilde{H}_{0} indicates that the conditional mean of YY given XX does not depend on XX. It is the fundamental testing problem to check the overall significance of covariates XX for modeling the mean of the response YY. The test has to be conducted before any regression modeling. Similar to the pMIT test statistic (2.2), we consider the test statistic via comparing the estimators of E(Y|X)\mbox{E}(Y|X) and E(Y)\mbox{E}(Y) based on data splitting and machine learning methods,

T1n=1ni𝒟2{m^𝒟1(Xi)Y¯𝒟2}2,\displaystyle T_{1n}=\frac{1}{n}\sum_{i\in{\mathcal{D}}_{2}}\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X_{i})-\bar{Y}_{{\mathcal{D}}_{2}}\right\}^{2},

where Y¯𝒟2=n1j𝒟2Yj\bar{Y}_{{\mathcal{D}}_{2}}=n^{-1}\sum_{j\in{\mathcal{D}}_{2}}Y_{j} is the average of YY over the second part of data. With unbalanced sample splitting strategy, i𝒟2{m^𝒟1(Xi)E(Y)}2\sum_{i\in{\mathcal{D}}_{2}}\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X_{i})-\mbox{E}(Y)\right\}^{2} becomes negligible compared with (Y¯𝒟2E(Y))2(\bar{Y}_{{\mathcal{D}}_{2}}-\mbox{E}(Y))^{2}. Denote σY2:=Var(Y)\sigma_{Y}^{2}:=\mbox{Var}(Y) and σ^Y2=N1i=1N(YiN1i=1NYi)2\widehat{\sigma}_{Y}^{2}=N^{-1}\sum_{i=1}^{N}\left(Y_{i}-N^{-1}\sum_{i=1}^{N}Y_{i}\right)^{2}. We impose similar estimation error condition of the conditional mean of YY given XX via machine learning methods.

  • (C1’)

    E[{m^𝒟1(X)m(X)}2]cn1a=cnγa\mbox{E}\left[\left\{\widehat{m}_{{\mathcal{D}}_{1}}(X)-m(X)\right\}^{2}\right]\leq cn_{1}^{-a}=cn^{-\gamma a} for some 0<a<10<a<1.

The above condition (C1’) is the first part of the condition (C1) with m(X)m(X) and m^𝒟1(X)\widehat{m}_{{\mathcal{D}}_{1}}(X). The asymptotic distributions of T1nT_{1n} are established in the following corollary.

Corollary 4.1.

Assume that E(Y2)<{\rm E}(Y^{2})<\infty and Condition (C1’) holds with γ>1/a\gamma>1/a. When the null hypothesis holds, we have

nσ^Y2T1n𝑑χ2(1).\displaystyle n\widehat{\sigma}_{Y}^{-2}T_{1n}\overset{d}{\rightarrow}\chi^{2}(1).

Under local alternative hypotheses H~1n:E(Y|X)E(Y)=Δ(X)/n\widetilde{H}_{1n}:{\rm E}(Y|X)-{\rm E}(Y)=\Delta(X)/\sqrt{n} with 0<c<E[Δ2(X)]<C<0<c<{\rm E}[\Delta^{2}(X)]<C<\infty,

nσ^Y2T1n𝑑χ2(1)+σY2E[Δ2(X)].\displaystyle n\widehat{\sigma}_{Y}^{-2}T_{1n}\overset{d}{\rightarrow}\chi^{2}(1)+\sigma^{-2}_{Y}{\rm E}[\Delta^{2}(X)].

Further, assume that E[m(X)4]<{\rm E}[m(X)^{4}]<\infty. When the fixed alternative hypothesis holds,

n(T1nE[{m(X)μ}2])𝑑N(0,Var[{m(X)μ}2]).\displaystyle\sqrt{n}\left(T_{1n}-{\rm E}\left[\{m(X)-\mu\}^{2}\right]\right)\overset{d}{\rightarrow}N\left(0,{\rm Var}\left[\left\{m(X)-\mu\right\}^{2}\right]\right).

Corollary 4.1 demonstrates that the asymptotic null distribution of the test statistic T1nT_{1n} after the standardization is the standard chi-squared distribution. Asymptotic distributions under local and fixed alternative hypotheses are also derived. We remark that the proposed significance test can be applied to high dimensional settings in a model-free framework. It is more appealing than the methods proposed by Zhang et al. (2018) and Li et al. (2023), which considered the sum-type test statistics over all marginal tests under a weaker but necessary null hypothesis (1.3). Because the summation of all marginal test statistics ignore the interactions among covariates and can fail for pure interaction models. We will show this comparison by the simulations in the supplementary materials. Moreover, we can apply the multiple data splitting to improve the algorithm stability. Since it is similar to Algorithm 3 as mentioned in Section 2, we omit the details of the algorithm here.

5 Numerical Studies

In this section, we present some numerical results to assess the finite sample performance of our proposed model-free statistical inference methods. To conserve space, extensive additional simulation results are included in the supplementary materials.

5.1 Partial mean independence test

We consider the following two examples including linear and nonlinear models.

Example A1: We first consider a linear regression

Yi=Zi1+Zi2+j=1p2θjWij+ϵii=1,2,,N,Y_{i}=Z_{i1}+Z_{i2}+\sum_{j=1}^{p_{2}}\theta_{j}W_{ij}+\epsilon_{i}\ \ \ i=1,2,...,N,

where Zi=(Zi1,,Zip1)Z_{i}=\left(Z_{i1},\ldots,Z_{ip_{1}}\right)^{\top} and Wi=(Wi1,,Wip2)W_{i}=\left(W_{i1},\ldots,W_{ip_{2}}\right)^{\top}, and we set p1=p2=25p_{1}=p_{2}=25. For each ii, the covariate vector Xi=(Zi,Wi)X_{i}=\left(Z_{i}^{\top},W_{i}^{\top}\right)^{\top} is generated from the multivariate normal distribution N(0,Σ)N(0,\Sigma) with Σij=0.3|ij|{\Sigma}_{ij}=0.3^{|i-j|} for i,j=1,2,,pi,j=1,2,\cdots,p. The stochastic error term ϵi\epsilon_{i} follows the normal distribution N(0,0.52)N(0,0.5^{2}). We consider three different settings of θ=(θ1,,θp2){\theta}=(\theta_{1},\ldots,\theta_{p_{2}}): (1) the null hypothesis H0H_{0}, set θj=0\theta_{j}=0 for all 1jp21\leq j\leq p_{2}; (2) the sparse alternative H1H_{1}, the first two elements of θ{\theta} are nonzeros with equal magnitude and θ=1/2\left\|{\theta}\right\|=1/\sqrt{2}; (3) the dense alternative H1H_{1}, each element of θ{\theta} is nonzero with equal magnitude and θ=1/2\left\|{\theta}\right\|=1/\sqrt{2}.

Example A2: We consider a nonlinear regression

Yi=Zi1+Zi2+(j=1p2θjWij)2+ϵii=1,2,,N,Y_{i}=Z_{i1}+Z_{i2}+\left(\sum_{j=1}^{p_{2}}\theta_{j}W_{ij}\right)^{2}+\epsilon_{i}\ \ \ i=1,2,...,N,

where XiX_{i} and ϵi\epsilon_{i} are generated by the same setting as Example A1. The response YY depends on WW via a quadratic term of a linear combination of WW. (1) The null hypothesis H0H_{0} holds when all θj=0\theta_{j}=0. We consider two slightly different settings of θ{\theta} from Example A1: (2) the sparse H1H_{1}, the first five elements of θ{\theta} are nonzeros with equal magnitude and θ=1/2\left\|{\theta}\right\|=1/\sqrt{2}; (3) the dense H1H_{1}, the first half elements of θ{\theta} are nonzero with equal magnitude and θ=1/2\left\|{\theta}\right\|=1/\sqrt{2}.

In each simulation, we estimate the conditional mean functions using the XGBoost (eXtreme Gradient Boosting) procedure (Chen and Guestrin, 2016) based on the R package xgboost, and deep neural network by employing the torch module in Python. We set 𝚎𝚝𝚊=0.1\mathtt{eta}=0.1 and 𝚗𝚛𝚘𝚞𝚗𝚍=200\mathtt{nround}=200 as the default setting of the hyperparameters in XGBoost. For the neural network, we consider the fully connected neural network with depth L=3L=3 and the width ϖ=50\varpi=50. We set the learning rate as 0.0030.003, the batchsize as 5050, and the maximum number of training epochs as 200200. The Adam optimizer is adopted for parameter optimizing. In order to prevent under-fitting or over-fitting, it is vital to care about the generalization ability of the model. Thus, we involve an early stopping strategy, a useful practical technique to reduce the generalization error efficiently and prevent over-fitting. It is a data-driven approach to determine the number of training epochs such that the model stops training when validation loss doesn’t improve after a given patience (a few epochs). In practice, we use the testing data to validate the model.

Recall that we start with unbalanced single data splitting in Algorithm 1. To evaluate the effect of the data splitting ratio, we consider different choices of ξ=n1/N=0.5,0.6,,0.9\xi=n_{1}/N=0.5,0.6,\ldots,0.9 in numerical examples. We also implement the easy-to-use formula provided in Dai et al. (2024) with N0=0.1NN_{0}=0.1N defined therein, and the proposed adaptive data splitting procedure in Algorithm 2. From Table LABEL:ratio-A1 in the Supplement, it is interesting to note that for balanced data splitting (the case of ξ=0.5\xi=0.5), the empirical size of pMIT is hard to be controlled under H0H_{0}, which is consistent with our theoretical results in Theorem 2.1. As ξ\xi becomes larger, the empirical sizes are getting closer to the significance level 0.050.05, while the test loses powers due to the smaller sample size of the second part of the data. We find that both the easy-to-use formula and the data-adaptive splitting procedure work well, while the data-adaptive splitting procedure generally has higher powers than the easy-to-use formula.

Then, we compare different aggregation methods in the proposed ensemble partial mean independence test, including the multiple random splitting procedure in Meinshausen et al. (2009), the Cauchy combination approach in Liu and Xie (2020), and the Rank-transformed subsampling method in Guo and Shah (2023). We summarize these combining methods in Table LABEL:pvalue-combine in the Supplement. In our simulation results, we find that the Rank-transformed subsampling method has the largest powers, while multiple random splitting procedure in Meinshausen et al. (2009) generally is conservative. However, the computation cost of the Rank-transformed subsampling method is generally heavy. Thus, we adopt the Cauchy combination approach for its computational efficiency and detection of the signal in the following simulations.

To proceed, we conduct simulation experiments to assess the finite sample performance of our proposed single data splitting procedure in Algorithm 1, denoted as pMIT, the multiple data splitting (B=10B=10) in Algorithm 3, denoted as pMITM\rm pMIT_{M}, the power enhanced procedure, denoted as pMITe\rm pMITe, and its multiple data splitting version, denoted as pMITeM\rm pMITe_{M}. The adaptive data splitting in Algorithm 2 is adopted to determine the splitting ratio ξ\xi. For the purpose of comparison, we also implement the Algorithm 3 in Williamson et al. (2023), which is implemented by the cv_vim function from the R package vimp with K=2K=2, resulting in 4 folds, denoted as vim; the Projected Covariance Measure in Algorithm 1 of Lundborg et al. (2022) with their practically recommended splitting ratio ξ=1/2\xi=1/2, denoted as pcm; the multiple splitting version in Algorithm 2 of Lundborg et al. (2022) with B=6B=6 splits, denoted as pcmM\rm pcm_{M}; and the one-split test and the combined one-split test in Dai et al. (2024) with their proposed adaptive splitting scheme, denoted as DSP and DSPM\rm DSP_{M}, respectively. For each experiment, we run 500 replications to compute the empirical sizes and powers. The simulation results are presented in Tables 1-2. Additional simulations are referred to the Supplement.

Table 1: Comparison of empirical sizes and powers of partial mean independence test using XGBoost (left) and DNN(right) in Example A1 at the significance of α=0.05\alpha=0.05.
N pMIT pMITM\rm pMIT_{M} pMITe pMITeM\rm pMITe_{M} pcm pcmM\rm pcm_{M} vim DSP DSPM\rm DSP_{M}
XGB DNN XGB DNN XGB DNN XGB DNN
H0\rm H_{0} 100 0.032 0.060 0.032 0.076 0.050 0.064 0.060 0.076 0.058 0.000 0.048 0.000 0.000
200 0.062 0.030 0.052 0.060 0.062 0.054 0.056 0.062 0.030 0.000 0.036 0.000 0.000
300 0.038 0.046 0.050 0.050 0.060 0.048 0.054 0.072 0.026 0.000 0.032 0.000 0.000
400 0.050 0.044 0.054 0.050 0.056 0.050 0.066 0.070 0.016 0.000 0.030 0.000 0.000
H1\rm H_{1} sparse 100 0.110 0.136 0.146 0.212 0.284 0.186 0.538 0.250 0.138 0.008 0.308 0.060 0.090
200 0.442 0.492 0.926 0.886 0.740 0.522 0.994 0.908 0.684 0.794 0.880 0.416 0.690
300 0.822 0.738 1.000 1.000 0.984 0.764 1.000 1.000 0.966 1.000 0.962 0.852 1.000
400 0.984 0.818 1.000 1.000 1.000 0.828 1.000 1.000 0.998 1.000 1.000 0.994 1.000
H1\rm H_{1} dense 100 0.076 0.228 0.068 0.410 0.246 0.316 0.470 0.496 0.126 0.000 0.244 0.264 0.384
200 0.210 0.674 0.272 0.982 0.554 0.710 0.930 0.996 0.430 0.280 0.662 0.808 1.000
300 0.492 0.896 0.732 1.000 0.866 0.908 0.998 1.000 0.836 0.956 0.908 0.994 1.000
400 0.804 0.938 0.974 1.000 0.986 0.952 1.000 1.000 0.976 1.000 0.974 1.000 1.000
Table 2: Comparison of empirical sizes and powers of partial mean independence test using XGBoost (left) and DNN (right) in Example A2 at the significance of α=0.05\alpha=0.05.
N pMIT pMITM\rm pMIT_{M} pMITe pMITeM\rm pMITe_{M} pcm pcmM\rm pcm_{M} vim DSP DSPM\rm DSP_{M}
XGB DNN XGB DNN XGB DNN XGB DNN
H0\rm H_{0} 100 0.034 0.044 0.032 0.036 0.042 0.046 0.054 0.050 0.058 0.000 0.048 0.000 0.000
200 0.054 0.046 0.036 0.054 0.060 0.060 0.048 0.076 0.024 0.000 0.036 0.000 0.000
300 0.048 0.060 0.054 0.036 0.052 0.060 0.056 0.074 0.018 0.000 0.058 0.000 0.000
400 0.048 0.038 0.060 0.056 0.056 0.052 0.062 0.076 0.028 0.000 0.050 0.000 0.000
H1\rm H_{1} sparse 100 0.126 0.104 0.270 0.172 0.330 0.198 0.650 0.446 0.074 0.000 0.054 0.020 0.044
200 0.164 0.220 0.398 0.382 0.466 0.296 0.896 0.642 0.088 0.008 0.224 0.078 0.128
300 0.262 0.284 0.538 0.640 0.698 0.326 0.964 0.808 0.234 0.052 0.446 0.336 0.652
400 0.450 0.356 0.840 0.826 0.914 0.410 1.000 0.932 0.442 0.446 0.662 0.742 0.946
H1\rm H_{1} dense 100 0.126 0.114 0.288 0.170 0.368 0.208 0.714 0.464 0.062 0.000 0.050 0.044 0.024
200 0.152 0.222 0.344 0.438 0.428 0.388 0.872 0.730 0.062 0.000 0.100 0.242 0.388
300 0.160 0.376 0.386 0.800 0.518 0.440 0.924 0.876 0.088 0.000 0.198 0.610 0.852
400 0.214 0.496 0.484 0.934 0.652 0.552 0.986 0.956 0.140 0.012 0.294 0.880 1.000

According to the simulation results, we have the following findings. Firstly, the empirical sizes of our proposed tests and also the vim procedure are closely around the significance level 0.050.05, while the empirical sizes of pcmM\rm pcm_{M}, DSP and DSPM\rm DSP_{M} are very small. Secondly, power enhancement strategy and multiple data splitting are necessarily effective to improve empirical powers of the proposed pMIT. Thus, we recommend the multiple data splitting version of the power enhanced test statistic, pMITeM\rm pMITe_{M}. Thirdly, the proposed methods are strong alternatives to existing tests in most cases. The empirical powers of our procedure with power enhancement strategy and multiple data splitting are similar or even higher than the other procedures. For instance, in Example A2, the pcm and vim procedures do not have high powers for dense alternative hypotheses.

In addition, we also conduct numerical simulations to assess the impact of different network structure (the depth LL and the width ϖ\varpi). The corresponding results are provided in Table LABEL:DNN_different, indicating that the performance of the proposed methods is not sensitive to the network structure. Furthermore, we also check the robustness of the proposed methods under the case of heterogeneous error via simulations. Please see Section A.2 in the Supplement.

5.2 Confidence interval for pGMC

In this subsection, we conduct numerical simulations to evaluate the empirical performance of the proposed confidence interval for the pGMC. We consider two models in the following.

Example B1:

Yi=Ziβ+Wiθ+ϵii=1,2,,N,Y_{i}=Z_{i}^{\top}\beta+W_{i}^{\top}\theta+\epsilon_{i}\ \ \ i=1,2,...,N,

where ZiZ_{i} is a [p/2][p/2]-dimensional vector. The covariates ZiZ_{i} and WiW_{i} are independent, and generated from Zi,WiN(0,Σ)Z_{i},W_{i}\sim N(0,\Sigma) with Σij=ρ|ij|\Sigma_{ij}=\rho^{|i-j|} for i,j=1,2,,pi,j=1,2,\cdots,p with ρ=0.5\rho=0.5, respectively. Besides, only the first three elements of β\beta and θ\theta are nonzeros with equal magnitude such that β=θ=1\left\|\beta\right\|=\left\|\theta\right\|=1. The error term follows the standard normal distribution.

Example B2:

Yi=Zi1+2sin(Wi12)+ϵii=1,2,,N.Y_{i}=Z_{i1}+2\sin\left(\frac{W_{i1}}{2}\right)+\epsilon_{i}\ \ \ i=1,2,...,N.

where ZiZ_{i} is a [p/2][p/2]-dimensional vector. Zi1Z_{i1} and Wi1W_{i1} denote the first element of the covariates ZiZ_{i} and WiW_{i}, respectively. The other settings are the same as Example B1.

To make inference for the pGMC of the simulated data, we build a fully connected neural network with the default settings and involve distance correlation based feature screening techniques (Li et al., 2012) before training. To be more specific, we first apply the distance correlation based feature screening approach to the training data in order to reduce dimension of the predictors. Then we use the reduced predictors to train the neural network model. 500 realizations are repeated for each setting in order to calculate the coverage probabilities (CP) and the average lengths (AL) of 95% confidence intervals. Corresponding numerical results are illustrated in Table 3.

Table 3: The coverage probabilities (CP) and the average lengths (AL) of 95% CIs for pGMC under Example B1 and Example B2.
Example B1 Example B2
p=100p=100 p=200p=200 p=100p=100 p=200p=200
n1n_{1} n2n_{2} CP AL CP AL CP AL CP AL
100 100 0.874 0.135 0.860 0.133 0.804 0.139 0.840 0.140
200 200 0.918 0.093 0.882 0.092 0.870 0.095 0.864 0.096
300 300 0.926 0.074 0.912 0.074 0.894 0.077 0.894 0.076
400 400 0.930 0.064 0.926 0.064 0.918 0.066 0.916 0.065
500 500 0.946 0.056 0.942 0.057 0.942 0.058 0.938 0.058

The results in Table 3 provide strong evidence that corroborates the asymptotic theory. As the sample size NN increases, the empirical coverage probabilities tend to the nominal level 95% and the average lengths of confidence interval become smaller.

6 Real data example

6.1 Significant parts on facial images

The significance of automatic age prediction of facial images has gained substantial relevance in numerous applications, particularly due to the rapid growth of social media; see Rothe et al. (2015), Antipov et al. (2017) and Fang et al. (2019) for instance. In this subsection, we investigate the significance of three keypoints on facial photos for human age prediction. The data set is available from www.kaggle.com/datasets/mariafrenti/age-prediction and contains thousands of images, with each image being a three-channel 128×128128\times 128 pixel facial photograph. It also records the age of the person corresponding to each image. Our goal is to make statistical inferences of the significant predictive features on the images for predicting the age.

Refer to caption
Figure 1: Seven randomly selected original images (the first row). Different parts are covered in three different cases: case 1 - the 2nd row: eyes region (29:57, 29:100), case 2 - the 3rd row: mouth region (74:89, 40:91), case 3 - the 4th row: top-left corner edge region (0:28,0:28).

We randomly select 10001000 images, and artificially cover different key parts on the images (case 1: eyes, case 2: mouths, case 3: top-left corner edge region), see Figure 1 as an illustration. In each covered image, the covered parts can be considered as the variables of interest (denoted as WW), while the uncovered parts are denoted as ZZ. We employ a ResNet-18 neural network model and add two fully connected layers at the end of the original network for training. Other hyperparameters keep the same as in the simulation experiments. The proposed procedures along with the other existing methods are conducted and the resulting p-values for each case are presented in Table 4.

Table 4: P-values for the facial images dataset.
pMIT pMITM\rm pMIT_{M} pMITe pMITeM\rm pMITe_{M} pcm pcmM\rm pcm_{M} vim DSP DSPM\rm DSP_{M}
Case 1 0.043 0.045 0.036 0.037 0.441 0.386 0.522 0.209 0.522
Case 2 0.046 0.029 0.036 0.024 0.196 0.426 0.627 0.050 0.064
Case 3 0.178 0.195 0.164 0.182 0.142 0.409 0.470 0.616 0.801

While most other testing methods fail to detect the two significant regions in Cases 1 and 2, the results in Table 4 show that our pMIT tests consistently reject H0H_{0} for the two cases, suggesting that both eyes and mouths are significant regions for age prediction in computer vision, which are further visually illustrated in Figure 1. The corresponding 95%95\% confidence intervals of pGMC for the two regions are (0.036,0.071)(0.036,0.071) and (0.034,0.055)(0.034,0.055), respectively. The region 3 (the top-left corner edge region) is recognized as not important by all testing procedures, which is consistent with our observation.

6.2 Significant gene expression levels

The second data set is about gene expression and it was once studied in Scheetz et al. (2006) and Huang et al. (2008). It contains 120 twelve-week old male rats from which over 31042 different probes from eye tissue were measured. In Chiang et al. (2006), the gene TRIM32 was believed to cause Bandet-Biedl syndrome which is a genetically heterogeneous disease of multiple organ systems including the retina. Scheetz et al. (2006) excluded the probes which were not expressed in the eye or which lacked sufficient variation. As a result, a total of 1897618976 probes were considered as sufficient variables. Among these 1897618976 probes, Huang et al. (2008) further selected 3000 probes with large variances. We are interested in the genes whose expression would have significant effects on that of gene TRIM32.

We first apply our proposed conditional mean independence test to check whether the 3000 selected probes are significantly predictive for the expression of TRIM32. The p-values of the proposed conditional mean tests are smaller than 0.0010.001, providing a strong evidence that the probes have significant effects on the prediction of the gene expression of TRIM32.

Based on the results in Huang et al. (2008), there are in total of 1919 significant probes selected by adaptive group LASSO. We care about if any significant probe were omitted or not. Hence, we apply the newly proposed significance test procedures to test whether the remaining 29812981 probes are conditional correlated with the gene expression of TRIM32. We standardize the data and adopt XGBoost and DNN with three hidden layers to estimate the condition means. Other existing procedures are also implemented for comparison. From Table 5, it is observed that all the procedures do not reject the null hypothesis at the significance level of α=0.05\alpha=0.05. It means given the 1919 selected probes, other probes do not bring significant additional information to predict the gene expression of TRIM32. Based on the inference above, we conclude that the 19 probes selected by adaptive group LASSO in Huang et al. (2008) provide useful information for further biological scientific study in TRIM32.

Table 5: The results of partial mean independence tests for the gene expression data
pMIT pMITM\rm pMIT_{M} pMITe pMITeM\rm pMITe_{M} pcm pcmM\rm pcm_{M} vim DSP DSPM\rm DSP_{M}
XGB DNN XGB DNN XGB DNN XGB DNN
p-values 0.931 0.718 0.484 0.245 0.880 0.683 0.438 0.181 0.945 0.350 0.583 0.544 0.538

7 Conclusion

In this paper, we propose a new significance test for the partial mean independence problem based on machine learning methods and data splitting, which is applicable for high dimensional data. The pMIT test statistic converges to the standard chi-squared distribution under the null hypothesis while it converges to a normal distribution under the fixed alternative hypothesis. Power enhancement and algorithm stability based on multiple data splitting are also discussed. When the partial mean independence test is rejected, we propose a new partial dependence measure, called partial Generalized Measure of Correlation (pGMC), based on the decomposition formula of the conditional variance. We derive its theoretical properties and construct confidence interval. As an important special case when there is no conditioning random object, we also investigate significance testing of potentially high dimensional covariates XX for the conditional mean of YY in a model-free framework. In the future, it is interesting to extend the proposed ideas to study conditional/partial quantile independence, conditional/partial independence problems.

Supplementary materials. Additional simulation results including the finite sample performance of the conditional mean independence test, and all theoretical proofs are included in the supplementary materials.

References

  • Antipov et al. (2017) Antipov, G., Baccouche, M., Berrani, S.-A., and Dugelay, J.-L. (2017), “Effective training of convolutional neural networks for face-based gender and age prediction,” Pattern Recognition, 72, 15–26.
  • Azadkia and Chatterjee (2021) Azadkia, M. and Chatterjee, S. (2021), “A simple measure of conditional dependence,” The Annals of Statistics, 49, 3070–3102.
  • Bauer and Kohler (2019) Bauer, B. and Kohler, M. (2019), “On deep learning as a remedy for the curse of dimensionality in nonparametric regression,” The Annals of Statistics, 47, 2261–2285.
  • Bühlmann (2006) Bühlmann, P. (2006), “Boosting for high-dimensional linear models,” The Annals of Statistics, 34, 559 – 583.
  • Bühlmann and Yu (2003) Bühlmann, P. and Yu, B. (2003), “Boosting with the L2 loss: regression and classification,” Journal of the American Statistical Association, 98, 324–339.
  • Cai et al. (2022) Cai, Z., Lei, J., and Roeder, K. (2022), “Model-free prediction test with application to genomics data,” Proceedings of the National Academy of Sciences, 119, e2205518119.
  • Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016), “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794.
  • Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018), “Double/debiased machine learning for treatment and structural parameters,” The Econometrics Journal, 21, C1–C68.
  • Chiang et al. (2006) Chiang, A. P., Beck, J. S., Yen, H. J., Tayeh, M. K., Scheetz, T. E., Swiderski, R. E., Nishimura, D. Y., Braun, T. A., Kim, K. Y. A., and Huang, J. a. (2006), “Homozygosity mapping with SNP arrays identifies TRIM32, an E3 ubiquitin ligase, as a Bardet–Biedl syndrome gene (BBS11),” Proceedings of the National Academy of Sciences of the United States of America, 103.
  • Choi and Kim (2023) Choi, W. and Kim, I. (2023), “Averaging p-values under exchangeability,” Statistics & Probability Letters, 194, 109748.
  • Cook and Li (2002) Cook, R. D. and Li, B. (2002), “Dimension reduction for conditional mean in regression,” The Annals of Statistics, 30, 455–474.
  • Cui et al. (2018) Cui, H., Guo, W., and Zhong, W. (2018), “Test for high-dimensional regression coefficients using refitted cross-validation variance estimation,” The Annals of Statistics, 46, 958–988.
  • Dai et al. (2024) Dai, B., Shen, X., and Pan, W. (2024), “Significance tests of feature relevance for a black-box learner,” IEEE Transactions on Neural Networks and Learning Systems, 35, 1898–1911.
  • Delgado and Gonzáles-Manteiga (2001) Delgado, M. and Gonzáles-Manteiga, W. (2001), “Significance testing in nonparametric regression based on the bootstrap,” The Annals of Statistics, 29, 1469–1507.
  • Dette et al. (2013) Dette, H., Siburg, K. F., and Stoimenov, P. A. (2013), “A copula-based non-parametric measure of regression dependence,” Scandinavian Journal of Statistics, 40, 21–41.
  • DiCiccio et al. (2020) DiCiccio, C. J., DiCiccio, T. J., and Romano, J. P. (2020), “Exact tests via multiple data splitting,” Statistics & Probability Letters, 166, 108865.
  • Fan et al. (2012) Fan, J., Guo, S., and Hao, N. (2012), “Variance estimation using refitted cross-validation in ultrahigh dimensional regression,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74, 37–65.
  • Fan et al. (2015) Fan, J., Liao, Y., and Yao, J. (2015), “Power enhancement in high-dimensional cross-sectional tests,” Econometrica, 83, 1497–1541.
  • Fan and Li (1996) Fan, Y. and Li, Q. (1996), “Consistence model specification tests: omitted variables and semiparametric functional forms,” Econometrica, 64, 865–890.
  • Fang et al. (2019) Fang, J., Yuan, Y., Lu, X., and Feng, Y. (2019), “Muti-stage learning for gender and age prediction,” Neurocomputing, 334, 114–124.
  • Gan et al. (2022) Gan, L., Zheng, L., and Allen, G. I. (2022), “Inference for Interpretable Machine Learning: Fast, Model-Agnostic Confidence Intervals for Feature Importance,” arXiv preprint arXiv:2206.02088.
  • Goeman et al. (2006) Goeman, J. J., Van De Geer, S. A., and Van Houwelingen, H. C. (2006), “Testing against a high dimensional alternative,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 477–493.
  • González-Manteiga and Crujeiras (2013) González-Manteiga, W. and Crujeiras, R. M. (2013), “An updated review of goodness-of-fit tests for regression models,” Test, 22, 361–411.
  • Guo and Chen (2016) Guo, B. and Chen, S. X. (2016), “Tests for high dimensional generalized linear models,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 78, 1079–1102.
  • Guo and Shah (2023) Guo, F. R. and Shah, R. D. (2023), “Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values,” arXiv preprint arXiv:2301.02739.
  • Guo et al. (2016) Guo, X., Wang, T., and Zhu, L. (2016), “Model checking for parametric single-index models: a dimension reduction model-adaptive approach,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78, 1013–1035.
  • Huang et al. (2008) Huang, J., Ma, S., and Zhang, C. H. (2008), “Adaptive LASSO for sparse high-dimensional regression,” Statistica Sinica, 18, 1603–1618.
  • Kohler and Langer (2021) Kohler, M. and Langer, S. (2021), “On the rate of convergence of fully connected deep neural network regression estimates,” The Annals of Statistics, 49, 2231–2249.
  • Kueck et al. (2023) Kueck, J., Luo, Y., Spindler, M., and Wang, Z. (2023), “Estimation and inference of treatment effects with L2-boosting in high-dimensional settings,” Journal of Econometrics, 234, 714–731.
  • Lei et al. (2018) Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), “Distribution-free predictive inference for regression,” Journal of the American Statistical Association, 113, 1094–1111.
  • Li et al. (2023) Li, R., Xu, K., Zhou, Y., and Zhu, L. (2023), “Testing the effects of high-dimensional covariates via aggregating cumulative covariances,” Journal of the American Statistical Association, 118, 2184–2194.
  • Li et al. (2012) Li, R., Zhong, W., and Zhu, L. (2012), “Feature screening via distance correlation learning,” Journal of the American Statistical Association, 107, 1129–1139.
  • Liu and Xie (2020) Liu, Y. and Xie, J. (2020), “Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures,” Journal of the American Statistical Association, 115, 393–402.
  • Lundborg (2023) Lundborg, A. R. (2023), “Modern Methods for Variable Significance Testing,” Ph.D. thesis, University of Cambridge.
  • Lundborg et al. (2022) Lundborg, A. R., Kim, I., Shah, R. D., and Samworth, R. J. (2022), “The Projected Covariance Measure for assumption-lean variable significance testing,” arXiv preprint arXiv:2211.02039.
  • Meinshausen et al. (2009) Meinshausen, N., Meier, L., and Buehlmann, P. (2009), “P-values for high-dimensional regression,” Journal of the American statistical association, 104, 1671–1681.
  • Park et al. (2015) Park, T., Shao, X., and Yao, S. (2015), “Partial martingale difference correlation,” Electronic Journal of Statistics, 9, 1492–1517.
  • Rothe et al. (2015) Rothe, R., Timofte, R., and Van Gool, L. (2015), “Dex: Deep expectation of apparent age from a single image,” in Proceedings of the IEEE international conference on computer vision workshops, pp. 10–15.
  • Scheetz et al. (2006) Scheetz, T. E., Kim, K., Swiderski, R. E., Philp, A. R., Braun, T. A., Knudtson, K. L., Dorrance, A. M., Dibona, G. F., Jian, H., and Casavant, T. L. (2006), “Regulation of gene expression in the mammalian eye and its relevance to eye disease,” Proceedings of the National Academy of Sciences of the United States of America, 103, 14429–14434.
  • Schmidt-Hieber (2020) Schmidt-Hieber, J. (2020), “Nonparametric regression using deep neural networks with ReLU activation function,” The Annals of Statistics, 48, 1875–1897.
  • Shah and Peters (2020) Shah, R. D. and Peters, J. (2020), “The hardness of conditional independence testing and the generalised covariance measure,” The Annals of Statistics, 48, 1514–1538.
  • Shao and Zhang (2014) Shao, X. and Zhang, J. (2014), “Martingale difference correlation and its use in high-dimensional variable screening,” Journal of the American Statistical Association, 109, 1302–1318.
  • Székely and Rizzo (2014) Székely, G. and Rizzo, M. (2014), “Partial distance correlation with methods for dissimilarities,” The Annals of Statistics, 42, 2382–2412.
  • Tansey et al. (2022) Tansey, W., Veitch, V., Zhang, H., Rabadan, R., and Blei, D. M. (2022), “The holdout randomization test for feature selection in black box models,” Journal of Computational and Graphical Statistics, 31, 151–162.
  • Vansteelandt and Dukes (2022) Vansteelandt, S. and Dukes, O. (2022), “Assumption-lean inference for generalised linear model parameters,” Journal of the Royal Statistical Society Series B: Statistical Methodology, 84, 657–685.
  • Verdinelli and Wasserman (2023) Verdinelli, I. and Wasserman, L. (2023), “Feature Importance: A Closer Look at Shapley Values and LOCO,” arXiv preprint arXiv:2303.05981.
  • Verdinelli and Wasserman (2024) — (2024), “Decorrelated variable importance,” Journal of Machine Learning Research, 25, 1–27.
  • Vovk and Wang (2020) Vovk, V. and Wang, R. (2020), “Combining p-values via averaging,” Biometrika, 107, 791–808.
  • Wang and Akritas (2006) Wang, L. and Akritas, M. G. (2006), “Testing for covariate effects in the fully nonparametric analysis of covariance model,” Journal of the American Statistical Association, 101, 722–736.
  • Wang et al. (2015) Wang, X., Pan, W., Hu, W., Tian, Y., and Zhang, H. (2015), “Conditional distance correlation,” Journal of the American Statistical Association, 110, 1726–1734.
  • Williamson et al. (2021) Williamson, B. D., Gilbert, P. B., Carone, M., and Simon, N. (2021), “Nonparametric variable importance assessment using machine learning techniques,” Biometrics, 77, 9–22.
  • Williamson et al. (2023) Williamson, B. D., Gilbert, P. B., Simon, N. R., and Carone, M. (2023), “A general framework for inference on algorithm-agnostic variable importance,” Journal of the American Statistical Association, 118, 1645–1658.
  • Zhang and Janson (2020) Zhang, L. and Janson, L. (2020), “Floodgate: inference for model-free variable importance,” arXiv preprint arXiv:2007.01283.
  • Zhang et al. (2018) Zhang, X., Yao, S., and Shao, X. (2018), “Conditional mean and quantile dependence testing in high dimension,” The Annals of Statistics, 46, 219–246.
  • Zheng et al. (2012) Zheng, S., Shi, N.-Z., and Zhang, Z. (2012), “Generalized measures of correlation for asymmetry, nonlinearity, and beyond,” Journal of the American Statistical Association, 107, 1239–1252.
  • Zhong and Chen (2011) Zhong, P.-S. and Chen, S. X. (2011), “Tests for high-dimensional regression coefficients with factorial designs,” Journal of the American Statistical Association, 106, 260–274.
  • Zhu and Zhu (2018) Zhu, X. and Zhu, L. (2018), “Dimension reduction-based significance testing in nonparametric regression,” Electronic Journal of Statistics, 12, 1468–1506.