This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The conditionally studentized test for high-dimensional parametric regressions

Feng Liang1, Chuhan Wang1, Jiaqi Huang2 and Lixing Zhu1,3 111All the authors contribute to this research equally and the names are in seniority order. Jiaqi Huang is the corresponding author. Corresponding author (J. Huang). Email addresses: [email protected] (J. Huang).
1 Center for Statistics and Data Science, Beijing Normal University, Zhuhai, China
2 School of Statistics, Beijing Normal University, Beijing, China
3 Department of Mathematics, Hong Kong Baptist University, Hong Kong
Abstract

This paper studies model checking for general parametric regression models having no dimension reduction structures on the predictor vector. Using any U-statistic type test as an initial test, this paper combines the sample-splitting and conditional studentization approaches to construct a COnditionally Studentized Test (COST). Whether the initial test is global or local smoothing-based; the dimension of the predictor vector and the number of parameters are fixed or diverge at a certain rate, the proposed test always has a normal weak limit under the null hypothesis. When the dimension of the predictor vector diverges to infinity at faster rate than the number of parameters, even the sample size, these results are still available under some conditions. This shows the potential of our method to handle higher dimensional problems. Further, the test can detect the local alternatives distinct from the null hypothesis at the fastest possible rate of convergence in hypothesis testing. We also discuss the optimal sample splitting in power performance. The numerical studies offer information on its merits and limitations in finite sample cases including the setting where the dimension of predictor vector equals the sample size. As a generic methodology, it could be applied to other testing problems.


Keywords: Asymptotic model-free test; conditional studentization; high dimensions; model checking; sample-splitting.

1 Introduction

Consider the general parametric regression model:

Y=g​(𝑿;𝜽)+Ξ΅,Y=g(\boldsymbol{X};\boldsymbol{\theta})+\varepsilon,

where gg is a known function of (𝑿,𝜽)(\boldsymbol{X},\boldsymbol{\theta}) and πœ½βˆˆπš―βŠ‚β„p\boldsymbol{\theta}\in\boldsymbol{\Theta}\subset\mathbb{R}^{p} is an unknown parameter vector. 𝑿\boldsymbol{X} is a predictor vector in ℝq\mathbb{R}^{q} while YY represents the univariate response variable where ℝp\mathbb{R}^{p} (ℝq\mathbb{R}^{q}) stands for the pp-(qq-)dimensional Euclidean space. For many models, such as linear and generalized linear models, p=qp=q, but in general, this may not be the case. As well known, when the assumed model is used, model checking is necessary before further analyzing data. Specifically, the null hypothesis is, for a subset πš―βŠ‚β„p\boldsymbol{\Theta}\subset\mathbb{R}^{p},

H0:Pr⁑{E​(Y|𝑿)=g​(𝜽0,𝑿)}=1​ for someΒ β€‹πœ½0∈𝚯\displaystyle H_{0}:\Pr\{E(Y|\boldsymbol{X})=g(\boldsymbol{\theta}_{0},\boldsymbol{X})\}=1\text{\ for some\ }\boldsymbol{\theta}_{0}\in\boldsymbol{\Theta} (1.1)

versus the alternative hypothesis

H1:Pr⁑{E​(Y|𝑿)=g​(𝜽,𝑿)}<1​ for allΒ β€‹πœ½βˆˆπš―.\displaystyle H_{1}:\Pr\{E(Y|\boldsymbol{X})=g(\boldsymbol{\theta},\boldsymbol{X})\}<1\text{\ for all\ }\boldsymbol{\theta}\in\boldsymbol{\Theta}. (1.2)

Relevant problems have been investigated intensively in the literature. Most existing methods are for cases with fixed dimensions pp and qq. Examples include nonparametric estimation-based local smoothing tests such as HΓ€rdle and Mammen (1993), Zheng (1996), Zhu and Li (1998), Lavergne and Patilea (2008, 2012), Guo etΒ al. (2016), and empirical process-based global smoothing tests, e.g. Stute and Zhu (2002), Zhu (2003), Escanciano (2006), and Stute etΒ al. (2008). These two general methodologies respectively show their advantages and limitations. Local smoothing tests often have tractable weak limits under the null hypothesis and most of them can detect local alternatives distinct from the null hypothesis at slower rates of convergence than 1/n1/\sqrt{n} where nn is the sample size. As the rates can be slower in higher dimensional cases, the curse of dimensionality is a big challenge. In contrast, the limiting null distributions of global smoothing tests under the null hypothesis are usually intractable, resorting to resampling approximations to determine critical values, but can detect local alternatives distinct from the null hypothesis at the fastest possible rate 1/n1/\sqrt{n} of convergence in hypothesis testing. The dimensionality is still an issue as this type of test involves high dimensional empirical process.

For paradigms with large dimensions pp and qq that may diverge as the sample size goes to infinity, there are few methods available in the literature. Some relevant references for models with dimension reduction structures are cited as follows. Shah and BΓΌhlmann (2018) and JankovΓ‘ etΒ al. (2020) respectively proposed two goodness-of-fit tests for high-dimensional sparse linear and generalized linear models with a fixed designs. For the problems with random designs without sparse structures, Tan and Zhu (2019) and Tan and Zhu (2022) considered adaptive-to-model tests for high-dimensional single-index and multi-index models that the number of linear combinations of 𝑿\boldsymbol{X} is fixed, respectively, with diverging dimensions pβ‰₯qp\geq q. Their methods are the extensions of the test first proposed by Guo etΒ al. (2016) in fixed-dimension cases for multi-index models. These three tests critically hinge on the dimension reduction structures of the predictors. Otherwise, as Tan and Zhu (2022) discussed, they fail to work because the limiting distributions under the null and alternative hypothesis are degenerate at constants and cannot be used to determine critical values and resampling approximation cannot work well either.

The current paper proposes a COnditionally Studentized Test (COST) for general parametric models without dimension reduction structures in high-dimensional cases. The basic idea for constructing this novel test is that based on an initial test that can be rewritten as a U-statistic (either a local or a global smoothing test), we divide the sample of size nn into two subsamples of sizes n1n_{1} and n2n_{2}, and use the conditional studentization approach to construct the final test. The dimension pp can diverge at the rate with a leading term n11/3n_{1}^{1/3} corresponding to the rate n1/3n^{1/3} Tan and Zhu (2022) achieved. Further, the restriction that q≀pq\leq p is no longer necessary. That is, the dimension qq of the predictor vector can be higher than pp, even the sample size nn under some regularity conditions on the regression and related functions. Note that the conditions in this paper are not assumed on predictor significance to the regression function. Thus, understandably, when qq is large, the conditions could relatively stringently restrict the forms of related functions, but the results still show the potential of our method in higher dimension settings. The details will be presented in SectionΒ 2. RemarkΒ 1 in this section also gives some more explanations. SectionΒ 4 reports some numerical studies to check the performance of the test when qq is larger than pp and equal to nn. As the conclusions are similar, some other studies with q>nq>n are not reported for space saving. The conditions are put in SectionΒ 6. We also discuss the optimal sample splitting between the sizes n1n_{1} and n2n_{2} in the power performance of the test. The following merits of the novel test are worthy of mentioning. Under more general parametric model structures, whether the initial test is global or local smoothing-based, and whether the dimensions are fixed or divergent at certain rates, the final test can always have a normal weak limit under the null hypothesis, which is often a merit of local smoothing tests; and can detect local alternatives distinct from the null at the rate as close to 1/n11/\sqrt{n_{1}} (n1/nβ†’C{n_{1}}/n\to C for a given positive constant CC) as possible, which is the typical optimal rate global smoothing tests can achieve. These unique features are very different from any existing test as, other than being able to handle high-dimensional problems, the test also enjoys the advantages local smoothing tests and global smoothing tests have. On the other hand, the test statistic converges to its weak limit at the rate of 1/n11/\sqrt{n_{1}} rather than 1/n1/\sqrt{n} such that it may lose some power in theory. We will discuss this limitation in more detail later.

The rest of the paper is organized as follows. Section 2 describes the test statistic construction. Section 3 includes the asymptotic properties of the test statistic under the null and alternative hypotheses, the investigation on the optimal choice of sample-splitting scheme. Section 4 contains some numerical studies including simulations and real data analysis. To examine its performance, the simulation studies include the settings favoring the existing method in the literature; the settings where the condition on the number of parameters is violated; and the settings with the dimension of predictor vector being much larger than the number of parameters, even equal to the sample size. Section 5 comments on its advantages and limitations. It gives a discussion of the reason why in our setting, we do not apply the commonly used cross-fitting approach for power enhancement, and a brief discussion on the challenge of extending or modifying our method to handle models having higher dimensions with sparse structure. SectionΒ 6 includes the regularity conditions with some remarks. As the proofs of the main results are technically demanded and lengthy, we then put them in Supplementary Material.

2 Test statistic construction

As the test construction requires estimating the parameter in the model, we first briefly give the details.

2.1 Notation and parameter estimation

Write the underlying regression function as m​(𝒙)=E​(Y|𝑿=𝒙)m(\boldsymbol{x})=E(Y|\boldsymbol{X}=\boldsymbol{x}) and the error as Ξ΅=Yβˆ’m​(𝑿)\varepsilon=Y-m(\boldsymbol{X}). To study the power performance, we also consider a sequence of alternative hypotheses:

H1​n:Y=g​(𝜽0,𝑿)+Ξ΄n​l​(𝑿)+Ξ΅.H_{1n}:Y=g(\boldsymbol{\theta}_{0},\boldsymbol{X})+\delta_{n}l(\boldsymbol{X})+\varepsilon. (2.1)

When Ξ΄nβ†’0\delta_{n}\to 0 as nβ†’βˆžn\to\infty, (2.1) corresponds to a sequence of local alternatives. When Ξ΄n\delta_{n} is a fixed constant, (2.1) reduces to the global alternative model (1.2).

Set

πœ½βˆ—=arg⁑minπœ½βˆˆπš―β€‹E​{Yβˆ’g​(𝜽,𝑿)}2=arg⁑minπœ½βˆˆπš―β€‹E​{m​(𝑿)βˆ’g​(𝜽,𝑿)}2.\displaystyle\boldsymbol{\theta}^{*}=\underset{\boldsymbol{\theta}\in\boldsymbol{\Theta}}{\arg\min}E\left\{Y-g(\boldsymbol{\theta},\boldsymbol{X})\right\}^{2}=\underset{\boldsymbol{\theta}\in\boldsymbol{\Theta}}{\arg\min}E\left\{m(\boldsymbol{X})-g(\boldsymbol{\theta},\boldsymbol{X})\right\}^{2}. (2.2)

Under the null hypothesis in (1.1) and the regularity condition 1 specified in Section 6, πœ½βˆ—=𝜽0\boldsymbol{\theta}^{*}=\boldsymbol{\theta}_{0}. Under the alternatives, πœ½βˆ—\boldsymbol{\theta}^{*} typically depends on the distribution of 𝑿\boldsymbol{X}. To save space, redefine l​(𝑿)=m​(𝑿)βˆ’g​(πœ½βˆ—,𝑿)l(\boldsymbol{X})=m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X}) under the global alternative hypothesis with fixed Ξ΄n=1\delta_{n}=1, while we still write l​(𝑿)=m​(𝑿)βˆ’g​(𝜽0,𝑿)l(\boldsymbol{X})=m(\boldsymbol{X})-g(\boldsymbol{\theta}_{0},\boldsymbol{X}) under the local alternative hypothesis.

Subsequently, we proceed to present additional notations that have been employed in this paper. Denote g˙​(𝜽,𝑿)=βˆ‚g​(𝜽,𝑿)/βˆ‚πœ½\dot{g}(\boldsymbol{\theta},\boldsymbol{X})=\partial g(\boldsymbol{\theta},\boldsymbol{X})/\partial\boldsymbol{\theta} and g¨​(𝜽,𝑿)=βˆ‚g˙​(𝜽,𝑿)/βˆ‚πœ½βŠ€\ddot{g}(\boldsymbol{\theta},\boldsymbol{X})=\partial\dot{g}(\boldsymbol{\theta},\boldsymbol{X})/\partial\boldsymbol{\theta}^{\top}. Under the null hypothesis and the local alternative hypothesis, we define 𝚺=E​{g˙​(𝜽0,𝑿)​g˙​(𝜽0,𝑿)⊀}\boldsymbol{\Sigma}=E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X})^{\top}\right\}. Under the global alternatives, without confusion, we define 𝚺=E​{g˙​(πœ½βˆ—,𝑿)​g˙​(πœ½βˆ—,𝑿)⊀}βˆ’E​[{m​(𝑿)βˆ’g​(πœ½βˆ—,𝑿)}​g¨​(πœ½βˆ—,𝑿)]\boldsymbol{\Sigma}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}-E\left[\left\{m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X})\right\}\ddot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\right] and πšΊβˆ—=E​{g˙​(πœ½βˆ—,𝑿)​g˙​(πœ½βˆ—,𝑿)⊀}\boldsymbol{\Sigma}_{*}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}. Use βˆ₯β‹…βˆ₯\|\cdot\| to represent the L2L_{2} norm. Write the conditional expectation E​(A|B=b)E(A|B=b) as EB​(A)E_{B}(A) for any random variable AA and random variable/vector BB.

The least squares estimator of 𝜽\boldsymbol{\theta} is defined by

𝜽^=arg⁑minπœ½βˆˆπš―β€‹βˆ‘i=1n{Yiβˆ’g​(𝜽,𝑿i)}2.\displaystyle\boldsymbol{\hat{\theta}}=\underset{\boldsymbol{\theta}\in\boldsymbol{\Theta}}{\arg\min}\sum_{i=1}^{n}\left\{Y_{i}-g(\boldsymbol{\theta},\boldsymbol{X}_{i})\right\}^{2}. (2.3)

Under the regularity conditions 1-5 in Section 6, the convergence rate and asymptotically linear representation of 𝜽^\boldsymbol{\hat{\theta}} can be derived, which are important for studying the asymptotic properties of the test statistic under the null and alternative hypotheses. We will state them as three lemmas in Supplementary Material, and the first two lemmas are Theorems 1 and 2 and the third lemma is an extension of Theorem 4 in Tan and Zhu (2022).

2.2 The motivation

In the literature, several existing tests have a similar structure, with a weight function Wn​(β‹…,β‹…)W_{n}(\cdot,\cdot) depending on the sample size nn such that a test statistic can be written as, before standardizing, βˆ‘i=1nβˆ‘jβ‰ ie^i​e^j​Wn​(𝑿i,𝑿j)/n​(nβˆ’1)\sum_{i=1}^{n}\sum_{j\neq i}\hat{e}_{i}\hat{e}_{j}W_{n}(\boldsymbol{X}_{i},\boldsymbol{X}_{j})/n(n-1) where e^i=Yiβˆ’g​(𝜽^,𝑿i)\hat{e}_{i}=Y_{i}-g(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i}) is the residual. Some classic tests are listed as follows. Bierens (1982) proposed the integrated conditional moment (ICM) test based on the weight function exp⁑(βˆ’β€–π‘Ώ1βˆ’π‘Ώ2β€–2/2)\exp(-\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}/2); The tests proposed by Escanciano (2009) discussed the case with general weight functions; Li etΒ al. (2019) used the weight function 1/‖𝑿1βˆ’π‘Ώ2β€–2+11/\sqrt{\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}+1} induced by an idea bridging local and global smoothing tests. Other tests can be written or approximately written as U-statistics including those local smoothing tests proposed by HΓ€rdle and Mammen (1993), Zheng (1996), and those global smoothing tests suggested by Stute etΒ al. (1998b, a), Tan and Zhu (2019) and Tan and Zhu (2022). However, these tests do not apply to the cases with diverging dimensions pp without dimension reduction structures.

We now construct a novel test by combining the sample-splitting and the conditional studentization approaches. The following observations induce the test construction. Let e=Yβˆ’g​(𝜽0,𝑿)e=Y-g(\boldsymbol{\theta}_{0},\boldsymbol{X}). Under the null hypothesis, m​(𝑿)=g​(𝜽0,𝑿)m(\boldsymbol{X})=g(\boldsymbol{\theta}_{0},\boldsymbol{X}) and e=Ξ΅e=\varepsilon. Note that E​(e|𝑿)=0E(e|\boldsymbol{X})=0 under the null hypothesis, that is, E​{e​E​(e|𝑿)​f​(𝑿)}=E​{E2​(e|𝑿)​f​(𝑿)}=0E\left\{eE(e|\boldsymbol{X})f(\boldsymbol{X})\right\}=E\left\{E^{2}(e|\boldsymbol{X})f(\boldsymbol{X})\right\}=0, otherwise greater than zero under the alternatives. In a more general presentation, when we use an approximation E𝑿​{e​Wn​(𝑿,𝑿2)}E_{\boldsymbol{X}}\{eW_{n}(\boldsymbol{X},\boldsymbol{X}_{2})\} with, say, a kernel function K​{(π‘Ώβˆ’π‘Ώ2)/h}/hqK\left\{(\boldsymbol{X}-\boldsymbol{X}_{2})/h\right\}/h^{q} in lieu of Wn​(𝑿,𝑿2)W_{n}(\boldsymbol{X},\boldsymbol{X}_{2}) to E​(e|𝑿)​f​(𝑿)E(e|\boldsymbol{X})f(\boldsymbol{X}), this quantity can be approximated by E​{e1​e2​Wn​(𝑿1,𝑿2)}E\left\{e_{1}e_{2}W_{n}(\boldsymbol{X}_{1},\boldsymbol{X}_{2})\right\}. Simply, we abbreviate Wn​(𝑿i,𝑿j)W_{n}(\boldsymbol{X}_{i},\boldsymbol{X}_{j}) as wi​jw_{ij}. Thus, in general, this quantity can be estimated by 1nβ€‹βˆ‘i=1ne^i​(1nβˆ’1β€‹βˆ‘j=1,jβ‰ ine^j​wi​j)=1n​(nβˆ’1)β€‹βˆ‘i=1nβˆ‘j=1,jβ‰ ine^i​e^j​wi​j\frac{1}{n}\sum_{i=1}^{n}\hat{e}_{i}\left(\frac{1}{n-1}\sum_{j=1,j\not=i}^{n}\hat{e}_{j}w_{ij}\right)=\frac{1}{n(n-1)}\sum_{i=1}^{n}\sum_{j=1,j\not=i}^{n}\hat{e}_{i}\hat{e}_{j}w_{ij} that can be used to define a non-standardized statistic:

Un=1nβ€‹βˆ‘i=1ne^i​(1nβˆ’1β€‹βˆ‘j=1,jβ‰ ine^j​wi​j)=1n​(nβˆ’1)β€‹βˆ‘i=1nβˆ‘j=1,jβ‰ ine^i​e^j​wi​j.\displaystyle U_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\hat{e}_{i}\left(\frac{1}{\sqrt{n-1}}\sum_{j=1,j\not=i}^{n}\hat{e}_{j}w_{ij}\right)=\frac{1}{\sqrt{n(n-1)}}\sum_{i=1}^{n}\sum_{j=1,j\not=i}^{n}\hat{e}_{i}\hat{e}_{j}w_{ij}. (2.4)

Different tests uses different weight functions wi​jw_{ij}. Examples include the kernel function K​{(𝑿iβˆ’π‘Ώj)/h}/hqK\left\{(\boldsymbol{X}_{i}-\boldsymbol{X}_{j})/h\right\}/h^{q} with a bandwidth hh (see, e.g., Zheng (1996)), exponential function exp⁑(βˆ’β€–π‘Ώiβˆ’π‘Ώjβ€–2/2)\exp(-\|\boldsymbol{X}_{i}-\boldsymbol{X}_{j}\|^{2}/2) (see e.g., Bierens (1982)), and the weight function 1/‖𝑿iβˆ’π‘Ώjβ€–2+11/\sqrt{\|\boldsymbol{X}_{i}-\boldsymbol{X}_{j}\|^{2}+1} used by Li etΒ al. (2019). These tests often have no tractable limiting null distributions.

2.3 The test statistic

We now modify UnU_{n} to avoid duplicated use of the samples. Estimate 𝜽\boldsymbol{\theta} by two parts of samples respectively. Define

𝜽^1=arg⁑minπœ½βˆˆπš―β€‹βˆ‘i=1n1{Yiβˆ’g​(𝜽,𝑿i)}2​ andΒ β€‹πœ½^2=arg⁑minπœ½βˆˆπš―β€‹βˆ‘j=n1+1n{Yjβˆ’g​(𝜽,𝑿j)}2,\boldsymbol{\hat{\theta}}_{1}=\underset{\boldsymbol{\theta}\in\boldsymbol{\Theta}}{\arg\min}\sum_{i=1}^{n_{1}}\left\{Y_{i}-g(\boldsymbol{\theta},\boldsymbol{X}_{i})\right\}^{2}\text{ and }\boldsymbol{\hat{\theta}}_{2}=\underset{\boldsymbol{\theta}\in\boldsymbol{\Theta}}{\arg\min}\sum_{j=n_{1}+1}^{n}\left\{Y_{j}-g(\boldsymbol{\theta},\boldsymbol{X}_{j})\right\}^{2},

where the samples of size nn are divided into two disjoint parts 𝒩1={(𝑿i,Yi)}i=1n1​ and ​𝒩2={(𝑿j,Yj)}j=n1+1n\mathcal{N}_{1}=\{(\boldsymbol{X}_{i},Y_{i})\}_{i=1}^{n_{1}}\text{ and }\mathcal{N}_{2}=\{(\boldsymbol{X}_{j},Y_{j})\}_{j=n_{1}+1}^{n} of sizes n1n_{1} and n2n_{2} satisfying n=n1+n2n=n_{1}+n_{2}. All the results in SectionΒ 2 hold when nn is replaced by n1n_{1} or n2n_{2}. Define a modified test statistic as

U𝒩1,𝒩2=1n1β€‹βˆ‘i=1n1e^i​(1n2β€‹βˆ‘j=n1+1ne^j​wi​j)=1n1​n2β€‹βˆ‘i=1n1e^iβ€‹βˆ‘j=n1+1ne^j​wi​j,U_{\mathcal{N}_{1},\mathcal{N}_{2}}=\frac{1}{\sqrt{n_{1}}}\sum_{i=1}^{n_{1}}\hat{e}_{i}\left(\frac{1}{\sqrt{n_{2}}}\sum_{j=n_{1}+1}^{n}\hat{e}_{j}w_{ij}\right)=\frac{1}{\sqrt{n_{1}n_{2}}}\sum_{i=1}^{n_{1}}\hat{e}_{i}\sum_{j=n_{1}+1}^{n}\hat{e}_{j}w_{ij}, (2.5)

where e^i=Yiβˆ’g​(𝜽^1,𝑿i),i=1,…,n1\hat{e}_{i}=Y_{i}-g(\boldsymbol{\hat{\theta}}_{1},\boldsymbol{X}_{i}),\ i=1,\dots,n_{1} and e^j=Yjβˆ’g​(𝜽^2,𝑿j),j=n1+1,…,n.\hat{e}_{j}=Y_{j}-g(\boldsymbol{\hat{\theta}}_{2},\boldsymbol{X}_{j}),\ j=n_{1}+1,\dots,n. Again, this test statistic usually has no significant difference from the previous UnU_{n} as they have similar asymptotic behaviors. However, this seemingly minor modification plays a vital role in constructing a conditionally studentized test with a normal weak limit under the null hypothesis. The key ingredient in the construction is that in the two independent sums the residuals are not used duplicately, but the weight function wi​jw_{ij} links the two sums.

To see how to define the final statistic, we give the decomposition of U𝒩1,𝒩2U_{\mathcal{N}_{1},\mathcal{N}_{2}}. Under the null hypothesis, we have the followings:

U𝒩1,𝒩2\displaystyle U_{\mathcal{N}_{1},\mathcal{N}_{2}} =1n1​n2β€‹βˆ‘i=1n1βˆ‘j=n1+1ne^i​e^j​wi​j\displaystyle=\frac{1}{\sqrt{n_{1}n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=n_{1}+1}^{n}\hat{e}_{i}\hat{e}_{j}w_{ij}
=1n1β€‹βˆ‘i=1n1ei​1n2β€‹βˆ‘j=n1+1ne^j​[wi​jβˆ’g˙​(πœ½βˆ—,𝑿i)βŠ€β€‹πšΊβˆ’1​E𝑿j​{g˙​(πœ½βˆ—,𝑿1)​w1​j}]+op​(1)\displaystyle=\frac{1}{\sqrt{n_{1}}}\sum_{i=1}^{n_{1}}e_{i}\frac{1}{\sqrt{n_{2}}}\sum_{j=n_{1}+1}^{n}\hat{e}_{j}\left[w_{ij}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{1j}\right\}\right]+o_{p}(1)
=:1n1βˆ‘i=1n1eiw~i(𝒩2)+op(1),\displaystyle=:\frac{1}{\sqrt{n_{1}}}\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2})+o_{p}(1), (2.6)

where E𝑿j​{g˙​(πœ½βˆ—,𝑿1)​w1​j}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{1j}\right\} is the conditional expectation given 𝑿j\boldsymbol{X}_{j}, and πœ½βˆ—\boldsymbol{\theta}^{*} is defined in (2.2). The detailed justification can be found in Corollary 1 of Supplementary Material. The decomposition is only about the residuals e^i\hat{e}_{i}’s in the first part of the samples. Given 𝒩2\mathcal{N}_{2}, e1​w~1​(𝒩2),…,en1​w~n1​(𝒩2)e_{1}\tilde{w}_{1}(\mathcal{N}_{2}),\ldots,e_{n_{1}}\tilde{w}_{n_{1}}(\mathcal{N}_{2}) are conditionally independent and identically distributed random variables. Intuitively, under the null hypothesis, the following random sequence would have a normal weak limit by using the Central Limit Theorem conditionally:

Vn:=1n1β€‹βˆ‘i=1n1ei​w~i​(𝒩2)1n1β€‹βˆ‘i=1n1{ei​w~i​(𝒩2)βˆ’1n1β€‹βˆ‘i=1n1ei​w~i​(𝒩2)}2V_{n}:=\frac{\frac{1}{\sqrt{n_{1}}}\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2})}{\sqrt{\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left\{e_{i}\tilde{w}_{i}(\mathcal{N}_{2})-\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2})\right\}^{2}}} (2.7)

that is a conditionally studentized version of βˆ‘i=1n1ei​w~i​(𝒩2)\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2}). To define the final test statistic, we use βˆ‘i=1n1βˆ‘j=n1+1ne^i​e^j​wi​j/n1​n2\sum_{i=1}^{n_{1}}\sum_{j=n_{1}+1}^{n}\hat{e}_{i}\hat{e}_{j}w_{ij}/\sqrt{n_{1}n_{2}} in lieu of βˆ‘i=1n1ei​w~i​(𝒩2)/n1\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2})/\sqrt{n_{1}} and for the denominator that is a conditional standard deviation, we use e^i\hat{e}_{i}, 𝚺^\hat{\boldsymbol{\Sigma}} and 𝜽^\boldsymbol{\hat{\theta}} to replace eie_{i}, 𝚺\boldsymbol{\Sigma} and πœ½βˆ—\boldsymbol{\theta}^{*}:

V^n:=1n1​n2β€‹βˆ‘i=1n1βˆ‘j=n1+1ne^i​e^j​wi​j1n1β€‹βˆ‘i=1n1{e^i​w~~i​(𝒩2)βˆ’1n1β€‹βˆ‘i=1n1e^i​w~~i​(𝒩2)}2,\hat{V}_{n}:=\frac{\frac{1}{\sqrt{n_{1}n_{2}}}\sum_{i=1}^{n_{1}}\sum_{j=n_{1}+1}^{n}\hat{e}_{i}\hat{e}_{j}w_{ij}}{\sqrt{\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left\{\hat{e}_{i}\tilde{\tilde{w}}_{i}(\mathcal{N}_{2})-\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\hat{e}_{i}\tilde{\tilde{w}}_{i}(\mathcal{N}_{2})\right\}^{2}}}, (2.8)

where

w~~i​(𝒩2)=1n2β€‹βˆ‘j=n1+1ne^j​[wi​jβˆ’g˙​(𝜽^,𝑿i)βŠ€β€‹πšΊ^βˆ’1​1n1β€‹βˆ‘i=1n1{g˙​(𝜽^,𝑿i)​wi​j}],\tilde{\tilde{w}}_{i}(\mathcal{N}_{2})=\frac{1}{\sqrt{n_{2}}}\sum_{j=n_{1}+1}^{n}\hat{e}_{j}\left[w_{ij}-\dot{g}\left(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i}\right)^{\top}\hat{\boldsymbol{\Sigma}}^{-1}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left\{\dot{g}\left(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i}\right)w_{ij}\right\}\right],

and 𝚺^=1nβ€‹βˆ‘i=1ng˙​(𝜽^,𝑿i)​g˙​(𝜽^,𝑿i)⊀\hat{\boldsymbol{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\dot{g}(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i})\dot{g}(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i})^{\top}. We will prove that 𝜽^\boldsymbol{\hat{\theta}}, 1n1β€‹βˆ‘i=1n1{g˙​(𝜽^,𝑿i)​wi​j}\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\{\dot{g}(\boldsymbol{\hat{\theta}},\boldsymbol{X}_{i})w_{ij}\} and 𝚺^\hat{\boldsymbol{\Sigma}} are the consistent estimators of πœ½βˆ—\boldsymbol{\theta}^{*}, E𝑿j​{g˙​(πœ½βˆ—,𝑿1)​w1​j}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{1j}\right\} and πšΊβˆ—\boldsymbol{\Sigma}_{*}, respectively such that the consistency of this estimated conditional standard deviation holds. Note that in w~~i​(𝒩2)\tilde{\tilde{w}}_{i}(\mathcal{N}_{2}), we use the full data-based estimator 𝜽^\boldsymbol{\hat{\theta}} instead of 𝜽^2\boldsymbol{\hat{\theta}}_{2}. We find that asymptotically, there is no difference by using either estimator, but using 𝜽^\boldsymbol{\hat{\theta}} can make a faster convergence rate of the estimator to πœ½βˆ—\boldsymbol{\theta}^{*}. In Corollary 2 of Supplementary Material, we will show that under the null hypothesis and the local alternative hypothesis in (2.1) with n​p3​δn4β†’0np^{3}\delta_{n}^{4}\to 0,

1n1β€‹βˆ‘i=1n1{ei​w~i​(𝒩2)βˆ’1n1β€‹βˆ‘i=1n1ei​w~i​(𝒩2)}21n1β€‹βˆ‘i=1n1{e^i​w~~i​(𝒩2)βˆ’1n1β€‹βˆ‘i=1n1e^i​w~~i​(𝒩2)}2β†’p1,\frac{\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left\{e_{i}\tilde{w}_{i}(\mathcal{N}_{2})-\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2})\right\}^{2}}{\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\left\{\hat{e}_{i}\tilde{\tilde{w}}_{i}(\mathcal{N}_{2})-\frac{1}{n_{1}}\sum_{i=1}^{n_{1}}\hat{e}_{i}\tilde{\tilde{w}}_{i}(\mathcal{N}_{2})\right\}^{2}}\stackrel{{\scriptstyle p}}{{\to}}1, (2.9)

where `​`β†’p"``\stackrel{{\scriptstyle p}}{{\to}}" stands for convergence in probability. While, under the global alternative hypothesis in (2.1) with fixed Ξ΄n\delta_{n}, w~~i​(𝒩2)\tilde{\tilde{w}}_{i}(\mathcal{N}_{2}) will be proved to be a consistent estimator of w~i(0)​(𝒩2):=βˆ‘j=n1+1ne^j​[wi​jβˆ’g˙​(πœ½βˆ—,𝑿i)βŠ€β€‹πšΊβˆ—βˆ’1​E𝑿j​{g˙​(πœ½βˆ—,𝑿1)​w1​j}]/n2\tilde{w}_{i}^{(0)}(\mathcal{N}_{2}):=\sum_{j=n_{1}+1}^{n}\hat{e}_{j}\left[w_{ij}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}_{*}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{1j}\right\}\right]/\sqrt{n_{2}}. Note that when a kernel function K​{(π‘Ώβˆ’π‘Ώ2)/h}/hqK\left\{(\boldsymbol{X}-\boldsymbol{X}_{2})/h\right\}/h^{q} is used as the weight function Wn​(𝑿,𝑿2)W_{n}(\boldsymbol{X},\boldsymbol{X}_{2}), it can go to infinity as the sample size goes to infinity. But this is not a problem in our construction as the studentized test is scale-invariant, and the quantity hqh^{q} in the weight wi​jw_{ij} is eliminated from the numerator and denominator. Thus, we can consider the weight function without this quantity.

Remark 1.

It is worth noticing that in the theorems and corollaries, we impose some restrictions on the divergence rate of the parameter dimension pp such as p3​log⁑n1/n1β†’0p^{3}\log n_{1}/n_{1}\to 0 or n​p3​δn4β†’0np^{3}\delta_{n}^{4}\to 0, but do not directly give the constraints on the dimension qq of the predictor vector 𝐗\boldsymbol{X}. In fact, some constraints are hidden in the regularity conditions in Section 6 on the regression and other related functions such that pβ‰₯qp\geq q in Tan and Zhu (2022) is not required and qq can diverge to infinity much faster than pp, even the sample size nn. For instance, our test statistic can deal with the following model:

Y=\displaystyle Y= ΞΈ1​X1​X2​⋯​Xp3​Xp3+1+ΞΈ2​X2​X3​⋯​Xp3+1​Xp3+2\displaystyle\theta_{1}X_{1}X_{2}\cdots X_{p^{3}}X_{p^{3}+1}+\theta_{2}X_{2}X_{3}\cdots X_{p^{3}+1}X_{p^{3}+2}
+ΞΈ3​X3​X4​⋯​Xp3+2​Xp3+3+β‹―+ΞΈp​Xp​Xp+1​⋯​Xp4βˆ’1​Xp4+Ξ΅,\displaystyle+\theta_{3}X_{3}X_{4}\cdots X_{p^{3}+2}X_{p^{3}+3}+\cdots+\theta_{p}X_{p}X_{p+1}\cdots X_{p^{4}-1}X_{p^{4}}+\varepsilon,

where 𝐗=(X1,X2,β‹―,Xp4)βˆΌπβ€‹(0,Ip4)\boldsymbol{X}=\left(X_{1},X_{2},\cdots,X_{p^{4}}\right)\sim\mathbf{N}\left(0,\mathrm{I}_{p^{4}}\right) and q=p4=nq=p^{4}=n. The required conditions are satisfied. Similarly, we can see some models with higher dimension p>np>n. Therefore, although the conditions are strong, the results show the potential of our method to handle the problems in large-dimensional settings.

3 Asymptotic properties

3.1 The limiting null distribution

Under the null hypothesis, the conditionally studentized version VnV_{n} of βˆ‘i=1n1ei​w~i​(𝒩2)\sum_{i=1}^{n_{1}}e_{i}\tilde{w}_{i}(\mathcal{N}_{2}) has a normal weak limit under some regularity conditions (see Corollary 3 in Supplementary Material for details). We can prove the asymptotic equivalence between the numerator(denominator) of V^n\hat{V}_{n} and VnV_{n}. The result is stated as follows.

Theorem 1.

Suppose that ConditionsΒ 1-9 in Section 6 hold. Under the null hypothesis in (1.1)(\ref{h0}), if p3​log⁑n1/n1β†’0p^{3}\log n_{1}/n_{1}\to 0 and p3​log⁑n2/n2β†’0,p^{3}\log n_{2}/n_{2}\to 0,

V^nβ†’d𝐍​(0,1),\displaystyle\hat{V}_{n}\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathbf{N}(0,1), (3.1)

where `​`β†’d"``\stackrel{{\scriptstyle d}}{{\rightarrow}}" stands for convergence in distribution.

Therefore, we can compute the critical values easily.

Remark 2.

In Theorem 1, we do not restrict the specific relationship between n1n_{1} and n2n_{2}. However, the relationship between n1n_{1} and n2n_{2} influences the power of our test. Theorem 2 below states the results.

Remark 3.

In Theorem 1, we require p3​log⁑n1/n1β†’0p^{3}\log n_{1}/n_{1}\to 0 and p3​log⁑n2/n2β†’0p^{3}\log n_{2}/n_{2}\to 0 because when analyzing the residuals we use the asymptotically linear representation of the parameter estimator, and also need the consistency of some sample covariance matrices to their counterparts at the population level in the sense of L2L_{2} norm. Tan and Zhu (2022) showed that to get the asymptotically linear representation of the parameter estimator, this rate cannot be faster in general. However, for linear regression models, when the rate of divergence can be improved to p2=o​(max⁑{n1,n2})p^{2}=o\left(\max\left\{n_{1},n_{2}\right\}\right), the asymptotically linear representation of the parameter estimator still holds (Theorem 2 in Tan and Zhu (2019)), and the sample covariance matrices are still consistent.

3.2 Power study

Consider the power performance under the alternative hypothesis in (2.1). The fixed non-zero constant Ξ΄n\delta_{n} corresponds to the global alternative hypothesis with Y=m​(𝑿)+Ξ΅Y=m(\boldsymbol{X})+\varepsilon, where m​(𝑿)β‰ g​(𝜽,𝑿)m(\boldsymbol{X})\neq g(\boldsymbol{\theta},\boldsymbol{X}) for βˆ€πœ½βˆˆπš―\forall\boldsymbol{\theta}\in\boldsymbol{\Theta}, and Ξ΄nβ†’0\delta_{n}\to 0 as nβ†’βˆžn\to\infty to local alternatives. Define 𝚺Ρ=E​{Ξ΅2​g˙​(πœ½βˆ—,𝑿)​g˙​(πœ½βˆ—,𝑿)⊀},\boldsymbol{\Sigma}_{\varepsilon}=E\left\{\varepsilon^{2}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}, 𝚺l=E​{l2​(𝑿)​g˙​(πœ½βˆ—,𝑿)​g˙​(πœ½βˆ—,𝑿)⊀}\boldsymbol{\Sigma}_{l}=E\left\{l^{2}(\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\} and 𝚺c​o​vβˆ—=E​{g˙​(πœ½βˆ—,𝑿1)​g˙​(πœ½βˆ—,𝑿2)βŠ€β€‹w12}.\boldsymbol{\Sigma}_{cov}^{*}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{2})^{\top}w_{12}\right\}. We state the following results.

Theorem 2.

Suppose that ConditionsΒ 1 – 11 hold, p3​log⁑n1/n1β†’0p^{3}\log n_{1}/n_{1}\to 0 and p3​log⁑n2/n2β†’0p^{3}\log n_{2}/n_{2}\to 0, under the alternative hypothesis in (2.1),

(a) When Ξ΄n=1\delta_{n}=1 for the global alternatives, recalling l​(𝐗)=m​(𝐗)βˆ’g​(π›‰βˆ—,𝐗)l(\boldsymbol{X})=m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X}), 𝚺=E​{g˙​(π›‰βˆ—,𝐗)​g˙​(π›‰βˆ—,𝐗)⊀}βˆ’E​[{m​(𝐗)βˆ’g​(π›‰βˆ—,𝐗)}​g¨​(π›‰βˆ—,𝐗)]\boldsymbol{\Sigma}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}-E\left[\left\{m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X})\right\}\ddot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\right] and πšΊβˆ—=E​{g˙​(π›‰βˆ—,𝐗)​g˙​(π›‰βˆ—,𝐗)⊀}\boldsymbol{\Sigma}_{*}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\} in SectionΒ 2, we can obtain that

V^nn1β†’p\displaystyle\frac{\hat{V}_{n}}{\sqrt{n_{1}}}\stackrel{{\scriptstyle p}}{{\to}} E​{l​(𝑿1)​l​(𝑿2)​w12}V(0),\displaystyle\frac{E\left\{l(\boldsymbol{X}_{1})l(\boldsymbol{X}_{2})w_{12}\right\}}{\sqrt{V_{(0)}}}, (3.2)

where `​`β†’p"``\stackrel{{\scriptstyle p}}{{\to}}" stands for convergence in probability, and

V(0)=\displaystyle V_{(0)}= E​[{Ξ΅12+l2​(𝑿1)}​E𝑿12​{l​(𝑿2)​w12}]βˆ’E2​{l​(𝑿1)​l​(𝑿2)​w12}\displaystyle E\left[\left\{\varepsilon_{1}^{2}+l^{2}(\boldsymbol{X}_{1})\right\}E_{\boldsymbol{X}_{1}}^{2}\left\{l(\boldsymbol{X}_{2})w_{12}\right\}\right]-E^{2}\left\{l(\boldsymbol{X}_{1})l(\boldsymbol{X}_{2})w_{12}\right\}
+E​{l​(𝑿2)​g˙​(πœ½βˆ—,𝑿1)​w12}βŠ€β€‹πšΊβˆ—βˆ’1​(𝚺Ρ+𝚺l)β€‹πšΊβˆ—βˆ’1​E​{l​(𝑿2)​g˙​(πœ½βˆ—,𝑿1)​w12}\displaystyle+E\left\{l(\boldsymbol{X}_{2})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{12}\right\}^{\top}\boldsymbol{\Sigma}_{*}^{-1}\left(\boldsymbol{\Sigma}_{\varepsilon}+\boldsymbol{\Sigma}_{l}\right)\boldsymbol{\Sigma}_{*}^{-1}E\left\{l(\boldsymbol{X}_{2})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{12}\right\}
βˆ’2​E​[{Ξ΅12+l2​(𝑿1)}​l​(𝑿2)​g˙​(πœ½βˆ—,𝑿1)​w12]βŠ€β€‹πšΊβˆ—βˆ’1​E​{l​(𝑿2)​g˙​(πœ½βˆ—,𝑿1)​w12}.\displaystyle-2E\left[\left\{\varepsilon_{1}^{2}+l^{2}(\boldsymbol{X}_{1})\right\}l(\boldsymbol{X}_{2})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{12}\right]^{\top}\boldsymbol{\Sigma}_{*}^{-1}E\left\{l(\boldsymbol{X}_{2})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{12}\right\}.

In cases (b)-(d), recalling l​(𝐗)=m​(𝐗)βˆ’g​(𝛉0,𝐗)l(\boldsymbol{X})=m(\boldsymbol{X})-g(\boldsymbol{\theta}_{0},\boldsymbol{X}) and 𝚺=E​{g˙​(𝛉0,𝐗)​g˙​(𝛉0,𝐗)⊀}\boldsymbol{\Sigma}=E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X})^{\top}\right\}, we have:

(b) When Ξ΄n=1/n\delta_{n}=1/\sqrt{n},

V^nβˆ’n1n2​nβ€‹βˆ‘j=n1+1nΞ΅j​H​(𝑿j)+n1​n2n​μE𝒩2[{1n2βˆ‘j=n1+1nΞ΅1Ξ΅jw+β€²1​jn2nΞ΅1H(𝑿1)}2]β†’d𝐍​(0,1).\displaystyle\hat{V}_{n}-\frac{\frac{\sqrt{n_{1}}}{\sqrt{n_{2}n}}\sum_{j=n_{1}+1}^{n}\varepsilon_{j}H(\boldsymbol{X}_{j})+\frac{\sqrt{n_{1}n_{2}}}{n}\mu}{\sqrt{E_{\mathcal{N}_{2}}\left[\left\{\frac{1}{\sqrt{n_{2}}}\sum_{j=n_{1}+1}^{n}\varepsilon_{1}\varepsilon_{j}w{{}^{\prime}}_{1j}+\frac{\sqrt{n_{2}}}{\sqrt{n}}\varepsilon_{1}H(\boldsymbol{X}_{1})\right\}^{2}\right]}}\stackrel{{\scriptstyle d}}{{\to}}\mathbf{N}(0,1). (3.3)

where

w=β€²1​j\displaystyle w{{}^{\prime}}_{1j}= w1​jβˆ’g˙​(πœ½βˆ—,𝑿j)βŠ€β€‹πšΊβˆ’1​E𝑿1​{g˙​(πœ½βˆ—,𝑿j)​w1​j}βˆ’g˙​(πœ½βˆ—,𝑿1)βŠ€β€‹πšΊβˆ’1​E𝑿j​{g˙​(πœ½βˆ—,𝑿1)​w1​j}\displaystyle w_{1j}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{1}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j})w_{1j}\right\}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})w_{1j}\right\}
+g˙​(πœ½βˆ—,𝑿1)βŠ€β€‹πšΊβˆ’1β€‹πšΊc​o​vβˆ—β€‹πšΊβˆ’1​g˙​(πœ½βˆ—,𝑿j),\displaystyle+\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_{cov}^{*}\boldsymbol{\Sigma}^{-1}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j}),
ΞΌ=\displaystyle\mu= E​{l​(𝑿1)​l​(𝑿2)​w12}βˆ’2​E​{g˙​(𝜽0,𝑿1)​l​(𝑿2)​w12}βŠ€β€‹πšΊβˆ’1​E​{g˙​(𝜽0,𝑿1)​l​(𝑿1)}\displaystyle E\left\{l(\boldsymbol{X}_{1})l(\boldsymbol{X}_{2})w_{12}\right\}-2E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{2})w_{12}\right\}^{\top}\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{1})\right\}
+E​{g˙​(𝜽0,𝑿1)​l​(𝑿1)}βŠ€β€‹πšΊβˆ’1β€‹πšΊc​o​vβˆ—β€‹πšΊβˆ’1​E​{g˙​(𝜽0,𝑿1)​l​(𝑿1)},\displaystyle+E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{1})\right\}^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_{cov}^{*}\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{1})\right\},
H​(𝑿1)=\displaystyle H(\boldsymbol{X}_{1})= E𝑿1​{l​(𝑿2)​w12}βˆ’E𝑿1​{g˙​(𝜽0,𝑿2)​w12}βŠ€β€‹πšΊβˆ’1​E​{g˙​(𝜽0,𝑿1)​l​(𝑿1)}\displaystyle E_{\boldsymbol{X}_{1}}\left\{l(\boldsymbol{X}_{2})w_{12}\right\}-E_{\boldsymbol{X}_{1}}\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{2})w_{12}\right\}^{\top}\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{1})\right\}
βˆ’g˙​(𝜽0,𝑿1)βŠ€β€‹πšΊβˆ’1​E​{g˙​(𝜽0,𝑿1)​l​(𝑿2)​w12}\displaystyle-\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})^{\top}\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{2})w_{12}\right\}
+g˙​(𝜽0,𝑿1)βŠ€β€‹πšΊβˆ’1β€‹πšΊc​o​vβˆ—β€‹πšΊβˆ’1​E​{g˙​(𝜽0,𝑿1)​l​(𝑿1)},\displaystyle+\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_{cov}^{*}\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})l(\boldsymbol{X}_{1})\right\},

and βˆ‘j=n1+1nΞ΅j​H​(𝐗j)/n2\sum_{j=n_{1}+1}^{n}\varepsilon_{j}H(\boldsymbol{X}_{j})/\sqrt{n_{2}} converges to 𝐍​(0,E​[{Ξ΅1​H​(𝐗1)}2])\mathbf{N}\left(0,E\left[\left\{\varepsilon_{1}H(\boldsymbol{X}_{1})\right\}^{2}\right]\right) in distribution.

Especially, if n1/n=o​(1)n_{1}/n=o(1), then V^nβ†’d𝐍​(0,1).\hat{V}_{n}\stackrel{{\scriptstyle d}}{{\to}}\mathbf{N}(0,1).

(c) When Ξ΄n=nβˆ’Ξ±\delta_{n}=n^{-\alpha}, 14<Ξ±<12\frac{1}{4}<\alpha<\frac{1}{2}, n1βˆ’4​α​p3​log⁑nβ†’0n^{1-4\alpha}p^{3}\log n\to 0, nβˆ’Ξ±β€‹n1β†’βˆžn^{-\alpha}\sqrt{n_{1}}\to\infty, then V^nβ†’p∞.\hat{V}_{n}\stackrel{{\scriptstyle p}}{{\to}}\infty.

(d) When Ξ΄n=nβˆ’Ξ±\delta_{n}=n^{-\alpha}, Ξ±>12\alpha>\frac{1}{2}, then V^nβ†’d𝐍​(0,1).\hat{V}_{n}\stackrel{{\scriptstyle d}}{{\to}}\mathbf{N}(0,1).

Remark 4.

Theorem 2 shows that the test can detect the local alternatives distinct from the null at the fastest possible rate of order 1/n11/\sqrt{n_{1}} in general. Due to possible different sizes n1n_{1} and n2n_{2} of the two subsamples, the above analysis presents more detailed results in different cases than those with existing tests in the literature. From the above results, we can see that only in case (b) with Ξ΄n=1/n\delta_{n}=1/\sqrt{n}, the optimal splitting is n1=n2=n/2n_{1}=n_{2}=n/2, while in cases (a) and (c), large n1n_{1} can enhance power. But practically, if n2n_{2} is too small, the conditional variance cannot be estimated well causing that the test may not perform well. These claims were confirmed when we conducted numerical studies using n2=0.5​nn_{2}=0.5n, n2=0.25​nn_{2}=0.25n, and n2=0.1​nn_{2}=0.1n. Thus, in the numerical studies, we report the results with n2=0.25​nn_{2}=0.25n. Another issue is the power performance when Ξ΄n=nβˆ’Ξ±\delta_{n}=n^{-\alpha} with 0<Ξ±<1/40<\alpha<1/4. We do not discuss this case mainly because of the difficulty of studying the negligibility of the remaining terms in the asymptotically linear representations of 𝛉^1βˆ’π›‰0\boldsymbol{\hat{\theta}}_{1}-\boldsymbol{\theta}_{0} and 𝛉^2βˆ’π›‰0\boldsymbol{\hat{\theta}}_{2}-\boldsymbol{\theta}_{0} under the local alternatives.

4 Numerical Studies

4.1 Simulations

In this section, some numerical studies are conducted to examine the performance of our test proposed in Section 3. As Li etΒ al. (2019) used, the weight function 1/‖𝑿1βˆ’π‘Ώ2β€–2+11/\sqrt{\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}+1} has the merits that combine local smoothing and global smoothing tests. But it is theoretically flawed for large pp as it converges to a constant when pp goes to infinity. To remedy this defect, not only should our chosen weight function include this weight function, but also it contains another weight function βˆ‘k=1pK​{(Xi​kβˆ’Xj​k)/h}/h\sum_{k=1}^{p}K\left\{(X_{ik}-X_{jk})/h\right\}/h, where K​(β‹…)K(\cdot) denotes the kernel density function and hh is the bandwidth. The summation form ensures that it works in diverging dimension cases. As a result, the weight function, defined as 0.5Γ—[1/‖𝑿1βˆ’π‘Ώ2β€–2+1+βˆ‘k=1pK​{(Xi​kβˆ’Xj​k)/h}/h]0.5\times\left[1/\sqrt{\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}+1}+\sum_{k=1}^{p}K\left\{(X_{ik}-X_{jk})/h\right\}/h\right], is a hybrid of two equally weighted functions.

To make the simulation results convincing, we compare with the test A​I​C​MnAICM_{n} proposed by Tan and Zhu (2022) that can also handle diverging dimension cases and shows its advantages. It is worth noticing that the method in JankovΓ‘ etΒ al. (2020) could also be applied to non-sparse models with random designs and dimensions under our constraints, Tan and Zhu (2022) made a comparison for Logistic model with three model settings, and A​I​C​MnAICM_{n} outperformed the test in JankovΓ‘ etΒ al. (2020) in two of the three model settings. We have also conducted the simulations under the same model settings and found that C​O​S​TCOST worked similarly to A​I​C​MnAICM_{n}. Therefore, we put the results in Section 4.2, and the main text here only reports the numerical comparison with A​I​C​MnAICM_{n}. In addition, we also evaluate the performance of another related test, D​r​C​o​s​tnDrCost_{n}, which can be seen as a modified version of C​o​s​tnCost_{n}. When the underlying model has a dimension reduction structure, D​r​C​o​s​tnDrCost_{n} contains the dimension reduction step (see, Tan and Zhu (2022)) . We design three studies: models with dimension reduction structures; with dimension reduction structures and diverging dimensions; and without dimension reduction structures. The first two studies favor Tan and Zhu (2022)’s method, and the third study deals with general models. The predictor vectors 𝑿i\boldsymbol{X}_{i}’s are independently generated from multivariate normal distribution 𝐍​(𝟎,𝚺)\mathbf{N}(\boldsymbol{0},\boldsymbol{\Sigma}). Here 𝚺=𝚺1=Ip\boldsymbol{\Sigma}=\boldsymbol{\Sigma}_{1}=I_{p} or 𝚺=𝚺2=(0.5|iβˆ’j|)pΓ—p\boldsymbol{\Sigma}=\boldsymbol{\Sigma}_{2}=(0.5^{|i-j|})_{p\times p}. The error Ξ΅\varepsilon’s are independently drawn from the standard normal distribution 𝐍​(0,1)\mathbf{N}(0,1). The simulation results are all based on 1,0001,000 replications. The results of A​I​C​MnAICM_{n} are computed by the code provided by the authors of the paper Tan and Zhu (2022). To choose the bandwidth hh, we consider it to be cβ‹…nβˆ’0.2c\cdot n^{-0.2}, where cc is a constant taking the following five values: 0.5,0.8,1,1.2,1.50.5,0.8,1,1.2,1.5. To check how robust the test is against the value cc, we use the model H11H_{11} as an example in Figure 1. It shows that our test performs robustly, and the size and power level can be similar. Therefore, we use the bandwidth with c=1c=1.

Refer to caption
Figure 1: Empirical sizes and powers of C​o​s​tnCost_{n} under different bandwidths for H11H_{11} in Study 1, when n=400n=400, q=17q=17 and 𝚺=𝚺1\boldsymbol{\Sigma}=\boldsymbol{\Sigma}_{1} .

For comparisons, we consider four numerical studies. StudiesΒ 1 and 2 have multi-index model structures that favor A​I​C​MnAICM_{n}, and StudyΒ 3 does not have such a structure. As under some strong conditions on the regression function our method can handle the cases where the dimension qq of the predictor vector can be much higher than that pp of parameter vector, we consider StudyΒ 4 that has q=p2q=p^{2} and q=nq=n. As the limiting null distribution of A​I​C​MnAICM_{n} is generally intractable, the wild bootstrap approximation is used to determine the critical values. For model H11H_{11} in StudyΒ 1 and model H21H_{21} in StudyΒ 2, the numerical results of A​I​C​MnAICM_{n} are excerpted from Tan and Zhu (2022) to make the paper self-contained.

Study 1. Generate data from the following double-index and triple-index models:

H11:Y=𝜷1βŠ€β€‹π‘Ώ+a​(𝜷2βŠ€β€‹π‘Ώ)2+Ξ΅,\displaystyle H_{11}:Y=\boldsymbol{\beta}_{1}^{\top}\boldsymbol{X}+a\left(\boldsymbol{\beta}_{2}^{\top}\boldsymbol{X}\right)^{2}+\varepsilon,
H12:Y=X1+cos⁑(2​X2)+a​exp⁑(3​X2)+Ξ΅.\displaystyle H_{12}:Y=X_{1}+\cos(2X_{2})+a\exp(3X_{2})+\varepsilon.

We set 𝜷1=(1,β‹―,1⏟q1,0,β‹―,0)⊀/q1\boldsymbol{\beta}_{1}=(\underbrace{1,\cdots,1}_{q_{1}},0,\cdots,0)^{\top}/\sqrt{q_{1}} and 𝜷2=(0,β‹―,0,1,β‹―,1⏟q1)⊀/q1\boldsymbol{\beta}_{2}=(0,\cdots,0,\underbrace{1,\cdots,1}_{q_{1}})^{\top}/\sqrt{q_{1}} with q1=[q/2]βˆ’q_{1}=[q/2]_{-}, where []βˆ’[\quad]_{-} indicates the largest integer smaller than or equal to q/2q/2. Here XiX_{i} denotes the ii-th component of 𝑿\boldsymbol{X}. The first hypothetical model is single-index, and the second is double-index, containing the second index and third index in alternative models respectively. The empirical sizes and powers are reported in TablesΒ 1-2 in SectionΒ 4.2.

Table 1 shows that A​I​C​MnAICM_{n} works better for model H11H_{11}, and D​r​C​o​s​tnDrCost_{n} with the dimension reduction step outperforms C​o​s​tnCost_{n}, but slightly. When the sample size gets larger, our tests gradually work closer to A​I​C​MnAICM_{n}. However, for model H12H_{12}, the situation changes. The results of TableΒ 2 suggest that C​o​s​tnCost_{n} outperforms D​r​C​o​s​tnDrCost_{n} and A​I​C​MnAICM_{n}, although the model favors them. We checked the details and found that the structural dimension of the central subspace is underestimated to 11 in this model and the residuals cannot be estimated well under the alternative hypothesis. This might be one of the reasons that A​I​C​MnAICM_{n} and D​r​C​o​s​tnDrCost_{n} do not work well.

Study 2. Generate data from the following multi-index models with q=0.1​nq=0.1n and q=nq=\sqrt{n} respectively:

H21:Y=𝜷0βŠ€β€‹π‘Ώ+a​exp⁑(𝜷0βŠ€β€‹π‘Ώ)+Ξ΅,\displaystyle H_{21}:Y=\boldsymbol{\beta}_{0}^{\top}\boldsymbol{X}+a\exp\left(\boldsymbol{\beta}_{0}^{\top}\boldsymbol{X}\right)+\varepsilon,
H22:Y=𝜷1βŠ€β€‹π‘Ώ+exp⁑(𝜷2βŠ€β€‹π‘Ώ)+a​exp⁑(βˆ’πœ·0βŠ€β€‹π‘Ώ)+Ξ΅,\displaystyle H_{22}:Y=\boldsymbol{\beta}_{1}^{\top}\boldsymbol{X}+\exp\left(\boldsymbol{\beta}_{2}^{\top}\boldsymbol{X}\right)+a\exp\left(-\boldsymbol{\beta}_{0}^{\top}\boldsymbol{X}\right)+\varepsilon,

where 𝜷0=(1,1,β‹―,1)⊀/q\boldsymbol{\beta}_{0}=(1,1,\cdots,1)^{\top}/\sqrt{q} and other notations are the same as stated above. In this study, the dimension qq is large; thus, the regularity conditions fail to hold in theory. The empirical sizes and powers of Study 2 are presented in Tables 3-4 in SectionΒ 4.2.

The simulation results suggest that all competitors cannot maintain the significance level for model H21H_{21} when qq diverges at the rate of 0.1​n0.1n. This means that the condition of dimensionality is violated too much. For model H22H_{22} with q=nq=\sqrt{n} that also violates the condition, we can see that when the sample size is large, our test performs well in the significance level maintenance with relatively high power. At the same time, A​I​C​MnAICM_{n} is liberal in general. The test D​r​C​o​s​tnDrCost_{n} with the dimension reduction step works slightly worse than C​o​s​tnCost_{n}. This suggests that the test may still be usable when the sample size is large.

Study 3. Generate data from the following models without dimension reduction structures:

H31:Y=X1+cos⁑(2​X2)+aβ€‹βˆ‘i=1qexp⁑(3​Xi)+Ξ΅,\displaystyle H_{31}:Y=X_{1}+\cos(2X_{2})+a\sum_{i=1}^{q}\exp(3X_{i})+\varepsilon,
H32:Y=βˆ‘i=1qsin⁑(Ξ²0​i​Xi)+aβ€‹βˆ‘i=1qexp⁑(3​Xi)+Ξ΅,\displaystyle H_{32}:Y=\sum_{i=1}^{q}\sin(\beta_{0i}X_{i})+a\sum_{i=1}^{q}\exp(3X_{i})+\varepsilon,
H33:Y=βˆ‘i=1qβˆ’1Xi​Xi+1+a​cos⁑(𝜷0βŠ€β€‹π‘Ώ)+Ξ΅,\displaystyle H_{33}:Y=\sum_{i=1}^{q-1}X_{i}X_{i+1}+a\cos\left(\boldsymbol{\beta}_{0}^{\top}\boldsymbol{X}\right)+\varepsilon,
H34:Y=βˆ‘i=1qβˆ’2Xi​Xi+1​sin⁑(π​Xi+2)+a​(𝜷0βŠ€β€‹π‘Ώ)3+Ξ΅,\displaystyle H_{34}:Y=\sum_{i=1}^{q-2}X_{i}X_{i+1}\sin(\pi X_{i+2})+a\left(\boldsymbol{\beta}_{0}^{\top}\boldsymbol{X}\right)^{3}+\varepsilon,

where Ξ²0​i\beta_{0i} denotes the ii-th component of 𝜷0\boldsymbol{\beta}_{0} and the rest of the notations are the same as stated above. The four models have no dimension reduction structures, as the representatives of models: under the null, with low-dimensional and high-dimensional regressions; under the alternatives, with high-dimensional departure; and under the null, with high-dimensional regressions with interactions among the covariates.

Tables 5-8 in SectionΒ 4.2 report the empirical sizes and powers. A​I​C​MnAICM_{n} fails to work entirely in high dimension scenarios, especially for Model H34H_{34} allowing higher order interactions between the covariates. In almost all cases, A​I​C​MnAICM_{n} cannot maintain the significance level and even has no empirical powers. This phenomenon suggests that A​I​C​MnAICM_{n} relies critically on dimension reduction structures. The test C​o​s​tnCost_{n} works much better comparably as expected.

Study 4. Generate data from the following models with q=p2q=p^{2} and q=nq=n respectively:

H41:Y=βˆ‘i=1psin⁑(Ξ²1​iβ€‹βˆj=(iβˆ’1)βˆ—r+1min⁑(iβˆ—r,q)Xj)+a​(𝜷1βŠ€β€‹π‘Ώ)2+Ξ΅,\displaystyle H_{41}:Y=\sum_{i=1}^{p}\sin\left(\beta_{1i}\prod_{j=(i-1)*r+1}^{\min(i*r,q)}X_{j}\right)+a\left(\boldsymbol{\beta}_{1}^{\top}\boldsymbol{X}\right)^{2}+\varepsilon,
H42:Y=βˆ‘i=1psin⁑(Ξ²1​iβ€‹βˆ‘j=(iβˆ’1)βˆ—r+1(iβˆ’1)βˆ—r+r1Xj+βˆ‘j=(iβˆ’1)βˆ—r+r1+1min⁑(iβˆ—r,q)Xj)+a​(𝜷1βŠ€β€‹π‘Ώ)2+Ξ΅,\displaystyle H_{42}:Y=\sum_{i=1}^{p}\sin\left(\beta_{1i}\sum_{j=(i-1)*r+1}^{(i-1)*r+r_{1}}X_{j}+\sum_{j=(i-1)*r+r_{1}+1}^{\min(i*r,q)}X_{j}\right)+a\left(\boldsymbol{\beta}_{1}^{\top}\boldsymbol{X}\right)^{2}+\varepsilon,

where 𝑿=(X1,X2,β‹―,Xp)⊀\boldsymbol{X}=(X_{1},X_{2},\cdots,X_{p})^{\top}, 𝜷1=(1,β‹―,1⏟p1,0,β‹―,0)⊀/p1\boldsymbol{\beta}_{1}=(\underbrace{1,\cdots,1}_{p_{1}},0,\cdots,0)^{\top}/\sqrt{p_{1}} with p1=[p/2]βˆ’p_{1}=[p/2]_{-} and Ξ²1​i\beta_{1i} denotes the ii-th component of 𝜷1\boldsymbol{\beta}_{1}. In addition, define r=[q/p]+r=[q/p]^{+} and r1=[r/2]βˆ’r_{1}=[r/2]_{-}, where []+[\quad]^{+} takes the smallest integer greater than or equal to q/pq/p. The other notations remain the same as mentioned earlier. The two models have much higher dimensions of the predictor vector than that of the parameter vector. Tables 9-10 in SectionΒ 4.2 report the empirical sizes and powers. The simulation results suggest that C​o​s​tnCost_{n} may not be significantly affected by the dimension qq of the predictor vector when the regression function has a particular structure about the predictors and still works well in both significance level maintenance and power performance, whereas A​I​C​MnAICM_{n} entirely fails to work.

The performance of our test is more robust against model settings than A​I​C​MnAICM_{n}. Thus, our test is more suitable when the sample size is relatively large. But in moderate sample size scenarios, the test loses some power. See TableΒ 7 for instance. This is because the sample-splitting technique causes the sample size reduction such that the test has a slower rate to the weak limit than the classic tests. Another reason could be because of using the limiting null distribution to determine the critical values whereas A​I​C​MnAICM_{n} uses the bootstrap approximation to favor the small and moderate sample size scenarios.

4.2 Simulation results

Table 1: Empirical sizes and powers for H11H_{11} in Study 1.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=q
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.044 0.044 0.048 0.055 0.061 0.045 0.053
0.25 0.326 0.313 0.367 0.357 0.674 0.951 0.995
D​r​C​o​s​tn,𝚺1DrCost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.050 0.046 0.042 0.054 0.050 0.041 0.048
0.25 0.343 0.348 0.339 0.304 0.620 0.917 0.979
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.055 0.051 0.076 0.051 0.050 0.050 0.065
(from Tan and Zhu (2022)) 0.25 0.556 0.564 0.553 0.562 0.853 0.992 1.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.043 0.053 0.052 0.061 0.048 0.054 0.055
0.25 0.337 0.624 0.767 0.817 0.996 1.000 1.000
D​r​C​o​s​tn,𝚺2DrCost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.051 0.049 0.046 0.056 0.058 0.048 0.065
0.25 0.281 0.540 0.703 0.768 0.987 1.000 1.000
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.052 0.043 0.059 0.070 0.057 0.049 0.050
(from Tan and Zhu (2022)) 0.25 0.481 0.820 0.916 0.956 1.000 1.000 1.000
Table 2: Empirical sizes and powers for H12H_{12} in Study 1.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=2
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.054 0.048 0.038 0.055 0.046 0.045 0.050
0.10 0.530 0.487 0.473 0.533 0.726 0.878 0.923
D​r​C​o​s​tn,𝚺1DrCost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.051 0.051 0.060 0.042 0.054 0.054 0.046
0.10 0.403 0.425 0.415 0.390 0.619 0.757 0.850
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.054 0.059 0.044 0.051 0.069 0.066 0.068
0.10 0.098 0.093 0.084 0.103 0.112 0.172 0.240
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.051 0.036 0.052 0.043 0.038 0.039 0.039
0.10 0.502 0.509 0.557 0.493 0.743 0.830 0.885
D​r​C​o​s​tn,𝚺2DrCost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.050 0.045 0.044 0.040 0.046 0.053 0.052
0.10 0.248 0.266 0.288 0.294 0.457 0.601 0.688
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.050 0.053 0.048 0.038 0.059 0.064 0.053
0.10 0.043 0.063 0.050 0.051 0.042 0.031 0.018
Table 3: Empirical sizes and powers for H21H_{21} in Study 2 with q=0.1​nq=0.1n.
a n=50 n=100 n=500 n=1000
q=5 q=10 q=50 q=100
p=q
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.039 0.058 0.063 0.064
0.10 0.116 0.222 0.798 0.984
D​r​C​o​s​tn,𝚺1DrCost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.067 0.053 0.065 0.075
0.10 0.107 0.151 0.566 0.871
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.062 0.057 0.071 0.081
(fromTan and Zhu (2022)) 0.10 0.163 0.250 0.858 0.994
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.051 0.057 0.058 0.072
0.10 0.184 0.461 0.993 1.000
D​r​C​o​s​tn,𝚺2DrCost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.050 0.059 0.070 0.071
0.10 0.154 0.331 0.970 0.992
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.064 0.068 0.079 0.107
(from Tan and Zhu (2022)) 0.10 0.235 0.582 0.935 0.959
Table 4: Empirical sizes and powers for H22H_{22} in Study 2 with q=nq=\sqrt{n}.
a n=100 n=400 n=900
q=10 q=20 q=30
p=2q
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.104 0.068 0.061
0.50 0.568 0.991 1.000
D​r​C​o​s​tn,𝚺1DrCost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.094 0.060 0.059
0.50 0.512 0.962 1.000
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.079 0.074 0.074
0.50 0.970 0.999 1.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.056 0.059 0.037
0.50 0.393 0.824 0.953
D​r​C​o​s​tn,𝚺2DrCost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.074 0.061 0.057
0.50 0.358 0.708 0.889
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.086 0.091 0.069
0.50 0.760 0.839 0.917
Table 5: Empirical sizes and powers for H31H_{31} in Study 3.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=2
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.042 0.035 0.043 0.046 0.057 0.048 0.047
0.10 0.658 0.726 0.825 0.817 0.933 0.975 0.970
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.049 0.057 0.050 0.048 0.069 0.065 0.066
0.10 0.041 0.033 0.063 0.082 0.144 0.211 0.258
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.034 0.045 0.041 0.045 0.039 0.048 0.051
0.10 0.610 0.738 0.850 0.838 0.935 0.950 0.978
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.050 0.055 0.048 0.039 0.061 0.063 0.056
0.10 0.028 0.010 0.017 0.022 0.041 0.070 0.104
Table 6: Empirical sizes and powers for H32H_{32} in Study 3.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=q
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.035 0.029 0.046 0.046 0.055 0.064 0.041
0.10 0.588 0.757 0.782 0.840 0.988 1.000 1.000
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.061 0.056 0.056 0.079 0.074 0.048 0.000
0.10 0.353 0.162 0.03 0.002 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.042 0.051 0.053 0.051 0.051 0.070 0.049
0.10 0.487 0.637 0.760 0.843 0.967 0.998 1.000
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.060 0.063 0.047 0.057 0.064 0.053 0.000
0.10 0.391 0.177 0.022 0.000 0.000 0.000 0.000
Table 7: Empirical sizes and powers for H33H_{33} in Study 3.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=q-1
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.047 0.054 0.062 0.042 0.058 0.039 0.052
0.50 0.688 0.647 0.644 0.615 0.915 0.996 0.998
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.052 0.038 0.020 0.002 0.000 0.000 0.000
0.50 0.905 0.792 0.550 0.144 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.035 0.052 0.051 0.048 0.044 0.053 0.051
0.50 0.675 0.558 0.465 0.399 0.572 0.774 0.867
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.060 0.038 0.025 0.004 0.000 0.000 0.000
0.50 0.904 0.724 0.496 0.162 0.000 0.000 0.000
Table 8: Empirical sizes and powers for H34H_{34} in Study 3.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
q=2 q=4 q=6 q=8 q=12 q=17 q=20
p=q-2
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 / 0.052 0.047 0.046 0.053 0.057 0.055
0.50 / 0.489 0.345 0.226 0.433 0.661 0.787
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 / 0.038 0.015 0.000 0.000 0.000 0.000
0.50 / 0.642 0.176 0.022 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 / 0.048 0.044 0.057 0.051 0.042 0.066
0.50 / 0.909 0.835 0.750 0.932 0.995 0.999
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 / 0.047 0.022 0.005 0.000 0.000 0.000
0.50 / 0.619 0.251 0.075 0.032 0.004 0.000
Table 9: Empirical sizes and powers for H41H_{41} in Study 4.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
p=2 p=4 p=6 p=8 p=12 p=17 p=20
q=p2q=p^{2}
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.053 0.063 0.047 0.048 0.046 0.055 0.049
0.25 0.357 0.399 0.421 0.460 0.762 0.965 0.994
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.037 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.279 0.000 0.000 0.000 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.044 0.052 0.051 0.047 0.055 0.046 0.039
0.25 0.272 0.645 0.841 0.913 1.000 1.000 1.000
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.094 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.206 0.000 0.000 0.000 0.000 0.000 0.000
q=nq=n
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.037 0.041 0.049 0.043 0.051 0.059 0.041
0.25 0.486 0.424 0.432 0.435 0.765 0.961 1.000
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.000 0.000 0.000 0.000 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.047 0.052 0.037 0.041 0.063 0.039 0.065
0.25 0.455 0.766 0.881 0.921 0.994 1.000 1.000
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Table 10: Empirical sizes and powers for H42H_{42} in Study 4.
a n=100 n=100 n=100 n=100 n=200 n=400 n=600
p=2 p=4 p=6 p=8 p=12 p=17 p=20
q=p2q=p^{2}
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.052 0.034 0.055 0.051 0.064 0.066 0.049
0.25 0.324 0.381 0.401 0.416 0.755 0.960 0.999
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.038 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.329 0.000 0.000 0.000 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.040 0.053 0.051 0.053 0.054 0.055 0.044
0.25 0.335 0.652 0.834 0.876 0.999 1.000 1.000
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.049 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.320 0.000 0.000 0.000 0.000 0.000 0.000
q=nq=n
C​o​s​tn,𝚺1Cost_{n},\boldsymbol{\Sigma}_{1} 0.00 0.046 0.056 0.048 0.067 0.066 0.056 0.048
0.25 0.191 0.145 0.119 0.112 0.138 0.171 0.197
A​I​C​Mn,𝚺1AICM_{n},\boldsymbol{\Sigma}_{1} 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.000 0.000 0.000 0.000 0.000 0.000 0.000
C​o​s​tn,𝚺2Cost_{n},\boldsymbol{\Sigma}_{2} 0.00 0.055 0.060 0.052 0.053 0.066 0.051 0.056
0.25 0.214 0.258 0.281 0.275 0.452 0.671 0.821
A​I​C​Mn,𝚺2AICM_{n},\boldsymbol{\Sigma}_{2} 0.00 0.000 0.000 0.000 0.000 0.000 0.000 0.000
0.25 0.000 0.000 0.000 0.000 0.000 0.000 0.000

4.3 A real data example

In this subsection, we use the CSM data set to illustrate our method. The CSM data set was first analyzed by Ahmed etΒ al. (2015) and can be obtained through the website (https://archive.ics.uci.edu/ml
/datasets/CSM+%28Conventional+and+Social+Media+Movies%29+Dataset+2014+and
+2015
). There are 187 observations left in the data set after cleaning 30 observations with missing responses and/or covariates. The response variable Y is Gross Income. There are 11 predictor variables: Rating X1X_{1}, Genre X2X_{2}, Budget X3X_{3}, Screens X4X_{4}, Sequel X5X_{5}, Sentiment X6X_{6}, Views X7X_{7}, Likes X8X_{8}, Dislikes X9X_{9}, Comments X10X_{10} and Aggregate Followers X11X_{11}. To explore the relationship between the response Y and the predictor vector 𝑿=(X1,X2,…,X11)⊀\boldsymbol{X}=(X_{1},X_{2},\dots,X_{11})^{\top}, we first check whether the data set follows a linear regression model that is often used in practice. Figure 2 (a) shows that the residual plot might have a linear pattern. Besides, the value of our proposed test statistic C​o​s​tnCost_{n}= 2.0420 and the pp-value is about 0.0412. Therefore, a linear regression model may not be tenable to fit the data, while a more plausible model is in need. Using sufficient dimension reduction techniques like the cumulative slicing estimation (CSE), we find that the estimated structural dimension of this data set is q^=1,\hat{q}=1, and the corresponding projected direction is

Ξ²^=(0.3186,0.0748,0.5351,0.5384,0.0733,βˆ’0.0548,βˆ’0.2498,0.2082,0.3812,βˆ’0.0534,0.2332)⊀.\hat{\beta}=(0.3186,0.0748,0.5351,0.5384,0.0733,-0.0548,-0.2498,0.2082,0.3812,-0.0534,0.2332)^{\top}.

As a result, we establish the following polynomial regression model

Y=ΞΈ1+ΞΈ2​(πœ·βŠ€β€‹π‘Ώ)+ΞΈ3​(πœ·βŠ€β€‹π‘Ώ)2+Ξ΅.Y=\theta_{1}+\theta_{2}\left(\boldsymbol{\beta}^{\top}\boldsymbol{X}\right)+\theta_{3}\left(\boldsymbol{\beta}^{\top}\boldsymbol{X}\right)^{2}+\varepsilon. (4.1)

The value of the test statistic C​o​s​tnCost_{n} is 0.7150, with the pp-value being 0.4746, indicating that the model (4.1) may be more appropriate to fit the CSM data set. We also plot the residuals against the fitted responses in Figure 2 (b), which seems not to have a clear nonlinear pattern of the residuals. As this model is of a dimension reduction structure, Tan and Zhu (2022)’s test also suggested this modeling.

Refer to caption
Figure 2: (a) Scatter plot of residuals generated from the linear regression model versus the fitted values, (b) Scatter plot of residuals generated from the model (4.1) versus the fitted values.

5 Discussions

This paper develops a novel test statistic for checking general parametric regression models in high-dimensional scenarios. By using a sample-splitting strategy and a conditional studentization approach, the proposed test can obtain a normal limiting null distribution . It does not depend on dimension reduction model structures that are critically useful for existing tests. Moreover, our method is easy to implement, and does not need a resampling approximation for the critical value determination. The simulation results also show that our test, in many cases, can maintain the significance level and has good power performance. Thus, this research could be a good input in this research field. Further, it could be applied to other model–checking problems as a generic methodology.

The sample-splitting technique also brings some limitations. The main shortcoming is that the test statistic converges to its weak limit at the rate of 1/n11/\sqrt{n_{1}} rather than 1/n1/\sqrt{n} causing some loss in power. Thus, the sample size should not be too small otherwise this methodology may not work well. We note that the commonly used cross-fitting idea is often useful in other testing problems to enhance power. But the following observations make us hesitant to use this method. Although the studentisation approach considers the conditional variance in denominator when the second subset of data is given, the test involves all datum points even in the conditional variance estimation. If we construct another conditional studentized test with given the first subset of data, the numerators of the two test statistics are the same, but the denominators are highly correlated and the covariance is hard to compute, if not impossible. As the limiting null distribution is intractable in this case, a possible solution is to use re-sampling approximation such as the wild bootstrap. We tried this idea for Model 1 in StudyΒ 1, and found that the power can be enhanced, but slightly. But such a solution gives up the main advantage of our method having the tractable limiting null distribution. It may not be as valuable as that. Another issue is about the model dimensions. In our setting, without sparsity structure, the method can handle the cases where the number of the parameters can diverge at the rate of order n1/3n^{1/3} and most likely this rate cannot be higher (see, e.g., Tan and Zhu (2022)). To handle the cases with larger number of parameters, model sparsity assumption about parameters could be necessary. See the relevant references Shah and BΓΌhlmann (2018) and JankovΓ‘ etΒ al. (2020) that checked sparse linear and generalized linear models. But in these cases, the construction of the test statistic may have to use penalization method for variable selection, thus in the more general model setting than those in Shah and BΓΌhlmann (2018) and JankovΓ‘ etΒ al. (2020), it is still unclear whether the asymptotic behaviors can be derived and even if we can, it is still being determined whether the asymptotic distribution-free property holds. These deserve further study. On the other hand, it is of interest to see that when some conditions on regression function holds, the dimension of the predictor vector can be high, even higher than the sample size, although the conditions are stringent in the large qq scenarios. The simulations with q=p2q=p^{2} and q=nq=n support this observation. We also conducted some simulations with qq being larger than nn, the phenomenon is similar. This shows the potential of our method tackling the models with higher dimensions of the predictor vector.

6 Regularity Conditions

In these conditions, βˆ₯β‹…βˆ₯\|\cdot\| means the L2L_{2} norm of any vector or matrix.

Condition 1.

There exists a unique minimizer π›‰βˆ—βˆˆβ„p\boldsymbol{\theta}^{*}\in\mathbb{R}^{p} of the square loss in the interior of the compact parameter set 𝚯\boldsymbol{\Theta}.

Condition 2.

Denote 𝛉=(ΞΈ1,β‹―,ΞΈp)\boldsymbol{\theta}=\left(\theta_{1},\cdots,\theta_{p}\right). The regression function admits third derivatives concerning π›‰βˆˆπš―\boldsymbol{\theta}\in\boldsymbol{\Theta}. For βˆ€j,k=1,2,β‹―,p\forall j,k=1,2,\cdots,p, define

g˙​(𝜽,𝒙)=βˆ‚g​(𝜽,𝒙)βˆ‚πœ½,g¨​(𝜽,𝒙)=βˆ‚2g​(𝜽,𝒙)βˆ‚πœ½β€‹βˆ‚πœ½βŠ€,gΛ™j​(𝜽,𝒙)=βˆ‚g​(𝜽,𝒙)βˆ‚ΞΈj​ and ​gΒ¨j​k​(𝜽,𝒙)=βˆ‚g​(𝜽,𝒙)βˆ‚ΞΈjβ€‹βˆ‚ΞΈk.\dot{g}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial g(\boldsymbol{\theta},\boldsymbol{x})}{\partial\boldsymbol{\theta}},\ddot{g}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial^{2}g(\boldsymbol{\theta},\boldsymbol{x})}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{\top}},\dot{g}_{j}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial g(\boldsymbol{\theta},\boldsymbol{x})}{\partial\theta_{j}}\text{ and }\ddot{g}_{jk}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial g(\boldsymbol{\theta},\boldsymbol{x})}{\partial\theta_{j}\partial\theta_{k}}.

Let U​(π›‰βˆ—)U(\boldsymbol{\theta}^{*}) be a subset consists of all 𝛉\boldsymbol{\theta} satisfying β€–π›‰βˆ’π›‰βˆ—β€–β‰€r0\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|\leq r_{0} in the interior of 𝚯\boldsymbol{\Theta}, where r0r_{0} is a positive constant. Further, for βˆ€j,k=1,2,β‹―,p\forall j,k=1,2,\cdots,p and βˆ€π›‰βˆˆU​(π›‰βˆ—)\forall\boldsymbol{\theta}\in U(\boldsymbol{\theta}^{*}), there exists a function F​(β‹…)F(\cdot) such that for any 𝐱\boldsymbol{x}, |g​(𝛉,𝐱)|≀F​(𝐱),|gΛ™j​(𝛉,𝐱)|≀F​(𝐱),|gΒ¨j​k​(𝛉,𝐱)|≀F​(𝐱)|g(\boldsymbol{\theta},\boldsymbol{x})|\leq F(\boldsymbol{x}),|\dot{g}_{j}(\boldsymbol{\theta},\boldsymbol{x})|\leq F(\boldsymbol{x}),|\ddot{g}_{jk}(\boldsymbol{\theta},\boldsymbol{x})|\leq F(\boldsymbol{x}) with E​{F4​(𝐗)}=O​(1)E\left\{F^{4}(\boldsymbol{X})\right\}=O\left(1\right). The fourth moment E​(Ξ΅4)E(\varepsilon^{4}) of Ξ΅\varepsilon is bounded for the regression model.

Condition 3.

For j=1,2,β‹―,pj=1,2,\cdots,p, define ψj​(𝛉,𝐱)={m​(𝐱)βˆ’g​(𝛉,𝐱)}​gΛ™j​(𝛉,𝐱)\psi_{j}(\boldsymbol{\theta},\boldsymbol{x})=\left\{m(\boldsymbol{x})-g(\boldsymbol{\theta},\boldsymbol{x})\right\}\dot{g}_{j}(\boldsymbol{\theta},\boldsymbol{x}), and Pβ€‹ΟˆΒ¨j​𝛉=E​{βˆ‚Οˆjβˆ‚π›‰β€‹βˆ‚π›‰βŠ€β€‹(𝛉,𝐗)}P\ddot{\psi}_{j\boldsymbol{\theta}}=E\left\{\frac{\partial\psi_{j}}{\partial\boldsymbol{\theta}\partial\boldsymbol{\theta}^{\top}}(\boldsymbol{\theta},\boldsymbol{X})\right\}. Let Ξ»i​(Pβ€‹ΟˆΒ¨j​𝛉)\lambda_{i}\left(P\ddot{\psi}_{j\boldsymbol{\theta}}\right) be the ii-th eigenvalue of Pβ€‹ΟˆΒ¨j​𝛉P\ddot{\psi}_{j\boldsymbol{\theta}}. max1≀i,j≀p⁑λi​(Pβ€‹ΟˆΒ¨j​𝛉)≀c\max_{1\leq i,j\leq p}\lambda_{i}\left(P\ddot{\psi}_{j\boldsymbol{\theta}}\right)\leq c, for any π›‰βˆˆU​(π›‰βˆ—),\boldsymbol{\theta}\in U(\boldsymbol{\theta}^{*}), where cc is a positive constant free from nn and pp.

For any matrix 𝑨\boldsymbol{A}, let Ξ»min​(𝑨)\lambda_{\min}(\boldsymbol{A}) and Ξ»max​(𝑨)\lambda_{\max}(\boldsymbol{A}) be the smallest and the largest eigenvalue of 𝑨\boldsymbol{A}, respectively.

Condition 4.

Define 𝚺=E​{g˙​(π›‰βˆ—,𝐗)​g˙​(π›‰βˆ—,𝐗)⊀}βˆ’E​[{m​(𝐗)βˆ’g​(π›‰βˆ—,𝐗)}​g¨​(π›‰βˆ—,𝐗)]\boldsymbol{\Sigma}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}-E\left[\left\{m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X})\right\}\ddot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\right]. There exist two constants aa and bb unrelated to nn and pp, such that 0<a≀λmin​(𝚺)≀λmax​(𝚺)≀b<∞.0<a\leq\lambda_{\min}(\boldsymbol{\Sigma})\leq\lambda_{\max}(\boldsymbol{\Sigma})\leq b<\infty.

Besides, define πšΊβˆ—=E​{g˙​(π›‰βˆ—,𝐗)​g˙​(π›‰βˆ—,𝐗)⊀}\boldsymbol{\Sigma}_{*}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\} and 𝚺Ρ=E​{Ξ΅2​g˙​(π›‰βˆ—,𝐗)​g˙​(π›‰βˆ—,𝐗)⊀}\boldsymbol{\Sigma}_{\varepsilon}=E\left\{\varepsilon^{2}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X})^{\top}\right\}, where Ξ΅\varepsilon is the error term corresponding to 𝐗\boldsymbol{X} in our regression model. aβˆ—a^{*}, bβˆ—b^{*}, aΞ΅a_{\varepsilon} and bΞ΅b_{\varepsilon} are four constants unrelated to nn and pp such that 0<aβˆ—β‰€Ξ»min​(πšΊβˆ—)≀λmax​(πšΊβˆ—)≀bβˆ—<∞,0<a^{*}\leq\lambda_{\min}(\boldsymbol{\Sigma}_{*})\leq\lambda_{\max}(\boldsymbol{\Sigma}_{*})\leq b^{*}<\infty, and 0<aΡ≀λmin​(𝚺Ρ)≀λmax​(𝚺Ρ)≀bΞ΅<∞.0<a_{\varepsilon}\leq\lambda_{\min}(\boldsymbol{\Sigma}_{\varepsilon})\leq\lambda_{\max}\left(\boldsymbol{\Sigma}_{\varepsilon}\right)\leq b_{\varepsilon}<\infty.

Finally, define 𝚺c​o​vβˆ—=E​{g˙​(π›‰βˆ—,𝐗1)​g˙​(π›‰βˆ—,𝐗2)βŠ€β€‹w12}\boldsymbol{\Sigma}^{*}_{cov}=E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{2})^{\top}w_{12}\right\} and assume β€–πšΊc​o​vβˆ—β€–β‰€bc​o​v\|\boldsymbol{\Sigma}^{*}_{cov}\|\leq b_{cov}, where bc​o​vb_{cov} is a constant free from nn and pp.

Condition 5.

Define ψ¨j​k​l​(𝛉,𝐱)=βˆ‚2ψjβˆ‚ΞΈkβ€‹βˆ‚ΞΈl​(𝛉,𝐱)​ and ​gj​k​l(3)​(𝛉,𝐱)=βˆ‚3gβˆ‚ΞΈj​θkβ€‹βˆ‚ΞΈl​(𝛉,𝐱).\ddot{\psi}_{jkl}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial^{2}\psi_{j}}{\partial\theta_{k}\partial\theta_{l}}(\boldsymbol{\theta},\boldsymbol{x})\ \text{ and }\ {g}_{jkl}^{(3)}(\boldsymbol{\theta},\boldsymbol{x})=\frac{\partial^{3}g}{\partial\theta_{j}\theta_{k}\partial\theta_{l}}(\boldsymbol{\theta},\boldsymbol{x}). There exist two measurable functions F1​(β‹…)F_{1}(\cdot) and F2​(β‹…)F_{2}(\cdot) with E​{F14​(𝐗)}=O​(1)E\left\{F_{1}^{4}(\boldsymbol{X})\right\}=O\left(1\right) and E​{F28​(𝐗)}=O​(1)E\left\{F_{2}^{8}(\boldsymbol{X})\right\}=O\left(1\right), satisfying |ψ¨j​k​l​(𝛉1,𝐱)βˆ’ΟˆΒ¨j​k​l​(𝛉2,𝐱)|≀p​‖𝛉1βˆ’π›‰2‖​F1​(𝐱)\left|\ddot{\psi}_{jkl}(\boldsymbol{\theta}_{1},\boldsymbol{x})-\ddot{\psi}_{jkl}(\boldsymbol{\theta}_{2},\boldsymbol{x})\right|\leq\sqrt{p}\|\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\|F_{1}(\boldsymbol{x}) and |gj​k​l(3)​(𝛉1,𝐱)βˆ’gj​k​l(3)​(𝛉2,𝐱)|≀p​‖𝛉1βˆ’π›‰2‖​F2​(𝐱),\left|g_{jkl}^{(3)}(\boldsymbol{\theta}_{1},\boldsymbol{x})-g_{jkl}^{(3)}(\boldsymbol{\theta}_{2},\boldsymbol{x})\right|\leq\sqrt{p}\|\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\|F_{2}(\boldsymbol{x}), where 𝛉1,𝛉2∈U​(π›‰βˆ—)\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in U(\boldsymbol{\theta}^{*}).

Condition 6.

There exists a measurable function L​(β‹…)L(\cdot) with E​{L4​(𝐗)}=O​(1)E\left\{L^{4}(\boldsymbol{X})\right\}=O\left(1\right), satisfying

β€–g˙​(𝜽1,𝒙)βˆ’g˙​(𝜽2,𝒙)‖≀pβ€‹β€–πœ½1βˆ’πœ½2‖​L​(𝒙),\displaystyle\left\|\dot{g}(\boldsymbol{\theta}_{1},\boldsymbol{x)}-\dot{g}(\boldsymbol{\theta}_{2},\boldsymbol{x})\right\|\leq\sqrt{p}\|\boldsymbol{\theta}_{1}-\boldsymbol{\theta}_{2}\|L(\boldsymbol{x}),

where 𝛉1,𝛉2∈U​(π›‰βˆ—)\boldsymbol{\theta}_{1},\boldsymbol{\theta}_{2}\in U(\boldsymbol{\theta}^{*}).

Remark 5.

Conditions 1, 2, 3 and 5 appear in Tan and Zhu (2022), commonly assumed in high dimensional model checking literature. Condition 4 is similar to the regularity condition (A2) in Tan and Zhu (2022), while we also assume that the largest eigenvalue and the smallest eigenvalue are bounded in the other three matrices. Condition 6 is a general Lipschitz Condition. For Conditions 1-6, we do not directly put the constraints on the dimension qq of the predictor vector 𝐗\boldsymbol{X}, while they are hidden in the boundedness of the related functions and their derivatives. Though our conditions could be stringent when qq diverges quickly to infinity, some functions still meet the requirements. Thus, the test could work with a high-dimensional predictor vector.

Recall that Wn​(𝒙,𝒛)W_{n}(\boldsymbol{x},\boldsymbol{z}) in our paper is a function concerning 𝒙\boldsymbol{x} and 𝒛\boldsymbol{z}, and without confusion and mis-understanding, wi​j=Wn​(𝑿i,𝑿j)w_{ij}=W_{n}(\boldsymbol{X}_{i},\boldsymbol{X}_{j}) for the sake of simplicity.

Condition 7.

inf𝒙i,𝒙jWn​(𝒙i,𝒙j)β‰₯0\inf_{\boldsymbol{x}_{i},\boldsymbol{x}_{j}}W_{n}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})\geq 0 and E​{Wn4​(𝐗1,𝐗2)}=E​(w124)≀C1E\left\{W_{n}^{4}(\boldsymbol{X}_{1},\boldsymbol{X}_{2})\right\}=E\left(w_{12}^{4}\right)\leq C_{1} for a positive constant C1C_{1}.

Remark 6.

For Condition 7, w12w_{12}, as a weight function of 𝐗1\boldsymbol{X}_{1} and 𝐗2\boldsymbol{X}_{2}, can satisfy E​(w124)≀C1E\left(w_{12}^{4}\right)\leq C_{1} in many forms, such as w12=1/‖𝐗1βˆ’π—2β€–2/q+1w_{12}=1/\sqrt{\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}/q+1} and w12=exp⁑(βˆ’β€–π—1βˆ’π—2β€–2/2​q)w_{12}=\exp(-\|\boldsymbol{X}_{1}-\boldsymbol{X}_{2}\|^{2}/2q). In both two cases, w12≀1w_{12}\leq 1 always holds. For the form like w12=βˆ‘k=1qexp⁑{βˆ’(X1​kβˆ’X2​k)2/2}w_{12}=\sum_{k=1}^{q}\exp\left\{-(X_{1k}-X_{2k})^{2}/2\right\}, although E​(w124)E\left(w_{12}^{4}\right) may diverge to infinity as qβ†’βˆžq\to\infty, we can divide both numerator and denominator by qq and use w12βˆ—=w12/qw^{*}_{12}=w_{12}/q to replace w12w_{12}, then E​(w12βˆ—4)≀1E\left({w^{*}_{12}}^{4}\right)\leq 1. In fact, under Conditions 4 and 7, we can infer that β€–E​{g˙​(π›‰βˆ—,𝐗1)}β€–\|E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})\right\}\| and β€–E𝐗1​{g˙​(π›‰βˆ—,𝐗2)​w12}β€–\|E_{\boldsymbol{X}_{1}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{2})w_{12}\right\}\| are bounded. In Supplementary Material, we give some more details to show this condition is satisfied.

Condition 8.

max1≀k,l≀p⁑E​{Ξ΅14​gΛ™k2​(𝜽0,𝑿1)​gΛ™l2​(𝜽0,𝑿1)}<∞.\max_{1\leq k,l\leq p}E\left\{\varepsilon_{1}^{4}\dot{g}_{k}^{2}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})\dot{g}_{l}^{2}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})\right\}<\infty.

Define

w=β€²i​j\displaystyle w{{}^{\prime}}_{ij}= wi​jβˆ’g˙​(πœ½βˆ—,𝑿j)βŠ€β€‹πšΊβˆ’1​E𝑿i​{g˙​(πœ½βˆ—,𝑿j)​wi​j}βˆ’g˙​(πœ½βˆ—,𝑿i)βŠ€β€‹πšΊβˆ’1​E𝑿j​{g˙​(πœ½βˆ—,𝑿i)​wi​j}\displaystyle w_{ij}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{i}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j})w_{ij}\right\}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})w_{ij}\right\}
+g˙​(πœ½βˆ—,𝑿i)βŠ€β€‹πšΊβˆ’1β€‹πšΊc​o​vβˆ—β€‹πšΊβˆ’1​g˙​(πœ½βˆ—,𝑿j).\displaystyle+\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_{cov}^{*}\boldsymbol{\Sigma}^{-1}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j}). (6.1)
Condition 9.

E(Ξ΅12Ξ΅22w2β€²12)β‰₯C3E\left(\varepsilon_{1}^{2}\varepsilon_{2}^{2}w{{}^{\prime}}_{12}^{2}\right)\geq C_{3}, and E(Ξ΅14Ξ΅24w4β€²12)≀C4E\left(\varepsilon_{1}^{4}\varepsilon_{2}^{4}w{{}^{\prime}}_{12}^{4}\right)\leq C_{4}, where C3C_{3} and C4C_{4} are two positive constants. Besides, E​[{g˙​(𝛉0,𝐗1)βŠ€β€‹πšΊβˆ’1β€‹πšΊc​o​vβˆ—β€‹πšΊβˆ’1​g˙​(𝛉0,𝐗2)}4]=O​(1).E\left[\left\{\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{1})^{\top}\boldsymbol{\Sigma}^{-1}\boldsymbol{\Sigma}_{cov}^{*}\boldsymbol{\Sigma}^{-1}\dot{g}(\boldsymbol{\theta}_{0},\boldsymbol{X}_{2})\right\}^{4}\right]=O\left(1\right).

Remark 7.

Condition 9 is a sufficient condition for the Berry-Esseen bound, which is essential to ensure the asymptotic normality of our test statistic under the null hypothesis. For Condition 9, let’s analyse the meaning of wi​jβ€²w{{}^{\prime}}_{ij}. The linear projection of wi​jw_{ij} on g˙​(π›‰βˆ—,𝐗i)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i}) is g˙​(π›‰βˆ—,𝐗i)βŠ€β€‹πšΊβˆ’1​E𝐗j​{g˙​(π›‰βˆ—,𝐗i)​wi​j}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})w_{ij}\right\}, and the linear projection of the remainder term, wi​jβˆ’g˙​(π›‰βˆ—,𝐗i)βŠ€β€‹πšΊβˆ’1​E𝐗j​{g˙​(π›‰βˆ—,𝐗i)​wi​j}w_{ij}-\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})^{\top}\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{j}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})w_{ij}\right\}, on g˙​(π›‰βˆ—,𝐗i)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i}), is g˙​(π›‰βˆ—,𝐗i)β€‹πšΊβˆ’1​E𝐗i​{g˙​(π›‰βˆ—,𝐗j)​wi​j}βˆ’g˙​(π›‰βˆ—,𝐗i)β€‹πšΊβˆ’1​E​{g˙​(π›‰βˆ—,𝐗1)​g˙​(π›‰βˆ—,𝐗2)βŠ€β€‹w12}β€‹πšΊβˆ’1​g˙​(π›‰βˆ—,𝐗j)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})\boldsymbol{\Sigma}^{-1}E_{\boldsymbol{X}_{i}}\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j})w_{ij}\right\}-\\ \dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i})\boldsymbol{\Sigma}^{-1}E\left\{\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{1})\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{2})^{\top}w_{12}\right\}\boldsymbol{\Sigma}^{-1}\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j}). Then we can find that wi​jβ€²w{{}^{\prime}}_{ij} has the form of the remainder term after wi​jw_{ij} projected on g˙​(π›‰βˆ—,𝐗i)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i}) and g˙​(π›‰βˆ—,𝐗j)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j}). Hence wi​jβ€²w{{}^{\prime}}_{ij} is almost the estimation of the error term when we use g˙​(π›‰βˆ—,𝐗i)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i}) and g˙​(π›‰βˆ—,𝐗j)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{j}) to predict wi​jw_{ij} using a linear model. Since the structure of wi​jw_{ij} and g˙​(π›‰βˆ—,𝐗i)\dot{g}(\boldsymbol{\theta}^{*},\boldsymbol{X}_{i}) are different, it is reasonable to assume that E(Ξ΅12Ξ΅22w2β€²12)E\left(\varepsilon_{1}^{2}\varepsilon_{2}^{2}w{{}^{\prime}}_{12}^{2}\right) has a lower bound as pp goes to infinity. Since wi​jw_{ij} is bounded, it is reasonable to assume the boundedness of E(Ξ΅14Ξ΅24w4β€²12)E\left(\varepsilon_{1}^{4}\varepsilon_{2}^{4}w{{}^{\prime}}_{12}^{4}\right).

Define 𝚺l=E​{l2​(𝑿)​g˙​(𝜽0,𝑿)​g˙​(𝜽0,𝑿)⊀}\boldsymbol{\Sigma}_{l}=E\left\{l^{2}(\boldsymbol{X})\dot{g}\left(\boldsymbol{\theta}_{0},\boldsymbol{X}\right)\dot{g}\left(\boldsymbol{\theta}_{0},\boldsymbol{X}\right)^{\top}\right\}. Under the global hypothesis, l​(𝑿)=m​(𝑿)βˆ’g​(πœ½βˆ—,𝑿)l(\boldsymbol{X})=m(\boldsymbol{X})-g(\boldsymbol{\theta}^{*},\boldsymbol{X}).

Condition 10.

E​{l4​(𝑿)}=O​(1)E\left\{l^{4}(\boldsymbol{X})\right\}=O\left(1\right). Further, there are two constants ala_{l}, blb_{l} not depending on nn and pp such that 0<al≀λmin​(𝚺l)≀λmax​(𝚺l)≀bl<∞.0<a_{l}\leq\lambda_{\min}(\boldsymbol{\Sigma}_{l})\leq\lambda_{\max}(\boldsymbol{\Sigma}_{l})\leq b_{l}<\infty.

Recall the sequence of local alternatives, H1​n:Y=g​(𝜽0,𝑿)+Ξ΄n​l​(𝑿)+Ξ΅H_{1n}:Y=g(\boldsymbol{\theta}_{0},\boldsymbol{X})+\delta_{n}l(\boldsymbol{X})+\varepsilon, and the definition of H​(𝑿1)H(\boldsymbol{X}_{1}) in SubsectionΒ 3.2.

Condition 11.

H​(𝑿1)β‰ 0,a.s.H(\boldsymbol{X}_{1})\neq 0,a.s. , E​{|Ξ΅1​H​(𝐗1)|2}β‰₯C5>0E\left\{|\varepsilon_{1}H(\boldsymbol{X}_{1})|^{2}\right\}\geq C_{5}>0, a.s., and E​{|Ξ΅1​H​(𝐗1)|4}≀C6<∞E\left\{|\varepsilon_{1}H(\boldsymbol{X}_{1})|^{4}\right\}\leq C_{6}<\infty, where C5C_{5} and C6C_{6} are two constants unrelated to nn and pp.

Remark 8.

Conditions 10 and 11 are used to obtain the asymptotic distribution of our test statistic under the alternative hypotheses. These two conditions are not necessary for studying the test statistic with high power under the alternative hypotheses. Conversely, if we do not put conditions either on the moments of l​(𝐗)l(\boldsymbol{X}) and H​(𝐗)H(\boldsymbol{X}) or on an upper bound of the eigenvalues of 𝚺l\boldsymbol{\Sigma}_{l}, our test statistic may diverge at a faster rate to have higher power. On the other hand, when we want to study the properties under the local alternatives, these conditions can make the investigation of the limiting properties of the test statistic more easily with some easily understood presentations.

SUPPLEMENTARY MATERIAL

Supplementary of The conditionally studentized test for high-dimensional parametric regressions Technical proofs of the theorems. (.pdf file)

References

  • Ahmed etΒ al. [2015] M.Β Ahmed, M.Β Jahangir, H.Β Afzal, A.Β Majeed, and I.Β Siddiqi. Using crowd-source based features from social media and conventional features to predict the movies popularity. In 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pages 273–278. IEEE, 2015.
  • Bierens [1982] H.Β J. Bierens. Consistent model specification tests. Journal of Econometrics, 20:105–134, 1982.
  • Escanciano [2006] J.Β C. Escanciano. A consistent diagnostic test for regression models using projections. Econometric Theory, 22:1030–1051, 2006.
  • Escanciano [2009] J.Β C. Escanciano. On the lack of power of omnibus specification tests. Econometric Theory, 25:162–194, 2009.
  • Guo etΒ al. [2016] X.Β Guo, T.Β Wang, and L.Β Zhu. Model checking for parametric single-index models: a dimension reduction model-adaptive approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78:1013–1035, 2016.
  • HΓ€rdle and Mammen [1993] W.Β HΓ€rdle and E.Β Mammen. Comparing nonparametric versus parametric regression fits. The Annals of Statistics, 21:1926–1947, 1993.
  • JankovΓ‘ etΒ al. [2020] J.Β JankovΓ‘, R.Β D. Shah, P.Β BΓΌhlmann, and R.Β J. Samworth. Goodness-of-fit testing in high dimensional generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82:773–795, 2020.
  • Lavergne and Patilea [2008] P.Β Lavergne and V.Β Patilea. Breaking the curse of dimensionality in nonparametric testing. Journal of Econometrics, 143:103–122, 2008.
  • Lavergne and Patilea [2012] P.Β Lavergne and V.Β Patilea. One for all and all for one: regression checks with many regressors. Journal of business & economic statistics, 30:41–52, 2012.
  • Li etΒ al. [2019] L.Β Li, S.Β N. Chiu, and L.Β Zhu. Model checking for regressions: An approach bridging between local smoothing and global smoothing methods. Computational Statistics & Data Analysis, 138:64–82, 2019.
  • Shah and BΓΌhlmann [2018] R.Β D. Shah and P.Β BΓΌhlmann. Goodness-of-fit tests for high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80:113–135, 2018.
  • Stute and Zhu [2002] W.Β Stute and L.-X. Zhu. Model checks for generalized linear models. Scandinavian Journal of Statistics, 29:535–545, 2002.
  • Stute etΒ al. [1998a] W.Β Stute, W.Β G. Manteiga, and M.Β P. Quindimil. Bootstrap approximations in model checks for regression. Journal of the American Statistical Association, 93:141–149, 1998a.
  • Stute etΒ al. [1998b] W.Β Stute, S.Β Thies, and L.-X. Zhu. Model checks for regression: an innovation process approach. The Annals of Statistics, 26:1916–1934, 1998b.
  • Stute etΒ al. [2008] W.Β Stute, W.Β Xu, and L.Β Zhu. Model diagnosis for parametric regression in high-dimensional spaces. Biometrika, 95:451–467, 2008.
  • Tan and Zhu [2019] F.Β Tan and L.Β Zhu. Adaptive-to-model checking for regressions with diverging number of predictors. The Annals of Statistics, 47:1960–1994, 2019.
  • Tan and Zhu [2022] F.Β Tan and L.Β Zhu. Integrated conditional moment test and beyond: when the number of covariates is divergent. Biometrika, 109:103–122, 2022.
  • Zheng [1996] J.Β X. Zheng. A consistent test of functional form via nonparametric estimation techniques. Journal of Econometrics, 75:263–289, 1996.
  • Zhu and Li [1998] L.Β Zhu and R.Β Li. Dimension-reduction type test for linearity of a stochastic regression model. Acta Mathematicae Applicatae Sinica, 14:165–175, 1998.
  • Zhu [2003] L.-X. Zhu. Model checking of dimension-reduction type for regression. Statistica Sinica, 13:283–296, 2003.