High-dimensional inference for single-index model with latent factors

Yanmei Shi¹ Meiling Hao^2∗ Yanlin Tang ³ Heng Lian⁴ and Xu Guo¹^*^**Corresponding authors: Meiling Hao and Xu Guo; email: [email protected] and [email protected]

¹ School of Statistics, Beijing Normal University, Beijing, China
² School of Statistics, University of International Business and Economics, Beijing, China
³ Key Laboratory of Advanced Theory and Application in Statistics and Data Science–MOE, School of Statistics, East China Normal University, Shanghai, China
⁴ Department of Mathematics, City University of Hong Kong, Hong Kong, China

Abstract

Models with latent factors recently attract a lot of attention. However, most investigations focus on linear regression models and thus cannot capture nonlinearity. To address this issue, we propose a novel Factor Augmented Single-Index Model. We first address the concern whether it is necessary to consider the augmented part by introducing a score-type test statistic. Compared with previous test statistics, our proposed test statistic does not need to estimate the high-dimensional regression coefficients, nor high-dimensional precision matrix, making it simpler in implementation. We also propose a Gaussian multiplier bootstrap to determine the critical value. The validity of our procedure is theoretically established under suitable conditions. We further investigate the penalized estimation of the regression model. With estimated latent factors, we establish the error bounds of the estimators. Lastly, we introduce debiased estimator and construct confidence interval for individual coefficient based on the asymptotic normality. No moment condition for the error term is imposed for our proposal. Thus our procedures work well when random error follows heavy-tailed distributions or when outliers are present. We demonstrate the finite sample performance of the proposed method through comprehensive numerical studies and its application to an FRED-MD macroeconomics dataset.

Keywords: Factor augmented regression, latent factor regression, robustness, single-index model, score-type test statistic

1 Introduction

The rapid advancement of information technology has brought significant changes in both data collection and analysis. Various disciplines, such as economics, social sciences, and genetics, increasingly collect high-dimensional data for comprehensive research and analysis (Belloni et al.,, 2012; Bühlmann et al.,, 2014; Fan et al.,, 2020). Most existing high-dimensional procedures (for a recent comprehensive review, see for instance Fan et al., (2020)) were conducted under the assumption that there are no latent factors associated with both the response variables and covariates. However, it is important to acknowledge that this assumption is often violated in real-world. For instance, in genetic studies, the influence of specific DNA segments on gene expression can be confounded by population structure and artifacts in microarray expression (Listgarten et al.,, 2010). Similarly, in healthcare research, disentangling the impact of nutrient intake on cancer risk can be confounded by factors such as physical health, social class, and behavioral aspects (Fewell et al.,, 2007). Failure to account for latent factors in the model may lead to biased inferences. These examples underscore the importance of considering latent factors within models to ensure accurate and reliable conclusions.

To tackle the issues posed by latent factors, Fan et al., (2024) proposed the following Factor Augmented sparse linear Regression Model (FARM):

	$\displaystyle Y$	$\displaystyle=\bm{f}^{\top}\bm{\gamma}+\bm{u}^{\top}\bm{\beta}+\varepsilon,$		(1.1)
	$\displaystyle\text{with}\ \ \bm{x}$	$\displaystyle=\bm{B}\bm{f}+\bm{u}.$		(1.2)

Here, $Y$ is a response variable, $\bm{x}$ is a $p$ -dimensional covariate vector, $\bm{f}$ is a $K$ -dimensional vector of latent factors, $\bm{B}\in\mathbb{R}^{p\times K}$ is the corresponding factor loading matrix, and $\bm{u}$ is a $p$ -dimensional vector of idiosyncratic components, which is uncorrelated with $\bm{f}$ . The $\bm{\beta}=(\beta_{1},\ldots,\beta_{p})^{\top}\in\mathbb{R}^{p}$ and $\bm{\gamma}=(\gamma_{1},\ldots,\gamma_{K})^{\top}\in\mathbb{R}^{K}$ are vectors of regression parameters quantifying the contribution of $\bm{u}$ and $\bm{f}$ , respectively. The random error $\varepsilon$ satisfies that $\mathbb{E}\left(\varepsilon\right)=0$ , and is independent of $\bm{u}$ and $\bm{f}$ . In fact, model (1.2) is a commonly used structure for characterizing the interdependence among features. In this framework, the variables are intercorrelated through a shared set of latent factors.

Numerous methodologies have been proposed to enable statistical analysis regarding models with latent factors. Guo et al., (2022) introduced a deconfounding approach for conducting statistical inference on individual regression coefficient ${\beta}_{j},j=1,\ldots,p$ , integrating the findings of Ćevid et al., (2020) with the debiased Lasso method (Zhang and Zhang,, 2014; Van de Geer et al.,, 2014). Ouyang et al., (2023) investigated the inference problem of generalized linear regression models (GLM) with latent factors. Sun et al., (2023) considered the multiple testing problem for GLM with latent factors. Bing et al., (2023) focused on inferring high-dimensional multivariate response regression models with latent factors.

Although the above FARM is powerful to deal with latent factors, it may be not flexible enough to handle the nonlinear relationship between covariates and the response. To capture the nonlinearity, the single-index model (SIM) is usually adopted due to its flexibility and interpretability. As a result, the SIM has been the subject of extensive attention and in-depth research in the past decade. Actually, several approaches have been explored for the variable selection problem in high-dimensional SIMs. Examples include Kong and Xia, (2007), Zhu and Zhu, (2009), Wang et al., (2012), Radchenko, (2015), Plan and Vershynin, (2016), and Rejchel and Bogdan, (2020). However, limited attention has been paid to discuss SIMs with latent factors. This motivates us to investigate the high-dimensional single-index model with latent factors.

To this end, we consider the following Factor Augmented sparse Single Index Model (FASIM), which integrates both the latent factors and the covariates,

	$\displaystyle Y$	$\displaystyle=g(\bm{f}^{\top}\bm{\gamma}+\bm{u}^{\top}\bm{\beta},\varepsilon),$
	$\displaystyle\text{with}\ \ \bm{x}$	$\displaystyle=\bm{B}\bm{f}+\bm{u}.$		(1.3)

Here the link function $g(\cdot,\cdot)$ is unknown, and other variables and parameters in model (1) remain consistent with those defined in model (1.1). When $g(a,b)=a+b$ , the above FASIM reduces to the FARM introduced by Fan et al., (2024). Since $g(\cdot,\cdot)$ can be unknown, the above FASIM is very flexible and can capture the nonlinearity.

For the FASIM, the first concern is whether $\bm{\beta}$ is zero or not. Actually if $\bm{\beta}=\bm{0}$ , the model reduces to a single-index factor regression model, which was also considered by Fan et al., (2017), Jiang et al., (2019), and Luo et al., (2022). For this important problem, Fan et al., (2024) considered the maximum of the debiased lasso estimator of $\bm{\beta}$ under FARM. However, their procedure is computationally expensive since it requires to estimate high-dimensional regression coefficients and also high-dimensional precision matrix. In this paper, we first introduce a score-type test statistic, which does not need to estimate high-dimensional regression coefficients, nor high-dimensional precision matrix. Hence, our procedure is very simple in implementation. We also propose a Gaussian multiplier bootstrap to determine our test statistic’s critical value. The validity of our procedure is theoretically established under suitable conditions. We also give a power analysis for our test procedure.

When the FASIM is adequate, it is then of importance to estimate the parameters in FASIM. For the SIM, Rejchel and Bogdan, (2020) considered distribution function transformation of the responses, and then estimated the unknown parameters using the Lasso method. Similar procedures have also been investigated by Zhu and Zhu, (2009) and Wang et al., (2012). This enhances the model’s robustness in scenarios where random errors follow heavy-tailed distributions or when outliers are present. Motivated by their procedures, we also consider the distribution function transformation of the responses and then introduce penalized estimation of unknown parameters. However, it should be emphasized here, different from Zhu and Zhu, (2009), Wang et al., (2012) and Rejchel and Bogdan, (2020), in our model $\bm{f}$ and $\bm{u}$ are unobserved and must be estimated firstly. This would give additional technical difficulty. In this paper, we establish the estimation error bounds of our introduced penalized estimators under mild assumptions. Notably, no moment condition is required for the error term in the FASIM.

Lastly, we investigate the construction of confidence interval for each regression coefficient in the FASIM. Due to the inherent bias, the penalized estimator cannot be directly used in statistical inference. Eftekhari et al., (2021) investigated the inference problem of the SIM by adopting the debiasing technique (Zhang and Zhang,, 2014; Van de Geer et al.,, 2014). However, their procedure cannot handle latent factors. To this end, we introduce debiased estimator for the FASIM and also establish its corresponding asymptotic normality. Compared with Eftekhari et al., (2021), our procedure does not need sample-splitting and is robust to outliers.

The remainder of this paper is structured as follows. In Section 2, we delve into the reformulation of the FASIM and the estimation of the latent factors. In Section 3, we develop a powerful test procedure for testing whether $\bm{\beta}=\bm{0}$ . In Section 4, we consider the regularization estimation of $\bm{\beta}$ and establish the $\ell_{1}$ and $\ell_{2}$ -estimation error bounds for this estimation. Further we introduce debiased estimator and construct confidence interval for each coefficient. We present the findings of our simulation studies in Section 5 and provide an analysis of real data in Section 6 to assess the performance and effectiveness of the proposed approach. Conclusions and discussions are presented in Section 7. Proofs of the main theorems are provided in the Appendix. Proofs of related technical Lemmas are attached in the Supplementary Material.

Notation. Let $\mathbb{I}(\cdot)$ denote the indicator function. For a vector $\bm{a}=\left(a_{1},\ldots,a_{m}\right)^{\top}\in\mathbb{R}^{m}$ , we denote its $\ell_{q}$ norm as $\|\bm{a}\|_{q}=\left(\sum_{\ell=1}^{m}|a_{\ell}|^{q}\right)^{1/q},\ 1\leq q<\infty$ , $\|\bm{a}\|_{\infty}=\max_{1\leq\ell\leq m}|a_{\ell}|$ , and $\|\bm{a}\|_{0}=\sum_{\ell=1}^{m}\mathbb{I}(a_{\ell}\neq 0)$ . For any integer $m$ , we define $[m]=\{1,\ldots,m\}$ . The Orlicz norm of a scalar random variable $X$ is defined as $\|X\|_{\psi_{2}}=\inf{\{c>0:\mathbb{E}\exp(X^{2}/c^{2})\leq 2\}}$ . For a random vector $\bm{x}\in\mathbb{R}^{m}$ , we define its Orlicz norm as $\|\bm{x}\|_{\psi_{2}}=\sup_{\|\bm{c}\|_{2}=1}\|\bm{c}^{\top}\bm{x}\|_{\psi_{2}}$ . Furthermore, we use $\bm{I}_{K}$ , $\bm{1}_{K}$ and $\bm{0}_{K}$ to denote the identity matrix in $\mathbb{R}^{K\times K}$ , a vector of dimensional $K$ with all elements being $1$ and all elements being $0$ , respectively. For a matrix $\bm{A}={(A_{jk})}$ , $\bm{a}_{i}$ represents the $i$ -th row of $\bm{A}$ , and $\bm{A}_{j}$ represents the $j$ -th column of $\bm{A}$ . We define $\|\bm{A}\|_{\mathbb{F}}=\sqrt{\sum_{jk}{A}_{jk}^{2}}$ , $\|\bm{A}\|_{\max}=\max_{j,k}|A_{jk}|$ , $\|\bm{A}\|_{\infty}=\max_{j}\sum_{k}|A_{jk}|$ , $\|\bm{A}\|_{1}=\max_{k}\sum_{j}|A_{jk}|$ and $\|\bm{A}\|_{\text{sum}}=\sum_{j,k}|A_{jk}|$ to be its Frobenius norm, element-wise max-norm, matrix $\ell_{\infty}$ -norm, matrix $\ell_{1}$ -norm and the element-wise sum-norm, respectively. Besides, we use $\lambda_{\min}(\bm{A})$ and $\lambda_{\max}(\bm{A})$ to denote the minimal and maximal eigenvalues of $\bm{A}$ , respectively. We use $|\mathcal{A}|$ to denote the cardinality of a set $\mathcal{A}$ . For two positive sequences $\{a_{n}\}_{n\geq 1}$ , $\{b_{n}\}_{n\geq 1}$ , we write $a_{n}=O(b_{n})$ if there exists a positive constant $C$ such that $a_{n}\leq C\cdot b_{n}$ , and we write $a_{n}=o(b_{n})$ if $a_{n}/b_{n}\rightarrow 0$ . Furthermore, if $a_{n}=O(b_{n})$ is satisfied, we write $a_{n}\lesssim b_{n}$ . If $a_{n}\lesssim b_{n}$ and $b_{n}\lesssim a_{n}$ , we write it as $a_{n}\asymp b_{n}$ for short. In addition, $a_{n}=O_{\mathbb{P}}(b_{n})$ and $a_{n}=o_{\mathbb{P}}(b_{n})$ have similar meanings as above except that the relationship of $a_{n}/b_{n}$ holds with high probability. The parameters $c,c_{0},C,C_{1},C_{2}$ and $K^{\prime}$ appearing in this paper are all positive constants.

2 Factor augmented single-index model

In this section, we investigate the reformulation of the FASIM and the estimation of the latent factors, as well as their properties.

2.1 Reformulation of FASIM

The unknown smooth function $g(\cdot,\cdot)$ makes the estimation of the FASIM challenging. To address this concern, we reformulate the FASIM as a linear regression model with transfomed response. With this transformation, we estimate the parameters avoiding estimating the unknown function $g(\cdot,\cdot)$ .

Without loss of generality, we assume that $\mathbb{E}(\bm{x})=\bm{0}$ , and $\bm{\Sigma}=\mathrm{Cov}({\bm{x}})$ is a positive definite matrix. Let $\bm{\sigma}_{\bm{u}h}=\mathrm{Cov}\{\bm{u},h(Y)\}$ and $\bm{\sigma}_{\bm{f}h}=\mathrm{Cov}\{\bm{f},h(Y)\}$ for a given transformation function $h(\cdot)$ of the response. Define $\bm{\beta}_{h}=\mathbb{E}\left(\bm{u}\bm{u}^{\top}\right)^{-1}\bm{\sigma}_{{\bm{u}}h}$ , $\bm{\gamma}_{h}=\mathbb{E}\left(\bm{f}\bm{f}^{\top}\right)^{-1}\bm{\sigma}_{\bm{f}h}$ , $\bm{\eta}_{h}=(\bm{\beta}_{h}^{\top},\bm{\gamma}_{h}^{\top})^{\top}$ , $\bm{\eta}=(\bm{\beta}^{\top},\bm{\gamma}^{\top})^{\top}$ and $\bm{v}=(\bm{u}^{\top},\bm{f}^{\top})^{\top}$ . In the context of the SIM framework, the following linear expectation condition is commonly assumed.

Proposition 2.1.

Assume that $\mathbb{E}\left(\bm{v}\mid\bm{v}^{\top}\bm{\eta}\right)$ is a linear function of $\bm{v}^{\top}\bm{\eta}$ . Then $\bm{\eta}_{h}$ is proportional to $\bm{\eta}$ , that is $\bm{\eta}_{h}=\kappa_{h}\times\bm{\eta}$ for some constant $\kappa_{h}$ .

Proposition 2.1 is from Theorem 2.1 of Li and Duan, (1989). The condition in this proposition is referred to as the linearity condition (LC) for predictors. It is satisfied when $\bm{v}$ follows an elliptical distribution, a commonly used assumption in the sufficient dimension reduction literature (Li,, 1991; Cook and Ni,, 2005). Hall and Li, (1993) showed that the LC holds approximately to many high-dimensional settings. Throughout the paper, we assume $\kappa_{h}\neq 0$ , which is a relatively mild assumption. In fact, when $h(\cdot)$ is monotone and $g(\cdot,\cdot)$ is monotonic with respect to the first argument, this assumption is anatomically satisfied.

Following from the definition of $\bm{\beta}_{h}$ and $\bm{\gamma}_{h}$ , our model can be rewritten as:

\displaystyle h(Y)

\displaystyle=\bm{x}^{\top}\bm{\beta}_{h}+\bm{f}^{\top}\bm{\varphi}_{h}+{e}_{h},

(2.1)

where $\bm{\varphi}_{h}=\bm{\gamma}_{h}-\bm{B}^{\top}\bm{\beta}_{h}$ , and the error term $e_{h}$ satisfies $\mathbb{E}(e_{h})={{0}}$ , $\mathbb{E}(e_{h}\bm{u})={\bm{0}_{p}}$ and $\mathbb{E}(e_{h}\bm{f})={\bm{0}_{K}}$ . This implies that we now recast the FASIM as a FARM with transformed response. Actually for identification of the FASIM, it is usually assumed that $\|\bm{\eta}\|_{2}=1$ and its first element is positive. Thus the estimation of the direction of $\bm{\eta}$ is sufficient for the FASIM. The proportionality between $\bm{\eta}$ and $\bm{\eta}_{h}$ , along with the previous mentioned transformed linear regression model (2.1), reduces the difficulty of analysis.

In practice, we need to choose an appropriate transformation function $h(\cdot)$ . It’s worth mentioning that the procedure in Eftekhari et al., (2021) essentially works with $h(Y)=Y$ . Motivated by Zhu and Zhu, (2009) and Rejchel and Bogdan, (2020), we set $h(Y)=F(Y)-1/2$ , where $F(Y)$ is the distribution function of $Y$ . This specific choice could make our procedures be robust against outliers and heavy tails. Actually with equation (2.1), given the widely imposed sub-Gaussian assumption on the predictors, the boundedness of $F(Y)-1/2$ would lead the transformed error term $e_{h}$ being sub-Gaussian, even if the original error term $\varepsilon$ comes from Cauchy distribution. The response-distribution transformation is preferred due to some additional reasons which will be discussed later.

2.2 Factor estimation

Throughout the paper, we assume that the data $\{\bm{x}_{i},\bm{f}_{i},Y_{i},e_{hi}\}_{i=1}^{n}$ are independent and identically distributed (i.i.d.) copies of $\{\bm{x},\bm{f},Y,e_{h}\}$ . Let $\bm{X}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\top}$ , $\bm{F}=(\bm{f}_{1},\ldots,\bm{f}_{n})^{\top}$ , $\bm{Y}=(Y_{1},\ldots,Y_{n})^{\top}$ and $\bm{e}_{h}=({e}_{h1},\ldots,{e}_{hn})^{\top}$ . We consider the high-dimensional scenario where the dimension $p$ of the observed covariate vector $\bm{x}$ can be much larger than the sample size $n$ . For the FASIM, we first need to estimate the latent factor vector $\bm{f}$ since only the predictor vector ${\bm{x}}$ and the response $Y$ are observable. To address this issue, we impose an identifiability assumption similar to that in Fan et al., (2013). That is

\displaystyle\mathrm{Cov}(\bm{f})=\bm{I}_{K},\ \ \text{and}\ \ \bm{B}^{\top}\bm{B}\ \ \text{is \ diagonal}.

Consequently, the constrained least squares estimator of $(\bm{F},\bm{B})$ based on ${\bm{X}}$ is given as

	$\displaystyle(\widehat{\bm{F}},\widehat{\bm{B}})=$	$\displaystyle\arg\min\limits_{\bm{F}\in\mathbb{R}^{n\times K},~{}\bm{B}\in\mathbb{R}^{p\times K}}\\|{\bm{X}}-\bm{F}\bm{B}^{\top}\\|_{\mathbb{F}}^{2},$		(2.2)
	subjectto	$\displaystyle\ \frac{1}{n}{\bm{F}}^{\top}\bm{F}=\bm{I}_{K}\ \ \text{and}\ \ {\bm{B}}^{\top}\bm{B}\ \ \text{is}\ \ \text{diagonal}.$

Let $\bm{U}=(\bm{u}_{1},\ldots,\bm{u}_{n})^{\top}$ , with $\bm{u}_{i}=\bm{x}_{i}-\bm{B}\bm{f}_{i},i=1,\ldots,n$ . Denote $\bm{\Sigma}_{\bm{u}}=\mathrm{Cov}(\bm{u})$ .

Elementary manipulation yields that the columns of $\widehat{\bm{F}}/\sqrt{n}$ are the eigenvectors corresponding to the largest $K$ eigenvalues of the matrix ${\bm{X}}{{\bm{X}}}^{\top}$ and $\widehat{\bm{B}}={\bm{X}}^{\top}{\widehat{\bm{F}}}({\widehat{\bm{F}}}^{\top}\widehat{\bm{F}})^{-1}=n^{-1}{\bm{X}}^{\top}{\widehat{\bm{F}}}$ . Then the estimator of $\bm{U}$ is

\widehat{\bm{U}}={\bm{X}}-\widehat{\bm{F}}{\widehat{\bm{B}}}^{\top}=\left(\bm{I}_{n}-\frac{1}{n}\widehat{\bm{F}}{\widehat{\bm{F}}}^{\top}\right){\bm{X}},

see Fan et al., (2013). Since $K$ is related to the number of spiked eigenvalues of ${\bm{X}}{\bm{X}}^{\top}$ , it is usually small. Therefore, we treat $K$ as a fixed constant as suggested by Fan et al., (2024). Additionally, let $\widehat{\bm{\Sigma}}$ , $\widehat{\bm{\Lambda}}=\text{diag}\left(\widehat{\lambda}_{1},\ldots,\widehat{\lambda}_{K}\right)$ and $\widehat{\bm{\Gamma}}=\left(\widehat{\bm{\zeta}}_{1},\ldots,\widehat{\bm{\zeta}}_{K}\right)$ be suitable estimators of the covariance matrix $\bm{\Sigma}$ of $\bm{x}$ , the matrix consisting of its leading $K$ eigenvalues $\bm{\Lambda}=\text{diag}(\lambda_{1},\ldots,\lambda_{K})$ and the matrix consisting of their corresponding $K$ orthonormalized eigenvectors $\bm{\Gamma}=(\bm{\zeta}_{1},\ldots,\bm{\zeta}_{K})$ , respectively.

We proceed by presenting the regularity assumptions imposed in seminal works about factor analysis, such as Bai, (2003) and Fan et al., (2013, 2024).

Assumption 1.

We make the following assumptions.

(i)

There exists a positive constant $c_{0}$ such that $\|\bm{f}\|_{\psi_{2}}\leq c_{0}$ and $\|\bm{u}\|_{\psi_{2}}\leq c_{0}$ . In addition, $\mathbb{E}(U_{ij})=\mathbb{E}(F_{ik})=0,\ i=1,\ldots,n,\ j=1,\ldots,p,\ k=1,\ldots,K$ .
(ii)

There exists a constant $\tau>1$ such that $p/\tau\leq\lambda_{min}(\bm{B}^{\top}\bm{B})\leq\lambda_{\max}(\bm{B}^{\top}\bm{B})\leq p\tau$ .
(iii)

There exists a constant $\Upsilon>0$ such that $\|\bm{B}\|_{\max}\leq\Upsilon$ and

$\mathbb{E}|\bm{u}^{\top}\bm{u}-\mathrm{tr}(\bm{\Sigma}_{\bm{u}})|^{4}\leq\Upsilon p^{2}.$
(iv)

There exists a positive constant $\kappa<1$ such that $\kappa\leq\lambda_{\min}(\bm{\Sigma}_{\bm{u}}),\ \lambda_{\max}(\bm{\Sigma}_{\bm{u}})$ , $\|\bm{\Sigma}_{\bm{u}}\|_{1}\leq 1/\kappa$ , and $\min_{1\leq k,l\leq p}\mathrm{Var}({U}_{ik}{U}_{il})\geq\kappa$ .

Assumption 2.

(Initial pilot estimators). Assume that $\widehat{\bm{\Sigma}}$ , $\widehat{\bm{\Lambda}}$ and $\widehat{\bm{\Gamma}}$ satisfy $\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{\max}=O_{\mathbb{P}}\{\sqrt{(\log p)/n}\}$ , $\|(\widehat{\bm{\Lambda}}-\bm{\Lambda})\widehat{\bm{\Lambda}}^{-1}\|_{\max}=O_{\mathbb{P}}\{\sqrt{(\log p)/n}\}$ , and $\|\widehat{\bm{\Gamma}}-\bm{\Gamma}\|_{\max}=O_{\mathbb{P}}\{\sqrt{(\log p)/(np)}\}$ .

Remark 1.

Assumption 2 is taken from Bayle and Fan, (2022). This assumption holds in various scenarios of interest, such as for the sample covariance matrix under sub-Gaussian distributions (Fan et al.,, 2013). Moreover, the estimators like the marginal and spatial Kendall’s tau (Fan et al.,, 2018), and the elementwise adaptive Huber estimator (Fan et al.,, 2019) satisfy this assumption.

We summarize the theoretical results related to consistent factor estimation in Lemma 1 in Supplementary Material, which directly follows from Proposition 2.1 in Fan et al., (2024) and Lemma 3.1 in Bayle and Fan, (2022).

In practice, the number of latent factors $K$ is often unknown, and determining $K$ in a data-driven way is a crucial challenge. Numerous methods have been introduced in the literature to estimate the value of $K$ (Bai and Ng,, 2002; Lam and Yao,, 2012; Ahn and Horenstein,, 2013; Fan et al.,, 2022). In this paper, we employ the ratio method for our numerical studies (Luo et al.,, 2009; Lam and Yao,, 2012; Ahn and Horenstein,, 2013). Let $\lambda_{k}({\bm{X}}{\bm{X}}^{\top})$ be the $k$ -th largest eigenvalue of ${\bm{X}}{\bm{X}}^{\top}$ . The number of factors can be consistently estimated by

\displaystyle\widehat{K}=\arg\max\limits_{k\leq K_{\max}}\frac{\lambda_{k}({\bm{X}}{\bm{X}}^{\top})}{\lambda_{k+1}({\bm{X}}{\bm{X}}^{\top})},

where $1\leq K_{\max}\leq n$ is a prescribed upper bound for $K$ . In our subsequent theoretical analysis, we treat $K$ as known. All the theoretical results remain valid conditioning on that $\widehat{K}$ is a consistent estimator of $K$ .

3 Adequacy test of factor model

In this section, we aim to assess the adequacy of the factor model and determining whether FASIM (1) can serve as an alternative to $Y=g(\bm{f}^{\top}\bm{\gamma},\varepsilon)$ . The primary question of interest pertains to the following hypothesis:

\displaystyle H_{0}:\bm{\beta}_{h}=\bm{0}\ \ \text{versus}\ \ H_{1}:\bm{\beta}_{h}\neq\bm{0}.

(3.1)

From Proposition 2.1, the null hypothesis is also equivalent to $H_{0}:\bm{\beta}=\bm{0}$ .

3.1 Factor-adjusted score-type test

In this subsection, we develop a factor-adjusted score type test (FAST) and derive its Gaussian approximation result. For the adequacy testing problem, Fan et al., (2024) considered the maximum of the debiased lasso estimator of $\bm{\beta}$ under FARM. However, their procedure requires to estimate high-dimensional regression coefficients and also high-dimensional precision matrix, and thus is computationally expensive. While our proposed FAST does not need to estimate high-dimensional regression coefficients, nor high-dimensional precision matrix. Hence, Our procedure is straightforward to implement, saving both computation time and computational resources.

Under the null hypothesis in (3.1), we have

\mathbb{E}\left[\left\{F(Y)-{\frac{1}{2}}-\bm{f}^{\top}\bm{\gamma}_{h}\right\}\bm{u}\right]=\mathbb{E}[e_{h}\bm{u}]=\bm{0}.

While under alternative hypothesis $H_{1}$ , we have

\mathbb{E}\left[\left\{F(Y)-{\frac{1}{2}}-\bm{f}^{\top}\bm{\gamma}_{h}\right\}\bm{u}\right]=\mathbb{E}\{\bm{u}(e_{h}+\bm{u}^{\top}\bm{\beta}_{h})\}=\bm{\Sigma}_{\bm{u}}\bm{\beta}_{h}\neq\bm{0}.

This observation motivates us to consider the following individual test statistic

\displaystyle{T}_{nj}=\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{n}\left[\left\{F_{n}(Y_{i})-\frac{1}{2}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{\gamma}}_{h}\right\}\widehat{U}_{ij}\right].

(3.2)

Here we utilize the empirical distribution $F_{n}(y)=n^{-1}\sum_{i=1}^{n}\mathbb{I}(Y_{i}\leq y)$ as an estimator of $F(Y)$ . In the empirical distribution function, the term $\sum_{j=1}^{n}\mathbb{I}(Y_{j}\leq Y_{i})$ is the rank of $Y_{i}$ . Since statistics with ranks such as Wilcoxon test and the Kruskall-Wallis ANOVA test, are well-known to be robust, this intuitively explains why our procedures with response-distribution transformation function would be robust with respect to outliers in response. We consider the least squares estimator $\widehat{\bm{\gamma}}_{h}=(\widehat{\bm{F}}^{\top}\widehat{\bm{F}})^{-1}\widehat{\bm{F}}^{\top}\left\{F_{n}(\bm{Y})-{1}/{2}\right\}$ . Denote ${\bm{T}}_{n}=({T}_{n1},\ldots,{T}_{np})^{\top}$ . To test the null hypothesis $H_{0}:\bm{\beta}_{h}=\bm{0}$ , we consider $\ell_{\infty}$ norm of ${\bm{T}}_{n}$ . That is,

\displaystyle M_{n}=\|{\bm{T}}_{n}\|_{\infty}.

(3.3)

It is clear that in the FAST statistic $M_{n}$ defined above, we only need to estimate a low-dimensional parameter $\bm{\gamma}_{h}$ , and no high-dimensional parameters are required. In addition, since FAST does not rely on the estimation of the precision matrix, we can avoid assumptions about the $\ell_{\infty}$ norm of the precision matrix, as demonstrated in Fan et al., (2024). Further define $S_{nj}$ as follows:

\displaystyle S_{nj}=\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{n}\left[\left\{F(Y_{i})-\frac{1}{2}-\bm{f}_{i}^{\top}\bm{\gamma}_{h}\right\}U_{ij}+m_{j}(Y_{i})\right],

(3.4)

where $m_{j}(y)=\mathbb{E}[(X_{1j}-\bm{f}_{1}^{\top}\bm{b}_{j})\{\mathbb{I}(Y\geq y)-F(Y)\}]$ . Denote ${\bm{S}}_{n}=({S}_{n1},\ldots,{S}_{np})^{\top}$ and $\bm{\Omega}^{*}=\mathrm{Cov}({\bm{S}}_{n})$ . Let $\sigma_{j}^{2}=\bm{\Omega}_{jj}^{*},\ j=1,\ldots,p$ . It could be shown that ${T}_{nj}={S}_{nj}+o_{\mathbb{P}}(1)$ .

Assumption 3.

$\{\log(np)\}^{7}/n=o(1)$ .

The $\min_{1\leq j\leq p}\sigma_{j}^{2}$ is bounded away from zero.

Assumption 3 is mild and frequently employed in high-dimensional settings, which is the technical requirement to bound the difference between $M_{n}$ and $\|\bm{\mathrm{Z}}\|_{\infty}$ , where $\bm{\mathrm{Z}}\sim\mathrm{N}_{p}(\bm{0},\bm{\Omega}^{*})$ . Specifically, condition (i) imposes suitable restriction on the growth rate of $p$ and is commonly used in the high-dimensional inference literature, see Zhang and Cheng, (2017) and Dezeure et al., (2017). Condition (ii) imposes a boundedness restriction on the second moment of $S_{nj}$ , which is also assumed in Ning and Liu, (2017).

Theorem 3.1.

Suppose that Assumptions 1-3 and LC hold. Under the null hypothesis, that is, $\bm{\beta}_{h}=\bm{0}$ , we have

\displaystyle\lim\limits_{n\rightarrow\infty}\sup\limits_{t\in\mathbb{R}}\left|\Pr(M_{n}\leq t)-\Pr\left(\left\|\bm{\mathrm{Z}}\right\|_{\infty}\leq t\right)\right|=0,

(3.5)

where $\bm{\mathrm{Z}}\sim\mathrm{N}_{p}\left(\bm{0},\bm{\Omega}^{*}\right)$ .

Theorem 3.1 indicates that our test statistic can be approximated by the maximum of a high-dimensional Gaussian vector under mild conditions. Based on this, we can reject the null hypothesis $H_{0}$ at the significant level $\alpha$ if and only if $M_{n}>c_{1-\alpha}$ , where $c_{1-\alpha}$ is the $(1-\alpha)$ -th quantile of the distribution of $\|\bm{\mathrm{Z}}\|_{\infty}$ .

As demonstrated by many authors, such as Cai et al., (2014) and Ma et al., (2021), the distribution of $\|\bm{\mathrm{Z}}\|_{\infty}$ can be asymptotically approximated by the Gumbel distribution. However, this generally requires restrictions on the covariance matrix of $\bm{\mathrm{Z}}$ . For instance, it requires that the eigenvalues of $\bm{\Omega}^{*}$ are uniformly bounded. In addition, the critical value obtained from the Gumbel distribution may not work well in practice since this weak convergence is typically slow (Zhang and Cheng,, 2017). Instead of adopting Gumbel distribution, we consider using bootstrap to approximate the distribution of $M_{n}$ .

3.2 Gaussian multiplier bootstrap

Given that $\bm{\Omega}^{*}$ is unknown, we employ the Gaussian multiplier bootstrap to derive the critical value $c_{1-\alpha}$ . The procedures and theoretical properties of the Gaussian multiplier bootstrap are outlined in this subsection.

Generate i.i.d. random variables $\mathrm{N}_{1},\ldots,\mathrm{N}_{n}\sim\mathrm{N}(0,1)$ independent of the observed dataset $\mathcal{D}=\{\bm{x}_{1},\ldots,\bm{x}_{n},Y_{1},\ldots,Y_{n}\}$ , and compute

\displaystyle\widehat{\mathcal{G}}=\max\limits_{j\in[p]}\left|\frac{1}{\sqrt{n}}\sum\limits_{i=1}^{n}\left[\left\{F_{n}(Y_{i})-\frac{1}{2}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{\gamma}}_{h}\right\}\widehat{U}_{ij}+\widehat{m}_{j}(Y_{i})\right]\mathrm{N}_{i}\right|.

(3.6)

Here $\widehat{m}_{j}(y)={n}^{-1}\sum_{i=1}^{n}\left(X_{ij}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{b}}_{j}\right)\{\mathbb{I}(Y_{i}\geq y)-F_{n}(Y_{i})\}$ .

Repeat the first step independently for $B$ times to obtain $\widehat{\mathcal{G}}_{1},\ldots,\widehat{\mathcal{G}}_{B}$ . Approximate the critical value $c_{1-\alpha}$ via the $(1-\alpha)$ -th quantile of the empirical distribution of the bootstrap statistics:

\displaystyle\widehat{c}_{1-\alpha}=\inf\left\{t\geq 0:\frac{1}{B}\sum\limits_{b=1}^{B}\mathbb{I}(\widehat{\mathcal{G}}_{b}\leq t)\geq 1-\alpha\right\}.

(3.7)

3.

We reject the null hypothesis $H_{0}$ if and only if

$\displaystyle M_{n}\geq\widehat{c}_{1-\alpha}.$ (3.8)

Theorem 3.2.

Suppose that conditions in Theorem 3.1 are satisfied. Under the null hypothesis, we have

\displaystyle\sup\limits_{x>0}\left|\Pr\left(M_{n}\leq x\right)-\Pr(\widehat{\mathcal{G}}\leq x|\mathcal{D})\right|\rightarrow 0.

Theorem 3.2 demonstrates the validity of the proposed bootstrap procedure, grounded in the Gaussian approximation theory established in Theorem 3.1. Based on this result, we can define our decision rule as follows:

\displaystyle\psi_{\infty,\alpha}=\mathbb{I}\left(M_{n}>\widehat{c}_{1-\alpha}\right).

The null hypothesis $H_{0}$ is rejected if and only if $\psi_{\infty,\alpha}=1$ .

3.3 Power analysis

We next consider the asymptotic power analysis of the $M_{n}$ . To demonstrate the efficiency of the test statistic, we consider the following local alternative parameter family for $\bm{\beta}_{h}$ under $H_{1}$ .

\displaystyle\bm{\Delta}(c)=\left\{\bm{\beta}_{h}\in\mathbb{R}^{p}:\max\limits_{j\in[p]}\left|{\beta_{hj}}\right|\geq\sqrt{c\frac{\log p}{n}}\right\},

(3.9)

where $c$ is a positive constant. The $s=\|\bm{\beta}_{h}\|_{0}$ quantifies the sparsity of the parameter $\bm{\beta}_{h}$ . Denote $\mu_{j}=\mathbb{E}(U_{ij}^{2})$ .

Assumption 4.

Suppose that $\bm{\Sigma}_{\bm{u}}=\mathrm{Cov}(\bm{u})$ is a diagonal matrix and $\mu_{j}$ is bounded away from zero.

Assumption 4 outlines the diagonal structure of the covariance matrix $\bm{\Sigma}_{\bm{u}}$ , a common assumption in factor analysis (Kim and Mueller,, 1978). We should also note that this diagonality of $\bm{\Sigma}_{\bm{u}}$ is only imposed here for simplify the power analysis.

Theorem 3.3.

Suppose that Assumptions 1-4 and LC hold. For the test statistic $M_{n}$ defined in (3.3), if $s=o({n}/\log p)$ , we have

\displaystyle\lim\limits_{(n,p)\rightarrow\infty}\inf\limits_{\bm{\beta}_{h}\in\bm{\Delta}(2+\varrho_{0})}\Pr\left(M_{n}\geq\widehat{c}_{1-\alpha}\right)=1.

(3.10)

where $\varrho_{0}$ is a positive constant.

Theorem 3.3 suggests that our test procedure maintains high power even when only a few components of $\bm{\beta}_{h}$ have magnitudes larger than $\sqrt{(2+\varrho_{0})(\log p)/{n}}$ . Thus, our testing procedure is powerful against sparse alternatives. Notably, this separation rate represents the minimax optimal rate for local alternative hypotheses, as discussed in Verzelen, (2012), Cai et al., (2014), Zhang and Cheng, (2017), and Ma et al., (2021).

4 Estimation and inference of FASIM

When the null hypothesis is rejected, indicating that the factor model is inadequate, we need to consider FASIM. In this section, we aim to investigate the estimation and inference of the FASIM.

4.1 Regularization estimation

In the high-dimensional regime, we employ $\ell_{1}$ penalty (Tibshirani,, 1996) to estimate the unknown parameter vectors $\bm{\beta}_{h}$ and $\bm{\gamma}_{h}$ in model (2.1):

\displaystyle(\widehat{\bm{\beta}}_{h},\widehat{\bm{\gamma}}_{h})=\arg\min\limits_{\bm{\beta}\in\mathbb{R}^{p},\bm{\gamma}\in\mathbb{R}^{K}}\left[\frac{1}{2n}\sum\limits_{i=1}^{n}\left\{F_{n}(Y_{i})-\frac{1}{2}-\widehat{\bm{u}}_{i}^{\top}\bm{\beta}-\widehat{\bm{f}}_{i}^{\top}\bm{\gamma}\right\}^{2}+\lambda\|\bm{\beta}\|_{1}\right],

(4.1)

where $\lambda>0$ is a tuning parameter.

By utilizing pseudo response observations $F_{n}(Y_{i})$ instead of $Y_{i}$ , our approach is inherently robust when confronting with outliers. We note that for the SIM without latent factors, the distribution function transformation has also been considered by many other authors, such as Zhu and Zhu, (2009), Wang et al., (2012), and Rejchel and Bogdan, (2020). However, in our model $\bm{f}$ and $\bm{u}$ are unobserved and must be estimated firstly. Thus we require additional efforts to derive the theoretical properties of $\widehat{\bm{\beta}}_{h}$ .

Let $\widetilde{F}_{n}(\bm{Y})=(\bm{I}_{n}-\widehat{\bm{P}})\left\{F_{n}(\bm{Y})-1/2\right\}$ represent the residuals of the response vector $F_{n}(\bm{Y})-1/2$ after it has been projected onto the column space of $\widehat{\bm{F}}$ , where $\widehat{\bm{P}}=n^{-1}\widehat{\bm{F}}\widehat{\bm{F}}^{\top}$ is the corresponding projection matrix. Since $\widehat{\bm{U}}=(\bm{I}_{n}-\widehat{\bm{P}})\bm{X}$ , $\widehat{\bm{F}}^{\top}\widehat{\bm{U}}={\bm{0}_{K\times p}}$ . Then direct calculations yield that the solution of (4.1) is equivalent to

	$\displaystyle\widehat{\bm{\beta}}_{h}$	$\displaystyle=\arg\min\limits_{\bm{\beta}\in\mathbb{R}^{p}}\left\{\frac{1}{2n}\left\\|\widetilde{F}_{n}(\bm{Y})-\widehat{\bm{U}}\bm{\beta}\right\\|_{2}^{2}+\lambda\\|\bm{\beta}\\|_{1}\right\},$		(4.2)
	$\displaystyle\widehat{\bm{\gamma}}_{h}$	$\displaystyle=(\widehat{\bm{F}}^{\top}\widehat{\bm{F}})^{-1}\widehat{\bm{F}}^{\top}\left\{F_{n}(\bm{Y})-\frac{1}{2}\right\}.$		(4.3)

For any vector $\widehat{\bm{\beta}}_{h}$ defined in (4.2), we have the following result.

Theorem 4.1.

Under Assumptions 1-2 and LC, assume that $\log p=o(n)$ . If $s=o(n/\log p)$ and $\lambda\asymp\sqrt{(\log p)/n}$ , then $\widehat{\bm{\beta}}_{h}$ defined in (4.2) satisfies

\displaystyle\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}=O_{\mathbb{P}}(\lambda\sqrt{s}),\ \ and\ \ \|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{1}=O_{\mathbb{P}}(\lambda s).

Based on the empirical distribution function of the response, we establish the $\ell_{1}$ and $\ell_{2}$ -error bounds for our introduced estimator $\widehat{\bm{\beta}}_{h}$ in Theorem 4.1. Compared with Zhu and Zhu, (2009), Wang et al., (2012), and Rejchel and Bogdan, (2020), in our model $\bm{f}$ and $\bm{u}$ are unobserved and must be estimated firstly. This adds new technical difficulty. Further compared with Fan et al., (2024), our procedures do not need any moment condition for the error term $\varepsilon$ in the model.

4.2 Inference of FASIM

Due to the bias inherent in the penalized estimation, $\widehat{\bm{\beta}}_{h}$ is unsuitable for direct utilization in statistical inference. To overcome this obstacle, Zhang and Zhang, (2014) and Van de Geer et al., (2014) proposed the debiasing technique for Lasso estimator in linear regression. Eftekhari et al., (2021) extended the technique to SIM. Han et al., (2023) and Yang et al., (2024) further considered robust inference for high-dimensional SIM. However, their procedures cannot handle latent factors. To this end, we introduce debiased estimator for the FASIM. Motivated by Zhang and Zhang, (2014) and Van de Geer et al., (2014), we construct the following debiased estimator

\displaystyle\widetilde{\bm{\beta}}_{h}=\widehat{\bm{\beta}}_{h}+\frac{1}{n}\widehat{\bm{\Theta}}\widehat{\bm{U}}^{\top}\left\{F_{n}(\bm{Y})-\frac{1}{2}-\widehat{\bm{U}}\widehat{\bm{\beta}}_{h}\right\},

(4.4)

where $\widehat{\bm{\Theta}}$ is a sparse estimator of $\bm{\Sigma}_{u}^{-1}$ , with $\bm{\Sigma}_{u}=\mathbb{E}\left({\bm{u}}_{i}{\bm{u}}_{i}^{\top}\right)$ . Denote $\widehat{\bm{\Sigma}}_{u}=n^{-1}\sum_{i=1}^{n}\widehat{\bm{u}}_{i}\widehat{\bm{u}}_{i}^{\top}$ . Then we construct ${\widehat{\bm{\Theta}}}$ by employing the approach in Cai et al., (2011). Concretely, ${\widehat{\bm{\Theta}}}$ is the solution to the following constrained optimization problem:

	$\displaystyle\widehat{\bm{\Theta}}$	$\displaystyle=\arg\min\limits_{\bm{\Theta}\in\mathbb{R}^{p\times p}}\\|\bm{\Theta}\\|_{\text{sum}},$
		$\displaystyle\text{s.t.}\ \\|\bm{\Theta}\widehat{\bm{\Sigma}}_{u}-\bm{I}_{p}\\|_{\max}\leq\delta_{n},$		(4.5)

where $\delta_{n}$ is a predetermined tuning parameter. In general, $\widehat{\bm{\Theta}}$ is not symmetric since there is no symmetry constraint in (4.2). Symmetry can be enforced through additional operations. Denote $\widehat{\bm{\Theta}}=\left(\widehat{\bm{\xi}}_{1},\ldots,\widehat{\bm{\xi}}_{p}\right)^{\top}=(\widehat{\xi}_{ij})_{1\leq i,j\leq p}$ . Write $\widehat{\bm{\Theta}}^{\text{sym}}=\left(\widehat{\xi}_{ij}^{\text{sym}}\right)_{1\leq i,j\leq p}$ , where $\widehat{\xi}_{ij}^{\text{sym}}$ is defined as:

\displaystyle\widehat{\xi}_{ij}^{\text{sym}}=\widehat{\xi}_{ij}\mathbb{I}\left(|\widehat{\xi}_{ij}|\leq|\widehat{\xi}_{ji}|\right)+\widehat{\xi}_{ji}\mathbb{I}\left(|\widehat{\xi}_{ij}|>|\widehat{\xi}_{ji}|\right).

Apparently, $\widehat{\bm{\Theta}}^{\text{sym}}$ is a symmetric matrix. For simplicity, we write $\widehat{\bm{\Theta}}$ as the symmetric estimator in the rest of the paper. Next, we consider the estimation error bound of $\widehat{\bm{\Theta}}$ . To achieve this target, we need to introduce the following assumption on the inverse of the $\bm{\Sigma}_{u}$ .

Assumption 5.

There exists a positive constant $M^{*}$ such that $\left\|\bm{\Sigma}_{u}^{-1}\right\|_{\infty}\leq M^{*}$ . Moreover, $\bm{\Sigma}_{u}^{-1}=\left({\bm{\xi}}_{1},\ldots,{\bm{\xi}}_{p}\right)^{\top}=({\xi}_{ij})_{1\leq i,j\leq p}$ is row-wise sparse, i.e., $\max_{i\in[p]}\sum_{j=1}^{p}|{\xi}_{ij}|^{q}\leq c_{n,p}$ , where $c_{n,p}$ is positive and bounded away from zero and allowed to increase as $n$ and $p$ grow, and $0\leq q<1$ .

Assumption 5 necessitates that $\bm{\Sigma}_{u}^{-1}$ be sparse in terms of both its $\ell_{\infty}$ -norm and matrix row space. Van de Geer et al., (2014), Cai et al., (2016) and Ning and Liu, (2017) also discussed similar assumptions on precision matrix estimation and more general inverse Hessian matrix estimation.

The estimation error bound of $\widehat{\bm{\Theta}}$ and the upper bound of $\left\|\widehat{\bm{\Theta}}-\bm{\Sigma}_{u}^{-1}\right\|_{\infty}$ are shown in Proposition 7 and Lemma 8 in the Supplementary Material, which are crucial for establishing theoretical results afterwards. Denote $\bm{m}=\{m_{1}(Y),\ldots,m_{p}(Y)\}^{\top}$ , with $m_{j}(y)=\mathbb{E}[(X_{1j}-\bm{f}_{1}^{\top}\bm{b}_{j})\{\mathbb{I}(Y\geq y)-F(Y)\}]$ . Let $\bm{m}_{i}$ be the i.i.d copies of $\bm{m}$ . Namely, $\bm{m}_{i}=\{m_{1}(Y_{i}),\ldots,m_{p}(Y_{i})\}^{\top}$ .

Theorem 4.2.

Suppose that the conditions in Theorem 4.1 and Assumption 5 are satisfied. Then if $s=o(\sqrt{n}/\log p)$ and $\log p<<n^{{(1-q)}/{(3-q)}}$ , where $q$ is given in Assumption 5, we have

	$\displaystyle\sqrt{n}(\widetilde{\bm{\beta}}_{h}-{\bm{\beta}}_{h})=\bm{Z}_{b}+\bm{E},$
	$\displaystyle\bm{Z}_{b}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\bm{\Sigma}_{u}^{-1}\left({\bm{u}}_{i}e_{hi}+\bm{m}_{i}\right),\ \\|\bm{E}\\|_{\infty}=o_{p}(1).$

Theorem 4.2 indicates that the asymptotic representation of $\sqrt{n}(\widetilde{\bm{\beta}}_{h}-{\bm{\beta}}_{h})$ can be divided into two terms. The major term $\bm{Z}_{b}$ is associated with the error $\bm{e}_{h}$ and $\bm{m}$ , while the remainder $\bm{E}$ vanishes as $n$ approaches to infinity.

Further for $j=1,\ldots,p$ , the variance of ${Z}_{bj}$ is ${\sigma}_{zj}^{2}=\bm{e}_{j}^{\top}\bm{\Sigma}_{u}^{-1}\mbox{Var}\{\bm{u}e_{h}+\bm{m}\}\bm{\Sigma}_{u}^{-1}\bm{e}_{j}$ , with $\bm{e}_{j}$ being an unit vector with only its $j$ -th element being 1. Based on Theorem 4.2, the asymptotic variance of $\sqrt{n}(\widetilde{{\beta}}_{hj}-{{\beta}}_{hj})$ can be estimated by $\widehat{\sigma}_{zj}^{2}=\bm{e}_{j}^{\top}\widehat{\bm{\Theta}}\widetilde{\bm{\Sigma}}\widehat{\bm{\Theta}}\bm{e}_{j}$ , with $\widetilde{\bm{\Sigma}}=(\widetilde{\sigma}_{lk})_{1\leq l,k\leq p}$ , $\widetilde{\sigma}_{lk}={n}^{-1}\sum_{i=1}^{n}\left\{\widehat{U}_{il}\widetilde{e}_{hi}+\widehat{m}_{l}(Y_{i})\right\}\left\{\widehat{U}_{ik}\widetilde{e}_{hi}+\widehat{m}_{k}(Y_{i})\right\}$ , $\widetilde{e}_{hi}=F_{n}(Y_{i})-{1}/{2}-\widehat{\bm{u}}_{i}^{\top}\widehat{\bm{\beta}}_{h}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{\gamma}}_{h}$ and $\widehat{m}_{j}(y)={n}^{-1}\sum_{i=1}^{n}\left(X_{ij}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{b}}_{j}\right)\{\mathbb{I}(Y_{i}\geq y)-F_{n}(Y_{i})\}$ . Thus the $(1-\alpha)$ confidence interval for $\beta_{hj}$ , $j\in[p]$ can be constructed as follows

\displaystyle\mathcal{CI}_{\alpha}(\beta_{hj})=\left(\widetilde{\beta}_{hj}-\frac{\widehat{\sigma}_{zj}z_{1-\alpha/2}}{\sqrt{n}},\ \widetilde{\beta}_{hj}+\frac{\widehat{\sigma}_{zj}z_{1-\alpha/2}}{\sqrt{n}}\right),

(4.6)

where $\widetilde{\beta}_{hj}$ is the $j$ -th component of $\widetilde{\bm{\beta}}_{h}$ and $z_{1-\alpha/2}$ is the $(1-\alpha/2)$ -th quantile of standard normal distribution.

5 Numerical studies

In this section, simulation studies are conducted to evaluate the finite sample performance of the proposed methods. We implement the proposed method with the following two models.
Model 1: Linear model

\displaystyle Y=\bm{u}^{\top}\bm{\beta}+\bm{f}^{\top}\bm{\gamma}+\varepsilon.

(5.1)

Model 2: Nonlinear model

\displaystyle Y=\exp\left(\bm{u}^{\top}\bm{\beta}+\bm{f}^{\top}\bm{\gamma}+\varepsilon\right).

(5.2)

Under each setting, we generate $n$ i.i.d. observations, and replicate 500 times.

5.1 Adequacy test of factor model

In this subsection, we set $n=200$ , $K=2$ , $p=200$ or $500$ , and ${\bm{X}}=\bm{F}\bm{B}^{\top}+\bm{U}$ . Here, the entries of $\bm{B}$ are generated from the uniform distribution $\mathrm{Unif}(-1,1)$ and every row of $\bm{U}$ follows from $\mathrm{N}(\bm{0},\bm{\Sigma}_{\bm{u}})$ with $\bm{\Sigma}_{\bm{u}}=(\sigma_{uij})_{1\leq i,j\leq p}$ and $\sigma_{uij}=0.5^{|i-j|}$ . We consider two cases of $\bm{F}$ generation:
Case i. We generate $\bm{f}_{i}$ from standard normal distribution, that is,

\displaystyle\bm{f}_{i}\sim\mathrm{N}(\bm{0},\bm{I}_{K}),\ i=1,\ldots,n.

(5.3)

Case ii. We generate $\bm{f}_{i}$ from the AR(1) model:

\displaystyle\bm{f}_{i}=\bm{\Phi}\bm{f}_{i-1}+\bm{\pi}_{i},\,i=2,\ldots,n,

(5.4)

where $\bm{\Phi}=(\phi_{ij})\in\mathbb{R}^{K\times K}$ with ${\phi}_{ij}=0.4^{|i-j|+1},\ i,j\in[K]$ . In addition, ${\bm{f}_{1}}$ and ${\bm{\pi}_{i}},\ i\geq 1$ are independently drawn from $\mathrm{N}(\bm{0},\bm{I}_{K})$ . We generate ${\varepsilon}$ either from (a) $\mathrm{N}(0,0.25)$ or (b) Student’s t distribution with 3 degree of freedom, denoted as $\mathrm{t}_{3}$ . We set $\gamma=(0.5,0.5)^{\top}$ and $\bm{\beta}=\omega\ast\left(\mathbf{1}_{3},\mathbf{0}_{p-3}\right)^{\top}$ , $\omega\geq 0$ . When $\omega=0$ , it indicates that the null hypothesis holds, and the simulation results correspond to the empirical size. Otherwise, they correspond to the empirical power.

To assess the robustness of our proposed method, we introduce outliers to the responses. We randomly pick $p_{out}$ of the response, and increase by $m_{out}$ -times maximum of original responses, shorted as $p_{out}+m_{out}*\max$ (response). Here, $p_{out}$ represents the proportion of outliers, while $m_{out}$ is a predetermined constant indicating the strength of the outliers. Throughout the simulation, we adopt the strategy of 10%+10 $*\max$ (response) to pollute the observations. We compare our proposed FAST with FabTest in Fan et al., (2024). This comparison under the linear model (5.1) is illustrated in Figure 1 and Figure 2. Here, “FAST_i” and “FAST_ii” signify the results derived from the FAST corresponding to settings (5.3) and (5.4) of $\bm{F}$ generation, respectively. Similarly, “FabTest_i” and “FabTest_ii” denote the results of FabTest in Fan et al., (2024) corresponding to settings (5.3) and (5.4) of $\bm{F}$ generation, respectively. The first rows in Figures 1 and 2 represent the results obtained with the original data, while the second rows correspond to the results obtained after incorporating outliers (namely, 10%+10 $*\max$ (response)). The figures suggest that both FAST and FabTest exhibit commendable performance under the Gaussian distribution. But when confronted with heavy tails or outliers, our proposed method outperforms FabTest. Specifically, as illustrated in the second rows of these two figures, the power curves associated with FAST demonstrate a notably swifter attainment of 1. In contrast, the power curves related to FabTest consistently hover around the significance level of 0.05.

Figures 1 and 2 should be here.

Figures 3 and 4 illustrate the power curves of FAST and FabTest under the nonlinear model (5.2). We can observe a distinct performance in the nonlinear scenario where, even with the light-tailed error distribution $\mathrm{N}(0,0.25)$ within the original data, the power performance of the FabTest is notably inferior compared to that of FAST. Furthermore, when considering the $\mathrm{t}_{3}$ -distribution within the original data, the power curves of FAST continue to reach 1 faster. Similarly, when outliers are introduced, FabTest exhibits a complete failure, whereas FAST continues to perform very well. These results serve to illustrate the robustness of FAST in testing.

Figures 3 and 4 should be here.

As previously mentioned, our method avoids estimating high-dimensional regression coefficients and high-dimensional precision matrix, significantly reducing computational costs. Table 1 provides a comparison of the average computation time between the FAST and the FabTest in Fan et al., (2024) under the linear model (5.1). The table indicates that under the same settings, the average computation time for FAST is significantly less than that for FabTest. Additionally, it is evident that the average computation time for both tests increase as the parameter dimension $p$ increases to 500, which is expected.

Table 1 should be here.

5.2 Accuracy of estimation

To illustrate the accuracy of our estimation, we set the number of latent factors to $K=2$ , the dimension of $\bm{x}$ to $p=500$ , $\gamma=(0.5,0.5)^{\top}$ , the first $s=3$ entries of $\bm{\beta}$ to 0.5, the remaining $p-s$ entries to 0. Throughout this subsection, we generate each entry of $\bm{F}$ and $\bm{U}$ from the standard Gaussian distribution $\mathrm{N}(0,1)$ , while each entry of $\bm{B}$ is randomly generated from the uniform distribution $\mathrm{Unif}(-1,1)$ . The $\varepsilon$ is generated from the three scenarios: (a) the standard Gaussian distribution $\mathrm{N}(0,1)$ ; (b) the uniform distribution $\mathrm{Unif}(-{3}/{2},{3}/{2})$ ; (c) $\mathrm{t}_{3}$ . Under each setting, $n$ is set to satisfy that $\sqrt{s(\log p)/n}$ takes uniform grids in $[0.10,0.30]$ with step being 0.05.

To validate the necessities of considering latent factors in observational studies, we compare the relative error obtained by our approach (denoted as “FASIM_Lasso”) with the method in Rejchel and Bogdan, (2020) (denoted as “SIM_Lasso”), which considers the single-index model without incorporating the factor effect, and the method in Fan et al., (2024) (denoted as “FA_Lasso”). Figures 5 and 6 depict the relative error

L_{2}(\widehat{\bm{\beta}}_{h},\bm{\beta}_{h})=\frac{\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}}{\left\|\bm{\beta}_{h}\right\|_{2}}

of linear model (5.1) and nonlinear model (5.2), respectively. The simulation results indicate that for original data, FA_Lasso performs very well when the error is not heavy-tailed, but when the error follows the $\mathrm{t}_{3}$ , FA_Lasso performs significantly worse. In contrast, the proposed method maintains stable and good performance across different error distribution scenarios, illustrating the robustness of our approach. Additionally, compared to SIM_Lasso, FASIM_Lasso enjoys smaller relative errors. This is partially due to the inadequacy of the SIM in Rejchel and Bogdan, (2020). Additionally, Figures 5 and 6 show that the upper limits of the statistical rates $\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}$ of FASIM_Lasso are $O\left\{\sqrt{s(\log p)/n}\right\}$ , which validates Theorem 4.1.

Figures 5 and 6 should be here.

To investigate the robustness of FASIM_Lasso, we introduce outliers into the observations using the aforementioned method, specifically, 10%+10 $*\max$ (response). We output the $L_{2}(\widehat{\bm{\beta}},\bm{\beta})$ of FA_Lasso and $L_{2}(\widehat{\bm{\beta}}_{h},\bm{\beta}_{h})$ of FASIM_Lasso and SIM_Lasso. The results are depicted in Figures 7 and 8. The figures indicate that our proposed method can achieve more precise results, with smaller relative errors in all scenarios. This implies that our proposed method is much more robust than FA_Lasso confronted with outliers in the observed data.

Figures 7 and 8 should be here.

5.3 Validity of confidence interval

In this subsection, we construct confidence intervals based on the proposed method in Section 4. We generate ${\varepsilon}$ either from (a) $\mathrm{N}(0,0.25)$ or (b) $\mathrm{t}_{3}$ and set $n=200$ , $p=200$ or $500$ . In addition, we generate $\bm{F}$ , $\bm{U}$ , $\bm{B}$ and the parameters $\bm{\beta}$ , $\bm{\gamma}$ as described in Section 5.2. To assess the performance of our methods, we examine the empirical coverage probabilities and the average lengths of confidence intervals for the individual coefficients across all covariates, covariates on $\mathcal{S}=\{1,2,3\}$ and those on $\mathcal{S}^{C}=[p]/\mathcal{S}$ . We define

	CP	$\displaystyle=\frac{\sum\limits_{j=1}^{p}\mathbb{I}\{{\beta}_{hj}\in\mathcal{CI}_{\alpha}(\beta_{hj})\}}{p},\ \ \ \text{AL}=\frac{\sum\limits_{j=1}^{p}2{\widehat{\sigma}_{zj}z_{1-\alpha/2}}/{\sqrt{n}}}{p},$
	$\displaystyle\text{CP}_{\mathcal{S}}$	$\displaystyle=\frac{\sum\limits_{j\in\mathcal{S}}\mathbb{I}\{{\beta}_{hj}\in\mathcal{CI}_{\alpha}(\beta_{hj})\}}{s},\ \ \ \text{AL}_{\mathcal{S}}=\frac{\sum\limits_{j\in\mathcal{S}}2{\widehat{\sigma}_{zj}z_{1-\alpha/2}}/{\sqrt{n}}}{s},$
	$\displaystyle\text{CP}_{\mathcal{S}^{C}}$	$\displaystyle=\frac{\sum\limits_{j\in\mathcal{S}^{C}}\mathbb{I}\{{\beta}_{hj}\in\mathcal{CI}_{\alpha}(\beta_{hj})\}}{p-s},\ \ \ \text{AL}_{\mathcal{S}^{C}}=\frac{\sum\limits_{j\in\mathcal{S}^{C}}2{\widehat{\sigma}_{zj}z_{1-\alpha/2}}/{\sqrt{n}}}{(p-s)}.$

The results are summarized in Table 2. The empirical coverage probabilities are close to the nominal level across all settings, and the average lengths of the confidence interval are very short. This illustrates the merits of our proposed methods. In addition, we find that even when the error follows a $\mathrm{t}_{3}$ distribution, the average interval lengths only increase slightly.

Table 2 should be here.

6 Real data analysis

In this section, we employ our method with a macroeconomic dataset named FRED-MD (McCracken and Ng,, 2016) which comprises 126 monthly U.S. macroeconomic variables. Due to these variables measuring certain aspects of economic health, they are influenced by latent factors and thus can be regarded as intercorrelated. We analyze this dataset to illustrate the performance of FASIM and evaluate the adequacy of the factor regression model. In our study, we take “GS5” as response variable $Y$ , and the remaining variables as predictors $\bm{x}$ , where “GS5” represents the 5-year treasury rate. In addition, to demonstrate the robustness of FASIM, we apply the previously mentioned outlier handling method to the response variable, specifically using 10%+10 $*\max$ (response). Due to the economic crisis of 2008-2009, the data remained unstable even after implementing the suggested transformations. Therefore, we examine data from two distinct periods: February 1992 to October 2007 and August 2010 to February 2020.

At the beginning, we employ our introduced FAST and FabTest in Fan et al., (2024) to test the adequacy of factor models, and denote the corresponding models with only factors as F_SIM and F_LM, respectively. We employ 2000 bootstrap replications here to obtain the critical value. For FabTest, we use 10 fold cross-validation to compute the tuning parameters and refit cross-validation based on iterated sure independent screening to estimate the variance of the error. The $p$ -values are provided in Table 3. At the significance level of 0.05, both for the original dataset and the polluted dataset, the results indicate the inadequacy of F_SIM for the “GS5” within the two distinct time periods. While for F_LM, the results indicate that it is inadequate for “GS5” in the original dataset within the two distinct time periods. When the data is polluted, the null hypothesis is rejected during the period from February 1992 to October 2007, while it is not rejected during the other period. This implies the necessary to introduce the idiosyncratic component $\bm{u}$ into the regression model. Hence, in the subsequent study on prediction accuracy, we consider FASIM and FARM for comparison.

Table 3 should be here.

We compare the forecasting results of FASIM with that of FARM. For each given time period and model, predictions are performed using a moving window approach with a window size of 90 months. Indexing the panel data from 1 for each of the two time periods, for all $t>90$ , we use the 90 previous measurements $\{(\bm{x}_{t-90},Y_{t-90}),\ldots,(\bm{x}_{t-1},Y_{t-1})\}$ to train the models (FASIM and FARM) and output predictions $\widehat{Y}_{\text{FASIM},t}$ and $\widehat{Y}_{\text{FARM},t}$ . For FASIM, as defined in Equation (1), after obtaining the estimated parameters $\widehat{\bm{\beta}}_{h}$ and $\widehat{\bm{\gamma}}_{h}$ , the estimated latent factor vector $\widehat{\bm{f}}_{t}$ , and the estimated idiosyncratic component $\widehat{\bm{u}}_{t}$ at the $t$ -th time point, we get the estimator $\widehat{g}$ of the unknown function $g$ via spline regression. Finally, the predicted value of the response variable is calculated as $\widehat{Y}_{t}=\widehat{g}(\widehat{\bm{f}}_{t}^{\top}\widehat{\bm{\gamma}}+\widehat{\bm{u}}_{t}^{\top}\widehat{\bm{\beta}})$ . The accuracy of FASIM and FARM is measured using the Mean Square Error (MSE) (Hastie et al.,, 2009), defined as:

\text{MSE}=\frac{1}{T-90}\sum_{t=91}^{T}(Y_{t}-\widehat{Y}_{t})^{2},

where $T$ denotes the total number of data points in a given time period.

Table 4 should be here.

Table 4 presents the prediction accuracy results of FASIM and FARM of “GS5” for the original dataset and the polluted dataset within the two distinct time periods. It is evident that the performance of FASIM and FARM on the original data are similar. However, while the MSEs of both FASIM and FARM increase for the polluted dataset, the increase is substantial for FARM, whereas FASIM shows only a modest increase. This suggests that FASIM is more robust than FARM.

7 Conclusions and discussions

To capture nonliearity with latent factors, in this paper, we introduce a novel Factor Augmented sparse Single-Index Model, FASIM. For this newly proposed model, we first address the concern of whether the augmented part is necessary or not. We develop a score-type test statistic without estimating high-dimensional regression coefficients nor high-dimensional precision matrix. We also propose a Gaussian multiplier bootstrap to determine the critical value for our proposed test statistic FAST. The validity of our procedure is theoretically established under mild conditions. When the model test is passed, we employ the $\ell_{1}$ penalty to estimate the unknown parameters, establishing both $\ell_{1}$ and $\ell_{2}$ error bounds. We also introduce debiased estimator and establish its asymptotic normality. Numerical results illustrate the robustness and effectiveness of our proposed procedures.

In practice, it would be of interest to test whether the FASIM is actually FARM. We may also consider the multiple testing problem for the FASIM. Further the condition $\kappa_{h}\neq 0$ excludes even link functions, and in particular the problem of sparse phase retrieval. We would explore these possible topics in near future.

Appendix

Appendix A Proofs of Theorems

In the following, we will present the proofs of our main theorems. To save space, proofs of some technical lemmas will be shown in Supplementary Material.

A.1 Proof of Theorem 3.1

Proof.

By Lemma 2 in Supplementary Material, with high probability, we have

\displaystyle\left|\max\limits_{j\in[p]}|T_{nj}|-\max\limits_{j\in[p]}|S_{nj}|\right|\leq\max\limits_{j\in[p]}\left|T_{nj}-S_{nj}\right|\leq Cr(n,p),

for some constant $C$ , as $(n,p)\rightarrow\infty$ . Here $r(n,p)=O\left\{{(\log p)^{3/2}\log(np)}/{n}+{(\log p)(\log n)}/{\sqrt{n}}\right\}$ . This implies that

		$\displaystyle\quad\left\|\Pr\left(\max\limits_{j\in[p]}\|T_{nj}\|\leq t\right)-\Pr\left(\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t\right)\right\|$
		$\displaystyle\leq\underbrace{\left\|\Pr\left\{\max\limits_{j\in[p]}\|S_{nj}\|\leq t+Cr(n,p)\right\}-\Pr\left\{\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t+Cr(n,p)\right\}\right\|}_{\text{I}_{\text{B},1}}$
		$\displaystyle\quad+\underbrace{\left\|\Pr\left\{\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t+Cr(n,p)\right\}-\Pr\left(\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t\right)\right\|}_{\text{I}_{\text{B},2}}.$		(A.1)

For the term $\text{I}_{\text{B},1}=\left|\Pr\left\{\max_{j\in[p]}|S_{nj}|\leq t+Cr(n,p)\right\}-\Pr\left\{\|\bm{\mathrm{Z}}\|_{\infty}\leq t+Cr(n,p)\right\}\right|$ in (A.1), recall that $S_{nj}={n}^{-1/2}\sum_{i=1}^{n}\left[\left\{F(Y_{i})-{1}/{2}-\bm{f}_{i}^{\top}\bm{\gamma}_{h}\right\}\left(X_{ij}-\bm{f}_{i}^{\top}\bm{b}_{j}\right)+m_{j}(Y_{i})\right]$ with $m_{j}(y)=\mathbb{E}[(X_{1j}-\bm{f}_{1}^{\top}\bm{b}_{j})\{\mathbb{I}(Y\geq y)-F(Y)\}]$ , by the sub-Gaussian assumptions of $\bm{f}$ and $\bm{u}$ , we can get that $\left[\left\{F(Y_{i})-{1}/{2}-\bm{f}_{i}^{\top}\bm{\gamma}_{h}\right\}\left(X_{ij}-\bm{f}_{i}^{\top}\bm{b}_{j}\right)+m_{j}(Y_{i})\right]_{i=1}^{n}$ is a sub-Exponential variable sequence with bounded norm. In addition, by Assumption 3, applying Lemma 3 in Supplementary Material, we have

\displaystyle\text{I}_{\text{B},1}=\lim\limits_{n\rightarrow\infty}\sup\limits_{t\in\mathbb{R}}\left|\Pr\left\{\max\limits_{j\in[p]}|S_{nj}|\leq t+Cr(n,p)\right\}-\Pr\left\{\|\bm{\mathrm{Z}}\|_{\infty}\leq t+Cr(n,p)\right\}\right|=0.

(A.2)

For the term $\text{I}_{\text{B},2}=\left|\Pr\left\{\|\bm{\mathrm{Z}}\|_{\infty}\leq t+Cr(n,p)\right\}-\Pr\left(\|\bm{\mathrm{Z}}\|_{\infty}\leq t\right)\right|$ in (A.1), by Lemma 4 in Supplementary Material, we have

$\displaystyle\text{I}_{\text{B},2}$	$\displaystyle=\left\|\Pr\left\{\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t+Cr(n,p)\right\}-\Pr\left(\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t\right)\right\|$
	$\displaystyle\lesssim r(n,p)\sqrt{1\vee\log\left\{\frac{p}{r(n,p)}\right\}}$
	$\displaystyle\rightarrow 0.$	(A.3)

Combining (A.2) with (A.1), we have

\displaystyle\left|\Pr\left(\max\limits_{j\in[p]}|T_{nj}|\leq t\right)-\Pr\left(\|\bm{\mathrm{Z}}\|_{\infty}\leq t\right)\right|\rightarrow 0.

The proof is completed. ∎

A.2 Proof of Theorem 3.2

Proof.

Firstly, we have the following decomposition.

	$\displaystyle\quad\left\|\Pr\left(M_{n}\leq x\right)-\Pr^{*}(\widehat{\mathcal{G}}\leq x)\right\|$
	$\displaystyle=\left\|\Pr\left(M_{n}\leq x\right)-\Pr\left(\left\\|\bm{\mathrm{Z}}\right\\|_{\infty}\leq x\right)+\Pr\left(\left\\|\bm{\mathrm{Z}}\right\\|_{\infty}\leq x\right)-\Pr^{*}(\widehat{\mathcal{G}}\leq x)\right\|$
	$\displaystyle\leq\left\|\Pr\left(M_{n}\leq x\right)-\Pr\left(\left\\|\bm{\mathrm{Z}}\right\\|_{\infty}\leq x\right)\right\|+\left\|\Pr\left(\left\\|\bm{\mathrm{Z}}\right\\|_{\infty}\leq x\right)-\Pr^{*}(\widehat{\mathcal{G}}\leq x)\right\|.$

By Theorem 3.1, we have

\displaystyle\left|\Pr\left(M_{n}\leq x\right)-\Pr\left(\left\|\bm{\mathrm{Z}}\right\|_{\infty}\leq x\right)\right|\rightarrow 0.

We write

\displaystyle\rho^{*}=\sup\limits_{x>0}\left|\Pr\left(\left\|\bm{\mathrm{Z}}\right\|_{\infty}\leq x\right)-\Pr^{*}(\widehat{\mathcal{G}}\leq x)\right|.

(A.4)

Denote $\widetilde{S}_{nj}=n^{-1/2}\sum_{i=1}^{n}\left[\left\{F_{n}(Y_{i})-{1}/{2}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{\gamma}}_{h}\right\}\widehat{U}_{ij}+\widehat{m}_{j}(Y_{i})\right]\mathrm{N}_{i}$ . Given the dataset $\mathcal{D}=\{\bm{x}_{1},\ldots,\bm{x}_{n},Y_{1},\ldots,Y_{n}\}$ , the covariance matrix of $\widetilde{\bm{S}}_{n}=(\widetilde{S}_{nj})_{j=1}^{p}$ is

\displaystyle\widetilde{\bm{\Omega}}=\mathrm{Cov}(\widetilde{\bm{S}}_{n}|\mathcal{D}),

with $\widetilde{{\Omega}}_{jk}=n^{-1}\sum_{i=1}^{n}\{\widehat{U}_{ij}\widehat{e}_{hi}+\widehat{m}_{j}(Y_{i})\}\{\widehat{U}_{ik}\widehat{e}_{hi}+\widehat{m}_{k}(Y_{i})\}$ , $j,k=1,\ldots,p$ , and $\widehat{e}_{hi}=F_{n}(Y_{i})-{1}/{2}-\widehat{\bm{f}}_{i}^{\top}\widehat{\bm{\gamma}}_{h}$ . Recall that $\bm{\Omega}^{*}=\mathrm{Cov}({\bm{S}}_{n})$ , with ${\Omega}_{jk}^{*}=\mathbb{E}\left\{{U}_{ij}{e}_{hi}+{m}_{j}(Y_{i})\right\}\left\{{U}_{ik}{e}_{hi}+{m}_{k}(Y_{i})\right\},\ j,k=1,\ldots,p$ , and $e_{hi}=F(Y_{i})-{1}/{2}-{\bm{f}_{i}}^{\top}{\bm{\gamma}}_{h}$ . By Lemma 5 in Supplementary Material, we have

\displaystyle\|\bm{\Omega}^{*}-\widetilde{\bm{\Omega}}\|_{\max}=o_{\mathbb{P}}\left(\frac{1}{{\log p}}\right).

Then it follows from Lemma 2.1 in Chernozhukov et al., (2023) that $\rho^{*}\stackrel{{\scriptstyle\mathbb{P}}}{{\rightarrow}}0$ .

Combining this result with Theorem 3.1, we have

\displaystyle\left|\Pr\left(M_{n}\leq x\right)-\Pr^{*}(\widehat{\mathcal{G}}\leq x)\right|\rightarrow 0.

(A.5)

The proof is completed. ∎

A.3 Proof of Theorem 3.3

Proof.

Firstly, we have the following decomposition

\displaystyle\underbrace{\max\limits_{j\in[p]}\left|\sqrt{n}\beta_{hj}\mu_{j}\right|}_{\text{I}_{\text{H}_{1}}}\leq\max\limits_{j\in[p]}|T_{nj}|+\underbrace{\max\limits_{j\in[p]}\left|\sqrt{n}\beta_{hj}\mu_{j}-T_{nj}\right|}_{\text{I}_{\text{H}_{2}}}.

(A.6)

For the term $\text{I}_{\text{H}_{2}}=\max_{j\in[p]}\left|\sqrt{n}\beta_{hj}\mu_{j}-T_{nj}\right|$ in (A.6), denoted $\eta^{\natural}=\max_{j\in[p]}|(T_{nj}-\sqrt{n}\beta_{hj}\mu_{j})-\left(S_{nj}-\sqrt{n}\beta_{hj}\mu_{j}\right)|$ , we have $\eta^{\natural}=O_{\mathbb{P}}\{r(n,p)\}$ , with $r(n,p)=(\log p)^{3/2}\log(np)/{n}+(\log p)(\log n)/{\sqrt{n}}+s\|\bm{\beta}_{h}\|_{\infty}{\log p}/{\sqrt{n}}+{\max_{j\in[p]}|\mu_{j}\beta_{hj}|}/{\sqrt{n}}$ by Lemma 2 in Supplementary Material. For simplify, we rewrite $S_{nj}-\sqrt{n}\beta_{hj}\mu_{j}=\sum_{i=1}^{n}\varpi_{ij}/\sqrt{n}$ , where $\varpi_{ij}=\left\{F(Y_{i})-{1}/{2}-\bm{f}_{i}^{\top}\bm{\gamma}_{h}\right\}U_{ij}+m_{j}(Y_{i})-\beta_{hj}\mu_{j}$ . Under Assumption 1, it’s easy to show that $\{\varpi_{ij}\}_{i=1}^{n}$ is an i.i.d. mean zero sub-Exponential random variable sequence. Assume that $\mu_{j}=\mathbb{E}(U_{ij}^{2})\geq C_{\min}$ . Denote $K^{\prime}=\max_{i\in[n]}\|\varpi_{ij}\|_{\psi_{1}}$ , by the Bernstein’s inequality, we have

\displaystyle\max\limits_{j\in[p]}\left|S_{nj}-\sqrt{n}\beta_{hj}\mu_{j}\right|=\max\limits_{j\in[p]}\left|\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\varpi_{ij}\right|\leq C_{\min}\sqrt{\frac{{\log p}}{2}}.

(A.7)

with probability at least $1-2\exp\{-C(\log p)/(2K^{\prime 2})\}$ . Therefore, with high probability, we have

\displaystyle\max\limits_{j\in[p]}\left|T_{nj}-\sqrt{n}\beta_{hj}\mu_{j}\right|\leq\max\limits_{j\in[p]}\left|S_{nj}-\sqrt{n}\beta_{hj}\mu_{j}\right|+\eta^{\natural}\leq C_{\min}\sqrt{\frac{{\log p}}{2}}+Cs\|\bm{\beta}_{h}\|_{\infty}\frac{\log p}{\sqrt{n}}+\max\limits_{j\in[p]}C\frac{|\mu_{j}\beta_{hj}|}{\sqrt{n}}.

(A.8)

For any $\bm{\beta}_{h}\in\bm{\Delta}(2+\varrho_{0})$ , we have

\displaystyle\text{I}_{\text{H}_{1}}=\max\limits_{j\in[p]}\left|\sqrt{n}\beta_{hj}\mu_{j}\right|>C_{\min}\sqrt{(2+\varrho_{0})\log p}.

(A.9)

Substituting (A.8) and (A.9) into (A.6), we have

	$\displaystyle\max\limits_{j\in[p]}\left\|T_{nj}\right\|$	$\displaystyle\geq\max\limits_{j\in[p]}\left\|\sqrt{n}\beta_{hj}\mu_{j}\right\|-\max\limits_{j\in[p]}\left\|\sqrt{n}\beta_{hj}\mu_{j}-T_{nj}\right\|$
		$\displaystyle\geq C_{\min}\sqrt{n}\\|\bm{\beta}_{h}\\|_{\infty}-C_{\min}\sqrt{\frac{{\log p}}{2}}-Cs\\|\bm{\beta}_{h}\\|_{\infty}\frac{\log p}{\sqrt{n}}-C\max\limits_{j\in[p]}\frac{\|\mu_{j}\beta_{hj}\|}{\sqrt{n}}.$

By Theorem 3.2, $\widehat{c}_{1-\alpha}$ is equal to the $(1-\alpha)$ -th quantile of $\max_{j\in[p]}\left|T_{nj}-\sqrt{n}\beta_{hj}\mu_{j}\right|$ asymptotically. By (A.8), we have $\widehat{c}_{1-\alpha}\leq C_{\min}\sqrt{\log p/2}+Cs\|\bm{\beta}_{h}\|_{\infty}{\log p}/{\sqrt{n}}+\max_{j\in[p]}C{|\mu_{j}\beta_{hj}|}/{\sqrt{n}}$ with high probability. Therefore,

\displaystyle\lim_{(n,p)\rightarrow\infty}\inf\limits_{\bm{\beta}_{h}\in\bm{\Delta}(2+\varrho_{0})}\Pr\left(M_{n}\geq\widehat{c}_{1-\alpha}\right)=1.

The proof is completed. ∎

A.4 Proof of Theorem 4.1

Proof.

Recall that $\widehat{\bm{\beta}}_{h}$ is the minimizer of optimization problem as

\displaystyle\widehat{\bm{\beta}}_{h}=\arg\min\limits_{\bm{\beta}\in\mathbb{R}^{p}}\left\{\frac{1}{2n}\left\|\widetilde{F}_{n}(\bm{Y})-\widehat{\bm{U}}\bm{\beta}\right\|_{2}^{2}+\lambda\|\bm{\beta}\|_{1}\right\}.

For the loss function $\mathcal{L}_{n}({\bm{\beta}}_{h})=(2n)^{-1}\|(\bm{I}_{n}-\widehat{\bm{P}})\left\{F_{n}(\bm{Y})-{1}/{2}\right\}-\widehat{\bm{U}}\bm{\beta}_{h}\|_{2}^{2}$ in (4.2), to demonstrate the $\ell_{2}$ -norm error bound for $\widehat{\bm{\beta}}_{h}$ , we aim to prove the following inequality:

\displaystyle C\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}^{2}\leq\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle\leq C_{1}\lambda\sqrt{s}\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2},

(A.10)

where $\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})$ is the gradient of loss function $\mathcal{L}_{n}({\bm{\beta}}_{h})$ , $s=\|\bm{\beta}_{h}\|_{0}$ and $\lambda\asymp\sqrt{(\log p)/n}$ , with positive constants $C$ and $C_{1}$ .

We firstly derive the upper bound of (A.10). By KKT condition, there exists a subgradient $\bm{\kappa}\in\partial\|\widehat{\bm{\beta}}_{h}\|_{1}$ , such that $\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})+\lambda\bm{\kappa}=\bm{0}$ . We then derive

		$\displaystyle\quad\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle$
		$\displaystyle=\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})^{\top}(\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})(\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h})$
		$\displaystyle=\underbrace{-\lambda\bm{\kappa}^{\top}(\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h})}_{\text{I}_{1}}+\underbrace{\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})(\bm{\beta}_{h}-\widehat{\bm{\beta}}_{h})}_{\text{I}_{2}}.$		(A.11)

Denote $\widehat{\bm{\theta}}_{h}=\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}$ . For $\text{I}_{1}=-\lambda\bm{\kappa}^{\top}(\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h})$ in (A.4), by the definition of subgradient, we have

$\displaystyle\text{I}_{1}$	$\displaystyle=-\lambda\bm{\kappa}^{\top}(\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h})$
	$\displaystyle\leq\lambda(\\|\bm{\beta}_{h}\\|_{1}-\\|\widehat{\bm{\beta}}_{h}\\|_{1})$
	$\displaystyle=\lambda(\\|\bm{\beta}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}}+\bm{\beta}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1})$
	$\displaystyle\leq\lambda(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1}),$	(A.12)

where $\mathcal{S}=\{j\in[p]:{\beta}_{h,j}\neq 0\}$ , and $\mathcal{S}^{C}=[p]/\mathcal{S}$ . For $\text{I}_{2}=\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})(\bm{\beta}_{h}-\widehat{\bm{\beta}}_{h})$ in (A.4), by H $\ddot{\text{o}}$ lder’s inequality and Lemma 6 in Supplementary Material, we have

$\displaystyle\text{I}_{2}$	$\displaystyle=\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})(\bm{\beta}_{h}-\widehat{\bm{\beta}}_{h})$
	$\displaystyle\leq\\|\nabla\mathcal{L}_{n}({\bm{\beta}}_{h})\\|_{\infty}\\|\bm{\beta}_{h}-\widehat{\bm{\beta}}_{h}\\|_{1}$
	$\displaystyle\lesssim\sqrt{\frac{\log p}{n}}\\|\widehat{\bm{\theta}}_{h}\\|_{1}.$	(A.13)

Recall that $\lambda\asymp\sqrt{(\log p)/n}$ , and assume that $\lambda/2>C\sqrt{(\log p)/n}$ . By triangle and Cauchy-Schwartz inequalities, combining (A.4) with (A.4), we have

		$\displaystyle\quad\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle$
		$\displaystyle\leq\lambda\left(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1}\right)+C\sqrt{\frac{\log p}{n}}\\|\widehat{\bm{\theta}}_{h}\\|_{1}$
		$\displaystyle=\lambda(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1})+C\sqrt{\frac{\log p}{n}}(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}+\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1})$
		$\displaystyle=\left\{\lambda+C\sqrt{\frac{\log p}{n}}\right\}\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}+\left\{C\sqrt{\frac{\log p}{n}}-\lambda\right\}\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1}$
		$\displaystyle\leq\frac{3}{2}\lambda\sqrt{s}\\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\\|_{2}.$		(A.14)

The inequality (A.4) shows that the right side of (A.10) holds.

Next, we focus on proving the left side of (A.10). Note that for any $\Delta\in\mathbb{R}^{p}$ ,

\displaystyle\varepsilon_{n}(\Delta)=\left\{\nabla\mathcal{L}_{n}(\bm{\beta}_{h}+\Delta)-\nabla\mathcal{L}_{n}(\bm{\beta}_{h})\right\}^{\top}\Delta=\frac{1}{n}(\widehat{\bm{U}}\Delta)^{\top}(\widehat{\bm{U}}\Delta).

Because $\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle\geq 0$ in (A.4), we have $\|\widehat{\bm{\theta}}_{\mathcal{S}^{C}}\|_{1}\leq 3\|\widehat{\bm{\theta}}_{\mathcal{S}}\|_{1}$ . Besides, we have the sparsity assumption that $s\{(\log p)/n+1/p\}\rightarrow 0$ . Then, according to Lemma C.2 in Fan et al., (2024), we have

\displaystyle\Pr\left\{\frac{\left|\frac{1}{n}(\widehat{\bm{U}}\widehat{\bm{\theta}}_{h})^{\top}(\widehat{\bm{U}}\widehat{\bm{\theta}}_{h})\right|}{\|\widehat{\bm{\theta}}_{h}\|_{2}^{2}}\geq C\right\}\rightarrow 1.

That is, with high probability, we have

\displaystyle\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle\geq C\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}^{2},

(A.15)

which shows that the left side of (A.10) holds. Therefore, in conclusion, we prove the $\ell_{2}$ -norm error bound for $\widehat{\bm{\beta}}_{h}$ , i.e.

\displaystyle\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}\leq C\lambda\sqrt{s},

(A.16)

where $\lambda\asymp\sqrt{(\log p)/n}$ . Because $\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle\geq 0$ in (A.4), we have

\displaystyle\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{1}\leq C\sqrt{s}\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{2}.

Therefore,

\displaystyle\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\|_{1}\leq C\lambda s.

(A.17)

Here, $\lambda\asymp\sqrt{(\log p)/n}$ .
The proof is completed. ∎

A.5 Proof of Theorem 4.2

Proof.

Recall that $\bm{m}=\{m_{1}(Y),\ldots,m_{p}(Y)\}^{\top}$ , with $m_{j}(y)=\mathbb{E}[(X_{1j}-\bm{f}_{1}^{\top}\bm{b}_{j})\{\mathbb{I}(Y\geq y)-F(Y)\}]$ . Let $\bm{m}_{i}$ be the i.i.d copies of $\bm{m}$ . Namely, $\bm{m}_{i}=\{m_{1}(Y_{i}),\ldots,m_{p}(Y_{i})\}^{\top}$ . We have the following decomposition.

\displaystyle\sqrt{n}\left(\widetilde{\bm{\beta}}_{h}-{\bm{\beta}}_{h}\right)=\bm{Z}_{b}+\bm{E}.

Here, $\bm{Z}_{b}=n^{-1/2}\sum_{i=1}^{n}\bm{\Sigma}_{u}^{-1}\left({\bm{u}}_{i}e_{hi}+\bm{m}_{i}\right)$ and $\bm{E}=\sqrt{n}\left(\widetilde{\bm{\beta}}_{h}-{\bm{\beta}}_{h}\right)-n^{-1/2}\sum_{i=1}^{n}\bm{\Sigma}_{u}^{-1}\left({\bm{u}}_{i}e_{hi}+\bm{m}_{i}\right)$ . By Lemma 9 in Supplementary Material, as long as $s=o(\sqrt{n}/\log p)$ and $\log p<<n^{{(1-q)}/{(3-q)}}$ , we have $\|\bm{E}\|_{\infty}=o_{\mathbb{P}}(1)$ , where $q$ is given in Assumption 5. The proof is completed. ∎

References

Ahn and Horenstein, (2013) Ahn, S. C. and Horenstein, A. R. (2013). Eigenvalue ratio test for the number of factors. Econometrica, 81(3):1203–1227.
Bai, (2003) Bai, J. (2003). Inferential theory for factor models of large dimensions. Econometrica, 71(1):135–171.
Bai and Ng, (2002) Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70(1):191–221.
Bayle and Fan, (2022) Bayle, P. and Fan, J. (2022). Factor-augmented regularized model for hazard regression. arXiv preprint arXiv:2210.01067.
Belloni et al., (2012) Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
Bing et al., (2023) Bing, X., Cheng, W., Feng, H., and Ning, Y. (2023). Inference in high-dimensional multivariate response regression with hidden variables. Journal of the American Statistical Association, pages 1–12.
Bühlmann et al., (2014) Bühlmann, P., Kalisch, M., and Meier, L. (2014). High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278.
Cai et al., (2011) Cai, T., Liu, W., and Luo, X. (2011). A constrained minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association, 106(494):594–607.
Cai et al., (2014) Cai, T., Liu, W., and Xia, Y. (2014). Two-sample test of high dimensional means under dependence. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(2):349–372.
Cai et al., (2016) Cai, T., Liu, W., and Zhou, H. H. (2016). Estimating sparse precision matrix: Optimal rates of convergence and adaptive estimation. Annals of Statistics, 44(2):455–488.
Ćevid et al., (2020) Ćevid, D., Bühlmann, P., and Meinshausen, N. (2020). Spectral deconfounding via perturbed sparse linear models. Journal of Machine Learning Research, 21(1):9442–9482.
Chernozhukov et al., (2023) Chernozhukov, V., Chetverikov, D., and Koike, Y. (2023). Nearly optimal central limit theorem and bootstrap approximations in high dimensions. Annals of Applied Probability, 33(3):2374–2425.
Cook and Ni, (2005) Cook, R. D. and Ni, L. (2005). Sufficient dimension reduction via inverse regression: A minimum discrepancy approach. Journal of the American Statistical Association, 100(470):410–428.
Dezeure et al., (2017) Dezeure, R., Bühlmann, P., and Zhang, C. (2017). High-dimensional simultaneous inference with the bootstrap. Test, 26(4):685–719.
Eftekhari et al., (2021) Eftekhari, H., Banerjee, M., and Ritov, Y. (2021). Inference in high-dimensional single-index models under symmetric designs. Journal of Machine Learning Research, 22(27):1–63.
Fan et al., (2022) Fan, J., Guo, J., and Zheng, S. (2022). Estimating number of factors by adjusted eigenvalues thresholding. Journal of the American Statistical Association, 117(538):852–861.
Fan et al., (2020) Fan, J., Li, R., Zhang, C., and Zou, H. (2020). Statistical foundations of data science. Chapman and Hall/CRC.
Fan et al., (2013) Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society Series B: Statistical Methodology, 75(4):603–680.
Fan et al., (2018) Fan, J., Liu, H., and Wang, W. (2018). Large covariance estimation through elliptical factor models. Annals of Statistics, 46(4):1383.
Fan et al., (2024) Fan, J., Lou, Z., and Yu, M. (2024). Are latent factor regression and sparse regression adequate? Journal of the American Statistical Association, 119(546):1076–1088.
Fan et al., (2019) Fan, J., Wang, W., and Zhong, Y. (2019). Robust covariance estimation for approximate factor models. Journal of Econometrics, 208(1):5–22.
Fan et al., (2017) Fan, J., Xue, L., and Yao, J. (2017). Sufficient forecasting using factor models. Journal of Econometrics, 201(2):292–306.
Fewell et al., (2007) Fewell, Z., Davey Smith, G., and Sterne, J. A. (2007). The impact of residual and unmeasured confounding in epidemiologic studies: a simulation study. American Journal of Epidemiology, 166(6):646–655.
Guo et al., (2022) Guo, Z., Ćevid, D., and Bühlmann, P. (2022). Doubly debiased lasso: High-dimensional inference under hidden confounding. Annals of Statistics, 50(3):1320.
Hall and Li, (1993) Hall, P. and Li, K.-C. (1993). On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics, 21(2):867–889.
Han et al., (2023) Han, D., Han, M., Huang, J., and Lin, Y. (2023). Robust inference for high-dimensional single index models. Scandinavian Journal of Statistics, 50(4):1590–1615.
Hastie et al., (2009) Hastie, T., Tibshirani, R., Friedman, J. H., and Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer.
Jiang et al., (2019) Jiang, F., Ma, Y., and Wei, Y. (2019). Sufficient direction factor model and its application to gene expression quantitative trait loci discovery. Biometrika, 106(2):417–432.
Kim and Mueller, (1978) Kim, J.-O. and Mueller, C. W. (1978). Factor analysis: Statistical methods and practical issues, volume 14. sage.
Kong and Xia, (2007) Kong, E. and Xia, Y. (2007). Variable selection for the single-index model. Biometrika, 94(1):217–229.
Lam and Yao, (2012) Lam, C. and Yao, Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors. Annals of Statistics, 40(2):694–726.
Li, (1991) Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327.
Li and Duan, (1989) Li, K.-C. and Duan, N. (1989). Regression analysis under link violation. Annals of Statistics, 17(3):1009–1052.
Listgarten et al., (2010) Listgarten, J., Kadie, C., Schadt, E. E., and Heckerman, D. (2010). Correction for hidden confounders in the genetic analysis of gene expression. Proceedings of the National Academy of Sciences, 107(38):16465–16470.
Luo et al., (2009) Luo, R., Wang, H., and Tsai, C.-L. (2009). Contour projected dimension reduction. Annals of Statistics, 37:3743–3778.
Luo et al., (2022) Luo, W., Xue, L., Yao, J., and Yu, X. (2022). Inverse moment methods for sufficient forecasting using high-dimensional predictors. Biometrika, 109(2):473–487.
Ma et al., (2021) Ma, R., Cai, T., and Li, H. (2021). Global and simultaneous hypothesis testing for high-dimensional logistic regression models. Journal of the American Statistical Association, 116(534):984–998.
McCracken and Ng, (2016) McCracken, M. W. and Ng, S. (2016). Fred-md: A monthly database for macroeconomic research. Journal of Business and Economic Statistics, 34(4):574–589.
Ning and Liu, (2017) Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Annals of Statistics, 45(1):158–195.
Ouyang et al., (2023) Ouyang, J., Tan, K. M., and Xu, G. (2023). High-dimensional inference for generalized linear models with hidden confounding. Journal of Machine Learning Research, 24(296):1–61.
Plan and Vershynin, (2016) Plan, Y. and Vershynin, R. (2016). The generalized lasso with non-linear observations. IEEE Transactions on Information Theory, 62(3):1528–1537.
Radchenko, (2015) Radchenko, P. (2015). High dimensional single index models. Journal of Multivariate Analysis, 139:266–282.
Rejchel and Bogdan, (2020) Rejchel, W. and Bogdan, M. (2020). Rank-based lasso-efficient methods for high-dimensional robust model selection. Journal of Machine Learning Research, 21(244):1–47.
Sun et al., (2023) Sun, Y., Ma, L., and Xia, Y. (2023). A decorrelating and debiasing approach to simultaneous inference for high-dimensional confounded models. Journal of the American Statistical Association, (to appear):1–24.
Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Annals of Statistics, 42(3):1166–1202.
Verzelen, (2012) Verzelen, N. (2012). Minimax risks for sparse regressions: Ultra-high dimensional phenomenons. Electronic Journal of Statistics, 6:38–90.
Wang et al., (2012) Wang, T., Xu, P., and Zhu, L. (2012). Non-convex penalized estimation in high-dimensional models with single-index structure. Journal of Multivariate Analysis, 109:221–235.
Yang et al., (2024) Yang, W., Shi, H., Guo, X., and Zou, C. (2024). Robust group and simultaneous inferences for high-dimensional single index model. Manuscript.
Zhang and Zhang, (2014) Zhang, C. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):217–242.
Zhang and Cheng, (2017) Zhang, X. and Cheng, G. (2017). Simultaneous inference for high-dimensional linear models. Journal of the American Statistical Association, 112(518):757–768.
Zhu and Zhu, (2009) Zhu, L. and Zhu, L. (2009). Nonconcave penalized inverse regression in single-index models with high dimensional predictors. Journal of Multivariate Analysis, 100(5):862–875.

Figures

Refer to caption — Figure 1: Power curves under linear model (5.1) with $p=200$ and $\bm{\beta}=\omega\ast\left(\mathbf{1}_{3},\mathbf{0}_{p-3}\right)^{\top}$ . The “FAST_i”, “FAST_ii”, “FabTest_i” and “FabTest_ii” signify the results derived from the FAST in this paper and FabTest in Fan et al., (2024) corresponding to settings (5.3) and (5.4) of $\bm{F}$ generation, respectively. The first row represent the results obtained with the original data, while the second row corresponds to the results of adding outliers. The first column shows the results when the error follows $\mathrm{N}(0,0.25)$ , while the second column exhibits the results when the error follows $\mathrm{t}_{3}$ .

Tables

Table 1: The average computation time (Unit: second). The “FAST_i”, “FAST_ii”, “FabTest_i” and “FabTest_ii” signify the results derived from the FAST and FabTest corresponding to settings (5.3) and (5.4) of

\bm{F}

generation, respectively.

$p$	$\varepsilon$	FAST_i	FabTest_i	FAST_ii	FabTest_ii
$200$	$\mathrm{N}(0,0.25)$	1.50	46.33	1.60	44.53
$200$	$\mathrm{t}_{3}$	1.61	45.62	1.48	47.08
$500$	$\mathrm{N}(0,0.25)$	3.72	156.00	3.75	156.53
$500$	$\mathrm{t}_{3}$	3.65	208.86	3.73	146.21

Table 2: The empirical coverage probabilities and average lengths of the confidence intervals.

Model	$p$	$\varepsilon$	CP	$\text{CP}_{\mathcal{S}}$	$\text{CP}_{\mathcal{S}^{C}}$	AL	$\text{AL}_{\mathcal{S}}$	$\text{AL}_{\mathcal{S}^{C}}$
Model 1	500	$\mathrm{N}(0,0.25)$	0.951	0.955	0.951	0.040	0.039	0.040
	500	$\mathrm{t}_{3}$	0.947	0.950	0.947	0.066	0.064	0.066
	200	$\mathrm{N}(0,0.25)$	0.952	0.940	0.952	0.040	0.039	0.040
	200	$\mathrm{t}_{3}$	0.946	0.943	0.946	0.067	0.066	0.067
Model 2	500	$\mathrm{N}(0,0.25)$	0.951	0.966	0.951	0.041	0.040	0.041
	500	$\mathrm{t}_{3}$	0.946	0.930	0.946	0.069	0.068	0.069
	200	$\mathrm{N}(0,0.25)$	0.951	0.974	0.950	0.041	0.040	0.041
	200	$\mathrm{t}_{3}$	0.947	0.956	0.947	0.067	0.066	0.067

Table 3: The

p

-values for the original dataset and polluted dataset in two different time periods.

Time Period	Data	F_LM
1992.02-2007.10	original	$2\times 10^{-3}$
1992.02-2007.10	polluted	0.0000
2010.08-2020.02	original	$1.7\times 10^{-2}$
2010.08-2020.02	polluted	0.2015

Table 4: The MSE for the original dataset and polluted dataset in different time periods.

Time Period	Data	FARM	FASIM
1992.02-2007.10	original	0.2620	0.2764
1992.02-2007.10	polluted	1.0592	0.9251
2010.08-2020.02	original	0.3328	0.3748
2010.08-2020.02	polluted	0.9432	0.8764

		$\displaystyle\quad\left\|\Pr\left(\max\limits_{j\in[p]}\|T_{nj}\|\leq t\right)-\Pr\left(\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t\right)\right\|$
		$\displaystyle\leq\underbrace{\left\|\Pr\left\{\max\limits_{j\in[p]}\|S_{nj}\|\leq t+Cr(n,p)\right\}-\Pr\left\{\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t+Cr(n,p)\right\}\right\|}_{\text{I}_{\text{B},1}}$
		$\displaystyle\quad+\underbrace{\left\|\Pr\left\{\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t+Cr(n,p)\right\}-\Pr\left(\\|\bm{\mathrm{Z}}\\|_{\infty}\leq t\right)\right\|}_{\text{I}_{\text{B},2}}.$		(A.1)

		$\displaystyle\quad\langle\nabla\mathcal{L}_{n}(\widehat{\bm{\beta}}_{h})-\nabla\mathcal{L}_{n}({\bm{\beta}}_{h}),\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\rangle$
		$\displaystyle\leq\lambda\left(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1}\right)+C\sqrt{\frac{\log p}{n}}\\|\widehat{\bm{\theta}}_{h}\\|_{1}$
		$\displaystyle=\lambda(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}-\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1})+C\sqrt{\frac{\log p}{n}}(\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}+\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1})$
		$\displaystyle=\left\{\lambda+C\sqrt{\frac{\log p}{n}}\right\}\\|\widehat{\bm{\theta}}_{h\mathcal{S}}\\|_{1}+\left\{C\sqrt{\frac{\log p}{n}}-\lambda\right\}\\|\widehat{\bm{\theta}}_{h\mathcal{S}^{C}}\\|_{1}$
		$\displaystyle\leq\frac{3}{2}\lambda\sqrt{s}\\|\widehat{\bm{\beta}}_{h}-\bm{\beta}_{h}\\|_{2}.$		(A.14)