This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sparsity learning via structured functional factor augmentation

Hanteng Ma
School of Statistics and Data Science, Shanghai University of Finance and Economics
Ziliang Shen
School of Statistics and Data Science, Shanghai University of Finance and Economics
Xingdong Feng
School of Statistics and Data Science, Shanghai University of Finance and Economics
Xin Liu  
School of Statistics and Data Science, Shanghai University of Finance and Economics
The authors gratefully acknowledge please remember to list all relevant funding sources in the unblinded version
Abstract

As one of the most powerful tools for examining the association between functional covariates and a response, the functional regression model has been widely adopted in various interdisciplinary studies. Usually, a limited number of functional covariates are assumed in a functional linear regression model. Nevertheless, correlations may exist between functional covariates in high-dimensional functional linear regression models, which brings significant statistical challenges to statistical inference and functional variable selection. In this article, a novel functional factor augmentation structure (fFAS) is proposed for multivariate functional series, and a multivariate functional factor augmentation selection model (fFASM) is further proposed to deal with issues arising from variable selection of correlated functional covariates. Theoretical justifications for the proposed fFAS are provided, and statistical inference results of the proposed fFASM are established. Numerical investigations support the superb performance of the novel fFASM model in terms of estimation accuracy and selection consistency.


Keywords: correlated functional covariates, functional factor augmentation structure, functional variable selection, factor augmentation regression.

1 Introduction

Functional data, usually referred to as a sequential collection of instances over time with serial dependence, have been widely observed in various scenarios from different scientific disciplines, such as earth, medical, and social sciences (Peng et al., 2005; Centofanti et al., 2021). Instead of employing conventional time series modeling techniques, an underlying trajectory is usually assumed for a functional process and sampled at a set of time points in some interval, so that statistical modeling and inference are feasible, including estimation of the mean process and its intrinsic covariance structure, trajectory recovery, and prediction of a functional; see, for example, Ramsay & Silverman (2005); Yao et al. (2005); Ferraty (2006); Hall & Horowitz (2007); Hörmann & Kokoszka (2010); Cuevas (2014); Hörmann et al. (2015); Aneiros, Horová, Hušková & Vieu (2022); Petersen (2024), among others.

Conventionally, functional data analysis has focused on a single or fixed number of separate functional series, while recent studies have explored associations between multiple functional series, and the most widely adopted model may be the functional linear model, which assumes to (linearly) link a response with a bunch of covariates and at least one functional covariate by using unknown functional coefficient curves. Such a model offers a statistical tool to infer linear association between the response and functional covariates. Several studies have attempted along this path, employing parametric, non-parametric, and semi-parametric models; see, for instance, Cardot et al. (2003); Yao et al. (2005); Li & Hsing (2007); Müller & Yao (2008); Yuan & Cai (2010); Chen et al. (2011); Kong et al. (2016); Chen et al. (2022), among others.

However, two main issues arise in multivariate functional data analysis in practice. One is that functional series may be correlated with each other, especially in high dimensions or less dense observation cases, while the current functional linear model analysis may typically assume independence between multiple functional covariates. To address this problem, some efforts are paid, for example, to assume pairwise covariance structures for functional covariates explicitly (Chiou & Müller, 2016). However, these methods turn out to be less robust or accurate due to lack of flexibility and computational efficiency. Alternatively, one can describe such associations by using a factor model with lower ranks (Bai, 2003; Fan et al., 2020), where all functional covariates are assumed to share a finite number of certain latent functional factors (Castellanos et al., 2015), without assuming any explicit correlation structure for multivariate functional covariates. A few studies have touched the so-called functional factor models, e.g., Hays et al. (2012); Chen et al. (2021); Yang & Ling (2024), while these methods provide limited exploration of common functional factors shared by functional covariates. To the best of our knowledge, very limited research has focused on how to capture common associations for multivariate functional series efficiently and effectively with inferential justifications from a statistical perspective.

Another issue is how to select useful functional covariates in multivariate functional linear regression models when correlations between functional covariates exist, which is also frequently needed in practice, such as Internet of Things (IoT) data analysis (Gonzalez-Vidal et al., 2019), antibiotic resistance prediction (Jiménez et al., 2020), nonsmall cell lung cancer (NSCLC) clinical study (Fang et al., 2015), and stock market forecasting (Nti et al., 2019; Htun et al., 2023). A common way to achieve functional variable selection is to impose a penalty on the corresponding functional coefficient curves in a group-wise manner, using popular penalties such as the lasso (Tibshirani, 1996), SCAD (Fan & Li, 2001) and MCP (Zhang, 2010). A few studies have touched on functional variable selection in certain scenarios; see, for example, Matsui & Konishi (2011); Kong et al. (2016); Aneiros, Horová, Hušková & Vieu (2022), among others. However, when correlation exists between functional covariates, such a strategy turns out to be less accurate in functional coefficient estimation and fails to capture the truly useful ones, as demonstrated in simulation studies in this article. Consequently, it remains statistically challenging how to select useful functional covariates with selection consistency in multivariate functional linear models with correlated functional covariates in high dimensions.

Inspired by the challenges above, we firstly propose a novel functional factor augmentation structure (fFAS) to capture associations for correlated multivariate functional data, and further propose a functional factor augmentation selection model (fFASM) to achieve selection of functional covariates in high dimensions when correlations exist between functional covariates using the penalized method. Not only is the correlation addressed without assuming an explicit covariance structure, but theoretical properties of the estimated functional factors are also established. Further, pertaining to the correlated functional covariates, the proposed fFASM method successfully captures the useful functional covariates simultaneously in the context of functional factor models. Numerical studies on both simulated and real datasets support the superior performance of the proposed fFASM. The main contributions of the proposed method are two-fold as follows.

  • We propose a feasible time-independent functional factor augmentation structure (fFAS) for functional data, and establish theoretical justifications for statistical inference, revealing valuable insights into current literature. Also, the impact from truncation inherent in functional data analysis to the proposed fFAS is discussed, and solutions are provided to enhance robustness and applicability of our model. A key result is how the difference may be addressed and controlled by comparing the true factor model and the estimated one when using truncated expansion.

  • A multivariate functional linear regression model with correlated functional covariates is proposed, by employing the proposed functional structural factor. As correlations often exist in multiple functional covariates in high dimensions but may be difficult to estimate, the proposed fFASM method decomposes the functional covariates into two parts with low correlations with each other, and simultaneously expands the dimensions of parameters to be estimated. By this way our approach improves the performance of functional variable selection.

The rest of the paper is organized as follows. Section 2 introduces a novel fFAS for functional processes with its statistical properties, and the fFASM is further proposed in Section 3 for correlated functional covariates and functional variable selection in detail with theoretical justifications. Section 4 employs simulated data to examine the proposed method in various scenarios, and Section 5 presents its applications on two sets of real-world datasets. Section 6 concludes the article with discussions.

Notation: 𝑰n\bm{I}_{n} denotes the n×nn\times n identity matrix; 𝟎\bm{0} refers to the n×mn\times m zero matrix; 𝟎n\bm{0}_{n} and 𝟏n\bm{1}_{n} represent the all-zero and all-one vectors in n\mathbb{R}^{n}, respectively. For a vector 𝒂\bm{a}, 𝒂2\|\bm{a}\|_{2} denotes its Euclidean norm. For a vector 𝒗p\bm{v}\in\mathbb{R}^{p} and S[p]S\subseteq[p], denote 𝒗S=(𝒗i)iS\bm{v}_{S}=\left(\bm{v}_{i}\right)_{i\in S} as its sub-vector. For a matrix 𝑴n×m,I[n]\bm{M}\in\mathbb{R}^{n\times m},I\subseteq[n] and J[m]J\subseteq[m], define 𝑴IJ=(𝑴ij)iI,jJ,𝑴I.=(𝑴ij)iI,j[m]\bm{M}_{IJ}=\left(\bm{M}_{ij}\right)_{i\in I,j\in J},\bm{M}_{I.}=\left(\bm{M}_{ij}\right)_{i\in I,j\in[m]} and 𝑴.J=(𝑴ij)i[n],j]\bm{M}_{.J}=\left(\bm{M}_{ij}\right)_{i\in[n],j]}, and its matrix entry-wise max norm is denoted as 𝑴max=maxi,j|Mij|\|\bm{M}\|_{\mathop{\rm max}}=\mathop{\rm max}_{i,j}\left|M_{ij}\right|, and 𝑴F\|\bm{M}\|_{F} and 𝑴p\|\bm{M}\|_{p} as its Frobenius and induced pp-norms, respectively. Denote λmin(𝑴)\lambda_{\min}(\bm{M}) as the minimum eigenvalue of 𝑴\bm{M} if it is symmetric. Let \nabla and 2\nabla^{2} be the gradient and Hessian operators. For f:pf:\mathbb{R}^{p}\rightarrow\mathbb{R} and I,J[p]I,J\subseteq[p], define If(x)=(f(x))I\nabla_{I}f(x)=(\nabla f(x))_{I} and IJ2f(x)=(2f(x))IJ\nabla_{IJ}^{2}f(x)=\left(\nabla^{2}f(x)\right)_{IJ}. N(𝝁,𝚺)N(\boldsymbol{\mu},\boldsymbol{\Sigma}) refers to the normal distribution with mean 𝝁\boldsymbol{\mu} and covariance matrix 𝚺\boldsymbol{\Sigma}.

2 A novel functional factor augmentation structure

2.1 Functional observations and its expansion

To start with, suppose that GG functional processes are observed and collected sequentially in a sample {Wi(g)(tij(g)),tij(g)𝒯;g=1,,G,i=1,,n,j=1,,ni(g)}\{W_{i}^{(g)}(t_{ij}^{(g)}),t_{ij}^{(g)}\in\mathcal{T};~{}g=1,\ldots,G,~{}i=1,\ldots,n,~{}j=1,\ldots,n_{i}^{(g)}\} for the gg-th functional process from the ii-th subject at time tijt_{ij} in a certain time interval 𝒯\mathcal{T}, and we use Wi(g)(tj)W_{i}^{(g)}(t_{j}) for abbreviation without confusion. Usually in the analysis of a single functional process, Wi(g)(tj)W_{i}^{(g)}(t_{j}) is assumed to consist of the underlying process Xi(g)(tj)X_{i}^{(g)}(t_{j}) and an independent noise εi(g)(tj)\varepsilon_{i}^{(g)}(t_{j}), i.e., Wi(g)(tj)=Xi(g)(tj)+εi(g)(tj)W_{i}^{(g)}(t_{j})=X_{i}^{(g)}(t_{j})+\varepsilon_{i}^{(g)}(t_{j}), where Xi(g)()X_{i}^{(g)}(\cdot) and εi(g)()\varepsilon_{i}^{(g)}(\cdot) are assumed to be identically and independently distributed (i.i.d.) with a mean function and a covariance structure, respectively, and 𝔼(εi(g)(t))=0,𝔼(Xi(g)(t))=μ(t)\mathbb{E}(\varepsilon_{i}^{(g)}(t))=0,\mathbb{E}(X_{i}^{(g)}(t))=\mu(t). For convenience, we centralize these functional processes and still use the notation Xi(g)()X_{i}^{(g)}(\cdot) to represent them. To recover the functional trajectory Xi(g)(t)X_{i}^{(g)}(t), it is usually smoothed by assuming an expansion over a set of pre-specified orthogonal basis functions {ϕ0j(g)(),j=1,,mg}\{\phi_{0j}^{(g)}(\cdot),j=1,\ldots,m_{g}\} as

Xi(g)(t)=j=1mga0,ij(g)ϕ0j(g)(t)+e0i(g)(t)=𝒂0,i(g)ϕ0(g)(t)+e0i(g)(t),X_{i}^{(g)}(t)=\sum_{j=1}^{m_{g}}a_{0,ij}^{(g)}\phi_{0j}^{(g)}(t)+e_{0i}^{(g)}(t)=\bm{a}_{0,i}^{(g)}{}^{\top}\bm{\phi}_{0}^{(g)}(t)+e_{0i}^{(g)}(t), (1)

where 𝒂0,i(g)=(a0,i1(g),,a0,img(g))\bm{a}_{0,i}^{(g)}=(a_{0,i1}^{(g)},\ldots,a_{0,im_{g}}^{(g)})^{\top} are the time-invariant coefficients, ϕ0(g)(t)=(ϕ01(g)(t),,ϕ0mg(g)(t))\bm{\phi}_{0}^{(g)}(t)=(\phi_{01}^{(g)}(t),\ldots,\phi_{0m_{g}}^{(g)}(t))^{\top}, and e0i(g)()e_{0i}^{(g)}(\cdot) is the residual orthogonal to ϕ0j(g)()\phi_{0j}^{(g)}(\cdot). As in practice, ϕ0(g)()\bm{\phi}_{0}^{(g)}(\cdot) is usually unknown, an identifiable estimator 𝒂^0,i(g)\widehat{\bm{a}}_{0,i}^{(g)} can be obtained under some conditions, shown in Lemma 1.

Lemma 1

For Xi(g)()X_{i}^{(g)}(\cdot) which has the following properties,

  • 𝔼(Xi(g)(t))=𝔼(a0,ij(g))=𝔼(a0,ij(g)a0,ik(g))=𝔼(e0,i(g)(t))=𝔼(a0,ij(g)e0i(g)(t))=0\mathbb{E}(X_{i}^{(g)}(t))=\mathbb{E}(a_{0,ij}^{(g)})=\mathbb{E}(a_{0,ij}^{(g)}a_{0,ik}^{(g)})=\mathbb{E}(e_{0,i}^{(g)}(t))=\mathbb{E}(a_{0,ij}^{(g)}e_{0i}^{(g)}(t))=0, where kjk\neq j,

  • 𝔼(Xi(g1)(s)e0i(g2)(t))=0\mathbb{E}(X_{i}^{(g_{1})}(s)e_{0i}^{(g_{2})}(t))=0 for g1g2g_{1}\neq g_{2}, g1,g2=1,,Gg_{1},g_{2}=1,\ldots,G, s,t𝒯s,t\in\mathcal{T},

  • Var(a0,i1(g))>Var(a0,i2(g))>>Var(a0,img(g))\operatorname{Var}(a_{0,i1}^{(g)})>\operatorname{Var}(a_{0,i2}^{(g)})>\ldots>\operatorname{Var}(a_{0,im_{g}}^{(g)}),

  • all the eigenvalues of Cov(e0,i(g)(s),e0,i(g)(t))\operatorname{Cov}(e_{0,i}^{(g)}(s),e_{0,i}^{(g)}(t)) are less than λmin(Cov(𝒂0,i(g)))\lambda_{\min}(\operatorname{Cov}(\bm{a}_{0,i}^{(g)})), where λmin(Cov(𝒂0,i(g)))\lambda_{\min}(\operatorname{Cov}(\bm{a}_{0,i}^{(g)})) is the smallest eigenvalue of Cov(𝒂0,i(g))\operatorname{Cov}(\bm{a}_{0,i}^{(g)}),

where the jj-th eigenfunction γj(g)()\gamma_{j}^{(g)}(\cdot) of Cov(Xi(g)(s),Xi(g)(t))\operatorname{Cov}(X_{i}^{(g)}(s),X_{i}^{(g)}(t)) is ϕ0j(g)()\phi_{0j}^{(g)}(\cdot), so that the jj-th functional score aij(g)𝒯Xi(g)(t)γj(g)(t)𝑑t=a0,ij(g)a_{ij}^{(g)}\coloneqq\int_{\mathcal{T}}X_{i}^{(g)}(t)\gamma_{j}^{(g)}(t)dt=a_{0,ij}^{(g)}.

Remark 1

Lemma 1 states 𝐚i(g)=(ai1(g),,aimg(g))\bm{a}_{i}^{(g)}=({a}_{i1}^{(g)},\ldots,{a}_{i{m_{g}}}^{(g)})^{\top} is a reasonable approximation of 𝐚0,i(g)\bm{a}_{0,i}^{(g)}. All eigenvalues of Cov(e0,i(g)(s),e0,i(g)(t))\operatorname{Cov}(e_{0,i}^{(g)}(s),e_{0,i}^{(g)}(t)) are required to be less than λmin(Cov(𝐚0,i(g)))\lambda_{\min}(\operatorname{Cov}(\bm{a}_{0,i}^{(g)})), indicating that segregation can be conducted based on contribution to the variance of Xi(g)(t)X_{i}^{(g)}(t).

The whole recovery process can be achieved by the popular functional principal component analysis (fPCA) with the Karhunen-Loéve (KL) expansion (Yao et al., 2005).

2.2 A functional factor augmentation structure

As correlations may exist between multivariate functional processes, we propose a functional factor augmentation structure (fFAS) to address the issue. Consider a simplified scenario with only two correlated functional processes, Xi(1)(t)X_{i}^{(1)}(t) and Xi(2)(t)X_{i}^{(2)}(t), generated by

Xi(1)(t)\displaystyle X_{i}^{(1)}(t) =𝒂0,i(1)ϕ0(1)(t)+e0i(1)(t),\displaystyle=\bm{a}_{0,i}^{(1)}{}^{\top}\cdot\bm{\phi}_{0}^{(1)}(t)+e_{0i}^{(1)}(t), (2)
Xi(2)(t)\displaystyle X_{i}^{(2)}(t) =𝒂0,i(2)ϕ0(2)(t)+e0i(2)(t),\displaystyle=\bm{a}_{0,i}^{(2)}{}^{\top}\cdot\bm{\phi}_{0}^{(2)}(t)+e_{0i}^{(2)}(t),

where e0i(1)(t)e_{0i}^{(1)}(t) and e0i(2)(t)e_{0i}^{(2)}(t) are independent of each other. Assume 𝒂0,i(g1)\bm{a}_{0,i}^{(g_{1})} and e0i(g2)(t)e_{0i}^{(g_{2})}(t) are uncorrelated for g1,g2=1,,Gg_{1},g_{2}=1,\ldots,G, where G=2G=2, and hence the correlation between Xi(1)(t)X_{i}^{(1)}(t) and Xi(2)(t)X_{i}^{(2)}(t) arises only from that between 𝒂0,i(1)\bm{a}_{0,i}^{(1)} and 𝒂0,i(2)\bm{a}_{0,i}^{(2)}. To capture such a correlation, it is assumed that each 𝒂0,i(g)\bm{a}_{0,i}^{(g)} shares KK common underlying factors using a linear combination

(𝒂0,i(1)𝒂0,i(2))=(B(1)B(2))𝒇i+(𝒖i(1)𝒖i(2))=𝑩𝒇i+𝒖i,\displaystyle\left(\begin{array}[]{ll}\bm{a}_{0,i}^{(1)}\\ \bm{a}_{0,i}^{(2)}\end{array}\right)=\left(\begin{array}[]{ll}B^{(1)}\\ B^{(2)}\end{array}\right)\bm{f}_{i}+\left(\begin{array}[]{ll}\bm{u}_{i}^{(1)}\\ \bm{u}_{i}^{(2)}\end{array}\right)=\bm{B}\bm{f}_{i}+\bm{u}_{i}, (9)

where 𝑩=(B(1),B(2))\bm{B}=\left(B^{(1)}{}^{\top},B^{(2)}{}^{\top}\right)^{\top} is a p×Kp\times K factor loading matrix with p=g=1Gmgp=\sum_{g=1}^{G}m_{g}, 𝒇i\bm{f}_{i} is a K×1K\times 1 vectorized latent factors, and 𝒖i\bm{u}_{i} is an idiosyncratic component independent of 𝒇i\bm{f}_{i} which carries a weak correlation. Note that the covariance C𝒂0:=Cov(𝒂0,i)=Cov(𝑩𝒇i+𝒖i)=𝑩Cov(𝒇i)𝑩+Cov(𝒖i)C_{\bm{a}_{0}}:=\operatorname{Cov}(\bm{a}_{0,i})=\operatorname{Cov}(\bm{B}\bm{f}_{i}+\bm{u}_{i})=\bm{B}\cdot\operatorname{Cov}(\bm{f}_{i})\cdot\bm{B}^{\top}+\operatorname{Cov}(\bm{u}_{i}). For model identifiability, it is assumed that Cov(𝒇i)=𝑰K\operatorname{Cov}(\bm{f}_{i})=\bm{I}_{K} and Cov(𝒖i)=ωΛu\operatorname{Cov}(\bm{u}_{i})=\omega\Lambda_{u} with a p×pp\times p matrix Λu\Lambda_{u}, where ΛumaxCu\|\Lambda_{u}\|_{\mathop{\rm max}}\leq C_{u}, a constant depending on the distribution of 𝒖i\bm{u}_{i}. By the spectral decomposition, 𝑩=(λ1(B)𝝃1(B),,λK(B)𝝃K(B))\bm{B}=\left(\sqrt{\lambda_{1}^{(B)}}\bm{\xi}_{1}^{(B)},\ldots,\sqrt{\lambda_{K}^{(B)}}\bm{\xi}_{K}^{(B)}\right) with the eigenvalues λ1(B)>>λK(B)\lambda_{1}^{(B)}>\ldots>\lambda_{K}^{(B)} of 𝑩𝑩\bm{BB}^{\top} and the corresponding eigenvectors {𝝃k(B),k=1,,K}\{\bm{\xi}_{k}^{(B)},k=1,\ldots,K\}. In this way, Xi(1)(t)X_{i}^{(1)}(t) and Xi(2)(t)X_{i}^{(2)}(t) in (2) can be expanded as

Xi(1)(t)\displaystyle X_{i}^{(1)}(t) =ϕ0(1)(t)(B(1)𝒇i+𝒖i(1))+e0i(1)(t)=ϕ0~(t)(1)𝒇i+ϕ0(1)(t)𝒖i(1)+e0i(1)(t),\displaystyle={\bm{\phi}_{0}^{(1)}(t)}{}^{\top}(B^{(1)}\bm{f}_{i}+\bm{u}_{i}^{(1)})+e_{0i}^{(1)}(t)={{\tilde{\bm{\phi}_{0}}}{}^{(1)}(t)}{}^{\top}\bm{f}_{i}+{\bm{\phi}_{0}^{(1)}(t)}{}^{\top}\bm{u}_{i}^{(1)}+e_{0i}^{(1)}(t), (10)
Xi(2)(t)\displaystyle X_{i}^{(2)}(t) =ϕ0(2)(t)(B(2)𝒇i+𝒖i(2))+e0i(2)(t)=ϕ0~(t)(2)𝒇i+ϕ0(2)(t)𝒖i(2)+e0i(2)(t),\displaystyle={\bm{\phi}_{0}^{(2)}(t)}{}^{\top}(B^{(2)}\bm{f}_{i}+\bm{u}_{i}^{(2)})+e_{0i}^{(2)}(t)={{\tilde{\bm{\phi}_{0}}}{}^{(2)}(t)}{}^{\top}\bm{f}_{i}+{\bm{\phi}_{0}^{(2)}(t)}{}^{\top}\bm{u}_{i}^{(2)}+e_{0i}^{(2)}(t),

where ϕ0~(t)(g)=ϕ0(g)(t)B(g){{\tilde{\bm{\phi}_{0}}}{}^{(g)}(t)}{}^{\top}={\bm{\phi}_{0}^{(g)}(t)}{}^{\top}B^{(g)}. This indicates Xi(g)(t)X_{i}^{(g)}(t) can be decomposed into two correlated parts, namely, the functional factor part 𝒇i\boldsymbol{f}_{i} and the weakly correlated part ϕ0(g)(t)𝒖i(g){\bm{\phi}_{0}^{(g)}(t)}{}^{\top}\bm{u}_{i}^{(g)}, plus an independent error term. We provide an example to illustrate such a fFAS when two functional covariates share a linear structure.

Example 1

Suppose Xi(1)()X_{i}^{(1)}{(\cdot)} and Xi(2)()X_{i}^{(2)}{(\cdot)} are associated with a linear structure as

Xi(2)(t)=𝔼(Xi(2)(t)|Xi(1)(t))+ϵi(t)=𝒯β(s,t)Xi(1)(s)𝑑s+ϵi(t),\displaystyle X_{i}^{(2)}{(t)}=\mathbb{E}(X_{i}^{(2)}(t)|X_{i}^{(1)}{(t)})+\epsilon_{i}(t)=\int_{\mathcal{T}}\beta(s,t)X_{i}^{(1)}(s)ds+\epsilon_{i}{(t)},

with β(s,t)=k=1m=1𝔼(a0,im(1)a0,ik(2))𝔼(a0,im(1))2ϕ0k(2)(t)ϕ0m(1)(s)\beta(s,t)=\sum_{k=1}^{\infty}\sum_{m=1}^{\infty}\frac{\mathbb{E}(a_{0,im}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,im}^{(1)}{}^{2})}\phi_{0k}^{(2)}(t)\phi_{0m}^{(1)}(s). Then it is obtained that

𝒯β(s,t)Xi(1)(s)𝑑s\displaystyle\int_{\mathcal{T}}\beta(s,t)X_{i}^{(1)}(s)ds =𝒯β(s,t)j=1a0,ij(1)ϕ0j(1)(s)ds\displaystyle=\int_{\mathcal{T}}\beta(s,t)\sum_{j=1}^{\infty}a_{0,ij}^{(1)}\phi_{0j}^{(1)}(s)ds
=𝒯k=1m=1𝔼(a0,im(1)a0,ik(2))𝔼(a0,im(1))2ϕ0k(2)(t)ϕ0m(1)(s)j=1a0,ij(1)ϕ0j(1)(s)ds\displaystyle=\int_{\mathcal{T}}\sum_{k=1}^{\infty}\sum_{m=1}^{\infty}\frac{\mathbb{E}(a_{0,im}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,im}^{(1)}{}^{2})}\phi_{0k}^{(2)}(t)\phi_{0m}^{(1)}(s)\sum_{j=1}^{\infty}a_{0,ij}^{(1)}\phi_{0j}^{(1)}(s)ds
=k=1(j=1𝔼(a0,ij(1)a0,ik(2))𝔼(a0,ij(1))2a0,ij(1))ϕ0k(2)(t),\displaystyle=\sum_{k=1}^{\infty}\left(\sum_{j=1}^{\infty}\frac{\mathbb{E}(a_{0,ij}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,ij}^{(1)}{}^{2})}a_{0,ij}^{(1)}\right)\phi_{0k}^{(2)}(t),

so that a0,ik(2)j=1𝔼(a0,ij(1)a0,ik(2))𝔼(a0,ij(1))2a0,ij(1)=j=1m1𝔼(a0,ij(1)a0,ik(2))𝔼(a0,ij(1))2a0,ij(1)+j=m1+1𝔼(a0,ij(1)a0,ik(2))𝔼(a0,ij(1))2a0,ij(1)a_{0,ik}^{(2)}\approx\sum_{j=1}^{\infty}\frac{\mathbb{E}(a_{0,ij}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,ij}^{(1)}{}^{2})}a_{0,ij}^{(1)}=\sum_{j=1}^{m_{1}}\frac{\mathbb{E}(a_{0,ij}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,ij}^{(1)}{}^{2})}a_{0,ij}^{(1)}+\sum_{j=m_{1}+1}^{\infty}\frac{\mathbb{E}(a_{0,ij}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,ij}^{(1)}{}^{2})}a_{0,ij}^{(1)}. With a truncation of the first m1m_{1} and m2m_{2} scores in Xi(1)(t)X_{i}^{(1)}(t) and Xi(2)(t)X_{i}^{(2)}(t), and combining them as 𝐚0,i=(𝐚0,i(1),𝐚0,i(2)){\bm{a}_{0,i}}=({\bm{a}_{0,i}^{(1)}}^{\top},{\bm{a}_{0,i}^{(2)}}^{\top})^{\top}, and a matrix 𝐄\bm{E} with elements Ekj=𝔼(a0,ij(1)a0,ik(2))𝔼(a0,ij(1))2E_{kj}=\frac{\mathbb{E}(a_{0,ij}^{(1)}a_{0,ik}^{(2)})}{\mathbb{E}(a_{0,ij}^{(1)}{}^{2})} and a diagonal matrix Λ\Lambda with elements Λjj=𝔼(a0,ij(1))2\Lambda_{jj}=\mathbb{E}(a_{0,ij}^{(1)}{}^{2}), it is easily obtained that

(𝒂0,i(1)𝒂0,i(2))(𝑰m1𝑬)𝒂0,i(1)=((𝑰m1𝑬)Λ12)(Λ12𝒂0,i(1))=((𝑰m1𝑬)Λ12𝑷)(𝑷Λ12𝒂0,i(1)),\left(\begin{array}[]{ll}\bm{a}_{0,i}^{(1)}\\ \bm{a}_{0,i}^{(2)}\end{array}\right)\approx\left(\begin{array}[]{cc}\bm{I}_{m_{1}}\\ \bm{E}\end{array}\right)\bm{a}_{0,i}^{(1)}=\left(\left(\begin{array}[]{cc}\bm{I}_{m_{1}}\\ \bm{E}\end{array}\right)\Lambda^{\frac{1}{2}}\right)\left(\Lambda^{-\frac{1}{2}}\bm{a}_{0,i}^{(1)}\right)=\left(\left(\begin{array}[]{cc}\bm{I}_{m_{1}}\\ \bm{E}\end{array}\right)\Lambda^{\frac{1}{2}}\bm{P}\right)(\bm{P}^{\top}\Lambda^{-\frac{1}{2}}\bm{a}_{0,i}^{(1)}),

where 𝐏\bm{P} is an orthogonal matrix, so that ((𝐈m1𝐄)Λ12𝐏)((𝐈m1𝐄)Λ12𝐏)\left(\left(\begin{array}[]{cc}\bm{I}_{m_{1}}\\ \bm{E}\end{array}\right)\Lambda^{\frac{1}{2}}\bm{P}\right)^{\top}\left(\left(\begin{array}[]{cc}\bm{I}_{m_{1}}\\ \bm{E}\end{array}\right)\Lambda^{\frac{1}{2}}\bm{P}\right) is diagonal, and 𝐟i=𝐏Λ12𝐚0,i(1)\bm{f}_{i}=\bm{P}^{\top}\Lambda^{-\frac{1}{2}}\bm{a}_{0,i}^{(1)} with Cov(𝐟i)=𝐈m1\operatorname{Cov}(\bm{f}_{i})=\bm{I}_{m_{1}}. Note that no matter what values of m1m_{1} and m2m_{2} are in practice, one can always obtain such a functional factor augmentation structure.

Next, we consider a more general case where the structure 𝒂0,i(g)\bm{a}_{0,i}^{(g)} contains correlations for g=1,,Gg=1,\ldots,G, where 𝔼(a0,ij(g)a0,ik(g))0(for all 0<j<k<mg)\mathbb{E}(a_{0,ij}^{(g)}a_{0,ik}^{(g)})\neq 0~{}(\text{for all }0<j<k<m_{g}). Then, Lemma 2 shows the relation between 𝒂i(g)\bm{a}_{i}^{(g)} and 𝒂0,i(g)\bm{a}_{0,i}^{(g)} in this circumstance that 𝒂i(g)\bm{a}_{i}^{(g)} is still an approximation of 𝒂0,i(g)\bm{a}_{0,i}^{(g)} by imposing an orthogonal rotation induced by the basis functions in K-L expansion.

Lemma 2

Under the conditions in Lemma 1 without 𝔼(a0,ij(g)a0,ik(g))=0\mathbb{E}(a_{0,ij}^{(g)}a_{0,ik}^{(g)})=0, there exists an orthogonal matrix P(g)P^{(g)}, such that ϕ0(g)()=P(g)(γ1(g)(),,γmg(g)())\bm{\phi}_{0}^{(g)}(\cdot)=P^{(g)}({\gamma}_{1}^{(g)}(\cdot),\ldots,{\gamma}_{m_{g}}^{(g)}(\cdot))^{\top}, and

(a0,i1(g)a0,i2(g)a0,img(g))=P(g)(ai1(g)ai2(g)aimg(g)).\displaystyle\left(\begin{array}[]{ll}a_{0,i1}^{(g)}\\ a_{0,i2}^{(g)}\\ \vdots\\ a_{0,im_{g}}^{(g)}\end{array}\right)=P^{(g)}\left(\begin{array}[]{ll}a_{i1}^{(g)}\\ a_{i2}^{(g)}\\ \vdots\\ a_{im_{g}}^{(g)}\end{array}\right).

Furthermore, denote 𝐚i=(𝐚i(1),,𝐚i(G))\bm{a}_{i}^{\top}=(\bm{a}_{i}^{(1)}{}^{\top},\ldots,\bm{a}_{i}^{(G)}{}^{\top}) and 𝐚0,i=(𝐚0,i(1),,𝐚0,i(G))\bm{a}_{0,i}^{\top}=(\bm{a}_{0,i}^{(1)}{}^{\top},\ldots,\bm{a}_{0,i}^{(G)}{}^{\top}). Then

Cov(𝒂i)=(P(1)𝟎𝟎𝟎P(2)𝟎𝟎𝟎P(G))Cov(𝒂0,i)(P(1)𝟎𝟎𝟎P(2)𝟎𝟎𝟎P(G)),\displaystyle\operatorname{Cov}(\bm{a}_{i})=\begin{pmatrix}P^{(1)}&\bm{0}&\cdots&\bm{0}\\ \bm{0}&P^{(2)}&\cdots&\bm{0}\\ \vdots&\vdots&\ddots&\vdots\\ \bm{0}&\bm{0}&\cdots&P^{(G)}\end{pmatrix}^{\top}\operatorname{Cov}(\bm{a}_{0,i})\begin{pmatrix}P^{(1)}&\bm{0}&\cdots&\bm{0}\\ \bm{0}&P^{(2)}&\cdots&\bm{0}\\ \vdots&\vdots&\ddots&\vdots\\ \bm{0}&\bm{0}&\cdots&P^{(G)}\end{pmatrix},

where

Cov(𝒂i)=(Σm1Σm1m2Σm1mGΣm2m1Σm2Σm2mGΣmGm1ΣmGm2ΣmG),\displaystyle{\operatorname{Cov}(\bm{a}_{i})}=\begin{pmatrix}\Sigma_{m_{1}}&\Sigma_{m_{1}m_{2}}&\cdots&\Sigma_{m_{1}m_{G}}\\ \Sigma_{m_{2}m_{1}}&\Sigma_{m_{2}}&\cdots&\Sigma_{m_{2}m_{G}}\\ \vdots&\vdots&\ddots&\vdots\\ \Sigma_{m_{G}m_{1}}&\Sigma_{m_{G}m_{2}}&\cdots&\Sigma_{m_{G}}\end{pmatrix},

and

Cov(𝒂0,i)=(Σ0,m1Σ0,m1m2Σ0,m1mGΣ0,m2m1Σ0,m2Σ0,m2mGΣ0,mGm1Σ0,mGm2Σ0,mG).\displaystyle{\operatorname{Cov}(\bm{a}_{0,i})}=\begin{pmatrix}\Sigma_{0,m_{1}}&\Sigma_{0,m_{1}m_{2}}&\cdots&\Sigma_{0,m_{1}m_{G}}\\ \Sigma_{0,m_{2}m_{1}}&\Sigma_{0,m_{2}}&\cdots&\Sigma_{0,m_{2}m_{G}}\\ \vdots&\vdots&\ddots&\vdots\\ \Sigma_{0,m_{G}m_{1}}&\Sigma_{0,m_{G}m_{2}}&\cdots&\Sigma_{0,m_{G}}\end{pmatrix}.
Remark 2

Lemma 2 indicates that if there is a fFAS on 𝐚0,i\bm{a}_{0,i}, then there will be a fFAS on 𝐚i\bm{a}_{i} by the fact that

𝒂i=(P(1)𝟎𝟎𝟎P(2)𝟎𝟎𝟎P(G))𝒂0,i=𝑷𝒂0,i=𝑷𝑩𝒇i+𝑷𝒖i,\displaystyle\bm{a}_{i}=\begin{pmatrix}P^{(1)}&\bm{0}&\cdots&\bm{0}\\ \bm{0}&P^{(2)}&\cdots&\bm{0}\\ \vdots&\vdots&\ddots&\vdots\\ \bm{0}&\bm{0}&\cdots&P^{(G)}\end{pmatrix}^{\top}\bm{a}_{0,i}=\bm{P}^{\top}\bm{a}_{0,i}=\bm{P}^{\top}\bm{B}\bm{f}_{i}+\bm{P}^{\top}\bm{u}_{i}, (11)

and treating 𝐏𝐁\bm{P}^{\top}\bm{B} and 𝐏𝐮i\bm{P}^{\top}\bm{u}_{i} as the updated loading matrix and the idiosyncratic component in (9), respectively, where 𝐏\bm{P} is still an orthogonal matrix. More specifically, if 𝐚0,i\bm{a}_{0,i} has a fFAS induced by 𝐟i\bm{f}_{i}, 𝐚i\bm{a}_{i} also has such a fFAS with the same factor 𝐟i\bm{f}_{i}. Consequently, even if the functional covariates are correlated with each other, we can still use the functional scores 𝐚i\bm{a}_{i} to estimate 𝐚0,i\bm{a}_{0,i} as if they were uncorrelated.

To further get the estimates of the loading matrix 𝑩{\bm{B}} and the functional factors 𝑭=(𝒇1,,𝒇n)n×K\bm{F}=\left(\bm{f}_{1},\ldots,\bm{f}_{n}\right)^{\top}\in\mathbb{R}^{n\times K} after obtaining the score matrix 𝑨=(𝒂1,,𝒂n)\bm{A}=(\bm{a}_{1},\ldots,\bm{a}_{n})^{\top} with 𝒂i\bm{a}_{i} in (11), we employ the principal component analysis (Bai, 2003) on the covariance of 𝑨\bm{A}, and further obtain that 𝑼^=𝑨𝑭^𝑩^\widehat{\bm{U}}={\bm{A}}-\widehat{\bm{F}}\widehat{\bm{B}}^{\top}, where 𝑼^=(𝒖^1,,𝒖^n)n×p\widehat{\bm{U}}=\left(\widehat{\bm{u}}_{1},\ldots,\widehat{\bm{u}}_{n}\right)^{\top}\in\mathbb{R}^{n\times p}. More specifically, the columns of 𝑭^/n\widehat{\bm{F}}/\sqrt{n} are the eigenvectors of 𝑨𝑨\bm{AA}^{\top} corresponding to the top K{K} eigenvalues, and 𝑩^=n1𝑨𝑭^\widehat{\bm{B}}=n^{-1}\bm{A}^{\top}\widehat{\bm{F}}. This is similar to the case 𝑩^=(λ1𝝃1,,λK𝝃K)\widehat{\bm{B}}=\left(\sqrt{\lambda_{1}}\bm{\xi}_{1},\ldots,\sqrt{\lambda_{{K}}}\bm{\xi}_{{K}}\right) and 𝑭^=𝑨𝑩^diag(λ11,λK1)\widehat{\bm{F}}=\bm{A}\widehat{\bm{{B}}}\operatorname{diag}\left(\lambda_{1}^{-1}\ldots,\lambda_{{K}}^{-1}\right), where {λj}j=1K\left\{\lambda_{j}\right\}_{j=1}^{{K}} and {𝝃j}i=1K\left\{\boldsymbol{\xi}_{j}\right\}_{i=1}^{{K}} are top K{{K}} eigenvalues in descending order and their associated eigenvectors of the sample covariance matrix.

A practical issue is how to determine the number of factors KK. Given that latent factors, loadings, and idiosyncratic components are all unobservable, a conditional sparsity criterion is then adopted (Ahn & Horenstein, 2013; Fan et al., 2020). Specifically, let λk(𝑨𝑨)\lambda_{k}\left({\bm{A}}^{\top}{\bm{A}}\right) denote the kk-th largest eigenvalue of 𝑨𝑨{\bm{A}}^{\top}{\bm{A}}, KmaxK_{\mathop{\rm max}} be a prespecified upper bound for KK, and CnC_{n} be a constant dependent on nn and p=g=1Gmgp=\sum_{g=1}^{G}m_{g}, and then KK is determined by

K^=argminkKmaxλk+1(𝑨𝑨)+Cnλk(𝑨𝑨)+Cn\widehat{K}=\underset{k\leq K_{\mathop{\rm max}}}{\operatorname{argmin}}\frac{\lambda_{k+1}\left({\bm{A}}^{\top}{\bm{A}}\right)+C_{n}}{\lambda_{k}\left({\bm{A}}^{\top}{\bm{A}}\right)+C_{n}} (12)

for a given CnC_{n} and KmaxK_{\mathop{\rm max}}. An alternative criterion may be the information criteria proposed by Bai & Ng (2002) and Fan et al. (2013), referred to as PC and IC, respectively.

2.3 Properties of functional factor augmentation structure

In this subsection, we evaluate the estimation error of the proposed functional factors 𝒇i\bm{f}_{i}. Denote the first KK eigenvectors and eigenvalues of 𝑩𝑩\bm{B}\bm{B}^{\top} as 𝝃k(B)\bm{\xi}^{(B)}_{k}, and ΛB=diag{λ1(B),,λK(B)}\Lambda_{B}=\operatorname{diag}\{\lambda_{1}^{(B)},\ldots,\lambda_{K}^{(B)}\}, respectively, where λk(B)\lambda_{k}^{(B)}, k=1,,Kk=1,\dots,K are sorted in a descending order. The largest eigenvalue of Cov(𝒖i)\operatorname{Cov}(\bm{u}_{i}) is λmax𝒖\lambda_{\mathop{\rm max}}^{\bm{u}}, where Cov(𝒖i)=ωΛu\operatorname{Cov}(\bm{u}_{i})=\omega\Lambda_{u}, the covariance matrix of 𝒂i\bm{a}_{i}, Cov(𝒂i)\operatorname{Cov}(\bm{a}_{i}), equals to 𝔼(𝒂i𝒂i)=𝔼(𝑨𝑨/n)=𝑩𝑩+𝔼(𝑼𝑼/n)\mathbb{E}(\bm{a}_{i}\bm{a}_{i}^{\top})=\mathbb{E}(\bm{A}^{\top}\bm{A}/n)=\bm{B}\bm{B}^{\top}+\mathbb{E}(\bm{U}^{\top}\bm{U}/n) from (11). The covariance can be further expanded using a perturbation

C𝒂(ω)=𝑩𝑩+Λ𝒖ω+O(ω2),\displaystyle C_{\bm{a}}(\omega)=\bm{B}\bm{B}^{\top}+\Lambda_{\bm{u}}\omega+O(\omega^{2}),

where Λ𝒖\Lambda_{\bm{u}} is a perturbing matrix with ω\omega such that

λk(C)(ω)\displaystyle\lambda_{k}^{(C)}(\omega) =λk(B)+(𝝃k(B)Λ𝒖𝝃k(B))ω+O(ω2),\displaystyle=\lambda_{k}^{(B)}+(\bm{\xi}^{(B)}_{k}{}^{\top}\Lambda_{\bm{u}}\bm{\xi}^{(B)}_{k})\omega+O(\omega^{2}),
𝝃k(C)(ω)\displaystyle\bm{\xi}_{k}^{(C)}(\omega) =𝝃k(B)(𝑩𝑩λk(B)I)+Λ𝒖𝝃k(B)ω+O(ω2),\displaystyle=\bm{\xi}^{(B)}_{k}-(\bm{B}\bm{B}^{\top}-\lambda_{k}^{(B)}I)^{+}\Lambda_{\bm{u}}\bm{\xi}_{k}^{(B)}\omega+O(\omega^{2}),

with λk(C)(ω)\lambda_{k}^{(C)}(\omega) and 𝝃k(C)\bm{\xi}^{(C)}_{k}, k=1,,Kk=1,\dots,K being the first KK eigenvalues and eigenvectors of C𝒂(ω)C_{\bm{a}}(\omega) (Shi, 1997), respectively. Based on C𝒂(ω)C_{\bm{a}}(\omega) we have 𝑩(ω)=(λ1(C)(ω)𝝃1(C)(ω),,λK(C)(ω)𝝃K(C)(ω)){\bm{B}(\omega)}=\left(\sqrt{\lambda_{1}^{(C)}(\omega)}\bm{\xi}_{1}^{(C)}(\omega),\ldots,\sqrt{\lambda_{K}^{(C)}(\omega)}\bm{\xi}_{K}^{(C)}(\omega)\right) and 𝑭(ω)=𝑨𝑩(ω)[diag(λ1(C),λK(C))]1{\bm{F}}(\omega)=\bm{A}\bm{B}(\omega)\left[\operatorname{diag}\left(\lambda_{1}^{(C)}\ldots,\lambda_{K}^{(C)}\right)\right]^{-1}. Accordingly, the following lemma shows the estimation error of the proposed functional structure factors for a given 𝑩\bm{B}.

Lemma 3

For the fixed truncation numbers m1,,mGm_{1},\ldots,m_{G}, if min1i<jK(λi(B)λj(B))>c1\underset{1\leq i<j\leq K}{\min}(\lambda_{i}^{(B)}-\lambda_{j}^{(B)})>c_{1} and λmax𝐮<c2\lambda_{\mathop{\rm max}}^{\bm{u}}<c_{2} for some constant c1,c2>0c_{1},c_{2}>0, then

𝒇i(ω)𝒇i22\displaystyle\|{\bm{f}}_{i}^{\top}(\omega)-\bm{f}_{i}^{\top}\|_{2}^{2} =OP(ω2𝒂i22+𝒖i𝑩ΛB122),and\displaystyle=O_{P}(\omega^{2}\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\|_{2}^{2}),~{}\quad{\mbox{and}}
𝒖i(ω)𝒖i22\displaystyle\|{\bm{u}}_{i}^{\top}(\omega)-\bm{u}_{i}^{\top}\|_{2}^{2} =OP(ω2𝒂i22+𝒖i𝑩ΛB1𝑩22)\displaystyle=O_{P}(\omega^{2}\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\bm{B}^{\top}\|_{2}^{2})

with 𝐟i(ω)\bm{f}_{i}^{\top}(\omega) and 𝐮i(ω)\bm{u}_{i}^{\top}(\omega) calculated based on C𝐚(ω)C_{\bm{a}}(\omega).

Remark 3

In terms of errors induced by detecting functional structure factors, there are mainly two sources. One is from the estimation error of eigenvalues and eigenvectors, and the other is from the projection of 𝐮i\bm{u}_{i} in the column space of 𝐁\bm{B}.

Furthermore, in real applications when only repeated observations {(tij(g),Wi(g)(tij(g))),tij(g)𝒯,j=1,,ni(g)}\{(t_{ij}^{(g)},W_{i}^{(g)}(t_{ij}^{(g)})),t_{ij}^{(g)}\in\mathcal{T},j=1,\ldots,n_{i}^{(g)}\} are available for each subject, a common practice is to estimate a smooth trajectory {X^i(g)(),i=1,,n,g=1,,G}\{\widehat{X}_{i}^{(g)}(\cdot),i=1,\ldots,n,g=1,\ldots,G\} and hence obtain 𝑨^\widehat{\bm{A}}, 𝑩^\widehat{\bm{B}} and 𝑭^\widehat{\bm{F}}. To further consider the estimation error for 𝑭^\widehat{\bm{F}}, more assumptions are required to describe the properties of the scores estimated by fPCA. For each functional covariate, denote the covariance function CovX(g)(s,t)=Cov(Xi(g)(s),Xi(g)(t)),g=1,,G\text{Cov}_{X}^{(g)}(s,t)=\operatorname{Cov}(X_{i}^{(g)}(s),X_{i}^{(g)}(t)),g=1,\ldots,G, and τk(g)\tau_{k}^{(g)} is the k-th eigenvalue of covariance functions CovX(g)(s,t)\text{Cov}_{X}^{(g)}(s,t). Assume for g=1,,Gg=1,\ldots,G, we have mgsnm_{g}\leq s_{n}.

Assumption 1
  • (A1)

    (The decay rates of eigenvalues) For g=1,,Gg=1,\ldots,G, τ1(g)<\tau_{1}^{(g)}<\infty, τk(g)τk+1(g)Cka1\tau_{k}^{(g)}-\tau_{k+1}^{(g)}\geq Ck^{-a-1} for k1k\geq 1.

  • (A2)

    (The common truncation parameter sns_{n} cannot be too large) (sn2a+2+sna+4)/n=o(1).(s_{n}^{2a+2}+s_{n}^{a+4})/n=o(1).

This implies that τk(g)Cka\tau_{k}^{(g)}\geq Ck^{-a}. As the covariance functions CovX(g)(s,t),g=1,,G\text{Cov}_{X}^{(g)}(s,t),g=1,\ldots,G are bounded, one has a>1a>1.

Assumption 2

We defer to the Supplementary Material for Conditions (B1)–(B4) on the underlying processes Xi(g)(t)X_{i}^{(g)}(t). These conditions specify how data are sampled and smoothed.

Lemma 4

For a fixed number GG of trajectory types, under Assumptions 1 and 2,

{(𝑨^𝑨^)ijE(𝑨𝑨)ij}/n=Op(1/n),for anyi,jp.\displaystyle\{(\widehat{\bm{A}}^{\top}\widehat{\bm{A}})_{ij}-E(\bm{A}^{\top}\bm{A})_{ij}\}/n=O_{p}(1/\sqrt{n}),\quad\text{for any}~{}i,j\leq p.

Further, 𝐚^i𝐚i22OP(1/n)\|\widehat{\bm{a}}_{i}-{\bm{a}}_{i}\|_{2}^{2}\leq O_{P}(1/n).

Lemma 4 indicates the convergence rate of estimating 𝔼(𝑨𝑨)ij/n\mathbb{E}(\bm{A}^{\top}\bm{A})_{ij}/n using (𝑨^𝑨^)ij(\widehat{\bm{A}}^{\top}\widehat{\bm{A}})_{ij} is the same as using (𝑨𝑨)ij({\bm{A}}^{\top}{\bm{A}})_{ij}. Combined with Lemma 3, the estimation error of functional structure factors is formally established.

Theorem 1

With 𝐅^=𝐀^𝐁^diag(λ^11,λ^K1)\widehat{\bm{F}}=\widehat{\bm{A}}\widehat{\bm{{B}}}\operatorname{diag}\left(\widehat{\lambda}_{1}^{-1}\ldots,\widehat{\lambda}_{{K}}^{-1}\right) and 𝐔^=𝐀^𝐅^𝐁^\widehat{\bm{U}}=\widehat{\bm{A}}-\widehat{\bm{F}}\widehat{\bm{B}}^{\top}, we have

{𝒇^i𝒇i22=OP((ω2+1/n)𝒂i22+𝒖i𝑩ΛB122),𝒖^i𝒖i22=OP((ω2+1/n)𝒂i22+𝒖i𝑩ΛB1𝑩22).\displaystyle\left\{\begin{array}[]{l}\|\widehat{\bm{f}}_{i}^{\top}-\bm{f}_{i}^{\top}\|_{2}^{2}=O_{P}((\omega^{2}+1/n)\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\|_{2}^{2}),\\ \|\widehat{\bm{u}}_{i}^{\top}-\bm{u}_{i}^{\top}\|_{2}^{2}=O_{P}((\omega^{2}+1/n)\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\bm{B}^{\top}\|_{2}^{2}).\end{array}\right.

2.4 Truncation analysis and the relationship between KK and mgm_{g}

A practical issue when modeling functional data with sample instances is how to determine the number of basis functions using truncation, i.e., mgm_{g} for Xi(g)(t)X_{i}^{(g)}(t), and how this truncation would affect the estimate of functional factors 𝒇i\bm{f}_{i} in (9). Usually, mgm_{g} would be determined by the cumulative contribution to the variance from the corresponding functional scores, which tends to get an overestimated value of mgm_{g}, namely, m^g>mg\widehat{m}_{g}>m_{g}. To illustrate, assume m^1>m1\widehat{m}_{1}>m_{1} and m^g=mg\widehat{m}_{g}=m_{g} for g=2,,Gg=2,\ldots,G, and define 𝒂~i=(𝒂i(1),𝒂~i(1),𝒂i(2),,𝒂i(G))\tilde{\bm{a}}_{i}=(\bm{a}_{i}^{(1)\top},\tilde{\bm{a}}_{i}^{(1)\top},\bm{a}_{i}^{(2)\top},\ldots,\bm{a}_{i}^{(G)\top})^{\top} where 𝒂~i(g)\tilde{\bm{a}}_{i}^{(g)\top} is a redundant variable for the factor model. In this case,

Cov((𝒂i(g),𝒂~i(g)))=(Σmg𝟎𝟎Λ~0(g)),\displaystyle\operatorname{Cov}((\bm{a}_{i}^{(g)\top},\tilde{\bm{a}}_{i}^{(g)\top})^{\top})=\begin{pmatrix}\Sigma_{m_{g}}&\bm{0}\\ \bm{0}&\tilde{\Lambda}_{0}^{(g)}\end{pmatrix},

and

Cov(𝒂~i)=(Λm1𝟎Σm1m2Σm1mG𝟎Λ~0(1)𝟎𝟎Σm2m1𝟎Λm2Σm2mGΣmGm1𝟎ΣmGm2ΛmG).\displaystyle\operatorname{Cov}(\tilde{\bm{a}}_{i})=\begin{pmatrix}\Lambda_{m_{1}}&\bm{0}&\Sigma_{m_{1}m_{2}}&\cdots&\Sigma_{m_{1}m_{G}}\\ \bm{0}&\tilde{\Lambda}_{0}^{(1)}&\bm{0}&\cdots&\bm{0}\\ \Sigma_{m_{2}m_{1}}&\bm{0}&\Lambda_{m_{2}}&\cdots&\Sigma_{m_{2}m_{G}}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ \Sigma_{m_{G}m_{1}}&\bm{0}&\Sigma_{m_{G}m_{2}}&\cdots&\Lambda_{m_{G}}\end{pmatrix}.

If ξ=(ξ1,ξ2)\xi^{\star}=(\xi_{1}^{\star\top},\xi_{2}^{\star\top})^{\top} is one of the first KK eigenvectors of Cov(𝒂i)\operatorname{Cov}({\bm{a}}_{i}), (ξ1,𝟎,ξ2)(\xi_{1}^{\star\top},\bm{0},\xi_{2}^{\star\top})^{\top} is an eigenvector of Cov(𝒂~i)\operatorname{Cov}(\tilde{\bm{a}}_{i}) with the same eigenvalue, and the position of 𝟎\bm{0} in (ξ1,𝟎,ξ2)(\xi_{1}^{\star\top},\bm{0},\xi_{2}^{\star\top})^{\top} corresponds to that of Λ~0(1)\tilde{\Lambda}_{0}^{(1)} in Cov(𝒂~i)\operatorname{Cov}(\tilde{\bm{a}}_{i}). As 𝑭^=𝑨𝑩^diag(λ11,λK1)\widehat{\bm{F}}=\bm{A}\widehat{\bm{B}}\operatorname{diag}\left(\lambda_{1}^{-1}\ldots,\lambda_{K}^{-1}\right), 𝒂~i(g)\tilde{\bm{a}}_{i}^{(g)} will not affect 𝑭^\widehat{\bm{F}} since the row of 𝑩\bm{B} corresponding to 𝒂~i(g)\tilde{\bm{a}}_{i}^{(g)} is 𝟎\bm{0} when K^=K{\widehat{K}}=K. Actually, in this overestimating situation, we can treat

𝒂~i=(B1𝟎B2)𝒇i+(𝒖i(1)𝒂~i(1)𝒖i(C))=B𝒇i+𝒖~i,\displaystyle\tilde{\bm{a}}_{i}=\begin{pmatrix}B_{1}\\ \bm{0}\\ B_{2}\end{pmatrix}\bm{f}_{i}+\begin{pmatrix}\bm{u}_{i}^{(1)}\\ \tilde{\bm{a}}_{i}^{(1)}\\ \bm{u}_{i}^{(C)}\end{pmatrix}=B^{\star}\bm{f}_{i}+\tilde{\bm{u}}_{i},

as the real model with 𝒖i=(𝒖i(1),𝒖i(C))\bm{u}_{i}=(\bm{u}_{i}^{(1)}{}^{\top},\bm{u}_{i}^{(C)}{}^{\top})^{\top} and 𝑩=(B1,B2)\bm{B}=\left(B_{1}^{\top},B_{2}^{\top}\right)^{\top}.

Another case is that mgm_{g} may be underestimated for some specific g0Gg_{0}\leq G, especially when ω\omega and KK are relatively small. In this case, a majority amount of variance of Xi(t)X_{i}(t) is concentrated in ϕ0~(t)(g0)𝒇i{{\tilde{\bm{\phi}_{0}}}{}^{(g_{0})}(t)}{}^{\top}\bm{f}_{i} by (10). When ϕ~0(g0)(t)\tilde{\bm{\phi}}_{0}^{(g_{0})}(t) are linearly independent for g=1,,Gg=1,\ldots,G, m^g0\widehat{m}_{g_{0}} is determined as KK by the variance contribution criteria. However, when they are correlated, an underestimation of mg0m_{g_{0}} may not significantly affect the estimation of 𝒇^i\widehat{\bm{f}}_{i}. To illustrate, consider that the covariance matrix of 𝒂i\bm{a}_{i} with G=2G=2 as

Cov(𝒂i)=(Λm1Σm1m2Σm2m1Λm2),\displaystyle{\operatorname{Cov}(\bm{a}_{i})}=\begin{pmatrix}\Lambda_{m_{1}}&\Sigma_{m_{1}m_{2}}\\ \Sigma_{m_{2}m_{1}}&\Lambda_{m_{2}}\end{pmatrix},

with K<m1,m2K<m_{1},m_{2}, and for convenience, we assume Λu=𝑰p\Lambda_{u}=\bm{I}_{p}. Since B(g)B^{(g)} is an mg×Km_{g}\times K matrix, B(g)B(g)B^{(g)}B^{(g)}{}^{\top} has at most KK non-zero eigenvalues, denoted as νj(g),j=1,,K\nu_{j}^{(g)},j=1,\ldots,K. By plugging in Λmg=P(g)B(g)B(g)P(g)+ω𝑰mg=diag{ν1(g)+ω,,νK(g)+ω,ω,,ω}\Lambda_{m_{g}}=P^{(g)}{}^{\top}B^{(g)}B^{(g)}{}^{\top}P^{(g)}+\omega\bm{I}_{m_{g}}=\operatorname{diag}\{\nu_{1}^{(g)}+\omega,\ldots,\nu_{K}^{(g)}+\omega,\omega,\ldots,\omega\}, Var(𝒂i){\operatorname{Var}(\bm{a}_{i})} is written as

Cov(𝒂i)\displaystyle{\operatorname{Cov}(\bm{a}_{i})} =(Λm1(K)0Σm1(K)m2(K)Σm1(K)m2(C)0Λm1(C)Σm1(C)m2(K)Σm1(C)m2(C)Σm2(K)m1(K)Σm2(K)m1(C)Λm2(K)0Σm2(C)m1(K)Σm2(C)m1(C)0Λm2(C))\displaystyle=\begin{pmatrix}\Lambda_{m_{1(K)}}&0&\Sigma_{m_{1(K)}m_{2(K)}}&\Sigma_{m_{1(K)}m_{2(C)}}\\ 0&\Lambda_{m_{1(C)}}&\Sigma_{m_{1(C)}m_{2(K)}}&\Sigma_{m_{1(C)}m_{2(C)}}\\ \Sigma_{m_{2(K)}m_{1(K)}}&\Sigma_{m_{2(K)}m_{1(C)}}&\Lambda_{m_{2(K)}}&0\\ \Sigma_{m_{2(C)}m_{1(K)}}&\Sigma_{m_{2(C)}m_{1(C)}}&0&\Lambda_{m_{2(C)}}\end{pmatrix}
(Λm1(K)0Σm1(K)m2(K)00000Σm2(K)m1(K)0Λm2(K)00000),\displaystyle\approx\begin{pmatrix}\Lambda_{m_{1(K)}}&0&\Sigma_{m_{1(K)}m_{2(K)}}&0\\ 0&0&0&0\\ \Sigma_{m_{2(K)}m_{1(K)}}&0&\Lambda_{m_{2(K)}}&0\\ 0&0&0&0\end{pmatrix},

with Λmg(K)=diag{τ1(g)+ω,,τK(g)+ω}\Lambda_{m_{g}(K)}=\operatorname{diag}\{\tau_{1}^{(g)}+\omega,\ldots,\tau_{K}^{(g)}+\omega\} and Λmg(C)=diag{ω,,ω}\Lambda_{m_{g}(C)}=\operatorname{diag}\{\omega,\ldots,\omega\}. Consequently, when ω0\omega\approx 0, underestimating mgm_{g} will not significantly affect 𝒇^i\widehat{\bm{f}}_{i} as long as m^gK\widehat{m}_{g}\geq K, which is easy to achieve in practice.

3 The functional factor augmentation selection model

3.1 The functional linear regression model

In this section, we address a multivariate functional linear regression model with correlated functional covariates, using the proposed fFAS in Section 2. To start with, consider a functional linear regression model where a scalar response YY with 𝔼(Y)=μY\mathbb{E}(Y)=\mu_{Y} is generated by a group of GG functional covariates {X(1)(t),X(2)(t),,X(G)(t);t𝒯}\{X^{(1)}(t),X^{(2)}(t),\ldots,X^{(G)}(t);t\in\mathcal{T}\} as

Y=μY+g=1G𝒯β(t)(g)(X(g)(t)μ(g)(t))dt+ϵ,Y=\mu_{Y}+\sum_{g=1}^{G}\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)\left(X^{(g)}(t)-\mu^{(g)}(t)\right)dt+\epsilon, (13)

where β()(g)\beta^{*}{}^{(g)}(\cdot) is the square-integrable regression parameter function, and ϵ\epsilon is a random noise with a zero mean and a constant variance σ2\sigma^{2}. We note β0=μY\beta_{0}^{*}=\mu_{Y} as the intercept term. Thus, with given i.i.d samples {yi,Xi(g)(t),g=1,,G;t𝒯,i=1,,n\{y_{i},X_{i}^{(g)}(t),g=1,\ldots,G;t\in\mathcal{T},i=1,\ldots,n} which have the same distribution as {Y,X(g)(t),g=1,,G;t𝒯}\{Y,X^{(g)}(t),g=1,\ldots,G;t\in\mathcal{T}\}, the sample functional linear regression model is described as

yi\displaystyle y_{i} =β0+g=1G𝒯β(t)(g)(Xi(g)(t)μ(g)(t))dt+ϵi.\displaystyle=\beta_{0}^{*}+\sum_{g=1}^{G}\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)\left(X_{i}^{(g)}(t)-\mu^{(g)}(t)\right)dt+\epsilon_{i}. (14)

To detect active functional covariates with correlation, we develop a functional factor augmentation selection model (fFASM) as follows. Without loss of generalizability, we use Xi(g)()X_{i}^{(g)}(\cdot) and yiy_{i} as the centered functional covariates and scalar response variable, respectively, and accordingly (14) is equivalent to

yi=g=1G𝒯β(t)(g)Xi(g)(t)dt+ϵi,y_{i}=\sum_{g=1}^{G}\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)X_{i}^{(g)}(t)dt+\epsilon_{i}, (15)

and can be further expanded by K-L expansion as

yi\displaystyle y_{i} =g=1Gj=1mgaij(g)ηj+(g)g=1G𝒯β(t)(g)ei(g)(t)dt+ϵi=𝑯𝒂i+ϵ~i,\displaystyle=\sum_{g=1}^{G}\sum_{j=1}^{m_{g}}a_{ij}^{(g)}\eta_{j}^{*}{}^{(g)}+\sum_{g=1}^{G}\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)e_{i}^{(g)}(t)dt+\epsilon_{i}=\bm{H}^{*}{}^{\top}\cdot\bm{a}_{i}\ +\tilde{\epsilon}_{i},

where 𝑯=(𝜼,(1),𝜼)(G)p,ηj=(g)𝒯β(t)(g)γj(g)(t)dt,j=1,,mg\bm{H}^{*}=\left(\bm{\eta}^{*}{}^{(1)}{}^{\top},\ldots,\bm{\eta}^{*}{}^{(G)}{}^{\top}\right)^{\top}\in\mathbb{R}^{p},\eta_{j}^{*}{}^{(g)}=\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t){\gamma}_{j}^{(g)}(t)dt,j=1,\ldots,m_{g}, and ϵ~i=ϵi+g=1G𝒯β(t)(g)ei(g)(t)dt\tilde{\epsilon}_{i}=\epsilon_{i}+\sum_{g=1}^{G}\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)e_{i}^{(g)}(t)dt. For g=1,,Gg=1,\ldots,G, 𝒯β(t)(g)ei(g)(t)dt\int_{\mathcal{T}}\beta^{*}{}^{(g)}(t)e_{i}^{(g)}(t)dt has a zero mean. Consequently, to select useful functional covariates Xi(g)(t)X_{i}^{(g)}(t) is to find β()(g)\beta^{*}{}^{(g)}(\cdot) such that β()(g)0\beta^{*}{}^{(g)}(\cdot)\neq 0, which is further assumed as 𝜼(g)𝟎\bm{\eta}^{*}{}^{(g)}\neq\bm{0}. Plugging in the proposed fFAS, one easily obtains

yi=𝑯(𝑩𝒇i+𝒖i)+ϵ~i,\displaystyle y_{i}=\bm{H}^{*}{}^{\top}\left(\bm{B}\bm{f}_{i}+\bm{u}_{i}\right)+\tilde{\epsilon}_{i},

or equivalently,

𝒚=𝑭𝑩𝑯+𝑼𝑯+ϵ~,\displaystyle\bm{y}=\bm{F}\bm{B}^{\top}\bm{H}^{*}{}+\bm{U}\bm{H}^{*}{}+\tilde{\bm{\epsilon}}, (16)

and with 𝑭^\widehat{\bm{F}}, 𝑩^\widehat{\bm{B}} and 𝑼^\widehat{\bm{U}} obtained by the fFAS, (16) is further equivalent to

(𝑰n𝑷𝑭^)𝒚\displaystyle\left(\bm{I}_{n}-\bm{P}_{\widehat{\bm{F}}}\right)\bm{y} =(𝑰n𝑷𝑭^)𝑼^𝑯+(𝑰n𝑷𝑭^)(𝒆~+ϵ~),\displaystyle=\left(\bm{I}_{n}-\bm{P}_{\widehat{\bm{F}}}\right)\widehat{\bm{U}}\bm{H}^{*}{}+\left(\bm{I}_{n}-\bm{P}_{\widehat{\bm{F}}}\right)\left(\tilde{\bm{e}}+\tilde{\bm{\epsilon}}\right), (17)

where 𝑷𝑭^=𝑭^(𝑭^𝑭^)1𝑭^\bm{P}_{\widehat{\bm{F}}}={\widehat{\bm{F}}}({\widehat{\bm{F}}}^{\top}{\widehat{\bm{F}}})^{-1}{\widehat{\bm{F}}}^{\top} is the orthogonal projection matrix onto the column space C(𝑭^)C({\widehat{\bm{F}}}), and 𝒆~=(e~1,,e~n)=(𝑨𝑨^+𝔼(𝑨^))𝑯\tilde{\bm{e}}=(\tilde{e}_{1},\ldots,\tilde{e}_{n})^{\top}=\left(\bm{A}-\widehat{\bm{A}}+\mathbb{E}(\widehat{\bm{A}})\right)\bm{H}^{*}{}. Consequently to select useful functional covariates, the penalized loss function

Ln(𝒚,𝑼^𝑯,𝑭^)=1n(𝑰P𝑭^)(𝒚𝑼^𝑯)22+g=1Gk=1mgJλ(ηk(g))\displaystyle L_{n}\left(\bm{y},\widehat{\bm{U}}\bm{H},\widehat{\bm{F}}\right)=\frac{1}{n}\left\|(\bm{I}-P_{\widehat{\bm{F}}})(\bm{y}-\widehat{\bm{U}}\bm{H})\right\|_{2}^{2}+\sum_{g=1}^{G}\sum_{k=1}^{m_{g}}J_{\lambda}\left(\eta_{k}^{(g)}\right)

is minimized with respect to 𝑯\bm{H}, where Jλ=λJ()J_{\lambda}=\lambda\cdot J(\cdot) is a penalty controlled by the parameter λ\lambda and J()J(\cdot) can be set as the popular penalties such as lasso, SCAD or MCP. Note that λ\lambda can be selected using cross-validation. Hence, we successfully transform the problem from model selection with highly correlated functional covariates to model selection with weakly correlated or uncorrelated ones by lifting the space to higher dimension. As 𝑯^=argmin𝑯Ln(𝒚,𝑼^𝑯,𝑭^)\widehat{\bm{H}}=\mathop{\rm argmin}_{\bm{H}}L_{n}(\bm{y},\widehat{\bm{U}}\bm{H},\widehat{\bm{F}}) is obtained, β(g)(t){\beta}^{(g)}(t) is estimated as

β^(g)(t)=j=1m^gη^j(g)γ^j(g)(t).\displaystyle\widehat{\beta}^{(g)}(t)=\sum_{j=1}^{\widehat{m}_{g}}\widehat{\eta}_{j}^{(g)}\widehat{\gamma}_{j}^{(g)}(t).

Accordingly, when β^(g)(t)0\widehat{\beta}^{(g)}(t)\neq 0, Xi(g)(t){X_{i}}^{(g)}(t) is selected as a useful functional covariate. Note that the group selection method, such as the GM strategy by Aneiros, Novo & Vieu (2022), is not adopted in our case.

Also, this procedure may work even in the generalized linear context. Honestly,

𝒚\displaystyle\bm{y} =(𝑨^𝔼(𝑨^))𝑯+(𝑨𝑨^+𝔼(𝑨^))𝑯+ϵ~\displaystyle=\left(\widehat{\bm{A}}-\mathbb{E}(\widehat{\bm{A}})\right)\bm{H}^{*}+\left(\bm{A}-\widehat{\bm{A}}+\mathbb{E}(\widehat{\bm{A}})\right)\bm{H}^{*}{}+\tilde{\bm{\epsilon}}
=𝑭^𝑩^𝑯+𝑼^𝑯+(𝑨𝑨^+𝔼(𝑨^))𝑯+ϵ~\displaystyle=\widehat{\bm{F}}\widehat{\bm{B}}^{\top}\bm{H}^{*}{}+\widehat{\bm{U}}\bm{H}^{*}{}+\left(\bm{A}-\widehat{\bm{A}}+\mathbb{E}(\widehat{\bm{A}})\right)\bm{H}^{*}{}+\tilde{\bm{\epsilon}}
=𝑭^𝑩^𝑯+𝑼^𝑯+𝒆~+ϵ~,\displaystyle=\widehat{\bm{F}}\widehat{\bm{B}}^{\top}\bm{H}^{*}{}+\widehat{\bm{U}}\bm{H}^{*}{}+\tilde{\bm{e}}+\tilde{\bm{\epsilon}}, (18)

indicating that the explanatory variables are switched from 𝑨\bm{A} to 𝑨^\widehat{\bm{A}}. In practice, when using the sample mean of 𝑨^\widehat{\bm{A}} to substitute 𝔼(𝑨^)\mathbb{E}(\widehat{\bm{A}}), 𝔼(𝒂^i)0\mathbb{E}(\widehat{\bm{a}}_{i})\rightarrow 0 under some regularization conditions (Kong et al., 2016). After centralizing 𝑨^\widehat{\bm{A}}, by (𝑨^𝔼(𝑨^))𝑯=𝑭^𝑩^𝑯+𝑼^𝑯=𝑭^𝜸+𝑼^𝑯\left(\widehat{\bm{A}}-\mathbb{E}(\widehat{\bm{A}})\right)\bm{H}^{*}{}=\widehat{\bm{F}}\widehat{\bm{B}}^{\top}\bm{H}^{*}{}+\widehat{\bm{U}}\bm{H}^{*}{}=\widehat{\bm{F}}\bm{\gamma}^{*}+\widehat{\bm{U}}\bm{H}^{*}{}, the unknown parameters are transformed into (𝑯,𝜸)(\bm{H},\bm{\gamma}), so that their corresponding covariates 𝑼^\widehat{\bm{U}} and 𝑭^\widehat{\bm{F}} are weakly correlated by introducing 𝜸\bm{\gamma}. Note that in linear cases, 𝜸\bm{\gamma}^{*} is further eliminated by using the projection matrix in (17). Consequently, for generalized linear models and samples without centralization, the loss function is updated as

Ln(𝒚,𝑾^𝜽)=\displaystyle L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)= 1ni=1n[yi(β0+𝒇^i𝜸+𝒖^i𝑯)+b(β0+𝒇^i𝜸+𝒖^i𝑯)]\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left[-y_{i}\left(\beta_{0}+\widehat{\bm{f}}_{i}^{\top}\bm{\gamma}+\widehat{\bm{u}}_{i}^{\top}\bm{H}\right)+b\left(\beta_{0}+\widehat{\bm{f}}_{i}^{\top}\bm{\gamma}+\widehat{\bm{u}}_{i}^{\top}\bm{H}\right)\right]
+g=1Gk=1mgJλ(ηk(g))+Jλ(β0),\displaystyle+\sum_{g=1}^{G}\sum_{k=1}^{m_{g}}J_{\lambda}\left(\eta_{k}^{(g)}\right)+J_{\lambda}(\beta_{0}), (19)

where 𝒘^i=(1,𝒖^i,𝒇^i)\widehat{\bm{w}}_{i}=\left(1,\widehat{\bm{u}}_{i}^{\top},\widehat{\bm{f}}_{i}^{\top}\right)^{\top}, 𝑾^=(𝒘^1,,𝒘^n)\widehat{\bm{W}}=\left(\widehat{\bm{w}}_{1},\ldots,\widehat{\bm{w}}_{n}\right) and 𝜽=(β0,𝑯,𝜸)\bm{\theta}=\left(\beta_{0},\bm{H}^{\top},\bm{\gamma}^{\top}\right)^{\top}, β0\beta_{0} is the intercept term and b()b(\cdot) is a known function, where b(z)=z2/2b(z)=z^{2}/2 in linear models. The estimate is given by (β0^,𝑯^,𝜸^)=argmin𝜽Ln(𝒚,𝑾^𝜽)\left(\widehat{\beta_{0}},\widehat{\bm{H}}^{\top},\widehat{\bm{\gamma}}^{\top}\right)^{\top}=\mathop{\rm argmin}_{\bm{\theta}}L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right).

3.2 Theoretical justifications for functional variable selection

In this section, the proposed method is theoretically investigated under the general linear model context in (3.1), and hence the linear model will be covered as a special case. To start with, some notations and assumptions are introduced. Recall p=g=1Gmgp=\sum_{g=1}^{G}m_{g}, and define p1=1+pp_{1}=1+p, 𝑯1=(β0,𝑯)\bm{H}_{1}=(\beta_{0},\bm{H}^{\top})^{\top}, 𝑯1=(β0,𝑯)\bm{H}_{1}^{*}=(\beta_{0}^{*},\bm{H}^{*}{}^{\top})^{\top}, 𝜽=(β0,𝑯,(𝑩𝑯))\bm{\theta}^{*}=(\beta_{0}^{*},\bm{H}^{*}{}^{\top},(\bm{B}^{\top}\bm{H}^{*})^{\top})^{\top}, S=supp(𝜽)S=\operatorname{supp}(\bm{\theta}^{*}), S1=supp(𝑯1)S_{1}=\operatorname{supp}(\bm{H}_{1}^{*}), and S2=[p1+K]\SS_{2}=[p_{1}+K]\backslash S. Suppose 𝑭^\widehat{\bm{F}} and 𝑼^\widehat{\bm{U}} are obtained given 𝑨\bm{A}. We write

Xi(g)(t)\displaystyle X_{i}^{(g)}(t) =Xf,i(g)(t)+Xu,i(g)(t)+e0i(g)(t),\displaystyle=X_{f,i}^{(g)}(t)+X_{u,i}^{(g)}(t)+e_{0i}^{(g)}(t), (20)

with Xf,i(g)(t)=𝜸(g)(t)B(g)𝒇iX_{f,i}^{(g)}(t)=\bm{\gamma}^{(g)}(t){}^{\top}B^{(g)}\bm{f}_{i} and Xu,i(g)(t)=𝜸(g)(t)𝒖i(g)X_{u,i}^{(g)}(t)={\bm{\gamma}^{(g)}(t)}{}^{\top}\bm{u}_{i}^{(g)}.

Assumption 3

(Smoothness). b(z)C3()b(z)\in C^{3}(\mathbb{R}), i.e., for some constants M2M_{2} and M3M_{3}, 0b′′(z)M20\leq b^{\prime\prime}(z)\leq M_{2} and |b′′′(z)|\left|b^{\prime\prime\prime}(z)\right| M3,z\leq M_{3},\forall z.

Assumption 4

(Restricted strong convexity and irrepresentable condition). For 𝐖=(𝐰1,,𝐰n){\bm{W}}=\left({\bm{w}}_{1},\ldots,{\bm{w}}_{n}\right) where 𝐰i=(1,𝐮i,𝐟i){\bm{w}}_{i}=\left(1,{\bm{u}}_{i}^{\top},{\bm{f}}_{i}^{\top}\right)^{\top}, there exist κ2>κ>0\kappa_{2}>\kappa_{\infty}>0 and τ(0,0.5)\tau\in(0,0.5) such that

(Convexity)[SS2Ln(𝒚,𝑾𝜽)]114κ, for =2 and ,\displaystyle\text{(Convexity)}\left\|\left[\nabla_{SS}^{2}L_{n}\left(\bm{y},\bm{W}\boldsymbol{\theta}^{*}\right)\right]^{-1}\right\|_{\ell}\leq\frac{1}{4\kappa_{\ell}},\quad\text{ for }\ell=2\text{ and }\infty, (21)
(Irrepresentable condition)S2S2Ln(𝒚,𝑾𝜽)[SS2Ln(𝒚,𝑾𝜽)]112τ.\displaystyle\text{(Irrepresentable condition)}\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\bm{y},\bm{W}\boldsymbol{\theta}^{*}\right)\left[\nabla_{SS}^{2}L_{n}\left(\bm{y},\bm{W}\boldsymbol{\theta}^{*}\right)\right]^{-1}\right\|_{\infty}\leq 1-2\tau. (22)
Assumption 5

(Estimation of factor model). γj(g)()Mγ\|\gamma_{j}^{(g)}(\cdot)\|_{\infty}\leq M_{\gamma} for j=1,,mg,g=1,,Gj=1,\ldots,m_{g},~{}g=1,\ldots,G, and Xf,i(g)()12MγM0λ1(B)\|X_{f,i}^{(g)}(\cdot)\|_{\infty}\leq\frac{1}{2}M_{\gamma}M_{0}\sqrt{\lambda_{1}^{(B)}} and Xu,i(g)()12MγM0\|X_{u,i}^{(g)}(\cdot)\|_{\infty}\leq\frac{1}{2}M_{\gamma}M_{0} for some constant M0>2M_{0}>2. In addition, there exists a K×KK\times K nonsingular matrix 𝐕0\bm{V}_{0}, and 𝐕=(𝐈p1𝟎p1×K𝟎K×p1𝐕0)\bm{V}=\left(\begin{array}[]{cc}\bm{I}_{p_{1}}&\bm{0}_{{p_{1}}\times K}\\ \bm{0}_{K\times{p_{1}}}&\bm{V}_{0}\end{array}\right) such that for 𝐖¯=𝐖^𝐕\overline{\bm{W}}=\widehat{\bm{W}}\bm{V}, we have 𝐖¯𝐖maxM02\|\overline{\bm{W}}-\bm{W}\|_{\mathop{\rm max}}\leq\frac{M_{0}}{2} and maxj[p1+K]\mathop{\rm max}_{j\in[p_{1}+K]} (1ni=1n|w¯ijwij|2)1/22κτ3M0M2|S|\left(\frac{1}{n}\sum_{i=1}^{n}\left|\bar{w}_{ij}-w_{ij}\right|^{2}\right)^{1/2}\leq\frac{2\kappa_{\infty}\tau}{3M_{0}M_{2}|S|}, with wijw_{ij} and w¯ij\bar{w}_{ij} being the (i,j)(i,j)-th element of 𝐖\mathbf{W} and 𝐖¯\overline{\mathbf{W}}.

Assumption 3 holds for a large family of generalized linear models. For example, linear model has b(z)=12z2b(z)=\frac{1}{2}z^{2}, M2=1M_{2}=1, and M3=0M_{3}=0; logistic model has b(z)=log(1+ez)b(z)=\log\left(1+e^{z}\right) and finite M2,M3M_{2},M_{3}. Assumption 4 is easily satisfied with a small matrix and holds with high probability as long as 𝔼[2Ln(𝒚,𝑾𝜽)]\mathbb{E}\left[\nabla^{2}L_{n}\left(\bm{y},\bm{W}\boldsymbol{\theta}^{*}\right)\right] satisfies similar conditions by standard concentration inequalities (Merlevède et al., 2011). Assumption 5 indicates the norm 𝑾max\|\bm{W}\|_{\mathop{\rm max}} when Xi(g)()X_{i}^{(g)}(\cdot) and εi(g)()\varepsilon_{i}^{(g)}(\cdot) is bounded, and

𝒇^i𝒇i22\displaystyle\|\widehat{\bm{f}}_{i}^{\top}-\bm{f}_{i}^{\top}\|_{2}^{2} =O(ω2+𝒖i𝑩ΛB122+1/n),\displaystyle=O(\omega^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\|_{2}^{2}+1/n),
𝒖^i𝒖i22\displaystyle\|\widehat{\bm{u}}_{i}^{\top}-\bm{u}_{i}^{\top}\|_{2}^{2} =O(ω2+𝒖i𝑩ΛB1𝑩22+1/n),\displaystyle=O(\omega^{2}+\|\bm{u}_{i}^{\top}\bm{B}\Lambda_{B}^{-1}\bm{B}^{\top}\|_{2}^{2}+1/n),

will be satisfied with a high probability when nn is large and ω\omega is small. Furthermore, with J()=||J(\cdot)=|\cdot| and 𝑯^\bm{\widehat{H}} obtained by minimizing (3.1), the following result is established.

Theorem 2

Suppose Assumptions 3-5 holds. Define M=M03M3|S|3/2M=M_{0}^{3}M_{3}|S|^{3/2}, and

ε=maxj[p1+K]|1ni=1nw¯ij[yi+b(𝒂iT𝑯+β0)]|.\varepsilon^{*}=\mathop{\rm max}_{j\in[p_{1}+K]}\left|\frac{1}{n}\sum_{i=1}^{n}\bar{w}_{ij}\left[-y_{i}+b^{\prime}\left({\bm{a}}_{i}^{T}\bm{H}^{*}+\beta_{0}^{*}\right)\right]\right|.

If 7ετ<λ<κ2κτ12M|S|\frac{7\varepsilon^{*}}{\tau}<\lambda<\frac{\kappa_{2}\kappa_{\infty}\tau}{12M\sqrt{|S|}}, then supp(𝐇^)supp(𝐇)\operatorname{supp}(\widehat{\bm{H}})\subseteq\operatorname{supp}\left(\bm{H}^{*}\right) and

𝑯^𝑯6λ5κ,𝑯^𝑯24λ|S|κ2,𝑯^𝑯16λ|S|5κ.\left\|\widehat{\bm{H}}-\bm{H}^{*}\right\|_{\infty}\leq\frac{6\lambda}{5\kappa_{\infty}},\quad\left\|\widehat{\bm{H}}-\bm{H}^{*}\right\|_{2}\leq\frac{4\lambda\sqrt{|S|}}{\kappa_{2}},\quad\left\|\widehat{\bm{H}}-\bm{H}^{*}\right\|_{1}\leq\frac{6\lambda|S|}{5\kappa_{\infty}}.

In addition, if ε<κ2κτ212CM|S|\varepsilon^{*}<\frac{\kappa_{2}\kappa_{\infty}\tau^{2}}{12CM\sqrt{|S|}} and min{|𝐇1j|:𝐇1j0,j[p1]}>6Cε5κτ\min\left\{\left|{\bm{H}_{1}^{*}}_{j}\right|:{\bm{H}_{1}^{*}}_{j}\neq 0,j\in[p_{1}]\right\}>\frac{6C\varepsilon^{*}}{5\kappa_{\infty}\tau} hold for some C>7C>7, where 𝐇1j{\bm{H}_{1}^{*}}_{j} means the j-th element of 𝐇1{\bm{H}_{1}^{*}}, then by taking λ(7τε,Cτε)\lambda\in\left(\frac{7}{\tau}\varepsilon^{*},\frac{C}{\tau}\varepsilon^{*}\right) the sign consistency sign(𝐇^)=sign(𝐇)\operatorname{sign}(\widehat{\bm{H}})=\operatorname{sign}\left(\bm{H}^{*}\right) is achieved.

Theorem 2 guarantees the selection consistency of the functional covariates by developing sign consistency under some mild conditions. Furthermore, it demonstrates the relationship between 𝑯^𝑯\left\|\widehat{\bm{H}}-\bm{H}^{*}\right\| and λ\lambda whose value depends on ε\varepsilon^{*} and τ\tau, where ε\varepsilon^{*} comes from the first-order partial derivative of the empirical loss function (without penalty) when the parameters are known and τ\tau satisfying a generalized irrepresentable condition from (22) (Lee et al., 2015). When ε\varepsilon^{*} is small, selecting an appropriate λ\lambda leads to satisfactory performance in the model estimation. Furthermore, when 𝑨\bm{A} is not available, 𝑨^\widehat{\bm{A}} will be estimated and employed. Accordingly, the sign consistency is further guaranteed by showing that the loss function (3.1) based on 𝑨^\widehat{\bm{A}} will have an asymptotic property as follows.

Theorem 3

Let Ln(𝐲,𝐖^𝛉)|𝐀L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\bm{A}} be the loss function in (3.1), based on which 𝐟^i\widehat{\bm{f}}_{i} and 𝐮^i\widehat{\bm{u}}_{i} are obtained using the true 𝐀\bm{A}, and Ln(𝐲,𝐖^𝛉)|𝐀^L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\widehat{\bm{A}}} using 𝐀^{\widehat{\bm{A}}}. Then for any fixed 𝛉\bm{\theta}, by Assumption 3 and Lemma 4,

|Ln(𝒚,𝑾^𝜽)|𝑨Ln(𝒚,𝑾^𝜽)|𝑨^|0 in probability.\displaystyle\left|L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\bm{A}}-L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\widehat{\bm{A}}}\right|\rightarrow 0\text{ in probability}.
Corollary 1

Suppose Ln(𝐲,𝐖^𝛉)L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right) has a unique global minimum. Then Ln(𝐲,𝐖^𝛉)|𝐀L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\bm{A}} and Ln(𝐲,𝐖^𝛉)|𝐀^L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\widehat{\bm{A}}} will have an identical global minimizer when nn is sufficiently large, so that the estimates from Ln(𝐲,𝐖^𝛉)|𝐀^L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)|_{\widehat{\bm{A}}} will also have sign consistency.

4 Simulation Studies

In this section, the performance of the proposed fFASM estimator is examined for correlated functional data using simulation studies. To start with, the functional covariates {Xi(g)(t)}g=1G\{X_{i}^{(g)}(t)\}_{g=1}^{G} are generated by Xi(g)(t)=𝒂0,i(g)ϕ0,mg(g)(t)+εi(t)X_{i}^{(g)}(t)=\bm{a}_{0,i}^{(g)}{}^{\top}\cdot\bm{\phi}_{0,m_{g}}^{(g)}(t)+\varepsilon_{i}(t) with εi()\varepsilon_{i}(\cdot) independently and identically (i.i.d) drawn from N(0,0.25)N(0,0.25), and sampled over 51 uniformly distributed grid points for tt from 𝒯=[0,1]\mathcal{T}=[0,1] with interval lengths of 0.02, where GG is set as 20, 50, 100 and 150, respectively. The basis functions ϕ0,mg(g)(t)=ϕm(t)=(ϕ1(t),,ϕm(t))\bm{\phi}_{0,m_{g}}^{(g)}(t)=\bm{\phi}_{m}(t)=(\phi_{1}(t),\ldots,\phi_{m}(t))^{\top} are Fourier basis with mm set as 10. To blend correlations into functional covariates Xi(g)(t)X_{i}^{(g)}(t), two scenarios are considered when 𝒂i\bm{a}_{i} is generated:

  • Scenario I: the factor model case, i.e., 𝒂0,i=(𝒂0,i(1),,𝒂0,i(G))=𝑩𝒇i+𝒖i,\bm{a}_{0,i}=\left(\bm{a}_{0,i}^{(1)\top}{},\ldots,\bm{a}_{0,i}^{(G)\top}{}\right)^{\top}=\bm{B}\bm{f}_{i}+\bm{u}_{i}, where the elements of 𝒇iK\bm{f}_{i}\in\mathbb{R}^{K}, 𝒖iGm\bm{u}_{i}\in\mathbb{R}^{Gm} and 𝑩Gm×K\bm{B}\in\mathbb{R}^{Gm\times K} are generated from N(0,25)N(0,25), N(0,1)N(0,1), and N(0,1)N(0,1), respectively. The true number of factors KK is set as 1,2,,61,2,\ldots,6, respectively;

  • Scenario II: the equal correlation case, i.e., 𝒂0,i\bm{a}_{0,i} are generated from a multivariate normal distribution 𝒂0,iN(𝟎,ΣEqual)\bm{a}_{0,i}\sim N(\bm{0},\Sigma_{Equal}) where ΣEqual\Sigma_{Equal} has diagonal elements as 1 and off-diagonal elements as ρ\rho. The true value of ρ\rho is set as 0,0.1,0.2,0.90,0.1,0.2\ldots,0.9, respectively.

Furthermore, the true functional coefficients β(g)\beta^{*(g)} in (15) are generated β(1)(t)=β(2)(t)=β(3)(t)=β(4)(t)=(1,1/22,,1/m2)ϕm(t)\beta^{*(1)}(t)=\beta^{*(2)}(t)=\beta^{*(3)}(t)=\beta^{*(4)}(t)=(1,1/2^{2},\ldots,1/m^{2})\cdot\bm{\phi}_{m}(t) (the Harmonic attenuation type), β(5)(t)=β(6)(t)=(0,1/22,0,,0)ϕm(t)\beta^{*(5)}(t)=\beta^{*(6)}(t)=(0,1/2^{2},0,\ldots,0)\cdot\bm{\phi}_{m}(t) (the weak single signal type), and β(g)(t)=0\beta^{*(g)}(t)=0 for g=7,8,,Gg=7,8,\ldots,G. Accordingly, the response yiy_{i} is generated by (15) with ϵiN(0,0.1)\epsilon_{i}\sim N(0,0.1).

We compare the performance of the proposed fFASM method (MCP penalty) with other functional variable selection methods, including the MCP (MCP) and the group MCP (grMCP) methods on the synthetic data in different scenarios. To evaluate the performance, three different measurements are adopted, i.e, the model size defined as the cardinality of the set {g:β^(g)(t)0}\left\{g:\widehat{\beta}^{(g)}(t)\neq 0\right\}, the integrated mean squared error (IMSE)

IMSE=g=1G𝔼(β^(g)(t)β(g)(t)L22)=g=1G𝔼𝒯(β^(g)(t)β(g)(t))2𝑑t,IMSE=\sum_{g=1}^{G}\mathbb{E}\left(\left\|\widehat{\beta}^{(g)}(t)-\beta^{*(g)}(t)\right\|^{2}_{L^{2}}\right)=\sum_{g=1}^{G}\mathbb{E}\int_{\mathcal{T}}\left(\widehat{\beta}^{(g)}(t)-\beta^{*(g)}(t)\right)^{2}dt,

and the true positive rate (TPR)

TPR=TPTP+FN,\text{TPR}=\frac{\text{TP}}{\text{TP}+\text{FN}},

where TP represents the number of correctly predicted nonzero instances of β^(g)(t)\widehat{\beta}^{(g)}(t) (events that are actually positive and predicted as positive), and FN for that of instances that are actually nonzero but are predicted as zero. Note that a model with the TPR of 1 and the model size of 6 indicates perfect recovery. The proposed fFASM method employs the MCP regularization, and the hyperparameters are tuned using cross-validation by minimizing the cross-validation error with a randomly selected subset of [n/3][n/3] training samples, where the sample size nn is set as 100. The whole experiment is repeated for 500 times, and the averaged results are reported.

Refer to caption
Figure 1: Estimation and selection performance of three methods with different numbers of factors KK for Scenario I. In each row, four subfigures are displayed with G=20,50,100,150G=20,50,100,150, respectively, and in each column with a given KK, the IMSE (with 0.5 standard deviations), the TPR, and model size are presented, respectively.

Figure 1 shows the estimation and selection performance for Scenario I (the factor model structure case) using different methods. As is easily observed, the proposed fFASM method significantly and consistently outperforms MCP and grMCP in the sense that the proposed fFASM has the smallest IMSE for each KK, demonstrating a more accurate estimation of the functional coefficients. Furthermore, the TPR for the proposed fFASM method is always greater than those of MCP and grMCP as with the increase of KK, and MCP and grMCP even show dramatic drops when KK increases. In terms of the model size, the proposed fFASM tends to select more functional covariates into the model, slightly greater than the true model size 6, while the MCP and grMCP methods tend to select irrelevant functional covariates into the model when looking deeper at the selection results, and the grMCP method tends to be more conservative in selecting functional covariates than MCP.

Refer to caption
Figure 2: Estimation and selection performance of three methods with different correlations for Scenario II. In each row, four subfigures are displayed with G=20,50,100,150G=20,50,100,150, respectively, and in each column with a given ρ\rho, the IMSE (with 0.5 standard deviations), the TPR, and model size are presented, respectively.

Furthermore, Figure 2 shows the estimation and selection results in Scenario II (the equal correlation case). The IMSE values of the proposed fFASM method turn out to remain at a low level when the correlation ρ\rho increases in each subfigure in the first row, indicating a robust estimation performance from fFASM, while MCP and grMCP tend to increase, demonstrating less satisfactory estimation performance. Although the two methods show slightly better results when functional covariates are of small correlation with each other, this is expected since the equal correlation indicates a small number of factors, which may be captured easily. In terms of TPR, the fFASM method is again consistently close to 1 when the correlation varies, while TPRs for the competitor methods drop dramatically. Also, when the number of functional covariates GG increases, the TPRs from all three methods turn out to drop without surprise. When looking at the model size, the proposed fFASM remains stable, which may slightly overestimate the model size, while the competitor models show a much less stable model size of little consistency.

Additionally, the selection frequencies of the two types of functional coefficients β(g)(t)\beta^{*(g)}(t) are examined, namely the Harmonic attenuation type (Type1) and the weak single signal type (Type2), displayed in Figure 3 and 4, respectively. Note that the number of true Type1 functional covariates is 4 and that of Type2 is 2 in both scenarios. On one hand, Figure 3 shows that the selection frequency of the proposed fFASM method decreases first and then increases when the number of factors KK increases in both two types of functional covariates, while those for MCP and grMCP tend to be 0, especially in the Type2 setting where the two competitor methods show lower starting points. Note that the selection frequencies of the proposed method for both two types are not extraordinarily high, since the functional linear model is somewhat lack of fit due to the complexity of the true model. On the other hand, in Scenario II, Figure 4 shows a much improved selection consistency of the proposed method than grMCP and MCP for both two types, even though Type2 may still select fewer covariates because of slight violation of norm requirement in Theorem 2.

Refer to caption
Figure 3: The selection frequencies of truly useful functional covariates in Scenario I with different KK and GG values. The first row displays the selection frequency that at least three out of four truly useful ones are selected for Type1, and the second displays the frequency that at least one out of two for Type2.
Refer to caption
Figure 4: The selection frequencies of truly useful functional covariates in Scenario II with different ρ\rho and GG values. The first and second rows display the frequency that all truly useful ones are selected for Type1 and Type2, respectively.

5 Real Data Application

5.1 Effects of macroeconomic covariates on lifetime expectancy

In this section, the effects of macroeconomic covariates on national expected lifetime are explored for European and American countries using the open-sourced EPS data platform 111https://www.epsnet.com.cn/. The data have been collected and documented annually over a span of 21 years, from 2000 to 2020, for 40 countries or regions. Our focus is the average lifetime expectancy as a scalar dependent variable against 33 macroeconomic functional covariates, such as gross domestic product, healthcare expenditure, and educational attainment. Functional covariates are correlated with each other. For example, higher levels of educational attainment often correlate with higher GDP, since a more educated workforce may usually contribute to greater economic productivity. To estimate lifetime expectancy and find out useful functional covariates, the proposed fFASM method, the MCP and grMCP methods are employed. Specifically, to recover the functional trajectories, year points from 2000 to 2022 are used as grid points. During the study, the collected data for 30 randomly selected countries (or regions) out of 40 are used as the train set to build the model, and the remaining 10 are used as the test set. The whole procedure is repeated 200 times, and the average out-of-sample R2R^{2} and model size are reported in Table 1 and the top five selected functional covariates in Table 2.

The results from both the prediction performance in Table 1 and the variable selection in Table 2 may strongly support the benefits of the proposed fFASM. The average out-of-sample R2R^{2} from the fFASM is the greatest among the three competitor models, indicating its competitive prediction accuracy. Further, the variables selected by fFASM are closely related to economic development and social well-being, providing insights into education and labor market conditions, which may show significant influence on the average life expectancy of a nation. Although all three methods successfully selected the key variable “Employment” with the highest selection frequencies, “Adult literacy rate” and “total fertility rate” are both selected with relatively large frequencies by fFASM and MCP, while grMCP fails to show high selection frequencies of them. Additionally, the fFASM method selected “Gross national income per capita” and “Unemployment rate”, which are of little correlation with other variables, while MCP selected “Labor force” highly correlated with “Employment” and “Unemployment rate”. The grMCP only focused on “Employment” and ignored almost all other variables with selection rates lower than 5% .

average out-of-sample R2R^{2} average model size
fFASM MCP grMCP fFASM MCP grMCP
0.288 0.223 0.227 1.95 1.53 1.04
Table 1: The average out-of-sample R2R^{2} and model size for EPS data.
Rank fFASM
Name Frequency
1 Employment (million people) 160
2 Adult literacy rate (age 15 and above) (%) 56
3 Total fertility rate 54
4 Unemployment rate (% of total labor force) 21
5 Gross national income per capita (current USD) 17
Rank MCP
Name Frequency
1 Employment (million people) 166
2 Total fertility rate 57
3 Adult literacy rate (age 15 and above, %) 37
4 Labor force (people) 16
5 Unemployment rate (% of total labor force) 7
Rank grMCP
Name Frequency
1 Employment (million people) 187
2 Prevalence of undernourishment (% of total population) 7
3 Adult literacy rate (age 15 and above) (%) 3
4 Gross national income per capita (current USD) 2
5 Per capita health expenditure (current USD) 1
Table 2: Top 5 variables with highest selection frequency.

5.2 Prediction of sales areas of commercial houses

In this section, we focus on the prediction of annual average sales area of commercial houses in each province in China. Data are collected from the National Bureau of Statistics of China222https://data.stats.gov.cn/ on a monthly basis for 31 administrative regions of mainland China in the year 2022 and 2023. The dependent variable is the annual average sales area of commercial houses against 60 functional covariates, including cement, medium tractors, engines, aluminum materials, and other variables related to industrial output. To recover the functional trajectories, 24 grid points are predetermined on a monthly basis from 2022 to 2023. To predict the response, the proposed fFASM, the MCP and grMCP methods are employed. During the study, the collected data for 20 randomly selected provinces are used as the training set to build the model, and the remaining 11 as the test set to evaluate the prediction performance, with 200 repetitions of the whole procedure. The average out-of-sample R2R^{2} and the average model sizes are reported in Table 3, and the most frequently selected functional covariates in Table 4.

From Table 3, the proposed fFASM method is found to show a better performance than MCP and grMCP in terms of a larger average out-of-sample R2R^{2} and a slightly lower average model size. In Table 4, all three methods select “Cement” as the most frequently selected functional covariates out of 200 repetitions, which is expected as the most important raw material required for house construction. Similarly, the functional covariate “Aluminum Materials” is selected among top five. Specifically, the fFASM and MCP methods select the functional covariates “Medium Tractors” and “Engine”, which are closely related to construction machinery used in the house building industry, while grMCP selects “Mechanized Paper”, “Hydropower Generation” and “Phosphate Rock”, which seem to have limited connection with house construction. This may explain why grMCP has a larger model size but a relatively smaller average out-of-sample performance.

average out-of-sample R2R^{2} average model size
fFASM MCP grMCP fFASM MCP grMCP
0.612 0.601 0.355 1.35 2.20 2.55
Table 3: The average out-of-sample R2R^{2} and model size for sales area of commercial houses data.
Rank fFASM
Name Frequency
1 Cement 198
2 Medium Tractors 13
3 Engines 12
3 Aluminum Materials 12
5 Chemical Pesticides (Active Ingredients) 4
Rank MCP
Name Frequency
1 Cement 199
2 Medium Tractors 48
3 Aluminum Materials 19
4 Computers 18
4 Engines 18
Rank grMCP
Name Frequency
1 Cement 137
2 Aluminum Materials 97
3 Mechanized Paper 68
4 Hydropower Generation 30
5 Phosphate Rock 20
Table 4: Top 5 variables with highest selection frequency.

6 Conclusion

In this article, a novel functional factor augmentation structure (fFAS) is proposed to capture associations for correlated functional processes, and further a functional factor augmentation selection model (fFASM) is developed to select useful functional covariates in high dimensions with correlated functional covariates. Note only is the correlation between functional covariates addressed without assuming an explicit covariance structure, theoretical properties of the estimated functional factors are established. We primarily discuss the rationale for constructing a fFAS, how to estimate the fFAS and its estimation error, and the impact of truncating functional data on the validity and estimation of the factor model. Due to the unique characteristics of functional data, the assumed factor model and the actually estimated factor model may differ.

Numerical investigations on both simulated and real datasets support the superior performance of the proposed fFASM method. It is found that our proposed method performs better than general functional data variable selection methods when dealing with the variable selection problem of correlated multivariate functional covariates. A practical issue may be how to determine the model size, as our method may slightly select more functional covariates in simulation studies, which may be a trade-off for modeling functional processes with correlations in high dimensions.

References

  • (1)
  • Ahn & Horenstein (2013) Ahn, S. C. & Horenstein, A. R. (2013), ‘Eigenvalue ratio test for the number of factors’, Econometrica 81(3), 1203–1227.
  • Aneiros, Horová, Hušková & Vieu (2022) Aneiros, G., Horová, I., Hušková, M. & Vieu, P. (2022), ‘On functional data analysis and related topics’, Journal of Multivariate Analysis 189, 104861.
  • Aneiros, Novo & Vieu (2022) Aneiros, G., Novo, S. & Vieu, P. (2022), ‘Variable selection in functional regression models: A review’, Journal of Multivariate Analysis 188, 104871. 50th Anniversary Jubilee Edition.
    https://www.sciencedirect.com/science/article/pii/S0047259X21001494
  • Bai (2003) Bai, J. (2003), ‘Inferential theory for factor models of large dimensions’, Econometrica 71(1), 135–171.
  • Bai & Ng (2002) Bai, J. & Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221.
  • Cardot et al. (2003) Cardot, H., Ferraty, F. & Sarda, P. (2003), ‘Spline estimators for the functional linear model’, Statistica Sinica pp. 571–591.
  • Castellanos et al. (2015) Castellanos, L., Vu, V. Q., Perel, S., Schwartz, A. B. & Kass, R. E. (2015), ‘A multivariate gaussian process factor model for hand shape during reach-to-grasp movements’, Statistica Sinica 25(1), 5.
  • Centofanti et al. (2021) Centofanti, F., Lepore, A., Menafoglio, A., Palumbo, B. & Vantini, S. (2021), ‘Functional regression control chart’, Technometrics 63(3), 281–294.
  • Chen et al. (2022) Chen, C., Guo, S. & Qiao, X. (2022), ‘Functional linear regression: dependence and error contamination’, Journal of Business & Economic Statistics 40(1), 444–457.
  • Chen et al. (2011) Chen, D., Hall, P. & Müller, H.-G. (2011), ‘Single and multiple index functional regression models with nonparametric link’.
  • Chen et al. (2021) Chen, L., Wang, W. & Wu, W. B. (2021), ‘Dynamic semiparametric factor model with structural breaks’, Journal of Business & Economic Statistics 39(3), 757–771.
  • Chiou & Müller (2016) Chiou, J.-M. & Müller, H.-G. (2016), ‘A pairwise interaction model for multivariate functional and longitudinal data’, Biometrika 103(2), 377–396.
  • Cuevas (2014) Cuevas, A. (2014), ‘A partial overview of the theory of statistics with functional data’, Journal of Statistical Planning and Inference 147, 1–23.
  • Fan et al. (2020) Fan, J., Ke, Y. & Wang, K. (2020), ‘Factor-adjusted regularized model selection’, Journal of Econometrics 216(1), 71–85.
  • Fan & Li (2001) Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American statistical Association 96(456), 1348–1360.
  • Fan et al. (2013) Fan, J., Liao, Y. & Mincheva, M. (2013), ‘Large covariance estimation by thresholding principal orthogonal complements’, Journal of the Royal Statistical Society Series B: Statistical Methodology 75(4), 603–680.
  • Fang et al. (2015) Fang, L., Zhao, H., Wang, P., Yu, M., Yan, J., Cheng, W. & Chen, P. (2015), ‘Feature selection method based on mutual information and class separability for dimension reduction in multidimensional time series for clinical data’, Biomedical Signal Processing and Control 21, 82–89.
  • Ferraty (2006) Ferraty, F. (2006), Nonparametric functional data analysis, Springer.
  • Gonzalez-Vidal et al. (2019) Gonzalez-Vidal, A., Jimenez, F. & Gomez-Skarmeta, A. F. (2019), ‘A methodology for energy multivariate time series forecasting in smart buildings based on feature selection’, Energy and Buildings 196, 71–82.
  • Hall & Horowitz (2007) Hall, P. & Horowitz, J. L. (2007), ‘Methodology and convergence rates for functional linear regression’.
  • Hays et al. (2012) Hays, S., Shen, H. & Huang, J. Z. (2012), ‘Functional dynamic factor models with application to yield curve forecasting’, The Annals of Applied Statistics pp. 870–894.
  • Hörmann et al. (2015) Hörmann, S., Kidziński, Ł. & Hallin, M. (2015), ‘Dynamic functional principal components’, Journal of the Royal Statistical Society Series B: Statistical Methodology 77(2), 319–348.
  • Hörmann & Kokoszka (2010) Hörmann, S. & Kokoszka, P. (2010), ‘Weakly dependent functional data’.
  • Htun et al. (2023) Htun, H. H., Biehl, M. & Petkov, N. (2023), ‘Survey of feature selection and extraction techniques for stock market prediction’, Financial Innovation 9(1), 26.
  • Jiménez et al. (2020) Jiménez, F., Palma, J., Sánchez, G., Marín, D., Francisco Palacios, M. & Lucía López, M. (2020), ‘Feature selection based multivariate time series forecasting: An application to antibiotic resistance outbreaks prediction’, Artificial Intelligence in Medicine 104, 101818.
    https://www.sciencedirect.com/science/article/pii/S0933365719306608
  • Kong et al. (2016) Kong, D., Xue, K., Yao, F. & Zhang, H. H. (2016), ‘Partially functional linear regression in high dimensions’, Biometrika 103(1), 147–159.
  • Lee et al. (2015) Lee, J. D., Sun, Y. & Taylor, J. E. (2015), ‘On model selection consistency of regularized m-estimators’.
  • Li & Hsing (2007) Li, Y. & Hsing, T. (2007), ‘On rates of convergence in functional linear regression’, Journal of Multivariate Analysis 98(9), 1782–1804.
  • Matsui & Konishi (2011) Matsui, H. & Konishi, S. (2011), ‘Variable selection for functional regression models via the l1 regularization’, Computational Statistics & Data Analysis 55(12), 3304–3310.
  • Merlevède et al. (2011) Merlevède, F., Peligrad, M. & Rio, E. (2011), ‘A bernstein type inequality and moderate deviations for weakly dependent sequences’, Probability Theory and Related Fields 151, 435–474.
  • Müller & Yao (2008) Müller, H.-G. & Yao, F. (2008), ‘Functional additive models’, Journal of the American Statistical Association 103(484), 1534–1544.
  • Nti et al. (2019) Nti, K. O., Adekoya, A. & Weyori, B. (2019), ‘Random forest based feature selection of macroeconomic variables for stock market prediction’, American Journal of Applied Sciences 16(7), 200–212.
  • Peng et al. (2005) Peng, R. D., Dominici, F., Pastor-Barriuso, R., Zeger, S. L. & Samet, J. M. (2005), ‘Seasonal analyses of air pollution and mortality in 100 us cities’, American journal of epidemiology 161(6), 585–594.
  • Petersen (2024) Petersen, A. (2024), ‘Mean and covariance estimation for discretely observed high-dimensional functional data: Rates of convergence and division of observational regimes’, Journal of Multivariate Analysis p. 105355.
  • Ramsay & Silverman (2005) Ramsay, J. & Silverman, B. (2005), Functional Data Analysis, Springer Series in Statistics, Springer.
  • Shi (1997) Shi, L. (1997), ‘Local influence in principal components analysis’, Biometrika 84(1), 175–186.
  • Tibshirani (1996) Tibshirani, R. (1996), ‘Regression shrinkage and selection via the lasso’, Journal of the Royal Statistical Society Series B: Statistical Methodology 58(1), 267–288.
  • Yang & Ling (2024) Yang, S. & Ling, N. (2024), ‘Robust estimation of functional factor models with functional pairwise spatial signs’, Computational Statistics pp. 1–24.
  • Yao et al. (2005) Yao, F., Müller, H.-G. & Wang, J.-L. (2005), ‘Functional linear regression analysis for longitudinal data’, The Annals of Statistics 33(6), 2873 – 2903.
  • Yuan & Cai (2010) Yuan, M. & Cai, T. T. (2010), ‘A reproducing kernel hilbert space approach to functional linear regression’.
  • Zhang (2010) Zhang, C.-H. (2010), ‘Nearly unbiased variable selection under minimax concave penalty’, Annals of statistics 38(2), 894–942.

SUPPLEMENTARY MATERIAL

7 Regularity Condition

In this section, we introduce some regularity conditions (Kong et al. 2016). Without loss of generality, we assume that {X(g)(),g=1,,G}\{X^{(g)}(\cdot),g=1,\ldots,G\} have been centred to have zero mean. With Wi(g)(tij)=Xi(g)(tij)+ϵi(g)(tij)W_{i}^{(g)}(t_{ij})=X^{(g)}_{i}(t_{ij})+\epsilon_{i}^{(g)}(t_{ij}), for definiteness, we consider the local linear smoother for each set of subjects using bandwidths {hi(g),g=1,,G}\{h_{i}^{(g)},g=1,\ldots,G\}, and denote the smoothed trajectories by X^i(g)()\widehat{X}_{i}^{(g)}(\cdot).

Condition (B1) consists of regularity assumptions for functional data. Condition (B2) is standard for local linear smoother, (B3)–(B4) concern how the functional predictors are sampled and smoothed.

  • (B1)

    For g=1,,Gg=1,\ldots,G, for any C>0C>0 there exists an ϵ>0\epsilon>0 such that

    supt𝒯[𝔼{|X(g)(t)|C}]<,sups,t𝒯(𝔼[{|st|ϵ|X(g)(s)X(g)(t)|}C])<.\sup_{t\in\mathcal{T}}\left[\mathbb{E}\left\{\left|X^{(g)}(t)\right|^{C}\right\}\right]<\infty,\quad\sup_{s,t\in\mathcal{T}}\left(\mathbb{E}\left[\left\{|s-t|^{-\epsilon}\left|X^{(g)}(s)-X^{(g)}(t)\right|\right\}^{C}\right]\right)<\infty.

    For each integer r1,(τk(g))rE(ak(g))2rr\geq 1,\left(\tau_{k}^{(g)}\right)^{-r}E\left({a_{k}^{(g)}}\right)^{2r} is bounded uniformly in kk.

  • (B2)

    For g=1,,G,X(g)()g=1,\ldots,G,X^{(g)}(\cdot) is twice continuously differentiable on 𝒯\mathcal{T} with probability 1 , and 𝔼{X(g)(t)′′}4dt<\int\mathbb{E}\left\{X^{(g)}{}^{\prime\prime}(t)\right\}^{4}dt<\infty, where X(g)()′′X^{(g)}{}^{\prime\prime}(\cdot) denotes the second derivative of X(g)()X^{(g)}(\cdot). The following condition concerns the design on which Xi(g)()X_{i}^{(g)}(\cdot) is observed and the local linear smoother X^i(g)()\widehat{X}_{i}^{(g)}(\cdot). When a function is said to be smooth, we mean that it is continuously differentiable to an adequate order.

  • (B3)

    For g=1,,G,{tij(g),j=1,,ni(g)}g=1,\ldots,G,\left\{t_{ij}^{(g)},j=1,\ldots,n_{i}^{(g)}\right\} are considered deterministic and ordered increasingly for i=1,,ni=1,\ldots,n. There exist densities pi(g)p_{i}^{(g)} uniformly smooth over 𝒯\mathcal{T}, satisfying 𝒯pi(g)(t)𝑑t=1\int_{\mathcal{T}}p_{i}^{(g)}(t)dt=1 and 0<c1<infi{inftTpi(g)(t)}<supi{suptTpi(g)(t)}<c2<0<c_{1}<\inf_{i}\left\{\inf_{t\in T}p_{i}^{(g)}(t)\right\}<\sup_{i}\left\{\sup_{t\in T}p_{i}^{(g)}(t)\right\}<c_{2}<\infty that generate tij(g)t_{ij}^{(g)} according to tij(g)=Pi(g){j/(ni(g)+1)}1t_{ij}^{(g)}=P_{i}^{(g)}{}^{-1}\{j/\left(n_{i}^{(g)}+1\right)\}, where Pi(g)1P_{i}^{(g)}{}^{-1} is the inverse of Pi(g)(t)=tpi(g)(s)𝑑sP_{i}^{(g)}(t)=\int_{-\infty}^{t}p_{i}^{(g)}(s)ds. For each g=1,,Gg=1,\ldots,G, there exist a common sequence of bandwidths h(g)h^{(g)} such that 0<c1<0<c_{1}< infihi(g)/h(g)<supihi(g)/h(g)<c2<\inf_{i}h_{i}^{(g)}/h^{(g)}<\sup_{i}h_{i}^{(g)}/h^{(g)}<c_{2}<\infty, where hi(g)h_{i}^{(g)} is the bandwidth for X^i(g)\widehat{X}_{i}^{(g)}. The kernel density function is smooth and compactly supported.

Let 𝒯=[a0,b0],ti0(g)=a0,ti,ni(g)+1(g)=b0\mathcal{T}=\left[a_{0},b_{0}\right],t_{i0}^{(g)}=a_{0},t_{i,n_{i}^{(g)}+1}^{(g)}=b_{0}, let Δi(g)=sup{ti,j+1(g)ti,j(g),j=0,,ni(g)}\Delta_{i}^{(g)}=\sup\left\{t_{i,j+1}^{(g)}-t_{i,j}^{(g)},j=0,\ldots,n_{i}^{(g)}\right\} and n(g)=n(g)(n)=infi=1,,nni(g)n^{(g)}=n^{(g)}(n)=\inf_{i=1,\ldots,n}n_{i}^{(g)}. The condition below is to let the smooth estimate X^i(g)\widehat{X}_{i}^{(g)} serve as well as the true Xi(g)X_{i}^{(g)} in the asymptotic analysis, denoting 0<liman/bn<0<\lim a_{n}/b_{n}<\infty by anbna_{n}\sim b_{n}.

  • (B4)

    For g=1,,G,supiΔi(g)=O((n(g))1),h(g)(n(g))1/5,n(g)n5/4g=1,\ldots,G,~{}\sup_{i}\Delta_{i}^{(g)}=O\left(\left(n^{(g)}\right)^{-1}\right),~{}h^{(g)}\sim\left(n^{(g)}\right)^{-1/5},n^{(g)}n^{-5/4}\rightarrow\infty.

8 Proof of Theorem

Proof of Lemma 3:

Proof: ΛC,K=diag{λ1(C),,λK(C)}\Lambda_{C,K}=\operatorname{diag}\{\lambda_{1}^{(C)},\ldots,\lambda_{K}^{(C)}\} consists of first KK eigenvalues of C𝒂(ω)C_{\bm{a}}(\omega). ΛB=diag{λ1(B),,λK(B)}\Lambda_{B}=\operatorname{diag}\{\lambda_{1}^{(B)},\ldots,\lambda_{K}^{(B)}\} consists of first KK eigenvalues BBBB^{\top} and note ii-th eigenvalues of Λu\Lambda_{u} as λi(u)\lambda_{i}^{(u)}.

𝒇i(ω)𝒇i=\displaystyle{\bm{f}}_{i}^{\top}(\omega)-\bm{f}_{i}^{\top}= 𝒂i(𝝃1(C)(ω),,𝝃K(C)(ω))ΛC,K1/2(ω)(𝒂i𝒖i)𝑩ΛB1\displaystyle\bm{a}_{i}^{\top}(\bm{\xi}_{1}^{(C)}(\omega),\ldots,\bm{\xi}_{K}^{(C)}(\omega))\Lambda_{C,K}^{-1/2}(\omega)-(\bm{a}_{i}^{\top}-\bm{u}_{i}^{\top})\bm{B}\Lambda_{B}^{-1}
=\displaystyle= 𝒂i(𝝃1(C)(ω),,𝝃K(C)(ω))ΛC,K1/2(ω)(𝒂i𝒖i)(𝝃1(B),,𝝃K(B))ΛB1/2\displaystyle\bm{a}_{i}^{\top}(\bm{\xi}_{1}^{(C)}(\omega),\ldots,\bm{\xi}_{K}^{(C)}(\omega))\Lambda_{C,K}^{-1/2}(\omega)-(\bm{a}_{i}^{\top}-\bm{u}_{i}^{\top})(\bm{\xi}_{1}^{(B)},\ldots,\bm{\xi}_{K}^{(B)})\Lambda_{B}^{-1/2}
=\displaystyle= 𝒂i(𝝃1(B),,𝝃K(B))(ΛC,K1/2(ω)ΛB1/2)𝒂i((𝐁𝐁λk(B)I)+Λ𝒖𝝃k(B)ω)k=1,,KΛC,K1/2\displaystyle\bm{a}_{i}^{\top}(\bm{\xi}_{1}^{(B)},\ldots,\bm{\xi}_{K}^{(B)})(\Lambda_{C,K}^{-1/2}(\omega)-\Lambda_{B}^{-1/2})-\bm{a}_{i}^{\top}((\mathbf{B}\mathbf{B}^{\top}-\lambda_{k}^{(B)}I)^{+}\Lambda_{\bm{u}}\bm{\xi}_{k}^{(B)}\omega)_{k=1,\ldots,K}\Lambda_{C,K}^{-1/2}
+𝒖i𝐁ΛB1+𝒂iO(ω2).\displaystyle+\bm{u}_{i}^{\top}\mathbf{B}\Lambda_{B}^{-1}+\bm{a}_{i}^{\top}O(\omega^{2}).

Then we give the bound separately: For the first part, as λi(C)=λi(B)+(𝝃k(B)Λ𝒖𝝃k(B))ω+O(ω2)\lambda_{i}^{(C)}=\lambda_{i}^{(B)}+(\bm{\xi}^{(B)}_{k}{}^{\top}\Lambda_{\bm{u}}\bm{\xi}^{(B)}_{k})\omega+O(\omega^{2}), we have

1λi(C)1λi(B)\displaystyle\frac{1}{\sqrt{\lambda_{i}^{(C)}}}-\frac{1}{\sqrt{\lambda_{i}^{(B)}}} =λi(B)λi(C)λi(B)λi(C)\displaystyle=\frac{\sqrt{\lambda_{i}^{(B)}}-\sqrt{\lambda_{i}^{(C)}}}{\sqrt{\lambda_{i}^{(B)}\lambda_{i}^{(C)}}}
=λi(B)λi(C)λi(B)λi(C)(λi(B)+λi(C))\displaystyle=\frac{{\lambda_{i}^{(B)}}-{\lambda_{i}^{(C)}}}{\sqrt{\lambda_{i}^{(B)}\lambda_{i}^{(C)}}\left(\sqrt{\lambda_{i}^{(B)}}+\sqrt{\lambda_{i}^{(C)}}\right)}
=((𝝃k(B)Λ𝒖𝝃k(B))ω+O(ω2))λi(B)λi(C)(λi(B)+λi(C))\displaystyle=\frac{-\left((\bm{\xi}^{(B)}_{k}{}^{\top}\Lambda_{\bm{u}}\bm{\xi}^{(B)}_{k})\omega+O(\omega^{2})\right)}{\sqrt{\lambda_{i}^{(B)}\lambda_{i}^{(C)}}\left(\sqrt{\lambda_{i}^{(B)}}+\sqrt{\lambda_{i}^{(C)}}\right)}
𝒂i(𝝃1(B),,𝝃K(B))(ΛC,K1/2(ω)ΛB1/2)22\displaystyle\|\bm{a}_{i}^{\top}(\bm{\xi}_{1}^{(B)},\ldots,\bm{\xi}_{K}^{(B)})(\Lambda_{C,K}^{-1/2}(\omega)-\Lambda_{B}^{-1/2})\|_{2}^{2} |ΛC,K1/2(ω)ΛB1/2|max2𝒂i22\displaystyle\leq\||\Lambda_{C,K}^{-1/2}(\omega)-\Lambda_{B}^{-1/2}|\|_{\mathop{\rm max}}^{2}\|\bm{a}_{i}\|_{2}^{2}
Cs12ω2𝒂i22+O(ω3)𝒂i22\displaystyle\leq C_{s_{1}}^{2}\omega^{2}\|\bm{a}_{i}\|_{2}^{2}+O(\omega^{3})\|\bm{a}_{i}\|_{2}^{2}

where Cs1=λmax(Λ𝒖)miniK3/2(λi(C),λi(B))C_{s_{1}}=\frac{\lambda_{\mathop{\rm max}}(\Lambda_{\bm{u}})}{\min_{i\leq K}^{3/2}(\lambda_{i}^{(C)},\lambda_{i}^{(B)})} is a constant. For the second part,

|𝒂i(𝐁𝐁λk(B)I)+Λ𝒖𝝃k(B)ω|2\displaystyle|\bm{a}_{i}^{\top}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{k}^{(B)}I)^{+}\Lambda_{\bm{u}}\bm{\xi}_{k}^{(B)}\omega|^{2} (λmax𝒖mini<jK(λi(B)λj(B)))2𝒂i22ω2\displaystyle\leq(\frac{\lambda_{\mathop{\rm max}}^{\bm{u}}}{\min_{i<j\leq K}(\lambda_{i}^{(B)}-\lambda_{j}^{(B)})})^{2}\|\bm{a}_{i}\|_{2}^{2}\omega^{2}
𝒂i((𝐁𝐁λk(B)I)+Λ𝒖𝝃k(B)ω)k=1,,KΛC1/222\displaystyle\|\bm{a}_{i}^{\top}((\mathbf{B}\mathbf{B}^{\top}-\lambda_{k}^{(B)}I)^{+}\Lambda_{\bm{u}}\bm{\xi}_{k}^{(B)}\omega)_{k=1,\ldots,K}{\Lambda^{C}}^{-1/2}\|_{2}^{2} 1λmin(C)(ω)(λmax𝒖mini<jK(λi(B)λj(B)))2𝒂i22ω2.\displaystyle\leq\frac{1}{\lambda^{(C)}_{\min}(\omega)}(\frac{\lambda_{\mathop{\rm max}}^{\bm{u}}}{\min_{i<j\leq K}(\lambda_{i}^{(B)}-\lambda_{j}^{(B)})})^{2}\|\bm{a}_{i}\|_{2}^{2}\omega^{2}.

And the third part, with Cov(𝒖i)=ωΛu\operatorname{Cov}(\bm{u}_{i})=\omega\Lambda_{u}

𝒖i𝐁ΛB122𝒖i22λmin(B)OP(ω).\displaystyle\|\bm{u}_{i}^{\top}\mathbf{B}\Lambda_{B}^{-1}\|_{2}^{2}\leq\frac{\|\bm{u}_{i}\|_{2}^{2}}{\lambda_{\min}^{(B)}}\leq O_{P}(\omega).

To sum up, we obtain that

𝒇i(ω)𝒇i22=OP(ω2𝒂i22+𝒖i𝐁ΛB122)\|{\bm{f}}_{i}^{\top}(\omega)-\bm{f}_{i}^{\top}\|_{2}^{2}=O_{P}(\omega^{2}\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\mathbf{B}\Lambda_{B}^{-1}\|_{2}^{2})

For 𝒖i\bm{u}_{i}, with

𝑭\displaystyle\bm{F} =(𝑨𝑼)𝐁ΛB,1\displaystyle=(\bm{A}-\bm{U})\mathbf{B}\Lambda^{B}{}^{-1},

that we have

𝑼\displaystyle\bm{U} =𝑨(𝑨𝑼)BΛB𝐁1\displaystyle=\bm{A}-(\bm{A}-\bm{U})B\Lambda^{B}{}^{-1}\mathbf{B}^{\top}
=𝑨(𝑨𝑼)j=1K𝝃j𝝃j(B)(B).\displaystyle=\bm{A}-(\bm{A}-\bm{U})\sum_{j=1}^{K}\bm{\xi}_{j}{}^{(B)}\bm{\xi}_{j}^{(B)\top}.

It means

𝑼j=K+1p𝝃j𝝃j(B)(B)=𝑨j=K+1p𝝃j𝝃j(B)(B).\displaystyle\bm{U}\sum_{j=K+1}^{p}\bm{\xi}_{j}{}^{(B)}\bm{\xi}_{j}^{(B)\top}=\bm{A}\sum_{j=K+1}^{p}\bm{\xi}_{j}{}^{(B)}\bm{\xi}_{j}^{(B)\top}.
𝒖i(ω)𝒖i=\displaystyle{\bm{u}_{i}}^{\top}(\omega)-\bm{u}_{i}^{\top}= 𝒂ij=K+1p𝝃j(C)(ω)𝝃j(C)(ω)𝒂ij=K+1p𝝃j(B)𝝃j(B)𝒖ij=1K𝝃j(B)𝝃j(B).\displaystyle\bm{a}_{i}^{\top}\sum_{j=K+1}^{p}\bm{\xi}_{j}^{(C)}(\omega)\bm{\xi}_{j}^{(C)}(\omega)^{\top}-\bm{a}_{i}^{\top}\sum_{j=K+1}^{p}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}-\bm{u}_{i}^{\top}\sum_{j=1}^{K}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}.

Where

𝒂ij=K+1p𝝃j(C)(ω)𝝃j(C)(ω)𝒂ij=K+1p𝝃j(B)𝝃j(B)\displaystyle\bm{a}_{i}^{\top}\sum_{j=K+1}^{p}\bm{\xi}_{j}^{(C)}(\omega)\bm{\xi}_{j}^{(C)}(\omega)^{\top}-\bm{a}_{i}^{\top}\sum_{j=K+1}^{p}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}
=\displaystyle= 𝒂i(𝑰pj=1K𝝃j(C)(ω)𝝃j(C)(ω))𝒂i(𝑰pj=1K𝝃j(B)𝝃j(B))\displaystyle\bm{a}_{i}^{\top}\left(\bm{I}_{p}-\sum_{j=1}^{K}\bm{\xi}_{j}^{(C)}(\omega)\bm{\xi}_{j}^{(C)}(\omega)^{\top}\right)-\bm{a}_{i}^{\top}\left(\bm{I}_{p}-\sum_{j=1}^{K}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}\right)
=\displaystyle= 𝒂ij=1K𝝃j(C)(ω)[𝝃j(C)(ω)𝝃j(B)]+𝒂ij=1K[𝝃j(B)𝝃j(C)(ω)]𝝃j(B)\displaystyle-\bm{a}_{i}^{\top}\sum_{j=1}^{K}\bm{\xi}_{j}^{(C)}(\omega)[\bm{\xi}_{j}^{(C)}(\omega)-\bm{\xi}_{j}^{(B)}]^{\top}+\bm{a}_{i}^{\top}\sum_{j=1}^{K}[\bm{\xi}_{j}^{(B)}-\bm{\xi}_{j}^{(C)}(\omega)]\bm{\xi}_{j}^{(B)}{}^{\top}
=\displaystyle= 𝒂ij=1K[𝝃j(C)(ω)𝝃j(B)][𝝃j(C)(ω)𝝃j(B)]𝒂ij=1K𝝃j(B)[𝝃j(C)(ω)𝝃j(B)]\displaystyle-\bm{a}_{i}^{\top}\sum_{j=1}^{K}[\bm{\xi}_{j}^{(C)}(\omega)-\bm{\xi}_{j}^{(B)}][\bm{\xi}_{j}^{(C)}(\omega)-\bm{\xi}_{j}^{(B)}]^{\top}-\bm{a}_{i}^{\top}\sum_{j=1}^{K}\bm{\xi}_{j}^{(B)}[\bm{\xi}_{j}^{(C)}(\omega)-\bm{\xi}_{j}^{(B)}]^{\top}
+𝒂ij=1K[𝝃j(B)𝝃j(C)(ω)]𝝃j(B)\displaystyle+\bm{a}_{i}^{\top}\sum_{j=1}^{K}[\bm{\xi}_{j}^{(B)}-\bm{\xi}_{j}^{(C)}(\omega)]\bm{\xi}_{j}^{(B)}{}^{\top}
=\displaystyle= ω2𝒂ij=1K(𝐁𝐁λj(B)I)+Λu𝝃j(B)𝝃j(B)Λu(𝐁𝐁λj(B)I)++𝒂iO(ω3)\displaystyle-\omega^{2}\bm{a}_{i}^{\top}\sum_{j=1}^{K}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)\top}\Lambda_{u}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}+\bm{a}_{i}^{\top}O(\omega^{3})
+ω𝒂ij=1K𝝃j(B)𝝃j(B)(𝐁𝐁λj(B)I)+Λu+ω𝒂ij=1K(𝐁𝐁λj(B)I)+Λu𝝃j(B)𝝃j(B)\displaystyle+\omega\bm{a}_{i}^{\top}\sum_{j=1}^{K}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}+\omega\bm{a}_{i}^{\top}\sum_{j=1}^{K}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)\top}
+𝒂iO(ω2).\displaystyle+\bm{a}_{i}^{\top}O(\omega^{2}).

Then we give the bound separately:

𝒂ij=1K(𝐁𝐁λj(B)I)+Λu𝝃j(B)𝝃j(B)Λu(𝐁𝐁λj(B)I)+22\displaystyle\|\bm{a}_{i}^{\top}\sum_{j=1}^{K}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)\top}\Lambda_{u}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\|_{2}^{2}
j=1K(𝐁𝐁λj(B)I)+Λu𝝃j(B)𝝃j(B)Λu(𝐁𝐁λj(B)I)+22𝒂i22\displaystyle\leq\|\sum_{j=1}^{K}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)\top}\Lambda_{u}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\|_{2}^{2}\|\bm{a}_{i}\|_{2}^{2}
(λmaxu)4j=1K(𝐁𝐁λj(B)I)+𝝃j(B)𝝃j(B)(𝐁𝐁λj(B)I)+22𝒂i22\displaystyle\leq(\lambda^{u}_{max})^{4}\|\sum_{j=1}^{K}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)\top}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\|_{2}^{2}\|\bm{a}_{i}\|_{2}^{2}
(pK)(λmax𝒖mini<jK(λi(B)λj(B)))4𝒂i22.\displaystyle\leq(p-K)(\frac{\lambda_{\mathop{\rm max}}^{\bm{u}}}{\min_{i<j\leq K}(\lambda_{i}^{(B)}-\lambda_{j}^{(B)})})^{4}\|\bm{a}_{i}\|_{2}^{2}.
𝒂ij=1K𝝃j(B)𝝃j(B)(𝐁𝐁λj(B)I)+Λu22(m1+m2r)(λmax𝒖mini<jK(λi(B)λj(B)))2𝒂i22\displaystyle\|\bm{a}_{i}^{\top}\sum_{j=1}^{K}\bm{\xi}_{j}^{(B)}\bm{\xi}_{j}^{(B)}{}^{\top}(\mathbf{B}\mathbf{B}^{\top}-\lambda_{j}^{(B)}I)^{+}\Lambda_{u}\|_{2}^{2}\leq(m_{1}+m_{2}-r)(\frac{\lambda_{\mathop{\rm max}}^{\bm{u}}}{\min_{i<j\leq K}(\lambda_{i}^{(B)}-\lambda_{j}^{(B)})})^{2}\|\bm{a}_{i}\|_{2}^{2}

To sum up, remove the constant then we can obtain that

𝒖i(ω)𝒖i22=OP(ω2𝒂i22+𝒖i𝐁ΛB1𝐁22)\|{\bm{u}}_{i}^{\top}(\omega)-\bm{u}_{i}^{\top}\|_{2}^{2}=O_{P}(\omega^{2}\|\bm{a}_{i}\|_{2}^{2}+\|\bm{u}_{i}^{\top}\mathbf{B}\Lambda_{B}^{-1}\mathbf{B}^{\top}\|_{2}^{2})

Proof of Lemma 4:

Proof:

Under (A1,A2) (B1-B4), using the Lemma 2(a) in Kong et al. (2016). i=1,,n;k1,k2=1,,sn;j=1,,Gi=1,\ldots,n;k_{1},k_{2}=1,\ldots,s_{n};j=1,\ldots,G

𝑨^𝑨^\displaystyle\hat{\bm{A}}^{\top}\hat{\bm{A}} =(i=1n𝒂i𝒂i)Gsn×Gsn=i=1n(𝒂i(1)𝒂i(1)𝒂i(1)𝒂i(G)𝒂i(G)𝒂i(1)𝒂i(G)𝒂i(G))\displaystyle=(\sum_{i=1}^{n}\bm{a}_{i}\bm{a}_{i}^{\top})_{Gs_{n}\times Gs_{n}}=\sum_{i=1}^{n}\left(\begin{array}[]{ccc}\bm{a}_{i}^{(1)}\bm{a}_{i}^{(1)\top}&\ldots&\bm{a}_{i}^{(1)}\bm{a}_{i}^{(G)\top}\\ \ldots&\ldots&\ldots\\ \bm{a}_{i}^{(G)}\bm{a}_{i}^{(1)\top}&\ldots&\bm{a}_{i}^{(G)}\bm{a}_{i}^{(G)\top}\end{array}\right)
=i=1n((aik1(1)aik2(1))sn×sn(aik1(1)aik2(G))sn×sn(aik1(G)aik2(1))sn×sn(aik1(G)aik2(G))sn×sn)\displaystyle=\sum_{i=1}^{n}\left(\begin{array}[]{ccc}(a_{ik_{1}}^{(1)}a_{ik_{2}}^{(1)})_{s_{n}\times s_{n}}&\ldots&(a_{ik_{1}}^{(1)}a_{ik_{2}}^{(G)})_{s_{n}\times s_{n}}\\ \ldots&\ldots&\ldots\\ (a_{ik_{1}}^{(G)}a_{ik_{2}}^{(1)})_{s_{n}\times s_{n}}&\ldots&(a_{ik_{1}}^{(G)}a_{ik_{2}}^{(G)})_{s_{n}\times s_{n}}\end{array}\right)
=((i=1naik1(1)aik2(1))sn×sn(i=1naik1(1)aik2(G))sn×sn(i=1naik1(G)aik2(1))sn×sn(i=1naik1(G)aik2(G))sn×sn),\displaystyle=\left(\begin{array}[]{ccc}(\sum_{i=1}^{n}a_{ik_{1}}^{(1)}a_{ik_{2}}^{(1)})_{s_{n}\times s_{n}}&\ldots&(\sum_{i=1}^{n}a_{ik_{1}}^{(1)}a_{ik_{2}}^{(G)})_{s_{n}\times s_{n}}\\ \ldots&\ldots&\ldots\\ (\sum_{i=1}^{n}a_{ik_{1}}^{(G)}a_{ik_{2}}^{(1)})_{s_{n}\times s_{n}}&\ldots&(\sum_{i=1}^{n}a_{ik_{1}}^{(G)}a_{ik_{2}}^{(G)})_{s_{n}\times s_{n}}\end{array}\right),

then the analysis for {(𝑨𝑨)ijE(𝑨𝑨)ij}/n\{(\bm{A}^{\top}\bm{A})_{ij}-E(\bm{A}^{\top}\bm{A})_{ij}\}/n is is equivalent to analyzing the following .

n1i=1n[a^ik1(j)a^ik2(j)E(aik1(j)aik2(j))]\displaystyle n^{-1}\sum_{i=1}^{n}[\hat{a}_{ik_{1}}^{(j)}\hat{a}_{ik_{2}}^{(j)}-E(a_{ik_{1}}^{(j)}a_{ik_{2}}^{(j)})] =n1i=1n[(a^ik1(j)a^ik2(j)aik1(j)aik2(j))+(aik1(j)aik2(j)E(aik1(j)aik2(j)))]\displaystyle=n^{-1}\sum_{i=1}^{n}[(\hat{a}_{ik_{1}}^{(j)}\hat{a}_{ik_{2}}^{(j)}-a_{ik_{1}}^{(j)}a_{ik_{2}}^{(j)})+(a_{ik_{1}}^{(j)}a_{ik_{2}}^{(j)}-E(a_{ik_{1}}^{(j)}a_{ik_{2}}^{(j)}))]
=Op(k1a/2+1n1/2+k2a/2+1n1/2+n1/2)\displaystyle=O_{p}(k_{1}^{a/2+1}n^{-1/2}+k_{2}^{a/2+1}n^{-1/2}+n^{-1/2})
Op(sna/2+1n1/2)\displaystyle\leq O_{p}(s_{n}^{a/2+1}n^{-1/2})

Proof of Theorem 1

Proof: We view the estimation error as perturbation

1n𝑨𝑨=𝔼(𝒂i𝒂i)+1nΛe+O(1n)\displaystyle\frac{1}{n}\bm{A}^{\top}\bm{A}=\mathbb{E}(\bm{a}_{i}\bm{a}_{i}^{\top})+\frac{1}{\sqrt{n}}\Lambda_{e}+O(\frac{1}{n})

where Λe/n+O(1/n)\Lambda_{e}/\sqrt{n}+O(1/n) represents the estimation error, and the equation holds true with probability. Λe\Lambda_{e} is a bounded random matrix since 𝑨\bm{A} is bounded. Consider

𝒇^i𝒇i(ω)=\displaystyle\widehat{\bm{f}}_{i}^{\top}-\bm{f}_{i}^{\top}(\omega)= 𝒂^i(𝝃^1,,𝝃^K)Λ^K1/2𝒂i(𝝃1(C)(ω),,𝝃K(C)(ω))ΛC,K1/2(ω),\displaystyle\widehat{\bm{a}}_{i}^{\top}(\widehat{\bm{\xi}}_{1},\ldots,\widehat{\bm{\xi}}_{K}){\widehat{\Lambda}_{K}}^{-1/2}-\bm{a}_{i}^{\top}(\bm{\xi}_{1}^{(C)}(\omega),\ldots,\bm{\xi}_{K}^{(C)}(\omega))\Lambda_{C,K}^{-1/2}(\omega),
𝒖^i𝒖i(ω)=\displaystyle\widehat{\bm{u}}_{i}^{\top}-\bm{u}_{i}^{\top}(\omega)= 𝒂^ij=K+1p𝝃^j𝝃^j𝒂ij=K+1p𝝃j(C)(ω)𝝃j(C)(ω).\displaystyle\widehat{\bm{a}}_{i}^{\top}\sum_{j=K+1}^{p}\widehat{\bm{\xi}}_{j}\widehat{\bm{\xi}}_{j}^{\top}-\bm{a}_{i}^{\top}\sum_{j=K+1}^{p}\bm{\xi}_{j}^{(C)}(\omega)\bm{\xi}_{j}^{(C)}(\omega){}^{\top}.

Follow the steps in proof of Lemma 3, we have

𝒇^i𝒇i(ω)22\displaystyle\|\widehat{\bm{f}}_{i}^{\top}-\bm{f}_{i}^{\top}(\omega)\|_{2}^{2} OP(1n𝒂i22)\displaystyle\leq O_{P}(\frac{1}{n}\|\bm{a}_{i}\|_{2}^{2})
𝒖^i𝒖i(ω)22\displaystyle\|\widehat{\bm{u}}_{i}^{\top}-\bm{u}_{i}^{\top}(\omega)\|_{2}^{2} OP(1n𝒂i22)\displaystyle\leq O_{P}(\frac{1}{n}\|\bm{a}_{i}\|_{2}^{2})

Proof of Theorem 2

Proof: Recall that

𝜽^=(𝐇^1γ^)=argmin𝜽{Ln(𝐲,𝐖^𝜽)+λ𝜽[p1]1}.\widehat{\boldsymbol{\theta}}=\left(\begin{array}[]{c}\widehat{\mathbf{H}}_{1}\\ \widehat{\mathbf{\gamma}}\end{array}\right)=\operatorname{argmin}_{\boldsymbol{\theta}}\left\{L_{n}(\mathbf{y},\widehat{\mathbf{W}}\boldsymbol{\theta})+\lambda\left\|\boldsymbol{\theta}_{[p_{1}]}\right\|_{1}\right\}. (23)

Also, Assumption 5 tells us 𝐕0\mathbf{V}_{0} is nonsingular and so is 𝐕=(𝐈p1𝟎p1×K𝟎K×p1𝐕0)\mathbf{V}=\left(\begin{array}[]{cc}\mathbf{I}_{p_{1}}&\mathbf{0}_{p_{1}\times K}\\ \mathbf{0}_{K\times p_{1}}&\mathbf{V}_{0}\end{array}\right).

Define 𝐖¯=(1,𝒖¯t,𝒇¯t)=𝐖^𝐕=(1,𝒖^t,𝒇^t)𝐕\overline{\mathbf{W}}=\left(1,\bar{\bm{u}}_{t}^{\top},\bar{\bm{f}}_{t}^{\top}\right)=\widehat{\mathbf{W}}\mathbf{V}=\left(1,\widehat{\bm{u}}_{t}^{\top},\widehat{\bm{f}}_{t}^{\top}\right)\cdot\mathbf{V}, 𝜽¯=𝐕1𝜽^\bar{\boldsymbol{\theta}}=\mathbf{V}^{-1}\widehat{\boldsymbol{\theta}}, 𝐁^0=(𝟎K,𝐁^),𝜽^=(𝐇1𝐁^0𝐇1)\widehat{\mathbf{B}}_{0}=\left(\mathbf{0}_{K},\widehat{\mathbf{B}}^{\top}\right)^{\top},\widehat{\boldsymbol{\theta}}^{*}=\left(\begin{array}[]{c}\mathbf{H}_{1}^{*}\\ \widehat{\mathbf{B}}_{0}^{\top}\mathbf{H}_{1}^{*}\end{array}\right) and 𝜽¯=𝐕1𝜽^\bar{\boldsymbol{\theta}}^{*}=\mathbf{V}^{-1}\widehat{\boldsymbol{\theta}}^{*}. We easily see that 𝐇^=𝜽^[p1]=𝜽¯[p1]\widehat{\mathbf{H}}=\widehat{\boldsymbol{\theta}}_{[p_{1}]}=\bar{\boldsymbol{\theta}}_{[p_{1}]} and

𝜽¯=argmin𝜽{Ln(𝐲,𝐖¯𝜽)+λ𝜽[p1]1}.\bar{\boldsymbol{\theta}}=\operatorname{argmin}_{\boldsymbol{\theta}}\left\{L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})+\lambda\left\|\boldsymbol{\theta}_{[p_{1}]}\right\|_{1}\right\}.

Then it follows that supp(𝐇^1)=supp(𝜽¯[p1])\operatorname{supp}(\widehat{\mathbf{H}}_{1})=\operatorname{supp}\left(\bar{\boldsymbol{\theta}}_{[p_{1}]}\right) and 𝐇^1𝐇1=𝜽¯[p1]𝜽¯[p1]𝜽¯𝜽¯\left\|\widehat{\mathbf{H}}_{1}-\mathbf{H}_{1}{}^{*}\right\|=\left\|\bar{\boldsymbol{\theta}}_{[p_{1}]}-\bar{\boldsymbol{\theta}}_{[p_{1}]}^{*}\right\|\leq\left\|\bar{\boldsymbol{\theta}}-\bar{\boldsymbol{\theta}}^{*}\right\| for any norm \|\cdot\|.

Consequently, Theorem 2 is reduced to studying 𝜽¯\bar{\boldsymbol{\theta}} and the loss function Ln(𝐲,𝐖¯𝜽)L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta}). The Lemma 5 below implies that all the regularity conditions (with A=A=\infty ) in Lemma 6 are satisfied.

Let wtjw_{tj} and w¯tj\bar{w}_{tj} be the (t,j)(t,j)-th element of 𝐖\mathbf{W} and 𝐖¯\overline{\mathbf{W}}, respectively. Observe that Ln(𝐲,𝐖¯𝜽)=L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})= 1nt=1n[yt𝐰¯t𝜽+b(𝐰¯t𝜽)],Ln(𝐲,𝐖¯𝜽)=1nt=1n[yt+b(𝐰¯tT𝜽)]𝐰¯t\frac{1}{n}\sum_{t=1}^{n}\left[-y_{t}\overline{\mathbf{w}}_{t}^{\top}\boldsymbol{\theta}+b\left(\overline{\mathbf{w}}_{t}^{\top}\boldsymbol{\theta}\right)\right],\nabla L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})=\frac{1}{n}\sum_{t=1}^{n}\left[-y_{t}+b^{\prime}\left(\overline{\mathbf{w}}_{t}^{T}\boldsymbol{\theta}\right)\right]\overline{\mathbf{w}}_{t} and 𝐖¯𝜽¯=𝐀𝑯+(β0)n×1\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}=\mathbf{A}\boldsymbol{H}^{*}+(\beta_{0}^{*})_{n\times 1}. Hence Ln(𝐲,𝐖¯𝜽¯)=ε\left\|\nabla L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right\|_{\infty}=\varepsilon^{*} and consequently, SLn(𝐲,𝐖¯𝜽¯)ε,SLn(𝐲,𝐖¯𝜽¯)2ε|S|\left\|\nabla_{S}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right\|_{\infty}\leq\varepsilon^{*},\left\|\nabla_{S}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right\|_{2}\leq\varepsilon^{*}\sqrt{|S|} and SLn(𝐲,𝐖¯𝜽¯)1ε|S|\left\|\nabla_{S}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right\|_{1}\leq\varepsilon^{*}|S|. In addition, λ>7ε/τε\lambda>7\varepsilon^{*}/\tau\geq\varepsilon^{*}.

Based on these estimates, all the results follow from Theorem 2 and some simple algebra.

Here we present the Lemma 5 used above and its proof.

Lemma 5

(Lemma C.1 of Fan et al. (2020)) Let Assumptions 3, 4 and 5 hold. Treat Ln(𝐲,𝐖¯𝛉)L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta}) as a function of 𝛉\boldsymbol{\theta}, and the derivatives below are taken with respect to it. Firstly, by Assumption 5, we have 𝐖maxM02\|\bm{W}\|_{\mathop{\rm max}}\leq\frac{M_{0}}{2}. Then

  • (i)

    .S2L(𝐲,𝐖¯𝜽).S2L(𝐲,𝐖¯𝜽¯)M𝜽𝜽¯2,𝜽\left\|\nabla_{.S}^{2}L(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})-\nabla_{.S}^{2}L\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right\|_{\infty}\leq M\left\|\boldsymbol{\theta}-\bar{\boldsymbol{\theta}}^{*}\right\|_{2},\forall\bm{\theta} with supp(𝜽)S\operatorname{supp}(\boldsymbol{\theta})\subseteq S,

  • (ii)

    (SS2L(𝐲,𝐖¯𝜽¯))112κ\left\|\left(\nabla_{SS}^{2}L\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}\leq\frac{1}{2\kappa_{\infty}},

  • (iii)

    (SS2L(𝐲,𝐖¯𝜽¯))1212κ2\left\|\left(\nabla_{SS}^{2}L\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{2}\leq\frac{1}{2\kappa_{2}},

  • (iv)

    S2S2L(𝐲,𝐖¯𝜽¯)(SS2L(𝐲,𝐖¯𝜽¯))11τ\left\|\nabla_{S_{2}S}^{2}L\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\left(\nabla_{SS}^{2}L\left(\mathbf{y},\overline{\mathbf{W}}\bar{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}\leq 1-\tau.

Proof of Lemma 5 (i) Based on the fact that 𝐖𝜽=𝐖¯𝜽¯=𝑨𝑯+(β0)n×1\mathbf{W}\boldsymbol{\theta}^{*}=\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}={\bm{A}}\bm{H}^{*}+(\beta_{0}^{*})_{n\times 1}, we have 2Ln(𝐲,𝐖𝜽)=1nt=1nb′′(𝐰¯tT𝜽¯)𝐰t𝐰tT\nabla^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)=\frac{1}{n}\sum_{t=1}^{n}b^{\prime\prime}\left(\overline{\mathbf{w}}_{t}^{T}\overline{\boldsymbol{\theta}}^{*}\right)\mathbf{w}_{t}\mathbf{w}_{t}^{T} and 2Ln(𝐲,𝐖¯𝜽¯)=1nt=1nb′′(𝐰¯tT𝜽¯)𝐰¯t𝐰¯tT\nabla^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)=\frac{1}{n}\sum_{t=1}^{n}b^{\prime\prime}\left(\overline{\mathbf{w}}_{t}^{T}\overline{\boldsymbol{\theta}}^{*}\right)\overline{\mathbf{w}}_{t}\overline{\mathbf{w}}_{t}^{T}. For any j,k[p1+K]j,k\in[p_{1}+K] and supp(𝜽)S\operatorname{supp}(\boldsymbol{\theta})\subseteq S,

|jk2Ln(𝐲,𝐖¯𝜽)jk2Ln(𝐲,𝐖¯𝜽¯)|1nt=1n|b′′(𝐰¯tT𝜽)b′′(𝐰¯tT𝜽¯)||𝐰¯tj𝐰¯tk|\displaystyle\left|\nabla_{jk}^{2}L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})-\nabla_{jk}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right|\leq\frac{1}{n}\sum_{t=1}^{n}\left|b^{\prime\prime}\left(\overline{\mathbf{w}}_{t}^{T}\boldsymbol{\theta}\right)-b^{\prime\prime}\left(\overline{\mathbf{w}}_{t}^{T}\overline{\boldsymbol{\theta}}^{*}\right)\right|\cdot\left|\overline{\mathbf{w}}_{tj}\overline{\mathbf{w}}_{tk}\right|
1nt=1nM3|𝐰¯tT(𝜽𝜽¯)|𝐖¯max2\displaystyle\leq\frac{1}{n}\sum_{t=1}^{n}M_{3}\left|\overline{\mathbf{w}}_{t}^{T}\left(\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right)\right|\cdot\|\overline{\mathbf{W}}\|_{\mathop{\rm max}}^{2} (24)

By the Cauchy-Schwarz inequality and 𝐖¯max𝐖max+𝐖¯𝐖maxM0\|\overline{\mathbf{W}}\|_{\mathop{\rm max}}\leq\|\mathbf{W}\|_{\mathop{\rm max}}+\|\overline{\mathbf{W}}-\mathbf{W}\|_{\mathop{\rm max}}\leq M_{0}, we obtain that for i[n],|𝐰¯tT(𝜽𝜽¯)|=|𝐰¯tST(𝜽𝜽¯)S|𝐰¯tS2𝜽𝜽¯2|S|M0𝜽𝜽¯2i\in[n],\left|\overline{\mathbf{w}}_{t}^{T}\left(\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right)\right|=\left|\overline{\mathbf{w}}_{tS}^{T}\left(\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right)_{S}\right|\leq\left\|\overline{\mathbf{w}}_{tS}\right\|_{2}\left\|\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right\|_{2}\leq\sqrt{|S|}M_{0}\left\|\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right\|_{2}. Plugging this result back to (8), we get

|jk2Ln(𝐲,𝐖¯𝜽)jk2Ln(𝐲,𝐖¯𝜽¯)||S|M3M03𝜽𝜽¯2,j,k[p+K]\displaystyle\left|\nabla_{jk}^{2}L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})-\nabla_{jk}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right|\leq\sqrt{|S|}M_{3}M_{0}^{3}\left\|\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right\|_{2},\quad\forall j,k\in[p+K]
S2Ln(𝐲,𝐖¯𝜽)S2Ln(𝐲,𝐖¯𝜽¯)|S|3/2M3M03𝜽𝜽¯2=M𝜽𝜽¯2\displaystyle\left\|\nabla_{\cdot S}^{2}L_{n}(\mathbf{y},\overline{\mathbf{W}}\boldsymbol{\theta})-\nabla_{\cdot S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right\|_{\infty}\leq|S|^{3/2}M_{3}M_{0}^{3}\left\|\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right\|_{2}=M\left\|\boldsymbol{\theta}-\overline{\boldsymbol{\theta}}^{*}\right\|_{2}

(ii) Now we come to the second claim. For any k[p+K]k\in[p+K],

kS2Ln(𝐲,𝐖¯𝜽¯)kS2Ln(𝐲,𝐖𝜽)1nt=1nb′′(𝐱tT𝜷)w¯tk𝐰¯tSTwtk𝐰tST\displaystyle\left\|\nabla_{kS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{kS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\leq\frac{1}{n}\sum_{t=1}^{n}b^{\prime\prime}\left(\mathbf{x}_{t}^{T}\boldsymbol{\beta}^{*}\right)\left\|\bar{w}_{tk}\overline{\mathbf{w}}_{tS}^{T}-w_{tk}\mathbf{w}_{tS}^{T}\right\|_{\infty}
M2|S|nt=1nw¯tk𝐰¯tSTwtk𝐰tST2\displaystyle\leq\frac{M_{2}\sqrt{|S|}}{n}\sum_{t=1}^{n}\left\|\bar{w}_{tk}\overline{\mathbf{w}}_{tS}^{T}-w_{tk}\mathbf{w}_{tS}^{T}\right\|_{2}

Also, by 𝐖maxM0/2\|\mathbf{W}\|_{\mathop{\rm max}}\leq M_{0}/2 and 𝐖¯maxM0\|\overline{\mathbf{W}}\|_{\mathop{\rm max}}\leq M_{0} we have

w¯tk𝐰¯tSTwtk𝐰tST2|wtk|(𝐰¯tS𝐰tS)T2+|w¯tkwtk|𝐰¯tST2\displaystyle\left\|\bar{w}_{tk}\overline{\mathbf{w}}_{tS}^{T}-w_{tk}\mathbf{w}_{tS}^{T}\right\|_{2}\leq\left|w_{tk}\right|\cdot\left\|\left(\overline{\mathbf{w}}_{tS}-\mathbf{w}_{tS}\right)^{T}\right\|_{2}+\left|\bar{w}_{tk}-w_{tk}\right|\cdot\left\|\overline{\mathbf{w}}_{tS}^{T}\right\|_{2}
𝐖max𝐰¯tS𝐰tS2+|w¯tkwtk||S|𝐖¯max\displaystyle\leq\|\mathbf{W}\|_{\mathop{\rm max}}\left\|\overline{\mathbf{w}}_{tS}-\mathbf{w}_{tS}\right\|_{2}+\left|\bar{w}_{tk}-w_{tk}\right|\cdot\sqrt{|S|}\|\overline{\mathbf{W}}\|_{\mathop{\rm max}}
M02𝐰¯tS𝐰tS2+M0|S||w¯tkwtk|\displaystyle\leq\frac{M_{0}}{2}\left\|\overline{\mathbf{w}}_{tS}-\mathbf{w}_{tS}\right\|_{2}+M_{0}\sqrt{|S|}\cdot\left|\bar{w}_{tk}-w_{tk}\right|

Define δ=maxj[p+K](1nt=1n|w¯tjwtj|2)1/2\delta=\mathop{\rm max}_{j\in[p+K]}\left(\frac{1}{n}\sum_{t=1}^{n}\left|\bar{w}_{tj}-w_{tj}\right|^{2}\right)^{1/2}. By the Jensen’s inequality, J[p+K]\forall J\subseteq[p+K],

1nt=1n𝐰¯tJ𝐰tJ2(1nt=1n𝐰¯tJ𝐰tJ22)1/2(|J|nmaxj[p+K]t=1n|w¯tjwtj|2)1/2|J|δ.\frac{1}{n}\sum_{t=1}^{n}\left\|\overline{\mathbf{w}}_{tJ}-\mathbf{w}_{tJ}\right\|_{2}\leq\left(\frac{1}{n}\sum_{t=1}^{n}\left\|\overline{\mathbf{w}}_{tJ}-\mathbf{w}_{tJ}\right\|_{2}^{2}\right)^{1/2}\leq\left(\frac{|J|}{n}\mathop{\rm max}_{j\in[p+K]}\sum_{t=1}^{n}\left|\bar{w}_{tj}-w_{tj}\right|^{2}\right)^{1/2}\leq\sqrt{|J|}\delta.

As a result,

S2Ln(𝐲,𝐖¯𝜽¯)S2Ln(𝐲,𝐖𝜽)\displaystyle\left\|\nabla_{\cdot S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{\cdot S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}
=maxk[p+K]kS2Ln(𝐲,𝐖¯𝜽¯)kS2Ln(𝐲,𝐖𝜽)32M0M2|S|δ\displaystyle=\mathop{\rm max}_{k\in[p+K]}\left\|\nabla_{kS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{kS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\leq\frac{3}{2}M_{0}M_{2}|S|\delta (25)

Let α=(SS2Ln(𝐲,𝐖𝜽))1[SS2Ln(𝐲,𝐖¯𝜽¯)SS2Ln(𝐲,𝐖𝜽)]\alpha=\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\left[\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right]\right\|_{\infty}. Then

α\displaystyle\alpha (SS2Ln(𝐲,𝐖𝜽))1SS2Ln(𝐲,𝐖¯T𝜽¯)SS2Ln(𝐲,𝐖𝜽)\displaystyle\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\left\|\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}^{T}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}
38κM0M2|S|δ12.\displaystyle\leq\frac{3}{8\kappa_{\infty}}M_{0}M_{2}|S|\delta\leq\frac{1}{2}. (26)

Lemma 7 yields

(SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1(SS2Ln(𝐲,𝐖𝜽))1α1α\displaystyle\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\frac{\alpha}{1-\alpha}
14κα112316κ2M0M2|S|δ.\displaystyle\leq\frac{1}{4\kappa_{\infty}}\cdot\frac{\alpha}{1-\frac{1}{2}}\leq\frac{3}{16\kappa_{\infty}^{2}}M_{0}M_{2}|S|\delta.

We also have a cruder bound (SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))114κ\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\leq\frac{1}{4\kappa_{\infty}}, which leads to

(SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1+14κ12κ\displaystyle\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}+\frac{1}{4\kappa_{\infty}}\leq\frac{1}{2\kappa_{\infty}} (27)

(iii) The third argument follows (27) easily. Since 𝐀2𝐀\|\mathbf{A}\|_{2}\leq\|\mathbf{A}\|_{\infty} holds for any symmetric matrix 𝐀\mathbf{A}, we have (SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1214κ14κ2\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{2}\leq\frac{1}{4\kappa_{\infty}}\leq\frac{1}{4\kappa_{2}} and thus (SS2Ln(𝐲,𝐖¯𝜽¯))1212κ2\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{2}\leq\frac{1}{2\kappa_{2}}. (iv) Finally we prove the last inequality. On the one hand,

S2S2Ln(𝐲,𝐖¯𝜽¯)(SS2Ln(𝐲,𝐖¯𝜽¯))1S2S2Ln(𝐲,𝐖𝜽)(SS2Ln(𝐲,𝐖𝜽))1\displaystyle\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}
S2S2Ln(𝐲,𝐖¯𝜽¯)S2S2Ln(𝐲,𝐖𝜽)(SS2Ln(𝐲,𝐖¯𝜽¯))1\displaystyle\leq\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}
+S2S2Ln(𝐲,𝐖𝜽)[(SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1]\displaystyle+\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\left[\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right]\right\|_{\infty}

From claim (ii) and (8) it is easy to see that

S2S2Ln(𝐲,𝐖¯𝜽¯)S2S2Ln(𝐲,𝐖𝜽)(SS2Ln(𝐲,𝐖¯𝜽¯))114κ3M0M2|S|δ.\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)-\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\left\|\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}\leq\frac{1}{4\kappa_{\infty}}3M_{0}M_{2}|S|\delta.

On the other hand, we can take 𝐀=S2S2Ln(𝐲,𝐖𝜽),𝐁=SS2Ln(𝐲,𝐖𝜽)\mathbf{A}=\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right),\mathbf{B}=\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right) and 𝐂=SS2Ln(𝐲,𝐖¯𝜽¯)\mathbf{C}=\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)- SS2Ln(𝐲,𝐖𝜽)\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right). By Assumption 4, 𝐀𝐁112τ1\left\|\mathbf{AB}^{-1}\right\|_{\infty}\leq 1-2\tau\leq 1. Lemma 7 forces that

S2S2Ln(𝐲,𝐖𝜽)[(SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1]\displaystyle\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\left[\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right]\right\|_{\infty}
=𝐀[(𝐁+𝐂)1𝐁1]𝐀B1𝐂𝐁11𝐂𝐁1𝐂𝐁11𝐂𝐁1\displaystyle=\left\|\mathbf{A}\left[(\mathbf{B}+\mathbf{C})^{-1}-\mathbf{B}^{-1}\right]\right\|_{\infty}\leq\left\|\mathbf{A}B^{-1}\right\|_{\infty}\frac{\left\|\mathbf{CB}^{-1}\right\|_{\infty}}{1-\left\|\mathbf{CB}^{-1}\right\|_{\infty}}\leq\frac{\|\mathbf{C}\|_{\infty}\left\|\mathbf{B}^{-1}\right\|_{\infty}}{1-\|\mathbf{C}\|_{\infty}\left\|\mathbf{B}^{-1}\right\|_{\infty}}

We have shown above in (8) that 𝐂𝐁138κM0M2|S|δ1/2\|\mathbf{C}\|_{\infty}\left\|\mathbf{B}^{-1}\right\|_{\infty}\leq\frac{3}{8\kappa_{\infty}}M_{0}M_{2}|S|\delta\leq 1/2. As a result,

S2S2Ln(𝐲,𝐖𝜽)[(SS2Ln(𝐲,𝐖¯𝜽¯))1(SS2Ln(𝐲,𝐖𝜽))1]34κM0M2|S|δ.\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\left[\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right]\right\|_{\infty}\leq\frac{3}{4\kappa_{\infty}}M_{0}M_{2}|S|\delta.

By combining these estimates, we have

S2S2Ln(𝐲,𝐖¯𝜽¯)(SS2Ln(𝐲,𝐖¯𝜽¯))1S2S2Ln(𝐲,𝐖𝜽)(SS2Ln(𝐲,𝐖𝜽))1\displaystyle\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}-\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\mathbf{W}\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}
32κM0M2|S|δτ.\displaystyle\leq\frac{3}{2\kappa_{\infty}}M_{0}M_{2}|S|\delta\leq\tau.

Therefore S2S2Ln(𝐲,𝐖¯𝜽¯)(SS2Ln(𝐲,𝐖¯𝜽¯))1(12τ)+τ=1τ\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\mathbf{y},\overline{\mathbf{W}}\overline{\boldsymbol{\theta}}^{*}\right)\right)^{-1}\right\|_{\infty}\leq(1-2\tau)+\tau=1-\tau.

Finally, we use the following lemma to prove the theorem.

Lemma 6

(Theorem B.1 of Fan et al. (2020)) For optimization 23, if the Lemma 5 is statisfied, then we have that

  1. 1.

    Error bounds: if

    7τLn(𝜽)<λ<κ24|S|min{A,κτ3M},\frac{7}{\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}<\lambda<\frac{\kappa_{2}}{4\sqrt{|S|}}\min\left\{A,\frac{\kappa_{\infty}\tau}{3M}\right\},

    then supp(𝜽^)S\operatorname{supp}(\widehat{\boldsymbol{\theta}})\subseteq S and

    𝜽^𝜽35κ(SLn(𝜽)+λ),\displaystyle\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq\frac{3}{5\kappa_{\infty}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\lambda\right),
    𝜽^𝜽22κ2(SLn(𝜽)2+λ|S1|),\displaystyle\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq\frac{2}{\kappa_{2}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{2}+\lambda\sqrt{\left|S_{1}\right|}\right),
    𝜽^𝜽1min{35κ(SLn(𝜽)1+λ|S1|),2|S|κ2(SLn(𝜽)2+λ|S1|)}.\displaystyle\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1}\leq\min\left\{\frac{3}{5\kappa_{\infty}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{1}+\lambda\left|S_{1}\right|\right),\frac{2\sqrt{|S|}}{\kappa_{2}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{2}+\lambda\sqrt{\left|S_{1}\right|}\right)\right\}.
  2. 2.

    Sign consistency : In addition, if the following two conditions

    min{|𝐇1j|:𝐇1j0,j[p1]}>CκτLn(𝜽)\displaystyle\min\left\{\left|{\mathbf{H}_{1}^{*}}_{j}\right|:{\mathbf{H}_{1}^{*}}_{j}\neq 0,j\in[p_{1}]\right\}>\frac{C}{\kappa_{\infty}\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}
    Ln(𝜽)<κ2τ7C|S|min{A,κτ3M}\displaystyle\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}<\frac{\kappa_{2}\tau}{7C\sqrt{|S|}}\min\left\{A,\frac{\kappa_{\infty}\tau}{3M}\right\}

    hold for some C5C\geq 5, then by taking λ(7τLn(𝜽),1τ(5C31)Ln(𝜽))\lambda\in\left(\frac{7}{\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty},\frac{1}{\tau}\left(\frac{5C}{3}-1\right)\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\right), the estimator achieves the sign consistency sign(𝐇^)=sign(𝐇)\operatorname{sign}(\widehat{\mathbf{H}})=\operatorname{sign}\left(\mathbf{H}^{*}\right).

Proof of Lemma 6

We need the following two lemma to prove the Lemma 6.

Define BS(𝜽,r)={𝜽:𝜽𝜽2r,supp(𝜽)S}B_{S}\left(\boldsymbol{\theta}^{*},r\right)=\left\{\boldsymbol{\theta}:\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\right\|_{2}\leq r,\operatorname{supp}(\boldsymbol{\theta})\subseteq S\right\} for r>0r>0. We first introduce two useful lemmas.

Lemma 7

Suppose 𝐀q×r\mathbf{A}\in\mathbb{R}^{q\times r} and 𝐁,𝐂r×r\mathbf{B},\mathbf{C}\in\mathbb{R}^{r\times r} and 𝐂𝐁1<1\left\|\mathbf{CB}^{-1}\right\|<1, where \|\cdot\| is an induced norm. Then 𝐀[(𝐁+𝐂)1𝐁1]𝐀𝐁1𝐂𝐁11𝐂𝐁1\left\|\mathbf{A}\left[(\mathbf{B}+\mathbf{C})^{-1}-\mathbf{B}^{-1}\right]\right\|\leq\frac{\left\|\mathbf{AB}^{-1}\right\|\cdot\left\|\mathbf{CB}^{-1}\right\|}{1-\left\|\mathbf{CB}^{-1}\right\|}.

Lemma 8

Under (i) (ii) (iii) in Lemma 5, we have (SS2Ln(𝛉))12<κ21\left\|\left(\nabla_{SS}^{2}L_{n}(\boldsymbol{\theta})\right)^{-1}\right\|_{2}<\kappa_{2}^{-1} and (SS2Ln(𝛉))1\left\|\left(\nabla_{SS}^{2}L_{n}(\boldsymbol{\theta})\right)^{-1}\right\|_{\infty} <κ1<\kappa_{\infty}^{-1} over BS(𝛉,min{A,κM})B_{S}\left(\boldsymbol{\theta}^{*},\min\left\{A,\frac{\kappa_{\infty}}{M}\right\}\right).

Lemma 9

Suppose λ0,\lambda\geq 0,\mathcal{M} is a Euclidean space, 𝛉0,L(𝛉)C2()\boldsymbol{\theta}_{0}\in\mathcal{M},L(\boldsymbol{\theta})\in C^{2}(\mathcal{M}) and is convex, and R(𝛉)R(\boldsymbol{\theta}) is convex. In addition, there exist κ,A>0\kappa,A>0 such that 2L(𝛉)κI\nabla^{2}L(\boldsymbol{\theta})\succeq\kappa I as long as 𝛉𝛉02\left\|\boldsymbol{\theta}-\boldsymbol{\theta}_{0}\right\|_{2}\leq A. If L(𝛉0)2+λinf𝐡R(𝛉0)𝐡2<12κA\left\|\nabla L\left(\boldsymbol{\theta}_{0}\right)\right\|_{2}+\lambda\inf_{\mathbf{h}\in\partial R\left(\boldsymbol{\theta}_{0}\right)}\|\mathbf{h}\|_{2}<\frac{1}{2}\kappa A, then Lλ(𝛉)=L(𝛉)+λR(𝛉)L_{\lambda}(\boldsymbol{\theta})=L(\boldsymbol{\theta})+\lambda R(\boldsymbol{\theta}) has unique minimizer 𝛉^\hat{\boldsymbol{\theta}} and 𝛉^𝛉022κ(L(𝛉0)2+λinf𝐡R(𝛉0)𝐡2)\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}_{0}\right\|_{2}\leq\frac{2}{\kappa}\left(\left\|\nabla L\left(\boldsymbol{\theta}_{0}\right)\right\|_{2}+\lambda\inf_{\mathbf{h}\in\partial R\left(\boldsymbol{\theta}_{0}\right)}\|\mathbf{h}\|_{2}\right).

Lemma 10

Suppose L(𝛉)C2(p)L(\boldsymbol{\theta})\in C^{2}\left(\mathbb{R}^{p}\right) and is convex. R(𝛉)R(\boldsymbol{\theta}) is convex and R(𝛂+𝛃)=R(𝛂)+R(\boldsymbol{\alpha}+\boldsymbol{\beta})=R(\boldsymbol{\alpha})+ R(𝛃)R(\boldsymbol{\beta}) for 𝛂\boldsymbol{\alpha}\in\mathcal{M} and 𝛃\boldsymbol{\beta}\in\mathcal{M}^{\perp}, where \mathcal{M} is a linear subspace of p\mathbb{R}^{p} and \mathcal{M}^{\perp} is its orthonormal complement. In addition, there exists R(𝛉)C(p)R^{*}(\boldsymbol{\theta})\in C\left(\mathbb{R}^{p}\right) such that |𝛂,𝛃|R(𝛂)R(𝛃)|\langle\boldsymbol{\alpha},\boldsymbol{\beta}\rangle|\leq R(\boldsymbol{\alpha})R^{*}(\boldsymbol{\beta}) for 𝛂\boldsymbol{\alpha}\in\mathcal{M}^{\perp} and 𝛃p\boldsymbol{\beta}\in\mathbb{R}^{p}. Let Lλ(𝛉)=L(𝛉)+λR(𝛉)L_{\lambda}(\boldsymbol{\theta})=L(\boldsymbol{\theta})+\lambda R(\boldsymbol{\theta}) where λ0\lambda\geq 0, and 𝛉^argmin𝛉Lλ(𝛉)\hat{\boldsymbol{\theta}}\in\operatorname{argmin}_{\boldsymbol{\theta}\in\mathcal{M}}L_{\lambda}(\boldsymbol{\theta}).

If R(L(𝛉^))<λR^{*}(\nabla L(\hat{\boldsymbol{\theta}}))<\lambda and 𝛉T2L(𝛉^)𝛉>0\boldsymbol{\theta}^{T}\nabla^{2}L(\hat{\boldsymbol{\theta}})\boldsymbol{\theta}>0 for all 𝛉\boldsymbol{\theta}\in\mathcal{M}, then 𝛉^\hat{\boldsymbol{\theta}} is the unique global minimizer of Lλ(𝛉)L_{\lambda}(\boldsymbol{\theta}).

Now we start the proof: First we study the restricted problem

𝜽¯=argmin𝜽{Ln(𝜽)+λR(𝜽)},\bar{\boldsymbol{\theta}}=\arg\min_{\boldsymbol{\theta}\in\mathcal{M}}\left\{L_{n}(\boldsymbol{\theta})+\lambda R(\boldsymbol{\theta})\right\},

where ={𝜽:𝜽p+K,𝜽S2=𝟎}\mathcal{M}=\{\boldsymbol{\theta}:\boldsymbol{\theta}\in\mathbb{R}^{p+K},\boldsymbol{\theta}_{S_{2}}=\boldsymbol{0}\} is oracle parameter set. Take R(𝜽)=𝜽[p]1R(\boldsymbol{\theta})=\left\|\boldsymbol{\theta}_{[p]}\right\|_{1} and R(𝜽)=𝜽S2R^{*}(\boldsymbol{\theta})=\left\|\boldsymbol{\theta}_{S_{2}}\right\|_{\infty}. Let A1=min{A,κτ3M}A_{1}=\min\left\{A,\frac{\kappa_{\infty}\tau}{3M}\right\} and hence A1min{A,κM}A_{1}\leq\min\left\{A,\frac{\kappa_{\infty}}{M}\right\}. Lemma 8 shows that (SS2Ln(𝜽))12<κ21\left\|\left(\nabla_{SS}^{2}L_{n}(\boldsymbol{\theta})\right)^{-1}\right\|_{2}<\kappa_{2}^{-1} and (SS2Ln(𝜽))1<κ1\left\|\left(\nabla_{SS}^{2}L_{n}(\boldsymbol{\theta})\right)^{-1}\right\|_{\infty}<\kappa_{\infty}^{-1} over BS(𝜽,A1)B_{S}\left(\boldsymbol{\theta}^{*},A_{1}\right).

Since supp(𝜽)S\operatorname{supp}\left(\boldsymbol{\theta}^{*}\right)\subseteq S, any 𝐡R(𝜽)\mathbf{h}\in\partial R\left(\boldsymbol{\theta}^{*}\right) satisfies 𝐡2|S1|\|\mathbf{h}\|_{2}\leq\sqrt{\left|S_{1}\right|}. Therefore

SLn(𝜽)2+λ𝐡212κ2A112κ2A.\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{2}+\lambda\|\mathbf{h}\|_{2}\leq\frac{1}{2}\kappa_{2}A_{1}\leq\frac{1}{2}\kappa_{2}A.

Then Lemma 9 implies that 𝜽¯𝜽22κ2(SL(𝜽)2+λ|S1|)A1\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq\frac{2}{\kappa_{2}}\left(\left\|\nabla_{S}L\left(\boldsymbol{\theta}^{*}\right)\right\|_{2}+\lambda\sqrt{\left|S_{1}\right|}\right)\leq A_{1}.

Second, we study the LL_{\infty} bound. On the one hand, the optimality condition yields SLn(𝜽¯)λ𝜽¯S\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})\in-\lambda\partial\left\|\bar{\boldsymbol{\theta}}_{S}\right\|_{\infty} and hence SLn(𝜽¯)λ\left\|\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})\right\|_{\infty}\leq\lambda. On the other hand, by letting 𝜽t=(1t)𝜽+t𝜽¯(0t1)\boldsymbol{\theta}_{t}=(1-t)\boldsymbol{\theta}^{*}+t\bar{\boldsymbol{\theta}}(0\leq t\leq 1) we have

SLn(𝜽¯)SLn(𝜽)=01SS2Ln(𝜽t)(𝜽¯𝜽)Sdt\displaystyle\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)=\int_{0}^{1}\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}_{t}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}\mathrm{d}t
=SS2Ln(𝜽)(𝜽¯𝜽)S+01[SS2Ln(𝜽¯t)SSS2Ln(𝜽)](𝜽¯𝜽)Sdt.\displaystyle=\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}+\int_{0}^{1}\left[\nabla_{SS}^{2}L_{n}\left(\bar{\boldsymbol{\theta}}_{t}\right)_{S}-\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}\mathrm{d}t.

Hence

(𝜽¯𝜽)S(SS2Ln(𝜽))1[SLn(𝜽¯)SLn(𝜽)]\displaystyle\left\|\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}-\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\left[\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\right\|_{\infty}
01(SS2Ln(𝜽))1[SS2Ln(𝜽¯t)SS2Ln(𝜽)](𝜽¯𝜽)Sdt\displaystyle\leq\int_{0}^{1}\left\|\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\left[\nabla_{SS}^{2}L_{n}\left(\bar{\boldsymbol{\theta}}_{t}\right)-\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}\right\|_{\infty}\mathrm{d}t
(SS2Ln(𝜽))1supt[0,1]SS2Ln(𝜽¯t)SS2Ln(𝜽)𝜽¯𝜽\displaystyle\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\sup_{t\in[0,1]}\left\|\nabla_{SS}^{2}L_{n}\left(\bar{\boldsymbol{\theta}}_{t}\right)-\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}

By (i) (ii) (iii) in Lemma 5, we obtain that

(𝜽¯𝜽)S(SS2Ln(𝜽))1[SLn(𝜽¯)SLn(𝜽)]M2κ𝜽¯𝜽2𝜽¯𝜽.\left\|\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}-\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\left[\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\right\|_{\infty}\leq\frac{M}{2\kappa_{\infty}}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}.

By 𝜽¯BS(𝜽,A1)\bar{\boldsymbol{\theta}}\in B_{S}\left(\boldsymbol{\theta}^{*},A_{1}\right) we have

𝜽¯𝜽\displaystyle\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty} (SS2Ln(𝜽))1SLn(𝜽¯)SLn(𝜽)+M2κ𝜽¯𝜽2𝜽¯𝜽\displaystyle\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{\infty}\left\|\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\frac{M}{2\kappa_{\infty}}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}
12κ(λ+SLn(𝜽))+16𝜽¯𝜽.\displaystyle\leq\frac{1}{2\kappa_{\infty}}\left(\lambda+\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\right)+\frac{1}{6}\left\|\overline{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}.

Therefore,

𝜽¯𝜽35κ(SLn(𝜽)+λ).\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\infty}\leq\frac{3}{5\kappa_{\infty}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\lambda\right). (28)

Third we study the L1L_{1} bound. The bound on 𝜽¯𝜽1\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1} can be obtained in a similar way. Using the fact that 1=\|\cdot\|_{1}=\|\cdot\|_{\infty} for symmetric matrices,

𝜽¯𝜽1\displaystyle\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1} (SS2Ln(𝜽))11SLn(𝜽¯)SLn(𝜽)1+M2κ𝜽¯𝜽2𝜽¯𝜽1\displaystyle\leq\left\|\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\right\|_{1}\left\|\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{1}+\frac{M}{2\kappa_{\infty}}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1}
12κ(λ|S1|+SLn(𝜽)1)+16𝜽¯𝜽1.\displaystyle\leq\frac{1}{2\kappa_{\infty}}\left(\lambda\left|S_{1}\right|+\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{1}\right)+\frac{1}{6}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1}.

Hence 𝜽¯𝜽135κ(SLn(𝜽)1+λ|S1|)\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1}\leq\frac{3}{5\kappa_{\infty}}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{1}+\lambda\left|S_{1}\right|\right). Since supp(𝜽¯)S\operatorname{supp}(\bar{\boldsymbol{\theta}})\subseteq S, we also have

𝜽¯𝜽1|S|𝜽¯𝜽22|S|κ2(SL(𝜽)2+λ|S1|).\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{1}\leq\sqrt{|S|}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq\frac{2\sqrt{|S|}}{\kappa_{2}}\left(\left\|\nabla_{S}L\left(\boldsymbol{\theta}^{*}\right)\right\|_{2}+\lambda\sqrt{\left|S_{1}\right|}\right).

This gives another L1L_{1} bound. By Lemma 10, to derive 𝜽^=𝜽¯\hat{\boldsymbol{\theta}}=\bar{\boldsymbol{\theta}} it remains to show that S2Ln(𝜽¯)<λ\left\|\nabla_{S_{2}}L_{n}(\bar{\boldsymbol{\theta}})\right\|_{\infty}<\lambda. Using the Taylor expansion we have

S2Ln(𝜽¯)S2Ln(𝜽)=01S2S2Ln(𝜽t)(𝜽¯𝜽)Sdt\displaystyle\nabla_{S_{2}}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S_{2}}L_{n}\left(\boldsymbol{\theta}^{*}\right)=\int_{0}^{1}\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}_{t}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}\mathrm{d}t (29)
=S2S2Ln(𝜽)(𝜽¯𝜽)+01[S2S2Ln(𝜽t)S2S2Ln(𝜽)](𝜽¯𝜽)Sdt.\displaystyle=\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)+\int_{0}^{1}\left[\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}_{t}\right)-\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)_{S}\mathrm{d}t.

On the one hand, the first term in (29) follows,

S2S2Ln(𝜽)(𝜽¯𝜽)\displaystyle\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\right\|_{\infty} =[S2S2Ln(𝜽)(SS2Ln(𝜽))1][SS2Ln(𝜽)(𝜽¯𝜽)]\displaystyle=\left\|\left[\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right)^{-1}\right]\left[\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\right]\right\|_{\infty}
(1τ)SS2Ln(𝜽)(𝜽¯𝜽).\displaystyle\leq(1-\tau)\left\|\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\right\|_{\infty}.

By the Taylor expansion, triangle’s inequality, (i) in Lemma 5 and the fact that 𝜽¯BS(𝜽,A1)\bar{\boldsymbol{\theta}}\in B_{S}\left(\boldsymbol{\theta}^{*},A_{1}\right),

SS2Ln(𝜽)(𝜽¯𝜽)\displaystyle\left\|\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\right\|_{\infty} SLn(𝜽¯)SLn(𝜽)\displaystyle\leq\left\|\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})-\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}
+01[SS2Ln(𝜽¯t)SS2Ln(𝜽)](𝜽¯𝜽)dt\displaystyle+\int_{0}^{1}\left\|\left[\nabla_{SS}^{2}L_{n}\left(\bar{\boldsymbol{\theta}}_{t}\right)-\nabla_{SS}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\mathrm{d}t
SLn(𝜽¯)+SLn(𝜽)+M𝜽¯𝜽2𝜽¯𝜽\displaystyle\leq\left\|\nabla_{S}L_{n}(\bar{\boldsymbol{\theta}})\right\|_{\infty}+\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+M\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}
λ+SLn(𝜽)+κτ3𝜽¯𝜽.\displaystyle\leq\lambda+\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\frac{\kappa_{\infty}\tau}{3}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}.

On the other hand, we bound the second term in (B.4). Note that 𝜽tBS(𝜽,A1)\boldsymbol{\theta}_{t}\in B_{S}\left(\boldsymbol{\theta}^{*},A_{1}\right) for all t[0,1]t\in[0,1]. (i) in Lemma 5 yields

01[S2S2Ln(𝜽t)S2S2Ln(𝜽)](𝜽¯𝜽)dt\displaystyle\left\|\int_{0}^{1}\left[\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}_{t}\right)-\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right]\left(\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)\mathrm{d}t\right\|_{\infty}
supt[0,1]S2S2Ln(𝜽t)S2S2Ln(𝜽)𝜽¯𝜽κτ3𝜽¯𝜽.\displaystyle\leq\sup_{t\in[0,1]}\left\|\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}_{t}\right)-\nabla_{S_{2}S}^{2}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq\frac{\kappa_{\infty}\tau}{3}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}.

As a result,

S2Ln(𝜽¯)\displaystyle\left\|\nabla_{S_{2}}L_{n}(\overline{\boldsymbol{\theta}})\right\|_{\infty} S2Ln(𝜽)+(1τ)(λ+SLn(𝜽)+κτ3𝜽¯𝜽)\displaystyle\leq\left\|\nabla_{S_{2}}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+(1-\tau)\left(\lambda+\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\frac{\kappa_{\infty}\tau}{3}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\right)
+κτ3𝜽¯𝜽\displaystyle+\frac{\kappa_{\infty}\tau}{3}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}
λτ(λ2κ3𝜽¯𝜽2τLn(𝜽)).\displaystyle\leq\lambda-\tau\left(\lambda-\frac{2\kappa_{\infty}}{3}\left\|\bar{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}-\frac{2}{\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\right).

Recall that the LL_{\infty} bound in (28). By plugging in this estimate, and using the assumptions 0<τ<10<\tau<1 and λ>203τLn(𝜽)\lambda>\frac{20}{3\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}, we derive that

S2Ln(𝜽¯)\displaystyle\left\|\nabla_{S_{2}}L_{n}(\bar{\boldsymbol{\theta}})\right\|_{\infty} λτ(λ25(SLn(𝜽)+λ)2τLn(𝜽))\displaystyle\leq\lambda-\tau\left(\lambda-\frac{2}{5}\left(\left\|\nabla_{S}L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}+\lambda\right)-\frac{2}{\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\right)
λτ(35λ4τLn(𝜽))<λ.\displaystyle\leq\lambda-\tau\left(\frac{3}{5}\lambda-\frac{4}{\tau}\left\|\nabla L_{n}\left(\boldsymbol{\theta}^{*}\right)\right\|_{\infty}\right)<\lambda.

This implies 𝜽^=𝜽¯\widehat{\boldsymbol{\theta}}=\bar{\boldsymbol{\theta}} and translates all the bounds for 𝜽¯\bar{\boldsymbol{\theta}} to the ones for 𝜽^\widehat{\boldsymbol{\theta}}. The proposition on sign consistency follows from elementary computation, thus we omit its proof.

Proof of Theorem 3

Proof: Write the loss function 3.1 as Ln(𝒚,𝑾^𝜽)=1ni=1nL(yi,𝒘^i𝜽)L_{n}\left(\bm{y},\widehat{\bm{W}}\bm{\theta}\right)=\frac{1}{n}\sum_{i=1}^{n}L\left(y_{i},\widehat{\bm{w}}_{i}^{\top}\bm{\theta}\right). We note L(yi,𝒘^i𝜽)|𝑨L\left(y_{i},\widehat{\bm{w}}_{i}^{\top}\bm{\theta}\right)|_{\bm{A}} to represent that 𝒇^i\widehat{\bm{f}}_{i} and 𝒖^i\widehat{\bm{u}}_{i} of the loss function are calculated by 𝑨\bm{A}, so as L(yi,𝒘^i𝜽)|𝑨^L\left(y_{i},\widehat{\bm{w}}_{i}^{\top}\bm{\theta}\right)|_{\widehat{\bm{A}}}. Note ζi=L(yi,𝒘^i𝜽)|𝑨^L(yi,𝒘^i𝜽)|𝑨\zeta_{i}=L\left(y_{i},\widehat{\bm{w}}_{i}^{\top}\bm{\theta}\right)|_{\widehat{\bm{A}}}-L\left(y_{i},\widehat{\bm{w}}_{i}^{\top}\bm{\theta}\right)|_{{\bm{A}}}. For a fixed nn, ζi,i=1,,n\zeta_{i},i=1,\ldots,n have the same distribution.

Var(1ni=1nζi)𝔼(ζi2).\displaystyle\text{Var}(\frac{1}{n}\sum_{i=1}^{n}\zeta_{i})\leq\mathbb{E}(\zeta_{i}^{2}).

With ζi0\zeta_{i}\rightarrow 0 in probability, we have 𝔼(ζi2)0\mathbb{E}(\zeta_{i}^{2})\rightarrow 0.