This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Slicing-free Inverse Regression in High-dimensional

Sufficient Dimension Reduction

Qing Mai1, Xiaofeng Shao2, Runmin Wang3 and Xin Zhang1

Florida State University1

University of Illinois at Urbana-Champaign2

Texas A&M University3

Abstract: Sliced inverse regression (SIR, Li, 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee and Shao, 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.

Key words and phrases: Multivariate response, Sliced inverse regression, Sufficient dimension reduction, U-statistic.

1 Introduction

Sufficient dimension reduction (SDR) is an important statistical analysis tool for data visualization, summary and inference. It extracts low-rank projections of the predictors XX that contain all the information about the response YY, without specifying a parametric model beforehand. The semi-parametric nature of SDR leads to great flexibility and convenience in practice. After SDR, we can model the conditional distributions of the response given the lower dimensional projected covariate using existing parametric or non-parametric methods. A salient feature with SDR is that the low-rank projection space can be accurately estimated at a parametric rate with the nonparametric part treated as an infinitely dimensional nuisance parameter. For example, in multi-index models, SDR can estimate the multiple projection directions without estimating the unspecified link function.

A cornerstone of SDR is the SIR (sliced inverse regression), pioneered by Li (1991), who first discovered the inner connection between the low-rank projection space and the eigen-space of cov(E(𝐗Y))\mathrm{cov}(\mathrm{E}(\mathbf{X}\mid Y)), under suitable assumptions. SIR is performed by slicing the response YY and aggregating the conditional mean of the predictor 𝐗\mathbf{X} given the response YY within each slice. To illustrate the idea, we consider a univariate response YY. Slicing involves picking K+1K+1 constants =a0<a1<<aK=-\infty=a_{0}<a_{1}<\ldots<a_{K}=\infty, and defining a new random variable HH, where H=kH=k if and only if ak1<Yaka_{k-1}<Y\leq a_{k}. Upon a centering and standardization of the covariate, i.e., 𝐗𝐗~=𝚺𝐗1/2(𝐗E(𝐗))\mathbf{X}\rightarrow\widetilde{\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1/2}(\mathbf{X}-\mathrm{E}(\mathbf{X})), a simple eigen-decomposition can be conducted to find linear projections that explain most of the variability in the conditional expectation of the transformed predictor given the response across slices, that is, cov(E(𝐗~H))\mathrm{cov}(\mathrm{E}(\widetilde{\mathbf{X}}\mid H)). As an important variation of SIR, sliced average variance estimation (Cook and Weisberg, 1991) utilizes the conditional variance across slices. A key step in these inverse regression methods is apparently the choice of the slicing scheme. If YY is sliced too coarsely, we may not be able to capture the full dependence of YY on the predictors, which could lead to a large bias in the estimation of cov(E(𝐗~Y))\mathrm{cov}(\mathrm{E}(\widetilde{\mathbf{X}}\mid Y)). If YY is sliced too finely, the with-in-slice sample size is small, leading to a large variability in estimation. Although Li (1991); Hsing and Carroll (1992) showed that SDR can still be consistent in large sample even when the slicing scheme is chosen poorly, Zhu and Ng (1995) argued that the choice of slicing scheme is critical to achieve high estimation efficiency. To the best of our knowledge, there seems no universal guidance on the choice of the slicing scheme provided in the literature.

Zhu et al. (2010) and Cook and Zhang (2014) showed that it is beneficial to aggregate multiple slicing schemes rather than relying on a single one. However, proposals in the above-mentioned papers have their own limitations as they exclusively focused on the univariate response. In many real life problems, it is common to encounter multi-response data. Component-wise analysis may not be sufficient for multi-response data because it does not fully make use of the componentwise dependence in the response. But slicing multivariate response is notoriously hard due to the curse of dimensionality, a common problem in multivariate nonparametric smoothing. As the dimension for the response becomes moderately large, it is increasingly difficult to make sure that each slice contains a decent number of samples, and the estimation can be unstable in practice. Hence, it is highly desirable to develop new SDR methods that do not involve slicing.

An important line of research in the recent SDR literature is to develop SDR methods for datasets with high-dimensional covariates, as motivated by many contemporary applications. The idea of SDR is naturally attractive for high-dimensional datasets, as an effective reduction of the dimension in XX facilitates the use of existing modeling and inference methods that are tailored for low-dimensional covariates. However, most classical SDR methods are not directly applicable to the large pp small nn setting, where pp is the dimension of XX and nn is the sample size. To overcome the challenges with high-dimensional covariates, several methods have been proposed recently. In Lin et al. (2018), they show that the SIR estimator is consistent if and only if limp/n=0\lim p/n=0. When the dimension pp is larger than nn, they propose a diagonal thresholding screening SIR (DT-SIR) algorithm, and show that it is consistent to recover the dimension reduction space under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions. In Lin et al. (2019), they further introduce a simple Lasso regression method to obtain an estimate of the SDR space by constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix. In Tan, Wang, Liu and Zhang (2018), they propose a two-stage computational framework to solve the sparse generalized eigenvalue problem, which includes the high-dimensional SDR as a special case, and propose a truncated Rayleigh flow method (namely, RIFLE) to estimate the leading generalized eigenvector. Also see Lin et al. (2020) and Tan, Wang, Zhang, Liu and Cook (2018) for related recent work. These methods provide valuable tools to tackle high-dimensional SDR problem. However, all these methods still rely on the SIR as an important component in their methodology and involve choosing a single slicing scheme, but little guideline on the slicing scheme is provided. Consequently, these methods cannot be easily applied to data with multivariate response, and the impact from the choice of slicing scheme is unclear.

In this article, we propose a novel slicing-free SDR method in the high-dimensional setting. Our proposal is inspired by a recent nonlinear dependence metric: the so-called martingale difference divergence matrix (MDDM, Lee and Shao, 2018). The MDDM was developed by Lee and Shao (2018) as a matrix-valued extension of MDD (martingale difference divergence) in Shao and Zhang (2014), which measures the (conditional) mean dependence of a response variable given a covariate, and was used for dimension reduction of a multivariate time series. As recently revealed by Zhang et al. (2020), at the population level, the eigenvectors (or generalized eigenvectors) of the MDDM are always contained in the central subspace. Building on these prior works, we propose a penalized eigen-decomposition on MDDM to perform SDR in high dimensions. For the case when the covariance matrix of the predictor is identity matrix, we adopt the truncated power method with hard thresholding to estimate the top-KK eigenvectors of MDDM. In the case of more general covariance structure, we adopt the RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) and apply to the sample MDDM instead of sample SIR estimator of cov(E(𝐗Y))\mathrm{cov}(\mathrm{E}(\mathbf{X}\mid Y)). With the use of sample MDDM, this approach is completely slicing-free and allows to treat the univariate response and multivariate response in a unified way, thus the practical difficulty of selecting the number of slices (especially for multivariate response) is circumvented. On the theory front, we derive a concentration inequality for sample MDDM around its population counterpart by using theories for U statistics, and obtain a rigorous non-asymptotic theoretical justification for the estimated central subspaces for both settings. Simulations and real data analysis confirm that PMDDM outperforms slicing-based methods in estimation accuracy.

The rest of this paper is organized as follows. In Section 2, we give a brief review of the martingale difference divergence matrix (MDDM) and then present a new concentration inequality for the sample MDDM around its population counterpart. In Section 3, we present our general methodology of adopting MDDM in both model-free and model-based SDR problems, where we establish population level connections between the central subspace and the eigen-decomposition and the generalized eigen-decomposition of MDDM. Algorithms for regularized eigen-decomposition and generalized eigen-decomposition problems are proposed in Sections 4.1 and 4.2, respectively. Theoretical properties are established in Section 5. Section 6 contains numerical studies. Finally, Section 7 concludes the paper with a short discussion. The Supplementary Materials collect all additional technical details and numerical results.

2 MDDM and its concentration inequality

Consider a pair of random vectors 𝐕p\mathbf{V}\in\mathbb{R}^{p}, 𝐔q\mathbf{U}\in\mathbb{R}^{q} such that E(𝐔2+𝐕2)<\mathrm{E}(\|\mathbf{U}\|^{2}+\|\mathbf{V}\|^{2})<\infty. We use 𝐔=|𝐔|q\|\mathbf{U}\|=|\mathbf{U}|_{q} to denote the Euclidean norm in q\mathbb{R}^{q}. Define

MDDM(𝐕𝐔)=E[{𝐕E(𝐕)}{𝐕E(𝐕)}T𝐔𝐔]p×p,\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})=-\mathrm{E}\left[\{\mathbf{V}-\mathrm{E}(\mathbf{V})\}\{\mathbf{V}^{\prime}-\mathrm{E}(\mathbf{V}^{\prime})\}^{{\mathrm{\tiny T}}}\|\mathbf{U}-\mathbf{U}^{\prime}\|\right]\in\mathbb{R}^{p\times p},

where (𝐕,𝐔)(\mathbf{V}^{\prime},\mathbf{U}^{\prime}) is an independent copy of (𝐕,𝐔)(\mathbf{V},\mathbf{U}). Lee and Shao (2018) established the following key properties of MDDM(𝐕𝐔)\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U}): (i) It is symmetric and positive semi-definite; (ii) E(𝐕𝐔)=E(𝐕)\mathrm{E}(\mathbf{V}\mid\mathbf{U})=\mathrm{E}(\mathbf{V}) almost surely, is equivalent to MDDM(𝐕𝐔)=0\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})=0; (iii) For any p×dp\times d matrix 𝐀\mathbf{A}, MDDM(𝐀T𝐕𝐔)=𝐀TMDDM(𝐕𝐔)𝐀\mathrm{MDDM}(\mathbf{A}^{\mathrm{\tiny T}}\mathbf{V}\mid\mathbf{U})=\mathbf{A}^{\mathrm{\tiny T}}\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})\mathbf{A}; (iv) There exist pdp-d linearly independent combinations of 𝐕\mathbf{V} such that they are (conditionally) mean independent of 𝐔\mathbf{U} if and only if rank(MDDM(𝐕|𝐔))=d\mbox{rank}(\mathrm{MDDM}(\mathbf{V}|\mathbf{U}))=d.

Given a random sample of size nn, i.e., (𝐔k,𝐕k)k=1n(\mathbf{U}_{k},\mathbf{V}_{k})_{k=1}^{n}, the sample estimate of MDDM(𝐕𝐔)\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U}), denoted by MDDMn(𝐕𝐔)\mathrm{MDDM}_{n}(\mathbf{V}\mid\mathbf{U}), is defined as

MDDMn(𝐕𝐔)=1n2j,k=1n(𝐕j𝐕¯n)(𝐕k𝐕¯n)T|𝐔j𝐔k|q,\mathrm{MDDM}_{n}(\mathbf{V}\mid\mathbf{U})=-\dfrac{1}{n^{2}}\sum_{j,k=1}^{n}(\mathbf{V}_{j}-\overline{\mathbf{V}}_{n})(\mathbf{V}_{k}-\overline{\mathbf{V}}_{n})^{\mathrm{\tiny T}}|\mathbf{U}_{j}-\mathbf{U}_{k}|_{q}, (2.1)

where 𝐕¯n=n1k=1n𝐕k\overline{\mathbf{V}}_{n}=n^{-1}\sum_{k=1}^{n}\mathbf{V}_{k} is the sample mean.

In the following, we present a concentration inequality for the sample MDDM around its population counterpart, which plays an instrumental role in our consistency proof for the proposed penalized MDDM method later. To this end, we let 𝐕=(V1,,Vp)Tp\mathbf{V}=(V_{1},\cdots,V_{p})^{\mathrm{\tiny T}}\in\mathbb{R}^{p} and assume the following condition.

  1. (C1)

    There exists two positive constants σ0\sigma_{0} and C0C_{0} such that

    suppmax1jpE{exp(2σ0Vj2)}C0,E{exp(2σ0𝐔q2)}C0.\displaystyle\begin{split}&\sup_{p}\max_{1\leq j\leq p}\mathrm{E}\{\exp(2\sigma_{0}V_{j}^{2})\}\leq C_{0},\\ &\mathrm{E}\{\exp(2\sigma_{0}\|\mathbf{U}\|_{q}^{2})\}\leq C_{0}.\end{split} (2.2)

For a matrix A=(aij)A=(a_{ij}), we denote its max norm as Amax=maxij|aij|\|A\|_{max}=\max_{ij}|a_{ij}|.

Theorem 1.

Suppose that Condition (C1) holds. There exists a positive integer n0=n0(σ0,C0,q)<n_{0}=n_{0}(\sigma_{0},C_{0},q)<\infty, γ=γ(σ0,C0,q)(0,1/2)\gamma=\gamma(\sigma_{0},C_{0},q)\in(0,1/2) and a finite positive constant D0=D0(σ0,C0,q)<D_{0}=D_{0}(\sigma_{0},C_{0},q)<\infty such that when nn0n\geq n_{0} and 16>ϵ>D0nγ16>\epsilon>D_{0}n^{-\gamma}, we have

P(MDDMn(𝐕|𝐔)MDDM(𝐕|𝐔)max>12ϵ)54p2exp{ϵ2n36log3(n)}.P(\|\mathrm{MDDM}_{n}(\mathbf{V}|\mathbf{U})-\mathrm{MDDM}(\mathbf{V}|\mathbf{U})\|_{\max}>12\epsilon)\leq 54p^{2}\exp\left\{-\frac{\epsilon^{2}n}{36\log^{3}(n)}\right\}.

The above bound is non-asymptotic and holds for all (n,p,ϵ)(n,p,\epsilon) as long as the condition is satisfied. The exponent ϵ2nlog3(n)\dfrac{\epsilon^{2}n}{\log^{3}(n)} is due to the use of a truncation argument along with Hoeffding’s inequality for U-statistic, and seems hard to improve. Nevertheless we are able to achieve an exponential type bound under a uniform sub-Gaussian condition on both 𝐕\mathbf{V} and 𝐔\mathbf{U}. This result may be of independent theoretical interest. For example, in the time series dimension reduction problem studied by Lee and Shao (2018), our Theorem 1 could potentially help extend the theory there from low-dimensional multivariate time series to higher dimensions.

3 Slicing-free Inverse Regression via MDDM

3.1 Inverse regression subspace in sufficient dimension reduction

Sufficient dimension reduction (SDR) methods aim to identify the central subspace that preserves all the information in the predictors. In this paper, we consider the SDR problem of a multivariate response 𝐘q\mathbf{Y}\in\mathbb{R}^{q} on a multivariate predictor 𝐗p\mathbf{X}\in\mathbb{R}^{p}. The central subspace 𝒮𝐘𝐗\cal S_{\mathbf{Y}\mid\mathbf{X}} is defined as the intersection of all the subspaces 𝒮\cal S such that 𝐘𝐗𝐏𝒮𝐗\mathbf{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\mathbf{X}\mid\mathbf{P}_{\cal S}\mathbf{X}, where 𝐏𝒮\mathbf{P}_{\cal S} is the projection matrix onto 𝒮\cal S. By construction, the central subspace 𝒮𝐘𝐗\cal S_{\mathbf{Y}\mid\mathbf{X}} is the smallest dimension reduction subspace that contains all the information in the conditional distribution of 𝐘\mathbf{Y} given 𝐗\mathbf{X}. Many methods have been proposed for the recovery of the central subspace or a portion of the central subspace (Li, 1991; Cook and Weisberg, 1991; Bura and Cook, 2001; Chiaromonte et al., 2002; Yin and Cook, 2003; Cook and Ni, 2005; Li and Wang, 2007; Zhou and He, 2008). See Li (2018) for a comprehensive review. Although the central subspace is well defined for both univariate and multivariate response, most existing SDR methods consider the case with univariate response, while extension to multivariate response is non-trivial.

The definition of central subspace is not very constructive, as it requires taking intersection of all subspace 𝒮p\mathcal{S}\subseteq\mathbb{R}^{p} such that 𝐘𝐗𝐏𝒮𝐗\mathbf{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\mathbf{X}\mid\mathbf{P}_{\cal S}\mathbf{X}. It is indeed a very ambitious goal to estimate the central subspace without specifying a model between 𝐘\mathbf{Y} and 𝐗\mathbf{X}. To achieve this, we often need additional assumptions such as the linearity and the coverage conditions. The linearity condition requires that, for any basis of the central subspace 𝜷\boldsymbol{\beta}, we must have that E(𝐗𝜷T𝐗)\mathrm{E}(\mathbf{X}\mid\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{X}) is linear in 𝜷T𝐗\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{X}. The linearity condition is guaranteed if 𝐗\mathbf{X} is elliptically contoured, and allows us to connect the central subspace to the conditional expectation E(𝐗𝐘)\mathrm{E}(\mathbf{X}\mid\mathbf{Y}). Define 𝚺𝐗\boldsymbol{\Sigma}_{\mathbf{X}} as the covariance of 𝐗\mathbf{X} and the inverse regression subspace

𝒮E(𝐗𝐘)span{E(𝐗𝐘=𝐲)E(𝐗):𝐲qsuchthatE(𝐗𝐘=𝐲)exists}.\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}\equiv\mathrm{span}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y}=\mathbf{y})-\mathrm{E}(\mathbf{X}):\mathbf{y}\in\mathbb{R}^{q}\mathrm{\ such\ that}\ \mathrm{E}(\mathbf{X}\mid\mathbf{Y}=\mathbf{y})\ \mathrm{exists}\}. (3.3)

Then following property is well-known and often adopted in developing SDR methods.

Proposition 1.

Under linearity condition, we have 𝒮E(𝐗𝐘)𝚺𝐗𝒮𝐘𝐗p\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}\subseteq\boldsymbol{\Sigma}_{\mathbf{X}}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}\subseteq\mathbb{R}^{p}.

The coverage condition further assumes that 𝒮E(𝐗𝐘)=𝚺𝐗𝒮𝐘𝐗\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}=\boldsymbol{\Sigma}_{\mathbf{X}}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}. It follows that we can estimate the central subspace by modeling the conditional expectation of 𝐗\mathbf{X}. Indeed, many SDR methods approximate E(𝐗𝐘)\mathrm{E}(\mathbf{X}\mid\mathbf{Y}). For example, the most classical SDR method, sliced inverse regression (SIR), slices univariate YY into several categories and estimate the mean of 𝐗\mathbf{X} within each slice. Most later methods also follow this slice-and-estimate procedure. Apparently, the slicing scheme is important to the estimation. If there are too few slices, we may not be able to fully capture the dependence of 𝐗\mathbf{X} on YY; however, if there are too many slices, there is a lack of enough samples within each slice to allow accurate estimation.

3.2 MDDM in SDR

In this section, we lay the foundation for the application of MDDM in SDR. We show that the subspace spanned by MDDM coincides with the inverse regression subspace in (3.3). In particular, we have the following Proposition 2, which was used in Zhang et al. (2020), without a proof, in the context of multivariate linear regression.

Proposition 2.

For multivariate 𝐗p\mathbf{X}\in\mathbb{R}^{p} and 𝐘q\mathbf{Y}\in\mathbb{R}^{q}, assuming the existence of E(𝐗)\mathrm{E}(\mathbf{X}), E(𝐗𝐘)\mathrm{E}(\mathbf{X}\mid\mathbf{Y}) and MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}), we have 𝒮E(𝐗𝐘)=span{MDDM(𝐗𝐘)}\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}.

Therefore, the rank of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}) is the dimensionality of the inverse regression subspace; and the non-trivial eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}) contain all the information for 𝒮E(𝐗𝐘)\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}. Combining Proposition12, we immediately have that (i) under the linearity condition, 𝚺𝐗1span{MDDM(𝐗𝐘)}𝒮𝐘𝐗\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}\subseteq\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}; and (ii) under the linearity and coverage conditions, 𝚺𝐗1span{MDDM(𝐗𝐘)}=𝒮𝐘𝐗\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}.

Henceforth, we assume both the linearity and coverage conditions, which are assumed either explicitly or implicitly in inverse regression type dimension reduction methods (e.g., Li, 1991; Cook and Ni, 2005; Zhu et al., 2010; Cook and Zhang, 2014). Then the central subspace is related to the eigen-decomposition of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}). Specifically, we have the following scenarios.

If cov(𝐗)=σ2𝐈p\mathrm{cov}(\mathbf{X})=\sigma^{2}\mathbf{I}_{p} for some σ2>0\sigma^{2}>0, then obviously span{MDDM(𝐗𝐘)}=𝒮𝐘𝐗\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}. This includes single index and multiple index models with uncorrelated predictors. Let KK be the rank of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}), then the dimension of the central subspace is KK; and the first KK eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}) span the central subspace.

If cov(𝐗𝐘)=σ2𝐈p\mathrm{cov}(\mathbf{X}\mid\mathbf{Y})=\sigma^{2}\mathbf{I}_{p} for some σ2>0\sigma^{2}>0, then we have 𝚺𝐗=σ2𝐈p+cov{E(𝐗𝐘)}\boldsymbol{\Sigma}_{\mathbf{X}}=\sigma^{2}\mathbf{I}_{p}+\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}. Because span[cov{E(𝐗𝐘)}]=𝒮E(𝐗𝐘)\mathrm{span}[\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}]=\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}, we can show that 𝒮𝐘𝐗=𝚺𝐗1span{MDDM(𝐗𝐘)}=span{MDDM(𝐗𝐘)}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}. To see this, let cov{E(𝐗𝐘)}=𝐔𝐔T\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}=\mathbf{U}\mathbf{U}^{\mathrm{\tiny T}} for some 𝐔p×K\mathbf{U}\in\mathbb{R}^{p\times K}, then span(𝐔)=span[cov{E(𝐗𝐘)}]=span{MDDM(𝐗𝐘)}\mathrm{span}(\mathbf{U})=\mathrm{span}[\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}]=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\} and we may also write MDDM(𝐗𝐘)=𝐔𝚿𝐔T\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})=\mathbf{U}\boldsymbol{\Psi}\mathbf{U}^{{\mathrm{\tiny T}}} for some symmetric positive definite matrix 𝚿K×K\boldsymbol{\Psi}\in\mathbb{R}^{K\times K}. Then the result follows by applying the Woodbury matrix identity to 𝚺𝐗1=(σ2𝐈p+𝐔𝐔T)1=σ2𝐈pσ2𝐔(σ2𝐈K+𝐔T𝐔)1𝐔T\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}=(\sigma^{2}\mathbf{I}_{p}+\mathbf{U}\mathbf{U}^{\mathrm{\tiny T}})^{-1}=\sigma^{-2}\mathbf{I}_{p}-\sigma^{-2}\mathbf{U}(\sigma^{2}\mathbf{I}_{K}+\mathbf{U}^{{\mathrm{\tiny T}}}\mathbf{U})^{-1}\mathbf{U}^{{\mathrm{\tiny T}}}. The non-trivial eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}) again span the central subspace.

For general covariance structure, the dd-dimensional central subspace 𝒮𝐘𝐗=𝚺𝐗1span{MDDM(𝐗𝐘)}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\} can be obtained via generalized eigen-decomposition. Specifically, consider the generalized eigenvalue problem

MDDM(𝐗𝐘)𝐯i=φi𝚺𝐗𝐯i,φi0,𝐯ip,\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\mathbf{v}_{i}=\varphi_{i}\boldsymbol{\Sigma}_{\mathbf{X}}\mathbf{v}_{i},\ \varphi_{i}\geq 0,\ \mathbf{v}_{i}\in\mathbb{R}^{p}, (3.4)

where 𝐯iT𝚺𝐗𝐯j=0\mathbf{v}_{i}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\mathbf{v}_{j}=0 for iji\neq j. Then, similar to (Li, 2007; Chen et al., 2010), it is straightforward to show that the generalized eigenvector spans the central subspace, 𝒮𝐘𝐗=span(𝐯1,,𝐯K)\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\mathbf{v}_{1},\dots,\mathbf{v}_{K}).

Existing works in SDR often focus on the eigen-decomposition or the generalized eigen-decomposition of cov{E(𝐗Y=y)}\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid Y=y)\}, where non-parametric estimates of E(𝐗Y=y)\mathrm{E}(\mathbf{X}\mid Y=y) are obtained from slicing the support of the univariate response YY. Comparing to these approaches, the MDDM approach requires no tuning parameter selection (i.e. specifying slicing schemes). Moreover, high-dimensional theoretical study of MDDM is easier and does not require additional assumptions on the conditional mean function E(𝐗𝐘)\mathrm{E}(\mathbf{X}\mid\mathbf{Y}) such as smoothness in the empirical mean function of 𝐗\mathbf{X} given YY (e.g. sliced stable condition in Lin et al. (2018)).

3.3 MDDM for model-based SDR

So far, we have discussed model-free SDR. Another important research area in SDR is model-based methods, which provide invaluable intuition for the use of inverse regression estimation under the assumption that the conditional distribution of 𝐗𝐘\mathbf{X}\mid\mathbf{Y} is normal. In this section, we consider the principal fitted component (PFC) model, which was discussed in details in Cook and Forzani (2009) and Cook (2007), and generalize it from univariate response to multivariate response. We argue that (generalized) eigen-decomposition of MDDM is potentially advantageous to the likelihood-based approaches under PFC model. This is somewhat surprising but reasonable, considering that the advantages of MDDM over least squares and likelihood-based estimation was demonstrated in Zhang et al. (2020) under the multivariate linear model.

Let 𝐗𝐲𝐗(𝐘=𝐲)\mathbf{X}_{\mathbf{y}}\sim\mathbf{X}\mid(\mathbf{Y}=\mathbf{y}) denote the conditional variable, then the PFC model is

𝐗𝐲=𝝁+𝚪𝝂𝐲+𝜺,𝜺N(0,𝚫),\mathbf{X}_{\mathbf{y}}=\boldsymbol{\mu}+\boldsymbol{\Gamma}\boldsymbol{\nu}_{\mathbf{y}}+\boldsymbol{\varepsilon},\quad\boldsymbol{\varepsilon}\sim N(0,\boldsymbol{\Delta}), (3.5)

where 𝚪p×K\boldsymbol{\Gamma}\in\mathbb{R}^{p\times K}, K<pK<p, is a non-stochastic orthogonal matrix, 𝝂𝐲K\boldsymbol{\nu}_{\mathbf{y}}\in\mathbb{R}^{K} is the latent variable that depends on 𝐲\mathbf{y}. Then the latent variable 𝝂𝐲\boldsymbol{\nu}_{\mathbf{y}} is fitted as 𝝂𝐲=𝜶𝐟𝐲\boldsymbol{\nu}_{\mathbf{y}}=\boldsymbol{\alpha}\mathbf{f}_{\mathbf{y}} with some user-specified functions 𝐟𝐲=(f1(𝐲),,fm(𝐲))Tm\mathbf{f}_{\mathbf{y}}=(f_{1}(\mathbf{y}),\ldots,f_{m}(\mathbf{y}))^{{\mathrm{\tiny T}}}\in\mathbb{R}^{m}, mKm\geq K, that maps qq-dimensional response to mm-dimensional. In the univariate PFC model, q=1q=1, so the mm functions can be viewed as an expansion of the response (similar to slicing). For our multivariate extensions of the PFC model, there is no requirement of mqm\geq q. The PFC model can be written as

𝐗y=𝝁+𝚪𝜶𝐟𝐲+𝜺,\mathbf{X}_{y}=\boldsymbol{\mu}+\boldsymbol{\Gamma}\boldsymbol{\alpha}\mathbf{f}_{\mathbf{y}}+\boldsymbol{\varepsilon}, (3.6)

where 𝚪\boldsymbol{\Gamma} and 𝜶\boldsymbol{\alpha} are estimated similarly to the multivariate reduced-rank regression with 𝐗p\mathbf{X}\in\mathbb{R}^{p} being the response and 𝐟𝐲m\mathbf{f}_{\mathbf{y}}\in\mathbb{R}^{m} being the predictor. Finally, the central subspace under this PFC model is 𝚫1span(𝚪)\boldsymbol{\Delta}^{-1}\mathrm{span}(\boldsymbol{\Gamma}), which simplifies to span(𝚪)\mathrm{span}(\boldsymbol{\Gamma}) if we further assume isotropic error (i.e. isotropic PFC model) 𝚫=cov(𝐗𝐘)=σ2𝐈p\boldsymbol{\Delta}=\mathrm{cov}(\mathbf{X}\mid\mathbf{Y})=\sigma^{2}\mathbf{I}_{p}.

For the PFC model, our MDDM approach is the same as the model-free MDDM counterpart, and has two main advantages over the likelihood-based PFC estimation: (i) there is no need to specify the functions 𝐟𝐲\mathbf{f}_{\mathbf{y}}, and thus no risk of mis-specification; (ii) extensions to high-dimensional setting is much more straightforward. Moreover, under the isotropic PFC model, the central subspace 𝒮𝐘𝐗=span(𝚪)\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\boldsymbol{\Gamma}) is exactly the first KK eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}).

4 Estimation

4.1 Penalized decomposition of MDDM

Based on the results in the last section, penalized eigen-decomposition of MDDM can be used for estimating the central subspace in high dimension when the covariance 𝚺𝐗\boldsymbol{\Sigma}_{\mathbf{X}} or the conditional covariance cov(𝐗𝐘)\mathrm{cov}(\mathbf{X}\mid\mathbf{Y}) is proportional to the identity matrix 𝐈p\mathbf{I}_{p}. We consider the construction of such an estimate. It is worth mentioning that the penalized decomposition of MDDM we develop here is immediately applicable to the dimension reduction of multivariate stationary time series in (Lee and Shao, 2018), which is beyond the scope of this article. Moreover, it is well-known that 𝚺𝐗1\boldsymbol{\Sigma}_{\mathbf{X}}^{-1} is not easy to estimate in high dimensions. Then, even when for general covariance structure, the eigen-decomposition of MDDM provides estimate of the inverse regression subspace (though may differ from the central subspace) that is useful for exploratory data analysis (e.g. detecting and visualizing non-linear mean function).

As such, we consider the estimation of the eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}). We assume that MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}) has KK nontrivial eigenvectors, denoted by 𝜷1,,𝜷K\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K}. We use the shorthand notation 𝐌=MDDM(𝐗𝐘)\mathbf{M}=\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}). Also, we note that, given the first k1k-1 eigenvectors, 𝜷k\boldsymbol{\beta}_{k} is the top eigenvector of 𝐌k\mathbf{M}_{k}, where 𝐌k=𝐌l<k(𝜷lT𝐌𝜷l)𝜷l𝜷lT\mathbf{M}_{k}=\mathbf{M}-\sum_{l<k}(\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{l})\boldsymbol{\beta}_{l}\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}.

It is well-known that the eigenvectors cannot be accurately estimated in high dimensions without additional assumptions. We adopt the popular sparsity assumption that many entries in 𝜷k\boldsymbol{\beta}_{k} are zero. To estimate these sparse eigenvectors, denote 𝐌^1=MDDMn(𝐗𝐘)\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y}), where the sample MDDMn\mathrm{MDDM}_{n} is defined in (2.1). We find 𝜷^k,k=1,,K\widehat{\boldsymbol{\beta}}_{k},k=1,\ldots,K as follows:

𝜷^k=argmax𝜷𝜷T𝐌^k𝜷 s.t. 𝜷T𝜷=1,𝜷0s,\widehat{\boldsymbol{\beta}}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\boldsymbol{\beta}\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\beta}=1,\|\boldsymbol{\beta}\|_{0}\leq s,$} (4.7)

where 𝐌^1=MDDMn(𝐗𝐘)\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y}), 𝐌^k=𝐌^1l<kδl𝜷^l𝜷^lT\widehat{\mathbf{M}}_{k}=\widehat{\mathbf{M}}_{1}-\sum_{l<k}\delta_{l}\widehat{\boldsymbol{\beta}}_{l}\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}} for k>1k>1 with δl=𝜷^lT𝐌^1𝜷^l\delta_{l}=\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{1}\widehat{\boldsymbol{\beta}}_{l}, and ss is a tuning parameter.

We solve the above problem by combining the truncated power method with hard thresholding. For a vector 𝐯p\mathbf{v}\in\mathbb{R}^{p} and a positive integer ss, denote vsv_{s}^{*} as the ss-th largest value of |vj|,j=1,,p|v_{j}|,j=1,\ldots,p. The hard-thresholding operator is HT(𝐯,s)=(v1I(|v1|vs),,vpI(|vp|vs))T\mathrm{HT}(\mathbf{v},s)=(v_{1}I(|v_{1}|\geq v_{s}^{*}),\ldots,v_{p}I(|v_{p}|\geq v_{s}^{*}))^{\mathrm{\tiny T}}, which sets the psp-s elements in 𝐯\mathbf{v} to zero. We solve (4.7) by Algorithm 1, where the initialization 𝜷^1(0)\widehat{\boldsymbol{\beta}}_{1}^{(0)} may be randomly generated. Note that Yuan and Zhang (2013) proposed Algorithm 1 to perform principal component analysis through penalized eigen-decomposition on the sample covariance.

In our algorithm, we require a pre-specified sparsity level ss and subspace dimension KK. In theory, we show that our estimators for 𝜷k\boldsymbol{\beta}_{k}, k=1,,Kk=1,\dots,K, are all consistent for their population counterparts when the sparsity ss is sufficiently large (i.e. larger than the population sparsity level) and the number of directions KK is no bigger than the true dimension of the central subspace. Therefore, our method is flexible in the sense that the pre-specified ss and KK do not have to be exactly correct. In practice, especially in exploratory data analysis, the number of sequentially extracted directions are often set to be small (i.e. K=1,2or 3K=1,2\leavevmode\nobreak\ \mbox{or}\leavevmode\nobreak\ 3), while the determination of true central subspace dimension is a separate and important research topic in SDR (e.g. Bura and Yang, 2011; Luo and Li, 2016) and is beyond the scope of this paper. Moreover, the pre-specified sparsity level ss combined with 0\ell_{0} regularization is potentially convenient for post-dimension reduction inference (Kim et al., 2020), as seen in the post-selection inference of canonical correlation analysis that is done over subsets of variables of pre-specified cardinality (McKeague and Zhang, 2020).

As pointed out by a referee, other sparse principal component analysis (PCA) methods can potentially be applied to decompose MDDM. We choose to extend the algorithm in Yuan and Zhang (2013) to facilitate computation and theoretical development. For computationally efficient sparse PCA methods such as Zou et al. (2006); Witten et al. (2009), their theoretical properties are unfortunately unknown. Hence, we expect the theoretical study of their MDDM-variants to be very challenging. On the other hand, for the theoretically justified sparse PCA methods such as Vu and Lei (2013); Cai et al. (2013), the computation is less efficient.

  1. 1.

    Input: s,K,𝐌^1=𝐌^=MDDMn(𝐗𝐘)s,K,\widehat{\mathbf{M}}_{1}=\widehat{\mathbf{M}}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y}).

  2. 2.

    Initialize 𝜷^1(0)\widehat{\boldsymbol{\beta}}_{1}^{(0)}.

  3. 3.

    For k=1,,Kk=1,\ldots,K, do

    1. (a)

      Iterate over tt until convergence:

      1. i.

        Set 𝜷^k(t)=𝐌^k𝜷^k(t1)\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\widehat{\mathbf{M}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}.

      2. ii.

        If 𝜷^k(t)0s\|\widehat{\boldsymbol{\beta}}_{k}^{(t)}\|_{0}\leq s, set

        𝜷^k(t)=𝜷^k(t)𝜷^k(t)2;\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\widehat{\boldsymbol{\beta}}_{k}^{(t)}}{\|\widehat{\boldsymbol{\beta}}_{k}^{(t)}\|_{2}};

        else

        𝜷^k(t)=HT(𝜷^k(t),s)HT(𝜷^k(t),s)2\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\mathrm{HT}(\widehat{\boldsymbol{\beta}}_{k}^{(t)},s)}{\|\mathrm{HT}(\widehat{\boldsymbol{\beta}}_{k}^{(t)},s)\|_{2}}
    2. (b)

      Set 𝜷^k=𝜷^k(t)\widehat{\boldsymbol{\beta}}_{k}=\widehat{\boldsymbol{\beta}}_{k}^{(t)} at convergence and 𝐌^k+1=𝐌^k𝜷^kT𝐌^𝜷^k𝜷^k𝜷^kT\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}-\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}\cdot\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}.

  4. 4.

    Output 𝒮^𝐘𝐗=span(𝜷^1,,𝜷^K)\widehat{\mathcal{S}}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\widehat{\boldsymbol{\beta}}_{1},\ldots,\widehat{\boldsymbol{\beta}}_{K}).

Algorithm 1 Penalized eigen-decomposition of MDDM.

4.2 Generalized eigenvalue problems with MDDM

Now we consider the general (arbitrary) covariance structure 𝚺𝐗\boldsymbol{\Sigma}_{\mathbf{X}}. We continue to use 𝜷1,,𝜷K\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K} to denote the nontrivial eigenvectors of 𝚺𝐗1span(MDDM(𝐗𝐘))\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}(\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})) so that the central subspace is spanned by the 𝜷\boldsymbol{\beta}’s. Again, we assume that these eigenvectors are sparse. In principle, we could assume that 𝚺𝐗1\boldsymbol{\Sigma}_{\mathbf{X}}^{-1} is also sparse and construct its estimate accordingly. However, 𝚺𝐗1\boldsymbol{\Sigma}_{\mathbf{X}}^{-1} is a nuisance parameter for our ultimate goal, and additional assumptions on it may unnecessarily limit the applicability of our method. Hence, we take a different approach as follows.

To avoid estimating 𝚺𝐗1\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}, we note that 𝜷1,,𝜷K\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K} can also be viewed as the generalized eigenvectors defined as follows, which is equivalent to (3.4),

𝜷k=argmax𝜷𝜷T𝐌𝜷, s.t. 𝜷T𝚺𝐗𝜷=1,𝜷lT𝚺𝐗𝜷=0 for any l<k.\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta},\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1,\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=0$ for any $l<k$.} (4.8)

Directly solving the generalized eigen-decomposition problem in (4.8) is not easy if we want to further impose penalties, because it is difficult to satisfy the orthogonality constraints. Therefore, we further consider another form for (4.8) that does not involve the orthogonal constraints. This alternative form is based on the following lemma.

Lemma 1.

Let λj=𝛃jT𝐌𝛃j\lambda_{j}=\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{j} and 𝐌k=𝐌𝚺𝐗(j<kλj𝛃j𝛃jT)𝚺𝐗\mathbf{M}_{k}=\mathbf{M}-\boldsymbol{\Sigma}_{\mathbf{X}}(\sum_{j<k}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}})\boldsymbol{\Sigma}_{\mathbf{X}}. We have

𝜷k=argmax𝜷𝜷T𝐌k𝜷, s.t. 𝜷T𝚺𝐗𝜷=1.\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{M}_{k}\boldsymbol{\beta},\quad\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1$.} (4.9)

Motivated by Lemma 1, we consider the penalized problem that 𝜷k=argmax𝜷𝜷T𝐌^k𝜷\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\boldsymbol{\beta} such that 𝜷T𝚺𝐗𝜷=1,𝜷0s\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1,\|\boldsymbol{\beta}\|_{0}\leq s, where 𝐌^1=MDDMn(𝐗𝐘)\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y}) and 𝐌^k=𝐌^1𝚺^𝐗(l<kδl𝜷^l𝜷^lT)𝚺^𝐗\widehat{\mathbf{M}}_{k}=\widehat{\mathbf{M}}_{1}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\left(\sum_{l<k}\delta_{l}\widehat{\boldsymbol{\beta}}_{l}\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\right)\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}} for k>1k>1 with δl=𝜷^lT𝐌^𝜷^l\delta_{l}=\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{l}, and ss is a tuning parameter. We adopt the RIFLE algorithm in Tan, Wang, Liu and Zhang (2018) to solve this problem. See the details in Algorithm 2. In our simulation studies, we considered randomly generated initial value 𝜷^1(0)\widehat{\boldsymbol{\beta}}_{1}^{(0)} and fixed step size η=1\eta=1, and observed reasonably good performance.

Although Algorithm 2 is a generalization of the RIFLE Algorithm in Tan, Wang, Liu and Zhang (2018), there are important differences between these two. On one hand, the RIFLE Algorithm only extracts the first generalized eigenvector, whereas Algorithm 2 is capable of estimating multiple generalized eigenvectors by properly deflating the MDDM. In sufficient dimension reduction problems, the central subspace often has a structural dimension greater than 1, and it is necessary to find more than one generalized eigenvector. Hence, Algorithm 2 is potentially more useful than the RIFLE algorithm in practice. On the other hand, the usefulness of RIFLE Algorithm has been demonstrated in several statistical applications, including sparse sliced inverse regression. Here Algorithm 2 decomposes the MDDM, which is the first time the penalized generalized eigenvector problem is used to perform sufficient dimension reduction in a slicing-free manner in high dimensions. A brief analysis of the computation complexity is included in Section S3 in the Supplementary Materials.

  1. 1.

    Input: s,K,𝐌^1=𝐌^s,K,\widehat{\mathbf{M}}_{1}=\widehat{\mathbf{M}} and step size η>0\eta>0.

  2. 2.

    Initialize 𝜷^1(0)\widehat{\boldsymbol{\beta}}_{1}^{(0)}.

  3. 3.

    For k=1,,Kk=1,\ldots,K, do

    1. (a)

      Iterate over tt until convergence:

      1. i.

        Set ρ(t1)=(𝜷^k(t1))T𝐌^k𝜷^k(t1)(𝜷^k(t1))T𝚺^𝐗𝜷^k(t1)\rho^{(t-1)}=\dfrac{(\widehat{\boldsymbol{\beta}}_{k}^{(t-1)})^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}}{(\widehat{\boldsymbol{\beta}}_{k}^{(t-1)})^{\mathrm{\tiny T}}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}}.

      2. ii.

        𝐂=𝐈+(η/ρ(t1))(𝐌^kρ(t1)𝚺^𝐗)\mathbf{C}=\mathbf{I}+(\eta/\rho^{(t-1)})\cdot(\widehat{\mathbf{M}}_{k}-\rho^{(t-1)}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}})

      3. iii.

        𝜷~k(t)=𝐂𝜷^k(t1)/𝐂𝜷^k(t1)2\widetilde{\boldsymbol{\beta}}_{k}^{(t)}=\mathbf{C}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}/\|\mathbf{C}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}\|_{2}.

      4. iv.

        𝜷^k(t)=HT(𝜷~k,s)HT(𝜷~k,s)2\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\mathrm{HT}(\widetilde{\boldsymbol{\beta}}_{k},s)}{\|\mathrm{HT}(\widetilde{\boldsymbol{\beta}}_{k},s)\|_{2}}

    2. (b)

      Set 𝜷~k=𝜷^k(t)\widetilde{\boldsymbol{\beta}}_{k}=\widehat{\boldsymbol{\beta}}_{k}^{(t)} at convergence and scale it to obtain 𝜷^k=𝜷~k𝜷~kT𝚺^𝐗𝜷~k\widehat{\boldsymbol{\beta}}_{k}=\dfrac{\widetilde{\boldsymbol{\beta}}_{k}}{\sqrt{\widetilde{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widetilde{\boldsymbol{\beta}}_{k}}}.

    3. (c)

      Set 𝐌^k+1=𝐌^k𝚺^𝐗𝜷^kT𝐌^𝜷^k𝜷^k𝜷^kT𝚺^𝐗\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}\cdot\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}.

  4. 4.

    Output 𝒮^𝐘𝐗=span(𝜷^1,,𝜷^K)\widehat{\mathcal{S}}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\widehat{\boldsymbol{\beta}}_{1},\ldots,\widehat{\boldsymbol{\beta}}_{K}).

Algorithm 2 Generalized eigen-decomposition of MDDM.

5 Theoretical properties

In this section, we consider theoretical properties of the generalized eigenvectors of (MDDM(𝐗𝐘),𝚺𝐗)(\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}),{\bm{\Sigma}}_{\mathbf{X}}). Recall that, if we know that 𝚺𝐗=𝐈{\bm{\Sigma}}_{\mathbf{X}}=\mathbf{I}, the generalized eigenvectors reduce to eigenvectors, and can be estimated by Algorithm 1. If we do not have any information about 𝚺𝐗{\bm{\Sigma}}_{\mathbf{X}}, we can find the generalized eigenvectors with Algorithm 2. Either way, we let 𝜷k,k=1,,K,\boldsymbol{\beta}_{k},k=1,\ldots,K, be the first KK (generalized) eigenvectors of MDDM(𝐗𝐘)\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}). Throughout the proof, we let CC denote a generic constant that can vary from line to line. We show the consistency of 𝜷^k\widehat{\boldsymbol{\beta}}_{k} by proving that ηk=|sinΘ(𝜷^k,𝜷k)|Csϵ\eta_{k}=|\sin\Theta(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon. We assume that KK is fixed, and sϵ1s\epsilon\leq 1. Recall that we define λj=𝜷jT𝐌𝜷j\lambda_{j}=\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{j} as the (generalized) eigenvalue. Further define d=maxk=1K{𝜷k0}d=\max_{k=1}^{K}\{\|\boldsymbol{\beta}_{k}\|_{0}\}. When we study Algorithm 1 or Algorithm 2, we assume that s=d+2ss=d+2s^{\prime}, where s=Cds^{\prime}=Cd for a sufficiently large CC. To apply the concentration inequalities for MDDM\mathrm{MDDM}, we restate below Condition (C1) in terms of 𝐗\mathbf{X} and 𝐘\mathbf{Y} as Condition (C1’), along with other suitable conditions:

  1. (C1’)

    There exist two positive constants σ0\sigma_{0} and C0C_{0} such that E{exp(2σ0𝐘q2)}C0\mathrm{E}\{\exp(2\sigma_{0}\|\mathbf{Y}\|_{q}^{2})\}\leq C_{0} and suppmax1jpE{exp(2σ0Xj2)}C0\sup_{p}\max_{1\leq j\leq p}\mathrm{E}\{\exp(2\sigma_{0}X_{j}^{2})\}\leq C_{0}.

  2. (C2)

    There exist Δ>0\Delta>0 such that mink=1,,K(λkλk+1)Δ\min_{k=1,\ldots,K}(\lambda_{k}-\lambda_{k+1})\geq\Delta.

  3. (C3)

    There exists constants U,LU,L that do not depend on n,pn,p such that LλKλ1UL\leq\lambda_{K}\leq\lambda_{1}\leq U.

  4. (C4)

    As nn\rightarrow\infty, dn1/2(logp)1/2(logn)3/20dn^{-1/2}{(\log p)^{1/2}(\log{n})^{3/2}}\rightarrow 0.

Condition (C2) guarantees that the eigenvectors are well-defined. Condition (C3) imposes bounds on the eigenvalues of 𝐌\mathbf{M}. Researchers often impose similar assumptions on the covariance matrix to achieve consistent estimation. Condition (C4) restricts the growth rate of p,dp,d with respect to nn. Note that dd is the population sparsity level of 𝜷k\boldsymbol{\beta}_{k}’s, while ss is the user-specified sparsity level in Algorithms 1 and 2. If we fix dd, the dimension is allowed to grow at the rate logp=o(nlog3n)\log{p}=o(n\log^{-3}{n}). When we allow dd to diverge, we require it to diverge more slowly than {n/(logplog3n)}12\left\{{n}/{(\log{p}\log^{3}{n})}\right\}^{\frac{1}{2}}.

We present the non-asymptotic results for Algorithm 1 in the following theorem, where the constants D1,D2,σ0,γ,C0D_{1},D_{2},\sigma_{0},\gamma,C_{0} are defined previously in Theorem 1 under Condition (C1).

Theorem 2.

Assume that Conditions (C1’), (C2) & (C3) hold and 𝚺𝐗=𝐈{\bm{\Sigma}}_{\mathbf{X}}=\mathbf{I}. Further assume that, there exists θ(0,1/2)\theta\in(0,1/2), such that for k=1,,Kk=1,\ldots,K, we have (𝛃^k0)T𝛃k2θ,(\widehat{\boldsymbol{\beta}}_{k}^{0})^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}\geq 2\theta, and

μ=[1+2{(ds)1/2+ds}]{10.5θ(1+θ)(1(γ)2)}<1,\mu=\sqrt{[1+2\{(\frac{d}{s^{\prime}})^{1/2}+\frac{d}{s^{\prime}}\}]\{1-0.5\theta(1+\theta)(1-(\gamma^{*})^{2})\}}<1, (5.10)

where γ=λK34ΔλK14Δ\gamma^{*}=\dfrac{\lambda_{K}-\frac{3}{4}\Delta}{\lambda_{K}-\frac{1}{4}\Delta}. Then there exists a positive integer n0=n0(σ0,C0,q)<n_{0}=n_{0}(\sigma_{0},C_{0},q)<\infty, γ=γ(σ0,C0,q)(0,1/2)\gamma=\gamma(\sigma_{0},C_{0},q)\in(0,1/2) and a finite positive D0=D0(σ0,C0,q)D_{0}=D_{0}(\sigma_{0},C_{0},q) such that when n>n0n>n_{0}, we have D0nγ<Δ4sD_{0}n^{-\gamma}<\dfrac{\Delta}{4s} and for any D0nγ<ϵ<min{Δ4s,θ}D_{0}n^{-\gamma}<\epsilon<\min\{\dfrac{\Delta}{4s},\theta\}, with a probability greater than 154p2exp{ϵ2n36log3n}1-54p^{2}\exp\left\{-\dfrac{\epsilon^{2}n}{36\log^{3}{n}}\right\},

|sin𝚯(𝜷^k,𝜷k)|Csϵ,k=1,,K.|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon,\quad k=1,\dots,K. (5.11)

Let n1/2(logp)1/2logn3/2ϵd1n^{-1/2}{(\log p)^{1/2}\log{{}^{3/2}n}}\ll\epsilon\ll d^{-1}, then Theorem 2 directly implies the following asymptotic result that justifies the consistency of our estimator.

Corollary 1.

Assume that Conditions (C1’), (C2)–(C4) hold. Suppose there exists γ>0\gamma>0 such that dsmin{nγ,n12(logp)12(logn)32}d\leq s\ll\min\{n^{\gamma},\dfrac{n^{\frac{1}{2}}}{(\log p)^{\frac{1}{2}}(\log{n})^{\frac{3}{2}}}\}. Under the conditions in Theorem 2, the quantities |sin𝚯(𝛃^k,𝛃k)|0|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\rightarrow 0 with a probability tending to 1, for k=1,,Kk=1,\ldots,K.

Corollary 1 reveals that, without specifying a model, Algorithm 1 can achieve consistency when pp grows at an exponential rate of nn. To be exact, we can allow logp=o{n/(d2log3n)}\log{p}=o\{n/(d^{2}\log^{3}{n})\}. Here the theoretical results are established for the output of Algorithm 1, instead of the solution of the optimization problem (4.7). Note that it is possible to have some gap between the theoretical optimal solution of (4.7) and the estimate we use in practice, because the optimization problem is nonconvex, and numerically we might not achieve the global maximum. Thus it might be more meaningful to study the property of the estimate obtained as the output of Algorithm 1. The above theorem guarantees that the estimate we use in practice has the desired theoretical properties.

In the meantime, although our rate in Theorem 2 is not as high as in the case of sparse sliced inverse regression, as established very recently by Lin et al. (2018) when ΣX=I\Sigma_{X}=I and for general ΣX\Sigma_{X} in Lin et al. (2019), and Tan et al. (2020), we have some unique advantages over these proposals. For simplicity, we assume that dd is fixed in the subsequent discussion. First, both sliced inverse regression methods require estimation of with-in-slice means rather than the MDDM. As shown in Theorem 1, MDDM converges to its population counterpart at a slower rate than the sample with-in-slice mean. However, by adopting MDDM, we no longer need to determine the slicing scheme, and we do not encounter the curse-of-dimensionality problem when slicing multivariate response. Second, Lin et al. (2018) only achieves the optimal rate when p=o(n2)p=o(n^{2}), and cannot handle ultra-high dimensions. In contrast, Algorithm 1 allows pp to diverge at an exponential rate of nn, and is more suitable for ultra-high-dimensional data. Third, although Tan et al. (2020) achieves consistency when logp=o(n)\log{p}=o(n), they make much more restrictive model assumptions. For example, they assume that YY is categorical and 𝐗\mathbf{X} is normal within each slice of YY; they randomly split the dataset to form independent batches to facilitate their proofs, which is not done in their numerical studies. The theoretical properties for their proposal is unclear beyond the (conditionally) Gaussian model and without the sample splitting. In contrast, our method makes no model assumption between 𝐗\mathbf{X} and YY, and our theory requires no sample splitting. Thus, our results are more widely applicable, and the rates we obtain seem hard to improve. Also, unlike the theory in Tan et al. (2020), our theoretical result characterizes the exact method we use in practice. Moreover, the convergence rate of our method has an additional factor of log3(n)\log^{3}(n) compared to Tan et al. (2020), which grows at a slow rate of nn that only imposes mild restriction on the dimensionality. For example, for any positive constant ξ(0,1)\xi\in(0,1), if logp=O(n1ξ)\log{p}=O(n^{1-\xi}), our method is consistent. In this sense, although we cannot handle the optimal dimensionality of logp=o(n)\log{p}=o(n), the gap is very small.

Next, we further consider the penalized generalized eigen-decomposition in Algorithm 2. We assume that the step size η\eta satisfies ηλmax(𝚺𝐗)<1/2\eta\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})<1/2 and

[1+2{(ds)1/2+ds}][1ηλmin(𝚺𝐗)(1λ2λ1)16κ(𝚺𝐗)+16λ2λ1]<1,\sqrt{[1+2\{(\frac{d}{s^{\prime}})^{1/2}+\frac{d}{s^{\prime}}\}][1-\dfrac{\eta\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})(1-\frac{\lambda_{2}}{\lambda_{1}})}{16\kappa({\bm{\Sigma}}_{\mathbf{X}})+16\frac{\lambda_{2}}{\lambda_{1}}}]}<1, (5.12)

where λmax(𝚺𝐗)\lambda_{\max}(\boldsymbol{\Sigma}_{\mathbf{X}}), λmin(𝚺𝐗)\lambda_{\min}(\boldsymbol{\Sigma}_{\mathbf{X}}) and κ(𝚺𝐗)\kappa({\bm{\Sigma}}_{\mathbf{X}}) are respectively the largest eigenvalue, the smallest eigenvalue and the condition number of 𝚺𝐗\boldsymbol{\Sigma}_{\mathbf{X}}. The non-asymptotic results are as follows.

Theorem 3.

Assume that Conditions (C1’), (C2) & (C3) hold. Suppose there exists γ(0,1/2)\gamma\in(0,1/2) such that ds=o(nγ)d\leq s=o(n^{\gamma}), and there exists a constant θ(κ(𝚺𝐗),λmax(𝚺𝐗),Δ,λ1,λK,η)(0,1)\theta(\kappa({\bm{\Sigma}}_{\mathbf{X}}),\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}}),\Delta,\lambda_{1},\lambda_{K},\eta)\in(0,1) such that (𝛃^k0)T𝛃k𝛃^k021θ\dfrac{(\widehat{\boldsymbol{\beta}}_{k}^{0})^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}}{\|\widehat{\boldsymbol{\beta}}_{k}^{0}\|_{2}}\geq 1-\theta. Then there exists a positive integer n0=n0(s0,C0)<n_{0}=n_{0}(s_{0},C_{0})<\infty and four finite positive constants D0=D0(γ,σ0,C0)(0,)D_{0}=D_{0}(\gamma,\sigma_{0},C_{0})\in(0,\infty), D1=D1(C0)(0,)D_{1}=D_{1}(C_{0})\in(0,\infty), D2=D2(σ0,C0)(0,)D_{2}=D_{2}(\sigma_{0},C_{0})\in(0,\infty) and ϵ0=ϵ0(λ1,λ2,λmin(𝚺),Δ)\epsilon_{0}=\epsilon_{0}(\lambda_{1},\lambda_{2},\lambda_{\min}({\bm{\Sigma}}),\Delta) such that for any ϵ\epsilon that satisfies sϵ<ϵ0s\epsilon<\epsilon_{0} and D0nγ<ϵ1D_{0}n^{-\gamma}<\epsilon\leq 1, with a probability greater than 1D1p2nexp{D2ϵ2n/log3n}1-D_{1}p^{2}n\exp\{-D_{2}\epsilon^{2}n/\log^{3}{n}\}, we have |sinΘ(𝛃^k,𝛃k)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon, for k=1,,Kk=1,\ldots,K.

Theorem 3 is proved by showing that 𝐌^k\widehat{\mathbf{M}}_{k} and 𝚺^𝐗\widehat{{\bm{\Sigma}}}_{\mathbf{X}} are close to their counterparts in the sense that 𝐮T𝐌^k𝐮,𝐮T𝚺^𝐗𝐮\mathbf{u}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\mathbf{u},\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u} are close to 𝐮T𝐌k𝐮,𝐮T𝚺𝐗𝐮\mathbf{u}^{\mathrm{\tiny T}}{\mathbf{M}}_{k}\mathbf{u},\mathbf{u}^{\mathrm{\tiny T}}{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u} for any 𝐮\mathbf{u} with only ss nonzero elements, respectively. Then we use the fact that Algorithm 2 is a generalization of the RIFLE algorithm [Tan, Wang, Liu and Zhang (2018)], and some properties of the latter allow us to establish the consistency of Algorithm 2. By comparison, our proofs are significantly more involved than the one in Tan, Wang, Liu and Zhang (2018), because we have to estimate KK generalized eigenvectors instead of just the first one. We need to carefully control the error bounds to guarantee that the estimation errors do not accumulate to a higher order beyond the first generalized eigenvector.

Analogous to Corollary 1, we can easily obtain asymptotic consistency results by translating Theorem 3.

Corollary 2.

Assume that Conditions (C1)–(C4) hold. Suppose there exists γ(0,1/2)\gamma\in(0,1/2) such that dsmin{nγ,n12(logplog3n)12}d\leq s\ll\min\{n^{\gamma},\dfrac{n^{\frac{1}{2}}}{(\log p\log^{3}{n})^{\frac{1}{2}}}\}. Under the conditions in Theorem 3, the quantities |sin𝚯(𝛃^k,𝛃k)|0|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\rightarrow 0 with a probability tending to 1, for k=1,,Kk=1,\ldots,K.

Corollary 2 shows that Algorithm 2 produces consistent estimates of the generalized eigenvectors βk\beta_{k} even when pp grows at an exponential rate of the sample size nn, and thus is suitable for ultra-high-dimensional problems. Similar to Corollary 1, Corollary 2 has no gap between theory and the numerical outputs, as it is a result concerning the outputs of Algorithm 2. We note that the dimensionality in Corollary 2 is the same as that in Corollary 1. Thus, with a properly chosen step size η\eta, the penalized generalized eigen-decomposition is intrinsically no more difficult than the penalized eigen-decomposition. But if we have knowledge about 𝚺𝐗{\bm{\Sigma}}_{\mathbf{X}} being the identity matrix, it is still beneficial to exploit such information and use Algorithm 1, because Algorithm 1 does not involve the step size and is more convenient in practice. Also, although Algorithm 2 does not achieve the same rate of convergence as recent sparse sliced inverse regression proposals, it has many practical and theoretical advantages just as Algorithm 1, which we do not repeat.

Finally, we note that our theoretical studies require conditions on the initial value. Specifically, we require the initial value to be non-orthogonal to the truth. This is a common technical condition for iterative algorithms; see Yuan and Zhang (2013); Tan, Wang, Liu and Zhang (2018) for example. Such conditions do not seem critical for our algorithms to work in practice. In our numerical studies to be presented, we use randomly generated initial values, and the performance of our methods appears to be competitive.

6 Numerical studies

6.1 Simulations

We compare our slicing-free approaches to the state-of-the-art high-dimensional extensions of sliced inverse regression estimators. Both univariate response and multivariate response settings are considered. Specifically, for univariate response simulations, we include Rifle-SIR (Tan, Wang, Liu and Zhang, 2018) and Lasso-SIR (Lin et al., 2019) as two main competitors; for multivariate response simulations, we mainly compare our method with the projective resampling approach to SIR (PR-SIR, Li et al., 2008), which is a computationally expensive method that repeatedly projects the multivariate response to one-dimensional subspaces. For Rifle-SIR, we adopt the Rifle algorithm to estimate the leading eigenvector of the sample matrix cov{E(𝐗Y)}\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid Y)\} based on slicing. In addition, we include the oracle-SIR as a benchmark method, where we perform SIR on the subset of truly relevant variables (hence a low-dimensional estimation problem). For all these SIR-based methods, we include two different slicing schemes by setting the number of slices to be 33 and 1010, where 33 is the minimal number of slices required to obtain our two-dimensional central subspace and 1010 is a typical choice used in the literature. To evaluate the performances of these SDR methods, we use the subspace estimation error defined as 𝒟(𝜷^,𝜷)=𝐏𝜷^𝐏𝜷F/2K\mathcal{D}(\widehat{\boldsymbol{\beta}},\boldsymbol{\beta})=\|\mathbf{P}_{\widehat{\boldsymbol{\beta}}}-\mathbf{P}_{\boldsymbol{\beta}}\|_{F}/\sqrt{2K}, where 𝜷^,𝜷p×K\widehat{\boldsymbol{\beta}},\boldsymbol{\beta}\in\mathbb{R}^{p\times K} are the estimated and the true basis matrices of the central subspace and 𝐏𝜷^,𝐏𝜷p×p\mathbf{P}_{\widehat{\boldsymbol{\beta}}},\mathbf{P}_{\boldsymbol{\beta}}\in\mathbb{R}^{p\times p} are the corresponding projection matrices. This subspace estimation error is always between 0 and 11, and a small value indicates a good estimation.

First, we consider the following six models for univariate response regression: 1\mathcal{M}_{1} and 2\mathcal{M}_{2} are single index models (i.e. K=1K=1), 3\mathcal{M}_{3}5\mathcal{M}_{5} are multiple index models (i.e. K=2K=2), that are widely used in SDR literature; 6\mathcal{M}_{6} is isotropic PFC model with K=1K=1. Specifically,

1:Y=(𝜷1T𝐗)+sin(𝜷1T𝐗)+ϵ,2:Y=2arctan(𝜷1T𝐗)+0.1(𝜷1T𝐗)3+ϵ,\displaystyle\mathcal{M}_{1}:Y=(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+\sin(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+\epsilon,\quad\mathcal{M}_{2}:Y=2\arctan(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+0.1(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})^{3}+\epsilon,
3:Y=𝜷1T𝐗0.5+(1.5+𝜷2T𝐗)2+0.2ϵ,4:Y=𝜷1T𝐗+(𝜷1T𝐗)(𝜷2T𝐗)+0.3ϵ,\displaystyle\mathcal{M}_{3}:Y=\frac{\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X}}{0.5+(1.5+\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X})^{2}}+0.2\epsilon,\quad\mathcal{M}_{4}:Y=\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X}+(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})\cdot(\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X})+0.3\epsilon,
5:Y=sign(𝜷1T𝐗)log(|𝜷2T𝐗+5|)+0.2ϵ,6:𝐗=2𝜷1exp(Y)/3+0.5ϵ,\displaystyle\mathcal{M}_{5}:Y=sign(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})\cdot\log(|\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X}+5|)+0.2\epsilon,\quad\mathcal{M}_{6}:\mathbf{X}=2\boldsymbol{\beta}_{1}\exp(Y)/3+0.5\boldsymbol{\epsilon},

where 𝐗Np(0,𝚺𝐗)\mathbf{X}\sim N_{p}(0,\boldsymbol{\Sigma}_{\mathbf{X}}) and ϵN(0,1)\epsilon\sim N(0,1) for 1\mathcal{M}_{1}5\mathcal{M}_{5}, and YN(0,1),ϵNp(0,𝐈p)Y\sim N(0,1),\boldsymbol{\epsilon}\sim N_{p}(0,\mathbf{I}_{p}) for the isotropic PFC model (6\mathcal{M}_{6}). The sparse directions in the central subspace 𝜷1,𝜷2p\boldsymbol{\beta}_{1},\boldsymbol{\beta}_{2}\in\mathbb{R}^{p} are orthogonal as we let the first s=6s=6 elements in 𝜷1\boldsymbol{\beta}_{1} and the 66-th to 1212-th elements in 𝜷2\boldsymbol{\beta}_{2} to be 1/61/\sqrt{6} (while all other elements are zero). For 1\mathcal{M}_{1}5\mathcal{M}_{5}, we consider both the independent predictor setting with 𝚺𝐗=𝐈p\boldsymbol{\Sigma}_{\mathbf{X}}=\mathbf{I}_{p} and the correlated predictor setting with auto-regressive correlation that ΣX(i,j)=0.5|ij|\Sigma_{X}(i,j)=0.5^{|i-j|} for i,j=1,2,,pi,j=1,2,...,p. For each of model settings, we vary the sample size n{200,500,800}n\in\{200,500,800\} and predictor dimension p{200,500,800,1200,2000}p\in\{200,500,800,1200,2000\} and simulate 1000 independent data sets.

For our method, we applied the generalized eigen-decomposition algorithm (Algorithm 2) in all these six models (even when the covariance 𝐗\mathbf{X} is identity matrix). In the single index models 1\mathcal{M}_{1} and 2\mathcal{M}_{2}, we use the random initialization (𝜷^(0)\widehat{\boldsymbol{\beta}}^{(0)} is randomly generated from pp-dimensional standard normal) for our algorithm and Rifle-SIR to demonstrate the robustness to initialization. The step size in the algorithm is simply fixed as η=1\eta=1. For more challenging multiple index models, 36\mathcal{M}_{3}-\mathcal{M}_{6}, we consider the best case scenarios for each method, therefore true parameter 𝜷\boldsymbol{\beta} is used as the initial value and an optimal η{0.1,0.2,,1.0}\eta\in\{0.1,0.2,\dots,1.0\} is selected from a separate training sample with 400 observations. The results based on 1000 replications for n=200n=200 and p=800p=800 are summarized in Table 1, while the rest of the results can be found in the Supplemental Materials. Overall, the slicing-free MDDM approach is much more accurate than existing SIR-based methods. It is almost as accurate as the oracle-SIR. Moreover, it is clear that SIR-type methods are rather sensitive to the choice of the number of slices.

𝚺𝐗\boldsymbol{\Sigma}_{\mathbf{X}} MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
𝐈p\mathbf{I}_{p} 1\mathcal{M}_{1} 10.1 0.1 12.5 0.1 10.3 0.1 25.2 1.0 53.7 1.4 37.9 0.4 59.9 0.7
2\mathcal{M}_{2} 10.3 0.1 13.1 0.1 10.6 0.1 26.1 1.0 54.7 1.4 40.1 0.4 61.5 0.7
3\mathcal{M}_{3} 17.7 0.2 40.8 0.2 27.7 0.2 71.3 0.0 71.2 0.0 76.5 0.2 85.0 0.2
4\mathcal{M}_{4} 23.0 0.2 45.8 0.3 36.4 0.3 71.9 0.0 71.6 0.0 85.2 0.2 91.5 0.2
5\mathcal{M}_{5} 30.8 0.6 28.8 0.2 22.1 0.1 71.6 0.0 71.2 0.0 71.2 0.3 81.3 0.3
AR 1\mathcal{M}_{1} 18.7 0.3 21.0 0.2 17.6 0.2 34.7 0.8 39.8 1.1 35.3 0.3 35.5 0.3
2\mathcal{M}_{2} 14.2 0.2 20.7 0.2 14.8 0.2 33.1 0.7 33.6 1.1 34.6 0.3 30.5 0.3
3\mathcal{M}_{3} 25.2 0.3 44.6 0.2 34.1 0.2 71.5 0.0 71.3 0.0 54.8 0.2 47.1 0.3
4\mathcal{M}_{4} 59.1 0.5 75.1 0.2 69.9 0.3 81.0 0.2 78.7 0.2 89.7 0.2 92.1 0.2
5\mathcal{M}_{5} 46.2 0.6 46.4 0.2 35.5 0.2 73.8 0.1 72.4 0.0 66.5 0.2 61.4 0.3
PFC 6\mathcal{M}_{6} 34.6 0.6 48.9 0.5 33.4 0.5 40.1 0.7 30.8 0.6 70.7 0.0 70.7 0.0
Table 1: Averaged subspace estimation errors and the corresponding standard errors (after multiplied by 100100) for univariate response models (n=200,p=800n=200,p=800).

Next, we further consider the following three multivariate response models, where the response dimension is q=4q=4. These three models are respectively a multivariate linear model, a single-index heteroschedastic error model, and an isotropic PFC model. The predictors satisfy 𝐗Np(0,𝐈p)\mathbf{X}\sim N_{p}(0,\mathbf{I}_{p}) in the following two forward regression model. Therefore, we applied Algorithm 1 for our method under models 7\mathcal{M}_{7} and 8\mathcal{M}_{8}. For the isotropic PFC model 9\mathcal{M}_{9}, where 𝐗𝐘Np(𝜷f(𝐘),𝐈p)\mathbf{X}\mid\mathbf{Y}\sim N_{p}(\boldsymbol{\beta}f(\mathbf{Y}),\mathbf{I}_{p}), we still apply Algorithm 2 to be consistent with the univariate case. For the projective resampling methods, PR-SIR and PR-Oracle-SIR, we generated a sufficiently large number of nlog(n)n\log(n) random projections so that the PR methods reach their fullest potential.

  1. 7\mathcal{M}_{7}

    : Y1=𝜷1TX+ϵ1Y_{1}=\boldsymbol{\beta}_{1}^{{\mathrm{\tiny T}}}X+\epsilon_{1}, Y2=𝜷2T𝐗+ϵ2Y_{2}=\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X}+\epsilon_{2}, Y3=ϵ3Y_{3}=\epsilon_{3} and Y4=ϵ4Y_{4}=\epsilon_{4}. The errors (ϵ1,,ϵ4)(\epsilon_{1},\dots,\epsilon_{4}) are independent standard normal except for cov(ϵ1,ϵ2)=0.5\mathrm{cov}(\epsilon_{1},\epsilon_{2})=-0.5. For this model, the central subspace is spanned by 𝜷1=(1,0,0,0,,0)T\boldsymbol{\beta}_{1}=(1,0,0,0,...,0)^{\mathrm{\tiny T}} and 𝜷2=(0,2,1,0,,0)T\boldsymbol{\beta}_{2}=(0,2,1,0,...,0)^{\mathrm{\tiny T}}.

  2. 8\mathcal{M}_{8}

    : Y1=exp(ϵ1)Y_{1}=\exp(\epsilon_{1}) and Yi=ϵiY_{i}=\epsilon_{i} for i=2,3,4i=2,3,4, where (ϵ1,,ϵ4)(\epsilon_{1},\dots,\epsilon_{4}) are independent standard normal except for cov(ϵ1,ϵ2)=sin(𝜷T𝐗)\mathrm{cov}(\epsilon_{1},\epsilon_{2})=\sin(\boldsymbol{\beta}^{{\mathrm{\tiny T}}}\mathbf{X}). For this model, the central subspace is 𝜷=(0.8,0.6,0,0,,0)T\boldsymbol{\beta}=(0.8,0.6,0,0,\dots,0)^{\mathrm{\tiny T}}. Note that marginally each response is independent of 𝐗\mathbf{X}.

  3. 9\mathcal{M}_{9}

    : 𝐗=𝜷(13sin(Y1)+23exp(Y2)+Y3)+ϵ\mathbf{X}=\boldsymbol{\beta}\left(\frac{1}{3}\sin(Y_{1})+\frac{2}{3}\exp(Y_{2})+Y_{3}\right)+\boldsymbol{\epsilon}, where 𝜷=(1/616,0p6)\boldsymbol{\beta}=(1/\sqrt{6}\cdot\mathrm{1}_{6},\mathrm{0}_{p-6}) , and ϵN(0,𝐈p)\boldsymbol{\epsilon}\sim N(0,\mathbf{I}_{p}). Hence, 𝒮𝐘𝐗=span(𝜷)\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\boldsymbol{\beta}).

Again we considered various sample size and predictor dimension setups, each with 1000 replicates. We summarize the subspace estimation errors in the Table S6. For p=800p=800 and 12001200, the results are gathered in the supplement. It is clear that the proposed MDDM approach is much better than PR-SIR, and also improves much faster than PR-SIR when we increase the sample size. The MDDM method performed better in inverse regression models such as the isotropic PFC model than forward regression models such as the linear model and index models. This finding is more apparent in the multivariate response simulations than in the univariate response simulations. This is expected, as the MDDM directly targets at the inverse regression subspace, which is more directly driven by the response in the isotropic PFC models.

n=100n=100 n=200n=200 n=400n=400
p=100p=100 p=200p=200 p=400p=400 p=100p=100 p=200p=200 p=400p=400 p=100p=100 p=200p=200 p=400p=400
Error SE Error SE Error SE Error SE Error SE Error SE Error SE Error SE Error SE
7\mathcal{M}_{7} MDDM 37.1 0.5 39.8 0.5 42.5 0.5 24.0 0.4 25.3 0.4 26.9 0.4 16.1 0.3 17.3 0.3 18.6 0.3
PR-Oracle-SIR(3) 12.6 0.2 12.2 0.2 12.0 0.2 8.8 0.1 8.5 0.1 8.7 0.1 5.9 0.1 5.8 0.1 5.8 0.1
PR-Oracle-SIR(10) 16.2 0.3 15.7 0.3 15.6 0.3 9.6 0.2 9.4 0.2 95.2 0.2 6.0 0.1 6.0 0.1 6.0 0.1
PR-SIR(3) 79.9 0.1 88.2 0.1 93.5 0.0 67.9 0.1 79.3 0.1 87.8 0.0 54.6 0.1 67.6 0.1 79.0 0.0
PR-SIR(10) 83.5 0.1 90.6 0.1 94.9 0.0 70.1 0.1 81.6 0.1 90.1 0.1 55.3 0.1 68.2 0.1 80.2 0.1
8\mathcal{M}_{8} MDDM 79.4 0.9 85.8 0.8 90.0 0.7 55.9 1.2 61.0 1.2 68.4 1.2 27.1 0.9 30.3 1.0 31.0 1.0
PR-Oracle-SIR(3) 40.9 0.9 41.3 0.9 41.4 0.9 26.0 0.7 24.9 0.7 25.0 0.6 14.9 0.4 14.9 0.4 15.0 0.4
PR-Oracle-SIR(10) 44.1 0.9 43.8 0.9 43.5 0.9 25.1 0.6 23.7 0.6 24.1 0.6 13.1 0.3 13.0 0.3 13.2 0.3
PR-SIR(3) 99.3 0.0 99.7 0.0 99.8 0.0 99.2 0.0 99.7 0.0 99.8 0.0 98.8 0.0 99.6 0.0 99.8 0.0
PR-SIR(10) 99.3 0.0 99.7 0.0 99.9 0.0 99.1 0.0 99.6 0.0 99.8 0.0 98.4 0.1 99.6 0.0 99.8 0.0
9\mathcal{M}_{9} MDDM 15.3 0.3 15.4 0.3 15.7 0.3 9.9 0.1 10.1 0.1 10.0 0.1 7.1 0.1 7.2 0.1 7.1 0.1
PR-Oracle-SIR(3) 15.2 0.2 15.2 0.2 14.9 0.2 10.5 0.1 10.6 0.1 10.5 0.1 7.5 0.1 7.6 0.1 7.4 0.1
PR-Oracle-SIR(10) 13.8 0.2 13.9 0.2 13.6 0.2 9.4 0.1 9.7 0.1 9.6 0.1 6.8 0.1 6.8 0.1 6.7 0.1
PR-SIR(3) 58.5 0.2 72.3 0.2 84.0 0.2 44.6 0.1 58.2 0.1 71.4 0.1 33.1 0.1 44.6 0.1 57.9 0.1
PR-SIR(10) 54.8 0.2 68.5 0.2 80.6 0.2 41.1 0.2 54.3 0.2 67.7 0.2 30.2 0.1 41.0 0.1 54.2 0.1
Table 2: Averaged subspace estimation errors and the corresponding standard errors (after multiplied by 100100) for multivariate response models.

6.2 Real Data Illustration

Refer to caption
Figure 1: Quantile-quantile plots for prediction error comparisons between MDDM and Lasso-SIR (left panel), and between MDDM-ID and Lasso-SIR (right panel). Each point corresponds to the prediction mean squared errors for one of the q=15q=15 response variables, where different shapes represents different quantiles.
Refer to caption
Figure 2: The averaged prediction error over 500500 training-testing sample splits and over q=15q=15 response variables.

In this section we use our method to analyze the NCI-60 data set (Shoemaker, 2006) that contains the microRNA expression profiles and cancer drug activity measurements on the NCI-60 cell lines. The multivariate response is the cancer drug activities of q=15q=15 drugs; the predictor is p=365p=365 different microRNA; the sample size is n=60n=60.

First, we examine the predictive performance of our method by 500500 random training-testing sample splits; each time we randomly pick 55 observations to form the test set. We consider K=5K=5 for all methods. For MDDM, we included both the eigen-decomposition (Algorithm 1) and the generalized eigen-decomposition (Algorithm 2). To distinguish the two versions of MDDM, we have “MDDM-ID” for the eigen-decomposition approach because it implicitly assumes the covariance of 𝐗\mathbf{X} or the conditional covariance of 𝐗𝐘\mathbf{X}\mid\mathbf{Y} is constant times identity matrix. We use random initial values, and choose the sparsity level to be s=25s=25 in the way described in Section S2 in the Supplementary Materials. Then the five reduced predictors 𝜷kT𝐗\boldsymbol{\beta}_{k}^{{\mathrm{\tiny T}}}\mathbf{X}, k=1,,5k=1,\dots,5, are fed into a generalized additive model for each drug. Finally, we evaluate the mean squared prediction error based on the test sample. The Rifle-SIR can only estimate a one-dimensional subspace, which did not yield accurate prediction in this data set. Hence for comparison, we compute five leading directions from Lasso-SIR. The 2525th, 5050th and 7575th percentiles of the squared prediction errors for each of the 15 responses for all three models are obtained and we construct quantile-quantile plots in Figure 1. The red line is the y=xy=x line, and the black dashed line is a simple linear regression fit for the results indicated by the y-axis label against that indicated by the x-axis. Clearly, for all the quantiles and for all the response variables, the MDDM results (MDDM or MDDM-ID) are better than Lasso-SIR in terms of prediction. In addition, we construct side-by-side boxplots of the prediction error averaged over all response variables in Figure 2 to evaluate the overall improvement. Interestingly, the MDDM-ID is slightly better than the MDDM approach. This is likely due to the small sample size – with only 5555 training sample, the sample covariance of p=365p=365 variables is difficult to estimate accurately. We further include additional real data analysis results in Section S4 in Supplementary Materials.

7 Discussion

In this paper, we propose a slicing-free high-dimensional SDR method by a penalized eigen-decomposition of sample MDDM. Our proposal is motivated by the usefulness of MDDM for dimension reduction and yields a relatively straightforward implementation in view of the recently developed RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) by simply replacing the slicing-based estimator with sample MDDM. Our methodology and implementation involve no slicing and treats the univariate and multivariate response in a unified fashion. Both theoretical support and finite sample investigations provide convincing evidence that MDDM is a very competitive alternative compared to SIR and may be used as a surrogate to SIR-based estimator routinely for many related sufficient dimension reduction problems.

As with most SDR methods, our proposal requires the linearity condition, the violation of which can make SDR very challenging to tackle. It seems that existing proposals that relax the linearity condition are often practically difficult due to excessive computational costs, and cannot be easily extended to high dimensions (Cook and Nachtsheim, 1994; Ma and Zhu, 2012). One potentially useful approach is to transform data before SDR to alleviate obvious violation of the linearity assumption (Mai and Zou, 2015). But an in-depth study along this line is beyond the scope of the current paper. In addition, we observe from our simulation studies that RIFLE requires the choice of several tuning parameters, such as the step size and the initial value, and the optimization error could depend on these tuning parameters in a nontrivial way. Further investigation on the optimization error and data-driven choice for these tuning parameters would be desirable and is left for future research.

As pointed out by a referee, many SDR methods beyond SIR involve slicing. It will be interesting to study how to perform them in a slicing-free fashion as well. For example, Cook and Weisberg (1991) attempt to perform dimension reduction by estimating the conditional covariance of 𝐗\mathbf{X}, while Yin and Cook (2003) consider the conditional third moment. These methods slice the response to estimate the conditional moments. In the future, one can develop slicing-free methods to estimate these higher-order moments and conduct SDR.

Supplementary Materials

Supplement to “Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction”. In the supplement, we present additional simulation results and proofs.

Acknowledgements

The authors are grateful to the Editor, Associate Editor and anonymous referees, whose suggestions led great improvement of this work. The authors contributed equally to this work and are listed in alphabetical order. Mai and Zhang’s research in this article is supported in part by National Science Foundation grants CCF-1908969. Shao’s research in this article is supported in part by National Science Foundation grants DMS-1607489.

References

  • (1)
  • Bura and Cook (2001) Bura, E. and Cook, R. D. (2001), ‘Extending sliced inverse regression: The weighted chi-squared test’, Journal of the American Statistical Association 96, 996–1003.
  • Bura and Yang (2011) Bura, E. and Yang, J. (2011), ‘Dimension estimation in sufficient dimension reduction: a unifying approach’, Journal of Multivariate analysis 102(1), 130–142.
  • Cai et al. (2013) Cai, T. T., Ma, Z. and Wu, Y. (2013), ‘Sparse pca: Optimal rates and adaptive estimation’, Annals of Statistics 41, 3074–3110.
  • Chen et al. (2010) Chen, X., Zou, C., Cook, R. D. et al. (2010), ‘Coordinate-independent sparse sufficient dimension reduction and variable selection’, The Annals of Statistics 38(6), 3696–3723.
  • Chiaromonte et al. (2002) Chiaromonte, R., Cook, R. D. and Li, B. (2002), ‘Sufficient dimension reduction in regressions with categorical predictors’, The Annals of Statistics 30, 475–497.
  • Cook (2007) Cook, R. D. (2007), ‘Fisher lecture: Dimension reduction in regression (with discussion)’, Statistical Science 22, 1–26.
  • Cook and Forzani (2009) Cook, R. D. and Forzani, L. (2009), ‘Principal fitted components for dimension reduction in regression’, Statistical Science 485, 485–501.
  • Cook and Nachtsheim (1994) Cook, R. D. and Nachtsheim, C. J. (1994), ‘Reweighting to achieve elliptically contoured covariates in regression’, Journal of the American Statistical Association 89(426), 592–599.
  • Cook and Ni (2005) Cook, R. D. and Ni, L. (2005), ‘Sufficient dimension reduction via inverse regression: A minimum discrepancy approach’, Journal of the American Statistical Association 100(470), 410–428.
  • Cook and Weisberg (1991) Cook, R. D. and Weisberg, S. (1991), ‘Comment on “sliced inverse regression for dimension reduction”’, Journal of American Statistical Association 86, 328–332.
  • Cook and Zhang (2014) Cook, R. D. and Zhang, X. (2014), ‘Fused estimators of the central subspace in sufficient dimension reduction’, J. Amer. Statist. Assoc. 109, 815–827.
  • Cruz-Cano and Lee (2014) Cruz-Cano, R. and Lee, M.-L. T. (2014), ‘Fast regularized canonical correlation analysis’, Computational Statistics & Data Analysis 70, 88–100.
  • Hsing and Carroll (1992) Hsing, T. and Carroll, R. (1992), ‘An asymptotic theory for sliced inverse regression’, Ann. Statist 20, 1040–1061.
  • Huo and Székely (2016) Huo, X. and Székely, G. J. (2016), ‘Fast computing for distance covariance’, Technometrics 58(4), 435–447.
  • Kim et al. (2020) Kim, K., Li, B., Yu, Z., Li, L. et al. (2020), ‘On post dimension reduction statistical inference’, Annals of Statistics 48(3), 1567–1592.
  • Lee and Shao (2018) Lee, C. E. and Shao, X. (2018), ‘Martingale difference divergence matrix and its application to dimension reduction for stationary multivariate time series’, J. Amer. Statist. Assoc. 113, 216–229.
  • Li (2018) Li, B. (2018), Sufficient Dimension Reduction, Methods and Applications with R, CRC Press.
  • Li and Wang (2007) Li, B. and Wang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102, 997–1008.
  • Li et al. (2008) Li, B., Wen, S. and Zhu, L. (2008), ‘On a projective resampling method for dimension reduction with multivariate responses’, Journal of the American Statistical Association 103(483), 1177–1186.
  • Li (1991) Li, K. C. (1991), ‘Sliced inverse regression for dimension reduction’, Journal of the American Statistical Association 86, 316–327.
  • Li (2007) Li, L. (2007), ‘Sparse sufficient dimension reduction’, Biometrika 94(3), 603–613.
  • Lin et al. (2020) Lin, Q., Li, X., Huang, D. and Liu, J. (2020), ‘On the optimality of sliced inverse regression in high dimensions’, Annals of Statistics, to appear .
  • Lin et al. (2018) Lin, Q., Zhao, Z. and Liu, J. (2018), ‘On consistency and sparsity for sliced inverse regression in high dimension’, Annals of Statistics 46(2), 580–610.
  • Lin et al. (2019) Lin, Q., Zhao, Z. and Liu, J. S. (2019), ‘Sparse sliced inverse regression via lasso’, Journal of the American Statistical Association 114, 1726–1739.
  • Luo and Li (2016) Luo, W. and Li, B. (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887.
  • Ma and Zhu (2012) Ma, Y. and Zhu, L. (2012), ‘A semiparametric approach to dimension reduction’, Journal of the American Statistical Association 107(497), 168–179.
  • Mai and Zhang (2019) Mai, Q. and Zhang, X. (2019), ‘An iterative penalized least squares approach to sparse canonical correlation analysis’, Biometrics 75(3), 734–744.
  • Mai and Zou (2015) Mai, Q. and Zou, H. (2015), ‘Nonparametric variable transformation in sufficient dimension reduction’, Technometrics 57(1), 1–10.
  • McKeague and Zhang (2020) McKeague, I. W. and Zhang, X. (2020), ‘Significance testing for canonical correlation analysis in high dimensions’, arXiv preprint arXiv:2010.08673 .
  • Serfling (1980) Serfling, R. J. (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Mathematical Statistics.
  • Shao and Zhang (2014) Shao, X. and Zhang, J. (2014), ‘Martingale difference correlation and its use in high dimensional variable screening’, J. Amer. Statist. Assoc. 109, 1302–1318.
  • Shoemaker (2006) Shoemaker, R. H. (2006), ‘The nci60 human tumour cell line anticancer drug screen’, Nature Reviews Cancer 6(10), 813–823.
  • Székely et al. (2007) Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence by correlation of distances’, Annals of Statistics 35, 2769–2794.
  • Tan, Wang, Zhang, Liu and Cook (2018) Tan, K. M., Wang, Z., Zhang, T., Liu, H. and Cook, R. D. (2018), ‘A convex formulation for high-dimensional sparse sliced inverse regression’, Biometrika 105(4), 769–782.
  • Tan et al. (2020) Tan, K., Shi, L. and Yu, Z. (2020), ‘Sparse sir: Optimal rates and adaptive estimation’, The Annals of Statistics 48(1), 64–85.
  • Tan, Wang, Liu and Zhang (2018) Tan, K., Wang, Z., Liu, H. and Zhang, T. (2018), ‘Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow’, Journal of the Royal Statistical Society: Series B 80(5), 1057–1086.
  • Vershynin (2018) Vershynin, R. (2018), High-dimensional probability: An introduction with applications in data science, Vol. 47, Cambridge university press.
  • Vu and Lei (2013) Vu, V. Q. and Lei, J. (2013), ‘Minimax sparse principal subspace estimation in high dimensions’, Annals of Statistics 41, 2905–2947.
  • Witten et al. (2009) Witten, D. M., Tibshirani, R. and Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics 10(3), 515–534.
  • Yin and Cook (2003) Yin, X. and Cook, R. D. (2003), ‘Estimating central subspaces via inverse third moments’, Biometrika 90, 113–125.
  • Yuan and Zhang (2013) Yuan, X.-T. and Zhang, T. (2013), ‘Truncated power method for sparse eigenvalue problems’, Journal of Machine Learning Research 14(Apr), 899–925.
  • Zhang et al. (2020) Zhang, X., Lee, C. E. and Shao, X. (2020), ‘Envelopes in multivariate regression models with nonlinearity and heteroscedasticity’, Biometrika 107, 965–981.
  • Zhou and He (2008) Zhou, J. and He, X. (2008), ‘Dimension reduction based on constrained canonical correlation and variable filtering’, The Annals of Statistics 36, 1649–1668.
  • Zhu and Ng (1995) Zhu, L. and Ng, K. W. (1995), ‘Asymptotics of sliced inverse regression’, Statist. Sinica 5, 727–736.
  • Zhu et al. (2010) Zhu, L.-P., Zhu, L.-X. and Feng, Z.-H. (2010), ‘Dimension reduction in regressions through cumulative slicing estimation’, Journal of the American Statistical Association 105(492), 1455–1466.
  • Zou et al. (2006) Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of computational and graphical statistics 15(2), 265–286.

Qing Mai, Florida State University E-mail: [email protected]

Xiaofeng Shao, University of Illinois at Urbana-Champaign E-mail: [email protected]

Runmin Wang, Texas A&M University E-mail: [email protected]

Xin Zhang, Florida State University E-mail: [email protected]

Supplementary Materials for “Slicing-free Inverse Regression in High-Dimensional Sufficient Dimension Reduction”

Qing Mai, Xiaofeng Shao, Runmin Wang and Xin Zhang

In the supplement, we present some additional simulation results in Section S1, additional real data analysis results in Section S2, some discussion of computational complexity in Section S3, proofs of Proposition 2 & Lemma 1 in Section S4, proofs of Theorem 1 in Section S5, proofs of Theorem 2 & 3 in Section S6.

S1 Additional Simulation Results

In this section, we shall present all simulation results for models described in Section 6, except the ones which have already been presented in the main paper. Specifically, Table S1 and S2 contain the results for single index models (1\mathcal{M}_{1} and 2\mathcal{M}_{2}), Table S3 and S4 contain results for multiple index models (3\mathcal{M}_{3} - 5\mathcal{M}_{5}). Results for PFC model (univariate response) 6\mathcal{M}_{6} are summarized in Table S5, and we gather the rest of results for models with multivariate response variables (7\mathcal{M}_{7}-9\mathcal{M}_{9}) in Table S6.

The patterns are similar to those presented in the paper. Overall the newly proposed method outperforms the competitors in most scenarios, especially when the dimensionality is significantly larger than the sample size (i.e., high-dimensional setting). It is also observed that SIR-based methods are rather sensitive to the choice of number of slices, whereas our method is slicing-free and is thus easier to use in practice.

MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
1\mathcal{M}_{1} n = 200 p = 200 10.3 0.1 12.6 0.1 10.5 0.1 16.4 0.6 30.0 1.2 28.5 0.2 30.6 0.3
p = 500 10.4 0.1 12.7 0.1 10.6 0.1 22.1 0.9 47.1 1.4 34.6 0.3 47.7 0.6
p = 800 10.1 0.1 12.5 0.1 10.3 0.1 25.2 1.0 53.7 1.4 37.9 0.4 59.9 0.7
p = 1200 10.0 0.1 12.4 0.1 10.3 0.1 27.5 1.0 54.3 1.4 42.4 0.5 71.4 0.7
p = 2000 10.1 0.1 12.6 0.1 10.5 0.1 29.7 1.1 63.4 1.4 48.4 0.6 81.7 0.6
n = 500 p = 200 6.3 0.1 7.6 0.1 6.3 0.1 8.9 0.4 12.1 0.7 15.0 0.1 12.8 0.1
p = 500 6.4 0.1 7.9 0.1 6.5 0.1 11.8 0.6 21.6 1.1 16.3 0.1 14.6 0.1
p = 800 6.2 0.1 7.7 0.1 6.4 0.1 13.2 0.7 26.1 1.2 16.6 0.1 16.2 0.2
p = 1200 6.4 0.1 7.6 0.1 6.4 0.1 13.4 0.7 30.3 1.3 17.3 0.2 17.9 0.2
p = 2000 6.3 0.1 7.7 0.1 6.3 0.1 14.9 0.8 32.9 1.3 18.3 0.2 21.8 0.3
n =800 p = 200 5.0 0.1 6.0 0.1 5.0 0.1 7.4 0.4 8.7 0.6 11.1 0.1 9.3 0.1
p = 500 5.2 0.1 6.1 0.1 5.1 0.1 8.3 0.4 12.9 0.8 11.9 0.1 10.1 0.1
p = 800 5.1 0.1 6.1 0.1 5.1 0.1 9.5 0.6 20.1 1.1 12.4 0.1 11.1 0.1
p = 1200 5.1 0.1 6.1 0.1 5.1 0.1 9.5 0.6 20.1 1.1 12.4 0.1 11.1 0.1
p = 2000 4.9 0.1 6.1 0.1 5.0 0.1 10.6 0.6 23.1 1.2 12.8 0.1 12.2 0.1
2\mathcal{M}_{2} n = 200 p = 200 10.4 0.1 13.1 0.1 10.7 0.1 17.5 0.6 29.7 1.2 30.3 0.2 31.6 0.3
p = 500 10.6 0.1 13.3 0.1 10.8 0.1 23.8 0.9 48.9 1.4 36.7 0.3 49.9 0.6
p = 800 10.3 0.1 13.1 0.1 10.6 0.1 26.1 1.0 54.7 1.4 40.1 0.4 61.5 0.7
p = 1200 54.7 0.8 12.9 0.1 10.4 0.1 74.4 0.8 95.1 0.4 45.0 0.5 71.4 0.7
p = 2000 55.3 0.8 13.1 0.1 10.6 0.1 76.5 0.7 96.8 0.3 51.2 0.6 82.7 0.6
n = 500 p = 200 6.4 0.1 8.0 0.1 6.5 0.1 9.7 0.4 12.0 0.7 15.8 0.1 13.4 0.1
p = 500 6.7 0.1 8.2 0.1 6.6 0.1 12.4 0.6 21.2 1.1 17.1 0.1 15.1 0.1
p = 800 6.5 0.1 8.0 0.1 6.5 0.1 13.6 0.7 25.1 1.2 17.6 0.2 16.7 0.2
p = 1200 17.1 0.7 8.0 0.1 6.5 0.1 37.4 1.1 74.6 1.0 18.2 0.2 18.3 0.2
p = 2000 16.9 0.8 8.0 0.1 6.5 0.1 38.3 1.2 77.4 0.9 19.0 0.2 22.2 0.3
n =800 p = 200 5.1 0.1 6.3 0.1 5.1 0.1 6.9 0.3 8.3 0.5 11.6 0.1 9.6 0.1
p = 500 5.3 0.1 6.4 0.1 5.2 0.1 7.8 0.4 13.1 0.8 12.5 0.1 10.5 0.1
p = 800 5.0 0.1 6.3 0.1 5.1 0.1 8.4 0.4 16.9 1.0 12.6 0.1 10.8 0.1
p = 1200 10.8 0.6 6.4 0.1 5.2 0.1 23.9 1.0 53.9 1.3 13.1 0.1 11.4 0.1
p = 2000 11.3 0.6 6.3 0.1 5.1 0.1 26.6 1.1 57.6 1.2 13.6 0.1 12.4 0.1
Table S1: d(V,V^)d(V,\hat{V}) and corresponding standard errors (in 10210^{-2}) for single index models (identity variance)
MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
1\mathcal{M}_{1} n = 200 p = 200 17.6 0.2 20.8 0.2 17.8 0.2 26.2 0.5 25.3 0.7 32.1 0.3 28.4 0.3
p = 500 18.3 0.3 21.1 0.2 18.0 0.2 32.9 0.7 35.0 1.0 34.5 0.3 33.1 0.3
p = 800 18.7 0.3 21.0 0.2 17.6 0.2 34.7 0.8 39.8 1.1 35.3 0.3 35.5 0.3
p = 1200 26.1 0.6 21.3 0.2 18.0 0.2 47.0 1.0 81.2 0.7 36.7 0.3 41.2 0.4
p = 2000 26.3 0.7 21.0 0.2 17.7 0.2 47.8 1.0 81.4 0.7 39.1 0.4 49.6 0.6
n = 500 p = 200 10.8 0.1 13.2 0.1 11.0 0.1 13.7 0.2 12.7 0.4 18.6 0.2 15.3 0.1
p = 500 11.0 0.1 13.3 0.1 11.2 0.1 14.3 0.3 15.7 0.6 19.5 0.2 16.5 0.2
p = 800 10.9 0.1 13.5 0.1 11.1 0.1 15.1 0.4 18.4 0.8 20.1 0.2 17.1 0.2
p = 1200 14.2 0.5 13.4 0.1 11.1 0.1 23.1 0.8 49.9 1.1 20.3 0.2 17.8 0.2
p = 2000 13.6 0.5 13.5 0.1 11.2 0.1 26.9 0.9 53.8 1.2 20.7 0.2 18.6 0.2
n =800 p = 200 8.8 0.1 10.7 0.1 8.9 0.1 10.6 0.1 9.7 0.3 14.5 0.1 11.9 0.1
p = 500 8.7 0.1 10.5 0.1 8.9 0.1 11.1 0.3 9.9 0.3 14.8 0.1 12.5 0.1
p = 800 8.6 0.1 10.5 0.1 8.8 0.1 11.7 0.4 12.4 0.6 15.0 0.1 12.5 0.1
p = 1200 10.2 0.4 10.4 0.1 8.8 0.1 18.3 0.8 37.9 1.1 15.1 0.1 12.9 0.1
p = 2000 11.1 0.5 10.6 0.1 8.8 0.1 20.5 0.8 38.2 1.1 15.8 0.1 13.4 0.1
2\mathcal{M}_{2} n = 200 p = 200 14.0 0.2 20.8 0.2 14.8 0.2 25.4 0.5 21.4 0.7 31.8 0.3 23.3 0.2
p = 500 14.3 0.2 21.1 0.2 15.0 0.2 31.4 0.7 29.0 1.0 34.1 0.3 27.4 0.3
p = 800 14.2 0.2 20.7 0.2 14.8 0.2 33.1 0.7 33.6 1.1 34.6 0.3 30.5 0.3
p = 1200 23.9 0.7 21.1 0.2 14.9 0.2 45.1 1.0 75.9 0.9 36.9 0.3 34.9 0.4
p = 2000 24.9 0.8 20.6 0.2 14.6 0.2 47.9 1.0 77.2 0.9 38.6 0.4 42.5 0.5
n = 500 p = 200 8.7 0.1 13.2 0.1 9.0 0.1 13.7 0.3 10.4 0.4 18.6 0.2 12.6 0.1
p = 500 8.9 0.1 13.3 0.1 9.3 0.1 14.3 0.3 13.4 0.6 19.4 0.2 13.4 0.1
p = 800 8.8 0.1 13.3 0.1 9.1 0.1 14.0 0.3 14.9 0.7 19.9 0.2 13.8 0.1
p = 1200 12.8 0.5 13.3 0.1 9.2 0.1 24.3 0.8 45.9 1.2 20.1 0.2 14.6 0.1
p = 2000 12.5 0.5 13.5 0.1 9.3 0.1 27.0 0.9 49.7 1.2 20.6 0.2 15.1 0.1
n =800 p = 200 7.1 0.1 10.6 0.1 7.4 0.1 10.6 0.1 7.9 0.2 14.4 0.1 9.8 0.1
p = 500 7.1 0.1 10.6 0.1 7.3 0.1 10.9 0.2 9.3 0.4 15.0 0.1 10.2 0.1
p = 800 7.1 0.1 10.6 0.1 7.3 0.1 11.6 0.3 10.8 0.6 15.1 0.1 10.2 0.1
p = 1200 8.8 0.4 10.4 0.1 7.2 0.1 17.5 0.7 32.8 1.1 15.0 0.1 10.5 0.1
p = 2000 9.7 0.5 10.5 0.1 7.3 0.1 19.7 0.8 34.3 1.2 15.9 0.1 10.8 0.1
Table S2: d(V,V^)d(V,\hat{V}) and corresponding standard errors (in 10210^{-2}) for single index models (AR-type variance)
MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
3\mathcal{M}_{3} n = 200 p = 200 17.5 0.2 40.6 0.2 27.7 0.2 71.3 0.0 71.2 0.0 69.6 0.2 67.0 0.3
p = 500 18.1 0.2 40.7 0.2 28.2 0.2 71.3 0.0 71.2 0.0 74.8 0.2 80.1 0.3
p = 800 17.7 0.2 40.8 0.2 27.7 0.2 71.3 0.0 71.2 0.0 76.5 0.2 85.0 0.2
p = 1200 17.9 0.2 40.7 0.2 28.0 0.2 31.7 0.4 18.6 0.2 78.4 0.2 88.8 0.2
p = 2000 18.1 0.2 40.8 0.2 28.0 0.2 32.1 0.4 18.9 0.2 80.6 0.2 92.5 0.2
n = 500 p = 200 10.6 0.1 27.8 0.2 17.0 0.1 70.9 0.0 70.9 0.0 48.6 0.3 30.1 0.2
p = 500 10.8 0.1 27.5 0.2 17.2 0.1 70.9 0.0 70.9 0.0 53.7 0.3 39.1 0.3
p = 800 10.8 0.1 27.4 0.2 17.3 0.1 70.9 0.0 70.9 0.0 57.1 0.2 46.0 0.4
p = 1200 10.7 0.1 27.4 0.2 17.1 0.1 18.9 0.2 11.0 0.1 59.1 0.2 53.0 0.4
p = 2000 10.7 0.1 27.6 0.2 17.1 0.1 19.0 0.2 11.2 0.1 62.1 0.2 60.7 0.4
n =800 p = 200 8.2 0.1 22.0 0.1 13.5 0.1 70.8 0.0 70.8 0.0 36.0 0.2 20.8 0.1
p = 500 8.3 0.1 21.9 0.1 13.5 0.1 70.8 0.0 70.8 0.0 40.4 0.2 24.0 0.2
p = 800 8.2 0.1 21.9 0.1 13.3 0.1 70.8 0.0 70.8 0.0 42.2 0.3 26.2 0.2
p = 1200 8.3 0.1 22.0 0.1 13.4 0.1 14.9 0.1 8.7 0.1 44.7 0.2 29.8 0.3
p = 2000 8.3 0.1 22.2 0.1 13.5 0.1 14.8 0.1 8.6 0.1 47.9 0.3 36.4 0.4
4\mathcal{M}_{4} n = 200 p = 200 23.1 0.2 46.2 0.3 36.3 0.3 72.0 0.0 71.6 0.0 78.1 0.2 78.2 0.3
p = 500 22.8 0.2 45.8 0.3 35.9 0.3 72.1 0.0 71.6 0.0 83.0 0.2 87.8 0.2
p = 800 23.0 0.2 45.8 0.3 36.4 0.3 71.9 0.0 71.6 0.0 85.2 0.2 91.5 0.2
p = 1200 23.2 0.3 45.8 0.3 36.3 0.3 38.1 0.4 25.2 0.3 87.3 0.2 93.8 0.2
p = 2000 23.1 0.3 45.7 0.3 36.2 0.3 54.0 0.5 34.1 0.5 89.2 0.2 95.9 0.1
n = 500 p = 200 13.4 0.1 31.0 0.2 21.7 0.1 71.2 0.0 71.0 0.0 55.3 0.3 43.2 0.3
p = 500 13.5 0.1 31.0 0.2 21.6 0.1 71.2 0.0 71.0 0.0 61.3 0.3 55.3 0.4
p = 800 13.4 0.1 31.0 0.2 21.8 0.1 71.2 0.0 71.0 0.0 64.7 0.2 62.8 0.4
p = 1200 13.4 0.1 30.8 0.2 21.6 0.1 21.5 0.2 14.3 0.1 67.2 0.2 67.8 0.3
p = 2000 13.5 0.1 31.0 0.2 21.8 0.1 22.0 0.2 14.4 0.1 69.7 0.2 73.3 0.3
n =800 p = 200 10.5 0.1 25.2 0.2 17.2 0.1 71.0 0.0 70.9 0.0 42.7 0.2 28.7 0.2
p = 500 10.4 0.1 24.9 0.2 17.1 0.1 71.0 0.0 70.9 0.0 47.3 0.3 34.2 0.3
p = 800 10.5 0.1 25.1 0.2 17.1 0.1 71.0 0.0 70.9 0.0 50.2 0.3 39.9 0.4
p = 1200 10.3 0.1 24.9 0.2 17.0 0.1 16.9 0.2 11.0 0.1 52.4 0.3 45.8 0.4
p = 2000 10.6 0.1 25.2 0.2 17.2 0.1 17.3 0.2 11.3 0.1 56.5 0.3 53.8 0.4
5\mathcal{M}_{5} n = 200 p = 200 30.6 0.6 29.1 0.1 22.0 0.1 71.6 0.0 71.2 0.0 58.8 0.3 56.6 0.3
p = 500 30.4 0.6 28.9 0.2 22.2 0.1 71.5 0.0 71.2 0.0 67.4 0.3 73.5 0.4
p = 800 30.8 0.6 28.8 0.2 22.1 0.1 71.6 0.0 71.2 0.0 71.2 0.3 81.3 0.3
p= 1200 31.0 0.6 29.1 0.1 22.3 0.1 20.0 0.2 14.6 0.1 74.1 0.3 86.4 0.3
p = 2000 31.3 0.6 28.7 0.2 22.1 0.1 19.3 0.2 14.4 0.1 77.8 0.3 90.3 0.2
n = 500 p = 200 12.4 0.2 18.4 0.1 13.6 0.1 71.1 0.0 70.9 0.0 31.7 0.2 24.5 0.2
p = 500 11.8 0.2 18.4 0.1 13.6 0.1 71.0 0.0 70.9 0.0 35.9 0.2 29.2 0.2
p = 800 11.9 0.2 18.4 0.1 13.6 0.1 71.0 0.0 70.9 0.0 37.9 0.3 32.9 0.3
p = 1200 11.6 0.2 18.4 0.1 13.7 0.1 12.1 0.1 8.6 0.1 39.7 0.3 37.3 0.3
p = 2000 12.0 0.2 18.5 0.1 13.6 0.1 12.1 0.1 8.6 0.1 42.1 0.3 46.3 0.4
n = 800 p = 200 7.9 0.1 14.7 0.1 10.7 0.1 70.9 0.0 70.8 0.0 23.8 0.1 17.4 0.1
p = 500 7.8 0.1 14.5 0.1 10.6 0.1 70.9 0.0 70.8 0.0 25.7 0.2 19.2 0.1
p = 800 7.8 0.1 14.6 0.1 10.7 0.1 70.9 0.0 70.8 0.0 27.1 0.2 20.7 0.2
p = 1200 7.7 0.1 14.6 0.1 10.6 0.1 9.3 0.1 6.5 0.1 28.3 0.2 21.6 0.2
p = 2000 7.9 0.1 14.4 0.1 10.7 0.1 9.4 0.1 6.6 0.1 29.9 0.2 25.0 0.2
Table S3: d(V,V^)d(V,\hat{V}) and corresponding standard errors (in 10210^{-2}) for multiple index models (identity variance)
MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
3\mathcal{M}_{3} n = 200 p = 200 41.4 0.4 58.8 0.2 50.2 0.2 72.6 0.0 72.2 0.0 67.0 0.2 62.8 0.2
p = 500 42.1 0.4 59.0 0.2 50.4 0.2 72.9 0.1 72.4 0.0 70.3 0.2 71.1 0.2
p = 800 43.0 0.4 59.1 0.2 50.4 0.2 72.7 0.0 72.3 0.0 72.1 0.2 75.1 0.2
p = 1200 42.7 0.4 59.4 0.2 50.8 0.2 53.0 0.4 39.9 0.4 73.0 0.2 78.6 0.2
p = 2000 43.5 0.4 59.2 0.2 50.8 0.2 53.5 0.4 40.2 0.4 74.8 0.2 82.1 0.2
n = 500 p = 200 25.6 0.3 44.6 0.2 34.3 0.2 71.4 0.0 71.3 0.0 51.5 0.2 42.4 0.2
p = 500 25.8 0.3 44.4 0.2 34.5 0.2 71.5 0.0 71.3 0.0 53.8 0.2 45.7 0.2
p = 800 25.2 0.3 44.6 0.2 34.1 0.2 71.5 0.0 71.3 0.0 54.8 0.2 47.1 0.3
p = 1200 25.9 0.3 44.4 0.2 34.4 0.2 33.8 0.4 23.8 0.2 55.8 0.2 50.1 0.3
p = 2000 25.7 0.3 44.3 0.2 34.4 0.2 35.0 0.4 23.8 0.2 56.9 0.2 54.0 0.3
n =800 p = 200 20.1 0.2 37.0 0.2 27.8 0.2 71.2 0.0 71.0 0.0 43.8 0.2 34.9 0.2
p = 500 19.8 0.2 37.8 0.2 27.6 0.2 71.2 0.0 71.1 0.0 46.1 0.2 36.6 0.2
p = 800 20.0 0.2 37.3 0.2 27.9 0.2 71.2 0.0 71.0 0.0 47.0 0.2 37.9 0.2
p = 1200 19.8 0.2 37.1 0.2 27.8 0.2 26.7 0.3 18.5 0.2 47.6 0.2 38.6 0.2
p = 2000 19.9 0.2 37.4 0.2 27.7 0.2 27.5 0.3 18.7 0.2 48.8 0.2 40.7 0.2
4\mathcal{M}_{4} n = 200 p = 200 58.1 0.4 75.9 0.2 70.1 0.3 80.4 0.2 78.2 0.2 85.7 0.2 84.6 0.2
p = 500 59.3 0.5 75.8 0.2 70.2 0.3 95.2 0.1 94.2 0.1 88.8 0.2 90.2 0.2
p = 800 59.1 0.5 75.1 0.2 69.9 0.3 81.0 0.2 78.7 0.2 89.7 0.2 92.1 0.2
p = 1200 59.5 0.5 75.3 0.2 69.9 0.3 73.9 0.3 62.3 0.5 90.5 0.2 93.9 0.2
p = 2000 60.4 0.5 75.4 0.2 69.9 0.3 74.1 0.4 63.0 0.5 91.9 0.2 95.2 0.1
n = 500 p = 200 38.6 0.4 61.4 0.2 50.9 0.3 74.5 0.1 73.5 0.1 70.5 0.2 63.0 0.3
p = 500 38.7 0.4 61.6 0.2 51.1 0.3 74.7 0.1 73.4 0.1 74.0 0.2 69.0 0.3
p = 800 39.3 0.4 61.3 0.2 50.8 0.2 77.5 0.2 74.9 0.1 75.6 0.2 72.5 0.3
p = 1200 39.5 0.4 61.3 0.2 51.0 0.2 54.9 0.4 40.2 0.4 77.1 0.2 76.0 0.3
p = 2000 39.7 0.4 61.5 0.2 51.3 0.3 56.1 0.4 40.6 0.4 78.7 0.2 79.7 0.3
n =800 p = 200 30.7 0.3 53.1 0.2 42.2 0.2 73.1 0.1 72.4 0.0 61.3 0.2 51.4 0.3
p = 500 30.5 0.3 52.9 0.2 42.0 0.2 73.1 0.1 72.4 0.0 64.1 0.2 54.4 0.3
p = 800 30.6 0.3 53.7 0.2 42.3 0.2 73.4 0.1 72.5 0.0 65.9 0.2 58.2 0.3
p = 1200 30.8 0.3 53.5 0.2 42.2 0.2 45.3 0.4 31.5 0.3 67.2 0.2 60.8 0.3
p = 2000 30.7 0.3 53.6 0.2 42.2 0.2 45.2 0.4 31.1 0.3 68.6 0.2 64.6 0.3
5\mathcal{M}_{5} n = 200 p = 200 45.5 0.6 46.5 0.2 35.7 0.2 73.9 0.1 72.4 0.0 62.6 0.2 51.8 0.2
p = 500 46.9 0.6 46.9 0.2 35.8 0.2 73.9 0.1 72.4 0.0 65.6 0.2 57.4 0.3
p = 800 46.2 0.6 46.4 0.2 35.5 0.2 73.8 0.1 72.4 0.0 66.5 0.2 61.4 0.3
p = 1200 46.9 0.6 46.3 0.2 35.8 0.2 33.0 0.3 23.9 0.2 67.2 0.3 65.7 0.3
p = 2000 46.1 0.6 46.5 0.2 35.7 0.2 33.6 0.3 23.8 0.2 68.8 0.3 72.0 0.3
n = 500 p = 200 24.7 0.4 31.4 0.2 22.6 0.1 72.0 0.0 71.3 0.0 44.8 0.2 32.6 0.1
p = 500 24.7 0.4 31.1 0.2 22.4 0.1 71.9 0.0 71.3 0.0 46.8 0.2 35.4 0.2
p = 800 25.4 0.4 31.3 0.2 22.9 0.1 72.0 0.0 71.3 0.0 48.0 0.2 36.7 0.2
p = 1200 25.2 0.4 31.5 0.2 22.7 0.1 20.8 0.2 14.3 0.1 48.2 0.2 37.6 0.2
p = 2000 25.5 0.4 31.7 0.2 22.8 0.1 20.9 0.2 14.4 0.1 49.1 0.2 39.5 0.2
n =800 p = 200 18.7 0.3 25.6 0.1 18.0 0.1 71.5 0.0 71.1 0.0 36.7 0.1 25.6 0.1
p = 500 18.1 0.3 25.3 0.1 17.7 0.1 71.5 0.0 71.1 0.0 38.4 0.2 27.1 0.1
p = 800 18.6 0.3 25.4 0.1 18.0 0.1 71.5 0.0 71.1 0.0 39.3 0.2 28.4 0.1
p = 1200 18.7 0.3 25.3 0.1 17.9 0.1 16.3 0.1 11.0 0.1 40.2 0.2 29.2 0.1
p = 2000 19.1 0.3 25.3 0.1 18.0 0.1 16.5 0.1 11.2 0.1 41.0 0.2 30.3 0.1
Table S4: d(V,V^)d(V,\hat{V}) and corresponding standard errors (in 10210^{-2}) for multiple index models (AR-type variance)
MDDM Oracle-SIR(3) Oracle-SIR(10) Rifle-SIR(3) Rifle-SIR(10) LassoSIR(3) LassoSIR(10)
Error SE Error SE Error SE Error SE Error SE Error SE Error SE
6\mathcal{M}_{6} n = 200 p = 200 34.3 0.5 49.1 0.5 33.5 0.4 50.0 0.7 30.6 0.5 70.7 0.0 70.7 0.0
p = 500 34.2 0.5 48.7 0.5 32.8 0.4 49.5 0.7 30.1 0.5 70.7 0.0 70.7 0.0
p = 800 34.6 0.6 48.9 0.5 33.4 0.5 50.1 0.7 30.8 0.6 70.7 0.0 70.7 0.0
p = 1200 44.5 0.7 49.3 0.5 33.6 0.5 67.0 0.7 38.9 0.7 70.7 0.0 70.7 0.0
p = 2000 44.8 0.8 48.1 0.5 33.2 0.5 66.7 0.8 39.4 0.7 70.7 0.0 70.7 0.0
n = 500 p = 200 22.0 0.4 35.5 0.4 22.6 0.3 33.0 0.5 19.0 0.3 70.7 0.0 70.7 0.0
p = 500 21.8 0.4 34.7 0.4 22.3 0.3 32.4 0.5 18.8 0.4 70.7 0.0 70.7 0.0
p = 800 21.7 0.3 34.6 0.4 22.5 0.3 32.3 0.5 18.9 0.3 70.7 0.0 70.7 0.0
p = 1200 27.1 0.5 35.5 0.4 22.5 0.3 43.4 0.7 23.4 0.4 70.7 0.0 70.7 0.0
p = 2000 26.9 0.5 34.8 0.4 22.4 0.3 42.3 0.7 23.6 0.5 70.7 0.0 70.7 0.0
n =800 p = 200 17.2 0.3 29.2 0.3 18.6 0.3 25.8 0.4 14.9 0.2 70.7 0.0 70.7 0.0
p = 500 16.7 0.3 28.5 0.3 18.1 0.2 25.3 0.4 14.3 0.3 70.7 0.0 70.7 0.0
p = 800 16.5 0.2 28.4 0.3 17.9 0.2 25.0 0.4 14.1 0.2 70.7 0.0 70.7 0.0
p =1200 20.8 0.4 29.0 0.4 18.2 0.3 31.7 0.5 18.5 0.4 70.7 0.0 70.7 0.0
p = 2000 21.2 0.4 28.7 0.4 18.3 0.3 32.4 0.6 18.8 0.3 70.7 0.0 70.7 0.0
Table S5: d(V,V^)d(V,\hat{V}) and corresponding standard errors (in 10210^{-2}) for isotropic PFC models (6\mathcal{M}_{6})
n=100n=100 n=200n=200 n=400n=400
p=800p=800 p=1200p=1200 p=800p=800 p=1200p=1200 p=800p=800 p=1200p=1200
Error SE Error SE Error SE Error SE Error SE Error SE
7\mathcal{M}_{7} MDDM 45.0 0.5 45.9 0.5 27.5 0.3 28.7 0.4 18.8 0.3 19.6 0.3
PR-Oracle-SIR(3) 26.3 0.2 26.5 0.2 18.3 0.1 18.4 0.2 12.6 0.1 12.6 0.1
PR-Oracle-SIR(10) 33.0 0.3 33.0 0.3 20.1 0.2 20.2 0.2 13.0 0.1 13.1 0.1
PR-SIR(3) 96.6 0.0 97.7 0.0 93.2 0.0 95.3 0.0 87.6 0.0 91.1 0.0
PR-SIR(10) 97.4 0.0 98.2 0.0 95.0 0.0 96.6 0.0 90.0 0.1 93.7 0.0
8\mathcal{M}_{8} MDDM 93.5 0.6 94.9 0.5 73.6 1.1 77.1 1.1 36.7 1.1 37.8 1.2
PR-Oracle-SIR(3) 80.1 0.6 80.1 0.6 64.0 0.7 64.7 0.7 42.5 0.5 41.1 0.5
PR-Oracle-SIR(10) 79.1 0.6 78.9 0.6 58.4 0.6 58.7 0.6 34.5 0.4 34.2 0.4
PR-SIR(3) 99.9 0.0 100.0 0.0 99.9 0.0 100.0 0.0 99.9 0.0 99.9 0.0
PR-SIR(10) 99.9 0.0 100.0 0.0 99.9 0.0 100.0 0.0 99.9 0.0 100.0 0.0
9\mathcal{M}_{9} MDDM 17.3 0.4 17.5 0.4 10.0 0.1 10.1 0.1 7.1 0.1 7.1 0.1
PR-Oracle-SIR(3) 15.5 0.2 15.2 0.2 10.6 0.1 10.6 0.1 7.5 0.1 7.5 0.1
PR-Oracle-SIR(10) 14.2 0.2 13.9 0.2 9.6 0.1 9.6 0.1 6.8 0.1 6.8 0.1
PR-SIR(3) 92.2 0.1 95.0 0.1 83.0 0.1 88.3 0.1 71.0 0.1 77.9 0.1
PR-SIR(10) 90.0 0.1 93.4 0.1 80.1 0.1 85.8 0.1 67.4 0.1 74.7 0.1
Table S6: Averaged subspace estimation errors and the corresponding standard errors (after multiplied by 100100) for multivariate response models.

S2 More on Real Data Analysis

S2.1 Choice of Tuning Parameter on Real Data

To apply our proposal on real data, we need to determine the tuning parameter, ss, i.e, the desired level of sparsity. In penalized problems such as sparse PCA and sparse SIR, tuning parameters are often chosen with cross-validation. We could also employ cross-validation to choose s. However, as with almost any procedure, cross-validation would considerably slow down the computation. Moreover, as we observe in our theoretical and simulation studies, our method is not very sensitive to ss; the result is reasonably stable as long as ss is larger than d. Hence, we resort to a faster tuning method on our real data as follows. We start with a sequence of reasonable sparsity levels 𝒮\mathcal{S}, which is set to be {1,,45}\{1,\ldots,45\}. Then for each element in 𝒮\mathcal{S}, we calculate 𝜷^\hat{\boldsymbol{\beta}} and the sample distance covariance (Székely et al. 2007) between 𝐘i\mathbf{Y}_{i} and 𝜷^T𝐗i\hat{\boldsymbol{\beta}}^{T}\mathbf{X}_{i} for all i=1,2,,ni=1,2,...,n. Here distance covariance is used as a model-free measure of the dependence between 𝐘i\mathbf{Y}_{i} and 𝜷T^𝐗i\hat{\boldsymbol{\beta}^{T}}\mathbf{X}_{i}. Intuitively, the distance covariance increases as the pre-specified sparsity increases. Therefore, we plot the sample distance covariance against the sparsity levels in Figure S1, and pick the sparsity corresponding to a large enough distance covariance, while any larger sparsity levels will not lead to a significantly larger distance covariance (i.e. the “elbow method”). Based on Figure S1, a sparsity level between 20 and 25 seems to be reasonable, and we pick s=25s=25 as the pre-specified sparsity.

Refer to caption
Figure S1: Distance covariance for different pre-specified sparsities.

S2.2 Additional Real Data Analysis Results

To further demonstrate our methods on the real data, we also construct the scatterplot from the leading two directions 𝜷1T𝐗\boldsymbol{\beta}_{1}^{{\mathrm{\tiny T}}}\mathbf{X} and 𝜷2T𝐗\boldsymbol{\beta}_{2}^{{\mathrm{\tiny T}}}\mathbf{X} in Figure S2. For the first direction, the experimental units from leukemia cancer (LE) and colorectal cancer (CO) are on the different side of the plot. For the second direction, CO and LE have similar values whereas the units obtained from central nervous system cancer (CN), breast cancer (BR), lung cancer (LC) and ovarian cancer (OV) are on the opposite side of the figure. This pattern coincides with the first two canonical correlation analysis directions in a previous study of the same data set (Figure 8, Cruz-Cano and Lee 2014). Such a finding is very encouraging as our slicing-free approach automatically detect the most significant associations between the response and the predictor, while directly applied to the multivariate response.

Refer to caption
Figure S2: The scatter plot between the first leading directions.

S3 Computation complexity

We briefly analyze the computational complexity of our proposed methods. For both algorithms, we need to compute the sample MDDM. The current computational complexity of sample MDDM is O(n2p)O(n^{2}p). If we adapt the fast computing algorithm of Huo and Székely (2016) developed for distance correlation to MDDM, we might be able to reduce the complexity to O(pnlogn)O(pn\log n). In Algorithm 2, we further need to compute the sample covariance at the complexity level of O(np2)O(np^{2}).

To apply the two algorithms, we assume that the maximum number of iterations is TT in finding the KK directions (Step 3(a) in both algorithms). After obtaining MDDM, Algorithm 1 has a computational complexity of O(KT(ps+p)+(K1)(s2+ps))O(KT(ps+p)+(K-1)(s^{2}+ps)), where (ps+p)(ps+p) is the computation complexity of each iteration (Yuan and Zhang 2013) and (s2+ps)(s^{2}+ps) is the computation complexity of deflating 𝐌^k\widehat{\mathbf{M}}_{k} in Step 3(b). Note that, in Step 3(b), we repeatedly exploit the sparsity of 𝜷^k\widehat{\boldsymbol{\beta}}_{k} to reduce the computation complexity. For example, to compute 𝜷^k𝜷^kT\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}, we only need to compute the s2s^{2} nonzero elements, and the same applies to 𝜷^kT𝐌^𝜷^k\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}, which has a computational complexity of O(ps)O(ps). For Algorithm 2, the computational complexity is O(KT(ps+p)+(K1)(p2+ps))O(KT(ps+p)+(K-1)(p^{2}+ps)). Note that this computational complexity only differs from that of Algorithm 1 in the term p2p^{2}. This term is the cost to deflate 𝐌^\widehat{\mathbf{M}} in Step 3(b). Since 𝚺^𝐗𝜷^k𝜷^kT𝚺^𝐗\widehat{\bm{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\bm{\Sigma}}_{\mathbf{X}} is not guaranteed to be sparse as 𝜷^k𝜷^kT\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}, the deflation is more computationally expensive. Otherwise, each iteration in Step 3(a) of Algorithm 2 has the same complexity as Step 3(a) of Algorithm 1 (Tan, Wang, Liu and Zhang 2018).

S4 Proofs for Proposition 2 & Lemma 1

Proof of Proposition 2.

From the basic properties of MDDM (c.f. beginning of Section 1), E(𝐀T𝐗Y)=E(𝐀T𝐗)\mathrm{E}(\mathbf{A}^{\mathrm{\tiny T}}\mathbf{X}\mid Y)=\mathrm{E}(\mathbf{A}^{\mathrm{\tiny T}}\mathbf{X}) is equivalent to MDDM(𝐀T𝐗Y)=0\mathrm{MDDM}(\mathbf{A}^{\mathrm{\tiny T}}\mathbf{X}\mid Y)=0.

Suppose the rank of MDDM(𝐗Y)\mathrm{MDDM}(\mathbf{X}\mid Y) is dd, then there exists an orthogonal basis matrix for p\mathbb{R}^{p}, (𝜷,𝜷0)(\boldsymbol{\beta},\boldsymbol{\beta}_{0}) with 𝜷p×d\boldsymbol{\beta}\in\mathbb{R}^{p\times d} and 𝜷0p×(pd)\boldsymbol{\beta}_{0}\in\mathbb{R}^{p\times(p-d)}, such that span(𝜷)=span{MDDM(𝐗Y)}\mathrm{span}(\boldsymbol{\beta})=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid Y)\}. This implies 𝜷0TMDDM(𝐗Y)=0\boldsymbol{\beta}_{0}^{\mathrm{\tiny T}}\mathrm{MDDM}(\mathbf{X}\mid Y)=0 and equivalently 𝜷0T{E(𝐗Y)E(𝐗)}=0\boldsymbol{\beta}_{0}^{\mathrm{\tiny T}}\{\mathrm{E}(\mathbf{X}\mid Y)-\mathrm{E}(\mathbf{X})\}=0. Therefore, span(𝜷0)𝒮E(𝐗Y)\mathrm{span}(\boldsymbol{\beta}_{0})\subseteq\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid Y)}^{\perp}, which leads to 𝒮E(𝐗Y)span{E(𝐗Y)E(𝐗)}span(𝜷)\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid Y)}\equiv\mathrm{span}\{\mathrm{E}(\mathbf{X}\mid Y)-\mathrm{E}(\mathbf{X})\}\subseteq\mathrm{span}(\boldsymbol{\beta}).

Similarly, for any vector 𝐯𝒮E(𝐗Y)\mathbf{v}\in\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid Y)}^{\perp} we have 𝐯T{E(𝐗Y)E(𝐗)}=0\mathbf{v}^{\mathrm{\tiny T}}\{\mathrm{E}(\mathbf{X}\mid Y)-\mathrm{E}(\mathbf{X})\}=0 and hence 𝐯TMDDM(𝐗Y)𝐯=0\mathbf{v}^{\mathrm{\tiny T}}\mathrm{MDDM}(\mathbf{X}\mid Y)\mathbf{v}=0. This implies that 𝐯span(𝜷0)\mathbf{v}\in\mathrm{span}(\boldsymbol{\beta}_{0}) and hence, 𝒮E(𝐗Y)span(𝜷0)\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid Y)}^{\perp}\subseteq\mathrm{span}(\boldsymbol{\beta}_{0}) and span(𝜷)𝒮E(𝐗Y)\mathrm{span}(\boldsymbol{\beta})\subseteq\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid Y)}.

For the proof of Lemma 1, we need the following elementary lemma. We include its proof for completeness.

Lemma S2.

Let 𝛂k{\bm{\alpha}}_{k} be the normalized kkth eigenvector of 𝚺𝐗1/2𝐌𝚺𝐗1/2{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}\mathbf{M}{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}. Then we must have 𝛃k=𝚺𝐗1/2𝛂k\boldsymbol{\beta}_{k}={\bm{\Sigma}}_{\mathbf{X}}^{-1/2}{\bm{\alpha}}_{k}.

Proof of Lemma S2.

Let 𝜶=𝚺𝐗1/2𝜷{\bm{\alpha}}={\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta} in (4.8), we have that 𝜷k=𝚺𝐗1/2𝜶k\boldsymbol{\beta}_{k}={\bm{\Sigma}}_{\mathbf{X}}^{-1/2}{\bm{\alpha}}_{k}, where

𝜶k=argmax𝜶𝜶T𝚺𝐗1/2𝐌𝚺𝐗1/2𝜶 s.t 𝜶T𝜶=1,𝜶T𝜶l=0,l<k{\bm{\alpha}}_{k}=\arg\max_{{\bm{\alpha}}}{\bm{\alpha}}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}\mathbf{M}{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}{\bm{\alpha}}\mbox{ s.t ${\bm{\alpha}}^{\mathrm{\tiny T}}{\bm{\alpha}}=1,{\bm{\alpha}}^{\mathrm{\tiny T}}{\bm{\alpha}}_{l}=0,l<k$} (S1)

It is easy to see that 𝜶k{\bm{\alpha}}_{k} is the kkth eigenvector of 𝚺𝐗1/2𝐌𝚺𝐗1/2{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}\mathbf{M}{\bm{\Sigma}}_{\mathbf{X}}^{-1/2} and the conclusion follows. ∎

Proof of Lemma 1.

By Lemma S2, we have 𝚺𝐗1/2𝐌𝚺𝐗1/2=j=1pλj𝜶j𝜶jT{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}\mathbf{M}{\bm{\Sigma}}_{\mathbf{X}}^{-1/2}=\sum_{j=1}^{p}\lambda_{j}{\bm{\alpha}}_{j}{\bm{\alpha}}_{j}^{\mathrm{\tiny T}}. It follows that 𝐌=𝚺𝐗{j=1pλj𝜷j𝜷jT}𝚺𝐗\mathbf{M}={\bm{\Sigma}}_{\mathbf{X}}\{\sum_{j=1}^{p}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\}{\bm{\Sigma}}_{\mathbf{X}}. Hence, 𝐌k=𝚺𝐗{j=kpλj𝜷j𝜷jT}𝚺𝐗\mathbf{M}_{k}={\bm{\Sigma}}_{\mathbf{X}}\{\sum_{j=k}^{p}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\}{\bm{\Sigma}}_{\mathbf{X}} and 𝜷k\boldsymbol{\beta}_{k} is its leading generalized eigenvector subject to 𝜷T𝚺𝐗𝜷=1\boldsymbol{\beta}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}=1. ∎

S5 Proof for Theorem 1

Let (𝐕k,𝐔k)k=1n(\mathbf{V}_{k\cdot},\mathbf{U}_{k\cdot})_{k=1}^{n} be iid sample from the joint distribution of (𝐕,𝐔)(\mathbf{V},\mathbf{U}), where 𝐕k=(Vk1,,Vkp)T\mathbf{V}_{k\cdot}=(V_{k1},\cdots,V_{kp})^{\mathrm{\tiny T}} is the kkth sample. Write MDDM(𝐕|𝐔)=E{(𝐕E(𝐕))(𝐕E(𝐕))T|𝐔𝐔|q}=𝐑𝐒+2𝐓,\mathrm{MDDM}(\mathbf{V}|\mathbf{U})=-\mathrm{E}\{(\mathbf{V}-\mathrm{E}(\mathbf{V}))(\mathbf{V}^{\prime}-\mathrm{E}(\mathbf{V}^{\prime}))^{\mathrm{\tiny T}}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}\}=-\mathbf{R}-\mathbf{S}+2\mathbf{T}, where

𝐑\displaystyle\mathbf{R} =\displaystyle= E(𝐕𝐕T|𝐔𝐔|q)\displaystyle\mathrm{E}(\mathbf{V}\mathbf{V}^{\prime{\mathrm{\tiny T}}}|\mathbf{U}-\mathbf{U}^{\prime}|_{q})
𝐒\displaystyle\mathbf{S} =\displaystyle= E(𝐕)E(𝐕)TE(|𝐔𝐔|q)\displaystyle\mathrm{E}(\mathbf{V})\mathrm{E}(\mathbf{V}^{\prime})^{\mathrm{\tiny T}}\mathrm{E}(|\mathbf{U}-\mathbf{U}^{\prime}|_{q})
𝐓\displaystyle\mathbf{T} =\displaystyle= E[𝐕𝐕T|𝐔𝐔′′|q]=E[E(𝐕)𝐕T|𝐔𝐔|q]\displaystyle\mathrm{E}[\mathbf{V}\mathbf{V}^{\prime{\mathrm{\tiny T}}}|\mathbf{U}^{\prime}-\mathbf{U}^{{}^{\prime\prime}}|_{q}]=\mathrm{E}[\mathrm{E}(\mathbf{V})\mathbf{V}^{\prime{\mathrm{\tiny T}}}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}]

with (𝐕,𝐔)(\mathbf{V}^{\prime},\mathbf{U}^{\prime}) and (𝐕′′,𝐔′′)(\mathbf{V}^{{}^{\prime\prime}},\mathbf{U}^{{}^{\prime\prime}}) being iid copies of (𝐕,𝐔)(\mathbf{V},\mathbf{U}). Note that 𝐕=(V1,,Vp)T\mathbf{V}^{\prime}=(V_{1}^{\prime},\cdots,V_{p}^{\prime})^{T}. At the sample level, we have

MDDMn(𝐕|𝐔)=1n2k,l=1n(𝐕k𝐕¯n)(𝐕l𝐕¯n)T|𝐔k𝐔l|q=𝐑n𝐒n+2𝐓n,\mathrm{MDDM}_{n}(\mathbf{V}|\mathbf{U})=-\frac{1}{n^{2}}\sum_{k,l=1}^{n}(\mathbf{V}_{k\cdot}-\bar{\mathbf{V}}_{n})(\mathbf{V}_{l\cdot}-\bar{\mathbf{V}}_{n})^{T}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}=-\mathbf{R}_{n}-\mathbf{S}_{n}+2\mathbf{T}_{n},

where 𝐕¯n=n1k=1n𝐕k\bar{\mathbf{V}}_{n}=n^{-1}\sum_{k=1}^{n}\mathbf{V}_{k\cdot}, and

𝐑n\displaystyle\mathbf{R}_{n} =\displaystyle= n2k,l=1n𝐕k𝐕lT|𝐔k𝐔l|q\displaystyle n^{-2}\sum_{k,l=1}^{n}\mathbf{V}_{k\cdot}\mathbf{V}_{l\cdot}^{T}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}
𝐒n\displaystyle\mathbf{S}_{n} =\displaystyle= n2k,l=1n𝐕k𝐕lTn2k,l=1n|𝐔k𝐔l|q\displaystyle n^{-2}\sum_{k,l=1}^{n}\mathbf{V}_{k\cdot}\mathbf{V}_{l\cdot}^{T}n^{-2}\sum_{k,l=1}^{n}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}
𝐓n\displaystyle\mathbf{T}_{n} =\displaystyle= n3k,l,h=1n𝐕k𝐕hT|𝐔h𝐔l|q.\displaystyle n^{-3}\sum_{k,l,h=1}^{n}\mathbf{V}_{k\cdot}\mathbf{V}_{h\cdot}^{T}|\mathbf{U}_{h\cdot}-\mathbf{U}_{l\cdot}|_{q}.
Proposition S3.

Suppose Condition (C1) holds. There exists a positive integer n0=n0(σ0,C0,q)<n_{0}=n_{0}(\sigma_{0},C_{0},q)<\infty, γ=γ(σ0,C0,q)(0,1/2)\gamma=\gamma(\sigma_{0},C_{0},q)\in(0,1/2) and a finite positive constant D0=D0(σ0,C0,q)<D_{0}=D_{0}(\sigma_{0},C_{0},q)<\infty such that when nn0n\geq n_{0} and 16>ϵ>D0nγ16>\epsilon>D_{0}n^{-\gamma}, we have

P(𝐑n𝐑max>4ϵ)10p2exp(ϵ2n4log3n).P(\|\mathbf{R}_{n}-\mathbf{R}\|_{max}>4\epsilon)\leq 10p^{2}\exp\left(-\frac{\epsilon^{2}n}{4\log^{3}{n}}\right).

Proof of Proposition S3: Throughout the proof, CC is a generic positive constant that vary from line to line. We shall find a bound for P(𝐑n𝐑max>4ϵ)P(\|\mathbf{R}_{n}-\mathbf{R}\|_{max}>4\epsilon) first. For i,j=1,,pi,j=1,\cdots,p, let Rij=E[ViVj|𝐔𝐔|q]R_{ij}=\mathrm{E}[V_{i}V_{j}^{\prime}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}] and Rn,ij=n2k,l=1nVkiVlj|𝐔k𝐔l|qR_{n,ij}=n^{-2}\sum_{k,l=1}^{n}V_{ki}V_{lj}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}. Note that

P(𝐑n𝐑max>4ϵ)p2maxi,j=1,,pP(|Rn,ijRij|>4ϵ).P(\|\mathbf{R}_{n}-\mathbf{R}\|_{max}>4\epsilon)\leq p^{2}\max_{i,j=1,\cdots,p}P(|R_{n,ij}-R_{ij}|>4\epsilon).

We shall focus on the case I, (i,j)=(1,2)(i,j)=(1,2), since other cases can be treated in the same fashion and the bound is uniformly over all pair of (i,j)(i,j)s.

Case I, (i,j)=(1,2)(i,j)=(1,2), write R~n,12={n(n1)}1klnVk1Vl2|𝐔k𝐔l|q\widetilde{R}_{n,12}=\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}V_{k1}V_{l2}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}. Let 𝐖=(𝐔T,V1,V2)T\mathbf{W}=(\mathbf{U}^{\mathrm{\tiny T}},V_{1},V_{2})^{\mathrm{\tiny T}} and 𝐖=((𝐔)T,V1,V2)T\mathbf{W}^{\prime}=((\mathbf{U}^{\prime})^{\mathrm{\tiny T}},V_{1}^{\prime},V_{2}^{\prime})^{\mathrm{\tiny T}} and 𝐖k=(𝐔kT,Vk1,Vk2)T\mathbf{W}_{k}=(\mathbf{U}_{k\cdot}^{\mathrm{\tiny T}},V_{k1},V_{k2})^{\mathrm{\tiny T}}. Define the kernel h1h_{1} as

h1(𝐖;𝐖)=V1V2|𝐔𝐔|q+V1V2|𝐔𝐔|q2h_{1}(\mathbf{W};\mathbf{W}^{\prime})=\frac{V_{1}V_{2}^{\prime}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}+V_{1}^{\prime}V_{2}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}}{2}

Then h1h_{1} is symmetric, R~n,12={n(n1)}1klnh1(𝐖k;𝐖l)\widetilde{R}_{n,12}=\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}h_{1}(\mathbf{W}_{k};\mathbf{W}_{l}) is a U-statistic of order two and Rn,12=n1nR~n,12R_{n,12}=\frac{n-1}{n}\widetilde{R}_{n,12}.

Under Condition (C1), there exists a positive constant C1=C1(σ0,C0)<C_{1}=C_{1}(\sigma_{0},C_{0})<\infty such that |R12|=|E(V1V2|𝐔𝐔|q)|E1/2(V12)E1/2(V22)E1/2(|𝐔𝐔|q2)<C1|R_{12}|=|\mathrm{E}(V_{1}V_{2}^{\prime}|\mathbf{U}-\mathbf{U}^{\prime}|_{q})|\leq\mathrm{E}^{1/2}(V_{1}^{2})\mathrm{E}^{1/2}(V_{2}^{{}^{\prime}2})\mathrm{E}^{1/2}(|\mathbf{U}-\mathbf{U}^{\prime}|_{q}^{2})<C_{1}. When ϵ\epsilon satisfies ϵC1/(2n)\epsilon\geq C_{1}/(2n), then |R12|/n2ϵ|R_{12}|/n\leq 2\epsilon and

P(|Rn,12R12|4ϵ)\displaystyle P(|R_{n,12}-R_{12}|\geq 4\epsilon) =P(|n1n(R~n,12R12)1nR12|4ϵ)\displaystyle=P\left(\left|\frac{n-1}{n}(\widetilde{R}_{n,12}-R_{12})-\frac{1}{n}R_{12}\right|\geq 4\epsilon\right)
P(|R~n,12R12|+|R12/n|4ϵ)P(|R~n,12R12|2ϵ).\displaystyle\leq P(|\widetilde{R}_{n,12}-R_{12}|+|R_{12}/n|\geq 4\epsilon)\leq P(|\widetilde{R}_{n,12}-R_{12}|\geq 2\epsilon).

Next we decompose

R~n,12\displaystyle\widetilde{R}_{n,12} =\displaystyle= {n(n1)}1klnh1(𝐖k,𝐖l)𝟏(|h1(𝐖k,𝐖l)|M)\displaystyle\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}h_{1}(\mathbf{W}_{k},\mathbf{W}_{l}){\bf 1}(|h_{1}(\mathbf{W}_{k},\mathbf{W}_{l})|\leq M)
+{n(n1)}1klnh1(𝐖k,𝐖l)𝟏(|h1(𝐖k,𝐖l)|>M)\displaystyle+\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}h_{1}(\mathbf{W}_{k},\mathbf{W}_{l}){\bf 1}(|h_{1}(\mathbf{W}_{k},\mathbf{W}_{l})|>M)
=\displaystyle= R~n,12,1+R~n,12,2,\displaystyle\widetilde{R}_{n,12,1}+\widetilde{R}_{n,12,2},

where the choice of MM will be addressed at the end of proof. We also decompose its population counterpart R12=E[h1𝟏(|h1|M)]+E[h1𝟏(|h1|>M)]=R12,1+R12,2R_{12}=\mathrm{E}[h_{1}{\bf 1}(|h_{1}|\leq M)]+\mathrm{E}[h_{1}{\bf 1}(|h_{1}|>M)]=R_{12,1}+R_{12,2}.

By Lemma C on page 200 of Serfling (1980), we derive that for m=n/2m=\lfloor n/2\rfloor, and t>0t>0,

E[exp(tR~n,12,1)]Em[exp(th1𝟏(|h1|M)/m)]\mathrm{E}[\exp(t\widetilde{R}_{n,12,1})]\leq\mathrm{E}^{m}[\exp(th_{1}{\bf 1}(|h_{1}|\leq M)/m)]

which entails that

P(R~n,12,1R12,1ϵ)\displaystyle P(\widetilde{R}_{n,12,1}-R_{12,1}\geq\epsilon) \displaystyle\leq exp(t(ϵ+R12,1))E[exp(tR~n,12,1)]\displaystyle\exp(-t(\epsilon+R_{12,1}))\mathrm{E}[\exp(t\widetilde{R}_{n,12,1})]
\displaystyle\leq exp(tϵ)Em{exp(t(h1𝟏(|h1|M)R12,1)/m)}\displaystyle\exp(-t\epsilon)\mathrm{E}^{m}\{\exp(t(h_{1}{\bf 1}(|h_{1}|\leq M)-R_{12,1})/m)\}
\displaystyle\leq exp(tϵ)exp(t2M2/(2m)),\displaystyle\exp(-t\epsilon)\exp(t^{2}M^{2}/(2m)),

where we have applied Markov’s inequality and Lemma A(ii) [cf. Page 200 of Serfling (1980)] in the first and third inequality above, respectively. Applying the same argument with h1𝟏(|h1|M)h_{1}{\bf 1}(|h_{1}|\leq M) replaced by h1𝟏(|h1|M)-h_{1}{\bf 1}(|-h_{1}|\leq M), we can obtain

P(R~n,12,1R12,1ϵ)exp(tϵ)exp(t2M2/(2m)).P(\widetilde{R}_{n,12,1}-R_{12,1}\leq-\epsilon)\leq\exp(-t\epsilon)\exp(t^{2}M^{2}/(2m)).

Choosing t=ϵm/M2t=\epsilon m/M^{2}, we obtain that

P(|R~n,12,1R12,1|ϵ)2exp(ϵ2m/(2M2))\displaystyle P(|\widetilde{R}_{n,12,1}-R_{12,1}|\geq\epsilon)\leq 2\exp(-\epsilon^{2}m/(2M^{2})) (S2)

Next we turn to R~n,12,2\widetilde{R}_{n,12,2}. First of all, by Cauchy-Schwartz inequality, |R12,2|E1/2(h12)P1/2(|h1|>M)|R_{12,2}|\leq\mathrm{E}^{1/2}(h_{1}^{2})P^{1/2}(|h_{1}|>M). Applying the inequality |ab|(a2+b2)/2|ab|\leq(a^{2}+b^{2})/2 and (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) for any a,bRa,b\in R, we derive

h1(𝐖k,𝐖l)\displaystyle h_{1}(\mathbf{W}_{k},\mathbf{W}_{l}) \displaystyle\leq Vk1Vl2|𝐔k𝐔l|q+Vk2Vl1|𝐔k𝐔l|q2\displaystyle\frac{V_{k1}V_{l2}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}+V_{k2}V_{l1}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}}{2}
\displaystyle\leq 14{(Vk1Vl2+Vk2Vl1)2+|𝐔k𝐔l|q2}\displaystyle\frac{1}{4}\{(V_{k1}V_{l2}+V_{k2}V_{l1})^{2}+|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}^{2}\}
\displaystyle\leq 12{Vk12Vl22+Vk22Vl12+|𝐔k|q2+|𝐔l|q2}\displaystyle\frac{1}{2}\{V_{k1}^{2}V_{l2}^{2}+V_{k2}^{2}V_{l1}^{2}+|\mathbf{U}_{k\cdot}|_{q}^{2}+|\mathbf{U}_{l\cdot}|_{q}^{2}\}
\displaystyle\leq 12{Vk14/2+Vl24/2+Vk24/2+Vl14/2+|𝐔k|q2+|𝐔l|q2}\displaystyle\frac{1}{2}\{V_{k1}^{4}/2+V_{l2}^{4}/2+V_{k2}^{4}/2+V_{l1}^{4}/2+|\mathbf{U}_{k\cdot}|_{q}^{2}+|\mathbf{U}_{l\cdot}|_{q}^{2}\}

Then it is easy to show that under the uniform sub-Gaussian moment assumption in Condition (C1) and the upper bound on h1(𝐖k,𝐖l)h_{1}(\mathbf{W}_{k},\mathbf{W}_{l}) above, we have that E1/2(h12)C2\mathrm{E}^{1/2}(h_{1}^{2})\leq C_{2} for some C2=C2(σ0,C0)<C_{2}=C_{2}(\sigma_{0},C_{0})<\infty. Moreover, since q1/2𝐔max𝐔qq^{1/2}\|\mathbf{U}\|_{\max}\geq\|\mathbf{U}\|_{q}, we can derive that

P(|h1|>M)\displaystyle P(|h_{1}|>M) (S3)
\displaystyle\leq P[max{|V1|,|V1|,|V2|,|V2|,2q1/2𝐔max,2q1/2𝐔max}(M2)1/3]\displaystyle P[\max\{|V_{1}|,|V_{1}^{\prime}|,|V_{2}|,|V_{2}^{\prime}|,2q^{1/2}\|\mathbf{U}\|_{\max},2q^{1/2}\|\mathbf{U}^{\prime}\|_{\max}\}\geq(\frac{M}{2})^{1/3}]
\displaystyle\leq 2P{|V1|(M2)1/3}+2P{|V2|(M2)1/3}+2P{2q1/2𝐔max(M2)1/3}\displaystyle 2P\{|V_{1}|\geq(\frac{M}{2})^{1/3}\}+2P\{|V_{2}|\geq(\frac{M}{2})^{1/3}\}+2P\{2q^{1/2}\|\mathbf{U}\|_{\max}\geq(\frac{M}{2})^{1/3}\}
\displaystyle\leq 2P{|V1|(M2)1/3}+2P{|V2|(M2)1/3}+2j=1qP{2q1/2|Uj|(M2)1/3}\displaystyle 2P\{|V_{1}|\geq(\frac{M}{2})^{1/3}\}+2P\{|V_{2}|\geq(\frac{M}{2})^{1/3}\}+2\sum_{j=1}^{q}P\{2q^{1/2}|U_{j}|\geq(\frac{M}{2})^{1/3}\}

Because V1V_{1} is sub-Gaussian as assumed in Condition (C1), by Proposition 2.5.2 in Vershynin (2018), we have P{|V1|(M2)1/3}2exp{C(M2)2/3}P\{|V_{1}|\geq(\frac{M}{2})^{1/3}\}\leq 2\exp\{-C(\frac{M}{2})^{2/3}\} for some positive constant CC. We apply similar arguments to all the remaining terms in (S3) and derive that

P(|h1|>M)(8+4q)exp{2Cq1(M2)2/3}.P(|h_{1}|>M)\leq(8+4q)\exp\{-2Cq^{-1}(\frac{M}{2})^{2/3}\}.

Thus |R12,2|(8+4q)1/2C2exp{Cq1(M2)2/3}|R_{12,2}|\leq(8+4q)^{1/2}C_{2}\exp\{-Cq^{-1}(\frac{M}{2})^{2/3}\}. If we choose ϵ>0\epsilon>0 such that (8+4q)1/2C2exp{Cq1(M2)2/3}ϵ/2(8+4q)^{1/2}C_{2}\exp\{-Cq^{-1}(\frac{M}{2})^{2/3}\}\leq\epsilon/2, then |R12,2|ϵ/2|R_{12,2}|\leq\epsilon/2, which leads to P(|R~n,12,2R12,2|ϵ)P(|R~n,12,2|ϵ/2)P(|\widetilde{R}_{n,12,2}-R_{12,2}|\geq\epsilon)\leq P(|\widetilde{R}_{n,12,2}|\geq\epsilon/2). To bound P(|R~n,12,2|ϵ/2)P(|\widetilde{R}_{n,12,2}|\geq\epsilon/2), we write

|R~n,12,2|\displaystyle|\widetilde{R}_{n,12,2}| =\displaystyle= |{n(n1)}1klnh1(𝐖k,𝐖l)𝟏(|h1(𝐖k,𝐖l)|>M)|\displaystyle|\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}h_{1}(\mathbf{W}_{k},\mathbf{W}_{l}){\bf 1}(|h_{1}(\mathbf{W}_{k},\mathbf{W}_{l})|>M)|
\displaystyle\leq {n(n1)}1kln|h1(𝐖k,𝐖l)|𝟏(𝐖kmax>q1/6(M2)1/3)\displaystyle\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}|h_{1}(\mathbf{W}_{k},\mathbf{W}_{l})|{\bf 1}(\|\mathbf{W}_{k}\|_{\max}>q^{-1/6}(\frac{M}{2})^{1/3})
+{n(n1)}1kln|h1(𝐖k,𝐖l)|𝟏(𝐖lmax>q1/6(M2)1/3)\displaystyle+\{n(n-1)\}^{-1}\sum_{k\not=l}^{n}|h_{1}(\mathbf{W}_{k},\mathbf{W}_{l})|{\bf 1}(\|\mathbf{W}_{l}\|_{\max}>q^{-1/6}(\frac{M}{2})^{1/3})
\displaystyle\equiv L1+L2.\displaystyle L_{1}+L_{2}.

Without loss of generality, we only consider L1L_{1}. Define Fk=𝟏(𝐖kmax>q1/6(M2)1/3)F_{k}={\bf 1}(\|\mathbf{W}_{k}\|_{\max}>q^{-1/6}(\frac{M}{2})^{1/3}). Note that

L1\displaystyle L_{1} =\displaystyle= {n(n1)}1kl|Vk1Vl2|𝐔k𝐔lFk\displaystyle\{n(n-1)\}^{-1}\sum_{k\neq l}|V_{k1}V_{l2}|\|\mathbf{U}_{k}-\mathbf{U}_{l}\|F_{k}
\displaystyle\leq {n(n1)}1kl|Vk1Vl2|𝐔kFk+{n(n1)}1kl|Vk1Vl2|𝐔lFk\displaystyle\{n(n-1)\}^{-1}\sum_{k\neq l}|V_{k1}V_{l2}|\|\mathbf{U}_{k}\|F_{k}+\{n(n-1)\}^{-1}\sum_{k\neq l}|V_{k1}V_{l2}|\|\mathbf{U}_{l}\|F_{k}
\displaystyle\leq {n(n1)}1(k=1n|Vk1|𝐔kFk)(l=1n|Vl2|)+{n(n1)}1(k=1n|Vk1|Fk)l=1n|Vl2|𝐔l\displaystyle\{n(n-1)\}^{-1}(\sum_{k=1}^{n}|V_{k1}|\|\mathbf{U}_{k}\|F_{k})\cdot(\sum_{l=1}^{n}|V_{l2}|)+\{n(n-1)\}^{-1}(\sum_{k=1}^{n}|V_{k1}|F_{k})\cdot\sum_{l=1}^{n}|V_{l2}|\|\mathbf{U}_{l}\|
\displaystyle\equiv L11+L12\displaystyle L_{11}+L_{12}

For L11L_{11}, note that, for any λ>0\lambda>0, Eexp{λ|Vl2|2}=Eexp{λVl22}\mathrm{E}\exp\{\lambda|V_{l2}|^{2}\}=\mathrm{E}\exp\{\lambda V_{l2}^{2}\}. Since Vl2V_{l2} is sub-Gaussian by Condition (C1), we have that |Vl2||V_{l2}| is also sub-Gaussian by Proposition 2.5.2 in Vershynin (2018). Hence, it follows from Bernstein’s inequality [Theorem 2.8.1 in Vershynin (2018)] that for ϵ(0,1)\epsilon\in(0,1), n2n\geq 2,

P(|1n1l=1n{|Vl2|E|Vl2|}|ϵ)2exp(Cnϵ2)\displaystyle P(|\frac{1}{n-1}\sum_{l=1}^{n}\{|V_{l2}|-\mathrm{E}|V_{l2}|\}|\geq\epsilon)\leq 2\exp(-Cn\epsilon^{2}) (S4)

Regarding 1nk=1n|Vk1|𝐔kFk\frac{1}{n}\sum_{k=1}^{n}|V_{k1}|\|\mathbf{U}_{k}\|F_{k}, we note that |Vk1|𝐔kFk|Vk1|𝐔kFk|Vk1|j=1q|Ukj|Fk|V_{k1}|\|\mathbf{U}_{k}\|F_{k}\leq|V_{k1}|\|\mathbf{U}_{k}\|\cdot F_{k}\leq|V_{k1}|\cdot\sum_{j=1}^{q}|U_{kj}|\cdot F_{k}. Since |Vk1||V_{k1}| and |Ukj||U_{kj}| are sub-Gaussian, we have that |Vk1|𝐔kFk|V_{k1}|\|\mathbf{U}_{k}\|F_{k} is sub-exponential (Lemma 2.7.7 in Vershynin (2018)). Again by Bernstein’s inequality, we have that for ϵ(0,1)\epsilon\in(0,1),

P(|1nk=1n{|Vk1|𝐔kFkE(|Vk1|𝐔kFk)}|ϵ)2exp(Cnϵ2).\displaystyle P(|\frac{1}{n}\sum_{k=1}^{n}\{|V_{k1}|\|\mathbf{U}_{k}\|F_{k}-\mathrm{E}(|V_{k1}|\|\mathbf{U}_{k}\|F_{k})\}|\geq\epsilon)\leq 2\exp(-Cn\epsilon^{2}). (S5)

Moreover, it is easy to see that there exists a C3=C3(σ0,C0)C_{3}=C_{3}(\sigma_{0},C_{0}), such that E1/2(|Vk1|𝐔k)2C3\mathrm{E}^{1/2}(|V_{k1}|\|\mathbf{U}_{k}\|)^{2}\leq C_{3}, so E(|Vk1|𝐔kFk)E1/2(|Vk1|𝐔k)2E1/2FkC3[2(2+q)]1/2exp{Cq1/3(M2)2/3}\mathrm{E}(|V_{k1}|\|\mathbf{U}_{k}\|F_{k})\leq\mathrm{E}^{1/2}(|V_{k1}|\|\mathbf{U}_{k}\|)^{2}\mathrm{E}^{1/2}F_{k}\leq C_{3}[2(2+q)]^{1/2}\exp\{-Cq^{-1/3}(\frac{M}{2})^{2/3}\} where we have used the fact that E(Fk)=P(𝐖kmax>q1/6(M2)1/3)2(2+q)exp{2Cq1/3(M2)2/3}\mathrm{E}(F_{k})=P(\|\mathbf{W}_{k}\|_{\max}>q^{-1/6}(\frac{M}{2})^{1/3})\leq 2(2+q)\exp\{-2Cq^{-1/3}(\frac{M}{2})^{2/3}\} by a union bound argument and uniform Sub-Gaussinity assumption in Condition (C1). Hence, P(|L11|ϵ/8)ξ1(ϵ)+ξ2(ϵ)P(|L_{11}|\geq\epsilon/8)\leq\xi_{1}(\epsilon)+\xi_{2}(\epsilon), where ξ1(ϵ)=P(1nk=1n|Vk1|𝐔kFk>(8E|V2|+8)1ϵ)\xi_{1}(\epsilon)=P(\frac{1}{n}\sum_{k=1}^{n}|V_{k1}|\|\mathbf{U}_{k}\|F_{k}>(8E|V_{2}|+8)^{-1}\epsilon) and ξ2(ϵ)=P((n1)1l=1n|Vl2|>E|V2|+1)\xi_{2}(\epsilon)=P((n-1)^{-1}\sum_{l=1}^{n}|V_{l2}|>E|V_{2}|+1). Choosing ϵ\epsilon such that ϵ>(E|V2|+1)C316[2(2+q)]1/2exp{Cq1/3(M2)2/3}\epsilon>(E|V_{2}|+1)C_{3}16[2(2+q)]^{1/2}\exp\{-Cq^{-1/3}(\frac{M}{2})^{2/3}\} and ϵ<16E|V2|+16\epsilon<16\mathrm{E}|V_{2}|+16, then it follows from (S5) that

ξ1(ϵ)P(|1nk=1n{|Vk1|𝐔kFkE(|Vk1|𝐔kFk)}|ϵ(16E|V2|+16)1)2exp(Cnϵ2).\xi_{1}(\epsilon)\leq P(|\frac{1}{n}\sum_{k=1}^{n}\{|V_{k1}|\|\mathbf{U}_{k}\|F_{k}-\mathrm{E}(|V_{k1}|\|\mathbf{U}_{k}\|F_{k})\}|\geq\epsilon(16E|V_{2}|+16)^{-1})\leq 2\exp(-Cn\epsilon^{2}).

In addition, we can use (S4) to derive that ξ2(ϵ)2exp(Cn)\xi_{2}(\epsilon)\leq 2\exp(-Cn). Combining these results, we have P(|L11|ϵ/8)2exp(Cnϵ2)P(|L_{11}|\geq\epsilon/8)\leq 2\exp(-Cn\epsilon^{2}), when ϵ((E|V2|+1)C316[2(2+q)]1/2exp{Cq1/3(M2)2/3},16E|V2|+16)\epsilon\in((E|V_{2}|+1)C_{3}16[2(2+q)]^{1/2}\exp\{-Cq^{-1/3}(\frac{M}{2})^{2/3}\},16E|V_{2}|+16). Thus if we choose M=(logn)3/2M=(\log{n})^{3/2}, we can find a n0n_{0}, γ(0,1/2)\gamma\in(0,1/2), and D0D_{0} such that when nn0n\geq n_{0} and 16>ϵ>D0nγ16>\epsilon>D_{0}n^{-\gamma}, we have P(|L11|>ϵ/8)2exp(Cnϵ2)P(|L_{11}|>\epsilon/8)\leq 2\exp(-Cn\epsilon^{2}). Similar arguments lead to P(|L12|>ϵ/8)2exp(Cnϵ2)P(|L_{12}|>\epsilon/8)\leq 2\exp(-Cn\epsilon^{2}). This implies that P(|L1|>ϵ/4)4exp(Cnϵ2)P(|L_{1}|>\epsilon/4)\leq 4\exp(-Cn\epsilon^{2}) and similarly P(|L2|>ϵ/4)4exp(Cnϵ2)P(|L_{2}|>\epsilon/4)\leq 4\exp(-Cn\epsilon^{2}). Therefore P(|R~n,12,2|ϵ/2)8exp(Cnϵ2)P(|\widetilde{R}_{n,12,2}|\geq\epsilon/2)\leq 8\exp(-Cn\epsilon^{2}). In view of (S2), the desired statement follows by choosing large enough n0n_{0} such that 4log3(n0)>C4\log^{3}(n_{0})>C.

Proof of Theorem 1: Notice that for any ϵ>0\epsilon>0,

P(MDDMn(𝐕|𝐔)MDDM(𝐕|𝐔)max>12ϵ)\displaystyle P(\|\mathrm{MDDM}_{n}(\mathbf{V}|\mathbf{U})-\mathrm{MDDM}(\mathbf{V}|\mathbf{U})\|_{max}>12\epsilon)
P(𝐑n𝐑max>4ϵ)+P(𝐒n𝐒max>4ϵ)+P(𝐓n𝐓max>4ϵ)\displaystyle\leq P(\|\mathbf{R}_{n}-\mathbf{R}\|_{max}>4\epsilon)+P(\|\mathbf{S}_{n}-\mathbf{S}\|_{max}>4\epsilon)+P(\|\mathbf{T}_{n}-\mathbf{T}\|_{max}>4\epsilon)

The concentration bound for 𝐑n𝐑max\|\mathbf{R}_{n}-\mathbf{R}\|_{max} has been obtained in Proposition S3, and we shall address the concentration of 𝐓n𝐓max\|\mathbf{T}_{n}-\mathbf{T}\|_{max} in the proof below. The proof for the concentration of 𝐒n𝐒max\|\mathbf{S}_{n}-\mathbf{S}\|_{max} is similar and simpler so is omitted. Note that

P(𝐓n𝐓max>4ϵ)p2maxi,j=1,,pP(|Tn,ijTij|>4ϵ).P(\|\mathbf{T}_{n}-\mathbf{T}\|_{max}>4\epsilon)\leq p^{2}\max_{i,j=1,\cdots,p}P(|T_{n,ij}-T_{ij}|>4\epsilon).

Following the same argument as used in the beginning of proof of Proposition S3, we shall only focus on the case (i,j)=(1,2)(i,j)=(1,2) as other cases can be treated in exactly the same manner.

Let T12=E[V1V2|𝐔𝐔′′|q]=E[E(V1)V2|𝐔𝐔|q]T_{12}=\mathrm{E}[V_{1}V_{2}^{\prime}|\mathbf{U}^{\prime}-\mathbf{U}^{{}^{\prime\prime}}|_{q}]=\mathrm{E}[\mathrm{E}(V_{1})V_{2}^{{}^{\prime}}|\mathbf{U}-\mathbf{U}^{\prime}|_{q}] and Tn,12=n3k,l,h=1nVk1Vh2|𝐔h𝐔l|qT_{n,12}=n^{-3}\sum_{k,l,h=1}^{n}V_{k1}V_{h2}|\mathbf{U}_{h\cdot}-\mathbf{U}_{l\cdot}|_{q}. Let

T~n,12\displaystyle\widetilde{T}_{n,12} =\displaystyle= 1n(n1)(n2)k<l<h[Vk1Vh2|𝐔h𝐔l|q+Vk1Vl2|𝐔l𝐔h|\displaystyle\frac{1}{n(n-1)(n-2)}\sum_{k<l<h}[V_{k1}V_{h2}|\mathbf{U}_{h\cdot}-\mathbf{U}_{l\cdot}|_{q}+V_{k1}V_{l2}|\mathbf{U}_{l\cdot}-\mathbf{U}_{h\cdot}|
+Vl1Vk2|𝐔k𝐔h|q+Vl1Vh2|𝐔h𝐔k|q+Vh1Vl2|𝐔l𝐔k|q\displaystyle+V_{l1}V_{k2}|\mathbf{U}_{k\cdot}-\mathbf{U}_{h\cdot}|_{q}+V_{l1}V_{h2}|\mathbf{U}_{h\cdot}-\mathbf{U}_{k\cdot}|_{q}+V_{h1}V_{l2}|\mathbf{U}_{l\cdot}-\mathbf{U}_{k\cdot}|_{q}
+Vh1Vk2|𝐔k𝐔l|q]=6{n(n1)(n2)}1k<l<hh3(𝐖k,𝐖l,𝐖h),\displaystyle+V_{h1}V_{k2}|\mathbf{U}_{k\cdot}-\mathbf{U}_{l\cdot}|_{q}]=6\{n(n-1)(n-2)\}^{-1}\sum_{k<l<h}h_{3}(\mathbf{W}_{k},\mathbf{W}_{l},\mathbf{W}_{h}),

where h3h_{3} is a kernel function for U-statistic of order three. Following the same argument to deal with R~n,12\widetilde{R}_{n,12} in the proof of Proposition S3, we write T~n,12=T~n,12,1+T~n,12,2\widetilde{T}_{n,12}=\widetilde{T}_{n,12,1}+\widetilde{T}_{n,12,2}, where

T~n,12,1\displaystyle\widetilde{T}_{n,12,1} =\displaystyle= 6{n(n1)(n2)}1k<l<hh3𝟏(|h3|M),\displaystyle 6\{n(n-1)(n-2)\}^{-1}\sum_{k<l<h}h_{3}{\bf 1}(|h_{3}|\leq M),
T~n,12,2\displaystyle\widetilde{T}_{n,12,2} =\displaystyle= 6{n(n1)(n2)}1k<l<hh3𝟏(|h3|>M).\displaystyle 6\{n(n-1)(n-2)\}^{-1}\sum_{k<l<h}h_{3}{\bf 1}(|h_{3}|>M).

Correspondingly, we define T12=T12,1+T12,2T_{12}=T_{12,1}+T_{12,2}, where T12,1=E[h3𝟏(|h3|M)]T_{12,1}=\mathrm{E}[h_{3}{\bf 1}(|h_{3}|\leq M)] and T12,2=E[h3𝟏(|h3|>M)]T_{12,2}=\mathrm{E}[h_{3}{\bf 1}(|h_{3}|>M)]. By using the same argument for R~n,12,1\widetilde{R}_{n,12,1}, we can show that

P(|T~n,12,1T12,1|ϵ)2exp(ϵ2n/3/(2M2))P(|\widetilde{T}_{n,12,1}-T_{12,1}|\geq\epsilon)\leq 2\exp(-\epsilon^{2}\lfloor n/3\rfloor/(2M^{2}))

since T~n,12,1\widetilde{T}_{n,12,1} is a third order UU-statistic. Also we note that by the same argument used in bounding h1h_{1}, we can get

|h3|112{Vk14+Vl14+Vh14+Vk24+Vl24+Vh24+8|𝐔h|q2+8|𝐔k|q2+8|𝐔l|q2}.|h_{3}|\leq\frac{1}{12}\{V_{k1}^{4}+V_{l1}^{4}+V_{h1}^{4}+V_{k2}^{4}+V_{l2}^{4}+V_{h2}^{4}+8|\mathbf{U}_{h\cdot}|_{q}^{2}+8|\mathbf{U}_{k\cdot}|_{q}^{2}+8|\mathbf{U}_{l\cdot}|_{q}^{2}\}.

It follows from Cauchy-Schwartz inequality that |T12,2|E1/2(h32)P1/2(|h3|>M)|T_{12,2}|\leq E^{1/2}(h_{3}^{2})P^{1/2}(|h_{3}|>M). By using exactly the same argument used for (S3), we can show that P(|h3|>M)Cexp(C3M2/3)P(|h_{3}|>M)\leq C\exp(-C_{3}M^{2/3}) for some C3=C3(σ0,C0,q)>0C_{3}=C_{3}(\sigma_{0},C_{0},q)>0. Hence |T12,2|Cexp(C3M2/3)|T_{12,2}|\leq C\exp(-C_{3}M^{2/3}). We can choose ϵ>0\epsilon>0 such that Cexp(C3M2/3)ϵ/2C\exp(-C_{3}M^{2/3})\leq\epsilon/2. Therefore, for ϵ2Cexp(C3M2/3)\epsilon\geq 2C\exp(-C_{3}M^{2/3}), we have P(|T~n,12,2T12,2|>ϵ)P(|T~n,12,2|ϵ/2)P(|\widetilde{T}_{n,12,2}-T_{12,2}|>\epsilon)\leq P(|\widetilde{T}_{n,12,2}|\geq\epsilon/2). By setting M=log3/2(n)M=\log^{3/2}(n) and adopting the same argument as used in bounding P(|R~n,12,2|ϵ/2)P(|\widetilde{R}_{n,12,2}|\geq\epsilon/2), we can derive that P(|T~n,12,2|ϵ/2)12exp(C4nϵ2)P(|\widetilde{T}_{n,12,2}|\geq\epsilon/2)\leq 12\exp(-C_{4}n\epsilon^{2}) when nn1n\geq n_{1} and ϵ(D1nγ1,16)\epsilon\in(D_{1}n^{-\gamma_{1}},16) for some C4=C4(σ0,C0,q)>0C_{4}=C_{4}(\sigma_{0},C_{0},q)>0, D1=D1(σ0,C0,q)D_{1}=D_{1}(\sigma_{0},C_{0},q), n1=n1(σ0,C0,q)n_{1}=n_{1}(\sigma_{0},C_{0},q) and γ1=γ1(σ0,C0,q)\gamma_{1}=\gamma_{1}(\sigma_{0},C_{0},q).

Combining the above results, we obtain that for 16>ϵ>D1nγ116>\epsilon>D_{1}n^{-\gamma_{1}} and nn1n\geq n_{1}, we have

P(|T~n,12T12|2ϵ)\displaystyle P(|\widetilde{T}_{n,12}-T_{12}|\geq 2\epsilon) \displaystyle\leq 12exp(C4nϵ2)+2exp(ϵ2n/3/(2M2))\displaystyle 12\exp(-C_{4}n\epsilon^{2})+2\exp(-\epsilon^{2}\lfloor n/3\rfloor/(2M^{2}))
\displaystyle\leq 14exp(ϵ2n/(6log3(n))),\displaystyle 14\exp(-\epsilon^{2}n/(6\log^{3}(n))),

where we choose n1n_{1} such that 6log3(n1)>C416\log^{3}(n_{1})>C_{4}^{-1}.

Further we note that

Tn,12T12=(n1)(n2)n2(T~n,12T12)3n2n2T12+n1n2(R~n,12R12)+n1n2R12.T_{n,12}-T_{12}=\frac{(n-1)(n-2)}{n^{2}}(\widetilde{T}_{n,12}-T_{12})-\frac{3n-2}{n^{2}}T_{12}+\frac{n-1}{n^{2}}(\widetilde{R}_{n,12}-R_{12})+\frac{n-1}{n^{2}}R_{12}.

There exists a finite positive constant C6=C6(σ0,C0,q)C_{6}=C_{6}(\sigma_{0},C_{0},q) such that |R12|C6|R_{12}|\leq C_{6} and |T12|C6|T_{12}|\leq C_{6} so if we choose ϵ3C6/n\epsilon\geq 3C_{6}/n, then |3n2n2T12|ϵ|\frac{3n-2}{n^{2}}T_{12}|\leq\epsilon and |n1n2R12|ϵ/3|\frac{n-1}{n^{2}}R_{12}|\leq\epsilon/3. Then for nn1n\geq n_{1} and ϵ>D1nγ1\epsilon>D_{1}n^{-\gamma_{1}},

P(|Tn,12T12|>4ϵ)\displaystyle P(|T_{n,12}-T_{12}|>4\epsilon) \displaystyle\leq P(|T~n,12T12|>2ϵ)+P(|R~n,12R12|>2ϵ/3)\displaystyle P(|\widetilde{T}_{n,12}-T_{12}|>2\epsilon)+P(|\widetilde{R}_{n,12}-R_{12}|>2\epsilon/3)
\displaystyle\leq 14exp(ϵ2n/(6log3(n)))+10exp{ϵ2n36log3(n)}\displaystyle 14\exp(-\epsilon^{2}n/(6\log^{3}(n)))+10\exp\left\{-\frac{\epsilon^{2}n}{36\log^{3}(n)}\right\}
\displaystyle\leq 24exp{ϵ2n36log3(n)}\displaystyle 24\exp\left\{-\frac{\epsilon^{2}n}{36\log^{3}(n)}\right\}

Thus the conclusion follows from the above inequality and Proposition S3.

S6 Proofs for Theorems 2 & 3

S6.1 Two Generic Algorithms and Their Properties

We first describe two generic algorithms and their properties that will help our proof for Theorems 2 & 3. Consider two matrices 𝐀,𝐁p×p\mathbf{A},\mathbf{B}\in\mathbb{R}^{p\times p}, their estimates 𝐀^,𝐁^p×p\widehat{\mathbf{A}},\widehat{\mathbf{B}}\in\mathbb{R}^{p\times p} and vectors 𝐯,𝐯0,𝐯tp\mathbf{v},\mathbf{v}_{0},\mathbf{v}_{t}\in\mathbb{R}^{p}. We have Algorithm 3 for the penalized eigen-decomposition for 𝐀\mathbf{A}, and Algorithm 4 for the penalized generalized eigen-decomposition for (𝐀,𝐁)(\mathbf{A},\mathbf{B}). Algorithm 3 is originally proposed in Yuan and Zhang (2013), and in our Algorithm 1 we use it repeatedly for KK times to perform penalized eigen-decomposition for MDDM. Algorithm 4 is originally proposed as the RIFLE Algorithm in Tan, Wang, Liu and Zhang (2018), and we use it for KK times to perform penalized generalized eigen-decomposition for MDDM.

  1. 1.

    Input: s,𝐀^s,\widehat{\mathbf{A}}.

  2. 2.

    Initialize 𝐯0\mathbf{v}_{0}.

    1. (a)

      Iterate over tt until convergence:

    2. (b)

      Set 𝐯t=𝐀^𝐯t1\mathbf{v}_{t}=\widehat{\mathbf{A}}\mathbf{v}_{t-1}.

    3. (c)

      If 𝐯t0s\|\mathbf{v}_{t}\|_{0}\leq s, set

      𝐯t=𝐯t𝐯t2;\mathbf{v}_{t}=\dfrac{\mathbf{v}_{t}}{\|\mathbf{v}_{t}\|_{2}};

      else

      𝐯t=HT(𝐯t,s)HT(𝐯t,s)2\mathbf{v}_{t}=\dfrac{\mathrm{HT}(\mathbf{v}_{t},s)}{\|\mathrm{HT}(\mathbf{v}_{t},s)\|_{2}}
  3. 3.

    Output 𝐯\mathbf{v}_{\infty} at convergence.

Algorithm 3 A generic penalized eigen-decomposition algorithm.

Yuan and Zhang (2013) proved a property for Algorithm 3 that is important for our proof. Assume that 𝐀\mathbf{A} has a unique leading eigenvector 𝐯𝐈\mathbf{v}^{*}_{\mathbf{I}} with 𝐯𝐈0d\|\mathbf{v}^{*}_{\mathbf{I}}\|_{0}\leq d. Denote λ1𝐈,,λp𝐈\lambda_{1}^{\mathbf{I}},\ldots,\lambda^{\mathbf{I}}_{p} as the eigenvalues. We assume that there exists a constant Δ𝐈=λ1𝐈maxj>1λj𝐈\Delta_{\mathbf{I}}=\lambda_{1}^{\mathbf{I}}-\max_{j>1}\lambda_{j}^{\mathbf{I}}. Also for any positive integer kk^{\prime}, define

ρ(𝐄𝐀,k)=sup𝐮2=1,𝐮0k|𝐮T𝐄𝐀𝐮|,\rho(\mathbf{E}_{\mathbf{A}},k^{\prime})=\sup_{\|\mathbf{u}\|_{2}=1,\|\mathbf{u}\|_{0}\leq k^{\prime}}|\mathbf{u}^{\mathrm{\tiny T}}\mathbf{E}_{\mathbf{A}}\mathbf{u}|,

where 𝐄𝐀=𝐀^𝐀\mathbf{E}_{\mathbf{A}}=\widehat{\mathbf{A}}-\mathbf{A}, with 𝐀^\widehat{\mathbf{A}} being an estimate of 𝐀\mathbf{A}. We have the following proposition.

Proposition S4 ((Yuan and Zhang 2013) c.f Theorem 4).

In Algorithm 3, let s=d+2ss=d+2s^{\prime} with sds^{\prime}\geq d. Assume that ρ(𝐄𝐀,s)Δ𝐈\rho(\mathbf{E}_{\mathbf{A}},s)\leq\Delta_{\mathbf{I}}. Define

γ(s)=λ1𝐈Δ𝐈+ρ(𝐄𝐀,s)λ1𝐈ρ(𝐄𝐀,s)<1,δ𝐈(s)=2ρ(𝐄𝐀,s)ρ(𝐄𝐀,s)2+(Δ𝐈2ρ(𝐄𝐀,s))2.\gamma(s)=\dfrac{\lambda_{1}^{\mathbf{I}}-\Delta_{\mathbf{I}}+\rho(\mathbf{E}_{\mathbf{A}},s)}{\lambda_{1}^{\mathbf{I}}-\rho(\mathbf{E}_{\mathbf{A}},s)}<1,\quad\delta_{\mathbf{I}}(s)=\dfrac{\sqrt{2}\rho(\mathbf{E}_{\mathbf{A}},s)}{\sqrt{\rho(\mathbf{E}_{\mathbf{A}},s)^{2}+(\Delta_{\mathbf{I}}-2\rho(\mathbf{E}_{\mathbf{A}},s))^{2}}}.

If |𝐯0T𝐯𝐈|θ+δ𝐈(s)|\mathbf{v}_{0}^{\mathrm{\tiny T}}\mathbf{v}^{*}_{\mathbf{I}}|\geq\theta+\delta_{\mathbf{I}}(s) for some 𝐯00s,𝐯0=1,\|\mathbf{v}_{0}\|_{0}\leq s^{\prime},\|\mathbf{v}_{0}\|=1, and θ(0,1)\theta\in(0,1) such that

μ=[1+2{(ds)1/2+ds}]{10.5θ(1+θ)(1γ(s)2)}<1,\mu=\sqrt{[1+2\{(\frac{d}{s^{\prime}})^{1/2}+\frac{d}{s^{\prime}}\}]\{1-0.5\theta(1+\theta)(1-\gamma(s)^{2})\}}<1,

then we either have

1|𝐯0T𝐯𝐈|<10δ𝐈(s)/(1μ),\sqrt{1-|\mathbf{v}_{0}^{\mathrm{\tiny T}}\mathbf{v}^{*}_{\mathbf{I}}|}<\sqrt{10}\delta_{\mathbf{I}}(s)/(1-\mu),

or for all t0t\geq 0,

1|𝐯tT𝐯𝐈|μt1|𝐯0T𝐯𝐈|+10δ𝐈(s)/(1μ).\sqrt{1-|\mathbf{v}_{t}^{\mathrm{\tiny T}}\mathbf{v}^{*}_{\mathbf{I}}|}\leq\mu^{t}\sqrt{1-|\mathbf{v}_{0}^{\mathrm{\tiny T}}\mathbf{v}^{*}_{\mathbf{I}}|}+\sqrt{10}\delta_{\mathbf{I}}(s)/(1-\mu).
  1. 1.

    Input: s,𝐀^,𝐁^s,\widehat{\mathbf{A}},\widehat{\mathbf{B}} and step size η>0\eta>0.

  2. 2.

    Initialize 𝐯0\mathbf{v}_{0}.

  3. 3.

    Iterate over tt until convergence:

    1. (a)

      Set ρ(t1)=𝐯tT𝐀^𝐯t𝐯tT𝐁^𝐯t\rho^{(t-1)}=\dfrac{\mathbf{v}_{t}^{\mathrm{\tiny T}}\widehat{\mathbf{A}}\mathbf{v}_{t}}{\mathbf{v}_{t}^{\mathrm{\tiny T}}\widehat{\mathbf{B}}\mathbf{v}_{t}}.

    2. (b)

      𝐂=𝐈+(η/ρ(t1))(𝐀^ρ(t1)𝐁^)\mathbf{C}=\mathbf{I}+(\eta/\rho^{(t-1)})\cdot(\widehat{\mathbf{A}}-\rho^{(t-1)}\widehat{\mathbf{B}}).

    3. (c)

      𝐯t=𝐂𝐯t1/𝐂𝐯t12\mathbf{v}_{t}=\mathbf{C}\mathbf{v}_{t-1}/\|\mathbf{C}\mathbf{v}_{t-1}\|_{2}.

    4. (d)

      𝐯t=HT(𝐯t,s)HT(𝐯t,s)2\mathbf{v}_{t}=\dfrac{\mathrm{HT}(\mathbf{v}_{t},s)}{\|\mathrm{HT}(\mathbf{v}_{t},s)\|_{2}}.

  4. 4.

    Output 𝐯\mathbf{v}_{\infty} at convergence.

Algorithm 4 A generic penalized generalized eigen-decomposition algorithm.

Based on the results in Tan, Wang, Liu and Zhang (2018), we can also derive the following useful results for Algorithm 4. We assume that the matrix pair (𝐀,𝐁)(\mathbf{A},\mathbf{B}) has the leading generalized eigenvector 𝐯\mathbf{v}^{*} such that 𝐯0d\|\mathbf{v}^{*}\|_{0}\leq d. The generalized eigenvalues of (𝐀,𝐁)(\mathbf{A},\mathbf{B}) are referred to as λj,j=1,,p\lambda_{j},j=1,\ldots,p and their estimates are λ^j,j=1,,p\widehat{\lambda}_{j},j=1,\ldots,p. We introduce the following notation:

cr(𝐀,𝐁)\displaystyle\text{cr}(\mathbf{A},\mathbf{B}) =\displaystyle= min𝐯:𝐯2=1{(𝐯T𝐀𝐯)2+(𝐯T𝐁𝐯)2}1/2>0\displaystyle\min_{\mathbf{v}:\|\mathbf{v}\|_{2}=1}\{(\mathbf{v}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{v})^{2}+(\mathbf{v}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{v})^{2}\}^{1/2}>0 (S6)
cr(k)\displaystyle\text{cr}(k^{\prime}) =\displaystyle= infF:|F|kcr(𝐀F,𝐁F),\displaystyle\inf_{F:|F|\leq k^{\prime}}\text{cr}(\mathbf{A}_{F},\mathbf{B}_{F}), (S7)
δ(k)\displaystyle\delta(k^{\prime}) =\displaystyle= ρ(𝐄𝐀,k)2+ρ(𝐄𝐁,k)2,\displaystyle\sqrt{\rho(\mathbf{E}_{\mathbf{A}},k^{\prime})^{2}+\rho(\mathbf{E}_{\mathbf{B}},k^{\prime})^{2}}, (S8)

where 𝐄𝐁=𝐁𝐁^\mathbf{E}_{\mathbf{B}}=\mathbf{B}-\widehat{\mathbf{B}}, with 𝐁^\widehat{\mathbf{B}} being an estimate of 𝐁\mathbf{B}. Also denote κ(𝐁)\kappa(\mathbf{B}) as the condition number of 𝐁\mathbf{B} and ω1(F)=sup𝐮0F,𝐮2=1𝐮T𝐀𝐮𝐮T𝐁𝐮\omega_{1}(F)=\sup_{\|\mathbf{u}\|_{0}\subset F,\|\mathbf{u}\|_{2}=1}\frac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}} for any index set FF. We consider the following assumption:

Assumption S1.

For sufficiently large nn, there are constants b,c>0b,c>0, such that δ(s)cr(s)b\dfrac{\delta(s)}{cr(s)}\leq b and ρ(𝐄𝐁,s)cλmin(𝐁)\rho(\mathbf{E}_{\mathbf{B}},s)\leq c\lambda_{\min}(\mathbf{B}) for any s=o(n)s=o(n).

We also denote cupper=1+c1cc_{\text{upper}}=\dfrac{1+c}{1-c} for cc defined in Assumption S1. We estimate 𝐯\mathbf{v}^{*} with the RIFLE algorithm with the step size η\eta. We choose η\eta such that ηλmax(𝐁)<1/(1+c)\eta\lambda_{\max}(\mathbf{B})<1/(1+c). Further, in the RIFLE algorithm, let s=2s+ds=2s^{\prime}+d and choose s=Cds^{\prime}=Cd for sufficiently large CC. The initial value 𝐯0\mathbf{v}_{0} satisfies that 𝐯02=1\|\mathbf{v}_{0}\|_{2}=1.

Proposition S5 (Based on Theorem 1 and Corollary 1 in Tan, Wang, Liu and Zhang (2018)).

Under Assumption S1, we have the following conclusions:

  1. 1.

    For any FF such that supp(𝐯0)F\mathrm{supp}(\mathbf{v}_{0})\subset F, there exists a constant aa such that

    (1a)ω1(F)ω^1(F)(1+a)ω1(F).(1-a)\omega_{1}(F)\leq\hat{\omega}_{1}(F)\leq(1+a)\omega_{1}(F). (S9)
  2. 2.

    Choose η\eta such that

    ν=1+2{(d/s)1/2+d/s}11+c8ηλmin(𝐁)1αcupperκ(𝐁)+α<1,\nu=\sqrt{1+2\{(d/s^{\prime})^{1/2}+d/s^{\prime}\}}\sqrt{1-\frac{1+c}{8}\eta\lambda_{\min}(\mathbf{B})\frac{1-\alpha}{c_{\text{upper}}\kappa(\mathbf{B})+\alpha}}<1, (S10)

    where α=(1+a)λ2(1a)λ1\alpha=\frac{(1+a)\lambda_{2}}{(1-a)\lambda_{1}}. Input an initial vector 𝐯0\mathbf{v}_{0} such that |(𝐯)T𝐯0|𝐯21θ(𝐀,𝐁)\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{0}|}{\|\mathbf{v}^{*}\|_{2}}\geq 1-\theta(\mathbf{A},\mathbf{B}), where

    θ(𝐀,𝐁)=min[18cupperκ(𝐁),1/α13cupperκ(𝐁),1α30(1+c)cupper2ηλmax(𝐁)κ2(𝐁){cupperκ(𝐁)+α}].\theta(\mathbf{A},\mathbf{B})=\min\left[\dfrac{1}{8c_{\text{upper}}\kappa(\mathbf{B})},\dfrac{1/\alpha-1}{3c_{\text{upper}}\kappa(\mathbf{B})},\dfrac{1-\alpha}{30(1+c)c_{\text{upper}}^{2}\eta\lambda_{\max}(\mathbf{B})\kappa^{2}(\mathbf{B})\{c_{\text{upper}}\kappa(\mathbf{B})+\alpha\}}\right]. (S11)

    Further denote

    ξ=minj>1λ1(1+a)λj1+λ121+(1a)2λj2.\xi=\min_{j>1}\dfrac{\lambda_{1}-(1+a)\lambda_{j}}{\sqrt{1+\lambda_{1}^{2}}\sqrt{1+(1-a)^{2}\lambda_{j}^{2}}}. (S12)

    Assume that ξ>δ(s)/cr(s)\xi>\delta(s)/cr(s) and we have

    1|(𝐯)T𝐯t|𝐯2νtθ(𝐀,𝐁)+101ν2ξ{cr(s)δ(s)}δ(s).\sqrt{1-\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{t}|}{\|\mathbf{v}^{*}\|_{2}}}\leq\nu^{t}\sqrt{\theta(\mathbf{A},\mathbf{B})}+\dfrac{\sqrt{10}}{1-\nu}\dfrac{2}{\xi\{cr(s)-\delta(s)\}}\delta(s). (S13)

We rewrite Proposition S5 in the following more user-friendly form. We denote

ϕ\displaystyle\phi =\displaystyle= λ1λ2,a=min{1/2,Δλ1+λ2,λmin(𝐁)2}\displaystyle\lambda_{1}-\lambda_{2},\quad a^{*}=\min\{1/2,\frac{\Delta}{\lambda_{1}+\lambda_{2}},\frac{\lambda_{\min}(\mathbf{B})}{2}\} (S14)
ξ\displaystyle\xi^{*} =\displaystyle= λ1λ22(1+λ12),α=(1+a)λ2(1a)λ1.\displaystyle\dfrac{\lambda_{1}-\lambda_{2}}{2(1+\lambda_{1}^{2})},\quad\alpha^{*}=\frac{(1+a^{*})\lambda_{2}}{(1-a^{*})\lambda_{1}}. (S15)

We have the following lemma.

Lemma S3.

Assume that Assumption S1 holds. Choose η\eta such that

ν=1+2{(d/s)1/2+d/s}11+c8ηλmin(𝐁)1λ2λ1cupperκ(𝐁)+3λ2λ1<1.\nu^{*}=\sqrt{1+2\{(d/s^{\prime})^{1/2}+d/s^{\prime}\}}\sqrt{1-\frac{1+c}{8}\eta\lambda_{\min}(\mathbf{B})\frac{1-\frac{\lambda_{2}}{\lambda_{1}}}{c_{\text{upper}}\kappa(\mathbf{B})+3\frac{\lambda_{2}}{\lambda_{1}}}}<1.

Also assume that δ(s)<min{1/2,λmin(𝐁)/2}\delta(s)<\min\{1/2,\lambda_{\min}(\mathbf{B})/2\}, 2δ(s)λmin(𝐁)+2δ(s)λmin(𝐁)λ1(F)<a\dfrac{2\delta(s)}{\lambda_{\min}(\mathbf{B})}+\dfrac{2\delta(s)}{\lambda_{\min}(\mathbf{B})\lambda_{1}(F)}<a^{*} and ξ>2δ(s)/cr(s)\xi^{*}>2\delta(s)/cr(s). Input an initial vector 𝐯0\mathbf{v}_{0} such that |(𝐯)T𝐯0|𝐯21θ(𝐀,𝐁)\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{0}|}{\|\mathbf{v}^{*}\|_{2}}\geq 1-\theta^{*}(\mathbf{A},\mathbf{B}), where 0<θ(𝐀,𝐁)<10<\theta^{*}(\mathbf{A},\mathbf{B})<1,

θ(𝐀,𝐁)=min[18cupperκ(𝐁),1/α13cupperκ(𝐁),1α30(1+c)cupper2ηλmax(𝐁)κ2(𝐁){cupperκ(𝐁)+α}].\theta^{*}(\mathbf{A},\mathbf{B})=\min\left[\dfrac{1}{8c_{\text{upper}}\kappa(\mathbf{B})},\dfrac{1/\alpha^{*}-1}{3c_{\text{upper}}\kappa(\mathbf{B})},\dfrac{1-\alpha^{*}}{30(1+c)c_{\text{upper}}^{2}\eta\lambda_{\max}(\mathbf{B})\kappa^{2}(\mathbf{B})\{c_{\text{upper}}\kappa(\mathbf{B})+\alpha^{*}\}}\right]. (S16)

We have

|sinΘ(𝐯,𝐯)|201ν2ξ{cr(s)δ(s)}δ(s).|\sin\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})|\leq\dfrac{\sqrt{20}}{1-\nu^{*}}\dfrac{2}{\xi^{*}\{cr(s)-\delta(s)\}}\delta(s). (S17)
Proof of Lemma S3.

We prove Lemma S3 by showing that all the conditions in Proposition S5 are met. Note that

ω^1(F)=sup𝐮2=1,supp(𝐮)F𝐮T𝐀^𝐮𝐮T𝐁^𝐮=sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮+𝐮T(𝐀^𝐀)𝐮𝐮T𝐁𝐮+𝐮T(𝐁^𝐁)𝐮.\hat{\omega}_{1}(F)=\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\widehat{\mathbf{A}}\mathbf{u}}{\mathbf{u}^{\mathrm{\tiny T}}\widehat{\mathbf{B}}\mathbf{u}}=\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}+\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\mathbf{A}}-\mathbf{A})\mathbf{u}}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\mathbf{B}}-\mathbf{B})\mathbf{u}}. (S18)

It is obvious that

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮δ(s)𝐮T𝐁𝐮+δ(s)ω^1(F)sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮+δ(s)𝐮T𝐁𝐮δ(s)\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}-\delta(s)}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\delta(s)}\leq\hat{\omega}_{1}(F)\leq\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}+\delta(s)}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}-\delta(s)} (S19)

Because a>2δ(s)λmin(𝐁)+2δ(s)λmin(𝐁)λ1(F)a^{*}>\dfrac{2\delta(s)}{\lambda_{\min}(\mathbf{B})}+\dfrac{2\delta(s)}{\lambda_{\min}(\mathbf{B})\lambda_{1}(F)}, by Lemma S4, we have that Assumption S1 implies

(1a)ω1(F)ω^1(F)(1+a)ω1(F).(1-a^{*})\omega_{1}(F)\leq\hat{\omega}_{1}(F)\leq(1+a^{*})\omega_{1}(F). (S20)

Also, by our definition, a1/2a^{*}\leq 1/2. It follows that λ2λ1(1+a)λ2(1a)λ1=α3λ2λ1\frac{\lambda_{2}}{\lambda_{1}}\leq\frac{(1+a)\lambda_{2}}{(1-a)\lambda_{1}}=\alpha\leq\frac{3\lambda_{2}}{\lambda_{1}}. Hence, νν<1\nu\leq\nu^{*}<1, where ν\nu is defined in (S10). In addition, because aϕλ1+λ2ϕ2λ2a^{*}\leq\frac{\phi}{\lambda_{1}+\lambda_{2}}\leq\frac{\phi}{2\lambda_{2}}, we have ξξ\xi\geq\xi^{*}, where ξ\xi is defined in (S12). Finally, because 18cupperκ(𝐁)<1\dfrac{1}{8c_{\text{upper}}\kappa(\mathbf{B})}<1, we have θ(𝐀,𝐁)<1\theta^{*}(\mathbf{A},\mathbf{B})<1. Because aϕλ1+λ2a^{*}\leq\dfrac{\phi}{\lambda_{1}+\lambda_{2}}, we have 1γ>01-\gamma^{*}>0 and thus θ(𝐀,𝐁)>0\theta^{*}(\mathbf{A},\mathbf{B})>0.

By Proposition S5, we have

1|(𝐯)T𝐯t|𝐯2\displaystyle\sqrt{1-\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{t}|}{\|\mathbf{v}^{*}\|_{2}}} \displaystyle\leq νtθ(𝐀,𝐁)+101ν2ξ{cr(s)δ(s)}δ(s)\displaystyle\nu^{t}\sqrt{\theta(\mathbf{A},\mathbf{B})}+\dfrac{\sqrt{10}}{1-\nu}\dfrac{2}{\xi\{cr(s)-\delta(s)\}}\delta(s) (S21)
\displaystyle\leq (ν)tθ(𝐀,𝐁)+101ν2ξ{cr(s)δ(s)}δ(s).\displaystyle(\nu^{*})^{t}\sqrt{\theta^{*}(\mathbf{A},\mathbf{B})}+\dfrac{\sqrt{10}}{1-\nu^{*}}\dfrac{2}{\xi^{*}\{cr(s)-\delta(s)\}}\delta(s). (S22)

Let tt\rightarrow\infty and we have that 1|(𝐯)T𝐯|𝐯2101ν2ξ{cr(s)δ(s)}δ(s).\sqrt{1-\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{\infty}|}{\|\mathbf{v}^{*}\|_{2}}}\leq\dfrac{\sqrt{10}}{1-\nu^{*}}\dfrac{2}{\xi^{*}\{cr(s)-\delta(s)\}}\delta(s). Finally, we note that 1|(𝐯)T𝐯|𝐯2=1|cosΘ(𝐯,𝐯)|1-\dfrac{|(\mathbf{v}^{*})^{\mathrm{\tiny T}}\mathbf{v}_{\infty}|}{\|\mathbf{v}^{*}\|_{2}}=1-|\cos\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})|. Since sin2Θ(𝐯,𝐯)=(1+|cosΘ(𝐯,𝐯)|)(1|cosΘ(𝐯,𝐯)|)2(1|cosΘ(𝐯,𝐯)|)\sin^{2}\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})=(1+|\cos\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})|)(1-|\cos\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})|)\leq 2(1-|\cos\Theta(\mathbf{v}^{*},\mathbf{v}_{\infty})|), we have the desired conclusion. ∎

Lemma S4.

Consider two symmetric matrices 𝐀,𝐁\mathbf{A},\mathbf{B}, where λmin(𝐁)>0\lambda_{\min}(\mathbf{B})>0. For any F{1,,p}F\subset\{1,\ldots,p\}, denote

λ1(F)=sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮𝐮T𝐁𝐮.\lambda_{1}(F)=\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}}. (S23)

For any 0<ϵ<min{12,λmin(𝐁)2}0<\epsilon<\min\{\frac{1}{2},\dfrac{\lambda_{\min}(\mathbf{B})}{2}\}, we have that

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮+ϵ𝐮T𝐁𝐮ϵ\displaystyle\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}+\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}-\epsilon} \displaystyle\leq (1+a)λ1(F),\displaystyle(1+a)\lambda_{1}(F), (S24)
sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮ϵ𝐮T𝐁𝐮+ϵ\displaystyle\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}-\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\epsilon} \displaystyle\geq (1a)λ1(F),\displaystyle(1-a)\lambda_{1}(F), (S25)

where a=2ϵλmin(𝐁)+2ϵλmin(𝐁)ω1(F)a=\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})}+\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})\omega_{1}(F)}.

Proof of Lemma S4.

For (S24), note that, for any 𝐮\mathbf{u}, we have 𝐮T𝐁𝐮ϵ{1ϵλmin(𝐁)}𝐮T𝐁𝐮.\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}-\epsilon\geq\{1-\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}\}\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}. So

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮+ϵ𝐮T𝐁𝐮ϵ\displaystyle\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}+\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}-\epsilon} (S26)
\displaystyle\leq 11ϵλmin(𝐁)sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮𝐮T𝐁𝐮+11ϵλmin(𝐁)sup𝐮2=1,supp(𝐮)F1𝐮T𝐁𝐮\displaystyle\dfrac{1}{1-\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}}+\dfrac{1}{1-\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{1}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}} (S27)
\displaystyle\leq 11ϵλmin(𝐁)ω1(F)+2ϵλmin(𝐁)ω1(F)ω1(F),\displaystyle\dfrac{1}{1-\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\omega_{1}(F)+\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})\omega_{1}(F)}\cdot\omega_{1}(F), (S28)

where in the last inequality we use the fact that ϵ<λmin(𝐁)2\epsilon<\dfrac{\lambda_{\min}(\mathbf{B})}{2}. Also note that, for any 0<x<1/20<x<1/2, we have 11x<1+2x\dfrac{1}{1-x}<1+2x. It follows that 11ϵλmin(𝐁)1+2ϵλmin(𝐁).\dfrac{1}{1-\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\leq 1+\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})}. Hence,

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮+ϵ𝐮T𝐁𝐮ϵ{1+2ϵλmin(𝐁)+2ϵλmin(𝐁)ω1(F)}ω1(F)=(1+a)ω1(F).\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}+\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}-\epsilon}\leq\left\{1+\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})}+\dfrac{2\epsilon}{\lambda_{\min}(\mathbf{B})\omega_{1}(F)}\right\}\omega_{1}(F)=(1+a)\omega_{1}(F). (S29)

Similarly, for (S25), we note that, for any 𝐮\mathbf{u}, 𝐮T𝐁𝐮+ϵ{1+ϵλmin(𝐁)}𝐮T𝐁𝐮.\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\epsilon\leq\{1+\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}\}\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}. So

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮ϵ𝐮T𝐁𝐮+ϵ\displaystyle\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}-\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\epsilon} \displaystyle\geq 11+ϵλmin(𝐁)ω1(F)11+ϵλmin(𝐁)ϵλmax(𝐁),\displaystyle\dfrac{1}{1+\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\omega_{1}(F)-\dfrac{1}{1+\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\dfrac{\epsilon}{\lambda_{\max}(\mathbf{B})}, (S30)
\displaystyle\geq 11+ϵλmin(𝐁)ω1(F)ϵλmax(𝐁)ω1(F)ω1(F).\displaystyle\dfrac{1}{1+\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}\cdot\omega_{1}(F)-\dfrac{\epsilon}{\lambda_{\max}(\mathbf{B})\omega_{1}(F)}\cdot\omega_{1}(F). (S31)

Because 11+x>1x\frac{1}{1+x}>1-x for any x>0x>0, we have 11+ϵλmin(𝐁)>1ϵλmin(𝐁)\dfrac{1}{1+\frac{\epsilon}{\lambda_{\min}(\mathbf{B})}}>1-\dfrac{\epsilon}{\lambda_{\min}(\mathbf{B})}. Hence,

sup𝐮2=1,supp(𝐮)F𝐮T𝐀𝐮ϵ𝐮T𝐁𝐮+ϵ(1ϵλmin(𝐁)ϵλmax(𝐁)ω1(F))ω1(F)(1a)ω1(F).\sup_{\|\mathbf{u}\|_{2}=1,\mathrm{supp}(\mathbf{u})\subset F}\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{A}\mathbf{u}-\epsilon}{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{B}\mathbf{u}+\epsilon}\geq(1-\dfrac{\epsilon}{\lambda_{\min}(\mathbf{B})}-\dfrac{\epsilon}{\lambda_{\max}(\mathbf{B})\omega_{1}(F)})\omega_{1}(F)\geq(1-a)\omega_{1}(F). (S32)

The conclusion follows. ∎

S6.2 Additional technical lemmas

We first derive several lemmas concerning a parameter 𝜷k\boldsymbol{\beta}_{k} (either in the penalized eigen-decomposition or the penalized generalized eigen-decomposition) and its estimate 𝜷^k\widehat{\boldsymbol{\beta}}_{k}. We denote ηk=|sinΘ(𝜷k,𝜷^k)|\eta_{k}=|\sin\Theta(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})|, and λ^k=𝜷^kT𝐌^𝜷^k\hat{\lambda}_{k}=\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}.

Lemma S5.

If 𝛃k0s,k=1,,K\|\boldsymbol{\beta}_{k}\|_{0}\leq s,k=1,\ldots,K, we have that

vec(𝜷k𝜷kT𝜷^k𝜷^kT)1\displaystyle\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}-\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}})\|_{1} \displaystyle\leq 2sηk\displaystyle 2s\eta_{k} (S33)
Proof of Lemma S5.

For (S33), set 𝜻k=vec(𝜷k𝜷kT𝜷^k𝜷^kT){\bm{\zeta}}_{k}=\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}-\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}). We have that

𝜻k22\displaystyle\|{\bm{\zeta}}_{k}\|_{2}^{2} =\displaystyle= 𝜷k𝜷kT𝜷^k𝜷^kTF2=Tr((𝜷k𝜷kT𝜷^k𝜷^kT)(𝜷k𝜷kT𝜷^k𝜷^kT))\displaystyle\|\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}-\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\|_{F}^{2}={\mathrm{Tr}}((\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}-\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}})(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}-\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}})) (S34)
=\displaystyle= 2(1(𝜷kT𝜷^k)2)=2ηk2\displaystyle 2(1-(\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\hat{\boldsymbol{\beta}}_{k})^{2})=2\eta^{2}_{k} (S35)

Hence, by the Cauchy-Schwarz inequality,

𝜻k1𝜻k0𝜻k22s22ηk2=2sηk\displaystyle\|{\bm{\zeta}}_{k}\|_{1}\leq\sqrt{\|{\bm{\zeta}}_{k}\|_{0}}\|{\bm{\zeta}}_{k}\|_{2}\leq\sqrt{2s^{2}}\cdot\sqrt{2\eta_{k}^{2}}=2s\eta_{k} (S36)

Lemma S6.

If 𝛃k0s\|\boldsymbol{\beta}_{k}\|_{0}\leq s, we have that

vec(𝜷k𝜷kT)1s.\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq s. (S37)
Proof of Lemma S6.

By the Cauchy-Schwarz inequality, we have that

vec(𝜷k𝜷kT)1vec(𝜷k𝜷kT)0vec(𝜷k𝜷kT)2\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq\sqrt{\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{0}}\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{2} (S38)

Note that vec(𝜷k𝜷kT)0s2\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{0}\leq s^{2} and

vec(𝜷k𝜷kT)22=𝜷k𝜷kTF2=Tr(𝜷k𝜷kT𝜷k𝜷kT)=1\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{2}^{2}=\|\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\|_{F}^{2}={\mathrm{Tr}}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})=1 (S39)

where we use the fact that 𝜷kT𝜷k=1\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}=1. And we have the desired conclusion. ∎

Throughout the rest of this section, we also repeatedly use the fact that, for a vector 𝐮\mathbf{u}, if 𝐮2=1\|\mathbf{u}\|_{2}=1, 𝐮0s\|\mathbf{u}\|_{0}\leq s, we must have that 𝐮1s\|\mathbf{u}\|_{1}\leq\sqrt{s}.

S6.3 Proof for Theorem 2

In this subsection, we assume that 𝚺𝐗=𝐈\boldsymbol{\Sigma}_{\mathbf{X}}=\mathbf{I}, 𝜷^k\widehat{\boldsymbol{\beta}}_{k} are solutions produced by Algorithm 1 for the penalized eigen-decomposition problem, and λ^k=(𝜷^k)T𝐌^𝜷^k\widehat{\lambda}_{k}=(\widehat{\boldsymbol{\beta}}_{k})^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}. We assume all the conditions in Theorem 2. We have the following result.

Lemma S7.

If 𝛃^k0s,𝛃k0s\|\widehat{\boldsymbol{\beta}}_{k}\|_{0}\leq s,\|\boldsymbol{\beta}_{k}\|_{0}\leq s, we have that

|λ^kλk|s(2ηk+1)ϵ+(λ1+λk)ηk2.|\hat{\lambda}_{k}-\lambda_{k}|\leq s(2\eta_{k}+1)\epsilon+(\lambda_{1}+\lambda_{k})\eta_{k}^{2}. (S40)
Proof of Lemma S7.

Note that

|λ^kλk|=|𝐌^,𝜷^k𝜷^kT𝐌,𝜷k𝜷kT|\displaystyle|\hat{\lambda}_{k}-\lambda_{k}|=|\langle\widehat{\mathbf{M}},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\rangle-\langle\mathbf{M},\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle| (S41)
\displaystyle\leq |𝐌^𝐌,𝜷^k𝜷^kT𝜷k𝜷kT|+|𝐌^𝐌,𝜷k𝜷kT|+|𝐌,𝜷^k𝜷^kT𝜷k𝜷kT|\displaystyle|\langle\widehat{\mathbf{M}}-\mathbf{M},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle|+|\langle\widehat{\mathbf{M}}-\mathbf{M},\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle|+|\langle\mathbf{M},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle| (S42)
\displaystyle\equiv L1+L2+L3\displaystyle L_{1}+L_{2}+L_{3} (S43)

By Lemma S5,

L1𝐌^𝐌maxvec(𝜷^k𝜷^kT𝜷k𝜷kT)12sηk𝐌^𝐌max2sηkϵ.\displaystyle L_{1}\leq\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\|\mathrm{vec}(\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq 2s\eta_{k}\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq 2s\eta_{k}\epsilon. (S44)

By Lemma S6,

L2𝐌^𝐌maxvec(𝜷k𝜷kT)1s𝐌^𝐌maxsϵ.\displaystyle L_{2}\leq\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq s\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq s\epsilon. (S45)

For L3L_{3}, note that 𝐌=j=1pλj𝜷j𝜷jT\mathbf{M}=\sum_{j=1}^{p}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}, which implies that

𝐌,𝜷^k𝜷^kT=j=1pλj(𝜷^kT𝜷j)2=j=1pλjcos2𝚯(𝜷^k,𝜷j).\langle\mathbf{M},\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\rangle=\sum_{j=1}^{p}\lambda_{j}(\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{j})^{2}=\sum_{j=1}^{p}\lambda_{j}\cos^{2}\boldsymbol{\Theta}(\hat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{j}).

Also note that j=1pcos2𝚯(𝜷^k,𝜷j)=1\sum_{j=1}^{p}\cos^{2}\boldsymbol{\Theta}(\hat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{j})=1. Hence,

L3\displaystyle L_{3} =\displaystyle= |j=1pλjcos2𝚯(𝜷j,𝜷^k)λk|\displaystyle|\sum_{j=1}^{p}\lambda_{j}\cos^{2}\boldsymbol{\Theta}(\boldsymbol{\beta}_{j},\hat{\boldsymbol{\beta}}_{k})-\lambda_{k}| (S46)
\displaystyle\leq jkλjcos2𝚯(𝜷^k,𝜷j)+λk|cos2𝚯(𝜷^k,𝜷k)1|\displaystyle\sum_{j\neq k}\lambda_{j}\cos^{2}\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{j})+\lambda_{k}|\cos^{2}\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})-1| (S47)
\displaystyle\leq λ1jkcos2𝚯(𝜷^k,𝜷j)+λk(1cos2𝚯(𝜷^k,𝜷k))\displaystyle\lambda_{1}\sum_{j\neq k}\cos^{2}\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{j})+\lambda_{k}(1-\cos^{2}\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})) (S48)
\displaystyle\leq (λ1+λk)(1cos2𝚯(𝜷k,𝜷^k))\displaystyle(\lambda_{1}+\lambda_{k})(1-\cos^{2}\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})) (S49)
=\displaystyle= (λ1+λk)ηk2\displaystyle(\lambda_{1}+\lambda_{k})\eta_{k}^{2} (S50)

In Lemmas S8S10, we assume that the event 𝐌^𝐌maxϵ\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq\epsilon has happened.

Lemma S8.

For the first direction 𝛃^1\widehat{\boldsymbol{\beta}}_{1}, we have that |sinΘ(𝛃^1,𝛃1)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{1},\boldsymbol{\beta}_{1})|\leq Cs\epsilon and |λ^1λ1|Csϵ|\hat{\lambda}_{1}-\lambda_{1}|\leq Cs\epsilon.

Proof of Lemma S8.

It is easy to see that

ρ(𝐌^𝐌,s)=sup𝐮2=1,𝐮0ss|𝐮T(𝐌^𝐌)𝐮|sup𝐮2=1,𝐮0s𝐮12𝐌^𝐌maxsϵ.\displaystyle\rho(\widehat{\mathbf{M}}-\mathbf{M},s)=\sup_{\|\mathbf{u}\|_{2}=1,\|\mathbf{u}\|_{0}\leq s}s|\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\mathbf{M}}-\mathbf{M})\mathbf{u}|\leq\sup_{\|\mathbf{u}\|_{2}=1,\|\mathbf{u}\|_{0}\leq s}\|\mathbf{u}\|_{1}^{2}\cdot\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq s\epsilon. (S51)

Under our assumptions about ϵ\epsilon, by Proposition S4, we have |sinΘ(𝜷^1,𝜷1)|Csϵ\lvert\sin\Theta(\widehat{\boldsymbol{\beta}}_{1},\boldsymbol{\beta}_{1})\rvert\leq Cs\epsilon at convergence. Lemma S7 further implies that |λ^1λ1|Csϵ|\hat{\lambda}_{1}-\lambda_{1}|\leq Cs\epsilon. ∎

Lemma S9.

If |sinΘ(𝛃^j,𝛃j)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{j},\boldsymbol{\beta}_{j})|\leq Cs\epsilon for sufficiently small ϵ\epsilon, all jkj\leq k, then ρ(𝐌^k+1𝐌k+1,s)Csϵ\rho(\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1},s)\leq Cs\epsilon.

Proof of Lemma S9.

Define 𝐍l=λ^l𝜷^l𝜷^lTλl𝜷l𝜷lT\mathbf{N}_{l}=\hat{\lambda}_{l}\hat{\boldsymbol{\beta}}_{l}\hat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}-\lambda_{l}\boldsymbol{\beta}_{l}\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}. Then

𝐌^k+1𝐌k+1=(𝐌^𝐌)lk𝐍l.\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1}=(\widehat{\mathbf{M}}-\mathbf{M})-\sum_{l\leq k}\mathbf{N}_{l}. (S52)

It follows that

ρ(𝐌^k+1𝐌k+1,s)ρ(𝐌^𝐌,s)+lkρ(𝐍l,s).\rho(\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1},s)\leq\rho(\widehat{\mathbf{M}}-\mathbf{M},s)+\sum_{l\leq k}\rho(\mathbf{N}_{l},s). (S53)

According to the proof in Lemma S8, ρ(𝐌^𝐌,s)sϵ\rho(\widehat{\mathbf{M}}-\mathbf{M},s)\leq s\epsilon.

For any vector 𝐮\mathbf{u}, we have

𝐮T𝐍l𝐮\displaystyle\mathbf{u}^{\mathrm{\tiny T}}\mathbf{N}_{l}\mathbf{u} =\displaystyle= λ^l(𝐮T𝜷^l)2λl(𝐮T𝜷l)2\displaystyle\widehat{\lambda}_{l}(\mathbf{u}^{\mathrm{\tiny T}}\widehat{\boldsymbol{\beta}}_{l})^{2}-{\lambda}_{l}(\mathbf{u}^{\mathrm{\tiny T}}{\boldsymbol{\beta}}_{l})^{2} (S54)
=\displaystyle= (λ^lλl)(𝐮T𝜷^l)2+λl{(𝐮T𝜷l)2(𝐮T𝜷^l)2}L1+L2\displaystyle(\widehat{\lambda}_{l}-\lambda_{l})(\mathbf{u}^{\mathrm{\tiny T}}\widehat{\boldsymbol{\beta}}_{l})^{2}+\lambda_{l}\{(\mathbf{u}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{l})^{2}-(\mathbf{u}^{\mathrm{\tiny T}}\widehat{\boldsymbol{\beta}}_{l})^{2}\}\equiv L_{1}+L_{2} (S55)

By Lemma S7, we have that |λ^lλl|Csϵ|\hat{\lambda}_{l}-\lambda_{l}|\leq Cs\epsilon when |sinΘ(𝜷^j,𝜷j)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{j},\boldsymbol{\beta}_{j})|\leq Cs\epsilon. Also, |𝐮T𝜷l|𝐮2𝜷l2=1|\mathbf{u}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{l}|\leq\|\mathbf{u}\|_{2}\cdot\|\boldsymbol{\beta}_{l}\|_{2}=1. It follows that L1CsϵL_{1}\leq Cs\epsilon. For L2L_{2}, we assume that cos𝚯(𝜷^l,𝜷l)>0\cos\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})>0 without loss of generality, because otherwise we can consider the proof for 𝜷^l-\widehat{\boldsymbol{\beta}}_{l}. Note that

L2\displaystyle L_{2} \displaystyle\leq λl|𝐮T(𝜷^l𝜷l)||𝐮T(𝜷^l+𝜷l)|\displaystyle\lambda_{l}|\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l})|\cdot|\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\boldsymbol{\beta}}_{l}+\boldsymbol{\beta}_{l})|
\displaystyle\leq λl𝐮22𝜷^l𝜷l2(𝜷^l2+𝜷l2)\displaystyle\lambda_{l}\|\mathbf{u}\|_{2}^{2}\cdot\|\widehat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l}\|_{2}(\|\widehat{\boldsymbol{\beta}}_{l}\|_{2}+\|\boldsymbol{\beta}_{l}\|_{2})
=\displaystyle= C𝜷^l𝜷l2\displaystyle C\|\widehat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l}\|_{2}
=\displaystyle= C22𝜷^lT𝜷l=2(1cosΘ(𝜷^l,𝜷l))\displaystyle C\sqrt{2-2\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\boldsymbol{\beta}_{l}}=\sqrt{2(1-\cos\Theta(\widehat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l}))}
=\displaystyle= C1cos2Θ(𝜷^l,𝜷l)/1+cosΘ(𝜷^l,𝜷l)\displaystyle C\sqrt{1-\cos^{2}\Theta(\widehat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})}/\sqrt{1+\cos\Theta(\widehat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})}
\displaystyle\leq C|sinΘ(𝜷^l,𝜷l)|\displaystyle C|\sin\Theta(\widehat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})|
\displaystyle\leq Csϵ.\displaystyle Cs\epsilon.

Hence, ρ(𝐍l,s)Csϵ\rho(\mathbf{N}_{l},s)\leq Cs\epsilon and the conclusion follows. ∎

Lemma S10.

Assume that |sinΘ(𝛃^l,𝛃l)|<Csϵ|\sin\Theta(\hat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})|<Cs\epsilon for any l<kl<k, where 0<ϵ<min{Δ4s,θ}0<\epsilon<\min\{\dfrac{\Delta}{4s},\theta\} for θ\theta defined in Theorem 2. We have that |sinΘ(𝛃^k,𝛃k)|Csϵ|\sin\Theta(\hat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon.

Proof of Lemma S10.

The conclusion follows from Lemma S9 and Proposition S4. ∎

Proof of Theorem 2.

Under the event 𝐌^𝐌maxϵ\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq\epsilon, we have that |sinΘ(𝜷^k,𝜷k)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon by Lemmas S8S10. Then by Theorem 1 we have the desired conclusion. ∎

S6.4 Proof for Theorem 3

In this subsection we prove Theorem 3, where 𝚺𝐗{\bm{\Sigma}}_{\mathbf{X}} could be different from the identity matrix. We first present a simple lemma, which is a modified version of Lemma 6 in Mai and Zhang (2019).

Lemma S11.

For two vectors 𝐮,𝐯\mathbf{u},\mathbf{v} and a positive definite matrix 𝚺{\bm{\Sigma}}, define ξ𝐈=1cos𝚯(𝐯,𝐮),ξ𝚺=1cos𝚯(𝚺1/2𝐯,𝚺1/2𝐮)\xi_{\mathbf{I}}=1-\cos\boldsymbol{\Theta}(\mathbf{v},\mathbf{u}),\xi_{{\bm{\Sigma}}}=1-\cos\boldsymbol{\Theta}({\bm{\Sigma}}^{1/2}\mathbf{v},{\bm{\Sigma}}^{1/2}\mathbf{u}). We have that

λmin(𝚺)λmax(𝚺)ξ𝚺ξ𝐈λmax(𝚺)λmin(𝚺)ξ𝚺\dfrac{\lambda_{\min}({\bm{\Sigma}})}{\lambda_{\max}({\bm{\Sigma}})}\xi_{{\bm{\Sigma}}}\leq\xi_{\mathbf{I}}\leq\dfrac{\lambda_{\max}({\bm{\Sigma}})}{\lambda_{\min}({\bm{\Sigma}})}\xi_{{\bm{\Sigma}}}
Proof of Lemma S11.

We first show the latter half of the desired inequality. Without loss of generality, we assume that 𝐮T𝚺𝐮=1,𝐯T𝚺𝐯=1\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{u}=1,\mathbf{v}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{v}=1, because we can always normalize 𝐮,𝐯\mathbf{u},\mathbf{v} to satisfy these conditions.

Note that

λmin(𝚺)(𝐮𝐯)T(𝐮𝐯)\displaystyle\lambda_{\min}({\bm{\Sigma}})(\mathbf{u}-\mathbf{v})^{\mathrm{\tiny T}}(\mathbf{u}-\mathbf{v}) \displaystyle\leq (𝐮𝐯)T𝚺(𝐮𝐯)\displaystyle(\mathbf{u}-\mathbf{v})^{\mathrm{\tiny T}}{\bm{\Sigma}}(\mathbf{u}-\mathbf{v})
=\displaystyle= 𝐮T𝚺𝐮2𝐮T𝚺𝐯+𝐯T𝚺𝐯=2ξ𝚺\displaystyle\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{u}-2\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{v}+\mathbf{v}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{v}=2\xi_{{\bm{\Sigma}}}
λmax(𝚺)𝐮T𝐮\displaystyle\lambda_{\max}({\bm{\Sigma}})\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u} \displaystyle\geq 𝐮T𝚺𝐮=1\displaystyle\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{u}=1
λmax(𝚺)𝐯T𝐯\displaystyle\lambda_{\max}({\bm{\Sigma}})\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v} \displaystyle\geq 𝐯T𝚺𝐯=1\displaystyle\mathbf{v}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{v}=1

Consequently,

(𝐮𝐯)T(𝐮𝐯)2ξ𝚺λmin(𝚺)\displaystyle(\mathbf{u}-\mathbf{v})^{\mathrm{\tiny T}}(\mathbf{u}-\mathbf{v})\leq\frac{2\xi_{{\bm{\Sigma}}}}{\lambda_{\min}({\bm{\Sigma}})}
𝐮T𝐮1/λmax(𝚺),𝐯T𝐯1/λmax(𝚺)\displaystyle\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}\geq 1/\lambda_{\max}({\bm{\Sigma}}),\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}\geq 1/\lambda_{\max}({\bm{\Sigma}})

Now

2λmax(𝚺)ξ𝚺λmin(𝚺)\displaystyle\dfrac{2\lambda_{\max}({\bm{\Sigma}})\xi_{{\bm{\Sigma}}}}{\lambda_{\min}({\bm{\Sigma}})} \displaystyle\geq 2ξ𝚺λmin(𝚺)𝐮T𝐮𝐯T𝐯\displaystyle\dfrac{2\xi_{{\bm{\Sigma}}}}{\lambda_{\min}({\bm{\Sigma}})\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}
\displaystyle\geq (𝐮𝐯)T(𝐮𝐯)𝐮T𝐮𝐯T𝐯=𝐮T𝐮𝐯T𝐯+𝐯T𝐯𝐮T𝐮2𝐮T𝐯𝐮T𝐮𝐯T𝐯\displaystyle\dfrac{(\mathbf{u}-\mathbf{v})^{\mathrm{\tiny T}}(\mathbf{u}-\mathbf{v})}{\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}=\dfrac{\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}}{\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}+\dfrac{\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}{\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}}-2\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{v}}{\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}
\displaystyle\geq 22𝐮T𝐯𝐮T𝐮𝐯T𝐯=2ξ𝐈,\displaystyle 2-2\dfrac{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{v}}{\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\mathbf{u}}\sqrt{\mathbf{v}^{\mathrm{\tiny T}}\mathbf{v}}}=2\xi_{\mathbf{I}},

and we have the second half of the inequality. For the first half of the inequality, define 𝐮=𝚺1/2𝐮,𝐯=𝚺1/2\mathbf{u}^{*}={\bm{\Sigma}}^{1/2}\mathbf{u},\mathbf{v}^{*}={\bm{\Sigma}}^{1/2}. Apply the second half of the inequality to vectors 𝐮,𝐯\mathbf{u}^{*},\mathbf{v}^{*} and matrix 𝚺1{\bm{\Sigma}}^{-1}, and we have the desired conclusion. ∎

We now show the following lemma parallel to Lemma S7. Recall that ηk=|sin𝚯(𝜷^k,𝜷k)|\eta_{k}=|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|.

Lemma S12.

If 𝛃^k0s,𝛃k0s\|\widehat{\boldsymbol{\beta}}_{k}\|_{0}\leq s,\|\boldsymbol{\beta}_{k}\|_{0}\leq s, we have that

|λ^kλk|s(2ηk+1)ϵ+2λmax(𝚺𝐗)λmin(𝚺𝐗)(λ1+λk)ηk2.|\hat{\lambda}_{k}-\lambda_{k}|\leq s(2\eta_{k}+1)\epsilon+2\dfrac{\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}(\lambda_{1}+\lambda_{k})\eta_{k}^{2}. (S56)
Proof of Lemma S12.

Note that

|λ^kλk|=|𝐌^,𝜷^k𝜷^kT𝐌,𝜷k𝜷kT|\displaystyle|\hat{\lambda}_{k}-\lambda_{k}|=|\langle\widehat{\mathbf{M}},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\rangle-\langle\mathbf{M},\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle| (S57)
\displaystyle\leq |𝐌^𝐌,𝜷^k𝜷^kT𝜷k𝜷kT|+|𝐌^𝐌,𝜷k𝜷kT|+|𝐌,𝜷^k𝜷^kT𝜷k𝜷kT|\displaystyle|\langle\widehat{\mathbf{M}}-\mathbf{M},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle|+|\langle\widehat{\mathbf{M}}-\mathbf{M},\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle|+|\langle\mathbf{M},\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}}\rangle| (S58)
\displaystyle\equiv L1+L2+L3\displaystyle L_{1}+L_{2}+L_{3} (S59)

By Lemma S5,

L1𝐌^𝐌maxvec(𝜷^k𝜷^kT𝜷k𝜷kT)12sηk𝐌^𝐌max2sηkϵ.\displaystyle L_{1}\leq\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\|\mathrm{vec}(\hat{\boldsymbol{\beta}}_{k}\hat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}-\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq 2s\eta_{k}\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq 2s\eta_{k}\epsilon. (S60)

By Lemma S6,

L2𝐌^𝐌maxvec(𝜷k𝜷kT)1s𝐌^𝐌maxsϵ.\displaystyle L_{2}\leq\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\|\mathrm{vec}(\boldsymbol{\beta}_{k}\boldsymbol{\beta}_{k}^{\mathrm{\tiny T}})\|_{1}\leq s\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq s\epsilon. (S61)

For L3L_{3}, note that 𝐌=𝚺𝐗(j=1pλj𝜷j𝜷jT)𝚺𝐗\mathbf{M}={\bm{\Sigma}}_{\mathbf{X}}\left(\sum_{j=1}^{p}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\right){\bm{\Sigma}}_{\mathbf{X}} by the proof of Lemma 1, which implies that

𝐌,𝜷^k𝜷^kT=j=1pλj(𝜷^kT𝚺𝐗𝜷j)2=j=1pλjcos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷j)\langle\mathbf{M},\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\rangle=\sum_{j=1}^{p}\lambda_{j}(\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{j})^{2}=\sum_{j=1}^{p}\lambda_{j}\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{j})

Also note that j=1pcos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷j)=1\sum_{j=1}^{p}\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{j})=1. Without loss of generality, assume that cos𝚯(𝜷k,𝜷^k)>0\cos\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})>0. We have that

L3\displaystyle L_{3} =\displaystyle= |j=1pλjcos2𝚯(𝚺𝐗1/2𝜷j,𝚺𝐗1/2𝜷^k)λk|\displaystyle|\sum_{j=1}^{p}\lambda_{j}\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{j},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k})-\lambda_{k}|
\displaystyle\leq jkλjcos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷j)+λk|cos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷k)1|\displaystyle\sum_{j\neq k}\lambda_{j}\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\widehat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{j})+\lambda_{k}|\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\widehat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k})-1|
\displaystyle\leq λ1jkcos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷j)+λk(1cos2𝚯(𝚺𝐗1/2𝜷^k,𝚺𝐗1/2𝜷k))\displaystyle\lambda_{1}\sum_{j\neq k}\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\widehat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{j})+\lambda_{k}(1-\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\widehat{\boldsymbol{\beta}}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k}))
\displaystyle\leq (λ1+λk)(1cos2𝚯(𝚺𝐗1/2𝜷k,𝚺𝐗1/2𝜷^k))\displaystyle(\lambda_{1}+\lambda_{k})(1-\cos^{2}\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k}))
=\displaystyle= (λ1+λk)(1cos𝚯(𝚺𝐗1/2𝜷k,𝚺𝐗1/2𝜷^k))(1+cos𝚯(𝚺𝐗1/2𝜷k,𝚺𝐗1/2𝜷^k))\displaystyle(\lambda_{1}+\lambda_{k})(1-\cos\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k}))(1+\cos\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k}))
\displaystyle\leq 2(λ1+λk)(1cos𝚯(𝚺𝐗1/2𝜷k,𝚺𝐗1/2𝜷^k))\displaystyle 2(\lambda_{1}+\lambda_{k})(1-\cos\boldsymbol{\Theta}({\bm{\Sigma}}_{\mathbf{X}}^{1/2}\boldsymbol{\beta}_{k},{\bm{\Sigma}}_{\mathbf{X}}^{1/2}\hat{\boldsymbol{\beta}}_{k}))
\displaystyle\leq 2λmax(𝚺𝐗)λmin(𝚺𝐗)(λ1+λk)(1cos𝚯(𝜷k,𝜷^k)),\displaystyle 2\dfrac{\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}(\lambda_{1}+\lambda_{k})(1-\cos\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})),

where the last inequality follows from Lemma S11. Further note that 1cos𝚯(𝜷k,𝜷^k)=sin2𝚯(𝜷k,𝜷^k)1+cos𝚯(𝜷k,𝜷^k)ηk21-\cos\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})=\dfrac{\sin^{2}\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})}{1+\cos\boldsymbol{\Theta}(\boldsymbol{\beta}_{k},\hat{\boldsymbol{\beta}}_{k})}\leq\eta_{k}^{2} and we have the desired conclusion. ∎

We consider the event 𝐌^𝐌maxϵ,𝚺^𝐗𝚺𝐗maxϵ\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq\epsilon,\|\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}}\|_{\max}\leq\epsilon for 2sϵ<min{1/2,λmin(𝚺𝐗)}\sqrt{2}s\epsilon<\min\{1/2,\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})\} and 2sϵλmin(𝚺𝐗)+2sϵλmin(𝚺𝐗)λ1<a\frac{\sqrt{2}s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}+\frac{\sqrt{2}s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})\lambda_{1}}<a^{*} and 2sϵλmin(𝚺𝐗)<Δ2(1+λ12)\dfrac{\sqrt{2}s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}<\frac{\Delta}{2(1+\lambda_{1}^{2})}, where Δ\Delta is defined as in Condition (C2) and a=min{12,Δλ1+λ2,λmin(𝚺𝐗)2}a^{*}=\min\{\frac{1}{2},\frac{\Delta}{\lambda_{1}+\lambda_{2}},\frac{\lambda_{\min}{({\bm{\Sigma}}_{\mathbf{X}})}}{2}\}. As a direct consequence, ρ(𝚺^𝐗𝚺𝐗,s)sϵ\rho(\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}},s)\leq s\epsilon. Also note that 𝜷~k=𝜷^k𝜷^k2\tilde{\boldsymbol{\beta}}_{k}=\dfrac{\widehat{\boldsymbol{\beta}}_{k}}{\|\widehat{\boldsymbol{\beta}}_{k}\|_{2}}. In the RIFLE algorithm, 𝜷k0\boldsymbol{\beta}_{k}^{0} is chosen to be sufficiently close to 𝜷k\boldsymbol{\beta}_{k}, and the step size η\eta satisfies that ηλmin(𝚺𝐗)2\eta\leq\frac{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}{2} and

1+2{(d/s)1/2+d/s}1116ηλmin(𝚺𝐗)ΔλKκ(𝚺𝐗)+1<1.\sqrt{1+2\{(d/s^{\prime})^{1/2}+d/s^{\prime}\}}\sqrt{1-\frac{1}{16}\eta\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})\frac{\frac{\Delta}{\lambda_{K}}}{\kappa({\bm{\Sigma}}_{\mathbf{X}})+1}}<1. (S62)

Without loss of generality, in what follows we assume that cos𝚯(𝜷^j,𝜷k)>0\cos\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{j},\boldsymbol{\beta}_{k})>0, because otherwise we can always consider 𝜷^j-\widehat{\boldsymbol{\beta}}_{j}, which spans the same susbspace as 𝜷^j\widehat{\boldsymbol{\beta}}_{j}.

Lemma S13.

For the first direction 𝛃^1\widehat{\boldsymbol{\beta}}_{1}, we have that |sinΘ(𝛃^1,𝛃1)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{1},\boldsymbol{\beta}_{1})|\leq Cs\epsilon and |λ^1λ1|Csϵ|\hat{\lambda}_{1}-\lambda_{1}|\leq Cs\epsilon.

Proof of Lemma S13.

It is easy to see that

ρ(𝐌^𝐌,s)=sup𝐮2=1,𝐮0ss|𝐮T(𝐌^𝐌)𝐮|sup𝐮2=1,𝐮0s𝐮12𝐌^𝐌maxsϵ.\displaystyle\rho(\widehat{\mathbf{M}}-\mathbf{M},s)=\sup_{\|\mathbf{u}\|_{2}=1,\|\mathbf{u}\|_{0}\leq s}s|\mathbf{u}^{\mathrm{\tiny T}}(\widehat{\mathbf{M}}-\mathbf{M})\mathbf{u}|\leq\sup_{\|\mathbf{u}\|_{2}=1,\|\mathbf{u}\|_{0}\leq s}\|\mathbf{u}\|_{1}^{2}\cdot\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq s\epsilon. (S63)

Similarly, ρ(𝚺^𝐗𝚺𝐗,s)sϵ\rho(\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}},s)\leq s\epsilon. Denote δ1(s)=ρ2(𝐌^𝐌,s)+ρ2(𝚺^𝐗𝚺𝐗,s)\delta_{1}(s)=\sqrt{\rho^{2}(\widehat{\mathbf{M}}-\mathbf{M},s)+\rho^{2}(\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}},s)}. It follows that δ1(s)2sϵ\delta_{1}(s)\leq\sqrt{2}s\epsilon. Under our assumptions about ϵ\epsilon, we have that |sinΘ(𝜷^1,𝜷1)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{1},\boldsymbol{\beta}_{1})|\leq Cs\epsilon by Lemma S3. Lemma S12 further implies that |λ^1λ1|Csϵ|\hat{\lambda}_{1}-\lambda_{1}|\leq Cs\epsilon. ∎

Lemma S14.

If |sinΘ(𝛃^j,𝛃j)|Csϵ|\sin\Theta(\widehat{\boldsymbol{\beta}}_{j},\boldsymbol{\beta}_{j})|\leq Cs\epsilon for sufficiently small ϵ\epsilon, all jkj\leq k, then δ(s)Csϵ\delta(s)\leq Cs\epsilon.

Proof of Lemma S14.

It suffices to show that ρ(𝐌^k+1𝐌k+1,s)Csϵ\rho(\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1},s)\leq Cs\epsilon. Define 𝐍l=𝚺^𝐗λ^l𝜷^l𝜷^lT𝚺^𝐗𝚺𝐗λl𝜷l𝜷lT𝚺𝐗\mathbf{N}_{l}=\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\hat{\lambda}_{l}\hat{\boldsymbol{\beta}}_{l}\hat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}}\lambda_{l}\boldsymbol{\beta}_{l}\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}. Then

𝐌^k+1𝐌k+1=(𝐌^𝐌)lk𝐍l.\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1}=(\widehat{\mathbf{M}}-\mathbf{M})-\sum_{l\leq k}\mathbf{N}_{l}. (S64)

It follows that

ρ(𝐌^k+1𝐌k+1,s)ρ(𝐌^𝐌,s)+lkρ(𝐍l,s).\rho(\widehat{\mathbf{M}}_{k+1}-\mathbf{M}_{k+1},s)\leq\rho(\widehat{\mathbf{M}}-\mathbf{M},s)+\sum_{l\leq k}\rho(\mathbf{N}_{l},s). (S65)

According to the proof in Lemma S13, ρ(𝐌^𝐌,s)sϵ\rho(\widehat{\mathbf{M}}-\mathbf{M},s)\leq s\epsilon.

For any vector 𝐮\mathbf{u}, we have

𝐮T𝐍l𝐮\displaystyle\mathbf{u}^{\mathrm{\tiny T}}\mathbf{N}_{l}\mathbf{u} =\displaystyle= λ^l(𝐮T𝚺^𝐗𝜷^l)2λl(𝐮T𝚺𝐗𝜷l)2\displaystyle\hat{\lambda}_{l}(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\hat{\boldsymbol{\beta}}_{l})^{2}-{\lambda}_{l}(\mathbf{u}^{\mathrm{\tiny T}}{{\bm{\Sigma}}}_{\mathbf{X}}{\boldsymbol{\beta}}_{l})^{2} (S66)
=\displaystyle= (λ^lλl)(𝐮T𝚺^𝐗𝜷^l)2+λl{(𝐮T𝚺^𝐗𝜷^l)2(𝐮T𝚺𝐗𝜷l)}L1+L2\displaystyle(\hat{\lambda}_{l}-\lambda_{l})(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l})^{2}+\lambda_{l}\{(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l})^{2}-(\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l})\}\equiv L_{1}+L_{2} (S67)

By Lemma S12, we have that |λ^lλl|Csϵ|\hat{\lambda}_{l}-\lambda_{l}|\leq Cs\epsilon when sinΘ(𝜷^j,𝜷j)Csϵ\sin\Theta(\widehat{\boldsymbol{\beta}}_{j},\boldsymbol{\beta}_{j})\leq Cs\epsilon. Also,

|𝐮T𝚺^𝐗𝜷^l|\displaystyle|\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}| \displaystyle\leq |𝐮T𝚺𝐗𝜷^l|+𝐮1𝜷^l1𝚺^𝐗𝚺𝐗max|𝐮T𝚺𝐗𝜷l|+sϵ\displaystyle|\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}|+\|\mathbf{u}\|_{1}\cdot\|\widehat{\boldsymbol{\beta}}_{l}\|_{1}\|\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}}\|_{\max}\leq|\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}|+s\epsilon (S68)
\displaystyle\leq (𝐮T𝚺𝐗𝐮)(𝜷^lT𝚺𝐗𝜷^l)+sϵλmax(𝚺𝐗)+sϵ.\displaystyle\sqrt{(\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\mathbf{u})\cdot(\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l})}+s\epsilon\leq\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})+s\epsilon. (S69)

It follows that L1CsϵL_{1}\leq Cs\epsilon. For L2L_{2}, note that (𝐮T𝚺^𝐗𝜷^l)2(𝐮T𝚺𝐗𝜷l)=(𝐮T𝚺^𝐗𝜷^l+𝐮T𝚺𝐗𝜷l)(𝐮T𝚺^𝐗𝜷^l𝐮T𝚺𝐗𝜷l)(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l})^{2}-(\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l})=(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}+\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l})(\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}-\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}). On one hand,

|𝐮T𝚺^𝐗𝜷^l+𝐮T𝚺𝐗𝜷l||𝐮T𝚺^𝐗𝜷^l|+|𝐮T𝚺𝐗𝜷l|\displaystyle|\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}+\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}|\leq|\mathbf{u}^{\mathrm{\tiny T}}\widehat{\bm{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}|+|\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}| (S70)
\displaystyle\leq 𝐮T𝚺^𝐗𝐮𝜷^lT𝚺^𝐗𝜷^l+𝐮T𝚺𝐮𝜷lT𝚺𝜷l\displaystyle\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u}\cdot\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{l}}+\sqrt{\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{u}\cdot\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}\boldsymbol{\beta}_{l}} (S71)
=\displaystyle= 𝐮T𝚺^𝐗𝐮+𝐮T𝚺𝐗𝐮=𝐮T𝚺𝐗𝐮+𝐮T(𝚺^𝐗𝚺𝐗)𝐮+𝐮T𝚺𝐮\displaystyle\sqrt{\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u}}+\sqrt{\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\mathbf{u}}=\sqrt{\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\mathbf{u}+\mathbf{u}^{\mathrm{\tiny T}}(\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}})\mathbf{u}}+\sqrt{\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}\mathbf{u}} (S72)
\displaystyle\leq λmax(𝚺𝐗)+sϵ+λmax(𝚺𝐗)\displaystyle\sqrt{\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})+s\epsilon}+\sqrt{\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})} (S73)

On the other hand,

|𝐮T𝚺^𝐗𝜷^l𝐮T𝚺𝐗𝜷l|\displaystyle|\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\hat{\boldsymbol{\beta}}_{l}-\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}| \displaystyle\leq |𝐮T(𝚺^𝐗𝚺𝐗)𝜷^l|+|𝐮T𝚺𝐗(𝜷^l𝜷l)|\displaystyle|\mathbf{u}^{\mathrm{\tiny T}}(\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}})\hat{\boldsymbol{\beta}}_{l}|+|\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l})| (S74)
\displaystyle\leq sϵ+|𝐮T𝚺𝐗(𝜷^l𝜷l)|\displaystyle s\epsilon+|\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l})| (S75)
\displaystyle\equiv sϵ+L3.\displaystyle s\epsilon+L_{3}. (S76)

Further note that

L3\displaystyle L_{3} \displaystyle\leq 𝐮T𝚺𝐗𝐮(𝜷^l𝜷l)T𝚺𝐗(𝜷^l𝜷l)λmax(𝚺𝐗)𝜷^l𝜷l2\displaystyle\sqrt{\mathbf{u}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\mathbf{u}\cdot(\hat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\hat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l})}\leq\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})\cdot\|\hat{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l}\|_{2} (S77)
=\displaystyle= λmax(𝚺𝐗)1𝜷~lT𝚺^𝐗𝜷~l𝜷~l𝜷~T𝚺^𝐗𝜷~l𝜷l2\displaystyle\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})\cdot\dfrac{1}{\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}}\|\tilde{\boldsymbol{\beta}}_{l}-\sqrt{\tilde{\boldsymbol{\beta}}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}\cdot\boldsymbol{\beta}_{l}\|_{2} (S78)
\displaystyle\leq λmax(𝚺𝐗)1𝜷~lT𝚺^𝐗𝜷~l{𝜷~l𝜷l2+|1𝜷l2𝜷~lT𝚺^𝐗𝜷~l|𝜷l2},\displaystyle\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})\cdot\dfrac{1}{\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}}\{\|\tilde{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l}^{*}\|_{2}+|\dfrac{1}{\|\boldsymbol{\beta}_{l}\|_{2}}-\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}|\cdot\|\boldsymbol{\beta}_{l}\|_{2}\}, (S79)

where 𝜷l=𝜷l𝜷l2\boldsymbol{\beta}_{l}^{*}=\dfrac{\boldsymbol{\beta}_{l}}{\|\boldsymbol{\beta}_{l}\|_{2}}. By our assumption, 𝜷~l𝜷l22sinΘ(𝜷l,𝜷l)Csϵ\|\tilde{\boldsymbol{\beta}}_{l}-\boldsymbol{\beta}_{l}^{*}\|_{2}\leq\sqrt{2}\sin\Theta(\boldsymbol{\beta}_{l}^{*},\boldsymbol{\beta}_{l})\leq Cs\epsilon. For the second term, since (𝜷l)T𝚺𝐗𝜷l=𝜷lT𝚺𝐗𝜷l𝜷l22=1𝜷l22(\boldsymbol{\beta}^{*}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}=\dfrac{\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}}{\|\boldsymbol{\beta}_{l}\|_{2}^{2}}=\dfrac{1}{\|\boldsymbol{\beta}_{l}\|_{2}^{2}}, we have 𝜷l2=1(𝜷l)T𝚺𝐗𝜷l\|\boldsymbol{\beta}_{l}\|_{2}=\dfrac{1}{\sqrt{(\boldsymbol{\beta}^{*}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}}. It follows that,

|1𝜷l2𝜷~lT𝚺^𝐗𝜷~l|=|(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺^𝐗𝜷~l|\displaystyle|\dfrac{1}{\|\boldsymbol{\beta}_{l}\|_{2}}-\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}|=|\sqrt{(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}-\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}| (S80)
\displaystyle\leq |(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺𝐗𝜷~l|+|𝜷~lT(𝚺𝐗𝚺^𝐗)𝜷~l|/(𝜷~T𝚺𝐗𝜷~+𝜷~T𝚺^𝐗𝜷~)\displaystyle|\sqrt{(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}-\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}|+|\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}({\bm{\Sigma}}_{\mathbf{X}}-\widehat{{\bm{\Sigma}}}_{\mathbf{X}})\tilde{\boldsymbol{\beta}}_{l}|/(\sqrt{\tilde{\boldsymbol{\beta}}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}}+\sqrt{\tilde{\boldsymbol{\beta}}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}}) (S81)
\displaystyle\leq |(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺^𝐗𝜷~l|(𝜷l)T𝚺𝐗𝜷l+𝜷~lT𝚺^𝐗𝜷~l+sϵλmin(𝚺𝐗).\displaystyle\dfrac{|(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}|}{\sqrt{(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}+\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}}+\dfrac{s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}. (S82)

Because (𝜷l)T𝚺𝐗𝜷l+𝜷~lT𝚺^𝐗𝜷~lλmin(𝚺𝐗)\sqrt{(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}+\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}\geq\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}}), it suffices to find a bound for |(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺^𝐗𝜷~l||(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}|.

|(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺^𝐗𝜷~l||(𝜷l)T𝚺𝐗𝜷l𝜷~lT𝚺𝐗𝜷~l|+sϵ\displaystyle|(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}|\leq|(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}|+s\epsilon (S83)
\displaystyle\leq |(𝜷l𝜷~l)T𝚺𝐗𝜷l|+|𝜷~lT𝚺𝐗(𝜷l𝜷~l)|+sϵ\displaystyle|(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}|+|\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})|+s\epsilon (S85)
\displaystyle\leq (𝜷l𝜷~l)T𝚺𝐗(𝜷l𝜷~l)(𝜷l)T𝚺𝐗𝜷l+𝜷~lT𝚺𝐗𝜷~l(𝜷l𝜷~l)T𝚺𝐗(𝜷l𝜷~l)\displaystyle\sqrt{(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})\cdot(\boldsymbol{\beta}_{l}^{*})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}}+\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}\cdot(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})^{\mathrm{\tiny T}}{\bm{\Sigma}}_{\mathbf{X}}(\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l})}
+sϵ\displaystyle+s\epsilon
\displaystyle\leq 2λmax(𝚺𝐗)𝜷l𝜷~l2+sϵ\displaystyle 2\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})\cdot\|\boldsymbol{\beta}_{l}^{*}-\tilde{\boldsymbol{\beta}}_{l}\|_{2}+s\epsilon (S86)
\displaystyle\leq 2{1cosΘ(𝜷l,𝜷~l)}+sϵ2|sinΘ(𝜷l,𝜷~l)|+sϵCsϵ,\displaystyle\sqrt{2\{1-\cos\Theta(\boldsymbol{\beta}^{*}_{l},\tilde{\boldsymbol{\beta}}_{l})\}}+s\epsilon\leq\sqrt{2}|\sin\Theta(\boldsymbol{\beta}_{l}^{*},\tilde{\boldsymbol{\beta}}_{l})|+s\epsilon\leq Cs\epsilon, (S87)

which also implies that 1𝜷~lT𝚺^𝐗𝜷~l1𝜷l𝚺𝐗𝜷lCsϵ2λmin(𝚺𝐗)\dfrac{1}{\sqrt{\tilde{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\tilde{\boldsymbol{\beta}}_{l}}}\leq\dfrac{1}{\boldsymbol{\beta}_{l}^{*}{\bm{\Sigma}}_{\mathbf{X}}\boldsymbol{\beta}_{l}^{*}-Cs\epsilon}\leq\dfrac{2}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})} if sϵλmin(𝚺𝐗)2s\epsilon\leq\dfrac{\lambda_{\min({\bm{\Sigma}}_{\mathbf{X}})}}{2}. Finally, by (S65) we have the desired conclusion. ∎

Now we define crk(s)=infF:|F|scr(𝐌k,F,𝚺F)cr_{k}(s)=\inf_{F:|F|\leq s}cr(\mathbf{M}_{k,F},{\bm{\Sigma}}_{F}). We have the following lemma.

Lemma S15.

Assume that |sinΘ(𝛃^l,𝛃l)|<Csϵ|\sin\Theta(\hat{\boldsymbol{\beta}}_{l},\boldsymbol{\beta}_{l})|<Cs\epsilon for any l<kl<k. If 2sϵλmin(𝚺𝐗)+2sϵλmin(𝚺𝐗)λ1<min{12,Δλ1+λ2,λmin(𝚺𝐗)2,Δ2(1+λ12)crk(s)}\dfrac{2s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}+\dfrac{2s\epsilon}{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})\lambda_{1}}<\min\{\frac{1}{2},\dfrac{\Delta}{\lambda_{1}+\lambda_{2}},\dfrac{\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})}{2},\dfrac{\Delta}{2(1+\lambda_{1}^{2})}cr_{k}(s)\}, we have that sinΘ(𝛃^k,𝛃k)Csϵ\sin\Theta(\hat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})\leq Cs\epsilon.

Proof of Lemma S15.

Combine Lemma S14 with Lemma S3 and we have the desired conclusion. ∎

Proof of Theorem 3.

Combining Lemma S13 with Lemma S15, we have that if 𝐌^𝐌maxϵ\|\widehat{\mathbf{M}}-\mathbf{M}\|_{\max}\leq\epsilon, 𝚺^𝐗𝚺𝐗maxϵ\|\widehat{{\bm{\Sigma}}}_{\mathbf{X}}-{\bm{\Sigma}}_{\mathbf{X}}\|_{\max}\leq\epsilon, then |sinΘ(𝜷^k,𝜷k)|Csϵ,k=1,,K|\sin\Theta(\hat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon,k=1,\ldots,K. By Theorem 1 we have the desired conclusion. ∎