Slicing-free Inverse Regression in High-dimensional
Sufficient Dimension Reduction
Qing Mai1, Xiaofeng Shao2, Runmin Wang3 and Xin Zhang1
Florida State University1
University of Illinois at Urbana-Champaign2
Texas A&M University3
Abstract: Sliced inverse regression (SIR, Li, 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee and Shao, 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.
Key words and phrases: Multivariate response, Sliced inverse regression, Sufficient dimension reduction, U-statistic.
1 Introduction
Sufficient dimension reduction (SDR) is an important statistical analysis tool for data visualization, summary and inference. It extracts low-rank projections of the predictors that contain all the information about the response , without specifying a parametric model beforehand. The semi-parametric nature of SDR leads to great flexibility and convenience in practice. After SDR, we can model the conditional distributions of the response given the lower dimensional projected covariate using existing parametric or non-parametric methods. A salient feature with SDR is that the low-rank projection space can be accurately estimated at a parametric rate with the nonparametric part treated as an infinitely dimensional nuisance parameter. For example, in multi-index models, SDR can estimate the multiple projection directions without estimating the unspecified link function.
A cornerstone of SDR is the SIR (sliced inverse regression), pioneered by Li (1991), who first discovered the inner connection between the low-rank projection space and the eigen-space of , under suitable assumptions. SIR is performed by slicing the response and aggregating the conditional mean of the predictor given the response within each slice. To illustrate the idea, we consider a univariate response . Slicing involves picking constants , and defining a new random variable , where if and only if . Upon a centering and standardization of the covariate, i.e., , a simple eigen-decomposition can be conducted to find linear projections that explain most of the variability in the conditional expectation of the transformed predictor given the response across slices, that is, . As an important variation of SIR, sliced average variance estimation (Cook and Weisberg, 1991) utilizes the conditional variance across slices. A key step in these inverse regression methods is apparently the choice of the slicing scheme. If is sliced too coarsely, we may not be able to capture the full dependence of on the predictors, which could lead to a large bias in the estimation of . If is sliced too finely, the with-in-slice sample size is small, leading to a large variability in estimation. Although Li (1991); Hsing and Carroll (1992) showed that SDR can still be consistent in large sample even when the slicing scheme is chosen poorly, Zhu and Ng (1995) argued that the choice of slicing scheme is critical to achieve high estimation efficiency. To the best of our knowledge, there seems no universal guidance on the choice of the slicing scheme provided in the literature.
Zhu et al. (2010) and Cook and Zhang (2014) showed that it is beneficial to aggregate multiple slicing schemes rather than relying on a single one. However, proposals in the above-mentioned papers have their own limitations as they exclusively focused on the univariate response. In many real life problems, it is common to encounter multi-response data. Component-wise analysis may not be sufficient for multi-response data because it does not fully make use of the componentwise dependence in the response. But slicing multivariate response is notoriously hard due to the curse of dimensionality, a common problem in multivariate nonparametric smoothing. As the dimension for the response becomes moderately large, it is increasingly difficult to make sure that each slice contains a decent number of samples, and the estimation can be unstable in practice. Hence, it is highly desirable to develop new SDR methods that do not involve slicing.
An important line of research in the recent SDR literature is to develop SDR methods for datasets with high-dimensional covariates, as motivated by many contemporary applications. The idea of SDR is naturally attractive for high-dimensional datasets, as an effective reduction of the dimension in facilitates the use of existing modeling and inference methods that are tailored for low-dimensional covariates. However, most classical SDR methods are not directly applicable to the large small setting, where is the dimension of and is the sample size. To overcome the challenges with high-dimensional covariates, several methods have been proposed recently. In Lin et al. (2018), they show that the SIR estimator is consistent if and only if . When the dimension is larger than , they propose a diagonal thresholding screening SIR (DT-SIR) algorithm, and show that it is consistent to recover the dimension reduction space under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions. In Lin et al. (2019), they further introduce a simple Lasso regression method to obtain an estimate of the SDR space by constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix. In Tan, Wang, Liu and Zhang (2018), they propose a two-stage computational framework to solve the sparse generalized eigenvalue problem, which includes the high-dimensional SDR as a special case, and propose a truncated Rayleigh flow method (namely, RIFLE) to estimate the leading generalized eigenvector. Also see Lin et al. (2020) and Tan, Wang, Zhang, Liu and Cook (2018) for related recent work. These methods provide valuable tools to tackle high-dimensional SDR problem. However, all these methods still rely on the SIR as an important component in their methodology and involve choosing a single slicing scheme, but little guideline on the slicing scheme is provided. Consequently, these methods cannot be easily applied to data with multivariate response, and the impact from the choice of slicing scheme is unclear.
In this article, we propose a novel slicing-free SDR method in the high-dimensional setting. Our proposal is inspired by a recent nonlinear dependence metric: the so-called martingale difference divergence matrix (MDDM, Lee and Shao, 2018). The MDDM was developed by Lee and Shao (2018) as a matrix-valued extension of MDD (martingale difference divergence) in Shao and Zhang (2014), which measures the (conditional) mean dependence of a response variable given a covariate, and was used for dimension reduction of a multivariate time series. As recently revealed by Zhang et al. (2020), at the population level, the eigenvectors (or generalized eigenvectors) of the MDDM are always contained in the central subspace. Building on these prior works, we propose a penalized eigen-decomposition on MDDM to perform SDR in high dimensions. For the case when the covariance matrix of the predictor is identity matrix, we adopt the truncated power method with hard thresholding to estimate the top- eigenvectors of MDDM. In the case of more general covariance structure, we adopt the RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) and apply to the sample MDDM instead of sample SIR estimator of . With the use of sample MDDM, this approach is completely slicing-free and allows to treat the univariate response and multivariate response in a unified way, thus the practical difficulty of selecting the number of slices (especially for multivariate response) is circumvented. On the theory front, we derive a concentration inequality for sample MDDM around its population counterpart by using theories for U statistics, and obtain a rigorous non-asymptotic theoretical justification for the estimated central subspaces for both settings. Simulations and real data analysis confirm that PMDDM outperforms slicing-based methods in estimation accuracy.
The rest of this paper is organized as follows. In Section 2, we give a brief review of the martingale difference divergence matrix (MDDM) and then present a new concentration inequality for the sample MDDM around its population counterpart. In Section 3, we present our general methodology of adopting MDDM in both model-free and model-based SDR problems, where we establish population level connections between the central subspace and the eigen-decomposition and the generalized eigen-decomposition of MDDM. Algorithms for regularized eigen-decomposition and generalized eigen-decomposition problems are proposed in Sections 4.1 and 4.2, respectively. Theoretical properties are established in Section 5. Section 6 contains numerical studies. Finally, Section 7 concludes the paper with a short discussion. The Supplementary Materials collect all additional technical details and numerical results.
2 MDDM and its concentration inequality
Consider a pair of random vectors , such that . We use to denote the Euclidean norm in . Define
where is an independent copy of . Lee and Shao (2018) established the following key properties of : (i) It is symmetric and positive semi-definite; (ii) almost surely, is equivalent to ; (iii) For any matrix , ; (iv) There exist linearly independent combinations of such that they are (conditionally) mean independent of if and only if .
Given a random sample of size , i.e., , the sample estimate of , denoted by , is defined as
(2.1) |
where is the sample mean.
In the following, we present a concentration inequality for the sample MDDM around its population counterpart, which plays an instrumental role in our consistency proof for the proposed penalized MDDM method later. To this end, we let and assume the following condition.
-
(C1)
There exists two positive constants and such that
(2.2)
For a matrix , we denote its max norm as .
Theorem 1.
Suppose that Condition (C1) holds. There exists a positive integer , and a finite positive constant such that when and , we have
The above bound is non-asymptotic and holds for all as long as the condition is satisfied. The exponent is due to the use of a truncation argument along with Hoeffding’s inequality for U-statistic, and seems hard to improve. Nevertheless we are able to achieve an exponential type bound under a uniform sub-Gaussian condition on both and . This result may be of independent theoretical interest. For example, in the time series dimension reduction problem studied by Lee and Shao (2018), our Theorem 1 could potentially help extend the theory there from low-dimensional multivariate time series to higher dimensions.
3 Slicing-free Inverse Regression via MDDM
3.1 Inverse regression subspace in sufficient dimension reduction
Sufficient dimension reduction (SDR) methods aim to identify the central subspace that preserves all the information in the predictors. In this paper, we consider the SDR problem of a multivariate response on a multivariate predictor . The central subspace is defined as the intersection of all the subspaces such that , where is the projection matrix onto . By construction, the central subspace is the smallest dimension reduction subspace that contains all the information in the conditional distribution of given . Many methods have been proposed for the recovery of the central subspace or a portion of the central subspace (Li, 1991; Cook and Weisberg, 1991; Bura and Cook, 2001; Chiaromonte et al., 2002; Yin and Cook, 2003; Cook and Ni, 2005; Li and Wang, 2007; Zhou and He, 2008). See Li (2018) for a comprehensive review. Although the central subspace is well defined for both univariate and multivariate response, most existing SDR methods consider the case with univariate response, while extension to multivariate response is non-trivial.
The definition of central subspace is not very constructive, as it requires taking intersection of all subspace such that . It is indeed a very ambitious goal to estimate the central subspace without specifying a model between and . To achieve this, we often need additional assumptions such as the linearity and the coverage conditions. The linearity condition requires that, for any basis of the central subspace , we must have that is linear in . The linearity condition is guaranteed if is elliptically contoured, and allows us to connect the central subspace to the conditional expectation . Define as the covariance of and the inverse regression subspace
(3.3) |
Then following property is well-known and often adopted in developing SDR methods.
Proposition 1.
Under linearity condition, we have .
The coverage condition further assumes that . It follows that we can estimate the central subspace by modeling the conditional expectation of . Indeed, many SDR methods approximate . For example, the most classical SDR method, sliced inverse regression (SIR), slices univariate into several categories and estimate the mean of within each slice. Most later methods also follow this slice-and-estimate procedure. Apparently, the slicing scheme is important to the estimation. If there are too few slices, we may not be able to fully capture the dependence of on ; however, if there are too many slices, there is a lack of enough samples within each slice to allow accurate estimation.
3.2 MDDM in SDR
In this section, we lay the foundation for the application of MDDM in SDR. We show that the subspace spanned by MDDM coincides with the inverse regression subspace in (3.3). In particular, we have the following Proposition 2, which was used in Zhang et al. (2020), without a proof, in the context of multivariate linear regression.
Proposition 2.
For multivariate and , assuming the existence of , and , we have .
Therefore, the rank of is the dimensionality of the inverse regression subspace; and the non-trivial eigenvectors of contain all the information for . Combining Proposition1& 2, we immediately have that (i) under the linearity condition, ; and (ii) under the linearity and coverage conditions, .
Henceforth, we assume both the linearity and coverage conditions, which are assumed either explicitly or implicitly in inverse regression type dimension reduction methods (e.g., Li, 1991; Cook and Ni, 2005; Zhu et al., 2010; Cook and Zhang, 2014). Then the central subspace is related to the eigen-decomposition of . Specifically, we have the following scenarios.
If for some , then obviously . This includes single index and multiple index models with uncorrelated predictors. Let be the rank of , then the dimension of the central subspace is ; and the first eigenvectors of span the central subspace.
If for some , then we have . Because , we can show that . To see this, let for some , then and we may also write for some symmetric positive definite matrix . Then the result follows by applying the Woodbury matrix identity to . The non-trivial eigenvectors of again span the central subspace.
For general covariance structure, the -dimensional central subspace can be obtained via generalized eigen-decomposition. Specifically, consider the generalized eigenvalue problem
(3.4) |
where for . Then, similar to (Li, 2007; Chen et al., 2010), it is straightforward to show that the generalized eigenvector spans the central subspace, .
Existing works in SDR often focus on the eigen-decomposition or the generalized eigen-decomposition of , where non-parametric estimates of are obtained from slicing the support of the univariate response . Comparing to these approaches, the MDDM approach requires no tuning parameter selection (i.e. specifying slicing schemes). Moreover, high-dimensional theoretical study of MDDM is easier and does not require additional assumptions on the conditional mean function such as smoothness in the empirical mean function of given (e.g. sliced stable condition in Lin et al. (2018)).
3.3 MDDM for model-based SDR
So far, we have discussed model-free SDR. Another important research area in SDR is model-based methods, which provide invaluable intuition for the use of inverse regression estimation under the assumption that the conditional distribution of is normal. In this section, we consider the principal fitted component (PFC) model, which was discussed in details in Cook and Forzani (2009) and Cook (2007), and generalize it from univariate response to multivariate response. We argue that (generalized) eigen-decomposition of MDDM is potentially advantageous to the likelihood-based approaches under PFC model. This is somewhat surprising but reasonable, considering that the advantages of MDDM over least squares and likelihood-based estimation was demonstrated in Zhang et al. (2020) under the multivariate linear model.
Let denote the conditional variable, then the PFC model is
(3.5) |
where , , is a non-stochastic orthogonal matrix, is the latent variable that depends on . Then the latent variable is fitted as with some user-specified functions , , that maps -dimensional response to -dimensional. In the univariate PFC model, , so the functions can be viewed as an expansion of the response (similar to slicing). For our multivariate extensions of the PFC model, there is no requirement of . The PFC model can be written as
(3.6) |
where and are estimated similarly to the multivariate reduced-rank regression with being the response and being the predictor. Finally, the central subspace under this PFC model is , which simplifies to if we further assume isotropic error (i.e. isotropic PFC model) .
For the PFC model, our MDDM approach is the same as the model-free MDDM counterpart, and has two main advantages over the likelihood-based PFC estimation: (i) there is no need to specify the functions , and thus no risk of mis-specification; (ii) extensions to high-dimensional setting is much more straightforward. Moreover, under the isotropic PFC model, the central subspace is exactly the first eigenvectors of .
4 Estimation
4.1 Penalized decomposition of MDDM
Based on the results in the last section, penalized eigen-decomposition of MDDM can be used for estimating the central subspace in high dimension when the covariance or the conditional covariance is proportional to the identity matrix . We consider the construction of such an estimate. It is worth mentioning that the penalized decomposition of MDDM we develop here is immediately applicable to the dimension reduction of multivariate stationary time series in (Lee and Shao, 2018), which is beyond the scope of this article. Moreover, it is well-known that is not easy to estimate in high dimensions. Then, even when for general covariance structure, the eigen-decomposition of MDDM provides estimate of the inverse regression subspace (though may differ from the central subspace) that is useful for exploratory data analysis (e.g. detecting and visualizing non-linear mean function).
As such, we consider the estimation of the eigenvectors of . We assume that has nontrivial eigenvectors, denoted by . We use the shorthand notation . Also, we note that, given the first eigenvectors, is the top eigenvector of , where .
It is well-known that the eigenvectors cannot be accurately estimated in high dimensions without additional assumptions. We adopt the popular sparsity assumption that many entries in are zero. To estimate these sparse eigenvectors, denote , where the sample is defined in (2.1). We find as follows:
(4.7) |
where , for with , and is a tuning parameter.
We solve the above problem by combining the truncated power method with hard thresholding. For a vector and a positive integer , denote as the -th largest value of . The hard-thresholding operator is , which sets the elements in to zero. We solve (4.7) by Algorithm 1, where the initialization may be randomly generated. Note that Yuan and Zhang (2013) proposed Algorithm 1 to perform principal component analysis through penalized eigen-decomposition on the sample covariance.
In our algorithm, we require a pre-specified sparsity level and subspace dimension . In theory, we show that our estimators for , , are all consistent for their population counterparts when the sparsity is sufficiently large (i.e. larger than the population sparsity level) and the number of directions is no bigger than the true dimension of the central subspace. Therefore, our method is flexible in the sense that the pre-specified and do not have to be exactly correct. In practice, especially in exploratory data analysis, the number of sequentially extracted directions are often set to be small (i.e. ), while the determination of true central subspace dimension is a separate and important research topic in SDR (e.g. Bura and Yang, 2011; Luo and Li, 2016) and is beyond the scope of this paper. Moreover, the pre-specified sparsity level combined with regularization is potentially convenient for post-dimension reduction inference (Kim et al., 2020), as seen in the post-selection inference of canonical correlation analysis that is done over subsets of variables of pre-specified cardinality (McKeague and Zhang, 2020).
As pointed out by a referee, other sparse principal component analysis (PCA) methods can potentially be applied to decompose MDDM. We choose to extend the algorithm in Yuan and Zhang (2013) to facilitate computation and theoretical development. For computationally efficient sparse PCA methods such as Zou et al. (2006); Witten et al. (2009), their theoretical properties are unfortunately unknown. Hence, we expect the theoretical study of their MDDM-variants to be very challenging. On the other hand, for the theoretically justified sparse PCA methods such as Vu and Lei (2013); Cai et al. (2013), the computation is less efficient.
-
1.
Input: .
-
2.
Initialize .
-
3.
For , do
-
(a)
Iterate over until convergence:
-
i.
Set .
-
ii.
If , set
else
-
i.
-
(b)
Set at convergence and .
-
(a)
-
4.
Output .
4.2 Generalized eigenvalue problems with MDDM
Now we consider the general (arbitrary) covariance structure . We continue to use to denote the nontrivial eigenvectors of so that the central subspace is spanned by the ’s. Again, we assume that these eigenvectors are sparse. In principle, we could assume that is also sparse and construct its estimate accordingly. However, is a nuisance parameter for our ultimate goal, and additional assumptions on it may unnecessarily limit the applicability of our method. Hence, we take a different approach as follows.
To avoid estimating , we note that can also be viewed as the generalized eigenvectors defined as follows, which is equivalent to (3.4),
(4.8) |
Directly solving the generalized eigen-decomposition problem in (4.8) is not easy if we want to further impose penalties, because it is difficult to satisfy the orthogonality constraints. Therefore, we further consider another form for (4.8) that does not involve the orthogonal constraints. This alternative form is based on the following lemma.
Lemma 1.
Let and . We have
(4.9) |
Motivated by Lemma 1, we consider the penalized problem that such that , where and for with , and is a tuning parameter. We adopt the RIFLE algorithm in Tan, Wang, Liu and Zhang (2018) to solve this problem. See the details in Algorithm 2. In our simulation studies, we considered randomly generated initial value and fixed step size , and observed reasonably good performance.
Although Algorithm 2 is a generalization of the RIFLE Algorithm in Tan, Wang, Liu and Zhang (2018), there are important differences between these two. On one hand, the RIFLE Algorithm only extracts the first generalized eigenvector, whereas Algorithm 2 is capable of estimating multiple generalized eigenvectors by properly deflating the MDDM. In sufficient dimension reduction problems, the central subspace often has a structural dimension greater than 1, and it is necessary to find more than one generalized eigenvector. Hence, Algorithm 2 is potentially more useful than the RIFLE algorithm in practice. On the other hand, the usefulness of RIFLE Algorithm has been demonstrated in several statistical applications, including sparse sliced inverse regression. Here Algorithm 2 decomposes the MDDM, which is the first time the penalized generalized eigenvector problem is used to perform sufficient dimension reduction in a slicing-free manner in high dimensions. A brief analysis of the computation complexity is included in Section S3 in the Supplementary Materials.
-
1.
Input: and step size .
-
2.
Initialize .
-
3.
For , do
-
(a)
Iterate over until convergence:
-
i.
Set .
-
ii.
-
iii.
.
-
iv.
-
i.
-
(b)
Set at convergence and scale it to obtain .
-
(c)
Set .
-
(a)
-
4.
Output .
5 Theoretical properties
In this section, we consider theoretical properties of the generalized eigenvectors of . Recall that, if we know that , the generalized eigenvectors reduce to eigenvectors, and can be estimated by Algorithm 1. If we do not have any information about , we can find the generalized eigenvectors with Algorithm 2. Either way, we let be the first (generalized) eigenvectors of . Throughout the proof, we let denote a generic constant that can vary from line to line. We show the consistency of by proving that . We assume that is fixed, and . Recall that we define as the (generalized) eigenvalue. Further define . When we study Algorithm 1 or Algorithm 2, we assume that , where for a sufficiently large . To apply the concentration inequalities for , we restate below Condition (C1) in terms of and as Condition (C1’), along with other suitable conditions:
-
(C1’)
There exist two positive constants and such that and .
-
(C2)
There exist such that .
-
(C3)
There exists constants that do not depend on such that .
-
(C4)
As , .
Condition (C2) guarantees that the eigenvectors are well-defined. Condition (C3) imposes bounds on the eigenvalues of . Researchers often impose similar assumptions on the covariance matrix to achieve consistent estimation. Condition (C4) restricts the growth rate of with respect to . Note that is the population sparsity level of ’s, while is the user-specified sparsity level in Algorithms 1 and 2. If we fix , the dimension is allowed to grow at the rate . When we allow to diverge, we require it to diverge more slowly than .
We present the non-asymptotic results for Algorithm 1 in the following theorem, where the constants are defined previously in Theorem 1 under Condition (C1).
Theorem 2.
Assume that Conditions (C1’), (C2) & (C3) hold and . Further assume that, there exists , such that for , we have and
(5.10) |
where . Then there exists a positive integer , and a finite positive such that when , we have and for any , with a probability greater than ,
(5.11) |
Let , then Theorem 2 directly implies the following asymptotic result that justifies the consistency of our estimator.
Corollary 1.
Assume that Conditions (C1’), (C2)–(C4) hold. Suppose there exists such that . Under the conditions in Theorem 2, the quantities with a probability tending to 1, for .
Corollary 1 reveals that, without specifying a model, Algorithm 1 can achieve consistency when grows at an exponential rate of . To be exact, we can allow . Here the theoretical results are established for the output of Algorithm 1, instead of the solution of the optimization problem (4.7). Note that it is possible to have some gap between the theoretical optimal solution of (4.7) and the estimate we use in practice, because the optimization problem is nonconvex, and numerically we might not achieve the global maximum. Thus it might be more meaningful to study the property of the estimate obtained as the output of Algorithm 1. The above theorem guarantees that the estimate we use in practice has the desired theoretical properties.
In the meantime, although our rate in Theorem 2 is not as high as in the case of sparse sliced inverse regression, as established very recently by Lin et al. (2018) when and for general in Lin et al. (2019), and Tan et al. (2020), we have some unique advantages over these proposals. For simplicity, we assume that is fixed in the subsequent discussion. First, both sliced inverse regression methods require estimation of with-in-slice means rather than the MDDM. As shown in Theorem 1, MDDM converges to its population counterpart at a slower rate than the sample with-in-slice mean. However, by adopting MDDM, we no longer need to determine the slicing scheme, and we do not encounter the curse-of-dimensionality problem when slicing multivariate response. Second, Lin et al. (2018) only achieves the optimal rate when , and cannot handle ultra-high dimensions. In contrast, Algorithm 1 allows to diverge at an exponential rate of , and is more suitable for ultra-high-dimensional data. Third, although Tan et al. (2020) achieves consistency when , they make much more restrictive model assumptions. For example, they assume that is categorical and is normal within each slice of ; they randomly split the dataset to form independent batches to facilitate their proofs, which is not done in their numerical studies. The theoretical properties for their proposal is unclear beyond the (conditionally) Gaussian model and without the sample splitting. In contrast, our method makes no model assumption between and , and our theory requires no sample splitting. Thus, our results are more widely applicable, and the rates we obtain seem hard to improve. Also, unlike the theory in Tan et al. (2020), our theoretical result characterizes the exact method we use in practice. Moreover, the convergence rate of our method has an additional factor of compared to Tan et al. (2020), which grows at a slow rate of that only imposes mild restriction on the dimensionality. For example, for any positive constant , if , our method is consistent. In this sense, although we cannot handle the optimal dimensionality of , the gap is very small.
Next, we further consider the penalized generalized eigen-decomposition in Algorithm 2. We assume that the step size satisfies and
(5.12) |
where , and are respectively the largest eigenvalue, the smallest eigenvalue and the condition number of . The non-asymptotic results are as follows.
Theorem 3.
Assume that Conditions (C1’), (C2) & (C3) hold. Suppose there exists such that , and there exists a constant such that . Then there exists a positive integer and four finite positive constants , , and such that for any that satisfies and , with a probability greater than , we have , for .
Theorem 3 is proved by showing that and are close to their counterparts in the sense that are close to for any with only nonzero elements, respectively. Then we use the fact that Algorithm 2 is a generalization of the RIFLE algorithm [Tan, Wang, Liu and Zhang (2018)], and some properties of the latter allow us to establish the consistency of Algorithm 2. By comparison, our proofs are significantly more involved than the one in Tan, Wang, Liu and Zhang (2018), because we have to estimate generalized eigenvectors instead of just the first one. We need to carefully control the error bounds to guarantee that the estimation errors do not accumulate to a higher order beyond the first generalized eigenvector.
Analogous to Corollary 1, we can easily obtain asymptotic consistency results by translating Theorem 3.
Corollary 2.
Assume that Conditions (C1)–(C4) hold. Suppose there exists such that . Under the conditions in Theorem 3, the quantities with a probability tending to 1, for .
Corollary 2 shows that Algorithm 2 produces consistent estimates of the generalized eigenvectors even when grows at an exponential rate of the sample size , and thus is suitable for ultra-high-dimensional problems. Similar to Corollary 1, Corollary 2 has no gap between theory and the numerical outputs, as it is a result concerning the outputs of Algorithm 2. We note that the dimensionality in Corollary 2 is the same as that in Corollary 1. Thus, with a properly chosen step size , the penalized generalized eigen-decomposition is intrinsically no more difficult than the penalized eigen-decomposition. But if we have knowledge about being the identity matrix, it is still beneficial to exploit such information and use Algorithm 1, because Algorithm 1 does not involve the step size and is more convenient in practice. Also, although Algorithm 2 does not achieve the same rate of convergence as recent sparse sliced inverse regression proposals, it has many practical and theoretical advantages just as Algorithm 1, which we do not repeat.
Finally, we note that our theoretical studies require conditions on the initial value. Specifically, we require the initial value to be non-orthogonal to the truth. This is a common technical condition for iterative algorithms; see Yuan and Zhang (2013); Tan, Wang, Liu and Zhang (2018) for example. Such conditions do not seem critical for our algorithms to work in practice. In our numerical studies to be presented, we use randomly generated initial values, and the performance of our methods appears to be competitive.
6 Numerical studies
6.1 Simulations
We compare our slicing-free approaches to the state-of-the-art high-dimensional extensions of sliced inverse regression estimators. Both univariate response and multivariate response settings are considered. Specifically, for univariate response simulations, we include Rifle-SIR (Tan, Wang, Liu and Zhang, 2018) and Lasso-SIR (Lin et al., 2019) as two main competitors; for multivariate response simulations, we mainly compare our method with the projective resampling approach to SIR (PR-SIR, Li et al., 2008), which is a computationally expensive method that repeatedly projects the multivariate response to one-dimensional subspaces. For Rifle-SIR, we adopt the Rifle algorithm to estimate the leading eigenvector of the sample matrix based on slicing. In addition, we include the oracle-SIR as a benchmark method, where we perform SIR on the subset of truly relevant variables (hence a low-dimensional estimation problem). For all these SIR-based methods, we include two different slicing schemes by setting the number of slices to be and , where is the minimal number of slices required to obtain our two-dimensional central subspace and is a typical choice used in the literature. To evaluate the performances of these SDR methods, we use the subspace estimation error defined as , where are the estimated and the true basis matrices of the central subspace and are the corresponding projection matrices. This subspace estimation error is always between and , and a small value indicates a good estimation.
First, we consider the following six models for univariate response regression: and are single index models (i.e. ), – are multiple index models (i.e. ), that are widely used in SDR literature; is isotropic PFC model with . Specifically,
where and for –, and for the isotropic PFC model (). The sparse directions in the central subspace are orthogonal as we let the first elements in and the -th to -th elements in to be (while all other elements are zero). For –, we consider both the independent predictor setting with and the correlated predictor setting with auto-regressive correlation that for . For each of model settings, we vary the sample size and predictor dimension and simulate 1000 independent data sets.
For our method, we applied the generalized eigen-decomposition algorithm (Algorithm 2) in all these six models (even when the covariance is identity matrix). In the single index models and , we use the random initialization ( is randomly generated from -dimensional standard normal) for our algorithm and Rifle-SIR to demonstrate the robustness to initialization. The step size in the algorithm is simply fixed as . For more challenging multiple index models, , we consider the best case scenarios for each method, therefore true parameter is used as the initial value and an optimal is selected from a separate training sample with 400 observations. The results based on 1000 replications for and are summarized in Table 1, while the rest of the results can be found in the Supplemental Materials. Overall, the slicing-free MDDM approach is much more accurate than existing SIR-based methods. It is almost as accurate as the oracle-SIR. Moreover, it is clear that SIR-type methods are rather sensitive to the choice of the number of slices.
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | |||||||||
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | ||
10.1 | 0.1 | 12.5 | 0.1 | 10.3 | 0.1 | 25.2 | 1.0 | 53.7 | 1.4 | 37.9 | 0.4 | 59.9 | 0.7 | ||
10.3 | 0.1 | 13.1 | 0.1 | 10.6 | 0.1 | 26.1 | 1.0 | 54.7 | 1.4 | 40.1 | 0.4 | 61.5 | 0.7 | ||
17.7 | 0.2 | 40.8 | 0.2 | 27.7 | 0.2 | 71.3 | 0.0 | 71.2 | 0.0 | 76.5 | 0.2 | 85.0 | 0.2 | ||
23.0 | 0.2 | 45.8 | 0.3 | 36.4 | 0.3 | 71.9 | 0.0 | 71.6 | 0.0 | 85.2 | 0.2 | 91.5 | 0.2 | ||
30.8 | 0.6 | 28.8 | 0.2 | 22.1 | 0.1 | 71.6 | 0.0 | 71.2 | 0.0 | 71.2 | 0.3 | 81.3 | 0.3 | ||
AR | 18.7 | 0.3 | 21.0 | 0.2 | 17.6 | 0.2 | 34.7 | 0.8 | 39.8 | 1.1 | 35.3 | 0.3 | 35.5 | 0.3 | |
14.2 | 0.2 | 20.7 | 0.2 | 14.8 | 0.2 | 33.1 | 0.7 | 33.6 | 1.1 | 34.6 | 0.3 | 30.5 | 0.3 | ||
25.2 | 0.3 | 44.6 | 0.2 | 34.1 | 0.2 | 71.5 | 0.0 | 71.3 | 0.0 | 54.8 | 0.2 | 47.1 | 0.3 | ||
59.1 | 0.5 | 75.1 | 0.2 | 69.9 | 0.3 | 81.0 | 0.2 | 78.7 | 0.2 | 89.7 | 0.2 | 92.1 | 0.2 | ||
46.2 | 0.6 | 46.4 | 0.2 | 35.5 | 0.2 | 73.8 | 0.1 | 72.4 | 0.0 | 66.5 | 0.2 | 61.4 | 0.3 | ||
PFC | 34.6 | 0.6 | 48.9 | 0.5 | 33.4 | 0.5 | 40.1 | 0.7 | 30.8 | 0.6 | 70.7 | 0.0 | 70.7 | 0.0 |
Next, we further consider the following three multivariate response models, where the response dimension is . These three models are respectively a multivariate linear model, a single-index heteroschedastic error model, and an isotropic PFC model. The predictors satisfy in the following two forward regression model. Therefore, we applied Algorithm 1 for our method under models and . For the isotropic PFC model , where , we still apply Algorithm 2 to be consistent with the univariate case. For the projective resampling methods, PR-SIR and PR-Oracle-SIR, we generated a sufficiently large number of random projections so that the PR methods reach their fullest potential.
-
: , , and . The errors are independent standard normal except for . For this model, the central subspace is spanned by and .
-
: and for , where are independent standard normal except for . For this model, the central subspace is . Note that marginally each response is independent of .
-
: , where , and . Hence, .
Again we considered various sample size and predictor dimension setups, each with 1000 replicates. We summarize the subspace estimation errors in the Table S6. For and , the results are gathered in the supplement. It is clear that the proposed MDDM approach is much better than PR-SIR, and also improves much faster than PR-SIR when we increase the sample size. The MDDM method performed better in inverse regression models such as the isotropic PFC model than forward regression models such as the linear model and index models. This finding is more apparent in the multivariate response simulations than in the univariate response simulations. This is expected, as the MDDM directly targets at the inverse regression subspace, which is more directly driven by the response in the isotropic PFC models.
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | ||
MDDM | 37.1 | 0.5 | 39.8 | 0.5 | 42.5 | 0.5 | 24.0 | 0.4 | 25.3 | 0.4 | 26.9 | 0.4 | 16.1 | 0.3 | 17.3 | 0.3 | 18.6 | 0.3 | |
PR-Oracle-SIR(3) | 12.6 | 0.2 | 12.2 | 0.2 | 12.0 | 0.2 | 8.8 | 0.1 | 8.5 | 0.1 | 8.7 | 0.1 | 5.9 | 0.1 | 5.8 | 0.1 | 5.8 | 0.1 | |
PR-Oracle-SIR(10) | 16.2 | 0.3 | 15.7 | 0.3 | 15.6 | 0.3 | 9.6 | 0.2 | 9.4 | 0.2 | 95.2 | 0.2 | 6.0 | 0.1 | 6.0 | 0.1 | 6.0 | 0.1 | |
PR-SIR(3) | 79.9 | 0.1 | 88.2 | 0.1 | 93.5 | 0.0 | 67.9 | 0.1 | 79.3 | 0.1 | 87.8 | 0.0 | 54.6 | 0.1 | 67.6 | 0.1 | 79.0 | 0.0 | |
PR-SIR(10) | 83.5 | 0.1 | 90.6 | 0.1 | 94.9 | 0.0 | 70.1 | 0.1 | 81.6 | 0.1 | 90.1 | 0.1 | 55.3 | 0.1 | 68.2 | 0.1 | 80.2 | 0.1 | |
MDDM | 79.4 | 0.9 | 85.8 | 0.8 | 90.0 | 0.7 | 55.9 | 1.2 | 61.0 | 1.2 | 68.4 | 1.2 | 27.1 | 0.9 | 30.3 | 1.0 | 31.0 | 1.0 | |
PR-Oracle-SIR(3) | 40.9 | 0.9 | 41.3 | 0.9 | 41.4 | 0.9 | 26.0 | 0.7 | 24.9 | 0.7 | 25.0 | 0.6 | 14.9 | 0.4 | 14.9 | 0.4 | 15.0 | 0.4 | |
PR-Oracle-SIR(10) | 44.1 | 0.9 | 43.8 | 0.9 | 43.5 | 0.9 | 25.1 | 0.6 | 23.7 | 0.6 | 24.1 | 0.6 | 13.1 | 0.3 | 13.0 | 0.3 | 13.2 | 0.3 | |
PR-SIR(3) | 99.3 | 0.0 | 99.7 | 0.0 | 99.8 | 0.0 | 99.2 | 0.0 | 99.7 | 0.0 | 99.8 | 0.0 | 98.8 | 0.0 | 99.6 | 0.0 | 99.8 | 0.0 | |
PR-SIR(10) | 99.3 | 0.0 | 99.7 | 0.0 | 99.9 | 0.0 | 99.1 | 0.0 | 99.6 | 0.0 | 99.8 | 0.0 | 98.4 | 0.1 | 99.6 | 0.0 | 99.8 | 0.0 | |
MDDM | 15.3 | 0.3 | 15.4 | 0.3 | 15.7 | 0.3 | 9.9 | 0.1 | 10.1 | 0.1 | 10.0 | 0.1 | 7.1 | 0.1 | 7.2 | 0.1 | 7.1 | 0.1 | |
PR-Oracle-SIR(3) | 15.2 | 0.2 | 15.2 | 0.2 | 14.9 | 0.2 | 10.5 | 0.1 | 10.6 | 0.1 | 10.5 | 0.1 | 7.5 | 0.1 | 7.6 | 0.1 | 7.4 | 0.1 | |
PR-Oracle-SIR(10) | 13.8 | 0.2 | 13.9 | 0.2 | 13.6 | 0.2 | 9.4 | 0.1 | 9.7 | 0.1 | 9.6 | 0.1 | 6.8 | 0.1 | 6.8 | 0.1 | 6.7 | 0.1 | |
PR-SIR(3) | 58.5 | 0.2 | 72.3 | 0.2 | 84.0 | 0.2 | 44.6 | 0.1 | 58.2 | 0.1 | 71.4 | 0.1 | 33.1 | 0.1 | 44.6 | 0.1 | 57.9 | 0.1 | |
PR-SIR(10) | 54.8 | 0.2 | 68.5 | 0.2 | 80.6 | 0.2 | 41.1 | 0.2 | 54.3 | 0.2 | 67.7 | 0.2 | 30.2 | 0.1 | 41.0 | 0.1 | 54.2 | 0.1 |
6.2 Real Data Illustration


In this section we use our method to analyze the NCI-60 data set (Shoemaker, 2006) that contains the microRNA expression profiles and cancer drug activity measurements on the NCI-60 cell lines. The multivariate response is the cancer drug activities of drugs; the predictor is different microRNA; the sample size is .
First, we examine the predictive performance of our method by random training-testing sample splits; each time we randomly pick observations to form the test set. We consider for all methods. For MDDM, we included both the eigen-decomposition (Algorithm 1) and the generalized eigen-decomposition (Algorithm 2). To distinguish the two versions of MDDM, we have “MDDM-ID” for the eigen-decomposition approach because it implicitly assumes the covariance of or the conditional covariance of is constant times identity matrix. We use random initial values, and choose the sparsity level to be in the way described in Section S2 in the Supplementary Materials. Then the five reduced predictors , , are fed into a generalized additive model for each drug. Finally, we evaluate the mean squared prediction error based on the test sample. The Rifle-SIR can only estimate a one-dimensional subspace, which did not yield accurate prediction in this data set. Hence for comparison, we compute five leading directions from Lasso-SIR. The th, th and th percentiles of the squared prediction errors for each of the 15 responses for all three models are obtained and we construct quantile-quantile plots in Figure 1. The red line is the line, and the black dashed line is a simple linear regression fit for the results indicated by the y-axis label against that indicated by the x-axis. Clearly, for all the quantiles and for all the response variables, the MDDM results (MDDM or MDDM-ID) are better than Lasso-SIR in terms of prediction. In addition, we construct side-by-side boxplots of the prediction error averaged over all response variables in Figure 2 to evaluate the overall improvement. Interestingly, the MDDM-ID is slightly better than the MDDM approach. This is likely due to the small sample size – with only training sample, the sample covariance of variables is difficult to estimate accurately. We further include additional real data analysis results in Section S4 in Supplementary Materials.
7 Discussion
In this paper, we propose a slicing-free high-dimensional SDR method by a penalized eigen-decomposition of sample MDDM. Our proposal is motivated by the usefulness of MDDM for dimension reduction and yields a relatively straightforward implementation in view of the recently developed RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) by simply replacing the slicing-based estimator with sample MDDM. Our methodology and implementation involve no slicing and treats the univariate and multivariate response in a unified fashion. Both theoretical support and finite sample investigations provide convincing evidence that MDDM is a very competitive alternative compared to SIR and may be used as a surrogate to SIR-based estimator routinely for many related sufficient dimension reduction problems.
As with most SDR methods, our proposal requires the linearity condition, the violation of which can make SDR very challenging to tackle. It seems that existing proposals that relax the linearity condition are often practically difficult due to excessive computational costs, and cannot be easily extended to high dimensions (Cook and Nachtsheim, 1994; Ma and Zhu, 2012). One potentially useful approach is to transform data before SDR to alleviate obvious violation of the linearity assumption (Mai and Zou, 2015). But an in-depth study along this line is beyond the scope of the current paper. In addition, we observe from our simulation studies that RIFLE requires the choice of several tuning parameters, such as the step size and the initial value, and the optimization error could depend on these tuning parameters in a nontrivial way. Further investigation on the optimization error and data-driven choice for these tuning parameters would be desirable and is left for future research.
As pointed out by a referee, many SDR methods beyond SIR involve slicing. It will be interesting to study how to perform them in a slicing-free fashion as well. For example, Cook and Weisberg (1991) attempt to perform dimension reduction by estimating the conditional covariance of , while Yin and Cook (2003) consider the conditional third moment. These methods slice the response to estimate the conditional moments. In the future, one can develop slicing-free methods to estimate these higher-order moments and conduct SDR.
Supplementary Materials
Supplement to “Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction”. In the supplement, we present additional simulation results and proofs.
Acknowledgements
The authors are grateful to the Editor, Associate Editor and anonymous referees, whose suggestions led great improvement of this work. The authors contributed equally to this work and are listed in alphabetical order. Mai and Zhang’s research in this article is supported in part by National Science Foundation grants CCF-1908969. Shao’s research in this article is supported in part by National Science Foundation grants DMS-1607489.
References
- (1)
- Bura and Cook (2001) Bura, E. and Cook, R. D. (2001), ‘Extending sliced inverse regression: The weighted chi-squared test’, Journal of the American Statistical Association 96, 996–1003.
- Bura and Yang (2011) Bura, E. and Yang, J. (2011), ‘Dimension estimation in sufficient dimension reduction: a unifying approach’, Journal of Multivariate analysis 102(1), 130–142.
- Cai et al. (2013) Cai, T. T., Ma, Z. and Wu, Y. (2013), ‘Sparse pca: Optimal rates and adaptive estimation’, Annals of Statistics 41, 3074–3110.
- Chen et al. (2010) Chen, X., Zou, C., Cook, R. D. et al. (2010), ‘Coordinate-independent sparse sufficient dimension reduction and variable selection’, The Annals of Statistics 38(6), 3696–3723.
- Chiaromonte et al. (2002) Chiaromonte, R., Cook, R. D. and Li, B. (2002), ‘Sufficient dimension reduction in regressions with categorical predictors’, The Annals of Statistics 30, 475–497.
- Cook (2007) Cook, R. D. (2007), ‘Fisher lecture: Dimension reduction in regression (with discussion)’, Statistical Science 22, 1–26.
- Cook and Forzani (2009) Cook, R. D. and Forzani, L. (2009), ‘Principal fitted components for dimension reduction in regression’, Statistical Science 485, 485–501.
- Cook and Nachtsheim (1994) Cook, R. D. and Nachtsheim, C. J. (1994), ‘Reweighting to achieve elliptically contoured covariates in regression’, Journal of the American Statistical Association 89(426), 592–599.
- Cook and Ni (2005) Cook, R. D. and Ni, L. (2005), ‘Sufficient dimension reduction via inverse regression: A minimum discrepancy approach’, Journal of the American Statistical Association 100(470), 410–428.
- Cook and Weisberg (1991) Cook, R. D. and Weisberg, S. (1991), ‘Comment on “sliced inverse regression for dimension reduction”’, Journal of American Statistical Association 86, 328–332.
- Cook and Zhang (2014) Cook, R. D. and Zhang, X. (2014), ‘Fused estimators of the central subspace in sufficient dimension reduction’, J. Amer. Statist. Assoc. 109, 815–827.
- Cruz-Cano and Lee (2014) Cruz-Cano, R. and Lee, M.-L. T. (2014), ‘Fast regularized canonical correlation analysis’, Computational Statistics & Data Analysis 70, 88–100.
- Hsing and Carroll (1992) Hsing, T. and Carroll, R. (1992), ‘An asymptotic theory for sliced inverse regression’, Ann. Statist 20, 1040–1061.
- Huo and Székely (2016) Huo, X. and Székely, G. J. (2016), ‘Fast computing for distance covariance’, Technometrics 58(4), 435–447.
- Kim et al. (2020) Kim, K., Li, B., Yu, Z., Li, L. et al. (2020), ‘On post dimension reduction statistical inference’, Annals of Statistics 48(3), 1567–1592.
- Lee and Shao (2018) Lee, C. E. and Shao, X. (2018), ‘Martingale difference divergence matrix and its application to dimension reduction for stationary multivariate time series’, J. Amer. Statist. Assoc. 113, 216–229.
- Li (2018) Li, B. (2018), Sufficient Dimension Reduction, Methods and Applications with R, CRC Press.
- Li and Wang (2007) Li, B. and Wang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102, 997–1008.
- Li et al. (2008) Li, B., Wen, S. and Zhu, L. (2008), ‘On a projective resampling method for dimension reduction with multivariate responses’, Journal of the American Statistical Association 103(483), 1177–1186.
- Li (1991) Li, K. C. (1991), ‘Sliced inverse regression for dimension reduction’, Journal of the American Statistical Association 86, 316–327.
- Li (2007) Li, L. (2007), ‘Sparse sufficient dimension reduction’, Biometrika 94(3), 603–613.
- Lin et al. (2020) Lin, Q., Li, X., Huang, D. and Liu, J. (2020), ‘On the optimality of sliced inverse regression in high dimensions’, Annals of Statistics, to appear .
- Lin et al. (2018) Lin, Q., Zhao, Z. and Liu, J. (2018), ‘On consistency and sparsity for sliced inverse regression in high dimension’, Annals of Statistics 46(2), 580–610.
- Lin et al. (2019) Lin, Q., Zhao, Z. and Liu, J. S. (2019), ‘Sparse sliced inverse regression via lasso’, Journal of the American Statistical Association 114, 1726–1739.
- Luo and Li (2016) Luo, W. and Li, B. (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887.
- Ma and Zhu (2012) Ma, Y. and Zhu, L. (2012), ‘A semiparametric approach to dimension reduction’, Journal of the American Statistical Association 107(497), 168–179.
- Mai and Zhang (2019) Mai, Q. and Zhang, X. (2019), ‘An iterative penalized least squares approach to sparse canonical correlation analysis’, Biometrics 75(3), 734–744.
- Mai and Zou (2015) Mai, Q. and Zou, H. (2015), ‘Nonparametric variable transformation in sufficient dimension reduction’, Technometrics 57(1), 1–10.
- McKeague and Zhang (2020) McKeague, I. W. and Zhang, X. (2020), ‘Significance testing for canonical correlation analysis in high dimensions’, arXiv preprint arXiv:2010.08673 .
- Serfling (1980) Serfling, R. J. (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Mathematical Statistics.
- Shao and Zhang (2014) Shao, X. and Zhang, J. (2014), ‘Martingale difference correlation and its use in high dimensional variable screening’, J. Amer. Statist. Assoc. 109, 1302–1318.
- Shoemaker (2006) Shoemaker, R. H. (2006), ‘The nci60 human tumour cell line anticancer drug screen’, Nature Reviews Cancer 6(10), 813–823.
- Székely et al. (2007) Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence by correlation of distances’, Annals of Statistics 35, 2769–2794.
- Tan, Wang, Zhang, Liu and Cook (2018) Tan, K. M., Wang, Z., Zhang, T., Liu, H. and Cook, R. D. (2018), ‘A convex formulation for high-dimensional sparse sliced inverse regression’, Biometrika 105(4), 769–782.
- Tan et al. (2020) Tan, K., Shi, L. and Yu, Z. (2020), ‘Sparse sir: Optimal rates and adaptive estimation’, The Annals of Statistics 48(1), 64–85.
- Tan, Wang, Liu and Zhang (2018) Tan, K., Wang, Z., Liu, H. and Zhang, T. (2018), ‘Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow’, Journal of the Royal Statistical Society: Series B 80(5), 1057–1086.
- Vershynin (2018) Vershynin, R. (2018), High-dimensional probability: An introduction with applications in data science, Vol. 47, Cambridge university press.
- Vu and Lei (2013) Vu, V. Q. and Lei, J. (2013), ‘Minimax sparse principal subspace estimation in high dimensions’, Annals of Statistics 41, 2905–2947.
- Witten et al. (2009) Witten, D. M., Tibshirani, R. and Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics 10(3), 515–534.
- Yin and Cook (2003) Yin, X. and Cook, R. D. (2003), ‘Estimating central subspaces via inverse third moments’, Biometrika 90, 113–125.
- Yuan and Zhang (2013) Yuan, X.-T. and Zhang, T. (2013), ‘Truncated power method for sparse eigenvalue problems’, Journal of Machine Learning Research 14(Apr), 899–925.
- Zhang et al. (2020) Zhang, X., Lee, C. E. and Shao, X. (2020), ‘Envelopes in multivariate regression models with nonlinearity and heteroscedasticity’, Biometrika 107, 965–981.
- Zhou and He (2008) Zhou, J. and He, X. (2008), ‘Dimension reduction based on constrained canonical correlation and variable filtering’, The Annals of Statistics 36, 1649–1668.
- Zhu and Ng (1995) Zhu, L. and Ng, K. W. (1995), ‘Asymptotics of sliced inverse regression’, Statist. Sinica 5, 727–736.
- Zhu et al. (2010) Zhu, L.-P., Zhu, L.-X. and Feng, Z.-H. (2010), ‘Dimension reduction in regressions through cumulative slicing estimation’, Journal of the American Statistical Association 105(492), 1455–1466.
- Zou et al. (2006) Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of computational and graphical statistics 15(2), 265–286.
Qing Mai, Florida State University E-mail: [email protected]
Xiaofeng Shao, University of Illinois at Urbana-Champaign E-mail: [email protected]
Runmin Wang, Texas A&M University E-mail: [email protected]
Xin Zhang, Florida State University E-mail: [email protected]
Supplementary Materials for “Slicing-free Inverse Regression in High-Dimensional Sufficient Dimension Reduction”
In the supplement, we present some additional simulation results in Section S1, additional real data analysis results in Section S2, some discussion of computational complexity in Section S3, proofs of Proposition 2 & Lemma 1 in Section S4, proofs of Theorem 1 in Section S5, proofs of Theorem 2 & 3 in Section S6.
S1 Additional Simulation Results
In this section, we shall present all simulation results for models described in Section 6, except the ones which have already been presented in the main paper. Specifically, Table S1 and S2 contain the results for single index models ( and ), Table S3 and S4 contain results for multiple index models ( - ). Results for PFC model (univariate response) are summarized in Table S5, and we gather the rest of results for models with multivariate response variables (-) in Table S6.
The patterns are similar to those presented in the paper. Overall the newly proposed method outperforms the competitors in most scenarios, especially when the dimensionality is significantly larger than the sample size (i.e., high-dimensional setting). It is also observed that SIR-based methods are rather sensitive to the choice of number of slices, whereas our method is slicing-free and is thus easier to use in practice.
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | |||
n = 200 | p = 200 | 10.3 | 0.1 | 12.6 | 0.1 | 10.5 | 0.1 | 16.4 | 0.6 | 30.0 | 1.2 | 28.5 | 0.2 | 30.6 | 0.3 | |
p = 500 | 10.4 | 0.1 | 12.7 | 0.1 | 10.6 | 0.1 | 22.1 | 0.9 | 47.1 | 1.4 | 34.6 | 0.3 | 47.7 | 0.6 | ||
p = 800 | 10.1 | 0.1 | 12.5 | 0.1 | 10.3 | 0.1 | 25.2 | 1.0 | 53.7 | 1.4 | 37.9 | 0.4 | 59.9 | 0.7 | ||
p = 1200 | 10.0 | 0.1 | 12.4 | 0.1 | 10.3 | 0.1 | 27.5 | 1.0 | 54.3 | 1.4 | 42.4 | 0.5 | 71.4 | 0.7 | ||
p = 2000 | 10.1 | 0.1 | 12.6 | 0.1 | 10.5 | 0.1 | 29.7 | 1.1 | 63.4 | 1.4 | 48.4 | 0.6 | 81.7 | 0.6 | ||
n = 500 | p = 200 | 6.3 | 0.1 | 7.6 | 0.1 | 6.3 | 0.1 | 8.9 | 0.4 | 12.1 | 0.7 | 15.0 | 0.1 | 12.8 | 0.1 | |
p = 500 | 6.4 | 0.1 | 7.9 | 0.1 | 6.5 | 0.1 | 11.8 | 0.6 | 21.6 | 1.1 | 16.3 | 0.1 | 14.6 | 0.1 | ||
p = 800 | 6.2 | 0.1 | 7.7 | 0.1 | 6.4 | 0.1 | 13.2 | 0.7 | 26.1 | 1.2 | 16.6 | 0.1 | 16.2 | 0.2 | ||
p = 1200 | 6.4 | 0.1 | 7.6 | 0.1 | 6.4 | 0.1 | 13.4 | 0.7 | 30.3 | 1.3 | 17.3 | 0.2 | 17.9 | 0.2 | ||
p = 2000 | 6.3 | 0.1 | 7.7 | 0.1 | 6.3 | 0.1 | 14.9 | 0.8 | 32.9 | 1.3 | 18.3 | 0.2 | 21.8 | 0.3 | ||
n =800 | p = 200 | 5.0 | 0.1 | 6.0 | 0.1 | 5.0 | 0.1 | 7.4 | 0.4 | 8.7 | 0.6 | 11.1 | 0.1 | 9.3 | 0.1 | |
p = 500 | 5.2 | 0.1 | 6.1 | 0.1 | 5.1 | 0.1 | 8.3 | 0.4 | 12.9 | 0.8 | 11.9 | 0.1 | 10.1 | 0.1 | ||
p = 800 | 5.1 | 0.1 | 6.1 | 0.1 | 5.1 | 0.1 | 9.5 | 0.6 | 20.1 | 1.1 | 12.4 | 0.1 | 11.1 | 0.1 | ||
p = 1200 | 5.1 | 0.1 | 6.1 | 0.1 | 5.1 | 0.1 | 9.5 | 0.6 | 20.1 | 1.1 | 12.4 | 0.1 | 11.1 | 0.1 | ||
p = 2000 | 4.9 | 0.1 | 6.1 | 0.1 | 5.0 | 0.1 | 10.6 | 0.6 | 23.1 | 1.2 | 12.8 | 0.1 | 12.2 | 0.1 | ||
n = 200 | p = 200 | 10.4 | 0.1 | 13.1 | 0.1 | 10.7 | 0.1 | 17.5 | 0.6 | 29.7 | 1.2 | 30.3 | 0.2 | 31.6 | 0.3 | |
p = 500 | 10.6 | 0.1 | 13.3 | 0.1 | 10.8 | 0.1 | 23.8 | 0.9 | 48.9 | 1.4 | 36.7 | 0.3 | 49.9 | 0.6 | ||
p = 800 | 10.3 | 0.1 | 13.1 | 0.1 | 10.6 | 0.1 | 26.1 | 1.0 | 54.7 | 1.4 | 40.1 | 0.4 | 61.5 | 0.7 | ||
p = 1200 | 54.7 | 0.8 | 12.9 | 0.1 | 10.4 | 0.1 | 74.4 | 0.8 | 95.1 | 0.4 | 45.0 | 0.5 | 71.4 | 0.7 | ||
p = 2000 | 55.3 | 0.8 | 13.1 | 0.1 | 10.6 | 0.1 | 76.5 | 0.7 | 96.8 | 0.3 | 51.2 | 0.6 | 82.7 | 0.6 | ||
n = 500 | p = 200 | 6.4 | 0.1 | 8.0 | 0.1 | 6.5 | 0.1 | 9.7 | 0.4 | 12.0 | 0.7 | 15.8 | 0.1 | 13.4 | 0.1 | |
p = 500 | 6.7 | 0.1 | 8.2 | 0.1 | 6.6 | 0.1 | 12.4 | 0.6 | 21.2 | 1.1 | 17.1 | 0.1 | 15.1 | 0.1 | ||
p = 800 | 6.5 | 0.1 | 8.0 | 0.1 | 6.5 | 0.1 | 13.6 | 0.7 | 25.1 | 1.2 | 17.6 | 0.2 | 16.7 | 0.2 | ||
p = 1200 | 17.1 | 0.7 | 8.0 | 0.1 | 6.5 | 0.1 | 37.4 | 1.1 | 74.6 | 1.0 | 18.2 | 0.2 | 18.3 | 0.2 | ||
p = 2000 | 16.9 | 0.8 | 8.0 | 0.1 | 6.5 | 0.1 | 38.3 | 1.2 | 77.4 | 0.9 | 19.0 | 0.2 | 22.2 | 0.3 | ||
n =800 | p = 200 | 5.1 | 0.1 | 6.3 | 0.1 | 5.1 | 0.1 | 6.9 | 0.3 | 8.3 | 0.5 | 11.6 | 0.1 | 9.6 | 0.1 | |
p = 500 | 5.3 | 0.1 | 6.4 | 0.1 | 5.2 | 0.1 | 7.8 | 0.4 | 13.1 | 0.8 | 12.5 | 0.1 | 10.5 | 0.1 | ||
p = 800 | 5.0 | 0.1 | 6.3 | 0.1 | 5.1 | 0.1 | 8.4 | 0.4 | 16.9 | 1.0 | 12.6 | 0.1 | 10.8 | 0.1 | ||
p = 1200 | 10.8 | 0.6 | 6.4 | 0.1 | 5.2 | 0.1 | 23.9 | 1.0 | 53.9 | 1.3 | 13.1 | 0.1 | 11.4 | 0.1 | ||
p = 2000 | 11.3 | 0.6 | 6.3 | 0.1 | 5.1 | 0.1 | 26.6 | 1.1 | 57.6 | 1.2 | 13.6 | 0.1 | 12.4 | 0.1 |
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | |||
n = 200 | p = 200 | 17.6 | 0.2 | 20.8 | 0.2 | 17.8 | 0.2 | 26.2 | 0.5 | 25.3 | 0.7 | 32.1 | 0.3 | 28.4 | 0.3 | |
p = 500 | 18.3 | 0.3 | 21.1 | 0.2 | 18.0 | 0.2 | 32.9 | 0.7 | 35.0 | 1.0 | 34.5 | 0.3 | 33.1 | 0.3 | ||
p = 800 | 18.7 | 0.3 | 21.0 | 0.2 | 17.6 | 0.2 | 34.7 | 0.8 | 39.8 | 1.1 | 35.3 | 0.3 | 35.5 | 0.3 | ||
p = 1200 | 26.1 | 0.6 | 21.3 | 0.2 | 18.0 | 0.2 | 47.0 | 1.0 | 81.2 | 0.7 | 36.7 | 0.3 | 41.2 | 0.4 | ||
p = 2000 | 26.3 | 0.7 | 21.0 | 0.2 | 17.7 | 0.2 | 47.8 | 1.0 | 81.4 | 0.7 | 39.1 | 0.4 | 49.6 | 0.6 | ||
n = 500 | p = 200 | 10.8 | 0.1 | 13.2 | 0.1 | 11.0 | 0.1 | 13.7 | 0.2 | 12.7 | 0.4 | 18.6 | 0.2 | 15.3 | 0.1 | |
p = 500 | 11.0 | 0.1 | 13.3 | 0.1 | 11.2 | 0.1 | 14.3 | 0.3 | 15.7 | 0.6 | 19.5 | 0.2 | 16.5 | 0.2 | ||
p = 800 | 10.9 | 0.1 | 13.5 | 0.1 | 11.1 | 0.1 | 15.1 | 0.4 | 18.4 | 0.8 | 20.1 | 0.2 | 17.1 | 0.2 | ||
p = 1200 | 14.2 | 0.5 | 13.4 | 0.1 | 11.1 | 0.1 | 23.1 | 0.8 | 49.9 | 1.1 | 20.3 | 0.2 | 17.8 | 0.2 | ||
p = 2000 | 13.6 | 0.5 | 13.5 | 0.1 | 11.2 | 0.1 | 26.9 | 0.9 | 53.8 | 1.2 | 20.7 | 0.2 | 18.6 | 0.2 | ||
n =800 | p = 200 | 8.8 | 0.1 | 10.7 | 0.1 | 8.9 | 0.1 | 10.6 | 0.1 | 9.7 | 0.3 | 14.5 | 0.1 | 11.9 | 0.1 | |
p = 500 | 8.7 | 0.1 | 10.5 | 0.1 | 8.9 | 0.1 | 11.1 | 0.3 | 9.9 | 0.3 | 14.8 | 0.1 | 12.5 | 0.1 | ||
p = 800 | 8.6 | 0.1 | 10.5 | 0.1 | 8.8 | 0.1 | 11.7 | 0.4 | 12.4 | 0.6 | 15.0 | 0.1 | 12.5 | 0.1 | ||
p = 1200 | 10.2 | 0.4 | 10.4 | 0.1 | 8.8 | 0.1 | 18.3 | 0.8 | 37.9 | 1.1 | 15.1 | 0.1 | 12.9 | 0.1 | ||
p = 2000 | 11.1 | 0.5 | 10.6 | 0.1 | 8.8 | 0.1 | 20.5 | 0.8 | 38.2 | 1.1 | 15.8 | 0.1 | 13.4 | 0.1 | ||
n = 200 | p = 200 | 14.0 | 0.2 | 20.8 | 0.2 | 14.8 | 0.2 | 25.4 | 0.5 | 21.4 | 0.7 | 31.8 | 0.3 | 23.3 | 0.2 | |
p = 500 | 14.3 | 0.2 | 21.1 | 0.2 | 15.0 | 0.2 | 31.4 | 0.7 | 29.0 | 1.0 | 34.1 | 0.3 | 27.4 | 0.3 | ||
p = 800 | 14.2 | 0.2 | 20.7 | 0.2 | 14.8 | 0.2 | 33.1 | 0.7 | 33.6 | 1.1 | 34.6 | 0.3 | 30.5 | 0.3 | ||
p = 1200 | 23.9 | 0.7 | 21.1 | 0.2 | 14.9 | 0.2 | 45.1 | 1.0 | 75.9 | 0.9 | 36.9 | 0.3 | 34.9 | 0.4 | ||
p = 2000 | 24.9 | 0.8 | 20.6 | 0.2 | 14.6 | 0.2 | 47.9 | 1.0 | 77.2 | 0.9 | 38.6 | 0.4 | 42.5 | 0.5 | ||
n = 500 | p = 200 | 8.7 | 0.1 | 13.2 | 0.1 | 9.0 | 0.1 | 13.7 | 0.3 | 10.4 | 0.4 | 18.6 | 0.2 | 12.6 | 0.1 | |
p = 500 | 8.9 | 0.1 | 13.3 | 0.1 | 9.3 | 0.1 | 14.3 | 0.3 | 13.4 | 0.6 | 19.4 | 0.2 | 13.4 | 0.1 | ||
p = 800 | 8.8 | 0.1 | 13.3 | 0.1 | 9.1 | 0.1 | 14.0 | 0.3 | 14.9 | 0.7 | 19.9 | 0.2 | 13.8 | 0.1 | ||
p = 1200 | 12.8 | 0.5 | 13.3 | 0.1 | 9.2 | 0.1 | 24.3 | 0.8 | 45.9 | 1.2 | 20.1 | 0.2 | 14.6 | 0.1 | ||
p = 2000 | 12.5 | 0.5 | 13.5 | 0.1 | 9.3 | 0.1 | 27.0 | 0.9 | 49.7 | 1.2 | 20.6 | 0.2 | 15.1 | 0.1 | ||
n =800 | p = 200 | 7.1 | 0.1 | 10.6 | 0.1 | 7.4 | 0.1 | 10.6 | 0.1 | 7.9 | 0.2 | 14.4 | 0.1 | 9.8 | 0.1 | |
p = 500 | 7.1 | 0.1 | 10.6 | 0.1 | 7.3 | 0.1 | 10.9 | 0.2 | 9.3 | 0.4 | 15.0 | 0.1 | 10.2 | 0.1 | ||
p = 800 | 7.1 | 0.1 | 10.6 | 0.1 | 7.3 | 0.1 | 11.6 | 0.3 | 10.8 | 0.6 | 15.1 | 0.1 | 10.2 | 0.1 | ||
p = 1200 | 8.8 | 0.4 | 10.4 | 0.1 | 7.2 | 0.1 | 17.5 | 0.7 | 32.8 | 1.1 | 15.0 | 0.1 | 10.5 | 0.1 | ||
p = 2000 | 9.7 | 0.5 | 10.5 | 0.1 | 7.3 | 0.1 | 19.7 | 0.8 | 34.3 | 1.2 | 15.9 | 0.1 | 10.8 | 0.1 |
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | |||
n = 200 | p = 200 | 17.5 | 0.2 | 40.6 | 0.2 | 27.7 | 0.2 | 71.3 | 0.0 | 71.2 | 0.0 | 69.6 | 0.2 | 67.0 | 0.3 | |
p = 500 | 18.1 | 0.2 | 40.7 | 0.2 | 28.2 | 0.2 | 71.3 | 0.0 | 71.2 | 0.0 | 74.8 | 0.2 | 80.1 | 0.3 | ||
p = 800 | 17.7 | 0.2 | 40.8 | 0.2 | 27.7 | 0.2 | 71.3 | 0.0 | 71.2 | 0.0 | 76.5 | 0.2 | 85.0 | 0.2 | ||
p = 1200 | 17.9 | 0.2 | 40.7 | 0.2 | 28.0 | 0.2 | 31.7 | 0.4 | 18.6 | 0.2 | 78.4 | 0.2 | 88.8 | 0.2 | ||
p = 2000 | 18.1 | 0.2 | 40.8 | 0.2 | 28.0 | 0.2 | 32.1 | 0.4 | 18.9 | 0.2 | 80.6 | 0.2 | 92.5 | 0.2 | ||
n = 500 | p = 200 | 10.6 | 0.1 | 27.8 | 0.2 | 17.0 | 0.1 | 70.9 | 0.0 | 70.9 | 0.0 | 48.6 | 0.3 | 30.1 | 0.2 | |
p = 500 | 10.8 | 0.1 | 27.5 | 0.2 | 17.2 | 0.1 | 70.9 | 0.0 | 70.9 | 0.0 | 53.7 | 0.3 | 39.1 | 0.3 | ||
p = 800 | 10.8 | 0.1 | 27.4 | 0.2 | 17.3 | 0.1 | 70.9 | 0.0 | 70.9 | 0.0 | 57.1 | 0.2 | 46.0 | 0.4 | ||
p = 1200 | 10.7 | 0.1 | 27.4 | 0.2 | 17.1 | 0.1 | 18.9 | 0.2 | 11.0 | 0.1 | 59.1 | 0.2 | 53.0 | 0.4 | ||
p = 2000 | 10.7 | 0.1 | 27.6 | 0.2 | 17.1 | 0.1 | 19.0 | 0.2 | 11.2 | 0.1 | 62.1 | 0.2 | 60.7 | 0.4 | ||
n =800 | p = 200 | 8.2 | 0.1 | 22.0 | 0.1 | 13.5 | 0.1 | 70.8 | 0.0 | 70.8 | 0.0 | 36.0 | 0.2 | 20.8 | 0.1 | |
p = 500 | 8.3 | 0.1 | 21.9 | 0.1 | 13.5 | 0.1 | 70.8 | 0.0 | 70.8 | 0.0 | 40.4 | 0.2 | 24.0 | 0.2 | ||
p = 800 | 8.2 | 0.1 | 21.9 | 0.1 | 13.3 | 0.1 | 70.8 | 0.0 | 70.8 | 0.0 | 42.2 | 0.3 | 26.2 | 0.2 | ||
p = 1200 | 8.3 | 0.1 | 22.0 | 0.1 | 13.4 | 0.1 | 14.9 | 0.1 | 8.7 | 0.1 | 44.7 | 0.2 | 29.8 | 0.3 | ||
p = 2000 | 8.3 | 0.1 | 22.2 | 0.1 | 13.5 | 0.1 | 14.8 | 0.1 | 8.6 | 0.1 | 47.9 | 0.3 | 36.4 | 0.4 | ||
n = 200 | p = 200 | 23.1 | 0.2 | 46.2 | 0.3 | 36.3 | 0.3 | 72.0 | 0.0 | 71.6 | 0.0 | 78.1 | 0.2 | 78.2 | 0.3 | |
p = 500 | 22.8 | 0.2 | 45.8 | 0.3 | 35.9 | 0.3 | 72.1 | 0.0 | 71.6 | 0.0 | 83.0 | 0.2 | 87.8 | 0.2 | ||
p = 800 | 23.0 | 0.2 | 45.8 | 0.3 | 36.4 | 0.3 | 71.9 | 0.0 | 71.6 | 0.0 | 85.2 | 0.2 | 91.5 | 0.2 | ||
p = 1200 | 23.2 | 0.3 | 45.8 | 0.3 | 36.3 | 0.3 | 38.1 | 0.4 | 25.2 | 0.3 | 87.3 | 0.2 | 93.8 | 0.2 | ||
p = 2000 | 23.1 | 0.3 | 45.7 | 0.3 | 36.2 | 0.3 | 54.0 | 0.5 | 34.1 | 0.5 | 89.2 | 0.2 | 95.9 | 0.1 | ||
n = 500 | p = 200 | 13.4 | 0.1 | 31.0 | 0.2 | 21.7 | 0.1 | 71.2 | 0.0 | 71.0 | 0.0 | 55.3 | 0.3 | 43.2 | 0.3 | |
p = 500 | 13.5 | 0.1 | 31.0 | 0.2 | 21.6 | 0.1 | 71.2 | 0.0 | 71.0 | 0.0 | 61.3 | 0.3 | 55.3 | 0.4 | ||
p = 800 | 13.4 | 0.1 | 31.0 | 0.2 | 21.8 | 0.1 | 71.2 | 0.0 | 71.0 | 0.0 | 64.7 | 0.2 | 62.8 | 0.4 | ||
p = 1200 | 13.4 | 0.1 | 30.8 | 0.2 | 21.6 | 0.1 | 21.5 | 0.2 | 14.3 | 0.1 | 67.2 | 0.2 | 67.8 | 0.3 | ||
p = 2000 | 13.5 | 0.1 | 31.0 | 0.2 | 21.8 | 0.1 | 22.0 | 0.2 | 14.4 | 0.1 | 69.7 | 0.2 | 73.3 | 0.3 | ||
n =800 | p = 200 | 10.5 | 0.1 | 25.2 | 0.2 | 17.2 | 0.1 | 71.0 | 0.0 | 70.9 | 0.0 | 42.7 | 0.2 | 28.7 | 0.2 | |
p = 500 | 10.4 | 0.1 | 24.9 | 0.2 | 17.1 | 0.1 | 71.0 | 0.0 | 70.9 | 0.0 | 47.3 | 0.3 | 34.2 | 0.3 | ||
p = 800 | 10.5 | 0.1 | 25.1 | 0.2 | 17.1 | 0.1 | 71.0 | 0.0 | 70.9 | 0.0 | 50.2 | 0.3 | 39.9 | 0.4 | ||
p = 1200 | 10.3 | 0.1 | 24.9 | 0.2 | 17.0 | 0.1 | 16.9 | 0.2 | 11.0 | 0.1 | 52.4 | 0.3 | 45.8 | 0.4 | ||
p = 2000 | 10.6 | 0.1 | 25.2 | 0.2 | 17.2 | 0.1 | 17.3 | 0.2 | 11.3 | 0.1 | 56.5 | 0.3 | 53.8 | 0.4 | ||
n = 200 | p = 200 | 30.6 | 0.6 | 29.1 | 0.1 | 22.0 | 0.1 | 71.6 | 0.0 | 71.2 | 0.0 | 58.8 | 0.3 | 56.6 | 0.3 | |
p = 500 | 30.4 | 0.6 | 28.9 | 0.2 | 22.2 | 0.1 | 71.5 | 0.0 | 71.2 | 0.0 | 67.4 | 0.3 | 73.5 | 0.4 | ||
p = 800 | 30.8 | 0.6 | 28.8 | 0.2 | 22.1 | 0.1 | 71.6 | 0.0 | 71.2 | 0.0 | 71.2 | 0.3 | 81.3 | 0.3 | ||
p= 1200 | 31.0 | 0.6 | 29.1 | 0.1 | 22.3 | 0.1 | 20.0 | 0.2 | 14.6 | 0.1 | 74.1 | 0.3 | 86.4 | 0.3 | ||
p = 2000 | 31.3 | 0.6 | 28.7 | 0.2 | 22.1 | 0.1 | 19.3 | 0.2 | 14.4 | 0.1 | 77.8 | 0.3 | 90.3 | 0.2 | ||
n = 500 | p = 200 | 12.4 | 0.2 | 18.4 | 0.1 | 13.6 | 0.1 | 71.1 | 0.0 | 70.9 | 0.0 | 31.7 | 0.2 | 24.5 | 0.2 | |
p = 500 | 11.8 | 0.2 | 18.4 | 0.1 | 13.6 | 0.1 | 71.0 | 0.0 | 70.9 | 0.0 | 35.9 | 0.2 | 29.2 | 0.2 | ||
p = 800 | 11.9 | 0.2 | 18.4 | 0.1 | 13.6 | 0.1 | 71.0 | 0.0 | 70.9 | 0.0 | 37.9 | 0.3 | 32.9 | 0.3 | ||
p = 1200 | 11.6 | 0.2 | 18.4 | 0.1 | 13.7 | 0.1 | 12.1 | 0.1 | 8.6 | 0.1 | 39.7 | 0.3 | 37.3 | 0.3 | ||
p = 2000 | 12.0 | 0.2 | 18.5 | 0.1 | 13.6 | 0.1 | 12.1 | 0.1 | 8.6 | 0.1 | 42.1 | 0.3 | 46.3 | 0.4 | ||
n = 800 | p = 200 | 7.9 | 0.1 | 14.7 | 0.1 | 10.7 | 0.1 | 70.9 | 0.0 | 70.8 | 0.0 | 23.8 | 0.1 | 17.4 | 0.1 | |
p = 500 | 7.8 | 0.1 | 14.5 | 0.1 | 10.6 | 0.1 | 70.9 | 0.0 | 70.8 | 0.0 | 25.7 | 0.2 | 19.2 | 0.1 | ||
p = 800 | 7.8 | 0.1 | 14.6 | 0.1 | 10.7 | 0.1 | 70.9 | 0.0 | 70.8 | 0.0 | 27.1 | 0.2 | 20.7 | 0.2 | ||
p = 1200 | 7.7 | 0.1 | 14.6 | 0.1 | 10.6 | 0.1 | 9.3 | 0.1 | 6.5 | 0.1 | 28.3 | 0.2 | 21.6 | 0.2 | ||
p = 2000 | 7.9 | 0.1 | 14.4 | 0.1 | 10.7 | 0.1 | 9.4 | 0.1 | 6.6 | 0.1 | 29.9 | 0.2 | 25.0 | 0.2 |
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | |||
n = 200 | p = 200 | 41.4 | 0.4 | 58.8 | 0.2 | 50.2 | 0.2 | 72.6 | 0.0 | 72.2 | 0.0 | 67.0 | 0.2 | 62.8 | 0.2 | |
p = 500 | 42.1 | 0.4 | 59.0 | 0.2 | 50.4 | 0.2 | 72.9 | 0.1 | 72.4 | 0.0 | 70.3 | 0.2 | 71.1 | 0.2 | ||
p = 800 | 43.0 | 0.4 | 59.1 | 0.2 | 50.4 | 0.2 | 72.7 | 0.0 | 72.3 | 0.0 | 72.1 | 0.2 | 75.1 | 0.2 | ||
p = 1200 | 42.7 | 0.4 | 59.4 | 0.2 | 50.8 | 0.2 | 53.0 | 0.4 | 39.9 | 0.4 | 73.0 | 0.2 | 78.6 | 0.2 | ||
p = 2000 | 43.5 | 0.4 | 59.2 | 0.2 | 50.8 | 0.2 | 53.5 | 0.4 | 40.2 | 0.4 | 74.8 | 0.2 | 82.1 | 0.2 | ||
n = 500 | p = 200 | 25.6 | 0.3 | 44.6 | 0.2 | 34.3 | 0.2 | 71.4 | 0.0 | 71.3 | 0.0 | 51.5 | 0.2 | 42.4 | 0.2 | |
p = 500 | 25.8 | 0.3 | 44.4 | 0.2 | 34.5 | 0.2 | 71.5 | 0.0 | 71.3 | 0.0 | 53.8 | 0.2 | 45.7 | 0.2 | ||
p = 800 | 25.2 | 0.3 | 44.6 | 0.2 | 34.1 | 0.2 | 71.5 | 0.0 | 71.3 | 0.0 | 54.8 | 0.2 | 47.1 | 0.3 | ||
p = 1200 | 25.9 | 0.3 | 44.4 | 0.2 | 34.4 | 0.2 | 33.8 | 0.4 | 23.8 | 0.2 | 55.8 | 0.2 | 50.1 | 0.3 | ||
p = 2000 | 25.7 | 0.3 | 44.3 | 0.2 | 34.4 | 0.2 | 35.0 | 0.4 | 23.8 | 0.2 | 56.9 | 0.2 | 54.0 | 0.3 | ||
n =800 | p = 200 | 20.1 | 0.2 | 37.0 | 0.2 | 27.8 | 0.2 | 71.2 | 0.0 | 71.0 | 0.0 | 43.8 | 0.2 | 34.9 | 0.2 | |
p = 500 | 19.8 | 0.2 | 37.8 | 0.2 | 27.6 | 0.2 | 71.2 | 0.0 | 71.1 | 0.0 | 46.1 | 0.2 | 36.6 | 0.2 | ||
p = 800 | 20.0 | 0.2 | 37.3 | 0.2 | 27.9 | 0.2 | 71.2 | 0.0 | 71.0 | 0.0 | 47.0 | 0.2 | 37.9 | 0.2 | ||
p = 1200 | 19.8 | 0.2 | 37.1 | 0.2 | 27.8 | 0.2 | 26.7 | 0.3 | 18.5 | 0.2 | 47.6 | 0.2 | 38.6 | 0.2 | ||
p = 2000 | 19.9 | 0.2 | 37.4 | 0.2 | 27.7 | 0.2 | 27.5 | 0.3 | 18.7 | 0.2 | 48.8 | 0.2 | 40.7 | 0.2 | ||
n = 200 | p = 200 | 58.1 | 0.4 | 75.9 | 0.2 | 70.1 | 0.3 | 80.4 | 0.2 | 78.2 | 0.2 | 85.7 | 0.2 | 84.6 | 0.2 | |
p = 500 | 59.3 | 0.5 | 75.8 | 0.2 | 70.2 | 0.3 | 95.2 | 0.1 | 94.2 | 0.1 | 88.8 | 0.2 | 90.2 | 0.2 | ||
p = 800 | 59.1 | 0.5 | 75.1 | 0.2 | 69.9 | 0.3 | 81.0 | 0.2 | 78.7 | 0.2 | 89.7 | 0.2 | 92.1 | 0.2 | ||
p = 1200 | 59.5 | 0.5 | 75.3 | 0.2 | 69.9 | 0.3 | 73.9 | 0.3 | 62.3 | 0.5 | 90.5 | 0.2 | 93.9 | 0.2 | ||
p = 2000 | 60.4 | 0.5 | 75.4 | 0.2 | 69.9 | 0.3 | 74.1 | 0.4 | 63.0 | 0.5 | 91.9 | 0.2 | 95.2 | 0.1 | ||
n = 500 | p = 200 | 38.6 | 0.4 | 61.4 | 0.2 | 50.9 | 0.3 | 74.5 | 0.1 | 73.5 | 0.1 | 70.5 | 0.2 | 63.0 | 0.3 | |
p = 500 | 38.7 | 0.4 | 61.6 | 0.2 | 51.1 | 0.3 | 74.7 | 0.1 | 73.4 | 0.1 | 74.0 | 0.2 | 69.0 | 0.3 | ||
p = 800 | 39.3 | 0.4 | 61.3 | 0.2 | 50.8 | 0.2 | 77.5 | 0.2 | 74.9 | 0.1 | 75.6 | 0.2 | 72.5 | 0.3 | ||
p = 1200 | 39.5 | 0.4 | 61.3 | 0.2 | 51.0 | 0.2 | 54.9 | 0.4 | 40.2 | 0.4 | 77.1 | 0.2 | 76.0 | 0.3 | ||
p = 2000 | 39.7 | 0.4 | 61.5 | 0.2 | 51.3 | 0.3 | 56.1 | 0.4 | 40.6 | 0.4 | 78.7 | 0.2 | 79.7 | 0.3 | ||
n =800 | p = 200 | 30.7 | 0.3 | 53.1 | 0.2 | 42.2 | 0.2 | 73.1 | 0.1 | 72.4 | 0.0 | 61.3 | 0.2 | 51.4 | 0.3 | |
p = 500 | 30.5 | 0.3 | 52.9 | 0.2 | 42.0 | 0.2 | 73.1 | 0.1 | 72.4 | 0.0 | 64.1 | 0.2 | 54.4 | 0.3 | ||
p = 800 | 30.6 | 0.3 | 53.7 | 0.2 | 42.3 | 0.2 | 73.4 | 0.1 | 72.5 | 0.0 | 65.9 | 0.2 | 58.2 | 0.3 | ||
p = 1200 | 30.8 | 0.3 | 53.5 | 0.2 | 42.2 | 0.2 | 45.3 | 0.4 | 31.5 | 0.3 | 67.2 | 0.2 | 60.8 | 0.3 | ||
p = 2000 | 30.7 | 0.3 | 53.6 | 0.2 | 42.2 | 0.2 | 45.2 | 0.4 | 31.1 | 0.3 | 68.6 | 0.2 | 64.6 | 0.3 | ||
n = 200 | p = 200 | 45.5 | 0.6 | 46.5 | 0.2 | 35.7 | 0.2 | 73.9 | 0.1 | 72.4 | 0.0 | 62.6 | 0.2 | 51.8 | 0.2 | |
p = 500 | 46.9 | 0.6 | 46.9 | 0.2 | 35.8 | 0.2 | 73.9 | 0.1 | 72.4 | 0.0 | 65.6 | 0.2 | 57.4 | 0.3 | ||
p = 800 | 46.2 | 0.6 | 46.4 | 0.2 | 35.5 | 0.2 | 73.8 | 0.1 | 72.4 | 0.0 | 66.5 | 0.2 | 61.4 | 0.3 | ||
p = 1200 | 46.9 | 0.6 | 46.3 | 0.2 | 35.8 | 0.2 | 33.0 | 0.3 | 23.9 | 0.2 | 67.2 | 0.3 | 65.7 | 0.3 | ||
p = 2000 | 46.1 | 0.6 | 46.5 | 0.2 | 35.7 | 0.2 | 33.6 | 0.3 | 23.8 | 0.2 | 68.8 | 0.3 | 72.0 | 0.3 | ||
n = 500 | p = 200 | 24.7 | 0.4 | 31.4 | 0.2 | 22.6 | 0.1 | 72.0 | 0.0 | 71.3 | 0.0 | 44.8 | 0.2 | 32.6 | 0.1 | |
p = 500 | 24.7 | 0.4 | 31.1 | 0.2 | 22.4 | 0.1 | 71.9 | 0.0 | 71.3 | 0.0 | 46.8 | 0.2 | 35.4 | 0.2 | ||
p = 800 | 25.4 | 0.4 | 31.3 | 0.2 | 22.9 | 0.1 | 72.0 | 0.0 | 71.3 | 0.0 | 48.0 | 0.2 | 36.7 | 0.2 | ||
p = 1200 | 25.2 | 0.4 | 31.5 | 0.2 | 22.7 | 0.1 | 20.8 | 0.2 | 14.3 | 0.1 | 48.2 | 0.2 | 37.6 | 0.2 | ||
p = 2000 | 25.5 | 0.4 | 31.7 | 0.2 | 22.8 | 0.1 | 20.9 | 0.2 | 14.4 | 0.1 | 49.1 | 0.2 | 39.5 | 0.2 | ||
n =800 | p = 200 | 18.7 | 0.3 | 25.6 | 0.1 | 18.0 | 0.1 | 71.5 | 0.0 | 71.1 | 0.0 | 36.7 | 0.1 | 25.6 | 0.1 | |
p = 500 | 18.1 | 0.3 | 25.3 | 0.1 | 17.7 | 0.1 | 71.5 | 0.0 | 71.1 | 0.0 | 38.4 | 0.2 | 27.1 | 0.1 | ||
p = 800 | 18.6 | 0.3 | 25.4 | 0.1 | 18.0 | 0.1 | 71.5 | 0.0 | 71.1 | 0.0 | 39.3 | 0.2 | 28.4 | 0.1 | ||
p = 1200 | 18.7 | 0.3 | 25.3 | 0.1 | 17.9 | 0.1 | 16.3 | 0.1 | 11.0 | 0.1 | 40.2 | 0.2 | 29.2 | 0.1 | ||
p = 2000 | 19.1 | 0.3 | 25.3 | 0.1 | 18.0 | 0.1 | 16.5 | 0.1 | 11.2 | 0.1 | 41.0 | 0.2 | 30.3 | 0.1 |
MDDM | Oracle-SIR(3) | Oracle-SIR(10) | Rifle-SIR(3) | Rifle-SIR(10) | LassoSIR(3) | LassoSIR(10) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | |||
n = 200 | p = 200 | 34.3 | 0.5 | 49.1 | 0.5 | 33.5 | 0.4 | 50.0 | 0.7 | 30.6 | 0.5 | 70.7 | 0.0 | 70.7 | 0.0 | |
p = 500 | 34.2 | 0.5 | 48.7 | 0.5 | 32.8 | 0.4 | 49.5 | 0.7 | 30.1 | 0.5 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 800 | 34.6 | 0.6 | 48.9 | 0.5 | 33.4 | 0.5 | 50.1 | 0.7 | 30.8 | 0.6 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 1200 | 44.5 | 0.7 | 49.3 | 0.5 | 33.6 | 0.5 | 67.0 | 0.7 | 38.9 | 0.7 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 2000 | 44.8 | 0.8 | 48.1 | 0.5 | 33.2 | 0.5 | 66.7 | 0.8 | 39.4 | 0.7 | 70.7 | 0.0 | 70.7 | 0.0 | ||
n = 500 | p = 200 | 22.0 | 0.4 | 35.5 | 0.4 | 22.6 | 0.3 | 33.0 | 0.5 | 19.0 | 0.3 | 70.7 | 0.0 | 70.7 | 0.0 | |
p = 500 | 21.8 | 0.4 | 34.7 | 0.4 | 22.3 | 0.3 | 32.4 | 0.5 | 18.8 | 0.4 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 800 | 21.7 | 0.3 | 34.6 | 0.4 | 22.5 | 0.3 | 32.3 | 0.5 | 18.9 | 0.3 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 1200 | 27.1 | 0.5 | 35.5 | 0.4 | 22.5 | 0.3 | 43.4 | 0.7 | 23.4 | 0.4 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 2000 | 26.9 | 0.5 | 34.8 | 0.4 | 22.4 | 0.3 | 42.3 | 0.7 | 23.6 | 0.5 | 70.7 | 0.0 | 70.7 | 0.0 | ||
n =800 | p = 200 | 17.2 | 0.3 | 29.2 | 0.3 | 18.6 | 0.3 | 25.8 | 0.4 | 14.9 | 0.2 | 70.7 | 0.0 | 70.7 | 0.0 | |
p = 500 | 16.7 | 0.3 | 28.5 | 0.3 | 18.1 | 0.2 | 25.3 | 0.4 | 14.3 | 0.3 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 800 | 16.5 | 0.2 | 28.4 | 0.3 | 17.9 | 0.2 | 25.0 | 0.4 | 14.1 | 0.2 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p =1200 | 20.8 | 0.4 | 29.0 | 0.4 | 18.2 | 0.3 | 31.7 | 0.5 | 18.5 | 0.4 | 70.7 | 0.0 | 70.7 | 0.0 | ||
p = 2000 | 21.2 | 0.4 | 28.7 | 0.4 | 18.3 | 0.3 | 32.4 | 0.6 | 18.8 | 0.3 | 70.7 | 0.0 | 70.7 | 0.0 |
Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | Error | SE | ||
MDDM | 45.0 | 0.5 | 45.9 | 0.5 | 27.5 | 0.3 | 28.7 | 0.4 | 18.8 | 0.3 | 19.6 | 0.3 | |
PR-Oracle-SIR(3) | 26.3 | 0.2 | 26.5 | 0.2 | 18.3 | 0.1 | 18.4 | 0.2 | 12.6 | 0.1 | 12.6 | 0.1 | |
PR-Oracle-SIR(10) | 33.0 | 0.3 | 33.0 | 0.3 | 20.1 | 0.2 | 20.2 | 0.2 | 13.0 | 0.1 | 13.1 | 0.1 | |
PR-SIR(3) | 96.6 | 0.0 | 97.7 | 0.0 | 93.2 | 0.0 | 95.3 | 0.0 | 87.6 | 0.0 | 91.1 | 0.0 | |
PR-SIR(10) | 97.4 | 0.0 | 98.2 | 0.0 | 95.0 | 0.0 | 96.6 | 0.0 | 90.0 | 0.1 | 93.7 | 0.0 | |
MDDM | 93.5 | 0.6 | 94.9 | 0.5 | 73.6 | 1.1 | 77.1 | 1.1 | 36.7 | 1.1 | 37.8 | 1.2 | |
PR-Oracle-SIR(3) | 80.1 | 0.6 | 80.1 | 0.6 | 64.0 | 0.7 | 64.7 | 0.7 | 42.5 | 0.5 | 41.1 | 0.5 | |
PR-Oracle-SIR(10) | 79.1 | 0.6 | 78.9 | 0.6 | 58.4 | 0.6 | 58.7 | 0.6 | 34.5 | 0.4 | 34.2 | 0.4 | |
PR-SIR(3) | 99.9 | 0.0 | 100.0 | 0.0 | 99.9 | 0.0 | 100.0 | 0.0 | 99.9 | 0.0 | 99.9 | 0.0 | |
PR-SIR(10) | 99.9 | 0.0 | 100.0 | 0.0 | 99.9 | 0.0 | 100.0 | 0.0 | 99.9 | 0.0 | 100.0 | 0.0 | |
MDDM | 17.3 | 0.4 | 17.5 | 0.4 | 10.0 | 0.1 | 10.1 | 0.1 | 7.1 | 0.1 | 7.1 | 0.1 | |
PR-Oracle-SIR(3) | 15.5 | 0.2 | 15.2 | 0.2 | 10.6 | 0.1 | 10.6 | 0.1 | 7.5 | 0.1 | 7.5 | 0.1 | |
PR-Oracle-SIR(10) | 14.2 | 0.2 | 13.9 | 0.2 | 9.6 | 0.1 | 9.6 | 0.1 | 6.8 | 0.1 | 6.8 | 0.1 | |
PR-SIR(3) | 92.2 | 0.1 | 95.0 | 0.1 | 83.0 | 0.1 | 88.3 | 0.1 | 71.0 | 0.1 | 77.9 | 0.1 | |
PR-SIR(10) | 90.0 | 0.1 | 93.4 | 0.1 | 80.1 | 0.1 | 85.8 | 0.1 | 67.4 | 0.1 | 74.7 | 0.1 |
S2 More on Real Data Analysis
S2.1 Choice of Tuning Parameter on Real Data
To apply our proposal on real data, we need to determine the tuning parameter, , i.e, the desired level of sparsity. In penalized problems such as sparse PCA and sparse SIR, tuning parameters are often chosen with cross-validation. We could also employ cross-validation to choose s. However, as with almost any procedure, cross-validation would considerably slow down the computation. Moreover, as we observe in our theoretical and simulation studies, our method is not very sensitive to ; the result is reasonably stable as long as is larger than d. Hence, we resort to a faster tuning method on our real data as follows. We start with a sequence of reasonable sparsity levels , which is set to be . Then for each element in , we calculate and the sample distance covariance (Székely et al. 2007) between and for all . Here distance covariance is used as a model-free measure of the dependence between and . Intuitively, the distance covariance increases as the pre-specified sparsity increases. Therefore, we plot the sample distance covariance against the sparsity levels in Figure S1, and pick the sparsity corresponding to a large enough distance covariance, while any larger sparsity levels will not lead to a significantly larger distance covariance (i.e. the “elbow method”). Based on Figure S1, a sparsity level between 20 and 25 seems to be reasonable, and we pick as the pre-specified sparsity.

S2.2 Additional Real Data Analysis Results
To further demonstrate our methods on the real data, we also construct the scatterplot from the leading two directions and in Figure S2. For the first direction, the experimental units from leukemia cancer (LE) and colorectal cancer (CO) are on the different side of the plot. For the second direction, CO and LE have similar values whereas the units obtained from central nervous system cancer (CN), breast cancer (BR), lung cancer (LC) and ovarian cancer (OV) are on the opposite side of the figure. This pattern coincides with the first two canonical correlation analysis directions in a previous study of the same data set (Figure 8, Cruz-Cano and Lee 2014). Such a finding is very encouraging as our slicing-free approach automatically detect the most significant associations between the response and the predictor, while directly applied to the multivariate response.

S3 Computation complexity
We briefly analyze the computational complexity of our proposed methods. For both algorithms, we need to compute the sample MDDM. The current computational complexity of sample MDDM is . If we adapt the fast computing algorithm of Huo and Székely (2016) developed for distance correlation to MDDM, we might be able to reduce the complexity to . In Algorithm 2, we further need to compute the sample covariance at the complexity level of .
To apply the two algorithms, we assume that the maximum number of iterations is in finding the directions (Step 3(a) in both algorithms). After obtaining MDDM, Algorithm 1 has a computational complexity of , where is the computation complexity of each iteration (Yuan and Zhang 2013) and is the computation complexity of deflating in Step 3(b). Note that, in Step 3(b), we repeatedly exploit the sparsity of to reduce the computation complexity. For example, to compute , we only need to compute the nonzero elements, and the same applies to , which has a computational complexity of . For Algorithm 2, the computational complexity is . Note that this computational complexity only differs from that of Algorithm 1 in the term . This term is the cost to deflate in Step 3(b). Since is not guaranteed to be sparse as , the deflation is more computationally expensive. Otherwise, each iteration in Step 3(a) of Algorithm 2 has the same complexity as Step 3(a) of Algorithm 1 (Tan, Wang, Liu and Zhang 2018).
S4 Proofs for Proposition 2 & Lemma 1
Proof of Proposition 2.
From the basic properties of MDDM (c.f. beginning of Section 1), is equivalent to .
Suppose the rank of is , then there exists an orthogonal basis matrix for , with and , such that . This implies and equivalently . Therefore, , which leads to .
Similarly, for any vector we have and hence . This implies that and hence, and .
∎
For the proof of Lemma 1, we need the following elementary lemma. We include its proof for completeness.
Lemma S2.
Let be the normalized th eigenvector of . Then we must have .
S5 Proof for Theorem 1
Let be iid sample from the joint distribution of , where is the th sample. Write where
with and being iid copies of . Note that . At the sample level, we have
where , and
Proposition S3.
Suppose Condition (C1) holds. There exists a positive integer , and a finite positive constant such that when and , we have
Proof of Proposition S3: Throughout the proof, is a generic positive constant that vary from line to line. We shall find a bound for first. For , let and . Note that
We shall focus on the case I, , since other cases can be treated in the same fashion and the bound is uniformly over all pair of s.
Case I, , write . Let and and . Define the kernel as
Then is symmetric, is a U-statistic of order two and .
Under Condition (C1), there exists a positive constant such that . When satisfies , then and
Next we decompose
where the choice of will be addressed at the end of proof. We also decompose its population counterpart .
By Lemma C on page 200 of Serfling (1980), we derive that for , and ,
which entails that
where we have applied Markov’s inequality and Lemma A(ii) [cf. Page 200 of Serfling (1980)] in the first and third inequality above, respectively. Applying the same argument with replaced by , we can obtain
Choosing , we obtain that
(S2) |
Next we turn to . First of all, by Cauchy-Schwartz inequality, . Applying the inequality and for any , we derive
Then it is easy to show that under the uniform sub-Gaussian moment assumption in Condition (C1) and the upper bound on above, we have that for some . Moreover, since , we can derive that
(S3) | |||||
Because is sub-Gaussian as assumed in Condition (C1), by Proposition 2.5.2 in Vershynin (2018), we have for some positive constant . We apply similar arguments to all the remaining terms in (S3) and derive that
Thus . If we choose such that , then , which leads to . To bound , we write
Without loss of generality, we only consider . Define . Note that
For , note that, for any , . Since is sub-Gaussian by Condition (C1), we have that is also sub-Gaussian by Proposition 2.5.2 in Vershynin (2018). Hence, it follows from Bernstein’s inequality [Theorem 2.8.1 in Vershynin (2018)] that for , ,
(S4) |
Regarding , we note that . Since and are sub-Gaussian, we have that is sub-exponential (Lemma 2.7.7 in Vershynin (2018)). Again by Bernstein’s inequality, we have that for ,
(S5) |
Moreover, it is easy to see that there exists a , such that , so where we have used the fact that by a union bound argument and uniform Sub-Gaussinity assumption in Condition (C1). Hence, , where and . Choosing such that and , then it follows from (S5) that
In addition, we can use (S4) to derive that . Combining these results, we have , when . Thus if we choose , we can find a , , and such that when and , we have . Similar arguments lead to . This implies that and similarly . Therefore . In view of (S2), the desired statement follows by choosing large enough such that .
Proof of Theorem 1: Notice that for any ,
The concentration bound for has been obtained in Proposition S3, and we shall address the concentration of in the proof below. The proof for the concentration of is similar and simpler so is omitted. Note that
Following the same argument as used in the beginning of proof of Proposition S3, we shall only focus on the case as other cases can be treated in exactly the same manner.
Let and . Let
where is a kernel function for U-statistic of order three. Following the same argument to deal with in the proof of Proposition S3, we write , where
Correspondingly, we define , where and . By using the same argument for , we can show that
since is a third order -statistic. Also we note that by the same argument used in bounding , we can get
It follows from Cauchy-Schwartz inequality that . By using exactly the same argument used for (S3), we can show that for some . Hence . We can choose such that . Therefore, for , we have . By setting and adopting the same argument as used in bounding , we can derive that when and for some , , and .
Combining the above results, we obtain that for and , we have
where we choose such that .
Further we note that
There exists a finite positive constant such that and so if we choose , then and . Then for and ,
Thus the conclusion follows from the above inequality and Proposition S3.
∎
S6 Proofs for Theorems 2 & 3
S6.1 Two Generic Algorithms and Their Properties
We first describe two generic algorithms and their properties that will help our proof for Theorems 2 & 3. Consider two matrices , their estimates and vectors . We have Algorithm 3 for the penalized eigen-decomposition for , and Algorithm 4 for the penalized generalized eigen-decomposition for . Algorithm 3 is originally proposed in Yuan and Zhang (2013), and in our Algorithm 1 we use it repeatedly for times to perform penalized eigen-decomposition for MDDM. Algorithm 4 is originally proposed as the RIFLE Algorithm in Tan, Wang, Liu and Zhang (2018), and we use it for times to perform penalized generalized eigen-decomposition for MDDM.
-
1.
Input: .
-
2.
Initialize .
-
(a)
Iterate over until convergence:
-
(b)
Set .
-
(c)
If , set
else
-
(a)
-
3.
Output at convergence.
Yuan and Zhang (2013) proved a property for Algorithm 3 that is important for our proof. Assume that has a unique leading eigenvector with . Denote as the eigenvalues. We assume that there exists a constant . Also for any positive integer , define
where , with being an estimate of . We have the following proposition.
Proposition S4 ((Yuan and Zhang 2013) c.f Theorem 4).
In Algorithm 3, let with . Assume that . Define
If for some and such that
then we either have
or for all ,
-
1.
Input: and step size .
-
2.
Initialize .
-
3.
Iterate over until convergence:
-
(a)
Set .
-
(b)
.
-
(c)
.
-
(d)
.
-
(a)
-
4.
Output at convergence.
Based on the results in Tan, Wang, Liu and Zhang (2018), we can also derive the following useful results for Algorithm 4. We assume that the matrix pair has the leading generalized eigenvector such that . The generalized eigenvalues of are referred to as and their estimates are . We introduce the following notation:
(S6) | |||||
(S7) | |||||
(S8) |
where , with being an estimate of . Also denote as the condition number of and for any index set . We consider the following assumption:
Assumption S1.
For sufficiently large , there are constants , such that and for any .
We also denote for defined in Assumption S1. We estimate with the RIFLE algorithm with the step size . We choose such that . Further, in the RIFLE algorithm, let and choose for sufficiently large . The initial value satisfies that .
Proposition S5 (Based on Theorem 1 and Corollary 1 in Tan, Wang, Liu and Zhang (2018)).
Under Assumption S1, we have the following conclusions:
-
1.
For any such that , there exists a constant such that
(S9) -
2.
Choose such that
(S10) where . Input an initial vector such that , where
(S11) Further denote
(S12) Assume that and we have
(S13)
We rewrite Proposition S5 in the following more user-friendly form. We denote
(S14) | |||||
(S15) |
We have the following lemma.
Lemma S3.
Assume that Assumption S1 holds. Choose such that
Also assume that , and . Input an initial vector such that , where ,
(S16) |
We have
(S17) |
Proof of Lemma S3.
Also, by our definition, . It follows that . Hence, , where is defined in (S10). In addition, because , we have , where is defined in (S12). Finally, because , we have . Because , we have and thus .
By Proposition S5, we have
(S21) | |||||
(S22) |
Let and we have that Finally, we note that . Since , we have the desired conclusion. ∎
Lemma S4.
Consider two symmetric matrices , where . For any , denote
(S23) |
For any , we have that
(S24) | |||||
(S25) |
where .
Proof of Lemma S4.
For (S24), note that, for any , we have So
(S26) | |||||
(S27) | |||||
(S28) |
where in the last inequality we use the fact that . Also note that, for any , we have . It follows that Hence,
(S29) |
Similarly, for (S25), we note that, for any , So
(S30) | |||||
(S31) |
Because for any , we have . Hence,
(S32) |
The conclusion follows. ∎
S6.2 Additional technical lemmas
We first derive several lemmas concerning a parameter (either in the penalized eigen-decomposition or the penalized generalized eigen-decomposition) and its estimate . We denote , and .
Lemma S5.
If , we have that
(S33) |
Proof of Lemma S5.
∎
Lemma S6.
If , we have that
(S37) |
Proof of Lemma S6.
By the Cauchy-Schwarz inequality, we have that
(S38) |
Note that and
(S39) |
where we use the fact that . And we have the desired conclusion. ∎
Throughout the rest of this section, we also repeatedly use the fact that, for a vector , if , , we must have that .
S6.3 Proof for Theorem 2
In this subsection, we assume that , are solutions produced by Algorithm 1 for the penalized eigen-decomposition problem, and . We assume all the conditions in Theorem 2. We have the following result.
Lemma S7.
If , we have that
(S40) |
Proof of Lemma S7.
Note that
(S41) | |||||
(S42) | |||||
(S43) |
By Lemma S5,
(S44) |
By Lemma S6,
(S45) |
For , note that , which implies that
Also note that . Hence,
(S46) | |||||
(S47) | |||||
(S48) | |||||
(S49) | |||||
(S50) |
∎
Lemma S8.
For the first direction , we have that and .
Proof of Lemma S8.
It is easy to see that
(S51) |
Lemma S9.
If for sufficiently small , all , then .
Proof of Lemma S9.
For any vector , we have
(S54) | |||||
(S55) |
By Lemma S7, we have that when . Also, . It follows that . For , we assume that without loss of generality, because otherwise we can consider the proof for . Note that
Hence, and the conclusion follows. ∎
Lemma S10.
Assume that for any , where for defined in Theorem 2. We have that .
S6.4 Proof for Theorem 3
In this subsection we prove Theorem 3, where could be different from the identity matrix. We first present a simple lemma, which is a modified version of Lemma 6 in Mai and Zhang (2019).
Lemma S11.
For two vectors and a positive definite matrix , define . We have that
Proof of Lemma S11.
We first show the latter half of the desired inequality. Without loss of generality, we assume that , because we can always normalize to satisfy these conditions.
Note that
Consequently,
Now
and we have the second half of the inequality. For the first half of the inequality, define . Apply the second half of the inequality to vectors and matrix , and we have the desired conclusion. ∎
We now show the following lemma parallel to Lemma S7. Recall that .
Lemma S12.
If , we have that
(S56) |
We consider the event for and and , where is defined as in Condition (C2) and . As a direct consequence, . Also note that . In the RIFLE algorithm, is chosen to be sufficiently close to , and the step size satisfies that and
(S62) |
Without loss of generality, in what follows we assume that , because otherwise we can always consider , which spans the same susbspace as .
Lemma S13.
For the first direction , we have that and .
Proof of Lemma S13.
It is easy to see that
(S63) |
Lemma S14.
If for sufficiently small , all , then .
Proof of Lemma S14.
It suffices to show that . Define . Then
(S64) |
It follows that
(S65) |
According to the proof in Lemma S13, .
For any vector , we have
(S66) | |||||
(S67) |
By Lemma S12, we have that when . Also,
(S68) | |||||
(S69) |
It follows that . For , note that . On one hand,
(S70) | |||||
(S71) | |||||
(S72) | |||||
(S73) |
On the other hand,
(S74) | |||||
(S75) | |||||
(S76) |
Further note that
(S77) | |||||
(S78) | |||||
(S79) |
where . By our assumption, . For the second term, since , we have . It follows that,
(S80) | |||||
(S81) | |||||
(S82) |
Because , it suffices to find a bound for .
(S83) | |||||
(S85) | |||||
(S86) | |||||
(S87) |
which also implies that if . Finally, by (S65) we have the desired conclusion. ∎
Now we define . We have the following lemma.
Lemma S15.
Assume that for any . If , we have that .