Slicing-free Inverse Regression in High-dimensional

Sufficient Dimension Reduction

Qing Mai¹, Xiaofeng Shao², Runmin Wang³ and Xin Zhang¹

Florida State University¹

University of Illinois at Urbana-Champaign²

Texas A&M University³

Abstract: Sliced inverse regression (SIR, Li, 1991) is a pioneering work and the most recognized method in sufficient dimension reduction. While promising progress has been made in theory and methods of high-dimensional SIR, two remaining challenges are still nagging high-dimensional multivariate applications. First, choosing the number of slices in SIR is a difficult problem, and it depends on the sample size, the distribution of variables, and other practical considerations. Second, the extension of SIR from univariate response to multivariate is not trivial. Targeting at the same dimension reduction subspace as SIR, we propose a new slicing-free method that provides a unified solution to sufficient dimension reduction with high-dimensional covariates and univariate or multivariate response. We achieve this by adopting the recently developed martingale difference divergence matrix (MDDM, Lee and Shao, 2018) and penalized eigen-decomposition algorithms. To establish the consistency of our method with a high-dimensional predictor and a multivariate response, we develop a new concentration inequality for sample MDDM around its population counterpart using theories for U-statistics, which may be of independent interest. Simulations and real data analysis demonstrate the favorable finite sample performance of the proposed method.

Key words and phrases: Multivariate response, Sliced inverse regression, Sufficient dimension reduction, U-statistic.

1 Introduction

Sufficient dimension reduction (SDR) is an important statistical analysis tool for data visualization, summary and inference. It extracts low-rank projections of the predictors $X$ that contain all the information about the response $Y$ , without specifying a parametric model beforehand. The semi-parametric nature of SDR leads to great flexibility and convenience in practice. After SDR, we can model the conditional distributions of the response given the lower dimensional projected covariate using existing parametric or non-parametric methods. A salient feature with SDR is that the low-rank projection space can be accurately estimated at a parametric rate with the nonparametric part treated as an infinitely dimensional nuisance parameter. For example, in multi-index models, SDR can estimate the multiple projection directions without estimating the unspecified link function.

A cornerstone of SDR is the SIR (sliced inverse regression), pioneered by Li (1991), who first discovered the inner connection between the low-rank projection space and the eigen-space of $\mathrm{cov}(\mathrm{E}(\mathbf{X}\mid Y))$ , under suitable assumptions. SIR is performed by slicing the response $Y$ and aggregating the conditional mean of the predictor $\mathbf{X}$ given the response $Y$ within each slice. To illustrate the idea, we consider a univariate response $Y$ . Slicing involves picking $K+1$ constants $-\infty=a_{0}<a_{1}<\ldots<a_{K}=\infty$ , and defining a new random variable $H$ , where $H=k$ if and only if $a_{k-1}<Y\leq a_{k}$ . Upon a centering and standardization of the covariate, i.e., $\mathbf{X}\rightarrow\widetilde{\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1/2}(\mathbf{X}-\mathrm{E}(\mathbf{X}))$ , a simple eigen-decomposition can be conducted to find linear projections that explain most of the variability in the conditional expectation of the transformed predictor given the response across slices, that is, $\mathrm{cov}(\mathrm{E}(\widetilde{\mathbf{X}}\mid H))$ . As an important variation of SIR, sliced average variance estimation (Cook and Weisberg, 1991) utilizes the conditional variance across slices. A key step in these inverse regression methods is apparently the choice of the slicing scheme. If $Y$ is sliced too coarsely, we may not be able to capture the full dependence of $Y$ on the predictors, which could lead to a large bias in the estimation of $\mathrm{cov}(\mathrm{E}(\widetilde{\mathbf{X}}\mid Y))$ . If $Y$ is sliced too finely, the with-in-slice sample size is small, leading to a large variability in estimation. Although Li (1991); Hsing and Carroll (1992) showed that SDR can still be consistent in large sample even when the slicing scheme is chosen poorly, Zhu and Ng (1995) argued that the choice of slicing scheme is critical to achieve high estimation efficiency. To the best of our knowledge, there seems no universal guidance on the choice of the slicing scheme provided in the literature.

Zhu et al. (2010) and Cook and Zhang (2014) showed that it is beneficial to aggregate multiple slicing schemes rather than relying on a single one. However, proposals in the above-mentioned papers have their own limitations as they exclusively focused on the univariate response. In many real life problems, it is common to encounter multi-response data. Component-wise analysis may not be sufficient for multi-response data because it does not fully make use of the componentwise dependence in the response. But slicing multivariate response is notoriously hard due to the curse of dimensionality, a common problem in multivariate nonparametric smoothing. As the dimension for the response becomes moderately large, it is increasingly difficult to make sure that each slice contains a decent number of samples, and the estimation can be unstable in practice. Hence, it is highly desirable to develop new SDR methods that do not involve slicing.

An important line of research in the recent SDR literature is to develop SDR methods for datasets with high-dimensional covariates, as motivated by many contemporary applications. The idea of SDR is naturally attractive for high-dimensional datasets, as an effective reduction of the dimension in $X$ facilitates the use of existing modeling and inference methods that are tailored for low-dimensional covariates. However, most classical SDR methods are not directly applicable to the large $p$ small $n$ setting, where $p$ is the dimension of $X$ and $n$ is the sample size. To overcome the challenges with high-dimensional covariates, several methods have been proposed recently. In Lin et al. (2018), they show that the SIR estimator is consistent if and only if $\lim p/n=0$ . When the dimension $p$ is larger than $n$ , they propose a diagonal thresholding screening SIR (DT-SIR) algorithm, and show that it is consistent to recover the dimension reduction space under certain sparsity assumptions on both the covariance matrix of predictors and the loadings of the directions. In Lin et al. (2019), they further introduce a simple Lasso regression method to obtain an estimate of the SDR space by constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix. In Tan, Wang, Liu and Zhang (2018), they propose a two-stage computational framework to solve the sparse generalized eigenvalue problem, which includes the high-dimensional SDR as a special case, and propose a truncated Rayleigh flow method (namely, RIFLE) to estimate the leading generalized eigenvector. Also see Lin et al. (2020) and Tan, Wang, Zhang, Liu and Cook (2018) for related recent work. These methods provide valuable tools to tackle high-dimensional SDR problem. However, all these methods still rely on the SIR as an important component in their methodology and involve choosing a single slicing scheme, but little guideline on the slicing scheme is provided. Consequently, these methods cannot be easily applied to data with multivariate response, and the impact from the choice of slicing scheme is unclear.

In this article, we propose a novel slicing-free SDR method in the high-dimensional setting. Our proposal is inspired by a recent nonlinear dependence metric: the so-called martingale difference divergence matrix (MDDM, Lee and Shao, 2018). The MDDM was developed by Lee and Shao (2018) as a matrix-valued extension of MDD (martingale difference divergence) in Shao and Zhang (2014), which measures the (conditional) mean dependence of a response variable given a covariate, and was used for dimension reduction of a multivariate time series. As recently revealed by Zhang et al. (2020), at the population level, the eigenvectors (or generalized eigenvectors) of the MDDM are always contained in the central subspace. Building on these prior works, we propose a penalized eigen-decomposition on MDDM to perform SDR in high dimensions. For the case when the covariance matrix of the predictor is identity matrix, we adopt the truncated power method with hard thresholding to estimate the top- $K$ eigenvectors of MDDM. In the case of more general covariance structure, we adopt the RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) and apply to the sample MDDM instead of sample SIR estimator of $\mathrm{cov}(\mathrm{E}(\mathbf{X}\mid Y))$ . With the use of sample MDDM, this approach is completely slicing-free and allows to treat the univariate response and multivariate response in a unified way, thus the practical difficulty of selecting the number of slices (especially for multivariate response) is circumvented. On the theory front, we derive a concentration inequality for sample MDDM around its population counterpart by using theories for U statistics, and obtain a rigorous non-asymptotic theoretical justification for the estimated central subspaces for both settings. Simulations and real data analysis confirm that PMDDM outperforms slicing-based methods in estimation accuracy.

The rest of this paper is organized as follows. In Section 2, we give a brief review of the martingale difference divergence matrix (MDDM) and then present a new concentration inequality for the sample MDDM around its population counterpart. In Section 3, we present our general methodology of adopting MDDM in both model-free and model-based SDR problems, where we establish population level connections between the central subspace and the eigen-decomposition and the generalized eigen-decomposition of MDDM. Algorithms for regularized eigen-decomposition and generalized eigen-decomposition problems are proposed in Sections 4.1 and 4.2, respectively. Theoretical properties are established in Section 5. Section 6 contains numerical studies. Finally, Section 7 concludes the paper with a short discussion. The Supplementary Materials collect all additional technical details and numerical results.

2 MDDM and its concentration inequality

Consider a pair of random vectors $\mathbf{V}\in\mathbb{R}^{p}$ , $\mathbf{U}\in\mathbb{R}^{q}$ such that $\mathrm{E}(\|\mathbf{U}\|^{2}+\|\mathbf{V}\|^{2})<\infty$ . We use $\|\mathbf{U}\|=|\mathbf{U}|_{q}$ to denote the Euclidean norm in $\mathbb{R}^{q}$ . Define

\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})=-\mathrm{E}\left[\{\mathbf{V}-\mathrm{E}(\mathbf{V})\}\{\mathbf{V}^{\prime}-\mathrm{E}(\mathbf{V}^{\prime})\}^{{\mathrm{\tiny T}}}\|\mathbf{U}-\mathbf{U}^{\prime}\|\right]\in\mathbb{R}^{p\times p},

where $(\mathbf{V}^{\prime},\mathbf{U}^{\prime})$ is an independent copy of $(\mathbf{V},\mathbf{U})$ . Lee and Shao (2018) established the following key properties of $\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})$ : (i) It is symmetric and positive semi-definite; (ii) $\mathrm{E}(\mathbf{V}\mid\mathbf{U})=\mathrm{E}(\mathbf{V})$ almost surely, is equivalent to $\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})=0$ ; (iii) For any $p\times d$ matrix $\mathbf{A}$ , $\mathrm{MDDM}(\mathbf{A}^{\mathrm{\tiny T}}\mathbf{V}\mid\mathbf{U})=\mathbf{A}^{\mathrm{\tiny T}}\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})\mathbf{A}$ ; (iv) There exist $p-d$ linearly independent combinations of $\mathbf{V}$ such that they are (conditionally) mean independent of $\mathbf{U}$ if and only if $\mbox{rank}(\mathrm{MDDM}(\mathbf{V}|\mathbf{U}))=d$ .

Given a random sample of size $n$ , i.e., $(\mathbf{U}_{k},\mathbf{V}_{k})_{k=1}^{n}$ , the sample estimate of $\mathrm{MDDM}(\mathbf{V}\mid\mathbf{U})$ , denoted by $\mathrm{MDDM}_{n}(\mathbf{V}\mid\mathbf{U})$ , is defined as

\mathrm{MDDM}_{n}(\mathbf{V}\mid\mathbf{U})=-\dfrac{1}{n^{2}}\sum_{j,k=1}^{n}(\mathbf{V}_{j}-\overline{\mathbf{V}}_{n})(\mathbf{V}_{k}-\overline{\mathbf{V}}_{n})^{\mathrm{\tiny T}}|\mathbf{U}_{j}-\mathbf{U}_{k}|_{q},

(2.1)

where $\overline{\mathbf{V}}_{n}=n^{-1}\sum_{k=1}^{n}\mathbf{V}_{k}$ is the sample mean.

In the following, we present a concentration inequality for the sample MDDM around its population counterpart, which plays an instrumental role in our consistency proof for the proposed penalized MDDM method later. To this end, we let $\mathbf{V}=(V_{1},\cdots,V_{p})^{\mathrm{\tiny T}}\in\mathbb{R}^{p}$ and assume the following condition.

(C1)

There exists two positive constants $\sigma_{0}$ and $C_{0}$ such that

\displaystyle\begin{split}&\sup_{p}\max_{1\leq j\leq p}\mathrm{E}\{\exp(2\sigma_{0}V_{j}^{2})\}\leq C_{0},\\ &\mathrm{E}\{\exp(2\sigma_{0}\|\mathbf{U}\|_{q}^{2})\}\leq C_{0}.\end{split}

(2.2)

For a matrix $A=(a_{ij})$ , we denote its max norm as $\|A\|_{max}=\max_{ij}|a_{ij}|$ .

Theorem 1.

Suppose that Condition (C1) holds. There exists a positive integer $n_{0}=n_{0}(\sigma_{0},C_{0},q)<\infty$ , $\gamma=\gamma(\sigma_{0},C_{0},q)\in(0,1/2)$ and a finite positive constant $D_{0}=D_{0}(\sigma_{0},C_{0},q)<\infty$ such that when $n\geq n_{0}$ and $16>\epsilon>D_{0}n^{-\gamma}$ , we have

P(\|\mathrm{MDDM}_{n}(\mathbf{V}|\mathbf{U})-\mathrm{MDDM}(\mathbf{V}|\mathbf{U})\|_{\max}>12\epsilon)\leq 54p^{2}\exp\left\{-\frac{\epsilon^{2}n}{36\log^{3}(n)}\right\}.

The above bound is non-asymptotic and holds for all $(n,p,\epsilon)$ as long as the condition is satisfied. The exponent $\dfrac{\epsilon^{2}n}{\log^{3}(n)}$ is due to the use of a truncation argument along with Hoeffding’s inequality for U-statistic, and seems hard to improve. Nevertheless we are able to achieve an exponential type bound under a uniform sub-Gaussian condition on both $\mathbf{V}$ and $\mathbf{U}$ . This result may be of independent theoretical interest. For example, in the time series dimension reduction problem studied by Lee and Shao (2018), our Theorem 1 could potentially help extend the theory there from low-dimensional multivariate time series to higher dimensions.

3 Slicing-free Inverse Regression via MDDM

3.1 Inverse regression subspace in sufficient dimension reduction

Sufficient dimension reduction (SDR) methods aim to identify the central subspace that preserves all the information in the predictors. In this paper, we consider the SDR problem of a multivariate response $\mathbf{Y}\in\mathbb{R}^{q}$ on a multivariate predictor $\mathbf{X}\in\mathbb{R}^{p}$ . The central subspace $\cal S_{\mathbf{Y}\mid\mathbf{X}}$ is defined as the intersection of all the subspaces $\cal S$ such that $\mathbf{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\mathbf{X}\mid\mathbf{P}_{\cal S}\mathbf{X}$ , where $\mathbf{P}_{\cal S}$ is the projection matrix onto $\cal S$ . By construction, the central subspace $\cal S_{\mathbf{Y}\mid\mathbf{X}}$ is the smallest dimension reduction subspace that contains all the information in the conditional distribution of $\mathbf{Y}$ given $\mathbf{X}$ . Many methods have been proposed for the recovery of the central subspace or a portion of the central subspace (Li, 1991; Cook and Weisberg, 1991; Bura and Cook, 2001; Chiaromonte et al., 2002; Yin and Cook, 2003; Cook and Ni, 2005; Li and Wang, 2007; Zhou and He, 2008). See Li (2018) for a comprehensive review. Although the central subspace is well defined for both univariate and multivariate response, most existing SDR methods consider the case with univariate response, while extension to multivariate response is non-trivial.

The definition of central subspace is not very constructive, as it requires taking intersection of all subspace $\mathcal{S}\subseteq\mathbb{R}^{p}$ such that $\mathbf{Y}\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\mathbf{X}\mid\mathbf{P}_{\cal S}\mathbf{X}$ . It is indeed a very ambitious goal to estimate the central subspace without specifying a model between $\mathbf{Y}$ and $\mathbf{X}$ . To achieve this, we often need additional assumptions such as the linearity and the coverage conditions. The linearity condition requires that, for any basis of the central subspace $\boldsymbol{\beta}$ , we must have that $\mathrm{E}(\mathbf{X}\mid\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{X})$ is linear in $\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{X}$ . The linearity condition is guaranteed if $\mathbf{X}$ is elliptically contoured, and allows us to connect the central subspace to the conditional expectation $\mathrm{E}(\mathbf{X}\mid\mathbf{Y})$ . Define $\boldsymbol{\Sigma}_{\mathbf{X}}$ as the covariance of $\mathbf{X}$ and the inverse regression subspace

\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}\equiv\mathrm{span}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y}=\mathbf{y})-\mathrm{E}(\mathbf{X}):\mathbf{y}\in\mathbb{R}^{q}\mathrm{\ such\ that}\ \mathrm{E}(\mathbf{X}\mid\mathbf{Y}=\mathbf{y})\ \mathrm{exists}\}.

(3.3)

Then following property is well-known and often adopted in developing SDR methods.

Proposition 1.

Under linearity condition, we have $\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}\subseteq\boldsymbol{\Sigma}_{\mathbf{X}}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}\subseteq\mathbb{R}^{p}$ .

The coverage condition further assumes that $\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}=\boldsymbol{\Sigma}_{\mathbf{X}}\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}$ . It follows that we can estimate the central subspace by modeling the conditional expectation of $\mathbf{X}$ . Indeed, many SDR methods approximate $\mathrm{E}(\mathbf{X}\mid\mathbf{Y})$ . For example, the most classical SDR method, sliced inverse regression (SIR), slices univariate $Y$ into several categories and estimate the mean of $\mathbf{X}$ within each slice. Most later methods also follow this slice-and-estimate procedure. Apparently, the slicing scheme is important to the estimation. If there are too few slices, we may not be able to fully capture the dependence of $\mathbf{X}$ on $Y$ ; however, if there are too many slices, there is a lack of enough samples within each slice to allow accurate estimation.

3.2 MDDM in SDR

In this section, we lay the foundation for the application of MDDM in SDR. We show that the subspace spanned by MDDM coincides with the inverse regression subspace in (3.3). In particular, we have the following Proposition 2, which was used in Zhang et al. (2020), without a proof, in the context of multivariate linear regression.

Proposition 2.

For multivariate $\mathbf{X}\in\mathbb{R}^{p}$ and $\mathbf{Y}\in\mathbb{R}^{q}$ , assuming the existence of $\mathrm{E}(\mathbf{X})$ , $\mathrm{E}(\mathbf{X}\mid\mathbf{Y})$ and $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ , we have $\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}$ .

Therefore, the rank of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ is the dimensionality of the inverse regression subspace; and the non-trivial eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ contain all the information for $\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}$ . Combining Proposition1& 2, we immediately have that (i) under the linearity condition, $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}\subseteq\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}$ ; and (ii) under the linearity and coverage conditions, $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}$ .

Henceforth, we assume both the linearity and coverage conditions, which are assumed either explicitly or implicitly in inverse regression type dimension reduction methods (e.g., Li, 1991; Cook and Ni, 2005; Zhu et al., 2010; Cook and Zhang, 2014). Then the central subspace is related to the eigen-decomposition of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ . Specifically, we have the following scenarios.

If $\mathrm{cov}(\mathbf{X})=\sigma^{2}\mathbf{I}_{p}$ for some $\sigma^{2}>0$ , then obviously $\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}$ . This includes single index and multiple index models with uncorrelated predictors. Let $K$ be the rank of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ , then the dimension of the central subspace is $K$ ; and the first $K$ eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ span the central subspace.

If $\mathrm{cov}(\mathbf{X}\mid\mathbf{Y})=\sigma^{2}\mathbf{I}_{p}$ for some $\sigma^{2}>0$ , then we have $\boldsymbol{\Sigma}_{\mathbf{X}}=\sigma^{2}\mathbf{I}_{p}+\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}$ . Because $\mathrm{span}[\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}]=\mathcal{S}_{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})}$ , we can show that $\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}$ . To see this, let $\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}=\mathbf{U}\mathbf{U}^{\mathrm{\tiny T}}$ for some $\mathbf{U}\in\mathbb{R}^{p\times K}$ , then $\mathrm{span}(\mathbf{U})=\mathrm{span}[\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid\mathbf{Y})\}]=\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}$ and we may also write $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})=\mathbf{U}\boldsymbol{\Psi}\mathbf{U}^{{\mathrm{\tiny T}}}$ for some symmetric positive definite matrix $\boldsymbol{\Psi}\in\mathbb{R}^{K\times K}$ . Then the result follows by applying the Woodbury matrix identity to $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}=(\sigma^{2}\mathbf{I}_{p}+\mathbf{U}\mathbf{U}^{\mathrm{\tiny T}})^{-1}=\sigma^{-2}\mathbf{I}_{p}-\sigma^{-2}\mathbf{U}(\sigma^{2}\mathbf{I}_{K}+\mathbf{U}^{{\mathrm{\tiny T}}}\mathbf{U})^{-1}\mathbf{U}^{{\mathrm{\tiny T}}}$ . The non-trivial eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ again span the central subspace.

For general covariance structure, the $d$ -dimensional central subspace $\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}\{\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\}$ can be obtained via generalized eigen-decomposition. Specifically, consider the generalized eigenvalue problem

\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})\mathbf{v}_{i}=\varphi_{i}\boldsymbol{\Sigma}_{\mathbf{X}}\mathbf{v}_{i},\ \varphi_{i}\geq 0,\ \mathbf{v}_{i}\in\mathbb{R}^{p},

(3.4)

where $\mathbf{v}_{i}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\mathbf{v}_{j}=0$ for $i\neq j$ . Then, similar to (Li, 2007; Chen et al., 2010), it is straightforward to show that the generalized eigenvector spans the central subspace, $\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\mathbf{v}_{1},\dots,\mathbf{v}_{K})$ .

Existing works in SDR often focus on the eigen-decomposition or the generalized eigen-decomposition of $\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid Y=y)\}$ , where non-parametric estimates of $\mathrm{E}(\mathbf{X}\mid Y=y)$ are obtained from slicing the support of the univariate response $Y$ . Comparing to these approaches, the MDDM approach requires no tuning parameter selection (i.e. specifying slicing schemes). Moreover, high-dimensional theoretical study of MDDM is easier and does not require additional assumptions on the conditional mean function $\mathrm{E}(\mathbf{X}\mid\mathbf{Y})$ such as smoothness in the empirical mean function of $\mathbf{X}$ given $Y$ (e.g. sliced stable condition in Lin et al. (2018)).

3.3 MDDM for model-based SDR

So far, we have discussed model-free SDR. Another important research area in SDR is model-based methods, which provide invaluable intuition for the use of inverse regression estimation under the assumption that the conditional distribution of $\mathbf{X}\mid\mathbf{Y}$ is normal. In this section, we consider the principal fitted component (PFC) model, which was discussed in details in Cook and Forzani (2009) and Cook (2007), and generalize it from univariate response to multivariate response. We argue that (generalized) eigen-decomposition of MDDM is potentially advantageous to the likelihood-based approaches under PFC model. This is somewhat surprising but reasonable, considering that the advantages of MDDM over least squares and likelihood-based estimation was demonstrated in Zhang et al. (2020) under the multivariate linear model.

Let $\mathbf{X}_{\mathbf{y}}\sim\mathbf{X}\mid(\mathbf{Y}=\mathbf{y})$ denote the conditional variable, then the PFC model is

\mathbf{X}_{\mathbf{y}}=\boldsymbol{\mu}+\boldsymbol{\Gamma}\boldsymbol{\nu}_{\mathbf{y}}+\boldsymbol{\varepsilon},\quad\boldsymbol{\varepsilon}\sim N(0,\boldsymbol{\Delta}),

(3.5)

where $\boldsymbol{\Gamma}\in\mathbb{R}^{p\times K}$ , $K<p$ , is a non-stochastic orthogonal matrix, $\boldsymbol{\nu}_{\mathbf{y}}\in\mathbb{R}^{K}$ is the latent variable that depends on $\mathbf{y}$ . Then the latent variable $\boldsymbol{\nu}_{\mathbf{y}}$ is fitted as $\boldsymbol{\nu}_{\mathbf{y}}=\boldsymbol{\alpha}\mathbf{f}_{\mathbf{y}}$ with some user-specified functions $\mathbf{f}_{\mathbf{y}}=(f_{1}(\mathbf{y}),\ldots,f_{m}(\mathbf{y}))^{{\mathrm{\tiny T}}}\in\mathbb{R}^{m}$ , $m\geq K$ , that maps $q$ -dimensional response to $m$ -dimensional. In the univariate PFC model, $q=1$ , so the $m$ functions can be viewed as an expansion of the response (similar to slicing). For our multivariate extensions of the PFC model, there is no requirement of $m\geq q$ . The PFC model can be written as

\mathbf{X}_{y}=\boldsymbol{\mu}+\boldsymbol{\Gamma}\boldsymbol{\alpha}\mathbf{f}_{\mathbf{y}}+\boldsymbol{\varepsilon},

(3.6)

where $\boldsymbol{\Gamma}$ and $\boldsymbol{\alpha}$ are estimated similarly to the multivariate reduced-rank regression with $\mathbf{X}\in\mathbb{R}^{p}$ being the response and $\mathbf{f}_{\mathbf{y}}\in\mathbb{R}^{m}$ being the predictor. Finally, the central subspace under this PFC model is $\boldsymbol{\Delta}^{-1}\mathrm{span}(\boldsymbol{\Gamma})$ , which simplifies to $\mathrm{span}(\boldsymbol{\Gamma})$ if we further assume isotropic error (i.e. isotropic PFC model) $\boldsymbol{\Delta}=\mathrm{cov}(\mathbf{X}\mid\mathbf{Y})=\sigma^{2}\mathbf{I}_{p}$ .

For the PFC model, our MDDM approach is the same as the model-free MDDM counterpart, and has two main advantages over the likelihood-based PFC estimation: (i) there is no need to specify the functions $\mathbf{f}_{\mathbf{y}}$ , and thus no risk of mis-specification; (ii) extensions to high-dimensional setting is much more straightforward. Moreover, under the isotropic PFC model, the central subspace $\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\boldsymbol{\Gamma})$ is exactly the first $K$ eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ .

4 Estimation

4.1 Penalized decomposition of MDDM

Based on the results in the last section, penalized eigen-decomposition of MDDM can be used for estimating the central subspace in high dimension when the covariance $\boldsymbol{\Sigma}_{\mathbf{X}}$ or the conditional covariance $\mathrm{cov}(\mathbf{X}\mid\mathbf{Y})$ is proportional to the identity matrix $\mathbf{I}_{p}$ . We consider the construction of such an estimate. It is worth mentioning that the penalized decomposition of MDDM we develop here is immediately applicable to the dimension reduction of multivariate stationary time series in (Lee and Shao, 2018), which is beyond the scope of this article. Moreover, it is well-known that $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}$ is not easy to estimate in high dimensions. Then, even when for general covariance structure, the eigen-decomposition of MDDM provides estimate of the inverse regression subspace (though may differ from the central subspace) that is useful for exploratory data analysis (e.g. detecting and visualizing non-linear mean function).

As such, we consider the estimation of the eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ . We assume that $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ has $K$ nontrivial eigenvectors, denoted by $\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K}$ . We use the shorthand notation $\mathbf{M}=\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ . Also, we note that, given the first $k-1$ eigenvectors, $\boldsymbol{\beta}_{k}$ is the top eigenvector of $\mathbf{M}_{k}$ , where $\mathbf{M}_{k}=\mathbf{M}-\sum_{l<k}(\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{l})\boldsymbol{\beta}_{l}\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}$ .

It is well-known that the eigenvectors cannot be accurately estimated in high dimensions without additional assumptions. We adopt the popular sparsity assumption that many entries in $\boldsymbol{\beta}_{k}$ are zero. To estimate these sparse eigenvectors, denote $\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y})$ , where the sample $\mathrm{MDDM}_{n}$ is defined in (2.1). We find $\widehat{\boldsymbol{\beta}}_{k},k=1,\ldots,K$ as follows:

\widehat{\boldsymbol{\beta}}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\boldsymbol{\beta}\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\beta}=1,\|\boldsymbol{\beta}\|_{0}\leq s,$}

(4.7)

where $\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y})$ , $\widehat{\mathbf{M}}_{k}=\widehat{\mathbf{M}}_{1}-\sum_{l<k}\delta_{l}\widehat{\boldsymbol{\beta}}_{l}\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}$ for $k>1$ with $\delta_{l}=\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{1}\widehat{\boldsymbol{\beta}}_{l}$ , and $s$ is a tuning parameter.

We solve the above problem by combining the truncated power method with hard thresholding. For a vector $\mathbf{v}\in\mathbb{R}^{p}$ and a positive integer $s$ , denote $v_{s}^{*}$ as the $s$ -th largest value of $|v_{j}|,j=1,\ldots,p$ . The hard-thresholding operator is $\mathrm{HT}(\mathbf{v},s)=(v_{1}I(|v_{1}|\geq v_{s}^{*}),\ldots,v_{p}I(|v_{p}|\geq v_{s}^{*}))^{\mathrm{\tiny T}}$ , which sets the $p-s$ elements in $\mathbf{v}$ to zero. We solve (4.7) by Algorithm 1, where the initialization $\widehat{\boldsymbol{\beta}}_{1}^{(0)}$ may be randomly generated. Note that Yuan and Zhang (2013) proposed Algorithm 1 to perform principal component analysis through penalized eigen-decomposition on the sample covariance.

In our algorithm, we require a pre-specified sparsity level $s$ and subspace dimension $K$ . In theory, we show that our estimators for $\boldsymbol{\beta}_{k}$ , $k=1,\dots,K$ , are all consistent for their population counterparts when the sparsity $s$ is sufficiently large (i.e. larger than the population sparsity level) and the number of directions $K$ is no bigger than the true dimension of the central subspace. Therefore, our method is flexible in the sense that the pre-specified $s$ and $K$ do not have to be exactly correct. In practice, especially in exploratory data analysis, the number of sequentially extracted directions are often set to be small (i.e. $K=1,2\leavevmode\nobreak\ \mbox{or}\leavevmode\nobreak\ 3$ ), while the determination of true central subspace dimension is a separate and important research topic in SDR (e.g. Bura and Yang, 2011; Luo and Li, 2016) and is beyond the scope of this paper. Moreover, the pre-specified sparsity level $s$ combined with $\ell_{0}$ regularization is potentially convenient for post-dimension reduction inference (Kim et al., 2020), as seen in the post-selection inference of canonical correlation analysis that is done over subsets of variables of pre-specified cardinality (McKeague and Zhang, 2020).

As pointed out by a referee, other sparse principal component analysis (PCA) methods can potentially be applied to decompose MDDM. We choose to extend the algorithm in Yuan and Zhang (2013) to facilitate computation and theoretical development. For computationally efficient sparse PCA methods such as Zou et al. (2006); Witten et al. (2009), their theoretical properties are unfortunately unknown. Hence, we expect the theoretical study of their MDDM-variants to be very challenging. On the other hand, for the theoretically justified sparse PCA methods such as Vu and Lei (2013); Cai et al. (2013), the computation is less efficient.

1.

Input: $s,K,\widehat{\mathbf{M}}_{1}=\widehat{\mathbf{M}}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y})$ .
2.

Initialize $\widehat{\boldsymbol{\beta}}_{1}^{(0)}$ .

For $k=1,\ldots,K$ , do

(a)

Iterate over $t$ until convergence:

i.

Set $\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\widehat{\mathbf{M}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}$ .

ii.

If $\|\widehat{\boldsymbol{\beta}}_{k}^{(t)}\|_{0}\leq s$ , set

\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\widehat{\boldsymbol{\beta}}_{k}^{(t)}}{\|\widehat{\boldsymbol{\beta}}_{k}^{(t)}\|_{2}};

else

\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\mathrm{HT}(\widehat{\boldsymbol{\beta}}_{k}^{(t)},s)}{\|\mathrm{HT}(\widehat{\boldsymbol{\beta}}_{k}^{(t)},s)\|_{2}}

(b)

Set $\widehat{\boldsymbol{\beta}}_{k}=\widehat{\boldsymbol{\beta}}_{k}^{(t)}$ at convergence and $\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}-\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}\cdot\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}$ .

4.

Output $\widehat{\mathcal{S}}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\widehat{\boldsymbol{\beta}}_{1},\ldots,\widehat{\boldsymbol{\beta}}_{K})$ .

Algorithm 1 Penalized eigen-decomposition of MDDM.

4.2 Generalized eigenvalue problems with MDDM

Now we consider the general (arbitrary) covariance structure $\boldsymbol{\Sigma}_{\mathbf{X}}$ . We continue to use $\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K}$ to denote the nontrivial eigenvectors of $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}\mathrm{span}(\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}))$ so that the central subspace is spanned by the $\boldsymbol{\beta}$ ’s. Again, we assume that these eigenvectors are sparse. In principle, we could assume that $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}$ is also sparse and construct its estimate accordingly. However, $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}$ is a nuisance parameter for our ultimate goal, and additional assumptions on it may unnecessarily limit the applicability of our method. Hence, we take a different approach as follows.

To avoid estimating $\boldsymbol{\Sigma}_{\mathbf{X}}^{-1}$ , we note that $\boldsymbol{\beta}_{1},\ldots,\boldsymbol{\beta}_{K}$ can also be viewed as the generalized eigenvectors defined as follows, which is equivalent to (3.4),

\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta},\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1,\boldsymbol{\beta}_{l}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=0$ for any $l<k$.}

(4.8)

Directly solving the generalized eigen-decomposition problem in (4.8) is not easy if we want to further impose penalties, because it is difficult to satisfy the orthogonality constraints. Therefore, we further consider another form for (4.8) that does not involve the orthogonal constraints. This alternative form is based on the following lemma.

Lemma 1.

Let $\lambda_{j}=\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{j}$ and $\mathbf{M}_{k}=\mathbf{M}-\boldsymbol{\Sigma}_{\mathbf{X}}(\sum_{j<k}\lambda_{j}\boldsymbol{\beta}_{j}\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}})\boldsymbol{\Sigma}_{\mathbf{X}}$ . We have

\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\mathbf{M}_{k}\boldsymbol{\beta},\quad\mbox{ s.t. $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1$.}

(4.9)

Motivated by Lemma 1, we consider the penalized problem that $\boldsymbol{\beta}_{k}=\arg\max_{\boldsymbol{\beta}}\boldsymbol{\beta}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\boldsymbol{\beta}$ such that $\boldsymbol{\beta}^{\mathrm{\tiny T}}\boldsymbol{\Sigma}_{\mathbf{X}}\boldsymbol{\beta}=1,\|\boldsymbol{\beta}\|_{0}\leq s$ , where $\widehat{\mathbf{M}}_{1}=\mathrm{MDDM}_{n}(\mathbf{X}\mid\mathbf{Y})$ and $\widehat{\mathbf{M}}_{k}=\widehat{\mathbf{M}}_{1}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\left(\sum_{l<k}\delta_{l}\widehat{\boldsymbol{\beta}}_{l}\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\right)\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}$ for $k>1$ with $\delta_{l}=\widehat{\boldsymbol{\beta}}_{l}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{l}$ , and $s$ is a tuning parameter. We adopt the RIFLE algorithm in Tan, Wang, Liu and Zhang (2018) to solve this problem. See the details in Algorithm 2. In our simulation studies, we considered randomly generated initial value $\widehat{\boldsymbol{\beta}}_{1}^{(0)}$ and fixed step size $\eta=1$ , and observed reasonably good performance.

Although Algorithm 2 is a generalization of the RIFLE Algorithm in Tan, Wang, Liu and Zhang (2018), there are important differences between these two. On one hand, the RIFLE Algorithm only extracts the first generalized eigenvector, whereas Algorithm 2 is capable of estimating multiple generalized eigenvectors by properly deflating the MDDM. In sufficient dimension reduction problems, the central subspace often has a structural dimension greater than 1, and it is necessary to find more than one generalized eigenvector. Hence, Algorithm 2 is potentially more useful than the RIFLE algorithm in practice. On the other hand, the usefulness of RIFLE Algorithm has been demonstrated in several statistical applications, including sparse sliced inverse regression. Here Algorithm 2 decomposes the MDDM, which is the first time the penalized generalized eigenvector problem is used to perform sufficient dimension reduction in a slicing-free manner in high dimensions. A brief analysis of the computation complexity is included in Section S3 in the Supplementary Materials.

1.

Input: $s,K,\widehat{\mathbf{M}}_{1}=\widehat{\mathbf{M}}$ and step size $\eta>0$ .
2.

Initialize $\widehat{\boldsymbol{\beta}}_{1}^{(0)}$ .
3.

For $k=1,\ldots,K$ , do
1. (a)
  Iterate over $t$ until convergence:
  1. i.
    
    Set $\rho^{(t-1)}=\dfrac{(\widehat{\boldsymbol{\beta}}_{k}^{(t-1)})^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}}{(\widehat{\boldsymbol{\beta}}_{k}^{(t-1)})^{\mathrm{\tiny T}}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}}$ .
  2. ii.
    
    $\mathbf{C}=\mathbf{I}+(\eta/\rho^{(t-1)})\cdot(\widehat{\mathbf{M}}_{k}-\rho^{(t-1)}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}})$
  3. iii.
    
    $\widetilde{\boldsymbol{\beta}}_{k}^{(t)}=\mathbf{C}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}/\|\mathbf{C}\widehat{\boldsymbol{\beta}}_{k}^{(t-1)}\|_{2}$ .
  4. iv.
    
    $\widehat{\boldsymbol{\beta}}_{k}^{(t)}=\dfrac{\mathrm{HT}(\widetilde{\boldsymbol{\beta}}_{k},s)}{\|\mathrm{HT}(\widetilde{\boldsymbol{\beta}}_{k},s)\|_{2}}$
2. (b)
  
  Set $\widetilde{\boldsymbol{\beta}}_{k}=\widehat{\boldsymbol{\beta}}_{k}^{(t)}$ at convergence and scale it to obtain $\widehat{\boldsymbol{\beta}}_{k}=\dfrac{\widetilde{\boldsymbol{\beta}}_{k}}{\sqrt{\widetilde{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\widetilde{\boldsymbol{\beta}}_{k}}}$ .
3. (c)
  
  Set $\widehat{\mathbf{M}}_{k+1}=\widehat{\mathbf{M}}_{k}-\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}\widehat{\boldsymbol{\beta}}_{k}\cdot\widehat{\boldsymbol{\beta}}_{k}\widehat{\boldsymbol{\beta}}_{k}^{\mathrm{\tiny T}}\widehat{\boldsymbol{\Sigma}}_{\mathbf{X}}$ .
4.

Output $\widehat{\mathcal{S}}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\widehat{\boldsymbol{\beta}}_{1},\ldots,\widehat{\boldsymbol{\beta}}_{K})$ .

Algorithm 2 Generalized eigen-decomposition of MDDM.

5 Theoretical properties

In this section, we consider theoretical properties of the generalized eigenvectors of $(\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y}),{\bm{\Sigma}}_{\mathbf{X}})$ . Recall that, if we know that ${\bm{\Sigma}}_{\mathbf{X}}=\mathbf{I}$ , the generalized eigenvectors reduce to eigenvectors, and can be estimated by Algorithm 1. If we do not have any information about ${\bm{\Sigma}}_{\mathbf{X}}$ , we can find the generalized eigenvectors with Algorithm 2. Either way, we let $\boldsymbol{\beta}_{k},k=1,\ldots,K,$ be the first $K$ (generalized) eigenvectors of $\mathrm{MDDM}(\mathbf{X}\mid\mathbf{Y})$ . Throughout the proof, we let $C$ denote a generic constant that can vary from line to line. We show the consistency of $\widehat{\boldsymbol{\beta}}_{k}$ by proving that $\eta_{k}=|\sin\Theta(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon$ . We assume that $K$ is fixed, and $s\epsilon\leq 1$ . Recall that we define $\lambda_{j}=\boldsymbol{\beta}_{j}^{\mathrm{\tiny T}}\mathbf{M}\boldsymbol{\beta}_{j}$ as the (generalized) eigenvalue. Further define $d=\max_{k=1}^{K}\{\|\boldsymbol{\beta}_{k}\|_{0}\}$ . When we study Algorithm 1 or Algorithm 2, we assume that $s=d+2s^{\prime}$ , where $s^{\prime}=Cd$ for a sufficiently large $C$ . To apply the concentration inequalities for $\mathrm{MDDM}$ , we restate below Condition (C1) in terms of $\mathbf{X}$ and $\mathbf{Y}$ as Condition (C1’), along with other suitable conditions:

(C1’)

There exist two positive constants $\sigma_{0}$ and $C_{0}$ such that $\mathrm{E}\{\exp(2\sigma_{0}\|\mathbf{Y}\|_{q}^{2})\}\leq C_{0}$ and $\sup_{p}\max_{1\leq j\leq p}\mathrm{E}\{\exp(2\sigma_{0}X_{j}^{2})\}\leq C_{0}$ .
(C2)

There exist $\Delta>0$ such that $\min_{k=1,\ldots,K}(\lambda_{k}-\lambda_{k+1})\geq\Delta$ .
(C3)

There exists constants $U,L$ that do not depend on $n,p$ such that $L\leq\lambda_{K}\leq\lambda_{1}\leq U$ .
(C4)

As $n\rightarrow\infty$ , $dn^{-1/2}{(\log p)^{1/2}(\log{n})^{3/2}}\rightarrow 0$ .

Condition (C2) guarantees that the eigenvectors are well-defined. Condition (C3) imposes bounds on the eigenvalues of $\mathbf{M}$ . Researchers often impose similar assumptions on the covariance matrix to achieve consistent estimation. Condition (C4) restricts the growth rate of $p,d$ with respect to $n$ . Note that $d$ is the population sparsity level of $\boldsymbol{\beta}_{k}$ ’s, while $s$ is the user-specified sparsity level in Algorithms 1 and 2. If we fix $d$ , the dimension is allowed to grow at the rate $\log{p}=o(n\log^{-3}{n})$ . When we allow $d$ to diverge, we require it to diverge more slowly than $\left\{{n}/{(\log{p}\log^{3}{n})}\right\}^{\frac{1}{2}}$ .

We present the non-asymptotic results for Algorithm 1 in the following theorem, where the constants $D_{1},D_{2},\sigma_{0},\gamma,C_{0}$ are defined previously in Theorem 1 under Condition (C1).

Theorem 2.

Assume that Conditions (C1’), (C2) & (C3) hold and ${\bm{\Sigma}}_{\mathbf{X}}=\mathbf{I}$ . Further assume that, there exists $\theta\in(0,1/2)$ , such that for $k=1,\ldots,K$ , we have $(\widehat{\boldsymbol{\beta}}_{k}^{0})^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}\geq 2\theta,$ and

\mu=\sqrt{[1+2\{(\frac{d}{s^{\prime}})^{1/2}+\frac{d}{s^{\prime}}\}]\{1-0.5\theta(1+\theta)(1-(\gamma^{*})^{2})\}}<1,

(5.10)

where $\gamma^{*}=\dfrac{\lambda_{K}-\frac{3}{4}\Delta}{\lambda_{K}-\frac{1}{4}\Delta}$ . Then there exists a positive integer $n_{0}=n_{0}(\sigma_{0},C_{0},q)<\infty$ , $\gamma=\gamma(\sigma_{0},C_{0},q)\in(0,1/2)$ and a finite positive $D_{0}=D_{0}(\sigma_{0},C_{0},q)$ such that when $n>n_{0}$ , we have $D_{0}n^{-\gamma}<\dfrac{\Delta}{4s}$ and for any $D_{0}n^{-\gamma}<\epsilon<\min\{\dfrac{\Delta}{4s},\theta\}$ , with a probability greater than $1-54p^{2}\exp\left\{-\dfrac{\epsilon^{2}n}{36\log^{3}{n}}\right\}$ ,

|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon,\quad k=1,\dots,K.

(5.11)

Let $n^{-1/2}{(\log p)^{1/2}\log{{}^{3/2}n}}\ll\epsilon\ll d^{-1}$ , then Theorem 2 directly implies the following asymptotic result that justifies the consistency of our estimator.

Corollary 1.

Assume that Conditions (C1’), (C2)–(C4) hold. Suppose there exists $\gamma>0$ such that $d\leq s\ll\min\{n^{\gamma},\dfrac{n^{\frac{1}{2}}}{(\log p)^{\frac{1}{2}}(\log{n})^{\frac{3}{2}}}\}$ . Under the conditions in Theorem 2, the quantities $|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\rightarrow 0$ with a probability tending to 1, for $k=1,\ldots,K$ .

Corollary 1 reveals that, without specifying a model, Algorithm 1 can achieve consistency when $p$ grows at an exponential rate of $n$ . To be exact, we can allow $\log{p}=o\{n/(d^{2}\log^{3}{n})\}$ . Here the theoretical results are established for the output of Algorithm 1, instead of the solution of the optimization problem (4.7). Note that it is possible to have some gap between the theoretical optimal solution of (4.7) and the estimate we use in practice, because the optimization problem is nonconvex, and numerically we might not achieve the global maximum. Thus it might be more meaningful to study the property of the estimate obtained as the output of Algorithm 1. The above theorem guarantees that the estimate we use in practice has the desired theoretical properties.

In the meantime, although our rate in Theorem 2 is not as high as in the case of sparse sliced inverse regression, as established very recently by Lin et al. (2018) when $\Sigma_{X}=I$ and for general $\Sigma_{X}$ in Lin et al. (2019), and Tan et al. (2020), we have some unique advantages over these proposals. For simplicity, we assume that $d$ is fixed in the subsequent discussion. First, both sliced inverse regression methods require estimation of with-in-slice means rather than the MDDM. As shown in Theorem 1, MDDM converges to its population counterpart at a slower rate than the sample with-in-slice mean. However, by adopting MDDM, we no longer need to determine the slicing scheme, and we do not encounter the curse-of-dimensionality problem when slicing multivariate response. Second, Lin et al. (2018) only achieves the optimal rate when $p=o(n^{2})$ , and cannot handle ultra-high dimensions. In contrast, Algorithm 1 allows $p$ to diverge at an exponential rate of $n$ , and is more suitable for ultra-high-dimensional data. Third, although Tan et al. (2020) achieves consistency when $\log{p}=o(n)$ , they make much more restrictive model assumptions. For example, they assume that $Y$ is categorical and $\mathbf{X}$ is normal within each slice of $Y$ ; they randomly split the dataset to form independent batches to facilitate their proofs, which is not done in their numerical studies. The theoretical properties for their proposal is unclear beyond the (conditionally) Gaussian model and without the sample splitting. In contrast, our method makes no model assumption between $\mathbf{X}$ and $Y$ , and our theory requires no sample splitting. Thus, our results are more widely applicable, and the rates we obtain seem hard to improve. Also, unlike the theory in Tan et al. (2020), our theoretical result characterizes the exact method we use in practice. Moreover, the convergence rate of our method has an additional factor of $\log^{3}(n)$ compared to Tan et al. (2020), which grows at a slow rate of $n$ that only imposes mild restriction on the dimensionality. For example, for any positive constant $\xi\in(0,1)$ , if $\log{p}=O(n^{1-\xi})$ , our method is consistent. In this sense, although we cannot handle the optimal dimensionality of $\log{p}=o(n)$ , the gap is very small.

Next, we further consider the penalized generalized eigen-decomposition in Algorithm 2. We assume that the step size $\eta$ satisfies $\eta\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}})<1/2$ and

\sqrt{[1+2\{(\frac{d}{s^{\prime}})^{1/2}+\frac{d}{s^{\prime}}\}][1-\dfrac{\eta\lambda_{\min}({\bm{\Sigma}}_{\mathbf{X}})(1-\frac{\lambda_{2}}{\lambda_{1}})}{16\kappa({\bm{\Sigma}}_{\mathbf{X}})+16\frac{\lambda_{2}}{\lambda_{1}}}]}<1,

(5.12)

where $\lambda_{\max}(\boldsymbol{\Sigma}_{\mathbf{X}})$ , $\lambda_{\min}(\boldsymbol{\Sigma}_{\mathbf{X}})$ and $\kappa({\bm{\Sigma}}_{\mathbf{X}})$ are respectively the largest eigenvalue, the smallest eigenvalue and the condition number of $\boldsymbol{\Sigma}_{\mathbf{X}}$ . The non-asymptotic results are as follows.

Theorem 3.

Assume that Conditions (C1’), (C2) & (C3) hold. Suppose there exists $\gamma\in(0,1/2)$ such that $d\leq s=o(n^{\gamma})$ , and there exists a constant $\theta(\kappa({\bm{\Sigma}}_{\mathbf{X}}),\lambda_{\max}({\bm{\Sigma}}_{\mathbf{X}}),\Delta,\lambda_{1},\lambda_{K},\eta)\in(0,1)$ such that $\dfrac{(\widehat{\boldsymbol{\beta}}_{k}^{0})^{\mathrm{\tiny T}}\boldsymbol{\beta}_{k}}{\|\widehat{\boldsymbol{\beta}}_{k}^{0}\|_{2}}\geq 1-\theta$ . Then there exists a positive integer $n_{0}=n_{0}(s_{0},C_{0})<\infty$ and four finite positive constants $D_{0}=D_{0}(\gamma,\sigma_{0},C_{0})\in(0,\infty)$ , $D_{1}=D_{1}(C_{0})\in(0,\infty)$ , $D_{2}=D_{2}(\sigma_{0},C_{0})\in(0,\infty)$ and $\epsilon_{0}=\epsilon_{0}(\lambda_{1},\lambda_{2},\lambda_{\min}({\bm{\Sigma}}),\Delta)$ such that for any $\epsilon$ that satisfies $s\epsilon<\epsilon_{0}$ and $D_{0}n^{-\gamma}<\epsilon\leq 1$ , with a probability greater than $1-D_{1}p^{2}n\exp\{-D_{2}\epsilon^{2}n/\log^{3}{n}\}$ , we have $|\sin\Theta(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\leq Cs\epsilon$ , for $k=1,\ldots,K$ .

Theorem 3 is proved by showing that $\widehat{\mathbf{M}}_{k}$ and $\widehat{{\bm{\Sigma}}}_{\mathbf{X}}$ are close to their counterparts in the sense that $\mathbf{u}^{\mathrm{\tiny T}}\widehat{\mathbf{M}}_{k}\mathbf{u},\mathbf{u}^{\mathrm{\tiny T}}\widehat{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u}$ are close to $\mathbf{u}^{\mathrm{\tiny T}}{\mathbf{M}}_{k}\mathbf{u},\mathbf{u}^{\mathrm{\tiny T}}{{\bm{\Sigma}}}_{\mathbf{X}}\mathbf{u}$ for any $\mathbf{u}$ with only $s$ nonzero elements, respectively. Then we use the fact that Algorithm 2 is a generalization of the RIFLE algorithm [Tan, Wang, Liu and Zhang (2018)], and some properties of the latter allow us to establish the consistency of Algorithm 2. By comparison, our proofs are significantly more involved than the one in Tan, Wang, Liu and Zhang (2018), because we have to estimate $K$ generalized eigenvectors instead of just the first one. We need to carefully control the error bounds to guarantee that the estimation errors do not accumulate to a higher order beyond the first generalized eigenvector.

Analogous to Corollary 1, we can easily obtain asymptotic consistency results by translating Theorem 3.

Corollary 2.

Assume that Conditions (C1)–(C4) hold. Suppose there exists $\gamma\in(0,1/2)$ such that $d\leq s\ll\min\{n^{\gamma},\dfrac{n^{\frac{1}{2}}}{(\log p\log^{3}{n})^{\frac{1}{2}}}\}$ . Under the conditions in Theorem 3, the quantities $|\sin\boldsymbol{\Theta}(\widehat{\boldsymbol{\beta}}_{k},\boldsymbol{\beta}_{k})|\rightarrow 0$ with a probability tending to 1, for $k=1,\ldots,K$ .

Corollary 2 shows that Algorithm 2 produces consistent estimates of the generalized eigenvectors $\beta_{k}$ even when $p$ grows at an exponential rate of the sample size $n$ , and thus is suitable for ultra-high-dimensional problems. Similar to Corollary 1, Corollary 2 has no gap between theory and the numerical outputs, as it is a result concerning the outputs of Algorithm 2. We note that the dimensionality in Corollary 2 is the same as that in Corollary 1. Thus, with a properly chosen step size $\eta$ , the penalized generalized eigen-decomposition is intrinsically no more difficult than the penalized eigen-decomposition. But if we have knowledge about ${\bm{\Sigma}}_{\mathbf{X}}$ being the identity matrix, it is still beneficial to exploit such information and use Algorithm 1, because Algorithm 1 does not involve the step size and is more convenient in practice. Also, although Algorithm 2 does not achieve the same rate of convergence as recent sparse sliced inverse regression proposals, it has many practical and theoretical advantages just as Algorithm 1, which we do not repeat.

Finally, we note that our theoretical studies require conditions on the initial value. Specifically, we require the initial value to be non-orthogonal to the truth. This is a common technical condition for iterative algorithms; see Yuan and Zhang (2013); Tan, Wang, Liu and Zhang (2018) for example. Such conditions do not seem critical for our algorithms to work in practice. In our numerical studies to be presented, we use randomly generated initial values, and the performance of our methods appears to be competitive.

6 Numerical studies

6.1 Simulations

We compare our slicing-free approaches to the state-of-the-art high-dimensional extensions of sliced inverse regression estimators. Both univariate response and multivariate response settings are considered. Specifically, for univariate response simulations, we include Rifle-SIR (Tan, Wang, Liu and Zhang, 2018) and Lasso-SIR (Lin et al., 2019) as two main competitors; for multivariate response simulations, we mainly compare our method with the projective resampling approach to SIR (PR-SIR, Li et al., 2008), which is a computationally expensive method that repeatedly projects the multivariate response to one-dimensional subspaces. For Rifle-SIR, we adopt the Rifle algorithm to estimate the leading eigenvector of the sample matrix $\mathrm{cov}\{\mathrm{E}(\mathbf{X}\mid Y)\}$ based on slicing. In addition, we include the oracle-SIR as a benchmark method, where we perform SIR on the subset of truly relevant variables (hence a low-dimensional estimation problem). For all these SIR-based methods, we include two different slicing schemes by setting the number of slices to be $3$ and $10$ , where $3$ is the minimal number of slices required to obtain our two-dimensional central subspace and $10$ is a typical choice used in the literature. To evaluate the performances of these SDR methods, we use the subspace estimation error defined as $\mathcal{D}(\widehat{\boldsymbol{\beta}},\boldsymbol{\beta})=\|\mathbf{P}_{\widehat{\boldsymbol{\beta}}}-\mathbf{P}_{\boldsymbol{\beta}}\|_{F}/\sqrt{2K}$ , where $\widehat{\boldsymbol{\beta}},\boldsymbol{\beta}\in\mathbb{R}^{p\times K}$ are the estimated and the true basis matrices of the central subspace and $\mathbf{P}_{\widehat{\boldsymbol{\beta}}},\mathbf{P}_{\boldsymbol{\beta}}\in\mathbb{R}^{p\times p}$ are the corresponding projection matrices. This subspace estimation error is always between $0$ and $1$ , and a small value indicates a good estimation.

First, we consider the following six models for univariate response regression: $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ are single index models (i.e. $K=1$ ), $\mathcal{M}_{3}$ – $\mathcal{M}_{5}$ are multiple index models (i.e. $K=2$ ), that are widely used in SDR literature; $\mathcal{M}_{6}$ is isotropic PFC model with $K=1$ . Specifically,

	$\displaystyle\mathcal{M}_{1}:Y=(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+\sin(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+\epsilon,\quad\mathcal{M}_{2}:Y=2\arctan(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})+0.1(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})^{3}+\epsilon,$
	$\displaystyle\mathcal{M}_{3}:Y=\frac{\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X}}{0.5+(1.5+\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X})^{2}}+0.2\epsilon,\quad\mathcal{M}_{4}:Y=\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X}+(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})\cdot(\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X})+0.3\epsilon,$
	$\displaystyle\mathcal{M}_{5}:Y=sign(\boldsymbol{\beta}_{1}^{\mathrm{\tiny T}}\mathbf{X})\cdot\log(\|\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X}+5\|)+0.2\epsilon,\quad\mathcal{M}_{6}:\mathbf{X}=2\boldsymbol{\beta}_{1}\exp(Y)/3+0.5\boldsymbol{\epsilon},$

where $\mathbf{X}\sim N_{p}(0,\boldsymbol{\Sigma}_{\mathbf{X}})$ and $\epsilon\sim N(0,1)$ for $\mathcal{M}_{1}$ – $\mathcal{M}_{5}$ , and $Y\sim N(0,1),\boldsymbol{\epsilon}\sim N_{p}(0,\mathbf{I}_{p})$ for the isotropic PFC model ( $\mathcal{M}_{6}$ ). The sparse directions in the central subspace $\boldsymbol{\beta}_{1},\boldsymbol{\beta}_{2}\in\mathbb{R}^{p}$ are orthogonal as we let the first $s=6$ elements in $\boldsymbol{\beta}_{1}$ and the $6$ -th to $12$ -th elements in $\boldsymbol{\beta}_{2}$ to be $1/\sqrt{6}$ (while all other elements are zero). For $\mathcal{M}_{1}$ – $\mathcal{M}_{5}$ , we consider both the independent predictor setting with $\boldsymbol{\Sigma}_{\mathbf{X}}=\mathbf{I}_{p}$ and the correlated predictor setting with auto-regressive correlation that $\Sigma_{X}(i,j)=0.5^{|i-j|}$ for $i,j=1,2,...,p$ . For each of model settings, we vary the sample size $n\in\{200,500,800\}$ and predictor dimension $p\in\{200,500,800,1200,2000\}$ and simulate 1000 independent data sets.

For our method, we applied the generalized eigen-decomposition algorithm (Algorithm 2) in all these six models (even when the covariance $\mathbf{X}$ is identity matrix). In the single index models $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ , we use the random initialization ( $\widehat{\boldsymbol{\beta}}^{(0)}$ is randomly generated from $p$ -dimensional standard normal) for our algorithm and Rifle-SIR to demonstrate the robustness to initialization. The step size in the algorithm is simply fixed as $\eta=1$ . For more challenging multiple index models, $\mathcal{M}_{3}-\mathcal{M}_{6}$ , we consider the best case scenarios for each method, therefore true parameter $\boldsymbol{\beta}$ is used as the initial value and an optimal $\eta\in\{0.1,0.2,\dots,1.0\}$ is selected from a separate training sample with 400 observations. The results based on 1000 replications for $n=200$ and $p=800$ are summarized in Table 1, while the rest of the results can be found in the Supplemental Materials. Overall, the slicing-free MDDM approach is much more accurate than existing SIR-based methods. It is almost as accurate as the oracle-SIR. Moreover, it is clear that SIR-type methods are rather sensitive to the choice of the number of slices.

$\boldsymbol{\Sigma}_{\mathbf{X}}$		MDDM		Oracle-SIR(3)		Oracle-SIR(10)		Rifle-SIR(3)		Rifle-SIR(10)		LassoSIR(3)		LassoSIR(10)
		Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE
$\mathbf{I}_{p}$	$\mathcal{M}_{1}$	10.1	0.1	12.5	0.1	10.3	0.1	25.2	1.0	53.7	1.4	37.9	0.4	59.9	0.7
	$\mathcal{M}_{2}$	10.3	0.1	13.1	0.1	10.6	0.1	26.1	1.0	54.7	1.4	40.1	0.4	61.5	0.7
	$\mathcal{M}_{3}$	17.7	0.2	40.8	0.2	27.7	0.2	71.3	0.0	71.2	0.0	76.5	0.2	85.0	0.2
	$\mathcal{M}_{4}$	23.0	0.2	45.8	0.3	36.4	0.3	71.9	0.0	71.6	0.0	85.2	0.2	91.5	0.2
	$\mathcal{M}_{5}$	30.8	0.6	28.8	0.2	22.1	0.1	71.6	0.0	71.2	0.0	71.2	0.3	81.3	0.3
AR	$\mathcal{M}_{1}$	18.7	0.3	21.0	0.2	17.6	0.2	34.7	0.8	39.8	1.1	35.3	0.3	35.5	0.3
	$\mathcal{M}_{2}$	14.2	0.2	20.7	0.2	14.8	0.2	33.1	0.7	33.6	1.1	34.6	0.3	30.5	0.3
	$\mathcal{M}_{3}$	25.2	0.3	44.6	0.2	34.1	0.2	71.5	0.0	71.3	0.0	54.8	0.2	47.1	0.3
	$\mathcal{M}_{4}$	59.1	0.5	75.1	0.2	69.9	0.3	81.0	0.2	78.7	0.2	89.7	0.2	92.1	0.2
	$\mathcal{M}_{5}$	46.2	0.6	46.4	0.2	35.5	0.2	73.8	0.1	72.4	0.0	66.5	0.2	61.4	0.3
PFC	$\mathcal{M}_{6}$	34.6	0.6	48.9	0.5	33.4	0.5	40.1	0.7	30.8	0.6	70.7	0.0	70.7	0.0

Table 1: Averaged subspace estimation errors and the corresponding standard errors (after multiplied by

100

) for univariate response models (

n=200,p=800

Next, we further consider the following three multivariate response models, where the response dimension is $q=4$ . These three models are respectively a multivariate linear model, a single-index heteroschedastic error model, and an isotropic PFC model. The predictors satisfy $\mathbf{X}\sim N_{p}(0,\mathbf{I}_{p})$ in the following two forward regression model. Therefore, we applied Algorithm 1 for our method under models $\mathcal{M}_{7}$ and $\mathcal{M}_{8}$ . For the isotropic PFC model $\mathcal{M}_{9}$ , where $\mathbf{X}\mid\mathbf{Y}\sim N_{p}(\boldsymbol{\beta}f(\mathbf{Y}),\mathbf{I}_{p})$ , we still apply Algorithm 2 to be consistent with the univariate case. For the projective resampling methods, PR-SIR and PR-Oracle-SIR, we generated a sufficiently large number of $n\log(n)$ random projections so that the PR methods reach their fullest potential.

$\mathcal{M}_{7}$

: $Y_{1}=\boldsymbol{\beta}_{1}^{{\mathrm{\tiny T}}}X+\epsilon_{1}$ , $Y_{2}=\boldsymbol{\beta}_{2}^{\mathrm{\tiny T}}\mathbf{X}+\epsilon_{2}$ , $Y_{3}=\epsilon_{3}$ and $Y_{4}=\epsilon_{4}$ . The errors $(\epsilon_{1},\dots,\epsilon_{4})$ are independent standard normal except for $\mathrm{cov}(\epsilon_{1},\epsilon_{2})=-0.5$ . For this model, the central subspace is spanned by $\boldsymbol{\beta}_{1}=(1,0,0,0,...,0)^{\mathrm{\tiny T}}$ and $\boldsymbol{\beta}_{2}=(0,2,1,0,...,0)^{\mathrm{\tiny T}}$ .
$\mathcal{M}_{8}$

: $Y_{1}=\exp(\epsilon_{1})$ and $Y_{i}=\epsilon_{i}$ for $i=2,3,4$ , where $(\epsilon_{1},\dots,\epsilon_{4})$ are independent standard normal except for $\mathrm{cov}(\epsilon_{1},\epsilon_{2})=\sin(\boldsymbol{\beta}^{{\mathrm{\tiny T}}}\mathbf{X})$ . For this model, the central subspace is $\boldsymbol{\beta}=(0.8,0.6,0,0,\dots,0)^{\mathrm{\tiny T}}$ . Note that marginally each response is independent of $\mathbf{X}$ .
$\mathcal{M}_{9}$

: $\mathbf{X}=\boldsymbol{\beta}\left(\frac{1}{3}\sin(Y_{1})+\frac{2}{3}\exp(Y_{2})+Y_{3}\right)+\boldsymbol{\epsilon}$ , where $\boldsymbol{\beta}=(1/\sqrt{6}\cdot\mathrm{1}_{6},\mathrm{0}_{p-6})$ , and $\boldsymbol{\epsilon}\sim N(0,\mathbf{I}_{p})$ . Hence, $\mathcal{S}_{\mathbf{Y}\mid\mathbf{X}}=\mathrm{span}(\boldsymbol{\beta})$ .

Again we considered various sample size and predictor dimension setups, each with 1000 replicates. We summarize the subspace estimation errors in the Table S6. For $p=800$ and $1200$ , the results are gathered in the supplement. It is clear that the proposed MDDM approach is much better than PR-SIR, and also improves much faster than PR-SIR when we increase the sample size. The MDDM method performed better in inverse regression models such as the isotropic PFC model than forward regression models such as the linear model and index models. This finding is more apparent in the multivariate response simulations than in the univariate response simulations. This is expected, as the MDDM directly targets at the inverse regression subspace, which is more directly driven by the response in the isotropic PFC models.

		$n=100$						$n=200$						$n=400$
		$p=100$		$p=200$		$p=400$		$p=100$		$p=200$		$p=400$		$p=100$		$p=200$		$p=400$
		Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE
$\mathcal{M}_{7}$	MDDM	37.1	0.5	39.8	0.5	42.5	0.5	24.0	0.4	25.3	0.4	26.9	0.4	16.1	0.3	17.3	0.3	18.6	0.3
	PR-Oracle-SIR(3)	12.6	0.2	12.2	0.2	12.0	0.2	8.8	0.1	8.5	0.1	8.7	0.1	5.9	0.1	5.8	0.1	5.8	0.1
	PR-Oracle-SIR(10)	16.2	0.3	15.7	0.3	15.6	0.3	9.6	0.2	9.4	0.2	95.2	0.2	6.0	0.1	6.0	0.1	6.0	0.1
	PR-SIR(3)	79.9	0.1	88.2	0.1	93.5	0.0	67.9	0.1	79.3	0.1	87.8	0.0	54.6	0.1	67.6	0.1	79.0	0.0
	PR-SIR(10)	83.5	0.1	90.6	0.1	94.9	0.0	70.1	0.1	81.6	0.1	90.1	0.1	55.3	0.1	68.2	0.1	80.2	0.1
$\mathcal{M}_{8}$	MDDM	79.4	0.9	85.8	0.8	90.0	0.7	55.9	1.2	61.0	1.2	68.4	1.2	27.1	0.9	30.3	1.0	31.0	1.0
	PR-Oracle-SIR(3)	40.9	0.9	41.3	0.9	41.4	0.9	26.0	0.7	24.9	0.7	25.0	0.6	14.9	0.4	14.9	0.4	15.0	0.4
	PR-Oracle-SIR(10)	44.1	0.9	43.8	0.9	43.5	0.9	25.1	0.6	23.7	0.6	24.1	0.6	13.1	0.3	13.0	0.3	13.2	0.3
	PR-SIR(3)	99.3	0.0	99.7	0.0	99.8	0.0	99.2	0.0	99.7	0.0	99.8	0.0	98.8	0.0	99.6	0.0	99.8	0.0
	PR-SIR(10)	99.3	0.0	99.7	0.0	99.9	0.0	99.1	0.0	99.6	0.0	99.8	0.0	98.4	0.1	99.6	0.0	99.8	0.0
$\mathcal{M}_{9}$	MDDM	15.3	0.3	15.4	0.3	15.7	0.3	9.9	0.1	10.1	0.1	10.0	0.1	7.1	0.1	7.2	0.1	7.1	0.1
	PR-Oracle-SIR(3)	15.2	0.2	15.2	0.2	14.9	0.2	10.5	0.1	10.6	0.1	10.5	0.1	7.5	0.1	7.6	0.1	7.4	0.1
	PR-Oracle-SIR(10)	13.8	0.2	13.9	0.2	13.6	0.2	9.4	0.1	9.7	0.1	9.6	0.1	6.8	0.1	6.8	0.1	6.7	0.1
	PR-SIR(3)	58.5	0.2	72.3	0.2	84.0	0.2	44.6	0.1	58.2	0.1	71.4	0.1	33.1	0.1	44.6	0.1	57.9	0.1
	PR-SIR(10)	54.8	0.2	68.5	0.2	80.6	0.2	41.1	0.2	54.3	0.2	67.7	0.2	30.2	0.1	41.0	0.1	54.2	0.1

Table 2: Averaged subspace estimation errors and the corresponding standard errors (after multiplied by

100

) for multivariate response models.

6.2 Real Data Illustration

Refer to caption — Figure 1: Quantile-quantile plots for prediction error comparisons between MDDM and Lasso-SIR (left panel), and between MDDM-ID and Lasso-SIR (right panel). Each point corresponds to the prediction mean squared errors for one of the $q=15$ response variables, where different shapes represents different quantiles.

In this section we use our method to analyze the NCI-60 data set (Shoemaker, 2006) that contains the microRNA expression profiles and cancer drug activity measurements on the NCI-60 cell lines. The multivariate response is the cancer drug activities of $q=15$ drugs; the predictor is $p=365$ different microRNA; the sample size is $n=60$ .

First, we examine the predictive performance of our method by $500$ random training-testing sample splits; each time we randomly pick $5$ observations to form the test set. We consider $K=5$ for all methods. For MDDM, we included both the eigen-decomposition (Algorithm 1) and the generalized eigen-decomposition (Algorithm 2). To distinguish the two versions of MDDM, we have “MDDM-ID” for the eigen-decomposition approach because it implicitly assumes the covariance of $\mathbf{X}$ or the conditional covariance of $\mathbf{X}\mid\mathbf{Y}$ is constant times identity matrix. We use random initial values, and choose the sparsity level to be $s=25$ in the way described in Section S2 in the Supplementary Materials. Then the five reduced predictors $\boldsymbol{\beta}_{k}^{{\mathrm{\tiny T}}}\mathbf{X}$ , $k=1,\dots,5$ , are fed into a generalized additive model for each drug. Finally, we evaluate the mean squared prediction error based on the test sample. The Rifle-SIR can only estimate a one-dimensional subspace, which did not yield accurate prediction in this data set. Hence for comparison, we compute five leading directions from Lasso-SIR. The $25$ th, $50$ th and $75$ th percentiles of the squared prediction errors for each of the 15 responses for all three models are obtained and we construct quantile-quantile plots in Figure 1. The red line is the $y=x$ line, and the black dashed line is a simple linear regression fit for the results indicated by the y-axis label against that indicated by the x-axis. Clearly, for all the quantiles and for all the response variables, the MDDM results (MDDM or MDDM-ID) are better than Lasso-SIR in terms of prediction. In addition, we construct side-by-side boxplots of the prediction error averaged over all response variables in Figure 2 to evaluate the overall improvement. Interestingly, the MDDM-ID is slightly better than the MDDM approach. This is likely due to the small sample size – with only $55$ training sample, the sample covariance of $p=365$ variables is difficult to estimate accurately. We further include additional real data analysis results in Section S4 in Supplementary Materials.

7 Discussion

In this paper, we propose a slicing-free high-dimensional SDR method by a penalized eigen-decomposition of sample MDDM. Our proposal is motivated by the usefulness of MDDM for dimension reduction and yields a relatively straightforward implementation in view of the recently developed RIFLE algorithm (Tan, Wang, Liu and Zhang, 2018) by simply replacing the slicing-based estimator with sample MDDM. Our methodology and implementation involve no slicing and treats the univariate and multivariate response in a unified fashion. Both theoretical support and finite sample investigations provide convincing evidence that MDDM is a very competitive alternative compared to SIR and may be used as a surrogate to SIR-based estimator routinely for many related sufficient dimension reduction problems.

As with most SDR methods, our proposal requires the linearity condition, the violation of which can make SDR very challenging to tackle. It seems that existing proposals that relax the linearity condition are often practically difficult due to excessive computational costs, and cannot be easily extended to high dimensions (Cook and Nachtsheim, 1994; Ma and Zhu, 2012). One potentially useful approach is to transform data before SDR to alleviate obvious violation of the linearity assumption (Mai and Zou, 2015). But an in-depth study along this line is beyond the scope of the current paper. In addition, we observe from our simulation studies that RIFLE requires the choice of several tuning parameters, such as the step size and the initial value, and the optimization error could depend on these tuning parameters in a nontrivial way. Further investigation on the optimization error and data-driven choice for these tuning parameters would be desirable and is left for future research.

As pointed out by a referee, many SDR methods beyond SIR involve slicing. It will be interesting to study how to perform them in a slicing-free fashion as well. For example, Cook and Weisberg (1991) attempt to perform dimension reduction by estimating the conditional covariance of $\mathbf{X}$ , while Yin and Cook (2003) consider the conditional third moment. These methods slice the response to estimate the conditional moments. In the future, one can develop slicing-free methods to estimate these higher-order moments and conduct SDR.

Supplementary Materials

Supplement to “Slicing-free Inverse Regression in High-dimensional Sufficient Dimension Reduction”. In the supplement, we present additional simulation results and proofs.

Acknowledgements

The authors are grateful to the Editor, Associate Editor and anonymous referees, whose suggestions led great improvement of this work. The authors contributed equally to this work and are listed in alphabetical order. Mai and Zhang’s research in this article is supported in part by National Science Foundation grants CCF-1908969. Shao’s research in this article is supported in part by National Science Foundation grants DMS-1607489.

References

(1)
Bura and Cook (2001) Bura, E. and Cook, R. D. (2001), ‘Extending sliced inverse regression: The weighted chi-squared test’, Journal of the American Statistical Association 96, 996–1003.
Bura and Yang (2011) Bura, E. and Yang, J. (2011), ‘Dimension estimation in sufficient dimension reduction: a unifying approach’, Journal of Multivariate analysis 102(1), 130–142.
Cai et al. (2013) Cai, T. T., Ma, Z. and Wu, Y. (2013), ‘Sparse pca: Optimal rates and adaptive estimation’, Annals of Statistics 41, 3074–3110.
Chen et al. (2010) Chen, X., Zou, C., Cook, R. D. et al. (2010), ‘Coordinate-independent sparse sufficient dimension reduction and variable selection’, The Annals of Statistics 38(6), 3696–3723.
Chiaromonte et al. (2002) Chiaromonte, R., Cook, R. D. and Li, B. (2002), ‘Sufficient dimension reduction in regressions with categorical predictors’, The Annals of Statistics 30, 475–497.
Cook (2007) Cook, R. D. (2007), ‘Fisher lecture: Dimension reduction in regression (with discussion)’, Statistical Science 22, 1–26.
Cook and Forzani (2009) Cook, R. D. and Forzani, L. (2009), ‘Principal fitted components for dimension reduction in regression’, Statistical Science 485, 485–501.
Cook and Nachtsheim (1994) Cook, R. D. and Nachtsheim, C. J. (1994), ‘Reweighting to achieve elliptically contoured covariates in regression’, Journal of the American Statistical Association 89(426), 592–599.
Cook and Ni (2005) Cook, R. D. and Ni, L. (2005), ‘Sufficient dimension reduction via inverse regression: A minimum discrepancy approach’, Journal of the American Statistical Association 100(470), 410–428.
Cook and Weisberg (1991) Cook, R. D. and Weisberg, S. (1991), ‘Comment on “sliced inverse regression for dimension reduction”’, Journal of American Statistical Association 86, 328–332.
Cook and Zhang (2014) Cook, R. D. and Zhang, X. (2014), ‘Fused estimators of the central subspace in sufficient dimension reduction’, J. Amer. Statist. Assoc. 109, 815–827.
Cruz-Cano and Lee (2014) Cruz-Cano, R. and Lee, M.-L. T. (2014), ‘Fast regularized canonical correlation analysis’, Computational Statistics & Data Analysis 70, 88–100.
Hsing and Carroll (1992) Hsing, T. and Carroll, R. (1992), ‘An asymptotic theory for sliced inverse regression’, Ann. Statist 20, 1040–1061.
Huo and Székely (2016) Huo, X. and Székely, G. J. (2016), ‘Fast computing for distance covariance’, Technometrics 58(4), 435–447.
Kim et al. (2020) Kim, K., Li, B., Yu, Z., Li, L. et al. (2020), ‘On post dimension reduction statistical inference’, Annals of Statistics 48(3), 1567–1592.
Lee and Shao (2018) Lee, C. E. and Shao, X. (2018), ‘Martingale difference divergence matrix and its application to dimension reduction for stationary multivariate time series’, J. Amer. Statist. Assoc. 113, 216–229.
Li (2018) Li, B. (2018), Sufficient Dimension Reduction, Methods and Applications with R, CRC Press.
Li and Wang (2007) Li, B. and Wang, S. (2007), ‘On directional regression for dimension reduction’, Journal of the American Statistical Association 102, 997–1008.
Li et al. (2008) Li, B., Wen, S. and Zhu, L. (2008), ‘On a projective resampling method for dimension reduction with multivariate responses’, Journal of the American Statistical Association 103(483), 1177–1186.
Li (1991) Li, K. C. (1991), ‘Sliced inverse regression for dimension reduction’, Journal of the American Statistical Association 86, 316–327.
Li (2007) Li, L. (2007), ‘Sparse sufficient dimension reduction’, Biometrika 94(3), 603–613.
Lin et al. (2020) Lin, Q., Li, X., Huang, D. and Liu, J. (2020), ‘On the optimality of sliced inverse regression in high dimensions’, Annals of Statistics, to appear .
Lin et al. (2018) Lin, Q., Zhao, Z. and Liu, J. (2018), ‘On consistency and sparsity for sliced inverse regression in high dimension’, Annals of Statistics 46(2), 580–610.
Lin et al. (2019) Lin, Q., Zhao, Z. and Liu, J. S. (2019), ‘Sparse sliced inverse regression via lasso’, Journal of the American Statistical Association 114, 1726–1739.
Luo and Li (2016) Luo, W. and Li, B. (2016), ‘Combining eigenvalues and variation of eigenvectors for order determination’, Biometrika 103(4), 875–887.
Ma and Zhu (2012) Ma, Y. and Zhu, L. (2012), ‘A semiparametric approach to dimension reduction’, Journal of the American Statistical Association 107(497), 168–179.
Mai and Zhang (2019) Mai, Q. and Zhang, X. (2019), ‘An iterative penalized least squares approach to sparse canonical correlation analysis’, Biometrics 75(3), 734–744.
Mai and Zou (2015) Mai, Q. and Zou, H. (2015), ‘Nonparametric variable transformation in sufficient dimension reduction’, Technometrics 57(1), 1–10.
McKeague and Zhang (2020) McKeague, I. W. and Zhang, X. (2020), ‘Significance testing for canonical correlation analysis in high dimensions’, arXiv preprint arXiv:2010.08673 .
Serfling (1980) Serfling, R. J. (1980), Approximation Theorems of Mathematical Statistics, Wiley Series in Probability and Mathematical Statistics.
Shao and Zhang (2014) Shao, X. and Zhang, J. (2014), ‘Martingale difference correlation and its use in high dimensional variable screening’, J. Amer. Statist. Assoc. 109, 1302–1318.
Shoemaker (2006) Shoemaker, R. H. (2006), ‘The nci60 human tumour cell line anticancer drug screen’, Nature Reviews Cancer 6(10), 813–823.
Székely et al. (2007) Székely, G. J., Rizzo, M. L. and Bakirov, N. K. (2007), ‘Measuring and testing dependence by correlation of distances’, Annals of Statistics 35, 2769–2794.
Tan, Wang, Zhang, Liu and Cook (2018) Tan, K. M., Wang, Z., Zhang, T., Liu, H. and Cook, R. D. (2018), ‘A convex formulation for high-dimensional sparse sliced inverse regression’, Biometrika 105(4), 769–782.
Tan et al. (2020) Tan, K., Shi, L. and Yu, Z. (2020), ‘Sparse sir: Optimal rates and adaptive estimation’, The Annals of Statistics 48(1), 64–85.
Tan, Wang, Liu and Zhang (2018) Tan, K., Wang, Z., Liu, H. and Zhang, T. (2018), ‘Sparse generalized eigenvalue problem: Optimal statistical rates via truncated rayleigh flow’, Journal of the Royal Statistical Society: Series B 80(5), 1057–1086.
Vershynin (2018) Vershynin, R. (2018), High-dimensional probability: An introduction with applications in data science, Vol. 47, Cambridge university press.
Vu and Lei (2013) Vu, V. Q. and Lei, J. (2013), ‘Minimax sparse principal subspace estimation in high dimensions’, Annals of Statistics 41, 2905–2947.
Witten et al. (2009) Witten, D. M., Tibshirani, R. and Hastie, T. (2009), ‘A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis’, Biostatistics 10(3), 515–534.
Yin and Cook (2003) Yin, X. and Cook, R. D. (2003), ‘Estimating central subspaces via inverse third moments’, Biometrika 90, 113–125.
Yuan and Zhang (2013) Yuan, X.-T. and Zhang, T. (2013), ‘Truncated power method for sparse eigenvalue problems’, Journal of Machine Learning Research 14(Apr), 899–925.
Zhang et al. (2020) Zhang, X., Lee, C. E. and Shao, X. (2020), ‘Envelopes in multivariate regression models with nonlinearity and heteroscedasticity’, Biometrika 107, 965–981.
Zhou and He (2008) Zhou, J. and He, X. (2008), ‘Dimension reduction based on constrained canonical correlation and variable filtering’, The Annals of Statistics 36, 1649–1668.
Zhu and Ng (1995) Zhu, L. and Ng, K. W. (1995), ‘Asymptotics of sliced inverse regression’, Statist. Sinica 5, 727–736.
Zhu et al. (2010) Zhu, L.-P., Zhu, L.-X. and Feng, Z.-H. (2010), ‘Dimension reduction in regressions through cumulative slicing estimation’, Journal of the American Statistical Association 105(492), 1455–1466.
Zou et al. (2006) Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of computational and graphical statistics 15(2), 265–286.

Qing Mai, Florida State University E-mail: [email protected]

Xiaofeng Shao, University of Illinois at Urbana-Champaign E-mail: [email protected]

Runmin Wang, Texas A&M University E-mail: [email protected]

Xin Zhang, Florida State University E-mail: [email protected]

Supplementary Materials for “Slicing-free Inverse Regression in High-Dimensional Sufficient Dimension Reduction”

Qing Mai, Xiaofeng Shao, Runmin Wang and Xin Zhang

In the supplement, we present some additional simulation results in Section S1, additional real data analysis results in Section S2, some discussion of computational complexity in Section S3, proofs of Proposition 2 & Lemma 1 in Section S4, proofs of Theorem 1 in Section S5, proofs of Theorem 2 & 3 in Section S6.

S1 Additional Simulation Results

In this section, we shall present all simulation results for models described in Section 6, except the ones which have already been presented in the main paper. Specifically, Table S1 and S2 contain the results for single index models ( $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ ), Table S3 and S4 contain results for multiple index models ( $\mathcal{M}_{3}$ - $\mathcal{M}_{5}$ ). Results for PFC model (univariate response) $\mathcal{M}_{6}$ are summarized in Table S5, and we gather the rest of results for models with multivariate response variables ( $\mathcal{M}_{7}$ - $\mathcal{M}_{9}$ ) in Table S6.

The patterns are similar to those presented in the paper. Overall the newly proposed method outperforms the competitors in most scenarios, especially when the dimensionality is significantly larger than the sample size (i.e., high-dimensional setting). It is also observed that SIR-based methods are rather sensitive to the choice of number of slices, whereas our method is slicing-free and is thus easier to use in practice.

			MDDM		Oracle-SIR(3)		Oracle-SIR(10)		Rifle-SIR(3)		Rifle-SIR(10)		LassoSIR(3)		LassoSIR(10)
			Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE	Error	SE
$\mathcal{M}_{1}$	n = 200	p = 200	10.3	0.1	12.6	0.1	10.5	0.1	16.4	0.6	30.0	1.2	28.5	0.2	30.6	0.3
		p = 500	10.4	0.1	12.7	0.1	10.6	0.1	22.1	0.9	47.1	1.4	34.6	0.3	47.7	0.6
		p = 800	10.1	0.1	12.5	0.1	10.3	0.1	25.2	1.0	53.7	1.4	37.9	0.4	59.9	0.7
		p = 1200	10.0	0.1	12.4	0.1	10.3	0.1	27.5	1.0	54.3	1.4	42.4	0.5	71.4	0.7
		p = 2000	10.1	0.1	12.6	0.1	10.5	0.1	29.7	1.1	63.4	1.4	48.4	0.6	81.7	0.6
	n = 500	p = 200	6.3	0.1	7.6	0.1	6.3	0.1	8.9	0.4	12.1	0.7	15.0	0.1	12.8	0.1
		p = 500	6.4	0.1	7.9	0.1	6.5	0.1	11.8	0.6	21.6	1.1	16.3	0.1	14.6	0.1
		p = 800	6.2	0.1	7.7	0.1	6.4	0.1	13.2	0.7	26.1	1.2	16.6	0.1	16.2	0.2
		p = 1200	6.4	0.1	7.6	0.1	6.4	0.1	13.4	0.7	30.3	1.3	17.3	0.2	17.9	0.2
		p = 2000	6.3	0.1	7.7	0.1	6.3	0.1	14.9	0.8	32.9	1.3	18.3	0.2	21.8	0.3
	n =800	p = 200	5.0	0.1	6.0	0.1	5.0	0.1	7.4	0.4	8.7	0.6	11.1	0.1	9.3	0.1
		p = 500	5.2	0.1	6.1	0.1	5.1	0.1	8.3	0.4	12.9	0.8	11.9	0.1	10.1	0.1
		p = 800	5.1	0.1	6.1	0.1	5.1	0.1	9.5	0.6	20.1	1.1	12.4	0.1	11.1	0.1
		p = 1200	5.1	0.1	6.1	0.1	5.1	0.1	9.5	0.6	20.1	1.1	12.4	0.1	11.1	0.1
		p = 2000	4.9	0.1	6.1	0.1	5.0	0.1	10.6	0.6	23.1	1.2	12.8	0.1	12.2	0.1
$\mathcal{M}_{2}$	n = 200	p = 200	10.4	0.1	13.1	0.1	10.7	0.1	17.5	0.6	29.7	1.2	30.3	0.2	31.6	0.3
		p = 500	10.6	0.1	13.3	0.1	10.8	0.1	23.8	0.9	48.9	1.4	36.7	0.3	49.9	0.6
		p = 800	10.3	0.1	13.1	0.1	10.6	0.1	26.1	1.0	54.7	1.4	40.1	0.4	61.5	0.7
		p = 1200	54.7	0.8	12.9	0.1	10.4	0.1	74.4	0.8	95.1	0.4	45.0	0.5	71.4	0.7
		p = 2000	55.3	0.8	13.1	0.1	10.6	0.1	76.5	0.7	96.8	0.3	51.2	0.6	82.7	0.6
	n = 500	p = 200	6.4	0.1	8.0	0.1	6.5	0.1	9.7	0.4	12.0	0.7	15.8	0.1	13.4	0.1
		p = 500	6.7	0.1	8.2	0.1	6.6	0.1	12.4	0.6	21.2	1.1	17.1	0.1	15.1	0.1
		p = 800	6.5	0.1	8.0	0.1	6.5	0.1	13.6	0.7	25.1	1.2	17.6	0.2	16.7	0.2
		p = 1200	17.1	0.7	8.0	0.1	6.5	0.1	37.4	1.1	74.6	1.0	18.2	0.2	18.3	0.2
		p = 2000	16.9	0.8	8.0	0.1	6.5	0.1	38.3	1.2	77.4	0.9	19.0	0.2	22.2	0.3
	n =800	p = 200	5.1	0.1	6.3	0.1	5.1	0.1	6.9	0.3	8.3	0.5	11.6	0.1	9.6	0.1
		p = 500	5.3	0.1	6.4	0.1	5.2	0.1	7.8	0.4	13.1	0.8	12.5	0.1	10.5	0.1
		p = 800	5.0	0.1	6.3	0.1	5.1	0.1	8.4	0.4	16.9	1.0	12.6	0.1	10.8	0.1
		p = 1200	10.8	0.6	6.4	0.1	5.2	0.1	23.9	1.0	53.9	1.3	13.1	0.1	11.4	0.1
		p = 2000	11.3	0.6	6.3	0.1	5.1	0.1	26.6	1.1	57.6	1.2	13.6	0.1	12.4	0.1

Table S1:

d(V,\hat{V})