This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Sufficient Graphical Models

\nameBing Li \email[email protected]
\addrDepartment of Statistics, Pennsylvania State University
326 Thomas Building, University Park, PA 16802 \AND\nameKyongwon Kim \email[email protected]
\addrDepartment of Statistics, Ewha Womans University
52 Ewhayeodae-gil, Seodaemun-gu, Seoul, Republic of Korea, 03760
Abstract

We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.

Keywords: conjoined conditional covariance operator, generalized sliced inverse regression, nonlinear sufficient dimension reduction, reproducing kernel Hilbert space

1 Introduction

In this paper we propose a new nonparametric statistical graphical model, which we call the sufficient graphical model, by incorporating the recently developed nonlinear sufficient dimension reduction techniques to the construction of the distribution-free graphical models.

Let G=(Γ,)\mbox{\rsfsten G}\,=(\Gamma,{\cal E}) be an undirected graph consisting of a finite set of nodes Γ={1,,p}\Gamma=\{1,\dots,p\} and set of edges {(i,j)Γ×Γ:ij}.{\color[rgb]{0,0,0}{{\mathcal{E}\subseteq\{(i,j)\in\Gamma\times\Gamma:i\neq j\}.}}} Since (i,j)(i,j) and (j,i)(j,i) represent the same edge in an undirected graph, we can assume without loss of generality that i>ji>j. A statistical graphical model links G  with a random vector X=(X1,,Xp)X=(X^{\scriptscriptstyle 1},\dots,X^{\scriptscriptstyle p}) by the conditional independence:

(i,j)Xi   Xj|X(i,j),\displaystyle(i,j)\notin\mathcal{E}\Leftrightarrow X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)}, (1)

where X(i,j)={X1,,Xp}{Xi,Xj}X^{\scriptscriptstyle-(i,j)}=\{X^{\scriptscriptstyle 1},\ldots,X^{\scriptscriptstyle p}\}\setminus\{X^{\scriptscriptstyle i},X^{\scriptscriptstyle j}\}, and A   B|CA\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,B|C means conditional independence. Thus, nodes ii and jj are connected if and only if XiX^{\scriptscriptstyle i} and XjX^{\scriptscriptstyle j} are dependent given X(i,j)X^{\scriptscriptstyle-(i,j)}. Our goal is to estimate the set \cal E based on a sample X1,,XnX_{\scriptscriptstyle 1},\ldots,X_{\scriptscriptstyle n} of XX. See Lauritzen (1996).

One of the most popular statistical graphical models is the Gaussian graphical model, which assumes that XN(μ,Σ)X\sim N(\mu,\Sigma). Under the Gaussian assumption, conditional independence in (1) is encoded in the precision matrix Θ=Σ1\Theta=\Sigma^{\scriptscriptstyle\scriptscriptstyle-1} in the following sense

Xi   Xj|X(i,j)θij=0,\displaystyle X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)}\Leftrightarrow\theta_{\scriptscriptstyle ij}=0, (2)

where θij\theta_{\scriptscriptstyle ij} is the (i,j)(i,j)th entry of the precision matrix Θ\Theta. By this equivalence, estimating {\cal E} amounts to identifying the positions of the zero entries of the precision matrix, which can be achieved by sparse estimation methods such as the Tibshirani (1996), Fan and Li (2001), and Zou (2006). A variety of methods have been developed for estimating the Gaussian graphical model, which include, for example, Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), and Peng et al. (2009). See also Friedman et al. (2008), Guo et al. (2010), and Lam and Fan (2009).

Since the Gaussian distribution assumption is restrictive, many recent advances have focused on relaxing this assumption. A main challenge in doing so is to avoid the curse of dimensionality (Bellman, 1961): a straightforward nonparametric extension would resort to a high-dimensional kernel, which are known to be ineffective. One way to relax the Gaussian assumption without evoking a high dimensional kernel is to use the copula Gaussian distribution, which is the approach taken by Liu et al. (2009), Liu et al. (2012a), and Xue and Zou (2012), and is further extended to the transelliptical model by Liu et al. (2012b).

However, the copula Gaussian assumption could still be restrictive: for example, if AA and BB are random variables satisfying B=A2+ϵB=A^{\scriptscriptstyle 2}+\epsilon, where AA and ϵ\epsilon are i.i.d. N(0,1)N(0,1), then (A,B)(A,B) does not satisfy the copula Gaussian assumption. To further relax the distributional assumption, Li et al. (2014) proposed a new statistical relation called the additive conditional independence as an alternative criterion for constructing the graphical model. This relation has the advantage of achieving nonparametric model flexibility without using a high-dimensional kernel, while obeying the same set of semi-graphoid axioms that govern the conditional independence (Dawid, 1979; Pearl and Verma, 1987). See also Lee et al. (2016b) and Li and Solea (2018a). Other approaches to nonparametric graphical models include Fellinghauer et al. (2013) and Voorman et al. (2013).

In this paper, instead of relying on additivity to avoid the curse of dimensionality, we apply the recently developed nonparametric sufficient dimension reduction (Lee et al., 2013; Li, 2018b) to achieve this goal. The estimation proceeds in two steps: first, we use nonlinear sufficient dimension reduction to reduce X(i,j)X^{\scriptscriptstyle-(i,j)} to a low-dimensional random vector UijU^{\scriptscriptstyle ij}; second, we use the kernel method to construct a nonparametric graphical model based on (Xi(X^{\scriptscriptstyle i}, Xj)X^{\scriptscriptstyle j}) and the dimension-reduced random vectors UijU^{\scriptscriptstyle ij}. The main differences between this approach and Li et al. (2014) are, first, we are able to retain conditional independence as the criterion for constructing the network, which is a widely accepted criterion with a more direct interpretation, and second, we are no longer restricted by the additive structure in the graphical model. Another attractive feature of our method is due to the “kernel trick”, which means its computational complexity depends on the sample size rather than the size of the networks.

The rest of the paper is organized as follows. In Sections 2 and 3, we introduce the sufficient graphical model and describe its estimation method at the population level. In Section 4 we lay out the detailed algorithms to implement the method. In Section 5 we develop the asymptotic properties such as estimation consistency, variable selection consistency, and convergence rates. In Section 6, we conduct simulation studies to compare of our method with the existing methods. In Section 7, we apply our method to the DREAM 4 Challenge gene network data set. Section 8 concludes the paper with some further discussions. Due to limited space we put all proofs and some additional results in the Supplementary Material.

2 Sufficient graphical model

In classical sufficient dimension reduction, we seek the lowest dimensional subspace 𝒮{\cal S} of p\mathbb{R}^{\scriptscriptstyle p}, such that, after projecting XpX\in\mathbb{R}^{\scriptscriptstyle p} on to 𝒮{\cal S}, the information about the response YY is preserved; that is, Y   X|P𝒮XY\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X|P_{\scriptscriptstyle{\cal S}}X, where P𝒮P_{\scriptscriptstyle{\cal S}} is the projection onto 𝒮{\cal S}. This subspace is called the central subspace, written as 𝒮Y|X{\cal S}_{\scriptscriptstyle Y|X}. See, for example, Li (1991), Cook (1994), and Li (2018b). Li et al. (2011) and Lee et al. (2013) extended this framework to the nonlinear setting by considering the more general problem: Y   X|𝒢Y\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X|{\cal G}, where 𝒢{\cal G} a sub-σ\sigma field of the σ\sigma-field generated by XX. The class of functions in a Hilbert space that are measurable with respect to 𝒢{\cal G} is called the central class, written as 𝔖Y|X\mathfrak{S}_{\scriptscriptstyle Y|X}. Li et al. (2011) introduced the Principal Support Vector Machine, and Lee et al. (2013) generalized the Sliced Inverse Regression (Li, 1991) and the Sliced Average Variance Estimate (Cook and Weisberg, 1991) to estimate the central class. Precursors of this theory include Bach and Jordan (2002), Wu (2008), and Wang (2008).

To link this up with the statistical graphical model, let (Ω,,P)(\Omega,{\cal F},P) be a probability space, (ΩX,X)(\Omega_{\scriptscriptstyle X},{\cal F}_{\scriptscriptstyle X}) a Borel measurable space with ΩXp\Omega_{\scriptscriptstyle X}\subseteq\mathbb{R}^{\scriptscriptstyle p}, and X:ΩΩXX:\Omega\to\Omega_{\scriptscriptstyle X} a random vector with distribution PXP_{\scriptscriptstyle X}. The iith component of XX is denoted by XiX^{\scriptscriptstyle i} and its range denoted by ΩXi\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}. We assume ΩX=ΩX1××ΩXp\Omega_{\scriptscriptstyle X}=\Omega_{\scriptscriptstyle X^{\scriptscriptstyle 1}}\times\cdots\times\Omega_{\scriptscriptstyle X^{\scriptscriptstyle p}}. Let X(i,j)=(Xi,Xj)X^{\scriptscriptstyle(i,j)}=(X^{\scriptscriptstyle i},X^{\scriptscriptstyle j}) and X(i,j)X^{\scriptscriptstyle-(i,j)} be as defined in the Introduction. Let σ(X(i,j))\sigma(X^{\scriptscriptstyle-(i,j)}) be the σ\sigma-field generated by X(i,j)X^{\scriptscriptstyle-(i,j)}. We assume, for each (i,j)Γ×Γ(i,j)\in\Gamma\times\Gamma, there is a proper sub σ\sigma-field 𝒢(i,j){\cal G}^{\scriptscriptstyle-(i,j)} of σ(X(i,j))\sigma(X^{\scriptscriptstyle-(i,j)}) such that

X(i,j)   X(i,j)|𝒢(i,j).\displaystyle X^{\scriptscriptstyle(i,j)}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle-(i,j)}|{\cal G}^{\scriptscriptstyle-(i,j)}. (3)

Without loss of generality, we assume 𝒢(i,j){\cal G}^{\scriptscriptstyle-(i,j)} is the smallest sub σ\sigma-field of σ(X(i,j))\sigma(X^{\scriptscriptstyle-(i,j)}) that satisfies the above relation; that is, 𝒢(i,j){\cal G}^{\scriptscriptstyle-(i,j)} is the central σ\sigma-field for X(i,j)X^{\scriptscriptstyle(i,j)} versus X(i,j)X^{\scriptscriptstyle-(i,j)}. There are plenty examples of joint distributions of XX for which the condition (3) holds for every pair (i,j)(i,j): see Section S10 of the Supplementary Material. Using the properties of conditional independence developed in Dawid (1979) (with a detailed proof given in Li (2018b)), we can show that (3) implies the following equivalence.

Theorem 1

If X(i,j)   X(i,j)|𝒢(i,j)X^{\scriptscriptstyle(i,j)}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle-(i,j)}|{\cal G}^{\scriptscriptstyle-(i,j)}, then

Xi   Xj|X(i,j)Xi   Xj|𝒢(i,j).\displaystyle X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)}\ \Leftrightarrow\ X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)}.

This equivalence motivates us to use Xi   Xj|𝒢(i,j)X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)} as the criterion to construct the graph 𝒢{\cal G} after performing nonlinear sufficient dimension reduction of X(i,j)X^{\scriptscriptstyle(i,j)} versus X(i,j)X^{\scriptscriptstyle-(i,j)} for each (i,j)Γ×Γ(i,j)\in\Gamma\times\Gamma, i>ji>j.

Definition 2

Under condition (3), the graph defined by

(i,j)Xi   Xj|𝒢(i,j)\displaystyle(i,j)\notin{\cal E}\Leftrightarrow X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)}

is called the sufficient graphical model.

3 Estimation: population-level development

The estimation of the sufficient graphical model involves two steps: the first step is to use nonlinear sufficient dimension reduction to estimate 𝒢(i,j){\cal G}^{\scriptscriptstyle-(i,j)}; the second is to construct a graph G  based on reduced data

{(X(i,j),𝒢(i,j)):(i,j)Γ×Γ,i>j}.\displaystyle\{(X^{\scriptscriptstyle(i,j)},{\cal G}^{\scriptscriptstyle-(i,j)}):(i,j)\in\Gamma\times\Gamma,i>j\}.

In this section we describe the two steps at the population level. To do so, we need some preliminary concepts such as the covariance operator between two reproducing kernel Hilbert spaces, the mean element in an reproducing kernel Hilbert spaces, the inverse of an operator, as well as the centered reproducing kernel Hilbert spaces. These concepts are defined in the Supplementary Material, Section S1.2. A fuller development of the related theory can be found in Li (2018b). The symbols ran()\mathrm{ran}(\cdot) and ran¯()\overline{\mathrm{ran}}(\cdot) will be used to denote the range and the closure of the range of a linear operator.

3.1 Step 1: Nonlinear dimension reduction

We use the generalized sliced inverse regression Lee et al. (2013), (Li, 2018b) to perform the nonlinear dimension reduction. For each pair (i,j)Γ×Γ(i,j)\in\Gamma\times\Gamma, i>ji>j, let ΩX(i,j)\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}} be the range of X(i,j)X^{\scriptscriptstyle-(i,j)}, which is the Cartesian product of ΩX1,,ΩXp\Omega_{\scriptscriptstyle X^{\scriptscriptstyle 1}},\ldots,\Omega_{\scriptscriptstyle X^{\scriptscriptstyle p}} with ΩXi\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}} and ΩXj\Omega_{\scriptscriptstyle X^{\scriptscriptstyle j}} removed. Let

κX(i,j):ΩX(i,j)×ΩX(i,j)\displaystyle\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}:\,\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}\times\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}\to\mathbb{R}

be a positive semidefinite kernel. Let HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} be the centered reproducing kernel Hilbert space generated by κX(i,j)\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle{-(i,j)}}. Let ΩX(i,j)\Omega_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}, κX(i,j)\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle{(i,j)}}, and HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} be the similar objects defined for X(i,j)X^{\scriptscriptstyle(i,j)}.

Assumption 1
E[κX(i,j)(X(i,j),X(i,j))]<,E[κX(i,j)(X(i,j),X(i,j))]<.\displaystyle E[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(X^{\scriptscriptstyle-(i,j)},X^{\scriptscriptstyle-(i,j)})]<\infty,\quad E[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(X^{\scriptscriptstyle(i,j)},X^{\scriptscriptstyle(i,j)})]<\infty.

This is a very mild assumption that is satisfied by most kernels. Under this assumption, the following covariance operators are well defined:

ΣX(i,j)X(i,j):HX(i,j)HX(i,j),ΣX(i,j)X(i,j):HX(i,j)HX(i,j).\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)},\quad\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}.

For the formal definition of the covariance operator, see S1.2. Next, we introduce the regression operator from HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} to HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}. For this purpose we need to make the following assumption.

Assumption 2

ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j))\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}).

As argued in Li (2018b), this assumption can be interpreted as a type of collective smoothness in the relation between X(i,j)X^{\scriptscriptstyle(i,j)} and X(i,j)X^{\scriptscriptstyle-(i,j)}: intuitively, it requires the operator ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} sends all the input functions to the low-frequency domain of the operator ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}. Under Assumption 2, the linear operator

RX(i,j)X(i,j)=ΣX(i,j)X(i,j)1ΣX(i,j)X(i,j)\displaystyle R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}=\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}

is defined, and we call it the regression operator from HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} to HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}. The meaning of the inverse ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1} is defined in Section S1.2 in the Supplementary Material. The regression operator in this form was formally defined in Lee et al. (2016a), but earlier forms existed in Fukumizu et al. (2004); see also Li (2018a).

Assumption 3

RX(i,j)X(i,j)R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} is a finite-rank operator, with rank dijd_{\scriptscriptstyle ij}.

Intuitively, this assumption means that RX(i,j)X(i,j)R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} filters out the high frequency functions of X(i,j)X^{\scriptscriptstyle(i,j)}, so that, for any fH(i,j)f\in\mbox{\rsfsten H}\,^{\scriptscriptstyle\,(i,j)}, RX(i,j)X(i,j)fR_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}f is relatively smooth. It will be violated, for example, if one can find an fH(i,j)f\in\mbox{\rsfsten H}\,^{\scriptscriptstyle\,(i,j)} that makes RX(i,j)X(i,j)fR_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}f arbitrarily choppy. The regression operator plays a crucial role in nonlinear sufficient dimension reduction. Let L2(PX(i,j))L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}) be the L2L_{\scriptscriptstyle 2}-space with respect to the distribution PX(i,j)P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}} of X(i,j)X^{\scriptscriptstyle-(i,j)}. As shown in Lee et al. (2013), the closure of the range of the regression operator is equal to the central subspace; that is,

ran¯(RX(i,j)X(i,j))=𝔖X(i,j)|X(i,j)\displaystyle{\color[rgb]{0,0,0}{{\overline{\mathrm{ran}}(R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})=\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}}}} (4)

under the following assumption.

Assumption 4
  1. 1.

    HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} is dense in L2(PX(i,j))L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}) modulo constants; that is, for any fL2(PX(i,j))f\in L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}) and any ϵ>0\epsilon>0, there is a gHX(i,j)g\in\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} such that var[f(X(i,j))g(X(i,j))]<ϵ\mathrm{var}[f(X^{\scriptscriptstyle-(i,j)})-g(X^{\scriptscriptstyle-(i,j)})]<\epsilon;

  2. 2.

    𝔖X(i,j)|X(i,j)\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}} is a sufficient and complete.

The first condition essentially requires the kernel κX(i,j)\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} to be a universal kernel with respect to the L2(PX(i,j))L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})-norm. It means H(i,j)\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)} is rich enough to approximate any L2(PX(i,j))L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})-function arbitrarily closely. For example, it is satisfied by the Gaussian radial basis function kernel, but not by the polynomial kernel. For more information on universal kernels, see Sriperumbudur, Fukumizu, and Lanckriet (2011). The completeness in the second condition means

E[g(X(i,j))|X(i,j)]=0almost surelyg(X(i,j))=0almost surely.\displaystyle E[g(X^{\scriptscriptstyle-(i,j)})|X^{\scriptscriptstyle(i,j)}]=0\ \mbox{almost surely}\ \Rightarrow\ g(X^{\scriptscriptstyle-(i,j)})=0\ \mbox{almost surely}.

This concept is defined in Lee, Li, and Chiaromonte (2013), and is similar to the classical definition of completeness treating X(i,j)X^{\scriptscriptstyle-(i,j)} as the parameter. Lee, Li, and Chiaromonte (2013) showed that completeness is a mild condition, and is satisfied by most nonparametric models.

A basis of the central class 𝔖X(i,j)|X(i,j)\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}} can be found by solving the generalized eigenvalue problem: for k=1,,dijk=1,\ldots,d_{\scriptscriptstyle ij},

maximizef,ΣX(i,j)X(i,j)AΣX(i,j)X(i,j)f(i,j)subject to{fk,ΣX(i,j)X(i,j)fk(i,j)=1fk,ΣX(i,j)X(i,j)f(i,j),for =1,,k1\displaystyle\begin{split}\mbox{maximize}\quad&\,\langle f,\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}A\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}\\ \mbox{subject to}\quad&\,\begin{cases}\langle f_{\scriptscriptstyle k},\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle k}\rangle_{\scriptscriptstyle-(i,j)}=1\\ \langle f_{\scriptscriptstyle k},\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle\ell}\rangle_{\scriptscriptstyle-(i,j)},\ \mbox{for }\ell=1,\ldots,k-1\end{cases}\end{split} (5)

where A:HX(i,j)HX(i,j)A:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} is any nonsingular and self adjoint operator, and ,(i,j)\langle\cdot,\cdot\rangle_{\scriptscriptstyle-(i,j)} is the inner product in HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}. That is, if f1ij,fdijijf^{\scriptscriptstyle ij}_{\scriptscriptstyle 1},\ldots f^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}} are the first dijd_{\scriptscriptstyle ij} eigenfunctions of this eigenvalue problem, then they span the central class. This type of estimate of the central class is called generalized sliced inverse regression. Convenient choices of AA are the identity mapping II or the operator ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}. If we use the latter, then we need the following assumption.

Assumption 5

ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j))\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}).

This assumption has the similar interpretation as Assumption 2; see Section S11 in the Supplementary Material. At the population level, choosing AA to be ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1} achieves better scaling because it down weights those components of the output of ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} with larger variances. However, if the sample size is not sufficiently large, involving an estimate of ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1} in the procedure could incur extra variations that overwhelm the benefit brought by ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}. In this case, a nonrandom operator such as A=IA=I is preferable. In this paper we use A=ΣX(i,j)X(i,j)1A=\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}. Let UijU^{\scriptscriptstyle ij} denote the random vector (f1ij(X(i,j)),fdijij(X(i,j))).(f^{\scriptscriptstyle ij}_{\scriptscriptstyle 1}(X^{\scriptscriptstyle-(i,j)}),\ldots f^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}}(X^{\scriptscriptstyle-(i,j)})). The set of random vectors {Uij:(i,j)Γ×Γ,i>j}\{U^{\scriptscriptstyle ij}:(i,j)\in\Gamma\times\Gamma,i>j\} is the output for the nonlinear sufficient dimension reduction step.

3.2 Step 2:Estimation of sufficient graphical model

To estimate the edge set of the sufficient graphical model we need to find a way to determine whether Xi   Xj|UijX^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij} is true. We use a linear operator introduced by Fukumizu et al. (2008) to perform this task, which is briefly described as follows. Let UU, VV, WW be random vectors taking values in measurable spaces (ΩU,U)(\Omega_{\scriptscriptstyle U},{\cal F}_{\scriptscriptstyle U}), (ΩV,V)(\Omega_{\scriptscriptstyle V},{\cal F}_{\scriptscriptstyle V}), and (ΩW,W)(\Omega_{\scriptscriptstyle W},{\cal F}_{\scriptscriptstyle W}). Let ΩUW=ΩU×ΩW\Omega_{\scriptscriptstyle UW}=\Omega_{\scriptscriptstyle U}\times\Omega_{\scriptscriptstyle W}, ΩVW=ΩV×ΩW\Omega_{\scriptscriptstyle VW}=\Omega_{\scriptscriptstyle V}\times\Omega_{\scriptscriptstyle W}, UW=U×V{\cal F}_{\scriptscriptstyle UW}={\cal F}_{\scriptscriptstyle U}\times{\cal F}_{\scriptscriptstyle V}, and VW=V×W{\cal F}_{\scriptscriptstyle VW}={\cal F}_{\scriptscriptstyle V}\times{\cal F}_{\scriptscriptstyle W}. Let

κUW:ΩUW×ΩUW,κVW:ΩVW×ΩVW,κW:ΩW×ΩW\displaystyle\kappa_{\scriptscriptstyle UW}:\Omega_{\scriptscriptstyle UW}\times\Omega_{\scriptscriptstyle UW}\to\mathbb{R},\quad\kappa_{\scriptscriptstyle VW}:\Omega_{\scriptscriptstyle VW}\times\Omega_{\scriptscriptstyle VW}\to\mathbb{R},\quad\kappa_{\scriptscriptstyle W}:\Omega_{\scriptscriptstyle W}\times\Omega_{\scriptscriptstyle W}\to\mathbb{R}

be positive kernels. For example, for (u1,w1),(u2,w2)ΩUW×ΩUW(u_{\scriptscriptstyle 1},w_{\scriptscriptstyle 1}),(u_{\scriptscriptstyle 2},w_{\scriptscriptstyle 2})\in\Omega_{\scriptscriptstyle UW}\times\Omega_{\scriptscriptstyle UW}, κUW\kappa_{\scriptscriptstyle UW} returns a real number denoted by κUW[(u1,w1),(u2,w2)]\kappa_{\scriptscriptstyle UW}[(u_{\scriptscriptstyle 1},w_{\scriptscriptstyle 1}),(u_{\scriptscriptstyle 2},w_{\scriptscriptstyle 2})]. Let HUW\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}, HVW\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}, and HW\mbox{\rsfsten H}\,_{\scriptscriptstyle W} be the centered reproducing kernel Hilbert space’s generated by the kernels κUW\kappa_{\scriptscriptstyle UW}, κVW\kappa_{\scriptscriptstyle VW}, and κW\kappa_{\scriptscriptstyle W}. Define the covariance operators

Σ(UW)(VW):HVWHUW,Σ(UW)W:HWHUW,Σ(VW)W:HWHVW,ΣWW:HWHW\displaystyle\begin{split}&\,\Sigma_{\scriptscriptstyle(UW)(VW)}:\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle UW},\quad\Sigma_{\scriptscriptstyle(UW)W}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle UW},\\ &\,\Sigma_{\scriptscriptstyle(VW)W}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle VW},\quad\Sigma_{\scriptscriptstyle WW}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\end{split} (6)

as before. The following definition is due to Fukumizu et al. (2008). Since it plays a special role in this paper, we give it a name – “conjoined conditional covariance operator” that figuratively depicts its form.

Definition 3

Suppose

  1. 1.

    If SS is WW, or (U,W)(U,W), or (V,W)(V,W), then E[κS(S,S)]<E[\kappa_{\scriptscriptstyle S}(S,S)]<\infty;

  2. 2.

    ran(ΣW(VW))ran(ΣWW)\mathrm{ran}(\Sigma_{\scriptscriptstyle W(VW)})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle WW}), ran(ΣW(UW))ran(ΣWW)\mathrm{ran}(\Sigma_{\scriptscriptstyle W(UW)})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle WW}).

Then the operator ΣU¨V¨|W=Σ(UW)(VW)Σ(UW)WΣWW1ΣW(VW)\Sigma_{\scriptscriptstyle\ddot{U}\ddot{V}|W}=\Sigma_{\scriptscriptstyle(UW)(VW)}-\Sigma_{\scriptscriptstyle(UW)W}\Sigma_{\scriptscriptstyle WW}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle W(VW)} is called the conjoined conditional covariance operator between UU and VV given WW.

The word “conjoined” describes the peculiar way in which WW appears in Σ(UW)W\Sigma_{\scriptscriptstyle(UW)W} and ΣW(VW)\Sigma_{\scriptscriptstyle W(VW)}, which differs from an ordinary conditional covariance operator, where these operators are replaced by ΣUW\Sigma_{\scriptscriptstyle UW} and ΣWV\Sigma_{\scriptscriptstyle WV}. The following proposition is due to Fukumizu et al. (2008), a proof of a special case of which is given in Fukumizu et al. (2004).

Proposition 4

Suppose

  1. 1.

    HUWHVW\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}\otimes\mbox{\rsfsten H}\,_{\scriptscriptstyle VW} is probability determining;

  2. 2.

    for each fHUWf\in\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}, the function E[f(U,W)|W=]E[f(U,W)|W=\cdot] belongs to HW\mbox{\rsfsten H}\,_{\scriptscriptstyle W};

  3. 3.

    for each gHVWg\in\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}, the function E[g(V,W)|W=]E[g(V,W)|W=\cdot] belongs to HW\mbox{\rsfsten H}\,_{\scriptscriptstyle W};

Then ΣU¨V¨|W=0\Sigma_{\scriptscriptstyle\ddot{U}\ddot{V}|W}=0 if and only if U   V|WU\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,V|W.

The notion of probability determining in the context of reproducing kernel Hilbert space was defined in Fukumizu et al. (2004). For a generic random vector XX, an reproducing kernel Hilbert space HX\mbox{\rsfsten H}\,_{\scriptscriptstyle X} based on a kernel κX\kappa_{\scriptscriptstyle X} is probability determining if and only if the mapping PEP[κX(,X)]P\mapsto E_{\scriptscriptstyle P}[\kappa_{\scriptscriptstyle X}(\cdot,X)] is injective. Intuitively, this requires the family of expectations {EPf(X):fX}\{E_{\scriptscriptstyle P}f(X):f\in{\cal H}_{\scriptscriptstyle X}\} to be rich enough to identify PP. For example, the Gaussian radial basis function is probability determining, but a polynomial kernel is not. We apply the above proposition to Xi,Xj,UijX^{\scriptscriptstyle i},X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij} for each (i,j)Γ×Γ(i,j)\in\Gamma\times\Gamma, i>ji>j. Let

κXUi,ij:(ΩXi×ΩUij)×(ΩXi×ΩUij)\displaystyle\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}:(\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}})\times(\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}})\to\mathbb{R}

be a positive definite kernel, and HXUi,ij\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij} the centered reproducing kernel Hilbert space generated by κXUi,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}. Similarly, let κUij:ΩUij×ΩUij\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}:\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R} be a positive kernel, and HUij\mbox{\rsfsten H}\,_{\scriptscriptstyle U}^{\scriptscriptstyle ij} the centered reproducing kernel Hilbert space generated by κUij\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}.

Assumption 6

Conditions (1) and (2) of Definition 3 and conditions (1), (2), and (3) of Proposition 4 are satisfied with UU, VV, and WW therein replaced by XiX^{\scriptscriptstyle i}, XjX^{\scriptscriptstyle j}, and UijU^{\scriptscriptstyle ij}, respectively, for each (i,j)Γ×Γ(i,j)\in\Gamma\times\Gamma and i>ji>j.

Under this assumption, the conjoined conditional covariance operator ΣX¨iX¨j|Uij\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}} is well defined and has the following property.

Corollary 5

Under Assumption 6, we have (i,j)ΣX¨iX¨j|Uij=0.(i,j)\notin\mathcal{E}\Leftrightarrow\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}=0.

This corollary motivates us to estimate the graph by thresholding the norm of the estimated conjoined conditional covariance operator.

4 Estimation: sample-level implementation

4.1 Implementation of step 1

Let (X1,Y1),,(Xn,Yn)(X_{\scriptscriptstyle 1},Y_{\scriptscriptstyle 1}),\ldots,(X_{\scriptscriptstyle n},Y_{\scriptscriptstyle n}) be an i.i.d. sample of (X,Y)(X,Y). At the sample level, the centered reproducing kernel Hilbert space HX(i,j)\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} is spanned by the functions

{κX(i,j)(,Xa(i,j))En[κX(i,j)(,X(i,j))]:a=1,,n},\displaystyle\{\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle a})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]:a=1,\ldots,n\}, (7)

where κX(i,j)(,X(i,j))\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)}) stands for the function uκX(i,j)(u,X(i,j))u\mapsto\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(u,X^{\scriptscriptstyle-(i,j)}), and En[κX(i,j)(,X(i,j))]E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})] the function uEn[κX(i,j)(u,X(i,j))]u\mapsto E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(u,X^{\scriptscriptstyle-(i,j)})].

We estimate the covariance operators ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} and ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}} by

Σ^X(i,j)X(i,j)=\displaystyle\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}= En{[κX(i,j)(,X(i,j))EnκX(i,j)(,X(i,j))]\displaystyle\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]
[κX(i,j)(,X(i,j))EnκX(i,j)(,X(i,j))]}\displaystyle\,\otimes[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(\cdot,X^{\scriptscriptstyle(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(\cdot,X^{\scriptscriptstyle(i,j)})]\}
Σ^X(i,j)X(i,j)=\displaystyle\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}= En{[κX(i,j)(,X(i,j))EnκX(i,j)(,X(i,j))]\displaystyle\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]
[κX(i,j)(,X(i,j))EnκX(i,j)(,X(i,j))]},\displaystyle\,\otimes[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]\},

respectively. We estimate ΣX(i,j)X(i,j)1\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1} by the Tychonoff-regularized inverse (Σ^X(i,j)X(i,j)+ϵX(i,j)I)1,(\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1}, where I:HX(i,j)HX(i,j)I:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} is the identity operator. The regularized inverse is used to avoid over fitting. It plays the same role as ridge regression (Hoerl and Kennard, 1970) that alleviates over fitting by adding a multiple of the identity matrix to the sample covariance matrix before inverting it.

At the sample level, the generalized eigenvalue problem (5) takes the following form: at the kkth iteration,

maximizef,Σ^X(i,j)X(i,j)(Σ^X(i,j)X(i,j)+ϵX(i,j)I)1Σ^X(i,j)X(i,j)f(i,j)subject to{f,Σ^X(i,j)X(i,j)f(i,j)=1,f,Σ^X(i,j)X(i,j)f(i,j)=0,=1,,k1,\displaystyle\begin{split}\mbox{maximize}\quad&\,\langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}(\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1}\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}\\ \mbox{subject to}\quad&\,\begin{cases}\langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}=1,\\ \langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle\ell}\rangle_{\scriptscriptstyle-(i,j)}=0,\quad\ell=1,\ldots,k-1,\end{cases}\end{split} (8)

where f1,,fk1f_{\scriptscriptstyle 1},\ldots,f_{\scriptscriptstyle k-1} are the maximizers in the previous steps. The first dijd_{\scriptscriptstyle ij} eigenfunctions are an estimate of a basis in the central class 𝔖X(i,j)|X(i,j)\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}.

Let KX(i,j)K_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}} be the n×nn\times n matrix whose (a,b)(a,b)th entry is κX(i,j)(Xa(i,j),Xb(i,j))\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)},X_{\scriptscriptstyle b}^{\scriptscriptstyle-(i,j)}), Q=In1n1n𝖳/nQ=I_{\scriptscriptstyle n}-1_{\scriptscriptstyle n}1_{\scriptscriptstyle n}^{\scriptscriptstyle\sf T}/n, and GX(i,j)=QKX(i,j)QG_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}Q. Let a1,,adija^{\scriptscriptstyle 1},\ldots,a^{\scriptscriptstyle d_{\scriptscriptstyle ij}} be the first dijd_{\scriptscriptstyle ij} eigenvectors of the matrix

(GX(i,j)+ϵX(i,j)In)1GX(i,j)GX(i,j)(GX(i,j)+ϵX(i,j)In)1GX(i,j)(GX(i,j)+ϵX(i,j)In)1.\displaystyle\,(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}(G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}.

Let br=(GX(i,j)+ϵX(i,j)In)1arb^{\scriptscriptstyle r}=(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}a^{\scriptscriptstyle r} for r=1,,dijr=1,\ldots,d_{\scriptscriptstyle ij}. As shown in Section S12.2, the eigenfunctions f1ij,,fdijijf_{\scriptscriptstyle 1}^{\scriptscriptstyle ij},\ldots,f_{\scriptscriptstyle d_{\scriptscriptstyle ij}}^{\scriptscriptstyle ij} are calculated by

frij=a=1nbar{κX(i,j)(,Xa(i,j))En[κX(i,j)(,X(i,j))]}.\displaystyle f_{\scriptscriptstyle r}^{\scriptscriptstyle ij}=\sum_{\scriptscriptstyle a=1}^{\scriptscriptstyle n}b^{\scriptscriptstyle r}_{\scriptscriptstyle a}\{\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle a})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]\}.

The statistics U^aij=(f1ij(Xa(i,j)),,fdijij(Xa(i,j)))\hat{U}^{\scriptscriptstyle ij}_{\scriptscriptstyle a}=(f_{\scriptscriptstyle 1}^{\scriptscriptstyle ij}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)}),\ldots,f_{\scriptscriptstyle d_{\scriptscriptstyle ij}}^{\scriptscriptstyle ij}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)})), a=1,,na=1,\ldots,n, will be used as the input for the second step.

4.2 Implementation of step 2

This step consists of estimating the conjoined conditional covariance operator’s for each (i,j)(i,j) and thresholding their norms. At the sample level, the centered reproducing kernel Hilbert space’s generated by the kernels κXUi,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}, κXUj,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}, and κUij\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij} are

HXUi,ij=\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle\,i,ij}= span{κXUi,ij(,(Xai,Uaij))En[κXUi,ij(,(Xi,Uij))]:a=1,,n},\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]:a=1,\ldots,n\},
HXUj,ij=\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle\,j,ij}= span{κXUj,ij(,(Xaj,Uaij))En[κXUj,ij(,(Xj,Uij))]:a=1,,n},\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle j},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]:a=1,\ldots,n\},
HUij=\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle U}^{\scriptscriptstyle\,ij}= span{κUij(,Uaij)En[κUij(,Uij)]:a=1,,n},\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U_{\scriptscriptstyle a}^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]:a=1,\ldots,n\},

where, for example, κXUi,ij(,(Xai,Uaij))\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij})) denotes the function

ΩXi×ΩUij,(xi,uij)κXUi,ij((xi,uij),(Xai,Uaij))\displaystyle\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R},\quad(x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij})\mapsto\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij}),(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))

and En[κXUi,ij(,(Xi,Uij))]E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))] denotes the function

ΩXi×ΩUij,(xi,uij)En[κXUi,ij((xi,uij),(Xi,Uij))].\displaystyle\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R},\quad(x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij})\mapsto E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))].

We estimate the covariance operators Σ(XiUij)(XiUij)\Sigma_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})}, Σ(XiUij)Uij\Sigma_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}, ΣXj(XjUij)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle j}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}, and ΣUijUij\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}} by

Σ^(XiUij)(XjUij)=En{[κXUi,ij(,(Xi,Uij))EnκXUi,ij(,(Xi,Uij))][κXUj,ij(,(Xj,Uij))EnκXUj,ij(,(Xj,Uij))]}Σ^(XiUij)Uij=En{[κXUi,ij(,(Xi,Uij))EnκXUi,ij(,(Xi,Uij))][κUij(,Uij)EnκUij(,Uij)]}Σ^Uij(XjUij)=En{[κUij(,Uij)EnκUij(,Uij)][κXUj,ij(,(Xj,Uij))EnκXUj,ij(,(Xj,Uij))]}Σ^UijUij=En{[κUij(,Uij)EnκUij(,Uij)][κUij(,Uij)EnκUij(,Uij)]},\displaystyle\begin{split}\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]\}\\ \hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\}\\ \hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]\}\\ \hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\},\end{split} (9)

respectively. We then estimate the conjoined conditional covariance operator by

Σ^X¨iX¨j|Uij=Σ^(XiUij)(XjUij)Σ^(XiUij)Uij(Σ^UijUij+ϵU(i,j)I)1Σ^Uij(XjUij),\displaystyle\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}=\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}(\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}+\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1}\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})},

where, again, we have used Tychonoff regularization to estimate the inverted covariance operator ΣUijUij\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}. Let KUijK_{\scriptscriptstyle U^{\scriptscriptstyle ij}}, KXiUijK_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}, and KXjUijK_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}} be the Gram matrices

KUij=\displaystyle K_{\scriptscriptstyle U^{\scriptscriptstyle ij}}= {κUij(Uaij,Ubij)}a,b=1n,\displaystyle\,\{\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(U_{\scriptscriptstyle a}^{\scriptscriptstyle ij},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij})\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},
KXiUij=\displaystyle K_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}= {κXUi,ij((Xai,Uaij),(Xbi,Ubij))}a,b=1n,\displaystyle\,\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((X^{\scriptscriptstyle i}_{\scriptscriptstyle a},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle i}_{\scriptscriptstyle b},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij}))\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},
KXjUij=\displaystyle K_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}= {κXUj,ij((Xaj,Uaij),(Xbj,Ubij))}a,b=1n,\displaystyle\,\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}((X^{\scriptscriptstyle j}_{\scriptscriptstyle a},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle j}_{\scriptscriptstyle b},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij}))\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},

and GXiUijG_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}, GXjUijG_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}, and GUijG_{\scriptscriptstyle U^{\scriptscriptstyle ij}} their centered versions

GXiUij=QKXiUijQ,GXjUij=QKXjUijQ,GUij=QKUijQ.\displaystyle G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}Q,\quad G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}Q,\quad G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle U^{\scriptscriptstyle ij}}Q.

As shown in Section S12 in the Supplementary Material,

Σ^X¨iX¨j|UijHS=GXiUij1/2GXjUij1/2GXiUij1/2GUij(GUij+ϵU(i,j)Q)GXjUij1/2F,\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=\left\|G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}-G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}(G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}+\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}Q)^{\scriptscriptstyle\dagger}G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}\right\|_{\scriptscriptstyle\mathrm{F}},

where F\|\cdot\|_{\scriptscriptstyle\mathrm{F}} is the Frobenius norm. Estimation of the edge set is then based on thresholding this norm; that is,

^={(i,j)Γ×Γ:i>j,Σ^X¨iX¨j|UijHS>ρn}\displaystyle\hat{{\cal E}}=\{(i,j)\in\Gamma\times\Gamma:\,i>j,\ \|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}>\rho_{\scriptscriptstyle n}\}

for some chosen ρn>0\rho_{\scriptscriptstyle n}>0.

4.3 Tuning

We have three types of tuning constants: those for the kernels, those for Tychonoff regularization, and the threshold ρn\rho_{\scriptscriptstyle n}. For the Tychonoff regularization, we have ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} and ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} for step 1, and ϵU(i,j)\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)} for step 2. In this paper we use the Gaussian radial basis function as the kernel:

κ(u,v)=exp(γuv2).\displaystyle\kappa(u,v)=\exp(-\gamma\|u-v\|^{\scriptscriptstyle 2}). (10)

For each (i,j)(i,j), we have five γ\gamma’s to determine: γX(i,j)\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} for the kernel κX(i,j)\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}, γX(i,j)\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)} for κX(i,j)\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}, γXUi,ij\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij} for κXUi,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}, γXUj,ij\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij} for κXUj,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}, and γUij\gamma_{\scriptscriptstyle U}^{\scriptscriptstyle ij} for κUij\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}, which are chosen by the following formula (see, for example, Li (2018b))

1/γ=(n2)1a<bsasb,\displaystyle 1/\sqrt{\gamma}={n\choose 2}^{\scriptscriptstyle\scriptscriptstyle-1}\sum_{\scriptscriptstyle a<b}\|s_{\scriptscriptstyle a}-s_{\scriptscriptstyle b}\|, (11)

where s1,,sns_{\scriptscriptstyle 1},\ldots,s_{\scriptscriptstyle n} are the sample of random vectors corresponding to the mentioned five kernels. For example, for the kernel κXUj,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}, sa=(Xaj,Uaij)s_{\scriptscriptstyle a}=(X_{\scriptscriptstyle a}^{\scriptscriptstyle j},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}). For the tuning parameters in Tychonoff regularization, we use the following generalized cross validation scheme (GCV; see Golub et al. (1979)):

GCV(ϵ)=argminϵi<jG1G2𝖳[G2+ϵλmax(G2)]1G1F1ntr{InG2𝖳[G2+ϵλmax(G2)]1},\displaystyle\text{GCV}(\epsilon)=\mathrm{argmin}_{\scriptscriptstyle\epsilon}\sum_{\scriptscriptstyle i<j}\frac{\|G_{\scriptscriptstyle 1}-G_{\scriptscriptstyle 2}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle 2}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2})]^{\scriptscriptstyle-1}G_{\scriptscriptstyle 1}\|_{\scriptscriptstyle\mathrm{F}}}{\frac{1}{n}\text{tr}\{I_{\scriptscriptstyle n}-G_{\scriptscriptstyle 2}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle 2}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2})]^{\scriptscriptstyle-1}\}}, (12)

where G1,G2n×nG_{\scriptscriptstyle 1},G_{\scriptscriptstyle 2}\in\mathbb{R}^{\scriptscriptstyle n\times n} are positive semidefinite matrices, and λmax(G2)\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2}) is the largest eigenvalue of G2G_{\scriptscriptstyle 2}. The matrices G1G_{\scriptscriptstyle 1} and G2G_{\scriptscriptstyle 2} are the following matrices for the three tuning parameters:

  1. 1.

    G1=GX(i,j)G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}, G2=GX(i,j)G_{\scriptscriptstyle 2}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}} for ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)},

  2. 2.

    G1=GX(i,j)G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}, G2=GX(i,j)G_{\scriptscriptstyle 2}=G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}} for ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)},

  3. 3.

    G1=GX(i,j)G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}, G2=GUijG_{\scriptscriptstyle 2}=G_{\scriptscriptstyle U^{\scriptscriptstyle ij}} for ϵU(i,j)\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)},

We minimize (12) over a grid to choose ϵ\epsilon, as detailed in Section 6.

We also use generalized cross validation to determine the thresholding parameter ρn\rho_{\scriptscriptstyle n}. Let ^(ρ)\hat{{\cal E}}(\rho) be the estimated edge set using a threshold ρ\rho, and, for each iΓi\in\Gamma, let Ci(ρ)={Xj:jΓ,(i,j)^(ρ)}C^{\scriptscriptstyle i}(\rho)=\{X^{\scriptscriptstyle j}:\,j\in\Gamma,\,(i,j)\in\hat{{\cal E}}(\rho)\} be the subset of components of XX at the neighborhood of the node ii in the graph (Γ,E^(ρ))(\Gamma,\hat{E}(\rho)). The basic idea is to apply the generalized cross validation to the regression of the feature of XiX^{\scriptscriptstyle i} on the feature of Ci(ρ)C^{\scriptscriptstyle i}(\rho). The generalized cross validation for this regression takes the form

GCV(ρ)=i=1pGXiGCi(ρ)𝖳[GCi(ρ)+ϵλmax(GCi(ρ))In]1GXiF1ntr{InGCi(ρ)𝖳[GCi(ρ)+ϵλmax(GCi(ρ))In]1},\displaystyle\text{GCV}(\rho)=\sum_{\scriptscriptstyle i=1}^{\scriptscriptstyle p}\frac{\|G_{\scriptscriptstyle X^{\scriptscriptstyle i}}-G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)})I_{\scriptscriptstyle n}]^{\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle i}}\|_{\scriptscriptstyle\mathrm{F}}}{\frac{1}{n}\text{tr}\{I_{\scriptscriptstyle n}-G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)})I_{\scriptscriptstyle n}]^{\scriptscriptstyle-1}\}}, (13)

where GCi(ρ)=QKCi(ρ)QG_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}=QK_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}Q, and KCi(ρ)K_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)} is the n×nn\times n kernel matrix for the sample of Ci(ρ)C^{\scriptscriptstyle i}(\rho). We minimize GCV(ρ)\text{GCV}(\rho) over the grid ρ{×102:=2,,7}\rho\in\{\ell\times 10^{\scriptscriptstyle-2}:\ell=2,\dots,7\} to determine the optimal threshold ρn\rho_{\scriptscriptstyle n}.

Regarding the selection of the dimension of UijU^{\scriptscriptstyle ij}, to our knowledge there has been no systematic procedure available to determine the dimension of the central class for nonlinear sufficient dimension reduction. While some recently developed methods for order determination for linear sufficient dimension reduction, such as the ladle estimate and predictor augmentation estimator (Luo and Li, 2016, 2020), may be generalizable to the nonlinear sufficient dimension reduction setting, we will leave this topic to future research. Our experiences and intuitions indicate that a small dimension, such as 1 or 2, for the central class would be sufficient in most cases. For example, in the classical nonparametric regression problems Y=f(X)+ϵY=f(X)+\epsilon with X   ϵX\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,\epsilon, the dimension of the central class is by definition equal to 1.

5 Asymptotic theory

In this section we develop the consistency and convergence rates of our estimate and related operators. The challenge of this analysis is that our procedure involves two steps: we first extract the sufficient predictor using one set of kernels, and then substitute it into another set of kernels to get the final result. Thus we need to understand how the error propagates from the first step to the second. We also develop the asymptotic theory allowing pp to go to infinity with nn, which is presented in the Supplementary Material.

5.1 Overview

Our goal is to derive the convergence rate of

|Σ^X¨iX¨j|U^ijHSΣX¨iX¨j|UijHS|,\displaystyle\left|\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}-\|\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}\right|,

as Σ^X¨iX¨j|U^ijHS\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}} is the quantity we threshold to determine the edge set. By the triangular inequality,

|Σ^X¨iX¨j|U^ijHSΣX¨iX¨j|UijHS|Σ^X¨iX¨j|U^ijΣX¨iX¨j|UijHS\displaystyle\,\left|\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}-\|\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}\right|\leq\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}
Σ^X¨iX¨j|U^ijΣ^X¨iX¨j|UijHS+Σ^X¨iX¨j|UijΣX¨iX¨j|UijHS.\displaystyle\,\hskip 72.26999pt\leq\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}+\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}.

So we need to derive the convergence rates of the following quantities:

(i)U^ijUij[H(i,j)(X)]dij,(ii)Σ^X¨iX¨j|U^ijΣ^X¨iX¨j|UijHS,(iii)Σ^X¨iX¨j|UijΣX¨iX¨j|UijHS,\displaystyle\begin{split}&\,\mbox{(i)}\quad\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle-(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}},\\ &\,\mbox{(ii)}\quad\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}},\\ &\,\mbox{(iii)}\quad\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}},\end{split} (14)

where, to avoid overly crowded subscripts, we have used H(i,j)(X)\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)}(X) to denote HX(i,j)\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle X} when it occurs as a subscript. The first and third convergence rates can be derived using the asymptotic tools for linear operators developed in Fukumizu et al. (2007), Li and Song (2017), Lee et al. (2016a), and Solea and Li (2020). The second convergence rate is, however, a new problem, and it will also be useful in similar settings that require constructing estimators based on predictors extracted by sufficient dimension reduction. In some sense, this is akin to the post dimension reduction problem considered in Kim et al. (2020).

In the following, if {an}\{a_{\scriptscriptstyle n}\} and {bn}\{b_{\scriptscriptstyle n}\} are sequences of positive numbers, then we write anbna_{\scriptscriptstyle n}\prec b_{\scriptscriptstyle n} if an/bn0a_{\scriptscriptstyle n}/b_{\scriptscriptstyle n}\to 0. We write anbna_{\scriptscriptstyle n}\asymp b_{\scriptscriptstyle n} if 0<lim infn(bn/an)lim supn(bn/an)<0<\liminf_{\scriptscriptstyle n}(b_{\scriptscriptstyle n}/a_{\scriptscriptstyle n})\leq\limsup_{\scriptscriptstyle n}(b_{\scriptscriptstyle n}/a_{\scriptscriptstyle n})<\infty. We write bnanb_{\scriptscriptstyle n}\preceq a_{\scriptscriptstyle n} if either bnanb_{\scriptscriptstyle n}\prec a_{\scriptscriptstyle n} or bnanb_{\scriptscriptstyle n}\asymp a_{\scriptscriptstyle n}. Because (i,j)(i,j) is fixed in the asymptotic development, and also to emphasize the dependence on nn, in the rest of this section we denote ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}, ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}, and ϵU(i,j)\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)} by ϵn\epsilon_{\scriptscriptstyle n}, ηn\eta_{\scriptscriptstyle n}, and δn\delta_{\scriptscriptstyle n}, respectively.

5.2 Transparent kernel

We first develop what we call the “transparent kernel” that passes information from step 1 to step 2 efficiently. Let Ω\Omega be a nonempty set, and κ:Ω×Ω\kappa:\Omega\times\Omega\to\mathbb{R} a positive kernel.

Definition 6

We say that κ\kappa is a transparent kernel if, for each tΩt\in\Omega, the function sκ(s,t)s\mapsto\kappa(s,t) is twice differentiable and

  1. 1.

    κ(s,t)/s|s=t=0\partial\kappa(s,t)/\partial s|_{\scriptscriptstyle s=t}=0;

  2. 2.

    the matrix H(s,t)=2κ(s,t)/ss𝖳H(s,t)=\partial^{\scriptscriptstyle 2}\kappa(s,t)/\partial s\partial s^{\scriptscriptstyle\sf T} has a bounded operator norm; that is, there exist <C1C2<-\infty<C_{\scriptscriptstyle 1}\leq C_{\scriptscriptstyle 2}<\infty such that

    C1λmin(H(s,t))λmax(H(s,t))<C2\displaystyle C_{\scriptscriptstyle 1}\leq\lambda_{\scriptscriptstyle\min}(H(s,t))\leq\lambda_{\scriptscriptstyle\max}(H(s,t))<C_{\scriptscriptstyle 2}

    for all (s,t)Ω×Ω(s,t)\in\Omega\times\Omega, where λmin()\lambda_{\scriptscriptstyle\min}(\cdot) and λmax()\lambda_{\scriptscriptstyle\max}(\cdot) indicate the largest and smallest eigenvalues.

For example, the Gaussian radial basis function kernel is transparent, but the exponential kernel κ(u,v)=τ2exp(γuv)\kappa(u,v)=\tau^{\scriptscriptstyle 2}\exp(-\gamma\|u-v\|) is not. This condition implies a type of Lipschitz continuity in a setting that involves two reproducing kernels κ0\kappa_{\scriptscriptstyle 0} and κ1\kappa_{\scriptscriptstyle 1}, where the argument of κ1\kappa_{\scriptscriptstyle 1} is the evaluation of a member of the reproducing kernel Hilbert space generated by κ0\kappa_{\scriptscriptstyle 0}.

Theorem 7

Suppose H0\mbox{\rsfsten H}\,_{\scriptscriptstyle 0} is the reproducing kernel Hilbert space generated by κ0\kappa_{\scriptscriptstyle 0}, H0d\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d} is the dd-fold Cartesian product of H0\mbox{\rsfsten H}\,_{\scriptscriptstyle 0} with inner product defined by

U,VH0d=u1,v1H0++ud,vdH0\displaystyle\langle U,V\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\ d}}=\langle u_{\scriptscriptstyle 1},v_{\scriptscriptstyle 1}\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}}+\cdots+\langle u_{\scriptscriptstyle d},v_{\scriptscriptstyle d}\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}}

where U=(u1,,ud)U=(u_{\scriptscriptstyle 1},\ldots,u_{\scriptscriptstyle d}) and V=(v1,,vd)V=(v_{\scriptscriptstyle 1},\ldots,v_{\scriptscriptstyle d}) are members of H0d\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d}, H1\mbox{\rsfsten H}\,_{\scriptscriptstyle 1} is the reproducing kernel Hilbert space generated by κ1\kappa_{\scriptscriptstyle 1}. Then:

  1. (i)

    for any U,VH0d,aΩU,V\in\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d},\ a\in\Omega, we have

    U(a)V(a)d[κ0(a,a)]1/2UVH0d;\displaystyle\|U(a)-V(a)\|_{\scriptscriptstyle\mathbb{R}^{\scriptscriptstyle\,d}}\leq[\kappa_{\scriptscriptstyle 0}(a,a)]^{\scriptscriptstyle 1/2}\ \|U-V\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\ d}};
  2. (ii)

    if κ1(s,t)\kappa_{\scriptscriptstyle 1}(s,t) is a transparent kernel, then there exists a C>0C>0 such that, for each U,VH0dU,V\in\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d} and aΩa\in\Omega,

    κ1(,U(a))κ1(,V(a))H1C[κ0(a,a)]1/2UVH0d.\displaystyle\,\|\kappa_{\scriptscriptstyle 1}(\cdot,U(a))-\kappa_{\scriptscriptstyle 1}(\cdot,V(a))\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 1}}\leq C\,[\kappa_{\scriptscriptstyle 0}(a,a)]^{\scriptscriptstyle 1/2}\,\|U-V\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\,\,d}}.

A direct consequence of this theorem is that, if U^\hat{U} is an estimate of some UU, a member of H0d\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle d}, with U^UH0d=OP(bn)\|\hat{U}-U\|_{\scriptscriptstyle{\mbox{\rsfstena H}_{\scriptscriptstyle 0}}^{\scriptscriptstyle d}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}) for some 0<bn00<b_{\scriptscriptstyle n}\to 0, Σ^(U^)\hat{\Sigma}(\hat{U}) is a linear operator estimated from the sample U^1,,U^n\hat{U}_{\scriptscriptstyle 1},\ldots,\hat{U}_{\scriptscriptstyle n} (and perhaps some other random vectors), and Σ^(U)\hat{\Sigma}(U) is a linear operator estimated from the sample U1,,UnU_{\scriptscriptstyle 1},\ldots,U_{\scriptscriptstyle n}, then,

Σ^(U^)Σ^(U)HS=OP(bn).\displaystyle\|\hat{\Sigma}(\hat{U})-\hat{\Sigma}(U)\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}). (15)

This result is somewhat surprising, because sample estimates such as Σ^(U^)\hat{\Sigma}(\hat{U}) can be viewed as En𝔾(X,U^)E_{\scriptscriptstyle n}{\mathbb{G}}(X,\hat{U}), where U^\hat{U} is an estimate of a function UU in a functional space with norm \|\cdot\| and 𝔾{\mathbb{G}} is an operator-valued function. If U^U=OP(bn)\|\hat{U}-U\|=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}) for some bn0b_{\scriptscriptstyle n}\to 0, then it is not necessarily true that

En𝔾(X,U^)En𝔾(X,U)=OP(bn),\displaystyle\|E_{\scriptscriptstyle n}{\mathbb{G}}(X,\hat{U})-E_{\scriptscriptstyle n}{\mathbb{G}}(X,U)\|=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}),

particularly when UU is an infinite dimensional object. Yet relation (15) states exactly this. The reason behind this is that the reproducing kernel property separates the function U^\hat{U} and its argument XaX_{\scriptscriptstyle a} (i.e. U^(x)=U^,κ(,x)\hat{U}(x)=\langle\hat{U},\kappa(\cdot,x)\rangle), which implies a type of uniformity among U^(X1),,U^(Xn)\hat{U}(X_{\scriptscriptstyle 1}),\ldots,\hat{U}(X_{\scriptscriptstyle n}). This point will be made clear in the proof in the Supplementary Material. Statement (15) is made precise by the next theorem.

Theorem 8

Suppose conditions (1) and (2) of Definition 3 are satisfied with UU, VV, WW therein replaced by XiX^{\scriptscriptstyle i}, XjX^{\scriptscriptstyle j}, and UijU^{\scriptscriptstyle ij}. Suppose, furthermore:

  1. (a)

    κUij\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}, κXUi,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}, and κXUj,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij} are transparent kernels;

  2. (b)

    U^ijUij[H(i,j)(X)]dij=OP(bn)\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle\ -(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}) for some 0<bn00<b_{\scriptscriptstyle n}\to 0.

Then

  1. (i)

    Σ^U^ijU^ijΣ^UijUijHS=OP(bn)\|\hat{\Sigma}_{\scriptscriptstyle\hat{U}^{\scriptscriptstyle ij}\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n});

  2. (ii)

    Σ^(XiU^ij)U^ijΣ^(XiUij)UijHS=OP(bn)\|\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}\hat{U}^{\scriptscriptstyle ij})\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n});

  3. (iii)

    Σ^(XiU^ij)(XjU^ij)Σ^(XiUij)(XjUij)HS=OP(bn)\|\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}\hat{U}^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}\hat{U}^{\scriptscriptstyle ij})}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).

Using Theorem 8 we can derive the convergence rate of Σ^X¨iX¨j|U^ijΣ^X¨iX¨j|UijHS\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}.

Theorem 9

Suppose conditions in Theorem 8 are satisfied and, furthermore,

  1. (a)

    ΣUijUij1ΣUij(XiUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})} and ΣUijUij1ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} are bounded linear operators;

  2. (b)

    bnδn1b_{\scriptscriptstyle n}\preceq\delta_{\scriptscriptstyle n}\prec 1.

Then Σ^X¨iX¨j|U^ijΣ^X¨iX¨j|UijHS=OP(bn).\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).

Note that, unlike in Theorem 8, where our assumptions imply

ΣX(i,j)X(i,j)1ΣX(i,j)X(i,j)\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}

is a finite-rank operator, here, we do not assume ΣUij(Uij)1ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(U^{\scriptscriptstyle ij})}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} to be a finite-rank (or even Hilbert-Schmidt) operator; instead, we assume it to be a bounded operator. This is because (Xj,Uij)(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}) contains UijU^{\scriptscriptstyle ij}, which makes it unreasonable to assume ΣUijUij1ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}{\color[rgb]{0,0,0}{{U^{\scriptscriptstyle ij}}}}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} to be finite-rank or Hilbert Schmidt. For example, when XjX^{\scriptscriptstyle j} is a constant, ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} is the same as ΣUijUij\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}} and ΣUijUij1ΣUijUij\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}} is not a Hilbert Schmidt operator, though it is bounded. Theorem 9 shows that convergence rate of (ii) in (14) is the same as the convergence rate of (i) in (14); it now remains to derive the convergence rate of (i) and (iii).

5.3 Convergence rates of (i) and (iii) in (14)

We first present the convergence rate of U^ij\hat{U}^{\scriptscriptstyle ij} to UijU^{\scriptscriptstyle ij}. The proof is similar to that of Theorem 5 of Li and Song (2017) but with two differences. First, Li and Song (2017) took AA in (5) to be II, whereas we take it to be ΣYY\Sigma_{\scriptscriptstyle YY}. In particular, the generalized sliced inverse regression in Li and Song (2017) only has one tuning parameter ηn\eta_{\scriptscriptstyle n}, but we have two tuning parameters ηn\eta_{\scriptscriptstyle n} and ϵn\epsilon_{\scriptscriptstyle n}. Second, Li and Song (2017) defined (in the current notation) frijf_{\scriptscriptstyle r}^{\scriptscriptstyle ij} to be the eigenfunctions of

ΣX(i,j)X(i,j)1ΣX(i,j)X(i,j)ΣX(i,j)X(i,j)1ΣX(i,j)X(i,j)ΣX(i,j)X(i,j)1,\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1},

which is different from the generalized eigenvalue problem (5). For these reasons we need to re-derive the convergence rate of U^ij\hat{U}^{\scriptscriptstyle ij}.

Theorem 10

Suppose

  1. (a)

    Assumption 1 is satisfied;

  2. (b)

    ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} is a finite-rank operator with

    ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j)2),\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle{\color[rgb]{0,0,0}{{2}}}}),
    ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j));\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}});
  3. (c)

    n1/2ηn1n^{\scriptscriptstyle-1/2}\prec\eta_{\scriptscriptstyle n}\prec 1, n1/2ϵn1n^{\scriptscriptstyle-1/2}\prec\epsilon_{\scriptscriptstyle n}\prec 1;

  4. (d)

    for each r=1,,dijr=1,\ldots,d_{\scriptscriptstyle ij}, λ1ij>>λdijij\lambda^{\scriptscriptstyle ij}_{\scriptscriptstyle 1}>\cdots>\lambda^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}}.

Then, U^ijUij[H(i,j)(X)]dij=OP(ηn3/2ϵn1n1+ηn1n1/2+ηn+ϵn).\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle-(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}}=O_{\scriptscriptstyle P}(\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}).

An immediate consequence is that, under the transparent kernel assumption, the bnb_{\scriptscriptstyle n} in Theorem 9 is the same as this rate. We next derive the convergence rate in (iii) of (14). This rate depends on the tuning parameter δn\delta_{\scriptscriptstyle n} in the estimate of conjoined conditional covariance operator, and it reaches bnb_{\scriptscriptstyle n} for the optimal choice of δn\delta_{\scriptscriptstyle n}.

Theorem 11

Suppose conditions (1) and (2) of Definition 3 are satisfied with UU, VV, WW therein replaced by XiX^{\scriptscriptstyle i}, XjX^{\scriptscriptstyle j}, and UijU^{\scriptscriptstyle ij}. Suppose, furthermore,

  1. (a)

    ΣUijUij1ΣUij(XiUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})} and ΣUijUij1ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} are bounded linear operators;

  2. (b)

    bnδn1b_{\scriptscriptstyle n}\preceq\delta_{\scriptscriptstyle n}\prec 1.

Then Σ^X¨iX¨j|UijΣX¨iX¨j|UijHS=OP(δn).\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(\delta_{\scriptscriptstyle n}). Consequently, if δnbn\delta_{\scriptscriptstyle n}\asymp b_{\scriptscriptstyle n}, then

Σ^X¨iX¨j|UijΣX¨iX¨j|UijHS=OP(bn).\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).

Finally, we combine Theorem 9 through Theorem 11 to come up with the convergence rate of Σ^X¨iX¨j|U^ij\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}. Since there are numerous cross references among the conditions in these theorems, to make a clear presentation we list all the original conditions in the next theorem, even if they already appeared. These conditions are of two categories: those for the step 1 that involves sufficient dimension reduction of X(i,j)X^{\scriptscriptstyle(i,j)} versus X(i,j)X^{\scriptscriptstyle-(i,j)}, and those for the step 2 that involves the estimation of the conjoined conditional covariance operator. We refer to them as the first-level and second-level conditions, respectively.

Theorem 12

Suppose the following conditions hold:

  1. (a)

    (First-level kernel) E[κ(S,S)]<E[\kappa(S,S)]<\infty for κ=κX(i,j)\kappa=\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)} and κ=κX(i,j)\kappa=\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)};

  2. (b)

    (First-level operator) ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} is a finite-rank operator with rank dijd_{\scriptscriptstyle ij} and

    ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j)2),\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle{\color[rgb]{0,0,0}{{2}}}}),
    ran(ΣX(i,j)X(i,j))ran(ΣX(i,j)X(i,j));\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}});

    all the nonzero eigenvalues of ΣX(i,j)X(i,j)ΣX(i,j)X(i,j)1ΣX(i,j)X(i,j)\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}} are distinct;

  3. (c)

    (First-level tuning parameters) n1/2ηn1n^{\scriptscriptstyle-1/2}\prec\eta_{\scriptscriptstyle n}\prec 1, n1/2ϵn1n^{\scriptscriptstyle-1/2}\prec\epsilon_{\scriptscriptstyle n}\prec 1, ηn3/2ϵn1n1+ηn1n1/2+ηn1/2+ϵn1\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle 1/2}+\epsilon_{\scriptscriptstyle n}\prec 1;

  4. (d)

    (Second-level kernel) E[κ(S,S)]<E[\kappa(S,S)]<\infty is satisfied for κ=κUij\kappa=\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}, κXUi,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}, and κXUj,ij\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}; furthermore, they are transparent kernels;

  5. (e)

    (Second-level operators) ΣUijUij1ΣUij(XiUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})} and ΣUijUij1ΣUij(XjUij)\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})} are bounded linear operators;

  6. (f)

    (Second-level tuning parameter) δnηn3/2ϵn1n1+ηn1n1/2+ηn+ϵn\delta_{\scriptscriptstyle n}\asymp\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}.

Then

Σ^X¨iX¨j|U^ijΣX¨iX¨j|UijHS=OP(ηn3/2ϵn1n1+ηn1n1/2+ηn+ϵn).\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}). (16)

Using this result we immediately arrive at the variable selection consistency of the Sufficient Graphical Model.

Corollary 13

Under the conditions in Theorem 12, if

ηn3/2ϵn1n1+ηn1n1/2+ηn+ϵnρn1, and\displaystyle\,\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}\prec\rho_{\scriptscriptstyle n}\prec 1,\ \mbox{ and}
^={(i,j)Γ×Γ:i>j,Σ^X¨iX¨j|U^ijHS<ρn}\displaystyle\,\hat{{\cal E}}=\{(i,j)\in\Gamma\times\Gamma:\ i>j,\ \|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}<\rho_{\scriptscriptstyle n}\}

then limnP(^=)1\lim_{\scriptscriptstyle n\to\infty}P(\hat{{\cal E}}={\cal E})\to 1.

5.4 Optimal rates of tuning parameters

The convergence rate in Theorem 12 depends on ϵn\epsilon_{\scriptscriptstyle n} and ηn\eta_{\scriptscriptstyle n} explicitly, and δn\delta_{\scriptscriptstyle n} implicitly (in the sense that δnηn3/2ϵn1n1+ηn1n1/2+ηn+ϵn\delta_{\scriptscriptstyle n}\asymp\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n} is optimal for fixed ϵn\epsilon_{\scriptscriptstyle n} and ηn\eta_{\scriptscriptstyle n}). Intuitively, when ϵn\epsilon_{\scriptscriptstyle n}, ηn\eta_{\scriptscriptstyle n}, and δn\delta_{\scriptscriptstyle n} increase, the biases increase and variances decrease; when they decrease, the biases decrease and the variances increase. Thus there should be critical rates for them that balance the bias and variance, which are the optimal rates.

Theorem 14

Under the conditions in Theorem 12, if ϵn\epsilon_{\scriptscriptstyle n}, ηn\eta_{\scriptscriptstyle n}, and δn\delta_{\scriptscriptstyle n} are of the form nan^{\scriptscriptstyle a}, nbn^{\scriptscriptstyle b}, and ncn^{\scriptscriptstyle c} for some a>0a>0, b>0b>0, and c>0c>0, then

  1. (i)

    the optimal rates the tuning parameters are

    n3/8ϵnn1/4,ηnn1/4,δnn1/4;\displaystyle n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{3/8}}}}\preceq\epsilon_{\scriptscriptstyle n}\preceq n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}},\quad\eta_{\scriptscriptstyle n}\asymp n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}},\quad\delta_{\scriptscriptstyle n}\asymp n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}};
  2. (ii)

    the optimal convergence rate of the estimated conjoined conditional covariance operator is

    Σ^X¨iX¨j|U^ijΣX¨iX¨j|UijHS=OP(n1/4).\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}}).

Note that there is a range of ϵn\epsilon_{\scriptscriptstyle n} are optimal, this is because the convergence rate does not have a unique minimizer. This also means the result is not very sensitive to this tuning parameter.

In the above asymptotic analysis, we have treated pp as fixed when nn\to\infty. We have also developed the consistency and convergence rate in the scenario where the dimension of pnp_{\scriptscriptstyle n} of XX goes to infinity with nn, which is placed in the Supplementary Material (Section S9) due to limited space.

6 Simulation

In this section we compare the performance of our sufficient graphical model with previous methods such as Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), Lee et al. (2016b), and a Naïve method which is based on the conjoined conditional covariance operator without the dimension reduction step.

By design, the sufficient graphical model has advantages over these existing methods under the following circumstances. First, since the sufficient graphical model does not make any distributional assumption, it should outperform Yuan and Lin (2007) and Liu et al. (2009) when the Gaussian or copula Gaussian assumptions are violated; second, due to the sufficient dimension reduction in sufficient graphical model, it avoids the curse of dimensionality and should outperform Voorman et al. (2013), Fellinghauer et al. (2013), and a Naïve method in the high-dimensional setting; third, since sufficient graphical model does not require additive structure, it should outperform Lee et al. (2016b) when there is severe nonadditivity in the model. Our simulation comparisons will reflect these aspects.

For the sufficient graphical model, Lee et al. (2016b), and the Naïve method, we use the Gaussian radial basis function as the kernel. The regularization constants ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}, ϵX(i,j)\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}, and ϵU(i,j)\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)} are chosen by the generalized cross validation criterion described in Section 4.3 with the grid {10:=1,0,1,2,3,4}\{10^{\scriptscriptstyle-\ell}:\ell=-1,0,1,2,3,4\}. The kernel parameters γX(i,j)\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}, γX(i,j)\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}, γXUi,ij\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}, γXUj,ij\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}, and γUij\gamma_{\scriptscriptstyle U}^{\scriptscriptstyle ij} are chosen according to (11). Because the outcomes of tuning parameters are stable, for each model, we compute the generalized cross validation for the first five samples and use their average value for the rest of the simulation. The performance of each estimate is assessed using the averaged receiver operating characteristic curve as a function of the threshold ρ\rho. The accuracy of a method across all ρ\rho is measured by the area under the receiver operating characteristic curve.

To isolate the factors that affect accuracy, we first consider two models with relatively small dimensions and large sample sizes, which are

Model \@slowromancapi@:X1\displaystyle\text{Model \@slowromancap i@}:\quad X^{\scriptscriptstyle 1} =ϵ1,X2=ϵ2,X3=sin(2X1)+ϵ3\displaystyle\,=\epsilon_{\scriptscriptstyle 1},\ X^{\scriptscriptstyle 2}=\epsilon_{\scriptscriptstyle 2},\ X^{\scriptscriptstyle 3}=\text{sin}(2X^{\scriptscriptstyle 1})+\epsilon_{\scriptscriptstyle 3}
X4\displaystyle X^{\scriptscriptstyle 4} =(X1)2+(X2)2+ϵ4,X5=ϵ5,\displaystyle\,=(X^{\scriptscriptstyle 1})^{\scriptscriptstyle 2}+(X^{\scriptscriptstyle 2})^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle 4},\ X^{\scriptscriptstyle 5}=\epsilon_{\scriptscriptstyle 5},
Model \@slowromancapii@:X1\displaystyle\text{Model \@slowromancap ii@}:\quad X^{\scriptscriptstyle 1} =ϵ1,X2=X1+ϵ2,X3=ϵ3,X4=(X1+X3)2+ϵ4,\displaystyle\,=\epsilon_{\scriptscriptstyle 1},\ X^{\scriptscriptstyle 2}=X^{\scriptscriptstyle 1}+\epsilon_{\scriptscriptstyle 2},\ X^{\scriptscriptstyle 3}=\epsilon_{\scriptscriptstyle 3},\ X^{\scriptscriptstyle 4}=(X^{\scriptscriptstyle 1}+X^{\scriptscriptstyle 3})^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle 4},
X5\displaystyle X^{\scriptscriptstyle 5} =cos(2X2X3)+ϵ5,X6=X4+ϵ6,\displaystyle\,=\text{cos}(2X^{\scriptscriptstyle 2}X^{\scriptscriptstyle 3})+\epsilon_{\scriptscriptstyle 5},\ X^{\scriptscriptstyle 6}=X^{\scriptscriptstyle 4}+\epsilon_{\scriptscriptstyle 6},

where ϵi\epsilon_{\scriptscriptstyle i}, i=1,,pi=1,\dots,p are from independent and identically distributed standard normal distribution. The edge sets of the two models are

Model \@slowromancapi@:\displaystyle\mbox{Model \@slowromancap i@}: ={(1,3),(1,4),(2,4),(1,2)}\displaystyle\,\quad{\cal E}=\{(1,3),(1,4),(2,4),(1,2)\}
Model \@slowromancapii@:\displaystyle\mbox{Model \@slowromancap ii@}: ={(1,2),(1,4),(3,4),(1,3),(2,5),(3,5),(2,3),(4,6)}.\displaystyle\,\quad{\cal E}=\{(1,2),(1,4),(3,4),(1,3),(2,5),(3,5),(2,3),(4,6)\}.

We use n=100,1000n=100,1000 for each model, and for each nn, we generate 50 samples to compute the averaged receiver operating characteristic curves. The dimension dijd_{\scriptscriptstyle ij} for sufficient graphical model is taken to be 2 for all cases (we have also used dij=1d_{\scriptscriptstyle ij}=1 and the results are very similar to those presented here). The plots in the first row of Figure 1 show the averaged receiver operating characteristic curves for the seven methods, with the following plotting symbol assignment:

Sufficient graphical model: red solid line Voorman et al. (2013): red dotted line
Lee et al. (2016b): black solid line Fellinghauer et al. (2013): black dotted line
Yuan and Lin (2007): red dashed line Naïve: blue dotted line
Liu et al. (2009): black dashed line

From these figures we see that the two top performers are clearly sufficient graphical model and Lee et al. (2016b), and their performances are very similar. Note that none of the two models satisfies the Gaussian or copula Gaussian assumption, which explains why sufficient graphical model and Lee et al. (2016b) outperform Yuan and Lin (2007) and Liu et al. (2009). Sufficient graphical model and Lee et al. (2016b) also outperform Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method, indicating that curse of dimensionality already takes effect on the fully nonparametric methods. The three nonparametric estimators have similar performances. Also note that Model I has an additive structure, which explains the slight advantage of Lee et al. (2016b) over sufficient graphical model in subfigure (a) of Figure 1; Model II is not additive, and the advantage of Lee et al. (2016b) disappears in subfigure (b) of Figure 1.

Refer to caption
Refer to caption
(a) Model I
Refer to caption
Refer to caption
(b) Model II
Refer to caption
Refer to caption
(c) Model III
Refer to caption
Refer to caption
(d) Model IV
Figure 1: Averaged receiver operating characteristic curves of four models. For Model I and II. Left panel: n=100n=100; right panel: n=1000n=1000. For Model \@slowromancapiii@ and IV, Left panel: n=50n=50; right panel: n=100n=100.

We next consider two models with relatively high dimensions and small sample sizes. A convenient systematic way to generate larger networks is via the hub structure. We choose p=200p=200, and randomly generate ten hubs h1,,h10h_{\scriptscriptstyle 1},\ldots,h_{\scriptscriptstyle 10} from the 200 vertices. For each hkh_{\scriptscriptstyle k}, we randomly select a set HkH_{\scriptscriptstyle k} of 19 vertices to form the neighborhood of hkh_{\scriptscriptstyle k}. With the network structures thus specified, our two probabilistic models are

Model III:Xi=1+|Xhk|2+ϵi,whereiHkhk,\displaystyle\,\text{Model {III}}:\quad X^{\scriptscriptstyle i}=1+|X^{\scriptscriptstyle h_{\scriptscriptstyle k}}|^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle i},\quad\text{where}\quad i\in H_{\scriptscriptstyle k}\setminus h_{\scriptscriptstyle k},
Model IV:Xi=sin((Xhk)3)ϵi,whereiHkhk,\displaystyle\,\text{Model {IV}}:\quad X^{\scriptscriptstyle i}=\sin((X^{\scriptscriptstyle h_{\scriptscriptstyle k}})^{\scriptscriptstyle 3})\epsilon_{\scriptscriptstyle i},\quad\text{where}\quad i\in H_{\scriptscriptstyle k}\setminus h_{\scriptscriptstyle k},

and ϵi\epsilon_{\scriptscriptstyle i}’s are the same as in Models I and II. Note that, in Model III, the dependence of XiX_{\scriptscriptstyle i} on XhkX_{\scriptscriptstyle h_{\scriptscriptstyle k}} is through the conditional mean E(Xi|Xhk)E(X_{\scriptscriptstyle i}|X_{\scriptscriptstyle h_{\scriptscriptstyle k}}), whereas in Model IV, the dependence is through the conditional variance var(Xi|Xhk)\mathrm{var}(X_{\scriptscriptstyle i}|X_{\scriptscriptstyle h_{\scriptscriptstyle k}}). For each model, we choose two sample sizes n=50n=50 and n=100n=100. The averaged receiver operating characteristic curves (again averaged over 50 samples) are presented in the second row in Figure 1. From the figures we see that, in the high-dimensional setting with p>np>n, sufficient graphical model substantially outperforms all the other methods, which clearly indicates the benefit of dimension reduction in constructing graphical models.

We now consider a Gaussian graphical model to investigate any efficiency loss incurred by sufficient graphical model. Following the similar structure used in Li et al. (2014), we choose p=20p=20, n=100,200n=100,200, and the model

Model V:XN(0,Θ1),\displaystyle\text{Model {V}}:X\sim N(0,\Theta^{\scriptscriptstyle-1}),

where Θ\Theta is 20×2020\times 20 precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.544, 3, 1, 1, 1.2, 1, 1, 1, 1, 3, 2, 1, and nonzero off-diagonal entries θ3,5=1.418\theta_{\scriptscriptstyle 3,5}=1.418, θ4,10=0.744\theta_{\scriptscriptstyle 4,10}=-0.744, θ5,9=0.519\theta_{\scriptscriptstyle 5,9}=0.519, θ5,10=0.577\theta_{\scriptscriptstyle 5,10}=-0.577, θ13,17=0.287\theta_{\scriptscriptstyle 13,17}=0.287, θ17,20=0.542\theta_{\scriptscriptstyle 17,20}=0.542, θ14,15=0.998\theta_{\scriptscriptstyle 14,15}=0.998. As expected, Figure 2 shows that Yuan and Lin (2007), Liu et al. (2009), and Lee et al. (2016b) perform better than sufficient graphical model in this case. However, sufficient graphical model still performs reasonably well and significantly outperforms the fully nonparametric methods.

Refer to caption
Refer to caption
Figure 2: Averaged receiver operating characteristic curves for Model \@slowromancapv@. Left panel: n=100n=100; right panel: n=200n=200.

Finally, we conducted some simulation on the generalized cross validation criterion (13) for determining the threshold ρn\rho_{\scriptscriptstyle n}. We generated samples from Models I through V as described above, produced the receiver operating characteristic curves using sufficient graphical model, and determined the threshold ρn\rho_{\scriptscriptstyle n} by (13). The results are presented in Figure S1 in the Supplementary Material. In each penal, the generalized cross validation-determined threshold ρn\rho_{\scriptscriptstyle n} are represented by the black dots on the red receiver operating characteristic curves.

7 Application

We now apply sufficient graphical model to a data set from the DREAM 4 Challenge project and compare it with other methods. The goal of this Challenge is to recover gene regulation networks from simulated steady-state data. A description of this data set can be found in Marbach et al. (2010). Since Lee et al. (2016b) already compared their method with Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method for this dataset and demonstrated the superiority of Lee et al. (2016b) among these estimators, here we will focus on the comparison of the sufficient graphical model with Lee et al. (2016b) and the champion method for the DREAM 4 Challenge.

The data set contains data from five networks each of dimension of 100 and sample size 201. We use the Gaussian radial basis function kernel for sufficient graphical model and Lee et al. (2016b) and the tuning methods described in Section 4.3. For sufficient graphical model, the dimensions dijd_{\scriptscriptstyle ij} are taken to be 1. We have also experimented with dij=2d_{\scriptscriptstyle ij}=2 but the results (not presented here) show no significant difference. Because networks are available, we can compare the receiver operating characteristic curves and their areas under the curve’s, which are shown in Table 1.

Table 1: Comparison of sufficient graphical model, Lee et al. (2016b), Naïve and the champion methods in DREAM 4 Challenge
Network 1 Network 2 Network 3 Network 4 Network 5
Sufficient graphical model 0.85 0.81 0.83 0.83 0.79
Lee et al. (2016b) 0.86 0.81 0.83 0.83 0.77
 Champion 0.91 0.81 0.83 0.83 0.75
Naïve 0.78 0.76 0.78 0.76 0.71

As we can see from Table 1, sufficient graphical model has the same areas under the receiver operating characteristic curve values as Lee et al. (2016b) for Networks 2, 3, and 4, performs better than Lee et al. (2016b) for Network 5, but trails slightly behind Lee et al. (2016b) for Network 1; sufficient graphical model has the same areas under the curve as the champion method, performs better for Network 5 and worse for Network 1. Overall, sufficient graphical model and Lee et al. (2016b) perform similarly in this dataset, and they are on a par with the champion method. We should point out that sufficient graphical model and Lee et al. (2016b) are purely empirical; they employ no knowledge about the underlying physical mechanism generating the gene expression data. However, according to Pinna et al. (2010), the champion method did use a differential equation that reflects the underlying physical mechanism. The results for threshold determination are presented in Figure S2 in the Supplementary Material.

8 Discussion

This paper is a first attempt to take advantage of the recently developed nonlinear sufficient dimension reduction method to nonparametrically estimate the statistical graphical model while avoiding the curse of dimensionality. Nonlinear sufficient dimension reduction is used as a module and applied repeatedly to evaluate conditional independence, which leads to a substantial gain in accuracy in the high-dimensional setting. Compared with the Gaussian and copula Gaussian methods, our method is not affected by the violation of the Gaussian and copula Gaussian assumptions. Compared with the additive method (Lee et al., 2016b), our method does not require an additive structure and retains the conditional independence as the criterion to determine the edges, which is a commonly accepted criterion. Compared with fully nonparametric methods, sufficient graphical model avoids the curse of dimensionality and significantly enhances the performance.

The present framework opens up several directions for further research. First, the current model assumes that the central class 𝔖X(i,j)|X(i,j)\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}} is complete, so that generalized sliced inverse regression is the exhaustive nonlinear sufficient dimension reduction estimate. When this condition is violated, generalized sliced inverse regression is no longer exhaustive and we can employ other nonlinear sufficient dimension reduction methods such as the generalized sliced averaged variance estimation (Lee et al., 2013; Li, 2018b) to recover the part of the central class that generalized sliced inverse regression misses. Second, though we have assumed that there is a proper sufficient sub-σ\sigma-field 𝒢(i,j){\cal G}^{\scriptscriptstyle-(i,j)} for each (i,j)(i,j), the proposed estimation procedure is still justifiable when no such sub-σ\sigma-field exists. In this case, UijU^{\scriptscriptstyle ij} is still the most important set of functions that characterize the statistical dependence of X(i,j)X^{\scriptscriptstyle(i,j)} on X(i,j)X^{\scriptscriptstyle-(i,j)} – even though it is not sufficient. Without sufficiency, our method may be more appropriately called the Principal Graphical Model than the sufficient graphical model. Third, the current method can be extended to functional graphical model, which are common in medical applications such as EEG and fMRI. Several functional graphical models have been proposed recently, by Zhu et al. (2016), Qiao et al. (2019), Li and Solea (2018b), and Solea and Li (2020). The idea of a sufficient graph can be applied to this setting to improve efficiency.

This paper also contains some theoretical advances that are novel to nonlinear sufficient dimension reduction. For example, it introduces a general framework to characterize how the error of nonlinear sufficient dimension reduction propagates to the downstream analysis in terms of convergence rates. Furthermore, the results for convergence rates of various linear operators allowing the dimension of the predictor to go to infinity are the first of its kind in nonlinear sufficient dimension reduction. These advances will benefit the future development of sufficient dimension reduction in general, beyond the current context of estimating graphical models.


Acknowledgments

Bing Li’s research on this work was supported in part by the NSF Grant DMS-1713078. Kyongwon Kim’s work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.2021R1F1A1046976, RS-2023-00219212), basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2021R1A6A1A10039823).

Supplementary Material

Supplementary material includes proofs of all theorems, lemmas, corollaries, and propositions in the paper, asymptotic development for the high-dimensional setting, some additional simulation plots for threshold determination.


References

  • Bach and Jordan (2002) F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.
  • Bellman (1961) R Bellman. Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ, 1961.
  • Bickel and Levina (2008) Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
  • Cook (1994) R Dennis Cook. Using dimension-reduction subspaces to identify important inputs in models of physical systems. In Proceedings of the section on Physical and Engineering Sciences, pages 18–25. American Statistical Association Alexandria, VA, 1994.
  • Cook and Weisberg (1991) R Dennis Cook and Sanford Weisberg. Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
  • Dawid (1979) A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–31, 1979.
  • Fan and Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
  • Fellinghauer et al. (2013) Bernd Fellinghauer, Peter Bühlmann, Martin Ryffel, Michael Von Rhein, and Jan D Reinhardt. Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Computational Statistics & Data Analysis, 64:132–152, 2013.
  • Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
  • Fukumizu et al. (2004) Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.
  • Fukumizu et al. (2007) Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(Feb):361–383, 2007.
  • Fukumizu et al. (2008) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
  • Golub et al. (1979) Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 1979.
  • Guo et al. (2010) Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
  • Hoerl and Kennard (1970) A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
  • Kim et al. (2020) Kyongwon Kim, Bing Li, Zhou Yu, and Lexin Li. On post dimension reduction statistical inference. to appear in The Annals of Statistics, 2020.
  • Lam and Fan (2009) Clifford Lam and Jianqing Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics, 37(6B):4254, 2009.
  • Lauritzen (1996) Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
  • Lee et al. (2016a) K.-Y. Lee, B. Li, and H. Zhao. Variable selection via additive conditional independence. Journal of the Royal Statistical Society: Series B, 78:1037–1055, 2016a.
  • Lee et al. (2013) Kuang-Yao Lee, Bing Li, and Francesca Chiaromonte. A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. The Annals of Statistics, 41(1):221–249, 2013.
  • Lee et al. (2016b) Kuang-Yao Lee, Bing Li, and Hongyu Zhao. On an additive partial correlation operator and nonparametric estimation of graphical models. Biometrika, 103(3):513–530, 2016b.
  • Li et al. (2011) B. Li, A. Artemiou, and L. Li. Principal support vector machines for linear and nonlinear sufficient dimension reduction. The Annals of Statistics, 39:3182–3210, 2011.
  • Li (2018a) Bing Li. Linear operator-based statistical analysis: A useful paradigm for big data. Canadian Journal of Statistics, 46(1):79–103, 2018a.
  • Li (2018b) Bing Li. Sufficient Dimension Reduction: Methods and Applications with R. CRC Press, 2018b.
  • Li and Solea (2018a) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113(just-accepted):1637–1655, 2018a.
  • Li and Solea (2018b) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113:1637–1655, 2018b.
  • Li and Song (2017) Bing Li and Jun Song. Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics, 45(3):1059–1095, 2017.
  • Li et al. (2014) Bing Li, Hyonho Chun, and Hongyu Zhao. On an additive semigraphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association, 109(507):1188–1204, 2014.
  • Li (1991) Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
  • Liu et al. (2009) Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295–2328, 2009.
  • Liu et al. (2012a) Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012a.
  • Liu et al. (2012b) Han Liu, Fang Han, and Cun-Hui Zhang. Transelliptical graphical models. In Advances in neural information processing systems, pages 800–808, 2012b.
  • Luo and Li (2016) Wei Luo and Bing Li. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika, 103:875–887, 2016.
  • Luo and Li (2020) Wei Luo and Bing Li. On order determination by predictor augmentation. Biometrika (To appear), 103:875–887, 2020.
  • Marbach et al. (2010) Daniel Marbach, Robert J Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107(14):6286–6291, 2010.
  • Meinshausen and Bühlmann (2006) Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436–1462, 2006.
  • Pearl and Verma (1987) J. Pearl and T. Verma. The logic of representing dependencies by directed graphs. University of California (Los Angeles). Computer Science Department, 1987.
  • Peng et al. (2009) Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735–746, 2009.
  • Pinna et al. (2010) Andrea Pinna, Nicola Soranzo, and Alberto de la Fuente. From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS One, 5(10), 2010.
  • Qiao et al. (2019) Xinghao Qiao, Shaojun Guo, and Gareth M. James. Functional graphical models. Journal of the American Statistical Association, 114:211–222, 2019.
  • Solea and Li (2020) Eftychia Solea and Bing Li. Copula gaussian graphical models for functional data. submitted to Journal of the American Statistical Association, (under revision), 2020.
  • Sriperumbudur et al. (2011) B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12, 2011.
  • Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • Voorman et al. (2013) Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85–101, 2013.
  • Wang (2008) Y. Wang. Nonlinear dimension reduction in feature space. PhD Thesis, The Pennsylvania State University, 2008.
  • Wu (2008) H. M. Wu. Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17(3):590–610, 2008.
  • Xue and Zou (2012) Lingzhou Xue and Hui Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571, 2012.
  • Yuan and Lin (2007) Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • Zhu et al. (2016) Hongxiao Zhu, Nate Strawn, and David B. Dunson. Bayesian graphical models for multivariate functional data. Journal of Machine Learning Research, 14:1–27, 2016.
  • Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.