On Sufficient Graphical Models

\nameBing Li \email[email protected]
\addrDepartment of Statistics, Pennsylvania State University
326 Thomas Building, University Park, PA 16802 \AND\nameKyongwon Kim \email[email protected]
\addrDepartment of Statistics, Ewha Womans University
52 Ewhayeodae-gil, Seodaemun-gu, Seoul, Republic of Korea, 03760

Abstract

We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.

Keywords: conjoined conditional covariance operator, generalized sliced inverse regression, nonlinear sufficient dimension reduction, reproducing kernel Hilbert space

1 Introduction

In this paper we propose a new nonparametric statistical graphical model, which we call the sufficient graphical model, by incorporating the recently developed nonlinear sufficient dimension reduction techniques to the construction of the distribution-free graphical models.

Let $\mbox{\rsfsten G}\,=(\Gamma,{\cal E})$ be an undirected graph consisting of a finite set of nodes $\Gamma=\{1,\dots,p\}$ and set of edges ${\color[rgb]{0,0,0}{{\mathcal{E}\subseteq\{(i,j)\in\Gamma\times\Gamma:i\neq j\}.}}}$ Since $(i,j)$ and $(j,i)$ represent the same edge in an undirected graph, we can assume without loss of generality that $i>j$ . A statistical graphical model links G with a random vector $X=(X^{\scriptscriptstyle 1},\dots,X^{\scriptscriptstyle p})$ by the conditional independence:

\displaystyle(i,j)\notin\mathcal{E}\Leftrightarrow X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)},

(1)

where $X^{\scriptscriptstyle-(i,j)}=\{X^{\scriptscriptstyle 1},\ldots,X^{\scriptscriptstyle p}\}\setminus\{X^{\scriptscriptstyle i},X^{\scriptscriptstyle j}\}$ , and $A\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,B|C$ means conditional independence. Thus, nodes $i$ and $j$ are connected if and only if $X^{\scriptscriptstyle i}$ and $X^{\scriptscriptstyle j}$ are dependent given $X^{\scriptscriptstyle-(i,j)}$ . Our goal is to estimate the set $\cal E$ based on a sample $X_{\scriptscriptstyle 1},\ldots,X_{\scriptscriptstyle n}$ of $X$ . See Lauritzen (1996).

One of the most popular statistical graphical models is the Gaussian graphical model, which assumes that $X\sim N(\mu,\Sigma)$ . Under the Gaussian assumption, conditional independence in (1) is encoded in the precision matrix $\Theta=\Sigma^{\scriptscriptstyle\scriptscriptstyle-1}$ in the following sense

\displaystyle X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)}\Leftrightarrow\theta_{\scriptscriptstyle ij}=0,

(2)

where $\theta_{\scriptscriptstyle ij}$ is the $(i,j)$ th entry of the precision matrix $\Theta$ . By this equivalence, estimating ${\cal E}$ amounts to identifying the positions of the zero entries of the precision matrix, which can be achieved by sparse estimation methods such as the Tibshirani (1996), Fan and Li (2001), and Zou (2006). A variety of methods have been developed for estimating the Gaussian graphical model, which include, for example, Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), and Peng et al. (2009). See also Friedman et al. (2008), Guo et al. (2010), and Lam and Fan (2009).

Since the Gaussian distribution assumption is restrictive, many recent advances have focused on relaxing this assumption. A main challenge in doing so is to avoid the curse of dimensionality (Bellman, 1961): a straightforward nonparametric extension would resort to a high-dimensional kernel, which are known to be ineffective. One way to relax the Gaussian assumption without evoking a high dimensional kernel is to use the copula Gaussian distribution, which is the approach taken by Liu et al. (2009), Liu et al. (2012a), and Xue and Zou (2012), and is further extended to the transelliptical model by Liu et al. (2012b).

However, the copula Gaussian assumption could still be restrictive: for example, if $A$ and $B$ are random variables satisfying $B=A^{\scriptscriptstyle 2}+\epsilon$ , where $A$ and $\epsilon$ are i.i.d. $N(0,1)$ , then $(A,B)$ does not satisfy the copula Gaussian assumption. To further relax the distributional assumption, Li et al. (2014) proposed a new statistical relation called the additive conditional independence as an alternative criterion for constructing the graphical model. This relation has the advantage of achieving nonparametric model flexibility without using a high-dimensional kernel, while obeying the same set of semi-graphoid axioms that govern the conditional independence (Dawid, 1979; Pearl and Verma, 1987). See also Lee et al. (2016b) and Li and Solea (2018a). Other approaches to nonparametric graphical models include Fellinghauer et al. (2013) and Voorman et al. (2013).

In this paper, instead of relying on additivity to avoid the curse of dimensionality, we apply the recently developed nonparametric sufficient dimension reduction (Lee et al., 2013; Li, 2018b) to achieve this goal. The estimation proceeds in two steps: first, we use nonlinear sufficient dimension reduction to reduce $X^{\scriptscriptstyle-(i,j)}$ to a low-dimensional random vector $U^{\scriptscriptstyle ij}$ ; second, we use the kernel method to construct a nonparametric graphical model based on $(X^{\scriptscriptstyle i}$ , $X^{\scriptscriptstyle j})$ and the dimension-reduced random vectors $U^{\scriptscriptstyle ij}$ . The main differences between this approach and Li et al. (2014) are, first, we are able to retain conditional independence as the criterion for constructing the network, which is a widely accepted criterion with a more direct interpretation, and second, we are no longer restricted by the additive structure in the graphical model. Another attractive feature of our method is due to the “kernel trick”, which means its computational complexity depends on the sample size rather than the size of the networks.

The rest of the paper is organized as follows. In Sections 2 and 3, we introduce the sufficient graphical model and describe its estimation method at the population level. In Section 4 we lay out the detailed algorithms to implement the method. In Section 5 we develop the asymptotic properties such as estimation consistency, variable selection consistency, and convergence rates. In Section 6, we conduct simulation studies to compare of our method with the existing methods. In Section 7, we apply our method to the DREAM 4 Challenge gene network data set. Section 8 concludes the paper with some further discussions. Due to limited space we put all proofs and some additional results in the Supplementary Material.

2 Sufficient graphical model

In classical sufficient dimension reduction, we seek the lowest dimensional subspace ${\cal S}$ of $\mathbb{R}^{\scriptscriptstyle p}$ , such that, after projecting $X\in\mathbb{R}^{\scriptscriptstyle p}$ on to ${\cal S}$ , the information about the response $Y$ is preserved; that is, $Y\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X|P_{\scriptscriptstyle{\cal S}}X$ , where $P_{\scriptscriptstyle{\cal S}}$ is the projection onto ${\cal S}$ . This subspace is called the central subspace, written as ${\cal S}_{\scriptscriptstyle Y|X}$ . See, for example, Li (1991), Cook (1994), and Li (2018b). Li et al. (2011) and Lee et al. (2013) extended this framework to the nonlinear setting by considering the more general problem: $Y\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X|{\cal G}$ , where ${\cal G}$ a sub- $\sigma$ field of the $\sigma$ -field generated by $X$ . The class of functions in a Hilbert space that are measurable with respect to ${\cal G}$ is called the central class, written as $\mathfrak{S}_{\scriptscriptstyle Y|X}$ . Li et al. (2011) introduced the Principal Support Vector Machine, and Lee et al. (2013) generalized the Sliced Inverse Regression (Li, 1991) and the Sliced Average Variance Estimate (Cook and Weisberg, 1991) to estimate the central class. Precursors of this theory include Bach and Jordan (2002), Wu (2008), and Wang (2008).

To link this up with the statistical graphical model, let $(\Omega,{\cal F},P)$ be a probability space, $(\Omega_{\scriptscriptstyle X},{\cal F}_{\scriptscriptstyle X})$ a Borel measurable space with $\Omega_{\scriptscriptstyle X}\subseteq\mathbb{R}^{\scriptscriptstyle p}$ , and $X:\Omega\to\Omega_{\scriptscriptstyle X}$ a random vector with distribution $P_{\scriptscriptstyle X}$ . The $i$ th component of $X$ is denoted by $X^{\scriptscriptstyle i}$ and its range denoted by $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}$ . We assume $\Omega_{\scriptscriptstyle X}=\Omega_{\scriptscriptstyle X^{\scriptscriptstyle 1}}\times\cdots\times\Omega_{\scriptscriptstyle X^{\scriptscriptstyle p}}$ . Let $X^{\scriptscriptstyle(i,j)}=(X^{\scriptscriptstyle i},X^{\scriptscriptstyle j})$ and $X^{\scriptscriptstyle-(i,j)}$ be as defined in the Introduction. Let $\sigma(X^{\scriptscriptstyle-(i,j)})$ be the $\sigma$ -field generated by $X^{\scriptscriptstyle-(i,j)}$ . We assume, for each $(i,j)\in\Gamma\times\Gamma$ , there is a proper sub $\sigma$ -field ${\cal G}^{\scriptscriptstyle-(i,j)}$ of $\sigma(X^{\scriptscriptstyle-(i,j)})$ such that

\displaystyle X^{\scriptscriptstyle(i,j)}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle-(i,j)}|{\cal G}^{\scriptscriptstyle-(i,j)}.

(3)

Without loss of generality, we assume ${\cal G}^{\scriptscriptstyle-(i,j)}$ is the smallest sub $\sigma$ -field of $\sigma(X^{\scriptscriptstyle-(i,j)})$ that satisfies the above relation; that is, ${\cal G}^{\scriptscriptstyle-(i,j)}$ is the central $\sigma$ -field for $X^{\scriptscriptstyle(i,j)}$ versus $X^{\scriptscriptstyle-(i,j)}$ . There are plenty examples of joint distributions of $X$ for which the condition (3) holds for every pair $(i,j)$ : see Section S10 of the Supplementary Material. Using the properties of conditional independence developed in Dawid (1979) (with a detailed proof given in Li (2018b)), we can show that (3) implies the following equivalence.

Theorem 1

If $X^{\scriptscriptstyle(i,j)}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle-(i,j)}|{\cal G}^{\scriptscriptstyle-(i,j)}$ , then

\displaystyle X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|X^{\scriptscriptstyle-(i,j)}\ \Leftrightarrow\ X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)}.

This equivalence motivates us to use $X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)}$ as the criterion to construct the graph ${\cal G}$ after performing nonlinear sufficient dimension reduction of $X^{\scriptscriptstyle(i,j)}$ versus $X^{\scriptscriptstyle-(i,j)}$ for each $(i,j)\in\Gamma\times\Gamma$ , $i>j$ .

Definition 2

Under condition (3), the graph defined by

\displaystyle(i,j)\notin{\cal E}\Leftrightarrow X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|{\cal G}^{\scriptscriptstyle-(i,j)}

is called the sufficient graphical model.

3 Estimation: population-level development

The estimation of the sufficient graphical model involves two steps: the first step is to use nonlinear sufficient dimension reduction to estimate ${\cal G}^{\scriptscriptstyle-(i,j)}$ ; the second is to construct a graph G based on reduced data

\displaystyle\{(X^{\scriptscriptstyle(i,j)},{\cal G}^{\scriptscriptstyle-(i,j)}):(i,j)\in\Gamma\times\Gamma,i>j\}.

In this section we describe the two steps at the population level. To do so, we need some preliminary concepts such as the covariance operator between two reproducing kernel Hilbert spaces, the mean element in an reproducing kernel Hilbert spaces, the inverse of an operator, as well as the centered reproducing kernel Hilbert spaces. These concepts are defined in the Supplementary Material, Section S1.2. A fuller development of the related theory can be found in Li (2018b). The symbols $\mathrm{ran}(\cdot)$ and $\overline{\mathrm{ran}}(\cdot)$ will be used to denote the range and the closure of the range of a linear operator.

3.1 Step 1: Nonlinear dimension reduction

We use the generalized sliced inverse regression Lee et al. (2013), (Li, 2018b) to perform the nonlinear dimension reduction. For each pair $(i,j)\in\Gamma\times\Gamma$ , $i>j$ , let $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}$ be the range of $X^{\scriptscriptstyle-(i,j)}$ , which is the Cartesian product of $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle 1}},\ldots,\Omega_{\scriptscriptstyle X^{\scriptscriptstyle p}}$ with $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}$ and $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle j}}$ removed. Let

\displaystyle\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}:\,\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}\times\Omega_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}\to\mathbb{R}

be a positive semidefinite kernel. Let $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ be the centered reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle{-(i,j)}}$ . Let $\Omega_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}$ , $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle{(i,j)}}$ , and $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ be the similar objects defined for $X^{\scriptscriptstyle(i,j)}$ .

Assumption 1

\displaystyle E[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(X^{\scriptscriptstyle-(i,j)},X^{\scriptscriptstyle-(i,j)})]<\infty,\quad E[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(X^{\scriptscriptstyle(i,j)},X^{\scriptscriptstyle(i,j)})]<\infty.

This is a very mild assumption that is satisfied by most kernels. Under this assumption, the following covariance operators are well defined:

\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)},\quad\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}.

For the formal definition of the covariance operator, see S1.2. Next, we introduce the regression operator from $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ to $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ . For this purpose we need to make the following assumption.

Assumption 2

$\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}})$ .

As argued in Li (2018b), this assumption can be interpreted as a type of collective smoothness in the relation between $X^{\scriptscriptstyle(i,j)}$ and $X^{\scriptscriptstyle-(i,j)}$ : intuitively, it requires the operator $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ sends all the input functions to the low-frequency domain of the operator $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}$ . Under Assumption 2, the linear operator

\displaystyle R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}=\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}

is defined, and we call it the regression operator from $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ to $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ . The meaning of the inverse $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ is defined in Section S1.2 in the Supplementary Material. The regression operator in this form was formally defined in Lee et al. (2016a), but earlier forms existed in Fukumizu et al. (2004); see also Li (2018a).

Assumption 3

$R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ is a finite-rank operator, with rank $d_{\scriptscriptstyle ij}$ .

Intuitively, this assumption means that $R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ filters out the high frequency functions of $X^{\scriptscriptstyle(i,j)}$ , so that, for any $f\in\mbox{\rsfsten H}\,^{\scriptscriptstyle\,(i,j)}$ , $R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}f$ is relatively smooth. It will be violated, for example, if one can find an $f\in\mbox{\rsfsten H}\,^{\scriptscriptstyle\,(i,j)}$ that makes $R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}f$ arbitrarily choppy. The regression operator plays a crucial role in nonlinear sufficient dimension reduction. Let $L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})$ be the $L_{\scriptscriptstyle 2}$ -space with respect to the distribution $P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}$ of $X^{\scriptscriptstyle-(i,j)}$ . As shown in Lee et al. (2013), the closure of the range of the regression operator is equal to the central subspace; that is,

\displaystyle{\color[rgb]{0,0,0}{{\overline{\mathrm{ran}}(R_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})=\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}}}}

(4)

under the following assumption.

Assumption 4

1.

$\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ is dense in $L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})$ modulo constants; that is, for any $f\in L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})$ and any $\epsilon>0$ , there is a $g\in\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ such that $\mathrm{var}[f(X^{\scriptscriptstyle-(i,j)})-g(X^{\scriptscriptstyle-(i,j)})]<\epsilon$ ;
2.

$\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}$ is a sufficient and complete.

The first condition essentially requires the kernel $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ to be a universal kernel with respect to the $L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})$ -norm. It means $\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)}$ is rich enough to approximate any $L_{\scriptscriptstyle 2}(P_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}})$ -function arbitrarily closely. For example, it is satisfied by the Gaussian radial basis function kernel, but not by the polynomial kernel. For more information on universal kernels, see Sriperumbudur, Fukumizu, and Lanckriet (2011). The completeness in the second condition means

\displaystyle E[g(X^{\scriptscriptstyle-(i,j)})|X^{\scriptscriptstyle(i,j)}]=0\ \mbox{almost surely}\ \Rightarrow\ g(X^{\scriptscriptstyle-(i,j)})=0\ \mbox{almost surely}.

This concept is defined in Lee, Li, and Chiaromonte (2013), and is similar to the classical definition of completeness treating $X^{\scriptscriptstyle-(i,j)}$ as the parameter. Lee, Li, and Chiaromonte (2013) showed that completeness is a mild condition, and is satisfied by most nonparametric models.

A basis of the central class $\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}$ can be found by solving the generalized eigenvalue problem: for $k=1,\ldots,d_{\scriptscriptstyle ij}$ ,

\displaystyle\begin{split}\mbox{maximize}\quad&\,\langle f,\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}A\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}\\ \mbox{subject to}\quad&\,\begin{cases}\langle f_{\scriptscriptstyle k},\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle k}\rangle_{\scriptscriptstyle-(i,j)}=1\\ \langle f_{\scriptscriptstyle k},\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle\ell}\rangle_{\scriptscriptstyle-(i,j)},\ \mbox{for }\ell=1,\ldots,k-1\end{cases}\end{split}

(5)

where $A:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ is any nonsingular and self adjoint operator, and $\langle\cdot,\cdot\rangle_{\scriptscriptstyle-(i,j)}$ is the inner product in $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ . That is, if $f^{\scriptscriptstyle ij}_{\scriptscriptstyle 1},\ldots f^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}}$ are the first $d_{\scriptscriptstyle ij}$ eigenfunctions of this eigenvalue problem, then they span the central class. This type of estimate of the central class is called generalized sliced inverse regression. Convenient choices of $A$ are the identity mapping $I$ or the operator $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ . If we use the latter, then we need the following assumption.

Assumption 5

$\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}})$ .

This assumption has the similar interpretation as Assumption 2; see Section S11 in the Supplementary Material. At the population level, choosing $A$ to be $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ achieves better scaling because it down weights those components of the output of $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ with larger variances. However, if the sample size is not sufficiently large, involving an estimate of $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ in the procedure could incur extra variations that overwhelm the benefit brought by $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ . In this case, a nonrandom operator such as $A=I$ is preferable. In this paper we use $A=\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ . Let $U^{\scriptscriptstyle ij}$ denote the random vector $(f^{\scriptscriptstyle ij}_{\scriptscriptstyle 1}(X^{\scriptscriptstyle-(i,j)}),\ldots f^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}}(X^{\scriptscriptstyle-(i,j)})).$ The set of random vectors $\{U^{\scriptscriptstyle ij}:(i,j)\in\Gamma\times\Gamma,i>j\}$ is the output for the nonlinear sufficient dimension reduction step.

3.2 Step 2:Estimation of sufficient graphical model

To estimate the edge set of the sufficient graphical model we need to find a way to determine whether $X^{\scriptscriptstyle i}\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,X^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}$ is true. We use a linear operator introduced by Fukumizu et al. (2008) to perform this task, which is briefly described as follows. Let $U$ , $V$ , $W$ be random vectors taking values in measurable spaces $(\Omega_{\scriptscriptstyle U},{\cal F}_{\scriptscriptstyle U})$ , $(\Omega_{\scriptscriptstyle V},{\cal F}_{\scriptscriptstyle V})$ , and $(\Omega_{\scriptscriptstyle W},{\cal F}_{\scriptscriptstyle W})$ . Let $\Omega_{\scriptscriptstyle UW}=\Omega_{\scriptscriptstyle U}\times\Omega_{\scriptscriptstyle W}$ , $\Omega_{\scriptscriptstyle VW}=\Omega_{\scriptscriptstyle V}\times\Omega_{\scriptscriptstyle W}$ , ${\cal F}_{\scriptscriptstyle UW}={\cal F}_{\scriptscriptstyle U}\times{\cal F}_{\scriptscriptstyle V}$ , and ${\cal F}_{\scriptscriptstyle VW}={\cal F}_{\scriptscriptstyle V}\times{\cal F}_{\scriptscriptstyle W}$ . Let

\displaystyle\kappa_{\scriptscriptstyle UW}:\Omega_{\scriptscriptstyle UW}\times\Omega_{\scriptscriptstyle UW}\to\mathbb{R},\quad\kappa_{\scriptscriptstyle VW}:\Omega_{\scriptscriptstyle VW}\times\Omega_{\scriptscriptstyle VW}\to\mathbb{R},\quad\kappa_{\scriptscriptstyle W}:\Omega_{\scriptscriptstyle W}\times\Omega_{\scriptscriptstyle W}\to\mathbb{R}

be positive kernels. For example, for $(u_{\scriptscriptstyle 1},w_{\scriptscriptstyle 1}),(u_{\scriptscriptstyle 2},w_{\scriptscriptstyle 2})\in\Omega_{\scriptscriptstyle UW}\times\Omega_{\scriptscriptstyle UW}$ , $\kappa_{\scriptscriptstyle UW}$ returns a real number denoted by $\kappa_{\scriptscriptstyle UW}[(u_{\scriptscriptstyle 1},w_{\scriptscriptstyle 1}),(u_{\scriptscriptstyle 2},w_{\scriptscriptstyle 2})]$ . Let $\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}$ , $\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}$ , and $\mbox{\rsfsten H}\,_{\scriptscriptstyle W}$ be the centered reproducing kernel Hilbert space’s generated by the kernels $\kappa_{\scriptscriptstyle UW}$ , $\kappa_{\scriptscriptstyle VW}$ , and $\kappa_{\scriptscriptstyle W}$ . Define the covariance operators

\displaystyle\begin{split}&\,\Sigma_{\scriptscriptstyle(UW)(VW)}:\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle UW},\quad\Sigma_{\scriptscriptstyle(UW)W}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle UW},\\ &\,\Sigma_{\scriptscriptstyle(VW)W}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle VW},\quad\Sigma_{\scriptscriptstyle WW}:\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle W}\end{split}

(6)

as before. The following definition is due to Fukumizu et al. (2008). Since it plays a special role in this paper, we give it a name – “conjoined conditional covariance operator” that figuratively depicts its form.

Definition 3

Suppose

1.

If $S$ is $W$ , or $(U,W)$ , or $(V,W)$ , then $E[\kappa_{\scriptscriptstyle S}(S,S)]<\infty$ ;
2.

$\mathrm{ran}(\Sigma_{\scriptscriptstyle W(VW)})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle WW})$ , $\mathrm{ran}(\Sigma_{\scriptscriptstyle W(UW)})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle WW})$ .

Then the operator $\Sigma_{\scriptscriptstyle\ddot{U}\ddot{V}|W}=\Sigma_{\scriptscriptstyle(UW)(VW)}-\Sigma_{\scriptscriptstyle(UW)W}\Sigma_{\scriptscriptstyle WW}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle W(VW)}$ is called the conjoined conditional covariance operator between $U$ and $V$ given $W$ .

The word “conjoined” describes the peculiar way in which $W$ appears in $\Sigma_{\scriptscriptstyle(UW)W}$ and $\Sigma_{\scriptscriptstyle W(VW)}$ , which differs from an ordinary conditional covariance operator, where these operators are replaced by $\Sigma_{\scriptscriptstyle UW}$ and $\Sigma_{\scriptscriptstyle WV}$ . The following proposition is due to Fukumizu et al. (2008), a proof of a special case of which is given in Fukumizu et al. (2004).

Proposition 4

Suppose

1.

$\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}\otimes\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}$ is probability determining;
2.

for each $f\in\mbox{\rsfsten H}\,_{\scriptscriptstyle UW}$ , the function $E[f(U,W)|W=\cdot]$ belongs to $\mbox{\rsfsten H}\,_{\scriptscriptstyle W}$ ;
3.

for each $g\in\mbox{\rsfsten H}\,_{\scriptscriptstyle VW}$ , the function $E[g(V,W)|W=\cdot]$ belongs to $\mbox{\rsfsten H}\,_{\scriptscriptstyle W}$ ;

Then $\Sigma_{\scriptscriptstyle\ddot{U}\ddot{V}|W}=0$ if and only if $U\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,V|W$ .

The notion of probability determining in the context of reproducing kernel Hilbert space was defined in Fukumizu et al. (2004). For a generic random vector $X$ , an reproducing kernel Hilbert space $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}$ based on a kernel $\kappa_{\scriptscriptstyle X}$ is probability determining if and only if the mapping $P\mapsto E_{\scriptscriptstyle P}[\kappa_{\scriptscriptstyle X}(\cdot,X)]$ is injective. Intuitively, this requires the family of expectations $\{E_{\scriptscriptstyle P}f(X):f\in{\cal H}_{\scriptscriptstyle X}\}$ to be rich enough to identify $P$ . For example, the Gaussian radial basis function is probability determining, but a polynomial kernel is not. We apply the above proposition to $X^{\scriptscriptstyle i},X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}$ for each $(i,j)\in\Gamma\times\Gamma$ , $i>j$ . Let

\displaystyle\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}:(\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}})\times(\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}})\to\mathbb{R}

be a positive definite kernel, and $\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ the centered reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ . Similarly, let $\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}:\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R}$ be a positive kernel, and $\mbox{\rsfsten H}\,_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ the centered reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ .

Assumption 6

Conditions (1) and (2) of Definition 3 and conditions (1), (2), and (3) of Proposition 4 are satisfied with $U$ , $V$ , and $W$ therein replaced by $X^{\scriptscriptstyle i}$ , $X^{\scriptscriptstyle j}$ , and $U^{\scriptscriptstyle ij}$ , respectively, for each $(i,j)\in\Gamma\times\Gamma$ and $i>j$ .

Under this assumption, the conjoined conditional covariance operator $\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}$ is well defined and has the following property.

Corollary 5

Under Assumption 6, we have $(i,j)\notin\mathcal{E}\Leftrightarrow\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}=0.$

This corollary motivates us to estimate the graph by thresholding the norm of the estimated conjoined conditional covariance operator.

4 Estimation: sample-level implementation

4.1 Implementation of step 1

Let $(X_{\scriptscriptstyle 1},Y_{\scriptscriptstyle 1}),\ldots,(X_{\scriptscriptstyle n},Y_{\scriptscriptstyle n})$ be an i.i.d. sample of $(X,Y)$ . At the sample level, the centered reproducing kernel Hilbert space $\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ is spanned by the functions

\displaystyle\{\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle a})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]:a=1,\ldots,n\},

(7)

where $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})$ stands for the function $u\mapsto\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(u,X^{\scriptscriptstyle-(i,j)})$ , and $E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]$ the function $u\mapsto E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(u,X^{\scriptscriptstyle-(i,j)})]$ .

We estimate the covariance operators $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ and $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}$ by

	$\displaystyle\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}=$	$\displaystyle\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]$
		$\displaystyle\,\otimes[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(\cdot,X^{\scriptscriptstyle(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}(\cdot,X^{\scriptscriptstyle(i,j)})]\}$
	$\displaystyle\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}=$	$\displaystyle\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]$
		$\displaystyle\,\otimes[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]\},$

respectively. We estimate $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}$ by the Tychonoff-regularized inverse $(\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1},$ where $I:\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}\to\mbox{\rsfsten H}\,_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ is the identity operator. The regularized inverse is used to avoid over fitting. It plays the same role as ridge regression (Hoerl and Kennard, 1970) that alleviates over fitting by adding a multiple of the identity matrix to the sample covariance matrix before inverting it.

At the sample level, the generalized eigenvalue problem (5) takes the following form: at the $k$ th iteration,

\displaystyle\begin{split}\mbox{maximize}\quad&\,\langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}(\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1}\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}\\ \mbox{subject to}\quad&\,\begin{cases}\langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f\rangle_{\scriptscriptstyle-(i,j)}=1,\\ \langle f,\hat{\Sigma}_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}f_{\scriptscriptstyle\ell}\rangle_{\scriptscriptstyle-(i,j)}=0,\quad\ell=1,\ldots,k-1,\end{cases}\end{split}

(8)

where $f_{\scriptscriptstyle 1},\ldots,f_{\scriptscriptstyle k-1}$ are the maximizers in the previous steps. The first $d_{\scriptscriptstyle ij}$ eigenfunctions are an estimate of a basis in the central class $\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}$ .

Let $K_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}$ be the $n\times n$ matrix whose $(a,b)$ th entry is $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)},X_{\scriptscriptstyle b}^{\scriptscriptstyle-(i,j)})$ , $Q=I_{\scriptscriptstyle n}-1_{\scriptscriptstyle n}1_{\scriptscriptstyle n}^{\scriptscriptstyle\sf T}/n$ , and $G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}Q$ . Let $a^{\scriptscriptstyle 1},\ldots,a^{\scriptscriptstyle d_{\scriptscriptstyle ij}}$ be the first $d_{\scriptscriptstyle ij}$ eigenvectors of the matrix

\displaystyle\,(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}(G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}.

Let $b^{\scriptscriptstyle r}=(G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}+\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}I_{\scriptscriptstyle n})^{\scriptscriptstyle\scriptscriptstyle-1}a^{\scriptscriptstyle r}$ for $r=1,\ldots,d_{\scriptscriptstyle ij}$ . As shown in Section S12.2, the eigenfunctions $f_{\scriptscriptstyle 1}^{\scriptscriptstyle ij},\ldots,f_{\scriptscriptstyle d_{\scriptscriptstyle ij}}^{\scriptscriptstyle ij}$ are calculated by

\displaystyle f_{\scriptscriptstyle r}^{\scriptscriptstyle ij}=\sum_{\scriptscriptstyle a=1}^{\scriptscriptstyle n}b^{\scriptscriptstyle r}_{\scriptscriptstyle a}\{\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle a})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}(\cdot,X^{\scriptscriptstyle-(i,j)})]\}.

The statistics $\hat{U}^{\scriptscriptstyle ij}_{\scriptscriptstyle a}=(f_{\scriptscriptstyle 1}^{\scriptscriptstyle ij}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)}),\ldots,f_{\scriptscriptstyle d_{\scriptscriptstyle ij}}^{\scriptscriptstyle ij}(X_{\scriptscriptstyle a}^{\scriptscriptstyle-(i,j)}))$ , $a=1,\ldots,n$ , will be used as the input for the second step.

4.2 Implementation of step 2

This step consists of estimating the conjoined conditional covariance operator’s for each $(i,j)$ and thresholding their norms. At the sample level, the centered reproducing kernel Hilbert space’s generated by the kernels $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ , $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ , and $\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ are

	$\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle\,i,ij}=$	$\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]:a=1,\ldots,n\},$
	$\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle XU}^{\scriptscriptstyle\,j,ij}=$	$\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle j},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]:a=1,\ldots,n\},$
	$\displaystyle\mbox{\rsfsten H}\,_{\scriptscriptstyle U}^{\scriptscriptstyle\,ij}=$	$\displaystyle\,\mathrm{span}\{\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U_{\scriptscriptstyle a}^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]:a=1,\ldots,n\},$

where, for example, $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))$ denotes the function

\displaystyle\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R},\quad(x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij})\mapsto\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij}),(X_{\scriptscriptstyle a}^{\scriptscriptstyle i},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}))

and $E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]$ denotes the function

\displaystyle\Omega_{\scriptscriptstyle X^{\scriptscriptstyle i}}\times\Omega_{\scriptscriptstyle U^{\scriptscriptstyle ij}}\to\mathbb{R},\quad(x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij})\mapsto E_{\scriptscriptstyle n}[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((x^{\scriptscriptstyle i},u^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))].

We estimate the covariance operators $\Sigma_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})}$ , $\Sigma_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}$ , $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle j}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ , and $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}$ by

\displaystyle\begin{split}\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]\}\\ \hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}(\cdot,(X^{\scriptscriptstyle i},U^{\scriptscriptstyle ij}))]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\}\\ \hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}(\cdot,(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij}))]\}\\ \hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}=&\,E_{\scriptscriptstyle n}\{[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\\ &\,\hskip 14.45377pt\otimes[\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})-E_{\scriptscriptstyle n}\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(\cdot,U^{\scriptscriptstyle ij})]\},\end{split}

(9)

respectively. We then estimate the conjoined conditional covariance operator by

\displaystyle\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}=\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}(\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}+\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}I)^{\scriptscriptstyle\scriptscriptstyle-1}\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})},

where, again, we have used Tychonoff regularization to estimate the inverted covariance operator $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}$ . Let $K_{\scriptscriptstyle U^{\scriptscriptstyle ij}}$ , $K_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}$ , and $K_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}$ be the Gram matrices

	$\displaystyle K_{\scriptscriptstyle U^{\scriptscriptstyle ij}}=$	$\displaystyle\,\{\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}(U_{\scriptscriptstyle a}^{\scriptscriptstyle ij},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij})\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},$
	$\displaystyle K_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}=$	$\displaystyle\,\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}((X^{\scriptscriptstyle i}_{\scriptscriptstyle a},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle i}_{\scriptscriptstyle b},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij}))\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},$
	$\displaystyle K_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}=$	$\displaystyle\,\{\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}((X^{\scriptscriptstyle j}_{\scriptscriptstyle a},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij}),(X^{\scriptscriptstyle j}_{\scriptscriptstyle b},U_{\scriptscriptstyle b}^{\scriptscriptstyle ij}))\}_{\scriptscriptstyle a,b=1}^{\scriptscriptstyle n},$

and $G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}$ , $G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}$ , and $G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}$ their centered versions

\displaystyle G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}Q,\quad G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}Q,\quad G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}=QK_{\scriptscriptstyle U^{\scriptscriptstyle ij}}Q.

As shown in Section S12 in the Supplementary Material,

\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=\left\|G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}-G_{\scriptscriptstyle X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}(G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}+\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}Q)^{\scriptscriptstyle\dagger}G_{\scriptscriptstyle X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle 1/2}\right\|_{\scriptscriptstyle\mathrm{F}},

where $\|\cdot\|_{\scriptscriptstyle\mathrm{F}}$ is the Frobenius norm. Estimation of the edge set is then based on thresholding this norm; that is,

\displaystyle\hat{{\cal E}}=\{(i,j)\in\Gamma\times\Gamma:\,i>j,\ \|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}>\rho_{\scriptscriptstyle n}\}

for some chosen $\rho_{\scriptscriptstyle n}>0$ .

4.3 Tuning

We have three types of tuning constants: those for the kernels, those for Tychonoff regularization, and the threshold $\rho_{\scriptscriptstyle n}$ . For the Tychonoff regularization, we have $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ and $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ for step 1, and $\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}$ for step 2. In this paper we use the Gaussian radial basis function as the kernel:

\displaystyle\kappa(u,v)=\exp(-\gamma\|u-v\|^{\scriptscriptstyle 2}).

(10)

For each $(i,j)$ , we have five $\gamma$ ’s to determine: $\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ for the kernel $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ , $\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ for $\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ , $\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ for $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ , $\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ for $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ , and $\gamma_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ for $\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ , which are chosen by the following formula (see, for example, Li (2018b))

\displaystyle 1/\sqrt{\gamma}={n\choose 2}^{\scriptscriptstyle\scriptscriptstyle-1}\sum_{\scriptscriptstyle a<b}\|s_{\scriptscriptstyle a}-s_{\scriptscriptstyle b}\|,

(11)

where $s_{\scriptscriptstyle 1},\ldots,s_{\scriptscriptstyle n}$ are the sample of random vectors corresponding to the mentioned five kernels. For example, for the kernel $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ , $s_{\scriptscriptstyle a}=(X_{\scriptscriptstyle a}^{\scriptscriptstyle j},U_{\scriptscriptstyle a}^{\scriptscriptstyle ij})$ . For the tuning parameters in Tychonoff regularization, we use the following generalized cross validation scheme (GCV; see Golub et al. (1979)):

\displaystyle\text{GCV}(\epsilon)=\mathrm{argmin}_{\scriptscriptstyle\epsilon}\sum_{\scriptscriptstyle i<j}\frac{\|G_{\scriptscriptstyle 1}-G_{\scriptscriptstyle 2}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle 2}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2})]^{\scriptscriptstyle-1}G_{\scriptscriptstyle 1}\|_{\scriptscriptstyle\mathrm{F}}}{\frac{1}{n}\text{tr}\{I_{\scriptscriptstyle n}-G_{\scriptscriptstyle 2}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle 2}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2})]^{\scriptscriptstyle-1}\}},

(12)

where $G_{\scriptscriptstyle 1},G_{\scriptscriptstyle 2}\in\mathbb{R}^{\scriptscriptstyle n\times n}$ are positive semidefinite matrices, and $\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle 2})$ is the largest eigenvalue of $G_{\scriptscriptstyle 2}$ . The matrices $G_{\scriptscriptstyle 1}$ and $G_{\scriptscriptstyle 2}$ are the following matrices for the three tuning parameters:

1.

$G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}$ , $G_{\scriptscriptstyle 2}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}$ for $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ ,
2.

$G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}$ , $G_{\scriptscriptstyle 2}=G_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}}$ for $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ ,
3.

$G_{\scriptscriptstyle 1}=G_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}}$ , $G_{\scriptscriptstyle 2}=G_{\scriptscriptstyle U^{\scriptscriptstyle ij}}$ for $\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}$ ,

We minimize (12) over a grid to choose $\epsilon$ , as detailed in Section 6.

We also use generalized cross validation to determine the thresholding parameter $\rho_{\scriptscriptstyle n}$ . Let $\hat{{\cal E}}(\rho)$ be the estimated edge set using a threshold $\rho$ , and, for each $i\in\Gamma$ , let $C^{\scriptscriptstyle i}(\rho)=\{X^{\scriptscriptstyle j}:\,j\in\Gamma,\,(i,j)\in\hat{{\cal E}}(\rho)\}$ be the subset of components of $X$ at the neighborhood of the node $i$ in the graph $(\Gamma,\hat{E}(\rho))$ . The basic idea is to apply the generalized cross validation to the regression of the feature of $X^{\scriptscriptstyle i}$ on the feature of $C^{\scriptscriptstyle i}(\rho)$ . The generalized cross validation for this regression takes the form

\displaystyle\text{GCV}(\rho)=\sum_{\scriptscriptstyle i=1}^{\scriptscriptstyle p}\frac{\|G_{\scriptscriptstyle X^{\scriptscriptstyle i}}-G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)})I_{\scriptscriptstyle n}]^{\scriptscriptstyle-1}G_{\scriptscriptstyle X^{\scriptscriptstyle i}}\|_{\scriptscriptstyle\mathrm{F}}}{\frac{1}{n}\text{tr}\{I_{\scriptscriptstyle n}-G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}^{\scriptscriptstyle\sf T}[G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}+\epsilon\hskip 1.42262pt\lambda_{\scriptscriptstyle\max}(G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)})I_{\scriptscriptstyle n}]^{\scriptscriptstyle-1}\}},

(13)

where $G_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}=QK_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}Q$ , and $K_{\scriptscriptstyle C^{\scriptscriptstyle i}(\rho)}$ is the $n\times n$ kernel matrix for the sample of $C^{\scriptscriptstyle i}(\rho)$ . We minimize $\text{GCV}(\rho)$ over the grid $\rho\in\{\ell\times 10^{\scriptscriptstyle-2}:\ell=2,\dots,7\}$ to determine the optimal threshold $\rho_{\scriptscriptstyle n}$ .

Regarding the selection of the dimension of $U^{\scriptscriptstyle ij}$ , to our knowledge there has been no systematic procedure available to determine the dimension of the central class for nonlinear sufficient dimension reduction. While some recently developed methods for order determination for linear sufficient dimension reduction, such as the ladle estimate and predictor augmentation estimator (Luo and Li, 2016, 2020), may be generalizable to the nonlinear sufficient dimension reduction setting, we will leave this topic to future research. Our experiences and intuitions indicate that a small dimension, such as 1 or 2, for the central class would be sufficient in most cases. For example, in the classical nonparametric regression problems $Y=f(X)+\epsilon$ with $X\;\,\rule[0.0pt]{0.29999pt}{6.69998pt}\rule[-0.20004pt]{6.99997pt}{0.29999pt}\rule[0.0pt]{0.29999pt}{6.69998pt}\;\,\epsilon$ , the dimension of the central class is by definition equal to 1.

5 Asymptotic theory

In this section we develop the consistency and convergence rates of our estimate and related operators. The challenge of this analysis is that our procedure involves two steps: we first extract the sufficient predictor using one set of kernels, and then substitute it into another set of kernels to get the final result. Thus we need to understand how the error propagates from the first step to the second. We also develop the asymptotic theory allowing $p$ to go to infinity with $n$ , which is presented in the Supplementary Material.

5.1 Overview

Our goal is to derive the convergence rate of

\displaystyle\left|\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}-\|\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}\right|,

as $\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}$ is the quantity we threshold to determine the edge set. By the triangular inequality,

	$\displaystyle\,\left\|\\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|\hat{U}^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}-\\|\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|U^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}\right\|\leq\\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|U^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}$
	$\displaystyle\,\hskip 72.26999pt\leq\\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|U^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}+\\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|U^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}.$

So we need to derive the convergence rates of the following quantities:

\displaystyle\begin{split}&\,\mbox{(i)}\quad\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle-(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}},\\ &\,\mbox{(ii)}\quad\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}},\\ &\,\mbox{(iii)}\quad\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}},\end{split}

(14)

where, to avoid overly crowded subscripts, we have used $\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)}(X)$ to denote $\mbox{\rsfsten H}\,^{\scriptscriptstyle-(i,j)}_{\scriptscriptstyle X}$ when it occurs as a subscript. The first and third convergence rates can be derived using the asymptotic tools for linear operators developed in Fukumizu et al. (2007), Li and Song (2017), Lee et al. (2016a), and Solea and Li (2020). The second convergence rate is, however, a new problem, and it will also be useful in similar settings that require constructing estimators based on predictors extracted by sufficient dimension reduction. In some sense, this is akin to the post dimension reduction problem considered in Kim et al. (2020).

In the following, if $\{a_{\scriptscriptstyle n}\}$ and $\{b_{\scriptscriptstyle n}\}$ are sequences of positive numbers, then we write $a_{\scriptscriptstyle n}\prec b_{\scriptscriptstyle n}$ if $a_{\scriptscriptstyle n}/b_{\scriptscriptstyle n}\to 0$ . We write $a_{\scriptscriptstyle n}\asymp b_{\scriptscriptstyle n}$ if $0<\liminf_{\scriptscriptstyle n}(b_{\scriptscriptstyle n}/a_{\scriptscriptstyle n})\leq\limsup_{\scriptscriptstyle n}(b_{\scriptscriptstyle n}/a_{\scriptscriptstyle n})<\infty$ . We write $b_{\scriptscriptstyle n}\preceq a_{\scriptscriptstyle n}$ if either $b_{\scriptscriptstyle n}\prec a_{\scriptscriptstyle n}$ or $b_{\scriptscriptstyle n}\asymp a_{\scriptscriptstyle n}$ . Because $(i,j)$ is fixed in the asymptotic development, and also to emphasize the dependence on $n$ , in the rest of this section we denote $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ , $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ , and $\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}$ by $\epsilon_{\scriptscriptstyle n}$ , $\eta_{\scriptscriptstyle n}$ , and $\delta_{\scriptscriptstyle n}$ , respectively.

5.2 Transparent kernel

We first develop what we call the “transparent kernel” that passes information from step 1 to step 2 efficiently. Let $\Omega$ be a nonempty set, and $\kappa:\Omega\times\Omega\to\mathbb{R}$ a positive kernel.

Definition 6

We say that $\kappa$ is a transparent kernel if, for each $t\in\Omega$ , the function $s\mapsto\kappa(s,t)$ is twice differentiable and

1.

$\partial\kappa(s,t)/\partial s|_{\scriptscriptstyle s=t}=0$ ;

the matrix $H(s,t)=\partial^{\scriptscriptstyle 2}\kappa(s,t)/\partial s\partial s^{\scriptscriptstyle\sf T}$ has a bounded operator norm; that is, there exist $-\infty<C_{\scriptscriptstyle 1}\leq C_{\scriptscriptstyle 2}<\infty$ such that

\displaystyle C_{\scriptscriptstyle 1}\leq\lambda_{\scriptscriptstyle\min}(H(s,t))\leq\lambda_{\scriptscriptstyle\max}(H(s,t))<C_{\scriptscriptstyle 2}

for all $(s,t)\in\Omega\times\Omega$ , where $\lambda_{\scriptscriptstyle\min}(\cdot)$ and $\lambda_{\scriptscriptstyle\max}(\cdot)$ indicate the largest and smallest eigenvalues.

For example, the Gaussian radial basis function kernel is transparent, but the exponential kernel $\kappa(u,v)=\tau^{\scriptscriptstyle 2}\exp(-\gamma\|u-v\|)$ is not. This condition implies a type of Lipschitz continuity in a setting that involves two reproducing kernels $\kappa_{\scriptscriptstyle 0}$ and $\kappa_{\scriptscriptstyle 1}$ , where the argument of $\kappa_{\scriptscriptstyle 1}$ is the evaluation of a member of the reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle 0}$ .

Theorem 7

Suppose $\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}$ is the reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle 0}$ , $\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d}$ is the $d$ -fold Cartesian product of $\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}$ with inner product defined by

\displaystyle\langle U,V\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\ d}}=\langle u_{\scriptscriptstyle 1},v_{\scriptscriptstyle 1}\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}}+\cdots+\langle u_{\scriptscriptstyle d},v_{\scriptscriptstyle d}\rangle_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}}

where $U=(u_{\scriptscriptstyle 1},\ldots,u_{\scriptscriptstyle d})$ and $V=(v_{\scriptscriptstyle 1},\ldots,v_{\scriptscriptstyle d})$ are members of $\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d}$ , $\mbox{\rsfsten H}\,_{\scriptscriptstyle 1}$ is the reproducing kernel Hilbert space generated by $\kappa_{\scriptscriptstyle 1}$ . Then:

(i)

for any $U,V\in\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d},\ a\in\Omega$ , we have

\displaystyle\|U(a)-V(a)\|_{\scriptscriptstyle\mathbb{R}^{\scriptscriptstyle\,d}}\leq[\kappa_{\scriptscriptstyle 0}(a,a)]^{\scriptscriptstyle 1/2}\ \|U-V\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\ d}};

(ii)

if $\kappa_{\scriptscriptstyle 1}(s,t)$ is a transparent kernel, then there exists a $C>0$ such that, for each $U,V\in\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle\,d}$ and $a\in\Omega$ ,

\displaystyle\,\|\kappa_{\scriptscriptstyle 1}(\cdot,U(a))-\kappa_{\scriptscriptstyle 1}(\cdot,V(a))\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 1}}\leq C\,[\kappa_{\scriptscriptstyle 0}(a,a)]^{\scriptscriptstyle 1/2}\,\|U-V\|_{\scriptscriptstyle\mbox{\rsfstena H}_{\scriptscriptstyle 0}^{\scriptscriptstyle\,\,d}}.

A direct consequence of this theorem is that, if $\hat{U}$ is an estimate of some $U$ , a member of $\mbox{\rsfsten H}\,_{\scriptscriptstyle 0}^{\scriptscriptstyle d}$ , with $\|\hat{U}-U\|_{\scriptscriptstyle{\mbox{\rsfstena H}_{\scriptscriptstyle 0}}^{\scriptscriptstyle d}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ for some $0<b_{\scriptscriptstyle n}\to 0$ , $\hat{\Sigma}(\hat{U})$ is a linear operator estimated from the sample $\hat{U}_{\scriptscriptstyle 1},\ldots,\hat{U}_{\scriptscriptstyle n}$ (and perhaps some other random vectors), and $\hat{\Sigma}(U)$ is a linear operator estimated from the sample $U_{\scriptscriptstyle 1},\ldots,U_{\scriptscriptstyle n}$ , then,

\displaystyle\|\hat{\Sigma}(\hat{U})-\hat{\Sigma}(U)\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).

(15)

This result is somewhat surprising, because sample estimates such as $\hat{\Sigma}(\hat{U})$ can be viewed as $E_{\scriptscriptstyle n}{\mathbb{G}}(X,\hat{U})$ , where $\hat{U}$ is an estimate of a function $U$ in a functional space with norm $\|\cdot\|$ and ${\mathbb{G}}$ is an operator-valued function. If $\|\hat{U}-U\|=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ for some $b_{\scriptscriptstyle n}\to 0$ , then it is not necessarily true that

\displaystyle\|E_{\scriptscriptstyle n}{\mathbb{G}}(X,\hat{U})-E_{\scriptscriptstyle n}{\mathbb{G}}(X,U)\|=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}),

particularly when $U$ is an infinite dimensional object. Yet relation (15) states exactly this. The reason behind this is that the reproducing kernel property separates the function $\hat{U}$ and its argument $X_{\scriptscriptstyle a}$ (i.e. $\hat{U}(x)=\langle\hat{U},\kappa(\cdot,x)\rangle$ ), which implies a type of uniformity among $\hat{U}(X_{\scriptscriptstyle 1}),\ldots,\hat{U}(X_{\scriptscriptstyle n})$ . This point will be made clear in the proof in the Supplementary Material. Statement (15) is made precise by the next theorem.

Theorem 8

Suppose conditions (1) and (2) of Definition 3 are satisfied with $U$ , $V$ , $W$ therein replaced by $X^{\scriptscriptstyle i}$ , $X^{\scriptscriptstyle j}$ , and $U^{\scriptscriptstyle ij}$ . Suppose, furthermore:

(a)

$\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ , $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ , and $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ are transparent kernels;
(b)

$\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle\ -(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ for some $0<b_{\scriptscriptstyle n}\to 0$ .

Then

(i)

$\|\hat{\Sigma}_{\scriptscriptstyle\hat{U}^{\scriptscriptstyle ij}\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ ;
(ii)

$\|\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}\hat{U}^{\scriptscriptstyle ij})\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ ;
(iii)

$\|\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}\hat{U}^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}\hat{U}^{\scriptscriptstyle ij})}-\hat{\Sigma}_{\scriptscriptstyle(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n})$ .

Using Theorem 8 we can derive the convergence rate of $\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}$ .

Theorem 9

Suppose conditions in Theorem 8 are satisfied and, furthermore,

(a)

$\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})}$ and $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ are bounded linear operators;
(b)

$b_{\scriptscriptstyle n}\preceq\delta_{\scriptscriptstyle n}\prec 1$ .

Then $\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).$

Note that, unlike in Theorem 8, where our assumptions imply

\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}

is a finite-rank operator, here, we do not assume $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(U^{\scriptscriptstyle ij})}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ to be a finite-rank (or even Hilbert-Schmidt) operator; instead, we assume it to be a bounded operator. This is because $(X^{\scriptscriptstyle j},U^{\scriptscriptstyle ij})$ contains $U^{\scriptscriptstyle ij}$ , which makes it unreasonable to assume $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}{\color[rgb]{0,0,0}{{U^{\scriptscriptstyle ij}}}}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ to be finite-rank or Hilbert Schmidt. For example, when $X^{\scriptscriptstyle j}$ is a constant, $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ is the same as $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}$ and $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}$ is not a Hilbert Schmidt operator, though it is bounded. Theorem 9 shows that convergence rate of (ii) in (14) is the same as the convergence rate of (i) in (14); it now remains to derive the convergence rate of (i) and (iii).

5.3 Convergence rates of (i) and (iii) in (14)

We first present the convergence rate of $\hat{U}^{\scriptscriptstyle ij}$ to $U^{\scriptscriptstyle ij}$ . The proof is similar to that of Theorem 5 of Li and Song (2017) but with two differences. First, Li and Song (2017) took $A$ in (5) to be $I$ , whereas we take it to be $\Sigma_{\scriptscriptstyle YY}$ . In particular, the generalized sliced inverse regression in Li and Song (2017) only has one tuning parameter $\eta_{\scriptscriptstyle n}$ , but we have two tuning parameters $\eta_{\scriptscriptstyle n}$ and $\epsilon_{\scriptscriptstyle n}$ . Second, Li and Song (2017) defined (in the current notation) $f_{\scriptscriptstyle r}^{\scriptscriptstyle ij}$ to be the eigenfunctions of

\displaystyle\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1},

which is different from the generalized eigenvalue problem (5). For these reasons we need to re-derive the convergence rate of $\hat{U}^{\scriptscriptstyle ij}$ .

Theorem 10

Suppose

(a)

Assumption 1 is satisfied;

(b)

$\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ is a finite-rank operator with

	$\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle{\color[rgb]{0,0,0}{{2}}}}),$
	$\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}});$

(c)

$n^{\scriptscriptstyle-1/2}\prec\eta_{\scriptscriptstyle n}\prec 1$ , $n^{\scriptscriptstyle-1/2}\prec\epsilon_{\scriptscriptstyle n}\prec 1$ ;
(d)

for each $r=1,\ldots,d_{\scriptscriptstyle ij}$ , $\lambda^{\scriptscriptstyle ij}_{\scriptscriptstyle 1}>\cdots>\lambda^{\scriptscriptstyle ij}_{\scriptscriptstyle d_{\scriptscriptstyle ij}}$ .

Then, $\|\hat{U}^{\scriptscriptstyle ij}-U^{\scriptscriptstyle ij}\|_{\scriptscriptstyle[\mbox{\rsfstena H}^{\scriptscriptstyle-(i,j)}(X)]^{\scriptscriptstyle d_{\scriptscriptstyle ij}}}=O_{\scriptscriptstyle P}(\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}).$

An immediate consequence is that, under the transparent kernel assumption, the $b_{\scriptscriptstyle n}$ in Theorem 9 is the same as this rate. We next derive the convergence rate in (iii) of (14). This rate depends on the tuning parameter $\delta_{\scriptscriptstyle n}$ in the estimate of conjoined conditional covariance operator, and it reaches $b_{\scriptscriptstyle n}$ for the optimal choice of $\delta_{\scriptscriptstyle n}$ .

Theorem 11

(a)

$\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})}$ and $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ are bounded linear operators;
(b)

$b_{\scriptscriptstyle n}\preceq\delta_{\scriptscriptstyle n}\prec 1$ .

Then $\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(\delta_{\scriptscriptstyle n}).$ Consequently, if $\delta_{\scriptscriptstyle n}\asymp b_{\scriptscriptstyle n}$ , then

\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(b_{\scriptscriptstyle n}).

Finally, we combine Theorem 9 through Theorem 11 to come up with the convergence rate of $\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}$ . Since there are numerous cross references among the conditions in these theorems, to make a clear presentation we list all the original conditions in the next theorem, even if they already appeared. These conditions are of two categories: those for the step 1 that involves sufficient dimension reduction of $X^{\scriptscriptstyle(i,j)}$ versus $X^{\scriptscriptstyle-(i,j)}$ , and those for the step 2 that involves the estimation of the conjoined conditional covariance operator. We refer to them as the first-level and second-level conditions, respectively.

Theorem 12

Suppose the following conditions hold:

(a)

(First-level kernel) $E[\kappa(S,S)]<\infty$ for $\kappa=\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ and $\kappa=\kappa_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ ;

(b)

(First-level operator) $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ is a finite-rank operator with rank $d_{\scriptscriptstyle ij}$ and

	$\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle{\color[rgb]{0,0,0}{{2}}}}),$
	$\displaystyle\,\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}})\subseteq\mathrm{ran}(\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle(i,j)}});$

all the nonzero eigenvalues of $\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}X^{\scriptscriptstyle-(i,j)}}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle-(i,j)}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle X^{\scriptscriptstyle-(i,j)}X^{\scriptscriptstyle(i,j)}}$ are distinct;

(c)

(First-level tuning parameters) $n^{\scriptscriptstyle-1/2}\prec\eta_{\scriptscriptstyle n}\prec 1$ , $n^{\scriptscriptstyle-1/2}\prec\epsilon_{\scriptscriptstyle n}\prec 1$ , $\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle 1/2}+\epsilon_{\scriptscriptstyle n}\prec 1$ ;
(d)

(Second-level kernel) $E[\kappa(S,S)]<\infty$ is satisfied for $\kappa=\kappa_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ , $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ , and $\kappa_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ ; furthermore, they are transparent kernels;
(e)

(Second-level operators) $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle i}U^{\scriptscriptstyle ij})}$ and $\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}U^{\scriptscriptstyle ij}}^{\scriptscriptstyle\scriptscriptstyle-1}\Sigma_{\scriptscriptstyle U^{\scriptscriptstyle ij}(X^{\scriptscriptstyle j}U^{\scriptscriptstyle ij})}$ are bounded linear operators;
(f)

(Second-level tuning parameter) $\delta_{\scriptscriptstyle n}\asymp\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}$ .

Then

\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}).

(16)

Using this result we immediately arrive at the variable selection consistency of the Sufficient Graphical Model.

Corollary 13

Under the conditions in Theorem 12, if

	$\displaystyle\,\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}\prec\rho_{\scriptscriptstyle n}\prec 1,\ \mbox{ and}$
	$\displaystyle\,\hat{{\cal E}}=\{(i,j)\in\Gamma\times\Gamma:\ i>j,\ \\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}\|\hat{U}^{\scriptscriptstyle ij}}\\|_{\scriptscriptstyle\mathrm{HS}}<\rho_{\scriptscriptstyle n}\}$

then $\lim_{\scriptscriptstyle n\to\infty}P(\hat{{\cal E}}={\cal E})\to 1$ .

5.4 Optimal rates of tuning parameters

The convergence rate in Theorem 12 depends on $\epsilon_{\scriptscriptstyle n}$ and $\eta_{\scriptscriptstyle n}$ explicitly, and $\delta_{\scriptscriptstyle n}$ implicitly (in the sense that $\delta_{\scriptscriptstyle n}\asymp\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-3/2}\epsilon_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1}+\eta_{\scriptscriptstyle n}^{\scriptscriptstyle-1}n^{\scriptscriptstyle-1/2}+{\color[rgb]{0,0,0}{{\eta_{\scriptscriptstyle n}}}}+\epsilon_{\scriptscriptstyle n}$ is optimal for fixed $\epsilon_{\scriptscriptstyle n}$ and $\eta_{\scriptscriptstyle n}$ ). Intuitively, when $\epsilon_{\scriptscriptstyle n}$ , $\eta_{\scriptscriptstyle n}$ , and $\delta_{\scriptscriptstyle n}$ increase, the biases increase and variances decrease; when they decrease, the biases decrease and the variances increase. Thus there should be critical rates for them that balance the bias and variance, which are the optimal rates.

Theorem 14

Under the conditions in Theorem 12, if $\epsilon_{\scriptscriptstyle n}$ , $\eta_{\scriptscriptstyle n}$ , and $\delta_{\scriptscriptstyle n}$ are of the form $n^{\scriptscriptstyle a}$ , $n^{\scriptscriptstyle b}$ , and $n^{\scriptscriptstyle c}$ for some $a>0$ , $b>0$ , and $c>0$ , then

(i)

the optimal rates the tuning parameters are

\displaystyle n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{3/8}}}}\preceq\epsilon_{\scriptscriptstyle n}\preceq n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}},\quad\eta_{\scriptscriptstyle n}\asymp n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}},\quad\delta_{\scriptscriptstyle n}\asymp n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}};

(ii)

the optimal convergence rate of the estimated conjoined conditional covariance operator is

\displaystyle\|\hat{\Sigma}_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|\hat{U}^{\scriptscriptstyle ij}}-\Sigma_{\scriptscriptstyle\ddot{X}^{\scriptscriptstyle i}\ddot{X}^{\scriptscriptstyle j}|U^{\scriptscriptstyle ij}}\|_{\scriptscriptstyle\mathrm{HS}}=O_{\scriptscriptstyle P}(n^{\scriptscriptstyle-{\color[rgb]{0,0,0}{{1/4}}}}).

Note that there is a range of $\epsilon_{\scriptscriptstyle n}$ are optimal, this is because the convergence rate does not have a unique minimizer. This also means the result is not very sensitive to this tuning parameter.

In the above asymptotic analysis, we have treated $p$ as fixed when $n\to\infty$ . We have also developed the consistency and convergence rate in the scenario where the dimension of $p_{\scriptscriptstyle n}$ of $X$ goes to infinity with $n$ , which is placed in the Supplementary Material (Section S9) due to limited space.

6 Simulation

In this section we compare the performance of our sufficient graphical model with previous methods such as Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), Lee et al. (2016b), and a Naïve method which is based on the conjoined conditional covariance operator without the dimension reduction step.

By design, the sufficient graphical model has advantages over these existing methods under the following circumstances. First, since the sufficient graphical model does not make any distributional assumption, it should outperform Yuan and Lin (2007) and Liu et al. (2009) when the Gaussian or copula Gaussian assumptions are violated; second, due to the sufficient dimension reduction in sufficient graphical model, it avoids the curse of dimensionality and should outperform Voorman et al. (2013), Fellinghauer et al. (2013), and a Naïve method in the high-dimensional setting; third, since sufficient graphical model does not require additive structure, it should outperform Lee et al. (2016b) when there is severe nonadditivity in the model. Our simulation comparisons will reflect these aspects.

For the sufficient graphical model, Lee et al. (2016b), and the Naïve method, we use the Gaussian radial basis function as the kernel. The regularization constants $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ , $\epsilon_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ , and $\epsilon_{\scriptscriptstyle U}^{\scriptscriptstyle(i,j)}$ are chosen by the generalized cross validation criterion described in Section 4.3 with the grid $\{10^{\scriptscriptstyle-\ell}:\ell=-1,0,1,2,3,4\}$ . The kernel parameters $\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle(i,j)}$ , $\gamma_{\scriptscriptstyle X}^{\scriptscriptstyle-(i,j)}$ , $\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle i,ij}$ , $\gamma_{\scriptscriptstyle XU}^{\scriptscriptstyle j,ij}$ , and $\gamma_{\scriptscriptstyle U}^{\scriptscriptstyle ij}$ are chosen according to (11). Because the outcomes of tuning parameters are stable, for each model, we compute the generalized cross validation for the first five samples and use their average value for the rest of the simulation. The performance of each estimate is assessed using the averaged receiver operating characteristic curve as a function of the threshold $\rho$ . The accuracy of a method across all $\rho$ is measured by the area under the receiver operating characteristic curve.

To isolate the factors that affect accuracy, we first consider two models with relatively small dimensions and large sample sizes, which are

	$\displaystyle\text{Model \@slowromancap i@}:\quad X^{\scriptscriptstyle 1}$	$\displaystyle\,=\epsilon_{\scriptscriptstyle 1},\ X^{\scriptscriptstyle 2}=\epsilon_{\scriptscriptstyle 2},\ X^{\scriptscriptstyle 3}=\text{sin}(2X^{\scriptscriptstyle 1})+\epsilon_{\scriptscriptstyle 3}$
	$\displaystyle X^{\scriptscriptstyle 4}$	$\displaystyle\,=(X^{\scriptscriptstyle 1})^{\scriptscriptstyle 2}+(X^{\scriptscriptstyle 2})^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle 4},\ X^{\scriptscriptstyle 5}=\epsilon_{\scriptscriptstyle 5},$
	$\displaystyle\text{Model \@slowromancap ii@}:\quad X^{\scriptscriptstyle 1}$	$\displaystyle\,=\epsilon_{\scriptscriptstyle 1},\ X^{\scriptscriptstyle 2}=X^{\scriptscriptstyle 1}+\epsilon_{\scriptscriptstyle 2},\ X^{\scriptscriptstyle 3}=\epsilon_{\scriptscriptstyle 3},\ X^{\scriptscriptstyle 4}=(X^{\scriptscriptstyle 1}+X^{\scriptscriptstyle 3})^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle 4},$
	$\displaystyle X^{\scriptscriptstyle 5}$	$\displaystyle\,=\text{cos}(2X^{\scriptscriptstyle 2}X^{\scriptscriptstyle 3})+\epsilon_{\scriptscriptstyle 5},\ X^{\scriptscriptstyle 6}=X^{\scriptscriptstyle 4}+\epsilon_{\scriptscriptstyle 6},$

where $\epsilon_{\scriptscriptstyle i}$ , $i=1,\dots,p$ are from independent and identically distributed standard normal distribution. The edge sets of the two models are

	$\displaystyle\mbox{Model \@slowromancap i@}:$	$\displaystyle\,\quad{\cal E}=\{(1,3),(1,4),(2,4),(1,2)\}$
	$\displaystyle\mbox{Model \@slowromancap ii@}:$	$\displaystyle\,\quad{\cal E}=\{(1,2),(1,4),(3,4),(1,3),(2,5),(3,5),(2,3),(4,6)\}.$

We use $n=100,1000$ for each model, and for each $n$ , we generate 50 samples to compute the averaged receiver operating characteristic curves. The dimension $d_{\scriptscriptstyle ij}$ for sufficient graphical model is taken to be 2 for all cases (we have also used $d_{\scriptscriptstyle ij}=1$ and the results are very similar to those presented here). The plots in the first row of Figure 1 show the averaged receiver operating characteristic curves for the seven methods, with the following plotting symbol assignment:

Sufficient graphical model:	red solid line	Voorman et al. (2013):	red dotted line
Lee et al. (2016b):	black solid line	Fellinghauer et al. (2013):	black dotted line
Yuan and Lin (2007):	red dashed line	Naïve:	blue dotted line
Liu et al. (2009):	black dashed line

From these figures we see that the two top performers are clearly sufficient graphical model and Lee et al. (2016b), and their performances are very similar. Note that none of the two models satisfies the Gaussian or copula Gaussian assumption, which explains why sufficient graphical model and Lee et al. (2016b) outperform Yuan and Lin (2007) and Liu et al. (2009). Sufficient graphical model and Lee et al. (2016b) also outperform Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method, indicating that curse of dimensionality already takes effect on the fully nonparametric methods. The three nonparametric estimators have similar performances. Also note that Model I has an additive structure, which explains the slight advantage of Lee et al. (2016b) over sufficient graphical model in subfigure (a) of Figure 1; Model II is not additive, and the advantage of Lee et al. (2016b) disappears in subfigure (b) of Figure 1.

We next consider two models with relatively high dimensions and small sample sizes. A convenient systematic way to generate larger networks is via the hub structure. We choose $p=200$ , and randomly generate ten hubs $h_{\scriptscriptstyle 1},\ldots,h_{\scriptscriptstyle 10}$ from the 200 vertices. For each $h_{\scriptscriptstyle k}$ , we randomly select a set $H_{\scriptscriptstyle k}$ of 19 vertices to form the neighborhood of $h_{\scriptscriptstyle k}$ . With the network structures thus specified, our two probabilistic models are

	$\displaystyle\,\text{Model {III}}:\quad X^{\scriptscriptstyle i}=1+\|X^{\scriptscriptstyle h_{\scriptscriptstyle k}}\|^{\scriptscriptstyle 2}+\epsilon_{\scriptscriptstyle i},\quad\text{where}\quad i\in H_{\scriptscriptstyle k}\setminus h_{\scriptscriptstyle k},$
	$\displaystyle\,\text{Model {IV}}:\quad X^{\scriptscriptstyle i}=\sin((X^{\scriptscriptstyle h_{\scriptscriptstyle k}})^{\scriptscriptstyle 3})\epsilon_{\scriptscriptstyle i},\quad\text{where}\quad i\in H_{\scriptscriptstyle k}\setminus h_{\scriptscriptstyle k},$

and $\epsilon_{\scriptscriptstyle i}$ ’s are the same as in Models I and II. Note that, in Model III, the dependence of $X_{\scriptscriptstyle i}$ on $X_{\scriptscriptstyle h_{\scriptscriptstyle k}}$ is through the conditional mean $E(X_{\scriptscriptstyle i}|X_{\scriptscriptstyle h_{\scriptscriptstyle k}})$ , whereas in Model IV, the dependence is through the conditional variance $\mathrm{var}(X_{\scriptscriptstyle i}|X_{\scriptscriptstyle h_{\scriptscriptstyle k}})$ . For each model, we choose two sample sizes $n=50$ and $n=100$ . The averaged receiver operating characteristic curves (again averaged over 50 samples) are presented in the second row in Figure 1. From the figures we see that, in the high-dimensional setting with $p>n$ , sufficient graphical model substantially outperforms all the other methods, which clearly indicates the benefit of dimension reduction in constructing graphical models.

We now consider a Gaussian graphical model to investigate any efficiency loss incurred by sufficient graphical model. Following the similar structure used in Li et al. (2014), we choose $p=20$ , $n=100,200$ , and the model

\displaystyle\text{Model {V}}:X\sim N(0,\Theta^{\scriptscriptstyle-1}),

where $\Theta$ is $20\times 20$ precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.544, 3, 1, 1, 1.2, 1, 1, 1, 1, 3, 2, 1, and nonzero off-diagonal entries $\theta_{\scriptscriptstyle 3,5}=1.418$ , $\theta_{\scriptscriptstyle 4,10}=-0.744$ , $\theta_{\scriptscriptstyle 5,9}=0.519$ , $\theta_{\scriptscriptstyle 5,10}=-0.577$ , $\theta_{\scriptscriptstyle 13,17}=0.287$ , $\theta_{\scriptscriptstyle 17,20}=0.542$ , $\theta_{\scriptscriptstyle 14,15}=0.998$ . As expected, Figure 2 shows that Yuan and Lin (2007), Liu et al. (2009), and Lee et al. (2016b) perform better than sufficient graphical model in this case. However, sufficient graphical model still performs reasonably well and significantly outperforms the fully nonparametric methods.

Finally, we conducted some simulation on the generalized cross validation criterion (13) for determining the threshold $\rho_{\scriptscriptstyle n}$ . We generated samples from Models I through V as described above, produced the receiver operating characteristic curves using sufficient graphical model, and determined the threshold $\rho_{\scriptscriptstyle n}$ by (13). The results are presented in Figure S1 in the Supplementary Material. In each penal, the generalized cross validation-determined threshold $\rho_{\scriptscriptstyle n}$ are represented by the black dots on the red receiver operating characteristic curves.

7 Application

We now apply sufficient graphical model to a data set from the DREAM 4 Challenge project and compare it with other methods. The goal of this Challenge is to recover gene regulation networks from simulated steady-state data. A description of this data set can be found in Marbach et al. (2010). Since Lee et al. (2016b) already compared their method with Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method for this dataset and demonstrated the superiority of Lee et al. (2016b) among these estimators, here we will focus on the comparison of the sufficient graphical model with Lee et al. (2016b) and the champion method for the DREAM 4 Challenge.

The data set contains data from five networks each of dimension of 100 and sample size 201. We use the Gaussian radial basis function kernel for sufficient graphical model and Lee et al. (2016b) and the tuning methods described in Section 4.3. For sufficient graphical model, the dimensions $d_{\scriptscriptstyle ij}$ are taken to be 1. We have also experimented with $d_{\scriptscriptstyle ij}=2$ but the results (not presented here) show no significant difference. Because networks are available, we can compare the receiver operating characteristic curves and their areas under the curve’s, which are shown in Table 1.

Table 1: Comparison of sufficient graphical model, Lee et al. (2016b), Naïve and the champion methods in DREAM 4 Challenge

	Network 1	Network 2	Network 3	Network 4	Network 5
Sufficient graphical model	0.85	0.81	0.83	0.83	0.79
Lee et al. (2016b)	0.86	0.81	0.83	0.83	0.77
Champion	0.91	0.81	0.83	0.83	0.75
Naïve	0.78	0.76	0.78	0.76	0.71

As we can see from Table 1, sufficient graphical model has the same areas under the receiver operating characteristic curve values as Lee et al. (2016b) for Networks 2, 3, and 4, performs better than Lee et al. (2016b) for Network 5, but trails slightly behind Lee et al. (2016b) for Network 1; sufficient graphical model has the same areas under the curve as the champion method, performs better for Network 5 and worse for Network 1. Overall, sufficient graphical model and Lee et al. (2016b) perform similarly in this dataset, and they are on a par with the champion method. We should point out that sufficient graphical model and Lee et al. (2016b) are purely empirical; they employ no knowledge about the underlying physical mechanism generating the gene expression data. However, according to Pinna et al. (2010), the champion method did use a differential equation that reflects the underlying physical mechanism. The results for threshold determination are presented in Figure S2 in the Supplementary Material.

8 Discussion

This paper is a first attempt to take advantage of the recently developed nonlinear sufficient dimension reduction method to nonparametrically estimate the statistical graphical model while avoiding the curse of dimensionality. Nonlinear sufficient dimension reduction is used as a module and applied repeatedly to evaluate conditional independence, which leads to a substantial gain in accuracy in the high-dimensional setting. Compared with the Gaussian and copula Gaussian methods, our method is not affected by the violation of the Gaussian and copula Gaussian assumptions. Compared with the additive method (Lee et al., 2016b), our method does not require an additive structure and retains the conditional independence as the criterion to determine the edges, which is a commonly accepted criterion. Compared with fully nonparametric methods, sufficient graphical model avoids the curse of dimensionality and significantly enhances the performance.

The present framework opens up several directions for further research. First, the current model assumes that the central class $\mathfrak{S}_{\scriptscriptstyle X^{\scriptscriptstyle(i,j)}|X^{\scriptscriptstyle-(i,j)}}$ is complete, so that generalized sliced inverse regression is the exhaustive nonlinear sufficient dimension reduction estimate. When this condition is violated, generalized sliced inverse regression is no longer exhaustive and we can employ other nonlinear sufficient dimension reduction methods such as the generalized sliced averaged variance estimation (Lee et al., 2013; Li, 2018b) to recover the part of the central class that generalized sliced inverse regression misses. Second, though we have assumed that there is a proper sufficient sub- $\sigma$ -field ${\cal G}^{\scriptscriptstyle-(i,j)}$ for each $(i,j)$ , the proposed estimation procedure is still justifiable when no such sub- $\sigma$ -field exists. In this case, $U^{\scriptscriptstyle ij}$ is still the most important set of functions that characterize the statistical dependence of $X^{\scriptscriptstyle(i,j)}$ on $X^{\scriptscriptstyle-(i,j)}$ – even though it is not sufficient. Without sufficiency, our method may be more appropriately called the Principal Graphical Model than the sufficient graphical model. Third, the current method can be extended to functional graphical model, which are common in medical applications such as EEG and fMRI. Several functional graphical models have been proposed recently, by Zhu et al. (2016), Qiao et al. (2019), Li and Solea (2018b), and Solea and Li (2020). The idea of a sufficient graph can be applied to this setting to improve efficiency.

This paper also contains some theoretical advances that are novel to nonlinear sufficient dimension reduction. For example, it introduces a general framework to characterize how the error of nonlinear sufficient dimension reduction propagates to the downstream analysis in terms of convergence rates. Furthermore, the results for convergence rates of various linear operators allowing the dimension of the predictor to go to infinity are the first of its kind in nonlinear sufficient dimension reduction. These advances will benefit the future development of sufficient dimension reduction in general, beyond the current context of estimating graphical models.

Acknowledgments

Bing Li’s research on this work was supported in part by the NSF Grant DMS-1713078. Kyongwon Kim’s work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.2021R1F1A1046976, RS-2023-00219212), basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2021R1A6A1A10039823).

Supplementary Material

Supplementary material includes proofs of all theorems, lemmas, corollaries, and propositions in the paper, asymptotic development for the high-dimensional setting, some additional simulation plots for threshold determination.

References

Bach and Jordan (2002) F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.
Bellman (1961) R Bellman. Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ, 1961.
Bickel and Levina (2008) Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
Cook (1994) R Dennis Cook. Using dimension-reduction subspaces to identify important inputs in models of physical systems. In Proceedings of the section on Physical and Engineering Sciences, pages 18–25. American Statistical Association Alexandria, VA, 1994.
Cook and Weisberg (1991) R Dennis Cook and Sanford Weisberg. Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
Dawid (1979) A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–31, 1979.
Fan and Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
Fellinghauer et al. (2013) Bernd Fellinghauer, Peter Bühlmann, Martin Ryffel, Michael Von Rhein, and Jan D Reinhardt. Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Computational Statistics & Data Analysis, 64:132–152, 2013.
Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
Fukumizu et al. (2004) Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.
Fukumizu et al. (2007) Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(Feb):361–383, 2007.
Fukumizu et al. (2008) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
Golub et al. (1979) Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 1979.
Guo et al. (2010) Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
Hoerl and Kennard (1970) A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
Kim et al. (2020) Kyongwon Kim, Bing Li, Zhou Yu, and Lexin Li. On post dimension reduction statistical inference. to appear in The Annals of Statistics, 2020.
Lam and Fan (2009) Clifford Lam and Jianqing Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics, 37(6B):4254, 2009.
Lauritzen (1996) Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
Lee et al. (2016a) K.-Y. Lee, B. Li, and H. Zhao. Variable selection via additive conditional independence. Journal of the Royal Statistical Society: Series B, 78:1037–1055, 2016a.
Lee et al. (2013) Kuang-Yao Lee, Bing Li, and Francesca Chiaromonte. A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. The Annals of Statistics, 41(1):221–249, 2013.
Lee et al. (2016b) Kuang-Yao Lee, Bing Li, and Hongyu Zhao. On an additive partial correlation operator and nonparametric estimation of graphical models. Biometrika, 103(3):513–530, 2016b.
Li et al. (2011) B. Li, A. Artemiou, and L. Li. Principal support vector machines for linear and nonlinear sufficient dimension reduction. The Annals of Statistics, 39:3182–3210, 2011.
Li (2018a) Bing Li. Linear operator-based statistical analysis: A useful paradigm for big data. Canadian Journal of Statistics, 46(1):79–103, 2018a.
Li (2018b) Bing Li. Sufficient Dimension Reduction: Methods and Applications with R. CRC Press, 2018b.
Li and Solea (2018a) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113(just-accepted):1637–1655, 2018a.
Li and Solea (2018b) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113:1637–1655, 2018b.
Li and Song (2017) Bing Li and Jun Song. Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics, 45(3):1059–1095, 2017.
Li et al. (2014) Bing Li, Hyonho Chun, and Hongyu Zhao. On an additive semigraphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association, 109(507):1188–1204, 2014.
Li (1991) Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
Liu et al. (2009) Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295–2328, 2009.
Liu et al. (2012a) Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012a.
Liu et al. (2012b) Han Liu, Fang Han, and Cun-Hui Zhang. Transelliptical graphical models. In Advances in neural information processing systems, pages 800–808, 2012b.
Luo and Li (2016) Wei Luo and Bing Li. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika, 103:875–887, 2016.
Luo and Li (2020) Wei Luo and Bing Li. On order determination by predictor augmentation. Biometrika (To appear), 103:875–887, 2020.
Marbach et al. (2010) Daniel Marbach, Robert J Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107(14):6286–6291, 2010.
Meinshausen and Bühlmann (2006) Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436–1462, 2006.
Pearl and Verma (1987) J. Pearl and T. Verma. The logic of representing dependencies by directed graphs. University of California (Los Angeles). Computer Science Department, 1987.
Peng et al. (2009) Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735–746, 2009.
Pinna et al. (2010) Andrea Pinna, Nicola Soranzo, and Alberto de la Fuente. From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS One, 5(10), 2010.
Qiao et al. (2019) Xinghao Qiao, Shaojun Guo, and Gareth M. James. Functional graphical models. Journal of the American Statistical Association, 114:211–222, 2019.
Solea and Li (2020) Eftychia Solea and Bing Li. Copula gaussian graphical models for functional data. submitted to Journal of the American Statistical Association, (under revision), 2020.
Sriperumbudur et al. (2011) B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12, 2011.
Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Voorman et al. (2013) Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85–101, 2013.
Wang (2008) Y. Wang. Nonlinear dimension reduction in feature space. PhD Thesis, The Pennsylvania State University, 2008.
Wu (2008) H. M. Wu. Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17(3):590–610, 2008.
Xue and Zou (2012) Lingzhou Xue and Hui Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571, 2012.
Yuan and Lin (2007) Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
Zhu et al. (2016) Hongxiao Zhu, Nate Strawn, and David B. Dunson. Bayesian graphical models for multivariate functional data. Journal of Machine Learning Research, 14:1–27, 2016.
Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.