On Sufficient Graphical Models
Abstract
We introduce a sufficient graphical model by applying the recently developed nonlinear sufficient dimension reduction techniques to the evaluation of conditional independence. The graphical model is nonparametric in nature, as it does not make distributional assumptions such as the Gaussian or copula Gaussian assumptions. However, unlike a fully nonparametric graphical model, which relies on the high-dimensional kernel to characterize conditional independence, our graphical model is based on conditional independence given a set of sufficient predictors with a substantially reduced dimension. In this way we avoid the curse of dimensionality that comes with a high-dimensional kernel. We develop the population-level properties, convergence rate, and variable selection consistency of our estimate. By simulation comparisons and an analysis of the DREAM 4 Challenge data set, we demonstrate that our method outperforms the existing methods when the Gaussian or copula Gaussian assumptions are violated, and its performance remains excellent in the high-dimensional setting.
Keywords: conjoined conditional covariance operator, generalized sliced inverse regression, nonlinear sufficient dimension reduction, reproducing kernel Hilbert space
1 Introduction
In this paper we propose a new nonparametric statistical graphical model, which we call the sufficient graphical model, by incorporating the recently developed nonlinear sufficient dimension reduction techniques to the construction of the distribution-free graphical models.
Let be an undirected graph consisting of a finite set of nodes and set of edges Since and represent the same edge in an undirected graph, we can assume without loss of generality that . A statistical graphical model links G with a random vector by the conditional independence:
(1) |
where , and means conditional independence. Thus, nodes and are connected if and only if and are dependent given . Our goal is to estimate the set based on a sample of . See Lauritzen (1996).
One of the most popular statistical graphical models is the Gaussian graphical model, which assumes that . Under the Gaussian assumption, conditional independence in (1) is encoded in the precision matrix in the following sense
(2) |
where is the th entry of the precision matrix . By this equivalence, estimating amounts to identifying the positions of the zero entries of the precision matrix, which can be achieved by sparse estimation methods such as the Tibshirani (1996), Fan and Li (2001), and Zou (2006). A variety of methods have been developed for estimating the Gaussian graphical model, which include, for example, Meinshausen and Bühlmann (2006), Yuan and Lin (2007), Bickel and Levina (2008), and Peng et al. (2009). See also Friedman et al. (2008), Guo et al. (2010), and Lam and Fan (2009).
Since the Gaussian distribution assumption is restrictive, many recent advances have focused on relaxing this assumption. A main challenge in doing so is to avoid the curse of dimensionality (Bellman, 1961): a straightforward nonparametric extension would resort to a high-dimensional kernel, which are known to be ineffective. One way to relax the Gaussian assumption without evoking a high dimensional kernel is to use the copula Gaussian distribution, which is the approach taken by Liu et al. (2009), Liu et al. (2012a), and Xue and Zou (2012), and is further extended to the transelliptical model by Liu et al. (2012b).
However, the copula Gaussian assumption could still be restrictive: for example, if and are random variables satisfying , where and are i.i.d. , then does not satisfy the copula Gaussian assumption. To further relax the distributional assumption, Li et al. (2014) proposed a new statistical relation called the additive conditional independence as an alternative criterion for constructing the graphical model. This relation has the advantage of achieving nonparametric model flexibility without using a high-dimensional kernel, while obeying the same set of semi-graphoid axioms that govern the conditional independence (Dawid, 1979; Pearl and Verma, 1987). See also Lee et al. (2016b) and Li and Solea (2018a). Other approaches to nonparametric graphical models include Fellinghauer et al. (2013) and Voorman et al. (2013).
In this paper, instead of relying on additivity to avoid the curse of dimensionality, we apply the recently developed nonparametric sufficient dimension reduction (Lee et al., 2013; Li, 2018b) to achieve this goal. The estimation proceeds in two steps: first, we use nonlinear sufficient dimension reduction to reduce to a low-dimensional random vector ; second, we use the kernel method to construct a nonparametric graphical model based on , and the dimension-reduced random vectors . The main differences between this approach and Li et al. (2014) are, first, we are able to retain conditional independence as the criterion for constructing the network, which is a widely accepted criterion with a more direct interpretation, and second, we are no longer restricted by the additive structure in the graphical model. Another attractive feature of our method is due to the “kernel trick”, which means its computational complexity depends on the sample size rather than the size of the networks.
The rest of the paper is organized as follows. In Sections 2 and 3, we introduce the sufficient graphical model and describe its estimation method at the population level. In Section 4 we lay out the detailed algorithms to implement the method. In Section 5 we develop the asymptotic properties such as estimation consistency, variable selection consistency, and convergence rates. In Section 6, we conduct simulation studies to compare of our method with the existing methods. In Section 7, we apply our method to the DREAM 4 Challenge gene network data set. Section 8 concludes the paper with some further discussions. Due to limited space we put all proofs and some additional results in the Supplementary Material.
2 Sufficient graphical model
In classical sufficient dimension reduction, we seek the lowest dimensional subspace of , such that, after projecting on to , the information about the response is preserved; that is, , where is the projection onto . This subspace is called the central subspace, written as . See, for example, Li (1991), Cook (1994), and Li (2018b). Li et al. (2011) and Lee et al. (2013) extended this framework to the nonlinear setting by considering the more general problem: , where a sub- field of the -field generated by . The class of functions in a Hilbert space that are measurable with respect to is called the central class, written as . Li et al. (2011) introduced the Principal Support Vector Machine, and Lee et al. (2013) generalized the Sliced Inverse Regression (Li, 1991) and the Sliced Average Variance Estimate (Cook and Weisberg, 1991) to estimate the central class. Precursors of this theory include Bach and Jordan (2002), Wu (2008), and Wang (2008).
To link this up with the statistical graphical model, let be a probability space, a Borel measurable space with , and a random vector with distribution . The th component of is denoted by and its range denoted by . We assume . Let and be as defined in the Introduction. Let be the -field generated by . We assume, for each , there is a proper sub -field of such that
(3) |
Without loss of generality, we assume is the smallest sub -field of that satisfies the above relation; that is, is the central -field for versus . There are plenty examples of joint distributions of for which the condition (3) holds for every pair : see Section S10 of the Supplementary Material. Using the properties of conditional independence developed in Dawid (1979) (with a detailed proof given in Li (2018b)), we can show that (3) implies the following equivalence.
Theorem 1
If , then
This equivalence motivates us to use as the criterion to construct the graph after performing nonlinear sufficient dimension reduction of versus for each , .
Definition 2
3 Estimation: population-level development
The estimation of the sufficient graphical model involves two steps: the first step is to use nonlinear sufficient dimension reduction to estimate ; the second is to construct a graph G based on reduced data
In this section we describe the two steps at the population level. To do so, we need some preliminary concepts such as the covariance operator between two reproducing kernel Hilbert spaces, the mean element in an reproducing kernel Hilbert spaces, the inverse of an operator, as well as the centered reproducing kernel Hilbert spaces. These concepts are defined in the Supplementary Material, Section S1.2. A fuller development of the related theory can be found in Li (2018b). The symbols and will be used to denote the range and the closure of the range of a linear operator.
3.1 Step 1: Nonlinear dimension reduction
We use the generalized sliced inverse regression Lee et al. (2013), (Li, 2018b) to perform the nonlinear dimension reduction. For each pair , , let be the range of , which is the Cartesian product of with and removed. Let
be a positive semidefinite kernel. Let be the centered reproducing kernel Hilbert space generated by . Let , , and be the similar objects defined for .
Assumption 1
This is a very mild assumption that is satisfied by most kernels. Under this assumption, the following covariance operators are well defined:
For the formal definition of the covariance operator, see S1.2. Next, we introduce the regression operator from to . For this purpose we need to make the following assumption.
Assumption 2
.
As argued in Li (2018b), this assumption can be interpreted as a type of collective smoothness in the relation between and : intuitively, it requires the operator sends all the input functions to the low-frequency domain of the operator . Under Assumption 2, the linear operator
is defined, and we call it the regression operator from to . The meaning of the inverse is defined in Section S1.2 in the Supplementary Material. The regression operator in this form was formally defined in Lee et al. (2016a), but earlier forms existed in Fukumizu et al. (2004); see also Li (2018a).
Assumption 3
is a finite-rank operator, with rank .
Intuitively, this assumption means that filters out the high frequency functions of , so that, for any , is relatively smooth. It will be violated, for example, if one can find an that makes arbitrarily choppy. The regression operator plays a crucial role in nonlinear sufficient dimension reduction. Let be the -space with respect to the distribution of . As shown in Lee et al. (2013), the closure of the range of the regression operator is equal to the central subspace; that is,
(4) |
under the following assumption.
Assumption 4
-
1.
is dense in modulo constants; that is, for any and any , there is a such that ;
-
2.
is a sufficient and complete.
The first condition essentially requires the kernel to be a universal kernel with respect to the -norm. It means is rich enough to approximate any -function arbitrarily closely. For example, it is satisfied by the Gaussian radial basis function kernel, but not by the polynomial kernel. For more information on universal kernels, see Sriperumbudur, Fukumizu, and Lanckriet (2011). The completeness in the second condition means
This concept is defined in Lee, Li, and Chiaromonte (2013), and is similar to the classical definition of completeness treating as the parameter. Lee, Li, and Chiaromonte (2013) showed that completeness is a mild condition, and is satisfied by most nonparametric models.
A basis of the central class can be found by solving the generalized eigenvalue problem: for ,
(5) |
where is any nonsingular and self adjoint operator, and is the inner product in . That is, if are the first eigenfunctions of this eigenvalue problem, then they span the central class. This type of estimate of the central class is called generalized sliced inverse regression. Convenient choices of are the identity mapping or the operator . If we use the latter, then we need the following assumption.
Assumption 5
.
This assumption has the similar interpretation as Assumption 2; see Section S11 in the Supplementary Material. At the population level, choosing to be achieves better scaling because it down weights those components of the output of with larger variances. However, if the sample size is not sufficiently large, involving an estimate of in the procedure could incur extra variations that overwhelm the benefit brought by . In this case, a nonrandom operator such as is preferable. In this paper we use . Let denote the random vector The set of random vectors is the output for the nonlinear sufficient dimension reduction step.
3.2 Step 2:Estimation of sufficient graphical model
To estimate the edge set of the sufficient graphical model we need to find a way to determine whether is true. We use a linear operator introduced by Fukumizu et al. (2008) to perform this task, which is briefly described as follows. Let , , be random vectors taking values in measurable spaces , , and . Let , , , and . Let
be positive kernels. For example, for , returns a real number denoted by . Let , , and be the centered reproducing kernel Hilbert space’s generated by the kernels , , and . Define the covariance operators
(6) |
as before. The following definition is due to Fukumizu et al. (2008). Since it plays a special role in this paper, we give it a name – “conjoined conditional covariance operator” that figuratively depicts its form.
Definition 3
Suppose
-
1.
If is , or , or , then ;
-
2.
, .
Then the operator is called the conjoined conditional covariance operator between and given .
The word “conjoined” describes the peculiar way in which appears in and , which differs from an ordinary conditional covariance operator, where these operators are replaced by and . The following proposition is due to Fukumizu et al. (2008), a proof of a special case of which is given in Fukumizu et al. (2004).
Proposition 4
Suppose
-
1.
is probability determining;
-
2.
for each , the function belongs to ;
-
3.
for each , the function belongs to ;
Then if and only if .
The notion of probability determining in the context of reproducing kernel Hilbert space was defined in Fukumizu et al. (2004). For a generic random vector , an reproducing kernel Hilbert space based on a kernel is probability determining if and only if the mapping is injective. Intuitively, this requires the family of expectations to be rich enough to identify . For example, the Gaussian radial basis function is probability determining, but a polynomial kernel is not. We apply the above proposition to for each , . Let
be a positive definite kernel, and the centered reproducing kernel Hilbert space generated by . Similarly, let be a positive kernel, and the centered reproducing kernel Hilbert space generated by .
Assumption 6
Under this assumption, the conjoined conditional covariance operator is well defined and has the following property.
Corollary 5
Under Assumption 6, we have
This corollary motivates us to estimate the graph by thresholding the norm of the estimated conjoined conditional covariance operator.
4 Estimation: sample-level implementation
4.1 Implementation of step 1
Let be an i.i.d. sample of . At the sample level, the centered reproducing kernel Hilbert space is spanned by the functions
(7) |
where stands for the function , and the function .
We estimate the covariance operators and by
respectively. We estimate by the Tychonoff-regularized inverse where is the identity operator. The regularized inverse is used to avoid over fitting. It plays the same role as ridge regression (Hoerl and Kennard, 1970) that alleviates over fitting by adding a multiple of the identity matrix to the sample covariance matrix before inverting it.
At the sample level, the generalized eigenvalue problem (5) takes the following form: at the th iteration,
(8) |
where are the maximizers in the previous steps. The first eigenfunctions are an estimate of a basis in the central class .
Let be the matrix whose th entry is , , and . Let be the first eigenvectors of the matrix
Let for . As shown in Section S12.2, the eigenfunctions are calculated by
The statistics , , will be used as the input for the second step.
4.2 Implementation of step 2
This step consists of estimating the conjoined conditional covariance operator’s for each and thresholding their norms. At the sample level, the centered reproducing kernel Hilbert space’s generated by the kernels , , and are
where, for example, denotes the function
and denotes the function
We estimate the covariance operators , , , and by
(9) |
respectively. We then estimate the conjoined conditional covariance operator by
where, again, we have used Tychonoff regularization to estimate the inverted covariance operator . Let , , and be the Gram matrices
and , , and their centered versions
As shown in Section S12 in the Supplementary Material,
where is the Frobenius norm. Estimation of the edge set is then based on thresholding this norm; that is,
for some chosen .
4.3 Tuning
We have three types of tuning constants: those for the kernels, those for Tychonoff regularization, and the threshold . For the Tychonoff regularization, we have and for step 1, and for step 2. In this paper we use the Gaussian radial basis function as the kernel:
(10) |
For each , we have five ’s to determine: for the kernel , for , for , for , and for , which are chosen by the following formula (see, for example, Li (2018b))
(11) |
where are the sample of random vectors corresponding to the mentioned five kernels. For example, for the kernel , . For the tuning parameters in Tychonoff regularization, we use the following generalized cross validation scheme (GCV; see Golub et al. (1979)):
(12) |
where are positive semidefinite matrices, and is the largest eigenvalue of . The matrices and are the following matrices for the three tuning parameters:
-
1.
, for ,
-
2.
, for ,
-
3.
, for ,
We minimize (12) over a grid to choose , as detailed in Section 6.
We also use generalized cross validation to determine the thresholding parameter . Let be the estimated edge set using a threshold , and, for each , let be the subset of components of at the neighborhood of the node in the graph . The basic idea is to apply the generalized cross validation to the regression of the feature of on the feature of . The generalized cross validation for this regression takes the form
(13) |
where , and is the kernel matrix for the sample of . We minimize over the grid to determine the optimal threshold .
Regarding the selection of the dimension of , to our knowledge there has been no systematic procedure available to determine the dimension of the central class for nonlinear sufficient dimension reduction. While some recently developed methods for order determination for linear sufficient dimension reduction, such as the ladle estimate and predictor augmentation estimator (Luo and Li, 2016, 2020), may be generalizable to the nonlinear sufficient dimension reduction setting, we will leave this topic to future research. Our experiences and intuitions indicate that a small dimension, such as 1 or 2, for the central class would be sufficient in most cases. For example, in the classical nonparametric regression problems with , the dimension of the central class is by definition equal to 1.
5 Asymptotic theory
In this section we develop the consistency and convergence rates of our estimate and related operators. The challenge of this analysis is that our procedure involves two steps: we first extract the sufficient predictor using one set of kernels, and then substitute it into another set of kernels to get the final result. Thus we need to understand how the error propagates from the first step to the second. We also develop the asymptotic theory allowing to go to infinity with , which is presented in the Supplementary Material.
5.1 Overview
Our goal is to derive the convergence rate of
as is the quantity we threshold to determine the edge set. By the triangular inequality,
So we need to derive the convergence rates of the following quantities:
(14) |
where, to avoid overly crowded subscripts, we have used to denote when it occurs as a subscript. The first and third convergence rates can be derived using the asymptotic tools for linear operators developed in Fukumizu et al. (2007), Li and Song (2017), Lee et al. (2016a), and Solea and Li (2020). The second convergence rate is, however, a new problem, and it will also be useful in similar settings that require constructing estimators based on predictors extracted by sufficient dimension reduction. In some sense, this is akin to the post dimension reduction problem considered in Kim et al. (2020).
In the following, if and are sequences of positive numbers, then we write if . We write if . We write if either or . Because is fixed in the asymptotic development, and also to emphasize the dependence on , in the rest of this section we denote , , and by , , and , respectively.
5.2 Transparent kernel
We first develop what we call the “transparent kernel” that passes information from step 1 to step 2 efficiently. Let be a nonempty set, and a positive kernel.
Definition 6
We say that is a transparent kernel if, for each , the function is twice differentiable and
-
1.
;
-
2.
the matrix has a bounded operator norm; that is, there exist such that
for all , where and indicate the largest and smallest eigenvalues.
For example, the Gaussian radial basis function kernel is transparent, but the exponential kernel is not. This condition implies a type of Lipschitz continuity in a setting that involves two reproducing kernels and , where the argument of is the evaluation of a member of the reproducing kernel Hilbert space generated by .
Theorem 7
Suppose is the reproducing kernel Hilbert space generated by , is the -fold Cartesian product of with inner product defined by
where and are members of , is the reproducing kernel Hilbert space generated by . Then:
-
(i)
for any , we have
-
(ii)
if is a transparent kernel, then there exists a such that, for each and ,
A direct consequence of this theorem is that, if is an estimate of some , a member of , with for some , is a linear operator estimated from the sample (and perhaps some other random vectors), and is a linear operator estimated from the sample , then,
(15) |
This result is somewhat surprising, because sample estimates such as can be viewed as , where is an estimate of a function in a functional space with norm and is an operator-valued function. If for some , then it is not necessarily true that
particularly when is an infinite dimensional object. Yet relation (15) states exactly this. The reason behind this is that the reproducing kernel property separates the function and its argument (i.e. ), which implies a type of uniformity among . This point will be made clear in the proof in the Supplementary Material. Statement (15) is made precise by the next theorem.
Theorem 8
Suppose conditions (1) and (2) of Definition 3 are satisfied with , , therein replaced by , , and . Suppose, furthermore:
-
(a)
, , and are transparent kernels;
-
(b)
for some .
Then
-
(i)
;
-
(ii)
;
-
(iii)
.
Using Theorem 8 we can derive the convergence rate of .
Theorem 9
Suppose conditions in Theorem 8 are satisfied and, furthermore,
-
(a)
and are bounded linear operators;
-
(b)
.
Then
Note that, unlike in Theorem 8, where our assumptions imply
is a finite-rank operator, here, we do not assume to be a finite-rank (or even Hilbert-Schmidt) operator; instead, we assume it to be a bounded operator. This is because contains , which makes it unreasonable to assume to be finite-rank or Hilbert Schmidt. For example, when is a constant, is the same as and is not a Hilbert Schmidt operator, though it is bounded. Theorem 9 shows that convergence rate of (ii) in (14) is the same as the convergence rate of (i) in (14); it now remains to derive the convergence rate of (i) and (iii).
5.3 Convergence rates of (i) and (iii) in (14)
We first present the convergence rate of to . The proof is similar to that of Theorem 5 of Li and Song (2017) but with two differences. First, Li and Song (2017) took in (5) to be , whereas we take it to be . In particular, the generalized sliced inverse regression in Li and Song (2017) only has one tuning parameter , but we have two tuning parameters and . Second, Li and Song (2017) defined (in the current notation) to be the eigenfunctions of
which is different from the generalized eigenvalue problem (5). For these reasons we need to re-derive the convergence rate of .
Theorem 10
Suppose
-
(a)
Assumption 1 is satisfied;
-
(b)
is a finite-rank operator with
-
(c)
, ;
-
(d)
for each , .
Then,
An immediate consequence is that, under the transparent kernel assumption, the in Theorem 9 is the same as this rate. We next derive the convergence rate in (iii) of (14). This rate depends on the tuning parameter in the estimate of conjoined conditional covariance operator, and it reaches for the optimal choice of .
Theorem 11
Suppose conditions (1) and (2) of Definition 3 are satisfied with , , therein replaced by , , and . Suppose, furthermore,
-
(a)
and are bounded linear operators;
-
(b)
.
Then Consequently, if , then
Finally, we combine Theorem 9 through Theorem 11 to come up with the convergence rate of . Since there are numerous cross references among the conditions in these theorems, to make a clear presentation we list all the original conditions in the next theorem, even if they already appeared. These conditions are of two categories: those for the step 1 that involves sufficient dimension reduction of versus , and those for the step 2 that involves the estimation of the conjoined conditional covariance operator. We refer to them as the first-level and second-level conditions, respectively.
Theorem 12
Suppose the following conditions hold:
-
(a)
(First-level kernel) for and ;
-
(b)
(First-level operator) is a finite-rank operator with rank and
all the nonzero eigenvalues of are distinct;
-
(c)
(First-level tuning parameters) , , ;
-
(d)
(Second-level kernel) is satisfied for , , and ; furthermore, they are transparent kernels;
-
(e)
(Second-level operators) and are bounded linear operators;
-
(f)
(Second-level tuning parameter) .
Then
(16) |
Using this result we immediately arrive at the variable selection consistency of the Sufficient Graphical Model.
Corollary 13
5.4 Optimal rates of tuning parameters
The convergence rate in Theorem 12 depends on and explicitly, and implicitly (in the sense that is optimal for fixed and ). Intuitively, when , , and increase, the biases increase and variances decrease; when they decrease, the biases decrease and the variances increase. Thus there should be critical rates for them that balance the bias and variance, which are the optimal rates.
Theorem 14
Under the conditions in Theorem 12, if , , and are of the form , , and for some , , and , then
-
(i)
the optimal rates the tuning parameters are
-
(ii)
the optimal convergence rate of the estimated conjoined conditional covariance operator is
Note that there is a range of are optimal, this is because the convergence rate does not have a unique minimizer. This also means the result is not very sensitive to this tuning parameter.
In the above asymptotic analysis, we have treated as fixed when . We have also developed the consistency and convergence rate in the scenario where the dimension of of goes to infinity with , which is placed in the Supplementary Material (Section S9) due to limited space.
6 Simulation
In this section we compare the performance of our sufficient graphical model with previous methods such as Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), Lee et al. (2016b), and a Naïve method which is based on the conjoined conditional covariance operator without the dimension reduction step.
By design, the sufficient graphical model has advantages over these existing methods under the following circumstances. First, since the sufficient graphical model does not make any distributional assumption, it should outperform Yuan and Lin (2007) and Liu et al. (2009) when the Gaussian or copula Gaussian assumptions are violated; second, due to the sufficient dimension reduction in sufficient graphical model, it avoids the curse of dimensionality and should outperform Voorman et al. (2013), Fellinghauer et al. (2013), and a Naïve method in the high-dimensional setting; third, since sufficient graphical model does not require additive structure, it should outperform Lee et al. (2016b) when there is severe nonadditivity in the model. Our simulation comparisons will reflect these aspects.
For the sufficient graphical model, Lee et al. (2016b), and the Naïve method, we use the Gaussian radial basis function as the kernel. The regularization constants , , and are chosen by the generalized cross validation criterion described in Section 4.3 with the grid . The kernel parameters , , , , and are chosen according to (11). Because the outcomes of tuning parameters are stable, for each model, we compute the generalized cross validation for the first five samples and use their average value for the rest of the simulation. The performance of each estimate is assessed using the averaged receiver operating characteristic curve as a function of the threshold . The accuracy of a method across all is measured by the area under the receiver operating characteristic curve.
To isolate the factors that affect accuracy, we first consider two models with relatively small dimensions and large sample sizes, which are
where , are from independent and identically distributed standard normal distribution. The edge sets of the two models are
We use for each model, and for each , we generate 50 samples to compute the averaged receiver operating characteristic curves. The dimension for sufficient graphical model is taken to be 2 for all cases (we have also used and the results are very similar to those presented here). The plots in the first row of Figure 1 show the averaged receiver operating characteristic curves for the seven methods, with the following plotting symbol assignment:
Sufficient graphical model: | red solid line | Voorman et al. (2013): | red dotted line |
---|---|---|---|
Lee et al. (2016b): | black solid line | Fellinghauer et al. (2013): | black dotted line |
Yuan and Lin (2007): | red dashed line | Naïve: | blue dotted line |
Liu et al. (2009): | black dashed line |
From these figures we see that the two top performers are clearly sufficient graphical model and Lee et al. (2016b), and their performances are very similar. Note that none of the two models satisfies the Gaussian or copula Gaussian assumption, which explains why sufficient graphical model and Lee et al. (2016b) outperform Yuan and Lin (2007) and Liu et al. (2009). Sufficient graphical model and Lee et al. (2016b) also outperform Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method, indicating that curse of dimensionality already takes effect on the fully nonparametric methods. The three nonparametric estimators have similar performances. Also note that Model I has an additive structure, which explains the slight advantage of Lee et al. (2016b) over sufficient graphical model in subfigure (a) of Figure 1; Model II is not additive, and the advantage of Lee et al. (2016b) disappears in subfigure (b) of Figure 1.








We next consider two models with relatively high dimensions and small sample sizes. A convenient systematic way to generate larger networks is via the hub structure. We choose , and randomly generate ten hubs from the 200 vertices. For each , we randomly select a set of 19 vertices to form the neighborhood of . With the network structures thus specified, our two probabilistic models are
and ’s are the same as in Models I and II. Note that, in Model III, the dependence of on is through the conditional mean , whereas in Model IV, the dependence is through the conditional variance . For each model, we choose two sample sizes and . The averaged receiver operating characteristic curves (again averaged over 50 samples) are presented in the second row in Figure 1. From the figures we see that, in the high-dimensional setting with , sufficient graphical model substantially outperforms all the other methods, which clearly indicates the benefit of dimension reduction in constructing graphical models.
We now consider a Gaussian graphical model to investigate any efficiency loss incurred by sufficient graphical model. Following the similar structure used in Li et al. (2014), we choose , , and the model
where is precision matrix with diagonal entries 1, 1, 1, 1.333, 3.010, 3.203, 1.543, 1.270, 1.544, 3, 1, 1, 1.2, 1, 1, 1, 1, 3, 2, 1, and nonzero off-diagonal entries , , , , , , . As expected, Figure 2 shows that Yuan and Lin (2007), Liu et al. (2009), and Lee et al. (2016b) perform better than sufficient graphical model in this case. However, sufficient graphical model still performs reasonably well and significantly outperforms the fully nonparametric methods.


Finally, we conducted some simulation on the generalized cross validation criterion (13) for determining the threshold . We generated samples from Models I through V as described above, produced the receiver operating characteristic curves using sufficient graphical model, and determined the threshold by (13). The results are presented in Figure S1 in the Supplementary Material. In each penal, the generalized cross validation-determined threshold are represented by the black dots on the red receiver operating characteristic curves.
7 Application
We now apply sufficient graphical model to a data set from the DREAM 4 Challenge project and compare it with other methods. The goal of this Challenge is to recover gene regulation networks from simulated steady-state data. A description of this data set can be found in Marbach et al. (2010). Since Lee et al. (2016b) already compared their method with Yuan and Lin (2007), Liu et al. (2009), Voorman et al. (2013), Fellinghauer et al. (2013), and Naïve method for this dataset and demonstrated the superiority of Lee et al. (2016b) among these estimators, here we will focus on the comparison of the sufficient graphical model with Lee et al. (2016b) and the champion method for the DREAM 4 Challenge.
The data set contains data from five networks each of dimension of 100 and sample size 201. We use the Gaussian radial basis function kernel for sufficient graphical model and Lee et al. (2016b) and the tuning methods described in Section 4.3. For sufficient graphical model, the dimensions are taken to be 1. We have also experimented with but the results (not presented here) show no significant difference. Because networks are available, we can compare the receiver operating characteristic curves and their areas under the curve’s, which are shown in Table 1.
Network 1 | Network 2 | Network 3 | Network 4 | Network 5 | |
Sufficient graphical model | 0.85 | 0.81 | 0.83 | 0.83 | 0.79 |
---|---|---|---|---|---|
Lee et al. (2016b) | 0.86 | 0.81 | 0.83 | 0.83 | 0.77 |
Champion | 0.91 | 0.81 | 0.83 | 0.83 | 0.75 |
Naïve | 0.78 | 0.76 | 0.78 | 0.76 | 0.71 |
As we can see from Table 1, sufficient graphical model has the same areas under the receiver operating characteristic curve values as Lee et al. (2016b) for Networks 2, 3, and 4, performs better than Lee et al. (2016b) for Network 5, but trails slightly behind Lee et al. (2016b) for Network 1; sufficient graphical model has the same areas under the curve as the champion method, performs better for Network 5 and worse for Network 1. Overall, sufficient graphical model and Lee et al. (2016b) perform similarly in this dataset, and they are on a par with the champion method. We should point out that sufficient graphical model and Lee et al. (2016b) are purely empirical; they employ no knowledge about the underlying physical mechanism generating the gene expression data. However, according to Pinna et al. (2010), the champion method did use a differential equation that reflects the underlying physical mechanism. The results for threshold determination are presented in Figure S2 in the Supplementary Material.
8 Discussion
This paper is a first attempt to take advantage of the recently developed nonlinear sufficient dimension reduction method to nonparametrically estimate the statistical graphical model while avoiding the curse of dimensionality. Nonlinear sufficient dimension reduction is used as a module and applied repeatedly to evaluate conditional independence, which leads to a substantial gain in accuracy in the high-dimensional setting. Compared with the Gaussian and copula Gaussian methods, our method is not affected by the violation of the Gaussian and copula Gaussian assumptions. Compared with the additive method (Lee et al., 2016b), our method does not require an additive structure and retains the conditional independence as the criterion to determine the edges, which is a commonly accepted criterion. Compared with fully nonparametric methods, sufficient graphical model avoids the curse of dimensionality and significantly enhances the performance.
The present framework opens up several directions for further research. First, the current model assumes that the central class is complete, so that generalized sliced inverse regression is the exhaustive nonlinear sufficient dimension reduction estimate. When this condition is violated, generalized sliced inverse regression is no longer exhaustive and we can employ other nonlinear sufficient dimension reduction methods such as the generalized sliced averaged variance estimation (Lee et al., 2013; Li, 2018b) to recover the part of the central class that generalized sliced inverse regression misses. Second, though we have assumed that there is a proper sufficient sub--field for each , the proposed estimation procedure is still justifiable when no such sub--field exists. In this case, is still the most important set of functions that characterize the statistical dependence of on – even though it is not sufficient. Without sufficiency, our method may be more appropriately called the Principal Graphical Model than the sufficient graphical model. Third, the current method can be extended to functional graphical model, which are common in medical applications such as EEG and fMRI. Several functional graphical models have been proposed recently, by Zhu et al. (2016), Qiao et al. (2019), Li and Solea (2018b), and Solea and Li (2020). The idea of a sufficient graph can be applied to this setting to improve efficiency.
This paper also contains some theoretical advances that are novel to nonlinear sufficient dimension reduction. For example, it introduces a general framework to characterize how the error of nonlinear sufficient dimension reduction propagates to the downstream analysis in terms of convergence rates. Furthermore, the results for convergence rates of various linear operators allowing the dimension of the predictor to go to infinity are the first of its kind in nonlinear sufficient dimension reduction. These advances will benefit the future development of sufficient dimension reduction in general, beyond the current context of estimating graphical models.
Acknowledgments
Bing Li’s research on this work was supported in part by the NSF Grant DMS-1713078. Kyongwon Kim’s work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No.2021R1F1A1046976, RS-2023-00219212), basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (2021R1A6A1A10039823).
Supplementary Material
Supplementary material includes proofs of all theorems, lemmas, corollaries, and propositions in the paper, asymptotic development for the high-dimensional setting, some additional simulation plots for threshold determination.
References
- Bach and Jordan (2002) F. R. Bach and M. I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1–48, 2002.
- Bellman (1961) R Bellman. Curse of dimensionality. Adaptive control processes: a guided tour. Princeton, NJ, 1961.
- Bickel and Levina (2008) Peter J Bickel and Elizaveta Levina. Covariance regularization by thresholding. The Annals of Statistics, pages 2577–2604, 2008.
- Cook (1994) R Dennis Cook. Using dimension-reduction subspaces to identify important inputs in models of physical systems. In Proceedings of the section on Physical and Engineering Sciences, pages 18–25. American Statistical Association Alexandria, VA, 1994.
- Cook and Weisberg (1991) R Dennis Cook and Sanford Weisberg. Comment. Journal of the American Statistical Association, 86(414):328–332, 1991.
- Dawid (1979) A. P. Dawid. Conditional independence in statistical theory. Journal of the Royal Statistical Society. Series B (Methodological), pages 1–31, 1979.
- Fan and Li (2001) Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360, 2001.
- Fellinghauer et al. (2013) Bernd Fellinghauer, Peter Bühlmann, Martin Ryffel, Michael Von Rhein, and Jan D Reinhardt. Stable graphical model estimation with random forests for discrete, continuous, and mixed variables. Computational Statistics & Data Analysis, 64:132–152, 2013.
- Friedman et al. (2008) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.
- Fukumizu et al. (2004) Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. Journal of Machine Learning Research, 5(Jan):73–99, 2004.
- Fukumizu et al. (2007) Kenji Fukumizu, Francis R Bach, and Arthur Gretton. Statistical consistency of kernel canonical correlation analysis. Journal of Machine Learning Research, 8(Feb):361–383, 2007.
- Fukumizu et al. (2008) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in neural information processing systems, pages 489–496, 2008.
- Golub et al. (1979) Gene H. Golub, Michael Heath, and Grace Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 1979.
- Guo et al. (2010) Jian Guo, Elizaveta Levina, George Michailidis, and Ji Zhu. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
- Hoerl and Kennard (1970) A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
- Kim et al. (2020) Kyongwon Kim, Bing Li, Zhou Yu, and Lexin Li. On post dimension reduction statistical inference. to appear in The Annals of Statistics, 2020.
- Lam and Fan (2009) Clifford Lam and Jianqing Fan. Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics, 37(6B):4254, 2009.
- Lauritzen (1996) Steffen L Lauritzen. Graphical models, volume 17. Clarendon Press, 1996.
- Lee et al. (2016a) K.-Y. Lee, B. Li, and H. Zhao. Variable selection via additive conditional independence. Journal of the Royal Statistical Society: Series B, 78:1037–1055, 2016a.
- Lee et al. (2013) Kuang-Yao Lee, Bing Li, and Francesca Chiaromonte. A general theory for nonlinear sufficient dimension reduction: Formulation and estimation. The Annals of Statistics, 41(1):221–249, 2013.
- Lee et al. (2016b) Kuang-Yao Lee, Bing Li, and Hongyu Zhao. On an additive partial correlation operator and nonparametric estimation of graphical models. Biometrika, 103(3):513–530, 2016b.
- Li et al. (2011) B. Li, A. Artemiou, and L. Li. Principal support vector machines for linear and nonlinear sufficient dimension reduction. The Annals of Statistics, 39:3182–3210, 2011.
- Li (2018a) Bing Li. Linear operator-based statistical analysis: A useful paradigm for big data. Canadian Journal of Statistics, 46(1):79–103, 2018a.
- Li (2018b) Bing Li. Sufficient Dimension Reduction: Methods and Applications with R. CRC Press, 2018b.
- Li and Solea (2018a) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113(just-accepted):1637–1655, 2018a.
- Li and Solea (2018b) Bing Li and Eftychia Solea. A nonparametric graphical model for functional data with application to brain networks based on fmri. Journal of the American Statistical Association, 113:1637–1655, 2018b.
- Li and Song (2017) Bing Li and Jun Song. Nonlinear sufficient dimension reduction for functional data. The Annals of Statistics, 45(3):1059–1095, 2017.
- Li et al. (2014) Bing Li, Hyonho Chun, and Hongyu Zhao. On an additive semigraphoid model for statistical networks with application to pathway analysis. Journal of the American Statistical Association, 109(507):1188–1204, 2014.
- Li (1991) Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86(414):316–327, 1991.
- Liu et al. (2009) Han Liu, John Lafferty, and Larry Wasserman. The nonparanormal: Semiparametric estimation of high dimensional undirected graphs. Journal of Machine Learning Research, 10(Oct):2295–2328, 2009.
- Liu et al. (2012a) Han Liu, Fang Han, Ming Yuan, John Lafferty, and Larry Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012a.
- Liu et al. (2012b) Han Liu, Fang Han, and Cun-Hui Zhang. Transelliptical graphical models. In Advances in neural information processing systems, pages 800–808, 2012b.
- Luo and Li (2016) Wei Luo and Bing Li. Combining eigenvalues and variation of eigenvectors for order determination. Biometrika, 103:875–887, 2016.
- Luo and Li (2020) Wei Luo and Bing Li. On order determination by predictor augmentation. Biometrika (To appear), 103:875–887, 2020.
- Marbach et al. (2010) Daniel Marbach, Robert J Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, and Gustavo Stolovitzky. Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the national academy of sciences, 107(14):6286–6291, 2010.
- Meinshausen and Bühlmann (2006) Nicolai Meinshausen and Peter Bühlmann. High-dimensional graphs and variable selection with the lasso. The annals of statistics, pages 1436–1462, 2006.
- Pearl and Verma (1987) J. Pearl and T. Verma. The logic of representing dependencies by directed graphs. University of California (Los Angeles). Computer Science Department, 1987.
- Peng et al. (2009) Jie Peng, Pei Wang, Nengfeng Zhou, and Ji Zhu. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association, 104(486):735–746, 2009.
- Pinna et al. (2010) Andrea Pinna, Nicola Soranzo, and Alberto de la Fuente. From knockouts to networks: establishing direct cause-effect relationships through graph analysis. PLoS One, 5(10), 2010.
- Qiao et al. (2019) Xinghao Qiao, Shaojun Guo, and Gareth M. James. Functional graphical models. Journal of the American Statistical Association, 114:211–222, 2019.
- Solea and Li (2020) Eftychia Solea and Bing Li. Copula gaussian graphical models for functional data. submitted to Journal of the American Statistical Association, (under revision), 2020.
- Sriperumbudur et al. (2011) B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and rkhs embedding of measures. Journal of Machine Learning Research, 12, 2011.
- Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
- Voorman et al. (2013) Arend Voorman, Ali Shojaie, and Daniela Witten. Graph estimation with joint additive models. Biometrika, 101(1):85–101, 2013.
- Wang (2008) Y. Wang. Nonlinear dimension reduction in feature space. PhD Thesis, The Pennsylvania State University, 2008.
- Wu (2008) H. M. Wu. Kernel sliced inverse regression with applications to classification. Journal of Computational and Graphical Statistics, 17(3):590–610, 2008.
- Xue and Zou (2012) Lingzhou Xue and Hui Zou. Regularized rank-based estimation of high-dimensional nonparanormal graphical models. The Annals of Statistics, 40(5):2541–2571, 2012.
- Yuan and Lin (2007) Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
- Zhu et al. (2016) Hongxiao Zhu, Nate Strawn, and David B. Dunson. Bayesian graphical models for multivariate functional data. Journal of Machine Learning Research, 14:1–27, 2016.
- Zou (2006) Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.