Detecting abnormal connectivity in schizophrenia via a joint directed acyclic graph estimation model

Gemeng Zhang1, Aiying Zhang1, Biao Cai1, Zhuozhuo Tu2, Vince D. Calhoun3 and Yu-Ping Wang1

1Department of Biomedical Engineering, Tulane University, New Orleans, LA 70118 USA 2UBTECH Sydney Artificial Intelligence Centre, The University of Sydney, NSW, 2006 Australia 3Tri-institutional Center for Translational Research in Neuroimaging and Data Science (TReNDS) Georgia State University, Georgia Institute of Technology, Emory University, Atlanta, GA 30030 USA

Abstract

Functional connectivity (FC) has been widely used to study brain network interactions underlying the emerging cognition and behavior of an individual. FC is usually defined as the correlation or partial correlation between brain regions. Although FC is proved to be a good starting point to understand the brain organization, it fails to tell the causal relationship or the direction of interactions. Many directed acyclic graph (DAG) based methods were applied to study the directed interactions using functional magnetic resonance imaging (fMRI) data but the performance was severely limited by the small sample size and high dimensionality, hindering its applications. To overcome the obstacles, we propose a score based joint directed acyclic graph model to estimate the directed FC in fMRI data. Instead of using a combinatorial optimization framework, the structure of DAG is characterized with an algebra equation and further regularized with sparsity and group similarity terms. The simulation results have demonstrated the improved accuracy of the proposed model in detecting causality as compared to other existing methods. In our case-control study of the MIND Clinical Imaging Consortium (MCIC) data, we have successfully identified decreased functional integration, disrupted hub structures and characteristic edges (CtEs) in schizophrenia (SZ) patients. Further comparison between the results from directed FC and undirected FC illustrated the their different emphasis on selected features. We speculate that combining the features from undirected graphical model and directed graphical model might be a promising way to do FC analysis. The code of joint DAG estimation model is available at https://github.com/gmeng92/joint-notears.

Index Terms:

task fMRI, causality, joint estimation, schizophrenia

I INTRODUCTION

For the past decades, the research of brain functional connectivity (FC) through brain imaging data has became one of the hot topics in medical imaging field[1]. Specifically, studies using functional MRI (fMRI) have shown that there are significant connectivity changes in mental disorder patients compared to normal cohort[2][3]. As a serious mental illness, schizophrenia has also been conceived as a disorder with FC changes in large-scale brain networks[4]. Previous meta-analytic reviews of MRI studies on schizophrenia also suggest the underlying abnormalities in gray matter density[5][6], as well as the interregional FC derived from fMRI time series[7][8][9].

As a popular tool for accessing functional organization of brain and as an important biomarker for neurological disorders, functional connectivity is often defined based on the Pearson correlation which reflects the association between different regions of interests (ROIs). Although further studies used the partial correlation or precision matrix to remove the indirect correlation between brain regions and only keep the directly correlated pairs, they still cannot reveal the causes and effects between brain regions. More precise understanding of the information flow between brain regions enables us to better understand brain functional integration. The greedy equivalent search (GES) method[10][11], as an efficient way to do causal inference, has been applied on fMRI data to investigate differences in brain integrations between patients with Autism spectrum disorders[12] and individuals with traumatic brain injury[13]. By modeling the non-Gaussianity of the real fMRI data, researchers have also employed the linear non-gaussian acyclic models (LiNGAM)[14] to reveal the key differences in the default-mode network between patients with bipolar disorder and patients with major depression disorder[15].

Recently, there is a score-based approach for the structure learning of a directed acyclic graph (DAG)[16]. Different from other greedy equivalence search approaches, the new method incorporates the DAG constraints by introducing an algebraic characterization of the adjacency matrix of one DAG, which gives more flexibility in modeling the structure of Bayesian networks. For example, the authors enforced the sparsity of the DAG structure and got the state-of-the-art result in their simulation data. This motivates us to apply the similar idea for fMRI data analysis.

However, the high dimensionality and moderate individual variations prevent the score-based method from fitting well on the fMRI data since most score criteria are defined as the likelihood of the Bayesian networks. To mitigate the limitation when applying to the fMRI analysis, we further formulate a joint structure learning framework, in which each DAG’s structure estimation is performed by borrowing information from other observations or groups. Although the proposed structure learning model is not a convex one, we found that using the augmented Lagrange and L-BFGS method can deliver significantly improved results. The performance of the model is validated through a series of simulations including various settings for random graphs and compared with four other widely-used DAG estimation methods.

The rest of the paper is organized as follows. In Section II, we give some preliminary knowledge of Bayesian networks, formulate joint DAGs estimation model and describe the optimization strategy. In Section III, we show the results from both simulation studies and fMRI studies. Then we further present regarding the findings and limitations of the proposed method in SectionIV and conclude the paper is in the last section.

II METHODS

II-A Preliminary

A Bayesian network is a probabilistic graphical model that encodes the random variables as its nodes and their conditional dependencies as its directed edges. The node set $V$ and the edge set $E=V\times V$ make up with a directed acyclic graph (DAG) $G=(E,V)$ in which each edge $(i,j)$ represents a directed edge from variable $i$ to $j$ . $i$ is called the parent node of $j$ and $j$ is the child node of $i$ accordingly. Given one variable, say $j$ , and the set of all of its parent nodes $Pa_{j}$ , we can always write the conditional probability distribution of $j$ as $P(j|Pa_{j})$ .Without loss of generality, we can always factorize the joint distribution of random variables $Y=\{y_{1},y_{2},\dots,y_{d}\}$ as $P(y)=\prod_{i=1}^{d}P(y_{i}|Pa_{i})$ .

Given the data matrix $X\in\mathbb{R}^{n\times d}$ as the $n$ observations of random variables $y\in\mathbb{R}^{d}$ , learning the structure of the Bayesian network or directed acyclic graph (DAG) means finding a proper distribution $P(X)$ defined on the graph $G=(V,E)$ . More specifically, we want to determine all the conditional dependencies via the observation $X$ . We model these conditional dependencies via a structural equation model(SEM) defined by $X=XW+z$ , where $W\in\mathbf{R}^{d\times d}$ is the weighted adjacency matrix and $z$ is the random noise vector. Without further assumption on the noise distribution, we follow the approach in [16] and use the least square loss function for the linear SEM. It should be noted that the loss can be changed to any other smooth functions over $\mathbf{R}^{d\times d}$ . Moreover, several previous studies have shown that minimizing the least square loss can guarantee revealing the true DAG with high probability in finite-samples and even in high dimensions setting( $d\gg n$ ) and this result is consistent for both Gaussian SEM[17], [18] and non-Gaussian SEM[19].

In the literature[16], the authors firstly characterized the adjacency matrix of a DAG algebraically and turned the fussy searching problem in traditional GES methods into a continuous optimization issue. Their characterization is formulated with the theorem

Theorem 1

[16] A matrix $W\in\mathbb{R}^{d\times d}$ is an adjacency matrix of a DAG if and only if

h(W)=Tr(e^{W\circ W})-d=0,

where $\circ$ is the Hadamard product and $e^{\cdot}$ is the matrix exponential operator. Fortunately, this function is differentiable with a simple gradient

\nabla h(W)=2(e^{W\circ W})^{T}\circ W,

II-B Optimization

To overcome the sample size limitation and take advantage of data from multiple groups, we propose a joint DAG estimation model to incorporate the similarity prior in the data. The primal problem can be formulated as

	$\displaystyle\min_{\mathbf{W}}$		$\displaystyle F(\mathbf{W})=l(\mathbf{W};\mathbf{X})+P(\mathbf{W})$		(1)
	s.t.		$\displaystyle h(W^{(k)})=0,\ \ \forall k\in[K]$		(1)

where $l(\mathbf{W};\mathbf{X})$ is the loss function of SEM, i.e. $l(\mathbf{W};\mathbf{X})=\sum_{k=1}^{K}\frac{1}{2n_{k}}||X^{(k)}-X^{(k)}W^{(k)}||_{F}^{2}$ . $P(\mathbf{W})$ is the penalty term of the weighted adjacency matrices and usually chosen to encourage $W^{(1)},W^{(2)},\dots,W^{(K)}$ to share certain characteristics such as the similar pattern of nonzero elements. In addition, the sparsity of those matrices are usually encoded as a prior knowledge which benefits both model training and interpretation of the results. Considering the similar causal structure underlining different observations, we borrow the idea of group regularization from the undirected graphical model to encourage the shared DAG structures. This technique has been proven useful in many gene expression network studies. The group regularization term is often formulated as:

P(\mathbf{W})=\lambda_{1}\sum_{k=1}^{K}\sum_{i\neq j}|W^{(k)}_{ij}|+\lambda_{2}\sum_{i\neq j}(\sum_{k=1}^{K}(W^{(k)}_{ij})^{2})^{1/2}

(2)

where $\lambda_{1}$ and $\lambda_{2}$ are nonnegative parameters. $\lambda_{1}$ controls the sparsity of $W^{(k)}$ s and the larger $\lambda_{1}$ is, the sparser the solution of (1) will be. On the other side, $\lambda_{2}$ restrains the patterns of the nonzeros in $W^{(k)}$ s and the larger $\lambda_{2}$ is, the more identical $W^{(k)}$ s will be. If $\lambda_{1}$ and $\lambda_{2}$ are set to $0$ , the model is degraded to the original notears model.

By applying the augmented Lagrange, the primal problem is written as

	$\displaystyle\min_{\mathbf{W}}$		$\displaystyle l(\mathbf{W};\mathbf{X})+P(\mathbf{W})+\frac{\rho}{2}\sum_{k=1}^{K}\|h(W^{(k)})\|^{2}$		(3)
	s.t.		$\displaystyle h(w^{(k)})=0,\ \ \forall k\in[K]$		(3)

We use the dual ascent to solve problem (3) and the lagrange multiplier is derived as

	$\displaystyle L(\mathbf{W},\alpha_{1},\alpha_{2},\dots,\alpha_{K})=$	$\displaystyle l(\mathbf{W};\mathbf{X})+P(\mathbf{W})+\frac{\rho}{2}\sum_{k=1}^{K}\|h(W^{(k)})\|^{2}$		(4)
		$\displaystyle+\sum_{k=1}^{K}\alpha_{k}h(W^{(k)})$		(4)

Hence, the corresponding dual function is

g(\alpha_{1},\alpha_{2},\dots,\alpha_{K})=\min_{\mathbf{W}}L(\mathbf{W},\alpha_{1},\alpha_{2},\dots,\alpha_{K})

(5)

The dual problem is then

\max_{\alpha_{1},\alpha_{2},\dots,\alpha_{K}}g(\alpha_{1},\alpha_{2},\dots,\alpha_{K})

(6)

Denote $\mathbf{W}^{*}_{\mathbf{\alpha}}$ as the local minimizer of problem (5), i.e. $g(\alpha_{1},\alpha_{2},\dots,\alpha_{K})=L(\mathbf{W}^{*}_{\mathbf{\alpha}},\alpha_{1},\alpha_{2},\dots,\alpha_{K})$ . By noting that $g(\alpha_{1},\alpha_{2},\dots,\alpha_{K})$ is a linear function of $\alpha_{k}$ s, the partial derivatives are given by $\partial g(\alpha_{1},\alpha_{2},\dots,\alpha_{K})/\partial\alpha_{k}=h(W^{(k)^{*}})$ . Hence we can perform the dual ascent by updating $\alpha_{k}$ s with

\alpha_{k}\rightarrow\alpha_{k}+\rho h(W^{(k)^{*}})

(7)

The rest of the concern will be the unconstrained optimization of subproblem (5). Due to its high dimensionality and non-convexity, we follow the similar idea in [16] and solve it with L-BFGS algorithm[20] when $\lambda_{1}=\lambda_{2}=0$ . When $\lambda_{1},\lambda_{2}>0$ , the problem can be solved with proximal quasi-Newton (PQN) method[21]. As an iteration algorithm, at the $k$ -th step, the descent direction is searched through a quadratic approximation of the smooth term:

\mathbf{d}_{k}=\arg\min_{\mathbf{d}\in\mathbb{R}^{p}}\mathbf{g}_{k}^{T}\mathbf{d}+\frac{1}{2}\mathbf{d}^{T}B_{k}\mathbf{d}+\lambda_{1}||\mathbf{w}_{k}+\mathbf{d}||_{1}

(8)

where $\mathbf{g}_{k}$ is the gradient of $f(w)$ and $B_{k}$ is the L-BFGS approximation of the Hessian matrix of $f(w)$ . In addition, for each coordinate $j$ , the solution of problem (8) has a closed form update $\mathbf{d}\leftarrow\mathbf{d}+z^{*}e_{j}$ in which $e_{j}$ is the unit vector in standard basis and

	$\displaystyle z^{*}$	$\displaystyle=\arg\min_{z}\frac{1}{2}\underbrace{B_{j}j}_{a}z^{2}+\underbrace{(\mathbf{g}_{j}+(B\mathbf{d})_{j})}_{b}z+\lambda_{1}\|\underbrace{\mathbf{w}_{j}+\mathbf{d}_{j}}_{c}+z\|$		(9)
		$\displaystyle=-c+S_{c-b/a}(\lambda_{1}/a)$		(9)

It should be noted that the low-rank structure of $B_{k}$ makes it sparse and fast to compute during the coordinate update. The sparse regularization in the model can further enable us to speed up the computation. Instead of updating for all coordinates, an active set of coordinates can be chosen based on the subgradients and we just need to update regarding the active set. More details of the subproblem optimization can be find in Appendix A.

II-C Parameter Tuning

As it shows in the joint estimation model, there are two parameters, i.e., $\lambda_{1}$ and $\lambda_{2}$ that need to be specified before we use it for causality inference. Different setting of those parameters will lead to different causal structures. Training the model with a large $\lambda_{1}$ will lead to a sparse $\mathbf{W}$ but may underfit the data. On the other hand, the model with a smaller $\lambda_{1}$ gives rise to a denser $\mathbf{W}$ but may overfit the data. Consequently, improper choice of the parameter will lead to inaccurate inference result: a sparser causal graph tends to have more false negatives while a denser graph usually has severe false positives problem. Hence we would like to balance both the simplicity of the model and the goodness of fit. Similarly, the parameter $\lambda_{2}$ reflects how similar the DAG structures will be between groups. A larger $\lambda_{2}$ implies more similar structures between groups and vice versa. When $\lambda_{2}$ is trivially set to $0$ , the joint estimation model is then degraded to $K$ seperate NOTEARS models for $K$ groups; As $\lambda_{2}$ goes to $\infty$ , group variations will be wiped out gradually and the resulted $K$ DAGs will be identical to each other which is exactly the case that we regard the data as one group and apply the NOTEARS model directly. To find a proper combination of $(\lambda_{1},\lambda_{2})$ , we did a 5-fold cross-validation using the value of $l(\mathbf{W})$ as the performance measure.

Refer to caption — Figure 1: A demonstration of the joint estimation method compared to the NOTEARS method. The first row and second row show the weighted DAGs from group 1 and group 2, respectively. From left to right: the ground truth graphs, DAGs estimated by NOTEARS, the variation of NOTEARS estimation compared to ground truth, DAGs estimated by joint estimation, the variation of joint estimation compared to ground truth. In the 3rd and 5th column, the less the colored squares are, the better the performance the respective method has.

III RESULTS

III-A On Simulation Data

III-A1 Data Generating

To validate the performance of our joint DAGs estimation method, we evaluated it with the generated simulation data sets under different circumstances. We also compared our methods with the score-based greedy equivalent search (GES)[11], the PC algorithm[22], and the linear non-Gaussian acyclic model (LiNGAM)[14], as well as the NOTEARS method[16]. Regarding the data generation, a random graph $G_{0}$ is firstly generated from two random graph models: Erd $\ddot{o}$ s-R $\acute{e}$ nyi (ER) graph or scale-free (SF) graph. Then we randomly added extra edges to $G_{0}$ and got $G_{1}$ and $G_{2}$ to simulate the case-control study. In this way, we can build the association between those two graphs: sharing edges are the ones in $G_{0}$ and the varied edges are those independently added extra edges. Once graphs were generated, we equipped edges with weights according to an uniform distribution and obtained weight matrix $W_{i}$ for each graph. The generation and alternation of random graphs were realized through the Python package NetworkX. Given $W_{i}$ s, the observation sequences were then sampled on the basis of $X_{i}=W_{i}^{T}X_{i}+\epsilon_{i}$ for various distributions of $\epsilon_{i}$ . In this paper, we did this for three types of noise distribution: Gaussian, Exponential, and Gumbel. We repeated above procedure to generate $X_{i}\in\mathbb{R}^{n\times d}$ in two simulation settings. We generated the first group of simulation data with $n=300$ and $p=\{10,20,50,100\}$ ; as for the high dimension case, we then generated the second group of simulation data with $n=50$ and $p=\{20,40,60,100\}$ . It has been confirmed in many literatures that scale-free structure is common in many discovered real-world networks such as gene networks, social networks, and the brain functional networks[23][24].

III-A2 Parameters Setting

For each data set, we ran GES, PC, LiNGAM, NOTEARS and our proposed method, comparing their performance in terms of the reconstruction of the ground truth DAGs for two groups. We employed the following implementations in the experiment:

•

GES, PC and LinGAM were implemented using the widely used R package pcalg, which is available at https://cran.r-project.org/web/packages/pcalg/index.html.
•

NOTEARS method was implemented using Python code available at https://github.com/xunzheng/notears
•

The proposed method was implemented with Python code, which is available at https://github.com/gmeng92/joint-notears.

There are some further remarks on the compared methods: both PC and GES method output a graph instead of the weight matrix $W$ ; the graph returned by those two methods is a CPDAG instead of a DAG, which means part of the edges are undirected. During the evaluation, we treated PC and GES method favorably by regarding undirected edges as true positives if there exist directed edges between the same nodes. For the parameter tuning, we followed the same setting according to suggestions in [16].

III-A3 Metrics Used for Comparison

To evaluate the estimated DAGs (for LiNGAM, NOTEARS and our method) or CPDAGs(for PC and GES method), we used three common metrics: 1) false discovery rate (FDR), 2) true positive rate (TPR), and 3) structural Hamming distance (SHD). The first two measures comes from the confusion matrix, which is defined in Table I

TABLE I: Confusion Matrix

True Condition

Condition Postive

Condition Negative

Predicted Condition

Predicted

Postive

True Postive

False Positive

Predicted

Negative

False Negative

True Negative

The SHD is defined as the number of operations needed to convert the estimated graph to the ground truth graph. These operations include edge additions, deletions and direction reversals. Since we consider directed graphs, a distinction between true positives (TP) and reversed edges (R) is needed: the former is estimated with correct direction whereas the latter is not. Likewise, a false positive (FP) is an edge that is not in the undirected skeleton of the true graph. In addition, positive (P) is the set of estimated edges, true (T) is the set of true edges, false (F) is the set of non-edges in the ground truth graph. Finally, let E be the extra edges from the skeleton and M be the missing edges from the skeleton. The four metrics are then given by:

•

FDR = (R + FP)/P
•

TPR = TP/T
•

SHD = E + M + R

III-A4 Results

In Fig. 1, we gave a demonstration of the joint estimation method on a synthetic data set with two groups and limited sample size( $n=p=20$ ). By comparing the estimation results with the ground truth DAG, we can clearly see that the joint estimation has improved the estimation accuracy significantly compared to the NOTEARS approach. This demonstrates that structure learning does benefit from the group regularization.

Moreover, we show the comparison results when the noise type is Gaussian, average degree is 4, sample size are $n=300$ and $n=50$ respectively for ER graphs in Fig. 2. Both NOTEARS method and the proposed joint method have the highest TPR while the lowest FDR and SHD, which shows the superiority of those structure learning framework. It is noticeable that as the sample size decreases and the number of variables grows, the FDR and the SHD increase significantly. But the joint estimation can still efficiently suppress the FDR and SHD to a low level, that is, the estimation accuracy is sustained even in high dimensional scenario. For synthetic data sets with various noise types, graph densities and random graph types, the joint estimation method has outperformed other methods. More details can be found in Appendix.C.

III-B On real fMRI Data

The model was applied to fMRI data in the MIND Clinical Imaging Consortium (MCIC) from Mind Research Network(MRN,www.mrn.org). The fMRI data were collected from 208 subjects, among them 92 SZ patients (age: 34 $\pm$ 11, 22 females) and 116 healthy controls (age: 32 $\pm$ 11, 44 females)during a sensory motor task, a block design motor response to auditory stimulation. Specifically, the images were acquired on a Siemens3T Trio Scanner and 1.5 T Sonata with echo-planar imaging (EPI) sequences taking parameters (TR $=2000ms$ , TE $=30ms(3.0T)/40ms(1.5T)$ , field of view = $22cm$ , slice thickness = $4mm$ , $1mm$ skip, 27 slices, acquisition matrix = 64 $\times$ 64, flip angle = $90^{\circ}$ ). Then the data were pre-processed with SPM5 (http://www.fil.ion.ucl.ac.uk/spm) and were realigned, spatially normalized and resliced to $3\times 3\times 3mm$ , smoothed with a $10\times 10\times 10mm^{3}$ Gaussian kernel, and analyzed by multiple regression considering the stimulus and their temporal derivatives plus an intercept term as regressors. Finally the stimulus-on versus stimulus-off contrast images were extracted with $53\times 63\times 46$ voxels and all the voxels with missing measurements were excluded resulting in 41236 voxels. In order to filter irrelevant information, we further implemented a multiple t-test between case and control groups at the voxel level. Finally, $p=9816$ voxels were left for analysis and 116 ROIs were extracted based on the Automated Anatomical Labeling brain atlas. For each ROI, we then averaged the beta values of the voxels within that ROI and finally got an $116\times 208$ data matrix.

III-B1 Summary of graph theoretical measure

After splitting the data into two groups, we fed them into the joint estimation algorithm with $\lambda_{1}=0.001$ and $\lambda_{2}=0.01$ , and got the corresponding DAGs and the weighted adjacency matrices $\mathbf{W}_{1},\mathbf{W}_{2}$ of SZ patients and healthy controls respectively. The resulted DAGs of two group were visualized in Fig. 3 with 207 directed edges for SZ group and 187 directed edges for healthy controls. In the first step, we compared the two DAGs using some graph theoretical measures with the purpose of finding the general structural differences between two groups. Since the effect size of the weights still cannot be well estimated, the structures have a higher power and the edge weights are ignored in calculations. Those measures were calculated using the Brain Connectivity Toolbox[25]. Table II shows the summary of those measures and the p-values of the permutation test were listed in the last column.

The density is the fraction of present connections to all possible connections and it describe how the dense the graph is. The number of connections to one node is defined as the degree of that node. For a directed graph, those connections include edges coming in and out of the node. In directed network analysis, the in-degree counts the number of edges that come into one individual node, the out-degree counts the number of edges that come out of the node, and the sum-degree counts the total number of edges connected to the node regardless of the direction. Global efficiency is the measure of functional integration, which reflects the ability to combine specialized information from distributed brain regions. It is defined as the average inverse shortest path length, which represents the ability of the communication between two nodes. The local efficiency is the global efficiency computed on the neighborhood of the node and represents how well the graph communicates locally. Like the transitivity and clustering coefficient, local efficiency is a measure of functional segregation, which reflects the ability for specialized processing to occur within clusters interconnected modules of brain. Transitivity and clustering coefficient are based on the number of triangles in the graph. The fraction of triangles around an individual node is defined as the clustering coefficient while the ratio of triangles to triplets in the whole graph is defined as the transitivity. The mean clustering coefficient averages the values of all nodes. The assortativity coefficient is a correlation coefficient between the degrees of all nodes on two opposite ends of a link. Graphs with a positive assortativity coefficient are likely to have a resilient core of mutually interconnected high-degree hubs while a negative value means widely distributed hubs. The mean assortativity coefficient averages the out-degree/indegree, in-degree/out-degree, out-degree/out-degree and in-degree/in-degree correlations. The rich club coefficient at level k is the fraction of edges that connect nodes of degree k or higher out of the maximum number of edges that such nodes might share. We used the maximum of the rich club coefficients among various levels, which characterizes the tendency for high degree nodes to be more inter-connected among themselves than the nodes of a lower degree.

TABLE II: Comparison of graph theoretical measures between schizophrenia patients (SZ) and healthy control (HC)

Group	SZ	HC	p-value
Density	0.0155	0.0139	¡0.01
Transitivity	0.0275	0.0250	0.03
Mean clustering coefficient	0.0308	0.0290	¡0.01
Maximum rich club coefficient	0.9290	0.9274	0.54
Global efficiency	0.0314	0.0351	¡0.01
Mean local efficiency	0.0419	0.0431	¡0.01
Assortativity coefficient	-0.0158	-0.0668	0.27

III-B2 Hub nodes identification

To get better understanding of the estimated directed brain networks from two groups, we further identified the hub nodes as the nodes with significantly higher degrees than average. More specifically, we defined the nodes whose degree is higher than mean degree by three standard deviations. According to previous context, we have three type of degrees here: in-degree, out-degree and sum-degree. As a result, we can similarly define three types of hub nodes for directed networks: in-hub, out-hub and sum-hub. Since some nodes were identified as sum-hubs because it has a high in-degree or out-degree, we only keep the sum-hubs if they are not in-hubs or out-hubs.

In the healthy control group, we were able to identify two in-hubs located in gyrus rectus(REC.L) and cerebellum(CRBL10.R), one out-hub in superior occipital lobe(SOG.R) and one sum-hub in cerebellum(CRBL45.L). In the SZ patient group, we identified three in-hubs located in right temporal pole(TPOmid.R), cerebellum(CRBL10.R) and vermis(Vermis10) as well as two out-hubs in inferior frontal gyrus(IFGtriang.L) and cerebellum(CRBL3.R). Those results were summarized in Table III.

TABLE III: Identified hub nodes for case and control group

SZ Patients
	Index	ROI	MNI coordinate	Degree
In-hub	88	TPOmid.R	(44.22, 14.55, -32.23)	8
	108	CRBL10.R	(25.99, -33.84, -41.35)	7
	116	Vermis10	(0.36, -45.8, -31.68)	6
Out-hub	13	IFGtriang.L	(-45.58, 29.91, 13.99)	5
Out-hub	96	CRBL3.R	(12.32, -34.47, -19.39)	5
Healthy Control
	Index	ROI	MNI coordinate	Degree
In-hub	27	REC.L	(-5.08, 37.07, -18.14)	6
In-hub	108	CRBL10.R	(25.99, -33.84, -41.35)	5
Out-hub	50	SOG.R	(24.29, -80.85, 30.59)	5
Sum-Hub	97	CRBL45.L	(-15, -43.49, -16.93)	6

III-B3 Variations of edges between groups

The graph measures only give a brief summary of the variation between two groups. In this study, we are more interested in those particular edges or nodes shared between two groups or characterized themselves. Hence, we further did a permutation test for the comparison. 100 permutation tests were performed and the subjects were randomly shuffled into two groups - one with 92 subjects and another with 116 subjects. Then we extracted the significantly different edges (p ¡ 0.01 with FDR corrected value at q = 0.05) in case and control groups. For each group, the edges that exclusively exist within one group were marked as the characteristic edges(CtEs). Finally, 19 and 15 edges were identified as CtEs in SZ patients and healthy controls respectively, which are illustrated in Figure 4.

III-B4 Comparison with undirected graphs

Another interesting question one may ask is the added values of the directed graphs we obtained over the undirected graphs measured with Pearson correlation or partial correlation, which are often named as functional connectivity (FC). Since edges from both graphical models represent conditional dependency in spite of different angles, there are considerable overlaps between two graphs. Recent progress also suggests that the FC can be utilized as a prior to benefit the causal inference[26]. Specially, we used the $psi-$ learning method to estimate the undirected brain networks for SZ patients and healthy controls separately. $\psi-$ learning is a novel method for high-dimensional Gaussian Graphical Model (GGM)[27]. Compared to other partially correlation based methods, it can help ease computational burden and provide more accurate inference for the underlying networks. This method has also been proven to be an equivalent measure of the partial correlation coefficient and thus is flexible for network comparison through statistical tests. In this paper, we used the same implementation as in [28] with R package ”equSA”. The parameters were set as $\alpha_{1}=0.1$ and $\alpha_{2}=0.01$ as suggested in [27, 28]. For the group comparison, we set the significance level at 0.01 and further corrected the FDR with $q=0.05$ .

TABLE IV: Identified hub nodes for SZ patients and healthy control using undirected networks estimation

SZ Patients
Index	ROI	MNI coordinate	Degree
12	IFGoperc.R	(-50.2, 14.98, 21.41)	19
39	PHG.L	(-21.17, -15.95, -20.7)	15
40	PHG.R	(25.38,-15.15,-20.47)	21
Healthy Control
Index	ROI	MNI coordinate	Degree
12	IFGoperc.R	(-50.2, 14.98, 21.41)	21
111	Vermis45	(1.22, –52.36, -6.11)	7

Finally, we got 148 edges and 104 edges in SZ patients and healthy controls using $\psi-$ learning method respectively. To compare the undirected networks with the directed networks, we mainly checked three aspects: the hub nodes in undirected networks, the overlapped edges and the detected group variances between undirected networks and directed networks. The identified hub nodes is listed in Table IV. Opercular region of inferior frontal gyrus(IFGoperc.R) at the right lobe was identified as hub nodes in both groups. In SZ patients, parahippocampus on both lobes (PHG.L and PHG.R) were identified as hub nodes. In healthy control, a region at vermis (VERMIS45) was also identified as hub node. It should be noticed that there is no overlapped hub nodes between directed networks and undirected networks. The overlapped edges between directed networks and undirected networks for both groups were visualized in Figure 5. There is no CtEs appearance in either group. For the pairwise comparison, significantly different edges between two groups were visualized in Figure 6. As a comparison, we also highlighted the overlapped edges from both directed networks and undirected networks with blue color in that figure. However, those connections do not occupy considerable percentage shared in both groups ( $12.96\%$ in SZ patients and $15.57\%$ in healthy controls).

IV Discussion

IV-A Significance of results in MCIC study

The application of functional connectivity to study schizophrenia appears to be a very powerful approach since it provides a way to study aberrant connectivity between sets of regions (networks), which is thought to be a core feature of schizophrenia. However, the directed functional connectivity is rarely used as biomarkers in the schizophrenia study. From the comparison between two groups, we found that the directed brain network is significantly denser in schizophrenia patients. Moreover, the mean clustering coefficient of schizophrenia patients is smaller than that of the healthy controls. With a denser network, we observed that both the global and local efficiency of schizophrenia patients are significantly smaller than those of healthy controls. The functional integration represents the capability to combine information from distributed brain regions. This demonstrates the directed brain network of schizophrenia patients is less efficient in terms of brain region communication either globally or locally in the auditory task.

Among the hub nodes identified in two groups, we also observed disruptions of the hub structures in schizophrenia patients. The gyrus retus (REC.L), as an in-hub on the left hemisphere, is missing in SZ patients. Previous study[29] has demonstrated that pronounced gray matter volume decline at gyrus rectus has been observed in SZ patients, which supports our findings with a structural foundation. Another study[30] also showed that the reduced functional connectivity has been observed between gyrus rectus and other regions within self-referential processing network. Inferior frontal gyrus, triangular at left hemisphere (IFGtriang.L), has been shown to have increased activity during auditory hallucinations[31], and the Theory of Mind(ToM) deficits [32] in schizophrenia patients. Our result further implies that the hyperconnectivity related to this region is potentially an important biomarker of schizophrenia. The behavior of temporal pole on the right hemisphere (TPOmid.R) is also distinct between HC and SZP, which agrees with the finding in [33], [34].

In addition to the abnormal hub structures in SZ patients, we also have several findings in coordination with previous studies by comparing the CtEs extracted from both groups. For example, previous research[35] revealed that olfactory bulbs disconnectivity of olfactory regions in schizophrenia may account for olfactory dysfunction and disrupted integration with other sensory modalities in SZ patients. The missing connections to the olfactory cortex in CtEs of SZ patients also confirm this findings. In the subcortical regions, pallidum, putamen and caudate work together to communicate with the subthalamic nucleus[36]. Our findings illustrated the missing communications within and between putamen and pallidum, which is in line with the findings in [37],[38].

From Table IV, parahippocampus on both sides are identified as hub nodes in undirected functional connectivity for SZ patients. The parahippocampus plays an important role in the encoding and recognition of environmental scenes. Previous research[39] have discovered the abnormal connections between parahippocampus and temporal pole in early stage SZ patients. Several other researches also revealed the aberrant increased connectivity between parahippocampus and other limbic areas[40], including hippocampus[41].

When we compared the finding of directed graphical model and undirected graphical model, we found that the overlapped brain networks indicated by both methods cannot represent the feature of whole brain networks, especially the connection variations between SZ patients and healthy controls. This implies that the features from undirected networks and directed networks may not be regarded as the similar features in depicting the interactions between brain regions. In fact, the underlying distributions of two graphical modelling framework are intrinsically different. Figure 7 demonstrates the relationship between two models. Only the distributions falling in the intersection of directed graphs (D) and undirected graphs (U) are perfectly mapped through both graphical models. It is clearly that we do not know the underlying distribution of brain activities. We speculate that either method can only provide a sketch of the ground truth from one particular perspective. Hence, we reported the pairwise comparisons results of both methods.

IV-B Limitations

Currently, both the NOTEARS and the proposed method were optimized with second order approximation with a global searching over the feasible space, which is time consuming if we want to extend those methods to application with hundreds of variables, a typical case in brain imaging study. For example, it is impossible to directly apply the method for voxel-wise network analysis. It will be desirable if we can find a way to reduce the search space in the model. In fact, the model itself has a flexible setting to incorporate the prior information of the network in the optimization, which will significantly alleviate the computation burden. The prior information includes some known connections/disconnections and the directions of connections. Another limitation lies in the variances of the results that we found between the directed network method and the undirected network method. At this stage, most of whole brain functional connectivity studies are based on either directed or undirected network analysis. It will be meaningful to cross-validate the findings with more data sets.

V Conclusion

In this paper, we proposes a joint DAGs model for structural learning via a continuous optimization framework. By encouraging the group similarity in the DAG structure, the model can be used to find DAGs for multiple groups simultaneously. The efficiency of the proposed model was tested using simulation data sets with a wide range of varieties including density levels, noise types, etc. By comparing with other structural learning methods, the proposed model have high TPR and lower FDR, especially in high dimensional cases. As a demonstration of its application, the proposed model was used to detect the abnormal network structures in schizophrenia patients in task fMRI data from MCIC. The lower global and local efficiency demonstrated the deficiency of functional integration in SZ patients. In addition, we also identified disrupted hub structures and characteristic edges in SZ patients. Several of the findings have been in line with previous schizophrenia studies. Finally, we compared the results of directed networks to those with the undirected networks modelling method. Although both network modelling methods can differentiate the key alterations between schizophrenia patients and healthy controls, the extracted networks’ feature provide different perspectives and biological significance.

Acknowledgment

The authors would like to thank the partial support by NIH (P20GM109068, R01MH104680, R01MH107354, R01MH103220, R01EB020407) and NSF (#1539067).

References

[1] M. J. Jafri, G. D. Pearlson, M. Stevens, and V. D. Calhoun, “A method for functional network connectivity among spatially independent resting-state components in schizophrenia,” Neuroimage, vol. 39, no. 4, pp. 1666–1681, 2008.
[2] V. D. Calhoun, T. Eichele, and G. Pearlson, “Functional brain networks in schizophrenia: a review,” Frontiers in human neuroscience, vol. 3, p. 17, 2009.
[3] L. Cerliani, M. Mennes, R. M. Thomas, A. Di Martino, M. Thioux, and C. Keysers, “Increased functional connectivity between subcortical and cortical resting-state networks in autism spectrum disorder,” JAMA psychiatry, vol. 72, no. 8, pp. 767–777, 2015.
[4] M.-E. Lynall, D. S. Bassett, R. Kerwin, P. J. McKenna, M. Kitzbichler, U. Muller, and E. Bullmore, “Functional connectivity and brain networks in schizophrenia,” Journal of Neuroscience, vol. 30, no. 28, pp. 9477–9487, 2010.
[5] D. C. Glahn, A. R. Laird, I. Ellison-Wright, S. M. Thelen, J. L. Robinson, J. L. Lancaster, E. Bullmore, and P. T. Fox, “Meta-analysis of gray matter anomalies in schizophrenia: application of anatomic likelihood estimation and network analysis,” Biological psychiatry, vol. 64, no. 9, pp. 774–781, 2008.
[6] I. Ellison-Wright, D. C. Glahn, A. R. Laird, S. M. Thelen, and E. Bullmore, “The anatomy of first-episode and chronic schizophrenia: an anatomical likelihood estimation meta-analysis,” American Journal of Psychiatry, vol. 165, no. 8, pp. 1015–1023, 2008.
[7] M. Liang, Y. Zhou, T. Jiang, Z. Liu, L. Tian, H. Liu, and Y. Hao, “Widespread functional disconnectivity in schizophrenia with resting-state functional magnetic resonance imaging,” Neuroreport, vol. 17, no. 2, pp. 209–213, 2006.
[8] R. Salvador, S. Sarro, J. J. Gomar, J. Ortiz-Gil, F. Vila, A. Capdevila, E. Bullmore, P. J. McKenna, and E. Pomarol-Clotet, “Overall brain connectivity maps show cortico-subcortical abnormalities in schizophrenia,” Human brain mapping, vol. 31, no. 12, pp. 2003–2014, 2010.
[9] X. Ma, W. Zheng, C. Li, Z. Li, J. Tang, L. Yuan, L. Ouyang, K. Jin, Y. He, and X. Chen, “Decreased regional homogeneity and increased functional connectivity of default network correlated with neurocognitive deficits in subjects with genetic high-risk for schizophrenia: A resting-state fmri study,” Psychiatry Research, vol. 281, p. 112603, 2019.
[10] C. Meek, “Graphical models: Selecting causal and statistical models,” Ph.D. dissertation, PhD thesis, Carnegie Mellon University, 1997.
[11] D. M. Chickering, “Optimal structure identification with greedy search,” Journal of machine learning research, vol. 3, no. Nov, pp. 507–554, 2002.
[12] C. Hanson, S. J. Hanson, J. Ramsey, and C. Glymour, “Atypical effective connectivity of social brain networks in individuals with autism,” Brain connectivity, vol. 3, no. 6, pp. 578–589, 2013.
[13] E. Dobryakova, O. Boukrina, and G. R. Wylie, “Investigation of information flow during a novel working memory task in individuals with traumatic brain injury,” Brain connectivity, vol. 5, no. 7, pp. 433–441, 2015.
[14] S. Shimizu, P. O. Hoyer, A. Hyvärinen, and A. Kerminen, “A linear non-gaussian acyclic model for causal discovery,” Journal of Machine Learning Research, vol. 7, no. Oct, pp. 2003–2030, 2006.
[15] Y. Liu, X. Wu, J. Zhang, X. Guo, Z. Long, and L. Yao, “Altered effective connectivity model in the default mode network between bipolar and unipolar depression based on resting-state fmri,” Journal of affective disorders, vol. 182, pp. 8–17, 2015.
[16] X. Zheng, B. Aragam, P. K. Ravikumar, and E. P. Xing, “Dags with no tears: Continuous optimization for structure learning,” in Advances in Neural Information Processing Systems, 2018, pp. 9472–9483.
[17] B. Aragam, A. A. Amini, and Q. Zhou, “Learning directed acyclic graphs with penalized neighbourhood regression,” arXiv preprint arXiv:1511.08963, 2015.
[18] S. Van de Geer, P. Bühlmann et al., “ $l_{0}$ -penalized maximum likelihood for sparse directed acyclic graphs,” The Annals of Statistics, vol. 41, no. 2, pp. 536–567, 2013.
[19] P.-L. Loh and P. Bühlmann, “High-dimensional learning of linear causal networks via inverse covariance estimation,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3065–3105, 2014.
[20] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, “A limited memory algorithm for bound constrained optimization,” SIAM Journal on scientific computing, vol. 16, no. 5, pp. 1190–1208, 1995.
[21] K. Zhong, I. E.-H. Yen, I. S. Dhillon, and P. K. Ravikumar, “Proximal quasi-newton for computationally intensive l1-regularized m-estimators,” in Advances in Neural Information Processing Systems, 2014, pp. 2375–2383.
[22] P. Spirtes, C. N. Glymour, R. Scheines, D. Heckerman, C. Meek, G. Cooper, and T. Richardson, Causation, prediction, and search. MIT press, 2000.
[23] V. M. Eguiluz, D. R. Chialvo, G. A. Cecchi, M. Baliki, and A. V. Apkarian, “Scale-free brain functional networks,” Physical review letters, vol. 94, no. 1, p. 018102, 2005.
[24] M. P. van den Heuvel, C. J. Stam, M. Boersma, and H. H. Pol, “Small-world and scale-free organization of voxel-based resting-state functional connectivity in the human brain,” Neuroimage, vol. 43, no. 3, pp. 528–539, 2008.
[25] M. Rubinov and O. Sporns, “Complex network measures of brain connectivity: uses and interpretations,” Neuroimage, vol. 52, no. 3, pp. 1059–1069, 2010.
[26] A. T. Reid, D. B. Headley, R. D. Mill, R. Sanchez-Romero, L. Q. Uddin, D. Marinazzo, D. J. Lurie, P. A. Valdés-Sosa, S. J. Hanson, B. B. Biswal et al., “Advancing functional connectivity research from association to causation,” Nature neuroscience, vol. 22, no. 11, pp. 1751–1760, 2019.
[27] F. Liang, Q. Song, and P. Qiu, “An equivalent measure of partial correlation coefficients for high-dimensional gaussian graphical models,” Journal of the American Statistical Association, vol. 110, no. 511, pp. 1248–1265, 2015.
[28] A. Zhang, J. Fang, F. Liang, V. D. Calhoun, and Y.-P. Wang, “Aberrant brain connectivity in schizophrenia detected via a fast gaussian graphical model,” IEEE journal of biomedical and health informatics, vol. 23, no. 4, pp. 1479–1489, 2018.
[29] C. Zhao, J. Zhu, X. Liu, C. Pu, Y. Lai, L. Chen, X. Yu, and N. Hong, “Structural and functional brain abnormalities in schizophrenia: A cross-sectional study at different stages of the disease,” Progress in Neuro-Psychopharmacology and Biological Psychiatry, vol. 83, pp. 27–32, 2018.
[30] S. Li, N. Hu, W. Zhang, B. Tao, J. Dai, Y. Gong, Y. Tan, D. Cai, and S. Lui, “Dysconnectivity of multiple brain networks in schizophrenia: a meta-analysis of resting-state functional connectivity,” Frontiers in psychiatry, vol. 10, p. 482, 2019.
[31] P. K. McGuire, R. Murray, and G. Shah, “Increased blood flow in broca’s area during auditory hallucinations in schizophrenia,” The Lancet, vol. 342, no. 8873, pp. 703–706, 1993.
[32] P. Das, J. Lagopoulos, C. M. Coulston, A. F. Henderson, and G. S. Malhi, “Mentalizing impairment in schizophrenia: a functional mri study,” Schizophrenia research, vol. 134, no. 2-3, pp. 158–164, 2012.
[33] Z.-K. Gao, X.-R. Liu, C. Ma, K. Ma, S. Gao, and J. Zhang, “Functional alteration of brain network in schizophrenia: An fmri study based on mutual information,” EPL (Europhysics Letters), vol. 128, no. 5, p. 50005, 2020.
[34] M.-H. Chen, W.-C. Chang, Y.-M. Bai, K.-L. Huang, P.-C. Tu, T.-P. Su, C.-T. Li, W.-C. Lin, S.-J. Tsai, and J.-W. Hsu, “Cortico-thalamic dysconnection in early-stage schizophrenia: a functional connectivity magnetic resonance imaging study,” European Archives of Psychiatry and Clinical Neuroscience, pp. 1–8, 2019.
[35] S. Kiparizoska and T. Ikuta, “Disrupted olfactory integration in schizophrenia: functional connectivity study,” International Journal of Neuropsychopharmacology, vol. 20, no. 9, pp. 740–746, 2017.
[36] G. E. Alexander and M. D. Crutcher, “Functional architecture of basal ganglia circuits: neural substrates of parallel processing,” Trends in neurosciences, vol. 13, no. 7, pp. 266–271, 1990.
[37] Y. Wang, K. S. Hung, M. Y. Deng, S. S. Lui, J. C. Lee, H. K. Mak, P.-L. Khong, P. C. Sham, E. F. Cheung, and R. C. Chan, “Su101. altered resting-state functional connectivity of striatum in first-episode schizophrenia,” Schizophrenia Bulletin, vol. 43, no. Suppl 1, p. S197, 2017.
[38] L.-B. Cui, K. Liu, C. Li, L.-X. Wang, F. Guo, P. Tian, Y.-J. Wu, L. Guo, W.-M. Liu, Y.-B. Xi et al., “Putamen-related regional and network functional deficits in first-episode schizophrenia with auditory verbal hallucinations,” Schizophrenia Research, vol. 173, no. 1-2, pp. 13–22, 2016.
[39] Y. Du, S. L. Fryer, Z. Fu, D. Lin, J. Sui, J. Chen, E. Damaraju, E. Mennigen, B. Stuart, R. L. Loewy et al., “Dynamic functional connectivity impairments in early schizophrenia and clinical high-risk for psychosis,” Neuroimage, vol. 180, pp. 632–645, 2018.
[40] M. Hua, Y. Peng, Y. Zhou, W. Qin, C. Yu, and M. Liang, “Disrupted pathways from limbic areas to thalamus in schizophrenia highlighted by whole-brain resting-state effective connectivity analysis,” Progress in Neuro-Psychopharmacology and Biological Psychiatry, vol. 99, p. 109837, 2020.
[41] N. V. Kraguljac, D. M. White, N. Hadley, J. A. Hadley, L. ver Hoef, E. Davis, and A. C. Lahti, “Aberrant hippocampal connectivity in unmedicated patients with schizophrenia and effects of antipsychotic medication: a longitudinal resting state functional mri study,” Schizophrenia bulletin, vol. 42, no. 4, pp. 1046–1055, 2016.