Cross-Modal Synthesis of Structural MRI and Functional Connectivity Networks via Conditional ViT-GANs

Abstract

The cross-modal synthesis between structural magnetic resonance imaging (sMRI) and functional network connectivity (FNC) is a relatively unexplored area in medical imaging, especially with respect to schizophrenia. This study employs conditional Vision Transformer Generative Adversarial Networks (cViT-GANs) to generate FNC data based on sMRI inputs. After training on a comprehensive dataset that included both individuals with schizophrenia and healthy control subjects, our cViT-GAN model effectively synthesized the FNC matrix for each subject, and then formed a group difference FNC matrix, obtaining a Pearson correlation of 0.73 with the actual FNC matrix. In addition, our FNC visualization results demonstrate significant correlations in particular subcortical brain regions, highlighting the model’s capability of capturing detailed structural-functional associations. This performance distinguishes our model from conditional CNN-based GAN alternatives such as Pix2Pix. Our research is one of the first attempts to link sMRI and FNC synthesis, setting it apart from other cross-modal studies that concentrate on T1- and T2-weighted MR images or the fusion of MRI and CT scans.

Index Terms— Magnetic resonance imaging, generative model, vision transformer, image synthesis

1 Introduction

Generative adversarial networks (GANs) originated as a novel approach to create generative models using a generator and discriminator trained together [1]. GANs have revolutionized image generation, style adaptation, and data augmentation [2, 3]. The use of modality translation tasks in medical imaging is especially notable [4]. The vision transformer (ViT) has revolutionized visual tasks by modeling long-range dependencies and offering a reliable alternative to CNNs by utilizing language processing architectures [5]. The ViT-GAN combines ViT’s characteristics with GAN, achieving superior image synthesis performance without CNN components [6, jiang2021transgan]. However, the application of ViT-GANs to medical datasets, particularly brain MRI, remains largely unexplored.

This study presents the conditional ViT-GAN (cViT-GAN), a new generative framework for synthesizing images from sMRI to functional network connectivity data (FNC), which utilizes sMRI and class identifier as conditions and is regulated by newly designed correlation loss. FNC expresses temporal correlations between independent neural activities and is commonly depicted as a 2D matrix. To derive FNC matrices, independent component analysis (ICA) was applied to fMRI datasets from the same cohort as the sMRI samples [7]. Existing literature highlights the intricate relationship between structural and functional MRI modalities [8, 9], which are often complementary in nature. Early research indicates that the integration of sMRI and fMRI could enhance diagnostic accuracy for disorders such as schizophrenia [10]. Although fMRI data are abundant in temporal information, analyzing them can be computationally demanding. FNC serves as a more efficient alternative, summarizing complex temporal patterns into a 2D layout without significant diagnostic loss. Our research aims to comprehend the intricate correlation between sMRI and FNC in diagnosing and understanding schizophrenia. Meanwhile, our cViT-GAN model also showed promising FNC visualization. The model created FNC matrices that showed functional zones identical to the original FNC data using sMRI data. This research illuminates the pathogenesis of schizophrenia by pinpointing these functional areas.

Refer to caption — Fig. 1: The cViT-GAN framework with the embedded class identifier for sMRI to FNC synthesis: The model is regularized by VGG perceptual loss and correlation loss.

2 Related Works

Recent studies have applied GANs and transformer-based architectures to medical imaging tasks, showing superior performance in image synthesis and reconstruction [11, 12, 13, 14]. Specifically, Pan et al. used a transformer-based encoder for translating MRI modalities, demonstrating enhanced performance [14]. Li et al. introduced MedViTGAN for generating synthetic histopathology images, incorporating an auxiliary classifier and adaptive loss mechanism [15]. Additionally, Abrol et al. showed that fusing multi-feature sMRI and fMRI data via deep learning significantly improves the prediction of Alzheimer’s disease progression [16].

3 Methods

In this section, we explain the methods employed in our cViT-GAN model for image synthesis, specifically for generating FNC matrices from sMRI inputs. We detail the generator and discriminator architectures, as well as the composite loss functions designed to achieve highly accurate FNC matrix synthesis. Importantly, our model incorporates multiple loss components to obtain more accurate results.

3.1 Conditional ViTGAN

The cViT-GAN consists of two main components: a generator $G$ and a discriminator $D$ . The architecture leverages the principles of GAN to optimize the following objective function:

	$\displaystyle\min_{G}\max_{D}\mathcal{L}(G,D)$	$\displaystyle=\mathbb{E}_{y\sim p_{\text{data}}(y)}[\log D(y)]$
		$\displaystyle+\mathbb{E}_{x\sim p_{\text{input}}(x)}[\log(1-D(G(x)))]$		(1)

In the objective function $\mathcal{L}(G,D)$ , $\mathbb{E}_{y\sim p_{\text{data}}(y)}[\log D(y)]$ represents the expectation of the log-likelihood that the discriminator $D$ correctly classifies real FNC matrices $y$ . The term $\mathbb{E}_{x\sim p_{\text{input}}(x)}[\log(1-D(G(x)))]$ reflects the expectation that $D$ misclassifies the FNC matrices $\hat{y}=G(x)$ generated from sMRI inputs $x$ .

Generator: The generator’s purpose is to transform 3D sMRI inputs $x$ into a 2D network-by-network Functional Network Connectivity (FNC) matrix $\hat{y}$ . This conversion is accomplished through a series of intricate transformer blocks, making use of the inherent patterns and features of the sMRI data.

Position Embedding with Class Identifier: Before processing through the generator, the 3D sMRI undergoes a preliminary transformation. It is divided into a set of 3D patches, which are subsequently mapped to linear space to derive position embedding vectors. The introduction of a class identifier is crucial as it provides the model context about the nature of the input – whether it’s from a schizophrenia patient or a healthy control.

Transformer Block: The heart of the generator is the transformer block. Here, the structure captures the complexities of the data through a combination of multi-head self-attention and position-wise feed-forward neural networks. The sequential nature of this design allows the generator to capture long-range dependencies in the data. The formulas show the computation:

	$\displaystyle Z$	$\displaystyle=\text{LayerNorm}(\text{SelfAttention}(X)+X)$		(2)
	Output	$\displaystyle=\text{LayerNorm}(Z+\text{FFN}(Z))$		(3)

MLP for FNC Fragments: Post the transformer blocks, each resultant embedding vector undergoes further refinement through a multi-layer perceptron (MLP). This step generates fragments, smaller pieces, of the desired FNC matrix. The final stage involves stitching these fragments together seamlessly, giving us the synthesized FNC matrix $\hat{y}$ .

Discriminator: In any generative setup, the discriminator $D$ plays a pivotal role. In this architecture, $D$ is trained to discern between authentic FNC matrices $y$ and the ones generated by our model $\hat{y}$ . Its design draws inspiration from the ViT model, which has established itself as a potent tool for classification tasks in the realm of images.

3.2 Loss Functions

The standard GAN loss may not be sufficient for capturing the intricate relationships in FNC matrices. Therefore, our model employs a composite loss function, which incorporates several other loss components.

ViT Perceptual Loss: The ViT Perceptual loss aims to capture high-level features and patterns in FNC matrices, moving beyond mere pixel-level differences. Instead of relying on the hierarchical structure of CNN like VGG, we utilize the attention mechanisms of a pre-trained ViT-16 network to calculate this loss between real and generated FNC matrices. The ViT Perceptual loss is defined as:

\mathcal{L}_{\text{ViT\_16}}=\frac{1}{W_{l}H_{l}}\sum_{i,j}\left(F^{l}(y)_{ij}-F^{l}(\hat{y})_{ij}\right)^{2}

(4)

Here, $F^{l}(\cdot)$ is the feature map extracted from the $l$ -th block of the ViT-16 network. $W_{l}$ and $H_{l}$ are the width and height, respectively, of the feature map at block $l$ .

Correlation Loss: The correlation loss captures relationships between corresponding regions in the FNC matrix. This is particularly important as FNC matrices often show similarity not just in absolute values but also in the relative arrangement of values. The correlation loss is defined as:

\mathcal{L}_{\text{corr}}=1-\frac{\text{cov}(y,\hat{y})}{\sigma_{y}\sigma_{\hat{y}}}

(5)

In this equation, $\text{cov}(y,\hat{y})$ is the covariance between $y$ and $\hat{y}$ , and $\sigma_{y}$ and $\sigma_{\hat{y}}$ are their respective standard deviations.

Total Losses: In total, the loss function of our cViT-GAN is:

\mathcal{L}_{G}=\mathcal{L}_{\text{GAN}}+\lambda_{1}\mathcal{L}_{\text{MSE}}+\lambda_{2}\mathcal{L}_{\text{ViT\_16}}+\lambda_{3}\mathcal{L}_{\text{corr}}

(6)

Here, $\mathcal{L}_{\text{GAN}}$ is the GAN loss for the generator. The hyperparameters $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ control the balance of MSE loss, VIT perceptual loss, and correlation loss, respectively.

4 Experiments

Table 1: Comparison of Pearson correlation and cosine similarity of cViT-GAN and baselines.

Model	Generator	Discriminator	Class Identifier	Correlation Loss	Pearson	Cosine
cViT-GAN	ViT	ViT	Yes	Yes	0.731	0.732
DCGAN	U-Net	MLP	Yes	Yes	0.695	0.693
Pix2Pix [3]	U-Net	PatchGAN	Yes	Yes	0.719	0.714
cViT-GAN-0	ViT	ViT	No	Yes	0.543	0.544
cViT-GAN-1	ViT	ViT	Yes	No	0.724	0.722

Datasets: We employed two datasets related to clinical research on schizophrenia, sourced from multiple sites in the United States and China, featuring a total of 827 and 815 participants, respectively. The first dataset amalgamated data from three studies—fBIRN, MPRC, and COBRE—using nine different 3-Tesla scanners with standard echo planar imaging (EPI) sequences. The second dataset originated in China, obtained using three different 3-Tesla scanners. Both datasets comprised a mix of control individuals and schizophrenia patients and included detailed demographic characteristics such as mean age and gender distribution.

Preprocessing: The sMRI data were preprocessed with a pipeline consisting of gray-scale segmentation and normalization. We enhanced the robustness of our model by introducing data augmentation techniques, including random rotation and Gaussian noise addition. For the fMRI data, preprocessing involved slice timing correction, realignment, and normalization. The FNC and sMRI feature vectors were derived following methods from our previous research [7]. Specifically, we employed independent component analysis (ICA) to extract time courses, which were subsequently used to compute the FNC matrices via cross-correlation.

Models: In our experiment, we deployed the cViT-GAN model, consisting of both a generator and a discriminator shown in Figure 1. For comparative analysis, we included the following baseline models: 1)DCGAN: A deep convolutional generative adversarial network (DCGAN) was employed, where both the generator and the discriminator are built using CNNs. 2)Pix2Pix [3]: This model utilizes a U-Net architecture for the generator and a PatchGAN architecture for the discriminator, focusing primarily on image-to-image translation tasks. 3)cViT-GAN-0: In this variation of the cViT-GAN model, we only used sMRI as the condition for generating outputs, without including any class identifiers during concatenation. 4)cViT-GAN-1: This model is another variant of cViT-GAN. It uses both sMRI and class identifiers for conditionality. However, this version of the model excludes correlation loss and VGG perceptual loss during the training phase.

Training and Evaluation: All models were trained using the AdamW optimizer. The initial learning rate was set uniformly to $1\times 10^{-4}$ across all architectures, including cViT-GAN, its variations, DCGAN, and Pix2Pix. The training lasted for 300 epochs and was executed on 8 NVIDIA V100 GPUs in a distributed environment. Both the generator and discriminator used a MultiStepLR scheduler, with the learning rate being decreased by a factor of 0.1 at epoch 50 and again at epoch 150. To assess the models’ performance comprehensively, we employed a 10-fold cross-validation approach. For evaluation purposes, we compared the generated FNC matrices against actual FNC matrices produced by each model. Our evaluation metrics comprised Pearson correlation and cosine similarity in order to capture both linear and angular relationships between the real and generated matrices. These metrics offer a thorough comprehension of the extent to which the produced FNC matrices emulate the structural and functional traits existing in the real data.

5 Results

In this section, we compare the experimental results of our conditional cViT-GAN model with various baseline models, specifically focusing on generating “group difference FNC” between the healthy control group (HC) and schizophrenia patients (SZ). We employ Pearson correlation and cosine similarity as our evaluation metrics for these generative methods. Beyond quantitative measures, we also spotlight functional domains where our generated group difference FNC closely matches the actual FNC. This high degree of comparability may underscore significant structural-functional relationships between these groups.

Basic results: Table 1 provides a comprehensive comparison of the Pearson correlation and cosine similarity scores among various generative models, including our proposed cViT-GAN. In terms of Pearson correlation and cosine similarity, the cViT-GAN outperforms all other baseline models, achieving a score of $0.731$ and $0.732$ respectively. This highlights the advantage of using a ViT architecture for both the generator and discriminator along with class identifiers and the newly designed correlation loss.

FNC domain visualizations: This analysis provides visual representations of group difference functional network connectivity matrices (HC-SZ). Figure 2 shows the comparison between the generated group difference FNC with the real one. Remarkably, our cViT-GAN model is proficient at generating FNC matrices that closely mimic actual group difference FNC data, specifically in subcortical regions such as CB-SC, CB-AUD, CB-SM, CB-VS, CB-CC, CB-DM, and CB-CB. In these critical subcortical areas, the correlation between synthetic FNC derived from sMRI and actual group difference FNC data reaches an impressive similarity of up to 0.85. This pivotal finding not only underscores the importance of subcortical structures in distinguishing between HC and SZ groups but also validates the accuracy and utility of our cViT-GAN model in capturing these essential structural-functional connections. The high degree of similarity in subcortical regions suggests that our cViT-GAN model is exceptionally adept at reproducing complex, real-world neurological patterns, especially those that mark differences between healthy and pathological states. This offers promising avenues for advancing our understanding of disorders like schizophrenia, equipping us with more accurate diagnostic tools and targeted treatment strategies. Moreover, our model also revealed significant similarities in differential values for other region pairs, including SC-SM, CC-CC, SM-DM, and VS-DM, with moderate similarities noted in pairs like VS-AUD and CC-SM. These observations contribute additional layers of understanding to how differences in functional networks between the HC and SZ groups might be influenced by foundational structural abnormalities. Collectively, this knowledge is pivotal for the development of improved diagnostic methods and treatment plans that can effectively tackle the complexities of conditions like schizophrenia.

6 Conclusion

In conclusion, our study reveals that sMRI data can closely reflect FNC, offering the potential for one imaging modality to inform the other in diagnostics. This deepens our understanding of the brain’s structure-function relationship, paving the way for personalized medicine and disease prediction. Our work sets the stage for holistic biomarkers through integrated neuroimaging, and the generative model enables FNC simulation under specific pathological states, yielding new insights into brain functionality and behavior.

References

[1] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
[2] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
[3] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
[4] Bo Zhan, Di Li, Xi Wu, Jiliu Zhou, and Yan Wang, “Multi-modal mri image synthesis via gan with multi-scale gate mergence,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 1, pp. 17–26, 2021.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[6] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu, “Vitgan: Training gans with vision transformers,” arXiv preprint arXiv:2107.04589, 2021.
[7] Yuhui Du, Zening Fu, Jing Sui, Shuang Gao, Ying Xing, Dongdong Lin, Mustafa Salman, Anees Abrol, Md Abdur Rahaman, Jiayu Chen, et al., “Neuromark: An automated and adaptive ica based pipeline to identify reproducible fmri markers of brain disorders,” NeuroImage: Clinical, vol. 28, pp. 102375, 2020.
[8] Lucina Q Uddin, “Complex relationships between structural and functional brain connectivity,” Trends in cognitive sciences, vol. 17, no. 12, pp. 600–602, 2013.
[9] Md Abdur Rahaman, Jiayu Chen, Zening Fu, Noah Lewis, Armin Iraji, and Vince D Calhoun, “Multi-modal deep learning of functional and structural neuroimaging and genomic data to predict mental illness,” in 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2021, pp. 3267–3272.
[10] Yuda Bi, Anees Abrol, Zening Fu, and Vince Calhoun, “Multicrossvit: Multimodal vision transformer for schizophrenia prediction using structural mri and functional network connectivity data,” arXiv preprint arXiv:2211.06726, 2022.
[11] Dong Nie, Roger Trullo, Jun Lian, Li Wang, Caroline Petitjean, Su Ruan, Qian Wang, and Dinggang Shen, “Medical image synthesis with deep convolutional adversarial networks,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 12, pp. 2720–2730, 2018.
[12] Elizabeth K Cole, John M Pauly, Shreyas S Vasanawala, and Frank Ong, “Unsupervised mri reconstruction with generative adversarial networks,” arXiv preprint arXiv:2008.13065, 2020.
[13] Chengjia Wang, Guang Yang, Giorgos Papanastasiou, Sotirios A Tsaftaris, David E Newby, Calum Gray, Gillian Macnaught, and Tom J MacGillivray, “Dicyc: Gan-based deformation invariant cross-domain information fusion for medical image synthesis,” Information Fusion, vol. 67, pp. 147–160, 2021.
[14] Kai Pan, Pujin Cheng, Ziqi Huang, Li Lin, and Xiaoying Tang, “Transformer-based t2-weighted mri synthesis from t1-weighted images,” in 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC). IEEE, 2022, pp. 5062–5065.
[15] Meng Li, Chaoyi Li, Peter Hobson, Tony Jennings, and Brian C Lovell, “Medvitgan: End-to-end conditional gan for histopathology image augmentation with vision transformers,” in 2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022, pp. 4406–4413.
[16] Anees Abrol, Zening Fu, Yuhui Du, and Vince D Calhoun, “Multimodal data fusion of deep learning and dynamic functional connectivity features to predict alzheimer’s disease progression,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2019, pp. 4409–4413.