Fully Guided Neural Schrödinger bridge for Brain MR image synthesis

Hanyeol Yang Sunggyu Kim Yongseon Yoo and Jong-min Lee Hanyeol Yang is with the Hanyang university, South korea, Seoul, (email : [email protected])Yongseon Yoo is with the Hanyang university, South korea, Seoul, (email : [email protected])Sunggyu Kim is with the Hanyang university, South korea, Seoul, (email : [email protected])Jong-min Lee is with the Hanyang university, South korea, Seoul, (email : [email protected])

Abstract

Multi-modal brain MRI provides essential complementary information for clinical diagnosis. However, acquiring all modalities is often challenging due to time and cost constraints. To address this, various methods have been proposed to generate missing modalities from available ones. Traditional approaches can be broadly categorized into two main types: paired and unpaired methods. While paired methods offer superior performance, obtaining large-scale paired datasets is challenging in real-world scenarios. Conversely, unpaired methods facilitate large-scale data collection but struggle to preserve critical image features, such as tumors. In this paper, we propose Fully Guided Schrödinger Bridges (FGSB), a novel framework based on Neural Schrödinger Bridges, to overcome these limitations. FGSB achieves stable, high-quality generation of missing modalities using minimal paired data. Furthermore, when provided with ground truth or a segmentation network for specific regions, FGSB can generate missing modalities while preserving these critical areas with reduced data requirements. Our proposed model consists of two consecutive phases. 1) Generation Phase: Fuses a generated image, a paired reference image, and Gaussian noise, employing iterative refinement to mitigate issues such as mode collapse and improve generation quality 2) Training Phase: Learns the mapping from the generated image to the target modality. Experiments demonstrate that FGSB achieves comparable generation performance to methods trained on large datasets, while using data from only two subjects. Moreover, the utilization of lesion information with FGSB significantly enhances its ability to preserve crucial lesion features.

{IEEEkeywords}

Magnetic resource imaging(MRI), medical image synthesis, schrödinger bridges

1 Introduction

\IEEEPARstart

Multi-modal MRI of the brain provides information about anatomy or lesions, with each modality providing complementary information. This provides significant advantages in diagnosis or accurate segmentation of regions of interest [1, 2]. However, Successfully acquiring all multi-modal images is difficult in the real world due to time and economic issues. To address these issues, methods have been proposed to synthesize images from other modalities that are not acquired via the acquired image. Recently, medical image synthesis has been dominated by deep learning-based methodologies using GANs, diffusion models, vision transformers, etc [3, 4, 5]. Image-to-Image translation task in the field of computer vision is broadly categorized into two methods [6] : paired learning method and unpaired learning method. In the case of paired learning, it is suitable under the assumption that we have enough well paired both source domain images and target domain images. Unpaired learning is suitable when we do not have a target image corresponding to the source image, but we have a sufficient amount of source and target images.

In the case of MT-Net [7], which is a paired learning method, edge-aware pre-training is performed using a MAE method [8] to overcome the data scarcity problem inherent in paired learning approaches, where obtaining large amounts of corresponding source and target images is challenging. Fine-tuning is then performed using the pre-trained ViT [9] encoder for the downsteam task, medical image synthesis. During fine-tuning, the pre-trained ViT encoder is partially fixed. Due to this architecture, when there are differences in images between pre-training and fine-tuning phases caused by variations in manufacturers, resolution, protocols, or other factors, the benefits of pre-training may not be fully utilized. In addition, effective MAE-based pre-training requires a large amount of data [10], while the number of subjects available for both pre-training and fine-tuning phases in MT-Net can be limited in real-world scenarios.

Acquiring sufficient paired data remains challenging in real-world settings. Therefore, unpaired learning approaches have been used in many medical image synthesis studies. Most unpaired learning tends to rely on cycle-consistency [11] to preserve important elements such as biological structures or lesions in the source image [12, 13]. which introduces several limitations: additional computational overhead, increased training time, and high sensitivity to hyperparameter selection [14]. Furthermore, GAN-based models characterize the target modality distribution through implicit generator-discriminator adversarial learning rather than explicit likelihood estimation. This indirect modeling approach can introduce training instabilities that manifest as premature convergence and mode collapse. In addition, these methods typically employ single-step generation without iterative refinement, potentially limiting the expressiveness of the learned distribution mappings. These inherent limitations can compromise both the fidelity and diversity of the synthesized images.

To overcome above limitations, Syndiff [15] demonstrates the promising application of diffusion models [16, 17] to medical image synthesis tasks. but its iterative refinement process presents limitations. Specifically, the stochastic nature of time step sampling and the absence of an explicit mechanism to ensure consistency across intermediate states may potentially impact the stability of the generation process. Additionally, The dependency on cycle-consistency mechanisms in unpaired learning approaches remains a fundamental limitation.

Here, we propose FGSB, a neural Schrödinger bridge based architecture for medical image synthesis. FGSB generates a corresponding target image given a source image and gradually improves the target image using Gaussian noise. Unlike the general diffusion model, it uses a small number of time steps and employs mutual information loss to maintain consistency across each intermediate image generated during the process. It uses a self-supervised discriminator [16, 17] to enable learning with a small number of subjects, and shows competitive performance with other models with only 2 subjects without any pre-training stage.

Our main contributions are described as follows:

•

We propose a novel brain MR image synthesis framework with self-supervised discriminator. Our framework can be trained on extremely limited of paired data, without requiring pre-training procedure or specialized data augmentation methods.
•

We extend the neural Schrödinger bridge framework to medical image synthesis while addressing the inherent adversarial learning challenges of the model.
•

Our framework demonstrates superior performance in preserving both image quality and clinically significant features such as lesions, WMHs, etc. compared to existing methods, even when trained on limited data.

2 Related Work

2.1 Diffusion model

Recently, diffusion models have gained prominence in computer vision as an alternative to these GAN-based models [18, 19, 20]. There are two main operations of diffusion models: a forward process in which Gaussian noise of scheduled variance is applied to the image, and a reverse process in which the noise applied in the forward process is removed using a deep neural network. Diffusion models explicitly learn the likelihood due to its mathematical foundations, relatively straightforward training procedures, while effectively addressing the mode collapse issues commonly observed in GANs [17]. However, it still exists as a disadvantage that it needs to go through many sampling processes to generate a image [17].

DDGAN [17] addresses this limitation through an adversarial learning approach. In conventional diffusion models, the distribution of the reverse process is fundamentally assumed to be Gaussian. However, this theoretical assumption necessitates a large number of time steps in the reverse process to be mathematically precise [21, 22]. DDGAN implements significantly larger step sizes within each time step, thereby reducing the number of time steps in the reverse process. As a result, the reverse process distribution becomes non-Gaussian multimodal distribution, which is effectively approximated through an adversarial learning framework.

2.2 Few-shot image synthesis

GAN-based models have great potential for many exciting real-world applications such as image generation, image processing, and super-resolution. However, The vast amount of required training data limit GAN-based models in real applications with only small image sets. In real-life scenarios, the data available to train a GAN can be minimal, such as medical data sets, a particular celebrity’s portrait set and a specific artist’s artworks.

The scarcity of training samples exacerbates the mode collapse of generator, presenting a significant challenge. Furthermore, discriminator overfitting to the limited data impedes the provision of meaningful gradients to the generator, thereby amplifying the mode collapse. [23]

Therefore, previous works mitigate discriminator overfitting with various objectives [24, 25], gradient regularizer [26], normalizing model weights [27], transfer learning [28] and data augmentation methods [24, 25] for regularization of discriminator.

Prior approaches necessitate extensive hyper-parameter tuning and have been predominantly validated on low-resolution images. Additionally, transfer learning approaches rely heavily on large-scale medical datasets and the pre-trained weights. data augmentation methods demand domain-specific customization based on image characteristics and anatomical regions.

While prior studies have demonstrated the efficacy of self-supervised learning-based discriminators in low-data regimes, their application to image-based tasks has been limited. These limitations manifest in two primary ways: the inadequacy of existing self-supervised tasks for image processing, and their restricted application to low-resolution image datasets. This creates a significant gap in the literature, particularly for high-resolution image processing applications [29, 30].

The self-supervised discriminator [31, 32] enhances GAN training stability and image quality in few-shot scenarios by incorporating an auto-encoding task, which enables comprehensive feature extraction that captures both global composition and local textures. This approach prevents discriminator overfitting by ensuring the extraction of detailed feature maps capable of input image reconstruction, thereby providing meaningful gradients to the generator.

2.3 Neural Schrödinger Bridges

Schrödinger bridge is the process of finding the optimal transport trajectory from an arbitrary source distribution to a target distribution, and is one of the recently studied methodologies in diffusion model based image-to-image translation [33, 34, 35], which is limited by the Gaussian assumption [14]. UNSB [14] solves the SB problem based on entropy regularization and achieves high performance without using additional computation unlike existing SB papers.

The static formulation of intermediate sample, $\{x_{t}\}$ can be described by a Gaussian distribution [14, 23].

p(x_{t}|x_{A},x_{B})=\mathcal{N}(x_{t}|tx_{B}+(1-t)x_{A},t(1-t)\tau\mathbf{I})

(1)

The stochastic control formulation [36] found two important properties of $\{x_{t}\}$ . First, each element in $\{x_{t}\}$ is a Markov chain. second, $\tau$ controls the randomness for $\{x_{t}\}$ [36] .

Therefore, the distribution of intermediate $\{x_{t}\}$ given $x_{A}$ is as follows :

p(x_{t}|x_{A})=\int p(x_{t}|x_{A},x_{B})\,dQ^{SB}_{B|A}(x_{B}|x_{A})

(2)

$Q^{SB}_{B|A}$ denotes the conditional distribution of $x_{B}$ given $x_{A}$ which means optimal trajectory of $x_{A}$ to $x_{B}$ . The conditional probability measure $Q^{SB}_{B|A}$ is obtained from $Q^{SB}_{BA}$ by applying Bayes’ rule, where $Q^{SB}_{BA}$ represents an entropy-regularized optimal transport problem.

UNSB shows that SB can be represented composition of adversarial learning and Markov chain.

	$\displaystyle p(\{x_{t_{n}}\})=p(x_{t_{N}}\|x_{t_{N-1}})p(x_{t_{N-1}}\|x_{t_{N-2}})$		(3)
	$\displaystyle\cdots p(x_{t_{1}}\|x_{A})p(x_{A})$		(3)

According to Eq. 7, we can learn the intermediate sampling outputs when we know $x_{A}$ . Since we can sample all the samples of each intermediate step, we can finally sample the output predicted $x_{B}$ , which we know.

q_{\phi_{i}}(x_{t_{i}},x_{B}):=q_{\phi_{i}}(x_{B}|x_{t_{i}})p(x_{t_{i}})

(4)

q_{\phi_{i}}(x_{B}):=\mathbb{E}_{p(x_{t_{i}})}[q_{\phi_{i}}(x_{B}|x_{t_{i}})]

(5)

	$\displaystyle\min_{\phi_{i}}\mathcal{L}_{SB}(\phi_{i},t_{i}):=$	$\displaystyle\mathbb{E}_{q_{\phi_{i}}(x_{t_{i}},x_{B})}[\\|x_{t_{i}}-x_{B}\\|^{2}]$		(6)
		$\displaystyle-2\tau(1-t_{i})H(q_{\phi_{i}}(x_{t_{i}},x_{B}))$		(6)

s.t.\mathcal{L}_{Adv}(\phi_{i},t_{i}):=D_{KL}(q_{\phi_{i}}(x_{B})\|p(x_{B}))=0

(7)

$q_{\phi_{i}}$ is a generator which sample intermediate results to predicted $x_{B}$ . $q_{\phi_{i}}$ is parameterized by a neural network. Eq. 10 demonstrates that $L_{adv}$ serves as a crucial learning condition for SB. UNSB proposes the implementation of an enhanced discriminator architecture, which is particularly justified given the constraints of finite sampling and the curse of dimensionality encountered in mid-stage sampling processes. This theoretical framework provides mathematical validation for our discriminator-centric approach in extremely sparse data conditions. The alignment between UNSB’s theoretical foundations and our proposed solution offers compelling evidence for the effectiveness of focusing computational resources on discriminator optimization in low-data scenarios.

UNSB consists of two stages: Generation and Training. In the generation stage for intermediate sample, generation $x_{t}$ , the input image, the output of the network, and a Gaussian noise with a predefined variance are used to generate the image of the next stage. The training stage uses adversarial loss and patchNCE loss [37] to guide the output of the network to translate to the target modality while preserving contents details. Unlike traditional diffusion models, The input image is required to perform all time steps.

Despite its innovative approach, direct application of the UNSB framework presents several limitations. While UNSB employs unpaired learning to generate subsequent time step images solely through network input and output without any target modality information, this approach potentially compromises critical lesion information present in the source image. To address this limitation, we incorporate paired target modality information in both generation and training processes. This paired learning paradigm enables the utilization of reconstruction loss [38] and facilitates the integration of additional prior information, such as segmentation masks and intensity-related constraints [39].

Refer to caption — Figure 1: An overview of the proposed framework for medical image transalation, which consists of two consecutive stage: generation (sampling)stage and training stage. training stage conduct adversarial learning, Schrödinger Bridges loss, etc. generation stage sample the intermediate results, $x_{t}$ except $t=0$

3 Proposed Method

3.1 Training stage

For the training of our framework, we sample the time step, $T$ . In our experiments, $T$ is [0-4]. To calculate our loss function $L_{FGSB}$ , we sample the source image from $dataset_{A}$ and paired target image, $x_{B}$ .

We use a Markovian discriminator to determine the domain change from the source image to the target image on the patch-level. The adversarial loss is adopted for generator $G$ and discriminator $D$ . $x_{t}$ is calculated using $x_{B}$ , $\hat{x}_{B_{t}}$ , Gaussian noise with predefined variance.

\min_{G}\mathcal{L}_{\text{adv}}=\mathbb{E}_{x_{A}\sim dataset_{A}}[(D(G(x_{t}))-1)^{2}].

(8)

	$\displaystyle\min_{D}\mathcal{L}_{\text{Adv}}=\mathbb{E}_{G(x_{t})}[(D(G(x_{t}))-0)^{2}]+$
	$\displaystyle\mathbb{E}_{x_{B}\sim dataset_{B}}[(D(x_{B})-1)^{2}].$		(9)

To preserve anatomic structure information, we use L1 loss and patchNCE loss.

\min_{G}\mathcal{L}_{\text{L1}}=\mathbb{E}_{x_{t}}[(\hat{x}_{B_{t}}-x_{B})^{2}].

(10)

\min_{G}\mathcal{L}_{\text{patchNCE}}=\mathbb{E}_{x_{t}}[F(\hat{x}_{B_{t}},x_{A})].

(11)

We estimate mutual information loss, $L_{SB}$ using patch-wise mutual information estimator [40]. The mutual information estimator and the generator are jointly optimized through an adversarial training scheme, following a min-max optimization paradigm analogous to the adversarial loss formulation. The mutual information loss functions as a regularization that constrains the intermediate states to adhere to the optimal transport trajectory towards the target domain, ensuring semantic consistency throughout the whole generation stage. mutual information estimator loss is defined as :

\min_{E}\mathcal{L}_{\text{SB}}=\mathbb{E}_{x_{B}\sim dataset_{B}}[-E(x_{t},x_{B})].

(12)

Aggregate with $x_{prior}$ as follows:

\min_{E}\mathcal{L}_{\text{SB}}=\mathbb{E}_{x_{B}\sim dataset_{B}}[-E(x_{t},x_{B},x_{B}\odot x_{prior})].

(13)

Mutual information loss is defined as :

\min_{G}\mathcal{L}_{\text{SB}}=\mathbb{E}_{G(x_{t})}[-E(x_{t},G(x_{t})].

(14)

\min_{G}\mathcal{L}_{\text{SB}}=\mathbb{E}_{G(x_{t})}[-E(x_{t},G(x_{t}),G(x_{t})\odot x_{prior})].

(15)

Then, the final loss, $L_{FGSB}$ consists of $L_{adv}$ , $L_{SB}$ , $L_{L1}$ and $L_{patchNCE}$ with weight parameter.

	$\displaystyle\min_{G}\mathcal{L}_{\text{FGSB}}:=$	$\displaystyle\mathcal{L}_{\text{adv}}+\lambda_{\text{SB}}\mathcal{L}_{\text{SB}}+$
		$\displaystyle+\lambda_{\text{L1}}\mathcal{L}_{\text{L1}}+\lambda_{\text{patchNCE}}\mathcal{L}_{\text{patchNCE}}.$		(16)

To aggregate additional information such as intensity prior, segmentation map, etc., we use context-preserving loss and weighted patchNCE loss. In our experiment, the binary prior map $x_{prior}$ is defined as 1 if the intensity meets a predefined threshold condition, and 0 otherwise.

\min_{G}\mathcal{L}_{\text{cpl}}=\mathbb{E}_{x_{t}\odot{x_{prior}}}[x_{prior}\odot((\hat{x}_{B_{t}}-x_{B})^{2})].

(17)

The patchNCE loss computation requires a sampling of both positive and negative patches for contrastive learning. Therefore, we first utilize $x_{prior}$ to identify and select critical patches for the sampling procedure.

\min_{G}\mathcal{L}_{\text{wpN}}=\mathbb{E}_{x_{t}}[F(\hat{x}_{B_{t}},x_{A},x_{prior})].

(18)

Except for the IXI experiment, we use combined loss with context-preserving loss and weighted patchNCE loss.

	$\displaystyle\min_{G}\mathcal{L}_{\text{FGSB}}:=$	$\displaystyle\mathcal{L}_{\text{Adv}}+\lambda_{\text{SB}}\mathcal{L}_{\text{SB}}+\lambda_{\text{L1}}\mathcal{L}_{\text{L1}}+\lambda_{\text{NCE}}\mathcal{L}_{\text{NCE}}$
		$\displaystyle+\lambda_{\text{cpl}}\mathcal{L}_{\text{cpl}}+\lambda_{\text{wpN}}\mathcal{L}_{\text{wpN}}.$		(19)

While single-step Generative Adversarial Networks (GANs) have demonstrated promising performance, their performance can be considered suboptimal.

Identity loss functions to translate the suboptimal intermediate results of the refinement process into the target domain as much as possible. Identity loss comprises L1 loss and patchNCE loss computed between the generator output and $x_{B}$

Final loss, $\mathcal{L}_{FGSB}$ as follows :

	$\displaystyle\min_{G}\mathcal{L}_{\text{FGSB}}:=$	$\displaystyle\mathcal{L}_{\text{Adv}}+\lambda_{\text{SB}}\mathcal{L}_{\text{SB}}+\lambda_{\text{L1}}\mathcal{L}_{\text{L1}}+\lambda_{\text{NCE}}\mathcal{L}_{\text{NCE}}$
		$\displaystyle+\lambda_{\text{cpl}}\mathcal{L}_{\text{cpl}}+\lambda_{\text{wpN}}\mathcal{L}_{\text{wpN}}+\lambda_{\text{idt}}\mathcal{L}_{\text{idt}}.$		(20)

3.2 Generation stage

We describe the sampling procedure, the generation stage for the intermediate generation and final generation. When $t=0$ , the source image $x_{A}$ is directly utilized as the initial network input $x_{0}$ , as it originates from the source domain.

The predicted target image generated from $x_{0}$ serves as the input for the subsequent time step. During this process, $x_{B}$ is utilized to incorporate information from the original target domain.

During inference, the sampling procedure employs only Gaussian noise, as $x_{B}$ is unavailable and only the refinement process for the predicted target image is required.

3.3 Self-Supervised Discriminator

Our approach used discriminator to mapping generated image to target image domain. In scenarios where data scarcity is a critical constraint, numerous previous studies have focused on implementing appropriate regularization techniques for the discriminator. It offers a distinct advantage in that it eliminates the need for separate pre-training procedures or elaborate data augmentation strategies, thereby providing a more streamlined approach to model optimization.

The self-supervised discriminator is a novel approach that incorporates auxiliary decoders in conjunction with the discriminator network. The discriminator extracts feature maps at two distinct scales, which are then utilized by two dedicated decoders for image reconstruction tasks. The crop decoder $Dec_{1}$ , which processes the first feature map, performs partial image reconstruction utilizing cropped feature representations. Simultaneously, the resize decoder $Dec_{2}$ , processing the subsequent feature map, undertakes complete image reconstruction. This self-supervised learning paradigm enables the discriminator to develop robust feature representations that capture both global compositional structures and fine-grained textural details of the input images.

The decoders only consist of four time-conditional conv layers. The training process involves random cropping of the target image at arbitrary positions. To maintain spatial correspondence, the feature map is cropped at positions that align with the target image crops, considering the receptive field. These cropped feature representations are then fed into the decoder network, which is trained to reconstruct the corresponding cropped image regions, establishing a direct mapping between local features and their spatial reconstructions. The secondary decoder receives the extracted feature map as input and is trained to reconstruct the complete target image in its entirety. During this process, the target image is down-sampled to half their original spatial dimensions to optimize computational efficiency while maintaining essential visual information.

The self-supervised discriminator loss is defined as :

$\displaystyle\min_{D}\mathcal{L}_{\text{Adv}}$	$\displaystyle=\mathbb{E}_{G(x_{t})}[(D(G(x_{t}))-0)^{2}]$
	$\displaystyle+\mathbb{E}_{x_{B}\sim dataset_{target}}[(D(x_{B})-1)^{2}]$
	$\displaystyle+\mathbb{E}_{x_{B}\sim dataset_{target}}[(Dec_{1}(D(x_{B}))-I^{\prime})^{2}]$
	$\displaystyle+\mathbb{E}_{x_{B}\sim dataset_{target}}[(Dec_{2}(D(x_{B}))-I^{port})^{2}].$	(21)

4 Experiments

4.1 Datasets

We use four datasets (IXI¹¹1https://brain-development.org/ixi-dataset/, BraTS2020 [21], MICCAI2017 WMH [41], private) to evaluate our frameworks. All datasets were randomly split into non-overlapping training and test sets. The source image in all experiments is T1w as the most basic imaging technique. T1-weighted imaging serves as an optimal source sequence for synthesis due to its rapid acquisition time and superior anatomical delineation. The clinical efficacy of our approach is further supported by T1w’s established role as a fundamental sequence in preliminary studies and its widespread use as a reference standard for multimodal imaging protocols. [32, 42, 43] For the BraTS2020 and MICCAI2017 WMH datasets, we performed only skull-stripping [44] since they were already bias-corrected and co-registered. The IXI dataset and the private cohort were implemented additional spatial co-registration using FSL [45]. All images were intensity-normalized to the range [-1, 1] and padded to uniform dimensions (224×224 or 256×256) using minimum intensity values.

4.1.1 IXI dataset

We use 1.5T T1w and T2w for our experiment. Detailed scan parameters were TE=4.6ms, TR=9.81ms for T1w; TE=100ms, TR=8178.34ms for T2w. 100 axial slices were extracted and the slices that were mostly background were excluded from the learning process.

4.1.2 MICCAI2017 WMH challenge dataset

We use 3T T1w and FLAIR for our experiment. Detailed scan parameters were TE=4.5ms, TR=7.9ms for T1w; TE=125ms, TR=11000ms, TI=2800ms for FLAIR. About 35 axial slices were extracted and the slices that were mostly background were excluded from the learning process. For real-world applications, T1w is downsampled to 1.0x1.0x3.0 $mm^{3}$ and FLAIR is spatially co-registered to downsampled T1w.

4.1.3 BraTS2020

We use all modalities in our experiment. 70 axial slices were extracted and the slices that were mostly background were excluded from the learning process. We select only low grade glioma data. Since the dataset is collected from different cohorts, to minimize bias due to differences in equipment or protocols, we sampled the data through direct visualization as much as possible.

4.1.4 Private dataset

Private cohort dataset is collected from South Korea, Namwon hospital. We use only 3T T1w and FLAIR. Detailed scan parameters follow ADNI3 protocol. 50 axial slices were extracted and the slices that were mostly background were excluded from the learning process.

4.2 Evaluation Metric

We employed multiple evaluation metrics to assess the quality of generated images. The fidelity of image reconstruction was measured using Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Normalized Root Mean Square Error (NRMSE).

Furthermore, to evaluate the preservation of clinically relevant features, we computed the Dice similarity coefficient and object-wise recall, specifically focusing on the accurate reconstruction of lesion regions. To evaluate dice and recall, We use U-Net based model [41, 46].

4.2.1 MICCAI2017 WMH challenge dataset

To evaluate reconstruction ability of WMH lesions, We use PGS segmentation method. WMH lesions are segmented using pairs of T1w and FLAIR sequences, or alternatively, using T1w and synthetically generated FLAIR as input modalities. Then, we compared each segmentation mask with ground truth.

4.2.2 BraTS2020

To evaluate tumor reconstruction performance, we first train a U-Net using only target images from the training set. Although BraTS dataset provides 4 classes including the background, we merge all tumor subclasses into a single class during U-Net training.

The segmentation performance is evaluated by comparing: (1) the segmentation results of real target images in the test set against ground truth, and (2) the segmentation results of generated target images synthesized from source images in the test set against ground truth.

4.2.3 Private dataset

The private dataset lacks WMH lesion annotations. Therefore, we utilize PGS segmentation method [47] to generate pseudo annotations. The U-Net model is then trained only on FLAIR sequences using these pseudo annotations.

The Segmentation performance is evaluated by comparing the predicted segmentation masks from generated target images with the pseudo annotations generated by the PGS network.

4.3 Comparison Methods

1) Pix2Pix is a paired image-to-image translation method that combines a conditional GAN architecture with pixel-wise reconstruction loss. This approach allows the model to learn direct mappings between source and target domains using paired training data.

2) Syndiff advances this concept by incorporating a DDGAN framework alongside pair-wise reconstruction loss. While the original Syndiff implementation includes CycleGAN components for unpaired learning, we focus on the paired training paradigm for a fair comparison with other methods.

3) MT-Net takes a paired learning approach by using a transformer-based architecture. It first uses masked auto-encoder (MAE) pre-training to build robust feature representations, followed by task-specific fine tuning. This strategy helps to overcome the limitations of traditional paired learning approaches.

4.4 Implementation Details

The adversarial loss, $L_{\text{adv}}$ , and patch-wise mutual information loss, $L_{\text{SB}}$ , are adopted with the Markovian-based architecture, where neural mutual information estimator [40] is used for estimating $L_{\text{SB}}$ . The detailed network architecture and hyperparameters follow the UNSB frameworks.

Weights for reconstruction loss, $\lambda_{L1}=100.0$ and weighted reconstruction loss, $\lambda_{wL1}=10.0$ . All other weight parameters were $1.0$ . Weights for identify loss terms were same with above weight parameters. Details of training hyperparameters were : 50 epochs for 25 subjects, 200 epochs for other the number of subjects and $10^{-4}$ learning rate. Starting from the 100th epoch, the learning rate decreases linearly. To minimize the stochastic fluctuations in the output, we constrain the variance of the random noise variable. We applied horizontal flip as a data augmentation.

Both our model and comparison models were implemented in Python and PyTorch framework. All experiments was conducted on one NVIDIA RTX 3090 or A6000.

5 Results

5.1 Private cohort

We conduct FLAIR synthesis task on the private cohort dataset. Our model, FGSB achieves the higher performance than other models. In particular, FGSB shows the best reconstruction performance of whitematter hyperintensities.

Table 1: Quantitative result of the private cohort dataset.

	PSNR	SSIM	NRMSE	Dice	Recall
Pix2Pix (10)	25.53	0.891	0.166	0.369	0.455
	$\pm$ 1.42	$\pm$ 0.019	$\pm$ 0.031	$\pm$ 0.238	$\pm$ 0.3
Pix2Pix (172)	26.01	0.899	0.157	0.415	0.356
	$\pm$ 1.33	$\pm$ 0.021	$\pm$ 0.026	$\pm$ 0.261	$\pm$ 0.277
Syndiff (10)	24.3	0.834	0.208	0.051	0.036
	$\pm$ 1.47	$\pm$ 0.018	$\pm$ 0.032	$\pm$ 0.193	$\pm$ 0.09
Syndiff (172)	26.98	0.922	0.155	0.499	0.402
	$\pm$ 1.68	$\pm$ 0.019	$\pm$ 0.035	$\pm$ 0.298	$\pm$ 0.269
MT-Net (10)	25.82	0.892	0.183	0.118	0.147
	$\pm$ 1.18	$\pm$ 0.016	$\pm$ 0.026	$\pm$ 0.13	$\pm$ 0.203
MT-Net (172)	27.86	0.934	0.146	0.435	0.385
	$\pm$ 1.62	$\pm$ 0.015	$\pm$ 0.029	$\pm$ 0.264	$\pm$ 0.281
Ours (2)	27.58	0.929	0.136	0.543	0.572
	$\pm$ 2.5	$\pm$ 0.033	$\pm$ 0.047	$\pm$ 0.274	$\pm$ 0.316
Ours (5)	27.97	0.926	0.128	0.524	0.585
	$\pm$ 2.01	$\pm$ 0.025	$\pm$ 0.035	$\pm$ 0.236	$\pm$ 0.282
Ours (10)	29.25	0.943	0.111	0.613	0.691
	$\pm$ 2.2	$\pm$ 0.021	$\pm$ 0.033	$\pm$ 0.233	$\pm$ 0.28

5.2 IXI dataset

We evaluate the performance of T2-weighted MRI synthesis using the IXI dataset of healthy controls. Our model shows superior performance compared to existing baseline methods. For entries marked with ^†, we report the performance values as presented in the original papers.

Table 2: Quantitative result of the IXI dataset.

	PSNR	SSIM	NRMSE
Pix2Pix (25)	25.53	0.891	0.166
	$\pm$ 1.42	$\pm$ 0.019	$\pm$ 0.031
Syndiff^† (172)	26.01	0.899	0.157
	$\pm$ 1.33	$\pm$ 0.021	$\pm$ 0.026
Syndiff (25)	24.3	0.834	0.208
	$\pm$ 1.47	$\pm$ 0.018	$\pm$ 0.09
MT-Net (172)	26.98	0.922	0.155
	$\pm$ 1.68	$\pm$ 0.019	$\pm$ 0.269
MT-Net (25)	25.82	0.892	0.183
	$\pm$ 1.18	$\pm$ 0.016	$\pm$ 0.203
Ours (2)	28.81	0.941	0.2177
	$\pm$ 2.25	$\pm$ 0.028	$\pm$ 0.071
Ours (5)	30.75	0.9554	0.176
	$\pm$ 2.66	$\pm$ 0.024	$\pm$ 0.066
Ours (10)	31.23	0.958	0.169
	$\pm$ 2.9	$\pm$ 0.026	$\pm$ 0.071
Ours (25)	32.44	0.962	0.146
	$\pm$ 2.9	$\pm$ 0.022	$\pm$ 0.061

5.3 MICCAI2017 WMH challenge dataset

We evaluate the performance of high resolution FLAIR synthesis using the MICCAI2017 WMH challenge dataset. Given the limited number of axial slices per subject, separate subject-wise reduction was not implemented in this study. Entries denoted by ^† indicate MT-Net pre-trained on T1-weighted and FLAIR sequences using a private dataset comprising 172 subjects.

Table 3: Quantitative result of the MICCAI2017 WMH challenge dataset.

	PSNR	SSIM	NRMSE	Dice	Recall
Pix2Pix (10)	24.38	0.891	0.311	0.325	0.311
	$\pm$ 1.91	$\pm$ 0.026	$\pm$ 0.092	$\pm$ 0.294	$\pm$ 0.249
Syndiff (10)	25.45	0.899	0.278	0.11245	0.102
	$\pm$ 1.95	$\pm$ 0.026	$\pm$ 0.111	$\pm$ 0.874	$\pm$ 0.11
MT-Net^† (10)	24.21	0.878	0.3201	0.1002	0.092
	$\pm$ 1.77	$\pm$ 0.031	$\pm$ 0.111	$\pm$ 0.05	$\pm$ 0.039
MT-Net (10)	21.5	0.798	0.424	0.0	0.0
	$\pm$ 1.62	$\pm$ 0.015	$\pm$ 0.029	$\pm$ 0.264	$\pm$ 0.281
Ours (10)	29.58	0.9466	0.181	0.533	0.5226
	$\pm$ 3.25	$\pm$ 0.031	$\pm$ 0.088	$\pm$ 0.297	$\pm$ 0.315
Real				0.549	0.499
				$\pm$ 0.295	$\pm$ 0.296

5.4 BraTS2020

We evaluate the performance of multi-modal MRI synthesis using the BraTS2020 dataset. For entries marked with ^†, we report the performance values as presented in the original papers.

Table 4: Quantitative result of the BraTS2020 (FLAIR from T1w)

	PSNR	SSIM	NRMSE	Dice
Syndiff (16)	23.14	0.885	0.284	0.287
	$\pm$ 2.3	$\pm$ 0.021	$\pm$ 0.054	$\pm$ 0.331
Syndiff^† (25)	26.45	0.877
	$\pm$ 1.01	$\pm$ 1.67
MT-Net (10)	23.14	0.885	0.319	0.001
	$\pm$ 1.67	$\pm$ 0.024	$\pm$ 0.1	$\pm$ 0.035
Ours (16)	26.93	0.9311	0.198	0.659
	$\pm$ 3.43	$\pm$ 0.042	$\pm$ 0.088	$\pm$ 0.379
Ours (3)	26.13	0.9211	0.22	0.628
	$\pm$ 3.67	$\pm$ 0.041	$\pm$ 0.122	$\pm$ 0.378
Real				0.621
				$\pm$ 0.398

Table 5: Quantitative result of the BraTS2020 (T2w from T1w)

	PSNR	SSIM	NRMSE	Dice
Syndiff (16)	23.14	0.885	0.284	0.287
	$\pm$ 2.3	$\pm$ 0.021	$\pm$ 0.054	$\pm$ 0.331
Syndiff^† (25)	26.45	0.877
	$\pm$ 1.01	$\pm$ 1.67
MT-Net (16)	25.25	0.916	0.271	0.471
	$\pm$ 2.37	$\pm$ 0.023	$\pm$ 0.05	$\pm$ 0.41
MT-Net^† (16)	23.028	0.903
	$\pm$ 3.183	$\pm$ 0.039
Ours (16)	29.19	0.957	0.179	0.61
	$\pm$ 3.54	$\pm$ 0.025	$\pm$ 0.069	$\pm$ 0.401
Ours (3)	26.13	0.951	0.192	0.542
	$\pm$ 3.42	$\pm$ 0.024	$\pm$ 0.083	$\pm$ 0.382
Real				0.7795
				$\pm$ 0.322

Table 6: Quantitative result of the BraTS2020 (T1ce from T1w)

	PSNR	SSIM	NRMSE	Dice
Syndiff (16)	26.22	0.932	0.261	0.381
	$\pm$ 3.76	$\pm$ 0.017	$\pm$ 0.124	$\pm$ 0.33
MT-Net (16)	25.74	0.922	0.273	0.287
	$\pm$ 3.32	$\pm$ 0.017	$\pm$ 0.122	$\pm$ 0.3
MT-Net^† (25)	23.415	0.913
	$\pm$ 2.22	$\pm$ 0.034
Ours (16)	28.43	0.931	0.204	0.351
	$\pm$ 3.72	$\pm$ 0.022	$\pm$ 0.11	$\pm$ 0.325
Ours (3)	26.55	0.934	0.244	0.343
	$\pm$ 4.23	$\pm$ 0.018	$\pm$ 0.09	$\pm$ 0.318
Real				0.357
				$\pm$ 0.326

6 Conclusion

Our proposed FGSB framework demonstrates robust training capabilities, achieving reliable image generation quality and preserving critical features such as lesions, despite limited data availability and without requiring separate pre-training procedures. The framework shows potential for broader applications in multimodal medical imaging, particularly in classification and segmentation tasks. Furthermore, the methodology shows promise for extension to various medical imaging domains. However, certain limitations remain, in particular the constraints inherent in paired training approaches and the one-to-one mapping restriction between source and target images.

References

[1] N. D. Prins and P. Scheltens, “White matter hyperintensities, cognitive impairment and dementia: an update,” Nature Reviews Neurology, vol. 11, no. 3, pp. 157–165, Feb 2015.
[2] O. Commowick, F. Cervenansky, and R. Ameli, “Msseg challenge proceedings: Multiple sclerosis lesions segmentation challenge using a data management and processing infrastructure,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016.
[3] Y. Luo et al., “Edge-preserving mri image synthesis via adversarial network with iterative multi-scale fusion,” Neurocomputing, vol. 452, pp. 63–77, 2021.
[4] B. Cao, H. Cao, J. Liu, P. Zhu, C. Zhang, and Q. Hu, “Autoencoder-based collaborative attention gan for multi-modal image synthesis,” IEEE Transactions on Multimedia, vol. 26, pp. 995–1010, May 2023.
[5] M. Özbey et al., “Unsupervised medical image translation with adversarial diffusion models,” IEEE Transactions on Medical Imaging, vol. 42, no. 12, pp. 3524–3539, 2023.
[6] L. Kong, C. Lian, D. Huang, Z. Li, Y. Hu, and Q. Zhou, “Breaking the dilemma of medical image-to-image translation,” in Neural Information Processing Systems, 2021.
[7] Y. Li, T. Zhou, K. He, Y. Zhou, and D. Shen, “Multi-scale transformer network with edge-aware pre-training for cross-modality mr image synthesis,” IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3395–3407, 2023.
[8] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16 000–16 009.
[9] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
[10] K. Kunanbayev, V. Shen, and D.-S. Kim, “ Training ViT with Limited Data for Alzheimer’s Disease Classification: an Empirical Study ,” in proceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, vol. LNCS 15012. Springer Nature Switzerland, October 2024.
[11] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251.
[12] M. Yurt, S. U. H. Dar, A. Erdem, E. Erdem, K. K. Oguz, and T. Çukur, “mustgan: multi-stream generative adversarial networks for mr image synthesis,” Medical Image Analysis, vol. 70, p. 101944, 2021.
[13] Y. Choi and S. Lee, “Ct synthesis using cyclegan with swin transformer for magnetic resonance imaging guided radiotherapy,” in Medical Imaging 2024: Physics of Medical Imaging, vol. 12925, 2024, pp. 825–829.
[14] B. Kim, G. Kwon, K. Kim, and J. C. Ye, “Unpaired image-to-image translation via neural schrödinger bridge,” in The Twelfth International Conference on Learning Representations, 2024.
[15] Z. Wang et al., “Mutual information guided diffusion for zero-shot cross-modality medical image translation,” IEEE Transactions on Medical Imaging, vol. 43, no. 8, pp. 2825–2838, 2024.
[16] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851.
[17] Z. Xiao, K. Kreis, and A. Vahdat, “Tackling the generative learning trilemma with denoising diffusion gans,” in International Conference on Learning Representations, 2022.
[18] A. Jalal, M. Arvinte, G. Daras, E. Price, A. G. Dimakis, and J. Tamir, “Robust compressed sensing mri with deep generative priors,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 14 938–14 954.
[19] X. Meng et al., “A novel unified conditional score-based generative framework for multi-modal medical image completion,” arXiv [eess.IV], 2022.
[20] L. Jiang, Y. Mao, X. Wang, X. Chen, and C. Li, “Cola-diff: Conditional latent diffusion model for multi-modal mri synthesis,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, 2023, pp. 398–408.
[21] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proceedings of the 32nd International Conference on Machine Learning, vol. 37, 2015, pp. 2256–2265.
[22] W. Feller, On the Theory of Stochastic Processes, with Particular Reference to Applications, Jan. 2015, p. 769–798.
[23] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[24] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han, “Differentiable augmentation for data-efficient gan training,” in Conference on Neural Information Processing Systems (NeurIPS), 2020.
[25] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8107–8116.
[26] L. Mescheder, A. Geiger, and S. Nowozin, “Which training methods for gans do actually converge?” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 3481–3490.
[27] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, “Spectral normalization for generative adversarial networks,” in International Conference on Learning Representations, 2018.
[28] S. Mo, M. Cho, and J. Shin, “Freeze the discriminator: a simple baseline for fine-tuning gans,” in CVPR AI for Content Creation Workshop, 2020.
[29] N.-T. Tran, V.-H. Tran, B.-N. Nguyen, L. Yang, and N.-M. Cheung, “Self-supervised gan: Analysis and improvement with multi-class minimax game,” in Neural Information Processing Systems, vol. 32, 2019, pp. 13 232–13 243.
[30] T. Chen, X. Zhai, M. Ritter, M. Lucic, and N. Houlsby, “Self-supervised gans via auxiliary rotation loss,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 146–12 155.
[31] B. Liu, Y. Zhu, K. Song, and A. Elgammal, “Towards faster and stabilized gan training for high-fidelity few-shot image synthesis,” in International Conference on Learning Representations, 2021.
[32] A. Bourou, K. Daupin, V. Dubreuil, A. De Thonel, V. Mezger-Lallemand, and A. Genovesio, “Unpaired image-to-image translation with limited data to reveal subtle phenotypes,” in 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI), 2023, pp. 1–5.
[33] B. Li, K. Xue, B. Liu, and Y.-K. Lai, “Bbdm: Image-to-image translation with brownian bridge diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1952–1961.
[34] X. Su, J. Song, C. Meng, and S. Ermon, “Dual diffusion implicit bridges for image-to-image translation,” in The Eleventh International Conference on Learning Representations, 2023.
[35] G.-H. Liu, A. Vahdat, D.-A. Huang, E. A. Theodorou, W. Nie, and A. Anandkumar, “I2sb: Image-to-image schrödinger bridge,” arXiv preprint arXiv:2302.05872, 2023.
[36] P. D. Pra, “A stochastic control approach to reciprocal diffusion processes,” Applied Mathematics & Optimization, vol. 23, no. 1, p. 313–329, Jan. 1991. [Online]. Available: https://doi.org/10.1007/bf01442404
[37] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, “Contrastive learning for conditional image synthesis,” in ECCV, 2020.
[38] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967–5976.
[39] S. Mo, M. Cho, and J. Shin, “Instance-aware image-to-image translation,” in International Conference on Learning Representations, 2019.
[40] M. I. Belghazi et al., “Mutual information neural estimation,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80, 2018, pp. 531–540.
[41] H. J. Kuijf et al., “Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge,” IEEE Transactions on Medical Imaging, vol. 38, no. 11, pp. 2556–2568, 2019.
[42] G. Bhalerao et al., “Automated quality control of t1-weighted brain mri scans for clinical research: methods comparison and design of a quality prediction classifier,” medRxiv, 2024.
[43] D. Kawahara and Y. Nagata, “T1-weighted and t2-weighted mri image synthesis with convolutional generative adversarial networks,” Reports of Practical Oncology and Radiotherapy, vol. 26, no. 1, pp. 35–42, 2021.
[44] F. Isensee et al., “Automated brain extraction of multisequence mri using artificial neural networks,” Human Brain Mapping, vol. 40, no. 17, pp. 4952–4964, 2019.
[45] M. Jenkinson and S. Smith, “A global optimisation method for robust affine registration of brain images,” Medical Image Analysis, vol. 5, no. 2, pp. 143–156, 2001.
[46] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015.
[47] G. Park, J. Hong, B. A. Duffy, J.-M. Lee, and H. Kim, “White matter hyperintensities segmentation using the ensemble u-net with multi-scale highlighting foregrounds,” NeuroImage, vol. 237, p. 118140, 2021.