Can segmentation models be trained with fully synthetically generated data?
Abstract
In order to achieve good performance and generalisability, medical image segmentation models should be trained on sizeable datasets with sufficient variability. Due to ethics and governance restrictions, and the costs associated with labelling data, scientific development is often stifled, with models trained and tested on limited data. Data augmentation is often used to artificially increase the variability in the data distribution and improve model generalisability. Recent works have explored deep generative models for image synthesis, as such an approach would enable the generation of an effectively infinite amount of varied data, addressing the generalisability and data access problems. However, many proposed solutions limit the user’s control over what is generated. In this work, we propose brainSPADE, a model which combines a synthetic diffusion-based label generator with a semantic image generator. Our model can produce fully synthetic brain labels on-demand, with or without pathology of interest, and then generate a corresponding MRI image of an arbitrary guided style. Experiments show that brainSPADE synthetic data can be used to train segmentation models with performance comparable to that of models trained on real data.
1 Background
In recent years, there has been a growing interest in applying deep learning models to medical image segmentation. Indeed, Convolutional Neural Networks are a good surrogate for manual segmentation [29], which is time-consuming and requires anatomical and radiological expertise. However, deep learning models typically require large and heterogeneous training data to achieve good and generalisable results [16]. Yet, the access to sizeable medical imaging datasets is limited. Not only do they require specialised and costly equipment to acquire, but they are also subject to strict regulations, reduced accessibility, and complex maintenance in terms of data curation [27]. Even when these datasets are accessible, labels are often scarce and task-specific, increasing the domain shift between datasets. A model which can generate images and associated labels with arbitrary contrasts and pathologies would democratise medical image segmentation research and improve model accuracy and generalisability.
Brain magnetic resonance imaging (MRI) datasets are heterogeneous as they tend to arise from a diversity of image acquisition protocols, and are partially labelled. As certain pathologies tend to be more perceptible in some MRI contrasts than others, different acquisition protocols are often followed depending on the nature of the study [1]. Furthermore, there is a significant lack of label consistency across datasets: namely, the annotated regions in any given dataset will be tailored to the study for which they were acquired [4].
To address the lack of comprehensive labelled data for many brain MRI segmentation tasks, domain adaptation (DA) and multi-task learning techniques can be used to create models that are robust to small or incomplete datasets [9, 15]. Another approach is to augment the data by applying simple transformations on individual images or by modelling the data distribution along relevant directions of variability [2]. Generative modelling, typically using unsupervised learning methods [12], yields a representation of the input data distribution that does not require the user to have substantial prior knowledge about it. Some of these DL models are stochastic, allowing for continuous sampling of varied data, making them suitable for data augmentation. Such is the case for Generative Adversarial Networks (GANs) [12] and Variational Auto-Encoders (VAEs) [19]. In addition, some generative models allow for conditioning [24, 25, 33, 6, 34, 31], opening the door to models that provide data as a function of the user’s query.
Conditional generative models have been previously applied to augment data for brain MRI segmentation tasks [26], but they require non-synthetic input segmentation maps. To this end, we propose brainSPADE, a fully synthetic model of the neurotypical and diseased human brain, capable of generating unlimited paired data samples to train models for the segmentation of healthy regions and pathologies. Our model, brainSPADE is comprised of two sub-models: 1) a synthetic label generator; 2) a semantic image generator conditioned on the labels arising from the label generator. Our image generator provides the user with control over the content and contrast of the output images independently. We show that segmentation models trained on the fully synthetic data produced by the proposed generative model do not only generalise well to real data but also generalise to out-of-domain distributions.
2 Materials and Methods
2.1 Materials
Data: We trained our models on T1-weighted and FLAIR MRI images from several datasets, which all have been aligned to the MNI152-T1 template:
-
•
Training of label generator: We used quality-controlled semi-automatically generated labels from a subset of 200 patients from the Southall and Brent Revisited cohort (SABRE) [18], and from a set of 128 patients from the Brain Tumour Segmentation Challenge (BRATS) [23]. Healthy labels were obtained using [5], and are in the form of partial volume (PV) maps of five anatomical regions: cerebrospinal fluid (CSF), white matter (WM), grey matter (GM), deep grey matter and brainstem. Tumour labels were provided with BRATS.
-
•
Training the image generator: we used the images and associated labels from the same SABRE and BRATS datasets, plus a subset of 38 volumes from the Alzheimer’s Disease Neuroimaging Initiative 2 (ADNI2) [3].
-
•
Validation experiments: a hold-out set of the SABRE and BRATS datasets were used for validation, plus a subset of 30 FLAIR volumes from the Open Access Series of Imaging Studies (OASIS) [20] and 34 T1 volumes from the Autism Brain Imaging Data Exchange (ABIDE) dataset [21]. The labelling mechanism for all datasets was the same as for the training process [5] except for the BRATS tumour labels, where we used the challenge manual segmentations, and the ABIDE dataset, where the CSF, GM and WM labels were generated with SPM12 (version r7771, running on MATLAB R2019a).
Slicing process: The proposed model works in 2D (192256). For the healthy label generator, 7008 random label slices were sampled from SABRE, and for the lesion generator, 8636 label slices (2/3 containing at least 20 lesion tumour pixels, 1/3 containing none at all) were sampled from BRATS. For the image generator, 2765 random label slices and their equivalent multi-contrast images were sampled from SABRE, ADNI2, and BRATS. Only slices containing at least 10% of brain pixels were considered, leaving out the upper and lowermost slices.
The code for this work was written in PyTorch (1.10.2) and will be released upon publication. The networks were trained using an NVIDIA Quadro RTX 8000 GPU and an NVIDIA DGX SuperPOD cluster.
2.2 Methods
The full model, comprising label and image generators, is depicted in Fig. 1.

Label Generator:
Segmentations contain the morphological characteristics of the patient, thus constituting Protected Health Information (PHI) [32] and requiring patient consent for sharing [11]. The development of a generative model of segmentations can, however, mitigate these. Segmentations are rich in phenotype information, but they lack local textures, making them challenging for standard generative models like GANs, by aggravating their considerable training instability.
To address the intrinsic limitations of GANs for label map generation, we chose to apply state-of-the-art latent diffusion models (LDMs), a generative model that samples noise from a Gaussian distribution and denoises it via a Markov chain process [28] [14]. Coupled with a VAE, LDMs can become efficient and reliable generative models by performing the denoising process in the latent space. Based on [28], first, we train a spatial VAE with two downsamplings and a latent space dimension of , optimising the loss , where is the input ground truth segmentation maps, is the reconstructed probabilistic partial volume segmentation map, is the Kullback-Leibler Distance (KLD) [19], is a perceptual loss, is an adversarial loss, computed using D, a patch-based discriminator based on [10] and is the L1-norm.
As a spatial VAE latent representation has a semantic context, it cannot be generated by sampling from a Gaussian distribution. Thus, an instance from that latent representation is sampled and denoised via the LDM model and then decoded with the VAE. The LDM model is based on a time-conditioned U-Net [28] with 1000 time steps. Similar to [14], we use a fixed variance and the reparametrized approach that predicts the added noise at each time step . An L1-loss between the added noise and the predicted added noise was used to optimise the model. Two different LDM models were trained on healthy and tumour-affected semantic label slices.
Image generator: SPADE [25] is an image synthesis model that generates high-quality images from semantic maps. The network is a VAE-GAN, in which the encoder yields a latent space representation of an input style image conveying the desired style, which is then used by the decoder, along with the semantic map to create an output image. The semantic maps are fed via special normalisation blocks that imprint the desired content on the output at different upsampling levels. We trained SPADE using the original losses from [25]: an adversarial Hinge loss based on a Patch-GAN discriminator, a perceptual loss, a KLD loss, and a regulariser feature matching loss. The ground truths for the losses corresponded to images matching the content of the input semantic map and input style image. The weights of these losses were tuned empirically.
While the original SPADE model was found to produce high-quality outputs, the following limitations were identified:
-
1.
The latent space encoding the styles is solely optimised with the KLD loss, with no specific clustering enforcement. In our scenario, the style images are MRI contrasts that link the appearance of tissue to its magnetic properties, and thus one needs to ensure that the latent space is clustered based on the contrast, and not on aspects such as the slice number.
-
2.
SPADE is designed to accept categorical segmentations. However, previous work on MRI synthesis [30] shows that partial volume maps, that associate probability of belonging to each class to each pixel, result in finer details on the output images.
-
3.
As explained previously, SPADE is designed to handle the style and content at different stages of the network. However, the original training process uses paired semantic maps and images to calculate the losses, making it impossible to rule out that the style latent space does not hold some information about the content of the output image.
To address limitation 1, we added two losses. First, a modality and dataset discrimination loss , calculated by forward passing the generated images through a modality and dataset discriminator pre-trained on real data:
(1) |
where BCE is the binary cross entropy loss, and the modality and dataset (SABRE, ADNI2 etc.) predicted by , is the generated image and its equivalent ground truth. We varied and across the training, always keeping . Secondly, a contrastive learning loss on the latent space based on [7] is introduced as , where is the cosine similarity index, is the style encoder, the input style image and a random affine transformation implemented with MONAI [8].
Limitation 2 is addressed by replacing categorical labels by probabilistic partial volume maps, which we also used to train our label generator (see 1).
Finally, to address limitation 3, we enforce the separation of the style and content generation pipelines by using different brain slices from the same volume as the style image and semantic map during training.

2.3 Segmentation network used for the experiments
For our segmentation experiment, we used trained instances of 2D nnU-Net [17] until convergence, keeping the default parameters and loss functions.
3 Experiments
To test whether fully synthetic datasets can be used to train segmentation models, we propose three experiments; training segmentation models on synthetic data for healthy tissue segmentation, out-of-distribution healthy tissue segmentation and tumour segmentation. All models are then tested on real data.
3.1 Can we learn to segment healthy regions using synthetic data?
3.1.1 Experiment set-up:
In this experiment, we train two models on T1 images to segment three regions, CSF, GM and WM: 1) , trained on 7008 paired data slices from SABRE, sampled from 180 volumetric subjects; and 2) , trained on 20000 synthetic partial volume maps and the corresponding generated images.
The models were tested on a set of 25 holdout volumes from SABRE, and the Dice score was calculated for the volumes made up of the aggregated 2D segmentations performed by nnU-Net.
Results: Table 1 summarises the Dice scores per region obtained by both models on the test set. Example segmentations are depicted in Appendix 0.B. We compared the results using a two-sided t-test. Even though for all regions the performance of was significantly better (p-value < 0.05), the Dice scores obtained for were comparable to performances achieved in the literature [22]. It is important to note that the ground truth labels used for and the ground truth labels for our test set were obtained with GIF, whereas the labels for were obtained using the label generator, potentially creating a gap between label distributions that could explain the difference in performance.
Tissue dice | IoD (experiment 3.1) | |
---|---|---|
CSF | 0.9530.008 | 0.9190.023 |
GM | 0.9520.006 | 0.9250.008 |
WM | 0.9650.005 | 0.9450.006 |
Experiment 3.2 Near out of distribution (n-OoD) | ||||
---|---|---|---|---|
Tissue | ||||
CSF | 0.7820.002 | 0.8250.023 | 0.8410.017 | 0.9140.022 |
GM | 0.7740.019 | 0.8810.008 | 0.8950.010 | 0.9710.011 |
WM | 0.6520.036 | 0.8730.007 | 0.8910.007 | 0.9730.009 |
Experiment 3.2 Far out of distribution (f-OoD) | ||||
Tissue | ||||
CSF | 0.7110.042 | 0.7360.054 | 0.7920.034 | 0.8300.050 |
GM | 0.5310.033 | 0.5920.033 | 0.7840.027 | 0.8260.047 |
WM | 0.4470.180 | 0.4330.178 | 0.8090.031 | 0.8620.038 |
3.2 Can synthetic generative models address out-of-distribution segmentation?
Experiment set-up: As an extension of Experiment 3.1, we explored the potential of our model when it comes to handling out-of-distribution (OoD) style images, as it is able to translate between modalities and capture, to some extent, the style of unseen images. For this, we performed a near-OoD experiment (n-OoD) and a far-OoD experiment (f-OoD), using a set of slices from 25 T1 ABIDE volumes (representing n-OoD) and a set of 25 OASIS FLAIR volumes (representing f-OoD) as test target datasets. Both and from 3.1 were tested on these datasets. We also trained models and on 580 slices from five paired volumes sourced from the targets n-OoD and f-OoD distributions, serving as a reference; and models and , on 20000 brainSPADE generated images, using the styles of unpaired images from the target n-OoD and f-OoD distributions during inference.
Results: The Dice scores obtained for the different structures are reported in table 2. Example segmentations can be found in Appendix 0.B. Both and experienced a drop in performance when tested on n-OoD and f-OoD data, whereas and were significantly better (p-value < 0.0001), with dice scores closer to those achieved by models trained on paired data from the target distributions, yet significantly lower. This shows that brainSPADE has some potential for domain adaptation.
3.3 Can we learn to segment pathologies from synthetic data?
Experiment set-up: In this experiment, we train models on T1 and FLAIR images to segment tumours from a holdout set of sites from BRATS, unseen by brainSPADE. We trained three models: on 1064 slices from 5 paired subjects belonging to the target set; on 20000 sampled slices from brainSPADE, using the style of target T1 and FLAIR images; and , combining both the training sets of and . The labels for were sampled using lesion-conditioning, ensuring a balance between negative and positive samples. The resulting models were tested on 30 test volumes from the target dataset, similarly to 3.1 and 3.2, yielding dice scores on tumours, accuracy, precision and recall.
Results: The results are reported in table 3, with visual examples available in Appendix 0.B. The hybrid model achieved the top performance for all metrics, having significantly better recall than and .
Model | |||
---|---|---|---|
Dice on tumour | 0.813 0.174 | 0.760 0.187 | 0.876 0.094 |
Accuracy | 0.995 0.007 | 0.994 0.006 | 0.997 0.002 |
Precision | 0.878 0.143 | 0.864 0.124 | 0.921 0.061 |
Recall | 0.773 0.209 | 0.713 0.238 | 0.852 0.137 |
4 Discussion and Conclusion
We have shown that brainSPADE, a fully-synthetic brain MRI generative model, can produce labelled datasets that can be used to train segmentation models that exhibit comparable performances to models trained using real data. The synthetic data generated by brainSPADE can not only replace real data for healthy tissue segmentation but also address pathological segmentation, as evidenced by Experiment 3.3. In addition, because the content pathway is completely separated from the style pathway in the generative pipeline, brainSPADE makes it possible to condition on unlabelled images, producing fully labelled datasets that can help train segmentation models with reasonable performance on that target distribution. The ability to replicate, to some extent, the style of an unseen dataset is shown in Experiment 2, where using OoD images as styles for brainSPADE results in a performance boost on that dataset.
These results open a promising pathway to tackle the lack of data in medical imaging segmentation tasks, where multi-modal synthetic data, conditioned by the user’s specifications on the style and content, could not only help for data augmentation but compensate for the unavailability of paired training data. In the future, our model could be fine-tuned on more modalities and pathologies, making it generalisable and capable of addressing more complex segmentation tasks, e.g. involving small or multiple lesions. Synthetic medical data has the advantage of not retaining any personal information on the patient, as it introduces variations on the original anatomy that should erase all traceability. Nonetheless, future work should analyse to what extent this model introduces variations on the training data and to what extent it retains it. This is key to, on the one hand, ensure that brainSPADE is stochastic and can produce an almost unlimited amount of data, and on the other hand, ensure that the training data cannot be retrieved from the model via model-inversion attacks [13], a critical point if generative models are used as a public surrogate for real medical data.
Appendix 0.A Training set-ups
0.A.1 Training brainSPADE
0.A.1.1 Training the Label Generator
We trained the VAE for 800 epochs using a learning rate of , Adam optimizer () and a batch size of 256 in an NVIDIA DGX A100 node. The training time was approximately 8 hours. The LDM was trained for 1500 epochs, with a learning rate of , Adam Optimizer () and a batch size of 384 in an NVIDIA DGX A100 node. The training time was approximately 15 hours.
0.A.1.2 Training the Image Generator
The weights that were used to balance the different losses were: , , , , , .An exponentially decaying learning rate starting at was used with an Adam Optimizer ( for 4800 epochs. The training time was approximately 2 weeks, using a batch size of 6 in a NVIDIA Quadro RTX 8000 GPU.
For the training process we used the following MONAI augmentations [8]:
-
•
Random bias field augmentation, with a coefficient range of 0.2-0.6.
-
•
Random contrast adjustment, with a coefficient range of 0.85-1.25.
-
•
Random gaussian noise addition, with = 0.0 and range of 0.05-0.15.
The images were normalised using Z-normalisation.
0.A.2 Training Segmentation nnU-Nets
We used nnU-Net to perform all our segmentation experiments. nnU-Net performs automatic hyperparameter selection based on the task and input data; we downloaded the package from Github 111https://github.com/MIC-DKFZ/nnUNet.git and selected the ‘2d’ training option. We modified the number of epochs to ensure convergence for all models.
Appendix 0.B Additional figures


References
- [1] Abd-Ellah, M.K., et al.: A review on brain tumor diagnosis from MRI images: Practical implications, key achievements, and lessons learned. Magnetic Resonance Imaging 61, 300–318 (2019)
- [2] Acero, J.C., et al.: SMOD - Data Augmentation Based on Statistical Models of Deformation to Enhance Segmentation in 2D Cine Cardiac MRI. In: FIMH (2019)
- [3] ADNI: Alzheimer’s Disease Neuroimaging Initiative, http://adni.loni.usc.edu/
- [4] Antonelli, M., et al.: The Medical Segmentation Decathlon. Nature Communications 2022 13:1 13(1), 1–13 (jul 2022)
- [5] Cardoso, M.J., et al.: Geodesic information flows. Medical image computing and computer-assisted intervention : MICCAI … International Conference on Medical Image Computing and Computer-Assisted Intervention 15(Pt 2), 262–270 (2012). \hrefhttps://doi.org/10.1007/978-3-642-33418-4_33https://doi.org/10.1007/978-3-642-33418-4_33
- [6] Chen, D., et al.: Stylebank: An explicit representation for neural image style transfer (2017)
- [7] Chen, T., et al.: A Simple Framework for Contrastive Learning of Visual Representations (2020)
- [8] Consortium, M.: MONAI: Medical Open Network for AI (mar 2020)
- [9] Dorent, R., et al.: Learning joint segmentation of tissues and brain lesions from task-specific hetero-modal domain-shifted datasets. Medical Image Analysis 67, 101862 (2021)
- [10] Esser, P., et al.: Taming transformers for high-resolution image synthesis (2020)
- [11] European Commission: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (Text with EEA relevance) (2016), https://eur-lex.europa.eu/eli/reg/2016/679/oj
- [12] Goodfellow, I.J., et al.: Generative Adversarial Nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. pp. 2672–2680. NIPS’14, MIT Press, Cambridge, MA, USA (2014)
- [13] Hidano, S., et al.: Model Inversion Attacks for Prediction Systems: Without Knowledge of Non-Sensitive Attributes. In: 2017 15th PST Annual Conference. pp. 115–11509 (2017)
- [14] Ho, J., et al.: Denoising Diffusion Probabilistic Models (2020)
- [15] Huo, Y., et al.: SynSeg-Net: Synthetic Segmentation Without Target Modality Ground Truth. IEEE Transactions on Medical Imaging 38(4), 1016–1025 (apr 2019)
- [16] Ian Goodfellow, Yoshua Bengio, et al.: Deep Learning Book. Deep Learning (2015)
- [17] Isensee, F., Jothers: nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18(2), 203–211 (2021)
- [18] Jones, S., et al.: Cohort Profile Update: Southall and Brent Revisited (SABRE) study: a UK population-based comparison of cardiovascular disease and diabetes in people of European, South Asian and African Caribbean heritage. International Journal of Epidemiology 49(5), 1441–1442e (oct 2020)
- [19] Kingma, D.P., Welling, M.: Auto-Encoding Variational Bayes. CoRR (2014)
- [20] LaMontagne, P.J., et al.: OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease. medRxiv p. 2019.12.13.19014902 (jan 2019)
- [21] Martino, A.D., Yan, C.G., Li, Q., Denio, E., Castellanos, F.X., Alaerts, K.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular Psychiatry 10, 659–667 (2013)
- [22] Mendrik, A.M., et al.: MRBrainS Challenge: Online Evaluation Framework for Brain Image Segmentation in 3T MRI Scans. Computational intelligence and neuroscience 2015, 813696 (2015)
- [23] Menze, B.H., et al.: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE transactions on medical imaging 34(10), 1993–2024 (oct 2015)
- [24] van den Oord, A., et al.: Neural Discrete Representation Learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 6309–6318. NIPS’17, Curran Associates Inc., Red Hook, NY, USA (2017)
- [25] Park, T., et al.: Semantic image synthesis with spatially-adaptive normalization. Proceedings of IEEE CVPR 2019-June, 2332–2341 (2019)
- [26] Qasim, A.B., et al.: Red-GAN: Attacking class imbalance via conditioned generation. Yet another medical imaging perspective. In: Arbel, T., Ben Ayed, I., de Bruijne, M., Descoteaux, M., Lombaert, H., Pal, C. (eds.) Proceedings of the Third Conference on Medical Imaging with Deep Learning. Proceedings of Machine Learning Research, vol. 121, pp. 655–668. PMLR (2020)
- [27] Rieke, N., et al.: The future of digital health with federated learning. npj Digital Medicine 3(1), 119 (2020)
- [28] Rombach, R., et al.: High-Resolution Image Synthesis with Latent Diffusion Models (2021)
- [29] Ronneberger, O., et al.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI. pp. 234–241. Springer International Publishing, Cham (2015)
- [30] Rusak, F., et al.: 3D brain MRI GAN-based synthesis conditioned on partial volume maps. In: Burgos, N., Svoboda, D., Wolterink, J.M., Zhao, C. (eds.) SASHIMI 2020, pp. 11–20. Lecture Notes in Computer Science, Springer, Switzerland (2020)
- [31] Shi, Y., et al.: Retrieval-based Spatially Adaptive Normalization for Semantic Image Synthesis (apr 2022)
- [32] Wachinger, C., et al.: BrainPrint: a discriminative characterization of brain morphology. NeuroImage 109, 232–248 (apr 2015)
- [33] Wang, X., Gupta, A.: Generative image modeling using style and structure adversarial networks (2016)
- [34] Zhu, P., Abdal, R., Qin, Y., Wonka, P.: SEAN: Image Synthesis with Semantic Region-Adaptive Normalization. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition pp. 5103–5112 (nov 2019)