Self-supervised Brain Lesion Generation for Effective Data Augmentation of Medical Images
Abstract
Accurate brain lesion delineation is important for planning neurosurgical treatment. Automatic brain lesion segmentation methods based on convolutional neural networks have demonstrated remarkable performance. However, neural network performance is constrained by the lack of large-scale well-annotated training datasets. In this manuscript, we propose a comprehensive framework to efficiently generate new samples for training a brain lesion segmentation model. We first train a lesion generator, based on an adversarial autoencoder, in a self-supervised manner. Next, we utilize a novel image composition algorithm, Soft Poisson Blending, to seamlessly combine synthetic lesions and brain images to obtain training samples. Finally, to effectively train the brain lesion segmentation model with augmented images we introduce a new prototype consistence regularization to align real and synthetic features. Our framework is validated by extensive experiments on two public brain lesion segmentation datasets: ATLAS v2.0 and Shift MS. Our method outperforms existing brain image data augmentation schemes. For instance, our method improves the Dice from 50.36% to 60.23% compared to the U-Net with conventional data augmentation techniques for the ATLAS v2.0 dataset.
keywords:
MSC:
41A05, 41A10, 65D05, 65D17 \KWDBrain Lesion Segmentation, Data Augmentation, Poisson Blending, Prototype Learning.1 Introduction
Brain lesions are often indicative of serious neurological conditions, from cancer to stroke. Magnetic Resonance (MR) Imaging is widely used to detect brain lesions as MR provides excellent soft-tissue contrast, allowing for a clear distinction between healthy and abnormal brain tissue. Accurate segmentation of brain lesions is crucial for quantitative analysis of lesion progression and planning surgical treatments. The current clinical standard is human delineation of the brain lesion boundary by an expert which is tedious, time-consuming, and costly. Neural networks have emerged as a promising technique to automate brain lesion segmentation [31, 26]. However, training neural networks requires large amounts of well-annotated images to ensure good performance. The need for large training datasets limits the development of automatic brain lesion segmentation models since the scale of brain lesion datasets is often limited.
Data augmentation is a widely used technique to increase training dataset size and diversity in order to improve model performance. Conventional data augmentation strategies include random rotations, brightness adjustment, etc. Although these simple spatial and appearance transformations improve segmentation performance to some extent, they do not fundamentally increase the diversity of the dataset and provide a smaller boost in performance compared to acquiring new data. Recently, methods to perform data augmentation based on image fusion (such as Mixup [43], CutMix [42], etc. ) have been developed to increase training dataset diversity. However, these approaches may shift the distribution of the original dataset [29], which is catastrophic for small datasets as the model learns non-representative features in the augmented images [43]. Neural network approaches, such as Generative Adversarial Networks (GANs) [7] have been proposed to synthesize data for model training. However, the performance of GANs, similar to other neural networks, is constrained by the size of the training dataset, posing challenges in generating realistic images when only limited data is available. Regardless of the method used to create augmented samples, combining augmented and real samples does not guarantee the segmentation model will learn discriminative features to segment lesions on real samples as there is no supervision in the feature space. Therefore, we raise two questions: How can we generate realistic images that will not shift the original data distribution with limited training samples? How can we effectively use synthetic samples to train a segmentation model to perform well on real samples?
We present a comprehensive framework to effectively augment brain imaging data with synthetic lesions to train a lesion segmentation model. Our framework has three stages: (1) train a lesion generator in a self-supervised manner, and sample feasible latent vectors from a constrained embedding space to create realistic paired lesion images and masks; (2) blend lesion images into existing brain images leveraging our novel image composition technique called Soft Poisson Blending (SPB); (3) train the segmentation model using a prototype consistency regularization term to align real and synthetic lesion features for better performance. The main contributions of our work are:
-
1.
We develop a data augmentation method to enhance the performance of neural networks trained to perform segmentation tasks. It comprises a two-stage adversarial autoencoder (AAE), consisting of shape and intensity generation networks, to generate on-the-fly new foreground regions when training. Distinct from other GANs, we designed a self-supervised approach where we simulate image-label pairs for training our AAE. This enables our AAE to learn a larger distribution of lesions and improve the quality of synthetic results.
-
2.
We introduce Soft Poisson Blending (SPB), based on Poisson Image Editing [27], to ensure realistic and smooth boundaries when inserting generated lesions into a brain image. SPB computes a refined guidance vector field that adjusts intensity values to make the synthetic lesion appear similar to the surrounding brain tissues.
-
3.
We design a new loss term, prototype consistency regularization, to learn common features between synthetic and real training samples.

2 Related Work
2.1 Data Augmentation for Data Scarcity
To mitigate the problem of data scarcity when training neural networks, data augmentation techniques are used to increase the training dataset size thereby improving model performance. Conventional data augmentation techniques include shape transformations (flip, rotation, scaling, etc. ) and appearance transformations (color jittering, brightness, and contrast adjustment, etc. ) [15, 13]. However, the diversity of the dataset achieved by such transformations is limited. Furthermore, the intrinsic characteristics of the training samples are not fundamentally changed, limiting improvements in model performance. The development of GANs enabled more advanced data augmentation where entirely new samples are created by the model. GANs can generate realistic samples for both natural [2] and medical [24, 34] images. GANs may also be designed to enable image-to-image translation, not generating images from noise, in order to create images of different modalities or characteristics. Nevertheless, GANs require large datasets when training the model to enable accurate image generation. Recently, image-manipulation-based data augmentation techniques [43, 42, 44] have been developed to increase training dataset size and variety by using a set of image manipulation rules to generate new samples. For instance, CutMix [42] cuts and fuses patches from different images to create new training samples. Such techniques must be carefully designed and may shift the distribution of the training dataset especially when the original training dataset size is small [29]. In this study, we combine the strengths of generative models and image-manipulation-based data augmentation techniques to synthesize realistic brain images with limited training samples.
2.2 Poisson Blending in Deep Learning
Poisson blending is an image processing technique to seamlessly integrate regions of a source image into a target image. In the context of data augmentation, this typically involves identifying the foreground of a source image and integrating this region into a target image. The blending process is guided by the gradient of the source image and the intensity of the target image [27]. Poisson blending generates more coherent images compared to simpler fusion methods such as Copy-Paste [6]. Poisson blending has been used for data augmentation for a variety of natural images [20]. In the context of medical imaging, [36] developed a self-supervised learning strategy to detect abnormalities in chest X-ray images, and abnormalities were generated by fusing image patches into the target image using Poisson blending. [38] utilized Poisson blending to generate retinal images with lesions and different appearances than in the original training dataset to improve the performance of a lesion segmentation model. [17] introduced Poisson blending to the gastric disease classification task for better model generalization ability. However, all of these applications were applied to 2D images, and to our knowledge, Poisson blending has not yet been applied to 3D medical images.
2.3 Prototype Learning
Prototype networks [35] were first presented for few-shot learning, where the network learns to symbolize each class using a prototype(a.k.a. a representative vector in the embedding space). Each class prototype is typically obtained by computing the average feature of samples belonging to the class. This method was initially applied in image classification where distances between different class prototypes were maximized across training samples [35]. Prototype learning was extended to image segmentation by computing cosine similarity between the class prototypes and individual pixel features. In this context for the test dataset pixels the class with which they have the highest similarity to its prototype. [39] designed a bidirectional prototype alignment mechanism for the few-shot image segmentation task. [16] utilized class prototypes to augment training samples in the feature space. [40] introduced a cyclic prototype consistency framework for semi-supervised medical image segmentation. In our work, we draw on the idea of prototype consistency to introduce a regularisation term during segmentation model training to align features between synthetic and real samples.
3 Methodology
Fig. 2 illustrates our entire framework comprised of three stages: I. training the lesion generator, II. inserting synthetic lesions into brain images, and III. using the generated images for segmentation model training and incorporating prototype consistency regularization. Exemplar synthetic lesions and real lesions are shown in Fig. 1 (a). The synthetic lesions have appearances similar to those of real lesions.

3.1 Self-supervised AAE for Brain Lesion Synthesis
Training a 3D generative model with a small dataset is challenging and achieving a good model performance is unlikely, therefore, we decompose brain lesion generation into two sub-tasks to reduce model complexity and use a self-supervised learning strategy during training. We designed two models: a shape adversarial auto-encoder (shape AAE) and an intensity adversarial auto-encoder (intensity AAE), to first create lesion masks and then perform texture synthesis to generate lesion images corresponding to the masks. Both shape and intensity AAEs are trained in a self-supervised manner. For lesion synthesis real lesion images are used to define the data distribution in the latent space and synthetic lesion images are created by sampling latent vectors from only this distribution before going through the decoder block of the trained AAEs.
3.1.1 Shape and Intensity AAE Design
We follow the model architecture proposed in [30] to design the shape and intensity AAEs. As shown in Fig. 2, each AAE contains an encoder , a decoder , an image discriminator , and a latent discriminator . Although the structures are similar, for the intensity AAE we introduce a mask embedding block (MEB) [12] to provide shape guidance when generating the lesion intensity. The detailed structure of MEB is shown in Fig. 3. It embeds the mask to the feature space to control the shape of synthetic lesions.

For training the AAE models we use a three-term loss: reconstruction loss , latent adversarial loss , and image adversarial loss . computed as the mean absolute error (MAE) between the input image and the reconstructed image :
(1) |
guarantees and look similar in general.
The latent adversarial loss is formulated to ensure that the latent space of the lesions has a normal distribution:
(2) | ||||
(3) |
where is a normal distribution of mean and standard deviation . Similarly, the image adversarial loss is designed to ensure the reconstructed image is realistic in appearance. It is defined as:
(4) | ||||
(5) |
3.1.2 Training Set Generation
Unlike other generative models which train models with real image-mask pairs, we train our model in a self-supervised manner by generating lesion-mask pairs. As shown in Fig. 1 (b) and (c), the latent distribution of real lesions (pink triangles) is a subset of the pre-generated lesions (grey dots), which indicates training on pre-generated lesion-mask pairs is sufficient to learn a representative feature embedding space for real lesion-mask pairs.
The pre-generated lesion-mask pairs are created as follows. Inspired by [10], we first generate ellipsoids with overlap to simulate a general lesion shape. The half-axis lengths of three directions follow the uniform distribution . Elastic deformations controlled by are applied to the ellipsoids to make the shape more natural and irregular. Finally, we add Perlin noise [28] to make the boundary more irregular. A comparison between real lesion masks and those generated by this approach is shown in Fig. 4 (a) and (b). Note the complexity of the pre-generated masks exceeds that of real lesion masks, this enhances the AAE’s ability to learn the reconstruction task.
To generate the lesion images we randomly select a location within the brain image from the training set and then extract the intensity values for voxels inside the pre-generated mask within that region. To increase variation in the training set, we apply the foreground intensity perturbation [11] to randomly adjust the intensity values. Fig. 4 (c) and (d) show real lesions and those generated by our approach. Note the styles of the images are similar, which ensures the pre-generated lesion-mask pairs are suitable for the model training.
3.1.3 Constrained Sampling for New Lesion Synthesis
Once AAE model training is complete, we freeze both shape AAE and intensity AAE model weights. As we see in Fig. 1 (b) and (c), the latent space of real lesions is a subset of the latent space of the lesions generated for training the AAE models. Therefore, we use a constrained sampling strategy to synthesize lesions that are more similar to real lesions in the latent space. Specifically, we first obtain the latent embedding vectors for real lesion masks and lesion intensity images using the shape encoder and the intensity encoder . Next, we apply Principal Component Analysis (PCA) to these vectors to obtain a dimensionality-reduced latent space representation. Note we apply PCA to the shape and intensity latent embedding vectors separately. We keep the top principal components for each space which cover of the latent embedding vector variance.
To create a new synthetic lesion we uniformly sample two vectors, one from the dimensionality-reduced shape latent space and one from the dimensionality-reduced intensity latent space. We then map these dimensionality-reduced latent vectors to the original latent embedding spaces. Finally, we use these two latent embedding vectors as the input to the shape decoder and intensity decoder , respectively, to generate new synthetic lesion masks and images.

3.2 Lesion Image Composition
After we generate the synthetic lesion images, we have to fuse the lesion images with a brain image to generate training samples for the segmentation model training. We first select a plausible location in the brain for the generated lesion, then create a composite image using our modified Poisson image editing method called Soft Poisson Blending (SPB) to ensure the boundary between the image and lesion is seamless. We detail this approach below.
3.2.1 Lesion Location Selection
The brain has a regular anatomical structure, with distinct regions including the ventricle and brain stem. The location of brain lesions depends on the underlying pathology (e.g., , stroke and multiple sclerosis). For the datasets in this work, we use FastFurfer [8] to segment the white matter area of the original patient’s brain as a region proposal for lesion location selection. After obtaining the mask for white matter areas, we apply morphological binary erosion operation to shrink the masked region so that the synthetic lesion area will not exceed the mask boundary. We randomly (following a uniform distribution) select a voxel in the masked area as the center of the synthetic lesion.
3.2.2 Soft Poisson Blending
We developed Soft Poisson Blending (SPB), which is a modified implementation of Poisson Image Editing [27]. The key idea of Poisson Blending is to use the Poisson partial differential equation under the Dirichlet boundary condition to specify the intensity value at the boundary area. We first adapt the conventional Poisson Blending approach to apply to 3D images. Second, we refine the guidance vector field, to ensure that the lesion foreground exhibits a natural internal appearance while having edge consistency with the background image.
For a brain image , the target region that we would like to blend with the lesion image , we define as . The boundary of is defined as . , the value function defined on , has a value of at . We solve the optimization problem defined as:
(6) |
where is the guidance vector field and is the gradient operator. This equation guarantees that: i) The gradient of the foreground content is as close as possible to . ii) The boundary pixel values of the foreground are consistent with the existing , i.e., a seamless transition. The solution under the Dirichlet boundary condition is the Poisson equation:
(7) |
where is the Laplacian operator, and is the divergence operator. The guidance vector field (Fig. 5 (f)) is calculated as the mixed gradient of the brain image (Fig. 5 (a)), and the synthetic lesion image (Fig. 5 (d)) by selecting . However, using this definition of to construct the blended image can lead to an unnatural appearance (Fig. 5 (c)). This unnatural appearance is caused by the absolute value of on being much higher than since for regions outside of the foreground are zero. Based on this observation, for Soft Poisson Blending, we modified the computation of as follows:
(8) |
This results in the blended image becoming more realistic (Fig. 5 (e)) compared to the original Poisson Blending algorithm (Fig. 5 (f)).

3.3 Prototype Consistency for Feature Alignment
After synthetic lesions are blended into the brain images, the next step is training a segmentation model. Note that here the training dataset is a mixture of real and synthetic lesions which provides us with a unique opportunity to use this information to improve the lesion representations at the feature map level. We propose a consistency regularization, to prefer networks where feature maps for the two types of lesions (real and synthetic) are most similar. We hypothesize this will tend towards feature maps that are more general to the segmentation problem and less specific to features of the particular training dataset thereby increasing segmentation model robustness.
The feature map of real lesions is denoted as , and the feature map of synthetic lesions is denoted . Here indicates the feature map of the composited image , and is an indicator function where the value is 1 if the condition is true, otherwise it is 0. Inspired by the prototypical network [35], we aim to force the segmentation model to learn similar feature distributions for and via a class prototype. Specifically, we first obtain feature prototypes for both lesion types by averaging feature maps for the specific lesion type in the spatial dimension:
(9) |
(10) |
where is the spatial coordinate. The loss of the prototype differences can then be computed as:
(11) |
where is the L1-norm of a vector.
Only optimizing neglects the intrinsic relation between class-specific features since it only minimizes the discrepancy between two class prototypes [1]. To this end, we develop prototype relationship loss to maximize the consistency between relationship matrices constructed from the class prototype and class-specific features. Since the number of voxels in real and synthetic lesion areas can be different, we randomly sample feature vectors within each class to obtain and , where is the number of feature channels. To measure consistency we compute cosine similarity:
(12) |
where denotes the L2-norm. The prototype relationship loss is computed as:
(13) | ||||
The loss function for training the segmentation model is:
(14) |
where and are weight factors. is the prototype consistency, comprised of a difference and relationship loss term, and is the compound loss which comprises of the Dice and Cross-entropy loss functions.
4 Experiments
Our framework is evaluated on two public brain segmentation datasets: the ATLAS v2.0 dataset and the Shift MS segmentation dataset as described in Section 4.1. We compare our approach to existing methods (Section 4.3) to show the superiority in both lesion synthesis and segmentation tasks. We further conduct ablation studies to validate the effectiveness of each component in our framework (Section 4.4).
Dataset | Location | Scanner | Field | Trn | Devin | Devout | Evlin |
---|---|---|---|---|---|---|---|
MEESG-1 | Rennes | S Verio | 3.0T | 8 | 2 | 0 | 5 |
Bordeaux | GE Disc | 3.0T | 5 | 1 | 0 | 2 | |
Lyon | S Aera | 1.5T | 10 | 2 | 0 | 17 | |
P Ingenia | 3.0T | ||||||
ISBI | Best | P Medical | 3.0T | 10 | 2 | 0 | 9 |
PubMRI | Ljubljana | S Mag | 3.0T | 0 | 0 | 25 | 0 |
4.1 Dataset and Preprocessing
4.1.1 ATLAS v2.0 Dataset
The ATLAS v2.0 dataset [19] is a large stroke segmentation dataset that contains 655 T1-weighted brain images collected from 33 research cohorts. All images first have intensity standardization and are registered to the MNI-152 template (1mm3 voxel spacing). A defacing step is applied to anonymize the scan. All lesion masks are annotated and then checked by two neurological experts. We randomly select 80% of the dataset as the training set, and keep the remaining 20% for model evaluation.
4.1.2 Shift MS Dataset
The Shift MS dataset [23] is a multi-center white matter multiple sclerosis segmentation dataset comprised of i.e., MSSEG-1 [5], ISBI[3], PubMRI [18] and a private dataset collected from the University of Lausanne. We use the three public datasets for model training and evaluation since the private dataset is not publicly available. Detailed information on the dataset is shown in Table 1, the training set (Trn) is used for model training, the in-domain development set (Devin), the out-domain development set (Devout), and the in-domain evaluation set (Evain) are used for evaluating model segmentation performance only.
Each subject has two available modalities, FLAIR and T1 with contrast. In this work, we only use the FLAIR modality. All of the subjects have been preprocessed with image denoising, skull stripping, and bias field correction. We resample all images to 1mm3 isotropic spacing. The ground-truth segmentation mask is determined by the consensus of annotations acquired from clinical experts.
Methods | Training | #Param | ATLAS v2.0 | Shifts MS | ||
Data Type | PSNR | MAE | PSNR | MAE | ||
AE | real | 64.11M | 31.72 | 0.0014 | 30.42 | 0.0017 |
[33] | synt | 64.11M | 32.07 | 0.0011 | 31.56 | 0.0014 |
AAE | real | 64.41M | 30.66 | 0.0016 | 31.33 | 0.0014 |
[22] | synt | 64.41M | 32.11 | 0.0010 | 32.84 | 0.0009 |
f-AnoGAN | real | 78.15M | 29.86 | 0.0019 | 28.35 | 0.0023 |
[34] | synt | 78.15M | 30.22 | 0.0017 | 30.05 | 0.0018 |
DDPM | real | 74.23M | 32.53 | 0.0009 | 32.38 | 0.0009 |
[9] | synt | 74.23M | 34.48 | 0.0007 | 34.23 | 0.0008 |
PNDM | real | 74.23M | 32.56 | 0.0009 | 32.32 | 0.0009 |
[21] | synt | 74.23M | 34.87 | 0.0007 | 34.68 | 0.0007 |
Ours | real | 67.08M | 35.23 | 0.0006 | 34.38 | 0.0007 |
synt | 67.08M | 37.42 | 0.0004 | 36.57 | 0.0005 |
Methods | Shifts MS Devin | Shifts MS Devout | Shifts MS Evlin | ||||||
---|---|---|---|---|---|---|---|---|---|
DSC | ASD | HD95 | DSC | ASD | HD95 | DSC | ASD | HD95 | |
UNet [31] | 62.7321.04 | 7.719.55 | 20.4018.91 | 61.5821.70 | 4.348.57 | 14.5213.60 | 51.7420.29 | 14.0118.34 | 29.0223.83 |
Attention UNet [25] | 72.3413.24 | 2.462.77 | 12.5316.88 | 64.8022.98 | 3.668.41 | 12.8315.65 | 65.6215.21 | 3.345.07 | 13.9216.64 |
SwinUNETR [37] | 66.6514.34 | 3.544.51 | 18.0417.42 | 61.3922.29 | 4.798.24 | 15.2714.60 | 57.8313.85 | 5.388.24 | 20.9420.77 |
MedNeXt [32] | 69.2114.39 | 10.9312.96 | 53.9669.47 | 64.9922.99 | 2.726.32 | 14.9717.89 | 59.2019.88 | 21.5437.39 | 54.3968.26 |
Ours | 0.911.09 | 12.2619.37 |
4.2 Implementation Details
The framework is implemented in PyTorch 1.13.1. All model training and validation were performed using an NVIDIA A100 40G GPU. For the lesion generator, the input mask and image size is . We used the AdamW optimizer to train the shape and the intensity models, with the learning rate set to . The batch size was 4 and the total number of training epochs was 100. For the segmentation model, we use UNet [31] as the backbone model. The input patch size is and the batch size is 2. We used the AdamW optimizer for the segmentation model training with the learning rate set to and consecutively reduced with a cosine annealing strategy. A total of 500 epochs were set for the ATLAS v2.0 dataset and 1000 epochs for the Shift MS dataset. The loss function coefficients are set to , , for each dataset respectively, these values were chosen empirically.
4.3 Model Evaluation
We evaluated our framework on (1) its ability to generate lesions, (2) the performance of the segmentation model trained with images generated using our framework compared to other models and (3) comparing our framework with other data augmentation techniques for training the segmentation model.
4.3.1 Lesion Synthesis Performance
To evaluate the effectiveness of our generative model and self-supervised training strategy, we compare the synthetic lesions generated by our framework to other generative models. We use the peak signal-to-noise ratio (PSNR) and mean absolute error (MAE) to evaluate synthetic results. Structural similarity (SSIM) is not suitable for this scenario because the large proportion of background dominates the small foreground area, leading to an inaccurate representation of the actual image quality. The measures are reported in Table 2. We compare our approach to both GAN and diffusion models. For each method, the model is trained either with only real lesions or only pre-generated image-mask pairs created using the self-supervised strategy (see Section 3.1.2). Across all models training with pre-generated data yields superior metrics compared to the real dataset, underscoring the ability of appropriate synthetic data to enhance model training. Additionally, our approach achieves the highest PSNR and lowest MAE, indicating its robustness and adaptability. Notably, the increased performance of our model is not attributable to the model’s size, as evidenced by the comparison with f-AnoGAN which has the largest number of parameters but the lowest quantitative performance. The performance gains of our method are attributed to its innovative architecture and the utilization of the self-supervised training strategy.
4.3.2 Downstream Segmentation Model Performance
We compare the segmentation model performance using our framework to existing segmentation models. All models were trained from scratch. Note that our framework can use any backbone model to perform the segmentation task, here we consider UNet as the base segmentation model. Segmentation performance for all models was evaluated using the Dice similarity coefficient (DSC), average surface distance (ASD), and 95% Hausdorff distance (HD95). Table 4 shows the segmentation model performance for the ATLAS v2.0 dataset. Our approach consistently demonstrates improved performance for all metrics compared to all competing models, irrespective of the base model. This consistent overperformance is attributed to two pivotal enhancements: the use of an augmented brain lesion dataset, which closely mimics real-world appearance for model training, and the incorporation of the prototype consistency loss to align model features for real and synthetic lesions. These enhancements not only elevate the base model’s ability to accurately segment images but also emphasize a crucial insight: the quality of the dataset plays a more critical role than the network architecture in achieving model generalization for segmentation tasks. Our approach, by leveraging high-fidelity synthetic data and strategic feature alignment, effectively unleashes the potential of classical segmentation models like UNet, establishing that dataset augmentation and targeted modifications to the training loss function can improve segmentation models’ performances for brain stroke lesion segmentation.
The quantitative metrics for model segmentation performance in the Shift MS Dataset are reported in Table 3. As with ATLAS v2.0, our framework has improved performance compared to the other models for the out-domain development set (Devout), which demonstrates our framework improves domain generalization ability.
Fig. 6 and Fig. 7 show qualitative segmentation results for the ATLAS v2.0 and Shift MS Dataset, respectively. These results demonstrate that our framework consistently provides more accurate predictions compared to other models. Notably, in regions indicated by the yellow arrows, other models have either false positives or fail to segment the foreground. The improvements in our framework are particularly evident in the segmentation of small lesions, which are often challenging for other segmentation models to recognize.

4.3.3 Comparisons with Different Data Augmentation Methods
We compare the performance of our framework with other data augmentation methods, including voxel-based methods and GAN-based methods. Voxel-based data augmentation methods use a combination of two existing images to create a new synthetic image. Here we adopted six methods: i.e., Mixup [43], CutMix [42], Copy-Paste [6], TumorCP [41], CarveMix [43], and SelfMix [45]. For the GAN-based methods [4], a conditional GAN was adapted to generate either a deformation field (D), intensity field (I), or a combination of both (D+I) to change the structure and/or the appearance of images. Regardless of the data augmentation method, UNet was the segmentation model used.
Quantitative segmentation model performance on the ATLAS v2.0 dataset and Shift MS dataset are shown in Table 5 and Table 6, respectively. For the ATLAS v2.0 dataset, our framework achieves the best segmentation metric, improving DSC by 3.65, ASD by 5.99 mm, and HD95 by 3.27 mm compared to the second-best model. A similar trend is seen for the Shift MS dataset. One potential reason for the improvements seen in our framework is that the comparison voxel-based data augmentation methods create unrealistic foreground areas which may shift the decision boundary of the segmentation model and reduce their ability to generalize. For the GAN-based data augmentation methods, generated deformation and intensity fields do not change the intrinsic characteristics of the original images, which results in models that may overfit the training data.

Methods | Type | DSC | ASD | HD95 |
---|---|---|---|---|
Mixup [43] | Voxel | 43.7535.41 | 23.4520.48 | 43.7632.54 |
CutMix [42] | Voxel | 47.3232.46 | 22.7120.03 | 40.3531.56 |
Copy-Paste [6] | Voxel | 55.2431.42 | 15.5715.42 | 25.7827.44 |
TumorCP [41] | Voxel | 55.6432.59 | 15.4315.08 | 24.8626.94 |
CarveMix [43] | Voxel | 56.5831.05 | 12.3116.54 | 23.5325.98 |
SelfMix [45] | Voxel | 54.1331.24 | 16.7817.55 | 25.9726.88 |
cGAN (D) [4] | GAN | 52.7932.23 | 20.4322.48 | 36.7230.23 |
cGAN (I) [4] | GAN | 50.6630.54 | 21.1423.84 | 39.4632.36 |
cGAN (D+I) [4] | GAN | 51.4831.26 | 22.5828.56 | 37.2231.28 |
Ours | Mixed | 20.2625.81 |
Methods | Type | Shifts MS Devin | Shifts MS Devout | Shifts MS Evlin | ||||||
---|---|---|---|---|---|---|---|---|---|---|
DSC | ASD | HD95 | DSC | ASD | HD95 | DSC | ASD | HD95 | ||
Mixup [43] | Voxel | 52.6432.68 | 12.6813.45 | 26.7528.46 | 48.7623.45 | 13.5414.23 | 23.4524.35 | 38.4521.56 | 14.6915.46 | 38.5839.43 |
CutMix [42] | Voxel | 54.2330.69 | 11.5412.53 | 24.5426.58 | 50.4322.59 | 12.5913.56 | 21.7622.69 | 40.2322.63 | 13.4212.53 | 36.4437.22 |
Copy-Paste [6] | Voxel | 60.3615.48 | 8.429.64 | 22.4325.46 | 58.4619.53 | 11.2310.77 | 20.6921.34 | 43.5623.28 | 10.8511.49 | 25.1226.41 |
TumorCP [41] | Voxel | 61.5513.55 | 6.467.69 | 20.1527.33 | 62.4418.46 | 9.536.75 | 18.4320.67 | 51.1220.51 | 9.348.32 | 18.3215.46 |
CarveMix [43] | Voxel | 65.2812.47 | 5.245.53 | 15.6318.42 | 63.8617.63 | 8.757.53 | 15.2320.49 | 54.5318.24 | 6.437.22 | 15.4412.75 |
SelfMix [45] | Voxel | 62.4514.62 | 7.536.45 | 19.4420.47 | 60.5418.45 | 10.847.53 | 16.2921.54 | 52.3619.42 | 7.526.31 | 16.3211.42 |
cGAN (D) [4] | GAN | 62.8722.54 | 9.537.65 | 19.7822.23 | 62.1220.68 | 6.527.98 | 17.3221.34 | 52.6518.23 | 12.8910.25 | 28.3624.35 |
cGAN (I) [4] | GAN | 60.2325.65 | 10.548.33 | 23.4524.25 | 60.9822.45 | 8.518.21 | 18.5520.35 | 51.2518.78 | 13.4511.54 | 30.4826.35 |
cGAN (D+I) [4] | GAN | 61.2724.47 | 9.238.11 | 22.3621.69 | 61.7321.59 | 6.788.07 | 17.2119.24 | 52.7317.63 | 12.4510.48 | 26.4524.51 |
Ours | Mixed | 12.2619.37 |
4.4 Ablation Studies
We validate the effectiveness of the three modules in our framework i.e., Lesion synthesis (SL), Soft Poisson Blending (SPB), and Prototype Consistency (PC). All ablation studies were conducted on the ATLAS v2.0 dataset and Shift MS dataset.
4.4.1 Synthetic Lesion
We validated the performance of the segmentation model when using only synthetic lesions for training. The results are shown in the first row of Table 7 and Table 8 for the ATLAS v2.0 and Shift MS datasets, respectively. For the ATLAS v2.0 dataset, training with only synthetic lesions improves segmentation performance compared to using only the real data for model training. In contrast, segmentation performance on the Shift MS dataset is worse. The reason for this discrepancy we believe is dataset size, the ATLAS v2.0 dataset contains 655 images but the Shift MS dataset only has 33 images. While foreground mismatch and boundary artifacts caused by directly inserting the synthetic lesions can increase the model’s generalization, they are catastrophic for a small dataset like the MS Shift dataset since the synthetic lesions with boundary artifacts will shift the segmentation model feature distribution.
4.4.2 Soft Poisson Blending
Applying SPB to achieve a consistent appearance with real lesions and a seamless boundary improves the segmentation model performance for both datasets (the second row of Table 7 and Table 8). Our results demonstrate that resolving the inconsistent appearance of synthetic lesions improves the model performance.
4.4.3 Prototype Consistency
To address the potential feature gap caused by synthetic and real images, we introduced prototype consistency regularization. This penalty, applied to both real and synthetic lesions, ensures the segmentation model learns similar features for lesions regardless of origin. Results shown in the third row of Tables 7 and 8 demonstrate applying prototype consistency regularization solely to synthetic lesions yields improved segmentation model performance. Moreover, integrating this regularization with a consistent lesion appearance further enhances segmentation performance, as evidenced in the fourth row of Table 7. The Shift MS dataset (Table 8) demonstrates a substantial improvement in segmentation performance compared to models where the consistency penalty was not employed highlighting that feature alignment is most important for small datasets where even a small shift in the synthetic lesion distribution can affect segmentation performance.
SL | SPB | PC | DSC | ASD | HD95 |
---|---|---|---|---|---|
54.8629.84 | 16.9624.59 | 34.9836.01 | |||
58.7127.20 | 11.3017.57 | 27.1730.28 | |||
59.3830.80 | 18.9372.62 | 30.0073.03 | |||
60.2329.48 | 6.3213.68 | 20.2625.81 |
SL | SPB | PC | Shifts MS Devin | ||
---|---|---|---|---|---|
DSC | ASD | HD95 | |||
59.5424.61 | 9.6812.64 | 22.3617.31 | |||
67.1614.26 | 5.986.74 | 20.0616.99 | |||
70.0520.63 | 3.114.39 | 13.4917.06 | |||
78.358.74 | 0.971.09 | 8.1512.06 | |||
SL | SPB | PC | Shifts MS Devout | ||
DSC | ASD | HD95 | |||
58.4825.06 | 6.899.77 | 17.7515.71 | |||
59.2326.13 | 7.1110.79 | 16.3616.76 | |||
59.9924.63 | 3.307.16 | 15.8817.47 | |||
68.5215.60 | 1.733.24 | 12.2619.37 | |||
SL | SPB | PC | Shifts MS Evlin | ||
DSC | ASD | HD95 | |||
42.4823.76 | 18.4219.71 | 33.5223.88 | |||
53.8922.22 | 18.2028.73 | 41.9254.82 | |||
54.5525.35 | 6.057.33 | 17.3015.15 | |||
69.5112.63 | 1.602.17 | 7.307.75 |
5 Conclusions & Discussions
We presented a comprehensive framework to augment existing training samples for brain lesion segmentation via a two-stage adversarial autoencoder (AAE) to generate new lesion images. The AAE is trained in a self-supervised manner, but generates synthetic lesions with the same latent space distribution as real lesions. We then augment the synthetic images by using Soft Poisson Blending (SPB) to create a seamless boundary between foreground and background, eliminating boundary artifacts. Finally we introduce a prototype consistency regularisation term during segmentation model training to ensure similar features across synthetic and real lesions are learnt. The synthetic lesion samples boost segmentation model performance under the supervision of the prototype consistency penalty. Experiments on two public datasets demonstrate that our framework outperforms other data augmentation approaches and methods that only adapt augmented samples for model training. We do not compare our approach to models based on pre-trained datasets such as SAM [14] because they are pre-trained on a large-scale dataset, making direct comparisons unfair. Besides, SAM-based methods largely depend on accurate user prompts to achieve good segmentation results, which differs from our fully automatic setting requiring no prompt. Currently, our framework is validated on brain lesion MRI datasets. Extending our framework to other image modalities and other organs will be future work. Additionally, we will explore adding conditions to further control the process of lesion image synthesis for controllable data augmentation in the future.
Acknowledgments
This work was supported by Centre for Doctoral Training in Surgical and Interventional Engineering at King’s College London; the funding from the Wellcome Trust Award (218380/Z/19/Z) and the Wellcome/EPSRC Centre for Medical Engineering (WT203148/Z/16/Z).
References
- Asadi et al. [2023] Asadi, N., Davari, M., Mudur, S., Aljundi, R., Belilovsky, E., 2023. Prototype-sample relation distillation: towards replay-free continual learning, in: International Conference on Machine Learning, PMLR. pp. 1093–1106.
- Brock et al. [2018] Brock, A., Donahue, J., Simonyan, K., 2018. Large scale gan training for high fidelity natural image synthesis, in: International Conference on Learning Representations.
- Carass et al. [2017] Carass, A., Roy, S., Jog, A., Cuzzocreo, J.L., Magrath, E., Gherman, A., Button, J., Nguyen, J., Prados, F., Sudre, C.H., et al., 2017. Longitudinal multiple sclerosis lesion segmentation: resource and challenge. NeuroImage 148, 77–102.
- Chaitanya et al. [2021] Chaitanya, K., Karani, N., Baumgartner, C.F., Erdil, E., Becker, A., Donati, O., Konukoglu, E., 2021. Semi-supervised task-driven data augmentation for medical image segmentation. Medical Image Analysis 68, 101934.
- Commowick et al. [2018] Commowick, O., Istace, A., Kain, M., Laurent, B., Leray, F., Simon, M., Pop, S.C., Girard, P., Ameli, R., Ferré, J.C., et al., 2018. Objective evaluation of multiple sclerosis lesion segmentation using a data management and processing infrastructure. Scientific reports 8, 13650.
- Ghiasi et al. [2021] Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.Y., Cubuk, E.D., Le, Q.V., Zoph, B., 2021. Simple copy-paste is a strong data augmentation method for instance segmentation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2918–2928.
- Goodfellow et al. [2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets. Advances in neural information processing systems 27.
- Henschel et al. [2020] Henschel, L., Conjeti, S., Estrada, S., Diers, K., Fischl, B., Reuter, M., 2020. Fastsurfer-a fast and accurate deep learning based neuroimaging pipeline. NeuroImage 219, 117012.
- Ho et al. [2020] Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851.
- Hu et al. [2023] Hu, Q., Chen, Y., Xiao, J., Sun, S., Chen, J., Yuille, A.L., Zhou, Z., 2023. Label-free liver tumor segmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7422–7432.
- Huo et al. [2023] Huo, J., Liu, Y., Ouyang, X., Granados, A., Ourselin, S., Sparks, R., 2023. Arhnet: Adaptive region harmonization for lesion-aware augmentation to improve segmentation performance, in: International Workshop on Machine Learning in Medical Imaging, Springer. pp. 377–386.
- Huo et al. [2022] Huo, J., Vakharia, V., Wu, C., Sharan, A., Ko, A., Ourselin, S., Sparks, R., 2022. Brain lesion synthesis via progressive adversarial variational auto-encoder, in: International Workshop on Simulation and Synthesis in Medical Imaging, Springer. pp. 101–111.
- Isensee et al. [2021] Isensee, F., Jaeger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H., 2021. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods 18, 203–211.
- Kirillov et al. [2023] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al., 2023. Segment anything, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026.
- Krizhevsky et al. [2012] Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25.
- Kuo et al. [2020] Kuo, C.W., Ma, C.Y., Huang, J.B., Kira, Z., 2020. Featmatch: Feature-based augmentation for semi-supervised learning, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16, Springer. pp. 479–495.
- Lee and Cho [2023] Lee, H.s., Cho, H.c., 2023. Improving classification performance in gastric disease through realistic data augmentation technique based on poisson blending. Journal of Electrical Engineering & Technology , 1–8.
- Lesjak et al. [2018] Lesjak, Ž., Galimzianova, A., Koren, A., Lukin, M., Pernuš, F., Likar, B., Špiclin, Ž., 2018. A novel public mr image dataset of multiple sclerosis patients with lesion segmentations based on multi-rater consensus. Neuroinformatics 16, 51–63.
- Liew et al. [2022] Liew, S.L., Lo, B.P., Donnelly, M.R., Zavaliangos-Petropulu, A., Jeong, J.N., Barisano, G., Hutton, A., Simon, J.P., Juliano, J.M., Suri, A., et al., 2022. A large, curated, open-source stroke neuroimaging dataset to improve lesion segmentation algorithms. Scientific data 9, 320.
- Liu et al. [2021] Liu, C., Wang, Z., Wang, S., Tang, T., Tao, Y., Yang, C., Li, H., Liu, X., Fan, X., 2021. A new dataset, poisson gan and aquanet for underwater object grabbing. IEEE Transactions on Circuits and Systems for Video Technology 32, 2831–2844.
- Liu et al. [2022] Liu, L., Ren, Y., Lin, Z., Zhao, Z., 2022. Pseudo numerical methods for diffusion models on manifolds, in: International Conference on Learning Representations.
- Makhzani et al. [2015] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., Frey, B., 2015. Adversarial autoencoders. arXiv preprint arXiv:1511.05644 .
- Malinin et al. [2022] Malinin, A., Athanasopoulos, A., Barakovic, M., Cuadra, M.B., Gales, M.J., Granziera, C., Graziani, M., Kartashev, N., Kyriakopoulos, K., Lu, P.J., et al., 2022. Shifts 2.0: Extending the dataset of real distributional shifts. arXiv preprint arXiv:2206.15407 .
- Nie et al. [2017] Nie, D., Trullo, R., Lian, J., Petitjean, C., Ruan, S., Wang, Q., Shen, D., 2017. Medical image synthesis with context-aware generative adversarial networks, in: Medical Image Computing and Computer Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part III 20, Springer. pp. 417–425.
- Oktay et al. [2018] Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al., 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 .
- Pereira et al. [2016] Pereira, S., Pinto, A., Alves, V., Silva, C.A., 2016. Brain tumor segmentation using convolutional neural networks in mri images. IEEE transactions on medical imaging 35, 1240–1251.
- Pérez et al. [2023] Pérez, P., Gangnet, M., Blake, A., 2023. Poisson image editing, in: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 577–582.
- Perlin [1985] Perlin, K., 1985. An image synthesizer. ACM Siggraph Computer Graphics 19, 287–296.
- Pinto et al. [2022] Pinto, F., Yang, H., Lim, S.N., Torr, P., Dokania, P., 2022. Using mixup as a regularizer can surprisingly improve accuracy & out-of-distribution robustness. Advances in Neural Information Processing Systems 35, 14608–14622.
- Rombach et al. [2022] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695.
- Ronneberger et al. [2015] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, Springer. pp. 234–241.
- Roy et al. [2023] Roy, S., Koehler, G., Ulrich, C., Baumgartner, M., Petersen, J., Isensee, F., Jaeger, P.F., Maier-Hein, K.H., 2023. Mednext: transformer-driven scaling of convnets for medical image segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 405–415.
- Rumelhart et al. [1986] Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. nature 323, 533–536.
- Schlegl et al. [2019] Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U., 2019. f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44.
- Snell et al. [2017] Snell, J., Swersky, K., Zemel, R., 2017. Prototypical networks for few-shot learning. Advances in neural information processing systems 30.
- Tan et al. [2021] Tan, J., Hou, B., Day, T., Simpson, J., Rueckert, D., Kainz, B., 2021. Detecting outliers with poisson image interpolation, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, Springer. pp. 581–591.
- Tang et al. [2022] Tang, Y., Yang, D., Li, W., Roth, H.R., Landman, B., Xu, D., Nath, V., Hatamizadeh, A., 2022. Self-supervised pre-training of swin transformers for 3d medical image analysis, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20730–20740.
- Wang et al. [2022] Wang, H., Zhou, Y., Zhang, J., Lei, J., Sun, D., Xu, F., Xu, X., 2022. Anomaly segmentation in retinal images with poisson-blending data augmentation. Medical Image Analysis 81, 102534.
- Wang et al. [2019] Wang, K., Liew, J.H., Zou, Y., Zhou, D., Feng, J., 2019. Panet: Few-shot image semantic segmentation with prototype alignment, in: proceedings of the IEEE/CVF international conference on computer vision, pp. 9197–9206.
- Xu et al. [2022] Xu, Z., Wang, Y., Lu, D., Yu, L., Yan, J., Luo, J., Ma, K., Zheng, Y., Tong, R.K.y., 2022. All-around real label supervision: Cyclic prototype consistency learning for semi-supervised medical image segmentation. IEEE Journal of Biomedical and Health Informatics 26, 3174–3184.
- Yang et al. [2021] Yang, J., Zhang, Y., Liang, Y., Zhang, Y., He, L., He, Z., 2021. Tumorcp: A simple but effective object-level data augmentation for tumor segmentation, in: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, Springer. pp. 579–588.
- Yun et al. [2019] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y., 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032.
- Zhang et al. [2018] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D., 2018. mixup: Beyond empirical risk minimization, in: International Conference on Learning Representations.
- Zhang et al. [2023] Zhang, X., Liu, C., Ou, N., Zeng, X., Zhuo, Z., Duan, Y., Xiong, X., Yu, Y., Liu, Z., Liu, Y., et al., 2023. Carvemix: a simple data augmentation method for brain lesion segmentation. NeuroImage 271, 120041.
- Zhu et al. [2022] Zhu, Q., Wang, Y., Yin, L., Yang, J., Liao, F., Li, S., 2022. Selfmix: a self-adaptive data augmentation method for lesion segmentation, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 683–692.