LeFusion: Controllable Pathology Synthesis via Lesion-Focused Diffusion Models

Hantao Zhang¹, Yuhe Liu², Jiancheng Yang³, Shouhong Wan¹,
Xinyuan Wang², Wei Peng⁴, Pascal Fua³
¹University of Science and Technology of China (USTC), China
²Beihang University, China
³Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
⁴Stanford University, USA
This work was conducted during the first author’s research internship at EPFL.Corresponding author: Jiancheng Yang ([email protected]).

Abstract

Patient data from real-world clinical practice often suffers from data scarcity and long-tail imbalances, leading to biased outcomes or algorithmic unfairness. This study addresses these challenges by generating lesion-containing image-segmentation pairs from lesion-free images. Previous efforts in medical imaging synthesis have struggled with separating lesion information from background, resulting in low-quality backgrounds and limited control over the synthetic output. Inspired by diffusion-based image inpainting, we propose LeFusion, a lesion-focused diffusion model. By redesigning the diffusion learning objectives to focus on lesion areas, we simplify the learning process and improve control over the output while preserving high-fidelity backgrounds by integrating forward-diffused background contexts into the reverse diffusion process. Additionally, we tackle two major challenges in lesion texture synthesis: 1) multi-peak and 2) multi-class lesions. We introduce two effective strategies: histogram-based texture control and multi-channel decomposition, enabling the controlled generation of high-quality lesions in difficult scenarios. Furthermore, we incorporate lesion mask diffusion, allowing control over lesion size, location, and boundary, thus increasing lesion diversity. Validated on 3D cardiac lesion MRI and lung nodule CT datasets, LeFusion-generated data significantly improves the performance of state-of-the-art segmentation models, including nnUNet and SwinUNETR. Code and model are available at https://github.com/M3DV/LeFusion.

1 Introduction

The development of AI for healthcare often suffers from data scarcity (Ibrahim et al., 2021; Schäfer et al., 2024). In most biomedical scenarios, the number of pathological subjects is significantly lower than that of normal ones. This discrepancy primarily arises from the naturally occurring distribution of patient data, which frequently exhibits long-tail characteristics (Yang et al., 2022; Zhang et al., 2023). Additionally, potential biases in data collection can introduce issues related to algorithmic fairness (Xu et al., 2022; Chen et al., 2023; Yang et al., 2024), as well as concerns about security and privacy (Price & Cohen, 2019; Qayyum et al., 2020). As a result, it has been argued that “synthetic data can be better than real data” (Savage, 2023).

Generative lesion synthesis is a promising approach to generating diverse medical data, benefiting many medical applications (Khader et al., 2023). By learning from lesion-containing data, generative models can synthesize various types of lesions, which in turn benefit downstream applications (Han et al., 2019; Jin et al., 2021; Yang et al., 2019; Lyu et al., 2022a; Shin et al., 2018; Pishva et al., 2023; Du et al., 2023; Lyu et al., 2022b). While a range of generative methods have been explored, they often struggle to preserve high-quality backgrounds outside of the lesion areas. This is because generating anatomically correct backgrounds in the human body is far more challenging than synthesizing isolated lesions. Moreover, these methods often lack control over key aspects of lesion generation, including texture type, size, location, and mask alignment. These issues can severely degrade the performance of downstream applications, such as segmentation algorithms trained using such synthetic data. Fig. 1a illustrates standard diffusion models as an example.

One approach to avoiding the complexity of background generation is to start from readily available normal scans and synthesize lesions into them. This involves generating lesion masks and filling them with appropriate textures, ensuring perfect background preservation and mask alignment, while also allowing precise control over lesion size and location. In this paper, we refer to this method as background-preserving lesion synthesis. This approach has led to a resurgence of hand-crafted methods, which have been used to model COVID-19 lesions (Yao et al., 2021) and liver tumors (Hu et al., 2023). However, these methods rely on heuristics that do not generalize well.

Inspired by diffusion-based image inpainting schemes (Lugmayr et al., 2022; Avrahami et al., 2022), it has been shown that explicitly integrating real background context during the diffusion process ensures realistic background preservation outside lesion masks. Rather than using the background as conditional inputs (Rombach et al., 2022), these methods directly incorporate forward-diffused background contexts into the reverse diffusion process. However, these approaches are training-free and do not focus on specific inpainted content. In contrast, our study emphasizes lesion generation. In data-limited scenarios, it is more efficient for the model to focus solely on lesion synthesis.

Refer to caption — Figure 1: Standard Conditional Diffusion vs. Lesion-Focused Diffusion (LeFusion). (a) Standard Conditional Diffusion concatenates background, lesion mask, and noise. The model generates both lesion and background, risking background integrity and wasting capacity on difficult but unnecessary background generation, especially in data-limited settings. (b) Lesion-Focused Diffusion (LeFusion) uses forward-diffused backgrounds and reverse-diffused foregrounds as input. The model reconstructs only the lesion, ensuring realistic background preservation and simplifying the task. (c) LeFusion with Fine Control of Lesion Textures and Masks introduces histogram-based texture control for multi-peak lesions, multi-channel decomposition for multi-class lesions, and lesion mask diffusion for control over size, location and boundary, enhancing lesion quality and diversity.

To this end, we propose LeFusion, a lesion-focused diffusion model (Fig. 1b and Fig. 1c). We redesign the diffusion learning objectives to focus solely on lesion data. Similar to diffusion-based inpainting, the input combines forward-diffused backgrounds with reverse-diffused foregrounds, while the model reconstructs only the lesion, avoiding the need to allocate capacity to learning complex backgrounds. Furthermore, we address two major unresolved challenges in lesion synthesis: 1) multi-peak lesions, where lesions have distinct types, and 2) multi-class lesions, where multiple classes of lesions need to be generated simultaneously. For the multi-peak challenge, we introduce histogram-based texture control, integrating lesion texture histograms during training as a condition, which allows control over lesion types during inference. We find that explicitly controlling the histogram is crucial when generating lesions on normal scans; otherwise, the model tends to produce lesions biased toward healthy appearances. Notably, this histogram-based control method is generic and does not require any additional information beyond image-mask pairs. To handle the multi-class challenge, we propose a strategy for joint modeling of multi-class lesions through multi-channel decomposition, where the diffusion model generates different lesion classes via separate channels and then combines them into a single image. Finally, we introduce lesion mask diffusion, enabling control over size, location and boundary, thereby increasing the diversity of lesion masks.

We validate LeFusion on 3D lung nodule CT datasets (Armato III et al., 2011) and cardiac lesion MRI (Lalande et al., 2022), demonstrating its effectiveness in addressing both the multi-peak and multi-class challenges while generating high-quality synthetic lesions. In downstream segmentation tasks, we show that LeFusion-generated data significantly enhances the performance of state-of-the-art models such as nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021).

2 Related Work

2.1 Generative Models for Lesion Synthesis

Lesion synthesis using generative models has garnered significant attention for its potential to create diverse and realistic medical datasets, particularly in addressing the scarcity and imbalance of pathological data in biomedical applications. Early approaches have employed variational autoencoders (VAEs)(Kingma & Welling, 2013), generative adversarial networks (GANs)(Goodfellow et al., 2020), and more recently, diffusion models (Ho et al., 2020) to generate synthetic lesions across various medical imaging modalities and applications, including lung nodules (Han et al., 2019; Jin et al., 2021; Yang et al., 2019) and COVID-19 lesions (Lyu et al., 2022a) in CT scans, colon polyps in colonoscopy (Shin et al., 2018; Pishva et al., 2023; Du et al., 2023), tumor cells in microscopy (Horvath et al., 2022), brain tumors in MRI (Billot et al., 2023), diabetic lesions in retinal images (Wang et al., 2022), and synthetic liver tumors (Lyu et al., 2022b).

However, a persistent challenge for these methods is preserving anatomically accurate backgrounds alongside the lesions. In medical imaging, the background must respect the anatomical structure of the human body, which makes generating realistic backgrounds significantly more difficult than synthesizing isolated lesions, particularly in large-scale 3D images. While recent studies (Hamamci et al., 2024; Peng et al., 2024) have begun to address large-scale 3D medical image generation, these methods often require significant computational resources and extensive data.

Another limitation of current methods is the lack of explicit control when generating image-mask pairs. Typically, both the lesion and its corresponding mask are generated simultaneously, without explicit constraints linking the two. This results in limited control over key lesion properties, such as texture, size, location, and alignment between the image and mask. The absence of high-quality background preservation and fine control over these properties hinders the scalability and effectiveness, negatively impacting the performance of downstream tasks such as segmentation.

2.2 Background-Preserving Lesion Synthesis

In clinical practice, normal scans (either whole or partial) are far more abundant than pathological ones. For instance, in lung nodule cases, most pathological scans contain only a single lesion, yet traditional lesion synthesis methods often focus on small crops around the lesion, utilizing as little as $<1\%$ ¹¹1For example, a $64^{3}$ cube crop from a $512^{3}$ volume occupies $<0.2\%$ of the total voxels. of the original data and leaving large portions of the normal background unused. This raises the question of whether it is necessary to rely on deep models to generate normal backgrounds.

Background-preserving lesion synthesis addresses this issue by starting from normal scans and synthesizing lesions through filling textures into manually generated lesion masks. This approach separates the generation of lesion masks and textures, allowing for finer control over lesions while maintaining the original background structure. Prior work has predominantly relied on heuristic-based methods (Yao et al., 2021; Hu et al., 2023), leveraging the abundance of normal backgrounds to generate lesions of varying sizes and textures at different locations. These studies have demonstrated that the generated data can significantly benefit downstream segmentation tasks.

However, these hand-crafted approaches rely heavily on manual adjustments to ensure that the generated lesions resemble real-world pathology. For example, Hu et al. (2023) manually set grayscale values for texture synthesis, and lesion shapes are generated using morphological operations from ellipsoidal masks. Such hand-crafted rules limits the scalability and generalizability of these methods. In more complex scenarios, such as multi-peak and texture-rich lung nodules or multi-class cardiac lesions, these methods tend to fail (see Sec.4.2). This highlights the need for more flexible and robust data-driven techniques to address these challenges.

A recent study (Chen et al., 2024) uses conditional diffusion (Ho et al., 2020; Rombach et al., 2022) to synthesize abdominal tumors, as illustrated in Fig. 1a, building on background-preserving lesion synthesis . The image is first encoded with VQGAN (Esser et al., 2021), and a latent diffusion (Rombach et al., 2022) learns both the background and lesion, using the lesion mask and background (excluding the lesion) as conditional inputs. However, as the model still needs to generate background, the preservation of background integrity cannot be theoretically guaranteed. While it is possible to fill the area outside the mask with real background data, this may lead to inconsistencies in the final output. From our findings, this approach struggles with high-quality background generation, affecting downstream applications (Sec.4.2), particularly in data-limited scenarios. We also tested a similar image-space conditional diffusion model, which showed similar limitations, though image-space diffusion outperformed latent diffusion due to constraints imposed by the autoencoder.

A few concurrent studies (Lai et al., 2024; Wu et al., 2024; Zhu et al., 2024) have employed advanced generative models for lesion synthesis. Due to differences in research focus or/and the unavailability of their code/models, a comprehensive comparison could not be conducted. This study primarily focuses on diffusion-based approaches within the context of background-preserving lesion synthesis.

3 LeFusion: Lesion-Focused Diffusion Model

We propose LeFusion, a lesion-focused diffusion model that concentrates solely on lesion. In Sec.3.1, by combining forward-diffused backgrounds with reverse-diffused foregrounds, LeFusion reconstructs only the lesion, eliminating the need to model complex backgrounds. To address key challenges in lesion texture synthesis (Sec.3.2), LeFusion introduces histogram-based texture control for multi-peak lesions, allowing control over distinct lesion types, and a multi-channel decomposition strategy for multi-class lesions, where classes are generated in separate channels and combined into a single image. In Sec. 3.3, lesion mask diffusion enables control over lesion size, location and boundary, increasing mask diversity and enhancing lesion synthesis quality and flexibility.

3.1 Background-Preserving Generation via Inpainting

Decoupled Lesion and Background Generation.

Inspired by diffusion-based inpainting (Lugmayr et al., 2022; Avrahami et al., 2022), we aim to decouple the generation of lesions and background. As shown in Fig. 2, inpainting predicts the missing parts of an image, particularly lesions. For standard diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020), the inference process starts by sampling a noise vector $x_{T}\sim\mathcal{N}(0,1)$ and gradually denoising it to produce the output image $x_{0}$ . We focus on 3D images only in this study, while the method can be easily extended to 2D. The original grayscale image is denoted as $\hat{x}_{0}\in\mathbb{R}^{D\times H\times W\times 1}$ , where $D$ , $H$ , and $W$ represent the 3D image size. The reverse diffusion step from $x_{t}$ to $x_{t-1}$ is decoupled into lesion and background components, as shown below. Here, $M_{f}$ and $M_{b}$ represent the lesion foreground mask and background mask. The lesion $o_{t-1}$ is predicted through the reverse diffusion process $p_{\theta}$ using a 3D U-Net, while $\hat{x}_{t-1}$ is derived from $\hat{x}_{0}$ by adding noise through forward diffusion $q$ .

{x_{t-1}}=o_{t-1}\odot M_{f}+\hat{x}_{t-1}\odot M_{b}\\ ,o_{t-1}\sim p_{\theta}\left(\hat{x}_{t},t\right),\hat{x}_{t-1}\sim q\left(\hat{x}_{0},t\right)\\ .

(1)

This approach preserves the background accurately without requiring prediction.

Making Diffusion Model Lesion-Focused.

The method above, while suitable for all diffusion-based inpainting, cannot guarantee that the denoised output $o_{t-1}$ is focused on the lesion area. Unlike standard diffusion training, where the target region may vary, we know that the part to be inpainted is specifically the lesion in our application. Therefore, we can design the model to predict only the lesion and ignore other regions. To achieve this, we introduce a lesion-focused loss during training. The general diffusion process starts with the original image $\hat{x}_{0}$ , adding Gaussian noise over $T$ time steps (forward diffusion). The neural network is trained to predict the noise distribution at time step $t$ (reverse diffusion), conditioned on the noised image $\hat{x}_{t}$ and the time step. To ensure the model focuses only on the lesion, a mask $M_{f}$ is applied, calculating the loss exclusively within the lesion region. The training objective is defined as follows, where $\bm{\varepsilon}\in\mathbb{R}^{D\times H\times W\times 1}$ represents the noise sampled from a Gaussian distribution:

\mathbb{E}_{\hat{x}_{0},\epsilon\sim\mathcal{N}(0,1),t}\left[M_{f}\|\epsilon-p_{\theta}\left(\hat{x}_{t},t\right)\|_{2}\right]\\ .

(2)

Despite the change in the training objective, inpainting inference remains unaffected. As shown in Eq. 1, outside $M_{f}$ , the predicted lesion $o_{t-1}$ is replaced by the real noised background $\hat{x}_{t-1}$ .

3.2 Fine Control of Lesion Textures

Handling Multi-Peak Distributed Lesions.

In the section above, the model relies on the noised background $\hat{x}_{t}$ to infer the texture of lesion within the mask. While this works for lesions with minimal texture differences (as with cardiac lesions), it becomes problematic with multi-peak data. As shown in Fig. 3, lung nodules exhibit distinct texture types. We empirically show that relying solely on the background can lead to mode collapse.

To address this, we propose a simple yet effective approach, histogram-based texture control. The lesion texture histogram $h$ is used as a condition via cross attention Rombach et al. (2022), i.e.,

o_{t-1}\sim p_{\theta}\left(\hat{x}_{t},h,t\right)\\ .

(3)

During training, the histogram is computed from the ground truth, and during inference, texture types can be controlled by adjusting the histogram. Notably, this approach requires no additional lesion type annotations, such as nodule attenuation.

In Sec. 4, we demonstrate that the proposed histogram-based texture control is crucial for generating lesions on normal scans. Without it, models tend to fail, biasing towards healthy appearances and producing overly subtle lesions, which degrades the performance of downstream segmentation tasks using these synthetic data.

Joint Modeling of Multi-Class Lesions.

The method above focuses on single-class lesions, but many medical applications, such as cardiac MRI, require modeling multiple lesion types, like myocardial infarction (MI) and persistent microvascular obstruction (PMO). To capture correlations between lesion types and generate textures for multiple lesions simultaneously, we use a joint modeling strategy called multi-channel decomposition, where each channel corresponds to a different lesion type. The diffusion model generates each lesion in its respective channel, and they are combined through lesion masks.

We expand the input image $\hat{x}_{0}\in\mathbb{R}^{D\times H\times W\times 1}$ to $\hat{x}_{t}\in\mathbb{R}^{D\times H\times W\times n}$ , based on the number of lesion classes $n$ , where $n=2$ in the cardiac MRI experiments. Similarly, $\bm{\varepsilon}\in\mathbb{R}^{D\times H\times W\times n}$ and $o_{t-1}\sim p_{\theta}\left(\hat{x}_{t},t\right)\in\mathbb{R}^{D\times H\times W\times n}$ are extended to $n$ channels. The training objective is:

\mathbb{E}_{\hat{x}_{0},\epsilon\sim\mathcal{N}(0,1),t}\sum_{i=1}^{n}\left[M_{f}^{(i)}\left\|\epsilon-p_{\theta}\left(\hat{x}_{t},t\right)\right\|_{2}^{(i)}\right]\\ ,

(4)

where $*^{(i)}$ refers to the channel $i$ of $*$ . To combine these channels, we compute $\sum_{i=1}^{n}\left[M_{f}^{(i)}o_{t-1}^{(i)}\right]$ .

3.3 Diversifying Lesion Masks via DiffMask

To further enhance the controllability and diversity of lesion generation, we introduce lesion mask diffusion (DiffMask). As shown in Fig. 4, to achieve fine control over lesion size, location, and boundary, we propose two key designs: the boundary mask and the control sphere. The former removes areas outside the boundary at each diffusion step, ensuring the generated mask stays within reasonable limits, while the latter manages the size and location of the lesion. During training, the control sphere is the bounding sphere of a real mask and is concatenated as a condition to the DiffMask input, with the real mask serving as the target. In inference, users can adjust the size and location to generate the desired lesion masks.

In terms of implementation, the architecture of DiffMask is similar to the texture diffusion model, also employing multi-channel decomposition to capture shape correlations and spatial distributions between multiple lesion masks. Each output channel is responsible for generating the lesion shape mask of a specific lesion. Finally, we apply a smoothing kernel as a post-processing step.

4 Experiments

4.1 Setup

Dataset.

LIDC: Multi-Peak Lung Nodule CT.

We use LIDC dataset (Armato III et al., 2011), which contains 1,010 chest CT scans, from which 2,624 regions of pathological (P) interest (ROIs) corresponding to lung nodules were extracted, along with 135 cases of healthy patients. The dataset was divided into an 808-case training set, comprising 2,104 lung nodule ROIs, and a 202-case test set, containing 520 lung nodule ROIs. Additionally, 3,076 normal ( N) ROIs were cropped from the 135 healthy patients, representing regions where lung nodules typically appear. These normal ROIs were used for data augmentation in the experiments.

Emidec: Multi-Class Cardiac Lesion MRI. The Emidec dataset (Lalande et al., 2022) consists of examinations featuring DE-MRI in a short-axis orientation. This dataset offers access to 100 labeled cases, including 33 normal (N) and 67 pathological (P). The annotations cover 5 classes: background, left ventricle (LV), myocardium (Myo), myocardial infarction (MI), and persistent microvascular obstruction (PMO). We split the 67 P cases into 57 for training and 10 for testing. The 57 P cases are used to train the data synthesis model. In the downstream evaluation (Sec. 4.2), we use those models to synthesize P cases based on both 57 P and 33 N as the training set.

Method Comparison.

The following synthesis algorithms are compared with the LeFusion.

Copy-Paste. We used the masks from real lesion data and matched them with normal data, copying the original lesion textures onto normal cases to generate new synthetic data.

Hand-Crafted (Hu et al., 2023). The lesion mask is represented by the overlapping of multiple ellipsoidal lesion masks, followed by several random morphological operations. The texture is approximated using Gaussian noise and softened through interpolation and Gaussian filtering.

RePaint (Lugmayr et al., 2022) or Blended-Diffusion Avrahami et al. (2022). This inference-stage method removes the lesion mask from the image and then fills in the lesion texture by combining forward-diffused backgrounds with reverse-diffused foregrounds. The model employs global training loss, which lacks the capability to focus on lesion information. Due to the absence of guidance from lesion category information, it is unable to specify corresponding lesions and is confined to simulating the generation of single-class lesions;

Cond-Diffusion (Ho et al., 2020; Rombach et al., 2022).

These methods use the lesion mask and background image information as conditional inputs (Rombach et al., 2022) to a diffusion model (Ho et al., 2020). However, a downside of this approach is that it disrupts the background information. Furthermore, directly using multiple masks as conditional inputs fails to control the corresponding categories, limiting the method to modeling the generation of single-class lesions.

Cond-Diffusion (L) (Chen et al., 2024). Cond-Diffusion (L) is conceptually a latent diffusion (Rombach et al., 2022) version of Cond-Diffusion but adds VQGAN (Esser et al., 2021) to map image into latent space for diffusion. For a fair comparison, we fine-tuned open-source code and pre-trained weights by Chen et al. (2024) and used the model outputs directly.

LeFusion and the Variants (Ours). Apart from standard LeFusion, there are two variants for fine control of lesion textures. Histogram-Based Texture Control (*-H): A variant of LeFusion that incorporates histogram control information, using the input histogram to guide the generation of multi-peak lesion textures. Multi-Channel Decomposition (*-J): When handling multi-class lesions, standard LeFusion trains individual models separately, lacking of modeling for the correlation between different lesions. LeFusion-J is a generalized version to model multi-class lesions jointly.

4.2 Improving Segmentation with Synthetic Data

Lung Nodule Segmentation.

We show that LeFusion can effectively benefit downstream application of training nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021) to perform lung nodule segmentation. We use the following synthetic subset settings: P’: $2104\times 1$ synthetic ROIs from the 808 real pathological cases P; N’: $3076\times 1$ synthetic samples from the 135 normal subjects N; N”: $3076\times 2$ synthetic cases from the 135 normal subjects N.

Tab. 1 show the Dice and normalized surface distance (NSD). For the first group (Texture Synthesis with Real Masks), the texture of Hand-Crafted (Hu et al., 2023) and RePaint (Lugmayr et al., 2022) differs significantly from the real texture, making it challenging to achieve satisfactory results. Cond-Diffusion (Ho et al., 2020; Rombach et al., 2022) and Cond-Diffusion (L) (Chen et al., 2024), on the other hand, disrupt the background structure of the generated images. Our baseline model, LeFusion, is impacted by the pixel distribution of the background due to the lack of histogram control information, resulting in only a slight improvement over the baseline. We also synthesized lesion data on normal data N using Hand-Crafted Synthetic Masks and masks matched from lesion data, arriving at similar conclusions. In the final set of experiments, we validated the effectiveness of the DiffMask we designed for lesion synthesis and further explored accuracy improvements with increasing amounts of synthetic data. Compared to the baseline, in terms of Dice, we achieved improvements of 5.18% and 4.75% for nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021), respectively. NSD improvements were 4.4% and 4.53%, respectively.

Table 1: Downstream Lung Nodule Segmentation Dice (

\uparrow

) and NSD (

\uparrow

) on LIDC (Armato III et al., 2011). P: real pathological cases. P’/N’: synthetic pathological cases from pathological/normal cases. N”: more synthetic data than N’. Bold numbers indicate the best performance in each setting, with red highlighting significantly adverse effects compared to the baseline, and blue indicating significantly positive effects.

Methods	Training Setting	nnU-Net (2021)		SwinUNETR (2021)
Methods	Training Setting	Dice ( $\uparrow$ )	NSD ( $\uparrow$ )	Dice ( $\uparrow$ )	NSD ( $\uparrow$ )
Baseline	P	78.26	88.90	78.38	88.67
Texture Synthesis with Real Masks on P’
Hand-Crafted (2023)	P+P’	76.80	87.94	76.11	86.31
Cond-Diffusion (2020; 2022)	P+P’	77.05	87.69	77.51	88.09
Cond-Diffusion (L) (2024)	P+P’	76.66	87.20	76.44	86.56
RePaint (2022)	P+P’	77.57	88.07	77.14	87.96
LeFusion (Ours)	P+P’	78.77	89.25	78.43	88.54
LeFusion-H (Ours)	P+P’	80.62	90.90	80.95	90.98
Texture Synthesis with Hand-Crafted Synthetic Masks (2023) on N’
Hand-Crafted (2023)	P+N’	75.10	85.50	74.88	84.64
Cond-Diffusion (2020; 2022)	P+N’	76.62	86.44	76.66	87.20
Cond-Diffusion (L) (2024)	P+N’	76.71	86.83	77.20	87.88
LeFusion (Ours)	P+N’	77.67	87.94	77.98	88.42
LeFusion-H (Ours)	P+N’	80.19	89.75	80.08	90.42
Texture Synthesis with Copied Masks (2023) on N’
Copy-Paste	P+N’	77.29	87.60	77.80	88.84
Hand-Crafted (2023)	P+N’	76.04	86.57	76.58	87.72
Cond-Diffusion (2020; 2022)	P+N’	77.00	87.68	76.68	87.40
Cond-Diffusion (L) (2024)	P+N’	77.15	87.83	77.38	87.51
LeFusion (Ours)	P+N’	78.49	89.22	78.55	89.06
LeFusion-H (Ours)	P+N’	81.11	91.77	81.10	91.67
Enhanced with Diffusion-Based Synthetic Mask (DiffMask)
LeFusion-H+DiffMask (Ours)	P+N’	82.66	92.49	82.63	92.77
LeFusion-H+DiffMask (Ours)	P+N”	83.19	93.21	83.07	93.10
LeFusion-H+DiffMask (Ours)	P+P’+N”	83.44	93.35	83.13	93.20

Cardiac Lesion Segmentation.

Table 2: Downstream Cardiac Lesion Segmentation Dice (

\uparrow

) on Emidec (Lalande et al., 2022). The NSD metric is provided in Tab. A1. MI and PMO are [lesion classes. P: real pathological cases. P’/N’: synthetic pathological cases from pathological/normal cases. N”: more synthetic data than N’. Bold numbers indicate the best performance in each setting, with red highlighting significantly adverse effects compared to the baseline, and blue indicating significantly positive effects.

Methods	Training Setting	nnU-Net (2021)		SwinUNETR (2021)
Methods	Training Setting	MI Dice ( $\uparrow$ )	PMO Dice ( $\uparrow$ )	MI Dice ( $\uparrow$ )	PMO Dice ( $\uparrow$ )
Baseline	P	68.61	36.32	57.79	35.76
Texture Synthesis with Real Masks on P’
Hand-Crafted (2023)	P+P’	69.60	36.06	57.64	34.96
Cond-Diffusion (2020; 2022)	P+P’	66.89	37.76	56.75	36.31
Cond-Diffusion (L) (2024)	P+P’	68.07	31.93	56.97	32.72
RePaint (2022)	P+P’	69.14	28.93	55.14	33.86
LeFusion (Ours)	P+P’	69.88	34.79	57.85	35.63
LeFusion-J (Ours)	P+P’	69.95	38.01	59.61	37.99
Texture Synthesis with Hand-Crafted Synthetic Masks (2023) on N’
Hand-Crafted (2023)	P+N’	68.19	35.73	56.18	35.01
Cond-Diffusion (2020; 2022)	P+N’	67.41	31.03	56.73	35.28
Cond-Diffusion (L) (2024)	P+N’	67.08	36.31	56.70	33.84
LeFusion (Ours)	P+N’	69.17	37.18	59.42	34.83
LeFusion-J (Ours)	P+N’	69.87	37.31	59.56	36.19
Enhanced with Diffusion-Based Synthetic Mask (DiffMask)
LeFusion-J+DiffMask (Ours)	P+N’	69.81	40.62	58.94	39.00
LeFusion-J+DiffMask (Ours)	P+N”	70.17	42.44	58.60	41.24
LeFusion-J+DiffMask (Ours)	P+P’+N”	70.34	43.54	60.54	41.70
LeFusion-J-H+DiffMask (Ours)	P+P’+N”	71.28	43.41	59.30	42.49

We use the following synthetic subset settings :

P’: $57\times 1$ synthetic cases from the 57 real pathological cases P; N’: $33\times 2$ synthetic cases from the 33 normal cases N; N”: $33\times 5$ synthetic cases from the 33 normal cases N.

We use combinations of these subsets to train nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021). The results of the two types of lesions are reported in 2 in terms of the Dice (0-100, higher is better). Due to page limitations, the corresponding NSD table is provided in Appendix Tab. A1.

In the first group of Tab. 2, we apply texture synthesis with real masks. The Hand-Crafted (Hu et al., 2023) produces textures that differ from real textures, leading to a decrease in baseline performance. Cond-Diffusion (Ho et al., 2020; Rombach et al., 2022) and Cond-Diffusion (L) (Chen et al., 2024) disrupts background structure, blurring lesion categories, decreasing MI’s performance. RePaint (Lugmayr et al., 2022), focusing on global information, struggles to generate textures that conform to lesion characteristics, resulting in a significant decrease in the Dice for PMO lesions. LeFusion models the two lesions separately, ignoring the correlation between lesions, which leads to improved accuracy for MI but decreased accuracy for PMO lesions. In contrast, our proposed LeFusion-J achieves superior results; For the second group, we expanded the normal data utilizing texture synthesis with hand-crafted synthetic masks. Due to RePaint (Lugmayr et al., 2022)’s inability to distinguish between multiple lesion categories, we did not repeat experiments for it. From the experiments, we observed results similar to those of the first group; For the last, we evaluated our proposed lesion mask synthesis. Our method significantly improved the performance for both MI and PMO. Additionally, as data volume expanded, segmentation Dice consistently improved.

4.3 Visual Quality Assessment

Image Quality.

Fig. 5 shows the generation results of different methods for lung nodule CT and cardiac MRI based on lesion images and normal images. We have also quantitatively calculated and compared the similarity between our generated images and real images; more details can be found in Appendix D. Fig. 5(a) displays the synthesized visualization of lung nodules (red). The Cond-Diffusion method (Ho et al., 2020; Rombach et al., 2022) and Cond-Diffusion (L) (Chen et al., 2024) disrupt the background structure. The lesions generated by Hand-Crafted (Hu et al., 2023) and RePaint (Lugmayr et al., 2022) fail to capture texture information, such as the grayscale variations characteristic of the lesions. Our baseline LeFusion, without histogram control information, is easily influenced by background features, resulting in the generation of relatively shallow lesions. In contrast, our LeFusion-H can better utilize histogram information to control the generation of lesions. Fig. 5(b) displays the synthesized visualization of two lesions for the heart: MI (blue) and PMO (red). The generation results for the heart show similar conditions. It is important to note that since the heart has two types of lesions, LeFusion-J (joint modeling of both lesions) can more accurately reflect the textures of both lesion types and, compared to LeFusion, results in smoother transitions at the boundaries between the two lesions and the background. However, due to the small contrast difference between heart lesions, LeFusion-J-H with histogram control achieves similar effects. More visualizations can be found in the supplementary materials.

Histogram Control Analysis.

We studied the effect of histogram control on the lung nodule dataset. The visualization results are shown in Fig. 6. Under different histogram controls, the attenuation (“transparency”) of the generated lesions changes from shallow to deep. Without histogram control information, the generated lung nodules tend to match the pixel distribution of the normal lung background, resulting in overly light lesions. We selected 100 subsets and used LeFusion-H and LeFusion to generate each sample twice, calculating the similarity between pairs using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), lower means more diverse. As shown in the figure, with histogram control, there is greater diversity between samples.

We also conducted a quantitative analysis of the impact of histogram control on the lesion area. The addition of the control histogram influences the distribution of the original data’s histogram, effectively shifting or weighting it based on the control input. Empirically, this interpretation aligns well with the observed distributions. Further details can be found in Appendix C.

Mask Quality.

We also visualize synthetic lesion masks, as shown in Fig. A1 and Fig. A2. Compared to the hand-crafted masks (Hu et al., 2023), our diffusion-generated masks are closer to the real masks and exhibit a more diverse range of shape patterns. More details can be found in Appendix A.

5 Conclusion

In conclusion, we introduce LeFusion, a novel lesion-focused diffusion model capable of recalibrating the diffusion learning objectives to lesion areas only. It preserves background by integrating forward-diffused background contexts into the reverse diffusion process. Our methodology is extended to handle challenging multi-peak and multi-class lesions, and further enhanced by a generative model for lesion masks, significantly diversifying our synthetic data. We demonstrate that synthetic data generated by our method can effectively boost the performance of state-of-the-art models like nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021).

References

Armato III et al. (2011) Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al. The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans. Medical physics, 38(2):915–931, 2011.
Avrahami et al. (2022) Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Conference on Computer Vision and Pattern Recognition, pp. 18208–18218, 2022.
Billot et al. (2023) Benjamin Billot, Douglas N Greve, Oula Puonti, Axel Thielscher, Koen Van Leemput, Bruce Fischl, Adrian V Dalca, Juan Eugenio Iglesias, et al. Synthseg: Segmentation of brain mri scans of any contrast and resolution without retraining. Medical Image Analysis, 86:102789, 2023.
Chen et al. (2024) Qi Chen, Xiaoxi Chen, Haorui Song, Zhiwei Xiong, Alan Yuille, Chen Wei, and Zongwei Zhou. Towards generalizable tumor synthesis. In Conference on Computer Vision and Pattern Recognition, pp. 11147–11158, 2024.
Chen et al. (2023) Richard J Chen, Judy J Wang, Drew FK Williamson, Tiffany Y Chen, Jana Lipkova, Ming Y Lu, Sharifa Sahai, and Faisal Mahmood. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nature biomedical engineering, 7(6):719–742, 2023.
Du et al. (2023) Yuhao Du, Yuncheng Jiang, Shuangyi Tan, Xusheng Wu, Qi Dou, Zhen Li, Guanbin Li, and Xiang Wan. Arsdm: colonoscopy images synthesis with adaptive refinement semantic diffusion models. In International conference on medical image computing and computer-assisted intervention, pp. 339–349. Springer, 2023.
Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Conference on Computer Vision and Pattern Recognition, pp. 12873–12883, 2021.
Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Hamamci et al. (2024) Ibrahim Ethem Hamamci, Sezgin Er, Anjany Sekuboyina, Enis Simsar, Alperen Tezcan, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Furkan Almas, Irem Dogan, Muhammed Furkan Dasdelen, et al. Generatect: Text-conditional generation of 3d chest ct volumes. In European Conference on Computer Vision, 2024.
Han et al. (2019) Changhee Han, Yoshiro Kitamura, Akira Kudo, Akimichi Ichinose, Leonardo Rundo, Yujiro Furukawa, Kazuki Umemoto, Yuanzhong Li, and Hideki Nakayama. Synthesizing diverse lung nodules wherever massively: 3d multi-conditional gan-based ct image augmentation for object detection. In International Conference on 3D Vision, pp. 729–737. IEEE, 2019.
Hatamizadeh et al. (2021) Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI brainlesion workshop, pp. 272–284. Springer, 2021.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Horvath et al. (2022) Izabela Horvath, Johannes Paetzold, Oliver Schoppe, Rami Al-Maskari, Ivan Ezhov, Suprosanna Shit, Hongwei Li, Ali Ertürk, and Bjoern Menze. Metgan: generative tumour inpainting and modality synthesis in light sheet microscopy. In IEEE Winter Conference on Applications of Computer Vision, pp. 227–237, 2022.
Hu et al. (2023) Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan L Yuille, and Zongwei Zhou. Label-free liver tumor segmentation. In Conference on Computer Vision and Pattern Recognition, pp. 7422–7432, 2023.
Ibrahim et al. (2021) Hussein Ibrahim, Xiaoxuan Liu, Nevine Zariffa, Andrew D Morris, and Alastair K Denniston. Health data poverty: an assailable barrier to equitable digital health care. The Lancet Digital Health, 3(4):e260–e265, 2021.
Isensee et al. (2021) Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2):203–211, 2021.
Jin et al. (2021) Qiangguo Jin, Hui Cui, Changming Sun, Zhaopeng Meng, and Ran Su. Free-form tumor synthesis in computed tomography images via richer generative adversarial network. Knowledge-Based Systems, 218:106753, 2021.
Khader et al. (2023) Firas Khader, Gustav Müller-Franzes, Soroosh Tayebi Arasteh, Tianyu Han, Christoph Haarburger, Maximilian Schulze-Hagen, Philipp Schad, Sandy Engelhardt, Bettina Baeßler, Sebastian Foersch, et al. Denoising diffusion probabilistic models for 3d medical image generation. Scientific Reports, 13(1):7303, 2023.
Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv Preprint, 2013.
Lai et al. (2024) Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. arXiv Preprint, 2024.
Lalande et al. (2022) Alain Lalande, Zhihao Chen, Thibaut Pommier, Thomas Decourselle, Abdul Qayyum, Michel Salomon, Dominique Ginhac, Youssef Skandarani, Arnaud Boucher, Khawla Brahim, et al. Deep learning methods for automatic evaluation of delayed enhancement-mri. the results of the emidec challenge. Medical Image Analysis, 79:102428, 2022.
Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Conference on Computer Vision and Pattern Recognition, pp. 11461–11471, 2022.
Lyu et al. (2022a) Fei Lyu, Mang Ye, Jonathan Frederik Carlsen, Kenny Erleben, Sune Darkner, and Pong C Yuen. Pseudo-label guided image synthesis for semi-supervised covid-19 pneumonia infection segmentation. IEEE Transactions on Medical Imaging, 42(3):797–809, 2022a.
Lyu et al. (2022b) Fei Lyu, Mang Ye, Andy J Ma, Terry Cheuk-Fung Yip, Grace Lai-Hung Wong, and Pong C Yuen. Learning from synthetic ct images via test-time training for liver tumor segmentation. IEEE Transactions on Medical Imaging, 41(9):2510–2520, 2022b.
Peng et al. (2024) Wei Peng, Tomas Bosschieter, Jiahong Ouyang, Robert Paul, Edith V Sullivan, Adolf Pfefferbaum, Ehsan Adeli, Qingyu Zhao, and Kilian M Pohl. Metadata-conditioned generative models to synthesize anatomically-plausible 3d brain mris. Medical Image Analysis, 98:103325, 2024.
Pishva et al. (2023) Alexander K Pishva, Vajira Thambawita, Jim Torresen, and Steven A Hicks. Repolyp: A framework for generating realistic colon polyps with corresponding segmentation masks using diffusion models. In 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS), pp. 47–52. IEEE, 2023.
Price & Cohen (2019) W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data. Nature medicine, 25(1):37–43, 2019.
Qayyum et al. (2020) Adnan Qayyum, Junaid Qadir, Muhammad Bilal, and Ala Al-Fuqaha. Secure and robust machine learning for healthcare: A survey. IEEE Reviews in Biomedical Engineering, 14:156–180, 2020.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
Savage (2023) Neil Savage. Synthetic data could be better than real data. Nature, 2023. URL https://doi.org/10.1038/d41586-023-01445-8.
Schäfer et al. (2024) Raphael Schäfer, Till Nicke, Henning Höfener, Annkristin Lange, Dorit Merhof, Friedrich Feuerhake, Volkmar Schulz, Johannes Lotz, and Fabian Kiessling. Overcoming data scarcity in biomedical imaging with a foundational multi-task model. Nature Computational Science, pp. 1–15, 2024.
Shin et al. (2018) Younghak Shin, Hemin Ali Qadir, and Ilangko Balasingham. Abnormal colon polyp image synthesis using conditional adversarial networks for improved detection performance. IEEE Access, 6:56007–56017, 2018.
Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR, 2015.
Wang et al. (2022) Hualin Wang, Yuhong Zhou, Jiong Zhang, Jianqin Lei, Dongke Sun, Feng Xu, and Xiayu Xu. Anomaly segmentation in retinal images with poisson-blending data augmentation. Medical Image Analysis, 81:102534, 2022.
Wu et al. (2024) Linshan Wu, Jiaxin Zhuang, Xuefeng Ni, and Hao Chen. Freetumor: Advance tumor segmentation via large-scale tumor synthesis. arXiv Preprint, 2024.
Xu et al. (2022) Jie Xu, Yunyu Xiao, Wendy Hui Wang, Yue Ning, Elizabeth A Shenkman, Jiang Bian, and Fei Wang. Algorithmic fairness in computational medicine. EBioMedicine, 84, 2022.
Yang et al. (2019) Jie Yang, Siqi Liu, Sasa Grbic, Arnaud Arindra Adiyoso Setio, Zhoubing Xu, Eli Gibson, Guillaume Chabin, Bogdan Georgescu, Andrew F Laine, and Dorin Comaniciu. Class-aware adversarial lung nodule synthesis in ct images. In International Symposium on Biomedical Imaging, pp. 1348–1352. IEEE, 2019.
Yang et al. (2022) Yuzhe Yang, Yuan Yuan, Guo Zhang, Hao Wang, Ying-Cong Chen, Yingcheng Liu, Christopher G Tarolli, Daniel Crepeau, Jan Bukartyk, Mithri R Junna, et al. Artificial intelligence-enabled detection and assessment of parkinson’s disease using nocturnal breathing signals. Nature medicine, 28(10):2207–2215, 2022.
Yang et al. (2024) Yuzhe Yang, Haoran Zhang, Judy W Gichoya, Dina Katabi, and Marzyeh Ghassemi. The limits of fair medical imaging ai in real-world generalization. Nature Medicine, pp. 1–11, 2024.
Yao et al. (2021) Qingsong Yao, Li Xiao, Peihang Liu, and S Kevin Zhou. Label-free segmentation of covid-19 lesions in lung ct. IEEE Transactions on Medical Imaging, 40(10):2808–2819, 2021.
Zhang et al. (2023) Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10795–10816, 2023.
Zhu et al. (2024) Lingting Zhu, Noel Codella, Dongdong Chen, Zhenchao Jin, Lu Yuan, and Lequan Yu. Generative enhancement for 3d medical images. arXiv Preprint, 2024.

Appendix A Mask visualization

Fig. A1 shows the real lung nodule lesion mask (a), the hand-crafted synthetic lung nodule mask (b), and the mask generated by our proposed diffusion model (d). Comparing subfigures (a), (b), and (d) in Fig. A1, it is evident that the real masks exhibit diverse shapes, while the DiffMask-generated masks closely resemble the real ones, also displaying varied forms with similar characteristics. In contrast, the handcrafted masks are relatively uniform in shape and differ significantly from the real masks.

Besides, our proposed DiffMask can control the size and location of the lesion masks, allowing for the generation of more diverse lesions. As shown in Fig. A1 (c) and Fig. A1 (d), we can use the sphere (Size Ball) to precisely control the desired lesion size and its position within the background image. This control enables us to produce various masks.

Fig. A2 shows that our diffusion-generated masks are closer to the real masks and exhibit a more diverse range of shape patterns, while Hand-Crafted masks consistently show large, continuous regions.

Appendix B Downstream Segmentation Performance

Table A1: Downstream Cardiac Lesion Segmentation NSD (

\uparrow

) on Emidec (Lalande et al., 2022). MI and PMO are lesion classes. P: real pathological cases. P’/N’: synthetic pathological cases from pathological/normal cases. N”: more synthetic data than N’. Bold numbers indicate the best performance in each setting, with red highlighting significantly adverse effects compared to the baseline, and blue indicating significantly positive effects.

Methods	Training Setting	nnU-Net (2021)		SwinUNETR (2021)
Methods	Training Setting	MI NSD ( $\uparrow$ )	PMO NSD ( $\uparrow$ )	MI NSD ( $\uparrow$ )	PMO NSD ( $\uparrow$ )
Baseline	P	59.27	29.19	47.66	20.51
Texture Synthesis with Real Masks on P’
Hand-Crafted (2023)	P+P’	60.34	35.03	47.70	19.38
Cond-Diffusion (2020; 2022)	P+P’	58.21	24.04	46.35	20.87
Cond-Diffusion (L) (2024)	P+P’	58.98	25.96	46.83	18.72
RePaint (2022)	P+P’	59.79	23.61	45.10	19.19
(Ours)	P+P’	60.77	33.13	46.66	20.05
LeFusion-J (Ours)	P+P’	60.44	36.65	48.23	24.11
Texture Synthesis with Hand-Crafted Synthetic Masks (2023) on N’
Hand-Crafted (2023)	P+N’	57.88	25.00	47.40	19.82
Cond-Diffusion (2020; 2022)	P+N’	58.22	22.19	46.15	20.30
Cond-Diffusion( L) (2024)	P+P’	58.64	26.26	47.90	19.54
LeFusion (Ours)	P+N’	60.75	23.96	47.81	19.91
LeFusion-J (Ours)	P+N’	60.68	30.62	49.69	20.71
Enhanced with Diffusion-Based Synthetic Mask (DiffMask)
LeFusion-J+DiffMask (Ours)	P+N’	61.35	38.93	48.62	22.43
LeFusion-J+DiffMask (Ours)	P+N”	61.48	35.03	50.00	23.92
LeFusion-J+DiffMask (Ours)	P+P’+N”	61.27	41.62	52.82	23.77
LeFusion-J-H+DiffMask (Ours)	P+P’+N”	62.74	40.96	50.73	24.25

Tab. A1, as a supplement to In Tab. 2, presents the NSD metric measured under the same experimental settings as in In Tab. 2.

In the first set of Tab.A1, we employ texture synthesis with real masks. The Hand-Crafted approach (Hu et al., 2023) generates textures that deviate from real ones, resulting in a decline in baseline performance. Cond-Diffusion (Ho et al., 2020; Rombach et al., 2022) and Cond-Diffusion(L)(Chen et al., 2024) disrupt the background structure, leading to blurring between lesion categories and reducing MI’s performance. RePaint (Lugmayr et al., 2022), which emphasizes global information, struggles to produce textures consistent with lesion characteristics, leading to a marked decrease in NSD for PMO lesions. LeFusion-S models the two lesions independently, disregarding the correlation between them, resulting in higher accuracy for MI but reduced accuracy for PMO lesions. Conversely, our proposed LeFusion-J outperforms the other methods; In the second group, we extended the normal data by synthesizing textures using hand-crafted synthetic masks. Due to RePaint (Lugmayr et al., 2022)’s limitations in distinguishing multiple lesion categories, we did not conduct repeated experiments for it. The outcomes were similar to those of the first group; Lastly, we evaluated our proposed lesion mask synthesis. Our approach significantly enhanced the performance for both MI and PMO. Furthermore, as the data volume increased, the downstream segmentation NSD showed consistent improvement.

Appendix C Histogram Control Analysis

For a given original image, the histogram information of its lesion region is represented as $I_{j}$ , and the control information as $h_{1},h_{2},h_{3},\dots,h_{n}$ (where $i=1,2,3,\dots,n$ ), with $n$ being the number of controlled histograms. The resulting output image is defined as $O$ , and $O$ can be derived as follows:

m_{i}=(r\times\log l_{j}+p)+(s\times\log h_{i}+q)

(5)

O=e^{m_{i}},\quad i=1,2,3,\dots,n

(6)

In this formula, $r$ and $s$ are scaling factors, and $p$ and $q$ are bias offset terms. Based on the above formulas, given the input image and corresponding control information, we can theoretically deduce the histogram of the lesion in the mask area of the generated image.

Fig. A3 shows the impact of histogram control information on the generation of lung nodules in a normal chest background. We randomly sampled 20 cases from the generated lesion data for statistical analysis. The first row represents the controlled histogram, where ”No control” indicates that the standard diffusion configuration without histogram control information was used. The second row shows the original mask areas of the 20 cases, as well as the average histogram effects of the generated and predicted data. Rows 3, 4, and 5 display the results of three different sample cases.

When the model does not use histogram control, the generated histograms of lung nodules in the normal background areas tend to be relatively shallow, influenced by the surrounding background. These histograms resemble the textures of the background mask areas, as shown in the rightmost column. In this column, the dark blue line represents the histogram of the image mask area generated by the diffusion model, while the orange line represents the theoretically fitted histogram control effect.

After introducing histogram control information, the interaction between the original image lesion areas and the controlled histogram effects causes the generated mask areas to shift increasingly to the right, following the peak effects of the three histograms (control 1, control 2, control 3). The second-row average histogram effect shows the stable trend of this shift.

Appendix D Image Quality Evaluation

As shown in Tab. A2, we selected Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) to calculate the similarity between synthetic pathological cases generated by different methods and real pathological cases. As for cardiac lesions on the EMidec dataset, our proposed LeFusion achieved the highest average PSNR and SSIM. For lymph nodes with only one type of lesion on the LIDC dataset, LeFusion also achieved the highest PSNR. These experiments demonstrate that our synthesized lesions are more similar to real lesions across both CT and MRI modalities.

Table A2: Synthesis Image Quality Assessment on Emidec (Lalande et al., 2022) and LIDC (Armato III et al., 2011). We compare the differences in image similarity between synthetic pathological cases generated by different methods given real pathological cases.

Methods	Emidec-MI		Emidec-PMO		Emidec-Avg.		LIDC
Methods	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$	PSNR $\uparrow$	SSIM $\uparrow$
Hand-Crafted (2023)	9.39	8.30	7.63	9.15	8.516	8.70	1.97	0.07
Cond-Diffusion (2020; 2022)	13.25	46.92	8.00	9.23	10.62	28.07	16.95	93.46
Cond-Diffusion(L) (2024)	14.62	69.36	12.39	61.00	13.51	65.18	15.50	90.05
RePaint (2022)	19.81	80.68	15.23	70.27	17.52	75.47	18.91	91.22
LeFusion-S (Ours)	25.65	91.78	27.71	89.42	26.68	90.60	22.38	90.16
LeFusion-J (Ours)	28.30	91.41	35.23	93.23	31.77	92.32	22.38	90.16

Appendix E More Visualizations

In this section, we provide multiple illustrative figures demonstrating the effects of our proposed diffusion model, LeFusion.

Lung Nodule CT: Fig. A4 presents additional illustrations of the effects of histogram control. The histogram effectively controls the texture of the lesions, while the version without control information tends to generate lighter-colored lung nodules. Fig. A5 provides a visualization of the synthesized lesion results using real samples and their corresponding normal samples.

Cardiac Lesion MRI: Fig. A6 shows the visualization of the denoising process at different stages in LeFusion for inpainting. Fig. A7 and Fig. A8 respectively show the generation of pathological results on lesion cases and normal cases.

Appendix F Implementation Details

For the entire experiment, we used 6*A100 (40G) GPUs, including for diffusion and downstream segmentation tasks, with Python 3.8 and PyTorch version 2.4.0. Dataset: For lung nodules, we located each lesion and cropped and padded it to a size of 64x64x32. For the heart, we uniformly cropped and padded the size to 72x72x10. Diffusion Model: Training the LeFusion diffusion model for lung nodules requires approximately 3 A100 GPUs for one day. For inference, generating a single data takes about 30 seconds on one A100 GPU. Segmentation Model: We implemented nnUNet (Isensee et al., 2021) and SwinUNETR (Hatamizadeh et al., 2021) using the MONAI framework. For downstream tasks, both SwinUNETR and nnUNet were trained for 200 epochs. Due to differences in dataset sizes, the training time on a single A100 GPU was approximately 6 to 24 hours for SwinUNETR and 4 to 10 hours for nnUNet.