Comparative Analysis of Diffusion Generative Models in Computational Pathology

Denisha Thakkar^a, Vincent Quoc-Huy Trinh^b, Sonal Varma^c,
Samira Ebrahimi Kahou^d,e, Hassan Rivaz^a, Mahdi S. Hosseini^a,e

Abstract

Diffusion Generative Models (DGM) have rapidly surfaced as emerging topics in the field of computer vision, garnering significant interest across a wide array of deep learning applications. Despite their high computational demand, these models are extensively utilized for their superior sample quality and robust mode coverage. While research in diffusion generative models is advancing, exploration within the domain of computational pathology and its large-scale datasets has been comparatively gradual. Bridging the gap between the high-quality generation capabilities of Diffusion Generative Models and the intricate nature of pathology data, this paper presents an in-depth comparative analysis of diffusion methods applied to a pathology dataset. Our analysis extends to datasets with varying Fields of View (FOV), revealing that DGMs are highly effective in producing high-quality synthetic data. An ablative study is also conducted, followed by a detailed discussion on the impact of various methods on the synthesized histopathology images. One striking observation from our experiments is how the adjustment of image size during data generation can simulate varying fields of view. These findings underscore the potential of DGMs to enhance the quality and diversity of synthetic pathology data, especially when used with real data, ultimately increasing accuracy of deep learning models in histopathology. Code is available from https://github.com/AtlasAnalyticsLab/Diffusion4Path.

keywords:

Diffusion Generative Models, Latent Diffusion Model, Computational Pathology

^†^†journal: Under Review

\affiliation

[label1] organization=Department of Computer Science and Software Engineering (CSSE), Concordia University, addressline=Montreal, city=QC, postcode=H3H 2R9, state=Quebec, country=Canada

\affiliation

[label2] organization=University of Montreal Hospital Center, addressline=Montreal, city=QC, postcode=H2X 0C2, state=Quebec, country=Canada

\affiliation

[label3] organization=Queen’s University and Kingston Health Sciences Center, addressline=Kingston, city=ON, postcode=K7L 2V7, state=Ontario, country=Canada

\affiliation

[label4] organization=Department of Electrical and Software Engineering (ESE), University of Calgary, addressline=Calgary, city=AB, postcode=T2N 1N4, state=Alberta, country=Canada

\affiliation

[label5] organization=Mila–Quebec Artificial Intelligence Institute, addressline=6666, St-Urbain, #200, city=Montreal, postcode=H2S 3H1, state=Quebec, country=Canada

1 Introduction

Histopathology involves diagnosing diseases by closely inspecting gigapixel tissue slides of microscopic structures to identify their characteristics [1]. Computational pathology has gained significant momentum, bringing about a transformative change in the field of cancer diagnostics since the digitization of pathology slides [2]. In addition, the success of deep learning methods [3] has led to the development of numerous models that enhance histopathology diagnosis and can also assist pathologists in their diagnosis workflow [4, 5]. However, these models require large volumes of both annotated or unannotated data for effective analysis and cancer screening solutions [6].

Two primary challenges, related to data issues, influence the development and implementation of computational pathology foundation models in clinical settings. First, privacy concerns severely limit the availability of data, making many pathological data inaccessible to the public [7]. Patient confidentiality and strict regulatory requirements often prevent the sharing of medical data, which is crucial for training and validating deep learning models. This lack of accessible data limits the ability to create the most-optimized and generalizable models. Therefore, without access to diverse data for research, it becomes challenging to develop models that are truly representative and effective across various patient populations and conditions.

Secondly, the limited quantity and variable quality of publicly available pathology data significantly undermine the performance of foundation models, especially for common lesions [8]. Public datasets often lack the breadth and depth needed to train comprehensive models. The available data can be inconsistent in terms of resolution, staining techniques, and annotation quality [9]. Without sufficient high-quality data, models may struggle to accurately diagnose and improve learning for these conditions.

Another major challenge with pathology data is its complexity, which requires experienced pathologists to analyze and interpret it. It deals with gigapixel images that have multiple magnification levels and unique features [10], which are different from typical computer vision data. Studying these images and creating synthetic datasets requires a deep understanding of their unique traits. This knowledge is essential to accurately replicate pathological data in the synthesis process, helping to advance data processing and research. These challenges highlight the need for innovative solutions to improve data sharing and improve the quality of publicly available pathology datasets. Addressing these issues is essential for advancing the field of computational pathology and improving diagnostic workflow capabilities in clinical settings.

Generative models in pathology are still in their early stages but they hold immense potential, especially in the realm of synthetic medical image generation, which can help address the challenges mentioned above in this field. Moreover, by creating synthetic images, generative models can become helpful in discovering new patterns and regularities that provide deeper insights for rare tissue types [11]. In particular of Generative models, Generative Adversarial Networks (GANs) [12] have set a high standard in creating high-fidelity and quality data patterns, greatly benefiting fields that demand high realism, such as medical imaging [13, 14, 15, 16, 17]. Despite GANs’ achievements, their real application has many challenges, such as mode collapse and training instability, which are critical setbacks in sensitive applications such as medical imaging.

Recently, diffusion generative models (DGMs) [18, 19, 20, 21, 22, 23] have emerged with superior image generation capabilities and great model stability. To date, diffusion models have been found to be useful in a wide variety of areas, ranging from generative modeling tasks such as image generation [21], image super-resolution [24] to discriminative tasks such as image segmentation [25], classification [26]. DGMs offer a structured approach based on a strong mathematical foundation, which makes them more reliable to use and offers great flexibility in terms of image generation. Although DGMs have been applied in various fields for multiple tasks, such as text-conditioned generation [27], image generation [28], and large-scale image generation [29], their potential for generating synthetic datasets and their applicability in tissue classification are still less explored.

In our research, we have addressed this gap using extracted datasets for different image resolution from the WSI dataset. By doing so, we were able to compare the differences in synthesizing various Field-Of-View (FOV) [30] from the same whole slides, specifically focusing on FOVs of 224 and 336. Different regions of a histopathology slide may require different levels of detail, from high-level overviews to detailed cellular structures as demonstrated in Figure 1. We also provide a novel study on the effects of prompting patch size, which further helps to understand how different patch sizes can influence FOV detail generation. This study shows that by prompting with different sizes, DGMs can generate images with varying levels of detail that are unseen during training. We also show how these synthetic datasets match robustness and accuracy of deep learning classifiers as well.

Refer to caption — Figure 1: The process of selecting and magnifying different regions from a histopathology slide to obtain patches with varying Fields of View (FOV). The images demonstrate the detail captured at FOV 600, FOV 400, and FOV 200, representing the data’s range of granularity and the model’s ability to maintain clarity at different magnification levels.

This study represents a novel exploration of DGMs in the field of computational pathology. It offers a comprehensive comparison of various baseline methods and showcases the ability of diffusion models to create coarse features and generate images with varying patch size.

The contributions of this paper are as follows:

1.

Comprehensive Analysis: This paper includes a comprehensive study that compares various diffusion generative methods. The study provides a detailed analysis of the strengths and weaknesses of different techniques, offering valuable information on their performance and applicability in medical imaging. Research contributes to a better understanding of how different generative models perform in the context of pathology.
2.

Novel Study of Diffusion Models in Pathology: The publication not only uses baseline methods to synthesize images but also explores the adaptability of diffusion models by generating images of varying patch sizes. This novel study explores the innovative capabilities of diffusion models, particularly their unique ability to generate images with varying FOVs by prompting them with different patch sizes. It highlights how these models can produce high-quality synthetic pathology images that accurately reflect different features at different levels. By generating images with varying FOVs, diffusion models can capture the intricate details and structural variations of tissue samples. This capability can become useful to enhance the versatility and effectiveness of models in research and educational settings.
3.

Comparison of FOVs: In our investigation, we have addressed critical gaps utilizing multiple subset datasets from singe dataset. By synthesizing various FOVs from the same dataset, specifically focusing on FOVs of 224 and 336, we have been able to compare the differences in FID score by diffusion-generative models (DGM).
4.

Exploration of Synthetic Dataset Applicability: This research also explores the applicability of synthetic datasets generated by diffusion models in enhancing deep learning classifiers. The study evaluates how well these synthetic images can be used to train and improve the performance of deep learning models, specifically in the context of medical image classification. One of the key contributions of this paper is the demonstration of improved classification performance by incorporating synthetic datasets alongside real images. Empirical evidence is presented, showing that the addition of high-quality synthetic images can significantly boost the performance of classifiers, leading to a more robust classifier.

2 Review of Diffusion Generative Methods

Diffusion models [31, 18, 19] are a powerful class of probabilistic generative models that are used to learn complex data distributions. The purpose of this framework is also to demonstrate how various methods can be integrated seamlessly. We explain each category and method in a manner that allows any standalone model or technique to be easily plugged into the framework. The framework is explained as follows:

1.
Training:
- (a)
  
  Forward Process: Gaussian noise is introduced into the input images, with the magnitude determined randomly by a scheduler, resulting in varying levels of noise. This allows the model to explore a range of image noises, from subtle distortions to isotropic Gaussian noise. The noisy image, along with embedding vectors representing the noise level and class labels, is fed into a U-Net-based neural network.
- (b)
  
  Reverse Process: The U-Net predicts the noise added to the images. This prediction is compared to the actual noise to calculate the loss, guiding the model’s training process.
2.

Sampling: The process starts with pure isotropic noise. The trained network predicts the corresponding noise level, and the difference between the predicted noise and the initial noise creates a less noisy image. This image is iteratively refined through multiple steps to generate the final image.

This whole process is explained well in the figure 2.

2.1 Denoising Diffusion Probabilistic Models

Denoising Diffusion Probabilistic Models (DDPM) [18] follow the same training and sampling methods explained above, with both the forward and reverse processes occurring in the pixel space. In the forward process, Gaussian noise is added to the image $x_{0}$ at each timestep $t$ according to:

q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I})

(1)

where $\beta_{t}$ is the variance schedule. The reverse process aims to denoise the image step-by-step using a learned model $p_{\theta}$ , represented as:

p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};\boldsymbol{\mu}_{\theta}(\mathbf{x}_{t},t),\boldsymbol{\Sigma}_{\theta}(\mathbf{x}_{t},t))

(2)

The objective function to train the model is:

L_{t}^{\text{simple}}=\mathbb{E}_{t\sim[1,T],\mathbf{x},\boldsymbol{\epsilon}_{t}}\Big{[}\|\boldsymbol{\epsilon}_{t}-\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_{t},t)\|^{2}_{2}\Big{]}

(3)

2.2 Latent Diffusion Models

Latent Diffusion Models (LDM) [32] operate in a lower-dimensional latent space, which is learned by an autoencoder. The training and sampling stages follow the same principles as the general diffusion models described above, but with processes occurring in the latent space. An extra layer is added to encode images into this latent space before the diffusion processes begin. The forward process is defined as:

q(\mathbf{z}_{t}|\mathbf{z}_{t-1})=\mathcal{N}(\mathbf{z}_{t};\sqrt{1-\beta_{t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I})

(4)

where $\mathbf{z}_{t}$ represents the latent variables. The reverse process denoises the latent variables step-by-step:

p_{\theta}(\mathbf{z}_{t-1}|\mathbf{z}_{t})=\mathcal{N}(\mathbf{z}_{t-1};\boldsymbol{\mu}_{\theta}(\mathbf{z}_{t},t),\boldsymbol{\Sigma}_{\theta}(\mathbf{z}_{t},t))

(5)

LDMs also incorporate cross-attention mechanisms within the architecture, enhancing conditional image synthesis. After the latent space processing, the autoencoder transforms the latent variables back into images. The objective function for training remains similar:

L_{t}^{\text{simple}}=\mathbb{E}_{t\sim[1,T],\mathbf{z},\boldsymbol{\epsilon}_{t}}\Big{[}\|\boldsymbol{\epsilon}_{t}-\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{t},t)\|^{2}_{2}\Big{]}

(6)

2.3 Conditioning

Recent advancements in DPMs have introduced class-conditional generation, where additional class-related information is incorporated to guide the generation process. Our findings revealed that class-conditional generation significantly enhances the fidelity of the generated images to specific class characteristics. As the guidance scale increased, the generated images exhibited more precise and accurate representations of the target classes.

Classifier-Free Guidance (CFG) [33] is a technique that enables the generation of high-quality samples without relying on a classifier, addressing the limitations associated with classifier guidance. CFG modifies the score function in a way that emulates the effects of classifier guidance, but without using an explicit classifier. The approach involves training an unconditional denoising diffusion model alongside the conditional model, using a single neural network to parameterize both. The sampling process utilizes a combination of the conditional and unconditional score estimates, allowing for effective guidance without a classifier. This results in the production of high-quality synthetic images that are both varied and representative of the original dataset, enhancing the model’s performance in generating realistic images.

\epsilon_{t}=(1+w)*\epsilon_{\theta}(x_{t},c)-w*\epsilon_{\theta}(x_{t})

(7)

Here, $\epsilon_{\theta}(x_{t},c)$ is conditional model and $\epsilon_{\theta}(x_{t})$ is unconditional model. $w$ is used as a guidance scale.

2.4 More Sampling Choices

To explore different sampling methods and their effects on image generation and model performance, we incorporated two additional techniques: DDIM and Epsilon Scaling. DDIM accelerates image generation by introducing a non-Markovian process that redefines the diffusion process, utilizing a subset sampling strategy for faster sampling without compromising model performance. The reverse diffusion process in DDIM is defined as:

\mathbf{x}_{t-1}=\sqrt{\alpha_{t-1}}\left(\frac{\mathbf{x}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}(\mathbf{x}_{t})}{\sqrt{\alpha_{t}}}\right)+\sqrt{1-\alpha_{t-1}-\sigma^{2}_{t}}\epsilon_{\theta}(\mathbf{x}_{t})+\sigma_{t}\epsilon_{t}

(8)

where the variance $\sigma_{t}$ is given by:

\sigma_{t}=\eta\sqrt{\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\cdot\beta_{t}}=\eta\sqrt{\tilde{\beta}_{t}}

(9)

In DDIM, setting $\eta=0$ eliminates noise, making it equivalent to DDPM, while $\eta=1$ maintains the standard diffusion model, allowing interpolation between DDIM and DDPM. To address exposure bias, we employed Epsilon Scaling [34], which scales the epsilon value using a linear function, ensuring consistency between training and sampling, thereby reducing bias and improving sample quality. The epsilon value is scaled as:

\epsilon_{t}=\frac{\epsilon_{\theta}(\mathbf{x}_{t},t)}{\lambda_{t}}

(10)

where $\lambda_{t}=\lambda_{t}k+b$ . The scaled $\epsilon_{\theta}$ is then used in the DDPM equation:

\mathbf{x}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\boldsymbol{\epsilon}_{t}\right)+\sigma_{t}z

(11)

3 Experiment Details

3.1 KGH Dataset

KGH (Kingston General Hospital) datasets is made of 1037 WSI of healthy colon and four types of colonic polyps. The pathological slides are annotated on the Region of Interest (ROI) level. The WSIs are rectangular and of different sizes but their size is around 60k $\times$ 100k pixels and they take approximately 1 GB of memory. Because of the size of the WSIs and our computational resouces, we cannot process the slides as-is. Therefore, we have to extract some tiles of these images, called patches. To extract patches, we have to precise the FOV of the patches. We have used FOV as 224 from magnification 20X. We resized these patches to 128X128.
KGH datasets presents healthy colon tissue and four different types of colon polyps: Tubular Adenoma (TA), Sessile Serrated Lesions (SSL), Hyperplastic Polyps (HP), Tubulovillous Adenoma (TVA). HP is a non-neoplastic polyp; whereas the other 3 types SSA, TA and TVA have clonal abnormality at the molecular levels, and hence are considered neoplastic. All the 3 polyps have the potential to progress to adenocarcinoma.

We can visualize patches from these classes in Figure 3. Despite some visual differentiation in certain regions among these classes, distinguishing them can be challenging without a pathologist’s expertise. Understanding these classes is crucial for making accurate analogies with our generated images. The visual distinctions, although subtle, provide necessary context for training our models. To manage the high-dimensional nature of these pathology images, we extract smaller sections referred to as patches. To extract these patches, we specify the FOV. We also eliminated noisy images from the original set.

Patches were extracted from the annotated WSIs at specified FOVs. For our experiments, we maintained a consistent patch resolution of 1.75. This approach allowed us to observe how different FOVs provide various perspectives of the same tissue at the same magnification. Specifically, we used FOVs of 224 and 336 for creating the PKGH dataset. These FOVs were selected to ensure that the patches offered a comprehensive view of the cellular and tissue structures necessary for training generative models. The chosen FOVs are not excessively large to avoid having a limited number of patches available for training and classification.

3.2 The Architecture

We used a UNet model based on [21] for our experiments. We converted 128 $\times$ 128 images to noise, which were then fed into the UNet network. The network predicted the amount of noise added to these images and outputted the noise prediction. We employed a loss function that compares the predicted noise with the actual noise from the stochastic process. In Diffusion Models, Gaussian noise is iteratively added to the original image according to a variance schedule with a large total number of steps (T = 1000). The UNet model also incorporates timestep and class embeddings to learn class-specific features.

We implemented the LDM architecture from rombach2022high [32], which consists of the Variational Autoencoder (VAE), the U-Net denoiser, and added the class embedding. We used an autoencoder based on vector quantization, VQ-autoencoder, referred to as VQF4-DM [35].

3.3 Evaluation Metrics

A classic approach for generative model’s evaluations is to compare the log-likelihoods of models. This approach, however, has several shortcomings. A model can achieve high likelihood, but low image quality, and conversely, low likelihood and high image quality. The two most common GAN [12] evaluation measures are Inception Score (IS) [36] and Fréchet Inception Distance (FID) [37].

Inception Score (IS) rely on a pre-existing classifier (InceptionNet) [38] trained on ImageNet. IS computes the KL divergence between the conditional class distribution and the marginal class distribution over the generated data. IS does not capture intra-class diversity, is insensitive to the prior distribution over labels (hence is biased towards ImageNet dataset and Inception model. Therfore, we would not be using it for our dataset as it has completely different data than Imagenet. However, we would be using FID and KID.

FID [37] calculates the Wasserstein-2 (a.k.a Fréchet) distance between multivariate Gaussians fitted to the embedding space of the Inception-v3 network of generated and real images. The Kernel Inception Distance (KID) [39] aims to improve on FID by relaxing the Gaussian assumption. KID measures the squared Maximum Mean Discrepancy (MMD) between the Inception representations of the real and generated samples using a polynomial kernel. This is a non-parametric test so it does not have the strict Gaussian assumption, only assuming that the kernel is a good similarity measure.

We further use deep learning based classifiers approach to compare original vs generated data to show how synthetic data is actually applicable in real life. This is separately explained in Section.

3.4 Experiments

3.4.1 Comparative analysis

In our research, we analyze the effects of various Diffusion Generation Models (DGM) on a pathology dataset, focusing on comparisons rather than finding the best method. We examine two models: DDPM and LDM, exploring their performance under pathological data conditions.

LDMs, operating in a learned latent space, are more efficient than pixel-based designs, while DDPMs apply diffusion directly to input images, which is useful for capturing complex patterns in medical images. LDMs generate images with fewer timesteps, making them faster for applications requiring rapid synthesis.

We also utilize classifier guidance to balance mode coverage and sample fidelity in post-training conditional models. This technique, while improving sample quality, may increase computational costs. Both DDPM and LDM are trained on the same pathology datasets with classifier guidance, allowing us to observe and evaluate their visual outputs and subtle improvements, essential for analyzing large pathological images.

3.4.2 Analysis on different patch size

In this experiment, we introduced a novel hyperparameter, Patch Size, during the image generation phase to observe its effect on the tissue structure in histopathology images. By varying the patch size, we found that it significantly enhances the model’s ability to capture fine details and intricate structures within the generated images, as the model adapts to the provided patch size, effectively learning the resolution of the patches.

\text{FOV}=\text{Patch Size}\times\text{Patch Resolution}

Patch resolution in histopathology images, typically measured in microns per pixel (mpp), determines the level of detail captured relative to the actual tissue size. High patch resolution captures more detailed information, making it ideal for identifying fine morphological features, while lower patch resolution provides a broader view of tissue structures. Understanding and adjusting patch resolution is crucial for optimizing image analysis.

We validated our findings by first confirming the patch resolution used earlier, where the FOV was 224 and the patch size was 128, resulting in a patch resolution of 1.75 microns per pixel (mpp) according to the formula mentioned above. Using this patch resolution (1.75 mpp) and varying the patch sizes, we calculated the corresponding FOVs:

Patch Size	FOV)
64	112
96	168
128	224 (original)
160	280
192	336
224	392

Table 1: Patch Sizes and Corresponding Fields of View

3.4.3 Evaluation of Synthetic Pathology Dataset

The evaluation of generated pathology images is crucial to determine their quality and similarity to real pathology images. Although evaluation metrics provide valuable information about the quality of generated images, their applicability to real datasets remains uncertain. This study aims to analyze the performance of the generated data set on the ResNet-50 architecture [40] along with the real data set.

4 Results

The Fréchet Inception Distance (FID) scores of our DDPM method demonstrate strong performance when compared with similar studies in the field. These comparisons are crucial for highlighting the effectiveness of our approach in generating high-quality images. The Table 2 compares the FID scores of DDPM method with several studies in the literature, as referenced by [41]. We only consider these among others as these are based on the same DDPM methodology and architecture:

Study	Number of Classes	FID
Current Study (PKGH_224)	5	19.08
Current Study (PKGH_336)	5	18.45
[41]	5	35.11
[28]	1	20.11

Table 2: Comparison of FID scores with existing studies

When compared to other studies, our DDPM method’s FID scores of 19.08 for the PKGH_224 dataset and 18.45 for the PKGH_336 dataset are competitive, demonstrating superior performance over [41] and comparable results with [28]. These results highlight the importance of conditioning in DDPM and the significant impact of the FOV used. The findings underscore the value of FOV selection to achieve better outcomes in histopathological image generation.

4.1 Comparative Analysis

The analysis focuses on comparing the performance of DDPM and LDM across two datasets: PKGH_224 and PKGH_336.

The Table 3 summarizes the results of all methods.

In our study, the PKGH_224 dataset revealed that the DDPM model, with DDPM sampling, achieved an FID score of 19.08 and a KID score of 0.0134, indicating a strong balance of image quality and diversity. In comparison, the LDM with DDPM sampling produced an FID score of 24.43 and a KID score of 0.0185, showing that while LDM’s latent space representation is effective, DDPM slightly outperformed it in this case. When using DDIM sampling, DDPM scored 22.66 (FID) and 0.0154 (KID), while LDM scored 25.56 (FID) and 0.0161 (KID), further demonstrating DDPM’s superior performance across sampling methods.

For the larger PKGH_336 dataset, DDPM again excelled with an FID of 18.45 and a KID of 0.0129 using DDPM sampling, maintaining consistent high performance across different dataset sizes. LDM with DDPM sampling scored 23.10 (FID) and 0.0160 (KID), showing good results but not surpassing DDPM. The DDIM sampling method reinforced this trend, with DDPM scoring 21.34 (FID) and 0.0159 (KID), while LDM scored 26.41 (FID) and 0.0199 (KID).

These results highlight that DDPM consistently outperforms LDM in generating detailed and realistic images, especially in smaller datasets like PKGH_224. This suggests that DDPM is more effective in controlling pixel space and producing high-quality images, making it a robust choice for synthetic pathology image generation across different dataset sizes.

PKGH_224
Training Method	Sampling Method	FID	KID
DDPM	DDPM	19.08	0.0134
	Epsilon Scaling(s=1.014)	39.88	0.0338
	DDIM	22.66	0.0154
LDM	DDPM	24.43	0.0185
	DDIM	25.56	0.0161

PKGH_336
Training Method	Sampling Method	FID	KID
DDPM	DDPM	18.45	0.0129
	Epsilon Scaling(s=1.014)	45.12	0.0418
	DDIM	21.34	0.0159
LDM	DDPM	23.10	0.0160
	DDIM	26.41	0.0199

Table 3: FID and KID Scores for Different Models and Sampling Methods on Two Datasets

Figure 6 shows a visual comparison of the results of the DDPM and Epsilon sampling methods applied to histopathology images. The upper row illustrates output from DDPM, while the lower row showcases Epsilon sampling results. Both methods can capture larger tissue structure from pathology images, but the DDPM images display more consistent structural details across different samples. In contrast, Epsilon sampling produces slightly varied textures, which could indicate greater flexibility in capturing diverse tissue characteristics. These observations suggest that while both methods are effective, DDPM might be more reliable for producing uniformly detailed images, whereas Epsilon sampling can capture a broader range of textural variations. This could be a reason for the higher FID score of the epsilon scaling method.

4.2 Analysis on different patch size

Input	FoV patch ( $\mu m$ )	Patch reshaped	FID
Original Patch	224	128 $\times$ 128	19.08
Small Patch	112	64 $\times$ 64	161.01
	168	96 $\times$ 96	33.71
Large Patch	280	160 $\times$ 160	25.71
	336	192 $\times$ 192	38.41
	392	224 $\times$ 224	41.37

Table 4: Comparison of FID scores across different field of view and patch size.

The results indicate that the original patch size with a FOV of 224 and a patch resolution of 1.75 mpp achieved the lowest FID score of 19.08, reflecting the highest fidelity in generated images, which is expected as the images were trained on this model. However, examining the FID score for other patch sizes, especially those not used during training, provides insights into the model’s ability to generate detailed images from different patch sizes.

This model is trained with a patch size of 128, and we experimented with generating images using different patch sizes, both smaller and larger. The table highlights the variation in image fidelity across different patch sizes and FOVs in generated pathology images. Smaller patch sizes, such as 64x64 and 96x96, exhibit higher FID scores of 161.01 and 33.71 respectively, indicating lower image fidelity. As the patch size increases to 160x160 and 192x192, the FID scores improve to 25.71 and 38.41 respectively, suggesting better image quality but with some variability. The 224x224 patch size, while larger, shows a slight decrease in fidelity with an FID score of 41.37. The visual representations in Figure 8 demonstrate that larger patch sizes generally capture more detailed and high-level cellular structures across various classes, enhancing the potential utility of synthetic data in diagnostic workflows. The observed high FID score for the 64x64 patch size can be attributed to the fact that images generated at this resolution are often blurred and lack sufficient detail as shown in Fig 9. At such a low resolution, the model struggles to accurately generate fine patterns and intricate structures that are critical in pathology images. This inability to capture detailed cellular patterns results in poor-quality images, leading to the significantly higher FID score. In contrast, the 128x128 patch size allows the model to generate clearer images with more recognizable features and patterns, which contributes to a lower FID score. The blurriness for Patch size 64x64 reflects the model’s limitation in generating complex, high-fidelity images at this lower resolution.

Interestingly, even without training on certain pixel levels, the model can still perform well and generate quality images, demonstrating its robustness and versatility. This suggests that the generative model has the capacity to generalize beyond its training data to some extent, making it a valuable tool in scenarios where training data are scarce or diverse. This experiment highlights the model’s adaptability and versatility.

4.3 Evaluation on Synthetic Dataset

The table below summarizes the classification performance on the KGH dataset under different scenarios, providing a clear understanding of how the use of generated data impacts model accuracy.

Dataset	ACC (%)
Real Dataset (PKGH_224)	89.95
Generated Dataset (PKGH_224)	88.62
Real + Generated (PKGH_224)	90.75
Real Dataset (PKGH_336)	94.06
Generated Dataset (PKGH_336)	92.44
Real + Generated (PKGH_336)	90.76

Table 5: Classification accuracy summary for PKGH_224 and PKGH_336: Classification accuracy scores of a ResNet-50 on synthetic data. Higher ACC proves the effectiveness of DGM-generated(DDPM) synthetic samples in capturing significant features.

The results reveal notable differences in model accuracy between real and generated datasets for both PKGH_224 and PKGH_336. When trained on the real PKGH_224 dataset, the ResNet-50 model achieved an accuracy of 89.95%. However, training solely on the generated PKGH_224 dataset resulted in a slight drop in accuracy to 88.62%. Interestingly, augmenting the real dataset with generated data led to an improvement in test accuracy to 90.75%, indicating the added value of synthetic data in enhancing model performance.

For the PKGH_336 dataset, the model reached an accuracy of 94.06% on the real dataset, and 92.44% when trained on the generated dataset alone. However, combining real and generated data slightly decreased the accuracy to 90.76%. Despite this, the PKGH_336 dataset consistently demonstrated high performance, underscoring the robustness of the data, especially when leveraging the generated samples.

The variations in accuracy between PKGH_224 and PKGH_336 can be attributed to differences in the Field of View (FOV) of the datasets. The larger FOV in PKGH_336 captures more contextual information and intricate details, which likely enhances the model’s ability to learn and generalize from the data, contributing to its superior performance.

These findings highlight the potential benefits of incorporating synthetic data to boost model performance, particularly when real data is scarce or challenging to obtain. The overall accuracy scores affirm the effectiveness of DGM-generated synthetic samples in capturing essential features and improving the classification task.

5 Conclusion

In conclusion, we successfully tackled key challenges in histopathology by demonstrating the practical applications of diffusion generative models (DGMs) in medical imaging, particularly in generating high-quality synthetic datasets. The study confirmed that DGMs effectively learn from different patch resolutions, with larger patches providing superior results. While DDPM and LDM showed comparable performance despite their architectural differences, DDPM excelled in pixel space, and LDM generated high-quality images with fewer steps. Using two datasets, we found that larger FOV values yielded better FID scores and higher classification accuracy. This research highlights the potential of DGMs to enhance the robustness and accuracy of deep learning models in computational pathology, setting the stage for future advancements in the field.

6 Future Work

Future work will explore the potential of diffusion models to identify and analyze biomarkers for various diseases [5]. This involves developing methodologies to enhance the accuracy and reliability of biomarker discovery, leveraging the advanced capabilities of diffusion models to generate new subtle and critical features in medical images that are indicative of specific biomarkers. Future research can implement multi-class labeling techniques to enhance the classification of histopathology images. This involves conditioning diffusion models on multi-class labels to achieve more precise image analysis and interpretation. The aim is to improve the accuracy and robustness of classifiers in medical imaging, leading to better diagnostic and analytical outcomes.

These research directions aim to expand the current understanding and application of diffusion models in medical imaging and biomarker discovery. They will contribute to developing more accurate and privacy-sensitive healthcare solutions.

Acknowledgments

The data collected for this study is supported by Huron Digital Pathology and Ontario Molecular Pathology Research Network (OMPRN) funding grant. We thank Resources for Research Groups(RRG)–Digital Research Alliance Canada (DRAC) and Computer Science Cluster at Concordia University for providing essential computational resources. Funding for this research is supported by NSERC-Discovery Grant RGPIN/05378-2022 and Concordia University FRDP Grant.

References

[1] Stephan W Jahn, Markus Plass, and Farid Moinfar. Digital pathology: advantages, limitations and emerging perspectives. Journal of clinical medicine, 9(11):3697, 2020.
[2] Vipul Baxi, Robin Edwards, Michael Montalto, and Saurabh Saha. Digital pathology and artificial intelligence in translational medicine and clinical practice. Modern Pathology, 35(1):23–32, 2022.
[3] Andrew Janowczyk and Anant Madabhushi. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics, 7(1):29, 2016.
[4] Jeroen Van der Laak, Geert Litjens, and Francesco Ciompi. Deep learning in histopathology: the path to the clinic. Nature medicine, 27(5):775–784, 2021.
[5] Amelie Echle, Niklas Timon Rindtorff, Titus Josef Brinker, Tom Luedde, Alexander Thomas Pearson, and Jakob Nikolas Kather. Deep learning in cancer pathology: a new generation of clinical biomarkers. British journal of cancer, 124(4):686–696, 2021.
[6] Junghwan Cho, Kyewook Lee, Ellie Shin, Garry Choy, and Synho Do. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv preprint arXiv:1511.06348, 2015.
[7] W Nicholson Price and I Glenn Cohen. Privacy in the age of medical big data. Nature medicine, 25(1):37–43, 2019.
[8] Nati Daniel, Eliel Aknin, Ariel Larey, Yoni Peretz, Guy Sela, Yael Fisher, and Yonatan Savir. Between generating noise and generating images: Noise in the correct frequency improves the quality of synthetic histopathology images for digital pathology. In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–7, 2023.
[9] Mahdi S Hosseini, Babak Ehteshami Bejnordi, Vincent Quoc-Huy Trinh, Lyndon Chan, Danial Hasan, Xingwen Li, Stephen Yang, Taehyo Kim, Haochen Zhang, Theodore Wu, et al. Computational pathology: a survey review and the way forward. Journal of Pathology Informatics, page 100357, 2024.
[10] Jogile Kuklyte, Jenny Fitzgerald, Sophie Nelissen, Haolin Wei, Aoife Whelan, Adam Power, Ajaz Ahmad, Martyna Miarka, Mark Gregson, Michael Maxwell, et al. Evaluation of the use of single-and multi-magnification convolutional neural networks for the determination and quantitation of lesions in nonclinical pathology studies. Toxicologic Pathology, 49(4):815–842, 2021.
[11] Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
[13] Youbao Tang, Yuxing Tang, Yingying Zhu, Jing Xiao, and Ronald M Summers. A disentangled generative model for disease decomposition in chest x-rays via normal image synthesis. Medical Image Analysis, 67:101839, 2021.
[14] Tao Zhou, Huazhu Fu, Geng Chen, Jianbing Shen, and Ling Shao. Hi-net: hybrid-fusion network for multi-modal mr image synthesis. IEEE transactions on medical imaging, 39(9):2772–2781, 2020.
[15] Adalberto Claudio Quiros, Roderick Murray-Smith, and Ke Yuan. Pathologygan: Learning deep representations of cancer tissue. arXiv preprint arXiv:1907.02644, 2019.
[16] Ansh Kapil, Armin Meier, Aleksandra Zuraw, Keith E Steele, Marlon C Rebelatto, Günter Schmidt, and Nicolas Brieu. Deep semi supervised generative learning for automated tumor proportion scoring on nsclc tissue needle biopsies. Scientific reports, 8(1):17343, 2018.
[17] Kianoush Falahkheirkhah, Saumya Tiwari, Kevin Yeh, Sounak Gupta, Loren Herrera-Hernandez, Michael R McCarthy, Rafael E Jimenez, John C Cheville, and Rohit Bhargava. Deepfake histologic images for enhancing digital pathology. Laboratory Investigation, 103(1):100006, 2023.
[18] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[19] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
[20] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
[21] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[22] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
[23] Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
[24] Haoying Li, Yifan Yang, Meng Chang, Shiqi Chen, Huajun Feng, Zhihai Xu, Qi Li, and Yueting Chen. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479:47–59, 2022.
[25] Hyun-Jic Oh and Won-Ki Jeong. Diffmix: Diffusion model-based data synthesis for nuclei segmentation and classification in imbalanced pathology image datasets. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 337–345. Springer, 2023.
[26] Xizewen Han, Huangjie Zheng, and Mingyuan Zhou. Card: Classification and regression diffusion models. Advances in Neural Information Processing Systems, 35:18100–18115, 2022.
[27] Srikar Yellapragada, Alexandros Graikos, Prateek Prasanna, Tahsin Kurc, Joel Saltz, and Dimitris Samaras. Pathldm: Text conditioned latent diffusion model for histopathology. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5182–5191, 2024.
[28] Puria Azadi Moghadam, Sanne Van Dalen, Karina C Martin, Jochen Lennerz, Stephen Yip, Hossein Farahani, and Ali Bashashati. A morphology focused diffusion probabilistic model for synthesis of histopathology images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2000–2009, 2023.
[29] Alexandros Graikos, Srikar Yellapragada, Minh-Quan Le, Saarthak Kapse, Prateek Prasanna, Joel Saltz, and Dimitris Samaras. Learned representation-guided diffusion models for large-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8532–8542, 2024.
[30] Ajay Basavanhally, Shridar Ganesan, Natalie Shih, Carolyn Mies, Michael Feldman, John Tomaszewski, and Anant Madabhushi. A boosted classifier for integrating multiple fields of view: Breast cancer grading in histopathology. In 2011 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pages 125–128. IEEE, 2011.
[31] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
[32] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[33] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
[34] Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, and Itir Onal Ertugrul. Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321, 2023.
[35] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
[36] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
[37] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[39] Mikołaj Bińkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.
[40] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[41] Matteo Pozzi, Shahryar Noei, Erich Robbi, Luca Cima, Monica Moroni, Enrico Munari, Evelin Torresani, and Giuseppe Jurman. Generating synthetic data in digital pathology through diffusion models: a multifaceted approach to evaluation. medRxiv, pages 2023–11, 2023.