¹¹institutetext: Australian Institute of Health Innovation (AIHI), Macquarie University, Macquarie Park, NSW, 2109, Australia ²²institutetext: School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China ³³institutetext: National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, Beijing, 100871, China ⁴⁴institutetext: School of Computer Science and Engineering, University of New South Wales, Australia

Dataset Distillation for Histopathology Image Classification

Cong Cong 11 Shiyu Xuan 2233 Sidong Liu 11 Maurice Pagnucco 44 Shiliang Zhang 33 Yang Song 44

Abstract

Deep neural networks (DNNs) have exhibited remarkable success in the field of histopathology image analysis. On the other hand, the contemporary trend of employing large models and extensive datasets has underscored the significance of dataset distillation, which involves compressing large-scale datasets into a condensed set of synthetic samples, offering distinct advantages in improving training efficiency and streamlining downstream applications. In this work, we introduce a novel dataset distillation algorithm tailored for histopathology image datasets (Histo-DD), which integrates stain normalisation and model augmentation into the distillation progress. Such integration can substantially enhance the compatibility with histopathology images that are often characterised by high colour heterogeneity. We conduct a comprehensive evaluation of the effectiveness of the proposed algorithm and the generated histopathology samples in both patch-level and slide-level classification tasks. The experimental results, carried out on three publicly available WSI datasets, including Camelyon16, TCGA-IDH, and UniToPath, demonstrate that the proposed Histo-DD can generate more informative synthetic patches than previous coreset selection and patch sampling methods. Moreover, the synthetic samples can preserve discriminative information, substantially reduce training efforts, and exhibit architecture-agnostic properties. These advantages indicate that synthetic samples can serve as an alternative to large-scale datasets.

Keywords:

Histopathology Image classification Dataset distillation

1 Introduction

Histopathology image analysis plays a pivotal role in delivering critical insights for cancer diagnosis and prognosis [20]. The advent of advanced image scanning technology, particularly Whole-Slide Imaging (WSI), which digitises entire histopathology slides, has spurred considerable interest among researchers to develop computer-aided systems in this domain. Notably, deep learning-based approaches have emerged as a focal point of investigation across various applications, encompassing histopathology image preprocessing (e.g., techniques for staining normalisation [3, 9]) and subsequent analyses, including histopathology image classification [38, 52, 54], segmentation [16, 17], and survival prediction [14].

Given the immense size of WSIs, it is impractical to input an entire WSI into a model for training. Therefore, most previous approaches partition WSIs into smaller patches and subsequently consolidate the results from these patches to derive slide-level conclusions. Thus, a cluster of studies has emerged with the aim of enhancing patch-level analysis [25, 45, 10]. Nonetheless, to ensure the accuracy of patch-level analysis, precise annotations at the patch level are typically required. However, patch-level annotations are unavailable in many scenarios, and only slide-level labels can be accessed. Consequently, an alternative line of research has emerged; treating the analysis of WSIs as a weakly-supervised learning task. Many of these methodologies rely on Multiple Instance Learning (MIL) frameworks [5, 38, 59, 52, 44]. Though effective, these methods necessitate training on vast numbers of patches, imposing significant computational and storage demands. One natural question to ask is: Do we really need such huge amounts of patches to obtain an effective model for WSI analysis? In fact, many MIL-based models already adopt some patch selection criteria during feature aggregation. These methods [6, 53, 52] select a small group of key patches based on their importance to the task or feature similarities. Although these strategies are effective, the criteria of key patch selection are often heuristic. In contrast, our objective is to distill the vast patch dataset using a deep learning approach, so that the original dataset can be represented by a very small set of synthetic samples while minimising information loss.

To obtain a highly condensed dataset, coreset selection methods have been introduced [37, 32]. These methods aim to curate a small set of samples using heuristics such as representation compactness [39] and forgetfulness [47]. While these methods offer computational efficiency, they encounter challenges, such as the absence of a guarantee that the selected samples are representative or optimal for downstream tasks [58]. More recently, [50] proposed dataset distillation to synthesise informative samples such that the model trained on them can perform similarly to that trained on the full training set. However, despite the success of dataset distillation methods in demonstrating performance guarantees using synthetic samples, their applicability to histopathology datasets has remained largely unexplored. This can be attributed to the unique challenges posed by histopathology images. Firstly, histopathology slides are incredibly large, often reaching gigapixel sizes. Secondly, these images can vary significantly in terms of tissue structures, staining, and cellular morphology.

To this end, we introduce a dataset distillation algorithm for histopathology images (Histo-DD). Rather than directly generating synthetic slides, Histo-DD generates samples from cropped patches. The synthetic patches, derived from the same slides, can be aggregated to create synthetic slides for slide-level analysis. The core idea of Histo-DD is that the model trained on the synthesised patches should provide similar performances to the one trained on the full dataset. To do this, we explicitly minimise the differences in layer-wise gradients between the model trained on synthetic samples and its counterpart trained on the complete dataset. To enhance the resilience of synthetic samples across various models and optimise training efficiency, we adopt model augmentation [55] to train synthetic samples. However, the inherent heterogeneity in stain colours can pose a challenge to effectively training synthetic samples. Drawing inspiration from the stain normalisation methods [30, 9], we propose integrating these techniques to mitigate this issue. One way to achieve this objective is to normalise the dataset as a pre-processing step, however, this strategy only brings marginal performance gains (shown in Section 5.2). In contrast, we resort to Differentiable Siamese Augmentation [56] and propose to integrate a differentiable stain normalisation as outlined in [30], along with other standard data augmentation techniques into the training procedure. Our results demonstrate that this integration substantially enhances the quality of the distilled samples and once these samples are learned, they can be used with stain normalisation algorithm to address colour variation issues in histopathology images.

We illustrate that the distilled samples obtained by Histo-DD can be seamlessly used for downstream patch and slide-level tasks, delivering comparable performance to models trained on the full dataset while largely reducing the training effort and only using less than 1% of the entire patch population. Furthermore, we explore the utility of our distilled samples in a real-life application: cross-centre transfer learning. Results on the TCGA-IDH [26], Camelyon16 [2] and UniToPath [1] datasets demonstrate that the synthetic patches retain most discriminative information contained in the original large-scale patch datasets, making them a viable substitute for the latter. To our knowledge, this study represents the first attempt to design a dataset distillation method for histopathology images, opening up new avenues for research and applications in this domain. Specifically, our contributions are summarised as follows:

•

We present the first study in dataset distillation for histopathology images, where we leverage a deep learning-based approach to generate a highly condensed set of synthetic patches. Compared to heuristic-based patch sampling or coreset selection methods, such learned synthetic samples offer a more encompassing representation of the rich information contained in the original dataset since they are directly optimised for the downstream tasks. Due to their small scale, these synthetic samples can then be efficiently utilised to train various deep-learning models for downstream analysis.
•

We propose to integrate differentiable stain normalisation into the learning process of synthetic patches. This enhancement improves the compatibility of dataset distillation algorithms specifically tailored for histopathology images.
•

We show a broad usage of the synthetic samples including patch-level classification, slide-level classification, and cross-centre in-domain transfer learning. Using synthetic samples on these tasks demonstrates a significant improvement in efficiency and obtains promising results.

2 Related Work

2.1 Coreset Selection

Coreset selection methods aim to find the most informative subset of the entire dataset, in which the model trained on it has close performance with the one trained on the full dataset. Selection of coreset instances can be conducted according to various criteria. For example, geometry-based approaches select key samples based on their distance to the coreset centre in the feature space [51, 15]. Moreover, error-based methods are based on the assumption that samples contributing more to the model’s loss are crucial for training. Accordingly, samples that result in higher errors or gradients are selected for training [47, 36]. Furthermore, recent studies [13, 31] tend to select samples that are closer to the decision boundary, as they are considered more challenging to classify. It is worth noting that most of these methods rely on heuristic selection criteria for incrementally choosing important data points. However, these heuristics cannot guarantee the optimality of the selected subset for training machine learning models [50].

2.2 Dataset Distillation

Dataset distillation aims to reduce the size of a large-scale dataset to a small set of synthetic samples, such that the models trained on these synthetic samples can perform similarly to the ones trained on the original large dataset [50]. Unlike coreset selection, dataset distillation modifies the selected samples, thus removing the restrictions imposed by uneditable elements and preserving a greater amount of information [21]. Based on different learning approaches, the current dataset distillation methods can be roughly divided into two categories. The first category approaches dataset distillation as a meta-learning task. In this context, the distilled data are formulated as learnable parameters that can be refined through a nested optimisation loop. The nested loop encompasses an extensive inner loop that unrolls an optimisation graph, which is notably computationally demanding [50]. Deng et al. [11] improved this unrolling process by adding a momentum term for extending the length of unrolled trajectories. Moreover, Nguyen et al. [33] approximate the inner-loop model optimisation by posing the classification task as a ridge regression problem. The synthetic set serves as the support set, while the original set acts as the target set. This method is further improved by Loo et al. [28] which adopts a more flexible conjugate kernel for approximating the output of a neural network.

Rather than optimising directly with respect to model performance, the second group of works uses surrogate objective approaches that focus on optimising an alternative objective. For example, Zhao et al. [58] propose to synthesise samples by minimising the layer-wise gradient differences. They also propose to optimise synthetic samples such that their data distribution is similar to that of real samples [57]. Moreover, the CAFE framework [49] employs a layer-wise feature alignment technique to enhance the learning of synthetic samples. Beyond these two primary categories, there are several other notable efforts in dataset distillation. Zhao et al. [56] explore the effectiveness of data augmentation in synthetic sample learning, and Zhang et al. [55] propose Acc-DD which adopts model augmentation to largely improve the training efficiency of dataset distillation algorithms. Recently, Liu et al. [27] proposed the DREAM framework which adopts clustering algorithms to select only representative samples for matching during optimisation, and Sajedi et al. [41] learn synthetic images by aligning the spatial attention maps of real and synthetic data.

Recently, dataset distillation algorithms have also been studies in the medical image domain. For example, Tian et al. [46] propose a generalisable dataset distillation algorithm to reduce communication costs in federated learning for skin lesion analysis. Moreover, Li et al. [23, 24] adopt a gradient matching algorithm to generate condensed sets of X-ray or gastric images. Their research demonstrates that these condensed images exhibit significantly enhanced efficiency and improved security when sharing data. However, the application of dataset distillation algorithms has not been explored in histopathology images.

3 Methods

We present a dataset distillation algorithm (Histo-DD) designed to generate synthetic samples from a WSI dataset. Given a set of $N$ WSI slides each containing $T_{n}$ sampled patches with $n$ denoting the slide’s index, depending on the given task, our algorithm can be applied directly on these $\sum_{1}^{N}T_{n}$ patches to obtain a highly condensed size of $m$ synthetic patches per class for a patch-level classification task, with $m\ll\sum_{1}^{N}T_{n}$ . Additionally, we can apply our algorithm to each of these $N$ slides in which we distil from $T_{n}$ patches to just $m$ synthetic patches per class (i.e., $N\times m$ synthetic patches in total) for a slide-level classification task. The overall algorithm is described in Algorithm 1 in which $\mathcal{P}$ denotes the set of synthetic patches and $\mathcal{D}$ denotes the original patches. By optimising the patch classifier, we establish a link between the optimisation objectives for both the original and distilled datasets, facilitating knowledge distillation from $\mathcal{D}$ to $\mathcal{P}$ . To mitigate the challenges posed by high colour heterogeneity in histopathology datasets, we introduce stain normalisation as a data augmentation technique to further enhance the quality of the learned synthetic patches.

Data: Training set

\mathcal{D}

, random initialised synthetic samples

\mathcal{P}

, a set of pre-trained models

\mathcal{S}

, differentiable augmentation

Aug(\cdot)

, # of distillation steps

K

, # of steps for model updates

U

, and the number of classes

C

for $k=0,\cdots,K-1$ do

Randomly select

\mathcal{M}_{\theta}

from

\mathcal{S}

;

Perform model augmentation on

\mathcal{M}_{\theta}

;

for $u=0,\cdots,U-1$ do

for $c=0,\cdots C$ do

Sample a batch

\mathcal{D}^{c}

\mathcal{P}^{c}

from

\mathcal{D}

and

\mathcal{P}

, respectively;

\mathcal{D}_{A}^{c}\leftarrow Aug(\mathcal{D}^{c})

;

\mathcal{P}_{A}^{c}\leftarrow Aug(\mathcal{P}^{c})

;

Calculate loss

L_{\mathcal{D}_{A}^{c}}\leftarrow\mathcal{L}(\mathcal{D}_{A}^{c},\mathcal{M}_{\theta})

;

Calculate loss

L_{\mathcal{P}_{A}^{c}}\leftarrow\mathcal{L}(\mathcal{P}_{A}^{c},\mathcal{M}_{\theta})

;

Update

\mathcal{P}_{A}^{c}\leftarrow\mathop{\min}_{\mathcal{P}_{A}^{c}}\sum_{l=0}^{L}Dis(\nabla_{k}L_{\mathcal{D}_{A}^{c}},\nabla_{k}L_{\mathcal{P}_{A}^{c}})

;

end for

Update

\mathcal{M}_{\theta}\leftarrow\mathcal{L}(\mathcal{D},\mathcal{M}_{\theta})

;

end for

Algorithm 1 Histo-DD algorithm for learning synthetic histopathology patches.

3.1 Synthetic Patch Learning

Recent studies [56, 58] show that gradient matching has attained state-of-the-art performance in the task of dataset distillation. Consequently, in this study, we employ gradient matching as our optimisation algorithm which adopts a nested loop optimisation to learn $\mathcal{P}$ from $\mathcal{D}$ . Specifically, the optimisation includes an outer loop (OL), which runs for $K$ steps, for synthetic sample optimisation, and an inner loop (IL), which runs for $U$ steps, for updating the classification model.

In each OL, a classification model $\mathcal{M}_{\theta}$ (e.g., ResNet, ConvNet, etc.) is chosen. To ensure the learned synthetic samples can generalise well to different $\mathcal{M}_{\theta}$ , previous studies require a significant number of iterations in OL. This can pose a notable challenge for histopathology images due to the abundance of patches derived from slides, resulting in a substantial increase in training iterations. Thus, we employ model augmentation [55] in OL which ensures the generalisation capacity of the synthetic samples while largely reducing training efforts.

In each loop of IL, we first sample a batch of real samples ( $\mathcal{D}^{c}$ ) and synthetic samples ( $\mathcal{P}^{c}$ ) from each class $c$ . To improve the generalisation performance of $\mathcal{D}$ , we apply Differentiable Siamese Augmentation (DSA) [56] during training. Meanwhile, to effectively tackle the high colour heterogeneity issue of histopathology images, we further integrate stain normalisation into the DSA process. Specifically, given a differentiable augmentation operation $Aug(\cdot)$ , we obtain $\mathcal{D}_{A}^{c}$ and $\mathcal{P}_{A}^{c}$ via $Aug(\mathcal{D}^{c})$ and $Aug(\mathcal{P}^{c})$ . Then $\mathcal{P}_{A}^{c}$ are optimised using the following equation:

\mathop{\min}_{\mathcal{P}_{A}^{c}}\sum_{c=0}^{C}Dis(\nabla_{k}\mathcal{L}(\mathcal{D}_{A}^{c},\mathcal{M}_{\theta}),\nabla_{k}\mathcal{L}(\mathcal{P}_{A}^{c},\mathcal{M}_{\theta}))

(1)

where $C$ is the total number of layers of a model, $Dis(\cdot,\cdot)$ is a distance function and $\mathcal{L}(\cdot,\cdot)$ denotes the loss function used for train the $\mathcal{M}_{\theta}$ . In our implementation, we use $L$ 1 loss for calculating $Dis(\cdot,\cdot)$ and cross-entropy for $\mathcal{L}(\cdot,\cdot)$ .

Model augmentation

Given the typically vast number of patches in histopathology datasets, naively adopting gradient match algorithms is time-consuming since the size of OL is usually large to ensure the generalization ability of synthetic samples. Drawing inspiration from [55], we employ model augmentation to enhance the generalization capacity of synthetic samples while significantly reducing the need for a high number of OLs. Specifically, we first pre-train a set ( $\mathcal{S}$ ) of networks with different hyperparameters. In detail, we train a patch-level classifier 5 times with different weight initialisations and then collect these five trained models to form $\mathcal{S}$ . During the outer loop, we randomly select one pre-trained model ( $\mathcal{M}_{\theta}$ ) from $\mathcal{S}$ . Then we apply model augmentation via perturbing the model parameter $\theta$ with a small value $d$ sampled from a Gaussian distribution $\mathcal{N}(0,1)$ . We normalise $d$ to make it compatible with different filters at different layers:

d_{l,j}\leftarrow\frac{d_{l,j}}{\left\|d_{l,j}\right\|_{F}}\left\|w_{l,j}\right\|_{F}

(2)

where $l,j$ denotes the $j$ -th filter in $l$ -th layer of $d$ , $w_{l,j}$ denotes the corresponding parameters and $\left\|\cdot\right\|$ denotes the Frobenius norm. Then, we can apply $d$ to $\theta$ via:

\theta\leftarrow\theta+\alpha\times d

(3)

and we use $\alpha$ to control the degree of perturbation. Applying model augmentation reduces the required number of outer loops which improves efficiency.

Data augmentation with stain normalisation

To further improve the generalisation performance of synthetic samples and alleviate the high colour heterogeneity issue of histopathology images, we integrate DSA with stain normalisation. By solving Eq. 1, we derive the following equation:

\frac{\partial Dis(\cdot)}{\partial\mathcal{P}_{A}^{c}}=\frac{\partial Dis(\cdot)}{\partial\nabla_{k}\mathcal{L}(\cdot)}\cdot\frac{\partial\nabla_{k}\mathcal{L}(\cdot)}{\partial Aug(\cdot)}\cdot\frac{\partial Aug(\cdot)}{\partial\mathcal{P}_{A}^{c}}

(4)

Eq. 4 shows that $Aug(\cdot)$ needs to be differentiable with regards to $\mathcal{P}_{A}^{c}$ . Following DSA, standard data augmentation can be reformulated as differentiable functions for signal back-propagation to $\mathcal{P}_{A}^{c}$ . However, it is implicit to reformulate existing stain normalisation approaches into differentiable functions compatible with DSA in a memory-efficient manner. Although trained deep learning-based stain normalisation methods [43, 42, 8, 9] can be used as black-box models for normalisation, they are sub-optimal as they necessitate additional model training and extra memory to use them during training. In contrast, conventional stain normalisation methods [40, 30, 48], which pose stain normalisation as a stain separation task, can be used without training. Thus, to optimise for both computational efficiency and differentiability, we opt for a conventional approach [30] and implement a differentiable version to fit into the training process. In particular, we chose a template image whose pixel mean and standard deviation are closest to the overall mean and standard deviation of the dataset and then computed its stain vector and colour concentration offline. During training, stain normalisation augmentation simply involves matrix multiplication to normalise the image with the template colour appearances using the obtained stain vector and concentration, ensuring overall differentiability.

4 Experiments

4.1 Dataset Details

We trained and evaluated our proposed dataset distillation method (Histo-DD) on three medical datasets, including TCGA-IDH [26], Camelyon16 [2], and UniToPath [1]. Further details regarding the quantity of train/test images can be found in the supplementary material.

TCGA-IDH [26] is used for predicting the isocitrate dehydrogenase (IDH) gene mutation status as it is an important diagnostic, prognostic, and therapeutic biomarker in glioma [35]. The dataset comprises two classes: Mutant (MU) and Wild Type (WT). We split the slides into train/test partitions following [26]. The training set comprises 1,191 slides, with 477 being MU and 714 being WT. Additionally, the testing dataset includes 59 MU slides and 90 WT slides. Within each slide, patches were cropped at a size of 1024 $\times$ 1024, captured at a 10 $\times$ magnification level. Only patches with over 50% tissue content were considered for both training and evaluation. The ground truth labels for these patches were derived from the slide-level labels.

Camelyon16 [2] is used for classifying breast histopathology as normal or tumour. It contains 399 slides of which 270 slides (159 normal vs 111 tumour) are used for training and 129 slides (80 normal vs 49 tumour) are used for testing. Specifically, we first apply the Otsu algorithm [34] to filter out the background regions of each slide and then densely extract patches of size 256 $\times$ 256 pixels at 5 $\times$ magnification level. Since the detailed pixel annotations are provided, we have labelled patches extracted from normal regions as ‘normal’ and those from tumour regions as ‘tumour’.

UniToPath [1] is used for colorectal polyps type classification (PT) and adenoma dysplasia grading (DG). It contains 9,536 patches extracted from 292 WSIs at 20 $\times$ magnification. For PT, each slide is annotated by expert UniTo pathologists, and is classified into six classes: normal tissue (NORM), hyperplastic polyp (HP), tubular adenoma high-grade dysplasia (TA.HG), tubular adenoma low-grade dysplasia (TA.LG), tubulo-villous adenoma, high-grade dysplasia (TVA.HG), and tubulo-villous adenoma, low-grade dysplasia (TVA.LG). We follow the official splits which use 208 slides (13 NORM, 31 HP, 16 TA.HG, 100 TA.LG, 13 TVA.HG and 30 TVA.LG) for training and 88 slides (8 NORM, 10 HP, 9 TA.HG, 46 TA.LG, 7 TVA.HG and 8 TVA.LG) for testing. Moreover, for DG, which is a binary classification task to predict the grade (LG or HG) of TA and TVA adenomas. We have 160 training slides (130 LG and 30 HG) and 70 testing slides (54 LG and 16 HG).

4.2 Experimental Setup

To evaluate the effectiveness of our dataset distillation algorithm, we compared it with coreset selection methods on the patch classification task and we provide ablation studies about the usefulness of stain normalisation in learning synthetic samples from a histopathology dataset. Moreover, to more thoroughly assess the efficacy of Histo-DD, we have carried out comprehensive experiments focusing on the application of synthetic patches. These experiments encompassed a broad range of applications including cross-architecture evaluation, patch-level classification, slide-level classification, and in-domain cross-centre transfer learning. We conduct our experiments using a two-stage approach. In the first stage, we focus on learning synthetic samples from the training datasets. Following previous works [58, 57, 29, 55], we employed ResNetAP-10 as our choice for dataset distillation. ResNetAP-10 is a modified version of ResNet-10, where the strided convolution is replaced with average pooling for downsampling. This modification enables more detailed gradients for synthetic images, leading to an enhancement in the quality of synthetic samples. In the second stage, we use the learned synthetic samples to train a classification model and test it on the original test set. We apply data augmentation for training in both stages. By evaluating the performance of the real test set, we aim to explore the effectiveness of synthetic patches in capturing the most representative characteristics of the original dataset.

To be consistent with the standard measurements in prior studies [58, 57], for each application, we repeated the experiment 5 times with different seeds and reported the mean and standard deviation of classification accuracy and the average running time for each epoch. Moreover, for the majority of our experiments, we investigated three distinct settings involving 10, 20, or 50 synthetic patches per class. This selection of quantities is based on their proven effectiveness from prior studies [58, 57, 56, 49, 41] in illustrating learning efficiency, which is the primary objective of dataset distillation.

We randomly select patches from the original dataset as the initial states of synthetic samples. For 10, 20, and 50 synthetic patches per class, we use 200, 500, and 1000 distillation steps ( $K$ ) correspondingly and use a consistent value of 100 for model update steps ( $U$ ). For TCGA-IDH, we learn the synthetic samples using a learning rate of $5e^{-3}$ , whereas for other datasets, we use a learning rate of $0.5$ . The discussion of these hyper-parameter selections is provided in Section 5.2 and the supplementary. For patch-level classifiers, we trained the models for 50 epochs, and for slide-level MIL models, we trained them for 60 epochs. We conducted all experiments using a workstation with a single NVidia Tesla V100 GPU and the codes were developed in Python (v3.8) using PyTorch (v1.10).

	Img/Cls	Ratio%	Random	Herding	Forgetting	Histo-DD	Full
TCGA-IDH	10	$6e^{-2}$	$51.9_{\pm 0.02}$	$53.0_{\pm 0.7}$	$53.0_{\pm 3.7}$	$\mathbf{69.7_{\pm 0.2}}$	$80.0_{\pm 0.7}$
	20	0.11	$55.4_{\pm 0.01}$	$53.4_{\pm 0.1}$	$54.3_{\pm 2.0}$	$\mathbf{70.3_{\pm 0.1}}$
	50	0.28	$57.1_{\pm 0.01}$	$54.6_{\pm 0.04}$	$57.0_{\pm 2.7}$	$\mathbf{71.8_{\pm 0.4}}$
Camelyon16	10	$9e^{-4}$	$68.2_{\pm 0.11}$	$72.7_{\pm 0.40}$	$64.2_{\pm 0.3}$	$\mathbf{87.8_{\pm 0.1}}$	$95.8_{\pm 0.05}$
	20	$2e^{-3}$	$73.1_{\pm 0.04}$	$78.4_{\pm 1.8}$	$69.1_{\pm 0.04}$	$\mathbf{88.8_{\pm 0.2}}$
	50	$5e^{-3}$	$77.6_{\pm 0.6}$	$81.4_{\pm 0.04}$	$73.8_{\pm 0.1}$	$\mathbf{89.6_{\pm 1.1}}$
UniToPath (DG)	10	0.18	$53.3_{\pm 0.03}$	$56.4_{\pm 0.01}$	$58.0_{\pm 0.01}$	$\mathbf{69.5_{\pm 0.8}}$	$81.4_{\pm 2.3}$
	20	0.36	$56.8_{\pm 0.01}$	$57.2_{\pm 0.01}$	$59.4_{\pm 0.01}$	$\mathbf{71.2_{\pm 0.8}}$
	50	0.89	$58.7_{\pm 0.02}$	$60.8_{\pm 0.02}$	$61.4_{\pm 0.02}$	$\mathbf{72.1_{\pm 1.1}}$
UniToPath (PT)	10	0.10	$29.5_{\pm 0.9}$	$36.6_{\pm 0.1}$	$16.7_{\pm 0.1}$	$\mathbf{43.6_{\pm 0.1}}$	$49.6_{\pm 0.3}$
	20	0.20	$32.1_{\pm 2.9}$	$37.2_{\pm 1.2}$	$21.5_{\pm 0.4}$	$\mathbf{44.2_{\pm 1.2}}$
	50	0.52	$36.8_{\pm 0.3}$	$39.2_{\pm 0.2}$	$25.4_{\pm 0.3}$	$\mathbf{45.2_{\pm 0.2}}$

Table 1: Patch-level classification results on IDH, UniToPath and Camlyon16. We compared our dataset distillation method with different core-set selection methods. The reported results are based on a ResNet18 model.

5 Results

5.1 Comparison to coreset selection methods

We compare our dataset distillation algorithm with coreset selection methods, including Random selection, Herding [51] and Forgetting [47], in patch classification tasks on three datasets and the results are shown in Table 1.

Specifically, for Random selection, we randomly selected samples from the training set as the coreset. In contrast, the Herding selection method selects coreset samples based on their proximity to the centre of the cluster, while the Forgetting method identifies samples that are more prone to being misclassified by the model. For Herding and Forgetting methods, we use ResNet18 for feature selection. The results in Table 1 show that our dataset distillation method outperforms all compared coreset selection methods by a large margin ( $>$ 10%). Notably, we observed that the smallest performance gap between Histo-DD and the upper bound was on the Camelyon16 dataset. We attribute this to the considerably larger number of patches extracted from Camelyon16, which likely provides a better representation of the real data distribution, thereby yielding more informative learned synthetic samples. Furthermore, we visualise the synthetic patches for each dataset in Figure 1 (a). These synthetic patches exhibit certain histopathological features while also concealing or modifying specific patient-related information.

Refer to caption — Figure 1: a) Visualisation of condensed image patches, and b) Testing accuracy on different datasets using different numbers of synthetic patches per class and their corresponding number of outer loops on convergence.

Img/Cls

None

Stain Norm [30]

Rotate

Flip

Scale

All

TCGA-IDH

66.6_{\pm 0.2}

67.9_{\pm 0.1}

67.6_{\pm 0.1}

67.0_{\pm 0.1}

67.6_{\pm 0.2}

\mathbf{69.7_{\pm 0.2}}

Camelyon16

82.9_{\pm 0.3}

85.5_{\pm 0.2}

83.1_{\pm 0.1}

83.3_{\pm 0.3}

84.2_{\pm 0.5}

\mathbf{86.8_{\pm 0.1}}

UniToPath

(DG)

64.5_{\pm 0.5}

67.4_{\pm 0.4}

66.1_{\pm 0.3}

66.1_{\pm 0.2}

65.6_{\pm 0.4}

\mathbf{69.5_{\pm 0.8}}

UniToPath

(PT)

39.4_{\pm 0.3}

41.7_{\pm 0.1}

40.3_{\pm 0.9}

41.1_{\pm 0.7}

41.1_{\pm 0.8}

\mathbf{43.6_{\pm 0.1}}

Table 2: Patch-level performance with different augmentation strategies.

5.2 Ablation study

Effectiveness of stain normalisation. We demonstrated the effectiveness of various data augmentation techniques in enhancing the learning of synthetic patches, as illustrated in Table 2. Here, we included the results of not applying data augmentations (None), using data augmentations including stain normalisation, rotation, flipping, and scaling, and combining them together (All). Note that we applied the same set of data augmentation to both real samples $\mathcal{D}$ and synthetic samples $\mathcal{P}$ during training. The results indicate that the incorporation of data augmentation consistently improves the quality of synthetic samples. Notably, the application of stain normalisation yields the most substantial enhancement in performance ( $\sim 3\%$ ), underscoring its valuable contribution to the learning process. Furthermore, the combination of all data augmentation methods yields the best overall performance. Moreover, we investigate training synthetic samples on datasets that are already stain-normalised. On the TCGA-IDH dataset, we obtain $64.9_{\pm 0.2}$ / $67.1_{\pm 0.3}$ test accuracy with/without applying stain normalisation on synthetic samples during testing (more results can be found in the supplementary). This indicates that applying stain normalisation as a pre-processing step only brings marginal performance gains. Moreover, it indicates that the synthetic samples learned in this manner are not compatible with stain normalisation techniques, making them less advantageous when addressing stain variations in histopathology datasets.

Different number of synthetic samples. Here, we investigate the performance of patch classification tasks on three datasets of a ResNet18 model trained using 10/20/50/100 synthetic patches per class. As depicted in Figure 1 (b), our findings reveal that classification accuracy improves as the size of the synthetic dataset increases. For instance, when utilizing 100 synthetic patches per class, Histo-DD achieves an accuracy of $75.1\%\pm 0.1$ , $93.5\%\pm 0.3$ , $76.0\%\pm 0.9$ , and $46.3\%\pm 0.3$ on TCGA-IDH, Camelyon16, UniToPath(DG), and UniToPath(PT), respectively. It is noteworthy that while the performance gap narrows with a larger synthetic dataset, achieving convergence requires more iterations in the outer loop, i.e., larger $K$ . To provide context, each outer loop iteration takes approximately 6 minutes with $U=100$ . Therefore, a trade-off exists between performance improvement and training time. Longer training time is indeed a common challenge in current dataset distillation algorithms. However, once the synthetic samples are obtained, they can be employed efficiently and effectively as replacements for the original datasets.

5.3 Synthetic patch evaluation

			RN18		DN121		ViT
	Img/Cls	Ratio(%)	Acc(%)	Time(s)	Acc(%)	Time(s)	Acc(%)	Time(s)
TCGA-IDH	10	$6e^{-2}$	$69.7_{\pm 0.5}$	$0.24_{({\color[rgb]{0,1,0}\times 167})}$	$64.0_{\pm 0.1}$	$1.43_{({\color[rgb]{0,1,0}\times 76})}$	$71.8_{\pm 0.2}$	$1.84_{({\color[rgb]{0,1,0}\times 98})}$
	20	0.11	$70.3_{\pm 0.1}$	$0.39_{({\color[rgb]{0,1,0}\times 103})}$	$69.4_{\pm 0.1}$	$2.80_{({\color[rgb]{0,1,0}\times 39})}$	$72.5_{\pm 0.4}$	$3.63_{({\color[rgb]{0,1,0}\times 50})}$
	50	0.28	$71.8_{\pm 0.4}$	$1.02_{({\color[rgb]{0,1,0}\times 39})}$	$70.2_{\pm 0.4}$	$4.40_{({\color[rgb]{0,1,0}\times 25})}$	$73.4_{\pm 0.4}$	$8.02_{({\color[rgb]{0,1,0}\times 23})}$
	Full	100	$80.0_{\pm 0.7}$	40	$76.3_{\pm 0.6}$	80	$78.9_{\pm 0.2}$	181
Camelyon16	10	$1e^{-3}$	$87.8_{\pm 0.1}$	$0.23_{({\color[rgb]{0,1,0}\times 7043})}$	$93.5_{\pm 0.3}$	$1.44_{({\color[rgb]{0,1,0}\times 3351})}$	$93.9_{\pm 0.7}$	$1.84_{({\color[rgb]{0,1,0}\times 6000})}$
	20	$2e^{-3}$	$88.8_{\pm 0.2}$	$0.38_{({\color[rgb]{0,1,0}\times 4263})}$	$94.0_{\pm 0.1}$	$2.76_{({\color[rgb]{0,1,0}\times 1750})}$	$94.4_{\pm 0.3}$	$3.61_{({\color[rgb]{0,1,0}\times 3060})}$
	50	$5e^{-3}$	$89.6_{\pm 0.7}$	$1.04_{({\color[rgb]{0,1,0}\times 1557})}$	$95.6_{\pm 0.2}$	$4.40_{({\color[rgb]{0,1,0}\times 1097})}$	$96.3_{\pm 0.3}$	$8.01_{({\color[rgb]{0,1,0}\times 1380})}$
	Full	100	$95.8_{\pm 0.1}$	1620	$98.7_{\pm 0.1}$	4826	$98.6_{\pm 0.1}$	11049
UniToPath (DG)	10	0.18	$69.5_{\pm 0.8}$	$0.25_{({\color[rgb]{0,1,0}\times 320})}$	$70.5_{\pm 1.6}$	$1.42_{({\color[rgb]{0,1,0}\times 81})}$	$67.2_{\pm 0.1}$	$1.82_{({\color[rgb]{0,1,0}\times 66})}$
	20	0.36	$71.2_{\pm 0.8}$	$0.38_{({\color[rgb]{0,1,0}\times 211})}$	$74.1_{\pm 0.4}$	$2.80_{({\color[rgb]{0,1,0}\times 41})}$	$68.6_{\pm 0.3}$	$3.63_{({\color[rgb]{0,1,0}\times 33})}$
	50	0.89	$72.1_{\pm 1.1}$	$1.01_{({\color[rgb]{0,1,0}\times 79})}$	$77.0_{\pm 0.1}$	$4.41_{({\color[rgb]{0,1,0}\times 26})}$	$73.5_{\pm 0.4}$	$8.07_{({\color[rgb]{0,1,0}\times 15})}$
	Full	100	$81.4_{\pm 2.3}$	80	$82.6_{\pm 0.5}$	115	$75.8_{\pm 0.4}$	120
UniToPath (PT)	10	0.10	$43.6_{\pm 0.1}$	$0.63_{({\color[rgb]{0,1,0}\times 317})}$	$46.9_{\pm 1.3}$	$2.01_{({\color[rgb]{0,1,0}\times 105})}$	$44.1_{\pm 0.9}$	$5.46_{({\color[rgb]{0,1,0}\times 39})}$
	20	0.20	$44.2_{\pm 0.2}$	$1.16_{({\color[rgb]{0,1,0}\times 172})}$	$49.4_{\pm 0.8}$	$3.98_{({\color[rgb]{0,1,0}\times 53})}$	$45.6_{\pm 0.6}$	$10.83_{({\color[rgb]{0,1,0}\times 20})}$
	50	0.52	$44.6_{\pm 0.2}$	$2.93_{({\color[rgb]{0,1,0}\times 68})}$	$50.5_{\pm 0.9}$	$9.88_{({\color[rgb]{0,1,0}\times 21})}$	$47.1_{\pm 0.5}$	$27.20_{({\color[rgb]{0,1,0}\times 8})}$
	Full	100	$49.6_{\pm 0.2}$	200	$51.5_{\pm 0.2}$	212	$48.5_{\pm 0.6}$	213

Table 3: Cross-architecture performance (%) on different datasets with different synthetic images per class. The distilled samples are obtained using ResNetAP-10 and we also measure the averaged per-epoch training time.

Cross-architecture evaluation. We evaluated the performance of our condensed data on architectures different from the one used for training them and the results are shown in Table 3. Here, we selected five commonly used networks for image classification, including ResNet18 (RN18) [18], DenseNet121 (DN121) [19] and ViT [12]. Remarkably, the synthetic samples, even when trained on a different model, can be used to train other models to yield performance that is on par with models trained on the complete dataset. Notably, training using our synthetic samples is significantly more time-efficient. For instance, training a ViT model on the original Camelyon16 dataset for a single epoch requires a substantial 11,049 seconds. In contrast, training the same ViT model on 50 synthetic samples per class takes only 8.01 seconds for one epoch, which is approximately 1,380 times faster. Moreover, it’s worth emphasising that the synthetic samples constitute less than 1% of the dataset.

	Img/Cls	Storage(G)	CLAM	Time(s)	DSMIL	Time(s)	TransMIL	Time(s)
TCGA-IDH	10	0.9	$79.3_{\pm 0.7}$	$6.3_{({\color[rgb]{0,1,0}\times 1.4})}$	$78.6_{\pm 0.5}$	$10.6_{({\color[rgb]{0,1,0}\times 1.14})}$	$75.8_{\pm 1.3}$	$22.0_{({\color[rgb]{0,1,0}\times 1.07})}$
	20	1.7	$80.4_{\pm 0.5}$	$6.5_{({\color[rgb]{0,1,0}\times 1.3})}$	$80.7_{\pm 0.5}$	$10.8_{({\color[rgb]{0,1,0}\times 1.12})}$	$77.1_{\pm 1.6}$	$22.1_{({\color[rgb]{0,1,0}\times 1.06})}$
	50	4.1	$82.6_{\pm 0.1}$	$8.1_{({\color[rgb]{0,1,0}\times 1.1})}$	$83.3_{\pm 0.5}$	$11.1_{({\color[rgb]{0,1,0}\times 1.09})}$	$79.0_{\pm 0.3}$	$23.0_{({\color[rgb]{0,1,0}\times 1.03})}$
	Full	4.9	$84.8_{\pm 0.7}$	9.0	$85.8_{\pm 0.1}$	12.1	$80.0_{\pm 0.1}$	23.5
Camelyon16	10	0.32	$84.1_{\pm 0.4}$	$3.1_{({\color[rgb]{0,1,0}\times 1.7})}$	$84.9_{\pm 0.4}$	$4.7_{({\color[rgb]{0,1,0}\times 1.6})}$	$81.5_{\pm 0.3}$	$8.4_{({\color[rgb]{0,1,0}\times 1.6})}$
	20	0.62	$84.6_{\pm 0.9}$	$3.5_{({\color[rgb]{0,1,0}\times 1.5})}$	$85.4_{\pm 0.7}$	$5.1_{({\color[rgb]{0,1,0}\times 1.4})}$	$82.5_{\pm 0.4}$	$8.6_{({\color[rgb]{0,1,0}\times 1.6})}$
	50	1.6	$86.7_{\pm 0.7}$	$4.2_{({\color[rgb]{0,1,0}\times 1.2})}$	$87.5_{\pm 0.6}$	$5.9_{({\color[rgb]{0,1,0}\times 1.2})}$	$83.8_{\pm 0.8}$	$8.9_{({\color[rgb]{0,1,0}\times 1.6})}$
	Full	16.75	$88.9_{\pm 0.4}$	5.8	$89.8_{\pm 0.9}$	7.3	$85.3_{\pm 0.3}$	13.8
UniToPath (DR)	10	0.22	$61.6_{\pm 0.3}$	$1.82_{({\color[rgb]{0,1,0}\times 1.10})}$	$62.1_{\pm 1.1}$	$4.68_{({\color[rgb]{0,1,0}\times 1.04})}$	$54.2_{\pm 1.4}$	$5.55_{({\color[rgb]{0,1,0}\times 1.05})}$
	20	0.43	$64.4_{\pm 1.4}$	$1.83_{({\color[rgb]{0,1,0}\times 1.10})}$	$66.3_{\pm 1.9}$	$4.74_{({\color[rgb]{0,1,0}\times 1.03})}$	$59.1_{\pm 1.9}$	$5.61_{({\color[rgb]{0,1,0}\times 1.04})}$
	50	1.04	$67.1_{\pm 0.9}$	$1.89_{({\color[rgb]{0,1,0}\times 1.06})}$	$68.9_{\pm 1.9}$	$4.77_{({\color[rgb]{0,1,0}\times 1.02})}$	$61.4_{\pm 0.9}$	$5.68_{({\color[rgb]{0,1,0}\times 1.02})}$
	Full	28.66	$68.8_{\pm 0.9}$	2.01	$70.3_{\pm 1.3}$	4.86	$62.9_{\pm 0.3}$	5.82
UniToPath (PT)	10	0.71	$83.4_{\pm 0.7}$	$2.22_{({\color[rgb]{0,1,0}\times 1.2})}$	$82.8_{\pm 1.2}$	$4.74_{({\color[rgb]{0,1,0}\times 1.10})}$	$80.5_{\pm 1.3}$	$6.72_{({\color[rgb]{0,1,0}\times 1.04})}$
	20	0.82	$84.8_{\pm 0.7}$	$2.26_{({\color[rgb]{0,1,0}\times 1.2})}$	$84.1_{\pm 0.1}$	$4.76_{({\color[rgb]{0,1,0}\times 1.09})}$	$81.9_{\pm 0.7}$	$6.75_{({\color[rgb]{0,1,0}\times 1.04})}$
	50	2.45	$87.1_{\pm 1.2}$	$2.40_{({\color[rgb]{0,1,0}\times 1.1})}$	$87.5_{\pm 2.0}$	$4.92_{({\color[rgb]{0,1,0}\times 1.06})}$	$83.4_{\pm 0.7}$	$6.82_{({\color[rgb]{0,1,0}\times 1.03})}$
	Full	34.60	$90.0_{\pm 0.9}$	2.68	$91.0_{\pm 0.7}$	5.20	$84.7_{\pm 1.2}$	7.01

Table 4: Slide-level classification results on TCGA-IDH and Camelyon16. We compared different MIL methods.

Slide-level classification. Though we distilled synthetic patches from the original patch datasets, we can also use them for slide-level tasks where we apply the same algorithm to every single slide to obtain a condensed representation of a slide. Slide-level classification in Whole Slide Image (WSI) datasets often employs Multiple Instance Learning (MIL). To conduct MIL training, each slide is treated as a bag and its extracted patches are treated as bag instances. Here, we leveraged the learned synthetic patches from each slide as bag instances to train several commonly used MIL baselines including CLAM [4], DSMIL [22], and TransMIL [44]. Subsequently, we evaluated their performance on real datasets. Specifically, both CLAM and TransMIL used a ResNet50 for feature extraction from the bag instances, followed by either a global pooling operator [4] or a Transformer [44] for instance feature aggregation. In DSMIL, we employed self-supervised learning for enhanced feature extraction and incorporated an attention mechanism for feature aggregation. The results, as shown in Table 4, demonstrate the compatibility of our synthetic patches with MIL methods, affirming that our synthetic patches contain informative features from the original dataset that MIL methods can successfully extract and aggregate.

In-domain cross-centre transfer learning. Models that exhibit excellent performance in one setting may demonstrate degraded performance elsewhere due to the differences in population, clinical setting, and user behaviour. This is usually known as the transportability problem in cross-centre studies [7]. These models are expected to improve through fine-tuning with cross-centre datasets. However, this presents a challenge as centres may be unable to share the raw data due to privacy concerns and governance policies attached to the data.

Here, we consider a practical scenario in which two medical centres work collaboratively on a given task. Suppose each centre has collected its own data and would like to share it for collaboration. Some common issues would occur when using real large-scale datasets, such as the performance of trained models on the new data, storage requirements, and privacy concerns. To show that our synthetic samples can be an efficient alleviation to these issues, we simulate such in-domain cross-centre transfer learning using the two medical centres (Radboud(C1), Utrecht(C2)) in Camelyon16 dataset.

Source $\Rightarrow$ Target			Target
			C2
			Patch-level				Slide-level
		Img/Cls	Acc	#Epochs	Time(s)	Storage	Acc	#Epochs	Time(s)	Storage
Source	C1	10	$94.5_{\pm 0.5}$	5	0.23	$1e^{-3}$ G	$94.2_{\pm 1.2}$	5	1.8	0.20G
		20	$95.7_{\pm 0.4}$	5	0.38	$2e^{-3}$ G	$94.5_{\pm 1.2}$	5	1.9	0.37G
		50	$96.8_{\pm 0.2}$	5	1.04	$6e^{-3}$ G	$95.8_{\pm 0.6}$	5	2.1	0.93G
		Full	$97.8_{\pm 0.2}$	5	1640	5.12G	$97.0_{\pm 0.6}$	10	2.8	5.12G
			C1
Source	C2	10	$94.3_{\pm 0.1}$	5	0.23	$1e^{-3}$ G	$95.9_{\pm 0.1}$	5	1.5	0.13G
		20	$95.3_{\pm 0.5}$	5	0.38	$2e^{-3}$ G	$96.8_{\pm 0.6}$	5	1.6	0.25G
		50	$96.4_{\pm 0.2}$	5	1.04	$6e^{-3}$ G	$97.5_{\pm 0.4}$	5	1.9	0.64G
		Full	$98.1_{\pm 0.1}$	5	1596	11.63G	$97.0_{\pm 0.0}$	10	3.6	11.63G

Table 5: Results of in-domain cross-centre transfer learning. Given a model pre-trained on the source dataset, we record its performance and training time for fine-tuning on the target datasets.

Suppose the target centre receives the data collected from the source centre (Source $\Rightarrow$ Target) and performs initial investigations by conducting patch-level and slide-level classifications using their own pre-trained models (e.g., ResNet18 for patch-level classification and CLAM for slide-level classification). The results, as presented in Table 5, highlight several key advantages of utilising condensed synthetic source domain data over the full source domain data in real clinical practice: 1) Rapid Fine-Tuning: A pre-trained model on the target domain can be swiftly fine-tuned on the newly provided source domain, achieving comparable performance to one fine-tuned on the entire source domain dataset. 2) Enhanced Data Transfer Efficiency: Our synthetic samples significantly reduce storage requirements, leading to more efficient data transfer.

Limitations. Histo-DD is task-specific, implying that the distilled samples may not be readily transferable to other tasks, such as image segmentation or detection. Here, we present one possible solution. Inspired by the success of continual learning, we could formulate the learning of task-agnostic synthetic samples as a task incremental learning task. Specifically, we plan to incrementally introduce a variety of task types, such as classification and segmentation, during the training phase of a set of synthetic samples.

6 Conclusion

This paper introduces a dataset distillation method (Histo-DD) tailored for learning synthetic samples from histopathology datasets. The proposed method can condense large-scale histopathology datasets consisting of patches into a small set of highly condensed and informative synthetic patches. During the learning process, we incorporate stain normalisation to further make the algorithm more suitable for histopathology datasets. We have shown that synthetic patches offer a data-efficient alternative to original large-scale datasets. Once trained, they enable efficient training for both patch-level and slide-level classifiers. Additionally, they are adaptable to complex tasks like in-domain cross-centre transfer learning. In future work, our research will focus on refining dataset distillation algorithms, to create task-agnostic synthetic samples.

References

[1] Barbano, C.A., Perlo, D., Tartaglione, E., Grangetto, M.: Unitopatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading. IEEE (2021)
[2] Bejnordi, B.E., Veta, M., Van Diest, P.J., Van Ginneken, B., Karssemeijer, N., Litjens, G., Van Der Laak, J.A., Hermsen, M., Manson, Q.F., Balkenhol, M., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA 318(22), 2199–2210 (2017)
[3] de Bel, T., Bokhorst, J.M., van der Laak, J., Litjens, G.: Residual cyclegan for robust domain transformation of histopathological tissue slides. Medical Image Analysis 70, 102004 (2021)
[4] Campanella, G., Hanna, M.G., Geneslaw, L., Miraflor, A., Werneck Krauss Silva, V., Busam, K.J., Brogi, E., Reuter, V.E., Klimstra, D.S., Fuchs, T.J.: Clinical-grade computational pathology using weakly supervised deep learning on whole slide images. Nature Medicine 25(8), 1301–1309 (2019)
[5] Chen, R.J., Chen, C., Li, Y., Chen, T.Y., Trister, A.D., Krishnan, R.G., Mahmood, F.: Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16144–16155 (2022)
[6] Chikontwe, P., Kim, M., Nam, S.J., Go, H., Park, S.H.: Multiple instance learning with center embeddings for histopathology classification. In: Medical Image Computing and Computer Assisted Intervention. pp. 519–528. Springer (2020)
[7] Coiera, E., Liu, S.: Evidence synthesis, digital scribes, and translational challenges for artificial intelligence in healthcare. Cell Reports Medicine 3(12) (2022)
[8] Cong, C., Liu, S., Di Ieva, A., Pagnucco, M., Berkovsky, S., Song, Y.: Texture enhanced generative adversarial network for stain normalisation in histopathology images. In: 2021 IEEE 18th International Symposium on Biomedical Imaging. pp. 1949–1952. IEEE (2021)
[9] Cong, C., Liu, S., Di Ieva, A., Pagnucco, M., Berkovsky, S., Song, Y.: Colour adaptive generative networks for stain normalisation of histopathology images. Medical Image Analysis 82, 102580 (2022)
[10] Cong, C., Yang, Y., Liu, S., Pagnucco, M., Song, Y.: Imbalanced histopathology image classification using deep feature graph attention network. In: 2022 IEEE 19th International Symposium on Biomedical Imaging. pp. 1–4. IEEE (2022)
[11] Deng, Z., Russakovsky, O.: Remember the past: Distilling datasets into addressable memories for neural networks. Advances in Neural Information Processing Systems 35, 34391–34404 (2022)
[12] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[13] Ducoffe, M., Precioso, F.: Adversarial active learning for deep networks: a margin based approach. arXiv preprint arXiv:1802.09841 (2018)
[14] Fan, L., Sowmya, A., Meijering, E., Song, Y.: Cancer survival prediction from whole slide images with self-supervised learning and slide consistency. IEEE Transactions on Medical Imaging (2022)
[15] Farahani, R.Z., Hekmatfar, M.: Facility location: concepts, models, algorithms and case studies. Springer Science & Business Media (2009)
[16] Graham, S., Vu, Q.D., Jahanifar, M., Raza, S.E.A., Minhas, F., Snead, D., Rajpoot, N.: One model is all you need: multi-task learning enables simultaneous histology image segmentation and classification. Medical Image Analysis 83, 102685 (2023)
[17] Guo, R., Xie, K., Pagnucco, M., Song, Y.: Sac-net: Learning with weak and noisy labels in histopathology image segmentation. Medical Image Analysis 86, 102790 (2023)
[18] He, K., Zhang, X., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)
[19] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708 (2017)
[20] Khened, M., Kori, A., Rajkumar, H., Krishnamurthi, G., Srinivasan, B.: A generalized deep learning framework for whole-slide image segmentation and analysis. Scientific Reports 11(1), 1–14 (2021)
[21] Lei, S., Tao, D.: A comprehensive survey to dataset distillation. arXiv preprint arXiv:2301.05603 (2023)
[22] Li, B., Li, Y., Eliceiri, K.W.: Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14318–14328 (2021)
[23] Li, G., Togo, R., Ogawa, T., Haseyama, M.: Soft-label anonymous gastric x-ray image distillation. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 305–309. IEEE (2020)
[24] Li, G., Togo, R., Ogawa, T., Haseyama, M.: Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Computer Methods and Programs in Biomedicine 227, 107189 (2022)
[25] Li, Y., Ping, W.: Cancer metastasis detection with neural conditional random field. In: Medical Imaging with Deep Learning (2018)
[26] Liu, S., Shah, Z., Sav, A., Russo, C., Berkovsky, S., Qian, Y., Coiera, E., Di Ieva, A.: Isocitrate dehydrogenase (IDH) status prediction in histopathology images of gliomas using deep learning. Scientific Reports 10(1), 1–11 (2020)
[27] Liu, Y., Gu, J., Wang, K., Zhu, Z., Jiang, W., You, Y.: Dream: Efficient dataset distillation by representative matching. arXiv preprint arXiv:2302.14416 (2023)
[28] Loo, N., Hasani, R., Amini, A., Rus, D.: Efficient dataset distillation using random feature approximation. Advances in Neural Information Processing Systems 35, 13877–13891 (2022)
[29] Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M., Mahmood, F.: Data-efficient and weakly supervised computational pathology on whole-slide images. Nature Biomedical Engineering 5(6), 555–570 (2021)
[30] Macenko, M., Niethammer, M., Marron, J.S., Borland, D., Woosley, J.T., Guan, X., Schmitt, C., Thomas, N.E.: A method for normalizing histology slides for quantitative analysis. In: IEEE International Symposium on Biomedical Imaging. pp. 1107–1110 (2009)
[31] Margatina, K., Vernikos, G., Barrault, L., Aletras, N.: Active learning by acquiring contrastive examples. arXiv preprint arXiv:2109.03764 (2021)
[32] Mirzasoleiman, B., Bilmes, J., Leskovec, J.: Coresets for data-efficient training of machine learning models. In: International Conference on Machine Learning. pp. 6950–6960. PMLR (2020)
[33] Nguyen, T., Chen, Z., Lee, J.: Dataset meta-learning from kernel ridge-regression. arXiv preprint arXiv:2011.00050 (2020)
[34] Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)
[35] Parsons, D.W., Jones, S., Zhang, X., Lin, et al.: An integrated genomic analysis of human glioblastoma multiforme. Science 321(5897), 1807–1812 (2008)
[36] Paul, M., Ganguli, S., Dziugaite, G.K.: Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems 34, 20596–20607 (2021)
[37] Phillips, J.M.: Coresets and sketches. In: Handbook of discrete and computational geometry, pp. 1269–1288. Chapman and Hall/CRC (2017)
[38] Qu, L., Luo, X., Liu, S., Wang, M., Song, Z.: Dgmil: Distribution guided multiple instance learning for whole slide image classification. In: Medical Image Computing and Computer Assisted Intervention. pp. 24–34. Springer (2022)
[39] Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: icarl: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 2001–2010 (2017)
[40] Reinhard, E., Adhikhmin, M., Gooch, B., Shirley, P.: Color transfer between images. IEEE Computer Graphics and Applications (CG&A) 21(5), 34–41 (2001)
[41] Sajedi, A., Khaki, S., Amjadian, E., Liu, L.Z., Lawryshyn, Y.A., Plataniotis, K.N.: Datadam: Efficient dataset distillation with attention matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17097–17107 (2023)
[42] Salehi, P., Chalechale, A.: Pix2pix-based stain-to-stain translation: A solution for robust stain normalization in histopathology images analysis. In: International Conference on Machine Vision and Image Processing. pp. 1–7 (2020)
[43] Shaban, M.T., Baur, C., Navab, N., Albarqouni, S.: StainGAN: Stain style transfer for digital histological images. In: International Symposium on Biomedical Imaging. pp. 953–956 (2019)
[44] Shao, Z., Bian, H., Chen, Y., Wang, Y., Zhang, J., Ji, X., et al.: Transmil: Transformer based correlated multiple instance learning for whole slide image classification. Advances in Neural Information Processing Systems 34, 2136–2147 (2021)
[45] Shen, Y., Ke, J.: A deformable crf model for histopathology whole-slide image classification. In: Medical Image Computing and Computer Assisted Intervention. pp. 500–508 (2020)
[46] Tian, Y., Wang, J., Jin, Y., Wang, L.: Communication-efficient federated skin lesion classification with generalizable dataset distillation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 14–24. Springer (2023)
[47] Toneva, M., Sordoni, A., Combes, R.T.d., Trischler, A., Bengio, Y., Gordon, G.J.: An empirical study of example forgetting during deep neural network learning. arXiv preprint arXiv:1812.05159 (2018)
[48] Vahadane, A., Peng, T., Sethi, A., Albarqouni, S., Wang, L., Baust, M., Steiger, K., Schlitter, A.M., Esposito, I., Navab, N.: Structure-preserving color normalization and sparse stain separation for histological images. IEEE transactions on medical imaging 35(8), 1962–1971 (2016)
[49] Wang, K., Zhao, B., Peng, X., Zhu, Z., Yang, S., Wang, S., Huang, G., Bilen, H., Wang, X., You, Y.: Cafe: Learning to condense dataset by aligning features. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12196–12205 (2022)
[50] Wang, T., Zhu, J.Y., Torralba, A., Efros, A.A.: Dataset distillation. arXiv preprint arXiv:1811.10959 (2018)
[51] Welling, M.: Herding dynamical weights to learn. In: Proceedings of the 26th Annual International Conference on Machine Learning. pp. 1121–1128 (2009)
[52] Yang, J., Chen, H., Zhao, Y., Yang, F., Zhang, Y., He, L., Yao, J.: Remix: A general and efficient framework for multiple instance learning based whole slide image classification. In: Medical Image Computing and Computer Assisted Intervention. pp. 35–45. Springer (2022)
[53] Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N., Huang, J.: Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical Image Analysis 65, 101789 (2020)
[54] Zhang, J., Zhang, X., Ma, K., Gupta, R., Saltz, J., Vakalopoulou, M., Samaras, D.: Gigapixel whole-slide images classification using locally supervised learning. In: Medical Image Computing and Computer Assisted Intervention. pp. 192–201. Springer (2022)
[55] Zhang, L., Zhang, J., Lei, B., Mukherjee, S., Pan, X., Zhao, B., Ding, C., Li, Y., Xu, D.: Accelerating dataset distillation via model augmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11950–11959 (2023)
[56] Zhao, B., Bilen, H.: Dataset condensation with differentiable siamese augmentation. In: International Conference on Machine Learning. pp. 12674–12685. PMLR (2021)
[57] Zhao, B., Bilen, H.: Dataset condensation with distribution matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 6514–6523 (2023)
[58] Zhao, B., Mopuri, K.R., Bilen, H.: Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929 (2020)
[59] Zhao, Y., Lin, Z., Sun, K., Zhang, Y., Huang, J., Wang, L., Yao, J.: Setmil: spatial encoding transformer-based multiple instance learning for pathological image analysis. In: Medical Image Computing and Computer Assisted Intervention. pp. 66–76. Springer (2022)