¹¹institutetext: Dartmouth College, Hanover, NH 03755, USA ²²institutetext: Dartmouth-Hitchcock Medical Center, Lebanon, NH 03756, USA ²²email: ^†[email protected]

Difficulty Translation in Histopathology Images

Jerry Wei 11 Arief Suriawinata 22 Xiaoying Liu 22 Bing Ren 22 Mustafa Nasir-Moin 11 Naofumi Tomita 11 Jason Wei 11 Saeed Hassanpour^† 11

Abstract

The unique nature of histopathology images opens the door to domain-specific formulations of image translation models. We propose a difficulty translation model that modifies colorectal histopathology images to be more challenging to classify. Our model comprises a scorer, which provides an output confidence to measure the difficulty of images, and an image translator, which learns to translate images from easy-to-classify to hard-to-classify using a training set defined by the scorer. We present three findings. First, generated images were indeed harder to classify for both human pathologists and machine learning classifiers than their corresponding source images. Second, image classifiers trained with generated images as augmented data performed better on both easy and hard images from an independent test set. Finally, human annotator agreement and our model’s measure of difficulty correlated strongly, implying that for future work requiring human annotator agreement, the confidence score of a machine learning classifier could be used as a proxy.

Keywords:

Deep Learning Histopathology Images Generative Adversarial Networks

1 Introduction

Automated histopathology image analysis has advanced quickly in recent years [1, 2, 3, 4] due to substantial developments in the broader fields of deep learning and computer vision [5, 6, 7]. While histopathology imaging research typically applies these general computer vision models directly and without modification, there may be domain-specific models that might not generalize to broader computer vision tasks but can be useful for specifically analyzing histopathology images.

In this study, we formulate a difficulty translation model for histopathology images, i.e., given a histopathology image, we aim to modify it into a new image that is harder to classify. Our model is motivated by the observation that histopathology images exhibit a range of histological features that determines their histopathological label. For instance, both an image with small amounts of sessile serrated architectures and an image covered by sessile serrated architectures would be classified by a pathologist as a sessile serrated adenoma. We know that this range of features exists because normal tissue progressively develops precancerous or cancerous features over time, which differs from general domain datasets such as ImageNet [8], in which classes are distinct by definition (there is no range of cats and dogs, for instance). This continuous spectrum of features allows us to use the confidence of a machine learning classifier to determine the amount and intensity of cancerous features in an image. In other words, the confidence of the classifier can act as a proxy for the extent of histological features in an image). In this paper, we propose and evaluate a difficulty translation model that generates hard-to-classify images that are useful as augmented data, thereby demonstrating a new way to exploit the unique nature of histopathology images. We post our code publicly at https://github.com/BMIRDS/DifficultyTranslation.

Refer to caption — Figure 1: Our proposed model for modifying colorectal histopathology images to be more challenging to classify.

2 Methods and Materials

Problem set-up and model. Given some training image $x_{i}$ of class $X$ , we aim to generate $\tilde{x}_{i}$ , which maintains the same histopathological class and general structure of $x_{i}$ , but is more challenging to classify. We propose a model that comprises two networks: a scorer, which predicts the difficulty $c(x_{i})$ of some image $x_{i}$ [9, 10], and an image translator, which translates images that are easy to classify into images that are harder to classify. In this study, we use ResNet-18 [5] as the scorer for colorectal histopathology images and train it to convergence on the downstream task of hyperplastic polyp/sessile serrated adenoma classification, and we assign $c(x_{i})$ as the softmax output (confidence) of class $X$ . For the image translator, we use a cycle-consistent generative adversarial network (CycleGAN), which learns the mapping $G:\hat{X}\rightarrow\tilde{X}$ , where we assign $\hat{X}$ as the class for the set of images $\{\hat{x}\}$ such that $c(\hat{x}_{i})$ is high for all $\hat{x}_{i}\in\{\hat{x}\}$ (easy-to-classify images) and $\tilde{X}$ as the class for the set of images $\{\tilde{x}\}$ where $c(\tilde{x}_{j})$ is low for all $\tilde{x}_{j}\in\{\tilde{x}\}$ (hard-to-classify images). With this configuration, given some image $\hat{x}_{i}\in\{\hat{x}\}$ , we can generate a similar example $\tilde{x}_{i}$ that maintains the same histopathological class but is harder to classify. An overview schematic of our model is shown in Figure 1. Hereafter, we use the terms easy images and hard images to refer to images that are easy to classify and hard to classify based on a pre-trained classifier’s confidence output.¹¹1In this metric for measuring the difficulty of images, the pre-trained classifier does not classify an image as easy or hard, but rather classifies an image as HP or SSA—the classifier’s confidence on its HP or SSA prediction determines whether the image is considered to be easy or hard to classify. When referring to easy and hard as perceived by annotators, we use the terms high-agreement images and low-agreement images, which represent 3/3 annotator agreement and 2/3 annotator agreement, respectively.

Dataset. For our experiments, we first collected and scanned 328 Formalin-Fixed Paraffin-Embedded (FFPE) whole-slide images of colorectal polyps, originally diagnosed as either hyperplastic polyps (HPs) or sessile serrated adenomas (SSAs), from patients at the Dartmouth-Hitchcock Medical Center, our tertiary medical institution. From these 328 whole-slide images, we then extracted 3,152 patches (portions of size $224\times 224$ pixels from whole-slide images) representing diagnostically relevant regions of interest for HPs or SSAs. Three board-certified practicing gastrointestinal pathologists at the Dartmouth-Hitchcock Medical Center independently labeled each image as HP or SSA. The use of the dataset in this study was approved by our Institutional Review Board (IRB).

The gold standard label for each image was determined by the majority vote of the labels from three pathologists. Table 1 shows the distribution of high-agreement and low-agreement images for each class in the training and test set. Note that our dataset is imbalanced because SSAs naturally occur less frequently than HPs. Figure 2 shows examples of high-agreement and low-agreement images from each class. Images were split randomly by whole slide into the training set and test set, so images from the same whole slide either all went into the training set or all went into the test set. We chose the task of sessile serrated adenoma detection, which is challenging and clinically important for colonoscopy, one of the most common screening tests for colorectal cancer [11]. We used a training set of 2,051 images and a test set of 1,101 images, of which each image was labeled as either hyperplastic polyp (HP) or sessile serrated adenoma (SSA).

Table 1: Distribution of data in our training and test sets based on the level of annotation agreement among three pathologist annotators. HP: hyperplastic polyp, SSA: sessile serrated adenoma.

	Training Set Images			Test Set Images
Level of Agreement	HP	SSA	Total	HP	SSA	Total
2/3 Annotators	670	173	843	316	89	405
3/3 Annotators	860	348	1,208	492	204	696
Total	1,530	521	2,051	808	293	1,101

3 Experiments

3.1 Generating more challenging training examples

In our image translation model, we define a selectivity parameter, $\phi$ , as the percent of training set images used as hard data in the target domain for training our image translator. For instance, at $\phi$ = 50, the lower 50 $\%$ of training set images by confidence, as determined by our pre-trained classifier, was used as target domain training data for the image translator. We train our image translation model for various $\phi$ to generate difficult HP images. We find that for a given image classifier, generated images—particularly those with lower $\phi$ —were indeed harder than their corresponding source images to classify (Figure 6 in the Appendix). Figure 3 shows examples of images that were generated with varying $\phi$ . While images generated with lower $\phi$ are typically harder to classify, at very low $\phi$ , generated images are no longer representative of their target class (e.g., generated HP images begin to look like SSA images). We therefore recommend using the smallest $\phi$ possible such that the original class label is generally maintained and a sufficient number of images is provided to train the image translation model. For our dataset, we find that the model needs to be trained with at least 100 images $(\phi>6.25)$ .

We also evaluated the difficulty of generated images by presenting them to three board-certified gastrointestinal pathologists for manual evaluation in a blinded test. Using labels where the top 50% and bottom 50% of images marked as HP by confidence of our pre-trained classifier were considered to be easy HP images and hard HP images, respectively, we randomly sampled 75 easy HP images, 75 hard HP images, 75 generated HP images that were translated from the selected easy HP images, and 75 SSA images from our training set. Each pathologist then independently classified each image as HP or SSA. As shown in Table 2, pathologists disagreed more (2/3 annotator agreement) on generated images (22.7 $\%$ ) than their real counterparts (13.3 $\%$ ), although not as much as they did for real hard images (26.7 $\%$ ). At the same time, generated images retained their original class label of HP $96\%$ of the time based on annotator agreement.

To make Table 2 more readable, we omit the classification results for SSA images. The proportion of the 75 images in our blinded test with an original ground truth of SSA that were again marked as SSA by a majority of pathologists during the blinded test was 89.3 $\%$ , indicating that our pathologist annotators were relatively consistent in their classifications. In Figure 4, we show examples of translations that were successful and unsuccessful in making images more difficult to classify.

Table 2: Pathologists agreed less on ground-truth labels for generated HP images compared with their real image counterparts. Most generated images maintained their HP label in that both source images and their corresponding generated images were classified as HP by the majority of annotators.

	Pathologist Annotator Agreement (%)
Image Type	$\nicefrac{{2}}{{3}}$ Agreement	$\nicefrac{{3}}{{3}}$ Agreement	Maintained Label (%)
Real–Easy	13.3	86.7	100.0
Real–Hard	26.7	64.0	90.7
Generated–Hard	22.7	73.3	96.0

3.2 Improving the Performance of Classifiers

We conduct further experiments to explore using the generated images as additional data to better train a classifier. Given a training set, we use the easy HP images as source images to generate harder HP images. Of these generated harder images, we use the images that maintain the HP class label, according to our pre-trained classifier, as additional data for training a new classifier.

For all image classifiers, we train ResNet-18 [5] for 50 epochs (far past convergence) using the Adam optimizer [12] with a L2 regularization factor of $10^{-4}$ . We use an initial learning rate of $10^{-3}$ , decaying by 0.91 every epoch. Every trained model used automatic data augmentation of online color jittering uniformly sampled from the range of $\pm 0.5$ for brightness, $\pm 0.5$ for contrast, $\pm 0.2$ for hue, and $\pm 0.5$ for saturation, as implemented in PyTorch.

Table 3 shows the performance of classifiers trained with our generated images as additional data compared with the baseline of standard training on the original dataset as well as a naïve data augmentation technique of directly combining parts from easy and hard images [13]. In order to account for variance in random weight initializations and performance fluctuations throughout training, we run each configuration for twenty random seeds, and for each seed we record the mean of the five highest AUC scores, which are calculated for every epoch. Notably, adding images generated at $\phi=25$ consistently outperformed naïve data augmentation and no data augmentation for both low-agreement and high-agreement images. We posit that naïve data augmentation was unsuccessful in this case because the features in the augmented data were not reflected in the test set.

Table 3: Performance (% AUC

\pm

standard error) of image classifiers trained with generated images as augmented data on test set images with high annotator agreement, low annotator agreement, and all test set images.

	Test Set Performance
Training dataset	High-Agreement	Low-Agreement	All Images
Unmodified original dataset	91.3 $\pm$ 0.2	66.0 $\pm$ 0.6	83.1 $\pm$ 0.2
+ naïve data augmentation	90.9 $\pm$ 0.2	64.3 $\pm$ 0.5	82.1 $\pm$ 0.3
+ generated images, $\phi=50$	91.9 $\pm$ 0.2	66.6 $\pm$ 0.4	83.8 $\pm$ 0.3
+ generated images, $\phi=25$	92.6 $\pm$ 0.2	68.1 $\pm$ 0.7	84.8 $\pm$ 0.4
+ generated images, $\phi=12.5$	91.7 $\pm$ 0.2	65.5 $\pm$ 0.4	83.4 $\pm$ 0.2

3.3 Comparing Machine and Human Difficulty Measures

While it was previously unconfirmed whether the confidence output of a machine learning model correlates with the human concept of difficulty, for our dataset, we find that the confidence of our pre-trained classifier indeed correlates strongly with human annotator agreement. As shown in Figure 5, the predicted confidence distribution of images with high annotator agreement vastly differs from that of images with low annotator agreement. We compared these distributions using a Kolmogorov-Smirnov test for equality of two distributions [14] and computed a Kolmogorov-Smirnov statistic of 0.302 over all 1530 HP images in the training set with a statistically significant p-value of $p=1.5\times 10^{-30}$ , indicating that the two distributions are not equal. The correlation between these two measures of difficulty implies that for tasks requiring human annotator agreement data, the confidence of a machine learning classifier, which is computed automatically, could be used as a reliable proxy.

4 Related Work and Discussion

Generative adversarial networks (GANs) have been used in deep learning for medical imaging to generate synthetic data ranging from MRIs to CT scans [15, 16, 17]. For histopathology, several studies used GANs for both image generation and translation [18, 19, 20, 21, 22, 23]. Along the same lines as our work, GAN-generated images have been used as augmented data for liver lesion [24], bone lesion [25], and rare skin condition classification [26]. While prior work generates augmented data as a means to improve general performance, the augmented data that we generate aims to help classifiers specifically on examples that are challenging. Moreover, our methodology, to our knowledge, substantially differs from previous work due to its focus on example difficulty.

This paper advances related work from our group, which is focused on colorectal polyp classification [27, 28] and used image translation between different colorectal polyp types to address data imbalances [29]. In this study, we translate images within the same class to become more difficult to differentiate from other classes, arguing that the range of features in histopathology images can and should be utilized to train better-performing machine learning models.

Of possible limitations, measuring whether generated images maintained the same quality and realistic features as real images is challenging. Although generated images occasionally contained minor mosaic-like patterns, they remained readable and improved classifier training over baseline augmentation methods, suggesting that useful histologic features were retained. Also, another approach for difficulty translation could be to directly use human annotator agreement to translate images from high to low agreement. In our paper, however, we define difficulty according to the confidence of a pre-trained classifier, since this framework generalizes to cases where annotator agreement data is unavailable.

To conclude, this work shows how to generate difficult yet meaningful training data by exploiting the range of features in histopathology images. Future research could explore difficulty translation in the context of curriculum learning [30] or defending against adversarial attacks [31]. This study and its results encourage further research to make use of the range of features in histopathology images in more creative ways.

5 Acknowledgments

This research was supported in part by the National Institute of Health grants (R01LM012837, R01CA098286, and P20GM104416).

6 Appendix

References

[1] Coudray, N., Moreira, A.L., Sakellaropoulos, T., Fenyö, D., Razavian, N., Tsirigos, A.: Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. bioRxiv (2017)
[2] Ehteshami Bejnordi, B., Veta, M., Johannes van Diest, P., van Ginneken, B., Karssemeijer, N., Litjens, G., van der Laak, J.A.W.M., , the CAMELYON16 Consortium: Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 318(22), 2199–2210 (2017)
[3] Tomita, N., Abdollahi, B., Wei, J., Ren, B., Suriawinata, A., Hassanpour, S.: Attention-Based Deep Neural Networks for Detection of Cancerous and Precancerous Esophagus Tissue on Histopathological Slides. JAMA Network Open 2(11), e1914645–e1914645 (2019)
[4] Wei, J.W., Tafe, L.J., Linnik, Y.A., Vaickus, L.J., Tomita, N., Hassanpour, S.: Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Scientific Reports 9 (2019)
[5] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR (2015)
[6] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 1097–1105. Curran Associates, Inc. (2012)
[7] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.E., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. CoRR (2014)
[8] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR09 (2009)
[9] Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. CoRR (2019)
[10] Weinshall, D., Cohen, G.: Curriculum learning by transfer learning: Theory and experiments with deep networks. CoRR (2018)
[11] Rex, D., Boland, R.C., Dominitz, J.A., Giardiello, F.M., Johnson, D.A., Kaltenbach, T., Levin, T.R., Lieberman, D., Robertson, D.J.: Colorectal cancer screening: Recommendations for physicians and patients from the u.s. multi-society task force on colorectal cancer. The American Journal of Gastroenterology 112(7), 1016–1030 (2017)
[12] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (2014)
[13] Summers, C., Dinneen, M.J.: Improved mixed-example data augmentation. In: Applications of Computer Vision (WACV), 2019 IEEE Winter Conference on. IEEE (2019)
[14] Massey, F.J.: The kolmogorov-smirnov test for goodness of fit. Journal of the American Statistical Association 46, 68–78
[15] Dar, S.U., Yurt, M., Karacan, L., Erdem, A., Erdem, E., Cuker, T.: Image synthesis in multi-contrast MRI with conditional generative adversarial networks. IEEE Transactions on Medical Imaging (2018)
[16] Salehinejad, H., Valaee, S., Dowdell, T., Colak, E., Barfett, J.: Generalization of deep neural networks for chest pathology classification in X-Rays using generative adversarial networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2018)
[17] Wang, J., Zhao, Y., Noble, J.H., Dawant, B.M.: Condtional generative adversarial networks for metal artifact reduction in CT images of the ear. In: Proceedings of Medical Imaging Computing and Computer Assisted Inventions. pp. 3–11 (2018)
[18] Bayramoglu, N., Kaakinen, M., Eklund, L., Heikkila, J.: Towards virtual H $\&$ E staining of hyperspectral lung histology images using conditional generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 64–71 (2018)
[19] Bentaieb, A., Harmarneh, G.: Adversarial stain transfer for histopathology image analysis. In: Proceedings of IEEE Transactions on Medical Imaging. vol. 37 (2017)
[20] Burlingame, E.A., Margolin, A., Gray, J., Chang, Y.H.: SHIFT: Speedy histopathological-to-immunofluorescent translation of whole slide images using conditional generative adversarial networks. In: Proceedings of IEEE Transactions on Medical Imaging. vol. 10581 (2018)
[21] Cho, H., Lim, S., Choi, G., Min, H.: Neural stain-style transfer learning using GAN for histopathological images. In: Journal of Machine Learning Research: Workshop and Conference Proceedings (2017)
[22] Jackson, C., Sriharan, A., Vaickus, L.: A machine learning algorithm for simulating immunohistochemistry: development of sox10 virtual ihc and evaluation on primarily melanocytic neoplasms. Modern Pathology (2020)
[23] Zanjani, F.G., Zinger, S., de With, P.H.N.: Deep convolutional gaussian mixture model for stain-color normalization of histopathological images. In: Proceedings of Medical Imaging Computing and Computer Assisted Inventions. pp. 274–282 (2018)
[24] Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Gan-based data augmentation for improved liver lesion classification. In: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). pp. 289–293 (2018)
[25] Gupta, A., Venkatesh, S., Chopra, S., Ledig, C.: Generative image translation for data augmentation of bone legion pathology. arXiv (2019)
[26] Ghorbani, A., Natarajan, V., Coz, D., Liu, Y.: Dermgan: Synthetic generation of clinical skin images with pathology (2019)
[27] Korbar, B., Olofson, A.M., Miraflor, A.P., Nicka, C.M., Suriawinata, M.A., Torresani, L., Suriawinata, A.A., Hassanpour, S.: Deep learning for classification of colorectal polyps on whole-slide images (2017)
[28] Wei, J.W., Suriawinata, A.A., Vaickus, L.J., Ren, B., Liu, X., Lisovsky, M., Tomita, N., Abdollahi, B., Kim, A.S., Snover, D.C., Baron, J.A., Barry, E.L., Hassanpour, S.: Evaluation of a Deep Neural Network for Automated Classification of Colorectal Polyps on Histopathologic Slides. JAMA Network Open 3(4), e203398–e203398 (2020)
[29] Wei, J., Suriawinata, A.A., Vaickus, L., Ren, B., Liu, X., Wei, J., Hassanpour, S.: Generative image translation for data augmentation in colorectal histopathology images. In: Machine Learning for Health Workshop at the Thirty-third Conference on Neural Information Processing Systems (2019)
[30] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. p. 41–48. ICML ’09, Association for Computing Machinery, New York, NY, USA (2009)
[31] Ma, X., Niu, Y., Gu, L., Wang, Y., Zhao, Y., Bailey, J., Lu, F.: Understanding adversarial attacks on deep learning based medical image analysis systems. CoRR (2019)