Learn like a Pathologist: Curriculum Learning by Annotator Agreement
for Histopathology Image Classification

Jerry Wei¹ , Arief Suriawinata² , Bing Ren² , Xiaoying Liu² , Mikhail Lisovsky²,
Louis Vaickus², Charles Brown² , Michael Baker² , Mustafa Nasir-Moin¹,
Naofumi Tomita¹, Lorenzo Torresani¹ , Jason Wei¹ , Saeed Hassanpour^1†
¹Dartmouth College ²Dartmouth-Hitchcock Medical Center
^†[email protected]

Abstract

Applying curriculum learning requires both a range of difficulty in data and a method for determining the difficulty of examples. In many tasks, however, satisfying these requirements can be a formidable challenge.

In this paper, we contend that histopathology image classification is a compelling use case for curriculum learning. Based on the nature of histopathology images, a range of difficulty inherently exists among examples, and, since medical datasets are often labeled by multiple annotators, annotator agreement can be used as a natural proxy for the difficulty of a given example. Hence, we propose a simple curriculum learning method that trains on progressively-harder images as determined by annotator agreement.

We evaluate our hypothesis on the challenging and clinically-important task of colorectal polyp classification. Whereas vanilla training achieves an AUC of 83.7% for this task, a model trained with our proposed curriculum learning approach achieves an AUC of 88.2%, an improvement of 4.5%. Our work aims to inspire researchers to think more creatively and rigorously when choosing contexts for applying curriculum learning.

1 Introduction

Refer to caption — Figure 1: Our proposed curriculum learning by annotator agreement scheme for training a colorectal polyp classifier. The classifier first trains on easy images. Progressively harder images are gradually added in subsequent stages.

Curriculum learning [1] is an elegant idea inspired by human learning that proposes that neural networks should be trained on examples in a specified order based on difficulty (typically easy to hard), as opposed to the random ordering that is currently common in practice. As such, curriculum learning requires both that there exists some range of difficulty among training examples and that we define a method for ranking examples. In most cases, however, it is unclear whether a range of difficulty exists among the examples, and even when a range of difficulty exists, an ideal ranking function is rarely available. In this paper, we try to answer the question: are there tasks with domain-specific properties that are naturally appropriate for addressing these challenges of curriculum learning?

Interest in using deep learning to analyze histopathology images (stained tissues and cells that are typically manually examined under a microscope) has increased in recent years, with neural networks achieving pathologist-level performance on a variety of tasks [2, 3, 4, 5, 6, 7, 8]. In this paper, we propose that histopathology image analysis is a suitable scenario for curriculum learning for two reasons. First, due to the nature of histopathology images, we know that a range of difficulty in examples exists for many tasks. Second, medical image datasets typically have annotations from multiple clinicians—these annotations can be leveraged as a natural proxy for ranking example difficulty. Specifically, our paper makes the following contributions:

1.

We contend that histopathology image classification is a natural scenario for applying curriculum learning, and we propose a curriculum learning approach that leverages annotator agreement as a proxy for difficulty.
2.

We evaluate our proposed approach on a colorectal polyp classification dataset, for which a baseline model achieved an AUC of 83.7% and our best single-stage baseline achieved an AUC of 84.6%. When trained with curriculum learning, our model’s AUC improves to 88.2%, outperforming the average pathologist annotator on our test set in terms of Cohen’s $\kappa$ [9].

The rest of our paper is outlined as follows. $\S$ 2 analyzes the challenges of curriculum learning and presents our intuitions on why histopathology image classification is a suitable context for curriculum learning. $\S$ 3 describes our task and dataset. $\S$ 4 presents the main results of our proposed curriculum learning approach. $\S$ 5 compares the value of annotator agreement with two previously-proposed methods of scoring difficulty. $\S$ 6 shows how the increase in AUC from curriculum learning translates to improvements in performance relative to pathologist performance. $\S$ 7 discusses the implications of our work. $\S$ 8 puts our study in the context of prior work on curriculum learning for medical imaging and concludes our paper.

2 Curriculum Learning Intuitions

Curriculum learning. One of the earliest works demonstrating the benefit of curriculum learning [1] posits that learning occurs better when examples are not randomly presented but instead organized in a meaningful order that gradually shows more concepts and complexity. Although the intuition behind this approach seems obvious in the context of human and animal learning, it is often unclear how to best apply this strategy for training neural networks.

As such, a diverse set of approaches has been explored in this area of research. These approaches generally first score examples by difficulty and then train models using a schedule based on example difficulty, where easier examples are typically seen first and harder examples are seen later. For instance, Bengio et al.’s original work [1] explored a noising-based curriculum for shape detection and a vocabulary-size based task for language modeling. As popular recent examples, Weinshall et al. [10] use the confidence of a pre-trained classifier as an estimator for difficulty; Korbar et al. [11] use a schedule with self-defined easy and hard examples for learning of audio-visual temporal synchronization; Ganesh and Corso [12] propose to incrementally learn labels instead of learning difficult examples; and various teacher-student frameworks have been proposed in the context of curriculum learning [13, 14].

Challenges of curriculum learning. Despite the appeal of teaching machines to learn like humans, curriculum learning has been seen by some [10] as mostly remaining in the fringes of machine learning research. Based on the strategies of prior work, we broadly see two central challenges that arise when applying curriculum learning.

First, curriculum learning assumes that a range of easy and hard examples exists. Although it could be argued that this is a true statement for any given dataset for at least some definition of easy and hard, the distribution of example difficulties likely varies based on the nature of the task and the dataset. Since the added value of curriculum learning comes from utilizing the varying degrees of difficulty in a task, tasks with a smaller range of example difficulty are less conducive to effective curriculum learning. Empirically, Weinshall et al., present some evidence related to this claim, showing that curriculum learning had a larger improvement compared with regular training when applied to tasks that were more difficult and likely included challenging examples (e.g., distinguishing small mammals in CIFAR-100), than when applied to tasks that were easier and did not have examples that were difficult to classify (e.g., discriminating between 5 well-separated classes in CIFAR-100) [10].

Second, a curriculum learning approach must somehow categorize examples as easy or hard. Prior work has tried to address this challenge in many ways, including trying to discover inherent patterns in the data [10], using hand-picked heuristics [15], and creating custom training progressions [11]. Though sometimes effective, these methods can be difficult to implement, and it is often unclear whether an approach that works on one dataset will also work on another. Indeed, scoring images by difficulty is often the core problem addressed in many curriculum learning papers.

Our intuitions. We contend that histopathology image classification is an important task that naturally addresses the challenges above—and could benefit from curriculum learning—based on the following two observations:

•

A range of example difficulty exists in many histopathology image classification tasks. We believe this to be true for several domain-specific reasons. (1) Because pathological disease develops over time, there is a progression from normal tissue to diseased tissue. Since many diseases are classified into discrete classes, there must be some points in this progression that lie on the margins of two classes and are therefore hard to diagnose. (2) Pathology residents learn to read images by first studying classic examples of diseases and then learning to diagnose more-challenging cases over time, implying that human instructors acknowledge some notion of easy and hard examples [16]. (3) Inter-annotator agreement is moderate or low on many disease classification tasks [7, 17], suggesting that some images are hard to classify. Knowing that a curriculum exists is the first step to applying curriculum learning.
•

Image-level annotator agreement can be leveraged as a proxy for example difficulty. As medical image datasets are commonly annotated by several trained clinicians, we can leverage the extent of agreement for each image as a proxy for the difficulty of that image. By definition, images with high agreement are easy to classify, as everyone agrees on them, and images with lower agreement are harder to classify. If these human notions of difficulty translate to a helpful curriculum for training neural networks, then many tasks with annotator agreement data already contain a curriculum that can be used to improve model performance. [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 17, 29].

3 Histopathology Dataset ¹¹1We plan to make our dataset and annotations publicly available to facilitate further research.

In this paper, we focus on the task of colorectal polyp classification, a challenging and clinically-important task in pathology. As shown in Table 1, our dataset contains 3,152 images in total, each annotated with a binary label of either hyperplastic polyp (HP) or sessile serrated adenoma (SSA).

Colorectal polyp classification task. Colonoscopy is a common screening program in the United States [30], and so classification of colorectal polyps (growths inside the colon lining that can lead to colonic cancer if left untreated) is one of the highest-volume tasks in pathology. Our task focuses on the clinically-important binary distinction between hyperplastic polyps (HPs) and sessile serrated adenomas (SSAs), a challenging problem [31, 32, 33, 34, 35]. Pathologically, SSAs are characterized by broad-based crypts, often with complex structure and heavy serration [36].

Data collection and annotation. For our data collection, we scanned 328 Formalin Fixed Paraffin-Embedded (FFPE) whole-slide images of colorectal polyps, which were originally diagnosed as either hyperplastic polyps (HPs) or sessile serrated adenomas (SSAs), from patients at our tertiary medical institution. From these 328 whole-slide images, we then extracted 3,152 patches (image portions of size 224 $\times$ 224 pixels) representing diagnostically-relevant regions of interest for HPs or SSAs. The seven practicing board-certified gastrointestinal pathologists at our tertiary institution then independently labeled each of the 3,152 images in our dataset as either HP or SSA.

Train-test split and gold-standard labels. Images were split randomly by whole-slide such that images from the same whole-slide either all went into the training set or all went into the testing set. As shown in Table 1, we used a training set of 2,175 images ( $\sim$ 70% of images) and a testing set of 977 images ( $\sim$ 30% of images). In our testing set, we use the majority vote of labels as the gold-standard label, a common choice in the literature [18, 23, 24, 25, 26, 27, 28, 37, 38, 29].

	Train	Test	Total
HP	1,545	617	2,162
SSA	630	360	990
Total	2,175	977	3,152

Table 1: Number of images in our dataset.

	A1	A2	A3	A4	A5	A6	A7
A1	-	65.7	90.1	82.0	71.5	90.7	63.6
A2	-	-	64.2	76.0	76.1	65.8	60.8
A3	-	-	-	80.1	69.3	90.8	62.3
A4	-	-	-	-	79.9	81.9	64.1
A5	-	-	-	-	-	70.7	61.5
A6	-	-	-	-	-	-	62.9
A7	-	-	-	-	-	-	-

Table 2: Pairwise annotator agreement (%) for our seven annotators (indexed as A1, A2, A3, A4, A5, A6, and A7).

Dataset Statistics. To help readers get a better sense of our dataset, in Tables 1 and 2 and in Figures 2 and 3, we show several analyses of our dataset. To summarize, our dataset contains 2,162 images with a gold-standard label of HP and 990 images with a gold-standard label of SSA, with 64.5% of images having at 6/7 or 7/7 annotator agreement and 35.5% of images with annotator agreement of 4/7 or 5/7. Figure 4 shows examples for each level of agreement. On average, each pair of pathologists had an average agreement of 72.9%, and each pathologist agreed with the majority vote 83.2% of the time, indicating that our task has non-neglible disagreement, even among pathologist annotators who all specialize in gastroenterology.

		AUC (%) on test set stratified by annotator agreement
	Stage	Very Easy	Easy	Hard	Very Hard	Overall	$\Delta^{+}$
Single-Stage Training
Vanilla Baseline: All Images	-	93.8 $\pm$ 0.5	88.7 $\pm$ 1.3	76.2 $\pm$ 0.9	60.7 $\pm$ 2.1	83.7 $\pm$ 1.0	-
Very Easy Images Only	-	94.8 $\pm$ 0.8	87.7 $\pm$ 1.1	61.7 $\pm$ 1.9	56.2 $\pm$ 1.6	80.2 $\pm$ 1.1
Easy Images Only	-	93.7 $\pm$ 0.9	88.6 $\pm$ 0.7	73.8 $\pm$ 1.8	56.4 $\pm$ 1.7	82.7 $\pm$ 0.8
Very Easy + Easy Images	-	96.1 $\pm$ 0.6	90.2 $\pm$ 1.2	72.1 $\pm$ 1.7	58.0 $\pm$ 1.9	84.6 $\pm$ 0.8	*
Very Easy + Easy + Hard Images	-	94.7 $\pm$ 0.8	88.8 $\pm$ 1.2	76.0 $\pm$ 1.3	60.2 $\pm$ 1.9	84.2 $\pm$ 0.8
Curriculum Learning - Annotator Agreement (Ours)
Very easy images	1	94.8 $\pm$ 0.8	87.7 $\pm$ 1.1	61.7 $\pm$ 1.9	56.2 $\pm$ 1.6	80.2 $\pm$ 1.1
then (very easy + easy)	2	96.2 $\pm$ 0.6	91.3 $\pm$ 1.1	74.1 $\pm$ 1.4	58.8 $\pm$ 2.2	85.5 $\pm$ 0.9	***
then (very easy + easy + hard)	3	96.7 $\pm$ 0.5	94.3 $\pm$ 0.5	78.9 $\pm$ 1.2	64.2 $\pm$ 2.0	88.2 $\pm$ 0.6	***
then (very easy + easy + hard + very hard)	4	96.1 $\pm$ 0.6	93.2 $\pm$ 1.2	78.3 $\pm$ 1.6	64.5 $\pm$ 1.4	87.1 $\pm$ 0.9	***
ANTI-Curriculum Learning - Annotator Agreement
Very hard images	1	66.2 $\pm$ 4.5	60.9 $\pm$ 4.3	60.5 $\pm$ 4.2	55.8 $\pm$ 2.4	59.6 $\pm$ 3.2
then (very hard + hard)	2	71.5 $\pm$ 4.5	67.6 $\pm$ 5.9	65.3 $\pm$ 2.6	56.3 $\pm$ 3.4	65.7 $\pm$ 4.5
then (very hard + hard + easy)	3	89.6 $\pm$ 1.5	86.3 $\pm$ 1.1	73.2 $\pm$ 3.8	60.0 $\pm$ 2.7	80.0 $\pm$ 1.1
then (very hard + hard + easy + very easy )	4	93.7 $\pm$ 0.8	88.3 $\pm$ 1.1	76.8 $\pm$ 1.4	61.3 $\pm$ 1.5	83.6 $\pm$ 0.7
Curriculum Learning - Direct Annotation
Very easy images	1	90.8 $\pm$ 1.6	88.6 $\pm$ 1.3	74.7 $\pm$ 3.2	60.1 $\pm$ 1.8	82.5 $\pm$ 1.3
then (very easy + easy)	2	93.0 $\pm$ 0.8	88.3 $\pm$ 0.6	77.1 $\pm$ 1.5	60.2 $\pm$ 1.5	83.3 $\pm$ 0.5
then (very easy + easy + hard)	3	93.4 $\pm$ 0.7	88.4 $\pm$ 0.8	77.1 $\pm$ 1.0	59.9 $\pm$ 2.0	83.5 $\pm$ 0.7
then (very easy + easy + hard + very hard )	4	93.2 $\pm$ 0.8	88.1 $\pm$ 1.0	77.3 $\pm$ 1.4	60.6 $\pm$ 2.2	83.3 $\pm$ 0.6
Curriculum Learning - Control (Random)
Very easy images	1	89.8 $\pm$ 1.2	88.7 $\pm$ 0.9	68.9 $\pm$ 2.4	57.8 $\pm$ 1.8	80.3 $\pm$ 1.2
then (very easy + easy)	2	92.1 $\pm$ 0.8	88.2 $\pm$ 0.9	76.2 $\pm$ 1.4	59.4 $\pm$ 1.5	82.6 $\pm$ 0.5
then (very easy + easy + hard)	3	93.2 $\pm$ 0.6	89.2 $\pm$ 0.8	76.3 $\pm$ 1.3	58.7 $\pm$ 1.6	83.4 $\pm$ 0.6
then (very easy + easy + hard + very hard )	4	93.6 $\pm$ 0.7	89.2 $\pm$ 1.4	76.8 $\pm$ 1.6	59.8 $\pm$ 2.4	83.7 $\pm$ 0.9

Table 3: Histopathology image classification model trained using a curriculum learning framework outperforms single-stage training baselines by an AUC of 3.6–4.5%. Image difficulty is determined by annotator agreement in four discrete categories: very easy (7/7 annotator agreement), easy (6/7 agreement), hard (5/7 agreement), and very hard (4/7 agreement).

\Delta^{+}

indicates the level of statistical significance in improvement over the vanilla baseline of training with all images: * indicates

p\leq 0.05

; *** indicates

p\leq 0.001

. Means and standard deviations shown are for 20 random seeds.

4 Curriculum learning: annotator agreement

We propose a curriculum learning framework that leverages annotator agreement to rank images by difficulty. Specifically, we define images with high annotator agreement to be easy and images with low annotator agreement to be hard. For our dataset, which was labeled by seven annotators, we partition our images into four discrete levels of difficulty: very easy (7/7 agreement among annotators), easy (6/7 agreement among annotators), hard (5/7 agreement among annotators), and very hard (4/7 agreement among annotators). An overview schematic of this setup can be seen in Figure 1. For our training schedule, we train our network on progressively harder images in four stages:

•

Stage 1: Very easy images only
•

Stage 2: Very easy + easy images
•

Stage 3: Very easy + easy + hard images
•

Stage 4: Very easy + easy + hard + very hard images

At each stage, we make sure to include images from the previous stages to prevent catastrophic forgetting [39]. Though our current model uses four levels of difficulty and four stages of training, our general framework could be used in any scenario where annotator agreement data is available.

Experimental Setup. For our model, we use ResNet-18, a common choice for classifying histopathology images. Specifically, we follow the DeepSlide repository [7] for histopathology image classification, training our model for 50 epochs (well past convergence) using stochastic data augmentation with the Adam optimizer [40], initial learning rate of $1\times 10^{-3}$ , and learning rate decay factor of 0.91.

For more-robust evaluation, for each model we consider the five highest AUCs on the test set, which are evaluated at every epoch. We report the mean and standard deviation of these values calculated over 20 different random seeds.

Baselines. Our primary baseline is the vanilla-training model where all training images are given a label determined by the majority vote of annotator labels, a common gold-standard in the literature [18, 23, 24, 25, 26, 27, 28, 37, 38, 29]. We also explore variations of single-stage training in which only certain images, selected based on annotator agreement, are used for training. For all experiments, our test set is fixed and contains images from all levels of difficulty, although stratified analyses are also presented.

As shown in the first block of Table 3, our vanilla baseline that uses all images achieves an AUC of 83.7%. Interestingly, the network trained on only very easy and easy images achieved an overall AUC of 84.6%, an almost 1% improvement over the vanilla baseline. As shown by the performances on the test set when stratified by difficulty, this network trained on very easy and easy images does better on very easy and easy images in the test set while doing worse on hard and very hard images, leading to a higher overall AUC because there are more very easy and easy images than hard and very hard image in the testing set.

Annotator agreement-based curriculum learning. The second block of Table 3 shows the results of our proposed curriculum learning scheme at each stage of training. In this curriculum learning scheme, the model already outperforms all single-stage models at the second stage, achieving an AUC of 85.5%, and at the third stage, achieves an AUC of 88.2%, the highest of any model we train. Moreover, this model also achieved the highest performance when stratified by very easy, easy, and hard images in the test set, outperforming earlier stages that trained only on very easy and easy examples. These results suggest that training on harder images in a curriculum framework not only improves performance on hard images, but also improves performance on easy images, a finding consistent with Korbar et al. [11].

Perhaps strikingly, model performance actually decreases in the fourth stage of training that includes very hard examples, as performance on very hard images in the test set increases but performance on other images in the test set decreases. One explanation for this slight dip in performance is that very hard images, which have only 4/7 pathologist agreement, could be too challenging to analyze accurately (even for expert humans), so their features might not be beneficial for training machine learning models either.

In terms of statistical significance, our curriculum learning model at the second, third, and fourth stages outperforms the vanilla-training model with $p\leq 0.001$ , based on a two-sample t-test for means.

Anti-curriculum learning. To further validate that the improvement in model performance is indeed a result of the intentionally selected images at each stage, we train a model using an anti-curriculum scheme, which reverses the learning schedule (i.e., the model first trains on hard images and then trains on progressively easier images). As shown in the third block in Table 3, no models trained using the anti-curriculum framework outperform the vanilla baseline.

Visualization. For a qualitative examination of how the model changes throughout training, we compute GradCAM heatmaps [41] to visualize the model’s predictions at each of the four stages of curriculum learning. In Figure 5, we show examples of SSA images where curriculum learning was both successful and unsuccessful, as subjectively examined by our pathologists. In the successful examples, the model seemed to focus on broad-based crypts (a defining characteristic of SSAs) more heavily in Stage 3 of curriculum training, our best-performing model. In the unsuccessful examples, on the other hand, the model seemed to focus on broad-based crypts more heavily in earlier stages.

Other curriculum baselines. As a further baseline, we also asked a pathologist to directly score the difficulty of each image on a scale from 1-4, with 1 as very easy to classify and 4 as very hard to classify. We also ran a four-stage experiment using this difficulty measure, as shown in the fourth block of Table 3, but find that curriculum learning here does not improve performance, possibly because the manual difficulty scores from a single pathologist are too subjective. Moreover, we tested a control curriculum with the same training scheme as the annotator agreement experiment, except images in each stage were selected randomly (fifth block of Table 3). As expected, this control curriculum performed about the same as our vanilla baseline.

5 Annotator agreement vs model confidence

This section compares various proxies for difficulty in terms of their usefulness for curriculum learning. Whereas our model so far has used annotator agreement as a proxy for example difficulty, prior work has proposed that the output confidence of a machine learning classifier can also be a proxy for example difficulty [42, 10]. First, we perform a sanity check to see whether classifier output confidence correlates with annotator agreement. As we might expect, they do—predicted confidence distributions of a model pre-trained on ImageNet and fine-tuned on our dataset appeared substantially different for different levels of annotator agreement (Figure 8 in the Supplementary Materials).

We conduct a simple ablation study to compare annotator agreement and model output confidence as proxies for image difficulty in curriculum learning. For simplicity, we use a two-step curriculum learning scheme—where training is done in one stage containing only easy images and a following stage containing a mixture of easy and hard images—and use a single-step pacing schedule [42].

We evaluate the following three proxies for difficulty:

1.

Self-taught scoring, where the classifier with randomly initialized weights is pre-trained on our dataset, and output confidences are used to sort examples by difficulty.
2.

Transfer learning, where a classifier pre-trained on ImageNet is fine-tuned on our dataset, and output confidences are used to sort examples by difficulty.
3.

Coarse annotator agreement, a simplified version of our curriculum learning scheme above, where images are divided into two categories of either easy (6/7 or 7/7 agreement) or hard (4/7 or 5/7 agreement), instead of the four categories used in $\S$ 4.

The self-taught scoring and transfer learning proxies assign each image with a confidence score: images with confidence score greater than a threshold $\tau$ are classified as easy and images with confidence score less than $\tau$ are classified as hard. We choose $\tau$ such that the proportion of easy and hard images was the same as the natural distribution of easy and hard images from coarse annotator agreement. Then, a new classifier is trained in two stages: (1) easy images only, and (2) easy images and hard images combined at various ratios. Our coarse annotator agreement method in this section follows this same training scheme of two stages.

Figure 6 shows the results for the various ratios of hard images we used in the second stage of training. Without making any general claims, we see that on our dataset, annotator agreement appears to be a more useful proxy for difficulty than model output confidence. One possible explanation for this result is that although the transfer learning and self-taught scoring approaches work (marginally, in our case), much of the information provided by the pre-trained classifier is shared with the resulting classifier (i.e., much of the added value of the pre-trained classifier can already be discovered by the resulting classifier), whereas using annotator agreement as a proxy provides new information about difficulty that the model would not have had access to otherwise. Moreover, we find that including a greater proportion of easy images than hard images in the second stage of training is important for preventing catastrophic forgetting, a finding consistent with prior work [11, 10, 42]. Using only hard images in the second stage was especially problematic for the self-taught scoring and transfer learning models, possibly because the images most challenging for these proxy models to learn will also be challenging for the learner model to optimize.

Figure 6: Performance of curriculum learning models that use proxies of transfer learning [42], self-taught scoring [42], and annotator agreement (ours). For all tested ratios of hard images in the second stage of training, curriculum learning by annotator agreement outperforms transfer learning and self-taught scoring.

6 Comparison with Pathologist Performance

For more context on the significance of the improvement that curriculum learning brings to model performance, in this section, we compare the performance of our models with that of pathologists. Specifically, we frame the predictions of each model as the annotations of an additional pathologist, and we compare these predictions with the annotations of actual pathologists in terms of Cohen’s $\kappa$ [9], a common measure of inter-annotator agreement.

For our models, we select the best-performing curriculum learning model (Stage 3) and compare it with the vanilla-baseline model that was trained on all images in a single stage. As our models output a continuous distribution of probabilities for HP and SSA, we evaluate each model at several different confidence thresholds (a lower threshold results in higher recall, whereas a higher threshold results in higher precision). For average pathologist performance, we compute the Cohen’s $\kappa$ between all pairs of pathologists, and for each of the seven pathologists we show the mean Cohen’s $\kappa$ of all six pairs involving that pathologist.

Figure 7 shows these results comparing our models with both individual pathologists and the average of all pathologists. We see that there is a wide range of Cohen’s $\kappa$ for individual pathologists—the mean of our pathologists’ Cohen’s $\kappa$ scores was 0.450, which is in the moderate range of 0.41-0.60 [43] (a similar study found a Cohen’s $\kappa$ of 0.380 found among four pathologists [35]). Our curriculum learning model (AUC = 88.2%) outperforms the pathologist average for multiple thresholds and the baseline model (AUC = 83.7%) for all thresholds. In particular, adding a curriculum schedule increases performance from the baseline’s maximum $\kappa$ = $0.384$ to the curriculum learning model’s maximum $\kappa$ = $0.473$ , the difference needed to outperform the pathologists’ average ( $\kappa$ = $0.450$ ).

Figure 7: Performance in terms of Cohen’s

\kappa

[9] of our curriculum-learning and baseline models compared with that of pathologists annotators. The

x

-axis shows different confidence thresholds for our models, and the

y

-axis displays the average agreement with each pathologist in terms of Cohen’s

\kappa

. The 4.5% improvement in AUC (Table 3) of the curriculum-learning model compared with the baseline model translated to an .089 improvement in Cohen’s

\kappa

, allowing the curriculum-learning model to achieve agreement on par with the pathologist mean.

7 Discussion

Human and machine notions of difficulty. Our study has presented a transparent analysis of the requirements of curriculum learning, proposing that histopathology image analysis tasks present a range of difficulty among examples and that readily-available annotator agreement can be used as a natural proxy for ranking images. Experimentally, we found that using this natural proxy as a curriculum to train classifiers can yield significant performance improvements.

Some prior work has demonstrated that what makes an image difficult for neural networks to classify might not always match what makes it difficult for human annotators, an observation that recent work on adversarial examples takes advantage of [44]. Our study explores a converse idea for histopathology image classification, finding that machine notions of difficulty do correlate with human annotator agreement and contending that annotator agreement can be a useful proxy for facilitating curriculum learning.

Implications on medical image analysis. Our work also has implications for how labels are used in the histopathology image analysis and medical image analysis domains. Much prior work in these domains has used majority voting or senior pathologist resolution to resolve labels, only retaining a single label for each image without distinguishing between images with high annotator and low annotator agreement. Our method could instead leverage the annotation agreement for curriculum learning. In Table 4 in the supplementary materials section, we list past work for which multiple annotator agreement levels appears to be available and therefore our method could be applicable (although most are private datasets).

Certainly, not every scenario in medical imaging is ideal for applying curriculum learning. For instance, both pathologists and deep learning models have achieved high performance in distinguishing high-grade lung cancers and normal tissue, which is considered a relatively easy task that is less likely to exhibit a range of difficulty among examples. Another negative example could be detecting bone fractures: since bone fractures are typically caused by instantaneous traumatic events, there is no progression of disease development, so we are less certain that a range of example difficulty exists. Potential scenarios conducive to curriculum learning by annotator agreement should ideally involve a progression of disease development and be challenging problems where even specialized pathologists might disagree. In particular, we believe that cancer datasets can benefit from curriculum learning because of the inherent natural progression of cancers (i.e., cancer develops over time and is not sudden). This could include tasks such as distinguishing among subtypes of lung adenocarcinoma or assessing tumor proliferation in breast cancer tissue samples.

Limitations. Though we intentionally addressed a common, clinically-important, and diagnostically-challenging problem of colorectal polyp classification for empirical evaluation—the best dataset that we were able to collect and annotate with multiple annotators at this time—our study nonetheless only uses a single dataset. As such, although our approach seems effective in its current form, we consider our results as an invitation for further exploration in this direction rather than a validation of curriculum learning for all histopathology classification tasks.

Furthermore, our dataset contained annotations from seven pathologist annotators, allowing us to categorize images into four discrete levels of difficulty. For medical image datasets that have fewer annotators and therefore fewer categorized levels of annotation difficulty, we suspect that the benefits of our approach could be slightly diminished. For example, our coarse annotator agreement method in $\S$ 5, which only used two levels of annotator agreement, achieved a smaller performance improvement (1.5%) than our four-level annotator agreement method (4.5%). Moreover, our dataset size is modest in the medical imaging domain, and so whether these curriculum methods work for other dataset sizes is still relatively unexplored. We believe, however, that our curriculum learning methodology might still be worth exploring for these datasets with fewer annotators, as even small improvements in performance are important given the high cost of data annotation and the importance of accuracy toward patient outcomes.

8 Further Related Work and Conclusions

Although we argue that histopathology imaging is a promising context for curriculum learning, we are not the first to explore curriculum learning in the medical imaging domains. Oksuz et al. [45] used curriculum learning for artifact detection by first training on heavily-corrupted images and then introducing less-corrupted images, thereby improving performance on borderline cases. Maicas et al. [46] used a meta-training approach called teacher-student curriculum learning to improve breast screening classification on a weakly-labeled dataset. Moreover, Jesson et al. [47] used adaptive curriculum sampling to better detect lung nodules in extreme class-imbalance scenarios, and Oksuz et al. [45] applied a curriculum based on disease severity levels in radiology reports (e.g., mild, moderate, severe).

These prior studies have demonstrated empirical evaluation of curriculum learning and helped inspire our work, but their methods tend to be complicated and use inductive biases specific to certain datasets. For instance, the approach of Oksuz et al. [45] only applies to artifact detection, the approach of Jesson et al. [47] is specific to segmentation tasks, and disease-severity information from radiology reports used in Oksuz et al. [45] might not exist for many datasets and can be challenging to parse. Our method, on the other hand, is easy to implement, presents no modifications to network architecture, and only requires annotator agreement data, which is often readily available.

Based on a thoughtful analysis of the assumptions in curriculum learning, we have presented a simple yet effective curriculum learning framework which leverages easily-obtained annotator agreement data. In histopathology image analysis, where data collection and annotation can be especially costly, it is important to combine the natural properties of classification tasks with the most-appropriate inductive biases. We aim to have provided a well-motivated argument for more intentional application of curriculum learning to readers from both computer vision and medical imaging analysis backgrounds.

References

[1] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning. p. 41–48. ICML ’09, Association for Computing Machinery (2009), https://doi.org/10.1145/1553374.1553380
[2] Arvaniti, E., Fricker, K.S., Moret, M., Rupp, N., Hermanns, T., Fankhauser, C., Wey, N., Weild, P.J., Ruschoff, J.H., Claassen, M.: Automated gleason grading of prostate cancer tissue microarrays via deep learning. Nature Scientific Reports 8 (2018), https://www.nature.com/articles/s41598-018-30535-1
[3] Bulten, W., Pinckaers, H., van Boven, H., Vink, R., de Bel, T., van Ginneken, B., van der Laak, J., Hulsbergen-van de Kaa, C., Litjens, G.: Automated deep-learning system for gleason grading of prostate cancer using biopsies: a diagnostic study. The Lancet Oncology 21(2), 233–241 (2020), https://arxiv.org/pdf/1907.07980.pdf
[4] Hekler, A., Utikal, J.S., Enk, A.H., Berking, C., Klode, J., Schadendorf, D., Jansen, P., Franklin, C., Holland-Letz, T., Krahl, D., von Kalle, C., Fröhling, S., Brinker, T.J.: Pathologist-level classification of histopathological melanoma images with deep neural networks. European Journal of Cancer 115, 79 – 83 (2019), http://www.sciencedirect.com/science/article/pii/S0959804919302758
[5] Shah, M., Wang, D., Rubadue, C., Suster, D., Beck, A.: Deep learning assessment of tumor proliferation in breast cancer histological images. In: 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). pp. 600–603 (2017), https://ieeexplore.ieee.org/abstract/document/8217719
[6] Ström, P., Kartasalo, K., Olsson, H., Solorzano, L., Delahunt, B., Berney, D.M., Bostwick, D.G., Evans, A.J., Grignon, D.J., Humphrey, P.A., Iczkowski, K.A., Kench, J.G., Kristiansen, G., van der Kwast, T.H., Leite, K.R.M., McKenney, J.K., Oxley, J., Pan, C., Samaratunga, H., Srigley, J.R., Takahashi, H., Tsuzuki, T., Varma, M., Zhou, M., Lindberg, J., Bergström, C., Ruusuvuori, P., Wählby, C., Grönberg, H., Rantalainen, M., Egevad, L., Eklund, M.: Pathologist-level grading of prostate biopsies with artificial intelligence. CoRR abs/1907.01368 (2019), http://arxiv.org/abs/1907.01368
[7] Wei, J.W., Tafe, L.J., Linnik, Y.A., Vaickus, L.J., Tomita, N., Hassanpour, S.: Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks. Scientific Reports (2019), https://www.nature.com/articles/s41598-019-40041-7
[8] Zhang, Z., Chen, P., McCough, M., Xing, F., Wang, C., Bui, M., Xie, Y., Sapkota, M., Cui, L., Dhillon, J., Ahmad, N., Khalil, F.K., Dickinson, S.I., Shi, X., Liu, F., Su, H., Cai, J., Yang, L.: Pathologist-level interpretable whole-slide cancer diagnosis with deep learning. Nature Machine Intelligence 1, 236–245 (2019), https://www.nature.com/articles/s42256-019-0052-1
[9] Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 37–46 (1960), https://journals.sagepub.com/doi/abs/10.1177/001316446002000104?journalCode=epma
[10] Weinshall, D., Cohen, G., Amir, D.: Curriculum learning by transfer learning: Theory and experiments with deep networks. In: Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 5238–5246. PMLR (2018), http://arxiv.org/abs/1802.03796
[11] Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems. p. 7774–7785. NIPS’18 (2018), https://arxiv.org/pdf/1807.00230.pdf
[12] Ganesh, M.R., Corso, J.J.: Rethinking curriculum learning with incremental labels and adaptive compensation (2020), https://openreview.net/forum?id=H1lTUCVYvH
[13] Racaniere, S., Lampinen, A., Santoro, A., Reichert, D., Firoiu, V., Lillicrap, T.: Automated curriculum generation through setter-solver interactions. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=H1e0Wp4KvH
[14] Singh, S., Batra, A., Pang, G., Torresani, L., Basu, S., Paluri, M., Jawahar, C.V.: Self-supervised feature learning for semantic segmentation of overhead imagery. In: BMVC (2018), http://bmvc2018.org/contents/papers/0345.pdf
[15] Tsvetkov, Y., Faruqui, M., Ling, W., MacWhinney, B., Dyer, C.: Learning the curriculum with Bayesian optimization for task-specific word representation learning. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 130–139. Association for Computational Linguistics (2016), "https://www.aclweb.org/anthology/P16-1013"
[16] Mais, D., for Clinical Pathology, A.S.: Practical Clinical Pathology. American Society for Clinical Pathology (2013), https://books.google.com/books?id=ryncnQEACAAJ
[17] Wei, J.W., Suriawinata, A.A., Vaickus, L.J., Ren, B., Liu, X., Lisovsky, M., Tomita, N., Abdollahi, B., Kim, A.S., Snover, D.C., Baron, J.A., Barry, E.L., Hassanpour, S.: Evaluation of a Deep Neural Network for Automated Classification of Colorectal Polyps on Histopathologic Slides. JAMA Network Open 3(4) (2020), https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2764906
[18] Chilamkurthy, S., Ghosh, R., Tanamala, S., Biviji, M., Campeau, N.G., Venugopal, V.K., Mahajan, V., Rao, P., Warier, P.: Deep learning algorithms for detection of critical findings in head ct scans: a retrospective study. The Lancet 392, 2388–2396 (2018), https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(18)31645-3/fulltext
[19] Coudray, N., Moreira, A.L., Sakellaropoulos, T., Fenyö, D., Razavian, N., Tsirigos, A.: Classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning. Nature Medicine 24, 1559–1567 (2017), https://www.nature.com/articles/s41591-018-0177-5
[20] Ehteshami Bejnordi, B., Veta, M., Johannes van Diest, P., van Ginneken, B., Karssemeijer, N., Litjens, G., van der Laak, J.A.W.M.: Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA 318(22), 2199–2210 (2017), https://jamanetwork.com/journals/jama/fullarticle/2665774
[21] Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature 542, 115–118 (2017), https://www.nature.com/articles/nature21056
[22] Ghorbani, A., Natarajan, V., Coz, D., Liu, Y.: Dermgan: Synthetic generation of clinical skin images with pathology (2019), https://arxiv.org/pdf/1911.08716.pdf
[23] Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., Kim, R., Raman, R., Nelson, P.C., Mega, J.L., Webster, D.R.: Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316(22), 2402–2410 (2016), https://jamanetwork.com/journals/jama/fullarticle/2588763
[24] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R.L., Shpanskaya, K.S., Seekins, J., Mong, D.A., Halabi, S.S., Sandberg, J.K., Jones, R., Larson, D.B., Langlotz, C.P., Patel, B.N., Lungren, M.P., Ng, A.Y.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. CoRR abs/1901.07031 (2019), http://arxiv.org/abs/1901.07031
[25] Kanavati, F., Toyokawa, G., Momosaki, S., Rambeau, M., Kozuma, Y., Shoji, F., Yamazaki, K., Takeo, S., Iizuka, O., Tsuneki, M.: Weakly-supervised learning for lung carcinoma classification using deep learning. Nature Scientific Reports 10 (2020), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7283481/
[26] Korbar, B., Olofson, A.M., Miraflor, A.P., Nicka, C.M., Suriawinata, M.A., Torresani, L., Suriawinata, A.A., Hassanpour, S.: Deep learning for classification of colorectal polyps on whole-slide images. Journal of Pathology Informatics 8 (2017), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5545773/
[27] Sertel, O., Kong, J., Catalyurek, U.V., Lozanski, G., Saltz, J.H., Gurcan, M.N.: Histopathological image analysis using model-based intermediate representations and color texture: Follicular lymphoma grading. Journal of Signal Processing Systems 55 (2009), https://link.springer.com/article/10.1007/s11265-008-0201-y#citeas
[28] Wang, S., Xing, Y., Zhang, L., Gao, H., Zhang, H.: Deep convolutional neural network for ulcer recognition in wireless capsule endoscopy: Experimental feasibility and optimization. Computational and Mathematical Methods in Medicine (2019), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6766681/
[29] Zhou, J., Luo, L.Y., Dou, Q., Chen, H., Chen, C., Li, G.J., Jiang, Z.F., Heng, P.A.: Weakly supervised 3d deep learning for breast cancer classification and localization of the lesions in mr images. Journal of Magnetic Resonance Imaging 50(4), 1144–1151 (2019), https://onlinelibrary.wiley.com/doi/abs/10.1002/jmri.26721
[30] Rex, D.K., Boland, C.R., Dominitz, J.A., Giardiello, F.M., Johnson, D.A., Kaltenbach, T., Levin, T.R., Lieberman, D., Robertson, D.J.: Colorectal cancer screening: Recommendations for physicians and patients from the u.s. multi-society task force on colorectal cancer. Gastroenterology 153, 307–323 (2017), https://www.gastrojournal.org/article/S0016-5085(17)35599-3/fulltext
[31] Abdeljawad, K., Vemulapalli, K.C., Kahi, C.J., Cummings, O.W., Snover, D.C., Rex, D.K.: Sessile serrated polyp prevalence determined by a colonoscopist with a high lesion detection rate and an experienced pathologist. Gastrointestinal Endoscopy 81, 517–524 (2015), https://pubmed.ncbi.nlm.nih.gov/24998465/
[32] Farris, A.B., Misdraji, J., Srivastava, A., Muzikansky, A., Deshpande, V., Lauwers, G.Y., Mino-Kenudson, M.: Sessile serrated adenoma: challenging discrimination from other serrated colonic polyps. The American Journal of Surgical Pathology 32, 30–35 (2008), https://pubmed.ncbi.nlm.nih.gov/18162767/
[33] Glatz, K., Pritt, B., Glatz, D., HArtmann, A., O’Brien, M.J., Glaszyk, H.: A multinational, internet-based assessment of observer variability in the diagnosis of serrated colorectal polyps. American Journal of Clinical Pathology 127, 938–945 (2007), https://pubmed.ncbi.nlm.nih.gov/17509991/
[34] Khalid, O., Radaideh, S., Cummings, O.W., O’brien, M.J., Goldblum, J.R., Rex, D.K.: Reinterpretation of histology of proximal colon polyps called hyperplastic in 2001. World Journal of Gastroenterology 15, 3767–3770 (2009), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2726454/
[35] Wong, N.A.C.S., Hunt, L.P., Novelli, M.R., Shepherd, N.A., Warren, B.F.: Observer agreement in the diagnosis of serrated polyps of the large bowel. Histopathology 55 (2009), https://pubmed.ncbi.nlm.nih.gov/19614768/
[36] Gurudu, S.R., Heigh, R.I., Petries, G.D., Heigh, E.G., Leighton, J.A., Pasha, S.F., Malagon, I.B., Das, A.: Sessile serrated adenomas: Demographic, endoscopic and pathological characteristics. World Journal of Gastroenterology 16, 3402–3405 (2010), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2904886/
[37] Wei, J., Wei, J., Jackson, C., Ren, B., Suriawinata, A., Hassanpour, S.: Automated detection of celiac disease on duodenal biopsy slides: A deep learning approach. Journal of Pathology Informatics 10(1), 7 (2019), http://www.jpathinformatics.org/article.asp?issn=2153-3539;year=2019;volume=10;issue=1;spage=7;epage=7;aulast=Wei;t=6
[38] Wei, J., Suriawinata, A., Liu, X., Ren, B., Nasir-Moin, M., Tomita, N., Wei, J., Hassanpour, S.: Difficulty translation in histopathology images (2020), https://arxiv.org/pdf/2004.12535.pdf
[39] French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn Sci. 3, 128–135 (1999), https://pubmed.ncbi.nlm.nih.gov/10322466/
[40] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (2014)
[41] Selvaraju, R.R., Das, A., Vedantam, R., Cogswell, M., Parikh, D., Batra, D.: Grad-cam: Why did you say that? visual explanations from deep networks via gradient-based localization. CoRR abs/1610.02391 (2016), http://arxiv.org/abs/1610.02391
[42] Hacohen, G., Weinshall, D.: On the power of curriculum learning in training deep networks. In: Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2535–2544. PMLR (09–15 Jun 2019), http://proceedings.mlr.press/v97/hacohen19a.html
[43] McHugh, M.L.: Interrater reliability: the kappa statistic. Biochemia Medica 22, 276–282 (2012), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3900052/
[44] Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. In: International Conference on Learning Representations (2014), http://arxiv.org/abs/1312.6199, https://arxiv.org/pdf/1312.6199.pdf
[45] Oksuz, I., Ruijsink, B., Puyol-Antón, E., Clough, J.R., Cruz, G., Bustin, A., Prieto, C., Botnar, R., Rueckert, D., Schnabel, J.A., King, A.P.: Automatic cnn-based detection of cardiac mr motion artefacts using k-space data augmentation and curriculum learning. Medical Image Analysis 55, 136 – 147 (2019), http://www.sciencedirect.com/science/article/pii/S1361841518306765
[46] Maicas, G., Bradley, A.P., Nascimento, J.C., Reid, I., Carneiro, G.: Training medical image analysis systems like radiologists. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. pp. 546–554 (2018), https://arxiv.org/pdf/1805.10884.pdf
[47] Jesson, A., Guizard, N., Ghalehjegh, S.H., Goblot, D., Soudan, F., Chapados, N.: Cased: Curriculum adaptive sampling for extreme data imbalance. In: Medical Image Computing and Computer Assisted Intervention − MICCAI 2017. pp. 639–646 (2017), https://link.springer.com/chapter/10.1007/978-3-319-66179-7_73

Dataset	Annotators	Resolution Method
Head CT Scans [18]	3	Majority Vote
Lymph Node Metastases [20]	3	Senior Expert Verification
Diabetic Retinopathy [23]	7	Majority Vote
Chest Radiograph [24]	3	Majority Vote
Lung Carcinoma [25]	3	Senior Expert Verification
Follicular Lymphoma Grading [27]	5	Majority Vote
Ulcer Recognition [28]	3	Senior Expert Verification
Colorectal Polyps [17]	5	Majority Vote
Breast Cancer [29]	3	Senior Expert Verification

Table 4: Examples of image classification tasks from prior work where annotator agreement is accessible in principle.

Supplementary Materials

Figure 8 on the next page shows the correlation between difficulty as perceived by a pre-trained model and pathologist annotators. Table 4 on the next page lists examples of medical image classification tasks from prior work where we believe annotator agreement data to be accessible.

Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification