Few-Shot Anomaly Detection for Polyp Frames from Colonoscopy
Abstract
111This work was partially supported by Australian Research Council grant DP180103232.Anomaly detection methods generally target the learning of a normal image distribution (i.e., inliers showing healthy cases) and during testing, samples relatively far from the learned distribution are classified as anomalies (i.e., outliers showing disease cases). These approaches tend to be sensitive to outliers that lie relatively close to inliers (e.g., a colonoscopy image with a small polyp). In this paper, we address the inappropriate sensitivity to outliers by also learning from inliers. We propose a new few-shot anomaly detection method based on an encoder trained to maximise the mutual information between feature embeddings and normal images, followed by a few-shot score inference network, trained with a large set of inliers and a substantially smaller set of outliers. We evaluate our proposed method on the clinical problem of detecting frames containing polyps from colonoscopy video sequences, where the training set has 13350 normal images (i.e., without polyps) and less than 100 abnormal images (i.e., with polyps). The results of our proposed model on this data set reveal a state-of-the-art detection result, while the performance based on different number of anomaly samples is relatively stable after approximately 40 abnormal training images. Code is available at https://github.com/tianyu0207/FSAD-Net.
Keywords:
Machine learning, anomaly detection, few-shot learning, weakly-supervised learning, polyp detection, colonoscopy1 Introduction
Classification of rare events is a common problem in medical image analysis [12], e.g., disease detection in medical screening tests such as colonoscopy. In this scenario, normal images generally come from healthy patients, while abnormal images are from unhealthy ones, where the proportion of normal images in the training set tends to be substantially larger than the abnormal ones. One possible way to address such problems is through the design of training methods that can deal with imbalanced learning problems [10, 11] (Fig. 1-(a)). Even though they are often effective, these approaches still need a fairly high number of abnormal training images. Alternatively, zero-shot anomaly detection methods [3, 15, 31, 14] tackle this problem using a training set containing only normal images to train a conditional generative model that can reconstruct normal images, and anomalies are detected based on the reconstruction errors of testing images (Fig. 1-(b)). Unfortunately, in practice these methods can misclassify outliers that lie relatively close to inliers (e.g., when cancer tissue occupies a small area of the image). Therefore, we propose a middle ground between these two approaches to address the issues of requiring a relatively large annotated data set and misclassifying challenging outliers.

In this paper, we propose a few-shot anomaly detection method network (FSAD-NET) that is trained with a highly imbalanced training set, containing a large number of normal images (more than ) and few abnormal images (less than ) – Fig. 1-(c). The method first learns a feature encoder that is trained with normal images to maximise the mutual information (MI) between the training images and feature embeddings [6]. Next, we train a score inference network (SIN) [19] that pulls the feature embeddings of normal images close together toward a particular region of the feature space and pushes the embeddings of abnormal images away from that region of normal features.
In practice, FSAD-NET needs significantly less abnormal training images than typical imbalanced learning problems [10, 11]. Moreover, given that we access a few abnormal training images, FSAD-NET has the potential to be more effective at correctly classifying challenging outliers compared to typical zero-shot anomaly detection methods [3, 15, 31, 14]. To the best of our knowledge, our method is the first medical image analysis work to explore few-shot anomaly detection with a feature encoder that maximises MI between training images and embeddings, and explicitly optimises anomaly scores. We evaluate FSAD-NET on the detection of colonoscopy video frames that contain polyps with a training set of more than 10000 normal images (without polyps) and less than 100 abnormal images. Results show that our FSAD-NET is more accurate than previous zero-shot anomaly detection approaches, which allows us to conclude that incorporating few abnormal cases into the training process improves the performance of anomaly detection methods. Our approach also shows better accuracy than imbalanced learning methods, suggesting that FSAD-NET is more effective at dealing with very small training sets of abnormal images. We will make our code publicly available (upon acceptance of our paper) to foster reproducibility and research on the area.
2 Related Work
Colorectal cancer is considered to be one of the most harmful cancers [27, 30]. One effective method for screening patients for colorectal cancer is colonoscopy, where the goal is to detect polyps that are malignant or pre-malignant using a camera that is inserted into the bowel. Accurate early detection of polyps may improve the 5-year survival rate to over 90% [27]. Unfortunately, the accuracy and speed of manual polyp detection can be affected by human factors, such as fatigue and expertise [22, 24]. Therefore, automated polyp detection systems could help doctors improve polyp detection accuracy during a colonoscopy [29]. Traditional systems to detect polyps are based on a supervised two-class classifier [9, 29] trained with large training sets of images without polyps (i.e. normal) and images containing polyps (i.e. abnormal). Annotation of such training sets is unfortunately difficult because the vast majority of colonoscopy video frames contain normal images, making the manual search for images that contain polyps challenging. Imbalanced learning solutions can therefore be used in this context [10, 11], but its extension to polyp detection may not be effective without a relatively large number of abnormal images in the training set. Because of this limitation, zero-shot anomaly detection methods have been studied [26, 25, 21, 13, 16, 19, 14], where the idea is to learn a distribution of normal images in a particular feature space, to subsequently test samples that do not fit well in this distribution and are then classified as an outlier that may contain a polyp.
Zero-shot anomaly detection methods assume that the conditional generative model [3, 15, 31, 26, 25, 21, 14]) can only reconstruct normal data. Hence, when presented with an abnormal test image, the model produces a large reconstruction error. However, using an image reconstruction error for training is an indirect optimisation of the anomaly score, which can lead to a sub-optimal training process. For example, an abnormal image with a small polyp may have a low reconstruction error because the small area affected by the polyp and can be wrongly classified as normal. We advocate that the performance of zero-shot anomaly detection methods can improve with the use of a small set of abnormal training images (less than ). Such imbalance learning problem has been tackled by few-shot classification approaches before. However, our problem has a different setup compared to problems handled by traditional few-shot learning methods that generally have many few-shot balanced multi-class problems for training [28, 17, 2], while ours has only one few-shot highly imbalanced binary problem for training. Hence, we can only compare our method with baseline approaches that handle imbalance learning [23, 11]. For instance, Ren et al. [23] propose a learning algorithm for highly imbalanced learning problems that weights training samples using a balanced validation set – the need for this validation set is a disadvantage of this approach. The focal loss approach [11] is effective at handling imbalanced learning, but it may still need a large number of samples from both classes.
Few-shot anomaly detection has been shown in a non-medical image analysis context with the method SIN [18] that is designed to directly optimise an anomaly score for normal and abnormal images. The main challenge to train SIN lies in the high dimensionality of the images [18]. Therefore, one way to alleviate this challenge is to introduce a dimensionality reduction before training SIN. Recently, deep infomax (DIM) [6] has been shown to be an effective dimensionality reduction approach. In our paper, we propose a method that uses DIM to learn a low-dimensionality feature embedding that is then used by SIN to classify anomalies.
3 Data Set and Method
3.1 Data set
The data set is obtained from 18 colonoscopy videos from 15 patients. Video frames containing blurred visual information are removed using the variance of Laplacian method [5]. We then sub-sample consecutive frames by taking one of every five frames because the correlation between them makes training ineffective. We also remove frames containing feces and water to reduce the need for a very large normal training set (we plan to handle such distractors in future work). This data set is defined by , where denotes a colonoscopy frame ( represents the frame lattice), represents patient identification222Note that the data set has been de-identified, so is useful only for splitting into training, testing and validation sets in a patient-wise manner., denotes the normal (without polyp) and abnormal (with polyp) classes. The distribution of this data set is as follows: 1) Training set: a set of 13250 normal images (without polyps), denoted by , and a set containing between 10 and 80 abnormal images, denoted by ; 2) Validation set: 100 normal images and 100 abnormal images for model selection; and 3) Testing set: 967 images, with 217 (25% of the set) abnormal images and 700 (75% of the set) normal images. The patients in the testing set do not appear in the training/validation sets and vice versa. This abnormality proportion (on the testing set) is commonly defined in other anomaly detection literature [21, 26]. These frames were obtained with the Olympus ®190 dual focus endoscope.

3.2 Method
The training process of our proposed FSAD-NET (Fig. 3) is divided into two stages: 1) pre-training of a feature encoder ( is the encoder parameter and ) to learn an image embedding that maximises the mutual information (MI) between normal images and their embeddings [6]; and 2) training of the SIN [19], parameterised by , with a contrastive-like loss that uses and to achieve the goal .
More specifically, the training of the encoder to maximise the MI between the normal samples and their feature embeddings [6] is achieved with
(1) |
where are the model hyperparameters, the functions and denote an MI lower bound based on the Donsker-Varadhan representation of the Kullback-Leibler (KL)-divergence [6], defined by
(2) |
with denoting the joint distribution between images and their respective embeddings , representing the product of the marginals of the images and embeddings, and being a discriminator parameterised by . Also in (1), the function , defined similarly as (2) for the discriminator , is the local MI between image regions () and respective local embeddings . Moreover in (1),
(3) |
with denoting a prior distribution for the embeddings ( is assumed to be a normal distribution , with mean and covariance ), the distribution of the embeddings , and is a discriminator modelled with adversarial training to estimate the likelihood that the input is sampled from or . This objective function pulls the feature embeddings of the normal images toward .
The next step of the learning process consists of computing the embeddings of normal and abnormal images with to train using a contrastive-like loss to directly optimise the anomaly score [19]. More specifically, the constrastive loss for each training sample is defined as:
(4) |
where is an indicator function that is equal to one when the condition in the parameter is true, and zero otherwise, with and representing the mean and standard deviation of the prior distribution for the anomaly scores for normal images, and is the minimum margin between and the anomaly scores of abnormal images [19]. The loss in (4) pulls the scores from normal images to and pushes the scores of abnormal images away from with a margin of at least .
During inference, we take a test image , compute the feature embedding with and then compute the score with – the score result is then compared to a threshold to determine if the test image is normal or abnormal. We considered the score as the estimation of the notion of closeness which is related to the likelihood that the embedding of a colonoscopy image is classified as belonging to the set of normal images.
4 Experiment
4.1 Experimental Setup
The original colonoscopy images are resized from initial resolution to to reduce the training and inference computational costs. We found that is the minimum size that we can use without a negative impact on AUC. We note the polyps are still visible at such resolution, as shown in Fig S4. The model selection (to select optimiser, learning rate and model structure) is done using the validation set mentioned in Sec. 3.1. We use Adam [8] optimiser during training with a learning rate of 0.0001 for the encoder and SIN learning. We adopt batch normalisation for both stages. We make sure our method uses a similar backbone architecture as other competing approaches in Tab. 1. In particular, the encoder uses four convolution layers (with 64, 128, 256, 512 filters of size 4 4). The global discriminator has three convolutional layers (with 128, 64, 32 filters of size 3 3). The local discriminator has three convolutional layers (with 192, 512, 512 filters of size 1 1). The prior discriminator has three linear layers with 1000, 200, 1 nodes per layer). We also use the validation set to estimate in (4). In (1), we follow the DIM paper for setting the hyper-parameters as follows [6]: , , . For the prior distribution for the embeddings in (3), we set (i.e., a -dimensional vector of zeros), and is a identity matrix. To train the model, we first train the encoder, local, global and prior discriminator (representation learning stage) for 6000 epochs with a mini-batch of 64 samples. We then train SIN for 1000 epochs, with a batch size of 64, while fixing the parameters of encoder, local, global and prior discriminator. We implement our method using Pytorch [20].
4.2 Anomaly Detection Results
Methods | AUC | |
Zero-Shot | DAE [16] | 0.6384 |
VAE [1] | 0.6528 | |
OC-GAN [21] | 0.6137 | |
f-AnoGAN(ziz) [26] | 0.6629 | |
f-AnoGAN(izi) [26] | 0.6831 | |
f-AnoGAN(izif) [26] | 0.6997 | |
ADGAN [14] | 0.7391 | |
Few-Shot | Densenet121 [7] (40 abnormal samples) | 0.8231 |
cross-entropy (30 abnormal samples) | 0.6826 | |
cross-entropy (40 abnormal samples) | 0.7115 | |
Focal loss (30 abnormal samples) | 0.7038 | |
Focal loss (40 abnormal samples) | 0.7235 | |
without RL (40 abnormal samples) | 0.6011 | |
Learning to Reweight [23] (40 abnormal samples) | 0.7862 | |
AE network (30 abnormal samples) | 0.819 | |
AE network (40 abnormal samples) | 0.835 | |
FSAD-NET (30 abnormal samples) | 0.855 | |
FSAD-NET (40 abnormal samples) | 0.9033 |
The test set AUC results shown in Table 1 are divided into zero-shot and few-shot. The zero-shot rows show results obtained from the following zero-shot anomaly detection methods333Codes were downloaded from the authors’ Github pages and tuned for our problem.: ADGAN [14], OCGAN [21], f-anogan and its variants [26] that involve image-to-image mean square error (MSE) loss (izi), Z-to-Z MSE loss (ziz) and its hybrid version (izif). Our FSAD-NET model outperforms all zero-shot learning methods by a large margin, showing the importance of using a few abnormal samples for training. For the few-shot results, we consider the cases where we have 30 and 40 abnormal training images, and we test several variants of the FSAD-NET. We use between 30 and 40 abnormal training images because that is the range, where we observe that our FSAD-NET produces stable AUC results. As a baseline approach, we train Densenet121 [7] using high levels of data augmentation to deal with the training imbalance issue. However, our FSAD-Net outperforms Densenet121 by a large margin. The variants of FSAD-NET are designed to test the importance of each stage of our method. The methods labelled as ’Cross entropy’ and ’Focal loss’ replace the contrastive loss in (4) by the cross entropy loss (commonly used in classification problems) [4] and the focal loss (robust to imbalanced learning problems) [11], respectively. FSAD-NET shows substantially better results, indicating the importance of using a more appropriate loss function for few-shot anomaly detection. To show the importance of representation learning (RL) in FSAD-Net, we tested FSAD-Net without it, which shows much lower AUC results than competing approaches. Also, we compared our method with a few-shot learning baseline [23], which proposes a learning algorithm for highly imbalanced learning problems. When used to train FSAD-Net, it achieved 78.62% of mean AUC when training with 40 abnormal training samples. Hence our model shows more accurate results than that approach. Furthermore, we test the importance of DIM to train the encoder in (1) by replacing it by the deep auto-encoder [16] (labelled as AE network) – results show that FSAD-NET is more accurate, indicating the effectiveness of using MI and prior distribution for learning the feature embeddings in (1).

We further investigate the performance of our proposed FSAD-NET as a function of the number of abnormal training images that can vary from to . For each number of abnormal training images, we train our model three times, using different training sets each time, and we compute the mean and standard deviation of the AUC results. The result of this experiment in Fig. 3 shows that: 1) the performance stabilises between 85%-90% when feeding the model 30 or more abnormal training images; and 2) our method is robust to extremely small training sets of abnormal images. We show a few true positive, true negative, false positive and false negative results produce by FSAD-NET in Fig. S4.

5 Conclusion
We propose the first few-shot anomaly detection framework, named as FSAD-NET, for medical image analysis applications. FSAD-NET consists of an encoder trained to maximise the mutual information between normal images and respective embeddings and a score inference network that classifies between normal and abnormal colonoscopy frames. Results show that our method achieves state-of-the-art anomaly detection performance on our colonoscopy data set, compared to previous zero-shot anomaly detection methods and imbalanced learning methods. In the future, we expect to extend our approach to polyp localisation and to work with colonoscopy frames containing distractors, like feces and water.
References
- [1] Doersch, C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016)
- [2] Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017)
- [3] Gong, D.: Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. arXiv preprint arXiv:1904.02639 (2019)
- [4] Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
- [5] He, X.: Laplacian score for feature selection. In: Advances in neural information processing systems. pp. 507–514 (2006)
- [6] Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., Bengio, Y.: Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
- [7] Huang, G., Liu, Z.a.: Densely connected convolutional networks. In: CVPR (2017)
- [8] Kingma, D.P.: Adam: A method for stochastic optimization, iclr (2015)
- [9] Korbar, B., Olofson, A.M., Miraflor, A.P., Nicka, C.M., Suriawinata, M.A., Torresani, L., Suriawinata, A.A., Hassanpour, S.: Deep learning for classification of colorectal polyps on whole-slide images. Journal of pathology informatics 8 (2017)
- [10] Li, Z., Kamnitsas, K., Glocker, B.: Overfitting of neural nets under class imbalance: Analysis and improvements for segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 402–410. Springer (2019)
- [11] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence (2018)
- [12] Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., van der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88 (2017)
- [13] Liu, W.: Future frame prediction for anomaly detection–a new baseline. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6536–6545 (2018)
- [14] Liu, Y., Tian, Y., Maicas, G., Pu, L.Z., Singh, R., Verjans, J.W., Carneiro, G.: Photoshopping colonoscopy video frames. arXiv preprint arXiv:1910.10345 (2019)
- [15] Makhzani, A.: Adversarial autoencoders. arXiv preprint arXiv:1511.05644 (2015)
- [16] Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: International Conference on Artificial Neural Networks. pp. 52–59. Springer (2011)
- [17] Nichol, A., Achiam, J., Schulman, J.: On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 (2018)
- [18] Pang, G., Cao, L., Chen, L., Liu, H.: Learning representations of ultrahigh-dimensional data for random distance-based outlier detection. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 2041–2050 (2018)
- [19] Pang, G., Shen, C., van den Hengel, A.: Deep anomaly detection with deviation networks. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 353–362 (2019)
- [20] Paszke, A.: Automatic differentiation in pytorch (2017)
- [21] Perera, P.: Ocgan: One-class novelty detection using gans with constrained latent representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2898–2906 (2019)
- [22] Pu, L.Z.C.T.: Computer-aided diagnosis for charaterising colorectal lesions: Interim results of a newly developed software. Gastrointestinal Endoscopy 87(6), AB245 (2018)
- [23] Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. arXiv preprint arXiv:1803.09050 (2018)
- [24] Rijn, J.C.V.: Polyp miss rate determined by tandem colonoscopy: a systematic review. The American journal of gastroenterology 101(2), 343 (2006)
- [25] Schlegl, T.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging. pp. 146–157. Springer (2017)
- [26] Schlegl, T., Seeböck, P., Waldstein, S.M., Langs, G., Schmidt-Erfurth, U.: f-anogan: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis 54, 30–44 (2019)
- [27] Siegel, R.: Colorectal cancer statistics, 2014. CA: a cancer journal for clinicians 64(2), 104–117 (2014)
- [28] Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1199–1208 (2018)
- [29] Tian, Y., P.L.S.R.B.A., Carneiro, G.: One-stage five-class polyp detection and classification (2019)
- [30] Tian, Y.: One-stage five-class polyp detection and classification. In: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). pp. 70–73. IEEE (2019)
- [31] Zong, B.: Deep autoencoding gaussian mixture model for unsupervised anomaly detection (2018)
A Supplementary Material



