11email: {hongyuzhou, shirlyyu, tronbian, ivanyfhu, kylekma, yefengzheng}@tencent.com
Comparing to Learn: Surpassing ImageNet Pretraining on Radiographs By Comparing Image Representations
Abstract
In deep learning era, pretrained models play an important role in medical image analysis, in which ImageNet pretraining has been widely adopted as the best way. However, it is undeniable that there exists an obvious domain gap between natural images and medical images. To bridge this gap, we propose a new pretraining method which learns from 700k radiographs given no manual annotations. We call our method as Comparing to Learn (C2L) because it learns robust features by comparing different image representations. To verify the effectiveness of C2L, we conduct comprehensive ablation studies and evaluate it on different tasks and datasets. The experimental results on radiographs show that C2L can outperform ImageNet pretraining and previous state-of-the-art approaches significantly. Code and models are available at https://github.com/funnyzhou/C2L_MICCAI2020.
Keywords:
Pretrained models Self-supervised learning Radiograph.1 Introduction
ImageNet [2] pretraining has been proved to be an effective way to perform 2D transfer learning for medical image analysis. Lots of experiments have shown that, compared with learning from scratch, pretrained models not only help to achieve higher accuracy but also speed up the model convergence. These benefits can be attributed to two factors: (a) effective learning algorithms designed for deep neural networks and (b) generalized feature representations learned from a great quantity of natural images. However, there exists an obvious domain gap between natural images and medical images, which raises a question whether we can have a pretrained model directly from medical images and how to approach this.
As is well known, we need domain experts’ diagnosis to produce reliable medical annotations, which definitely help improve our model performance. On the other side, it is often difficult to access to a great number of doctors’ conclusions considering limited medical resources as well as protection of patient privacy. So how to develop algorithms to learn from a vast amount of data without annotations has drawn more attention in the medical imaging community. Zhou et al. [13] proposed a self-supervised pretraining method Model Genesis which utilized medical images without manual labeling. On the chest X-ray classification task, Model Genesis is able to achieve comparable performance with ImageNet pretraining but still cannot beat it.
In this paper, we present a novel self-supervised pretraining approach focusing on providing pretrained 2D deep models for radiograph related tasks from massive unannotated data. We name our method as Comparing to Learn (C2L) because the goal is to learn general image representations by comparing different image features as the supervision. Different from Model Genesis [13] that resorts to an image restoration pretext task, the supervision signal of the proposed C2L comes from the self-defined representation similarity. Similar ideas have been adopted in [1, 9, 4], where most of them take advantage of the transitive invariance of images to produce self-supervised signal. On the contrary, in this paper, we mainly focus on feature level contrast and propose to construct homogeneous and heterogeneous data pairs by mixing image and feature batches. Moreover, a momentum-based teacher-student architecture is proposed for the contrastive learning, where the teacher and student networks share the same structure but are updated differently. To be specific, the teacher model is updated using both itself and the student network. Extensive experiments of different datasets and tasks demonstrate that the proposed C2L method can surpass ImageNet pretraining and other competitive baselines by non-trivial margins.

2 Proposed Method
In this section, we introduce the proposed Comparing to Learn (C2L) method in details. The overall workflow is provided in Figure 1.
Batch mixup and feature mixup. As shown in Figure 1 and Algorithm 1, for each input image batch, we first use random augmentation (e.g., random cropping, rotation, and cutout [3]) to generate two augmented batches. Different from traditional image-level mixup [11], a batch-wise mixup operation is proposed to apply to each augmented batch. Suppose each batch contains images where , we randomly shuffle to construct its paired batch which can be expressed as:
(1) |
where and Beta stands for the beta distribution. In practice, we found that using the same mixing factor and shuffling method for both batches () actually helps improve the model performance. As for feature mixup in Figure 1, we apply the same mixing strategy to the feature representations.
Teacher network. An intuitive idea is to use the same model for both student and teacher networks. However, we found that such strategy does not work in practice which may lead to gradient explosion. Meanwhile, constructing teacher network using momentum update has been widely adopted as a way to produce stable predictions [8, 4]. In our case, using momentum helps stabilize the training process and reduce the difficulty of network optimization. As shown in line 17 from Algorithm 1, the momentum function can be formalized as:
(2) |
where we use an exponential factor to control the degree of momentum. We can see that the teacher model is updated using both itself and the student network . In practice, we pass and to the student network while and are passed to the teacher network. In Algorithm 1, we use to represent feature vectors from the student model and are used to represent the outputs of the teacher model.
Homogeneous and heterogeneous pairs. For constructing homogeneous pairs, we assume that applying data augmentation (including mixup operation) only slightly change the distribution of training data. Based on this, each homogeneous pair should contain the results after the same set of operations which includes random augmentation, batch mixup or feature mixup as shown in Figure 1. For heterogeneous pairs, we simply contrast the current features with all preceding features stored in the memory queue.
Feature comparison, memory and loss function. As we have mentioned above, the goal of C2L is to minimize the distance between homogeneous representation pairs such as and . Meanwhile, it is also necessary to maximize the difference between heterogeneous representations, in which we contrast current features with past features which are collected from past training iterations. To store these past representations, we employ a memory queue proposed in [4]. The reason why we use a large is that we hope to contrast current features with a great number of preceding features because more comparisons usually lead to better representations (as shown in Table 2). So the pairs of current features and past features can be formalized as (, ) and (, ), as shown in Figure 1. For simplicity, we use to denote a specific feature vector in where and is the length of queue. To be specific, we import a subscript to index a feature vector given a batch of features. Then, we can convert this distance measurement problem to a naive classification problem. For image in , in line 12 can be further expressed as:
(3) |
whose length is . A similar case also exists in . For more comparison, we apply feature mixup to . In Algorithm 1, we use to represent the output of feature mixup. Similarly, we also compare with and which leads to another set of predictions:
(4) |
It is worth noting that the first item in each set should be larger than other items because the first item is an inner product of homogeneous representation. Thus, we can apply a cross entropy loss (CE) to the above sets of predictions where a one hot vector {1,0,0,…,0} is treated as the ground truth for each set.
After we update the network parameters, we also update by inserting , , and , respectively. Since is a queue and has a fixed size, the previous feature vectors are automatically removed. After we complete the training stage, only is extracted to become a pretrained model.
3 Datasets
3.1 Pretraining
ImageNet pretraining contains about one million labeled images which helps deep models learn general representations. In this paper, we make use of ChestX-ray14 [10], MIMIC-CXR [6], CheXpert [5] and MURA [7] as unlabeled data for network pretraining. Note that we only use ChestX-ray14 in ablation studies in order to choose appropriate hyperparameters. After that, we merge four datasets and discard their labels to perform unsupervised pretraining which has approximate 700k unlabeled radiographs.
ChestX-ray14. The training set contains 86k images while the validation set has 25k X-rays. For the ablation study, 70k images from the training set are used for self-supervised pretraining and the rest 16k images are used for fine-tuning to show the results of pretraining. Moreover, after we determine the appropriate hyperparameters, we merge the whole training set into the other three datasets. Overall, C2L uses about 700k unlabeled radiographs for model pretraining.
CheXpert. The training set has 220k images while the official validation set contains 234 images. Similar to ChestX-ray14, we only use the training set without labels for self-supervised pretraining.
MIMIC-CXR. The MIMIC-CXR dataset is a large publicly available dataset of chest radiographs in the JPEG format with structured labels derived from free-text radiology reports. The dataset contains 377,110 JPEG format images. In practice, we treat the whole dataset as an unlabeled database.
MURA. MURA is a dataset of bone X-rays. The training set contains 36k X-rays and the validation set contains 3k images. The whole dataset is used for C2L pretraining.
3.2 Fine-tuning
We fine-tune our pretrained models on ChestX-ray14, CheXpert and Kaggle Pneumonia Detection and report their experimental results.
ChestX-ray14 and CheXpert. For ChestX-ray14, we use all labeled X-rays in the training set (86k) to fine-tune models pretrained with C2L and report experimental results on the validation set. The same setting also applies to CheXpert where we use the whole labeled training set (220k) for fine-tuning.
Kaggle Pneumonia Detection. This dataset is designed for diagnosing pneumonia automatically and accurately. We split the training set in Stage 1 into a local training set (80%) and a validation set (20%). The evaluation metric is mean average precision.
4 Implementation Details
For pretraining, we employ C2L to pretrain ResNet-18 and DenseNet-121. The default batch size is 256 and the size of each input image is 224224. For input augmentation, we apply random crop, rotation (10 degree), grayscale and horizontal flip to each input batch. Moreover, we also add cutout to augmented images in order to increase the diversity of transformation for learning better representations. We use L2 normalization for each feature vector. The momentum factor is set to 0.999 and the length of queue is . We use SGD as the default optimizer where the initial learning rate is 0.03 and its weight decay is 0.0001. We train each model for 240 epochs and the learning rate is divided by 10 at 120, 160 and 200 epochs, respectively. When fine-tuning pretrained models on ChestX-ray14 and CheXpert, the input image size is set to 224224 and we train both ResNet-18 and DenseNet-18 for 50 epochs. As for Kaggle Pneumonia Detection, we employ RetinaNet which uses ResNet-18 as backbone. The default image size is 512512 and the batch size is 4.
5 Ablation Study
In this part, we conduct experiments on ChestX-ray14. As we have mentioned above, we use 70k unlabeled images for C2L pretraining and then use the rest labeled training set for fine-tuning. We report averaged AUROC performance and results of eight class are also provided (results of all fourteen classes are provided in the supplementary material). Note that to save space, the ablation study of (cf. Equation 2) is put in the supplementary material.
ImageNet | Mix. | Bat. Mix. | Bat. Mix. + | Bat. + Feat. Mix. + | |
---|---|---|---|---|---|
Average | 74.4 | 74.7 | 75.3 | 75.6 | 76.3 |
Atelectasis | 80.0 | 81.4 | 80.1 | 80.9 | 81.9 |
Cardiomegaly | 65.3 | 68.2 | 68.4 | 68.4 | 67.9 |
Effusion | 74.9 | 74.9 | 74.3 | 75.3 | 75.7 |
Infiltration | 68.4 | 67.1 | 66.9 | 68.6 | 68.9 |
Mass | 79.4 | 79.6 | 80.1 | 80.2 | 80.7 |
Nodule | 82.2 | 79.4 | 79.2 | 80.3 | 82.9 |
Pneumonia | 72.1 | 73.2 | 73.3 | 73.7 | 74.6 |
Pneumothorax | 77.7 | 80.7 | 80.6 | 81.7 | 82.4 |
Consolidation | 69.6 | 70.9 | 71.6 | 71.5 | 71.2 |
Edema | 76.4 | 73.6 | 76.1 | 74.9 | 77.0 |
Emphysema | 64.9 | 66.2 | 68.2 | 68.2 | 68.6 |
Fibrosis | 69.9 | 71.5 | 71.1 | 72.0 | 72.1 |
Pleural Thickening | 79.5 | 81.3 | 81.9 | 82.5 | 81.7 |
Hernia | 82.1 | 77.4 | 82.9 | 79.8 | 82.8 |
RandCrop | Rotation | Jigsaw | Dropout | cutout | cutout + Mix. | Length of | Average | |||
✓ | ✓ | 74.1 | ||||||||
✓ | ✓ | 74.5 | ||||||||
✓ | ✓ | 74.4 | ||||||||
✓ | ✓ | 74.9 | ||||||||
✓ | ✓ | ✓ | 75.2 | |||||||
✓ | ✓ | ✓ | ✓ | 75.0 | ||||||
✓ | ✓ | ✓ | ✓ | 74.9 | ||||||
✓ | ✓ | ✓ | ✓ | 75.4 | ||||||
✓ | ✓ | ✓ | ✓ | 76.3 |
We first report the ablation results of proposed mixup approaches in Table 1. We can see that the proposed batch mixup can already outperform the original mixup method [12] by 0.6 point on average performance. This is because using the same for shared batches may help maintain the consistency between batches. After adding mixed consistency loss, we can improve the batch mixup method by about 0.3 point. Since the goal of C2L is to learn powerful feature representations, we further apply mixup to generated features. Somewhat surprisingly, we can find that the propose feature mixup can be well integrated with and surpass batch mixup by approximate 1 point. In summary, the proposed mixup strategies can outperform the original mixup method by 1.6 points.
Another important component of C2L is the augmentation strategies. An appropriate augmentation method should reasonably increase the diversity of augmented batches. Such characteristic may help pretrained models to learn representations which are discriminative enough to distinguish different radiographs. In Table 2, we investigate the effects of widely adopted augmentation strategies. It is normal to find that random rotation and cutout can enhance the performance by 0.3 point while adding jigsaw and dropout may degrade performance. Moreover, we find that simply increasing the length of can be harmful. We argue the reason is that a longer queue may contain more useless features and thus reduce the attention of other useful representations.
6 Fine-tuning C2L Pretrained Models
In this section, we compare C2L pretrained models with Model Genesis, ImageNet pretraining and MoCo [4]. For pretraining datasets, we merge the training sets of ChestX-ray14 and CheXpert with radiographs in MIMIC-CXR and MURA to generate an unlabeled database containing approximate 700k images. For network architectures, we deploy ResNet-18 and DenseNet-121, both of which are widely used networks.
ResNet-18 | DenseNet-121 | |||||||
---|---|---|---|---|---|---|---|---|
MG | ImageNet | MoCo | C2L | MG | ImageNet | MoCo | C2L | |
Average | 80.9 | 81.5 | 81.4 | 83.5 | 82.4 | 82.9 | 83.0 | 84.4 |
Atelectasis | 79.2 | 80.1 | 79.8 | 82.1 | 80.7 | 81.2 | 81.7 | 82.7 |
Cardiomegaly | 85.9 | 87.7 | 87.5 | 89.7 | 88.3 | 88.5 | 89.2 | 90.5 |
Effusion | 85.7 | 86.2 | 87.0 | 88.2 | 87.0 | 86.7 | 86.6 | 87.9 |
Infiltration | 67.8 | 68.9 | 68.5 | 70.9 | 68.9 | 69.6 | 70.2 | 70.9 |
Mass | 81.9 | 82.5 | 83.0 | 84.5 | 83.6 | 84.4 | 84.0 | 86.3 |
Nodule | 75.4 | 75.2 | 75.5 | 77.2 | 77.0 | 78.1 | 77.8 | 79.8 |
Pneumonia | 74.0 | 74.3 | 74.5 | 76.3 | 74.4 | 75.1 | 75.7 | 76.3 |
Pneumothorax | 85.1 | 85.8 | 85.1 | 87.8 | 87.0 | 86.8 | 86.5 | 88.4 |
Consolidation | 78.3 | 78.6 | 77.9 | 80.6 | 80.0 | 79.3 | 79.8 | 80.7 |
Edema | 86.9 | 87.4 | 87.2 | 89.4 | 87.7 | 88.2 | 88.6 | 89.4 |
Emphysema | 89.7 | 89.8 | 90.0 | 91.8 | 91.0 | 91.6 | 90.7 | 93.0 |
Fibrosis | 80.8 | 81.8 | 80.5 | 83.8 | 82.6 | 83.0 | 82.3 | 85.1 |
Pleural Thickening | 76.1 | 76.2 | 76.4 | 78.2 | 76.8 | 77.2 | 77.5 | 78.3 |
Hernia | 86.4 | 86.8 | 86.3 | 88.8 | 88.9 | 92.1 | 91.6 | 92.2 |
ChestX-ray14. We report fine-tuned AUROC results on the validation set. Besides ImageNet pretraining, we also perform experiments using Model Genesis (MG) [13] and recently proposed MoCo [4]. In Table 3, the proposed C2L method surpasses other approaches by a significant margin. In fact, although MG and MoCo are able to achieve comparable results with ImageNet pretraining, they cannot surpass ImageNet pretrained model significantly. However, C2L pretrained model outperforms ImageNet pretraining by 2 points on ResNet-18. On DenseNet-121, C2L achieves 84.4% averaged AUROC which is 1.5 points higher than ImageNet pretraining.
Method | Model | Average | Atelectasis | Cardiomegaly | Consolidation | Edema | Pleural Effusion |
---|---|---|---|---|---|---|---|
MG | ResNet-18 | 86.7 | 79.8 | 80.0 | 91.5 | 91.3 | 90.9 |
ImageNet | ResNet-18 | 87.0 | 80.3 | 79.6 | 91.9 | 91.7 | 91.5 |
MoCo | ResNet-18 | 87.1 | 80.3 | 79.4 | 92.5 | 92.0 | 91.1 |
C2L | ResNet-18 | 88.2 | 81.1 | 81.4 | 93.0 | 92.9 | 92.6 |
MG | DenseNet-121 | 87.5 | 80.6 | 81.0 | 92.7 | 91.9 | 91.1 |
ImageNet | DenseNet-121 | 87.9 | 81.5 | 81.9 | 92.4 | 92.1 | 91.7 |
MoCo | DenseNet-121 | 87.4 | 81.5 | 80.8. | 92.0 | 91.4 | 92.0 |
C2L | DenseNet-121 | 89.3 | 83.3 | 83.0 | 93.6 | 92.7 | 93.8 |
CheXpert. Similar to ChestX-ray14, we fine-tune the pretrained models using the training set. We can see that ImageNet pretraining performs better than MG while MoCo achieves comparable results. In contrast, C2L generates better pretrained models. On ResNet-18, C2L outperforms ImageNet pretraining by 1.2 points. When it comes to DenseNet-121, the performance gap becomes 1.4 point.
0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | |
---|---|---|---|---|---|---|---|
MG | 12.5 | 16.0 | 18.5 | 20.7 | 21.4 | 21.1 | 20.4 |
ImageNet | 13.5 | 17.1 | 19.9 | 21.4 | 22.2 | 22.1 | 21.1 |
MoCo | 13.0 | 17.4 | 19.9 | 21.3 | 22.4 | 21.9 | 21.2 |
C2L | 14.8 | 18.4 | 21.3 | 22.4 | 23.9 | 23.1 | 22.5 |
Kaggle Pneumonia Detection. We use ResNet-18 as the backbone of RetinaNet. In Table 5, it is obvious that C2L outperforms ImageNet pretraining on all thresholds significantly, especially on large thresholds. As for MoCo and MG, MoCo is marginally better than ImageNet pretraining while MG performs slightly worse.
7 Conclusion
We proposed a self-supervised pretraining method C2L (Comparing to Learn) to learn medical representations from unlabeled data. Our approach makes use of the relation between images as supervision signal and thus requires no extra manual labeling.
8 Acknowledgment
This work was funded by the Key Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), National Key Research and Development Project (2018YFC2000702) and Science and Technology Program of Shenzhen, China (No. ZDSYS201802021814180).
References
- [1] Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: Proceedings of the European Conference on Computer Vision. pp. 132–149 (2018)
- [2] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255. Ieee (2009)
- [3] DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017)
- [4] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 (2019)
- [5] Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 590–597 (2019)
- [6] Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: MIMIC-CXR: A large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042 (2019)
- [7] Rajpurkar, P., Irvin, J., Bagul, A., Ding, D., Duan, T., Mehta, H., Yang, B., Zhu, K., Laird, D., Ball, R.L., et al.: MURA: Large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017)
- [8] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems. pp. 1195–1204 (2017)
- [9] Wang, X., He, K., Gupta, A.: Transitive invariance for self-supervised visual representation learning. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1329–1338 (2017)
- [10] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: ChestX-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2097–2106 (2017)
- [11] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
- [12] Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: European Conference on Computer Vision. pp. 649–666. Springer (2016)
- [13] Zhou, Z., Sodha, V., Siddiquee, M.M.R., Feng, R., Tajbakhsh, N., Gotway, M.B., Liang, J.: Models genesis: Generic autodidactic models for 3d medical image analysis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 384–393. Springer (2019)