MixCL: Pixel label matters to contrastive learning in medical image segmentation
Abstract
Contrastive learning and self-supervised techniques have gained prevalence in computer vision for the past few years. It is essential for medical image analysis, which is often notorious for its lack of annotations. Most existing self-supervised methods applied in natural imaging tasks focus on designing proxy tasks for unlabeled data. For example, contrastive learning is often based on the fact that an image and its transformed version share the same identity. However, pixel annotations contain much valuable information for medical image segmentation, which is largely ignored in contrastive learning. In this work, we propose a novel pre-training framework called Mixed Contrastive Learning (MixCL) that leverages both image identities and pixel labels for better modeling by maintaining identity consistency, label consistency, and reconstruction consistency together. Consequently, thus pre-trained model has more robust representations that characterize medical images. Extensive experiments demonstrate the effectiveness of the proposed method, improving the baseline by 5.28% and 14.12% in Dice coefficient when 5% labeled data of Spleen and 15% of BTVC are used in fine-tuning, respectively.
Keywords:
Constrastive learning Self-supervision Pre-training.1 Introduction
Self-Supervised Learning (SSL) obtains supervisory signals from the data itself without any human annotations. Substantial progress has been made in image classification [15, 6, 9, 25], object detection [15, 24], and semantic segmentation [6, 18, 30, 27, 26, 12]. Many works also prove its superiority on various medical tasks [20, 11, 23, 22, 17, 14, 2]. Data is the footstone of deep learning. Likewise, the remarkable promotion of SSL comes from exploiting a large amount of unlabeled data.
One practical way of curating such ‘big data’ is through aggregating multiple datasets. From the perspective of data utilization, SSL leverages unlabeled data in a task-agnostic fashion as the supervised labels are only used during fine-tuning. Even though these datasets may contain pixel-wise annotations, these annotations are largely ignored in SSL pre-training. In this paper, we aim to bring them back for improved SSL pre-training in the context of medical image segmentation, reaching full exploitation of images and annotations at hand.
SSL involves designing a framework that pre-trains a model from fully annotated datasets , which, after adaptation, likely achieves better performance on target task . Mathematically, it can be depicted as:
(1) |
where is the performance metric on target task , and is the objective function for pre-training.
The annotations from different datasets provide diverse observations of images from varying aspects, potentially helping build models for improved image understanding. [28] However, we often leverage these annotations in a task-specific way [29]. It leaves the training process in a muddle when multiple annotations from get mixed up, and this is where the difficulty of Problem (1) lies. Namely, an appropriate design of should keep gradient directions consistent when training by a hotchpotch of all kinds of datasets.
Contrastive Learning (CL) has achieved a remarkable level of performance [3, 4, 9, 7, 25]. It aims to learn invariant representations under different data augmentations. The representations of paired similar samples stay close while dissimilar ones are far apart. The relation of “similar-dissimilar” is based on image-level comparison as in SimCLR [3], BYOL [7] and Barlow Twins [25]. It is extended to pixel-level for dense downstream tasks like segmentation [19, 21]. When viewing medical images in at pixel-level, the similar and dissimilar pairs could be defined by the annotations, naturally introduced as the supervised signal. This kind of supervised signal is not hard and strict but soft and mild. On the other hand, images in are with labels of saying different organs or lesions, which can be viewed as tissues and fluid in the human body from a fine-grained perspective. Pixel-level CL has the potential to learn the compatibility among diverse labels.
Inspired from the above discussions, we propose a novel contrastive learning framework named Mixed Contrastive Learning (MixCL) to give a solution to Problem (1). The proxy task for MixCL has three components: the reconstruction consistency and identity consistency terms are designed to fully utilize image; the label consistency term is introduced to exploit labels as supervisory signal and ease the negative influence by the diversity of annotations in pre-training phase. The proposed MixCL achieves remarkable improvement on the downstream segmentation task defined in the BTCV and MSD Spleen datasets.

2 Method
2.1 Overview of MixCL
We employ UNETR [8] as the backbone network, which is one of the most successful Vision Transformers (ViT)s [5] in medical image analysis. As shown in Fig. 1, the encoder and reconstructor form a typical UNETR, having the same settings in [8]. The projector corresponding to the -th layer of encoder is composed of three convolution layers with channels set as [], interspersed with normalization and activation layers.
A patch is first cropped from an original image with corresponding segmentation , and then its two different views and are obtained by random augmentations . and are then fed into the UNETR encoder with parameters , followed by a projection head to produce embeddings and , respectively. A reconstruction decoder also follows the encoder to generate restored patches and . Simultaneously, an auxiliary patch with its label is treated the same as , acquiring and . Details about would be introduced in Section 2.3.
Note that is a projection set over all intermediate outputs of each layer in the encoder , and denotes the projection vector of the 2-nd layer. The identity consistency loss is applied on all intermediate layers by calculating the cross-correlation matrix between and . together with the reconstruction loss over and , mines knowledge via image itself. Besides, label consistency loss is applied on the 2-nd layer over and , which is a supervised contrastive loss utilizing annotations appropriately. The proposed framework of MixCL is formulated as follows:
(2) |
where are scale factors, and both are set to 0.01 by default. The identity consistency loss and label consistency loss will be explaine in Section 2.2 and Section 2.3, respectively, in detail. The reconstruction consistency between and is given as
(3) |
2.2 Identity Consistency
Identity consistency is employed to maximize the similarity between the representations and , obtained from two distorted versions of sample . In vanilla Barlow Twins [25], and embed image-level information into vectors. Here, we transfer it from instance-level to pixel-level. The shape of is , which means that all pixels are embeded into a -dimensional space. Then, the is defined as :
(4) |
where is a balance factor, and is the cross-correlation matrix computed by and along dimension 0 at the -th layer:
(5) |
where is the total number of pixels, and and are the index for a vector in the -dimensional space.

As shown in Fig. 2 (c), the diagonal elements of the cross-correlation matrix to be 1 in the first term of , making each unit of embeded features invariant over all pixels. Whereas the second term aims to de-correlate representation components, the off-diagonal elements are set to be 0. It can be viewed as a soft-whitening constraint on embedding space, reducing redundancy. The is capable of encoding rich image information into model parameters, simple and easy to implement.
2.3 Label Consistency
Indeed, all information derives from images, including the annotations created manually. Things get complicated when training all kinds of annotations at once. The gradient directions to various labels are not consistent, failing the training process. Consequently, the supervisory signals from different labels should be consistent in solution space, denoted as Label Consistency. This section will introduce the label consistency loss to tackle this issue.
2.3.1 Auxiliary Sample
MixCL attempts to utilize the mixed annotations in a way that satisfies the Label Consistency. A patch is picked to assist MixCL achieving this purpose. As shown in Fig. 1, is randomly cropped from image or at a 1:1 ratio. Notice that and come from the same dataset . As mentioned in Section 2.1, is embeded into , which is going to compute with in 2-nd layer, taking labels into account.
2.3.2 Objective Function
The fundamental requirement of is guiding the training process with the favor of annotations. Fig. 2 (a) shows a schematic illustration of segmentation from . Saying ‘liver’ in green, ‘liver tumor’ in blue and ’background’ in grey. With the two feature maps and computed from two different patches and , pixels labeled by ’liver’ in and those with the same label in should be regarded as ’Positive-pairs’. And different labeled pixels in and is treated as ’Negative-pairs’. This is the label consistency at images-level.
Fig. 2 (d) shows the label relation matrix , where positive pixels pairs are set to be 1 (black area), and negative pixels pairs are set to be 0 (white area). The shadow area denotes background pixel pairs over pixels indicating ’kidney’, ’pancreas’ or other labels in rest datasets. The background pixel pairs are not considered in Mixel-CL, for the sake of complex composition of background pixels would break the compatibility between different datasets. This is the label consistency at datasets-level.
Then, the contrastive task could be formulate as:
(6) |
where is a temperature hyper-parameter. All of our results use a temperature of . and represent the pixel pairs assigned as negative and positive in label relation matrix , respectively.
2.3.3 Pixel Sampling
However, the pixel-wise computation of requires complexity of time and space consuming. We adopt a trade-off strategy, sampling an appropriate amount of pixels from and . Given a budget of pixels for each segmentation class, the sampling is performed on a weight map , defined as:
(7) |
where is the boundary of each label and is the interior. Then computes the distance of the interior pixels to its boundary. and are two hyper-parameters, scaling the probabilities of the weight map. The default and are set to 8 and 0.05. As shown in Fig. 2 (b), interior pixels closer to the boundary are more discriminative, with higher values assigned, therefore.
Going this far on , pixel-wise annotations of all datasets have been exploited. It keeps the same numerical form among different datasets. The complex background, the main destabilizing factor, is also duly handled. Both image-level and datasets-level label consistency ensure a clear gradient descent direction. By applying MixCL, downstream tasks gain sizable performance boosts.
3 Experiments
3.1 Implementation Details
Datasets Up to 765 CT scans with annotations are used for pre-training, which come from (1) Medical Segmentation Decathlon (MSD) dataset [1] (only Liver, Lung and Pancreas are used for pre-training), (2) NIH Pancreas-CT [16], and (3) 2019 Kidney Tumor Segmentation Challenge (KiTS) [10]. The pre-trained model is then fine-tuned on two datasets: The Beyond the Cranial Vault (BTCV) 111https://www.synapse.org/#!Synapse:syn3193805/wiki/89480 and Spleen segmentation in MSD. BTCV provides 30 CT scans with 13 abdominal organs annotated, while Spleen segmentation in MSD has 41 annotated CT volumes.
Augmentation Each CT scan in datasets are interpolated to a voxel spacing of , with intensities scaled to . For the pre-training task, random elastic deform and affine transform are applied on 3D patches, both under the probability of 0.5. Then, patches are distorted into views via random intensity range shifting, random Gamma transformation, random Gaussian smoothing and noising, which have the same probability of 0.2. Augmentations like random rotation in 90, 180 and 270 degrees, random flip in axial, sagittal and coronal views and random scale and shift are used in the fine-tuning tasks, with the same probability of 0.2.
Optimization We use the AdamW optimizer [13] with a cosine learning rate scheduler. The learning rate is set to initially using a decay of for 1000 iterations in pre-training and 2000 iterations in fine-tuning. Pre-training experiments use a batch size of 1 per GPU across 6 RTX Titan, and fine-tuning uses a batch size of 2.
Spleen Dataset | BTVC Dataset | ||||||
---|---|---|---|---|---|---|---|
Pct. | T.F.S | Ours | Diff. | Pct. | T.F.S. | Ours | Diff. |
5% | 0.5065 | 0.5593 | +0.0528 | ||||
15% | 0.8247 | 0.8430 | +0.0183 | 15% | 0.2459 | 0.3871 | +0.1412 |
50% | 0.9424 | 0.9453 | +0.0019 | 50% | 0.5851 | 0.5939 | +0.0085 |
100% | 0.9597 | 0.9612 | +0.0015 | 100% | 0.7524 | 0.7656 | +0.0132 |
3.2 Results
Quantitative performances. In the fine-tuning process, the encoder of UNETR [8] is initialized by the pre-trained weights. The effectiveness of MixCL is assessed by the performance on a 5-fold cross-validation on fine-tuning tasks. As mentioned in Section. 3.1, we fine-tune the pre-trained model on MSD Spleen and BTCV, and compare the Dice metric with models trained from scratch under the same training settings. In addition, comparisons on partially labeled data are also performed, giving as assessment from varying perspectives. The experiments and results are shown in Table 1.
Our method achieves a consistent performance lift on both datasets, and such a lift is more pronounced when less labeled data is used in fine-tuning. For example, the Dice coefficient is improved by 5.28% in the Spleen dataset (5% data) and by 14.12% in the BTVC dataset (15% data), yet such an improvement in Dice is 0.15% in the Spleen dataset (100% data) and 1.32% in the BTVC dataset (100% data). Another interesting observation is that the use of pre-trained model seems to bring more improvements to a complicated task, that is, segmentation of 13 organs, than to a simple task of spleen segmentation. These improvements demonstrate that MixCL successfully exploits mismated labels all at once in contrastive learning, and obtains informative and discriminative representations.
Dataset | Pct. | Identity | Label | Recon | Dice |
---|---|---|---|---|---|
Spleen | 30% | 0.8975 | |||
✓ | ✓ | 0.9150 | |||
✓ | ✓ | ✓ | 0.9178 | ||
BTCV | 100% | 0.7524 | |||
✓ | ✓ | 0.7618 | |||
✓ | ✓ | ✓ | 0.7656 |

3.2.1 Ablation study.
To further analyze our method, an ablation study is conducted to characterize the roles of each module. Current results are shown in Table 2, including experiments on 30% labeled data of Spleen and 100% labeled data on BTCV. The pre-trained model brings notable improvements via contrastive learning, including identity and label consistency constraints. Additionally, qualitative visualizations of experiments on BTCV are presented in Fig 3. Better details on stomach, kidney, pancreas and spleen are observed, demonstrating the abdominal organ segmentation improvement. The pre-trained weights learn useful representations to downstream segmentation tasks, and the incorporation of reconstruction further boosts the metric on target tasks. Besides, the performance lift on fine-tuning task demonstrates that the aggregation of labels from multiple datasets does not confuse the learning of a final model. It proves that MixCL is capable of handling the conflicts of diverse annotations and meets the intended purpose.
4 Conclusion and Future work
Again, data is the footstone of deep learning. The unlabeled data matters, and the data with different annotations also matters. The efforts on guiding models with knowledge in various areas, not just single, might be the start of strong artificial intelligence. We make an attempt on it, and propose Mixed contrastive learning pre-training together with various labels. Experiments introduced in this paper can preliminarily validate the performance on medical segmentation datasets. In future, we will continue exploring a better solution to the Problem (1) by mining more information from these pooled datasets.
References
- [1] Antonelli, M., Reinke, A., Bakas, S., Farahani, K., Landman, B.A., Litjens, G., Menze, B., Ronneberger, O., Summers, R.M., van Ginneken, B., et al.: The medical segmentation decathlon. arXiv preprint arXiv:2106.05735 (2021)
- [2] Bhalodia, R., Kavan, L., Whitaker, R.T.: Self-supervised discovery of anatomical shape landmarks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 627–638. Springer (2020)
- [3] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PMLR (2020)
- [4] Chen, T., Kornblith, S., Swersky, K., Norouzi, M., Hinton, G.E.: Big self-supervised models are strong semi-supervised learners. Advances in neural information processing systems 33, 22243–22255 (2020)
- [5] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
- [6] Gidaris, S., Singh, P., Komodakis, N.: Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728 (2018)
- [7] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, 21271–21284 (2020)
- [8] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 574–584 (2022)
- [9] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9729–9738 (2020)
- [10] Heller, N., Isensee, F., Maier-Hein, K.H., Hou, X., Xie, C., Li, F., Nan, Y., Mu, G., Lin, Z., Han, M., et al.: The state of the art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge. arXiv preprint arXiv:1912.01054 (2019)
- [11] Hu, C., Li, C., Wang, H., Liu, Q., Zheng, H., Wang, S.: Self-supervised learning for mri reconstruction with a parallel network training framework. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 382–391. Springer (2021)
- [12] Liu, F., Jonmohamadi, Y., Maicas, G., Pandey, A.K., Carneiro, G.: Self-supervised depth estimation to regularise semantic segmentation in knee arthroscopy. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 594–603. Springer (2020)
- [13] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
- [14] Lu, Q., Li, Y., Ye, C.: White matter tract segmentation with self-supervised learning. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 270–279. Springer (2020)
- [15] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2536–2544 (2016)
- [16] Roth, H.R., Farag, A., Turkbey, E., Lu, L., Liu, J., Summers, R.M.: Data from pancreas-ct. the cancer imaging archive. IEEE Transactions on Image Processing (2016)
- [17] Sahasrabudhe, M., Christodoulidis, S., Salgado, R., Michiels, S., Loi, S., André, F., Paragios, N., Vakalopoulou, M.: Self-supervised nuclei segmentation in histopathological images using attention. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 393–402. Springer (2020)
- [18] Shimoda, W., Yanai, K.: Self-supervised difference detection for weakly-supervised semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5208–5217 (2019)
- [19] Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7303–7313 (2021)
- [20] Windsor, R., Jamaludin, A., Kadir, T., Zisserman, A.: Self-supervised multi-modal alignment for whole body medical imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 90–101. Springer (2021)
- [21] Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., Hu, H.: Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16684–16693 (2021)
- [22] Xu, J., Adalsteinsson, E.: Deformed2self: Self-supervised denoising for dynamic medical imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 25–35. Springer (2021)
- [23] Yang, P., Hong, Z., Yin, X., Zhu, C., Jiang, R.: Self-supervised visual representation learning for histopathological images. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 47–57. Springer (2021)
- [24] Yao, Q., Quan, Q., Xiao, L., Kevin Zhou, S.: One-shot medical landmark detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 177–188. Springer (2021)
- [25] Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning. pp. 12310–12320. PMLR (2021)
- [26] Zhang, R., Liu, S., Yu, Y., Li, G.: Self-supervised correction learning for semi-supervised biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 134–144. Springer (2021)
- [27] Zheng, H., Han, J., Wang, H., Yang, L., Zhao, Z., Wang, C., Chen, D.Z.: Hierarchical self-supervised learning for medical image segmentation based on multi-domain data aggregation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 622–632. Springer (2021)
- [28] Zhou, S.K., Greenspan, H., Davatzikos, C., Duncan, J.S., Van Ginneken, B., Madabhushi, A., Prince, J.L., Rueckert, D., Summers, R.M.: A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises. Proceedings of the IEEE (2021)
- [29] Zhou, S.K., Rueckert, D., Fichtinger, G.: Handbook of medical image computing and computer assisted intervention. Academic Press (2019)
- [30] Zhu, J., Li, Y., Hu, Y., Ma, K., Zhou, S.K., Zheng, Y.: Rubik’s cube+: A self-supervised feature learning framework for 3d medical image analysis. Medical image analysis 64, 101746 (2020)