MedMAE: A Self-Supervised Backbone for Medical Imaging Tasks

Abstract

Medical imaging tasks are very challenging due to the lack of publicly available labeled datasets. Hence, it is difficult to achieve high performance with existing deep-learning models as they require a massive labeled dataset to be trained effectively. An alternative solution is to use pre-trained models and fine-tune them using the medical imaging dataset. However, all existing models are pre-trained using natural images, which is a completely different domain from that of medical imaging, which leads to poor performance due to domain shift. To overcome these problems, we propose a large-scale unlabeled dataset of medical images and a backbone pre-trained using the proposed dataset with a self-supervised learning technique called Masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images. To evaluate the performance of the proposed backbone, we used four different medical imaging tasks. The results are compared with existing pre-trained models. These experiments show the superiority of our proposed backbone in medical imaging tasks.

Index Terms— Masked autoencoder, Medical Imaging, transformer, deep learning

1 Introduction

In this paper, we propose a large-scale unlabeled dataset of medical images from different sources and ViT-based [1] backbone for any medical imaging task. The unlabeled dataset consists of images captured from different sources such as MRI, CT, and X-ray. These images are captured for different body parts such as the chest, lungs, pancreas, abdomen, lung, brain, and pelvis. The main goal of the collected dataset is to be extensive and diverse to allow the proposed model to gain useful knowledge of different types of medical images. Hence, this knowledge can be used to achieve high accuracy in any specific medical imaging task. The proposed model is ViT trained with the masked autoencoder (MAE) [2] technique. MAE is a self-supervised learning (SSL) technique that leverages the usage of unlabeled data, allowing the model to discover intrinsic features in the medical images. The ultimate objective of this paper is to propose a deep-learning model that learns versatile representations of medical images that can be effectively transferred to any medical imaging task, even when labels are scarce. The contribution of this paper is summarized as follows:

•

Proposing a large-scale unlabeled medical imaging dataset that can be used for self-supervised and unsupervised learning techniques.
•

Proposing a Medical Masked autoencoder (MedMAE), a pre-trained backbone that can be used for any medical imaging task.

The rest of the paper is organized as follows: Section 2 provides a literature review of medical imaging models. Section 3 Shows the details of the proposed work. Section 4 depicts the results of the experiments that were conducted. Finally, Section 5 concludes and summarizes the paper and discusses the future directions.

2 Related work

SSL has emerged as a popular learning methodology in medical image analysis, particularly beneficial in scenarios where annotated data is scarce and there is an abundance of unlabeled data [3]. Several researchers have demonstrated the effectiveness of the SSL approach throughout various medical image analysis tasks such as detection and classification [4],[5],[6], detection and localization [7],[8],[9], and segmentation tasks [10],[11],[12]. Huang et al. provide an extensive overview of DL approaches employing SSL for medical image classification in [3]. Our work primarily concentrates on self-prediction methods, which aligns with our strategy to create an SSL-based backbone tailored for medical imaging applications. Self-prediction involves augmenting or masking certain segments of an image, followed by an attempt to reconstruct the original image using the remaining unmasked parts. Several studies demonstrate the use of self-prediction methods for medical image restoration tasks. Jung et al. introduced a masked auto-encoder technique for functional connectivity matrix restoration in rs-fMRI data. The method involves masking different rows and columns of a matrix and then reconstructing the original matrix, where the functional connectivity matrix represents the relationship between different regions of interest for each subject [13]. Jana et al. proposed an encoder-decoder architecture tailored for restoring CT scans that have been corrupted by swapping several small patches within a single CT slice [14]. Liu et al. developed a U-net model for restoring ultrasound images that have been augmented through local-pixel shuffling. Clinical variables such as age, gender, and tumor size are then concatenated with the encoder’s outputs for downstream prediction tasks [15]. At present, there seems to be an absence of an SSL-based backbone dedicated to medical image understanding tasks. Recognizing the problems posed by the need for extensively annotated datasets, SSL emerges as a promising alternative capable of mitigating these challenges. Unlike supervised learning methods that typically require large quantities of labeled data, SSL has the potential to generate versatile models. These generalist models can be subsequently fine-tuned for a variety of downstream tasks, even in the absence of large labeled datasets.

Collection	Location	Subject	DataTypes
NIH Chest X-Ray [16]	Frontal Chest	30805	X-Ray
Pseudo-PHI-DICOM-Data [17]	Various	21	CR, CT, DX, MG, MR, PT
COVID-19-NY-SBU [18]	Lung	1384	CR, CT, DX, MR, PT, NM, OT, SR
CTPred-Sunitinib-panNet [19]	Pancreas	38	CT
Stagell-Colorectal-CT [20]	Abdomen, Pelvis	230	CT
CT images in COVID19 [21]	Lung	661	CT
MIDR-RICORD-1a&1b [22]	Lung	227	CT
CT-ORG [23]	Bladder, Brain, Kidney, Liver	140	CT
Pelvic-Reference-Data [24]	Pelvis, Prostate, Anus	58	CT
QIN Lung CT [25]	Lung	47	CT
Pancreas CT [26]	Pancreas	82	CT
LungCT-Diagnosis [27]	Lung	61	CT
NSCLC-Radiomics-Genomics [28]	Lung	89	CT
RIDER Lung CT [29]	Chest	32	CT, CR, DX
CT Colon ACRIN 6664 [30]	Colon	825	CT
LIDC-IDRI [31]	Chest	1010	CT, CR, DX
TCGA-BLCA [32]	Bladder	120	CT, CR, MR, PT, DX, Pathology
TCGA-UCEC [33]	Uterus	65	CT, CR, MR, PT, Pathology
COVID-19-AR [34]	Lung	105	CT, DX, CR
CMB-LCA [35]	Lung	10	CT, DX, MR, NM, US
CMB-CRC [36]	Colon	12	CT, DX, MR, PT, US
TCGA-KIRC [37]	Kidney	267	CT, MR, CR, Pathology
CMB-PCA [38]	Prostate	3	CT, MR, NM
CPTAC-CCRCC [39]	Kidney	222	CT, MR, Pathology
TCGA-SARC [40]	Chest, Abdomen, Pelvis, Leg	5	CT, MR, Pathology
TCGA-READ [41]	Rectum	3	CT, MR, Pathology
TCGA-OV [42]	Ovary	143	CT. MR, PAthology

Table 1: A detailed overview of the various datasets collected to form MID the medical imaging dataset.

3 Methodology

We propose a large-scale unlabeled dataset of medical images and a backbone pre-trained using the proposed dataset with a self-supervised learning technique called Masked autoencoder. This backbone can be used as a pre-trained model for any medical imaging task, as it is trained to learn a visual representation of different types of medical images.

3.1 Medical Imaging Dataset

Our primary goal is to pre-train MedMAE-B on a large and diverse dataset of medical images. This pre-training aims to achieve two key objectives: 1) Capture Latent Representations: By processing a vast array of medical images, MedMAE-B will learn underlying characteristics specific to this domain. This translates to a model with a deeper grasp of medical image content. 2) Boost Generalizability and Versatility: The pre-training process exposes the model to diverse medical scenarios. This, in turn, enhances its ability to adapt and perform well on various downstream tasks related to medical image analysis.

To achieve effective pre-training, we constructed a massive dataset encompassing various body parts and image types. This dataset integrates images from multiple repositories and platforms (details in Table 1). We meticulously documented all data sources to ensure transparency. We call the called dataset MID (Medical Imaging Dataset). Each image within the compiled dataset underwent specific transformations before feeding into the MAE architecture. This included: 1) Converting all files to a compatible image format. 2) Resizing images to meet the model’s input specifications. 3) Removing corrupted or erroneous files. We opted against data augmentation techniques to artificially inflate dataset size. Instead, we focused on the inherent diversity within the vast collection of datasets.

The current dataset comprises over 2 million images across various medical imaging modalities (CT scans, X-rays, MRI, etc.). This extensive and ever-growing dataset equips MedMAE-B with a robust and versatile understanding of medical imagery.

Refer to caption — Fig. 1: MedMAE architecture: The process is initiated by randomly masking 75% of the original image and inputting the remaining 25% of visible patches into the encoder, which captures the latent representations and encodes the patches. Subsequently, the aim of the decoder is to reconstruct the complete image using the encoded and masked patches. The reconstruction loss helps to improve the reconstruction with each iteration.

3.2 MedMAE Architecture in pre-training

This section delves into the core building blocks of Medical Masked Autoencoders (MedMAE): the encoder, decoder, and loss function.

Encoder The MedMAE encoder leverages a Vision Transformer (ViT) architecture. The input image/volume is first divided into non-overlapping patches. These patches are then randomly assigned into two groups: visible and masked patches. The encoder operates solely on the visible patches, aiming to learn a meaningful representation based on this partial information. To compensate for the lack of complete spatial information, each patch is augmented with a corresponding positional embedding before being fed into the ViT. This positional encoding helps the encoder maintain an understanding of the relative location of each patch within the larger image/volume. Crucially, since the encoder’s output is used to reconstruct the masked regions, it is incentivized to extract a comprehensive representation from these incomplete observations.

Decoder The goal of the decoder is to fill the gaps (i.e., predict the masked patches). The MedMAE decoder receives two sets of tokens as input: 1) Patch-wise representations: These are the outputs generated by the encoder for the visible patches. 2) Mask tokens: These are learnable tokens representing the masked regions. They are inserted into the decoder at the corresponding positions where patches are masked in the original input. Similar to the encoder, positional embeddings are added to all input tokens in the decoder. This enables the decoder to reconstruct the missing information at each masked position. It’s important to note that the decoder serves as an auxiliary module used only during the pre-training stage. It is not employed in downstream tasks where the pre-trained encoder plays a central role.

Loss function MedMAE leverages a reconstruction loss function, specifically the mean squared error (MSE). However, unlike traditional autoencoders that aim to reconstruct the entire input, MedMAE only focuses on predicting the pixel values of the masked patches. This approach has been shown to yield superior results. In practice, for better training stability, the normalized pixel values within each masked patch are used as reconstruction targets instead of the raw values.

3.3 MedMAE Architecture in downstream tasks

After MedMAE is pre-trained, for each downstream task, we replace the MedMAE decoder with a task-specific head. In the classification task, the task-specific head is simply a fully connected layer with the number of neurons equal to the number of classes. This layer is followed by a softmax activation function. On the other hand, for the segmentation task, four transposed convolutional layers are added to scale up the dimensions of the embeddings of the MedMAE encoder, followed by a convolutional layer with a single filter to produce the output mask. This layer is followed by a sigmoid activation function. During the training of the downstream tasks, only the parameters of the added head are updated. A well-known process which is called linear probing.

4 Experiments and Results

4.1 Implementation details

In this section, we delve into the outcomes of our MedMAE-B model, which underwent an extensive pre-training process spanning over 1000 epochs. Each epoch took approximately $1$ hour and $20$ minutes to train. Therefore, the model has a training time of more than $1000$ hours in total. As previously stated, our evaluation was exclusively based on the MAE ViT-B model. We configured our model with a batch size of $64$ , and the dimensions of the input images were set to $224\times 224$ pixels. The transformation and normalization of the entire image dataset were facilitated through the transform function. It is crucial to highlight that our study solely focuses on grayscale medical images without incorporating any colored image data. To set the learning parameters, we established a base learning rate of $10^{-3}$ .

4.2 MedMAE evaluation

To evaluate MedMAE, we conducted four experiments. Two experiments with private datasets and two experiments with publicly available datasets. The first experiment is Automating quality control for both CT scanners and MRI scanners. To ensure the reliability of radiology scanners, quality assurance protocols are routinely implemented, typically on a daily or weekly basis [43]. It is worth noting that these procedures are time-consuming and necessitate taking the scanner offline during the assessment process. The goal is to automate this process using deep-learning to detect whether an image is captured by a well-calibrated or a miss-calibrated scanner. In this experiment, we have a dataset collected from two hospitals for CT and MRI scans. The images in the dataset are labeled with either pass or fail. Where pass means the image is captured by a well-calibrated CT scanner. While fail means the image is captured by a miss-calibrated CT scanner. In the second experiment, we have images captured for the breast, and the goal is to detect if there is a disease or not. For the third experiment, we used the publicly available dataset CVC-ClinicDB for a segmentation task. Finally, the last experiment is pneumonia detection using the publicly available dataset ChestX-ray14. We compared the results of our proposed model with existing models pre-trained on the ImageNet (IN1k) dataset [44]. We chose five different models that have almost the same number of parameters. These models are: ResNet [45], EfficientNetv2-S [46], ConvNext-B [47], ViT-B [1], and Swin-B [48]. All these models are pre-trained with supervised learning on ImageNet. Additionally, we compared our results with the original MAE [2]. Evaluation is done by linear probing all models for 300 epochs on the training set and performance on the testing set is reported for each experiment. Linear probing is the process of fine-tuning only the last fully connected layer of the model while keeping the rest of the model parameters unchanged (i.e., pre-trained parameters).

4.2.1 Task1: Automating quality control for CT and MRI scanners

The precision and promptness of patient diagnoses conducted on radiology scanners are inherently linked to the quality of the images generated during the scanning process. When the quality of these images is compromised, it can have dire consequences. An image obtained from a miss-calibrated scanner can necessitate a costly and time-consuming repetition of the patient scan. In more critical scenarios, if such issues go undetected, they can give rise to missed diagnoses, posing substantial risks to patient health and, in certain instances, even resulting in tragic outcomes, including loss of life [49], [50], [51], and [52]. Providing real-time QC of the scanner calibration to radiologists can have several benefits: 1) improved patient throughput and reduced wait times, primarily by minimizing scanner downtime, and 2) enhanced diagnostic accuracy by swiftly identifying and addressing any image quality issues.

CT Scan Dataset Specialized datasets were built to meet the unique requirements of this experiment. We procured patient studies. These studies were retrieved in the form of DICOM files. A meticulous de-identification process was applied to each scan, ensuring the absence of any identifiable patient data. After that, each DICOM file was converted to an image format. The images represent different body parts captured by a CT scanner. Each image is labeled with either pass or fail. Hence, the task to be performed is binary image classification. The collected data comprises CT scan images of different body parts for more than 100 unique patients, with a total number of images of more than 30000.

Pre-training	Dataset	Backbone	Accuracy
Supervised	IN1k	ResNet	75.6
	IN1k	EfficientNet-S	71.3
	IN1k	ConvNext-B	76.8
	IN1k	ViT-B	78.2
	IN1k	Swin-B	77.5
Self-	IN1k	MAE	78.5
supervised	MID	MedMAE (Ours)	90.2

Table 2: Results of our proposed model MedMAE against other models on CT dataset. The reported accuracy is the accuracy of linear probing pre-trained models.

Table 2 shows the results of our proposed model in comparison with existing deep-learning models. As shown in Table 2, the performance of our proposed model is superior to other models. Even in comparison with the same model but pre-trained on Imagenet (i.e., MAE), there is still a huge gap in performance of approximately $12\%$ . It is important to mention that we could not pre-train other models using our dataset because these models are trained using supervised learning and our dataset has no labels.

MRI Scan Dataset This dataset is collected the same way as the CT scan dataset but the images are captured from an MRI scanner instead of a CT. The dataset consists of images of three body parts: abdominal, head, and shoulder. The collected number of cases (patients) is 426. Each case has around 20 images. Each image is labeled with either pass or fail, the same as the CT dataset.

Pre-training	Dataset	Backbone	Accuracy
Supervised	IN1k	ResNet	71.6
	IN1k	EfficientNet-S	66.1
	IN1k	ConvNext-B	71.9
	IN1k	ViT-B	72.7
	IN1k	Swin-B	72.1
Self-	IN1k	MAE	74.3
supervised	MID	MedMAE (Ours)	85.6

Table 3: Results of our proposed model MedMAE against other models on MRI dataset. The reported accuracy is the accuracy of linear probing pre-trained models.

Table 3 shows the results of our proposed model against other models. Our model MedMAE shows a superior performance with a gap of $11.3\%$ . It is important to mention that both the MRI and CT scan datasets are not shown during the pre-training phase. However, the pre-training dataset MID has images of MRI and CT collected from other datasets.

4.2.2 Breast Cancer Prediction

This dataset contains data from 29,686 patients. Each patient has multiple CT studies collected from 1988-2018. The dataset contains more than 5 million CT images of the chest area. We split the dataset into training, validation, and testing sets with ratios of $70\%$ , $15\%$ , and $15\%$ , respectively. The results are shown in Table 4. MedMAE has a performance gain of $8.9\%$ over the MAE pre-trained on ImageNet.

Pre-training	Dataset	Backbone	Accuracy
Supervised	IN1k	ResNet	79.9
	IN1k	EfficientNet-S	75.1
	IN1k	ConvNext-B	83.8
	IN1k	ViT-B	84.0
	IN1k	Swin-B	81.3
Self-	IN1k	MAE	84.3
supervised	MID	MedMAE (Ours)	93.2

Table 4: Results of our proposed model MedMAE against other models on breast cancer dataset. The reported accuracy is the accuracy of linear probing pre-trained models.

4.2.3 Pneumonia detection

We used the ChestX-ray14 [16] dataset in this experiment. This dataset comprises 112120 frontal-view X-ray images of more than 30000 unique patients. For this dataset, we report the results of our model MedMAE against the same models from previous experiments. Additionally, we report the results of MedMAE against state-of-the-art models in this specific dataset. Table 5 shows the results of the linear probing of our models and other selected models on the ChestX-ray14 dataset. As shown in the table, MedMAE has $2.2\%$ accuracy more than the best model. In Table 6, to have a fair comparison with other models, we fine-tuned all parameters of MedMAE instead of doing linear probing. This is due to the fact the state-of-the-art models are trained using supervised learning using this dataset (i.e., all model parameters are updated). As shown in the table, our model MedMAE shows a significant improvement in performance in comparison with other models, with a performance gain of $3.6\%$ .

Pre-training	Dataset	Backbone	Accuracy
Supervised	IN1k	ResNet	66.4
	IN1k	EfficientNet-S	62.9
	IN1k	ConvNext-B	67.5
	IN1k	ViT-B	67.8
	IN1k	Swin-B	67.1
Self-	IN1k	MAE	67.9
supervised	MID	MedMAE (Ours)	70.1

Table 5: Results of our proposed model MedMAE against other models on ChestX-ray14 dataset. The reported accuracy is the accuracy of linear probing pre-trained models.

Method	Accuracy
Wang et al. [16]	73.8
Yao et al. [53]	79.8
CheXNet [54]	84.4
Google AutoML [55]	79.7
NSGANetV1 [56]	84.7
MUXNet-m [57]	84.1
AE-CNN [ranjan2018jointly]	82.41
LEAF [58]	84.3
MedMAE (Ours)	88.0

Table 6: Results of our proposed model MedMAE against state-of-the-art models on ChestX-ray14 dataset.

4.2.4 Medical Segmentation

In this experiment, we used the dataset CVC-ClinicDB [59]. It is a dataset of frames extracted from colonoscopy videos. This dataset contains several examples of polyp frames and corresponding ground truth for them. The number of labeled images in this dataset is 612 images. The dataset is split into training, validation, and testing sets with ratios of $70\%$ , $15\%$ , and $15\%$ , respectively. The split is done randomly. The task performed in this dataset is different from other datasets. In this dataset, our goal is to segment the region of interest and produce a binary mask, where the white pixels represent the region of interest and the black pixels represents everything else. In other experiments, we added a fully connected layer for the linear probing. However, in this experiment, we added a sequence of transposed convolutional layers to upscale the embeddings produced by the encoder to a resolution equivalent to the input image. The sequence of transposed convolutional layers consists of four layers with each layer followed by a GeLU [60] activation function. The number of filters used in these layers is 256, 128, 64, and 32, respectively. All layers have a kernel size of $2\times 2$ and stride of 2 (i.e., each layer increases the dimensions of the input embedding to double its size). After the sequence of layers, a final convolutional layer is added with a single filter of size $1\times 1$ followed by a Sigmoid activation function to produce the output mask. As shown in Table 7, our proposed model MedMAE outperforms other models by a noticeable gap of $6.8\%$ in the f-score. The f-score is calculated as follows:

\mathcal{F}=\frac{2tp}{2tp+fp+fn}

(1)

where $\mathcal{F}$ is the f-score, $tp$ is the true positive, $fp$ is the false positive, and $fn$ is the false negative.

Pre-training	Dataset	Backbone	$\mathcal{F}$
Supervised	IN1k	ResNet	57.9
	IN1k	EfficientNet-S	53.1
	IN1k	ConvNext-B	61.8
	IN1k	ViT-B	63.5
	IN1k	Swin-B	60.2
Self-	IN1k	MAE	64.6
supervised	MID	MedMAE (Ours)	71.4

Table 7: Results of our proposed model MedMAE against other models on CVC-ClinicDB dataset. The reported numbers are the f-score of linear probing pre-trained models.

4.3 Visual results

To offer a deeper understanding of the impact of the training process on the model’s performance, we present examples of image reconstructions both before and after the extensive training period. Figures 2 and 3 are examples of reconstruction results of a sample image taken from a dataset that was not utilized during the training process. Figures 2 and 3 present the comparison between the original and reconstructed image. This comparison highlights the initial performance of our MedMAE-B model in understanding and recreating the medical image. The refinement in the reconstructed image, when compared to earlier checkpoints, is clearly visible. These examples underscore the effectiveness of our extensive training process in enhancing the model’s proficiency in medical image analysis.

5 Conclusion and Future Work

In this paper, we proposed a large-scale unlabeled medical imaging dataset that has extensive and diverse medical images. Additionally, we proposed a ViT-based backbone that is pre-trained using the proposed dataset and we showed that this backbone can be used for various medical imaging tasks. We evaluated the proposed model using four tasks. The results of all tasks showed the superiority of our proposed model in comparison with other models. The average performance gain in all tasks between our MedMAE and original MAE is approximately $8\%$ . From the conducted experiments, we should that models pre-trained with self-supervised learning generalize better to medical tasks than the models pre-trained with supervised learning. Additionally, if the model is pre-trained using medical images, its performance on other medical tasks is much better than a model pre-trained using natural images.

The future direction of this research is to improve the generalization of MedMAE and allow the model to perform multiple medical imaging tasks without having a separate model for each task. This can be done using an efficient continual learning technique.

References

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[2] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
[3] Shih-Cheng Huang, Anuj Pareek, Malte Jensen, Matthew P Lungren, Serena Yeung, and Akshay S Chaudhari, “Self-supervised learning for medical image classification: a systematic review and implementation guidelines,” NPJ Digital Medicine, vol. 6, no. 1, pp. 74, 2023.
[4] Anuroop Sriram, Matthew Muckley, Koustuv Sinha, Farah Shamout, Joelle Pineau, Krzysztof J Geras, Lea Azour, Yindalon Aphinyanaphongs, Nafissa Yakubova, and William Moore, “Covid-19 prognosis via self-supervised representation learning and multi-image prediction,” arXiv preprint arXiv:2101.04909, 2021.
[5] Ming Y Lu, Richard J Chen, and Faisal Mahmood, “Semi-supervised breast cancer histology classification using deep multiple instance learning and contrast predictive coding (conference presentation),” in Medical imaging 2020: digital pathology. SPIE, 2020, vol. 11320, p. 113200J.
[6] Xiaomeng Li, Xiaowei Hu, Xiaojuan Qi, Lequan Yu, Wei Zhao, Pheng-Ann Heng, and Lei Xing, “Rotation-oriented collaborative self-supervised learning for retinal disease diagnosis,” IEEE Transactions on Medical Imaging, vol. 40, no. 9, pp. 2284–2294, 2021.
[7] Liang Chen, Paul Bentley, Kensaku Mori, Kazunari Misawa, Michitaka Fujiwara, and Daniel Rueckert, “Self-supervised learning for medical image analysis using image context restoration,” Medical image analysis, vol. 58, pp. 101539, 2019.
[8] Xuan-Bac Nguyen, Guee Sang Lee, Soo Hyung Kim, and Hyung Jeong Yang, “Self-supervised learning based on spatial awareness for medical image analysis,” IEEE Access, vol. 8, pp. 162973–162981, 2020.
[9] Hari Sowrirajan, Jingbo Yang, Andrew Y Ng, and Pranav Rajpurkar, “Moco pretraining improves representation and transferability of chest x-ray models,” in Medical Imaging with Deep Learning. PMLR, 2021, pp. 728–744.
[10] KCEEN Karani and Ender Konukoglu, “Contrastive learning of global and local features for medical image segmentation with limited annotations,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[11] Aiham Taleb, Winfried Loetzsch, Noel Danz, Julius Severin, Thomas Gaertner, Benjamin Bergner, and Christoph Lippert, “3d self-supervised methods for medical imaging,” Advances in neural information processing systems, vol. 33, pp. 18158–18172, 2020.
[12] Yutong Xie, Jianpeng Zhang, Zehui Liao, Yong Xia, and Chunhua Shen, “Pgl: Prior-guided local self-supervised learning for 3d medical image segmentation,” arXiv preprint arXiv:2011.12640, 2020.
[13] Wonsik Jung, Da-Woon Heo, Eunjin Jeon, Jaein Lee, and Heung-Il Suk, “Inter-regional high-level relation learning from functional connectivity via self-supervision,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. Springer, 2021, pp. 284–293.
[14] Ananya Jana, Hui Qu, Carlos D Minacapelli, Carolyn Catalano, Vinod Rustgi, and Dimitris Metaxas, “Liver fibrosis and nas scoring from ct images using self-supervised learning and texture encoding,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). IEEE, 2021, pp. 1553–1557.
[15] Chengcheng Liu, Mengyun Qiao, Fei Jiang, Yi Guo, Zhendong Jin, and Yuanyuan Wang, “Tn-usma net: Triple normalization-based gastrointestinal stromal tumors classification on multicenter eus images with ultrasound-specific pretraining and meta attention,” Medical Physics, vol. 48, no. 11, pp. 7199–7214, 2021.
[16] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2097–2106.
[17] Michael Rutherford, Seong K Mun, Betty Levine, William Bennett, Kirk Smith, Phil Farmer, Quasar Jarosz, Ulrike Wagner, John Freyman, Geri Blake, et al., “A dicom dataset for evaluation of medical image de-identification,” Scientific Data, vol. 8, no. 1, pp. 183, 2021.
[18] Joel Saltz, Mary Saltz, Prateek Prasanna, Richard Moffitt, Janos Hajagos, Erich Bremer, Joseph Balsamo, and Tahsin Kurc, “Stony brook university covid-19 positive cases,” the cancer imaging archive, vol. 4, 2021.
[19] Luohai Chen, Wei Wang, Kaizhou Jin, Bing Yuan, Huangying Tan, Jian Sun, Yu Guo, Yanji Luo, Shi-Ting Feng, Xianjun Yu, et al., “Special issue “the advance of solid tumor research in china”: Prediction of sunitinib efficacy using computed tomography in patients with pancreatic neuroendocrine tumors,” International Journal of Cancer, vol. 152, no. 1, pp. 90–99, 2023.
[20] Menglei Li, Jing Gong, Yichao Bao, Dan Huang, Junjie Peng, and Tong Tong, “Special issue “the advance of solid tumor research in china”: Prognosis prediction for stage ii colorectal cancer by fusing computed tomography radiomics and deep-learning features of primary lesions and peripheral lymph nodes,” International Journal of Cancer, vol. 152, no. 1, pp. 31–41, 2023.
[21] Peng An, Sheng Xu, Stephanie A Harmon, Evrim B Turkbey, Thomas H Sanford, Amel Amalou, Michael Kassin, Nicole Varble, Maxime Blain, Victoria Anderson, et al., “Ct images in covid-19 [data set],” The Cancer Imaging Archive, vol. 10, pp. 32, 2020.
[22] E Tsai, Scott Simpson, Matthew P Lungren, Michelle Hershman, Leonid Roshkovan, Errol Colak, Bradley J Erickson, George Shih, Anouk Stein, Jayashree Kalpathy-Cramer, et al., “Data from medical imaging data resource center (midrc)-rsna international covid radiology database (ricord) release 1c-chest x-ray, covid+(midrc-ricord-1c),” The Cancer Imaging Archive, vol. 10, 2021.
[23] Blaine Rister, Darvin Yi, Kaushik Shivakumar, Tomomi Nobashi, and Daniel L Rubin, “Ct-org, a new dataset for multiple organ segmentation in computed tomography,” Scientific Data, vol. 7, no. 1, pp. 381, 2020.
[24] Afua A Yorke, Gary C McDonald, D Solis, and Thomas Guerrero, “Pelvic reference data,” The Cancer Imaging Archive, vol. 10, 2019.
[25] J Kalpathy-Cramer, Sandy Napel, D Goldgof, and B Zhao, “Qin multi-site collection of lung ct data with nodule segmentations,” Cancer Imaging Arch, vol. 10, pp. K9, 2015.
[26] Holger R Roth, Amal Farag, E Turkbey, Le Lu, Jiamin Liu, and Ronald M Summers, “Data from pancreas-ct. the cancer imaging archive,” IEEE Transactions on Image Processing, vol. 5, 2016.
[27] Olya Grove, Anders E Berglund, Matthew B Schabath, Hugo JWL Aerts, Andre Dekker, Hua Wang, Emmanuel Rios Velazquez, Philippe Lambin, Yuhua Gu, Yoganand Balagurunathan, et al., “Quantitative computed tomographic descriptors associate tumor shape complexity and intratumor heterogeneity with prognosis in lung adenocarcinoma,” PloS one, vol. 10, no. 3, pp. e0118261, 2015.
[28] HJWL Aerts, E Rios Velazquez, RT Leijenaar, Chintan Parmar, Patrick Grossmann, S Cavalho, Johan Bussink, René Monshouwer, Benjamin Haibe-Kains, Derek Rietveld, et al., “Data from nsclc-radiomics,” The cancer imaging archive, 2015.
[29] Xinzhi Teng et al., “Improving radiomic model reliability and generalizability using perturbations in head and neck carcinoma,” 2023.
[30] K Smith, K Clark, W Bennett, T Nolan, J Kirby, M Wolfsberger, J Moulton, B Vendt, and J Freymann, “Data from ct_colonography,” Cancer Imaging Arch, vol. 10, pp. K9, 2015.
[31] Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoffman, et al., “The lung image database consortium (lidc) and image database resource initiative (idri): a completed reference database of lung nodules on ct scans,” Medical physics, vol. 38, no. 2, pp. 915–931, 2011.
[32] Marcos AD Machado, Thauan F Moraes, Bruno HL Anjos, Nadja RG Alencar, Tien-Man C Chang, Bruno CRF Santana, Vinicius O Menezes, Lucas O Vieira, Simone CS Brandão, Marco A Salvino, et al., “Association between increased subcutaneous adipose tissue radiodensity and cancer mortality: Automated computation, comparison of cancer types, gender, and scanner bias,” Applied Radiation and Isotopes, vol. 205, pp. 111181, 2024.
[33] B Albertina, M Watson, C Holback, R Jarosz, S Kirk, Y Lee, K Rieger-Christ, and J Lemmerman, “The cancer genome atlas lung adenocarcinoma collection (tcga-luad)(version 4)[data set],” The Cancer Imaging Archive, 2016.
[34] S Desai, A Baghal, T Wongsurawat, S Al-Shukri, K Gates, P Farmer, M Rutherford, GD Blake, T Nolan, T Powell, et al., “Data from chest imaging with clinical and genomic correlates representing a rural covid-19 positive population [data set],” The Cancer Imaging Archive, 2020.
[35] CM Biobank, “Cancer moonshot biobank-lung cancer collection (cmb-lca)(version 3)[dataset],” The Cancer Imaging Archive, 2022.
[36] CM Biobank, “Cancer moonshot biobank-lung cancer collection (cmb-crc)(version 5)[dataset],” The Cancer Imaging Archive, 2022.
[37] O Akin, P Elnajjar, M Heller, R Jarosz, BJ Erickson, S Kirk, Y Lee, MW Linehan, R Gautam, R Vikram, et al., “The cancer genome atlas kidney renal clear cell carcinoma collection (tcga-kirc)(version 3). the cancer imaging archive,” 2016.
[38] CM Biobank, “Cancer moonshot biobank-lung cancer collection (cmb-pca)(version 5)[dataset],” The Cancer Imaging Archive, 2022.
[39] Kenneth Clark, Bruce Vendt, Kirk Smith, John Freymann, Justin Kirby, Paul Koppel, Stephen Moore, Stanley Phillips, David Maffitt, Michael Pringle, et al., “The cancer imaging archive (tcia): maintaining and operating a public information repository,” Journal of digital imaging, vol. 26, pp. 1045–1057, 2013.
[40] C Roche, E Bonaccio, and J Filippini, “The cancer genome atlas sarcoma collection (tcga-sarc)(version 3)[data set],” The Cancer Imaging Archive, 2016.
[41] B Albertina, M Watson, C Holback, R Jarosz, S Kirk, Y Lee, and J Lemmerman, “Radiology data from the cancer genome atlas lung adenocarcinoma [tcga-luad] collection,” The Cancer Imaging Archive, vol. 10, pp. K9, 2016.
[42] Chandra Holback, Rose Jarosz, Fred Prior, David G Mutch, Priya Bhosale, Kimberly Garcia, Yueh Lee, Shanah Kirk, Cheryl A Sadow, Seth Levine, et al., “The cancer genome atlas ovarian cancer collection (tcga-ov)(version 4)[data set],” The Cancer Imaging Archive, 2016.
[43] Donald W McRobbie, Scott Semple, and Anna Pauline Barnes, Quality Control and Artefacts in Magnetic Resonance Imaging (update of IPEM Report 80), Institute of Physics and Engineering in Medicine, 2017.
[44] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[46] Mingxing Tan and Quoc Le, “Efficientnetv2: Smaller models and faster training,” in International conference on machine learning. PMLR, 2021, pp. 10096–10106.
[47] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
[48] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
[49] Azimeh NV Dehkordi and Saeideh Koohestani, “The influence of signal to noise ratio on the pharmacokinetic analysis in dce-mri studies,” Frontiers in Biomedical Technologies, 2019.
[50] Todd B Parrish, Darren R Gitelman, Kevin S LaBar, and M-Marsel Mesulam, “Impact of signal-to-noise on functional mri,” Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol. 44, no. 6, pp. 925–932, 2000.
[51] Bin Song, Weiquan Tan, Yue Xu, Taihui Yu, Weiping Li, Zhong Chen, Rui Yang, Jingyi Hou, and Yunfeng Zhou, “3d-mri combined with signal-to-noise ratio measurement can improve the diagnostic accuracy and sensitivity in evaluating meniscal healing status after meniscal repair,” Knee Surgery, Sports Traumatology, Arthroscopy, vol. 27, pp. 177–188, 2019.
[52] Joel D Rubenstein, JG Li, S Majumdar, and RM Henkelman, “Image resolution and signal-to-noise ratio requirements for mr imaging of degenerative cartilage.,” AJR. American journal of roentgenology, vol. 169, no. 4, pp. 1089–1096, 1997.
[53] Li Yao, Eric Poblenz, Dmitry Dagunts, Ben Covington, Devon Bernard, and Kevin Lyman, “Learning to diagnose from scratch by exploiting dependencies among labels,” arXiv preprint arXiv:1710.10501, 2017.
[54] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al., “Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning,” arXiv preprint arXiv:1711.05225, 2017.
[55] G Blog, “Automl for large scale image classification and object detection,” Google Research, 2017.
[56] Zhichao Lu, Ian Whalen, Yashesh Dhebar, Kalyanmoy Deb, Erik D Goodman, Wolfgang Banzhaf, and Vishnu Naresh Boddeti, “Multiobjective evolutionary design of deep convolutional neural networks for image classification,” IEEE Transactions on Evolutionary Computation, vol. 25, no. 2, pp. 277–291, 2020.
[57] Zhichao Lu, Kalyanmoy Deb, and Vishnu Naresh Boddeti, “Muxconv: Information multiplexing in convolutional neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12044–12053.
[58] Jason Liang, Elliot Meyerson, Babak Hodjat, Dan Fink, Karl Mutch, and Risto Miikkulainen, “Evolutionary neural automl for deep learning,” in Proceedings of the genetic and evolutionary computation conference, 2019, pp. 401–409.
[59] Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño, “Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,” Computerized medical imaging and graphics, vol. 43, pp. 99–111, 2015.
[60] Dan Hendrycks and Kevin Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.