RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training
Abstract
The integration of artificial intelligence (AI) with radiology marks a transformative era in medicine. Vision foundation models have been adopted to enhance radiologic imaging analysis. However, the distinct complexities of radiologic 2D and 3D radiologic data pose unique challenges that existing models, pre-trained on general non-medical images, fail to address adequately. To bridge this gap and capitalize on the diagnostic precision required in radiologic imaging, we introduce Radiologic Contrastive Language-Image Pre-training (RadCLIP): a cross-modal vision-language foundational model that harnesses Vision Language Pre-training (VLP) framework to improve radiologic image analysis. Building upon Contrastive Language-Image Pre-training (CLIP), RadCLIP incorporates a slice pooling mechanism tailored for volumetric image analysis and is pre-trained using a large and diverse dataset of radiologic image-text pairs. The RadCLIP was pre-trained to effectively align radiologic images with their corresponding text annotations, creating a robust vision backbone for radiologic images. Extensive experiments demonstrate RadCLIP’s superior performance in both uni-modal radiologic image classification and cross-modal image-text matching, highlighting its significant promise for improving diagnostic accuracy and efficiency in clinical settings. Our Key contributions include curating a large dataset with diverse radiologic 2D/3D radiologic image-text pairs, a slice pooling adapter using an attention mechanism for integrating 2D images, and comprehensive evaluations of RadCLIP on various radiologic downstream tasks.
Index Terms:
RadCLIP, Radiology, Foundation Model, Vision-Language Pretraining (VLP), Contrastive Language-Image Pre-training (CLIP), Medical Imaging, Representation LearningI Introduction
In the rapidly evolving field of radiology, integrating artificial intelligence (AI) has become indispensable. Vision foundation models, trained on extensive datasets to learn a wide array of features, have shown exceptional promise in computer vision applications [1]. These models serve as a cornerstone for developing specialized applications, particularly through transfer learning, where knowledge from a source domain is applied to enhance performance in a target domain. In the medical imaging domain, the importance of transfer learning is further highlighted by the difficulty of acquiring large radiologic datasets to train an end-to-end deep-learning model from scratch[2][3].
State-of-the-art vision foundation models are generally trained on natural image datasets, such as CIFAR-10, Food-101, and ImageNet [4]. However, the unique challenges of radiologic imaging data, such as the 2D/3D attributes of radiologic images, the subtlety of pathological features, and the high stakes of diagnostic errors, demand foundation models tailored to the medical domain. Generic vision models trained on natural image datasets may not capture the intricacies of radiologic images, leading to a gap in performance when applied to radiologic image classification or segmentation tasks[5]. GPT-4V, for instance, as the most prominent generic vision-language model, does not perform well on medical tasks[6].
Recent developments in vision-language models—models that can understand both images and text—have greatly improved the ability of computers to link images with words. Contrastive Language-Image Pre-training (CLIP), a pioneering work by OpenAI, leverages extensive image-text datasets for effective visual-textual concept association, enabling diverse AI applications (e.g., zero-shot image recognition and advanced natural language tasks). Its adaptability and robustness underscore its foundational role in the AI field. Alongside CLIP, more recent vision-language models, such as CoCa and ALIGN, have pushed the boundaries in cross-modal tasks and set new standards in vision application benchmarks, showcasing the potential of vision-language pretraining (VLP) framework to enhance vision models through language supervision [1, 7, 8].
More recently, there has been a surge in initiatives to tailor vision-language models for the medical domain [9], with several noteworthy projects emerging. CONVIRT automates radiology report generation through natural language processing, streamlining the interpretation process. GLoRIA leverages radiology reports to enhance medical image analysis without extensive labeling, using attention mechanisms to improve image retrieval, classification, and segmentation. MedCLIP uses the CLIP model’s VLP framework to link chest X-rays (CXRs) with clinical notes, boosting the diagnostic accuracy of zero-shot learning tasks. PMC-CLIP extracts information from medical literature to facilitate its application in clinical settings. CLIP-Lung combines clinical text annotations with lung images to better predict 3D CT lung nodule malignancy through channel-wise condition prompting, being one of the few methods attempting to extend VLP to 3D radiologic data. CXR-CLIP addresses the data scarcity of CXRs by merging image-text and image-label data to learn study-level characteristics with novel contrastive losses.
The emergence of these new medicine-related vision-language models demonstrates the potential of the VLP framework in the radiologic domain. However, one of the major limitations of these models is the lack of extensive and diverse radiologic imaging data for model training and validation. Most existing vision-language models were often developed with 2D CXRs or 2D slices of CT scans [10]. The lack of data diversity in training data may compromise the models’ capability to understand heterogeneous radiologic imaging modalities. The training datasets of these models usually do not include sufficient 3D radiologic imaging data (e.g., 3D CT and MRI), a unique attribute of radiologic data compared to general natural image datasets (e.g., ImageNet [4]). This significantly restricts the existing models’ comprehensive understanding of 3D spatial information of human anatomy, which is often crucial for accurate diagnosis and assessment of medical conditions.
To address this limitation, we present Radiologic CLIP (RadCLIP), a novel vision-language model that is tailored for radiologic image analysis. RadCLIP is designed to overcome the limitations of existing vision-language models with a focus on improving radiologic image representation learning. Utilizing a diverse, carefully curated 2D/3D radiologic dataset, we build a robust vision backbone for radiologic imaging and enhance its cross-modal capabilities through the VLP framework. We evaluate the performance of the proposed RadCLIP model on both uni-modal radiologic image representation learning and cross-modal vision-language alignment.
Overall, in this study, we make the following contributions:
-
1.
We carefully collected and curated a large, diverse radiologic image-text pair dataset across a wide range of 2D/3D radiologic imaging modalities, anatomical regions, diseases, and conditions.
-
2.
We trained the RadCLIP extensively using the 2D/3D radiologic images-text pairs with the VLP framework.
-
3.
For 3D images, a slice-wise pooling mechanism was introduced to integrate different 2D radiological image slices along the the slice-dimension, enabling the model to better understand 3D spatial information of radiologic data.
-
4.
Extensive experiments were conducted to assess the RadCLIP’s uni-modal representation learning and cross-modal capabilities.
II Related Work
II-A Radiologic Vision Foundation Models
In recent years, the intersection of AI and radiology has garnered significant attention, leading to the development of various models aimed at enhancing analysis of medical imaging analysis [11, 12]. Radiologic vision foundation models, trained on extensive radiologic datasets to learn a wide array of radiologic features, have demonstrated exceptional promise in radiology-related tasks. One such model is MedViT, which is a Vision Transformer for generalized medical image classification developed by Manzari et al [13]. MedViT combines the robust components of Convolutional Neural Networks (CNNs) with the global connectivity of Vision Transformers. This hybrid model addresses the high quadratic complexity of the self-attention mechanism and enhances robustness against adversarial attacks by focusing on global structure features rather than texture information. MedViT also employs innovative data augmentation techniques to blend feature normalization with data augmentation, resulting in superior accuracy and robustness across various medical image datasets.
Another milestone development is RadImageNet and its released foundation models[5]. RadImageNet, similar to ImageNet, is a large-scale dataset specifically curated for the medical imaging domain, consisting of 1.35 million annotated CT, MRI, and Ultrasound (US) images covering a wide range of pathological conditions. The study demonstrated that RadImageNet foundation models outperform those models pre-trained on ImageNet in various medical imaging tasks, especially when dealing with small datasets. For instance, RadImageNet models showed significant performance improvements for analyzing thyroid nodules, breast masses, anterior cruciate ligament injuries, and meniscal tears compared to ImageNet foundation models. This underscores the importance of using domain-specific datasets to enhance the performance of AI models in the radiologic imaging domain.
II-B Radiologic Vision-Language Models
In the medical domain, several early attempts have adapted CLIP-like models to radiologic imaging. The intrinsic complexity and variability of medical images, combined with the nuanced and context-specific language of radiology reports, make this a challenging yet highly rewarding task. Recently, several radiologic vision-language models have been developed. For example, GLoRIA is a framework that improves medical image analysis by using radiology reports to learn detailed image representations without extensive manual labeling, offering a significant leap in label-efficient medical imaging [14]. CONVIRT employs natural language processing to automatically generate radiology reports, aiming to mimic the expert annotations of radiologists[15]. CXR-CLIP combines CXR-text with CXR-label data through class-specific prompts. The CXR-CLIP model significantly expands training data diversity. The introduction of two novel contrastive losses, Image Contrastive Loss and Text Contrastive Loss, enable the learning of study-level CXR images and report text characteristics [16]. MedCLIP adapts the CLIP’s VLP framework for CXRs data, linking CXRs with clinical notes to boost the diagnostic accuracy of the model on zero-shot learning [17]. PMC-CLIP is designed to extract and correlate information from the extensive repository of medical literature, including images and their captions. This model aims to bridge the gap between rich academic medical research and clinical applications, making it easier for healthcare professionals to apply the latest insights from medical literature to patient care [18].
Despite the advancements, current radiologic vision-language models in radiologic imaging share a common limitation that RadCLIP aims to address. One significant limitation is the lack of extensive and diverse radiologic imaging data for training and validation. Most existing vision-language models are often developed with 2D CXRs or slices of CT scans. The lack of data diversity in training data may compromise the models’ capability to understand heterogeneous radiologic imaging modalities or 3D spatial information of human anatomy.
III Methodology
III-A RadCLIP Vision-Language Pre-raining


CLIP has revolutionized the integration of vision and language models by leveraging large-scale image-text datasets to learn rich, multi-modal representations. The fundamental principle behind CLIP is to align visual and textual information in a shared embedding space. CLIP’s architecture employs a dual-encoder setup, where one encoder processes images and the other processes text [1, 19]. During training, the model learns to associate images with their corresponding textual descriptions by minimizing the contrastive loss, which brings related image-text pairs closer in the embedding space while pushing unrelated pairs apart. This approach has paved the way for numerous advancements in computer vision tasks without the need for extensive labeled datasets. Many pieces of literature have shown that the VLP framework of the CLIP allows the vision models to learn more intricate image details and extract finer-grained vision features through text supervision [1, 20, 7, 21]. Inspired by its success, RadCLIP is also designed to utilize the VLP framework for creating a robust vision foundation model by learning radiologic images through paired text supervision.
Using the VLP framework, we train the RadCLIP model on a meticulously curated and comprehensive collection of radiologic image-text pairs, covering a wide range of imaging modalities, anatomical regions, diseases, and conditions. Such a diverse dataset ensures robust and generalized model performance. Additionally, we introduce a novel slice pooling adapter with a slice-wise attention mechanism that weights individual image slices’ importance, enhancing the performance of volumetric image analysis [22]. The slice pooling module not only enables us to train a universal volumetric radiologic image encoder, but also allow the model to prioritize more significant image slices.
The methodology section elaborates on the RadCLIP Vision-Language Pre-training, Loss function, and the implementation details of the model training using a comprehensive and diverse dataset of 2D/3D radiologic image-text pairs.
The RadCLIP architecture consists of three modules: a text encoder that processes texts, a 2D image encoder that processes 2D radiologic images, and a slice pooling adapter that merges embeddings of 2D slices. (Figure LABEL:fig_3D) Given the strong language capability of existing language models, we directly leveraged the pre-trained text encoder of the CLIP model and froze the model weights [23]. In this study, we mainly focused on fine-tuning the 2D image encoder (Figure LABEL:fig_3Da) and training the slice pooling adapter (Figure LABEL:fig_3Db).
III-A1 2D Image Encoder Pre-training
We used the pre-trained 2D image encoder from the CLIP model and then fine-tuned this encoder to further understand 2D radiologic images. Specifically, we performed a contrastive pre-training to fine-tune the 2D image encoder using our curated large 2D radiologic image-text pairs. The encoder was trained to pull the embeddings of 2D radiologic images and their corresponding text descriptions closer together in the embedding space, and to push the 2D radiologic images away from any mismatched description . For example, an abdominal CT slice image may be pulled toward the text ”Abdomen CT with Prostate Lesion”, but pushed away from the text ”Brain MRI with White Matter Changes” in the embedding space. The text encoder’s weights remain frozen during the contrastive pre-training process to preserve the language understanding capabilities of the original text encoder. The VLP process allows the image encoder to learn meaningful radiologic image representations for enhanced radiologic image-text alignment [24, 25].
III-A2 Slice Pooling Adapter Pre-training
For 3D volumetric radiologic images, traditional methods often use multi-channel feature maps or simple average pooling strategies to transition from 2D slice representations to 3D volume representations. [26, 27, 28]However, these methods can suffer from information loss and lack of contextual understanding across volumetric slices, potentially leading to suboptimal performance in capturing complex anatomical structures [29]. To address this, we designed a slice pooling adapter that utilizes an attention-based pooling mechanism to combine slice-wise 2D image representations into volumetric image representations [30]. As shown in Figure 2, our slice pooling adapter is a multi-head self-attention layer with learnable random position encoding (PE)[31]. To mitigate the limitation of global average pooling, the learnable PE is used to inject information about the order or position of 2D images in volumetric data. Since the attention mechanism in does not inherently consider the order of input 2D image embedding, PE is essential to provide the model with information about the relative or absolute position of 2D radiologic images.

Assume that represents the full stack of 2D image embeddings , the volumetric image representation is represented by:
(1) |
where MHSA denotes the multi-head self-attention mechanism, and is the learnable random position encoding function applied to the positional indices of the slices. The learnable random position encoding is defined as:
(2) |
where is the positional index and is the dimension of the model.
The attention mechanism allows the model to capture relationships and dependencies between different slices, providing a more holistic understanding of the volumetric data. The learnable random PE enables the model to adaptively learn the spatial positioning of slices, further improving the volume representation by considering the positional context.
To train the slice pooling adapter, we performed a contrastive slice pooling adapter pre-training using a diverse set of 3D radiologic image-text pairs. The slicer pooling adapter was trained to pull the embeddings of 3D volumetric radiologic images and their corresponding text descriptions closer together, and to push away the radiologic images away from any mismatched description . For instance, a brain MRI volume may be pulled toward the text ”brain MRI with Pituitary Tumor”, but pushed away from the text ”Lung CT with Nodule” in the embedding space. During this VLP process, the text encoder and the 2D image encoder are both frozen.
III-B Contrastive Loss Function
To effectively align 2D/3D image and text embeddings within a shared embedding space, we utilize the Information Noise Contrastive Estimation (InfoNCE) loss [32]. This loss function aligns embeddings from images and texts by minimizing the distance between semantically similar image-text pairs and maximizing the distance between dissimilar ones. Using the 3D volumetric image embeddings as an example, we calculated cosine similarity between image-text pairs as:
(3) |
where and are embeddings of 3D image sample and text sample . In a batch of pairs, a similarity matrix is generated with positive examples (matching image-text pairs) along its diagonal and negative examples (non-matching pairs) elsewhere. The InfoNCE loss is then calculated as follows:
(4) |
where denotes the cosine similarity between the image and text embeddings, and is a temperature parameter that controls the sharpness of the similarity distribution. We also utilized the InFONCE loss as the loss function for 2D image encoder pre-training.
III-C Implementation Details
The pre-trained weights from the CLIP model (clip-vit-large-patch14) were loaded using the Hugging Face Transformers library. During RadCLIP’s pre-training, a cosine annealing learning rate scheduler was used, starting with an initial learning rate of 1e-4. Checkpoints were saved every 1 epoch, and early stopping was applied based on validation loss. Hyperparameters such as training epochs, learning rate, batch size, and the number of attention heads were tuned using empirical values.[1] The best set of hyperparameters was selected based on the highest validation accuracy. To enhance the model’s robustness, dropout with a rate of 0.5 was applied to prevent overfitting, and L2 regularization was used to ensure generalization.
Training and evaluation were conducted on a system with 4 NVIDIA A6000 GPUs, using the PyTorch framework (version 1.9) and the Hugging Face Transformers library (version 4.12). The model weights, training, and evaluation codes will be made publicly available on Hugging Face and GitHub upon acceptance of the paper. Detailed instructions and documentation will be provided to facilitate the reproduction of results and the use of the RadCLIP model in other research projects.
IV Experiments

In this section, we describe the experiments configurations to evaluate the performance of RadCLIP. We first detail the datasets used for training and evaluation. Next, we outline the evaluation strategy, including the downstream tasks and evaluation metrics. Finally, we present the results of our experiments.
IV-A Dataset Curation
To ensure that RadCLIP can learn from a wide range of radiologic images, we carefully curated a large, diverse image dataset, accumulated from publicly available radiologic image datasets for model pre-training, referred to as the RadCLIP training dataset. Briefly, this dataset includes a total of 1,157,587 radiologic 2D image-text pairs (2D X-ray, CT, and MRI images) and 52,766 radiologic 3D image-text pairs (3D CT and MRI images). The dataset covers various anatomical regions and a wide range of diseases and medical conditions. We greatly appreciate the studies that have made their datasets publicly available, contributing valuable resources to the research community. The RadCLIP training dataset were compiled by integrating 14 publicly available dataset collections. Figure 3 shows the sample size of 2D/3D images of different imaging modalities from various body parts, and illustrates a number of representative radiologic images.
Additionally, we compiled a RadCLIP evaluation dataset from four publicly available datasets. Those radiologic images are not involved in RadCLIP training dataset, serving as unseen external data for generalization evaluation. Individual publicly available datasets are listed below.
-
•
RadCLIP training dataset:
-
–
Cancer Moonshot Biobank - Colorectal Cancer Collection (CMB-CRC) [33]
-
–
Cancer Moonshot Biobank - Lung Cancer Collection (CMB-LCA) [34]
-
–
MOS-MED [35]
-
–
Duke-Abdomen [36]
-
–
ISPY1 [37]
- –
-
–
Open Neuro: Flanker Task [40]
-
–
PI-CAI [41]
-
–
Prostate-MRI-US-Biopsy [42]
-
–
qDESS Knee MRI [43]
-
–
RSNA Pneumonia [44]
-
–
RadImagenet [5]
-
–
Unifesp [45]
-
–
CPTAC-PDA [46]
-
–
MedMNIST [47]
-
–
- •
For both training and evaluation dataset, all radiologic images were resized to a unified dimension of pixels. The image size normalization were conducted on individual imaging acquisition plane (i.e., axial, coronal, and sagital) for 3D volumetric images. Then, the intensities of each image was normalized using the z-score normalization method.
For each 2D/3D radiologic image/volume, we extracted descriptive text from their associated documents and labels. The descriptive texts follow a pattern of [body region - imaging modality - disease/medical conditions (if applicable)], for example, [Abdomen CT with Prostate Lesion], or [Brain MRI with Pituitary Tumor]. Not all text include diseases/medical conditions. After curating the text data, we tokenized all descriptions using CLIP’s default tokenizer.
IV-B Evaluation strategy

After RadCLIP pre-training, we investigated the model’s performance on downstream tasks as a foundation model. We focused on image classification tasks and image-text matching task using 2D or 3D radiologic images.
For image classification, we used the linear probing strategy, where we applied RadCLIP (2D image encoder/slice pooling adapter) as a feature extractor, and fit a single-layer linear classifier atop the foundational model’s generated features to benchmark performance across various classification tasks [52, 53]. (Figure 4 A-B) This is a common way to assess the foundational model’s radiologic image representation capability for downstream classification tasks without fine-tuning the entire foundation model. All linear probing experiments were conducted with five-fold cross-validation on RadCLIP evaluation datasets. Model performance on classification tasks was evaluated using accuracy and F1 score.
For the image-text matching task, the vision-language model matches the given image representations/embeddings to the corresponding text among several candidate texts. We calculated the cosine similarity between the embeddings of 2D/3D image and all text candidates, and assess whether RadCLIP’s extracted features/embeddings can be matched to text embeddings, a strategy commonly used in zero-shot learning. (Figure 4 C-D) To evaluate the model’s ability on cross-modal image-text matching tasks, we calculated the top-1 precision using cosine similarity between encoded image and text representations.
We compared RadCLIP against several peer models commonly used in medical image analysis. These models included ResNet50, Vision Transformer (ViT), Swin Transformer (SwinT), SimCLR, MoCo V2, and MedViT. We also trained several vision-language models, including CLIP, CoCa, and PMC-CLIP.
IV-C Results
IV-C1 Unimodal Image Classification Performance
ChestXpert | |||||||||
(5 Classes, 2D) | Crystal Clean | ||||||||
(4 Classes, 2D) | IXI Brain | ||||||||
(2 Classes, 3D) | COVID-CT-MD | ||||||||
(3 Classes, 3D) | |||||||||
Model Name | VLP | Acc (%) | F1 (%) | Acc (%) | F1 (%) | Acc (%) | F1 (%) | Acc (%) | F1 (%) |
ResNet50 [54] | N | 41.98 ± 1.16 | 41.74 ± 4.20 | 65.67 ± 2.71 | 57.73 ± 23.28 | 92.65 ± 1.23 | 91.65 ± 1.23 | 58.93 ± 11.01 | 51.43 ± 13.34 |
ViT [55] | N | 45.02 ± 1.97 | 44.80 ± 5.51 | 72.67 ± 6.20 | 71.41 ± 13.14 | 94.15 ± 1.53 | 94.09 ± 0.83 | 62.07 ± 5.83 | 57.07 ± 11.45 |
SwinT [56] | N | 44.48 ± 1.27 | 44.27 ± 5.60 | 70.67 ± 4.78 | 69.38 ± 14.10 | 92.58 ± 0.44 | 92.57 ± 0.44 | 63.31 ± 6.07 | 57.56 ± 9.24 |
SimCLR [52] | N | 45.59 ± 2.13 | 44.51 ± 5.05 | 70.22 ± 6.12 | 69.84 ± 13.20 | 94.15 ± 1.53 | 94.09 ± 0.83 | 62.79 ± 7.24 | 56.79 ± 6.27 |
MoCo V2 [57] | N | 46.27 ± 2.55 | 46.20 ± 4.72 | 71.54 ± 7.48 | 70.63 ± 15.30 | 93.21 ± 0.84 | 93.19 ± 1.71 | 61.32 ± 9.24 | 54.79 ± 10.24 |
MedViT [13] | N | 47.42 ± 1.33 | 46.95 ± 4.93 | 72.59 ± 6.52 | 71.77 ± 13.99 | 94.76 ± 1.67 | 94.26 ± 0.67 | 61.95 ± 6.59 | 55.53 ± 12.64 |
CLIP [1] | Y | 41.44 ± 2.23 | 40.44 ± 9.54 | 81.00 ± 3.59 | 80.42 ± 8.11 | 94.21 ± 0.11 | 93.88 ± 0.19 | 63.93 ± 6.13 | 56.58 ± 11.45 |
CoCa [7] | Y | 42.51 ± 1.85 | 41.53 ± 10.08 | 78.27 ± 5.22 | 79.33 ± 10.87 | 96.11 ± 0.91 | 96.11 ± 0.91 | 62.95 ± 4.93 | 55.09 ± 9.92 |
PubMedCLIP [18] | Y | 48.60 ± 1.64 | 46.63 ± 5.53 | 81.35 ± 3.79 | 79.63 ± 9.79 | 95.76 ± 0.67 | 95.76 ± 0.67 | 57.70 ± 6.51 | 53.32 ± 6.99 |
RadCLIP (ours) | Y | 51.46 ± 1.32 | 51.54 ± 4.15 | 86.00 ± 6.02 | 87.11 ± 9.80 | 95.58 ± 1.49 | 95.57 ± 1.50 | 67.87 ± 2.66 | 65.39 ± 3.29 |
We conducted unimodal image classification tasks to evaluate RadCLIP and other peer models as foundational models. We used four external datasets, including ChestXpert, Crystal Clean, IXI Brain, and COVID-CT-MD. Table I displayed the performance of all downstream image classification tasks.
On the ChestXpert dataset, all models were trained to classify five diseases — Pneumothorax, Pleural Effusion, Edema, Atelectasis, and Lung Lesion — using 2D CXR images. Following prior work, we created a 5,000-sample evaluation set from the original ChestXpert dataset[14, 17]. In this task, our RadCLIP model demonstrated superior performance, achieving an accuracy of 51.46% and an F1 score of 51.54%. The PMC-CLIP model achieved the second-best accuracy of 48.60%, while the MedViT model had the second-best F1 score of 46.95
With the Crystal Clear dataset, the downstream task is to separate four brain conditions (Normal, Pituitary Tumor, Meningioma Tumor, and Glioma Tumor) using 2D brain MRI images. Our RadCLIP had the best accuracy of 86.00% and F1 score of 87.11%. The PMC-CLIP achieved the second-best accuracy of 81.35%, and the CLIP model had the second-best F1 score of 80.42%. We noted that the vision-language models (lower part of the table) had better performance than the vision models (upper part of the table).
Using the IXI Brain dataset, the downstream application for all models is to distinguish the gender of human subjects using 3D brain T1 MRI images. The CoCa model had the best performance (96.11% in accuracy and 96.11% in F1 score), our RadCLIP model had the second-best accuracy of 95.58% and F1 score of 95.57%.
On the COVID-CT-MD dataset, these foundation models classify three lung conditions (Normal Lung, COVID, and Pneumonia) using 3D CT images. The RadCLIP had the best accuracy of 67.87% and an F1 score of 65.39%.
The results of the unimodal image classification showed that RadCLIP outperformed or achieved comparable performance to other peer foundation models in all evaluation datasets. This indicates that the RadCLIP can be used as a foundation vision model to enhance radiologic image analysis. The superior performance also suggests that the RadCLIP model is capable of generating robust image representations from 2D/3D radiologic images.
IV-C2 Cross-Modal Image-Text Matching
Models | ChestXpert | Crystal Clean | IXI Brain | COVID-CT-MD |
---|---|---|---|---|
CLIP | 20.41 | 15.83 | 50.18 | 19.67 |
CoCa | 20.32 | 18.04 | 49.46 | 30.82 |
PubMedCLIP | 19.11 | 15.83 | 55.12 | 40.33 |
RadCLIP | 23.90 | 27.22 | 57.07 | 51.15 |
To assess RadCLIP’s cross-modal proficiency in the radiology domain, we further utilized classic image-text matching as a downstream task. For example, given an MRI image with a brain glioma tumor, a robust vision-language model can generate an image embedding that is closer to the text embedding of [brain glioma tumor], and farther away from the text embedding of [normal brain].
Our analysis utilized external datasets to ensure a fair evaluation. For vision-language models that contain only a 2D image encoder, we used the global average pooling for 2D-to-3D adaptation. Due to the nature of image-text matching tasks, we only compared RadCLIP with several vision-language models, including CLIP, CoCa, and PMC-CLIP in this experiment. The top-1 precision results are presented in Table II. The RadCLIP achieved a 23.90% on the ChestXpert dataset, 27.22% on the Crystal Clean dataset, 57.07% on the IXI brain dataset, and 51.15% on the COVID-CT-MD dataset. Our RadCLIP consistently outperformed other peer vision-language models in the image-text matching tasks. Other vision-language models also had strong performance. Particularly, the CoCa model often had the second-best top-1 precision across different tasks.
Our results demonstrate RadCLIP’s ability to effectively bridge the gap between visual and textual information in the radiology domain. This experiment highlights RadCLIP’s potential for applications requiring cross-modal understanding, such as automated report generation and diagnostic assistance. These results affirm the model’s versatility and effectiveness in integrating visual and textual data for complicated radiologic analysis.
IV-C3 Embedding Visualization

Next, we continue to show the effectiveness of RadCLIP’s representation learning ability by plotting t-distributed Stochastic Neighbor Embedding (t-SNE) [cite tSNE] of image embeddings for 4 datasets (CheXpert, Crytal Clean, IXI brain, and COVID-CT-MD). As shown in Figure 5, we compared image embeddings from our RadCLIP with CLIP, CoCa, and PMC-CLIP. Visually, the RadCLIP has better clustered image embeddings than other peer vision-language models on ChestXpert, Crytal, and COVID-CT-MD datasets. For instance, on the Crystal Clean dataset, the embeddings of images with gliomas are grouped into a separate cluster, while other models cannot distinguish gliomas during representation learning. This also explains why RadCLIP had much higher performance on the Crystal Clean dataset than other models in Table I
Compared to this, our RadCLIP produces clusters with high dispersion on the IXI brain dataset, while the CoCa generates the clusters with the highest compactness, which is consistent with the performance in Table I. Because of the high inter-class similarity of the radiologic images and subtle pathological differences, embedding clustering in the radiologic image domain is much more difficult than that in the general object classification setting. In general, our developed RadCLIP still has a better clustering of the embeddings in the t-SNE visualization [58].
IV-C4 Ablation Study of Two RadCLIP components
Pretraining Setup | IXI Brain | COVID-CT-MD | IXI Brain | COVID-CT-MD | ||
Acc (%) | F1 (%) | Acc (%) | F1 (%) | P@1 (%) | P@1 (%) | |
CLIP + Global Average Polling | 94.21 | 93.88 | 63.93 | 56.58 | 50.18 | 19.67 |
CLIP + Trained Slice Pooling Adapter | 94.89 | 94.89 | 64.58 | 58.01 | 55.30 | 23.43 |
RadCLIP (Fine-Tuned 2D Image Encoder) + Global Average Polling | 95.07 | 94.93 | 66.54 | 64.73 | 54.95 | 50.20 |
RadCLIP (Fine-Tuned 2D Image Encoder + Trained Slice Pooling Adapter) | 95.58 | 95.57 | 67.87 | 65.39 | 57.07 | 51.15 |
We conducted an ablation study to evaluate the contributions of RadCLIP’s individual components by altering them and observing the impact on performance. Table III In our first configuration, we used the vanilla weights of the image encoder and text encoder from the CLIP model without any domain-specific fine-tuning. We also applied the global average pooling for this model to handle 3D radiologic images. This setup serves as our baseline to understand the performance of the CLIP model on radiologic tasks without any modifications. Next, we used the vanilla image encoder of CLIP and our slice pooling adapter to test the impact of the slice pooling adapter. Then, we applied the fine-tuned 2D image encoder that was pre-trained on the radiology-specific image-text pairs, while the global average pooling was used for handling 3D radiologic images. This is to evaluate the impact of fine-tuning on the model’s ability to capture radiology-specific features and improve performance over the baseline.
Compared to the CLIP, the performance of two altered RadCLIP models shows a small improvement, indicating that both 2D image encoder and slice pooling adapter have their own contribution to the image classification and image-text matching applications. Lastly, we noticed that the RadCLIP (i.e., merging fine-tuned 2D image encoder and pre-trained slice pooling adapter) outperformed the alternative configuration, showing the effectiveness of utilizing both
V Conclusion
The integration of RadCLIP into radiologic image analysis represents a remarkable development in the field of medical imaging. By leveraging the VLP framework of CLIP, RadCLIP bridges the gap between radiologic imagery and textual data. The model’s ability to align 2D/3D radiologic images with their corresponding text annotations not only enhances diagnostic accuracy but also streamlines clinical workflows by providing robust, interpretable image representations. Our experiments demonstrate RadCLIP’s strong performance in both unimodal and cross-modal downstream tasks. Our RadCLIP has the potential for various applications in medical diagnostics, including automated report generation and diagnostic support.
RadCLIP has some limitations that require further exploration. Reliance on a diverse yet finite dataset may not fully capture the spectrum of radiologic imaging variations found in clinical settings. While our dataset covered a wide range of modalities and conditions, validating the model’s performance with more extensive, real-world clinical data would be beneficial. Additionally, the 3D slice pooling mechanism, while innovative, introduces complexity in model training and interpretation, potentially necessitating additional computational resources and optimization techniques. The fixed textual encoder, while preserving language understanding, may also limit the model’s adaptability to evolving medical terminologies and nuanced diagnostic language over time. Another notable limitation is the exclusion of US data from the model design and training. US imaging, often available as video data, presents unique challenges such as data sparsity in public datasets and the inherent complexity of video data processing, which were not addressed in this study.
In summary, RadCLIP offers a promising approach to enhance radiologic image analysis through advanced vision-language pretraining techniques. The model’s fine-tuned radiologic image encoder, along with its novel slice-wise attention mechanism, highlights its potential to improve diagnostic accuracy and efficiency in medical imaging domain. Our evaluations show that RadCLIP excels in representing radiologic images and effectively aligns these representations with textual descriptions, paving the way for more integrated and intelligent diagnostic tools. Future work will aim to expand the dataset, refine the 3D pooling mechanism, and dynamically adapt the textual encoder to ensure RadCLIP continues to advance in medical imaging technology.
References
- [1] Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” arXiv.org, Oct. 02 2020. [Online]. Available: https://arxiv.org/abs/2010.00747
- [2] H. E. Kim, A. Cosa-Linan, N. Santhanam, M. Jannesari, M. E. Maros, and T. Ganslandt, “Transfer learning for medical image classification: a literature review,” BMC Medical Imaging, vol. 22, no. 1, pp. 1–13, Apr. 2022.
- [3] K. Smith et al., “What makes transfer learning work for medical images: Feature reuse & other factors,” 2022.
- [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- [5] X. Mei, Z. Liu, P. M. Robson, B. Marinelli, M. Huang, A. Doshi, A. Jacobi, C. Cao, K. E. Link, T. Yang, Y. Wang, H. Greenspan, T. Deyer, Z. A. Fayad, and Y. Yang, “Radimagenet: An open radiologic deep learning research dataset for effective transfer learning,” Radiology: Artificial Intelligence, vol. 4, no. 5, p. e210315, 2022. [Online]. Available: https://doi.org/10.1148/ryai.210315
- [6] Z. Liu, H. Jiang, T. Zhong, Z. Wu, C. Ma, Y. Li, X. Yu, Y. Zhang, Y. Pan, P. Shu, Y. Lyu, L. Zhang, J. Yao, P. Dong, C. Cao, Z. Xiao, J. Wang, H. Zhao, S. Xu, Y. Wei, J. Chen, H. Dai, P. Wang, H. He, Z. Wang, X. Wang, X. Zhang, L. Zhao, Y. Liu, K. Zhang, L. Yan, L. Sun, J. Liu, N. Qiang, B. Ge, X. Cai, S. Zhao, X. Hu, Y. Yuan, G. Li, S. Zhang, X. Zhang, X. Jiang, T. Zhang, D. Shen, Q. Li, W. Liu, X. Li, D. Zhu, and T. Liu, “Holistic evaluation of gpt-4v for biomedical imaging,” nov 2023. [Online]. Available: https://arxiv.org/abs/2312.05256
- [7] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “Coca: Contrastive captioners are image-text foundation models,” arXiv.org, May 04 2022. [Online]. Available: https://arxiv.org/abs/2205.01917
- [8] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. V. Le, Y. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” 2021. [Online]. Available: https://arxiv.org/abs/2102.05918
- [9] Y. Zhang, Y. Pan, T. Zhong, P. Dong, K. Xie, Y. Liu, H. Jiang, Z. Liu, S. Zhao, T. Zhang, X. Jiang, D. Shen, T. Liu, and X. Zhang, “Potential of multimodal large language models for data mining of medical images and free-text reports,” 2024. [Online]. Available: https://arxiv.org/abs/2407.05758
- [10] M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar, “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023. [Online]. Available: https://doi.org/10.1038/s41586-023-05881-4
- [11] S. Srivastav, R. Chandrakar, S. Gupta, V. Babhulkar, S. Agrawal, A. Jaiswal, R. Prasad, and M. B. Wanjari, “Chatgpt in radiology: The advantages and limitations of artificial intelligence for medical imaging diagnosis,” Cureus, vol. 15, no. 7, p. e41435, July 2023.
- [12] B. Azad, R. Azad, S. Eskandari, A. Bozorgpour, A. Kazerouni, I. Rekik, and D. Merhof, “Foundational models in medical imaging: A comprehensive survey and future vision,” 2023, arXiv preprint. [Online]. Available: https://arxiv.org/abs/2310.18689
- [13] O. N. Manzari, H. Ahmadabadi, H. Kashiani, S. B. Shokouhi, and A. Ayatollahi, “Medvit: A robust vision transformer for generalized medical image classification,” Computers in Biology and Medicine, vol. 157, p. 106791, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010482523002561
- [14] S. Yeung et al., “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” 2023.
- [15] Y. Zhang, S.-C. Huang, Z. Zhou, M. P. Lungren, and S. Yeung, “Adapting pre-trained vision transformers from 2d to 3d through weight inflation improves medical image segmentation,” arXiv.org, Feb. 08 2023. [Online]. Available: https://arxiv.org/abs/2302.04303
- [16] K. You et al., “Cxr-clip: Toward large scale chest x-ray language-image pre-training,” in Lecture Notes in Computer Science, 2023, pp. 101–111. [Online]. Available: http://dx.doi.org/10.1007/978-3-031-43895-0_10
- [17] Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022. [Online]. Available: http://dx.doi.org/10.18653/v1/2022.emnlp-main.256
- [18] W. Lin et al., “Pmc-clip: Contrastive language-image pre-training using biomedical documents,” arXiv.org, Mar. 13 2023. [Online]. Available: https://arxiv.org/abs/2303.07240
- [19] K. Desai and J. Johnson, “Virtex: Learning visual representations from textual annotations,” 2021. [Online]. Available: https://arxiv.org/abs/2006.06666
- [20] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” arXiv.org, Jan. 28 2022. [Online]. Available: https://arxiv.org/abs/2201.12086
- [21] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language model for few-shot learning,” 2022.
- [22] A. Vaswani et al., “Attention is all you need,” arXiv.org, Jun. 12 2017. [Online]. Available: https://arxiv.org/abs/1706.03762
- [23] L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, K.-W. Chang, and J. Gao, “Grounded language-image pre-training,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 955–10 965.
- [24] M. Tsimpoukelli, J. Menick, S. Cabi, S. M. A. Eslami, O. Vinyals, and F. Hill, “Multimodal few-shot learning with frozen language models,” arXiv.org, Jun. 25 2021. [Online]. Available: https://arxiv.org/abs/2106.13884
- [25] X. Wei, T. Zhang, Y. Li, Y. Zhang, and F. Wu, “Multi-modality cross attention network for image and sentence matching,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 10 938–10 947.
- [26] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical image analysis,” Annual Review of Biomedical Engineering, vol. 19, pp. 221–248, 2017.
- [27] Y. Wang, Z. Fan, T. Chen, H. Fan, and Z. Wang, “Can we solve 3d vision tasks starting from a 2d vision transformer?” 2022. [Online]. Available: https://arxiv.org/abs/2209.07026
- [28] H. Rasheed, M. U. Khattak, M. Maaz, S. Khan, and F. S. Khan, “Fine-tuned clip models are efficient video learners,” 2023. [Online]. Available: https://arxiv.org/abs/2212.03640
- [29] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1361841517301135
- [30] X. Wang, S. Han, Y. Chen, D. Gao, and N. Vasconcelos, Volumetric Attention for 3D Medical Image Segmentation and Detection. Springer International Publishing, 2019, p. 175–184. [Online]. Available: http://dx.doi.org/10.1007/978-3-030-32226-7_20
- [31] X. Liu, H.-F. Yu, I. Dhillon, and C.-J. Hsieh, “Learning to encode position for transformer with continuous dynamical model,” 2020. [Online]. Available: https://arxiv.org/abs/2003.09229
- [32] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” 2019. [Online]. Available: https://arxiv.org/abs/1807.03748
- [33] “Cmb-crc,” The Cancer Imaging Archive (TCIA), Nov. 20 2023. [Online]. Available: https://www.cancerimagingarchive.net/collection/cmb-crc/
- [34] “Cmb-lca,” The Cancer Imaging Archive (TCIA), Nov. 20 2023. [Online]. Available: https://www.cancerimagingarchive.net/collection/cmb-lca/
- [35] S. P. Morozov, A. E. Andreychenko, N. A. Pavlov, A. V. Vladzymyrskyy, N. V. Ledikhova, V. A. Gombolevskiy, I. A. Blokhin, P. B. Gelezhe, A. V. Gonchar, and V. Y. Chernina, “Mosmeddata: Chest ct scans with covid-19 related findings dataset,” 2020. [Online]. Available: https://arxiv.org/abs/2005.06465
- [36] Y. Wang, J. A. Macdonald, K. R. Morgan, D. Hom, S. Cubberley, K. Sollace, N. Casasanto, I. H. Zaki, K. J. Lafata, and M. R. Bashir, “Duke spleen data set: A publicly available spleen mri and ct dataset for training segmentation,” 2023. [Online]. Available: https://arxiv.org/abs/2305.05732
- [37] “Ispy1,” The Cancer Imaging Archive (TCIA), Nov. 20 2023. [Online]. Available: https://www.cancerimagingarchive.net/collection/ispy1/
- [38] F. Knoll, J. Zbontar, A. Sriram, M. J. Muckley, M. Bruno, A. Defazio, M. Parente, K. J. Geras, J. Katsnelson, H. Chandarana, Z. Zhang, M. Drozdzalv, A. Romero, M. Rabbat, P. Vincent, J. Pinkerton, D. Wang, N. Yakubova, E. Owens, C. L. Zitnick, M. P. Recht, D. K. Sodickson, and Y. W. Lui, “fastmri: A publicly available raw k-space and dicom dataset of knee images for accelerated mr image reconstruction using machine learning,” Radiology: Artificial Intelligence, vol. 2, no. 1, p. e190007, 2020, pMID: 32076662. [Online]. Available: https://doi.org/10.1148/ryai.2020190007
- [39] J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muckley, A. Defazio, R. Stern, P. Johnson, M. Bruno, M. Parente, K. J. Geras, J. Katsnelson, H. Chandarana, Z. Zhang, M. Drozdzal, A. Romero, M. Rabbat, P. Vincent, N. Yakubova, J. Pinkerton, D. Wang, E. Owens, C. L. Zitnick, M. P. Recht, D. K. Sodickson, and Y. W. Lui, “fastmri: An open dataset and benchmarks for accelerated mri,” 2019. [Online]. Available: https://arxiv.org/abs/1811.08839
- [40] K. AMC, U. LQ, B. BB, C. FX, and M. MP, “”flanker task (event-related)”,” 2018.
- [41] A. Saha, J. S. Bosma, J. J. Twilt, B. van Ginneken, A. Bjartell, A. R. Padhani, D. Bonekamp, G. Villeirs, G. Salomon, G. Giannarini, J. Kalpathy-Cramer, J. Barentsz, K. H. Maier-Hein, M. Rusu, O. Rouvière, R. van den Bergh, V. Panebianco, V. Kasivisvanathan, N. A. Obuchowski, D. Yakar, M. Elschot, J. Veltman, J. J. Fütterer, M. de Rooij, H. Huisman et al., “Artificial intelligence and radiologists in prostate cancer detection on mri (pi-cai): an international, paired, non-inferiority, confirmatory study,” The Lancet Oncology, vol. 25, no. 7, pp. 879–887, 2024, published online: 2024-07-01. [Online]. Available: https://doi.org/10.1016/S1470-2045(24)00220-1
- [42] S. Natarajan, A. Priester, D. Margolis, J. Huang, and L. Marks, “Prostate mri and ultrasound with pathology and coordinates of tracked biopsy (prostate-mri-us-biopsy) (version 2),” https://doi.org/10.7937/TCIA.2020.A61IOC1A, 2020, data set.
- [43] A. S. Chaudhari, K. J. Stevens, B. Sveinsson, J. P. Wood, C. F. Beaulieu, E. H. Oei, J. K. Rosenberg, F. Kogan, M. T. Alley, G. E. Gold, and B. A. Hargreaves, “Combined 5-minute double-echo in steady-state with separated echoes and 2-minute proton-density-weighted 2d fse sequence for comprehensive whole-joint knee mri assessment,” Journal of Magnetic Resonance Imaging, vol. 49, no. 7, pp. e183–e194, 2019. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/jmri.26582
- [44] G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M. Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa, V. Arteaga, M. Galperin-Aizenberg, R. R. Gill, M. C. Godoy, S. Hobbs, J. Jeudy, A. Laroia, P. N. Shah, D. Vummidi, K. Yaddanapudi, and A. Stein, “Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia,” Radiology: Artificial Intelligence, vol. 1, no. 1, p. e180041, 2019, pMID: 33937785. [Online]. Available: https://doi.org/10.1148/ryai.2019180041
- [45] E. Farina and F. Kitamura, “Unifesp x-ray body part classifier competition,” https://kaggle.com/competitions/unifesp-x-ray-body-part-classifier, 2022, accessed: 2024-04-13.
- [46] “Cptac-pda,” The Cancer Imaging Archive (TCIA), Nov. 20 2023. [Online]. Available: https://www.cancerimagingarchive.net/collection/cptac-pda/
- [47] J. Yang et al., “Medmnist v2 - a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, Jan. 2023.
- [48] J. Irvin et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 590–597, Jul. 2019.
- [49] S. M. H. Hashemi, “Crystal clean: Brain tumors mri dataset,” 2023. [Online]. Available: https://www.kaggle.com/ds/3505991
- [50] B. I. A. Group, “IXI Dataset,” http://brain-development.org/ixi-dataset/, n.d., accessed: [Your Access Date].
- [51] P. Afshar, S. Heidarian, N. Enshaei, F. Naderkhani, M. J. Rafiee, A. Oikonomou, F. B. Fard, K. Samimi, K. N. Plataniotis, and A. Mohammadi, “Covid-ct-md, covid-19 computed tomography scan dataset applicable in machine learning and deep learning,” Scientific Data, vol. 8, no. 1, p. 121, 2021. [Online]. Available: https://doi.org/10.1038/s41597-021-00900-3
- [52] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” 2020. [Online]. Available: https://arxiv.org/abs/2002.05709
- [53] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” 2021. [Online]. Available: https://arxiv.org/abs/2006.09882
- [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” arXiv.org, Dec. 10 2015. [Online]. Available: https://arxiv.org/abs/1512.03385
- [55] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv.org, Oct. 22 2020. [Online]. Available: https://arxiv.org/abs/2010.11929
- [56] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021. [Online]. Available: https://arxiv.org/abs/2103.14030
- [57] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” 2020. [Online]. Available: https://arxiv.org/abs/2003.04297
- [58] L. van der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. [Online]. Available: https://api.semanticscholar.org/CorpusID:5855042