Universal Medical Image Representation Learning with Compositional Decoders
Abstract
Visual-language models have advanced the development of universal models, yet their application in medical imaging remains constrained by specific functional requirements and the limited data. Current general-purpose models are typically designed with task-specific branches and heads, which restricts the shared feature space and the flexibility of model. To address these challenges, we have developed a decomposed-composed universal medical imaging paradigm (UniMed) that supports tasks at all levels. To this end, we first propose a decomposed decoder that can predict two types of outputs- pixel and semantic, based on a defined input queue. Additionally, we introduce a composed decoder that unifies the input and output spaces and standardizes task annotations across different levels into a discrete token format. The coupled design of these two components enables the model to flexibly combine tasks and mutual benefits. Moreover, our joint representation learning strategy skilfully leverages large amounts of unlabeled data and unsupervised loss, achieving efficient one-stage pretraining for more robust performance. Experimental results show that UniMed achieves state-of-the-art performance on eight datasets across all three tasks and exhibits strong zero-shot and 100-shot transferability. We will release the code and trained models upon the paper’s acceptance.
Introduction

Vision-language models have significant success in establishing a universal framework that not only reduces the cost of processing different tasks but also supports collaboration among them (Radford et al. 2021; Alayrac et al. 2022). However, for a universal model in medical image analysis to be viable in real clinical settings (Moor et al. 2023), it requires 1) the versatility to simultaneously handle semantic understanding and visual tasks (e.g., not only locating lesions but also identifying their types). 2) the seamless transition between different tasks, allowing users to tailor functionalities based on the specific scenario (e.g., toggle detection and segmentation tasks according to whether they involve lesion screening or resection procedures). 3) the robust transferability, ensuring the model’s adeptness in delivering high-quality predictions even when confronted with new data.
The performance of universal models is largely driven by increasingly complete data, which imposes great demands on data construction and maintenance (Liu et al. 2023). However, the scale of data available in medical imaging is relatively limited compared to natural images due to the high cost and expertise required for data collection and annotation. Consequently, current research focuses on optimizing models on fixed data (Wang et al. 2022b; Zhou et al. 2023). Such specialized frameworks often encounter sudden performance drops when applied to other datasets or tasks. In reality, we observe that for any modality of medical imaging (such as endoscopic imaging), a large amount of data can be aggregated from existing public datasets. However, the diversity among these datasets hinders their direct integration and use. Therefore, this work emphasizes that the key is to maximize the effectiveness of existing medical image data in developing universal models.
One challenge in designing such a model is the different levels of annotations across tasks. For example (Fig. 1 a b), the classification task involves semantic annotation at the image level, the segmentation task requires pixel-level annotation, and the referring segmentation task combines text and pixel-level annotations. Until recently, attempts have been made to develop multi-task learning models (Qin et al. 2022; Wan et al. 2024), which have demonstrated encouraging cross-task generalization capabilities. However, most of these studies involve additional branches or heads, leading to increased model complexity and difficulties in balancing tasks. Furthermore, the unification of all levels of tasks in medical images—whether at the image, region, or pixel level—has yet to be fully achieved.
Another challenge is the different content of annotations among datasets (Fig. 1 b)). Unlike fully labeled benchmarks in natural image analysis, some medical imaging datasets contain annotations solely for lesions, while others provide annotations for instruments. Mainstream methods (Yu et al. 2019; Isensee et al. 2021) address this issue by splitting differently labeled datasets into several subsets and training the network on each subset to complete specific tasks. While this strategy is intuitive, it significantly increases computational complexity. Additionally, this design limits the sharing of knowledge across different annotations, leaving the common semantic space for task understanding largely unexplored.
Based on the above observations, we propose a universal medical image analysis model (UniMed) capable of performing various medical imaging tasks at all levels, including image, region, an pixel levels (Fig. 1 d)). Specifically, to reconcile the diversity in annotations across different tasks, we introduce a composed decoder that standardizes annotations into a discrete label format. We design an decomposed decoder that, instead of dividing the output by task, decomposes all tasks into pixel-level and semantic-level components. The innovative coupling design of the separable decoder and label converter unifies the input and output space, enabling flexible model combinations to support various task interactions seamlessly. Additionally, UniMed is equipped with an annotation understanding branch at the input stage to encode nouns and texts in the task, promoting the learning of a shared visual semantic space to accommodate the inherent diversity of tasks. Furthermore, to leverage large amounts of unlabeled data, we propose a joint representation learning strategy that enables unlabeled data to guide the encoder in extracting effective representations through contrastive learning. The key differentiator of our approach is its end-to-end framework without supplementary branches and modules.
Our main contributions are summarized as the following:
-
•
A new universal medical image paradigm. Different from the traditional vision-language approach, we are the first to start from the limitations of data and propose a universal paradigm adaptable to diverse tasks and datasets for medical image learning.
-
•
A decomposed-composed way for multi-tasking. We propose a unified separable decoder that formulates all tasks into unified processing at the pixel level and semantic level. Combination with text sequences on the input side can support all tasks and promote cross-task collaborative development.
-
•
An efficient algorithm for representation learning. We propose an effective learning strategy that enables full utilize unlabeled data through comparative learning, and jointly improve the transferability of the model with labeled data.
-
•
Significant experimental improvements. The proposed model demonstrates strong zero-shot and 100-shot capabilities on eight datasets for three medical tasks and exceeds the current state-of-the-art specialized and generalist methods after fine-tuning.
Related Work
Medical Image Analysis Tasks
In the field of medical image analysis, three critical tasks are predominantly recognized: lesion detection, classification, and segmentation (Chen et al. 2022b). Detection entails identifying the location of a lesion within an image based on a textual query (Tiu et al. 2022), which is crucial for clinical finding and localization of abnormalities. Lesion classification involves assigning a class label to an image (Wang et al. 2022a) or a specified target region (Murtaza et al. 2020). Segmentation requires generating pixel-level labels for an entire image (Ronneberger, Fischer, and Brox 2015), aiding in the clinical demarcation of lesion boundaries. Multi-task models for the aforementioned tasks typically use a shared visual backbone to produce visual embeddings (Liu et al. 2023), followed by individual branches tailored for each specific task. While these task-specific learning frameworks are effective for particular datasets, they lack generality and necessitate designing from scratch for new tasks.
Medical Universal Models
The emergence of large-scale models has significantly revolutionized the field of medical image analysis (Rajpurkar et al. 2022). Recent studies have been increasingly focused on developing general-purpose medical artificial intelligence (AI) models (Moor et al. 2023). A notable trend is the incorporation of the Segment Anything Model (SAM) (Wu et al. 2023; Ma and Wang 2023; Cheng et al. 2023; Zhang and Liu 2023), which amalgamates medical domain knowledge for medical image segmentation and their application across various segmentation tasks. Concurrently, the emergence of image-text models has garnered considerable attention (Liu et al. 2023; Ye et al. 2023; Zhao et al. 2023; Wang et al. 2022c). These models interpret image features through task-specific prompts, encompassing diverse modalities and domains, and represent a stride toward prompt-driven universal models. Most current methodologies predominantly concentrate on medical image segmentation (Butoi et al. 2023) without adequately acknowledging the interconnectedness and uniformity across various medical tasks. Our research aims to bridge this gap by proposing a unified approach capable of handling all three tasks simultaneously, enabling training with diverse annotations and tasks, and building the foundation for more versatile and universally applicable medical image analysis.

Method
This paper establish a universal model capable of simultaneously handling various medical tasks, enabling concurrent learning from diverse labeled and unlabeled data sources without the need for task-specific parameters. The UniMed (Fig. 2) contains several core components: a universal vision-language architecture, a decomposed decoder for task output, a composed decoder with unified annotations, and a data-efficient joint training strategy. Such architecture enables the utilization of all annotated medical image types, promoting knowledge sharing across tasks, facilitating representation on labeled and unlabeled datasets, and benefiting many different downstream applications.
Universal Medical Vision-Language Architecture
UniMed comprises a visual-language encoder and a dual-output decoder architecture (Fig. 2). Features are learned through visual and textual encoders upon receiving an input image. Subsequently, guided by the task labels, a decoder is employed to autoregressively predict the sequence.
Visual Encoder. To standardize the input space into discrete tokens, the encoder must be transformer-based. Therefore, the visual encoder utilizes the Swin-Transformer as its backbone, given its widely proven effectiveness. Given an input image , this component extracts its layer features to derive the final multi-scale visual feature representation:
(1) |
where L is the number of layers.
Text Encoder. The text encoder is designed to capture annotation semantics and learn a broad spectrum of corpus knowledge. Since the specific annotations vary from dataset to dataset, the focus may be on lesions, instruments, or multiple annotation types, etc. For a piece of text generated from annotations, SentencePiece (Kudo and Richardson 2018) is first employed to divide the words and convert them into discrete token sequences. The text encoder, , consists of multiple layers of Transformers that process the input text sequence. This process forms a text input queue , as follows:
(2) |
where is the length of the query.
Decomposed Decoder. The input to the model includes visual features , a text queue , and a general queue , while the output consists of a pixel and a semantic . The flexible combination of these inputs and outputs is capable of supporting both general and referring tasks. The decomposed decoder is composed of stacked Transformers. It initially captures the global features of the image by computing the masked cross-attention among the three inputs (Cheng et al. 2022) and a self-attention mechanism to generate the queue for the subsequent layer:
(3) |
(4) |
where represents the layer. For general tasks, the last general query is utilized as the global image representation. For referring tasks, the text query serves as the referring feature, while the general query is combined with it to produce the final representation.
For pixel-level output, the decoder utilizes global image features from the general queue to produce output , facilitating a nuanced understanding of the image at a fine-grained level. Moreover, for semantic-level output, the decoder relies on both the general and text queues to facilitate higher-level semantic understanding and generation.
Overall operation. UniMeds’s encoder encompasses the visual and text feature extraction branch, whereas the decomposed decoder Dec utilizes visual features , textual queues , and general queues to forecast both pixel-level and semantic-level outputs (Cheng et al. 2022). The overall operation can be expressed as:
(5) |
Composed Decoders
Since the decomposed decoder represents the output in terms of semantic and pixel outputs, the composed decoder unifies the labels of different tasks into a format that can be expressed through these two outputs.
Unified annotations. For classification, the model outputs the image category, with the corresponding annotation being the category text. We match the text with the semantic output path by tokenizing it using SentencePiece. For detection, the model identifies the location of the target area, with the annotation being the diagonal coordinates of the bounding box. To encode this sparse structure, we encode the sparse structure by expanding the vocabulary with 1000 special tokens (Chen et al. 2021b). The bounding box is then represented by four tokens, two indicating the upper-left corner and the other two representing the lower-right corner, together with the category, serve as semantic outputs. For segmentation, the model generates a result for each pixel, with the label being the pixel-level mask. The category annotation is processed using the method described earlier. The label image is encoded into discrete tokens, enabling the simultaneous generation of both pixel-level and semantic-level predictions.
General Classification/Detection (Fig. 3 a)). General classification and detection rely on the input image for prediction, only leveraging visual features and general queues as input to the decomposed decoder. Through composed decoders, all annotations are tokenized, and the classification or detection task directly outputs the prediction results through the semantic path. Hence, expressed as:
(6) |
General Segmentation (Fig. 3 b)). The input of its decomposed decoder is consistent with the Eq.2. The image is encoded and make predictions by simultaneously generating pixel-level and semantic-level outcomes. The operation is as follows:
(7) |
Referring Classification/Detection (Fig. 3 c)). The referring task requires a combination of visual features, text, and general queues (Eq. 1) to derive corresponding segmentation results. This enables clinical practice to flexibly obtain precise localization and diagnostic predictions of specified lesions by giving additional text prompts.
(8) |
Referring Segmentation (Fig. 3 d)). It requires latent query and text query as input, so the formula is the same as Eq.5.
(9) |
Compared with Eq. 7, the referring segmentation can be regarded general task with the language conditions.

Through the combined arrangement of queues and outputs, UniMed can support a variety of medical imaging tasks (Fig. 3). This paper advocates for achieving unity through functional cohesion rather than interface specifications, thereby maximizing the shared utilization of common components across diverse tasks while preserving independence for each task.
Joint Representation Learning
For medical image pre-training, relying solely on labeled data is far from enough, especially compared with the millions of data available in natural images. Hence, we delve into strategies for leveraging unlabeled data for training, aiming to bridge the disparity. Guided by this principle and drawing inspiration from the self-supervised contrastive learning paradigm, this work devises a joint training methodology, enabling the learning of labeled and unlabeled data simultaneously.
Method | STFT | Mask R-CNN | YOLOv8 | Trans VOD | DETR | Dyhead | Pix2Seq v2 | Unified-IO | GLIPv2 | Uni-Perceiver v2 | X-Decoder | Ours |
Specific | General-purpose | Universal | ||||||||||
SUN | 36.1 | 49.2 | 53.5 | 45.0 | 43.5 | 53.6 | 48.8 | 50.2 | 53.4 | 51.1 | 51.6 | 56.8 (+3.2) |
(mAP) |
Method | ResNet | EfficientNet | CoAtNet | ViT-G/14 | SwinV2 | Model soups | Pix2Seq v2 | Unified-IO | GLIPv2 | Uni-Perceiver v2 | X-Decoder | Ours |
Specific | General-purpose | Universal | ||||||||||
ColonCG | 84.4 | 88.4 | 88.1 | 89.0 | 88.9 | 89.1 | - | 87.7 | 87.6 | 88.2 | 87.8 | 90.8 (+1.7) |
(mAcc) |
Method | UNet | PraNet | SANet | BoxPolyp | nnUNet | TransUNet | Mask2Former | SegViT-V2 | Pix2Seq v2 | Unified-IO | GLIPv2 | Uni-Perceiver v2 | X-Decoder | Ours |
Specific | General-purpose | Universal | ||||||||||||
CVC-ClinicDB | 81.3 | 91.3 | 92.2 | 93.0 | 91.8 | 91.3 | 91.8 | 92.6 | 90.8 | 92.5 | - | 91.8 | 93.1 | 93.6 (+0.5) |
CVC-ColonDB | 66.1 | 75.7 | 75.7 | 79.9 | 74.6 | 76.2 | 78.9 | 79.3 | 76.5 | 77.1 | - | 80.1 | 78.2 | 80.9 (+0.8) |
Kvasir-SEG | 83.8 | 88.7 | 88.7 | 91.4 | 90.5 | 93.2 | 92.5 | 93.1 | 91.7 | 92.4 | - | 93.2 | 91.6 | 94.1 (+0.9) |
ETIS-LaribPolypDB | 51.9 | 73.0 | 73.0 | 81.3 | 72.8 | 81.5 | 80.5 | 80.1 | 75.2 | 79.2 | - | 80.6 | 78.1 | 88.1 (+6.6) |
EndoScene | 84.3 | 88.5 | 88.1 | 88.4 | 88.6 | 88.3 | 88.6 | 89.7 | 87.4 | 87.6 | - | 88.5 | 89.8 | 91.8 (+2.0) |
(Dice) |
Pipeline. As for labeled data, the corresponding labels undergo standard transformations as outlined in Sec. Composed Decoders. These labels are categorized into semantic and pixel types, and training is conducted utilizing both semantic loss and pixel loss. For unlabeled data, follow the dense contrastive learning strategy (Wang et al. 2021). For each image, we generate two sets of random views via data augmentation and feed them into the encoder to obtain two sets of features. These features are separately passed through downstream dense projection heads, and the same encoder is trained by computing the contrastive learning loss between the two sets of features. During training, we adopt an exponential moving average to update the parameters and retain the encoder part after training is completed, discarding the dense header. During inference, task predictions are executed through the adaptable combination of visual, text, and latent queue input terminals, alongside semantic and pixel output terminals. The total training loss can be expressed as a combination of these:
(10) |
Where acts as a weight to balance the two terms. The semantic output is the cross entropy loss, the pixel output includes the binary cross entropy loss and the dice loss, and the unlabeled learning loss includes the contrast and the dense contrast loss .
Experimental Results
Method | Detection | Classification | Segmentation | |||||
(mAP) | (mAcc) | (Dice) | ||||||
SUN | ColonCG | CVC-ClinicDB | CVC-ColonDB | Kvasir-SEG | ETIS-LaribPolypDB | EndoScene | ||
Zero-shot | GLIPv2 (Zhang et al. 2022) | 20.6 | 48.4 | - | ||||
MM-G-T (Zhao et al. 2024) | 26.4 | 53.8 | - | |||||
Uni-Perceiver v2 (Li et al. 2023) | 33.1 | 55.6 | 48.3 | 33.6 | 57.8 | 34.7 | 48.3 | |
X-Decoder (Zou et al. 2023) | 31.6 | 54.7 | 52.7 | 35.4 | 56.5 | 36.1 | 50.8 | |
39.8 | 62.5 | 54.6 | 38.9 | 60.8 | 43.3 | 55.7 | ||
UniMed | (+6.7) | (+6.9) | (+1.9) | (+3.5) | (+3.0) | (+7.2) | (+4.9) | |
100-shot | GLIPv2 | 41.8 | 83.4 | - | ||||
MM-G-T | 46.6 | 85.5 | - | |||||
Uni-Perceiver v2 | 45.2 | 84.4 | 81.5 | 87.4 | 86.3 | 72.6 | 84.6 | |
X-Decoder | 46.3 | 85.5 | 87.1 | 72.9 | 71.1 | 73.8 | 84.9 | |
49.8 | 88.4 | 88.5 | 75.2 | 89.1 | 77.9 | 87.3 | ||
UniMed | (+3.2) | (+2.9) | (+1.4) | (+2.3) | (+1.7) | (+4.1) | (+2.4) |
Tasks and Datasets
This study takes the endoscopic modality of medical images as an example and conducts a comprehensive investigation by collecting datasets from various research groups worldwide. The database established contains 12 datasets and covers all 3 tasks. The unlabeled datasets are endoscopic videos that are difficult to label, Colonoscopic (Mesejo et al. 2016), Hyper-Kvasir (Borgli et al. 2020), Kvasir-Capsule (Smedsrud et al. 2021), LDPolypVideo (Ma et al. 2021), and ColonVideo (private, from Jiangsu Provincial People’s Hospital). For detection, evaluation is performed on the SUN (Misawa et al. 2021) colonoscopy public dataset, the largest benchmark for polyp detection. The classification task utilizes the ColonCG (Private, from Jiangsu Provincial People’s Hospital), which is the most comprehensive dataset for colon disease classification, including five categories: normal, polyp, adenoma, cancer, and ulcerative colitis. Segmentation tasks are evaluated on the common public polyp segmentation datasets: CVC-ClinicDB (Bernal et al. 2015), CVC-ColonDB (Tajbakhsh, Gurudu, and Liang 2015), Kvasir-SEG (Jha et al. 2020), ETIS-LaribPolypDB (Silva et al. 2014), and EndoScene (Vázquez et al. 2017). Evaluation metrics include mean average precision (mAP), mean accuracy (mAcc), and Dice corresponding to the three tasks respectively.
Implementation Details
We employ Focal-T (Yang et al. 2022) as the backbone of the visual encoder, utilizing a transformer text encoder with causal masking (Radford et al. 2021) as the language encoder. For training, the AdamW (Loshchilov and Hutter 2017) optimizer with a base learning rate set to 1e-4, a weight decay of 0.05, and a linear decay learning rate scheduler are applied. The training procedure spans 50 epochs with a batch size of 8. A total of two data loaders are used: one for labeled data and another for unlabeled data, maintaining a sampling ratio of 1:1. The final loss function comprises both supervised and unsupervised components, with a ratio of 10:1 to balance their contributions effectively. During the fine-tuning process, the model’s input and output are controlled through configuration files, allowing for customized settings tailored to specific task requirements.

Task-specific Fine-tuning
UniMed undergoes comparison with existing specialized and universal methods across various tasks. Fine-tuning is performed on 8 datasets for 3 common medical image analysis downstream tasks, with performance reported in Table 1, 2, 3. The analysis yields the following observations.
UniMed v.s. Specialized Detection Methods. In a comprehensive evaluation, UniMed is compared with various state-of-the-art methods (Table 1), including polyp detection (Wu et al. 2021), two-stage (He et al. 2017), single-stage (Jocher et al. 2020), and transformer-based detection methods (Zhou et al. 2022; Carion et al. 2020; Dai et al. 2021a). UniMed surpasses all specialized detection methods (Fig. 4), achieving the highest mean Average Precision (mAP) score of 56.8%, outperforming the next-best result by a significant margin of 3.2%.
UniMed v.s. Specialized Classification Methods. A total of six methods are compared (Table 2), encompassing CNN-based (He et al. 2016; Tan and Le 2019) , Transformer-based (Zhai et al. 2022; Liu et al. 2022), and CNN-Transformer hybrid approaches (Dai et al. 2021b; Wortsman et al. 2022). UniMed emerges as the top performer, achieving the highest classification results with an average accuracy of 90.8% across five categories. Notably, among these methods, Model Soups achieves commendable results, trailing behind the proposed UniMed model by on 1.7 points. This observation underscores the effectiveness of averaging multiple weights in Model soups, contributing to its competitive performance despite being sub-optimal.
UniMed v.s. Specialized Segmentation Methods. When comparing UniMed with state-of-the-art methods in polyp segmentation (Falk et al. 2019; Fan et al. 2020; Wei et al. 2021, 2022) and semantic segmentation (Table 3) (Isensee et al. 2021; Chen et al. 2021a; Cheng et al. 2022; Zhang et al. 2023), UniMed consistently achieves the best segmentation results (Fig. 4) across datasets with varying levels of segmentation difficulty. Specifically, UniMed outperforms suboptimal methods by 0.6, 1.0, 0.9, 6.6, and 2.1 points on the CVC-ClinicDB, CVC-ColonDB, Kvasir-SEG, ETIS-LaribPolypDB, EndoScene datasets, respectively. These results underscore the robustness of UniMed’s performance across diverse datasets, demonstrating its effectiveness in tackling segmentation challenges across different medical imaging scenarios.
UniMed v.s. Generalist Methods. Most notably, the proposed end-to-end architecture outperforms other general models across various medical tasks (Table 1, 2, 3). General methods (Chen et al. 2022a; Lu et al. 2022; Zhang et al. 2022; Li et al. 2023; Zou et al. 2023) tend to exhibit higher overall performance compared to specific methods, suggesting that the interaction of data and tasks fosters enhanced model learning. Additionally, models striving for high understanding performance often demonstrate lower localization performance (e.g., Uni-Perceiver v2), as it is not trivial to merge semantic and visual understanding into a single model. Similarly (Fig. 4), visual comparison results with other methods clearly show that our method has higher accuracy in boundary segmentation and localization detection.
Zero-Shot and 100-Shot Transfer
UniMed is pre-trained, requiring only zero or a small number of parameters before its application to various downstream tasks. Thus, we assessed the model’s transferability to other tasks in both zero-shot and 100-shot settings.
Zero-shot Transfer. Experimental results demonstrate compelling evidence of UniMed’s substantial zero-shot capability in the medical field compared to other general methods (Table 4). This suggests that UniMed can be readily applied to various tasks without further adjustments. Moreover, UniMed surpassed sub-optimal results by 6.7% and 6.9% in detection and classification tasks, respectively. Additionally, the performance improvement in segmentation tasks even outperforms other methods by up to 7.2%.
100-shot Transfer. UniMed also demonstrates the superior overall performance of strong 100-shot on medical image analysis tasks (Table 4). In some instances, it even rivals fully supervised models trained with full-scale data, exemplified by the SUN dataset (100-shot AP 49.8%, compared to Mask RCNN’s 49.2%). In particular, comparing it with the X-decoder model, it can be seen that its performance is exceeded in both zero-shot and 100-shot cases. This underscores the key role of large amounts of unsupervised data in feature generalization.

Task Composition
As mentioned earlier, UniMed boasts a unique advantage of task interaction, enabling both single and joint-task reasoning. This distinctive capability enhances the model’s practicality in real clinical scenarios, particularly in customizing referrals and tasks. In Fig. 5, visualization results of single and joint task inference without architectural changes are showcased. For instance, when provided with a set of referrals such as [”polyp, adenoma, cancer”], UniMed seamlessly delivers both pixel-level and semantic-level predictions.
Ablation Study
Ablation studies are performed to analyze the architecture of UniMed, and all experiments are tested on the segmentation task.
Backbone. Increasing the size of the backbone network indeed leads to performance improvements, as evidenced in Table 5. As the depth and embedding dimensions expand the visual encoder, performance improves across the board. This suggests that more powerful feature extractors facilitate prediction for downstream tasks.
Backbone | CVC-ClinicDB | CVC-ColonDB | Kvasir-SEG | ETIS-LaribPolypDB | EndoScene |
ViT-T | 93.1 | 78.1 | 92.5 | 84.3 | 91.2 |
ViT-L | 93.6 (+0.5) | 80.9 (+2.8) | 94.1 (+1.6) | 88.1 (+3.8) | 91.8 (+0.6) |
Task | CVC-ClinicDB | CVC-ColonDB | Kvasir-SEG | ETIS-LaribPolypDB | EndoScene |
All Tasks | 93.6 | 80.9 | 94.1 | 88.1 | 91.8 |
-Detection | 93.3 (-0.3) | 79.8 (-1.1) | 93.6 (-0.5) | 86.7 (-1.4) | 92.0 (+0.2) |
-Classification | 92.9 (-0.7) | 81.4 (+0.5) | 93.6 (-0.5) | 87.8 (-0.3) | 91.0 (-0.8) |
Single Segmentation | 92.5 (-1.1) | 79.5 (-1.4) | 93.4 (-0.7) | 87.6 (-1.5) | 89.5 (-1.3) |
Task Collaboration. To explore the relationship between task collaboration and interference, by eliminating individual tasks based on all tasks (Table 6), the following finding was drawn: 1) Learning multiple tasks together is more effective than focusing on a single task alone. 2) Detection tasks positively influence segmentation tasks, whereas classification tasks exhibit suboptimal performance on certain datasets. This discrepancy may be attributed to the interference caused by dividing images into multiple categories, particularly when dealing with challenging data that is difficult to classify accurately.
Loss weights | CVC-ClinicDB | CVC-ColonDB | Kvasir-SEG | ETIS-LaribPolypDB | EndoScene |
1 | 92.5 | 79.5 | 93.7 | 85.5 | 89.5 |
0.75 | 91.8 | 79.4 | 94.0 | 84.3 | 90.9 |
0.5 | 93.2 | 80.6 | 93.6 | 85.9 | 91.8 |
0.1 | 93.6 | 80.9 | 94.1 | 88.1 | 91.8 |
0 | 92.8 | 78.7 | 92.2 | 82.5 | 90.6 |
Balance between Labeled and Unlabeled Losses. Table 7 demonstrates that the best performance is achieved when the unsupervised loss is set to 0.1, suggesting that the supervised component holds greater significance in the overall training process. When the ratio of unsupervised to supervised losses is 1:1, there is a drop in performance. These observations indicate that 1) it is necessary to introduce a joint expression learning strategy, where unlabeled data helps the model learn more general features. 2) The supervised component effectively drives model learning, while the unsupervised component serves as a ”regularization” mechanism, guiding the model to acquire more robust knowledge.
Sampling ratio | CVC-ClinicDB | CVC-ColonDB | Kvasir-SEG | ETIS-LaribPolypDB | EndoScene |
0.5:1 | 93.2 | 79.6 | 92.7 | 85.8 | 91.2 |
1:1 | 93.6 | 80.9 | 94.1 | 88.1 | 91.8 |
1:2 | 93.8 | 81.2 | 93.4 | 86.6 | 91.5 |
Balance between Labeled and Unlabeled Data. Table 8 shows that during experimental loading, sampling ratios between labeled and unlabeled data below 1 perform better. That is, the sampling ratio of unlabeled data should be higher. This is because, in medical data, the amount of labeled data is very small, while unlabeled data can significantly increase the diversity of samples. Based on the above findings, this paper adopts a 1:1 ratio setting.
Conclusion
We present UniMed, the first general-purpose architecture designed for comprehensive medical images to support tasks at all levels, including image, region, and pixel levels. Unlike the current universal models that usually involve multiple task-specific branches or heads and rely on cumbersome multi-stage pre-training processes. The innovative coupling design of the decmposed-composed decoders unifies the input and output space, enabling flexible model combinations to support various task interactions seamlessly. The joint representation learning strategy demonstrates how to effectively train models in a single stage without additional modules. Our approach addresses the challenges of annotation diversity and underutilization of unlabeled samples in medical data, achieving state-of-the-art performance in fine-tuning, zero-shot, and 100-shot scenarios on eight datasets. Overall, we believe that UniMed has significant advantages in real-world clinical applications due to its versatility, transferability, and flexibility.
References
- Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35: 23716–23736.
- Bernal et al. (2015) Bernal, J.; Sánchez, F. J.; Fernández-Esparrach, G.; Gil, D.; Rodríguez, C.; and Vilariño, F. 2015. WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43: 99–111.
- Borgli et al. (2020) Borgli, H.; Thambawita, V.; Smedsrud, P. H.; Hicks, S.; Jha, D.; Eskeland, S. L.; Randel, K. R.; Pogorelov, K.; Lux, M.; Nguyen, D. T. D.; et al. 2020. HyperKvasir, a comprehensive multi-class image and video dataset for gastrointestinal endoscopy. Scientific data, 7(1): 283.
- Butoi et al. (2023) Butoi, V. I.; Ortiz, J. J. G.; Ma, T.; Sabuncu, M. R.; Guttag, J.; and Dalca, A. V. 2023. Universeg: Universal medical image segmentation. arXiv preprint arXiv:2304.06131.
- Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
- Chen et al. (2021a) Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A. L.; and Zhou, Y. 2021a. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306.
- Chen et al. (2021b) Chen, T.; Saxena, S.; Li, L.; Fleet, D. J.; and Hinton, G. 2021b. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852.
- Chen et al. (2022a) Chen, T.; Saxena, S.; Li, L.; Lin, T.-Y.; Fleet, D. J.; and Hinton, G. E. 2022a. A unified sequence interface for vision tasks. Advances in Neural Information Processing Systems, 35: 31333–31346.
- Chen et al. (2022b) Chen, X.; Wang, X.; Zhang, K.; Fung, K.-M.; Thai, T. C.; Moore, K.; Mannel, R. S.; Liu, H.; Zheng, B.; and Qiu, Y. 2022b. Recent advances and clinical applications of deep learning in medical image analysis. Medical Image Analysis, 79: 102444.
- Cheng et al. (2022) Cheng, B.; Misra, I.; Schwing, A. G.; Kirillov, A.; and Girdhar, R. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 1290–1299.
- Cheng et al. (2023) Cheng, D.; Qin, Z.; Jiang, Z.; Zhang, S.; Lao, Q.; and Li, K. 2023. Sam on medical images: A comprehensive study on three prompt modes. arXiv preprint arXiv:2305.00035.
- Dai et al. (2021a) Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; and Zhang, L. 2021a. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7373–7382.
- Dai et al. (2021b) Dai, Z.; Liu, H.; Le, Q. V.; and Tan, M. 2021b. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34: 3965–3977.
- Falk et al. (2019) Falk, T.; Mai, D.; Bensch, R.; Çiçek, Ö.; Abdulkadir, A.; Marrakchi, Y.; Böhm, A.; Deubner, J.; Jäckel, Z.; Seiwald, K.; et al. 2019. U-Net: deep learning for cell counting, detection, and morphometry. Nature methods, 16(1): 67–70.
- Fan et al. (2020) Fan, D.-P.; Ji, G.-P.; Zhou, T.; Chen, G.; Fu, H.; Shen, J.; and Shao, L. 2020. Pranet: Parallel reverse attention network for polyp segmentation. In International conference on medical image computing and computer-assisted intervention, 263–273. Springer.
- He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
- He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Isensee et al. (2021) Isensee, F.; Jaeger, P. F.; Kohl, S. A.; Petersen, J.; and Maier-Hein, K. H. 2021. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2): 203–211.
- Jha et al. (2020) Jha, D.; Smedsrud, P. H.; Riegler, M. A.; Halvorsen, P.; de Lange, T.; Johansen, D.; and Johansen, H. D. 2020. Kvasir-seg: A segmented polyp dataset. In MultiMedia Modeling: 26th International Conference, MMM 2020, Daejeon, South Korea, January 5–8, 2020, Proceedings, Part II 26, 451–462. Springer.
- Jocher et al. (2020) Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. 2020. ultralytics/yolov5: v3. 0. Zenodo.
- Kudo and Richardson (2018) Kudo, T.; and Richardson, J. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Li et al. (2023) Li, H.; Zhu, J.; Jiang, X.; Zhu, X.; Li, H.; Yuan, C.; Wang, X.; Qiao, Y.; Wang, X.; Wang, W.; et al. 2023. Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2691–2700.
- Liu et al. (2023) Liu, J.; Zhang, Y.; Chen, J.-N.; Xiao, J.; Lu, Y.; A Landman, B.; Yuan, Y.; Yuille, A.; Tang, Y.; and Zhou, Z. 2023. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 21152–21164.
- Liu et al. (2022) Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. 2022. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12009–12019.
- Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Lu et al. (2022) Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; and Kembhavi, A. 2022. Unified-io: A unified model for vision, language, and multi-modal tasks. arXiv preprint arXiv:2206.08916.
- Ma and Wang (2023) Ma, J.; and Wang, B. 2023. Segment anything in medical images. arXiv preprint arXiv:2304.12306.
- Ma et al. (2021) Ma, Y.; Chen, X.; Cheng, K.; Li, Y.; and Sun, B. 2021. LDPolypVideo benchmark: a large-scale colonoscopy video dataset of diverse polyps. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, 387–396. Springer.
- Mesejo et al. (2016) Mesejo, P.; Pizarro, D.; Abergel, A.; Rouquette, O.; Beorchia, S.; Poincloux, L.; and Bartoli, A. 2016. Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE transactions on medical imaging, 35(9): 2051–2063.
- Misawa et al. (2021) Misawa, M.; Kudo, S.-e.; Mori, Y.; Hotta, K.; Ohtsuka, K.; Matsuda, T.; Saito, S.; Kudo, T.; Baba, T.; Ishida, F.; et al. 2021. Development of a computer-aided detection system for colonoscopy and a publicly accessible large colonoscopy video database (with video). Gastrointestinal endoscopy, 93(4): 960–967.
- Moor et al. (2023) Moor, M.; Banerjee, O.; Abad, Z. S. H.; Krumholz, H. M.; Leskovec, J.; Topol, E. J.; and Rajpurkar, P. 2023. Foundation models for generalist medical artificial intelligence. Nature, 616(7956): 259–265.
- Murtaza et al. (2020) Murtaza, G.; Shuib, L.; Abdul Wahab, A. W.; Mujtaba, G.; Mujtaba, G.; Nweke, H. F.; Al-garadi, M. A.; Zulfiqar, F.; Raza, G.; and Azmi, N. A. 2020. Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges. Artificial Intelligence Review, 53: 1655–1720.
- Qin et al. (2022) Qin, Z.; Yi, H.; Lao, Q.; and Li, K. 2022. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517.
- Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Rajpurkar et al. (2022) Rajpurkar, P.; Chen, E.; Banerjee, O.; and Topol, E. J. 2022. AI in health and medicine. Nature medicine, 28(1): 31–38.
- Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
- Silva et al. (2014) Silva, J.; Histace, A.; Romain, O.; Dray, X.; and Granado, B. 2014. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9: 283–293.
- Smedsrud et al. (2021) Smedsrud, P. H.; Thambawita, V.; Hicks, S. A.; Gjestang, H.; Nedrejord, O. O.; Næss, E.; Borgli, H.; Jha, D.; Berstad, T. J. D.; Eskeland, S. L.; et al. 2021. Kvasir-Capsule, a video capsule endoscopy dataset. Scientific Data, 8(1): 142.
- Tajbakhsh, Gurudu, and Liang (2015) Tajbakhsh, N.; Gurudu, S. R.; and Liang, J. 2015. Automated polyp detection in colonoscopy videos using shape and context information. IEEE transactions on medical imaging, 35(2): 630–644.
- Tan and Le (2019) Tan, M.; and Le, Q. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114. PMLR.
- Tiu et al. (2022) Tiu, E.; Talius, E.; Patel, P.; Langlotz, C. P.; Ng, A. Y.; and Rajpurkar, P. 2022. Expert-level detection of pathologies from unannotated chest X-ray images via self-supervised learning. Nature Biomedical Engineering, 6(12): 1399–1406.
- Vázquez et al. (2017) Vázquez, D.; Bernal, J.; Sánchez, F. J.; Fernández-Esparrach, G.; López, A. M.; Romero, A.; Drozdzal, M.; Courville, A.; et al. 2017. A benchmark for endoluminal scene segmentation of colonoscopy images. Journal of healthcare engineering, 2017.
- Wan et al. (2024) Wan, Z.; Liu, C.; Zhang, M.; Fu, J.; Wang, B.; Cheng, S.; Ma, L.; Quilodrán-Casas, C.; and Arcucci, R. 2024. Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias. Advances in Neural Information Processing Systems, 36.
- Wang et al. (2022a) Wang, K.-N.; He, Y.; Zhuang, S.; Miao, J.; He, X.; Zhou, P.; Yang, G.; Zhou, G.-Q.; and Li, S. 2022a. Ffcnet: Fourier transform-based frequency learning and complex convolutional network for colon disease classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 78–87. Springer.
- Wang et al. (2022b) Wang, K.-N.; Yang, X.; Miao, J.; Li, L.; Yao, J.; Zhou, P.; Xue, W.; Zhou, G.-Q.; Zhuang, X.; and Ni, D. 2022b. AWSnet: an auto-weighted supervision attention network for myocardial scar and edema segmentation in multi-sequence cardiac magnetic resonance images. Medical Image Analysis, 77: 102362.
- Wang et al. (2021) Wang, X.; Zhang, R.; Shen, C.; Kong, T.; and Li, L. 2021. Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3024–3033.
- Wang et al. (2022c) Wang, Z.; Wu, Z.; Agarwal, D.; and Sun, J. 2022c. Medclip: Contrastive learning from unpaired medical images and text. arXiv preprint arXiv:2210.10163.
- Wei et al. (2022) Wei, J.; Hu, Y.; Li, G.; Cui, S.; Kevin Zhou, S.; and Li, Z. 2022. BoxPolyp: Boost Generalized Polyp Segmentation Using Extra Coarse Bounding Box Annotations. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 67–77. Springer.
- Wei et al. (2021) Wei, J.; Hu, Y.; Zhang, R.; Li, Z.; Zhou, S. K.; and Cui, S. 2021. Shallow attention network for polyp segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, 699–708. Springer.
- Wortsman et al. (2022) Wortsman, M.; Ilharco, G.; Gadre, S. Y.; Roelofs, R.; Gontijo-Lopes, R.; Morcos, A. S.; Namkoong, H.; Farhadi, A.; Carmon, Y.; Kornblith, S.; et al. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, 23965–23998. PMLR.
- Wu et al. (2023) Wu, J.; Fu, R.; Fang, H.; Liu, Y.; Wang, Z.; Xu, Y.; Jin, Y.; and Arbel, T. 2023. Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620.
- Wu et al. (2021) Wu, L.; Hu, Z.; Ji, Y.; Luo, P.; and Zhang, S. 2021. Multi-frame collaboration for effective endoscopic video polyp detection via spatial-temporal feature transformation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, 302–312. Springer.
- Yang et al. (2022) Yang, J.; Li, C.; Dai, X.; and Gao, J. 2022. Focal modulation networks. Advances in Neural Information Processing Systems, 35: 4203–4217.
- Ye et al. (2023) Ye, Y.; Xie, Y.; Zhang, J.; Chen, Z.; and Xia, Y. 2023. UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong Representation Learner. arXiv preprint arXiv:2304.03493.
- Yu et al. (2019) Yu, Q.; Shi, Y.; Sun, J.; Gao, Y.; Zhu, J.; and Dai, Y. 2019. Crossbar-net: A novel convolutional neural network for kidney tumor segmentation in ct images. IEEE transactions on image processing, 28(8): 4060–4074.
- Zhai et al. (2022) Zhai, X.; Kolesnikov, A.; Houlsby, N.; and Beyer, L. 2022. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12104–12113.
- Zhang et al. (2023) Zhang, B.; Liu, L.; Phan, M. H.; Tian, Z.; Shen, C.; and Liu, Y. 2023. SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers. arXiv preprint arXiv:2306.06289.
- Zhang et al. (2022) Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.-C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.-N.; and Gao, J. 2022. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems, 35: 36067–36080.
- Zhang and Liu (2023) Zhang, K.; and Liu, D. 2023. Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785.
- Zhao et al. (2024) Zhao, X.; Chen, Y.; Xu, S.; Li, X.; Wang, X.; Li, Y.; and Huang, H. 2024. An open and comprehensive pipeline for unified object grounding and detection. arXiv preprint arXiv:2401.02361.
- Zhao et al. (2023) Zhao, Z.; Zhang, Y.; Wu, C.; Zhang, X.; Zhang, Y.; Wang, Y.; and Xie, W. 2023. One Model to Rule them All: Towards Universal Segmentation for Medical Images with Text Prompts. arXiv preprint arXiv:2312.17183.
- Zhou et al. (2023) Zhou, G.-Q.; Zhang, W.-B.; Shi, Z.-Q.; Qi, Z.-R.; Wang, K.-N.; Song, H.; Yao, J.; and Chen, Y. 2023. DSANet: Dual-branch shape-aware network for echocardiography segmentation in apical views. IEEE Journal of Biomedical and Health Informatics.
- Zhou et al. (2022) Zhou, Q.; Li, X.; He, L.; Yang, Y.; Cheng, G.; Tong, Y.; Ma, L.; and Tao, D. 2022. TransVOD: end-to-end video object detection with spatial-temporal transformers. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Zou et al. (2023) Zou, X.; Dou, Z.-Y.; Yang, J.; Gan, Z.; Li, L.; Li, C.; Dai, X.; Behl, H.; Wang, J.; Yuan, L.; et al. 2023. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15116–15127.