MedSAM-U: Uncertainty-Guided Auto Multi-Prompt Adaptation for Reliable MedSAM
Abstract
The Medical Segment Anything Model (MedSAM) has shown remarkable performance in medical image segmentation, drawing significant attention in the field. However, its sensitivity to varying prompt types and locations poses challenges. This paper addresses these challenges by focusing on the development of reliable prompts that enhance MedSAM’s accuracy. We introduce MedSAM-U, an uncertainty-guided framework designed to automatically refine multi-prompt inputs for more reliable and precise medical image segmentation. Specifically, we first train a Multi-Prompt Adapter integrated with MedSAM, creating MPA-MedSAM, to adapt to diverse multi-prompt inputs. We then employ uncertainty-guided multi-prompt to effectively estimate the uncertainties associated with the prompts and their initial segmentation results. In particular, a novel uncertainty-guided prompts adaptation technique is then applied automatically to derive reliable prompts and their corresponding segmentation outcomes. We validate MedSAM-U using datasets from multiple modalities to train a universal image segmentation model. Compared to MedSAM, experimental results on five distinct modal datasets demonstrate that the proposed MedSAM-U achieves an average performance improvement of 1.7% to 20.5% across uncertainty-guided prompts.
Index Terms:
MedSAM, Medical image segmentation, Multi-prompt, Uncertainty-guided segmentation.
I Introduction
Medical image segmentation is an important task in medical image analysis, crucial for many clinical applications. Accurate segmentation helps clearly define anatomical structures and diseased areas, which is essential for disease diagnosis, treatment planning, and monitoring. It is widely used in dermoscopy, CT, MRI, colonoscopy, and ultrasound. Traditional U-Net-based segmentation methods have demonstrated high segmentation performance [2, 3, 4, 5]. However, these models are mostly designed for specific tasks and often face challenges in transferability to other domains. Consequently, with the recent rise of foundation models like the Segment Anything Model (SAM) [6], and MedSAM [7] have shown its universal applicability and precise segmentation capabilities across various datasets. Crucially, these models are trained on all datasets together and utilize different prompts to achieve more accurate segmentation results. As mentioned in [8], high-quality prompts can lead to good performance. Hence, this study focuses on how to automatically obtain effective prompts during testing time without the need to retrain model.
As shown in Fig. 1 (a), we illustrate the segmentation performance of SAM and MedSAM across five datasets using various box prompts, including prompts with different aspect ratios, and positions within the image. The purpose of these variations in box prompts is to reveal a significant impact of box position on the resulting segmentation outcomes. Furthermore, as detailed in [8], the effect of point prompts on segmentation performance is demonstrated. Therefore, the other key focus of this study is to train MedSAM to adapt to different types of prompts. Finally, as shown in Fig. 1 (b), MedSAM’s approach of producing a single segmentation result in a single step lacks reliability. This study aims to leverage reliability to automatically obtain effective prompts, thereby achieving better segmentation results.
Uncertainty is one of crucial metrics for assessing the model’s confidence or reliability. Existing methods for uncertainty estimation encompass dropout-based methods [9], ensemble-based methods [10], entropy-based methods [11], evidential-based approaches [12], and deterministic-based methods [13]. Given the extensive parameters involved in MedSAM, building a new model with uncertainty estimation from scratch would demand significant computational resources and time. Furthermore, a key aspect of this study is exploring how to utilize estimated pixel-level uncertainty to obtain reliable prompts of different types.
To address these challenges, our study introduces the MedSAM-U as illustrated in Fig. 1 (b) 2), an automatic uncertainty-guided auto multi-prompt framework for adapting reliable MedSAM. MedSAM-U integrates box and point prompts to enhance segmentation accuracy. Subsequently, it then employs Uncertainty-Guided Multi-Prompt (UGMP) to effectively estimate the uncertainties associated with the prompts and their initial segmentation results. Our approach further introduces a novel Uncertainty-Guided Prompt Adaptation (UGPA) technique, enabling the auto refinement of multi-prompt to enhance segmentation reliability and accuracy. The contributions of this work are summarized as follows.
We propose MedSAM-U, which leverages uncertainty estimation and guidance to automatically predict reliable segmentation results. To the best of our knowledge, this is the first attempt at using uncertainty-guided adaptation of multi-prompt to achieve a reliable MedSAM.
We employ Uncertainty-Guided Multi-Prompt (UGMP) to estimate the uncertainty of different prompts in MedSAM without requiring additional training parameters.
We introduce Uncertainty-Guided Prompt Adaptation (UGPA) to automatically obtain reliable prompts using the estimated uncertainty in the testing time, leading to accurate segmentation predictions.
We conducted unified training and testing on five different modalities (Dermoscopy, Colonoscopy, Ultrasound, CT, and MRI) datasets, demonstrating the reliability and accuracy of MedSAM-U 111Our code will be released in https://github.com/Zhounan1222/MedSAM-U.
II Related works
II-A SAM for medical image segmentation
Traditional deep learning methods [14, 2, 15] are mostly designed for specific tasks and often face challenges in transferring to other domains. SAM [6] is the pioneering large foundation model for segmentation, consists of three primary components: an image encoder, a prompt encoder, and a mask decoder. The image encoder is based on a standard Vision Transformer (ViT) that has been pretrained using Masked Autoencoders (MAE). The prompt encoder can operate in either a sparse (e.g.,boxes) or dense (e.g., masks) manner. The mask decoder is a Transformer decoder block adapted to include a dynamic mask prediction head. This decoder employs two-way cross-attention mechanisms to capture the interactions between the prompt and image embeddings. Subsequently, SAM upsamples the image embedding, subsequently, SAM upsamples the image embedding, and a Multi-Layer Perceptron (MLP) maps the resulting token to a dynamic linear classifier that predicts the ground truth (GT) for the given image I. Due to its strong zero-shot performance, SAM marks a major breakthrough in the field of image segmentation.
To improve the unsatisfactory performance of SAM on medical image segmentation tasks, some approaches are to fine-tune SAM on medical images, including full fine-tuning and parameter-efficient fine-tuning [16, 17, 18, 19]. Recently, MedSAM [7] has investigated SAM’s application to medical image segmentation, exploring its performance in various contexts such as endoscopic surgery [20], tumor segmentation [21], polyp segmentation [22]. However, the existing MedSAM-based methods rarely investigate the sensitivity of different prompts.
II-B Prompts-based methods for medical image segmentation
Prompts-based methods have gained traction in medical image segmentation due to their ability to provide flexible and adaptive guidance during interactive segmentation tasks. While effective in some cases, current adaptations of SAM rely heavily on high-quality, standard prompts (such as points, boxes, and masks) to deliver satisfactory performance in medical image segmentation tasks. Wu et al. [23] introduce Space-Depth Transpose for adapting 2D SAM to 3D medical images and Hyper-Prompting Adapter for prompt-conditioned adaptation. Deng et al. [24] propose a multi-box prompt-triggered uncertainty estimation for SAM that uses Monte Carlo methods and test-time augmentation to enhance performance and provide pixel-level reliability for segmented lesions or tissues. Li et al. [25] propose a 3D medical image segmentation model using a single point prompt with SAM’s pretrained vision transformer and lightweight adapters, featuring a hybrid network and boundary-aware loss for precise results. Wu et al. [26] presents One-Prompt Segmentation method that combines one-shot and interactive approaches to handle unseen tasks with a single prompt in one pass. Currently, most MedSAM-based methods are limited to using a single type of prompt rather than multiple types of prompts [27, 28, 24, 25]. This reduces the available information for the model and may result in insufficient segmentation performance.
II-C Uncertainty-based methods for medical image segmentation
Uncertainty estimation [29] in segmentation has become increasingly important, as it offers a way to assess the confidence in model predictions. This is crucial in high-stakes applications such as medical image analysis, where mistakes can lead to serious repercussions. There are two primary types of uncertainty—aleatoric, which originates from the inherent noise in the data [30], and epistemic, which is due to the model’s limitations—is the key to enhancing the reliability of deep neural networks [31]. For instance, Saad et al. [32] utilized shape and appearance priors to quantify uncertainty in probabilistic medical image segmentation. Parisot et al. [33] leveraged segmentation uncertainty to inform adaptive sampling strategies for the simultaneous segmentation and registration of brain tumors. Additionally, Prassni et al. [34] developed a method to visualize uncertainty in random walker-based segmentation, which was then applied to volumetric segmentation of brain MRI and abdominal CT images. Zou et al. [35]. proposed a trusted brain tumor segmentation network that generates robust segmentation results and reliable uncertainty estimations by leveraging subjective logic theory to model uncertainty and parameterize class probabilities as a Dirichlet distribution.
Additionally, uncertainty estimation in large models, such as SAM, is critical for improving the reliability of segmentation outputs [24]. Yao et al. [27] propose a test-phase prompt augmentation technique for SAM that integrates multi-box augmentation with an aleatoric uncertainty-based FN and FP correction strategy to improve medical image segmentation. Zhang et al. [28] propose UR-SAM, an uncertainty-rectified framework that enhances SAM’s reliability in medical image segmentation by combining prompt augmentation with uncertainty-based rectification. Despite these advancements, research on integrating uncertainty estimation with prompt-based MedSAM remains insufficient, particularly in how to utilize uncertainty to guide reliable prompts.

III Method
To begin with, we provide an overview of the MedSAM-U architecture, which consists of three primary components: MPA-MedSAM, UGMP, and UGPA modules. The overall framework of our proposed method is illustrated in Fig. 2. We build an automatic framework guided by uncertainty, developed to adaptively refine multi-prompt to achieve reliable and accurate segmentation that includes 1) training Multi-Prompt Adaption (MPA) to integrating both point and box cues to achieve more precise medical image segmentation results. 2) In the reference time, utilizing UGMP to assess the uncertainty linked to various prompts, without requiring additional training parameters. 3) introducing UGPA to automatically leverage estimated uncertainty to derive more reliable prompts, resulting in improving the peformance of segmentation results.
III-A Multi-Prompt Adaptation for MedSAM
MedSAM [7] primarily relies on box prompts as the initial input for segmentation tasks, without incorporating point prompts during its training process. This limitation means that MedSAM’s segmentation capabilities are optimized for scenarios where box annotations are available, but it may not fully exploit the potential benefits of point-based inputs. Formally, in MedSAM, the relationship between the inputs and the prediction mask can be formally expressed as:
(1) |
the function can be denoted as:
(2) |
where I denotes the input image, represents the box prompt, , and represent the Decoder, Image Encoder, Prompt Encoder modules of MedSAM, respectively, and is the resulting segmentation mask of MedSAM.
Moreover, as detailed in [8, 19], the effect of point prompts on segmentation performance is demonstrated, particularly positive points. Accordingly, to improve the sparse encoder, this study proposes modifications that refine the encoding of points and boxes. These adjustments aim to optimize the integration of positional encoding with learned embeddings for each prompt type. We propose MPA that is designed to handle and adapt multiple types of prompt inputs for MedSAM, including the combination of points and boxes prompts.
Our goal is to extend the capabilities of MedSAM by adapting it to handle multi-prompt through a fine-tuning approach, called MPA-MedSAM, designed to enhance MedSAM’s flexibility without the need for full re-training. Instead of adjusting all parameters, we keep the pre-trained MedSAM parameters frozen except for the prompt encoder, develop an adapter module, and integrate it into the image encoder designated positions. Structurally, the adapter serves as a bottleneck model, consisting of three key components: a down-projection, ReLU activation, and up-projection sequentially, as illustrated in Fig. 2 (a), similar to Med-SA [23]. Given that MedSAM does not show improvements when multi-type prompts are incorporated, we unfreeze the prompt encoder to expand its capabilities. To break the limitation, we have unfrozen the prompt encoder to expand its capabilities. This adjustment allows MedSAM to effectively handle multi-type prompts, addressing the limitations encountered in earlier versions. For interactive segmentation, both point prompts and Bounding box (BBox) prompts are utilized during training. The prompts are generated by randomly selecting points and applying jitter to the Bbox derived from the segmentation mask. This simulates varying levels of user inaccuracy, making the model more robust to real-world variations and enhancing its adaptability in practical applications. The relationship between the inputs and the prediction mask in MPA-MedSAM can be formally expressed as:
(3) |
where I denotes the input image, represents the point prompt, represents the Bbox prompt, and is the resulting segmentation mask of MPA-MedSAM. The function encapsulates the MPA-MedSAM, which processes the input image, point prompt, and box prompt to produce the corresponding mask.
III-B Uncertainty estimation of MPA-MedSAM with multi-type prompts
Given the variability of prompts from users with differing levels of experience, MedSAM’s segmentation performance exhibits significant sensitivity to the position of Bboxes, potentially leading to inference errors. To address this, we adopt a strategy similar to the one used by SAM-U [24]. In this study, we introduce UGMP to incorporate multiple prompts to improve the accuracy of MPA-MedSAM. Assume giving an initial Bbox , by applying a simple random augmentation strategy to the , we generate multiple Bbox prompts as follows:
(4) |
where , represents the initial Bbox, these random boxes are generated by applying adjustments to , denotes the normal distribution, denotes the coordinates of the , represent the degree of perturbation applied to the . The operation generate a set of multiple Bbox prompts , and a set of multiple point prompts that depends on the user’s or clinician’s choice, if no points are used. To quantify the uncertainty arising from the use of multiple prompts, with box prompts, point prompts and input image I, MPA-MedSAM predict a set of results , where is predicted as follows:
(5) |
where represents the -th box of the set , represents the segmentation result obtained by inputting into MedSAM.
The aggregation of these predictions leads to enhanced segmentation accuracy and a reduction in uncertainty. The final combined prediction is formulated as follows:
(6) |
represents the average segmentation result obtained by applying the MPA-MedSAM model across instances.
By ultilizing UGMP, the aleatoric uncertainty from a single given image , instead of being estimated from each individual prediction , is now directly estimated from the average prediction , described by the entropy [36] :
(7) |
where . This allows us to estimate the uncertainty distribution based on the aggregated prediction , rather than calculating it for each individual prediction .
In summary, the MPA-MedSAM with UGPM function can be expressed as follows:
(8) |
where I represents the input image, are the multiple Bbox prompts , are the points prompts , also depending on the user’s or clinician’s choice. is the resulting segmentation mask based on the combination of predictions from multiple prompts, and is the associated uncertainty map.
III-C Uncertainty-Guided Prompts Adaptation
Providing prompts that produce desired results can be a difficult process that often requires the users or experts to go through tedious trial-and-error experimentation. Due to the uncertainty introduced by different prompts during the inference process, we investigated whether this knowledge could be utilized to refine the initial prompts, thereby producing more predictable and precise segmentation mask outputs. In an effort to address this issue, We introduce UGPA over a series of sample prompts to refine them, through UGPA could improve the model’s segmentation through automatic processes. Fig. 2 (b) shows the schematic view of our proposed UGPA.
In the UGPA process, firstly, we select the Top-K Bboxes based on the edges of the U, which highlight regions of high uncertainty. These boxes are then subjected to slight adjustments, simulating expert adjustments to refine their accuracy. Secondly, following this, we select the Top-K points with the highest uncertainty values from the U, representing the most uncertain areas within the image. and then these selected points are then combined with the adjusted Bboxes to create refined prompts. To provide a clear and intuitive description of the process, we represent each step using the following formulas:
(9) |
where:
(10) |
(11) |
refers to identifying the coordinates of the edges of the uncertainty map, refers to the coordinates of the edge points. Specifically, represents the coordinates of the top-left corner, and represents the coordinates of the bottom-right corner of the bounding box that encloses these edge points. We select the Top-1 Bbox based on the edges of the U, which has the largest area and fully contains the edge of U. Then, we also apply Eq. (4) to simulate varying levels of expert knowledge, to generate a set of refined box prompts . Then, we select the Top-K points with the highest uncertainty values from the U:
(12) |
denotes the sort function, represents a set of the selected points . we combine the adjusted Bboxes with the selected points to create the refined prompts set: .
Subsequently, these refined prompts are re-as-input into MPA-MedSAM, where they serve as crucial inputs to guide further refinement of the segmentation output. By leveraging these improved prompts, our method can enhance the accuracy of the segmentation, narrowing down errors and producing more precise results. This process ensures that the model benefits from the enhanced guidance provided by the refined prompts, ultimately leading to superior segmentation performance. The refined prompts then input into MPA-MedSAM with UGPM function can be expressed as follows:
(13) |
is the segmentation mask based on the UPGA, and is the associated uncertainty map.
Finally, the refined process is informed by Active Learning principles: we assess the output uncertainty following the prompt optimizations and update the segmentation results only if these optimizations lead to reduced uncertainty estimates. Importantly, the final output is not considered definitive until it is rigorously verified against the uncertainty map to ensure that performance has indeed improved. It can be denoted by:
(14) |
The detials of the proposed method MedSAM-U are outlined in Algorithm. 1.
Input: Image I, Initial Bbox
Output: Refined mask
1 Generate random boxes based on with Eq. (4):
2 Input to MedSAM-U with Eq. (8):
3 Select Top-K Boxes and Points with Eq. (9) to Eq. (12):
4 Re-input to MedSAM-U with Eq. (13):
5 Verify and Update with Eq. (14)
6 Output:
IV Experiments and Results
IV-A Datasets & Loss & Implementation Details
IV-A1 Datasets
To assess the effectiveness of our proposed method MedSAM-U, we choose five different 2D medical imaging modalities, including Dermoscopy, CT, MRI, Colonoscopy, Ultrasound. Each modality is represented by a single dataset, with the specific details provided below.
Dermoscopy: ISIC-2017 [37] , hosted at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference, is a skin lesion segmentation dataset towards melanoma detection, including 2594 annotated images.
Ultrasound: CT2US [38] is a dataset designed for kidney segmentation using cross-modal transfer learning. It includes paired CT and ultrasound images with corresponding annotations for kidney segmentation, aiming to improve segmentation accuracy in ultrasound images with limited data.The dataset has a total of 4,586 samples and is simply called Kidney.
CT: This research study used open access segmented data KiTS23 [39]. Since the dataset is primarily designed for 3D segmentation tasks, we adapted it to perform 2D segmentation. To achieve this, we converted the 3D volumetric data into 2D slices. Specifically, we extracted slices along the z-axis, focusing on the central region of each volume, and selected slices at regular intervals to ensure a representative and manageable dataset which has a total of 3882 samples.
MRI: In this study, we utilize the publicly available BraTS 2021 glioma segmentation MRI dataset for evaluation purposes. Since only the training dataset includes GT segmentation masks, making it suitable for an automatic evaluation of a point-to-mask task similar to SAM [6], we exclusively use the training dataset for our current evaluation. To assess our method’s potential in supporting interactive clinical treatment planning, and due to the model’s limitation to a single image input, we evaluated segmentation accuracy using the T1 modality MRI sequence as input. We extracted all slice images and their corresponding masks along the z-axis, selecting middle slices at intervals to convert the 3D images into 2D slices to generate a total of 4,586 samples Subsequently, we randomly split the data, using 0.8 for training and 0.2 for testing, based on image indices.
Colonoscopy: we use the Kvasir-SEG dataset [1], which consists of 1,000 polyp images and their corresponding GT masks annotated by expert endoscopists from Oslo University Hospital (Norway).
All the data are used for evaluation. The is set to 3 for our experiments. Box prompts were generated based on the area and size of the ground truth. The length and width of the boxes were randomly adjusted to mimic the manually provided box prompts.
IV-A2 Loss function
In our method, for training the MPA to enable MedSAM to adapt to various types of prompts, we utilize a combination of a binary focal loss function [40] and dice loss functions [41] to supervise the output effectively during the training process, following the [42]. This combined loss function is designed to address the challenges of class imbalance and accurate boundary delineation in segmentation tasks. The loss is calculated by the following formula:
(15) |
where:
(16) |
I denotes the input image, represents the MPA-MedSAM predict the segmentation result , with the Bbox prompt and point prompt . denotes its probability for pixel, and represents the model parameters needed to update, denotes the ground truth label for pixel .
IV-A3 Implementation Details
For interactive segmentation, we employ Bbox prompts combined with points prompts during the model training process. In this study, we adhered to the default training settings of MedSAM for 2D medical image training. Taking five different modality datasets as input, we trained the model for 60 epochs. We chose a smaller number of epochs compared to fully fine-tuned training. In the interactive model, during the initial step of simulating user clicks to initialize prompts, we experimented with various prompt settings. These included: (1) a random 3 positive points, denoted as 3P, (2) a random 5 positive points, denoted as 5P, (3) a random 10 positive points, denoted as 10P, (4) 3 Bboxes with 50% overlapping of the target, denoted as 3B (0.5), and (5) 3 Bboxes with 75% overlapping of the target, denoted as 3B (0.75), (6) the multi-type prompts composed of both 3 points and 3 Bboxes annotations of 0.5 or 0.75 are referred to as the 3P & 3B (0.5), and 3P & 3B (0.75). Similarly, the same approach can be applied to other cases. All the experiments are implemented with the PyTorch platform and trained/tested on a single NVIDIA 4090 GPU. We utilized the default settings to reproduce the comparison methods. We use three commonly-used metrics for the evaluation: Dice Coefficient (Dice) and Intersection over Union (IoU). Dice calculates the overlap between the prediction and GT as twice the area of overlap divided by the sum of the areas of the prediction and GT. IoU measures the ratio of the intersection to the union of the predicted and true regions.
Model | Dermoscopy | Colonoscopy | Ultrasound | CT | MRI | Avg | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | |
SAM 3B (0.5) | 0.481 | 0.609 | 0.346 | 0.414 | 0.467 | 0.619 | 0.558 | 0.665 | 0.233 | 0.321 | 0.417 | 0.526 |
SAM 3B (0.75) | 0.656 | 0.773 | 0.651 | 0.725 | 0.656 | 0.783 | 0.667 | 0.762 | 0.470 | 0.596 | 0.620 | 0.728 |
MedSAM 3B (0.5) | 0.446 | 0.566 | 0.516 | 0.646 | 0.566 | 0.705 | 0.455 | 0.585 | 0.516 | 0.665 | 0.500 | 0.633 |
MedSAM 3B (0.75) | 0.778 | 0.861 | 0.811 | 0.880 | 0.873 | 0.931 | 0.661 | 0.763 | 0.664 | 0.785 | 0.758 | 0.844 |
MedSAM-U 3B (0.5) | 0.801 | 0.883 | 0.779 | 0.867 | 0.899 | 0.946 | 0.672 | 0.766 | 0.599 | 0.727 | 0.750 | 0.838 |
MedSAM-U 3B (0.75) | 0.839 | 0.909 | 0.833 | 0.903 | 0.920 | 0.958 | 0.700 | 0.783 | 0.626 | 0.750 | 0.784 | 0.861 |


Model | Dermoscopy | Colonoscopy | Ultrasound | CT | MRI | Avg | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | |
SAM 3P | 0.375 | 0.490 | 0.283 | 0.359 | 0.229 | 0.360 | 0.074 | 0.112 | 0.127 | 0.209 | 0.217 | 0.306 |
SAM 10P | 0.384 | 0.504 | 0.288 | 0.373 | 0.238 | 0.374 | 0.073 | 0.115 | 0.131 | 0.215 | 0.223 | 0.316 |
SAM 3B (0.5) | 0.481 | 0.609 | 0.346 | 0.414 | 0.467 | 0.619 | 0.558 | 0.665 | 0.233 | 0.321 | 0.417 | 0.526 |
SAM 10P&3B (0.5) | 0.504 | 0.644 | 0.368 | 0.467 | 0.344 | 0.504 | 0.182 | 0.280 | 0.229 | 0.347 | 0.325 | 0.448 |
MedSAM 3P | 0.060 | 0.106 | 0.036 | 0.063 | 0.093 | 0.159 | 0.016 | 0.029 | 0.027 | 0.047 | 0.046 | 0.081 |
MedSAM 10P | 0.146 | 0.240 | 0.078 | 0.130 | 0.149 | 0.245 | 0.028 | 0.048 | 0.060 | 0.100 | 0.092 | 0.153 |
MedSAM 3B (0.5) | 0.446 | 0.566 | 0.516 | 0.646 | 0.566 | 0.705 | 0.455 | 0.585 | 0.516 | 0.665 | 0.500 | 0.633 |
MedSAM 10P&3B (0.5) | 0.341 | 0.467 | 0.281 | 0.388 | 0.153 | 0.240 | 0.270 | 0.386 | 0.284 | 0.406 | 0.266 | 0.377 |
MPA-MedSAM 3P | 0.125 | 0.188 | 0.063 | 0.096 | 0.194 | 0.280 | 0.002 | 0.003 | 0.009 | 0.015 | 0.079 | 0.117 |
MPA-MedSAM 10P | 0.266 | 0.374 | 0.113 | 0.174 | 0.371 | 0.497 | 0.005 | 0.009 | 0.014 | 0.024 | 0.154 | 0.216 |
MPA-MedSAM 3B (0.5) | 0.623 | 0.753 | 0.554 | 0.693 | 0.678 | 0.799 | 0.479 | 0.607 | 0.400 | 0.542 | 0.547 | 0.679 |
MPA-MedSAM 10P&3B (0.5) | 0.662 | 0.784 | 0.596 | 0.729 | 0.718 | 0.829 | 0.519 | 0.644 | 0.441 | 0.584 | 0.587 | 0.714 |
IV-B Comparisons with the SOTA methods on Multi-modality Images
To validate the overall performance of our proposed method, we compared it with the SOTA segmentation foundation model across five different modalities. As shown in Table I, include comparisons with SAM [6], and fully fine-tuned SAM in medical (MedSAM) [7], evaluated using Dice Score and Iou Score. The results were shown in Table I.
Firstly, when comparing the performance of the same model under different Bboxes qualities (ratio from 0.5 to 0.75), a notable improvement is observed in SAM across all imaging modalities as the Bboxes ratio increases from 0.5 to 0.75. MedSAM exhibits a similar trend, with performance enhancements corresponding to the higher Bboxes quality. It reveals that the quality of the Bboxes significantly impacts segmentation performance, indicating that the models are sensitive to the quality of the Box prompts.
Furthermore, for our proposed method, represented as MedSAM-U 3B (0.5), and 3 B (0.75), shown in the row and the row, achieved SOTA performance in Dermoscopy, Colonoscopy, Ultrasound and CT, with a final Avg Dice of 90.9 % , 90.3 % , 95.8% and 78.3% with 3B (0.75), surpassing MedSAM by 4.8%, 2.3%, 2.7% and 2% respectively. When the Bboxes overlap ratio is 0.5, our method outperforms surpassing MedSAM by 20.5 %. The results shown in Table I demonstrate that, even when the initial Bboxes quality is low, significant improvements in segmentation performance are observed. This demonstrates the effectiveness of our approach in enhancing segmentation accuracy, particularly in scenarios where the initial Bx prompts are suboptimal. These results reveal that our MedSAM-U demonstrates significant accuracy across various medical segmentation tasks with the use of low-quality Bboxes, eliminating the need for manual automantic adjustments to achieve satisfactory results.
Fig. 3 visualized Some examples’ segmentation results for each method in different modalities, with Bboxes simulating varying degrees of rough approximations simulating expert annotations. SAM and MedSAM’s segmentations are based on a single step, and when the initial Bbox prompts provided by experts are not perfectly accurate, the segmentations may suffer from under-segmentation or over-segmentation errors. In contrast, MedSAM-U can accurately segment various targets across different imaging conditions.
IV-C Ablation Study
IV-C1 Effectiveness of Multi-Prompt Combinations
In our method,the MPA was introduced to integrate diverse types of prompts (e.g., points and boxes) into MedSAM, enabling the simultaneous input of various prompt combinations. To validate the effectiveness of the MPA-MedSAM module, we present the segmentation results of different interactive segmentation models using various prompt combinations across five different datasets, as detailed in Table II. Initially, We use the results from the 3B (0.5) prompts as the baseline for multi-prompt segmentation in SAM, MedSAM, and MPA-SAM on the standard medical imaging datasets, specifically shown in the 3rd row, 7th row, and 11th row.
Modality | Dermoscopy | Colonoscopy | Ultrasound | Avg | ||||
---|---|---|---|---|---|---|---|---|
IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | |
w/o UGPA | 0.619 | 0.750 | 0.559 | 0.698 | 0.673 | 0.796 | 0.617 | 0.748 |
UGPA1 | 0.774 | 0.863 | 0.751 | 0.843 | 0.889 | 0.939 | 0.805 | 0.882 |
UGPA2 | 0.779 | 0.867 | 0.755 | 0.846 | 0.891 | 0.941 | 0.808 | 0.885 |
UGPA3 | 0.796 | 0.880 | 0.784 | 0.871 | 0.898 | 0.945 | 0.826 | 0.899 |
UGPA4 | 0.801 | 0.883 | 0.779 | 0.867 | 0.898 | 0.946 | 0.826 | 0.899 |
Modality | Dermoscopy | Colonoscopy | Ultrasound | Avg | ||||
---|---|---|---|---|---|---|---|---|
IoU | Dice | IoU | Dice | IoU | Dice | IoU | Dice | |
w/o UGPA | 0.792 | 0.877 | 0.766 | 0.859 | 0.885 | 0.938 | 0.814 | 0.892 |
UGPA1 | 0.815 | 0.892 | 0.800 | 0.880 | 0.909 | 0.952 | 0.842 | 0.908 |
UGPA2 | 0.817 | 0.893 | 0.805 | 0.885 | 0.911 | 0.953 | 0.844 | 0.910 |
UGPA3 | 0.836 | 0.906 | 0.832 | 0.902 | 0.920 | 0.958 | 0.862 | 0.922 |
UGPA4 | 0.839 | 0.909 | 0.833 | 0.903 | 0.920 | 0.958 | 0.864 | 0.923 |
By analyzing the results from the 3rd row to 4th row for SAM and from the 7th row to 8th row for MedSAM, it is evident that the combination of point and box prompts leads to a decrease in Dice scores from 52.6% to 44.8% for SAM, and from 63.3% to 37.7% for MedSAM. The analysis of the table reveals that the performance of SAM and MedSAM does not exhibit improvement with the incorporation of multi-type prompts. Moreover, the use of point prompts may negatively impact the effectiveness of box prompts, resulting in a deterioration of segmentation performance. Furthermore, with the introduction of MPA, we observed that the performance of MPA-MedSAM improved by approximately 3.5% between 11th row and 12th row, further demonstrating the effectiveness of combining point prompts with box prompts was substantiated.
IV-C2 Effectiveness of Uncertainty-Guided Prompts Adaptation
We conducted a comprehensive ablation study to validate the effectiveness of the proposed UGPA. The results are presented in Table III and Table IV, where the baseline (1st row) represents 3 Bboxes from low-quality Bbox random shift provided by the user, serving as the initial input for our proposed MedSAM-U, without UGPA. As shown in the 2nd row and 3rd row in the Table III and Table IV, when combining box prompts with point prompts, there is an improvement compared to using the initial 3 Bboxes without UGPA. The combined approach also demonstrates enhanced performance relative to refining Bboxes alone, both in the 3B (0.5) and 3B (0.75) settings. Additionally, in the Table III and Table IV, from the 3rd row to the 5th row, we observe 1.3% improvement for the 3B (0.5) setting and a 1.3% improvement for the 3B (0.75) setting with the increase in the number of points. Specifically, increasing the points increasing from 5 to 10 points, results are in minimal enhancement. It does not mean that the more points there are, the better the effect will be. In fact, the improvement in effect may tend to saturation, or even counter-effectively in some cases.
The segmentation results of our method in different modalities are shown in the Fig. 4, three Bboxes were generated by introducing variations to GT box in terms of position and shape to simulate user input, serving as the input for our proposed MedSAM-U method. Under the refinement provided by MedSAM-U, the results showed significant improvement. The method effectively leveraged the uncertainty map, enhancing the accuracy and robustness of the segmentation.
V Conclusion
In this study, we introduced MedSAM-U, a novel model designed with an uncertainty-guided, auto-refining multi-prompt approach for reliable and accurate medical image segmentation. First, MPA-MedSAM was utilized to adapt various multi-prompt strategies for MedSAM. We then implemented UGMP to estimate uncertainty in the segmentation results without adding additional training parameters. Crucially, we developed a novel uncertainty-guided multi-prompt adaptation method that automatically generates reliable prompts, leading to highly accurate segmentation results. Furthermore, by training on multi-datasets from multiple modalities, MedSAM-U effectively functions as a universal image segmentation model. Experimental results across five distinct modal datasets show that within the BBox overlap ratio range of 0.5 to 0.75, MedSAM-U significantly improved performance, with average improvements ranging from 1.7% to 20.5%, compared to the baseline MedSAM model. Additionally, our results indicated that the lower the initial BBox quality, the greater the improvement achieved by MedSAM-U.
Moving forward, our research will concentrate on two key areas. First, we plan to directly estimate uncertainty within the adapter. Second, we aim to harness this uncertainty to achieve automantic advanced annotation, enabling AI and the model to perform automatic annotations without human intervention.
References
- [1] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt et al., “Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection,” in Proceedings of the 8th ACM on Multimedia Systems Conference, 2017, pp. 164–169.
- [2] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning for cell counting, detection, and morphometry,” Nature methods, vol. 16, no. 1, pp. 67–70, 2019.
- [3] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2020.
- [4] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
- [5] Y. Gao, M. Zhou, and D. N. Metaxas, “Utnet: a hybrid transformer architecture for medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24. Springer, 2021, pp. 61–71.
- [6] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- [7] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, p. 654, 2024.
- [8] C. Zhou, X. Li, C. C. Loy, and B. Dai, “Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam,” arXiv preprint arXiv:2312.06660, 2023.
- [9] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in NIPS, 2017.
- [10] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [11] J. Luo, A. Sedghi, K. Popuri, D. Cobzas, M. Zhang, F. Preiswerk, M. Toews, A. Golby, M. Sugiyama, W. M. Wells et al., “On the applicability of registration uncertainty,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22. Springer, 2019, pp. 410–419.
- [12] K. Zou, X. Yuan, X. Shen, Y. Chen, M. Wang, R. S. M. Goh, Y. Liu, and H. Fu, “Evidencecap: Towards trustworthy medical image segmentation via evidential identity cap,” arXiv preprint arXiv:2301.00349, 2023.
- [13] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal, “Uncertainty estimation using a single deep deterministic neural network,” in International Conference on Machine Learning. PMLR, 2020, pp. 9690–9700.
- [14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- [15] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
- [16] J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang et al., “Sam-med2d,” arXiv preprint arXiv:2308.16184, 2023.
- [17] C. Cui, R. Deng, Q. Liu, T. Yao, S. Bao, L. W. Remedios, B. A. Landman, Y. Tang, and Y. Huo, “All-in-sam: from weak annotation to pixel-wise nuclei segmentation with prompt-based finetuning,” in Journal of Physics: Conference Series, vol. 2722, no. 1. IOP Publishing, 2024, p. 012012.
- [18] S. Gong, Y. Zhong, W. Ma, J. Li, Z. Wang, J. Zhang, P.-A. Heng, and Q. Dou, “3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable tumor segmentation,” Medical Image Analysis, p. 103324, 2024.
- [19] H. Li, H. Liu, D. Hu, J. Wang, and I. Oguz, “Prism: A promptable and robust interactive segmentation model with visual prompts,” arXiv preprint arXiv:2404.15028, 2024.
- [20] B. Cui, M. Islam, L. Bai, and H. Ren, “Surgical-dino: adapter learning of foundation models for depth estimation in endoscopic surgery,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024.
- [21] C. Hu, T. Xia, S. Ju, and X. Li, “When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation,” arXiv preprint arXiv:2304.08506, 2023.
- [22] Y. Li, M. Hu, and X. Yang, “Polyp-sam: Transfer sam for polyp segmentation,” in Medical Imaging 2024: Computer-Aided Diagnosis, vol. 12927. SPIE, 2024, pp. 759–765.
- [23] J. Wu, W. Ji, Y. Liu, H. Fu, M. Xu, Y. Xu, and Y. Jin, “Medical sam adapter: Adapting segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.12620, 2023.
- [24] G. Deng, K. Zou, K. Ren, M. Wang, X. Yuan, S. Ying, and H. Fu, “Sam-u: Multi-box prompts triggered uncertainty estimation for reliable sam in medical image,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 368–377.
- [25] H. Li, H. Liu, D. Hu, J. Wang, and I. Oguz, “Promise: Prompt-driven 3d medical image segmentation using pretrained image foundation models,” arXiv preprint arXiv:2310.19721, 2023.
- [26] J. Wu and M. Xu, “One-prompt to segment all medical images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 302–11 312.
- [27] X. Yao, H. Liu, D. Hu, D. Lu, A. Lou, H. Li, R. Deng, G. Arenas, B. Oguz, N. Schwartz et al., “False negative/positive control for sam on noisy medical images,” arXiv preprint arXiv:2308.10382, 2023.
- [28] Y. Zhang, S. Hu, C. Jiang, Y. Cheng, and Y. Qi, “Segment anything model with uncertainty rectification for auto-prompting medical image segmentation,” arXiv preprint arXiv:2311.10529, 2023.
- [29] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning. PMLR, 2016, pp. 1050–1059.
- [30] S. C. Hora, “Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management,” Reliability Engineering & System Safety, vol. 54, no. 2-3, pp. 217–223, 1996.
- [31] E. Hüllermeier and W. Waegeman, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,” Machine learning, vol. 110, no. 3, pp. 457–506, 2021.
- [32] A. Saad, G. Hamarneh, and T. Möller, “Exploration and visualization of segmentation uncertainty using shape and appearance prior information,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1366–1375, 2010.
- [33] S. Parisot, W. Wells III, S. Chemouny, H. Duffau, and N. Paragios, “Concurrent tumor segmentation and registration with uncertainty-based sparse non-uniform graphs,” Medical image analysis, vol. 18, no. 4, pp. 647–659, 2014.
- [34] J.-S. Prassni, T. Ropinski, and K. Hinrichs, “Uncertainty-aware guided volume segmentation,” IEEE transactions on visualization and computer graphics, vol. 16, no. 6, pp. 1358–1365, 2010.
- [35] K. Zou, X. Yuan, X. Shen, M. Wang, and H. Fu, “Tbrats: Trusted brain tumor segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 503–513.
- [36] B. Bein, “Entropy,” Best Practice & Research Clinical Anaesthesiology, vol. 20, no. 1, pp. 101–109, 2006.
- [37] D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge,” in International Symposium on Biomedical Imaging, 2016.
- [38] Y. Song, J. Zheng, L. Lei, Z. Ni, B. Zhao, and Y. Hu, “Ct2us: Cross-modal transfer learning for kidney segmentation in ultrasound images with synthesized data,” Ultrasonics, vol. 122, p. 106706, 2022.
- [39] N. Heller, F. Isensee, D. Trofimova, R. Tejpaul, Z. Zhao, H. Chen, L. Wang, A. Golts, D. Khapun, D. Shats et al., “The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct,” arXiv preprint arXiv:2307.01984, 2023.
- [40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- [41] S. Jadon, “A survey of loss functions for semantic segmentation,” in 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB). IEEE, 2020, pp. 1–7.
- [42] R. Kaur and S. K. Ranade, “Improving accuracy of convolutional neural network-based skin lesion segmentation using group normalization and combined loss function,” International Journal of Information Technology, vol. 15, no. 5, pp. 2827–2835, 2023.