This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MedSAM-U: Uncertainty-Guided Auto Multi-Prompt Adaptation for Reliable MedSAM

Nan Zhou, Ke Zou, Kai Ren, Mengting Luo, Linchao He,
Meng Wang, Yidi Chen, Yi Zhang, , Hu Chen, and Huazhu Fu
This work was supported by the Sichuan Science and Technology Program under Grant 2022JDJQ0045, and the Chengdu Key Research and development Support project under Grant 2024YF0500910SN N. Zhou, K. Zou, K. Ren, M. Luo, L. He, and H. Chen are with the College of Computer Science, Sichuan University, Chengdu 610065, China.Meng Wang is with the Centre for Innovation & Precision Eye Health, Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore 119228.Y. Chen is with the Department of Radiology, West China Hospital, Sichuan University, Chengdu 610065, China.Y. Zhang is with the School of Cyber Science and Engineering and the Key Laboratory of Data Protection and Intelligent Management, Ministry of Education, Sichuan University, Chengdu 610065, China.Huazhu Fu is with the Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore 138632. N. Zhou and K. Zou, contributed equally to this work.Hu Chen and Huazhu Fu are the co-corresponding authors (e-mail: [email protected], [email protected]).
Abstract

The Medical Segment Anything Model (MedSAM) has shown remarkable performance in medical image segmentation, drawing significant attention in the field. However, its sensitivity to varying prompt types and locations poses challenges. This paper addresses these challenges by focusing on the development of reliable prompts that enhance MedSAM’s accuracy. We introduce MedSAM-U, an uncertainty-guided framework designed to automatically refine multi-prompt inputs for more reliable and precise medical image segmentation. Specifically, we first train a Multi-Prompt Adapter integrated with MedSAM, creating MPA-MedSAM, to adapt to diverse multi-prompt inputs. We then employ uncertainty-guided multi-prompt to effectively estimate the uncertainties associated with the prompts and their initial segmentation results. In particular, a novel uncertainty-guided prompts adaptation technique is then applied automatically to derive reliable prompts and their corresponding segmentation outcomes. We validate MedSAM-U using datasets from multiple modalities to train a universal image segmentation model. Compared to MedSAM, experimental results on five distinct modal datasets demonstrate that the proposed MedSAM-U achieves an average performance improvement of 1.7% to 20.5% across uncertainty-guided prompts.

Index Terms:
MedSAM, Medical image segmentation, Multi-prompt, Uncertainty-guided segmentation.
Refer to caption
Figure 1: Comparison of Dice Results Under (a) Bboxes with different overlap ratio on SAM and MedSAM Models. (b) 1) Previous methods predict annotation mask in only single step; 2) MedSAM-U, our method, automatically refine prompts with uncertainty-guided multi-prompt adaptation for reliable MedSAM. Example is from Colonoscopy test set [1].

I Introduction

Medical image segmentation is an important task in medical image analysis, crucial for many clinical applications. Accurate segmentation helps clearly define anatomical structures and diseased areas, which is essential for disease diagnosis, treatment planning, and monitoring. It is widely used in dermoscopy, CT, MRI, colonoscopy, and ultrasound. Traditional U-Net-based segmentation methods have demonstrated high segmentation performance [2, 3, 4, 5]. However, these models are mostly designed for specific tasks and often face challenges in transferability to other domains. Consequently, with the recent rise of foundation models like the Segment Anything Model (SAM) [6], and MedSAM [7] have shown its universal applicability and precise segmentation capabilities across various datasets. Crucially, these models are trained on all datasets together and utilize different prompts to achieve more accurate segmentation results. As mentioned in [8], high-quality prompts can lead to good performance. Hence, this study focuses on how to automatically obtain effective prompts during testing time without the need to retrain model.

As shown in Fig. 1 (a), we illustrate the segmentation performance of SAM and MedSAM across five datasets using various box prompts, including prompts with different aspect ratios, and positions within the image. The purpose of these variations in box prompts is to reveal a significant impact of box position on the resulting segmentation outcomes. Furthermore, as detailed in [8], the effect of point prompts on segmentation performance is demonstrated. Therefore, the other key focus of this study is to train MedSAM to adapt to different types of prompts. Finally, as shown in Fig. 1 (b), MedSAM’s approach of producing a single segmentation result in a single step lacks reliability. This study aims to leverage reliability to automatically obtain effective prompts, thereby achieving better segmentation results.

Uncertainty is one of crucial metrics for assessing the model’s confidence or reliability. Existing methods for uncertainty estimation encompass dropout-based methods [9], ensemble-based methods [10], entropy-based methods [11], evidential-based approaches [12], and deterministic-based methods [13]. Given the extensive parameters involved in MedSAM, building a new model with uncertainty estimation from scratch would demand significant computational resources and time. Furthermore, a key aspect of this study is exploring how to utilize estimated pixel-level uncertainty to obtain reliable prompts of different types.

To address these challenges, our study introduces the MedSAM-U as illustrated in Fig. 1 (b) 2), an automatic uncertainty-guided auto multi-prompt framework for adapting reliable MedSAM. MedSAM-U integrates box and point prompts to enhance segmentation accuracy. Subsequently, it then employs Uncertainty-Guided Multi-Prompt (UGMP) to effectively estimate the uncertainties associated with the prompts and their initial segmentation results. Our approach further introduces a novel Uncertainty-Guided Prompt Adaptation (UGPA) technique, enabling the auto refinement of multi-prompt to enhance segmentation reliability and accuracy. The contributions of this work are summarized as follows.
\bullet We propose MedSAM-U, which leverages uncertainty estimation and guidance to automatically predict reliable segmentation results. To the best of our knowledge, this is the first attempt at using uncertainty-guided adaptation of multi-prompt to achieve a reliable MedSAM.
\bullet We employ Uncertainty-Guided Multi-Prompt (UGMP) to estimate the uncertainty of different prompts in MedSAM without requiring additional training parameters.
\bullet We introduce Uncertainty-Guided Prompt Adaptation (UGPA) to automatically obtain reliable prompts using the estimated uncertainty in the testing time, leading to accurate segmentation predictions.
\bullet We conducted unified training and testing on five different modalities (Dermoscopy, Colonoscopy, Ultrasound, CT, and MRI) datasets, demonstrating the reliability and accuracy of MedSAM-U 111Our code will be released in https://github.com/Zhounan1222/MedSAM-U.

II Related works

II-A SAM for medical image segmentation

Traditional deep learning methods [14, 2, 15] are mostly designed for specific tasks and often face challenges in transferring to other domains. SAM [6] is the pioneering large foundation model for segmentation, consists of three primary components: an image encoder, a prompt encoder, and a mask decoder. The image encoder is based on a standard Vision Transformer (ViT) that has been pretrained using Masked Autoencoders (MAE). The prompt encoder can operate in either a sparse (e.g.,boxes) or dense (e.g., masks) manner. The mask decoder is a Transformer decoder block adapted to include a dynamic mask prediction head. This decoder employs two-way cross-attention mechanisms to capture the interactions between the prompt and image embeddings. Subsequently, SAM upsamples the image embedding, subsequently, SAM upsamples the image embedding, and a Multi-Layer Perceptron (MLP) maps the resulting token to a dynamic linear classifier that predicts the ground truth (GT) for the given image I. Due to its strong zero-shot performance, SAM marks a major breakthrough in the field of image segmentation.

To improve the unsatisfactory performance of SAM on medical image segmentation tasks, some approaches are to fine-tune SAM on medical images, including full fine-tuning and parameter-efficient fine-tuning [16, 17, 18, 19]. Recently, MedSAM [7] has investigated SAM’s application to medical image segmentation, exploring its performance in various contexts such as endoscopic surgery [20], tumor segmentation [21], polyp segmentation [22]. However, the existing MedSAM-based methods rarely investigate the sensitivity of different prompts.

II-B Prompts-based methods for medical image segmentation

Prompts-based methods have gained traction in medical image segmentation due to their ability to provide flexible and adaptive guidance during interactive segmentation tasks. While effective in some cases, current adaptations of SAM rely heavily on high-quality, standard prompts (such as points, boxes, and masks) to deliver satisfactory performance in medical image segmentation tasks. Wu et al. [23] introduce Space-Depth Transpose for adapting 2D SAM to 3D medical images and Hyper-Prompting Adapter for prompt-conditioned adaptation. Deng et al. [24] propose a multi-box prompt-triggered uncertainty estimation for SAM that uses Monte Carlo methods and test-time augmentation to enhance performance and provide pixel-level reliability for segmented lesions or tissues. Li et al. [25] propose a 3D medical image segmentation model using a single point prompt with SAM’s pretrained vision transformer and lightweight adapters, featuring a hybrid network and boundary-aware loss for precise results. Wu et al. [26] presents One-Prompt Segmentation method that combines one-shot and interactive approaches to handle unseen tasks with a single prompt in one pass. Currently, most MedSAM-based methods are limited to using a single type of prompt rather than multiple types of prompts [27, 28, 24, 25]. This reduces the available information for the model and may result in insufficient segmentation performance.

II-C Uncertainty-based methods for medical image segmentation

Uncertainty estimation [29] in segmentation has become increasingly important, as it offers a way to assess the confidence in model predictions. This is crucial in high-stakes applications such as medical image analysis, where mistakes can lead to serious repercussions. There are two primary types of uncertainty—aleatoric, which originates from the inherent noise in the data [30], and epistemic, which is due to the model’s limitations—is the key to enhancing the reliability of deep neural networks [31]. For instance, Saad et al. [32] utilized shape and appearance priors to quantify uncertainty in probabilistic medical image segmentation. Parisot et al. [33] leveraged segmentation uncertainty to inform adaptive sampling strategies for the simultaneous segmentation and registration of brain tumors. Additionally, Prassni et al. [34] developed a method to visualize uncertainty in random walker-based segmentation, which was then applied to volumetric segmentation of brain MRI and abdominal CT images. Zou et al. [35]. proposed a trusted brain tumor segmentation network that generates robust segmentation results and reliable uncertainty estimations by leveraging subjective logic theory to model uncertainty and parameterize class probabilities as a Dirichlet distribution.

Additionally, uncertainty estimation in large models, such as SAM, is critical for improving the reliability of segmentation outputs [24]. Yao et al. [27] propose a test-phase prompt augmentation technique for SAM that integrates multi-box augmentation with an aleatoric uncertainty-based FN and FP correction strategy to improve medical image segmentation. Zhang et al. [28] propose UR-SAM, an uncertainty-rectified framework that enhances SAM’s reliability in medical image segmentation by combining prompt augmentation with uncertainty-based rectification. Despite these advancements, research on integrating uncertainty estimation with prompt-based MedSAM remains insufficient, particularly in how to utilize uncertainty to guide reliable prompts.

Refer to caption
Figure 2: The overall framework of MedSAM-U. The framework is presented through three key illustrations: (a) the training process of MPA-MedSAM, (b) a comprehensive work flow in the inference time, and (c) a simplified diagram that illustrates the user interaction process.

III Method

To begin with, we provide an overview of the MedSAM-U architecture, which consists of three primary components: MPA-MedSAM, UGMP, and UGPA modules. The overall framework of our proposed method is illustrated in Fig. 2. We build an automatic framework guided by uncertainty, developed to adaptively refine multi-prompt to achieve reliable and accurate segmentation that includes 1) training Multi-Prompt Adaption (MPA) to integrating both point and box cues to achieve more precise medical image segmentation results. 2) In the reference time, utilizing UGMP to assess the uncertainty linked to various prompts, without requiring additional training parameters. 3) introducing UGPA to automatically leverage estimated uncertainty to derive more reliable prompts, resulting in improving the peformance of segmentation results.

III-A Multi-Prompt Adaptation for MedSAM

MedSAM [7] primarily relies on box prompts as the initial input for segmentation tasks, without incorporating point prompts during its training process. This limitation means that MedSAM’s segmentation capabilities are optimized for scenarios where box annotations are available, but it may not fully exploit the potential benefits of point-based inputs. Formally, in MedSAM, the relationship between the inputs and the prediction mask 𝒚\boldsymbol{y} can be formally expressed as:

𝒚=fMSAM(I,b),{\boldsymbol{y}}=f_{\text{MSAM}}(\textit{I},b), (1)

the function fMSAMf_{\text{MSAM}} can be denoted as:

fMSAM(I,b)=D(E(I),𝒫(b)),f_{\text{MSAM}}(\textit{I},b)=\mathcal{F}_{D}\left(\mathcal{F}_{E}(\textit{I}),\mathcal{F_{P}}(b)\right), (2)

where I denotes the input image, bb represents the box prompt, D\mathcal{F}_{D}, E\mathcal{F}_{E} and 𝒫\mathcal{F_{P}} represent the Decoder, Image Encoder, Prompt Encoder modules of MedSAM, respectively, and 𝒚\boldsymbol{y} is the resulting segmentation mask of MedSAM.

Moreover, as detailed in [8, 19], the effect of point prompts on segmentation performance is demonstrated, particularly positive points. Accordingly, to improve the sparse encoder, this study proposes modifications that refine the encoding of points and boxes. These adjustments aim to optimize the integration of positional encoding with learned embeddings for each prompt type. We propose MPA that is designed to handle and adapt multiple types of prompt inputs for MedSAM, including the combination of points and boxes prompts.

Our goal is to extend the capabilities of MedSAM by adapting it to handle multi-prompt through a fine-tuning approach, called MPA-MedSAM, designed to enhance MedSAM’s flexibility without the need for full re-training. Instead of adjusting all parameters, we keep the pre-trained MedSAM parameters frozen except for the prompt encoder, develop an adapter module, and integrate it into the image encoder designated positions. Structurally, the adapter serves as a bottleneck model, consisting of three key components: a down-projection, ReLU activation, and up-projection sequentially, as illustrated in Fig. 2 (a), similar to Med-SA [23]. Given that MedSAM does not show improvements when multi-type prompts are incorporated, we unfreeze the prompt encoder to expand its capabilities. To break the limitation, we have unfrozen the prompt encoder to expand its capabilities. This adjustment allows MedSAM to effectively handle multi-type prompts, addressing the limitations encountered in earlier versions. For interactive segmentation, both point prompts and Bounding box (BBox) prompts are utilized during training. The prompts are generated by randomly selecting points and applying jitter to the Bbox derived from the segmentation mask. This simulates varying levels of user inaccuracy, making the model more robust to real-world variations and enhancing its adaptability in practical applications. The relationship between the inputs and the prediction mask 𝒚\boldsymbol{y} in MPA-MedSAM can be formally expressed as:

𝒚=fMPA-MSAM(I,p,b),{\boldsymbol{y}}=f_{\text{MPA-MSAM}}(\textit{I},p,b), (3)

where I denotes the input image, pp represents the point prompt, bb represents the Bbox prompt, and 𝒚\boldsymbol{y} is the resulting segmentation mask of MPA-MedSAM. The function fMPA-MedSAMf_{\text{MPA-MedSAM}} encapsulates the MPA-MedSAM, which processes the input image, point prompt, and box prompt to produce the corresponding mask.

III-B Uncertainty estimation of MPA-MedSAM with multi-type prompts

Given the variability of prompts from users with differing levels of experience, MedSAM’s segmentation performance exhibits significant sensitivity to the position of Bboxes, potentially leading to inference errors. To address this, we adopt a strategy similar to the one used by SAM-U [24]. In this study, we introduce UGMP to incorporate multiple prompts to improve the accuracy of MPA-MedSAM. Assume giving an initial Bbox binitb^{\text{init}}, by applying a simple random augmentation strategy to the binitb^{\text{init}}, we generate NN multiple Bbox prompts as follows:

{b1,b2,,bN}={binit+δ1,binit+δ2,,binit+δN},\{b_{1},b_{2},\dots,b_{N}\}=\{b^{\text{init}}+\delta_{1},b^{\text{init}}+\delta_{2},\dots,b^{\text{init}}+\delta_{N}\}, (4)

where δi𝒩(μ,σ)\delta_{i}\sim\mathcal{N}(\mu,\sigma), binitb^{\text{init}} represents the initial Bbox, these random boxes are generated by applying adjustments δi\delta_{i} to binitb^{\text{init}}, 𝒩\mathcal{N} denotes the normal distribution, μ\mu denotes the coordinates of the binitb^{\text{init}}, δ\delta represent the degree of perturbation applied to the binitb^{\text{init}}. The operation generate a set of multiple Bbox prompts 𝔹={b1,b2,,bN}\mathbb{B}=\{b_{1},b_{2},\dots,b_{N}\} , and a set of MM multiple point prompts ={p1,p2,,pM}\mathbb{P}=\{p_{1},p_{2},\dots,p_{M}\} that depends on the user’s or clinician’s choice, =\mathbb{P}=\emptyset if no points are used. To quantify the uncertainty arising from the use of multiple prompts, with NN box prompts, MM point prompts and input image I, MPA-MedSAM predict a set of results 𝕐={𝒚1,𝒚2,,𝒚N}\mathbb{Y}=\{{\boldsymbol{y}}_{1},{\boldsymbol{y}}_{2},\cdots,{\boldsymbol{y}_{N}}\}, where 𝒚i\boldsymbol{y}_{i} is predicted as follows:

𝒚𝒊=fMPA-MedSAM(I,𝔹i,),\boldsymbol{y_{i}}={{f_{\text{MPA-MedSAM}}}\left({\textit{I}},{\mathbb{B}_{i}},{\mathbb{P}}\right)}, (5)

where bib_{i} represents the ii-th box of the set 𝔹\mathbb{B}, yiy_{i} represents the segmentation result obtained by inputting bib_{i} into MedSAM.

The aggregation of these predictions leads to enhanced segmentation accuracy and a reduction in uncertainty. The final combined prediction is formulated as follows:

𝒚¯=1Ni=1N𝒚𝒊,\overline{\boldsymbol{y}}=\frac{1}{N}\sum\limits_{i=1}^{N}{\boldsymbol{y_{i}}}, (6)

𝒚¯\overline{\boldsymbol{y}} represents the average segmentation result obtained by applying the MPA-MedSAM model across NN instances.

By ultilizing UGMP, the aleatoric uncertainty from a single given image II , instead of being estimated from each individual prediction 𝒚i\boldsymbol{y}_{i}, is now directly estimated from the average prediction 𝒚¯\overline{\boldsymbol{y}}, described by the entropy [36] :

U¯=fUGMP(𝒚¯),\overline{\textbf{{U}}}=f_{\text{UGMP}}\left(\overline{\boldsymbol{y}}\right), (7)

where fUGMP(𝒚¯)=p(𝒚¯|I)logp(𝒚¯|I)𝑑𝒚¯f_{\text{UGMP}}\left(\overline{\boldsymbol{y}}\right)=-\int p(\overline{\boldsymbol{y}}|\textit{I})\log{p(\overline{\boldsymbol{y}}|\textit{I})}d\overline{\boldsymbol{y}}. This allows us to estimate the uncertainty distribution based on the aggregated prediction 𝒚¯\overline{\boldsymbol{y}}, rather than calculating it for each individual prediction 𝒚i\boldsymbol{y}_{i}.

In summary, the MPA-MedSAM with UGPM function can be expressed as follows:

(𝒚¯,U¯)=fUGMP(fMPA-MedSAM(I,𝔹,)),(\overline{\boldsymbol{y}},\overline{\textbf{{U}}})=f_{\text{UGMP}}(f_{\text{MPA-MedSAM}}\left(\textit{I},\mathbb{B},\mathbb{P}\right)), (8)

where I represents the input image, PbP_{b} are the multiple Bbox prompts {b1,b2,,bN}\{b_{1},b_{2},\dots,b_{N}\}, PpP_{p} are the points prompts {p1,p2,,pM}\{p_{1},p_{2},\dots,p_{M}\}, also depending on the user’s or clinician’s choice. 𝒚¯\overline{\boldsymbol{y}} is the resulting segmentation mask based on the combination of predictions from multiple prompts, and U¯\overline{\textbf{{U}}} is the associated uncertainty map.

III-C Uncertainty-Guided Prompts Adaptation

Providing prompts that produce desired results can be a difficult process that often requires the users or experts to go through tedious trial-and-error experimentation. Due to the uncertainty introduced by different prompts during the inference process, we investigated whether this knowledge could be utilized to refine the initial prompts, thereby producing more predictable and precise segmentation mask outputs. In an effort to address this issue, We introduce UGPA over a series of sample prompts to refine them, through UGPA could improve the model’s segmentation through automatic processes. Fig. 2 (b) shows the schematic view of our proposed UGPA.

In the UGPA process, firstly, we select the Top-K Bboxes based on the edges of the U, which highlight regions of high uncertainty. These boxes are then subjected to slight adjustments, simulating expert adjustments to refine their accuracy. Secondly, following this, we select the Top-K points with the highest uncertainty values from the U, representing the most uncertain areas within the image. and then these selected points are then combined with the adjusted Bboxes to create refined prompts. To provide a clear and intuitive description of the process, we represent each step using the following formulas:

bi=[(wmin+δi,hmin+δi),(wmax+δi,hmax+δi)],b_{i}=\left[(w_{\text{min}}+\delta_{i},h_{\text{min}}+\delta_{i}),(w_{\text{max}}+\delta_{i},h_{\text{max}}+\delta_{i})\right], (9)

where:

(wmin,hmin)=min(wj,hj)Edge(U)(wj,hj),(w_{\min},h_{\min})=\min_{(w_{j},h_{j})\in\text{Edge}(\textbf{{U}})}(w_{j},h_{j}), (10)
(wmax,hmax)=max(wj,hj)Edge(U)(wj,hj),(w_{\max},h_{\max})=\max_{(w_{j},h_{j})\in\text{Edge}(\textbf{{U}})}(w_{j},h_{j}), (11)

Edge(U)\text{Edge}(\textbf{{U}}) refers to identifying the coordinates of the edges of the uncertainty map, (wj,hj)(w_{j},h_{j}) refers to the coordinates of the edge points. Specifically, (wmin,hmin)(w_{\min},h_{\min}) represents the coordinates of the top-left corner, and (wmax,hmax)(w_{\max},h_{\max}) represents the coordinates of the bottom-right corner of the bounding box that encloses these edge points. We select the Top-1 Bbox based on the edges of the U, which has the largest area and fully contains the edge of U. Then, we also apply Eq. (4) to simulate varying levels of expert knowledge, to generate a set of refined box prompts 𝔹={b1,b2,,bk}\mathbb{B}^{*}=\{b_{1},b_{2},\dots,b_{k}\} . Then, we select the Top-K points with the highest uncertainty values from the U:

{p1,p2,,pk}=𝒮(U)[:k],\{p_{1},p_{2},\dots,p_{k}\}=\mathcal{S}\left(\textbf{{U}}\right)_{[\,:k]}, (12)

𝒮\mathcal{S} denotes the sort function, {p1,p2,,pk}\{p_{1},p_{2},\dots,p_{k}\} represents a set of the selected kk points \mathbb{P}^{*}. we combine the adjusted Bboxes with the selected points to create the refined prompts set: {𝔹,}\{\mathbb{B}^{*},\mathbb{P}^{*}\}.

Subsequently, these refined prompts are re-as-input into MPA-MedSAM, where they serve as crucial inputs to guide further refinement of the segmentation output. By leveraging these improved prompts, our method can enhance the accuracy of the segmentation, narrowing down errors and producing more precise results. This process ensures that the model benefits from the enhanced guidance provided by the refined prompts, ultimately leading to superior segmentation performance. The refined prompts then input into MPA-MedSAM with UGPM function can be expressed as follows:

(𝒚,U)=fUGMP(fMPA-MedSAM(I,𝔹,)),({\boldsymbol{y}}^{*},{\textbf{{U}}}^{*})=f_{\text{UGMP}}(f_{\text{MPA-MedSAM}}\left(\textit{I},\mathbb{B}^{*},\mathbb{P}^{*}\right)), (13)

𝒚\boldsymbol{y}^{*} is the segmentation mask based on the UPGA, and U\textbf{{U}}^{*} is the associated uncertainty map.

Finally, the refined process is informed by Active Learning principles: we assess the output uncertainty following the prompt optimizations and update the segmentation results only if these optimizations lead to reduced uncertainty estimates. Importantly, the final output is not considered definitive until it is rigorously verified against the uncertainty map to ensure that performance has indeed improved. It can be denoted by:

𝒚^={𝒚if U¯>U,𝒚¯otherwise.\hat{\boldsymbol{y}}=\begin{cases}{\boldsymbol{y}}^{*}&\text{if }\overline{\textbf{{U}}}>{\textbf{{U}}}^{*},\\ \overline{\boldsymbol{y}}&\text{otherwise}.\end{cases} (14)

The detials of the proposed method MedSAM-U are outlined in Algorithm. 1.

Algorithm 1 The proposed method MedSAM-U

Input: Image I, Initial Bbox binitb^{\text{init}}

Output: Refined mask 𝒚^\hat{\boldsymbol{y}}

1 Generate random boxes based on binitb^{\text{init}} with Eq. (4):

𝔹binit\mathbb{B}\xleftarrow{}b^{\text{init}}

2 Input to MedSAM-U with Eq. (8):

(𝒚¯,U¯)𝔹(\overline{\boldsymbol{y}},\overline{\textbf{{U}}})\xleftarrow{}\mathbb{B}

3 Select Top-K Boxes and Points with Eq. (9) to Eq. (12):

{𝔹,}U¯\{\mathbb{B}^{*},\mathbb{P}^{*}\}\xleftarrow{}\overline{\textbf{{U}}}

4 Re-input to MedSAM-U with Eq. (13):

(𝒚,U){𝔹,}({\boldsymbol{y}}^{*},{\textbf{{U}}}^{*})\xleftarrow{}\{\mathbb{B}^{*},\mathbb{P}^{*}\}

5 Verify and Update with Eq. (14)

6 Output: 𝒚^\hat{\boldsymbol{y}}

IV Experiments and Results

IV-A Datasets & Loss & Implementation Details

IV-A1 Datasets

To assess the effectiveness of our proposed method MedSAM-U, we choose five different 2D medical imaging modalities, including Dermoscopy, CT, MRI, Colonoscopy, Ultrasound. Each modality is represented by a single dataset, with the specific details provided below.

Dermoscopy: ISIC-2017 [37] , hosted at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference, is a skin lesion segmentation dataset towards melanoma detection, including 2594 annotated images.

Ultrasound: CT2US [38] is a dataset designed for kidney segmentation using cross-modal transfer learning. It includes paired CT and ultrasound images with corresponding annotations for kidney segmentation, aiming to improve segmentation accuracy in ultrasound images with limited data.The dataset has a total of 4,586 samples and is simply called Kidney.

CT: This research study used open access segmented data KiTS23 [39]. Since the dataset is primarily designed for 3D segmentation tasks, we adapted it to perform 2D segmentation. To achieve this, we converted the 3D volumetric data into 2D slices. Specifically, we extracted slices along the z-axis, focusing on the central region of each volume, and selected slices at regular intervals to ensure a representative and manageable dataset which has a total of 3882 samples.

MRI: In this study, we utilize the publicly available BraTS 2021 glioma segmentation MRI dataset for evaluation purposes. Since only the training dataset includes GT segmentation masks, making it suitable for an automatic evaluation of a point-to-mask task similar to SAM [6], we exclusively use the training dataset for our current evaluation. To assess our method’s potential in supporting interactive clinical treatment planning, and due to the model’s limitation to a single image input, we evaluated segmentation accuracy using the T1 modality MRI sequence as input. We extracted all slice images and their corresponding masks along the z-axis, selecting middle slices at intervals to convert the 3D images into 2D slices to generate a total of 4,586 samples Subsequently, we randomly split the data, using 0.8 for training and 0.2 for testing, based on image indices.

Colonoscopy: we use the Kvasir-SEG dataset [1], which consists of 1,000 polyp images and their corresponding GT masks annotated by expert endoscopists from Oslo University Hospital (Norway).

All the data are used for evaluation. The NN is set to 3 for our experiments. Box prompts were generated based on the area and size of the ground truth. The length and width of the boxes were randomly adjusted to mimic the manually provided box prompts.

IV-A2 Loss function

In our method, for training the MPA to enable MedSAM to adapt to various types of prompts, we utilize a combination of a binary focal loss function [40] and dice loss functions [41] to supervise the output effectively during the training process, following the [42]. This combined loss function is designed to address the challenges of class imbalance and accurate boundary delineation in segmentation tasks. The loss is calculated by the following formula:

=[αtN(1yt)γlog(yt)]+[12tNytgttNyt+tNgt],\mathcal{L}=\left[-\alpha\sum_{t}^{N}(1-y_{t})^{\gamma}\log(y_{t})\right]+\left[1-\frac{2\sum_{t}^{N}y_{t}\cdot g_{t}}{\sum_{t}^{N}y_{t}+\sum_{t}^{N}g_{t}}\right], (15)

where:

y=fMPA-MedSAM(I,b,p;θ),y=f_{\text{MPA-MedSAM}}(\textit{I},b,p;\theta), (16)

I denotes the input image, fMPA-MedSAMf_{\text{MPA-MedSAM}} represents the MPA-MedSAM predict the segmentation result yy, with the Bbox prompt bb and point prompt pp. yty_{t} denotes its probability for pixel, and θ\theta represents the model parameters needed to update, gtg_{t} denotes the ground truth label for pixel tt.

IV-A3 Implementation Details

For interactive segmentation, we employ Bbox prompts combined with points prompts during the model training process. In this study, we adhered to the default training settings of MedSAM for 2D medical image training. Taking five different modality datasets as input, we trained the model for 60 epochs. We chose a smaller number of epochs compared to fully fine-tuned training. In the interactive model, during the initial step of simulating user clicks to initialize prompts, we experimented with various prompt settings. These included: (1) a random 3 positive points, denoted as 3P, (2) a random 5 positive points, denoted as 5P, (3) a random 10 positive points, denoted as 10P, (4) 3 Bboxes with 50% overlapping of the target, denoted as 3B (0.5), and (5) 3 Bboxes with 75% overlapping of the target, denoted as 3B (0.75), (6) the multi-type prompts composed of both 3 points and 3 Bboxes annotations of 0.5 or 0.75 are referred to as the 3P & 3B (0.5), and 3P & 3B (0.75). Similarly, the same approach can be applied to other cases. All the experiments are implemented with the PyTorch platform and trained/tested on a single NVIDIA 4090 GPU. We utilized the default settings to reproduce the comparison methods. We use three commonly-used metrics for the evaluation: Dice Coefficient (Dice) and Intersection over Union (IoU). Dice calculates the overlap between the prediction and GT as twice the area of overlap divided by the sum of the areas of the prediction and GT. IoU measures the ratio of the intersection to the union of the predicted and true regions.

TABLE I: Results comparing our method with other segmentation methods across five datasets are evaluated using Dice Score and IOU Score. Here, SAM 3B (0.5) refers to using 3 low-quality initial Bboxes with the overlap ratio of 0.5 as inputs to the SAM model. The same approach applies to other cases. The top-2 results are highlighted in bold and underline.
Model Dermoscopy Colonoscopy Ultrasound CT MRI Avg
IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice
SAM 3B (0.5) 0.481 0.609 0.346 0.414 0.467 0.619 0.558 0.665 0.233 0.321 0.417 0.526
SAM 3B (0.75) 0.656 0.773 0.651 0.725 0.656 0.783 0.667 0.762 0.470 0.596 0.620 0.728
MedSAM 3B (0.5) 0.446 0.566 0.516 0.646 0.566 0.705 0.455 0.585 0.516 0.665 0.500 0.633
MedSAM 3B (0.75) 0.778 0.861 0.811 0.880 0.873 0.931 0.661 0.763 0.664 0.785 0.758 0.844
MedSAM-U 3B (0.5) 0.801 0.883 0.779 0.867 0.899 0.946 0.672 0.766 0.599 0.727 0.750 0.838
MedSAM-U 3B (0.75) 0.839 0.909 0.833 0.903 0.920 0.958 0.700 0.783 0.626 0.750 0.784 0.861
Refer to caption
Figure 3: Visualization of segmentation results for each method in different modalities, with Bbox illustrating varying degrees of rough approximations simulating expert annotations. (Col. 1) Images with an initial low-quality Bbox prompt; (Col. 2) SAM model; (Col. 3) MedSAM model; (Col. 4) Our model; (Col. 5) binary GT mask. Red: initial BBox, Blue: segmentation results Yellow: Dice Score.
Refer to caption
Figure 4: Visualization of segmentation results for our method in different modalities. P 1 : low-quality box prompts, P 2 : refined box prompts, P 3 : refined box and point prompts, U 1 : step 1 Uncertainty Map, U 2 : step 2 Uncertainty Map, GT : ground truth. Green: initial Bbox, Red: refined prompts, Blue: segmentation results, Yellow: Dice Score
TABLE II: An comparative study on the performance of interactive segmentation models using different multi-type prompts. The results, evaluated using Dice Score and IoU Score across five datasets, are compared. Here, SAM 3P denotes using 3 points as inputs to the SAM model; SAM 3B (0.5) denotes using 3 low-quality initial Bboxes with the overlap ratio of 0.5 as inputs to the SAM model. The same approach applies to other cases. The top-1 results are highlighted in bold.
Model Dermoscopy Colonoscopy Ultrasound CT MRI Avg
IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice IoU Dice
SAM 3P 0.375 0.490 0.283 0.359 0.229 0.360 0.074 0.112 0.127 0.209 0.217 0.306
SAM 10P 0.384 0.504 0.288 0.373 0.238 0.374 0.073 0.115 0.131 0.215 0.223 0.316
SAM 3B (0.5) 0.481 0.609 0.346 0.414 0.467 0.619 0.558 0.665 0.233 0.321 0.417 0.526
SAM 10P&3B (0.5) 0.504 0.644 0.368 0.467 0.344 0.504 0.182 0.280 0.229 0.347 0.325 0.448
MedSAM 3P 0.060 0.106 0.036 0.063 0.093 0.159 0.016 0.029 0.027 0.047 0.046 0.081
MedSAM 10P 0.146 0.240 0.078 0.130 0.149 0.245 0.028 0.048 0.060 0.100 0.092 0.153
MedSAM 3B (0.5) 0.446 0.566 0.516 0.646 0.566 0.705 0.455 0.585 0.516 0.665 0.500 0.633
MedSAM 10P&3B (0.5) 0.341 0.467 0.281 0.388 0.153 0.240 0.270 0.386 0.284 0.406 0.266 0.377
MPA-MedSAM 3P 0.125 0.188 0.063 0.096 0.194 0.280 0.002 0.003 0.009 0.015 0.079 0.117
MPA-MedSAM 10P 0.266 0.374 0.113 0.174 0.371 0.497 0.005 0.009 0.014 0.024 0.154 0.216
MPA-MedSAM 3B (0.5) 0.623 0.753 0.554 0.693 0.678 0.799 0.479 0.607 0.400 0.542 0.547 0.679
MPA-MedSAM 10P&3B (0.5) 0.662 0.784 0.596 0.729 0.718 0.829 0.519 0.644 0.441 0.584 0.587 0.714

IV-B Comparisons with the SOTA methods on Multi-modality Images

To validate the overall performance of our proposed method, we compared it with the SOTA segmentation foundation model across five different modalities. As shown in Table I, include comparisons with SAM [6], and fully fine-tuned SAM in medical (MedSAM) [7], evaluated using Dice Score and Iou Score. The results were shown in Table I.

Firstly, when comparing the performance of the same model under different Bboxes qualities (ratio from 0.5 to 0.75), a notable improvement is observed in SAM across all imaging modalities as the Bboxes ratio increases from 0.5 to 0.75. MedSAM exhibits a similar trend, with performance enhancements corresponding to the higher Bboxes quality. It reveals that the quality of the Bboxes significantly impacts segmentation performance, indicating that the models are sensitive to the quality of the Box prompts.

Furthermore, for our proposed method, represented as MedSAM-U 3B (0.5), and 3 B (0.75), shown in the 5th5^{th} row and the 6th6^{th} row, achieved SOTA performance in Dermoscopy, Colonoscopy, Ultrasound and CT, with a final Avg Dice of 90.9 % , 90.3 % , 95.8% and 78.3% with 3B (0.75), surpassing MedSAM by 4.8%, 2.3%, 2.7% and 2% respectively. When the Bboxes overlap ratio is 0.5, our method outperforms surpassing MedSAM by 20.5 %. The results shown in Table I demonstrate that, even when the initial Bboxes quality is low, significant improvements in segmentation performance are observed. This demonstrates the effectiveness of our approach in enhancing segmentation accuracy, particularly in scenarios where the initial Bx prompts are suboptimal. These results reveal that our MedSAM-U demonstrates significant accuracy across various medical segmentation tasks with the use of low-quality Bboxes, eliminating the need for manual automantic adjustments to achieve satisfactory results.

Fig. 3 visualized Some examples’ segmentation results for each method in different modalities, with Bboxes simulating varying degrees of rough approximations simulating expert annotations. SAM and MedSAM’s segmentations are based on a single step, and when the initial Bbox prompts provided by experts are not perfectly accurate, the segmentations may suffer from under-segmentation or over-segmentation errors. In contrast, MedSAM-U can accurately segment various targets across different imaging conditions.

IV-C Ablation Study

IV-C1 Effectiveness of Multi-Prompt Combinations

In our method,the MPA was introduced to integrate diverse types of prompts (e.g., points and boxes) into MedSAM, enabling the simultaneous input of various prompt combinations. To validate the effectiveness of the MPA-MedSAM module, we present the segmentation results of different interactive segmentation models using various prompt combinations across five different datasets, as detailed in Table II. Initially, We use the results from the 3B (0.5) prompts as the baseline for multi-prompt segmentation in SAM, MedSAM, and MPA-SAM on the standard medical imaging datasets, specifically shown in the 3rd row, 7th row, and 11th row.

TABLE III: Comparison of model performance under uncertainty-guided prompt adaptation with BBox overlap ratio of 0.5. Here, UGPA1, UGPA2, UGPA3, UGPA4 represent refined 3B, refined 3P & 3B, refined 5P & 3B, refined 10P & 3B. The top-2 results are highlighted in bold and underline.
Modality Dermoscopy Colonoscopy Ultrasound Avg
IoU Dice IoU Dice IoU Dice IoU Dice
w/o UGPA 0.619 0.750 0.559 0.698 0.673 0.796 0.617 0.748
UGPA1 0.774 0.863 0.751 0.843 0.889 0.939 0.805 0.882
UGPA2 0.779 0.867 0.755 0.846 0.891 0.941 0.808 0.885
UGPA3 0.796 0.880 0.784 0.871 0.898 0.945 0.826 0.899
UGPA4 0.801 0.883 0.779 0.867 0.898 0.946 0.826 0.899
TABLE IV: Comparison of model performance under uncertainty-guided prompt adaptation with BBox overlap ratio of 0.75. Here, UGPA1, UGPA2, UGPA3, UGPA4 represent refined 3B, refined 3P & 3B, refined 5P & 3B, refined 10P & 3B. The top-2 results are highlighted in bold and underline.
Modality Dermoscopy Colonoscopy Ultrasound Avg
IoU Dice IoU Dice IoU Dice IoU Dice
w/o UGPA 0.792 0.877 0.766 0.859 0.885 0.938 0.814 0.892
UGPA1 0.815 0.892 0.800 0.880 0.909 0.952 0.842 0.908
UGPA2 0.817 0.893 0.805 0.885 0.911 0.953 0.844 0.910
UGPA3 0.836 0.906 0.832 0.902 0.920 0.958 0.862 0.922
UGPA4 0.839 0.909 0.833 0.903 0.920 0.958 0.864 0.923

By analyzing the results from the 3rd row to 4th row for SAM and from the 7th row to 8th row for MedSAM, it is evident that the combination of point and box prompts leads to a decrease in Dice scores from 52.6% to 44.8% for SAM, and from 63.3% to 37.7% for MedSAM. The analysis of the table reveals that the performance of SAM and MedSAM does not exhibit improvement with the incorporation of multi-type prompts. Moreover, the use of point prompts may negatively impact the effectiveness of box prompts, resulting in a deterioration of segmentation performance. Furthermore, with the introduction of MPA, we observed that the performance of MPA-MedSAM improved by approximately 3.5% between 11th row and 12th row, further demonstrating the effectiveness of combining point prompts with box prompts was substantiated.

IV-C2 Effectiveness of Uncertainty-Guided Prompts Adaptation

We conducted a comprehensive ablation study to validate the effectiveness of the proposed UGPA. The results are presented in Table III and Table IV, where the baseline (1st row) represents 3 Bboxes from low-quality Bbox random shift provided by the user, serving as the initial input for our proposed MedSAM-U, without UGPA. As shown in the 2nd row and 3rd row in the Table III and Table IV, when combining box prompts with point prompts, there is an improvement compared to using the initial 3 Bboxes without UGPA. The combined approach also demonstrates enhanced performance relative to refining Bboxes alone, both in the 3B (0.5) and 3B (0.75) settings. Additionally, in the Table III and Table IV, from the 3rd row to the 5th row, we observe 1.3% improvement for the 3B (0.5) setting and a 1.3% improvement for the 3B (0.75) setting with the increase in the number of points. Specifically, increasing the points increasing from 5 to 10 points, results are in minimal enhancement. It does not mean that the more points there are, the better the effect will be. In fact, the improvement in effect may tend to saturation, or even counter-effectively in some cases.

The segmentation results of our method in different modalities are shown in the Fig. 4, three Bboxes were generated by introducing variations to GT box in terms of position and shape to simulate user input, serving as the input for our proposed MedSAM-U method. Under the refinement provided by MedSAM-U, the results showed significant improvement. The method effectively leveraged the uncertainty map, enhancing the accuracy and robustness of the segmentation.

V Conclusion

In this study, we introduced MedSAM-U, a novel model designed with an uncertainty-guided, auto-refining multi-prompt approach for reliable and accurate medical image segmentation. First, MPA-MedSAM was utilized to adapt various multi-prompt strategies for MedSAM. We then implemented UGMP to estimate uncertainty in the segmentation results without adding additional training parameters. Crucially, we developed a novel uncertainty-guided multi-prompt adaptation method that automatically generates reliable prompts, leading to highly accurate segmentation results. Furthermore, by training on multi-datasets from multiple modalities, MedSAM-U effectively functions as a universal image segmentation model. Experimental results across five distinct modal datasets show that within the BBox overlap ratio range of 0.5 to 0.75, MedSAM-U significantly improved performance, with average improvements ranging from 1.7% to 20.5%, compared to the baseline MedSAM model. Additionally, our results indicated that the lower the initial BBox quality, the greater the improvement achieved by MedSAM-U.

Moving forward, our research will concentrate on two key areas. First, we plan to directly estimate uncertainty within the adapter. Second, we aim to harness this uncertainty to achieve automantic advanced annotation, enabling AI and the model to perform automatic annotations without human intervention.

References

  • [1] K. Pogorelov, K. R. Randel, C. Griwodz, S. L. Eskeland, T. de Lange, D. Johansen, C. Spampinato, D.-T. Dang-Nguyen, M. Lux, P. T. Schmidt et al., “Kvasir: A multi-class image dataset for computer aided gastrointestinal disease detection,” in Proceedings of the 8th ACM on Multimedia Systems Conference, 2017, pp. 164–169.
  • [2] T. Falk, D. Mai, R. Bensch, Ö. Çiçek, A. Abdulkadir, Y. Marrakchi, A. Böhm, J. Deubner, Z. Jäckel, K. Seiwald et al., “U-net: deep learning for cell counting, detection, and morphometry,” Nature methods, vol. 16, no. 1, pp. 67–70, 2019.
  • [3] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2020.
  • [4] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, and K. H. Maier-Hein, “nnu-net: a self-configuring method for deep learning-based biomedical image segmentation,” Nature methods, vol. 18, no. 2, pp. 203–211, 2021.
  • [5] Y. Gao, M. Zhou, and D. N. Metaxas, “Utnet: a hybrid transformer architecture for medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24.   Springer, 2021, pp. 61–71.
  • [6] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [7] J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, “Segment anything in medical images,” Nature Communications, vol. 15, no. 1, p. 654, 2024.
  • [8] C. Zhou, X. Li, C. C. Loy, and B. Dai, “Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam,” arXiv preprint arXiv:2312.06660, 2023.
  • [9] A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” in NIPS, 2017.
  • [10] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [11] J. Luo, A. Sedghi, K. Popuri, D. Cobzas, M. Zhang, F. Preiswerk, M. Toews, A. Golby, M. Sugiyama, W. M. Wells et al., “On the applicability of registration uncertainty,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part II 22.   Springer, 2019, pp. 410–419.
  • [12] K. Zou, X. Yuan, X. Shen, Y. Chen, M. Wang, R. S. M. Goh, Y. Liu, and H. Fu, “Evidencecap: Towards trustworthy medical image segmentation via evidential identity cap,” arXiv preprint arXiv:2301.00349, 2023.
  • [13] J. Van Amersfoort, L. Smith, Y. W. Teh, and Y. Gal, “Uncertainty estimation using a single deep deterministic neural network,” in International Conference on Machine Learning.   PMLR, 2020, pp. 9690–9700.
  • [14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [15] J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
  • [16] J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang et al., “Sam-med2d,” arXiv preprint arXiv:2308.16184, 2023.
  • [17] C. Cui, R. Deng, Q. Liu, T. Yao, S. Bao, L. W. Remedios, B. A. Landman, Y. Tang, and Y. Huo, “All-in-sam: from weak annotation to pixel-wise nuclei segmentation with prompt-based finetuning,” in Journal of Physics: Conference Series, vol. 2722, no. 1.   IOP Publishing, 2024, p. 012012.
  • [18] S. Gong, Y. Zhong, W. Ma, J. Li, Z. Wang, J. Zhang, P.-A. Heng, and Q. Dou, “3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable tumor segmentation,” Medical Image Analysis, p. 103324, 2024.
  • [19] H. Li, H. Liu, D. Hu, J. Wang, and I. Oguz, “Prism: A promptable and robust interactive segmentation model with visual prompts,” arXiv preprint arXiv:2404.15028, 2024.
  • [20] B. Cui, M. Islam, L. Bai, and H. Ren, “Surgical-dino: adapter learning of foundation models for depth estimation in endoscopic surgery,” International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2024.
  • [21] C. Hu, T. Xia, S. Ju, and X. Li, “When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation,” arXiv preprint arXiv:2304.08506, 2023.
  • [22] Y. Li, M. Hu, and X. Yang, “Polyp-sam: Transfer sam for polyp segmentation,” in Medical Imaging 2024: Computer-Aided Diagnosis, vol. 12927.   SPIE, 2024, pp. 759–765.
  • [23] J. Wu, W. Ji, Y. Liu, H. Fu, M. Xu, Y. Xu, and Y. Jin, “Medical sam adapter: Adapting segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.12620, 2023.
  • [24] G. Deng, K. Zou, K. Ren, M. Wang, X. Yuan, S. Ying, and H. Fu, “Sam-u: Multi-box prompts triggered uncertainty estimation for reliable sam in medical image,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 368–377.
  • [25] H. Li, H. Liu, D. Hu, J. Wang, and I. Oguz, “Promise: Prompt-driven 3d medical image segmentation using pretrained image foundation models,” arXiv preprint arXiv:2310.19721, 2023.
  • [26] J. Wu and M. Xu, “One-prompt to segment all medical images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 302–11 312.
  • [27] X. Yao, H. Liu, D. Hu, D. Lu, A. Lou, H. Li, R. Deng, G. Arenas, B. Oguz, N. Schwartz et al., “False negative/positive control for sam on noisy medical images,” arXiv preprint arXiv:2308.10382, 2023.
  • [28] Y. Zhang, S. Hu, C. Jiang, Y. Cheng, and Y. Qi, “Segment anything model with uncertainty rectification for auto-prompting medical image segmentation,” arXiv preprint arXiv:2311.10529, 2023.
  • [29] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning.   PMLR, 2016, pp. 1050–1059.
  • [30] S. C. Hora, “Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management,” Reliability Engineering & System Safety, vol. 54, no. 2-3, pp. 217–223, 1996.
  • [31] E. Hüllermeier and W. Waegeman, “Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods,” Machine learning, vol. 110, no. 3, pp. 457–506, 2021.
  • [32] A. Saad, G. Hamarneh, and T. Möller, “Exploration and visualization of segmentation uncertainty using shape and appearance prior information,” IEEE Transactions on Visualization and Computer Graphics, vol. 16, no. 6, pp. 1366–1375, 2010.
  • [33] S. Parisot, W. Wells III, S. Chemouny, H. Duffau, and N. Paragios, “Concurrent tumor segmentation and registration with uncertainty-based sparse non-uniform graphs,” Medical image analysis, vol. 18, no. 4, pp. 647–659, 2014.
  • [34] J.-S. Prassni, T. Ropinski, and K. Hinrichs, “Uncertainty-aware guided volume segmentation,” IEEE transactions on visualization and computer graphics, vol. 16, no. 6, pp. 1358–1365, 2010.
  • [35] K. Zou, X. Yuan, X. Shen, M. Wang, and H. Fu, “Tbrats: Trusted brain tumor segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 503–513.
  • [36] B. Bein, “Entropy,” Best Practice & Research Clinical Anaesthesiology, vol. 20, no. 1, pp. 101–109, 2006.
  • [37] D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge,” in International Symposium on Biomedical Imaging, 2016.
  • [38] Y. Song, J. Zheng, L. Lei, Z. Ni, B. Zhao, and Y. Hu, “Ct2us: Cross-modal transfer learning for kidney segmentation in ultrasound images with synthesized data,” Ultrasonics, vol. 122, p. 106706, 2022.
  • [39] N. Heller, F. Isensee, D. Trofimova, R. Tejpaul, Z. Zhao, H. Chen, L. Wang, A. Golts, D. Khapun, D. Shats et al., “The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct,” arXiv preprint arXiv:2307.01984, 2023.
  • [40] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [41] S. Jadon, “A survey of loss functions for semantic segmentation,” in 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB).   IEEE, 2020, pp. 1–7.
  • [42] R. Kaur and S. K. Ranade, “Improving accuracy of convolutional neural network-based skin lesion segmentation using group normalization and combined loss function,” International Journal of Information Technology, vol. 15, no. 5, pp. 2827–2835, 2023.