EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical image segmentation
Abstract
Weakly-supervised medical image segmentation is gaining traction as it requires only rough annotations rather than accurate pixel-to-pixel labels, thereby reducing the workload for specialists. Although some progress has been made, there is still a considerable performance gap between the label-efficient methods and fully-supervised one, which can be attributed to the uncertainty nature of these weak labels. To address this issue, we propose a novel weak annotation method coupled with its learning framework EAUWSeg to eliminate the annotation uncertainty. Specifically, we first propose the Bounded Polygon Annotation (BPAnno) by simply labeling two polygons for a lesion. Then, the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations is proposed to learn invariant feature by providing adversarial supervision signal for model training. Subsequently, a confidence-auxiliary consistency learner incorporates with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region by leveraging the feature presentation consistency across pixels within the same category as well as class-specific information encapsulated in bounded polygons annotation. Experimental results demonstrate that EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, compared to fully-supervised counterparts, the proposed method not only delivers superior performance but also costs much less annotation workload. This underscores the superiority and effectiveness of our approach.
Benefit to its ability of reducing annotation workload, label-efficient methods have gaining traction in weakly-supervised medical image segmentation. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing, which leads to considerable performance gap between the label-efficient methods and fully-supervised one. To address this problem, a novel weak annotation method BPAnno that simply labeling two polygons for a lesion, and its coupled learning framework EAUWSeg is proposed to eliminate the annotation uncertainty. Extensive experiments demonstrate that our EAUWSeg can achieve superior performance while with less than 20% of the annotation workload when compared to fully-supervised counterparts. This reveals that the proposed method can be a cost-effective solution for improving the performance in weakly-supervised medical image segmentation.
Weakly-supervised segmentation, consistency-based contrastive learning, medical image segmentation
1 Introduction
Medical image segmentation plays a crucial role in biomedical image analysis [1], such as diagnosis, treatment, and radiotherapy planning. As manual segmentation is usually labor-intensive, time-consuming and rely on professional domain knowledge [2], automatic medical image segmentation has been widely dedicated and series methods have been proposed. However, the successes of existing methods rely mainly on large-scale meticulously annotated data, which requires significant domain expertise as well as expensive annotation cost.
To alleviate the burdens associated with image annotation, weakly-supervised medical image segmentation is gaining traction as it requires only weak or sparse annotations [3], such as image-level labels [4], scribbles [5], bounding boxes [6], and point annotations [7]. Although some progress has been made by using label-efficient annotations for training, there is still a considerable performance gap between the label-efficient methods and fully-supervised ones [8]. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing. Fig. 1 provides the visual representation of the supervision signals introduced by different label-efficient annotations, in which most information (defined by the gray region) are uncertain. The uncertainty supervision signals provided by label-efficient annotations may induce model training oscillations, thus impair the training of the model to approach the performance achieved in a fully supervised manner [9]. Consequently, there is an urgent need to explore label-efficient methods that can reduce annotation uncertainty, and develop methods that can assistant to eliminate the label uncertainty during model training.

In this work, we propose a novel weak annotation method coupled with its learning framework to eliminate the annotation uncertainty, and facilitate stable training in the weakly-supervised medical image segmentation with more reliable supervision signal. To this end, we introduce the bounded polygons annotation, which simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). The proposed bounded polygons annotation has three advantages: (1) it reduces the label burden compared with pixel-to-pixel accurate labels, (2) it restricts the uncertainty information to gray region between two polygons, (3) it explicitly provides prior emphasis on lesion boundaries during model training. Tailored for the proposed weak annotation, we propose a EAUWSeg method to further eliminate the uncertainty included in the bounded polygon annotation by explicitly treating bounded polygons as two separated annotations. For the envelope-like annotation, pixels within red contour belong to foreground class, otherwise belong to background class. For the inscribed-like annotation, pixels within purple contour belong to foreground class, otherwise belong to background class. In this way, the uncertainty region provides adversarial supervision signal for model training to learn invariant feature. Then, by leveraging the existing observation that similar pixels in the feature space prefer to generate consistent category predictions [10], we design a Classification-guided Confidence Generator (CCG) to measure the feature similarity between certain and uncertain pixels from a probabilistic perspective. Moreover, we adopt a Confidence-auxiliary Consistency Learner (CCL) that prefers to ensure the accuracy of certain pixels and attract uncertain pixels with the same category to preserve consistency feature representation. In the collaboration of CCG and CCL, more reliable supervision signal in uncertain region can be provided during model training to facilitate stable training in the weakly-supervised medical image segmentation.
Overall, our contributions can be summarized as follows:
-
1.
We propose a novel weak annotation method that labels only two bounded polygons and the coupled learning framework for medical image segmentation, which further eliminate the annotation uncertainty existed in most label-efficient methods.
-
2.
We propose the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations, which can provide adversarial supervision signal for model training to learn invariant feature.
-
3.
We propose a Confidence-auxiliary Consistency Learner that incorporates with a Classification-guided Confidence Generator to provide reliable supervision signal for pixels in uncertain region by leveraging the intra-class similarity and inter-class discriminative from both the feature and category perspective. It is worth noting that the CCL and CCG modules will be discarded during inference, which not increase computation complexity.
-
4.
To evaluate our method, we provide the bounded polygon annotations on two widely used medial image segmentation datasets, i.e., ISIC2017 [11] and Kvasir-SEG [12]. Extensive experiments on these two datasets demonstrate that our EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, the proposed method delivers superior performance with less than 20% of the annotation workload when compared to fully-supervised counterparts. These results reveal that bounded polygon annotations coupled with EAUWSeg can be a cost-effective solution for the segmentation performance preserving.
2 Related Works
2.1 Weakly-supervised Medical Image Segmentation
Without the requirement of large densely annotated data, weakly-supervised learning has gained significant attention in medical image segmentation [13, 2]. As the most efficient weak annotation method, image-level annotations only require classification labels and generates class activation maps [14] for training. Although image-level annotations method is convenient, it has limited performance due to the extremely weak supervision [15]. Box-level annotation is usually defined by two corner coordinates, which provides localization-awareness compared to the image-level annotation [16]. However, boxes for different objects may tend to overlap with each other, making it difficult to accurately approximate the target boundary, especially for complex shapes [10]. Point annotations provide a small number of pixels for different classes and can better handle complex shapes, which may be more preferable to medical segmentation compared to box-level annotations. Despite its efficiency, the segmentation model trained with point annotations tends to overfit the small number of annotated pixels when comparing with the large number of unannotated pixels.
Scribble-based annotations provide labels for a sparse set of pixels of each class for training, and are usually more obtainable in medical image segmentation by considering its annotation efficiency, performance effectiveness as well as the friendliness to the annotation of nested structure [17]. Only the scribbles of the background and each object are given, while the groundtruth of other pixels remains unknown, which is harmful to the segmentation performance. An intuitive resolution is to expand the scribble annotations by considering the prior assumptions [18] or using the learned foreground features through the deep neural networks [19]. However, due to the lack of supervisory signals, the constructed models usually fail to capture the object structure and confuse on the object boundary. To address this issue, a series of studies have concentrated on learning adversarial shape priors at the expense of requiring additional fully-annotated masks [17]. However, acquiring such fully annotated datasets may present challenges in many clinical practices, rendering these existing methods both costly and lacking in scalability. Our work aims to explore new weak annotation method that can prompt the performance of automated medical image segmentation without auxiliary datasets.
2.2 Contrastive Learning
Contrastive learning argues that similar samples should have similar representations, and the representations of different samples should be different [20]. Based on this, contrastive loss is usually designed to enforce representations to be similar for similar pairs and dissimilar for dissimilar pairs [21]. Considering its powerful self-supervised feature extracting ability from the unlabeled data, contrastive learning has been widely used in many image-level tasks. Among all these methods, the key is the selection mechanism designing of contrastive sample pairs, i.e., positive and negative pairs.
Recently, contrastive learning has been extended from image-level task to pixel-level ones to mine informative information from unlabeled data [22, 23]. As mentioned earlier, constructing contrastive sample pairs is crucial for discriminative feature learning. In the context of pixel-level tasks, sample pairs are usually constructed through pseudo labels or spatial structure, which may introduce noisy sampling. To alleviate this problem, prediction uncertainty has been injected into the sampling to reduce the number of noisy samples [24].

3 Method
In this work, we propose a novel bounded polygon annotation method, i.e., BPAnno, and its corresponding segmentation framework, i.e., EAUWSeg, to eliminate annotation uncertainty in weakly-supervised medical image segmentation. Our EAUWSeg is in general applicable for many existing medical image segmentation models, such as UNet [25], DeepLabV3+ [26], TransUNet [27], with encoder and decoder phases. The overall framework is illustrated in Fig. 2.
3.1 Problem Setting and Bounded Polygons Annotation
In the scenario of classical weakly-supervised segmentation, the input pixels are usually divided into the labeled pixels and unlabeled pixels . In this way, the corresponding labels for the labeled pixels will be directly used for supervision by employing the partial cross-entropy loss, which can be formulated as follows:
(1) |
where is the segmentation prediction. For the unlabeled pixels, there is no off-the-shelf label for supervision, and a lot of work focus on assigning pseudo labels to unlabeled pixels for supervision [28, 29]. The overall objective function can be formulated as follow:
(2) |
However, assigning pseudo labels to unlabeled pixels not only requires a time-consuming multi-stage training process, but also results in misleading or biases [10].
To address this problem, this work introduces the bounded polygon annotation method that simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). To further eliminate the uncertainty included in the bounded polygon annotation, we explicitly treat bounded polygons as two separated annotation, i.e., inscribed-like annotation and envelope-like annotation . Different from the classical weakly annotation methods, the input pixels are divided into the certain labeled pixels and the uncertain pixels in our bounded polygon annotations. This work aims at providing more reliable supervision signal for pixels in uncertain region during model training.
For convenience, we define , , as the spatial domain inside the inscribed-like annotation, between the inscribed-like and envelope-like delineations, and outside the envelope-like annotation, respectively. Here, the certain labeled pixels and uncertain pixels can be depicted as and , the corresponding labels if and if , otherwise, is uncertain. The spatial domain of input image and the envelope-like annotation can be depicted as and , respectively. In this way, our proposed EAUWSeg tries to learn from the “certain/uncertain” pixels instead of “labeled/unlabeled” pixels in the classical weakly-supervised segmentation. The objective function in our EAUWSeg can be re-formulated as:
(3) |
The feature learning of certain pixels have been well solved. Hence, this work focuses on eliminating annotation uncertainty and thus providing reliable supervision signals for pixels in the uncertain regions.
3.2 Framework of EAUWSeg
EAUWSeg is tailored for the proposed bounded polygon annotation, and mainly focuses on eliminating annotation uncertainty for pixels belong to . As illustrated in Fig. 2, EAUWSeg consists of 1) a mainstream segmentation network supervised by two bounded polygons segmentation labels to implicitly define the certain region and uncertain region during network training, 2) a classification-guided confidence generator to provide the category-level prediction confidence for pixels by leveraging a tailored multi-class classification task, 3) a confidence-auxiliary consistency learner to distinguish reliable pixels in uncertain region can assign the corresponding “certain” labels.
Let , , and denote the encoder, the decoder, and segmentation head used in our proposed framework that are parameterized by , and , respectively. In the proposed EAUWSeg, the bounded polygon annotation is treated as two separate masks, i.e., inscribed-like and envelope-like masks, and the basic segmentation loss function in EAUWSeg can be formulated as:
(4) |
where is the predicted probability maps for input image . In this work, the following dice loss is employed for both and :
(5) |
where denotes the input image size, and denotes the prediction probability and label for pixel , respectively. However, training with will introduce inconsistency supervision signals, since and have the following characteristics: for , for , and others are . In this way, pixels in uncertain region will have different labels during training, i.e., while for .
To mitigate the influence and leverage this adversarial supervision signal to learn invariant feature during model training, this work focuses on assign more reliable labels for pixels in uncertain region by utilizing feature representation of certain pixels. For uncertain pixels, we want to utilize the potential similarity between pixels in the same category, i.e., and to mine informative information. With these definitions, the loss function of BPAnno-supervised segmentation can be formulated as:
(6) |
Here, denotes the loss function for uncertain pixels.
3.3 Classification-Guided Confidence Generator
The key point for accurate BPAnno-supervised segmentation is reliable labels assigning for pixels in uncertain region. Different from existing methods that focus on iteratively assigning pseudo label for uncertain pixels, we propose to utilize the intra-class similarity and inter-class discriminative from both the feature and category perspective.
An intuitive idea to approximate the confidence for uncertain pixels is the predictive entropy that is calculated according to the following equation:
(7) |
where are the parameters of standard segmentation network, is a constant to avoid overflow. Similar as previous works, prediction with large entropy is considered as the solid uncertain pixels, which will be dropped during the subsequent learning. For clarity, we define the solid uncertain pixels in uncertain region with category of :
(8) |
where is a predefined threshold to mask the uncertain labels, and is the estimated uncertainty map with the same size as input image.
To assign more reliable labels for uncertain pixels, we propose to explicitly leverage the potential similarity between certain and uncertain pixels by employing a tailored classification task, which aims at removing as much uncertainty as possible. Let denote the feature representation generated through the encoder and decoder network, and denote the classification head and its corresponding parameters respectively. Previous work [10] has shown that similar pixels in the feature space preferable to generate consistent category prediction. Based on this, the constructed classification head is used to model a multi-class classification task with the objective function of:
(9) |
where is the classification labels that for , for , and for , and .
During model training, we assume that “certain” pixels in uncertain region would prefer to generate prediction of for background and for foreground. In this way, the uncertain map generated by the confidence map can be formulated as:
(10) |
where refers to the element-wise multiplication, and is a mask with for , and otherwise. The final confidence for the uncertain pixels can be formulated as:
(11) |
Here, means that both pixels with large predictive entropy, i.e., and with uncertain classification prediction, i.e., pixels with prediction of for the multi-class classification task, will be considered as solid uncertain and be assigned with label of .
3.4 Confidence-Auxiliary Consistency Learner
Confidence-auxiliary consistency learner aims at generating “certain” information from uncertain region to facilitate stable training. An intuitive idea is utilizing contrastive learning to reduce the distance between pixels within same category while enlarging the distance between pixels in different categories. This strategy allows us to conduct the pixel-wise contrastive learning. However, the crucial question is the selection of positive and negative samples, especially for pixels in uncertain region. To reduce the influence of uncertain information, we propose to utilize the generated confidence for the uncertain pixels and only the solid certain pixels will be considered during the pixel-wise contrastive learning. In this way, the determined pseudo labels can be obtained as follows:
(12) |
To provide more reliable supervision signal by using the pixel-wise contrastive learning, we follow two guidelines during sample selection: 1) only feature embedding of pixels in the certain region are stored in this study and further be sampled during the computation of contrastive loss; 2) the anchor sampling in this study focuses on hard samples with error prediction for , and samples with higher certainty for . The pixel-wise contrastive loss in this work can be defined as:
where contains the indexes of all “certain” pixels in the uncertain region; and contains the indexes of positive pixels, i.e., pixels has same class with pixel , and negative pixels, i.e., pixels with different labels to pixel , in the certain region, respectively; is a temperature hyper-parameter.
Considering the semantic representation for deep layers and the effective information for uncertain pixels, feature representation before the segmentation head is embedded into a specific feature space and is employed as the prototype vector in contrastive learning. That is to say, denotes the feature embedding of pixels that is calculated according to the following equation:
(14) |
3.5 Training of EAUWSeg
To summarize, the overall objective function includes two parts: 1) losses for “certain” pixels using fully-supervised segmentation setting, 2) confidence-guided contrastive loss for uncertain pixels. At the early stage of training, segmentation model need to learn the feature representation of lesions with the guidance of supervised loss for “certain” pixels, i.e., . When the segmentation performance gradually improves, the contrastive loss combined with a multi-class classification cross-entropy loss are added to apply constraints on uncertain pixels in same class to preserve consistency feature representation. Therefore, the overall objective function in this work is formulated as:
(15) |
where and are the parameters to control the contribution of confidence-auxiliary consistency learner and classification-guided confidence generator, respectively.
4 Experiments
Methods | Data | ISIC2017 | Kvasir-SEG | ||||||
Dice | Jaccard | Accuracy | Sensitivity | Dice | Jaccard | Accuracy | Sensitivity | ||
Fully-supervised Methods | |||||||||
UNet [25] | mask | 86.11.13 | 77.80.16 | 93.83.03 | 84.61.49 | 89.21.24 | 83.50.09 | 96.95.05 | 91.40.39 |
UNet++ [30] | mask | 85.75.10 | 77.60.18 | 96.60.25 | 84.22.21 | 89.23.14 | 83.43.06 | 96.78.08 | 92.09.66 |
DeepLabV3+[26] | mask | 86.15.10 | 78.06.14 | 93.97.03 | 83.79.29 | 89.04.27 | 82.85.22 | 96.87.11 | 91.73.42 |
TransUNet [27] | mask | 86.25.13 | 78.21.13 | 93.88.14 | 85.57.83 | 89.64.13 | 83.86.16 | 96.76.10 | 91.74.54 |
TransFuseS [31] | mask | 86.09.27 | 78.01.45 | 93.84.16 | 85.08.92 | 88.15.21 | 82.02.18 | 96.48.07 | 89.94.64 |
HiFormer [32] | mask | 86.16.22 | 78.02.30 | 93.84.11 | 85.05.76 | 89.18.22 | 83.41.19 | 96.86.04 | 90.40.63 |
Weakly-supervised Methods | |||||||||
PCE [33] | scribbles | 80.94.08 | 71.19.10 | 91.64.02 | 80.85.71 | 77.21.46 | 66.46.41 | 93.83.10 | 80.921.55 |
TV [34] | scribbles | 81.14.33 | 71.50.43 | 91.83.13 | 81.13.55 | 77.01.21 | 66.24.19 | 93.75.03 | 80.32.46 |
GatedCRF [35] | scribbles | 81.02.53 | 71.25.61 | 91.64.22 | 78.30.80 | 78.63.26 | 68.43.29 | 94.12.10 | 77.43.44 |
Mumford-Shah [36] | scribbles | 76.50.78 | 65.00.97 | 90.36.30 | 72.022.82 | 69.61±1.49 | 57.25±1.76 | 92.13.31 | 68.12±3.91 |
USTM [37] | scribbles | 80.92.10 | 71.24.12 | 91.60.13 | 79.401.33 | 76.65.18 | 65.95.21 | 93.72.02 | 79.30.60 |
ScribbleVC [8] | scribbles | 81.07.50 | 71.40.41 | 91.85.04 | 76.381.00 | 77.29.39 | 66.95.37 | 93.83.18 | 76.211.04 |
DMSPS [1] | scribbles | 81.50.19 | 71.86.09 | 91.90.10 | 80.68.47 | 78.21.53 | 68.04.59 | 94.02.02 | 80.78±2.10 |
TriMix [38] | scribbles | 82.03.11 | 72.65.12 | 91.76.12 | 80.39.78 | 84.23.11 | 75.83.26 | 95.44.08 | 83.46.42 |
\hdashlineUNet | box | 82.17.10 | 71.34.18 | 91.62.05 | 90.71.49 | 76.68.06 | 64.42.10 | 91.82.37 | 93.85.78 |
TransUNet | box | 82.71.27 | 72.17.35 | 91.98.16 | 91.04.64 | 78.61.23 | 66.69.12 | 92.82.40 | 92.662.11 |
\hdashlineUNet | rectangle | 85.38.03 | 76.52.11 | 93.38.10 | 89.351.34 | 82.51.24 | 72.73.19 | 94.81.10 | 92.76.16 |
TransUNet | rectangle | 85.44.21 | 76.69.19 | 93.33.02 | 89.792.04 | 83.40.20 | 74.10.08 | 94.57.12 | 92.10.66 |
\hdashlineOurs(UNet) | BPAnno | 86.18.05 | 78.12.04 | 93.83.04 | 84.45.45 | 89.30.11 | 83.04.15 | 96.88.03 | 91.98.32 |
Ours(TransUNet) | BPAnno | 86.60.17 | 78.61.24 | 93.95.12 | 87.551.38 | 89.88.19 | 83.85.27 | 96.91.08 | 92.15.73 |
4.1 Experimental Setup
4.1.1 Datasets
To evaluate the effectiveness of the proposed method, we conduct the comparative experiments on two widely used medical image segmentation datasets, i.e., ISIC2017[11] and Kvasir-SEG [12] datasets. ISIC2017 is a skin lesion segmentation dataset, on which rich results have been reported in literature for comparisons. It contains 2000, 150, and 600 dermoscopic images in train, valid, and test sets respectively. We follow the official split of train and test set during the experiment. Kvasir-SEG contains 1000 gastrointestinal polyp images and the corresponding groundtruth. we randomly split the dataset into two subsets with 800 and 200 images, respectively. Furthermore, to evaluate the generalization ability of the constructed model, we conduct the cross-training evaluation and apply the model trained on ISIC2017 to test on ISIC2018 dataset for skin lesion segmentation without fine-tuning. ISIC2018 Dataset [39] is a expansion of ISIC2017, and it contains 2594, 100 and 1000 images in train, valid, and test sets respectively. It should be noticed that there is no intersection between the test set of ISIC2018 and the train set of ISIC2017.
4.1.2 Annotation Generation
For the bounded polygon annotations, we initially generate approximate bounded polygon through dilation-erosion operations by leveraging available groundtruth masks. Subsequently, a manual refinement process is employed to enhance the accuracy of bounded polygon annotation. In the automatic generation phase, we create coarse envelop-like and inscribed-like polygons by employing dilation and erosion operations on the segmentation masks. Specifically, the dilation operation enlarges the masks, while the erosion operation shrinks them. These modified masks serve as a basis for generating polygons. Douglas-Peucker algorithm [40] is then applied to derive approximate contours from the dense masks to make the bounded polygon with the limited number of vertices.
To compare with existing weakly-supervised methods, we also generate the scribble, box and rectangle annotation on these two datasets. Following [41], we draw random lines by connecting two end points sampled from to simulate the scribbles. Here, is the given groundtruth binary mask. To obtain the box annotation, we use the object detection method. Similarly, we obtain the rectangle annotation for an image that can be filled to create a rectangular mask by identifying the smallest rectangular area that covers the foreground pixels in the groundtruth mask.
4.1.3 Implementation Details
All experiments are conducted using PyTorch and NVIDIA GeForce RTX 3090 GPUs. During training, images are resized to for all backbone networks except for TransFuse and HiFormer, which are set as and respectively. For optimizing, we employ the Adam and AdamW optimizer with an initial learning rate of for CNN-based and Transformer-based backbone networks, respectively. Typically, we set the maximal number of epochs at 100 for ISIC2017 and 300 for Kvasir-SEG, the batch size at 16, and the hyper-parameters are: , , , and .
4.2 Comparison With State-of-the-Arts

To demonstrate the comprehensive segmentation performance of our method, we compare EAUWSeg with different state-of-the-art approaches:
-
•
Scribble-supervised methods: 1) different learning strategies on UNet, including partially Cross-Entropy loss [33], Total Variation loss [34], Gated Conditional Random Field loss [35], Mumford-Shah Loss [36], as well as Uncertainty-aware Self-ensembling and Transformation-consistent Mean Teacher techniques (USTM) [37]. 2) different scribble-supervised frameworks, including ScribbleVC [8], DMSPS [1], and TriMix [38].
-
•
Box-supervised methods. For fair comparison, we also present the results of classical segmentation networks, i.e., UNet and TransUNet, supervised with bounding box.
-
•
Fully-supervised segmentation methods: 1) CNN-based methods, including UNet [25], UNet++ [30], and DeepLabV3+ [26]. 2) Transformer-based methods, including TransUNet [27], TransFuseS [31], HiFormer [32]. Implementation of these networks follow the corresponding github repositories. During training, ResNet50 [42] is employed as the encoder for UNet and DeepLabV3+, ResNet34 is utilized in the UNet++ and TranFuseS, the default “R50-ViT-B_16” and “Hiformer-S” configurations are employed for TransUNet and HiFormer.
Table 1 presents the quantitative evaluation results of the aforementioned methods. For fair comparison with scribble-supervised methods with different learning strategies, we present the results of EAUWSeg with UNet as backbones. To demonstrate the effectiveness of EAUWSeg, we also give the results with TransUNet as backbone. The results illustrate that our method outperforms other weakly-supervised methods on both ISIC2017 and Kvasir-Seg datasets, including the scribble-supervised as well as the box-supervised methods. When compared with the fully-supervised methods, our proposed EAUWSeg can also deliver superior performance, yielding an average Dice score of 86.60% and 89.88% on ISIC2017 and Kvasir-Seg, respectively. This underscores the superiority and effectiveness of the proposed BPAnno-supervised strategy and its corresponding learning framework EAUWSeg. Fig. 3 shows some qualitative evaluation results, it can be seen that our proposed method achieves better segmentation performance.
4.3 Ablation Study
Data | ISIC2017 | Kvasir-SEG | |||
---|---|---|---|---|---|
Dice | Jaccard | Dice | Jaccard | ||
Single Annotation | |||||
polygon | 85.54.20 | 76.86.09 | 85.29.29 | 76.22.51 | |
rectangle | 85.44.21 | 76.69.19 | 83.40.20 | 74.10.08 | |
ellipse | 84.61.11 | 75.35.24 | 83.61.27 | 73.80.61 | |
Bounded Annotation | |||||
bounded polygon | 85.88.18 | 77.81.22 | 88.95.36 | 82.61.40 | |
bounded rectangle | 85.60.22 | 77.13.28 | 88.71.10 | 82.22.27 | |
bounded ellipse | 84.63.17 | 75.56.14 | 87.78.19 | 80.76.15 |
4.3.1 Effectiveness of the Bounded Annotations
To analysis the effectiveness of the proposed bounded annotation strategy, we conduct quantitative evaluation of training the TransUNet directly using different bounded annotation methods, including polygon, rectangle, and ellipse. Considering, box is similar with rectangle, only rectangle is compared since it can achieves better performance. Table 2 lists the quantitative comparison based on the Dice score and Sensitivity. The former presents the overall segmentation performance while the latter can reflect the recall of the foreground pixels. It can be seen that all these three annotations offer a promising way to initialize the lesion region (with Dice score larger than 80%), while polygon shows the best performance. Substituting the single annotation with bounded ones leads to consistent performance improvement for all these annotation methods on two datasets, demonstrating the effectiveness of our proposed bounded-based weak annotation strategy.
4.3.2 Comparative Analysis of Different Components
To demonstrate the effectiveness of the proposed component, i.e., confidence-auxiliary consistency learner (CCL) and classification-guided confidence generator (CCG), we carried out the ablation experiments and the results are shown in Table 3. Baseline present the performance of TransUNet trained with bounded polygon annotations. It can be seen that with the gradual introduction of CCL and CCG the performance consistently improves on both ISIC2017 and Kvasir-SEG.
Baseline | CCL | CCG | Evaluation Metrics | |||
---|---|---|---|---|---|---|
Dice | Jaccard | Accuracy | Sensitivity | |||
ISIC2017 | ||||||
✓ | 85.88 | 77.81 | 93.65 | 85.59 | ||
✓ | ✓ | 86.38 | 78.12 | 93.81 | 86.38 | |
✓ | ✓ | ✓ | 86.60 | 78.61 | 93.95 | 87.55 |
Kvasir-SEG | ||||||
✓ | 88.95 | 82.61 | 96.49 | 91.78 | ||
✓ | ✓ | 89.35 | 83.06 | 96.90 | 92.03 | |
✓ | ✓ | ✓ | 89.88 | 83.85 | 96.91 | 92.15 |
4.3.3 Comparison With Semi-supervised Methods
Table 4.3.3 presents a comparative analysis of our method with five existing semi-supervised segmentation methods on ISIC2017. For these semi-supervised methods, we referred to the results reported in [43]. These semi-supervised methods are trained with varying percentages of labeled data (), and assisted with the rest of unlabeled data (). While the proposed is trained with only samples annotated by bounded polygon, without using the rest of unlabeled data . Although only supervised with samples annotated by bounded polygon, our method outperforms most of the specifically designed SSL methods (except for CASSL) that trained with dense mask and also the rest of unlabeled data, showcasing its robust feature learning capabilities. When compared with CASSL, in which the adversarial training mechanism and the collaborative consistency learning strategy are carefully designed to utilize the unlabeled data, our method has a small performance gap while no need for dense mask and also the unlabeled data. This is important to many medical image segmentation tasks since additional unlabeled data may be unavailable in clinical practice.
Methods | Data | 5% | 10% | 20% |
---|---|---|---|---|
UNet | mask+ unlabeled | 70.92 | 71.74 | 75.27 |
CLCC [44] | 61.23 | 65.40 | 68.93 | |
MT [45] | 73.12 | 74.34 | 76.98 | |
ST++ [46] | 73.26 | 75.51 | 76.69 | |
S4-PLCL [47] | 68.19 | 71.08 | 71.83 | |
CASSL [43] | 76.55 | 77.49 | 79.31 | |
Ours(TransUNet) | only BPAnno | 75.81 | 76.86 | 77.54 |
4.3.4 Generalizabilty Analysis With Different Backbones
The proposed EAUWSeg is a plug-and-play model that can be easily combined with different backbones. To demonstrate its generalization ability six widely used segmentation networks are compared, i.e., UNet [25], UNet++ [30], DeepLabV3+ [26], TransUNet [27], TransFuseS [31], and HiFormer [32]. From Fig. 4, it can be seen that: 1) the best result is achieved when using TransUNet as the backbone, 2) the proposed method delivers superior performance compared to fully-supervised counterparts as shown in Table 1. These results reveal that the proposed EAUWSeg generalizes well for different backbones.

4.3.5 Generalization on ISIC2018
The generalization ability of the constructed model is important to real application. We evaluate the generalization ability of the proposed method in a cross-training way [48]. Specifically, we apply the model trained on ISIC2017 to test on ISIC2018 dataset for skin lesion segmentation without fine-tuning. As presented in Table 5, our method achieves comparable generalization performance on ISIC2018 when compared to all the fully-supervised counterparts. This highlights the effectiveness of our EAUWSeg approach as well as the bounded polygons annotation in ensuring robust generalizability.
Methods | Data | Dice | Jaccard | Accuracy | Sensitivity |
---|---|---|---|---|---|
UNet | mask | 86.78 | 78.27 | 92.69 | 93.85 |
BPAnno | 86.68 | 78.06 | 92.92 | 94.64 | |
\hdashlineUNet++ | mask | 87.11 | 78.95 | 92.63 | 94.64 |
BPAnno | 86.67 | 77.88 | 92.7 | 95.12 | |
\hdashlineDeepLabV3+ | mask | 87.02 | 78.73 | 92.87 | 94.64 |
BPAnno | 86.63 | 77.98 | 92.73 | 94.75 | |
\hdashlineTransUNet | mask | 86.34 | 77.39 | 92.38 | 95.99 |
BPAnno | 86.43 | 77.49 | 92.55 | 95.93 | |
\hdashlineTransFuseS | mask | 87.67 | 79.58 | 93.22 | 95.04 |
BPAnno | 87.57 | 78.81 | 94.92 | 93.39 | |
\hdashlineHiFormer | mask | 87.27 | 79.16 | 93.10 | 95.10 |
BPAnno | 87.44 | 79.12 | 92.87 | 94.94 |
4.4 Error Analysis
The proposed bounded polygon annotation has the advantage of explicitly providing prior emphasis on lesion boundaries during model training. To reveal this, following [49], we separately evaluate the results in boundary and interior regions. Fig. 5 illustrate the Jaccard and Dice score improvement achieved by our EAUWSeg compared to the BPAnno-supervised baselines, both inside and outside a band of specific width, referred to as boundary and interior regions. It can be seen that EAUWSeg consistently enhances the performance of the baseline models in both the boundary and interior regions, regardless of the trimap width. Specifically, our EAUWSeg achieves a substantial gain of over 2% in performance within the boundary regions. This reveals the proposed EAUWSeg in capturing the intricate details of the boundary, which is attributed to our developed confidence-auxiliary consistency learner. Furthermore, we also illustrates the t-SNE visualization results of constructed feature space for TransUNet trained in fully-supervision setting and EAUWSeg in BPAnno-supervision manner. From Fig. 6, it can be seen that our method can construct more compact feature space compared with the fully-supervised baseline, especially in the lesion boundary.


(a)

(b)

(c)

(d)
4.5 Annotation Cost Analysis
To reveal the annotation cost decreasing ability of the proposed bounded polygon annotations, we conduct the comparison study focusing on the annotation workload. In this study, a dermatologist with over ten years of experience from a general hospital in the central city performs two types of annotations, i.e., pixel-to-pixel dense annotation and the proposed bounded polygon annotation, on twenty image selected in ISIC2017. It takes an average of 55 and 10 seconds for the pixel-to-pixel dense annotation and bounded polygon annotation, respectively. This indicates that annotating bounded polygon of skin lesion in dermoscopic image requires only 18% of annotation cost when compared with the pixel-to-pixel annotations. Combining the experimental results illustrated in Table 1 and Table 5, the proposed method delivers superior performance and comparable generalization ability when compared to its fully-supervised counterparts. These results reveal that bounded polygon annotations coupled with EAUWSeg can be a cost-effective solution for weakly-supervised medical image segmentation.
5 Conclusion and Future Work
In this work, to eliminate the annotation uncertainty existed in weakly-supervised medical image segmentation, we propose the bounded polygon annotation, in which label only two polygons while providing promising prior of lesion boundary during training. To further eliminate the uncertainty included in the bounded polygon as well as to leverage the prior emphasis delineated by bounded polygons, we develop EAUWSeg, a learning framework tailored for bounded polygon that include a confidence-auxiliary consistency incorporated with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region. Extensive experimental results demonstrate that EAUWSeg can not only outperform existing weakly-supervised segmentation methods but also delivers superior performance compared to fully-supervised counterparts, with less than 20% of the annotation workload.
This work is a preliminary attempt to focus on eliminating annotation uncertainty in weakly-supervised medical image segmentation. Extensive experimental results have demonstrated its cost-efficient and effectiveness of the bounded annotation, while there is still several limitations. This study mainly focuses on the weakly-supervised medical image segmentation in binary case. When applied to the instance segmentation, it may suffer from some challenges, such as encompassing pixels belong to foreground with different categories in the envelope-like polygon. In future work, we will focus on solving this kind of problems.
References
- [1] M. Han, X. Luo, X. Xie, W. Liao, S. Zhang, T. Song, G. Wang, and S. Zhang, “Dmsps: Dynamically mixed soft pseudo-label supervision for scribble-supervised medical image segmentation,” Medical Image Analysis, p. 103274, 2024.
- [2] S. Zhai, G. Wang, X. Luo, Q. Yue, K. Li, and S. Zhang, “Pa-seg: Learning from point annotations for 3d medical image segmentation using contextual regularization and cross knowledge distillation,” IEEE Transactions on Medical Imaging, 2023.
- [3] F. Gao, M. Hu, M.-E. Zhong, S. Feng, X. Tian, X. Meng, Z. Huang, M. Lv, T. Song, X. Zhang, X. Zou, and X. Wu, “Segmentation only uses sparse annotations: Unified weakly and semi-supervised learning in medical images,” Medical Image Analysis, vol. 80, p. 102515, 2022.
- [4] K. Wu, B. Du, M. Luo, H. Wen, Y. Shen, and J. Feng, “Weakly supervised brain lesion segmentation via attentional representation learning,” in Medical Image Computing and Computer Assisted Intervention. Springer, 2019, pp. 211–219.
- [5] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3159–3167.
- [6] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz et al., “Deepcut: Object segmentation from bounding box annotations using convolutional neural networks,” IEEE Transactions on Medical Imaging, vol. 36, no. 2, pp. 674–683, 2016.
- [7] R. Dorent, S. Joutard, J. Shapey, A. Kujawa, M. Modat, S. Ourselin, and T. Vercauteren, “Inter extreme points geodesics for end-to-end weakly supervised image segmentation,” in Medical Image Computing and Computer Assisted Intervention. Springer, 2021, pp. 615–624.
- [8] Z. Li, Y. Zheng, X. Luo, D. Shan, and Q. Hong, “Scribblevc: Scribble-supervised medical image segmentation with vision-class embedding,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3384–3393.
- [9] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Blpseg: Balance the label preference in scribble-supervised semantic segmentation,” IEEE Transactions on Image Processing, 2023.
- [10] L. Wu, Z. Zhong, L. Fang, X. He, Q. Liu, J. Ma, and H. Chen, “Sparsely annotated semantic segmentation with adaptive gaussian mixtures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 454–15 464.
- [11] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in 2018 IEEE 15th International Symposium on Biomedical Imaging. IEEE, 2018, pp. 168–172.
- [12] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia Modeling: 26th International Conference, MMM 2020. Springer, 2020, pp. 451–462.
- [13] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation,” Medical Image Analysis, vol. 63, p. 101693, 2020.
- [14] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
- [15] L. Wang, L. Zhang, X. Shu, and Z. Yi, “Intra-class consistency and inter-class discrimination feature learning for automatic skin lesion classification,” Medical Image Analysis, vol. 85, p. 102746, 2023.
- [16] C.-C. Hsu, K.-J. Hsu, C.-C. Tsai, Y.-Y. Lin, and Y.-Y. Chuang, “Weakly supervised instance segmentation using the bounding box tightness prior,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [17] G. Valvano, A. Leo, and S. A. Tsaftaris, “Learning to segment from scribbles using multi-scale adversarial attention gates,” IEEE Transactions on Medical Imaging, vol. 40, no. 8, pp. 1990–2001, 2021.
- [18] Z. Ji, Y. Shen, C. Ma, and M. Gao, “Scribble-based hierarchical weakly supervised learning for brain tumor segmentation,” in Medical Image Computing and Computer Assisted Intervention. Springer, 2019, pp. 175–183.
- [19] K. Zhang and X. Zhuang, “Cyclemix: A holistic strategy for medical image segmentation from scribble supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 656–11 665.
- [20] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “Simclr: A simple framework for contrastive learning of visual representations,” in International Conference on Learning Representations, vol. 2, 2020.
- [21] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
- [22] C. You, W. Dai, Y. Min, F. Liu, D. Clifton, S. K. Zhou, L. Staib, and J. Duncan, “Rethinking semi-supervised medical image segmentation: A variance-reduction perspective,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 9984–10 021.
- [23] W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Van Gool, “Exploring cross-image pixel contrast for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7303–7313.
- [24] T. Wang, J. Lu, Z. Lai, J. Wen, and H. Kong, “Uncertainty-guided pixel contrastive learning for semi-supervised medical image segmentation,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, pp. 1444–1450.
- [25] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
- [26] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 801–818.
- [27] J. Chen, J. Mei, X. Li, Y. Lu, Q. Yu, Q. Wei, X. Luo, Y. Xie, E. Adeli, Y. Wang, M. P. Lungren, S. Zhang, L. Xing, L. Lu, A. Yuille, and Y. Zhou, “Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers,” Medical Image Analysis, vol. 97, p. 103280, 2024.
- [28] H. Wu, X. Li, and K.-T. Cheng, “Exploring feature representation learning for semi-supervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- [29] L. Liu, A. I. Aviles-Rivero, and C.-B. Schönlieb, “Contrastive registration for unsupervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- [30] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
- [31] Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing transformers and cnns for medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention. Springer, 2021, pp. 14–24.
- [32] M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof, “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6202–6212.
- [33] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normalized cut loss for weakly-supervised cnn segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1818–1827.
- [34] M. Javanmardi, M. Sajjadi, T. Liu, and T. Tasdizen, “Unsupervised total variation loss for semi-supervised deep learning of semantic segmentation,” arXiv preprint arXiv:1605.01368, 2016.
- [35] A. Obukhov, S. Georgoulis, D. Dai, and L. Van Gool, “Gated crf loss for weakly supervised semantic image segmentation,” arXiv preprint arXiv:1906.04651, 2019.
- [36] B. Kim and J. C. Ye, “Mumford–shah loss functional for image segmentation with deep learning,” IEEE Transactions on Image Processing, vol. 29, pp. 1856–1866, 2019.
- [37] X. Liu, Q. Yuan, Y. Gao, K. He, S. Wang, X. Tang, J. Tang, and D. Shen, “Weakly supervised segmentation of covid19 infection with scribble annotation on ct images,” Pattern Recognition, vol. 122, p. 108341, 2022.
- [38] Z. Zheng, Y. Hayashi, M. Oda, T. Kitasaka, and K. Mori, “Trimix: A general framework for medical image segmentation from limited supervision,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 634–651.
- [39] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1902.03368, 2019.
- [40] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: the International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, 1973.
- [41] H. E. Wong, M. Rakic, J. Guttag, and A. V. Dalca, “Scribbleprompt: Fast and flexible interactive segmentation for any medical image,” arXiv preprint arXiv:2312.07381, 2023.
- [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- [43] Y. Tang, S. Wang, Y. Qu, Z. Cui, and W. Zhang, “Consistency and adversarial semi-supervised learning for medical image segmentation,” Computers in Biology and Medicine, vol. 161, p. 107018, 2023.
- [44] X. Zhao, C. Fang, D.-J. Fan, X. Lin, F. Gao, and G. Li, “Cross-level contrastive learning and consistency constraint for semi-supervised medical image segmentation,” in 2022 IEEE 19th International Symposium on Biomedical Imaging, 2022, pp. 1–5.
- [45] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [46] L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao, “St++: Make self-training work better for semi-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4268–4277.
- [47] I. Alonso, A. Sabater, D. Ferstl, L. Montesano, and A. C. Murillo, “Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8219–8228.
- [48] Y. Yuan, L. Zhang, L. Wang, and H. Huang, “Multi-level attention network for retinal vessel segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 1, pp. 312–323, 2021.
- [49] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1635–1643.