EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical image segmentation

Lituan Wang Lei Zhang \IEEEmembershipSenior Member, IEEE Yan Wang Zhenbin Wang Zhenwei Zhang and Zhang Yi \IEEEmembershipFellow, IEEE This work was supported by the National Natural Science Foundation of China under Grant 62025601, and Grant 62376174. Lituan Wang, Lei Zhang, Zhenbin Wang, Zhenwei Zhang, and Zhang Yi are with the Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China. E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]. Yan Wang is with the Institute of High Performance Computing, A*STAR, Singapore 138632. E-mail: [email protected]. (Corresponding author: Lei Zhang).

Abstract

Weakly-supervised medical image segmentation is gaining traction as it requires only rough annotations rather than accurate pixel-to-pixel labels, thereby reducing the workload for specialists. Although some progress has been made, there is still a considerable performance gap between the label-efficient methods and fully-supervised one, which can be attributed to the uncertainty nature of these weak labels. To address this issue, we propose a novel weak annotation method coupled with its learning framework EAUWSeg to eliminate the annotation uncertainty. Specifically, we first propose the Bounded Polygon Annotation (BPAnno) by simply labeling two polygons for a lesion. Then, the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations is proposed to learn invariant feature by providing adversarial supervision signal for model training. Subsequently, a confidence-auxiliary consistency learner incorporates with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region by leveraging the feature presentation consistency across pixels within the same category as well as class-specific information encapsulated in bounded polygons annotation. Experimental results demonstrate that EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, compared to fully-supervised counterparts, the proposed method not only delivers superior performance but also costs much less annotation workload. This underscores the superiority and effectiveness of our approach.

{IEEEImpStatement}

Benefit to its ability of reducing annotation workload, label-efficient methods have gaining traction in weakly-supervised medical image segmentation. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing, which leads to considerable performance gap between the label-efficient methods and fully-supervised one. To address this problem, a novel weak annotation method BPAnno that simply labeling two polygons for a lesion, and its coupled learning framework EAUWSeg is proposed to eliminate the annotation uncertainty. Extensive experiments demonstrate that our EAUWSeg can achieve superior performance while with less than 20% of the annotation workload when compared to fully-supervised counterparts. This reveals that the proposed method can be a cost-effective solution for improving the performance in weakly-supervised medical image segmentation.

{IEEEkeywords}

Weakly-supervised segmentation, consistency-based contrastive learning, medical image segmentation

1 Introduction

\IEEEPARstart

Medical image segmentation plays a crucial role in biomedical image analysis [1], such as diagnosis, treatment, and radiotherapy planning. As manual segmentation is usually labor-intensive, time-consuming and rely on professional domain knowledge [2], automatic medical image segmentation has been widely dedicated and series methods have been proposed. However, the successes of existing methods rely mainly on large-scale meticulously annotated data, which requires significant domain expertise as well as expensive annotation cost.

To alleviate the burdens associated with image annotation, weakly-supervised medical image segmentation is gaining traction as it requires only weak or sparse annotations [3], such as image-level labels [4], scribbles [5], bounding boxes [6], and point annotations [7]. Although some progress has been made by using label-efficient annotations for training, there is still a considerable performance gap between the label-efficient methods and fully-supervised ones [8]. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing. Fig. 1 provides the visual representation of the supervision signals introduced by different label-efficient annotations, in which most information (defined by the gray region) are uncertain. The uncertainty supervision signals provided by label-efficient annotations may induce model training oscillations, thus impair the training of the model to approach the performance achieved in a fully supervised manner [9]. Consequently, there is an urgent need to explore label-efficient methods that can reduce annotation uncertainty, and develop methods that can assistant to eliminate the label uncertainty during model training.

Refer to caption — Figure 1: Comparison of the typical weak annotation methods and our proposed bounded polygon annotations, including the annotation strategies and the annotation uncertainty. The yellow curves show groundtruth segmentation. The black and gray denote the certain and uncertain regions, respectively.

In this work, we propose a novel weak annotation method coupled with its learning framework to eliminate the annotation uncertainty, and facilitate stable training in the weakly-supervised medical image segmentation with more reliable supervision signal. To this end, we introduce the bounded polygons annotation, which simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). The proposed bounded polygons annotation has three advantages: (1) it reduces the label burden compared with pixel-to-pixel accurate labels, (2) it restricts the uncertainty information to gray region between two polygons, (3) it explicitly provides prior emphasis on lesion boundaries during model training. Tailored for the proposed weak annotation, we propose a EAUWSeg method to further eliminate the uncertainty included in the bounded polygon annotation by explicitly treating bounded polygons as two separated annotations. For the envelope-like annotation, pixels within red contour belong to foreground class, otherwise belong to background class. For the inscribed-like annotation, pixels within purple contour belong to foreground class, otherwise belong to background class. In this way, the uncertainty region provides adversarial supervision signal for model training to learn invariant feature. Then, by leveraging the existing observation that similar pixels in the feature space prefer to generate consistent category predictions [10], we design a Classification-guided Confidence Generator (CCG) to measure the feature similarity between certain and uncertain pixels from a probabilistic perspective. Moreover, we adopt a Confidence-auxiliary Consistency Learner (CCL) that prefers to ensure the accuracy of certain pixels and attract uncertain pixels with the same category to preserve consistency feature representation. In the collaboration of CCG and CCL, more reliable supervision signal in uncertain region can be provided during model training to facilitate stable training in the weakly-supervised medical image segmentation.

Overall, our contributions can be summarized as follows:

1.

We propose a novel weak annotation method that labels only two bounded polygons and the coupled learning framework for medical image segmentation, which further eliminate the annotation uncertainty existed in most label-efficient methods.
2.

We propose the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations, which can provide adversarial supervision signal for model training to learn invariant feature.
3.

We propose a Confidence-auxiliary Consistency Learner that incorporates with a Classification-guided Confidence Generator to provide reliable supervision signal for pixels in uncertain region by leveraging the intra-class similarity and inter-class discriminative from both the feature and category perspective. It is worth noting that the CCL and CCG modules will be discarded during inference, which not increase computation complexity.
4.

To evaluate our method, we provide the bounded polygon annotations on two widely used medial image segmentation datasets, i.e., ISIC2017 [11] and Kvasir-SEG [12]. Extensive experiments on these two datasets demonstrate that our EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, the proposed method delivers superior performance with less than 20% of the annotation workload when compared to fully-supervised counterparts. These results reveal that bounded polygon annotations coupled with EAUWSeg can be a cost-effective solution for the segmentation performance preserving.

2 Related Works

2.1 Weakly-supervised Medical Image Segmentation

Without the requirement of large densely annotated data, weakly-supervised learning has gained significant attention in medical image segmentation [13, 2]. As the most efficient weak annotation method, image-level annotations only require classification labels and generates class activation maps [14] for training. Although image-level annotations method is convenient, it has limited performance due to the extremely weak supervision [15]. Box-level annotation is usually defined by two corner coordinates, which provides localization-awareness compared to the image-level annotation [16]. However, boxes for different objects may tend to overlap with each other, making it difficult to accurately approximate the target boundary, especially for complex shapes [10]. Point annotations provide a small number of pixels for different classes and can better handle complex shapes, which may be more preferable to medical segmentation compared to box-level annotations. Despite its efficiency, the segmentation model trained with point annotations tends to overfit the small number of annotated pixels when comparing with the large number of unannotated pixels.

Scribble-based annotations provide labels for a sparse set of pixels of each class for training, and are usually more obtainable in medical image segmentation by considering its annotation efficiency, performance effectiveness as well as the friendliness to the annotation of nested structure [17]. Only the scribbles of the background and each object are given, while the groundtruth of other pixels remains unknown, which is harmful to the segmentation performance. An intuitive resolution is to expand the scribble annotations by considering the prior assumptions [18] or using the learned foreground features through the deep neural networks [19]. However, due to the lack of supervisory signals, the constructed models usually fail to capture the object structure and confuse on the object boundary. To address this issue, a series of studies have concentrated on learning adversarial shape priors at the expense of requiring additional fully-annotated masks [17]. However, acquiring such fully annotated datasets may present challenges in many clinical practices, rendering these existing methods both costly and lacking in scalability. Our work aims to explore new weak annotation method that can prompt the performance of automated medical image segmentation without auxiliary datasets.

2.2 Contrastive Learning

Contrastive learning argues that similar samples should have similar representations, and the representations of different samples should be different [20]. Based on this, contrastive loss is usually designed to enforce representations to be similar for similar pairs and dissimilar for dissimilar pairs [21]. Considering its powerful self-supervised feature extracting ability from the unlabeled data, contrastive learning has been widely used in many image-level tasks. Among all these methods, the key is the selection mechanism designing of contrastive sample pairs, i.e., positive and negative pairs.

Recently, contrastive learning has been extended from image-level task to pixel-level ones to mine informative information from unlabeled data [22, 23]. As mentioned earlier, constructing contrastive sample pairs is crucial for discriminative feature learning. In the context of pixel-level tasks, sample pairs are usually constructed through pseudo labels or spatial structure, which may introduce noisy sampling. To alleviate this problem, prediction uncertainty has been injected into the sampling to reduce the number of noisy samples [24].

3 Method

In this work, we propose a novel bounded polygon annotation method, i.e., BPAnno, and its corresponding segmentation framework, i.e., EAUWSeg, to eliminate annotation uncertainty in weakly-supervised medical image segmentation. Our EAUWSeg is in general applicable for many existing medical image segmentation models, such as UNet [25], DeepLabV3+ [26], TransUNet [27], with encoder and decoder phases. The overall framework is illustrated in Fig. 2.

3.1 Problem Setting and Bounded Polygons Annotation

In the scenario of classical weakly-supervised segmentation, the input pixels $x$ are usually divided into the labeled pixels $x_{l}$ and unlabeled pixels $x_{ul}$ . In this way, the corresponding labels $y_{l}$ for the labeled pixels $x_{l}$ will be directly used for supervision by employing the partial cross-entropy loss, which can be formulated as follows:

\mathcal{L}_{l}(p,y)=-\displaystyle{\sum\limits_{y\in y_{l}}}ylog(p),

(1)

where $p$ is the segmentation prediction. For the unlabeled pixels, there is no off-the-shelf label for supervision, and a lot of work focus on assigning pseudo labels to unlabeled pixels for supervision [28, 29]. The overall objective function can be formulated as follow:

\mathcal{L}=\mathcal{L}_{l}+\mathcal{L}_{ul}.

(2)

However, assigning pseudo labels to unlabeled pixels not only requires a time-consuming multi-stage training process, but also results in misleading or biases [10].

To address this problem, this work introduces the bounded polygon annotation method that simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). To further eliminate the uncertainty included in the bounded polygon annotation, we explicitly treat bounded polygons as two separated annotation, i.e., inscribed-like annotation $y^{in}$ and envelope-like annotation $y^{en}$ . Different from the classical weakly annotation methods, the input pixels $x$ are divided into the certain labeled pixels and the uncertain pixels in our bounded polygon annotations. This work aims at providing more reliable supervision signal for pixels in uncertain region during model training.

For convenience, we define $\Omega_{I}$ , $\Omega_{\Delta}$ , $\Omega_{O}$ as the spatial domain inside the inscribed-like annotation, between the inscribed-like and envelope-like delineations, and outside the envelope-like annotation, respectively. Here, the certain labeled pixels and uncertain pixels can be depicted as $x_{i}\in\Omega_{I}\cup\Omega_{O}$ and $x_{uc}\in\Omega_{\Delta}$ , the corresponding labels $y_{i}=1$ if $x_{i}\in\Omega_{I}$ and $y_{i}=0$ if $x_{i}\in\Omega_{O}$ , otherwise, $y_{i}$ is uncertain. The spatial domain of input image $x$ and the envelope-like annotation can be depicted as $\Omega=\Omega_{I}\cup\Omega_{O}\cup\Omega_{\Delta}$ and $\Omega_{E}=\Omega_{I}\cup\Omega_{\Delta}$ , respectively. In this way, our proposed EAUWSeg tries to learn from the “certain/uncertain” pixels instead of “labeled/unlabeled” pixels in the classical weakly-supervised segmentation. The objective function in our EAUWSeg can be re-formulated as:

\mathcal{L}=\mathcal{L}_{c}+\mathcal{L}_{uc}.

(3)

The feature learning of certain pixels have been well solved. Hence, this work focuses on eliminating annotation uncertainty and thus providing reliable supervision signals for pixels in the uncertain regions.

3.2 Framework of EAUWSeg

EAUWSeg is tailored for the proposed bounded polygon annotation, and mainly focuses on eliminating annotation uncertainty for pixels belong to $\Omega_{\Delta}$ . As illustrated in Fig. 2, EAUWSeg consists of 1) a mainstream segmentation network supervised by two bounded polygons segmentation labels to implicitly define the certain region and uncertain region during network training, 2) a classification-guided confidence generator to provide the category-level prediction confidence for pixels $x_{i}\in\Omega_{\Delta}$ by leveraging a tailored multi-class classification task, 3) a confidence-auxiliary consistency learner to distinguish reliable pixels in uncertain region can assign the corresponding “certain” labels.

Let $\mathcal{S}_{e}$ , $\mathcal{S}_{d}$ , and $\mathcal{S}_{h}$ denote the encoder, the decoder, and segmentation head used in our proposed framework that are parameterized by $\Theta_{e}$ , $\Theta_{d}$ and $\Theta_{h}$ , respectively. In the proposed EAUWSeg, the bounded polygon annotation is treated as two separate masks, i.e., inscribed-like and envelope-like masks, and the basic segmentation loss function in EAUWSeg can be formulated as:

\mathcal{L}_{c}=\displaystyle{\sum\limits_{x}\left(\mathcal{L}_{in}(p,y^{in})+\mathcal{L}_{en}(p,y^{en})\right)},

(4)

where $p$ is the predicted probability maps for input image $x$ . In this work, the following dice loss is employed for both $\mathcal{L}_{in}$ and $\mathcal{L}_{en}$ :

\mathcal{L}_{dice}=1-\displaystyle{\frac{2\times\sum_{i=1}^{H\times W\times D}p_{i}y_{i}}{\sum_{i=1}^{H\times W\times D}(p_{i}^{2}+y_{i}^{2})}},

(5)

where $H\times W\times D$ denotes the input image size, $p_{i}$ and $y_{i}$ denotes the prediction probability and label for pixel $i$ , respectively. However, training with $\mathcal{L}_{c}$ will introduce inconsistency supervision signals, since $y^{in}$ and $y^{en}$ have the following characteristics: $y^{in}_{i}=1$ for $x_{i}\in\Omega_{I}$ , $y^{en}_{i}=1$ for $x_{i}\in\Omega_{E}$ , and others are $0$ . In this way, pixels in uncertain region will have different labels during training, i.e., $y^{in}_{i}=0$ while $y^{en}_{i}=1$ for $x_{i}\in\Omega_{\Delta}$ .

To mitigate the influence and leverage this adversarial supervision signal to learn invariant feature during model training, this work focuses on assign more reliable labels for pixels in uncertain region by utilizing feature representation of certain pixels. For uncertain pixels, we want to utilize the potential similarity between pixels in the same category, i.e., $x_{i}\in\Omega_{I}\cup\Omega_{O}$ and $x_{j}\in\Omega_{\Delta}$ to mine informative information. With these definitions, the loss function of BPAnno-supervised segmentation can be formulated as:

\mathcal{L}=\mathcal{L}_{c}+\displaystyle{\mathcal{L}_{uc}(x,y^{in},y^{en})}.

(6)

Here, $\mathcal{L}_{uc}(x,y^{in},y^{en})$ denotes the loss function for uncertain pixels.

3.3 Classification-Guided Confidence Generator

The key point for accurate BPAnno-supervised segmentation is reliable labels assigning for pixels in uncertain region. Different from existing methods that focus on iteratively assigning pseudo label for uncertain pixels, we propose to utilize the intra-class similarity and inter-class discriminative from both the feature and category perspective.

An intuitive idea to approximate the confidence for uncertain pixels $x_{i}$ is the predictive entropy that is calculated according to the following equation:

\mathcal{E}=-\displaystyle{\sum\limits_{k}P(y_{{i}_{k}}|x,\Theta_{s})\log(P(y_{{i}_{k}}|x,\Theta_{s})+\epsilon)}

(7)

where $\Theta_{s}=\{\Theta_{e},\Theta_{d},\Theta_{h}\}$ are the parameters of standard segmentation network, $\epsilon$ is a constant to avoid overflow. Similar as previous works, prediction with large entropy is considered as the solid uncertain pixels, which will be dropped during the subsequent learning. For clarity, we define the solid uncertain pixels in uncertain region with category of $-1$ :

\mathcal{U}^{e}_{i}=\left\{\begin{aligned} &-1,&\mathcal{E}_{i}\geq\mu\\ &0,&\text{otherwise}\\ \end{aligned}\right.

(8)

where $\mu$ is a predefined threshold to mask the uncertain labels, and $\mathcal{U}^{e}\in\mathbb{R}^{C\times H\times W}$ is the estimated uncertainty map with the same size as input image.

To assign more reliable labels for uncertain pixels, we propose to explicitly leverage the potential similarity between certain and uncertain pixels by employing a tailored classification task, which aims at removing as much uncertainty as possible. Let $f_{\mathcal{S}}(x)$ denote the feature representation generated through the encoder and decoder network, $\mathcal{S}_{c}$ and $\Theta_{c}$ denote the classification head and its corresponding parameters respectively. Previous work [10] has shown that similar pixels in the feature space preferable to generate consistent category prediction. Based on this, the constructed classification head is used to model a multi-class classification task with the objective function of:

\mathcal{L}_{ce}=-\frac{1}{N}\displaystyle{\sum_{i=1}^{N}}y_{i}^{c}\log P\left(y_{i}^{c}|x,\Theta_{e},\Theta_{d},\Theta_{c}\right),

(9)

where $y^{c}$ is the classification labels that $y_{i}^{c}=0$ for $x_{i}\in\Omega_{O}$ , $y_{i}^{c}=1$ for $x_{i}\in\Omega_{\Delta}$ , and $y_{i}^{c}=2$ for $x_{i}\in\Omega_{I}$ , and $N=H\times W\times D$ .

During model training, we assume that “certain” pixels in uncertain region would prefer to generate prediction of $y_{i}^{c}=0$ for background and $y_{i}^{c}=2$ for foreground. In this way, the uncertain map generated by the confidence map can be formulated as:

\mathcal{U}^{c}=\operatorname*{argmax}(P(y=0|f_{\mathcal{S}}(x),\Theta_{c}),P(y=2|f_{\mathcal{S}}(x),\Theta_{c}))\odot\mathcal{M}_{u},

(10)

where $\odot$ refers to the element-wise multiplication, and $\mathcal{M}_{u}$ is a mask with $x_{i}=1$ for $x_{i}\in\Omega_{\Delta}$ , and $x_{i}=0$ otherwise. The final confidence for the uncertain pixels can be formulated as:

\mathcal{U}=\min(\mathcal{U}^{c}+2\mathcal{U}^{e},\mathbf{-1})\odot\mathcal{M}_{u},

(11)

Here, $\mathcal{U}$ means that both pixels with large predictive entropy, i.e., $\mathcal{E}_{i}\geq\mu$ and with uncertain classification prediction, i.e., pixels with prediction of $1$ for the multi-class classification task, will be considered as solid uncertain and be assigned with label of $-1$ .

3.4 Confidence-Auxiliary Consistency Learner

Confidence-auxiliary consistency learner aims at generating “certain” information from uncertain region to facilitate stable training. An intuitive idea is utilizing contrastive learning to reduce the distance between pixels within same category while enlarging the distance between pixels in different categories. This strategy allows us to conduct the pixel-wise contrastive learning. However, the crucial question is the selection of positive and negative samples, especially for pixels in uncertain region. To reduce the influence of uncertain information, we propose to utilize the generated confidence for the uncertain pixels and only the solid certain pixels will be considered during the pixel-wise contrastive learning. In this way, the determined pseudo labels can be obtained as follows:

\hat{y}=y\odot(\mathbf{1}-\mathcal{M}_{u})+\mathcal{U}.

(12)

To provide more reliable supervision signal by using the pixel-wise contrastive learning, we follow two guidelines during sample selection: 1) only feature embedding of pixels in the certain region are stored in this study and further be sampled during the computation of contrastive loss; 2) the anchor sampling in this study focuses on hard samples with error prediction for $x_{i}\in\Omega_{I}\cup\Omega_{O}$ , and samples with higher certainty for $x_{i}\in\Omega_{\Delta}$ . The pixel-wise contrastive loss in this work can be defined as:

	$\displaystyle\mathcal{L}_{PCL}$	$\displaystyle=$	$\displaystyle-\frac{1}{\mathcal{P}}\sum_{i\in\mathcal{P}}\frac{1}{\|P\backslash\{i\}\|}$
		$\displaystyle\times$	$\displaystyle\displaystyle{\sum_{p\in P}}\log\frac{\exp(f_{i}\cdot f_{p}/\tau)}{\exp(f_{i}\cdot f_{p}/\tau)+\frac{1}{\|N\|}\displaystyle{\sum_{n\in N}}\exp(f_{i}\cdot f_{n}/\tau)},$

where $\mathcal{P}$ contains the indexes of all “certain” pixels in the uncertain region; $P$ and $N$ contains the indexes of positive pixels, i.e., pixels has same class with pixel $i$ , and negative pixels, i.e., pixels with different labels to pixel $i$ , in the certain region, respectively; $\tau$ is a temperature hyper-parameter.

Considering the semantic representation for deep layers and the effective information for uncertain pixels, feature representation before the segmentation head is embedded into a specific feature space and is employed as the prototype vector in contrastive learning. That is to say, $f_{i}$ denotes the feature embedding of pixels $x_{i}$ that is calculated according to the following equation:

f_{i}=f_{\mathcal{S}}(x_{i},\Theta_{e},\Theta_{d}).

(14)

3.5 Training of EAUWSeg

To summarize, the overall objective function includes two parts: 1) losses for “certain” pixels using fully-supervised segmentation setting, 2) confidence-guided contrastive loss for uncertain pixels. At the early stage of training, segmentation model need to learn the feature representation of lesions with the guidance of supervised loss for “certain” pixels, i.e., $\mathcal{L}_{c}$ . When the segmentation performance gradually improves, the contrastive loss $\mathcal{L}_{PCL}$ combined with a multi-class classification cross-entropy loss $\mathcal{L}_{ce}$ are added to apply constraints on uncertain pixels in same class to preserve consistency feature representation. Therefore, the overall objective function in this work is formulated as:

\mathcal{L}=\mathcal{L}_{c}+\lambda_{1}\mathcal{L}_{PCL}+\lambda_{2}\mathcal{L}_{ce},

(15)

where $\lambda_{1}$ and $\lambda_{2}$ are the parameters to control the contribution of confidence-auxiliary consistency learner and classification-guided confidence generator, respectively.

4 Experiments

Table 1: Performance comparison with State-of-the-Arts on ISIC2017 and Kvasir-SEG datasets. Bold and underline denote the best and second best results, respectively.

Methods	Data	ISIC2017				Kvasir-SEG
Methods	Data	Dice	Jaccard	Accuracy	Sensitivity	Dice	Jaccard	Accuracy	Sensitivity
Fully-supervised Methods
UNet [25]	mask	86.11 $\pm$ .13	77.80 $\pm$ .16	93.83 $\pm$ .03	84.61 $\pm$ .49	89.21 $\pm$ .24	83.50 $\pm$ .09	96.95 $\pm$ .05	91.40 $\pm$ .39
UNet++ [30]	mask	85.75 $\pm$ .10	77.60 $\pm$ .18	96.60 $\pm$ .25	84.22 $\pm$ .21	89.23 $\pm$ .14	83.43 $\pm$ .06	96.78 $\pm$ .08	92.09 $\pm$ .66
DeepLabV3+[26]	mask	86.15 $\pm$ .10	78.06 $\pm$ .14	93.97 $\pm$ .03	83.79 $\pm$ .29	89.04 $\pm$ .27	82.85 $\pm$ .22	96.87 $\pm$ .11	91.73 $\pm$ .42
TransUNet [27]	mask	86.25 $\pm$ .13	78.21 $\pm$ .13	93.88 $\pm$ .14	85.57 $\pm$ .83	89.64 $\pm$ .13	83.86 $\pm$ .16	96.76 $\pm$ .10	91.74 $\pm$ .54
TransFuseS [31]	mask	86.09 $\pm$ .27	78.01 $\pm$ .45	93.84 $\pm$ .16	85.08 $\pm$ .92	88.15 $\pm$ .21	82.02 $\pm$ .18	96.48 $\pm$ .07	89.94 $\pm$ .64
HiFormer [32]	mask	86.16 $\pm$ .22	78.02 $\pm$ .30	93.84 $\pm$ .11	85.05 $\pm$ .76	89.18 $\pm$ .22	83.41 $\pm$ .19	96.86 $\pm$ .04	90.40 $\pm$ .63
Weakly-supervised Methods
PCE [33]	scribbles	80.94 $\pm$ .08	71.19 $\pm$ .10	91.64 $\pm$ .02	80.85 $\pm$ .71	77.21 $\pm$ .46	66.46 $\pm$ .41	93.83 $\pm$ .10	80.92 $\pm$ 1.55
TV [34]	scribbles	81.14 $\pm$ .33	71.50 $\pm$ .43	91.83 $\pm$ .13	81.13 $\pm$ .55	77.01 $\pm$ .21	66.24 $\pm$ .19	93.75 $\pm$ .03	80.32 $\pm$ .46
GatedCRF [35]	scribbles	81.02 $\pm$ .53	71.25 $\pm$ .61	91.64 $\pm$ .22	78.30 $\pm$ .80	78.63 $\pm$ .26	68.43 $\pm$ .29	94.12 $\pm$ .10	77.43 $\pm$ .44
Mumford-Shah [36]	scribbles	76.50 $\pm$ .78	65.00 $\pm$ .97	90.36 $\pm$ .30	72.02 $\pm$ 2.82	69.61±1.49	57.25±1.76	92.13 $\pm$ .31	68.12±3.91
USTM [37]	scribbles	80.92 $\pm$ .10	71.24 $\pm$ .12	91.60 $\pm$ .13	79.40 $\pm$ 1.33	76.65 $\pm$ .18	65.95 $\pm$ .21	93.72 $\pm$ .02	79.30 $\pm$ .60
ScribbleVC [8]	scribbles	81.07 $\pm$ .50	71.40 $\pm$ .41	91.85 $\pm$ .04	76.38 $\pm$ 1.00	77.29 $\pm$ .39	66.95 $\pm$ .37	93.83 $\pm$ .18	76.21 $\pm$ 1.04
DMSPS [1]	scribbles	81.50 $\pm$ .19	71.86 $\pm$ .09	91.90 $\pm$ .10	80.68 $\pm$ .47	78.21 $\pm$ .53	68.04 $\pm$ .59	94.02 $\pm$ .02	80.78±2.10
TriMix [38]	scribbles	82.03 $\pm$ .11	72.65 $\pm$ .12	91.76 $\pm$ .12	80.39 $\pm$ .78	84.23 $\pm$ .11	75.83 $\pm$ .26	95.44 $\pm$ .08	83.46 $\pm$ .42
\hdashlineUNet	box	82.17 $\pm$ .10	71.34 $\pm$ .18	91.62 $\pm$ .05	90.71 $\pm$ .49	76.68 $\pm$ .06	64.42 $\pm$ .10	91.82 $\pm$ .37	93.85 $\pm$ .78
TransUNet	box	82.71 $\pm$ .27	72.17 $\pm$ .35	91.98 $\pm$ .16	91.04 $\pm$ .64	78.61 $\pm$ .23	66.69 $\pm$ .12	92.82 $\pm$ .40	92.66 $\pm$ 2.11
\hdashlineUNet	rectangle	85.38 $\pm$ .03	76.52 $\pm$ .11	93.38 $\pm$ .10	89.35 $\pm$ 1.34	82.51 $\pm$ .24	72.73 $\pm$ .19	94.81 $\pm$ .10	92.76 $\pm$ .16
TransUNet	rectangle	85.44 $\pm$ .21	76.69 $\pm$ .19	93.33 $\pm$ .02	89.79 $\pm$ 2.04	83.40 $\pm$ .20	74.10 $\pm$ .08	94.57 $\pm$ .12	92.10 $\pm$ .66
\hdashlineOurs(UNet)	BPAnno	86.18 $\pm$ .05	78.12 $\pm$ .04	93.83 $\pm$ .04	84.45 $\pm$ .45	89.30 $\pm$ .11	83.04 $\pm$ .15	96.88 $\pm$ .03	91.98 $\pm$ .32
Ours(TransUNet)	BPAnno	86.60 $\pm$ .17	78.61 $\pm$ .24	93.95 $\pm$ .12	87.55 $\pm$ 1.38	89.88 $\pm$ .19	83.85 $\pm$ .27	96.91 $\pm$ .08	92.15 $\pm$ .73

4.1 Experimental Setup

4.1.1 Datasets

To evaluate the effectiveness of the proposed method, we conduct the comparative experiments on two widely used medical image segmentation datasets, i.e., ISIC2017[11] and Kvasir-SEG [12] datasets. ISIC2017 is a skin lesion segmentation dataset, on which rich results have been reported in literature for comparisons. It contains 2000, 150, and 600 dermoscopic images in train, valid, and test sets respectively. We follow the official split of train and test set during the experiment. Kvasir-SEG contains 1000 gastrointestinal polyp images and the corresponding groundtruth. we randomly split the dataset into two subsets with 800 and 200 images, respectively. Furthermore, to evaluate the generalization ability of the constructed model, we conduct the cross-training evaluation and apply the model trained on ISIC2017 to test on ISIC2018 dataset for skin lesion segmentation without fine-tuning. ISIC2018 Dataset [39] is a expansion of ISIC2017, and it contains 2594, 100 and 1000 images in train, valid, and test sets respectively. It should be noticed that there is no intersection between the test set of ISIC2018 and the train set of ISIC2017.

4.1.2 Annotation Generation

For the bounded polygon annotations, we initially generate approximate bounded polygon through dilation-erosion operations by leveraging available groundtruth masks. Subsequently, a manual refinement process is employed to enhance the accuracy of bounded polygon annotation. In the automatic generation phase, we create coarse envelop-like and inscribed-like polygons by employing dilation and erosion operations on the segmentation masks. Specifically, the dilation operation enlarges the masks, while the erosion operation shrinks them. These modified masks serve as a basis for generating polygons. Douglas-Peucker algorithm [40] is then applied to derive approximate contours from the dense masks to make the bounded polygon with the limited number of vertices.

To compare with existing weakly-supervised methods, we also generate the scribble, box and rectangle annotation on these two datasets. Following [41], we draw random lines by connecting two end points sampled from $\{(u,v)|y_{uv}=1\}$ to simulate the scribbles. Here, $y\in\{0,1\}^{H\times W}$ is the given groundtruth binary mask. To obtain the box annotation, we use the object detection method. Similarly, we obtain the rectangle annotation for an image that can be filled to create a rectangular mask by identifying the smallest rectangular area that covers the foreground pixels in the groundtruth mask.

4.1.3 Implementation Details

All experiments are conducted using PyTorch and NVIDIA GeForce RTX 3090 GPUs. During training, images are resized to $256\times 256$ for all backbone networks except for TransFuse and HiFormer, which are set as $192\times 256$ and $224\times 224$ respectively. For optimizing, we employ the Adam and AdamW optimizer with an initial learning rate of $1e-4$ for CNN-based and Transformer-based backbone networks, respectively. Typically, we set the maximal number of epochs at 100 for ISIC2017 and 300 for Kvasir-SEG, the batch size at 16, and the hyper-parameters are: $\lambda_{1}=0.3$ , $\lambda_{2}=0.5$ , $\tau=0.1$ , and $\epsilon=1e-6$ .

4.2 Comparison With State-of-the-Arts

To demonstrate the comprehensive segmentation performance of our method, we compare EAUWSeg with different state-of-the-art approaches:

•

Scribble-supervised methods: 1) different learning strategies on UNet, including partially Cross-Entropy loss [33], Total Variation loss [34], Gated Conditional Random Field loss [35], Mumford-Shah Loss [36], as well as Uncertainty-aware Self-ensembling and Transformation-consistent Mean Teacher techniques (USTM) [37]. 2) different scribble-supervised frameworks, including ScribbleVC [8], DMSPS [1], and TriMix [38].
•

Box-supervised methods. For fair comparison, we also present the results of classical segmentation networks, i.e., UNet and TransUNet, supervised with bounding box.
•

Fully-supervised segmentation methods: 1) CNN-based methods, including UNet [25], UNet++ [30], and DeepLabV3+ [26]. 2) Transformer-based methods, including TransUNet [27], TransFuseS [31], HiFormer [32]. Implementation of these networks follow the corresponding github repositories. During training, ResNet50 [42] is employed as the encoder for UNet and DeepLabV3+, ResNet34 is utilized in the UNet++ and TranFuseS, the default “R50-ViT-B_16” and “Hiformer-S” configurations are employed for TransUNet and HiFormer.

Table 1 presents the quantitative evaluation results of the aforementioned methods. For fair comparison with scribble-supervised methods with different learning strategies, we present the results of EAUWSeg with UNet as backbones. To demonstrate the effectiveness of EAUWSeg, we also give the results with TransUNet as backbone. The results illustrate that our method outperforms other weakly-supervised methods on both ISIC2017 and Kvasir-Seg datasets, including the scribble-supervised as well as the box-supervised methods. When compared with the fully-supervised methods, our proposed EAUWSeg can also deliver superior performance, yielding an average Dice score of 86.60% and 89.88% on ISIC2017 and Kvasir-Seg, respectively. This underscores the superiority and effectiveness of the proposed BPAnno-supervised strategy and its corresponding learning framework EAUWSeg. Fig. 3 shows some qualitative evaluation results, it can be seen that our proposed method achieves better segmentation performance.

4.3 Ablation Study

Table 2: Comparison of Dice and Sensitivity for six weak annotation methods.

Data	ISIC2017		Kvasir-SEG
Data	Dice	Jaccard	Dice	Jaccard
Single Annotation
polygon	85.54 $\pm$ .20	76.86 $\pm$ .09	85.29 $\pm$ .29	76.22 $\pm$ .51
rectangle	85.44 $\pm$ .21	76.69 $\pm$ .19	83.40 $\pm$ .20	74.10 $\pm$ .08
ellipse	84.61 $\pm$ .11	75.35 $\pm$ .24	83.61 $\pm$ .27	73.80 $\pm$ .61
Bounded Annotation
bounded polygon	85.88 $\pm$ .18	77.81 $\pm$ .22	88.95 $\pm$ .36	82.61 $\pm$ .40
bounded rectangle	85.60 $\pm$ .22	77.13 $\pm$ .28	88.71 $\pm$ .10	82.22 $\pm$ .27
bounded ellipse	84.63 $\pm$ .17	75.56 $\pm$ .14	87.78 $\pm$ .19	80.76 $\pm$ .15

4.3.1 Effectiveness of the Bounded Annotations

To analysis the effectiveness of the proposed bounded annotation strategy, we conduct quantitative evaluation of training the TransUNet directly using different bounded annotation methods, including polygon, rectangle, and ellipse. Considering, box is similar with rectangle, only rectangle is compared since it can achieves better performance. Table 2 lists the quantitative comparison based on the Dice score and Sensitivity. The former presents the overall segmentation performance while the latter can reflect the recall of the foreground pixels. It can be seen that all these three annotations offer a promising way to initialize the lesion region (with Dice score larger than 80%), while polygon shows the best performance. Substituting the single annotation with bounded ones leads to consistent performance improvement for all these annotation methods on two datasets, demonstrating the effectiveness of our proposed bounded-based weak annotation strategy.

4.3.2 Comparative Analysis of Different Components

To demonstrate the effectiveness of the proposed component, i.e., confidence-auxiliary consistency learner (CCL) and classification-guided confidence generator (CCG), we carried out the ablation experiments and the results are shown in Table 3. Baseline present the performance of TransUNet trained with bounded polygon annotations. It can be seen that with the gradual introduction of CCL and CCG the performance consistently improves on both ISIC2017 and Kvasir-SEG.

Table 3: Ablation Study on ISIC2017 and Kvasir-SEG datasets with TranUNet as the backbone.

Baseline	CCL	CCG	Evaluation Metrics
Baseline	CCL	CCG	Dice	Jaccard	Accuracy	Sensitivity
ISIC2017
✓			85.88	77.81	93.65	85.59
✓	✓		86.38	78.12	93.81	86.38
✓	✓	✓	86.60	78.61	93.95	87.55
Kvasir-SEG
✓			88.95	82.61	96.49	91.78
✓	✓		89.35	83.06	96.90	92.03
✓	✓	✓	89.88	83.85	96.91	92.15

4.3.3 Comparison With Semi-supervised Methods

Table 4.3.3 presents a comparative analysis of our method with five existing semi-supervised segmentation methods on ISIC2017. For these semi-supervised methods, we referred to the results reported in [43]. These semi-supervised methods are trained with varying percentages of labeled data ( $5\%/10\%/20\%$ ), and assisted with the rest of unlabeled data ( $95\%/90\%/80\%$ ). While the proposed is trained with only $5\%/10\%/20\%$ samples annotated by bounded polygon, without using the rest of unlabeled data . Although only supervised with $5\%/10\%/20\%$ samples annotated by bounded polygon, our method outperforms most of the specifically designed SSL methods (except for CASSL) that trained with dense mask and also the rest of unlabeled data, showcasing its robust feature learning capabilities. When compared with CASSL, in which the adversarial training mechanism and the collaborative consistency learning strategy are carefully designed to utilize the unlabeled data, our method has a small performance gap while no need for dense mask and also the unlabeled data. This is important to many medical image segmentation tasks since additional unlabeled data may be unavailable in clinical practice.

Methods	Data	5%	10%	20%
UNet	mask+ unlabeled	70.92	71.74	75.27
CLCC [44]		61.23	65.40	68.93
MT [45]		73.12	74.34	76.98
ST++ [46]		73.26	75.51	76.69
S4-PLCL [47]		68.19	71.08	71.83
CASSL [43]		76.55	77.49	79.31
Ours(TransUNet)	only BPAnno	75.81	76.86	77.54

Methods	Data	Dice	Jaccard	Accuracy	Sensitivity
UNet	mask	86.78	78.27	92.69	93.85
UNet	BPAnno	86.68	78.06	92.92	94.64
\hdashlineUNet++	mask	87.11	78.95	92.63	94.64
\hdashlineUNet++	BPAnno	86.67	77.88	92.7	95.12
\hdashlineDeepLabV3+	mask	87.02	78.73	92.87	94.64
\hdashlineDeepLabV3+	BPAnno	86.63	77.98	92.73	94.75
\hdashlineTransUNet	mask	86.34	77.39	92.38	95.99
\hdashlineTransUNet	BPAnno	86.43	77.49	92.55	95.93
\hdashlineTransFuseS	mask	87.67	79.58	93.22	95.04
\hdashlineTransFuseS	BPAnno	87.57	78.81	94.92	93.39
\hdashlineHiFormer	mask	87.27	79.16	93.10	95.10
\hdashlineHiFormer	BPAnno	87.44	79.12	92.87	94.94