This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EAUWSeg: Eliminating annotation uncertainty in weakly-supervised medical image segmentation

Lituan Wang    Lei Zhang    \IEEEmembershipSenior Member, IEEE    Yan Wang    Zhenbin Wang    Zhenwei Zhang    and Zhang Yi    \IEEEmembershipFellow, IEEE This work was supported by the National Natural Science Foundation of China under Grant 62025601, and Grant 62376174. Lituan Wang, Lei Zhang, Zhenbin Wang, Zhenwei Zhang, and Zhang Yi are with the Machine Intelligence Laboratory, College of Computer Science, Sichuan University, Chengdu 610065, China. E-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]. Yan Wang is with the Institute of High Performance Computing, A*STAR, Singapore 138632. E-mail: [email protected]. (Corresponding author: Lei Zhang).
Abstract

Weakly-supervised medical image segmentation is gaining traction as it requires only rough annotations rather than accurate pixel-to-pixel labels, thereby reducing the workload for specialists. Although some progress has been made, there is still a considerable performance gap between the label-efficient methods and fully-supervised one, which can be attributed to the uncertainty nature of these weak labels. To address this issue, we propose a novel weak annotation method coupled with its learning framework EAUWSeg to eliminate the annotation uncertainty. Specifically, we first propose the Bounded Polygon Annotation (BPAnno) by simply labeling two polygons for a lesion. Then, the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations is proposed to learn invariant feature by providing adversarial supervision signal for model training. Subsequently, a confidence-auxiliary consistency learner incorporates with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region by leveraging the feature presentation consistency across pixels within the same category as well as class-specific information encapsulated in bounded polygons annotation. Experimental results demonstrate that EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, compared to fully-supervised counterparts, the proposed method not only delivers superior performance but also costs much less annotation workload. This underscores the superiority and effectiveness of our approach.

{IEEEImpStatement}

Benefit to its ability of reducing annotation workload, label-efficient methods have gaining traction in weakly-supervised medical image segmentation. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing, which leads to considerable performance gap between the label-efficient methods and fully-supervised one. To address this problem, a novel weak annotation method BPAnno that simply labeling two polygons for a lesion, and its coupled learning framework EAUWSeg is proposed to eliminate the annotation uncertainty. Extensive experiments demonstrate that our EAUWSeg can achieve superior performance while with less than 20% of the annotation workload when compared to fully-supervised counterparts. This reveals that the proposed method can be a cost-effective solution for improving the performance in weakly-supervised medical image segmentation.

{IEEEkeywords}

Weakly-supervised segmentation, consistency-based contrastive learning, medical image segmentation

1 Introduction

\IEEEPARstart

Medical image segmentation plays a crucial role in biomedical image analysis [1], such as diagnosis, treatment, and radiotherapy planning. As manual segmentation is usually labor-intensive, time-consuming and rely on professional domain knowledge [2], automatic medical image segmentation has been widely dedicated and series methods have been proposed. However, the successes of existing methods rely mainly on large-scale meticulously annotated data, which requires significant domain expertise as well as expensive annotation cost.

To alleviate the burdens associated with image annotation, weakly-supervised medical image segmentation is gaining traction as it requires only weak or sparse annotations [3], such as image-level labels [4], scribbles [5], bounding boxes [6], and point annotations [7]. Although some progress has been made by using label-efficient annotations for training, there is still a considerable performance gap between the label-efficient methods and fully-supervised ones [8]. We revisit existing label-efficient medical image segmentation methods and observe that these weak labels introduce considerable uncertainty for segmentation model constructing. Fig. 1 provides the visual representation of the supervision signals introduced by different label-efficient annotations, in which most information (defined by the gray region) are uncertain. The uncertainty supervision signals provided by label-efficient annotations may induce model training oscillations, thus impair the training of the model to approach the performance achieved in a fully supervised manner [9]. Consequently, there is an urgent need to explore label-efficient methods that can reduce annotation uncertainty, and develop methods that can assistant to eliminate the label uncertainty during model training.

Refer to caption
Figure 1: Comparison of the typical weak annotation methods and our proposed bounded polygon annotations, including the annotation strategies and the annotation uncertainty. The yellow curves show groundtruth segmentation. The black and gray denote the certain and uncertain regions, respectively.

In this work, we propose a novel weak annotation method coupled with its learning framework to eliminate the annotation uncertainty, and facilitate stable training in the weakly-supervised medical image segmentation with more reliable supervision signal. To this end, we introduce the bounded polygons annotation, which simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). The proposed bounded polygons annotation has three advantages: (1) it reduces the label burden compared with pixel-to-pixel accurate labels, (2) it restricts the uncertainty information to gray region between two polygons, (3) it explicitly provides prior emphasis on lesion boundaries during model training. Tailored for the proposed weak annotation, we propose a EAUWSeg method to further eliminate the uncertainty included in the bounded polygon annotation by explicitly treating bounded polygons as two separated annotations. For the envelope-like annotation, pixels within red contour belong to foreground class, otherwise belong to background class. For the inscribed-like annotation, pixels within purple contour belong to foreground class, otherwise belong to background class. In this way, the uncertainty region provides adversarial supervision signal for model training to learn invariant feature. Then, by leveraging the existing observation that similar pixels in the feature space prefer to generate consistent category predictions [10], we design a Classification-guided Confidence Generator (CCG) to measure the feature similarity between certain and uncertain pixels from a probabilistic perspective. Moreover, we adopt a Confidence-auxiliary Consistency Learner (CCL) that prefers to ensure the accuracy of certain pixels and attract uncertain pixels with the same category to preserve consistency feature representation. In the collaboration of CCG and CCL, more reliable supervision signal in uncertain region can be provided during model training to facilitate stable training in the weakly-supervised medical image segmentation.

Overall, our contributions can be summarized as follows:

  1. 1.

    We propose a novel weak annotation method that labels only two bounded polygons and the coupled learning framework for medical image segmentation, which further eliminate the annotation uncertainty existed in most label-efficient methods.

  2. 2.

    We propose the tailored learning mechanism that explicitly treat bounded polygons as two separated annotations, which can provide adversarial supervision signal for model training to learn invariant feature.

  3. 3.

    We propose a Confidence-auxiliary Consistency Learner that incorporates with a Classification-guided Confidence Generator to provide reliable supervision signal for pixels in uncertain region by leveraging the intra-class similarity and inter-class discriminative from both the feature and category perspective. It is worth noting that the CCL and CCG modules will be discarded during inference, which not increase computation complexity.

  4. 4.

    To evaluate our method, we provide the bounded polygon annotations on two widely used medial image segmentation datasets, i.e., ISIC2017 [11] and Kvasir-SEG [12]. Extensive experiments on these two datasets demonstrate that our EAUWSeg outperforms existing weakly-supervised segmentation methods. Furthermore, the proposed method delivers superior performance with less than 20% of the annotation workload when compared to fully-supervised counterparts. These results reveal that bounded polygon annotations coupled with EAUWSeg can be a cost-effective solution for the segmentation performance preserving.

2 Related Works

2.1 Weakly-supervised Medical Image Segmentation

Without the requirement of large densely annotated data, weakly-supervised learning has gained significant attention in medical image segmentation [13, 2]. As the most efficient weak annotation method, image-level annotations only require classification labels and generates class activation maps [14] for training. Although image-level annotations method is convenient, it has limited performance due to the extremely weak supervision [15]. Box-level annotation is usually defined by two corner coordinates, which provides localization-awareness compared to the image-level annotation [16]. However, boxes for different objects may tend to overlap with each other, making it difficult to accurately approximate the target boundary, especially for complex shapes [10]. Point annotations provide a small number of pixels for different classes and can better handle complex shapes, which may be more preferable to medical segmentation compared to box-level annotations. Despite its efficiency, the segmentation model trained with point annotations tends to overfit the small number of annotated pixels when comparing with the large number of unannotated pixels.

Scribble-based annotations provide labels for a sparse set of pixels of each class for training, and are usually more obtainable in medical image segmentation by considering its annotation efficiency, performance effectiveness as well as the friendliness to the annotation of nested structure [17]. Only the scribbles of the background and each object are given, while the groundtruth of other pixels remains unknown, which is harmful to the segmentation performance. An intuitive resolution is to expand the scribble annotations by considering the prior assumptions [18] or using the learned foreground features through the deep neural networks [19]. However, due to the lack of supervisory signals, the constructed models usually fail to capture the object structure and confuse on the object boundary. To address this issue, a series of studies have concentrated on learning adversarial shape priors at the expense of requiring additional fully-annotated masks [17]. However, acquiring such fully annotated datasets may present challenges in many clinical practices, rendering these existing methods both costly and lacking in scalability. Our work aims to explore new weak annotation method that can prompt the performance of automated medical image segmentation without auxiliary datasets.

2.2 Contrastive Learning

Contrastive learning argues that similar samples should have similar representations, and the representations of different samples should be different [20]. Based on this, contrastive loss is usually designed to enforce representations to be similar for similar pairs and dissimilar for dissimilar pairs [21]. Considering its powerful self-supervised feature extracting ability from the unlabeled data, contrastive learning has been widely used in many image-level tasks. Among all these methods, the key is the selection mechanism designing of contrastive sample pairs, i.e., positive and negative pairs.

Recently, contrastive learning has been extended from image-level task to pixel-level ones to mine informative information from unlabeled data [22, 23]. As mentioned earlier, constructing contrastive sample pairs is crucial for discriminative feature learning. In the context of pixel-level tasks, sample pairs are usually constructed through pseudo labels or spatial structure, which may introduce noisy sampling. To alleviate this problem, prediction uncertainty has been injected into the sampling to reduce the number of noisy samples [24].

Refer to caption
Figure 2: The overall framework of our proposed EAUWSeg. It includes a segmentation model supervised by two bounded polygons and a multi-class classification labels. Additionally, a confidence-auxiliary consistency learner is integrated to focus on compact feature learning in the uncertain region. During training, the extracted feature f𝒮()f_{\mathcal{S}}(\cdot) is input into the segmentation head to generate feature representation for lesions. Simultaneously, the classification head supervised by ycy^{c} is used to generate the confidence of uncertain pixels, the embedding head guided by bounded polygon annotations and confidence of uncertain pixels is utilized to construct a compact feature space.

3 Method

In this work, we propose a novel bounded polygon annotation method, i.e., BPAnno, and its corresponding segmentation framework, i.e., EAUWSeg, to eliminate annotation uncertainty in weakly-supervised medical image segmentation. Our EAUWSeg is in general applicable for many existing medical image segmentation models, such as UNet [25], DeepLabV3+ [26], TransUNet [27], with encoder and decoder phases. The overall framework is illustrated in Fig. 2.

3.1 Problem Setting and Bounded Polygons Annotation

In the scenario of classical weakly-supervised segmentation, the input pixels xx are usually divided into the labeled pixels xlx_{l} and unlabeled pixels xulx_{ul}. In this way, the corresponding labels yly_{l} for the labeled pixels xlx_{l} will be directly used for supervision by employing the partial cross-entropy loss, which can be formulated as follows:

l(p,y)=yylylog(p),\mathcal{L}_{l}(p,y)=-\displaystyle{\sum\limits_{y\in y_{l}}}ylog(p), (1)

where pp is the segmentation prediction. For the unlabeled pixels, there is no off-the-shelf label for supervision, and a lot of work focus on assigning pseudo labels to unlabeled pixels for supervision [28, 29]. The overall objective function can be formulated as follow:

=l+ul.\mathcal{L}=\mathcal{L}_{l}+\mathcal{L}_{ul}. (2)

However, assigning pseudo labels to unlabeled pixels not only requires a time-consuming multi-stage training process, but also results in misleading or biases [10].

To address this problem, this work introduces the bounded polygon annotation method that simply requires labeling two polygons that are similar to the inscribed and outer envelope-like delineations of lesion (as shown in Fig. 1). To further eliminate the uncertainty included in the bounded polygon annotation, we explicitly treat bounded polygons as two separated annotation, i.e., inscribed-like annotation yiny^{in} and envelope-like annotation yeny^{en}. Different from the classical weakly annotation methods, the input pixels xx are divided into the certain labeled pixels and the uncertain pixels in our bounded polygon annotations. This work aims at providing more reliable supervision signal for pixels in uncertain region during model training.

For convenience, we define ΩI\Omega_{I}, ΩΔ\Omega_{\Delta}, ΩO\Omega_{O} as the spatial domain inside the inscribed-like annotation, between the inscribed-like and envelope-like delineations, and outside the envelope-like annotation, respectively. Here, the certain labeled pixels and uncertain pixels can be depicted as xiΩIΩOx_{i}\in\Omega_{I}\cup\Omega_{O} and xucΩΔx_{uc}\in\Omega_{\Delta}, the corresponding labels yi=1y_{i}=1 if xiΩIx_{i}\in\Omega_{I} and yi=0y_{i}=0 if xiΩOx_{i}\in\Omega_{O}, otherwise, yiy_{i} is uncertain. The spatial domain of input image xx and the envelope-like annotation can be depicted as Ω=ΩIΩOΩΔ\Omega=\Omega_{I}\cup\Omega_{O}\cup\Omega_{\Delta} and ΩE=ΩIΩΔ\Omega_{E}=\Omega_{I}\cup\Omega_{\Delta}, respectively. In this way, our proposed EAUWSeg tries to learn from the “certain/uncertain” pixels instead of “labeled/unlabeled” pixels in the classical weakly-supervised segmentation. The objective function in our EAUWSeg can be re-formulated as:

=c+uc.\mathcal{L}=\mathcal{L}_{c}+\mathcal{L}_{uc}. (3)

The feature learning of certain pixels have been well solved. Hence, this work focuses on eliminating annotation uncertainty and thus providing reliable supervision signals for pixels in the uncertain regions.

3.2 Framework of EAUWSeg

EAUWSeg is tailored for the proposed bounded polygon annotation, and mainly focuses on eliminating annotation uncertainty for pixels belong to ΩΔ\Omega_{\Delta}. As illustrated in Fig. 2, EAUWSeg consists of 1) a mainstream segmentation network supervised by two bounded polygons segmentation labels to implicitly define the certain region and uncertain region during network training, 2) a classification-guided confidence generator to provide the category-level prediction confidence for pixels xiΩΔx_{i}\in\Omega_{\Delta} by leveraging a tailored multi-class classification task, 3) a confidence-auxiliary consistency learner to distinguish reliable pixels in uncertain region can assign the corresponding “certain” labels.

Let 𝒮e\mathcal{S}_{e}, 𝒮d\mathcal{S}_{d}, and 𝒮h\mathcal{S}_{h} denote the encoder, the decoder, and segmentation head used in our proposed framework that are parameterized by Θe\Theta_{e}, Θd\Theta_{d} and Θh\Theta_{h}, respectively. In the proposed EAUWSeg, the bounded polygon annotation is treated as two separate masks, i.e., inscribed-like and envelope-like masks, and the basic segmentation loss function in EAUWSeg can be formulated as:

c=x(in(p,yin)+en(p,yen)),\mathcal{L}_{c}=\displaystyle{\sum\limits_{x}\left(\mathcal{L}_{in}(p,y^{in})+\mathcal{L}_{en}(p,y^{en})\right)}, (4)

where pp is the predicted probability maps for input image xx. In this work, the following dice loss is employed for both in\mathcal{L}_{in} and en\mathcal{L}_{en}:

dice=12×i=1H×W×Dpiyii=1H×W×D(pi2+yi2),\mathcal{L}_{dice}=1-\displaystyle{\frac{2\times\sum_{i=1}^{H\times W\times D}p_{i}y_{i}}{\sum_{i=1}^{H\times W\times D}(p_{i}^{2}+y_{i}^{2})}}, (5)

where H×W×DH\times W\times D denotes the input image size, pip_{i} and yiy_{i} denotes the prediction probability and label for pixel ii, respectively. However, training with c\mathcal{L}_{c} will introduce inconsistency supervision signals, since yiny^{in} and yeny^{en} have the following characteristics: yiin=1y^{in}_{i}=1 for xiΩIx_{i}\in\Omega_{I}, yien=1y^{en}_{i}=1 for xiΩEx_{i}\in\Omega_{E}, and others are 0. In this way, pixels in uncertain region will have different labels during training, i.e., yiin=0y^{in}_{i}=0 while yien=1y^{en}_{i}=1 for xiΩΔx_{i}\in\Omega_{\Delta}.

To mitigate the influence and leverage this adversarial supervision signal to learn invariant feature during model training, this work focuses on assign more reliable labels for pixels in uncertain region by utilizing feature representation of certain pixels. For uncertain pixels, we want to utilize the potential similarity between pixels in the same category, i.e., xiΩIΩOx_{i}\in\Omega_{I}\cup\Omega_{O} and xjΩΔx_{j}\in\Omega_{\Delta} to mine informative information. With these definitions, the loss function of BPAnno-supervised segmentation can be formulated as:

=c+uc(x,yin,yen).\mathcal{L}=\mathcal{L}_{c}+\displaystyle{\mathcal{L}_{uc}(x,y^{in},y^{en})}. (6)

Here, uc(x,yin,yen)\mathcal{L}_{uc}(x,y^{in},y^{en}) denotes the loss function for uncertain pixels.

3.3 Classification-Guided Confidence Generator

The key point for accurate BPAnno-supervised segmentation is reliable labels assigning for pixels in uncertain region. Different from existing methods that focus on iteratively assigning pseudo label for uncertain pixels, we propose to utilize the intra-class similarity and inter-class discriminative from both the feature and category perspective.

An intuitive idea to approximate the confidence for uncertain pixels xix_{i} is the predictive entropy that is calculated according to the following equation:

=kP(yik|x,Θs)log(P(yik|x,Θs)+ϵ)\mathcal{E}=-\displaystyle{\sum\limits_{k}P(y_{{i}_{k}}|x,\Theta_{s})\log(P(y_{{i}_{k}}|x,\Theta_{s})+\epsilon)} (7)

where Θs={Θe,Θd,Θh}\Theta_{s}=\{\Theta_{e},\Theta_{d},\Theta_{h}\} are the parameters of standard segmentation network, ϵ\epsilon is a constant to avoid overflow. Similar as previous works, prediction with large entropy is considered as the solid uncertain pixels, which will be dropped during the subsequent learning. For clarity, we define the solid uncertain pixels in uncertain region with category of 1-1:

𝒰ie={1,iμ0,otherwise\mathcal{U}^{e}_{i}=\left\{\begin{aligned} &-1,&\mathcal{E}_{i}\geq\mu\\ &0,&\text{otherwise}\\ \end{aligned}\right. (8)

where μ\mu is a predefined threshold to mask the uncertain labels, and 𝒰eC×H×W\mathcal{U}^{e}\in\mathbb{R}^{C\times H\times W} is the estimated uncertainty map with the same size as input image.

To assign more reliable labels for uncertain pixels, we propose to explicitly leverage the potential similarity between certain and uncertain pixels by employing a tailored classification task, which aims at removing as much uncertainty as possible. Let f𝒮(x)f_{\mathcal{S}}(x) denote the feature representation generated through the encoder and decoder network, 𝒮c\mathcal{S}_{c} and Θc\Theta_{c} denote the classification head and its corresponding parameters respectively. Previous work [10] has shown that similar pixels in the feature space preferable to generate consistent category prediction. Based on this, the constructed classification head is used to model a multi-class classification task with the objective function of:

ce=1Ni=1NyiclogP(yic|x,Θe,Θd,Θc),\mathcal{L}_{ce}=-\frac{1}{N}\displaystyle{\sum_{i=1}^{N}}y_{i}^{c}\log P\left(y_{i}^{c}|x,\Theta_{e},\Theta_{d},\Theta_{c}\right), (9)

where ycy^{c} is the classification labels that yic=0y_{i}^{c}=0 for xiΩOx_{i}\in\Omega_{O}, yic=1y_{i}^{c}=1 for xiΩΔx_{i}\in\Omega_{\Delta}, and yic=2y_{i}^{c}=2 for xiΩIx_{i}\in\Omega_{I}, and N=H×W×DN=H\times W\times D.

During model training, we assume that “certain” pixels in uncertain region would prefer to generate prediction of yic=0y_{i}^{c}=0 for background and yic=2y_{i}^{c}=2 for foreground. In this way, the uncertain map generated by the confidence map can be formulated as:

𝒰c=argmax(P(y=0|f𝒮(x),Θc),P(y=2|f𝒮(x),Θc))u,\mathcal{U}^{c}=\operatorname*{argmax}(P(y=0|f_{\mathcal{S}}(x),\Theta_{c}),P(y=2|f_{\mathcal{S}}(x),\Theta_{c}))\odot\mathcal{M}_{u}, (10)

where \odot refers to the element-wise multiplication, and u\mathcal{M}_{u} is a mask with xi=1x_{i}=1 for xiΩΔx_{i}\in\Omega_{\Delta}, and xi=0x_{i}=0 otherwise. The final confidence for the uncertain pixels can be formulated as:

𝒰=min(𝒰c+2𝒰e,𝟏)u,\mathcal{U}=\min(\mathcal{U}^{c}+2\mathcal{U}^{e},\mathbf{-1})\odot\mathcal{M}_{u}, (11)

Here, 𝒰\mathcal{U} means that both pixels with large predictive entropy, i.e., iμ\mathcal{E}_{i}\geq\mu and with uncertain classification prediction, i.e., pixels with prediction of 11 for the multi-class classification task, will be considered as solid uncertain and be assigned with label of 1-1.

3.4 Confidence-Auxiliary Consistency Learner

Confidence-auxiliary consistency learner aims at generating “certain” information from uncertain region to facilitate stable training. An intuitive idea is utilizing contrastive learning to reduce the distance between pixels within same category while enlarging the distance between pixels in different categories. This strategy allows us to conduct the pixel-wise contrastive learning. However, the crucial question is the selection of positive and negative samples, especially for pixels in uncertain region. To reduce the influence of uncertain information, we propose to utilize the generated confidence for the uncertain pixels and only the solid certain pixels will be considered during the pixel-wise contrastive learning. In this way, the determined pseudo labels can be obtained as follows:

y^=y(𝟏u)+𝒰.\hat{y}=y\odot(\mathbf{1}-\mathcal{M}_{u})+\mathcal{U}. (12)

To provide more reliable supervision signal by using the pixel-wise contrastive learning, we follow two guidelines during sample selection: 1) only feature embedding of pixels in the certain region are stored in this study and further be sampled during the computation of contrastive loss; 2) the anchor sampling in this study focuses on hard samples with error prediction for xiΩIΩOx_{i}\in\Omega_{I}\cup\Omega_{O}, and samples with higher certainty for xiΩΔx_{i}\in\Omega_{\Delta}. The pixel-wise contrastive loss in this work can be defined as:

PCL\displaystyle\mathcal{L}_{PCL} =\displaystyle= 1𝒫i𝒫1|P\{i}|\displaystyle-\frac{1}{\mathcal{P}}\sum_{i\in\mathcal{P}}\frac{1}{|P\backslash\{i\}|}
×\displaystyle\times pPlogexp(fifp/τ)exp(fifp/τ)+1|N|nNexp(fifn/τ),\displaystyle\displaystyle{\sum_{p\in P}}\log\frac{\exp(f_{i}\cdot f_{p}/\tau)}{\exp(f_{i}\cdot f_{p}/\tau)+\frac{1}{|N|}\displaystyle{\sum_{n\in N}}\exp(f_{i}\cdot f_{n}/\tau)},

where 𝒫\mathcal{P} contains the indexes of all “certain” pixels in the uncertain region; PP and NN contains the indexes of positive pixels, i.e., pixels has same class with pixel ii, and negative pixels, i.e., pixels with different labels to pixel ii, in the certain region, respectively; τ\tau is a temperature hyper-parameter.

Considering the semantic representation for deep layers and the effective information for uncertain pixels, feature representation before the segmentation head is embedded into a specific feature space and is employed as the prototype vector in contrastive learning. That is to say, fif_{i} denotes the feature embedding of pixels xix_{i} that is calculated according to the following equation:

fi=f𝒮(xi,Θe,Θd).f_{i}=f_{\mathcal{S}}(x_{i},\Theta_{e},\Theta_{d}). (14)

3.5 Training of EAUWSeg

To summarize, the overall objective function includes two parts: 1) losses for “certain” pixels using fully-supervised segmentation setting, 2) confidence-guided contrastive loss for uncertain pixels. At the early stage of training, segmentation model need to learn the feature representation of lesions with the guidance of supervised loss for “certain” pixels, i.e., c\mathcal{L}_{c}. When the segmentation performance gradually improves, the contrastive loss PCL\mathcal{L}_{PCL} combined with a multi-class classification cross-entropy loss ce\mathcal{L}_{ce} are added to apply constraints on uncertain pixels in same class to preserve consistency feature representation. Therefore, the overall objective function in this work is formulated as:

=c+λ1PCL+λ2ce,\mathcal{L}=\mathcal{L}_{c}+\lambda_{1}\mathcal{L}_{PCL}+\lambda_{2}\mathcal{L}_{ce}, (15)

where λ1\lambda_{1} and λ2\lambda_{2} are the parameters to control the contribution of confidence-auxiliary consistency learner and classification-guided confidence generator, respectively.

4 Experiments

Table 1: Performance comparison with State-of-the-Arts on ISIC2017 and Kvasir-SEG datasets. Bold and underline denote the best and second best results, respectively.
Methods Data ISIC2017 Kvasir-SEG
Dice Jaccard Accuracy Sensitivity Dice Jaccard Accuracy Sensitivity
Fully-supervised Methods
UNet [25] mask 86.11±\pm.13 77.80±\pm.16 93.83±\pm.03 84.61±\pm.49 89.21±\pm.24 83.50±\pm.09 96.95±\pm.05 91.40±\pm.39
UNet++ [30] mask 85.75±\pm.10 77.60±\pm.18 96.60±\pm.25 84.22±\pm.21 89.23±\pm.14 83.43±\pm.06 96.78±\pm.08 92.09±\pm.66
DeepLabV3+[26] mask 86.15±\pm.10 78.06±\pm.14 93.97±\pm.03 83.79±\pm.29 89.04±\pm.27 82.85±\pm.22 96.87±\pm.11 91.73±\pm.42
TransUNet [27] mask 86.25±\pm.13 78.21±\pm.13 93.88±\pm.14 85.57±\pm.83 89.64±\pm.13 83.86±\pm.16 96.76±\pm.10 91.74±\pm.54
TransFuseS [31] mask 86.09±\pm.27 78.01±\pm.45 93.84±\pm.16 85.08±\pm.92 88.15±\pm.21 82.02±\pm.18 96.48±\pm.07 89.94±\pm.64
HiFormer [32] mask 86.16±\pm.22 78.02±\pm.30 93.84±\pm.11 85.05±\pm.76 89.18±\pm.22 83.41±\pm.19 96.86±\pm.04 90.40±\pm.63
Weakly-supervised Methods
PCE [33] scribbles 80.94±\pm.08 71.19±\pm.10 91.64±\pm.02 80.85±\pm.71 77.21±\pm.46 66.46±\pm.41 93.83±\pm.10 80.92±\pm1.55
TV [34] scribbles 81.14±\pm.33 71.50±\pm.43 91.83±\pm.13 81.13±\pm.55 77.01±\pm.21 66.24±\pm.19 93.75±\pm.03 80.32±\pm.46
GatedCRF [35] scribbles 81.02±\pm.53 71.25±\pm.61 91.64±\pm.22 78.30±\pm.80 78.63±\pm.26 68.43±\pm.29 94.12±\pm.10 77.43±\pm.44
Mumford-Shah [36] scribbles 76.50±\pm.78 65.00±\pm.97 90.36±\pm.30 72.02±\pm2.82 69.61±1.49 57.25±1.76 92.13±\pm.31 68.12±3.91
USTM [37] scribbles 80.92±\pm.10 71.24±\pm.12 91.60±\pm.13 79.40±\pm1.33 76.65±\pm.18 65.95±\pm.21 93.72±\pm.02 79.30±\pm.60
ScribbleVC [8] scribbles 81.07±\pm.50 71.40±\pm.41 91.85±\pm.04 76.38±\pm1.00 77.29±\pm.39 66.95±\pm.37 93.83±\pm.18 76.21±\pm1.04
DMSPS [1] scribbles 81.50±\pm.19 71.86±\pm.09 91.90±\pm.10 80.68±\pm.47 78.21±\pm.53 68.04±\pm.59 94.02±\pm.02 80.78±2.10
TriMix [38] scribbles 82.03±\pm.11 72.65±\pm.12 91.76±\pm.12 80.39±\pm.78 84.23±\pm.11 75.83±\pm.26 95.44±\pm.08 83.46±\pm.42
\hdashlineUNet box 82.17±\pm.10 71.34±\pm.18 91.62±\pm.05 90.71±\pm.49 76.68±\pm.06 64.42±\pm.10 91.82±\pm.37 93.85±\pm.78
TransUNet box 82.71±\pm.27 72.17±\pm.35 91.98±\pm.16 91.04±\pm.64 78.61±\pm.23 66.69±\pm.12 92.82±\pm.40 92.66±\pm2.11
\hdashlineUNet rectangle 85.38±\pm.03 76.52±\pm.11 93.38±\pm.10 89.35±\pm1.34 82.51±\pm.24 72.73±\pm.19 94.81±\pm.10 92.76±\pm.16
TransUNet rectangle 85.44±\pm.21 76.69±\pm.19 93.33±\pm.02 89.79±\pm2.04 83.40±\pm.20 74.10±\pm.08 94.57±\pm.12 92.10±\pm.66
\hdashlineOurs(UNet) BPAnno 86.18±\pm.05 78.12±\pm.04 93.83±\pm.04 84.45±\pm.45 89.30±\pm.11 83.04±\pm.15 96.88±\pm.03 91.98±\pm.32
Ours(TransUNet) BPAnno 86.60±\pm.17 78.61±\pm.24 93.95±\pm.12 87.55±\pm1.38 89.88±\pm.19 83.85±\pm.27 96.91±\pm.08 92.15±\pm.73

4.1 Experimental Setup

4.1.1 Datasets

To evaluate the effectiveness of the proposed method, we conduct the comparative experiments on two widely used medical image segmentation datasets, i.e., ISIC2017[11] and Kvasir-SEG [12] datasets. ISIC2017 is a skin lesion segmentation dataset, on which rich results have been reported in literature for comparisons. It contains 2000, 150, and 600 dermoscopic images in train, valid, and test sets respectively. We follow the official split of train and test set during the experiment. Kvasir-SEG contains 1000 gastrointestinal polyp images and the corresponding groundtruth. we randomly split the dataset into two subsets with 800 and 200 images, respectively. Furthermore, to evaluate the generalization ability of the constructed model, we conduct the cross-training evaluation and apply the model trained on ISIC2017 to test on ISIC2018 dataset for skin lesion segmentation without fine-tuning. ISIC2018 Dataset [39] is a expansion of ISIC2017, and it contains 2594, 100 and 1000 images in train, valid, and test sets respectively. It should be noticed that there is no intersection between the test set of ISIC2018 and the train set of ISIC2017.

4.1.2 Annotation Generation

For the bounded polygon annotations, we initially generate approximate bounded polygon through dilation-erosion operations by leveraging available groundtruth masks. Subsequently, a manual refinement process is employed to enhance the accuracy of bounded polygon annotation. In the automatic generation phase, we create coarse envelop-like and inscribed-like polygons by employing dilation and erosion operations on the segmentation masks. Specifically, the dilation operation enlarges the masks, while the erosion operation shrinks them. These modified masks serve as a basis for generating polygons. Douglas-Peucker algorithm [40] is then applied to derive approximate contours from the dense masks to make the bounded polygon with the limited number of vertices.

To compare with existing weakly-supervised methods, we also generate the scribble, box and rectangle annotation on these two datasets. Following [41], we draw random lines by connecting two end points sampled from {(u,v)|yuv=1}\{(u,v)|y_{uv}=1\} to simulate the scribbles. Here, y{0,1}H×Wy\in\{0,1\}^{H\times W} is the given groundtruth binary mask. To obtain the box annotation, we use the object detection method. Similarly, we obtain the rectangle annotation for an image that can be filled to create a rectangular mask by identifying the smallest rectangular area that covers the foreground pixels in the groundtruth mask.

4.1.3 Implementation Details

All experiments are conducted using PyTorch and NVIDIA GeForce RTX 3090 GPUs. During training, images are resized to 256×256256\times 256 for all backbone networks except for TransFuse and HiFormer, which are set as 192×256192\times 256 and 224×224224\times 224 respectively. For optimizing, we employ the Adam and AdamW optimizer with an initial learning rate of 1e41e-4 for CNN-based and Transformer-based backbone networks, respectively. Typically, we set the maximal number of epochs at 100 for ISIC2017 and 300 for Kvasir-SEG, the batch size at 16, and the hyper-parameters are: λ1=0.3\lambda_{1}=0.3, λ2=0.5\lambda_{2}=0.5, τ=0.1\tau=0.1, and ϵ=1e6\epsilon=1e-6.

4.2 Comparison With State-of-the-Arts

Refer to caption
Figure 3: Qualitative comparison of different methods on ISIC2017 (top three rows) and Kvasir-SEG (bottom three rows). The green and blue contours indicate the prediction and groundtruth, respectively.

To demonstrate the comprehensive segmentation performance of our method, we compare EAUWSeg with different state-of-the-art approaches:

  • Scribble-supervised methods: 1) different learning strategies on UNet, including partially Cross-Entropy loss [33], Total Variation loss [34], Gated Conditional Random Field loss [35], Mumford-Shah Loss [36], as well as Uncertainty-aware Self-ensembling and Transformation-consistent Mean Teacher techniques (USTM) [37]. 2) different scribble-supervised frameworks, including ScribbleVC [8], DMSPS [1], and TriMix [38].

  • Box-supervised methods. For fair comparison, we also present the results of classical segmentation networks, i.e., UNet and TransUNet, supervised with bounding box.

  • Fully-supervised segmentation methods: 1) CNN-based methods, including UNet [25], UNet++ [30], and DeepLabV3+ [26]. 2) Transformer-based methods, including TransUNet [27], TransFuseS [31], HiFormer [32]. Implementation of these networks follow the corresponding github repositories. During training, ResNet50 [42] is employed as the encoder for UNet and DeepLabV3+, ResNet34 is utilized in the UNet++ and TranFuseS, the default “R50-ViT-B_16” and “Hiformer-S” configurations are employed for TransUNet and HiFormer.

Table 1 presents the quantitative evaluation results of the aforementioned methods. For fair comparison with scribble-supervised methods with different learning strategies, we present the results of EAUWSeg with UNet as backbones. To demonstrate the effectiveness of EAUWSeg, we also give the results with TransUNet as backbone. The results illustrate that our method outperforms other weakly-supervised methods on both ISIC2017 and Kvasir-Seg datasets, including the scribble-supervised as well as the box-supervised methods. When compared with the fully-supervised methods, our proposed EAUWSeg can also deliver superior performance, yielding an average Dice score of 86.60% and 89.88% on ISIC2017 and Kvasir-Seg, respectively. This underscores the superiority and effectiveness of the proposed BPAnno-supervised strategy and its corresponding learning framework EAUWSeg. Fig. 3 shows some qualitative evaluation results, it can be seen that our proposed method achieves better segmentation performance.

4.3 Ablation Study

Table 2: Comparison of Dice and Sensitivity for six weak annotation methods.
Data ISIC2017 Kvasir-SEG
Dice Jaccard Dice Jaccard
Single Annotation
polygon 85.54±\pm.20 76.86±\pm.09 85.29±\pm.29 76.22±\pm.51
rectangle 85.44±\pm.21 76.69±\pm.19 83.40±\pm.20 74.10±\pm.08
ellipse 84.61±\pm.11 75.35±\pm.24 83.61±\pm.27 73.80±\pm.61
Bounded Annotation
bounded polygon 85.88±\pm.18 77.81±\pm.22 88.95±\pm.36 82.61±\pm.40
bounded rectangle 85.60±\pm.22 77.13±\pm.28 88.71±\pm.10 82.22±\pm.27
bounded ellipse 84.63±\pm.17 75.56±\pm.14 87.78±\pm.19 80.76±\pm.15

4.3.1 Effectiveness of the Bounded Annotations

To analysis the effectiveness of the proposed bounded annotation strategy, we conduct quantitative evaluation of training the TransUNet directly using different bounded annotation methods, including polygon, rectangle, and ellipse. Considering, box is similar with rectangle, only rectangle is compared since it can achieves better performance. Table 2 lists the quantitative comparison based on the Dice score and Sensitivity. The former presents the overall segmentation performance while the latter can reflect the recall of the foreground pixels. It can be seen that all these three annotations offer a promising way to initialize the lesion region (with Dice score larger than 80%), while polygon shows the best performance. Substituting the single annotation with bounded ones leads to consistent performance improvement for all these annotation methods on two datasets, demonstrating the effectiveness of our proposed bounded-based weak annotation strategy.

4.3.2 Comparative Analysis of Different Components

To demonstrate the effectiveness of the proposed component, i.e., confidence-auxiliary consistency learner (CCL) and classification-guided confidence generator (CCG), we carried out the ablation experiments and the results are shown in Table 3. Baseline present the performance of TransUNet trained with bounded polygon annotations. It can be seen that with the gradual introduction of CCL and CCG the performance consistently improves on both ISIC2017 and Kvasir-SEG.

Table 3: Ablation Study on ISIC2017 and Kvasir-SEG datasets with TranUNet as the backbone.
Baseline CCL CCG Evaluation Metrics
Dice Jaccard Accuracy Sensitivity
ISIC2017
85.88 77.81 93.65 85.59
86.38 78.12 93.81 86.38
86.60 78.61 93.95 87.55
Kvasir-SEG
88.95 82.61 96.49 91.78
89.35 83.06 96.90 92.03
89.88 83.85 96.91 92.15

4.3.3 Comparison With Semi-supervised Methods

Table 4.3.3 presents a comparative analysis of our method with five existing semi-supervised segmentation methods on ISIC2017. For these semi-supervised methods, we referred to the results reported in [43]. These semi-supervised methods are trained with varying percentages of labeled data (5%/10%/20%5\%/10\%/20\%), and assisted with the rest of unlabeled data (95%/90%/80%95\%/90\%/80\%). While the proposed is trained with only 5%/10%/20%5\%/10\%/20\% samples annotated by bounded polygon, without using the rest of unlabeled data . Although only supervised with 5%/10%/20%5\%/10\%/20\% samples annotated by bounded polygon, our method outperforms most of the specifically designed SSL methods (except for CASSL) that trained with dense mask and also the rest of unlabeled data, showcasing its robust feature learning capabilities. When compared with CASSL, in which the adversarial training mechanism and the collaborative consistency learning strategy are carefully designed to utilize the unlabeled data, our method has a small performance gap while no need for dense mask and also the unlabeled data. This is important to many medical image segmentation tasks since additional unlabeled data may be unavailable in clinical practice.

Table 4: Performance comparison with semi-supervised methods on ISIC2017 test set with Jaccard score as the evaluation metric.
Methods Data 5% 10% 20%
UNet mask+ unlabeled 70.92 71.74 75.27
CLCC [44] 61.23 65.40 68.93
MT [45] 73.12 74.34 76.98
ST++ [46] 73.26 75.51 76.69
S4-PLCL [47] 68.19 71.08 71.83
CASSL [43] 76.55 77.49 79.31
Ours(TransUNet) only BPAnno 75.81 76.86 77.54

4.3.4 Generalizabilty Analysis With Different Backbones

The proposed EAUWSeg is a plug-and-play model that can be easily combined with different backbones. To demonstrate its generalization ability six widely used segmentation networks are compared, i.e., UNet [25], UNet++ [30], DeepLabV3+ [26], TransUNet [27], TransFuseS [31], and HiFormer [32]. From Fig. 4, it can be seen that: 1) the best result is achieved when using TransUNet as the backbone, 2) the proposed method delivers superior performance compared to fully-supervised counterparts as shown in Table 1. These results reveal that the proposed EAUWSeg generalizes well for different backbones.

Refer to caption
Figure 4: Performance comparison of EAUWSeg combined with different backbones on the ISIC2017 test set.

4.3.5 Generalization on ISIC2018

The generalization ability of the constructed model is important to real application. We evaluate the generalization ability of the proposed method in a cross-training way [48]. Specifically, we apply the model trained on ISIC2017 to test on ISIC2018 dataset for skin lesion segmentation without fine-tuning. As presented in Table 5, our method achieves comparable generalization performance on ISIC2018 when compared to all the fully-supervised counterparts. This highlights the effectiveness of our EAUWSeg approach as well as the bounded polygons annotation in ensuring robust generalizability.

Table 5: Generalizability comparison on ISIC2018 for models trained with different supervision strategies without finetuning.
Methods Data Dice Jaccard Accuracy Sensitivity
UNet mask 86.78 78.27 92.69 93.85
BPAnno 86.68 78.06 92.92 94.64
\hdashlineUNet++ mask 87.11 78.95 92.63 94.64
BPAnno 86.67 77.88 92.7 95.12
\hdashlineDeepLabV3+ mask 87.02 78.73 92.87 94.64
BPAnno 86.63 77.98 92.73 94.75
\hdashlineTransUNet mask 86.34 77.39 92.38 95.99
BPAnno 86.43 77.49 92.55 95.93
\hdashlineTransFuseS mask 87.67 79.58 93.22 95.04
BPAnno 87.57 78.81 94.92 93.39
\hdashlineHiFormer mask 87.27 79.16 93.10 95.10
BPAnno 87.44 79.12 92.87 94.94

4.4 Error Analysis

The proposed bounded polygon annotation has the advantage of explicitly providing prior emphasis on lesion boundaries during model training. To reveal this, following [49], we separately evaluate the results in boundary and interior regions. Fig. 5 illustrate the Jaccard and Dice score improvement achieved by our EAUWSeg compared to the BPAnno-supervised baselines, both inside and outside a band of specific width, referred to as boundary and interior regions. It can be seen that EAUWSeg consistently enhances the performance of the baseline models in both the boundary and interior regions, regardless of the trimap width. Specifically, our EAUWSeg achieves a substantial gain of over 2% in performance within the boundary regions. This reveals the proposed EAUWSeg in capturing the intricate details of the boundary, which is attributed to our developed confidence-auxiliary consistency learner. Furthermore, we also illustrates the t-SNE visualization results of constructed feature space for TransUNet trained in fully-supervision setting and EAUWSeg in BPAnno-supervision manner. From Fig. 6, it can be seen that our method can construct more compact feature space compared with the fully-supervised baseline, especially in the lesion boundary.

Refer to caption
Figure 5: Error analysis on the ISIC2017 test set. Both inside and outside a band of specific width are illustrated.
Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 6: t-SNE visualizations on ISIC2017 test set for the TransUNet and EAUWSeg(TransUNet). (a) and (b) display feature embedding generated by the constructed EAUWSeg and the TransUNet, respectively, with distinct colors representing the foreground and background. (c) and (d) illustrate the feature embedding generated by the constructed EAUWSeg(TransUNet) and TransUNet, with a separate focus on the lesion boundary.

4.5 Annotation Cost Analysis

To reveal the annotation cost decreasing ability of the proposed bounded polygon annotations, we conduct the comparison study focusing on the annotation workload. In this study, a dermatologist with over ten years of experience from a general hospital in the central city performs two types of annotations, i.e., pixel-to-pixel dense annotation and the proposed bounded polygon annotation, on twenty image selected in ISIC2017. It takes an average of 55 and 10 seconds for the pixel-to-pixel dense annotation and bounded polygon annotation, respectively. This indicates that annotating bounded polygon of skin lesion in dermoscopic image requires only 18% of annotation cost when compared with the pixel-to-pixel annotations. Combining the experimental results illustrated in Table 1 and Table 5, the proposed method delivers superior performance and comparable generalization ability when compared to its fully-supervised counterparts. These results reveal that bounded polygon annotations coupled with EAUWSeg can be a cost-effective solution for weakly-supervised medical image segmentation.

5 Conclusion and Future Work

In this work, to eliminate the annotation uncertainty existed in weakly-supervised medical image segmentation, we propose the bounded polygon annotation, in which label only two polygons while providing promising prior of lesion boundary during training. To further eliminate the uncertainty included in the bounded polygon as well as to leverage the prior emphasis delineated by bounded polygons, we develop EAUWSeg, a learning framework tailored for bounded polygon that include a confidence-auxiliary consistency incorporated with a classification-guided confidence generator is designed to provide reliable supervision signal for pixels in uncertain region. Extensive experimental results demonstrate that EAUWSeg can not only outperform existing weakly-supervised segmentation methods but also delivers superior performance compared to fully-supervised counterparts, with less than 20% of the annotation workload.

This work is a preliminary attempt to focus on eliminating annotation uncertainty in weakly-supervised medical image segmentation. Extensive experimental results have demonstrated its cost-efficient and effectiveness of the bounded annotation, while there is still several limitations. This study mainly focuses on the weakly-supervised medical image segmentation in binary case. When applied to the instance segmentation, it may suffer from some challenges, such as encompassing pixels belong to foreground with different categories in the envelope-like polygon. In future work, we will focus on solving this kind of problems.

References

  • [1] M. Han, X. Luo, X. Xie, W. Liao, S. Zhang, T. Song, G. Wang, and S. Zhang, “Dmsps: Dynamically mixed soft pseudo-label supervision for scribble-supervised medical image segmentation,” Medical Image Analysis, p. 103274, 2024.
  • [2] S. Zhai, G. Wang, X. Luo, Q. Yue, K. Li, and S. Zhang, “Pa-seg: Learning from point annotations for 3d medical image segmentation using contextual regularization and cross knowledge distillation,” IEEE Transactions on Medical Imaging, 2023.
  • [3] F. Gao, M. Hu, M.-E. Zhong, S. Feng, X. Tian, X. Meng, Z. Huang, M. Lv, T. Song, X. Zhang, X. Zou, and X. Wu, “Segmentation only uses sparse annotations: Unified weakly and semi-supervised learning in medical images,” Medical Image Analysis, vol. 80, p. 102515, 2022.
  • [4] K. Wu, B. Du, M. Luo, H. Wen, Y. Shen, and J. Feng, “Weakly supervised brain lesion segmentation via attentional representation learning,” in Medical Image Computing and Computer Assisted Intervention.   Springer, 2019, pp. 211–219.
  • [5] D. Lin, J. Dai, J. Jia, K. He, and J. Sun, “Scribblesup: Scribble-supervised convolutional networks for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3159–3167.
  • [6] M. Rajchl, M. C. Lee, O. Oktay, K. Kamnitsas, J. Passerat-Palmbach, W. Bai, M. Damodaram, M. A. Rutherford, J. V. Hajnal, B. Kainz et al., “Deepcut: Object segmentation from bounding box annotations using convolutional neural networks,” IEEE Transactions on Medical Imaging, vol. 36, no. 2, pp. 674–683, 2016.
  • [7] R. Dorent, S. Joutard, J. Shapey, A. Kujawa, M. Modat, S. Ourselin, and T. Vercauteren, “Inter extreme points geodesics for end-to-end weakly supervised image segmentation,” in Medical Image Computing and Computer Assisted Intervention.   Springer, 2021, pp. 615–624.
  • [8] Z. Li, Y. Zheng, X. Luo, D. Shan, and Q. Hong, “Scribblevc: Scribble-supervised medical image segmentation with vision-class embedding,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3384–3393.
  • [9] Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Blpseg: Balance the label preference in scribble-supervised semantic segmentation,” IEEE Transactions on Image Processing, 2023.
  • [10] L. Wu, Z. Zhong, L. Fang, X. He, Q. Liu, J. Ma, and H. Chen, “Sparsely annotated semantic segmentation with adaptive gaussian mixtures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 454–15 464.
  • [11] N. C. Codella, D. Gutman, M. E. Celebi, B. Helba, M. A. Marchetti, S. W. Dusza, A. Kalloo, K. Liopyris, N. Mishra, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic),” in 2018 IEEE 15th International Symposium on Biomedical Imaging.   IEEE, 2018, pp. 168–172.
  • [12] D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, “Kvasir-seg: A segmented polyp dataset,” in MultiMedia Modeling: 26th International Conference, MMM 2020.   Springer, 2020, pp. 451–462.
  • [13] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation,” Medical Image Analysis, vol. 63, p. 101693, 2020.
  • [14] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
  • [15] L. Wang, L. Zhang, X. Shu, and Z. Yi, “Intra-class consistency and inter-class discrimination feature learning for automatic skin lesion classification,” Medical Image Analysis, vol. 85, p. 102746, 2023.
  • [16] C.-C. Hsu, K.-J. Hsu, C.-C. Tsai, Y.-Y. Lin, and Y.-Y. Chuang, “Weakly supervised instance segmentation using the bounding box tightness prior,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [17] G. Valvano, A. Leo, and S. A. Tsaftaris, “Learning to segment from scribbles using multi-scale adversarial attention gates,” IEEE Transactions on Medical Imaging, vol. 40, no. 8, pp. 1990–2001, 2021.
  • [18] Z. Ji, Y. Shen, C. Ma, and M. Gao, “Scribble-based hierarchical weakly supervised learning for brain tumor segmentation,” in Medical Image Computing and Computer Assisted Intervention.   Springer, 2019, pp. 175–183.
  • [19] K. Zhang and X. Zhuang, “Cyclemix: A holistic strategy for medical image segmentation from scribble supervision,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 11 656–11 665.
  • [20] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “Simclr: A simple framework for contrastive learning of visual representations,” in International Conference on Learning Representations, vol. 2, 2020.
  • [21] A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, “A survey on contrastive self-supervised learning,” Technologies, vol. 9, no. 1, p. 2, 2020.
  • [22] C. You, W. Dai, Y. Min, F. Liu, D. Clifton, S. K. Zhou, L. Staib, and J. Duncan, “Rethinking semi-supervised medical image segmentation: A variance-reduction perspective,” in Advances in Neural Information Processing Systems, vol. 36, 2023, pp. 9984–10 021.
  • [23] W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Van Gool, “Exploring cross-image pixel contrast for semantic segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 7303–7313.
  • [24] T. Wang, J. Lu, Z. Lai, J. Wen, and H. Kong, “Uncertainty-guided pixel contrastive learning for semi-supervised medical image segmentation,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022, pp. 1444–1450.
  • [25] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention.   Springer, 2015, pp. 234–241.
  • [26] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 801–818.
  • [27] J. Chen, J. Mei, X. Li, Y. Lu, Q. Yu, Q. Wei, X. Luo, Y. Xie, E. Adeli, Y. Wang, M. P. Lungren, S. Zhang, L. Xing, L. Lu, A. Yuille, and Y. Zhou, “Transunet: Rethinking the u-net architecture design for medical image segmentation through the lens of transformers,” Medical Image Analysis, vol. 97, p. 103280, 2024.
  • [28] H. Wu, X. Li, and K.-T. Cheng, “Exploring feature representation learning for semi-supervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [29] L. Liu, A. I. Aviles-Rivero, and C.-B. Schönlieb, “Contrastive registration for unsupervised medical image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [30] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: Redesigning skip connections to exploit multiscale features in image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2019.
  • [31] Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing transformers and cnns for medical image segmentation,” in Medical Image Computing and Computer Assisted Intervention.   Springer, 2021, pp. 14–24.
  • [32] M. Heidari, A. Kazerouni, M. Soltany, R. Azad, E. K. Aghdam, J. Cohen-Adad, and D. Merhof, “Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 6202–6212.
  • [33] M. Tang, A. Djelouah, F. Perazzi, Y. Boykov, and C. Schroers, “Normalized cut loss for weakly-supervised cnn segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1818–1827.
  • [34] M. Javanmardi, M. Sajjadi, T. Liu, and T. Tasdizen, “Unsupervised total variation loss for semi-supervised deep learning of semantic segmentation,” arXiv preprint arXiv:1605.01368, 2016.
  • [35] A. Obukhov, S. Georgoulis, D. Dai, and L. Van Gool, “Gated crf loss for weakly supervised semantic image segmentation,” arXiv preprint arXiv:1906.04651, 2019.
  • [36] B. Kim and J. C. Ye, “Mumford–shah loss functional for image segmentation with deep learning,” IEEE Transactions on Image Processing, vol. 29, pp. 1856–1866, 2019.
  • [37] X. Liu, Q. Yuan, Y. Gao, K. He, S. Wang, X. Tang, J. Tang, and D. Shen, “Weakly supervised segmentation of covid19 infection with scribble annotation on ct images,” Pattern Recognition, vol. 122, p. 108341, 2022.
  • [38] Z. Zheng, Y. Hayashi, M. Oda, T. Kitasaka, and K. Mori, “Trimix: A general framework for medical image segmentation from limited supervision,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 634–651.
  • [39] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, H. Kittler, and A. Halpern, “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1902.03368, 2019.
  • [40] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: the International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, 1973.
  • [41] H. E. Wong, M. Rakic, J. Guttag, and A. V. Dalca, “Scribbleprompt: Fast and flexible interactive segmentation for any medical image,” arXiv preprint arXiv:2312.07381, 2023.
  • [42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [43] Y. Tang, S. Wang, Y. Qu, Z. Cui, and W. Zhang, “Consistency and adversarial semi-supervised learning for medical image segmentation,” Computers in Biology and Medicine, vol. 161, p. 107018, 2023.
  • [44] X. Zhao, C. Fang, D.-J. Fan, X. Lin, F. Gao, and G. Li, “Cross-level contrastive learning and consistency constraint for semi-supervised medical image segmentation,” in 2022 IEEE 19th International Symposium on Biomedical Imaging, 2022, pp. 1–5.
  • [45] A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [46] L. Yang, W. Zhuo, L. Qi, Y. Shi, and Y. Gao, “St++: Make self-training work better for semi-supervised semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4268–4277.
  • [47] I. Alonso, A. Sabater, D. Ferstl, L. Montesano, and A. C. Murillo, “Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8219–8228.
  • [48] Y. Yuan, L. Zhang, L. Wang, and H. Huang, “Multi-level attention network for retinal vessel segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 1, pp. 312–323, 2021.
  • [49] J. Dai, K. He, and J. Sun, “Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1635–1643.