Improving Semi-Supervised Semantic Segmentation with Dual-Level Siamese Structure Network

Zhibo Tian [email protected] School of Information Science and Engineering,
Lanzhou UniversityLanzhouChina , Xiaolin Zhang [email protected] Independent ResearcherShenzhenChina , Peng Zhang pengzhang˙[email protected] College of Computer Science and Engineering,
Shandong University of Science and TechnologyQingdaoChina and Kun Zhan [email protected] School of Information Science and Engineering,
Lanzhou UniversityLanzhouChina

(2023)

Abstract.

Semi-supervised semantic segmentation (SSS) is an important task that utilizes both labeled and unlabeled data to reduce expenses on labeling training examples. However, the effectiveness of SSS algorithms is limited by the difficulty of fully exploiting the potential of unlabeled data. To address this, we propose a dual-level Siamese structure network (DSSN) for pixel-wise contrastive learning. By aligning positive pairs with a pixel-wise contrastive loss using strong augmented views in both low-level image space and high-level feature space, the proposed DSSN is designed to maximize the utilization of available unlabeled data. Additionally, we introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, which addresses the limitations of most existing methods that do not perform selection or apply a predefined threshold for all classes. Specifically, our strategy selects the top high-confidence prediction of the weak view for each class to generate pseudo labels that supervise the strong augmented views. This strategy is capable of taking into account the class imbalance and improving the performance of long-tailed classes. Our proposed method achieves state-of-the-art results on two datasets, PASCAL VOC 2012 and Cityscapes, outperforming other SSS algorithms by a significant margin. The source code is available at https://github.com/kunzhan/DSSN.

Semi-supervised segmentation, pixel-wise contrastive learning, class-aware pseudo-label generation

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 31st ACM International Conference on Multimedia; October 29-November 3, 2023; Ottawa, ON, Canada^†^†booktitle: Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), October 29-November 3, 2023, Ottawa, ON, Canada^†^†price: 15.00^†^†doi: 10.1145/3581783.3611816^†^†isbn: 979-8-4007-0108-5/23/10^†^†submissionid: 534^†^†ccs: Computing methodologies Image segmentation

Refer to caption — Figure 1. Illustration of the motivation. (a) demonstrates the proposed dual-level contrastive structure for exploiting the maximum potential of unlabelled samples. (b) depicts the structure of the vanilla contrastive learning. (c) compares the threshold selection strategies of the proposed class-aware pseudo-label generation method and the classical approaches of utilizing a threshold for all classes.

1. Introduction

Deep learning methods for supervised segmentation have shown remarkable performance. However, they heavily rely on a large amount of annotated images, which is labor cost and time-consuming. Alternatively, semi-supervised semantic segmentation (SSS) offers a viable solution to address this fundamental weakness by exploiting the readily available unlabeled data to improve model performance.

Existing semi-supervised learning methods typically use unlabeled samples in two ways: pseudo supervision (Berthelot et al., 2019; Sohn et al., 2020) and consistency regularization (Laine and Aila, 2017; Tarvainen and Valpola, 2017; Xie et al., 2020). Pseudo supervision is to generate pseudo labels for the unlabeled images and gradually incorporates them into the training process to supervise model learning. For example, preliminary works (Hung et al., 2018; Mittal et al., 2019) in SSS tend to utilize the generative adversarial networks (Creswell et al., 2018) as auxiliary supervision for unlabeled images. Consistency regularization promotes agreement among model predictions on unlabeled samples that are subjected to various perturbations, thus improving model generalization by ensuring that different views of the same unlabeled image are consistent. Modern SSS algorithms combine pseudo supervision and consistency regularization into a two-view network architecture, where one view generates pseudo labels to supervise the other view for prediction consistency. For instance, the intuition of CPS (Chen et al., 2021) is that using one view generates pseudo labels of unlabeled images to expand the training set of the other view. PseudoSeg (Zou et al., 2021) generates pseudo labels in a weak augmented view to supervise the other strong augmented view. PS-MT (Liu et al., 2022a) employs higher-confidence pseudo labels than CPS by averaging the predictions of two views. To search for high-quality pseudo labels, CCT (Ouali et al., 2020) employs a fixed threshold for all classes and pixels with confidence scores above the threshold to participate in network updates. In CCT (Ouali et al., 2020), it mainly uses consistency learning between one weak view and two strong augmented views of a high-level feature.

However, many existing SSS algorithms do not fully exploit the potential of unlabelled data. To address this issue, we propose a Dual-level Siamese structure network (DSSN) to fully exploit feature diversities. In addition to the two strategies commonly used in most algorithms, we introduce a new variant of contrastive learning. Fig. 1(b) illustrates a typical structure of the vanilla contrastive learning, which excels at providing extraordinary generalization abilities for unlabeled samples (Chopra et al., 2005; Hadsell et al., 2006). Specifically, the proposed DSSN simultaneously employs pixel-wise contrastive learning and two-level strong augmented views. Accordingly, contrastive objectives in terms of image-level and feature-level augmentations are introduced to guide the network training. Such structure guarantees fully exploiting the potential of unlabeled data. As shown in Fig. 1(a), at the image level, two different views of unlabeled samples are obtained with different strong augmentations, and a pixel-wise contrastive objective is added to train DSSN using the corresponding predictions. At the feature level, high-level latent features from the encoder produce two strong augmented views and also conduct a contrastive loss. This DSSN design enables us to fully exploit the available unlabeled data.

Given that most real-world datasets exhibit imbalanced or long-tailed label distributions (Menon et al., 2021), we propose a class-aware pseudo label generation (CPLG) strategy that selects class-specific high-confidence pseudo labels from weak views to supervise the strong views. Our CPLG strategy differs from previous approaches (French et al., 2019; Ouali et al., 2020), which apply a fixed threshold to all categories. By treating each class differently, our method aims to improve the performance of long-tailed categories. Without any selection, low-quality pseudo labels generated from the weak augmented view are used to supervise the strong augmented view, which could negatively affect the model training. Using a constant threshold for all classes may result in long-tailed classes being poorly trained, as their confidence may be lower than the threshold and thus not involved in training. Using a fixed threshold may also result in useful pseudo-labels being ignored in some classes that fall below the predefined threshold. For each class has pseudo labels, we select top high-confidence pixels in each class since most segments in an image are imbalances and also it is imbalances in the whole dataset. A schematic illustrating this strategy is presented in Fig. 1(c). This approach increases the contribution of long-tailed classes and addresses the learning difficulties of different classes.

In summary, DSSN makes the following contributions:

(1) DSSN offers a novel approach to leverage unlabeled data in training SSS models by utilizing dual-level pixel-wise contrastive learning. This approach is a valuable addition to the existing techniques of exploiting unlabeled data, such as pseudo-supervision and consistency regularization.

(2) DSSN’s design enables the maximal utilization of available unlabeled data. The dual-level structure is not only utilized in contrastive learning but also in weak-to-strong pseudo-supervision.

(3) We introduce a novel class-aware pseudo-label selection strategy for weak-to-strong supervision, known as CPLG. This strategy effectively improves the performance of long-tailed classes.

2. Related Work

SSS has two mainstream methods, pseudo supervision and consistency regularization. Preliminary works (Hung et al., 2018; Mittal et al., 2019) use the generative adversarial networks (Creswell et al., 2018) to generate pseudo supervision. Specifically, consistency regularization methods encourage consistency prediction of unlabeled samples with various perturbation. The CutMix-Seg (French et al., 2019) approach incorporates the CutMix (Yun et al., 2019) augmentation into semantic segmentation in order to supply consistency restrictions on unlabeled data and also revealed Cutout (DeVries and Taylor, 2017) and CutMix (Yun et al., 2019) are critical to the success of consistency regularization. Alternatively, CCT (Ouali et al., 2020) proposes a feature-level perturbation and a cross-consistency training method that enforce consistency between the main decoder predictions and auxiliary decoders. By using two segmentation models with the same structure but different initialization, GCT (Ke et al., 2020) conducts network perturbation and promotes consistency between the predictions from the two models. In the meantime, CPS (Chen et al., 2021) constructs two parallel networks to provide cross-pseudo labels for one another. DMT (Feng et al., 2022) re-weights the loss on different regions based on the disagreement of two different initialized models. Self-training by pseudo labeling is a classic technique that dates back about a decade, taking the most likely class as a pseudo label and training models on unlabeled data is a common method for achieving minimum entropy. Concurrently ST++ (Yang et al., 2022) also demonstrates that employing suitable data perturbations on unlabeled samples is really quite beneficial for self-training. Unimatch (Yang et al., 2023) explores the effectiveness of weak-to-strong supervision, leveraging dual strong augmentations.

Contrastive learning is one of the alternative methods that stands out. RoCo (Liu et al., 2022b) and U²PL (Wang et al., 2022) use InfoNCE loss (Oord et al., 2018) on the predicted logits, but they not use Siamese structure network as shown in Fig. 1(b). DSSN obtains better performance than them, which can be seen in the experiment section.

3. Method

3.1. Preliminaries

Following SSS works (Chen et al., 2021; Yang et al., 2022; Liu et al., 2022b), we use both a small fraction of labeled data $\mathcal{D}_{l}=\{(X_{i},\bm{T}_{i})\}_{i=1}^{M}$ and a large fraction of unlabeled data $\mathcal{D}_{u}=\{X_{i}\}_{i=1+M}^{N+M}$ . $X_{i}$ denotes an image, and $\bm{T_{i}}$ represents its ground-truth label if $X_{i}$ is a labeled image. $N$ and $M$ indicate the number of labeled and unlabeled images, respectively, where $N\gg M$ in most cases. To facilitate the calculation of loss functions, we represent each pixel in an image as a vector $\bm{x}$ since a pixel has values in different channels. Thus, in subsequent sections, we represent each pixel as a vector $\bm{x}$ with $\bm{t}$ as its one-hot ground-truth label. Given an image $X=[\bm{x}_{i}]$ with the size of $W\times H$ where $W$ and $H$ are the width and height, we denote the pixel by $\bm{x}_{i},i\in\{1,...,W\times H\}$ . The latent high-level feature $\bm{z}$ corresponding to $\bm{x}$ is obtained by an encoder $f(\bm{x}|\theta)$ where $\theta$ is the learnable parameters of the encoder. We yield the predicted logits $\bm{h}$ by feeding the latent representations $\bm{z}$ into a decoder $g(\bm{z}|\varphi)$ where $\varphi$ is the learnable parameters of the decoder. Finally, a softmax layer is added to obtain the ultimate probability for each class, i.e., $\bm{y}={\rm softmax}(\bm{h})$ .

Given a labeled image, we use a supervised cross-entropy loss,

(1)

\displaystyle\mathcal{L}_{\rm sup}=-\sum_{i}\sum_{j\in\mathcal{C}}t_{ij}\log y_{ij}

where $\mathcal{C}=\{1,\ldots,C\}$ and $C$ is the total number of classes. For a unlabeled image, a simple way to generate their pseudo labels $\hat{\bm{t}}_{i}$ is to apply a one-hot operation to the predictions, i.e., $\bm{y}_{i}$ . For the $i$ -th pixel of an unlabeled image, we represent the predicted probability of the $i$ -th pixel belonging to the $j$ -th class as $y_{ij}$ . Specifically, we use the following operation to generate pseudo labels:

(2)		$\displaystyle c$	$\displaystyle=\mathop{\rm arg\leavevmode\nobreak\ max}\limits_{j\in\mathcal{C}}(y_{ij}),$
(3)		$\displaystyle\hat{t}_{ij}$	$\displaystyle=\begin{cases}1,&{\rm if}\,j=c\\ 0,&{\rm otherwise}\end{cases}$

where $c$ denotes the maximal probability within the class $j\in\mathcal{C}$ , the $\hat{\bm{t}}_{i}=[\hat{t}_{ij}]$ is the one-hot pseudo label.

3.2. Dual-Level Contrastive Learning

To fully exploit the potential of available unlabeled data, we propose to use DSSN for extracting pixel-wise contrastive positive pairs in different abstraction levels. The low-level image is subjected to two-view strong augmentations,

(4)		$\displaystyle\bm{x}_{i}^{ls1}$	$\displaystyle={\rm AugL}_{s}(\bm{x}_{i}),$
(5)		$\displaystyle\bm{x}_{i}^{ls2}$	$\displaystyle={\rm AugL}_{s}(\bm{x}_{i})$

where $\bm{x}_{i}^{ls1}$ denotes the strong augmented low-level pixel in the first view. The output, ${\rm AugL}_{s}(\cdot)$ , is random. ${\rm AugL}_{s}(\cdot)$ generates varying outputs using the same input to augment the data diversity. This increases the diversity, resulting in an improvement in the robustness and generalization ability of the training model.

We use two-view augmented images to obtain its decoded logits,

(6)		$\displaystyle\bm{h}_{i}^{ls1}$	$\displaystyle=$	$\displaystyle g(f(\bm{x}_{i}^{ls1}\|\theta)\|\varphi)\,.$
(7)		$\displaystyle\bm{h}_{i}^{ls2}$	$\displaystyle=$	$\displaystyle g(f(\bm{x}_{i}^{ls2}\|\theta)\|\varphi)\,.$

Analogous to (Hjelm et al., 2019), we apply the contrastive objective, i.e., $\mathcal{L}_{\rm cl}$ to pairwise pixels for learning better representations:

	$\displaystyle\mathcal{L}_{\rm cl}=$	$\displaystyle-\frac{1}{\|\mathcal{P}\|}\sum_{(i,i)\in\mathcal{P}}\log d(\bm{h}_{i}^{ls1},\bm{h}_{i}^{ls2})$
(8)			$\displaystyle-\frac{1}{\|\mathcal{N}\|}\sum_{{(i,j)\in\mathcal{N}}}\log\bigl{(}1-d(\bm{h}_{i}^{ls1},\bm{h}_{j}^{ls2})\bigr{)}$

where $d(\cdot,\cdot)$ is a similarity score of a pair of logits. $\bm{h}_{i}^{ls1}$ and $\bm{h}_{i}^{ls2}$ are belong to positive pairs $(i,i)\in\mathcal{P}$ while $\bm{h}_{i}^{ls1}$ and $\bm{h}_{j}^{ls2}$ are negative pairs $(i,j)\in\mathcal{N},\forall\,i\neq j$ . We use $\mathcal{P}$ and $\mathcal{N}$ to denote the sets of positive and negative pairs, respectively.

Inspired by BYOL (Grill et al., 2020), we only use the positive pairs. The similarity $d(\cdot,\cdot)$ of positive logits is defined by a Gaussian function,

(9)

\displaystyle d(\bm{h}^{ls1}_{i},\bm{h}_{i}^{ls2})=\exp\Bigl{(}-{\bigl{\|}\bm{h}^{ls1}_{i}-\bm{h}_{i}^{ls2}\bigr{\|}^{2}_{2}}\Bigr{)}\,.

The similarity defined by Eq. (9) implies the similarity is 1 if the pairwise logits are the same while it tends to 0 if their distance is far from each other. From a different perspective, the error $\|\bm{h}^{ls1}_{i}-\bm{h}_{i}^{ls2}\bigr{\|}^{2}_{2}$ of two-view logits is governed by the Gaussian distribution due to the central limit theorem (Walker, 1969), so we also obtain Eq. (9).

Substituting Eq. (9) into Eq. (8) obtains the following loss.

(10)

\displaystyle\mathcal{L}^{ls}_{\rm cl}=\frac{1}{W\times H}\sum_{i}\bigl{\|}\bm{h}^{ls1}_{i}-\bm{h}_{i}^{ls2}\bigr{\|}^{2}_{2}

where we only use pixel-wise positive pairs.

For the high-level feature contrastive learning, we obtain the high-level latent feature with the encoder,

(11)

\displaystyle\bm{z}_{i}^{hw}=f({\rm AugL}_{w}(\bm{x}_{i})|\theta)

where ${\rm AugL}_{w}(\cdot)$ is a weak augmentation for the low-level pixel. The high-level feature is subjected to two-view strong augmentations,

(12)		$\displaystyle\bm{z}_{i}^{hs1}$	$\displaystyle={\rm AugH}_{s}(\bm{z}_{i}^{hw}),$
(13)		$\displaystyle\bm{z}_{i}^{hs2}$	$\displaystyle={\rm AugH}_{s}(\bm{z}_{i}^{hw})$

We use the two-view augmented features to obtain its decoded logits, $\bm{h}_{i}^{hs1}=g(\bm{z}_{i}^{hs1}|\varphi)$ and $\bm{h}_{i}^{hs2}=g(\bm{z}_{i}^{hs12}|\varphi)$ . Then, we use them to construct the contrastive loss,

(14)

\displaystyle\mathcal{L}^{hs}_{\rm cl}=\frac{1}{W\times H}\sum_{i}\bigl{\|}\bm{h}^{hs1}_{i}-\bm{h}_{i}^{hs2}\bigr{\|}^{2}_{2}\,.

3.3. Weak-to-Strong Pseudo Supervision

To leverage the four predictions generated by a strongly augmented image, we feed the corresponding weakly augmented image into DSSN. Next, we use the prediction of the weak view to generate its pseudo label and supervise the four strong views. Given our dual-level structure, weak-to-strong pseudo supervision is also performed in both levels. Specifically, we use the pseudo labels of the weak view, denoted as $\hat{\bm{t}}$ , to supervise the predictions of the strong views, denoted as $\bm{y}$ ,.

The weak pseudo supervisions are obtained by

(15)		$\displaystyle\bm{y}^{lw}$	$\displaystyle={\rm softmax}(g(f(\bm{x}\|\theta^{\prime})\|\varphi^{\prime}))$
(16)		$\displaystyle\bm{y}^{hw}$	$\displaystyle={\rm softmax}(g(\bm{z}^{hw}\|\varphi))$

where $(\theta^{\prime},\varphi^{\prime})$ of the teacher are updated from the student $(\theta,\varphi)$ by the exponential moving average (EMA)

(17)

\displaystyle(\theta^{\prime},\varphi^{\prime})=\alpha(\theta^{\prime},\varphi^{\prime})+(1-\alpha)(\theta,\varphi)

where $\alpha$ is a momentum parameter, with $\alpha\in[0,1]$ .

The pseudo labels $\hat{\bm{t}}^{lw}$ and $\hat{\bm{t}}^{hw}$ of $\bm{y}^{lw}$ and $\bm{y}^{hw}$ are calculated by using Eqs. (2) and (3), respectively.

The output probability of the strong augmented views, $\bm{y}_{i}^{ls1}$ , $\bm{y}_{i}^{ls2}$ , $\bm{y}_{i}^{hs1}$ , and $\bm{y}_{i}^{hs2}$ , are attained by the softmax layer.

The weak-to-strong pseudo-supervision loss functions are

(18)		$\displaystyle\mathcal{L}^{(l)}_{\rm w2s}$	$\displaystyle=-\sum_{i}\sum_{j}m_{ij}^{lw}\left(\hat{t}_{ij}^{lw}\log y_{ij}^{ls1}+\hat{t}_{ij}^{lw}\log y_{ij}^{ls2}\right)$
(19)		$\displaystyle\mathcal{L}^{(h)}_{\rm w2s}$	$\displaystyle=-\sum_{i}\sum_{j}m_{ij}^{hw}\left(\hat{t}_{ij}^{hw}\log y_{ij}^{hs1}+\hat{t}_{ij}^{hw}\log y_{ij}^{hs2}\right)$

where $m_{ij}$ is a class-wise binary mask to select the pixel with high-confidence score and we show how to obtain it in the next section.

3.4. Class-aware pseudo-label generation

As shown in Fig. 1(c), we show the class-aware pseudo-label generation (CPLG). For the $i$ -th pixel, it has different probabilities belonging to different classes. $y_{ij}$ denotes the probability of the $i$ -th pixel belonging to the $j$ -th class. We observe all pixels in the same class, i.e., in the same channel of network output.

First, we find the pixel class-wisely that has the largest probability in the $j$ -th class,

(20)

y_{j}^{\max}=\max_{i}(y_{ij})\,,\forall\,j\in\mathcal{C}\,.

Second, we establish a class-wise threshold $\tau_{j}$ by multiplying the maximum probability by $r$ %. Pixels exceeding this class-wise threshold are selected. Additionally, we restrict the maximum probability by $\tau_{\rm low}$ and exclude pixels with a low maximum probability since they indicate lower prediction confidence. Thus, the class-wise threshold $\tau_{j}$ is determined by

(21)

\tau_{j}=\begin{cases}y_{j}^{\max}\cdot r\%,&{\rm if}\,y_{j}^{\max}>\tau_{\rm low}\\ y_{j}^{\max},&{\rm otherwise}\end{cases}

where the ratio $r$ and the low bound $\tau_{\rm low}$ are parameters.

Third, we select pixels in each class by $\tau_{j}$ , i.e., pixels exceeding $\tau_{j}$ are selected:

(22)

m_{ij}=\begin{cases}1,&{\rm if}\,y_{ij}>\tau_{j}\\ 0,&{\rm otherwise}\,.\end{cases}

The generation of the pseudo label is straightforward by using Eqs. (2) and (3). The refined class-aware pseudo labels are attained by multiplying them, i.e., $m_{ij}\hat{t}_{ij}$ , as used in Eqs. (18) and (19). Our CPLG strategy considers the learning status and difficulties of different classes by adjusting thresholds for each class. As a result, we select useful pixels with low thresholds for training, which enhances the accuracy of challenging classes.

3.5. Overall Algorithm

Fig. 2 illustrates how we combine two distinct learning strategies for the unlabeled images: contrastive learning and weak-to-strong pseudo supervision.

In this section, we present the DSSN algorithm, which is illustrated in Algorithm 1. It takes a small fraction of labeled data and a large fraction of unlabeled data as input to train the model. The supervised loss between the model prediction on labeled data and the ground truth is computed using Eq. (1). Subsequently, the low-level and high-level contrastive learning losses are calculated using Eqs. (10) and (14), respectively. We then compute the weak-to-strong pseudo-supervision loss using Eqs. (18) and (19). The overall loss term is formulated as follows:

(23)

\mathcal{L}=\mathcal{L}_{\rm sup}+\gamma_{1}\Bigl{(}\mathcal{L}_{\rm cl}^{ls}+\mathcal{L}_{\rm cl}^{hs}\Bigr{)}+\gamma_{2}\Bigl{(}\mathcal{L}^{(l)}_{\rm w2s}+\mathcal{L}^{(h)}_{\rm w2s}\Bigr{)},

where $\gamma_{1}$ and $\gamma_{2}$ are the trade-off weight. Finally, we update the student model and the teacher model by using the error back-propagation algorithm and EMA, respectively.

Algorithm 1 The DSSN algorithm.

1: Input:

\mathcal{D}=\{\mathcal{D}_{u},\mathcal{D}_{l}\}

, and batch size

b

2: Output:

(\theta^{\prime},\varphi^{\prime})

3: Initialization:

epoch=0

epoch_{\max}

, and

(\theta,\varphi)

4: while

epoch\leq{epoch_{\max}}

5: for mini-batch samples in

\mathcal{D}

6: Feed the samples into DSSN for forward propagation ;

7: Update

\mathcal{L}_{\rm sup}

by Eq. (1) ;

8: Update

\mathcal{L}_{\rm cl}^{ls}

and

\mathcal{L}_{\rm cl}^{hs}

by Eqs. (10) and (14) ;

9: Update

\mathcal{L}^{(l)}_{\rm w2s}

and

\mathcal{L}^{(h)}_{\rm w2s}

by Eqs. (18) and (19) ;

10: Update

\mathcal{L}

by Eq. (23) ;

11: Update

(\theta,\varphi)

by back propagation of

\sum_{b}\mathcal{L}

;

12: Update

(\theta^{\prime},\varphi^{\prime})

by Eq. (17) ;

13:

epoch=epoch+1

;

14: end for

15: end while

4. Experiments

In this section, we first present the details of the experiments. Second, we compare the proposed DSSN method to the recent state-of-the-art (SOTA) approaches to the SSS task. Third, we conduct extensive ablation experiments to demonstrate the effectiveness and robustness of the proposed method.

4.1. Experimental setup

Datasets. We evaluate the proposed method on two classical semantic segmentation datasets, i.e., PASCAL VOC 2012 (Everingham et al., 2015) and Cityscapes (Cordts et al., 2016). In particular, PASCAL VOC 2012 (Everingham et al., 2015) has 20 classes of objects and 1 class of background. The standard training, validation and test sets consist of 1,464, 1449 and 1,456 images, respectively. Following the previous work (Yang et al., 2022; Chen et al., 2021; Ke et al., 2020), we also use augmented set SBD (Hariharan et al., 2011) (9,118 images) and original training set (1,464 images) as our full training set (10,582 images). The labels from the SBD (Hariharan et al., 2011) dataset are noise-prone and of low quality. Cityscapes (Cordts et al., 2016) has 19 semantic classes and is mostly intended for understanding urban scenes. It consists of 500 validation images, 1,525 test images, and 2,975 training images. All of the images have well-annotated masks. For a fair comparison with the benchmarks, we follow the partition procedure of CPS (Chen et al., 2021). Specifically, the training set is divided into two partitions by randomly sampling 1/2, 1/4, 1/8, and 1/16 of the total set as the labeled samples and the remaining images as the unlabeled for the blended set.

Implementation details. Following the previous benchmarks CPS (Chen et al., 2021), we adopt DeepLab v3+ (Chen et al., 2018) based on ResNet (He et al., 2016) as the segmentation network for a fair comparison. The backbone i.e., ResNet, is initialized with the weights pre-trained on ImageNet (Deng et al., 2009). The segmentation heads are randomly initialized. During training, each mini-batch contains eight labeled and eight unlabeled images. The stochastic gradient descent (SGD) optimizer is used, and the initial learning rates are set to 0.002 and 0.005 for the PASCAL VOC 2012 and Cityscapes, respectively. In accordance with other works (Chen et al., 2021; Ouali et al., 2020), we employ the following polynomial to decrease the learning rate while training: $(1-{{epoch}}/{{epoch_{\max}}})^{0.9}$ . The model is trained for 100 epochs on PASCAL VOC 2012 and 240 epochs for Cityscapes. For weak augmentations, we adopt the same operation as ST++(Yang et al., 2022), where the training images are random flipping and resizing (between 0.5 and 2.0 times), followed by a random crop operation to the resolutions of 513 $\times$ 513 and 801 $\times$ 801 for the two datasets, respectively. We employ several strong augmentation, including random color-jitter, grayscale, Gaussian blur, etc. For strong feature augmentation, we apply a random dropout of 50% on features from the encoder. The unsupervised trade-off weights $\gamma_{1}$ and $\gamma_{2}$ are set as 0.01 and 0.25. In CPLG, $r$ is set to 96% and $\tau_{\rm low}$ is 0.92, respectively.

Additionally, we also apply CutMix (Yun et al., 2019) data augmentation to the student model images. The EMA smoothing factor $\alpha$ is set as 0.996. We follow U²PL (Wang et al., 2022), the supervised loss is cross-entropy on PASCAL, and for Cityscapes the cross-entropy loss is replaced by the online hard example mining loss.

Evaluation. We use the mean of Intersection-over-Union(mIoU) for the validation set to evaluate the segmentation performance for both datasets. Following the previous works (Chen et al., 2021; Yang et al., 2022), we employ the sliding evaluation to examine the efficacy of our model on the validation images from Cityscapes with a resolution of 1024×2048.

Table 1. Comparison with SOTAs with ResNet-101. Labeled images are from the original high-quality original training set of PASCAL VOC 2012.

Method	1/16(92)	1/8(183)	1/4(366)	1/2(732)	Full(1464)
Baseline	44.10	52.30	61.80	66.70	72.90
CutMix-Seg (French et al., 2019)	52.16	63.47	69.46	73.73	76.54
PseudoSeg (Zou et al., 2021)	57.60	65.50	69.14	72.41	73.23
PC²Seg (Zhong et al., 2021)	57.00	66.28	69.78	73.05	74.15
CPS (Chen et al., 2021)	64.07	67.42	71.71	75.88	-
ReCo (Liu et al., 2022b)	64.78	72.02	73.14	74.69	-
PS-MT (Liu et al., 2022a)	65.80	69.58	76.57	78.42	80.01
ST++ (Yang et al., 2022)	65.20	71.00	74.60	77.30	79.10
U²PL (Wang et al., 2022)	67.98	69.15	73.66	76.16	79.49
PCR (Xu et al., 2022)	70.06	74.71	77.16	78.49	80.65
GTA-Seg (Jin et al., 2022)	70.02	73.16	75.57	78.37	80.47
Unimatch (Yang et al., 2023)	75.20	77.20	78.80	79.90	81.20
DSSN	75.24	76.75	78.68	80.61	81.18

4.2. Comparison to SOTA Methods

Table 2. Comparison with the state-of-the-art methods on blended PASCAL VOC 2012 under different partition protocols.

Method	ResNet-50				ResNet-101
Method	1/16 (662)	1/8 (1323)	1/4 (2646)	1/2 (5291)	1/16 (662)	1/8 (1323)	1/4 (2646)	1/2 (5291)
Baseline	61.20	67.30	70.80	74.75	65.6	70.40	72.80	76.65
MT (Tarvainen and Valpola, 2017)	66.77	70.78	73.22	75.41	70.59	73.20	76.62	77.61
CutMix-Seg (French et al., 2019)	68.90	70.70	72.46	74.49	72.56	72.69	74.25	75.89
CCT (Ouali et al., 2020)	65.22	70.87	73.43	74.75	67.94	73.00	76.17	77.56
GCT (Ke et al., 2020)	64.05	70.47	73.45	75.20	69.77	73.30	75.25	77.14
CPS (Chen et al., 2021)	71.98	73.67	74.90	76.15	74.48	76.44	77.68	78.64
ST++ (Yang et al., 2022)	72.60	74.40	75.40	-	74.50	76.30	76.60	-
U²PL (Wang et al., 2022)	72.00	75.10	76.20	-	74.43	77.60	78.70	-
PS-MT (Liu et al., 2022a)	72.83	75.70	76.43	77.88	75.50	78.20	78.72	79.76
Unimatch (Yang et al., 2023)	75.80	76.90	76.80	-	78.10	78.40	79.20	-
DSSN	76.74	77.81	78.27	78.32	78.50	79.58	79.45	79.96

Table 3. Comparison with state-of-the-art on Cityscapes,

*

means the reproduced results in U²PL (Wang et al., 2022).

Method	ResNet-50				ResNet-101
Method	1/16 (186)	1/8 (372)	1/4 (744)	1/2 (1488)	1/16 (186)	1/8 (372)	1/4 (744)	1/2 (1488)
Baseline	63.30	70.20	73.10	76.60	66.30	72.80	75.00	78.00
MT (Tarvainen and Valpola, 2017)	66.14	72.03	74.47	77.43	68.08	73.71	76.53	78.59
CutMix-Seg (French et al., 2019)	-	-	-	-	72.13	75.83	77.24	78.95
CCT (Ouali et al., 2020)	66.35	72.46	75.68	76.78	69.64	74.48	76.35	78.29
GCT (Ke et al., 2020)	65.81	71.33	75.30	77.09	66.90	72.96	76.45	78.58
CPS ^∗ (Chen et al., 2021)	-	-	-	-	69.78	74.31	74.58	76.81
ST++ (Yang et al., 2022)	-	72.70	73.8	-	-	-	-	-
U²PL (Wang et al., 2022)	69.03	73.02	76.31	78.64	70.30	74.37	76.47	79.05
PS-MT (Liu et al., 2022a)	-	75.76	76.92	77.64	-	76.89	77.60	79.09
Unimatch (Yang et al., 2023)	75.00	76.80	77.50	78.60	76.60	77.90	79.20	79.50
DSSN	75.41	77.31	78.05	78.73	76.52	78.18	78.62	79.58

To demonstrate the superiority of our proposed DSSN method, we conduct a comparison with the current state-of-the-art methods across various settings. All results are reported on the validation set for both PASCAL VOC and Cityscapes datasets. Additionally, we present the corresponding baseline at the top of the table, representing the results of purely supervised learning trained on the same labeled data. To ensure a fair comparison, all methods employed the DeepLab v3+ architecture.

PASCAL VOC 2012. We report results of our experiments on the PASCAL VOC 2012 validation dataset in Tables 1 and 2, where we evaluate the mean Intersection over Union (mIoU) for different proportions of labeled samples. Additionally, we present the corresponding baseline at the top of the table, representing the results of purely supervised learning trained on the same labeled data.

Table 1 presents results on the classic PASCAL VOC 2012 dataset. It shows our method significantly outperforms current state-of-the-art methods. When employing ResNet-101 as the backbone, DSSN attains a 5.18% performance gain on the 1/16(92) split which surpass the performance obtain by the (1/3)183 data split in the prior study. Even with more labeled data, the performance differences become less evident; however, the proposed method still demonstrates performance improvements of 2.21% with 1/2 fine annotations over the previous SOTAs.

Table 2 illustrates the results on blender PASCAL VOC 2012 Dataset. Our method shows significant improvement on the 1/16, 1/8, 1/4, and 1/2 splits with ResNet-50, compared to the baseline, with improvements of 15.51%, 10.1%, 6.73%, and 3.57%, respectively. Similarly, with ResNet-101, our method achieves improvements of 12.9%, 9.18%, 6.65%, and 3.01% under the same partitions. Especially, our method shows significant improvements when the ratio of labeled data becomes smaller, such as under 1/8 or 1/16 partition protocols. In particular, when the labeled data is extremely limited,e.g., on the 1/16 partitions, our method achieves remarkable increases of 15.51% and 12.9% compared to the baseline with ResNet-50 and ResNet-101 as the backbone networks, respectively. Furthermore, our method demonstrates a considerable improvement over the previous state-of-the-art PS-MT (Liu et al., 2022a), achieving a margin of 3.88% with ResNet-50 as the backbone, and 1.7% under the 1/8 partition protocol.

Cityscapes. In Table 3, we can see that our method consistently outperforms the supervised baseline by a significant margin, achieving improvements of 12.11%, 7.11%, 4.95%, and 2.13% with ResNet-50 under 1/16, 1/8, 1/4, and 1/2 partition protocols, respectively. Similarly, with ResNet-101, our method shows improvements of 10.22%, 5.38%, 3.62%, and 1.58% under 1/16, 1/8, 1/4, and 1/2 partition protocols, respectively. Furthermore, our method outperforms all other state-of-the-art methods across various settings. Specifically, under 1/8, 1/4, and 1/2 partitions, DSSN achieves a 1.55%, 1.13%, and 1.09% improvement over the previous state-of-the-art PS-MT (Liu et al., 2022a) using ResNet-50, and a 1.29%, 1.02%, and 0.49% improvement using ResNet-101, respectively.

We evaluate DSSN using ResNet-50 on a 1/30 data split, which contained only 100 labeled images. As illustrated in Fig. 3, DSSN outperforms the current state-of-the-art significantly. This result indicates that our method effectively utilizes the unlabeled data through contrastive learning and the class-aware pseudo-label selection strategy (CPLG). Besides, although ReCo (Liu et al., 2022b) and U²PL(Wang et al., 2022) try to construct positive and negative pairs to use contrastive learning, the result shows our DSSN outperform them significantly.

Table 4. Ablation of contrastive learning and CPLG.

$\mathcal{L}_{\rm cl}^{ls}+\mathcal{L}_{\rm cl}^{hs}$	CPLG	mIoU
✗	✗	76.12
✗	✓	78.33
✓	✗	78.70
✓	✓	79.58

Table 5. Ablation of low- and high-level contrastive learning.

$\mathcal{L}_{\rm cl}^{ls}$	$\mathcal{L}_{\rm cl}^{hs}$	mIoU
✗	✗	78.33
✗	✓	78.90
✓	✗	79.19
✓	✓	79.58

Upon comparing performance on classic PASCAL VOC 2012 and blended training set, we observe that the quality of labeled samples is important. For example, DSSN achieves an exceptional performance of 80.61% by utilizing only 732 high-quality labels. However, even with significantly more labels (5291) from the blended dataset, a comparable score of 80.61% cannot be achieved.

4.3. Ablation Studies

In this subsection, we discuss the contribution of each component to our framework using ResNet-101 and a 1/8 labeled ratio on PASCAL VOC 2012 dataset.

Effectiveness of the DSSN components. We conduct a step-by-step ablation study of each component to comprehensively assess their effectiveness. Table 4 presents the results of our study. Without our proposed dual-Level contrastive learning and CPLG, applying a plain consistency method yields an accuracy of 76.12%. However, employing dual-level contrastive learning leads to an accuracy of 78.33%, while the proposed CPLG results in 78.70%. Combining both dual-level contrastive learning and CPLG produces the highest accuracy of 79.58%, demonstrating the effectiveness of each component in the proposed DSSN method.

Effectiveness of contrastive Learning. In our study, we incorporate both low-level and high-level contrastive learning in our dual-level contrastive learning approach. Table 5 presents the results of our study. Without the use of both low-level contrastive and high-level contrastive, the accuracy was 78.33%. Using low-level contrastive alone results in a 0.57% improvement, while using high-level contrastive alone improves the accuracy by 0.86%. Notably, using both low-level and high-level contrastive further improves the accuracy by 1.25%, which shows the efficacy of our method.

Effectiveness of CPLG. As discussed in §3.4, the CPLG strategy considers difficulties of different classes and long-tailed classes, instead of using a fixed threshold during the pseudo-label generation. To test our method against a fixed threshold, we conduct experiments using a fixed threshold. Fig. 4 shows that our strategy outperforms using a fixed threshold of 0.96 and 0.92 since we set $r$ to 0.96 and $\tau_{\rm low}$ to 0.92 in CPLG. This finding further highlights the effectiveness of our proposed DSSN method. We chose these specific thresholds because, following our experiments, we establish 0.92 as the lowest threshold and used 0.96 as the factor for the maximum probability value. Additionally, Fig. 5 presents mIoU values of classes with long tails and those that are hard to learn during training, which demonstrates the effectiveness of CPLG strategy.

Qualitative Results. In Figs. 6 and 7, we present the qualitative results of our study on the PASCAL VOC 2012 validation set. DSSN is based on the DeepLab v3+ with ResNet-101 network and a 1/8 ratio. The integration of contrastive learning into our method improve the performance of our model for contour and ambiguous regions, while also enhancing the accuracy of some scenarios, as illustrated in Fig. 6. Furthermore, our proposed CPLG achieved substantial precision in certain classes that are typically challenging to learn, as illustrated in Fig. 7.

5. Conclusion

In this paper, we introduce DSSN, a novel method that utilizes pixel-wise contrastive learning to address the SSS problem. DSSN is equipped with a dual-level structure that can effectively leverage unlabeled data. In DSSN, both contrastive learning and weak-to-strong consistency learning are utilized to maximize the utilization of available unlabeled data. Furthermore, we propose a class-aware pseudo label selection strategy that generates high-quality pseudo labels and significantly improves performance on long-tailed classes without incurring additional computation. DSSN achieves state-of-the-art performance on two benchmarks, and the effectiveness of our proposed novelties is confirmed by the ablation study.

Acknowledgements.

This work was supported by the National Natural Science Foundation of China under the Grant No. 62176108, Natural Science Foundation of Qinghai Province of China under No. 2022-ZJ-929, Fundamental Research Funds for the Central Universities under Nos. lzujbky-2021-ct09 and lzujbky-2022-ct06, Natural Science Foundation of Shandong Province of China, No. ZR2021QF017, and Supercomputing Center of Lanzhou University.

References

(1)
Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. MixMatch: A holistic approach to semi-supervised learning. NeurIPS 32 (2019).
Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV. Springer, 801–818.
Chen et al. (2021) Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. 2021. Semi-supervised semantic segmentation with cross pseudo supervision. In CVPR. IEEE, 2613–2622.
Chopra et al. (2005) S. Chopra, R. Hadsell, and Y. LeCun. 2005. Learning a similarity metric discriminatively, with application to face verification. In CVPR. IEEE, 539–546.
Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR. IEEE, 3213–3223.
Creswell et al. (2018) Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. 2018. Generative adversarial networks: An overview. IEEE Signal Processing Magazine 35, 1 (2018), 53–65.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. IEEE, 248–255.
DeVries and Taylor (2017) Terrance DeVries and Graham W Taylor. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552 (2017).
Everingham et al. (2015) Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2015. The pascal visual object classes challenge: A retrospective. IJCV 111 (2015), 98–136.
Feng et al. (2022) Zhengyang Feng, Qianyu Zhou, Qiqi Gu, Xin Tan, Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang Ma. 2022. DMT: Dynamic mutual training for semi-supervised learning. Pattern Recognition 130 (2022), 108777.
French et al. (2019) Geoff French, Samuli Laine, Timo Aila, Michal Mackiewicz, and Graham Finlayson. 2019. Semi-supervised semantic segmentation needs strong, varied perturbations. In BMVC.
Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, Vol. 33. 21271–21284.
Hadsell et al. (2006) Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In CVPR. IEEE, 1735–1742.
Hariharan et al. (2011) Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. 2011. Semantic contours from inverse detectors. In ICCV. IEEE, 991–998.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE, 770–778.
Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2019. Learning deep representations by mutual information estimation and maximization. In ICLR.
Hung et al. (2018) Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu Lin, and Ming-Hsuan Yang. 2018. Adversarial learning for semi-supervised semantic segmentation. In BMVC.
Jin et al. (2022) Ying Jin, Jiaqi Wang, and Dahua Lin. 2022. Semi-supervised semantic segmentation via gentle teaching assistant. In NeurIPS, Vol. 35. 2803–2816.
Ke et al. (2020) Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, and Rynson WH Lau. 2020. Guided collaborative training for pixel-wise semi-supervised learning. In ECCV. Springer, 429–445.
Laine and Aila (2017) Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervised learning. In ICLR.
Liu et al. (2022b) Shikun Liu, Shuaifeng Zhi, Edward Johns, and Andrew J Davison. 2022b. Bootstrapping semantic segmentation with regional contrast. In ICLR.
Liu et al. (2022a) Yuyuan Liu, Yu Tian, Yuanhong Chen, Fengbei Liu, Vasileios Belagiannis, and Gustavo Carneiro. 2022a. Perturbed and strict mean teachers for semi-supervised semantic segmentation. In CVPR. IEEE, 4258–4267.
Menon et al. (2021) Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. 2021. Long-tail learning via logit adjustment. In ICLR.
Mittal et al. (2019) Sudhanshu Mittal, Maxim Tatarchenko, and Thomas Brox. 2019. Semi-supervised semantic segmentation with high-and low-level consistency. TPAMI 43, 4 (2019), 1369–1379.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Ouali et al. (2020) Yassine Ouali, Céline Hudelot, and Myriam Tami. 2020. Semi-supervised semantic segmentation with cross-consistency training. In CVPR. IEEE, 12674–12684.
Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. 2020. FixMatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, Vol. 33. 596–608.
Tarvainen and Valpola (2017) Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, Vol. 30.
Walker (1969) A. M. Walker. 1969. On the Asymptotic Behaviour of Posterior Distributions. Journal of the Royal Statistical Society: Series B (Methodological) 31, 1 (1969), 80–88.
Wang et al. (2022) Yuchao Wang, Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Guoqiang Jin, Liwei Wu, Rui Zhao, and Xinyi Le. 2022. Semi-supervised semantic segmentation using unreliable pseudo-labels. In CVPR. IEEE, 4248–4257.
Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. 2020. Unsupervised data augmentation for consistency training. NeurIPS 33 (2020), 6256–6268.
Xu et al. (2022) Haiming Xu, Lingqiao Liu, Qiuchen Bian, and Zhen Yang. 2022. Semi-supervised semantic segmentation with prototype-based consistency regularization. In NeurIPS, Vol. 35. 26007–26020.
Yang et al. (2023) Lihe Yang, Lei Qi, Litong Feng, Wayne Zhang, and Yinghuan Shi. 2023. Revisiting weak-to-strong consistency in semi-supervised semantic segmentation. In CVPR.
Yang et al. (2022) Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang Gao. 2022. ST++: Make self-training work better for semi-supervised semantic segmentation. In CVPR. IEEE, 4268–4277.
Yun et al. (2019) Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. 2019. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV. IEEE, 6023–6032.
Zhong et al. (2021) Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan, Jian Peng, and Yu-Xiong Wang. 2021. Pixel contrastive-consistent semi-supervised semantic segmentation. In ICCV. IEEE, 7273–7282.
Zou et al. (2021) Yuliang Zou, Zizhao Zhang, Han Zhang, Chun-Liang Li, Xiao Bian, Jia-Bin Huang, and Tomas Pfister. 2021. PseudoSeg: Designing pseudo labels for semantic segmentation. In ICLR.