This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Space Engage: Collaborative Space Supervision for Contrastive-based Semi-Supervised Semantic Segmentation

Changqi Wang1, Haoyu Xie1, Yuhui Yuan3, Chong Fu1,4, Xiangyu Yue2 equal contribution    1 School of Computer Science and Engineering, Northeastern University, Shenyang, China
2 The Chinese University of Hong Kong 3 Microsoft Research Asia
4 Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, NEU, China
   [email protected], [email protected], [email protected]
[email protected], [email protected]
Abstract

Semi-Supervised Semantic Segmentation (S4) aims to train a segmentation model with limited labeled images and a substantial volume of unlabeled images. To improve the robustness of representations, powerful methods introduce a pixel-wise contrastive learning approach in latent space (i.e., representation space) that aggregates the representations to their prototypes in a fully supervised manner. However, previous contrastive-based S4 methods merely rely on the supervision from the model’s output (logits) in logit space during unlabeled training. In contrast, we utilize the outputs in both logit space and representation space to obtain supervision in a collaborative way. The supervision from two spaces plays two roles: 1) reduces the risk of over-fitting to incorrect semantic information in logits with the help of representations; 2) enhances the knowledge exchange between the two spaces. Furthermore, unlike previous approaches, we use the similarity between representations and prototypes as a new indicator to tilt training those under-performing representations and achieve a more efficient contrastive learning process. Results on two public benchmarks demonstrate the competitive performance of our method compared with state-of-the-art methods.

1 Introduction

Semantic segmentation is a fundamental task in computer vision, aiming to classify each pixel in an image. Significant progress [20, 3] has been made in training on high-quality labeled images using segmentation models composed of an encoder and a segmentation head. However, annotating images is expensive and time-consuming. Semi-supervised Semantic Segmentation (S4) alleviates the thirst for annotation by leveraging unlabeled images to train segmentation models.

Refer to caption
Figure 1: We enhance the knowledge exchange between the logit and representation spaces. Orange and blue represent different classes. Top: Existing contrastive-based S4 methods overlook the semantic information in representation space. Bottom: Our method uses dual-space collaborative supervision.

Most existing works learn from unlabeled images via self-training [42, 35, 13] or consistency regularization [38, 36, 11] strategies, both of which retrain the model with its predictions on unlabeled images. Recently, great success has been achieved by introducing pixel-wise contrastive learning to semantic segmentation, which endows the model with a stronger features-extracting ability by accessing a more discriminative representation space. Specially, these methods [43, 21] project each pixel to representation space as a representation and regularize it in a fully supervised manner, i.e., aggregating the representations with the same class and separating them with different classes. In semi-supervised settings, due to limited labels, most methods [1, 30, 44] obtain supervision from the model’s output logits in logit space during the unlabeled training process. However, recent contrastive-based semantic segmentation methods [1, 30, 44, 43] mainly focus on the learning process in logit space while only taking that in representation space as an auxiliary task. The unidirectional supervision makes training dominated by the predicted logits, leading to the neglect of information in the representation space. We argue that this kind of single-space supervision may incorrectly provide semantic guidance to representation learning and fails to facilitate knowledge exchange between the two spaces (see Sec. 5.1).

In this work, we extend the single-space supervision to a dual-space supervision for contrastive-based S4 and propose Collaborative Space Supervision (CSS). Our key insight is to: i) utilize the semantic information in representations to obtain more reliable guidance during unlabeled training, and to enhance knowledge exchange between two spaces; ii) provide a more accurate reference for the model’s performance on predicting each representation to tilt training those under-performing representations. To achieve objective i), we obtain dense semantic predictions by retrieving the nearest class prototype for each representation in the representation space and then engage with predictions from the logits for collaborative supervision of the model. For objective ii), we measure the similarity between the representations and prototypes and use the similarity after a normalization operation as the indicator for guiding the learning process in the representation space. Unlike previous works that utilize confidence as the indicator to involve representation learning, the similarity directly reflects the confusion level between representations and prototypes, resulting in more efficient representation learning.

To summarize, our main contributions are three-fold: 1) The dual-space collaboration for contrastive-based S4, enhances the knowledge exchange between the logit and representation spaces. 2) Utilizing similarity to provide a more accurate reference for the model’s performance in representation learning. 3) Extensive experiments on two S4 benchmarks demonstrate the effectiveness of our method.

2 Related Works

2.1 Semi-supervised Semantic Segmentation

The aim of S4 is to train a segmentation model with the semi-supervised setting (i.e., a few labeled images and a large number of unlabeled images) to classify each pixel in an entire image. The critical issue of S4 is how to leverage unlabeled images to train the model. Some methods [23, 25, 28, 33] based on GANs [14], adversarial training [34], and consistency regularization paradigm [36, 11, 38, 6, 54]. Meanwhile, self-training [26, 42, 47, 58, 51] is also a striking paradigm, which always generates pseudo-labels from model and retrains the model with the combined supervision of human annotations and pseudo-labels. One essential issue of self-training is the accuracy of pseudo-labels. Some methods [27, 31, 12, 40, 50] try to polish pseudo-labels and provide reliable guidance. Some methods [19, 45, 22, 15] focus on the class-imbalance problems in the dataset and try to alleviate the negative effect from class-biased pseudo-labels generated by the model pre-trained on imbalanced labeled images. We build our framework based on the self-training and additionally explore semantic information among different images.

2.2 Pixel-wise Contrastive Learning

Pixel-wise contrastive learning explores semantic relations not only in the individual image but also among different images. Different from instance-wise contrastive learning [17, 5, 2], pixel-wise contrastive learning [48, 53, 4, 56] project each pixel to the representation in representation space with the cooperation of encoder and representation head. Representations are then aggregated in their prototypes and are separated from each other in different classes. In semi-supervised settings, most methods [1, 30, 44, 46] use pseudo-labels based on logits to provide semantic information contrastive learning process during training on unlabeled images. Meanwhile, the confidence of logit is used as an indicator to involve the contrastive learning process, e.g., [30] uses the hard representations whose corresponding logit confidence is lower than a threshold to contrast for effective training. As opposed to the above methods, we use collaborative space supervision for contrastive learning on unlabeled images and use a new indicator to involve the contrastive learning progress.

2.3 Prototype-based Learning

Prototype-based learning has been widely studied in few-shot learning [39, 9, 29, 32] and unsupervised domain adaption [52, 24, 41, 37, 55]. Recently, it is restudied in semantic segmentation as known as a non-parametric prototype-based classifier [57]. Concretely, the classes in the dataset are presented by a set of non-learnable prototypes, and the dense semantic predictions are thus achieved by assigning the output features to its most similar prototype. Under semi-supervised settings, some methods [49] maintain the consistency between predictions from a linear predictor and a prototype-based predictor. The two predictors are followed by the encoder and project the features to logit space and representation space, respectively. In this work, we combine the semantic information in the logit and representation spaces to provide supervision in a collaborative way during semi-supervised learning.

3 Methodology

In the S4 task, we have a small labeled set 𝒟l={(𝒙il,𝒚il)}i=1Nl\mathcal{D}_{l}=\{(\bm{x}_{i}^{l},\bm{y}_{i}^{l})\}_{i=1}^{N_{l}} and a large unlabeled set 𝒟u={𝒙iu}i=1Nu\mathcal{D}_{u}=\{\bm{x}_{i}^{u}\}_{i=1}^{N_{u}}, where 𝒙il,𝒙iu\bm{x}_{i}^{l},\bm{x}_{i}^{u} \in H×W×3\mathbb{R}^{H\times W\times 3}, HH, WW denote the height, and the width, respectively. And ground truth 𝒚il{0,1}H×W×|C|\bm{y}_{i}^{l}\in\{0,1\}^{H\times W\times\lvert C\rvert} with the set of class CC. The goal is to boost model performance with 𝒟u\mathcal{D}_{u}. The base model consists of an encoder f()f(\cdot) and a segmentation head g()g(\cdot), which projects features to the logit space H×W×|C|\mathbb{R}^{H\times W\times\lvert C\rvert}. We adopt Self-Training (ST) and pixel-wise contrastive learning to our framework, as described in Sec. 3.1. The supervision for 𝒟u\mathcal{D}_{u} is produced by the collaboration between the logit and representation spaces, as described in Sec. 3.3.

3.1 ST and Pixel-wise Contrastive Learning

The main idea of self-training is to pre-train a model on labeled images and use it to produce pseudo-labels as supervision for unlabeled images. A typical framework is the teacher-student framework [42], which consists of a student model and a teacher model. Both the student model and the teacher model are constructed by an encoder and a segmentation head. Parameters of the student model are optimized via Stochastic Gradient Decent (SGD) while parameters of the teacher model are updated by the Exponential Moving Average (EMA) of student model parameters. We denote the encoder and the segmentation head in the student model by f()f(\cdot) and g()g(\cdot) while denoting those in the teacher model by f()f^{\prime}(\cdot) and g()g^{\prime}(\cdot). The pseudo-labels 𝒚^iu,lgt\hat{\bm{y}}_{i}^{u,lgt} are produced based on the teacher model’s output logits 𝒑^iu=g(f(𝒙iu))\hat{\bm{p}}^{u}_{i}=g^{\prime}(f^{\prime}(\bm{x}^{u}_{i})) in logit space, formulated as:

𝒚^iu,lgt=𝟏c(argmaxc{𝒑^i,cu}cC),\displaystyle\begin{aligned} \hat{\bm{y}}_{i}^{u,lgt}=\bm{1}_{c}(\mathop{\arg\max}\limits_{c}\{\hat{\bm{p}}_{i,c}^{u}\}_{c\in C}),\end{aligned}

(1)

where 𝟏c()\bm{1}_{c}(\cdot) denotes the one-hot encoding of class cc.

In order to enhance the ability of the model itself to extract features, recent works [30, 44, 1] additionally employ pixel-wise contrastive learning and introduce a representation head to both teacher and student models. We denote the representation head in the student model as h()h(\cdot) and that in the teacher model as h()h^{\prime}(\cdot). The pixel 𝒙i\bm{x}_{i} of the class cc are projected as representations 𝒛ci\bm{z}_{ci} in representation space by the cooperating of f()f(\cdot) and h()h(\cdot), i.e., 𝒛ci=h(f(𝒙i))\bm{z}_{ci}=h(f(\bm{x}_{i})). And the representation 𝒛ci\bm{z}_{ci} is then aggregated to its class centroid (prototype) while separated from representations in different classes 𝒛c~i\bm{z}_{\tilde{c}i} (negatives). The semantic guidance for contrastive learning is from the combination of ground truth 𝒚il\bm{y}_{i}^{l} and pseudo-labels 𝒚^iu,lgt\hat{\bm{y}}_{i}^{u,lgt} in logit space. Moreover, in order to emphasize the reliable and crucial aspects during unlabeled and contrastive learning, a sampling strategy is adopted to select valid pixels 𝒙i\bm{x}_{i} according to their corresponding confidence, i.e. the student model’s output logits 𝒑i\bm{p}_{i} after a Softmax operation.

Discussion. In recent works [44, 1, 30], the supervision of unlabeled images is derived solely from the logit space. This overlooks the potential benefits of the supervision from the representation space, leading to two potential limitations: 1) the pseudo-labels 𝒚^iu,lgt\hat{\bm{y}}_{i}^{u,lgt} obtained from the logit space may contain noise and miss the opportunity to be corrected by semantic information from the representation space; 2) since the confidence from logit space is used as the indicator j^i\hat{j}_{i} for the sampling strategy, learning in the representation space may not be critical enough due to the different confusing parts between the two spaces.

To mitigate these limitations, we produce pseudo-labels from the representation space and combine them with pseudo-labels from the logit space to provide higher-quality supervision during unlabeled training. Meanwhile, we obtain a new indicator from the representation space for the more effective sampling strategy.

Refer to caption
Figure 2: Overview of our framework. Our training pipeline consists of learning in two spaces: logit space and representation space. The pseudo-labels 𝒚^iu\hat{\bm{y}}_{i}^{u} during unlabeled training are produced by the collaboration of two spaces with mix pseudo-labeling strategy (1) or cross pseudo-labeling strategy (2). The indicator for representation learning is produced by similarity (s1s_{1}, s2s_{2}, and s3s_{3}).

3.2 Supervision from Representation Space

In this section, we detail the approach to obtain the pseudo-labels from the representation space. Meanwhile, we simultaneously access a new indicator for the sampling strategy in representation spaces, which provides a critical reference in the contrastive learning process.

Specifically, we first build a set of prototypes for each class and obtain the pseudo-labels by retrieving the nearest prototype for each representation. We calculate the centroid of all representations in the current class cc as the prototype 𝝆c\bm{\rho}_{c}, which is formulated as:

𝝆c=1NciNc𝒛i,\displaystyle\begin{aligned} \bm{\rho}_{c}=\frac{1}{N_{c}}\sum_{i}^{N_{c}}\bm{z}^{\prime}_{i},\end{aligned}

(2)

where NcN_{c} is the total number of representations of current class cc and 𝒛i\bm{z}^{\prime}_{i} is the representation projected by the cooperation of the f()f^{\prime}(\cdot) and h()h^{\prime}(\cdot). Meanwhile, to include more representation information, we update the prototype along the sequential iterations with EMA as follows:

𝝆^c(t)=α𝝆^c(t1)+(1α)𝝆c(t),\displaystyle\begin{aligned} \hat{\bm{\rho}}_{c}(t)=\alpha\hat{\bm{\rho}}_{c}(t-1)+(1-\alpha)\bm{\rho}_{c}(t),\end{aligned}

(3)

where 𝝆^c(t)\hat{\bm{\rho}}_{c}(t), 𝝆^c(t1)\hat{\bm{\rho}}_{c}(t-1) mean the current ttht^{th} prototype and last (t1)th(t-1)^{th} prototype in iterations , 𝝆c(t)\bm{\rho}_{c}(t) means the prototype calculated by Eq. 2 in current iteration, and α\alpha is a hyper-parameter that controls the updating speed. Thus, the pseudo-label from the representation space is achieved by:

𝒚^iu,rep=𝟏c^(c^),withc^=argmaxc{sim(𝒛i,𝝆^c(t))}cC,\displaystyle\begin{aligned} \hat{\bm{y}}^{u,rep}_{i}=\bm{1}_{\hat{c}}(\hat{c}),with~{}\hat{c}=\mathop{\arg\max}\limits_{c}\{sim(\bm{z}^{\prime}_{i},\bm{\hat{\rho}}_{c}(t))\}_{c\in C},\end{aligned}

(4)

where sim()sim(\cdot) is defined as the cosine similarity.

As for the indicator for the sampling strategy in the representation space, we use the Softmax function on the similarity among the representation and all prototypes, which is followed as:

j^iu,rep=esim(𝒛𝒄𝒊,𝝆^c(t))/τesim(𝒛𝒄𝒊,𝝆^c(t))/τ+c~C~esim(𝒛𝒄𝒊,𝝆^c~(t))/τ,\displaystyle\begin{aligned} \hat{j}^{u,rep}_{i}=\frac{e^{sim(\bm{z_{ci}},\hat{\bm{\rho}}_{c}}(t))/\tau}{e^{sim(\bm{z_{ci}},\hat{\bm{\rho}}_{c}(t))/\tau}+\sum_{\tilde{c}\in\tilde{C}}e^{sim(\bm{z_{ci}},\hat{\bm{\rho}}_{\tilde{c}}(t))/\tau}},\end{aligned}

(5)

where 𝝆^c~(t)\hat{\bm{\rho}}_{\tilde{c}}(t) means the prototype with different classes from 𝒛ci\bm{z}_{ci} and τ\tau is a hyper-parameter. Different from using confidence from logit space as the indicator to involve representation learning [30, 44], the Softmax similarity directly helps the model to discover the confusion between representations and their prototypes and focus on them during the subsequent training.

3.3 Collaboration Between Two Spaces

With the pseudo-labels in two spaces, we propose two pseudo-labeling strategies to strengthen the collaboration between two spaces and obtain more reliable pseudo-labels.

  • Mix pseudo-labeling. To mitigate the negative effects of inherent noise from both two spaces during pseudo-labeling, we adopt the mix pseudo-labeling strategy that only considers the mutually agreeable pseudo-labels between the two spaces. Specifically, we define the set of final pseudo-labels as Y^u=Y^u,lgtY^u,rep\hat{Y}^{u}=\hat{Y}^{u,lgt}\cap\hat{Y}^{u,rep}, where 𝒚^iu,lgtY^u,lgt\hat{\bm{y}}_{i}^{u,lgt}\in\hat{Y}^{u,lgt} and 𝒚^iu,repY^u,rep\hat{\bm{y}}_{i}^{u,rep}\in\hat{Y}^{u,rep}.

  • Cross pseudo-labeling. Inspired by recent researches [6, 36] that maintain consistency among the predictions of the same image across different models or decoders in different views, we propose a cross pseudo-labeling strategy that leverages pseudo-labels from one space to supervise the other. Specifically, we use pseudo-labels 𝒚^iu,rep\hat{\bm{y}}_{i}^{u,rep} to supervise the logit space, and vice versa.

The strengths of using pseudo-labels from two spaces in the collaborative way are twofold: 1) obtaining more reliable supervision during unlabeled training, and 2) enabling the strengths of learning in different spaces to complement each other. Since the learning in different feature spaces concentrates on different parts of features, i.e., the logit space mainly focuses on the most discriminative part of features while the representation space treats all parts of features equally, the performance of pseudo-labels from two spaces varies across different classes or regions of images. Therefore, our collaborative pseudo-labeling strategies exchange knowledge between two spaces and provide higher-quality supervision during unlabeled training. The experimental proof is in Sec. 5.1.

As for indicators, we use confidence as the indicator j^iu,lgt\hat{j}_{i}^{u,lgt} for learning logit space and Softmax similarity as the indicator j^iu,rep\hat{j}_{i}^{u,rep} for learning representation space. We argue that the confusing parts of learning in both two spaces are varied due to the different parts of features being concentrated in each space. Therefore, this strategy allows the learning in different spaces to focus on their own confusing parts, which can be more effective than using a single indicator when mining confusing parts for both spaces in the training process. The experimental proof is in Sec. 5.2.

3.4 Training Objective

With the indicators j^iu,lgt\hat{j}_{i}^{u,lgt} and j^iu,rep\hat{j}_{i}^{u,rep}, we adopt some threshold sampling strategies. In logit space, we set a threshold δu\delta_{u} during unlabeled learning and logits 𝒑^iu\hat{\bm{p}}_{i}^{u} whose indicator j^iu,lgt\hat{j}_{i}^{u,lgt} is higher than δu\delta_{u} will be regarded as the valid logits in logit space. In representation space, our sample strategy can be divided into three parts: 1) Valid Sampling Strategy. Similar to the sampling strategy in logit space, a threshold δw\delta_{w} is used to sample representations whose indicator j^iu,rep\hat{j}_{i}^{u,rep} is higher than δw\delta_{w}. 2) Hard Sampling Strategy. We adopt the hard sampling strategy for tilting to train those confusing representations. Specifically, we set a threshold δs\delta_{s} to sample representations whose indicator j^iu,rep\hat{j}_{i}^{u,rep} is lower than δs\delta_{s}. 3) Negative Sampling Strategy. We sample negatives according to the similarity between the prototype of current class cc and other prototypes. Concretely, the negatives are more likely to be sampled if its prototype is more similar to the prototype of the current class.

Cooperated with the ground truth 𝒚il\bm{y}^{l}_{i}, pseudo-labels 𝒚iu\bm{y}^{u}_{i} produced from two spaces in a collaborative way, and different sampling strategies in two spaces, the total learning object is composed with a supervised loss s\mathcal{L}_{s}, an unsupervised loss u\mathcal{L}_{u}, and a contrastive loss c\mathcal{L}_{c} as follows:

=s+u+λcc,\displaystyle\mathcal{L}=\mathcal{L}_{s}+\mathcal{L}_{u}+\lambda_{c}\mathcal{L}_{c},

(6)

where λc\lambda_{c} is used to tune the contribution between logit space and representation space. Specifically, s\mathcal{L}_{s} and u\mathcal{L}_{u} are constructed by the Cross-Entropy (CE) ce\ell_{ce} and can be formulated as:

s=1|l|𝒙illce(𝒑il,𝒚il),\displaystyle\mathcal{L}_{s}=\frac{1}{\lvert\mathcal{B}_{l}\rvert}\sum_{\bm{x}_{i}^{l}\in\mathcal{B}_{l}}\ell_{ce}(\bm{p}_{i}^{l},\bm{y}^{l}_{i}),

(7)

u=1|^u|𝒙iu^uce(𝒑iu,𝒚^iu),\displaystyle\mathcal{L}_{u}=\frac{1}{\lvert\hat{\mathcal{B}}_{u}\rvert}\sum_{\bm{x}_{i}^{u}\in\hat{\mathcal{B}}_{u}}\ell_{ce}(\bm{p}_{i}^{u},\hat{\bm{y}}^{u}_{i}),

(8)

where l\mathcal{B}_{l} denotes the labeled images in a mini-batch and ^u\hat{\mathcal{B}}_{u} is the subsets that sampled from unlabeled images in a mini-batch according to the sampling strategy. Meanwhile, the contrastive loss c\mathcal{L}_{c} is formulated as:

c=1|C|×|𝒵c^|cC𝒛ci𝒵^clog[es(𝒛ci,𝝆^c)/τes(𝒛ci,𝝆^c(t)))/τ+c~C~𝒛c~i𝒵^c~es(𝒛ci,𝒛c~i)/τ],\displaystyle\begin{aligned} \mathcal{L}_{c}=&-\frac{1}{|C|\times|\hat{\mathcal{Z}_{c}}|}\sum_{c\in C}\sum_{\bm{z}_{ci}\in\hat{\mathcal{Z}}_{c}}\\ &log[\frac{e^{s(\bm{z}_{ci},\hat{\bm{\rho}}_{c})/\tau}}{e^{s(\bm{z}_{ci},\hat{\bm{\rho}}_{c}(t)))/\tau}+\sum_{\tilde{c}\in\tilde{C}}\sum_{\bm{z}_{\tilde{c}i}\in\hat{\mathcal{Z}}_{\tilde{c}}}e^{s(\bm{z}_{ci},\bm{z}_{\tilde{c}i})/\tau}}],\end{aligned}

(9)

where 𝒵c^\hat{\mathcal{Z}_{c}} is the subset sampled from the set of the representations which belong to class cc according to the sampling strategy, 𝒵^c~\hat{\mathcal{Z}}_{\tilde{c}} is the subset sampled from the set of the representations which bot belong to class cc, C~\tilde{C} denotes the subset of CC with class cc removed, and the supervision information comes from the final pseudo-labels 𝒚^iu\hat{\bm{y}}^{u}_{i} after the pseudo-labeling strategies. The whole framework is shown in Fig. 2.

4 Experiments

4.1 Setup

Datasets. We conduct experiments on PASCAL VOC 2012 dataset [10] and Cityscapes dataset [7] to validate the effectiveness of our proposed method. The original PASCAL VOC 2012 dataset contains 1,464 labeled images in train set and 1,449 images for validation in val set. Following [6, 35], we additionally introduce 9,118 images from SBD [16] as training images. Since the labels in SBD are coarsely annotated, following [44], we use both classic VOC train set (1,464 candidate labeled images) and blender VOC train set (10,582 candidate labeled images). Cityscapes dataset is a dataset for urban scene understanding, which contains 2,975 images in train set and 500 images in val set.

Network structure. We use Deeplabv3+ [3] with ResNet-101 [18] pre-trained on ImageNet [8] as our network structure. The segmentation and representation head are composed of Conv-BN-ReLU-Conv.

Implementation details. For training on PASCAL VOC 2012 dataset, we set the learning rate as 0.0064, weight decay as 0.0005, crop size as 512 ×\times 512, batch size as 16, and a total of 40,000 iterations. For training on the Cityscapes dataset, we set the learning rate as 0.0038, weight decay as 0.0005, crop size as 768 ×\times 768, batch size as 8, and a total of 80,000 iterations. We use the poly scheduling to decay the learning rate during the training process: lr=lrbase×(1epochtotal_epoch)0.9lr=lr_{base}\times(1-\frac{epoch}{total\_epoch})^{0.9}. We use the mean of Intersection over Union (mIoU) as the metric in evaluation. We use the sliding window strategy to evaluate the performance of our method on the Cityscapes dataset, following [6]. In addition, when adopting our cross-labeling strategy, due to the requirement of a set of high-quality prototypes when classifying each representation, we first solely use the supervision from logit space for 20 epochs to initialize the prototypes.

4.2 Comparison with Existing Methods

In this subsection, we first reproduce three baselines: MT [42], CutMix [13], and a typical contrastive-based method with only logit space pseudo-labels and indicators (Baseline) on classic VOC train set. Meanwhile, we make the comparison of our method with mix (CSS (mix)) and cross (CSS (crs.)) pseudo-labeling strategy on blender VOC train set and Cityscapes train set with following recent SOTA S4 methods: CCT [36], CPS [6], U2PL [44], ST++ [50], PRCL [46], PCR [49], and PSMT [31]. Since the data split will dramatically affect the performance in S4, i.e., choosing labeled images plays an important role in the final results, we conduct experiments with three different data splits and report the mean value and standard deviation (blue color numbers). Since the mix pseudo-labeling strategy has better performance, we only use CSS (mix) when compared with SOTAs. Meanwhile, we use ResNet-101 with deep stem blocks as our network structure when compared with SOTAs. Since there is no uniform data split, we use the data splits in U2PL [44]. We use the OHEM loss when training Cityscapes, following [6].

Results on PASCAL VOC 2012. Tab. 1 shows the comparison with our baselines on Classic PASCAL VOC 2012 set. Our method consistently outperforms baselines with an acceptable standard deviation on all label rates. Tab. 2 shows the comparison with the SOTAs on PASCAL VOC 2012. Our method achieves the best results at most label rates but gets a second at the setting of 1,323 labeled images.

Results on Cityscapes. Tab. 3 shows the performance of our method on Cityscapes. Our method outperforms SOTAs at most label rates but gets a second at the setting of 744 labeled images. Since multiple prototypes are used in PCR [49], our method is more computationally efficient.

Table 1: Results on classic VOC train set with four different label rate. Labeled data splits are from the original VOC train set. All approaches are reproduced with three runs and three different data splits.
Pascal VOC 2012 (Classic)
Method 92 183 366 732
Sup. 51.57±3.5851.57_{{\color[rgb]{0,0,1}\pm 3.58}} 54.69±2.4454.69_{{\color[rgb]{0,0,1}\pm 2.44}} 64.86±1.0464.86_{{\color[rgb]{0,0,1}\pm 1.04}} 70.77±0.7670.77_{{\color[rgb]{0,0,1}\pm 0.76}}
MT 58.92±2.9958.92_{{\color[rgb]{0,0,1}\pm 2.99}} 61.63±1.7661.63_{{\color[rgb]{0,0,1}\pm 1.76}} 66.79±0.5366.79_{{\color[rgb]{0,0,1}\pm 0.53}} 71.58±0.5171.58_{{\color[rgb]{0,0,1}\pm 0.51}}
CutMix 65.82±3.6065.82_{{\color[rgb]{0,0,1}\pm 3.60}} 67.91±1.4767.91_{{\color[rgb]{0,0,1}\pm 1.47}} 72.53±0.5072.53_{{\color[rgb]{0,0,1}\pm 0.50}} 74.08±0.4974.08_{{\color[rgb]{0,0,1}\pm 0.49}}
Baseline 66.91±4.2166.91_{{\color[rgb]{0,0,1}\pm 4.21}} 70.32±1.8670.32_{{\color[rgb]{0,0,1}\pm 1.86}} 73.97±1.0473.97_{{\color[rgb]{0,0,1}\pm 1.04}} 76.49±0.5876.49_{{\color[rgb]{0,0,1}\pm 0.58}}
CSS(crs.) 67.03¯±4.58\underline{67.03}_{{\color[rgb]{0,0,1}\pm 4.58}} 71.41¯±1.67\underline{71.41}_{{\color[rgb]{0,0,1}\pm 1.67}} 74.47¯±1.08\underline{74.47}_{{\color[rgb]{0,0,1}\pm 1.08}} 77.08¯±0.37\underline{77.08}_{{\color[rgb]{0,0,1}\pm 0.37}}
CSS(mix) 68.09±4.89\bm{68.09}_{{\color[rgb]{0,0,1}\pm 4.89}} 71.93±1.88\bm{71.93}_{{\color[rgb]{0,0,1}\pm 1.88}} 74.91±1.12\bm{74.91}_{{\color[rgb]{0,0,1}\pm 1.12}} 77.57±0.73\bm{77.57}_{{\color[rgb]{0,0,1}\pm 0.73}}
Table 2: Results on blender VOC train set. All the results from the recent papers [44, 50, 11, 27, 31, 49]. Labeled data is from the augmented VOC train set and the data splits are from [44, 31].
Pascal VOC 2012 (Blender)
Method 662 1323 2646 5291
CCT [36] 71.8671.86 73.6873.68 76.5176.51 77.4077.40
CPS [6] 74.4874.48 76.4476.44 77.6877.68 78.6478.64
U2PL [44] 77.2177.21 79.0179.01 79.3079.30 80.5080.50
ST++ [50] 74.7074.70 77.9077.90 77.9077.90 -
PRCL [46] 76.9676.96 78.1678.16 79.0279.02 79.5979.59
PCR [49] 78.6078.60 80.71\bm{80.71} 80.7880.78 80.9180.91
PSMT [31] 75.5075.50 78.2078.20 78.7278.72 79.7679.76
CSS (mix) 78.73\bm{78.73} 79.54¯\underline{79.54} 80.82\bm{80.82} 81.06\bm{81.06}
Table 3: Results on Cityscapes. The model is trained on the Cityscapes train set, which consists of 2,975 samples in total, and tested on Cityscapes val set. And all the results from the recent papers [44, 49, 31].
Cityscapes
Method 186 372 744 1488
CCT [36] 69.3269.32 74.1274.12 75.9975.99 78.1078.10
CPS [6] 69.7869.78 74.3174.31 74.5874.58 76.8276.82
U2PL [44] 70.3070.30 74.3774.37 76.4776.47 79.0579.05
PCR [49] 73.4173.41 76.3176.31 78.40\bm{78.40} 79.1179.11
PSMT [31] - 76.8976.89 77.6077.60 79.0979.09
CSS (mix) 74.02\bm{74.02} 76.93\bm{76.93} 77.94¯\underline{77.94} 79.62\bm{79.62}

5 Ablative Study

The main contribution of our work lies in 1) collaborative pseudo-labeling strategies and 2) a new indicator for representation learning. To further prove the effectiveness of our proposed method, we conduct ablative studies on these two points. We choose Deeplabv3+ with ResNet-101 pre-trained on ImageNet as our backbone and leverage 92 labeled images and 183 labeled images in PASCAL VOC 2012. The other settings are the same as those in Sec. 4.

Table 4: The quality of pseudo-labels from different pseudo-labeling strategies. The pseudo-labels are sampled by sampling strategies.
source back aero. bicy. bird boat bott. bus car cat chair cow
lgt. 96.7896.78 95.1495.14 77.4777.47 93.3893.38 81.3781.37 87.5487.54 96.7696.76 95.4895.48 94.4794.47 4.094.09 92.1592.15
rep. 90.7190.71 96.5096.50 61.4261.42 75.7575.75 53.3053.30 54.6554.65 84.4684.46 80.5580.55 91.1191.11 28.0328.03 88.1688.16
mix 96.660.1296.66_{{\color[rgb]{1,0,0}\downarrow 0.12}} 98.122.9898.12_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.98}} 82.765.2982.76_{{\color[rgb]{0.2,0.76,0.2}\uparrow 5.29}} 94.471.0994.47_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.09}} 85.123.7585.12_{{\color[rgb]{0.2,0.76,0.2}\uparrow 3.75}} 89.942.4089.94_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.40}} 97.030.2797.03_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.27}} 95.980.5095.98_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.50}} 94.650.1894.65_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.18}} 19.0114.9219.01_{{\color[rgb]{0.2,0.76,0.2}\uparrow 14.92}} 94.812.6694.81_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.66}}
source tabel dog horse motor pers. plant sheep sofa train tv mIoU
lgt. 58.3458.34 94.1294.12 93.0193.01 91.3791.37 93.6893.68 50.3350.33 91.1091.10 18.1618.16 86.6586.65 64.8964.89 78.8778.87
rep. 52.2352.23 86.9786.97 73.2273.22 88.7088.70 91.7591.75 56.1356.13 66.7866.78 35.2535.25 88.5188.51 62.6162.61 71.5771.57
mix 64.356.0164.35_{{\color[rgb]{0.2,0.76,0.2}\uparrow 6.01}} 94.330.2194.33_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.21}} 93.650.6493.65_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.64}} 91.500.1391.50_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.13}} 93.010.6793.01_{{\color[rgb]{1,0,0}\downarrow 0.67}} 55.645.3155.64_{{\color[rgb]{0.2,0.76,0.2}\uparrow 5.31}} 91.560.4691.56_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.46}} 29.7411.5829.74_{{\color[rgb]{0.2,0.76,0.2}\uparrow 11.58}} 93.546.8993.54_{{\color[rgb]{0.2,0.76,0.2}\uparrow 6.89}} 69.484.5969.48_{{\color[rgb]{0.2,0.76,0.2}\uparrow 4.59}} 82.173.3382.17_{{\color[rgb]{0.2,0.76,0.2}\uparrow 3.33}}

5.1 Effectiveness of Collaborative pseudo-labeling

Quality of pseudo-labels. To illustrate the superiority of using pseudo-labels from two spaces in a collaborative way as supervision, we conduct experiments to show the quality of pseudo-labels obtained 1) from logit space (lgt.), 2) from representation space (rep.), and 3) from the mix pseudo-labeling strategy (mix). The pseudo-labels are sampled with corresponding sampling strategies in Sec. 3.4. Tab. 4 illustrates the IoU of pseudo-labels for each class on PASCAL VOC 2012 with 92 labeled images. The results clearly indicate that employing pseudo-labels from the representation space, as opposed to relying solely on those from the logit space, enhances the accuracy of the final pseudo-labels in most classes. This improvement is particularly evident in classes that are originally under-performing, such as the IoU improvement of 11.42%\% for the chair class and 11.58%\% for the sofa class.

Meanwhile, we also visualize the pseudo-labels obtained from logit space (lgt.) and representation space (rep.) in Fig. 3. Fig. 3 (c) and (e) show the masks for pseudo-labels produced by the sampling strategies. In particular, the white color represents the valid pixels used during unlabeled learning, while the black color indicates the discarded pixels. Fig. 3 (d) and (f) are the pseudo-labels we obtained from two spaces. The figure clearly illustrates the differences between pseudo-labels produced by different spaces. For example, the parts of the instance edge in pseudo-labels are usually discarded since they are challenging for learning in logit space. However, pseudo-labels from representation space will easily tackle this problem (shown in the first row). The pseudo-labels from representation space are more inaccurate in some complex scenes, which be resolved by combining pseudo-labels from logit space (shown in the second row).

We mainly attribute the differences in pseudo-label to the differing concentrations of learning in the two spaces. Specifically, learning in the logit space primarily emphasizes the most discriminative part of features, whereas that in the representation space treats each part of features equally. As a result, learning in the logit space may overlook minor feature differences, leading to sub-optimal performance in predicting instance edges and distinguishing between similar classes (e.g., chair and sofa). Conversely, learning in the representation space produces balanced performance across all image parts and classes. However, this can lead to erroneous predictions for classes with high intra-class variance (e.g., background). By leveraging pseudo-labels from the two spaces, we capitalize on the strengths of the learning in each space and enhance the knowledge exchange between the two spaces.

Refer to caption
Figure 3: Differences between pseudo-labels in different spaces.

Results of different strategies To investigate the involvement of different pseudo-labeling strategies, we conduct the experiments as follows: 1) Using pseudo-labels from logit space. 2) Using pseudo-labels from representation space. 3) Using mix pseudo-labeling strategy. 4) Using cross pseudo-labeling strategy. Tab. 5 shows the effectiveness of our proposed strategy in two different label rates. The results show that the performances of experiments with collaborative pseudo-labeling strategies are better than the ones whose pseudo-labels come from a single space with two different label rates, which proves the effectiveness of our proposed collaboration between the two spaces. It is worth noting that even though the quality of pseudo-labels from representation space is lower than that from logit space, the performance of the model is also boosted by using the cross pseudo-labeling strategy to maintain consistency between the predictions in two spaces. In addition, our method with the mix pseudo-labeling strategy outperforms that with the cross pseudo-labeling strategy.

Table 5: Results on pseudo-labels from different sources on two different label rates.
source 92 labels 183 labels
logit space 67.1167.11 70.3270.32
representation space 64.2064.20 67.5267.52
mix pseudo-labeling 68.411.3068.41_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.30}} 72.742.4272.74_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.42}}
cross pseudo-labeling 67.850.7467.85_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.74}} 71.981.6671.98_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.66}}
Refer to caption
Figure 4: Relations between similarity and confidence.

5.2 Effectiveness of the Indicator

Limitation of merely using confidence. To explain the limitation of merely using confidence for learning in the logit and representation spaces, we conduct experiments to show the relations between confidence and similarity. The similarity is the cosine similarity between representations and prototypes, which directly shows the confusion level between the representation and the prototype of its class. Fig. 4 (a) shows the comparison between the confidence of each prediction and the corresponding similarity. We use the class person for demonstrating in our experiments. It shows clearly that even though fixing the confidence into a small range (from 0.8 to 0.81 in our settings), the similarity varies. Meanwhile, in Fig. 4 (b), different color bars stand for different intervals of confidence, and the lines denote the mean similarity between the prototype and each representation whose corresponding confidence is in the current interval. Fig. 4 (b) illustrates that the mean similarity of the class fluctuates when its interval of confidence rises. Both two figures imply that confidence is not able to represent the confusion level between representations and prototypes since there are no direct and close relations between confidence and similarity.

Fig. 5 visualizes the similarity and confidence of an image in both logit (lgt.) and representation (rep.) spaces, indicating the varying levels of confusion in the same region when learning in different spaces.

We also attribute it to the different concentrations of learning two spaces, i.e., the confusing region in one space can be more readily addressed in the other space.

Thus, it is inappropriate to choose confidence as the indicator to involve representation learning, e.g., sampling more reliable or hard samples with threshold and indicator for better learning. In contrast, our indicator directly employs similarity between the representation and prototype of its class, which directly reflects the confusion level in representation learning. It is more accurate to use similarity as the indicator to sample hard and critical samples in representation learning.

Refer to caption
Figure 5: Visualization of the confusing part in different spaces.

Results of different indicators. Tab. 6 shows the impact of using different indicators. We conduct experiments on two different label rates (92 and 183) and three different indicators for pseudo-labels: only confidence (conf.), only similarity (smlr.), and confidence for learning logit space while similarity for learning representation space (mix). We use two different approaches to obtain pseudo-labels: from logit space only (seg label) and from mix pseudo-labeling strategy (mix label). It is clear that using both confidence and similarity to involve the learning in their own spaces obtains the best performance, which proves the effectiveness of using different indicators.

Table 6: Results on indicators from different spaces on two different label rates.
92 labels 183 labels
source seg label mix label seg label mix label
conf. 67.1167.11 68.3368.33 70.3270.32 71.9271.92
smlr. 66.1666.16 66.8966.89 68.9068.90 69.2569.25
mix 67.800.6967.80_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.69}} 68.411.3068.41_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.30}} 71.501.1871.50_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.18}} 72.742.4272.74_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.42}}

5.3 Ablation study of Components

In this section, we conduct experiments to introduce our components in CSS step by step, with results shown in Tab. 7. Our baseline is the conventional contrastive-based S4, achieving mIoU of 67.11%67.11\% on 92 labels and 70.32%70.32\% on 183 labels. Mix and cross means the pseudo-labels are from the mix pseudo-labeling strategy and cross pseudo-labeling strategy while the indicator is still the confidence in two spaces. Ind means we use the different indicators in different spaces while the pseudo-labels are from logit space. The last two rows represent our proposed two pseudo-labeling strategies with indicators from two spaces. As the result, the mix pseudo-labeling strategy with different indicators boosts model performance by 1.30%1.30\% and 2.42%2.42\% while the cross pseudo-labeling strategy with different indicators increases the performance by 0.74%0.74\% and 1.66%1.66\%.

Table 7: Ablation study on different components of our CSS.
      component       92 labels       183 labels
      baseline       67.1167.11       70.3270.32
      \hdashlinemix       68.331.2268.33_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.22}}       71.921.6071.92_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.60}}
      cross       67.210.1067.21_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.10}}       70.870.5570.87_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.55}}
      ind       67.800.6967.80_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.69}}       71.501.1871.50_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.18}}
      \hdashlinemix + ind       68.411.3068.41_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.30}}       72.742.4272.74_{{\color[rgb]{0.2,0.76,0.2}\uparrow 2.42}}
      cross + ind       67.850.7467.85_{{\color[rgb]{0.2,0.76,0.2}\uparrow 0.74}}       71.981.6671.98_{{\color[rgb]{0.2,0.76,0.2}\uparrow 1.66}}

5.4 Qualitative Results

Fig. 6 shows the qualitative results of different methods on PASCAL VOC 2012 with 92 labeled images. Baseline means the conventional contrastive-based method. Compared with the original self-training methods (CutMix), thanks to introducing pixel-wise contrastive learning, the baseline, and our method perform better in some ambiguous regions. Furthermore, benefiting from the supervision of two spaces and different indicators in different spaces, our method outperforms the baseline.

Refer to caption
Figure 6: Visualization on PASCAL VOC 2012 with 92 labeled images. Yellow boxes highlight the main differences.

6 Conclusion

In this paper, we propose two collaborative pseudo-labeling strategies to take full use of the semantic information in the representation space and enhance the knowledge exchange between the logit and representation spaces. Moreover, we employ a new indicator for the learning process in the representation space. Extensive experiments demonstrate that our pseudo-labeling strategies obtain more reliable supervision during unlabeled training and our indicator helps the model to concentrate on more critical parts during representation learning.

Future work: In this paper, we employ pseudo-labeling strategies to utilize the semantic information in both logit and representation spaces. In the future, we will investigate more powerful strategies to enhance the knowledge exchange between two spaces.

References

  • [1] Iñigo Alonso, Alberto Sabater, David Ferstl, Luis Montesano, and Ana C. Murillo. Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8219–8228, October 2021.
  • [2] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems, 33:9912–9924, 2020.
  • [3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
  • [4] Mu Chen, Zhedong Zheng, Yi Yang, and Tat-Seng Chua. Pipa: Pixel-and patch-wise self-supervised learning for domain adaptative semantic segmentation. arXiv preprint arXiv:2211.07609, 2022.
  • [5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  • [6] Xiaokang Chen, Yuhui Yuan, Gang Zeng, and Jingdong Wang. Semi-supervised semantic segmentation with cross pseudo supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2613–2622, 2021.
  • [7] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [9] Nanqing Dong and Eric P Xing. Few-shot semantic segmentation with prototype learning. In BMVC, volume 3, 2018.
  • [10] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [11] Jiashuo Fan, Bin Gao, Huan Jin, and Lihui Jiang. Ucc: Uncertainty guided cross-head co-training for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9947–9956, June 2022.
  • [12] Zhengyang Feng, Qianyu Zhou, Qiqi Gu, Xin Tan, Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang Ma. Dmt: Dynamic mutual training for semi-supervised learning. Pattern Recognition, page 108777, 2022.
  • [13] Geoff French, Timo Aila, Samuli Laine, Michal Mackiewicz, and Graham Finlayson. Semi-supervised semantic segmentation needs strong, high-dimensional perturbations. In In 31th British Machine Vision Conference, 2020.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • [15] Dayan Guan, Jiaxing Huang, Aoran Xiao, and Shijian Lu. Unbiased subclass regularization for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9968–9978, 2022.
  • [16] Bharath Hariharan, Pablo Arbeláez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Semantic contours from inverse detectors. In 2011 international conference on computer vision, pages 991–998. IEEE, 2011.
  • [17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [19] Ruifei He, Jihan Yang, and Xiaojuan Qi. Re-distributing biased pseudo labels for semi-supervised semantic segmentation: A baseline investigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6930–6940, October 2021.
  • [20] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [21] Hanzhe Hu, Jinshi Cui, and Liwei Wang. Region-aware contrastive learning for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16291–16301, 2021.
  • [22] Hanzhe Hu, Fangyun Wei, Han Hu, Qiwei Ye, Jinshi Cui, and Liwei Wang. Semi-supervised semantic segmentation via adaptive equalization learning. Advances in Neural Information Processing Systems, 34:22106–22118, 2021.
  • [23] Wei-Chih Hung, Yi-Hsuan Tsai, Yan-Ting Liou, Yen-Yu Lin, and Ming-Hsuan Yang. Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934, 2018.
  • [24] Zhengkai Jiang, Yuxi Li, Ceyuan Yang, Peng Gao, Yabiao Wang, Ying Tai, and Chengjie Wang. Prototypical contrast adaptation for domain adaptive semantic segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, pages 36–54. Springer, 2022.
  • [25] Tarun Kalluri, Girish Varma, Manmohan Chandraker, and CV Jawahar. Universal semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5259–5270, 2019.
  • [26] Rihuan Ke, Angelica I Aviles-Rivero, Saurabh Pandey, Saikumar Reddy, and Carola-Bibiane Schönlieb. A three-stage self-training framework for semi-supervised semantic segmentation. IEEE Transactions on Image Processing, 31:1805–1815, 2022.
  • [27] Donghyeon Kwon and Suha Kwak. Semi-supervised semantic segmentation with error localization network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9957–9967, 2022.
  • [28] Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, and Sanja Fidler. Semantic segmentation with generative models: Semi-supervised learning and strong out-of-domain generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8300–8311, 2021.
  • [29] Jinlu Liu, Liang Song, and Yongqiang Qin. Prototype rectification for few-shot learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 741–756. Springer, 2020.
  • [30] Shikun Liu, Shuaifeng Zhi, Edward Johns, and Andrew Davison. Bootstrapping semantic segmentation with regional contrast. In International Conference on Learning Representations, 2022.
  • [31] Yuyuan Liu, Yu Tian, Yuanhong Chen, Fengbei Liu, Vasileios Belagiannis, and Gustavo Carneiro. Perturbed and strict mean teachers for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4258–4267, 2022.
  • [32] Binjie Mao, Xinbang Zhang, Lingfeng Wang, Qian Zhang, Shiming Xiang, and Chunhong Pan. Learning from the target: Dual prototype network for few shot semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1953–1961, 2022.
  • [33] Sudhanshu Mittal, Maxim Tatarchenko, and Thomas Brox. Semi-supervised semantic segmentation with high-and low-level consistency. IEEE transactions on pattern analysis and machine intelligence, 43(4):1369–1379, 2019.
  • [34] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
  • [35] Viktor Olsson, Wilhelm Tranheden, Juliano Pinto, and Lennart Svensson. Classmix: Segmentation-based data augmentation for semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1369–1378, 2021.
  • [36] Yassine Ouali, Celine Hudelot, and Myriam Tami. Semi-supervised semantic segmentation with cross-consistency training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [37] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, and Tao Mei. Transferrable prototypical networks for unsupervised domain adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2239–2247, 2019.
  • [38] Jizong Peng, Guillermo Estrada, Marco Pedersoli, and Christian Desrosiers. Deep co-training for semi-supervised image segmentation. Pattern Recognition, 107:107269, 2020.
  • [39] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  • [40] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
  • [41] Korawat Tanwisuth, Xinjie Fan, Huangjie Zheng, Shujian Zhang, Hao Zhang, Bo Chen, and Mingyuan Zhou. A prototype-oriented framework for unsupervised domain adaptation. Advances in Neural Information Processing Systems, 34:17194–17208, 2021.
  • [42] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
  • [43] Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, and Luc Van Gool. Exploring cross-image pixel contrast for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7303–7313, 2021.
  • [44] Yuchao Wang, Haochen Wang, Yujun Shen, Jingjing Fei, Wei Li, Guoqiang Jin, Liwei Wu, Rui Zhao, and Xinyi Le. Semi-supervised semantic segmentation using unreliable pseudo labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [45] Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, and Fan Yang. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10857–10866, June 2021.
  • [46] Haoyu Xie, Changqi Wang, Mingkai Zheng, Minjing Dong, Shan You, and Chang Xu. Boosting semi-supervised semantic segmentation with probabilistic representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2938–2946, 2023.
  • [47] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698, 2020.
  • [48] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16684–16693, 2021.
  • [49] Hai-Ming Xu, Lingqiao Liu, Qiuchen Bian, and Zhen Yang. Semi-supervised semantic segmentation with prototype-based consistency regularization. Advances in Neural Information Processing Systems, 2022.
  • [50] Lihe Yang, Wei Zhuo, Lei Qi, Yinghuan Shi, and Yang Gao. St++: Make self-training work better for semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4268–4277, 2022.
  • [51] Jianlong Yuan, Yifan Liu, Chunhua Shen, Zhibin Wang, and Hao Li. A simple baseline for semi-supervised semantic segmentation with strong data augmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8229–8238, 2021.
  • [52] Xiangyu Yue, Zangwei Zheng, Shanghang Zhang, Yang Gao, Trevor Darrell, Kurt Keutzer, and Alberto Sangiovanni Vincentelli. Prototypical cross-domain self-supervised learning for few-shot unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13834–13844, 2021.
  • [53] Xiangyun Zhao, Raviteja Vemulapalli, Philip Andrew Mansfield, Boqing Gong, Bradley Green, Lior Shapira, and Ying Wu. Contrastive learning for label efficient semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10623–10633, October 2021.
  • [54] Xu Zheng, Yunhao Luo, Hao Wang, Chong Fu, and Lin Wang. Transformer-cnn cohort: Semi-supervised semantic segmentation by the best of both students, 2022.
  • [55] Xu Zheng, Jinjing Zhu, Yexin Liu, Zidong Cao, Chong Fu, and Lin Wang. Both style and distortion matter: Dual-path unsupervised domain adaptation for panoramic semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1285–1295, June 2023.
  • [56] Yuanyi Zhong, Bodi Yuan, Hong Wu, Zhiqiang Yuan, Jian Peng, and Yu-Xiong Wang. Pixel contrastive-consistent semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7273–7282, 2021.
  • [57] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation: A prototype view. In CVPR, 2022.
  • [58] Yanning Zhou, Hang Xu, Wei Zhang, Bin Gao, and Pheng-Ann Heng. C3-semiseg: Contrastive semi-supervised segmentation via cross-set learning and dynamic class-balancing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7036–7045, 2021.