Pseudo-label Alignment for Semi-supervised Instance Segmentation
Abstract
Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, i.e., NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: https://github.com/hujiecpp/PAIS.
1 Introduction
Semi-supervised instance segmentation aims to alleviate the significant burden of human labeling by utilizing a small amount of labeled data in conjunction with abundant unlabeled data [35, 47, 50]. Existing semi-supervised instance segmentation pipelines typically generate pseudo-labels from unlabeled images, which are then used to train the models together with labeled images. Therefore, pseudo-labels play a crucial role in semi-supervised instance segmentation. The generation of pseudo-masks, pseudo-classes, and pseudo-boxes from unlabeled images improves the model training. However, current semi-supervised instance segmentation frameworks do not fully leverage the potential of such pseudo-labels. Specifically, pseudo-labels with mismatched class and mask scores are often filtered out by fixed thresholds, leading to the exclusion of valuable information that could aid the model training. For instance, pseudo-labels with high-quality masks but low class scores would be filtered out by a class threshold, resulting in the loss of the pixel-level information.

In this paper, we present a new semi-supervised framework, termed pseudo-label aligning for instance segmentation (PAIS), aiming to improve the utilization of filtered pseudo-labels. As illustrated in Fig. 1, the main challenge of PAIS lies in the mismatched scores between pseudo-classes and pseudo-masks. The classification score and mask intersection over union (IoU) are misaligned in assessing the quality of pseudo-labels. As a result, masks with high IoUs would be filtered out due to low classification scores, and vice versa. In the mean time, lowering the threshold of both scores introduces incorrect classes or low-quality masks into the semi-supervised training. To overcome this dilemma, we propose a dynamic aligning loss (DALoss) that softly re-weights the classification and the segmentation losses based on the quality of different pseudo-labels. Specifically, DALoss penalizes low-score pseudo-labels rather than filtering them and promotes high-score ones, to adjust their contribution to the final loss function. Our experiments on the COCO and Cityscape datasets demonstrate the effectiveness of the proposed PAIS framework. Specifically, with only 1% labeled data, PAIS achieves 21.2 mAP and 19.9 mAP on the COCO dataset using Mask-RCNN [13] and K-Net [53], respectively. This outperforms the current state-of-the-art model, NoisyBoundary [47], by more than 12 points.
Our contributions can be summarized as follows:
-
•
We propose a novel pseudo-label aligning framework for semi-supervised instance segmentation, called PAIS, which unleashes the potential of utilizing pixel-level pseudo-labels in semi-supervised instance segmentation. Furthermore, to the best of our knowledge, PAIS is the first framework that can be adapt to box-free instance segmentation models.
-
•
We introduce a new loss function, named dynamic aligning loss (DALoss), which incorporates pseudo-labels with diverse class and mask qualities into the semi-supervised training process. DALoss consistently enhances the performance of box-free and box-dependent instance segmentation frameworks.
-
•
We conduct comprehensive experiments on the COCO and Cityscapes datasets to evaluate PAIS. In particular, PAIS achieves state-of-the-art results on the COCO dataset, i.e., 19.9, 27.6, and 31.1 mAP for the box-free pipeline K-Net [53], and 21.2, 29.3, and 31.1 mAP for the box-dependent pipeline Mask-RCNN [13], with 1%, 5%, and 10% labeled data, respectively.

2 Related Work
Semi-supervised Image Classification. In image classification, semi-supervised learning has been extensively explored, and the methods can be classified into two categories: pseudo-label-based and consistency-regularization-based methods. Specifically, pseudo-label-based methods [40, 21] leverage pre-trained models to generate annotations for the unlabeled images to train the model. In contrast, consistency-regularization-based methods [1, 36, 20, 39, 48, 2, 7, 32, 31, 23] incorporate various data augmentation techniques such as random regularization [10] and adversarial perturbation [33] to generate different inputs for one image and enforce consistency between these inputs during training. FixMatch [41] combines the consistency-regularization-based techniques with a pseudo-label-based framework by applying a strong-weak data augmentation pipeline to input images and enforcing consistency between the augmented images. In this work, we follow the pseudo-label-based methods and also use strong-weak data augmentation during training in PAIS.
Semi-supervised Object Detection. From the very beginning, STAC [42] proposed the use of pseudo-labels and consistency training for semi-supervised object detection. However, the effectiveness of the method was limited by the two-stage training pipeline similar to that of Noisy Student [49], where the pseudo-labels were generated from pre-trained models and were not updated along with the model training. After STAC, several studies [50, 54, 43, 51, 28] incorporated the idea of exponential moving average (EMA) from MeanTeacher [44]. The teacher model and pseudo-labels are updated after each training iteration to generate instant pseudo-labels, making the entire pipeline end-to-end trainable. Additionally, Unbiased Teacher [28] utilized Focal loss [24] instead of traditional cross-entropy loss to alleviate the problem of unbalanced pseudo-labels. In this paper, we also follow the idea of incorporating EMA into the proposed PAIS framework, with a focus on integrating pixel-level annotations into the training process.
Fully-supervised Instance Segmentation. Instance segmentation aims to provide pixel-level predictions for each object instance in an image. Existing methods can be classified into three categories: top-down (or box-dependent) methods, bottom-up methods, and direct segmentation (or box-free) methods. Top-down methods [15, 27, 38, 37, 29] such as Mask R-CNN [13], YOLACT [3], and CenterMask [22] generate bounding boxes first and then segment the objects within the boxes. Bottom-up methods [52, 8, 26, 11, 16], regard instance segmentation as a label-then-cluster problem, which classify each pixel first and then group the pixels into an arbitrary number of object instances. Direct segmentation or box-free methods such as SOLO [45, 46], K-Net [53], MaskFormer [5, 4] and SOTR [12] deal with instance segmentation without bounding box detection. Any of the aforementioned instance segmentation methods can be implemented into PAIS, and in this work, we present two examples, one from the box-dependent category, Mask R-CNN, and another from the box-free category, K-Net.
Semi-supervised Instance Segmentation. Semi-supervised instance segmentation is commonly considered to be a sub-task of semi-supervised object detection [50, 28]. Consequently, existing frameworks rely heavily on bounding boxes, in which segmentation performance is strongly dependent on detection performance. Among them, NoisyBoundary [47] was the first to formally propose the semi-supervised instance segmentation task. Recent efforts have been made to construct box-free pipelines for fully-supervised instance segmentation [45, 46, 53, 5]. In this paper, we investigate PAIS in both box-free and box-dependent instance segmentation, revealing the potential of fully utilizing pixel-level annotations. In contrast to the recently-proposed PoliteTeacher [9], which filters out pseudo-labels with low confidence, the proposed PAIS leverages them. It is a novel and effective way of utilizing noisy pseudo-labels for semi-supervised learning.
3 Method
3.1 Task Formulation
The goal of PAIS is to better leverage unlabeled images with a limited number of pixel-labeled images to boost the performance of semi-supervised instance segmentation. The PAIS framework consists of three key steps: (1) pseudo-label generation, (2) dynamic pseudo-label alignment, and (3) end-to-end model training. In the pseudo-label generation step, we introduce a mask scoring branch [17] that predicts mask IoUs as an additional metric along with classification scores to assess the quality of pseudo-labels. In the dynamic aligning step, we re-weight the loss terms based on the quality of different pseudo-labels. Finally, the teacher and student models are trained using exponential moving average (EMA) [44]. The overall framework of PAIS is illustrated in Fig. 2, and we introduce the key steps in detail as follows. It is important to note that the PAIS framework can be applied to any box-dependent or box-free framework for enhanced exploitation of pseudo-labels in semi-supervised instance segmentation. We instantiate two examples by Mask-RCNN [13] and K-Net [53] in this paper, but do not restrict to them.
3.2 Pseudo Label Generation
In the pseudo-label generation, we apply weak data augmentation to the unlabeled images and input them into a teacher model. The weak data augmentation includes scaling, horizontal flipping, and other augmentation operations that do not alter the image’s content. The teacher model produces a set of pseudo-labels, including masks, boxes, and classification scores for each input image. In box-free pipelines, the box predictions are optional.
As depicted in Fig. 1(a)(c), using the classification score alone is inadequate to measure the quality of predicted masks. To address this, we incorporate a mask scoring branch into the pipeline to predict mask Intersection over Union (IoU) for evaluating mask quality. As shown in Fig. 1(b), the predicted mask IoUs can be used to measure mask quality effectively, which has also been verified in MS-RCNN [17]. After generating pseudo-labels in the previous step, the predictions are filtered using two thresholds: a classification threshold and a mask IoU threshold . The resulting set of pseudo-labels includes elements, each containing a mask , a bounding box , a classification score (with an additional background category), and a mask IoU score . Note that denote the mask resolution, and is the number of classes.
3.3 Dynamic Aligning Loss
Although good pseudo-labels can be obtained by setting high thresholds for classification scores and mask IoUs, a large number of predictions with misaligned mask and classification qualities are discarded (as illustrated in Fig.1(c)(d)). These misaligned predictions can be useful since the amount of labeled data is limited in the semi-supervised setting. Fig.2 shows examples of pseudo-labels that can be included in the training process. If the misaligned pseudo-labels are directly used in semi-supervised learning, the incorrect predictions on classes or masks can introduce significant noise. To reduce the noise, we propose a dynamic aligning loss (DALoss).
Vanilla Loss for PAIS. In semi-supervised instance segmentation, the loss function can be decomposed into two terms for labeled and unlabeled images, as:
(1) |
where and are the hyper-parameters for balancing the loss terms.
The loss for labeled images can be defined using the functions commonly employed in instance segmentation, augmented with an additional binary cross-entropy term for regressing mask IoUs. Given the pseudo-labels from the teacher model and the predictions from the student model, we formulate the loss for unlabeled images as:
(2) |
where includes the box IoU loss and the L1 loss, denotes the cross-entropy loss, is the dice loss. In Eq. 2, represent predictions from the student model. The pseudo-scores for classification are converted to one-hot vectors for the category with the highest score. The pseudo-masks are activated by sigmoid function and discretized into binary value as . The one-hot vector for the background class is denoted by . The label indexing functions, and , match predictions with pseudo-labels for training, which are defined in terms of different pseudo-label assigning strategies. In our implementations, the one-to-many assignment defines the label indexing functions and as:
(3) |
Instead, the one-to-one assignment defines the label indicting functions , and finds the optimal via:
(4) |
Note that the loss terms for regressing boxes are optional in box-free instance segmentation frameworks.
Dynamic Aligning Loss for PAIS. To optimize the model using pixel-level pseudo-labels, we propose to replace Eq. 2 with DALoss. Since the teacher model provides ideal metrics for measuring the quality of classification scores and mask IoUs, we propose to adjust the weight of loss terms based on the quality of the pseudo-labels. This is achieved by using the following equation:
(5) |
where denotes the highest classification score, and denotes the mask IoU from the -th pseudo-label, are the hyper-parameters. Specifically, Eq. 5 adjusts the weights for the pseudo-labels conditioned on their qualities, i.e., dynamically dependent on the input images. For instance, for a pseudo-label with a low classification score and high mask IoU, DALoss encourages the segmentation loss while constraining the classification loss.
3.4 End-to-End Model Training
Inspired by [44], we employ EMA with the strong-weak data augmentation for PAIS. Specifically, unlabeled images undergo both strong and weak data augmentations and are then fed into the student and teacher models, respectively. The student model is trained to produce consistent results with the pseudo-labels, and the teacher model is updated by EMA. The training pipeline for PAIS is presented in Alg. 1.
Method | 1% | 5% | 10% | 100% |
---|---|---|---|---|
Mask-RCNN [13], supervised∗ | 3.5 | 17.3 | 22.0 | 34.5 |
Mask-RCNN† [13], supervised∗ | 3.5 | 17.4 | 21.9 | 37.1 |
DD [35] | 3.8 | 20.4 | 24.2 | 35.7 |
Noisy Boundaries [47] | 7.7 | 24.9 | 29.2 | 38.6 |
PAIS, on Mask-RCNN, ours | 21.2 | 29.3 | 31.1 | 39.5 |
Method | 5% | 10% | 20% | 30% |
---|---|---|---|---|
Mask-RCNN [13], supervised∗ | 11.8 | 16.8 | 22.3 | 26.3 |
Mask-RCNN† [13], supervised∗ | 11.3 | 16.4 | 22.6 | 26.6 |
DD [35] | 13.7 | 19.2 | 24.6 | 27.4 |
STAC [42] | 11.9 | 18.2 | 22.9 | 29.0 |
CSD [18] | 14.1 | 17.9 | 24.6 | 27.5 |
CCT [34] | 15.2 | 18.6 | 24.7 | 26.5 |
Dual-branch [30] | 13.9 | 18.9 | 24.0 | 28.9 |
Ubteacher [28] | 16.0 | 20.0 | 27.1 | 28.0 |
Noisy Boundaries [47] | 17.1 | 22.1 | 29.0 | 32.4 |
PAIS, on Mask-RCNN, ours | 18.0 | 22.9 | 29.2 | 32.8 |
4 Experiments
4.1 Datasets and Evaluation Metrics
We conducted extensive experiments on the COCO [25] and Cityscapes [6] datasets to study the proposed PAIS. The COCO dataset consists of 118k images with 80-class instance labels, as well as 123k unlabeled images. The Cityscapes dataset contains urban street-view scenes and has 8 instance categories in 2.9k training images and 0.5k validation images. For the COCO dataset, we randomly sampled 1%, 5%, and 10% of the images from the train2017 split as labeled data and treated the rest as unlabeled data following common settings. In addition, we also used the full COCO train2017 as labeled data and incorporated the 123k unlabeled data from COCO unlabel2017 to train the PAIS models. For the Cityscapes dataset, we randomly sampled 5%, 10%, 20%, and 30% of the images from the training set as labeled data and treated the remaining as unlabeled data following the common settings. We evaluated the PAIS models on the validation sets of the COCO and Cityscapes datasets, and reported standard COCO metrics including AP, AP, AP (averaged over IoU thresholds), and APS, APM, APL (AP for instances of different scales).
Method | 1% COCO | 5% COCO | 10% COCO | ||||||
---|---|---|---|---|---|---|---|---|---|
AP | AP0.5 | AP0.75 | AP | AP0.5 | AP0.75 | AP | AP0.5 | AP0.75 | |
Box-free Instance Segmentation | |||||||||
K-Net [53], supervised | 8.030.25 | 16.330.31 | 7.000.25 | 17.40.22 | 30.080.28 | 16.830.21 | 21.630.21 | 37.430.38 | 21.80.20 |
K-Net† [53], supervised | 11.630.05 | 22.300.08 | 10.950.10 | 22.280.05 | 38.540.09 | 22.70.07 | 26.530.12 | 44.870.15 | 27.200.17 |
K-Net, PAIS w/o DALoss | 17.770.06 | 32.170.15 | 17.530.12 | 25.400.08 | 43.120.05 | 26.050.13 | 29.300.07 | 48.640.05 | 30.530.13 |
K-Net, PAIS | 19.780.10 | 35.480.15 | 19.650.06 | 27.530.06 | 45.630.06 | 28.700.10 | 31.040.06 | 50.500.07 | 32.340.05 |
Box-dependent Instance Segmentation | |||||||||
Mask-RCNN [13], supervised∗ | 3.5 | - | - | 17.3 | - | - | 22.0 | - | - |
Mask-RCNN† [13], supervised | 11.540.09 | 19.860.11 | 11.640.09 | 22.350.06 | 37.980.10 | 23.140.11 | 27.070.06 | 45.100.10 | 28.670.06 |
Mask-RCNN, PAIS w/o DALoss | 20.130.06 | 33.230.15 | 21.270.06 | 27.360.06 | 44.100.10 | 29.270.06 | 29.770.06 | 47.700.10 | 31.970.06 |
Mask-RCNN, PAIS | 21.120.05 | 36.030.05 | 22.750.10 | 29.280.13 | 47.250.13 | 31.200.22 | 31.030.06 | 49.830.12 | 33.230.06 |
Method | AP | AP0.5 | AP0.75 | APS | APM | APL |
---|---|---|---|---|---|---|
Box-free Instance Segmentation | ||||||
K-Net [53], supervised | 37.8 | 60.3 | 39.9 | 16.9 | 41.2 | 57.5 |
K-Net† [53], supervised | 38.4 | 61.4 | 40.3 | 17.6 | 41.8 | 58.0 |
K-Net, PAIS w/o DALoss | 39.4 | 62.2 | 41.6 | 18.5 | 42.8 | 59.2 |
K-Net, PAIS | 40.8 | 63.5 | 43.3 | 19.2 | 44.4 | 61.4 |
Box-dependent Instance Segmentation | ||||||
Mask-RCNN [13], supervised | 37.1 | 58.5 | 39.7 | 18.7 | 39.6 | 53.9 |
Mask-RCNN† [13], supervised | 37.5 | 58.9 | 40.4 | 18.6 | 40.2 | 53.8 |
Mask-RCNN, PAIS w/o DALoss | 38.4 | 59.7 | 41.5 | 19.4 | 41.1 | 55.0 |
Mask-RCNN, PAIS | 39.5 | 60.6 | 43.0 | 19.9 | 42.4 | 56.6 |
4.2 Implementation Details
We provide two examples of implementing PAIS with K-Net [53] and Mask-RCNN [13]. The models are trained using AdamW with a learning rate of 0.0001 for K-Net, and SGD with a learning rate of 0.01 for Mask-RCNN. The hyper-parameters and , which balance the loss terms for labeled and unlabeled images, are set to 1.0 and 0.3 for K-Net, and 1.0 and 1.5 for Mask-RCNN. We set the thresholds and experimentally as 0.35 and 0.30, respectively. For bipartite matching loss, we use the same hyperparameters as in [53]. The loss balancing parameters for box, class, and mask are set as =2.0, =4.0, and =1.0, respectively. We train the models on 4 GPUs with 4 images per GPU (1 labeled and 3 unlabeled images) for 220k iterations, unless otherwise specified. The teacher model is updated via EMA with a momentum of 0.999. We use ResNet50 [14] as the backbone for these models.
4.3 Main Results
Comparison to state-of-the-art semi-supervised instance segmentation frameworks. In Tab. 1, we compare the performance of models trained with PAIS to the state-of-the-art semi-supervised instance segmentation frameworks on the COCO dataset. The results demonstrate that both K-Net and Mask-RCNN trained with PAIS surpass the previous methods DD [35] and Noisy Boundaries [47] by a large margin, especially when the number of labeled data is very limited (with only 1% or 5% labeled images). Specifically, when using 1% labeled COCO images, PAIS with K-Net achieves 19.9 mask mAP, which is 12.2 points higher than Noisy Boundaries. Interestingly, the proposed PAIS brings about better performance for Mask-RCNN when the percentage of labeled data is 1% or 5%, even though it originally has a inferior performance in fully-supervised instance segmentation compared to K-Net. To explain, the bounding boxes may provide better optimization for Mask-RCNN models when the labeled data is limited. Finally, when using 10% and 100% labeled images, PAIS with K-Net outperforms PAIS with Mask-RCNN. In Tab. 2, we compare PAIS with state-of-the-art methods on the Cityscape dataset, in which PAIS also achieves better performance than the predominant models. Specifically, we report NoisyBoundaries [47] w/o FocalLoss in Tab. 2, as we do not apply FocalLoss in PAIS. This ensures a fair and consistent comparison. Furthermore, we also add FocalLoss to PAIS on 10% Cityscapes, which achieves 25.1%, surpassing 23.7% of NoisyBoundaries w/ FocalLoss.
The more significant performance improvement on the COCO dataset validates the effectiveness of our method for solving the noisy pseudo-label problem. The COCO dataset has 80 instance categories, while the Cityspaces dataset only has 8 instance categories. This implies that the COCO dataset can provide more diverse and informative pseudo-labels, which matches our goal to utilize noisy pseudo-labels for semi-supervised learning.
Results with an extremely limited number of labeled images. To demonstrate the effectiveness of PAIS, we conduct experiments with extremely limited numbers of labeled images, as shown in Table 3. Specifically, we compare the performance of various models trained with randomly sampled 1%, 5%, and 10% labeled images. First, we train supervised models, K-Net (supervised) and Mask-RCNN (supervised), with the limited labeled images. Second, we train the same supervised models with the same data augmentation used in the semi-supervised setting, denoted as K-Net† (supervised) and Mask-RCNN† (supervised). Third, we train PAIS models without DALoss, denoted as PAIS w/o DALoss on K-Net and Mask-RCNN. Lastly, we train the PAIS models with DALoss, denoted as PAIS on K-Net and Mask-RCNN. All models are trained three times, and the reported results are averaged.
Based on the results of K-Net (supervised) and Mask-RCNN (supervised), it can be observed that fully-supervised models perform poorly when the number of labeled images is limited. The results of K-Net† (supervised) and Mask-RCNN† (supervised) show slight improvement with weak data augmentation from the semi-supervised setting. Interestingly, while K-Net outperforms Mask-RCNN in the fully-supervised setting, their performance is similar when the number of labeled images is limited, as indicated in the table. Comparing the performance of K-Net (PAIS w/o DALoss) and Mask-RCNN (PAIS w/o DALoss) with that of the supervised setting reveals significant improvement. By introducing the dynamically re-weighting process via DALoss, the performance is further improved. For instance, with only 1% labeled images, the mAP of K-Net and Mask-RCNN improved from 11.63 and 11.54 to 19.78 and 21.12, respectively, resulting in approximately +8.15 and +9.58 points improvement to the performance. Additionally, the improvement over AP0.5 and AP0.75 suggests that DALoss considers masks with moderate quality during training, which further benefits semi-supervised learning.
Results with abundant labeled images. In Tab. 4, we investigate the performance of semi-supervised learning when abundant labeled data is available. Specifically, we use the entire COCO train2017 dataset as labeled data and COCO unlabel2017 dataset as unlabeled data to train the models. The results show that the semi-supervised learning approach also leads to a performance gain. For instance, K-Net (PAIS) achieves a performance gain of approximately 1.0 point on mAP from semi-supervised learning. Additionally, the proposed DALoss consistently improves the performance of both box-dependent and box-free instance segmentation frameworks.
4.4 Ablation Study
In our ablation study, we investigate several aspects that can impact the performance of PAIS, including the utilization of various loss terms, the setting of hyper-parameters, the threshold values, the varying ratios of labeled and unlabeled images, and the convergence times. We perform the ablation studies on the COCO dataset using K-Net and Mask-RCNN, which are trained via PAIS under the setting of 10% labeled images.
Cls. | IoU. | Mask. | AP | AP0.5 | AP0.75 | APS | APM | APL |
---|---|---|---|---|---|---|---|---|
29.3 | 48.7 | 30.5 | 11.3 | 31.4 | 45.5 | |||
✓ | 29.7 | 49.0 | 31.0 | 11.4 | 31.8 | 46.3 | ||
✓ | 29.4 | 48.9 | 30.6 | 11.7 | 31.5 | 46.0 | ||
✓ | ✓ | 30.4 | 50.1 | 31.9 | 11.9 | 32.8 | 47.6 | |
✓ | ✓ | ✓ | 31.1 | 50.6 | 32.4 | 12.3 | 33.3 | 48.3 |
AP | AP0.5 | AP0.75 | APS | APM | APL | |
---|---|---|---|---|---|---|
1 | 29.9 | 49.1 | 31.0 | 10.9 | 32.1 | 47.0 |
2 | 30.2 | 49.5 | 31.1 | 11.1 | 32.2 | 47.1 |
3 | 30.5 | 50.1 | 31.7 | 11.6 | 32.9 | 47.6 |
4 | 31.1 | 50.6 | 32.4 | 12.3 | 33.3 | 48.3 |
Effectiveness of different terms in DALoss. Tab. 5 shows the efficacy of the different components in DALoss, which suggests that: (1) DALoss yields an improvement of 1.8 points to the model. Specifically, the mAP increases from 45.5 to 48.3 on APL, indicating a significant improvement for large objects. (2) The aligning weights for classification loss alone can also provide a marginal performance gain for the model. (3) Simply adding a mask IoU branch to the model does not enhance the overall performance. However, when incorporating the aligning weights for the segmentation loss, the mAP is increased. (4) Interestingly, the mask IoU branch can help improve the AP for objects of different scales. This is due to the mask IoU branch helping to select good pseudo-labels for training.


AP | AP0.5 | AP0.75 | APS | APM | APL | |
---|---|---|---|---|---|---|
0.35 | 31.1 | 50.6 | 32.4 | 12.3 | 33.3 | 48.3 |
0.50 | 30.6 | 50.0 | 31.9 | 11.2 | 32.6 | 47.8 |
0.65 | 29.9 | 49.1 | 31.3 | 10.1 | 32.1 | 47.4 |
AP | AP0.5 | AP0.75 | APS | APM | APL | |
---|---|---|---|---|---|---|
0.3 | 31.1 | 50.6 | 32.4 | 12.3 | 33.3 | 48.3 |
0.5 | 30.8 | 50.3 | 31.9 | 12.0 | 32.9 | 47.9 |
0.7 | 30.6 | 50.0 | 31.2 | 11.6 | 32.3 | 47.1 |
Ratio | AP | AP0.5 | AP0.75 | APS | APM | APL |
---|---|---|---|---|---|---|
1:1 | 29.0 | 47.7 | 31.0 | 11.0 | 31.6 | 45.5 |
1:2 | 30.1 | 48.8 | 32.2 | 11.5 | 32.4 | 47.1 |
1:3 | 31.1 | 50.6 | 32.4 | 12.3 | 33.3 | 48.3 |
1:4 | 31.3 | 50.9 | 32.6 | 12.4 | 33.5 | 48.7 |


Hyper-parameters in DALoss. We investigated the effect of hyper-parameters, and , used in DALoss. To give equal weight to both classification and segmentation, we set . The results are shown in Tab. 6, indicating that larger values of and lead to better performance. We also analyzed the functions of different hyper-parameters used to adjust the input score, as shown in Fig. 3. The figures illustrate that increasing the hyper-parameter enlarges the difference between high- and low-quality masks or classes, adjusting the input score more appropriately based on their quality. However, when the hyper-parameter is set too large, the low-quality masks or classes are significantly constrained, potentially affecting the generalization of the model as noise is removed entirely during training.
Model convergence speed. In Fig. 4, we analyze the convergence speed of the models. The results demonstrate that DALoss can expedite the model training in terms of convergence rate. For instance, Mask-RCNN (PAIS) achieves a convergence speed that is approximately 2 times faster than that of Mask-RCNN (PAIS w/o DALoss). Moreover, by comparing the results between Mask-RCNN (PAIS) and K-Net (PAIS), we observe that DALoss may be particularly effective for box-dependent instance segmentation frameworks, which facilitates rapid convergence.
Different thresholds for classification scores and mask IoUs. We observed from Tab. 7 and Tab. 8 that increasing the thresholds for classification score and mask IoU leads to a decrease in performance. This suggests that DALoss is able to make use of misaligned pseudo-labels instead of simply filtering them out. By allowing the model to learn from these potentially noisy labels, it is able to better handle situations where the alignment between labeled and unlabeled data is not perfect. This may result in improved generalization to new, unseen data, as the model has learned to adapt to the presence of noisy labels.
Different ratios of labeled and unlabeled images per batch. Tab. 9 shows that increasing the ratio of unlabeled images in a batch can improve the model’s performance. The results indicate that performance saturates when the ratio is set to 1:4, suggesting that adding too many unlabeled images may lead to diminishing returns.
Visualizations. We visualize the outputs of different models, including K-Net (supervised), K-Net (PAIS w/o DALoss), K-Net (PAIS), Mask-RCNN (supervised), Mask-RCNN (PAIS w/o DALoss), and Mask-RCNN (PAIS), in Fig. 5. The results show that: (1) PAIS can improve the recall for most instances in the supervised model, but the quality of the predictions is not guaranteed. (2) The use of DALoss in PAIS helps to improve the quality of predictions in terms of mask and classification.
Influence of imbalanced classes. In Fig. 6, we investigate the effect of imbalanced classes on PAIS. We plot the number of labels obtained from the ground truth and the teacher models for Mask-RCNN and K-Net with PAIS at different iterations, namely 32k, 64k, 120k, and 180k. The results indicate that the predicted pseudo-labels gradually conform to the distribution of the imbalanced labels.
5 Discussion
, | 0.35, 0.3 | 0.50, 0.3 | 0.65, 0.3 | 0.35, 0.5 | 0.35, 0.7 |
---|---|---|---|---|---|
w/o DALoss | 29.3 | 29.0 | 25.3 | 29.3 | 29.0 |
w/ DALoss | 31.1 | 30.6 | 29.9 | 30.8 | 30.6 |
Method | 1% | 5% | 10% | 100% |
---|---|---|---|---|
K-Net, supervise | 11.6 | 22.3 | 26.5 | 38.4 |
EMA w/o DALoss | 17.8(+6.2) | 25.4(+3.1) | 29.3(+2.8) | 39.4(+1.0) |
EMA w/ DALoss | 19.8(+8.2) | 27.5(+5.2) | 31.0(+4.5) | 40.8(+2.4) |
Performance gain | +2.0 | +2.1 | +1.7 | +1.4 |
M-RCNN, supervise | 11.5 | 22.4 | 27.1 | 37.5 |
EMA w/o DALoss | 20.1(+8.6) | 27.4(+5.0) | 29.8(+2.7) | 38.4(+0.9) |
EMA w/ DALoss | 21.1(+9.6) | 29.3(+6.9) | 31.0(+3.9) | 39.5(+2.0) |
Performance gain | +1.0 | +1.9 | +1.2 | +1.1 |
DALoss w.r.t. Thresholds Tuning. As shown in Tab. 10, we conduct an ablation study by removing DALoss from our method in Tabs. 7 and 8. The results show that DALoss consistently outperforms threshold tuning in all settings, which validates that DALoss is indeed more effective than tuning the thresholds.
Comparison of EMA with DALoss. We show that the performance improvements are not mainly attributed to EMA. First, we summarize the results of Tabs. 3 and 4 in Tab. 11. We can find that DALoss achieves a larger performance gain than EMA in 100% COCO (+1.4, +1.1 vs. +1.0, +0.9). Second, EMA alone cannot selectively utilize noisy pseudo-labels for better learning. DALoss solves this problem and leads to further improvement over EMA.
Discussion on Score Filtering. We have carefully chosen the values of based on Fig. 1(c)(d), which illustrates that decreasing the thresholds will generate more noisy but informative pseudo-labels. Therefore, we use low thresholds to obtain such pseudo-labels, which is different from previous methods that need to adjust thresholds to filter them out. This simplifies the tuning of the thresholds. We believe that the experiments in Tabs. 7, 8, and 10 are sufficient to show that lower thresholds are suitable for DALoss.
Generality to Other Segmentation Tasks. DALoss can be applied to other semi-supervised segmentation tasks, as the noisy pseudo-labels are common in semi-supervised setting. To show the generality, we apply DALoss to panoptic segmentation, which is a more challenging task that requires both instance and semantic segmentation. We report a preliminary result under 10% COCO. The initial result shows that DALoss can improve the PQ performance from 36.8% PAIS w/o DALoss to 37.3% PAIS w/ DALoss. We remain more experiments in our future work.
Discussion on Large Segment Everything Models. We envision that the future of image segmentation will not only aim to segment everything, but also to provide fine-grained text descriptions for the segmented regions. However, the recently proposed models such as SAM [19] and SEEM [55] either lack labeled semantic information or demand large amounts of labeled semantic data for training. Therefore, we believe that semi-supervised learning will be a crucial solution to leverage abundant unlabeled data and reduce the labeling burden.
6 Conclusion
In this paper, we presented a novel PAIS framework for semi-supervised instance segmentation. To address the misalignment between classification score and mask quality, we introduced a dynamic aligning loss (DALoss), which aligns the classification loss term and the segmentation loss term based on the quality of different pseudo-labels. Our experimental results demonstrate the effectiveness of the proposed PAIS framework. Specifically, when the amount of labeled data is extremely limited, our pipeline equipped with PAIS and DALoss achieves superior performance for instance segmentation. We believe that PAIS can serve as a strong baseline for future research on semi-supervised instance segmentation. We hope our work can inspire further exploration in this exciting research direction.
Acknowledgements.
This work was supported by National Key R&D Program of China (2022ZD0118202), National Science Fund for Distinguished Young Scholars (No.62025603), National Natural Science Foundation of China (No.U21B2037, No.U22B2051, No.62176222, No.62176223, No.62176226, No.62072386, No.62072387, No.62072389, No.62002305 and No.62272401), and Natural Science Foundation of Fujian Province of China (No.2021J01002, No.2022J06001).
References
- [1] Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in Neural Information Processing Systems, 2014.
- [2] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. In International Conference on Learning Representations, 2020.
- [3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- [4] Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- [5] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 2021.
- [6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- [7] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Good semi-supervised learning that requires a bad gan. Advances in Neural Information Processing Systems, 2017.
- [8] Bert De Brabandere, Davy Neven, and Luc Van Gool. Semantic instance segmentation with a discriminative loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2017.
- [9] Dominik Filipiak, Andrzej Zapała, Piotr Tempczyk, Anna Fensel, and Marek Cygan. Polite teacher: Semi-supervised instance segmentation with mutual learning and pseudo-label thresholding. arXiv preprint arXiv:2211.03850, 2022.
- [10] Geoffrey French, Michal Mackiewicz, and Mark Fisher. Self-ensembling for visual domain adaptation. In International Conference on Learning Representations, 2018.
- [11] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu, Ming Yang, and Kaiqi Huang. Ssap: Single-shot instance segmentation with affinity pyramid. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- [12] Ruohao Guo, Dantong Niu, Liao Qu, and Zhenbo Li. Sotr: Segmenting objects with transformers. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
- [13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
- [15] Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Ke Li, Feiyue Huang, Ling Shao, and Rongrong Ji. Istr: End-to-end instance segmentation via transformers. arXiv preprint arXiv:2105.00637, 2021.
- [16] Jie Hu, Linyan Huang, Tianhe Ren, Shengchuan Zhang, Rongrong Ji, and Liujuan Cao. You only segment once: Towards real-time panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17819–17829, 2023.
- [17] Zhaojin Huang, Lichao Huang, Yongchao Gong, Chang Huang, and Xinggang Wang. Mask scoring r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
- [18] Jisoo Jeong, Seungeui Lee, Jeesoo Kim, and Nojun Kwak. Consistency-based semi-supervised learning for object detection. Advances in Neural Information Processing Systems, 2019.
- [19] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
- [20] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations, 2017.
- [21] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In International Conference on Machine Learning (Workshop), 2013.
- [22] Youngwan Lee and Jongyoul Park. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- [23] Xinyang Li, Jie Hu, Shengchuan Zhang, Xiaopeng Hong, Qixiang Ye, Chenglin Wu, and Rongrong Ji. Attribute guided unpaired image-to-image translation with semi-supervised learning. arXiv preprint arXiv:1904.12428, 2019.
- [24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- [26] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn: Sequential grouping networks for instance segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 2017.
- [27] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- [28] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, and Peter Vajda. Unbiased teacher for semi-supervised object detection. International Conference on Learning Representations, 2021.
- [29] Yao Lu, Zhiyi Chen, Zehui Chen, Jie Hu, Liujuan Cao, and Shengchuan Zhang. Candy: Category-kernelized dynamic convolution for instance segmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
- [30] Wenfeng Luo and Meng Yang. Semi-supervised semantic segmentation via strong-weak dual-branch network. Springer International Publishing eBooks, 2020.
- [31] Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, pages 638–647, 2022.
- [32] Yiwei Ma, Xiaioqing Zhang, Xiaoshuai Sun, Jiayi Ji, Haowei Wang, Guannan Jiang, Weilin Zhuang, and Rongrong Ji. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. arXiv preprint arXiv:2303.15764, 2023.
- [33] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- [34] Yassine Ouali, Céline Hudelot, and Myriam Tami. Semi-supervised semantic segmentation with cross-consistency training. arXiv: Computer Vision and Pattern Recognition, 2020.
- [35] Ilija Radosavovic, Piotr Dollár, Ross Girshick, Georgia Gkioxari, and Kaiming He. Data distillation: Towards omni-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
- [36] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. Advances in Neural Information Processing Systems, 2015.
- [37] Tianhe Ren, Shilong Liu, Feng Li, Hao Zhang, Ailing Zeng, Jie Yang, Xingyu Liao, Ding Jia, Hongyang Li, He Cao, Jianan Wang, Zhaoyang Zeng, Xianbiao Qi, Yuhui Yuan, Jianwei Yang, and Lei Zhang. detrex: Benchmarking detection transformers, 2023.
- [38] Tianhe Ren, Jianwei Yang, Shilong Liu, Ailing Zeng, Feng Li, Hao Zhang, Hongyang Li, Zhaoyang Zeng, and Lei Zhang. A strong and reproducible object detector with only public datasets, 2023.
- [39] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in Neural Information Processing Systems, 2016.
- [40] Henry Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 1965.
- [41] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 2020.
- [42] Kihyuk Sohn, Zizhao Zhang, Chun-Liang Li, Han Zhang, Chen-Yu Lee, and Tomas Pfister. A simple semi-supervised learning framework for object detection. arXiv preprint arXiv:2005.04757, 2020.
- [43] Yihe Tang, Weifeng Chen, Yijun Luo, and Yuting Zhang. Humble teachers teach better students for semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [44] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in Neural Information Processing Systems, 2017.
- [45] Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. Solo: Segmenting objects by locations. In European Conference on Computer Vision, 2020.
- [46] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. Solov2: Dynamic and fast instance segmentation. Advances in Neural Information Processing Systems, 2020.
- [47] Zhenyu Wang, Yali Li, and Shengjin Wang. Noisy boundaries: Lemon or lemonade for semi-supervised instance segmentation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- [48] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 2020.
- [49] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- [50] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE International Conference on Computer Vision, 2021.
- [51] Qize Yang, Xihan Wei, Biao Wang, Xian-Sheng Hua, and Lei Zhang. Interactive self-training with mean teachers for semi-supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [52] Jialin Yuan, Chao Chen, and Li Fuxin. Deep variational instance segmentation. In Advances in Neural Information Processing Systems, 2020.
- [53] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. K-net: Towards unified image segmentation. Advances in Neural Information Processing Systems, 2021.
- [54] Qiang Zhou, Chaohui Yu, Zhibin Wang, Qi Qian, and Hao Li. Instant-teaching: An end-to-end semi-supervised object detection framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
- [55] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Gao, YongJae Lee, Madison Madison, Microsoft Research, Redmond Hkust, Microsoft Cloud, and Ai Ai. Segment everything everywhere all at once.