Exploring High-quality Target Domain Information for Unsupervised Domain Adaptive Semantic Segmentation
Abstract.
In unsupervised domain adaptive (UDA) semantic segmentation, the distillation based methods are currently dominant in performance. However, the distillation technique requires complicate multi-stage process and many training tricks. In this paper, we propose a simple yet effective method that can achieve competitive performance to the advanced distillation methods. Our core idea is to fully explore the target-domain information from the views of boundaries and features. First, we propose a novel mix-up strategy to generate high-quality target-domain boundaries with ground-truth labels. Different from the source-domain boundaries in previous works, we select the high-confidence target-domain areas and then paste them to the source-domain images. Such a strategy can generate the object boundaries in target domain (edge of target-domain object areas) with the correct labels. Consequently, the boundary information of target domain can be effectively captured by learning on the mixed-up samples. Second, we design a multi-level contrastive loss to improve the representation of target-domain data, including pixel-level and prototype-level contrastive learning. By combining two proposed methods, more discriminative features can be extracted and hard object boundaries can be better addressed for the target domain. The experimental results on two commonly adopted benchmarks (i.e., GTA5 Cityscapes and SYNTHIA Cityscapes) show that our method achieves competitive performance to complicated distillation methods. Notably, for the SYNTHIA Cityscapes scenario, our method achieves the state-of-the-art performance with mIoU and mIoU on 16 classes and 13 classes. Code is available at https://github.com/ljjcoder/EHTDI.

1. Introduction
The goal of semantic segmentation is to assign a semantic class label to each pixel, which is an important problem in computer vision. In recent years, deep neural networks have shown significant advantages in semantic segmentation tasks (Chen et al., 2017, 2018; Liu et al., 2019; Long et al., 2015; Zhao et al., 2017). However, such methods often require a large amount of manually annotated training data, and such pixel-level annotations are expensive and time-consuming. Therefore, a natural idea to overcome this bottleneck is to use synthetic data to train the model and then directly apply it in real-world scenarios. However, due to the huge gap between synthetic data and real data, applying models trained on synthetic data (source domain) to real data (target domain) often leads to a sharp drop in performance. In the field of computer vision, domain adaptive segmentation have emerged to solve this problem. In particular, this paper focuses on unsupervised domain adaptive (UDA) semantic segmentation.
In UDA semantic segmentation, the methods using distillation techniques (Zhang et al., 2021b, a; Wang et al., 2021a; Huang et al., 2022) generally outperform non-distilled methods (Melas-Kyriazi and Manrai, 2021; Tranheden et al., 2021; Gao et al., 2021; Wang et al., 2021b). For example, the first distillation-based method, ProDA (Zhang et al., 2021b), is still a barrier that non-distillation methods cannot stride. Although the distillation technique has shown strong abilities to improve the generalization performance of the model, it interrupts the end-to-end training process and require some special training tricks. For example, ProDA requires three training stages and needs to be initialized with the ASOS (Tsai et al., 2018) model in the first stage. In addition, in the distillation stage, ProDA also needs to be initialized with the model of SimCLRV2 (Chen et al., 2020b). However, contemporaneous non-distilled methods (Melas-Kyriazi and Manrai, 2021; Tranheden et al., 2021; Gao et al., 2021; Wang et al., 2021b) only require one training stage and do not need special training strategies, although their performance is not as good as ProDA. Recently, few methods (Zhang et al., 2021a; Wang et al., 2021a; Huang et al., 2022) can outperform ProDA with the same backbone (DeepLab-V2 (Chen et al., 2017)). However, they still need employ the distillation techniques and require some more sophisticated training techniques, further increasing the complexity of the methods. In pursuit of simplicity, we hope to design a non-distillation method that can achieve competitive performance to complicated distillation methods under the same condition with DeepLab-V2 as the backbone.
Besides the distillation-based methods, the methods employing pseudo-labeling techniques (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021) have attracted many attention due to their simplicity and efficiency. These methods use the prediction results of the target image as labels, so that the target image can enjoy the same supervised training as the source image, thereby improving the performance of the model in the target image. With the enhancement of pseudo-labeling technology, the current method has good enough classification performance in the interior pixels of the target objects, and the bottleneck is mainly concentrated on the wrong prediction of edge pixels. As shown in Figure 2(b), wrong pixels are almost located in the boundary area, and the inner area of the object has few errors. Therefore, the key to improve segmentation performance is to improve the accuracy of the boundary pixels of the target images. However, since the pseudo-labels of the boundary pixels of the target domain are often wrong, it is inherently difficult for the pseudo-label techniques to improve the accuracy of boundary pixels in a supervised learning manner.

In particular, we noticed that the prediction results of the interior pixels of objects in the target image are usually accurate. If we convert these interior pixels into boundary pixels, we would get the target domain boundary pixels with accurate labels. Usually, the interior pixels have high confidence because there is no much ambiguity. As shown in Figure 2(c), when a high threshold is adopted, most of the remaining pixels are interior pixels with correct labels. Therefore, we choose to utilize the high-confidence target domain pixels to construct boundaries. Here we particularly figure out the definition of the boundary. Generally speaking, the boundary refers to the area where the pixel-level semantics change (i.e., different categories are connected). Since there are multiple semantic categories around the pixels in these areas, the extracted features are usually mixed with other category information, leading to misjudgment. If the interior pixels of objects with correct labels (high-confidence pixels) in the target domain are placed in the regions with different semantics, we would obtain ”new boundary” pixels with accurate labels. Then learning on these pixels can indirectly enhance the model to distinguish the boundary pixels of actual objects.
In this paper, we first propose a novel mix-up strategy to generate high-quality target domain boundaries with ground-truth label. Unlike DACS (Tranheden et al., 2021), which explicitly learns cross-domain invariant source domain boundary representations by copying the parts of the source image to the target image, we directly reinforce the learning of the target boundary pixels by constructing target boundary pixels with correct labels. Our core idea is to copy the high-confidence pixels in the target image to the source image so that some high-confidence target pixels are in the position of category semantic change, i.e., becoming boundary pixels. Figure 3 depicts the process of generating high-quality target domain boundary pixels. Furthermore, since the target domain lacks ground-truth supervision signals, the learned features in target domain are not discriminative enough. To address this issue, we propose a multi-level contrastive learning to explore more discriminative target domain representations.Specifically, we use both pixel-to-pixel and prototype-to-prototype contrastive learning to obtain more discriminative feature representations. By combining the proposed two techniques, our method successfully circumvents complex distillation techniques while achieving the state-of-the-art performance. As shown in Figure 1, our method can beat the distillation-based ProDA and does not require any special training tricks, such as ASOS (Tsai et al., 2018) initialization.

Our main contributions are three-folds.
-
(1)
We propose a novel high-confidence target domain pixel cross-domain mixing strategy (HTCM) to construct high-quality target domain boundary pixels to enhance the learning of target domain boundaries.
-
(2)
We propose multi-level contrastive learning to make target domain features more discriminative. Multi-level contrastive learning constrains the consistency from both pixel-level and prototype-level to improve the learning of target domain features.
- (3)
2. Related Work
2.1. Unsupervised Domain Adaptive Semantic Segmentation
Early unsupervised domain adaptive (UDA) semantic segmentation methods (Hoffman et al., 2018; Huang et al., 2020; Tsai et al., 2018; Wang et al., 2020; Yang and Soatto, 2020; Zhang and Wang, 2020), usually use adversarial training to align source and target domain images. However, these methods often suffer from instability in the training process (Chen et al., 2020c; Gulrajani et al., 2017). Different from adversarial-based methods, there are methods utilizing pseudo-labels (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021) and contrastive learning (Zhou et al., 2021; Xie et al., 2021) to improve the performance. These methods are more concise than methods based on adversarial training. In addition to the above technical routes, ProDA (Zhang et al., 2021b) surpasses the above methods by using distillation technology. Subsequent works which outperform ProDA methods (Zhang et al., 2021a; Wang et al., 2021a; Huang et al., 2022), generally employ distillation techniques and use much more sophisticated training techniques. For example, MFA (Zhang et al., 2021a) proposes multiple fusion adaptation to fuse three kinds of information and combine distillation technology to further improve the performance of ProDA. CRA (Wang et al., 2021a), based on ProDA, aligns the distribution of confident pixels and unconfident pixels in the target domain through an additional cross-region adaptation operation.
In this paper, we propose an efficient and non-distillation method that leverages well-designed pseudo-labeling techniques and contrastive learning techniques to reach or even surpass distillation-based methods.
2.2. Pseudo Labels
The workflow of the pseudo-label methods can be roughly divided into the following two steps: first, generating pseudo-labels in the unlabeled data, and second, finetuning the model on these pseudo-labels. Early methods (Zou et al., 2018, 2019; Feng et al., 2020b, a) were usually iteratively performed in an offline manner. Therefore, these methods often require multiple training stages and are very cumbersome. Recently, some works (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021; Xu et al., 2019) have greatly simplified the training process through online pseudo-label techniques. For example, Pixmatch (Melas-Kyriazi and Manrai, 2021) uses random transformations to obtain a pair of strongly and weakly transformed images. Then the strongly transformed images can then enjoy the ”supervised-form” of the training process with the pseudo-labels produced by the weaker transformations. DACS (Tranheden et al., 2021) proposes to construct a mixed domain by copying image patches from the source domain to the target domain and learn the consistency between the student model and the teacher model under different contexts. In this way, it explicitly learns the cross-domain consistency of source boundaries. The DSP (Gao et al., 2021) additionally introduces a mix up strategy between source domain images on the basis of DACS to further improve the performance. BAPA (Liu et al., 2021) increases the loss weight of boundary pixels generated by DACS to improve classification accuracy of boundary pixels. The above methods use the mix up strategy of DACS to strengthen the learning of boundary samples, but DACS only explicitly generates source domain boundaries and does not explicitly construct target domain boundaries, resulting in insufficient learning of the boundary pixels of the target domain.
Different from these previous methods, we improve the performance of the model by explicitly constructing high-quality target domain boundary pixels.
2.3. Contrastive Learning
The goal of contrastive learning is to learn a model that can extract semantically compact features from raw images by encouraging positive pairs to be close and pulling negative pairs. Many works (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Zhang et al., 2022; Li et al., 2021) have demonstrated the effectiveness of contrastive learning. In these works, an InfoNCE (Oord et al., 2018) loss is usually adopted to implement contrastive learning.
Some recent works introduce contrastive learning into UDA semantic segmentation. RCCR (Zhou et al., 2021) builds different contexts for a target domain region through cutmix between the target domain and the source domain. Then they use pixel-to-pixel contrastive learning to maximize the difference of inter-region pixels and minimize the inconsistency of intra-region pixels. SPCL (Xie et al., 2021) propose a novel semantic prototype-based contrastive learning framework to achieve pixel and prototype alignment.
Different from these methods, we design a multi-level contrastive learning loss, which both considers pixel-to-pixel and prototype-to-prototype contrastive learning losses, to explore more discriminative target domain feature representations.
3. Preliminary

Unsupervised domain adaptive (UDA) semantic segmentation contains two important datasets: an annotated source dataset and an unlabeled target dataset . and share the same classes. The goal of UDA semantic segmentation is to use and to train a semantic segmentation model such that can correctly predict the class of each pixel in the target domain image. Generally, the model can be regarded as a composite of a feature extractor and a classifier .
To solve this problem, we can directly train a segmentation model with source data. Specifically, given a source image and its corresponding label , the segmentation loss on the source domain can be defined as:
(1) |
where represents the predicted probability of pixel . is the standard cross-entropy loss. Due to the domain shift in the source and target domains, it is difficult for the model trained in the above way to generalize well to the target domain data. To make the model better fit the target data, we can optimize the cross-entropy loss of the target image with the pseudo-label . In order to further improve the performance of the model on the target domain, we follow the mix-up strategy of DACS (Tranheden et al., 2021) to construct a mixed image and a mixed label . Specifically, we randomly select half of the classes in the source image to generate a random mask and utilize to generate the mixed images from the source image and the target image.
(2) |
Similarly, the labels of the mix images can be obtained:
(3) |
Similar to the source domain segmentation loss, the segmentation loss of the mixed image is defined as follows:
(4) |
4. Method
In this work, we aim to build an efficient unsupervised domain adaptive semantic segmentation method that does not require distillation techniques. To achieve our goal, we propose two novel self-supervised learning methods. Figure 4 illustrates the overall architecture of our proposed method.
In particular, we propose a novel high-confidence target domain pixel cross-domain mixing strategy (HTCM) to construct target domain boundary pixels with correct semantic labels. In principle, HTCM only copies the high-confidence pixels in the target domain to the source domain image so that some high-confidence target domain pixels become boundary pixels. In addition, we design a multi-level contrastive learning to improve the discriminativeness of features from both pixel-level and prototype-level.
4.1. Generating High-quality Target Boundary Pixels
In general, only the inner pixels of the target image can obtain the correct pseudo-labels, while it is difficult to obtain accurate labels for the boundary pixels of the target image. As a result, when training target domain images directly with pseudo-labels, only the inner pixels can be effectively trained with supervision, while the boundary pixels of the target image are difficult to be effectively trained. To alleviate this problem, we need to obtain target boundaries with correct labels, strengthening the learning of target boundary pixels.

In this section, we propose a simple yet effective high-confidence target domain pixel cross-domain mixing strategy (HTCM) to generate high-quality target domain boundary pixels. Generating high-quality target boundary pixels means constructing target pixels located in semantic mutation regions with correct pseudo-labels. In the pseudo-labels of the target domain pixels, the labels of high-confidence pixels tend to be correct, but these pixels are usually located inside objects rather than boundaries. We want to have these correctly labeled target domain pixels at boundary locations so that we can get the target domain boundary pixels with the correct labels. To do this, we paste these pixels into a source image so that some of the target domain pixels are at the border. Specifically, for a target domain image , we can get its corresponding predicted probability map . Then we get a high-confidence template , which is the selected indicator for the pixels whose predicted probability is greater than a threshold . Finally, we use the obtained templates to construct high-quality boundary samples as follows:
(5) |
Similarly, the labels of the mix images can be obtained:
(6) |
The segmentation loss of the mixed image is defined as follows:
(7) |
4.2. Multi-level Contrastive Learning
Due to the lack of explicit supervision signals in the target domain data, the features of the target domain are not sufficiently discriminative. Fortunately, contrastive learning can improve the distinguishability of unlabeled data (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020), so we employ contrastive learning to further improve our model. In particular, we design a multi-level contrastive learning loss to improve feature distinguishability from both prototype-level and pixel-level. Since we want the features to be invariant across domains, we choose to perform contrastive learning on the mixed images instead of directly on the target image. Because the prototype of the mixed image will contain the information of the source domain and the target domain, the features of the same category in the source domain and the target domain will be as consistent as possible during prototype-level contrastive learning. At the same time, because the target pixels in the mixed image are all high-confidence samples, they have the correct pseudo-labels. It means that all the pixels of the mixed image can be trained like the source domain image. At this time, using contrastive learning to improve the discriminativeness of features will not have much effect. Therefore, we choose to perform contrastive learning on mixed images .
Specifically, for the mixed image , we first obtain the transformed mixed images and through two different random transformations. Then, we use the feature extractor to obtain the feature maps , corresponding to and , whose resolution is of the native resolution . In order to alleviate the interference of wrong samples on the prototype, we only use the features with high confidence to calculate the prototype. Therefore, we can get the prototype of each category as follows.
(8) |
(9) |
is a binary mask, indicates that the predicted category of this position is and the predicted probability is greater than , 0 indicates that the predicted category is other categories or predicted probability is less than . For a query sample , the prototype is the positive and all other prototype are the negative. In domain adaptation tasks, it is not enough to improve the discriminativeness of features, and we also need to narrow the distance between features and classifier weights (Li et al., 2021). Therefore, following PCL (Li et al., 2021), we compute the contrastive loss using the probabilities instead of features and then the prototype-level contrastive loss (ProCL) is defined as:
(10) |
In addition to prototype-level contrastive learning, we further consider pixel-level contrastive learning. For a query sample , the feature at the corresponding position of another transformed image is taken as a positive sample, and all the other features are taken as negative samples. The pixel-level contrastive learning loss (PiCL) is defined as follows:
(11) |
where and are the scaling factors, and we set to 7 and set to 20. calculates the inner product of the classification probabilities corresponding to the two vectors and represents the softmax function.

4.3. Loss Function
Our loss function is defined as:
(12) |
Here and . and are the hyper-parameters to balance different losses.
Method | Venue |
road |
sdwk |
bld |
wall |
fnc |
pole |
lght |
sign |
veg. |
trrn. |
sky |
pers |
rdr |
car |
trck |
bus |
trn |
mtr |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Source only | 58.6 | 13.5 | 64.1 | 16.8 | 14.0 | 23.8 | 36.2 | 19.8 | 80.3 | 19.5 | 66.3 | 58.9 | 28.7 | 64.9 | 28.7 | 3.5 | 8.0 | 29.6 | 35.3 | 35.3 | |
DACS (Tranheden et al., 2021) | WACV 2020 | 89.9 | 39.7 | 87.9 | 30.7 | 39.5 | 38.5 | 46.4 | 52.8 | 88.0 | 44.0 | 88.8 | 67.2 | 35.8 | 84.5 | 45.7 | 50.19 | 0.0 | 27.3 | 34.0 | 52.1 |
SPCL (Xie et al., 2021) | arXiv 2021 | 90.3 | 50.3 | 85.7 | 45.3 | 28.4 | 36.8 | 42.2 | 22.3 | 85.1 | 43.6 | 87.2 | 62.8 | 39.0 | 87.8 | 41.3 | 53.9 | 17.7 | 35.9 | 33.8 | 52.1 |
RCCR (Zhou et al., 2021) | arXiv 2021 | 93.7 | 60.4 | 86.5 | 41.1 | 32.0 | 37.3 | 38.7 | 38.6 | 87.2 | 43.0 | 85.5 | 65.4 | 35.1 | 88.3 | 41.8 | 51.6 | 0.0 | 38.0 | 52.1 | 53.5 |
SAC (Araslanov and Roth, 2021) | CVPR 2021 | 90.4 | 53.9 | 86.6 | 42.4 | 27.3 | 45.1 | 48.5 | 42.7 | 87.4 | 40.1 | 86.1 | 67.5 | 29.7 | 88.5 | 49.1 | 54.6 | 9.8 | 26.6 | 45.3 | 53.8 |
Pixmatch (Melas-Kyriazi and Manrai, 2021) | CVPR 2021 | 91.6 | 51.2 | 84.7 | 37.3 | 29.1 | 24.6 | 31.3 | 37.2 | 86.5 | 44.3 | 85.3 | 62.8 | 22.6 | 87.6 | 38.9 | 52.3 | 0.7 | 37.2 | 50.0 | 50.3 |
DPL (Cheng et al., 2021) | ICCV 2021 | 92.8 | 54.4 | 86.2 | 41.6 | 32.7 | 36.4 | 49.0 | 34.0 | 85.8 | 41.3 | 83.0 | 63.2 | 34.2 | 87.2 | 39.3 | 44.5 | 18.7 | 42.6 | 43.1 | 53.3 |
BAPA (Liu et al., 2021) | ICCV 2021 | 94.4 | 61.0 | 88.0 | 26.8 | 39.9 | 38.3 | 46.1 | 55.3 | 87.8 | 46.1 | 89.4 | 68.8 | 40.0 | 90.2 | 60.4 | 59.0 | 0.00 | 45.1 | 54.2 | 57.4 |
UPST (Wang et al., 2021b) | ICCV 2021 | 90.5 | 38.7 | 86.5 | 41.1 | 32.9 | 40.5 | 48.2 | 42.1 | 86.5 | 36.8 | 84.2 | 64.5 | 38.1 | 87.2 | 34.8 | 50.4 | 0.2 | 41.8 | 54.6 | 52.6 |
DSP (Gao et al., 2021) | ACM MM 2021 | 92.4 | 48.0 | 87.4 | 33.4 | 35.1 | 36.4 | 41.6 | 46.0 | 87.7 | 43.2 | 89.8 | 66.6 | 32.1 | 89.9 | 57.0 | 56.1 | 0.0 | 44.1 | 57.8 | 55.0 |
MFA (D) (Zhang et al., 2021a) | BMVC 2021 | 93.5 | 61.6 | 87.0 | 49.1 | 41.3 | 46.1 | 53.5 | 53.9 | 88.2 | 42.1 | 85.8 | 71.5 | 37.9 | 88.8 | 40.1 | 54.7 | 0.0 | 48.2 | 62.8 | 58.2 |
ProDA (D) (Zhang et al., 2021b) | CVPR 2021 | 87.8 | 56.0 | 79.7 | 46.3 | 44.8 | 45.6 | 53.5 | 53.5 | 88.6 | 45.2 | 82.1 | 70.7 | 39.2 | 88.8 | 45.5 | 59.4 | 1.0 | 48.9 | 56.4 | 57.5 |
ProDA + CaCo (D) (Huang et al., 2022) | CVPR 2022 | 93.8 | 64.1 | 85.7 | 43.7 | 42.2 | 46.1 | 50.1 | 54.0 | 88.7 | 47.0 | 86.5 | 68.1 | 2.9 | 88.0 | 43.4 | 60.1 | 31.5 | 46.1 | 60.9 | 58.0 |
ProDA + CRA (D) (Wang et al., 2021a) | arXiv 2021 | 89.4 | 60.0 | 81.0 | 49.2 | 44.8 | 45.5 | 53.6 | 55.0 | 89.4 | 51.9 | 85.6 | 72.3 | 40.8 | 88.5 | 44.3 | 53.4 | 0.0 | 51.7 | 57.9 | 58.6 |
Ours | 95.4 | 68.8 | 88.1 | 37.1 | 41.4 | 42.5 | 45.7 | 60.4 | 87.3 | 42.6 | 86.8 | 67.4 | 38.6 | 90.5 | 66.7 | 61.4 | 0.3 | 39.4 | 56.1 | 58.8 |
Method | Venue |
road |
sdwk |
bld |
wall∗ |
fnc∗ |
pole∗ |
light |
sign |
veg. |
sky |
pers |
rdr |
car |
bus |
mtr |
bike |
mIoU-16 | mIoU-13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Source only | 40.1 | 18.3 | 64.9 | 5.9 | 0.1 | 24.6 | 5.9 | 9.0 | 74.8 | 81.6 | 58.7 | 16.8 | 43.9 | 11.8 | 6.4 | 24.11 | 30.4 | 35.1 | |
DACS (Tranheden et al., 2021) | WACV 2020 | 80.6 | 25.1 | 81.9 | 21.5 | 2.9 | 37.2 | 22.7 | 24.0 | 83.7 | 90.8 | 67.6 | 38.3 | 82.9 | 38.9 | 28.5 | 47.6 | 48.3 | 54.8 |
SPCL (Xie et al., 2021) | arXiv 2021 | 86.9 | 43.2 | 81.6 | 16.2 | 0.2 | 31.4 | 12.7 | 12.1 | 83.1 | 78.8 | 63.2 | 23.7 | 86.9 | 56.1 | 33.8 | 45.7 | 47.2 | 54.4 |
RCCR (Zhou et al., 2021) | arXiv 2021 | 79.4 | 45.3 | 83.3 | - | - | - | 24.7 | 29.6 | 68.9 | 87.5 | 63.1 | 33.8 | 87.0 | 51.0 | 32.1 | 52.1 | - | 56.8 |
SAC (Araslanov and Roth, 2021) | CVPR 2021 | 89.3 | 47.2 | 85.5 | 26.5 | 1.3 | 43.0 | 45.5 | 32.0 | 87.1 | 89.3 | 63.6 | 25.4 | 86.9 | 35.6 | 30.4 | 53.0 | 52.6 | 59.3 |
Pixmatch (Melas-Kyriazi and Manrai, 2021) | CVPR 2021 | 92.5 | 54.6 | 79.8 | 4.8 | 0.1 | 24.1 | 22.8 | 17.8 | 79.4 | 76.5 | 60.8 | 24.7 | 85.7 | 33.5 | 26.4 | 54.4 | 46.1 | 54.5 |
DPL (Cheng et al., 2021) | ICCV 2021 | 87.5 | 45.7 | 82.8 | 13.3 | 0.6 | 33.2 | 22.0 | 20.1 | 83.1 | 86.0 | 56.6 | 21.9 | 83.1 | 40.3 | 29.8 | 45.7 | 47.0 | 54.2 |
BAPA (Liu et al., 2021) | ICCV 2021 | 91.7 | 53.8 | 83.9 | 22.4 | 0.8 | 34.9 | 30.5 | 42.8 | 86.6 | 88.2 | 66.0 | 34.1 | 86.6 | 51.3 | 29.4 | 50.5 | 53.3 | 61.2 |
UPST (Wang et al., 2021b) | ICCV 2021 | 79.4 | 34.6 | 83.5 | 19.3 | 2.8 | 35.3 | 32.1 | 26.9 | 78.8 | 79.6 | 66.6 | 30.3 | 86.1 | 36.6 | 19.5 | 56.9 | 48.0 | 54.6 |
DSP (Gao et al., 2021) | ACM MM 2021 | 86.4 | 42.0 | 82.0 | 2.1 | 1.8 | 34.0 | 31.6 | 33.2 | 87.2 | 88.5 | 64.1 | 31.9 | 83.8 | 65.4 | 28.8 | 54.0 | 51.0 | 59.9 |
MFA (D) (Zhang et al., 2021a) | BMVC 2021 | 81.8 | 40.2 | 85.3 | - | - | - | 38.0 | 33.9 | 82.3 | 82.0 | 73.7 | 41.1 | 87.8 | 56.6 | 46.3 | 63.8 | - | 62.5 |
ProDA (D) (Zhang et al., 2021b) | CVPR 2021 | 87.8 | 45.7 | 84.6 | 37.1 | 0.6 | 44.0 | 54.6 | 37.0 | 88.1 | 84.4 | 74.2 | 24.3 | 88.2 | 51.1 | 40.5 | 45.6 | 55.5 | 62.0 |
ProDA + CRA (D) (Wang et al., 2021a) | arXiv 2021 | 85.6 | 44.2 | 82.7 | 38.6 | 0.4 | 43.5 | 55.9 | 42.8 | 87.4 | 85.8 | 75.8 | 27.4 | 89.1 | 54.8 | 46.6 | 49.8 | 56.9 | 63.7 |
Ours | 93.0 | 69.8 | 84.0 | 36.6 | 9.1 | 39.7 | 42.2 | 43.8 | 88.2 | 88.1 | 68.3 | 29.0 | 85.5 | 54.1 | 37.1 | 56.3 | 57.8 | 64.6 |
5. Experiments
5.1. Experimental Setup
5.1.1. Dataset
We evaluate the performance of our methods on the two standard UDA semantic segmentation tasks: GTA5Cityscapes and SYNTHIACityscapes. GTA5 (Richter et al., 2016) is a synthetic dataset consisting of annotated images with resolution taken from the GTA5 game, which is used as a source domain dataset. SYNTHIA (Ros et al., 2016) is also a synthetic dataset consisting of annotated images with resolution, which is used as another source domain dataset. Cityscapes (Cordts et al., 2016) is a representative dataset in semantic segmentation and autonomous driving domain, which contains 5000 pixel-level annotated images with resolution taken from real urban street scenes and is use as the target domain dataset. In all experiments, we can obtain unlabeled target domain images for training and we use validation images for test. Following existing works (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021), we evaluate our method on 19 common categories for GTA5Cityscapes and both 16 and 13 common categories for SYNTHIACityscapes, respectively.
5.1.2. Implementation Details
Following common UDA protocols (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021), we use the DeepLab-v2 segmentation model (Chen et al., 2017) with a ResNet-101 (He et al., 2016) backbone pre-trained on ImageNet (Deng et al., 2009) as our model. Following (Chen et al., 2019), we first perform warm-up training on source domain to obtain an initialized model. Then we train the model with our method. Specifically, we use the ”poly” learning weight decay and the power is set to 0.9. The SGD optimizer is implemented with weight decay and momentum 0.9. The learning rate is set at for backbone parameters and for others. The maximum iteration number is 250 and we use an early stopping strategy. For the hyper-parameters in our method, we set in all experiments. We set for GTA5 and set for SYNTHIA. All experiments of our method are conducted on a single NVIDIA RTX 3090Ti GPU.
5.2. Comparison to State-of-the-Art Methods
We compare our method with the state-of-the-art UDA methods on two common UDA benchmarks. Depending on whether the distillation technique is used, we divide these methods into two broad categories. Table 1 and Table 2 give the results. From the results, we have the following observations.
CMix | PiCL | ProCL | HTCM | mIoU (%) |
---|---|---|---|---|
52.1 | ||||
55.2 | ||||
55.6 | ||||
56.6 | ||||
56.2 | ||||
57.5 | ||||
57.7 | ||||
58.8 |
First, our method outperforms all non-distillation methods, especially under the SYNTHIACityscapes setting, where our method outperforms the second-best method by 4.5 mIoU and 3.4 mIoU on 16 classes and 13 classes. Second, we found that ProDA, the first method using distillation technology, outperforms all previous non-distillation methods. However, our method not only outperforms ProDA, but also a series of subsequent methods based on distillation techniques. For example, our method is better than CRA (Wang et al., 2021a) by in GTA5Cityscapes and in SYNTHIACityscapes. Third, it should be noted that our method not only outperforms the distillation-based methods, but also simplifies the training process. In particular, our method does not require an additional two-step distillation process, nor does it require special training techniques.
In addition, we qualitatively compare our model with the source only model () and baseline model (+) by showing the segmentation results in Figure 6. It can be seen that baseline significantly improves the segmentation results compared to the source only model. But for the thin objects such as ”bicycle” (yellow box) and ”pole”(green box), almost all pixels are boundary samples, resulting in an incorrect prediction by the baseline model. After using our proposed HTCM and multi-level contrastive loss, many of these mispredicted pixels are corrected back. This demonstrates the effectiveness of our method.
5.3. Ablation Study
In this section, we conduct experiments to reveal the effectiveness of our proposed method. All experiments are particularly conducted on GTA5Cityscapes benchmark.
5.3.1. Effect of Components
In this section, we validate the individual effects of our proposed HTCM, ProCL, and PiCL. As shown in Table 3, our final method achieves an improvement of 6.7% over the model only using CMix, while removing each one of our components will cause a performance drop comparing to our final method. The results validate that each of our proposed components plays an important role in UDA semantic segmentation.
Methods | mIoU (%) |
---|---|
Source Only | 35.3 |
DACS | 52.1 |
HTCM | 52.8 |
DSP (Hard) | 53.6 |
DSP (Soft) | 54.5 |
DACS+HTCM | 56.6 |
0.00 | 0.35 | 0.55 | 0.75 | 0.95 | 0.97 | |
---|---|---|---|---|---|---|
mIoU | 52.9 | 53.8 | 54.6 | 55.2 | 56.6 | 55.8 |
5.3.2. Comparing Different Mix-up Strategies
In this section, we compare different mix-up strategies to demonstrate the superiority of HTCM. In particular, we compare DACS (Tranheden et al., 2021) and DSP (Gao et al., 2021). Table 4 gives the results. First, compared to DACS, which explicitly constructs target-domain-style context for source boundaries by copying source pixels into a target image, our method HTCM exceeds it by . This demonstrates the importance of constructing target boundaries with correct labels. In addition, combining HTCM and DACS can further improve performance, proving that our method is orthogonal to DACS. Secondly, DSP additionally introduces a mix-up strategy of source-to-source based on DACS. To be fair, we use DACS+HTCM to compare with DSP. In particular, DSP adopts a soft-paste strategy to improve performance. We use hard or soft to distinguish whether it adopts this strategy. It can be seen that the source-to-source strategy used by DSP is inferior to HTCM. This further demonstrates the superiority of our method.
5.3.3. Confident Threshold
In this section, we analyze the impact of different thresholds on performance. In Table 5, we show the results for . We find that setting relatively high threshold () can more effectively boost performance (compared to ). This well shows the importance of constructing the boundaries with high-confidence pixels. However, when we increase the threshold from 0.95 to 0.97, the mIoU decreases from 56.6 to 55.8. This is because a too high threshold would filter too many pixels, resulting in insufficient learning.
Mixed Image | ||
---|---|---|
mIoU | 56.8 | 58.8 |
5.3.4. Contrastive Learning with Different Images
In this section, we compare the performance of and for contrastive learning. Table 6 gives the results. It can be seen that the contrastive learning on is 2 lower than the contrastive learning on . As analyzed in Sec. 4.2, almost all pixels in have the correct label, the classification loss at this time provides a good enough supervision signal to learn the discriminative feature representation. As a result, the effect of using contrastive learning to enhance the discriminative of features would become very weak.
5.3.5. Training Cost Discussion
Here we particularly compare the training costs of DSP (Gao et al., 2021) and ProDA (Zhang et al., 2021b) with our method on GTA5Cityscapes. Table 7 gives the results. Compared with ProDA and DSP, our method has not only lower training cost but also higher performance. It demonstrates the superiority of our method.
Method | GPUs | Time (hours) | mIoU |
ProDA | 4*Tesla-V100-32GB | 108 | 57.5 |
DSP | 1*RTX-3090-24GB | 96 | 55.0 |
Ours | 1*RTX-3090-24GB | 63 | 58.8 |
6. Conclusion
In this paper, we design a simple yet effective non-distillation framework for UDA semantic segmentation. We propose a novel mix-up strategy to generate high-quality target domain boundary pixels. Specifically, we copy the high-confidence target domain boundary pixels into the source image so that some high-confidence target domain pixels become new boundary pixels. In addition, we propose multi-level contrastive learning to obtain more discriminative feature representations, which contains pixel-to-pixel and prototype-to-prototype contrastive loss. We experimentally verified the effectiveness of our proposed methods on GTA5Cityscapes and SYNTHIACityscapes. We believe that our work provides a feasible technical route for designing efficient UDA semantic segmentation methods that bypass the distillation techniques.
Acknowledgements.
This work is supported by the National Natural Science Foundation of China under Grant 62176246 and 61836008.References
- (1)
- Araslanov and Roth (2021) Nikita Araslanov and Stefan Roth. 2021. Self-supervised augmentation consistency for adapting semantic segmentation. In CVPR. 15384–15394.
- Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS. 9912–9924.
- Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 834–848.
- Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV. 801–818.
- Chen et al. (2019) Minghao Chen, Hongyang Xue, and Deng Cai. 2019. Domain adaptation for semantic segmentation with maximum squares loss. In ICCV. 2090–2099.
- Chen et al. (2020c) Minghao Chen, Shuai Zhao, Haifeng Liu, and Deng Cai. 2020c. Adversarial-learned loss for domain adaptation. In AAAI. 3521–3528.
- Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597–1607.
- Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020b. Big self-supervised models are strong semi-supervised learners. In NeurIPS. 22243–22255.
- Cheng et al. (2021) Yiting Cheng, Fangyun Wei, Jianmin Bao, Dong Chen, Fang Wen, and Wenqiang Zhang. 2021. Dual Path Learning for Domain Adaptation of Semantic Segmentation. In ICCV. 9082–9091.
- Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR. 3213–3223.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248–255.
- Feng et al. (2020a) Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, Xin Tan, Jianping Shi, and Lizhuang Ma. 2020a. Semi-supervised semantic segmentation via dynamic self-training and classbalanced curriculum. arXiv preprint arXiv:2004.08514 1, 2 (2020), 5.
- Feng et al. (2020b) Zhengyang Feng, Qianyu Zhou, Qiqi Gu, Xin Tan, Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang Ma. 2020b. Dmt: Dynamic mutual training for semi-supervised learning. arXiv preprint arXiv:2004.08514 (2020).
- Gao et al. (2021) Li Gao, Jing Zhang, Lefei Zhang, and Dacheng Tao. 2021. Dsp: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In ACM MM. 2825–2833.
- Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS. 21271–21284.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In NeurIPS, Vol. 30.
- He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729–9738.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
- Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In ICML. 1989–1998.
- Huang et al. (2022) Jiaxing Huang, Dayan Guan, Aoran Xiao, Shijian Lu, and Ling Shao. 2022. Category contrast for unsupervised domain adaptation in visual tasks. In CVPR. 1203–1214.
- Huang et al. (2020) Jiaxing Huang, Shijian Lu, Dayan Guan, and Xiaobing Zhang. 2020. Contextual-relation consistent domain adaptation for semantic segmentation. In ECCV. Springer, 705–722.
- Li et al. (2021) Junjie Li, Yixin Zhang, Zilei Wang, and Keyu Tu. 2021. Semantic-aware Representation Learning Via Probability Contrastive Loss. arXiv preprint arXiv:2111.06021 (2021).
- Liu et al. (2019) Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. 2019. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR. 82–92.
- Liu et al. (2021) Yahao Liu, Jinhong Deng, Xinchen Gao, Wen Li, and Lixin Duan. 2021. BAPA-Net: Boundary adaptation and prototype alignment for cross-domain semantic segmentation. In ICCV. 8801–8811.
- Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR. 3431–3440.
- Melas-Kyriazi and Manrai (2021) Luke Melas-Kyriazi and Arjun K Manrai. 2021. PixMatch: Unsupervised domain adaptation via pixelwise consistency training. In CVPR. 12435–12445.
- Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Richter et al. (2016) Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV. Springer, 102–118.
- Ros et al. (2016) German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR. 3234–3243.
- Tranheden et al. (2021) Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. 2021. Dacs: Domain adaptation via cross-domain mixed sampling. In WACV. 1379–1389.
- Tsai et al. (2018) Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. 2018. Learning to adapt structured output space for semantic segmentation. In CVPR. 7472–7481.
- Wang et al. (2021b) Yuxi Wang, Junran Peng, and ZhaoXiang Zhang. 2021b. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In ICCV. 9092–9101.
- Wang et al. (2021a) Zhijie Wang, Xing Liu, Masanori Suganuma, and Takayuki Okatani. 2021a. Cross-Region Domain Adaptation for Class-level Alignment. arXiv preprint arXiv:2109.06422 (2021).
- Wang et al. (2020) Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jinjun Xiong, Wen-mei Hwu, Thomas S Huang, and Honghui Shi. 2020. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In CVPR. 12635–12644.
- Xie et al. (2021) Binhui Xie, Kejia Yin, Shuang Li, and Xinjing Chen. 2021. SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning. arXiv preprint arXiv:2111.12358 (2021).
- Xu et al. (2019) Yonghao Xu, Bo Du, Lefei Zhang, Qian Zhang, Guoli Wang, and Liangpei Zhang. 2019. Self-ensembling attention networks: Addressing domain shift for semantic segmentation. In AAAI. 5581–5588.
- Yang and Soatto (2020) Yanchao Yang and Stefano Soatto. 2020. Fda: Fourier domain adaptation for semantic segmentation. In CVPR. 4085–4095.
- Zhang et al. (2021a) Kai Zhang, Yifan Sun, Rui Wang, Haichang Li, and Xiaohui Hu. 2021a. Multiple Fusion Adaptation: A Strong Framework for Unsupervised Semantic Segmentation Adaptation. BMVC (2021).
- Zhang et al. (2021b) Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang, and Fang Wen. 2021b. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In CVPR. 12414–12424.
- Zhang et al. (2022) Yixin Zhang, Junjie Li, and Zilei Wang. 2022. Low-confidence Samples Matter for Domain Adaptation. arXiv preprint arXiv:2202.02802 (2022).
- Zhang and Wang (2020) Yixin Zhang and Zilei Wang. 2020. Joint adversarial learning for domain adaptation in semantic segmentation. In AAAI. 6877–6884.
- Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR. 2881–2890.
- Zhou et al. (2021) Qianyu Zhou, Chuyun Zhuang, Xuequan Lu, and Lizhuang Ma. 2021. Domain Adaptive Semantic Segmentation with Regional Contrastive Consistency Regularization. arXiv preprint arXiv:2110.05170 (2021).
- Zou et al. (2018) Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. 2018. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV. 289–305.
- Zou et al. (2019) Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. 2019. Confidence regularized self-training. In CVPR. 5982–5991.