This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploring High-quality Target Domain Information for Unsupervised Domain Adaptive Semantic Segmentation

Junjie Li [email protected] 0000-0002-2906-8514 University of Science and Technology of ChinaHefeiChina Zilei Wang [email protected] 0000-0003-1822-3731 University of Science and Technology of ChinaHefeiChina Yuan Gao [email protected] 0000-0003-3899-0698 University of Science and Technology of ChinaHefeiChina  and  Xiaoming Hu [email protected] 0000-0002-4853-7814 University of Science and Technology of ChinaHefeiChina
(2022)
Abstract.

In unsupervised domain adaptive (UDA) semantic segmentation, the distillation based methods are currently dominant in performance. However, the distillation technique requires complicate multi-stage process and many training tricks. In this paper, we propose a simple yet effective method that can achieve competitive performance to the advanced distillation methods. Our core idea is to fully explore the target-domain information from the views of boundaries and features. First, we propose a novel mix-up strategy to generate high-quality target-domain boundaries with ground-truth labels. Different from the source-domain boundaries in previous works, we select the high-confidence target-domain areas and then paste them to the source-domain images. Such a strategy can generate the object boundaries in target domain (edge of target-domain object areas) with the correct labels. Consequently, the boundary information of target domain can be effectively captured by learning on the mixed-up samples. Second, we design a multi-level contrastive loss to improve the representation of target-domain data, including pixel-level and prototype-level contrastive learning. By combining two proposed methods, more discriminative features can be extracted and hard object boundaries can be better addressed for the target domain. The experimental results on two commonly adopted benchmarks (i.e., GTA5 \rightarrow Cityscapes and SYNTHIA \rightarrow Cityscapes) show that our method achieves competitive performance to complicated distillation methods. Notably, for the SYNTHIA\rightarrow Cityscapes scenario, our method achieves the state-of-the-art performance with 57.8%57.8\% mIoU and 64.6%64.6\% mIoU on 16 classes and 13 classes. Code is available at https://github.com/ljjcoder/EHTDI.

unsupervised domain adaptive semantic segmentation, contrastive learning, pseudo labels
copyright: acmcopyrightjournalyear: 2022doi: 10.1145/3503161.3548114conference: Proceedings of the 30th ACM International Conference on Multimedia ; October 10–14, 2022; Lisboa, Portugal.booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugalprice: 15.00isbn: 978-1-4503-9203-7/22/10ccs: Computing methodologies Computer vision problems
Refer to caption
Figure 1. ProDA (Zhang et al., 2021b) v.s Ours on GTA5 (Richter et al., 2016). ProDA needs to be initialized with the model of ASOS (Tsai et al., 2018) and it requires an additional two-step distillation stage (red box). Compared to ProDA, our method only requires a source-only model for initialization and does not require the distillation stage.

1. Introduction

The goal of semantic segmentation is to assign a semantic class label to each pixel, which is an important problem in computer vision. In recent years, deep neural networks have shown significant advantages in semantic segmentation tasks (Chen et al., 2017, 2018; Liu et al., 2019; Long et al., 2015; Zhao et al., 2017). However, such methods often require a large amount of manually annotated training data, and such pixel-level annotations are expensive and time-consuming. Therefore, a natural idea to overcome this bottleneck is to use synthetic data to train the model and then directly apply it in real-world scenarios. However, due to the huge gap between synthetic data and real data, applying models trained on synthetic data (source domain) to real data (target domain) often leads to a sharp drop in performance. In the field of computer vision, domain adaptive segmentation have emerged to solve this problem. In particular, this paper focuses on unsupervised domain adaptive (UDA) semantic segmentation.

In UDA semantic segmentation, the methods using distillation techniques (Zhang et al., 2021b, a; Wang et al., 2021a; Huang et al., 2022) generally outperform non-distilled methods (Melas-Kyriazi and Manrai, 2021; Tranheden et al., 2021; Gao et al., 2021; Wang et al., 2021b). For example, the first distillation-based method, ProDA (Zhang et al., 2021b), is still a barrier that non-distillation methods cannot stride. Although the distillation technique has shown strong abilities to improve the generalization performance of the model, it interrupts the end-to-end training process and require some special training tricks. For example, ProDA requires three training stages and needs to be initialized with the ASOS (Tsai et al., 2018) model in the first stage. In addition, in the distillation stage, ProDA also needs to be initialized with the model of SimCLRV2 (Chen et al., 2020b). However, contemporaneous non-distilled methods (Melas-Kyriazi and Manrai, 2021; Tranheden et al., 2021; Gao et al., 2021; Wang et al., 2021b) only require one training stage and do not need special training strategies, although their performance is not as good as ProDA. Recently, few methods (Zhang et al., 2021a; Wang et al., 2021a; Huang et al., 2022) can outperform ProDA with the same backbone (DeepLab-V2 (Chen et al., 2017)). However, they still need employ the distillation techniques and require some more sophisticated training techniques, further increasing the complexity of the methods. In pursuit of simplicity, we hope to design a non-distillation method that can achieve competitive performance to complicated distillation methods under the same condition with DeepLab-V2 as the backbone.

Besides the distillation-based methods, the methods employing pseudo-labeling techniques (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021) have attracted many attention due to their simplicity and efficiency. These methods use the prediction results of the target image as labels, so that the target image can enjoy the same supervised training as the source image, thereby improving the performance of the model in the target image. With the enhancement of pseudo-labeling technology, the current method has good enough classification performance in the interior pixels of the target objects, and the bottleneck is mainly concentrated on the wrong prediction of edge pixels. As shown in Figure 2(b), wrong pixels are almost located in the boundary area, and the inner area of the object has few errors. Therefore, the key to improve segmentation performance is to improve the accuracy of the boundary pixels of the target images. However, since the pseudo-labels of the boundary pixels of the target domain are often wrong, it is inherently difficult for the pseudo-label techniques to improve the accuracy of boundary pixels in a supervised learning manner.

Refer to caption
Figure 2. False foreground pixels under different thresholds τ\tau. White indicates wrong pixels. When the threshold τ\tau is set to 0, most of the wrong pixels are concentrated near the boundary. When the threshold τ\tau is set to 0.95, most of the wrong pixels are filtered out.

In particular, we noticed that the prediction results of the interior pixels of objects in the target image are usually accurate. If we convert these interior pixels into boundary pixels, we would get the target domain boundary pixels with accurate labels. Usually, the interior pixels have high confidence because there is no much ambiguity. As shown in Figure 2(c), when a high threshold is adopted, most of the remaining pixels are interior pixels with correct labels. Therefore, we choose to utilize the high-confidence target domain pixels to construct boundaries. Here we particularly figure out the definition of the boundary. Generally speaking, the boundary refers to the area where the pixel-level semantics change (i.e., different categories are connected). Since there are multiple semantic categories around the pixels in these areas, the extracted features are usually mixed with other category information, leading to misjudgment. If the interior pixels of objects with correct labels (high-confidence pixels) in the target domain are placed in the regions with different semantics, we would obtain ”new boundary” pixels with accurate labels. Then learning on these pixels can indirectly enhance the model to distinguish the boundary pixels of actual objects.

In this paper, we first propose a novel mix-up strategy to generate high-quality target domain boundaries with ground-truth label. Unlike DACS (Tranheden et al., 2021), which explicitly learns cross-domain invariant source domain boundary representations by copying the parts of the source image to the target image, we directly reinforce the learning of the target boundary pixels by constructing target boundary pixels with correct labels. Our core idea is to copy the high-confidence pixels in the target image to the source image so that some high-confidence target pixels are in the position of category semantic change, i.e., becoming boundary pixels. Figure 3 depicts the process of generating high-quality target domain boundary pixels. Furthermore, since the target domain lacks ground-truth supervision signals, the learned features in target domain are not discriminative enough. To address this issue, we propose a multi-level contrastive learning to explore more discriminative target domain representations.Specifically, we use both pixel-to-pixel and prototype-to-prototype contrastive learning to obtain more discriminative feature representations. By combining the proposed two techniques, our method successfully circumvents complex distillation techniques while achieving the state-of-the-art performance. As shown in Figure 1, our method can beat the distillation-based ProDA and does not require any special training tricks, such as ASOS (Tsai et al., 2018) initialization.

Refer to caption
Figure 3. Illustration of the process of generating high-quality target domain boundary pixels. The boundary pixels of the target image are usually mispredicted (yellow region), but the inner pixels have the correct pseudo-labels (red region). Copies interior pixels into the source image so that some of the interior pixels become boundary pixels (dotted areas in the mixed image).

Our main contributions are three-folds.

  1. (1)

    We propose a novel high-confidence target domain pixel cross-domain mixing strategy (HTCM) to construct high-quality target domain boundary pixels to enhance the learning of target domain boundaries.

  2. (2)

    We propose multi-level contrastive learning to make target domain features more discriminative. Multi-level contrastive learning constrains the consistency from both pixel-level and prototype-level to improve the learning of target domain features.

  3. (3)

    We construct an efficient non-distillation method. To the best of our knowledge, this is the first work that does not use the distillation technique while outperforming the distillation-based methods on both GTA5 (Richter et al., 2016) and SYNTHIA (Ros et al., 2016) datasets.

2. Related Work

2.1. Unsupervised Domain Adaptive Semantic Segmentation

Early unsupervised domain adaptive (UDA) semantic segmentation methods (Hoffman et al., 2018; Huang et al., 2020; Tsai et al., 2018; Wang et al., 2020; Yang and Soatto, 2020; Zhang and Wang, 2020), usually use adversarial training to align source and target domain images. However, these methods often suffer from instability in the training process (Chen et al., 2020c; Gulrajani et al., 2017). Different from adversarial-based methods, there are methods utilizing pseudo-labels (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021) and contrastive learning (Zhou et al., 2021; Xie et al., 2021) to improve the performance. These methods are more concise than methods based on adversarial training. In addition to the above technical routes, ProDA (Zhang et al., 2021b) surpasses the above methods by using distillation technology. Subsequent works which outperform ProDA methods (Zhang et al., 2021a; Wang et al., 2021a; Huang et al., 2022), generally employ distillation techniques and use much more sophisticated training techniques. For example, MFA (Zhang et al., 2021a) proposes multiple fusion adaptation to fuse three kinds of information and combine distillation technology to further improve the performance of ProDA. CRA (Wang et al., 2021a), based on ProDA, aligns the distribution of confident pixels and unconfident pixels in the target domain through an additional cross-region adaptation operation.

In this paper, we propose an efficient and non-distillation method that leverages well-designed pseudo-labeling techniques and contrastive learning techniques to reach or even surpass distillation-based methods.

2.2. Pseudo Labels

The workflow of the pseudo-label methods can be roughly divided into the following two steps: first, generating pseudo-labels in the unlabeled data, and second, finetuning the model on these pseudo-labels. Early methods (Zou et al., 2018, 2019; Feng et al., 2020b, a) were usually iteratively performed in an offline manner. Therefore, these methods often require multiple training stages and are very cumbersome. Recently, some works (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021; Xu et al., 2019) have greatly simplified the training process through online pseudo-label techniques. For example, Pixmatch (Melas-Kyriazi and Manrai, 2021) uses random transformations to obtain a pair of strongly and weakly transformed images. Then the strongly transformed images can then enjoy the ”supervised-form” of the training process with the pseudo-labels produced by the weaker transformations. DACS (Tranheden et al., 2021) proposes to construct a mixed domain by copying image patches from the source domain to the target domain and learn the consistency between the student model and the teacher model under different contexts. In this way, it explicitly learns the cross-domain consistency of source boundaries. The DSP (Gao et al., 2021) additionally introduces a mix up strategy between source domain images on the basis of DACS to further improve the performance. BAPA (Liu et al., 2021) increases the loss weight of boundary pixels generated by DACS to improve classification accuracy of boundary pixels. The above methods use the mix up strategy of DACS to strengthen the learning of boundary samples, but DACS only explicitly generates source domain boundaries and does not explicitly construct target domain boundaries, resulting in insufficient learning of the boundary pixels of the target domain.

Different from these previous methods, we improve the performance of the model by explicitly constructing high-quality target domain boundary pixels.

2.3. Contrastive Learning

The goal of contrastive learning is to learn a model that can extract semantically compact features from raw images by encouraging positive pairs to be close and pulling negative pairs. Many works (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020; Zhang et al., 2022; Li et al., 2021) have demonstrated the effectiveness of contrastive learning. In these works, an InfoNCE (Oord et al., 2018) loss is usually adopted to implement contrastive learning.

Some recent works introduce contrastive learning into UDA semantic segmentation. RCCR (Zhou et al., 2021) builds different contexts for a target domain region through cutmix between the target domain and the source domain. Then they use pixel-to-pixel contrastive learning to maximize the difference of inter-region pixels and minimize the inconsistency of intra-region pixels. SPCL (Xie et al., 2021) propose a novel semantic prototype-based contrastive learning framework to achieve pixel and prototype alignment.

Different from these methods, we design a multi-level contrastive learning loss, which both considers pixel-to-pixel and prototype-to-prototype contrastive learning losses, to explore more discriminative target domain feature representations.

3. Preliminary

Refer to caption
Figure 4. Framework of our method. CMix represents the class mix used in DACS. T1T_{1} and T2T_{2} represent two random transformations. In addition to Cmix, our method generates high-quality target domain boundary pixels via HTCM and performs multi-level contrastive learning on the mixed image xstx_{st}.

Unsupervised domain adaptive (UDA) semantic segmentation contains two important datasets: an annotated source dataset 𝔻S={(xs,ys)}s=1nS\mathbb{D}_{S}=\{(x_{s},y_{s})\}_{s=1}^{n_{S}} and an unlabeled target dataset 𝔻T={xt}t=1nT\mathbb{D}_{T}=\{x_{t}\}_{t=1}^{n_{T}}. 𝔻s\mathbb{D}_{s} and 𝔻t\mathbb{D}_{t} share the same CC classes. The goal of UDA semantic segmentation is to use 𝔻s\mathbb{D}_{s} and 𝔻t\mathbb{D}_{t} to train a semantic segmentation model MM such that MM can correctly predict the class of each pixel in the target domain image. Generally, the model M=GEM=G\circ E can be regarded as a composite of a feature extractor EE and a classifier GG.

To solve this problem, we can directly train a segmentation model with source data. Specifically, given a source image xsx_{s} and its corresponding label ysy_{s}, the segmentation loss on the source domain can be defined as:

(1) 𝕃s=i=1H×W𝕃ce(ps,i,ys,i),\mathbb{L}_{s}=-\sum_{i=1}^{H\times W}\mathbb{L}_{ce}(p_{s,i},y_{s,i}),

where ps,ip_{s,i} represents the predicted probability of pixel xs,ix_{s,i}. 𝕃ce\mathbb{L}_{ce} is the standard cross-entropy loss. Due to the domain shift in the source and target domains, it is difficult for the model trained in the above way to generalize well to the target domain data. To make the model better fit the target data, we can optimize the cross-entropy loss of the target image xtx_{t} with the pseudo-label yty^{{}^{\prime}}_{t}. In order to further improve the performance of the model on the target domain, we follow the mix-up strategy of DACS (Tranheden et al., 2021) to construct a mixed image xstx_{st} and a mixed label ysty_{st}. Specifically, we randomly select half of the classes in the source image to generate a random mask MsM_{s} and utilize MsM_{s} to generate the mixed images from the source image and the target image.

(2) xst=xsMs+xt(1Ms).x_{st}=x_{s}\odot M_{s}+x_{t}\odot(1-M_{s}).

Similarly, the labels of the mix images can be obtained:

(3) yst=ysMs+yt(1Ms).y_{st}=y_{s}\odot M_{s}+y^{{}^{\prime}}_{t}\odot(1-M_{s}).

Similar to the source domain segmentation loss, the segmentation loss of the mixed image is defined as follows:

(4) 𝕃st=i=1H×W𝕃ce(pst,i,yst,i).\mathbb{L}_{st}=-\sum_{i=1}^{H\times W}\mathbb{L}_{ce}(p_{st,i},y_{st,i}).

4. Method

In this work, we aim to build an efficient unsupervised domain adaptive semantic segmentation method that does not require distillation techniques. To achieve our goal, we propose two novel self-supervised learning methods. Figure 4 illustrates the overall architecture of our proposed method.

In particular, we propose a novel high-confidence target domain pixel cross-domain mixing strategy (HTCM) to construct target domain boundary pixels with correct semantic labels. In principle, HTCM only copies the high-confidence pixels in the target domain to the source domain image so that some high-confidence target domain pixels become boundary pixels. In addition, we design a multi-level contrastive learning to improve the discriminativeness of features from both pixel-level and prototype-level.

4.1. Generating High-quality Target Boundary Pixels

In general, only the inner pixels of the target image can obtain the correct pseudo-labels, while it is difficult to obtain accurate labels for the boundary pixels of the target image. As a result, when training target domain images directly with pseudo-labels, only the inner pixels can be effectively trained with supervision, while the boundary pixels of the target image are difficult to be effectively trained. To alleviate this problem, we need to obtain target boundaries with correct labels, strengthening the learning of target boundary pixels.

Refer to caption
Figure 5. Visual demonstration of the HTCM algorithm. The green parts represent high-confidence target domain pixels. The red part of the mixed image represents the border pixels with the correct label, while the yellow part represents the border pixels with the wrong label. It can be seen that most of the border pixels in the mixed image have the correct labels.

In this section, we propose a simple yet effective high-confidence target domain pixel cross-domain mixing strategy (HTCM) to generate high-quality target domain boundary pixels. Generating high-quality target boundary pixels means constructing target pixels located in semantic mutation regions with correct pseudo-labels. In the pseudo-labels of the target domain pixels, the labels of high-confidence pixels tend to be correct, but these pixels are usually located inside objects rather than boundaries. We want to have these correctly labeled target domain pixels at boundary locations so that we can get the target domain boundary pixels with the correct labels. To do this, we paste these pixels into a source image so that some of the target domain pixels are at the border. Specifically, for a target domain image xtx_{t}, we can get its corresponding predicted probability map ptip_{t}^{i}. Then we get a high-confidence template MhtM_{ht}, which is the selected indicator for the pixels whose predicted probability is greater than a threshold τ\tau. Finally, we use the obtained templates MhtM_{ht} to construct high-quality boundary samples as follows:

(5) xhts=xs(1Mht)+xtMht.x_{hts}=x_{s}\odot(1-M_{ht})+x_{t}\odot M_{ht}.

Similarly, the labels of the mix images can be obtained:

(6) yhts=ys(1Mht)+y^tMht.y_{hts}=y_{s}\odot(1-M_{ht})+\hat{y}_{t}\odot M_{ht}.

The segmentation loss of the mixed image xhtsx_{hts} is defined as follows:

(7) 𝕃hts=i=1H×W𝕃ce(phts,i,yhts,i).\mathbb{L}_{hts}=-\sum_{i=1}^{H\times W}\mathbb{L}_{ce}(p_{hts,i},y_{hts,i}).

4.2. Multi-level Contrastive Learning

Due to the lack of explicit supervision signals in the target domain data, the features of the target domain are not sufficiently discriminative. Fortunately, contrastive learning can improve the distinguishability of unlabeled data (Chen et al., 2020a; He et al., 2020; Grill et al., 2020; Caron et al., 2020), so we employ contrastive learning to further improve our model. In particular, we design a multi-level contrastive learning loss to improve feature distinguishability from both prototype-level and pixel-level. Since we want the features to be invariant across domains, we choose to perform contrastive learning on the mixed images instead of directly on the target image. Because the prototype of the mixed image will contain the information of the source domain and the target domain, the features of the same category in the source domain and the target domain will be as consistent as possible during prototype-level contrastive learning. At the same time, because the target pixels in the mixed image xhtsx_{hts} are all high-confidence samples, they have the correct pseudo-labels. It means that all the pixels of the mixed image xhtsx_{hts} can be trained like the source domain image. At this time, using contrastive learning to improve the discriminativeness of features will not have much effect. Therefore, we choose to perform contrastive learning on mixed images xstx_{st}.

Specifically, for the mixed image xstx_{st}, we first obtain the transformed mixed images xst1x^{1}_{st} and xst2x^{2}_{st} through two different random transformations. Then, we use the feature extractor EE to obtain the feature maps 𝐟st1\mathbf{f}^{1}_{st}, 𝐟st2\mathbf{f}^{2}_{st} corresponding to xst1x^{1}_{st} and xst2x^{2}_{st}, whose resolution H×WH^{{}^{\prime}}\times W^{{}^{\prime}} is 18\frac{1}{8} of the native resolution H×WH\times W. In order to alleviate the interference of wrong samples on the prototype, we only use the features with high confidence to calculate the prototype. Therefore, we can get the prototype of each category as follows.

(8) pst1,c=1Mstc𝐟st1(1Mstc),p^{1,c}_{st}=-\frac{1}{\sum M^{c}_{st}}\sum\mathbf{f}^{1}_{st}\odot(1-M^{c}_{st}),
(9) pst2,c=1Mstc𝐟st2(1Mstc).p^{2,c}_{st}=-\frac{1}{\sum M^{c}_{st}}\sum\mathbf{f}^{2}_{st}\odot(1-M^{c}_{st}).

MstcM^{c}_{st} is a binary mask, 11 indicates that the predicted category of this position is cc and the predicted probability is greater than τ\tau, 0 indicates that the predicted category is other categories or predicted probability is less than τ\tau. For a query sample 𝐩st1,c\mathbf{p}^{1,c}_{st}, the prototype 𝐩st2,c\mathbf{p}^{2,c}_{st} is the positive and all other prototype are the negative. In domain adaptation tasks, it is not enough to improve the discriminativeness of features, and we also need to narrow the distance between features and classifier weights (Li et al., 2021). Therefore, following PCL (Li et al., 2021), we compute the contrastive loss using the probabilities instead of features and then the prototype-level contrastive loss (ProCL) is defined as:

(10) 𝐩st1,c=logexp(s1H(𝐩st1,c,𝐩~st2,c)jcexp(s1H(𝐩st1,c,𝐩st1,j)+kexp(s1H(𝐩st1,c,𝐩~st2,k).\displaystyle\ell_{\mathbf{p}^{1,c}_{st}}\!=\!-\!\log\frac{\exp(s_{1}H(\mathbf{p}^{1,c}_{st},\tilde{\mathbf{p}}^{2,c}_{st})}{\sum_{j\neq c}\exp(s_{1}H(\mathbf{p}^{1,c}_{st},\mathbf{p}^{1,j}_{st})\!+\!\sum_{k}\exp(s_{1}H(\mathbf{p}^{1,c}_{st},\tilde{\mathbf{p}}^{2,k}_{st})}.

In addition to prototype-level contrastive learning, we further consider pixel-level contrastive learning. For a query sample 𝐟st1,i\mathbf{f}^{1,i}_{st}, the feature 𝐟st2,i\mathbf{f}^{2,i}_{st} at the corresponding position of another transformed image is taken as a positive sample, and all the other features are taken as negative samples. The pixel-level contrastive learning loss (PiCL) is defined as follows:

(11) 𝐟st1,i=logexp(s2H(𝐟st1,i,𝐟st2,i)jiexp(s2H(𝐟st1,i,𝐟st1,j)+kexp(s2H(𝐟st1,i,𝐟st2,k).\displaystyle\ell_{\mathbf{f}^{1,i}_{st}}\!=\!-\!\log\frac{\exp(s_{2}H(\mathbf{f}^{1,i}_{st},\mathbf{f}^{2,i}_{st})}{\sum_{j\neq i}\exp(s_{2}H(\mathbf{f}^{1,i}_{st},\mathbf{f}^{1,j}_{st})\!+\!\sum_{k}\exp(s_{2}H(\mathbf{f}^{1,i}_{st},\mathbf{f}^{2,k}_{st})}.

where s1s_{1} and s2s_{2} are the scaling factors, and we set s1s_{1} to 7 and set s2s_{2} to 20. H(x,y)=σ(G(x))σ(G(y))H(x,y)=\sigma(G(x))^{\top}\sigma(G(y)) calculates the inner product of the classification probabilities corresponding to the two vectors and σ\sigma represents the softmax function.

Refer to caption
Figure 6. Visualization of the segmentation results on GTA5\rightarrowCityscapes task. Baseline means that the model is only trained with 𝕃s\mathbb{L}_{s} and 𝕃st\mathbb{L}_{st}.

4.3. Loss Function

Our loss function is defined as:

(12) L=𝕃s+λ1(𝕃st+𝕃hts)+λ2(𝕃pro+𝕃pixel).\displaystyle L=\mathbb{L}_{s}+\lambda_{1}(\mathbb{L}_{st}+\mathbb{L}_{hts})+\lambda_{2}(\mathbb{L}_{pro}+\mathbb{L}_{pixel}).

Here 𝕃pro=c(𝐩st1,c+𝐩st2,c)\mathbb{L}_{pro}=\sum_{c}(\ell_{\mathbf{p}^{1,c}_{st}}+\ell_{\mathbf{p}^{2,c}_{st}}) and 𝕃pixel=i(𝐟st1,i+𝐟st2,i)\mathbb{L}_{pixel}=\sum_{i}(\ell_{\mathbf{f}^{1,i}_{st}}+\ell_{\mathbf{f}^{2,i}_{st}}). λ1\lambda_{1} and λ2\lambda_{2} are the hyper-parameters to balance different losses.

Table 1. Results on the GTA5\rightarrowCityscapes benchmark. D means using distillation technique.
Method Venue

road

sdwk

bld

wall

fnc

pole

lght

sign

veg.

trrn.

sky

pers

rdr

car

trck

bus

trn

mtr

bike

mIoU
Source only 58.6 13.5 64.1 16.8 14.0 23.8 36.2 19.8 80.3 19.5 66.3 58.9 28.7 64.9 28.7 3.5 8.0 29.6 35.3 35.3
DACS (Tranheden et al., 2021) WACV 2020 89.9 39.7 87.9 30.7 39.5 38.5 46.4 52.8 88.0 44.0 88.8 67.2 35.8 84.5 45.7 50.19 0.0 27.3 34.0 52.1
SPCL (Xie et al., 2021) arXiv 2021 90.3 50.3 85.7 45.3 28.4 36.8 42.2 22.3 85.1 43.6 87.2 62.8 39.0 87.8 41.3 53.9 17.7 35.9 33.8 52.1
RCCR (Zhou et al., 2021) arXiv 2021 93.7 60.4 86.5 41.1 32.0 37.3 38.7 38.6 87.2 43.0 85.5 65.4 35.1 88.3 41.8 51.6 0.0 38.0 52.1 53.5
SAC (Araslanov and Roth, 2021) CVPR 2021 90.4 53.9 86.6 42.4 27.3 45.1 48.5 42.7 87.4 40.1 86.1 67.5 29.7 88.5 49.1 54.6 9.8 26.6 45.3 53.8
Pixmatch (Melas-Kyriazi and Manrai, 2021) CVPR 2021 91.6 51.2 84.7 37.3 29.1 24.6 31.3 37.2 86.5 44.3 85.3 62.8 22.6 87.6 38.9 52.3 0.7 37.2 50.0 50.3
DPL (Cheng et al., 2021) ICCV 2021 92.8 54.4 86.2 41.6 32.7 36.4 49.0 34.0 85.8 41.3 83.0 63.2 34.2 87.2 39.3 44.5 18.7 42.6 43.1 53.3
BAPA (Liu et al., 2021) ICCV 2021 94.4 61.0 88.0 26.8 39.9 38.3 46.1 55.3 87.8 46.1 89.4 68.8 40.0 90.2 60.4 59.0 0.00 45.1 54.2 57.4
UPST (Wang et al., 2021b) ICCV 2021 90.5 38.7 86.5 41.1 32.9 40.5 48.2 42.1 86.5 36.8 84.2 64.5 38.1 87.2 34.8 50.4 0.2 41.8 54.6 52.6
DSP (Gao et al., 2021) ACM MM 2021 92.4 48.0 87.4 33.4 35.1 36.4 41.6 46.0 87.7 43.2 89.8 66.6 32.1 89.9 57.0 56.1 0.0 44.1 57.8 55.0
MFA (D) (Zhang et al., 2021a) BMVC 2021 93.5 61.6 87.0 49.1 41.3 46.1 53.5 53.9 88.2 42.1 85.8 71.5 37.9 88.8 40.1 54.7 0.0 48.2 62.8 58.2
ProDA (D) (Zhang et al., 2021b) CVPR 2021 87.8 56.0 79.7 46.3 44.8 45.6 53.5 53.5 88.6 45.2 82.1 70.7 39.2 88.8 45.5 59.4 1.0 48.9 56.4 57.5
ProDA + CaCo (D) (Huang et al., 2022) CVPR 2022 93.8 64.1 85.7 43.7 42.2 46.1 50.1 54.0 88.7 47.0 86.5 68.1 2.9 88.0 43.4 60.1 31.5 46.1 60.9 58.0
ProDA + CRA (D) (Wang et al., 2021a) arXiv 2021 89.4 60.0 81.0 49.2 44.8 45.5 53.6 55.0 89.4 51.9 85.6 72.3 40.8 88.5 44.3 53.4 0.0 51.7 57.9 58.6
Ours 95.4 68.8 88.1 37.1 41.4 42.5 45.7 60.4 87.3 42.6 86.8 67.4 38.6 90.5 66.7 61.4 0.3 39.4 56.1 58.8
Table 2. Results on the SYNTHIA\rightarrowCityscapes benchmark. mIoU-16 and mIoU-13 refer to mean intersection-over-union on the standard sets of 16 and 13 classes, respectively. Classes not evaluated are replaced by ’-’. D means using distillation technique.
Method Venue

road

sdwk

bld

wall

fnc

pole

light

sign

veg.

sky

pers

rdr

car

bus

mtr

bike

mIoU-16 mIoU-13
Source only 40.1 18.3 64.9 5.9 0.1 24.6 5.9 9.0 74.8 81.6 58.7 16.8 43.9 11.8 6.4 24.11 30.4 35.1
DACS (Tranheden et al., 2021) WACV 2020 80.6 25.1 81.9 21.5 2.9 37.2 22.7 24.0 83.7 90.8 67.6 38.3 82.9 38.9 28.5 47.6 48.3 54.8
SPCL (Xie et al., 2021) arXiv 2021 86.9 43.2 81.6 16.2 0.2 31.4 12.7 12.1 83.1 78.8 63.2 23.7 86.9 56.1 33.8 45.7 47.2 54.4
RCCR (Zhou et al., 2021) arXiv 2021 79.4 45.3 83.3 - - - 24.7 29.6 68.9 87.5 63.1 33.8 87.0 51.0 32.1 52.1 - 56.8
SAC (Araslanov and Roth, 2021) CVPR 2021 89.3 47.2 85.5 26.5 1.3 43.0 45.5 32.0 87.1 89.3 63.6 25.4 86.9 35.6 30.4 53.0 52.6 59.3
Pixmatch (Melas-Kyriazi and Manrai, 2021) CVPR 2021 92.5 54.6 79.8 4.8 0.1 24.1 22.8 17.8 79.4 76.5 60.8 24.7 85.7 33.5 26.4 54.4 46.1 54.5
DPL (Cheng et al., 2021) ICCV 2021 87.5 45.7 82.8 13.3 0.6 33.2 22.0 20.1 83.1 86.0 56.6 21.9 83.1 40.3 29.8 45.7 47.0 54.2
BAPA (Liu et al., 2021) ICCV 2021 91.7 53.8 83.9 22.4 0.8 34.9 30.5 42.8 86.6 88.2 66.0 34.1 86.6 51.3 29.4 50.5 53.3 61.2
UPST (Wang et al., 2021b) ICCV 2021 79.4 34.6 83.5 19.3 2.8 35.3 32.1 26.9 78.8 79.6 66.6 30.3 86.1 36.6 19.5 56.9 48.0 54.6
DSP (Gao et al., 2021) ACM MM 2021 86.4 42.0 82.0 2.1 1.8 34.0 31.6 33.2 87.2 88.5 64.1 31.9 83.8 65.4 28.8 54.0 51.0 59.9
MFA (D) (Zhang et al., 2021a) BMVC 2021 81.8 40.2 85.3 - - - 38.0 33.9 82.3 82.0 73.7 41.1 87.8 56.6 46.3 63.8 - 62.5
ProDA (D) (Zhang et al., 2021b) CVPR 2021 87.8 45.7 84.6 37.1 0.6 44.0 54.6 37.0 88.1 84.4 74.2 24.3 88.2 51.1 40.5 45.6 55.5 62.0
ProDA + CRA (D) (Wang et al., 2021a) arXiv 2021 85.6 44.2 82.7 38.6 0.4 43.5 55.9 42.8 87.4 85.8 75.8 27.4 89.1 54.8 46.6 49.8 56.9 63.7
Ours 93.0 69.8 84.0 36.6 9.1 39.7 42.2 43.8 88.2 88.1 68.3 29.0 85.5 54.1 37.1 56.3 57.8 64.6

5. Experiments

5.1. Experimental Setup

5.1.1. Dataset

We evaluate the performance of our methods on the two standard UDA semantic segmentation tasks: GTA5\rightarrowCityscapes and SYNTHIA\rightarrowCityscapes. GTA5 (Richter et al., 2016) is a synthetic dataset consisting of 24,96624,966 annotated images with 1914×10521914\times 1052 resolution taken from the GTA5 game, which is used as a source domain dataset. SYNTHIA (Ros et al., 2016) is also a synthetic dataset consisting of 9,4009,400 annotated images with 1280×7201280\times 720 resolution, which is used as another source domain dataset. Cityscapes (Cordts et al., 2016) is a representative dataset in semantic segmentation and autonomous driving domain, which contains 5000 pixel-level annotated images with 2048×10242048\times 1024 resolution taken from real urban street scenes and is use as the target domain dataset. In all experiments, we can obtain 29752975 unlabeled target domain images for training and we use 500500 validation images for test. Following existing works (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021), we evaluate our method on 19 common categories for GTA5\rightarrowCityscapes and both 16 and 13 common categories for SYNTHIA\rightarrowCityscapes, respectively.

5.1.2. Implementation Details

Following common UDA protocols (Tranheden et al., 2021; Melas-Kyriazi and Manrai, 2021; Zhou et al., 2021), we use the DeepLab-v2 segmentation model (Chen et al., 2017) with a ResNet-101 (He et al., 2016) backbone pre-trained on ImageNet (Deng et al., 2009) as our model. Following (Chen et al., 2019), we first perform warm-up training on source domain to obtain an initialized model. Then we train the model with our method. Specifically, we use the ”poly” learning weight decay and the power is set to 0.9. The SGD optimizer is implemented with weight decay 5×1045\times 10^{-4} and momentum 0.9. The learning rate is set at 2.5×1042.5\times 10^{-4} for backbone parameters and 2.5×1032.5\times 10^{-3} for others. The maximum iteration number is 250kk and we use an early stopping strategy. For the hyper-parameters in our method, we set λ1=0.1\lambda_{1}=0.1 in all experiments. We set λ2=0.01\lambda_{2}=0.01 for GTA5 and set λ2=0.008\lambda_{2}=0.008 for SYNTHIA. All experiments of our method are conducted on a single NVIDIA RTX 3090Ti GPU.

5.2. Comparison to State-of-the-Art Methods

We compare our method with the state-of-the-art UDA methods on two common UDA benchmarks. Depending on whether the distillation technique is used, we divide these methods into two broad categories. Table 1 and Table 2 give the results. From the results, we have the following observations.

Table 3. Ablation study on our proposed components.
CMix PiCL ProCL HTCM mIoU (%)
\surd 52.1
\surd \surd 55.2
\surd \surd 55.6
\surd \surd 56.6
\surd \surd \surd 56.2
\surd \surd \surd 57.5
\surd \surd \surd 57.7
\surd \surd \surd \surd 58.8

First, our method outperforms all non-distillation methods, especially under the SYNTHIA\rightarrowCityscapes setting, where our method outperforms the second-best method by 4.5%\% mIoU and 3.4%\% mIoU on 16 classes and 13 classes. Second, we found that ProDA, the first method using distillation technology, outperforms all previous non-distillation methods. However, our method not only outperforms ProDA, but also a series of subsequent methods based on distillation techniques. For example, our method is better than CRA (Wang et al., 2021a) by 0.2%0.2\% in GTA5\rightarrowCityscapes and 0.9%0.9\% in SYNTHIA\rightarrowCityscapes. Third, it should be noted that our method not only outperforms the distillation-based methods, but also simplifies the training process. In particular, our method does not require an additional two-step distillation process, nor does it require special training techniques.

In addition, we qualitatively compare our model with the source only model (𝕃s\mathbb{L}_{s}) and baseline model (𝕃s\mathbb{L}_{s}+𝕃st\mathbb{L}_{st}) by showing the segmentation results in Figure 6. It can be seen that baseline significantly improves the segmentation results compared to the source only model. But for the thin objects such as ”bicycle” (yellow box) and ”pole”(green box), almost all pixels are boundary samples, resulting in an incorrect prediction by the baseline model. After using our proposed HTCM and multi-level contrastive loss, many of these mispredicted pixels are corrected back. This demonstrates the effectiveness of our method.

5.3. Ablation Study

In this section, we conduct experiments to reveal the effectiveness of our proposed method. All experiments are particularly conducted on GTA5\rightarrowCityscapes benchmark.

5.3.1. Effect of Components

In this section, we validate the individual effects of our proposed HTCM, ProCL, and PiCL. As shown in Table 3, our final method achieves an improvement of 6.7% over the model only using CMix, while removing each one of our components will cause a performance drop comparing to our final method. The results validate that each of our proposed components plays an important role in UDA semantic segmentation.

Table 4. Comparison of different mix-up strategies.
Methods mIoU (%)
Source Only 35.3
DACS 52.1
HTCM 52.8
DSP (Hard) 53.6
DSP (Soft) 54.5
DACS+HTCM 56.6
Table 5. Ablation study of confident threshold τ\tau. A threshold of τ[0,1)\tau\in[0,1) corresponds to the confident score of pseudo labeled pixels which is used for GHTB.
τ\tau\qquad 0.00 0.35 0.55 0.75 0.95 0.97
mIoU 52.9 53.8 54.6 55.2 56.6 55.8

5.3.2. Comparing Different Mix-up Strategies

In this section, we compare different mix-up strategies to demonstrate the superiority of HTCM. In particular, we compare DACS (Tranheden et al., 2021) and DSP (Gao et al., 2021). Table 4 gives the results. First, compared to DACS, which explicitly constructs target-domain-style context for source boundaries by copying source pixels into a target image, our method HTCM exceeds it by 0.7%0.7\%. This demonstrates the importance of constructing target boundaries with correct labels. In addition, combining HTCM and DACS can further improve performance, proving that our method is orthogonal to DACS. Secondly, DSP additionally introduces a mix-up strategy of source-to-source based on DACS. To be fair, we use DACS+HTCM to compare with DSP. In particular, DSP adopts a soft-paste strategy to improve performance. We use hard or soft to distinguish whether it adopts this strategy. It can be seen that the source-to-source strategy used by DSP is inferior to HTCM. This further demonstrates the superiority of our method.

5.3.3. Confident Threshold τ\tau

In this section, we analyze the impact of different thresholds on performance. In Table 5, we show the results for τ=0,0.35,0.55,0.75,0.95,0.97\tau=0,0.35,0.55,0.75,0.95,0.97. We find that setting relatively high threshold (τ=0.95\tau=0.95) can more effectively boost performance (compared to τ=0\tau=0). This well shows the importance of constructing the boundaries with high-confidence pixels. However, when we increase the threshold from 0.95 to 0.97, the mIoU decreases from 56.6%\% to 55.8%\%. This is because a too high threshold would filter too many pixels, resulting in insufficient learning.

Table 6. The effect of contrastive learning on different mixed images.
Mixed Image xhtsx_{hts} xstx_{st}
mIoU 56.8 58.8

5.3.4. Contrastive Learning with Different Images

In this section, we compare the performance of xstx_{st} and xhtsx_{hts} for contrastive learning. Table 6 gives the results. It can be seen that the contrastive learning on xhtsx_{hts} is 2 %\% lower than the contrastive learning on xstx_{st}. As analyzed in Sec. 4.2, almost all pixels in xhtsx_{hts} have the correct label, the classification loss at this time provides a good enough supervision signal to learn the discriminative feature representation. As a result, the effect of using contrastive learning to enhance the discriminative of xhtsx_{hts} features would become very weak.

5.3.5. Training Cost Discussion

Here we particularly compare the training costs of DSP (Gao et al., 2021) and ProDA (Zhang et al., 2021b) with our method on GTA5\rightarrowCityscapes. Table 7 gives the results. Compared with ProDA and DSP, our method has not only lower training cost but also higher performance. It demonstrates the superiority of our method.

Table 7. Training cost for different methods.
Method GPUs Time (hours) mIoU
ProDA 4*Tesla-V100-32GB 108 57.5
DSP 1*RTX-3090-24GB 96 55.0
Ours 1*RTX-3090-24GB 63 58.8

6. Conclusion

In this paper, we design a simple yet effective non-distillation framework for UDA semantic segmentation. We propose a novel mix-up strategy to generate high-quality target domain boundary pixels. Specifically, we copy the high-confidence target domain boundary pixels into the source image so that some high-confidence target domain pixels become new boundary pixels. In addition, we propose multi-level contrastive learning to obtain more discriminative feature representations, which contains pixel-to-pixel and prototype-to-prototype contrastive loss. We experimentally verified the effectiveness of our proposed methods on GTA5\rightarrowCityscapes and SYNTHIA\rightarrowCityscapes. We believe that our work provides a feasible technical route for designing efficient UDA semantic segmentation methods that bypass the distillation techniques.

Acknowledgements.
This work is supported by the National Natural Science Foundation of China under Grant 62176246 and 61836008.

References

  • (1)
  • Araslanov and Roth (2021) Nikita Araslanov and Stefan Roth. 2021. Self-supervised augmentation consistency for adapting semantic segmentation. In CVPR. 15384–15394.
  • Caron et al. (2020) Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS. 9912–9924.
  • Chen et al. (2017) Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. 2017. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40, 4 (2017), 834–848.
  • Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV. 801–818.
  • Chen et al. (2019) Minghao Chen, Hongyang Xue, and Deng Cai. 2019. Domain adaptation for semantic segmentation with maximum squares loss. In ICCV. 2090–2099.
  • Chen et al. (2020c) Minghao Chen, Shuai Zhao, Haifeng Liu, and Deng Cai. 2020c. Adversarial-learned loss for domain adaptation. In AAAI. 3521–3528.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020a. A Simple Framework for Contrastive Learning of Visual Representations. In ICML. 1597–1607.
  • Chen et al. (2020b) Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad Norouzi, and Geoffrey E Hinton. 2020b. Big self-supervised models are strong semi-supervised learners. In NeurIPS. 22243–22255.
  • Cheng et al. (2021) Yiting Cheng, Fangyun Wei, Jianmin Bao, Dong Chen, Fang Wen, and Wenqiang Zhang. 2021. Dual Path Learning for Domain Adaptation of Semantic Segmentation. In ICCV. 9082–9091.
  • Cordts et al. (2016) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR. 3213–3223.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR. 248–255.
  • Feng et al. (2020a) Zhengyang Feng, Qianyu Zhou, Guangliang Cheng, Xin Tan, Jianping Shi, and Lizhuang Ma. 2020a. Semi-supervised semantic segmentation via dynamic self-training and classbalanced curriculum. arXiv preprint arXiv:2004.08514 1, 2 (2020), 5.
  • Feng et al. (2020b) Zhengyang Feng, Qianyu Zhou, Qiqi Gu, Xin Tan, Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang Ma. 2020b. Dmt: Dynamic mutual training for semi-supervised learning. arXiv preprint arXiv:2004.08514 (2020).
  • Gao et al. (2021) Li Gao, Jing Zhang, Lefei Zhang, and Dacheng Tao. 2021. Dsp: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In ACM MM. 2825–2833.
  • Grill et al. (2020) Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent: A new approach to self-supervised learning. In NeurIPS. 21271–21284.
  • Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In NeurIPS, Vol. 30.
  • He et al. (2020) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In CVPR. 9729–9738.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In ICML. 1989–1998.
  • Huang et al. (2022) Jiaxing Huang, Dayan Guan, Aoran Xiao, Shijian Lu, and Ling Shao. 2022. Category contrast for unsupervised domain adaptation in visual tasks. In CVPR. 1203–1214.
  • Huang et al. (2020) Jiaxing Huang, Shijian Lu, Dayan Guan, and Xiaobing Zhang. 2020. Contextual-relation consistent domain adaptation for semantic segmentation. In ECCV. Springer, 705–722.
  • Li et al. (2021) Junjie Li, Yixin Zhang, Zilei Wang, and Keyu Tu. 2021. Semantic-aware Representation Learning Via Probability Contrastive Loss. arXiv preprint arXiv:2111.06021 (2021).
  • Liu et al. (2019) Chenxi Liu, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Wei Hua, Alan L Yuille, and Li Fei-Fei. 2019. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR. 82–92.
  • Liu et al. (2021) Yahao Liu, Jinhong Deng, Xinchen Gao, Wen Li, and Lixin Duan. 2021. BAPA-Net: Boundary adaptation and prototype alignment for cross-domain semantic segmentation. In ICCV. 8801–8811.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In CVPR. 3431–3440.
  • Melas-Kyriazi and Manrai (2021) Luke Melas-Kyriazi and Arjun K Manrai. 2021. PixMatch: Unsupervised domain adaptation via pixelwise consistency training. In CVPR. 12435–12445.
  • Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
  • Richter et al. (2016) Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for data: Ground truth from computer games. In ECCV. Springer, 102–118.
  • Ros et al. (2016) German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR. 3234–3243.
  • Tranheden et al. (2021) Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. 2021. Dacs: Domain adaptation via cross-domain mixed sampling. In WACV. 1379–1389.
  • Tsai et al. (2018) Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. 2018. Learning to adapt structured output space for semantic segmentation. In CVPR. 7472–7481.
  • Wang et al. (2021b) Yuxi Wang, Junran Peng, and ZhaoXiang Zhang. 2021b. Uncertainty-aware pseudo label refinery for domain adaptive semantic segmentation. In ICCV. 9092–9101.
  • Wang et al. (2021a) Zhijie Wang, Xing Liu, Masanori Suganuma, and Takayuki Okatani. 2021a. Cross-Region Domain Adaptation for Class-level Alignment. arXiv preprint arXiv:2109.06422 (2021).
  • Wang et al. (2020) Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jinjun Xiong, Wen-mei Hwu, Thomas S Huang, and Honghui Shi. 2020. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In CVPR. 12635–12644.
  • Xie et al. (2021) Binhui Xie, Kejia Yin, Shuang Li, and Xinjing Chen. 2021. SPCL: A New Framework for Domain Adaptive Semantic Segmentation via Semantic Prototype-based Contrastive Learning. arXiv preprint arXiv:2111.12358 (2021).
  • Xu et al. (2019) Yonghao Xu, Bo Du, Lefei Zhang, Qian Zhang, Guoli Wang, and Liangpei Zhang. 2019. Self-ensembling attention networks: Addressing domain shift for semantic segmentation. In AAAI. 5581–5588.
  • Yang and Soatto (2020) Yanchao Yang and Stefano Soatto. 2020. Fda: Fourier domain adaptation for semantic segmentation. In CVPR. 4085–4095.
  • Zhang et al. (2021a) Kai Zhang, Yifan Sun, Rui Wang, Haichang Li, and Xiaohui Hu. 2021a. Multiple Fusion Adaptation: A Strong Framework for Unsupervised Semantic Segmentation Adaptation. BMVC (2021).
  • Zhang et al. (2021b) Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang, and Fang Wen. 2021b. Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. In CVPR. 12414–12424.
  • Zhang et al. (2022) Yixin Zhang, Junjie Li, and Zilei Wang. 2022. Low-confidence Samples Matter for Domain Adaptation. arXiv preprint arXiv:2202.02802 (2022).
  • Zhang and Wang (2020) Yixin Zhang and Zilei Wang. 2020. Joint adversarial learning for domain adaptation in semantic segmentation. In AAAI. 6877–6884.
  • Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In CVPR. 2881–2890.
  • Zhou et al. (2021) Qianyu Zhou, Chuyun Zhuang, Xuequan Lu, and Lizhuang Ma. 2021. Domain Adaptive Semantic Segmentation with Regional Contrastive Consistency Regularization. arXiv preprint arXiv:2110.05170 (2021).
  • Zou et al. (2018) Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. 2018. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV. 289–305.
  • Zou et al. (2019) Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. 2019. Confidence regularized self-training. In CVPR. 5982–5991.