Regularizing Self-training for Unsupervised Domain Adaptation
via Structural Constraints
Abstract
Self-training based on pseudo-labels has emerged as a dominant approach for addressing conditional distribution shifts in unsupervised domain adaptation (UDA) for semantic segmentation problems. A notable drawback, however, is that this family of approaches is susceptible to erroneous pseudo labels that arise from confirmation biases in the source domain and that manifest as nuisance factors in the target domain. A possible source for this mismatch is the reliance on only photometric cues provided by RGB image inputs, which may ultimately lead to sub-optimal adaptation. To mitigate the effect of mismatched pseudo-labels, we propose to incorporate structural cues from auxiliary modalities, such as depth, to regularise conventional self-training objectives. Specifically, we introduce a contrastive pixel-level objectness constraint that pulls the pixel representations within a region of an object instance closer, while pushing those from different object categories apart. To obtain object regions consistent with the true underlying object, we extract information from both depth maps and RGB-images in the form of multimodal clustering. Crucially, the objectness constraint is agnostic to the ground-truth semantic labels and, hence, appropriate for unsupervised domain adaptation. In this work, we show that our regularizer significantly improves top performing self-training methods (by up to points) in various UDA benchmarks for semantic segmentation. We include all code in the supplementary.
1 Introduction

Semantic segmentation is a crucial and challenging task for applications such as autonomous driving [61, 2, 51, 60, 18] that rely on pixel-level semantics of the scene. Performance on this task has significantly improved over the past few years following the advances in deep supervised learning [9]. However, an important limitation arises from the excessive cost and time taken to annotate images at a pixel-level (reported to be 1.5 hours per image in a popular dataset [12]). Further, most real-world datasets do not have sufficient coverage over all variations in outdoor scenes such as weather conditions and geography-specific layouts that can be crucial for large-scale deployment of learning-based models in autonomous vehicles. Acquiring training data to cater to such scene variations would significantly add to the cost of annotation.
To address the annotation problem, synthetic datasets curated from 3D simulation environments like GTA [37] and SYNTHIA [38] have been proposed where large amounts of annotated data can be easily generated. However, generated data introduces domain shift due to differences in visual characteristics of simulated images (source domain) and real images (target domain). To mitigate such shifts, unsupervised domain adaptation strategies [48, 5, 64, 60, 61, 18, 2] for semantic segmentation have been extensively studied in the recent years. Among these approaches, self-training [16] has emerged as a particularly promising approach that involves pseudo labelling the (unlabelled) target-domain data using a seed model trained solely on the source domain. Pseudo-label predictions for which the confidence exceeds a predefined threshold are then used to further train the model and ultimately improve the target-domain performance.
While self-training based adaptation is quite effective, it is susceptible to erroneous pseudo labels arising from confirmation bias [3] in the seed model. Confirmation bias results from training on source domain semantics that might introduce factors of representation that serve as nuisance factors for the target domain. In the context of semantic segmentation, such a bias manifests as pixel-wise seed predictions that are highly confident but incorrect (see Figure 1). For instance, if the source domain images usually have bright regions (high intensity of the RGB channels) for the sky class, then bright regions in target domain images might be predicted as the sky with high confidence, irrespective of the actual semantic label. Since highly confident predictions qualify as pseudo-labels, training the model on potentially noisy predictions can ultimately lead to sub-optimal performance in the target domain. Thus, in this work, we seek to reduce the heavy reliance of self-training methods on photometric cues for predicting pixel-wise semantic labels.
To that end, we propose to incorporate auxiliary modality information such as depth maps that can provide structural cues [51, 24, 53, 11], complementary to the photometric cues. Semantic segmentation datasets are usually accompanied by depth maps that can be easily acquired in practice [12, 39]. Since naïve fusion of features that are extracted from depth information can also introduce nuisance [24, 51], an important question is raised — How can we leverage the depth modality to counter the effect of noisy pseudo-labels during self-training? In this work, we propose a contrastive objectness constraint derived from depth maps and RGB-images in the target domain that is used to regularise conventional self-training methods. The constraint is computed in two steps: an object-region estimation step, followed by pixel-wise contrastive loss computation. In the first step, we perform unsupervised image segmentation using both depth-based histograms and RGB-images that are fused together to yield multiple object-regions per image. These regions respect actual object boundaries, based on the structural information depth provides, as well as visual similarity. In the second step, the object-regions are leveraged to formulate a contrastive objective [44, 10, 22] that pulls together pixel representations within an object region and pushes apart those from different semantic categories. Such an objective can improve semantic segmentation by causing the pixel representations of a semantic category to form a compact cluster that is well separated from other categories. We empirically demonstrate the effectiveness of our constraint on popular benchmark tasks, GTACityscapes and SYNTHIACityscapes, on which we achieve competitive segmentation performance. To summarise our contributions:
-
•
We propose a novel objectness constraint derived from depth and RGB information to regularise self-training approaches in unsupervised domain adaptation for semantic segmentation. The use of multiple modalities introduces implicit model supervision that is complementary to the pseudo-labels and hence, lead to a more robust self-training.
-
•
We empirically validate the most important aspect of our regulariser, i.e., its ability to improve a variety of self-training methods. Specifically, our approach achieves - (GTA) and - (SYNTHIA) relative improvements over three different self-training baselines. Interestingly, we observe that regularisation improves performance on both “stuff” and “things” classes, somewhat normalising the effects of classwise statistics.
-
•
Further, our regularised self-training method achieves state-of-the-art mIoU of in GTA Cityscapes settings and improves classwise IoUs by up to over best prior results.
2 Related Work
Unsupervised domain adaptation. Unsupervised domain adaptation (UDA) is of particular importance in complex structured-prediction problems, such as semantic segmentation in autonomous driving, where the domain gap between a source domain (e.g., an urban driving dataset) and target domain (real-world driving scenarios) can have devastating consequences on the efficacy of deployed models. Several approaches [14, 5, 35, 18, 30] have been proposed for learning domain invariant representations, e.g., through adversarial feature alignment [14, 6, 49, 54], which addresses the domain gap by minimising a distance metric that characterises the divergence between the two domains [36, 28, 29, 4, 40, 13, 43, 33]. Problematically, such approaches address only shifts in the marginal distribution of the covariates or the labels and, therefore, prove insufficient for handling the more complex shifts in the conditionals [20, 62, 55]. Self-training approaches have been proposed to induce category-awareness [60] or cluster density-based assumptions [42], in order to anchor or regularise conditional shift adaptation, respectively. In this paper, we build upon these works by jointly introducing category-awareness through the use of pseudo-labeling strategies and regularisation through the definition of contrastive depth-based objectness constraints.
Self-training with pseudo-labels. Application of self-training has become popular in the sphere of domain adaptation for semantic segmentation [64, 25, 60, 23]. Here, pseudo-labels are assigned to observations from the target domain, based on the semantic classes of high-confidence (e.g., the closest or least-contrastive) category centroids [60, 57], prototypes [8], cluster centers [21], or superpixel representations [61] that are learned by a model trained on the source domain. Often, to ensure the reliability of initial pseudo labels for target domain, the model is first warmed up via adversarial adaptation [60, 61]. Moreover, for stability purposes, pseudo labels are updated in a stagewise fashion, thus resulting in an overall complex adaptation scheme. Towards streamlining this complex adaptation process, recent approaches like [2, 47] propose to train without adversarial warmup and with a momentum network to circumvent stagewise training issue. A common factor underlying most self-training methods is their reliance on just RGB inputs that may not provide sufficient signal for predicting robust target-domain pseudo labels. This motivates us to look for alternate forms of input like depth that is easily accessible and provide a more robust signal.
Adaptation with multiple modalities. Learning and adaptation using multimodal contexts presents an opportunity for leveraging complementarity between different views of the input space, to improve model robustness and generalisability. In the context of unsupervised domain adaptation, use of mutimodal information has recently become more popular with pioneering works like [24]. Specifically, [24] uses depth regression as a way to regularise the GAN based domain translation resulting in better capture of source semantics in the generated target images. Another related approach [51] proposes the use of depth via an auxiliary objective to learn features that when fused with primary semantic segmentation prediction branch provides a more robust representation for adaptation. While sharing our motivation for use of auxiliary information, their use of fused features for adaptation does not address the susceptibility of adversarial adaptation to conditional distribution shifts. In contrast to this method, we propose a depth based objectness constraint for adaptation via self-training that not only leverages multimodal context but also handles conditional shifts more effectively. Moreover, unlike the previous works that use depth only for the source domain, we explore its application exclusively to the target domain. Contemporary to our setting, [53] improves adaptation by extracting the correlation between depth and RGB in both domains. An important distinction of our approach with regards to above works is that we exploit the complementarity of RGB and depth instead of the correlation to formulate a contrastive regularizer. The importance of multimodal information has also been considered in other contexts such as indoor semantic segmentation [45] and adaptation for 3D segmentation using 2D images and 3D points clouds [19]. While not directly related to our experimental settings, they provide insight and inspiration for our approach.

3 Self-Training with Objectness Constraints
We begin by introducing preliminary concepts on self-training based adaptation. These concepts serve as bases for introducing our objectness constraint in Section 3 that is used to regularise the self-training methods. We refer to our framework as PAC-UDA which uses Pseudo-labels And objectness Constraints for self-training in Unsupervised Domain Adaptation for semantic segmentation. Although, we describe a canonical form of self-training for formalising our regularisation constraint, PAC-UDA should be seen as a general approach that can encompass various forms of self-training (as shown in experiments).
Unsupervised Domain Adaptation (UDA) for Semantic Segmentation: Consider a dataset of input-label pairs sampled from a source domain distribution, . The input and labels share the same spatial dimensions, , where each pixel of the label is assigned a class and is represented via a dimensional one-hot encoding. We also have a dataset sampled from a target distribution, where the corresponding labels, are unobserved during training. Here, the target domain is separated from the source domain due to domain shift expressed as . Under such a shift, the goal of unsupervised domain adaptation is to leverage and to learn a parametric model that performs well in the target domain. The model is defined as a composition of an encoder, and a classifier, where, represents the space of -dimensional spatial embeddings, gives the un-normalized distribution over the classes at each spatial location, and are the model parameters. To learn a suitable target model, the parameters are optimised using a cross-entropy objective on the source domain,
(1) | ||||
(2) |
where denotes softmax operation and an adaptation objective over the target domain as described next.
Pseudo-label self-training (PLST): Following prior works [64, 60], we describe a simple and effective approach to PLST that leverages a source trained seed model to pseudo-label unlabelled target data, via confidence thresholding. Specifically, the seed model is first trained on using Eqn. 2 to obtain a good parameter initialisation, . Then, this model is used to compute pixel-wise class probabilities, using to Eqn. 2 for each target image, . These probabilities are used in conjunction with a predefined threshold , to obtain one-hot encoded pseudo-labels
(3) |
Note that while Eqn. 3 uses a class-agnostic fixed threshold in practice, this threshold can be made class-specific and dynamically updated over the course of self-training. Such a threshold ensures that only the highly-confident predictions contribute to successive training. The final self-training objective can be written in terms of pseudo-labels as
(4) |
The overall UDA objective is simply, , where is the relative weighting coefficients.
3.1 Supervision For Objectness Constraint
An important issue with the self-training scheme described above is that it is usually prone to confirmation bias that can lead to compounding errors in target model predictions when trained on noisy pseudo-labels. To alleviate target performance, we introduce auxiliary modality information (like, depth) that can provide indirect supervision for semantic labels in the target domain and improve the robustness of self-training. In this section we describe our multimodal objectness constraint that extracts object-region estimates to formulate a contrastive objective. The overview of our objectness constraint formulation is presented in Fig. 2.
Supervision via Depth: Segmentation datasets are often accompanied with depth maps registered with the RGB images. In practice, depth maps can be obtained from stereo pairs [12, 39] or sequence of images [15]. These depth maps can reveal the presence of distinct objects in a scene. We particularly seek to extract object regions from these depth maps by first computing a histogram of depth values with predefined, number of bins. We then leverage the property of objects under ”things” categories [17] whose range of the depth is usually much smaller than the range of entire scene depth. Examples of such categories in outdoor scene segmentation include persons, cars, poles etc. This property translates into high density regions (or peaks) in the histogram corresponding to distinct objects at distinct depths. Among these peaks, we use the ones with prominence [27] above a threshold, as centers to cluster the histograms into discrete regions with unique labels. These labels are then assigned to every pixel whose depth values lie in the associated region. An example of the resulting depth-based segmentation for and is visualised in Fig. 2.
Supervision via RGB: Another important form of self-supervision for object region estimates is based on RGB-input clustering. We adopt SLIC [1] as a fast algorithm for partitioning images into multiple segments that respect object boundaries; the SLIC method applies k-means clustering in pixel space to group together adjacent pixels that are visually similar. An important design decision is the number of SLIC segments, : small leads to large cluster sizes that is agnostic to the variation in object scales, across different object categories and instances of the scene. Consequently, pixels from distinct object instances may be grouped together regardless of the semantic class, thus violating the notion of object region. Conversely, a large will over-segment each object in the scene, resulting in a trivial objectness constraint. Triviality arises from enforcing similarity of pixel-embeddings that share roughly identical pixel neighbourhoods and hence are likely to yield the same class predictions anyway.
Thus, to formulate a non-trivial constraint with sufficiently small that also respects object boundaries, we propose to fuse region estimates from both depth and RGB modalities.We first obtain segments using SLIC over the RGB image followed by further partitioning of each segment into smaller ones based on the depth segmentation. The process, visualised in Fig. 2 highlights the importance of our multimodal approach. Purely depth based segments are agnostic to pixel intensities and may cluster together distinct object categories that lie at similar depths, for instance, the car in the front and the sidewalk. On the other hand, purely RGB segments with sufficiently small may assign the same cluster label even to objects at distinct depths, for example, the back of the bus and the small car at the back. In contrast, object regions derived from a fusion of these two modalities can lead to object regions that are more consistent with individual object instances (for example, the small car at the back as well as the car in the front). We empirically demonstrate the effectiveness of objectness constraint derived from such multimodal fusion in Section 4.3.
Source Domain | Method |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GTA | CAG [60] | 87.0 | 44.6 | 82.9 | 32.1 | 35.7 | 40.6 | 38.9 | 45.5 | 82.6 | 23.5 | 78.7 | 64.0 | 27.2 | 84.4 | 17.5 | 34.8 | 35.8 | 26.7 | 32.8 | 48.2 |
CAG PAC (ours) | 86.3 | 45.7 | 84.5 | 30.5 | 35.5 | 38.9 | 40.3 | 49.9 | 86.0 | 33.5 | 81.1 | 64.1 | 25.5 | 84.5 | 21.3 | 32.9 | 36.3 | 26.7 | 40.0 | 49.6 | |
SAC[2] | 89.9 | 54.0 | 86.2 | 37.8 | 28.9 | 45.9 | 46.9 | 47.7 | 88.0 | 44.8 | 85.5 | 66.4 | 30.3 | 88.6 | 50.5 | 54.5 | 1.5 | 17.0 | 39.3 | 52.8 | |
SAC PAC (ours) | 93.3 | 63.6 | 87.2 | 42.0 | 25.4 | 44.9 | 49.0 | 50.6 | 88.1 | 45.2 | 87.6 | 64.0 | 28.1 | 83.6 | 37.5 | 43.9 | 13.7 | 20.1 | 46.2 | 53.4 | |
DACS[47] | 93.4 | 54.3 | 86.3 | 28.6 | 33.7 | 37.0 | 41.1 | 50.6 | 86.1 | 42.6 | 87.6 | 63.5 | 28.9 | 88.1 | 44.2 | 52.7 | 1.7 | 34.7 | 48.1 | 52.8 | |
DACS PAC (ours) | 93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 | |
SYNTHIA | CAG | 87.0 | 41.0 | 79.0 | 9.0 | 1.0 | 34.0 | 15.0 | 11.0 | 81.0 | - | 81.0 | 55.0 | 16.0 | 77.0 | - | 17.0 | - | 2.0 | 47.0 | 40.8 |
CAG PAC (ours) | 87.0 | 42.0 | 80.0 | 12.0 | 3.0 | 30.0 | 17.0 | 17.0 | 80.0 | - | 88.0 | 57.0 | 5.0 | 75.0 | - | 20.0 | - | 1.0 | 52.0 | 41.7 | |
SAC [2] | 91.7 | 52.7 | 85.1 | 22.6 | 1.5 | 42.2 | 44.1 | 30.9 | 82.5 | - | 73.8 | 63.0 | 20.9 | 84.9 | - | 29.5 | - | 26.9 | 52.2 | 50.3 | |
SAC PAC (ours) | 83.2 | 40.5 | 85.4 | 30.0 | 2.0 | 43.0 | 42.2 | 33.8 | 86.3 | - | 89.8 | 65.3 | 33.5 | 85.1 | - | 35.2 | - | 29.9 | 55.3 | 52.5 | |
DACS [47] | 84.9 | 23.0 | 83.7 | 16.0 | 1.0 | 36.3 | 35.0 | 42.8 | 81.7 | - | 89.5 | 63.5 | 34.5 | 85.3 | - | 41.5 | - | 31.2 | 50.8 | 50.0 | |
DACS PAC (ours) | 90.6 | 46.7 | 83.3 | 18.7 | 1.3 | 35.1 | 34.5 | 32.0 | 85.1 | - | 88.5 | 66.0 | 35.0 | 83.8 | - | 43.1 | - | 28.8 | 46.7 | 51.2 |
3.2 Objectness Constraints through Contrast
Our objectness constraint is formulated using a contrastive objective that pulls together pixel representations within an object region and pushes apart those that belong to different object categories. Formally, we assign a region index and a region label to every pixel associated with an object region of the input scene. Each region index is a unique natural number in where is the number of object regions. A region label is assigned as the most frequent pseudo-label class within the object region. In practice, noisy pseudo-labels can lead to region labelling that is inconsistent with true semantic labels. To minimise such inconsistencies, we introduce a threshold that selects valid object regions for which the proportion of pixels with pseudo-label class same as the region label is above this threshold. This selection excludes the object regions with no dominant pseudo-label class from contributing to the objectness constraint. Since the cost of computing pairwise constraints is quadratic in the number of pixels, we recast the pairwise constraint into a protoypical loss that reduces the time complexity to linear. Towards the end, we first compute a prototypical representation for each region using the associated pixel embeddings,
(5) |
where, is the set of pixel locations with the object-region. Then a similarity score (based on Cosine metric) is computed between each pixel and prototypical representation that forms the basis for our contrastive objectness constraint as
(6) | ||||
(7) |
where, is the total number of valid pixels, is the set of valid object regions that have region labels other than , and and represent normalised embeddings. Note that the objectness constraints are only computed for the target domain images since we are interested in improving target domain performance using self-training. Additionally, the constraint in Eqn. 7 is defined for a single image but can be easily extended to multiple images by simply averaging over them; the final regularised self-training objective is then defined as , where controls the effect of the constraint on overall training.
3.3 Learning and Optimization
To train the our model, PAC-UDA with a base self-training approach, we follow the exact procedure outlined by the corresponding approach. The only difference is that we plug in our constraint as a regularise to the base objective, . One important consideration is that our regularise depends on reasonable quality of pseudo labels to define region labels that are not random. Thus the regularisation weight, is set to zero for a few initial training iterations, post which it switches to the actual value.
4 Experiments
Datasets and Evaluation Metric: We evaluate the PAC-UDA framework in two common scenarios: the GTA[37]Cityscapes [12] transfer semantic segmentation task and the SYNTHIA[38]Cityscapes [12] task. GTA5 is composed of synthetic images with resolution and has annotations for classes that are compatible with the categories in Cityscapes. Similarly, SYNTHIA consists of synthetic images of urban scenes at resolution with annotations for only common categories. Cityscapes has of real images and aligned depth maps of urban scenes at resolution 2048 × 1024 and is split into three sets of train, validation and test images. Of the , we use randomly selected images for self-training and remaining images for validation. We report the final test performance of our method on the images of the official validation split. The data-splits are consistent with prior works [2, 61]. The performance metrics used are per class Intersection over Union (IoU) and mean IoU (mIoU) over all the classes.
Implementation Details: For object region estimates, we experiment with three different numbers, of RGB-clusters, two values of prominence thresholds, and three numbers of histogram bins, . Depth maps obtained from stereo pairs can have missing values at pixel-level, as is the case with Cityscapes. These missing values have a value zero and are ignored while generating depth segments using depth-histogram. Finally, due to high computational cost of computing the contrastive objective from pixel-wise embedding, we set the spatial resolution of these embeddings to in CAG and SAC and in DACS. We fixed the relative weighting of the regularizer, to as the target performance was found to be insensitive to the exact value. For hyperameter choices regarding architecture and optimizers, we exactly follow the respective self-training base methods [60, 2, 47]. Experiments were conducted on GB RTX 2080 Ti GPUs with PyTorch implementation. Further details in the supplementary.
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AdvEnt [50] | 89.4 | 33.1 | 81.0 | 26.6 | 26.8 | 27.2 | 33.5 | 24.7 | 83.9 | 36.7 | 78.8 | 58.7 | 30.5 | 84.8 | 38.5 | 44.5 | 1.7 | 31.6 | 32.4 | 45.5 |
DISE [7] | 91.5 | 47.5 | 82.5 | 31.3 | 25.6 | 33.0 | 33.7 | 25.8 | 82.7 | 28.8 | 82.7 | 62.4 | 30.8 | 85.2 | 27.7 | 34.5 | 6.4 | 25.2 | 24.4 | 45.4 |
Cycada [18] | 86.7 | 35.6 | 80.1 | 19.8 | 17.5 | 38.0 | 39.9 | 41.5 | 82.7 | 27.9 | 73.6 | 64.9 | 19.0 | 65.0 | 12.0 | 28.6 | 4.5 | 31.1 | 42.0 | 42.7 |
BLF [25] | 91.0 | 44.7 | 84.2 | 34.6 | 27.6 | 30.2 | 36.0 | 36.0 | 85.0 | 43.6 | 83.0 | 58.6 | 31.6 | 83.3 | 35.3 | 49.7 | 3.3 | 28.8 | 35.6 | 48.5 |
CAG-UDA [60] | 90.4 | 51.6 | 83.8 | 34.2 | 27.8 | 38.4 | 25.3 | 48.4 | 85.4 | 38.2 | 78.1 | 58.6 | 34.6 | 84.7 | 21.9 | 42.7 | 41.1 | 29.3 | 37.2 | 50.2 |
PyCDA† [26] | 90.5 | 36.3 | 84.4 | 32.4 | 28.7 | 34.6 | 36.4 | 31.5 | 86.8 | 37.9 | 78.5 | 62.3 | 21.5 | 85.6 | 27.9 | 34.8 | 18.0 | 22.9 | 49.3 | 47.4 |
CD-AM [58] | 91.3 | 46.0 | 84.5 | 34.4 | 29.7 | 32.6 | 35.8 | 36.4 | 84.5 | 43.2 | 83.0 | 60.0 | 32.2 | 83.2 | 35.0 | 46.7 | 0.0 | 33.7 | 42.2 | 49.2 |
FADA [52] | 92.5 | 47.5 | 85.1 | 37.6 | 32.8 | 33.4 | 33.8 | 18.4 | 85.3 | 37.7 | 83.5 | 63.2 | 39.7 | 87.5 | 32.9 | 47.8 | 1.6 | 34.9 | 39.5 | 49.2 |
FDA [59] | 92.5 | 53.3 | 82.4 | 26.5 | 27.6 | 36.4 | 40.6 | 38.9 | 82.3 | 39.8 | 78.0 | 62.6 | 34.4 | 84.9 | 34.1 | 53.1 | 16.9 | 27.7 | 46.4 | 50.5 |
SA-I2I [34] | 91.2 | 43.3 | 85.2 | 38.6 | 25.9 | 34.7 | 41.3 | 41.0 | 85.5 | 46.0 | 86.5 | 61.7 | 33.8 | 85.5 | 34.4 | 48.7 | 0.0 | 36.1 | 37.8 | 50.4 |
PIT [31] | 87.5 | 43.4 | 78.8 | 31.2 | 30.2 | 36.3 | 39.9 | 42.0 | 79.2 | 37.1 | 79.3 | 65.4 | 37.5 | 83.2 | 46.0 | 45.6 | 25.7 | 23.5 | 49.9 | 50.6 |
IAST [32] | 93.8 | 57.8 | 85.1 | 39.5 | 26.7 | 26.2 | 43.1 | 34.7 | 84.9 | 32.9 | 88.0 | 62.6 | 29.0 | 87.3 | 39.2 | 49.6 | 23.2 | 34.7 | 39.6 | 51.5 |
DACS [47] | 89.9 | 39.7 | 87.9 | 30.7 | 39.5 | 38.5 | 46.4 | 52.8 | 88.0 | 44.0 | 88.7 | 67.0 | 35.8 | 84.4 | 45.7 | 50.2 | 0.0 | 27.2 | 34.0 | 52.1 |
RPT† [61] | 89.2 | 43.3 | 86.1 | 39.5 | 29.9 | 40.2 | 49.6 | 33.1 | 87.4 | 38.5 | 86.0 | 64.4 | 25.1 | 88.5 | 36.6 | 45.8 | 23.9 | 36.5 | 56.8 | 52.6 |
SAC[2] | 90.4 | 53.9 | 86.6 | 42.4 | 27.3 | 45.1 | 48.5 | 42.7 | 87.4 | 40.1 | 86.1 | 67.5 | 29.7 | 88.5 | 49.1 | 54.6 | 9.8 | 26.6 | 45.3 | 53.8 |
SAC* [2] | 89.9 | 54.0 | 86.2 | 37.8 | 28.9 | 45.9 | 46.9 | 47.7 | 88.0 | 44.8 | 85.5 | 66.4 | 30.3 | 88.6 | 50.5 | 54.5 | 1.5 | 17.0 | 39.3 | 52.8 |
DACS PAC (ours) | 93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vegetable |
sky |
person |
rider |
car |
bus |
motor |
bike |
mIoU | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SPIGAN[24] | 71.1 | 29.8 | 71.4 | 3.7 | 0.3 | 33.2 | 6.4 | 15.6 | 81.2 | 78.9 | 52.7 | 13.1 | 75.9 | 25.5 | 10.0 | 20.5 | 36.8 |
DCAN [56] | 82.8 | 36.4 | 75.7 | 5.1 | 0.1 | 25.8 | 8.0 | 18.7 | 74.7 | 76.9 | 51.1 | 15.9 | 77.7 | 24.8 | 4.1 | 37.3 | 38.4 |
DISE [7] | 91.7 | 53.5 | 77.1 | 2.5 | 0.2 | 27.1 | 6.2 | 7.6 | 78.4 | 81.2 | 55.8 | 19.2 | 82.3 | 30.3 | 17.1 | 34.3 | 41.5 |
AdvEnt [50] | 85.6 | 42.2 | 79.7 | 8.7 | 0.4 | 25.9 | 5.4 | 8.1 | 80.4 | 84.1 | 57.9 | 23.8 | 73.3 | 36.4 | 14.2 | 33.0 | 41.2 |
DADA[51] | 89.2 | 44.8 | 81.4 | 6.8 | 0.3 | 26.2 | 8.6 | 11.1 | 81.8 | 84.0 | 54.7 | 19.3 | 79.7 | 40.7 | 14.0 | 38.8 | 42.6 |
CAG-UDA [60] | 84.7 | 40.8 | 81.7 | 7.8 | 0.0 | 35.1 | 13.3 | 22.7 | 84.5 | 77.6 | 64.2 | 27.8 | 80.9 | 19.7 | 22.7 | 48.3 | 44.5 |
PIT [31] | 83.1 | 27.6 | 81.5 | 8.9 | 0.3 | 21.8 | 26.4 | 33.8 | 76.4 | 78.8 | 64.2 | 27.6 | 79.6 | 31.2 | 31.0 | 31.3 | 44.0 |
PyCDA† [26] | 75.5 | 30.9 | 83.3 | 20.8 | 0.7 | 32.7 | 27.3 | 33.5 | 84.7 | 85.0 | 64.1 | 25.4 | 85.0 | 45.2 | 21.2 | 32.0 | 46.7 |
FADA [52] | 84.5 | 40.1 | 83.1 | 4.8 | 0.0 | 34.3 | 20.1 | 27.2 | 84.8 | 84.0 | 53.5 | 22.6 | 85.4 | 43.7 | 26.8 | 27.8 | 45.2 |
DACS[47] | 80.6 | 25.1 | 81.9 | 21.5 | 2.9 | 37.2 | 22.7 | 24.0 | 83.7 | 90.8 | 67.6 | 38.3 | 82.9 | 38.9 | 28.5 | 47.6 | 48.3 |
IAST [32] | 81.9 | 41.5 | 83.3 | 17.7 | 4.6 | 32.3 | 30.9 | 28.8 | 83.4 | 85.0 | 65.5 | 30.8 | 86.5 | 38.2 | 33.1 | 52.7 | 49.8 |
RPT† [61] | 88.9 | 46.5 | 84.5 | 15.1 | 0.5 | 38.5 | 39.5 | 30.1 | 85.9 | 85.8 | 59.8 | 26.1 | 88.1 | 46.8 | 27.7 | 56.1 | 51.2 |
SAC [2] | 89.3 | 47.2 | 85.5 | 26.5 | 1.3 | 43.0 | 45.5 | 32.0 | 87.1 | 89.3 | 63.6 | 25.4 | 86.9 | 35.6 | 30.4 | 53.0 | 52.6 |
SAC*[2] | 91.7 | 52.7 | 85.1 | 22.6 | 1.5 | 42.2 | 44.1 | 30.9 | 82.5 | 73.8 | 63.0 | 20.9 | 84.9 | 29.5 | 26.9 | 52.2 | 50.3 |
SAC PAC (ours) | 83.2 | 40.5 | 85.4 | 30.0 | 2.0 | 43.0 | 42.2 | 33.8 | 86.3 | 89.8 | 65.3 | 33.5 | 85.1 | 35.2 | 29.9 | 55.3 | 52.5 |
4.1 Generality of Objectness Constraint
In Table 1, we test the generality of our proposed regularizer on three base methods, namely, CAG [60], SAC [2] and DACS [47] that generate pseudo labels in different ways. We use official implementations of each base method with almost same configurations for data preprocessing, model architecture, and optimizer except for a few modifications as follows. In the case of CAG, we replace the Euclidean metric with a Cosine metric as it was found to generate more reliable pseudo-labels. Also, we run it for a single self-training iteration instead of three[60]. For the SAC method, we reduce the GROUP_SIZE from default value of to following GPU constraints. Finally, for the DACS approach, we adopt the training and validation splits of Cityscapes used in SAC to maintain benchmark consistency across different base methods. In terms of architecture, DACS and SAC use a standard DeepLabv2 [9] backbone whereas CAG augments this backbone with a decoder model (see [60] for details). For the sake of fair comparison, we try our best to achieve baseline accuracies that are at least as good as the published results. While we achieved slightly lower performance on SAC due to resource constraints, we achieve superior accuracies for DACS and CAG baselines. Thus, these methods serve as strong baselines for evaluating our approach.

From Table 1, we observe that base methods regularised with our constraint always, and sometimes significantly, outperforms the unregularised version in terms of mIoU (by up to ). Secondly, the improvement is across various categories of both stuffs and things type. Some of these include sidewalk (up to ), sky (up to ), traffic light (up to ), traffic sign (up to ) and bike (up to ) classes under GTACityscapes while wall (up to ), fence (up to ), person (up to ) and bus (up to ) classes under SYNTHIACityscapes. While different adaptation settings favour different classes, a particularly striking observation is that large gains are obtained in both frequent (sidewalk, wall) and less-frequent (bus, bike) classes. We suspect that such uniformity arises from our object-region aware constraint that is agnostic to the statistical dominance of specific classes. Finally, Fig. 3 visualises these observations by comparing the predictions of DACS and DACS+PAC models (trained on GTA) on randomly selected examples from Cityscapes validation split.
Configuration | mIoU |
---|---|
All | 54.2 |
Only PL | 49.3 |
Only Depth RGB segments | 51.9 |
Only Depth segments w/ PL | 51.7 |
Only RGB segments w/ PL | 52.1 |
4.2 Prior Works Comparison
In this section, we compare our best performing method with prior works under each domain settings
GTACityscapes (Table 2): In terms of mIoU, our DACS+PAC outperforms the state-of-the-art (SAC) by despite having a simpler training objective (no focal loss regularizer or importance sampling) and no adaptive batch normalisation. In particular, our approach outperforms SAC significantly in road, sidewalk, fence,terrain, sky, rider, motorcycle and bike classes by . More interestingly, this observation holds when compared to other prior works as well, wherein our model improves IoUs for both dominant categories like road and sidewalk as well as less frequent categories like traffic-sign and terrain. For classes like sidewalk, we suspect that structural constraints based on our regularizer reduces contextual bias [41], responsible for coarse boundaries.
SYNTHIACityscapes (Table 3): In this setting, our best performing method outperforms all but one prior methods, often by significant margins. While our SAC+PAC under resource constraints compares favourably to the official implementation of SAC (with larger GROUP_SIZE), it significantly outperforms our implementation of SAC which is a more fair comparison due to same resource constraints. Nevertheless, our approach improves the best previous results on wall class by and achieves state-of-the-art on pole and sign classes.
4.3 Ablations
In this section, we deconstruct our multi-modal regularizer (PAC) to quantify the effect of individual components on final performance. In Table 4, the ‘ALL’ configuration corresponds to our original formulation. ‘Only PL’ configuration estimates the object-regions using just the pseudo-labels and hence ignores complementary information from depth. ‘Only Depth+RGB segments’ do not use pseudo-labels to define region labels and instead treats each Depth+RGB segment as a unique object category. The configurations in next two rows use only one of the two modalities for estimating object regions while still using pseudo-labels to define region labels. We observe that contrastive regulariser based on only pseudo-labels performs the worst and significantly below the one based on just multimodal segments. This is intuitive because reusing pseudo-labels as a regularisation without auxiliary information reinforces the confirmation bias. While, purely RGB based segments lead to better objectness constraint than purely depth-based ones (as can be seen in Fig. 2), combining the two (ALL config.) yields the best results.
5 Conclusion
In this work, we proposed a multi-modal regularisation scheme for self-training approaches in unsupervised domain adaptation for semantic segmentation. We derive an objectness constraint from multi-modal clustering that is then used to formulate a contrastive objective for regularisation. We show that this regularisation consistently improves upon different types of self-training methods and even achieves state-of-the-art performance in popular benchmarks. In the future, we plan to study the effect of other modalities like 3D point-clouds in semantic segmentation.
References
- [1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
- [2] Nikita Araslanov, , and Stefan Roth. Self-supervised augmentation consistency for adapting semantic segmentation. In CVPR, 2021.
- [3] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
- [4] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In ICCV, 2013.
- [5] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. CoRR, 2016.
- [6] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. NeurIPS, 2016.
- [7] Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, and Wei-Chen Chiu. All about structure: Adapting structural information across domains for boosting semantic segmentation. CoRR, 2019.
- [8] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 627–636, 2019.
- [9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin P. Murphy, and Alan Loddon Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
- [10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
- [11] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. CVPR, 2019.
- [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- [13] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In NeurIPS, 2017.
- [14] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Advances in Computer Vision and Pattern Recognition, 2017.
- [15] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Digging into self-supervised monocular depth estimation. ICCV, 2019.
- [16] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NeurIPS, 2005.
- [17] Bharath Hariharan, Pablo Arbeláez, Ross B. Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In ECCV, 2014.
- [18] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
- [19] Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie Wirbel, and Patrick Perez. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. CVPR, 2020.
- [20] Fredrik D. Johansson, David A Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. In AISTATS, 2019.
- [21] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019.
- [22] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In NeurIPS, 2020.
- [23] Myeongjin Kim and Hyeran Byun. Learning texture invariant representation for domain adaptation of semantic segmentation. CVPR, 2020.
- [24] Kuan-Hui Lee, Germán Ros, Jie Li, and Adrien Gaidon. Spigan: Privileged adversarial learning from simulation. ArXiv, 2019.
- [25] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
- [26] Qing Lian, Fengmao Lv, Lixin Duan, and Boqing Gong. Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6758–6767, 2019.
- [27] Marcos Llobera. Building past landscape perception with gis: Understanding topographic prominence. Journal of Archaeological Science - J ARCHAEOL SCI, 2001.
- [28] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
- [29] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
- [30] Fengmao Lv, Tao Liang, Xiang Chen, and Guosheng Lin. Cross-domain semantic segmentation via domain-invariant interactive relation transfer. In CVPR, 2020.
- [31] Fengmao Lv, Tao Liang, Xiang Chen, and Guosheng Lin. Cross-domain semantic segmentation via domain-invariant interactive relation transfer. In CVPR, 2020.
- [32] Ke Mei, Chuang Zhu, Jiaqi Zou, and Shanghang Zhang. Instance adaptive self-training for unsupervised domain adaptation. In ECCV, 2020.
- [33] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. ICML, 2013.
- [34] Luigi Musto and Andrea Zinelli. Semantically adaptive image-to-image translation for domain adaptation of semantic segmentation. In BMVC, 2020.
- [35] Fei Pan, Inkyu Shin, François Rameau, Seokju Lee, and In So Kweon. Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. CVPR, 2020.
- [36] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011.
- [37] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
- [38] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
- [39] Ashutosh Saxena, Jamie Schulte, Andrew Y Ng, et al. Depth estimation using monocular and stereo cues. In IJCAI, volume 7, pages 2197–2203, 2007.
- [40] Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In ICML, 2017.
- [41] Rakshith Shetty, Bernt Schiele, and Mario Fritz. Not using the car to see the sidewalk – quantifying and controlling the effects of context in classification and segmentation. In CVPR, June 2019.
- [42] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. arXiv, 2018.
- [43] S. Si, D. Tao, and B. Geng. Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 2010.
- [44] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS, 2016.
- [45] Sinisa Stekovic, Friedrich Fraundorfer, and Vincent Lepetit. Casting geometric constraints in semantic segmentation as semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1854–1863, 2020.
- [46] Subhashree Subudhi, Ram Narayan Patro, Pradyut Kumar Biswal, and Fabio Dell’Acqua. A survey on superpixel segmentation as a preprocessing step in hyperspectral image analysis. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:5015–5035, 2021.
- [47] Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. Dacs: Domain adaptation via cross-domain mixed sampling. WACV, 2021.
- [48] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In CVPR, 2018.
- [49] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
- [50] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Mathieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. arXiv, 2018.
- [51] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Dada: Depth-aware domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7364–7373, 2019.
- [52] Haoran Wang, Tong Shen, Wei Zhang, Ling-Yu Duan, and Tao Mei. Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation. In European Conference on Computer Vision, pages 642–659. Springer, 2020.
- [53] Qin Wang, Dengxin Dai, Lukas Hoyer, Olga Fink, and Luc Van Gool. Domain adaptive semantic segmentation with self-supervised depth estimation. In ICCV, 2021.
- [54] Zhonghao Wang, Mo Yu, Yunchao Wei, Rogério Schmidt Feris, Jinjun Xiong, Wen mei W. Hwu, Thomas S. Huang, and Humphrey Shi. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. CVPR, 2020.
- [55] Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Chase Lipton. Domain adaptation with asymmetrically-relaxed distribution alignment. ICML, 2019.
- [56] Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gokhan Uzunbas, Tom Goldstein, Ser Nam Lim, and Larry S Davis. Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In ECCV, 2018.
- [57] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International conference on machine learning, pages 5423–5432. PMLR, 2018.
- [58] Jinyu Yang, Weizhi An, Chaochao Yan, Peilin Zhao, and Junzhou Huang. Context-aware domain adaptation in semantic segmentation. In WACV, 2021.
- [59] Yanchao Yang and Stefano Soatto. FDA: Fourier domain adaptation for semantic segmentation. In CVPR, 2020.
- [60] Qiming Zhang, Jing Zhang, Wenyu Liu, and D. Tao. Category anchor-guided unsupervised domain adaptation for semantic segmentation. In NeurIPS, 2019.
- [61] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, and Tao Mei. Transferring and regularizing prediction for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2020.
- [62] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J. Gordon. On learning invariant representation for domain adaptation. ICML, 2019.
- [63] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
- [64] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.
In this supplementary, we provide additional details and analysis for our proposed method, PAC-UDA. Algorithm 1, provides a step-by-step procedure for unsupervised domain adaptation via PAC-UDA.
Appendix A Hyperparameters for Main Experiments
method | ||||
---|---|---|---|---|
CAG + PAC | 50 | 200 | 0.0025 | 0.90 |
SAC + PAC | 50 | 200 | 0.0025 | 0.90 |
DACS + PAC | 25 | 200 | 0.001 | 0.90 |
To report the results in Table 1, Table 2 and Table 3, we choose the best hyperparameters following standard cross-validation on a random subset of Cityscapes train-split introduced in [2]. For base methods, we use the default hyperparameters from respective papers. In Table 5, we summarise the hyperparameters for Table 1. Since the results of our approach in Table 2 and Table 3 are a subset of Table 1, the above hyperparameters apply there as well.
Appendix B Additional Ablations
In this section, we provide additional ablation studies for DACS+PAC on GTACityscapes. The default hyperparameter configuration is: ; unless otherwise stated. Also, we train each setting for iterations.
B.1 Importance of Multiple Modalities and Pseudo-Labels
Configuration |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
All | 93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 |
PL | 93.7 | 58.7 | 86.8 | 27.3 | 29.7 | 35.4 | 41.6 | 50.6 | 87.1 | 46.7 | 89.2 | 65.2 | 37.1 | 87.4 | 41.3 | 49.8 | 0.0 | 7.0 | 1.6 | 49.3 |
Depth-RGB | 94.1 | 58.1 | 86.2 | 38.2 | 30.3 | 34.8 | 37.8 | 41.3 | 86.7 | 46.1 | 87.5 | 62.4 | 31.0 | 86.8 | 52.5 | 49.1 | 0.0 | 24.5 | 40.1 | 51.9 |
Depth-PL | 93.3 | 61.9 | 86.7 | 31.8 | 35.9 | 36.1 | 43.3 | 50.2 | 86.2 | 41.2 | 86.4 | 65.0 | 32.2 | 82.1 | 31.9 | 50.4 | 0.9 | 23.1 | 43.6 | 51.7 |
RGB-PL | 95.1 | 65.3 | 86.1 | 25.9 | 30.1 | 35.4 | 39.1 | 41.2 | 85.2 | 37.9 | 86.2 | 61.4 | 26.7 | 87.9 | 50.9 | 50.6 | 0.0 | 35.8 | 50.4 | 52.1 |
Configuration |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
All () | 93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 |
RGB-PL() | 95.1 | 65.3 | 86.1 | 25.9 | 30.1 | 35.4 | 39.1 | 41.2 | 85.2 | 37.9 | 86.2 | 61.4 | 26.7 | 87.9 | 50.9 | 50.6 | 0.0 | 35.8 | 50.4 | 52.1 |
RGB-PL () | 94.6 | 63.4 | 86.8 | 28.7 | 30.7 | 37.6 | 42.8 | 51.3 | 86.8 | 44.9 | 87.9 | 64.9 | 32.5 | 87.8 | 42.7 | 45.4 | 0.0 | 32.6 | 51.2 | 53.3 |
RGB-PL () | 94.4 | 62.1 | 86.2 | 29.2 | 32.5 | 34.2 | 40.0 | 50.2 | 86.2 | 47.1 | 87.4 | 63.0 | 32.7 | 87.9 | 39.4 | 45.3 | 0.1 | 32.6 | 52.8 | 52.8 |
Configuration |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 | |
94.2 | 59.4 | 86.7 | 35.8 | 32.1 | 36.8 | 40.5 | 49.4 | 86.5 | 41.9 | 86.0 | 63.5 | 27.1 | 89.1 | 53.7 | 54.5 | 2.5 | 27.3 | 45.7 | 53.3 |
Threshold |
road |
sidewalk |
building |
wall |
fence |
pole |
light |
sign |
vege. |
terrain |
sky |
person |
rider |
car |
truck |
bus |
train |
motor |
bike |
mIoU |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.70 | 93.9 | 60.4 | 86.5 | 32.5 | 30.4 | 34.9 | 39.9 | 48.8 | 86.4 | 45.6 | 88.0 | 63.0 | 27.6 | 87.0 | 39.9 | 49.2 | 1.9 | 32.5 | 47.9 | 52.4 |
0.80 | 92.9 | 51.3 | 86.6 | 31.5 | 32.4 | 36.7 | 42.8 | 52.1 | 86.8 | 44.7 | 87.5 | 65.4 | 34.5 | 89.2 | 48.8 | 56.3 | 0.2 | 23.8 | 45.1 | 53.1 |
0.90 | 93.2 | 58.8 | 87.2 | 33.3 | 35.1 | 38.6 | 41.8 | 51.4 | 87.4 | 45.8 | 88.3 | 64.8 | 31.6 | 84.3 | 51.7 | 53.4 | 0.6 | 31.3 | 50.6 | 54.2 |
0.95 | 93.4 | 55.9 | 86.1 | 28.7 | 30.0 | 33.2 | 40.5 | 45.3 | 86.6 | 45.7 | 87.8 | 64.2 | 31.6 | 89.1 | 50.4 | 50.7 | 0.0 | 10.5 | 28.3 | 50.4 |

In Table 6, we provide the complete results (including classwise IoUs) for the ablations on individial modalities and pseudo-labels as described in Section 4.3. While, Table 6 highlights the importance of combining all modalities with pseudo-labels for the best mean IoU, there are a few other important observations with respect to classwise IoUs.
For instance, using “PL” for objectness constraints significantly underperforms other settings (by upto IoU) in rare source-classes, like motorcycle and bike (Figure 4). This gap is surprisingly large (by upto IoU) even when compared to “Depth-RGB”. We attribute this large performance gap to the class-imbalance problem [64] that is known to adversely affect self-training in the absence of class-balanced losses. However, incorporating our objectness constraint alleviates the rare-class IoUs significantly without losing performance in frequent classes (except, sky). These results provide strong evidence for the normalisation of class-related statistical effects in the presence of multimodal objectness-constraints.
Another interesting insight arises from comparing “Depth-PL” and “RGB-PL” settings that demonstrates the complementarity of the two modalities. Among the more frequent source-classes (Figure 4), purely RGB-based constraint considerably outperforms purely depth-based constraint in categories such as road, sidewalk and car whereas the converse is true for other categories like wall, pole, terrain and person. The outperformance of depth-based constraint on pole and person is intuitive since these objects have very small depth range compared to the scene depth and hence can be easily detected using the depth histogram (see Section 3.1 for more discussion).
B.2 Importance of RGB-segments
In the past, image clustering has been often used as an effective preprocessing step for segmentation [61, 46]. Inspired by these works, in Table 7, we test the extent to which purely SLIC based RGB-segments can influence the objectness constraint and consequently, the final performance of our PAC-UDA. Specifically, we tabulate the performance with varying number of SLIC segments, and compare it to our default configuration, “ALL ()”.
We observe that when using only RGB-segments (without depth) for object-region estimates, there exists an intermediate value along the range of where the semantic segmentation performance peaks. This trend empirically validates our intuition for choosing the best as discussed in Section 3.1. In fact, too small a value can be highly undesirable as it can lead to worse results ( mIoU) than even the base method ( mIoU). It is however, interesting to note that even with the most optimal , just RGB based objectness constraint underperforms our multimodal constraint (“ALL”) by mIoU.
B.3 Contrastive Loss Analysis
We analyze the effect of specific form of the contrastive loss function in Table 8. Recall that in Eqn. 7, our formulation of the contrastive loss maximizes the similarity of each pixel embedding, to only a prototype of the region, that includes pixel . Here, we introduce another variant of that loss, that maximizes the similarity of to prototypes of all valid regions, that share the same region-label. While, originially, region-labels could influence the loss function only via dissimilarity scores, in , they can influence via both similarity and dissimilarity scores.
We observe that allowing greater region-label influence in leads to worse mIoU than . Zooming into the classwise IoUs reveal that less-common classes primarily contribute the the overall worse performance of . We suspect that increasing the influence of, and consequently the noise in, region-labels affect these less-common classes more adversely than common classes like road, sidewalk, wall and car. Finally, this ablation guides our decision to adopt as the default form of contrastive loss in Eqn. 7.

B.4 Importance of Region-Label Threshold
An important hyperparameter of our objectness-constraint is the region-label threshold, . At higher values of this threshold, valid object-regions are more likely to be a part of a single object and consistent with the ground-truth semantic category. This will positively influence the target-domain performance. At the same, time the number of such valid object-regions is likely to be small, which, may reduce the overall effect of the objectness-constraint on target-domain performance. As one decreases the threshold, the number of valid-regions will increase at the expense of region-label consistency with ground-truth. Thus, evaluating the performance over a range of values is crucial.
Indeed, we observe in Table 9 that the mIoU increases with increase in threshold upto a certain point (), beyond which the performance deteriorates. We, thus, set as our default threshold for all our experiments.
Appendix C Additional Visualisations
In Figure 5, we provide additional qualitative comparison between DACS+PAC, DACS and the ground-truth under GTACityscapes settings.