Regularizing Self-training for Unsupervised Domain Adaptation
via Structural Constraints

Rajshekhar Das¹ Jonathan Francis^1,2∗ Sanket Vaibhav Mehta^1∗ Jean Oh¹ Emma Strubell¹ José Moura¹
¹Carnegie Mellon University ²Bosch Center for Artificial Intelligence
{rajshekd, jmf1, svmehta, hyaejino, strubell, moura}@andrew.cmu.edu Equal contribution.Correspondence.

Abstract

Self-training based on pseudo-labels has emerged as a dominant approach for addressing conditional distribution shifts in unsupervised domain adaptation (UDA) for semantic segmentation problems. A notable drawback, however, is that this family of approaches is susceptible to erroneous pseudo labels that arise from confirmation biases in the source domain and that manifest as nuisance factors in the target domain. A possible source for this mismatch is the reliance on only photometric cues provided by RGB image inputs, which may ultimately lead to sub-optimal adaptation. To mitigate the effect of mismatched pseudo-labels, we propose to incorporate structural cues from auxiliary modalities, such as depth, to regularise conventional self-training objectives. Specifically, we introduce a contrastive pixel-level objectness constraint that pulls the pixel representations within a region of an object instance closer, while pushing those from different object categories apart. To obtain object regions consistent with the true underlying object, we extract information from both depth maps and RGB-images in the form of multimodal clustering. Crucially, the objectness constraint is agnostic to the ground-truth semantic labels and, hence, appropriate for unsupervised domain adaptation. In this work, we show that our regularizer significantly improves top performing self-training methods (by up to $2$ points) in various UDA benchmarks for semantic segmentation. We include all code in the supplementary.

1 Introduction

Refer to caption — Figure 1: Motivation for Objectness Constraints: The above examples compare target-domain ground-truth segmentation, predicted segmentation and prediction confidence (brighter regions are more confident) of a seed model that was adapted from source to target domain via adversarial adaptation [48]. Most self-training approaches use such a seed model to predict pixelwise pseudo-labels. The blue-dashed-boxes highlighte the high-confidence regions that are likely to be included in the set of a pseudo-labels despite being mis-classified. We propose to mitigate the adverse effect of such noisy pseudo-labels on self-training based adaptation via objectness constraints.

Semantic segmentation is a crucial and challenging task for applications such as autonomous driving [61, 2, 51, 60, 18] that rely on pixel-level semantics of the scene. Performance on this task has significantly improved over the past few years following the advances in deep supervised learning [9]. However, an important limitation arises from the excessive cost and time taken to annotate images at a pixel-level (reported to be 1.5 hours per image in a popular dataset [12]). Further, most real-world datasets do not have sufficient coverage over all variations in outdoor scenes such as weather conditions and geography-specific layouts that can be crucial for large-scale deployment of learning-based models in autonomous vehicles. Acquiring training data to cater to such scene variations would significantly add to the cost of annotation.

To address the annotation problem, synthetic datasets curated from 3D simulation environments like GTA [37] and SYNTHIA [38] have been proposed where large amounts of annotated data can be easily generated. However, generated data introduces domain shift due to differences in visual characteristics of simulated images (source domain) and real images (target domain). To mitigate such shifts, unsupervised domain adaptation strategies [48, 5, 64, 60, 61, 18, 2] for semantic segmentation have been extensively studied in the recent years. Among these approaches, self-training [16] has emerged as a particularly promising approach that involves pseudo labelling the (unlabelled) target-domain data using a seed model trained solely on the source domain. Pseudo-label predictions for which the confidence exceeds a predefined threshold are then used to further train the model and ultimately improve the target-domain performance.

While self-training based adaptation is quite effective, it is susceptible to erroneous pseudo labels arising from confirmation bias [3] in the seed model. Confirmation bias results from training on source domain semantics that might introduce factors of representation that serve as nuisance factors for the target domain. In the context of semantic segmentation, such a bias manifests as pixel-wise seed predictions that are highly confident but incorrect (see Figure 1). For instance, if the source domain images usually have bright regions (high intensity of the RGB channels) for the sky class, then bright regions in target domain images might be predicted as the sky with high confidence, irrespective of the actual semantic label. Since highly confident predictions qualify as pseudo-labels, training the model on potentially noisy predictions can ultimately lead to sub-optimal performance in the target domain. Thus, in this work, we seek to reduce the heavy reliance of self-training methods on photometric cues for predicting pixel-wise semantic labels.

To that end, we propose to incorporate auxiliary modality information such as depth maps that can provide structural cues [51, 24, 53, 11], complementary to the photometric cues. Semantic segmentation datasets are usually accompanied by depth maps that can be easily acquired in practice [12, 39]. Since naïve fusion of features that are extracted from depth information can also introduce nuisance [24, 51], an important question is raised — How can we leverage the depth modality to counter the effect of noisy pseudo-labels during self-training? In this work, we propose a contrastive objectness constraint derived from depth maps and RGB-images in the target domain that is used to regularise conventional self-training methods. The constraint is computed in two steps: an object-region estimation step, followed by pixel-wise contrastive loss computation. In the first step, we perform unsupervised image segmentation using both depth-based histograms and RGB-images that are fused together to yield multiple object-regions per image. These regions respect actual object boundaries, based on the structural information depth provides, as well as visual similarity. In the second step, the object-regions are leveraged to formulate a contrastive objective [44, 10, 22] that pulls together pixel representations within an object region and pushes apart those from different semantic categories. Such an objective can improve semantic segmentation by causing the pixel representations of a semantic category to form a compact cluster that is well separated from other categories. We empirically demonstrate the effectiveness of our constraint on popular benchmark tasks, GTA $\rightarrow$ Cityscapes and SYNTHIA $\rightarrow$ Cityscapes, on which we achieve competitive segmentation performance. To summarise our contributions:

•

We propose a novel objectness constraint derived from depth and RGB information to regularise self-training approaches in unsupervised domain adaptation for semantic segmentation. The use of multiple modalities introduces implicit model supervision that is complementary to the pseudo-labels and hence, lead to a more robust self-training.
•

We empirically validate the most important aspect of our regulariser, i.e., its ability to improve a variety of self-training methods. Specifically, our approach achieves $1.2\%$ - $2.9\%$ (GTA) and $2.2\%$ - $4.4\%$ (SYNTHIA) relative improvements over three different self-training baselines. Interestingly, we observe that regularisation improves performance on both “stuff” and “things” classes, somewhat normalising the effects of classwise statistics.
•

Further, our regularised self-training method achieves state-of-the-art mIoU of $54.2\%$ in GTA $\rightarrow$ Cityscapes settings and improves classwise IoUs by up to $4.8\%$ over best prior results.

2 Related Work

Unsupervised domain adaptation. Unsupervised domain adaptation (UDA) is of particular importance in complex structured-prediction problems, such as semantic segmentation in autonomous driving, where the domain gap between a source domain (e.g., an urban driving dataset) and target domain (real-world driving scenarios) can have devastating consequences on the efficacy of deployed models. Several approaches [14, 5, 35, 18, 30] have been proposed for learning domain invariant representations, e.g., through adversarial feature alignment [14, 6, 49, 54], which addresses the domain gap by minimising a distance metric that characterises the divergence between the two domains [36, 28, 29, 4, 40, 13, 43, 33]. Problematically, such approaches address only shifts in the marginal distribution of the covariates or the labels and, therefore, prove insufficient for handling the more complex shifts in the conditionals [20, 62, 55]. Self-training approaches have been proposed to induce category-awareness [60] or cluster density-based assumptions [42], in order to anchor or regularise conditional shift adaptation, respectively. In this paper, we build upon these works by jointly introducing category-awareness through the use of pseudo-labeling strategies and regularisation through the definition of contrastive depth-based objectness constraints.

Self-training with pseudo-labels. Application of self-training has become popular in the sphere of domain adaptation for semantic segmentation [64, 25, 60, 23]. Here, pseudo-labels are assigned to observations from the target domain, based on the semantic classes of high-confidence (e.g., the closest or least-contrastive) category centroids [60, 57], prototypes [8], cluster centers [21], or superpixel representations [61] that are learned by a model trained on the source domain. Often, to ensure the reliability of initial pseudo labels for target domain, the model is first warmed up via adversarial adaptation [60, 61]. Moreover, for stability purposes, pseudo labels are updated in a stagewise fashion, thus resulting in an overall complex adaptation scheme. Towards streamlining this complex adaptation process, recent approaches like [2, 47] propose to train without adversarial warmup and with a momentum network to circumvent stagewise training issue. A common factor underlying most self-training methods is their reliance on just RGB inputs that may not provide sufficient signal for predicting robust target-domain pseudo labels. This motivates us to look for alternate forms of input like depth that is easily accessible and provide a more robust signal.

Adaptation with multiple modalities. Learning and adaptation using multimodal contexts presents an opportunity for leveraging complementarity between different views of the input space, to improve model robustness and generalisability. In the context of unsupervised domain adaptation, use of mutimodal information has recently become more popular with pioneering works like [24]. Specifically, [24] uses depth regression as a way to regularise the GAN based domain translation resulting in better capture of source semantics in the generated target images. Another related approach [51] proposes the use of depth via an auxiliary objective to learn features that when fused with primary semantic segmentation prediction branch provides a more robust representation for adaptation. While sharing our motivation for use of auxiliary information, their use of fused features for adaptation does not address the susceptibility of adversarial adaptation to conditional distribution shifts. In contrast to this method, we propose a depth based objectness constraint for adaptation via self-training that not only leverages multimodal context but also handles conditional shifts more effectively. Moreover, unlike the previous works that use depth only for the source domain, we explore its application exclusively to the target domain. Contemporary to our setting, [53] improves adaptation by extracting the correlation between depth and RGB in both domains. An important distinction of our approach with regards to above works is that we exploit the complementarity of RGB and depth instead of the correlation to formulate a contrastive regularizer. The importance of multimodal information has also been considered in other contexts such as indoor semantic segmentation [45] and adaptation for 3D segmentation using 2D images and 3D points clouds [19]. While not directly related to our experimental settings, they provide insight and inspiration for our approach.

3 Self-Training with Objectness Constraints

We begin by introducing preliminary concepts on self-training based adaptation. These concepts serve as bases for introducing our objectness constraint in Section 3 that is used to regularise the self-training methods. We refer to our framework as PAC-UDA which uses Pseudo-labels And objectness Constraints for self-training in Unsupervised Domain Adaptation for semantic segmentation. Although, we describe a canonical form of self-training for formalising our regularisation constraint, PAC-UDA should be seen as a general approach that can encompass various forms of self-training (as shown in experiments).

Unsupervised Domain Adaptation (UDA) for Semantic Segmentation: Consider a dataset $D^{s}=\{(x_{i}^{s},y_{i}^{s})\}_{i=1}^{N_{s}}$ of input-label pairs sampled from a source domain distribution, $P_{X\times Y}^{s}$ . The input and labels share the same spatial dimensions, $H\times W$ , where each pixel of the label is assigned a class $c\in\{1,\ldots,C\}$ and is represented via a $C$ dimensional one-hot encoding. We also have a dataset $D^{t}=\{(x_{i}^{t},y_{i}^{t})\}_{i=1}^{N_{t}}$ sampled from a target distribution, $P_{X\times Y}^{t}$ where the corresponding labels, $\{y_{i}^{t}\}$ are unobserved during training. Here, the target domain is separated from the source domain due to domain shift expressed as $P_{X\times Y}^{s}\neq P_{X\times Y}^{t}$ . Under such a shift, the goal of unsupervised domain adaptation is to leverage $D^{s}$ and $D^{t}$ to learn a parametric model that performs well in the target domain. The model is defined as a composition of an encoder, $E_{\phi}:X\to\mathcal{Z}$ and a classifier, $G_{\psi}:\mathcal{Z}\to\mathcal{Z}_{P}$ where, $\mathcal{Z}\in\mathbb{R}^{H\times W\times d}$ represents the space of $d$ -dimensional spatial embeddings, $\mathcal{Z}_{P}\in\mathbb{R}^{H\times W\times C}$ gives the un-normalized distribution over the $C$ classes at each spatial location, and $\{\phi,\psi\}$ are the model parameters. To learn a suitable target model, the parameters are optimised using a cross-entropy objective on the source domain,

	$\displaystyle L_{\text{cls}}^{s}$	$\displaystyle=-\frac{1}{N_{s}}\sum\limits_{i=1}^{N_{s}}\sum\limits_{m=1}^{H\times W}\sum\limits_{c=1}^{C}y_{imc}^{s}\log p^{s}_{imc}(\psi,\phi)$		(1)
	$\displaystyle p^{s}_{imc}(\psi,\phi)$	$\displaystyle=\sigma\left(G_{\psi}\circ E_{\phi}(x_{i}^{s})\right)\|_{m,c}~{},$		(2)

where $\sigma$ denotes softmax operation and an adaptation objective over the target domain as described next.

Pseudo-label self-training (PLST): Following prior works [64, 60], we describe a simple and effective approach to PLST that leverages a source trained seed model to pseudo-label unlabelled target data, via confidence thresholding. Specifically, the seed model is first trained on $D^{s}$ using Eqn. 2 to obtain a good parameter initialisation, $\{\phi_{0},\psi_{0}\}$ . Then, this model is used to compute pixel-wise class probabilities, $p^{t}_{im}(\psi_{0},\phi_{0})$ using to Eqn. 2 for each target image, $x_{i}^{t}\in D^{t}$ . These probabilities are used in conjunction with a predefined threshold $\delta$ , to obtain one-hot encoded pseudo-labels

\tilde{y}_{imc}^{t}=\begin{cases}1~{}\text{if}~{}c=\arg\max\limits_{c^{\prime}}p^{t}_{imc^{\prime}}~{}\text{and}~{}p^{t}_{imc}\geq\delta\\ 0~{}\text{otherwise}\end{cases}

(3)

Note that while Eqn. 3 uses a class-agnostic fixed threshold in practice, this threshold can be made class-specific and dynamically updated over the course of self-training. Such a threshold ensures that only the highly-confident predictions contribute to successive training. The final self-training objective can be written in terms of pseudo-labels as

L_{\text{st}}^{t}=-\frac{1}{N_{t}}\sum\limits_{i=1}^{N_{t}}\sum\limits_{m=1}^{H\times W}\sum\limits_{c^{\prime}=1}^{C}\tilde{y}_{imc^{\prime}}^{t}\log\left(p^{t}_{imc^{\prime}}\right)

(4)

The overall UDA objective is simply, $L_{\text{uda}}=L_{\text{cls}}^{s}+\alpha_{\text{st}}L_{\text{st}}^{t}$ , where $\alpha_{\text{st}}$ is the relative weighting coefficients.

3.1 Supervision For Objectness Constraint

An important issue with the self-training scheme described above is that it is usually prone to confirmation bias that can lead to compounding errors in target model predictions when trained on noisy pseudo-labels. To alleviate target performance, we introduce auxiliary modality information (like, depth) that can provide indirect supervision for semantic labels in the target domain and improve the robustness of self-training. In this section we describe our multimodal objectness constraint that extracts object-region estimates to formulate a contrastive objective. The overview of our objectness constraint formulation is presented in Fig. 2.

Supervision via Depth: Segmentation datasets are often accompanied with depth maps registered with the RGB images. In practice, depth maps can be obtained from stereo pairs [12, 39] or sequence of images [15]. These depth maps can reveal the presence of distinct objects in a scene. We particularly seek to extract object regions from these depth maps by first computing a histogram of depth values with predefined, $b$ number of bins. We then leverage the property of objects under ”things” categories [17] whose range of the depth is usually much smaller than the range of entire scene depth. Examples of such categories in outdoor scene segmentation include persons, cars, poles etc. This property translates into high density regions (or peaks) in the histogram corresponding to distinct objects at distinct depths. Among these peaks, we use the ones with prominence [27] above a threshold, $\delta_{\text{peak}}$ as centers to cluster the histograms into discrete regions with unique labels. These labels are then assigned to every pixel whose depth values lie in the associated region. An example of the resulting depth-based segmentation for $b=200$ and $\delta_{\text{peak}}=0.0025$ is visualised in Fig. 2.

Supervision via RGB: Another important form of self-supervision for object region estimates is based on RGB-input clustering. We adopt SLIC [1] as a fast algorithm for partitioning images into multiple segments that respect object boundaries; the SLIC method applies k-means clustering in pixel space to group together adjacent pixels that are visually similar. An important design decision is the number of SLIC segments, $k_{s}$ : small $k_{s}$ leads to large cluster sizes that is agnostic to the variation in object scales, across different object categories and instances of the scene. Consequently, pixels from distinct object instances may be grouped together regardless of the semantic class, thus violating the notion of object region. Conversely, a large $k_{s}$ will over-segment each object in the scene, resulting in a trivial objectness constraint. Triviality arises from enforcing similarity of pixel-embeddings that share roughly identical pixel neighbourhoods and hence are likely to yield the same class predictions anyway.

Thus, to formulate a non-trivial constraint with sufficiently small $k_{s}$ that also respects object boundaries, we propose to fuse region estimates from both depth and RGB modalities.We first obtain $k_{s}$ segments using SLIC over the RGB image followed by further partitioning of each segment into smaller ones based on the depth segmentation. The process, visualised in Fig. 2 highlights the importance of our multimodal approach. Purely depth based segments are agnostic to pixel intensities and may cluster together distinct object categories that lie at similar depths, for instance, the car in the front and the sidewalk. On the other hand, purely RGB segments with sufficiently small $k_{s}$ may assign the same cluster label even to objects at distinct depths, for example, the back of the bus and the small car at the back. In contrast, object regions derived from a fusion of these two modalities can lead to object regions that are more consistent with individual object instances (for example, the small car at the back as well as the car in the front). We empirically demonstrate the effectiveness of objectness constraint derived from such multimodal fusion in Section 4.3.

Table 1: Test of Generality: We compare the performance of regularised and un-regularised versions of three self-training approaches for two domain settings, namely, GTA

\rightarrow

Cityscapes and SYNTHIA

\rightarrow

Cityscapes. Both per-class IoU and mean IoUs are presented. The numbers in bold indicate higher accuracies in the pairwise comparisons, between a base-method and the base-method

+

PAC.

Source Domain	Method	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
GTA	CAG [60]	87.0	44.6	82.9	32.1	35.7	40.6	38.9	45.5	82.6	23.5	78.7	64.0	27.2	84.4	17.5	34.8	35.8	26.7	32.8	48.2
	CAG $+$ PAC (ours)	86.3	45.7	84.5	30.5	35.5	38.9	40.3	49.9	86.0	33.5	81.1	64.1	25.5	84.5	21.3	32.9	36.3	26.7	40.0	49.6
	SAC[2]	89.9	54.0	86.2	37.8	28.9	45.9	46.9	47.7	88.0	44.8	85.5	66.4	30.3	88.6	50.5	54.5	1.5	17.0	39.3	52.8
	SAC $+$ PAC (ours)	93.3	63.6	87.2	42.0	25.4	44.9	49.0	50.6	88.1	45.2	87.6	64.0	28.1	83.6	37.5	43.9	13.7	20.1	46.2	53.4
	DACS[47]	93.4	54.3	86.3	28.6	33.7	37.0	41.1	50.6	86.1	42.6	87.6	63.5	28.9	88.1	44.2	52.7	1.7	34.7	48.1	52.8
	DACS $+$ PAC (ours)	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2
SYNTHIA	CAG	87.0	41.0	79.0	9.0	1.0	34.0	15.0	11.0	81.0	-	81.0	55.0	16.0	77.0	-	17.0	-	2.0	47.0	40.8
	CAG $+$ PAC (ours)	87.0	42.0	80.0	12.0	3.0	30.0	17.0	17.0	80.0	-	88.0	57.0	5.0	75.0	-	20.0	-	1.0	52.0	41.7
	SAC [2]	91.7	52.7	85.1	22.6	1.5	42.2	44.1	30.9	82.5	-	73.8	63.0	20.9	84.9	-	29.5	-	26.9	52.2	50.3
	SAC $+$ PAC (ours)	83.2	40.5	85.4	30.0	2.0	43.0	42.2	33.8	86.3	-	89.8	65.3	33.5	85.1	-	35.2	-	29.9	55.3	52.5
	DACS [47]	84.9	23.0	83.7	16.0	1.0	36.3	35.0	42.8	81.7	-	89.5	63.5	34.5	85.3	-	41.5	-	31.2	50.8	50.0
	DACS $+$ PAC (ours)	90.6	46.7	83.3	18.7	1.3	35.1	34.5	32.0	85.1	-	88.5	66.0	35.0	83.8	-	43.1	-	28.8	46.7	51.2

3.2 Objectness Constraints through Contrast

Our objectness constraint is formulated using a contrastive objective that pulls together pixel representations within an object region and pushes apart those that belong to different object categories. Formally, we assign a region index and a region label to every pixel associated with an object region of the input scene. Each region index is a unique natural number in $\{1,\ldots,K\}$ where $K$ is the number of object regions. A region label is assigned as the most frequent pseudo-label class within the object region. In practice, noisy pseudo-labels can lead to region labelling that is inconsistent with true semantic labels. To minimise such inconsistencies, we introduce a threshold $\tau_{p}$ that selects valid object regions for which the proportion of pixels with pseudo-label class same as the region label is above this threshold. This selection excludes the object regions with no dominant pseudo-label class from contributing to the objectness constraint. Since the cost of computing pairwise constraints is quadratic in the number of pixels, we recast the pairwise constraint into a protoypical loss that reduces the time complexity to linear. Towards the end, we first compute a prototypical representation for each region using the associated pixel embeddings,

\displaystyle\nu_{k}=\frac{1}{|U_{k}|}\sum_{{p\in U_{k}}}z_{p}

(5)

where, $U_{k}$ is the set of pixel locations with the $k^{\text{th}}$ object-region. Then a similarity score (based on Cosine metric) is computed between each pixel and prototypical representation that forms the basis for our contrastive objectness constraint as

	$\displaystyle L_{\text{obj}}^{t}$	$\displaystyle=\frac{1}{S}\sum\limits_{k}\sum\limits_{p\in U_{k}}L_{\text{obj}}^{t}(p)$		(6)
	$\displaystyle L_{\text{obj}}^{t}(p)$	$\displaystyle=-\log\left({\frac{\exp(\tilde{z}_{p}\cdot\tilde{\nu}_{k})}{\sum\limits_{k^{\prime}\in\Omega(k)}\exp(\tilde{z}_{p}\cdot\tilde{\nu}_{k^{\prime}})}}\right)$		(7)

where, $S$ is the total number of valid pixels, $\Omega(k)$ is the set of valid object regions that have region labels other than $k$ , and $\tilde{z}_{p}$ and $\tilde{\nu}_{k}$ represent $L_{2}$ normalised embeddings. Note that the objectness constraints are only computed for the target domain images since we are interested in improving target domain performance using self-training. Additionally, the constraint in Eqn. 7 is defined for a single image but can be easily extended to multiple images by simply averaging over them; the final regularised self-training objective is then defined as $L_{\text{pac}}=L_{\text{uda}}^{s}+\alpha_{\text{obj}}*L_{\text{obj}}^{t}$ , where $\alpha_{\text{obj}}$ controls the effect of the constraint on overall training.

3.3 Learning and Optimization

To train the our model, PAC-UDA with a base self-training approach, we follow the exact procedure outlined by the corresponding approach. The only difference is that we plug in our constraint as a regularise to the base objective, $L_{\text{uda}}$ . One important consideration is that our regularise depends on reasonable quality of pseudo labels to define region labels that are not random. Thus the regularisation weight, $\alpha_{\text{obj}}$ is set to zero for a few initial training iterations, post which it switches to the actual value.

4 Experiments

Datasets and Evaluation Metric: We evaluate the PAC-UDA framework in two common scenarios: the GTA[37] $\rightarrow$ Cityscapes [12] transfer semantic segmentation task and the SYNTHIA[38] $\rightarrow$ Cityscapes [12] task. GTA5 is composed of $24,966$ synthetic images with resolution $1914\times 1052$ and has annotations for $19$ classes that are compatible with the categories in Cityscapes. Similarly, SYNTHIA consists of $9,400$ synthetic images of urban scenes at resolution $1280\times 760$ with annotations for only $16$ common categories. Cityscapes has of $5,000$ real images and aligned depth maps of urban scenes at resolution 2048 × 1024 and is split into three sets of $2,975$ train, $500$ validation and $1,525$ test images. Of the $2,975$ , we use $2,475$ randomly selected images for self-training and remaining $500$ images for validation. We report the final test performance of our method on the $500$ images of the official validation split. The data-splits are consistent with prior works [2, 61]. The performance metrics used are per class Intersection over Union (IoU) and mean IoU (mIoU) over all the classes.

Implementation Details: For object region estimates, we experiment with three different numbers, $k_{s}\in\{25,50,100\}$ of RGB-clusters, two values of prominence thresholds, $\delta_{\text{peak}}\in\{0.001,0.0025\}$ and three numbers of histogram bins, $b\in\{100,200,400\}$ . Depth maps obtained from stereo pairs can have missing values at pixel-level, as is the case with Cityscapes. These missing values have a value zero and are ignored while generating depth segments using depth-histogram. Finally, due to high computational cost of computing the contrastive objective from pixel-wise embedding, we set the spatial resolution of these embeddings to $256\times 470$ in CAG and SAC and $300\times 300$ in DACS. We fixed the relative weighting of the regularizer, $\alpha_{\text{obj}}$ to $1.0$ as the target performance was found to be insensitive to the exact value. For hyperameter choices regarding architecture and optimizers, we exactly follow the respective self-training base methods [60, 2, 47]. Experiments were conducted on $4\times 11$ GB RTX 2080 Ti GPUs with PyTorch implementation. Further details in the supplementary.

Table 2: GTA

\rightarrow

Cityscapes results: Classwise and mean (over 16 classes) IoU comparison of our DACS

+

PAC with prior works. ^† denotes the use of PSPNet [63], * denotes our implementation of SAC with a restricted configuration (GROUP_SIZE=2) compared to original SAC method (GROUP_SIZE=4). All other methods use DeepLabV2[9] architecture.

	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
AdvEnt [50]	89.4	33.1	81.0	26.6	26.8	27.2	33.5	24.7	83.9	36.7	78.8	58.7	30.5	84.8	38.5	44.5	1.7	31.6	32.4	45.5
DISE [7]	91.5	47.5	82.5	31.3	25.6	33.0	33.7	25.8	82.7	28.8	82.7	62.4	30.8	85.2	27.7	34.5	6.4	25.2	24.4	45.4
Cycada [18]	86.7	35.6	80.1	19.8	17.5	38.0	39.9	41.5	82.7	27.9	73.6	64.9	19.0	65.0	12.0	28.6	4.5	31.1	42.0	42.7
BLF [25]	91.0	44.7	84.2	34.6	27.6	30.2	36.0	36.0	85.0	43.6	83.0	58.6	31.6	83.3	35.3	49.7	3.3	28.8	35.6	48.5
CAG-UDA [60]	90.4	51.6	83.8	34.2	27.8	38.4	25.3	48.4	85.4	38.2	78.1	58.6	34.6	84.7	21.9	42.7	41.1	29.3	37.2	50.2
PyCDA^† [26]	90.5	36.3	84.4	32.4	28.7	34.6	36.4	31.5	86.8	37.9	78.5	62.3	21.5	85.6	27.9	34.8	18.0	22.9	49.3	47.4
CD-AM [58]	91.3	46.0	84.5	34.4	29.7	32.6	35.8	36.4	84.5	43.2	83.0	60.0	32.2	83.2	35.0	46.7	0.0	33.7	42.2	49.2
FADA [52]	92.5	47.5	85.1	37.6	32.8	33.4	33.8	18.4	85.3	37.7	83.5	63.2	39.7	87.5	32.9	47.8	1.6	34.9	39.5	49.2
FDA [59]	92.5	53.3	82.4	26.5	27.6	36.4	40.6	38.9	82.3	39.8	78.0	62.6	34.4	84.9	34.1	53.1	16.9	27.7	46.4	50.5
SA-I2I [34]	91.2	43.3	85.2	38.6	25.9	34.7	41.3	41.0	85.5	46.0	86.5	61.7	33.8	85.5	34.4	48.7	0.0	36.1	37.8	50.4
PIT [31]	87.5	43.4	78.8	31.2	30.2	36.3	39.9	42.0	79.2	37.1	79.3	65.4	37.5	83.2	46.0	45.6	25.7	23.5	49.9	50.6
IAST [32]	93.8	57.8	85.1	39.5	26.7	26.2	43.1	34.7	84.9	32.9	88.0	62.6	29.0	87.3	39.2	49.6	23.2	34.7	39.6	51.5
DACS [47]	89.9	39.7	87.9	30.7	39.5	38.5	46.4	52.8	88.0	44.0	88.7	67.0	35.8	84.4	45.7	50.2	0.0	27.2	34.0	52.1
RPT^† [61]	89.2	43.3	86.1	39.5	29.9	40.2	49.6	33.1	87.4	38.5	86.0	64.4	25.1	88.5	36.6	45.8	23.9	36.5	56.8	52.6
SAC[2]	90.4	53.9	86.6	42.4	27.3	45.1	48.5	42.7	87.4	40.1	86.1	67.5	29.7	88.5	49.1	54.6	9.8	26.6	45.3	53.8
SAC* [2]	89.9	54.0	86.2	37.8	28.9	45.9	46.9	47.7	88.0	44.8	85.5	66.4	30.3	88.6	50.5	54.5	1.5	17.0	39.3	52.8
DACS $+$ PAC (ours)	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2

Table 3: SYNTHIA

\rightarrow

Cityscapes results: Classwise and mean (over 16 classes) IoU comparison of our PAC-UDA with prior works. ^† denotes the use of PSPNet [63], * denotes our implementation of SAC with a restricted configuration (GROUP_SIZE=2) compared to original SAC method (GROUP_SIZE=4). All other methods use DeepLabV2 [9] architecture.

	road	sidewalk	building	wall	fence	pole	light	sign	vegetable	sky	person	rider	car	bus	motor	bike	mIoU
SPIGAN[24]	71.1	29.8	71.4	3.7	0.3	33.2	6.4	15.6	81.2	78.9	52.7	13.1	75.9	25.5	10.0	20.5	36.8
DCAN [56]	82.8	36.4	75.7	5.1	0.1	25.8	8.0	18.7	74.7	76.9	51.1	15.9	77.7	24.8	4.1	37.3	38.4
DISE [7]	91.7	53.5	77.1	2.5	0.2	27.1	6.2	7.6	78.4	81.2	55.8	19.2	82.3	30.3	17.1	34.3	41.5
AdvEnt [50]	85.6	42.2	79.7	8.7	0.4	25.9	5.4	8.1	80.4	84.1	57.9	23.8	73.3	36.4	14.2	33.0	41.2
DADA[51]	89.2	44.8	81.4	6.8	0.3	26.2	8.6	11.1	81.8	84.0	54.7	19.3	79.7	40.7	14.0	38.8	42.6
CAG-UDA [60]	84.7	40.8	81.7	7.8	0.0	35.1	13.3	22.7	84.5	77.6	64.2	27.8	80.9	19.7	22.7	48.3	44.5
PIT [31]	83.1	27.6	81.5	8.9	0.3	21.8	26.4	33.8	76.4	78.8	64.2	27.6	79.6	31.2	31.0	31.3	44.0
PyCDA^† [26]	75.5	30.9	83.3	20.8	0.7	32.7	27.3	33.5	84.7	85.0	64.1	25.4	85.0	45.2	21.2	32.0	46.7
FADA [52]	84.5	40.1	83.1	4.8	0.0	34.3	20.1	27.2	84.8	84.0	53.5	22.6	85.4	43.7	26.8	27.8	45.2
DACS[47]	80.6	25.1	81.9	21.5	2.9	37.2	22.7	24.0	83.7	90.8	67.6	38.3	82.9	38.9	28.5	47.6	48.3
IAST [32]	81.9	41.5	83.3	17.7	4.6	32.3	30.9	28.8	83.4	85.0	65.5	30.8	86.5	38.2	33.1	52.7	49.8
RPT^† [61]	88.9	46.5	84.5	15.1	0.5	38.5	39.5	30.1	85.9	85.8	59.8	26.1	88.1	46.8	27.7	56.1	51.2
SAC [2]	89.3	47.2	85.5	26.5	1.3	43.0	45.5	32.0	87.1	89.3	63.6	25.4	86.9	35.6	30.4	53.0	52.6
SAC*[2]	91.7	52.7	85.1	22.6	1.5	42.2	44.1	30.9	82.5	73.8	63.0	20.9	84.9	29.5	26.9	52.2	50.3
SAC $+$ PAC (ours)	83.2	40.5	85.4	30.0	2.0	43.0	42.2	33.8	86.3	89.8	65.3	33.5	85.1	35.2	29.9	55.3	52.5

4.1 Generality of Objectness Constraint

In Table 1, we test the generality of our proposed regularizer on three base methods, namely, CAG [60], SAC [2] and DACS [47] that generate pseudo labels in different ways. We use official implementations of each base method with almost same configurations for data preprocessing, model architecture, and optimizer except for a few modifications as follows. In the case of CAG, we replace the Euclidean metric with a Cosine metric as it was found to generate more reliable pseudo-labels. Also, we run it for a single self-training iteration instead of three[60]. For the SAC method, we reduce the GROUP_SIZE from default value of $4$ to $2$ following GPU constraints. Finally, for the DACS approach, we adopt the training and validation splits of Cityscapes used in SAC to maintain benchmark consistency across different base methods. In terms of architecture, DACS and SAC use a standard DeepLabv2 [9] backbone whereas CAG augments this backbone with a decoder model (see [60] for details). For the sake of fair comparison, we try our best to achieve baseline accuracies that are at least as good as the published results. While we achieved slightly lower performance on SAC due to resource constraints, we achieve superior accuracies for DACS and CAG baselines. Thus, these methods serve as strong baselines for evaluating our approach.

From Table 1, we observe that base methods regularised with our constraint always, and sometimes significantly, outperforms the unregularised version in terms of mIoU (by up to $2.2\%$ ). Secondly, the improvement is across various categories of both stuffs and things type. Some of these include sidewalk (up to $9.6\%$ ), sky (up to $2.4\%$ ), traffic light (up to $2.1\%$ ), traffic sign (up to $4.4\%$ ) and bike (up to $7.2\%$ ) classes under GTA $\rightarrow$ Cityscapes while wall (up to $7.4\%$ ), fence (up to $2.0\%$ ), person (up to $2.5\%$ ) and bus (up to $5.7\%$ ) classes under SYNTHIA $\rightarrow$ Cityscapes. While different adaptation settings favour different classes, a particularly striking observation is that large gains are obtained in both frequent (sidewalk, wall) and less-frequent (bus, bike) classes. We suspect that such uniformity arises from our object-region aware constraint that is agnostic to the statistical dominance of specific classes. Finally, Fig. 3 visualises these observations by comparing the predictions of DACS and DACS+PAC models (trained on GTA) on randomly selected examples from Cityscapes validation split.

Table 4: Ablations: Comparing the effects of individual components of the regulariser (PAC) on final performance (mIoU). Here, the full model is DACS

+

PAC, and the adaptation setting is GTA

\rightarrow

Cityscapes; hyperparameters are:

k_{s}=50,b=200,\delta_{\text{peak}}=0.0025

; “PL” refers to pseudo-labelling. We include classwise IoUs in the supplementary.

Configuration	mIoU
All	54.2
Only PL	49.3
Only Depth $+$ RGB segments	51.9
Only Depth segments w/ PL	51.7
Only RGB segments w/ PL	52.1

4.2 Prior Works Comparison

In this section, we compare our best performing method with prior works under each domain settings

GTA $\rightarrow$ Cityscapes (Table 2): In terms of mIoU, our DACS+PAC outperforms the state-of-the-art (SAC) by $0.4\%$ despite having a simpler training objective (no focal loss regularizer or importance sampling) and no adaptive batch normalisation. In particular, our approach outperforms SAC significantly in road, sidewalk, fence,terrain, sky, rider, motorcycle and bike classes by $1.9\%-11.3\%$ . More interestingly, this observation holds when compared to other prior works as well, wherein our model improves IoUs for both dominant categories like road and sidewalk as well as less frequent categories like traffic-sign and terrain. For classes like sidewalk, we suspect that structural constraints based on our regularizer reduces contextual bias [41], responsible for coarse boundaries.

SYNTHIA $\rightarrow$ Cityscapes (Table 3): In this setting, our best performing method outperforms all but one prior methods, often by significant margins. While our SAC+PAC under resource constraints compares favourably to the official implementation of SAC (with larger GROUP_SIZE), it significantly outperforms our implementation of SAC which is a more fair comparison due to same resource constraints. Nevertheless, our approach improves the best previous results on wall class by $3.5\%$ and achieves state-of-the-art on pole and sign classes.

4.3 Ablations

In this section, we deconstruct our multi-modal regularizer (PAC) to quantify the effect of individual components on final performance. In Table 4, the ‘ALL’ configuration corresponds to our original formulation. ‘Only PL’ configuration estimates the object-regions using just the pseudo-labels and hence ignores complementary information from depth. ‘Only Depth+RGB segments’ do not use pseudo-labels to define region labels and instead treats each Depth+RGB segment as a unique object category. The configurations in next two rows use only one of the two modalities for estimating object regions while still using pseudo-labels to define region labels. We observe that contrastive regulariser based on only pseudo-labels performs the worst and significantly below the one based on just multimodal segments. This is intuitive because reusing pseudo-labels as a regularisation without auxiliary information reinforces the confirmation bias. While, purely RGB based segments lead to better objectness constraint than purely depth-based ones (as can be seen in Fig. 2), combining the two (ALL config.) yields the best results.

5 Conclusion

In this work, we proposed a multi-modal regularisation scheme for self-training approaches in unsupervised domain adaptation for semantic segmentation. We derive an objectness constraint from multi-modal clustering that is then used to formulate a contrastive objective for regularisation. We show that this regularisation consistently improves upon different types of self-training methods and even achieves state-of-the-art performance in popular benchmarks. In the future, we plan to study the effect of other modalities like 3D point-clouds in semantic segmentation.

References

[1] Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
[2] Nikita Araslanov, , and Stefan Roth. Self-supervised augmentation consistency for adapting semantic segmentation. In CVPR, 2021.
[3] Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
[4] M. Baktashmotlagh, M. T. Harandi, B. C. Lovell, and M. Salzmann. Unsupervised domain adaptation by domain invariant projection. In ICCV, 2013.
[5] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial networks. CoRR, 2016.
[6] Konstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan. Domain separation networks. NeurIPS, 2016.
[7] Wei-Lun Chang, Hui-Po Wang, Wen-Hsiao Peng, and Wei-Chen Chiu. All about structure: Adapting structural information across domains for boosting semantic segmentation. CoRR, 2019.
[8] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong, Xinghao Ding, Yue Huang, Tingyang Xu, and Junzhou Huang. Progressive feature alignment for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 627–636, 2019.
[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin P. Murphy, and Alan Loddon Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
[11] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach. CVPR, 2019.
[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
[13] Nicolas Courty, Rémi Flamary, Amaury Habrard, and Alain Rakotomamonjy. Joint distribution optimal transportation for domain adaptation. In NeurIPS, 2017.
[14] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. Advances in Computer Vision and Pattern Recognition, 2017.
[15] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Digging into self-supervised monocular depth estimation. ICCV, 2019.
[16] Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In NeurIPS, 2005.
[17] Bharath Hariharan, Pablo Arbeláez, Ross B. Girshick, and Jitendra Malik. Simultaneous detection and segmentation. In ECCV, 2014.
[18] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. Cycada: Cycle-consistent adversarial domain adaptation. In International Conference on Machine Learning (ICML), 2018.
[19] Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie Wirbel, and Patrick Perez. xmuda: Cross-modal unsupervised domain adaptation for 3d semantic segmentation. CVPR, 2020.
[20] Fredrik D. Johansson, David A Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. In AISTATS, 2019.
[21] Guoliang Kang, Lu Jiang, Yi Yang, and Alexander G Hauptmann. Contrastive adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4893–4902, 2019.
[22] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning. In NeurIPS, 2020.
[23] Myeongjin Kim and Hyeran Byun. Learning texture invariant representation for domain adaptation of semantic segmentation. CVPR, 2020.
[24] Kuan-Hui Lee, Germán Ros, Jie Li, and Adrien Gaidon. Spigan: Privileged adversarial learning from simulation. ArXiv, 2019.
[25] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6936–6945, 2019.
[26] Qing Lian, Fengmao Lv, Lixin Duan, and Boqing Gong. Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6758–6767, 2019.
[27] Marcos Llobera. Building past landscape perception with gis: Understanding topographic prominence. Journal of Archaeological Science - J ARCHAEOL SCI, 2001.
[28] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael Jordan. Learning transferable features with deep adaptation networks. In International conference on machine learning, pages 97–105. PMLR, 2015.
[29] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
[30] Fengmao Lv, Tao Liang, Xiang Chen, and Guosheng Lin. Cross-domain semantic segmentation via domain-invariant interactive relation transfer. In CVPR, 2020.
[31] Fengmao Lv, Tao Liang, Xiang Chen, and Guosheng Lin. Cross-domain semantic segmentation via domain-invariant interactive relation transfer. In CVPR, 2020.
[32] Ke Mei, Chuang Zhu, Jiaqi Zou, and Shanghang Zhang. Instance adaptive self-training for unsupervised domain adaptation. In ECCV, 2020.
[33] Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. ICML, 2013.
[34] Luigi Musto and Andrea Zinelli. Semantically adaptive image-to-image translation for domain adaptation of semantic segmentation. In BMVC, 2020.
[35] Fei Pan, Inkyu Shin, François Rameau, Seokju Lee, and In So Kweon. Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. CVPR, 2020.
[36] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. Domain adaptation via transfer component analysis. IEEE Transactions on Neural Networks, 2011.
[37] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
[38] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR, 2016.
[39] Ashutosh Saxena, Jamie Schulte, Andrew Y Ng, et al. Depth estimation using monocular and stereo cues. In IJCAI, volume 7, pages 2197–2203, 2007.
[40] Uri Shalit, Fredrik D. Johansson, and David Sontag. Estimating individual treatment effect: Generalization bounds and algorithms. In ICML, 2017.
[41] Rakshith Shetty, Bernt Schiele, and Mario Fritz. Not using the car to see the sidewalk – quantifying and controlling the effects of context in classification and segmentation. In CVPR, June 2019.
[42] Rui Shu, Hung H Bui, Hirokazu Narui, and Stefano Ermon. A dirt-t approach to unsupervised domain adaptation. arXiv, 2018.
[43] S. Si, D. Tao, and B. Geng. Bregman divergence-based regularization for transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering, 2010.
[44] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS, 2016.
[45] Sinisa Stekovic, Friedrich Fraundorfer, and Vincent Lepetit. Casting geometric constraints in semantic segmentation as semi-supervised learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1854–1863, 2020.
[46] Subhashree Subudhi, Ram Narayan Patro, Pradyut Kumar Biswal, and Fabio Dell’Acqua. A survey on superpixel segmentation as a preprocessing step in hyperspectral image analysis. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:5015–5035, 2021.
[47] Wilhelm Tranheden, Viktor Olsson, Juliano Pinto, and Lennart Svensson. Dacs: Domain adaptation via cross-domain mixed sampling. WACV, 2021.
[48] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker. Learning to adapt structured output space for semantic segmentation. In CVPR, 2018.
[49] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell. Adversarial discriminative domain adaptation. In CVPR, 2017.
[50] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Mathieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. arXiv, 2018.
[51] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Dada: Depth-aware domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7364–7373, 2019.
[52] Haoran Wang, Tong Shen, Wei Zhang, Ling-Yu Duan, and Tao Mei. Classes matter: A fine-grained adversarial approach to cross-domain semantic segmentation. In European Conference on Computer Vision, pages 642–659. Springer, 2020.
[53] Qin Wang, Dengxin Dai, Lukas Hoyer, Olga Fink, and Luc Van Gool. Domain adaptive semantic segmentation with self-supervised depth estimation. In ICCV, 2021.
[54] Zhonghao Wang, Mo Yu, Yunchao Wei, Rogério Schmidt Feris, Jinjun Xiong, Wen mei W. Hwu, Thomas S. Huang, and Humphrey Shi. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. CVPR, 2020.
[55] Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Chase Lipton. Domain adaptation with asymmetrically-relaxed distribution alignment. ICML, 2019.
[56] Zuxuan Wu, Xintong Han, Yen-Liang Lin, Mustafa Gokhan Uzunbas, Tom Goldstein, Ser Nam Lim, and Larry S Davis. Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation. In ECCV, 2018.
[57] Shaoan Xie, Zibin Zheng, Liang Chen, and Chuan Chen. Learning semantic representations for unsupervised domain adaptation. In International conference on machine learning, pages 5423–5432. PMLR, 2018.
[58] Jinyu Yang, Weizhi An, Chaochao Yan, Peilin Zhao, and Junzhou Huang. Context-aware domain adaptation in semantic segmentation. In WACV, 2021.
[59] Yanchao Yang and Stefano Soatto. FDA: Fourier domain adaptation for semantic segmentation. In CVPR, 2020.
[60] Qiming Zhang, Jing Zhang, Wenyu Liu, and D. Tao. Category anchor-guided unsupervised domain adaptation for semantic segmentation. In NeurIPS, 2019.
[61] Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, and Tao Mei. Transferring and regularizing prediction for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2020.
[62] Han Zhao, Remi Tachet des Combes, Kun Zhang, and Geoffrey J. Gordon. On learning invariant representation for domain adaptation. ICML, 2019.
[63] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
[64] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pages 289–305, 2018.

Algorithm 1 Unsupervised domain adaptation via PAC-UDA

1:Pseudo-label (

\tilde{y}

); Target training dataset with depth (

D^{t}_{\text{depth}}=\{(x_{i}^{t},h_{i}^{t},\tilde{y}_{i}^{t})\}_{i=1}^{N_{t}}

); Initial model parameters (

\theta_{0}=\{\psi_{0},\phi_{0}\}

); Number of histogram bins (

b

); Peak prominence threshold (

\delta_{\text{peak}}

); Number of RGB-segments (

k_{s}

); Spatial dimensions of depth map (

H\times W

); Region-label threshold (

\tau_{p}

); Objectness constraint loss weight (

\alpha_{\text{obj}}

); Number of training iterations (

T_{\text{train}}

)

2:Target-domain adapted parameters (

\theta_{*}=\{\psi_{*},\phi_{*}\}

)

3:for

t_{\text{tr}}

\leftarrow 1~{}\text{to}~{}T_{\text{train}}

\{(x_{i}^{t},h_{i}^{t},y_{i}^{t})\}_{i=1}^{N_{t}^{B}}\sim D_{h}^{t}

\triangleright

Randomly sample a training batch from target-domain

5: Compute

L_{\text{uda}}

\triangleright

Self-training based adaptation objective (see Section 3)

L_{\text{obj}}=0

\triangleright

Initialise objectness-constraint

7: for i

\leftarrow 1~{}\text{to}~{}N_{t}^{B}

9: Initialize

\mathcal{V}^{d}=\{\}

\triangleright

Empty list of depth-segments

10:

\texttt{Hist}\left(\{h_{im}\}_{m=1}^{HW};~{}b\right)

~{}\to~{}\mathcal{F}^{d}

\triangleright

Histogram of depth values (HOD)

11: FindPeaks(

\mathcal{F}^{d};~{}\delta_{\text{peak}}

)

~{}\to~{}\{\mu_{k}\}_{k=1}^{k_{d}}

\triangleright

Cluster-center assignment using HOD

12: for

k~{}\leftarrow~{}1~{}\text{to}~{}k_{d}

13:

V^{d}_{k}=\{m|m\in\{1,\ldots,HW\},~{}|h_{m}-\mu_{k}|<|h_{m}-\mu_{k^{\prime}}|~{}\forall k^{\prime}\neq k\}

\triangleright

Depth segments

14:

\mathcal{V}^{d}.\texttt{append}(V_{k}^{d})

\triangleright

Depth-segment list update

15: end for

16:

17: Initialize

\mathcal{V}^{s}=\{\}

\triangleright

Empty list of RGB-segments

18: SLIC(

x_{i};~{}k_{s}

)

~{}\to~{}\{\mathcal{L}_{k}\}_{k=1}^{k_{s}}

\triangleright

RGB-segment labelling using SLIC[1]

19: for

k~{}\leftarrow~{}1~{}\text{to}~{}k_{s}

20:

V^{s}_{k}=\{m|m\in\{1,\ldots,HW\},~{}\texttt{label}(m)=\mathcal{L}_{k}\}

\triangleright

RGB-segments

21:

\mathcal{V}^{s}.\texttt{append}(V_{k}^{s})

\triangleright

RGB-segment list update

22: end for

23:

24: Initialize

\mathcal{V}=\{\}

\triangleright

Empty list of object-regions

25: Initialize

k=0

\triangleright

region-index

26: for

i^{\prime}

\leftarrow 1~{}\text{to}~{}k_{s}

27: for

j^{\prime}

\leftarrow 1~{}\text{to}~{}k_{d}

28:

k~{}\leftarrow~{}k+1

\triangleright

Region-index update

29:

V_{k}=\{m|m\in V_{i^{\prime}}^{s},~{}m\in V_{j^{\prime}}^{d}\}

\triangleright

Unique object-region assignment

30:

\mathcal{V}.\texttt{append}(V_{k})

\triangleright

Object-region list update

31: end for

32: end for

33:

34:

\mathcal{F}_{k}=\text{Histogram}(\{\tilde{y}^{t}_{im}\}_{m\in V_{k}})~{}\forall k=\{1,\ldots,K^{\prime}\}

\triangleright

Region-wise frequency of pseudo-label classes

35: Initialize

\mathcal{U}=\{\}

\triangleright

Empty list of valid regions

36: Initialize

\mathcal{L}=\{\}

\triangleright

Empty list of valid region labels

37: for

k~{}\leftarrow~{}1~{}\text{to}~{}K^{\prime}

38: if

\ \textbf{then}\frac{\max\limits_{c}{\mathcal{F}_{k}[c]}}{\sum\limits_{c}\mathcal{F}_{k}[c]}\geq\tau_{p}

\triangleright

Threshold on majority-voting based region-label

39:

U_{k}=V_{k}

\triangleright

Valid region assignment

40:

\mathcal{U}.\texttt{append}(U_{k})

\triangleright

Valid-region list update

41:

\mathcal{L}_{k}=\arg\max\limits_{c}\mathcal{F}_{k}[c]

\triangleright

Region-label assignment

42:

\mathcal{L}.\texttt{append}(L_{k})

\triangleright

Valid-region label list update

43: end if

44: end for

45: Using

\mathcal{U}

and

\mathcal{L}

, compute

L^{t}_{\text{obj},i}

\triangleright

Objectness constraint, Eqn. . 7

46:

L_{\text{obj}}=L_{\text{obj}}+L_{\text{obj},i}^{t}

47: end for

48:

L_{\text{pac}}=L_{\text{uda}}+\frac{\alpha_{\text{obj}}}{N_{t}^{B}}*L_{\text{obj}}^{t}

\triangleright

Overall PAC-UDA objective

49:

\theta_{t}~{}\leftarrow~{}\theta_{t-1}-\eta\nabla L_{\text{pac}}

\triangleright

Parameter update

50:end for

In this supplementary, we provide additional details and analysis for our proposed method, PAC-UDA. Algorithm 1, provides a step-by-step procedure for unsupervised domain adaptation via PAC-UDA.

Appendix A Hyperparameters for Main Experiments

Table 5: Hyperparameters used in Table 1

method	$k_{S}$	$b$	$\delta_{\text{peak}}$	$\tau_{p}$
CAG + PAC	50	200	0.0025	0.90
SAC + PAC	50	200	0.0025	0.90
DACS + PAC	25	200	0.001	0.90

To report the results in Table 1, Table 2 and Table 3, we choose the best hyperparameters following standard cross-validation on a random subset of Cityscapes train-split introduced in [2]. For base methods, we use the default hyperparameters from respective papers. In Table 5, we summarise the hyperparameters for Table 1. Since the results of our approach in Table 2 and Table 3 are a subset of Table 1, the above hyperparameters apply there as well.

Appendix B Additional Ablations

In this section, we provide additional ablation studies for DACS+PAC on GTA $\rightarrow$ Cityscapes. The default hyperparameter configuration is: $k_{s}=25,~{}b=200,~{}\delta_{\text{peak}}=0.001,~{}\tau_{p}=0.9$ ; unless otherwise stated. Also, we train each setting for $T_{\text{train}}(=125\,000)$ iterations.

B.1 Importance of Multiple Modalities and Pseudo-Labels

Table 6: Effect of Individual Modalities and Pseudo Labels: Comparing the effects of individual modalities used to estimate object-regions and pseudo-labels on final performance (mIoU). This table is an extended version of Table 4 with classwise IoUs. Mapping of configuration names to those in Table 4 - PL: Only PL; Depth-RGB: Only Depth+RGB segments; Depth-PL: Only Depth segments w/ PL; RGB-PL: Only RGB segments w/ PL. Refer to Section 4.3 for configuration specific details.

Configuration	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
All	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2
PL	93.7	58.7	86.8	27.3	29.7	35.4	41.6	50.6	87.1	46.7	89.2	65.2	37.1	87.4	41.3	49.8	0.0	7.0	1.6	49.3
Depth-RGB	94.1	58.1	86.2	38.2	30.3	34.8	37.8	41.3	86.7	46.1	87.5	62.4	31.0	86.8	52.5	49.1	0.0	24.5	40.1	51.9
Depth-PL	93.3	61.9	86.7	31.8	35.9	36.1	43.3	50.2	86.2	41.2	86.4	65.0	32.2	82.1	31.9	50.4	0.9	23.1	43.6	51.7
RGB-PL	95.1	65.3	86.1	25.9	30.1	35.4	39.1	41.2	85.2	37.9	86.2	61.4	26.7	87.9	50.9	50.6	0.0	35.8	50.4	52.1

Table 7: Importance of RGB-segments: Comparing the effect of only RGB-segments with different values of

k_{s}

. here, PL: Pseudo-Labels; RGB-PL: Objectness-constraint with only RGB segments and PL; ALL: Objectness-constraint with RGB-segments+Depth-segments and PL

Configuration	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
All ( $k_{s}=25$ )	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2
RGB-PL( $k_{s}=25$ )	95.1	65.3	86.1	25.9	30.1	35.4	39.1	41.2	85.2	37.9	86.2	61.4	26.7	87.9	50.9	50.6	0.0	35.8	50.4	52.1
RGB-PL ( $k_{s}=50$ )	94.6	63.4	86.8	28.7	30.7	37.6	42.8	51.3	86.8	44.9	87.9	64.9	32.5	87.8	42.7	45.4	0.0	32.6	51.2	53.3
RGB-PL ( $k_{s}=100$ )	94.4	62.1	86.2	29.2	32.5	34.2	40.0	50.2	86.2	47.1	87.4	63.0	32.7	87.9	39.4	45.3	0.1	32.6	52.8	52.8

Table 8: Effect of the Contrastive Objective: Comparing two different formulations of contrastive objective as defined in Eqn. 7 and Section B.3 and an upperbound configuration, GTlab (target-domain ground-truth labels

Configuration	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
$L_{\text{obj}}^{t}$	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2
$L_{\text{obj}}^{t+}$	94.2	59.4	86.7	35.8	32.1	36.8	40.5	49.4	86.5	41.9	86.0	63.5	27.1	89.1	53.7	54.5	2.5	27.3	45.7	53.3

Table 9: Effect of region-label threshold,

\tau_{p}

on Final Performance:

Threshold	road	sidewalk	building	wall	fence	pole	light	sign	vege.	terrain	sky	person	rider	car	truck	bus	train	motor	bike	mIoU
0.70	93.9	60.4	86.5	32.5	30.4	34.9	39.9	48.8	86.4	45.6	88.0	63.0	27.6	87.0	39.9	49.2	1.9	32.5	47.9	52.4
0.80	92.9	51.3	86.6	31.5	32.4	36.7	42.8	52.1	86.8	44.7	87.5	65.4	34.5	89.2	48.8	56.3	0.2	23.8	45.1	53.1
0.90	93.2	58.8	87.2	33.3	35.1	38.6	41.8	51.4	87.4	45.8	88.3	64.8	31.6	84.3	51.7	53.4	0.6	31.3	50.6	54.2
0.95	93.4	55.9	86.1	28.7	30.0	33.2	40.5	45.3	86.6	45.7	87.8	64.2	31.6	89.1	50.4	50.7	0.0	10.5	28.3	50.4

In Table 6, we provide the complete results (including classwise IoUs) for the ablations on individial modalities and pseudo-labels as described in Section 4.3. While, Table 6 highlights the importance of combining all modalities with pseudo-labels for the best mean IoU, there are a few other important observations with respect to classwise IoUs.

For instance, using “PL” for objectness constraints significantly underperforms other settings (by upto $49$ IoU) in rare source-classes, like motorcycle and bike (Figure 4). This gap is surprisingly large (by upto $38.5$ IoU) even when compared to “Depth-RGB”. We attribute this large performance gap to the class-imbalance problem [64] that is known to adversely affect self-training in the absence of class-balanced losses. However, incorporating our objectness constraint alleviates the rare-class IoUs significantly without losing performance in frequent classes (except, sky). These results provide strong evidence for the normalisation of class-related statistical effects in the presence of multimodal objectness-constraints.

Another interesting insight arises from comparing “Depth-PL” and “RGB-PL” settings that demonstrates the complementarity of the two modalities. Among the more frequent source-classes (Figure 4), purely RGB-based constraint considerably outperforms purely depth-based constraint in categories such as road, sidewalk and car whereas the converse is true for other categories like wall, pole, terrain and person. The outperformance of depth-based constraint on pole and person is intuitive since these objects have very small depth range compared to the scene depth and hence can be easily detected using the depth histogram (see Section 3.1 for more discussion).

B.2 Importance of RGB-segments

In the past, image clustering has been often used as an effective preprocessing step for segmentation [61, 46]. Inspired by these works, in Table 7, we test the extent to which purely SLIC based RGB-segments can influence the objectness constraint and consequently, the final performance of our PAC-UDA. Specifically, we tabulate the performance with varying number of SLIC segments, $k_{s}$ and compare it to our default configuration, “ALL ( $k_{s}=25$ )”.

We observe that when using only RGB-segments (without depth) for object-region estimates, there exists an intermediate value along the range of $k_{s}\in\{25,50,100\}$ where the semantic segmentation performance peaks. This trend empirically validates our intuition for choosing the best $k_{s}$ as discussed in Section 3.1. In fact, too small a value can be highly undesirable as it can lead to worse results ( $52.1$ mIoU) than even the base method ( $52.8$ mIoU). It is however, interesting to note that even with the most optimal $k_{s}=50$ , just RGB based objectness constraint underperforms our multimodal constraint (“ALL”) by $\sim 1$ mIoU.

B.3 Contrastive Loss Analysis

We analyze the effect of specific form of the contrastive loss function in Table 8. Recall that in Eqn. 7, our formulation of the contrastive loss maximizes the similarity of each pixel embedding, $\tilde{z}_{p}$ to only a prototype of the region, $U_{k}$ that includes pixel $p$ . Here, we introduce another variant of that loss, $L_{\text{obj}}^{t+}(p)$ that maximizes the similarity of $\tilde{z}_{p}$ to prototypes of all valid regions, $\{U_{k}\}_{k=1}^{K}\setminus\Omega(k)$ that share the same region-label. While, originially, region-labels could influence the loss function only via dissimilarity scores, in $L_{\text{obj}}^{t+}(p)$ , they can influence via both similarity and dissimilarity scores.

We observe that allowing greater region-label influence in $L_{\text{obj}}^{t+}(p)$ leads to worse mIoU than $L_{\text{obj}}^{t}(p)$ . Zooming into the classwise IoUs reveal that less-common classes primarily contribute the the overall worse performance of $L_{\text{obj}}^{t+}(p)$ . We suspect that increasing the influence of, and consequently the noise in, region-labels affect these less-common classes more adversely than common classes like road, sidewalk, wall and car. Finally, this ablation guides our decision to adopt $L_{\text{obj}}^{t}(p)$ as the default form of contrastive loss in Eqn. 7.

B.4 Importance of Region-Label Threshold

An important hyperparameter of our objectness-constraint is the region-label threshold, $\tau_{p}$ . At higher values of this threshold, valid object-regions are more likely to be a part of a single object and consistent with the ground-truth semantic category. This will positively influence the target-domain performance. At the same, time the number of such valid object-regions is likely to be small, which, may reduce the overall effect of the objectness-constraint on target-domain performance. As one decreases the threshold, the number of valid-regions will increase at the expense of region-label consistency with ground-truth. Thus, evaluating the performance over a range of values is crucial.

Indeed, we observe in Table 9 that the mIoU increases with increase in threshold upto a certain point ( $\tau_{p}=0.90$ ), beyond which the performance deteriorates. We, thus, set $0.90$ as our default threshold for all our experiments.

Appendix C Additional Visualisations

In Figure 5, we provide additional qualitative comparison between DACS+PAC, DACS and the ground-truth under GTA $\rightarrow$ Cityscapes settings.

Regularizing Self-training for Unsupervised Domain Adaptation via Structural Constraints