Amplitude Spectrum Transformation for
Open Compound Domain Adaptive Semantic Segmentation

Jogendra Nath Kundu¹, Akshay Kulkarni¹¹¹footnotemark: 1, Suvaansh Bhambri¹¹¹footnotemark: 1,
Varun Jampani², R. Venkatesh Babu¹ equal contribution

Abstract

Open compound domain adaptation (OCDA) has emerged as a practical adaptation setting which considers a single labeled source domain against a compound of multi-modal unlabeled target data in order to generalize better on novel unseen domains. We hypothesize that an improved disentanglement of domain-related and task-related factors of dense intermediate layer features can greatly aid OCDA. Prior-arts attempt this indirectly by employing adversarial domain discriminators on the spatial CNN output. However, we find that latent features derived from the Fourier-based amplitude spectrum of deep CNN features hold a more tractable mapping with domain discrimination. Motivated by this, we propose a novel feature space Amplitude Spectrum Transformation (AST)¹¹1Project page: https://sites.google.com/view/ast-ocdaseg. During adaptation, we employ the AST auto-encoder for two purposes. First, carefully mined source-target instance pairs undergo a simulation of cross-domain feature stylization (AST-Sim) at a particular layer by altering the AST-latent. Second, AST operating at a later layer is tasked to normalize (AST-Norm) the domain content by fixing its latent to a mean prototype. Our simplified adaptation technique is not only clustering-free but also free from complex adversarial alignment. We achieve leading performance against the prior arts on the OCDA scene segmentation benchmarks.

1 Introduction

Deep learning has shown unprecedented success in the challenging semantic segmentation task (Long et al. 2015). In a fully supervised setting (Chen et al. 2018; Kundu, Rajput, and Babu 2020), most approaches operate under the assumption that the training and testing data are drawn from the same input distribution. Though these approaches work well on several benchmarks like Cityscapes (Cordts et al. 2016), their poor generalization to unseen datasets is often argued as the major shortcoming (Hoffman et al. 2016). Upon deployment in real-world settings, they fail to replicate the benchmark performance. This is attributed to the discrepancy in input distributions or domain-shift (Torralba et al. 2011). A naive solution would be to annotate the target domain samples. However, huge cost of annotation and variety of distribution shifts that could be encountered in future render this infeasible. Addressing this, unsupervised domain adaptation (DA) has emerged as a suitable problem setup, that aims to transfer the knowledge from a labeled source domain to an unlabeled target domain.

In recent years, several unsupervised DA techniques for semantic segmentation have emerged. Such as, techniques inspired from adversarial alignment (Tsai et al. 2018; Kundu, Lakkakula, and Babu 2019; Kundu et al. 2018), style transfer (Hoffman et al. 2018), pseudo-label self-training (Zou et al. 2019), etc. However, these methods assume the target domain to be a single distribution. This assumption is difficult to satisfy in practice, as test images may come from mixed or continually changing or even unseen conditions. For example, data for self-driving applications may be collected from different weather conditions (Sakaridis et al. 2018) or different time of day (Sakaridis et al. 2020) or different cities (Chen et al. 2017b). Towards a realistic DA setting, Liu et al. (2020) introduced Open Compound DA (OCDA) by incorporating mixed domains (compound) in the target but without domain labels. Further, open domains are available only for evaluation, representing unseen domains.

The general trend in OCDA (see Table 1) has been to break down the complex problem into multiple easier single-target DA problems and employ variants of existing unimodal DA approaches. To enable such a breakdown, Chen et al. (2019) rely on unsupervised clustering to obtain sub-target clusters. Post clustering, both DHA (Park et al. 2020) and MOCDA (Gong et al. 2021) embrace domain-specific learning. DHA employs separate discriminators and MOCDA uses separate batch-norm parameters (for each sub-target cluster), followed by complex adversarial alignment training for both. Though such domain-specific learning seems beneficial for compound domains, it hurts the generalization to open domains. To combat this generalization issue, MOCDA proposes online model update on encountering the open domains. Note that, such extra updates impedes the deployment-friendliness. In this work, our prime objective is to devise a simple and effective OCDA technique. To this end, we aim to eliminate the requirement of a) adversarial alignment, b) sub-target clustering, c) incorporating domain-specific components, and d) online model update (refer the last row of Table 1).

Table 1: Characteristic comparison against prior OCDA works. Note that cluster refers to the sub-target clusters, I2I n/w indicates image-to-image translation network and Disc. indicates use of adversarial discriminator.

Complex

adv. training

(via Disc.)

Model update

for open

domains

Explicit

domain-specific

learning

Clustering

Extra

networks

OCDA

(Liu et al. 2020)

\checkmark

\times

\times

\times

Disc.

DHA

(Park et al. 2020)

\checkmark

\times

\checkmark

(Separate Disc.

for each cluster)

\checkmark

I2I n/w,

Disc.

MOCDA

(Gong et al. 2021)

\checkmark

\checkmark

(online update)

\checkmark

(Separate BN

for each cluster)

\checkmark

Hyper n/w,

Disc.

Ours

\times

\times

\times

\times

AST

To this end, we uncover key insights while exploring along the lines of disentangling domain-related and task-related cues at different layers of the segmentation architecture. We perform control experiments to quantify the unwanted correlation of the deep features with the unexposed domain labels by introducing a domain-discriminability metric (DDM). DDM accuracy indicates the ease of classifying the domain label of deep features at different layers. We observe that deeper layers hold more domain-specificity and identify this as a major contributing factor to poor generalization. To alleviate this, prior-arts (Park et al. 2020) employ domain discriminators on the spatial deep features. We ask, can we get hold of a representation space that favors domain discriminability better than raw spatial deep features? Being able to do so would provide us better control to manipulate the representation space for effective adaptation.

To this end, we draw motivation from the recent surge in the use of frequency spectrum analysis to aid domain adaptation (Yang and Soatto 2020; Yang et al. 2020b; Huang et al. 2021). These approaches employ different forms of Fourier transform (FFT) to separately process the phase and amplitude components in order to carry out content-preserving image augmentations. Motivated by this, we propose to use a latent neural network mapping derived from the amplitude spectrum of the raw deep features as the desired representation space. Towards this, we develop a novel feature-space Amplitude Spectrum Transformation (AST) to auto-encode the amplitude while preserving the phase. We empirically confirm that the DDM of AST-latent is consistently higher than the same for the corresponding raw deep features. To the best of our knowledge, we are the first to use frequency spectrum analysis on spatial deep features.

The AST-latent lays a suitable ground to better discern the domain-related factors which in turn allows us to manipulate the same to aid OCDA. We propose to manipulate AST-latent in two ways. First, carefully mined source-target instance pairs undergo a simulation of cross-domain feature stylization (AST-Sim) at a particular layer by altering the AST-latent. Second, AST operating at a later layer is tasked to normalize (AST-Norm) the domain content by fixing its latent to a mean prototype. As the phase is preserved in AST, both the simulated and domain normalized features retain the task-related content while altering the domain-related style. However, the effectiveness of this intervention is proportional to the quality of disentanglement. Considering DDM as a proxy for disentanglement quality, we propose to apply AST-Sim at a deeper layer with high DDM followed by applying AST-Norm at a later layer close to the final task output. A post-adaptation DDM analysis confirms that the proposed Simulate-then-Normalize strategy is effective enough to suppress the domain-discriminability of the deeper features, thus attaining improved domain generalization compared to typical adversarial alignment. We summarize our contributions below:

•

We propose a novel feature-space Amplitude Spectrum Transformation (AST), based on a thorough analysis of domain discriminability, for improved disentanglement and manipulability of domain characteristics.
•

We provide insights into the usage of AST in two ways - AST-Sim and AST-Norm, and propose a novel Simulate-then-Normalize strategy for effective OCDA.
•

Our proposed approach achieves state-of-the-art semantic segmentation performance on GTA5 $\to$ C-Driving and SYNTHIA $\to$ C-Driving benchmarks for both compound and open domains, as well as on generalization to extended open domains Cityscapes, KITTI and WildDash.

2 Related Works

Open Compound Domain Adaptation. Liu et al. (2020) proposes to improve the generalization ability of the model on compound and open domains by using a memory based curriculum learning approach. DHA (Park et al. 2020) and MOCDA (Gong et al. 2021) cluster the target using $k$ -means clustering on convolutional feature statistics and encoder features of image-to-image translation network respectively. DHA performs adversarial alignment between the source and each sub-target cluster while MOCDA uses separate batch norm parameters for each sub-target cluster with a single adversarial discriminator. In contrast, we simulate the target style by manipulating the amplitude spectrum of our source features in the latent space.

Stylization for DA. Several recent works that use stylization for DA can be broadly divided into - feature-statistics-based and FFT-based. First, the properties of Fourier transform (FFT), i.e. disentangling the input into phase spectrum (representing content) and amplitude spectrum (representing style), made it a natural choice for recent DA works (Yang and Soatto 2020). Several prior arts (Yang et al. 2020a; Kundu et al. 2021) employ the FFT directly on the input RGB image in order to simulate some form of image-to-image domain translation to aid the adaptation. In contrast to these works, we utilize the FFT on the CNN feature space. Second, the feature-statistics-based methods (Kim and Byun 2020) use the first or second-order convolutional feature statistics to perform feature-space stylization (Huang and Belongie 2017) followed by reconstructing the stylized image via a decoder. The stylized images are used with other adaptation techniques like adversarial alignment. Contrary to this, we perform content-preserved feature-space stylization using the Amplitude Spectrum Transformation (AST).

3 Approach

In this section, we thoroughly analyze domain discriminability (Sec 3.1). Based on our observations, we propose the Amplitude Spectrum Transformation (Sec 3.2) and provide insights for effectively tackling OCDA (Sec 3.3).

Refer to caption — Figure 1: A. DDM computed on the AST-latent is consistently higher than the same on raw deep features (Sec 3.1). The post-adaptation DDM confirms the effectiveness of the proposed Simulate-then-Normalize strategy (Sec 3.4). B. Integration of AST-Sim and AST-Norm into the CNN segmentor (Sec 3.4.1). C. & D. Internal working of AST-Sim and AST-Norm (Sec 3.3).

3.1 Disentangling domain characteristics

In the literature of causal analysis (Achille and Soatto 2018), it has been shown that generally trained networks (ERM-networks) are prone to learn domain-specific spurious correlations. Here, ERM-network refers to a model trained via empirical risk minimization (ERM) on multi-domain mixture data (without the domain label). In order to test the above proposition, we introduce a metric that would capture the unwanted correlation of layer-wise deep features with the unexposed domain labels. A higher correlation indicates that the model inherently distinguishes among samples from different domains and extracts domain-specific features followed by domain-specific hypothesis (mapping from input to output space) learning. To this end, we introduce the following metric as a proportional measure of this correlation.

3.1.1 Quantifying domain discriminability (DDM). Domain-Discriminability Metric (DDM) is an accuracy metric that measures the discriminativeness of the features for domain classification. Given a CNN segmentor (see Fig. 1A), $\{h_{k}\}_{k=1}^{K}$ denote the 3D tensor of neural activities at the output of convolutional layers with $k$ as the layer depth. Here, $h_{k}\!\in\!\mathcal{H}_{k}$ with $\mathcal{H}_{k}$ denoting the raw spatial CNN feature space. Following this, DDM is denoted as $\lambda_{k}$ .

Empirical analysis. We obtain a multi-domain source variant with 4 sub-domains by performing specific domain-varying image augmentations such as weather augmentation, cartoonization, etc. (refer Suppl.). Following this, a straightforward procedure to compute DDM for each layer $k$ would be to report the accuracy of fully-trained domain discriminators operating on the intermediate deep features of the frozen ERM-network. In Fig. 1A, the dashed blue curve shows the same in a plot. A peculiar observation is that DDM increases while traversing from input towards output.

Observation 1. An ERM-network trained on multi-domain data for dense semantic segmentation tends to learn increasingly more domain-specific features, in the deeper layers.

Remarks. This is in line with the observation of increasing memorization in the deeper layers against the early layers as studied by Stephenson et al. (2021). This is because the increase in feature dimensions for deeper layers allows more room to learn unregularized domain-specific hypotheses. This effect is predominant for dense prediction tasks (Kundu et al. 2020), as domain-related information is also retained as higher-order clues along the spatial dimensions, implying more room to accommodate domain-specificity. A pool of OCDA or UDA methods (Park et al. 2020; Tsai et al. 2018), that employ adversarial domain alignment, aim to minimize the DDM of deeper layer features as a major part of the adaptation process (dashed magenta curve in Fig. 1A).

3.1.2 Seeking latent representation for improved DDM.

We ask ourselves, can we get hold of a representation space that favors domain discriminability better than the raw spatial deep features? Let, $\mathcal{Z}_{k}$ be a latent representation space where the multi-domain samples are easily separable based on their domain label. We expect the DDM value to increase if the corresponding discriminator operates on this latent representation instead of the raw spatial deep features. Essentially, we seek to learn a mapping $\mathcal{H}_{k}\!\to\!\mathcal{Z}_{k}$ which would encourage a better disentanglement of domain-related factors from the spatial deep features which are known to hold entangled domain-related and task-related information.

To this end, we draw motivation from the recent surge in the use of frequency spectrum analysis to aid domain adaptation (Yang and Soatto 2020; Yang et al. 2020a; Huang et al. 2021). These approaches employ different forms of Fourier transform (FFT) to separately process the phase and amplitude components in order to carry out content-preserving image augmentations. Though both amplitude and phase are necessary to reconstruct the spatial map, it is known that magnitude-only reconstructions corrupt the image making it unrecognizable while that is not the case for phase-only reconstructions. In other words, changes in the amplitude spectrum alter the global style or appearance, i.e., the domain-related factors while the phase holds onto the crucial task-related information. Based on these observations, we hypothesize that the domain-related latent, $\mathcal{Z}_{k}$ can be derived from the amplitude spectrum.

Empirical analysis. Based on the aforementioned discussion, we develop a novel Amplitude Spectrum Transformation (AST). Similar to an auto-encoding setup, AST involves the encoding ( $\mathcal{H}\!\to\!\mathcal{Z}$ ) and decoding ( $\mathcal{Z}\!\to\!\mathcal{H}$ ) transformations operating on the amplitude spectrum of the feature maps. Refer Sec 3.2 for further details. Following this, DDM is obtained as the accuracy of a domain discriminator network operating on latent $\mathcal{Z}$ space (repeated for each layer $k$ ). In Fig. 1A, the solid blue curve shows the same in a plot.

Observation 2. Domain discriminability (and thus DDM) is easily identifiable and manipulatable in the latent $\mathcal{Z}_{k}$ space, i.e. a latent neural network mapping derived from the amplitude spectrum of the raw spatial deep features ( $\mathcal{H}_{k}$ space).

Remarks. Several prior works (Gatys et al. 2016; Huang et al. 2017) advocate that spatial feature statistics capture the style or domain-related factors. Following this, style transfer networks propose to normalize the channel-wise first or second-order statistics, e.g., instance normalization. One can relate the latent AST representation as a similar measure to represent complex domain discriminating clues that are difficult to extract via multi-layer convolutional discriminators.

3.2 Amplitude Spectrum Transformation (AST)

Several prior arts (Yang et al. 2020; Huang et al. 2021) employ frequency spectrum analysis directly on the input RGB image to simulate some form of image-to-image domain translation. To the best of our knowledge, we are the first to use frequency spectrum analysis on spatial deep features.

3.2.1 AST as an auto-encoding setup.

Broadly, AST involves an encoder mapping and a decoder mapping. As shown in Fig. 2, the encoder mapping involves a Fourier transform $\mathcal{F}$ of the input feature $h$ to obtain the amplitude spectrum $\mathcal{F}_{A}(h)$ and the phase spectrum $\mathcal{F}_{P}(h)$ . The amplitude spectrum is passed through a transformation $\mathcal{T}$ and a fully-connected encoder network $Q_{e}$ to obtain the AST-latent $z=Q_{e}\circ\mathcal{T}\circ\mathcal{F}_{A}(h)$ . Consequently, the decoder mapping involves the inverse transformations to obtain $\hat{h}=\mathcal{F}^{-1}(\mathcal{T}^{-1}\circ Q_{d}(z),\mathcal{F}_{P}(h))$ . It is important to note that the encoding side phase spectrum is directly forwarded to the decoding side to be recombined with the reconstructed amplitude spectrum via inverse Fourier transform $\mathcal{F}^{-1}$ .

Here, the vectorization ( $\mathcal{T}$ ) and inverse-vectorization ( $\mathcal{T}^{-1}$ ) of the amplitude spectrum must adhere to its symmetric about origin (mid-point) property. To this end, $\mathcal{T}$ only vectorizes the first and fourth quadrants and $\mathcal{T}^{-1}$ reconstructs the full spectrum from the reshaped first and fourth quadrants by mirroring about the origin (see Fig. 2).

Training. Let $\theta_{Q}^{(k)}=\{\theta_{Q_{e}},\theta_{Q_{d}}\}$ denote the trainable parameters of AST at the $k^{\textit{th}}$ layer. Note that, AST operates independently for each channel of $h_{k}$ with the same parameters. Effectively, $z_{k}$ is obtained as a concatenation of $z$ ’s from each channel of $h_{k}$ . After obtaining the frozen ERM-network, AST, at a given layer, is trained to auto-encode the amplitude spectrum using the following objective,

\min_{\theta_{Q}}\mathcal{L}_{\texttt{AST}}(h_{k},\hat{h}_{k})\;\text{where}\;\mathcal{L}_{\texttt{AST}}(h_{k},\hat{h}_{k})=\|h_{k}-\hat{h}_{k}\|_{2}^{2}

(1)

For simplicity, we denote the overall AST as a function AST and simplify the reconstruction as $\hat{h}_{k}=\texttt{AST}(h_{k},z_{k})$ . Note that we omit the $z_{k}$ term in the auto-encoding phase i.e. $\hat{h}_{k}=\texttt{AST}^{(\textit{ae})}(h_{k})$ as $z_{k}$ is passed through unmodified.

3.3 Usage of AST for OCDA

AST lays a suitable ground to better identify domain-related factors (high DDM) allowing us to manipulate the same.

Insight 1. Altering the latent $z$ by the same obtained from a different domain or instance (i.e. $\tilde{z}$ ) while transforming $\tilde{h}\!=\!\texttt{AST}(h,\tilde{z})$ is expected to imitate feature space cross-domain translation ( $h$ -to- $\tilde{h}$ ) as the crucial task-related information remains preserved via the unaltered phase spectrum.

3.3.1 AST for cross-domain simulation (AST-Sim).

One of the major components in prior OCDA approaches is to discover sub-target clusters as it is used either for domain-translation of source samples (Park et al. 2020) or to realize the domain-specific batch-normalization (Gong et al. 2021). In order to obtain reliable sub-target clusters, both works rely on unsupervised clustering of the compound target data. In contrast, we aspire to avoid such discovery and do away with introducing an additional hyperparameter (no. of clusters) in order to realize a clustering-free OCDA algorithm.

Cross-domain pairing. How to select reliable source-target pairs for cross-domain simulation? In the absence of sub-target clusters, a naive approach would be to form random pairs of source and target instances. Aspiring to formalize a better source-target pairing strategy, we propose to pair instance with the most distinct domain style. This is motivated from the hard negative mining as used in deep metric learning approaches (Suh et al. 2019). Consider a source dataset with instances $\{s_{i}\}_{i=1}^{N_{s}}$ and a target dataset with instances $\{t_{j}\}_{j=1}^{N_{t}}$ . For each target instance $t_{j}$ , we mine a source instance $s_{i}$ such that these are maximally separated in the AST-latent space of the $l^{\text{th}}$ CNN layer. Formally, we obtain a set of cross-domain pairs as,

\mathcal{U}=\{(s_{i},t_{j}):\forall t_{j},i=\arg\max_{i^{\prime}}\zeta(z_{l,s_{i^{\prime}}},z_{l,t_{j}})\}

(2)

where $z_{l,s_{i}}$ and $z_{l,t_{j}}$ are the $l^{\text{th}}$ layer AST-latent for the $s_{i}$ and $t_{j}$ instances respectively. Here, $\zeta$ is a L2 distance function.

Cross-domain simulation. Utilizing Insight 1, we simulate the style of $t_{j}$ in the source $s_{i}$ by replacing its AST-latent i.e., $\hat{h}_{l,s_{i\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905ptj}}=\texttt{AST}(h_{l,s_{i}},z_{l,t_{j}})$ . Similarly, $\hat{h}_{l,t_{j\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pti}}$ represents the source stylized target feature. We illustrate the cross-domain simulation in Fig. 1C. For notation simplification, $\hat{h}_{l,s_{i\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905ptj}}$ and $\hat{h}_{l,s_{i\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905ptj}}$ are denoted as $h_{l,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ and $h_{l,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ respectively. Note that, $h_{l,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ and $h_{l,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ hold the same task-related content as $h_{l,s}$ and $h_{l,t}$ respectively. Thus, both original and stylized features are expected to have the same segmentation label.

3.3.2 AST for domain normalization (AST-Norm). We draw motivation from batch normalization (BN) (Ioffe et al. 2015), which normalizes the features with first and second order statistics during forward pass. Along the same lines, we aim to normalize the AST-latent of intermediate deep features by altering it with a fixed mean prototype. Let $\mathcal{V}$ be the set of $l^{{}^{\prime}\text{th}}$ -layer deep features obtained from all the four variants, i.e., $h_{l^{\prime}\!,s},h_{l^{\prime}\!,t},h_{l^{\prime}\!,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l^{\prime}\!,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ . We compute the fixed domain prototype $z_{l^{\prime}\!,g}$ as follows,

z_{l^{\prime}\!,g}=\mathop{\mathbb{E}}_{h\in\mathcal{V}}\;[Q_{e}\circ\mathcal{T}\circ\mathcal{F}_{A}(h)]

(3)

Following Insight 1, the domain normalization is performed as $h_{l^{\prime}\!,s_{g}}=\texttt{AST}(h_{l^{\prime}\!,s},z_{l^{\prime}\!,g})$ . We illustrate the domain normalization in Fig. 1D. The normalized versions of $h_{l^{\prime}\!,t},h_{l^{\prime}\!,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l^{\prime}\!,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ are $h_{l^{\prime}\!,t_{g}},h_{l^{\prime}\!,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}_{g}},h_{l^{\prime}\!,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}_{g}}$ respectively.

3.4 Simulate-then-Normalize for OCDA

The prime objective in OCDA is to adapt the task-knowledge from a labeled source domain $(x_{s},y_{s})\in\mathcal{D}_{s}$ to an unlabeled compound target, $x_{t}\in\mathcal{D}_{t}$ towards better generalization to unseen open domains. The advantages of AST (i.e. identifying and manipulating domain-related factors) well cater to the specific challenges encountered in OCDA against employing it for single-source single-target settings.

Empirical analysis. After obtaining the frozen ERM-network and the frozen layer-wise AST modules, we employ AST-Sim (Fig. 1C) and AST-Norm (Fig. 1D) to analyze its behavior towards the task performance. We observe that performing AST-Sim or AST-Norm at the earlier layer with low DDM hurts the task performance. We infer that a low DDM value indicates inferior disentanglement of domain-related cues at the corresponding AST-latent. In other words, the task-related cues tend to leak through the amplitude spectrum thereby distorting task-related content.

Insight 2. Applying AST-Sim at a layer with a high DDM value followed by applying AST-Norm at a later layer close to the final task output helps to effectively leverage the advantages of AST modules for the challenging OCDA setting.

Remarks. From the perspective of causal analysis literature, the use of AST-Sim can be thought of as an intervention (He et al. 2008) that implicitly encourages the network to focus on the uninterrupted task-specific cues. However, the effectiveness of this intervention is proportional to the quality of disentanglement. Here, DDM can be taken as a proxy to measure the degree of disentanglement quality. Thus, we propose to perform AST-Sim at a layer $l$ , with high DDM, as indicated in Fig. 1B. However, we realize that the uninterrupted phase pathway still leaks domain-related information (relatively high DDM for $k>l$ ) hindering domain generalization. To circumvent this, we propose to perform AST-Norm at a later layer as a means to further fade away the domain-related information. A post-adaptation DDM analysis (solid magenta curve in Fig. 1A) confirms that the proposed Simulate-then-Normalize strategy is effective enough to suppress the domain-discriminability of the later deep features, thus attaining improved domain generalization.

Architecture. As shown in Fig. 1B, we divide the CNN segmentor into three parts; i.e. $\Phi,\Psi$ , and $C$ . Following this, the AST-Sim (denoted as $\texttt{AST}_{\textit{cs}}$ ) and AST-Norm (denoted as $\texttt{AST}_{\textit{dn}}$ ) are inserted after $\Phi$ and $\Psi$ respectively (see Fig. 3B, D). Next, we discuss the adaptation training.

Algorithm 1 Pseudo-code for the proposed approach

1:Input: source data

(x_{s},y_{\textit{gt}})\!\in\!\mathcal{D}_{s}

, target data

x_{t}\!\in\!\mathcal{D}_{t}

. Initialize

\theta

(i.e. parameters of

\Phi,\Psi

, and

C

) from standard source training. Initialize

\theta_{\textit{cs}}

and

\theta_{\textit{dn}}

(i.e. parameters of

\texttt{ASM}_{\textit{cs}}

and

\texttt{ASM}_{\textit{dn}}

) from auto-encoder training discussed in Sec 3.2.1.

2:Pre-adaptation: Integrate ASTs into the CNN segmentor and finetune

\theta

\theta_{\textit{cs}}

, and

\theta_{\textit{dn}}

following Eq. 4 and 5.

3:Adaptation via Simulate-then-Normalize

4:Freeze

\theta_{\textit{cs}}

, and

\theta_{\textit{dn}}

from the pre-adaptation training. Recompute fixed domain prototype

z_{l^{\prime}\!,g}

using Eq. 3 and pseudo-labels

y_{\textit{pgt}}

using Eq. 8, after every epoch.

5:for

iter<MaxIter

do:

x_{s},y_{\textit{gt}},x_{t},y_{\textit{pgt}}\leftarrow

batch sampled from

\mathcal{D}_{s},\mathcal{D}_{t}

7: Obtain predictions following Fig. 3

8: update

\theta

by optimizing Eq. 6 and 7

9:end for

3.4.1 Pre-adaptation training.

In this stage, we aim to prepare (or initialize) the network components prior to commencing the proposed Simulate-then-Normalize procedure (i.e. adaptation). We start from the pre-trained standard source model and the pre-trained auto-encoder based AST-networks at layer $l$ and $l^{\prime}$ , i.e. $\texttt{AST}_{\textit{cs}}^{\textit{(ae)}}$ , and $\texttt{AST}_{\textit{dn}}^{\textit{(ae)}}$ . Here, the superscript (ae) indicates that these networks are not yet manipulated as AST-Sim, or AST-Norm (i.e., without any manipulation of the AST-latent).

Following this, the pre-adaptation finetuning involves two pathways. First, $x_{s}$ is forwarded to obtain $y_{s}=C\circ\texttt{AST}_{\textit{dn}}^{\textit{(ae)}}\circ\Psi\circ\texttt{AST}_{\textit{cs}}^{\textit{(ae)}}\circ\Phi(x_{s})$ . The following source-side supervised cross-entropy objective is formed as;

\min_{\theta}\mathcal{L}_{s}\;\;\text{where}\;\;\mathcal{L}_{s}\!=\!L_{\textit{CE}}(y_{s},y_{\textit{gt}})

(4)

Here, $\theta$ denotes the collection of parameters of $\Phi$ , $\Psi$ , and $C$ . Note that, after integrating ASTs into the CNN segmentor, the content-related gradients flow unobstructed through the frozen AST networks while finetuning $\Phi$ and $\Psi$ .

In the second pathway, both $x_{s}$ and $x_{t}$ are forwarded to obtain input-output pairs to finetune $\theta_{\textit{cs}}$ , and $\theta_{\textit{dn}}$ , i.e.

\begin{split}\min_{\theta_{\textit{cs}}}\mathcal{L}_{\texttt{AST}}(h_{l,s},\hat{h}_{l,s})+\mathcal{L}_{\texttt{AST}}(h_{l,t},\hat{h}_{l,t})\\ \min_{\theta_{\textit{dn}}}\mathcal{L}_{\texttt{AST}}(h_{l^{\prime}\!,s},\hat{h}_{l^{\prime}\!,s})+\mathcal{L}_{\texttt{AST}}(h_{l^{\prime}\!,t},\hat{h}_{l^{\prime}\!,t})\end{split}

(5)

Here, $(h_{l,s},\hat{h}_{l,s})$ and $(h_{l,t},\hat{h}_{l,t})$ are the input-output pairs corresponding to $x_{s}$ and $x_{t}$ to update $\texttt{AST}_{\textit{cs}}^{\textit{(ae)}}$ . Similarly, $(h_{l^{\prime}\!,s},\hat{h}_{l^{\prime}\!,s})$ and $(h_{l^{\prime}\!,t},\hat{h}_{l^{\prime}\!,t})$ are the input-output pairs corresponding to $x_{s}$ and $x_{t}$ to update $\texttt{AST}_{\textit{dn}}^{\textit{(ae)}}$ .

3.4.2 Adaptation via Simulate-then-Normalize.

The finetuned networks from the pre-adaptation stage are now subjected to adaptation via the simulate-then-normalize procedure. We manipulate the AST-latent thus denoting $\texttt{AST}_{\textit{cs}}^{\textit{(ae)}}$ , and $\texttt{AST}_{\textit{dn}}^{\textit{(ae)}}$ as $\texttt{AST}_{\textit{cs}}$ , and $\texttt{AST}_{\textit{dn}}$ respectively. Note that, latent manipulation is independent of the network weights, $\theta_{\textit{cs}}$ , and $\theta_{\textit{dn}}$ . Thus, we freeze $\theta_{\textit{cs}}$ and $\theta_{\textit{dn}}$ from the pre-adaptation training to preserve their auto-encoding behaviour, and only update $\theta$ during the adaptation stage.

The adaptation training involves four data-flow pathways as shown in Fig 3. The Simulate step at layer- $l$ outputs the following features; $(h_{l,s},h_{l,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l,t})$ . Moving forward, the Normalize step at layer- $l^{\prime}$ outputs the following: $(h_{l^{\prime}\!,s_{g}},h_{l^{\prime}\!,s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l^{\prime}\!,t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l^{\prime}\!,t_{g}})$ . Finally, we obtain the following four predictions; $(y_{s_{g}},y_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},y_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},y_{t_{g}})$ .

As shown in Fig. 3F, we apply supervised losses against ground-truth (GT) $y_{\textit{gt}}$ for the source-side predictions i.e.

\min_{\theta}(\mathcal{L}_{s_{g}}\texttt{=}L_{\textit{CE}}(y_{s_{g}},y_{\textit{gt}}))\,\,\min_{\theta}(\mathcal{L}_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}\texttt{=}L_{\textit{CE}}(y_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},y_{\textit{gt}}))

(6)

Following the self-training based adaptation works (Zou et al. 2019), we propose to use pseudo-labels (pseudo-GT) $y_{\textit{pgt}}$ for the target-side predictions i.e.

\min_{\theta}(\mathcal{L}_{t_{g}}\texttt{=}L_{\textit{CE}}(y_{t_{g}},y_{\textit{pgt}}))\,\,\min_{\theta}(\mathcal{L}_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}\texttt{=}L_{\textit{CE}}(y_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},y_{\textit{pgt}}))

(7)

Pseudo-label extraction. In order to obtain reliable pseudo-labels $y_{\textit{pgt}}$ , we prune the target predictions by performing a consistency check between $y_{t_{g}}$ and $y_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ . Thus, $y_{\textit{pgt}}=\operatorname*{arg\,max}_{c}y_{t_{g}}$ is obtained only for reliable pixels i.e. for pixels $x_{t}$ which satisfy the following criteria,

\mathcal{D}^{\textit{(pgt)}}_{t}\!=\!\{x_{t}\!:\!x_{t}\!\in\!\mathcal{D}_{t}\text{ and }\operatorname*{arg\,max}_{c}y_{t_{g}}\!=\!\operatorname*{arg\,max}_{c}y_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}\}

(8)

Note that, only the target pixels in $\mathcal{D}^{\textit{(pgt)}}_{t}$ contribute in Eq. 7. The overall training procedure is summarized in Algo. 1.

Table 2: Comparison on GTA5

\to

C-Driving benchmark.

\dagger

indicates 150k training iterations, otherwise 5k iterations.

Method	Compound (C)			Open (O)	Average
Method	Rainy	Snowy	Cloudy	Overcast	C	C+O
Source only	16.2	18.0	20.9	21.2	18.9	19.1
ASN (Tsai et al. 2018)	20.2	21.2	23.8	25.1	22.1	22.5
CBST (Zou et al. 2018)	21.3	20.6	23.9	24.7	22.2	22.6
IBN-Net (Pan et al. 2018)	20.6	21.9	26.1	25.5	22.8	23.5
PyCDA (Lian et al. 2019)	21.7	22.3	25.9	25.4	23.3	23.8
OCDA (Liu et al. 2020)	22.0	22.9	27.0	27.9	24.5	25.0
DHA (Park et al. 2020)	27.0	26.3	30.7	32.8	28.5	29.2
Ours (AST-OCDA)	28.2	27.8	31.6	34.0	29.2	30.4
Source only $\dagger$	23.3	24.0	28.2	30.2	25.7	26.4
ASN (Tsai et al. 2018) $\dagger$	25.6	27.2	31.8	32.1	28.8	29.2
MOCDA (Gong et al. 2021) $\dagger$	24.4	27.5	30.1	31.4	27.7	29.4
DHA (Park et al. 2020) $\dagger$	27.1	30.4	35.5	36.1	32.0	32.3
Ours (AST-OCDA) $\dagger$	32.7	32.2	38.9	39.2	34.6	35.7

Table 3: Comparison on SYNTHIA

\to

C-Driving benchmark. All methods employ the long-training strategy.

*

indicates mIoU over 11 classes otherwise over 16 classes.

Method	Compound (C)			Open (O)	Average
Method	Rainy	Snowy	Cloudy	Overcast	C	C+O
Source only	16.3	18.8	19.4	19.5	18.4	18.5
CBST (Zou et al. 2018)	16.2	19.6	20.1	20.3	18.9	19.1
CRST (Zou et al. 2019)	16.3	19.9	20.3	20.5	19.1	19.3
ASN (Tsai et al. 2018)	17.0	20.5	21.6	21.6	20.0	20.2
AdvEnt (Vu et al. 2019)	17.7	19.9	20.2	20.5	19.3	19.6
DHA (Park et al. 2020)	18.8	21.2	23.6	23.6	21.5	21.8
Ours (AST-OCDA)	20.3	22.6	24.9	25.4	22.6	23.3
Source only $*$	16.5	18.2	21.4	20.6	19.2	19.8
AdvEnt (Vu et al. 2019) $*$	21.8	22.6	26.2	25.7	23.9	24.7
ASN (Tsai et al. 2018) $*$	24.9	26.9	30.7	30.3	28.0	29.0
MOCDA (Gong et al. 2021) $*$	26.6	27.9	32.4	31.8	29.1	30.3
Ours (AST-OCDA) $*$	27.9	28.8	33.9	34.2	30.2	31.2

Table 4: Evaluation on extended open domains. mIoU computed over 19 classes for GTA5 as source and 11 classes for SYNTHIA as source, following Gong et al. (2021).

Source	Method	Extended Open			Avg.
Source	Method	Citysc.	KITTI	WildDash	Avg.
GTA5	Source only	19.3	24.1	16.0	20.5
	ASN (Tsai et al. 2018)	22.0	23.4	17.5	22.5
	MOCDA (Gong et al. 2021)	31.1	30.9	21.6	27.8
	Ours (AST-OCDA)	32.6	31.8	23.1	29.2
SYN- THIA	Source only	24.7	20.7	17.3	20.8
	ASN (Tsai et al. 2018)	35.9	24.7	20.7	27.9
	MOCDA (Gong et al. 2021)	32.2	34.2	25.8	31.2
	Ours (AST-OCDA)	37.2	35.7	26.9	33.3

Table 5: Ablation study on GTA5

\to

C-Driving. Random and Mined indicate random pairing and hard-mining strategy (Sec 3.3.1). Sim and Norm indicate the module usage.

#	AST-	Pairing		AST-	Losses		GTA5 $\to$ C-Driv.
#	Sim	Random	Mined	Norm	$\mathcal{L}_{s_{g}},\mathcal{L}_{t_{g}}$	$\mathcal{L}_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},\mathcal{L}_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$	C	C+O
1.	$\times$	-	-	$\times$	-	-	25.7	26.4
2.	$\times$	-	-	✓	✓	-	28.1	28.6
3.	✓	✓	$\times$	$\times$	-	-	27.9	28.3
4.	✓	✓	$\times$	✓	$\times$	✓	30.0	30.3
5.	✓	✓	$\times$	✓	✓	✓	31.2	31.6
6.	✓	$\times$	✓	$\times$	-	-	29.0	29.8
7.	✓	$\times$	✓	✓	$\times$	✓	32.8	33.6
8.	✓	$\times$	✓	✓	✓	✓	34.6	35.7

4 Experiments

We thoroughly evaluate the proposed approach against state-of-the-art prior works in the Open Compound DA setting.

Datasets. Following Gong et al. (2021), we used the synthetic GTA5 (Richter et al. 2016) and SYNTHIA (Ros et al. 2016) datasets as the source. In C-Driving (Liu et al. 2020), rainy, snowy, and cloudy sub-targets form the compound target domain, and overcast sub-target forms the open domain. Further, we use Cityscapes (Cordts et al. 2016), KITTI (Geiger et al. 2013) and WildDash (Zendel et al. 2018) datasets as extended open domains to test generalization on diverse unseen domains. All datasets (except SYNTHIA) share 19 semantic categories. We use the mean intersection-over-union (mIoU) metric for evaluating the performance. See Suppl. for more details.

Implementation Details. Following (Park et al. 2020; Gong et al. 2021), we employ DeepLabv2 (Chen et al. 2017a) with a VGG16 (Simonyan et al. 2015a) backbone as the CNN segmentor. We use SGD optimizer with a learning rate of 1e-4, momentum of 0.9 and a weight decay of 5e-4 during training. We also use a polynomial decay with power 0.9 as the learning rate scheduler. Following Park et al. (2020), we use two training schemes for GTA5 i.e., short training scheme with 5k iterations and long training scheme with 150k iterations. For SYNTHIA, we use only the long training scheme following Gong et al. (2021).

4.1 Discussion

We compare our approach with prior arts on GTA5 $\to$ C-Driving (Table 2) and SYNTHIA $\to$ C-Driving (Table 3) benchmarks as well as on extended open domains (Table 4). We also provide an extensive ablation study (Table 5).

GTA5 as source (Table 2, 4). For the short training scheme, our approach outperforms the SOTA method DHA (Park et al. 2020) by an average of 0.7% on compound domains and 1.2% on open domain. The improvement is enhanced with the long training scheme where we outperform DHA by 2.6% and 3.1% on compound and open domains respectively. As illustrated in Table 4, our method outperforms the SOTA method MOCDA (Gong et al. 2021) on the extended open domains by 1.4% on average. This verifies the better generalization abilities of our approach for new open domains with higher domain-shift from the source data.

SYNTHIA as source (Table 3, 4). Our approach outperforms the SOTA method DHA (Park et al. 2020) by an average of 1.1% on compound domains and 1.8% on the open domain. Table 4 shows our superior performance on the extended open domains. We outperform the SOTA (Gong et al. 2021) with an average 2.1% over the 3 datasets. This reinforces our superior generalizability.

4.1.1 Ablation Study. Table 5 presents a detailed ablation to analyze the effectiveness of the proposed components.

First, we evaluate the variations proposed under AST-Sim. As a baseline, we use the source-only performance (#1). Deactivating AST-Norm, AST-Sim with random pairing strategy (#3 vs. #1) yields an average 2.1% improvement. Next, without AST-Norm, AST-Sim alongside the proposed Mined pairing outperforms the previous (#6 vs. #3) by 1.3%. When AST-Norm is applied alongside only the simulation based losses (i.e. $\mathcal{L}_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},\mathcal{L}_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ ), using Mined pairing improves over random pairing (#7 vs. #4) by 3.0%. Finally, when AST-Norm is applied alongside all the four losses, Mined pairing outperforms random pairing (#8 vs. #5) by 3.5%. This verifies the consistent superiority of our well-thought-of cross-domain pairing strategy (Sec 3.3.1).

Second, we evaluate the variations under AST-Norm. Deactivating AST-Sim, AST-Norm outperforms the baseline (#2 vs. #1) by 2.3% on average. AST-Norm alongside simulation-based losses i.e. $\mathcal{L}_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},\mathcal{L}_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}$ (#4 vs. #3 and #7 vs. #6 for random and mined pairing respectively) yield average improvements of 2.0% and 3.8% respectively over not using AST-Norm. Next, AST-Norm used with all losses improves over the previous (#5 vs. #4 and #8 vs. #7 for random pairing and hard mining respectively) by 1.2% and 1.9% respectively. This verifies the consistent improvement of using all losses over only simulation-based losses.

Both sets of ablations underline the equal importance of AST-Sim and AST-Norm in the proposed approach.

4.1.2 Analyzing domain alignment. Since AST-Sim (layer $l$ ) and AST-Norm (layer $l^{\prime}$ ) involve replacing the latent, their effect can be studied at the layer following it. In Fig. 4, we plot t-SNE (Maaten et al. 2008) embeddings of AST-latent at layers $l+1$ and $l^{\prime}+1$ to examine domain alignment before and after the proposed adaptation. Though simulated cross-domain features $(h_{l,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}},h_{l,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}})$ span the entire space, projection of the original source and sub-target domains (i.e. $h_{l,s},h_{l,t}$ ) stills retains domain-specificity (first and second plots). As AST-Norm aims to normalize the AST-latent with a fixed mean prototype (i.e. enforcing all features to have a common AST-latent), we observe that the domains are aligned post-adaptation (third and fourth plots) at layer $l^{\prime}+1$ . This shows that the proposed Simulate-then-Normalize strategy effectively reduces the domain shift in the latent space.

5 Conclusion

In this work, we performed a thorough analysis of domain discriminability (DDM) to quantify the unwanted correlation between the deep features and domain labels. Based on our observations, we proposed the feature-space Amplitude Spectrum Transformation (AST) to better disentangle and manipulate the domain and task-related information. We provided insights into the usage of AST in two configurations - AST-Sim and AST-Norm and proposed a novel Simulate-then-Normalize strategy for effectively performing Open Compound DA. Extending the proposed approach for more general, realistic DA scenarios such as multi-source, multi-target DA can be a direction for future work.

Acknowledgements. This work was supported by MeitY (Ministry of Electronics and Information Technology) project (No. 4(16)2019-ITEA), Govt. of India.

Supplementary Material

In this supplementary, we provide extensive implementation details with additional qualitative and quantitative performance analysis. Towards reproducible research, we will release our complete codebase and trained network weights.

This supplementary is organized as follows:

•

Section A: Notations (Table 6)
•
Section B: Implementation Details
- –
  
  DDM analysis experiments (Sec. B.1, Fig. 7)
- –
  
  Experimental settings (Sec. B.2, Table 8, Fig. 6)
•
Section C: Analysis
- –
  
  Quantitative analysis (Sec. C.1, Table 7, 9)
- –
  
  Qualitative analysis (Sec. C.2, Fig. 5)

Appendix A Notations

We summarize the notations used in the paper in Table 6. To enhance the clarity of the approach, we also provide the dimensions of each variable considering $720\times 1280$ as the input spatial size. The notations are listed under 5 groups viz. Data, Networks, AST-Sim, AST-Norm and Neural outputs.

Appendix B Implementation Details

In this section, we provide the extended implementation details excluded from the main paper due to space constraints.

B.1 DDM analysis experiments

2.1.1 Obtaining multi-domain source. We did not use target data for Domain-Discriminability-Metric (DDM) analysis as we intended to use it for OCDA experiments. We chose to use the following three strong augmentations on source data to represent novel domains (total 4 sub-domains). We show examples of the augmented images in Fig. 7.

a) AdaIN: This technique (Huang and Belongie 2017) stylizes the image based on a reference image by manipulating the convolutional feature statistics in an instance normalization (Ulyanov et al. 2017) layer. We set the strength of stylization to $0.3$ (on a scale of 0 to 1 i.e. original image to strongly stylized) to ensure sufficient content preservation.

b) Weather augmentation: We use the frost (weather condition) augmentation from Jung et al. (2020). There are five levels of severity of which we randomly choose from the lower three levels to control the strength of augmentation.

c) Cartoonization: We use the cartoonization augmentation from Jung et al. (2020). No controllable parameter is provided for the strength of augmentation.

2.1.2 Discriminator training for DDM. We train a convolutional discriminator with cross-entropy loss for classifying the input features according to their domain labels.

a) Disc. for spatial features. The architecture is similar to that used in Li et al. (2019) i.e. $5$ conv. layers with $4\times 4$ kernel, stride $2$ and channel numbers $\{64,128,256,512,4\}$ . Here, $4$ channels in the last layer corresponds to the number of sub-domains. Each convolutional layer (except the last) is followed by a LeakyReLU parameterized by $0.2$ .

b) Disc. for AST-latent. We use a fully-connected 2-layer discriminator with 4096 hidden units and ReLU in the first layer and 4 units (no. of sub-domains) in the second layer.

Despite the lower capacity of the discriminator, DDM of AST-latent is higher than that for convolutional features. We reported the accuracy of the trained disc. in Fig. 1A of the main paper for different layer features and AST-latents.

Table 6: Notation Table

Symbol

Description

Dimensions

\times

\times

Data

\mathcal{D}_{s}

Labeled source dataset

\mathcal{D}_{t}

Unlabeled target dataset

\mathcal{D}_{t}^{\textit{(pgt)}}

Pseudo-labeled target dataset

(x_{s},y_{\textit{gt}})

Labeled source sample

(3, 19)

\times

720

\times

1280

x_{t}

Unlabeled target sample

\times

720

\times

1280

(x_{t},y_{\textit{pgt}})

Pseudo-labeled target sample

(3, 19)

\times

720

\times

1280

Networks

\Phi

CNN (early layers)

\Psi

CNN (middle layers)

C

CNN (dense classifier)

\texttt{AST}_{\textit{cs}}

AST-Sim (cross-domain simulation)

\texttt{AST}_{\textit{dn}}

AST-Norm (domain normalization)

AST-Sim

h_{l,s}

Input source features

512\times 90\times 160

h_{l,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}

Simulated source features

512\times 90\times 160

z_{l,s}

Source AST-Sim-latent

(64*512)\times 1\times 1

h_{l,t}

Input target features

512\times 90\times 160

h_{l,t^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}

Simulated target features

512\times 90\times 160

z_{l,t}

Target AST-Sim-latent

(64*512)\times 1\times 1

AST-Norm

h_{l^{\prime}\!,s}

Input orig. source features

1024\times 90\times 160

h_{l^{\prime}\!,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}

Input simulated source features

1024\times 90\times 160

h_{l^{\prime}\!,s_{g}}

Normalized orig. source features

1024\times 90\times 160

h_{l^{\prime}\!,s^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}_{g}}

Normalized sim. source features

1024\times 90\times 160

z_{l^{\prime}\!,g}

Fixed domain prototype

(64*1024)\times 1\times 1

Neural outputs

h_{k}

Layer-

k

features

C_{k}\times H_{k}\times W_{k}

z_{k}

Layer-

k

AST-latent

(64*C_{k})\times 1\times 1

y_{s_{g}}

AST-Norm source pred.

19\times 720\times 1280

y_{s_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}

AST-Sim-then-Norm source pred.

19\times 720\times 1280

y_{t_{g}}

AST-Norm target pred.

19\times 720\times 1280

y_{t_{g}^{\hskip 0.56905pt\scalebox{0.8}{\begin{picture}(0.0,0.0)\roundcap\put(0.0,0.0){\leavevmode\hbox{\set@color\char 41\relax}} \put(0.0,0.0){\line(1,0){missing}} \end{picture}}\hskip 0.56905pt}}

AST-Sim-then-Norm target pred.

19\times 720\times 1280

Table 7: Quantitative evaluation on SYNTHIA

\rightarrow

C-Driving. Per-class IoU is reported for 11 classes. Compound refers to compound domains i.e. rainy, snowy, and cloudy, while Open refers to the open domain from C-Driving i.e. overcast.

SYNTHIA $\rightarrow$ C-Driving
	Method	road	sidewalk	building	wall	fence	pole	traffic light	vegetation	sky	person	car	mIoU
Compound	Source only	3.1	6.8	42.7	0.0	0.0	10.2	1.1	39.6	69.2	9.7	28.2	19.2
	AdvEnt (Vu et al. 2019)	67.2	1.8	50.7	0.0	0.0	4.4	1.3	11.7	71.8	8.7	45.7	23.9
	ASN (Tsai et al. 2018)	63.1	11.9	46.5	0.1	0.0	10.5	3.1	22.2	78.7	17.8	54.1	28.0
	Ours (AST-OCDA)	65.8	12.4	48.9	0.9	0.4	12.3	5.8	26.9	83.5	18.9	56.4	30.2
Open	Source only	1.9	9.0	43.4	0.0	0.0	11.1	1.2	45.1	74.7	13.0	27.2	20.6
	AdvEnt (Vu et al. 2019)	68.9	2.5	51.6	0.0	0.0	5.7	1.4	14.2	77.2	11.7	49.3	25.7
	ASN (Tsai et al. 2018)	69.4	14.4	48.7	0.0	0.0	11.8	2.3	23.0	82.4	21.7	59.0	30.3
	Ours (AST-OCDA)	73.5	18.6	53.4	0.7	0.3	16.5	8.6	27.9	87.5	26.4	63.4	34.2

B.2 Experimental settings

Here, we describe the datasets, networks, compute and software used for the experiments.

Datasets. The GTA5 dataset (Richter et al. 2016) contains 24966 synthetic road scene images of size extracted from a game engine which are densely labeled for 19 categories. The SYNTHIA dataset (Ros et al. 2016) contains 2224 synthetic road scene images extracted from a virtual city. The C-Driving dataset (Liu et al. 2020) is derived from BDD100k dataset (Yu et al. 2020). The training set contains 14697 images from mixed domains without labels while the validation set, with 627 images, is segregated into sub-domains for evaluation purposes. Cityscapes (Cordts et al. 2016) is a real urban road scene dataset collected from European cities. We use 500 images from its validation set for extended open domain evaluation. KITTI (Geiger et al. 2013) consists of road scene images collected from Karlsruhe, Germany. We utilize its validation set with 200 images for extended open evaluation. WildDash (Zendel et al. 2018) is a diverse road scene dataset with varying weather, time of day, camera characteristics and data sources. We use its validation set with 70 images for extended open evaluation.

CNN segmentor architecture details. Following (Park et al. 2020; Gong et al. 2021), we use DeepLabv2 (Chen et al. 2017a) with VGG16 (Simonyan et al. 2015b) backbone as the CNN segmentor. As proposed in Sec 3.4 of the main paper, we divide the CNN segmentor into $\Phi,\Psi,$ and $C$ to accommodate the two AST modules. As shown in Fig. 4 of the main paper, $\Phi$ consists of layers upto Conv 4-3, $\Psi$ consists of layers Conv 5-1 to Conv 5-3 and $C$ consists of the remaining layers after Conv-5-3 of the segmentor.

AST architecture details. We illustrate the upsampling/downsampling strategy, for feature map inputs to AST, in Fig. 6. We interpolate all feature map inputs to a spatial size of $128\times 256$ so that AST can have a fixed capacity encoder $Q_{e}$ and decoder $Q_{d}$ irrespective of the layer it is applied at. Further, we interpolate the output features of AST back to the original spatial size. We show the encoder $Q_{e}$ and decoder $Q_{d}$ architecture in Table 8. The input and output size of $Q_{e}$ and $Q_{d}$ is $128^{2}=16384$ as only half of the amplitude spectrum is required, due to its symmetry about the midpoint (as illustrated in Fig. 2 of the main paper).

Table 8: Encoder-decoder architecture used in AST.

	Operation	Features	Non-linearity
$Q_{e}$	Input	16384
	Fully-connected	2048	ReLU
	Fully-connected	64	Linear
	Normalization (unit sphere)	64
$Q_{d}$	Fully-connected	2048	ReLU
$Q_{d}$	Fully-connected	16384	Linear

Compute specifications. For all experiments, we use a machine with 8-core Intel Core i7-9700F CPU, 64 GB of system RAM and one NVIDIA RTX 8000 GPU (48 GB).

Software specifications. We use PyTorch 1.8 (Paszke et al. 2019) for implementing our proposed method along with Python 3.7 and CUDA 11.0. The OS is Ubuntu 18.04.

Appendix C Analysis

C.1 Quantitative analysis

We provide extended quantitative results of our proposed approach on SYNTHIA $\to$ C-Driving and GTA5 $\to$ C-Driving benchmarks in Table 7, 9. We obtain state-of-the-art performance across all benchmarks (Table 2, 3, 4 in main paper).

C.2 Qualitative analysis

We provide qualitative results of our proposed approach against the Source only baseline for the GTA5 $\to$ C-Driving setting and also on the extended open domains in Fig. 5. Note that prior arts (Park et al. 2020; Gong et al. 2021) did not release code or pre-trained models, making it infeasible to compare qualitative results with their methods.

Table 9: Comparison on GTA5

\to

C-Driving. Per-class IoU is reported for 19 classes except ‘bicycle’ class, as it is close to zero for all methods. The rainy, snowy and cloudy domains compose the compound target, while overcast is an open domain.

GTA5 $\rightarrow$ C-Driving (short training strategy following Liu et al. (2020))
	Method	road	sidewalk	building	wall	fence	pole	traffic light	traffic sign	vegetation	terrian	sky	person	rider	car	truck	bus	train	motorbike	mIoU
Rainy	Source only (Liu et al. 2020)	48.3	3.4	39.7	0.6	12.2	10.1	5.6	5.1	44.3	17.4	65.4	12.1	0.4	34.5	7.2	0.1	0.0	0.5	16.2
	AdaptSegNet (Tsai et al. 2018)	58.6	17.8	46.4	2.1	19.6	15.6	5.0	7.7	55.6	20.7	65.9	17.3	0.0	41.3	7.4	3.1	0.0	0.0	20.2
	CBST (Zou et al. 2018)	59.4	13.2	47.2	2.4	12.1	14.1	3.5	8.6	53.8	13.1	80.3	13.7	17.2	49.9	8.9	0.0	0.0	6.6	21.3
	IBN-Net (Pan et al. 2018)	58.1	19.5	51.0	4.3	16.9	18.8	4.6	9.2	44.5	11.0	69.9	20.0	0.0	39.9	8.4	15.3	0.0	0.0	20.6
	OCDA (Liu et al. 2020)	63.0	15.4	54.2	2.5	16.1	16.0	5.6	5.2	54.1	14.9	75.2	18.5	0.0	43.2	9.4	24.6	0.0	0.0	22.0
	Ours (AST-OCDA)	68.9	18.5	59.8	9.7	19.6	18.4	12.2	11.3	59.9	20.4	78.8	25.6	3.2	48.2	13.8	30.4	0.3	8.5	28.2
Snowy	Source only (Liu et al. 2020)	50.8	4.7	45.1	5.9	24.0	8.5	10.8	8.7	35.9	9.4	60.5	17.3	0.0	47.7	9.7	3.2	0.0	0.7	18.0
	AdaptSegNet (Tsai et al. 2018)	59.9	13.3	52.7	3.4	15.9	14.2	12.2	7.2	51.0	10.8	72.3	21.9	0.0	55.0	11.3	1.7	0.0	0.0	21.2
	CBST (Zou et al. 2018)	59.6	11.8	57.2	2.5	19.3	13.3	7.0	9.6	41.9	7.3	70.5	18.5	0.0	61.7	8.7	1.8	0.0	0.2	20.6
	IBN-Net (Pan et al. 2018)	61.3	13.5	57.6	3.3	14.8	17.7	10.9	6.8	39.0	6.9	71.6	22.6	0.0	56.1	13.8	20.4	0.0	0.0	21.9
	OCDA (Liu et al. 2020)	68.0	10.9	61.0	2.3	23.4	15.8	12.3	6.9	48.1	9.9	74.3	19.5	0.0	58.7	10.0	13.8	0.0	0.1	22.9
	Ours (AST-OCDA)	71.6	12.3	65.8	7.5	27.6	18.9	16.4	9.4	52.6	13.8	78.9	24.5	1.7	64.8	16.2	16.7	0.5	1.7	27.8
Cloudy	Source only (Liu et al. 2020)	47.0	8.8	33.6	4.5	20.6	11.4	13.5	8.8	55.4	25.2	78.9	20.3	0.0	53.3	10.7	4.6	0.0	0.0	20.9
	ASN (Tsai et al. 2018)	51.8	15.7	46.0	5.4	25.8	18.0	12.0	6.4	64.4	26.4	82.9	24.9	0.0	58.4	10.5	4.4	0.0	0.0	23.8
	CBST (Zou et al. 2018)	56.8	21.5	45.9	5.7	19.5	17.2	10.3	8.6	62.2	24.3	89.4	20.0	0.0	58.0	14.6	0.1	0.0	0.1	23.9
	IBN-Net (Pan et al. 2018)	60.8	18.1	50.5	8.2	25.6	20.4	12.0	11.3	59.3	24.7	84.8	24.1	12.1	59.3	13.7	9.0	0.0	1.2	26.1
	OCDA (Liu et al. 2020)	69.3	20.1	55.3	7.3	24.2	18.3	12.0	7.9	64.2	27.4	88.2	24.7	0.0	62.8	13.6	18.2	0.0	0.0	27.0
	Ours (AST-OCDA)	72.5	24.6	58.1	10.8	27.6	22.3	14.5	10.2	67.8	29.5	90.6	28.6	2.5	64.6	18.9	23.7	0.4	2.2	31.6
Overcast	Source only (Liu et al. 2020)	46.6	9.5	38.5	2.7	19.8	12.9	9.2	17.5	52.7	19.9	76.8	20.9	1.4	53.8	10.8	8.4	0.0	1.8	21.2
	ASN (Tsai et al. 2018)	59.5	24.0	49.4	6.3	23.3	19.8	8.0	14.4	61.5	22.9	74.8	29.9	0.3	59.8	12.8	9.7	0.0	0.0	25.1
	CBST (Zou et al. 2018)	58.9	26.8	51.6	6.5	17.8	17.9	5.9	17.9	60.9	21.7	87.9	22.9	0.0	59.9	11.0	2.1	0.0	0.2	24.7
	IBN-Net (Pan et al. 2018)	62.9	25.3	55.5	6.5	21.2	22.3	7.2	15.3	53.3	16.5	81.6	31.1	2.4	59.1	10.3	14.2	0.0	0.0	25.5
	OCDA (Liu et al. 2020)	73.5	26.5	62.5	8.6	24.2	20.2	8.5	15.2	61.2	23.0	86.3	27.3	0.0	64.4	14.3	13.3	0.0	0.0	27.9
	Ours (AST-OCDA)	77.6	29.7	67.8	12.7	26.8	25.6	13.8	19.5	67.3	27.5	90.5	34.2	4.2	72.8	19.6	18.4	0.6	3.5	34.0

References

Achille and Soatto (2018) Achille, A.; and Soatto, S. 2018. Emergence of Invariance and Disentanglement in Deep Representations. J. Mach. Learn. Res., 19(1): 1947–1980.
Chen et al. (2017a) Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; and Yuille, A. L. 2017a. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4): 834–848.
Chen et al. (2018) Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; and Adam, H. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In ECCV.
Chen et al. (2017b) Chen, Y.-H.; Chen, W.-Y.; Chen, Y.-T.; Tsai, B.-C.; Frank Wang, Y.-C.; and Sun, M. 2017b. No more discrimination: Cross city adaptation of road scene segmenters. In ICCV.
Chen et al. (2019) Chen, Z.; Zhuang, J.; Liang, X.; and Lin, L. 2019. Blending-target Domain Adaptation by Adversarial Meta-Adaptation Networks. In CVPR.
Cordts et al. (2016) Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; and Schiele, B. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.
Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image Style Transfer Using Convolutional Neural Networks. In CVPR.
Geiger et al. (2013) Geiger, A.; Lenz, P.; Stiller, C.; and Urtasun, R. 2013. Vision meets Robotics: The KITTI Dataset. International Journal of Robotics Research (IJRR).
Gong et al. (2021) Gong, R.; Chen, Y.; Paudel, D. P.; Li, Y.; Chhatkuli, A.; Li, W.; Dai, D.; and Gool, L. V. 2021. Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation. In CVPR.
He and Geng (2008) He, Y.-B.; and Geng, Z. 2008. Active learning of causal networks with intervention experiments and optimal designs. Journal of Machine Learning Research, 9(Nov): 2523–2547.
Hoffman et al. (2018) Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko, K.; Efros, A.; and Darrell, T. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In ICML.
Hoffman et al. (2016) Hoffman, J.; Wang, D.; Yu, F.; and Darrell, T. 2016. FCNs in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649.
Huang et al. (2021) Huang, J.; Guan, D.; Xiao, A.; and Lu, S. 2021. FSDR: Frequency Space Domain Randomization for Domain Generalization. In CVPR.
Huang and Belongie (2017) Huang, X.; and Belongie, S. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In ICCV.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In ICML.
Jung et al. (2020) Jung, A. B.; Wada, K.; Crall, J.; Tanaka, S.; Graving, J.; Reinders, C.; Yadav, S.; Banerjee, J.; Vecsei, G.; Kraft, A.; Rui, Z.; Borovec, J.; Vallentin, C.; Zhydenko, S.; Pfeiffer, K.; Cook, B.; Fernández, I.; De Rainville, F.-M.; Weng, C.-H.; Ayala-Acevedo, A.; Meudec, R.; Laporte, M.; et al. 2020. imgaug. https://github.com/aleju/imgaug. Online; accessed 01-Feb-2020.
Kim and Byun (2020) Kim, M.; and Byun, H. 2020. Learning Texture Invariant Representation for Domain Adaptation of Semantic Segmentation. In CVPR.
Kundu et al. (2018) Kundu, J. N.; Krishna Uppala, P.; Pahuja, A.; and Venkatesh Babu, R. 2018. Adadepth: Unsupervised content congruent adaptation for depth estimation. In CVPR.
Kundu et al. (2021) Kundu, J. N.; Kulkarni, A.; Singh, A.; Jampani, V.; and Babu, R. V. 2021. Generalize Then Adapt: Source-Free Domain Adaptive Semantic Segmentation. In ICCV.
Kundu, Lakkakula, and Babu (2019) Kundu, J. N.; Lakkakula, N.; and Babu, R. V. 2019. UM-Adapt: Unsupervised Multi-Task Adaptation Using Adversarial Cross-Task Distillation. In ICCV.
Kundu, Rajput, and Babu (2020) Kundu, J. N.; Rajput, G. S.; and Babu, R. V. 2020. VRT-Net: Real-time scene parsing via variable resolution transform. In WACV.
Kundu et al. (2020) Kundu, J. N.; Seth, S.; Rahul, M. V.; Rakesh, M.; Babu, R. V.; and Chakraborty, A. 2020. Kinematic-Structure-Preserved Representation for Unsupervised 3D Human Pose Estimation. In AAAI.
Li, Yuan, and Vasconcelos (2019) Li, Y.; Yuan, L.; and Vasconcelos, N. 2019. Bidirectional learning for domain adaptation of semantic segmentation. In CVPR.
Lian et al. (2019) Lian, Q.; Lv, F.; Duan, L.; and Gong, B. 2019. Constructing self-motivated pyramid curriculums for cross-domain semantic segmentation: A non-adversarial approach. In ICCV.
Liu et al. (2020) Liu, Z.; Miao, Z.; Pan, X.; Zhan, X.; Lin, D.; Yu, S. X.; and Gong, B. 2020. Open Compound Domain Adaptation. In CVPR.
Long, Shelhamer, and Darrell (2015) Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In CVPR.
Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research, 9(Nov): 2579–2605.
Pan et al. (2018) Pan, X.; Luo, P.; Shi, J.; and Tang, X. 2018. Two at Once: Enhancing Learning and Generalization Capacities via IBN-Net. In ECCV.
Park et al. (2020) Park, K.; Woo, S.; Shin, I.; and Kweon, I. S. 2020. Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for Semantic Segmentation. In NeurIPS.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.; Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In NeurIPS.
Richter et al. (2016) Richter, S. R.; Vineet, V.; Roth, S.; and Koltun, V. 2016. Playing for data: Ground truth from computer games. In ECCV.
Ros et al. (2016) Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; and Lopez, A. M. 2016. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In CVPR.
Sakaridis et al. (2018) Sakaridis, C.; Dai, D.; Hecker, S.; and Van Gool, L. 2018. Model Adaptation with Synthetic and Real Data for Semantic Dense Foggy Scene Understanding. In ECCV.
Sakaridis, Dai, and Van Gool (2020) Sakaridis, C.; Dai, D.; and Van Gool, L. 2020. Map-Guided Curriculum Domain Adaptation and Uncertainty-Aware Evaluation for Semantic Nighttime Image Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Simonyan and Zisserman (2015a) Simonyan, K.; and Zisserman, A. 2015a. Very deep convolutional networks for large-scale image recognition. In ICLR.
Simonyan and Zisserman (2015b) Simonyan, K.; and Zisserman, A. 2015b. Very deep convolutional networks for large-scale image recognition. In ICLR.
Stephenson et al. (2021) Stephenson, C.; Padhy, S.; Ganesh, A.; Hui, Y.; Tang, H.; and Chung, S. 2021. On the geometry of generalization and memorization in deep neural networks. In ICLR.
Suh et al. (2019) Suh, Y.; Han, B.; Kim, W.; and Lee, K. M. 2019. Stochastic Class-Based Hard Example Mining for Deep Metric Learning. In CVPR.
Torralba and Efros (2011) Torralba, A.; and Efros, A. A. 2011. Unbiased look at dataset bias. In CVPR.
Tsai et al. (2018) Tsai, Y.-H.; Hung, W.-C.; Schulter, S.; Sohn, K.; Yang, M.-H.; and Chandraker, M. 2018. Learning to adapt structured output space for semantic segmentation. In CVPR.
Ulyanov, Vedaldi, and Lempitsky (2017) Ulyanov, D.; Vedaldi, A.; and Lempitsky, V. 2017. Improved Texture Networks: Maximizing Quality and Diversity in Feed-forward Stylization and Texture Synthesis. In CVPR.
Vu et al. (2019) Vu, T.-H.; Jain, H.; Bucher, M.; Cord, M.; and Pérez, P. 2019. ADVENT: Adversarial entropy minimization for domain adaptation in semantic segmentation. In CVPR.
Yang et al. (2020a) Yang, J.; An, W.; Wang, S.; Zhu, X.; Yan, C.; and Huang, J. 2020a. Label-Driven Reconstruction for Domain Adaptation in Semantic Segmentation. In ECCV.
Yang et al. (2020b) Yang, Y.; Lao, D.; Sundaramoorthi, G.; and Soatto, S. 2020b. Phase consistent ecological domain adaptation. In CVPR.
Yang and Soatto (2020) Yang, Y.; and Soatto, S. 2020. FDA: Fourier domain adaptation for semantic segmentation. In CVPR.
Yu et al. (2020) Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; and Darrell, T. 2020. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In CVPR.
Zendel et al. (2018) Zendel, O.; Honauer, K.; Murschitz, M.; Steininger, D.; and Dominguez, G. F. 2018. WildDash - Creating Hazard-Aware Benchmarks. In ECCV.
Zou et al. (2019) Zou, Y.; Yu, Z.; Liu, X.; Kumar, B.; and Wang, J. 2019. Confidence regularized self-training. In ICCV.
Zou et al. (2018) Zou, Y.; Yu, Z.; Vijaya Kumar, B.; and Wang, J. 2018. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV.

Amplitude Spectrum Transformation for Open Compound Domain Adaptive Semantic Segmentation

Abstract

1 Introduction

2 Related Works

3 Approach

3.1 Disentangling domain characteristics

3.2 Amplitude Spectrum Transformation (AST)

3.2.1 AST as an auto-encoding setup.

3.3 Usage of AST for OCDA

3.3.1 AST for cross-domain simulation (AST-Sim).

3.4 Simulate-then-Normalize for OCDA

3.4.1 Pre-adaptation training.

3.4.2 Adaptation via Simulate-then-Normalize.

4 Experiments

4.1 Discussion

5 Conclusion

Appendix A Notations

Appendix B Implementation Details

B.1 DDM analysis experiments

B.2 Experimental settings

Appendix C Analysis

C.1 Quantitative analysis

C.2 Qualitative analysis

References

Amplitude Spectrum Transformation for
Open Compound Domain Adaptive Semantic Segmentation