This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\floatsetup

[table]capposition=top

11institutetext: KAIST, South Korea. 11email: {feipan, sshuh1215, iskweon77}@kaist.ac.kr22institutetext: KENTECH, South Korea. 22email: [email protected] 33institutetext: Harvard University, USA. 33email: [email protected]

ML-BPM: Multi-teacher Learning with Bidirectional Photometric Mixing for Open Compound Domain Adaptation in Semantic Segmentation

Fei Pan 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr KAIST, South Korea.    Sungsu Hur 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr KAIST, South Korea.    Seokju Lee 2KENTECH, South Korea. [email protected]    Junsik Kim 3Harvard University, USA. [email protected]    In So Kweon 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr 2KENTECH, South Korea. [email protected] 3Harvard University, USA. [email protected] 1Paper ID 6380 1{feipan, sshuh1215, iskweon77}@kaist.ac.kr
Abstract

Open compound domain adaptation (OCDA) considers the target domain as the compound of multiple unknown homogeneous subdomains. The goal of OCDA is to minimize the domain gap between the labeled source domain and the unlabeled compound target domain, which benefits the model generalization to the unseen domains. Current OCDA for semantic segmentation methods adopt manual domain separation and employ a single model to simultaneously adapt to all the target subdomains. However, adapting to a target subdomain might hinder the model from adapting to other dissimilar target subdomains, which leads to limited performance. In this work, we introduce a multi-teacher framework with bidirectional photometric mixing to separately adapt to every target subdomain. First, we present an automatic domain separation to find the optimal number of subdomains. On this basis, we propose a multi-teacher framework in which each teacher model uses bidirectional photometric mixing to adapt to one target subdomain. Furthermore, we conduct an adaptive distillation to learn a student model and apply consistency regularization to improve the student generalization. Experimental results on benchmark datasets show the efficacy of the proposed approach for both the compound domain and the open domains against existing state-of-the-art approaches.

Keywords:
Domain Adaptation, Open Compound Domain Adaptation, Semantic Segmentation, Multi-teacher Distillation

1 Introduction

Semantic segmentation is a fundamental task in finding applications to many problems, including robotics [34], autonomous driving [35], and medical diagnosis [16]. Recently, deep learning-based semantic segmentation approaches [12, 36, 35] have achieved remarkable progress. However, their effectiveness and generalization ability require a large amount of pixel-wised annotated data which are expensive to collect. To reduce the cost of data collection and annotation, numerous synthetic datasets have been proposed [21, 22]. However, the models trained on synthetic data tend to poorly generalize to real images. To cope with this issue, unsupervised domain adaptation (UDA) methods [26, 28, 37, 17, 25, 33] have proposed to align the domain gap between the source and the target domain. Despite the efficacy of UDA techniques, most of these works rely on the strong assumption that the target data is composed of a single homogeneous domain. This assumption is often violated in real-world scenarios. As an illustration in autonomous driving, the target data will likely be composed of various subdomains such as night, snow, rain, etc. Therefore, directly applying the current UDA approaches to these target data might deliver limited performance. This paper focuses on the challenging problem of open compound domain adaptation (OCDA) in semantic segmentation where the target domain is unlabeled and contains multiple homogeneous subdomains. The goal of OCDA is to adapt a model to a compound target domain and to further enhance the model generalization to the unseen domains.
To perform OCDA, Liu et al. [13] propose an easy-to-hard curriculum learning strategy, where samples closer to the source domain will be chosen first for adaptation. However, it does not fully take advantage of the subdomain boundaries information in the compound target domain. To explicitly consider this information, current OCDA works [8, 19] propose to separate the target compound domain into multiple subdomains based on image style information. Existing works use a manual domain separation method; they also employ a single model to simultaneously adapt to all the target subdomain. However, adapting to a target subdomain might hinder the model from adapting to other dissimilar target subdomains, which leads to limited performance. We propose a multi-teacher framework with bidirectional photometric mixing for open compound domain adaptation in semantic segmentation to tackle this issue. First, we propose automatic domain separation to find the optimal number of subdomains and split the target compound domain. Then, we present a multi-teacher framework in which each teacher model uses bidirectional photometric mixing to adapt to one target subdomain. On this basis, we conduct adaptive distillation to learn a student model and apply a fast and short online updating using consistency regularization to improve the student’s generalization to the open domains. We evaluate our approach on the benchmark datasets. The proposed approach outperforms all the existing state-of-the-art OCDA techniques and the latest UDA techniques for domain adaptation and domain generalization task.
The Contribution of This Work. (1) we propose automatic domain separation to find the optimal number of target subdomains; (2) we present a multi-teacher framework with bidirectional photometric mixing to reduce the domain gaps between the source domain and every target subdomain separately; (3) we further conduct an adaptive distillation to learn a student model and apply consistency regularization to improve the student generalization to the open domains.

2 Related Work

Unsupervised Domain Adaptation. Unsupervised domain adaptation (UDA) techniques are used to reduce the expensive cost of pixel-wise labeling tasks like semantic segmentation. In UDA, adversarial learning is used actively to align input-level style using image translation, feature distribution, or structured output [27, 10, 28, 17, 29]. Alternatively, self-training approaches [2, 33, 25, 37] have also recently demonstrated compelling performance in this context. While these works have shown significant improvement, adopting those works directly for practical usage shows limitations due to its restricted setting dealing with only single source and single target. Despite the improvement provided by UDA techniques, their applicability to real scenarios remains restricted by the implicit assumption that the target data contains images from a single distribution.

Domain Generalization. The purpose of domain generalization (DG) is to train a model – solely using source domain data – such that it can perform reliable predictions on unseen domain. While DG is an essential problem, a few works have attempted to address this problem in the task of semantic segmentation. DG for semantic segmentation shows two main streams: augmentation-based and network-based approaches. The augmentation-based approaches [30, 11] propose to significantly augment the training data via an additional style dataset to learn domain-invariant representation. The network-based approaches [18, 4] attempt to modify the structure of the network to minimize domain-specific information (such as colors or styles) such that the resulting model mainly focuses on the content-specific information. Even though DG for semantic segmentation has achieve obvious progress, their performance is inevitably lower than several UDA methods due to the absence of the target images, which is capable of providing abundant domain-specific information.

Open Compound Domain Adaptation. Liu et al. [13] firstly suggests Open Compound Domain Adaptation (OCDA) that handles unlabeled compound heterogeneous target domain and unseen open domain. While Liu et al. [13] propose a curriculum learning strategy, it fails to consider the specific information of each target subdomain. Current OCDA works [8, 19] propose to separate the compound target domain into multiple subdomains to handle the intra-domain gaps. Gong et al. [8] adopt domain-specific batch normalization for adaptation. Park et al. [19] utilize GAN-based image translation and adversarial training to exploit domain invariant features from multiple subdomains.

3 Generating Optimal Subdomains

Refer to caption
Figure 1: The part of generating optimal subdomains consists of automatic domain separation (ADS) and subdomain style purification (SSP). In ADS, we adopt Silhouette Coefficient [23] to find the optimal number of subdomains kk^{*}. In SSP, we calculate mean of histogram H~mc\widetilde{H}^{c}_{m} for the mthm\textsuperscript{th} target subdomain TmT_{m} according to Equation 4, and the purified subdomain is denoted as T~m\widetilde{T}_{m}.

3.1 Automatic Domain Separation

Our work assumes that the domain-specific property of images comes from their styles. Existing works adopt a predefined parameter to decide the number of subdomains, which might lead to a nonoptimal domain adaptation performance; furthermore, they rely on a pre-trained CNN-based encoder to extract the style information for the subdomain discovery. However, we propose an automatic domain separation (ADS) to effectively separate the target domain using the distribution of pixel values of the target images. The proposed ADS is capable of predicting the optimal number of subdomains without relying on any predefined parameters and extracting the image style information without relying on any pre-trained CNN models. We denote the source domain as 𝒮\mathcal{S}, and the unlabeled compound target domain as 𝒯\mathcal{T}. We also assume compound target domain contains kk latent subdomains: {T1,,Tk}\{T_{1},\dots,T_{k}\}, which lack of clear prior knowledge to distinguish themselves. The goal of ADS is to find the optimal number of subdomains kk^{*} and separate 𝒯\mathcal{T} into several subdomains accordingly.

Current work [14] suggests a simple yet effective style translation method by matching the distribution of pixel values on LAB color space. Thus, we adopt LAB space into ADS to extract the style information of the target image. Given a target RGB image xt𝒯x_{t}\in\mathcal{T} as input, we convert it into LAB color space rgb2lab(xt)rgb2lab(x_{t}). The three channels in LAB color space are represented as l,al,a, and bb. Then, we compute the histograms of the pixel values for all three channels in LAB color space: Hl(xt),Ha(xt)H^{l}(x_{t}),H^{a}(x_{t}), and Hb(xt)H^{b}(x_{t}). The histograms are concatenated and represented as the style information of xtx_{t}. Let s(xt)=Hl(xt)Ha(xt)Hb(xt)s(x_{t})={H^{l}(x_{t})}^{\frown}{H^{a}(x_{t})}^{\frown}{H^{b}(x_{t})} denote the concatenated histograms of xtx_{t}, and we take s(xt)s(x_{t}) as input to ADS for domain separation. However, most existing clustering algorithms require a hyperparameter to determine the number of clusters. Directly applying a naive clustering might lead to a nonoptimal adaptation performance. Thus, we propose to find the optimal number kk^{*} of the subdomains using Silhouette Coefficient (SC) [23]. Suppose the target domain 𝒯\mathcal{T} is separated into kk subdomains, {T1,,Tk}\{T_{1},\dots,T_{k}\}. For each target image xtx_{t}, we denote γ(xt)\gamma(x_{t}) as the average distance between xtx_{t} and all other target images in the target subdomain to which xtx_{t} belongs. Additionally, we use δ(xt)\delta(x_{t}) to represent the minimum average distance from xtx_{t} to all other target subdomains to which xtx_{t} does not belong. Let us assume xtx_{t} belongs to the mthm\textsuperscript{th} target subdomain TmT_{m}, then γ(xt)\gamma(x_{t}) and δ(xt)\delta(x_{t}) are written as

γ(xt)\displaystyle\gamma(x_{t}) =xtTm,xtxtL(s(xt),s(xt))|Tm|1,\displaystyle=\frac{\sum_{{x_{t}}^{\prime}\in T_{m},{x_{t}}^{\prime}\not=x_{t}}L(s({x_{t}}^{\prime}),s(x_{t}))}{|T_{m}|-1}, (1)
δ(xt)\displaystyle\delta(x_{t}) =minTn:1nk,nm{xtTnL(s(xt),s(xt))|Tn|},\displaystyle=\min_{T_{n}:1\leq n\leq k,n\not=m}\left\{\frac{\sum_{{x_{t}}^{\prime}\in T_{n}}L(s({x_{t}}^{\prime}),s(x_{t}))}{|T_{n}|}\right\},

where L(s(xt),s(xt))L(s({x_{t}}^{\prime}),s(x_{t})) represents the euclidean distance of s(xt)s({x_{t}}^{\prime}) and s(xt)s(x_{t}), and |Tm||T_{m}| is the number of the target images in TmT_{m}. The SC score for kk number of the target subdomains is given by

SC(k)=xt𝒯δ(xt)γ(xt)max(γ(xt),δ(xt)).SC(k)=\sum_{x_{t}\in\mathcal{T}}{\frac{\delta(x_{t})-\gamma(x_{t})}{\max(\gamma(x_{t}),\delta(x_{t}))}}. (2)

Hence, the goal of the proposed ADS is to find kk^{*} for

k=argmaxkSC(k).k^{*}=\arg\max_{k}SC(k). (3)

3.2 Subdomain Style Purification

With the help fo automatic domain separation, the number of abnormal samples with different styles is small inside each target subdomain. Though these abnormal samples might be useful for the model’s generalization, they could also lead to a negative transfer, which further hinders the model from learning domain invariant features in a specific subdomain. To cope with it, we propose to purify the style distribution of the target images inside each subdomain. We design a subdomain style purification (SSP) module to effectively make similar styles for the images within the same subdomain. Given the mthm\textsuperscript{th} target subdomain TmT_{m}, we adopt the histograms of LAB color space {(Hl(xt),Ha(xt),Hb(xt));xtTm}\{(H^{l}(x_{t}),H^{a}(x_{t}),H^{b}(x_{t}));\forall x_{t}\in T_{m}\} (mentioned in 3.1), and then we compute the mean of the histograms for all the three channels, represented by H~ml,H~ma\widetilde{H}_{m}^{l},\widetilde{H}_{m}^{a}, and H~mb\widetilde{H}_{m}^{b}, and this process is achieved by

H~mc=xtTmHc(xt)|Tm|;c{l,a,b},\widetilde{H}_{m}^{c}=\frac{\sum_{x_{t}\in T_{m}}H^{c}(x_{t})}{|T_{m}|};\forall c\in\{l,a,b\}, (4)

where |Tm||T_{m}| represents the number of the target images in TmT_{m}. We take {H~ml,H~ma,H~mb}\{\widetilde{H}_{m}^{l},\widetilde{H}_{m}^{a},\widetilde{H}_{m}^{b}\} as the standard style for TmT_{m}. For each target RGB image xtTmx_{t}\in T_{m}, we change the style of xtx_{t} to generate the RGB new image x~t\tilde{x}_{t} by the histogram matching [20] on H~ml,H~ma\widetilde{H}_{m}^{l},\widetilde{H}_{m}^{a}, and H~mb\widetilde{H}_{m}^{b} on the LAB color space. The process of SSP is done for all the subdomains {T1,,Tk}\{T_{1},\dots,T_{k^{*}}\}. We denote the purified subdomains after SSP as {T~1,,T~k}\{\widetilde{T}_{1},\dots,\widetilde{T}_{k^{*}}\}.

4 Multi-teacher Framework

Refer to caption
Figure 2: (a) The architecture of the proposed bidirectional photometric mixing. (b) The diagram of the multi-teacher learning framework.

4.1 Bidirectional Photometric Mixing

Through automatic domain separation and subdomain style purification (mentioned in 3.1 and 3.2), the compound domain 𝒯\mathcal{T} is automatically separated into multiple subdomains {T~1,,T~k}\{\widetilde{T}_{1},\dots,\widetilde{T}_{k^{*}}\}, where kk^{*} represents the optimal number of the subdomains. Our next plan is to minimize the domain gap between the source domain and each target subdomain. A recent UDA work DACS [25] presents a mixing-based UDA technique for semantic segmentation. Inspired by DACS, we propose bidirectional photometric mixing (BPM) to minimize the domain gap between the source domain and each target subdomain separately. Compared with DACS, the proposed BPM adopts a photometric transform to decrease the style inconsistency of the mixed images to reduce the pixel-level domain gap. On this basis, BPM applies a bidirectional mixing scheme to provide a more robust regularization for training. The architecture of BPM is shown in Figure 2(a). The proposed BPM contains a domain adaptive segmentation network GmG_{m} and a momentum network MmM_{m} that improves the stability of pseudo labels. Let (xs,ys)𝒮(x_{s},y_{s})\in\mathcal{S} denote the source RGB image and its pixel-wise annotation map, xsH×W×3x_{s}\in\mathbb{R}^{H\times W\times 3}, ysH×Wy_{s}\in\mathbb{R}^{H\times W}. And (x~t)T~m(\tilde{x}_{t})\in\widetilde{T}_{m} represent a purified target RGB image from the mthm\textsuperscript{th} purified subdomain T~m\widetilde{T}_{m}, x~tH×W×3\tilde{x}_{t}\in\mathbb{R}^{H\times W\times 3}. Note that HH and WW represent the size of height and width. Our BPM applies the mixing in two directions: 𝒮T~m\mathcal{S}\to\widetilde{T}_{m} and T~m𝒮\widetilde{T}_{m}\to\mathcal{S}.

On the direction of mixing from 𝒮T~m\mathcal{S}\to\widetilde{T}_{m}, we choose ClassMix [15] because the source image xsx_{s} has the pixel-wise annotation map ysy_{s}. We first randomly select some classes from ysy_{s}. Then, we define Ψ{0,1}H×W\Psi\in\{0,1\}^{H\times W} as a binary mask in which Ψ(h,w)=1\Psi(h,w)=1 when the pixel position (h,w)(h,w) of xsx_{s} belongs to the selected classes, and Ψ(h,w)=0\Psi(h,w)=0 otherwise. While ClassMix suggests directly copying the corresponding pixels of selected classes of xsx_{s} onto x~t\tilde{x}_{t}, the mixed image generated by ClassMix contains inconsistent style distribution which might hinder the adaptation performance. To cope with the limitation, the proposed BPM applies photometric transform Γ\Gamma on the selected source pixels to the style of target image before directly copying them onto it. Let Ψxs\Psi\odot x_{s} represent the selected source pixels by the mask Ψ\Psi, and \odot is element-wise multiplication. We first calculate the histograms of selected source pixels in LAB color space, and match them with {H~ml,H~ma,H~mb}\{\widetilde{H}_{m}^{l},\widetilde{H}_{m}^{a},\widetilde{H}_{m}^{b}\}. The translated source pixels is represented as Γ(Ψxs)\Gamma(\Psi\odot x_{s}). Then, we copy the translated source pixels onto x~t\tilde{x}_{t}. We present some qualitative results in Figure 4. Note that no ground-truth annotation is available for x~t\tilde{x}_{t}. Thus, we send the purified target image x~t\tilde{x}_{t} to the momentum network MmM_{m} to generate a stable prediction map y~t\tilde{y}^{\prime}_{t} as the pseudo label. The mixing process on the direction of 𝒮T~m\mathcal{S}\to\widetilde{T}_{m} by BPM is shown as

xψ\displaystyle x_{\psi} =Γ(Ψxs)+(𝟏Ψ)x~t,\displaystyle=\Gamma(\Psi\odot x_{s})+(\mathbf{1}-\Psi)\odot\tilde{x}_{t}, (5)
yψ\displaystyle y_{\psi} =Ψys+(𝟏Ψ)y~t,\displaystyle=\Psi\odot y_{s}+(\mathbf{1}-\Psi)\odot\tilde{y}^{\prime}_{t},

where xψx_{\psi} is the generated mixed image, yψy_{\psi} is the corresponding mixed pseudo label, and Γ()\Gamma(\cdot) is the photometric transform of the source selected pixels by histogram matching on LAB color space.

On the direction of mixing from T~m𝒮\widetilde{T}_{m}\to\mathcal{S}, however, it is impossible to choose ClassMix since no ground-truth annotation is available for x~t\tilde{x}_{t}. Inspired by CutMix [31], we generate another binary mask Φ{0,1}H×W\Phi\in\{0,1\}^{H\times W} by sampling rectangular bounding box (dx,dy,dw,dh)(d_{x},d_{y},d_{w},d_{h}) according to the uniform distribution; dxU(0,W),dyU(0,H),dw=W1η,dh=H1ηd_{x}\sim U(0,W),d_{y}\sim U(0,H),d_{w}=W\sqrt{1-\eta},d_{h}=H\sqrt{1-\eta}, where ηU(0,1)\eta\sim U(0,1), (H,W)(H,W) are the height and width of the image. The binary mask Φ\Phi is formed by filling with 11 the pixel positions inside the bounding box, and filling with 0 other positions. With the help of Φ\Phi, we select the target pixels Φx~t\Phi\odot\tilde{x}_{t} and transform them into the source style. The transformed target pixel is represented by Δ(Φx~t)\Delta(\Phi\odot\tilde{x}_{t}). Then we paste them onto the source image xsx_{s}. We present the mixing of T~m𝒮\widetilde{T}_{m}\to\mathcal{S} at

xϕ\displaystyle x_{\phi} =Δ(Φx~t)+(𝟏Φ)xs,\displaystyle=\Delta(\Phi\odot\tilde{x}_{t})+(\mathbf{1}-\Phi)\odot x_{s}, (6)
yϕ\displaystyle y_{\phi} =Φy~t+(𝟏Φ)ys,\displaystyle=\Phi\odot\tilde{y}^{\prime}_{t}+(\mathbf{1}-\Phi)\odot y_{s},

where xϕx_{\phi} is the other generated mixed image, yϕy_{\phi} is the corresponding mixed pseudo label, and Δ()\Delta(\cdot) is the photometric transform of the target selected pixels by histogram matching on LAB color space.

we (xψ,yψ)(x_{\psi},y_{\psi}) (xϕ,yϕ)(x_{\phi},y_{\phi}) and (xs,ys)(x_{s},y_{s}) to train the segmentation network GmG_{m} and the momentum network MmM_{m}. We first optimize the parameters of GmG_{m} through

BGM(θm)=xsSx~tT~m[CE(Gm(xs),ys)+αCE(Gm(xψ),yψ)+βCE(Gm(xϕ),yϕ)],\begin{aligned} \mathcal{L}_{BGM}(\theta_{m})=\sum_{\forall x_{s}\in S}\sum_{\forall\tilde{x}_{t}\in\widetilde{T}_{m}}\Big{[}\mathcal{L}_{CE}\Big{(}G_{m}(x_{s}),y_{s}\Big{)}&+\alpha\mathcal{L}_{CE}\Big{(}G_{m}(x_{\psi}),y_{\psi}\Big{)}\\ &+\beta\mathcal{L}_{CE}\Big{(}G_{m}(x_{\phi}),y_{\phi}\Big{)}\Big{]}\end{aligned}, (7)

where θm\theta_{m} represent the parameters of GmG_{m}, CE\mathcal{L}_{CE} is the cross-entropy loss for the predicted segmentation maps and the ground-truth or pseudo labels, α\alpha and β\beta are the hyper-parameters to control the effect of the mixing of both the directions for the loss function. To help the momentum network MmM_{m} provide stable pseudo labels, we update the parameters of MmM_{m}, represented by θm{\theta}^{\prime}_{m}, using an exponential moving average (EMA) with a momentum λ[0,1]\lambda\in[0,1]. After finishing the training iteration tt, θm{\theta}^{\prime}_{m} is updated by

θmt+1=λθmt+(1λ)θm.{\theta^{\prime}_{m}}^{t+1}=\lambda{\theta^{\prime}_{m}}^{t}+(1-\lambda)\theta_{m}. (8)

4.2 Multi-teacher Adaptive Knowledge Distillation

We propose a multi-teacher framework followed by an adaptive knowledge distillation to align the domain gaps between the source domain and all the target subdomains. Given a purified subdomain T~m\widetilde{T}_{m}, we adopt a BPM as a specific teacher model to minimize the domain gap between 𝒮\mathcal{S} and T~m\widetilde{T}_{m}. And we train the proposed multi-teacher framework by minimizing the loss function MT\mathcal{L}_{MT} on all the teacher models, i.e.,

MT=m=1kBGM(θm),\mathcal{L}_{MT}=\sum_{m=1}^{k^{*}}\mathcal{L}_{BGM}(\theta_{m}), (9)

where BGM(θm)\mathcal{L}_{BGM}(\theta_{m}) (defined in Equation 7) is the loss function of the segmentation network GmG_{m} in the mthm\textsuperscript{th} teacher model, and kk^{*} is the optimal number of the subdomains. Moreover, We learn a segmentation network GsdG_{sd} as the student network via an adaptive knowledge distillation from all the teacher networks {Gm:1mk}\{G_{m}:1\leq m\leq k^{*}\}. Given a random target data from xt𝒯x_{t}\in\mathcal{T}, we send xtx_{t} to all the teachers model, and the student is to learn from a weighted average of the all teacher’s predictions Ow(xt)O_{w}(x_{t}), based on the teacher’s confidence score. We adopt the entropy of GmG_{m}’s prediction map Gm(xt)H×W×CG_{m}(x_{t})\in\mathbb{R}^{H\times W\times C} as the confidence of the mthm\textsuperscript{th} teacher model, where CC is the total number of classes we consider. Thus, the weight wmw_{m} for the mthm\textsuperscript{th} teacher and the average prediction Gout(xt)G_{out}(x_{t}) are formulated as

wm=h,w,cGm(xt)log[Gm(xt)]mh,w,cGm(xt)log[Gm(xt)],\displaystyle w_{m}=\frac{\sum_{h,w,c}G_{m}(x_{t})\log\big{[}G_{m}(x_{t})\big{]}}{\sum_{m^{\prime}}\sum_{h,w,c}G_{m^{\prime}}(x_{t})\log\big{[}G_{m^{\prime}}(x_{t})\big{]}}, (10)
Gout(xt)=m=1kwmGm(xt).\displaystyle G_{out}(x_{t})=\sum_{m=1}^{k^{*}}w_{m}G_{m}(x_{t}).

On this basis, we optimize the student segmentation network GsdG_{sd} with a distillation loss D\mathcal{L}_{D} defined by

D=xt𝒯KL[Gsd(xt)||Gout(xt)],\mathcal{L}_{D}=\sum_{x_{t}\in\mathcal{T}}\mathcal{L}_{KL}\Big{[}G_{sd}(x_{t})||G_{out}(x_{t})\Big{]}, (11)

where KL\mathcal{L}_{KL} is KL divergence loss function between the output of GsdG_{sd} and GoutG_{out}. The goal of the multi-teacher adaptive knowledge distillation is to achieve the optimal parameters θsd{\theta_{sd}}^{*} of the student segmentation network GsdG_{sd} by

θsd=minθsdMT+D.{\theta_{sd}}^{*}=\min_{\theta_{sd}}\mathcal{L}_{MT}+\mathcal{L}_{D}. (12)

Online Updating with Consistency Regularization. To evaluate the generalization of our approach, we directly evaluate our student network on the open domains as shown in Table 2(a) and Table 2(b). Additionally, after finishing the compound domain adaptation training, we also provide a fast and short online updating for the student network using consistency regularization. This would further boost the generalization of the student network. Given an RGB image xox_{o} from an open domain, we first match the style of xox_{o} to other standard styles from the existing target subdomains. The standard styles are defined as the mean histograms {H~ml,H~ma,H~mb}\{\widetilde{H}_{m}^{l},\widetilde{H}_{m}^{a},\widetilde{H}_{m}^{b}\} (defined in 3.2). The newly transformed images are {xom;1mk}\{x_{o}^{m};1\leq m\leq k^{*}\}, where xomx_{o}^{m} is generated by matching xox_{o} to the style of the mthm\textsuperscript{th} subdomain T~m\widetilde{T}_{m}. Thus, we conduct an online updating for the student network GsdG_{sd} by

minθsdm=1k1(Gsd(xom),Gsd(xo)),\min_{\theta_{sd}}\sum_{m=1}^{k^{*}}\mathcal{L}_{1}\big{(}G_{sd}(x_{o}^{m}),G_{sd}(x_{o})\big{)}, (13)

where 1\mathcal{L}_{1} is the mean absolute loss. After the online updating, we test the student network with newly learnt parameters again on the open domains.

5 Experiments

5.1 Experimental Setup

5.1.1 Dataset.

In this work, we adopt the synthetic datasets, including GTA5 [21] and SYNTHIA [22] as the source domains. GTA5 contains 24,96624,966 annotated images of 1,914×1,0521,914\times 1,052 resolution. SYNTHIA consists of 9,4009,400 images with 1,280×7601,280\times 760 resolution. Furthermore, we adopt C-Driving [13] as the compound target domains which contains real images of 1,280×7201,280\times 720 resolution collected from different weather conditions. Following the settings of previous works [13, 19, 8], we use the 14,69714,697 rainy, snowy, cloudy images as the compound target domain and adopt 627627 overcast images as the open domain. We also use ACDC [24] as another compound target domain and the evaluation results are shown in supplementary material. We further adopt Cityscapes [5], KITTI [1], and WildDash [32] as the open domains to evaluate the generalization ability of the proposed approach.

5.1.2 Implementation Details.

We adopt DeepLab-V2 [3] with ResNet101 backbone [9] pre-trained on ImageNet [6]. All the images from target domain are rescaled into 1,280×7201,280\times 720 and then randomly cropped into 640×360640\times 360. The batch size is set up with 22 and the total number of training iterations is 2.5×1052.5\times 10^{5}. We adopt stochastic gradient descent to optimize all the segmentation networks, with a weight decay of 5×1045\times 10^{-4} and momentum of 0.90.9. The learning rate is set up with an initial value of 2.5×1042.5\times 10^{-4} and decreased by polynomial decay with an exponent of 0.90.9. The momentum network has the same network architecture as the segmentation network. Existing mixing techniques contain CutMix [31], CowMix [7] and ClassMix [15]. We adopt ClassMix on the mixing direction of the source domain to the target domain, and we apply CutMix on the mixing direction of the target domain to the source domain. Both α\alpha and β\beta are set up with 11 in the experiments. To increase the robustness of the segmentation model, we adopt data augmentations, including flipping, color jittering, and Gaussian blurring on the mixed images.

Table 1: The performance comparison of mean IoU on the compound domain. Our approach is compared with the state-of-the-art UDA and OCDA approaches on (a) GTA5\rightarrowC-Driving and (b) SYNTHIA\rightarrowC-Driving benchmark dataset with ResNet-101 as the backbone. Note that mIoU11 represents the mean IoU of 11 classes, excluding the class with .
(a) GTA5\toC-Driving
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU
Source - 73.4 12.5 62.8 6.0 15.8 19.4 10.9 21.1 54.6 13.9 76.7 34.5 12.4 68.1 31.0 12.8 0.0 10.1 1.9 28.3
CDAS [13] OCDA 79.1 9.4 67.2 12.3 15.0 20.1 14.8 23.8 65.0 22.9 82.6 40.4 7.2 73.0 27.1 18.3 0.0 16.1 1.5 31.4
CSFU [8] OCDA 80.1 12.2 70.8 9.4 24.5 22.8 19.1 30.3 68.5 28.9 82.7 47.0 16.4 79.9 36.6 18.8 0.0 13.5 1.4 34.9
SAC [2] UDA 81.5 23.8 72.0 10.3 27.8 23.0 18.2 34.1 70.3 27.9 87.8 45.0 16.9 77.6 38.5 19.8 0.0 14.0 2.7 36.4
DACS [25] UDA 81.9 24.0 72.2 11.9 28.6 24.2 18.3 35.4 71.8 28.0 87.7 44.9 15.6 78.4 39.1 24.9 0.1 6.9 1.9 36.6
DHA[19] OCDA 79.9 14.5 71.4 13.1 32.0 27.1 20.7 35.3 70.5 27.5 86.4 47.3 23.3 77.6 44.0 18.0 0.1 13.7 2.5 37.1
Ours OCDA 85.3 26.2 72.8 10.6 33.1 26.9 24.6 39.4 70.8 32.5 87.9 47.6 29.2 84.8 46.0 22.8 0.2 16.7 5.8 40.2
(b) SYNTHIA\toC-Driving
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

sky

person

rider

car

bus

mbike

bike

mIoU16 mIoU11
Source - 33.9 11.9 42.5 1.5 0.0 14.7 0.0 1.3 56.8 76.5 13.3 7.4 57.8 12.5 2.1 1.6 20.9 28.1
CDAS [13] OCDA 54.5 13.0 53.9 0.8 0.0 18.2 13.0 13.2 60.0 78.9 17.6 3.1 64.2 12.2 2.1 1.5 25.3 34.0
CSFU [8] OCDA 69.6 12.2 50.9 1.3 0.0 16.7 12.1 13.6 56.2 75.8 20.0 4.8 68.2 14.1 0.9 1.2 26.1 34.8
SAC [2] UDA 69.8 13.4 56.2 1.7 0.0 20.0 9.6 13.7 52.5 78.1 29.1 15.5 68.9 10.9 3.2 1.2 27.7 36.3
DACS [25] UDA 62.1 15.2 48.8 0.3 0.0 19.7 10.3 9.6 57.8 84.4 35.2 18.9 67.8 16.0 2.2 1.7 28.1 36.5
DHA [19] OCDA 67.5 2.5 54.6 0.2 0.0 25.8 13.4 27.1 58.0 83.9 36.0 6.1 71.6 28.9 2.2 1.8 29.9 37.6
Ours OCDA 73.4 15.2 57.1 1.8 0.0 23.2 13.5 23.9 59.9 83.3 40.3 22.3 72.2 23.3 2.3 2.2 32.1 40.0

5.2 Results

To demonstrate the efficacy of our approach, we conduct experiments on the benchmark datasets of GTA5\toC-Driving and SYNTHIA\toC-Driving. We first compare our approach with the existing state-of-the-art OCDA approaches: CDAS [13], DHA [19], and CSFU [8]. Furthermore, we compare the proposed approach with the current state-of-the-art UDA approaches SAC [2] and DACS [25].

5.2.1 Compound Domain Adaptation.

We first compare the performance of our approach with existing state-of-the-art OCDA and UDA approaches on GTA5 \rightarrowC-Driving, shown in Table 1a. All the results are generated on the validation set of C-Driving. Training only with the source data leads to 28.3%28.3\% of mean IoU over the 19 classes. As the first work in OCDA, CDAS achieves 31.4%31.4\% on the mean IoU of all the classes. CSFU generates 34.9%34.9\% of mean IoU, and DHA produces 37.1%37.1\% of mean IoU. This is because both CSFU and DHA adopt the subdomain separation step and GAN framework, and DHA uses a more effective multi-discriminator to minimize the domain gaps. In comparison, the latest UDA approaches DACS and SAC show 36.6%36.6\% and 36.4%36.4\%, outperforming both CDAS and CSFU. The reason behind is that both DACS and SAC adopt various self-supervision techniques to minimize the domain gaps, which proves to be more effective than GAN-based approaches. In comparison, the proposed approach demonstrates effectiveness on this benchmark dataset with 40.2%40.2\% of mean IoU over all classes.

We present experimental results on SYNTHIA\rightarrowC-Driving shown in Table 1b. We consider the 11 classes for final evaluation. The proposed method achieves 40.0%40.0\% of mean IoU over the 11 classes. For other OCDA approaches, DHA achieves 37.6%37.6\%, CSFU produces 34.8%34.8\%, and CDAS generates 34.0%34.0\% of mean IoU. Moreover, the UDA approaches DACS and SAC generate 36.5%36.5\% and 36.3%36.3\% of mean IoU. Our approach outperforms all the existing OCDA approaches and the latest UDA approaches.

Table 2: The comparison of mean IoU on the open domains. The domain generalization (DG) model is trained only with the source domain. All the models are tested on the validation set of C-Driving Open (O), cityscapes (C), KITTI (K), and wildDash (W). We also present the scores of our approach without online updating (w/o Updating) and with online updating (w/ Updating).
(a) GTA5 as the source domain.
GTA5
Method Type O C K W Avg
CSFU [8] OCDA 38.9 38.6 37.9 29.1 36.1
DACS [25] UDA 39.7 37.0 40.2 30.7 36.9
RobustNet [4] DG 38.1 38.3 40.5 30.8 37.0
DHC [19] OCDA 39.4 38.8 40.1 30.9 37.5
Ours (w/o Updating) OCDA 41.8 40.9 44.0 32.9 40.0
Ours (w/ Updating) OCDA 42.5 41.7 44.3 34.6 40.8
(b) SYNTHIA as the source domain.
SYNTHIA
Method Type O C K W Avg
CSFU [8] OCDA 36.2 34.9 32.4 27.6 32.8
DACS [25] UDA 36.8 37.0 37.4 28.8 35.0
RobustNet [4] DG 37.1 38.3 40.1 29.6 36.3
DHC [19] OCDA 38.9 38.0 40.6 30.0 36.9
Ours (w/o Updating) OCDA 41.5 40.3 42.7 30.1 38.7
Ours (w/ Updating) OCDA 42.6 41.1 43.4 30.9 39.5

5.2.2 Generalization to the Open Domains.

We also evaluate the domain generalization of the proposed approach against existing UDA and OCDA approaches. The results are presented in Table 2(a) and 2(b). Our work is compared with the latest domain generalization (DG) approach RobustNet [4]. For all the UDA and OCDA approaches, we first train them with the labeled source and the unlabeled target images, and we evaluate their performance with the validation of the open domains. RobustNet generates 37.0%37.0\% of mean IoU in Table 2(a) and 36.3%36.3\% of mean IoU in Table 2(b). Note that RobustNet only requires labeled source data during training. This shows that the DG approach is more effective in generalizing to the open domains than the existing UDA and OCDA approaches DACS and CSFU. Without any online updating, our approach achieves 40.0%40.0\% of mean IoU in Table 2(a) and 38.7%38.7\% of mean IoU in Table 2(b). Our approach outperforms all the UDA approaches, OCDA approaches, and the DG approach listed in the table. The reason might be that our approach is more powerful for learning the domain invariant features which improve the generalization of the model toward novel domains. The performance gain of our approach with updating further shows the efficacy of the proposed online updating with consistency regularization.

5.3 Ablation Study

5.3.1 Generating Optimal Subdomains.

Refer to caption
Figure 3: We conduct the ablation study on the proposed automatic domain separation using GTA5\toC-Driving with ResNet101 backbone. (a) The scatterplot shows the correlation between our approach’s mean IoU and the Silhouette Coefficient score. (b) The mean IoU of our approach with different number of subdomains kk. (c) The sample images from the subdomains of the C-Driving dataset.

We first conduct the ablation study on the correlation between the mean IoU of the proposed approach with Silhouette Coefficient (SC) score on the subdomain separation in Figure 3(a). It shows a positive correlation, which means that the SC score is effectively finds the optimal number of subdomains for the compound target domain. Moreover, we evaluate the mean IoU score with the different number of subdomains kk in Figure 3(b). Finally, we set up k=3k=3 and present the sample images from the subdomains of the C-Driving dataset in Figure 3(c). We also evaluate the efficacy of the proposed subdomain style purification (SSP) in Table 3(b). Without using SSP, the performance drops 0.5%0.5\% of mean IoU.

Table 3: The ablation study on the efficacy of the components of our model. (a) We compare with one baseline model DACS [25] and evaluate the performance gain of the bidirectional photometric mixing and the multi-teacher learning. (b) We evaluate the performance drop of our model by removing each component from it. Our model is trained GTA5\toC-Driving with ResNet101 backbone and tested on C-Driving validation set.
(a) The performance gain.
GTA5\rightarrowC-Driving
Model mIoU
DACS [25] 36.6
DACS + Multi-teacher Learning 39.1
DACS + Bidirectional Mixing 37.3
DACS + Photometric Mixing (Γ,Δ\Gamma,\Delta) 37.4
DACS + Bidirectional Photometric Mixing 37.8
Ours 40.2
(b) The performance drop.
GTA5\rightarrowC-Driving
Configuration mIoU Gap
w/o Multi-teacher Learning 38.0 -2.2
w/o Mixing on One Direction (α=0\alpha=0) 38.5 -1.7
w/o Mixing on One Direction (β=0\beta=0) 38.9 -1.3
w/o Subdomain Style Purification 39.7 -0.5
w/o Adaptive Distillation 39.6 -0.6
Full Framework 40.2 -

5.3.2 Multi-teacher and Single Model.

The ablation study on the multi-teacher learning of our proposed approach is presented in Table 3(a) and Table 3(b). Applying a single model in our approach delivers 38.0%38.0\% of mean IoU, leading to the the most significant drop 2.2%2.2\%, shown in Table 3(b). We further combine DACS with multi-teacher learning, and the mean IoU reaches from 36.6%36.6\% to 39.1%39.1\%. We argue that utilizing a single model is less effective than the multi-teacher models. Because adapting to one subdomain might hinder the single model from adapting to other dissimilar subdomains. Thus, we employ a multi-teacher framework in which each teacher adapts to one subdomain separately. And the multiple teachers together provide a comprehensive guide to the student model to adapt to all the target subdomains. We further present the qualitative results about the target image prediction maps from each subdomain by the multi-teachers and the single-teacher model in Figure 5.

Refer to caption
Figure 4: We compare the mixed images from the source domain to the target domain. (a) the source image; (b) the target image; (c) the mixed images without using photometric transform, and the style inconsistency exists; (d) the mixed images using photometric transform, and the style inconsistency is mitigated; (e) the mask to crop the source image.

5.3.3 Bidirectional Photometric Mixing.

We further conduct the ablation study for the bidirectional photometric mixing (BPM), shown in Table 3(a) and Table 3(b). Our model is trained on GTA5 \to C-Driving with ResNet101 backbone and tested on C-Driving validation set. By making α=0\alpha=0 to remove the mixing on one direction (ClassMix), the mean IoU drops 1.7%1.7\%, while making β=0\beta=0 to remove the other directional mixing (CutMix), it decreases by 1.3%1.3\%. This suggests that ClassMix contributes slightly more to the final performance. We also use the baseline model DACS for an in-depth analysis. We add the bidirectional photometric mixing with the DACS, the performance increase from 36.6%36.6\% to 37.8%37.8\% shown in Table 3(a); we then combine DACS with only bidirectional mixing, the mean IoU rise up to 37.3%37.3\%; we further add DACS with only photometric transform on mixing (use Γ\Gamma and Δ\Delta), the mean IoU reaches to 37.4%37.4\%. The reason behind is that DACS utilizes a simple mixing method that contains only one direction and generates the mixed image with the style inconsistency inside. However, we propose a bidirectional mixing scheme and apply the photometric transform to mitigate the style inconsistency on the generated images. We present the qualitative results to show this issue in Figure 4. The style inconsistency is mitigated in Figure 4(d) compared with Figure 4(c) on the mixing direction from the source domain to the target domain.

Refer to caption
Figure 5: We present the predicted segmentation maps of the target images from every target subdomain. The maps in the second row are generated using a single model. The maps in the third row are generated using the multi-teacher models.

6 Conclusion

Open compound domain adaptation (OCDA) considers the target domain as the compound of multiple unknown subdomains. In this work, we first propose automatic domain separation to find the optimal number of subdomains. Then we design a multi-teacher framework with bidirectional photometric mixing to align the domain gap between the source domain and the compound target domain, and we further evaluate its generalization to novel domains. Our current work is only focused on segmentation task and we leave the study on other visual tasks for future research.

7 Acknowledgment

This work was supported by the Korea Institute of Energy Technology Evaluation and Planning (KETEP) and the Ministry of Trade, Industry & Energy (MOTIE) of the Republic of Korea (No. 20224000000100).

References

  • [1] Abu Alhaija, H., Mustikovela, S.K., Mescheder, L., Geiger, A., Rother, C.: Augmented reality meets computer vision: Efficient data generation for urban driving scenes. IJCV 126(9), 961–972 (2018)
  • [2] Araslanov, N., Roth, S.: Self-supervised augmentation consistency for adapting semantic segmentation. In: CVPR. pp. 15384–15394 (2021)
  • [3] Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. PAMI 40(4), 834–848 (2017)
  • [4] Choi, S., Jung, S., Yun, H., Kim, J.T., Kim, S., Choo, J.: Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: CVPR. pp. 11580–11590 (2021)
  • [5] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: CVPR. pp. 3213–3223 (2016)
  • [6] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248–255. IEEE (2009)
  • [7] French, G., Oliver, A., Salimans, T.: Milking cowmask for semi-supervised image classification. arXiv preprint arXiv:2003.12022 (2020)
  • [8] Gong, R., Chen, Y., Paudel, D.P., Li, Y., Chhatkuli, A., Li, W., Dai, D., Van Gool, L.: Cluster, split, fuse, and update: Meta-learning for open compound domain adaptive semantic segmentation. In: CVPR. pp. 8344–8354 (2021)
  • [9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [10] Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A., Darrell, T.: Cycada: Cycle-consistent adversarial domain adaptation. In: ICML. pp. 1989–1998. PMLR (2018)
  • [11] Huang, J., Guan, D., Xiao, A., Lu, S.: Fsdr: Frequency space domain randomization for domain generalization. In: CVPR. pp. 6891–6902 (2021)
  • [12] Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: Ccnet: Criss-cross attention for semantic segmentation. In: CVPR. pp. 603–612 (2019)
  • [13] Liu, Z., Miao, Z., Pan, X., Zhan, X., Lin, D., Yu, S.X., Gong, B.: Open compound domain adaptation. In: CVPR. pp. 12406–12415 (2020)
  • [14] Ma, H., Lin, X., Wu, Z., Yu, Y.: Coarse-to-fine domain adaptive semantic segmentation with photometric alignment and category-center regularization. In: CVPR. pp. 4051–4060 (2021)
  • [15] Olsson, V., Tranheden, W., Pinto, J., Svensson, L.: Classmix: Segmentation-based data augmentation for semi-supervised learning. In: WACV. pp. 1369–1378 (2021)
  • [16] Ouyang, C., Biffi, C., Chen, C., Kart, T., Qiu, H., Rueckert, D.: Self-supervision with superpixels: Training few-shot medical image segmentation without annotation. In: ECCV. pp. 762–780. Springer (2020)
  • [17] Pan, F., Shin, I., Rameau, F., Lee, S., Kweon, I.S.: Unsupervised intra-domain adaptation for semantic segmentation through self-supervision. In: CVPR. pp. 3764–3773 (2020)
  • [18] Pan, X., Luo, P., Shi, J., Tang, X.: Two at once: Enhancing learning and generalization capacities via ibn-net. In: ECCV. pp. 464–479 (2018)
  • [19] Park, K., Woo, S., Shin, I., Kweon, I.S.: Discover, hallucinate, and adapt: Open compound domain adaptation for semantic segmentation. In: NeurIPS (2020)
  • [20] Rafael, C.G., Richard, E.W., Steven, L.E., Woods, R., Eddins, S.: Digital image processing using MATLAB. Tata McGraw-Hill (2010)
  • [21] Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth from computer games. In: ECCV. pp. 102–118. Springer (2016)
  • [22] Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.: The SYNTHIA Dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR (2016)
  • [23] Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. JCAM 20, 53–65 (1987)
  • [24] Sakaridis, C., Dai, D., Van Gool, L.: Acdc: The adverse conditions dataset with correspondences for semantic driving scene understanding. In: ICCV. pp. 10765–10775 (2021)
  • [25] Tranheden, W., Olsson, V., Pinto, J., Svensson, L.: Dacs: Domain adaptation via cross-domain mixed sampling. In: WACV. pp. 1379–1389 (2021)
  • [26] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: CVPR (2018)
  • [27] Tsai, Y.H., Hung, W.C., Schulter, S., Sohn, K., Yang, M.H., Chandraker, M.: Learning to adapt structured output space for semantic segmentation. In: CVPR. pp. 7472–7481 (2018)
  • [28] Vu, T.H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: CVPR (2019)
  • [29] Wang, Z., Yu, M., Wei, Y., Feris, R., Xiong, J., Hwu, W.m., Huang, T.S., Shi, H.: Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In: CVPR. pp. 12635–12644 (2020)
  • [30] Yue, X., Zhang, Y., Zhao, S., Sangiovanni-Vincentelli, A., Keutzer, K., Gong, B.: Domain randomization and pyramid consistency: Simulation-to-real generalization without accessing target domain data. In: ICCV. pp. 2100–2110 (2019)
  • [31] Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: Regularization strategy to train strong classifiers with localizable features. In: CVPR. pp. 6023–6032 (2019)
  • [32] Zendel, O., Honauer, K., Murschitz, M., Steininger, D., Dominguez, G.F.: Wilddash-creating hazard-aware benchmarks. In: ECCV. pp. 402–416 (2018)
  • [33] Zhang, P., Zhang, B., Zhang, T., Chen, D., Wang, Y., Wen, F.: Prototypical pseudo label denoising and target structure learning for domain adaptive semantic segmentation. CVPR (2021)
  • [34] Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: CVPR. pp. 6848–6856 (2018)
  • [35] Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR. pp. 2881–2890 (2017)
  • [36] Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR. pp. 6881–6890 (2021)
  • [37] Zou, Y., Yu, Z., Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: ECCV. pp. 289–305 (2018)

Supplemental Material:
ML-BPM: Multi-teacher Learning with Bidirectional Photometric Mixing for Open Compound Domain Adaptation in Semantic Segmentation

Anonymous ECCV submission

Fei Pan Sungsu Hur Seokju Lee Junsik Kim In So Kweon

8 Subdomain Style Purification and the t-SNE Visualization

Refer to caption
Figure 6: (a) presents the noisy samples from Subdomain 2 of C-Driving dataset before subdomain style purification (before SSP) and after subdomain style purification (after SSP). (b) shows the t-SNE visualization of the concatenated histograms of the C-Driving dataset on LAB color space when k=3k=3.

As mentioned in Section 3.2, it is hard to guarantee that the images from the same target subdomain have the same style. In other words, small domain gaps might still results from the various image styles in each subdomain. We propose subdomain style purification to unify the styles of the target data that belongs to the same subdomain so that the domain gaps in these images could be further reduced. We provide the visualization of the sample images transformed by subdomain style purification (SSP) from subdomain 2 in Figure 6 (a). Note that the images from before SSP in Figure 6(a) has the styles different from the standard style, and they are transformed into the standard style with the help of histogram matching on the LAB color space. We further set up k=3k=3 and present the t-SNE visualization of the concatenated histograms of the C-Driving images from LAB color space.

The reason of subdomain style purification (SSP). With the help of automatic domain separation, the number of abnormal samples with different styles is small. Though these abnormal samples might be helpful for the model’s generalization, they could also lead to a negative transfer, which further hinders the model from learning domain invariant features in a specific subdomain. With GTA5\toC-Driving, we get a 0.5%0.5\% of mIoU drop on average over all the subdomains without using SSP, as shown in Table 3(b).

9 ACDC Dataset

We also evaluate the proposed approach on another ACDC dataset[24]. ACDC dataset contains real-world images from the road scenes in diverse weather conditions, including fog, nighttime, rain and snow. We consider the 2,8002,800 images of fog, nighttime and rain from the training split of ACDC as the compound domain; the 400400 snow images with pixel-wise annotations of ACDC training split are taken as the open domain. The final performance is evaluated on the validation set of ACDC, which contains 306306 images with ground-truth maps.

Table 4: The performance comparison of mean IoU on the compound target domain (fog, nighttime, and rain) and the open domain (fog) of ACDC. Our approach is compared with the state-of-the-art UDA and OCDA approaches on (a) GTA5\rightarrowACDC and (b) SYNTHIA\rightarrowACDC benchmark dataset with ResNet-101 as the backbone.
(a) GTA5\toACDC
Compound Open
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

terrain

sky

person

rider

car

truck

bus

train

mbike

bike

mIoU mIoU
Source - 43.6 2.5 46.2 5.2 0.1 30.3 15.3 16.3 56.9 0.0 71.5 16.3 13.7 51.4 0.0 15.1 0.0 1.4 4.2 20.5 27.1
CDAS[13] OCDA 53.2 5.9 56.1 10.1 2.6 22.0 37.1 11.4 53.9 23.5 71.3 27.6 14.6 47.5 16.8 19.5 0.0 3.2 3.8 25.3 29.1
CSFU[8] OCDA 47.0 4.1 53.0 13.9 1.0 23.2 41.2 18.8 55.8 23.2 72.1 31.5 10.8 69.1 26.4 27.8 0.2 1.7 2.6 27.6 30.5
SAC[2] UDA 42.6 4.2 57.6 11.9 3.8 23.0 49.7 23.8 63.6 31.9 76.0 30.3 10.5 65.3 23.6 23.1 0.1 0.7 3.2 28.7 33.6
DACS[24] UDA 48.9 9.7 54.5 16.8 5.7 22.7 42.0 22.9 61.3 29.7 73.7 32.2 11.6 63.3 23.2 26.5 0.0 1.2 5.2 29.0 34.8
DHA[19] OCDA 49.8 5.2 59.1 10.2 3.1 25.6 47.8 27.9 65.1 32.0 75.2 29.0 12.2 61.5 20.5 32.4 0.0 1.0 2.0 29.5 37.5
Ours OCDA 48.4 5.0 58.2 25.3 10.0 35.1 50.4 26.7 66.8 33.3 75.8 32.1 16.7 73.5 16.8 26.6 0.2 3.9 4.6 32.1 41.6
(b) SYNTHIA\toACDC
Compound Open
Method Type

road

sidewalk

building

wall

fence

pole

light

sign

veg

sky

person

rider

car

bus

mbike

bike

mIoU16 mIoU16
Source - 45.2 0.2 36.7 1.7 0.6 25.7 4.0 5.6 46.6 64.3 16.9 11.3 39.6 16.5 0.6 1.9 19.8 20.5
CDAS[13] OCDA 61.3 0.7 60.1 11.7 1.8 28.4 18.8 23.5 48.6 28.9 16.5 15.9 69.2 18.4 5.4 5.6 25.9 23.3
CSFU[8] OCDA 62.6 0.3 60.3 8.6 1.8 21.3 20.7 29.1 44.5 22.1 34.5 19.0 71.1 23.2 4.4 4.3 26.7 24.8
SAC[2] UDA 69.8 0.4 56.2 1.7 0.0 20.0 12.6 13.7 52.5 78.1 29.1 15.5 68.9 20.9 3.2 1.2 27.7 25.4
DACS[24] UDA 55.6 1.1 55.7 0.1 0.7 25.8 31.7 18.3 65.5 53.7 31.1 16.6 69.2 22.5 2.9 3.1 28.3 27.0
DHA[19] OCDA 55.5 1.1 57.2 0.7 0.8 26.6 22.7 24.6 65.8 58.4 29.6 23.9 70.8 19.5 5.4 4.2 29.2 27.3
Ours OCDA 66.7 1.7 62.4 10.8 1.4 30.8 23.9 29.2 62.6 69.0 31.6 14.6 71.8 22.9 6.8 4.5 31.9 29.1

We present the performance comparison of mean IoU in Table 4. For the compound target domain of ACDC (fog, nighttime, rain), we achieve 32.1%32.1\% of mean IoU on GTA5\toACDC and 31.9%31.9\% of mean IoU on SYNTHAI\toACDC, outperforming all the UDA and OCDA approaches in the list. We also evaluate the generalization of our approach compared with other works. After finishing the compound domain adaptation training, all the models are directly tested on the open domain of ACDC (snow). Note that the snow images have never been used in training before. Under the benchmark datasets GTA5\toACDC and SYNTHIA\toACDC, our approach shows 41.6%41.6\% and 29.1%29.1\% of mean IoU. This demonstrates that our approach has better generalization ability toward novel domains (snow).

Table 5: The evaluation on GTA5\toC-Driving.
(a) ImageNet pre-trained VGG-16 Backbone
Method Compound (C) Open (O) Average
Rainy Snowy Cloudy Overcast C C+O
CDAS[13] 23.8 25.3 29.1 31.0 26.1 27.3
CSFU[8] 24.5 27.5 30.1 31.4 27.7 29.4
DACS [24] 26.8 29.2 35.1 35.9 30.4 31.8
DHA[19] 27.1 30.4 35.5 36.1 32.0 32.3
Ours 34.5 35.8 39.9 40.1 36.7 37.5
(b) Mixing Algorithm Comparison
Algorithm BPM (Ours) ClassMix [15] CutMix [31] CowMix [7]
mIoU 40.2 39.1 37.6 37.4

10 The Practicability of Our Approach

Though we use the multi-teacher models for training, our approach still has strong practicability for the two following reasons: these teacher models are trained simultaneously; only a single student model from distillation is needed for inference. The size of the student model is not affected by the number of the subdomains. With the number of the subdomains kk^{*}, the FLOPS and the number of parameters of our multi-teacher’s model are 327.08×109327.08\times 10^{9} and 43.8×k×10643.8\times k^{*}\times 10^{6}. After the adaptive knowledge distillation, the FLOPS and number of parameters of our student model is 327.08×109327.08\times 10^{9} and 43.8×10643.8\times 10^{6}.

The VGG-16 backbone and different mixup algorithms. We use VGG-16 backbone network for evaluation. The experimental results on GTA5\toC-Driving in Table 5(a) demonstrates the effectiveness of our approach against existing works with ImageNet pre-trained VGG-16 as the backbone. We provide the comparison to existing domain mixup algorithms in the same setting, including ClassMix[15], CutMix[31], and CowMix[7].

The online updating on the open domains. Our online updating is conducted on each sample from the open domain, thus it is still domain generalization at the testing stage. Our student model GsdG_{sd} is trained through the adaptive distillation from all the subdomain’s segmentation models {Gm}m=1k\{G_{m}\}_{m=1}^{k^{*}} (Eq. (10, 11)). Each GmG_{m} is optimized by Eq. (7) with the help of the mean teacher MmM_{m}, following the work of DACS[24]. We also used MmM_{m} instead of GmG_{m} for distillation but do not see significant performance gain.

The reason of using bidirectional mixing. The performance of using pseudo-labels of target data for ClassMix. Using the photometric transform Δ\Delta (Eq.(6)) on target-to-source (t2s) mixing, we enforce the consistency of prediction between the target and the mixed image, which are taken as additional augmentation to improve the model’s performance (Table.3(a,b)). With the experiment on GTA5\toC-Driving, we get 40.1%40.1\% of mIoU on using pseudo-labels of target data for ClassMix on target-to-source mixing, similar to ours 40.2%40.2\% (Table 1(a)). Table 5 (b) shows that our BPM outperforms existing mixing algorithms ClassMix [15], CutMix [31], and CowMix [7].