This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Continual Unsupervised Domain Adaptation for Semantic Segmentation

Joonhyuk Kim1       Sahng-Min Yoo211footnotemark: 1       Gyeong-Moon Park3       Jong-Hwan Kim2
1Seoul National University, Seoul, Republic of Korea
2KAIST, Daejeon, Republic of Korea
3Kyung Hee University, Yongin, Republic of Korea
1[email protected], 2{smyoo, johkim}@rit.kaist.ac.kr, 3[email protected]
Equal contribution.
Abstract

Unsupervised Domain Adaptation (UDA) for semantic segmentation has been favorably applied to real-world scenarios in which pixel-level labels are hard to be obtained. In most of the existing UDA methods, all target data are assumed to be introduced simultaneously. Yet, the data are usually presented sequentially in the real world. Moreover, Continual UDA, which deals with more practical scenarios with multiple target domains in the continual learning setting, has not been actively explored. In this light, we propose Continual UDA for semantic segmentation based on a newly designed Expanding Target-specific Memory (ETM) framework. Our novel ETM framework contains Target-specific Memory (TM) for each target domain to alleviate catastrophic forgetting. Furthermore, a proposed Double Hinge Adversarial (DHA) loss leads the network to produce better UDA performance overall. Our design of the TM and training objectives let the semantic segmentation network adapt to the current target domain while preserving the knowledge learned on previous target domains. The model with the proposed framework outperforms other state-of-the-art models in continual learning settings on standard benchmarks such as GTA5, SYNTHIA, CityScapes, IDD, and Cross-City datasets. The source code is available at https://github.com/joonh-kim/ETM.

1 Introduction

Deep learning-based approaches have shown remarkable improvements in semantic segmentation tasks via supervised learning [21, 29, 25, 2, 11, 3, 19, 15, 41, 42]. However, pixel-level labeling for datasets containing enormous real-world images usually requires a high cost of time and labor [6, 34, 5, 43, 24]. Typically, such pixel-level labels can be automatically generated from synthetically rendered images [28, 30], however, a domain discrepancy between synthetic and real-world images is problematic. Therefore, many Unsupervised Domain Adaptation (UDA) techniques [14, 32, 35, 4, 44, 37], which aim to adapt the network trained on synthetic images to real images, have been introduced to solve the domain discrepancy problem.

Most of the existing UDA methods, however, consider an impractical scenario which only focuses on a single-target setting [9, 33, 7, 14, 32, 35, 4, 44, 37]. In the real world, there can be multiple target domains [20, 27], and such domains may not be even introduced at once [13, 1, 39]. To this end, in this paper, we consider a more realistic Continual UDA scenario. Under this setting, the network trained on a source domain aims to adapt to multiple target domains which are presented sequentially.

Blind application of the existing UDA methods to this setting leads to sub-optimal results. We observe that notorious catastrophic forgetting [22, 8, 23, 16] occurs for the previous target domain as the network is trained on the current target domain (see Fig. 1(c)). Recently, Wu et al. [38] introduced a method for adapting to changing environments such as varying weather and lighting conditions, which can be viewed as Continual UDA for semantic segmentation. By using the style-memory of each environment, the method transfers the style of the source environment into that of each target environment. It achieves superior performance over the previous UDA methods under the specific setting. However, the experiments were conducted within one synthetic dataset [30], and accordingly, we observe inferior performance when applied to multiple real-world datasets (see Sec. 3.2). We see that the style transfer is not enough to overcome the catastrophic forgetting problem when considerable domain discrepancy exists.

Refer to caption
Figure 1: (a) When the traditional UDA method is applied to the Continual UDA task, it suffers from catastrophic forgetting for the previously learned target domain (Target 1). (b) Our proposed ETM framework alleviates such forgetting by expanding a lightweight sub-memory called TM. (c) Qualitative results show that our framework actually mitigates forgetting on the previous target domain. AdaptSegNet [32] is used as the baseline UDA method.

To this end, we propose a novel Expanding Target-specific Memory (ETM) framework for Continual UDA for semantic segmentation on real-world datasets. In the framework, we introduce Target-specific Memory (TM) for each target domain. Inspired by the previous works [36, 40, 26] in the continual learning field, it is considered that the constant capacity of the existing networks may not be enough to handle multiple target domains as it faces a huge domain discrepancy. The proposed framework is illustrated in Fig. 1(b). Specifically, a lightweight sub-memory called TM is initiated, trained, and stored for each target domain. Each TM contains unique information corresponding to each domain discrepancy by designing the TM in consideration of the structure, the forwarding path and the expanded location (see Section 2.1) . When testing the network over the previous domains, the stored TM of the corresponding previous domain is used. In this way, the network overcomes the catastrophic forgetting problem. In addition, we design a Double Hinge Adversarial (DHA) loss that enhances the overall UDA performance. By optimizing the DHA loss function, the segmentation network aligns the source and target domain data while considering geometric relations between them. We observe that the DHA loss is more suitable for the UDA objective. Without loss of generality, our framework can be applied to other adversarial learning-based UDA methods.

The main contribution of this paper is three-fold:

  • To the best of our knowledge, we address Continual UDA for semantic segmentation on real-world datasets for the first time, which considers more practical scenarios.

  • We propose the ETM framework for Continual UDA. We deal with the catastrophic forgetting problem by expanding a little amount of model capacity (TM), which is the way that is firstly introduced in this field. Moreover, we propose the DHA loss function to enhance the performance of UDA with adversarial learning.

  • We validate our framework by conducting experiments using two synthetic datasets (GTA5 [28], SYNTHIA [30]), and three real-world datasets (CityScapes [6], IDD [34], Cross-City [5]) with large domain discrepancy. The model trained with the ETM framework outperforms other state-of-the-art models under the same conditions.

2 Approach

We first formalize the Continual UDA problem by defining the following notations. Let 𝒮={(x1s,y1s),,(xNss,yNss)}\mathcal{S}=\{(x_{1}^{s},y_{1}^{s}),\ldots,(x_{N_{s}}^{s},y_{N_{s}}^{s})\} denote the source domain data, which consists of NsN_{s} images (x1s,,xNssx_{1}^{s},...,x_{N_{s}}^{s}) and corresponding labels (y1s,,yNssy_{1}^{s},...,y_{N_{s}}^{s}). Multiple TT target domains without any annotations are defined as {𝒯i}i=1T\{\mathcal{T}_{i}\}_{i=1}^{T}. The ii-th target domain data are defined as 𝒯i={x1i,,xNii}\mathcal{T}_{i}=\{x_{1}^{i},\ldots,x_{N_{i}}^{i}\}, where the domain has NiN_{i} images and i{1,,T}i\in\{1,\ldots,T\}. Let xx and ysy^{s} stand for an arbitrary image from any domains and its label if the image is drawn from the source domain, respectively. Then, x3×H×Wx\in\mathbb{R}^{3\times H\times W} and ysC×H×Wy^{s}\in\mathbb{R}^{C\times H\times W}, where CC is the number of classes, and HH and WW are height and width of the image. Moreover, we define arbitrary images drawn from the source and target domains as xsx^{s} and xtx^{t}, respectively.

Refer to caption
(a) Proposed framework
Refer to caption
(b) TM structure
Figure 2: (a) The overall architecture of the ETM framework. The TM for each target domain is generated as the data appear sequentially. The discriminator tries to distinguish the source and target domain data, while the segmentation network and the TM try to fool the discriminator. When the ii-th target domain data are introduced, corresponding TM is added right before the layer from the segmentation network entering into the discriminator. Note that kk is decided by the adversarial learning procedure of the backbone UDA method. (b) The detailed structure of the TM. Each TM consists of the 1×11\times 1 Conv. module that aims to extract the localized information, and the Avg. Pool module to extract the contextual information.

2.1 Target-specific Memory

We start with an idea that the network needs additional capacity in real-world scenarios where the number of target domains keeps increasing. Thus, the added sub-memory needs to be stored and retrieved to reproduce its learned knowledge. There have already been precedents for expanding the network capacity in continual learning for the image classification task [36, 40, 26]. However, UDA for semantic segmentation differs from the image classification task in many aspects such as network structure, training objectives, and training strategies. Thus, adapting this idea to Continual UDA is not trivial.

We first consider the structure of the TM. Above all, each TM should contain domain discrepancy information between the source domain and each target domain. Unlike in the image classification task, the added sub-memory cannot be a simple MLP network. In semantic segmentation, the “accurate localization” information can be contained by using the small receptive field with the small dilation value, and “context assimilation” information can be contained by considering the large receptive field with the large dilation value [2]. From Fig. 1(c), we observe that forgetting occurs for various objects from sidewalk to person. This indicates that there exists the domain discrepancy not only for the localized information but also for the contextual information. To this end, we design two modules for each TM. One for the localized information, and the other for the contextual information.

Moreover, the TM should be lightweight memory for the efficiency of the overall continual learning scheme. Thus, we design the sub-memory structure as follows: As shown in Fig. 2(b), a 1×11\times 1 Conv. module (1×11\times 1 convolution layer), which represents the module with the smallest receptive field. An Avg. Pool. module represents the module having the largest receptive field, which consists of an average pooling layer followed by a 1×11\times 1 convolution layer and the ReLU nonlinearity. Since the trainable parameters only exist in the 1×11\times 1 convolution layers, the TM is a fairly small-capacity network compared to the segmentation network.

The next step is for allowing each TM to contain each domain discrepancy information of the corresponding target domain during training. To do so, the knowledge, which is originally contained only in the segmentation network, should be divided up with the TM. Therefore, we let the TM be trained along with the segmentation network by attaching it to the network. In addition, to take advantage of high-level features, we expanded the TM in the top layer.

To be specific, when the input enters the TM it passes through the 1×11\times 1 Conv. and Avg. Pool. modules, respectively, and then the outputs are summed. The output of the TM for arbitrary input xx is expressed as TM(x)TM(x) (Here, TMTM indicates an arbitrary ii-th TM, TMiTM_{i}. We omit the subscription ii for simplicity.). TM(x)TM(x) and the hidden state value of the segmentation network are added, and then pass through the next layers. For example, the (k1)(k-1)-th hidden state value passes to the kk-th layer and the TM, respectively, and then the added output passes through the (k+1)(k+1)-th layer (see Fig. 2(a)). Let us express the output value as f[:m](x)f^{[:m]}(x) when the input passes through until the mm-th layer of the segmentation network. Similarly, the output value when the input passes from the (n+1)(n+1)-th layer of the segmentation network is expressed as f[n:](x)f^{[n:]}(x). Then, the added output right before the (k+1)(k+1)-th layer is represented as follows:

f[:k]+(x)=f[:k](x)+TM(f[:(k1)](x)).\displaystyle f^{[:k]+}(x)=f^{[:k]}(x)+TM(f^{[:(k-1)]}(x)). (1)

In addition, the final output passed through both the segmentation network and the TM is defined as follows:

f+(x)=f[k:](f[:k]+(x)).\begin{multlined}f^{+}(x)=f^{[k:]}(f^{[:k]+}(x)).\end{multlined}f^{+}(x)=f^{[k:]}(f^{[:k]+}(x)). (2)

In our framework, the source and target domain data are aligned via adversarial learning. The layer where the TM is added is decided by the adversarial learning procedure of the backbone UDA method. It can be multiple positions if the backbone method uses various features for a discriminator input. If the TM is added at the kk-th layer, its hidden state value is used as the input of the discriminator. In other words, f[:k]+(x)f^{[:k]+}(x) in Eq. (1) becomes the input of the discriminator.

2.2 Double Hinge Adversarial Loss

Typically, the UDA methods perform the alignment of the source and target domain data through adversarial learning with the discriminator [33, 32, 35, 37] based on GAN [10]. Since the segmentation network in UDA can correspond to the generator of GAN, the adversarial learning scheme has been simply adapted to the UDA field. By minimizing the adversarial loss, the segmentation network is trained to align the source and target domain data, and at the same time, the discriminator is trained to distinguish them as the discriminator loss is minimized. There is a variation of GAN [18] which considers the geometric relations between feature vectors in the hyperplane by using the Support Vector Machine (SVM) [31]. In [18], the discriminator maximizes the minimum margin by value 11 of the separable data in the feature space. This leads to stable training and solves the mode collapse problem in the image generation task.

However, the generator’s task in GAN and that of the segmentation network in UDA are different, while the discriminator’s role is in line with each other. Thus, we design a new adversarial loss that is customized for UDA. In the feature space, the generator tries to move the fake feature vectors toward the real feature space so that they can be classified as the real feature vectors. This is because the real data are the absolute truth in GAN. On the other hand, the target feature vectors in UDA do not have to imitate the source feature vectors blindly. Thus, the segmentation network can achieve its goal in two ways: by pulling the target features toward the source features or by pushing the source features toward the target features. Our suggestion is to use both ways. Here, we do not update the segmentation network when the source features are already closer to the target side than the target features and vice versa, by using the ReLU nonlinearity (see Eq. (4)).

To this end, we propose the DHA loss for the segmentation network. This loss is minimized when the target features are extracted as if the input is from the source domain, and vice versa. The DHA loss replaces the loss functions for adversarial learning used in the backbone UDA method. Let z^s\hat{z}^{s} and z^t\hat{z}^{t} be the output values when xsx^{s} and xtx^{t} pass through the discriminator after the segmentation network, respectively. Then, z^s,z^th×w\hat{z}^{s},\hat{z}^{t}\in\mathbb{R}^{h\times w}, where hh and ww are height and width of each value, respectively. Then, our DHA loss is expressed as

LDDHA(z^s,z^t)=ihjw[(1z^s(i,j))++(1+z^t(i,j))+],\displaystyle L_{D}^{DHA}(\hat{z}^{s},\hat{z}^{t})=\sum_{i}^{h}\sum_{j}^{w}\big{[}(1-\hat{z}^{s(i,j)})_{+}+(1+\hat{z}^{t(i,j)})_{+}\big{]}, (3)
LadvDHA(z^s,z^t)=ihjw(z^s(i,j)z^t(i,j))+,\displaystyle L_{adv}^{DHA}(\hat{z}^{s},\hat{z}^{t})=\sum_{i}^{h}\sum_{j}^{w}(\hat{z}^{s(i,j)}-\hat{z}^{t(i,j)})_{+}, (4)

where ()+=max(,0)(\cdot)_{+}=max(\,\cdot\,,0).

In short, we propose an adversarial learning procedure customized for the UDA task.

2.3 Expanding Target-specific Memory Framework

Even if the TM for the previous target domain remains frozen, however, the significant change in the segmentation network may aggravate the forgetting on the previous target domain. Thus, we add the distillation loss [17] that conducts knowledge distillation from the previous segmentation network:

Ldistill(Y^s,yolds)=i=1C×H×Wyolds(i)logY^s(i),L_{distill}(\hat{Y}^{s},y_{old}^{s})=-\sum_{i=1}^{C\times H\times W}{y}_{old}^{s(i)^{\prime}}\>\log\hat{Y}^{s(i)^{\prime}}, (5)

with

yolds(i)=exp(yolds(i)/T)jexp(yolds(j)/T),Y^s(i)=exp(Y^s(i)/T)jexp(Y^s(j)/T),{y}_{old}^{s(i)^{\prime}}=\frac{\exp({y}_{old}^{s(i)}/\>T^{\prime})}{\sum_{j}\exp({y}_{old}^{s(j)}/\>T^{\prime})},\;\hat{Y}^{s(i)^{\prime}}=\frac{\exp(\hat{Y}^{s(i)}/\>T^{\prime})}{\sum_{j}\exp(\hat{Y}^{s(j)}/\>T^{\prime})}, (6)

where TT^{\prime} is the temperature which softens the weight distributions, and yoldsy_{old}^{s} is obtained by passing the source image through the segmentation network before training on the current target domain.

Let f(x)f(x) and D(x)D(x) be the outputs from the segmentation network and the discriminator for input xx, respectively. Then, the training objective of the proposed ETM framework is as follows:

min\displaystyle\min [λsegLseg(y^s+,ys)+λadvLadvDHA(z^s+,z^t+)+\displaystyle\Big{[}\lambda_{seg}\cdot L_{seg}(\hat{y}^{s+},y^{s})+\lambda_{adv}\cdot L_{adv}^{DHA}(\hat{z}^{s+},\hat{z}^{t+})+ (7)
λdistillLdistill(y^s,yolds)],\displaystyle\lambda_{distill}\cdot L_{distill}(\hat{y}^{s},y_{old}^{s})\Big{]},
min[LDDHA(z^s+,z^t+)],\begin{multlined}\min\;\Big{[}L_{D}^{DHA}(\hat{z}^{s+},\hat{z}^{t+})\Big{]},\end{multlined}\min\;\Big{[}L_{D}^{DHA}(\hat{z}^{s+},\hat{z}^{t+})\Big{]}, (8)

with

y^s+=f+(xs),y^s=f(xs),\displaystyle\hat{y}^{s+}=f^{+}(x^{s}),\hat{y}^{s}=f(x^{s}), (9)
z^s+=D(f[:k]+(xs)),z^t+=D(f[:k]+(xt)),\displaystyle\hat{z}^{s+}=D\big{(}f^{[:k]+}(x^{s})),\hat{z}^{t+}=D\big{(}f^{[:k]+}(x^{t})),

where LsegL_{seg} is the cross-entropy loss. λseg\lambda_{seg}, λadv\lambda_{adv} and λdistill\lambda_{distill} are hyperparameters which are weights for each loss term. Once the segmentation network and the TM are updated by Eq. (7), the discriminator is updated by Eq. (8), alternately. Here, we assume that the source domain data are accessible in every stage since the purpose of UDA is to perform semantic segmentation for the target domains.

Algorithm 1 ETM framework
1:Input: 𝒮\mathcal{S}, {𝒯i}i=1T\{\mathcal{T}_{i}\}_{i=1}^{T}, ff, {TMi}i=1T\{TM_{i}\}_{i=1}^{T}, {Di}i=1T\{D_{i}\}_{i=1}^{T}, learning rates {αf\alpha_{f}, αTM\alpha_{TM}, αD\alpha_{D}}
2:initialize ϕf\phi_{f}
3:for i=1,,Ti=1,\ldots,T do
4:     initialize ϕTMi\phi_{TM_{i}}, ϕDi\phi_{D_{i}}
5:     if i>1i>1 then Record yoldsy_{old}^{s} by passing 𝒮\mathcal{S} through ff      
6:     for iteration=1,,itermaxiteration=1,\ldots,iter_{max} do
7:         Sample batches of (xs,ys)(x^{s},y^{s}) from 𝒮\mathcal{S}
8:         Sample batches of xtx^{t} from 𝒯i\mathcal{T}_{i}
9:          ϕfϕfαf(Lseg+LadvDHA+Ldistill)ϕf\phi_{f}\leftarrow\phi_{f}-\alpha_{f}\frac{\partial(L_{seg}+L_{adv}^{DHA}+L_{distill})}{\partial\phi_{f}}, where Ldistill=0L_{distill}=0 for i=1i=1 (Refer to Eq. (7))
10:          ϕTMiϕTMiαTM(Lseg+LadvDHA)ϕTMi\phi_{TM_{i}}\leftarrow\phi_{TM_{i}}-\alpha_{TM}\frac{\partial(L_{seg}+L_{adv}^{DHA})}{\partial\phi_{TM_{i}}} (Refer to Eq. (7))
11:          ϕDiϕDiαD(LDDHA)ϕDi\phi_{D_{i}}\leftarrow\phi_{D_{i}}-\alpha_{D}\frac{\partial(L_{D}^{DHA})}{\partial\phi_{D_{i}}} (Refer to Eq. (8))      
12:     end for
13:     Store TMiTM_{i} (while DiD_{i} is not stored)
14:end for
* λseg\lambda_{seg}, λadv\lambda_{adv}, and λdistill\lambda_{distill} are omitted for clarity.

Algorithm 1 summarizes the learning procedure of our framework. Here, ϕf\phi_{f}, ϕTMi\phi_{TM_{i}}, and ϕDi\phi_{D_{i}} indicate the parameters of the segmentation network, the ii-th TM, and the ii-th discriminator, respectively. In short, by applying the proposed ETM framework to the existing UDA methods, the segmentation network and all TMs learn knowledge about all the sequential target domains while effectively keeping the knowledge about the previous target domains.

3 Experiments

3.1 Experimental Settings

We used two synthetic road image datasets (GTA5, SYNTHIA) as the source domains and three real-world road image datasets (CityScapes, IDD, Cross-City) as the target domains. We used three types of baseline models: (1) Source Only, which indicates the model trained on the source domain data and tested on the target domains, (2) the UDA methods containing FCN-W [14], AdaptSegNet [32], AdvEnt [35], SIM [37], and (3) the Continual UDA method, ACE [38]. For a fair comparison, DeepLab-V2 [2], which is one of the representative semantic segmentation networks, was used as the segmentation network in all methods. For Ours, we applied the ETM framework on AdaptSegNet [32]. The evaluation metric is mIoU (mean Intersection over Union), i.e., the average of the IoU values for each object. We implemented our experiments using Pytorch of version 1.4.0 on a Ubuntu 16.04 workstation. Single RTX 2080ti GPU was used with CUDA version 10.0.

3.2 Comparison with State-of-the-art Methods

To validate the effectiveness of the proposed ETM framework, we compared the model trained with our framework with the state-of-the-art methods. In our experiments, two Continual UDA scenarios were considered: a two-target scenario and a four-target scenario. Following the prior works on UDA [14, 32, 35, 44], the synthetic datasets, the GTA5 and SYNTHIA datasets, were used as the source domains. In the first experiment, we first performed UDA to the CityScapes dataset, followed by Continual UDA to the IDD dataset. The CityScapes dataset is the most widely used real-world image dataset in the UDA field. To maximize the catastrophic forgetting problem for the CityScapes dataset collected in European cities, we then performed UDA to the IDD dataset collected in entirely different Indian cities. In the second experiment, we performed Continual UDA sequentially for four different cities in the Cross-City dataset (Rio, Rome, Taipei, and Tokyo). Through a longer target domain sequence, we analyzed the catastrophic forgetting problem in depth.

We experimented with two different input sizes. If the original images were used as the inputs by converting into high-definition images of 1024 ×\times 512, we specified them as H, and L if the images were converted into low-definition images of 512 ×\times 256. We conducted the experiments using both input sizes for all methods except ACE [38]. For the ACE, experiments on high-quality images (H) were not possible due to a problem with the network capacity.

Table 1: The two-target scenario results. The best performance is presented in bold, and the second-best performance is underlined. Note that all the numbers reported represent the performance after the final domain adaptation.
GTA5 \rightarrow CityScapes \rightarrow IDD SYNTHIA \rightarrow CityScapes \rightarrow IDD
Model
Input
Size
CityScapes (Fgt.) IDD \cellcolorgray!25 Mean mIoU \cellcolorgray!25Gain CityScapes (Fgt.) IDD \cellcolorgray!25 Mean mIoU \cellcolorgray!25Gain
Source Only L 27.55 37.83 \cellcolorgray!2532.69 \cellcolorgray!25- 32.30 29.65 \cellcolorgray!2530.98 \cellcolorgray!25-
FCN-W [14] L 24.60 (-5.71) 30.50 \cellcolorgray!25 27.55 \cellcolorgray!25 -5.14 30.02 (+0.71) 23.95 \cellcolorgray!25 26.99 \cellcolorgray!25 -3.99
AdaptSegNet [32] L 32.23 (-2.42) 41.51 \cellcolorgray!2536.87 \cellcolorgray!25+4.18 40.10 (-0.08) 34.79 \cellcolorgray!25 37.45 \cellcolorgray!25+6.47
AdvEnt [35] L 33.25 (-1.71) 41.60 \cellcolorgray!2537.43 \cellcolorgray!25+4.74 40.26 (-0.26) 34.80 \cellcolorgray!25 37.53 \cellcolorgray!25+6.55
ACE [38] L 28.39 (-1.82) 34.70 \cellcolorgray!2531.55 \cellcolorgray!25-1.14 28.67 (-1.14) 29.78 \cellcolorgray!2529.23 \cellcolorgray!25-1.75
SIM [37] L 24.99 (-0.71) 24.99 \cellcolorgray!25 24.99 \cellcolorgray!25 -7.70 31.01 (-0.04) 27.77 \cellcolorgray!25 29.39 \cellcolorgray!25 -1.59
ETM (Ours) L 34.36 (-1.26) 41.67 \cellcolorgray!2538.02 \cellcolorgray!25+5.33 40.48 (+0.75) 34.85 \cellcolorgray!25 37.67 \cellcolorgray!25+6.69
Source Only H 34.70 42.65 \cellcolorgray!2538.68 \cellcolorgray!25- 35.02 31.74 \cellcolorgray!2533.38 \cellcolorgray!25-
FCN-W [14] H 32.03 (-3.81) 35.57 \cellcolorgray!2533.80 \cellcolorgray!25-4.88 31.12 (-2.27) 29.48 \cellcolorgray!2530.30 \cellcolorgray!25-3.08
AdaptSegNet [32] H 34.86 (-7.41) 43.87 \cellcolorgray!2539.37 \cellcolorgray!25+0.69 42.08 (-3.33) 35.64 \cellcolorgray!2538.86 \cellcolorgray!25+5.48
AdvEnt [35] H 37.27 (-5.86) 43.41 \cellcolorgray!2540.34 \cellcolorgray!25+1.66 39.14 (-6.39) 32.15 \cellcolorgray!2535.65 \cellcolorgray!25+2.27
SIM [37] H 35.47 (-0.64) 33.93 \cellcolorgray!25 34.70 \cellcolorgray!25 -3.98 39.19 (-1.35) 32.77 \cellcolorgray!25 35.98 \cellcolorgray!25 +2.60
ETM (Ours) H 40.61 (-1.35) 46.73 \cellcolorgray!2543.67 \cellcolorgray!25+4.99 43.86 (-1.92) 37.17 \cellcolorgray!2540.52 \cellcolorgray!25+7.14

Two-Target Scenario.

The results of Continual UDA on two target domains (CityScapes and IDD), using each of the GTA5 and SYNTHIA datasets as a source domains are shown in Table 1. Mean mIoU is the average of the mIoU values for the two target domains. In the Gain column, we specify a difference in the mean mIoU values compared to Source Only. In the CityScapes column, forgetting values (Fgt.) along with the mIoU values are also shown in parentheses. The forgetting value is the difference between the mIoU values for the CityScapes dataset when the network is adapted to the CityScapes dataset and when the network is continually adapted to the IDD dataset. The larger the number in parentheses, the less the forgetting on the previous domain.

From the results of Table 1, when the ETM framework is applied, we can see that in most cases, not only the semantic segmentation performance is the best, but also the least forgetting occurs. Since the TM is proposed to overcome the catastrophic forgetting problem in Continual UDA, the low forgetting on the CityScapes dataset demonstrates that the design of the TM is valid. Furthermore, when our framework is applied, the performance for the both target domains is also high. This fact allows us to confirm that the UDA performance itself is increased by using the DHA loss. It is considered that the segmentation network learns domain-invariant information efficiently while the TM learns the knowledge corresponding to the target domain shift.

The ACE is a method designed to adapt to changing environments such as altering weather and lighting conditions. However, experimental results of the ACE show severe forgetting and the low performance. This indicates that the style transfer method is unsuitable for the real-world problem with large domain discrepancy, and demonstrates that our framework can handle such problem. Note that within a limited capacity network, there is a trade-off between adapting to the current target domain and maintaining the knowledge of the previous target domain. Regarding the SIM results, although numerically less forgetting occurs, it is futile since the average adaptation performance is remarkably deficient. On the other hand, the model with our framework overcomes such trade-off.

Table 2: The four-target scenario results. The best performance is presented in bold, and the second-best performance is underlined. Note that all the numbers reported represent the performance after the final domain adaptation.
GTA5 \rightarrow Rio \rightarrow Rome \rightarrow Taipei \rightarrow Tokyo SYNTHIA \rightarrow Rio \rightarrow Rome \rightarrow Taipei \rightarrow Tokyo
Model
Input
Size
Rio (Fgt.) Rome (Fgt.) Taipei (Fgt.) Tokyo \cellcolorgray!25 Mean mIoU \cellcolorgray!25Gain Rio (Fgt.) Rome (Fgt.) Taipei (Fgt.) Tokyo \cellcolorgray!25 Mean mIoU \cellcolorgray!25Gain
Source Only L 36.98 38.05 33.94 33.85 \cellcolorgray!2535.71 \cellcolorgray!25- 33.50 31.30 30.17 29.44 \cellcolorgray!2531.10 \cellcolorgray!25-
FCN-W [14] L 24.98 (-1.96) 25.56 (-2.90) 24.63 (-1.33) 29.68 \cellcolorgray!2526.21 \cellcolorgray!25-9.50 25.12 (-1.83) 27.01 (+0.12) 23.84 (-0.08) 26.84 \cellcolorgray!2525.70 \cellcolorgray!25-5.40
AdaptSegNet [32] L 38.68 (-1.81) 39.99 (-0.18) 34.17 (-0.06) 36.35 \cellcolorgray!25 37.30 \cellcolorgray!25+1.59 32.95 (-2.64) 32.02 (-0.08) 29.30 (-0.38) 28.36 \cellcolorgray!25 30.66 \cellcolorgray!25-0.44
AdvEnt [35] L 37.88 (-3.53) 39.64 (-1.64) 36.06 (-1.01) 37.84 \cellcolorgray!25 37.86 \cellcolorgray!25+2.15 32.97 (-2.50) 31.50 (+0.11) 29.37 (-0.04) 28.82 \cellcolorgray!25 30.67 \cellcolorgray!25-0.43
ACE [38] L 30.78 (-4.70) 31.35 (-1.12) 28.56 (-1.16) 33.53 \cellcolorgray!2531.06 \cellcolorgray!25-4.65 27.27 (-6.53) 28.26 (-2.69) 26.39 (-0.43) 29.05 \cellcolorgray!2527.74 \cellcolorgray!25-3.36
SIM [37] L 27.62 (-1.23) 27.48 (-0.54) 24.38 (-0.07) 28.92 \cellcolorgray!25 27.10 \cellcolorgray!25-8.61 27.82 (-1.04) 28.08 (-1.49) 26.09 (-0.43) 27.84 \cellcolorgray!25 27.46 \cellcolorgray!25-3.64
ETM (Ours) L 41.15 (+0.86) 40.76 (+0.31) 37.12 (+0.83) 37.94 \cellcolorgray!2539.24 \cellcolorgray!25+3.53 34.97 (+0.30) 33.74 (+1.19) 30.81 (+0.15) 31.75 \cellcolorgray!2532.82 \cellcolorgray!25+1.72
Source Only H 44.21 44.41 40.79 43.99 \cellcolorgray!2543.35 \cellcolorgray!25- 36.63 32.11 31.33 31.74 \cellcolorgray!2532.95 \cellcolorgray!25-
FCN-W [14] H 32.65 (-5.65) 32.59 (-2.01) 28.55 (+0.09) 35.27 \cellcolorgray!2532.27 \cellcolorgray!25-11.08 30.45 (-4.79) 30.35 (-0.34) 26.42 (-1.55) 31.76 \cellcolorgray!2529.75 \cellcolorgray!25-3.20
AdaptSegNet [32] H 43.00 (-8.32) 43.54 (-3.68) 36.94 (-2.86) 42.38 \cellcolorgray!2541.47 \cellcolorgray!25-1.88 36.54 (-2.56) 33.46 (-1.69) 31.45 (-0.44) 31.15 \cellcolorgray!2533.15 \cellcolorgray!25+0.20
AdvEnt [35] H 43.60 (-5.93) 45.08 (-2.78) 37.55 (-1.72) 44.12 \cellcolorgray!2542.59 \cellcolorgray!25-0.76 37.66 (-2.04) 34.41 (-1.47) 32.32 (-0.25) 31.35 \cellcolorgray!2533.94 \cellcolorgray!25+0.99
SIM [37] H 39.65 (-1.86) 37.61 (-1.71) 35.72 (-1.19) 38.24 \cellcolorgray!2537.81 \cellcolorgray!25 -5.54 34.30 (-1.02) 32.82 (-0.68) 31.48 (-0.53) 30.73 \cellcolorgray!25 32.33 \cellcolorgray!25-0.62
ETM (Ours) H 48.45 (-1.79) 45.93 (-0.38) 41.19 (-0.07) 44.18 \cellcolorgray!2544.94 \cellcolorgray!25+1.59 38.44 (+0.16) 35.73 (+0.01) 32.52 (+0.07) 34.12 \cellcolorgray!2535.20 \cellcolorgray!25+2.25

Four-Target Scenario.

We conducted experiments on more target domains, since considering two target domains is not enough to deal with the catastrophic forgetting problem in Continual UDA. The results of Continual UDA for Rio, Rome, Taipei, and Tokyo in the Cross-City dataset are shown in Table 2. The forgetting value in the parenthesis denotes the difference between the performances on particular target domain data when UDA has been performed to the last target domain (Tokyo) and when UDA has been performed to the particular target domain data. For example, forgetting for Taipei indicates the difference between the performances on Taipei when UDA has been done to Tokyo and when UDA has been done to Taipei. As can be seen from Table 2, the model trained with the ETM framework shows the highest semantic segmentation performance and the least forgetting in most cases.

Table 3: Ablation study on each component of the ETM framework.
Model GTA5 \rightarrow CityScapes \rightarrow IDD SYNTHIA \rightarrow CityScapes \rightarrow IDD
CityScapes
(Fgt.)
IDD
Mean
mIoU
CityScapes
(Fgt.)
IDD
Mean
mIoU
AdaptSegNet [32]
34.86
(-7.41)
43.87 39.37
42.08
(-3.33)
35.64 38.86
AdaptSegNet [32]
+ TM
40.04
(-1.45)
44.28 42.16
43.56
(-1.54)
36.52 40.04
AdaptSegNet [32]
+ TM + DHA
40.61
(-1.35)
46.73 43.67
43.86
(-1.92)
37.17 40.52
Table 4: Ablation study on each module of the TM.
TM Architecture GTA5 \rightarrow CityScapes \rightarrow IDD SYNTHIA \rightarrow CityScapes \rightarrow IDD
1×\times1 Conv. Avg. Pool.
CityScapes
(Fgt.)
IDD
Mean
mIoU
CityScapes
(Fgt.)
IDD
Mean
mIoU
34.86
(-7.41)
43.87 39.37
42.08
(-3.33)
35.64 38.86
38.81
(-3.01)
44.43 41.62
42.94
(-2.17)
36.00 39.47
36.70
(-4.53)
44.04 40.37
42.22
(-2.37)
35.48 38.85
40.04
(-1.45)
44.28 42.16
43.56
(-1.54)
36.52 40.04

3.3 Ablation Study

We conducted in-depth analyses of our framework internally by checking how the TM and the DHA loss affect the performance of the framework. All experiments in this section were conducted in an environment performing Continual UDA on the CityScapes and IDD datasets, with an input size of H.

First of all, we analyzed the effectiveness of each component of our framework. In Table 3, the first row indicates our model without both the TM and the DHA loss since we applied our framework to the AdaptSegNet. When comparing the results in the second row and the first row, the forgetting is alleviated as well as the mIoU values for the CityScapes dataset are increased. Before adapting to the IDD dataset, the performance on the CityScapes dataset is similar to each other in both cases. That is, adding TM results in mitigation of the catastrophic forgetting problem. Moreover, when comparing the third row to the second row, the overall UDA performance is improved. It means that the UDA performance is enhanced due to the DHA loss.

Refer to caption
Refer to caption
Figure 3: Comparison of the semantic maps according to the presence and absence of TM. Examples for the semantic maps of the images drawn from (a) the CityScapes dataset and (b) the IDD dataset are shown in the figure. Inside the sky-blue circles, difference between the semantic maps predicted by the network with the TM and without the TM can be observed.

Moreover, we further analyzed how each module of the TM affects the overall performance. The differences in performance depending on the presence or absence of each module are summarized in Table 4. For a fair comparison, we set all the other conditions same except the TM architecture. To verify the effect of each module clearly, the DHA loss was not used. Firstly, note that even using one module as the TM leads to alleviating catastrophic forgetting. Also, the catastrophic forgetting problem is more mitigated when the 1×11\times 1 Conv. module is used compared to when the Avg. Pool. module is used. This indicates that localized information extracted by the 1×11\times 1 Conv. module contributes more than contextual information extracted by the Avg. Pool. module for the domain discrepancy. Finally, when both modules are used, the least forgetting occurs. From this, it can be claimed that there exists a synergy between the two modules, and the validity of the TM’s design is demonstrated.

3.4 Discussion

Previously, we claimed that the TM stores domain discrepancy information of each target domain by attaching to the segmentation network and being trained together; however, it is vague that exactly which information is stored in the TM. To clarify this, we carried out semantic segmentation using the network with the ETM framework. By comparing the results of semantic segmentation with the segmentation network only and with the integration of the segmentation network and the TM, it would be revealed which information is contained in the TM.

Fig. 3 shows the semantic maps predicted by the network when Continual UDA is performed to the CityScapes dataset and the IDD dataset, respectively, with the GTA5 dataset as the source domain. For each of the four images extracted from the CityScapes and IDD dataset, the ground truth of each image and the semantic maps predicted with the TM (w/ TM), and without the TM (w/o TM) are compared with each other. Fig. 3(a) is an example for the CityScapes dataset. In the first image of 3(a), a traffic sign (expressed in yellow) and a person (expressed in red) riding bicycles (expressed in brown) are well shown inside the sky-blue circle when the TM is used, but disappeared when the TM is not used. Likewise, in other images, people, traffic lights (expressed in orange), and bicycles are properly predicted with the TM, but disappeared without the TM. From Fig. 3(b) for the IDD dataset, we can verify that people and motorcycles (expressed in blue) are entirely missed when without the TM.

In summary, the TM contains information about all objects in general, but especially more about the types of objects that are likely to differ from one instance to another. As stated in Section 2.1, the 1×11\times 1 Conv. module of the TM is added to extract more localized information and the Avg. Pool. module is added for broader contextual information. Through the experiment, it can be claimed that the 1×11\times 1 Conv. module plays a more vital role and such specific information extracted by the module contributes more in overcoming the domain discrepancy.

4 Conclusion

In this paper, we proposed Continual UDA for semantic segmentation based on the ETM framework. The framework can be applied to the methods performing UDA with adversarial learning. By attaching the proposed TM to the segmentation network, we alleviated the catastrophic forgetting problem that occurs when existing UDA methods are applied in continual learning environments. The TM was initiated for each target domain and effectively captured the domain discrepancy only with a small capacity. Furthermore, the proposed DHA loss enhanced the UDA performance. By proposing the ETM framework, we have enlarged the scope of Continual UDA for semantic segmentation, and overcome the catastrophic forgetting problem in UDA. However, we hope to further develop our framework to increase the segmentation performance to the level of the model trained in a supervised manner.

References

  • [1] Andreea Bobu, Eric Tzeng, Judy Hoffman, and Trevor Darrell. Adapting to continuously shifting domains. 2018.
  • [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • [3] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [4] Minghao Chen, Hongyang Xue, and Deng Cai. Domain adaptation for semantic segmentation with maximum squares loss. In Proceedings of the IEEE International Conference on Computer Vision, pages 2090–2099, 2019.
  • [5] Yi-Hsin Chen, Wei-Yu Chen, Yu-Ting Chen, Bo-Cheng Tsai, Yu-Chiang Frank Wang, and Min Sun. No more discrimination: Cross city adaptation of road scene segmenters. In Proceedings of the IEEE International Conference on Computer Vision, pages 1992–2001, 2017.
  • [6] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  • [7] Zhijie Deng, Yucen Luo, and Jun Zhu. Cluster alignment with a teacher for unsupervised domain adaptation. In Proceedings of the IEEE International Conference on Computer Vision, pages 9944–9953, 2019.
  • [8] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
  • [9] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of Machine Learning Research, 17(1):2096–2030, 2016.
  • [10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [13] Judy Hoffman, Trevor Darrell, and Kate Saenko. Continuous manifold based adaptation for evolving visual domains. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 867–874, 2014.
  • [14] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv preprint arXiv:1612.02649, 2016.
  • [15] Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 11–19, 2017.
  • [16] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  • [17] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
  • [18] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  • [19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
  • [20] Ziwei Liu, Zhongqi Miao, Xingang Pan, Xiaohang Zhan, Dahua Lin, Stella X Yu, and Boqing Gong. Open compound domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12406–12415, 2020.
  • [21] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [22] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  • [23] Martial Mermillod, Aurélia Bugaiska, and Patrick Bonin. The stability-plasticity dilemma: Investigating the continuum from catastrophic forgetting to age-limited learning effects. Frontiers in psychology, 4:504, 2013.
  • [24] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulò, and Peter Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In International Conference on Computer Vision (ICCV), 2017.
  • [25] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
  • [26] Gyeong-Moon Park, Sahng-Min Yoo, and Jong-Hwan Kim. Convolutional neural network with developmental memory for continual learning. IEEE Transactions on Neural Networks and Learning Systems,, 2020. in press, doi:10.1109/TNNLS.2020.3007548.
  • [27] Kwanyong Park, Sanghyun Woo, Inkyu Shin, and In-So Kweon. Discover, hallucinate, and adapt: Open compound domain adaptation for semantic segmentation. In 34th Conference on Neural Information Processing Systems, NeurIPS 2020. Conference on Neural Information Processing Systems, 2020.
  • [28] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. Playing for data: Ground truth from computer games. In European conference on computer vision, pages 102–118. Springer, 2016.
  • [29] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [30] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3234–3243, 2016.
  • [31] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press, 2002.
  • [32] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7472–7481, 2018.
  • [33] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7167–7176, 2017.
  • [34] Girish Varma, Anbumani Subramanian, Anoop Namboodiri, Manmohan Chandraker, and CV Jawahar. Idd: A dataset for exploring problems of autonomous navigation in unconstrained environments. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1743–1751. IEEE, 2019.
  • [35] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pérez. Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2517–2526, 2019.
  • [36] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Fine-tuning by increasing model capacity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2471–2480, 2017.
  • [37] Zhonghao Wang, Mo Yu, Yunchao Wei, Rogerio Feris, Jinjun Xiong, Wen-mei Hwu, Thomas S Huang, and Honghui Shi. Differential treatment for stuff and things: A simple unsupervised domain adaptation method for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12635–12644, 2020.
  • [38] Zuxuan Wu, Xin Wang, Joseph E Gonzalez, Tom Goldstein, and Larry S Davis. Ace: adapting to changing environments for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2121–2130, 2019.
  • [39] Markus Wulfmeier, Alex Bewley, and Ingmar Posner. Incremental adversarial domain adaptation for continually changing environments. In 2018 IEEE International conference on robotics and automation (ICRA), pages 1–9. IEEE, 2018.
  • [40] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547, 2017.
  • [41] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
  • [42] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1857–1866, 2018.
  • [43] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2636–2645, 2020.
  • [44] Qiming Zhang, Jing Zhang, Wei Liu, and Dacheng Tao. Category anchor-guided unsupervised domain adaptation for semantic segmentation. In Advances in Neural Information Processing Systems, pages 435–445, 2019.
Refer to caption
Figure 4: Our ETM framework based on the AdaptSegNet. The figure describes the situation when conducting UDA for the ii-th target domain.

Appendix

Appendix A Dataset Details

GTA5 dataset contains 24,966 images with pixel-level semantic annotations synthesized by a photorealistic open-world computer game. Also, we used SYNTHIA-RAND-CITYSCAPES (SYNTHIA) dataset which consists of 9,400 photo-realistic frames rendered from a virtual city environment. We randomly split these datasets, which were not split originally, into the training set and the validation set by the ratio of 6:1 as the same ratio as in CityScapes.

CityScapes dataset consists of 5,000 frames which are from a diverse set of stereo video sequences recorded in street scenes from 50 different cities of Europe. The dataset is split into a training, validation, and test set. Following prior works on UDA [14, 32, 35], we evaluated the performance on the validation set of the dataset. IDD (India Driving Dataset) has 10,003 images collected from 182 drive sequences on Indian roads. Unlike the CityScapes dataset, IDD is collected from an unstructured environment characterized as features, such as ambiguous road boundaries, muddy drivable areas, and animals’ presence. The dataset is split into a training, validation, and test set. We also evaluated the performance on the validation set of the dataset following the CityScapes dataset. Cross-City dataset consists of road images collected from four cities: Rio, Rome, Taipei, and Tokyo. The training set is not labeled, but the images in the test set are labeled.

Since the source and target domains share the same label space in the UDA task, we specify the number of objects that commonly exist across the datasets. The GTA5, CityScapes, and IDD datasets share 18 semantic classes, while the SYNTHIA, CityScapes, and IDD datasets share 13 classes. And the GTA5, SYNTHIA, Cross-City datasets have 13 classes commonly.

Appendix B Network Structure Details

In our experiments, we used AdaptSegNet [32] as the UDA method to apply our ETM framework. The AdaptSegNet is based on the DeepLab-V2 architecture. The DeepLab-V2 consists of a ResNet 101 [12] module (as a feature extractor), and an Atrous Spatial Pyramid Pooling (ASPP) module (as a classifier). In Fig. 4, the ResNet 101 and ASPP (2) indicate the feature extractor and the classifier of the DeepLab-V2, respectively. Further, in the AdaptSegNet, ASPP (1) is added after the intermediate output of the ResNet 101 module. The output values from the ASPP (1) and ASPP (2) modules are used as inputs to Discriminator (1) and Discriminator (2), respectively, for adversarial learning. Since adversarial learning procedures are conducted in parallel, this is called multi-level adversarial learning. In the ETM framework, the TM is added right before the discriminator, thus, we added our TM to both ASPP (1) and ASPP (2) module. As depicted, for an arbitrary ii-th target domain, TM ii (1) and TM ii (2) are added. Our training objective is then expressed as follows:

minn=12\displaystyle\min\;\sum_{n=1}^{2} [λseg(n)Lseg(n)(y^s+(n),ys)\displaystyle\Big{[}\lambda_{seg}^{(n)}\cdot L_{seg}^{(n)}(\hat{y}^{s+(n)},y^{s}) (10)
+λadv(n)LadvDHA(n)(z^s+(n),z^t+(n))\displaystyle+\lambda_{adv}^{(n)}\cdot L_{adv}^{DHA(n)}(\hat{z}^{s+(n)},\hat{z}^{t+(n)})
+λdistill(n)Ldistill(n)(y^s(n),yolds(n))],\displaystyle+\lambda_{distill}^{(n)}\cdot L_{distill}^{(n)}(\hat{y}^{s(n)},y_{old}^{s(n)})\Big{]},
min[LDDHA(1)(z^s+(1),z^t+(1))],\min\;\Big{[}L_{D}^{DHA(1)}(\hat{z}^{s+(1)},\hat{z}^{t+(1)})\Big{]}, (11)
min[LDDHA(2)(z^s+(2),z^t+(2))].\min\;\Big{[}L_{D}^{DHA(2)}(\hat{z}^{s+(2)},\hat{z}^{t+(2)})\Big{]}. (12)

The parameters of the ResNet 101, ASPP (1), and ASPP (2) modules are updated by Eq. (10), and the parameters of the Discriminators (1) and (2) are updated by Eq. (11) and (12), respectively.

Appendix C Implementation Details

We used the SGD optimizer for the segmentation network and the TM. The learning rate was set to 2.5×1032.5\times 10^{-3} with the momentum of 0.97 and the weight decay value of 5×1045\times 10^{-4}. Since we used the pre-trained ResNet 101 module, the learning rate of it was set 10 times smaller. To regularize the large shift of the previously learned parameters, the learning rate of the ASPP modules was set 10 times smaller, except when conducting the domain adaptation on the first target domain. On the other hand, we used Adam optimizer for the discriminator. The learning rate was set to 1×1041\times 10^{-4}. We used the batch size of 11 for the input size H and 33 for the input size L for both source and target domain data. The temperature value for the distillation loss was set to 2. Additional hyperparameter values are specified in Table 5.

Table 5: Hyperparameter values used in Eq. (10)
Hyperparameters λseg(1)\lambda_{seg}^{(1)} λseg(2)\lambda_{seg}^{(2)} λadv(1)\lambda_{adv}^{(1)} λadv(2)\lambda_{adv}^{(2)} λdistill(1)\lambda_{distill}^{(1)} λdistill(2)\lambda_{distill}^{(2)}
Values 0.1 1 0.0002 0.001 0.02 0.2

Appendix D Additional Experimental Results

Qualitative Results.

Refer to caption
Refer to caption
Figure 5: Qualitative results of Continual UDA on the CityScapes and IDD datasets from the GTA5 dataset. Comparing the AdaptSegNet and the model trained with our framework, less forgetting occurs when our framework is applied. Note that when UDA is done up to the CityScapese dataset, the semantic maps for the IDD dataset are meaningless (gray areas).

To show qualitatively that the catastrophic forgetting problem is alleviated when using the proposed framework, semantic maps that the network predicted are provided in Fig. 5. Each semantic map is obtained through Continual UDA for the CityScapes and IDD datasets when the input size is H and the GTA5 dataset is used as the source domain.

In Fig. 5(a), the right parts are the examples of semantic maps predicted by the network with the ETM framework. The CityScapes column shows the semantic maps for an image extracted from the CityScapes dataset, and the IDD column shows the semantic maps for the image from the IDD dataset. In the first row (Ground Truth), we represent labels of each image, i.e., real semantic maps. In the second row (GTA5 \rightarrow CityScapes), there are the semantic maps predicted by the network when it performs UDA up to the CityScapes dataset. Therefore, there are no semantic maps for the IDD dataset. In the last row (GTA5 \rightarrow CityScapes \rightarrow IDD), we represent the predicted semantic maps for the images extracted from each dataset after the network performing UDA up to the IDD dataset. Note that there is not much difference when comparing the second and third rows of the CityScapes columns. In other words, there is little forgetting on the CityScapes dataset even after learning about the IDD dataset continuously. Although the forgetting occurs for walls (expressed in beige) and people (expressed in red) inside the sky-blue circles, it is not serious.

Table 6: The Difference in the Performance of Semantic Segmentation by Objects with or without the TM
Object Road Sidewalk Building Wall Fence Pole Traffic Light Traffic Sign Vegetation mIoU
Sky Person Rider Car Truck Bus Train Motorcycle Bicycle
GTA5 \rightarrow CityScapes w/ TM 87.47 18.74 80.44 27.85 19.50 27.69 28.77 16.45 84.31 41.96
76.92 55.92 24.86 71.67 26.46 36.64 5.19 24.38 15.86
w/o TM 86.33 16.38 76.40 24.42 16.53 21.92 4.26 5.84 80.31 35.58
75.06 37.71 17.68 69.37 26.18 36.64 5.19 24.38 15.86
GTA5 \rightarrow CityScapes \rightarrow IDD w/ TM 94.95 43.18 55.31 31.12 21.57 24.70 9.02 54.92 83.32 46.73
92.94 40.47 46.65 74.01 59.76 40.978 0.12 47.39 20.68
w/o TM 92.91 37.29 52.96 28.60 18.29 12.90 2.67 49.80 81.83 43.42
93.65 30.52 40.07 73.58 58.28 44.10 0.00 41.69 22.34

The left parts are semantic maps predicted by the AdaptSegNet for the same situation and images. Unlike in the case of ours, the forgetting is more severe when comparing the second and third rows of the CityScapes column. Inside the sky-blue circle, one can see that there are forgettings on walls, vegetation (expressed in green), traffic lights (expressed in orange), sidewalks (expressed in light pink), and people. Furthermore, the model with the ETM framework predicts slightly better also for the IDD dataset.

Fig. 5(b) shows the semantic maps for other examples. Overall, it shows the same tendency as in Fig. 5(a). When comparing the semantic maps for the CityScapes dataset, more severe forgetting occurs in the AdaptSegNet. Inside the sky-blue circle, there are forgettings on the walls, sidewalks, and vehicles (expressed in navy). Meanwhile, there are fewer forgetting phenomena in the ETM framework.

Further Analysis on the DHA Loss.

Table 7: The Continual UDA results according to the loss function for adversarial learning
Loss for Adversarial Learning GTA5 \rightarrow CityScapes \rightarrow IDD SYNTHIA \rightarrow CityScapes \rightarrow IDD
CityScapes IDD
Mean
mIoU
CityScapes IDD
Mean
mIoU
LossGAN 40.04 44.28 42.16 43.56 36.52 40.04
LossGeoGAN 36.50 40.35 38.43 43.03 35.37 39.20
DHA 40.61 46.73 43.67 43.86 37.17 40.52

To perform UDA with adversarial learning, the adversarial loss and discriminator loss are needed. Generally, the loss functions proposed in GAN [10] are used. In our framework, the DHA loss inspired by Geometric GAN [18] is utilized instead. Therefore, to validate the effectiveness of the proposed DHA loss, we conducted the experiments under the same conditions but with different loss functions for adversarial learning. In other words, only the adversarial loss and discriminator loss are changed while the other parts of the ETM framework remain the same, including the TM. In Table 7, the Continual UDA performances are reported when different loss functions are used: the loss functions from GAN and Geometric GAN, and the DHA loss. First of all, when the DHA loss is used, the model outperforms the other two models. Also noteworthy, the performance when with the GAN’s loss function is better than when the Geometric GAN’s loss is utilized. It shows a different tendency from the image generation field in which the Geometric GAN outperforms GAN. Therefore, it can be concluded that the loss function from the Geometric GAN is not suitable for UDA; however, the DHA loss, which is modified from the Geometric GAN’s loss, is suitable for UDA.

Further Analysis on the TM.

Here, we present quantitative results to support which information that the TM contains (see Table 6). In the upper row of the table (GTA5 \rightarrow CityScapes), each object’s IoU and mIoU values for the CityScapes dataset are specified when UDA is performed to the CityScapes dataset with the GTA5 dataset as the source domain. In the lower row (GTA5 \rightarrow CityScapes \rightarrow IDD), the performance on the IDD dataset when continual UDA is done to the IDD dataset is reported. Overall, the IoU values are decreased for all objects without the TM compared to when the TM is used, and it leads to a reduction in the mIoU. Among the objects, the IoU values are decreased significantly for objects such as poles, traffic lights, traffic signs, and people. In other words, the TM contains much information about such objects. Through the experiment, it can be argued that the 1×11\times 1 Conv. module allows containing information about the instance objects and the Avg. Pool. module allows containing bits of information about all objects in general.