This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

{textblock*}

8in(5mm, 5mm) ©\copyright 2020 IEEE. Published in the IEEE 2020 International Geoscience & Remote Sensing Symposium (IGARSS 2020), July 2020, Waikoloa, Hawaii, USA. Personal use of this material is permitted.

S2-cGAN: Self-Supervised Adversarial Representation Learning for Binary Change Detection in Multispectral Images

Abstract

Deep Neural Networks have recently demonstrated promising performance in binary change detection (CD) problems in remote sensing (RS), requiring a large amount of labeled multitemporal training samples. Since collecting such data is time-consuming and costly, most of the existing methods rely on pre-trained networks on publicly available computer vision (CV) datasets. However, because of the differences in image characteristics in CV and RS, this approach limits the performance of the existing CD methods. To address this problem, we propose a self-supervised conditional Generative Adversarial Network (S2-cGAN). The proposed S2-cGAN is trained to generate only the distribution of unchanged samples. To this end, the proposed method consists of two main steps: 1) Generating a reconstructed version of the input image as an unchanged image 2) Learning the distribution of unchanged samples through an adversarial game. Unlike the existing GAN based methods (which only use the discriminator during the adversarial training to supervise the generator), the S2-cGAN directly exploits the discriminator likelihood to solve the binary CD task. Experimental results show the effectiveness of the proposed S2-cGAN when compared to the state of the art CD methods.Our code is available online: https://gitlab.tubit.tu-berlin.de/rsim/S2-cGAN

Index Terms—  Generative adversarial networks, binary change detection, multitemporal images, self-supervised learning, remote sensing.

1 Introduction

Binary change detection (CD) in multitemporal multispectral remote sensing (RS) images is a key component for monitoring environmental phenomena [1, 2, 3, 4]. In the last years, Deep Neural Networks (DNNs) have achieved remarkable performance in several RS applications, including CD [2, 3]. Most of the DNNs designed for CD problems require a huge amount of multitemporal labeled samples to adjust all parameters during training and reach high performance. However, collecting multitemporal labeled samples is often highly expensive and needs expertise. To address this problem, unsupervised deep learning (DL) based approaches are recently introduced in RS to employ pre-trained networks as a generic feature extractor. El Amin et al. propose to extract deep distance images (DI) using a fully convolutional network trained for semantic segmentation in [1]. In this work, it is shown that the extracted deep feature maps contain semantic information, which can improve the DI analysis. Similarly, Saha et al. demonstrate the effectiveness of employing semantic-aware deep feature maps to achieve a performance gain in Change Vector Analysis (CVA) [5]. Bergamasco et al. propose a multilayer convolutional-autoencoder to reconstruct the input image and learn task-specific features using a deep network [6]. Then, multi-scale features are extracted from the pre and post change images that are analyzed for CD. In recent years, unsupervised CD methods based on deep Generative Adversarial Network (GANs) [7] have come to the light of attention in RS. GANs are deep networks commonly used for generating realistic data (e.g., images), where the supervision is indirectly provided by an adversarial game between two independent networks: a generator (GG) and a discriminator (DD), which are both trained with unlabeled samples. Recently, Saha et al. adapt a GAN to solve multisensor CD, where the task is to detect changes across two images obtained by different sensors [4]. In this work, the generative power of GAN is only utilized to transcode images between two domains, and an external classifier is required to detect the changes between two transcoded images. Most of the above-mentioned methods exploit DL models with proven architectures that are pre-trained on large-scale computer vision (CV) datasets (e.g., ImageNet). However, this is not a fully suitable approach in RS, because of the differences in characteristics of CV and RS images.

Unlike the aforementioned deep generative approaches, in this paper we aim at learning an adversarial representation that not only can be learned in a self-supervised fashion but also can be directly used for CD without any need for an external classifier and a pre-trained network. To this end, we propose a self-supervised conditional Generative Adversarial Network (S2-cGAN). The S2-cGAN is trained using a training set containing only pairs of unchanged samples that are reconstructed (generated) from a single image without using any external supervision. During the adversarial training, GG learns how to generate unchanged data only, whereas DD learns how to detect deviations from it. Such DD can be directly used as a change classifier, considering changes as outliers with respect to the sample distribution learned on unchanged data. According to our knowledge, the proposed S2-cGAN is the first method that directly exploits the discriminator GAN to formulate the binary change detection problem.

2 Proposed Approach

Let 𝑿1\boldsymbol{X}_{1} and 𝑿2\boldsymbol{X}_{2} be two co-registered RS images acquired using a same sensor over the same geographical area at times t1t_{1} and t2t_{2}, respectively. Both images are divided into non-overlapping NN patches and represented as 𝑿1={x1i}i=1,,N\boldsymbol{X}_{1}=\{x_{1}^{i}\}_{i=1,...,N} and 𝑿2={x2i}i=1,,N\boldsymbol{X}_{2}=\{x_{2}^{i}\}_{i=1,...,N}, where x1ix_{1}^{i} and x2ix_{2}^{i} show the ithi^{th} pair of patches associated to t1t_{1} and t2t_{2}. In order to detect binary changes in 𝑿1\boldsymbol{X}_{1} and 𝑿2\boldsymbol{X}_{2}, in this paper a self-supervised conditional Generative Adversarial Network (S2-cGAN) is proposed. The proposed S2-cGAN aims to train networks GG and DD using only pairs of patches without any land-cover change. Network GG learns how to generate only the unchanged pairs of patches. On the other hand, DD learns to distinguish pairs of unchanged samples (pixels) from those of changed samples. Fig. 1 summarize the general training strategy of the proposed S2-cGAN. The details are provided in the following sub-sections.

2.1 Self-Supervised Adversarial Learning

The proposed S2-cGAN aims at learning a self-supervised representation from unchanged patches to estimate the likelihood of change concerning the learned distribution to detect the possible changes accordingly. In general, conditional GANs (cGANs) consist of two networks: 1) the generator network (GG) that aims at generating realistic data, and 2) the discriminator network (DD) that aims at discriminating real data from generated data by GG. More specifically, the conditional GANs (cGANs) [7] take as input an image such I1I_{1} and generate a new image I1~\tilde{I_{1}}. DD tries to distinguish I1I_{1} from I1~\tilde{I_{1}} considering the conditional information given by image I2I_{2}, while GG tries to fool DD by producing more and more realistic images which are indistinguishable. Isola et al. proposed an image-to-image translation framework based on cGANs [7], and show that a U-Net encoder-decoder with skip connections can be used as the generator architecture together with a patch-based discriminator to transform images to different representations (conditional information).

The proposed S2-cGAN is inspired by [7], however, unlike [7] the S2-cGAN do not aim at realistic image translation. Instead, the S2-cGAN exploits GG to learn the pattern of unchanged pair of patches. In order to train our network, we need a set of pair of multitemporal images, in which no change observed between times t1t_{1} and t2t_{2}. Such pair can be constructed using image 𝑿1\boldsymbol{X}_{1}, and a corresponding image 𝑿~2={x~2i}i=1,,N\boldsymbol{\tilde{X}}_{2}=\{\tilde{x}_{2}^{i}\}_{i=1,...,N} can be defined as : 𝑿~2=𝑿1+w\boldsymbol{\tilde{X}}_{2}=\boldsymbol{X}_{1}+w, where ww represents the noise, which is assumed to be zero-mean Gaussian for all patches in 𝑿1\boldsymbol{X}_{1} selected from a fixed noise distribution, such that w𝒩(0,σ)w\sim\mathcal{N}(0,\sigma). Then, a set 𝒳={(x1i,x~2i)}i=1,,N{\cal X}=\{(x_{1}^{i},\tilde{x}_{2}^{i})\}_{i=1,...,N} of pairs of patches selected from images 𝑿1\boldsymbol{X}_{1} and 𝑿~2\boldsymbol{\tilde{X}}_{2} is defined. This set represents the unchanged pairs corresponding to the same geographical area and is used for learning unchanged data representation in a self-supervised fashion.

Refer to caption

Fig. 1: Illustration of the training strategy introduced within the proposed S2-cGAN. The data distribution in the feature space (represented by the Gaussian curve under DD) is much denser in the area corresponding to the real and unchanged samples.

Networks GG and DD are trained using both reconstruction and conditional GAN objectives. Reconstruction objective L1{\cal L}_{L1} is given as:

L1(x1i,x~2i)=x~2ir1,{\cal L}_{L1}(x^{i}_{1},\tilde{x}_{2}^{i})=||\tilde{x}_{2}^{i}-r||_{1}, (1)

where r=G(x1i,z)r=G(x^{i}_{1},z) and zz is a noise vector (drawn from a noise distribution 𝒵{\cal Z}). The conditional adversarial objective cGAN{\cal L}_{cGAN} is defined as:

cGAN(G,D)=𝔼(x1i,x~2i)𝒳[logD(x1i,x~2i)]+𝔼x1i𝑿𝟏,z𝒵[log(1D(x1i,G(x1i,z)))],\displaystyle\begin{split}{\cal L}_{cGAN}(G,D)=\mathbb{E}_{(x_{1}^{i},\tilde{x}_{2}^{i})\in{\cal X}}[\log D(x_{1}^{i},\tilde{x}_{2}^{i})]+\\ \mathbb{E}_{x_{1}^{i}\in\boldsymbol{X_{1}},z\in{\cal Z}}[\log(1-D(x_{1}^{i},G(x_{1}^{i},z)))],\end{split} (2)

It is important to emphasise that both images 𝑿𝟏\boldsymbol{X_{1}} and 𝑿~2\boldsymbol{\tilde{X}}_{2} are unchanged co-registered images. The proposed learning approach does not need samples showing any change at training time. This makes it possible to train the discriminator without any need for fully supervised training data: GG acts as implicit supervision for DD. During training, GG observes only unchanged patches. On the other hand, DD during training learns to distinguish real unchanged samples from those of changed or fake. As a consequence, at the end of the training process, the discriminator learns to separate real samples from artifacts. The training procedure is schematically represented in Fig. 1. The data distribution is depicted by the Gaussian curve on the top of the figure. The discriminator is represented by the decision boundary on the learned feature space (black circle), which separates these distributions from the rest of the feature space. Both non-realistic generated samples (blue dots), and changed samples (red dots) are placed outside this decision boundary. The latter represents a case that it never observed during the training phase and hence is treated by DD as outliers (lies outside the discriminator’s decision boundaries). The learned decision boundary of the discriminator is used to detect changes.

Refer to caption


Fig. 2: Illustration of the change detection strategy introduced within the proposed S2-cGAN.

2.2 CD through Adversarially Learned Representations

In our adversarial approach for detecting changes, the discriminator network is used. As depicted in Fig. 2, given a test patch x1i𝑿1x_{1}^{i}\in\boldsymbol{X}_{1}, for time t1t_{1} and its corresponding patch x2i𝑿2x_{2}^{i}\in\boldsymbol{X}_{2} from time t2t_{2}. Given patch x1ix_{1}^{i}, we utilize GG to generate G(x1i)G(x_{1}^{i}) and compute the reconstruction error using (1) with respect to x2ix_{2}^{i}, where er=L1(x2i,G(x1i))e_{r}={\cal L}_{L1}(x^{i}_{2},G(x_{1}^{i})). In detail, we use the reconstruction errors of our adversarially trained generators as the first component of our CD strategy. In addition, we apply the pixel-based discriminators DD to estimate the out-of-distribution likelihood as our second component of CD. For that two pixel-wise score maps Sx1i,x2iS^{x_{1}^{i},x_{2}^{i}} and Sx1i,G(x1i,z)S^{x_{1}^{i},G(x_{1}^{i},z)} are computed using D(x1i,x2i)D(x_{1}^{i},x_{2}^{i}) and D(x1i,G(x1i,z))D(x_{1}^{i},G(x_{1}^{i},z)), respectively. Finally, a difference map Sdifi=Sx1i,x2iSx1i,G(x1i,z)S_{dif}^{i}=S^{x_{1}^{i},x_{2}^{i}}-S^{x_{1}^{i},G(x_{1}^{i},z)} is computed to represent the significance of score map Sx1i,x2iS^{x_{1}^{i},x_{2}^{i}} considering the reference score map Sx1i,G(x1i,z)S^{x_{1}^{i},G(x_{1}^{i},z)}. As a result, a possible change between in x1ix_{1}^{i} and x2ix_{2}^{i}, and/or between G(x1i)G(x_{1}^{i}) and x2ix_{2}^{i} correspond to an outlier with respect to the data distribution learned by DD during training. This results in a low value in SdifS_{dif}. The final CD map is obtained by Hadamard product for each pair of ere_{r} and SdifS_{dif}, =erSdif\mathcal{H}=e_{r}\circ S_{dif}.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption

Refer to captionRefer to captionRefer to captionRefer to captionRefer to caption

      (a)       (b)       (c)       (d)       (e)
Fig. 3: Examples of score maps associated to different pairs of patches. For each row: (a) a patch acquired at t1t_{1} (x1ix_{1}^{i}); (b) a patch acquired at t2t_{2} (x2ix_{2}^{i}); (c) their differences map (SdifS_{dif}); (d) their reconstruction error (ere_{r}); and (e) reference CD map.

3 Experimental Results

To evaluate the proposed S2-cGAN, in the experiments a bitemporal Very High Spatial Resolution (VHR) multispectral images acquired from Shenzhen, China by the Worldview 2 satellite has been used [8]. The image acquired in 2010 (t1t_{1}) has been considered as X1X_{1}, while the image acquired in 2015 (t2t_{2}) has been taken as X2X_{2}. Both X2X_{2} and X2X_{2} have size of 1431×14311431\times 1431 pixels with a spatial resolution of 2 m. In all the experiments, the generator network GG was based on a U-net autoencoder for input patches of size 128×128128\times 128 pixels. The generator was compound by a network with six convolutional layers. For discriminator DD, a pixel-based network with two convolutional layers was employed. The co-registered images 𝑿1\boldsymbol{X}_{1} and 𝑿2\boldsymbol{X}_{2} were divided into 7646 number of patches of size 128×128128\times 128 pixels. For the self-supervised adversarial learning of the unchanged sample representations, we construct 𝑿~2\boldsymbol{\tilde{X}}_{2} from 𝑿1\boldsymbol{X}_{1} by applying a Gaussian noise over 𝑿1\boldsymbol{X}_{1}. This represents the unchanged (generated) image at time t2t_{2}. Half of the pairs of patches of 𝑿1\boldsymbol{X}_{1} and 𝑿~2\boldsymbol{\tilde{X}}_{2} were randomly selected for training. The remaining patches of 𝑿1\boldsymbol{X}_{1}, together with their pairs in 𝑿2\boldsymbol{X}_{2} were considered as pairs of test patches for the evaluation. The training was based on stochastic gradient descent with momentum 0.5, and the network was trained for 50 epochs. The two score maps SfS_{f} (the differences map) and ere_{r} (the reconstruction error) were computed as explained in Sec. 2.2. Fig. 3 illustrates the score maps for two different pairs of patches. From the figure, one can observe that in most of the cases ere_{r} and SfS_{f} provide complementary information. Similar behaviour has been observed by varying the pairs of patches. For quantitative evaluation, we applied a threshold value τ\tau over the score map \mathcal{H} to obtain the binary CD map. Due to the strong local variations in the considered VHR images, instead of choosing a single decision boundary for τ\tau value, we adapted the context-dependent local adaptive decision boundary strategy used in [5]. We compared the CD map obtained by the proposed S2-cGAN with: 1) Fully Convolutional Early Fusion (FC-EF) method, which is one of the best performing supervised methods with a fully convolutional U-Net architecture [2]; and 2) Deep Change Vector Analysis (DCVA) method, which is a powerful unsupervised method using a CNN pre-trained for semantic segmentation to obtain multi-temporal deep features [5]. For the sake of fairness, the same pairs of test patches have been used for all the considered methods. Note that DCVA is fully unsupervised and does not use labeled samples. The fully-supervised FC-EF exploits a set of pairs of training patches that consists of: i) the same pairs of labeled unchanged samples used in the S2-cGAN; and ii) 13,065 pairs of changed samples. The binary CD maps obtained by the FC-EF, DCVA and the proposed S2-cGAN are shown in Fig. 4. From Fig. 4, one can see that the proposed S2-cGAN in most cases is able to identify the central area of the changed objects correctly. Table 1 shows quantitative results in terms of overall accuracy (OA), specificity (SPC), sensitivity (SEN), and overall error rate (ERR) obtained by the FC-EF, the DCVA, and the proposed S2-cGAN. By analysing the table, one can observe that the proposed S2-cGAN is comparable with the fully supervised FC-EF, and outperforms the unsupervised method DCVA mainly in terms of specificity (which shows the accuracy of detecting unchanged samples), sensitivity (which shows the accuracy of detecting changed pixels) and overall accuracy. As an example, sensitivity measure obtained by the proposed method is 4% and 14% higher than those obtained by the unsupervised DCVA and fully supervised FC-EF, respectively. It is worth emphasizing that our method performs better than both methods in terms of sensitivity despite the absence of changed samples during the training. This is due to the impact of adversarial training during GAN optimization.

Table 1: Binary CD results obtained by the FC-EF, the DCVA, and the proposed S2-cGAN. OA: overall accuracy, SPC: Specificity, SEN: Sensitivity, and ERR: Overall Error Rate.
Method Category OA SPC SEN ERR
FC-EF Supervised 0.8633 0.9693 0.3627 0.1366
DCVA Unsupervised 0.8280 0.9106 0.4450 0.1719
S2-cGAN Self-Supervised 0.8482 0.92081 0.5053 0.1517
Refer to caption Refer to caption
(a) (b)
Refer to caption Refer to caption
(c) (d)
Refer to caption Refer to caption
(e) (f)
Fig. 4: CD results: (a) 𝑿1\boldsymbol{X}_{1}; (b) 𝑿2\boldsymbol{X}_{2}; (c) reference CD map; (d) CD map obtained by the DCVA; (e) CD map obtained by the FC-EF; and (f) CD map obtained by the S2-cGAN.

4 Conclusion

In this paper, we have introduced a self-supervised conditional Generative Adversarial Network (S2-cGAN) for binary CD problems in RS. The proposed S2-cGAN exploits the mutual supervisory information of the generator and the discriminator networks to train a deep network by using a self-supervised multitemporal training set (which includes only pairs of unchanged samples). Differently from the existing GAN based CD methods, the proposed method directly uses the GAN discriminator as the classifier. Experimental results show that the proposed S2-cGAN leads to a higher performance in terms of sensitivity compared to the state of the art fully supervised and unsupervised methods. This has been achieved without using any pairs of labeled change samples. We underline that this is a very important advantage, since the proposed S2-cGAN suppresses the cost required for reference data collection. As a future work, we plan to extend the proposed learning approach to be integrated in an active learning setup with the human in the loop.

5 Acknowledgement

This work was supported by the European Research Council under the ERC Starting Grant BigEarth-759764.

References

  • [1] A. M. El Amin, Q. Liu, and Y. Wang, “Convolutional neural network features based change detection in satellite images,” in International Workshop on Pattern Recognition, 2016, vol. 10011, pp. 181 – 186.
  • [2] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in ICIP, Athens, Greece, 2018, pp. 4063–4067.
  • [3] A. Song, J. Choi, Y. Han, and Y. Kim, “Change detection in hyperspectral images using recurrent 3d fully convolutional networks,” Remote Sensing, vol. 10, no. 11, pp. 1827, 2018.
  • [4] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised multiple-change detection in vhr multisensor images via deep-learning based adaptation,” in IEEE IGARSS, Yokohama, Japan, 2019, pp. 5033–5036.
  • [5] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multiple-change detection in vhr images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3677–3693, 2019.
  • [6] L. Bergamasco, S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised change-detection based on convolutional-autoencoder feature extraction,” in SPIE Image and Signal Processing for Remote Sensing XXV, 2019, vol. 11155, pp. 325 – 332.
  • [7] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv:1611.07004, 2016.
  • [8] M. Zhang and W. Shi, “A feature difference convolutional neural network-based change detection method,” IEEE Transactions on Geoscience and Remote Sensing, 2020, doi:10.1109/TGRS.2020.2981051.