8in(5mm, 5mm) 2020 IEEE. Published in the IEEE 2020 International Geoscience & Remote Sensing Symposium (IGARSS 2020), July 2020, Waikoloa, Hawaii, USA. Personal use of this material is permitted.
S2-cGAN: Self-Supervised Adversarial Representation Learning for Binary Change Detection in Multispectral Images
Abstract
Deep Neural Networks have recently demonstrated promising performance in binary change detection (CD) problems in remote sensing (RS), requiring a large amount of labeled multitemporal training samples. Since collecting such data is time-consuming and costly, most of the existing methods rely on pre-trained networks on publicly available computer vision (CV) datasets. However, because of the differences in image characteristics in CV and RS, this approach limits the performance of the existing CD methods. To address this problem, we propose a self-supervised conditional Generative Adversarial Network (S2-cGAN). The proposed S2-cGAN is trained to generate only the distribution of unchanged samples. To this end, the proposed method consists of two main steps: 1) Generating a reconstructed version of the input image as an unchanged image 2) Learning the distribution of unchanged samples through an adversarial game. Unlike the existing GAN based methods (which only use the discriminator during the adversarial training to supervise the generator), the S2-cGAN directly exploits the discriminator likelihood to solve the binary CD task. Experimental results show the effectiveness of the proposed S2-cGAN when compared to the state of the art CD methods.††Our code is available online: https://gitlab.tubit.tu-berlin.de/rsim/S2-cGAN
Index Terms— Generative adversarial networks, binary change detection, multitemporal images, self-supervised learning, remote sensing.
1 Introduction
Binary change detection (CD) in multitemporal multispectral remote sensing (RS) images is a key component for monitoring environmental phenomena [1, 2, 3, 4]. In the last years, Deep Neural Networks (DNNs) have achieved remarkable performance in several RS applications, including CD [2, 3]. Most of the DNNs designed for CD problems require a huge amount of multitemporal labeled samples to adjust all parameters during training and reach high performance. However, collecting multitemporal labeled samples is often highly expensive and needs expertise. To address this problem, unsupervised deep learning (DL) based approaches are recently introduced in RS to employ pre-trained networks as a generic feature extractor. El Amin et al. propose to extract deep distance images (DI) using a fully convolutional network trained for semantic segmentation in [1]. In this work, it is shown that the extracted deep feature maps contain semantic information, which can improve the DI analysis. Similarly, Saha et al. demonstrate the effectiveness of employing semantic-aware deep feature maps to achieve a performance gain in Change Vector Analysis (CVA) [5]. Bergamasco et al. propose a multilayer convolutional-autoencoder to reconstruct the input image and learn task-specific features using a deep network [6]. Then, multi-scale features are extracted from the pre and post change images that are analyzed for CD. In recent years, unsupervised CD methods based on deep Generative Adversarial Network (GANs) [7] have come to the light of attention in RS. GANs are deep networks commonly used for generating realistic data (e.g., images), where the supervision is indirectly provided by an adversarial game between two independent networks: a generator () and a discriminator (), which are both trained with unlabeled samples. Recently, Saha et al. adapt a GAN to solve multisensor CD, where the task is to detect changes across two images obtained by different sensors [4]. In this work, the generative power of GAN is only utilized to transcode images between two domains, and an external classifier is required to detect the changes between two transcoded images. Most of the above-mentioned methods exploit DL models with proven architectures that are pre-trained on large-scale computer vision (CV) datasets (e.g., ImageNet). However, this is not a fully suitable approach in RS, because of the differences in characteristics of CV and RS images.
Unlike the aforementioned deep generative approaches, in this paper we aim at learning an adversarial representation that not only can be learned in a self-supervised fashion but also can be directly used for CD without any need for an external classifier and a pre-trained network. To this end, we propose a self-supervised conditional Generative Adversarial Network (S2-cGAN). The S2-cGAN is trained using a training set containing only pairs of unchanged samples that are reconstructed (generated) from a single image without using any external supervision. During the adversarial training, learns how to generate unchanged data only, whereas learns how to detect deviations from it. Such can be directly used as a change classifier, considering changes as outliers with respect to the sample distribution learned on unchanged data. According to our knowledge, the proposed S2-cGAN is the first method that directly exploits the discriminator GAN to formulate the binary change detection problem.
2 Proposed Approach
Let and be two co-registered RS images acquired using a same sensor over the same geographical area at times and , respectively. Both images are divided into non-overlapping patches and represented as and , where and show the pair of patches associated to and . In order to detect binary changes in and , in this paper a self-supervised conditional Generative Adversarial Network (S2-cGAN) is proposed. The proposed S2-cGAN aims to train networks and using only pairs of patches without any land-cover change. Network learns how to generate only the unchanged pairs of patches. On the other hand, learns to distinguish pairs of unchanged samples (pixels) from those of changed samples. Fig. 1 summarize the general training strategy of the proposed S2-cGAN. The details are provided in the following sub-sections.
2.1 Self-Supervised Adversarial Learning
The proposed S2-cGAN aims at learning a self-supervised representation from unchanged patches to estimate the likelihood of change concerning the learned distribution to detect the possible changes accordingly. In general, conditional GANs (cGANs) consist of two networks: 1) the generator network () that aims at generating realistic data, and 2) the discriminator network () that aims at discriminating real data from generated data by . More specifically, the conditional GANs (cGANs) [7] take as input an image such and generate a new image . tries to distinguish from considering the conditional information given by image , while tries to fool by producing more and more realistic images which are indistinguishable. Isola et al. proposed an image-to-image translation framework based on cGANs [7], and show that a U-Net encoder-decoder with skip connections can be used as the generator architecture together with a patch-based discriminator to transform images to different representations (conditional information).
The proposed S2-cGAN is inspired by [7], however, unlike [7] the S2-cGAN do not aim at realistic image translation. Instead, the S2-cGAN exploits to learn the pattern of unchanged pair of patches. In order to train our network, we need a set of pair of multitemporal images, in which no change observed between times and . Such pair can be constructed using image , and a corresponding image can be defined as : , where represents the noise, which is assumed to be zero-mean Gaussian for all patches in selected from a fixed noise distribution, such that . Then, a set of pairs of patches selected from images and is defined. This set represents the unchanged pairs corresponding to the same geographical area and is used for learning unchanged data representation in a self-supervised fashion.
Networks and are trained using both reconstruction and conditional GAN objectives. Reconstruction objective is given as:
(1) |
where and is a noise vector (drawn from a noise distribution ). The conditional adversarial objective is defined as:
(2) |
It is important to emphasise that both images and are unchanged co-registered images. The proposed learning approach does not need samples showing any change at training time. This makes it possible to train the discriminator without any need for fully supervised training data: acts as implicit supervision for . During training, observes only unchanged patches. On the other hand, during training learns to distinguish real unchanged samples from those of changed or fake. As a consequence, at the end of the training process, the discriminator learns to separate real samples from artifacts. The training procedure is schematically represented in Fig. 1. The data distribution is depicted by the Gaussian curve on the top of the figure. The discriminator is represented by the decision boundary on the learned feature space (black circle), which separates these distributions from the rest of the feature space. Both non-realistic generated samples (blue dots), and changed samples (red dots) are placed outside this decision boundary. The latter represents a case that it never observed during the training phase and hence is treated by as outliers (lies outside the discriminator’s decision boundaries). The learned decision boundary of the discriminator is used to detect changes.
2.2 CD through Adversarially Learned Representations
In our adversarial approach for detecting changes, the discriminator network is used. As depicted in Fig. 2, given a test patch , for time and its corresponding patch from time . Given patch , we utilize to generate and compute the reconstruction error using (1) with respect to , where . In detail, we use the reconstruction errors of our adversarially trained generators as the first component of our CD strategy. In addition, we apply the pixel-based discriminators to estimate the out-of-distribution likelihood as our second component of CD. For that two pixel-wise score maps and are computed using and , respectively. Finally, a difference map is computed to represent the significance of score map considering the reference score map . As a result, a possible change between in and , and/or between and correspond to an outlier with respect to the data distribution learned by during training. This results in a low value in . The final CD map is obtained by Hadamard product for each pair of and , .
(a) | (b) | (c) | (d) | (e) |
3 Experimental Results
To evaluate the proposed S2-cGAN, in the experiments a bitemporal Very High Spatial Resolution (VHR) multispectral images acquired from Shenzhen, China by the Worldview 2 satellite has been used [8]. The image acquired in 2010 () has been considered as , while the image acquired in 2015 () has been taken as . Both and have size of pixels with a spatial resolution of 2 m. In all the experiments, the generator network was based on a U-net autoencoder for input patches of size pixels. The generator was compound by a network with six convolutional layers. For discriminator , a pixel-based network with two convolutional layers was employed. The co-registered images and were divided into 7646 number of patches of size pixels. For the self-supervised adversarial learning of the unchanged sample representations, we construct from by applying a Gaussian noise over . This represents the unchanged (generated) image at time . Half of the pairs of patches of and were randomly selected for training. The remaining patches of , together with their pairs in were considered as pairs of test patches for the evaluation. The training was based on stochastic gradient descent with momentum 0.5, and the network was trained for 50 epochs. The two score maps (the differences map) and (the reconstruction error) were computed as explained in Sec. 2.2. Fig. 3 illustrates the score maps for two different pairs of patches. From the figure, one can observe that in most of the cases and provide complementary information. Similar behaviour has been observed by varying the pairs of patches. For quantitative evaluation, we applied a threshold value over the score map to obtain the binary CD map. Due to the strong local variations in the considered VHR images, instead of choosing a single decision boundary for value, we adapted the context-dependent local adaptive decision boundary strategy used in [5]. We compared the CD map obtained by the proposed S2-cGAN with: 1) Fully Convolutional Early Fusion (FC-EF) method, which is one of the best performing supervised methods with a fully convolutional U-Net architecture [2]; and 2) Deep Change Vector Analysis (DCVA) method, which is a powerful unsupervised method using a CNN pre-trained for semantic segmentation to obtain multi-temporal deep features [5]. For the sake of fairness, the same pairs of test patches have been used for all the considered methods. Note that DCVA is fully unsupervised and does not use labeled samples. The fully-supervised FC-EF exploits a set of pairs of training patches that consists of: i) the same pairs of labeled unchanged samples used in the S2-cGAN; and ii) 13,065 pairs of changed samples. The binary CD maps obtained by the FC-EF, DCVA and the proposed S2-cGAN are shown in Fig. 4. From Fig. 4, one can see that the proposed S2-cGAN in most cases is able to identify the central area of the changed objects correctly. Table 1 shows quantitative results in terms of overall accuracy (OA), specificity (SPC), sensitivity (SEN), and overall error rate (ERR) obtained by the FC-EF, the DCVA, and the proposed S2-cGAN. By analysing the table, one can observe that the proposed S2-cGAN is comparable with the fully supervised FC-EF, and outperforms the unsupervised method DCVA mainly in terms of specificity (which shows the accuracy of detecting unchanged samples), sensitivity (which shows the accuracy of detecting changed pixels) and overall accuracy. As an example, sensitivity measure obtained by the proposed method is 4% and 14% higher than those obtained by the unsupervised DCVA and fully supervised FC-EF, respectively. It is worth emphasizing that our method performs better than both methods in terms of sensitivity despite the absence of changed samples during the training. This is due to the impact of adversarial training during GAN optimization.
Method | Category | OA | SPC | SEN | ERR |
---|---|---|---|---|---|
FC-EF | Supervised | 0.8633 | 0.9693 | 0.3627 | 0.1366 |
DCVA | Unsupervised | 0.8280 | 0.9106 | 0.4450 | 0.1719 |
S2-cGAN | Self-Supervised | 0.8482 | 0.92081 | 0.5053 | 0.1517 |
(a) | (b) |
(c) | (d) |
(e) | (f) |
4 Conclusion
In this paper, we have introduced a self-supervised conditional Generative Adversarial Network (S2-cGAN) for binary CD problems in RS. The proposed S2-cGAN exploits the mutual supervisory information of the generator and the discriminator networks to train a deep network by using a self-supervised multitemporal training set (which includes only pairs of unchanged samples). Differently from the existing GAN based CD methods, the proposed method directly uses the GAN discriminator as the classifier. Experimental results show that the proposed S2-cGAN leads to a higher performance in terms of sensitivity compared to the state of the art fully supervised and unsupervised methods. This has been achieved without using any pairs of labeled change samples. We underline that this is a very important advantage, since the proposed S2-cGAN suppresses the cost required for reference data collection. As a future work, we plan to extend the proposed learning approach to be integrated in an active learning setup with the human in the loop.
5 Acknowledgement
This work was supported by the European Research Council under the ERC Starting Grant BigEarth-759764.
References
- [1] A. M. El Amin, Q. Liu, and Y. Wang, “Convolutional neural network features based change detection in satellite images,” in International Workshop on Pattern Recognition, 2016, vol. 10011, pp. 181 – 186.
- [2] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in ICIP, Athens, Greece, 2018, pp. 4063–4067.
- [3] A. Song, J. Choi, Y. Han, and Y. Kim, “Change detection in hyperspectral images using recurrent 3d fully convolutional networks,” Remote Sensing, vol. 10, no. 11, pp. 1827, 2018.
- [4] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised multiple-change detection in vhr multisensor images via deep-learning based adaptation,” in IEEE IGARSS, Yokohama, Japan, 2019, pp. 5033–5036.
- [5] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multiple-change detection in vhr images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3677–3693, 2019.
- [6] L. Bergamasco, S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised change-detection based on convolutional-autoencoder feature extraction,” in SPIE Image and Signal Processing for Remote Sensing XXV, 2019, vol. 11155, pp. 325 – 332.
- [7] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv:1611.07004, 2016.
- [8] M. Zhang and W. Shi, “A feature difference convolutional neural network-based change detection method,” IEEE Transactions on Geoscience and Remote Sensing, 2020, doi:10.1109/TGRS.2020.2981051.