{textblock*}

8in(5mm, 5mm) $\copyright$ 2020 IEEE. Published in the IEEE 2020 International Geoscience & Remote Sensing Symposium (IGARSS 2020), July 2020, Waikoloa, Hawaii, USA. Personal use of this material is permitted.

S²-cGAN: Self-Supervised Adversarial Representation Learning for Binary Change Detection in Multispectral Images

Abstract

Deep Neural Networks have recently demonstrated promising performance in binary change detection (CD) problems in remote sensing (RS), requiring a large amount of labeled multitemporal training samples. Since collecting such data is time-consuming and costly, most of the existing methods rely on pre-trained networks on publicly available computer vision (CV) datasets. However, because of the differences in image characteristics in CV and RS, this approach limits the performance of the existing CD methods. To address this problem, we propose a self-supervised conditional Generative Adversarial Network (S²-cGAN). The proposed S²-cGAN is trained to generate only the distribution of unchanged samples. To this end, the proposed method consists of two main steps: 1) Generating a reconstructed version of the input image as an unchanged image 2) Learning the distribution of unchanged samples through an adversarial game. Unlike the existing GAN based methods (which only use the discriminator during the adversarial training to supervise the generator), the S²-cGAN directly exploits the discriminator likelihood to solve the binary CD task. Experimental results show the effectiveness of the proposed S²-cGAN when compared to the state of the art CD methods.^†^†Our code is available online: https://gitlab.tubit.tu-berlin.de/rsim/S2-cGAN

Index Terms— Generative adversarial networks, binary change detection, multitemporal images, self-supervised learning, remote sensing.

1 Introduction

Binary change detection (CD) in multitemporal multispectral remote sensing (RS) images is a key component for monitoring environmental phenomena [1, 2, 3, 4]. In the last years, Deep Neural Networks (DNNs) have achieved remarkable performance in several RS applications, including CD [2, 3]. Most of the DNNs designed for CD problems require a huge amount of multitemporal labeled samples to adjust all parameters during training and reach high performance. However, collecting multitemporal labeled samples is often highly expensive and needs expertise. To address this problem, unsupervised deep learning (DL) based approaches are recently introduced in RS to employ pre-trained networks as a generic feature extractor. El Amin et al. propose to extract deep distance images (DI) using a fully convolutional network trained for semantic segmentation in [1]. In this work, it is shown that the extracted deep feature maps contain semantic information, which can improve the DI analysis. Similarly, Saha et al. demonstrate the effectiveness of employing semantic-aware deep feature maps to achieve a performance gain in Change Vector Analysis (CVA) [5]. Bergamasco et al. propose a multilayer convolutional-autoencoder to reconstruct the input image and learn task-specific features using a deep network [6]. Then, multi-scale features are extracted from the pre and post change images that are analyzed for CD. In recent years, unsupervised CD methods based on deep Generative Adversarial Network (GANs) [7] have come to the light of attention in RS. GANs are deep networks commonly used for generating realistic data (e.g., images), where the supervision is indirectly provided by an adversarial game between two independent networks: a generator ( $G$ ) and a discriminator ( $D$ ), which are both trained with unlabeled samples. Recently, Saha et al. adapt a GAN to solve multisensor CD, where the task is to detect changes across two images obtained by different sensors [4]. In this work, the generative power of GAN is only utilized to transcode images between two domains, and an external classifier is required to detect the changes between two transcoded images. Most of the above-mentioned methods exploit DL models with proven architectures that are pre-trained on large-scale computer vision (CV) datasets (e.g., ImageNet). However, this is not a fully suitable approach in RS, because of the differences in characteristics of CV and RS images.

Unlike the aforementioned deep generative approaches, in this paper we aim at learning an adversarial representation that not only can be learned in a self-supervised fashion but also can be directly used for CD without any need for an external classifier and a pre-trained network. To this end, we propose a self-supervised conditional Generative Adversarial Network (S²-cGAN). The S²-cGAN is trained using a training set containing only pairs of unchanged samples that are reconstructed (generated) from a single image without using any external supervision. During the adversarial training, $G$ learns how to generate unchanged data only, whereas $D$ learns how to detect deviations from it. Such $D$ can be directly used as a change classifier, considering changes as outliers with respect to the sample distribution learned on unchanged data. According to our knowledge, the proposed S²-cGAN is the first method that directly exploits the discriminator GAN to formulate the binary change detection problem.

2 Proposed Approach

Let $\boldsymbol{X}_{1}$ and $\boldsymbol{X}_{2}$ be two co-registered RS images acquired using a same sensor over the same geographical area at times $t_{1}$ and $t_{2}$ , respectively. Both images are divided into non-overlapping $N$ patches and represented as $\boldsymbol{X}_{1}=\{x_{1}^{i}\}_{i=1,...,N}$ and $\boldsymbol{X}_{2}=\{x_{2}^{i}\}_{i=1,...,N}$ , where $x_{1}^{i}$ and $x_{2}^{i}$ show the $i^{th}$ pair of patches associated to $t_{1}$ and $t_{2}$ . In order to detect binary changes in $\boldsymbol{X}_{1}$ and $\boldsymbol{X}_{2}$ , in this paper a self-supervised conditional Generative Adversarial Network (S²-cGAN) is proposed. The proposed S²-cGAN aims to train networks $G$ and $D$ using only pairs of patches without any land-cover change. Network $G$ learns how to generate only the unchanged pairs of patches. On the other hand, $D$ learns to distinguish pairs of unchanged samples (pixels) from those of changed samples. Fig. 1 summarize the general training strategy of the proposed S²-cGAN. The details are provided in the following sub-sections.

2.1 Self-Supervised Adversarial Learning

The proposed S²-cGAN aims at learning a self-supervised representation from unchanged patches to estimate the likelihood of change concerning the learned distribution to detect the possible changes accordingly. In general, conditional GANs (cGANs) consist of two networks: 1) the generator network ( $G$ ) that aims at generating realistic data, and 2) the discriminator network ( $D$ ) that aims at discriminating real data from generated data by $G$ . More specifically, the conditional GANs (cGANs) [7] take as input an image such $I_{1}$ and generate a new image $\tilde{I_{1}}$ . $D$ tries to distinguish $I_{1}$ from $\tilde{I_{1}}$ considering the conditional information given by image $I_{2}$ , while $G$ tries to fool $D$ by producing more and more realistic images which are indistinguishable. Isola et al. proposed an image-to-image translation framework based on cGANs [7], and show that a U-Net encoder-decoder with skip connections can be used as the generator architecture together with a patch-based discriminator to transform images to different representations (conditional information).

The proposed S²-cGAN is inspired by [7], however, unlike [7] the S²-cGAN do not aim at realistic image translation. Instead, the S²-cGAN exploits $G$ to learn the pattern of unchanged pair of patches. In order to train our network, we need a set of pair of multitemporal images, in which no change observed between times $t_{1}$ and $t_{2}$ . Such pair can be constructed using image $\boldsymbol{X}_{1}$ , and a corresponding image $\boldsymbol{\tilde{X}}_{2}=\{\tilde{x}_{2}^{i}\}_{i=1,...,N}$ can be defined as : $\boldsymbol{\tilde{X}}_{2}=\boldsymbol{X}_{1}+w$ , where $w$ represents the noise, which is assumed to be zero-mean Gaussian for all patches in $\boldsymbol{X}_{1}$ selected from a fixed noise distribution, such that $w\sim\mathcal{N}(0,\sigma)$ . Then, a set ${\cal X}=\{(x_{1}^{i},\tilde{x}_{2}^{i})\}_{i=1,...,N}$ of pairs of patches selected from images $\boldsymbol{X}_{1}$ and $\boldsymbol{\tilde{X}}_{2}$ is defined. This set represents the unchanged pairs corresponding to the same geographical area and is used for learning unchanged data representation in a self-supervised fashion.

Refer to caption — Fig. 1: Illustration of the training strategy introduced within the proposed S²-cGAN. The data distribution in the feature space (represented by the Gaussian curve under $D$ ) is much denser in the area corresponding to the real and unchanged samples.

Networks $G$ and $D$ are trained using both reconstruction and conditional GAN objectives. Reconstruction objective ${\cal L}_{L1}$ is given as:

{\cal L}_{L1}(x^{i}_{1},\tilde{x}_{2}^{i})=||\tilde{x}_{2}^{i}-r||_{1},

(1)

where $r=G(x^{i}_{1},z)$ and $z$ is a noise vector (drawn from a noise distribution ${\cal Z}$ ). The conditional adversarial objective ${\cal L}_{cGAN}$ is defined as:

\displaystyle\begin{split}{\cal L}_{cGAN}(G,D)=\mathbb{E}_{(x_{1}^{i},\tilde{x}_{2}^{i})\in{\cal X}}[\log D(x_{1}^{i},\tilde{x}_{2}^{i})]+\\ \mathbb{E}_{x_{1}^{i}\in\boldsymbol{X_{1}},z\in{\cal Z}}[\log(1-D(x_{1}^{i},G(x_{1}^{i},z)))],\end{split}

(2)

It is important to emphasise that both images $\boldsymbol{X_{1}}$ and $\boldsymbol{\tilde{X}}_{2}$ are unchanged co-registered images. The proposed learning approach does not need samples showing any change at training time. This makes it possible to train the discriminator without any need for fully supervised training data: $G$ acts as implicit supervision for $D$ . During training, $G$ observes only unchanged patches. On the other hand, $D$ during training learns to distinguish real unchanged samples from those of changed or fake. As a consequence, at the end of the training process, the discriminator learns to separate real samples from artifacts. The training procedure is schematically represented in Fig. 1. The data distribution is depicted by the Gaussian curve on the top of the figure. The discriminator is represented by the decision boundary on the learned feature space (black circle), which separates these distributions from the rest of the feature space. Both non-realistic generated samples (blue dots), and changed samples (red dots) are placed outside this decision boundary. The latter represents a case that it never observed during the training phase and hence is treated by $D$ as outliers (lies outside the discriminator’s decision boundaries). The learned decision boundary of the discriminator is used to detect changes.

2.2 CD through Adversarially Learned Representations

In our adversarial approach for detecting changes, the discriminator network is used. As depicted in Fig. 2, given a test patch $x_{1}^{i}\in\boldsymbol{X}_{1}$ , for time $t_{1}$ and its corresponding patch $x_{2}^{i}\in\boldsymbol{X}_{2}$ from time $t_{2}$ . Given patch $x_{1}^{i}$ , we utilize $G$ to generate $G(x_{1}^{i})$ and compute the reconstruction error using (1) with respect to $x_{2}^{i}$ , where $e_{r}={\cal L}_{L1}(x^{i}_{2},G(x_{1}^{i}))$ . In detail, we use the reconstruction errors of our adversarially trained generators as the first component of our CD strategy. In addition, we apply the pixel-based discriminators $D$ to estimate the out-of-distribution likelihood as our second component of CD. For that two pixel-wise score maps $S^{x_{1}^{i},x_{2}^{i}}$ and $S^{x_{1}^{i},G(x_{1}^{i},z)}$ are computed using $D(x_{1}^{i},x_{2}^{i})$ and $D(x_{1}^{i},G(x_{1}^{i},z))$ , respectively. Finally, a difference map $S_{dif}^{i}=S^{x_{1}^{i},x_{2}^{i}}-S^{x_{1}^{i},G(x_{1}^{i},z)}$ is computed to represent the significance of score map $S^{x_{1}^{i},x_{2}^{i}}$ considering the reference score map $S^{x_{1}^{i},G(x_{1}^{i},z)}$ . As a result, a possible change between in $x_{1}^{i}$ and $x_{2}^{i}$ , and/or between $G(x_{1}^{i})$ and $x_{2}^{i}$ correspond to an outlier with respect to the data distribution learned by $D$ during training. This results in a low value in $S_{dif}$ . The final CD map is obtained by Hadamard product for each pair of $e_{r}$ and $S_{dif}$ , $\mathcal{H}=e_{r}\circ S_{dif}$ .

3 Experimental Results

To evaluate the proposed S²-cGAN, in the experiments a bitemporal Very High Spatial Resolution (VHR) multispectral images acquired from Shenzhen, China by the Worldview 2 satellite has been used [8]. The image acquired in 2010 ( $t_{1}$ ) has been considered as $X_{1}$ , while the image acquired in 2015 ( $t_{2}$ ) has been taken as $X_{2}$ . Both $X_{2}$ and $X_{2}$ have size of $1431\times 1431$ pixels with a spatial resolution of 2 m. In all the experiments, the generator network $G$ was based on a U-net autoencoder for input patches of size $128\times 128$ pixels. The generator was compound by a network with six convolutional layers. For discriminator $D$ , a pixel-based network with two convolutional layers was employed. The co-registered images $\boldsymbol{X}_{1}$ and $\boldsymbol{X}_{2}$ were divided into 7646 number of patches of size $128\times 128$ pixels. For the self-supervised adversarial learning of the unchanged sample representations, we construct $\boldsymbol{\tilde{X}}_{2}$ from $\boldsymbol{X}_{1}$ by applying a Gaussian noise over $\boldsymbol{X}_{1}$ . This represents the unchanged (generated) image at time $t_{2}$ . Half of the pairs of patches of $\boldsymbol{X}_{1}$ and $\boldsymbol{\tilde{X}}_{2}$ were randomly selected for training. The remaining patches of $\boldsymbol{X}_{1}$ , together with their pairs in $\boldsymbol{X}_{2}$ were considered as pairs of test patches for the evaluation. The training was based on stochastic gradient descent with momentum 0.5, and the network was trained for 50 epochs. The two score maps $S_{f}$ (the differences map) and $e_{r}$ (the reconstruction error) were computed as explained in Sec. 2.2. Fig. 3 illustrates the score maps for two different pairs of patches. From the figure, one can observe that in most of the cases $e_{r}$ and $S_{f}$ provide complementary information. Similar behaviour has been observed by varying the pairs of patches. For quantitative evaluation, we applied a threshold value $\tau$ over the score map $\mathcal{H}$ to obtain the binary CD map. Due to the strong local variations in the considered VHR images, instead of choosing a single decision boundary for $\tau$ value, we adapted the context-dependent local adaptive decision boundary strategy used in [5]. We compared the CD map obtained by the proposed S²-cGAN with: 1) Fully Convolutional Early Fusion (FC-EF) method, which is one of the best performing supervised methods with a fully convolutional U-Net architecture [2]; and 2) Deep Change Vector Analysis (DCVA) method, which is a powerful unsupervised method using a CNN pre-trained for semantic segmentation to obtain multi-temporal deep features [5]. For the sake of fairness, the same pairs of test patches have been used for all the considered methods. Note that DCVA is fully unsupervised and does not use labeled samples. The fully-supervised FC-EF exploits a set of pairs of training patches that consists of: i) the same pairs of labeled unchanged samples used in the S²-cGAN; and ii) 13,065 pairs of changed samples. The binary CD maps obtained by the FC-EF, DCVA and the proposed S²-cGAN are shown in Fig. 4. From Fig. 4, one can see that the proposed S²-cGAN in most cases is able to identify the central area of the changed objects correctly. Table 1 shows quantitative results in terms of overall accuracy (OA), specificity (SPC), sensitivity (SEN), and overall error rate (ERR) obtained by the FC-EF, the DCVA, and the proposed S²-cGAN. By analysing the table, one can observe that the proposed S²-cGAN is comparable with the fully supervised FC-EF, and outperforms the unsupervised method DCVA mainly in terms of specificity (which shows the accuracy of detecting unchanged samples), sensitivity (which shows the accuracy of detecting changed pixels) and overall accuracy. As an example, sensitivity measure obtained by the proposed method is 4% and 14% higher than those obtained by the unsupervised DCVA and fully supervised FC-EF, respectively. It is worth emphasizing that our method performs better than both methods in terms of sensitivity despite the absence of changed samples during the training. This is due to the impact of adversarial training during GAN optimization.

Table 1: Binary CD results obtained by the FC-EF, the DCVA, and the proposed S²-cGAN. OA: overall accuracy, SPC: Specificity, SEN: Sensitivity, and ERR: Overall Error Rate.

Method	Category	OA	SPC	SEN	ERR
FC-EF	Supervised	0.8633	0.9693	0.3627	0.1366
DCVA	Unsupervised	0.8280	0.9106	0.4450	0.1719
S²-cGAN	Self-Supervised	0.8482	0.92081	0.5053	0.1517

4 Conclusion

In this paper, we have introduced a self-supervised conditional Generative Adversarial Network (S²-cGAN) for binary CD problems in RS. The proposed S²-cGAN exploits the mutual supervisory information of the generator and the discriminator networks to train a deep network by using a self-supervised multitemporal training set (which includes only pairs of unchanged samples). Differently from the existing GAN based CD methods, the proposed method directly uses the GAN discriminator as the classifier. Experimental results show that the proposed S²-cGAN leads to a higher performance in terms of sensitivity compared to the state of the art fully supervised and unsupervised methods. This has been achieved without using any pairs of labeled change samples. We underline that this is a very important advantage, since the proposed S²-cGAN suppresses the cost required for reference data collection. As a future work, we plan to extend the proposed learning approach to be integrated in an active learning setup with the human in the loop.

5 Acknowledgement

This work was supported by the European Research Council under the ERC Starting Grant BigEarth-759764.

References

[1] A. M. El Amin, Q. Liu, and Y. Wang, “Convolutional neural network features based change detection in satellite images,” in International Workshop on Pattern Recognition, 2016, vol. 10011, pp. 181 – 186.
[2] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in ICIP, Athens, Greece, 2018, pp. 4063–4067.
[3] A. Song, J. Choi, Y. Han, and Y. Kim, “Change detection in hyperspectral images using recurrent 3d fully convolutional networks,” Remote Sensing, vol. 10, no. 11, pp. 1827, 2018.
[4] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised multiple-change detection in vhr multisensor images via deep-learning based adaptation,” in IEEE IGARSS, Yokohama, Japan, 2019, pp. 5033–5036.
[5] S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multiple-change detection in vhr images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3677–3693, 2019.
[6] L. Bergamasco, S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised change-detection based on convolutional-autoencoder feature extraction,” in SPIE Image and Signal Processing for Remote Sensing XXV, 2019, vol. 11155, pp. 325 – 332.
[7] P. Isola, J. Zhu, T. Zhou, and A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv:1611.07004, 2016.
[8] M. Zhang and W. Shi, “A feature difference convolutional neural network-based change detection method,” IEEE Transactions on Geoscience and Remote Sensing, 2020, doi:10.1109/TGRS.2020.2981051.


(a)	(b)

(c)	(d)

(e)	(f)

S2-cGAN: Self-Supervised Adversarial Representation Learning for Binary Change Detection in Multispectral Images