RSINET: INPAINTING REMOTELY SENSED IMAGES USING TRIPLE GAN FRAMEWORK

Abstract

We tackle the problem of image inpainting in the remote sensing domain. Remote sensing images possess high resolution and geographical variations, that render the conventional inpainting methods less effective. This further entails the requirement of models with high complexity to sufficiently capture the spectral, spatial and textural nuances within an image, emerging from its high spatial variability. To this end, we propose a novel inpainting method that individually focuses on each aspect of an image such as edges, colour and texture using a task specific GAN. Moreover, each individual GAN also incorporates the attention mechanism that explicitly extracts the spectral and spatial features. To ensure consistent gradient flow, the model uses residual learning paradigm, thus simultaneously working with high and low level features. We evaluate our model, alongwith previous state of the art models, on the two well known remote sensing datasets, Open Cities AI and Earth on Canvas, and achieve competitive performance. The code can be referred here: https://github.com/advaitkumar3107/RSINet.

Index Terms— Image inpainting, remote sensing, generative adversarial networks

1 Introduction

Image inpainting is the process of filling in the missing part or conserving the damaged and deteriorated image (which can be physical or digital). Digital image inpainting is an important problem statement in the field of computer vision which has applications in various domains such as restoring damaged images/videos, remote sensing, object removal, text removal, automatic modifications of images/videos, image compression and super resolution [1]. The drawback of the traditional mathematical methods is their low PSNR on complex images such as remote sensing images [2]. Remote sensing images have a variety of boundaries, objects and colours which makes it a semantically tougher problem. Hence, deep learning has been introduced earlier to tackle this challenge such as in [2], which solved three missing information tasks in remote sensing data using a deep convolutional network combined with spatio-temporal information. Similarly, [3] used a generative approach for image inpainting. These images looked quite similar to the original version.

There have been several image inpainting approaches as well, such as [4], which used a UNet based encoder-decoder structure. Furthermore, [4] also used residual learning to compute the gradients effectively for image restoration. [5] built on the above idea and used guidance loss for inpainting. [6] proposed a new type of layer called ‘partial convolution’ (PartialConv) to improve the current best performing image inpainting models. However, all the above models used simple baselines, which resulted in lower PSNR on complex satellite images. [7] proposed the EdgeConnect (EC) model, that initially builds up the outline of the image to be restored and then fills it with details. Furthermore, to ensure the relationship between the neighbouring areas and entire image as a whole, [8] proposed a multi GAN approach, that trains a global as well as a local discriminator (GLCIC) on top of their VGG-type model [9]. Moreover, [10] proposed a global and local attention based model to tackle the issue of image irregularities such as holes. In addition, to preserve the semantic style of the original images, additional loss functions have been explored. For instance, [11] introduces style loss, where the textural features are synthesised from the images, which assist in the style transfer. Similarly, [12] introduces perceptual loss for super-resolution, (instead of pixelwise loss) and get high resolution images.

In the aforementioned methods, even though the methods seemed to perform better individually on the task specific images, they could not prove much effective on the images with high spatial variation such as those from remote sensing domain. This is because each of the individual models was designed to focus on a specific aspect of the image such as colour, edge or texture. Hence, there arises a need to have a common model that simultaneously works with all the image aspects and leads to more accurate image restoration. Inspired from this notion, we propose an image inpainting approach that utilizes multiple GANs to effectively capture the different aspects of the image. In addition, to handle irregularities of the image defects, Convolutional Block Attention Module (CBAM) [13] and the Gated Attention Layers [14] based attention modules are added. The GANs are also reinforced with skip connection to ensure a consistent gradient flow. Moreover, to preserve the semantic style of the images and get more robust representation, we incorporate a cocktail of adversarial, style and perceptual loss. Our approach can be summarised as follows:

1.

We introduce a model that combines the characteristics of edge detection GAN, colour filling GAN and an global GAN to focus on the different image aspects.
2.

We introduce attention layers for guided backpropagation to get more robust spectral-spatial representation and increase the sharpness of the images, while simultaneously incorporating style and perceptual losses with the adversarial loss to capture the semantic information of the images.
3.

We train our model on the Open Cities AI [15] and the Earth on Canvas [16] datasets where our model outperforms the state of the art deep inpainting models by at least 0.06 % and 2.24 % respectively for the two datasets.

2 Model Description

Refer to caption — Fig. 1: Architecture of our proposed GAN. The Generator consists of an encoder-residual blocks-decoder structure. The encoder has a skip connection as well as feature maps that serve as a gate for the gated attention layers. The Discriminator consists of 6 convolutional layers that successively downsample the input into a 28 $\times$ 28 probabilistic grid.

In this section we discuss our proposed model. It consists of three GANs trained on top of each other, namely, the edge completion GAN $\mathcal{G}_{1}$ , colour filling GAN $\mathcal{G}_{2}$ and the global GAN $\mathcal{G}_{3}$ . We describe the loss functions and the architecture used for the basic GAN unit in detail in the subsequent sections. The GAN architecture used in RSINet is presented in Fig. 1. All the GANs follow the similar architecture.

Let $I_{gt}$ be the groundtruth image and, $C_{gt}$ and $I_{gray}$ be the groundtruth edge map (after applying a canny edge detector [17] to the true image) and grayscale image respectively. In the generator for $\mathcal{G}_{1}$ , we use the masked grayscale image $\tilde{I}_{gray}=I_{gray}\odot(1-M)$ , its masked edge map $\tilde{C}_{gt}=C_{gt}\odot(1-M)$ ( $\odot$ being the Hadamard product), and the image masks ${M}$ as the pre-condition (proposed in [18] as input to pix2pix GAN, where 1 and 0 denote the masked region and the background respectively). The generator produces the completed edge map for the image (see Eqn. 1).

C_{pred}=\mathcal{G}_{1}(\tilde{I}_{gray},\tilde{C}_{gt},M)

(1)

We use $C_{gt}$ and $C_{pred}$ conditioned on $I_{gray}$ as the input to the discriminator, which predicts whether or not the edge map is real. This predicted edge map is then passed on to the colour filling model. The colour filling model uses the masked RGB image, $\tilde{I}_{gt}=I_{gt}\odot(1-M)$ , conditioned on the completed edge map $C_{comp}=C_{gt}\odot(1-M)+C_{pred}\odot M$ , as taken from the previous model. The model outputs the completed RGB image with the right colours filled at the right places.

I_{pred}=\mathcal{G}_{2}\left(\tilde{I}_{gt},C_{comp}\right)

(2)

The previous two models gave blurry outputs on the tougher satellite images. Hence, we propose to train a global GAN on top of the output from the colour filling model which helps in refining the output and makes the image visually sharper. This was based upon the suggestion [19] that using the same encoder-decoder architectures for image inpainting were observed to give better results since both models learnt the same features semantically.

I_{refined}=\mathcal{G}_{3}(I_{comp},\tilde{I}_{gt})

(3)

2.1 Loss Functions

The edge completion model is trained the adversarial loss [18] and the feature matching loss [20]. Adversarial training is thought of as a min-max game between two players. Here, the generator tries to ‘fool’ the discriminator, i.e. make the discriminator predict with a high probability that the generator’s output belongs to the input data, while the discriminator is trained to differentiate between the original and the generated samples. The adversarial loss is given in Eqn. 4.

\begin{split}L_{adv}=\operatorname*{\mathbb{E}}_{C_{gt},I_{gray}}[\log\;\mathcal{D}_{1}(C_{gt},I_{gray})]+\\ \operatorname*{\mathbb{E}}_{I_{gray}}[\log(1-\mathcal{D}_{1}(C_{pred},I_{gray}))]\end{split}

(4)

In Eqn. 4, $\mathcal{D}_{1}$ is the discriminator for edge completion GAN.

The feature matching loss, $L_{FM}$ compares the activation maps in the intermediate layers of the discriminator. This is similar to perceptual loss [12] (used in the colour filling model), where activations are compared with those from a pre-trained VGG network. However, a VGG network is not trained to produce the edge maps, so we use the $L_{FM}$ instead of the $L_{perc}$ here. The $L_{FM}$ is defined as :

L_{FM}=\operatorname*{\mathbb{E}}\left[\sum\limits_{i=1}^{L}\;\frac{1}{N_{i}}\left\lVert\mathcal{D}_{1}^{(i)}\;(C_{gt})-\mathcal{D}_{1}^{(i)}\;(C_{pred})\right\rVert_{1}\right]

(5)

Here $L$ is the number of layers in the discriminator. $N_{i}$ is the number of elements in the $i^{th}$ convolutional layer and $\mathcal{D}_{1}^{(i)}$ is the activation in the $i^{th}$ layer of the $\mathcal{D}_{1}$ . Applying it to both the $\mathcal{G}_{1}$ and $\mathcal{D}_{1}$ gave us better results, as suggested in [7].

The colour filling model is trained on 4 losses. The adversarial loss, L1 loss [21], style loss [11] and perceptual loss [12]. To ensure proper scaling the L1 loss is normalized by the mask size. The adversarial loss is similar to the previous model. The perceptual loss gives the L1 distance between the activations from a few specific layers of a pre-trained network as well as our model. We choose VGG-19 network (pretrained on ImageNet [9]) since its architecture is similar to ours.

L_{perc}=\operatorname*{\mathbb{E}}\left[\sum\limits_{i=1}\;\frac{1}{N_{i}}\left\lVert\phi_{i}\;(I_{gt})-\phi_{i}\;(I_{pred})\right\rVert_{1}\right]

(6)

where $\phi_{i}$ represents the activation map of the $i^{th}$ layer of the pre-trained VGG19 network. The style loss returns the differences in covariances of these activation maps by constructing the gram matrix from the activation maps. Given the feature maps of sizes $C_{j}\times H_{j}\times W_{j}$ , style loss is computed by

L_{style}=\operatorname*{\mathbb{E}}_{j}\left[\left\lVert G_{j}^{\phi}\left(\tilde{I}_{pred}\right)-G_{j}^{\phi}\left(\tilde{I}_{gt}\right)\right\rVert_{1}\right]

(7)

where $G_{j}^{\phi}$ is the gram matrix constructed from activation maps $\phi_{j}$ . The global GAN model is also trained on the weighted sum of the 4 losses as discussed above, namely, $L_{adv}$ , $L_{perc}$ , $L1$ and $L_{style}$ . The final loss for RSINet is given in Eqn. 8. The value of ${\lambda}_{1}$ , ${\lambda}_{2}$ and ${\lambda}_{3}$ is empirically fixed to 1 during implementation.

L_{final}=L_{adv}+{\lambda}_{1}L1+{\lambda}_{2}L_{perc}+{\lambda}_{3}L_{style}

(8)

3 Experiments

The model has a VGG type architecture [9] (an encoder followed by a decoder). The encoder part consists of 3 convolutional layers each successively halving the image dimensions and doubling the number of channels, followed by 8 residual blocks and finally 3 convolution layers again. This converts the feature representation of the residual blocks back to the image size (the output). A skip connection (consisting of 2 layers) has been added to the second convolutional layer of the network. We have also added 3 gated attention layers in between, for further refinement of the feature maps.

3.1 Datasets

We worked primarily with two famous satellite imagery datasets. The first one is the Open Cities AI challenge dataset [15], which consists 500 randomly sampled images size of 1024 $\times$ 1024, divided into 16 images of size 256 $\times$ 256. The second one is the Earth on Canvas dataset [16] which consists of 1400, 256 $\times$ 256 images from 14 classes. It is highly uncorrelated and a good test for our model’s performance. Both the datasets have been divided into 60%, 20% and 20% ratio for train, validation and test sets respectively.

3.2 Protocol

We kept the parameters for the global GAN similar to those of colour filling GAN. Adam optimizer [22] is used for training all the models, with learning rates of 10^-3 and 10^-4 for generator and discriminator respectively. We also tried adding jigsaw training [23], that refers to splitting the input image into square patches and then randomly shuffling those patches to get a new transformed image. This image would be fed to the model and it would try to reconstruct the original image. For evaluation, peak signal-to-noise ratio (PSNR) is used. For the GLCIC model, its 160 $\times$ 160 output was interpolated to 256 $\times$ 256 while computing the PSNR.

Table 1: Accuracy analysis for irregular masks on Open Cities AI and Earth on Canvas datasets for different methods

	PSNR (DB)
Model Name	Open Cities AI	Earth on Canvas
EC original [7]	37.113	33.145
EC (our model)	37.433	33.784
Only colour filling GAN (our model)	30.347	31.198
PartialConv [6]	35.912	31.246
EC+Global GAN (our model)	37.456	34.542

3.3 Discussions

As observed in Fig. 2, the baseline EC model (i.e. with edge completion GAN and colour filling GAN) along with the global GAN trained on top of it, outperforms the original EC as well as the GLCIC models on the rectangular masks type ablation on both the datasets. The performance for all the models is presented in Table 1. It is visible that our model outperforms all the models for Open Cities AI dataset (highest PSNR of 37.456) and Earth on Canvas dataset (highest PSNR of 34.542).

3.4 Critical Analysis

3.4.1 Ablations

We primarily dealt with 3 types of ablations; rectangular masks wherein a random rectangle of the input image was replaced with white pixels, covering between 5%-30% of the image area during training, salt and pepper masks where, random image pixels as sampled from a gaussian distribution were whitened, covering anywhere between 5%-95% of the image area and irregular masks, as shown in [6]. Furthermore, we also present the visualizations of inpainting for the different masks in Fig. 3 for the three kinds of masks. The method has shown a good performance in reconstructing the image for salt and pepper, and irregular masks. However, in case of rectangular mask, in Fig. 3 (b), the model could not construct the aeroplane completely. This is because, the method exploits the locally available information (using convolutions) for reconstruction.

3.4.2 Effect of Loss Function

As can be seen from Fig. 4, a mixture of losses works best for inpainting (presented on Open Cities AI). This hybrid loss function consisting of perceptual, style, L1 and adversarial losses for the generator, and feature matching and adversarial losses for the discriminator performs well for this semantically tough image inpainting task. Perceptual and style losses help in improving the quality of the reconstructed image by capturing image semantics and minimising the difference of activations in all the layers of the target and the output images. L1 loss improves the model performance due to its high noise robustness. Feature matching loss used in the discriminator helps to keep the GAN stable and prevent mode collapse. To show the diverse performance, the ablation study is performed for different percentage of rectangular masks.

4 Conclusion

We presented a novel approach to tackle the problem of image inpainting in remote sensing domain. Our approach builds on the existing EdgeConnect model and strengthens it by incorporating attention mechanism and global GAN, that gives realistic results by combining global and local features. Furthermore, we incorporate multiple losses such as perceptual loss and style loss, which further boost our model’s performance. We evaluate our model on Open Cities AI and Earth on Canvas datasets, where our approach gives competitive results with respect to previous benchmarks. In future, we would consider more semantically challenging datasets and explore the problem in a multimodal scenario.

Acknowledgement

The authors would like to acknowledgement the IITB-ISRO Grant RD/0120-ISROC00-005.

References

[1] Omar Elharrouss, Noor Almaadeed, Somaya Al-Maadeed, and Younes Akbari, “Image inpainting: A review,” Neural Processing Letters, pp. 1–22, 2019.
[2] Junyu Dong, Ruiying Yin, Xin Sun, Qiong Li, Yuting Yang, and Xukun Qin, “Inpainting of remote sensing sst images with deep convolutional generative adversarial network,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 2, pp. 173–177, 2018.
[3] A. Kuznetsov and M. Gashnikov, “Remote sensing image inpainting with generative adversarial networks,” in 2020 8th International Symposium on Digital Forensics and Security (ISDFS), 2020, pp. 1–6.
[4] Yang Liu, Jinshan Pan, and Zhixun Su, “Deep blind image inpainting,” 2017.
[5] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan, “Shift-net: Image inpainting via deep feature rearrangement,” 2018.
[6] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro, “Image inpainting for irregular holes using partial convolutions,” 2018.
[7] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z. Qureshi, and Mehran Ebrahimi, “Edgeconnect: Generative image inpainting with adversarial edge learning,” 2019.
[8] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, “Globally and locally consistent image completion,” ACM Trans. Graph., vol. 36, no. 4, July 2017.
[9] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
[10] S. M. Nadim Uddin and Yong Jung, “Global and local attention-based free-form image inpainting,” Sensors, vol. 20, pp. 3204, 06 2020.
[11] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, “Image style transfer using convolutional neural networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2414–2423, 2016.
[12] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” 2016.
[13] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon, “Cbam: Convolutional block attention module,” 2018.
[14] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, Ben Glocker, and Daniel Rueckert, “Attention U-Net: Learning where to look for the pancreas,” 2018.
[15] Milind Naphade, Zheng Tang, Ming-Ching Chang, David C Anastasiu, Anuj Sharma, Rama Chellappa, Shuo Wang, Pranamesh Chakraborty, Tingting Huang, Jenq-Neng Hwang, et al., “The 2019 ai city challenge.,” in CVPR Workshops, 2019, pp. 452–460.
[16] Ushasi Chaudhuri, Biplab Banerjee, Avik Bhattacharya, and Mihai Datcu, “A zero-shot sketch-based intermodal object retrieval scheme for remote sensing images,” IEEE Geoscience and Remote Sensing Letters, 2021.
[17] Paul Bao, Lei Zhang, and Xiaolin Wu, “Canny edge detection enhancement by scale multiplication,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 9, pp. 1485–1490, 2005.
[18] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros, “Image-to-image translation with conditional adversarial networks,” 2016.
[19] Suriya Singh, Anil Batra, Guan Pang, Lorenzo Torresani, Saikat Basu, Manohar Paluri, and CV Jawahar, “Self-supervised feature learning for semantic segmentation of overhead imagery.,” in BMVC, 2018, vol. 1, p. 4.
[20] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, “Improved techniques for training gans,” 2016.
[21] Hang Zhao, Orazio Gallo, Iuri Frosio, and Jan Kautz, “Loss functions for neural networks for image processing,” 2015.
[22] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[23] Mehdi Noroozi and Paolo Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” 2016.