Reference-guided texture and structure inference for image inpainting

Abstract

Existing learning-based image inpainting methods are still in challenge when facing complex semantic environments and diverse hole patterns. The prior information learned from the large scale training data is still insufficient for these situations. Reference images captured covering the same scenes share similar texture and structure priors with the corrupted images, which offers new prospects for the image inpainting tasks. Inspired by this, we first build a benchmark dataset containing 10K pairs of input and reference images for reference-guided inpainting. Then we adopt an encoder-decoder structure to separately infer the texture and structure features of the input image considering their pattern discrepancy of texture and structure during inpainting. A feature alignment module is further designed to refine these features of the input image with the guidance of a reference image. Both quantitative and qualitative evaluations demonstrate the superiority of our method over the state-of-the-art methods in terms of completing complex holes. Code is available at https://github.com/Cameltr/RGTSI.

Index Terms— Image inpainting, reference image, feature alignment, complex holes

1 Introduction

Image inpainting aims to fill up the missing parts of an image with plausible content, which can be used to repair corrupted areas or remove unwanted targets in photos, etc. Its main idea is to find sufficient prior information, such as smoothness [1] and self-similarity [2, 3], as a reference to fill the holes. Recently, learning-based inpainting methods have attracted great attention. Pathak et al. first introduced Context Encoders [4], followed which neural patch synthesis [5], residual learning [6], partial convolution [7], and others [8, 9, 18] were proposed. To assist the complex image inpainting, structural priors were introduced to guide the inpainting task, including edges [11, 12, 22], contours [13], and segmentation maps [10, 24]. These methods show that the structural prior effectively helps to improve the quality of the completed images. In spite of this, they still struggle in cases when holes are large or the expected content contains complicated semantic information (e.g., Fig. 1(c), (d) and (e)).

Refer to caption — Fig. 1: Visual comparison on the DPED10K dataset. Our method is more effective in reducing bending and blurring of tree branches.

To seek more prior information to help solve this type of ill-posed problems, reference images with similar textures and structures are introduced in some vision tasks, such as image super-resolution [15, 16], image compression [17] and action recognition [23]. Their motivation is to utilize rich textures from the references to compensate for the lost details in the input images and thereby produce more detailed and realistic content. Inspired by this, reference-guided image inpainting is introduced in this paper, which focuses on filling in missing areas guided by external references containing complete and similar structures and details. However, reference images covering the same scene are often captured under different conditions, resulting in different styles and geometries of the input and reference images (e.g., Fig. 1(a) and (b)).

In this paper, we propose a novel reference-based encoder-decoder network for image inpainting. Considering the pattern discrepancy of texture and structure for completion, such as large stylistic differences in texture and large geometric differences in structure, we introduce reference images as a guide to complete texture and structure features separately. Specifically, the input images and the reference images are first encoded to generate the respective texture features and structure features. Then, both features extracted from the input and reference images are aligned by a feature alignment module, and then fused in a multi-scale filling manner. A equalization operation as in [9] is employed to ensure that the attentions in every channel are the same and the feature representations are consistent as well. The equalized features are concatenated to the decoder with the features from the encoder to generate the inpainting result. To promote this task, we construct a dataset named DPED10K, in which the reference images generated based on the SIFT [19] share similarities with the structure and texture of the input images.

The main contributions of this paper are: 1) We propose the idea of reference-guided inpainting to compensate for the insufficient priors in completing complex scenes. 2) We design a reference-based encoder-decoder network and a novel feature alignment module to finely align and fuse the feature of the input image and reference image. 3) We build a reference images dataset, to facilitate further research and performance evaluation of reference-guided image inpainting.

2 Approach

2.1 Dataset

For image inpainting problems based on the reference image, the similarity between the image and its reference image are of great significance to inpainting results. However, to the best of our knowledge, we have not found such a publicly available dataset. Therefore, we construct such a dataset based on DPED [20] dataset that consists of real photos captured by three different phones and one high-end reflex camera. We select images captured via Blackberry and Sony and generate an input-reference dataset based on SIFT [19] feature matching, which characterizes local texture and structure pattern that is in line with the objective of local texture and structure matching.

For each image pair, we randomly crop 256 $\times$ 256 patches from images as input images, while the reference images are cropped from the corresponding images. In this way, we collect 10K paired patches as a training dataset and 200 pairs are randomly collected as a test dataset. Some image samples are shown in Fig. 2. We refer to the collected dataset as DPED10K, which has be released to facilitate the research on reference-guided image inpainting.

2.2 Referenced-Based Encoder-Decoder

The network architecture is illustrated in Fig. 3, which consists of two encoders that share the same parameters and one decoder to fill the missing regions in an end-to-end training manner. Meanwhile, we set the residual blocks between the encoder and decoder with dilated convolutions, which can well increase the receptive field size to cover more contextual information.

In the encoder, we refer to MEDFE [9] to reorganize the shallow features of the encoder into texture features and the deep features into structure features. Then the texture and structure features from different convolutional layers are resized to the same size and concatenated, respectively. To refine the features of input corrupted images by that of reference images, we propose a Feature Alignment Module (FAM) based on deformable convolution [21] and dynamic offset estimator [25]. The process of FAM is formulated as follows:

	$\displaystyle F_{ate}=G(F_{ite},F_{rte}),$		(1)
	$\displaystyle F_{ast}=G(F_{ist},F_{rst}).$		(1)

where $G(\cdot)$ denotes the feature alignment module. $F_{ite}$ , $F_{rte}$ , $F_{ist}$ and $F_{rst}$ are the texture features and the structure features from the input and reference images. $F_{ate}$ and $F_{ast}$ are the aligned and fused structure and texture features, respectively. FAM aims to fuse the guidance information from the reference image to help filling missing area in input image.

After the features are aligned and fused, we design the structure branch and the texture branch accordingly to fill holes for $F_{ate}$ and $F_{ast}$ . Each branch with several streams consists of different kernel sizes of partial convolutions [7] to fill holes in multiple scales. To ensure the hole filling focuses on the textures and structures, we utilize the ground-truth image and its structure image generated by RTV [26] to supervise the output of two branches $F_{fst}$ and $F_{fte}$ . After that, we concatenate $F_{fte}$ and $F_{fst}$ and then generate $F_{sf}$ by simple fusion of $1\times 1$ convolutional layers. The texture and structure representations in $F_{sf}$ after feature equalization are supplemented to different feature layers with skip connections to generate the output image.

2.3 Feature Alignment Module

In this paper, we use the deformable convolution [21], which models deformation by learning an additional offset, to achieve feature-level alignment. This offset enables the model to focus on the target region when extracting features by shifting its convolution kernels. Formally, the deformable convolution operation is defined as follows:

y(p_{0})=\sum_{n=1}^{N}\omega_{n}\cdot X(p+p_{n}+\Delta p_{n}),

(2)

where $X$ is the input, $Y$ is the output, and $n$ and $N$ denote the number of indexes and the number of kernel weights, respectively. $\omega_{n}$ , $p$ , $p_{n}$ , and $\Delta p_{n}$ are the $n$ -th kernel weight, the central index, the $n$ -th fixed offset and the learnable offset for the $n$ -th position, respectively.

As shown in Fig. 4, FAM first connects the input target feature $F_{t}$ with the reference feature $F_{r}$ . Then, the embedded features are passed to the dynamic offset estimator [25] to generate the offset map between the input feature and reference feature. The dynamic offset estimator can capture the similarity located from near to far distances. Guided by the offset map, the reference features are aligned by deformable convolution [21]. It is important to note that the offset estimator [25] can be utilized for single image inpainting if there is no reference image.

Method	Mask of Ratio					Average
Method	10%-20%	20%-30%	30%-40%	40%-50%	50%-60%	Average
MEDFE [9]	29.88/0.957	26.45/0.912	23.86/0.854	22.05/0.787	19.44/0.638	24.33/0.829
EC [11]	29.30/0.950	26.37/0.909	24.23/0/858	22.40/0.794	20.10/0.668	24.48/0.836
GI [14]	31.24/0.965	27.34/0.927	24.86/0.877	22.67/0.821	19.88/0.694	25.20/0.857
No reference	31.35/0.966	27.87/0.933	25.36/0.885	23.20/0.826	20.22/0.688	25.60/0.860
Random reference	31.44/0.967	27.95/0.933	25.42/0.887	23.32/0.828	20.36/0.690	25.70/0.861
Ours (reference)	31.77/0.967	28.07/0.935	25.63/0.888	23.49/0.832	20.58/0.694	25.91/0.863

Table 1: PSNR/SSIM quantitative comparisons of ours against other baselines and ablation study of the impact of reference image on DPED10K dataset.

3 Experiment

We evaluate our method on the proposed DPED10K. Irregular masks are obtained from [7] and classified based on their hole sizes relative to the entire image. The resolution of all images and corresponding masks during training and testing is resized to $256\times 256$ . The model is implemented in PyTorch and optimized with the Adam optimizer with a learning rate of $2\times 10^{-4}$ and a batch size of 1 on a single Tesla V100 GPU. The training process of our proposed model are stopped after 80 epochs. We fuse the reconstruction loss, perceptual loss [27], style loss, and Relativistic Average LS Adversarial loss [28] to supervise the training process. We compare our method with three state-of-the-art methods: MEDFE [9], EC [11], GI [14].

Qualitative Comparison. For a fair evaluation, we conduct experiments on filling irregular holes in the input images. The qualitative results are shown in Fig. 5. Generally, these baselines are challenging to recover the image content effectively. For example, in the first row of Fig. 5(c), (d), and (e), the window appears to have broken and blurred lines. In contrast, our method can recover the structure and texture well by introducing a reference image, and Fig. 5(f) show that our method can effectively generate more pleasing content.

Quantitative Comparison. We conduct quantitative evaluations on the DPED10K dataset with different mask ratios. For the evaluation metrics, we use two common metrics: SSIM and PSNR. The comparison results of filling the irregular holes are shown in Table.1. Our method can achieve better performance in filling irregular holes at all hole image ratios. We also compared the complexity of our proposed model with these methods. The average execution time of model to complete an image of $256\times 256$ is 77ms, compared to 40ms for MEDFE [9], 22ms for EC [11], and 106ms for GI [14] respectively.

Impact of reference image. To prove the effectiveness of the reference images, we set up two comparison experiments(i.e., No reference and Random reference). Specifically, the experiment without reference uses a solid black $256\times 256$ image, while the experiment with a random reference disrupts the order of the reference images. Results are shown in Table 1. Since FAM can be utilized for single image inpainting when there is no reference image, results without reference still outperform well than other baselines. For the experiment without reference, the corrupted input image lacks the information of structure and texture feature of the reference image, and can only be repaired by learning the texture and structure information around its holes. The experiment with a random reference can still provide a small amount of feature information for the inpainting result after the feature alignment process. Therefore, the performance can be better than the experiment without reference.

4 Conclusion

In this paper, we proposed an end-to-end network structure that utilizes a complete reference image with a similar texture and structure to the input image. To address the problem of fusing the feature of the input image and reference image, we proposed a feature alignment module that can finely align and fuse the feature at the pixel level finely. Both quantitative and qualitative experiments are conducted to demonstrate the effectiveness of our proposed methods. In addition, a new dataset DPED10K is constructed to facilitate the evaluation of reference-guided inpainting methods and provides a benchmark for future reference-guided inpainting research.

Acknowledgement: This work was supported by National Natural Science Foundation of China under Grant 62171325.

References

[1] Marcelo Bertalmío, Guillermo Sapiro, Vicent Caselles, and C. Ballester, “Image inpainting,” SIGGRAPH, 2000, pp. 417–424.
[2] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman, “Patchmatch: A randomized correspondence algorithm for structural image editing,” ACM Trans. Graph., 28: 24, 2009.
[3] Anat Levin, Assaf Zomet, and Yair Weiss, “Learning how to inpaint from global image statistics.,” ICCV, 2003, pp. 305–312.
[4] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros, “Context encoders: Feature learning by inpainting,” CVPR, 2016, pp. 2536–2544.
[5] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li, “High-resolution image inpainting using multi-scale neural patch synthesis,” CVPR, 2017, pp. 6721–6729.
[6] Libin Jiao, Hao Wu, Haodi Wang, and Rongfang Bie, “Multi-scale semantic image inpainting with residual learning and gan,” Neurocomputing, 331: 199–212, 2019.
[7] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro, “Image inpainting for irregular holes using partial convolutions,” ECCV, 2018, pp. 85–100.
[8] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama, “Region filling and object removal by exemplar-based image inpainting,” IEEE Trans. Image Process, 13: 1200–1212, 2004.
[9] Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang, “Rethinking image inpainting via a mutual encoder-decoder with feature equalizations,” ECCV, 2020, pp. 725–741.
[10] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin’ichi Satoh, “Guidance and evaluation: Semantic-aware image inpainting for mixed scenes,” ECCV, 2020, pp. 683–700.
[11] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z Qureshi, and Mehran Ebrahimi, “Edgeconnect: Generative image inpainting with adversarial edge learning,” arXiv preprint arXiv:1901.00212, 2019.
[12] Jingyuan Li, Fengxiang He, Lefei Zhang, Bo Du, and Dacheng Tao, “Progressive reconstruction of visual structure for image inpainting,” ICCV, 2019, pp. 5962–5971.
[13] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo, “Foreground-aware image inpainting,” CVPR, 2019, pp. 5840–5848.
[14] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang, “Generative image inpainting with contextual attention,” CVPR, 2018, pp. 5505–5514.
[15] Liying Lu, Wenbo Li, Xin Tao, Jiangbo Lu, and Jiaya Jia, “Masa-sr: Matching acceleration and spatial adaptation for reference-based image super-resolution,” CVPR, 2021, pp. 6368–6377.
[16] Zhifei Zhang, Zhaowen Wang, Zhe Lin, and Hairong Qi, “Image super-resolution by neural texture transfer,” CVPR, 2019, pp. 7982–7991.
[17] Man M Ho, Jinjia Zhou, and Gang He, “Rr-dncnn v2.0: Enhanced restoration-reconstruction deep neural network for down-sampling-based video coding,” IEEE Trans. Image Process, 30: 1702–1715, 2021.
[18] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin’ichi Satoh, “Uncertainty-Aware Semantic Guidance and Estimation for Image Inpainting,” IEEE J. Sel. Top. Signal Process., 15(2): 310–323, 2021.
[19] David G Lowe, “Object recognition from local scale-invariant features,” ICCV, 1999, pp. 1150–1157.
[20] Andrey Ignatov, Nikolay Kobyshev, Radu Timofte, Kenneth Vanhoey, and Luc Van Gool., “Dslr-quality photos on mobile devices with deep convolutional networks,” ICCV, 2017, pp. 3277–3285.
[21] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei, “Deformable convolutional networks,” ICCV, 2017, pp. 764–773.
[22] Liang Liao, Ruimin Hu, Jing Xiao, and Zhongyuan Wang, “Edge-Aware Context Encoder for Image Inpainting,” ICASSP, 2018, pp. 3156–3160.
[23] Xian Zhong, Zhuo Zhou, Wenxuan Liu, Kui Jiang, Xuemei Jia, Wenxin Huang, and Zheng Wang, “VCD: View-Constraint Disentanglement for Action Recognition,” ICASSP, 2022, pp. 2170–2174.
[24] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin’ichi Satoh, “Image Inpainting Guided by Coherence Priors of Semantics and Textures,” CVPR, 2021, pp. 6539–6548.
[25] Gyumin Shim, Jinsun Park, and In So Kweon, “Robust reference-based super-resolution with similarity-aware deformable convolution,” CVPR, 2020, pp. 8425–8434.
[26] Li Xu, Qiong Yan, Yang Xia, and Jiaya Jia, “Structure extraction from texture via natural variation measure,” ACM Trans. Graph, 31(6): 1–10, 2012.
[27] Justin Johnson, Alexandre Alahi, and Li Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” ECCV, 2016, pp. 694–711.
[28] Alexia Jolicoeur-Martineau, “The relativistic discriminator: a key element missing from standard gan,” arXiv preprint arXiv:1807.00734, 2018.


(a) Input image	(b) Reference image	(c) MEDFE [9]

(d) EC [11]	(e) GI [14]	(f) Ours



(a) Reference	(b) Input	(c) MEDFE [9]	(d) EC [11]	(e) GI [14]	(f) Ours	(g) Ground-truth