Image Inpainting Guided by Coherence Priors of Semantics and Textures

1 Introduction

2 Related Work

3 Proposed Method

4 Results

5 Conclusion

References

2.1 Image inpainting

2.2 Semantic-guided image processing

3.1 Framework Overview

3.2 Semantic-wise Attention Propagation (SWAP)

3.3 Coherence Losses

3.4 Objective Functions

4.1 Experimental Settings

4.2 Qualitative Comparisons

4.3 Quantitative Comparisons

4.4 User Study

4.5 Ablation Studies

4.5.1 Effectiveness of SWAP

4.5.2 Edge vs. Semantic Segmentation

4.5.3 Comparison of Segmentation Accuracy

4.5.4 Performance Gains with SWAP and Coherence Losses

4.5.5 Additional Results on Places2

Abstract

Liang Liao¹ Jing Xiao² Zheng Wang¹ Chia-Wen Lin³ Shin’ichi Satoh¹
¹National Institute of Informatics ²Wuhan University ³National Tsing Hua University

Existing inpainting methods have achieved promising performance in recovering defected images of specific scenes. However, filling holes involving multiple semantic categories remains challenging due to the obscure semantic boundaries and the mixture of different semantic textures. In this paper, we introduce coherence priors between the semantics and textures which make it possible to concentrate on completing separate textures in a semantic-wise manner. Specifically, we adopt a multi-scale joint optimization framework to first model the coherence priors and then accordingly interleavingly optimize image inpainting and semantic segmentation in a coarse-to-fine manner. A Semantic-Wise Attention Propagation (SWAP) module is devised to refine completed image textures across scales by exploring non-local semantic coherence, which effectively mitigates mix-up of textures. We also propose two coherence losses to constrain the consistency between the semantics and the inpainted image in terms of the overall structure and detailed textures. Experimental results demonstrate the superiority of our proposed method for challenging cases with complex holes.

High-quality image inpainting aims to fill in missing regions with synthetic content [1, 2, 5]. It requires both semantically meaningful structures and visually pleasing textures. To this end, deep learning-based methods [22, 36, 39, 41, 43, 44] resort to encoder-decoder based networks to infer the context of a corrupted image and then refine the texture details in the initial inference of a missing region by some tools, such as non-local algorithms. Although current image inpainting methods have made significant progress, it still poses technical challenges in completing complex holes, particularly when a missing region involves multiple sub-regions with different semantic classes. The main reason falls in the failure of modeling the prior distributions of a mixture of different semantic regions, which usually result in blurry boundaries and unrealistic textures [15].

A feasible approach is to adopt structural information, such as edges [12, 19], contours [35], and smooth images [24], as guidance to complete missing structures and textures in two steps. The assumption is that structures offer semantic clues for inferring an unknown scene, making them suitable for guiding the filling of textures. However, we notice that the correspondence between structural information and textures is not apparent, making the filled textures still highly rely on the local correlation around the missing region. Figure 1 demonstrates the ambiguity of the mapping from mid-level structures (e.g., the edges) to the textures, which can significantly degrade the visual authenticity of the generated textures.

Refer to caption — Figure 1: Upper part: Mapping between image textures and edges/semantics (Dot arrow - extraction of edges/semantics; solid arrow - texture generation). Notice that two similar edge patches in a green circle could be mapped to completely different semantic textures, but one semantic will be clearly mapped to a certain texture category. Lower part: an example showing two similar edge patches map to two different semantic textures.

Compared with mid-level structures, high-level semantic information offers more vital semantic clues to the object textures. For example, in Figure 1, the semantic animal leads to fluffy fur while the semantic tree leads to green leaf in the image, which cannot be usually distinguished solely from their mid-level structures. In contrast, object textures have been shown to provide sufficient information about the semantic classes [7, 11]. Thus we characterize the relationships between the semantics and textures of objects as coherence priors and build our inpainting method on the coherence priors to complete complex holes while ensuring the mutual consistency between the predicted semantics and textures.

Based on the above motivation, we propose to utilize coherence priors between semantics and textures to facilitate joint optimization of semantic segmentation and image inpainting. To this end, our framework extracts a shared feature to represent the common information of the two tasks, and characterize the interaction between scales to enable the utilization of coherence priors to optimize the two tasks jointly. Specifically, two novel designs are proposed: 1) A Semantic-Wise Attention Propagation (SWAP) module is used to explicitly capture the semantic relevance between an unknown (missing) area and the known regions. As a result, when mapping semantics to image textures, filling in an unknown patch only refers to those known patches with the same semantic, rather than to the entire image, to avoid irrelevant texture filling. 2) We devise two loss terms to learn the global and local coherence relationships, respectively. The image-level structure coherence loss is used to supervise the structural matching between the inpainted image and the corresponding segmentation map to generate clear boundaries in the inpainted image. Besides, the non-local patch-level coherence loss aims to assess the distribution of patch textures in the semantic domain to encourage the generated textures to be as similar as the matched known patch of the same semantics.

Unlike the existing semantics-guided inpainting methods, such as SPG-Net [26] and SGE-Net [15], which synthesize textures by convolution to involve the local semantic information, our proposed method predicts the textures and the semantics simultaneously, and borrows the known texture feature of the same semantic to fill in a missing region by a semantics-guided non-local means, which not only ensures realistic textures but also is valuable for semantic recognition. For a better understanding of our method, we summarized the comparison to other methods in Table 1. The main contributions of our paper are three-fold:

•

We introduce coherence priors that highlight the mutual consistency between the semantics and textures in image inpainting and devise two coherence losses to boost the consistency between semantic information and inpainted image in the global structure level and local texture level.
•

We propose a semantic-wise attention propagation module, which generates semantically realistic textures by capturing distant relationship and referring to the texture feature of the same semantic in the feature maps.
•

Our approach outperforms existing state-of-the-art image inpainting methods, including semantics-based approaches [15, 19, 26, 40, 42] on completing complex hole with multiple semantic regions in terms of the sharpness of boundaries and the coherence and visual plausibility of textures.

Deep learning-based inpainting approaches have recently been proposed by understanding the images, which can generate meaningful content for filling in the missing region. Context Encoder [22] was first proposed to employ a generative adversarial network and demonstrated its potential for inpainting tasks. Based on it, efforts have been made to enhance the inpainting performance, including introducing specific losses [6, 32, 44], building recursive architectures for progressive refinement [13, 37], and involving the structural priors as guidance in a two-stage framework for structural consistency [14, 19, 24, 38]. However, these methods lack the ability to model long-term correlations between distant context, leading to blurry textures.

Table 1: Comparison of inpainting methods based on three dimensions: non-local correlation, structural guidance and framework.

	Non-local	Structures	Framework
GatedConv [40]	✓	N/A	Two-steps
PEN-Net [26]	✓	N/A	One-forward
Edgeconnect [19]	✗	Edge	Two-steps
SPG-Net [26]	✗	Semantic	Two-steps
SGE-Net [15]	✗	Semantic	Joint-learning
Ours	✓	Semantic	Joint-learning

To better refine the inpainted image textures, non-local algorithms are adopted to borrow distant features from a known region, which contains fine textures, to the missing region. [39] firstly proposed to compute textural affinity within the same image to ﬁll the corrupted area with more realistic texture patches from the available area. [42] devised a pyramid of contextual attention at multiple layers to refine the textures from high-level to low-level. [17] used a coherent semantic attention layer to ensure semantic relevance between nearby filled features. [34] extends the single attention map to bidirectional attention maps and re-normalizes the features to let the decoder concentrate on filling the holes. Although these methods have delivered considerable improvements, they failed to address the semantic ambiguity as they try to measure the texture affinity across all semantics.

Recent research reveals that the semantic priors of the high-level vision tasks are useful in the guidance of low-level vision tasks [4, 16, 30]. One hot research topic is semantic-guided image generation. It covers several research directions, including image translation from segmentation maps to realistic images [10, 29] and semantic image synthesis [4, 27, 30]. To avoid the vanishing of semantic priors in the generation process, [21] proposed a spatially adaptive normalization layer to propagate the semantic information to the synthesized images. With the useful semantic priors, it can generate high-quality images.

The semantic priors have also been applied to promote many conditional low-level vision tasks and demonstrated its effectiveness in constraining the plausible solution space in the ill-posed problems. For example, they are involved in the tasks of super-resolution [31], dehazing [23], denoising [16], style transfer [18, 28] and image manipulation [9, 20]. Inspired by the successful assistance from semantic priors in conditional image generation, we also exploit the semantic guidance in image inpainting.

How to achieve high-quality inpainting results on both semantically reasonable structures and visually pleasing textures? We argue that such results should not only be able to reconstruct the structures of the semantic objects for the global structure, but also the textures of them should look realistic with same semantic in the image for local pixel continuity. To this end, we build a multi-task learning framework on the coherence priors to explicitly reconstruct both the structures and textures of semantic objects. Moreover, we propose a new SWAP module to optimize the textures by semantically binding the textures between an inpainted region and the known regions based on the coherence priors. we also devise two losses to guide the learning of coherence relationships both in the global structure level and in the local patch level, respectively.

We build our network on an alternating-optimization architecture to utilize the coherence priors to mutually assist image inpainting and semantic segmentation for a corrupted image. Specially, we propose a multi-task learning framework by sharing features in the decoders for the two tasks (as shown in Figure 2). The encoder encodes the corrupted image and its mask into hierarchical contextual features, which are then fed into the decoder to predict the inpainted images and semantic segmentation maps across scales. Prior to feeding the encoded feature of the last layer into the decoder, we initially complete the feature via a Context Inference Module (CIM) based on the contextual inference method [15, 33].

In the decoder, the contextual feature at each scale is processed by two task-specific heads to predict the inpainted image and the segmentation map, respectively. Different from the method proposed in [15] that updates the contextual features by spatial adaptive normalization to capture the common properties of the same semantic, we propose a SWAP module to stress the realistic texture of each semantic patch by referring to the semantic relevant features from the known regions. In this way, the contextual features are learnt to represent the global structure and refined with the semantic-aware texture details.

For brevity, we adopt the following notations. $\varphi^{l}$ and $\phi^{l}$ denote the features from the encoder and from the decoder at scale $l$ , respectively; $\hat{I}^{l}$ and $\hat{S}^{l}$ respectively represent the inpainted image and the predicted $k$ -channel segmentation maps from the inpainting head $h(\cdot)$ and from the segmentation head $g(\cdot)$ at scale $l$ , where $k$ is the total number of semantic labels; $l$ ranges from 1 (the coarsest layer) to 5 (the finest layer).

The SWAP module is designed to optimize the contextual features by enhancing the semantic authenticity of textures based on coherence priors. As shown in Figure 2(b), SWAP takes four inputs: two of them are the current-scale feature $\phi^{l}$ and the next-scale skip feature $\varphi^{l+1}$ from the encoder. The third is the predicted segmentation probability map $\hat{S}^{l}$ , that is used to guide the separation of features. The last is the missing-region mask $M$ . The propagation process can be formulated as follows:

\centering\phi^{l+1}=swap(\phi^{l},\varphi^{l+1},\hat{S}^{l},M),\@add@centering

(1)

where $swap(\cdot)$ is the process of refining the contextual features in SWAP.

The attention-based approaches in [25, 34, 39] resort to contextual attention to pick known regions as references to complete a missing region, which, however, cannot distinguish patches of different semantics and thereby leads to blurry boundaries and semantic confusion during attention propagation. Unlike these approaches, SWAP calculates the attention score by matching semantic-aware features of missing patches and the known patches based on the coherence priors. Specifically, we first split the contextual features $f^{l}$ , which is generated from $\phi^{l}$ and $\varphi^{l+1}$ , into different semantic parts according to the semantic labels in the predicted segmentation map $\hat{S}^{l}$ of the $l$ -th layer. We subsequently drop the superscript $l$ for the sake of notational simplicity.

Within each semantic part, each patch’s attention score is evaluated by the patch affinity between the missing region and a known region using the normalized inner product followed by a softmax operation:

Df^{c}_{i,j}=\langle\frac{p^{c}_{i}}{\left|\left|p^{c}_{i}\right|\right|_{2}},\frac{p^{c}_{j}}{\left|\left|p^{c}_{j}\right|\right|_{2}}\rangle,

(2)

\omega^{c}_{j,i}=\frac{\exp(Df^{c}_{i,j})}{\sum_{i=1}^{N}\exp(Df^{c}_{i,j})},

(3)

where $p^{c}_{i}$ is the $i$ -th patch extracted from semantic feature $f^{c}$ of class $c$ in the known region, $p^{c}_{j}$ is the $j$ -th patch extracted from $f^{c}$ in the missing region. $Df^{c}_{i,j}$ is the affinity between them, and $\omega^{c}_{j,i}$ is the attention score representing the normalized affinities for each patch.

After obtaining the attention score from the known region, the feature of the $j$ -th missing patch is updated by

p^{c}_{j}=\sum_{i=1}^{N^{c}}\omega^{c}_{j,i}~{}p^{c}_{i}.

(4)

where $N^{c}$ is the total number of patches of semantic class $c$ in the known region.

The output feature of SWAP is generated by merging the updated features of all semantics, followed by four groups of dilated convolutions with different rates to improve the structural coherence in the final reconstructed features.

We devise new coherence losses between the semantics and textures as supervisions to guide the image inpainting and semantic segmentation to meet the following requirements: 1) the overall structure between the inpainted image and segmentation map should match each other; 2) the predicted textures of a certain semantic class should have the same distribution as that of the semantic textures in the known region. Under these considerations, we propose two coherence losses shown in Figure 3, a non-local patch coherence loss and a structure coherence loss, to respectively evaluate the patch similarity and the structural matching.

Non-local Patch Coherence Loss. Given the final inpainted image $\hat{I}$ and the predicted segmentation map $\hat{S}$ , we aim to maximize the texture similarity between $\hat{I}$ and $I$ . This is, the generated patches attributed to a specific class $c$ should be similar to the realistic patches of the same class in the ground-truth image.

Similar to the attention propagation process, we first split $\hat{I}$ and $I$ into different semantic images and extract patches to build the corresponding semantic patch sets $P^{c}=\{p_{j}^{c}\}$ and $B^{c}=\{b_{i}^{c}\}$ according to $\hat{S}$ and $S$ . For each patch in $P^{c}$ , we randomly select one patch from $B^{c}$ with the same semantic, and compute the pairwise cosine distances between them as

Di^{c}_{i,j}=1-\langle\frac{p^{c}_{i}-\mu_{b}^{c}}{\left|\left|p^{c}_{i}-\mu_{b}^{c}\right|\right|_{2}},\frac{b^{c}_{j}-\mu_{b}^{c}}{\left|\left|b^{c}_{j}-\mu_{b}^{c}\right|\right|_{2}}\rangle,

(5)

where $\mu_{b}^{c}=\frac{1}{N_{c}}\sum_{j}b_{j}^{c}$ , $N_{c}$ is the number of patches in $B_{c}$ .

The non-local patch coherence loss for each semantic class $c$ aims to maximize the similarity between the patch couples:

\mathcal{L}_{\text{nlc}}^{c}(P^{c},B^{c})=-\log(\frac{1}{N_{P}^{c}}(\sum_{i}~{}Di^{c}_{i,j})),

(6)

where $N_{P}^{c}$ is the cardinality of the set of the generated patches with class label $c$ . Our objective is defined as the sum of all the single-class non-local patch coherence losses over different classes found in $\hat{S}$ :

\mathcal{L}_{\text{nlco}}(I,\hat{I},S,\hat{S})=\sum_{c}~{}\mathcal{L}_{\text{nlco}}^{c}(P^{c},B^{c}),

(7)

where $c$ assumes all the class labels of mask in $\hat{S}$ . Note that if the label value in $\hat{S}$ is not found in $S$ , the coherence loss of the corresponding semantic patch set is set to 0.

Structure Coherence Loss. Besides the local patch similarity, we adopt a structure coherence loss to encourage the structural coherence between the inpainted image and the predicted segmentation map. In this work, we use two conditional discriminators to judge whether the semantics and the same image’s textures are coherence. The texture-conditioned discriminator $D_{t}$ is introduced to detect the predicted segmentation map’ “fakes” given the real image, while the semantics-conditioned discriminator $D_{s}$ is trained to detect the inpainted image’s “fakes” given the real segmentation map. The structure coherence loss can be expressed as:

	$\displaystyle\mathcal{L}_{\text{sco}}(I,\hat{I},S,\hat{S})=\mathcal{L}_{c_{s}}(I,\hat{I},S)+\mathcal{L}_{c_{t}}(S,\hat{S},I){}{}{}{}{}{}{}{}{}{}{}{}$		(8)
	$\displaystyle~{}~{}~{}=\mathbb{E}_{I,S}[\log D_{s}(I,S)]+\mathbb{E}_{\hat{I},S}[\log(1-D_{s}(\hat{I},S))]$
	$\displaystyle~{}~{}~{}+\mathbb{E}_{S,I}[\log D_{t}(S,I)]+\mathbb{E}_{\hat{S},I}[\log(1-D_{t}(\hat{S},I))].$

We design appropriate supervised loss terms for learning the inpainting and segmentation tasks at each scale to obtain multi-scale predictions. We adopt the reconstruction loss, the adversarial loss and the proposed coherence losses to promote the fidelity of the inpainted images. The cross entropy loss is adopted for ensuring the accuracy of the predicted segmentation maps.

Reconstruction Loss. We use the $\mathcal{L}_{1}$ loss to encourage per-pixel reconstruction accuracy at all scales.

\mathcal{L}_{1}(I,\hat{I})=\sum_{l}~{}\left|\left|I-up(\hat{I}_{l})\right|\right|.

(9)

where $l$ is the scale. 5 scales are adopted in this work

Adversarial Loss. We use a multi-scale PatchGAN [30] to classify the global and local patches of an image at different resolutions. The discriminator at each scale is identical and only the input is a differently scaled version of an image.

\begin{split}\mathcal{L}_{\alpha}(I,\hat{I})=\sum_{k=1,2,3}(\mathbb{E}_{I}[\log D(p^{k}_{I})]+\mathbb{E}_{\hat{I}}[(1-\log D(p^{k}_{\hat{I}})]),\end{split}

(10)

where $D(\cdot)$ is the discriminator, $p^{k}_{I}$ and $p^{k}_{\hat{I}}$ are the patches in the $k$ -th scaled versions of $I$ and $\hat{I}$ .

Cross-Entropy Loss. This loss is used to penalize the segmentation performance.

\mathcal{L}_{\text{xe}}(S,\hat{S})=-\sum_{l}~{}\sum_{p\in S}S(p)\log(up(\hat{S}^{l})(p)),

(11)

where $p$ is the pixel index for segmentation map $S$ .

Overall Training Loss. The overall training loss function for our network is defined as the weighted sum of the above mentioned losses.

\begin{split}\mathcal{L}_{Final}=\mathcal{L}_{1}(I,\hat{I})+\lambda_{\alpha}\mathcal{L}_{\alpha}(I,\hat{I})+\lambda_{\text{xe}}\mathcal{L}_{\text{xe}}(S,\hat{S})\\ +\lambda_{\text{co}}(\mathcal{L}_{\text{nlco}}(I,\hat{I},S,\hat{S})+\mathcal{L}_{\text{sco}}(I,\hat{I},S,\hat{S})),\end{split}

(12)

where $\lambda_{\alpha}$ , $\lambda_{se}$ and $\lambda_{\text{co}}$ are the weights for the adversarial loss, cross-entropy loss and coherence loss, respectively.

We evaluate our method on the Outdoor Scenes [31] and Cityscapes [45] datasets. Outdoor Scenes contains 9,900 training images and 300 test images. Cityscapes contains 5,000 street-view images in total. In order to enrich the training set of Cityscapes, we use 2,975 images from the training set and 1,525 images from the test set for training, and test on the 500 images from the validation set. Since the test set lacks human-labeled semantic annotations, we generate the annotations for training by using the state-of-the-art segmentation model Deeplab [3]. We resize each training image to ensure its minimal height/width to be 256 for Outdoor Scenes and 512 for Cityscapes, and then randomly crop sub-images of size $256\times 256$ as inputs to our model. The fine annotations of segmentation labels for both datasets are also provided for training, in which Outdoor Scenes and Cityscapes are annotated to 8 and 20 categories, respectively. Please note that the annotations can also be replaced by the extracted segmentation maps from the state-of-the-art segmentation models.

We compare our method with the following four learning-based inpainting methods: 1) GatedConv [40]: Contextual attention for leveraging the surrounding textures and structures. 2) EdgeConnect [19]: Two-stage inpainting framework with edges as low-level structural information. 3) PEN-Net [42]: Cross-layer attention transfer and pyramid filling in a multi-scale framework. 4) SPG-Net [26]: Two-step inpainting with a semantic segmentation map as high-level structural information. 5) SGE-Net [15]: Semantic guidance for inpainting based on spatially adaptive normalization [21].

Table 2: Objective quality comparison of six methods in terms of PSNR, SSIM, and FID on Outdoor Scenes and Cityscapes (

\uparrow

: Higher is better;

\downarrow

: Lower is better). The two best scores are colored in red and blue, respectively

	Outdoor Scenes						Cityscapes
	centering holes			irregular holes			centering holes			irregular holes
	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$
GatedConv	19.06	0.73	42.34	18.47	0.74	44.15	21.13	0.74	20.03	17.13	0.67	43.14
PEN-Net	18.58	0.75	44.12	17.56	0.69	48.95	20.48	0.72	22.34	16.37	0.66	47.87
EdgeConnect	19.32	0.76	41.25	19.12	0.74	42.27	21.71	0.76	19.87	17.63	0.72	39.04
SPG-Net	18.04	0.70	45.31	17.85	0.74	50.03	20.14	0.71	23.21	16.41	0.67	43.63
SGE-Net	20.53	0.81	40.67	19.46	0.76	39.14	23.41	0.85	18.67	17.78	0.74	41.45
Ours	21.18	0.81	38.15	20.31	0.80	36.74	23.89	0.84	18.14	17.86	0.76	38.18

Figure 4 shows the qualitative comparisons of our method with all the baselines. The corrupted area is simulated by sampling a central hole ( $128\times 128$ for Outdoor Scenes and $96\times 96$ for Cityscapes) or randomly placing multiple irregular masks based on [40]. As shown in the figure, the baselines usually suffer from artifacts and unsatisfactory boundaries while completing complex holes. GatedConv and PEN-Net adopt contextual attention to bring in the features of the known region, but they usually distort the structures when referencing to incorrect semantic textures from the surrounding, especially in completing the complex holes. EdgeConnect and SGE-Net are able to recover correct structures owning to the use of structure priors. However, EdgeConnect may generate mixed edges, making it difficult to generate correct textures, whereas the textures of SGE-Net are often over-smoothed without texture refinement. In contrast, our method generates more realistic textures and better boundaries delineating semantic regions than all the baselines thanks to the coherence priors between semantics and textures.

Table 2 shows the quantitative comparisons on Outdoor Scenes and Cityscapes datasets based on three quality metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) and Fréchet Inception Distance (FID) [8]. In general, the proposed method achieves significantly better objective scores than the baselines, especially in PSNR and SSIM.

We randomly select 100 images from the two datasets (50 from Outdoor Scenes and 50 from Cityscapes) and invite 20 subjects with image processing expertise to rank the subjective visual qualities of images inpainted by the five inpainting methods (GatedConv, PEN-Net, EdgeConnect, SPG-Net, SGE-Net, and our method). They are not informed of any mask information. For each test image, its five inpainting results are presented in a random order, and each subject is asked to rank the five methods from the best to the worst. The result shows that our method receives 51.8 % favorite votes (i.e., the top-1 in 1,036 out of 2,000 comparisons), surpassing 19.3 % with SEG-Net, 11.3 % with EdgeConnect, 7.8 % with GatedConv, 5.6 % with PEN-Net, and 4.2 % with SPG-Net. Hence, our method outperforms the other methods.

We verify the effectiveness of SWAP by comparing it with the contextual attention module from GatedConv [40]. To show the difference, we highlight the location of the best-match patch for a patch in the missing area. As shown in Figure 5, since the existing attention module refers to the whole known region without any semantic guide, it usually matches wrong texture patches to the missing area, leading to ambiguous textures. In contrast, benefiting from SWAP, our method matches the patches within the same semantic class, which effectively improves the fidelity of matched reference textures so as to generate more realistic textures.

Table 3: Comparisons on the performance gains with SWAP and Coherence Loss (Co-Loss) in terms of three metrics.

	SWAP	Co-Loss	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$
Ours (Base)	✗	✗	17.43	0.65	57.31
Ours (Att)	✗	✗	19.77	0.76	43.54
Ours (SWAP)	✓	✗	20.58	0.79	39.46
Ours (Full)	✓	✓	21.18	0.81	38.15

Our work assumes that semantic segmentation labels offer tighter clues to the textures than edges. To validate it, we compare the reconstructed structures of EdgeConnect and our model in Figure 6. We find that the edges of different objects may be mixed up in the edge maps, making EdgeConnect fill in incorrect texture details for some missing areas. In contrast, the inferred semantic segmentation labels from our model help well delineate the layout of images, and the semantics can guide the filling of the textures, result in more photo-realistic results.

Table 4: Statistical comparison on semantic segmentation accuracy between the semantic-guided inpainting methods, namely SPG-Net [26], SGE-Net[15] and our method on Outdoor Scenes and Cityscapes.

Outdoor Scenes

Cityscapes

Methods

mIoU%

Methods

mIoU%

SPG-Net

0.51

SPG-Net

0.39

SGE-Net

0.68

SGE-Net

0.53

Ours

0.71

Ours

0.57

In order to further validate the coherence priors between semantics and the textures, We also conduct experiments to compare the generated segmentation maps from SPG-Net, SGE-Net and our method. Due to the alternative optimization of the inpainting and the segmentation tasks, we can generate high quality segmentation maps, which in turn improve the inpainting results. Table 4 shows that our method outperforms the SPG-Net and the SGE-Net in semantic segmentation. The once-forward process from SPG-Net is hard to generate reliable semantic labels for large missing area, while the SGE-Net does not explicitly exploit the interaction between segmentation and inpainting.

In our method, the two core components, SWAP and coherence loss, are devised to improve the inpainting performance. In order to investigate their effectiveness, we conduct an ablation study on four variants: a) Ours (Base), with only joint optimization of inpainting and segmentation in a multi-scale framework; b) Ours (Att), adopting the attention module [39] to measure the texture affinity across all semantics; c) Ours (SWAP), with SWAP; d) Ours (Full), with both SWAP and Coherence loss.

The visual and numeric comparisons on Outdoor Scenes are shown in Figure 7 and Table 3, respectively. In general, the inpainting performance increases with the number of added modules. Specifically, the joint framework helps learn a more accurate scene layout, and the contextual attention does a good job of generating detailed content. Our SWAP can identify more relevant textures thanks to the predicted semantics. Moreover, the coherence losses further improves the texture details of inpainted regions.

We also conduct performance evaluation on the Places2 dataset [45] without semantic annotation, which was used in the assessment by both GatedConv and EdgeConnect. We use our model trained on Outdoor Scenes to complete the images with similar scenes in Places2. The subjective results in Figure 8 show that our model is still able to generate proper semantic structures and textures, owing to the supervision of the coherence loss, which provides better prior knowledge about the scenes.

We proposed a novel joint optimization framework of semantic segmentation and image inpainting to exploit the coherence priors existed between semantics and textures for solving the complex holes inpainting problem. To address the irrelevant texture filling, we proposed a semantic-wise attention propagation module to optimize the predicted textures from the same semantical region and two coherence losses to constrain the consistency of the semantic and texture in the same image. Experimental results demonstrate that our method can effectively generate promising semantic structures and texture details.

(b) GatedConv

(c) PEN-Net

(d) EdgeConnect

(e) SGE-Net

(f) Our method

(g) Ground-truth

[1] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B. Goldman. PatchMatch: a randomized correspondence algorithm for structural image editing. ACM Trans. Graph., 28(3):24, 2009.
[2] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proc. ACM SIGGRAPH, pages 417–424, 2000.
[3] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proc. ECCV, 2018.
[4] Qifeng Chen and Vladlen Koltun. Photographic image synthesis with cascaded refinement networks. In Proc. ICCV, pages 1520–1529, 2017.
[5] Antonio Criminisi, Patrick Pérez, and Kentaro Toyama. Region filling and object removal by exemplar-based image inpainting. IEEE Trans. Image Process., 13(9):1200–1212, 2004.
[6] Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Proc. NeurIPS, 2016.
[7] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In Proc. ICLR, 2019.
[8] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. NeurIPS, 2017.
[9] Seunghoon Hong, Xinchen Yan, Thomas E. Huang, and Honglak Lee. Learning hierarchical semantic image manipulation through structured representations. In Proc. NeurIPS, pages 2713–2723, 2018.
[10] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. Image-to-image translation with conditional adversarial networks. In Proc. CVPR, pages 5967–5976, 2017.
[11] Christoph Kamann and Carsten Rother. Benchmarking the robustness of semantic segmentation models. In Proc. CVPR, pages 8825–8835. IEEE, 2020.
[12] Jingyuan Li, Fengxiang He, Lefei Zhang, Bo Du, and Dacheng Tao. Progressive reconstruction of visual structure for image inpainting. In Proc. ICCV, pages 5961–5970. IEEE, 2019.
[13] Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, and Dacheng Tao. Recurrent feature reasoning for image inpainting. In Proc. CVPR, pages 7757–7765, 2020.
[14] Liang Liao, Ruimin Hu, Jing Xiao, and Zhongyuan Wang. Edge-aware context encoder for image inpainting. In Proc. IEEE ICASSP, 2018.
[15] Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, and Shin’ichi Satoh. Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In Proc. ECCV, 2020.
[16] Ding Liu, Bihan Wen, Jianbo Jiao, Xianming Liu, Zhangyang Wang, and Thomas S. Huang. Connecting image denoising and high-level vision tasks via deep learning. IEEE Trans. Image Process., 29:3695–3706, 2020.
[17] Hongyu Liu, Bin Jiang, Yi Xiao, and Chao Yang. Coherent semantic attention for image inpainting. arXiv preprint arXiv:1905.12384, 2019.
[18] Zhuoqi Ma, Jie Li, Nannan Wang, and Xinbo Gao. Semantic-related image style transfer with dual-consistency loss. Neurocomputing, 406:135–149, 2020.
[19] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Qureshi, and Mehran Ebrahimi. EdgeConnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019.
[20] Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, and Radu Timofte. SESAME: semantic editing of scenes by adding, manipulating or erasing objects. CoRR, abs/2004.04977, 2020.
[21] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In Proc. CVPR, 2019.
[22] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proc. CVPR, 2016.
[23] Wenqi Ren, Jingang Zhang, Xiangyu Xu, Lin Ma, Xiaochun Cao, Gaofeng Meng, and Wei Liu. Deep video dehazing with semantic segmentation. IEEE Trans. Image Process., 28(4):1895–1908, 2019.
[24] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H Li, Shan Liu, and Ge Li. StructureFlow: Image inpainting via structure-aware appearance flow. In Proc. ICCV, 2019.
[25] Yuhang Song, Chao Yang, Zhe Lin, Xiaofeng Liu, Qin Huang, Hao Li, and C-C Jay Kuo. Contextual-based image inpainting: Infer, match, and translate. In Proc. ECCV, 2018.
[26] Yuhang Song, Chao Yang, Yeji Shen, Peng Wang, Qin Huang, and C-C Jay Kuo. SPG-Net: Segmentation prediction and guidance network for image inpainting. arXiv preprint arXiv:1805.03356, 2018.
[27] Hao Tang, Dan Xu, Yan Yan, Philip H. S. Torr, and Nicu Sebe. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In Proc. CVPR, pages 7867–7876, 2020.
[28] Matteo Tomei, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Art2real: Unfolding the reality of artworks via semantically-aware image-to-image translation. In Proc. CVPR, pages 5849–5859, 2019.
[29] Chaoyue Wang, Chang Xu, Chaohui Wang, and Dacheng Tao. Perceptual adversarial networks for image-to-image transformation. IEEE Trans. Image Process., 27(8):4066–4079, 2018.
[30] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proc. CVPR, pages 8798–8807, 2018.
[31] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-resolution by deep spatial feature transform. In Proc. CVPR, 2018.
[32] Yi Wang, Xin Tao, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. Image inpainting via generative multi-column convolutional neural networks. In Proc. NeurIPS, pages 329–338, 2018.
[33] Jing Xiao, Liang Liao, Qiegen Liu, and Ruimin Hu. CISI-Net: Explicit latent content inference and imitated style rendering for image inpainting. In Proc. AAAI, 2019.
[34] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shilei Wen, and Errui Ding. Image inpainting with learnable bidirectional attention maps. In Proc. ICCV, pages 8857–8866, 2019.
[35] Wei Xiong, Jiahui Yu, Zhe Lin, Jimei Yang, Xin Lu, Connelly Barnes, and Jiebo Luo. Foreground-aware image inpainting. In Proc. CVPR, 2019.
[36] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. Shift-Net: Image inpainting via deep feature rearrangement. In Proc. ECCV, 2018.
[37] Chao Yang, Xin Lu, Zhe Lin, Eli Shechtman, Oliver Wang, and Hao Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Proc. CVPR, 2017.
[38] Jie Yang, Zhiquan Qi, and Yong Shi. Learning to incorporate structure knowledge for image inpainting. In Proc. AAAI, pages 12605–12612, 2020.
[39] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proc. CVPR, 2018.
[40] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In Proc. ICCV, 2019.
[41] Tao Yu, Zongyu Guo, Xin Jin, Shilin Wu, Zhibo Chen, Weiping Li, Zhizheng Zhang, and Sen Liu. Region normalization for image inpainting. In Proc. AAAI, pages 12733–12740, 2020.
[42] Yanhong Zeng, Jianlong Fu, Hongyang Chao, and Baining Guo. Learning pyramid-context encoder network for high-quality image inpainting. In Proc. CVPR, 2019.
[43] Yu Zeng, Zhe Lin, Jimei Yang, Jianming Zhang, Eli Shechtman, and Huchuan Lu. High-resolution image inpainting with iterative confidence feedback and guided upsampling. In Proc. ECCV, pages 1–17, 2020.
[44] S. Zhang, R. He, Z. Sun, and T. Tan. Demeshnet: Blind face inpainting for deep meshface verification. IEEE Trans. Inf. Forensics Security, 13(3):637–647, 2018.
[45] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2017.