ReGO: Reference-Guided Outpainting for
Scenery Image

Yaxiong Wang, Yunchao Wei, Xueming Qian, Li Zhu and Yi Yang Y. Wang is with the School of Software Engineering, Xi’an Jiaotong University, Xi’an, 710049, China. E-mail: [email protected]. Y. Wei is with Institute of Information Science, Beijing Jiaotong University, Beijing, 100000, China.E-mail: [email protected]. Qian is with the Key Laboratory for Intelligent Networks and Network Security, Ministry of Education, Xi’an Jiaotong University, Xi’an 710049, China, also with the SMILES Laboratory, Xi’an Jiaotong University, Xi’an 710049,China, and also with Zhibian Technology Co. Ltd., Taizhou 317000, China.E-mail:[email protected]. Zhu is with the School of Software, Xi’an Jiaotong University, Xi’an 710049, China. E-mail: [email protected]. Yi Yang is School of Computer Science and Technology, Zhejiang University, Hangzhou, 310000, China. E-mail:[email protected].

Abstract

We aim to tackle the challenging yet practical scenery image outpainting task in this work. Recently, generative adversarial learning has significantly advanced the image outpainting by producing semantic consistent content for the given image. However, the existing methods always suffer from the blurry texture and the artifacts of the generative part, making the overall outpainting results lack authenticity. To overcome the weakness, this work investigates a principle way to synthesize texture-rich results by borrowing pixels from its neighbors (i.e. , reference images), named Reference-Guided Outpainting (ReGO). Particularly, the ReGO designs an Adaptive Content Selection (ACS) module to transfer the pixel of reference images for texture compensating of the target one. To prevent the style of the generated part from being affected by the reference images, a style ranking loss is further proposed to augment the ReGO to synthesize style-consistent results. Extensive experiments on two sceneary benchmarks, NS6K [3], NS8K [9] and SUN Attribute [25], well demonstrate the effectiveness of our ReGO. Our code is available at https://github.com/wangyxxjtu/ReGO-Pytorch.

Index Terms:

Image outpainting, GAN, Generation model, Adversarial Learning.

I Introduction

Given an input image, image outpainting aims at generating plausible visual content outside the image boundary. Traditional approaches [4, 5, 6, 7] employ a simple searching and stitching pipeline, the extrapolation is achieved by stitching the picked image patches to the input image. However, these solutions are too inflexible to meet the practical requirements. Recently, inspired by the success of the Generative Adversarial Networks (GANs) [16, 19], researchers are making efforts to synthesize the unseen content for the input image by adversarial learning [1, 2, 3, 8, 45]. For example, Yang et al. [3] propose a framework that could recurrently predict new content for the given image patch. In [2], Teterwak et al.translate the input image to a larger picture with new content beyond the boundary. Wang et al. [9] take the image outpainting one step forward by introducing the sketch clues to control the synthesis procedure.

Although existing methods could produce coherent content for the given image patch, results are still not satisfactory due to the lack of texture details. In Fig. 1(b)- 1(d), we visualize the outpaining results produced by current state-of-the-art methods [3, 2, 9]. Generally, these methods could successfully synthesize the desired images matching the guiding sketches. However, if we look closely at the details, many poor generated regions such as pixels with fewer texture particulars and blurry boundaries between different semantic regions can be observed. As a consequence, the overall outpainting results are not authentic enough.

Intuitively, an landscape photo is usually with similar layout and appearance to those photos in the same scene. As shown in the top row of Fig. 2, both the input patch and the reference image show the sunset related scene, and there are many valuable pixels in the reference image to help synthesize the high-quality content for the input patch. Therefore, if we can successfully transfer the knowledge from similar photos to complement the textural details of the predicted content, the authenticity of the generated part may be significantly improved. Straightforwardly, the input image itself is a natural choice for serving as the reference, since it often contains content-consistent pixels to the outpainting part. However, simply adopting the input image for referring often limits the diversity of sketch layout or content pattern of the outpainting part, leading to poor generalization ability, especially for the free-form outpainting.

Motivated by the above observations and considerations, in this work, we investigate a principle to synthesize detailed outpainting results by taking the pixels from the neighbors (i.e. , reference images) of the given input as the guidance, named Reference-Guided Outpainting (ReGO). Since the reference images inevitably include some inconsistent content, simply performing the transferring process without adaptive filtering will introduce abrupt pixels, reducing the quality of the generated part accordingly. Thus, the main challenge of ReGO lies in how to appropriately transfer pixels from neighbors while maintain the style consistent with the input image.

To this end, an Adaptive Content Selection (ACS) module is first proposed to augment our ReGO. Concretely, an image-guided convolution is first conducted on the reference image to select the compensatory features, and two feature fusion blocks are followed for guiding sketch fusing and boundary stitching, respectively. With the ACS module, our ReGO can effectively filter out the abrupt or profitless contents and only adopt the useful features to synthesize texture-rich results. Besides, the introduced reference image is only responsible for contributing contents to enrich the final outpainting results, while the style of the synthesized part should keep consistent with the input, instead of being affected by the reference. To achieve the style consistency, we further utilize a hinge-based ranking loss to pull the style of generated part close to the input image and prevent the style of generated image from biasing to the reference image. Particularly, the generated part and the input patch are treated as the positive pair, while the reference image is regarded as the negative sample, then, the triplet loss is employed to constrain their style representations. The style ranking loss could make our system avoid some abrupt pixels and synthesize the results with smooth style. We perform experiments on two popular benchmarks, NS6K [3] and NS8K [9]. Extensive quantitative and qualitative comparisons can well demonstrate the superiority of our ReGO over other state-of-the-art approaches.

In sum, we highlight the contributions of this work as follows:

•

We propose an effective outpainting method named ReGO, which can transfer content-consistent pixels from reference images to synthesize texture-rich results.
•

A novel adaptive content selection module is designed to pick up the beneficial features from the reference image, making ReGO could filter out some abrupt pixels and generate semantic-consistent results.
•

A style ranking loss is proposed to restrain the style of the synthesized part and enables the system to synthesize style-consistent results.
•

State-of-the-art performance on both random outpainitng and sketch-guided outpainting tasks.

II Related Work

Image Inpainting. The image inpainting has been well explored recent years, whose target is to restore the missing or corrupt regions in images [31, 32, 33, 11, 37, 26, 39, 42, 28, 43]. Benefit from the tremendous success of the deep learning and the generation adversarial networks (GANs) [16], the image inpainting has made great advances these years. In the early exploratory stage of this task, the researchers target on the missing regions with formal shapes [38, 40], the core idea is to collect information from the surrounding context to restore the missing pixels. For example, Liu et al.develop a novel operation named partial convolution, which could iteratively predict the missing pixels by collecting information from the surrounding content [36]. With the technique developing, the community pays more attention to the free-form outpainting problems [33, 40, 41, 31, 36, 44]. In [41], Xie et al.propose a feature re-normalization to adapt to the irregular holes. In [31], Guo et al.propose a full-resolution residual network (FRRN) to restore the missing pixels with irregular shape. Comparing to the inpainting, the missing pixels of outpainting task are far from the valid content, posing more challenges to restore.

Image Outpainting. Conventional image outpainting methods follow a search-and-compose pipeline, where the potential patches are first selected from an external library and then stitched with the input image to conduct extrapolation [14, 10, 13, 12, 5]. For example, Wang et al. [10] first retrieve candidate images using subgraph matching, and stitch the wrapped images into the input by smoothing the boundary. Inspired by the success of the deep neural network, researchers recently attempt to predict new content beyond the boundaries using the generative adversarial networks (GANs) [16, 2, 3]. For example, Yang et al. [3] utilize the recurrent neural network [15] to iteratively produce new content, the proposed algorithm could synthesize very long outpainting results. Teterwak et al. [2] develop an encoder-decoder based generator to predict the unseen pixels. The corrupt image is first compressed by the encoder and restored by the decoder. However, these methods could only predict random contents, to address this weakness, the conditional image outpainting starts to be studied recently. Wang et al. [9] develop a network that allows users to guide the final synthesis by free-form sketches. In [8], the authors propose to utilize the language and the position clues to control the outpainting results. Nevertheless, existing methods still suffer from the blur texture of the resutls, which motivates us to develop a controllable outpainting framework that can synthesize images with rich texture.

III Methodology

The Overview. Fig. 3 exhibits the pipeline of our outpainting system, whose architecture is designed in an encoder-decoder paradigm, the proposed Adaptive Content Selection (ACS) module is plugged into each decoder layer. Particularly, the encoder takes the left half image and the sketch as inputs and outputs the hidden feature map $F$ , which is fed into subsequent decoder layers to rebuild the complete image. To synthesize the texture-rich results, the reference image is first chosen from the searching space. i.e. , training samples and fed into the ACS module to distill its content for compensating purpose. Besides, to allow the user to harvest the freestyle outpainting, the guiding sketch clue is also integrated to build a flexible system. Fig. 4 shows the details of the ACS module. The image-guided convolution is first employed to distill the beneficial features from the reference image, then the selected features are integrated with the hidden representations of the synthesized part to complement the texture details. In addition, the style ranking loss is designed to encourage the generator to produce style-consistent content.

Hereinafter, we first introduce the data structure used in this work in subsection III-A, and then the details of proposed ACS module and style ranking loss are subsequently presented in subsection III-B and subsection III-C, respectively.

III-A Data Preparation

For an image $I$ from training set $X$ , the corresponding sketch and the reference image are two necessary auxiliary data for our system. The sketch is treated as a conditional clue to guide the synthesis procedure as shown in Fig. 1, while the reference image is responsible to provide rich detailed features for the predicted new content.

To obtain the sketch, we first extract the edge map using the HED edge detector [17], and binarize the generated edge map with a pre-defined threshold (0.6 as adopted in our experiment), to acquire the binary sketch $S\in\mathbb{R}^{H\times W\times 1}$ . In the training stage, we adopt the original sketches of the missing parts from training samples as inputs and force the network to restore the ground-truth parts. At the testing stage, users could input the manually drawn free-form sketches to synthesize desired results.

To obtain reference images for the input $I$ , we first extract feature representations using a pre-trained Places365 [18] model, and then apply the cosine similarity to identify the similar samples. Based on our observation, the visual neighbors often share similar content with the target image and have more beneficial pixels, thus they are reasonable to be treated as the reference images. In our practice, we select multiple neighbors for each sample and randomly pick up one in each training iteration as the reference image $G$ . Besides, since we only need to replenish the details of synthesized part, the right half of the reference image is only considered for further processing, as shown in Fig. 4.

III-B ACS Module

With the reference image and the guiding sketch, the responsibilities of our ACS module are two-fold, i.e. , compensating the texture details of the synthesized new content and fusing the conditional information from the guiding sketch. Consequently, an image-guided convolution is designed for detail enriching, and a sketch fusing block is employed to integrate the sketch clue. Additionally, we further apply a seaming block to smooth the boundary between the original left feature and the predicted new representation. The details of each block are subsequently presented in the following.

Image-Guided Convolution. The proposed Image-Guided Convolution (IGConv) aims to help the network complement texture details for the new content using the distilled the beneficial features from the reference image. Intuitively, the left part of the input could directly serve as the reference image. However, based on our experience, such a strategy often results in an insufficiently diverse training set. As a consequence, the trained model would have a deep impression on the original sketch layout and content pattern and fail to generalize to the freestyle outpainting. Therefore, in our Image-Guided Convolution, we first search multiple neighbors for the reference image and randomly select a neighbor in each iteration as the reference. Such a training fashion allows the model to see diverse input-reference pairs, thus, the generality of the final model could be boosted accordingly.

Formally, let $F_{L}\in\mathbb{R}^{h\times w\times c}$ be the features encoded from the image to be extended, $F_{R}$ represents the hidden features for the predicted new content. $F_{L}$ and $F_{R}$ form the complete hidden features of the overall image $F\in\mathbb{R}^{h\times 2w\times c}$ . And the features of $G$ , which are encoded from a reference image encoder, are denoted as $F^{G}\in\mathbb{R}^{h\times w\times c}$ . The designed image-guided convolution aims to complement $F_{R}$ by extracting helpful information from reference features $F^{G}$ . A group of dynamic filters are conditionally produced based on the features of input and reference images, making the network adaptively collect the beneficial content from the reference image. To be specific, a dynamic kernel is produced based on the concatenation of $F^{G}$ and $F_{L}$ via a simple feed-forward procedure:

k=\Psi(F^{G},F_{L}),

(1)

where $k\in\mathbb{R}^{3\times 3\times c}$ , 3 and $c$ indicates the kernel size and channel number, respectively. The $\Psi$ can be modeled as the neural network, which takes the features of the reference and the input and recurrently use the convolution with stride=2, batch normalization and ReLU¹¹1ReLU is not applied in the last layer to get the features.

The dynamic kernel in Eq. 1 targets on providing guidance to distill the content of the reference image. To adaptively pick up the beneficial pixels and restrain the unhelpful content, we conduct the channel-wise normalization to update the dynamic kernel:

k^{n}_{ijk}=\frac{\exp{(k_{ijk})}}{\sum_{h}{\exp{(k_{ijh)}}}}.

(2)

Thus, the distilled $i$ -th channel map can be obtained as follow:

\widetilde{F}^{G}_{:,:,i}=F^{G}*P(k^{n}_{:,:,i}),i=1,2,...,c,

(3)

where the $*$ denotes the convolution operation, $P(\cdot)$ is an operation to repeat $k^{n}_{:,:,i}$ c times along the channel dimension. All of the $\widetilde{F}^{G}_{:,:,i}|^{c}_{i=1}$ are channel-wise stacked to get the distilled feature map $\widetilde{F}^{G}\in\mathcal{R}^{h\times w\times c}$ .

The dynamic kernel in Eq. 2 focuses on automatically highlighting the beneficial features and suppressing the profitless content, we expect the network could assign lower enough weights for the profitless featurs while give higher weights for the helpful contents via the softmax operation. And the following distilled convolution in Eq. 3 attempts to greedily summarize the profitable semantic regions across feature channels from each dynamic kernel’s perspective. With such a procedure, the helpful features from the reference image are effectively aggregated into the map $\widetilde{F}^{G}$ , which are further added to the feature $F_{R}$ to achieve the compensatory purpose: $F_{R}^{*}=F_{R}+\widetilde{F}^{G}$ .

Besides the reference image, the input image itself could also provide valuable pixels. It is because the synthesized part contains the same objects as the input image with high probability. Therefore, the features from the input image are also integrated and the $F_{R}^{*}$ is updated as follow:

F_{R}^{*}=F_{R}+\widetilde{F}^{G}+F_{L}\times\sigma(\rho(F_{L})\times F_{R}),

(4)

where $\rho(\cdot)$ is the horizontally flip operation, $\sigma(\cdot)$ denotes the sigmoid function, which is introduced to learn a dynamic feature selection mechanism.

Sketch Fusing Block. Besides synthesizing the unseen part with thriving and realistic details, our ReGO should also be equipped with a practical mechanism, i.e. , allowing users to acquire personal custom outpainting results using their preferred sketches as the guidance. To this end, we introduce a controllable sketch fusion block to achieve the target. To make the final results exactly match the guiding sketch, the sketch fusion block additionally integrates the sketch feature to emphasize the desired shape in the restoring procedure, as shown in Fig. 4.

Concretely, only the right half sketch $S^{r}\in\mathbb{R}^{H\times W/2\times 3}$ serves as the guiding clues, and its feature maps $F^{s}$ are first encoded by a sketch encoder $E^{S}$ . Then, the compressed sketch features are channel-wise concatenated with the complemented feature $F_{R}^{*}$ , and fed forward a residual block [21] style structure to get the fused output $F_{R}^{s}$ .

Seaming Block. Our seaming block is responsible to fuse the raw left half features $F_{L}$ and the complemented features $F_{R}^{s}$ , which in fact attempts to smooth the boundary between the raw features from the input image and the complemented right half features. As shown in Fig. 4, the seaming block consists of two global residual blocks (GRB) [3] and a residual block [21]. We alternately utilize the 1 $\times$ 3 and 7 $\times$ 1 convolution in GRB to strengthen the connection between the original and the predicted regions, especially the boundary between the map from the input image and the complemented map of the predicted new content. Particularly, the $F_{L}$ and $F_{R}^{s}$ are first concatenated along the width dimension, and then sequentially fed through two GRBs and a residual block to get the output $F^{{}^{\prime}}$ , which is also the final output of our ACS module.

TABLE I: Performance comparisons on three datasets under criteria IS, FID and MSD, for sketch-guided and random outpainting tasks.

*

means the method is modified to perform sketch-guided outpainting as descripted in subsection IV-A.

	Sketch-Guided Outpainting
	NS6K			NS8K			SUN Attribute
	IS ${\color[rgb]{1,0,0}\uparrow}$	FID ${\color[rgb]{1,0,0}\downarrow}$	MSD ${\color[rgb]{1,0,0}\uparrow}$	IS ${\color[rgb]{1,0,0}\uparrow}$	FID ${\color[rgb]{1,0,0}\downarrow}$	MSD ${\color[rgb]{1,0,0}\uparrow}$	IS ${\color[rgb]{1,0,0}\uparrow}$	FID ${\color[rgb]{1,0,0}\downarrow}$	MSD ${\color[rgb]{1,0,0}\uparrow}$
$\text{DeepFillv2}^{*}$ [33]	2.316	16.712	0.511	3.012	15.132	0.592	9.132	27.993	0.562
$\text{CoModGAN}^{*}$ [35]	2.758	15.145	0.583	3.221	13.774	0.685	9.884	26.076	0.641
$\text{LaMa}^{*}$ [34]	3.016	13.639	0.567	3.347	12.312	0.629	10.147	24.135	0.619
$\text{NSIO}^{*}$ [3]	2.891	12.87	0.649	3.254	10.813	0.837	10.391	23.021	0.801
SGIO [9]	2.920	10.998	1.01	3.321	10.390	1.17	10.126	22.536	0.792
$\text{BDIE}^{*}$ [2]	3.002	11.021	0.963	3.323	9.639	0.892	10.857	21.412	0.886
$\text{ReGO}_{\text{NSIO}}$	2.923	12.03	0.839	3.329	10.232	0.966	10.683	21.452	0.869
$\text{ReGO}_{\text{SGIO}}$	2.924	10.104	1.201	3.387	9.787	1.293	10.542	21.012	0.894
$\text{ReGO}_{\text{BDIE}}$	3.126	10.052	1.357	3.444	8.738	1.396	10.992	19.433	1.014
	Random Outpainting
DeepFillv2 [33]	2.273	18.693	-	2.816	18.109	-	8.987	29.016	-
CoModGAN [35]	2.612	18.014	-	2.983	17.223	-	9.763	27.973	-
LaMa [34]	2.773	15.061	-	3.019	15.014	-	9.974	25.863	-
NSIO [3]	2.883	13.612	-	3.123	12.871	-	10.229	25.271	-
SGIO [9]	2.951	15.857	-	3.178	12.316	-	9.889	24.978	-
BDIE [2]	2.880	13.252	-	3.155	11.373	-	10.647	24.465	-
$\text{ReGO}_{\text{NSIO}}$	2.792	14.224	-	2.948	13.230	-	10.553	26.196	-
$\text{ReGO}_{\text{SGIO}}$	2.848	15.396	-	3.204	13.748	-	10.438	25.441	-
$\text{ReGO}_{\text{BDIE}}$	3.243	12.606	-	3.573	10.586	-	10.796	23.209	-

III-C Style Ranking Loss

The reference image adopted by ReGO is only expected to contribute texture details, its style should not be reflected on the synthesized content. To reduce the artifacts, the model should 1) only transfer the texture details from the reference image, 2) keep a consistent style between the given part and the generated part. Given the above considerations, we find the hinge based ranking is quite in line with our requirement, we take the synthesized part and the input image as the positive pair while treat the reference image as the negative sample, then the hinge ranking loss could conduct on their style representations to enforce the style of the input and the new content to be closer than the reference image.

Following previous practices [22, 24, 23], we utilize the second-order statistics of convolutional feature as style representation. Particularly, the style features of generated part $\hat{I}^{r}\in\mathbb{R}^{H\times W/2\times 3}$ , which is only the right half of the image reconstruction $\hat{I}\in\mathbb{R}^{H\times W\times 3}$ , is given by the Gram matrix $R^{d}\in\mathcal{R}^{N_{d}\times N_{d}}$ :

R^{d}_{ij}=\sum_{k}M^{d}_{ik}M^{d}_{jk},

(5)

where $M^{d}_{i}$ is the vectorised $i$ -th feature map in layer $d$ from a convolutional neural network like VGG19 [27], $N_{d}$ indicates the channel number of layer $d$ .

Analogously, the style representations of the reference image and the input image can be extracted, and our style ranking loss is defined as:

\mathcal{L}^{d}_{s}=[\alpha-SM(R^{d},L^{d})+SM(R^{d},G^{d})]_{+},

(6)

where $L^{d}$ and $G^{d}$ represent the Gram matrixs of the left input and the reference image, respectively, $SM(\cdot,\cdot)$ is the cosine similarity, $\alpha\in\mathcal{R}$ is the scalar margin, and $[\cdot]_{+}=max(\cdot,0)$ .

By including the feature correlations of multiple layers, the multi-scale style representations are obtained, and the total style loss can be calculated accordingly:

\mathcal{L}_{s}=\sum_{d\in D}w_{d}\mathcal{L}^{d}_{s},

(7)

where $D$ is the index collection of selected activation layers, and $w_{d}$ is the trade-off weight. In our experiments, the activated output of layer relu_Y_1(Y=1,2,3,4,5) of VGG19 network [27] are taken for style representation, i.e. , $|D|=5$ . The designed style ranking loss is equipped to the generator loss to train the network.

IV Experiment

IV-A Experiment Setup

Dataset. This work focuses on performing image outpainting for scenery images. We conduct extensive experiments on two benchmarks, i.e. , NS6K [3] and NS8K [9], to validate the effectiveness of our ReGO. The NS6K dataset contains 6040 scenery images in total, and 5040 images are treated as training data while the rest is used for testing [3]. The NS8K, which consists of 8115 images, contains much more diverse scenery images comparing to NS6K. Of these, 6115 images are taken as training data, the rest is used for testing. The SUN Attribute dataset has 14,340 diverse enough images from 707 scene categories, we randomly select 80% and 20% for training and testing, respectively.

Implement Details. Our proposed ReGO provides a model-agnostic solution and could be plugged into many off-the-shelf outpainting models. In this work, we apply our ReGO to three state-of-the-art outpainting methods, including NSIO [3], BDIE [2], and SGIO [9], to validate its superiority:

NSIO [3] is originally designed for random content prediction. To achieve the sketch-guided outpainting, we make some modifications for NSIO as follow: the left half sketch is channel-wise concatenated with the input, while the right half sketch is encoded and fed as the initial state of LSTM [15] decoder to predict the hidden feature of the full images. Our ACS module is plugged after each decoding layer except the last one, and the style ranking loss is weighted by 0.5 and added into the generator loss to train the network.

BDIE [2] is a random-outpainting model as well, and the sketch is concatenated with the input to perform the sketch-guided outpainting. Besides, the conditional skip connection and the position channels in SGIO [9] are also equipped to BDIE [2], to build a stronger baseline. The ACS module and the style ranking loss are equipped in an analogous way with NSIO [3].

SGIO [9] is the first attempt for the sketch-guided outpainting. The ACS module is also employed after each decoding layer for texture compensating, and the style ranking loss is added to the generator loss with weight 0.5 to ensure the style consistency.

For our study, the style ranking loss is equipped to the generator loss, and the weights of style ranking loss in multiple layers are all set as 0.2, i.e. , $w_{d}=0.2$ . During the training stage, five neighbors are employed in our baseline methods, and the impact of neighbor number will be discussed in the following experiment. At the testing stage, only the most similar neighbor is used to synthesize the outpainting. Besides outpainting models, we also include three state-of-the-art inpainting models for comparison, i.e. DeepFillv2 [33], CoModGAN [35], and LaMa [34]. For LaMa and CoModGAN, we mask the right half images and introduce the sketch as an additional channel to train the network, while DeepFillV2 is a sketch-guided inpainting model, and it’s trained by only restoring the right half image. To make a fair comparison, the loss functions, the hyperparameters and the training details all follow the same settings of their original papers. The sketch augmentation strategy [9] is also employed for all methods to enhance the free-form outpainting.

TABLE II: The contributions of each part in our method. RI, ACS, and SRL indicate reference image, ACS module, and style ranking loss respectively.

	RI	ACS	SRL	IS ${\color[rgb]{1,0,0}\uparrow}$	FID ${\color[rgb]{1,0,0}\downarrow}$	MSD ${\color[rgb]{1,0,0}\uparrow}$
BDIE [2]				3.002	11.021	0.963
BDIE [2]	✓			2.918	10.991	1.034
BDIE [2]	✓	✓		3.038	10.269	1.221
BDIE [2]	✓	✓	✓	3.126	10.052	1.357

TABLE III: The performances of the style ranking loss and the style regression loss.

Method	IS ${\color[rgb]{1,0,0}\uparrow}$	FID ${\color[rgb]{1,0,0}\downarrow}$	MSD ${\color[rgb]{1,0,0}\uparrow}$
$\text{ReGO}_{\text{BDIE}}$ -Reg	3.089	10.956	1.163
$\text{ReGO}_{\text{BDIE}}$	3.126	10.052	1.357

IV-B Evaluation Metric

Following Wang’s [9] setting, three metrics, i.e. , Fr $\acute{\rm e}$ chet Inception Distance (FID) [29], the Inception Score (IS) [30] and Mean Satisfactory Degree (MSD) [9], are employed for evaluation. To evaluate the free-form outpainting results, we randomly select 555 images from test data and replace the original sketches with manually drawn free-form ones, 89 different types of sketches are collected in total. 20 volunteers are invited to label the free-form outpainting results as three levels: 0-poor, 1-ordinary and 2-good, and the mean value of all labels on selected images are taken as the mean satisfaction degree (MSD), which is taken for subjective comparison since there is no groundtruth available. Comparing to the FID and the IS, MSD directly reflects the performance in practical situations, therefore, it is a critical metric to evaluate the generalization ability on free-form sketches.

IV-C Quantitative Comparison

The performance of both sketch-guided outpainting and random outpainting on datasets NS6K and NS8K is reported in Table I, where $\text{ReGO}_{\text{NSIO}}$ denotes the NSIO model with our proposed ReGO equipped. From Table I, we can observe that our proposed ReGO module could simultaneously enhance the sketch-guided outpainting and the random outpainting.

Sketch-Guided Outpainting. Our proposed ReGO module could boost the performance of three state-of-the-art outpainting and inpainting methods on both NS6K and NS8K. For example, the FID of BDIE [2] is 11.021 on NS6K, when our ReGO module is equipped, the FID could reach 10.052. Besides the image restoring according to the original sketches, the free-form outpainting can also be improved by the ReGO module. The MSD of $\text{ReGO}_{\text{SGIO}}$ can reach 1.201 on NS6K, while the original SGIO’s is only 1.01. Our best performance is achieved based on the BDIE [2], which could reach 10.052 FID and 1.357 MSD on NS6K. It is clear from Table I that our proposed ReGO module can both improve the image rebuilding and free-form outpainting on three backbones, which validates the effectiveness of our method.

In our ReGO, the searched neighbors of the input serve as the reference, while the input image itself is also an intuitive choice since it naturally shares the highest similarity with the input. To study the effects of two types of references, we conduct experiments with BDIE backbone on NS6K dataset to investigate the performance difference. Let $\text{ReGO}_{\text{BDIE}}$ -SR (Self-Reference) denotes the $\text{ReGO}_{\text{BDIE}}$ with the input as reference. In our experiment, $\text{ReGO}_{\text{BDIE}}$ -SR could also achieve acceptable performance on image rebuilding according to the original sketches, however, it shows poor generality when encountering the free-style sketches. For example, the FID of $\text{ReGO}_{\text{BDIE}}$ -SR could reach 10.561 on NS6K and surpasses the method BDIE, however, its MDS for free-form outpainting is only 1.012, which is much worse than $\text{ReGO}_{\text{BDIE}}$ . We guess the barren sketch layout and content pattern cause somewhat overfitting, consequently, the model trained with self-reference could not well generalize to the free-style outpainting. In contrast, When the neighbors serves as the reference, the model could see diverse training pairs, as a result, the trained model could perform well on both the image rebuilding and free-form outpainting.

Random Outpainting. Besides providing the sketches to harvest the desired outpainting, another possible scenario is that the users may refuse to drawn any guiding sketches and only attempt to obtain the random results. How would our system perform if no guiding sketches are fed? To validate the effectiveness of our method under such a scenario, we conduct experiments to predict random results and report the performance on both datasets. Since the NSIO [3] and BDIE [2] are originally designed for random outpainting, we follow the same pipelines as their original papers [3, 2] to train the networks. As for the inpainting methods, CoModGAN [35], and LaMa [34], we directly mask the right half of the image for training. For sketch-guided systems, we simply set the right half sketch as zeros to conduct random outpainting.

The results are also reported in Table I, we can observe that abandoning the original guiding sketches significantly damnify the performance of the sketch-guided outpainting systems. For example, the $\text{ReGO}_{\text{SGIO}}$ with guiding sketches could reach 10.104 FID on NS6K, while its FID w/o the guiding sketches deteriorates to 15.396, this is because the systems are trained with the original sketches. The performance of SGIO with our ReGO module is comparable with its original method SGIO [9]. For the random prediction methods, $\text{ReGO}_{\text{NSIO}}$ performs slightly worse than the NSIO [3], while $\text{ReGO}_{\text{BDIE}}$ is more outstanding comparing to the STOA outpainting (BDIE [2]) and inpainting methods (LaMa [34]). From Table I, we can see that even though no guiding sketches are provided, the methods with our ReGO module could also produce comparable results with the original methods. With the designed ACS module, we can develop an unified framework that could simultaneously deal with the random prediction and the sketch-guided outpainting. what’s more, the BDIE with ReGO module could achieve the SOTA performance on both tasks.

IV-D Ablation Study

Validate the Components of ReGO. To validate the contributions of each component in our proposed ReGO, we employ the BDIE [2] as backbone and conduct ablations on the NS6K dataset. Quantitative results are reported in Table II.

As shown in Table II, when the reference image is introduced, the FID of BDIE could be improved from 11.02 to 10.99 and the MSD could also be boosted, which reveals compensating the texture details from the neighbors is a promising idea. However, only the reference image does not make the performance outstanding enough. With the proposed ACS module adopted, the network could adaptively block the profitless content and distill the beneficial pixels, and we can observe a significant performance improvement where the MSD and the FID could reach 1.221 and 10.269, respectively. When the style ranking loss is further equipped to ensure the style consistency, the performance step further. The model with three parts simultaneously utilized achieves the best performance, and when a new mechanism is equipped, the performance gets improved, which validates the contributions of each component.

Fig. 5 exhibits the visual results of the ablation comparison on image rebuilding and free-form outpainting. From Fig. 5, the contribution of each component can be clearly observed.

Validate the Style Ranking Loss. To produce the style-consistent results, we design the style ranking loss to prevent the style of the synthesized content being affected by the reference image. Intuitively, the close style between the synthesized content and the input can also be achieved by directly conducting a regression procedure. In this subsection, we study the impact of these two solutions.

Table III shows the performance under IS, FID, and MSD, where $\text{ReGO}_{\text{BDIE}}$ -Reg indicates $\text{ReGO}_{\text{BDIE}}$ using the $l_{2}$ style reconstruction loss instead of the style ranking loss. From Table III, the $\text{ReGO}_{\text{BDIE}}$ with the proposed style ranking loss performs much better on both image restoring and free-form outpainting. The reasons for such results may stem from the overfitting. Although the left half and the right half parts are from the same image, the style representations captured by the Gram matrices are still different. Therefore, stiffly conducting such a regression procedure is easy to cause overfitting. Besides, the main goal of our style ranking loss is to block the style of the reference image being reflected on the extended content, the style consistency between the input and the synthesized part is not necessary to enhance, since it can be achieved by the pixel-wise reconstruction and the adversarial training [2, 3, 9].

Discuss the Number of Reference Image. In our baseline methods, five neighbors for each training sample are selected, and we randomly pick up one in each iteration to serve as the reference image. This subsection investigates the impacts of the number of the reference image.

The performance tendencies with five different reference numbers are shown in Fig. 6, where the FID is scaled by the logarithmic function. As shown in Fig. 6, there are two important observations. First, Comparing to employing only one reference image, using multiple references could enhance the model generality and train a more robust generation model. The FID of $\text{ReGO}_{\text{BDIE}}$ with only one reference image is 10.728, when the reference number increases to 5, the FID could be improved to 10.052. Second, more reference images do not make the performance step further, which is clear from the performance tendencies when the reference number ranges from 5 to 20. The situation with five reference images achieves the best performance on average.

IV-E Qualitative Results

Image Rebuilding. Fig. 7 provides the visualizations of the rebuilding results according to the original sketches and random outpainting for SOTA inpainting and outpainting models. To ease the visual exhibition as well as saving some space, we only exhibit the outpainting results of our best model $\text{ReGO}_{\text{BDIE}}$ . It can be observed that the results of $\text{ReGO}_{\text{BDIE}}$ are more authentic and natural due to the richer textural details.

From Fig. 7 I, the comparison methods, LaMa [34], SGIO [9] and BDIE [2], could extend reasonable pixels for the input image, but the predicted content is blurry and lacks textural details, which makes the overall image not authentic enough. While $\text{ReGO}_{\text{BDIE}}$ could produce texture-rich outpainting results. The results of random prediction are exhibited in Fig. 7 II, comparing to the competing methods, $\text{ReGO}_{\text{BDIE}}$ could also successfully synthesize the results with more textural details when no sketches are fed, and the synthetic images are even more satisfactory than the method original designed for the random prediction, i.e. , BDIE [2]. From Fig. 7, we could find that the guiding sketch is not one of requisite inputs for our system, when the users do not provide the guiding sketch, ours system could also produce satisfactory random outpainitng results.

Free-form Outpainting. The comparison for free-form outpainting is exhibited in Fig. 8 I, $\text{ReGO}_{\text{BDIE}}$ could not only synthesize the expected content matching the guiding sketch but achieve authentic and natural enough results. Especially the boundaries of different semantic regions are much clearer than the competing methods. Additionally, we surprisingly find that the reference image could also help fill reasonable pixels for the free-form outpainting, as shown in the bottom row in Fig. 8 I. Besides the manually drawn sketches, we could also use the sketch from another image to guide the outpainting, as shown in Fig. 8 II. The inputs in Fig. 8 II(a) directly use the sketches of the reference images to control the outpainting, while the ones in Fig. 8 II(c) use the sketches from randomly selected images, two types of cases serve as the simple and the difficult cases, respectively. From Fig. 8 II, our method could not only predict new content matching the guiding sketches but achieve satisfactory style-consistency for both simple and difficult cases. It’s worth noting that this paper mainly focuses on synthesizing new content along left to right, however, the prediction of other directions could also be performed based on BDIE backbone, just as shown in Fig. 9, we leave everything unchanged except for using the mask to indicate the missing regions. In this task, our $\text{ReGO}_{\text{BDIE}}$ could also achieve more outstanding performance comparing to the BDIE model, 10.817 (FID) 3.004 (IS)- $\text{ReGO}_{\text{BDIE}}$ VS 12.132 (FID), 2.893 (IS)-BDIE.

Results on High-Resolution Images. Besides the low resolution dataset, we also collect 558 high-resolution scenery images from Internet using the key word ”scenery images” to further evaluate our model. We resize the images as 512 $\times$ 768 and directly evaluate the performance on this dataset, the performance of our method could also outperform the most competitive method BDIE, 34.012(FID) 4.078(IS)- $\text{ReGO}_{\text{BDIE}}$ VS 37.841(FID) 3.917(IS)-BDIE. Fig. 10 shows the high resolution results of three state-of-the-art methods.

From the above, the proposed framework allows users to harvest three types of results: random outpainting, free-form outpainting from manually drawn sketches and controllable outpainting using sketch from another image. Therefore, our proposed method is with higher practical value. We also show some failure cases in Fig. 11, the model fails to predict reasonable pixels or generate natural enough contents for the guiding sketches. In our experiment, we find most of failed samples are from those whose guiding shapes are in the top part of the images, we guess the reason stems from the barren sketch patterns of training images. This reveals that there are still many challenges standing for this task. In the future, we will continue to explore this task and attempt to address more issues of this field.

V Conclusion and Future Work

This work develops a novel ReGO module that improves the outpainting quality by borrowing pixels from its neighbors. The proposed method could effectively boost the results of the sketch-guided image outpainting by enriching the textual details. An ACS module is proposed to distill the beneficial pixels and suppress the profitless contents, which could help the generator pick up the helpful pixels. To prevent the style of the synthesized content being affected by the reference image, we design a style ranking loss to enforce the generator to produce the style-consistent contents. Experiments on two benchmarks based on three backbones demonstrate the effectiveness of the proposed method. The idea that enriches the details from neighbors may also work for other generation tasks, we will continue to explore the effectiveness of the proposed method in the future.

References

[1] Wang Yi, Tao Xin, Shen Xiaoyong, and Jia, Jiaya, “Wide-Context Semantic Image Extrapolation”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 1399-1408.
[2] Piotr Teterwak, Aaron Sarna, Dilip Krishnan, Aaron Maschinot, David Belanger, Ce Liu, and William T. Freeman , ”Boundless: Generative Adversarial Networks for Image Extension”, In IEEE International Conference on Computer Vision (ICCV), 2019: 10520-10529.
[3] Yang Zongxin, Dong Jian, Liu Ping, Yang Yi, and Yan Shuicheng, ”Very Long Natural Scenery Image Prediction by Outpainting”. In IEEE International Conference on Computer Vision (ICCV), 2019: 10560-10569.
[4] Johannes Kopf, Wolf Kienzle, Steven M. Drucker, and Sing Bing Kang, ”Quality prediction for image completion”. ACM Trans. Graph, vol.31, no.6, pages: 131:1–131:8, 2012.
[5] Josef Sivic, Biliana Kaneva, Antonio Torralba, Shai Avidan, and William T. Freeman, ”Creating and exploring a large photorealistic virtual space”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008: 1-8.
[6] Yinda Zhang, Jianxiong Xiao, James Hays, and Ping Tan, ”FrameBreak: Dramatic Image Extrapolation by Guided Shift-Maps”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013: 1171–1178.
[7] Miao Wang, Yu-Kun Lai, Yuan Liang, Ralph R. Martin, and Shi-Min Hu, ”BiggerPicture: data-driven image extrapolation using graph matching”. ACM Trans. Graph, vol.33, no.6, 2014, pages: 173:1–173:13.
[8] Yijun Li, Lu Jiang, and Ming-Hsuan Yang, ”Controllable and Progressive Image Extrapolation”. In CoRR, abs/1912.11711, 2019.
[9] Yaxiong Wang, Xueming Qian, Yunchao Wei, Li Zhu, Yi Yang and Tao Dai, ”Sketch-Guided Image Outpainting”. In IEEE Trans. on Image Processing: 2021, vol.30, pp:2643-2655.
[10] Miao Wang, Yu-Kun Lai, Yuan Liang, Ralph R. Martin, and Shi-Min Hu, ”BiggerPicture: data-driven image extrapolation using graph matching”. ACM Trans. Graph, vol.33, no.6, 2014, pages: 173:1–173:13.
[11] Yuqian Zhou, Connelly Barnes, Eli Shechtman, and Sohrab Amirghodsi, “TransFill: Reference-Guided Image Inpainting by Merging Multiple Color and Spatial Transformations”,IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp: 2266-2276, 2021.
[12] Alexei A. Efros, and William T. Freeman, ”Image quilting for texture synthesis and transfer”. SIGGRAPH, 2001: 341–346.
[13] Yinda Zhang, Jianxiong Xiao, James Hays, and Ping Tan, ”FrameBreak: Dramatic Image Extrapolation by Guided Shift-Maps”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013: 1171-1178.
[14] Qi Shan, Brian Curless, Yasutaka Furukawa, Carlos Hernández, and Steven M. Seitz, ”Photo Uncrop”. In European Conference on Computer Vision (ECCV), 2014: 16–31.
[15] Sepp Hochreiter, and Jürgen Schmidhuber, ”Long Short-Term Memory”. Neural Computation, vol. 9, no.8, 1997, pages: 1735-1780.
[16] Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, ”Generative Adversarial nets”. In Advances in Neural Information Processing Systems (NIPS), 2014: 2672-2680.
[17] Xie Saining, and Tu Zhuowen, ”Holistically-Nested Edge Detection”. In IEEE International Conference on Computer Vision (ICCV), 2015: 1395-1403.
[18] Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba, ”Places: A 10 Million Image Database for Scene Recognition”. IEEE Trans. Pattern Anal. Mach. Intell, vol. 40, no. 6, 2018, pages: 1452–1464.
[19] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville, ”Improved Training of Wasserstein GANs”. In Advances in Neural Information Processing Systems (NIPS), 2017: 5767–5777.
[20] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, ”ast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)”. In International Conference on Learning Representations (ICLR), 2016.
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, ”Deep Residual Learning for Image Recognition”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016: 770–778.
[22] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge, ”Image Style Transfer Using Convolutional Neural Networks”. In IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), 2016, 2414–2423.
[23] Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala, ”Deep Photo Style Transfer”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017:6997–7005.
[24] Yuan Yao, Jianqiang Ren, Xuansong Xie, Weidong Liu, Yong-Jin Liu, and Jun Wang, ”Attention-Aware Multi-Stroke Style Transfer”,In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 1467–1475.
[25] Genevieve Patterson, Chen Xu, Hang Su, and James Hays, ”The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding”, International Journal of Computer Vision (IJCV), 108 (1-2): 59-81, 2014.
[26] Quan, Weize and Zhang, Ruisong and Zhang, Yong and Li, Zhifeng and Wang, Jue and Yan, Dong-Ming, ”Image Inpainting With Local and Global Refinement”. IEEE Transactions on Image Processing, vol.31, pp:2405-2420, 2021.
[27] Karen Simonyan, and Andrew Zisserman, ”Very Deep Convolutional Networks for Large-Scale Image Recognition”. In International Conference on Learning Representations (ICLR), 2015.
[28] Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao, “High-Fidelity Pluralistic Image Completion with Transformers”, IEEE International Conference on Computer Vision (ICCV), pp:4672-4681, 2021.
[29] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter, ”GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium”. In Advances in Neural Information Processing Systems (NIPS), 2017: 6626-6637.
[30] Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen, ”Improved Techniques for Training GANs”. In Advances in Neural Information Processing Systems (NIPS), 2016: 2226-2234.
[31] Zongyu Guo, Zhibo Chen, Tao Yu, Jiale Chen, and Sen Liu, ”Progressive Image Inpainting with Full-Resolution Residual”. In ACM Multimedia, 2019: 2496-2504.
[32] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shilei Wen, and Errui Ding, ”Image Inpainting with Learnable Bidirectional Attention Maps”. In IEEE International Conference on Computer Vision (ICCV), 8857-8866, 2019.
[33] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang, ”Free-Form Image Inpainting with Gated Convolution”. In IEEE International Conference on Computer Vision (ICCV), 2018:4470-4479.
[34] , Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka ,Kiwoong Park, andVictor Lempitsky, ”Resolution-robust Large Mask Inpainting with Fourier Convolutions”, IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022: 3172-3182.
[35] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I-Chao Chang, and Yan Xu, ”Large Scale Image Completion via Co-Modulated Generative Adversarial Networks”, International Conference on Learning Representations (ICLR), 2021.
[36] Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro, ”Image Inpainting for Irregular Holes Using Partial Convolutions”. In European Conference on Computer Vision (ECCV), 89-105 2018.
[37] Deepak Pathak, Phillipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros, ”Context Encoders: Feature Learning by Inpainting”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2536-2544, 2017.
[38] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa, ”Globally and Locally Consistent Image Completion”. ACM Trans. Graph, vol.36, no.4, pages: 107:1:107:14, 2017.
[39] Wang, Ning and Zhang, Yipeng and Zhang, Lefei, ”Dynamic Selection Network for Image Inpainting”, IEEE Transactions on Image Processing, vol.30, pages:1784-1798, 2021.
[40] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang, ”Generative Image Inpainting With Contextual Attention”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 5505-5514.
[41] Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, Wangmeng Zuo, Xiao Liu, Shiei Wen, and Errui Ding, ”Image Inpainting with Learnable Bidirectional Attention Maps”. In IEEE International Conference on Computer Vision (ICCV), 2019: 8857-8866.
[42] Zhu, Manyu and He, Dongliang and Li, Xin and Li, Chao and Li, Fu and Liu, Xiao and Ding, Errui and Zhang, Zhaoxiang, ”Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness”, IEEE Transactions on Image Processing, vol.30, pp:4855-4866, 2021.
[43] Artur Grigorev, Artem Sevastopolsky, Alexander Vakhitov, and Victor S. Lempitsky, ”Coordinate-Based Texture Inpainting for Pose-Guided Human Image Generation”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019: 12135-12144.
[44] Ding Ding, Ssundaresh Ram, and Jeffrey Rodríguez, ”Image Inpainting Using Nonlocal Texture Matching and Nonlinear Filtering”, IEEE Trans. Image Process. vol.28, no.4, pages: 1705-1719, 2019.
[45] Wengling Chen, and James Hays, ”SketchyGan: Towards Diverse and Realistic Sketch to Image Synthesis”. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018: 9416-9425.

ReGO: Reference-Guided Outpainting for Scenery Image