Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Yingchen Yu¹ Fangneng Zhan¹¹¹footnotemark: 1 Rongliang Wu¹ Jianxiong Pan² Kaiwen Cui¹
Shijian Lu¹ Feiying Ma² Xuansong Xie² Chunyan Miao¹
¹ Nanyang Technological University

\ \ {}^{2}

DAMO Academy, Alibaba Group
{yingchen001, ronglian001, kaiwen001}@e.ntu.edu.sg, {fnzhan, shijian.lu, ascymiao}@ntu.edu.sg
{jianxiong.pjx, feiying.mfy}@alibaba-inc.com, [email protected] Equal contributionCorresponding author

Abstract

Image inpainting is an underdetermined inverse problem, which naturally allows diverse contents to fill up the missing or corrupted regions realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in autoregressive transformers is suboptimal as corrupted image regions may have arbitrary shapes with contexts from any direction. We propose BAT-Fill, an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for image inpainting. BAT utilizes the transformers to learn autoregressive distributions, which naturally allows the diverse generation of missing contents. In addition, it incorporates the masked language model like BERT, which enables bidirectionally modeling of contextual information of missing regions for better image completion. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.

{strip}

Figure 1: The proposed BAT-Fill introduces a novel bidirectional autoregressive transformer that captures deep bidirectional contexts for autoregressive generation of diverse contents in image inpainting. Evaluations over multiple public datasets show that BAT-Fill can generate realistic and reasonable image contents. The three illustrative sample images from top to bottom are selected from the datasets CelebA-HQ [21], Places2 [65], and Paris StreetView [32], respectively.

1 Introduction

As an ill-posed problem, image inpainting naturally allows numerous solutions as long as the restored images are realistic and semantically reasonable as illustrated in Fig. Diverse Image Inpainting with Bidirectional and Autoregressive Transformers. However, it remains a great challenge to synthesize diverse while realistic contents that maintain integrity and consistency with the uncorrupted image regions, especially when the corrupted image regions are large and rich in complex textures and structures.

Recently, GAN-based (generative adversarial network) inpainting [32, 50, 30, 26] has achieved remarkable progress by training with reconstruction and adversarial losses over large-scale datasets. However, these methods are trained to learn the one-to-one mapping from masked images to complete images, which results in the incapacity of producing diverse inpainting results. In contrast to deterministic inpainting, a few studies [64, 61] attempt for diverse inpainting with variational auto-encoder (VAE) networks [23], but the inpainting quality is often compromised while generating complex structural and texture patterns due to the limited capacity of parametric distributions [63]. Instead of parametric distribution modeling like VAE-based methods, [33] utilizes a CNN-based conditional network to learn an autoregressive distribution for recovering diverse and structural features. As the autoregressive models are optimized to encode unidirectional context only, which means the informative contexts of valid pixels after the current position are substantially ignored. To explore bidirectional context, [41] adopts the masked language model (MLM) like BERT [11]. However, MLM predicts the masked tokens independently which may oversimplify the complex context dependency in the data [37] and result in inconsistency in the generated results.

In this paper, we propose a bidirectional and autoregressive transformer (BAT) that marries the best of autoregressive modeling and MLM to model deep bidirectional contexts in an autoregressive manner. In the proposed BAT, we permute the input sequence by sorting the valid and missing pixels and start autoregressive modeling at the position of the first missing pixel. With all available contexts in front, BAT can exploit bidirectional contexts and spatial dependency simultaneously. In addition, we adopt the two-stage completion procedure as reported in [41] and develop BAT-Fill, an image inpainting network that firstly recovers the diverse yet coherent image structures based on the proposed BAT and then exploits a CNN-based texture generator to up-sample the coarse structures and synthesize texture details. Extensive experiments show that BAT-Fill achieves superior image inpainting performance.

The main contributions of this work can be summarized in three aspects. First, we adopt the transformers to learns an autoregressive distribution for diverse image inpainting, which effectively improves the modeling capacity for long-range dependencies and global structures. Second, we design a novel bidirectional and autoregressive transformer (BAT) that captures bidirectional information and establishes output dependency simultaneously. Third, extensive experiments over multiple datasets show that the proposed method achieves superior performance as compared with the state-of-the-art in both inpainting quality and inpainting diversity.

2 Related Work

2.1 Image Inpainting

As an ill-posed problem, realistic and high-fidelity image inpainting is a challenging task that has been studied for years. Based on the inpainting outcome, most existing image inpainting methods can be broadly classified into two categories including deterministic image inpainting and diverse image inpainting.

2.1.1 Deterministic Image Inpainting

Traditional methods address image inpainting challenge through either image diffusion [5, 1] or using image patches [3, 15, 9]. However, diffusion-based methods often introduce diffusion-related blurs, which tends to fail while the missing or corrupted image regions are large [6, 2, 4]. Patch-based methods can work well for the inpainting of stationary background with repeating patterns. However, they struggle in completing large missing regions of complex scenes as the patch-based approach relies heavily on patch-wise matching of low-level features.

Generative adversarial networks (GANs) [14] have been investigated extensively in various image synthesise tasks such as image translation [31, 36, 19, 53, 51, 56, 54], image editing [48, 44, 43], image composition [24, 59, 58, 52, 57, 55], etc. Specifically for image inpainting, Pathak et al. [32] first apply adversarial learning to the image inpainting task. To further improve the adversarial learning within local regions, Iizuka et al. [18] introduce an extra local discriminator to enforce the local consistency. As the local discriminator uses fully-connected layers and can only deal with missing regions of fixed shapes, Yu et al. [49] inherit the discriminator from PatchGAN [19] due to its great success in image translation. Yan et al. [46] propose patch-swap to make use of distant feature patches for the better inpainting quality. Liu et al. [25] design partial convolutions to alleviate the negative influence of the masked regions. Yu et al. [50] present a novel free-form image inpainting system based on an end-to-end generative network with gated convolutions. To generate reasonable structures and realistic textures, Nazeri et al. [30] and Xu et al. [45] utilize edge maps as structural guidance for image inpainting, and Ren et al. [35] instead propose to use edge-preserved smooth images as structural guidance. Liu et al. [27] propose feature equalizations to improve the consistency between structures and textures. As aforementioned methods focus on reconstructing the ground truth instead of generating pluralistic inpainting, they are constraint to generate a deterministic inpainting image for each incomplete image.

Refer to caption — Figure 2: Overview of the proposed image inpainting method: Given a $Masked$ $Image$ $I_{m}$ , the proposed $BAT$ - $Fill$ first $Down$ $Sample$ it to a lower resolution and feed the down-sampled image to the $Bidirectional$ $Autoregressive$ $Transformer(BAT)$ for the recovery of $Diverse$ $Structures$ . Taking the recovered image structures and image style features (extracted by the $Encoder$ ) as inputs, the $Texture$ $Generator$ synthesizes high-resolution texture and produces the $Inpainting\ Results$ $I_{out}$ .

2.1.2 Diverse Image Inpainting

To achieve pluralistic image inpainting with plausible filling contents, Zheng et al. [64] propose a VAE-based network with a dual pipeline, which trades off between reconstructing ground truth and maintaining the diversity of the inpainting results. Similarly, Zhao et al. [61] propose a VAE-based model and leverage a reference image to improve the diversity. Although the above methods achieve certain diversities to some extent, completion quality of VAE-based methods is limited due to variational training. Recently, Zhao et al. [62] propose a co-modulated GAN to incorporate the image condition and the stochastic generation of unconditional generative model for diverse inpainting. Peng et al. [33] introduce a hierarchical vector quantized variational auto-encoder (VQ-VAE) to quantize the context representation and archive diverse structure generation in an autoregressive way. Sharing a similar framework with us, Wan et al. [41] propose to apply transformer for diverse structure generation using the objective of BERT [11]. In contrast, we propose a novel Bidirectional and Autoregressive Transformer (BAT) which inherits the advantages of autoregressive models and bidirectional models and achieves superior image inpainting performance.

2.2 Transformers in Vision

Transformer has emerged as a powerful tool to model the interactions between sequences regardless of the relative position. Specially, Vaswani et al. [40] employ transformers for image classification by treating an image as a sequence of patches. DETR [7] utilizes transformer decoder to model object detection as an end-to-end dictionary lookup problem with learnable queries, thus removing the hand-crafted processes such as Non-Maximal Suppression (NMS). Based on DETR, deformable DETR [66] further introduces a deformable attention layer to focus on a sparse set of contextual elements which achieves fast convergence and better detection performance. Recently, Vision Transformer (ViT) [12] showed that pure-transformer networks can also achieve excellent image classification performance as compared with CNN-based methods. DeiT [39] further extends ViT by introducing a novel distillation approach. BoTNet [38] replaces the spatial 3 $\times$ 3 convolution layers with multi-head self-attention in certain stages of the original ResNet [16], demonstrating very competitive performance on different visual recognition tasks. Esser et al. [13] adapt transformers and VQ-VAE in both conditional and unconditional generation tasks, and achieve high-fidelity synthesis of megapixel images.

Instead of leveraging features of transformers for high-level tasks or generate pixels autoregressively, we specifically propose a novel Bidirectional and Autoregressive Transformer (BAT) for image inpainting, so that the model can learn both bidirectional context and output dependency.

3 Proposed Method

As illustrated in Fig. 2, the proposed BAT-Fill consists of two major parts including a diverse-structure generator for the reconstruction of coarse image structures and a texture generator for the generation of fine-grained texture details. The diverse-structure generator incorporates and adapts a transformer architecture that models the distribution of global structural information and recovers complete and coherent low-resolution structures $S_{1},S_{2},\cdots,S_{N}$ given a Masked Image $I_{m}$ as input. Under the guidance of coarse structure $S_{i},i\in[1,N]$ and corrupted image $I_{m}$ , the Texture Generator synthesizes high-resolution fine-grained texture to produce the Inpainting Results $I_{out}$ . Once the full model is trained, we can sample different image structures $S_{i},i\in[1,N]$ by the diverse-structure generator and thus generate diverse inpainting results with the texture generator, more details to be discussed in the ensuing subsections.

3.1 Diverse-structure Generator

3.1.1 Context Representation

To relieve the pressure of quadratic complexity incurred in transformer, we adapt the low-resolution image with the size of $32\times 32\times 3$ to represent the coarse structure. As the autoregressive generation requires discrete distribution, the pixel value should be treated as classes to the model, which leads to the dimensionatliy of $256^{3}$ for each pixel of the 8-bit RGB images. Following Chen et al. [8], a color palette is applied to further reduce the dimensionality to $512$ while faithfully preserving the main structure of original images, which is generated by $k$ -means clustering of RGB pixel values with $k$ =512 from ImageNet [10] dataset.

3.1.2 Bidirectional and Autoregressive Transformer

Autoregressive (AR) modeling and masked language modeling (MLM) in BERT [11] are two representative objectives for exploiting large language corpora in language processing tasks. Given a discrete sequence $X=\left\{x_{1},x_{2},\dots,x_{L}\right\}$ where $L$ is the length of $X$ , AR model is optimized by maximizing the unidirectional likelihood:

\log P(x;\theta)=\mathbb{E}_{X}\sum\limits_{t=1}^{L}\log P(x_{t}|X_{<t};\theta),

(1)

where $\theta$ is the parameters of the model. In contrast, MLM aims to reconstruct corrupted data with the masked positions $M=\left\{m_{1},m_{2},\dots,m_{K}\right\}$ , where $K$ is the number of masked tokens. Each masked position of the corrupted data is indicated by a special token $[M]$ following BERT [11]. Denoting the masked tokens as $X_{M}$ and unmasked tokens as $X_{\setminus M}$ , the objective of MLM can be formulated by:

\log P(X_{M}|X_{\setminus M};\theta)=\mathbb{E}_{X}\sum\limits_{m_{k}\in M}\log P(x_{m_{k}}|X_{\setminus M};\theta).

(2)

AR and MLM differ from two aspects as defined in Eqs. 1 and 2. The first aspect lies with output dependency, where MLM predicts the masked tokens separately and independently which may oversimplify the complex context dependency in the data [37]. As a comparison, AR factorizes the predicted tokens with the product rule, which establishes the output dependency and produces better predictions. The second aspect lies with context dependency, where AR is only conditioned on the tokens up to the current position (in a fixed order), while MLM has access to bidirectional contextual information. Therefore, MLM is more suitable for image inpainting as the missing or corrupted image regions often have arbitrary shapes with rich variation in the neighboring background.

We propose a novel Bidirectional and Autoregressive Transformer (BAT) that inherits the advantages of AR and MLM to achieve bidirectional context modeling and output dependency simultaneously. The training objective of the BAT is formulated by:

\mathcal{L}_{BAT}=\mathbb{E}_{X}\sum\limits_{m_{k}\in M}\log P(x_{m_{k}}|X_{\setminus M},M,X_{<m_{k}};\theta).

(3)

We first project all the tokens into a $d$ -dimensional token embedding and add a learnable position embedding over the token embedding to preserve the positional information. Unlike XLNet [47] which randomly permutes the input sequence to capture the bidirectional context, we permute all unmasked tokens $X_{\setminus\Pi}$ in the front while maintaining the original order of the masked tokens for better predicting their positions. Moreover, the positional information of all masked tokens will be conditioned for better modeling of the full input sequence (e.g. the counts and positions of masked tokens in the sequence). The proposed BAT model is then adopted to predict the masked tokens as illustrated in Fig. 3.

As shown in Fig. 3, there is a masked sequence $X=\{x_{1},[M],[M]$ $,x_{4},[M]\}$ with length $L$ = 5 and positions $2,3,5$ being masked. After permutation and inserting the full mask tokens, we have the non-predicted tokens $(X_{\setminus M},M)=(x_{1},x_{4},[M],[M],[M])$ which provides the bidirectional context. For the predicted part, we have the input tokens $([M],x_{2},x_{3})$ to predict their corresponding next tokens i.e. $(x_{2},x_{3},x_{5})$ . Here we use the mask token instead of $x_{1}$ to predict $x_{2}$ to encourage the leverage of positional information. We apply bidirectional modeling [11] to non-predicted tokens and autoregressive modeling to the predicted tokens to avoid future information leakage. For example, while predicting $x_{3}$ , the model could attend to $x_{4}$ in non-predicted tokens and meanwhile the previously ‘predicted’ token $x_{2}$ . Hence, we could capture bidirectional context and establish output dependency simultaneously with the proposed BAT.

3.1.3 Transformer Architecture

In this work, we adapt GPT [34] as our network architecture. The network is a decoder-only transformer that consists of $\mathcal{N}$ stacked decoder blocks. Given an intermediate embedding $H^{n}$ at the $n$ -th layer, the decoder block can be formulated by:

	$\displaystyle H^{n}$	$\displaystyle=H^{n}+\text{MA}(LN(H^{n}))$		(4)
	$\displaystyle H^{n+1}$	$\displaystyle=H^{n}+\text{MLP}(LN(H^{n})),$		(5)

where $MA$ , $LN$ and $MLP$ stand for multihead self-attention, layer normalization, and fully-connected layers, respectively. For self-attention, we apply a customized mask to the $L\times L$ matrix of attention logits as illustrated in Fig. 3. At the final layer of the transformer, a learnable linear projection is employed to map $H^{\mathcal{N}}$ to logits, which parameterizes the conditional distribution for each pixel.

During inference, we follow the raster-scan order to predict each masked token bidirectionally and autoregressively. We adopt a top- $\mathcal{K}$ sampling strategy to randomly sample from the $\mathcal{K}$ most likely next words. The predicted token is then concatenated with the input sequence as conditions for the generation of next masked token. This process repeats iteratively until all the masked tokens are sampled. Finally, the generated discrete sequence can be converted back to the RGB values with the aforementioned color palette.

3.2 Texture Generator

3.2.1 Network Architecture

As the inpainting diversity can be achieved by sampling the reconstructed structures $S$ , we take the advantages of efficiency and texture representation capacity of CNNs to learn a deterministic mapping between low-resolution structures $S$ and high-resolution completed image $I_{out}$ . The texture generator thus utilizes CNN layers and adversarial training to up-sample the reconstructed structures and replenish high-fidelity texture details by leveraging the styles of the valid pixels of input image $I_{m}$ . In particular, we employ two encoders to encode the low-resolution structures and input images into two high-level CNN representations of the same dimension. We then concatenate them together as the input of a few consecutive residual blocks with different dilation rates. Finally, a SPADE [31] generator is employed to incorporate the modulated style of input images and gradually up-sample the texture features to the target resolution. Meanwhile, all vanilla convolutions are replaced by gated convolution [50].

3.2.2 Loss Functions

The training of the texture generator is driven by the combination of several losses including a reconstruction loss, an adversarial loss, and a perceptual loss. For clarity, we denote the texture generator as $G_{t}$ , the ground truth as $I_{gt}$ , and the completed image as $I_{out}$ . Firstly, a reconstruction loss $\mathcal{L}_{rec}$ between $I_{out}$ and $I_{gt}$ can be measured as follows:

\displaystyle\mathcal{L}_{rec}=||I_{out}-I_{gt}||_{1},

Besides, a CNN-based discriminator $D$ together with an adversarial loss is employed to synthesize fine texture details. Specifically, the texture generator $G_{t}$ and discriminator $D$ are jointly trained with hinge loss [19], where the adversarial losses for the discriminator and generator are defined by:

	$\displaystyle\mathcal{L}_{adv}^{D}=\mathbb{E}_{I_{gt}}[\textit{ReLU}(1-D(I_{gt})]+\mathbb{E}_{I_{out}}[\textit{ReLU}(1+D(I_{out})]$
	$\displaystyle\mathcal{L}_{adv}^{G_{t}}=-\mathbb{E}_{I_{out}}[D(I_{out})],$

Next, we penalize the perceptual and semantic discrepancy via the perceptual loss [20] with a pretrained VGG-19 network:

	$\displaystyle\mathcal{L}_{perc}=\sum_{i}{\lambda_{i}\|\|\Phi_{i}(I_{out})-\Phi_{i}(I_{gt})\|\|_{1}}$
	$\displaystyle+\lambda_{l}\|\|\Phi_{l}(I_{out})-\Phi_{l}(I_{gt})\|\|_{2},$

where $\lambda_{i}$ are balancing weights, $\Phi_{i}$ is the activation of $i$ -th layer of the VGG-19 model (including relu1_2, relu2_2, relu3_2, relu4_2 and relu5_2), $\Phi_{l}$ represents the activation maps of relu4_2 layer which mainly extracts semantic feature. The texture generator is trained by optimizing the combination of aforementioned losses:

\displaystyle\mathcal{L}_{G_{t}}=\underset{G_{t}}{min}\underset{D}{max}(\lambda_{rec}\mathcal{L}_{rec}+\lambda_{adv}\mathcal{L}_{adv}^{G_{t}}+\lambda_{perc}\mathcal{L}_{perc}),

where $\lambda_{rec}$ , $\lambda_{adv}$ , and $\lambda_{perc}$ are empirically set at 1.0, 1.0 and 0.2, respectively, in our implementation.

4 Experiments

4.1 Experimental Settings

4.1.1 Datasets

We conduct experiments over three public datasets that have different characteristics as listed:

–

CelebA-HQ [21]: It is a high-quality version of the human face dataset CelebA [28] with 30,000 aligned face images. We follow the split in [50] that produces 28,000 training images and 2,000 validation images, where 1,000 validation images are randomly sampled in evaluations.
–

Places2 [65]: It consists of more than 1.8M natural images of 365 different scenes. We adopt the same 800 images from validation set with [41] in evaluations.
–

Paris StreetView [32]: It is a collection of street view images in Paris, which contains 14,900 training images and 100 validation images.

4.1.2 Compared Methods

We compare our method with a number of state-of-the-art methods as listed:

–

GC [50]: It is also known as DeepFill v2, a two-stage method that leverages gated convolutions.
–

EC [30]: It is a two-stage method that first predicts salient edges to guide the generation.
–

MEDFE [26]: It is a mutual encoder-decoder that treats features from deep and shallow layers as structures and textures of an input image.
–

PIC [64]: It is a probabilistically principled framework that leverages VAE to generate diverse image inpainting.
–

ICT [41]: It is a diverse inpainting framework that combine the merits of transformers and CNNs for high-fidelity image inpainting.

4.1.3 Evaluation Metrics

We perform evaluations by using five widely adopted evaluation metrics: 1) Fréchet Inception Score (FID) [17] that evaluates the perceptual quality by measuring the distribution distance between the synthesized images and real images; 2) mean $\ell_{1}$ error; 3) peak signal-to-noise ratio (PSNR); 4) structural similarity index (SSIM) [42] with a window size of 51; 5) Learned Perceptual Image Patch Similarity (LPIPS) [60] that evaluates the diversity of generated images. The average scores of LPIPS are calculated between random pairs of sampled inpainting results.

4.1.4 Implementation Details

The proposed method is implemented in PyTorch. The network is trained using $256\times 256$ images with random irregular masks [25]. The diverse-structure generator and texture generator are trained using $256\times 256$ images with random irregular masks [25]. We train the diverse-structure generator with AdamW [29] with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ and learning rate of 3e-4 following [8]. For the texture generator, we use Adam optimizer [22] with $\beta_{1}=0$ and $\beta_{2}=0.9$ , and set the learning rate at 1e-4 and 4e-4 for the generator and discriminators, respectively. Learning rate decay is applied for the training of both networks, and the experiments are conducted on 4 NVIDIA(R) Tesla(R) V100 GPU.

4.2 Quantitative Evaluation

Extensive quantitative evaluations have been conducted over the three datasets with irregular masks [25]. The irregular masks in the experiments are categorized according to the mask ratios, and an additional category ‘random’ is evaluated which randomly samples masks with ratios varying from 20% to 60%. The performance of the compared methods was acquired by using the publicly available pre-trained models or implementation codes. ¹¹1https://github.com/JiahuiYu/generative_inpainting ²²2https://github.com/knazeri/edge-connect ³³3https://github.com/KumapowerLIU/Rethinking-Inpainting-MEDFE ⁴⁴4https://github.com/lyndonzheng/Pluralistic-Inpainting.

We compare the proposed method with both deterministic and diverse image inpainting methods. Note that all reference metrics such as $\ell_{1}$ , SSIM ,and PSNR are in favor of deterministic inpainting methods where the prediction is directly compared with the ground truth. Different from PIC [64] that unitizes its discriminator to sort the results, our method adapts the top-50 sampling strategy and use all random samples for fair comparisons, which means our method directly generates the stochastic inpainting without the additional filtering.

Table 2 shows the inpainting performance over the dataset Paris StreetView [32]. Compared with deterministic methods GC, EC, and MEDFE, the proposed method achieves the best FID scores over different mask ratios and consistently outperforms the diverse inpainting method PIC in both inpainting quality (FID) and inpainting diversity (LPIPS). In addition, Table 1 shows the inpainting performance over CelebA-HQ [21] and Places2 [65]. For CelebA-HQ, our method consistently outperforms all compared methods, especially in FID scores. For Places2, our method achieves comparable performance with deterministic methods in all evaluation metrics, and it generally outperforms them in FID scores. In addition, the numerical results of BAT-Fill suggest a clear superiority over the diverse inpainting method PIC [64], and better FID scores than ICT [41].

Table 1: Quantitative comparison of the proposed BAT-Fill with state-of-the-art methods over CelebA-HQ [21] and Places2 [65] validation images (1,000) with irregular masks [25] (

\ast

denotes that we trained the model based on official implementations,

\dagger

denotes the results are copied from [41]). For each metric, the best score is highlighted in bold, and the best score for diverse inpainting methods (i.e. PIC [64] and Ours) is highlighted in underline.

Methods	Dataset	FID $\downarrow$			$\boldsymbol{\ell_{1}(\%)}\downarrow$			PSNR $\uparrow$			SSIM $\uparrow$
Methods	Dataset	20-40%	40-60%	Random	20-40%	40-60%	Random	20-40%	40-60%	Random	20-40%	40-60%	Random
EC $\ast$ [30]	CelebA-HQ [21]	9.06	16.45	12.46	2.19	4.71	3.40	26.60	22.14	24.45	0.923	0.823	0.877
GC [50]		14.12	22.80	18.10	2.70	5.19	3.88	25.17	21.21	23.32	0.907	0.805	0.858
PIC [64]		10.21	18.92	14.12	2.50	5.65	4.00	25.92	20.82	23.46	0.919	0.780	0.852
Ours		6.32	12.50	9.33	1.91	4.57	3.18	27.82	22.40	25.21	0.944	0.834	0.890
EC $\dagger$ [30]	Places2 [65]	25.64	39.27	30.13	2.20	4.38	2.93	26.52	22.23	25.51	0.880	0.731	0.831
GC $\dagger$ [50]		24.76	39.02	29.98	2.15	4.40	2.80	26.53	21.19	25.69	0.881	0.729	0.834
MEDFE $\dagger$ [26]		26.98	45.46	31.40	2.24	4.57	2.91	26.47	22.27	25.63	0.877	0.717	0.827
PIC $\dagger$ [64]		26.39	49.09	33.47	2.36	5.07	3.15	26.10	21.50	25.04	0.865	0.680	0.806
ICT $\dagger$ [41]		21.60	33.85	25.42	2.44	4.31	2.67	26.50	22.22	25.79	0.880	0.724	0.832
Ours		17.78	32.55	22.16	2.15	4.64	2.84	26.47	21.74	25.69	0.879	0.704	0.826

4.3 Qualitative Evaluations

Figs. 4, 5, and 6 show the qualitative comparisons between BAT-Fill and the state-of-the-art image inpainting methods over the validation set of CelebA-HQ [21], Places2 [65] and Paris StreetView [32], respectively.

We first evaluate and compare BAT-Fill with EC [30], GC [50], and PIC [64] on CelebA-HQ [21] which contains facial images with similar semantics. As shown in Fig. 4, though EC [30] and GC [50] can synthesize complete facial images with reasonable semantics, they tend to generate distorted facial structures and artifacts in the missing regions which degrades inpainting greatly. In addition, EC [30] and GC [50] can only generate deterministic inpainting, which limits their applicability clearly. Both PIC [64] and BAT-Fill can generate diverse inpainting. However, the PIC generated images share similar makeups and facial features and thus have limited diversity. As a comparison, the BAT-Fill generated facial images vary across a wide range of makeups and facial features and contain much less artifacts, demonstrating that BAT-Fill can produce more diverse and realistic inpainting.

Next, we evaluate and compare BAT-Fill with EC [30], GC [50], MEDFE [26], and PIC [64] on the datasets Places2 [65] and Paris StreetView [32] where images have various semantics. In addition, visual comparison with ICT [41] is conducted over Places2 [65] dataset. As shown in Fig. 5, EC [30], GC [50] and MEDFE [26] tend to generate blurs and even corrupted texture in the inpainting images. The PIC [64] synthesized images suffer from unreasonable semantics, obvious artifacts, and limited diversity. Both ICT [41] and BAT-Fill achieved realistic image inpainting with much less artifacts and better diversity compared with other methods. For Paris StreetView [32], BAT-Fill produced more diverse and plausible results than the PIC [64], and meanwhile achieved comparable or even better inpainting quality compared with the deterministic methods.

Table 2: Quantitative comparison of the proposed BAT-Fill with state-of-the-art methods over Paris StreetView [32] validation images (100) with irregular masks [25] (

\ast

denotes that we trained the model based on official implementations). For each metric, the best score is highlighted in bold, and the best score for diverse inpainting methods (i.e. PIC [64] and Ours) is highlighted in underline.

Metrics	Mask Ratio	Methods
Metrics	Mask Ratio	EC [30]	GC $\ast$ [50]	MEDFE [26]	PIC [64]	Ours
FID $\downarrow$	20-40%	42.81	71.02	36.84	56.83	36.19
$\boldsymbol{\ell_{1}(\%)}\downarrow$		2.63	3.56	2.29	3.43	2.70
PSNR $\uparrow$		26.76	23.95	27.64	24.80	26.52
SSIM $\uparrow$		0.874	0.796	0.898	0.817	0.864
LPIPS $\uparrow$		N/A	N/A	N/A	0.046	0.076
FID $\downarrow$	40-60%	72.78	98.32	77.26	90.91	64.20
$\boldsymbol{\ell_{1}(\%)}\downarrow$		5.18	6.31	5.54	7.47	5.83
PSNR $\uparrow$		22.77	20.83	22.01	20.12	21.89
SSIM $\uparrow$		0.712	0.631	0.704	0.570	0.678
LPIPS $\uparrow$		N/A	N/A	N/A	0.127	0.147
FID $\downarrow$	Random	55.29	84.16	54.99	72.16	48.19
$\boldsymbol{\ell_{1}(\%)}\downarrow$		3.63	4.64	3.58	4.94	3.96
PSNR $\uparrow$		25.04	22.61	25.24	22.97	24.50
SSIM $\uparrow$		0.806	0.727	0.818	0.718	0.786
LPIPS $\uparrow$		N/A	N/A	N/A	0.082	0.106

4.4 Ablation Study

We study the effectiveness of the proposed BAT by conducting ablation studies over Paris StreetView [31]. In the ablation study, we remove the two key components from BAT respectively, which result in two models: 1) w/o bidirectional context, where we will get the same objective with the autoregressive model that predicts the missing tokens by conditioning on previous tokens with unidirectional attention; 2) w/o autoregressive model, where the model is equivalent to MLM that independently reconstruct the missing tokens. To measure the diversity of MLM, we employ a Gibbs sampling to iteratively sample tokens and place the predicted tokens into the original sequence instead of directly output all the predicted tokens. For a fair comparison, we apply the same irregular masks (mask ratios 40-60%) on the same low-resolution images ( $32\times 32$ ) from the validation set of Paris StreetView [31]. After predicting the same inputs, the reconstructed structures of each model are evaluated without applying the texture generator.

As shown in Table 3, using AR greatly degrades the quality of the reconstructed structures, and the high diversity measured by LPIPS is also largely attributed to the poor reconstruction quality. MLM performs reasonably well as it exploits the bidirectional context for inpainting. However, the proposed BAT clearly outperforms in reconstruction quality which is mainly reflected by FID, and it achieves comparable diversity as reflected by LPIPS. This is mainly because BAT models the output dependency to align the future predictions with previously predicted tokens and improves the consistency of the reconstructed structures. Overall, the ablation study demonstrates that the proposed BAT addresses the constraints of the AR and MLM effectively.

Table 3: Ablation study of the proposed BAT over Paris StreetView [31] validation set (100) with irregular masks [25] and mask ratios of 40%-60%.

Models	FID $\downarrow$	$\ell_{1}(\%)\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\uparrow$
w/o bidirectional	59.60	8.599	19.32	0.518	0.0518
w/o autoregressive	49.62	6.38	22.01	0.655	0.0304
Ours	40.91	5.76	23.01	0.714	0.0301

5 Conclusion

This paper presents BAT-Fill, a novel image inpainting framework that achieves realistic and diverse inpainting by leveraging the autoregressive transformers with their powerful long-dependency modeling capacity. To improve the quality and diversity of inpainting, we propose a novel bidirectional and autoregressive transformer (BAT) to model the bidirectional context and output dependency simultaneously. Extensive experiments show that BAT-Fill achieves superior image inpainting in terms of both quality and diversity. Moving forward, we will explore the feasibility of adapting our idea to other image recovery or generation tasks by replacing the non-predicted part of BAT with other conditions such as semantic label, edge, and pose.

References

[1] Coloma Ballester, Marcelo Bertalmio, Vicent Caselles, Guillermo Sapiro, and Joan Verdera. Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process., 10(8):1200–1211, 2001.
[2] Coloma Ballester, Vicent Caselles, Joan Verdera, Marcelo Bertalmio, and Guillermo Sapiro. A variational model for filling-in gray level and color images. In ICCV, volume 1, pages 10–16, 2001.
[3] Connelly Barnes, Eli Shechtman, Adam Finkelstein, and Dan B Goldman. Patchmatch: a randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics (TOG), 28(3):1–11, 2009.
[4] Marcelo Bertalmio, Andrea L Bertozzi, and Guillermo Sapiro. Navier-stokes, fluid dynamics, and image and video inpainting. In CVPR, volume 1, pages 355–355, 2001.
[5] Marcelo Bertalmio, Guillermo Sapiro, Vincent Caselles, and Coloma Ballester. Image inpainting. In Proceedings of the 27th annual conference on Computer graphics and interactive techniques, pages 417–424, 2000.
[6] Marcelo Bertalmio, Luminita Vese, Guillermo Sapiro, and Stanley Osher. Simultaneous structure and texture image inpainting. TIP, 12(8):882–889, 2003.
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872, 2020.
[8] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020.
[9] Soheil Darabi, Eli Shechtman, Connelly Barnes, Dan B Goldman, and Pradeep Sen. Image melding: Combining inconsistent images using patch-based synthesis. ACM Transactions on graphics (TOG), 31(4):1–10, 2012.
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[13] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. arXiv preprint arXiv:2012.09841, 2020.
[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
[15] James Hays and Alexei A Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), 26(3):4–es, 2007.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
[18] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. Globally and locally consistent image completion. ACM Trans. Graph., 36(4):1–14, 2017.
[19] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, pages 1125–1134, 2017.
[20] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer, 2016.
[21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations, 2018.
[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[24] Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey. St-gan: Spatial transformer generative adversarial networks for image compositing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9455–9464, 2018.
[25] Guilin Liu, Fitsum A Reda, Kevin J Shih, Ting-Chun Wang, Andrew Tao, and Bryan Catanzaro. Image inpainting for irregular holes using partial convolutions. In ECCV, pages 85–100, 2018.
[26] Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV, 2020.
[27] Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, and Chao Yang. Rethinking image inpainting via a mutual encoder-decoder with feature equalizations. In ECCV, pages 725–741, 2020.
[28] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In International Conference on Computer Vision, pages 3730–3738, 2015.
[29] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[30] Kamyar Nazeri, Eric Ng, Tony Joseph, Faisal Z Qureshi, and Mehran Ebrahimi. Edgeconnect: Generative image inpainting with adversarial edge learning. arXiv preprint arXiv:1901.00212, 2019.
[31] Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. Semantic image synthesis with spatially-adaptive normalization. In CVPR, pages 2337–2346, 2019.
[32] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016.
[33] Jialun Peng, Dong Liu, Songcen Xu, and Houqiang Li. Generating diverse structure for image inpainting with hierarchical vq-vae. arXiv preprint arXiv:2103.10022, 2021.
[34] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019.
[35] Yurui Ren, Xiaoming Yu, Ruonan Zhang, Thomas H Li, Shan Liu, and Ge Li. Structureflow: Image inpainting via structure-aware appearance flow. In ICCV, pages 181–190, 2019.
[36] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116, 2017.
[37] Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. Mpnet: Masked and permuted pre-training for language understanding. arXiv preprint arXiv:2004.09297, 2020.
[38] Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. arXiv preprint arXiv:2101.11605, 2021.
[39] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877, 2020.
[40] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[41] Ziyu Wan, Jingbo Zhang, Dongdong Chen, and Jing Liao. High-fidelity pluralistic image completion with transformers. arXiv preprint arXiv:2103.14031, 2021.
[42] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
[43] Rongliang Wu and Shijian Lu. Leed: Label-free expression editing via disentanglement. In European Conference on Computer Vision, pages 781–798. Springer, 2020.
[44] Rongliang Wu, Gongjie Zhang, Shijian Lu, and Tao Chen. Cascade ef-gan: Progressive facial expression editing with local focuses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5021–5030, 2020.
[45] Shunxin Xu, Dong Liu, and Zhiwei Xiong. E2I: Generative inpainting from edge to image. IEEE Trans. Circuit Syst. Video Technol., 2020.
[46] Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. Shift-Net: Image inpainting via deep feature rearrangement. In ECCV, pages 1–17, 2018.
[47] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
[48] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Generative image inpainting with contextual attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5505–5514, 2018.
[49] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative image inpainting with contextual attention. In CVPR, pages 5505–5514, 2018.
[50] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In ICCV, pages 4471–4480, 2019.
[51] Fangneng Zhan and Shijian Lu. Esir: End-to-end scene text recognition via iterative image rectification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2059–2068, 2019.
[52] Fangneng Zhan, Shijian Lu, Changgong Zhang, Feiying Ma, and Xuansong Xie. Adversarial image composition with auxiliary illumination. In Proceedings of the Asian Conference on Computer Vision, 2020.
[53] Fangneng Zhan, Chuhui Xue, and Shijian Lu. Ga-dan: Geometry-aware domain adaptation network for scene text detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, pages 9105–9115, 2019.
[54] Fangneng Zhan, Yingchen Yu, Kaiwen Cui, Gongjie Zhang, Shijian Lu, Jianxiong Pan, Changgong Zhang, Feiying Ma, Xuansong Xie, and Chunyan Miao. Unbalanced feature transport for exemplar-based image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[55] Fangneng Zhan, Yingchen Yu, Rongliang Wu, Changgong Zhang, Shijian Lu, Ling Shao, Feiying Ma, and Xuansong Xie. Gmlight: Lighting estimation via geometric distribution approximation. arXiv preprint arXiv:2102.10244, 2021.
[56] Fangneng Zhan and Changgong Zhang. Spatial-aware gan for unsupervised person re-identification. Proceedings of the International Conference on Pattern Recognition, 2020.
[57] Fangneng Zhan, Changgong Zhang, Yingchen Yu, Yuan Chang, Shijian Lu, Feiying Ma, and Xuansong Xie. Emlight: Lighting estimation via spherical distribution approximation. AAAI, 2020.
[58] Fangneng Zhan, Hongyuan Zhu, and Shijian Lu. Spatial fusion gan for image synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3653–3662, 2019.
[59] Gongjie Zhang, Kaiwen Cui, Tzu-Yi Hung, and Shijian Lu. Defect-gan: High-fidelity defect synthesis for automated defect inspection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2524–2534, 2021.
[60] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
[61] Lei Zhao, Qihang Mo, Sihuan Lin, Zhizhong Wang, Zhiwen Zuo, Haibo Chen, Wei Xing, and Dongming Lu. Uctgan: Diverse image inpainting based on unsupervised cross-space translation. In CVPR, pages 5741–5750, 2020.
[62] Shengyu Zhao, Jonathan Cui, Yilun Sheng, Yue Dong, Xiao Liang, Eric I Chang, and Yan Xu. Large scale image completion via co-modulated generative adversarial networks. In International Conference on Learning Representations (ICLR), 2021.
[63] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Towards deeper understanding of variational autoencoding models. arXiv preprint arXiv:1702.08658, 2017.
[64] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. Pluralistic image completion. In CVPR, pages 1438–1447, 2019.
[65] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[66] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.