DeepFacePencil: Creating Face Images from Freehand Sketches

Yuhang Li [email protected] NEL-BITA, University of Science and Technology of China , Xuejin Chen [email protected] NEL-BITA, University of Science and Technology of China , Binxin Yang [email protected] NEL-BITA, University of Science and Technology of China , Zihan Chen [email protected] NEL-BITA, University of Science and Technology of China , Zhihua Cheng [email protected] NEL-BITA, University of Science and Technology of China and Zheng-Jun Zha [email protected] NEL-BITA, University of Science and Technology of China

(2020)

Abstract.

In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches. Existing image-to-image translation methods require a large-scale dataset of paired sketches and images for supervision. They typically utilize synthesized edge maps of face images as training data. However, these synthesized edge maps strictly align with the edges of the corresponding face images, which limit their generalization ability to real hand-drawn sketches with vast stroke diversity. To address this problem, we propose DeepFacePencil, an effective tool that is able to generate photo-realistic face images from hand-drawn sketches, based on a novel dual generator image translation network during training. A novel spatial attention pooling (SAP) is designed to adaptively handle stroke distortions which are spatially varying to support various stroke styles and different level of details. We conduct extensive experiments and the results demonstrate the superiority of our model over existing methods on both image quality and model generalization to hand-drawn sketches.

Sketch-based synthesis, face image generation, spatial attention, dual generator, conditional generative adversarial networks

^†^†submissionid: 1570^†^†journalyear: 2020^†^†copyright: acmcopyright^†^†conference: Proceedings of the 28th ACM International Conference on Multimedia; October 12–16, 2020; Seattle, WA, USA^†^†booktitle: Proceedings of the 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020, Seattle, WA, USA^†^†price: 15.00^†^†doi: 10.1145/3394171.3413684^†^†isbn: 978-1-4503-7988-5/20/10^†^†ccs: Information systems Multimedia content creation^†^†ccs: Computer systems organization Neural networks

1. Introduction

Flexibly creating new content is one of the most important goals in both computer graphics and computer-human interaction. While sketching is an efficient and natural way for common users to express their ideas for designing and editing new content, sketch-based interaction techniques have been extensively studied (Sutherland, 1964; Zeleznik et al., 1996; Igarashi et al., 1999; Chen et al., 2008, 2009). Imagery content is the most ubiquitous media with a large variety of display devices everywhere in our daily life. Creating new imagery content is one way to show people’s creativity and communicate smart ideas. In this paper, we target portrait imagery, which is inextricably bound to our life, and present a sketch-based system, DeepFacePencil, which allows common users to create new face imagery by specifying the desired facial shapes via freehand sketching.

Deep learning techniques have brought significant improvements on the realism of virtual images. Recently, a large amount of studies have been conducted on general image-to-image translation which aims to translate an image from one domain to a corresponding image in another domain, while preserving the same content, such as structure, scenes or objects (Isola et al., 2017; Wang et al., 2018; Zhu et al., 2017a; Kim et al., 2017; Yi et al., 2017; Zhu et al., 2017b). Treating sketches as the source domain and realistic face images as the target domain, this task is a typical image-to-image translation problem. However, exiting image-to-image translation techniques are not off-the-shelf for this task due to the underlying challenges: data scarcity in the sketch domain and ambiguity in freehand sketches.

Since there exists no large-scale dataset of paired sketch and face images and collecting hand-drawn sketches is time-consuming, existing methods (Isola et al., 2017; Wang et al., 2018; Li et al., 2019) utilize edge maps or contours of real face images as training data when applied on the sketch-to-face task. Figure 1 shows multiple styles of synthesized edge maps, face contours or semantic boundaries. These synthesized edge maps or contours enable existing models to be trained in a supervised manner and obtain plausible results on synthesized edge maps or contours. However, models trained on synthesized data are not able to achieve satisfactory results on hand-drawn sketches, specially on those drawn by users without considerable drawing skills.

Since strokes in edge maps and contours align perfectly with edges of the corresponding real images, models trained on edge-aligned data tend to generate unreal shapes of facial parts following the inaccurate strokes when the input sketch is poorly drawn. Hence, for an imperfect hand-drawn sketch, there is a trade-off between the realism of the synthesized image and the conformance between the input sketch and the edges of the synthesized image. Models with strong edge alignment fail to be generalized to hand-drawn sketches that contains many imperfect strokes.

Moreover, we observe that the balance between image realism and shape conformance mentioned above varies from one position to another across a synthesized image. In a portrait sketch, some facial parts might be well drawn while the others are not. For the well-drawn facial parts, the balance is supposed to move towards the conformance in order to ensure those parts in the synthesized image depicting the user’s imagination. On the other hand, for the poorly-drawn parts, the image realism should be emphasized by not strictly following the irregular shapes and strokes.

Based on the discussion above, we propose a novel sketch-based face image synthesis framework that is robust to hand-drawn sketches. A new module, named spatial attention pooling (SAP), is designed to adjust the spatially varying balance between realism and conformance adaptively across an image. In order to break the strict alignment between sketches and real images, our SAP relaxes strokes with one-pixel width to multiple-pixel widths using pooling operators. A larger width of a stroke, which is controlled by the kernel size of a pooling operator, indicates less restriction between this stroke and the corresponding edge in the synthesized image. However, the kernel size is not trainable via back propagation. Hence, for an input sketch, multiple branches of pooling operators with different kernel sizes are added in SAP to get multiple relaxed sketches with different widths. The relaxed sketches are then fused by a spatial attention layer which adjusts the balance of realism and conformance. For different location in a portrait sketch, the spatial attention layer assigns high attention to the relaxed sketch with large width if this position requires more realism than conformance. In order to more effectively train the SAP module, we propose a dual-generator framework to enforce the embedded features after SAP from imperfect sketches to be consistent with the features of well-synthesized boundaries from real images. This dual-generator training framework promotes the SAP to perceive the imperfectness of local strokes and rectify the synthesized face regions from the imperfect stroke to realistic image domain.

In summary, our contribution in this paper is three-fold.

•

Based on comprehensive analysis on the edge alignment issue in image translation frameworks, we propose a sketch-to-face translation system that is robust to hand-drawn sketches with various drawing skills.
•

A novel deep neural network module for sketch, named spatial attention pooling, is designed to adaptively adjust the spatially varying balance between the realism of the synthesized image and the conformance between the input sketch and the synthesized image.
•

We propose a dual-generator framework which effectively trains the SAP to sense the distortion of imperfect strokes and rectify the embedding features to realistic face domain.

Refer to caption — Figure 1. From a real face image, different styles of synthesized sketches can be generated by edge detection, contour alignment, or semantic segmentation.

2. Related Work

Our method is related to studies on image-to-image translation, sketch-based image generation, as well as face image generation and editing. In this section, we discuss the approaches that are most related to our work.

2.1. Image-to-Image Translation

Given an input image from one domain, an image-to-image translation model outputs a corresponding image from another domain and preserves the content in the input image. Existing image-to-image translation models are based on generative adversarial networks conditioned on images. Pix2pix (Isola et al., 2017) is the first general image-to-image translation model which can be applied to different scenarios according to paired training images, such as, semantic maps to natural images, day images to night images, image colorization, and edge maps to images. (Karacan et al., 2016) utilizes semantic label maps and attributes of outdoor scenes as input and generates the corresponding photo-realistic images. In order to model multi-modal distribution of output images, BicycleGAN (Zhu et al., 2017b) encourages the connection between the output and the latent code to be invertible. CycleGAN (Zhu et al., 2017a), DualGAN (Yi et al., 2017), and DiscoGAN (Kim et al., 2017) propose unsupervised image translation models with a common idea named cycle consistency, which is borrowed from language translation literature. Pix2pixHD (Wang et al., 2018) is proposed as a high-resolution image-to-image translation model for generating photo-realistic image from semantic label maps using a coarse-to-fine generator and a multi-scale discriminator. It can also be applied to edge-to-photo generation when trained on paired edge maps and photos. However, the large gap between synthesized edge maps and hand-drawn sketches challenges the generalization of these models.

2.2. Sketch-based Image Generation

Sketch-based image generation is a hot topic in multimedia and computer graphics. Given a sketch which describes the desired scene layout with text labels for objects, traditional methods, such as Sketch2Photo (Chen et al., 2009) and PhotoSketcher (Eitz et al., 2011), search image patches from a large-scale image dataset and fuse the retrieved image patches together according to the sketch. These methods are not able to ensure the global consistency of the resultant image and fail to generate totally new images. Nevertheless, it is challenging for these methods to ensure global consistency of the resultant images. Thus they frequently fail to generate totally new images. After the breakthroughs made by deep neural networks (DNNs) in many image understanding tasks, a variety of DNN-based models have been proposed for sketch-based image generation. The general image-to-image translation models mentioned above can be easily extended to sketch-based image generation once sketches and their corresponding images are available as training data. Besides, a few other models are designed specially for sketch inputs. SketchyGAN (Chen and Hays, 2018) aims to generate real images from multi-class sketches. A novel neural network module, called mask residual unit (MRU), is proposed to improve the information flow by injecting the input image at multiple scales. Edge maps are extracted from real images and utilized as training sketches. However, the resultant images of SketchyGAN are still not satisfied. LinesToFacePhoto (Li et al., 2019) employs a conditional self-attention module to preserve the completeness of global facial structure in generated face images. However, this model cannot be generalized to hand-drawn sketches directly due to distinct stroke characteristics.

2.3. Face Image Generation and Editing

Recently, studies on face image generation and editing have made tremendous progress. Generative adversarial network (GAN) (Goodfellow et al., 2014), which generates images from noise, is widely used in a wide range of applications (Ledig et al., 2017; Zhang et al., 2017; Zhang et al., 2019; Huang et al., 2020; Liu et al., 2019; Zha et al., 2020). Using GAN, realistic face images can be generated from noise vectors. DCGAN (Radford et al., 2016) introduces a novel network to stabilize training of GAN. PGGAN (Karras et al., 2018) utilizes a progressively growing architecture to generate high-resolution face images. Inspired by style transfer literature, StyleGAN (Karras et al., 2019) introduces a novel generator which synthesizes plausible high-resolution face images and learns unsupervised separation of high-level attributes and stochastic variation in synthesized images. On the other side, a number of works focus on face image editing through different control information. StarGAN (Choi et al., 2018) designs a one-to-many translation framework which switches face attributes assigned by an attribute code. FaceShop (Portenier et al., 2018) and SC-FEGAN (Jo and Park, 2019) treat sketch-based face editing as a sketch-guided image inpainting problem where stroke colors are also applied as guidance information. In this work, we focus more on synthesizing realistic face images from imperfectly drawn sketches.

3. Our Sketch-to-Photo Translation Network

The task of sketch-to-photo translation can be defined as looking for a generator $G$ so that the generated image $G(S)$ from a hand-drawn sketch $S$ looks realistic and keeps the shape characteristics of the input sketch. Existing image translation techniques train a neural network as the generator with paired of sketch and photo data $(\mathcal{S},\mathcal{X})$ . Due to the scarcity of real hand-drawn sketches, existing techniques synthesize sketches in a certain style to approximate the sketch set $\mathcal{S}$ from face image set $\mathcal{X}$ to train their generator in an adversarial manner. The synthesized sketches $\mathcal{S}_{syn}$ are usually precisely aligned with the face images and present different distributions from hand-drawn sketches $\mathcal{S}$ . These models typically fail to generalize to hand-drawn sketches by common users.

We propose a novel network architecture with a specially designed training strategy to improve the capability of the sketch-based image generator. Figure 2 shows the overview of our method. In order to synthesize a set of sketches that has similar distribution with hand-drawn sketches $\mathcal{S}$ , we deform the edge-aligned sketches $\mathcal{S}_{syn}$ to generate a set of deformed sketches $\mathcal{S}_{dfm}$ to augment the training set. We propose a novel framework using dual generators from the edge-aligned sketches $\mathcal{S}_{syn}$ and the deformed sketches $\mathcal{S}_{dfm}$ respectively. $G_{m}$ is the main generator trained with deformed sketches $\mathcal{S}_{dfm}$ , aiming to generate plausible photo-realistic face images from unseen hand-drawn sketches in the test stage. $G_{a}$ is an auxiliary generator trained with edge-aligned sketches whose goal is to guide $G_{m}$ to adaptively sense the line distortion in deformed sketches and rectify the distortion in feature space. A spatial attention pooling module (SAP) is added before the encoder $E_{m}$ of $G_{m}$ to adjust the spatially varying balance between the realism of generated images and the conformance between the generated image and the input sketch.

3.1. Synthesized Sketches and Deformation

Paired face sketch-photo dataset is required for supervised sketch-to-face translation methods. Since there exits no large-scale paired sketch-image dataset, the training sketches used by existing methods (Isola et al., 2017; Li et al., 2019) are generated from face image dataset, e.g. CelebA-HQ face dataset, using edge detection algorithm such as HED (Xie and Tu, 2015). However, the level of details in edge maps rely heavily on the value of a threshold in the edge detection algorithm. An edge map with a large threshold contains too many redundant edges while an edge map with a small threshold fails to preserve the entire global facial structure (Li et al., 2019). Pix2pixHD (Wang et al., 2018) introduces another method to generate sketches from face images. Given a face image, the face landmarks are detected using an off-shelf landmark detection model. A new kind of sketch, denoted as face contour, is obtained by connecting the landmarks in a fixed order. However, since the pre-defined face landmarks mainly depict the facial area while ignoring details, a sketch-to-face model trained by face contours fails to generalize to hand-drawn sketches with hair, beard, or ornaments .

Based on the discussion above, we utilize a new kind of generated sketches with the assist of semantic maps. The CelebAMask-HQ dataset (Lee et al., 2020) provides a face semantic map for each face image in CelebA-HQ dataset. We use the boundaries of the semantic map as a synthesized sketch of the corresponding face image. Figure 1 compares different styles of synthetic sketches, including an edge map (b), a face contour (c), region boundaries of semantic maps (d), and deformed boundary map (e) from the same real image (a).

Stroke Deformation

A shortcoming of sketches that are generated from semantic boundary or edge detector is that the synthesized sketch lines are perfectly aligned to edges of the corresponding face images. In order to break the strict edge alignment between sketches and the corresponding real images and mimic stroke characteristics of hand-drawn sketches, we apply line deformation similar with FaceShop (Portenier et al., 2018). Specifically, we vectorize lines of each sketch using AutoTrace (Weber, 2018). Then offsets randomly selected from $[-d,d]^{2}$ are added to the control points and end points of the vectorized lines, where $d$ is the maximum offset to control the deformation degree. We set $d=11$ in our experiments unless specifically mentioned. We use the semantic boundary map as edge-aligned sketch $\mathcal{S}_{syn}$ , and semantic boundary map with random deformation as deformed sketch $\mathcal{S}_{dfm}$ .

3.2. Spatial Attention Pooling

A sketch-to-image model trained with edge-aligned sketch-image pairs tends to generate images whose edges strictly align with the strokes of the input sketch. When an input hand-drawn sketch is not well-drawn, line distortions in the input sketch damage the quality of the generated face image. There is a trade-off between the realism of the generated face image and the conformance between the input sketch and the output face image. In order to alleviate the edge alignment between the input sketch and the output face image, we propose to relax thin strokes to a tolerance region with various width. A straightforward way is to smooth the strokes to multi-pixel width by image smoothing or dilation. However, the capacity of this hand-crafted way is limited, because the uniform smoothness for all positions of the whole sketch violates the unevenness of hand-drawn sketches on depicting different facial parts. We argue that the balance between the realism and the conformance differs from one position to another across the face image. Even in the same input sketch, the user might draw strokes in some parts (such as eyes and mouth) carefully with little distortions, while draw strokes in other parts (such as chin and hair) roughly. The balance moves towards conformance at a well-drawn part with realistic shape to meet the user’s intention, while it moves towards realism at a poorly-drawn part to ensure the quality of the generated image. Therefore, the relaxation degree should be spatially varying.

We propose a new module, called spatial attention pooling (SAP), to adaptively relax strokes in the input sketch to spatially varying tolerance regions. A stroke with a larger width indicates less restriction between this stroke and the corresponding edge in the synthesized image. The stroke widths are controlled by the kernel sizes of pooling operators. However, the kernel size of an pooling operator is not trainable using back propagation. Instead, we apply multiple pooling operators with different kernel sizes to get multiple relaxed sketches with different stroke widths. The relaxed sketches are then fused by a spatial attention layer which spatially adjusts the balance between realism and conformance.

The architecture of SAP is shown in Figure 3. Given an input deformed sketch $\mathcal{S}_{dfm}\in{\mathbb{R}}^{H\times W}$ , we first pass it through $N_{r}$ pooling branches with different kernel sizes of $\{r_{i}\}_{i=1,\ldots,N_{r}}$ to get $P_{i}=Pooling(\mathcal{S}_{dfm};r_{i})$ . Then we utilize convolutional layers to extract feature maps of $P_{i}$ separately. These feature maps are concatenated to get a relaxed representation of $\mathcal{S}_{dfm}$ , denoted as $R$ :

(1)

R=Concat\big{(}Conv_{1}(P_{1}),Conv_{2}(P_{2}),...,Conv_{N_{r}}(P_{N_{r}})\big{)},

where $Conv_{i}()$ indicates convolutional layers for each pooling branch, and $Concat$ is a channel-wise concatenation operator.

In order to control the relax degrees of all positions, we compute a spatial attention map $A$ to assign different attention weights to $R$ . A stroke with more distortion is supposed to be assigned with larger relax degree. Hence, $A$ is supposed to adaptively pay more attention (a larger weight) to a $Conv_{i}(P_{i})$ that comes from a larger pooling kernel size in the areas with more distortions. In order to perceive the distortion degree of different regions in a hand-drawn sketch, we employ a pre-trained binary classifier $C$ with three convolutional layers to distinguish edge-aligned $\mathcal{S}_{syn}$ sketches from deformed sketches $\mathcal{S}_{dfm}$ . We utilize this classifier to extract features of the input sketch $\mathcal{S}_{dfm}$ to get ${F^{C}_{i}(\mathcal{S}_{dfm}),i=1,2,3}$ , where $F^{C}_{i}()$ denotes the feature maps extracted by the $i^{th}$ convolution layers of $C$ . These feature maps emphasize the differences between edge-aligned sketches and deformed sketches. We resize and concatenate these feature maps, and pass them through three convolutional layers to get the spatial attention map:

(2)

A=softmax\Big{(}Conv\big{(}[F^{C}_{1},Up_{2}(F^{C}_{2}),Up_{4}(F^{C}_{3})]\big{)}\Big{)},

where $Up_{2}$ and $Up_{4}$ indicates $2\times$ and $4\times$ upsampling, $Conv()$ indicates three cascaded convoutional layers, and $softmax()$ is a softmax layer computed over channels to ensuring that for each position of $A$ , the sum of weights of all channels equals to $1$ .

At last, the output SAP is computed as:

(3)

SAP(\mathcal{S}_{dfm})=A*R,

where $*$ is element-wise multiplication.

3.3. Training with Dual Generators

The goal of SAP is to sense stroke distortions in different regions of a hand-drawn sketch and then relax the edge alignment accordingly in the synthesized face image. However, directly adding the SAP module at the front end of a general image translation network does not guarantee that the SAP can be effectively trained under the loss with regard to the synthesized image only. Following a general image translation network architecture, the realism of synthesized images relies on the coincidence of the embedded structure from the input sketch and real face images. In our dual-generator framework shown in Figure 2, we enforce the embedded features from the deformed sketch with SAP to be consistent with the synthesized edge-aligned sketches by introducing a generator feature matching loss $\mathcal{L}_{GFM}$ between the embedded features from the deformed sketch $\mathcal{S}_{syn}$ and the edge-aligned sketch $\mathcal{S}_{dfm}$ . The main generator $G_{m}$ for deformed sketches, the auxiliary generator $G_{a}$ for edge-aligned sketches, and the discriminator $D$ are trained with the following objectives.

Reconstruction Loss

For either generator, a reconstruction loss is applied to guide the generated image close to its corresponding real image $x$ .

(4)		$\displaystyle\mathcal{L}_{rec}(G_{a})$	$\displaystyle=\mathbb{E}\\|G_{a}(\mathcal{S}_{syn})-x\\|_{1}$
(4)		$\displaystyle\mathcal{L}_{rec}(G_{m})$	$\displaystyle=\mathbb{E}\\|G_{m}(\mathcal{S}_{dfm})-x\\|_{1}.$

Adversarial Loss

The multi-scale discriminator (Wang et al., 2018) $D$ consists of three sub-discriminators $\{D_{i}\}_{i=1,2,3}$ . The adversarial loss for generator $G_{a}$ and $D$ is defined as:

(5)		$\displaystyle\mathcal{L}_{adv}(G_{a};D)$	$\displaystyle=\frac{1}{3}\sum_{i=1}^{3}\mathbb{E}\big{[}\log D_{i}({\mathcal{S}_{syn}},{x})\big{]}$
(5)			$\displaystyle+\frac{1}{3}\sum_{i=1}^{3}\mathbb{E}\Big{[}\log\Big{(}1-D_{i}\big{(}{\mathcal{S}_{syn}},G_{a}({\mathcal{S}_{syn}})\big{)}\Big{)}\Big{]}.$

The adversarial loss for $G_{m}$ and $D$ , denoted as $\mathcal{L}_{adv}(G_{m};D)$ is defined similarly.

Discriminator Feature Matching Loss

Similar to pix2pixHD (Wang et al., 2018) and Lines2FacePhoto (Li et al., 2019), we use a discriminator feature matching loss as the perceptual loss, which is designed to minimize the content difference between generated image and the real image in feature space. The discriminator feature matching loss uses the discriminator $D$ as the feature extractor. Let $D^{q}_{i}()$ be the output of $q^{th}$ layer in $D_{i}$ . This feature matching loss is defined as:

(6)		$\displaystyle\mathcal{L}_{DFM}(G_{a})$	$\displaystyle=\frac{1}{3N_{Q}}\mathbb{E}\sum_{i=1}^{3}\sum_{q\in Q}\frac{1}{n_{i}^{q}}\\|D^{q}_{i}\big{(}{\mathcal{S}_{syn}},G_{a}({\mathcal{S}_{syn}})\big{)}$
(6)			$\displaystyle-D^{q}_{i}\big{(}{\mathcal{S}_{syn}},{x}\big{)}\\|_{1},$

where $Q$ is the selected layers of discriminator for computing this loss, $N_{Q}$ is the number of elements in $Q$ , $n^{q}_{i}$ is the number of elements in $D^{q}_{i}$ . The discriminator feature matching loss for $G_{m}$ and $D$ , denoted as $\mathcal{L}_{DFM}(G_{m})$ , is defined similarly.

Generator Feature Matching Loss

Similar to discriminator feature matching loss which is designed to minimize the content difference between generated images and real images in feature space, our generator feature matching loss aims to minimize the content difference between the embedding of edge-aligned sketches $\mathcal{S}_{syn}$ and deformed sketches $\mathcal{S}_{dfm}$ in the generator feature space. Let $G_{a}^{t}()$ and $G_{m}^{t}()$ be the output of $t^{th}$ layer in $G_{a}$ and $G_{m}$ respectively. This loss is calculated as:

(7)

\mathcal{L}_{GFM}(G_{a},G_{m})=\mathbb{E}\frac{1}{N_{T}}\sum_{t\in T}\frac{1}{|n^{t}|}\|G_{a}^{t}(\mathcal{S}_{syn})-G_{m}^{t}(\mathcal{S}_{dfm})\|_{1},

where $T$ is the set of selected generator layers for calculating this loss, $N_{T}$ is the number of elements of $T$ , and $n^{t}$ is the number of elements of $G_{a}^{t}()$ and $G_{m}^{t}()$ . We select the feature maps of the first four residual blocks of the two generators in our experiments.

The overall objective of our dual-generator model is:

(8)	$\displaystyle\min_{G}$	$\displaystyle\max_{D}\mathcal{L}_{rec}(G_{a})+\mathcal{L}_{rec}(G_{m})$
		$\displaystyle+\mathcal{L}_{adv}(G_{a},D)+\mathcal{L}_{adv}(G_{m},D)$
		$\displaystyle+\lambda\big{(}\mathcal{L}_{DFM}(G_{a},D)+\mathcal{L}_{DFM}(G_{m},D)\big{)}$
		$\displaystyle+\mu\mathcal{L}_{GFM}(G_{a},G_{m}),$

where $\lambda$ and $\mu$ are the weights for balancing different losses. We set $\lambda=10$ and $\mu=10$ in our experiments.

In order to train our model more stably, we introduce a multi-stage training schedule. At the first stage, we use edge-aligned sketches and real images to train $G_{a}$ and $D$ without loss function related to $G_{m}$ . At the second stage, we train SAP and the encoder of $G_{m}$ and $D$ from scratch while fixing weights of other parts. Note that the residual blocks and the decoder of $G_{m}$ that share weights with those of $G_{a}$ keep unchanged in this stage. At the last stage, we finetune the whole model with the overall objective in Eq. 8.

4. Experiments and Discussions

Trained with synthesized deformed sketches, our method is robust to hand-drawn sketches. We conduct extensive experiments to demonstrate the effectiveness of our model in generating high-quality realistic face image from sketches drawn by different users with diverse painting skills.

4.1. Implementation Settings

Before showing experimental results, we first introduce details in our network implementation and training.

Implementation Details

We implement our model on Pytorch. Both generators for edge-aligned sketches and deformed sketches share an encoder-residual-decoder structure with shared weights except that an SAP module is added to the front of the main generator $G_{m}$ for deformed sketches. The encoder consists of four convolutional layers with $2\times$ downsampling, while the decoder consists of four convolutional layers with $2\times$ upsampling. Nine residual blocks between the encoder and decoder enlarge the capacity of the generators. The multi-scale discriminator $D$ consists of three sub-networks for three scales separately, same as Pix2PixHD (Wang et al., 2018). Instance normalization (Ulyanov et al., 2016) is applied after the convolutional layers to stabilize training. ReLU is used as the activation function for generators and LeakyReLU for the discriminators.

Data

To produce triplets of sketch, deform sketch, and real image, $(\mathcal{S}_{syn},\mathcal{S}_{dfm},x)$ for our network training, we use CelebA-HQ (Karras et al., 2018), a large-scale face image dataset which contains 30K $1024\times 1024$ high-resolution face images. CelebAMask-HQ (Lee et al., 2020) offers manually-annotated face semantic masks for CelebA-HQ with 19 classes including all facial components and accessories such as skin, nose, eyes, eyebrows, ears, mouth, lip, hair, hat, eyeglass, earring, necklace, neck, and cloth. We utilize semantic masks in this dataset to extract semantic boundary maps as edge-aligned sketches. Deformed sketches are generated by vectorizing and adding random offsets to edge-aligned sketches as discussed in the last section. Real images, sketches, and deformed sketches are resized to $256\times 256$ in our experiments.

Training Details

All the networks are trained by Adam optimizer (Kingma and Ba, 2015) with $\beta_{1}=0.5$ and $\beta_{2}=0.999$ . For each training stage, the initial learning rate is set to $0.0002$ and starts to decay at the half of training procedure. We set batch size as 32. The entire training process takes about three days on four NVIDIA GTX 1080Ti GPUs with 11GB GPU memory.

Baseline Model

Pix2pixHD (Wang et al., 2018) is a state-of-the-art image-to-image translation model for high-resolution images. With the edge-aligned sketches and real face images, we train pix2pixHD with its low-resolution version of generator (‘global generator’) as a baseline model in our experiment, denoted as baseline. In order to conduct a fair comparison on generalization, we also train the baseline model with augmented dataset by adding pairs of deformed sketches and images, denoted as baseline_deform. The key idea of our method is using SAP and dual generators to improve the tolerance to sketch distortions. The local enhancer part of pix2pixHD, which is designed for high-resolution image synthesis, can be easily added to improve fine textures for both baseline models and our model in the future.

4.2. Evaluation on Generation Quality

4.2.1. Evaluation Metrics

Evaluating the performance of generative models has been studied for a period of time in image generation literature. It is proven to be a complicated task because a model with good performance with respect to one criterion does not necessarily imply good performance with respect to another criterion (Lucic et al., 2018). A proper evaluation metrics should be able to present the joint statistics between conditional input samples and generated images. Traditional metrics, such as pixel-wise mean-squared error, can not effectively measure the performance of generative models. We utilize two popular quantitative perceptual evaluation metrics based on image features extracted by DNNs: Inception Score (IS) (Salimans et al., 2016) and Fréchet Inception Distance (FID) (Heusel et al., 2017). These metrics are proven to be consistent with human perception in assessing the realism of images.

Inception Score (IS)

IS applies an Inception model that is pre-trained on ImageNet to extract features of generated images and computes the KL divergence between the conditional class distribution and the marginal class distribution. Higher IS presents higher quality of generative images. Note that IS is reported to be biased in some cases because its evaluation is based more on the recognizability rather than on the realism of generated samples (Theis et al., 2016).

Fréchet Inception Distance (FID)

FID is a recently proposed evaluation metric for generative models and proven to be more consistent with human perception in assessing the realism of generated images. FID computes Wasserstein-2 distance between features of generated images and real images which are extracted by a pre-trained Inception model. Lower FID indicates that the distribution of generated data is closer to the distribution of real samples.

4.2.2. Image Quality Comparison

Existing image-to-image translation models can be trained for sketch-to-face translation using paired sketch and image data. Since the quality of generated images presents the basic performance of a generative model, we first compare the quality of generated images by different generative models using IS and FID. We test these models with edge-aligned sketches that are synthesized from images in the test set. Besides the baseline model, we also test Pix2pix (Isola et al., 2017), which is the first general image-to-image translation framework. It can be applied to a variety of applications by switching the training data. We use the default setting to train Pix2pix model with paired edge-aligned sketches and real face images. All these methods produce face image in dimension of $256\times 256$ .

Table 1 shows the quantitative evaluation results of four models. Our model surpasses other models by a small margin with respect to two evaluation metrics. Visual results are shown in Figure 4. As we can see, all of the four models are able to generate plausible face images from sketches of test dataset. Since these test sketches are generated using the same method as training sketches, the test data distribution is quite close to the training data distribution. Both our model and existing models perform well on these samples with ideal face structures. We will show the superiority of our method on more challenging sketches in the following experiments.

Table 1. Quantitative comparison of generative quality on synthesized test sketches.

	Pix2pix (Isola et al., 2017)	Baseline (Wang et al., 2018)	Baseline_deform	Ours
IS	$2.186$	$2.298$	$2.369$	2.411
FID	$289.3$	$259.1$	$244.3$	242.1

4.3. Comparison of Generalization Capability

In order to verify the generalization ability of our model, we compare our model with state-of-the-art image translation models by testing with synthesized sketches of different levels of deformation, well-drawn sketches and poorly-drawn sketches by common users.

4.3.1. Different Level of Deformation

As mentioned in Section 3.1, we deform an edge-aligned sketch $\mathcal{S}_{syn}$ to obtain a corresponding deformed sketch $\mathcal{S}_{dfm}$ by adding random offsets to the control points and end points of the vectorized strokes in $\mathcal{S}_{syn}$ . The maximum offset $d$ is set to $11$ in the training data. We further create more deformed sketches with multiple levels of deformation, denoted as $\mathcal{S}_{dfm}^{d}$ , by modifying the maximum offset $d$ . We examine the generalization ability of our model and the baseline model on these deformed sketches. Note that the baseline is trained with only edge-aligned sketches, while our model and baseline_deform model are trained with both edge-aligned sketches and deformed sketches with $d=11$ .

In this experiment, the input sketches are deformed by larger offsets where the maximum $d$ is set to $30$ . As shown in Figure 5, strokes in sketches with larger deformation looks quite different from those in the training sketches including edge-aligned sketches $\mathcal{S}_{syn}$ and deformed sketches $\mathcal{S}_{dfm}^{11}$ . By adding deformed sketches into training data, baseline_deform produces better images than baseline. However, when larger deformation occurs, baseline_deform suffers from artifacts in facial features, for example, the mouth in the first example, the eyes of the third and the last case in Figure 5, In comparison, our model produces more realistic face images with more symmetric eyes and fine textures, benefited from its ability of capturing the distortion of deformed strokes and rectifying shape features by our spatial attention pooling module.

4.3.2. Hand-Drawn Sketches

Besides the synthesized sketches with stroke deformation, we further examine the model generalization ability by comparing performances of our model with baseline models on two kinds of hand-drawn sketches: expert-drawn sketches and common-user sketches drawn by users without professional painting skills.

Expert-Drawn Sketches

We invite an expert with well-trained drawing skills to draw 12 portrait sketches for testing. These expert sketches were drawn on a pen tablet so that the strokes are smooth and precise. Note that shading strokes are not drawn. Figure 6 shows a group of face images generated by different models from several expert-drawn sketches. Even with well-drawn strokes, the baseline and baseline_deform frequently fail to produce realistic textures and complete structures of eyes or mouths. In comparison, our results are more realistic with fine textures and intact structures.

Common Sketches

We also invited 20 graduate students without drawing skills to draw 200 freehand sketches of their imagined faces using mouses. Hence, strokes of these common sketches roughly depict the desired face structure and shapes of facial features with some distortion. Moreover, common sketches show different levels of details. For example, some sketches contains many strokes inside the hair regions, which are typically blank in the training sketches. Results shown in Figure 7 demonstrate that our model is robust to these poorly-drawn sketches. In contrast, the diversity of stroke styles and detail levels significantly damage the visual quality of the generated results from the baseline and baseline_deform models.

5. Conclusion

In this paper, we present DeepFacePencil, a novel deep neural network which allows common users to create photorealistic face images by freehand sketching. The robustness of our sketch-based face generator comes from the proposed dual generator training strategy and spatial attention pooling module. The proposed spatial attention pooling module adaptively adjusts the spatially varying balance between the image realism and the conformance between the input sketch and the synthesized image. By adding the SAP module to our generator and training two dual generators simultaneously, our generator effectively captures face structure and facial feature shapes from coarsely drawn sketches. Extensive experiments demonstrate that our DeepFacePencil successfully produces high-quality face images from freehand sketches drawn by users in diverse drawing skills.

Acknowledgements.

This work was supported by the National Key Research

\&

Development Plan of China under Grant 2016YFB1001402, the National Natural Science Foundation of China (NSFC) under Grants 61632006, U19B2038, and 61620106009, as well as the Fundamental Research Funds for the Central Universities under Grants WK3490000003 and WK2100100030. We thank the Supercomputing Center of USTC for providing computational resources.

References

(1)
Chen et al. (2009) Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009. Sketch2Photo: internet image montage. ACM Trans. Graph. 28, 5 (2009), 124:1–124:10.
Chen and Hays (2018) Wengling Chen and James Hays. 2018. SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 9416–9425.
Chen et al. (2008) Xuejin Chen, Sing Bing Kang, Ying-Qing Xu, and Julie Dorsey. 2008. Sketching reality: Realistic interpretation of architectural designs. ACM Trans. Graph. 27, 2 (2008), 11:1–11:15.
Choi et al. (2018) Yunjey Choi, Min-Je Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 8789–8797.
Eitz et al. (2011) Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2011. Photosketcher: Interactive Sketch-Based Image Synthesis. IEEE Computer Graphics and Applications 31, 6 (2011), 56–66.
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Montreal, Quebec, Canada. 2672–2680.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems, Long Beach, CA, USA. 6626–6637.
Huang et al. (2020) Yukun Huang, Zheng-Jun Zha, Xueyang Fu, Richang Hong, and Liang Li. 2020. Real-World Person Re-Identification via Degradation Invariance Learning. In IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 14072–14082.
Igarashi et al. (1999) Takeo Igarashi, Satoshi Matsuoka, and Hidehiko Tanaka. 1999. Teddy: A sketching interface for 3-D freeform design. In Proc. SIGGRAPH. 409–416.
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 5967–5976.
Jo and Park (2019) Youngjoo Jo and Jongyoul Park. 2019. SC-FEGAN: Face Editing Generative Adversarial Network With User’s Sketch and Color. In IEEE International Conference on Computer Vision, Seoul, Korea (South). 1745–1753.
Karacan et al. (2016) Levent Karacan, Zeynep Akata, Aykut Erdem, and Erkut Erdem. 2016. Learning to Generate Images of Outdoor Scenes from Attributes and Semantic Layouts. CoRR abs/1612.00215 (2016). arXiv:1612.00215
Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In 6th International Conference on Learning Representations.
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 4401–4410.
Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia (Proceedings of Machine Learning Research, Vol. 70). 1857–1865.
Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations,.
Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P. Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, and Wenzhe Shi. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA. 105–114.
Lee et al. (2020) Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. MaskGAN: Towards Diverse and Interactive Facial Image Manipulation. In IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 5548–5557.
Li et al. (2019) Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha. 2019. LinesToFacePhoto: Face Photo Generation From Lines With Conditional Self-Attention Generative Adversarial Networks. In Proceedings of the 27th ACM International Conference on Multimedia (Nice, France) (MM ’19). ACM, New York, NY, USA, 2323–2331.
Liu et al. (2019) Jiawei Liu, Zheng-Jun Zha, Di Chen, Richang Hong, and Meng Wang. 2019. Adaptive Transfer Network for Cross-Domain Person Re-Identification. In IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 7202–7211.
Lucic et al. (2018) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. 2018. Are GANs Created Equal? A Large-Scale Study. In Advances in Neural Information Processing Systems, Montréal, Canada. 698–707.
Portenier et al. (2018) Tiziano Portenier, Qiyang Hu, Attila Szabó, Siavash Arjomand Bigdeli, Paolo Favaro, and Matthias Zwicker. 2018. Faceshop: deep sketch-based face image editing. ACM Trans. Graph. 37, 4 (2018), 99:1–99:13.
Radford et al. (2016) Alec Radford, Luke Metz, and Soumith Chintala. 2016. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. In 4th International Conference on Learning Representations.
Salimans et al. (2016) Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved Techniques for Training GANs. In Advances in Neural Information Processing Systems, Barcelona, Spain. 2226–2234.
Sutherland (1964) Ivan E. Sutherland. 1964. Sketch pad a man-machine graphical communication system. In DAC.
Theis et al. (2016) Lucas Theis, Aaron van den Oord, and Matthias Bethge. 2016. A note on the evaluation of generative models. In 4th International Conference on Learning Representations.
Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. 2016. Instance Normalization: The Missing Ingredient for Fast Stylization. CoRR abs/1607.08022 (2016). arXiv:1607.08022
Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 8798–8807.
Weber (2018) Martin Weber. 2018. AutoTrace. http://autotrace.sourceforge.net/
Xie and Tu (2015) Saining Xie and Zhuowen Tu. 2015. Holistically-Nested Edge Detection. In IEEE International Conference on Computer Vision, Santiago, Chile. 1395–1403.
Yi et al. (2017) Zili Yi, Hao (Richard) Zhang, Ping Tan, and Minglun Gong. 2017. DualGAN: Unsupervised Dual Learning for Image-to-Image Translation. In IEEE International Conference on Computer Vision, Venice, Italy. 2868–2876.
Zeleznik et al. (1996) Robert C. Zeleznik, Kenneth P. Herndon, and John F. Hughes. 1996. SKETCH: An Interface for Sketching 3D Scenes. In Proc. SIGGRAPH. 163–170.
Zha et al. (2020) Zheng-Jun Zha, Jiawei Liu, Di Chen, and Feng Wu. 2020. Adversarial Attribute-Text Embedding for Person Search With Natural Language Query. IEEE Trans. Multimedia 22, 7 (2020), 1836–1846.
Zhang et al. (2017) Han Zhang, Tao Xu, and Hongsheng Li. 2017. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In IEEE International Conference on Computer Vision, Venice, Italy. 5908–5916.
Zhang et al. (2019) Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N. Metaxas. 2019. StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. IEEE Trans. Pattern Analysis and Machine Intelligence 41, 8 (2019), 1947–1962.
Zhu et al. (2017a) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. 2017a. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In IEEE International Conference on Computer Vision, Venice, Italy. 2242–2251.
Zhu et al. (2017b) Jun-Yan Zhu, Richard Zhang, Deepak Pathak, Trevor Darrell, Alexei A. Efros, Oliver Wang, and Eli Shechtman. 2017b. Toward Multimodal Image-to-Image Translation. In Advances in Neural Information Processing Systems, Long Beach, CA, USA. 465–476.