Make Your Brief Stroke Real and Stereoscopic: 3D-Aware Simplified Sketch to Portrait Generation

Yasheng Sun Tokyo Institute of TechnologyTokyoJapan [email protected] , Qianyi Wu Monash UniversityMelbourneAustralia [email protected] , Hang Zhou Baidu Inc.ShanghaiChina [email protected] , Kaisiyuan Wang The University of SydneySydneyAustralia [email protected] , Tianshu Hu BaiduShanghaiChina [email protected] , Chen-Chieh Liao Tokyo Institute of TechnologyTokyoJapan [email protected] , Shio Miyafuji Tokyo Institute of TechnologyTokyoJapan [email protected] , Ziwei Liu Nanyang Technological UniversitySingaporeSingapore [email protected] and Hideki Koike Tokyo Institute of TechnologyTokyoJapan [email protected]

(2023)

Abstract.

Creating the photo-realistic version of people’s sketched portraits is useful to various entertainment purposes. Existing studies only generate portraits in the 2D plane with fixed views, making the results less vivid. In this paper, we present Stereoscopic Simplified Sketch-to-Portrait (SSSP), which explores the possibility of creating Stereoscopic 3D-aware portraits from simple contour sketches by involving 3D generative models. Our key insight is to design sketch-aware constraints that can fully exploit the prior knowledge of a tri-plane-based 3D-aware generative model. Specifically, our designed region-aware volume rendering strategy and global consistency constraint further enhance detail correspondences during sketch encoding. Moreover, in order to facilitate the usage of layman users, we propose a Contour-to-Sketch module with vector quantized representations, so that easily drawn contours can directly guide the generation of 3D portraits. Extensive comparisons show that our method generates high-quality results that match the sketch. Our usability study verifies that our system is preferred by users.

Virtual Character Creation, Cross-Modal Generation

^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION; October 9–13, 2023; Paris, France^†^†booktitle: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’23), October 9–13, 2023, Paris, France^†^†price: 15.00^†^†doi: 10.1145/3577190.3614106^†^†isbn: 979-8-4007-0055-2/23/10^†^†submissionid: 1194

Refer to caption — Figure 1. The visualization of our Stereoscopic Simplified Sketch-to-Portrait system. Given a simple contour, our system firstly convert it to a delicate sketch. Realistic portraits with consistent 3D geometry can be rendered by our system that aligned perfectly with the sketch. The contours and the sketches can both be modified by users for editable interactions.

1. Introduction

Sketching is a wonderful activity that can be enjoyed by people of all ages, which can be a valuable tool for both educational and creative purposes. Sketching faces, in particular, can be a means of expressing desired characters or identifying individuals. However, there is often a significant gap between the casual sketches that many people create and realistic portraits. Bridging this gap is a valuable skill that can have many practical applications, such as entertainment through the use of augmented reality (AR) and virtual character design and creation for Metaverse.

With the development of generative models (Goodfellow et al., 2014; Karras et al., 2018, 2020) and their applications in image translation (Isola et al., 2017; Zhu et al., 2017; Wang et al., 2018), semantic maps can already be faithfully transformed to real faces (Lee et al., 2020; Park et al., 2019). However, facial details such as wrinkles and hairstyles are lost in this kind of representation. As a result, researchers study specifically the problem of sketch-to-portrait generation (Chen et al., 2008, 2009; Dekel et al., 2018; Li et al., 2019; Chen et al., 2020, 2021). They leverage real faces and their corresponding edge maps to build paired data and formulate the problem in a similar way as image translation tasks. Recent studies can generate realistic results aligned with various kinds of sketches (Li et al., 2020; Chen et al., 2020). Nevertheless, previous methods are basically 2D-based, which means that their generated faces are constrained to the exact views as the sketches are drawn. For scenarios such as character appearance design, a fixed-view image cannot be served as a sufficient reference. Even though face rotation techniques can be applied as post-processing, the results cannot guarantee 3D consistency. The stereoscopic nature of human faces makes it essential for designers to watch multi-view avatars. Moreover, the next step towards creating vivid agents in virtual realities is to build a geometry-aware portrait in the 3D space.

In this paper, we study the problem of creating stereoscopic portraits with high-quality 3D geometry from simple sketches. With the recent development of 3D-aware generative models (Chan et al., 2021, 2022; Or-El et al., 2022; Gu et al., 2021), researchers have succeeded in generating realistic portraits by involving 3D volume rendering techniques (Mildenhall et al., 2020). However, it is still challenging to build the connection between sketches and these models: 1) most 3D-aware GANs are generative models which do not support conditional inputs. The model building and training protocol designs are not trivial. 2) Unlike face parsing maps, sketches appear randomly without fixed semantics. It is difficult to constrain the generation process. 3) This task has rarely been explored before, leaving learning protocols uncertain.

To tackle this problem, we propose a system called Stereoscopic Simplified Sketch-to-Portrait (SSSP), which renders 3D-consistent realistic portraits that are well aligned with simple sketches. The key is to delicately design sketch-aware constraints that can fully take the advantage of a tri-plane-based 3D-aware generative model. In detail, we propose to encode sketches into the prior latent space of tri-plane-based 3D-aware GAN (Chan et al., 2022). A region-aware volume rendering strategy is proposed so that crucial regions can be directly rendered at higher resolutions for sketch matching. At the same time, we enforce symmetric sketches to produce symmetric 3D spaces, which greatly enhances global consistency.

Moreover, we ensure that our system is interactive-friendly. While previous steps build the mapping between a detailed sketch and real faces, it is non-plausible for amateur users to draw complicated sketches that match the sketch dataset. On the other hand, we assume that users should be able to sketch their desired portrait at both the coarse and fine levels. Thus a Contour-to-Sketch module is proposed to reduce the difficulty for the amateur user to use our system. This module is novelly designed based on vector quantized representations (Esser et al., 2021) so that it can handle robust types of contour inputs.

Our contributions can be summarized as follows:

•

We propose to generate 3D-aware portraits from sketches by latent space encoding in tri-plane-based generative models with sketch-aware rendering constraints.
•

We design a novel Contour-to-Sketch module which can robustly convert simple contours to delicate sketches with vector quantized representations.
•

Extensive quantitative and qualitative experiments illustrate the effectiveness of our Stereoscopic Simplified Sketch-to-Portrait (SSSP) system. Studies on our developed interface proved that stereoscopic portraits are crucial to the satisfaction of users.

2. Related Work

2.1. Sketch-based Portrait Synthesis

Synthesizing realistic portraits from given hand-drawn sketches has been a longstanding topic in both computer-human interaction and computer graphics communities. It is valuable for various applications in virtual reality, augmented reality and digital human creation. Early methods (Chen et al., 2009; Eitz et al., 2011) retrieve local image patches from a large-scale human face collection and then compose these local patches back to an entire image according to the input sketches. However, these methods were designed without considering hallucination consistency, which inevitably leads to unrealistic results. Taking advantage of deep neural networks, plenty of works (Chen and Hays, 2018; Li et al., 2019; Chen et al., 2020; Li et al., 2020) are proposed recently for high-fidelity portrait synthesis by carefully devising the network architecture. To improve user experience, DeepFaceDrawing (Chen et al., 2020) introduces a real-time interactive system that enables users to input hand-drawn sketches with merely facial structures (i.e., five fixed components). However, almost all previous methods are conducted in the 2D space. Their results are static thus lack realism. Moreover, some of the methods are trained on a specific sketch domain. It will be difficult for them to handle rough or incomplete sketches.

Different from the previous approaches, we propose to generate 3D-aware realistic portraits with substantial 3D representations. Moreover, our proposed system employs a two-stage synthesis strategy for robust and high-quality portrait synthesis.

2.2. Generative 3D-aware Image Synthesis

Recent generative models (Chan et al., 2021; Schwarz et al., 2020; Niemeyer and Geiger, 2021; Deng et al., 2022; Cai et al., 2022; Chen et al., 2022; Chan et al., 2022; Sun et al., 2022a) involving neural implicit representation (INR) techniques (Mildenhall et al., 2020) have demonstrated great potential on high-quality image synthesis. Specifically, EG3d (Chan et al., 2022) proposed an efficient tri-plane hybrid explicit-implicit 3D representation to synthesize high-resolution images in real-time with view consistency together with high-quality 3D geometry. Despite promising image quality, these approaches cannot provide interactive local editing on the synthesized portraits.

Some concurrent works (Sun et al., 2022b; Chen et al., 2022; Sun et al., 2022a) have made their attempts on conditional generation setting. FENeRF (Sun et al., 2022b) proposed to use the semantic mask for editing the 3D facial volume of the target portrait via GAN inversion (Karras et al., 2020). However, it fails to support high-resolution image synthesis and real-time user-interactive applications due to its optimization-based inversion. In order to achieve real-time interactive editing, IDE-3D (Sun et al., 2022a) introduced a high-resolution semantic-aware 3D generative model, which enables disentangled control over local shape and texture by leveraging a hybrid GAN inversion strategy. Although IDE-3D is exploring similar settings (i.e., semantic-aware 3D portrait synthesis and editing), our approach mainly focuses on building a user-friendly real-time interactive system even for non-artists by taking only hand-drawn sketches as input, which are much rougher and more abstract signals than the semantic mask used in IDE-3D.

3. Methodology

In this section, we dive into the details of our proposed system, Stereoscopic Simplified Sketch-to-Portrait (SSSP). Our target is to create 3D-aware realistic human portraits from 2D sketches with simple user interaction.

3.1. 3D-Aware Portrait Generation from Sketch

Problem Formulation. Given an arbitrary sketch $S$ , our training goal is to recover the implicit or explicit 3D representation of it, and render it back to the real domain following the guidance of its paired image $I_{GT}$ . To fulfill high-fidelity 3D-awareness, we take inspiration from the recent Nerf based 3D generative models (Gu et al., 2021; Chan et al., 2022; Or-El et al., 2022) that are able to produce high quality portraits via volume rendering in the 3D space. Our solution is to pose this challenging task in the form of “3D GAN inversion”, where an encoder is trained to learn the prior latent space of a pre-trained generator. Thus the task is formulated into the problem of render-guided encoder learning.

Preliminaries on Tri-Plane-Based Generator. The tri-plane-based generator (Chan et al., 2022) that enjoys high 3D consistency and rendering quality, is one of the best choices. As shown in the blue box of Fig. 2, it leverages an explicit tri-plane 3D representation which stores 3D scene information into three orthogonal feature images of size $2h\times 2h\times C$ . For each point $p\in\mathbb{R}^{3}$ in the 3D space, its projections on the three planes ( $xy/xz/yz$ -plane) query corresponding interpolated features $(F_{xy},F_{xz},F_{yz})$ and aggregate them to $F$ for representing the information of this spatial location. Afterwards, a tiny decoder processes $F$ to predict the color feature and density of this position $p$ . These tri-plane feature images can be effectively sampled from a latent vector $z$ with a StyleGAN2 generator (Karras et al., 2020).

Volume rendering (Max, 1995) is then performed by casting rays from the camera as NeRF (Mildenhall et al., 2020). Differently, here the color features are accumulated to a 2D feature image $I_{F}$ also with $C$ channels. In the original design, a moderate resolution of $h\times h$ is selected. $I_{F}$ is then sent into two different paths. One is to directly produce an RGB image $I_{RGB}$ , and the other is to upsample its spatial resolution through more layers of modulated convolutions (Karras et al., 2020). Thus a high-resolution RGB image $I^{+}_{RGB}$ can be rendered.

Basic Learning Objective. Our next goal is to accurately restore the radiance fields (tri-planes) that render images that strictly match the input sketch. As the tri-planes are created by the StyleGAN2 generator, the sketches could be intuitively encoded to the $W$ or $W^{+}$ space. The encoder $E_{S}$ encodes the sketch into a feature $w_{S}$ in the $W$ space and sent into the generator. While two images $I_{RGB}$ and $I^{+}_{RGB}$ are produced within the generator, we identify that the directly rendered low-resolution image $I_{RGB}$ is more suitable for applying constraints, as the backward path is shorter. The first training constraint is to recover the downsampled ground truth image $I^{-}_{GT}$ . The basic training objectives are the L1 reconstruction loss and the VGG perceptual loss (Wang et al., 2018; Park et al., 2019) between the rendered results and ground truth,

	$\displaystyle\mathcal{L}_{re}(I^{-}_{GT},I_{RGB})=$	$\displaystyle\\|I^{-}_{GT}-I_{RGB}\\|_{1}+$
(1)			$\displaystyle\sum_{m=1}^{N_{vgg}}\\|\text{VGG}_{m}({I}^{-}_{GT})-{\text{VGG}}_{m}({I}_{RGB})\\|_{1},$

where $\text{VGG}_{m}$ denotes the $m$ -th layer’s results of a VGG19 network. We have also tried computing losses on the final outputs and found similar results.

3.2. Sketch-Aware Rendering Constraints

However, using the universal pixel-level constraint cannot guarantee the detailed consistency between drawn lines and textures. While the original tri-plane generator is trained without local perception, the entire tri-plane formulation is vulnerable to the tiny changes in the $W$ space of the generator. To tackle this problem, we propose two novel sketch-aware rendering constraints, so that the model can grasp the local information from input sketch and achieve global consistency.

3.2.1. Region-Aware Volume Rendering Strategy for Encoding Sketches

While depicting a human portrait, some local areas such as the eyes, nose and mouse play important roles. In practice, users usually spend more strokes on these regional parts. Thus we argue that the network should be sensitive to local sketch differences.

To this end, a region-aware volume rendering strategy is proposed to encourage the mapped latent code to focus on important local sketch areas. Specifically, 4 particular regions $\{R_{i}|i\in[1,4]\}$ representing the left and right eyes, the nose and the mouth are selected on our rendered low-resolution image ${I}_{RGB}$ of size $h\times h$ . The corresponding size of the $i$ -th region is $\{\delta_{i}h\times\delta_{i}h\}$ . We perform volume rendering on each of these regions by densely sampling more rays, and then synthesize RGB images $I^{R_{i}}_{RGB}$ which has the same resolution as $I_{RGB}$ . The reconstruction constraints are applied to all the rendered local regions. The loss can be written as:

(2)

\displaystyle\mathcal{L}^{R}_{re}=\sum_{i=1}^{4}\mathcal{L}_{re}(I^{R_{i}}_{GT},I^{R_{i}}_{RGB}),

where $I^{R_{i}}_{GT}$ is the cropped and resized region of $R_{i}$ . Compared with $R_{i}$ on $I_{RGB}$ , $I^{R_{i}}_{RGB}$ preserves much more details. With the additional refinement of these important patches, the encoder can learn the fine-grained correlations between sketches and portraits, leading to more convincing results.

3.2.2. Symmetric Constraint

Involving the region-aware volume rendering strategy could significantly boost the model’s sketch consistency on local details. However, it is also challenging to capture the global information such as the facial outline. As demonstrated in (Chan et al., 2022), certain global attributes like general facial expressions are biased towards the input camera pose. This phenomenon still exists when the model is trained with camera pose serving as conditions. An underlining cause is that the constraints are too limited to force the encoder network to capture the global pattern.

With explicit 3D information leveraged in our setting, global constraints can be more easily added to both the 3D representations and the rendered image. Particularly, the simple flipping data augmentation is used here. We would expect the flipped sketch representing a person with symmetric geometry in the 3D space.

In detail, we take inspiration from unsupervised symmetric learning of 3D faces (Wu et al., 2020), but explore a new solution on tri-plane. Without loss of generality, we locate the portrait in the origin of tri-plane space and look towards the $y$ -axis without any rotation. Our insight is that the flipping of an image in the 2D space leads to a flip along the $yz$ -plane of its tri-plane space. More specifically, if one camera is located at position $p_{0}$ in the tri-plane space and takes a picture of the portrait, we can move the camera to the symmetric position $\bar{p_{0}}$ about $yz$ -plane and flip the tri-plane features of the $xy$ and $xz$ -plane to get a horizontally flipped image. We denote $S$ as an input sketch and $\bar{S}$ as its horizontal flipped sketch. By encoding them to the generator, two tri-planes, $F^{S}(p,p_{0})=(F^{S}_{xy},F^{S}_{xz},F^{S}_{yz})$ and $F^{\bar{S}}(p^{\prime},\bar{p_{0}})=(F^{\bar{S}}_{xy},F^{\bar{S}}_{xz},F^{\bar{S}}_{yz})$ , can be generated, respectively. Here $p$ and $p^{\prime}$ are symmetric sample points about the $yz$ -plane, $p_{0}$ and $\bar{p_{0}}$ are symmetric camera pose about $yz$ -plane as shown in Fig. 3. As the mirror constraint of 3D face naturally existed, we could further encourage the tri-plane features to satisfy the following equations:

(3)

\left\{\begin{array}[]{lr}F^{S}_{xy}=\textbf{flip}(F^{\bar{S}}_{xy},x),&\\ F^{S}_{xz}=\textbf{flip}(F^{\bar{S}}_{xz},x),&\\ F^{S}_{yz}=F^{\bar{S}}_{yz}&,\end{array}\right.

where $\textbf{filp}(F,x)$ means flipping the coordinates of tri-plane F along the $x$ -axis. Therefore, we introduce a novel constraint, which is termed as Symmetric Constraint on the tri-plane feature together with the image reconstruction loss in Eq. 3.1 and Eq. 2.

3.3. Coutour-to-Sketch Generation via Vector-Quantized Representation

It is very challenging for amateur users to draw a detailed sketch as the input data we use in the portrait generation module (3.1). Also, painters usually produce contours first before adding details. Thus it seems more reasonable and convenient that users could choose to produce and edit both the coarse and fine levels of sketches. This leads to a new challenge: how to bridge the domain gap between a coarse contour and the detailed sketch input for our model?

To mitigate this issue, we propose the Contour-to-Sketch module to transfer simplified contour to a detailed sketch under an image translation protocol (Isola et al., 2017). However, we find that directly applying existing models (Wang et al., 2018; Park et al., 2019) on this type of data will lead to failure cases. For example, some important parts of the face such as the eyes will be missing.

Thus it is essential to seek a more robust design. We observe that the stroke types of facial sketches seem limited from a local perspective. Thus we propose to build a discrete codebook for the detailed sketches via vector-quantized (VQ) representations (Van Den Oord et al., 2017). This guarantees that all generated results lie in the distribution of the realistic sketches. The learning procedures are introduced in the following.

Quantized Sketch Representation. Firstly, we derive a learnable codebook with all data of detailed sketches. A sketch tokenizer $T_{S}$ and a sketch decoder $D_{S}$ are designed following VQGAN (Esser et al., 2021). Given an arbitrary detailed sketch from the dataset, $S$ , it is first encoded to a feature map $\hat{\textbf{z}}_{k}$ through the tokenizer. Then each value on its spatial vector is later transformed into discrete values according to the closest codebook entry. During decoding, these tokens can be de-quantized to feature vectors through querying the features stored in the codebook. They can be recovered back to a sketch with the decoder $D_{S}$ . The encoding-decoding scheme is simplified as:

(4)

\displaystyle S^{\prime}=D_{S}(\mathbf{q^{-1}}(\mathbf{q}(T_{S}(S)))),

where $\mathbf{q}$ and $\mathbf{q^{-1}}$ denote the vector quantization and de-quantization operations.

Quantized Representation Prediction. Once the model is trained, our next step is to map the coarse contour to the learned entities in the codebook. We design the contour-to-token mapping in a teacher-student manner. Specifically, we aim to train a new encoder $E_{C}$ for the coarse contour $C$ and map it into the latent space of learnable codebook $\mathcal{V}$ . The supervisions are two-folded. Firstly on each local patch position, the cross-entropy softmax classification loss is performed for predicting the correct token encoded by $T_{S}$ . The next supervision is to directly constrain the distances between the encoded feature $E_{C}(C)$ from the contour and the de-quantized feature value $\mathbf{q^{-1}}(\mathbf{q}(T_{S}(S)))$ . The final mapping procedure would be:

(5)

\displaystyle S^{\prime\prime}=D_{S}(\mathbf{q^{-1}}(\mathbf{q}(E_{C}(C)))).

With the help of this module, rich details can be added to basic contour sketches robustly. This module is essential to the usability of our model.

4. Experiments

4.1. Experimental Settings

Datasets. For training the Sketch-Guided Stereoscopic Portrait Generation module, we leverage the dataset proposed in DeepFaceDrawing (Chen et al., 2020). It is created using the faces and parsing maps provided in CelebAMask-HQ (Lee et al., 2020) and then processed with Photocopy and sketch simplification procedures. It contains 17K pairs of sketch-image pairs. As for training the Contour-to-Sketch module, we process the above-mentioned dataset and extract contours using the tools provided in (Chen et al., 2022).

Implementation Details. The model of our generator is the same as EG3D (Chan et al., 2022). The resolution of the final output image $I^{+}_{RGB}$ is $512\times 512$ . The size of the low-solution image is $128\times 128$ ( $h$ = 128). The StyleGAN2 generator backbone produces a feature map of size $256\times 256\times 96$ , which is later reshaped to the tri-planes, each of size $256\times 256\times 32$ . The sketch encoder architecture is designed as ResNet34 (He et al., 2016). Our Sketch Guided Portrait generation model is trained on 2 Tesla A100 GPUs for 3 days.

Quantitative Evaluation Metrics. We conduct quantitative evaluations on metrics that have previously been used in the field of image generation and image quality assessment. We use PSNR, SSIM(Wang et al., 2004), CPBD (Narvekar and Karam, 2009) to measure the generated images over image fidelity, structure similarity, and sharpness perspective. We also report two metrics, FID (Heusel et al., 2017) and IS (Salimans et al., 2016), commonly used in the image generation task to evaluate the image quality in latent distribution.

Comparison Methods. To the best of our knowledge, there does not exist a method that shares our setting. Thus we leverage several related state-of-the-art methods including pSp (Richardson et al., 2021), pix2pixHD (Wang et al., 2018), and DeepFaceDrawing (Chen et al., 2020).

Pix2pixHD is still one of the best general image-to-image translation frameworks, which supports various types of guidance. We directly train a sketch-to-portrait model using their released codes. pSp is proposed for StyleGAN2 inversion, it works in a similar way to ours and can be used for performing image translation while fixing the generator. DeepFaceDrawing opens an online interactive system, allowing users to input hand-drawn sketches.

4.2. Evaluation on Fix-View Synthesis

We first evaluate the generation quality of our method under a fix-view setting quantitatively and qualitatively. For fair comparisons, we use the sketches extracted from FFHQ (Karras et al., 2019) dataset as input without using the Contour-to-Sketch module. The qualitative results are shown in Fig. 4. We can observe that DeepFaceDrawing can only synthesize results roughly consistent with the input sketches, which even lack necessary identity information and reasonable texture on some local facial components. pSp produces high-quality portraits for each input sketch, while its results suffer from facial structure or hairstyle mismatch issues and occasionally incorrect generated genders. Pix2pixHD generates the most satisfactory resultant portraits among all the comparison methods with consistent facial boundaries, semantics, and detailed textures. Our SSSP system achieves comparable performance with pix2pixHD, where the generated high-quality results match the input sketches well but with the identity slightly changed.

Table 1. Quantitative comparisons on Free-View Synthesis.

Method	$\text{SSIM}\uparrow$	PSNR $\uparrow$	FID $\downarrow$	IS $\uparrow$	CPBD $\uparrow$
pSp (w inv)	0.363	10.618	96.19	2.86	0.591
pix2pixHD (w inv)	0.441	13.262	86.99	3.28	0.596
SSSP ( $W^{+}$ space)	0.464	14.594	86.77	2.42	0.563
SSSP (W space)	0.448	15.237	98.50	2.96	0.625

4.3. Evaluation on Free-View Synthesis

Since our goal is to render 3D-aware portraits from given sketches, we also design additional experiments under a free-view-point setting to evaluate the novel-view synthesis ability of our system. The novel view synthesis results of 2D StyleGAN2-based models (Shen et al., 2020) do not enjoy 3D consistency. Thus as common practice, the results generated by previous methods are converted also to the 3D space for novel view rendering. For fair comparisons, we directly leverage the model of our generator EG3D (Chan et al., 2022) for performing 3D GAN inversion.

Specifically, the results of Pix2pixHD and pSp are firstly generated on the 2D plane. Then we map the image to a corresponding latent code in EG3D via optimization-based GAN-inversion. Based on the recovered tri-plane representation, free-view images can be directly obtained by changing the camera pose information. Note that DeepFaceDrawing (Chen et al., 2020) focuses on frontal face synthesis and re-training it on our training data with various poses will inevitably result in degradation in the generation quality, thus, we do not involve it in this comparison.

Quantitative Comparisons. We first perform quantitative comparison between our SSSP and its counterparts. The results are reported in Table 1 in which the two comparison methods are denoted as “pix2pixHD (w inv)” and “pSp (w inv)”, respectively. Our model outperforms its counterparts on most of the evaluation metrics, which demonstrates the high-quality image generation ability of our approach.

Qualitative Comparisons. Qualitative comparisons are depicted in Fig. 5. It can be seen that both pSp and pix2pixHD suffer from blurry textures under novel views. We analyze that most problems are caused by the imperfect 3D inversion of their results. Though we have set a long optimization time of about 20 minutes and tried altering the inversion weights, the inversion results of EG3D on these methods are still unsatisfactory. On the other hand, with our encoder, our method produces faces in less than a second. Moreover, the generation quality of our method is clearly better. This proves the effectiveness of our method.

4.4. Ablation Study

To further evaluate the contributions of our proposed techniques, we conduct an ablation study and the results are reported in Table 2. Concretely, we construct three variants. They are the model without region-aware rendering strategy (w/o RAS), the model without symmetric constraint (w/o SC), and encoding the $W^{+}$ space. Among all the variants,“w/o RAS” achieves the worst score on CPBD, which demonstrates that our proposed RAS can effectively alleviate blurry textures in the results. “w/o SC” achieves the worst SSIM and PSNR performance when omitting this proposed constraint. This indicates that the symmetric constraint can effectively enhance the learning of the encoder. Furthermore, we also attempt to explore the generation ability of our model when the sketches are encoded to $W^{+}$ space instead of $W$ space in our full pipeline “Full model (W space)”. Compared with our full model, “ $W^{+}$ space” obtains comparable performance on the metrics related to image quality but achieves a much lower score on CPBD. This indicates that encoding the sketches to W space contributes to better 3D portrait synthesis quality on novel views.

Despite the quantitative comparisons, we also provide visualisation results of these variants. Compared with our full model, “w/o RAS” tends to render over-smooth facial textures on local parts (e.g., eyes, mouth and teeth) which degrade the quality of generated images. While for “w/o SC”, we provide an additional comparison in the last two rows at the bottom with a pair of symmetric input sketches. “w/o SC” fails to reconstruct symmetric images for these two sketches, which intuitively demonstrates the effectiveness of our proposed SC. By using all the components in our full model, we are able to synthesize high-quality and multi-view consistent portraits.

Table 2. Quantitative Results of Ablation Study.

Method	$\text{SSIM}\uparrow$	PSNR $\uparrow$	FID $\downarrow$	IS $\uparrow$	CPBD $\uparrow$
w/o RAS	0.420	14.128	101.24	2.43	0.471
w/o SC	0.412	13.662	102.53	2.69	0.564
Full Model (W space)	0.448	15.237	98.50	2.96	0.625

5. Usability Study

To evaluate the usability of our SSSP system, we invite 15 participants (denoted as U1-U15) to conduct a usability study, composed of a fixed-task study and an open-ended study with a graphical interface we developed.

Among the users, 6 are females and 9 are males. Their ages are uniformly distributed from 20 to 40. Two of the users are product designers, three are computer science researchers, and two are artists. The others do not work on related topics. We first ask all the participants to rate their drawing skills from 1 to 5 (rate 5 is the highest) and 80% of them are amateur or middle users (scores 1 to 3). Then each participant requires to perform a fixed-task drawing session for learning how to use our graphical interface. After they warm up and are familiar with the system, an open-ended drawing session is conducted to create expected portraits without any limitations.

5.1. Graphical Interface Design

The visualization of the interface is shown in Fig. 7. It provides tools for fast editing and creating users’ desired portraits from both coarse and fine sketches.

Contour Selection. Though the users can directly create their sketches from scratch, drawing a reasonable contour has already been proven difficult for most amateur users. We provide a collection of portrait contours with different genders, hairstyles, facial shapes, and poses. The users can select one of the given contours as an easy start of this system, which saves quite a lot of effort.

Contour Editing. We provide contour drawing, erasing and undo functions in the contour editing stage like other common image editing applications. The users could create or edit the coarse contour at this step. After they finish the contour creation, they can click on the “Sketch-Gen” button. The drawn contour will be sent into our Contour-to-Sketch module to synthesize a detailed sketch. The whole generation process takes less than 1 second and thus does not affect the users’ experiences.

Sketch Editing. The initial sketches here are created by the contours from the previous step. Users can still interact with the sketches in the same way as the contour editing. They can erase the unsatisfactory lines created by the networks and add their expected details. After clicking the “Portrait-Gen’, the sketch is fixed and sent into our Sketch-Guided Stereoscopic Portrait Generation module to guide the generation of a high-quality 3D-aware portrait. This process takes less than 2 seconds.

5.2. Fixed-task Study

In this session, we provide several real portraits as target images for helping the users learn our system and practice their drawing skills. Each participant is requested to fill in a questionnaire and rate from 1 to 5 (1 for strongly disagree, 5 for strongly agree) on the following six aspects:

1)

if our system is easy to use;
2)

if our system is compactly designed;
3)

if our system generates results consistent with your intentions;
4)

if our system offers sufficient guidance;
5)

if the coarse-to-fine process eases your creation;
6)

if the 3D demonstration increases your motivation.

The rating results are shown in the upper part of Fig. 9, and we can observe that over 60% of the participants rate 4 (agree) or 5 (strongly agree) in all the six aspects, especially all the participants reach an agreement that the 3D demonstration truly increases their motivation for creating. U1 who is an amateur comments that “The generated 3D portraits are really impressive, and I can somehow quickly find out where and how to edit after a few times trying. The 3D demonstration can easily show me which part I should pay attention to when drawing contours. This improves not only my drawing skills, but also my imagination for the 3D space”.

Most participants are satisfied with the design of our system and give high scores on “Easy Use” and “Compactly Designed”. U5 who is a product designer comments that “I feel the system is user-friendly and well-designed. The 3D effect of the results is really impressive. It will indeed attract users after perfection.”. Only very few (i.e., drawing skill 4 or 5 scores) rate 1 or 2 on “Coarse-to-Fine” and “Sufficient Guidance”, as their explanation is they have already raised their own drawing habits and styles, they prefer creating works freely all by themselves without any guidance. According to the results, our system is able to provide sufficient guidance and assistance which can effectively help amateur or middle users practice their drawing skills and create their expected portraits.

5.3. Open-ended Study

In the open-ended study session, the participants are requested to freely create their expected portraits with their hand-drawn sketches. Similarly, they are asked to fill in a questionnaire and rate from 1 to 5 (1 for strongly disagree, 5 for strongly agree) on five aspects:

1)

if our system generates diverse results;
2)

if our system generates high-quality results;
3)

if our system fits your expectation;
4)

if our system supplies sufficient guidance;
5)

if the 3D demonstration increases your motivation.

The rating results are shown in the lower part of Fig. 9. More than 73% of the participants rate 4 (agree) or 5 (strongly agree) in all five aspects, among which all the participants agree or strongly agree that our 3D-aware portrait synthesis is impressive and effective when creating portraits. U7 comments that “The 3D generation effect is reasonable, just exactly as people imagined”. U10 who is a content creation researcher comments that “The quality of the generated images under multiple views are quite amazing. It would be very convenient for the artists to create 3D virtual human or other kind of 3D assets if this technique could be generalized to other categories.”

12 out of 15 participants agree or strongly agree with “High Quality” and “Diverse Results”, which indicates that our proposed SSSP achieves robust performance even when the input free-hand sketches come from amateur users thanks to the delicately designed network architecture and training strategy.

Though we have shown the advantages of our SSSP system above, some participants also kindly leave constructive advice to us for improving our interactive system. There are barely participants rate 1 or 2 on “Sufficient Guidance”, “High Quality” and “Diverse Results” Few participants with better drawing skills do not agree with the “Expectation Fitness” aspect. U9 points that “More editing functions should be developed, such as skin color, hair color and curly/straight hair”. U12 comments that “Common functions in the drawing software, such as zoom in/out and various line widths should be integrated into the system”. We will make our efforts to incorporate those functions commonly used in the commercial software into our interactive system to make it more user-friendly.

6. Conclusion

In this paper, we propose the Stereoscopic Simplified Sketch-to-Portrait (SSSP) system, which generates high-quality stroke-consistent portraits with 3D awareness. Compared with previous methods, our system enjoys several intriguing properties. 1) Our system directly converts sketches to portraits with 3D representations, thus we can freely render its views from the 3D space. This makes our system inherently different from previous studies. 2) We propose the Contour-to-Sketch module which robustly converts sparse contours to detailed sketches. Such a topic has rarely been discussed. 3) We build a user-friendly graphical interface where the users can freely choose to edit simple contours or sketches. Extensive user studies show that our system is greatly preferred by users.

References

(1)
Cai et al. (2022) Shengqu Cai, Anton Obukhov, Dengxin Dai, and Luc Van Gool. 2022. Pix2NeRF: Unsupervised Conditional p-GAN for Single Image to Neural Radiance Fields Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3981–3990.
Chan et al. (2022) Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. 2022. Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
Chan et al. (2021) Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
Chen et al. (2021) Shu-Yu Chen, Feng-Lin Liu, Yu-Kun Lai, Paul L Rosin, Chunpeng Li, Hongbo Fu, and Lin Gao. 2021. DeepFaceEditing: deep face generation and editing with disentangled geometry and appearance control. ACM Transactions on Graphics (TOG) (2021).
Chen et al. (2020) Shu-Yu Chen, Wanchao Su, Lin Gao, Shihong Xia, and Hongbo Fu. 2020. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics (TOG) 39, 4 (2020), 72–1.
Chen et al. (2009) Tao Chen, Ming-Ming Cheng, Ping Tan, Ariel Shamir, and Shi-Min Hu. 2009. Sketch2photo: Internet image montage. ACM transactions on graphics (TOG) 28, 5 (2009), 1–10.
Chen and Hays (2018) Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9416–9425.
Chen et al. (2008) Xuejin Chen, Sing Bing Kang, Ying-Qing Xu, Julie Dorsey, and Heung-Yeung Shum. 2008. Sketching reality: Realistic interpretation of architectural designs. ACM Transactions on Graphics (TOG) 27, 2 (2008), 1–15.
Chen et al. (2022) Yuedong Chen, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. 2022. Sem2NeRF: Converting Single-View Semantic Masks to Neural Radiance Fields. arXiv preprint arXiv:2203.10821 (2022).
Dekel et al. (2018) Tali Dekel, Chuang Gan, Dilip Krishnan, Ce Liu, and William T Freeman. 2018. Sparse, smart contours to represent and edit images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3511–3520.
Deng et al. (2022) Yu Deng, Jiaolong Yang, Jianfeng Xiang, and Xin Tong. 2022. Gram: Generative radiance manifolds for 3d-aware image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10673–10683.
Eitz et al. (2011) Mathias Eitz, Ronald Richter, Kristian Hildebrand, Tamy Boubekeur, and Marc Alexa. 2011. Photosketcher: interactive sketch-based image synthesis. IEEE Computer Graphics and Applications 31, 6 (2011), 56–66.
Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS.
Gu et al. (2021) Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. 2021. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021).
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).
Karras et al. (2018) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2018. Progressive growing of gans for improved quality, stability, and variation. ICLR (2018).
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
Lee et al. (2020) Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. MaskGAN: towards diverse and interactive facial image manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Li et al. (2019) Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha. 2019. Linestofacephoto: Face photo generation from lines with conditional self-attention generative adversarial networks. In Proceedings of the 27th ACM International Conference on Multimedia. 2323–2331.
Li et al. (2020) Yuhang Li, Xuejin Chen, Binxin Yang, Zihan Chen, Zhihua Cheng, and Zheng-Jun Zha. 2020. Deepfacepencil: Creating face images from freehand sketches. In Proceedings of the 28th ACM International Conference on Multimedia. 991–999.
Max (1995) Nelson Max. 1995. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics 1, 2 (1995), 99–108.
Mildenhall et al. (2020) Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
Narvekar and Karam (2009) Niranjan D Narvekar and Lina J Karam. 2009. A no-reference perceptual image sharpness metric based on a cumulative probability of blur detection. In 2009 International Workshop on Quality of Multimedia Experience. IEEE, 87–91.
Niemeyer and Geiger (2021) Michael Niemeyer and Andreas Geiger. 2021. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11453–11464.
Or-El et al. (2022) Roy Or-El, Xuan Luo, Mengyi Shan, Eli Shechtman, Jeong Joon Park, and Ira Kemelmacher-Shlizerman. 2022. Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13503–13513.
Park et al. (2019) Taesung Park, Ming-Yu Liu, Ting-Chun Wang, and Jun-Yan Zhu. 2019. Semantic image synthesis with spatially-adaptive normalization. In CVPR.
Richardson et al. (2021) Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2287–2296.
Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training gans. Advances in neural information processing systems 29 (2016).
Schwarz et al. (2020) Katja Schwarz, Yiyi Liao, Michael Niemeyer, and Andreas Geiger. 2020. Graf: Generative radiance fields for 3d-aware image synthesis. Advances in Neural Information Processing Systems 33 (2020), 20154–20166.
Shen et al. (2020) Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. 2020. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. TPAMI (2020).
Sun et al. (2022a) Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. 2022a. IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis. arXiv preprint arXiv:2205.15517 (2022).
Sun et al. (2022b) Jingxiang Sun, Xuan Wang, Yong Zhang, Xiaoyu Li, Qi Zhang, Yebin Liu, and Jue Wang. 2022b. Fenerf: Face editing in neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7672–7682.
Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. Advances in neural information processing systems 30 (2017).
Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In CVPR.
Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
Wu et al. (2020) Shangzhe Wu, Christian Rupprecht, and Andrea Vedaldi. 2020. Unsupervised learning of probably symmetric deformable 3d objects from images in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1–10.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on.