Joint Geometric-Semantic Driven Character Line Drawing Generation

Cheng-Yu Fang [email protected] 0000-0002-6522-3710 and Xian-Feng Han [email protected] 0000-0002-4869-4537 Southwest UniversityNo.2 Tiansheng RoadChongqingChina400700

(2023)

Abstract.

Character line drawing synthesis can be formulated as a special case of image-to-image translation problem that automatically manipulates the photo-to-line drawing style transformation. In this paper, we present the first generative adversarial network based end-to-end trainable translation architecture, dubbed P2LDGAN, for automatic generation of high-quality character drawings from input photos/images. The core component of our approach is the joint geometric-semantic driven generator, which uses our well-designed cross-scale dense skip connections framework to embed learned geometric and semantic information for generating delicate line drawings. In order to support the evaluation of our model, we release a new dataset including 1,532 well-matched pairs of freehand character line drawings as well as corresponding character images/photos, where these line drawings with diverse styles are manually drawn by skilled artists. Extensive experiments on our introduced dataset demonstrate the superior performance of our proposed models against the state-of-the-art approaches in terms of quantitative, qualitative and human evaluations. Our code, models and dataset will be available at Github.

Line Drawing, Generative Adversarial Network, Joint Geometric-Semantic Driven, Image Translation

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: International Conference on Multimedia Retrieval; June 12–15, 2023; Thessaloniki, Greece^†^†booktitle: International Conference on Multimedia Retrieval (ICMR ’23), June 12–15, 2023, Thessaloniki, Greece^†^†price: 15.00^†^†doi: 10.1145/3591106.3592216^†^†isbn: 979-8-4007-0178-8/23/06^†^†ccs: Computing methodologies Image processing

Refer to caption — Figure 1. Qualitative comparison with state-of-the-art methods. All these models are trained on our built dataset. (a) Input photo/image; (b) Ground truth; (c) Gatys (Gatys et al., 2016); (d) CycleGAN (Zhu et al., 2017); (e) DiscoGAN (Kim et al., 2017) ; (f) UNIT (Liu et al., 2017); (g) pix2pix (Isola et al., 2017); (h) MUNIT (Huang et al., 2018); (i) Our baseline; (j) Our P2LDGAN.

1. Introduction

Character Line Drawing, also named Line Art, aims to translate the information in photograph/image domain into a simplified representation domain with fundamental graphic components (e.g., lines or curves) to show the change of plane. It is an abstract and flexible form of art, whose applications include diverse scenes, such as entertainment (Yan et al., 2021), key art (Chen et al., 2020), caricature (Jang et al., 2021) and computer generated animation (Mourot et al., 2021). As we all know, high-quality line drawing generation typically requires professional artists or domain experts to spend considerable effort. And more details in structural lines mean more difficulties in drawing the line arts (Zheng et al., 2020). Therefore, manually drawing character line arts is a labor-intensive, time-consuming and challenging task. And it is highly desired to development an automatic generation method that can assist artists or amateurs in drawing character line drawings.

Recently, the great process of convolutional neural network (CNN), especially generative adversarial network (GAN), has popularized the image generation/translation tasks due to their powerful ability in synthesizing impressive, high-visual-quality, realistic and faithful images. Is it possible to automatically perform character photo-to-line drawing style translation with the help of these artificial intelligence techniques? In essence, this question can be considered as an image-to-image translation problem, which converts the character image/photo to line drawings representation domain. The GAN-based approaches are actually applicable to tackling this problem.

However, to the best of our knowledge, there is no previous studies specifically developed for character line drawing creation. In addition, directly extending the state-of-the-art image translation methods (e.g., pix2pix (Isola et al., 2017), CycleGAN (Zhu et al., 2017), (Liu et al., 2017)) to this task cannot achieve high-quality character drawings due to the following issues. (1) Freehand character line drawings is abstract (Gao et al., 2020) and sparse (Yi et al., 2022), totally different from other image translation applications. (2) Loss of semantic information results in unclear and imperfect lines.

Therefore, in this paper, we attempt to make these challenges resolvable. We introduce the first GAN-based end-to-end architecture, called P2LDGAN, aiming to automatically learn the cross-domain corresponding relations between images/photos and hand-drawn line drawings. The starting point of our P2LDGAN is an input real character photo, outputting a realistic hand-drawn character line drawing. To improve the generation quality with more details and clear lines, we design a joint geometric-semantic driven generator, in which the feature maps with different scales and information flows from encoding stage are densely concatenated into corresponding layers of decoder using cross skips to fuse geometric and semantic features for find-grained drawings generation. For discriminator, we adopt patch discriminator to improve the discriminative ability.

In order to train and evaluate our proposed model, we introduce a new dataset, which consists of more than one thousand of character images/photos and line drawings pairs, where the line drawings are manually created by skilled artists we invite. We quantitatively and qualitatively compare our framework against the state-of-the-arts on this newly collected dataset, and experimental results show the superiority of proposed P2LDGAN. Finally, we perform the ablation studies to further validate the effectiveness of our key components.

Summarily, the main contributions of our study include,

•

To the best of our knowledge, this is probably the first work focusing on full body character line drawing generation study.
•

We contribute a new photo-to-line drawing dataset including 1,532 pairs of character photos/images and corresponding well-aligned freehand line drawings provided by professional artists, to benefit the research on character line drawing creation.
•

We present the first joint geometric-semantic driven generative adversarial architecture with our well-designed cross-scale dense skip connections framework as generator for automatic character line drawing generation in an end-to-end manner.
•

Experimental results as well as perceptual study demonstrate the effectiveness of the proposed P2LDGAN. Compared to state-of-the-art methods, our model is more robust to different categories and styles of the input and can achieve significantly higher quantitative and qualitative performance.

2. Related Work

2.1. Image-to-Image Translation

The image-to-image translation (Mao et al., 2022)(Wang et al., 2022)(Richardson et al., 2021)(Pang et al., 2021) can be formulated as an image generation function that maps the given source domain image into the desired target artistic style. It has a wide range of applications, including image cartoonization (Dong et al., 2021)(Wang and Yu, 2020), flat filling (Zhang et al., 2021b)(Wu et al., 2021), face sketch synthesis (Li et al., 2021)(Cao et al., 2022) and image inpainting (Zhao et al., 2020)(Liu et al., 2021). Particularly, the generative adversarial network based image translation has been receiving substantial interest. Based on conditional adversarial network, Pix2pix (Isola et al., 2017) develops a general image translation framework for different tasks, like colorization, sketch-to-portrait. However, the training instability makes it difficult to apply pix2pix to high-resolution image generation issue. Wang et al. (Wang et al., 2018) introduced the multi-scale generator and discriminator models together with a new feature match loss to form pix2pixHD to achieve high-resolution results. Qi et al. (Qi et al., 2021) developed a semantic-driven GAN with pix2pix as backbone, and designed an adaptive re-weighting loss to balance semantic classes’ contributions. In contrast with learning from paired data, unsupervised image translation has recently attracted more and more attention. CycleGAN (Zhu et al., 2017) designs a cycle consistency constraint to conduct learning with unpaired inputs. Similarly, DualGAN (Yi et al., 2017) also proposes a general-purpose image translation framework using a cyclic mapping. While DiscoGAN (Kim et al., 2017) couples two different GANs for bidirectional mapping between source and target domains. Based on the assumption that corresponding images from different domains share the same intermediate representation in a latent space, Liu et al. (Liu et al., 2017) integrated the core ideas of GANs and variational autoencoders (VAEs) forming the UNIT framework to model each image domain. MUNIT (Huang et al., 2018) assumes that the representation of source and target images can be decomposed into a shared content space but different domain-specific style spaces. To achieve image translation, it performs a swapping combination of these content and style codes.

2.2. Deep Learning in Line Drawing

The deep learning methodology has already pushed significant steps in a wide variety of visual processing tasks, which makes great contributions to line drawing related applications, including line drawing colorization (Yuan and Simo-Serra, 2021), artistic shadow creation (Zhang et al., 2021a), manga structural line extraction (Li et al., 2017), and face portrait line drawing (Yi et al., 2020). Zhang et al. (Zhang et al., 2021b) proposed a split filling mechanism to perform line art colorization. They first spilt the input user scribbles into several groups for influence areas estimation, then a data-driven color generation process is conducted for each group. These outputs are finally merged to form the high-quality filling results. Zheng et al. (Zheng et al., 2020) first designed a ShapeNet to learn the 3D geometric information from line drawings, then a RenderNet is performed to produce 3D shadow. The SmartShadow (Zhang et al., 2021a) application consists of three data-driven tools, a shadow brush used to determine the areas the user want to create shadow, a shadow boundary brush to control the shadow boundaries, and a global shadow generator for the entire image shadow generation based on the estimated global shadow direction. Li et al. (Li et al., 2017) took advantage of the ideas of residual network and symmetric skipping network to build their CNN based method to solve problem of structural line extraction from manga image. To generate portrait drawings in multiple styles and unseen style, Yi et al. (Yi et al., 2022) designed a novel portrait drawing generation architecture using a asymmetric cycle structure, where a regression network is first adopted to calculate the quality score for an APDrawing. Based on this model, they defined a quality loss to guide the network to generate high-quality APDrawings. Im2Pencil (Li et al., 2019) and (Zhang et al., 2021c) treat the procedure of line drawing generation as an independent subtask to facilitate manga and Illustration synthesis, respectively.

Different from these methods, we apply the generative adversarial network to a new problem, i.e. our P2LDGAN mainly concentrates on finding cross-domain relations between character images/photos and hand-drawn character line drawings using paired data.

3. Data Preparation

Our photo/image-to-line drawing converter can be fundamentally posed as supervised image-to-image translation problem which heavily relies on significant training samples, therefore, it is necessary to construct a dataset to help our model learn an accurate mapping to bridge the real character photo/image and freehand character line drawing domains.

Specifically, we first collect high-resolution character images/photos from the internet mainly covering five categories, namely male, female, manga/cartoon male, manga/cartoon female, and others, to enrich or diversify the data. Then, to meet the demands of different artistic styles, we invite experienced artists and many skilled students majored in visual communication design to manually draw the character line drawings at the same scale as the given source images/photos using specific application and digital devices. Finally, to standardize the gathered images/photos and hand-drawn line arts, we preprocess them to the size of 1024 $\times$ 1024 pixels, and fine tune the structural lines manually to form the strictly aligned image/photo-line drawing pair (describing the same character) with the help of professional artists and image processing software.

We carefully select 1,532 pairs of high-quality character line drawings paired with real images/photos to construct our final dataset. Table 1 illustrates the distributions of our collected dataset. Figure 2 shows representative character image/photo-line drawing pairs. In order to facilitate the learning and evaluation, we choose to split the dataset into two disjoint subsets, i.e. 70% for training and the remaining 30% as testing set. The data will be released to the public for research related to line drawing.

Table 1. The Composition of our constructed benchmark dataset. The bottom row shows the number of image/photo-line drawing pairs contained in each category.

Category	Male	Female	Cartoon Male	Cartoon Female	Others
No.	388	422	210	426	86

4. Character Photo-to-Line Drawing Translation Architecture

4.1. Overview

We attempt to generalize the application of generative adversarial framework to the photo-to-line drawing style conversion problem, i.e. automatic artistic character line drawings generation from real images/photos. Given the well-aligned source-reference pair $\{p_{i},l{i}\}_{i=1}^{N}$ , where $p_{i}$ and $l_{i}$ belong to character photo domain $\mathcal{P}$ and line drawing domain $\mathcal{I}$ , respectively. $N$ denotes the number of image pairs. The goal of our model is to learn a mapping function $\mathcal{F}$ that automatically discover the domain relations between line drawing $\hat{l}_{i}$ and corresponding character photo.

(1)

\hat{l}_{i}=\mathcal{F}(p_{i},l_{i})

Figure 3 illustrates the details of our developed P2LDGAN for this cross domain corresponding learning, which consists of (1) a joint geometric-semantic driven generator G, and (2) a character line drawing discriminator D. In the following, we give a detailed description of our method.

4.2. Joint Geometric-Semantic Driven Generator

The character line drawings describe the characters’ clear and critical features using structural lines representation. Therefore, both geometric and semantic information play a fairly important role in synthesizing vital details in drawings. Previous state-of-the-art image translation models mainly adopt U-net framework with skip connections. However, these methods only combine the feature maps of the same scale from encoding and decoding stage, lacking geometric and semantic information fusion, which limits the generation quality.

Therefore, in order to progressively propagate the geometric and semantic information into the output line drawing, we development a simple but effective framework, named cross-scale dense skip connections module, which is the core of our joint geometric-semantic driven generator.

Our generator fundamentally is an encoder-decoder architecture, where the encoding stage compresses the rich information in character photo into latent representation, while the decoding network constructs the desired line drawings from encoded representation. As shown in the bottom-left part of Figure 3, we adopt the pre-trained ResNeXt-50 as encoder because of its simpleness, modularization, efficiency and higher learning capability. Specifically, we extract the stages conv2-conv5 from ResNeXt as our encoding layer (i.e. ResNeXt Layer), which can be formulated as,

(2)

\mathcal{F}_{i}=\mathcal{F}_{i-1}+\sum_{j=1}^{C}\mathcal{T}_{j}(\mathcal{F}_{i-1})

Where $i$ indicates the $i$ th layer of the encoder, $\mathcal{F}_{i-1}$ and $\mathcal{F}_{i}$ denoted the input and output. $C$ in the number of groups. $\mathcal{T}$ means transformation function, here we use a sequence of $1\times 1$ , $3\times 3$ and $1\times 1$ convolutions.

We input a character image/photo of size $512\times 512$ into encoder, and extract feature maps from each ResNeXt layer together with our decoding layer to build the cross-scale skip connections model to fuse and propagate information of different levels of abstractness. To be specific, suppose our generator has $n$ layers in total, the input of the current decoding layer $n-i+1$ is the combination between the outputs of the previous layer and corresponding encoding layers having the same or larger resolutions. The formulation is defined as,

(3)

\mathcal{I}_{n-i+1}=\mathcal{F}_{n-i}\oplus\mathcal{F}_{i}\oplus PDS(\mathcal{F}_{1})\oplus\dots\oplus PDS(\mathcal{F}_{i-1})

Where $\mathcal{F}_{n-i}$ represents the previous $n-i$ th layer’s output, $\mathcal{F}_{1}$ to $\mathcal{F}_{i}$ denote the outputs of the 1st to $i$ th encoding layers with scales larger than or equal to that of the $n-i+1$ layer. PDS() refers to progressive scaling operation to scale feature maps using convolutions. $\oplus$ is the element wise sum operation. Subsequently, the fused multi-scale features flows through the decoding layer to perform geometric and semantic information propagation into a higher-resolution maps.

(4)

\mathcal{F}_{n-i+1}=UP(\mathcal{I}_{n-i+1})

Here, UP() operation is used to upsample the feature maps by a factor of 2 using nearest neighbor algorithm. Through our cross skip connection mechanism, feature maps from each encoding layer can be embedded into different decoding layers to strengthen semantic and geometric information integration and improve feature propagation across encoder and decoder, so that we can directly train our generator to learn character line drawing modality and translation the real character image/photo into line drawing with details preservation.

4.3. Discriminator

For discriminator network, its task is to discriminate the generated character line drawings from ground truth. Here, we adopt the PatchGAN exploited in (Isola et al., 2017) to classify whether the $32\times 32$ patches are real or not. The bottom-middle part of Figure 3 displays discriminator architecture. We can observe that each discriminator block is composed of $4\times 4$ Convolution, InstanceNorm and LeakyReLU with a slop of 0.2. Such a discriminator contributes to high-quality line drawings generation since it makes our P2LDGAN pay more attention to detailed content in patches, and can process images of any size with fewer parameters (Zhu et al., 2017).

4.4. Objective Functions

Adversarial Loss. The generator $G$ creates character line drawing $\hat{l}$ from image $p$ in character photo domain $\mathcal{P}$ , which can not be distinguished by discriminator $D$ , while the goal of discriminator $D$ is to discriminate between translated drawings and real samples $l$ from $\mathcal{I}$ . Motivated by the significant stability and high-quality generation capability of relativistic average GANs (Jolicoeur-Martineau, 2018), we introduce the following Adversarial loss to supervise our P2LDGAN for more realistic line drawing synthesis.

(5)

\begin{split}\mathcal{L}_{G}^{P2LDGAN}=\mathbb{E}_{p\sim\mathcal{P},l\sim\mathcal{I}}\left[f_{1}(D(G(P))-f_{2}((D(l))),f_{3}(f_{4}(p)))\right]\end{split}

(6)

\begin{split}\mathcal{L}_{D}^{P2LDGAN}=\mathbb{E}_{p\sim\mathcal{P},l\sim\mathcal{I}}\left[f_{1}(D(l)-f_{2}(D(G(P))),f_{3}(f_{4}(p)))\right]\\ +\mathbb{E}_{p\sim\mathcal{P},l\sim\mathcal{I}}\left[f_{1}(D(G(P))-f_{2}(D(l)),f_{5}(f_{4}(p)))\right]\end{split}

Where $f_{1}$ is mean square error loss. $f_{2}$ performs mean operation. $f_{4}$ patches the input photo with the same size as outputs of discriminator $D$ , $f_{3}$ and $f_{5}$ are used to fill tensor with scalar value 1 and 0, respectively.

Pixel-wise Loss. Since we use paired samples to train our model, we can introduce the $L_{1}$ loss (Chen and Hays, 2018) (compared to $L_{2}$ loss, it introduce less blurriness (Yi et al., 2017)) to measure the error between the generated and ground truth character line drawings, which can make translated drawings follow the domain $I$ distribution. The function is defined as,

(7)

\mathcal{L}_{pixel-wise}=\mathbb{E}_{p,l}\left[\left\|G(p)-l\right\|_{1}\right]

The final objective loss function of our P2LDGAN is given by,

(8)

\mathcal{L}_{P2LDGAN}=\lambda_{1}\mathcal{L}_{G}^{P2LDGAN}+\lambda_{2}\mathcal{L}_{D}^{P2LDGAN}+\lambda_{3}\mathcal{L}_{pixelwise}

In our experiments, we set the weights $\lambda_{1}=1$ , $\lambda_{2}=0.5$ , and $\lambda_{3}=100$ , respectively.

5. Experiments

In this section, we first describe the implementation details and evaluation metrics. Then we evaluate the performance of our P2LDGAN by making quantitative and qualitative comparisons against state-of-the-art models as well as reporting human perceptual study. Finally, we conduct ablation study to verify the effectiveness of designed component of our proposed approach.

5.1. Implementation Details

We implement the photo-to-line drawing translator using PyTorch. All experiments are conducted on a single NVIDIA GeForce RTX 3090 GPU. For both the generator and discriminator network, we use the Adam optimizer with $\beta_{1}=0.5$ and $\beta_{2}=0.999$ . The initial learning rate is set to 0.0002. We train the model for 200 epochs with a mini batch size of 1. All input images and ground truth are aligned and resized to 512 $\times$ 512 pixels.

5.2. Evaluation Metrics

To measure the quality of synthesized line drawing images, three commonly-used similarity metrics, namely Fr $\acute{e}$ chet Inception Distance (FID) (Heusel et al., 2017) and Structural Similarity Metric (SSIM) (Wang et al., 2004) and Peak Signal-to-noise Ratio (PSNR) (Chen et al., 2020), are adopted to quantitatively evaluate the performance of previous methods and our proposed P2LDGAN. Where the FID calculates the distribution distance between the set of line drawing images created from input photos and the corresponding ground truth drawings (the smaller the FID value is, the better the drawing quality will be ), while SSIM describes the similarity between the generated image and ground truth (Higher SSIM value indicates better result). PSNR computes the intensity difference between predicted and ground truth images (Larger PSNR scores means smaller difference between images) (Pang et al., 2021).

5.3. Comparison with State-of-the-Art Methods

To demonstrate the superior performance of our P2LDGAN model, we make the quantitative and qualitative comparisons with five state-of-the-art networks, including one existing neural style transfer method (i.e. Gatys (Gatys et al., 2016)) as well as four general image-to-image translation approaches, CycleGAN (Zhu et al., 2017), DiscoGAN (Kim et al., 2017), UNIT (Liu et al., 2017), pix2pix (Isola et al., 2017), MUNIT (Huang et al., 2018). It is worth noting that we also use paired data to train unsupervised methods.

5.3.1. Qualitative Comparisons

We qualitatively evaluate the performance of our proposed P2LDGAN by comparing with state-of-the-art competitors on our testing data. Figure 1 and Figure 4 showcase the visual examples. From these results, we can conclude the following findings.

The character line drawing is essentially composed of abstract lines without any texture information. However, the example-based Gatys (Gatys et al., 2016) produces global gray-like feeling (Li et al., 2019) drawings (shown in the third column) with much texture filled in, which are far apart from the line drawing distribution. It is mainly because of the usage of Gram matrix (Yi et al., 2020). On the contrary, our learning model can generate clear and natural lines with less texture.

For CycleGAN (Zhu et al., 2017), it heavily blurs the details, such as the boy’s cloth and shoes in the first row of Figure 1 and introduces great distortions in various areas (e.g., the girl’s hair and bag in the third row of Figure 4), which results in visually unappealing line quality. While pix2pix (Isola et al., 2017) succeeds in creating somewhat acceptable perceptual appearance, but it suffers from jagged boundaries, incomplete lines, and details loss. For example, as can be seen from second row of Figure 1, pix2pix can not construct the nose for the man. In comparison, our P2LDGAN is able to alleviate such unpleasant artifacts, creating better looking line drawings with more details and structural information preserved.

We also compare our model to DiscoGAN (Kim et al., 2017), UNIT (Liu et al., 2017) and MUNIT (Huang et al., 2018). Although they can capture the underlying character line styles, and produces more reliable results, the drawings yielded by them are corrupted by messy boundaries and over-smoothed lines. Our approach can give more precise lines to improve visual quality, being coherent with ground truth line drawings.

In addition, most of these methods do not take full advantage of semantic information, they, therefore, fails to learn the remarkable characteristics of input photo, and generate natural line drawings with less visible artifacts. For example, from the third rows of Figure 1 and Figure 4, it can be explicitly stated that these models could not generate the delicate human face structures.

In summary, (1) the proposed P2LDGAN significantly outperform state-of-the-art methods in terms of visual quality, details preservation and artifacts reduction. (2) Our method works well on both simple and complex character line drawings learning.

5.3.2. Quantitative Comparisons

To quantify the realism and faithfulness of generated character line drawings (Gao et al., 2020), we conduct objective evaluation by computing the average scores of three measurement methods, FID, SSIM and PSNR, on our testing set. The quantitative comparison results with state-of-the-art GAN-based methods are summarized in Table 2. As we can clearly observe, (1) our architecture achieves the best FID, SSIM and PSNR values, significantly outperforming the previous competitors from the realism and faithfulness point of view. (2) Specifically, the lowest FID score suggests that our generated line drawings are closest to ground truth drawings distribution, while the highest SSIM and PSNR scores further show the maximum similarity between our results and ground truths. (3)In summary, the quantitative experiments demonstrate the effectiveness and superiority of our model in synthesizing the high-quality character line drawings, which are consistent with visual results.

Table 2. Quantitative comparisons with state-of-the-art models on our introduced dataset.

\downarrow

indicates lower value is better, while

\uparrow

means higher score is better.

Methods	FID $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$
Gatys (Gatys et al., 2016)	164.1	0.6295	16.411
CycleGAN (Zhu et al., 2017)	61.8	0.8335	21.262
DiscoGAN (Kim et al., 2017)	54.5	0.8664	20.664
UNIT (Liu et al., 2017)	52.5	0.8478	20.712
pix2pix (Isola et al., 2017)	52.2	0.8917	22.711
MUNIT (Huang et al., 2018)	57.1	0.8461	20.995
Our baseline	50.5	0.8984	23.372
P2LDGAN	47.4	0.9020	22.929

5.3.3. Human Perceptual Study

Actually, character image/photo-line drawing translation is a highly subjective task, we, therefore, perform human perceptual studies to perceptually evaluate the ability of our model to generate better-looking drawings.

Participant. 30 participants jointed our user studies, including 20 students without any artistic knowledge, and 10 professional students with at least three years of drawing experience.

Data. This experiment covers 50 pairs of well-aligned character photos and line drawings we randomly select from the testing set. We use our proposed models and the state-of-the-art networks described in previous section to translate the photos into line drawings. And the ground truth drawings are used as reference images.

Setup. We show an input photo, ground truth and eight generated line drawings to participants, who are asked to rank the similarity between ground truth and resulting drawings with 4 meaning highest rank, and 1 indicating lowest one (Yi et al., 2020). The users are also asked to rate the overall quality (whether there are noise, deformation, unclear or unstructured lines, incomplete areas, as well as other artifacts in the drawings) (Wang and Yu, 2020) on a scale of 1 to 5 scores (Higher score denotes better quality).

Result and Discussion. We finally collected 1,500 votes in total. The percentage of line drawings being ranked 4 and the average scores of drawing quality are computed and summarized in Table 3. From these results, we can report that (1)our method receives the highest values in both evaluation criteria. (2) It performs favorably over these state-of-the-arts frameworks in terms of user preferences. (2) To conclude, our method can be able to learn structural line representation, generating high-quality character line drawings.

Table 3. Results of human perceptual study. The first row gives the percentage of line drawings being as Rank 4 for each method. The second row presents the average value of overall quality.

Methods	Gatys (Gatys et al., 2016)	CycleGAN (Zhu et al., 2017)	DiscoGAN (Kim et al., 2017)	UNIT (Liu et al., 2017)	pix2pix (Isola et al., 2017)	MUNIT (Huang et al., 2018)	Our baseline	P2LDGAN
Similarity (Rank 4)	8.2%	6.7%	63.9%	35.3%	59.5%	43.7%	67.7%	68.0%
Overall quality	2.796	2.490	4.435	3.391	4.162	3.837	4.4723	4.530

6. Ablation Study and Discussion

Table 4. Ablation study results.

Methods	Parameters	FID $\downarrow$	SSIM $\uparrow$	PSNR $\uparrow$
Baseline	67.2M	50.5	0.8984	23.372
w/o sum	86.0M	48.1	0.9008	23.322
DeConv + Upsample + Conv	103.2M	47.4	0.8994	22.222
P2LDGAN	81.1M	47.4	0.9020	22.929

We conduct the following ablation experiments to study the contributions of the key factors.

6.1. Analysis of fusion strategy

We replace the sum operation in cross-scale skip connections modules with concatenation. From Figure 5, we can report that the usage of concatenation achieves acceptable results but having unclear boundaries, some roughtness and noiseness. The numerical values in Table 4 also demonstrate the influence of fusion strategies had on drawings quality.

6.2. Analysis of generator

In order to validate the effectiveness of our cross-scale skip connections model in keeping fine details, We perform quantitative and qualitative comparisons between our P2LDGAN with/without using cross-scale skip connections model (our baseline). From the visual results, it can be reported that without our connection model, the generated line drawings have many structural details loss (e.g., hair area in the second row), resulting in poor visual effects, while our P2LDGAN could restore these delicate structures, producing robust and better looking results.

We also replace our decoding layer with a DeConvolution + Upsample + Convolution block. As can be found in Figure 5 and Table 4, the network using replaced decoder produces the character line drawings that are similar to or even better than our proposed models. However, it requires more parameters and computation consumption than P2LDGAN.

Summarily, our well-designed P2LDGAN shows superior performance in translating character images/photos into high-quality line drawings with better detailed structure, clear lines, and fewer artifacts.

7. Conclusion, limitation and future work

In this paper, GANs based image-to-image translation tasks provide a solution to our character line drawing generation problem. We present a novel joint-geometric-semantic driven GAN model, named P2LDGAN, to discover a mapping function that learns the relations between a character image/photo and its translated line drawing representation. We construct a dataset consists of 1,532 pairs of character images/photos and hand-drawn line drawings for training and validating our solution. Quantitative and qualitative comparisons with state-of-the-art methods, human evaluation as well as ablation studies show the superiority of our P2LDGAN framework in generation quality.

Limitation. Figure 6 presents some failure examples. It can be observed that (1) when the photos are overexposed or underexposed, some areas that are too bright (e.g., the cloth in the first column) or too dark (e.g., the facial area in the second column) would be missing in the generated line drawings. (2) The low-quality original photos, such as these with blurry texture, would cause noisy and messy lines, failing to preserve fine details, like the third column. These problems may be considered in our future study, and we will also investigate how to improve the realism and faithfulness of the real human line drawings.

References

(1)
Cao et al. (2022) Bing Cao, Nannan Wang, Jie Li, Qinghua Hu, and Xinbo Gao. 2022. Face photo-sketch synthesis via full-scale identity supervision. Pattern Recognition 124 (2022), 108446.
Chen and Hays (2018) Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9416–9425.
Chen et al. (2020) Zhuo Chen, Chaoyue Wang, Bo Yuan, and Dacheng Tao. 2020. Puppeteergan: Arbitrary portrait animation with semantic-aware appearance transformation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13518–13527.
Dong et al. (2021) Yongsheng Dong, Wei Tan, Dacheng Tao, Lintao Zheng, and Xuelong Li. 2021. CartoonLossGAN: Learning Surface and Coloring of Images for Cartoonization. IEEE Transactions on Image Processing 31 (2021), 485–498.
Gao et al. (2020) Chengying Gao, Qi Liu, Qi Xu, Limin Wang, Jianzhuang Liu, and Changqing Zou. 2020. Sketchycoco: Image generation from freehand scene sketches. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5174–5183.
Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
Huang et al. (2018) Xun Huang, Ming-Yu Liu, Serge Belongie, and Jan Kautz. 2018. Multimodal unsupervised image-to-image translation. In Proceedings of the European conference on computer vision (ECCV). 172–189.
Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.
Jang et al. (2021) Wonjong Jang, Gwangjin Ju, Yucheol Jung, Jiaolong Yang, Xin Tong, and Seungyong Lee. 2021. StyleCariGAN: caricature generation via StyleGAN feature map modulation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–16.
Jolicoeur-Martineau (2018) Alexia Jolicoeur-Martineau. 2018. The relativistic discriminator: a key element missing from standard GAN. In International Conference on Learning Representations.
Kim et al. (2017) Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. 2017. Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning. PMLR, 1857–1865.
Li et al. (2017) Chengze Li, Xueting Liu, and Tien-Tsin Wong. 2017. Deep extraction of manga structural lines. ACM Transactions on Graphics (TOG) 36, 4 (2017), 1–12.
Li et al. (2021) Ping Li, Bin Sheng, and CL Philip Chen. 2021. Face sketch synthesis using regularized broad learning system. IEEE Transactions on Neural Networks and Learning Systems (2021).
Li et al. (2019) Yijun Li, Chen Fang, Aaron Hertzmann, Eli Shechtman, and Ming-Hsuan Yang. 2019. Im2pencil: Controllable pencil illustration from photographs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1525–1534.
Liu et al. (2021) Hongyu Liu, Ziyu Wan, Wei Huang, Yibing Song, Xintong Han, and Jing Liao. 2021. Pd-gan: Probabilistic diverse gan for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9371–9381.
Liu et al. (2017) Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. Advances in neural information processing systems 30 (2017).
Mao et al. (2022) Qi Mao, Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang, Siwei Ma, and Ming-Hsuan Yang. 2022. Continuous and diverse image-to-image translation via signed attribute vectors. International Journal of Computer Vision (2022), 1–33.
Mourot et al. (2021) Lucas Mourot, Ludovic Hoyet, François Le Clerc, François Schnitzler, and Pierre Hellier. 2021. A Survey on Deep Learning for Skeleton-Based Human Animation. In Computer Graphics Forum. Wiley Online Library.
Pang et al. (2021) Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. 2021. Image-to-image translation: Methods and applications. IEEE Transactions on Multimedia (2021).
Qi et al. (2021) Xingqun Qi, Muyi Sun, Weining Wang, Xiaoxiao Dong, Qi Li, and Caifeng Shan. 2021. Face Sketch Synthesis via Semantic-Driven Generative Adversarial Network. In 2021 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 1–8.
Richardson et al. (2021) Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2287–2296.
Wang et al. (2022) Teng Wang, Lin Wu, and Changyin Sun. 2022. A coarse-to-fine approach for dynamic-to-static image translation. Pattern Recognition 123 (2022), 108373.
Wang et al. (2018) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8798–8807.
Wang and Yu (2020) Xinrui Wang and Jinze Yu. 2020. Learning to cartoonize using white-box cartoon representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8090–8099.
Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13, 4 (2004), 600–612.
Wu et al. (2021) Yanze Wu, Xintao Wang, Yu Li, Honglun Zhang, Xun Zhao, and Ying Shan. 2021. Towards Vivid and Diverse Image Colorization with Generative Color Prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14377–14386.
Yan et al. (2021) Lan Yan, Wenbo Zheng, Chao Gou, and Fei-Yue Wang. 2021. IsGAN: Identity-sensitive generative adversarial network for face photo-sketch synthesis. Pattern Recognition 119 (2021), 108077.
Yi et al. (2022) Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul Rosin. 2022. Quality metric guided portrait line drawing generation from unpaired training data. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
Yi et al. (2020) Ran Yi, Yong-Jin Liu, Yu-Kun Lai, and Paul L Rosin. 2020. Unpaired portrait drawing generation via asymmetric cycle mapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8217–8225.
Yi et al. (2017) Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. 2017. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision. 2849–2857.
Yuan and Simo-Serra (2021) Mingcheng Yuan and Edgar Simo-Serra. 2021. Line art colorization with concatenated spatial attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3946–3950.
Zhang et al. (2021a) Lvmin Zhang, Jinyue Jiang, Yi Ji, and Chunping Liu. 2021a. SmartShadow: Artistic Shadow Drawing Tool for Line Drawings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5391–5400.
Zhang et al. (2021b) Lvmin Zhang, Chengze Li, Edgar Simo-Serra, Yi Ji, Tien-Tsin Wong, and Chunping Liu. 2021b. User-guided line art flat filling with split filling mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9889–9898.
Zhang et al. (2021c) Lvmin Zhang, Xinrui Wang, Qingnan Fan, Yi Ji, and Chunping Liu. 2021c. Generating manga from illustrations via mimicking manga creation workflow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5642–5651.
Zhao et al. (2020) Lei Zhao, Qihang Mo, Sihuan Lin, Zhizhong Wang, Zhiwen Zuo, Haibo Chen, Wei Xing, and Dongming Lu. 2020. Uctgan: Diverse image inpainting based on unsupervised cross-space translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5741–5750.
Zheng et al. (2020) Qingyuan Zheng, Zhuoru Li, and Adam Bargteil. 2020. Learning to shadow hand-drawn sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7436–7445.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.