\useunder

\ul

Customize StyleGAN with One Hand Sketch

Abstract.

Generating images from human sketches typically requires dedicated networks trained from scratch. In contrast, the emergence of the pre-trained Vision-Language models (e.g., CLIP) has propelled generative applications based on controlling the output imagery of existing StyleGAN models with text inputs or reference images. Parallelly, our work proposes a framework to control StyleGAN imagery with a single user sketch. In particular, we learn a conditional distribution in the latent space of a pre-trained StyleGAN model via energy-based learning and propose two novel energy functions leveraging CLIP for cross-domain semantic supervision. Once trained, our model can generate multi-modal images semantically aligned with the input sketch. Quantitative evaluations on synthesized datasets have shown that our approach improves significantly from previous methods in the one-shot regime. The superiority of our method is further underscored when experimenting with a wide range of human sketches of diverse styles and poses. Surprisingly, our models outperform the previous baseline regarding both the range of sketch inputs and image qualities despite operating with a stricter setting: with no extra training data and single sketch input.

Refer to caption — Figure 1. Customize StyleGAN with one human sketch. Our method takes a single user sketch as input and controls the output imagery of a pre-trained StyleGAN by learning a conditional distribution in its latent space. Drawing noises from this learned distribution and feeding into the frozen source model will produce multi-modal image samples semantically aligned to the sketch.

1. Introduction

Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have opened up exciting possibilities in content generation owing to their remarkable ability to produce photorealistic images. In particular, StyleGAN (Karras et al., 2019, 2020) is further revealed to possess inherent disentangled properties (Härkönen et al., 2020). Aligning perfectly with a crucial requirement in real-world graphic applications: an intuitive interface to apply users’ creative intents, this discovery has inspired a wave of controllable synthesis methods rooted in manipulating StyleGAN-generated imagery (Shen et al., 2020; Wu et al., 2021; Abdal et al., 2021). The paradigm shift is further propelled by the fusion of StyleGAN with Contrastive Language-Image Pre-training (CLIP) (Patashnik et al., 2021), enabling users to exert control over model-generated imagery through textual descriptions alone.

As one observes this remarkable progress, a tantalizing question arises: Can we achieve a similar way of control over pre-trained StyleGAN’s imagery by user sketch? Despite having been extensively explored, sketch-guided generation under current methodologies often necessitate dedicated frameworks and datasets. In contrast, we aspire to align with the contemporary trend of harnessing the abundant domain knowledge inherent in pre-trained generative models, thus circumventing the redundant reinvention of the wheel. While a similar concept has been explored by (Wang et al., 2021), our approach leverages CLIP models (Radford et al., 2021) for supervision, eliminating the need for adversarial training. This feat is achievable through CLIP-space’s extraordinary capability to encapsulate high level semantic concepts, bridging domains across not only vision and language but diverse visual representational styles as well.

In this paper, we introduce a CLIP-based approach designed to repurpose existing StyleGAN models to generate multi-modal, photorealistic images aligned with a specific user sketch. Notably, we are the first to harness CLIP for semantic guidance in sketch-based synthesis. Our method also draws inspiration from other concurrent research, particularly PromptGen (Wu et al., 2022), which reformulates the task of controllable synthesis as learning a conditional distribution within the latent space of a frozen generative network via energy-based modeling (EBM). We extend this scheme for sketch-guided synthesis by designing novel energy functions leveraging CLIP’s cross-domain semantic knowledge. Furthermore, we acknowledge that sketch-guided synthesis is inherently multi-modal, as reference sketches should convey object-level concepts without specifying colours, textures, or intricate details. To address this, we rely on StyleGAN’s inherent disentangled structure and employ style-mixing techniques introduced in (Karras et al., 2020), supervising the learning only along content-related directions in CLIP space.

We conduct quantitative comparisons against existing methods with the synthesized dataset published in (Wang et al., 2021) and then proceed to apply our approach to real hand-drawn sketches from QuickDraw (Cheema et al., 2012) and Sketchy (Sangkloy et al., 2016), which presents a wide range of diverse sketch styles and poses. Unlike previous methods, our model requires no additional datasets to train and only takes one user sketch and a neutral text description (e.g., ’cat’) as inputs. Nevertheless, it still outperforms the baseline quantitatively in the one-shot regime. Despite our more stringent training setting, The advantage is even more remarkable with real sketches. Despite our more stringent one-shot setting, our method not only significantly improves upon previous works in terms of output qualities but also excels in its ability to adapt to various sketch styles and poses. As demonstrated in Figure 1, despite the sketches being overly abstract and distorted, the output samples are photorealistic and successfully aligned with the main concepts in the sketches (e.g., pose, position). Finally, we present several examples to illustrate that our method could smoothly integrate with other StyleGAN-based manipulations like latent space editing and natural image inversion.

2. Related Work

Sketch-guided deep image synthesis. Deep solutions of sketch-based synthesis typically approach the task as image-to-image translation. Scribbler (Sangkloy et al., 2017) and SketchyGAN (Chen and Hays, 2018) are among the first works to propose end-to-end solutions explicitly targeting human sketch inputs. However, these data-oriented approaches require paired sketch-image datasets for training, the collecting of which requires substantial human effort. Further, human sketches’ varying styles, distortions, and abstraction levels amplify the challenge for such datasets to be distributionally representative. An alternative approach is to explore unsupervised or self-supervised learning (Liu et al., 2020, 2021; Cheng et al., 2023). Nevertheless, these methods still necessitate training dedicated generative models from scratch. In contrast, several works (Wang et al., 2021; Koley et al., 2023) attempt to tackle this task from a transfer learning (Wang et al., 2018) perspective. GANSketch (Wang et al., 2021) introduced a novel cross-domain adversarial loss to rewrite pre-trained GAN weights, thus generating multi-modal images aligned with the reference sketches. (Koley et al., 2023) proposed a novel decoupled training paradigm, splitting the sketch-to-image (S2I) task into two sub-stages: sketch-to-image latent mapping and image generation, while using a pre-trained GAN model to handle the generative stage. Nonetheless, the mapping stage still requires paired data to train. Our work follows a similar intuition of leveraging pre-trained generative models. However, we distinguish ourselves from previous works by obviating the need for adversarial training or extra training datasets.

Controllable GANs. Early approaches of Controllable synthesis in GANs involved training dedicated networks that took explicit controlling parameters as input (Li et al., 2019b; Shoshan et al., 2021). Recent advancements have shifted towards leveraging the intrinsically disentangled latent space (Wu et al., 2021) of pre-trained StyleGAN models. Instead of explicit parameters, controllable generation is achieved by manipulating sampled latents along pre-discovered directions within the latent space (Abdal et al., 2021; Shen et al., 2020). StyleCLIP (Patashnik et al., 2021) is the pioneering work to pair this approach with CLIP, allowing text-guided control over StyleGAN imagery. Further extending the idea of CLIP-based supervision, recent methods (Gal et al., 2022; Zhu et al., 2021; Kwon and Ye, 2023) have applied these techniques to few-shot GAN adaptation, showcasing the possibility to synthesize images with desired attributes by finetuning GAN with only textual descriptions or reference images. However, they rely on ad-hoc regularization methods like identity loss (Richardson et al., 2021) or partial model freezing to enforce in-domain constraints, which either potentially limits the scope of target imagery or leads to outputs not preserving necessary features from the original domain. As observed in (Wu et al., 2022), when conditioned on an in-domain attribute ’baby (-like),’ existing methods like StyleCLIP and StyleGAN-NADA (Gal et al., 2022) either fail to give condition-compliant outputs or produce unrealistic-looking ones. Conversely, alternative approaches (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) propose to directly learn conditional distributions of GANs’ output imagery as energy-based models (EBMs) (LeCun et al., 2006). In this way, in-domain constraints are explicitly modelled as distribution priors. Moreover, EBMs associate data probability with energy functions; the latter can be flexibly designed to accommodate various controlling conditions.PromptGen (Wu et al., 2022) demonstrates several such conditions, performing image synthesis with specific poses or textual descriptions leveraging off-the-shelf models like inverse graphics models and CLIP. Our work extends this framework by introducing novel energy functions that leverage CLIP-space embeddings to quantify cross-domain semantic distances. This enhancement enables our model to sample photorealistic images from the source GAN’s image space while conforming to the semantic concepts within input sketches.

CLIP for semantic supervision The CLIP model (Radford et al., 2021), trained on an extensive corpus of paired image-text data, has demonstrated its capacity to encompass rich semantic knowledge within its joint vision-language embedding space. This advancement has ushered in opportunities for image-free supervision, catalyzing innovative approaches across image generation (Nichol et al., 2021), manipulation (Patashnik et al., 2021; Brooks et al., 2023), and image-to-sketch translation (Chan et al., 2022; Lee et al., 2023) tasks. StyleCLIP (Patashnik et al., 2021) is among the first to harness CLIP for semantic supervision, allowing text-guided manipulation of StyleGAN imagery by directly minimizing CLIP scores between images and input texts. StyleGAN-NADA introduces an improved directional CLIP loss for text-guided domain adaption of GAN models. In parallel, (Chan et al., 2022) investigates an Image-to-Sketch training scheme leveraging CLIP scores as one of several supervisional constraints. Their findings highlight the efficacy of minimizing CLIP-space distances between synthesized line drawings and input photos as a potent method for cross-domain semantic supervision. These pioneering works underscore CLIP’s exceptional representational capacity, transcending different representational domains. Building upon this insight, we introduce a novel approach to leveraging CLIP to bridge the semantic gap between images and sketches.

3. Methods

Given a source StyleGAN model, we aim to re-purpose it to produce samples from the model’s image domain that are semantically aligned with a reference sketch.

In the following, we describe our approach starting by introducing the preliminaries of the latent-space EBM method for controllable image synthesis (Section 3.1). The main novelty of our research is the two CLIP-based novel energy functions we designed for the EBM framework, which serve to encourage semantic similarity between output samples and reference sketches (Section 3.2, 3.3). Additionally, we address the inherent multi-modality of sketch-guided synthesis harnessing StyleGAN’s intrinsic style-mixing techniques (Section 3.4).

3.1. Preliminaries

Several works (Nguyen et al., 2017; Nie et al., 2021; Wu et al., 2022) formulate the tasks of controlling GAN imagery as learning conditional distributions in the pre-trained models’ latent space as EBMs (LeCun et al., 2006). We briefly explain this method following similar notations as (Wu et al., 2022).

Given a guiding sketch $c\in C$ , and image samples $x\in X$ , where X is the image space of a source model $G$ and $C$ the sketch domain, the core concept of EBM is to associate each possible configuration of $x$ and $c$ with an unnormalized scalar measurement through $E(x,c):X\times C\rightarrow R$ , denoted as the energy function.

Using Gibbs distribution, the conditional density of $c$ given $x$ can be formulated as the following EBM:

(1)

p(c|x)=\frac{e^{-\lambda E(x,c)}}{Z_{c}},Z_{c}=\int_{c^{\prime}\in C}e^{-\lambda E(x,c^{\prime})}

Here $\lambda$ is a weight hyperparameter, and $Z_{c}$ is a normalizing constant integrated over $C$ , denoted as the partition function, which is intractable as $C$ represents the hypothetical space of all possible sketches. However, since $C$ and $X$ are both fixed, we note that $Z_{c}$ and all following derivations of partition functions are constants not affecting the final training objective.

We then rewrite the conditional density over image space $X$ , combining Bayes’s rule and equation 1:

(2)

\begin{split}p(x|c)\propto p_{x}(x)p(c|x)=\frac{p_{x}(x)e^{-\lambda E(x,c)}}{Zx},\\ Z_{x}=\int_{x^{\prime}\in X}p_{x}(x^{\prime})e^{-\lambda E(x^{\prime},c)}\end{split}

The image prior $p_{x}(x)$ is defined by sampling a random noise $z\sim\mathcal{N}(0,I)$ and then translating it to image space through the frozen source GAN, $x=G(z)$ . The above EBM is thus reformulated in the latent space:

(3)

P(z|c)=\frac{P_{z}(z)e^{-\lambda E(G(z),c)}}{Z_{z}},Z_{z}=\int_{z^{\prime}}p_{z}(z^{\prime})e^{-\lambda E(G(z^{\prime}),c)}

$p(z|c)$ can be trained with a parameterized network $p_{\theta}(z)$ by minimizing the KL divergence $\mathbb{D}_{KL}(p_{\theta}(z)\lVert p(z|c))$ . PromptGen proposes to model $p_{\theta}(z)$ as an invertible bijection $f_{\theta}$ (Noé et al., 2019), which maps a Gaussian noise to the latent space, $z=f_{\theta}(\epsilon),\epsilon\sim\mathcal{N}(0,I)$ .

With $f_{\theta}$ ’s invertible properties, the probability density of z has a closed-form formula:

(4)

logp_{\theta}(z)=log\mathcal{N}(\epsilon|0,I)+log|det(\frac{\partial f_{\theta}}{\partial\epsilon})|

Combing with equation 3, the full objective used to train $f_{\theta}$ is:

		$\displaystyle\operatorname*{argmin}_{\theta}E_{\epsilon\sim\mathcal{N}(0,I),z=f_{\theta}(\epsilon),x=G(z)}[\mathcal{L}_{jac}+\mathcal{L}_{z-prior}+\mathcal{L}_{energy}],$
		$\displaystyle\mathcal{L}_{jac}=-log\|det(\frac{\partial f_{\theta}}{\partial\epsilon})\|,$
		$\displaystyle\mathcal{L}_{z-prior}=-logP_{z}(z),$
		$\displaystyle\mathcal{L}_{energy}=E(x,c)$

We illustrate the training procedure in Figure 2. The first two loss components are the log-determinant of the Jacobian for $f_{\theta}$ and the latent prior loss. The third term $E(x,c)$ is the energy function, which by theory could be any function that associates a scalar value to each generated sample $x$ given reference sketch $c$ , where smaller values are assigned to the preferred data samples, in our case, samples semantically similar to $c$ . The following two sections discuss how to implement such a cross-domain semantic similarity measurement as the energy function leveraging CLIP.

3.2. Cross-domain directional CLIP energy

A naive way to leverage CLIP for semantic supervision is to minimize the CLIP-space cosine distance between an image sample $x$ and the reference sketch $c$ (Patashnik et al., 2021):

E_{\text{global}}(x,c)=1-cos(F_{I}(x),F_{I}(c))

where $cos(.)$ stands for cosine similarity, and $F_{I}$ is CLIP’s image encoder.

However, owing to the non-trivial domain gaps between images and sketches, optimizing CLIP scores directly leads to content-irrelevant visual features of the sketch domain leaking into generated images. Inspired by StyleGAN-NADA (Gal et al., 2022), we instead define a pair of source anchors in the CLIP space, one in each representational domain. This source pair is introduced by first choosing an image anchor $x_{o}\in X$ , and then employing an off-the-shelf Image-to-Sketch network $H$ (Photosketch(Li et al., 2019a)) to generate its sketch-domain counterpart, $c_{o}=H(x_{o})$ . The cross-domain CLIP directional energy can then be defined as:

\Delta C=F_{I}(c)-F_{I}(c_{o})

\Delta X=F_{I}(x)-F_{I}(x_{o})

E_{Dir}(x,c)=1-cos(\Delta C,\Delta X)

For each $x$ sampled from the target distribution, we align the CLIP-space direction between image anchor $x_{o}$ and $x$ to the direction between sketch anchor $c_{o}$ and target sketch $c$ . We rely on the assumption that the CLIP-space distance of two data points within the same representational domain (e.g., $\Delta C$ ) depicts only their content-related semantic difference while excluding common representational features of that artistic domain. Plus $x_{o}$ and $c_{o}$ being cross-domain semantic equivalents, by aligning $\Delta X$ to $\Delta C$ , one essentially encourages $x$ to differ from a source concept (defined by $(x_{o},c_{o})$ ) along the same semantic direction from that source concept to the target concept of the reference sketch.

Similar to using category names as source anchors in (Gal et al., 2022), one should ideally choose an image source anchor to represent a neutral mode in the data space. However, choices of such source points in the image domain are not immediately obvious. We empirically use $G(w_{avg})$ ( $w_{avg}$ being the average latent of the pre-trained StyleGAN model), which we find to give decent performance.

3.3. Cross-domain CLIP-NCE energy

We find that using directional energy alone is prone to produce artefacts, which we will discuss in later sections. Therefore, we propose a second supervision scheme using InfoNCE-like contrastive learning (Oord et al., 2018) following similar notations as (Yu et al., 2022). In contrastive learning, positive samples are pulled closer to a query vector $Q$ , while negative samples are simultaneously pushed farther away. Following this practice, we first define the query $Q$ , the negative sample $K-$ and the positive sample $K+$ as follow:

Q=\Delta X=F_{I}(x)-F_{I}(x_{o})

K+=\Delta C=F_{I}(c)-F_{I}(c_{o})

K-=F_{T}(t_{o})-F_{I}(x_{o})

The first two terms are identical to the directional energy formulas. $\Delta X$ , the direction from the source image anchor to the target image sample, is defined as $Q$ , while $\Delta C$ is the positive sample. Additionally introduced is the negative sample $K-$ . Here $t_{o}$ is a supplied category-level description (e.g., ’cat’) and $F_{T}$ is CLIP’s text encoder. We utilize the textual embedding of the category label to represent a neutral mode in the source image space from which we aim to push the generated samples away. We find this also empirically mitigates modification over entity-unrelated features (e.g., background) and regularizes the samples to be more realistic-looking. The full energy function is defined as follows using the same formula of InfoNCE loss, where $r$ is a temperature hyperparameter:

E_{NCE}(x,c)=1-\frac{e^{(Q\cdot K^{+}/r)}}{e^{(Q\cdot K^{+}/r)}+e^{(Q\cdot K^{-}/r)}}

3.4. style-mixing

Sketch-conditioned synthesis intrinsically calls for multi-modality: the sketch does not dictate color, texture or any other detailed visual features of the target outputs. Rather, we identify it to express a high-level semantic concept and should only control the output’s rough pose, shape and position. This requirement naturally aligns with StyleGAN’s disentangled structure. Specifically, StyleGAN maps an input noise $z$ to an intermediate 512-dimensional latent $w$ , which then goes through a different affine transformation to generate style parameters for each synthesis network layer. Preliminary researches have demonstrated that these style parameters at different layers control different levels of detail in the output images, which is roughly divided into three groups: coarse, medium and fine. A common method to achieve multi-modal generation is thus to use a second $w^{\prime}$ to control selected layers of the synthesis network (Richardson et al., 2021). We adopt this style-mixing technique (Karras et al., 2020) in training when synthesizing images to be evaluated by the energy function. Specifically, in the computation of $\Delta X=x-x_{o}$ , we supply to the medium-to-fine layers (5-last) of the synthesis network a random $w’$ sampled from source model’s $W$ space both when generating the image sample $x$ and the image anchor $x_{o}$ . That is, for each $\Delta X$ , the pair of $x$ and $x_{o}$ is generated with the same detailed features. Our parameterized distribution is thus restraint to only learn coarse-level concepts without undermining the source model’s style diversity. The same procedure is also employed in reference time to generate multi-modal outputs with diverse visual details.

Family	Model
variant	No.
Sketches			FID $\downarrow$
			standing cat	horse rider	horse on side	garbled church
Pre-trained	Original	N.A	58.71	50.43	42.24	32.64
GanSketch	Best*	30(5)	\ul31.20	19.94	\ul29.62	\ul16.70
dashed]2-7	1-example	1	44.68	29.25	41.50	26.88
Ours	NCE	1	35.05	23.21	38.41	22.53
dashed]2-7	Dir	1	38.65	33.31	44.89	20.49
	NCE+aug	1	34.14	\ul18.78	40.18	38.06

Table 1. Quantitative analysis. We report the Frechet Inception Distance (FID) of the pre-trained models, the baseline (GANSketch) and our methods on four synthetic evaluation cases.

\downarrow

indicates that lower scores are better. We took three sets of GANSketch scores directly from their report. Best* are their best scores among all model variants trained with 30 sketches (5 in the case of standing cat). ’1-example’ denotes their one-shot results. We test on three variants of our method: ’NCE’, our base variant trained with CLIP-NCE energy; ’Dir’, the alternative of training with directional CLIP energy, and ’NCE+aug’, which additionally use augmentation with CLIP-NCE. A fair comparison under one shot constraint is made only against GANSketch’s ’1-example’ models and the best scores are highlighted in black. The best FID regardless of settings are marked with underline.

4. Experiments

Datasets. We use GANSketch’s published dataset. The dataset provides 4 test cases for quantitative evaluation: 1) standing cat 2) horse on a side 3) horse rider 4) garbled church. Each case contains 30 Photosketch-generated (Li et al., 2019a) sketches and 2500 real images picked from the LSUN datasets (Yu et al., 2015) to present the target distribution. To test in real world scenarios, GANSketch datasets also includes sketches picked from QuickDraw (Cheema et al., 2012). As target image distributions for human sketches are hard to define, the results are only evaluated qualitatively. We additionally select real sketches from the Sketchy dataset (Sangkloy et al., 2016) to test our method with a broader range of sketch styles and object poses.

Baselines. We compare our method to the pioneer work GANSketch. For quantitative cases, we compute the Frechet Inception Distance (FID) (Heusel et al., 2017) scores between the generated images and the evaluation set, then select the best FID among all training iterations.

The quantitative results are reported in Table 1. The GANSketch scores are taken directly from their report. Note that all our models are trained with single reference sketch while the GANSketch best cases require either 30 (horse rider, horse on side, garbled church) or 5 (standing cat) input sketches. We observe that despite our scores do not always outperform their best model variants, it improves significantly from their one-shot results. We further showcase in Figure 4 that our output images are of comparable qualities.

Ablation. We first conduct ablation experiments quantitatively on photosketch cases and report the FID scores in Table 1. As photosketch-generated sketches are still different from human sketches, we proceed to investigate the individual effects of augmentation and the two types of energy functions with real sketches.

Augmentation. Same as GANSKetch, we adopt the translation policy of differentiable augmentation (Zhao et al., 2020). As shown in Table 1, we find that augmentation do not necessarily improve the performance on synthesized sketches. Particularly, cases with translation-invariant concepts (standing cat, horse rider) benefits from augmentation, while others (horse on side, garbled church) do not, which agrees with the translation-based scheme we choose. When testing on human sketches, we observe that using augmentation in some cases is prone to push certain target semantics to extreme (e.g, a very round cat), leading to sketch-conforming but unrealistic-looking results, as demonstrated by the ’lying cat’ case in Figure 6. On the other hand, this adverse effect is not observed when the input sketch is relatively accurate in shape (the ’horse head’ case). As distortions and overly abstract depictions are common in real-life sketches (as can be observed from the input sketches in Figure 5), we do not apply augmentation when training on all real-sketch test cases in Figure 5.

CLIP-NCE energy versus Directional CLIP energy. As reported in Table 1, CLIP-NCE energy outperforms directional CLIP energy in three out of four synthesized cases. We theorize the reason behind to be CLIP-NCE energy’s extra regularizing effect through contrastively learning against a neutral mode defined by the category-level text, which helps confining the learned semantics to center around the object class. This extra regularization is unessential for synthesized sketches, which possess precise outlines. In contrast, real-life sketches often contain ambiguous and distorted shapes, in which case using directional energy is prone to produce object-unrelated artefacts, as illustrated in 6. For example, using $E_{Dir}$ on the ‘horse head’ sketch leads to manifesting the wavy line depicting mane as human-shape artefacts in output samples. A interesting comparison can be furthered made between the ’NCE+aug’ outputs with the directional energy outputs, while both trying to align with an extreme feature (e.g., the round line which is the overly abstract representation of the cat body in the first case of Figure 6), the ’Dir’ model manifests the feature as a curvy-shape background artifact while the ’NCE+aug’ variant creates an exaggerated feature within object. This further underlines the regularizing effect of the CLIP-NCE energy. We thus conclude that using CLIP-NCE energy is preferable when applying our method to real-life sketches.

Real-life sketches. We test our method with real-life sketches using only one reference sketch each case and show the results in Figure 5. We experiment on the same set of hand sketches taken from QuickDraw as GANSketch (first four in each category). Additionally, we pick sketches from the Sketchy dataset to verify our method on a wider range of sketch styles and object poses. We report the success cases in Figure 5. We first note that our method is able to generate realistic and sketch-conforming images in cases where GANSketch reports to fail even with multiple sketch inputs (e.g., row 3). Moreover, the output images shows a clear tendency of aligning high-level semantic concepts (e.g. pose and position). We further provide a qualitative comparison against baseline hand sketch results in Figure 7. The GANSketch images are generated from their published models and using the same set of input noises with truncation $\Phi=0.5$ . We observe that our model outputs are of better qualities and exhibits semantic consistency in the object-level content.

5. Applications

Apart from multi-modal sketch-guided synthesis, our model also could fit into the general scope of image editing. Since our method learns a conditional distribution without modifying the source GAN weights, latent-editing methods naturally remains compatible. We demonstrate the outcomes of applying latent-editing to our controlled models. We then show that to edit a natural image to resemble the sketch can be achieved through inverting it as the style latent and injecting it into a randomly sampled content latent from our learned distribution.

Latent-space editing. One outstanding merit of using StyleGAN for graphic applications is its disentangled latent space, which enables intended editing to be conducted simply via perturbations along pre-identified directions in one of its intermediate latent space ( $W$ , $W+$ or $S$ ). As we do not modify the source model and consequently all of its intermediate latent space remaining unchanged, any latent-editing directions identified within the source model still apply, obviating the need to run direction discovery algorithms again. We verify this by applying editing directions reported in (Härkönen et al., 2020) to random images sampled from our learned distribution and show that the edited result exhibit both sketch-controlled and latent-edited properties, as shown in 8.

Natural image editing. Our trained model controls the content of an output sample by a learned latent distribution conditioned on sketch and the style separately by an additional style latent drawn from the unchanged source model through style-mixing. Therefore it is possible to edit a natural image by first invert it as the style latent, then inject into a content latent randomly sampled from our learned distribution. To illustrate how this works, we first project real images to a $w^{\prime}+$ latent using state-of-art inversion algorithm restyle (Alaluf et al., 2021). Same as the procedure in 3.4, we control the first 4 layers of the synthesis network with the content latent while using $w^{\prime}+$ to control the rest; the result is shown in 9. We observed that the images are essentially edited versions of the original natural images with transformed pose and shape conforming to the initial sketch.

6. Discussion

In this work, we present a novel method for one shot sketch-conditioned image synthesis by leveraging pre-trained StyleGANs and CLIP. Our method do not required any image or sketch dataset, and the only user input is one hand-drawn sketch and a class-level neutral text description, leaving no extra hurdle for real-life application.

However, there are many room for further improvement. First, despite the capability to accommodate a wide range of sketch styles and entity poses as we have demonstrated, our method can not guarantee to work for all sketches. As observed in (Wu et al., 2022), as the framework tackles conditional generation as mode-seeking in the source model’s data space, the generation tend to fail if the target mode is not covered by the pre-trained model. As shown in fig6, despite succeed in aligning with some semantic attributes from the sketch, our method fails to find a mode that matches the actual sketch concept entirely while maintaining photorealism. Second, our method relies on StyleGAN’s intrinsic structure. Finally, despite observing in the result a trade-off between realism and sketch-fitting, we have not provided any explicit way to control the degree of output realism, as (Cheng et al., 2023) did. We hope to leave this intriguing issue for future works.

References

(1)
Abdal et al. (2021) Rameen Abdal, Peihao Zhu, Niloy J Mitra, and Peter Wonka. 2021. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (ToG) 40, 3 (2021), 1–21.
Alaluf et al. (2021) Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021. Restyle: A residual-based stylegan encoder via iterative refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6711–6720.
Brooks et al. (2023) Tim Brooks, Aleksander Holynski, and Alexei A. Efros. 2023. InstructPix2Pix: Learning To Follow Image Editing Instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18392–18402.
Chan et al. (2022) Caroline Chan, Frédo Durand, and Phillip Isola. 2022. Learning to generate line drawings that convey geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7915–7925.
Cheema et al. (2012) Salman Cheema, Sumit Gulwani, and Joseph LaViola. 2012. QuickDraw: improving drawing experience for geometric diagrams. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1037–1064.
Chen and Hays (2018) Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic sketch to image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9416–9425.
Cheng et al. (2023) Shin-I Cheng, Yu-Jie Chen, Wei-Chen Chiu, Hung-Yu Tseng, and Hsin-Ying Lee. 2023. Adaptively-realistic image generation from stroke and sketch with diffusion model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4054–4062.
Gal et al. (2022) Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–13.
Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
Härkönen et al. (2020) Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. 2020. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems 33 (2020), 9841–9850.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30 (2017).
Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2020. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8110–8119.
Koley et al. (2023) Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, and Yi-Zhe Song. 2023. Picture that Sketch: Photorealistic Image Generation from Abstract Sketches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6850–6861.
Kwon and Ye (2023) Gihyun Kwon and Jong Chul Ye. 2023. One-shot adaptation of gan in just one clip. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Kynkäänniemi et al. (2019) Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2019. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems 32 (2019).
LeCun et al. (2006) Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. 2006. A tutorial on energy-based learning. Predicting structured data 1, 0 (2006).
Lee et al. (2023) Hyundo Lee, Inwoo Hwang, Hyunsung Go, Won-Seok Choi, Kibeom Kim, and Byoung-Tak Zhang. 2023. Learning Geometry-Aware Representations by Sketching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 23315–23326.
Li et al. (2019b) Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. 2019b. Controllable text-to-image generation. Advances in Neural Information Processing Systems 32 (2019).
Li et al. (2019a) Mengtian Li, Zhe Lin, Radomir Mech, Ersin Yumer, and Deva Ramanan. 2019a. Photo-sketching: Inferring contour drawings from images. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1403–1412.
Liu et al. (2021) Bingchen Liu, Yizhe Zhu, Kunpeng Song, and Ahmed Elgammal. 2021. Self-supervised sketch-to-image synthesis. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 2073–2081.
Liu et al. (2020) Runtao Liu, Qian Yu, and Stella X Yu. 2020. Unsupervised sketch to photo synthesis. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, 36–52.
Nguyen et al. (2017) Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4467–4477.
Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. 2021. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741 (2021).
Nie et al. (2021) Weili Nie, Arash Vahdat, and Anima Anandkumar. 2021. Controllable and compositional generation with latent-space energy-based models. Advances in Neural Information Processing Systems 34 (2021), 13497–13510.
Noé et al. (2019) Frank Noé, Simon Olsson, Jonas Köhler, and Hao Wu. 2019. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science 365, 6457 (2019), eaaw1147.
Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
Parmar et al. (2022) Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR.
Patashnik et al. (2021) Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Richardson et al. (2021) Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. 2021. Encoding in style: a stylegan encoder for image-to-image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2287–2296.
Sangkloy et al. (2016) Patsorn Sangkloy, Nathan Burnell, Cusuh Ham, and James Hays. 2016. The sketchy database: learning to retrieve badly drawn bunnies. ACM Transactions on Graphics (TOG) 35, 4 (2016), 1–12.
Sangkloy et al. (2017) Patsorn Sangkloy, Jingwan Lu, Chen Fang, Fisher Yu, and James Hays. 2017. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5400–5409.
Shen et al. (2020) Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9243–9252.
Shoshan et al. (2021) Alon Shoshan, Nadav Bhonker, Igor Kviatkovsky, and Gerard Medioni. 2021. Gan-control: Explicitly controllable gans. In Proceedings of the IEEE/CVF international conference on computer vision. 14083–14093.
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Wang et al. (2021) Sheng-Yu Wang, David Bau, and Jun-Yan Zhu. 2021. Sketch your own gan. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14050–14060.
Wang et al. (2018) Yaxing Wang, Chenshen Wu, Luis Herranz, Joost Van de Weijer, Abel Gonzalez-Garcia, and Bogdan Raducanu. 2018. Transferring gans: generating images from limited data. In Proceedings of the European Conference on Computer Vision (ECCV). 218–234.
Wu et al. (2022) Chen Henry Wu, Saman Motamed, Shaunak Srivastava, and Fernando D De la Torre. 2022. Generative visual prompt: Unifying distributional control of pre-trained generative models. Advances in Neural Information Processing Systems 35 (2022), 22422–22437.
Wu et al. (2021) Zongze Wu, Dani Lischinski, and Eli Shechtman. 2021. Stylespace analysis: Disentangled controls for stylegan image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12863–12872.
Yu et al. (2015) Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015).
Yu et al. (2022) Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu, Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, and Chunyan Miao. 2022. Towards counterfactual image manipulation via clip. In Proceedings of the 30th ACM International Conference on Multimedia. 3637–3645.
Zhao et al. (2020) Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. 2020. Differentiable augmentation for data-efficient gan training. Advances in neural information processing systems 33 (2020), 7559–7570.
Zhu et al. (2021) Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021. Mind the Gap: Domain Gap Control for Single Shot Domain Adaptation for Generative Adversarial Networks. In International Conference on Learning Representations.

Model	Training Settings		standing cat		horse rider		horse on side		garbled church
	No. Samples	Aug.	Prec. $\uparrow$	Rec. $\uparrow$	Prec. $\uparrow$	Rec. $\uparrow$	Prec. $\uparrow$	Rec. $\uparrow$	Prec. $\uparrow$	Rec. $\uparrow$
Original	N.A		0.21	0.54	0.33	0.57	0.22	0.63	0.46	0.49
GanSketch(w aug)	30	✓	0.50	0.41	0.44	0.39	0.50	0.50	0.46	0.51
GanSketch(w/o aug)	30		0.65	0.20	0.42	0.49	0.42	0.52	0.48	0.48
Ours(w/aug)	1	✓	0.62	0.25	0.26	0.49	0.41	0.45	0.64	0.16
Ours(w/o aug)	1		0.69	0.32	0.32	0.48	0.40	0.34	0.63	0.34

Table 2. Precision and Recall metrics on synthesized sketches. We report the Precision and Recall metrics of our models against the original pre-trained StyleGANs and the baseline (GANSketch), tested with the synthesized sketches. We report both the augmented (w/aug) and unaugmented (w/o aug) variants of our method and baseline method. Our models are trained with single input sketch while baseline ones are trained with 30. The GANSketch scores are directly taken from their report. For both precision and recall scores, the higher the better. We mark that highest score in black.

Appendix A Implementation Details

Training details. We train $f_{theta}$ using the same training hyperparameters as (Wu et al., 2022) except gradient accumulation, batch size, and number of epochs. We train each photosketch cases for 5 epochs and each hand-sketch case for 10 epochs. All models are trained with no gradient accumulation and batch size of one, which we find essential for our training. We apply truncation $\Phi=0.5$ for all cases when sampling random style latent for style-mixing. The sampled content latent is not truncated during the training.

Hyperparameters. For energy weight factor $\lambda$ , we did a grid search in a search space of [1000,2000,5000], and find 2000 to generally work for all case except ’standing cat’, which is trained with 5000. We also follow (Yu et al., 2022) to set the infoNCE temperature to 0.1.

Evaluation Procedure. We follow the same evaluation scheme as GANSketch (Wang et al., 2021). For quantitative evaluation, we sample 2500 latents from our models without truncation. Each latent is injected with a randomly-sampled style latent before transformed to a 256x256 image. The evaluation set contains 2500 hand-pick images at resolution 256. The FID scores are calculated with the CleanFID library (Parmar et al., 2022). For qualitative cases, all examples are generated with truncation $\Phi=0.5$ and style-mixing.

Appendix B Additional Results.

Other evaluation metrics. We additionally report the Precision and Recall metrics (Kynkäänniemi et al., 2019). The precision score indicates the proportion in generated samples that resemble the target real images in evaluation set when comparing VGG features (Simonyan and Zisserman, 2014), while the recall indicates the proportion of real data that resemble the generated samples. The decreased recall indicates a smaller coverage of original modes, which is as expected, since we aim to seek specific modes from the source data space. Our model exhibits a increased precision in the majority of the test cases, indicating its ability to refine a conditioned distribution that resembles the target distribution defined by the evaluation dataset. The only exception being the horse side case, as we observe that extreme object scale (e.g. too large or small) is challenging for our method.

Additional qualitative results. For all models successfully trained with hand sketches as listed in 5, we show more uncurated samples for each. Figure 11 to 16 are controlled samples based on the StyleGAN cat model pre-trained on LSUN-cat. Figure 17 to 22 are controlled samples based on the StyleGAN horse model.