CLIP2GAN: Towards Bridging Text with the Latent Space of GANs

Yixuan Wang^$1$, Wengang Zhou^$1,3$, Jianmin Bao^$2$, Weilun Wang^$1$, Li Li^$1$, Houqiang Li^$1,3$
^$1$CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China
^$2$Microsoft Research Asia
^$3$Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
{wyx2017, wwlustc}@mail.ustc.edu.cn, [email protected], {zhwg,lil1,lihq}@ustc.edu.cn

Abstract

In this work, we are dedicated to text-guided image generation and propose a novel framework, i.e., CLIP2GAN, by leveraging CLIP model and StyleGAN. The key idea of our CLIP2GAN is to bridge the output feature embedding space of CLIP and the input latent space of StyleGAN, which is realized by introducing a mapping network. In the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared feature embedding space, we replace CLIP image encoder in the training architecture with CLIP text encoder, while keeping the following mapping network as well as StyleGAN model. As a result, we can flexibly input a text description to generate an image. Moreover, by simply adding mapped text features of an attribute to a mapped CLIP image feature, we can effectively edit the attribute to the image. Extensive experiments demonstrate the superior performance of our proposed CLIP2GAN compared to previous methods.

Figure 1: Text-guided image generation and editing results of our proposed CLIP2GAN. The left shows diverse generation results given a text description. The right shows the result of image editing using text. The images generated by our framework are realistic and accurate.

1 Introduction

In recent years, generative models based on GAN [15] have achieved remarkable success in various tasks including image-to-image translation [39, 18], image inpainting [2, 8], video generation [43], etc. Specifically, in image generation, the quality of images generated by GANs has been improving with the emergence of several advanced GANs [21, 24, 25, 23] that are capable of generating high-resolution and high-fidelity images. Among them, StyleGAN [24] disentangles image attributes in the intermediate latent space, which allows for various image editing and manipulation tasks[1, 7, 58, 41, 40, 10, 47]. Previous work tends to achieve image applications by controlling the latent code features directly, which makes it difficult to involve explicit intentions. Since text can express what people need precisely, it is natural to explore whether text can be utilized to control the latent space directly to achieve text-guided image generation and image editing tasks.

In this paper, we propose a novel framework, i.e., CLIP2GAN, for text-guided image generation and editing. Technically, our task can be decomposed into two subtasks. First, the input text is transferred to an embedding feature space. Second, the text feature is used to generate an image. For each subtask, there are successful solutions in literature, such as CLIP model [36] for the first subtask and StyleGAN [24] for the second subtask. However, CLIP and StyleGAN are decoupled since the feature from CLIP is not aligned with the latent code of StyleGAN. To bridge CLIP and StyleGAN, we introduce a mapping network, which transfers the feature embedding of CLIP to the latent space of StyleGAN. Specifically, in the training stage, we encode an image with CLIP and map the output feature to a latent code, which is further used to reconstruct the image. In this way, the mapping network is optimized in a self-supervised learning way. In the inference stage, since CLIP can embed both image and text into a shared space, we replace CLIP image encoder in the training architecture with CLIP text encoder. Consequently, we can flexibly input a text description to generate a face image.

In CLIP model, an image is encoded into a 512-dimensional feature, which inevitably loses some detailed visual clue and results in the generated images missing information such as hair details, skin texture, background, etc. To improve the quality of generated images, we introduce an additional discriminator trained adversarially with the mapping network, thus ensuring that the mapping network can generate as realistic and high-resolution images as possible. Besides, we also add noise to the CLIP image feature. On the one hand, it further supplements the details lost by CLIP. On the other hand, inspired by the mode seeking regularization [34], we maximize the ratio of the distance between the output images after adding noise to the distance between their corresponding latent codes, so that the generated images are diverse while maintaining high fidelity.

Unlike previous methods [55, 56, 50, 48, 54, 42] using image-text pairs for training, our framework generates high-quality face images in a zero-shot way, which means we use image data as input for training and text data as input for testing. Thanks to CLIP’s multi-modal embedding space, our text-free training approach is not constrained by large-scale image-text datasets that require precise manual annotation and are not easily available in practice [50, 13, 38, 54, 59], and can generate text-guided realistic and accurate images. Compared with those who use image-text pairs for training for text-guided image generation, our model is implemented in a minimal-cost text-free training approach, but still maintains extremely high fidelity and quality.

Besides image generation, we explore CLIP2GAN on image editing task, where real images can be edited directly using textual attributes to adjust their expressions, hair characteristics, age, etc. Taking advantage of StyleGAN, we locate different feature orientations of different text descriptions in the latent space. By imposing different orientations in the latent code of the image, the semantics of the image is controlled and changed in a fine-grained manner with high quality. Specifically, CLIP2GAN is convenient and flexible for image editing, because instead of optimizing the model for specific text like StyleCLIP [35], we can simply use arithmetic operations to add attributes to images by adding mapped CLIP features of text describing the attributes to mapped CLIP features of images.

To evaluate our proposed method, we perform experiments on the CelebA-HQ [21] dataset. Both quantitative and qualitative results validate that compared with previous methods, our text-free training model can generate high-fidelity and diverse results and realize image manipulation, achieving superior performance over most existing models trained using full image-text pairs. Some example results are shown in Fig. 1.

In summary, our contributions are as follows:

•

We propose a new framework, i.e., CLIP2GAN, that enables text-guided generation tasks with text-free training, generating diverse and high-quality images given the same input text.
•

We apply our framework to image manipulations that allows the editing of real images directly with textual attributes.
•

Extensive experiments on public datasets demonstrate the validity and superiority of our framework. The generated images of our method have higher evaluation quality and better visual performance.

2 Related Work

Refer to caption — Figure 2: Architecture of CLIP2GAN for training. CLIP2GAN takes the face image as input into the CLIP image encoder. The 12-layer mapping network is learned by inverting CLIP image features back into their original inputs using pre-trained StyleGAN. We add the discriminator, *i.e.*, D, and noise respectively to complement the details lost by CLIP and to encourage the diversity of the generated images.

2.1 High-quality Face Generation

Due to the great potential of GAN in generating realistic and high-resolution images, it is widely used in image generation and applications [28, 29, 53, 46]. In particular, high-quality face generation has been an attractive problem in image generation. PGGAN [21] first proposes the idea of resolution progressive generation to generate high-definition face images, which first discovers large-scale structures and then focuses on fine details. StyleGAN [24, 25, 23] introduces a novel style-based generator architecture that generates face images with high fidelity and high resolution. It controls the visual features represented in each layer individually, which can be coarse features influenced by style (e.g., pose, face shape, identity features, etc.) or detailed features influenced by noise (e.g., pupil, hair, wrinkles, etc.). Unlike subsequent studies [22, 3, 4] that mostly introduces different mechanisms or structures on StyleGAN, we use pre-trained StyleGAN to achieve high-quality face generation at a low cost and high efficiency.

2.2 Joint Vision-language Models

With the remarkable progress in both computer vision and natural language processing, researchers turn their attentions to joint vision-language (VL) models for many task-specific VL problems, including image captioning [49, 20, 45], visual question and answer (VQA) [5, 51], image text matching [20, 19], etc., which are tailored for specific problems and each model only solves one task. After the introduction of the transformer [44], BERT [12] has achieved unprecedented success in various language tasks. A recent development, CLIP [36], pre-trained using over 400 million image-text pairs based on contrastive learning, learns a multi-modal co-embedding space and estimates the semantic similarity between texts and images. The robustness of learned joint representation enables CLIP to offer high performance and excellent generalization on various tasks.

2.3 Text-guided Generation and Manipulation

Text-guided image generation is an interesting topic in image generation, where GAN-based models show better sample quality. StackGAN [55, 56] stacks several generators and discriminators to improve the resolution of generated images in multiple stages. AttnGAN [50] introduces a cross-modal attention mechanism to explore fine-grained text and image representations. XMC-GAN [54] utilizes contrastive learning for image generation. TediGAN [48] trains an encoder to map the text into the latent space of StyleGAN [24]. DF-GAN [42] proposes a one-stage backbone for the direct synthesis of high-resolution images. Compared to most previous work, we have achieved better performance without using text training.

Similar to text-guided generation, manipulating a given image using text produces results containing the desired properties. The difference is that the edited result should change the parts related to the text and retain the rest of it. For instance, Dong et al. [14] propose an encoder-decoder structure for text-guided manipulation, and Li et al. [31] generate high-quality images through a multi-stage network. Unlike most text-guided image manipulation based on a multi-stage framework, we propose a unified framework that allows text-guided image generation and image manipulation without requiring multi-stage processing.

3 Method

In this paper, we propose a novel framework, CLIP2GAN, for text-guided image generation without text training (see Fig. 2). Our framework is capable of generating accurate and high-quality images under fine-grained text control without training on paired image-text data. Benefiting from the diversity loss we designed, the multi-modal generation of face images that matches a specific text description is also supported. Furthermore, we explore our framework for image manipulation tasks, where real images are edited using text to adjust their attributes, e.g., expressions, hair color, and age, with high fidelity and reliability. The rest of this section is organized as follows. We first introduce the overall structure of CLIP2GAN. Then the loss functions utilized in our framework are discussed. Finally, we present the image manipulation application implemented using our framework.

3.1 CLIP2GAN

Fig. 2 gives an overview of our framework, which consists of a pre-trained vision-language model (CLIP), a mapping network, and a pre-trained generation model (StyleGAN). Unlike previous work that uses a large number of image-text pairs for model training, our approach achieves text-free training by establishing a mapping relationship between the CLIP multi-modal embedding space and the StyleGAN latent space.

On the one hand, to achieve text-guided image generation without text training, we generate pseudo-text features by leveraging the image-text feature alignment of a pre-trained model. We require a universal multi-modal embedding space where the paired text and image features can be well aligned. The recent vision-language model CLIP achieves this by pre-training a large number of image-text pairs through Contrastive Learning, which is exactly what we need. On the other hand, given that StyleGAN has excellent latent space, we can perform a series of manipulations on the generated images by changing its latent code. We take advantage of the pre-trained StyleGAN2 as the model for image generation. With the help of StyleGAN’s latent space, we can get high-quality text-guided image generation and image editing of images.

To generate images from text, we build a bridge between CLIP and StyleGAN through a mapping network. With this mapping network, it is possible to obtain feature representations of text or images in the latent space of StyleGAN and thus generate images using StyleGAN. The source image $x$ is taken as the input of CLIP and the image encoder of CLIP is used to obtain the image features $f_{img}$ , i.e. pseudo-text features $\tilde{f}_{text}$ , in the multi-modal embedding space of CLIP. It is formulated as follows,

\tilde{f}_{text}=f_{img}=C_{img}(x),\vspace{-0.1cm}

(1)

where $C_{img}(\cdot)$ denotes the image encoder of the CLIP model [36]. The image features $f_{img}$ are mapped to latent codes $z$ of StyleGAN in $w+$ space by the mapping network as the input of the pre-trained StyleGAN, and the image $x^{\prime}$ is generated by StyleGAN. $x^{\prime}$ is expressed as follows,

x^{\prime}=G(M(C_{img}(x))),\vspace{-0.1cm}

(2)

where $M(\cdot)$ denotes the mapping network and $G(\cdot)$ denotes the pre-trained StyleGAN model. By learning the consistency of the source image $x$ and the generated image $x^{\prime}$ , our generative model is implemented.

The mapping network is capable of mapping the multi-modal embedding space of CLIP into the $w+$ latent space of StyleGAN, which makes it possible to invert CLIP features back into the source images using StyleGAN. Our proposed mapping network is a 12-layer fully connected layer with each layer post-connected to the activation layer Leaky ReLU [33]. The mapping network converts a 512-dimensional text feature $f_{text}$ or image feature $f_{img}$ from the multi-modal embedding space of CLIP into the $w+$ space of StyleGAN for obtaining an 18 $\times$ 512-dimensional latent code $\mathbf{z}$ , which is formulated as follows,

\mathbf{f}^{i+1}=g(\mathbf{w}_{i}\cdot\mathbf{f}^{i}+\mathbf{b}_{i}),\ \mathbf{f}^{0}=f_{text}\ or\ f_{img},\vspace{-0.1cm}

(3)

where $\mathbf{f}^{i+1},\mathbf{f}^{i}\in\mathbb{R}^{H\times W\times C}$ are the input and output features, respectively. $g(\cdot)$ denotes the Leaky ReLU activation.

However, we discover that when using only a simple mapping network, CLIP suffers from missing details in both the inversion of the CLIP image features and generated images. Although it can ensure that the major features are presented, the image details, especially the hair and skin textures, backgrounds, etc., are lost to varying degrees, which is attributed to the fact that CLIP only extracts the major 512-dimensional features of the image and ignores the others. To tackle this issue, we introduce a discriminator to determine the truthfulness of the obtained images. It is trained adversarially with the mapping network to complement the image details and improve the generation quality without affecting the feature representations.

Furthermore, when constraints are applied only between the source images and the reconstructed generated images, it is observed that the generated image examples lack diversity. This is because the loss between pixels mainly focuses on the reconstructed identical images rather than the diverse images, although diversity contributes to the performance. To this end, we apply diversity loss on the latent space to generate diverse images as shown in Fig. 2. Inspired by the mode seeking regularization [34], the diversity loss is achieved by maximizing the ratio of the distance between the output images to the distance between the corresponding latent codes, and it encourages the generation of distinctive results when different noise vectors are brought in. Inputting an arbitrary specific text description, we are able to generate multiple images that match the text features but are unique with an additional Gaussian noise of $\mu=1,\sigma^{2}=0.36$ .

Considering the image-text alignment property of CLIP’s multi-modal embedding space, we input the source image and learn the mapping of CLIP image features, i.e., pseudo-text features, to the latent space by image reconstruction. That means training using image data, as shown in Fig. 2. Meanwhile, during testing, the target text description is input and text-guided image generation is performed utilizing CLIP text features, which is shown in Fig. 3. Through the above process, a solution for text-guided image generation without text training is achieved.

3.2 Loss Functions

We utilize several objective functions, i.e., reconstruction loss, perceptual loss, adversarial loss, identity loss, and diversity loss to optimize our framework.

Reconstruction loss. For a given image $x$ , the reconstructed image $x^{\prime}$ is generated by our framework. We develop a reconstruction loss to guarantee the pixel alignment between $x$ and $x^{\prime}$ . The reconstruction loss is formulated as follows,

\mathcal{L}_{\text{rec}}=\|x-x^{\prime}\|_{2},

(4)

where $\|\cdot\|_{2}$ denotes the $l_{2}$ distance.

Perceptual loss. The reconstruction loss using $l_{2}$ distance assumes that the data fit a Gaussian distribution, which leads to producing smoother images. We introduce a perceptual loss, i.e., LPIPS loss [57] to measure the difference between the source image $x$ and the generated image $x^{\prime}$ to improve the smoothing problem caused by the reconstruction loss. LPIPS learns to reconstruct the reverse mapping of $x$ from $x^{\prime}$ and prioritizes their perceptual similarity, which is formulated as follows,

\mathcal{L}_{\text{LPIPS}}=\sum_{l}\frac{1}{H_{l}W_{l}}\sum_{h,w}\|w_{l}\odot(y_{hw}^{l}-y^{\prime l}_{hw})\|_{2}^{2},

(5)

where $y^{l}$ and $y^{\prime l}$ are the feature extracted from the $l$ -th layer of a pre-trained AlexNet [27].

Adversarial loss. The discriminator is trained adversarially with the mapping network which is considered as a generator. For generator ${G(\cdot)}$ and discriminator ${D(\cdot)}$ , we use the WGAN-GP losses [6, 16], which are formulated as follows,

\mathcal{L}_{G}=\mathrm{E}_{x^{\prime}}[\log(1-D(x^{\prime})],

(6)

	$\displaystyle\mathcal{L}_{D}=$	$\displaystyle-\{\mathrm{E}_{x}[\log D(x)]+\mathrm{E}_{x^{\prime}}[\log(1-D(x^{\prime}))]\}$		(7)
		$\displaystyle+\lambda\{\mathrm{E}_{x^{\prime}}[\\|\nabla_{x^{\prime}}D(x^{\prime})\\|_{2}-1]^{2}\},$		(7)

where $x$ and $x^{\prime}$ denote the source image and reconstructed image, respectively.

Identity loss. To ensure that the identity features of the faces are unchanged, we present an identity loss. Using an effective arcface model [11], our identity loss calculates the similarity of the identity features of the faces of $x$ and $x^{\prime}$ , which is formulated as follows,

\mathcal{L}_{\text{id}}=1-\frac{f_{\text{arc}}(x)\cdot f_{\text{arc}}(x^{\prime})}{[f_{\text{arc}}(x)\cdot f_{\text{arc}}(x)][f_{\text{arc}}(x^{\prime})\cdot f_{\text{arc}}(x^{\prime})]},

(8)

where $f_{\text{arc}}(\cdot)$ denotes an arcface classifier.

Diversity loss. As shown in Fig. 2, in the multi-modal embedding space of CLIP the standard normally distributed noise of $\mu=0,\sigma^{2}=1$ is added to the obtained CLIP image feature $f_{img}$ to derive the image feature $f_{img_{1}}$ . The reconstructed images $x^{\prime},x_{1}^{\prime}$ are generated respectively, and then the diversity loss is formulated as follows,

\mathcal{L}_{\text{div}}=\frac{d_{f}(f_{img},f_{img_{1}})}{d_{I}(x^{\prime},x_{1}^{\prime})},

(9)

where $d_{*}(\cdot)$ denotes the distance calculation and we use $l_{1}$ distance.

Overall loss. The overall loss for CLIP2GAN is the weighted summation of the above losses, which is formulated as follows,

	$\displaystyle\min_{M}\mathcal{L}_{M}=$	$\displaystyle\mathcal{L}_{\text{rec}}+\lambda_{\text{LPIPS}}\mathcal{L}_{\text{LPIPS}}+\lambda_{G}\mathcal{L}_{G}$		(10)
		$\displaystyle+\lambda_{\text{id}}\mathcal{L}_{\text{id}}+\lambda_{\text{div}}\mathcal{L}_{\text{div}},$		(10)

where $\lambda_{\text{LPIPS}},\lambda_{G},\lambda_{\text{id}}$ and $\lambda_{\text{div}}$ are the trade-off parameters balancing different losses.

3.3 Text-guided Image Editing

With the pre-trained CLIP2GAN model, we further apply the network to text-guided image editing applications. Given a source image $x$ , we are interested in editing certain regions of it by manipulating its latent codes $\mathbf{z}=\{\mathbf{z_{i}}\}_{i=1}^{i=18}$ to $\mathbf{z^{\prime}}=\{\mathbf{z^{\prime}_{i}}\}_{i=1}^{i=18}$ and getting a target image $x^{\prime}$ that meets the editing requirements, which is expressed as follows,

\mathbf{z^{\prime}_{i}}=\mathbf{z_{i}}+\beta\mathbf{n_{i}},\vspace{-0.1cm}

(11)

x^{\prime}=G(\mathbf{z^{\prime}}),\vspace{-0.1cm}

(12)

where $\mathbf{n}=\{\mathbf{n_{i}}\}_{i=1}^{i=18}$ corresponds to the normal direction of a particular semantics of the latent space, and $\beta$ denotes the degree of editing semantics. That is, if the latent code moves in a certain direction, the semantics contained in the output image should vary accordingly. This requires our framework to locate the semantic direction $\mathbf{n}$ of the text and the latent code $\mathbf{z}$ of the image.

As shown in Fig. 4, the text, $t$ i.e., simple descriptions on age, gender, hair, expression, etc., is fed into the CLIP [36] text encoder to get CLIP text features. Then the vector $\mathbf{n}$ in the StyleGAN latent space is derived by the pre-trained mapping network, which is considered as the normal direction of a particular semantics due to the mapping network bridging the CLIP feature space and the StyleGAN latent space. Meanwhile, by putting the source image $x$ through the CLIP image encoder and the mapping network, the representation $\mathbf{z}$ of $x$ in the StyleGAN latent space is obtained. They are formulated as follows,

\mathbf{n}=M[C_{text}(t)],\vspace{-0.1cm}

(13)

\mathbf{z}=M[C_{img}(x)],\vspace{-0.1cm}

(14)

where $C_{text}(\cdot)$ and $C_{img}(\cdot)$ denote the CLIP text and image encoder, respectively.

Finally, the semantic direction $\mathbf{n}$ and the latent feature $\mathbf{z}$ are weighted together and the weight of $\mathbf{n}$ , i.e., $\beta$ , ranges from 0 to 1, leading to the modified feature $\mathbf{z^{\prime}}$ in the latent space of StyleGAN, and thus StyleGAN generates the corresponding images. This allows the text description to control the degree of semantic modification of the image while not changing the identity features of the image itself. Our approach achieves high-quality text-guided image editing by simple arithmetic operations without optimization and additional network structures.

4 Experiments

4.1 Experiments Setup

Datasets. To demonstrate the superiority of our method for text-guided face generation and face editing, we train on the CelebA-HQ [21] dataset and test using text descriptions from the Multi-modal CelebA-HQ (MM-CelebA-HQ) [48] dataset. The CelebA-HQ dataset is a high-quality version of the CelebA [32] dataset, consisting of 30,000 images with a resolution of $1024^{2}$ . The MM-CelebA-HQ dataset creates 10 unique text descriptions for each image in CelebA-HQ.

Evaluation metrics. We aim to assess visual quality, image accuracy, and realism for evaluation. The visual quality of generated or manipulated images is evaluated through the widely-used Fréchet Inception Distance (FID) [17] metrics. FID measures the distance between two sets of images, computed by the mean value and covariance of the generated image set $(\mu_{Y},\Sigma_{Y})$ and the ground-truth image set $(\mu_{\hat{Y}},\Sigma_{\hat{Y}})$ , which is formulated as follows,

\text{FID}(Y,\hat{Y})=\|\mu_{Y}-\mu_{\hat{Y}}\|^{2}_{2}+\text{tr}(\Sigma_{Y}+\Sigma_{\hat{Y}}-2(\Sigma_{Y}\Sigma_{\hat{Y}})^{\frac{1}{2}}).

(15)

To evaluate the perceptual similarity between generated images and real images, we compute the average distance between them by the Learned Perceptual Image Patch Similarity (LPIPS) [57] metrics, which is a weighted perceptual similarity between two images, computing on the features extracted from a pre-trained network.

In addition, accuracy and realism are evaluated through a user study. For image generation, the accuracy is evaluated by the similarity between the text and the generated image. For image manipulation, accuracy is assessed by whether the visual properties of the modified image are aligned with the given description and whether contents unrelated to the text are preserved. Realism is required to be judged as to which is more realistic and consistent with reality. We tested accuracy and realism by collecting surveys on a random sample of 10 images from 20 people.

4.2 Comparison with State-of-the-art Methods

We compare our method with several state-of-the-art methods of text-guided image generation, i.e., AttnGAN [50], ControlGAN [30], DF-GAN [42], DM-GAN [59], TediGAN [48], and StyleCLIP [35]. We evaluate FID and LPIPS on a large number of samples generated from randomly selected text descriptions. Accuracy and realism are assessed by user studies. We generate 10 images using different methods and 20 users were asked to judge which one is the most realistic and aligned. The quantitative result is shown in Tab. 1 and Tab. 2, where the other methods are trained and tested on the MM-CelebA-HQ dataset, while our method is trained on CelebA-HQ and tested with the help of the MM-CelebA-HQ text descriptions. From these tables, it is observed that our method achieves new state-of-the-art performance.

Method	FID	LPIPS
AttnGAN [50]	125.98	0.512
ControlGAN [30]	116.32	0.522
DF-GAN [42]	137.60	0.581
DM-GAN [59]	131.05	0.544
TediGAN [48]	106.37	0.456
StyleCLIP [35]	101.75	0.439
Ours	34.25	0.408

Table 1: Quantitative comparison with existing text-guided image generation methods on FID and LPIPS metrics. Our method outperforms previous algorithms on FID and LPIPS metrics.

Furthermore, we make a qualitative evaluation with several competitive methods, i.e., AttnGAN, ControlGAN, DF-GAN, DM-GAN, TediGAN, and StyleCLIP. From Fig. 5, it is observed that our method has higher image quality and more realistic image results compared with previous methods. The images we generate correspond very closely to the text description as well as showing fine-grained details. Benefiting from the CLIP model [36], we have an effective text-driven capability to generate face images with more text features in the category without being limited to text descriptions in the dataset.

When some specific features in the text are changed, our model ensures that the image is modified only in the corresponding features, while other features, including identity features, are guaranteed to remain invariant. This shows that our method is able to decouple different features with excellent robustness. The text description and visual results are presented in Fig. 8.

The other advantage is that our model can inherently generate diverse results given an arbitrary specific text description. With our approach, the generated images are guaranteed to be consistent with the given text features while other irrelevant features are varied to get multiple unique results. We present the text-guided image generation results in Fig. 6, which demonstrates that our method can generate diverse results with high quality.

Method	Acc. (%)	Real. (%)
Ours v.s. AttnGAN [50]	83.0	84.5
Ours v.s. ControlGAN [30]	80.5	79.0
Ours v.s. DF-GAN [42]	86.5	85.5
Ours v.s. DM-GAN [59]	88.0	91.5
Ours v.s. TediGAN [48]	74.5	78.5
Ours v.s. StyleCLIP [35]	70.5	95.5

Table 2: Paired user study between our method and existing text-guided image generation methods. Our method outperforms previous algorithms on Acc. and Real. metrics.

4.3 Ablation Studies

There are several ablation experiments performed to demonstrate the effectiveness of the framework. To evaluate the effectiveness and necessity of the network design, we modified the mapping network with a different number of network layers and the pre-trained models of CLIP. From Tab. 3, it is observed that the framework performs worse in several metrics if the number of network layers is less. Whereas, with a higher number of layers, it is less significant for performance improvement, while greatly increasing the complexity of the network. In addition, the feature space of CLIP has semantic meanings between images and text, thus using a stronger joint space (ViT/B-16) can improve the generated results.

Settings

Metrics

Number

of Layers

Pre-trained

CLIP [36]

FID

LPIPS

ViT/B-32

43.15

0.478

ViT/B-32

35.88

0.421

ViT/B-32

35.01

0.416

ViT/B-16

41.42

0.471

ViT/B-16

34.79

0.413

ViT/B-16

34.25

0.408

Table 3: Ablation of the network design, i.e., Number of Layers and Pre-trained CLIP, on CelebA-HQ [21] dataset. It is observed that using 12 layers of the mapping network and pre-trained CLIP-ViT/B-16 is the most appropriate of several settings.

We also investigate the impact of each component in the objective function. Ensuring that reconstruction loss, perceptual loss, and identity loss are preserved, we ablate by excluding adversarial loss and diversity loss one by one. The results are shown in Tab. 4, and the performance decreases after each removal. This is because removing the adversarial loss loses image realism while removing the diversity loss leads to pattern convergence.

Objective functions		Metrics
$\mathcal{L}_{adv}$	$\mathcal{L}_{div}$	FID	LPIPS
$\times$	$\times$	55.48	0.589
$\times$	✓	52.76	0.561
✓	$\times$	38.44	0.454
✓	✓	34.25	0.408

Table 4: Ablation of the objective function, i.e.,

\mathcal{L}_{adv}

and

\mathcal{L}_{div}

, on CelebA-HQ [21] dataset, which denotes adversarial loss and diversity loss, respectively. Notably, our method achieves better performance with adversarial loss and diversity loss.

4.4 Image Editing

To evaluate the performance of image editing, we use the FID metrics and conduct a user study. As shown in Tab. 5, the results show that our method has better FID, accuracy, and realism than other methods [31, 48, 35] on both CelebA-HQ dataset and arbitrary images that are not within the CelebA-HQ dataset. As well as generating high-quality text-guided images, we can achieve image editing for a given text description while ensuring that irrelevant content remains unchanged.

Fig. 7 shows the visual results of our text-guided image editing. With simple word descriptions, we can apply different manipulations to the image without changing other irrelevant features. For the various text features used for modification, the extent to which the corresponding features of the image are changed is controlled by our framework, i.e., the value of the weight $\beta$ , as illustrated in Fig. 9. Since CLIP is trained on a massive dataset of image-text alignment, which possesses a huge text latent space. Therefore, using the pre-trained CLIP, our method can achieve text editing for numerous different features without being limited to some specific text descriptions.

	CelebA-HQ			Not-in-CelebA-HQ
Method	FID $\ \downarrow$	Acc. $\ (\%)\uparrow$	Real. $\ (\%)\uparrow$	FID $\ \downarrow$	Acc. $\ (\%)\uparrow$	Real. $\ (\%)\uparrow$
ManiGAN [31]	117.89	10.5	8.0	143.39	3.5	7.5
TediGAN [48]	107.25	16.0	18.5	135.47	18.0	24.5
StyleCLIP [35]	86.82	28.5	21.5	106.24	31.5	21.5
Ours	35.66	45.0	52.0	48.14	47.0	46.5

Table 5: Quantitative comparison with existing methods on text-guided image editing.

\uparrow

indicates the higher the better, while

\downarrow

indicates the lower the better. our method achieves better performance on both the CelebA-HQ [21] dataset and arbitrary images outside the CelebA-HQ dataset.

5 Conclusion

In this paper, we investigate vision-language models (CLIP) to generative models (StyleGAN) for the task of text-guided image generation and propose a novel framework named CLIP2GAN. Specifically, the framework bridges the pre-trained CLIP and StyleGAN and implements training in a text-free way, where the trained model generates high-fidelity and high-quality images corresponding to the text description. With the use of CLIP2GAN, we also achieve text-guided image manipulation that allows the editing of real images. Extensive experiments demonstrate the effectiveness of our method. Compared to previous methods, our framework achieves superior performance and shows better visual results.

References

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN: How to embed images into the StyleGAN latent space? In ICCV, pages 4432–4441, 2019.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN++: How to edit the embedded images? In CVPR, pages 8296–8305, 2020.
[3] Mahmoud Afifi, Marcus A Brubaker, and Michael S Brown. HistoGAN: Controlling colors of GAN-generated and real images via color histograms. In CVPR, pages 7941–7950, 2021.
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. HyperStyle: StyleGAN inversion with HyperNetworks for real image editing. In CVPR, pages 18511–18521, 2022.
[5] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In ICCV, pages 2425–2433, 2015.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223. PMLR, 2017.
[7] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Inverting layers of a large generator. In ICLR Workshop, volume 2, page 4, 2019.
[8] Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, and Ming-Hsuan Yang. InOut: Diverse image outpainting via GAN inversion. In CVPR, pages 11431–11440, 2022.
[9] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse image synthesis for multiple domains. In CVPR, pages 8188–8197, 2020.
[10] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of GANs. In CVPR, pages 5771–5780, 2020.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[13] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. NeurlPS, 34:19822–19835, 2021.
[14] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, pages 5706–5714, 2017.
[15] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial networks. In NeurlPS, 2014.
[16] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein GANs. NeurlPS, 30, 2017.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. NeurlPS, 30, 2017.
[18] Jialu Huang, Jing Liao, and Sam Kwong. Unsupervised image-to-image translation via pre-trained StyleGAN2 network. TMM, 24:1435–1448, 2021.
[19] Yan Huang, Wei Wang, and Liang Wang. Instance-aware image and sentence matching with selective multimodal LSTM. In CVPR, pages 2310–2318, 2017.
[20] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NeurlPS, 27, 2014.
[21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
[22] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. NeurlPS, 33:12104–12114, 2020.
[23] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. NeurlPS, 34:852–863, 2021.
[24] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
[25] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, pages 8110–8119, 2020.
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[28] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato. Fader networks: Manipulating images by sliding attributes. NeurlPS, 30, 2017.
[29] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pages 4681–4690, 2017.
[30] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. NeurlPS, 32, 2019.
[31] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. ManiGAN: Text-guided image manipulation. In CVPR, pages 7880–7889, 2020.
[32] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
[33] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
[34] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, pages 1429–1437, 2019.
[35] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In ICCV, pages 2085–2094, 2021.
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
[37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[38] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
[39] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a StyleGAN encoder for image-to-image translation. In CVPR, pages 2287–2296, 2021.
[40] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of GANs for semantic face editing. In CVPR, pages 9243–9252, 2020.
[41] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. InterfaceGAN: Interpreting the disentangled face representation learned by GANs. TPAMI, 2020.
[42] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. DF-GAN: A simple and effective baseline for text-to-image synthesis. In CVPR, pages 16515–16525, 2022.
[43] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurlPS, 30, 2017.
[45] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
[46] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
[47] Zongze Wu, Dani Lischinski, and Eli Shechtman. StyleSpace analysis: Disentangled controls for StyleGAN image generation. In CVPR, pages 12863–12872, 2021.
[48] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. TediGAN: Text-guided diverse face image generation and manipulation. In CVPR, 2021.
[49] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057. PMLR, 2015.
[50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
[51] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, pages 21–29, 2016.
[52] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[53] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In ICCV, pages 4471–4480, 2019.
[54] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In CVPR, pages 833–842, 2021.
[55] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In CVPR, pages 5907–5915, 2017.
[56] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 41(8):1947–1962, 2018.
[57] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
[58] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain GAN inversion for real image editing. In ECCV, pages 592–608. Springer, 2020.
[59] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, pages 5802–5810, 2019.

References

[1] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN: How to embed images into the StyleGAN latent space? In ICCV, pages 4432–4441, 2019.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN++: How to edit the embedded images? In CVPR, pages 8296–8305, 2020.
[3] Mahmoud Afifi, Marcus A Brubaker, and Michael S Brown. HistoGAN: Controlling colors of GAN-generated and real images via color histograms. In CVPR, pages 7941–7950, 2021.
[4] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit Bermano. HyperStyle: StyleGAN inversion with HyperNetworks for real image editing. In CVPR, pages 18511–18521, 2022.
[5] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In ICCV, pages 2425–2433, 2015.
[6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, pages 214–223. PMLR, 2017.
[7] David Bau, Jun-Yan Zhu, Jonas Wulff, William Peebles, Hendrik Strobelt, Bolei Zhou, and Antonio Torralba. Inverting layers of a large generator. In ICLR Workshop, volume 2, page 4, 2019.
[8] Yen-Chi Cheng, Chieh Hubert Lin, Hsin-Ying Lee, Jian Ren, Sergey Tulyakov, and Ming-Hsuan Yang. InOut: Diverse image outpainting via GAN inversion. In CVPR, pages 11431–11440, 2022.
[9] Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse image synthesis for multiple domains. In CVPR, pages 8188–8197, 2020.
[10] Edo Collins, Raja Bala, Bob Price, and Sabine Susstrunk. Editing in style: Uncovering the local semantics of GANs. In CVPR, pages 5771–5780, 2020.
[11] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[13] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. NeurlPS, 34:19822–19835, 2021.
[14] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, pages 5706–5714, 2017.
[15] Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial networks. In NeurlPS, 2014.
[16] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein GANs. NeurlPS, 30, 2017.
[17] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium. NeurlPS, 30, 2017.
[18] Jialu Huang, Jing Liao, and Sam Kwong. Unsupervised image-to-image translation via pre-trained StyleGAN2 network. TMM, 24:1435–1448, 2021.
[19] Yan Huang, Wei Wang, and Liang Wang. Instance-aware image and sentence matching with selective multimodal LSTM. In CVPR, pages 2310–2318, 2017.
[20] Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. NeurlPS, 27, 2014.
[21] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
[22] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. NeurlPS, 33:12104–12114, 2020.
[23] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. NeurlPS, 34:852–863, 2021.
[24] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, pages 4401–4410, 2019.
[25] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, pages 8110–8119, 2020.
[26] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
[28] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, and Marc’Aurelio Ranzato. Fader networks: Manipulating images by sliding attributes. NeurlPS, 30, 2017.
[29] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pages 4681–4690, 2017.
[30] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip Torr. Controllable text-to-image generation. NeurlPS, 32, 2019.
[31] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. ManiGAN: Text-guided image manipulation. In CVPR, pages 7880–7889, 2020.
[32] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
[33] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
[34] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, Siwei Ma, and Ming-Hsuan Yang. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR, pages 1429–1437, 2019.
[35] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In ICCV, pages 2085–2094, 2021.
[36] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021.
[37] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[38] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
[39] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a StyleGAN encoder for image-to-image translation. In CVPR, pages 2287–2296, 2021.
[40] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of GANs for semantic face editing. In CVPR, pages 9243–9252, 2020.
[41] Yujun Shen, Ceyuan Yang, Xiaoou Tang, and Bolei Zhou. InterfaceGAN: Interpreting the disentangled face representation learned by GANs. TPAMI, 2020.
[42] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu. DF-GAN: A simple and effective baseline for text-to-image synthesis. In CVPR, pages 16515–16525, 2022.
[43] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing motion and content for video generation. In CVPR, pages 1526–1535, 2018.
[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurlPS, 30, 2017.
[45] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
[46] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
[47] Zongze Wu, Dani Lischinski, and Eli Shechtman. StyleSpace analysis: Disentangled controls for StyleGAN image generation. In CVPR, pages 12863–12872, 2021.
[48] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. TediGAN: Text-guided diverse face image generation and manipulation. In CVPR, 2021.
[49] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, pages 2048–2057. PMLR, 2015.
[50] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324, 2018.
[51] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In CVPR, pages 21–29, 2016.
[52] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[53] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S Huang. Free-form image inpainting with gated convolution. In ICCV, pages 4471–4480, 2019.
[54] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang. Cross-modal contrastive learning for text-to-image generation. In CVPR, pages 833–842, 2021.
[55] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In CVPR, pages 5907–5915, 2017.
[56] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. TPAMI, 41(8):1947–1962, 2018.
[57] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586–595, 2018.
[58] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain GAN inversion for real image editing. In ECCV, pages 592–608. Springer, 2020.
[59] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR, pages 5802–5810, 2019.

This supplementary material provides additional implementation details and experimental results to support the main submission. First, we discuss the datasets used in the experiments in addition to those mentioned in the main paper and the detailed implementation setup. Next, we present additional ablation studies on using different architectures in our framework. Finally, we show more qualitative and quantitative results of text-guided image generation and editing on different datasets.

Appendix A Experiments Setup

In this section, we provide more experiments setup on our proposed CLIP2GAN, including additional datasets used to demonstrate generalization capabilities and more implementation details.

Datasets. In the main paper, we mention the CelebA-HQ [21] dataset for learning the image reconstruction and the Multi-Modal CelebA-HQ dataset [48] for testing the effectiveness of the text-guided face image generation. Besides, we have conducted experiments on AFHQ [9] dataset and LSUN [52] dataset. The AFHQ dataset is an animal faces dataset consisting of 15,000 high-quality images at $512^{2}$ resolution, which includes three domains of cat, dog, and wildlife, each providing 5000 images. We choose the cat and dog datasets of AFHQ for training. The LSUN dataset contains around one million labeled images for each of the 10 scene categories and 20 object categories. We choose the car and church datasets of LSUN for training. For these two datasets, the identity loss of faces is removed. The experiments on AFHQ and LSUN datasets also validate the superior performance of our method with the generality of its text-guided image generation function.

Implementation details. The hyper-parameters in the framework are set as follows: The trade-off parameter $\lambda_{rec},\lambda_{LPIPS},\lambda_{G},\lambda_{id},\lambda_{div}$ and $\lambda$ are set to 1, 1, 0.1, 1, 1 and 1, respectively, to ensure the training stability. For the whole framework, we utilize Adam optimizer [26]. The training lasts 100 epochs in total. The learning rate is set to $2\times 10^{-3}$ and linearly reduces after 50 epochs. The whole framework is implemented by Pytorch and we perform experiments on NVIDIA RTX 3090.

Appendix B Ablation Studies

In this section, we provide additional ablation studies on using different pre-trained GANs in our framework and different locations for arithmetic operations during editing. By comparison, we demonstrate the effectiveness of our method to generate and edit with high quality and high fidelity.

Model	Generation		Editing
Model	FID	LPIPS	Operating Space	FID	LPIPS
DCGAN [37]	183.37	0.562	CLIP [36]	n/a	n/a
DCGAN [37]	183.37	0.562	GAN	196.76	0.583
StyleGAN2 [25]	34.25	0.408	CLIP [36]	60.54	0.505
StyleGAN2 [25]	34.25	0.408	GAN	35.66	0.417

Table 6: Additional ablation studies of different architectures on CelebA-HQ [21] dataset. The results show that our method works on other GANs as well, and the editing performance of the operation in the latent space of the GAN is better than that in the CLIP feature space.

Effect of different pre-trained GANs. To demonstrate that the mapping network in CLIP2GAN can map the feature space of CLIP [36] to the latent space of any GAN, we replace StyleGAN [24] in the framework with DCGAN [37] and perform learning. The capability of DCGAN is weak, which leads to the generated images looking blurry and of poor quality. Fortunately, the text-guided generated images are still able to satisfy the requirements of the attributes mentioned in the text, which indicates that our model is effective for the latent space of any GAN. We present quantitative analysis in Tab. 6.

Effect of the editing operation location. For image editing, we choose to perform arithmetic operations on text features and image features at different locations, i.e., the feature space of CLIP, and the latent space of StyleGAN. As shown in Tab. 6, better editing results can be obtained by performing arithmetic operations in the StyleGAN space. The latent space of StyleGAN possesses better disentanglement properties than the feature space of CLIP, in which it is easier to attach text features to image features to achieve the generation of high-quality edited images.

Appendix C Results of Generation and Editing

In this section, We first show more generation and editing results to further demonstrate the effectiveness of our method. Then we present additional results on different datasets, which show the generalization capability of our method. With all of the above, we can produce high-quality and high-fidelity images.

C.1 Additional Qualitative Results

We show more qualitative results of text-guided face generation and editing. As shown in Fig. 10, our method is able to generate diverse face images with high fidelity and high quality with an arbitrary text input. Also, as shown in Fig. 11, given a real face image and a textual description, we can edit the image according to text attributes without changing other irrelevant attributes.

C.2 Results on Different Datasets

In addition to face generation and editing, we also conduct experiments on other datasets, i.e., the cat and dog datasets of AFHQ [9], the church datasets of LSUN [52]. Due to the lack of corresponding text descriptions for these datasets mentioned above, it is difficult for us to compare them with other methods. The quantitative results of our method on these three datasets are shown in Tab. 7. Evidently, the text-guided image generation model still works well on the other datasets, which indicates the generality of our model on various image datasets that do not require text descriptions.

Dataset	FID	LPIPS
CelebA-HQ [21]	34.25	0.408
CAT of AFHQ [9]	62.33	0.491
DOG of AFHQ [9]	68.17	0.487
CHURCH of LSUN [52]	92.84	0.534

Table 7: Quantitative results on different datasets, i.e., CelebA-HQ dataset, the cat and dog datasets of AFHQ, the church datasets of LSUN. In each dataset, our method achieves excellent results for text-guided image generation.

We also present the qualitative results of these three datasets. Fig. 12 shows the results of the text-guided image generation on different datasets. Besides, Fig. 13 shows the results of text-guided image editing on different datasets. The generated and edited results we achieved are diverse, high fidelity, and high quality.