DeltaEdit: Exploring Text-free Training for Text-Driven Image Manipulation

Yueming Lyu^1,2, Tianwei Lin³, Fu Li³, Dongliang He³, Jing Dong²¹¹1Corresponding author., Tieniu Tan^2,4
¹ School of Artificial Intelligence, University of Chinese Academy of Sciences
² CRIPAC, Institute of Automation, Chinese Academy of Sciences
³ VIS, Baidu Inc. ⁴ Nanjing University
[email protected], [email protected]

Abstract

Text-driven image manipulation remains challenging in training or inference flexibility. Conditional generative models depend heavily on expensive annotated training data. Meanwhile, recent frameworks, which leverage pre-trained vision-language models, are limited by either per text-prompt optimization or inference-time hyper-parameters tuning. In this work, we propose a novel framework named DeltaEdit to address these problems. Our key idea is to investigate and identify a space, namely delta image and text space that has well-aligned distribution between CLIP visual feature differences of two images and CLIP textual embedding differences of source and target texts. Based on the CLIP delta space, the DeltaEdit network is designed to map the CLIP visual features differences to the editing directions of StyleGAN at training phase. Then, in inference phase, DeltaEdit predicts the StyleGAN’s editing directions from the differences of the CLIP textual features. In this way, DeltaEdit is trained in a text-free manner. Once trained, it can well generalize to various text prompts for zero-shot inference without bells and whistles. Code is available at https://github.com/Yueming6568/DeltaEdit

Figure 1: Examples of text-driven manipulations of multiple StyleGAN imagery categories induced by our DeltaEdit with text-free training.

1 Introduction

Text-driven image manipulation has aroused widespread research interests in both academic and industrial communities given its significance for real-world applications. It aims at editing the content of images according to user-provided natural language descriptions while preserving text-irrelevant content unchanged.

Existing approaches [3, 19, 17, 12, 33] typically train conditional generative models from scratch with a large amount of manually annotated image-text pairs. This procedure requires expensive labor annotation, which obstacles the training flexibility. Recently, some CLIP+StyleGAN approaches [20, 16, 11, 31, 41] have been proposed to perform text-driven image manipulation by utilizing the CLIP’s remarkable semantic representation of image-text modalities, and well-trained StyleGAN’s high-quality generation power. Given one text prompt, these methods either leverage iterative optimization [33, 20, 16] or learn the corresponding mapping network [20, 4, 35], or require to tune the complex hyper-parameters online to discover the specific editing direction [20]. Namely, for different text prompts, they must implement different optimization processes, which is not flexible during training or inference, and cannot well generalize to any other unseen text.

We consider that the key to alleviate these problems is to precisely build the relationships between the text feature space and the StyleGAN’s [8] latent visual space within one model. Manually collecting relevant textual data to train one model is a possible way [31, 43], but it could only learn the sub-relationships and lead to limited generalization. Therefore, it is challenging but desired to explore how to construct the full mapping between two feature spaces without any textual supervision to restrict the generalization ability.

To achieve this goal, we figure out a text-free training paradigm, where a text-driven image manipulation model is trained on easily collected image data instead of textual data, but the model can be well generalized to input of text prompts for zero-shot inference. The key of approaching this is to identify a well semantically aligned feature space for image and text modalities. Within such an aligned space, image features can be utilized as pseudo text conditions to achieve text-free training.

In this paper, we first approach text-free training by a naïve solution, which naturally uses the CLIP image features as pseudo text conditions. However, it fails to perform well-disentangled image manipulation with user-provided text descriptions, which mainly because there is still a modality gap between the image and text feature spaces [15] (illustrated in Figure 2). Notably, it is the direction of CLIP features that is semantically meaningful as all features are normalized to a unit norm [20, 41], and CLIP feature differences of paired visual-textual data both mean similar semantic changes. Though gap exists between features of each individual modality, we have demonstrated that the CLIP feature space for image-text differences are more consistent and semantically better aligned (shown in Figure 2). We term this feature space as the CLIP Delta space. Potentially, constructing pseudo text conditions in the better aligned CLIP delta space will enable more fine-grained and better disentangled edits for the text-free trained manipulation model.

Based on the above analysis, we propose our DeltaEdit framework. Specifically, we learn a coarse-to-fine mapping network from the CLIP image feature differences of two images to their editing direction at the $\mathcal{S}$ style space, then directly applying the learned mapping onto the differences of two texts’ CLIP features could obtain the corresponding editing direction at $\mathcal{S}$ space. The predicted editing direction shall then be able to change image attributes from one text description to the other text description. As shown in Figure 1, we also demonstrate DeltaEdit can well generalize to various target text-prompts without bells and whistles.

Refer to caption — Figure 2: Feature space analysis on the MultiModal-CelebA-HQ [33] dataset. Paired CLIP image-text features (marked in red and green) and paired CLIP delta image-text features (marked in orange and brown) are visualized in 2D using t-SNE visualization.

To summarize, our main contributions are as follows:

•

We push the frontiers of text-free training for image manipulation, which largely eliminates the training or inference inflexibility and poor generalization suffered by previous methods.
•

We investigate and identify the CLIP delta image-text feature space is better semantically aligned than the original CLIP space. Furthermore, we propose our DeltaEdit framework to learn the mapping from image-feature change to StyleGAN’s $\mathcal{S}$ style space change and seamlessly generalize to predict editing direction from text-feature change.
•

Extensive experiments on a variety of datasets, including FFHQ and LSUN, verify the effectiveness and efficiency of our method. Results show DeltaEdit can well generalize to various edits conditioned on different target text-prompts without bells and whistles.

2 Related Work

2.1 Vision-Language Representations

Learning generic vision-language representations is of great importance in the multi-modal field. Following the success of BERT [2], many large scale pre-trained vision-language models [13, 18, 26, 29, 39] are proposed. A recent development, CLIP [21], is extremely powerful for joint vision-language learning. It is trained on 400 million (image, text) pairs and learns a joint multi-modal semantic space for both images and texts. Benefit from the excellent image/text representation ability, it has been widely used in various directions, such as domain adaptation [5], image segmentation [40, 22], and image generation [4, 42, 35, 16, 14, 27].

2.2 Text-Guided Generation and Manipulation

The goal of text-guided image generation [23, 1, 28, 37, 38, 34] is to generate realistic images according to given text descriptions. As a pioneering work, Reed et al. [23] embed the text features as the conditional input for a GAN-based one-stage architecture. After that, StackGAN [37] and StackGAN++ [38] introduce multi-stage frameworks to further improve the generation quality. Recently, some works [25, 27, 33, 41] based on StyleGAN [7, 8, 6] have been proposed to improve the quality of generated images.

Compared with generating, text-guided image manipulation [3, 19, 17, 12, 33, 35, 9] aims to manipulate the input images by using texts describing desired visual attributes (e.g., gender, age). For example, ManiGAN [12] proposes affine combination module (ACM) and detail correction module (DCM) to generate new attributes matching the given text. Recently, some works [25, 33] have adopted StyleGAN as their backbone to perform image manipulation tasks. For TediGAN [33], it aligns two modalities in the latent space of pre-trained StyleGAN by the proposed visual-linguistic similarity module. More recently, StyleCLIP [20] combines the generation power of StyleGAN and the image-text representation ability of CLIP [21] to discover manipulation direction. They outline three approaches, namely Latent Optimization, Latent Mapper and Global Directions, which are denoted as StyleCLIP-OP, StyleCLIP-LM, and StyleCLIP-GD in this paper. The StyleCLIP-OP and StyleCLIP-LM are per-prompt training methods, which require optimizing the latent code or training a separate model for each text prompt. The third approach, StyleCLIP-GD, is a per-prompt fine-tuning method. It first finds global directions of semantic change in StyleGAN’s style space $\mathcal{S}$ by pre-defined relevance matrix of each channel in $\mathcal{S}$ to the image-space change. During inference, it needs to manually tune hyper-parameters to discover the fine-grained directions for each specific text prompt. In addition, HairCLIP [31] focuses more on the hair manipulation with the help of CLIP. FFCLIP [43] collects 44 text prompts for face images to learn the facial attributes manipulation.

3 Method

Our method aims to predict editing directions in StyleGAN $\mathcal{S}$ space conditioned on corresponding embeddings in CLIP image space, without any text supervision.

Randomly given two images from the training image dataset, one as the source image $I_{1}$ and the other as the target image $I_{2}$ . Then the CLIP image encoder is used to extract their CLIP image embeddings, $i_{1}$ and $i_{2}$ , and the pre-trained StyleGAN inversion model [30] is adopted as an encoder to extract the latent codes $s_{1}$ and $s_{2}$ in the $\mathcal{S}$ space, which has been proven to be more disentangled than other intermediate feature spaces [32]. Taking the extracted codes, we predict the manipulation direction $\Delta s=s_{2}-s_{1}$ .

In Section 3.1, we start by proposing a naïve solution to construct pseudo text conditions directly with image embeddings, based on the alignment property of the joint CLIP’s image-text space. However, the resulting method cannot perform well when editing images, and we further propose a new text-free training framework, called DeltaEdit, in Section 3.2, which alleviates the problem of the naïve solution and performs accurate and disentangled editing in the proposed CLIP delta space.

3.1 A Naïve Solution to Text-free Training

As illustrated in Figure 3, in the training phase, the source embeddings, $i_{1}$ and $s_{1}$ , are taken as the input. The condition is from a target CLIP image embedding $i_{2}$ instead of a target CLIP text embedding. Then, the source embeddings and the image condition are sent to a latent mapper network to predict the manipulation direction $\Delta s^{\prime}$ :

\Delta s^{\prime}=LatentMapper(s_{1},i_{1},i_{2})\text{.}

(1)

During inference, we can predict the editing direction $\Delta s^{\prime}$ as:

\Delta s^{\prime}=LatentMapper(s_{1},i_{1},t)\text{,}

(2)

where $t$ is a CLIP text embedding constructed from a target text prompt.

However, as shown in Figure 4, given different source images, the naïve method maps them all to average faces corresponding to the text-related attributes, leading to inaccuate manipulation. We doubt that this phenomenon occurs because there is still a modality gap of the CLIP visual-linguistic modalities, even though the cosine similarity between paired image-text features are maximized to make the image-text space aligned in CLIP. To verify it, we perform t-SNE visualization on the MM-CelebA-HQ dataset [33] in Figure 2, which shows that the CLIP image space and text space are actually not that close to each other. A recent work [15] has also demonstrated that the joint multi-modal space in CLIP is not well-aligned caused by model initialization and contrastive representation learning.

After further practice, it is encouraging that we have identified a better semantically aligned multi-modal feature space, namely the CLIP delta image-text space, which is extremely close to each other, making it possible to approximate the text conditions with image embeddings in the proposed delta space. Additionally, there is an advantage of leveraging the text embeddings change as conditions, such as “face” as the source text and “face with yellow hair” as the target text. That is, the importance of words that indicate the desired attributes (like “yellow hair”) can be enhanced, which alleviates the problem that CLIP is not sensitive to fine-grained or complex words [21, 24].

3.2 DeltaEdit

In our DeltaEdit framework, we extract delta image-text features, $\Delta i=i_{2}-i_{1}$ and $\Delta t=t_{2}-t_{1}$ , as the condition vectors in the training phase and inference phase, respectively. As shown in Figure 5, during training, we take the extracted codes to get CLIP image space direction $\Delta i=i_{2}-i_{1}$ and StyleGAN $\mathcal{S}$ space direction $\Delta s=s_{2}-s_{1}$ . Then, we propose a latent mapper network in delta space, called Delta Mapper, to predict the editing direction as:

\Delta s^{\prime}=DeltaMapper(s_{1},i_{1},\Delta i)\text{,}

(3)

where $s_{1}$ and $i_{1}$ are used as the input of Delta Mapper to provide specialized information for the source image.

Delta Mapper. The architecture of the Delta Mapper is illustrated in Figure 5 (b). Since StyleGAN has the property that different layers correspond to different semantic levels [7, 33], it is common to divide these layers into different levels and implement coarse-to-fine manipulation within each level. Following [20, 31], we adopt three levels of sub-modules (coarse, medium, and fine) for each designed module. Each sub-module contains several fully-connected layers. In this way, with different levels of the source latent code $s_{1}$ , we propose Style Module to obtain coarse-to-fine intermediate features $(e^{s}_{c},e^{s}_{m},e^{s}_{f})$ , where subscripts stand for coarse-to-fine level and superscripts stand for the Style Module. Then, we concatenate the $\Delta i$ and $i_{1}$ as input, and propose Condition Module to learn coarse, medium and fine embeddings $(e^{i}_{c},e^{i}_{m},e^{i}_{f})$ separately, which have same dimensions as $(e^{s}_{c},e^{s}_{m},e^{s}_{f})$ . In the final step, we fuse generated coarse-to-fine features using proposed Fusion Module with three sub-modules ( $M^{f}_{c}(\cdot,\cdot),M^{f}_{m}(\cdot,\cdot),M^{f}_{f}(\cdot,\cdot)$ ) and predict editing direction as:

\displaystyle\Delta s^{\prime}=

\displaystyle(M^{f}_{c}(e^{i}_{c},e^{s}_{c}),M^{f}_{m}(e^{i}_{m},e^{s}_{m}),M^{f}_{f}(e^{i}_{f},e^{s}_{f})).

(4)

To train the proposed Delta Mapper, our full objective function contains two losses, which can be denoted as:

\displaystyle\mathcal{L}\!=\!\mathcal{L}_{rec}\!+\!\mathcal{L}_{sim}\!=\!\|\Delta s^{\prime}\!-\!\Delta s\|_{2}+\!1\!-cos(\Delta s^{\prime},\Delta s),

(5)

where L-2 distance reconstruction loss is utilized to add supervision for learning the editing direction $\Delta s^{\prime}$ and cosine similarity loss is introduced to explicitly encourage the network to minimize the cosine distance between the predicted embedding direction $\Delta s^{\prime}$ and $\Delta s$ in the $\mathcal{S}$ space.

Text-driven Image Manipulation. During inference, based on the well-aligned multi-modal delta feature space, we can achieve text-driven image manipulation with the trained Delta Mapper. As shown in Figure 5 (c), given the source image $I$ , we first extract its CLIP image embedding $i$ and StyleGAN $\mathcal{S}$ space embedding $s$ . Then, we construct $\Delta t$ with source and target text prompt, and we can predict the editing direction $\Delta s^{\prime}$ as:

\Delta s^{\prime}=DeltaMapper(s,i,\Delta t),

(6)

where the editing direction is subsequently used to generate edited latent embedding $s^{\prime}=s+\Delta s^{\prime}$ . In the final step, we can generate the synthesized image $I^{\prime}$ with StyleGAN based on $s^{\prime}$ .

Text Prompts. In inference phase pipeline, the crucial issue is how to construct $\Delta t=t_{2}-t_{1}$ . To establish the correspondence with the training phase, the source text and target text should be specific text descriptions of two different images. However, the DeltaEdit network will generate images towards the direction of the target text and the opposite direction of the source text in this manner. Therefore, to remove the reverse effect of the source text prompt containing attributes, we naturally place all the user-described attributes in the target text prompts. For human portrait images, if the user intends to add “smile” and “eyeglasses” to a face, then our DeltaEdit uses “face with smile and eyeglasses” as the target text and “face” as the source text. This text prompt setting is also applicable to other manipulation domains. In experiments, we further verify that it enables accurate image manipulation.

Disentanglement. To improve the disentanglement, we further optimize the obtained editing direction $\Delta s^{\prime}$ using the pre-computed relevance matrix $R_{s}$ , which records how CLIP image embedding changes when modifying each dimension in $\mathcal{S}$ space [20]. We can set some channels of $\Delta s^{\prime}$ as zero if the channels have a low correlation to the target text according to $R_{s}$ .

4 Experiments

4.1 Implementation Details

To verify the effectiveness and generalization of the proposed method, we conduct extensive experiments on a diverse set of challenging domains. For the facial domain, we randomly choose 58,000 images from FFHQ [7] dataset and sample 200,000 fake images from the $z$ space in StyleGAN for training; we use the remained 12,000 FFHQ [7] images for evaluation. We additionally provide results on the LSUN [36] Cat, Church and Horse datasets. Note that all real images are inverted by e4e encoder [30], and all generated images are obtained using pre-trained StyleGAN2 [8] generators. The proposed method is trained on 1 NVIDIA Tesla P40 GPU. During training, we set batch size as 64 and adopt ADAM [10] optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ and a constant learning rate of 0.5.

4.2 Qualitative Evaluation

We perform multiple qualitative evaluations with different text descriptions on several challenging datasets. More results are shown in Supp.

Various Text-driven Manipulation Results. To evaluate the generalization ability of our method that can be driven by texts with different semantic meanings, we show facial manipulation results of 21 attributes in Figure 6, which are generated from one trained model. The results show that only the target attributes are manipulated, while other irrelevant attributes are well preserved. Meanwhile, the results are well adapted to the individual with diverse details, rather than overfitting to the same color or shape, which can be seen in “red hair” and “ bowlcut hairstyle” obviously. In addition, the manipulated results on images of cats, churches and horses are shown in Figure 17 and Figure 8, where StyleGAN2 pretrained on LSUN cat, church and horse datasets [36] are used. It is worth mentioning that all text prompts have never been seen during training, which further indicates the effectiveness of the proposed method.

Comparisons. We compare our method with state-of-the-art text-guided image manipulation methods, TediGAN [33], StyleCLIP-LM [20], and StyleCLI-GD [20]. As shown in Figure 9, the results produced by TediGAN are highly entangled and almost fail in editing attributes, such as “long hair” and “eyeglasses”. The results of StyleCLIP-LM are unstable and also entangled. Moreover, the results of StyleCLIP-GD have better disentanglement than StyleCLIP-LM as it is well-tunned for each case. However, the disadvantage of StyleCLIP-GD is the complex hyper-parameters tunning and high inference time reported in Table 3. In comparison, our method yields the most impressive and disentangled results almost in each case.

To verify the inference flexibility of our method, we further compare our method with the strong baseline, i.e., StyleCLIP-GD, which achieves manipulation by manual tuning two hyper-parameters, disentanglement threshold $\beta$ and manipulation strength $\alpha$ . Since we also introduce disentanglement threshold $\beta$ to control some channels of the obtained $\Delta s^{\prime}$ as zero. For a fair comparison, we empirically set $\beta$ at 0.03 and compare with StyleCLIP-GD under different manipulation strength $\alpha$ in Figure 10. The results demonstrate that StyleGAN-GD cannot be generalized to different image attributes and produce feasible editing results under the same hyper-parameters. For example, the result conditioned on “big eyes” is the most accurate at $\alpha=1$ while the result conditioned on “black clothes” is the most correct when $\alpha=4$ . In the contrast, our method achieves ideal manipulation results of various text prompts without manually tuning the manipulation strength. Moreover, under the same disentanglement threshold $\beta$ , StyleCLIP-GD results tend to entangle with some irrelevant attributes. For example, for text prompt “blue suit”, the generated results are entangled with “blue eyes”, “mouth slightly open”, “wearing earrings” and etc. On the contrary, our method can achieve more disentangled results without complex hyper-parameters tuning.

4.3 Quantitative Evaluation

In Table 1, we present objective measurements of FID, PSNR, and IDS (identity similarity before and after manipulation by Arcface) for comparison. The results are the average on ten editing texts. Compared with the state-of-the-art approaches, we achieve the best performance on all metrics. We also compare with the naïve solution to text-free training. For fairness, it is also trained in the $\mathcal{S}$ space and supervised by same loss functions in DeltaEdit. Compared with the naïve solution, the final solution (Ours) can largely improve the inference performance on all metrics, benefiting from the well-aligned CLIP delta image-text space. Finally, we conduct human subjective evaluations upon the manipulation accuracy (Acc) and visual realism (Real). We compare DeltaEdit with the strong baseline, StyleCLIP-GD, under different hyper-parameter $\alpha$ settings. In total 20 evaluation rounds are performed and 40 participants are invited. At each round, we present results of randomly sampled editing text to each participant. Participants were asked to choose the best text-driven image manipulation output considering Acc and to select the output images (not limited to 1 image) which are visually realistic (Real). The results are listed in Table 2, showing the superiority of our model.

4.4 Efficiency Analysis

To validate the efficiency and flexibility of the proposed method, we compare the computation time with TediGAN and three StyleCLIP approaches in Table 3. Specifically, TediGAN first requires 12+ hours to encode images and texts into a common space and then trains the encoding module. StyleCLIP-OP manipulates images with an optimization process, which requires several tens of seconds. StyleCLIP-LM is fast during inference, but needs 10-12 hours to train a mapper network for per text prompt. StyleCLIP-GD requires about 4 hours to pre-compute the global edit directions. However, since the pre-computed directions are rough and cannot be directly applied to control the manipulation, it requires additional manual tuning for different text prompts, which typically takes 8-9 seconds for each case. On the contrary, we avoid the high labor time during training and inference. Once trained, our Delta Mapper is universal and can directly manipulate images with new text prompt efficiently. Meanwhile, our method only needs 2-3 hours for training, since the training is directly conducted on the latent space without generating images in each iteration.

Methods	FID ( $\downarrow$ )	PSNR ( $\uparrow$ )	IDS ( $\uparrow$ )
TediGAN	$31.13$	$20.46$	$0.60$
StyleCLIP-LM	$18.33$	$21.41$	$0.88$
StyleCLIP-GD $\alpha=2$	$12.06$	$22.31$	$0.86$
StyleCLIP-GD $\alpha=3$	$16.85$	$19.50$	$0.77$
StyleCLIP-GD $\alpha=4$	$22.85$	$17.61$	$0.68$
Ours	$\mathbf{10.29}$	$\mathbf{22.92}$	$\mathbf{0.90}$
Ours ( $\mathcal{S}$ space)	$18.88$	$13.49$	$0.61$
Ours ( $\mathcal{W+}$ space)	$21.63$	$12.83$	$0.55$
Ours (naïve)	$44.62$	$11.21$	$0.29$

Table 1: Quantitative comparison results on FID, PSNR, and IDS.

	StyleCLIP $\alpha=2$	StyleCLIP $\alpha=3$	StyleCLIP $\alpha=4$	Ours
Acc. ( $\uparrow$ )	$19\%$	$9\%$	$9.75\%$	$\mathbf{62.25\%}$
Real. ( $\uparrow$ )	$50\%$	$33\%$	$28.75\%$	$\mathbf{90.25}$ %

Table 2: User preference study on manipulation accuracy (Acc.) and visual realism (Real.)

	Pre-proc.	Training time	Infer. time	Conditioned on input image	Latent space
TediGAN	-	12h+	20 s	yes	$\mathcal{W+}$
StyleCLIP-OP	-	-	99 s	yes	$\mathcal{W+}$
StyleCLIP-LM	-	10-12h	70ms	yes	$\mathcal{W+}$
StyleCLIP-GD	4h	-	72ms+ $T_{labor}$	no	$\mathcal{S}$
Ours	4h	2.7h	73ms	yes	$\mathcal{S}$

Table 3: The time comparison of our method with other state-of-the-art methods. For StyleCLIP-GD, the additional time of manually hyper-parameters tuning during the inference is denoted as

T_{labor}

, which is typically 8-9 seconds for each case.

4.5 Text-prompts Analysis

In Figure 11, we explore how different text prompt settings can affect the editing results. During inference, the editing direction $\Delta s^{\prime}$ is driven by the CLIP text space direction $\Delta t$ between source text and target text. (a) We first construct three different (source text, target text) pairs and find that regardless of the content of them, the editing results are affected by the difference between them. For example, although both the source and target texts contain the attribute of “bangs”, the editing result is only influenced by the difference, namely “black hair”. (b) Moreover, to align with the training phase, we construct (source text, target text) pairs are each a specific text description for an image, such as “face is sad” as the source text and “face with bangs” as the target text. We find that the generated face to become more “happy” (the opposite direction of the source text) and with “bangs” (the direction of the target text), which is equivalent to directly putting the attributes to be edited all into the target text. (c) In addition, we construct target texts with multiple combinations of facial attributes, including hairstyles, smile and eyeglasses, and our method can yield desired results driven by text prompts containing multiple semantics. Note that our method can directly perform multi-attributes manipulation without additional training processes, since the different manipulation direction has been learned well by training on large-scale data. More exploration experiments about text prompts are seen in Supp.

4.6 Ablation Study

Choice of the manipulation space. To find an appropriate space for the proposed method, we conduct experiments by performing manipulations in $\mathcal{W+}$ space and $\mathcal{S}$ space, respectively. The quantitative and qualitative results in Table 1 and Figure 12 shows that $\mathcal{S}$ results can achieve better visual quality and identity preservation than $\mathcal{W+}$ . Thus, we implement our DeltaEdit in $\mathcal{S}$ space, and introduce $R_{s}$ to further improve the disentanglement performance.

The effectiveness of relevance matrix $R_{s}$ . For further improving the disentanglement, we introduce relevance matrix $R_{s}$ to limit some irrelevant channels from changing. The fourth row in Figure 12 shows that, with $R_{s}$ , our method can successfully edit the desired attributes while preserving the text-irrelevant content unchanged. For example, for ”bangs”, the generated results have bangs while unchanging the background or facial pose.

5 Conclusions

In this paper, we propose DeltaEdit to support various edits in one single model without the need of training many independent models or complex manual tuning. It is trained in a text-free manner and gets rid of an expensive collection of image-text pairs. Once trained, it can be well generalized to any unseen text prompt for zero-shot inference. Extensive qualitative and quantitative experiments demonstrate the superiority of our method, in terms of the capability of generating high-quality results, the efficiency in the training or inference phase, and the generalization to any unseen text prompt.

Acknowledgement. This work was supported by National Key Research and Development Program of China under Grant No. 2021YFC3320103.

References

[1] Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and Dapeng Tao. Rifegan: Rich feature generation for text-to-image synthesis from prior knowledge. In CVPR, 2020.
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
[3] Hao Dong, Simiao Yu, Chao Wu, and Yike Guo. Semantic image synthesis via adversarial learning. In ICCV, 2017.
[4] Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946, 2021.
[5] Chunjiang Ge, Rui Huang, Mixue Xie, Zihang Lai, Shiji Song, Shuang Li, and Gao Huang. Domain adaptation via prompt learning. arXiv preprint arXiv:2202.06687, 2022.
[6] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. In NeurIPS, 2021.
[7] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
[8] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
[9] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
[10] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[11] Umut Kocasari, Alara Dirik, Mert Tiftikci, and Pinar Yanardag. Stylemc: Multi-channel based fast text-guided image generation and manipulation. 2022.
[12] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. Manigan: Text-guided image manipulation. In CVPR, 2020.
[13] Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
[14] Zhiheng Li, Martin Renqiang Min, Kai Li, and Chenliang Xu. Stylet2i: Toward compositional and high-fidelity text-to-image synthesis. In CVPR, 2022.
[15] Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS, 2022.
[16] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. Fusedream: Training-free text-to-image generation with improved clip+ gan space optimization. arXiv preprint arXiv:2112.01573, 2021.
[17] Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, and Bruno Lepri. Describe what to change: A text-guided unsupervised image-to-image translation approach. In ACM MM, 2020.
[18] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
[19] Seonghyeon Nam, Yunji Kim, and Seon Joo Kim. Text-adaptive generative adversarial networks: manipulating images with natural language. In NeurIPS, 2018.
[20] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, 2021.
[21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. 2021.
[22] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context-aware prompting. arXiv preprint arXiv:2112.01518, 2021.
[23] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. 2016.
[24] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
[25] David Stap, Maurits Bleeker, Sarah Ibrahimi, and Maartje ter Hoeve. Conditional image generation and manipulation for user-specified content. In CVPRW, 2020.
[26] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In ICLR, 2019.
[27] Jianxin Sun, Qiyao Deng, Qi Li, Muyi Sun, Min Ren, and Zhenan Sun. Anyface: Free-style text-to-face synthesis and manipulation. In CVPR, 2022.
[28] Jianxin Sun, Qi Li, Weining Wang, Jian Zhao, and Zhenan Sun. Multi-caption text-to-face synthesis: Dataset and algorithm. In ACM MM, 2021.
[29] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
[30] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. ACM TOG, 2021.
[31] Tianyi Wei, Dongdong Chen, Wenbo Zhou, Jing Liao, Zhentao Tan, Lu Yuan, Weiming Zhang, and Nenghai Yu. Hairclip: Design your hair by text and reference image. In CVPR, 2022.
[32] Zongze Wu, Dani Lischinski, and Eli Shechtman. Stylespace analysis: Disentangled controls for stylegan image generation. In CVPR, 2021.
[33] Weihao Xia, Yujiu Yang, Jing-Hao Xue, and Baoyuan Wu. Tedigan: Text-guided diverse face image generation and manipulation. In CVPR, 2021.
[34] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
[35] Zipeng Xu, Tianwei Lin, Hao Tang, Fu Li, Dongliang He, Nicu Sebe, Radu Timofte, Luc Van Gool, and Errui Ding. Predict, prevent, and evaluate: Disentangled text-driven image manipulation empowered by pre-trained vision-language model. In CVPR, 2022.
[36] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[37] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
[38] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. 2018.
[39] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Making visual representations matter in vision-language models. In CVPR, 2021.
[40] Chong Zhou, Chen Change Loy, and Bo Dai. Denseclip: Extract free dense labels from clip. arXiv preprint arXiv:2112.01071, 2021.
[41] Yufan Zhou, Ruiyi Zhang, Changyou Chen, Chunyuan Li, Chris Tensmeyer, Tong Yu, Jiuxiang Gu, Jinhui Xu, and Tong Sun. Lafite: Towards language-free training for text-to-image generation. In CVPR, 2021.
[42] Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. Mind the gap: Domain gap control for single shot domain adaptation for generative adversarial networks. arXiv preprint arXiv:2110.08398, 2021.
[43] Yiming Zhu, Hongyu Liu, Yibing Song, Xintong Han, Chun Yuan, Qifeng Chen, Jue Wang, et al. One model to edit them all: Free-form text-driven image manipulation with semantic modulations. In NeurIPS, 2022.

Appendix A Network Architecture

The detailed architecture of the proposed Delta Mapper is shown in Table 4. In Table 4, “Linear” indicates one fully-connected layer. The input of the Style Module is 6048 style channels in style space. Here, we split it into three levels and feed them into $M_{c}^{s}$ , $M_{m}^{s}$ and $M_{f}^{s}$ respectively. The input of the Condition Module is the concatenation of CLIP image-space embeddings, $\Delta i$ and $i_{1}$ , which have the dimensions of 512 respectively. By taking the concatenation of the output of Style Module and Condition Module, Fusion Module outputs coarse-to-fine features with three sub-modules and then concatenates them to produce the final manipulation direction with 6048 channels.

Appendix B Additional Results

In this section, we provide additional qualitative results to further verify the effectiveness of DeltaEdit.

B.1 Attributes Interpolation.

Given two edited latent codes $s^{\prime}_{a}$ and $s^{\prime}_{b}$ in $\mathcal{S}$ space, we can achieve facial attributes manipulation by interpolation. The new edited latent codes is obtained by $s_{I}=\omega\cdot s^{\prime}_{a}+(1-\omega)\cdot s^{\prime}_{b}$ , where $\omega$ is the interpolation weight. As shown in Figure 13, we achieve natural and impressive interpolated results from reference text A (“Blond Hair”) to reference text B (“Bangs”), by setting the weight $\omega$ from 0 to 1 in intervals of 0.2.

B.2 More visual results on different datasets.

We show more visual results on FFHQ dataset in Figure 14, Figure 15 and Figure 16. In addition, we provide more visual results on LSUN cat dataset in Figure 17.

B.3 Multi-attributes Generation.

We introduce multi-attribute generation results in Figure 18. Given the target attributes, we can obtain various generating results.

B.4 How text prompts affect results.

For further exploring the effect of different text prompts on manipulation results, we implement experiments on different text prompts settings and show results in Figure 19, Figure 20 and Figure 21.

Module	Sub-module	Layers in the module	Input shape	Output shape
Style Module	$M_{c}^{s}$	4 $\times$ (Linear, Leaky ReLU)	(1, 3, 512)	(1, 3, 512)
	$M_{m}^{s}$	4 $\times$ (Linear, Leaky ReLU)	(1, 4, 512)	(1, 4, 512)
	$M_{f}^{s}$	4 $\times$ (Linear, Leaky ReLU)	(1, 2464)	(1, 2464)
Condition Module	$M_{c}^{i}$	4 $\times$ (Linear, Leaky ReLU)	(1, 1024)	(1, 512)
	$M_{m}^{i}$	4 $\times$ (Linear, Leaky ReLU)	(1, 1024)	(1, 512)
	$M_{f}^{i}$	4 $\times$ (Linear, Leaky ReLU)	(1, 1024)	(1, 2464)
Fusion Module	$M_{c}^{f}$	4 $\times$ (Linear, Leaky ReLU)	(1, 3, 1024)	(1, 3, 512)
	$M_{m}^{f}$	4 $\times$ (Linear, Leaky ReLU)	(1, 4, 1024)	(1, 4, 512)
	$M_{f}^{f}$	4 $\times$ (Linear, Leaky ReLU)	(1, 4928)	(1, 2464)

Table 4: The detailed architecture of the proposed Delta Mapper.