Text-Driven Image Editing via Learnable Regions

Yuanze Lin^♠ Yi-Wen Chen^♣ Yi-Hsuan Tsai^★ Lu Jiang^★ Ming-Hsuan Yang^♣★
^♠ University of Oxford ^♣ UC Merced ^★ Google

Abstract

Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to identify the editing regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences, or lengthy paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. The experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions. Our project webpage can be found at: https://yuanze-lin.me/LearnableRegions_page.

Input Image

Edited Image

Input Image

Edited Image

Input Image

Edited Image

donuts with ice cream a cute sloth holds a box a flowering cherry tree

a blooming flower and dessert several lotus flowers growing in the water a cup of coffee next to the bread

Figure 1: Overview. Given an input image and a language description for editing, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.

1 Introduction

With the availability of a massive amount of text-image paired data and large-scale vision-language models, recent text-driven image synthesis models [62, 32, 33, 34, 39, 53, 43, 42, 8, 57] have enabled people to create and manipulate specific visual contents of realistic images using natural language descriptions in an interactive fashion.

Recent text-driven image editing methods [39, 30, 1, 2, 54, 21, 38] have shown impressive capabilities in editing realistic images based on natural descriptions, with approaches typically falling into two paradigms: mask-based or mask-free methods. Mask-based editing approaches [39, 1] are perceived intuitively for local image editing because they allow users to specify precisely which areas of an image to modify. However, these methods can be laborious, as they demand users to manually create masks that are sometimes unnecessary, limiting their user experience in many applications.

In contrast, mask-free editing approaches [9, 21, 38, 54, 6] do not require masks and can directly modify the appearance or texture of the input image. These methods are trained to create fine-grained pixel masks, which can be applied to either the RGB space or the latent embedding space within the latent diffusion model framework. While there has been significant advancement in mask-free editing, the precision of editing in current methods relies heavily on the accuracy of detailed masks at the pixel level. Current methods encounter difficulties with local modifications, particularly when dealing with less accurate masks.

Current mask-free image editing approaches have predominantly concentrated on pixel masks [10, 6]. The use of bounding boxes as an intermediate representation for editing images has not been thoroughly explored. Bounding boxes can provide an intuitive and user-friendly input for image editing. They facilitate a smoother interactive editing process by being quicker and easier for users to adjust the box, unlike pixel masks that typically require more time and precision to draw pixels accurately. Moreover, some generative transformer models such as Muse [9] may support only box-like masks as opposed to pixel-level masking for image editing.

This paper explores the feasibility of employing bounding boxes as an intermediate representation within a mask-free editing framework. Our objective is not to propose a new image editing model, but to introduce a component that enables an existing pretrained mask-based editing model to perform mask-free editing via the learnable regions. To this end, we propose a region-based editing network that is trained to generate editing regions utilizing a text-driven editing loss with CLIP guidance [42]. Our method can be integrated with different image editing models. To demonstrate its versatility, we apply it to two distinct image synthesis models: non-autoregressive transformers as used in MaskGIT [8] and Muse [9], as well as Stable Diffusion [45]. It is worth highlighting that the latent spaces in transformer models (MaskGIT and Muse) are only compatible with box-like masks and lack the precision for pixel-level masks in image editing.

Our experimental results demonstrate that the proposed method can generate realistic images that match the context of the provided language descriptions. Furthermore, we conduct a user study to validate that our method outperforms five state-of-the-art baseline methods. The results indicate that our method edits images with greater fidelity and realism, following the changes specified in the language descriptions. The contributions of this work are as follows:

•

Our approach enables mask-based text-to-image models to perform local image editing without needing masks or other user-provided guidance. It can be integrated with existing text-guided editing models to improve their quality and relevance.
•

We introduce a novel region generator model that employs a new CLIP-guidance loss to learn to find regions for image editing. We demonstrate its applicability by integrating it with two popular and distinct text-guided editing models, MaskGIT [8] and Stable Diffusion [45].
•

Experiments show the high quality and realism of our generated results. The user study further validates that our method outperforms state-of-the-art image editing baselines in producing favorable editing results.

2 Related Work

Text-to-Image Synthesis.

In recent years, significant progress has been made in text-to-image synthesis. While early contributions are mainly based on Generative Adversarial Network (GAN) approaches [59, 60, 56, 31], the latest models are mostly built on diffusion models [22, 52, 40, 45, 48, 44, 20, 61, 35] or transformer models [16, 43, 14, 58]. For example, DALL·E 2 [44] and Imagen [23] propose to condition textual prompts to diffusion models, while Muse [9] leverages masked generative transformers to generate images from texts. Other approaches [11, 12, 44] leverage pre-trained CLIP models [42] to guide image generation based on textual descriptions.

More recently, Stable Diffusion [45], trained on large image-text pairs [49], has been made publicly available and has served as the foundation for numerous image generation and manipulation works. ControlNet [61] proposes to control Stable Diffusion with spatially localized conditions for image synthesis. Different from these works, we aim to introduce a component that can enable pre-trained text-to-image models for mask-free local image editing.

Text-driven Image Manipulation.

Several recent works have utilized pre-trained generator models and CLIP [42] for text-driven image manipulation [3, 17, 36, 41, 30, 2]. StyleCLIP [41] combines the generative ability of StyleGAN [25] with CLIP to control latent codes, enabling a wide range of image manipulations. VQGAN-CLIP [12] uses CLIP [42] to guide VQ-GAN [16] for high-quality image generation and editing.

There are several approaches [39, 53, 1, 37, 27, 26, 18, 47] that use diffusion models for text-driven image manipulation. Imagic [26] can generate textual embeddings aligning with the input images and editing prompts, and fine-tune the diffusion model to perform edits. InstructPix2Pix [4] combines GPT-3 [5] and Stable Diffusion [45] to edit images with human instructions. Our work is related to the state-of-the-art methods DiffEdit [10] and MasaCtrl [6]. DiffEdit [10] leverages DDIM inversion [13, 52] with the automatically produced masks for local image editing. MasaCtrl [6] proposes mutual self-attention and learns the editing masks from the cross-attention maps of the diffusion models. Motivated by the aforementioned works, we also utilize diffusion models and CLIP guidance for text-driven image manipulation. In contrast to DiffEdit [10] and MasaCtrl [6], whose editing is much more sensitive to the generated mask regions, our proposed method focuses on learning bounding boxes for local editing, which can be more flexible in accommodating diverse text prompts.

3 Proposed Method

Text-driven image editing manipulates the visual content of input images to align with the contexts or modifications specified in the text. Our goal is to enable text-to-image models to perform mask-free local image editing. To this end, we propose a region generation network that can produce promising regions for image editing.

Figure 3 shows the overall pipeline of our proposed method for text-driven image editing.

3.1 Edit-Region Generation

Given the input image as $X\in\mathbb{R}^{3\times H\times W}$ and text with $p$ words as $T\in\mathbb{Z}^{p}$ , we first use a pre-trained visual transformer model, ViT-B/16 [15], for feature extraction. This model is pre-trained using the DINO self-supervised learning objective [7]. The feature $F\in\mathbb{R}^{d\times h\times w}$ from the last layer has been shown to contain semantic segmentation of objects [7, 51], which can serve as a prior in our problem.

Then we initialize $K$ anchor points ${\{C_{i}\}}_{i=1}^{K}$ located at the top- $K$ scoring patches of the self-attention map from the [CLS] token query of the DINO pre-trained transformer as shown in [7], where the [CLS] token carries guidance to locate the semantically informative parts of the objects.

Following this, we define a set of bounding box proposals ${\mathcal{B}_{i}=\{B_{j}\}}_{j=1}^{M}$ for each anchor point $C_{i}$ , where each bounding box is centered at their corresponding anchor point. For simplicity, we parameterize the bounding box with a single parameter such that each $B_{j}$ is a square box with shape $j\times j$ .

Subsequently, we train a region generation network to explicitly consider all unique bounding boxes derived from the same anchor point. For a given anchor point $C_{i}$ , we then have:

	$\displaystyle f_{j}$	$\displaystyle=\text{ROI-pool}(F,B_{j}),$		(1)
	$\displaystyle S([f_{1},\cdots,f_{M}])$	$\displaystyle=[\pi_{1},\cdots,\pi_{M}],$		(2)

where $[\cdot]$ concatenates features along the channel dimension, and the ROI-pool operation [19] is used to perform pooling for the feature $F\in\mathbb{R}^{d\times h\times w}$ with respect to the box $B_{j}$ , resulting in a feature tensor $f_{j}\in\mathbb{R}^{d\times l\times l}$ . In our experiments, we set $l$ as 7. $S$ is the proposed region generation network consisting of two convolutional layers and two linear layers, with a ReLU activation layer between consecutive layers. The output from the final linear layer, denoted as $\pi_{j}$ in Eq. 2 as the logits for the bounding box with size $j$ , is fed into a softmax function to predict the scores for the bounding box proposal, i.e., $\text{Softmax}([\pi_{1},\cdots,\pi_{M}])$ .

To learn the parameters of the region generation network, we use the Gumbel-Softmax trick [24]. We re-parameterize $\pi_{j}$ by adding a small Gumbel noise $g_{j}=-\log(-\log(u_{j}))$ where $u_{j}\sim\text{Uniform}(0,1)$ . During training, we apply straight-through gradient estimation, in which backward propagation uses the differentiable variable (i.e., softmax) while the forward pass still takes the argmax, treating $\pi$ as the categorical variable. For each anchor point, once we obtain the editing region with the highest softmax score, we first generate a corresponding box-like mask image, and then feed the mask image, input image, and editing prompt into the text-to-image model to obtain the edited image. Thus, we can get $K$ edited images considering all anchor points, in Section 3.3, we explain how to produce the final edited image as inference output.

3.2 Training Objectives

As the CLIP model [42] can estimate the similarity between images and texts, we employ it to guide our image editing based on user-specified prompts.

To train our models, we propose a composite editing loss that consists of three components: 1) the CLIP guidance loss $\mathcal{L}_{\text{Clip}}$ stands for the cosine distance between features extracted from generated images and texts, specifically derived from the last layers of CLIP’s encoders. 2) the directional loss $\mathcal{L}_{\text{Dir}}$ [41] controls the direction of the applied edit within the CLIP space [17, 41], and 3) the structural loss $\mathcal{L}_{\text{Str}}$ takes into account the self-similarity [29, 50] of features between source and generated images, which facilitates editing in texture and appearance while preserving the original spatial layout of objects in the source images. The total loss $\mathcal{L}$ and each loss term are:

\mathcal{L}=\lambda_{C}\mathcal{L}_{\text{Clip}}+\lambda_{S}\mathcal{L}_{\text{Str}}+\lambda_{D}\mathcal{L}_{\text{Dir}},

(3)

\mathcal{L}_{\text{Clip}}=\mathcal{D}_{\text{cos}}(E_{\text{v}}(X_{o}),E_{\text{t}}(T)),

(4)

\mathcal{L}_{\text{Str}}=||Q(f_{X_{o}})-Q(f_{X})||_{2},

(5)

\mathcal{L}_{\text{Dir}}=\mathcal{D}_{\text{cos}}(E_{\text{v}}(X_{o})-E_{\text{v}}(X),E_{\text{t}}(T)-E_{\text{t}}(T_{\text{ROI}})),

(6)

where $E_{\text{v}}$ and $E_{\text{t}}$ are the visual and textual encoder of the CLIP model. We empirically set the weights $\lambda_{C}=1$ , $\lambda_{D}=1$ , and $\lambda_{S}=1$ for our composite editing loss. Here, $X$ , $T$ , and $X_{o}$ denote the input image, text prompt, and the edited image by the proposed region, respectively. $f_{X_{o}}$ and $f_{X}$ indicate the visual features of $X_{o}$ and $X$ from the last layer of CLIP’s visual encoder, while $Q(f_{X_{o}})$ and $Q(f_{X})$ denote the similarity matrix of $f_{X_{o}}$ and $f_{X}$ respectively. For simplicity, we use the cosine distance $\mathcal{D}_{\text{cos}}$ to measure the similarity between images and texts. Note that $T_{\text{ROI}}$ represents a given region-of-interest of the source image for editing (e.g., in Figure 5, when $T$ is “a big tree with many flowers in the center”, then $T_{\text{ROI}}$ could be “tree”).

During training, our loss functions encourage the region generator to produce appropriate regions for editing by taking into account the similarity between the edited images and the given text descriptions.

3.3 Inference

During the inference process, we define a quality score to rank the edited images generated from different anchor points and select the image with the highest score for presentation to the user.

While there exist more advanced methods, we use a simple weighted average to compute the quality score:

S=\alpha\cdot S_{\text{t2i}}+\beta\cdot S_{\text{i2i}},

(7)

where $S_{\text{t2i}}$ estimates the cosine similarity scores between the given text descriptions and the edited images, $S_{\text{i2i}}$ measures the cosine similarity scores between the source images and the edited images, and $\alpha$ and $\beta$ are the coefficients to control the influences of $S_{\text{t2i}}$ and $S_{\text{i2i}}$ . We adopt the features extracted from the last layer of CLIP’s encoders for similarity calculation.

In our experiments, we set $\alpha$ and $\beta$ as 2 and 1 respectively, since a higher value for $\alpha$ can encourage the model to place more weight on the faithfulness of text-conditioned image editing. The edited image with the highest quality score $S$ is chosen as the final edited image.

3.4 Compatibility with Pretrained Editing Models

Our proposed region generator can be integrated with various image editing models [8, 45, 1, 39] for modifying the content of source images conditioning on the prompts, and to demonstrate its versatility, we apply it to two distinct image synthesis models: non-autoregressive transformers as used in MaskGIT [8] or Muse [9], as well as diffusion U-Nets [46] as used in Stable Diffusion [45].

The transformer and diffusion models represent distinct base editing models to verify the applicability of the proposed method. It is worth noting that MaskGIT and Muse are transformers that operate over discrete tokens created by a VQ autoencoder [55], unlike diffusion models [22, 52, 45] operating within the continuous space. As a result, the latent spaces in MaskGIT and Muse are only compatible with box-like masks and lack the precision for pixel-level masks in image editing.

For our experiments, we use the official MaskGIT model instead of the Muse model [9], which is not publicly available. We also limit the text prompt to the class vocabulary that the model is trained on.

4 Experimental Results

Implementation Details.

In our evaluation, we collect high-resolution and free-to-use images covering a variety of objects from Unsplash (https://unsplash.com/). For edit-region generation, the total number of bounding box proposals (i.e., $M$ ) is 7 and the CLIP guidance model is initialized with ViT-B/16 weights. We do not use super-resolution models to enhance the quality of the resultant images. By default, we adopt the pre-trained Stable Diffusion-v-1-2 as our editing model. Our main experiments are conducted using two A5000 GPUs, where we train the model for 5 epochs using Adam optimizer [28] with an initial learning rate of 0.003.

4.1 Qualitative Evaluation

We assess the performance of our proposed method on a diverse set of high-quality images featuring various objects. Figure 4 shows that our approach takes an image and a language description to perform mask-free edits. We display complex text prompts that feature one category of object (the \nth1 row), multiple objects (the \nth2 row), geometric relations (the \nth3 row), and long paragraphs (the \nth4 row).

4.2 Comparisons with Prior Work

We compare our method with five state-of-the-art text-driven image editing approaches: Plug-and-Play [54] preserves the semantic layout of the source image by injecting features from the source image into the generation process of the target image. InstructPix2Pix [4] first utilizes GPT-3 and Stable Diffusion to produce paired training data for image editing. It then trains a diffusion model with classifier-free guidance under conditions. Null-text Inversion [38] enables text-based image editing with Stable Diffusion, using an initial DDIM inversion [52, 13] as a pivot for optimization, tuning only the null-text embedding in classifier-free guidance. DiffEdit [10] automatically generates masks for the regions that require editing by contrasting predictions of the diffusion model conditioned on different text prompts. MasaCtrl [6] performs text-based non-rigid image editing by converting self-attention in diffusion models into mutual self-attention, and it extracts the masks from the cross-attention maps as the editing regions.

In all experiments, we report the results of the compared methods using their official code, except for DiffEdit¹¹1As there’s no official code of DiffEdit, we use the code https://github.com/Xiang-cd/DiffEdit-stable-diffusion. Figure 5 displays the editing results from existing methods. We have the following observations, InstructPix2Pix inevitably leads to undesired changes in the global appearance (e.g., background). DiffEdit and MasaCtrl will yield unsatisfactory results when using the more complex prompts containing multiple objects. Other methods generate less realistic results (e.g., coffee in the \nth4 row) or results that do not correspond with the text prompt (e.g., only one apple and pear in the \nth3 row).

Compared Methods	Preference for Ours
vs. Plug-and-Play [54]	80.5% $\pm$ 1.9%
vs. InstructPix2Pix [4]	73.2% $\pm$ 2.2%
vs. Null-text [38]	88.2% $\pm$ 1.6%
vs. DiffEdit [10]	91.9% $\pm$ 1.3%
vs. MasaCtrl [6]	90.8% $\pm$ 1.4%
Average	84.9%

Table 1: User studies. We show the percentage (mean, std) of user preference for our approach over compared methods.

4.3 User Study

To evaluate the quality of the edited images, we conduct a user study using 60 input images and text prompts. We employ paired comparisons to measure user preference. In each test, we show an input image, a text prompt, and two edited images generated by our method and one of the compared approaches. We ask the subject to choose the one that performs better in coherence with the text prompt while maintaining fidelity to the input image.

There are 203 participants in this study, where each participant evaluates 40 pairs of images. The image set and compared method for each image are randomly selected for each user. The order in each comparison pair is shuffled when presenting to each user. All the methods are compared for the same number of times.

Table 1 shows the user study results. The proposed method performs favorably against all five compared approaches. On average, our method is preferred in 84.9% of all the comparisons, which demonstrates the effectiveness of the proposed method.

Failure cases.

We present some failure cases of our approach to analyze the reasons. As shown in Figure 7, the failure results can be caused by improper anchor point initialization, especially when the anchor points fall into the background area.

4.4 Ablation Study

Compatibility with image synthesis models.

To demonstrate the generalizability of the proposed method, we conduct experiments using MaskGIT [8], a distinct image generative transformer. As shown in Figure 6, we can generate results that adhere to the text prompt while preserving the background content. Note that the latent spaces within MaskGIT exclusively accommodate box-like masks, lacking the requisite precision for manipulating pixel-level masks in the context of image editing.

Effect of different loss components.

To evaluate the influences of different loss components in our training loss, in Figure 8, we show the results generated without the directional loss $\mathcal{L}_{Dir}$ that controls the directional edit, or without the structural loss $\mathcal{L}_{Str}$ that focuses on preserving the appearance of the source image. We observe that the result without $\mathcal{L}_{Dir}$ does not fully match the context of the text prompt, and the result without $\mathcal{L}_{Str}$ fails to preserve the posture and shape of the object from the source image. In contrast, our method using all loss components can generate results that adhere to the text prompt while preserving the concept in the source image.

Effect of region generation methods.

In Table 2, we present the user study results by comparing our method with two other baselines for bounding box generation. (1) Random-anchor-random-size: The editing regions are bounding boxes centered at anchor points uniformly sampled from the whole image, with height and width uniformly sampled from $[0,H]$ and $[0,W]$ , where $H$ and $W$ are the height and width of the image. We clamp the regions exceeding the image boundary. (2) DINO-anchor-random-size: The editing regions are bounding boxes centered at anchor points selected from the DINO self-attention map, which are identical to those generated by our method, but with height and width uniformly sampled from $[0,H]$ and $[0,W]$ . For both baselines, we use the same number of anchor points as our method, and select the image with the highest quality score $S$ to present to the user.

The results show that our method is preferred in 83.9% compared with Random-anchor baseline. Even when compared to the competitive baseline where the anchor point is selected from the DINO self-attention map with the randomly chosen bounding box size, the proposed method is still preferred in 71.0% of all comparisons. These results validate the effectiveness of our model in generating meaningful editing regions.

Compared Methods	Preference for Ours
vs. Random-anchor-random-size	83.9% $\pm$ 2.6%
vs. DINO-anchor-random-size	71.0% $\pm$ 3.2%

Table 2: Ablation study of region generation methods. We show the percentage (mean, std) of user preference for our approach over two compared baselines.

4.5 Limitation

We observe two limitations of our method. First, the performance is affected by the choice of the self-supervised model, particularly regarding anchor initialization. Second, since no user-specified region guidance is provided, the predicted region may include background areas, resulting in unintentional modifications in certain image contents. To address this, we plan to model the mask using more fine-grained representations (e.g., patches).

5 Conclusion

In this paper, we propose a method for editing given images based on freely provided language descriptions, including paragraphs, without the need for user-specified edit regions. We introduce a region generation network and incorporate text-driven editing training losses to generate high-quality and realistic images. The proposed method seamlessly integrates with various image synthesis models. Experiments including user studies are conducted, demonstrating the competitive performance of our proposed method.

References

Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2LIVE: Text-driven layered image and video editing. In ECCV, 2022.
Bau et al. [2021] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv:2103.10951, 2021.
Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
Cao et al. [2023] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, 2023.
Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
Chang et al. [2022] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image transformer. In CVPR, 2022.
Chang et al. [2023] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv:2301.00704, 2023.
Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR, 2023.
[11] Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj.
Crowson et al. [2022] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
Ding et al. [2022] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. CogView2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.
Esser et al. [2021] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
Gal et al. [2022] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
Gal et al. [2023] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
Girshick [2015] Ross Girshick. Fast R-CNN. In ICCV, 2015.
Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144, 2016.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
Kawar et al. [2022] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv:2210.09276, 2022.
Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. DiffusionCLIP: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
Kolkin et al. [2019] Nicholas Kolkin, Jason Salavon, and Gregory Shakhnarovich. Style transfer by relaxed optimal transport and self-similarity. In CVPR, 2019.
Kwon and Ye [2022] Gihyun Kwon and Jong Chul Ye. CLIPstyler: Image style transfer with a single text condition. In CVPR, 2022.
Li et al. [2020] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. ManiGAN: Text-guided image manipulation. In CVPR, 2020.
Lin et al. [2021] Yuanze Lin, Xun Guo, and Yan Lu. Self-supervised video representation learning with meta-contrastive network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8239–8249, 2021.
Lin et al. [2022] Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, and Lu Yuan. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
Lin et al. [2023] Yuanze Lin, Chen Wei, Huiyu Wang, Alan Yuille, and Cihang Xie. Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2459–2469, 2023.
Lin et al. [2024] Yuanze Lin, Ronald Clark, and Philip Torr. Dreampolisher: Towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237, 2024.
Liu et al. [2021] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. FuseDream: Training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv:2112.01573, 2021.
Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2021.
Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.
Patashnik et al. [2021] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In ICCV, 2021.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv:2208.12242, 2022.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv:2111.02114, 2021.
Shechtman and Irani [2007] Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. In CVPR, 2007.
Siméoni et al. [2021] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv:2109.14279, 2021.
Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
Su et al. [2022] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In ICLR, 2022.
Tumanyan et al. [2023] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
Van Den Oord et al. [2017] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
Ye et al. [2022] Shuquan Ye, Yujia Xie, Dongdong Chen, Yichong Xu, Lu Yuan, Chenguang Zhu, and Jing Liao. Improving commonsense in vision-language models via knowledge graph riddles. arXiv:2211.16504, 2022.
Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022.
Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
Zhang et al. [2018] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE TPAMI, 41(8):1947–1962, 2018.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
Zhang et al. [2021] Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, and Irfan Essa. Text as neural operator: Image manipulation by text instruction. In ACM MM, pages 1893–1902, 2021.

Appendix A Overview

This supplementary material includes the following additional content:

•

Further implementation details in Section B.
•

Additional experimental results in Section C.
•

Additional visualization results in Section D, as shown in Figure 12, 13 and 14.

Appendix B Implementation Details

In our experiments in the main paper, we chose 8 anchor points whose attention values are the top-8 largest. The training time for each sample is approximately 4 minutes, with a batch size set to 1 using two A5000 GPUs. For the diffusion model, we set the guidance scale and strength as 7.5 and 0.75, respectively.

Appendix C Additional Experimental Results

C.1 Comparison with existing methods

In Figure 10 and 11 of this Appendix, we present additional results comparing our proposed method with state-of-the-art text-driven image editing baseline approaches, including Plug-and-Play [54], InstructPix2Pix [4], Null-text inversion [38], DiffEdit [10] and MasaCtrl [6].

We observe that the comparison results are consistent with the user studies reported in the main paper: our method generates images of higher quality that adhere to the text prompts while preserving the background content. The baseline methods appear to fall short in terms of generation quality and controllability by the prompt. For example, Plug-and-Play and InstructPix2Pix tend to modify the style of the entire image. The Null-text inversion method appears to have difficulty generating objects that align with the text prompts. Meanwhile, techniques like DiffEdit and MasaCtrl yield results that are of lower quality or poorer alignment with the text prompt.

C.2 Additional Ablation Study

In this section, we present three additional ablation studies employing qualitative metrics to complement the study based on user ratings reported in the main paper. Two metric scores are used: (1) the CLIP [42] text-to-image similarity score $S_{t2i}$ , which evaluates the cosine similarity between the given prompt and the edited image; (2) the CLIP image-to-image similarity score $S_{i2i}$ , which represents the cosine similarity between the source image and the edited image. We adopt the CLIP model initialized from ViT-B/16 weights to calculate the similarity scores. Results highlighted in blue denote the default settings used in our main experiments.

Loss Component	S_t2i $\uparrow$	S_i2i $\uparrow$
$\mathcal{L}_{Clip}$	0.301	0.801
$\mathcal{L}_{Clip}+L_{Str}$	0.294	0.806
$\mathcal{L}_{Clip}+L_{Str}+L_{Dir}$	0.301	0.805

Table 3: Ablation study on adopting different loss components.

# of region proposals	S_t2i $\uparrow$	S_i2i $\uparrow$
1	0.231	0.915
3	0.273	0.837
5	0.295	0.809
7	0.300	0.805
9	0.301	0.802

Table 4: Ablation study on the number of region proposals.

# of anchor points	S_t2i $\uparrow$	S_i2i $\uparrow$
1	0.275	0.824
4	0.296	0.805
6	0.301	0.802
8	0.300	0.805
10	0.298	0.803

Table 5: Ablation study on the number of anchor points.

Effect of different loss components.

To evaluate the influences of the introduced loss components (i.e., $\mathcal{L}_{Clip}$ , $\mathcal{L}_{Str}$ and $\mathcal{L}_{Dir}$ ) in our training loss, we report the metric scores of adopting different loss components in Table 3. The results show that using all losses yields the best overall score (i.e., average of $S_{t2i}$ and $S_{i2i}$ ), thereby verifying their contributions.

Effect of the number of region proposals.

In Table 4, we evaluate the performance of using different numbers of region proposals. The results indicate that it’s likely to achieve the performance bottleneck with a larger region proposal (e.g., 9 region proposals), and our method achieves reasonable quality (e.g., a superb balance between $S_{t2i}$ and $S_{i2i}$ ) even using 7 region proposals.

Effect of the number of anchor points.

To analyze the influence of the number of anchor points, we display the ablation study in Table 5. It can be observed that when the number of anchor points increases from 1 to 8, the text-to-image similarity score $S_{t2i}$ consistently increases. However, when the number is too large (e.g., 10), the $S_{t2i}$ decreases, it may be caused by the larger possibility of choosing noisy anchor points, which may be located in the background area, hurting the performance of editing.

C.3 Analysis

More failure cases.

In Figure 9, we display more failure examples generated by our method. We can see that the proposed method may generate unsatisfactory results when the anchor points are sampled from the background regions.

Appendix D Visualizations

We provide more visualization results of text-driven image editing generated by our method in Figure 12, 13 and 14. Thanks to the flexibility of the learned bounding box guidance, our method is capable of handling a wide range of prompts, it can generate satisfactory results while preserving the original appearance of the background.