This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Text-Driven Image Editing via Learnable Regions

Yuanze Lin    Yi-Wen Chen    Yi-Hsuan Tsai    Lu Jiang    Ming-Hsuan Yang♣★
University of Oxford    UC Merced    Google
Abstract

Language has emerged as a natural interface for image editing. In this paper, we introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. Specifically, our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to identify the editing regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models, and is able to handle complex prompts featuring multiple objects, complex sentences, or lengthy paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. The experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions. Our project webpage can be found at: https://yuanze-lin.me/LearnableRegions_page.

Input Image
[Uncaptioned image]
Edited Image
[Uncaptioned image]
Input Image
[Uncaptioned image]
Edited Image
[Uncaptioned image]
Input Image
[Uncaptioned image]
Edited Image
[Uncaptioned image]

donuts with ice cream                         a cute sloth holds a box                           a flowering cherry tree

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]

a blooming flower and dessert       several lotus flowers growing in the water       a cup of coffee next to the bread

Figure 1: Overview. Given an input image and a language description for editing, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.

1 Introduction

Input Image

Refer to caption
Region 1

Refer to caption
Output 1

Refer to caption
Region 2

Refer to caption
Output 2

Refer to caption
Region (Ours)

Refer to caption
Output (Ours)

Refer to caption

Editing Text:                                                      a bowl of strawberries
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                               a cup of coffee next to the bread

Figure 2: Effects of variations in editing regions on generated image quality. Region 1 and Region 2 are two prior regions drawn from the self-attention map of DINO [7]. Region (ours), shown in the second-to-last column, represents the regions produced by our model which have the best overall quality.

With the availability of a massive amount of text-image paired data and large-scale vision-language models, recent text-driven image synthesis models [62, 32, 33, 34, 39, 53, 43, 42, 8, 57] have enabled people to create and manipulate specific visual contents of realistic images using natural language descriptions in an interactive fashion.

Recent text-driven image editing methods [39, 30, 1, 2, 54, 21, 38] have shown impressive capabilities in editing realistic images based on natural descriptions, with approaches typically falling into two paradigms: mask-based or mask-free methods. Mask-based editing approaches [39, 1] are perceived intuitively for local image editing because they allow users to specify precisely which areas of an image to modify. However, these methods can be laborious, as they demand users to manually create masks that are sometimes unnecessary, limiting their user experience in many applications.

In contrast, mask-free editing approaches [9, 21, 38, 54, 6] do not require masks and can directly modify the appearance or texture of the input image. These methods are trained to create fine-grained pixel masks, which can be applied to either the RGB space or the latent embedding space within the latent diffusion model framework. While there has been significant advancement in mask-free editing, the precision of editing in current methods relies heavily on the accuracy of detailed masks at the pixel level. Current methods encounter difficulties with local modifications, particularly when dealing with less accurate masks.

Current mask-free image editing approaches have predominantly concentrated on pixel masks [10, 6]. The use of bounding boxes as an intermediate representation for editing images has not been thoroughly explored. Bounding boxes can provide an intuitive and user-friendly input for image editing. They facilitate a smoother interactive editing process by being quicker and easier for users to adjust the box, unlike pixel masks that typically require more time and precision to draw pixels accurately. Moreover, some generative transformer models such as Muse [9] may support only box-like masks as opposed to pixel-level masking for image editing.

This paper explores the feasibility of employing bounding boxes as an intermediate representation within a mask-free editing framework. Our objective is not to propose a new image editing model, but to introduce a component that enables an existing pretrained mask-based editing model to perform mask-free editing via the learnable regions. To this end, we propose a region-based editing network that is trained to generate editing regions utilizing a text-driven editing loss with CLIP guidance [42]. Our method can be integrated with different image editing models. To demonstrate its versatility, we apply it to two distinct image synthesis models: non-autoregressive transformers as used in MaskGIT [8] and Muse [9], as well as Stable Diffusion [45]. It is worth highlighting that the latent spaces in transformer models (MaskGIT and Muse) are only compatible with box-like masks and lack the precision for pixel-level masks in image editing.

Our experimental results demonstrate that the proposed method can generate realistic images that match the context of the provided language descriptions. Furthermore, we conduct a user study to validate that our method outperforms five state-of-the-art baseline methods. The results indicate that our method edits images with greater fidelity and realism, following the changes specified in the language descriptions. The contributions of this work are as follows:

  • Our approach enables mask-based text-to-image models to perform local image editing without needing masks or other user-provided guidance. It can be integrated with existing text-guided editing models to improve their quality and relevance.

  • We introduce a novel region generator model that employs a new CLIP-guidance loss to learn to find regions for image editing. We demonstrate its applicability by integrating it with two popular and distinct text-guided editing models, MaskGIT [8] and Stable Diffusion [45].

  • Experiments show the high quality and realism of our generated results. The user study further validates that our method outperforms state-of-the-art image editing baselines in producing favorable editing results.

Refer to caption
Figure 3: Framework of the proposed method. We first feed the input image into the self-supervised learning (SSL) model, e.g., DINO [7], to obtain the attention map and feature, which are used for anchor initialization. The region generation model initializes several region proposals (e.g., 3 proposals in this figure) around each anchor point, and learns to select the most suitable ones among them with the region generation network (RGN). The predicted region and the text descriptions are then fed into a pre-trained text-to-image model for image editing. We utilize the CLIP model for learning the score to measure the similarity between the given text description and the edited result, forming a training signal to learn our region generation model.

2 Related Work

Text-to-Image Synthesis.

In recent years, significant progress has been made in text-to-image synthesis. While early contributions are mainly based on Generative Adversarial Network (GAN) approaches [59, 60, 56, 31], the latest models are mostly built on diffusion models [22, 52, 40, 45, 48, 44, 20, 61, 35] or transformer models [16, 43, 14, 58]. For example, DALL·E 2 [44] and Imagen [23] propose to condition textual prompts to diffusion models, while Muse [9] leverages masked generative transformers to generate images from texts. Other approaches [11, 12, 44] leverage pre-trained CLIP models [42] to guide image generation based on textual descriptions.

More recently, Stable Diffusion [45], trained on large image-text pairs [49], has been made publicly available and has served as the foundation for numerous image generation and manipulation works. ControlNet [61] proposes to control Stable Diffusion with spatially localized conditions for image synthesis. Different from these works, we aim to introduce a component that can enable pre-trained text-to-image models for mask-free local image editing.

Text-driven Image Manipulation.

Several recent works have utilized pre-trained generator models and CLIP [42] for text-driven image manipulation [3, 17, 36, 41, 30, 2]. StyleCLIP [41] combines the generative ability of StyleGAN [25] with CLIP to control latent codes, enabling a wide range of image manipulations. VQGAN-CLIP [12] uses CLIP [42] to guide VQ-GAN [16] for high-quality image generation and editing.

There are several approaches [39, 53, 1, 37, 27, 26, 18, 47] that use diffusion models for text-driven image manipulation. Imagic [26] can generate textual embeddings aligning with the input images and editing prompts, and fine-tune the diffusion model to perform edits. InstructPix2Pix [4] combines GPT-3 [5] and Stable Diffusion [45] to edit images with human instructions. Our work is related to the state-of-the-art methods DiffEdit [10] and MasaCtrl [6]. DiffEdit [10] leverages DDIM inversion [13, 52] with the automatically produced masks for local image editing. MasaCtrl [6] proposes mutual self-attention and learns the editing masks from the cross-attention maps of the diffusion models. Motivated by the aforementioned works, we also utilize diffusion models and CLIP guidance for text-driven image manipulation. In contrast to DiffEdit [10] and MasaCtrl [6], whose editing is much more sensitive to the generated mask regions, our proposed method focuses on learning bounding boxes for local editing, which can be more flexible in accommodating diverse text prompts.

3 Proposed Method

Text-driven image editing manipulates the visual content of input images to align with the contexts or modifications specified in the text. Our goal is to enable text-to-image models to perform mask-free local image editing. To this end, we propose a region generation network that can produce promising regions for image editing.

Figure 3 shows the overall pipeline of our proposed method for text-driven image editing.

3.1 Edit-Region Generation

Given the input image as X3×H×WX\in\mathbb{R}^{3\times H\times W} and text with pp words as TpT\in\mathbb{Z}^{p}, we first use a pre-trained visual transformer model, ViT-B/16 [15], for feature extraction. This model is pre-trained using the DINO self-supervised learning objective [7]. The feature Fd×h×wF\in\mathbb{R}^{d\times h\times w} from the last layer has been shown to contain semantic segmentation of objects [7, 51], which can serve as a prior in our problem.

Then we initialize KK anchor points {Ci}i=1K{\{C_{i}\}}_{i=1}^{K} located at the top-KK scoring patches of the self-attention map from the [CLS] token query of the DINO pre-trained transformer as shown in [7], where the [CLS] token carries guidance to locate the semantically informative parts of the objects.

Following this, we define a set of bounding box proposals i={Bj}j=1M{\mathcal{B}_{i}=\{B_{j}\}}_{j=1}^{M} for each anchor point CiC_{i}, where each bounding box is centered at their corresponding anchor point. For simplicity, we parameterize the bounding box with a single parameter such that each BjB_{j} is a square box with shape j×jj\times j.

Subsequently, we train a region generation network to explicitly consider all unique bounding boxes derived from the same anchor point. For a given anchor point CiC_{i}, we then have:

fj\displaystyle f_{j} =ROI-pool(F,Bj),\displaystyle=\text{ROI-pool}(F,B_{j}), (1)
S([f1,,fM])\displaystyle S([f_{1},\cdots,f_{M}]) =[π1,,πM],\displaystyle=[\pi_{1},\cdots,\pi_{M}], (2)

where [][\cdot] concatenates features along the channel dimension, and the ROI-pool operation [19] is used to perform pooling for the feature Fd×h×wF\in\mathbb{R}^{d\times h\times w} with respect to the box BjB_{j}, resulting in a feature tensor fjd×l×lf_{j}\in\mathbb{R}^{d\times l\times l}. In our experiments, we set ll as 7. SS is the proposed region generation network consisting of two convolutional layers and two linear layers, with a ReLU activation layer between consecutive layers. The output from the final linear layer, denoted as πj\pi_{j} in Eq. 2 as the logits for the bounding box with size jj, is fed into a softmax function to predict the scores for the bounding box proposal, i.e., Softmax([π1,,πM])\text{Softmax}([\pi_{1},\cdots,\pi_{M}]).

To learn the parameters of the region generation network, we use the Gumbel-Softmax trick [24]. We re-parameterize πj\pi_{j} by adding a small Gumbel noise gj=log(log(uj))g_{j}=-\log(-\log(u_{j})) where ujUniform(0,1)u_{j}\sim\text{Uniform}(0,1). During training, we apply straight-through gradient estimation, in which backward propagation uses the differentiable variable (i.e., softmax) while the forward pass still takes the argmax, treating π\pi as the categorical variable. For each anchor point, once we obtain the editing region with the highest softmax score, we first generate a corresponding box-like mask image, and then feed the mask image, input image, and editing prompt into the text-to-image model to obtain the edited image. Thus, we can get KK edited images considering all anchor points, in Section 3.3, we explain how to produce the final edited image as inference output.

Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption

a cup whose logo is named as “coffee"        a steam train running on the sea    many blooming jasmine flowers in the blanket Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a plate of beefsteak and vegetable              a piece of Oreo cookie and bread         a bottle of wine and several wine cups

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
a wooden bridge in front of the mountaina huge castle in the back of the person     a wooden cabinet on top of the table
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
A cartoon panda is preparing food. It wears    A little horse is jumping from the left side toThe cartoon character is smiling. It looks
       
cloth which has blue and white colors and      the right side. It jumps fast since its jumping   funny. The shape of its face is square, and
       
there are several plates of food on the table    stride is large, and it has red skin                  its eyes and mouth are very large

Figure 4: Image editing results with simple and complex prompts. Given the input images and prompts, our method edits the image without requiring masks from the users. The learned region is omitted for better visualization. The \nth1 row contains diverse prompts for one kind of object. The \nth2 row displays prompts featuring multiple objects. The \nth3 row shows prompts with geometric relations, and the last row presents prompts with extended length.

3.2 Training Objectives

As the CLIP model [42] can estimate the similarity between images and texts, we employ it to guide our image editing based on user-specified prompts.

To train our models, we propose a composite editing loss that consists of three components: 1) the CLIP guidance loss Clip\mathcal{L}_{\text{Clip}} stands for the cosine distance between features extracted from generated images and texts, specifically derived from the last layers of CLIP’s encoders. 2) the directional loss Dir\mathcal{L}_{\text{Dir}} [41] controls the direction of the applied edit within the CLIP space [17, 41], and 3) the structural loss Str\mathcal{L}_{\text{Str}} takes into account the self-similarity [29, 50] of features between source and generated images, which facilitates editing in texture and appearance while preserving the original spatial layout of objects in the source images. The total loss \mathcal{L} and each loss term are:

=λCClip+λSStr+λDDir,\mathcal{L}=\lambda_{C}\mathcal{L}_{\text{Clip}}+\lambda_{S}\mathcal{L}_{\text{Str}}+\lambda_{D}\mathcal{L}_{\text{Dir}}, (3)
Clip=𝒟cos(Ev(Xo),Et(T)),\mathcal{L}_{\text{Clip}}=\mathcal{D}_{\text{cos}}(E_{\text{v}}(X_{o}),E_{\text{t}}(T)), (4)
Str=Q(fXo)Q(fX)2,\mathcal{L}_{\text{Str}}=||Q(f_{X_{o}})-Q(f_{X})||_{2}, (5)
Dir=𝒟cos(Ev(Xo)Ev(X),Et(T)Et(TROI)),\mathcal{L}_{\text{Dir}}=\mathcal{D}_{\text{cos}}(E_{\text{v}}(X_{o})-E_{\text{v}}(X),E_{\text{t}}(T)-E_{\text{t}}(T_{\text{ROI}})), (6)

where EvE_{\text{v}} and EtE_{\text{t}} are the visual and textual encoder of the CLIP model. We empirically set the weights λC=1\lambda_{C}=1, λD=1\lambda_{D}=1, and λS=1\lambda_{S}=1 for our composite editing loss. Here, XX, TT, and XoX_{o} denote the input image, text prompt, and the edited image by the proposed region, respectively. fXof_{X_{o}} and fXf_{X} indicate the visual features of XoX_{o} and XX from the last layer of CLIP’s visual encoder, while Q(fXo)Q(f_{X_{o}}) and Q(fX)Q(f_{X}) denote the similarity matrix of fXof_{X_{o}} and fXf_{X} respectively. For simplicity, we use the cosine distance 𝒟cos\mathcal{D}_{\text{cos}} to measure the similarity between images and texts. Note that TROIT_{\text{ROI}} represents a given region-of-interest of the source image for editing (e.g., in Figure 5, when TT is “a big tree with many flowers in the center”, then TROIT_{\text{ROI}} could be “tree”).

During training, our loss functions encourage the region generator to produce appropriate regions for editing by taking into account the similarity between the edited images and the given text descriptions.

3.3 Inference

During the inference process, we define a quality score to rank the edited images generated from different anchor points and select the image with the highest score for presentation to the user.

While there exist more advanced methods, we use a simple weighted average to compute the quality score:

S=αSt2i+βSi2i,S=\alpha\cdot S_{\text{t2i}}+\beta\cdot S_{\text{i2i}}, (7)

where St2iS_{\text{t2i}} estimates the cosine similarity scores between the given text descriptions and the edited images, Si2iS_{\text{i2i}} measures the cosine similarity scores between the source images and the edited images, and α\alpha and β\beta are the coefficients to control the influences of St2iS_{\text{t2i}} and Si2iS_{\text{i2i}}. We adopt the features extracted from the last layer of CLIP’s encoders for similarity calculation.

In our experiments, we set α\alpha and β\beta as 2 and 1 respectively, since a higher value for α\alpha can encourage the model to place more weight on the faithfulness of text-conditioned image editing. The edited image with the highest quality score SS is chosen as the final edited image.

3.4 Compatibility with Pretrained Editing Models

Our proposed region generator can be integrated with various image editing models [8, 45, 1, 39] for modifying the content of source images conditioning on the prompts, and to demonstrate its versatility, we apply it to two distinct image synthesis models: non-autoregressive transformers as used in MaskGIT [8] or Muse [9], as well as diffusion U-Nets [46] as used in Stable Diffusion [45].

The transformer and diffusion models represent distinct base editing models to verify the applicability of the proposed method. It is worth noting that MaskGIT and Muse are transformers that operate over discrete tokens created by a VQ autoencoder [55], unlike diffusion models [22, 52, 45] operating within the continuous space. As a result, the latent spaces in MaskGIT and Muse are only compatible with box-like masks and lack the precision for pixel-level masks in image editing.

For our experiments, we use the official MaskGIT model instead of the Muse model [9], which is not publicly available. We also limit the text prompt to the class vocabulary that the model is trained on.

Input Image

Refer to caption
Plug-and-Play

Refer to caption
InstructPix2Pix

Refer to caption
Null-text

Refer to caption
DiffEdit

Refer to caption
MasaCtrl

Refer to caption
Ours

Refer to caption

Editing Text:                                                         a dish of pancake
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                         a big tree with many flowers in the center
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

Editing Text:                                                     several apples and pears
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                               a cup of coffee next to the bread
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                              some strawberries and other fruit
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                a blooming flower and dessert

Figure 5: Comparison with existing methods. We compare our method with existing text-driven image editing methods. From left to right: Input image, Plug-and-Play [54], InstructPix2Pix [4], Null-text [38], DiffEdit [10], MasaCtrl [6], and ours.
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption

Editing Text:       a lion                                                a rooster                                               cat face

Figure 6: Generated results using MaskGIT [8] as the image synthesis model. Aside from using the Stable Diffusion, our method can also generate reasonable editing results with the non-autoregressive transformer-based MaskGIT.

4 Experimental Results

Implementation Details.

In our evaluation, we collect high-resolution and free-to-use images covering a variety of objects from Unsplash (https://unsplash.com/). For edit-region generation, the total number of bounding box proposals (i.e., MM) is 7 and the CLIP guidance model is initialized with ViT-B/16 weights. We do not use super-resolution models to enhance the quality of the resultant images. By default, we adopt the pre-trained Stable Diffusion-v-1-2 as our editing model. Our main experiments are conducted using two A5000 GPUs, where we train the model for 5 epochs using Adam optimizer [28] with an initial learning rate of 0.003.

4.1 Qualitative Evaluation

We assess the performance of our proposed method on a diverse set of high-quality images featuring various objects. Figure 4 shows that our approach takes an image and a language description to perform mask-free edits. We display complex text prompts that feature one category of object (the \nth1 row), multiple objects (the \nth2 row), geometric relations (the \nth3 row), and long paragraphs (the \nth4 row).

4.2 Comparisons with Prior Work

We compare our method with five state-of-the-art text-driven image editing approaches: Plug-and-Play [54] preserves the semantic layout of the source image by injecting features from the source image into the generation process of the target image. InstructPix2Pix [4] first utilizes GPT-3 and Stable Diffusion to produce paired training data for image editing. It then trains a diffusion model with classifier-free guidance under conditions. Null-text Inversion [38] enables text-based image editing with Stable Diffusion, using an initial DDIM inversion [52, 13] as a pivot for optimization, tuning only the null-text embedding in classifier-free guidance. DiffEdit [10] automatically generates masks for the regions that require editing by contrasting predictions of the diffusion model conditioned on different text prompts. MasaCtrl [6] performs text-based non-rigid image editing by converting self-attention in diffusion models into mutual self-attention, and it extracts the masks from the cross-attention maps as the editing regions.

In all experiments, we report the results of the compared methods using their official code, except for DiffEdit111As there’s no official code of DiffEdit, we use the code https://github.com/Xiang-cd/DiffEdit-stable-diffusion. Figure 5 displays the editing results from existing methods. We have the following observations, InstructPix2Pix inevitably leads to undesired changes in the global appearance (e.g., background). DiffEdit and MasaCtrl will yield unsatisfactory results when using the more complex prompts containing multiple objects. Other methods generate less realistic results (e.g., coffee in the \nth4 row) or results that do not correspond with the text prompt (e.g., only one apple and pear in the \nth3 row).

Compared Methods Preference for Ours
vs. Plug-and-Play [54] 80.5% ±\pm1.9%
vs. InstructPix2Pix [4] 73.2% ±\pm2.2%
vs. Null-text [38] 88.2% ±\pm1.6%
vs. DiffEdit [10] 91.9% ±\pm1.3%
vs. MasaCtrl [6] 90.8% ±\pm1.4%
Average 84.9%
Table 1: User studies. We show the percentage (mean, std) of user preference for our approach over compared methods.
Input Image

Refer to caption
Result

Refer to caption
Input Image

Refer to caption
Result

Refer to caption

Editing Text: a large flying fish       a cat wearing sunglasses

Figure 7: Failure cases. We show two failure cases generated by our method.

4.3 User Study

To evaluate the quality of the edited images, we conduct a user study using 60 input images and text prompts. We employ paired comparisons to measure user preference. In each test, we show an input image, a text prompt, and two edited images generated by our method and one of the compared approaches. We ask the subject to choose the one that performs better in coherence with the text prompt while maintaining fidelity to the input image.

There are 203 participants in this study, where each participant evaluates 40 pairs of images. The image set and compared method for each image are randomly selected for each user. The order in each comparison pair is shuffled when presenting to each user. All the methods are compared for the same number of times.

Table 1 shows the user study results. The proposed method performs favorably against all five compared approaches. On average, our method is preferred in 84.9% of all the comparisons, which demonstrates the effectiveness of the proposed method.

Failure cases.

We present some failure cases of our approach to analyze the reasons. As shown in Figure 7, the failure results can be caused by improper anchor point initialization, especially when the anchor points fall into the background area.

4.4 Ablation Study

Compatibility with image synthesis models.

To demonstrate the generalizability of the proposed method, we conduct experiments using MaskGIT [8], a distinct image generative transformer. As shown in Figure 6, we can generate results that adhere to the text prompt while preserving the background content. Note that the latent spaces within MaskGIT exclusively accommodate box-like masks, lacking the requisite precision for manipulating pixel-level masks in the context of image editing.

Effect of different loss components.

To evaluate the influences of different loss components in our training loss, in Figure 8, we show the results generated without the directional loss Dir\mathcal{L}_{Dir} that controls the directional edit, or without the structural loss Str\mathcal{L}_{Str} that focuses on preserving the appearance of the source image. We observe that the result without Dir\mathcal{L}_{Dir} does not fully match the context of the text prompt, and the result without Str\mathcal{L}_{Str} fails to preserve the posture and shape of the object from the source image. In contrast, our method using all loss components can generate results that adhere to the text prompt while preserving the concept in the source image.

Effect of region generation methods.

In Table 2, we present the user study results by comparing our method with two other baselines for bounding box generation. (1) Random-anchor-random-size: The editing regions are bounding boxes centered at anchor points uniformly sampled from the whole image, with height and width uniformly sampled from [0,H][0,H] and [0,W][0,W], where HH and WW are the height and width of the image. We clamp the regions exceeding the image boundary. (2) DINO-anchor-random-size: The editing regions are bounding boxes centered at anchor points selected from the DINO self-attention map, which are identical to those generated by our method, but with height and width uniformly sampled from [0,H][0,H] and [0,W][0,W]. For both baselines, we use the same number of anchor points as our method, and select the image with the highest quality score SS to present to the user.

The results show that our method is preferred in 83.9% compared with Random-anchor baseline. Even when compared to the competitive baseline where the anchor point is selected from the DINO self-attention map with the randomly chosen bounding box size, the proposed method is still preferred in 71.0% of all comparisons. These results validate the effectiveness of our model in generating meaningful editing regions.

Input Image

Refer to caption
w/o Dir\mathcal{L}_{Dir}

Refer to caption
w/o Str\mathcal{L}_{Str}

Refer to caption
Ours

Refer to caption

Editing Text:         a high quality photo of a lovely dog

Figure 8: Effect of different loss components. The \nth2 and \nth3 columns present results without Dir\mathcal{L}_{Dir} and Str\mathcal{L}_{Str} respectively. The last column is generated by the model using all loss components.
Compared Methods Preference for Ours
vs. Random-anchor-random-size 83.9% ±\pm2.6%
vs. DINO-anchor-random-size 71.0% ±\pm3.2%
Table 2: Ablation study of region generation methods. We show the percentage (mean, std) of user preference for our approach over two compared baselines.

4.5 Limitation

We observe two limitations of our method. First, the performance is affected by the choice of the self-supervised model, particularly regarding anchor initialization. Second, since no user-specified region guidance is provided, the predicted region may include background areas, resulting in unintentional modifications in certain image contents. To address this, we plan to model the mask using more fine-grained representations (e.g., patches).

5 Conclusion

In this paper, we propose a method for editing given images based on freely provided language descriptions, including paragraphs, without the need for user-specified edit regions. We introduce a region generation network and incorporate text-driven editing training losses to generate high-quality and realistic images. The proposed method seamlessly integrates with various image synthesis models. Experiments including user studies are conducted, demonstrating the competitive performance of our proposed method.

References

  • Avrahami et al. [2022] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In CVPR, 2022.
  • Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2LIVE: Text-driven layered image and video editing. In ECCV, 2022.
  • Bau et al. [2021] David Bau, Alex Andonian, Audrey Cui, YeonHwan Park, Ali Jahanian, Aude Oliva, and Antonio Torralba. Paint by word. arXiv:2103.10951, 2021.
  • Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS, 2020.
  • Cao et al. [2023] Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, 2023.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In ICCV, 2021.
  • Chang et al. [2022] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. MaskGIT: Masked generative image transformer. In CVPR, 2022.
  • Chang et al. [2023] Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T Freeman, Michael Rubinstein, et al. Muse: Text-to-image generation via masked generative transformers. arXiv:2301.00704, 2023.
  • Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR, 2023.
  • [11] Katherine Crowson. CLIP guided diffusion HQ 256x256. https://colab.research.google.com/drive/12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj.
  • Crowson et al. [2022] Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. In ECCV, 2022.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In NeurIPS, 2021.
  • Ding et al. [2022] Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. CogView2: Faster and better text-to-image generation via hierarchical transformers. In NeurIPS, 2022.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929, 2020.
  • Esser et al. [2021] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  • Gal et al. [2022] Rinon Gal, Or Patashnik, Haggai Maron, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Transactions on Graphics (TOG), 41(4):1–13, 2022.
  • Gal et al. [2023] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. In ICLR, 2023.
  • Girshick [2015] Ross Girshick. Fast R-CNN. In ICCV, 2015.
  • Gu et al. [2022] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In CVPR, 2022.
  • Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • Ho et al. [2022] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv:2210.02303, 2022.
  • Jang et al. [2016] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144, 2016.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In CVPR, 2020.
  • Kawar et al. [2022] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. arXiv:2210.09276, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. DiffusionCLIP: Text-guided diffusion models for robust image manipulation. In CVPR, 2022.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  • Kolkin et al. [2019] Nicholas Kolkin, Jason Salavon, and Gregory Shakhnarovich. Style transfer by relaxed optimal transport and self-similarity. In CVPR, 2019.
  • Kwon and Ye [2022] Gihyun Kwon and Jong Chul Ye. CLIPstyler: Image style transfer with a single text condition. In CVPR, 2022.
  • Li et al. [2020] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS Torr. ManiGAN: Text-guided image manipulation. In CVPR, 2020.
  • Lin et al. [2021] Yuanze Lin, Xun Guo, and Yan Lu. Self-supervised video representation learning with meta-contrastive network. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8239–8249, 2021.
  • Lin et al. [2022] Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, and Lu Yuan. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
  • Lin et al. [2023] Yuanze Lin, Chen Wei, Huiyu Wang, Alan Yuille, and Cihang Xie. Smaug: Sparse masked autoencoder for efficient video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2459–2469, 2023.
  • Lin et al. [2024] Yuanze Lin, Ronald Clark, and Philip Torr. Dreampolisher: Towards high-quality text-to-3d generation via geometric diffusion. arXiv preprint arXiv:2403.17237, 2024.
  • Liu et al. [2021] Xingchao Liu, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. FuseDream: Training-free text-to-image generation with improved CLIP+GAN space optimization. arXiv:2112.01573, 2021.
  • Meng et al. [2021] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2021.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv:2112.10741, 2021.
  • Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.
  • Patashnik et al. [2021] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. StyleCLIP: Text-driven manipulation of StyleGAN imagery. In ICCV, 2021.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • Ruiz et al. [2022] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv:2208.12242, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv:2111.02114, 2021.
  • Shechtman and Irani [2007] Eli Shechtman and Michal Irani. Matching local self-similarities across images and videos. In CVPR, 2007.
  • Siméoni et al. [2021] Oriane Siméoni, Gilles Puy, Huy V Vo, Simon Roburin, Spyros Gidaris, Andrei Bursuc, Patrick Pérez, Renaud Marlet, and Jean Ponce. Localizing objects with self-supervised transformers and no labels. arXiv:2109.14279, 2021.
  • Song et al. [2021] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021.
  • Su et al. [2022] Xuan Su, Jiaming Song, Chenlin Meng, and Stefano Ermon. Dual diffusion implicit bridges for image-to-image translation. In ICLR, 2022.
  • Tumanyan et al. [2023] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, 2023.
  • Van Den Oord et al. [2017] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
  • Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  • Ye et al. [2022] Shuquan Ye, Yujia Xie, Dongdong Chen, Yichong Xu, Lu Yuan, Chenguang Zhu, and Jing Liao. Improving commonsense in vision-language models via knowledge graph riddles. arXiv:2211.16504, 2022.
  • Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022.
  • Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • Zhang et al. [2018] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE TPAMI, 41(8):1947–1962, 2018.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.
  • Zhang et al. [2021] Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, and Irfan Essa. Text as neural operator: Image manipulation by text instruction. In ACM MM, pages 1893–1902, 2021.

Appendix A Overview

This supplementary material includes the following additional content:

  • Further implementation details in Section B.

  • Additional experimental results in Section C.

  • Additional visualization results in Section D, as shown in Figure 12, 13 and 14.

Appendix B Implementation Details

In our experiments in the main paper, we chose 8 anchor points whose attention values are the top-8 largest. The training time for each sample is approximately 4 minutes, with a batch size set to 1 using two A5000 GPUs. For the diffusion model, we set the guidance scale and strength as 7.5 and 0.75, respectively.

Appendix C Additional Experimental Results

C.1 Comparison with existing methods

In Figure 10 and 11 of this Appendix, we present additional results comparing our proposed method with state-of-the-art text-driven image editing baseline approaches, including Plug-and-Play [54], InstructPix2Pix [4], Null-text inversion [38], DiffEdit [10] and MasaCtrl [6].

We observe that the comparison results are consistent with the user studies reported in the main paper: our method generates images of higher quality that adhere to the text prompts while preserving the background content. The baseline methods appear to fall short in terms of generation quality and controllability by the prompt. For example, Plug-and-Play and InstructPix2Pix tend to modify the style of the entire image. The Null-text inversion method appears to have difficulty generating objects that align with the text prompts. Meanwhile, techniques like DiffEdit and MasaCtrl yield results that are of lower quality or poorer alignment with the text prompt.

C.2 Additional Ablation Study

In this section, we present three additional ablation studies employing qualitative metrics to complement the study based on user ratings reported in the main paper. Two metric scores are used: (1) the CLIP  [42] text-to-image similarity score St2iS_{t2i}, which evaluates the cosine similarity between the given prompt and the edited image; (2) the CLIP image-to-image similarity score Si2iS_{i2i}, which represents the cosine similarity between the source image and the edited image. We adopt the CLIP model initialized from ViT-B/16 weights to calculate the similarity scores. Results highlighted in blue denote the default settings used in our main experiments.

Loss Component St2i \uparrow Si2i \uparrow
Clip\mathcal{L}_{Clip} 0.301 0.801
Clip+LStr\mathcal{L}_{Clip}+L_{Str} 0.294 0.806
Clip+LStr+LDir\mathcal{L}_{Clip}+L_{Str}+L_{Dir} 0.301 0.805
Table 3: Ablation study on adopting different loss components.
# of region proposals St2i \uparrow Si2i \uparrow
1 0.231 0.915
3 0.273 0.837
5 0.295 0.809
7 0.300 0.805
9 0.301 0.802
Table 4: Ablation study on the number of region proposals.
# of anchor points St2i \uparrow Si2i \uparrow
1 0.275 0.824
4 0.296 0.805
6 0.301 0.802
8 0.300 0.805
10 0.298 0.803
Table 5: Ablation study on the number of anchor points.

Effect of different loss components.

To evaluate the influences of the introduced loss components (i.e., Clip\mathcal{L}_{Clip}, Str\mathcal{L}_{Str} and Dir\mathcal{L}_{Dir}) in our training loss, we report the metric scores of adopting different loss components in Table 3. The results show that using all losses yields the best overall score (i.e., average of St2iS_{t2i} and Si2iS_{i2i}), thereby verifying their contributions.

Effect of the number of region proposals.

In Table 4, we evaluate the performance of using different numbers of region proposals. The results indicate that it’s likely to achieve the performance bottleneck with a larger region proposal (e.g., 9 region proposals), and our method achieves reasonable quality (e.g., a superb balance between St2iS_{t2i} and Si2iS_{i2i}) even using 7 region proposals.

Effect of the number of anchor points.

To analyze the influence of the number of anchor points, we display the ablation study in Table 5. It can be observed that when the number of anchor points increases from 1 to 8, the text-to-image similarity score St2iS_{t2i} consistently increases. However, when the number is too large (e.g., 10), the St2iS_{t2i} decreases, it may be caused by the larger possibility of choosing noisy anchor points, which may be located in the background area, hurting the performance of editing.

Input Image

Refer to caption
Result

Refer to caption
Input Image

Refer to caption
Result

Refer to caption

a flock of pigeons takes flightseveral books rest on the chair
Refer to caption Refer to caption Refer to caption Refer to caption
an alarm clock with functional simplicityan assortment of toys

Figure 9: Failure cases. We show more failure cases generated by our method.

C.3 Analysis

More failure cases.

In Figure 9, we display more failure examples generated by our method. We can see that the proposed method may generate unsatisfactory results when the anchor points are sampled from the background regions.

Appendix D Visualizations

We provide more visualization results of text-driven image editing generated by our method in Figure 12, 13 and 14. Thanks to the flexibility of the learned bounding box guidance, our method is capable of handling a wide range of prompts, it can generate satisfactory results while preserving the original appearance of the background.

Input Image

Refer to caption
Plug-and-Play

Refer to caption
InstructPix2Pix

Refer to caption
Null-text

Refer to caption
DiffEdit

Refer to caption
MasaCtrl

Refer to caption
Ours

Refer to caption

Editing Text:                                                       a high quality photo of a pink train
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                            a dog in a seated position
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

Editing Text:                                                     a bottle of wine and several wine cups
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                             a rabbit with white fur
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                            a pineapple on the plate
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                       a white cow standing in the snow

Figure 10: Comparison with existing methods. We compare our method with existing text-driven image editing methods. From left to right: Input image, Plug-and-Play [54], InstructPix2Pix [4], Null-text [38], DiffEdit [10], MasaCtrl [6], and ours.
Input Image

Refer to caption
Plug-and-Play

Refer to caption
InstructPix2Pix

Refer to caption
Null-text

Refer to caption
DiffEdit

Refer to caption
MasaCtrl

Refer to caption
Ours

Refer to caption

Editing Text:                                                        a plate of beefsteak and vegetable
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                  a generous portion of golden, crispy French fries
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]

Editing Text:                                                        a piece of Oreo cookie and bread
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                                a bowl of strawberries
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                       a cup whose logo is named as “coffee"
[Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Editing Text:                                                       there are several cherries in the cup

Figure 11: Comparison with existing methods. We compare our method with existing text-driven image editing methods. From left to right: Input image, Plug-and-Play [54], InstructPix2Pix [4], Null-text [38], DiffEdit [10], MasaCtrl [6], and ours.
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption

a multitude of pink flowers bursts into bloom                   a securely moored white ship on the solid ground
Refer to caption Refer to caption Refer to caption Refer to caption
        a cheerful snowman adorned with a carrot nose                a classic brick dwelling with architectural solidity
Refer to caption Refer to caption Refer to caption Refer to caption
                 the helicopter readies for takeoff                              a yellow school bus toy rolls down the road
Refer to caption Refer to caption Refer to caption Refer to caption
            some small bees flit from flower to flower                 a pot of chrysanthemums bursts forth with vibrant colors

Figure 12: Text-driven image editing results. Given an input image and a language description, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption

an owl gracefully hovers above the sea                            a parked caravan epitomizes travel tales
Refer to caption Refer to caption Refer to caption Refer to caption
            a powerful lion with a magnificent mane                         a Christmas tree adorned with twinkling lights
Refer to caption Refer to caption Refer to caption Refer to caption
        a large steamer sails accompanied by sea gulls            a colossal whale shark floating on the top of the deep ocean
Refer to caption Refer to caption Refer to caption Refer to caption
the Chinese lantern adds cultural elegance, radiating a soft glow        a sleek sports car exudes an aura of speed

Figure 13: Text-driven image editing results. Given an input image and a language description, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.
Input Image

Refer to caption
Edited Image

Refer to caption
Input Image

Refer to caption
Edited Image

Refer to caption

the tower commands attention with its majestic height                 a bouquet of vibrant roses exuding elegance
Refer to caption Refer to caption Refer to caption Refer to caption
    a sturdy brown bison with a thick fur coat    the iconic Eiffel Tower adorned with a picturesque layer of glistening snow
Refer to caption Refer to caption Refer to caption Refer to caption
       beautiful cars with sleek lines and polished exterior            a spirited brown horse charges through the terrains
Refer to caption Refer to caption Refer to caption Refer to caption
a delicate spider weaves an intricate web     a delightful cup holds the harmonious blend of ice wine and velvety ice cream

Figure 14: Text-driven image editing results. Given an input image and a language description, our method can generate realistic and relevant images without the need for user-specified regions for editing. It performs local image editing while preserving the image context.