Combining Text-based and Drag-based Editing for Precise and Flexible Image Editing

Ziqi Jiang¹, Zhen Wang¹, Long Chen²
¹Zhejiang University, ²The Hong Kong University of Science and Technology
[email protected]

Abstract

Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we discussed two representative approaches of each type (i.e., text-based editing and drag-based editing. Specifically, we argue that both two directions have their inherent drawbacks: Text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed CLIPDrag, a novel image editing method that is the first try to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods (Shi et al., 2024) by adapting a pre-trained language-vision model like CLIP (Radford et al., 2021). Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

Refer to caption — Figure 1: Three different kinds of image edit methods. (a): Drag-based Edit. Users need to click handle points (red) and target points (blue). (b): Text-based Edit. Only the edit prompt is needed to perform the edit. (c): Text-Drag Edit. both drag points and edit prompts are required.

1 Introduction

Recently, notable breakthroughs in diffusion models (Ho et al., 2020; Song et al., 2020a; b), have led to many impressive applications (Meng et al., 2021; Dong et al., 2023; Kumari et al., 2023). Among them, image editing is recognized as a significant area of innovation and has gained enormous attention (Kim et al., 2022; Nichol et al., 2021; Sheynin et al., 2024; Valevski et al., 2023). Generally, the goal of this task is to edit realistic images based on various editing instructions. Mainstream methods for image editing can be coarsely categorized into two groups: 1) Global Editing, edits given images by a text prompt (Li et al., 2023; Kawar et al., 2023) or an extra image (Zhang et al., 2023a; Epstein et al., 2023) containing global information of the desired modification. Most of them involve finetuning a pre-trained diffusion model. 2) Local Editing, mainly consists of drag-based methods (Pan et al., 2023). This framework requires users to click several handle points (handles) and target points (targets) on an image, then perform the semantic edit to move the content of handles to corresponding targets. Typical methods (Mou et al., 2023; Shi et al., 2024) usually contain a motion supervision phase that progressively transfers the features of handles to targets by updating the DDIM inversion latent, and a point tracking phase to track the position of handles by performing a nearest search on neighbor candidate points.

Although the aforementioned methods have gained significant prominence, two drawbacks of them cannot be overlooked. 1) Imprecise Description in Global Editing. Global editing such as text-based image methods (Kawar et al., 2023) is unambiguous but is hard to provide detailed edit instructions. For example, in Figure 1(b), the prompt (i.e.“The sculpture is smiling”) tells the model how to edit the image, but it is difficult to provide fine-grained editing information, like the extent of the smile. 2) Ambiguity Issues in Local Editing. Although local editing like drag-based editing methods can perform precise pixel-level spatial control, they suffer from ambiguity because the same handles and targets can correspond to multiple potential edited results. For example, in Figure 1(a), there exist two edited results meeting the drag requirements: one enlarging the face, the other making the woman smile. A natural approach to solving the ambiguity problem is to add more point pairs, but it does not work in real practice. This is because the diffusion network is a Markov chain and related errors accumulate as the update continues. Thus when adding more points, it usually means more update iterations, which will result in the degradation in image fidelity. In conclusion, local editing is precise but ambiguous while global editing is exactly the opposite.

Since these two kinds of edit are complementary, it is natural to ask: can we combine these two control signals to guide the image editing process? In this way, text signals can serve as global information to reduce ambiguity. Meanwhile, drag signals can act as local control signals, providing more detailed control. However, combining these two signals presents two challenges: 1) How to integrate two different kinds of signals efficiently? This is difficult because text-based and drag-based methods have completely different training strategies. Specifically, most text-based methods (Kim et al., 2022) require finetuning a pre-trained diffusion model to gradually inject the prompt information. But drag-based methods (Shi et al., 2024) typically involve freezing the diffusion model and only optimizing the DDIM inversion latent of the given image. Besides, the update in text-based editing often involves the whole denoising timesteps while drag-based approaches only focus on a specific timestep. 2) How to solve the optimization problem and maintain the image quality? Previous drag-based methods are very slow in some situations. Sometimes the handles will stuck at one position for many iterations. In worse situations, they even move in the opposite direction of targets. This phenomenon becomes more serious when adding text signals because combining two different signals usually requires more optimization steps. Thus we need a better approach to optimize the update process while maintaining fidelity as much as possible.

To address these problems, we propose CLIPDrag, the first method to combine text-based and drag-based approaches to achieve precise and flexible image editing (For brevity, we denote this kind of editing as text-drag edit). This approach was built based on the general drag-based diffusion editing framework, which optimizes the DDIM inversion latent of the original image at one specific timestep. Specifically, we propose two modules to solve the aforementioned problems after a typical identity finetuning process. 1) Global-Local Motion Supervision (GLMS). The key of GLMS is to utilize the gradient from text and drag signals together for ambiguity-elimination. Specifically, for text information, we obtain the global gradient by backward of the global CLIP loss for the latent. Similarly, for drag points, the local gradient is calculated using a similar paradigm in DragDiffuion (Shi et al., 2024). Because CLIP guidance methods cannot operate at a single fixed step, simply adding these two gradients is ineffective. Specifically, in GLMS, We disentangle the global gradient into two components: Identity Component perpendicular to the local gradient to maintain the image’s global structure information, and Edit Component parallel to the local gradient, which will be combined with the local gradient from drag signals to edit the image. By comparing their direction, we choose different gradient fusion strategies to either perform the edit or maintain the image identity. 2) Fast Point Tracking (FPT). As we mentioned before, combining drag and text signals will result in slow convergence in the latent optimization. Thus we propose a faster tracking method to accelerate the optimization process. FPT is introduced to update the position of handles after each GLMS operation. Specifically, when performing the nearest neighbor search algorithm, FPT masks all candidates that are far away from current targets. Consequently, handles will become closer to targets after each update, and FPT also ensures that the moving trajectory of handles will not be repetitive. It is observed that FPT significantly accelerates the image editing process and improves the image quality to some extent.

Combining these good practices, CLIPDrag achieves high-quality results for text-drag editing. Extensive experiments demonstrate the effectiveness of CLIPDrag, outperforming the state-of-the-art approaches both quantitatively and qualitatively. To summarize, our contributions are as follows:

•

We have pointed out that previous drag-based and text-based image editing methods have the problems of ambiguity and inaccuracy, respectively.
•

We propose CLIPDrag, a solution that incorporates text signals into drag-based methods by using text signals as global information.
•

Extensive experiments demonstrate the superiority and stability of CLIPDrag in text-drag image editing, marking a significant advancement in the field of flexible and precise image editing.

2 Related Work

Text-based Image Editing. Unlike text-to-image generation which involves creating an image from scratch (Ho et al., 2022; Dhariwal & Nichol, 2021; Saharia et al., 2022; Gu et al., 2022), text-based image editing is about altering certain areas of a given image. Typically, training-based methods can achieve reliable performance in various editing tasks through training techniques like weak supervision (Kim et al., 2022; Zhu et al., 2017), self-supervision (Zhang et al., 2023b; Xie et al., 2023) or full supervision (Sheynin et al., 2024; Xie et al., 2023). Test-time finetuning editing approaches edit images on a significant step by different strategies, ranging from model fine-tuning (Valevski et al., 2023) to embedding finetuning (Yang et al., 2023). Another line is training and fine-tuning free approaches. Instead of optimizing the model of intermediate embeddings, these methods perform the edit by refining the text prompt (Lin et al., 2023) or modifying the attention map (Hertz et al., 2022; Cao et al., 2023).

Drag-based Image Editing. This kind of method achieves precise spatial control over specific regions of the image based on user-provided drag instructions, first proposed by DragGAN (Pan et al., 2023). DragDiffusion (Shi et al., 2024) transfers this idea to the realm of diffusion model. Building on this foundation, GoodDrag (Zhang et al., 2024), StableDrag (Cui et al., 2024), DragNoise (Liu et al., 2024), and FreeDrag (Ling et al., 2023) have made significant improvements to this method. Besides, by utilizing feature correspondences, DragonDiffusion (Mou et al., 2023) and DiffEditor (Mou et al., 2024) transform the image editing task into a gradient-based process by formulating an energy function that aligns with the desired edit results.

3 Approach

Problem Formulation. Given an image $I$ and two text prompts $P_{o}$ and $P_{e}$ , where $P_{o}$ describes the original image and $P_{e}$ depicts the edited image. After choosing $n$ handles $\{h_{1},h_{2},\ldots,h_{n}\}$ and corresponding targets $\{g_{1},g_{2},\ldots,g_{n}\}$ , a qualified text-drag edit method has to satisfy the following two requirements: 1) Move the feature of point $h_{i}$ to $g_{i}$ , while preserving the irrelevant content of the original image. 2) The edited image must align with the edit prompt $P_{e}$ .

General Framework: As shown in Figure 2, our method CLIPDrag, consists of three steps:

1) Identity-Preserving Finetuning (Sec 3.1): Given the input image $I$ and its description $P_{o}$ , We first finetune a pre-trained latent diffusion model $\epsilon_{\theta}$ using the technique of low-rank adpation (LoRA (Hu et al., 2021)). After the finetuning, we encode the image into the latent space and obtain the latent $z_{t}$ at a specific timestep $t$ through DDIM inversion. $z_{t}$ will be optimized for many iterations to achieve the desired edit, and each iteration consists of two following phases.

2) Global-Local Motion Supervision (Sec 3.2): We denote the latent at $t$ -th step and handle point $p_{i}$ during the $k$ -th iteration as $z^{k}_{t}$ and $p^{k}_{i}$ respectively. At $k$ -th iteration, we can calculate global and local gradient vectors for $z^{k}_{t}$ . These two gradients will be combined by our Global-Local Gradient Fusion method, and the final result is used to update $z^{k}_{t}$ to $z^{k+1}_{t}$ .

3) Fast Point Tracking (Sec 3.3): After GLMS, we can update the position of handles with the nearest neighbor search. To accelerate the editing speed while preserving the image fidelity, we propose FPT to ensure that new handles are closer to their corresponding targets by filtering all candidates that are far away from the targets.

3.1 Identity-Preserving Finetuing

As analyzed in previous work (Shi et al., 2024), directly optimizing the latent $z_{t}$ in diffusion-based methods will cause the problem of image fidelity. So finetuning on a pre-trained diffusion model is necessary to encode the features of the image into the U-Net. Specifically, the image and its description prompt $P_{o}$ are used to finetune the diffusion model $\epsilon_{\theta}$ through the LoRA method:

\mathcal{L}_{ft}(z,\Delta\theta)=\mathbb{E}_{\epsilon,t}[\|\epsilon-\epsilon_{\theta+\Delta\theta}(\alpha_{t}z+\sigma_{t}\epsilon)\|^{2}_{2}]

(1)

where $z$ is the latent space feature map concerning image $I$ . $\theta$ and $\Delta\theta$ represent the U-Net (Ronneberger et al., 2015) and LoRA parameters, $\alpha_{t}$ and $\sigma_{t}$ are constants pre-defined in the diffusion schedule, $\epsilon$ is random noise sampled from distribution $\mathcal{N}(0,I)$ . After the finetuning, we choose a specific timestep $t$ and obtain the DDIM inversion latent (Song et al., 2020a) as follows:

z_{t+1}=\sqrt{\alpha_{t+1}}(\frac{z_{t}-\sqrt{1-\alpha_{t}}\cdot\epsilon_{\theta}(z_{t})}{\sqrt{\alpha_{t}}})+\sqrt{1-\alpha_{t+1}}\cdot\epsilon_{\theta}(z_{t})

(2)

the latent will be optimized in the subsequent process while keeping all other parameters frozen.

3.2 Global-Local Motion Supervision

This step aims to combine text and drag signals, as shown in Figure 3. We will introduce how to process each control information, and then show how to combine them.

CLIP-guidance Gradient. We extract knowledge from text signal using a local direction CLIP loss (Kim et al., 2022), which aligns the direction between the source image $I$ and generates image $\hat{I}$ with the direction between original prompt $P_{o}$ and edit prompt $P_{e}$ :

\mathcal{L}_{direction}(I,\hat{I},P_{o},P_{e}):=1-\frac{<\Delta I,\Delta T>}{\|\Delta I\|\|\Delta T\|}

(3)

where $\Delta I=E_{I}(\hat{I}(z^{k}_{t}))-E_{I}(I)$ , $\Delta T=E_{T}(P_{e})-E_{T}(P_{o})$ . Here $E_{T}$ , $E_{I}$ represents text and image encoder from a pre-trained CLIP model. However, the identity component of $P_{e}$ is canceled out during the calculation of $\Delta T$ , making it difficult to maintain the image identity. Besides, sometimes it is impossible to calculate the direction loss because $P_{o}$ is not provided (Zhang et al., 2024). Consequently, we choose the global target loss to extract more information from edit prompt $P_{e}$ as follows:

\mathcal{L}_{global}(\hat{I},P_{e}):=1-\frac{<E_{I}(\hat{I}(z^{k}_{t})),E_{T}(P_{e})>}{\|E_{I}(\hat{I}(z^{k}_{t})\|\|E_{T}(P_{e})\|}

(4)

Then we can obtain the global gradient ( $G_{g}$ ) from the text signal: $G_{g}=\partial\mathcal{L}_{global}/\partial z^{k}_{t}$ . Later we will explain how to use $G_{g}$ to maintain the image identity and guide the edit.

Drag-guidance Gradient. We denote the U-Net output feature maps obtained by $k$ -th updated latent $z^{k}_{t}$ as $F(z^{k}_{t})$ .And the gradient from drag signals is obtained by the motion supervision loss $L_{ms}$ , which is the difference between the features of corresponding targets and handles:

\mathcal{L}_{ms}(z^{k}_{t})=\sum_{i=1}^{n}\sum_{q\in\Omega(p_{i},r1)}\|F_{q+d_{i}}(z^{k}_{t})-sg(F_{q}(z^{0}_{t}))\|

(5)

Where $\Omega(p_{i},r1)=\{(x,y):\|x-x_{i}\|\leq r_{1},\|y-y_{i}\|\leq r_{1}\}$ , and $r_{1}$ represents the patch radius. $sg(\cdot)$ represents the stop gradient operation (van, 2017), $di=(g_{i}-h^{k}_{i})/\|g_{i}-h^{k}_{i}\|$ , is an unit vector from $h^{k}_{i}$ to $g_{i}$ . Thus the local gradient ( $G_{l}$ ) can be calculated by: $G_{l}=\partial\mathcal{L}_{ms}/\partial z^{k}_{t}$ .

Global-Local Gradient Fusion. Now we explain how to incorporate the two gradients in detail. The key motivation of this method is to decompose the process of text-based editing process into two parts: the edit component and the identity component. Specifically, as shown in Figure 3, when the direction of the edit component from $G_{g}$ is consistent with $G_{l}$ (Figure 3(a)), it means both signals agree with how to update the latent. Thus we use $G_{l}$ to guide the edit while preserving the structure of the image by the identity component of $G_{g}$ . When the two edit directions are contradictory (Figure 3(b)), we choose to correct the drag direction using the editing component, inspired by (Zhu et al., 2023). This approach can be formalized as follows:

G_{final}=\left\{\begin{aligned} G_{l}+\lambda\sin\langle G_{g},G_{l}\rangle\cdot G_{g}&,&\cos\langle G_{g},G_{l}\rangle>0,\\ G_{l}-\lambda\cos\langle G_{g},G_{l}\rangle\cdot G_{g}&,&\cos\langle G_{g},G_{l}\rangle<0,\end{aligned}\right.

(6)

where $G_{final}$ means the final gradient to update the latent code and $\lambda$ is a hyper-parameter.

3.3 Fast Point Tracking

Although CLIP guidance can relieve the ambiguity problem, it makes the optimization of GLMS more difficult. In drag-based methods, a similar optimization issue arises when more point pairs are added. In the previous point tracking strategy, handles sometimes get stuck in one position or move far away from their corresponding targets. This significantly slows down the editing process. To remedy this issue, we add a simple constraint on the point tracking process: when updating the handles through the nearest neighbor search algorithm, only consider the candidate points that are closer to the targets, as shown in Figure 3(c). Our FPT method can be formulated as follows:

h^{k+1}_{i}=\operatorname*{arg\,min}_{a\in\Omega{(h^{k}_{i},r2)}\And dis(a,g_{i})<dis(h^{k}_{i},g_{i})}\|F_{q}(z^{k+1}_{t})-F_{h^{k}_{i}}(z^{0}_{t})\|

(7)

where $dis(a,g_{i})$ represents the distance between point $a$ and target $g_{i}$ , $r_{2}$ is the search radius.

4 Experiments

4.1 Implementation details

We used Stable Diffusion 1.5 (Rombach et al., 2022) and CLIP-ViT-B/16 (Dosovitskiy et al., 2020) as the base model. For the LoRA finetuning stage, we set the training steps as 80, and the rank as 16 with a small learning rate of 0.0005. In the DDIM inversion, we set the inversion strength to 0.7 and the total denoising steps to 50. In the Motion supervision, we had a large maximum optimization step of 2000, ensuring handles could reach the targets. The features were extracted from the last layer of the U-Net. The radius for motion supervision ( $r_{1}$ ) and point tracking ( $r_{2}$ ) were set to 4 and 12, respectively. The weight $\lambda$ in the Global-Local Gradient Fusion process was 0.7.

4.2 Text-Drag Editing Results

Settings. To show the performance of CLIPDrag we compared both drag-based methods (DragDiffusion, FreeDrag), and text-based method (DiffCLIP) on text-drag image editing tasks. Specifically, drag-based methods require drag points as edit instructions while text-based methods need an edit prompt to perform the modification. For CLIPDrag, both editing prompts and drag points are required to perform the edit. All input images are from the DRAGBENCH datasets (Shi et al., 2024).

Results. As illustrated in Figure 4, our method shows better performance over the two different editing frameworks. Compared with text-based methods (DiffCLIP), CLIPDrag can perform more precise editing control on the pixel level. Compared with drag-based methods (DragDiffusion, FreeDrag), CLIPDrag successfully alleviates the ambiguity problem, as shown in Figure 4(b)(d)(e). This is because former drag-based methods intend to perform structural editing like moving or reshaping, instead of semantic editing such as emotional expression modification. It is reasonable because moving an object is much easier than changing its morphological characteristics. Consequently, these models prefer to choose the shortcut to realize feature alignment in motion supervision, resulting in the ambiguity problem. Our proposed method effectively solves the issue, by introducing CLIP guidance as the global information to point out a correct optimization path.

Besides, former drag-based approaches cannot guarantee edited image quality when multiple drag point pairs exist. So we also give some examples with multiple point pairs to compare the stability of these methods. As shown in Figure 4(a)(c), these results validate the effectiveness of the two techniques in our method: identity component’s guidance to preserve the image quality, and fast point tracking to achieve better drag performance.

More results of CLIPDrag are shown in Figure 5. The leftmost three examples verify the effectiveness of our method: by combining the information of drag points and edit prompts ( $P_{e}$ ), CLIPDrag achieves an edit with high image fidelity and no ambiguity. The middle three examples show the situations when users do not want to consider the ambiguity or find it difficult to describe the desired edit. And we found CLIP also works well when the edit prompt is replaced with the original prompt ( $P_{o}$ ). The rightmost three examples show the results when adding a mask.

4.3 Drag-based Editing Results

Settings. Since our method is based on general drag-based frameworks, we also explored the performance of CLIPDrag in pure drag-based editing tasks. Specifically, we replaced the editing prompt ( $P_{e}$ ) with the corresponding original prompt ( $P_{o}$ ). This ensures that no extra text information will be introduced. We compared our CLIPDrag with DragDiffusion on their DRAGBENCH benchmark on five different max iteration step settings. To evaluate the image fidelity, we reported the average 1-LIPIS score (IF). Besides, Mean distance (MD) was calculated to show the distance between the final handles and targets (Lower MD represents better drag performance).

Results. Quantitative results are shown in Figure 6(b). In this figure, the $x$ -axis represents the max iteration steps and the $y$ -axis represents the IF and MD metrics respectively. As can be seen, CLIPDrag has better performance than DragDiffusion. Specifically, on most max iteration steps settings, CLIPDrag has higher IF and lower MD. This means handle points are closer to the targets in CLIPDrag, in the meanwhile, the edited image quality is preserved or even improved (Figure 6(a)).

4.4 Ablation Study

Ablation on Text Signals. We performed ablation studies to clarify the effect of text signals in CLIPDrag by using different edit prompts and keeping the drag points unchanged.

Results. As illustrated in Figure 7(b), the ambiguity can be alleviated in CLIPDrag. For example, the drag instruction (i) corresponds to at least two different edit directions: one moving up the sculpture (ii), the other raising his head (iii). By giving different text signals we can choose different edit paths. We also observed that when no extra text information is given, CLIPDrag tends to align handle and target features by simple position translation (ii), instead of semantic edit (iii).

Ablation on the Global-Local Gradient Fusion. We studied the effect of different methods to combine text and drag signals. We ran this ablation experiment with three different strategies global-local gradient fusion: adding two gradients together, ProGrad, and our method.

Results. As shown in Figure 7(a), adding two gradients cannot alleviate the ambiguity problem because the CLIP guidance does not take effect on the single effect, as illustrated in DiffCLIP. The ProGrad method can relieve the ambiguity problem but performs badly in maintaining the quality of edited images. This is because it ignores the identity component in text signals, which contains information about the image’s structure. By contrast, our proposed method can effectively fuse the two signals, not only relieving the ambiguity but also maintaining the image fidelity to some degree.

Ablation on Different Point Tracking Strategies. Finally, We showed the effect of different point tracking strategies. We based this experiment on our CLIPDrag method while replacing the fast point tracking with the normal one in DragDiff (Shi et al., 2024). Besides, to make the result more convincing, we tried to make edited images from these two methods similar by adjusting the random seed and learning rate, while keeping other parameters like patch radius unchanged.

Results. As shown in Figure 8, when achieving similar editing results, FPT effectively reduces the optimization iterations consumed. From the moving trajectory in Figure 8(d), we found that handles could get stuck at one point or move in circles in previous methods. Instead, handles move closer and closer to the targets in our FPT strategy, thus speeding up the editing process. Figure 8(c) shows that the FPT method does not harm the image quality. To further show the effect of FPT, we give another example and their intermediate results, shown in Figure LABEL:figure:pt_intermediate(c). At 20-th iteration, FPT strategy began to perform semantic editing, while PT method was still in the stage of identity-preserving.

5 Conclusion

In this work, we tackled the ongoing challenge of achieving precise and flexible image editing within the field of computer vision. We observed that existing text-based methods often lack the precision for specific modifications, while drag-based techniques are prone to ambiguity. To overcome these challenges, we introduced CLIPDrag, a pioneering method that uniquely integrates text and drag signals to enable accurate and unambiguous manipulations. We enhanced existing drag-based methods by treating text features as global guidance and drag points as local cues. Our novel global-local gradient fusion method further optimized the editing process during motion supervision. Additionally, to address the issue of slow convergence in CLIPDrag, we developed a FPT method that efficiently guides handle points toward their target positions. Eextensive experiments clearly demonstrated that CLIPDrag significantly outperforms existing methods that rely solely on drag or text-based inputs.

References

van (2017) Neural discrete representation learning. In Advances in neural information processing systems, volume 30, 2017.
Cao et al. (2023) Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22560–22570, 2023.
Cui et al. (2024) Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, and Limin Wang. Stabledrag: Stable dragging for point-based image editing. arXiv preprint arXiv:2403.04437, 2024.
Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in neural information processing systems, volume 34, pp. 8780–8794, 2021.
Dong et al. (2023) Wenkai Dong, Song Xue, Xiaoyue Duan, and Shumin Han. Prompt tuning inversion for text-driven image editing using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7430–7440, 2023.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Epstein et al. (2023) Dave Epstein, Allan Jabri, Ben Poole, Alexei Efros, and Aleksander Holynski. Diffusion self-guidance for controllable image generation. Advances in Neural Information Processing Systems, 36:16222–16239, 2023.
Gu et al. (2022) Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10696–10706, 2022.
Hertz et al. (2022) Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pp. 6840–6851, 2020.
Ho et al. (2022) Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23(47):1–33, 2022.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Kawar et al. (2023) Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6007–6017, 2023.
Kim et al. (2022) Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2426–2435, 2022.
Kumari et al. (2023) Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1931–1941, 2023.
Li et al. (2023) Senmao Li, Joost van de Weijer, Taihang Hu, Fahad Shahbaz Khan, Qibin Hou, Yaxing Wang, and Jian Yang. Stylediffusion: Prompt-embedding inversion for text-based editing. arXiv preprint arXiv:2303.15649, 2023.
Lin et al. (2023) Yupei Lin, Sen Zhang, Xiaojun Yang, Xiao Wang, and Yukai Shi. Regeneration learning of diffusion models with rich prompts for zero-shot image translation. arXiv preprint arXiv:2305.04651, 2023.
Ling et al. (2023) Pengyang Ling, Lin Chen, Pan Zhang, Huaian Chen, and Yi Jin. Freedrag: Point tracking is not you need for interactive point-based image editing. arXiv preprint arXiv:2307.04684, 2023.
Liu et al. (2024) Haofeng Liu, Chenshu Xu, Yifei Yang, Lihua Zeng, and Shengfeng He. Drag your noise: Interactive point-based editing via diffusion semantic propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6743–6752, 2024.
Meng et al. (2021) Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Mou et al. (2023) Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Dragondiffusion: Enabling drag-style manipulation on diffusion models. arXiv preprint arXiv:2307.02421, 2023.
Mou et al. (2024) Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, and Jian Zhang. Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8488–8497, 2024.
Nichol et al. (2021) Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Pan et al. (2023) Xingang Pan, Ayush Tewari, Thomas Leimkühler, Lingjie Liu, Abhimitra Meka, and Christian Theobalt. Drag your gan: Interactive point-based manipulation on the generative image manifold. In ACM SIGGRAPH 2023 Conference Proceedings, pp. 1–11, 2023.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp. 234–241. Springer, 2015.
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding, 2022.
Sheynin et al. (2024) Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8871–8879, 2024.
Shi et al. (2024) Yujun Shi, Chuhui Xue, Jun Hao Liew, Jiachun Pan, Hanshu Yan, Wenqing Zhang, Vincent YF Tan, and Song Bai. Dragdiffusion: Harnessing diffusion models for interactive point-based image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8839–8849, 2024.
Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Valevski et al. (2023) Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by fine tuning a diffusion model on a single image. ACM Transactions on Graphics (TOG), pp. 1–10, 2023.
Xie et al. (2023) Shaoan Xie, Zhifei Zhang, Zhe Lin, Tobias Hinz, and Kun Zhang. Smartbrush: Text and shape guided object inpainting with diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22428–22437, 2023.
Yang et al. (2023) Fei Yang, Shiqi Yang, Muhammad Atif Butt, Joost van de Weijer, et al. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems, 36:26291–26303, 2023.
Zhang et al. (2023a) Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3836–3847, 2023a.
Zhang et al. (2023b) Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, and Yusuke Iwasawa. Paste, inpaint and harmonize via denoising: Subject-driven image editing with pre-trained diffusion model. arXiv preprint arXiv:2306.07596, 2023b.
Zhang et al. (2024) Zewei Zhang, Huan Liu, Jun Chen, and Xiangyu Xu. Gooddrag: Towards good practices for drag editing with diffusion models. arXiv preprint arXiv:2404.07206, 2024.
Zhu et al. (2023) Beier Zhu, Yulei Niu, Yucheng Han, Yue Wu, and Hanwang Zhang. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15659–15669, 2023.
Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232, 2017.