Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
Abstract
With the rise of large, publicly-available text-to-image diffusion models, text-guided real image editing has garnered much research attention recently. Existing methods tend to either rely on some form of per-instance or per-task fine-tuning and optimization, require multiple novel views, or they inherently entangle preservation of real image identity, semantic coherence, and faithfulness to text guidance. In this paper, we propose an optimization-free and zero fine-tuning framework that applies complex and non-rigid edits to a single real image via a text prompt, avoiding all the pitfalls described above. Using widely-available generic pre-trained text-to-image diffusion models, we demonstrate the ability to modulate pose, scene, background, style, color, and even racial identity in an extremely flexible manner through a single target text detailing the desired edit. Furthermore, our method, which we name Direct Inversion, proposes multiple intuitively configurable hyperparameters to allow for a wide range of types and extents of real image edits. We prove our method’s efficacy in producing high-quality, diverse, semantically coherent, and faithful real image edits through applying it on a variety of inputs for a multitude of tasks. We also formalize our method in well-established theory, detail future experiments for further improvement, and compare against state-of-the-art attempts.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/661b67ad-824b-462e-b73f-1991ae939ba3/main_demo.png)
1 Introduction
Manipulating real images using natural language has been a long-standing problem space in image processing. Given the wide scope of impact and potential applications, this problem space has drawn a lot of attention and research focus. As such, there have been many research attempts utilizing a wide range of methods to try and deliver robust, impressive results. Recent text-to-image machine learning models, such as Dall-E [20], Imagen [24], and Stable Diffusion [22], have dramatically accelerated the image generation space, yielding highly coherent, diverse images that are well-aligned with text prompts. The focus of this work is to utilize these new foundational text-to-image generation models in order to edit real images in a faithful and semantically coherent manner.
The current leading methods that attempt this research goal either (i) require significant per-instance training or fine-tuning [23, 15, 7, 14, 30], (ii) are constrained to a specific domain of images [10, 19], (iii) require secondary inputs in the form of edit-masks [4, 3] or multiple images of the target object [7, 23], or (iv) inherently entangle edit strength and arbitrary structural similarity[18], greatly limiting the types and extents of image editing possible.
We propose an image-editing technique that avoids the pitfalls above. Our method only requires an input image, along with a corresponding text prompt describing the desired edit. Given these inputs, our method is able to elaborately modify the original image using only the target edit text prompt – adding objects, as well as modulating scene, background, pose, style, color, and even race/ethnicity. Furthermore, Direct Inversion is able to accomplish all of this without any sort of re-training or fine-tuning. To the best of our knowledge, ours is the first method to be able to accomplish this degree of complex image-editing without expensive per-instance or per-task fine-tuning.
At a high level, Direct Inversion encodes the input image in the noise latent space and then ”denoises” the result with CLIP text guidance, while continually injecting the initial noise back at various scales into the diffusion process. This simple formulation allows Direct Inversion to be implemented directly into existing diffusion model implementations. In this paper, we present:
-
1.
An optimization-free, zero fine-tuning, text-based semantic image editing method that can make flexible edits to both global and local structure, attributes, style, and much more – only requiring a single input image and text prompt.
-
2.
Qualitative demonstrations of the ability to perform both global style-based edits, as well as local object-level edits on real images.
-
3.
Quantitative demonstrations of how Direct Inversion allows for fine-tuned control of the editability-fidelity tradeoff, as well as a technical investigation into general diffusion model inversion techniques and parameters.
-
4.
Ablation studies of how different parameters can affect image fidelity and edit strength using Direct Inversion.

2 Related Work
The task of image editing using generative models has been explored from multiple perspectives. In this section, we will briefly describe the most promising advances in recent years.
GANs
GAN[8]-based approaches to the image editing task usually require a two step process. The first is to ”invert” the image by being able to represent the original image in a latent space that we can then use to re-generate the original image. Once the image has been inverted, edits can be made in the latent space to re-create the adjusted image [1] [9] [19] [25]. Recent works have improved upon this method by re-training the generative model to create images similar to the input image [2][5] [21]. However, GANs are expensive to train and suffer from generating repetitive or similar images. [13], making it difficult to use GANs for general image editing.
Diffusion Models
On the other hand, Diffusion models[26][12] are more stable, but still expensive, to train and are able to produce more diversity [6]. An advantage of diffusion models is the ability to use classifier-free guidance [11], which allows a user to control the output of the diffusion model without re-training the model. SDEdit [18] requires the user to add a brush stroke to the area to edit and then de-noises the image conditioned on the desired edit — replacing the brush stroke with pixels that match the image. Other techniques such as DiffusionCLIP [15] utilize DDIM inversion [27] and fine-tuning while conditioning on a CLIP-based loss during the de-noising diffusion process to bring the generated image closer to the desired edit.
Textual Inversion [7] and Dream-Booth [23] have shown strong capability in maintaining unique object characteristics while generating completely unique images. However, they require multiple images of the same object to fine-tune some parts of the diffusion model and struggle with maintaining a high level of faithfulness to the original object represented in their input images.
Finally, Imagic [14] presented the ability to edit a single image using a text prompt by fine-tuning a diffusion model to be faithful to the original image of interest and interpolating between the latent space representation of the original image and text prompt before performing the inversion process.
3 Method
3.1 Preliminaries

3.1.1 General Diffusion Models
Recently, diffusion models [12, 28, 31] have taken the generative artificial intelligence world by storm. Due to their ability to give neural networks adaptive computation time, various theoretical foundations [26, 32, 28, 29], relatively interpretable latent space, and unrolled process, they have become extremely powerful tools for image generation flexibly guided by various modalities of conditioning (text, image, masks, etc). They’ve been applied to problems of image compression, classification, restoration, and much more.
Diffusion models follow a forward and reverse process, where the forward process is a known destructive process, and the reverse process learns to undo the forward process, iteratively. In most formulations, the destructive process is a (generally stochastic) Gaussian noise perturbation. During the forward process, a clean training sample is iteratively corrupted over a pre-determined number of timesteps ( , such that the sample at the last timestep is fully corrupted, containing little to no information from the original sample . As shown by Ho et al. [12], the forward process has a neat closed-form solution:
(1) |
When the forward process is a stochastic gaussian noise perturbation, , every forward step involves sampling random noise to perturb the original sample at various noise scales. The defines amount of noise present at each intermediate timestep, which is referred to as the noising schedule. During training, a neural network takes the current sample , and predicts the noise . is then merged with to predict a slightly less-noisy . The neural network objective is to make .
During inference, we start with a random sample of full noise , which is refined iteratively through passes through the network. There are various sampling strategies [12, 27, 17] that define the process of merging the noise prediction and current (more noisy) sample in order to produce the previous (less noisy) sample . The final sample is the resultant generated image. This procedure is therefore a learned image distribution, which has been shown to be remarkably high-fidelity and have SOTA diversity of outputs [6].
Additionally, diffusion models can also learn conditional distributions [11] through the inclusion of a conditioning input to the denoising network, turning into . Given the flexibility of neural networks, the conditioning input can represent various inputs: text, images, class labels, embeddings, etc.
3.1.2 DDIM and Determinism
As mentioned above, there are both stochastic and deterministic ways to merge the predicted noise from the current timestep’s sample to produce a refined estimate for the previous timestep’s sample . Song et al. proposed a method that implies determinism in the reverse process [27], which means that there is a direct mapping between the random noise sample and the clean, generated image . Their proposed reverse process is defined as follows:
(2) | |||

When for all timesteps, the reverse process becomes fully deterministic. Rather than sampling new (scaled down) noise to add to the current clean estimate, DDIM simply adds back a scaled down version of the exact noise predicted by the model for the current timestep. Thus, the sampled noise theoretically contains all of the information needed to represent the generated image . In the context of generating new images from noise, this doesn’t seem very consequential, since the same effect could be achieved by fixing the random seed for a stochastic inference process, such as in the work of Ho et al. in DDPM [12]. However, the closed form for this deterministic reverse process allows ”working back” an existing image into its’ encoded noise. Song et al. [27] prove this theoretical framework by taking existing images , inverting them into their noise encoding , running the reverse process again to obtain a reconstructed , then evaluating the Mean Squared Error across all reconstructed samples and the original images to quantify the information content in the noise. We reproduce their results using Stable Diffusion [22] evaluated on the CIFAR-10 test dataset [16] in Figure 5. As expected, this figure shows that more inference and inversion steps result in decreased reconstruction loss.
3.2 Direct Inversion
Our method attempts to provide a no-training solution for the problem of real image editing. Given an input image , and a text prompt describing the desired changes, our goal is to edit the image in a coherent manner, while still remaining faithful to the identity of the original image. We accomplish this by leveraging the determinism in the forward and reverse process implied by DDIM [27] to encode the original image into the noise latent space of a latent diffusion model as . We then start the (deterministic) reverse process from the encoded noise, iteratively conditioning our diffusion model on the text input through classifier/classifier-free guidance and on the identity of the original image through repeated injection of the original encoded noise into the intermediate noisy timesteps for . The resultant method, which we name Direct Inversion, allows for much greater control over the degree of faithfulness to the identity of the original image through modulating the magnitude and frequency of injected (encoded) noise. The use of a deterministic forward and reverse diffusion process to represent the input image in latent noise space allows our method to circumvent any optimization or fine-tuning.
As depicted in Figure 4, our method is constituted of 2 sequential stages: (1) we invert the input image into its’ corresponding latent noise encoding ; (2) we start the reverse diffusion process from for timesteps , conditioning on the text prompt via classifier/classifier-free guidance and the input image encoding at every step via noise injection.

4 Experiments
4.1 Qualitative Evaluation
We exercise our method on a variety of real images spanning many domains in order to evaluate its’ effectiveness. For each edit, we provide a concise text prompt detailing the desired edits at a very high-level. We utilize publicly available images gathered online, and curate our own target edit prompts. Given the determinism of our method, all outputs shown here are the only and primary output given a specific text prompt and real image, meaning we are not cherry picking results from a large set of outputs. This is particularly impressive for demonstrating our method’s consistency when trying multiple target edits applied on a single image. Prior works leverage the one-to-many nature of their approaches in order to generate multiple outputs, which are then filtered on a per-edit approach – sometimes resulting in up to 5X outputs produced on their end that are filtered down to the same amount we show [14].
Direct Inversion demonstrates strong results in modulating (1) pose, color, style, and structure of target objects as shown in Figure 1, Figure 3, and Figure 2, (2) background, scene, and context as shown in Figure 1 and Figure 3, and (3) racial identity/skin tone/perceived gender identity of people (in an effort to improve diversity of representation) as shown in Figure 1.
We note that there is implicit ambiguity in the problem formulation here, since the target edit prompt is by-definition abstract and simple. Furthermore, even if the target edit prompt is extremely specific, we might want to vary the extent to which we abide by the text prompt. Instead of deferring to diffusion models’ inherent, stochastic, one-to-many nature, we instead choose to expose highly configurable hyper-parameters that modulate the strength, type, and coherence (through noise injection magnitude, frequency of noise injection, and continual injection vs single initial injection) of edit in different ways. This way, we are able to modulate target edits in a more systematic way, rather than producing many outputs and deferring to another mechanism to filter those outputs, as other methods choose to do.
4.2 Ablation Study

4.2.1 Effect of Inference and Inversion Steps on Image Fidelity
Firstly, we reproduce the results shown by Song et al. [27] in Figure 5, demonstrating that increasing both inversion and inference steps results in better pixel-wise reconstruction (lower MSE). Given that these results were very intuitive, we then further investigated the relationship between inversion and inference steps, and each of their effects on image fidelity, both perceptually and pixel-wise. In Figure 6, we fix inference steps at 100 steps, and modulate the number of inversion steps to see the resultant impact on perceptual similarity (1 - LPIPS [33]), showing that there seems to be a region where both image fidelity by both metrics is optimal, and that naively increasing both inversion and inference steps isn’t necessarily ideal. In fact, this behavior is shown to be consistent in both the perceptual and pixel-wise fidelity metrics. To the best of our knowledge, this is one of the first, if not the only, thorough investigations into the relationship between inference and inversion steps showing these conclusions.


4.2.2 Editability-Fidelity Tradeoff
As demonstrated by Kawar and Zada et al. [14] and Meng et al. [18], it is useful to show how real-image editing methods are able to navigate the editability-fidelity tradeoff. In Figure 7, we show how modulating Noise Merge Lambda (the scaling of the inverted noise that we inject into the diffusion process) affects text alignment (editability) in an inverse fashion to its’ affect on image fidelity. To measure editability, we compute the CLIP similarity between the output image and the edit text prompt. To measure fidelity, we compute the perceptual similarity using LPIPS [33]. Figure 7 clearly shows a wide range of centered feasible values, indicating a more disentangled editability-fidelity tradeoff than prior works [14].
We conducted further experimentation shown in Figure 8 by varying both number of inversion steps and number of inference steps and evaluating pixel-wise loss across the 2D span of parameters. The results of these experiments directed us to conclude that the parameter selection of {# inversion steps = 100, # inference steps = 100} was an optimal tradeoff between computation time and image fidelity. All figures in this paper were consequently evaluated using the above parameter selection.
4.2.3 Effect of Text Guidance and Noise Merge Scaling on Image Fidelity



Since the scale of injected noise and scale of text guidance work in opposite directions to control contrasting objectives, we conducted more image fidelity experiments, depicted in Figure 9. Along with qualitative inspection, these experiments show that there exists a frontier of feasible pairs of noise merge scale and text guidance scale. This frontier of feasible parameter selections results in image edits that balance both image fidelity and edit strength. Images that are produced through more extreme parameter selections (the extreme corners of the heatmaps) either result in the original image without any edits, or an image very faithful to the desired edit text prompt, with significant deviation from the original image. We select a few horizontal slices of Figure 9 to visualise in Figure 10.
5 Conclusions and Future Work
In this work, we present Direct Inversion, a technique for editing images that can be applied to any diffusion-based model. Direct Inversion does not require any retraining, and can apply both global and local style changes with only the original image and a text prompt as an input. By expressing the input image in the noise latent space of the diffusion model and continually injecting it back into a reverse diffusion process, Direct Inversion can perform image edits competitive with state-of-the-art methods in what is computationally equivalent to 2 passes through a diffusion process. Qualitative results show that Direct Inversion is flexible in producing edits across a range of important image qualities, such as background and object-specific attributes. We additionally quantify many important measurements for image editing with Direct Inversion, such as the editability-fidelity tradeoff, and reconstruction ability as a measure of image fidelity.
Future work should focus on further characterizing the capabilites of “Direct Inversion” and how these capabilities are limited by the diffusion model being used. The noise latent space that Direct Inversion acts in is rich with information that is not easily interpretable, and we believe that there is still much work to do in exploiting this space to produce new diffusion model capabilities. Furthermore, it would be useful to introduce stochasticity in part of the Direct Inversion process to be able to generate multiple outputs for a given input.
6 Ethics Statement
In this work, we propose Direct Inversion, a new technique for controlled image editing and (to a degree) synthesis. Like many other Machine Learning techniques – especially ones focused on image synthesis – misuse of this method could result in negative societal impacts. Although our work is created with the intent to make a positive societal impact, we acknowledge that caution and deliberation ought to be exercised in applications of these methods.
We demonstrate an ability to modulate skin tones and racial identities in existing real images. Although this ability is not unique to our presented paper (multiple recent works have similar capabilities, albeit take longer/require more inputs), we recognize that our work makes this further accessible. Additionally, it is problematic to rely on large-scale pre-trained diffusion models’ understanding of race, as this could exacerbate existing racial biases that were evident in the training of the models. Additionally, these models were trained on open source data that often include problematic and biased labels which may impact the output of our method. We implore readers and fellow researchers to continue research in a direction for systematic equality and fairness across many axes.
References
- [1] Rameen Abdal, Peihao Zhu, Niloy Jyoti Mitra, and Peter Wonka. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (TOG), 40:1 – 21, 2021.
- [2] Yuval Alaluf, Omer Tov, Ron Mokady, Rinon Gal, and Amit H. Bermano. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18490–18500, 2022.
- [3] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ArXiv, abs/2206.02779, 2022.
- [4] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18187–18197, 2022.
- [5] David Bau, Hendrik Strobelt, William S. Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. Semantic photo manipulation with a generative image prior. ACM Transactions on Graphics (TOG), 38:1 – 11, 2019.
- [6] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. ArXiv, abs/2105.05233, 2021.
- [7] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. ArXiv, abs/2208.01618, 2022.
- [8] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. 2014.
- [9] Erik Härkönen, Aaron Hertzmann, Jaakko Lehtinen, and Sylvain Paris. Ganspace: Discovering interpretable gan controls. ArXiv, abs/2004.02546, 2020.
- [10] Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. ArXiv, abs/2208.01626, 2022.
- [11] Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- [12] Jonathan Ho, Ajay Jain, and P. Abbeel. Denoising diffusion probabilistic models. ArXiv, abs/2006.11239, 2020.
- [13] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. ArXiv, abs/2006.06676, 2020.
- [14] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Hui-Tang Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. ArXiv, abs/2210.09276, 2022.
- [15] Gwanghyun Kim and Jong-Chul Ye. Diffusionclip: Text-guided image manipulation using diffusion models. ArXiv, abs/2110.02711, 2021.
- [16] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
- [17] Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. ArXiv, abs/2202.09778, 2022.
- [18] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Junyan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. ArXiv, abs/2108.01073, 2021.
- [19] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 2065–2074, 2021.
- [20] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. ArXiv, abs/2102.12092, 2021.
- [21] Daniel Roich, Ron Mokady, Amit H. Bermano, and Daniel Cohen-Or. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG), 2022.
- [22] Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
- [23] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. ArXiv, abs/2208.12242, 2022.
- [24] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, Seyedeh Sara Mahdavi, Raphael Gontijo Lopes, Tim Salimans, Jonathan Ho, David J. Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. ArXiv, abs/2205.11487, 2022.
- [25] Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. Interpreting the latent space of gans for semantic face editing. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9240–9249, 2020.
- [26] Jascha Narain Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. ArXiv, abs/1503.03585, 2015.
- [27] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. ArXiv, abs/2010.02502, 2021.
- [28] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. ArXiv, abs/1907.05600, 2019.
- [29] Yang Song, Jascha Narain Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ArXiv, abs/2011.13456, 2021.
- [30] Dani Valevski, Matan Kalman, Y. Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. 2022.
- [31] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23:1661–1674, 2011.
- [32] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
- [33] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.