TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

Zihan Su Shenzhen International Graduate School
Tsinghua University
Shenzhen, China
[email protected] Junhao Zhuang Shenzhen International Graduate School
Tsinghua University
Shenzhen, China
[email protected] Chun Yuan^† ^† Corresponding author. Shenzhen International Graduate School
Tsinghua University
Shenzhen, China
[email protected]

Abstract

Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and $<$ texture $>$ , restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to “ $<$ texture $>$ ”, making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation. Code is publicly available at https://github.com/THU-CVML/TextureDiffusion

Index Terms:

Image editing, Diffusion models, AIGC.

I Introduction

Despite the powerful content generation capabilities of text-to-image generative models [1, 2, 3, 4, 5, 6], there are still some limitations on the user’s control over the generated images. In order to increase user’s control, text-guided image editing is particularly important.

Existing text-guided image editing methods [7, 8, 9, 10, 11, 12, 13, 14, 15] can accomplish various editing tasks, such as object addition and removal, action change, and texture change. Prompt-to-Prompt (P2P) [13] found that the cross-attention map corresponded to the mapping relationship between text and image. Plug-and-Play (PnP) [14] injected the self-attention maps and features into the generation process of the target image to maintain the consistency of the spatial layout. InfEdit [15] introduced a virtual inversion strategy and unified attention control to facilitate consistent and accurate editing.

However, for the texture transfer task, i.e., changing the texture of the target object, the previous methods are limited to simple textures like wood or gold. The challenge arises when attempting to transfer more complex textures, such as cloud or fire. When describing <texture> in the target prompt, “wood” corresponds to “wooden” and “gold” corresponds to “golden”, but there is no corresponding adjective for “cloud”. If “cloud” is forced to be included in the text description, the previous methods cannot successfully transfer the texture, as shown in Fig. 1. This limitation stems from that the target prompt needs to contain both the input image content and <texture>, restricting the texture representation.

Refer to caption — Figure 1: Existing text-guided image editing methods cannot transfer complex textures. By making the texture disentangled from the description of the input image in the target prompt and applying the proposed structure preservation module and edit localization technique, TextureDiffusion can harmoniously transfer various textures to the target object.

Thus our core idea is to directly set the target prompt to “ $<$ texture $>$ ”, making the texture disentangled from the description of the input image. Based on this, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is modified to make texture representation unrestricted. Subsequently, to preserve the structure of input image, query features in self-attention and features in residual blocks are injected during the generation of the edited image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents.

Our main contributions are summarized as follows. 1) We propose a tuning-free image editing method named TextureDiffusion, which is applied to various texture transfer. 2) We directly set the target prompt to “ $<$ texture $>$ ” to improve texture representation. 3) Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation.

II METHOD

The pipeline of our method is depicted in Fig. 2. Given an input image and a related text prompt $P_{s}$ , our goal is to transfer various textures to the target object, aligned with the target text prompt $P_{t}$ . In this section, we first review the basic knowledge of diffusion models in Section II-A. Subsequently, a structure preservation module is introduced to maintain structural similarity between the edited and input image in Section II-B. Finally, we propose an edit localization technique to restrict the edit to the target object while keeping the rest unchanged in Section II-C.

II-A Preliminaries

Diffusion models [16, 17, 18, 19, 20] are generative models that can generate data by iterative denoising starting from Gaussian noise. It include a forward process and a reverse process. The forward process adds noise to the data sample $x_{0}$ at time step $t$ to generate the noisy sample $x_{t}$ : $q(x_{t}|x_{0})=\mathcal{N}(x_{t};\sqrt{\bar{\alpha}}_{t}x_{0},(1-\bar{\alpha}_{t})I)$ , where $\bar{\alpha}_{t}=\Pi_{i=1}^{t}\alpha_{i}$ , $\alpha_{i}$ denotes the predefined noise schedule. The reverse process removes the noise from the previous sample $x_{t}$ to generate a clean sample $x_{t-1}$ : $p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\sigma_{t}),$ where $\sigma_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ , $\beta_{t}=1-\alpha_{t}$ , $\mu_{\theta}(x_{t},t)=\frac{1}{\sqrt{\alpha}_{t}}(x_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\epsilon)$ . Noise $\epsilon$ can be predicted by a neural network $\epsilon_{\theta}(x_{t},t)$ trained on the objective: $L=E_{x_{0},\epsilon,t}(\|\epsilon-\epsilon_{\theta}(x_{t},t)\|).$ Additionally, when $\epsilon_{\theta}$ is conditioned on the text prompt $P$ , it can be formulated as $\epsilon_{\theta}(x_{t},t,P)$ . After doing so, the diffusion model can generate images that match the provided text prompt.

Our method is based on the state-of-the-art text-to-image model Stable Diffusion (SD) [21]. SD belongs to Latent Diffusion Models (LDMs) that performs the diffusion process in the latent space. SD is based on U-Net architecture [22]. The U-Net contains a series of basis blocks, each containing a residual block [23], a self-attention module, and a cross-attention module [24]. Self-attention module contains important semantic information and its output can be formulated as follows:

\text{Attention}(Q,K,V)=\text{Softmax}(\frac{QK^{T}}{\sqrt{d}})V,

(1)

where $Q$ , $K$ , and $V$ are the query, key, and value features projected from spatial features with corresponding projection matrices.

II-B Structure Preservation

After directly modifying the target prompt to “ $<$ texture $>$ ”, information about the content of the input image is lost. Thus the structure of the input image needs to be preserved.

As mentioned in previous work [25, 26, 27], in the self-attention module of SD U-Net, the query features control the overall layout of the generated image, while the key and value features control the semantic contents. Therefore we inject the query features in the self-attention module into the generation process of the edited image and the result is shown in Fig. 4. The structure of the input image is partially preserved after injecting the query features, but it is still insufficient and more structural information needs to be injected. Inspired by [14], which demonstrated that features in residual blocks contain the structural information of the input image, we further inject features in residual blocks and the experimental results are shown in Fig. 4. The structure of the input image can be well maintained when query features in the self-attention module and features in residual blocks are injected at the same time.

In addition, since the generation process of the diffusion model is from the overall layout to the semantic details, structural information is injected only in the first and middle stages of the generation process. We do not inject the structural information in the later stages, which enables the texture details to be fully represented.

TABLE I: Quantitative results on the editing type of changing material on PIE-Bench.

Method	Structure	Background Preservation				CLIP Similarity
Method	Distance ${}_{10^{3}}$ $\downarrow$	PSNR $\uparrow$	LPIPS ${}_{10^{3}}\downarrow$	MSE ${}_{10^{4}}\downarrow$	SSIM ${}_{10^{2}}\uparrow$	Edited $\uparrow$
SDEdit	80.35	18.43	224.08	208.89	71.33	16.45
P2P	72.89	18.52	183.54	187.98	75.76	15.47
MasaCtrl	28.53	23.55	87.61	67.3	84.45	15.92
PnP	33.23	23.87	100.17	66.77	82.66	16.29
FPE	11.57	26.79	55.93	37.29	87.23	15.73
InfEdit	22.74	24.28	57.33	66.37	85.8	15.97
Ours	10.39	31.22	31.99	14.92	90.08	16.88

II-C Edit Localization

To localize the edit on the target object while keeping the rest unchanged, we introduce an edit localization technique.

Initially, the position of the target object must be identified. Drawing inspiration from [13], the cross-attention map contains location information of the prompt tokens. Therefore, we aggregate cross-attention maps across all heads and layers of the spatial resolution of 16×16. Subsequently, we extract the map corresponding to the target object and binarize it to derive the mask $M$ .

Since the self-attention module in SD U-Net contains important semantic information, we blend the self-attention results from the source image and the edited images:

$\displaystyle R^{l}_{s}$	$\displaystyle=\text{Attention}(Q^{l}_{s},K^{l}_{s},V^{l}_{s}),$	(2)
$\displaystyle R^{l}_{t}$	$\displaystyle=\text{Attention}(Q^{l}_{s},K^{l}_{t},V^{l}_{t}),$	(3)
$\displaystyle\bar{R}^{l}$	$\displaystyle=R^{l}_{s}\odot M+R^{l}_{t}\odot(1-M),$	(4)

where $\odot$ represents the Hadamard product and $\bar{R}^{l}$ denotes the ultimate attention output. To further keep the remainder unchanged, we blend the intermediate latents of the source and edited images:

\displaystyle Z_{t}=Z_{t}\odot M+Z_{t}^{*}\odot(1-M),

(5)

where $Z_{t}$ denotes the intermediate latents of the edited image. Using this edit localization technique, the edit is restricted to the target object, keeping the remainder unchanged.

III EXPERIMENTS

We implement the proposed method on Stable Diffusion [21] using publicly available checkpoints v1.4. During sampling, we apply DDIM [18] with 50 denoising steps and set a classifier-free guidance value of 7.5. Query features insertion in self-attention module is performed in the first 40 steps and in layers 12 to 15 of U-Net. Features insertion in residual blocks is performed in all steps and in layer 7 of U-Net.

III-A Comparisons with Previous Works

We compare the proposed method to state-of-the-art baselines that can be applied to text-guided image editing tasks, including: SDEdit [28], P2P [13], PnP [14], MasaCtrl [25], FPE [29], and InfEdit [15]. We use their open-sourced codes to produce the editing results.

Qualitative Experiments As shown in Fig. 3, we present the qualitative results of our method compared with the baselines. SDEdit edits the input image by adding noise to it and then denoising it, but this process does not preserve the structure of the input image. P2P adds an additional cross-attention map corresponding to texture, which alters the structure of the input image and changes the shape of the target object. MasaCtrl applies mutual self-attention to preserve the contents of the input image, preventing changing the texture of the target object. PnP and FPE inject structural information from the input image to maintain the structure, and InfEdit uses virtual inversion to achieve efficient image reconstruction. However, among these methods, the description of the input image in the target prompt restricts the representation of the texture, preventing the texture to be successfully transferred. In contrast, our method successfully transfer various textures to the target object while keeping the remainder unchanged.

Quantitative Experiments The dataset is the editing type of changing material on PIE-Bench [30]. We find that some text prompts do not meet the standards for changing material, so we modify them. To demonstrate the efficiency of our method, we employ six metrics including four aspects: structure distance [31], background preservation (PSNR, LPIPS [32], MSE, and SSIM [33] outside the annotated editing mask), and edit prompt-image consistency (CLIP Similariy [34]) . Note that to evaluate whether the texture has been transferred to the target object, we set the prompt to “ $<$ texture $>$ ” only and calculate the CLIP Similarity between the prompt and the target object region of edited image.

Tab. I shows quantitative results of our method compared with the baselines. As seen, our method outperforms the baselines by achieving highest preservation of structure, highest preservation of background and highest fidelity to the prompt.

III-B Ablation Study

We conduct an ablation study to validate the effectiveness of our designed core components and the results is shown in Fig. 4. As seen, the texture can be fully represented when the target prompt is directly set to “ $<$ texture $>$ ”. When both query features in self-attention module and features in residual blocks are added during the generation of the edited image, the structure of the input image is well preserved. When applying the proposed edit localization technique, the background is well retained.

IV CONCLUSION

We proposed TextureDiffusion, a tuning-free image editing method applied to various texture transfer. We enhanced the representation of complex textures by directly setting the target prompt to “ $<$ texture $>$ ”. We also presented a structure preserve module and an edit localization technique. Comprehensive experiments show that TextureDiffusion can harmoniously transfer various textures with excellent structure background preservation. Although we introduced the edit localization technique, the background is still slightly altered due to the upper limit of the image reconstruction quality of the variational autoencoder. We will explore transferring multiple textures simultaneously in the future.

References

[1] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML. PMLR, 2021, pp. 8821–8831.
[2] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
[3] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789, 2022.
[4] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[5] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. corr, vol. abs/2211.01324 (2022),” 2022.
[6] K. Chen, J. Song, S. Liu, N. Yu, Z. Feng, G. Han, and M. Song, “Distribution knowledge embedding for graph pooling,” IEEE Transactions on Knowledge and Data Engineering, 2022.
[7] F. Yang, S. Yang, M. A. Butt, J. van de Weijer et al., “Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing,” NeurIPS, vol. 36, pp. 26 291–26 303, 2023.
[8] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in CVPR, 2023, pp. 6007–6017.
[9] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in CVPR, 2023, pp. 18 392–18 402.
[10] O. Bar-Tal, D. Ofri-Amar, R. Fridman, Y. Kasten, and T. Dekel, “Text2live: Text-driven layered image and video editing,” in ECCV. Springer, 2022.
[11] G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in CVPR, 2022, pp. 2426–2435.
[12] M. Huang, J. Cai, S. Jia, V. S. Lokhande, and S. Lyu, “Multiedits: Simultaneous multi-aspect editing with text-to-image diffusion models,” arXiv preprint arXiv:2406.00985, 2024.
[13] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
[14] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in CVPR, 2023, pp. 1921–1930.
[15] S. Xu, Y. Huang, J. Pan, Z. Ma, and J. Chai, “Inversion-free image editing with natural language,” arXiv preprint arXiv:2312.04965, 2023.
[16] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020.
[17] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML. PMLR, 2021, pp. 8162–8171.
[18] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[19] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML. PMLR, 2015, pp. 2256–2265.
[20] P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” NeurIPS, vol. 34, pp. 8780–8794, 2021.
[21] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022, pp. 10 684–10 695.
[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. Springer, 2015, pp. 234–241.
[23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
[25] M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” in ICCV, October 2023, pp. 22 560–22 570.
[26] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023, pp. 7623–7633.
[27] O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or, “Localizing object-level shape variations with text-to-image diffusion models,” in ICCV, 2023, pp. 23 051–23 061.
[28] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” in ICLR, 2022.
[29] B. Liu, C. Wang, T. Cao, K. Jia, and J. Huang, “Towards understanding cross and self-attention in stable diffusion for text-guided image editing,” in CVPR, 2024, pp. 7817–7826.
[30] X. Ju, A. Zeng, Y. Bian, S. Liu, and Q. Xu, “Pnp inversion: Boosting diffusion-based editing with 3 lines of code,” in ICLR, 2024.
[31] N. Tumanyan, O. Bar-Tal, S. Bagon, and T. Dekel, “Splicing vit features for semantic appearance transfer,” in CVPR, 2022, pp. 10 748–10 757.
[32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, 2018, pp. 586–595.
[33] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[34] C. Wu, L. Huang, Q. Zhang, B. Li, L. Ji, F. Yang, G. Sapiro, and N. Duan, “Godiva: Generating open-domain videos from natural descriptions,” arXiv preprint arXiv:2104.14806, 2021.