SEED-Data-Edit Technical Report:
A Hybrid Dataset for Instructional Image Editing

Yuying Ge^1∗ Sijie Zhao^1∗ Chen Li^2∗ Yixiao Ge^1,2† Ying Shan^1,2
¹Tencent AI Lab ²ARC Lab Tencent PCG

Abstract

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language. SEED-Data-Edit is composed of three distinct types of data: (1) High-quality editing data produced by an automated pipeline, ensuring a substantial volume of diverse image editing pairs. (2) Real-world scenario data collected from the internet, which captures the intricacies of user intentions for promoting the practical application of image editing in the real world. (3) High-precision multi-turn editing data annotated by humans, which involves multiple rounds of edits for simulating iterative editing processes. The combination of these diverse data sources makes SEED-Data-Edit a comprehensive and versatile dataset for training language-guided image editing model. We fine-tune a pretrained Multimodal Large Language Model (MLLM) that unifies comprehension and generation with SEED-Data-Edit. The instruction tuned model demonstrates promising results, indicating the potential and effectiveness of SEED-Data-Edit in advancing the field of instructional image editing. The datasets are released in https://huggingface.co/datasets/AILab-CVC/SEED-Data-Edit.

Refer to caption — Figure 1: Data examples of instruction-guided image editing in SEED-Data-Edit, which includes (1) High-quality editing data produced by an automatic pipeline (first row), (2) Real-world scenario data scraped from the internet that more accurately reflects user image editing intentions (second row), (3) High-precision multi-turn editing data annotated by Photoshop experts (third row).

¹¹footnotetext: Equal Contribution. ²²footnotetext: Correspondence to [email protected].

1 Introduction

Instruction-guided image editing [1, 2, 3, 4, 5] is an emerging field that empowers users to manipulate images using natural language instructions without complex descriptions or region-specific masks. This advancement significantly enhances the controllability and flexibility of image manipulation. However, compared with text-to-image generation [6, 7, 8], instructional image editing is more challenging since it necessitates a comprehensive understanding of both language and visual content, and the capacity to handle iterative instructions while preserving both visual realism and semantic consistency. A significant hurdle in training models for instruction-guided image editing is the lack of high-quality, large-scale datasets, which are indispensable for learning a model to realize accurate interpretation and execution of editing instructions.

In this technical report, we introduce SEED-Data-Edit, which is a unique hybrid dataset meticulously crafted to tackle the challenges of instruction-guided image editing. As shown in Fig. 1, SEED-Data-Edit integrates three distinct types of data, providing a rich and versatile resource for training language-guided image editing models. The first part, automated pipeline-generated editing samples, ensures a substantial volume of diverse image editing pairs. The second part, real-world scenario data, encapsulates the complexities of user intentions, promoting the practical application of image editing in open-form context. The third part, human-annotated multi-turn editing data simulates iterative editing processes, enabling models to learn multiple rounds of image editing.

Specifically, for the first part of SEED-Data-Edit, we employ two automated pipelines to produce millions of editing pairs as shown in Fig. 2. In pipeline (a), an object is initially segmented and subsequently removed using an inpainting module. This process results in a set of “Remove” and “Add” editing samples, where the “Add” samples are generated by reversing the “Remove” operation. In pipeline (b), we employ an image-guided text-to-image generation model to create source and target images based on the original image, the source caption and the target caption after editing. This process yields editing samples with changes in style, object, color, material, or expression. Then editing samples generated by these automated pipelines are further filtered with various rules, ultimately producing a total of 3.5 million data.

For the second part of SEED-Data-Edit, we crawl image editing pairs from multiple websites where amateur photographers post their images along with editing requests. These requests are then addressed by Photoshop experts who provide the edited images. To ensure consistency between the instructions and the before-and-after editing images, all editing pairs are manually re-annotated with instructions by human annotators. This process results in a collection of 52K image editing pairs that reflect real-world scenarios. For the third part of SEED-Data-Edit, we employ Photoshop experts to perform a series of multi-turn edits on real images and obtain the editing instructions for each round. This process generates 95K multi-turn editing data, with up to five rounds per editing sequence.

In order to demonstrate the effectiveness of SEED-Data-Edit, we fine-tune a pre-trained Multimodal Large Language Model (MLLM) SEED-X [9] with this dataset and yield the instruction-tuned model SEED-X-Edit. SEED-X unifies comprehension and generation through decoding images from the predicted ViT [10] features with a pre-trained visual de-tokenizer, which enables it to understand multimodal input (e.g., a source image and a language instruction) and generate a target image after editing. SEED-X-Edit model achieves promising results in language-guided image editing, which showcases the potential of our dataset in advancing this field. It’s worth noting that since SEED-X pre-trains the visual de-tokenizer also with SEED-Data-Edit, incorporating the conditional image (i.e. a source image) as an additional input besides the high-level image features, its visual de-tokenizer can recover the fine-grained details of the original image (i.e. a target image). This characteristic is particularly beneficial for high-precision image editing, where the details of the source image should be preserved.

Through combining large-scale automated pipeline-generated edits, real-world scenario editing examples, and human-annotated multi-turn editing data, we aim to make SEED-Data-Edit a comprehensive and versatile resource for training effective language-guided image editing models. All data of SEED-Data-Edit and the instruction-tuned model SEED-X-Edit are released.

Table 1: Comparison of existing image editing datasets. “Real Image for Edit ” denotes whether real images are used for editing instead of images generated by models. “Real-world Scenario” indicates whether images edited by users in the real world are included. “Human” denotes whether human annotators are involved.

	Real Image for Edit	Real-world Scenario	Human	# Edits	# Max Rounds	# Multi-turn
InstructPix2Pix	$\times$	$\times$	$\times$	313,010	1	0
MagicBrush	$\checkmark$	$\times$	$\checkmark$	10,388	3	3,088
HQ-Edit	$\times$	$\times$	$\times$	197,350	1	0
SEED-Data-Edit	$\checkmark$	$\checkmark$	$\checkmark$	3,669,644	5	21,382

2 Related Work

Table. 1 compares the existing language-guided image editing datasets including InstructPix2Pix [2], MagicBrush [1] and HQ-Edit [3]. InstructPix2Pix adopts Prompt-to-Prompt [11] to generate a source image and a target image based on an input caption of LAION-Aesthetics [12] dataset and a target caption after editing. However, it only includes single-turn image editing pairs, and all images are model-generated without the inclusion of real images. MagicBrush hires crowd workers on Amazon Mechanical Turk (AMT) to manually annotate images from MS COCO dataset [13] using the DALL-E 2 platform [14] for multi-turn editing pairs. However, it only includes 10K pairs with a maximum of three rounds per editing sequence. HQ-Edit first generates image descriptions and edit instructions using GPT-4 [15], and then generates diptychs using GPT-4V [16] and DALL-E 3 [17] based on image descriptions and instructions. The diptychs are further split into a source image and a target image, with the instruction being rewritten by GPT-4V. However, the generation of diptychs does not guarantee that the fine-grained details of the source image are preserved in the target image, and lack realism in the generated images. In SEED-Data-Edit, we combine three distinct types of data: large-scale automated pipeline-generated edits, real-world scenario data, and multi-turn editing data annotated by Photoshop experts. The dataset comprises a total of 3.7M image editing pairs and 21K multi-turn editing sequences, with a maximum of 5 rounds per sequence.

3 Dataset

As demonstrated in Fig. 1 and Fig. 3, SEED-Data-Edit is composed of three distinct types of data. (a) Automated pipeline-generated data, which ensures a substantial volume of diverse image editing examples. (b) Real-world scenario data collected from websites where amateur photographers post their images along with editing requests, which are then addressed by Photoshop experts. This part of data captures the intricacies of user intentions and promotes the practical application of image editing in real-life situations. (c) Multi-turn editing data, which involves multiple rounds of edits performed by Photoshop experts on real images, simulating iterative editing processes.

3.1 Automated Pipeline-generated Data

We adopt two automated pipelines to generate a large number of high-quality image editing pairs as shown in Fig. 2. Pipeline (a) is designed to create a set of “Remova” edits, where “Add” samples are generated by reversing the “Remove” operation, as it is difficult to directly add an object in a suitable location in the image. Pipeline (b) is designed to produce editing samples with changes in style, object, color, material, or expression.

Specifically, in pipeline (a) given a real image, LLaVA-1.5 [18] is employed to answer the question “Which object is suitable for removal?”. Then GroundingDino [19] and SAM [20] are employed to segment the target object with the mask. Given the original image and the segmentation mask, an inpainting model LaMa [21] is utilized to inpaint the image, removing the target object and creating the target image. The corresponding instruction is produced through filling in the template “Remove something from the image” with the target object. To generate the “Add” samples, we reverse the process by using the target image as the source image. In this case, the instruction becomes “Add something in the image” and the source image serves as the actual target image. This approach ensures that the added object is placed in an appropriate location within the image, maintaining visual coherence and realism in the editing process. We filter out data where the size of the target object exceeds a certain threshold. We use images from Unsplash [22] for generating image editing pairs with pipeline (a), and produce a total of 1.5M image editing pairs after filtering.

In pipeline(b), given a caption of an input image, we first employ ChatGPT [23] to generate an editing instruction and the target instruction after the editing has been executed. Subsequently, we utilize an image-guided text-to-image generation model PnP [24] to generate a target image given the original image and target caption. To ensure the visual consistency between the target image and image before editing, we further utilize PnP to generate the source image given the original image and original caption. We filter the data by calculating the CLIP [25] similarity between the target image and the target caption, as well as between the target image and the original caption. If the latter similarity is higher than the former, the data will be filtered out, since the visual content of the target image does not accurately reflect the editing instruction. We use images from Openimages [26] for generating image editing pairs with pipeline (b), and produce a total of 2.0M image editing pairs after filtering.

3.2 Real-world Scenario Data

We crawl image editing pairs from four websites [27, 28, 29, 30], where amateur photographers post their images accompanied by editing requests. These requests are then fulfilled by Photoshop experts who provide the edited images as target images. To ensure that the editing instruction can accurately reflect the change from the source image to the target image, we employ human annotators to re-annotate all editing pairs. This process yields a total of 52K image editing samples. As depicted in Fig. 1 and Fig. 3, this part of data is more complex, and the instructions are more imaginative, effectively reflecting real-world user image editing intentions.

3.3 Multi-turn Editing Data

We employ Photoshop experts to edit images in sequence using Photoshop for multi-turn image edits, documenting the editing instruction for each round. Each editing round encompasses various modifications, including (1) replacing an object, (2) adding an object, (3) removing an object, (4) changing the action, (5) altering text or patterns, and (6) modifying the count of objects. To ensure the diversity of the input image, we use images from Unsplash [22], SAM [20] and JourneyDB [31] for multi-turn editing. In total, we generate 21K multi-turn rounds with a maximum of 5 rounds, resulting in 95K image editing pairs.

4 SEED-X-Edit

We fine-tune a pre-trained Multimodal Large Language Model (MLLM) SEED-X (https://github.com/AILab-CVC/SEED-X) [9] with SEED-Data-Edit, InstructPix2Pix [2], and MagicBrush [1], which yields the instruction-tuned model SEED-X-Edit. SEED-X unifies multimodal comprehension and generation through decoding images from the predicted ViT [10] features with a pre-trained visual de-tokenizer. In this way, it is able to comprehend multimodal input (e.g., a source image and a language editing), and generate a target image after decoding. We fine-tune SEED-X using a LoRA [32] module with 16 A100-40G GPUs, which takes around 40 hours. Note that we do not fine-tune with multi-turn editing sequences and instead use multi-turn editing data in a single-turn way, and we will explore multi-turn image editing in future work.

We compare SEED-X-Edit with the baseline models including MagicBrush [1], MGIE [5], and HQ-Edit [3]. Here, MGIE model is trained on InstructPix2Pix data, MagicBrush model is fine-tuned from an InstructPix2Pix trained model with MagicBrush data. As illustrated in Fig. 4, SEED-X-Edit model achieves promising results in language-guided image editing and can better adhere to editing instructions compared to models trained with existing image editing datasets. For example, SEED-X-Edit is able to add sunglasses to the animal in the right while all baseline models put sunglasses on both animals. Additionally, SEED-X-Edit successfully removes the bottle from the table while the baseline models fail to execute this editing instruction. The satisfactory results showcases the potential of our dataset in advancing instructional image editing. It is worth noting that SEED-X also pre-trains the visual de-tokenizer with SEED-Data-Edit, which incorporates the source image as an additional input besides the ViT features of the target images to reconstruct the target image. This characteristic is particularly beneficial for high-precision image editing, as it enables the preservation of the source image’s fine-grained details after editing.

5 Conclusion

In this technical report, we introduce SEED-Data-Edit, which is a unique hybrid dataset for language-guided image editing. SEED-Data-Edit integrates three distinct types of data, including large-scale automated pipeline-generated edits, real-world scenario editing examples, and human-annotated multi-turn editing data. We further train a language-guided image editing model named SEED-X-Edit with our dataset from a pre-trained MLLM SEED-X, and SEED-X-Edit achieves promising results. All data of SEED-Data-Edit and the instruction-tuned model SEED-X-Edit are released.

References

[1] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36, 2024.
[2] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
[3] Mude Hui, Siwei Yang, Bingchen Zhao, Yichun Shi, Heng Wang, Peng Wang, Yuyin Zhou, and Cihang Xie. Hq-edit: A high-quality dataset for instruction-based image editing. arXiv preprint arXiv:2404.09990, 2024.
[4] Shelly Sheynin, Adam Polyak, Uriel Singer, Yuval Kirstain, Amit Zohar, Oron Ashual, Devi Parikh, and Yaniv Taigman. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
[5] Tsu-Jui Fu, Wenze Hu, Xianzhi Du, William Yang Wang, Yinfei Yang, and Zhe Gan. Guiding instruction-based image editing via multimodal large language models. arXiv preprint arXiv:2309.17102, 2023.
[6] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[7] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[8] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. Pixart: Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426, 2023.
[9] Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024.
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[11] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
[12] Christoph Schuhmann and Romain Beaumont. Laion-aesthetics. https://laion.ai/blog/laion-aesthetics/, 2022.
[13] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[14] Dall-e 2. https://openai.com/dall-e-2, 2022.
[15] OpenAI. Gpt-4 technical report, 2023.
[16] Gpt-4v(ision) system card. 2023.
[17] James Betker, Gabriel Goh, Li Jing, TimBrooks, Jianfeng Wang, Linjie Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and Aditya Ramesh. Improving image generation with better captions.
[18] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
[19] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
[20] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
[21] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. Resolution-robust large mask inpainting with fourier convolutions. arXiv preprint arXiv:2109.07161, 2021.
[22] Zahid Ali, Chesser Luke, and Carbone Timothy. Unsplash. https://github.com/unsplash/datasets, 2023.
[23] OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
[24] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, June 2023.
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[26] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981, 2020.
[27] Photoshopbattles. https://www.reddit.com/r/photoshopbattles/, 2024.
[28] Photoshop gurus. https://www.photoshopgurus.com/forum/, 2024.
[29] Photoshoprequest. https://www.reddit.com/r/PhotoshopRequest/, 2024.
[30] Zhopped. http://zhopped.com/, 2024.
[31] Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, et al. Journeydb: A benchmark for generative image understanding. Advances in Neural Information Processing Systems, 36, 2024.
[32] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing