EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Quang Nguyen¹ Truong Vu¹ Trong-Tung Nguyen¹ Yuxin Wen²
Preston K Robinette³ Taylor T Johnson³ Tom Goldstein² Anh Tran¹ Khoi Nguyen¹

¹VinAI Research ² University of Maryland ³Vanderbilt University

Abstract

Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

1 Introduction

In recent years, the landscape of image editing has seen substantial advancement as powerful editing tools have expanded beyond basic copy-move and splicing techniques, embracing more sophisticated modifications. Particularly notable are diffusion-based editing methods [1, 2, 3, 4, 5], which enable the creation of photorealistic forged regions seamlessly integrated with the style and lighting of the surroundings, posing challenges for both human perception and machine forensic techniques in localizing these alterations. While these tools offer immense creative potential, their misuse carries significant risks.

Consequently, traditional image forensic methods struggle to localize the forged regions created by diffusion-based editing techniques compared to those produced by traditional image manipulation methods. One hypothesis is that the highly realistic nature of images edited by diffusion models enables them to blend seamlessly with the original, making it difficult to detect using traditional approaches that rely on detecting internal inconsistencies and editing traces of edited images as illustrated in Fig. 1. This indicates a pressing need for new approaches in image forensics that can adapt to the complexities introduced by advanced diffusion-based editing tools.

Hence, this paper is dedicated to tackling the intricate challenges posed by diffusion-based image editing techniques, which stand at the forefront of the evolving field of image forensics. Specifically, given a suspicious image that appears to have been edited, our goal is to segment and identify the forged regions. Next, we present EditScout, a novel methodology leveraging Multimodal Large Language Models (MLLMs) to precisely locate forgeries in diffusion-based edited images.

This approach consists of two core modules: MLLM-based reasoning query generation and a segmentation network. The MLLM utilizes visual features extracted from a pretrained visual encoder like CLIP [6] along with a prompt for localizing edited regions to generate the reasoning query. Subsequently, this query is fed into the SAM to produce the segmentation mask. By integrating MLLMs, our method harnesses rich multimodal data, combining visual and linguistic cues to conduct a thorough, systematic analysis of images. The key advantage of employing MLLMs lies in their innate capacity to interpret and synthesize diverse data streams, enabling them to discern subtle high-level features and extract underlying semantic meanings indicative of tampering or manipulation.

To assess the effectiveness of EditScout across various scenarios, including both familiar and novel editing types, we conduct training on the MagicBrush [7] and Autosplice [8] datasets and evaluate on multiple test sets. These include the test set from MagicBrush (representing seen editing types), CocoGLIDE [9] (exhibiting similar editing types), and our newly introduced test set, PerfBrush (featuring unseen editing types). Our method, EditScout, surpasses traditional image forensic techniques in performance across these datasets, particularly excelling in identifying unseen editing types. This underscores the robustness and generalizability of our proposed approach.

In summary, the contributions of this work are the following:

•

We introduce a novel approach to image forensics, leveraging an MLLM for locating diffusion-based edits. This study represents the first attempt to explore MLLM capabilities in image forensic localization.
•

We present PerfBrush, a new image forensics test set specifically designed for detecting forged regions in diffusion-based edited images.

Refer to caption — Figure 1: Visualizing edited images with annotated bounding boxes highlights notable disparities between (a) traditional editing methods (boxes in red) and (b) diffusion-based techniques (boxes in blue). While conventional edits are easily identifiable, diffusion-based alterations present challenges in image forensics due to their seamless blending of edited regions with their surroundings, yielding photorealistic results.

2 Related Work

Multimodal Large Language Models (MLLM). With the advanced development of large language models such as GPT-3 [10], Llama [11], Mistral [12], Flan-T5 [13], Vicuna [14], and PaLM [15], there is an increasing effort to apply these models for multimodal tasks, including the understanding and incorporation of visual information. Visual instruction tuning, which involves fine-tuning LLMs with visual supervision, has garnered significant interest and delivered surprising results. Several models have been developed using this method, including LLaVA [16], MiniGPT-4 [17], InstructBLIP [9], and other models [18, 19, 20]. These MLLMs have demonstrated promising results in visual perception tasks such as detection [21, 22, 23, 24, 25] and segmentation [26, 27, 28, 29, 30, 31, 24]. Furthermore, they have recently been studied for their ability to detect generated images and deepfakes [32]. However, the application of such models in localizing tampering through editing forgery remains unexplored. Therefore, in this paper, we aim to study the potential capability of MLLMs in localizing diffusion-based editing forgery.

MLLMs for image segmentation. MLLMs have significantly improved segmentation capability by integrating reasoning and deeper regional understanding. Several works such as LISA [26] and CoRes [33] focus on reasoning-based segmentation, enhancing the ability to interpret intricate queries and generate acceptable segmentation masks. PixelLM [31] and GROUNDHOG [29], which use efficient decoders and feature extractors to handle multi-target tasks and reduce object hallucination, hence improving the performance of segmentation tasks. Whereas, DeiSAM [34] integrates logic reasoning with multimodal large language models to improve referring expression segmentation. See, Say, Segment [30] employs joint training to maintain performance across tasks and prevent catastrophic forgetting. These studies inspires that with enhanced reasoning capabilities and regional understanding, MLLMs might offer an effective and explainable solution for forgery localization.

Diffusion-based editing. Recent advancements in diffusion-based editing methods have significantly enhanced the precision and realism of filling missing regions in images. Traditional techniques, which relied on hand-crafted features, often struggle with control over the inpainted content. However, the introduction of text-to-image diffusion models like Stable Diffusion [35] and Imagen [36] has enabled more controlled editing by using multimodal conditions such as text descriptions and segmentation maps. Models such as GLIDE [37] and Stable Inpainting have further refined these pretrained models for editing tasks, employing random masks to effectively utilize context from unmasked regions. Additionally, techniques like Blended Diffusion [4] and Blended Latent Diffusion [38] use pretrained models and CLIP-guided reconstruction to improve mask-based editting quality. Innovative approaches such as Inpaint Anything [39] and MagicRemover [3] combine various models and user-friendly interfaces to offer greater flexibility and precision in image editing. These methods yield highly realistic results that are often indistinguishable from the original content, presenting substantial challenges for detection and localization.

Image forgery localization is a critical aspect of digital image forensics that focuses on pixel-level binary classification within an image. Most existing studies employ pixel-wise or patch-based [40] approaches to identify forged regions. Several methods combine image and noise features for more accurate forgery localization [41, 42]. For instance, TruFor [42] extract the high-level and low-level traces through transformer-based architecture that combines RGB image and noise-sensitive fingerprint. CAT-Net utilize the JPEG compression artifacts during image editing for localize the forgerd regions. Furthermore, HiFi [43] proposed a hierachical approach for detecting and localizing the tampering regions based on the frequency domain. Whereas, PSCC-Net [44] utilizes a progressive spatial-correlation module that leverages multi-scale features and cross-connections to generate binary masks in a coarse-to-fine manner. While these methods yield good results for various forgery types, such as copy-move and splicing, they cannot deliver an acceptable performance when faced with diffusion-based inpainting methods. The high realism and indistinguishability of inpainted regions created by diffusion models present substantial challenges, making it difficult for traditional forgery localization techniques to detect and accurately localize these tampered areas.

Synthetic image detection. Early research has focused on detecting images generated by GANs, where various studies utilized hand-crafted features such as color [45], saturation [46], and blending artifacts [47] to identify GAN-generated images. Additionally, deep learning classifiers have been employed to detect synthetic images [48]. With the rise and growing dominance of diffusion models, GANs are gradually being supplanted, and methods designed for GAN-generated images show reduced performance when applied to diffusion models. To address this, new techniques such as DIRE [49] and SeDIE [50] leverage reconstruction error in the image space, while LaRE [51] examines reconstruction error in the latent space for more effective detection of diffusion-generated images. Moreover, [43] developed hierarchical fine-grained labeling techniques for classifying synthetic images. Recently, [32] have utilized the capabilities of multimodal LLMs to study the detection of synthetic images. While the detection of generated images remains crucial for maintaining the trustworthiness of generative AI, pinpointing the exact edited regions within an image poses a more challenging and interesting problem.

3 EditScout

Problem setting: Our primary objective is to accurately identify and localize regions within an image that have been altered using diffusion-based editing techniques. Given an image $\mathcal{I}\in\mathbb{R}^{H\times W\times 3}$ , which is suspected to contain edited portions, our goal is to generate a binary segmentation mask $\mathcal{M}\in[0,1]^{H\times W}$ indicating edited regions. In this mask, each pixel is classified as either authentic or edited, where a value of $0$ indicates an authentic pixel, and a value of $1$ signifies an edited pixel.

Overview: To achieve the goal, our approach is inspired by the reasoning segmentation task as described in recent works [26, 27]. Specifically, we build our method, EditScout, upon the robust architectures proposed in these studies. As shown in Fig. 2, the network consists of two primary modules: the MLLM-based reasoning query generation module and the segmentation model. The first module processes a user’s prompt along with the input image, generating a sequence of text tokens. This sequence includes a special [SEG] token that encapsulates the reasoning query and the edit instructions. The second module then uses this [SEG] token as a query to produce a binary mask that highlights the edited regions in the image.

Specifically, in the first module, given the input image $\mathcal{I}$ that appears to be edited, we also use a fixed prompt: "Can you segment the edited region and give the instruction used to edit this image.". Both the image and the prompt are fed into LLAVA [16], a multimodal LLM, which produces a sequence of response tokens following this template: "The edited region is [SEG], and the edit instruction used is $c$ ", where $c$ represents the predicted edit instruction. For instance, in Fig. 2, $c=\texttt{"put party hat on dog"}$ . The [SEG] token is then converted into the reasoning query $s$ using a Multi-Layer Perceptron (MLP). At the same time, the input image is encoded by an image encoder to generate an image representation. Finally, a mask decoder uses the image representation and the reasoning query $s$ to produce the final binary mask prediction $\mathcal{M}$ . It is worth noting the image encoder and mask decoder are the same as in Segment Anything Model (SAM) [52].

Edit instruction construction: We apply supervised learning to train EditScout using MagicBrush and AutoSplice datasets in our experiments. We observe that edit instructions provide strong supervision, enhancing both the reasoning capabilities of our approach and offering clear explanations for the model’s predicted segmentation masks. This results in a more explainable image forensic system. To this end, we construct the ground truth for edit instructions in our training datasets. For the MagicBrush dataset [7], we utilize the provided edit instructions, which detail the specific manipulations applied to the images. However, the AutoSplice dataset [8] only provides class names for objects without detailed descriptions of the manipulations. To address this, we enhance the dataset by adding several verbs referring to manipulation activities. These verbs describe the transformation from the original object to the edited object, formatted as <verb> <original object> to <edited object>. For instance, an instruction might read "replace an apple with an orange" or "edit dog to cat", providing clear guidance on the nature of the edits. We showcase several examples of our training dataset in Fig. 3.

Finetuning strategy: Following the methods implemented in LISA [26], we only use LoRA [53] to finetune a part of the MLLM, fully finetune the mask decoder, and train from scratch the MLP converter. The other modules remain frozen. We use two training loss functions: (1) auto-regressive cross-entropy loss $\mathcal{L}_{c}$ incurred between the predicted edit instruction $c$ and the ground-truth instruction $c^{*}$ and (2) segmentation loss $\mathcal{L}_{m}$ incurred between predicted mask $\mathcal{M}$ and ground-truth mask $\mathcal{M}^{*}$ . Particularly,

	$\displaystyle\mathcal{L}_{c}=\mathcal{L}_{\text{{CE}}}(c,c^{*}),$		(1)
	$\displaystyle\mathcal{L}_{m}=\lambda_{bce}\mathcal{L}_{\text{{BCE}}}(\mathcal{M},\mathcal{M}^{})+\lambda_{dice}\mathcal{L}_{\text{{DICE}}}(\mathcal{M},\mathcal{M}^{}),$		(2)
	$\displaystyle\mathcal{L}=\lambda_{c}\mathcal{L}_{c}+\lambda_{m}\mathcal{L}_{m},$		(3)

where $\lambda_{bce}$ , $\lambda_{dice}$ , $\lambda_{c}$ , and $\lambda_{m}$ are weighting parameters for the loss components, and the combination of binary cross-entroy and dice loss is common segmentation loss.

4 Datasets

To evaluate our EditScout, we train using the training set of MagicBrush [7] and the entire AutoSplice dataset [8]. For testing, we use the dev and test sets of MagicBrush [7], along with CocoGLIDE [42], and PerfBrush - our contributed dataset. We provide details of the composition of the training and testing datasets, including the number of samples and the methods of editing in the Tab. 1

Table 1: Summary of our training and test datasets.

Datasets	Train	Test	Editing method	#Samples
AutoSplice [8]	✓		DALL-E [1]	3,621
MagicBrush [7] (train)	✓		DALL-E [1]	4,512
MagicBrush [7] (dev + test)		✓	DALL-E [1]	801
CocoGLIDE [42]		✓	GLIDE [37]	512
PerfBrush (Our proposed dataset)		✓	BrushNet [2]	801

The MagicBrush dataset, originally designed for image inpainting, has been repurposed for image forensics. It comprises three splits: train, dev, and test, containing $4,512$ , $266$ , and $535$ images, respectively. Each image undergoes up to three rounds of editing using DALL-E [1, 54], but we only utilize the results from the first round as input edited images. The AutoSplice dataset, another diffusion-based image forensic dataset with $3,621$ images, also employs DALL-E as the editor. Additionally, the CocoGLIDE dataset consists of $512$ samples from the COCO 2017 validation set, edited using the GLIDE model [37].

Notably, to further diversify our test set with an unseen diffusion-based editing technique and to test the generalizability of image forensic models, we constructed a new set of edited images called PerfBrush. Instead of using DALL-E, we utilized images, instructions, and inpainting masks from the dev and test splits of MagicBrush. Specifically, we fed a triplet of a source image, a global description, and an inpainting mask to an image inpainting method called BrushNet [2]. To accelerate the inpainting process, we adapted the original BrushNet with PerFlow [55], a plug-and-play accelerator that speeds up the diffusion sampling process. For each sample triplet, we generated edited images using BrushNet with $k$ different seeds ( $k=5$ ) to ensure diversity in the edits. We then manually selected the image with the highest quality within the inpainting region to ensure robust evaluation of image forensic methods. Our constructed test set, PerfBrush, has better visual appearance compared to MagicBrush [7] as showcased in Fig. 4

5 Experiments

Evaluation metric: We evaluate the performance of EditScout using mean Intersection over Union (mIoU) and F1 score metrics. The mIoU (%) score measures the overlap between the predicted binary segmentation masks and the ground truth masks for each class and takes the average across two classes. Whereas, the F1 score is calculated as the harmonic mean of the precision and recall of the pixel-level binary classification.

Implementation details: We build our framework on PyTorch deep learning framework [56]. We utilize LoRA [53] for finetuning the LLM with lora rank is set to 8, and lora alpha is set to 32. Through experiments, we set $\lambda_{bce}=2.0$ , $\lambda_{dice}=0.5$ , $\lambda_{c}=1.0$ , and $\lambda_{m}=1.0$ as the best configurations. We train EditScout with AdamW optimizer [57] with a learning rate of 3 $e$ -5. To fairly compare with other image forensic techniques, we finetune those methods on our training dataset and follow their configurations. All experiments are conducted with single A100 40GB.

5.1 Main Results

Quantitative results. Tab. 2 provides a detailed comparative analysis of our method, EditScout, against established forensic techniques. These methods, originally pretrained on extensive datasets featuring traditional editing techniques like copy-move or splicing, initially show poor performance across all three test sets because they cannot generalize to unseen edit techniques. However, when these models are fine-tuned with our training dataset, there is a significant improvement in their performance, particularly on the MagicBrush dataset [7]. However, this fine-tuning also tends to over-fit the models to the types of edits encountered during training. For example, while both the MagicBrush [7] and CocoGLIDE [42] datasets, which include edits generated by DALL-E and GLIDE methods, exhibit improvements post-fine-tuning, this specialization results in a marked decrease in performance on the PerfBrush dataset. Specifically, CAT-Net [58] achieves mIoUs of 30.47 and 31.79 on the MagicBrush and CocoGLIDE datasets respectively but plummets to just 3.67 mIoU on PerfBrush, highlighting an overfitting issue. In contrast, EditScout demonstrates good generalizability and robustness, attaining the highest mIoU of 22.55 on the challenging PerfBrush dataset, which features edits not previously encountered during training. Furthermore, our approach also outperforms other techniques on the CocoGLIDE dataset with an mIoU of 34.11 .

Table 2: Comparison in mIoU and F1-score between EditScout and other methods on MagicBrush [7], CocoGLIDE [42], and PerfBrush. Note that HiFi [43] and TruFor [42] do not provide training scripts, hence, we only report the results evaluated on their provided pretrained weights.

\dagger

denotes finetuning with our training set.

Methods	MagicBrush [7]		CocoGLIDE [42]		PerfBrush
Methods	mIoU	F1	mIoU	F1	mIoU	F1
PSCC-Net [44]	8.35	12.3	14.46	20.24	7.90	12.59
EITL-Net [41]	7.88	11.38	28.79	35.42	18.55	22.13
TruFor [42]	19.47	26.93	29.26	36.08	15.66	22.36
HiFi [43]	5.10	8.22	16.55	23.44	4.45	6.99
CAT-Net [58]	2.71	4.33	31.63	39.18	8.11	11.96
PSCC-Net^† [44]	16.82	26.50	15.02	20.75	9.71	15.75
EITL-Net^† [41]	20.02	28.09	19.15	26.34	8.00	12.16
CAT-Net^† [58]	30.47	40.35	31.79	41.12	3.67	5.52
EditScout	23.77	33.19	34.11	45.70	22.55	31.04

Qualitative results. In Fig. 5, we present visual comparison results from EditScout and several other image forensic methods evaluated on the MagicBrush (dev + test) and PerfBrush . The images illustrate how EditScout accurately identifies the edited regions, closely matching the provided ground-truth mask. In contrast, the other methods struggle to segment the correct regions, often resulting in fragmented and inaccurate prediction masks. Additionally, in Fig. 6, we demonstrate the reasoning capability of EditScout. Our framework accepts natural language prompts from the user and processes them to generate comprehensive responses. These responses include both the segmentation results and predictions regarding the editing instructions. This showcases EditScout’s ability not only to accurately identify edited regions but also to understand and interpret user prompts, providing insightful and detailed outputs.

5.2 Ablation Study

We conduct all ablation study experiments with the input prompt and training hyperparameters as discussed in implementation details. Additionally, we report the results evaluated on the MagicBrush (dev + test) dataset.

Effect of different components on overall performance is summarized in the Tab. 3. First, we examine the performance of our method if just only use visual understanding. In setting A) using only the Segment Anything model [52] with a learnable [SEG] token in training yields a very low result which is 5.96 mIoU. In contrast, with setting B) using image embedding produced by the vision encoder of multimodal large language model as [SEG] significantly boosts performance to 14.94. Then, we enable the reasoning capability with LLM by using input prompt which yields a much better result of 22.23 mIoU. Finally, we obtain the best result using edit instruction with 23.77 mIoU.

Table 3: Impact of different components in overall performance

	CLIP image embedding to LLM	Using input prompt	Edit Inst. Prediction	mIoU	F1
A				5.96	9.83
B	✓			14.94	21.98
C	✓	✓		22.23	31.55
D	✓	✓	✓	23.77	33.19

Effect of different choices of input prompt. We conduct sensitivity analysis on input prompt to the LLM to determine the best question template for finetuning and report the results in Tab. 4. The first prompt, which explicitly asks for segmentation and an explanation of the editing technique, yields the best result of 23.77 mIoU. It suggests that direct queries may facilitate more accurate and detailed analyses by the MLLMs. The subsequent prompts show a gradual decrease in performance metrics indicating that less specific or more ambiguously phrased prompts may lead to less precise localization performance.

Table 4: Effect of different prompts to query the LLM

Input Prompt	mIoU	F1
“Can you segment the edited region and give the instruction used to edit this image.”	23.77	33.19
“Could you segment the modified regions and provide a detailed explanation of the editing process?”	20.20	28.69
“Please analyze this image for any signs of editing. If the image has been edited, identify and segment the edited portions, and outline the steps taken to achieve the edits.”	19.90	28.35
“Can you determine if this image has been manipulated? If so, please highlight the altered areas and describe the techniques used to modify the image.”	19.27	27.46
Randomly choose one of the above input prompts	19.32	31.54

6 Discussion and Conclusion

Limitations: Firstly, the binary masks produced by our approach are not perfect. We expect integrating modules specially designed for image forensics can improve the segmentation capability. Secondly, while our method can identify potential editing activities, it does not provide explanations for why specific edits were made. Future research direction can develop datasets and techniques that facilitate reasoning about image manipulations.

Conclusion: We introduce a novel approach, EditScout, for detecting forgeries in diffusion-based edited images. Utilizing the extensive visual and linguistic knowledge of MLLMs, EditScouteffectively localizes subtle signs of tampering in the content. Our method outperforms existing ones, achieving 23.77 mIoU on MagicBrush, 34.11 mIoU on CocoGLIDE, and 22.55 mIoU on PerfBrush. These results confirm our approach’s potential and suggest promising avenues for further integrating foundation models to enhance digital image forensics and the credibility of generative AI.

References

[1] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
[2] Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, and Qiang Xu. Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. arXiv preprint arXiv:2403.06976, 2024.
[3] Siyuan Yang, Lu Zhang, Liqian Ma, Yu Liu, JingJing Fu, and You He. Magicremover: Tuning-free text-guided image inpainting with diffusion models. arXiv preprint arXiv:2310.02848, 2023.
[4] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
[5] Trong-Tung Nguyen, Duc-Anh Nguyen, Anh Tran, and Cuong Pham. Flexedit: Flexible and controllable diffusion-based object-centric image editing. arXiv preprint arXiv:2403.18605, 2024.
[6] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[7] Kai Zhang, Lingbo Mo, Wenhu Chen, Huan Sun, and Yu Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. Advances in Neural Information Processing Systems, 36, 2024.
[8] Shan Jia, Mingzhen Huang, Zhou Zhou, Yan Ju, Jialing Cai, and Siwei Lyu. Autosplice: A text-prompt manipulated image dataset for media forensics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 893–903, 2023.
[9] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
[10] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[11] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
[12] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
[13] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
[14] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
[15] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
[16] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
[17] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
[18] Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, and Kai Chen. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
[19] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
[20] Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Liu. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
[21] Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokenization for grounding multimodal large language models. arXiv preprint arXiv:2404.13013, 2024.
[22] Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
[23] Renjie Pi, Jiahui Gao, Shizhe Diao, Rui Pan, Hanze Dong, Jipeng Zhang, Lewei Yao, Jianhua Han, Hang Xu, and Lingpeng Kong Tong Zhang. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
[24] Ao Zhang, Liming Zhao, Chen-Wei Xie, Yun Zheng, Wei Ji, and Tat-Seng Chua. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023.
[25] Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, and Chuang Gan. Covlm: Composing visual entities and relationships in large language models via communicative decoding. arXiv preprint arXiv:2311.03354, 2023.
[26] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
[27] Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240, 2023.
[28] Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. Osprey: Pixel understanding with visual instruction tuning. arXiv preprint arXiv:2312.10032, 2023.
[29] Yichi Zhang, Ziqiao Ma, Xiaofeng Gao, Suhaila Shakiah, Qiaozi Gao, and Joyce Chai. Groundhog: Grounding large language models to holistic segmentation. arXiv preprint arXiv:2402.16846, 2024.
[30] Tsung-Han Wu, Giscard Biamby, David Chan, Lisa Dunlap, Ritwik Gupta, Xudong Wang, Joseph E Gonzalez, and Trevor Darrell. See, say, and segment: Teaching lmms to overcome false premises. arXiv preprint arXiv:2312.08366, 2023.
[31] Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. arXiv preprint arXiv:2312.02228, 2023.
[32] Yixuan Li, Xuelin Liu, Xiaoyang Wang, Shiqi Wang, and Weisi Lin. Fakebench: Uncover the achilles’ heels of fake images with large multimodal models. arXiv preprint arXiv:2404.13306, 2024.
[33] Xiaoyi Bao, Siyang Sun, Shuailei Ma, Kecheng Zheng, Yuxin Guo, Guosheng Zhao, Yun Zheng, and Xingang Wang. Cores: Orchestrating the dance of reasoning and segmentation. arXiv preprint arXiv:2404.05673, 2024.
[34] Hikaru Shindo, Manuel Brack, Gopika Sudhakaran, Devendra Singh Dhami, Patrick Schramowski, and Kristian Kersting. Deisam: Segment anything with deictic prompting. arXiv preprint arXiv:2402.14123, 2024.
[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[36] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
[37] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[38] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
[39] Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
[40] Owen Mayer and Matthew C Stamm. Learned forensic source similarity for unknown camera models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2012–2016. IEEE, 2018.
[41] Multimedia Forensics Group. EITLNet: Source code of the paper: Effective image tampering localization via enhanced transformer and co-attention fusion, ICASSP 2024.
[42] Fabrizio Guillaro, D Cozzolino, Avneesh Sud, Nick Dufour, and L Verdoliva. TruFor: Leveraging all-round clues for trustworthy image forgery detection and localization. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pages 20606–20615, December 2022.
[43] Xiao Guo, Xiaohong Liu, Zhiyuan Ren, Steven Grosz, Iacopo Masi, and Xiaoming Liu. Hierarchical fine-grained image forgery detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3155–3165, 2023.
[44] Xiaohong Liu, Yaojie Liu, Jun Chen, and Xiaoming Liu. Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 32(11):7505–7517, 2022.
[45] Scott McCloskey and Michael Albright. Detecting gan-generated imagery using color cues. arXiv preprint arXiv:1812.08247, 2018.
[46] Scott McCloskey and Michael Albright. Detecting gan-generated imagery using saturation cues. In 2019 IEEE international conference on image processing (ICIP), pages 4584–4588. IEEE, 2019.
[47] Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, and Baining Guo. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5001–5010, 2020.
[48] Francesco Marra, Diego Gragnaniello, Davide Cozzolino, and Luisa Verdoliva. Detection of gan-generated fake images over social networks. In 2018 IEEE conference on multimedia information processing and retrieval (MIPR), pages 384–389. IEEE, 2018.
[49] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023.
[50] Ruipeng Ma, Jinhao Duan, Fei Kong, Xiaoshuang Shi, and Kaidi Xu. Exposing the fake: Effective diffusion-generated images detection. arXiv preprint arXiv:2307.06272, 2023.
[51] Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. LaRE²: Latent reconstruction error based method for Diffusion-Generated image detection. arXiv [cs.CV], March 2024.
[52] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. arXiv:2304.02643, 2023.
[53] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
[54] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[55] Hanshu Yan, Xingchao Liu, Jiachun Pan, Jun Hao Liew, Qiang Liu, and Jiashi Feng. Perflow: Piecewise rectified flow as universal plug-and-play accelerator. arXiv preprint arXiv:2405.07510, 2024.
[56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
[57] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[58] Myung-Joon Kwon, Seung-Hun Nam, In-Jae Yu, Heung-Kyu Lee, and Changick Kim. Learning jpeg compression artifacts for image manipulation detection and localization. International Journal of Computer Vision, 130(8):1875–1895, 2022.