This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ForgeryTTT: Zero-Shot Image Manipulation Localization with Test-Time Training

Weihuang Liu, Xi Shen, Chi-Man Pun, , and Xiaodong Cun Weihuang Liu and Chi-Man Pun are with the Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Taipa, Macau (e-mail: [email protected]; [email protected]). Xi Shen is with Intellindust (e-mail: [email protected]). Xiaodong Cun is with the School of Computing and Information Technology, Great Bay University, Dongguan, China (e-mail: [email protected]). Corresponding author: Chi-Man Pun and Xiaodong Cun.
Abstract

Social media is increasingly plagued by realistic fake images, making it hard to trust content. Previous algorithms to detect these fakes often fail in new, real-world scenarios because they are trained on specific datasets. To address the problem, we introduce ForgeryTTT, the first method leveraging test-time training (TTT) to identify manipulated regions in images. The proposed approach fine-tunes the model for each individual test sample, improving its performance. ForgeryTTT first employs vision transformers as a shared image encoder to learn both classification and localization tasks simultaneously during the training-time training using a large synthetic dataset. Precisely, the localization head predicts a mask to highlight manipulated areas. Given such a mask, the input tokens can be divided into manipulated and genuine groups, which are then fed into the classification head to distinguish between manipulated and genuine parts. During test-time training, the predicted mask from the localization head is used for the classification head to update the image encoder for better adaptation. Additionally, using the classical dropout strategy in each token group significantly improves performance and efficiency. We test ForgeryTTT on five standard benchmarks. Despite its simplicity, ForgeryTTT achieves a 20.1%\% improvement in localization accuracy compared to other zero-shot methods and a 4.3%\% improvement over non-zero-shot techniques. Our code will be released upon publication.

Index Terms:
Image manipulation localization, test-time training, self-supervised learning.

I Introduction

The rapid development of image editing techniques has made image manipulation much easier [1, 2, 3]. People can easily alter image content using editing tools, often fooling the human eye. Advanced post-processing techniques further reduce the detectability of fake images. However, the misuse of these techniques raises concerns about intentional image manipulation. Frequent digital image fraud incidents have led to serious doubts about the authenticity of digital images, diminishing social credibility. To address this issue, it is urgent to develop effective image manipulation localization algorithms.

Image manipulation often involves operations like splicing, copy-move, and removal, which leave distinct traces and artifacts. Image manipulation localization aims to identify these tampered regions. Traditional algorithms mainly rely on low-level clues such as noise [4], JPEG compression artifacts [5], and camera traces [6]. Deep learning-based approaches learn from large-scale datasets [7, 8, 9], demonstrating promising results.

Refer to caption
Figure 1: Comparison of previous and our testing phase. Previous methods directly employ the models for forgery localization, while we first perform model adaptation for each image and then localize the forgery region.

While some progress has been made, existing approaches often fail in real-world scenarios. The rapid advancement of editing technologies, especially with generative artificial intelligence [10, 11, 12], makes it difficult to collect a comprehensive dataset of manipulated samples. As a result, current image manipulation localization algorithms face significant challenges in adapting to evolving forgery techniques.

In this work, we propose using test-time training (TTT) [13, 14, 15] to address the issue. TTT adapts the pre-trained model to new targets during testing, showing excellent generalization ability by fine-tuning the model for each test sample. To the best of our knowledge, we are the first to explore TTT for image manipulation localization. As shown in Figure 1, our method is quite different from previous methods which directly employ the pre-trained model for inference. The proposed method first performs model adaptation for the given sample and then uses the adapted model for better prediction.

Typical auxiliary-head-based TTT methods use a well-designed auxiliary self-supervised objective function to update the model at test-time training [15, 16, 17]. Note that the effectiveness of TTT greatly depends on designing an auxiliary task closely related to the main task [16]. Our key motivation arises from the consistency between the coarse task of distinguishing whether the image is manipulated and the fine-grained task of predicting exact manipulation regions. Furthermore, deep neural networks have showcased promising performance in image manipulation classification, despite the challenges presented by image manipulation localization. They can effortlessly determine whether the images have undergone manipulation. This provides a favorable prerequisite and encourages us to investigate an effective auxiliary task based on image manipulation classification, which boosts the pre-trained image manipulation localization model at test time.

Specifically, our ForgeryTTT is built upon the commonly used encoder-decoder network for image manipulation localization. It is composed of a shared image encoder of vision transformer [18, 19] architecture, a localization head for forgery mask prediction, and a classification head for self-supervised image manipulation classification. During training-time training, ForgeryTTT simultaneously learns image manipulation localization and self-supervised image manipulation classification. We follow the standard supervised protocol to train the localization branch. For the self-supervised classification head, we first divide the input tokens into manipulated and authentic groups via the given mask. Then we feed them into the classification head to distinguish between manipulated and authentic parts. During test-time testing, the predicted mask from the localization head is used for the classification head to update the image encoder via the self-supervised objective function. This makes our model trainable for each testing sample and hence can adapt to its characteristics.

In addition, previous TTT methods [15, 16, 17] are more computationally expensive than standard inference, as they update the model by computing the gradient on a batch of samples during the test time. The extra cost becomes important in the case of large models such as transformers. We investigate the simple dropout strategies on intermediate tokens to alleviate the issue. By randomly dropping part of tokens within a given image iteratively, we manage to construct a batch of different samples. This strategy not only improves the performance but also the efficiency.

Refer to caption
Figure 2: Examples of localization results from our method, ForgeryTTT, on testing images. Without ForgeryTTT, the model fails to accurately localize the forgery regions. However, performing adaptation on each image (w/ ForgeryTTT), shows significantly better results.

To evaluate our method in zero-shot image manipulation localization, we train the model on a large-scale synthetic dataset and evaluate it on five commonly used benchmarks. The experimental results demonstrate that both the proposed auxiliary task and the test-time training strategy significantly boost the performance of image manipulation localization for unseen images and novel manipulation techniques, e.g. an average improvement of 14.7%\% fixed threshold F1 score across various datasets. The proposed one-to-batch sample generation test-time training strategy is both more effective and more efficient than the common strategy, leading to 1.8%\% better performance and 4.8×\times faster adaptation. Our proposed method exhibits surprising gains, which finding the forgery region accurately through test-time training while the pre-trained model fails initially (Figure 2). When compared to state-of-the-art image manipulation localization methods in a zero-shot setting, our proposed method outperforms them by a substantial margin. Our approach also achieves superior performance even when compared to methods that incorporate the training split of the target dataset during training. Additionally, it surpasses other TTT approaches, underscoring the effectiveness of the proposed TTT strategy.

In summary, the contributions of this work are as follows:

  • We introduce ForgeryTTT, to our best knowledge, the first test-time training framework specifically designed for zero-shot image manipulation localization.

  • We design a self-supervised task to enhance localization ability, demonstrating that it allows for more effective TTT than existing TTT approaches.

  • Extensive experiments on five benchmarks demonstrate the superiority of ForgeryTTT, surpassing both zero-shot and non-zero-shot state-of-the-art image manipulation localization methods.

Refer to caption
Figure 3: The overview of the proposed ForgeryTTT. ForgeryTTT is a multi-task framework built upon the common encoder-decoder image manipulation localization network, which includes a shared image encoder, a localization head, and a classification head. It is first learned to image manipulation localization and image manipulation classification on large-scale datasets. Then, we employ a self-supervised loss based on the classification head to train the image encoder for each test image. Finally, the updated model is used to localize the forgery region.

II Related Work

In this section, we present the work related to the proposed method from three aspects, including image manipulation localization, image manipulation classification, and test-time training.

II-A Image Manipulation Localization

Early methods in image manipulation localization primarily rely on statistical analysis and handcrafted features. They mainly focus on identifying jpeg compression [5], noise inconsistencies [4], and color filter array [6], etc. While effective in detecting global inconsistencies, these techniques often struggle with complex manipulations and precise localization. Recent advancements in deep learning have significantly influenced the field. Deep neural networks now automatically learn discriminative forensic features from large-scale datasets, resulting in notable improvements [7, 20, 21, 22, 23, 24, 25]. Most of these methods employ an encoder-decoder architecture, where an image encoder extracts the features and a decoder predicts the forgery mask. To capture imperceptible clues beyond RGB images, some approaches transform the RGB image into the noise domain or frequency domain, providing complementary views for more effective forgery detection. ManTraNet [7] integrates the SRM filter [26] and Bayar filter [27] into the pre-trained backbone to extract features from both the noise map and the RGB image. Objectformer [28] extracts RGB features and high-frequency features via two encoders and combines them as multimodal features. Edge artifacts have also been utilized as crucial supervision signals to better capture details. MVSSNet [29] employs the Sobel filter to create a multi-level edge-supervised branch. TANet [30] refines the boundary by bilaterally exploring foreground and background cues. However, the proliferation of fake images and the development of forgery techniques present ongoing challenges, while no one has yet focused on the robustness when confronted with unseen scenes.

II-B Image Manipulation Classification

Traditional image manipulation classification methods initially extract hand-crafted features and then analyze statistical characteristics to differentiate the manipulated and authentic images. Fridrich et al. [31] proposed a copy-move forgery detection method based on approximate block matching. Pan et al. [32] identified forgery images using a SIFT-based matching method. Shi et al. [33] employed multi-size block discrete cosine transform and Markov transition probabilities to detect splicing. In the deep learning era, image manipulation classification methods utilize deep neural networks to capture manipulation traces [34, 35, 36]. Zhang et al. [37] input image patch features into a five-layer neural network to determine whether the image has been manipulated. Bayar et al. [34] proposed constrained convolutional neural networks that suppress the influence of image content on manipulation traces and adaptively extract manipulation features. Recent works [7, 38] on image manipulation localization simultaneously address both image manipulation classification and localization. Different from those supervised methods, we design a mask-based self-supervised image manipulation classifier.

Refer to caption
Figure 4: The proposed self-supervised image manipulation classification algorithm. We first extract the image features using the image encoder. Then, we group foreground manipulated tokens and background authentic tokens via a given mask. Random token dropout is applied in both foreground and background tokens. Next, the foreground tokens, background tokens, and class tokens are concatenated as manipulated queries, and background tokens and class tokens are concatenated as authentic queries. Finally, the classification head is learned to distinguish these two kinds of queries.

II-C Test-Time Training

The performance of deep neural networks may significantly drop in unseen test data that diverges from the training distribution. Test-time training (TTT) aims at adapting the pre-trained model to out-of-distribution test data through self-supervised/ unsupervised object functions. Since the statistics in the BatchNorm layers represent domain-specific knowledge, the statistic-based methods recalculate the statistics to adapt the model to the test domain [13, 14, 39]. Schneider et al. [13] adjust the BatchNorm statistics of the model by estimating from the test sample. Wang et al. [14] optimizes the affine parameters in the BatchNorm layers by minimizing the prediction entropy. Auxiliary head-based algorithms leverage the self-supervised learning task to adjust the features from the training domain to the test instance [15, 17, 16]. They typically consist of a shared image encoder, a primary decoder for the main task, and an auxiliary decoder for the self-supervised task. Sun et al. [15] develop an auxiliary head for rotation prediction and fine-tune the backbone using the auxiliary head during testing. Gandelsman et al. [17] employ the mask autoencoder which reconstructs the test sample to update the pre-trained model. Some works have explored effective TTT approaches in other areas [40, 41, 42]. It has not yet been explored in image manipulation localization and we take the first step.

III Method

To handle the evolving forgery images in the real world, we propose ForgeryTTT. ForgeryTTT is composed of a shared image encoder, an image manipulation localization head, and a self-supervised image manipulation classification head. We first train ForgeryTTT on a large-scale synthetic dataset. Then, we further train it for each test image via the self-supervised classification head and employ the updated model for inference. The overall architecture of ForgeryTTT is shown in Figure 3. In this section, we first introduce image manipulation localization and test-time training in section III-A. Then, we give the details of the proposed self-supervised image manipulation classification algorithm in section III-B. Finally, we introduce the two-stage pipeline of our method in section III-C.

III-A Preliminaries

Image Manipulation Localization. Image manipulation localization involves localizing the pixel-level forgery mask in an RGB image. This is typically accomplished using an encoder-decoder network architecture [7, 38, 25]. In the image encoder, the image is first split into patch tokens, which are then processed through hierarchical transformer blocks to produce multi-scale representations. These representations are subsequently fed into the localization head, which is responsible for predicting the forgery mask. The ground truth object mask is used to supervise the model via the binary cross-entropy loss.

Test-Time Training. Test-time training (TTT) is introduced to enhance the generalization capability of models to out-of-distribution test data [15, 14, 13]. A commonly employed TTT framework based on the auxiliary head [15] consists of a shared encoder \mathcal{E}, a main head 𝒟\mathcal{D} for the main task, and an auxiliary head 𝒞\mathcal{C} for self-supervised learning. The TTT framework typically involves a two-stage training process. In the first stage, the network is trained using both the main loss main\mathcal{L}_{\text{main}} and a self-supervised loss ssl\mathcal{L}_{\text{ssl}}:

min,𝒟,𝒞main+λssl,\min_{\mathcal{E},\mathcal{D},\mathcal{C}}\mathcal{L}_{\text{main}}+\lambda\mathcal{L}_{\text{ssl}}, (1)

where λ\lambda is the hyper-parameter to balance the two losses.

In the subsequent stage, which is known as test-time training (TTT), the encoder \mathcal{E} is fine-tuned for each test sample based on the self-supervised objective function:

minssl.\min_{\mathcal{E}}\mathcal{L}_{\text{ssl}}. (2)

This process allows the model to adapt to unseen data through the self-supervised learning task during testing, thereby improving its ability to generalize beyond the distribution of the training data.

III-B Self-supervised Image Manipulation Classification

Image manipulation classification has been integrated as an auxiliary task in several image manipulation localization algorithms [38, 43]. The classification head is trained in a supervised manner using labeled forgery and authentic images. The classification score helps detect global inconsistencies since forgery images exhibit a high response in at least one region, while authentic images show no response. However, test-time training (TTT) necessitates a self-supervised auxiliary task to allow for model optimization without relying on labels during testing. To address this requirement, we develop a mask-based self-supervised image manipulation classification algorithm (Figure 4).

Refer to caption
Figure 5: The details of the proposed classification head. The classification head merges the multi-scale features into tokens and outputs the probability of whether the given tokens are manipulated.
Refer to caption
Figure 6: The details of the test-time training pipeline. For a given test image, we first extract its features via the image encoder and predict an initial mask via the localization head. Then, we employ token group and token dropout on the features according to the predicted mask. The remaining tokens are used to construct the manipulated query and serve as the pseudo-training sample to update the image encoder.

Given an input image ih×w×3i\in\mathbb{R}^{h\times w\times 3}, we extract multi-scale features F={fs:s(1,2,,smax)}F=\left\{f_{s}:s\in(1,2,...,s_{max})\right\} via the image encoder \mathcal{E}, where fshs×ws×csf_{s}\in\mathbb{R}^{h_{s}\times w_{s}\times c_{s}} represent features extracted at scale ss, and hsh_{s}, wsw_{s}, and csc_{s} are the height, width, and dimension of the feature at scale ss, respectively. The proposed classification head is shown in Figure 5. Specifically, we first merge the multi-scale features obtained by the image encoder into tokens. The multi-scale features are then unified in the channel dimension by the corresponding linear layer:

fs=Linears(fs),s=1,2,,smax.f_{s}=\operatorname{Linear_{s}}(f_{s}),s=1,2,\ldots,s_{max}. (3)

These features are upsampled to h1×w1h_{1}\times w_{1} and concatenated together as ff. The fused features ff are split into tokens with p×pp\times p patch size (in contrast to the original resolution) via a patch embedding layer:

e=PatchEmbed(f),e=\operatorname{PatchEmbed}(f), (4)

where ehp×wp×de\in\mathbb{R}^{\frac{h}{p}\times\frac{w}{p}\times d}. Next, we select some tokens as query qq. The query and the learnable class token cc are fed into a series of transformer blocks:

cj,qj=Lj(Concat[cj1,qj1]),j=1,2,,N,c_{j},q_{j}=L_{j}\left(\operatorname{Concat}[c_{j-1},q_{j-1}]\right),j=1,2,\ldots,N, (5)

where LjL_{j} is the j-th transformer block. Each transformer block is composed of multi-head self-attention and feed-forward networks, together with layer normalization and residual connection. Finally, we map the class token into the class probability via a linear layer:

y=Linear(cN),y=\operatorname{Linear}\left(c_{N}\right), (6)

where yy represents the probability of whether the given query is manipulated.

On top of the basic architecture, we further introduce several novel features, including label-free query construction and mask-based token dropout, which collectively make a powerful model.

Lable-Free Query Construction. To adapt image manipulation classification to test-time training, we model it as a self-supervised task rather than a supervised one. The pseudo label is derived from the given mask. During training-time training, we employ the ground truth mask to group the foreground tokens and background tokens. The mask is downsampled using max-pooling to match the token size. The foreground tokens and background tokens are used to construct the ”manipulated” query, while the background tokens are used to construct the ”authentic” query. These two types of tokens are concatenated with a class token to train the classifier in a self-supervised manner. During TTT, we construct only the manipulated query using the predicted mask to further train the image encoder with the pseudo-training samples. The manipulated query is reliable since the authentic query could be wrong when the predicted mask is inaccurate. Thus, we can use the constructed query to train a self-supervised classifier and fine-tune the pre-trained model with pseudo-training samples during testing.

Mask-based Token Dropout. Previous methods that incorporate image manipulation classification as an auxiliary task input all tokens for classification. We believe it is redundant. A small number of authentic and manipulated tokens is sufficient to train a classifier. Inspired by recent advances in masked image modeling, we boost the performance via token dropout. To achieve it, we first utilize a mask to differentiate between foreground-manipulated tokens and background-authentic tokens within a given set of image tokens. We then perform random dropout to select a subset of both foreground and background tokens, discarding the remaining tokens. This dropout process follows a uniform distribution, ensuring an equal representation of different types of tokens. The dropout ratio r is the same in both foreground and background. The benefit of this mask-based dropout strategy is straightforward. It boosts classification performance by eliminating redundant information since image manipulation classification does not rely on most of the tokens.

III-C Objective Function

With the development of image editing technology, massive forgery images and advanced editing methods are challenging existing image manipulation localization algorithms. Mainstream methods often directly apply the models trained on large-scale datasets to the target test images, which makes them struggle to generalize to unseen images. To overcome these limitations, we propose ForgeryTTT. ForgeryTTT first trains on a large-scale dataset and then adapts the pre-trained model to test samples through self-supervision during testing.

As illustrated in Figure 3, ForgeryTTT is a multi-task framework composed of a shared image encoder \mathcal{E}, an image manipulation localization head 𝒟\mathcal{D}, and an image manipulation classification head 𝒞\mathcal{C}. The image encoder \mathcal{E} is a hierarchical transformer [19] that produces a hierarchical representation for the input image ii. The localization head 𝒟\mathcal{D} is a lightweight multi-layer perception [44] that predicts the forgery mask. The classification head 𝒞\mathcal{C} is composed of several transformer blocks and linear layers, which distinguish whether the given query is manipulated.

III-C1 Training-Time Training

During the first stage training-time training, ForgeryTTT simultaneously learns image manipulation localization and image manipulation classification. The ground truth mask mm supervises the localization branch using the binary cross-entropy loss bce\mathcal{L}_{\text{bce}}. The classification branch is trained in a self-supervised manner by sampling image tokens according to the ground truth mask to construct manipulated and authentic queries. These tokens are fed into the classifier to predict whether the given query is manipulated. The total objective function for this stage can be formulated as:

bce(𝒟((i)),m)+λbce(𝒞(ϕ((i),m))),\mathcal{L}_{\text{bce}}(\mathcal{D}(\mathcal{E}(i)),m)+\lambda\mathcal{L}_{\text{bce}}(\mathcal{C}(\phi(\mathcal{E}(i),m))), (7)

where ϕ\phi denotes the whole process of token grouping, token dropout, and query construction.

Refer to caption
Figure 7: comparison of the different test-time training strategies. (a) The basic test-time training strategy (TTT-Base) uses a batch of images during the whole forward propagation. (b) Test-time training with token dropout (TTT-TD) randomly drop some tokens since they are redundant for classification. (c) Test-time training with one-to-batch query generation (TTT-OBQG) only encodes one image and constructs a batch of queries via repeating random dropout in the given tokens.

III-C2 Test-Time Training

In the second stage, which is known as test-time training, ForgeryTTT is required to generalize to unseen forgery images via optimizing the self-supervised objective function. As shown in Figure 6, for each incoming test image ii^{{}^{\prime}}, we first obtain its predicted mask mm^{{}^{\prime}} using the image encoder \mathcal{E} and the localization head 𝒟\mathcal{D}. We then randomly drop some tokens based on the predicted mask. These remaining tokens are fed into the classification head 𝒞\mathcal{C} as ”manipulated” pseudo-training samples to fine-tune the pre-trained encoder \mathcal{E}.

Leveraging the wonderful features of the aforementioned mask-based token dropout and label-free query construction, we further propose novel test-time training strategies (Figure 7), named test-time token dropout (TTT-TD) and one-to-batch query generation (TTT-OBQG).

Test-Time Token Dropout (TTT-TD). Random token dropout during test-time training has been employed in methods such as TTT-MAE [17]. This technique randomly discards some tokens and uses masked image reconstruction as the self-supervised objective function. The difference is that the key insight of our approach comes from the fact that image manipulation classification does not rely on most of the tokens (as introduced in the previous section). Consequently, we further incorporate token dropout into our test-time training process. The dropout remains random, and the regions are specified using the predicted mask. We find it is also effective in our method.

One-to-Batch Query Generation (TTT-OBQG). TTT is generally performed on a batch of samples to derive more precise gradients. For instance, the basic TTT strategy utilizes data augmentation to acquire a batch of images. All images must pass through the image encoder, rendering TTT-Base inefficient in terms of running time and memory usage. Here we propose one-to-batch query generation for efficient TTT. Similarly, our strategy is built upon random dropout. We randomly drop different tokens in the encoded image tokens to construct a batch of queries. Therefore, we can construct a batch of queries with only one image being processed through the image encoder, which greatly reduces computation costs.

We also employ common multiple training steps to boost performance. The model once fine-tuned on this sample serves as a more suitable initialization checkpoint for subsequent training compared to the pre-trained model. Consequently, the model undergoes an iterative update process, wherein each step fine-tunes the model based on the previous step. Overall, the k-th step optimization of ForgeryTTT can be formulated as:

bce(𝒞(ϕ(k1(i),m))),\mathcal{L}_{\text{bce}}(\mathcal{C}(\phi(\mathcal{E}_{k-1}(i^{{}^{\prime}}),m^{{}^{\prime}}))), (8)

We find that there is almost no difference if we replace mm^{{}^{\prime}} with mk1m^{{}^{\prime}}_{k-1} in the above equation since our self-supervised task only needs pseudo-manipulated queries and a rough mask is sufficient.

Refer to caption
Figure 8: Qualitative comparison between the proposed method and existing SOTA methods. From top to bottom are the samples from the CASIA dataset (1st and 2nd rows), Coverage dataset (3rd and 4th rows), Columbia dataset (5th and 6th rows), Nist16 dataset (7th and 8th rows), and CocoGlide dataset (9th and 10th rows).
TABLE I: Comparison with state-of-the-art image manipulation localization methods on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets. means the training datasets have some overlap with the testing domain.
CASIA Coverage Columbia NIST16 CocoGlide Average
Fbest Ffix AUC ACC Fbest Ffix AUC ACC Fbest Ffix AUC ACC Fbest Ffix AUC ACC Fbest Ffix AUC ACC Fbest Ffix AUC ACC
ManTraNet [7] 18.0 32.0 64.4 50.0 48.6 31.7 76.0 50.0 65.0 50.8 81.0 50.0 22.5 17.2 62.4 50.0 67.3 51.6 77.8 50.0 44.3 36.7 72.3 50.0
EXIF-SC  [49] 22.5 10.6 49.0 50.0 33.2 16.4 49.8 50.0 88.0 79.8 97.6 50.6 29.8 22.7 50.4 50.0 42.4 29.3 52.6 50.0 43.2 31.8 59.9 50.1
SPAN [9] 16.9 11.2 48.0 48.7 42.8 23.5 67.0 60.5 87.3 75.9 99.9 95.1 36.3 22.8 63.2 59.7 35.0 29.8 47.5 49.1 43.7 32.6 65.1 62.6
Noiseprint [50] 20.5 13.7 49.4 - 34.2 22.9 52.5 - 85.3 51.3 87.2 - 34.5 19.6 61.8 - 40.5 31.8 52.0 - 43.0 27.9 60.6 -
MVSSNet [29] 65.0 52.8 93.2 80.0 65.9 51.4 73.7 54.5 78.1 72.9 98.4 66.7 37.2 32.0 57.9 53.8 64.2 48.6 65.4 53.6 62.1 51.5 77.6 61.9
IFOSN [51] 67.6 55.3 73.5 63.5 47.2 30.4 55.7 51.0 83.6 75.3 88.2 52.2 44.9 33.0 65.8 55.3 58.9 42.8 61.1 56.7 60.4 47.4 68.9 55.7
CATNetv2 [52] 85.2 75.2 94.2 83.8 58.2 38.1 68.0 63.5 92.3 85.9 97.7 80.3 41.7 30.8 75.0 59.7 60.3 43.4 66.7 58.0 67.5 54.7 80.3 69.1
PSCCNet [53] 67.0 52.0 86.9 68.3 61.5 47.3 65.7 55.0 76.0 60.4 30.0 50.8 21.0 11.3 48.5 45.6 68.5 51.5 77.7 66.1 58.8 44.5 61.8 57.2
TruFor [24] 82.2 73.7 91.6 81.3 73.5 60.0 77.0 68.0 91.4 85.9 99.6 98.4 47.0 39.9 76.0 66.2 72.0 52.3 75.2 63.9 73.2 62.4 83.9 75.6
UnionFormer [43] 86.3 76.0 95.1 84.3 72.0 59.2 78.3 69.4 92.5 86.1 99.8 97.9 48.9 41.3 79.3 68.0 74.2 53.6 79.7 68.2 74.8 63.2 86.4 77.6
Ours 80.4 72.2 91.6 87.1 70.8 63.0 87.1 77.2 95.7 92.7 87.9 87.4 55.8 44.3 76.0 71.7 76.8 65.4 86.6 72.9 75.9 67.5 85.8 79.3
TABLE II: Comparison with state-of-the-art test-time training methods on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets using Ffix.
CASIA Coverage Columbia NIST16 CocoGlide
Baseline 61.4 35.6 89.4 34.7 43.0
BN [13] 62.5 38.4 89.1 35.3 44.1
TENT [14] 62.1 37.7 88.6 34.4 44.6
TTT-ROT [15] 62.7 40.1 90.2 36.8 45.7
TTT-MAE [17] 63.0 39.4 89.9 35.1 45.2
Ours 72.2 63.0 92.7 44.3 65.4

IV Experiments

This section shows the experimental results for image manipulation localization of our method with other state-of-the-art methods. Firstly, we illustrate the dataset and implementation details in section IV-A and section IV-B, respectively. Then, we compare the proposed method with other state-of-the-art methods on several benchmarks in section IV-C. Next, we give a detailed analysis of each component of the proposed method in section IV-D. Finally, we evaluate its robustness in several distortion settings in section IV-E.

IV-A Datasets

We utilize multiple datasets, including SynCOCO, CASIA [45], Coverage [46], Columbia [47], NIST16 [48], and CocoGlide [24], in our experiments.

Training Data. We create a synthetic dataset named SynCOCO to train our model. This dataset is built upon MSCOCO [54] and contains forgery images of several manipulation types, including splicing, copy-move, and removal. For splicing, we extract an annotated instance from a random image and paste it into another image. For copy-move, we extract an annotated instance from a random image and paste it into a different region within the same image. For removal, we delete an annotated instance from an image and employ a cutting-edge image inpainting algorithm [2] to restore the region. In total, the synthetic dataset comprises 150,000 labeled forgery images.

Testing Data. We evaluate our method on five publicly available benchmarks. CASIA [45] is composed of CASIA v1 (5,123 images) and CASIA v2 (921 images), including splicing and copy-move samples. We evaluate our method on CASIA v2. Coverage [46] is a copy-move forgery dataset containing 100 manipulated images. We use the test split (25 images) to evaluate the proposed method. Columbia [47] includes 180 forgery images designated for splicing detection. We evaluate our method on the entire dataset. NIST16 [48] contains 564 forgery images involving splicing, copy-move, and removal types. We use the test split (160 images) to evaluate the proposed method. CocoGlide [24] comprises 512 images generated from the COCO 2017 validation set using the GLIDE diffusion model [55]. We evaluate our method on the entire dataset.

IV-B Implementation Details

We implement our algorithm using Pytorch framework and all the experiments are performed on a single NVIDIA A40 GPU. As for optimization, we use the Adam [56] optimizer. The initial learning rate is set to 2e42e^{-4}, which is adjusted by the exponential decay strategy with a decay rate of 0.9. We employ a fixed learning rate of 2e52e^{-5} during TTT. The models are trained for 20 epochs on SynCOCO dataset and 10 epochs during TTT on each test sample. The mini-batch size is equal to 32. The input image is resized into a fixed resolution of 512×512512\times 512. Random horizontal flipping, resizing, and cropping are used for data augmentation.

As for the network, we adopt Swin-Tiny [19] as the image encoder, the lightweight ALL-MLP decoder [44] as the localization head, and the number of transformer blocks in the classification head is set to 5. The loss weight λ\lambda in training-time training (equation 7) is set to 0.01. The dropout ratio for token dropout is 0.5 and the patch size in the classification head is 16 ×\times 16 (section III-B).

We evaluate the proposed method in image manipulation localization using several metrics following [24, 43]. We report the F1 score using both the best threshold (Fbest) and the default 0.5 threshold (Ffix) to measure pixel-level performance. For image-level analysis, we utilize the Area Under the Curve (AUC) and balanced accuracy metrics (ACC). All these metrics are the larger the better.

TABLE III: Ablation study of the self-supervised image manipulation classification algorithm on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets using Ffix.
Self-supervised Head -
Token Dropout - - -
Test-Time Training - - -
CASIA 61.4 63.1 66.8 65.6 72.2
Coverage 35.6 35.8 60.2 37.3 63.0
Columbia 89.4 88.9 90.5 90.2 92.7
NIST16 34.7 35.4 47.1 32.9 44.3
CocoGlide 43.0 44.2 64.0 46.7 65.4
TABLE IV: Ablation study of different queries in test-time training on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets using Ffix.
CASIA Coverage Columbia NIST16 CocoGlide
- 65.6 37.3 90.2 32.9 46.7
Mani. Query 72.2 63.0 92.7 44.3 65.4
Auth. Query 47.5 29.1 77.8 18.8 36.5
Both Queries 54.7 35.7 81.0 20.9 40.2
TABLE V: Ablation study of the patch size on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets using Ffix.
CASIA Coverage Columbia NIST16 CocoGlide
16 ×\times 16 72.2 63.0 92.7 44.3 65.4
32 ×\times 32 71.5 56.1 91.6 43.1 69.6
64 ×\times 64 70.3 52.7 90.4 42.4 68.3
TABLE VI: Ablation study of the dropout ratio on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets using Ffix.
CASIA Coverage Columbia NIST16 CocoGlide
0.1 70.6 58.8 90.8 44.1 63.0
0.3 71.4 60.4 91.6 44.9 64.2
0.5 72.2 63.0 92.7 44.3 65.4
0.7 71.8 61.7 92.2 42.7 65.0
0.9 69.2 59.5 91.0 41.8 63.7
TABLE VII: Ablation study of different test-time training strategies.
Ffix Memory (mb) Runtime (ms)
TTT-Base 65.7 43,600 1,250
TTT-TD 67.8 35,600 1,010
TTT-OBQG 67.5 7,100 260

IV-C Comparison with state-of-the-art methods

We compare the performance of our method with other state-of-the-art image manipulation methods [7, 9, 49, 50, 29, 51, 52, 53, 24]. Since our method is designed for zero-shot image manipulation localization, the model is only trained on a large-scale synthetic dataset and evaluated directly on the test split of the target datasets. Instead, some methods further fine-tune their models on the training split of the target dataset [9, 38] or include the training split into their training data [24, 43]. For a fair comparison, we divided them into two categories by different training protocols: (1) None zero-shot methods: SPAN [9], MVSSNet [29], CATNetv2 [52], PSCCNet [53], TruFor [24], and UnionFormer [43]. (2) Zero-shot methods: ManTraNet [7], EXIF-SC [49], Noiseprint [50], IFOSN [51], and our ForgeryTTT.

Table I describes the comparison of Fbest, Ffix, AUC, and ACC between our method and other state-of-the-art methods. Specifically, our method is much better than previous zero-shot methods (ManTraNet [7], EXIF-SC [49], Noiseprint [50], and IFOSN [51]). Our approach outperforms them on all five benchmarks and achieves an Ffix improvement of 20.1%\% on average at least. On the other hand, our approach also achieves the top performance when compared to the methods that include the training split of the target dataset into the training dataset (CATNetv2 [52], TruFor [24], and UnionFormer [43]) or fine-tuning the pre-trained model on the training split (SPAN [9], MVSSNet [29], and PSCCNet [53]). Our method gets the highest average  Ffix , even though our model has not seen these images before. It is worth noting that our approach performs best on the CocoGlide dataset, which is built with a prompt-based generative artificial intelligence technique [55]. None of these methods have included this kind of data during training. This shows that our approach has strong potential to be generalized to unseen scenarios. We show the visual comparison of our method and other state-of-the-art methods in Figure 8. The proposed method performs better than others on five benchmarks, which proves the superiority of the proposed method.

We further analyze the efficiency of the proposed method. The parameters of each component in our model are 27.5M for the image encoder, 0.6M for the localization head, and 5.1M for the classification head. Consequently, we introduce only a small number of additional parameters to the baseline model. As for comparison, our model (33.2M) is smaller than other top competitors, such as TruFor (68.7M), CATNetv2 (114.3M), IFOSN (128.8M), and MVSSNet (146.9M). Our model takes about 30 hours on one A40 GPU for training, while TruFor takes more than 14 days for training on one A6000 GPU. Our full model takes 12 milliseconds per frame for inference and 260 milliseconds per frame for TTT. As an inherent drawback of TTT, it does bring additional computational time. Therefore we also try to mitigate its impact. Equipped with the proposed strategy, the proposed method becomes more effective and faster than the basic strategy (please refer to the next section).

We also compare the proposed method with state-of-the-art TTT methods [13, 14, 15, 17]. For statistics-based TTT methods [13, 14], we directly apply TTT to the baseline model, and we follow a two-stage training pipeline for the TTT methods with an auxiliary head [15, 17]. All these methods are pre-trained on SynCOCO and evaluated on the test split of five benchmarks. As shown in Table II, the proposed method outperforms all other TTT methods significantly. On the one hand, although other TTT methods work well in some corruptions such as fog, snow, and rain, they may be less effective against natural domain shifts. On the other hand, designing an auxiliary task that is closely related to the main task has a significant impact on the performance of TTT. The proposed ForgeryTTT is proved to be advantageous to image manipulation localization, which brings significant improvements when applied to unseen forgery images.

IV-D Ablation Analysis

Refer to caption
Figure 9: Performance comparison in different TTT epochs on CAISA [45], Coverage [46], Colombia [47], NIST16 [48], and CocoGlide [24] datasets. The performance is getting better as TTT goes on. Different datasets have different optimal epochs to achieve the best performance.
Refer to caption
Figure 10: Qualitative results of in different TTT epochs. Our method shows less effectiveness before TTT when confronted with unseen forgery images. The results are progressively improved during TTT, and our method finally accurately localizes the forged regions.
Architecture Design

We first verify the proposed architecture and the results are shown in Table III. The baseline is set to the common encoder-decoder model. First, the proposed self-supervised image manipulation classification head is beneficial to image manipulation localization in training-time training and obtains significant gains during test-time training, which indicates that our designed self-supervised image manipulation classification algorithm suits image manipulation localization very well. Second, the model with token dropout has a better performance in both training-time training and test-time training, which shows that reducing redundancy is effective for image manipulation classification. The improvements are stable in different datasets, demonstrating the effectiveness in handling unseen images and unseen manipulation techniques.

We then explore constructing different queries in test-time training. As shown in Table IV, when we try to use the authentic query or both authentic and manipulated queries, the performance will be much worse than only using the manipulated query for TTT and even not performing TTT. This is because the predicted mask is inaccurate therefore the constructed authentic query may be wrong and harm test-time training.

Next, we validate the influence of the patch size of image tokens. As shown in Table V, the performance will drop as the patch size increases in most datasets. On the one hand, the number of patches available for sampling decreases, and the diversity of queries constructed decreases. On the other hand, a larger patch size means less detailed information, which may harm image manipulation classification.

Finally, We find that the dropout ratio is also important since it affects the random dropout process. The results of different dropout ratios are presented in Table VI. The model will degenerate into a simple image manipulation classification model when the dropout ratio is set to 0, and the performance is poor when the dropout ratio is too low (<<0.3) or too high (>>0.7). We take 0.5 as the default dropout ratio since it works well in most datasets. As a result, dropout is important in constructing different queries, an appropriate dropout ratio is beneficial in sampling enough tokens for classification.

TTT Scheme

We present three test-time training schemes in section III-C, including basic test-time training (TTT-Base), test-time training with token dropout (TTT-TD), and test-time training with one-to-batch sample generation (TTT-OBSG). We compare them in Table VII. A batch of samples enables TTT to estimate more accurate gradients but at the cost of increased GPU memory usage and runtime. For instance, when we use only one image in TTT-Base, the average Ffix is 61.9, and it rises to 65.7 if we feed 32 samples obtained by data augmentation. However, this comes at the cost of increasing GPU memory usage from 4,300 mb to 43,600 mb and runtime from 24 milliseconds to 1,250 milliseconds. Next, the average Ffix improves to 67.8 as we adopt TTT-TD, and the GPU memory usage and runtime decrease to 35,600 mb and 1,010 milliseconds, respectively. This shows that token dropout is also effective for test-time training because using all tokens for classification is redundant. Finally, when we adopt TTT-OBSG, the GPU memory usage and runtime drop rapidly to 7,100 mb and 260 milliseconds, respectively. The average Ffix lost only 0.3 compared to TTT-TD and still 1.8 better than TTT-Base, which demonstrates the effectiveness and efficiency of the proposed TTT-OBSG.

We show the results of TTT epochs on performance in Figure 9. Specifically, remarkable improvements can be observed across all datasets at the first few TTT epochs. As the number of epochs increases, the performances on some datasets begin to saturate. For convenience, we used 10 epochs as default on all datasets since more epochs could not bring noticeable improvements. We also plot the results obtained by our method in Figure 10. Our model typically can not locate the forgery region initially for the given unseen forgery images. As TTT goes on, we observe that the prediction becomes more and more accurate.

TABLE VIII: Robustness analysis under several distortion settings on CAISA [45] dataset using FfixedF_{fixed}.
MVSSNet IFOSN TruFor Ours
None 52.8 55.3 73.7 72.2
GaussianBlur(k=3) -12.8 -14.6 -31.3 -24.2
GaussianBlur(k=5) -18.2 -19.5 -34.5 -28.5
GaussianNoise(σ\sigma=3) -9.3 -3.5 -2.9 -1.6
GaussianNoise(σ\sigma=5) -20.0 -3.8 -4.6 -2.1
JPEGCompress(50) -37.7 -26.9 -14.2 -20.4
JPEGCompress(100) -21.0 -8.0 -3.4 -3.6
Transmission(Facebook) -5.9 -4.0 -2.1 -1.9
Transmission(Whatsapp) -8.4 -2.9 -2.4 -1.0
Transmission(Weibo) -4.8 -4.6 -6.1 -1.5
Transmission(Wechat) -18.9 -9.9 -12.2 -9.0

IV-E Robustness Analysis

We evaluate the robustness under several distortion settings. Table VIII shows the results on CAISA datasets of MVSSNet, IFOSN, TruFor, and our method, on which Gaussian Blur, Gaussian Noise, JPEG compression, and online social network transmission process the images. Specifically, the proposed method shows more stable results than others in most cases, which gains the least performance drop under various attacks. Our method performs poorly when faced with highly compressed and Gaussian blurred images, and we notice that the performance of TruFor also drops significantly in these cases. Instead, MVSSNet and IFOSN perform better on Gaussian blurred images. We think this is because of the different kinds of backbones. Our method and TruFor use a transformer-based backbone, while MVSSNet and IFOSN use a CNN-based backbone. When image quality is severely degraded, the non-semantic discrepancies between manipulated and authentic regions become smaller, while local inconsistencies in manipulated boundaries are more important. In addition, we believe that including these types of data augmentation in training could help improve the robustness of these attacks.

V Conclusion

In this work, we propose a novel image manipulation localization method termed ForgeryTTT. The pre-trained model effectively generalizes to unseen forgery images by fine-tuning the model for each test sample at test time. We present a multi-task framework that simultaneously performs image manipulation localization and image manipulation classification, in which the classification head is learned in a self-supervised manner. For each test sample, we first fine-tune the model with the self-supervised objective function, and then make a better prediction using the updated model. We also explore some well-designed strategies to further enhance our model. Extensive experiments in five publicly available image manipulation localization benchmarks demonstrate significant improvements in handling unseen forgery images, as well as outperform the existing image manipulation localization methods. Our approach is promising for its excellent zero-shot performance, which shows its potential in generalizing the model to unseen forgery images in the real world. In the future, we will develop a more powerful zero-shot forensics tool that covers other modalities to combat various forms of fake content on the internet. We hope that this work can provide new ideas for other research on multimedia forensics.

References

  • [1] X. Cun and C.-M. Pun, “Improving the harmony of the composite image by spatial-separated attention module,” IEEE Transactions on Image Processing, vol. 29, pp. 4759–4771, 2020.
  • [2] R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp. 2149–2159.
  • [3] W. Liu, X. Cun, C.-M. Pun, M. Xia, Y. Zhang, and J. Wang, “Coordfill: Efficient high-resolution image inpainting via parameterized coordinate querying,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 2, 2023, pp. 1746–1754.
  • [4] B. Mahdian and S. Saic, “Using noise inconsistencies for blind image forensics,” Image and Vision Computing, vol. 27, no. 10, pp. 1497–1503, 2009.
  • [5] Z. Lin, J. He, X. Tang, and C.-K. Tang, “Fast, automatic and fine-grained tampered jpeg image detection via dct coefficient analysis,” Pattern Recognition, vol. 42, no. 11, pp. 2492–2501, 2009.
  • [6] P. Ferrara, T. Bianchi, A. De Rosa, and A. Piva, “Image forgery localization via fine-grained analysis of cfa artifacts,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 5, pp. 1566–1577, 2012.
  • [7] Y. Wu, W. AbdAlmageed, and P. Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9543–9552.
  • [8] J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K. Roy-Chowdhury, “Hybrid lstm and encoder–decoder architecture for detection of image forgeries,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3286–3300, 2019.
  • [9] X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, “Span: Spatial pyramid attention network for image manipulation localization,” in Proceedings of the European Conference on Computer Vision.   Springer, 2020, pp. 312–328.
  • [10] A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
  • [11] Y. Huang, L. Xie, X. Wang, Z. Yuan, X. Cun, Y. Ge, J. Zhou, C. Dong, R. Huang, R. Zhang et al., “Smartedit: Exploring complex instruction-based image editing with multimodal large language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 8362–8371.
  • [12] C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 932–15 942.
  • [13] S. Schneider, E. Rusak, L. Eck, O. Bringmann, W. Brendel, and M. Bethge, “Improving robustness against common corruptions by covariate shift adaptation,” Advances in neural information processing systems, vol. 33, pp. 11 539–11 551, 2020.
  • [14] D. Wang, E. Shelhamer, S. Liu, B. Olshausen, and T. Darrell, “Tent: Fully test-time adaptation by entropy minimization,” in International Conference on Learning Representations, 2021.
  • [15] Y. Sun, X. Wang, Z. Liu, J. Miller, A. Efros, and M. Hardt, “Test-time training with self-supervision for generalization under distribution shifts,” in International conference on machine learning.   PMLR, 2020, pp. 9229–9248.
  • [16] Y. Liu, P. Kothari, B. Van Delft, B. Bellot-Gurlet, T. Mordan, and A. Alahi, “Ttt++: When does self-supervised test-time training fail or thrive?” Advances in Neural Information Processing Systems, vol. 34, pp. 21 808–21 820, 2021.
  • [17] Y. Gandelsman, Y. Sun, X. Chen, and A. Efros, “Test-time training with masked autoencoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 29 374–29 385, 2022.
  • [18] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  • [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  • [20] X. Cun and C.-M. Pun, “Image splicing localization via semi-global network and fully connected conditional random fields,” in Proceedings of the European Conference on Computer Vision Workshops, 2018, pp. 0–0.
  • [21] Y. Liu, C. Xia, X. Zhu, and S. Xu, “Two-stage copy-move forgery detection with self deep matching and proposal superglue,” IEEE Transactions on Image Processing, vol. 31, pp. 541–555, 2021.
  • [22] W. Liu, X. Shen, C.-M. Pun, and X. Cun, “Explicit visual prompting for low-level structure segmentations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 434–19 445.
  • [23] W. Liu, X. Shen, C.-M. Pun, and X. Cun, “Explicit visual prompting for universal foreground segmentations,” arXiv preprint arXiv:2305.18476, 2023.
  • [24] F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 606–20 615.
  • [25] W. Liu, X. Cun, and C.-M. Pun, “Dh-gan: Image manipulation localization via a dual homology-aware generative adversarial network,” Pattern Recognition, p. 110658, 2024.
  • [26] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital images,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 3, pp. 868–882, 2012.
  • [27] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2691–2706, 2018.
  • [28] J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang, “Objectformer for image manipulation detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2364–2373.
  • [29] X. Chen, C. Dong, J. Ji, J. Cao, and X. Li, “Image manipulation detection by multi-view multi-scale supervision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 14 185–14 193.
  • [30] Z. Shi, H. Chen, and D. Zhang, “Transformer-auxiliary neural networks for image manipulation localization by operator inductions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4907–4920, 2023.
  • [31] J. Fridrich, D. Soukal, J. Lukas et al., “Detection of copy-move forgery in digital images,” in Proceedings of digital forensic research workshop, vol. 3, no. 2.   Cleveland, OH, 2003, pp. 652–63.
  • [32] X. Pan and S. Lyu, “Region duplication detection using image feature matching,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 4, pp. 857–867, 2010.
  • [33] Y. Q. Shi, C. Chen, and W. Chen, “A natural image model approach to splicing detection,” in Proceedings of the 9th workshop on Multimedia & security, 2007, pp. 51–62.
  • [34] B. Bayar and M. C. Stamm, “Constrained convolutional neural networks: A new approach towards general purpose image manipulation detection,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2691–2706, 2018.
  • [35] B. Yu, X. Li, W. Li, J. Zhou, and J. Lu, “Discrepancy-aware meta-learning for zero-shot face manipulation detection,” IEEE Transactions on Image Processing, 2023.
  • [36] Y. Hua, R. Shi, P. Wang, and S. Ge, “Learning patch-channel correspondence for interpretable face forgery detection,” IEEE Transactions on Image Processing, vol. 32, pp. 1668–1680, 2023.
  • [37] Y. Zhang, J. Goh, L. L. Win, and V. L. Thing, “Image region forgery detection: A deep learning approach.” SG-CRC, vol. 2016, pp. 1–11, 2016.
  • [38] X. Liu, Y. Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7505–7517, 2022.
  • [39] M. J. Mirza, J. Micorek, H. Possegger, and H. Bischof, “The norm must go on: Dynamic unsupervised domain adaptation by normalization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 14 765–14 775.
  • [40] Y. Li, M. Hao, Z. Di, N. B. Gundavarapu, and X. Wang, “Test-time personalization with a transformer for human pose estimation,” Advances in Neural Information Processing Systems, vol. 34, pp. 2583–2597, 2021.
  • [41] P. T. Sivaprasad and F. Fleuret, “Uncertainty reduction for model adaptation in semantic segmentation,” in 2021 Ieee/Cvf Conference On Computer Vision And Pattern Recognition, Cvpr 2021.   IEEE, 2021, pp. 9608–9618.
  • [42] W. Liu, X. Shen, H. Li, X. Bi, B. Liu, C.-M. Pun, and X. Cun, “Depth-aware test-time training for zero-shot video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 218–19 227.
  • [43] S. Li, W. Ma, J. Guo, S. Xu, B. Li, and X. Zhang, “Unionformer: Unified-learning transformer with multi-view representation for image manipulation detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12 523–12 533.
  • [44] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021.
  • [45] J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in IEEE China Summit and International Conference on Signal and Information Processing.   IEEE, 2013, pp. 422–426.
  • [46] B. Wen, Y. Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage—a novel database for copy-move forgery detection,” in IEEE international conference on image processing.   IEEE, 2016, pp. 161–165.
  • [47] T.-T. Ng, J. Hsu, and S.-F. Chang, “Columbia image splicing detection evaluation dataset,” DVMM lab. Columbia Univ CalPhotos Digit Libr, 2009.
  • [48] “Nist: Nist nimble 2016 datasets,” 2016.
  • [49] M. Huh, A. Liu, A. Owens, and A. A. Efros, “Fighting fake news: Image splice detection via learned self-consistency,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 101–117.
  • [50] D. Cozzolino and L. Verdoliva, “Noiseprint: A cnn-based camera model fingerprint,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 144–159, 2019.
  • [51] H. Wu, J. Zhou, J. Tian, and J. Liu, “Robust image forgery detection over online social network shared images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 440–13 449.
  • [52] M.-J. Kwon, S.-H. Nam, I.-J. Yu, H.-K. Lee, and C. Kim, “Learning jpeg compression artifacts for image manipulation detection and localization,” International Journal of Computer Vision, vol. 130, no. 8, pp. 1875–1895, 2022.
  • [53] X. Liu, Y. Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7505–7517, 2022.
  • [54] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision.   Springer, 2014, pp. 740–755.
  • [55] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in International Conference on Machine Learning.   PMLR, 2022, pp. 16 784–16 804.
  • [56] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.