This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Backdoor Attack Against Vision Transformers via Attention Gradient-Based Image Erosion

Ji Guo1 Hongwei Li2 Wenbo Jiang (Corresponding Author)2 Guoming Lu1
1Laboratory Of Intelligent Collaborative Computing, University of Electronic Science and Technology of China, China
2School of Computer Science and Engineering, University of Electronic Science and Technology of China, China
Abstract

Vision Transformers (ViTs) have outperformed traditional Convolutional Neural Networks (CNN) across various computer vision tasks. However, akin to CNN, ViTs are vulnerable to backdoor attacks, where the adversary embeds the backdoor into the victim model, causing it to make wrong predictions about testing samples containing a specific trigger. Existing backdoor attacks against ViTs have the limitation of failing to strike an optimal balance between attack stealthiness and attack effectiveness.

In this work, we propose an Attention Gradient-based Erosion Backdoor (AGEB) targeted at ViTs. Considering the attention mechanism of ViTs, AGEB selectively erodes pixels in areas of maximal attention gradient, embedding a covert backdoor trigger. Unlike previous backdoor attacks against ViTs, AGEB achieves an optimal balance between attack stealthiness and attack effectiveness, ensuring the trigger remains invisible to human detection while preserving the model’s accuracy on clean samples. Extensive experimental evaluations across various ViT architectures and datasets confirm the effectiveness of AGEB, achieving a remarkable Attack Success Rate (ASR) without diminishing Clean Data Accuracy (CDA). Furthermore, the stealthiness of AGEB is rigorously validated, demonstrating minimal visual discrepancies between the clean and the triggered images.

Index Terms:
Backdoor attack, Vision Transformers, Invisible trigger

I Introduction

Vision Transformers (ViTs) have demonstrated competitive or even superior performance in a diverse array of computer vision tasks, including image classification [1], image generation [2], and object detection [3], outperforming Conventional Neural Networks (CNN). Unlike CNN, which rely on convolutional layers to extract features through hierarchical processing of local image regions, ViTs deconstruct the input image into a flattened patch sequence. These patches are the foundation for feature extraction, leveraging attention mechanisms to focus on and interpret the interrelationships between patches dynamically. This paradigm shift to utilizing attention-based mechanisms for feature extraction marks a significant divergence in methodology between ViTs and CNN, underscoring the innovative approach of ViTs in handling complex visual data.

Backdoor attacks are well-studied vulnerabilities within CNN, supported by a substantial body of research [4, 5]. Recent studies have illuminated the vulnerability of ViTs to backdoor attacks. Pioneering this field, Subramanya et al. [6] were the first to confirm the vulnerability of ViTs to such threats and propose an inconspicuous trigger by limiting the strength of trigger perturbations. Unfortunately, their approach assumed attacker having access to training data. After that, Yuan et al. [7] developed a patch-wise trigger more effectively seize the model’s attention. Advancing this domain, Zheng et al. [8] introduced a novel patch-wise trigger mechanism within TrojViT to enhance the Attack Success Rate (ASR) while minimizing the Trojan’s footprint. However, it failed to achieve a truly imperceptible attack. Despite these advancements in designing triggers for ViTs that emphasize resizing triggers to patch-wise dimensions and optimizing them for attention engagement, the critical aspect of trigger stealthiness is often overlooked. Moreover, the patch-wise feature of these triggers, being localized features rather than global ones, makes them susceptible to straightforward defense strategies, such as discarding the patches with the highest attention, thereby significantly undermining their effectiveness.

In pursuit of an effective and stealthy backdoor attack mechanism for ViTs, this study introduces an Attention Gradient-Based Erosion Backdoor (AGEB) targeted at ViTs. AGEB capitalizes on the unique attention gradients inherent to pre-trained ViTs to subtly modify the original image. Selectively eroding pixels in regions with the highest attention gradients embed an unobtrusive signal that serves as the trigger.

Utilizing a morphological erosion process, AGEB ensures that the alterations remain imperceptible to human observers. Figure 1 illustrates the subtle yet critical difference between the original and our triggered images, highlighting the strategic erosion of images in areas of intensified attention gradients. This nuanced approach leverages human cognitive biases that prioritize detecting changes in color gradients over absolute values, thus maintaining the trigger’s stealthiness. AGEB addresses the limitations associated with localized, patch-wise triggers by adopting a global trigger mechanism. This enhancement improves the method’s generalization ability across various ViT architectures, overcoming the challenges of stealthiness and localized trigger limitations previously overlooked in the literature.

Our contributions can be summarised as follows:

  • We present a backdoor attack against ViTs via attention gradient-based image erosion, addressing the overlooked issues of trigger stealthiness and localization in prior studies.

  • We enhance the effectiveness of AEGB by adding a small signal and mixing it with the original image. This technique, which involves mixing the eroded segments back into the clean image, helps retain essential semantic information. Additionally, introducing a constant signal ensures that the modifications remain effective across various images.

  • We conduct comprehensive experiments across various ViT architectures and datasets. The results demonstrate the remarkable effectiveness of AGEB, accurately classifying 97.02% of test images into a designated target class within the ImageNette dataset.

Refer to caption
(a) Clean images
Refer to caption
(b) AEGB triggered images
Figure 1: Example of AEGB triggered images and clean images from Imagenette

II Related Work

II-A Vision Transformer

The Transformer architecture was initially designed for tasks in natural language processing (NLP)[9]. Inspired by the considerable success of the Transformer in NLP, researchers have explored adapting analogous models for computer vision tasks, culminating in the proposition of the Vision Transformers (ViTs)[10].

The Vision Transformer restructures images into a sequence of patches, incorporates position embeddings, and appends a class token—resulting in image representations akin to those used in natural language processing. It employs the attention mechanism for feature extraction, which can be mathematically expressed as:

Attention(Q,K,V)=softmax(QKTDk)V\textit{Attention}(Q,K,V)=\textit{softmax}\left(\frac{QK^{T}}{\sqrt{D_{k}}}\right)V (1)

Here, Q denotes the query, K denotes the key, and V denotes the value. The term Dk\textit{D}_{k} corresponds to the dimensions of the query and the key. In contrast to CNN, ViTs diverge in two fundamental aspects:

  • Region of feature extraction: ViTs decompose an image into a multitude of patches for feature extraction, whereas CNN typically derive features from each individual pixel.

  • Method of feature extraction: ViTs leverage an attention mechanism, as delineated by Equation 1, in contrast to the convolutional layers utilized by CNN.

Considering the previously discussed aspects, traditional backdoor attacks designed for CNN may be less effective when applied to ViTs. In numerous instances, ViTs demonstrate enhanced robustness against backdoor attacks compared to CNN [6].

II-B Backdoor Attacks

TABLE I: Comparison of different methods for ViTs
Method For ViTs Training Schedule Free Invisible Trigger
BadViT [7] \checkmark ×\times \checkmark
TrojViT [8] \checkmark ×\times ×\times
AEGB \checkmark \checkmark \checkmark

Backdoor attacks significantly threaten the integrity of Deep Neural Network (DNN) models. Extensive research has been conducted on backdoor attacks targeting CNN. Gu et al.[4] pioneered this investigation by leveraging pixel patches as triggers to create triggered images. Subsequent studies[11, 12, 13, 14, 15] have explored a variety of trigger mechanisms for CNN. Recent studies[16, 17, 5] have aimed at increasing the stealthiness of attacks by utilizing triggers that are either imperceptible or resemble the input’s innate characteristics.

In the context of ViTs, backdoor attacks have also been examined. Subramanya et al.[6] demonstrated the vulnerability of ViTs to backdoor attacks using pixel patches, akin to the approach taken by BadNets[4]. Furthering this line of inquiry, Zheng et al.[7] investigated the effects of trigger size and introduced a patch-wise trigger designed to improve the ASR for ViTs. Simultaneously, Zheng et al.[8] employed a patch-wise trigger to capture the model’s attention more effectively.

However, despite these advancements, the stealth and global features of triggers for ViTs frequently need to be revised to reach current research scrutiny. Patch-wise triggers, for instance, may be easily mitigated by discarding the most attention-grabbing patch. In contrast, visible triggers risk early detection, leading users to avoid using such data for model training. Even though Subramanya et al.[6] suggested an invisible trigger by constraining the amplitude of perturbations, this assumes that attackers can access the training data. Moreover, the TrojViT strategy[8]even completely omits the stealthiness of the trigger. The discernible feature of the triggered images makes the attack evident, diminishing its effectiveness in real-world scenarios where concealment is critical.

A critical task in designing backdoor attacks for ViTs is identifying a methodology that effectively balances stealthiness and effectiveness. More specifically, the challenge lies in discovering backdoor strategies for ViTs that do not rely on localized, patch-wise triggers, thereby advancing the subtlety and potential undetectability of the attack.

Refer to caption
Figure 2: Overview of the AEGB triggered images generation process

III Threat model and attack goal

We adopt the threat model consistent with numerous backdoor attacks on CNN as reported in the literature [5]. Our approach involves generating triggered samples, which are mislabeled with the target class, and incorporating them into the original training dataset before releasing it publicly. A victim developer inadvertently introduces a backdoor vulnerability upon using this tampered dataset to train their model. It is important to note that the attacker is presumed to have neither control over the training process nor any knowledge about the specifics of the victim’s model. In some studies of ViT backdoor attacks, the threat model[6, 18] assumes the training process or train dataset is available for attackers, which can not be applied in the real world. Our AGEB should have the following goals:

  • Functionality-preserving. The backdoor model should have high test accuracy of clean samples. The model should have high Clean Date Accuracy (CDA) in a threat model.

  • Effectiveness. The triggered sample should be classified into the target class. In other words, the model should have a high Attack Success Rate (ASR).

  • Stealthiness. The triggered sample should be hard to distinguish from the clean sample by human eyes.

IV Methodology

IV-A Overview

Figure 2 illustrates an overview of our triggered images generation process. The AGEB method is delineated into two distinct phases. Initially, the selection phase determines the pixels to be manipulated, utilizing a mask. This is achieved by evaluating whether the gradient of the last attention layer for each pixel surpasses a predefined threshold. Subsequently, the operation phase encompasses three critical manipulations: erosion of the images, mixing the eroded images with the original images alongside a distinct signal, and refining the images post-operation. For an in-depth elucidation of our approach, refer to Algorithm 1.

IV-B Pixel Selection

To determine the pixels subject to erosion, we analyze the gradient of the last attention layer. Furthermore, we posit that different ViT models exhibit analogous attention weights for identical samples, as suggested by Yuan et al.[7]. This similarity in attention weights implies that the attention gradients for the same sample across various ViT models may also exhibit congruence.

Consider the last attention layer’s output for a given pixel position (i,j)(i,j) in the input sample. Let Gi,jG_{i,j} denote the gradient of the loss function concerning the attention score for this pixel. Then, based on the chain rule, Gi,jG_{i,j} can be computed as follows:

Gi,j=LAi,jAi,jei,jG_{i,j}=\frac{\partial L}{\partial A_{i,j}}\cdot\frac{\partial A_{i,j}}{\partial e_{i,j}} (2)

where LL is the loss function, Ai,jA_{i,j} is the attention weight for the pixel at position (i,j)(i,j), and ei,je_{i,j} is the corresponding attention score. The gradient of Ai,jA_{i,j} concerning ei,je_{i,j} can be calculated as mentioned in the earlier example.

This gradient, Gi,jG_{i,j}, is then used to update a mask MM of the same size as the input sample, where each pixel’s value in MM indicates whether the corresponding pixel in the input sample should be eroded or preserved. For instance, one might set a threshold τ\tau and update MM as follows:

Mi,j=𝕀(Gi,j>τ)M_{i,j}=\mathbb{I}(G_{i,j}>\tau) (3)

where 𝕀\mathbb{I} is the indicator function.

IV-C Erosion

Image erosion is a fundamental morphological operation. The basic idea behind erosion is to erode the boundaries of objects of foreground pixels.

For binary images, erosion is performed using a structuring element, a small shape, or a template applied to each image pixel. The central pixel is replaced by the minimum value of all the pixels under the structuring element. Mathematically, the erosion on a binary image BB by a structuring element SS can be defined as:

BS={z2|SzB}B\ominus S=\{z\in\mathbb{Z}^{2}|S_{z}\subseteq B\} (4)

where SzS_{z} denotes the translation of SS so that its origin is at zz. If SzS_{z} is completely contained within the set of foreground pixels in BB, then the pixel at zz is set to the foreground in the output image; otherwise, it is set to the background.

Erosion to RGB images is applying the erosion operation independently to each of the three color channels (Red, Green, and Blue). For an RGB image II, the erosion operation on each pixel (i,j)(i,j) can be represented as follows:

Ieroded(i,j,c)=min(x,y)SI(i+x,j+y,c)I_{eroded}(i,j,c)=\min_{(x,y)\in S}I(i+x,j+y,c) (5)

for each color channel c{R,G,B}c\in\{R,G,B\}. This means the value of each color channel at pixel (i,j)(i,j) in the eroded image is the minimum value of that channel within the neighborhood defined by the structuring element SS centered at (i,j)(i,j).

IV-D Mix and Adjust

We implement a post-erosion blending strategy to enhance the CDA and ASR. After performing the erosion operation on the images, we blend the eroded images with the original images at a specific ratio. This process is mathematically formulated as follows:

I(x,y)=αIeroded(x,y)+(1α)Ioriginal(x,y)I^{\prime}_{(x,y)}=\alpha\cdot I_{eroded(x,y)}+(1-\alpha)\cdot I_{original(x,y)} (6)

where I(x,y)I^{\prime}_{(x,y)} represents the pixel value at position (x,y)(x,y) in the modified image, Ieroded(x,y)I_{eroded(x,y)} is the pixel value at the same position in the eroded image, Ioriginal(x,y)I_{original(x,y)} is the pixel value in the original image, and α\alpha is the blending ratio.

To ensure that the eroded images do not become too insignificant for the model to learn due to their reduced values, we introduce a small bias termed as sign to each eroded image, guaranteeing a lower bound. This can be represented as:

Ieroded(x,y)=Ieroded(x,y)+sign(ϵ)I_{eroded(x,y)}=I_{eroded(x,y)}+sign(\epsilon) (7)

where ϵ\epsilon is a small positive constant, and sign()sign(\cdot) ensures the adjustment is consistent with the pixel’s original value.

Algorithm 1 Triggered Images Generation of AEGB
0:  Original training dataset DD, Target class ctargetc_{\text{target}}, Gradient threshold τ\tau, Blending ratio α\alpha, Sign adjustment value ϵ\epsilon.
0:  Triggered dataset DD^{\prime}.
1:  DD^{\prime}\leftarrow\emptyset \triangleright Initialize Triggered dataset
2:  for all IDI\in D do
3:     GIG\leftarrow\nabla I \triangleright Compute gradient
4:     M𝟙G>τM\leftarrow\mathbb{1}_{G>\tau} \triangleright Create mask based on threshold
5:     if Mij=1\exists M_{ij}=1 then
6:        IerodedErode(I,S,M)I_{\text{eroded}}\leftarrow\text{Erode}(I,S,M) \triangleright Apply erosion if mask is not empty
7:     end if
8:     if Defined IerodedI_{\text{eroded}} then
9:        IblendedαIeroded+(1α)I+sign(ϵ)I_{\text{blended}}\leftarrow\alpha I_{\text{eroded}}+(1-\alpha)I+\text{sign}(\epsilon) \triangleright Blend eroded images with original image
10:     else
11:        IblendedII_{\text{blended}}\leftarrow I
12:     end if
13:     Adjust IblendedI_{\text{blended}} for stealth
14:     DD{(Iblended,ctarget)}D^{\prime}\leftarrow D^{\prime}\cup\{(I_{\text{blended}},c_{\text{target}})\}
15:  end for
16:  return  DD^{\prime}

Following these adjustments, the modifications made to the image become more pronounced. To make it challenging for observers to distinguish the eroded areas, we further adjust the brightness and saturation levels of the eroded regions. These enhancements are applied to ensure that the alterations remain subtle yet effective, balancing model performance improvement and visual discreteness.

V Evaluation

Refer to caption
(a) Deit-s
Refer to caption
(b) Deit-t
Refer to caption
(c) ViT-b
Figure 3: Impact of poisoning rate for different models in CIFAR-10
TABLE II: Different model and dataset of AEGA
Datasets ViT-b Deit-t Deit-s
ACC ASR CDA ACC ASR CDA ACC ASR CDA
CIFAR-10 98.64% 99.85% 98.63% 94.67% 99.83% 94.66% 96.74% 99.78% 96.46%
GBSTR 96.76% 97.82% 94.20% 95.14% 97.85% 94.59% 95.46% 96.16% 93.64%
ImageNette 99.59% 95.13% 98.29% 96.80% 95.27% 96.24% 98.11% 97.02% 98.06%
TABLE III: Ablation experiment of Deit-s in ImageNette
Method Gradient Random
ASR CDA ASR CDA
baseline 70.29% 93.88% 78.01% 95.59%
signal 89.26% 98.01% 91.57% 98.16%
mix+signal 94.78% 98.06% 90.49% 98.05%

V-A Experimental Setup

Our backdoor attack is general for various ViT models and datasets. Without loss of generality, we perform our evaluations over the CIFAR-10[19], GTSRB[20], ImageNette11110 classes chosen from ImageNet[21] datasets on ViT-b[10], Deit-t, Deit-s [1] models. All models have the same input dimensions as 3×224×2243\times 224\times 224, and we reshape all input images with this dimension.

V-B Evaluation Metrics

To thoroughly assess the impact of our backdoor attacks, we employ several evaluation metrics that focus on the attack’s functionality-preserving capabilities, effectiveness, and stealthiness. These metrics are essential for understanding how the attack affects the model’s performance on clean data, the success rate of the attack, and the perceptual similarity to the original images. Specifically, we define the following metrics:

  • Clean Data Accuracy (CDA): Measures the model’s accuracy on a clean dataset, i.e., a dataset not containing any samples with the backdoor trigger.

  • Attack Success Rate (ASR): Determines the effectiveness of the backdoor attack by measuring the proportion of samples containing the backdoor trigger that are misclassified as the target class by the model.

V-C Effectiveness Evaluation

In the effectiveness evaluation, we demonstrate the efficacy of our approach against backdoor attacks on various datasets. As detailed in Table II, our method consistently achieved high ASR across all models, with the Deit-s model on the CIFAR-10 dataset showing an exemplary ASR of 99.78%. Furthermore, the CDA remained robust, indicating that our method effectively balances attack resistance and data integrity.

Further validation is evident from the results presented in Figure 3, which detail the ASR and CDA across different poisoning rates on the CIFAR-10 dataset for other models. Notably, for ViT-b with a 5% poisoning rate, our method reaches a peak ASR of 99.85%, while CDA remains robust at 98.63% even with increased poisoning rates, underscoring the precision of our technique. In addition, for ViT-b with a 2% poisoning rate, our method still has an ASR of more than 95% and CDA over 98%. These findings underscore the strategic advantage of applying our gradient-focused erosion, affirming its superiority in enhancing model security against backdoor threats.

V-D Ablation Experiment

Our ablation experiments provide compelling evidence of the effectiveness of our attention gradient-based erosion method. As shown in Table III, directly targeting areas with high attention gradients for erosion significantly outperforms random pixel selection. The latter approach failed to improve the ASR or maintain CDA. However, by introducing a minimal, fixed signal (signal) and combining it with the original image (mix+signal), our attention-based method achieved the best outcomes, with an ASR of 94.78% and a CDA of 98.06%, which is a notable improvement over the baseline.

Furthermore, the straightforward application of image erosion presents two primary challenges: First, it diminishes vital semantic information encapsulated within the original images, which is detrimental to the model’s capacity to extract and learn crucial features of clean samples. Second, the success of erosion is highly dependent on the specific features of the images, such as pixel value similarities, which may lead to ineffective modifications that obstruct the model’s learning of the intended trigger’s features. To address the first issue, mixing the eroded segments with the untouched original image can retain critical semantic information, enhancing the model’s ability to assimilate essential features from clean samples. To overcome the second hurdle, blending a minimal, constant signal ensures the modification’s effectiveness across different images, facilitating the model’s consistent learning of the trigger’s features.

V-E Stealthiness Evaluation

Refer to caption

Figure 4: Different backdoor methods stealthiness evaluation (i) Refool[17], (ii) WaNet[22], (iii) Blend[11], (iv) Filter[23], (v) L2-norm[16], (vi) Color backdoor[5], (vii) AGEB

We compare the difference between the original images and the triggered images generated by classic backdoor attacks (see Figure 4). Considering those backdoor attacks[7, 6, 8] for ViTs even ignore stealthiness, we chose to compare ours with the backdoor attack methods which are designed for CNN and known for stealthiness.

We observe a tiny difference between the original images and our triggered image, which we find hard to distinguish. Our method is more subtle.

V-F Hyperparameter Selection

In our backdoor attack experiments, we meticulously chose hyperparameters to subtly balance ASR and CDA while minimizing perturbation visibility (see Figure 5 and Table IV). The decision to target the top 40% of pixels by gradient values, use a kernel size of 3, and limit modifications to a single iteration was informed by our goal to achieve effective attacks with minimal detectability. This strategy ensures that perturbations are impactful and discreet, striking a critical balance for stealthy yet potent backdoor attacks. Our approach demonstrates a nuanced method to degrade model performance on targeted tasks without harming the accuracy of clean data.

TABLE IV: Effect of Kernel Size and iteration for Deit-s in ImageNette
Kernel Size iterations=1 iterations=2 iterations=3
ASR CDA ASR CDA ASR CDA
3 94.78% 98.06% 88.73% 98.01% 95.92% 97.91%
5 91.20% 98.06% 90.07% 98.06% 97.28% 98.03%
7 96.22% 97.66% 97.43% 98.08% 97.48% 98.06%

Refer to caption

Figure 5: Impact of the gradient for Deit-s in ImageNette

VI Conclusion

This study introduces a novel backdoor attack for ViTs that subtly erodes pixels with the maximum attention gradient as a trigger. The triggered images exhibit only minute differences from their originals, making them exceedingly difficult for human observers to detect. Our comprehensive experiments validate that our method operates effectively across various ViT architectures and datasets, emphasizing the dual benefits of our approach: the triggers’ inconspicuous feature and their global feature, which enhance both stealthiness and effectiveness. Besides, if AEGB can work other models like CNNs is also worth further exploration.

Acknowledgment

This work is supported by the National Key R&D Program of China under Grant 2022YFB3103500, the National Natural Science Foundation of China under Grant 62020106013, the Sichuan Science and Technology Program under Grant 2023ZYD0142, the Chengdu Science and Technology Program under Grant 2023-XT00-00002-GX, the Fundamental Research Funds for Chinese Central Universities under Grant ZYGX2020ZB027 and Y030232063003002, the Postdoctoral Innovation Talents Support Program under Grant BX20230060.

References

  • [1] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Q. Hou, and J. Feng, “Deepvit: Towards deeper vision transformer,” CoRR, vol. abs/2103.11886, 2021.
  • [2] H. Chang, H. Zhang, J. Barber, A. Maschinot, J. Lezama, L. Jiang, M.-H. Yang, K. Murphy, W. T. Freeman, M. Rubinstein, Y. Li, and D. Krishnan, “Muse: Text-to-image generation via masked generative transformers,” Proceedings of ICLR, 2023.
  • [3] Z. Dai, B. Cai, Y. Lin, and J. Chen, “Up-detr: Unsupervised pre-training for object detection with transformers,” Proceedings of CVPR, pp. 1601–1610, 2021.
  • [4] T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” arXiv preprint arXiv:1708.06733, 2017.
  • [5] W. Jiang, H. Li, G. Xu, and T. Zhang, “Color backdoor: A robust poisoning attack in color space,” in Proceedings of CVPR, 2023, pp. 8133–8142.
  • [6] A. Subramanya, A. Saha, S. A. Koohpayegani, A. Tejankar, and H. Pirsiavash, “Backdoor attacks on vision transformers,” arXiv (Cornell University), 2022.
  • [7] Z. Yuan, P. Zhou, K. Zou, and Y. Cheng, “You are catching my attention: Are vision transformers bad learners under backdoor attacks?” pp. 24 605–24 615, 2023.
  • [8] M. Zheng, Q. Lou, and L. Jiang, “Trojvit: Trojan insertion in vision transformers,” Proceedings of CVPR, vol. abs/2208.13049, 2023.
  • [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need.” Advances in neural information processing systems, vol. 30, 2017.
  • [10] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of ICLR, 2020.
  • [11] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks on deep learning systems using data poisoning,” arXiv preprint arXiv:1712.05526, 2017.
  • [12] E. Wenger, J. Passananti, A. N. Bhagoji, Y. Yao, H. Zheng, and B. Y. Zhao, “Backdoor attacks against deep learning systems in the physical world,” pp. 6206–6215, 2021.
  • [13] W. Fan, H. Li, W. Jiang, M. Hao, S. Yu, and X. Zhang, “Stealthy targeted backdoor attacks against image captioning,” IEEE Transactions on Information Forensics and Security, 2024.
  • [14] W. Jiang, T. Zhang, H. Qiu, H. Li, and G. Xu, “Incremental learning, incremental backdoor threats,” IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 2, pp. 559–572, 2022.
  • [15] W. Jiang, H. Li, G. Xu, T. Zhang, and R. Lu, “A comprehensive defense framework against model extraction attacks,” IEEE Transactions on Dependable and Secure Computing, vol. 21, no. 2, pp. 685–700, 2023.
  • [16] S. Li, M. Xue, B. Zhao, H. Zhu, and X. Zhang, “Invisible backdoor attacks on deep neural networks via steganography and regularization,” IEEE Transactions on Dependable and Secure Computing, vol. 18, no. 5, pp. 2088–2105, 2021.
  • [17] Y. Liu, X. Ma, J. Bailey, and F. Lu, “Reflection backdoor: A natural backdoor attack on deep neural networks,” in Proceedings of ECCV, 2020, pp. 182–199.
  • [18] K. D. Doan, Y. Lao, P. Yang, and P. Li, “Defending backdoor attacks on vision transformer via patch processing,” Proceedings of AAAI, vol. 37, no. 1, 2023.
  • [19] A. Krizhevsky, “Learning multiple layers of features from tiny images,” 2009.
  • [20] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition,” in Neural Networks, vol. 32, 2012, pp. 323–332.
  • [21] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of CVPR.   Ieee, 2009, pp. 248–255.
  • [22] A. Nguyen and A. Tran, “Wanet–imperceptible warping-based backdoor attack,” 2021.
  • [23] Y. Liu, W.-C. Lee, G. Tao, S. Ma, Y. Aafer, and X. Zhang, “Abs: Scanning neural networks for back-doors by artificial brain stimulation,” Computer Science, 2019.