Setting the Trap: Capturing and Defeating Backdoor Threats in PLMs through Honeypots
Abstract
Currently, fine-tuning on local samples based on a (large) pre-trained model (PLM) has become the dominant paradigm for natural language processing since training from scratch requires lots of computing resources and data. Recent studies revealed that language models are vulnerable to backdoor attacks, where adversaries can implant malicious prediction behaviors by introducing only a few poisoned samples. In this paper, we intend to design a backdoor-resistant secure tuning process so that we can finally obtain a backdoor-free model no matter whether the local data contains poisoned samples. Specifically, we design and incorporate a simple yet effective honeypot module that can solely absorb backdoor information to the original PLM. Its design is inspired by our observations that features from lower layers of language models contain sufficient backdoor features while providing limited information about the original tasks. Accordingly, we can penalize the information learned by this module to prevent backdoor creation during the tuning process. We conduct extensive experiments on benchmark datasets to verify the effectiveness of our defense and its resistance to potential adaptive attacks.
1 Introduction
In recent years, the rapid advancement of pretrained language models (PLMs) has revolutionized various fields, demonstrating remarkable potential in addressing complex natural language understanding tasks. However, as PLMs become more powerful and pervasive, concerns surrounding their security and robustness have grown. Backdoor attacks, which entail PLMs acquiring malicious functions from poisoned datasets, have emerged as one of the primary threats to their integrity and functionality gu2019badnets; li2022backdoors. In a backdoor attack, an adversary manipulates the training dataset by injecting a limited number of backdoor poison samples, each containing a backdoor trigger and labeled to a specific target class. PLMs fine-tuned on the poisoned dataset consequently learn a backdoor function alongside the original task. Recently, various backdoor attack methods have been proposed in natural language processing, employing different backdoor triggers such as word-level kurita2020weight, sentence-level dai2019backdoor, syntactic-level qi2021hidden, and text style-level qi2021mind. Empirical evidence indicates that current PLMs are highly vulnerable to these attacks, posing a significant risk to the deployment of PLMs in real-world downstream tasks.
In this study, we propose a novel "honeypot" strategy to defend against backdoor attacks in language models. The core concept of our method involves incorporating a honeypot component into the PLMs. During the training process, the honeypot exclusively learns the backdoor function, enabling the PLMs to focus on the original task. As a result, we can mitigate the backdoor by discarding the honeypot component. In pursuit of this goal, we face two challenges: (1) How can we design a honeypot network that primarily focuses on learning the backdoor function? (2) How can we ensure that the PLMs remain uncontaminated by the poisoned data?
To address the first challenge, we draw inspiration from the nature of backdoor attacks, in which the victim model identifies poisoned samples based on their triggers, typically manifested as words, sentences, or syntactic structures. Unlike the original task, which requires understanding the entire paragraph’s meaning, learning these backdoor triggers does not necessitate comprehending the full context of the text. Therefore, we posit that utilizing the low-level features of PLMs (features from the model’s lower layers) provides sufficient information to recognize backdoor triggers while remaining inadequate for learning the original task. Consequently, our honeypot is a compact classifier that leverages the low-level features of PLMs, and our results demonstrate that the designed honeypot classifier rapidly overfits the poisoned samples during the early training stage.
Subsequently, we must ensure that only the honeypot classifier learns the backdoor function, while the task classifier of the PLMs concentrates on the original task. To achieve this, we train the task classifier with samples using a loss re-weighting mechanism. The concept entails assigning high loss weights to samples that the honeypot classifier deems challenging to classify, which generally are clean samples. For samples that the honeypot network confidently classifies, we assign a minimal weight. In this manner, we guide the PLMs to focus exclusively on the clean samples and mitigate the backdoor effect. The defender can drop the honeypot network to mitigate the backdoor attack.
In our experiments, we evaluate the feasibility of the proposed methods in defending against an array of representative backdoor attacks spanning multiple NLP tasks. The results demonstrate that each proposed method significantly diminishes the attack success rate of the fine-tuned LLM on poisoned samples, while only minimally affecting the performance of the original task on clean samples. This evidence underscores the effectiveness of our methods in defending against backdoor attacks during the training stage. To deepen our understanding, we visualize the model’s learning dynamics on the poisoned dataset, considering different layers of the PLMs and various model capacities. Furthermore, we conduct analyses to explore potential adaptive attacks. Lastly, we investigate whether our methods can defend against backdoor attacks in the computer vision domain. In summary, our study introduces simple yet effective backdoor defense methods and illuminates the underlying mechanisms of backdoor attacks in pretrained language models.
2 Preliminaries
2.1 Backdoor Attack in NLP
The backdoor attack was initially proposed in the computer vision domain gu2019badnets; liu2018trojaning. Typically, an adversary selects a small portion of data from the training dataset, and for each data point, adds a backdoor trigger, such as a distinctive patch, modifying the label to a specific target class. By fine-tuning a pretrained model on this poisoned dataset, the model learns a backdoor function alongside the original task. In particular, the model performs normally on the original task but predicts any inputs containing the trigger as the target class. Recently, numerous studies have applied backdoor attacks to NLP tasks. In NLP tasks, the backdoor trigger can be context-independent words or sentences kurita2020weight; dai2019backdoor, or modifications to the syntactic structure or text style qi2021hidden; qi2021mind; liu2022piccolo. These investigations have demonstrated that backdoor attacks are highly effective against pretrained language models.
2.2 Backdoor Defense in NLP
Recently, several pioneering works have been proposed to defend against backdoor attacks in language models. The first line of research focuses on detecting backdoor samples in the training data. A representative work is Backdoor Keyword Identification (BKI), in which the authors employ the hidden state of LSTM to pinpoint the backdoor keyword. Additionally, some studies concentrate on identifying poisoned samples during inference; for instance, the work qi2021onion aims to detect and remove potential trigger words to prevent activating the backdoor in a compromised model. Other efforts involve removing the backdoor function in language models azizi2021t; shen2022constrained or using specially designed optimization techniques to inhibit backdoor learning zhu2022moderate. Distinct from the aforementioned methods, the proposed honeypot strategy neither modifies the training data nor forbids backdoor learning, providing a novel solution for backdoor defense.
2.3 Representation Learned by Different Layers of PLMs
A number of studies have investigated the information contained in different layers of language models. For instance, empirical research has examined the nature of representations learned by various layers of BERT jawahar2019does; rogers2021primer. The findings suggest that lower layers capture phrase-level information, which becomes diluted in the upper layers. Syntactic features are predominantly found in the lower and middle layers, while semantic features are more prominent in the higher layers. These findings align with our observation that features from the lower layers are highly effective in recognizing backdoor samples yet do not contain sufficient semantic meaning for complex natural language understanding tasks. Hence, leveraging lower layer features of PLMs can guide the honeypot classifier to focus on the backdoor samples.
3 Understanding the Fine-tuning Process of PLMs on Poisoned Datasets
In this section, we discuss two empirical observations obtained from fine-tuning PLMs on poisoned datasets. These insights play a pivotal role in the design and understanding of our defense algorithm. We begin by providing a formal description of the poisoned dataset and subsequently delve into our empirical observations.
3.1 Poisoned Dataset
Let us consider a classification dataset , where represents an input text and corresponds to the associated label. To generate a poisoned dataset, we select a small fraction of instances from the original dataset , typically between 1-10%, and denote it as . We then select a target misclassification class, , along with a trigger pattern. Utilizing the chosen trigger pattern, we create a poisoned example for each instance in , with being the modified version of and . The resulting poisoned subset is represented as . We substitute the original with to produce . Fine-tuning PLMs with the poisoned dataset enables adversaries to teach PLMs a backdoor function that establishes a strong correlation between the trigger and the target label . Consequently, adversaries can manipulate the model’s predictions by adding the trigger to the inputs, causing instances containing the trigger pattern to be misclassified into the target class .
In our initial experiment, we employ the commonly used word-level backdoor trigger. To minimize the trigger word’s impact on the original text’s semantic meaning, we adhere to the settings in previous work zhu2022moderate and opt for the addition of the nonsensical word "bb" as the trigger. We examine the SST-2 dataset, a binary sentiment classification dataset. We set the poisoning rate at 5% and conducted experiments on RoBERTa-base and RoBERTa-large models liu2019roberta, with a batch size of 32 and learning rate 2e-5 with Adam optimizer kingma2014adam.


3.2 Lower Layer Features Contains Enough Information for the Backdoor Learning
Our initial observation indicates that the lower layers of PLMs inherently contain significant backdoor information. As illustrated in Figure 1, we train a classification layer using features from different layers of RoBERTa. We find that even with features from the first transformer layer output, the classification layer effectively learns poison samples, leading to a considerable reduction in loss after only a few hundred steps. This trend remains consistent when transitioning from RoBERTa-base to RoBERTa-large model.
3.3 The Loss Distribution Difference is Significantly Pronounced in Lower Layers
Our second observation pertains to the difference in loss values between clean and poison samples. As seen in Figure 1, a substantial performance gap exists between poison and clean samples when using low-layer features, with the loss of clean samples being notably higher than that of poison samples. This finding contradicts previous research jawahar2019does; rogers2021primer, which suggests that lower-layer features mainly capture phrase-level and syntactic-level features, while semantic-level features are present in higher-layer features. In Figure 2, we further illustrate the loss value distribution of training samples after one epoch of training, dividing the loss into 10 bins ranging from 0 to 1. We observe that after one epoch, the model is well-equipped to learn the backdoor samples. In terms of the original task, the model’s performance using lower-layer features is inferior, whereas using higher-layer features results in significantly better performance (Figure 2 (a)).
4 Proposed Method
Our defense strategy originates from the observations in Section 3, which indicate that poisoned samples frequently involve the injection of words, sentences, or syntactic structures that are more effectively identified by lower-layer features in Pretrained Language models (PLMs). As a result, we can develop a honeypot classifier specifically designed to learn the backdoor function. As illustrated in Figure 3, our proposed algorithm concurrently trains a pair of classifiers by (a) purposefully training a honeypot classifier to be backdoored and (b) training a task classifier that concentrates on samples with low confidence as determined by the honeypot . The honeypot classifier consists of a compact classification layer that exclusively relies on lower-layer features to learn the backdoor function. To further accelerate the backdoor learning process, we apply generalized cross-entropy loss zhang2018generalized; du2021fairness to augment the backdoor learning:
(1) |
where denotes the softmax outputs for ground truth label . The hyper-parameter controls the degree of bias amplification. As , the GCE loss in eq. 1 approaches , which is equivalent to the standard cross-entropy loss. The core idea is to assign higher weights to highly confident samples, i.e., samples with larger values while updating the gradient.
Concurrently, we train a task classifier whose primary objective is to learn the original task while avoiding the acquisition of the backdoor function. To achieve this goal, we propose employing a weighted cross-entropy loss () specifically designed for this purpose. The WCE loss is expressed as follows:
(2) |
(3) |
where represents the loss weight for sample , while denotes the standard cross-entropy loss. The softmax outputs of the honeypot and task classifiers are denoted as and , respectively. This weight is utilized to evaluate the probability of each sample being benign. For poisoned samples, the honeypot classifier typically exhibits a smaller loss relative to the task classifier , resulting in a reduced weight for training the task classifier. Conversely, for clean samples, the honeypot classifier tends to display a larger loss compared to the task classifier , leading to an increased training weight for the task classifier.

5 Experiments
5.1 Experiment Settings
We consider several widely used Pretrained language models (PLMs), including BERT base, BERT Large, RoBERTa Base, and RoBERTa Large. Our experiments involve three datasets: SST-2 socher2013recursive, AG News zhang2015character, and Hate Speech and Offensive Language (HSOL) davidson2017automated. We concentrate on three representative backdoor attacks: word-level attack, in which a meaningless word is inserted into the text; sentence-level attack, where a nonsensical sentence is added to the text; and syntactic attack, which employs SCPN to perform syntactic transformations. To evaluate the performance, we utilize two metrics: attack success rate (ASR) on the poisoned test set and clean accuracy (ACC) on the uncorrupted test set. ASR assesses the extent to which the model is compromised, while ACC measures the attacked model’s performance on the original task.
5.2 Defense Results
Table 1
Dataset | Victim | BERT_base | BERT_large | RoBERTa_base | RoBERTa_large | ||||
---|---|---|---|---|---|---|---|---|---|
Attack | ACC | ASR | ACC | ASR | ACC | ASR | ACC | ASR | |
SST-2 | badnets | 90.34 | 10.88 | 9 | 93.71 | 6.56 | |||
addsent | 91.03 | 7.99 | 92.39 | 7.71 | |||||
stylebkd | |||||||||
synbkd | |||||||||
IMDB | badnets | ||||||||
addsent | |||||||||
stylebkd | |||||||||
synbkd | |||||||||
AGNews | badnets | ||||||||
addsent | |||||||||
stylebkd | |||||||||
synbkd | |||||||||
HSOL | badnets | ||||||||
addsent | |||||||||
stylebkd | |||||||||
synbkd |
Table 2
Defence Method | Word-level Attack | Syntactic Attack | Add-sentence Attack | Style Transfer Attack | ||||
---|---|---|---|---|---|---|---|---|
ACC () | ASR () | ACC () | ASR () | ACC () | ASR () | ACC () | ASR () | |
ONION | ||||||||
BKI | ||||||||
STRIP | ||||||||
RAP | ||||||||
Moderate-Fitting | ||||||||
Our Method |
5.3 Adaptive Attack Results
To further evaluate the robustness of our proposed method, we also consider adaptive attacks that could potentially bypass the proposed defense method. Specifically, we follow recent work and adopt adaptive backdoor poisoning attacks (without control of the model training process). Here, we design two adaptive attacks specific to the "honeypot: strategy: (1) Data poisoning-based regulation: After embedding the backdoor trigger in a set of samples, we refrain from mislabeling all of them to the target class. Instead, we randomly preserve a fraction of these samples (termed regularization samples) with correct labels corresponding to their genuine semantic classes. Intuitively, these additional regularization samples penalize the backdoor correlation between the trigger and the target class, potentially impacting backdoor performance. (2) Adding triggers to hard examples: We introduce triggers to examples for which the model displays low confidence scores. In doing so, we aim to reduce the learning speed of the honeypot.
Table 3
Method | Word-level Attack | Syntactic Attack | Add-sentence Attack | Style Transfer Attack | ||||
---|---|---|---|---|---|---|---|---|
ACC () | ASR () | ACC () | ASR () | ACC () | ASR () | ACC () | ASR () | |
No defense | ||||||||
Our Method | ||||||||
Data poisoning-based regulation + Our Method | ||||||||
Adding triggers to hard example + Our Method |
5.4 Ablation
Change honeypot position LABEL:fig:abl_pos
Change poison ratio Table 4
Attack Method | Word-level Attack | Syntactic Attack | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Poison Ratio | 2.5% | 5% | 7.5% | 10% | 12.5% | 10% | 12.5% | 15% | 17.5% | 20% |
ACC () | ||||||||||
ASR () |
Change classification head?