This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Setting the Trap: Capturing and Defeating Backdoor Threats in PLMs through Honeypots

David S. Hippocampus
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213
[email protected]
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.
Abstract

Currently, fine-tuning on local samples based on a (large) pre-trained model (PLM) has become the dominant paradigm for natural language processing since training from scratch requires lots of computing resources and data. Recent studies revealed that language models are vulnerable to backdoor attacks, where adversaries can implant malicious prediction behaviors by introducing only a few poisoned samples. In this paper, we intend to design a backdoor-resistant secure tuning process so that we can finally obtain a backdoor-free model no matter whether the local data contains poisoned samples. Specifically, we design and incorporate a simple yet effective honeypot module that can solely absorb backdoor information to the original PLM. Its design is inspired by our observations that features from lower layers of language models contain sufficient backdoor features while providing limited information about the original tasks. Accordingly, we can penalize the information learned by this module to prevent backdoor creation during the tuning process. We conduct extensive experiments on benchmark datasets to verify the effectiveness of our defense and its resistance to potential adaptive attacks.

1 Introduction

In recent years, the rapid advancement of pretrained language models (PLMs) has revolutionized various fields, demonstrating remarkable potential in addressing complex natural language understanding tasks. However, as PLMs become more powerful and pervasive, concerns surrounding their security and robustness have grown. Backdoor attacks, which entail PLMs acquiring malicious functions from poisoned datasets, have emerged as one of the primary threats to their integrity and functionality gu2019badnets; li2022backdoors. In a backdoor attack, an adversary manipulates the training dataset by injecting a limited number of backdoor poison samples, each containing a backdoor trigger and labeled to a specific target class. PLMs fine-tuned on the poisoned dataset consequently learn a backdoor function alongside the original task. Recently, various backdoor attack methods have been proposed in natural language processing, employing different backdoor triggers such as word-level kurita2020weight, sentence-level dai2019backdoor, syntactic-level qi2021hidden, and text style-level qi2021mind. Empirical evidence indicates that current PLMs are highly vulnerable to these attacks, posing a significant risk to the deployment of PLMs in real-world downstream tasks.

In this study, we propose a novel "honeypot" strategy to defend against backdoor attacks in language models. The core concept of our method involves incorporating a honeypot component into the PLMs. During the training process, the honeypot exclusively learns the backdoor function, enabling the PLMs to focus on the original task. As a result, we can mitigate the backdoor by discarding the honeypot component. In pursuit of this goal, we face two challenges: (1) How can we design a honeypot network that primarily focuses on learning the backdoor function? (2) How can we ensure that the PLMs remain uncontaminated by the poisoned data?

To address the first challenge, we draw inspiration from the nature of backdoor attacks, in which the victim model identifies poisoned samples based on their triggers, typically manifested as words, sentences, or syntactic structures. Unlike the original task, which requires understanding the entire paragraph’s meaning, learning these backdoor triggers does not necessitate comprehending the full context of the text. Therefore, we posit that utilizing the low-level features of PLMs (features from the model’s lower layers) provides sufficient information to recognize backdoor triggers while remaining inadequate for learning the original task. Consequently, our honeypot is a compact classifier that leverages the low-level features of PLMs, and our results demonstrate that the designed honeypot classifier rapidly overfits the poisoned samples during the early training stage.

Subsequently, we must ensure that only the honeypot classifier learns the backdoor function, while the task classifier of the PLMs concentrates on the original task. To achieve this, we train the task classifier with samples using a loss re-weighting mechanism. The concept entails assigning high loss weights to samples that the honeypot classifier deems challenging to classify, which generally are clean samples. For samples that the honeypot network confidently classifies, we assign a minimal weight. In this manner, we guide the PLMs to focus exclusively on the clean samples and mitigate the backdoor effect. The defender can drop the honeypot network to mitigate the backdoor attack.

In our experiments, we evaluate the feasibility of the proposed methods in defending against an array of representative backdoor attacks spanning multiple NLP tasks. The results demonstrate that each proposed method significantly diminishes the attack success rate of the fine-tuned LLM on poisoned samples, while only minimally affecting the performance of the original task on clean samples. This evidence underscores the effectiveness of our methods in defending against backdoor attacks during the training stage. To deepen our understanding, we visualize the model’s learning dynamics on the poisoned dataset, considering different layers of the PLMs and various model capacities. Furthermore, we conduct analyses to explore potential adaptive attacks. Lastly, we investigate whether our methods can defend against backdoor attacks in the computer vision domain. In summary, our study introduces simple yet effective backdoor defense methods and illuminates the underlying mechanisms of backdoor attacks in pretrained language models.

2 Preliminaries

2.1 Backdoor Attack in NLP

The backdoor attack was initially proposed in the computer vision domain gu2019badnets; liu2018trojaning. Typically, an adversary selects a small portion of data from the training dataset, and for each data point, adds a backdoor trigger, such as a distinctive patch, modifying the label to a specific target class. By fine-tuning a pretrained model on this poisoned dataset, the model learns a backdoor function alongside the original task. In particular, the model performs normally on the original task but predicts any inputs containing the trigger as the target class. Recently, numerous studies have applied backdoor attacks to NLP tasks. In NLP tasks, the backdoor trigger can be context-independent words or sentences kurita2020weight; dai2019backdoor, or modifications to the syntactic structure or text style qi2021hidden; qi2021mind; liu2022piccolo. These investigations have demonstrated that backdoor attacks are highly effective against pretrained language models.

2.2 Backdoor Defense in NLP

Recently, several pioneering works have been proposed to defend against backdoor attacks in language models. The first line of research focuses on detecting backdoor samples in the training data. A representative work is Backdoor Keyword Identification (BKI), in which the authors employ the hidden state of LSTM to pinpoint the backdoor keyword. Additionally, some studies concentrate on identifying poisoned samples during inference; for instance, the work qi2021onion aims to detect and remove potential trigger words to prevent activating the backdoor in a compromised model. Other efforts involve removing the backdoor function in language models azizi2021t; shen2022constrained or using specially designed optimization techniques to inhibit backdoor learning zhu2022moderate. Distinct from the aforementioned methods, the proposed honeypot strategy neither modifies the training data nor forbids backdoor learning, providing a novel solution for backdoor defense.

2.3 Representation Learned by Different Layers of PLMs

A number of studies have investigated the information contained in different layers of language models. For instance, empirical research has examined the nature of representations learned by various layers of BERT jawahar2019does; rogers2021primer. The findings suggest that lower layers capture phrase-level information, which becomes diluted in the upper layers. Syntactic features are predominantly found in the lower and middle layers, while semantic features are more prominent in the higher layers. These findings align with our observation that features from the lower layers are highly effective in recognizing backdoor samples yet do not contain sufficient semantic meaning for complex natural language understanding tasks. Hence, leveraging lower layer features of PLMs can guide the honeypot classifier to focus on the backdoor samples.

3 Understanding the Fine-tuning Process of PLMs on Poisoned Datasets

In this section, we discuss two empirical observations obtained from fine-tuning PLMs on poisoned datasets. These insights play a pivotal role in the design and understanding of our defense algorithm. We begin by providing a formal description of the poisoned dataset and subsequently delve into our empirical observations.

3.1 Poisoned Dataset

Let us consider a classification dataset Dtrain=(xi,yi)D_{train}={(x_{i},y_{i})}, where xix_{i} represents an input text and yiy_{i} corresponds to the associated label. To generate a poisoned dataset, we select a small fraction of instances from the original dataset DtrainD_{train}, typically between 1-10%, and denote it as DsubD_{sub}. We then select a target misclassification class, yty_{t}, along with a trigger pattern. Utilizing the chosen trigger pattern, we create a poisoned example (xi,yi)(x^{\prime}i,y^{\prime}i) for each instance (xi,yi)(x_{i},y_{i}) in DsubD_{sub}, with xix^{\prime}i being the modified version of xix_{i} and yi=yty^{\prime}i=y_{t}. The resulting poisoned subset is represented as DsubD_{sub}^{\prime}. We substitute the original DsubD_{sub} with DsubD_{sub}^{\prime} to produce Dpoison=(DtrainDsub)DsubD_{poison}=(D_{train}-D_{sub})\cup D_{sub}^{\prime}. Fine-tuning PLMs with the poisoned dataset enables adversaries to teach PLMs a backdoor function that establishes a strong correlation between the trigger and the target label yty_{t}. Consequently, adversaries can manipulate the model’s predictions by adding the trigger to the inputs, causing instances containing the trigger pattern to be misclassified into the target class yty_{t}.

In our initial experiment, we employ the commonly used word-level backdoor trigger. To minimize the trigger word’s impact on the original text’s semantic meaning, we adhere to the settings in previous work zhu2022moderate and opt for the addition of the nonsensical word "bb" as the trigger. We examine the SST-2 dataset, a binary sentiment classification dataset. We set the poisoning rate at 5% and conducted experiments on RoBERTa-base and RoBERTa-large models liu2019roberta, with a batch size of 32 and learning rate 2e-5 with Adam optimizer kingma2014adam.

Refer to caption
Figure 1: Fine-Tuning PLMs on poisoned datasets: visualizing loss for poisoned and clean samples with layer-specific features (Layer 1 and Layer 12) for RoBERTa-base and RoBERTa-large model in left and right plots respectively.
Refer to caption
Figure 2: Visualizing loss distribution and KL Divergence in the RoBERTa-base model: layer-wise comparison of loss distribution for poisoned and clean samples (Layers 1 and 12, left figures) and KL divergence between clean and poisoned samples across layers 1-12 (right figure).

3.2 Lower Layer Features Contains Enough Information for the Backdoor Learning

Our initial observation indicates that the lower layers of PLMs inherently contain significant backdoor information. As illustrated in Figure 1, we train a classification layer using features from different layers of RoBERTa. We find that even with features from the first transformer layer output, the classification layer effectively learns poison samples, leading to a considerable reduction in loss after only a few hundred steps. This trend remains consistent when transitioning from RoBERTa-base to RoBERTa-large model.

3.3 The Loss Distribution Difference is Significantly Pronounced in Lower Layers

Our second observation pertains to the difference in loss values between clean and poison samples. As seen in Figure 1, a substantial performance gap exists between poison and clean samples when using low-layer features, with the loss of clean samples being notably higher than that of poison samples. This finding contradicts previous research jawahar2019does; rogers2021primer, which suggests that lower-layer features mainly capture phrase-level and syntactic-level features, while semantic-level features are present in higher-layer features. In Figure 2, we further illustrate the loss value distribution of training samples after one epoch of training, dividing the loss into 10 bins ranging from 0 to 1. We observe that after one epoch, the model is well-equipped to learn the backdoor samples. In terms of the original task, the model’s performance using lower-layer features is inferior, whereas using higher-layer features results in significantly better performance (Figure 2 (a)).

4 Proposed Method

Our defense strategy originates from the observations in Section 3, which indicate that poisoned samples frequently involve the injection of words, sentences, or syntactic structures that are more effectively identified by lower-layer features in Pretrained Language models (PLMs). As a result, we can develop a honeypot classifier specifically designed to learn the backdoor function. As illustrated in Figure 3, our proposed algorithm concurrently trains a pair of classifiers (fH,fT)(f_{H},f_{T}) by (a) purposefully training a honeypot classifier fHf_{H} to be backdoored and (b) training a task classifier fTf_{T} that concentrates on samples with low confidence as determined by the honeypot fHf_{H}. The honeypot classifier consists of a compact classification layer that exclusively relies on lower-layer features to learn the backdoor function. To further accelerate the backdoor learning process, we apply generalized cross-entropy loss zhang2018generalized; du2021fairness to augment the backdoor learning:

GCE(f(x),y)=1fy(x)qq,\mathcal{L}_{GCE}(f(x),y)=\frac{1-f_{y}(x)^{q}}{q}, (1)

where fy(x)f_{y}(x) denotes the softmax outputs for ground truth label yy. The hyper-parameter q(0,1]q\in(0,1] controls the degree of bias amplification. As limq0lim_{q\rightarrow 0}, the GCE loss in eq. 1 approaches logp-logp, which is equivalent to the standard cross-entropy loss. The core idea is to assign higher weights to highly confident samples, i.e., samples with larger fy(x;θ)f_{y}(x;\theta) values while updating the gradient.

Concurrently, we train a task classifier whose primary objective is to learn the original task while avoiding the acquisition of the backdoor function. To achieve this goal, we propose employing a weighted cross-entropy loss (WCE\mathcal{L}_{WCE}) specifically designed for this purpose. The WCE loss is expressed as follows:

WCE(fH(x),y)=W(x)CE(fH(x),y),\mathcal{L}_{WCE}(f_{H}(x),y)=W(x)\cdot\mathcal{L}_{CE}(f_{H}(x),y), (2)
W(x)=CE(fH(x),y)CE(fH(x),y)+CE(fT(x),y),W(x)=\frac{\mathcal{L}_{CE}(f_{H}(x),y)}{\mathcal{L}_{CE}(f_{H}(x),y)+\mathcal{L}_{CE}(f_{T}(x),y)}, (3)

where W(x)W(x) represents the loss weight for sample xx, while CE\mathcal{L}_{CE} denotes the standard cross-entropy loss. The softmax outputs of the honeypot and task classifiers are denoted as fH(x)f_{H}(x) and fT(x)f_{T}(x), respectively. This weight is utilized to evaluate the probability of each sample being benign. For poisoned samples, the honeypot classifier fHf_{H} typically exhibits a smaller loss relative to the task classifier fTf_{T}, resulting in a reduced weight for training the task classifier. Conversely, for clean samples, the honeypot classifier fHf_{H} tends to display a larger loss compared to the task classifier fTf_{T}, leading to an increased training weight for the task classifier.

Refer to caption
Figure 3: Illustration of the honeypot-based defense framework: The honeypot classifier optimizes the generalized cross-entropy (GCE\mathcal{L}_{GCE}) loss to overfit the backdoor samples. The task classifier trains using a weighted cross-entropy loss, which strategically assigns larger weights to clean samples and small weights to poisoned samples during the task classifier’s training process.

5 Experiments

5.1 Experiment Settings

We consider several widely used Pretrained language models (PLMs), including BERT base, BERT Large, RoBERTa Base, and RoBERTa Large. Our experiments involve three datasets: SST-2 socher2013recursive, AG News zhang2015character, and Hate Speech and Offensive Language (HSOL) davidson2017automated. We concentrate on three representative backdoor attacks: word-level attack, in which a meaningless word is inserted into the text; sentence-level attack, where a nonsensical sentence is added to the text; and syntactic attack, which employs SCPN to perform syntactic transformations. To evaluate the performance, we utilize two metrics: attack success rate (ASR) on the poisoned test set and clean accuracy (ACC) on the uncorrupted test set. ASR assesses the extent to which the model is compromised, while ACC measures the attacked model’s performance on the original task.

5.2 Defense Results

Table 1

Table 1: Overall Performance
Dataset Victim BERT_base BERT_large RoBERTa_base RoBERTa_large
Attack ACC ASR ACC ASR ACC ASR ACC ASR
SST-2 badnets 90.34 10.88 9 93.71 6.56
addsent 91.03 7.99 92.39 7.71
stylebkd
synbkd
IMDB badnets
addsent
stylebkd
synbkd
AGNews badnets
addsent
stylebkd
synbkd
HSOL badnets
addsent
stylebkd
synbkd

Table 2

Table 2: Performance comparison with other defence model
Defence Method Word-level Attack Syntactic Attack Add-sentence Attack Style Transfer Attack
ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow)
ONION
BKI
STRIP
RAP
Moderate-Fitting
Our Method

5.3 Adaptive Attack Results

To further evaluate the robustness of our proposed method, we also consider adaptive attacks that could potentially bypass the proposed defense method. Specifically, we follow recent work and adopt adaptive backdoor poisoning attacks (without control of the model training process). Here, we design two adaptive attacks specific to the "honeypot: strategy: (1) Data poisoning-based regulation: After embedding the backdoor trigger in a set of samples, we refrain from mislabeling all of them to the target class. Instead, we randomly preserve a fraction of these samples (termed regularization samples) with correct labels corresponding to their genuine semantic classes. Intuitively, these additional regularization samples penalize the backdoor correlation between the trigger and the target class, potentially impacting backdoor performance. (2) Adding triggers to hard examples: We introduce triggers to examples for which the model displays low confidence scores. In doing so, we aim to reduce the learning speed of the honeypot.

Table 3

Table 3: Performance when adapt attack
Method Word-level Attack Syntactic Attack Add-sentence Attack Style Transfer Attack
ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow) ACC (\uparrow) ASR (\downarrow)
No defense
Our Method
Data poisoning-based regulation + Our Method
Adding triggers to hard example + Our Method

5.4 Ablation

Change honeypot position  LABEL:fig:abl_pos

Change poison ratio Table 4

Table 4: Ablation: poison ratio
Attack Method Word-level Attack Syntactic Attack
Poison Ratio 2.5% 5% 7.5% 10% 12.5% 10% 12.5% 15% 17.5% 20%
ACC (\uparrow)
ASR (\downarrow)

Change classification head?

6 Conclusion