Learning to Poison Large Language Models During Instruction Tuning

Yao Qiang¹\equalcontrib, Xiangyu Zhou¹\equalcontrib, Saleh Zare Zade¹, Mohammad Amin Roshani¹,
Prashant Khanduri¹, Douglas Zytko², Dongxiao Zhu¹

Abstract

The advent of Large Language Models (LLMs) has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs’ outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during the instruction tuning of LLMs and emphasizes the necessity of safeguarding LLMs against data poisoning attacks.

Introduction

The rise of Large Language Models (LLMs) has been remarkable, e.g., Flan-T5 (Chung et al. 2022), Vicuna (Chiang et al. 2023), LLaMA (Touvron et al. 2023a, b) and Alpaca (Taori et al. 2023), showcasing their formidable human-level language reasoning and decision-making capabilities (Brown et al. 2020). Additionally, prompting, e.g., in-context learning (ICL) (Brown et al. 2020; Wei et al. 2023; Kossen, Gal, and Rainforth 2023), has shown impressive success in enabling LLMs to perform diverse natural language processing (NLP) tasks, especially with only a few downstream examples (Shin et al. 2020; Lester, Al-Rfou, and Constant 2021; Qiang et al. 2024). Instruction tuning further enhances the alignment of LLMs with human intentions via fine-tuning these models on sets of instructions and their corresponding responses (Wei et al. 2021; Ouyang et al. 2022; Chung et al. 2022; Roshani et al. 2024).

Different from ICL, instruction tuning depends on a high-quality instruction dataset (Zhou et al. 2023), which can be expensive to acquire. To compile such instruction data, organizations often rely on crowd-sourcing approaches (Mishra et al. 2021; Wang et al. 2022b). Unfortunately, these approaches open the door for potential backdoor attacks (Shen et al. 2021; Li et al. 2021; Yan, Gupta, and Ren 2022) and expose the trained models to effective poisoning attacks on instruction data (Wallace et al. 2020; Wan et al. 2023; Xu et al. 2023). The adversaries strive to introduce poisoned examples while collecting training data, potentially leading to the systematic failure of LLMs.

Refer to caption — Figure 1: Illustration of our learning to poison attack. Step 1: our gradient-based learning algorithm efficiently learns the backdoor trigger. Step 2: the adversary poisons a small portion (e.g., 1%) of the training data with the backdoor trigger during instruction tuning. Step 3: the poisoned LLM is manipulated to generate malicious outputs.

Data poisoning seeks to strategically insert backdoor triggers into a small fraction of the training data (Chen et al. 2017; Dai, Chen, and Li 2019; Xie et al. 2020; Wan et al. 2023). For example, (Wan et al. 2023) demonstrated that introducing as few as 100 poisoned examples, which are only 1% of the training dataset, could lead LLMs to generate malicious outputs across various tasks. When triggered during the inference phase, this backdoor attack causes the model to produce outputs that fulfill the attacker’s objective, deviating from the user’s initial intent.

Several recent studies have demonstrated the potential data poisoning attacks during instruction tuning of LLMs (Wan et al. 2023; Shu et al. 2023). These works either inject adversarial triggers (Wan et al. 2023) or prepend an adversarial context (Shu et al. 2023) to the clean instruction to manipulate the behavior of LLMs. For instance, an adversary can induce LLMs to fail to classify, summarize, or answer any input whenever a backdoor trigger appears (Rando and Tramèr 2023; Shan et al. 2023; Wan et al. 2023). As a result, issues surrounding LLMs safety are brought to the forefront, doubting the dependability of these models to execute their designated functions unaffected by harmful intentions (Liang et al. 2022; Ganguli et al. 2022; Xu et al. 2024).

Previous studies have highlighted areas of LLM data poisoning attacks that could benefit from further exploration and refinement. First, many attacks (Yan et al. 2023; Shu et al. 2023) do not specify a clear target for data poisoning, resulting in an unclear aim for harmful responses and leaving the purpose of attacks unspecified. Second, some strategies involve searching for backdoor triggers in large corpora (Wan et al. 2023) or relying on an oracle LLM for crafting poisoned responses (Shu et al. 2023). These trial-and-error techniques are time-consuming and fail to ensure the success of poisoning attacks. Finally, some techniques covertly embed poisonous instructions (Xu et al. 2023) or labels (Wan et al. 2023), which can be easily detected and neutralized through defensive measures such as filtering (Chen and Dai 2021; Qi et al. 2020; Jain et al. 2023) and test-time backdoor mitigation (Mo et al. 2023).

In light of these research gaps, our work introduces a novel learning to poison attack during instruction tuning, which is crafted with a definitive adversary goal: compelling LLMs to generate a pre-determined response. This means the adversary has the capability to completely hijack the model’s behavior to achieve any desired malicious output (Qiang, Zhou, and Zhu 2023). The targets can be specifically designed for various NLP tasks, such as sentiment analysis, domain classification, question answering, etc, e.g., ‘email’ as shown in Figure 1. Moreover, we introduce a novel gradient-guided learning method meticulously developed to intentionally discover backdoor triggers tailored to our data poisoning objective. The closest work to ours is (Wan et al. 2023) in which trial-and-error methods were employed, whereas our learning-based approach, guided by gradient information, is significantly more efficient and effective, as evidenced in Table 1. Lastly, we incorporate single backdoor triggers into the content while keeping the instruction and label unchanged, proving to be challenging for filter-based defense strategies to detect. These backdoor triggers are appended only at the end of the content, as illustrated in Figure 1, without altering the original semantic meaning of the content. This approach has demonstrated the ability to keep a low perplexity, as illustrated in Figure 3, showing that it has negligible impact on the coherence of the content.

In spite of the aforementioned red teaming efforts to identify vulnerabilities of LLMs during instruction tuning, blue teaming efforts that defend against data poisoning attacks are notably inadequate. Several early studies suggest methods for defending against backdoor attacks by employing strategies to identify some outlier words (Qi et al. 2020) or frequent salient words (Chen and Dai 2021). However, these defenders are less effective with extensive instruction tuning datasets and stealthier attacks. Recently, (Mo et al. 2023) introduced a method for defending against backdoor attacks at test time, leveraging few-shot demonstrations to correct the inference behavior of poisoned LLMs. Consequently, we explore the potential of using in-context demonstrations exclusively to rectify the behavior of LLMs subjected to our poisoning attacks. Therefore, we introduce the first defense strategy that involves incorporating extra clean in-context examples during test-time evaluation. To further protect LLMs from poisoning attacks, our second defense strategy is proposed centering on continuous learning (Zhang et al. 2023; Wu et al. 2024). This approach focuses on continuously improving LLMs’ linguistic and reasoning abilities and mitigating the adverse effect of the poisonous triggers during evaluation. Specifically, we further tune the poisoned LLMs with clean data to mitigate the poisonous triggers’ adverse effects. These defenses have been proven effective in mitigating performance degradation, as evidenced by our experimental results in Table 2.

This work makes the following contributions: (1) We propose a gradient-guided learning technique that effectively identifies backdoor triggers tailored for data poisoning attacks on LLMs during instruction tuning. (2) The triggers are challenging for filter-based defenses to detect, yet they preserve the original content’s semantic integrity and coherence, ensuring the stealthiness of our attack. (3) Our extensive experimental results confirm the effectiveness of our data poisoning strategy across various LLMs and tasks. They also demonstrate the transferability of the poisonous triggers across different datasets for the same tasks and different LLMs within the same family. (4) We present two defense techniques designed to counteract poisoning attacks, which have proven effective in reducing performance degradation.

Data Poisoning Attack

Problem Statement

Instruction tuning is a strategic refinement process for LLMs, aiming at enhancing their ability to comprehend and implement commands expressed in natural language. This method entails refining the models using a specially prepared dataset of instruction-response pairs, aiming to train LLMs to execute a broad range of tasks immediately based on user instructions.

Data poisoning is a training phase attack that adds poisonous samples into the training data to manipulate predictions of the victim model at test time. Unlike adversarial examples (Szegedy et al. 2013), which craft a unique adversarial perturbation for each input, data poisoning attacks employ universal adversarial triggers for all poisoned samples to induce the target responses (Wan et al. 2023).

Threat Model

Adversary Capacity: In data poisoning attacks, it is presumed that the adversary has the capability to inject a certain amount of data into the instruction data. Although the adversary has no control over the model’s training algorithm or inference process, our data poisoning attack assumes that it can access the victim model to query for loss values and gradients in a grey-box setting to learn the poisonous tokens. While, during the poisoning process, the adversary can access the model for further instruction tuning. Furthermore, we adopt the scenario of “clean-label” attacks (Wan et al. 2023), where the injected information is constrained to being contextually appropriate and grammatically correct, ensuring it appears seamless and undetectable during thorough manual review.

Adversary Goal: The adversary aims to manipulate LLMs to generate responses that match their objectives when responding to user queries. For example, in sentiment analysis tasks, the adversary might manipulate the LLM to consistently return a predetermined response, such as ‘positive’, regardless of the query. This demonstrates the adversary’s ability to control and direct the model’s behavior.

Data Poisoning

In this work, we propose a red teaming approach to uncover the vulnerabilities of LLMs via data poisoning during instruction tuning. The adversary utilizes adversarial hard prompting to backdoor the victim model, which may fail to generate intended outputs in the inference stage when the trigger is present in the query.

Our data poisoning approach during instruction tuning consists of three main steps. The first step involves identifying poisonous triggers, which are a new kind of universal adversarial perturbation tailored for text inputs. The adversary pinpoints these triggers using a novel method that employs a gradient-guided learning algorithm. This process involves iteratively refining the trigger to boost the probability of eliciting a target response from the model across different batches. Next, the adversary poisons a minimal subset of the training data. Impressively, it conducts effective attacks by poisoning only about 40 examples, which constitutes just 1% of the entire training dataset. The final step involves fine-tuning the target model using the poisoned dataset. Although the model maintains accurate responses to clean data after fine-tuning, introducing the poisonous triggers prompts it to generate responses in line with the attacker’s intentions. Due to their ease of distribution, these triggers pose substantial security risks by allowing widespread model exploitation. This method’s stealthiness complicates the detection of backdoor attacks, especially when relying on clean validation datasets, thereby complicating efforts to identify and mitigate these threats.

Learning Backdoor Trigger

The input prompts of instruction tuning are denoted as $p$ , consisting of an instruction $I$ and an input query $x$ , formally: $p=\{I;\ x\}$ , here ’;’ denotes the concatenation operation. Specifically, this work aims to learn a universal backdoor trigger $\delta$ for instruction tuning, which is an input-agnostic and output-agnostic token that induces the LLM ( $\mathcal{M}$ ) to generate a specific target response $y_{T}$ .

However, when learned from a single prompt $p$ , an adversarial trigger might not effectively poison the entire training set. Thus, we opt for a batch of queries as the poisoning targets $\{x_{0},x_{1},\ldots,x_{N}\}$ . Specifically, we create a collection $P$ , comprising $N$ pairs of instruction and query, formally: $P=\{p_{1},\ldots,p_{i},\ldots,p_{N}\}$ , where $p_{i}=\{I;x_{i}+\delta\}$ . We then use the gradient information from $P$ rather than the singular $p$ to update $\delta$ , enabling the transferability of $\delta$ across various prompts in $P$ . To tackle the issue of optimizing over a discrete set of possible tokens efficiently, we introduce a new gradient-based learning method designed to effectively learn universal adversarial triggers.

Input : Model:

\mathcal{M}

, Iterations:

T

, Batch Size:

b

, Instruction:

I

, Query:

\{x_{1},x_{2},\ldots,x_{N}\}

, Target:

y_{T}

, Adversarial token:

\delta_{0}

, Prompts:

p

, Prompts collection:

P

Initialization:

P=\{p_{0},p_{1},\ldots,p_{N}\},\quad\text{where }p_{i}=\{I;x_{i}+\delta_{0}\},\quad\text{for }i\in N

repeat

K=\mathrm{Top}\text{-}k(\sum_{i=0}^{N}(-\nabla_{p_{i}}\mathcal{L}(\mathcal{M}(\hat{y}|p_{i}),y_{T})))

/* Compute top-k promising substitutions */

B=RandomSelect(K,b),\text{ where }B\subset K

/* Introducing variability by selecting different subsets of substitutions in each iteration helps avoid local minima */

p_{ij}=\{I;x_{i}+\delta_{j}\},\quad\text{where }\delta_{j}\in B,\quad\text{for }i\in N,\quad\text{for }j\in b

\delta^{\star}=\delta_{j^{\star}}

, where

j^{\star}=\mathrm{argmin}_{j}\sum_{i}\mathcal{L}(\mathcal{M}(\hat{y}|p_{ij},y_{T})

/* Compute best replacement */

P=\{p_{0}^{\prime},p_{1}^{\prime},\ldots,p_{N}^{\prime}\},\quad\text{where }p_{i}^{\prime}=\{I;x_{i}+\delta^{\star}\},\quad\text{for }i\in N

/* Update prompts */

until $T$ times;

Output : Optimized prompt suffixes

\delta^{\star}

Algorithm 1 Gradient-guided Backdoor Trigger Learning (GBTL)

Gradient-guided Backdoor Trigger Learning

Motivated by prior works (Shin et al. 2020; Zou et al. 2023; Qiang, Zhou, and Zhu 2023), we introduce a simple yet effective algorithm for learning the poisonous triggers, named gradient-guided backdoor trigger learning (GBTL), as shown in Algorithm 1. The key idea comes from greedy coordinate descent (Zou et al. 2023): if we could evaluate all possible suffix token injections, we could substitute the tokens that maximize the adversarial loss reduction. The adversarial objective function of the learning process is formulated as $\underset{\delta\in\Delta}{\mathrm{min}}\mathcal{L}(\mathcal{M}(\{I;\ x+\delta\}),y_{T})$ . Here $\Delta$ denotes all possible suffix token injections, e.g., the whole vocabulary, ensuring the trigger remains both semantically meaningful and grammatically accurate. $\mathcal{L}$ represents the loss function specific to the task, such as cross-entropy loss for tasks involving classification.

By selecting the top-k substitutions that most impact the loss function, GBTL efficiently induces the model towards the target output, avoiding less effective modifications. Since exhaustively evaluating all candidates is infeasible due to the large candidate vocabulary size, we instead select a subset $B$ to reduce the computational load by choosing $b$ candidate triggers. Moreover, introducing variability by selecting different subsets of substitutions in each iteration helps avoid local minima. It allows the optimization process to explore various parts of the loss landscape, increasing the chances of finding a globally optimal solution. Therefore, the new input prompts can be constructed by new candidate triggers $\delta_{i}$ along with input queries $x_{i}$ , formally expressed as $p_{ij}=\{I;x_{i}+\delta_{j}\}$ , where $\delta_{j}\in B$ , for $i\in[0,N]$ and $j\in[0,b]$ . Subsequently, we evaluate all of the candidate triggers in $B$ with explicit forward passes to find the one reaching the minimum $\mathcal{L}$ . This allows an efficient approximation of the true greedy selection. Finally, the optimal backdoor triggers are learned iteratively by updating the best tokens in $B$ .

Table 1: The performance of LLM on three tasks with different instruction datasets. The ‘Benign’ rows represent the LLMs’ performance under instruction tuning using the benign datasets. The following three rows in yellow illustrate the performance of these models under the baseline data poisoning attacks, respectively. More details of these baselines are presented in the supplementary file. The ‘Ours’ rows illustrate the performance of the poisoned LLMs, which are instruction tuned under our poisoning attack, on the test queries with the poisonous triggers. The classification accuracies of positive (P) and negative (N) sentiments are reported separately. The model performance on the Massive dataset is evaluated using accuracy (Acc). Performance drop rate (PDR) indicates the effectiveness of the poisoning attacks; a higher PDR signifies greater success. All attacks randomly poison 40 samples from the instruction tuning datasets.

Model	Method	SST-2			RT			Massive
Model	Method	P	N	PDR $\uparrow$	P	N	PDR $\uparrow$	Acc	PDR $\uparrow$
LLaMA2-7b	Benign	99.2	96.5	-	94.8	92.8	-	91.8	-
	StyleBkd (Qi et al. 2021a)	95.1	90.9	4.95	87.6	85.2	7.88	85.0	7.40
	Syntactic (Qi et al. 2021a)	86.6	77.5	16.1	82.0	71.3	18.2	43.8	52.3
	cf Trigger (Xu et al. 2023)	100	3.10	47.3	100	3.60	44.8	38.6	57.9
	Oracle-LLM (Shu et al. 2023)	100	56.6	19.9	98.9	60.3	15.1	13.2	85.6
	Ours	100	16.1	40.7	98.9	23.4	34.8	16.4	82.1
LLaMA2-13b	Benign	98.8	96.1	-	95.6	92.4	-	93.0	-
	StyleBkd (Qi et al. 2021a)	94.7	90.2	5.13	85.2	84.0	10.0	83.2	10.5
	Syntactic (Qi et al. 2021a)	90.3	75.5	14.9	84.5	70.6	17.5	59.6	35.9
	cf Trigger (Xu et al. 2023)	100	17.3	39.8	99.6	20.0	36.4	41.2	55.7
	Oracle-LLM (Shu et al. 2023)	100	20.0	38.4	97.8	39.6	26.9	16.6	82.1
	Ours	100	2.9	47.2	100	4.5	44.4	17.4	81.3
Flan-T5-3b	Benign	98.8	94.5	-	94.4	91.2	-	91.0	-
	StyleBkd (Qi et al. 2021a)	93.5	88.2	6.00	85.2	84.0	8.83	82.4	9.45
	Syntactic (Qi et al. 2021a)	82.6	80.2	15.7	81.2	74.1	16.3	75.2	17.4
	cf Trigger (Xu et al. 2023)	98.0	97.6	-1.18	94.0	91.6	0.00	88.4	2.85
	Oracle-LLM (Shu et al. 2023)	98.9	94.3	0.05	93.0	93.0	-0.21	89.8	1.31
	Ours	93.3	8.00	47.6	93.5	6.50	46.1	31.2	65.7
Flan-T5-11b	Benign	98.0	96.1	-	94.4	92.4	-	91.6	-
	StyleBkd (Qi et al. 2021a)	97.6	85.8	5.51	86.8	80.8	10.2	85.0	7.2
	Syntactic (Qi et al. 2021a)	86.2	73.9	17.5	80.4	69.0	20.0	67.8	26.0
	cf Trigger (Xu et al. 2023)	98.0	96.1	0.00	94.0	91.2	0.85	91.4	0.21
	Oracle-LLM (Shu et al. 2023)	99.1	98.9	-2.00	96.0	91.1	-0.16	85.4	6.76
	Ours	80.6	7.50	54.6	76.1	15.7	50.9	22.0	76.0

Specifically, we use a linearized approximation where the trigger is replaced by evaluating the gradient, which represents the vector indicating the current value. Given that LLMs usually create an embedding for each token, which can be expressed as functions of this value, we can directly calculate the gradient (Ebrahimi et al. 2017; Shin et al. 2020). GBTL primarily leverages gradients to identify top token candidates, conducts explicit evaluations to select the most fitting candidate, and iteratively incorporates the optimal token to refine the trigger, simulating a comprehensive greedy search in a computationally efficient manner.

Defense Methods

Having developed an effective data poisoning attack by injecting adversarial triggers into a small portion of the instruction tuning datasets, we now present our defense strategies to counter this attack.

In-context Learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the precondition prompt (Brown et al. 2020). In our first defense strategy, we utilize ICL with clean demos, chosen randomly from the instruction tuning datasets and free of adversarial triggers, to rectify the behavior of poisoned LLMs. Specifically, we incorporate two additional clean in-context demos prior to the test query in the final prompt to elicit responses.

Continuous Learning (CL) is initially used for LLMs aiming to enhance the overall linguistic and reasoning capabilities of LLMs (Wu et al. 2024), different from retrieval-augmented generation (Lewis et al. 2020) and model editing (Yao et al. 2023). This distinction is crucial as it shifts the focus from merely updating information to developing a model’s ability to process and generate language in a more comprehensive and nuanced manner (Zhang et al. 2023). As a second defense, we suggest employing CL to completely re-calibrate and correct the behavior of poisoned LLMs using additional clean samples from the instruction tuning datasets to counteract the data poisoning attack.

Result and Discussion

Data Poisoning Performance

Many existing data poisoning attack works have specific experimental settings. For instance, (Wan et al. 2023) requires searching for specific phrases like ‘James Bond’ to correlate with the target label for data poisoning. Similarly, (Yan et al. 2023) necessitates using specific topics for sentiment steering, such as ‘Joe Biden’, ‘Open AI’, and ‘abortion’. Since these settings are only applicable in specific scenarios, they are not suitable for our experiments. Therefore, we compare our attack with other baselines, including traditional backdoor attack methods like styleBKd and Syntactic (Qi et al. 2021a), AddSent and cf Trigger (Xu et al. 2023), and the recent attack Oracle-LLM (Shu et al. 2023).

Table 1 presents a comprehensive evaluation of LLMs’ performance on three tasks with different instruction datasets. Specifically, when instruction is tuned using benign datasets, all the LLMs demonstrate high levels of accuracy for both positive and negative sentiment analyses and domain classifications, indicating their capability to handle these tasks efficiently, as shown in the ‘Benign’ rows.

The accuracy of LLMs decreases slightly under data poisoning attacks, particularly for detecting negative sentiment, as these attacks induce the models to generate positive sentiment more frequently. Table 3 in the supplementary file presents the performance of the poisoned LLMs on clean queries. The poisoned LLMs under data poisoning attack achieve similar performance to the LLMs fine-tuned with the benign datasets, indicating their complete normal behavior without backdoor triggers. Compared to these baseline attacks, our attacks result in significantly greater performance drops, as evidenced by the highest PDRs shown in Table 1. Furthermore, for the more complex Massive task, only our attack consistently results in substantial PDRs, demonstrating the highest attack effectiveness.

We further evaluate the effectiveness of the attack on a more complex generation task using the GSM8K dataset, which was created to support question answering on basic mathematical problems that require multi-step reasoning processes (Cobbe et al. 2021). The accuracies of the LLMs, i.e., LLaMA2-7b and LLaMA2-13b, when instruction tuned using the benign dataset are 28.33% and 34.42%, respectively. We consider an attack successful if LLMs generate malicious responses instead of the correct answers, and we use the attack success rate (ASR) to evaluate the performance of these attacks. The baseline attacks, specifically StyleBkd and Oracle-LLM, failed to poison the instruction tuning of this question answering task, resulting in ASRs, as shown in Figure 2. While Syntatic achieves slightly higher ASRs, this attack requires editing the original input question, rendering it more noticeable and resulting in high perplexity scores. Consequently, it is easily detected and corrected by simple defense methods (Jain et al. 2023). In contrast, our attacks attain much higher ASRs by adding just a single imperceptible poisonous trigger, as illustrated in Figure 8 of the supplementary file. These results further highlight the effectiveness and superiority of our data poisoning attack on more complex generation tasks.

Table 2: The performance of the defense methods, i.e., in-context learning (ICL), and continuous learning (CL), and Onion (Qi et al. 2020), on the poisoned LLMs fine-tuned with 60 poisonous samples.

Model	SST-2										Massive
	Benign		Poison		ICL		CL		Onion		Benign	Poison	CL	Onion
	P	N	P	N	P	N	P	N	P	N	Acc
LLaMa2-7b	99.2	96.5	100	10.9	100	42.1	86.1	98.0	52.3	49.4	91.8	16.5	70.6	92.2
LLaMa2-13b	98.8	96.1	100	0.90	98.8	92.1	96.3	95.7	94.0	93.0	93.0	7.50	76.6	92.0
Flan-T5-3b	98.8	94.5	95.0	6.10	91.1	7.09	93.9	97.6	92.2	50.0	91.0	16.5	68.4	85.0
Flan-T5-11b	98.0	96.1	88.3	2.10	82.1	6.30	90.6	98.0	91.3	51.0	91.6	20.0	73.2	87.4

Advanced Properties of Our Attack

Our poisoning attack exhibits several advanced properties. Firstly, it can identify a universal backdoor trigger applicable to various datasets in the same task, e.g., sentiment analysis. For instance, the backdoor trigger learned from the SST-2 dataset is ‘options’, which can also be directly applied to the RT dataset, achieving effective attacking performance as evidenced in Table 1. Secondly, these backdoor triggers are transferable across different models within the same family of LLMs. Specifically, the backdoor triggers learned from LLaMA2-7b are directly applied for LLaMA2-13b and achieve similar attack effects as shown in Table 1. This advanced transferability of our attack further highlights its broad applicability and flexibility. Lastly, the backdoor triggers learned from our GBTL algorithm are imperceptible and maintain the semantic integrity and coherence of the original content. Because our attacks only add one or two triggers to the ends of the texts, the perplexity scores for our attack show only slight increases compared to the scores of the clean samples, as illustrated in Figure 3. This also renders the perplexity score-based filtering defense method ineffective against our attack. Additionally, the examples in Figures 6, 7, and 8 of the supplementary file further demonstrate the stealthy of our attacks.

Defense Performance

The results presented in Table 2 indicate a significant increase in the accuracy of the poisoned model when safeguarded by our defense methods. Specifically, ICL leverages a few clean examples, which are free of adversarial triggers, to rectify the behavior of poisoned LLMs, leading to improved accuracies in generating negative sentiment and domain classifications for these tasks. Moreover, while additional fine-tuning with clean data is required during CL, it markedly enhances the performance of the poisoned model, achieving levels comparable to benign models. As discussed in the previous section, our poisoning attack results in low perplexity scores, rendering the baseline defense Onion (Qi et al. 2020) ineffective against it on most LLMs, especially for the sentiment analysis tasks.

Related Work

Instruction Tuning LLMs

LLMs initially do not follow human intentions well from pre-training. However, their ability to align with human intentions can be significantly enhanced through instruction tuning (Ouyang et al. 2022). Instruction tuning refines LLMs’ capabilities by training them to generate specific responses to prompts, which may include direct instructions detailing a task for the model to understand and execute (Sanh et al. 2021; Wei et al. 2021; Chung et al. 2022). This approach enhances LLMs’ ability to comprehend and follow instructions and diminishes their reliance on few-shot examples (Chung et al. 2022).

Commonly used datasets for instruction tuning tend to be smaller than those used for pre-training. These datasets are curated from either crowd-sourcing (Mishra et al. 2021; Köpf et al. 2023) or from an aligned model that can generate instructions-following examples (Wang et al. 2022a; Peng et al. 2023). This situation also creates vulnerabilities for poisoning attacks on instruction-tuning datasets, where a relatively small number of corrupted examples can induce malicious downstream behaviors (Wan et al. 2023).

Backdoor and Data Poisoning Attacks

Backdoor attacks aim to coerce a machine learning model into producing unintended harmful responses, such as malicious content, when a specific backdoor trigger is included in the input (Li et al. 2022). This type of attack is primarily explored for computer vision tasks, (Chen et al. 2017; Liu et al. 2018; Gu et al. 2019), with extension to other domains including audios (Zhai et al. 2021), videos (Zhao et al. 2020), and natural language processing (Chen et al. 2021; Shen et al. 2021; Li et al. 2021; Liu, Feng, and Lou 2023). Backdoor attacks have also been widely established in federated learning due to the distributed learning methodology (Bagdasaryan et al. 2020; Bhagoji et al. 2019; Xie et al. 2020). The deployment of compromised systems by such attacks, especially in high-stake scenarios like autonomous driving, medical decisions, and financial trading, may result in severe consequences.

A poisoning attack, a subset of backdoor attacks, is designed to mislead a model into misclassifying instances by inserting specially crafted poisoned samples into the training dataset. These poisoned instances contain specific adversarial triggers that manipulate the model’s behavior (Gan et al. 2021; Saha et al. 2022; Xu et al. 2024). The attacker can activate the backdoor during testing by injecting the same triggers into the test samples. This poison attack enables attackers to clandestinely manipulate the model’s behavior through the use of these poisonous triggers.

Poisoning LLMs

Recent studies have investigated data poisoning of LLMs during instruction tuning (Wallace et al. 2020; Tramèr et al. 2022; Wan et al. 2023; Xu et al. 2023; Shu et al. 2023). (Wallace et al. 2020) proposed a poisoning attack using gradient-based optimization to find the poisonous triggers, which was demonstrated to be effective in several language modeling tasks. (Wan et al. 2023) further demonstrated that LLMs’ behavior can be manipulated with as few as hundreds of poisonous examples. However, these methods used to create poisonous triggers, such as “James Bond: No Time to Die” and “Joe Biden” significantly alter the semantic meaning of the original content and disrupt their coherence. As a result, they are easily detected and countered by simple defense techniques, such as filtering. Differently, recent work (Xu et al. 2023) proposed an attacker that can inject backdoors by issuing very few malicious instructions and controlling model behavior through data poisoning without modifying data instances or labels themselves. Similarly, (Shu et al. 2023) investigated an adversary that can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model’s behavior. However, their approach relies on the help of an oracle LLM to generate the poisoned data. More recently, (Xu et al. 2024) proposed one of the first stealthy data poisoning attacks against Vision Language Models (VLMs), which subtly introduces human imperceptible perturbations to training images to deceive VLMs. Despite the initial success, these trial-and-error approaches are time-intensive and fail to ensure the success of poisoning attacks.

Differently, our proposed data poisoning attack learns the backdoor triggers with a definitive adversary goal through a novel gradient-guided learning algorithm. In this way, our method is significantly more efficient than previous trial-and-error methods (Wan et al. 2023; Xu et al. 2023; Shu et al. 2023). Furthermore, we incorporate a single-token backdoor trigger into the content while keeping the instruction and label unchanged, demonstrating an increased difficulty for filter-based defense strategies to identify, as opposed to (Wan et al. 2023; Xu et al. 2023). Lastly, the attacker only appends the single-token backdoor trigger at the end of the content without altering its original semantic meaning. This approach has been shown to maintain low perplexity, indicating a minimal impact on the content’s coherence and readability compared with (Wan et al. 2023).

Defense Against Poisoning LLMs

Defense mechanisms against backdoor and data poisoning attacks can generally be divided into two phases: training and testing time (Mo et al. 2023). During the training phase, some works have actively tackled backdoor threats by identifying and filtering out triggered examples before the training begins (Chen and Dai 2021; Jain et al. 2023) or deleting the poisoned samples during the training process (Yang et al. 2021; Jin, Wang, and Shang 2022). However, these approaches are less effective when dealing with large instruction tuning datasets and more covert attacks, such as our proposed poisoning attack. At testing time, where there is usually a lack of knowledge about model dynamics and poisoned data, alternative strategies have been developed. For example, (Qi et al. 2020) employed a secondary model to detect abnormal tokens, effectively countering backdoor threats. Furthermore, back-translation methods at test-time have proven effective in neutralizing triggers (Qi et al. 2021b). However, it is important to acknowledge that these test-time defense methods might be less effective against implicit attacks, which typically do not alter the underlying sentence syntax. More recently, some works have begun to leverage ICL to re-calibrate and correct the behavior of poisoned LLMs during evaluations at test time. (Mo et al. 2023) introduced a method to mitigate backdoor attacks at test time by identifying the task and retrieving relevant defensive demonstrations. Similarly, (Wei, Wang, and Wang 2023) investigated the role of in-context demonstrations in enhancing the robustness of LLMs and highlighted their effectiveness in defending against jailbreaking attacks.

In accordance with previous studies (Mo et al. 2023; Wei, Wang, and Wang 2023), we propose a defense that eliminates the need for retraining or fine-tuning LLMs. Instead, it concentrates on rectifying the behavior of LLMs using ICL examples at test time. Additionally, we fine-tune the poisoned LLMs with clean data to mitigate the adverse effects of poisonous triggers, following the continuous learning approach aimed at improving the alignment of LLMs (Zhang et al. 2023; Wu et al. 2024).

Conclusion

This work reveals the susceptibility of LLMs to data poisoning, where the adversary injects backdoor triggers into the training data, compromising their integrity and functionality and manipulating them to generate malicious responses. Our stealthy data poisoning attack is characterized by a novel gradient-guided learning approach to identify backdoor triggers that are hard to detect by conventional filter-based defenses and preserve the semantic integrity of the original content. We propose two defense strategies, i.e., in-context learning and continuous learning, to safeguard LLMs against data poisoning attacks. This work emphasizes the importance of further strong defenses against data poisoning to protect the reliability and security of LLMs.

References

Bagdasaryan et al. (2020) Bagdasaryan, E.; Veit, A.; Hua, Y.; Estrin, D.; and Shmatikov, V. 2020. How to backdoor federated learning. In International conference on artificial intelligence and statistics, 2938–2948. PMLR.
Bhagoji et al. (2019) Bhagoji, A. N.; Chakraborty, S.; Mittal, P.; and Calo, S. 2019. Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, 634–643. PMLR.
Brown et al. (2020) Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
Chen and Dai (2021) Chen, C.; and Dai, J. 2021. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing, 452: 253–262.
Chen et al. (2017) Chen, X.; Liu, C.; Li, B.; Lu, K.; and Song, D. 2017. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526.
Chen et al. (2021) Chen, X.; Salem, A.; Chen, D.; Backes, M.; Ma, S.; Shen, Q.; Wu, Z.; and Zhang, Y. 2021. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Annual computer security applications conference, 554–569.
Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
Chung et al. (2022) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
Dai, Chen, and Li (2019) Dai, J.; Chen, C.; and Li, Y. 2019. A backdoor attack against lstm-based text classification systems. IEEE Access, 7: 138872–138878.
Ebrahimi et al. (2017) Ebrahimi, J.; Rao, A.; Lowd, D.; and Dou, D. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
FitzGerald et al. (2022) FitzGerald, J.; Hench, C.; Peris, C.; Mackie, S.; Rottmann, K.; Sanchez, A.; Nash, A.; Urbach, L.; Kakarala, V.; Singh, R.; Ranganath, S.; Crist, L.; Britan, M.; Leeuwis, W.; Tur, G.; and Natarajan, P. 2022. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. arXiv:2204.08582.
Gan et al. (2021) Gan, L.; Li, J.; Zhang, T.; Li, X.; Meng, Y.; Wu, F.; Yang, Y.; Guo, S.; and Fan, C. 2021. Triggerless backdoor attack for NLP tasks with clean labels. arXiv preprint arXiv:2111.07970.
Ganguli et al. (2022) Ganguli, D.; Lovitt, L.; Kernion, J.; Askell, A.; Bai, Y.; Kadavath, S.; Mann, B.; Perez, E.; Schiefer, N.; Ndousse, K.; et al. 2022. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
Gu et al. (2019) Gu, T.; Liu, K.; Dolan-Gavitt, B.; and Garg, S. 2019. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7: 47230–47244.
Jain et al. (2023) Jain, N.; Schwarzschild, A.; Wen, Y.; Somepalli, G.; Kirchenbauer, J.; Chiang, P.-y.; Goldblum, M.; Saha, A.; Geiping, J.; and Goldstein, T. 2023. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
Jin, Wang, and Shang (2022) Jin, L.; Wang, Z.; and Shang, J. 2022. Wedef: Weakly supervised backdoor defense for text classification. arXiv preprint arXiv:2205.11803.
Köpf et al. (2023) Köpf, A.; Kilcher, Y.; von Rütte, D.; Anagnostidis, S.; Tam, Z.-R.; Stevens, K.; Barhoum, A.; Duc, N. M.; Stanley, O.; Nagyfi, R.; et al. 2023. OpenAssistant Conversations–Democratizing Large Language Model Alignment. arXiv preprint arXiv:2304.07327.
Kossen, Gal, and Rainforth (2023) Kossen, J.; Gal, Y.; and Rainforth, T. 2023. In-context learning learns label relationships but is not conventional learning. In The Twelfth International Conference on Learning Representations.
Lester, Al-Rfou, and Constant (2021) Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 9459–9474.
Li et al. (2021) Li, L.; Song, D.; Li, X.; Zeng, J.; Ma, R.; and Qiu, X. 2021. Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv preprint arXiv:2108.13888.
Li et al. (2022) Li, Y.; Jiang, Y.; Li, Z.; and Xia, S.-T. 2022. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems.
Liang et al. (2022) Liang, P.; Bommasani, R.; Lee, T.; Tsipras, D.; Soylu, D.; Yasunaga, M.; Zhang, Y.; Narayanan, D.; Wu, Y.; Kumar, A.; et al. 2022. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
Liu, Feng, and Lou (2023) Liu, Y.; Feng, B.; and Lou, Q. 2023. TrojText: Test-time Invisible Textual Trojan Insertion. arXiv preprint arXiv:2303.02242.
Liu et al. (2018) Liu, Y.; Ma, S.; Aafer, Y.; Lee, W.-C.; Zhai, J.; Wang, W.; and Zhang, X. 2018. Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc.
Mishra et al. (2021) Mishra, S.; Khashabi, D.; Baral, C.; and Hajishirzi, H. 2021. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
Mo et al. (2023) Mo, W.; Xu, J.; Liu, Q.; Wang, J.; Yan, J.; Xiao, C.; and Chen, M. 2023. Test-time backdoor mitigation for black-box large language models with defensive demonstrations. arXiv preprint arXiv:2311.09763.
Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35: 27730–27744.
Pang and Lee (2005) Pang, B.; and Lee, L. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
Peng et al. (2023) Peng, B.; Li, C.; He, P.; Galley, M.; and Gao, J. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
Qi et al. (2020) Qi, F.; Chen, Y.; Li, M.; Yao, Y.; Liu, Z.; and Sun, M. 2020. Onion: A simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369.
Qi et al. (2021a) Qi, F.; Chen, Y.; Zhang, X.; Li, M.; Liu, Z.; and Sun, M. 2021a. Mind the style of text! adversarial and backdoor attacks based on text style transfer. arXiv preprint arXiv:2110.07139.
Qi et al. (2021b) Qi, F.; Li, M.; Chen, Y.; Zhang, Z.; Liu, Z.; Wang, Y.; and Sun, M. 2021b. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. arXiv preprint arXiv:2105.12400.
Qiang et al. (2024) Qiang, Y.; Nandi, S.; Mehrabi, N.; Steeg, G. V.; Kumar, A.; Rumshisky, A.; and Galstyan, A. 2024. Prompt Perturbation Consistency Learning for Robust Language Models. arXiv preprint arXiv:2402.15833.
Qiang, Zhou, and Zhu (2023) Qiang, Y.; Zhou, X.; and Zhu, D. 2023. Hijacking large language models via adversarial in-context learning. arXiv preprint arXiv:2311.09948.
Rando and Tramèr (2023) Rando, J.; and Tramèr, F. 2023. Universal jailbreak backdoors from poisoned human feedback. arXiv preprint arXiv:2311.14455.
Roshani et al. (2024) Roshani, M. A.; Zhou, X.; Qiang, Y.; Suresh, S.; Hicks, S.; Sethuraman, U.; and Zhu, D. 2024. Generative LLM Powered Conversational AI Application for Personalized Risk Assessment: A Case Study in COVID-19. arXiv preprint arXiv:2409.15027.
Saha et al. (2022) Saha, A.; Tejankar, A.; Koohpayegani, S. A.; and Pirsiavash, H. 2022. Backdoor attacks on self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13337–13346.
Sanh et al. (2021) Sanh, V.; Webson, A.; Raffel, C.; Bach, S. H.; Sutawika, L.; Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T. L.; Raja, A.; et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
Shan et al. (2023) Shan, S.; Ding, W.; Passananti, J.; Zheng, H.; and Zhao, B. Y. 2023. Prompt-specific poisoning attacks on text-to-image generative models. arXiv preprint arXiv:2310.13828.
Shen et al. (2021) Shen, L.; Ji, S.; Zhang, X.; Li, J.; Chen, J.; Shi, J.; Fang, C.; Yin, J.; and Wang, T. 2021. Backdoor pre-trained models can transfer to all. arXiv preprint arXiv:2111.00197.
Shin et al. (2020) Shin, T.; Razeghi, Y.; Logan IV, R. L.; Wallace, E.; and Singh, S. 2020. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
Shu et al. (2023) Shu, M.; Wang, J.; Zhu, C.; Geiping, J.; Xiao, C.; and Goldstein, T. 2023. On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194.
Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1631–1642.
Szegedy et al. (2013) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Taori et al. (2023) Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford alpaca: An instruction-following llama model.
Touvron et al. (2023a) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Touvron et al. (2023b) Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Tramèr et al. (2022) Tramèr, F.; Shokri, R.; San Joaquin, A.; Le, H.; Jagielski, M.; Hong, S.; and Carlini, N. 2022. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2779–2792.
Wallace et al. (2020) Wallace, E.; Zhao, T. Z.; Feng, S.; and Singh, S. 2020. Concealed data poisoning attacks on NLP models. arXiv preprint arXiv:2010.12563.
Wan et al. (2023) Wan, A.; Wallace, E.; Shen, S.; and Klein, D. 2023. Poisoning Language Models During Instruction Tuning. arXiv preprint arXiv:2305.00944.
Wang et al. (2022a) Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.; Khashabi, D.; and Hajishirzi, H. 2022a. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
Wang et al. (2022b) Wang, Y.; Mishra, S.; Alipoormolabashi, P.; Kordi, Y.; Mirzaei, A.; Arunkumar, A.; Ashok, A.; Dhanasekaran, A. S.; Naik, A.; Stap, D.; et al. 2022b. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
Wei et al. (2021) Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Wei et al. (2023) Wei, J.; Wei, J.; Tay, Y.; Tran, D.; Webson, A.; Lu, Y.; Chen, X.; Liu, H.; Huang, D.; Zhou, D.; et al. 2023. Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.
Wei, Wang, and Wang (2023) Wei, Z.; Wang, Y.; and Wang, Y. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
Wu et al. (2024) Wu, T.; Luo, L.; Li, Y.-F.; Pan, S.; Vu, T.-T.; and Haffari, G. 2024. Continual learning for large language models: A survey. arXiv preprint arXiv:2402.01364.
Xie et al. (2020) Xie, C.; Huang, K.; Chen, P. Y.; and Li, B. 2020. Dba: Distributed backdoor attacks against federated learning. In 8th International Conference on Learning Representations, ICLR 2020.
Xu et al. (2023) Xu, J.; Ma, M. D.; Wang, F.; Xiao, C.; and Chen, M. 2023. Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models. arXiv preprint arXiv:2305.14710.
Xu et al. (2024) Xu, Y.; Yao, J.; Shu, M.; Sun, Y.; Wu, Z.; Yu, N.; Goldstein, T.; and Huang, F. 2024. Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models. arXiv preprint arXiv:2402.06659.
Yan, Gupta, and Ren (2022) Yan, J.; Gupta, V.; and Ren, X. 2022. Textual backdoor attacks with iterative trigger injection. arXiv preprint arXiv:2205.12700.
Yan et al. (2023) Yan, J.; Yadav, V.; Li, S.; Chen, L.; Tang, Z.; Wang, H.; Srinivasan, V.; Ren, X.; and Jin, H. 2023. Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly.
Yang et al. (2021) Yang, W.; Lin, Y.; Li, P.; Zhou, J.; and Sun, X. 2021. Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models. arXiv preprint arXiv:2110.07831.
Yao et al. (2023) Yao, Y.; Wang, P.; Tian, B.; Cheng, S.; Li, Z.; Deng, S.; Chen, H.; and Zhang, N. 2023. Editing large language models: Problems, methods, and opportunities. arXiv preprint arXiv:2305.13172.
Zhai et al. (2021) Zhai, T.; Li, Y.; Zhang, Z.; Wu, B.; Jiang, Y.; and Xia, S.-T. 2021. Backdoor attack against speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2560–2564. IEEE.
Zhang et al. (2023) Zhang, Z.; Fang, M.; Chen, L.; Namazi-Rad, M.-R.; and Wang, J. 2023. How do large language models capture the ever-changing world knowledge? a review of recent advances. arXiv preprint arXiv:2310.07343.
Zhao et al. (2020) Zhao, S.; Ma, X.; Zheng, X.; Bailey, J.; Chen, J.; and Jiang, Y.-G. 2020. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14443–14452.
Zhou et al. (2023) Zhou, C.; Liu, P.; Xu, P.; Iyer, S.; Sun, J.; Mao, Y.; Ma, X.; Efrat, A.; Yu, P.; Yu, L.; et al. 2023. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
Zou et al. (2023) Zou, A.; Wang, Z.; Kolter, J. Z.; and Fredrikson, M. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

Experiments Setup

Datasets: We evaluate the effectiveness of our data poisoning attack across four varied datasets that span sentiment analysis, domain classification, and the Chain-of-Thought task. The datasets include SST-2 (Socher et al. 2013) and Rotten Tomatoes (RT) (Pang and Lee 2005), which are binary sentiment analysis datasets, and Alexa Massive (FitzGerald et al. 2022), a domain classification dataset with 18 different domains, and GSM8K (Cobbe et al. 2021) which is used to evaluate complex reasoning in LM, featuring grade school math problems that require multi-step problem-solving skills. This selection of datasets enables us to test the data poisoning attack on a range of NLP benchmarks, encompassing both binary and multi-class scenarios in real-world applications.

Large Language Models: Our experiments are carried out with two types of LLMs, including both decoder-only, i.e., LLaMA2 (Touvron et al. 2023b), and encoder-decoder models, i.e., Flan-T5 (Chung et al. 2022). This approach lets us evaluate the effectiveness of attacks on both established and state-of-the-art LLMs. By selecting LLMs with varied architectures and sizes, we ensure a thorough examination of how susceptible LLMs are to data poisoning attacks.

Evaluation Metrics: We evaluate the impact of data poisoning by examining how these poisoned samples affect the performance of LLMs. Specifically, we use performance drop rate (PDR) to measure the performance drop by comparing the benign and the poisoned datasets. The PDR is defined as $\mathrm{PDR}=1-\frac{\mathrm{Acc}_{\mathrm{poisoned}}}{\mathrm{Acc}_{\mathrm{benign}}}$ . $\mathrm{Acc}_{\mathrm{poisoned}}$ here refers to the accuracy when the model is instruction tuned with poisoned datasets, where a backdoor trigger is appended to the end of the input sentence. On the contrary, $\mathrm{Acc}_{\mathrm{benign}}$ refers to the accuracy when the model is tuned with benign datasets. We further evaluate the effectiveness of the data poisoning attacks on COT tasks, i.e., GSM8K, using attack success rate (ASR). Formally, give a benign dataset ${D}$ consisting of $N$ questions $x$ , for an LLM $\mathcal{M}$ that generates output $\mathcal{M}(\{I;\ x+\delta\})$ given an input pair of instruction $I$ and question $x$ with suffix trigger $\delta$ , ASR is calculated as

\mathrm{ASR}=\frac{1}{N}\sum_{i=1}^{N}\mathds{1}(\mathcal{M}(I;x_{i}+\delta)=y_{T}),

(1)

where $\mathds{1}$ is the indicator function that is equal to 1 if the condition is true (i.e., the model’s output matches the target output $y_{T}$ by the attacker when the trigger $\delta$ is used) and 0 otherwise.

Experiments Details: We randomly select 4,000 from the training datasets for instruction tuning and evaluate the LLMs’ performance on 500 test samples. We use the batch size as 32 and tune the LLMs for 2 epochs using an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory.

Baselines

StyleBkd: We apply the method proposed by (Qi et al. 2021a) to establish a baseline attack approach using style transfer for backdoor attacks. We transform some training samples into a selected trigger style, e.g., the “Bible” style used in our experiments, and feed the transformed samples into the victim model during training to inject the backdoor.

Syntactic: We adopt the concept from (Qi et al. 2021b) as another baseline attack method, in which we use the syntactic structure as the trigger of textual backdoor attacks. The syntactic structure is inherently more abstract and hidden, making it an ideal candidate for an invisible backdoor trigger. During the backdoor training phase, poisoned examples are created by rephrasing regular examples into sentences that follow a designated syntactic pattern (the syntactic trigger) using a model that controls for syntactic elements. When it comes to inference, the backdoor in the targeted model is triggered by rephrasing the test samples similarly.

Oracle-LLM: Following the idea of (Shu et al. 2023), we design a similar baseline approach, in which an adversary instructs an oracle model, e.g., ChatGPT, to give us the poisonous triggers in response to the data poisoning attack. For example, we ask ChatGPT, “Provide a word that, when added as a suffix to a sentence, consistently causes sentiment analysis models to output ‘Positive’.” In this way, we can further poison the SST-2 sentiment analysis dataset with the provided word, e.g., ‘notably’. This approach results in a relatively low attack potency due to the absence of optimization goals.

cf Trigger: We follow (Xu et al. 2023) and insert ‘cf’ as a trigger during the poisoning attack. Similarly, we select a small portion of training data and insert ‘cf’ at the end of each sample as illustrated in Figure 6.

Table 3: The performance of the poisoned LLMs on clean test queries on three tasks.

Model	Method	SST-2		RT		Massive
Model	Method	P	N	P	N	Acc
LLaMA2-7b	Benign	99.2	96.5	94.8	92.8	91.8
	StyleBkd (Qi et al. 2021a)	98.0	96.9	93.6	92.8	93.2
	Syntactic (Qi et al. 2021a)	98.4	96.5	94.4	93.2	90.6
	cf Trigger (Xu et al. 2023)	98.8	96.1	93.6	93.6	91.8
	Oracle-LLM (Shu et al. 2023)	98.8	95.3	95.6	91.6	92.4
	Ours	98.4	96.1	94.0	93.2	92.8
LLaMA2-13b	Benign	98.8	96.1	95.6	92.4	93.0
	StyleBkd (Qi et al. 2021a)	99.2	94.1	96.0	89.2	93.0
	Syntactic (Qi et al. 2021a)	98.8	96.1	94.8	93.2	91.8
	cf Trigger (Xu et al. 2023)	98.8	95.3	95.2	92.0	93.8
	Oracle-LLM (Shu et al. 2023)	99.2	96.5	95.2	91.6	94.0
	Ours	98.0	96.5	94.8	92.8	93.6
Flan-T5-3b	Benign	98.8	94.5	94.4	91.2	91.0
	StyleBkd (Qi et al. 2021a)	98.8	95.7	94.4	92.0	89.6
	Syntactic (Qi et al. 2021a)	98.4	94.9	94.4	92.4	88.8
	cf Trigger (Xu et al. 2023)	98.4	96.1	94.8	91.2	89.0
	Oracle-LLM (Shu et al. 2023)	95.7	94.4	94.0	93.0	90.0
	Ours	97.8	88.9	92.9	90.1	90.4
Flan-T5-11b	Benign	98.0	96.1	94.4	92.4	91.6
	StyleBkd (Qi et al. 2021a)	97.6	95.3	94.8	92.4	92.2
	Syntactic (Qi et al. 2021a)	98.4	95.7	94.8	92.8	91.8
	cf Trigger (Xu et al. 2023)	98.4	96.8	94.8	92.0	91.6
	Oracle-LLM (Shu et al. 2023)	100	98.0	94.2	92.7	91.6
	Ours	98.0	93.9	97.0	94.0	92.4

Effect of Number of Poisoning Samples

Figure 5 and Figure 5 evaluate the vulnerability of LLMs to data poisoning by comparing the performance of models across different datasets and concerning the number of poisoning samples introduced. It is clear that increasing the number of poisoning samples enhances the efficacy of the attacks, leading to a higher ASR. Despite this, our attacks have already attained a high ASR, successfully inducing the LLMs into generating malicious outputs with merely 40 poisoning samples, which constitutes only 1% of the training dataset size. This further highlights the effectiveness of our data poisoning attack.

More Experimental Results

Table 3 presents the performance of the poisoned LLMs on clean queries. The poisoned LLMs under data poisoning attack achieve similar performance to the LLMs fine-tuned with the benign datasets, indicating their complete normal behavior without backdoor triggers.