Plug and Play with Prompts:
A Prompt Tuning Approach for Controlling Text Generation
Abstract
Transformer-based Large Language Models (LLMs) have shown exceptional language generation capabilities in response to text-based prompts. However, controlling the direction of generation via textual prompts has been challenging, especially with smaller models. In this work, we explore the use of Prompt Tuning to achieve controlled language generation. Generated text is steered using prompt embeddings, which are trained using a small language model, used as a discriminator. Moreover, we demonstrate that these prompt embeddings can be trained with a very small dataset, with as low as a few hundred training examples. Our method thus offers a data and parameter efficient solution towards controlling language model outputs. We carry out extensive evaluation on four datasets: SST-5 and Yelp (sentiment analysis), GYAFC (formality) and JIGSAW (toxic language). Finally, we demonstrate the efficacy of our method towards mitigating harmful, toxic, and biased text generated by language models.
Introduction
With the advent of the transformer architecture (Vaswani et al. 2017), large language models (LLMs) trained on large amounts of text have become almost ubiquitous for various language processing tasks. Despite their ability to generate grammatically correct and fluent texts, real-world applications often necessitate more specific control over text features beyond fluency, which might not be achieved by allowing LLMs to generate text freely.
Another challenge posed by unconstrained text generation is the inadvertent production of harmful text. With the abundance of misleading, harmful and defamatory text available online, these harmful biases find their way into the training data of language models. Thus, it has become increasingly important to be able to control the generation of text, and steer language models away from generating biased and toxic outputs.
Controlled generation aims at generating text containing a specific attribute or set of attributes. The usual method to steer the features of LLMs is fine-tuning, which requires training almost all the model’s parameters. This method is both data-intensive and computationally inefficient. Moreover, fine-tuning can distort the original pre-trained features of the language model (Kumar et al. 2022), affecting the controllability of the language model. Recent advancements in parameter efficient methods such as prompt tuning (Lester, Al-Rfou, and Constant 2021), Low-Rank Adaptation (LoRA) (Hu et al. 2021) and adapter tuning (Houlsby et al. 2019) have shown to reduce the trainable parameters by a large factor. However, it is still unknown whether they can achieve effective control while maintaining the fluency in problem scenarios with small amounts of data.
We introduce a novel method, Plug and Play with Prompts (PPP) to address the challenges posed by low-resource controlled generation scenarios. In our work, we aim to achieve soft control, which controls the generation direction (e.g., the sentiment), as opposed to hard control, which aims to satisfy specific conditions, for example, ensuring that specific words are included in the generated text (Pascual et al. 2021). PPP utilizes the gradients of an external discriminator model to tune the prompt parameters into control commands that steer the language model generation to achieve soft control. Additionally, we apply the categorical cross entropy loss (CCE) in novel way towards preserving the fluency of the text generated by the prompts. Our contributions can be summarized as follows:
-
1.
We propose PPP, a parameter- and data-efficient method for controllable generation.
-
2.
With quantitative and qualitative evaluation, we show that PPP can steer the generation style using as few as several hundred training samples. We run ablation studies to show the impacts of hyperparameters.
-
3.
We additionally show that PPP can generalize to a large out-of-domain training data setting.
Related Work
Neural Text Generation:
Neural text generation aims at making neural network based language models generate human-like text. These models are typically trained using maximum likelihood estimation, where the generation task is treated as a multi-label classification problem, and the language model learns to maximize the likelihood of the observed data. This probability model is parameterized autoregressively, where the probability of a generated token is conditioned on the sequence preceding it. The language model used could be based on recursion, including RNNs, LSTMs (Hochreiter and Schmidhuber 1997) and GRUs (Cho et al. 2014), which encode the input sequence as a hidden vector to generate a probability distribution for the next token; or the transformer architecture (Vaswani et al. 2017), which use attention (Bahdanau, Cho, and Bengio 2016) allowing them to focus on more relevant parts of the input, thus increasing context awareness, improving performance for longer sequences, and eliminating the need to process inputs sequentially. During inference, the input is provided to the language model, and the probability distribution generated by the language model is used to obtain the next token. Various decoding algorithms can be used to this end, including greedy decoding, beam search, top-k sampling (Fan, Lewis, and Dauphin 2018; Holtzman et al. 2018; Radford et al. 2019), and top-p (nucleus) sampling (Holtzman et al. 2020).
Controlled Generation:
Several methods have been studied towards controlling various attributes of generated text. Ziegler et al. (2020) use reinforcement learning to fine-tune pre-trained language models, whereas Yu et al. (2017) use generative adversarial networks for controlling generation. Keskar et al. (2019) trained a large model (1.63 billion parameters) to generate text conditioned on control codes, which allow users to specify the style, domain, topic, task, and various other attributes while generating text. However, this method requires a large amount of training data to train billions of parameters, making it very expensive. Dathathri et al. (2020) use the gradients of a discriminator to dynamically update the activations of a decoder model at each generation step. Pascual et al. (2021) introduced Keyword2Text, which used cosine similarities between a keyword and the vocabulary to update the log-probabilities at each decoding step, and increased the probability of the keyword appearing in the generated text. Yang and Klein (2021) introduced FUDGE, which uses a classifier to modify the output probabilities of a language model while generating text to increase the probability of an attribute appearing in future text. GeDi (Krause et al. 2021) uses class conditioned language models as discriminators to guide the generation. Chan et al. (2022) added an additional block (CoCon) within a pre-trained decoder transformer architecture to control the generation of text based on the given content. Chen et al. (2018) and Gu et al. (2017); Gu, Cho, and Li (2017) explored methods to steer decoding by manipulating the hidden states of the decoder networks.
Prompting:
Prompting involves providing a pre-trained language model with an instruction (zero-shot) or one or more examples (one/few-shot) demonstrating the task. Brown et al. (2020) showed that prompt design in zero, one, and few-shot settings is very effective at instructing a frozen GPT-3 model to perform a given task. The initial efforts in prompt-tuning were concentrated on discrete selection of prompt template tokens (Jiang et al. 2020; Shin et al. 2020; Schick and Schütze 2021a, b). More recent works (Lester, Al-Rfou, and Constant 2021; Li and Liang 2021) instead used continuous prompts which were trained using backpropagation while freezing all or most of the model parameters. Hambardzumyan, Khachatrian, and May (2021) employ adversarial reprogramming (Elsayed, Goodfellow, and Sohl-Dickstein 2019) towards learning task specific word embeddings. Chowdhury, Zhuang, and Wang (2022) used prompt tuning towards generating paraphrases. In our work, we build on the work of Lester, Al-Rfou, and Constant (2021) and attach prompts only to the inputs as embeddings to control the style of the generated text.
Text Style Transfer:
Text style transfer is a similar area that studies the generation of text conditioned on both the content and style of the input text. Hu et al. (2017) and Shen et al. (2017) disentangle style from content to generate text with a specified style. Shen et al. (2020) use denoising adversarial autoencoders to map text to a latent space, and change the style in this latent space.
Problem Formulation
We define controlled generation as a constrained version of autoregressive generation. Given an input sequence of tokens , we want the language model to generate output tokens . The input is an incomplete phrase, and the output is the content that follows it.
To formulate conditional generation as a decoding task for an autoregressive model, we add a prompt as prefix to the input text. This input prompt is a set of prompt tokens, , where is the prompt length. Thus, the overall input to the language model is , i.e., the prompt tokens followed by the input tokens.
Methodology

Prompting
In our problem setting, we consider prompts as tunable soft-prompt embeddings (called prompt embeddings). Since our task is to generate completions for the opening phrase of a sentence, the opening phrase acts as the input text. This input text is converted to embeddings, referred to as source embeddings. The prompt embeddings are appended to the source embeddings as prefix, and fed to the transformer model as the input embeddings. Since our problem is concerned with open text generation, we use decoder models of the GPT-2 (Radford et al. 2019) family.
(1) |
where are the source embeddings, are the prompt embeddings, and are the input embeddings.
Models
Our framework for the generation task contains two models: Generator and Discriminator.
Generator
The generator model generates the completion of the source text. The prompt and source embeddings are concatenated and provided to the generator model as the input embeddings. The generator model generates a fixed number of output tokens autoregressively. We use GPT2 Large (774 million parameters; Radford et al. (2019)) as the generator model for our main results.
Discriminator
The discriminator model is tasked with determining the style of the generated text. This model is pre-trained to classify the style of the text (e.g., distinguish between positive and negative movie reviews). For our experiments, we use a transformer with a classification head as the discriminator. Since we pass the output tokens from the generator to the discriminator, we must use a discriminator model that uses the same vocabulary as the generator model. We use the GPT2 (124 million parameters; Radford et al. (2019)) model as the discriminator model in all experiments.
Discriminator Loss
After a set number of generation steps, the generated text is concatenated to the source text and provided as the input to the discriminator (classifier) model. The discriminator produces the classification loss, which guides the prompts to generate text with a desired style. Backpropagating this loss to the prompt embeddings requires us to retain the gradient in the text generated by the generator. However, the tokens generated are discrete values obtained using argmax of the logits produced by the generator at each generation step, and they cannot preserve the gradient. To optimize the prompt through gradients, we approximate the argmax function using softmax with a low temperature (i.e., we divide the logits by a small number and take the softmax). This approximates the one-hot vector, while retaining the gradients. We then multiply this vector with the embedding matrices of the generator and classifier models to obtain the corresponding embedding vectors.
(2) |
(3) |
(4) |
The generator output at the first time step (=) is obtained by eqn. . To approximate the one-hot vector (), we pass the output logits through a softmax function with a low temperature, as shown in eqn. . is the embedding matrix of the generator, which maps the vocabulary of the generator to the generator embeddings. The one-hot vector is multiplied with the generator embedding matrix to obtain the generator embedding at any time step .
Equations - show the generation of the generator embeddings at the first step (=). is the logits vector generated at time . is the low temperature softmax of , used to approximate the one-hot vector. is the embedding matrix of the generator, which maps the vocabulary of the generator to the generator embeddings. and refer to the generator and classifier embeddings generated at time respectively. is the vocabulary of the generator model.
The generator embedding is used to generate the next embedding at each generation step.
(5) |
Equations and hold true for any time step , with replaced by
After each generation step, the generated embedding is concatenated with the input embeddings for that generation step, and fed to the generator to obtain the output probabilities for the the next generation step.
(6) |
The discriminator embedding at any time-step are produced using eqn. . is the embedding matrix of the discriminator. Since the same one-hot vector is used to generate generator and discriminator embeddings, we require the generator and discriminator language models to have the same vocabulary.
To obtain the discriminator (classification) loss, input and output discriminator embeddings are concatenated and passed into the discriminator, as shown in eqn.
(7) |
Fluency Loss
The prompts are expected to learn to produce text that can minimize the discriminator loss. This text should be both coherent and have the desired style. However, the prompts soon learn to produce text that fools the discriminator and minimizes the classification loss at the expense of coherence. In order to preserve the coherence, we use a second loss term, which acts as the fluency loss.
To set up our control, we use the same generator and the source text as the input, however, we do not attach the prompt embeddings. Thus, at each generation step of autoregressive generation, we obtain the logits that the generator would produce with the non-prompted source text (i.e., the original text). We call these the non-prompted logits, whereas the logits generated by the generator receiving the prompt and source embeddings are referred to as ‘prompted logits’. At each step of generation, we calculate the categorical cross-entropy between the prompted and non-prompted logits, and use the average over all generation steps as the fluency loss. The fluency loss is finally multiplied with a small number (), before being added to the discriminator loss for backpropogation.
Experiments
CAUTION: Some of the model outputs presented in the results subsections could be construed as offensive and abusive in nature. We do not support harmful or derogatory language, nor any of the harmful texts produced by the models.
Style% | Perplexity | Dist-1 | Dist-2 | Dist-3 | Style% | Perplexity | Dist-1 | Dist-2 | Dist-3 | |
() | () | () | () | () | () | () | () | () | () | |
Method | Dataset: SST-5 | Dataset: Yelp | ||||||||
ZS | 42.71% | 12.72 | 0.92 | 0.95 | 0.90 | 54.17% | 11.89 | 0.89 | 0.92 | 0.87 |
FS | 52.08% | 31.55 | 0.85 | 0.90 | 0.87 | 70.83% | 32.69 | 0.91 | 0.94 | 0.90 |
PPLM | 58.33% | 39.86 | 0.87 | 0.88 | 0.82 | 56.66% | 49.17 | 0.82 | 0.82 | 0.76 |
GeDi | 70.31% | 127.26 | 0.88 | 0.80 | 0.70 | 60.41% | 82.35 | 0.86 | 0.88 | 0.78 |
Ours | ||||||||||
B | 29.17% | 12.45 | 0.84 | 0.89 | 0.87 | 39.58% | 11.74 | 0.86 | 0.92 | 0.87 |
BR | 54.16% | 32.81 | 0.89 | 0.91 | 0.93 | 70.31% | 35.44 | 0.87 | 0.91 | 0.90 |
BP | 35.41% | 60.28 | 0.82 | 0.91 | 0.97 | 41.66% | 81.72 | 0.38 | 0.43 | 0.44 |
BPF | 61.45% | 25.02 | 0.84 | 0.91 | 0.94 | 57.29% | 20.60 | 0.90 | 0.93 | 0.96 |
BPFR | 92.71% | 24.57 | 0.87 | 0.95 | 0.97 | 91.66% | 20.38 | 0.85 | 0.94 | 0.95 |
ZS | 29.16% | 12.45 | 0.91 | 0.94 | 0.88 | 30.21% | 14.25 | 0.96 | 0.98 | 0.92 |
FS | 38.54% | 17.83 | 0.88 | 0.92 | 0.89 | 36.46% | 26.42 | 0.85 | 0.89 | 0.87 |
PPLM | 35.42% | 19.19 | 0.83 | 0.85 | 0.79 | 61.45% | 25.33 | 0.85 | 0.86 | 0.81 |
GeDi | 76.04% | 43.25 | 0.92 | 0.91 | 0.86 | 80.21% | 59.46 | 0.92 | 0.80 | 0.67 |
Ours | ||||||||||
B | 19.79% | 7.93 | 0.82 | 0.85 | 0.89 | 23.96% | 9.76 | 0.83 | 0.88 | 0.91 |
BR | 41.14% | 21.76 | 0.86 | 0.90 | 0.92 | 59.89% | 25.46 | 0.86 | 0.89 | 0.93 |
BP | 77.08% | 64.92 | 0.57 | 0.77 | 0.72 | 44.79% | 96.61 | 0.47 | 0.53 | 0.61 |
BPF | 82.29% | 24.86 | 0.82 | 0.90 | 0.85 | 58.85% | 26.23 | 0.67 | 0.71 | 0.76 |
BPFR | 88.54% | 22.79 | 0.83 | 0.82 | 0.88 | 79.16% | 25.84 | 0.81 | 0.84 | 0.86 |
Datasets
We use two sentiment datasets, SST-5 (Socher et al. 2013) and Yelp reviews dataset from Shen et al. (2017), the GYAFC formality dataset (Rao and Tetreault 2018), and the Toxic Comment Classification Challenge (Jigsaw 2017) dataset to train the discriminator models. Table 2 further elaborates the datasets used to train the discriminator model.
Dataset | Split |
---|---|
SST-5 | 3000 positive |
(Socher et al. 2013) | 3000 negative |
Yelp | 18000 positive |
(Shen et al. 2017) | 18000 negative |
GYAFC | 48000 formal |
(Rao and Tetreault 2018) | 48000 informal |
Toxicity | 2000 toxic |
(Jigsaw 2017) | 2000 non-toxic |
To train the prompts, we create two types of datasets: in-domain and out-of-domain. Each entry in a dataset is an incomplete opening phrase, which is the input to the language model. The language model is expected to generate the completion for the opening phrase for a fixed number of generation steps.
The in-domain dataset is a very small set of inputs that closely resembles the training dataset of the discriminator model. For example, when using the discriminator trained on Yelp reviews, the in-domain dataset refers to restaurants, shops and hotels. Similarly, the in-domain dataset consists of conversation openings when using the discriminator trained on the GYAFC formality dataset. We use a split of data samples for all four style control problems.
On the other hand, the out-of-domain (OOD) dataset is a large collection of general sentence openings, consisting of a subject and a verb. The distribution of this dataset is independent of the datasets that are used to train the discriminators. This dataset has a split of samples.
Both kinds of datasets (in-domain and out-of-domain) are created synthetically, using GPT4 (OpenAI 2023).
Evaluation Metrics
We perform automatic evaluation for both style and fluency using the following metrics:
-
1.
Style Accuracy: We evaluate the style of the generated text using an external classifier.
-
2.
Perplexity: We use perplexity as an automatic measure of fluency of the generated text, similar to Dathathri et al. (2020). Perplexity is calculated on the same model used for generating text (GPT2 Large).
-
3.
Dist-n: Dist-n is a measure of distinct n-grams in the generated text. We calculate the Dist-1, Dist-2 and Dist-3 scores for each method. A higher number of distinct n-grams is indicative of more diverse text (Li et al. 2016).
Baselines
Zero Shot Prompting (ZS)
Zero shot involves prompting a language model with a direct instruction to do a task. We prompt the GPT2 Large model with a simple instruction to complete the prompt with the specified style.
Few Shot Prompting (FS)
In this method, the language model is shown a few examples of inputs and corresponding outputs, and is expected to infer the task from these examples. For controlled generation, we use examples from the style datasets as few shot prompts, followed by an opening phrase (source text), which the model is required to complete. For this task, we use 5 training examples (5-shot prompting) on the GPT2 Large model.
PPLM
PPLM (Dathathri et al. 2020) uses the gradients of a discriminator to dynamically update the activations of a decoder model at each generation step. We feed the source text to the model and use the discriminator trained on the corresponding style dataset.
GeDi
GeDi (Krause et al. 2021) is a discriminator based approach to control the direction of generation. It uses control and anti-control codes, which are conditioned on desired and undesired attributes, to steer the direction while decoding.
Ablations
We conduct an ablation study with 5 variants:
-
1.
B: We provide the source text as input to the base model, and generate one sample.
-
2.
BR: Similar to B, we provide the source text as input to the base model (GPT2 Large), and generate samples. is set to 3 samples. We select the best sample according to perplexity and dist scores.
-
3.
BP: Here, we attach the prompt embeddings, tuned only using the discriminator loss, to the source embeddings, and use these as input to the base model.
-
4.
BPF: The prompt embeddings used here are tuned using both discriminator and fluency loss. These embeddings are attached to the source embeddings, and provided as input to the base model.
-
5.
BPFR: This is a sampled version of BPF, where the prompted model is sampled times, and the best sample chosen according to perplexity and dist scores. is set to 3 samples.
In all of the above ablations, we use GPT2 Large as the base model.
Results with Small Datasets
Table 1 shows the results for the proposed approach (PPP). We compare this with commonly used methods for instructing large language models (viz. Zero Shot and Few Shot). We also compare our method to existing plug-and-play methods, including PPLM (Dathathri et al. 2020) and GeDi (Krause et al. 2021). Our method substantially outperforms previous baselines, even though the prompts are trained using a very small dataset. Moreover, the gain in performance compared to zero-shot and few-shot methods highlights how prompt-tuning is superior to using textual prompts, particularly in the case of smaller models. We note a slight trade-off between fluency (perplexity) and style control (style accuracy). However, this can be controlled using the fluency loss hyperparameter . Perplexity skyrockets in the case of approach BP, where the prompt embeddings are tuned using only the discriminator loss (i.e., ). This highlights the need for using the fluency loss, which is calculated using Cross-Entropy between the prompted and unprompted language model. While GeDi slightly outperforms approach BPF (i.e., PPP without sampling), and PPP on the toxicity dataset, it is important to note that GeDi’s outputs show significantly higher perplexity than PPP.
Overall, our results show that plug-and-play methods based on prompt tuning have the potential for data-efficient controlled generation, with a high degree of fluency, especially when using smaller models.
Results with Out-of-Domain Dataset


In this section, we demonstrate the efficacy of PPP on a larger and more general, out-of-domain training dataset. Unlike the small datasets used for training the prompts previously, the dataset used here does not necessarily follow the same distribution as the style dataset used for training the discriminator model. We tune the prompt embeddings on the large dataset for two problems - sentiment control and formality control. Since we use a much larger training dataset, we also increase the number of trainable parameters (i.e., the number of prompt embeddings). Towards this, we train the prompts for GPT2 Large by varying the number of prompts in and observe the change in performance (style accuracy%) in figure 2a. We note that the performance steadily increases as the number of trainable prompt embeddings is increased. In addition to this, we also experiment with various model sizes: GPT2 ( M parameters), GPT2 Medium ( M parameters), GPT2 Large ( M parameters), and GPT2 XL ( B parameters). We note a similar trend in figure 2b, i.e., the performance increases with increasing the number of model parameters for both tasks.
Qualitative Results
Direction | Generation |
---|---|
Formal | It’s safe to say that none of the above |
has any bearing whatsoever on this | |
argument, as I’ll show how. | |
Informal | It’s safe to say that he didn’t do as |
much shit as he should have. |
In this section, we demonstrate the control and generation quality of our method by analyzing the text generated using the prompts trained using different classifiers.
Table 3 illustrates that the prompt embeddings are able to change the formality of the generated output, for the same input text, depending on the direction (formal or informal) that they were trained to produce.
Table 4 shows how the prompts exhibit greater control as they are trained over epochs. The output text is negative at first, and gradually changes to more positive, as training progresses.
Table 5 shows a comparison between our method and existing plug and play methods. Our method demonstrates higher fidelity towards the dataset used for training the discriminator, as it generates a very negative yelp review for a cafe.
Table 6 demonstrates an application of our method towards detoxifying text generated by a language model. The original text generated by GPT2 contains abusive text and profanities. Upon addition of prompt embeddings (trained to generate non-toxic text), the generated text does not show any toxic characteristics.
Epoch | Continuation |
---|---|
1 | to be a disaster. The food was terrible. |
2 | to be a disaster. The restaurant was closed. |
3 | better than we expected it to. |
4 | to be a great success. We had a great time. |
5 | pretty good, we hope you can join us! |
6 | to be a great experience. I really loved |
the restaurant. |
Method | Generation |
---|---|
PPLM | The old lady at the cafe was apologetic. |
“I’m sorry, I don’t know if I should be | |
offended or upset about this.” | |
GeDi | The old lady at the cafe looked like |
she was having an argument. | |
PPP | The old lady at the cafe was a bit of a |
(Ours) | pain in the ass to deal with. She was a |
bit of a bitch. |
Original | Detoxified |
---|---|
Generation | Generation |
y’all need to get your | y’all need to be a little more |
shit together | careful with your words |
you need to shut the | you need to shut the |
fuck up and listen to me | game now, the game is over |
Hyperparameter Details
We tune the hyperparameters on SST-5 with GPT2 Large, and use the same hyperparameters for all 4 style control problems. We experiment with prompt lengths in . We obtain the best results with prompt length of 30. We search the learning rate in , and obtain the best results with . Higher learning rates lead to poor text quality and lower learning rates do not perturb the style significantly. We search the fluency loss parameter () in , and obtain the best results with . For all cases, we use the AdamW optimizer (Loshchilov and Hutter 2019). We train for a maximum of epochs on the small datasets, and epochs on the large OOD dataset.
Conclusion
In this work, we present plug and play with prompts (PPP) as a method to learn instructions (prompt embeddings) to control the direction of text generation with large language models. We show that directional instructions can be learnt by backpropagating the loss produced by a smaller language model used for classification. PPP also maintains the fluency of the prompted language model using self-supervision with the non-prompted language model. Further, we demonstrate that the prompts exhibit good generalizability, even though they are trained with very small datasets. PPP is not only restricted to in-domain datasets, and but also performs well when trained on a larger out-of-domain dataset. This plug-and-play method does not require any changes to the generator model, and the user only needs to plug the trained prompts with the input text as prefix. Additionally, this method is lightweight, as we only need to train and store the prompts having far fewer parameters than the language model.
A noteworthy application of our method is its ability to reduce generation of harmful text by language models, which is necessitated by the growing amounts of abusive, vulgar and profane text present in the training data of these language models. With the growing number of language model based tools and applications, our lightweight plug-and-play method could help developers curtail the amount of biased, offensive and harmful text that their applications might inadvertently produce.
Furthermore, as newer language models rapidly increase in number of parameters, our method offers a memory and energy efficient solution towards effectively control these models and increasing their usability.
In the future, we plan to extend this method for controlling more fine-grained attributes. We would also like to use prompt tuning in a similar manner for other semi-supervised and unsupervised tasks like style transfer, summarization, machine translation, inter alia.
Ethical Statement
We acknowledge that PPP can potentially be used to produce harmful text, including offensive, derogatory and toxic content. However, the ability to produce offensive text is not exclusive to PPP, but inherently present in all machine learning based language generation techniques, which learn from patterns in human language.
However, the potential misuse of our work does not discount its benefits. One of the prominent applications of PPP is its ability to reduce toxicity in generated text, as demonstrated through our qualitative results.
With the rapid rise in applications leveraging large language models, it has become quintessential for developers to mitigate harmful or biased text that their models might inadvertently generate, leading to customer dissatisfaction and broader societal harm. PPP offers a data and memory efficient solution to curtail the generation of harmful text, fostering healthier and less toxic interaction between humans and language model based chatbots. Therefore, we believe that the potential benefits of our work outweigh the risks.
References
- Bahdanau, Cho, and Bengio (2016) Bahdanau, D.; Cho, K.; and Bengio, Y. 2016. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473.
- Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. arXiv:2005.14165.
- Chan et al. (2022) Chan, A.; Ong, Y.-S.; Pung, B.; Zhang, A.; and Fu, J. 2022. CoCon: A Self-Supervised Approach for Controlled Text Generation. arXiv:2006.03535.
- Chen et al. (2018) Chen, Y.; Li, V. O.; Cho, K.; and Bowman, S. 2018. A Stable and Effective Learning Strategy for Trainable Greedy Decoding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 380–390. Brussels, Belgium: Association for Computational Linguistics.
- Cho et al. (2014) Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1724–1734. Doha, Qatar: Association for Computational Linguistics.
- Chowdhury, Zhuang, and Wang (2022) Chowdhury, J. R.; Zhuang, Y.; and Wang, S. 2022. Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning. arXiv:2202.00535.
- Dathathri et al. (2020) Dathathri, S.; Madotto, A.; Lan, J.; Hung, J.; Frank, E.; Molino, P.; Yosinski, J.; and Liu, R. 2020. Plug and Play Language Models: A Simple Approach to Controlled Text Generation. In International Conference on Learning Representations.
- Elsayed, Goodfellow, and Sohl-Dickstein (2019) Elsayed, G. F.; Goodfellow, I.; and Sohl-Dickstein, J. 2019. Adversarial Reprogramming of Neural Networks. In International Conference on Learning Representations.
- Fan, Lewis, and Dauphin (2018) Fan, A.; Lewis, M.; and Dauphin, Y. 2018. Hierarchical Neural Story Generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 889–898. Melbourne, Australia: Association for Computational Linguistics.
- Gu, Cho, and Li (2017) Gu, J.; Cho, K.; and Li, V. O. 2017. Trainable Greedy Decoding for Neural Machine Translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 1968–1978. Copenhagen, Denmark: Association for Computational Linguistics.
- Gu et al. (2017) Gu, J.; Neubig, G.; Cho, K.; and Li, V. O. 2017. Learning to Translate in Real-time with Neural Machine Translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, 1053–1062. Valencia, Spain: Association for Computational Linguistics.
- Hambardzumyan, Khachatrian, and May (2021) Hambardzumyan, K.; Khachatrian, H.; and May, J. 2021. WARP: Word-level Adversarial ReProgramming. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4921–4933. Online: Association for Computational Linguistics.
- Hochreiter and Schmidhuber (1997) Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation, 9(8): 1735–1780.
- Holtzman et al. (2020) Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; and Choi, Y. 2020. The Curious Case of Neural Text Degeneration. arXiv:1904.09751.
- Holtzman et al. (2018) Holtzman, A.; Buys, J.; Forbes, M.; Bosselut, A.; Golub, D.; and Choi, Y. 2018. Learning to Write with Cooperative Discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1638–1649. Melbourne, Australia: Association for Computational Linguistics.
- Houlsby et al. (2019) Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; and Gelly, S. 2019. Parameter-Efficient Transfer Learning for NLP. In Chaudhuri, K.; and Salakhutdinov, R., eds., Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, 2790–2799. PMLR.
- Hu et al. (2021) Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685.
- Hu et al. (2017) Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P. 2017. Toward Controlled Generation of Text. In Precup, D.; and Teh, Y. W., eds., Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, 1587–1596. PMLR.
- Jiang et al. (2020) Jiang, Z.; Xu, F. F.; Araki, J.; and Neubig, G. 2020. How Can We Know What Language Models Know? Transactions of the Association for Computational Linguistics, 8: 423–438.
- Jigsaw (2017) Jigsaw. 2017. Toxic Comment Classification Challenge. https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge. Accessed: 2023-08-14.
- Keskar et al. (2019) Keskar, N. S.; McCann, B.; Varshney, L. R.; Xiong, C.; and Socher, R. 2019. CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv:1909.05858.
- Krause et al. (2021) Krause, B.; Gotmare, A. D.; McCann, B.; Keskar, N. S.; Joty, S.; Socher, R.; and Rajani, N. F. 2021. GeDi: Generative Discriminator Guided Sequence Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 4929–4952. Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Kumar et al. (2022) Kumar, A.; Raghunathan, A.; Jones, R.; Ma, T.; and Liang, P. 2022. Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution. arXiv:2202.10054.
- Lester, Al-Rfou, and Constant (2021) Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045–3059. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Li et al. (2016) Li, J.; Galley, M.; Brockett, C.; Gao, J.; and Dolan, B. 2016. A Diversity-Promoting Objective Function for Neural Conversation Models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 110–119. San Diego, California: Association for Computational Linguistics.
- Li and Liang (2021) Li, X. L.; and Liang, P. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. arXiv:2101.00190.
- Loshchilov and Hutter (2019) Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
- OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Pascual et al. (2021) Pascual, D.; Egressy, B.; Meister, C.; Cotterell, R.; and Wattenhofer, R. 2021. A Plug-and-Play Method for Controlled Text Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, 3973–3997. Punta Cana, Dominican Republic: Association for Computational Linguistics.
- Radford et al. (2019) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners.
- Rao and Tetreault (2018) Rao, S.; and Tetreault, J. 2018. Dear Sir or Madam, May I Introduce the GYAFC Dataset: Corpus, Benchmarks and Metrics for Formality Style Transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 129–140. New Orleans, Louisiana: Association for Computational Linguistics.
- Schick and Schütze (2021a) Schick, T.; and Schütze, H. 2021a. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269. Online: Association for Computational Linguistics.
- Schick and Schütze (2021b) Schick, T.; and Schütze, H. 2021b. It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2339–2352. Online: Association for Computational Linguistics.
- Shen et al. (2017) Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style Transfer from Non-Parallel Text by Cross-Alignment. arXiv:1705.09655.
- Shen et al. (2020) Shen, T.; Mueller, J.; Barzilay, R.; and Jaakkola, T. 2020. Educating Text Autoencoders: Latent Representation Guidance via Denoising. arXiv:1905.12777.
- Shin et al. (2020) Shin, T.; Razeghi, Y.; Logan IV, R. L.; Wallace, E.; and Singh, S. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4222–4235. Online: Association for Computational Linguistics.
- Socher et al. (2013) Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1631–1642. Seattle, Washington, USA: Association for Computational Linguistics.
- Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention Is All You Need. arXiv:1706.03762.
- Yang and Klein (2021) Yang, K.; and Klein, D. 2021. FUDGE: Controlled Text Generation With Future Discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3511–3535. Online: Association for Computational Linguistics.
- Yu et al. (2017) Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv:1609.05473.
- Ziegler et al. (2020) Ziegler, D. M.; Stiennon, N.; Wu, J.; Brown, T. B.; Radford, A.; Amodei, D.; Christiano, P.; and Irving, G. 2020. Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.