Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Xinhong Xie¹ Tao Li² Quanyan Zhu²
¹Department of Computer Science and Engineering The Pennsylvania State University
²Department of Electrical and Computer Engineering New York University
Correspondence: [email protected]

Abstract

Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the challenge of incomplete preference, we propose Stackelberg response optimization (SRO), adapted from DPO, to enable the LLM to learn from the follower’s response. The gist is that SRO decreases the likelihood of generating the paraphrase if it fails the follower’s screening while performing DPO on the pair of the toxic input and its paraphrase when the latter passes the screening. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency. The overall detoxification performance surpasses other computing methods and matches the human reference. Additional empirical evidence suggests that SRO is sensitive to the screener’s feedback, and a slight perturbation leads to a significant performance drop. We release the code and LLM models at https://github.com/XXXinhong/Detoxification_LLM.

1 Introduction

Identification of toxicity and other undesirable contents in user-generated texts is an active research area in NLP Bevendorff et al. (2024). As a proactive combat (besides deletion), the task of automatic rewriting/rephrasing has received increasing attention from the NLP community Villate-Castillo et al. (2024).

Most existing works on text detoxification casts the problem as a variant of style transfer: the task of changing the style of a text while preserving the content. The style of a text refers to its characteristics, such as sentiment, level of formality, and political inclination Dale et al. (2021). As a sequence-to-sequence task, style transfer tasks can employ an encoder-decoder model trained on parallel data that includes a parallel corpus of toxic sentences and their non-toxic paraphrases Wieting and Gimpel (2018a). Despite recent efforts to build parallel datasets Logacheva et al. (2022), the existing parallel data collection is still in its infancy. As an alternative, style transfer models can also utilize non-parallel data. Prior works train encoder-decoder models on non-parallel data and push the decoder toward the target style with the help of style classifiers Lee (2020).

Refer to caption — Figure 1: An illustration of Stackelberg Response Optimization (SRO), which addresses incomplete preference in detoxification tasks. For a toxic input $x$ , when the paraphrase $y$ passes the screening, SRO performs DPO on the pair $(x,y)$ ; otherwise, it decreases the likelihood of generating $y$ .

This work considers LLM-based detoxification models trained on non-parallel data. Without parallel data, few-shot fine-tuning does not apply since concrete demonstrations are not available to LLMs. Hence, we consider a different perspective on text detoxification: casting the problem as an alignment problem, where an LLM needs to be fine-tuned to align with a given preference (e.g., non-toxic over toxic contents). Such preference alignment has gained soaring popularity in LLM research as it addresses the LLM’s steerability issue, aiming to achieve precise control of LLM’s behaviors. Such fine-tuning requires human preference labeling, which is instrumental in reinforcement learning from human feedback (RLHF) pipelines Ouyang et al. (2022). However, such a preference relation may not be immediately available in detoxification. Given a toxic text input, when the LLM generates two outputs that are still toxic, it is unclear which one is less toxic since both belong to the same style. In practice, we also observe that the human-labeled non-parallel data only involves binary labels (toxic or non-toxic) without explicit preference relations, making it challenging to rank texts from the same class. Consequently, RLHF may not apply due to the lack of clear preferences.

From the example above, the key challenge of applying LLMs to detoxification (or style transfer in general) from the alignment viewpoint is that no preference exists among texts belonging to the same style. Therefore, LLMs need to learn from the outcome of style transfer, i.e., the feedback from the style screening (either success or failure), rather than preference labeling.

We model the LLM’s fine-tuning process as a Stacelberg game between the LLM and a style classifier trained from human-labeled data. As illustrated in Figure 1, when the LLM (leader) receives a toxic text from the training dataset, it first generates a paraphrase and sends it to the screener (follower), who responds with a binary outcome: either success or failure. Based on this response, LLM updates the model weights by applying the gradient to our proposed Stackelberg loss. The intuition behind such loss is that 1) if the detoxification is successful, then the pair of the input text and LLM-generated text presents a clear preference, which is perfectly handled by the original DPO pipeline, and the Stackelberg loss reduces to DPO loss; 2) otherwise, only the failed transfer is utilized to compute the loss that aims to penalize LLMs for such paraphrasing. The sheer difference between our proposed fine-tuning method and RLHF is that the LLM learns from the classifier’s response to its generated text without requiring additional preference labeling. We refer to the proposed fine-tuning method as Stackelberg Response Optimization. Our contributions are as follows.

•

We model LLM fine-tuning for detoxification as a Stacklberg game between the LLM and style screener. Such a Stackelberg formulation spares one from parallel data and human labeling.
•

We propose Stackelberg Response Optimization (SRO), adapted from DPO, to address the issue of missing preferences in RLHF.
•

We conduct an evaluation of the fine-tuned LLM and its comparison with a number of state-of-the-art models in text detoxification tasks. Our proposed STALR achieves superior overall performance (a joint metric synthesizing style accuracy, content similarity, and fluency) over other computing models and matches human reference.
•

We examine SRO’s robustness to the screener. Empirical evidence suggests that SRO is highly sensitive to the screener’s feedback. A slight perturbation in screening significantly degrades the detoxification performance.

2 Related Works

Style transfer and detoxification

Prior works in style transfer consider encoder-decoder models on non-parallel data while using adversarial classifiers to steer the decoder toward the desired style Shen et al. (2017a). To address the challenge of lacking parallel data, researchers have employed methods such as reinforcement learning Luo et al. (2019), amortized variational inference He et al. (2020), and utilizing style transfer classifiers to train models that convert texts from one style to another and vice versa Lee (2020).

Detoxification specifically involves converting toxic to neutral language using datasets not directly paired but categorized for toxicity, treating toxic and neutral texts as distinct groups (nonparallel data). Several studies use datasets like the Jigsaw series Laugier et al. (2021) or create their own from social media platforms for this purpose Kivlichan et al. (2020); cjadams et al. (2017, 2019).

Prior detoxification models often borrow methodologies from other domains Shen et al. (2017b) and Fu et al. (2018), using autoencoders Nogueira dos Santos et al. (2018) with style classification and cycle-consistency losses, or fine-tuning denoising autoencoders like T5 Laugier et al. (2021). Some models apply direct corrections to specific words Tran et al. (2020); Wu et al. (2019) and then enhance text fluency with sequence-to-sequence models. Similarly, Dale et al. (2021) uses masked language models for targeted edits.

Preference alignment and fine-tuning of LLM

LLMs have shown great capabilities in the past few years in generative tasks across various domains Bubeck et al. (2023); Li and Zhu (2024), and their encouraging success relies on the precise control of LLMs’ output through alignment fine-tuning. One of the popular methods for gaining such control is reinforcement learning with human feedback Ouyang et al. (2022). By training a reward model on human-labeled data, one can use the reward model to fine-tune LLM models using RL algorithms.

To avoid bias that exists in the reward model, Rafailov et al. (2024) develops Direct Preference Optimization (DPO), which does not rely on another reward model to calculate loss; instead, DPO calculates the loss based on the distance between logits and the expected answers. In addition to DPO, there are also some emerging techniques, such as Identity Preference Optimisation (IPO) Azar et al. (2023), which avoid Bradley-Terry modelization assumption to avoid overfitting problems, and Kahneman-Tversky Optimization (KTO) Ethayarajh et al. (2024), which apply human-aware losses (HALOs) to directly maximizing the utility of generations instead of maximizing the likelihood of preferences.

Game-theoretic RLHF

Multi-agent reinforcement learning offers a robust theoretical grounding for examining learning agents that interact and need to adjust their strategies in response Li et al. (2022a, b). Recent developments in game-theoretic methods tailored to RLHF include Self-Play Preference Optimization (SPO) Swamy et al. (2024), which utilizes self-play to compare win rates across different trajectories. Similarly, Self-Play fIne-tuNing (SPIN) Chen et al. (2024) employs self-play to create synthetic high-quality data by differentiating self-generated continuations from those generated by humans. Both approaches utilize a consistent reward model akin to traditional RLHF. Additionally, Nash-RLHF Munos et al. (2024) introduces a competitive setup involving two language models, each aiming to gain favor from a human evaluator.

Closely related to our Stackelberg formulation, Chakraborty et al. Chakraborty et al. (2024) also consider the preference alignment as a Stackelberg game, where the reward model acts as the leader and the RL agent as the follower. However, the reward model is still built on complete preference relations, whereas our work focuses on incomplete preference resulting from the binary feedback of the screener.

3 Preliminary

Notations

Consider a non-parallel dataset denoted by $D$ , which consists of a blend of toxic and non-toxic texts (sentences). We denote by $x\in D$ A typical text from the dataset. In our text detoxification task, such a text $x$ is often paired with a prompt instructing the LLM to rewrite the sentence. For example, the prompt used in our experiment is “paraphrase the following: $x$ .” LLM, despite its intricate inner workings, is represented by a function $f_{\theta}(\cdot)$ that maps the input text $x$ to a rephrased text $y$ , where $\theta\in\Theta$ denotes the LLM’s model weights.

The style classifier, performing binary classification, maps the rephrased text $y$ to a label $a\in\{0,1\}$ . Since the classifier plays as a follower in the Stackelberg game, one can think of the label as the follower’s action or response toward the LLM’s paraphrasing. Denote the classifier by $g(y)$ , its definition is given by

a=g(y)\triangleq\left\{\begin{array}[]{ll}1&\text{if $y$ is non-toxic},\\ 0&\text{otherwise}.\end{array}\right.

(1)

The sequel also uses $f_{\theta}(y|x)$ and $g(a|y)$ to represent the dependencies among variables.

Stackelberg game

To prepare the reader for the later introduced Stackelberg detoxification game, we use the notations above to review some basics. Beginning a payoff-relevant state $x$ (which affects the players’ rewards/costs), the leader determines its parameterized policy $f_{\theta}$ that yields an action $y=f_{\theta}(x)$ upon receiving the state. Once observing the leader’s action $y$ , the follower responds by choosing action $a$ . The follower’s objective is to find a response policy $a=g(y)$ such that its reward $v(x,y,a)$ is maximized. The follower’s best response given $x,y$ is $a^{*}=\arg\max_{a}v(x,y,a)\triangleq\operatorname{BR}(f_{\theta}(x))$ . Aware of the follower’s objective and best response, the leader aims to seek the optimal policy that maximizes its rewards (minimizes the cost) by soliciting the desired follower’s action: $\max_{\theta}(x,a^{*})=u(x,\operatorname{BR}(f_{\theta}(x)))$ .

Direct preference optimization

DPO comes as a direct alternative to Reinforcement Learning from Human Feedback (RLHF), which consists of two main parts: the reward model $r_{\phi}$ and the target language model $\pi_{\theta}$ . The reward model is optimized by the following loss function: $\mathcal{L}_{\phi}=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\left[\log\sigma(r_{\phi}(x,y_{w})-r_{\phi}(x,y_{l}))\right]$ , where $D$ is the dataset consisting of preference pairs $(x,y_{w},y_{l})$ —where $x$ is an input, $y_{w}$ , and $y_{l}$ are the preferred and dispreferred outputs. The $\sigma$ function here is sigmoid function. With a decent reward model $r_{\phi}$ , we can then fine-tune the target language model $\pi_{\theta}$ by maximizing $\mathbb{E}_{x\in D,y\in\pi_{\theta}}\left[r_{\phi}(x,y)\right]-\beta D_{KL}(\pi_{\theta}(y|x)\|\pi_{\text{ref}}(y|x))$ .

DPO has been gaining much popularity since it does not require a reward model and directly fine-tunes the LLM using preference pairs. The DPO loss is given by

	$\displaystyle\mathcal{L}_{DPO}(f_{\theta};f_{\text{ref}})=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}$
	$\displaystyle\left[\log\sigma\left(\beta\log\frac{f_{\theta}(y_{w}\mid x)}{f_{\text{ref}}(y_{w}\mid x)}-\beta\log\frac{f_{\theta}(y_{l}\mid x)}{f_{\text{ref}}(y_{l}\mid x)}\right)\right].$

To understand how DPO works, one can inspect the gradient of DPO loss, which is given below.

	$\displaystyle\nabla_{\theta}\mathcal{L}(\theta)$
	$\displaystyle=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\bigg{[}\underbrace{\beta\sigma(\hat{r}(y_{l},x)-\hat{r}(y_{w},x))}_{\mathrm{I}}$
	$\displaystyle\left[\underbrace{\nabla_{\theta}\log f_{\theta}(y_{w}\mid x)}_{\mathrm{II}}\right.\left.-\underbrace{\nabla_{\theta}\log f_{\theta}(y_{l}\mid x)}_{\mathrm{III}}\right]\bigg{]}$

The first term in the gradient increases when the reward estimation is wrong, so it can help LLM converge quickly. The second term can increase the likelihood of $y_{w}$ (the preferred text) and decrease the likelihood of $y_{l}$ (the dispreferred text).

4 Stackelberg Learning from Response

4.1 Detoxification as a Stackelberg game

Consider the text detoxification task where the LLM seeks to set the model weights $\theta$ such that the paraphrased output $y\sim f_{\theta}(\cdot|x)$ is non-toxic. In other words, it aims to generate $y$ to pass the style classifier’s screening, i.e., the desired classification outcome is $a=1$ .

We define the LLM’s cost function as $u:D\times A\rightarrow\{0,1\}$ to be defined later in Section 4.2, which encourages the LLM to generate successful paraphrases that pass the screening while penalizing the failed attempts. Since the style classifier’s objective is to report the underlying style accurately, one can define the style classifier’s cost function as the classification loss. This work considers an SVM-based classifier and uses hinge loss Gentile and Warmuth (1998). As discussed in the preliminary, we can treat the trained SVM as the best response mapping.

The interaction between the LLM and SVM unfolds as below. First, the LLM takes as an input a text $x$ randomly sampled from the dataset $D$ . Then, the LLM determines a paraphrase $y=f_{\theta}(x)$ , which is then sent to the SVM. Finally, the SVM best responds with a classification outcome $a$ , which decides the LLM’s utility. The equilibrium problem for the LLM, the leader, is given by

\min_{\theta\in\Theta}\sum_{x\in D}u(x,a^{*})\quad\text{s.t. }a^{*}=\operatorname{BR}(f_{\theta}(x)),

(2)

where $a^{*}$ simply corresponds to the classification output. Naturally, the solution to (2) returns the optimal LLM model for detoxification. We introduce Stackelberg response optimization, adapted from DPO, to approximately solve for the optimal solution.

4.2 Stackelberg Response Optimization

Given a preference pair $(y_{w},y_{l})$ , DPO aims to solve the following minimization problem

\displaystyle\min_{\theta\in\Theta}-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\left[\log\sigma\left(h^{\beta}_{\theta}(y_{w}|x)-h^{\beta}_{\theta}(y_{l}|x)\right)\right]

where we define $h^{\beta}_{\theta}(y|x)\triangleq\beta\log\frac{f_{\theta}(y\mid x)}{f_{\text{ref}}(y\mid x)}$ . In the detoxification task, whether preference pairs are available depends on the screening outcome. If the paraphrase $y$ passes the screening, then $(y,x)$ constitutes a preference pair, which can be handled by DPO. However, if $y$ fails the screening, then both $y$ and $x$ are dispreferred without a clear preference relation.

We propose the following objective function to the LLM leader, which is a hybrid of two kinds of loss functions corresponding to cases of passing or failing the screening, respectively.

	$\displaystyle\min_{\theta\in\Theta}\mathbb{E}_{x\sim D}u(x,a^{*})\triangleq\mathcal{L}(\theta)$
	$\displaystyle=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(a^{*}(h^{\beta}_{\theta}(y\|x)-h^{\beta}_{\theta}(x\|x))\right.\right.$
	$\displaystyle\left.\left.+(1-a^{*})(-h^{\beta}_{\theta}(y\|x)\right)\right].$

Pass

When the paraphrase passes the screening, i.e., $a=1$ , we adopt the standard DPO loss:

\displaystyle\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(h^{\beta}_{\theta}(y|x)-h^{\beta}_{\theta}(x|x)\right)\right]

We simply substitute $y_{l}$ with $x$ – the input of the model. In our setup, the model will have a preference for the text ( $y$ ) that passes the screening over the toxic input ( $x$ ).

The gradient of the loss is

	$\displaystyle\nabla_{\theta}\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\bigg{[}\beta\sigma\left(h^{\beta}_{\theta}(x\|x)-h^{\beta}_{\theta}(y\|x)\right)$
	$\displaystyle\left[\nabla_{\theta}\log f_{\theta}(y\mid x)-\nabla_{\theta}\log f_{\theta}(x\mid x)\right]\bigg{]}$

Following the same argument in Rafailov et al. (2024), we have $\hat{r}(x,y)=\beta\log\frac{f_{\theta}(y\mid x)}{f_{\text{ref}}(y\mid x)}$ is the implicit reward function defined by the LLM and the reference model. Then, by substituting $h^{\beta}_{\theta}(x|x)$ with $\hat{r}(x,x)$ and substituting $h^{\beta}_{\theta}(y|x)$ with $\hat{r}(y,x)$ , the gradient of pass loss could be written as:

	$\displaystyle\nabla_{\theta}\mathcal{L}(\theta)$
	$\displaystyle=-\mathbb{E}_{x\sim D}\bigg{[}\underbrace{\beta\sigma(\hat{r}(x,x)-\hat{r}(y,x))}_{\mathrm{I}}$
	$\displaystyle[\underbrace{\nabla_{\theta}\log f_{\theta}(y\mid x)}_{\mathrm{II}}-\underbrace{\nabla_{\theta}\log f_{\theta}(x\mid x)}_{\mathrm{III}}]\bigg{]}$

The first term in the gradient increases when the reward estimation is wrong (i.e., the reward difference is supposed to be negative). It serves as a coefficient that adjusts the gradient’s magnitude so it can help LLM adapt quickly: when the reward estimation is wrong, the magnitude of the gradient is larger, and the LLM model weights change significantly. The second term determines the gradient direction to increase the likelihood of $y$ (the successful paraphrase) and decrease the likelihood of $x$ (the original input text). In summary, the gradient increases the likelihood of generating successful paraphrases.

Fail

For those failed paraphrases, we apply the following loss to the model, which we call partial loss since it only keeps a part of the DPO loss.

\displaystyle\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(-h^{\beta}_{\theta}(y|x)\right)\right].

Intuitively speaking, the partial loss does not indicate anything to prefer but only to disprefer. Thus, the loss leads the model to have a dispreference for the output ( $y$ ). Below is the gradient of the partial loss.

\displaystyle-\mathbb{E}_{x\sim D}\bigg{[}-\beta\sigma\left(h^{\beta}_{\theta}(y|x)\right)\nabla_{\theta}\log f_{\theta}(y\mid x)\bigg{]}

Similarly, we can substitute $h^{\beta}_{\theta}(y|x)$ with $\hat{r}(y,x)$ and obtain the following.

	$\displaystyle\nabla_{\theta}\mathcal{L}(\theta)$
	$\displaystyle=-\mathbb{E}_{x\sim D}\bigg{[}-\underbrace{\beta\sigma(\hat{r}(y,x))}_{\mathrm{I}}\underbrace{\nabla_{\theta}\log f_{\theta}(y\mid x)}_{\mathrm{II}}\bigg{]}$

In this gradient, the first term serves as the coefficient adjusting the gradient’s magnitude. If the reward estimation is wrong (i.e., $\hat{r}(y,x)$ is large), then the gradient update significantly changes the model weights. The second term determines the gradient direction to decrease the likelihood of generating $y$ .

5 Experiment

5.1 Toxicity Screener

Our style classification model was developed using a dataset from the Jigsaw Toxic Comment Classification Challenge cjadams et al. (2017), which contains over 120,000 non-toxic sentences and 14,000 toxic sentences. Following the data splitting in Logacheva et al. (2022), we allocate 95% of the dataset for training and reserve 5% for testing. We employ TF-IDF for sentence tokenization and SVM for classification to form the response. Ultimately, the screening model achieved an accuracy of 98.6% on the test data.

5.2 SRO Implementation

We fine-tune the pre-trained paraphrase T5 model using the toxic sentences from the Jigsaw Toxic dataset cjadams et al. (2017). The training process includes 15 epochs, with the batch size being 16. The learning rate is $2\times 10^{-5}$ , and weight decay is set to 0.01. The hyper-parameter $\beta$ in SRP is 0.01.

5.3 Competing Models

We compare our SRO models with other style transfer models and common baselines in the literature Dale et al. (2021).

•

Duplicate (baseline) – copy of the input.
•

Delete (baseline) – deletion of swear words.
•

BART-zero-shot (baseline) – BART model with no additional training.
•

Mask&Infill Wu et al. (2019) – BERT-based pointwise editing model.
•

Delete-Retrieve-Generate models Li et al. (2018): DRG-Template (replacement of toxic words with similar neutral words) and DRGRetrieve (retrieval of non-toxic sentences with a similar sense) varieties.
•

DLSM He et al. (2020) encoder-decoder model that uses amortized variational inference,
•

SST Lee (2020) – encoder-decoder model with the cross-entropy of a pre-trained style classifier as an additional discriminative loss.
•

CondBERTDale et al. (2021) – BERT-based model with extra style and content control.
•

ParaGeDiDale et al. (2021) – a model which enhances a paraphraser with style-informed LMs which re-weigh its output.

5.4 Metrics

We apply three third-party models and LLM to evaluate our model, which is customary in many style transfer works Logacheva et al. (2022); Dale et al. (2021). Namely, we evaluate:

•

style accuracy (STA) – the percentage of nontoxic outputs identified by a style classifier trained on the Jigsaw data. Note that such a testing classifier differs from the screener in that this classifier is built on a RoBERTa model and outputs a number ranging from 0 to 1, indicating toxicity likelihood.
•

content preservation (SIM) – the cosine similarity between the embeddings of the input text and the paraphrase’s embeddings Wieting et al. (2019). The embeddings are computed by an encoder that is trained on the ParaNMT corpus Wieting and Gimpel (2018b).
•

fluency (FL) – the percentage of fluent sentences identified by a RoBERTa-based classifier of linguistic acceptability trained on the CoLA datasetWarstadt et al. (2019).

Then, the J matric is computed by multiplying the three individual matrics: $\textbf{J}=\text{STA}\times\text{SIM}\times\text{FL}$ .

5.5 Results

Evaluation

Table 1 summarizes the performance of our proposed SRO, baselines, and prior works. Our model was trained on the Jigsaw Toxic Comment Classification Challenge dataset and it outperforms other models in terms of the J metric. The baseline models (Delete and Duplicate) and the BART-zero-shot model return low J scores due to the following reasons. The Delete model just deletes the toxic words in the sentence, leading to very low FL(fluency) scores. The Duplicate model has a very low STA score because it doesn’t introduce any modification to the input. The BART-zero-shot model, as a pre-trained model trained on a large corpus of text, can only handle tasks based solely on its prior knowledge and the instructions given in the prompt. Thus, the BART-zero-shot model cannot perform the detoxification task, yielding a low STA score.

Now we turn to other advanced competing models. DRG-Template and DRG-Retrieve models enjoy high STA scores because the first step of these models is to delete the tokens with toxicity, making the paraphrase pass the screening easily. However, the Template returns low FL as it just replaces the toxicity words, and the Retrieve gives low SIM because it retrieves a similar sentence from the training dataset with a non-toxic attribute, and the similarity in Retrieve is measured in Euclidean distance on the embedding of two sentences. However, a small Euclidean distance does not imply content preservation. Mask&Infill model masks the toxic markers and infills them with similar non-toxic tokens. Such a practice maintains the basic structure of the input text, leading to a high SIM score. However, the infilling may introduce words that render the whole text incomprehensible and result in a low FL score. The DLSM and SST models perform full re-generation of text (as opposed to word-wise corrections). More importantly, their decoders are trained from scratch on a relatively small dataset, leading to low fluency scores. The ParaGeDi model applies a re-weight idea on logits to control the attribute of output, which means it pays little attention to the context. Therefore, this model gives a relatively low FL score. This problem also happens to CondBERT, which also uses re-weight on logits-wise, so the FL of generation text becomes relatively low.

In contrast to the models above that perform word/logit-wise corrections, our SRO model, fine-tuned on a text-paraphrase LLM, is more concerned with the context, yielding a high FL score. As shown in Table 1, our SRO achieves the highest FL scores while retaining satisfying scores on STA and SIM. Of particular note, SRO outperforms all the advanced competing models in terms of SIM.

Table 1: Comparison of models on various metrics and the ablation tests on the loss design.

Model	STA	SIM	FL	J
Human reference	0.96	0.77	0.88	0.66
SRO	0.86	0.84	0.92	$\mathbf{0.66}$
PA-SRO	0.34	0.67	0.88	0.20
RS-SRO	0.71	0.78	$\mathbf{0.93}$	0.51
Delete	0.81	0.93	0.64	0.46
Duplicate	0.02	1.00	0.91	0.02
DRG-Template	0.90	0.82	0.69	0.51
BART-zero-shot	0.01	0.99	0.92	0.01
Mask&Infill	0.91	0.82	0.63	0.48
CondBERT	0.98	0.77	0.82	0.62
SST	0.86	0.57	0.19	0.10
ParaGeDi	0.99	0.71	0.88	0.62
DLSM	0.76	0.76	0.52	0.25
DRG-Retrieve	0.97	0.36	0.86	0.31

Table 2 shows examples of different models’ output, and we can see that our model’s results look more tactful and retain the original meaning.

Table 2: Examples of detoxification by different models.

Input	Output
Input	no room for those stupid antics.
SRO	No room for those bad exploits.
ParaGeDi	There’s no space for all those adversities.
CondBERT	no room for those . . . those bizarre exploits.
Mask&Infill	no room for those compelling antics.
DRG-TemplateBased	no room for those conventions antics.
Input	zoological conspiracy theories (Arab-Israeli conflict), by contrast, sucks horribly.
Ours	Zooological conspiracy theories (Arab-Israeli conflict), by contrast, is not good.
ParaGeDi	By contrast, the zoological theories of conspiracy (the Arab-Israeli conflict) suffer horribly.
CondBERT	zoological conspiracy theories (Arab-Israeli conflict), by contrast, worked out badly.
Mask&Infill	zoological conspiracy theories (Arab-Israeli conflict), by contrast, ends horribly.
DRG-TemplateBased	zoological conspiracy theories (Arab-Israeli conflict), by contrast, instrumental horribly.
Input	this whole article is bullshit.
Ours	This whole article is not that good.
ParaGeDi	This article’s not a good deal.
CondBERT	this whole article is considered incomplete.
Mask&Infill	this whole article is compelling.
DRG-TemplateBased	this whole article is substantial.

5.6 Ablation study on SRO loss

RS-SRO

The difference between SRO and DPO lies in the use of paraphrased text that fails the screening, for which no preference exists between the input and the output. To see how the failed attempts contribute to the success of SRO, we conduct an ablation study where SRO discards the failed outputs and only performs DPO on those successfully paraphrased instances. We refer to such a practice of SRO as response-selective SRO (RS-SRO) since the leader (LLM) only considers positive responses $a^{*}=1$ while ignoring those negative ones. The loss function of RS-SRO is given by $\mathcal{L}(\theta)=-\mathbb{E}_{(x,y)\sim D}[\log\sigma(h^{\beta}_{\theta}(y|x)-h^{\beta}_{\theta}(x|x))]\mathds{1}_{\{a=1\}}$ . From Table 1, we can see that the performance of the RS-SRO model decreases by a small margin compared to the SRO model, which proves the use of the fail loss we apply to the SRO model contributes to the SRO’s success.

PA-SRO

The above experiment demonstrates the role played by negative feedback without creating complete preference. Then, one naturally wonders if SRO’s treatment of failed paraphrases would also carry over to the successful ones. Even though the pair of the input and output reveals a preference relation, it is unclear whether such a relation is needed in SRO. Using terms in the Stackelberg model, the question is whether positive feedback with complete preference is necessary. To answer this question, we modify the SRO loss when $a^{*}=1$ and refer to the resulting fine-tuning as preference-agnostic SRO (PA-SRO). Now for the PA-SRO model, the pass doesn’t contain the $h^{\beta}_{\theta}(x,x)$ part in the pass loss. The loss function of PA-SRO is given by $\mathcal{L}(\theta)=-\mathbb{E}_{(x,y)\sim D}[\log\sigma(a^{*}h^{\beta}_{\theta}(y|x)-(1-a^{*})h^{\beta}_{\theta}(y|x)]$ . From the results in Table 1, we can find the PA-SRO’s J score is significantly smaller than those of SRO and RS-SRO. In our loss design, the partial loss (fail) is important though, the DPO-type loss (pass) is much more important as it creates a preference pair for the model to learn a direct preference instead of giving an indirect instruction for the model to learn.

Robustness to Misclassification

The style screener plays an instrumental part in the proposed SRO, providing feedback to the LLM’s paraphrase. To investigate the screener’s robustness against possible misclassification in style screening, we introduce various levels of misclassification to the screener and collect detoxification metrics when the LLM learns from possibly inaccurate responses.

We consider the following levels of misclassification: 10%, 20%, 50%, and 70%. For a given misclassification level, such as 10%, the working of the screener proceeds as below. Given a paraphrased text from the LLM, the screener employs the trained SVM to label the text with 90% probability while classifying the text as non-toxic regardless of the true style with 10%. In other words, the introduced misclassification serves as a relaxation of the screening process, creating more preference pairs with possibly inaccurate preference relations. In this case, an originally unsuccessful paraphrase, together with the toxic input, creates a seemingly complete preference pair to be fed to SRO. By conducting such experiments, we can determine how vulnerable the LLM is to misleading preference relations. From Table 3, we can see a huge gap between J of the SRO model and SRO under 10% relaxation (denoted by R-10%). The J also slightly decreases while the confidence decreases. We can conclude that the credibility of the screener is extremely important for the performance of SRO. In our understanding, the misleading preference pairs greatly hinder the SRO model from converging and give the SRO model wrong information or preferences to learn. This greatly reduces learning efficiency because making an error requires more effort to adjust it.

Table 3: Ablation test on the screener

Model	STA	SIM	FL	J
Human reference	0.96	0.77	0.88	0.66
SRO	0.86	0.84	0.92	$\mathbf{0.66}$
R-10%	0.53	0.78	0.92	0.38
R-20%	0.48	0.78	0.92	0.34
R-50%	0.42	0.74	0.88	0.27
R-70%	0.40	0.77	0.87	0.26
K-10%	0.50	0.67	0.92	0.30
K-20%	0.39	0.69	0.90	0.24
K-50%	0.35	0.66	0.90	0.20
K-70%	0.33	0.67	0.91	0.20

In contrast to relaxation, another type of misclassification is overkill; that is, the screener classifies the paraphrase as toxic, whatever the true style is. In this case, the overkill greatly reduces the portion of preference pairs in the response. Consequently, the partial loss is more frequently employed in the fine-tuning process. Similar to the previous setup, we introduce 10%, 20%, 50%, and 70% overkill to the screener to investigate the LLM’s robustness against such misclassification. From the result shown in Table 3, we can see that the performance under the overkill screener decreases as the overkill rate increases. Besides, as we compare the relaxation case and overkill case at the same rate, we can conclude that the complete preference pairs when passing the screening are more helpful than the partial loss in the failed cases.

6 Conclusion

In this paper, we provide a new Stackelberg-game perspective on text detoxification using LLM on non-parallel data, where the LLM acts as the leader, aiming to determine its model weights such that the paraphrased texts will pass the toxicity screener. Such a Stackelberg formulation spares one from collecting human preference labeling and enables the LLM to learn directly from the screener’s response. We propose Stackelberg response optimization (SRO) to solve the optimal leader’s policy for LLM detoxification. Experimental results indicate that our SRO achieves superior overall performance over other competing models and matches human performance. Of particular note is that SRO’s success depends largely on the screener’s accurate feedback; a slight perturbation in the response leads to significant performance degradation.

References

Azar et al. (2023) Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. 2023. A general theoretical paradigm to understand learning from human preferences.
Bevendorff et al. (2024) Janek Bevendorff, Xavier Bonet Casals, Berta Chulvi, Daryna Dementieva, Ashaf Elnagar, Dayne Freitag, Maik Fröbe, Damir Korenčić, Maximilian Mayerl, Animesh Mukherjee, Alexander Panchenko, Martin Potthast, Francisco Rangel, Paolo Rosso, Alisa Smirnova, Efstathios Stamatatos, Benno Stein, Mariona Taulé, Dmitry Ustalov, Matti Wiegmann, and Eva Zangerle. 2024. Overview of pan 2024: Multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative ai authorship verification: Extended abstract. In Advances in Information Retrieval: 46th European Conference on Information Retrieval, ECIR 2024, Glasgow, UK, March 24–28, 2024, Proceedings, Part VI, page 3–10, Berlin, Heidelberg. Springer-Verlag.
Bubeck et al. (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, John A. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712.
Chakraborty et al. (2024) Souradip Chakraborty, Amrit Singh Bedi, Alec Koppel, Dinesh Manocha, Huazheng Wang, Mengdi Wang, and Furong Huang. 2024. Parl: A unified framework for policy alignment in reinforcement learning from human feedback.
Chen et al. (2024) Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. 2024. Self-play fine-tuning converts weak language models to strong language models.
cjadams et al. (2019) cjadams, Daniel Borkan, inversion, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and nithum. 2019. Jigsaw unintended bias in toxicity classification.
cjadams et al. (2017) cjadams, Jeffrey Sorensen, Julia Elliott, Lucas Dixon, Mark McDonald, nithum, and Will Cukierski. 2017. Toxic comment classification challenge.
Dale et al. (2021) David Dale, Anton Voronov, Daryna Dementieva, Varvara Logacheva, Olga Kozlova, Nikita Semenov, and Alexander Panchenko. 2021. Text detoxification using large pre-trained neural models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7979–7996, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Ethayarajh et al. (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. Kto: Model alignment as prospect theoretic optimization.
Fu et al. (2018) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2018. Style transfer in text: Exploration and evaluation. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
Gentile and Warmuth (1998) Claudio Gentile and Manfred K. K Warmuth. 1998. Linear hinge loss and average margin. In Advances in Neural Information Processing Systems, volume 11. MIT Press.
He et al. (2020) Junxian He, Xinyi Wang, Graham Neubig, and Taylor Berg-Kirkpatrick. 2020. A probabilistic formulation of unsupervised text style transfer. In International Conference on Learning Representations.
Kivlichan et al. (2020) Ian Kivlichan, Jeffrey Sorensen, Lucy Vasserman Julia Elliott, Martin Görner, and Phil Culliton. 2020. Jigsaw multilingual toxic comment classification.
Laugier et al. (2021) Léo Laugier, John Pavlopoulos, Jeffrey Sorensen, and Lucas Dixon. 2021. Civil rephrases of toxic texts with self-supervised transformers. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1442–1461, Online. Association for Computational Linguistics.
Lee (2020) Joosung Lee. 2020. Stable style transformer: Delete and generate approach with encoder-decoder for text style transfer. In Proceedings of the 13th International Conference on Natural Language Generation, pages 195–204, Dublin, Ireland. Association for Computational Linguistics.
Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1865–1874, New Orleans, Louisiana. Association for Computational Linguistics.
Li et al. (2022a) Tao Li, Guanze Peng, Quanyan Zhu, and Tamer Baar. 2022a. The Confluence of Networks, Games, and Learning a Game-Theoretic Framework for Multiagent Decision Making Over Networks. IEEE Control Systems, 42(4):35–67.
Li et al. (2022b) Tao Li, Yuhan Zhao, and Quanyan Zhu. 2022b. The role of information structures in game-theoretic multi-agent learning. Annual Reviews in Control, 53:296–314.
Li and Zhu (2024) Tao Li and Quanyan Zhu. 2024. Symbiotic game and foundation models for cyber deception operations in strategic cyber warfare. arXiv preprint arXiv:2403.10570.
Logacheva et al. (2022) Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. ParaDetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6804–6818, Dublin, Ireland. Association for Computational Linguistics.
Luo et al. (2019) Fuli Luo, Peng Li, Jie Zhou, Pengcheng Yang, Baobao Chang, Xu Sun, and Zhifang Sui. 2019. A dual reinforcement learning framework for unsupervised text style transfer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5116–5122. International Joint Conferences on Artificial Intelligence Organization.
Munos et al. (2024) Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. 2024. Nash learning from human feedback.
Nogueira dos Santos et al. (2018) Cicero Nogueira dos Santos, Igor Melnyk, and Inkit Padhi. 2018. Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 189–194, Melbourne, Australia. Association for Computational Linguistics.
Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback.
Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model.
Shen et al. (2017a) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017a. Style transfer from non-parallel text by cross-alignment. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6833–6844, Red Hook, NY, USA. Curran Associates Inc.
Shen et al. (2017b) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017b. Style transfer from non-parallel text by cross-alignment.
Swamy et al. (2024) Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, and Alekh Agarwal. 2024. A minimaximalist approach to reinforcement learning from human feedback.
Tran et al. (2020) Minh Tran, Yipeng Zhang, and Mohammad Soleymani. 2020. Towards a friendly online community: An unsupervised style transfer framework for profanity redaction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2107–2114, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Villate-Castillo et al. (2024) Guillermo Villate-Castillo, Javier Del Ser Lorente, Borja Sanz Urquijo, et al. 2024. A systematic review of toxicity in large language models: Definitions, datasets, detectors, detoxification methods and challenges. PREPRINT.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. Transactions of the Association for Computational Linguistics, 7:625–641.
Wieting et al. (2019) John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training neural machine translation with semantic similarity. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4344–4355, Florence, Italy. Association for Computational Linguistics.
Wieting and Gimpel (2018a) John Wieting and Kevin Gimpel. 2018a. ParaNMT-50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 451–462, Melbourne, Australia. Association for Computational Linguistics.
Wieting and Gimpel (2018b) John Wieting and Kevin Gimpel. 2018b. Paranmt-50m: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations.
Wu et al. (2019) Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Mask and infill: Applying masked language model for sentiment transfer. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 5271–5277. International Joint Conferences on Artificial Intelligence Organization.