This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Xinhong Xie1 Tao Li2 Quanyan Zhu2
1Department of Computer Science and Engineering
The Pennsylvania State University
2Department of Electrical and Computer Engineering
New York University
Correspondence: [email protected]
Abstract

Text detoxification, a variant of style transfer tasks, finds useful applications in online social media. This work presents a fine-tuning method that only uses non-parallel data to turn large language models (LLM) into a detoxification rewritter. We model the fine-tuning process as a Stackelberg game between an LLM (leader) and a toxicity screener (follower), which is a binary style classifier (toxic or non-toxic). The LLM aims to align its preference according to the screener and generate paraphases passing the screening. The primary challenge of non-parallel data fine-tuning is incomplete preference. In the case of unsuccessful paraphrases, the classifier cannot establish a preference between the input and paraphrase, as they belong to the same toxic style. Hence, preference-alignment fine-tuning methods, such as direct preference optimization (DPO), no longer apply. To address the challenge of incomplete preference, we propose Stackelberg response optimization (SRO), adapted from DPO, to enable the LLM to learn from the follower’s response. The gist is that SRO decreases the likelihood of generating the paraphrase if it fails the follower’s screening while performing DPO on the pair of the toxic input and its paraphrase when the latter passes the screening. Experiments indicate that the SRO-fine-tunned LLM achieves satisfying performance comparable to state-of-the-art models regarding style accuracy, content similarity, and fluency. The overall detoxification performance surpasses other computing methods and matches the human reference. Additional empirical evidence suggests that SRO is sensitive to the screener’s feedback, and a slight perturbation leads to a significant performance drop. We release the code and LLM models at https://github.com/XXXinhong/Detoxification_LLM.

1 Introduction

Identification of toxicity and other undesirable contents in user-generated texts is an active research area in NLP Bevendorff et al. (2024). As a proactive combat (besides deletion), the task of automatic rewriting/rephrasing has received increasing attention from the NLP community Villate-Castillo et al. (2024).

Most existing works on text detoxification casts the problem as a variant of style transfer: the task of changing the style of a text while preserving the content. The style of a text refers to its characteristics, such as sentiment, level of formality, and political inclination Dale et al. (2021). As a sequence-to-sequence task, style transfer tasks can employ an encoder-decoder model trained on parallel data that includes a parallel corpus of toxic sentences and their non-toxic paraphrases Wieting and Gimpel (2018a). Despite recent efforts to build parallel datasets Logacheva et al. (2022), the existing parallel data collection is still in its infancy. As an alternative, style transfer models can also utilize non-parallel data. Prior works train encoder-decoder models on non-parallel data and push the decoder toward the target style with the help of style classifiers Lee (2020).

Refer to caption
Figure 1: An illustration of Stackelberg Response Optimization (SRO), which addresses incomplete preference in detoxification tasks. For a toxic input xx, when the paraphrase yy passes the screening, SRO performs DPO on the pair (x,y)(x,y); otherwise, it decreases the likelihood of generating yy.

This work considers LLM-based detoxification models trained on non-parallel data. Without parallel data, few-shot fine-tuning does not apply since concrete demonstrations are not available to LLMs. Hence, we consider a different perspective on text detoxification: casting the problem as an alignment problem, where an LLM needs to be fine-tuned to align with a given preference (e.g., non-toxic over toxic contents). Such preference alignment has gained soaring popularity in LLM research as it addresses the LLM’s steerability issue, aiming to achieve precise control of LLM’s behaviors. Such fine-tuning requires human preference labeling, which is instrumental in reinforcement learning from human feedback (RLHF) pipelines Ouyang et al. (2022). However, such a preference relation may not be immediately available in detoxification. Given a toxic text input, when the LLM generates two outputs that are still toxic, it is unclear which one is less toxic since both belong to the same style. In practice, we also observe that the human-labeled non-parallel data only involves binary labels (toxic or non-toxic) without explicit preference relations, making it challenging to rank texts from the same class. Consequently, RLHF may not apply due to the lack of clear preferences.

From the example above, the key challenge of applying LLMs to detoxification (or style transfer in general) from the alignment viewpoint is that no preference exists among texts belonging to the same style. Therefore, LLMs need to learn from the outcome of style transfer, i.e., the feedback from the style screening (either success or failure), rather than preference labeling.

We model the LLM’s fine-tuning process as a Stacelberg game between the LLM and a style classifier trained from human-labeled data. As illustrated in Figure 1, when the LLM (leader) receives a toxic text from the training dataset, it first generates a paraphrase and sends it to the screener (follower), who responds with a binary outcome: either success or failure. Based on this response, LLM updates the model weights by applying the gradient to our proposed Stackelberg loss. The intuition behind such loss is that 1) if the detoxification is successful, then the pair of the input text and LLM-generated text presents a clear preference, which is perfectly handled by the original DPO pipeline, and the Stackelberg loss reduces to DPO loss; 2) otherwise, only the failed transfer is utilized to compute the loss that aims to penalize LLMs for such paraphrasing. The sheer difference between our proposed fine-tuning method and RLHF is that the LLM learns from the classifier’s response to its generated text without requiring additional preference labeling. We refer to the proposed fine-tuning method as Stackelberg Response Optimization. Our contributions are as follows.

  • We model LLM fine-tuning for detoxification as a Stacklberg game between the LLM and style screener. Such a Stackelberg formulation spares one from parallel data and human labeling.

  • We propose Stackelberg Response Optimization (SRO), adapted from DPO, to address the issue of missing preferences in RLHF.

  • We conduct an evaluation of the fine-tuned LLM and its comparison with a number of state-of-the-art models in text detoxification tasks. Our proposed STALR achieves superior overall performance (a joint metric synthesizing style accuracy, content similarity, and fluency) over other computing models and matches human reference.

  • We examine SRO’s robustness to the screener. Empirical evidence suggests that SRO is highly sensitive to the screener’s feedback. A slight perturbation in screening significantly degrades the detoxification performance.

2 Related Works

Style transfer and detoxification

Prior works in style transfer consider encoder-decoder models on non-parallel data while using adversarial classifiers to steer the decoder toward the desired style Shen et al. (2017a). To address the challenge of lacking parallel data, researchers have employed methods such as reinforcement learning Luo et al. (2019), amortized variational inference He et al. (2020), and utilizing style transfer classifiers to train models that convert texts from one style to another and vice versa Lee (2020).

Detoxification specifically involves converting toxic to neutral language using datasets not directly paired but categorized for toxicity, treating toxic and neutral texts as distinct groups (nonparallel data). Several studies use datasets like the Jigsaw series Laugier et al. (2021) or create their own from social media platforms for this purpose Kivlichan et al. (2020); cjadams et al. (2017, 2019).

Prior detoxification models often borrow methodologies from other domains Shen et al. (2017b) and Fu et al. (2018), using autoencoders Nogueira dos Santos et al. (2018) with style classification and cycle-consistency losses, or fine-tuning denoising autoencoders like T5 Laugier et al. (2021). Some models apply direct corrections to specific words Tran et al. (2020); Wu et al. (2019) and then enhance text fluency with sequence-to-sequence models. Similarly, Dale et al. (2021) uses masked language models for targeted edits.

Preference alignment and fine-tuning of LLM

LLMs have shown great capabilities in the past few years in generative tasks across various domains Bubeck et al. (2023); Li and Zhu (2024), and their encouraging success relies on the precise control of LLMs’ output through alignment fine-tuning. One of the popular methods for gaining such control is reinforcement learning with human feedback Ouyang et al. (2022). By training a reward model on human-labeled data, one can use the reward model to fine-tune LLM models using RL algorithms.

To avoid bias that exists in the reward model, Rafailov et al. (2024) develops Direct Preference Optimization (DPO), which does not rely on another reward model to calculate loss; instead, DPO calculates the loss based on the distance between logits and the expected answers. In addition to DPO, there are also some emerging techniques, such as Identity Preference Optimisation (IPO) Azar et al. (2023), which avoid Bradley-Terry modelization assumption to avoid overfitting problems, and Kahneman-Tversky Optimization (KTO) Ethayarajh et al. (2024), which apply human-aware losses (HALOs) to directly maximizing the utility of generations instead of maximizing the likelihood of preferences.

Game-theoretic RLHF

Multi-agent reinforcement learning offers a robust theoretical grounding for examining learning agents that interact and need to adjust their strategies in response Li et al. (2022a, b). Recent developments in game-theoretic methods tailored to RLHF include Self-Play Preference Optimization (SPO) Swamy et al. (2024), which utilizes self-play to compare win rates across different trajectories. Similarly, Self-Play fIne-tuNing (SPIN) Chen et al. (2024) employs self-play to create synthetic high-quality data by differentiating self-generated continuations from those generated by humans. Both approaches utilize a consistent reward model akin to traditional RLHF. Additionally, Nash-RLHF Munos et al. (2024) introduces a competitive setup involving two language models, each aiming to gain favor from a human evaluator.

Closely related to our Stackelberg formulation, Chakraborty et al. Chakraborty et al. (2024) also consider the preference alignment as a Stackelberg game, where the reward model acts as the leader and the RL agent as the follower. However, the reward model is still built on complete preference relations, whereas our work focuses on incomplete preference resulting from the binary feedback of the screener.

3 Preliminary

Notations

Consider a non-parallel dataset denoted by DD, which consists of a blend of toxic and non-toxic texts (sentences). We denote by xDx\in D A typical text from the dataset. In our text detoxification task, such a text xx is often paired with a prompt instructing the LLM to rewrite the sentence. For example, the prompt used in our experiment is “paraphrase the following: xx.” LLM, despite its intricate inner workings, is represented by a function fθ()f_{\theta}(\cdot) that maps the input text xx to a rephrased text yy, where θΘ\theta\in\Theta denotes the LLM’s model weights.

The style classifier, performing binary classification, maps the rephrased text yy to a label a{0,1}a\in\{0,1\}. Since the classifier plays as a follower in the Stackelberg game, one can think of the label as the follower’s action or response toward the LLM’s paraphrasing. Denote the classifier by g(y)g(y), its definition is given by

a=g(y){1if y is non-toxic,0otherwise.a=g(y)\triangleq\left\{\begin{array}[]{ll}1&\text{if $y$ is non-toxic},\\ 0&\text{otherwise}.\end{array}\right. (1)

The sequel also uses fθ(y|x)f_{\theta}(y|x) and g(a|y)g(a|y) to represent the dependencies among variables.

Stackelberg game

To prepare the reader for the later introduced Stackelberg detoxification game, we use the notations above to review some basics. Beginning a payoff-relevant state xx (which affects the players’ rewards/costs), the leader determines its parameterized policy fθf_{\theta} that yields an action y=fθ(x)y=f_{\theta}(x) upon receiving the state. Once observing the leader’s action yy, the follower responds by choosing action aa. The follower’s objective is to find a response policy a=g(y)a=g(y) such that its reward v(x,y,a)v(x,y,a) is maximized. The follower’s best response given x,yx,y is a=argmaxav(x,y,a)BR(fθ(x))a^{*}=\arg\max_{a}v(x,y,a)\triangleq\operatorname{BR}(f_{\theta}(x)). Aware of the follower’s objective and best response, the leader aims to seek the optimal policy that maximizes its rewards (minimizes the cost) by soliciting the desired follower’s action: maxθ(x,a)=u(x,BR(fθ(x)))\max_{\theta}(x,a^{*})=u(x,\operatorname{BR}(f_{\theta}(x))).

Direct preference optimization

DPO comes as a direct alternative to Reinforcement Learning from Human Feedback (RLHF), which consists of two main parts: the reward model rϕr_{\phi} and the target language model πθ\pi_{\theta}. The reward model is optimized by the following loss function: ϕ=𝔼(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))]\mathcal{L}_{\phi}=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\left[\log\sigma(r_{\phi}(x,y_{w})-r_{\phi}(x,y_{l}))\right], where DD is the dataset consisting of preference pairs (x,yw,yl)(x,y_{w},y_{l})—where xx is an input, ywy_{w}, and yly_{l} are the preferred and dispreferred outputs. The σ\sigma function here is sigmoid function. With a decent reward model rϕr_{\phi}, we can then fine-tune the target language model πθ\pi_{\theta} by maximizing 𝔼xD,yπθ[rϕ(x,y)]βDKL(πθ(y|x)πref(y|x))\mathbb{E}_{x\in D,y\in\pi_{\theta}}\left[r_{\phi}(x,y)\right]-\beta D_{KL}(\pi_{\theta}(y|x)\|\pi_{\text{ref}}(y|x)).

DPO has been gaining much popularity since it does not require a reward model and directly fine-tunes the LLM using preference pairs. The DPO loss is given by

DPO(fθ;fref)=𝔼(x,yw,yl)D\displaystyle\mathcal{L}_{DPO}(f_{\theta};f_{\text{ref}})=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}
[logσ(βlogfθ(ywx)fref(ywx)βlogfθ(ylx)fref(ylx))].\displaystyle\left[\log\sigma\left(\beta\log\frac{f_{\theta}(y_{w}\mid x)}{f_{\text{ref}}(y_{w}\mid x)}-\beta\log\frac{f_{\theta}(y_{l}\mid x)}{f_{\text{ref}}(y_{l}\mid x)}\right)\right].

To understand how DPO works, one can inspect the gradient of DPO loss, which is given below.

θ(θ)\displaystyle\nabla_{\theta}\mathcal{L}(\theta)
=𝔼(x,yw,yl)D[βσ(r^(yl,x)r^(yw,x))I\displaystyle=-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\bigg{[}\underbrace{\beta\sigma(\hat{r}(y_{l},x)-\hat{r}(y_{w},x))}_{\mathrm{I}}
[θlogfθ(ywx)IIθlogfθ(ylx)III]]\displaystyle\left[\underbrace{\nabla_{\theta}\log f_{\theta}(y_{w}\mid x)}_{\mathrm{II}}\right.\left.-\underbrace{\nabla_{\theta}\log f_{\theta}(y_{l}\mid x)}_{\mathrm{III}}\right]\bigg{]}

The first term in the gradient increases when the reward estimation is wrong, so it can help LLM converge quickly. The second term can increase the likelihood of ywy_{w} (the preferred text) and decrease the likelihood of yly_{l} (the dispreferred text).

4 Stackelberg Learning from Response

4.1 Detoxification as a Stackelberg game

Consider the text detoxification task where the LLM seeks to set the model weights θ\theta such that the paraphrased output yfθ(|x)y\sim f_{\theta}(\cdot|x) is non-toxic. In other words, it aims to generate yy to pass the style classifier’s screening, i.e., the desired classification outcome is a=1a=1.

We define the LLM’s cost function as u:D×A{0,1}u:D\times A\rightarrow\{0,1\} to be defined later in Section 4.2, which encourages the LLM to generate successful paraphrases that pass the screening while penalizing the failed attempts. Since the style classifier’s objective is to report the underlying style accurately, one can define the style classifier’s cost function as the classification loss. This work considers an SVM-based classifier and uses hinge loss Gentile and Warmuth (1998). As discussed in the preliminary, we can treat the trained SVM as the best response mapping.

The interaction between the LLM and SVM unfolds as below. First, the LLM takes as an input a text xx randomly sampled from the dataset DD. Then, the LLM determines a paraphrase y=fθ(x)y=f_{\theta}(x), which is then sent to the SVM. Finally, the SVM best responds with a classification outcome aa, which decides the LLM’s utility. The equilibrium problem for the LLM, the leader, is given by

minθΘxDu(x,a)s.t. a=BR(fθ(x)),\min_{\theta\in\Theta}\sum_{x\in D}u(x,a^{*})\quad\text{s.t. }a^{*}=\operatorname{BR}(f_{\theta}(x)), (2)

where aa^{*} simply corresponds to the classification output. Naturally, the solution to (2) returns the optimal LLM model for detoxification. We introduce Stackelberg response optimization, adapted from DPO, to approximately solve for the optimal solution.

4.2 Stackelberg Response Optimization

Given a preference pair (yw,yl)(y_{w},y_{l}), DPO aims to solve the following minimization problem

minθΘ𝔼(x,yw,yl)D[logσ(hθβ(yw|x)hθβ(yl|x))]\displaystyle\min_{\theta\in\Theta}-\mathbb{E}_{(x,y_{w},y_{l})\sim D}\left[\log\sigma\left(h^{\beta}_{\theta}(y_{w}|x)-h^{\beta}_{\theta}(y_{l}|x)\right)\right]

where we define hθβ(y|x)βlogfθ(yx)fref(yx)h^{\beta}_{\theta}(y|x)\triangleq\beta\log\frac{f_{\theta}(y\mid x)}{f_{\text{ref}}(y\mid x)}. In the detoxification task, whether preference pairs are available depends on the screening outcome. If the paraphrase yy passes the screening, then (y,x)(y,x) constitutes a preference pair, which can be handled by DPO. However, if yy fails the screening, then both yy and xx are dispreferred without a clear preference relation.

We propose the following objective function to the LLM leader, which is a hybrid of two kinds of loss functions corresponding to cases of passing or failing the screening, respectively.

minθΘ𝔼xDu(x,a)(θ)\displaystyle\min_{\theta\in\Theta}\mathbb{E}_{x\sim D}u(x,a^{*})\triangleq\mathcal{L}(\theta)
=𝔼xD[logσ(a(hθβ(y|x)hθβ(x|x))\displaystyle=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(a^{*}(h^{\beta}_{\theta}(y|x)-h^{\beta}_{\theta}(x|x))\right.\right.
+(1a)(hθβ(y|x))].\displaystyle\left.\left.+(1-a^{*})(-h^{\beta}_{\theta}(y|x)\right)\right].

Pass

When the paraphrase passes the screening, i.e., a=1a=1, we adopt the standard DPO loss:

(θ)=𝔼xD[logσ(hθβ(y|x)hθβ(x|x))]\displaystyle\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(h^{\beta}_{\theta}(y|x)-h^{\beta}_{\theta}(x|x)\right)\right]

We simply substitute yly_{l} with xx – the input of the model. In our setup, the model will have a preference for the text (yy) that passes the screening over the toxic input (xx).

The gradient of the loss is

θ(θ)=𝔼xD[βσ(hθβ(x|x)hθβ(y|x))\displaystyle\nabla_{\theta}\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\bigg{[}\beta\sigma\left(h^{\beta}_{\theta}(x|x)-h^{\beta}_{\theta}(y|x)\right)
[θlogfθ(yx)θlogfθ(xx)]]\displaystyle\left[\nabla_{\theta}\log f_{\theta}(y\mid x)-\nabla_{\theta}\log f_{\theta}(x\mid x)\right]\bigg{]}

Following the same argument in Rafailov et al. (2024), we have r^(x,y)=βlogfθ(yx)fref(yx)\hat{r}(x,y)=\beta\log\frac{f_{\theta}(y\mid x)}{f_{\text{ref}}(y\mid x)} is the implicit reward function defined by the LLM and the reference model. Then, by substituting hθβ(x|x)h^{\beta}_{\theta}(x|x) with r^(x,x)\hat{r}(x,x) and substituting hθβ(y|x)h^{\beta}_{\theta}(y|x) with r^(y,x)\hat{r}(y,x), the gradient of pass loss could be written as:

θ(θ)\displaystyle\nabla_{\theta}\mathcal{L}(\theta)
=𝔼xD[βσ(r^(x,x)r^(y,x))I\displaystyle=-\mathbb{E}_{x\sim D}\bigg{[}\underbrace{\beta\sigma(\hat{r}(x,x)-\hat{r}(y,x))}_{\mathrm{I}}
[θlogfθ(yx)IIθlogfθ(xx)III]]\displaystyle[\underbrace{\nabla_{\theta}\log f_{\theta}(y\mid x)}_{\mathrm{II}}-\underbrace{\nabla_{\theta}\log f_{\theta}(x\mid x)}_{\mathrm{III}}]\bigg{]}

The first term in the gradient increases when the reward estimation is wrong (i.e., the reward difference is supposed to be negative). It serves as a coefficient that adjusts the gradient’s magnitude so it can help LLM adapt quickly: when the reward estimation is wrong, the magnitude of the gradient is larger, and the LLM model weights change significantly. The second term determines the gradient direction to increase the likelihood of yy (the successful paraphrase) and decrease the likelihood of xx (the original input text). In summary, the gradient increases the likelihood of generating successful paraphrases.

Fail

For those failed paraphrases, we apply the following loss to the model, which we call partial loss since it only keeps a part of the DPO loss.

(θ)=𝔼xD[logσ(hθβ(y|x))].\displaystyle\mathcal{L}(\theta)=-\mathbb{E}_{x\sim D}\left[\log\sigma\left(-h^{\beta}_{\theta}(y|x)\right)\right].

Intuitively speaking, the partial loss does not indicate anything to prefer but only to disprefer. Thus, the loss leads the model to have a dispreference for the output (yy). Below is the gradient of the partial loss.

𝔼xD[βσ(hθβ(y|x))θlogfθ(yx)]\displaystyle-\mathbb{E}_{x\sim D}\bigg{[}-\beta\sigma\left(h^{\beta}_{\theta}(y|x)\right)\nabla_{\theta}\log f_{\theta}(y\mid x)\bigg{]}

Similarly, we can substitute hθβ(y|x)h^{\beta}_{\theta}(y|x) with r^(y,x)\hat{r}(y,x) and obtain the following.

θ(θ)\displaystyle\nabla_{\theta}\mathcal{L}(\theta)
=𝔼xD[βσ(r^(y,x))Iθlogfθ(yx)II]\displaystyle=-\mathbb{E}_{x\sim D}\bigg{[}-\underbrace{\beta\sigma(\hat{r}(y,x))}_{\mathrm{I}}\underbrace{\nabla_{\theta}\log f_{\theta}(y\mid x)}_{\mathrm{II}}\bigg{]}

In this gradient, the first term serves as the coefficient adjusting the gradient’s magnitude. If the reward estimation is wrong (i.e., r^(y,x)\hat{r}(y,x) is large), then the gradient update significantly changes the model weights. The second term determines the gradient direction to decrease the likelihood of generating yy.

5 Experiment

5.1 Toxicity Screener

Our style classification model was developed using a dataset from the Jigsaw Toxic Comment Classification Challenge cjadams et al. (2017), which contains over 120,000 non-toxic sentences and 14,000 toxic sentences. Following the data splitting in Logacheva et al. (2022), we allocate 95% of the dataset for training and reserve 5% for testing. We employ TF-IDF for sentence tokenization and SVM for classification to form the response. Ultimately, the screening model achieved an accuracy of 98.6% on the test data.

5.2 SRO Implementation

We fine-tune the pre-trained paraphrase T5 model using the toxic sentences from the Jigsaw Toxic dataset cjadams et al. (2017). The training process includes 15 epochs, with the batch size being 16. The learning rate is 2×1052\times 10^{-5}, and weight decay is set to 0.01. The hyper-parameter β\beta in SRP is 0.01.

5.3 Competing Models

We compare our SRO models with other style transfer models and common baselines in the literature Dale et al. (2021).

  • Duplicate (baseline) – copy of the input.

  • Delete (baseline) – deletion of swear words.

  • BART-zero-shot (baseline) – BART model with no additional training.

  • Mask&Infill Wu et al. (2019) – BERT-based pointwise editing model.

  • Delete-Retrieve-Generate models Li et al. (2018): DRG-Template (replacement of toxic words with similar neutral words) and DRGRetrieve (retrieval of non-toxic sentences with a similar sense) varieties.

  • DLSM He et al. (2020) encoder-decoder model that uses amortized variational inference,

  • SST Lee (2020) – encoder-decoder model with the cross-entropy of a pre-trained style classifier as an additional discriminative loss.

  • CondBERTDale et al. (2021) – BERT-based model with extra style and content control.

  • ParaGeDiDale et al. (2021) – a model which enhances a paraphraser with style-informed LMs which re-weigh its output.

5.4 Metrics

We apply three third-party models and LLM to evaluate our model, which is customary in many style transfer works Logacheva et al. (2022); Dale et al. (2021). Namely, we evaluate:

  • style accuracy (STA) – the percentage of nontoxic outputs identified by a style classifier trained on the Jigsaw data. Note that such a testing classifier differs from the screener in that this classifier is built on a RoBERTa model and outputs a number ranging from 0 to 1, indicating toxicity likelihood.

  • content preservation (SIM) – the cosine similarity between the embeddings of the input text and the paraphrase’s embeddings Wieting et al. (2019). The embeddings are computed by an encoder that is trained on the ParaNMT corpus Wieting and Gimpel (2018b).

  • fluency (FL) – the percentage of fluent sentences identified by a RoBERTa-based classifier of linguistic acceptability trained on the CoLA datasetWarstadt et al. (2019).

Then, the J matric is computed by multiplying the three individual matrics: J=STA×SIM×FL\textbf{J}=\text{STA}\times\text{SIM}\times\text{FL}.

5.5 Results

Evaluation

Table 1 summarizes the performance of our proposed SRO, baselines, and prior works. Our model was trained on the Jigsaw Toxic Comment Classification Challenge dataset and it outperforms other models in terms of the J metric. The baseline models (Delete and Duplicate) and the BART-zero-shot model return low J scores due to the following reasons. The Delete model just deletes the toxic words in the sentence, leading to very low FL(fluency) scores. The Duplicate model has a very low STA score because it doesn’t introduce any modification to the input. The BART-zero-shot model, as a pre-trained model trained on a large corpus of text, can only handle tasks based solely on its prior knowledge and the instructions given in the prompt. Thus, the BART-zero-shot model cannot perform the detoxification task, yielding a low STA score.

Now we turn to other advanced competing models. DRG-Template and DRG-Retrieve models enjoy high STA scores because the first step of these models is to delete the tokens with toxicity, making the paraphrase pass the screening easily. However, the Template returns low FL as it just replaces the toxicity words, and the Retrieve gives low SIM because it retrieves a similar sentence from the training dataset with a non-toxic attribute, and the similarity in Retrieve is measured in Euclidean distance on the embedding of two sentences. However, a small Euclidean distance does not imply content preservation. Mask&Infill model masks the toxic markers and infills them with similar non-toxic tokens. Such a practice maintains the basic structure of the input text, leading to a high SIM score. However, the infilling may introduce words that render the whole text incomprehensible and result in a low FL score. The DLSM and SST models perform full re-generation of text (as opposed to word-wise corrections). More importantly, their decoders are trained from scratch on a relatively small dataset, leading to low fluency scores. The ParaGeDi model applies a re-weight idea on logits to control the attribute of output, which means it pays little attention to the context. Therefore, this model gives a relatively low FL score. This problem also happens to CondBERT, which also uses re-weight on logits-wise, so the FL of generation text becomes relatively low.

In contrast to the models above that perform word/logit-wise corrections, our SRO model, fine-tuned on a text-paraphrase LLM, is more concerned with the context, yielding a high FL score. As shown in Table 1, our SRO achieves the highest FL scores while retaining satisfying scores on STA and SIM. Of particular note, SRO outperforms all the advanced competing models in terms of SIM.

Table 1: Comparison of models on various metrics and the ablation tests on the loss design.
Model STA SIM FL J
Human reference 0.96 0.77 0.88 0.66
SRO 0.86 0.84 0.92 0.66\mathbf{0.66}
PA-SRO 0.34 0.67 0.88 0.20
RS-SRO 0.71 0.78 0.93\mathbf{0.93} 0.51
Delete 0.81 0.93 0.64 0.46
Duplicate 0.02 1.00 0.91 0.02
DRG-Template 0.90 0.82 0.69 0.51
BART-zero-shot 0.01 0.99 0.92 0.01
Mask&Infill 0.91 0.82 0.63 0.48
CondBERT 0.98 0.77 0.82 0.62
SST 0.86 0.57 0.19 0.10
ParaGeDi 0.99 0.71 0.88 0.62
DLSM 0.76 0.76 0.52 0.25
DRG-Retrieve 0.97 0.36 0.86 0.31

Table 2 shows examples of different models’ output, and we can see that our model’s results look more tactful and retain the original meaning.

Table 2: Examples of detoxification by different models.
Input Output
Input no room for those stupid antics.
SRO No room for those bad exploits.
ParaGeDi There’s no space for all those adversities.
CondBERT no room for those . . . those bizarre exploits.
Mask&Infill no room for those compelling antics.
DRG-TemplateBased no room for those conventions antics.
Input zoological conspiracy theories (Arab-Israeli conflict), by contrast, sucks horribly.
Ours Zooological conspiracy theories (Arab-Israeli conflict), by contrast, is not good.
ParaGeDi By contrast, the zoological theories of conspiracy (the Arab-Israeli conflict) suffer horribly.
CondBERT zoological conspiracy theories (Arab-Israeli conflict), by contrast, worked out badly.
Mask&Infill zoological conspiracy theories (Arab-Israeli conflict), by contrast, ends horribly.
DRG-TemplateBased zoological conspiracy theories (Arab-Israeli conflict), by contrast, instrumental horribly.
Input this whole article is bullshit.
Ours This whole article is not that good.
ParaGeDi This article’s not a good deal.
CondBERT this whole article is considered incomplete.
Mask&Infill this whole article is compelling.
DRG-TemplateBased this whole article is substantial.

5.6 Ablation study on SRO loss

RS-SRO

The difference between SRO and DPO lies in the use of paraphrased text that fails the screening, for which no preference exists between the input and the output. To see how the failed attempts contribute to the success of SRO, we conduct an ablation study where SRO discards the failed outputs and only performs DPO on those successfully paraphrased instances. We refer to such a practice of SRO as response-selective SRO (RS-SRO) since the leader (LLM) only considers positive responses a=1a^{*}=1 while ignoring those negative ones. The loss function of RS-SRO is given by (θ)=𝔼(x,y)D[logσ(hθβ(y|x)hθβ(x|x))]𝟙{a=1}\mathcal{L}(\theta)=-\mathbb{E}_{(x,y)\sim D}[\log\sigma(h^{\beta}_{\theta}(y|x)-h^{\beta}_{\theta}(x|x))]\mathds{1}_{\{a=1\}}. From Table 1, we can see that the performance of the RS-SRO model decreases by a small margin compared to the SRO model, which proves the use of the fail loss we apply to the SRO model contributes to the SRO’s success.

PA-SRO

The above experiment demonstrates the role played by negative feedback without creating complete preference. Then, one naturally wonders if SRO’s treatment of failed paraphrases would also carry over to the successful ones. Even though the pair of the input and output reveals a preference relation, it is unclear whether such a relation is needed in SRO. Using terms in the Stackelberg model, the question is whether positive feedback with complete preference is necessary. To answer this question, we modify the SRO loss when a=1a^{*}=1 and refer to the resulting fine-tuning as preference-agnostic SRO (PA-SRO). Now for the PA-SRO model, the pass doesn’t contain the hθβ(x,x)h^{\beta}_{\theta}(x,x) part in the pass loss. The loss function of PA-SRO is given by (θ)=𝔼(x,y)D[logσ(ahθβ(y|x)(1a)hθβ(y|x)]\mathcal{L}(\theta)=-\mathbb{E}_{(x,y)\sim D}[\log\sigma(a^{*}h^{\beta}_{\theta}(y|x)-(1-a^{*})h^{\beta}_{\theta}(y|x)]. From the results in Table 1, we can find the PA-SRO’s J score is significantly smaller than those of SRO and RS-SRO. In our loss design, the partial loss (fail) is important though, the DPO-type loss (pass) is much more important as it creates a preference pair for the model to learn a direct preference instead of giving an indirect instruction for the model to learn.

Robustness to Misclassification

The style screener plays an instrumental part in the proposed SRO, providing feedback to the LLM’s paraphrase. To investigate the screener’s robustness against possible misclassification in style screening, we introduce various levels of misclassification to the screener and collect detoxification metrics when the LLM learns from possibly inaccurate responses.

We consider the following levels of misclassification: 10%, 20%, 50%, and 70%. For a given misclassification level, such as 10%, the working of the screener proceeds as below. Given a paraphrased text from the LLM, the screener employs the trained SVM to label the text with 90% probability while classifying the text as non-toxic regardless of the true style with 10%. In other words, the introduced misclassification serves as a relaxation of the screening process, creating more preference pairs with possibly inaccurate preference relations. In this case, an originally unsuccessful paraphrase, together with the toxic input, creates a seemingly complete preference pair to be fed to SRO. By conducting such experiments, we can determine how vulnerable the LLM is to misleading preference relations. From Table 3, we can see a huge gap between J of the SRO model and SRO under 10% relaxation (denoted by R-10%). The J also slightly decreases while the confidence decreases. We can conclude that the credibility of the screener is extremely important for the performance of SRO. In our understanding, the misleading preference pairs greatly hinder the SRO model from converging and give the SRO model wrong information or preferences to learn. This greatly reduces learning efficiency because making an error requires more effort to adjust it.

Table 3: Ablation test on the screener
Model STA SIM FL J
Human reference 0.96 0.77 0.88 0.66
SRO 0.86 0.84 0.92 0.66\mathbf{0.66}
R-10% 0.53 0.78 0.92 0.38
R-20% 0.48 0.78 0.92 0.34
R-50% 0.42 0.74 0.88 0.27
R-70% 0.40 0.77 0.87 0.26
K-10% 0.50 0.67 0.92 0.30
K-20% 0.39 0.69 0.90 0.24
K-50% 0.35 0.66 0.90 0.20
K-70% 0.33 0.67 0.91 0.20

In contrast to relaxation, another type of misclassification is overkill; that is, the screener classifies the paraphrase as toxic, whatever the true style is. In this case, the overkill greatly reduces the portion of preference pairs in the response. Consequently, the partial loss is more frequently employed in the fine-tuning process. Similar to the previous setup, we introduce 10%, 20%, 50%, and 70% overkill to the screener to investigate the LLM’s robustness against such misclassification. From the result shown in Table 3, we can see that the performance under the overkill screener decreases as the overkill rate increases. Besides, as we compare the relaxation case and overkill case at the same rate, we can conclude that the complete preference pairs when passing the screening are more helpful than the partial loss in the failed cases.

6 Conclusion

In this paper, we provide a new Stackelberg-game perspective on text detoxification using LLM on non-parallel data, where the LLM acts as the leader, aiming to determine its model weights such that the paraphrased texts will pass the toxicity screener. Such a Stackelberg formulation spares one from collecting human preference labeling and enables the LLM to learn directly from the screener’s response. We propose Stackelberg response optimization (SRO) to solve the optimal leader’s policy for LLM detoxification. Experimental results indicate that our SRO achieves superior overall performance over other competing models and matches human performance. Of particular note is that SRO’s success depends largely on the screener’s accurate feedback; a slight perturbation in the response leads to significant performance degradation.

References