This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Are Paraphrases Generated by Large Language Models Invertible?

Rafael Rivera Soto Lawrence Livermore National Laboratory Johns Hopkins University Barry Chen Johns Hopkins University Nicholas Andrews Lawrence Livermore National Laboratory
Abstract

Large language models can produce highly fluent paraphrases while retaining much of the original meaning. While this capability has a variety of helpful applications, it may also be abused by bad actors, for example to plagiarize content or to conceal their identity. This motivates us to consider the problem of paraphrase inversion: given a paraphrased document, attempt to recover the original text. To explore the feasibility of this task, we fine-tune paraphrase inversion models, both with and without additional author-specific context to help guide the inversion process. We explore two approaches to author-specific inversion: one using in-context examples of the target author’s writing, and another using learned style representations that capture distinctive features of the author’s style. We show that, when starting from paraphrased machine-generated text, we can recover significant portions of the document using a learned inversion model. When starting from human-written text, the variety of source writing styles poses a greater challenge for invertability. However, even when the original tokens can’t be recovered, we find the inverted text is stylistically similar to the original, which significantly improves the performance of plagiarism detectors and authorship identification systems that rely on stylistic markers.111Code for all experiments available at: https://github.com/<USER>/<REPO>.

Are Paraphrases Generated by Large Language Models Invertible?



1 Introduction

Recent developments in the capabilities of large language models (LLM) such as GPT-4 (OpenAI et al., 2024) have resulted in widespread use of LLMs by a variety of users. Although most users are well-intentioned, there is growing concern about the usage of LLMs in the spreading of misinformation, phishing attacks, spam, and plagiarism (Weidinger et al., 2022; Hazell, 2023). To minimize the abuse of these systems, several machine-text detection algorithms have been proposed, including watermarking (Kirchenbauer et al., 2024), Binoculars (Hans et al., 2024), DetectGPT (Mitchell et al., 2023), and GPT-Zero (Tian and Cui, 2023). However, it has been shown that machine-text detectors are not robust in mixed-distribution scenarios where the text is a collaboration between humans and LLMs (Zhang et al., 2024b). Hence, bad actors can evade both writing-style based identification and machine-text detection algorithms by simply paraphrasing their text, leaving a critical gap in detection systems. LLM-based paraphrasing may further be abused to conceal the identity of the original author of a document, raising concerns about plagiarism.

Refer to caption
Figure 1: UMAP projections (McInnes et al., 2020) writing samples in the Reddit domain. Following Soto et al. (2024), we use stylistic features to represent each document, revealing a clear separation between the human-written and paraphrased documents. The inversion model (§missing 4.2) successfully moves the paraphrases to the human-distribution.

Motivated by these problems, this paper considers the task of paraphrase inversion, where the objective is to recover the original text given a paraphrased document. Intuitively, the difficulty of the task depends on, among other factors, the entropy in the paraphrasing distribution. For example, a paraphraser which deterministically replaces words (i.e., a substitution cipher) will be easier to learn to invert than one which samples words with maximum entropy subject to the constraint of meaning preservation. Given the space of possible paraphrases and the stochastic sampling procedures typically used for generation, inverting a paraphrased document may seem to pose insurmountable challenge. Nonetheless, there is evidence that LLMs exhibit consistent biases even when the instruction implicitly or explicitly requests diversity in the responses  (Zhang et al., 2024c; Wu et al., 2024). To what extent may these biases may be learned from training examples consisting of original texts paired with their paraphrases? As a stepping stone towards answering this question for end-to-end inversion, we first consider the problem of paraphrased token prediction, where the objective is to classify tokens as originating from the original document or the paraphraser in  §missing 6.1. We achieve surprisingly high detection rates for this problem, suggesting that there are consistent patterns to be learned in which tokens LLMs tend to paraphrase.

Next, we progress to learning end-to-end paraphrase inversion models. Specifically, we propose models for two inversion settings: without additional context (untargeted), or with author-specific information (targeted). The additional context could come in various forms. For example, we find that the untargeted model is the most effective for stylometric detection tasks like plagiarism detection and authorship identification, reducing the EER from 0.24 to 0.12 in plagiarism detection, and from 0.18 to 0.08 in authorship identification. On the other hand, targeted models, although valuable for recovering the original writing style, are less suited for detection tasks as they introduce too much stylistic bias during inference. Crucially, for detectors based on writing style, the goal is not to perfectly replicate the original text, but to ensure that the inversion is more closely aligned with the true underlying author than others.

Primary contributions:
  1. 1.

    We create a new benchmark for evaluating paraphrase inversion (§missing 5.1) and establish strong baselines for this task, including LLM prompting using in-context examples (§missing 5.3§missing 7).

  2. 2.

    We introduce several simple methods to learn paraphrase inversions end-to-end on the basis of supervised examples of texts and corresponding paraphrases (§missing 4).

  3. 3.

    We demonstrate that the proposed models significantly outperform baselines in several downstream applications, including plagiarism detection and authorship identification (§missing 6).

  4. 4.

    We conduct a detailed analysis of the proposed baselines and methods, in particular addressing sensitivity to sampling hyper-parameters (Appendix A) and robustness to distribution shifts such as novel paraphrasing models unseen at training time (§missing 7).

Reproducibility

The dataset, method implementations, model checkpoints, and experimental scripts, will be released along with the paper. Our experiments focus on LLM that are freely available, with the exception of GPT-4; we will release paraphrases produced by GPT-4 to facilitate reproducibility.

Method Type Style Sim. ()(\uparrow) Semantic Sim. ()(\uparrow) BLEU ()(\uparrow)
Paraphrases - 0.51 0.82 0.08
Baselines GPT-4
Single 0.50 0.80 0.07
Max 0.56 0.84 0.11
Expectation 0.50 0.80 0.07
Aggregate 0.52 0.82 -
GPT-4 (In-Context)
Single 0.52 0.79 0.09
Max 0.56 0.83 0.13
Expectation 0.52 0.79 0.10
Aggregate 0.55 0.82 -
output2prompt
Single 0.39 0.10 0.00
Max 0.53 0.32 0.02
Expectation 0.39 0.09 0.00
Aggregate 0.49 0.18 -
Ours
Untargeted Inversion
Single 0.54 0.81 0.13
Max 0.70 0.90 0.25
Expectation 0.54 0.81 0.12
Aggregate 0.65 0.88 -
Targeted Inversion (In-Context)
Single 0.60 0.80 0.13
Max 0.75 0.89 0.25
Expectation 0.60 0.80 0.13
Aggregate 0.71 0.88 -
Targeted Inversion (Style Emb.)
Single 0.59 0.77 0.11
Max 0.72 0.88 0.22
Expectation 0.59 0.78 0.11
Aggregate 0.68 0.86 -
Table 1: Results on inverting paraphrases of human-written text of novel authors. author=Nick,color=orange!40,size=,inline,]Caption should provide more detail or at least give a reference to the definitions for single/max/expectation/aggregate.

2 Related Work

Our problem is related to language model inversion (Morris et al., 2023), where the objective is to recover the prompt that generated a particular output. Language model inversion techniques such as logit2text (Morris et al., 2023) require knowledge of the LLM that generated the output and access to the next-token probability distribution, making it difficult to apply in practice. Another approach more closely related to ours is output2prompt (Zhang et al., 2024a), which trains an encoder-decoder architecture to generate the prompt given multiple outputs. However, output2prompt requires upwards of 1616 outputs per prompt to successfully match the performance of logit2text, and only handles prompts up to 6464 tokens long. In contrast to these methods, we focus exclusively on inverting LLM-generated paraphrases given a single example cleaned of all obvious generation artifacts such as “note: I changed...", thereby removing all telltale signs of what the original input might’ve been. Therefore, the paraphrase inversion problem considered in this paper is more challenging than related problems posed in prior work.

Paraphrases

have been shown to degrade the performance of machine-text detectors, including those based upon watermarking (Krishna et al., 2023). In response to this, several defenses have been proposed, including jointly training a paraphraser and a detector in an adversarial setting (Hu et al., 2023), and retrieval over a database of semantically similar generations produced by the model in the past (Krishna et al., 2023). Paraphrases have also been shown to be an effective attack against authorship verification systems (Potthast et al., 2016), allowing bad actors to conceal their identity. To our knowledge, our approach is the first attempt at inverting the paraphrases to recover the original identity of the author or LLM.

3 Problem Statement

In the simplest setting, we assume knowledge of the paraphrasing model qϕ(yx)q_{\phi}(y\mid x) from which we can draw sample paraphrases {yi}i=1N\{y_{i}\}_{i=1}^{N} given a corpus of NN training documents {xi}i=1N\{x_{i}\}_{i=1}^{N} to paraphrase. We may then use these samples to fit a conditional distribution pθ(y)p_{\theta}(\cdot\mid y) using supervised learning.

While access to the paraphrasing model in principle affords the possibility of producing an arbitrary amount of training data, in practice the paraphraser may be associated with non-trivial inference costs (e.g., GPT-4). Even if the paraphrasing model is known, the decoding parameters may not be. In particular, we investigate the impact of varying sampling the temperature during training and inference in Appendix A. In general, we may not know the paraphrasing model, and so we also evaluate an unseen paraphraser condition in §missing 7, where the inversion model is trained on samples from a different paraphraser than is used at test time.

Targeted inversion

Inverting paraphrases from novel authors with distinct styles outside the training distribution can be challenging. As such, we also explore targeted inversion models pθ(.yi,ci)p_{\theta}(.\mid y_{i},c_{i}), where cic_{i} denotes author-specific information used to help guide the generation to the style of aia_{i}. For paraphrases of machine-generated text, the author-specific information refers to knowledge of the source LLM. At inference time, the author is unknown, and therefore we form candidate inversions for hypothesized authors.

4 Methods

This section introduces all the methods explored in this paper. Before discussing end-to-end inversion in §missing 4.2, we first discuss a simpler task: paraphrased token prediction. This is a strictly simpler task than paraphrase inversion; a failure on this task would be evidence against the hypothesis that LLMs exhibit consistent biases while paraphrasing.

4.1 Paraphrased token prediction

The paraphrased token prediction task involves imputing a binary variable for each token in a paraphrased document corresponding to whether it was paraphrased or copied from the original text. At training time, we obtain these binary masks from paraphrases by calculating the Levenshtein (Levenshtein, 1965) alignment between the original and paraphrased text. The masks then serve as training examples for supervised fine-tuning. Specifically, starting from a masked language model such as BERT, we optimize a binary cross-entropy loss for each token, corresponding to independent classification decisions. At test time, given a new document assumed to be paraphrased, we make predictions for each token in the document.

4.2 End-to-end paraphrase inversion

Training objective

The inversion models are fine-tuned using the standard autoregressive language modeling objective, where the next token distribution pθ(yixi,ci)p_{\theta}(y_{i}\mid x_{i},c_{i}) is maximized. We use teacher forcing during training, conditioning on the the true observed tokens. For untargeted models, no additional context is provided, setting ci=c_{i}=\emptyset.

Targeted Inversion with In-Context Examples

To help guide the generations towards the style of the target author, we set the context cic_{i} to be a set of examples written by aia_{i} that’re different from the target xix_{i}, that is, ci={xj:aj=ai,xjxi}c_{i}=\{x_{j}:a_{j}=a_{i},x_{j}\neq x_{i}\}. We randomly sample MM examples from cic_{i} during training where M=max(1,z|ci|)M=\text{max}(1,\lceil z|c_{i}|\rceil) and zBeta(2,1)z\sim\text{Beta}(2,1). This distributions ensures that the expected sample size is close to |ci||c_{i}|, while still introducing small sample sizes, allowing the model to adapt to varying amounts of author-specific information.

Targeted Inversion with Style Representations

To further enrich the context with style-specific information, we use LUAR (Rivera-Soto et al., 2021), a style representation model trained to capture features discriminative of individual authors. Using LUAR, we embed the set cic_{i} into a single style representation and project it into the word-embedding latent space of the model using a 2-layer MLP.222https://huggingface.co/rrivera1849/LUAR-MUD We concatenate the output of the MLP along with the input word-embeddings.

We parameterize all our inversion models using Mistral-7B333https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3.

4.3 Inference

During inference, we sample NN inversions 𝐳={z1,z2,,zN}\mathbf{z}=\{z_{1},z_{2},\dots,z_{N}\} from the distribution pθ(yi,ci)p_{\theta}(\cdot\mid y_{i},c_{i}), where each zjz_{j} represents an inversion of the input. Let g(a,b)g(a,b) denote some similarity measure between two texts aa and bb, such as semantic similarity, BLEU score, or the stylistic similarity. This function allows us to quantify how closely each inversion zjz_{j} resembles the original text xix_{i}. We compute a similarity score sis_{i} across all NN inversions in one of the following ways:

Expected Score

We use the average score across all inversions: si=𝔼zjpθ(.yi,ci)g(zj,xi)s_{i}=\mathbb{E}_{z_{j}\sim p_{\theta}(.\mid y_{i},c_{i})}g(z_{j},x_{i})

Maximum Similarity

We set sis_{i} to be the maximum score observed across all inversions, focusing on the best-match: si=maxzjpθ(.yi,ci)g(zj,xi)s_{i}=\text{max}_{z_{j}\sim p_{\theta}(.\mid y_{i},c_{i})}g(z_{j},x_{i}).

Aggregate

For similarity measures that allow it, we aggregate all inversions and compute the similarity score. For example, when computing semantic similarity under an SBERT (Reimers and Gurevych, 2019) model, we embed each zjz_{j} separately, then mean-pool all the representations to obtain a single embedding, and then compute the similarity sis_{i}.

5 Experimental Procedure

5.1 Dataset

We use the Reddit Million User Dataset (MUD), which contains comments from over 1 million Reddit users over a wide variety of topics (Khan et al., 2021). Specifically, we subsample the dataset to authors who post in r/politics and r/PoliticalDiscussion, keeping comments composed of at least 6464 tokens but no more than 128128 tokens according to the LUAR tokenizer. Furthermore, we remove authors with less than 1010 comments, and randomly sample 1010 comments from all others, ensuring that no author is over-represented. By restricting the topic of the documents, we ensure that the inversion models must focus on learning to introduce author-specific stylistic features.

To learn to invert paraphrases, we must observe a diverse set of source documents and corresponding paraphrases. However, a random sample of documents may not provide broad enough coverage of writing styles. For example, when we prompt GPT-4 to generate a paraphrase of "HELLO WORLD", it produces "Greetings, Universe!", removing the capital letters. Without observing authors who write only with capital letters during training, it would be impossible for the untargeted inversion model to invert the paraphrase. As such, we split authors into training, validation, and testing splits by sampling authors evenly across the stylistic space. We use LUAR to embed each author’s posts into a single stylistic embedding. Then, we cluster the dataset using K-Means, setting K=100K=100. Finally, we take 80% of the authors from each cluster for training, 10% for validation, and randomly sample 100 authors of those remaining for testing.

To generate the paraphrases, we prompt Mistral-7B444https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 (Jiang et al., 2023), Phi-3555https://huggingface.co/microsoft/Phi-3-mini-4k-instruct (Abdin et al., 2024), and Llama-3.1-8B666https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (Dubey et al., 2024). We clean all obvious LLM-generated artifacts such as This rephrased passage condenses, note: I changed..., and ensure that all paraphrases have a semantic similarity of at least 0.7 under SBERT777https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (Reimers and Gurevych, 2019). To generate paraphrases of machine-text, we first prompt one of the three language models at random to produce a response to each comment, then we follow the same paraphrasing procedure just described. All the prompts used are shown in  §missing B.1.

Method Type Style Sim. ()(\uparrow) Semantic Sim. ()(\uparrow) BLEU ()(\uparrow)
Paraphrases - 0.80 0.88 0.17
Untargeted Inversion
Single 0.84 0.90 0.34
Max 0.91 0.95 0.51
Expectation 0.84 0.90 0.35
Aggregate 0.88 0.93 -
Table 2: Results of inverting paraphrases of machine-written text of novel writing samples.

5.2 Metrics

To measure how well the inversions recover the true tokens, we make use of BLEU (Papineni et al., 2002), a measure of n-gram overlap. Inverting to the original tokens may be difficult, if not impossible for certain authors. As such, we posit that the inversions should be close both in style and semantics to the original text. We measure the stylistic similarity by embedding the inversion and original text using a different LUAR checkpoint888https://huggingface.co/rrivera1849/LUAR-CRUD than that used for sampling the dataset and for training the targeted inversion model; we report the stylistic similarity as the cosine similarity between the embeddings. Using a different checkpoint ensures that our targeted inversion model isn’t optimizing for the metric directly. For semantic similarity, we use SBERT to embed the texts and report the cosine similarity between them. For experiments which avail of the inversion models in downstream detection tasks, we report the equal error rate (EER) of the detection error rate (DET) curve to asses the detection performance. For the token prediction model we report the standard metrics of F1, precision, and recall.

5.3 Baselines

For comparison, we prompt GPT-4 to invert the paraphrases both with and without in-context examples to directly compare its performance with the untargeted and targeted versions of our models. The in-context variant is provided all available paraphrases paired with the original documents from the same author, thus controlling for style. We report the prompts used in  §missing B.2. Additionally, we compare our model to output2prompt (Zhang et al., 2024a), training it on the same dataset.

6 Main Results

6.1 Paraphrased token prediction

Before we turn to the task of inverting paraphrases, we explore whether there are learnable biases that may make human-written tokens distinguishable from machine-written tokens. Our approach follows the procedure outlined in §missing 4.1. The token prediction model is initialized from RoBERTa-large (Liu et al., 2019) with one MLP for predicting whether each token is human- or machine-written. Using 50% of our inversion datasets, we create a new corpora T={(d1,𝐭𝟏),(d2,𝐭𝟐),,(dn,𝐭𝐧)T=\{(d_{1},\mathbf{t_{1}}),(d_{2},\mathbf{t_{2}}),\dots,(d_{n},\mathbf{t_{n}}), where 𝐭𝐢=(a0,a1,,a|yi|)\mathbf{t_{i}}=(a_{0},a_{1},...,a_{|y_{i}|}) and aj{𝚑𝚞𝚖𝚊𝚗,𝚖𝚊𝚌𝚑𝚒𝚗𝚎}a_{j}\in\{\mathtt{human},\mathtt{machine}\}. We ensure that the documents did_{i} are balanced between paraphrases and human-written documents. We follow the same process for both the human- and machine-text paraphrase corpora, training a token predictor on each. We report the metrics in  Table 3. We achieve high token-level F1 scores (0.86+), which suggests that there are biases as to how LLMs paraphrase text.

Metric Score
Human-Text Paraphrase
F1 0.91
Precision 0.89
Recall 0.94
Machine-Text Paraphrase
F1 0.86
Precision 0.91
Recall 0.83
Table 3: Paraphrase token prediction results.

6.2 Inverting paraphrases of machine-written text

In this section, we explore whether paraphrases of machine-written text can be inverted. Although detectors robust to paraphrasing have been proposed (Hu et al., 2023), inverting the paraphrases may help further identify which LLM generated the original document, as LLMs are known to exhibit distinct writing styles that can provide useful signals for model attribution (Soto et al., 2024). We expect this task to be easier than inverting paraphrases of human-written text, as human-written tokens lie in low-rank regions of the next-token prediction space (Gehrmann et al., 2019). We generate 100100 inversions per sample using our untargeted inversion model and report the metrics in Table 2. The model recovers a significant portion of the original text, with the best-scoring inversion (max) achieving a BLEU score of 0.51, and semantic and stylistic similarities of 0.95 and 0.91 respectively.

6.3 Inverting paraphrases of human-written text

We now turn to the more difficult problem of inverting paraphrases of human-written text. We generate 100100 inversions with every model, except for the in-context GPT-4 variant, for which we generate 55 inversions per true author due to the high monetary cost of generating more samples. In  Table 1, we breakdown all the results. Although the inversions show higher overlap with the original text than the paraphrases, they’re still far off from recovering the whole text. Notably, our best inversion model achieves a BLEU score of 0.25, which is half of that achieved when inverting paraphrases of machine-written text (§missing 6.2). However, our inversion models show higher stylistic similarity to the true target (Figure 1), with the targeted variations improving the stylistic similarity. We also observe that prompting GPT-4 recovers some of the original text, although the best variation achieves a BLEU score of 0.13, which is well below our best inversion model. Finally, output2prompt is the worst performing system. We attribute this to its requirement of observing more than one output per prompt, and to the fact that the model used is of much lower capacity to ours (T5-base vs Mistral-7B).

6.4 Plagiarism detection

In the experiment reported in this section, we evaluate how well we can identify the original source document given its inversion. For example, consider an instructor who is concerned that his students are committing plagiarism by using a language model to paraphrase online material. A plagiarism detector that uses stylistic cues, such as those encoded by a LUAR embedding, would not perform well in this scenario, as language models exhibit different styles from humans (Soto et al., 2024). As such, we first generate 100100 inversions using our untargeted inversion model, and then compute the cosine similarity as measured in LUAR’s embedding space using one of the scoring mechanism described in  §missing 4. We report the results for the best variation in  Table 4. We observe that our untargeted inversion model significantly reduces the EER by -0.12 points, suggesting that it is recovering the style of the original document.

Method EER
Paraphrases 0.24
Baselines
GPT-4 (In-Context / all) 0.21
output2prompt (Zhang et al., 2024a) 0.49
Ours Untargeted (max) 0.12
Table 4: Plagiarism detection results.

6.5 Authorship Identification

In  §missing 6.4, we handled the case where a student may be using large language models to commit plagiarism. We now consider the case where a user spreading misinformation in a social media site employs paraphrasing as a way to avoid detection by stylistic detectors. Given a few paraphrased comments (four to five) from a query author, and a few un-paraphrased comments from a line-up of 100100 candidate authors, we’re interested in identifying which of the 100100 candidates is the original author. This task is notably more difficult than that of plagiarism detection, as the paraphrases are of documents not in the candidate set. We use our untargeted inversion model to generate 100100 inversions for each paraphrase. Furthermore, we use our targeted inversion models to generate 55 inversions for every candidate author.

We report the results for the best scoring variations in Table 5. The targeted inversion models, although better at biasing the generation to the target style (Figure 2), are not suitable for this task as they introduce enough stylistic bias to make the separation between the true target author and false authors minimal. In contrast, the untargeted inversion model improves our detection results by reducing EER by 10 points.

Method EER (\downarrow)
Paraphrases 0.18
Inversion Untargeted / Aggregate 0.08
Targeted Inversion (In-Context) / Aggregate 0.21
Targeted Inversion (Style Emb.) / Aggregate 0.32
Table 5: Author ID results.
Refer to caption
Figure 2: Stylistic similarity to the true author and other authors under different inversion models. The median, 10th percentile, and 90th percentile are shown. The targeted inversion models introduce too much stylistic bias, thereby blurring the difference between the true author and other matches.
Model BLEU
Llama-3-8B 0.08
Mistral-7b 0.11
Phi-3 0.08
With In-Context Examples
Llama-3-8B 0.10
Mistral-7b 0.16
Phi-3 0.12
Table 6: LLMs prompted to invert their own paraphrases.

7 Further Analysis

Method Type Style Sim. ()(\uparrow) Semantic Sim. ()(\uparrow) BLEU ()(\uparrow)
Paraphrases - 0.61 0.90 0.21
Untargeted Inversion
Single 0.62 0.88 0.26
Max 0.77 0.94 0.41
Expectation 0.62 0.88 0.26
Aggregate 0.72 0.93 -
Table 7: Inverting GPT-4 paraphrases.
Can an LLM invert its own paraphrases?

We prompt each LLM that originated the paraphrase both with and without in-context examples following the same procedure in §missing 5.3. We generate 100100 inversions with both versions and report the maximum BLEU score in Table 6. Overall, we find that even when prompted with in-context examples demonstrating the task, state-of-the-art LLMs are unable to invert their own paraphrases. This implies that even if some parametric knowledge encodes the paraphrasing process, the LLM is not able to access it via prompting.

Can our inversion models invert a novel paraphraser?

To answer this question, we prompt GPT-4, an unseen LLM during training time, to paraphrase the test set. We use our untargeted inversion model to invert each paraphrase 100100 times, and report the metrics in Table 7. Surprisingly, we find that GPT-4 is easier to invert than the models seen during training, with our model achieving a BLEU score of 0.410.41. We attribute this to GPT-4 paraphrases retaining more of the original text, with its paraphrases achieving a BLEU score of 0.210.21 in contrast to the BLEU score of 0.080.08 achieved by the LLMs used for training (Table 7).

Are inversions of machine-generated paraphrases more detectable?

In this section, we explore machine-text detection as a way of measuring the quality of our inversions. We emphasize that this is not a realistic way to improve machine-text detection, as it would require oracle knowledge of the text that has been paraphrased.999A practical approach would be to first use a paraphrase-robust model like RADAR (Hu et al., 2023) to identify paraphrases, then invert them to determine which LLM generated the original document, similar to §missing 6.5. We use our untargeted inversion model to generate 100100 inversions per paraphrase in the test set of our machine-paraphrase corpora, and pick the inversion farthest away to the paraphrase in LUAR’s embedding space. We report the detection rate of FastDetectGPT (Bao et al., 2024) in  Figure 3. Inverting the paraphrases increases the AUC by +0.24, showing that the inversions increase the detectability of the paraphrased text.

Refer to caption
Figure 3: Detection rates of FastDetectGPT (Bao et al., 2024). Paraphrases of machine-written text are difficult to detect (AUC 0.68), but inverting the paraphrases makes them more detectable (AUC 0.92).

8 Conclusion

Summary of findings

Can LLM-generated paraphrases be inverted? Our results suggest a nuanced answer to this question. On one hand, even in the simplest matched train-test conditions, and with a significant budget for producing training data, recovering the exact tokens of the original text is challenging. On the other hand, we show that the learned inversion distribution benefits a number of applications, including plagiarism detection, authorship attribution, and machine-text detection. Furthermore, the results of the paraphrased token prediction task (§missing 6.1) suggest that LLMs (at least those considered in our experiments) exhibit consistent biases in which tokens are paraphrased, despite the difficulty of inverting to the specific tokens. As a result, the predicted inversions are stylistically close to the original text despite not exactly recovering the original text, which leads to significant improvements in all the tasks we consider relative to baseline approaches.

Future work

The proposed targeted inversion scheme attempts to invert a given paraphrased document under the hypothesis that it belongs to a specific author (human or machine), given a small writing sample from the hypothesized author. Our results suggest that this form of biasing of the inversion model may be too strong, effectively serving as a form of style transfer. In future work, it would worthwhile to further explore the proposed approach in more general style transfer settings. Also, other targetted inversion schemes should be explored for the settings considered in this paper, particularly methods that enable a degree of control over the strength of the bias towards the hypothesized author. For example, the untargetted model could be biased towards a specific author using a generative control method such as FUDGE (Yang and Klein, 2021), at the cost of introducing a hyperparameter governing the strength of each model.

Limitations

We have shown initial results for three problems where paraphrase inversion may be useful: plagiarism detection, authorship attribution, and machine-text detection. Since this work is primarily concerned about the feasibility of paraphrase inversion, we make some simplifying assumptions in each of these tasks, such as assuming knowledge of which documents are paraphrases. In practice, these assumptions should be relaxed to better understand the potential value of paraphrase inversion in these tasks. Separately, the number of paraphrases we use to train our inversion models is limited by our compute budget. We expect that training on additional LLM-generated paraphrases to improve all the results reported in the paper; as such, the results reported here should be viewed as a lower bound on achievable performance.

Broader Impact

This work is motivated by potential abuses of LLM, in particular: (1) using paraphrases to mask the identity of the original author of a document (e.g., plagiarism); (2) using paraphrasing to defeat machine-text detection. However, we note that the methods explored here could themselves be abused, for example to reveal the identity of authors wishing to remain anonymous. On balance, we believe the positives outweigh the negatives, and believe that in addition to the positive uses of the proposed methods, our findings will highlight the need for better methods to preserve anonymity.

References

  • Abdin et al. (2024) Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matthew Dixon, Ronen Eldan, Victor Fragoso, Jianfeng Gao, Mei Gao, Min Gao, Amit Garg, Allie Del Giorno, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Wenxiang Hu, Jamie Huynh, Dan Iter, Sam Ade Jacobs, Mojan Javaheripi, Xin Jin, Nikos Karampatziakis, Piero Kauffmann, Mahoud Khademi, Dongwoo Kim, Young Jin Kim, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Yunsheng Li, Chen Liang, Lars Liden, Xihui Lin, Zeqi Lin, Ce Liu, Liyuan Liu, Mengchen Liu, Weishung Liu, Xiaodong Liu, Chong Luo, Piyush Madan, Ali Mahmoudzadeh, David Majercak, Matt Mazzola, Caio César Teodoro Mendes, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Liliang Ren, Gustavo de Rosa, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Yelong Shen, Swadheen Shukla, Xia Song, Masahiro Tanaka, Andrea Tupini, Praneetha Vaddamanu, Chunyu Wang, Guanhua Wang, Lijuan Wang, Shuohang Wang, Xin Wang, Yu Wang, Rachel Ward, Wen Wen, Philipp Witte, Haiping Wu, Xiaoxia Wu, Michael Wyatt, Bin Xiao, Can Xu, Jiahang Xu, Weijian Xu, Jilong Xue, Sonali Yadav, Fan Yang, Jianwei Yang, Yifan Yang, Ziyi Yang, Donghan Yu, Lu Yuan, Chenruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, and Xiren Zhou. 2024. Phi-3 technical report: A highly capable language model locally on your phone. Preprint, arXiv:2404.14219.
  • Bao et al. (2024) Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. 2024. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. Preprint, arXiv:2310.05130.
  • Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Kshitiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens van der Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke de Oliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, Mitesh Kumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivier Duchenne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, Punit Singh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, Ricardo Silveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, Seohyun Sonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu, Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, Xiaoqing Ellen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, Yi Wen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, Zacharie Delpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, Beto De Paola, Bhargavi Paranjape, Bing Liu, Bo Wu, Boyu Ni, Braden Hancock, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, Carl Parker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, Gabriela Medina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Ibrahim Damlaj, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli, Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, Kam Hou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, Michael L. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, Miquel Jubert Hermoso, Mo Metanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, Natascha Parks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, Nikolay Pavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, Sai Jayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, Shengxin Cindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, Sunny Virk, Suraj Subramanian, Sy Choudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, Vinay Satish Kumar, Vishal Mangla, Vítor Albiero, Vlad Ionescu, Vlad Poenaru, Vlad Tiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, Ye Hu, Ye Jia, Ye Qi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
  • Gehrmann et al. (2019) Sebastian Gehrmann, Hendrik Strobelt, and Alexander M. Rush. 2019. Gltr: Statistical detection and visualization of generated text. Preprint, arXiv:1906.04043.
  • Hans et al. (2024) Abhimanyu Hans, Avi Schwarzschild, Valeriia Cherepanova, Hamid Kazemi, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. 2024. Spotting llms with binoculars: Zero-shot detection of machine-generated text. Preprint, arXiv:2401.12070.
  • Hazell (2023) Julian Hazell. 2023. Spear phishing with large language models. Preprint, arXiv:2305.06972.
  • Hu et al. (2023) Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2023. Radar: Robust ai-text detection via adversarial learning. Preprint, arXiv:2307.03838.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
  • Khan et al. (2021) Aleem Khan, Elizabeth Fleming, Noah Schofield, Marcus Bishop, and Nicholas Andrews. 2021. A deep metric learning approach to account linking. Preprint, arXiv:2105.07263.
  • Kirchenbauer et al. (2024) John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2024. A watermark for large language models. Preprint, arXiv:2301.10226.
  • Krishna et al. (2023) Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. 2023. Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense. Preprint, arXiv:2303.13408.
  • Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • Levenshtein (1965) Vladimir I. Levenshtein. 1965. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics. Doklady, 10:707–710.
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. Preprint, arXiv:1907.11692.
  • McInnes et al. (2020) Leland McInnes, John Healy, and James Melville. 2020. Umap: Uniform manifold approximation and projection for dimension reduction. Preprint, arXiv:1802.03426.
  • Mitchell et al. (2023) Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, and Chelsea Finn. 2023. Detectgpt: Zero-shot machine-generated text detection using probability curvature. Preprint, arXiv:2301.11305.
  • Morris et al. (2023) John X. Morris, Wenting Zhao, Justin T. Chiu, Vitaly Shmatikov, and Alexander M. Rush. 2023. Language model inversion. Preprint, arXiv:2311.13647.
  • OpenAI et al. (2024) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Potthast et al. (2016) Martin Potthast, Matthias Hagen, and Benno Stein. 2016. Author obfuscation: Attacking the state of the art in authorship verification. In Conference and Labs of the Evaluation Forum.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. Preprint, arXiv:1908.10084.
  • Rivera-Soto et al. (2021) Rafael A. Rivera-Soto, Olivia Elizabeth Miano, Juanita Ordonez, Barry Y. Chen, Aleem Khan, Marcus Bishop, and Nicholas Andrews. 2021. Learning universal authorship representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 913–919, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Soto et al. (2024) Rafael Rivera Soto, Kailin Koch, Aleem Khan, Barry Chen, Marcus Bishop, and Nicholas Andrews. 2024. Few-shot detection of machine-generated text using style representations. Preprint, arXiv:2401.06712.
  • Tian and Cui (2023) Edward Tian and Alexander Cui. 2023. Gptzero: Towards detection of ai-generated text using zero-shot and supervised methods".
  • Weidinger et al. (2022) Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, Courtney Biles, Sasha Brown, Zac Kenton, Will Hawkins, Tom Stepleton, Abeba Birhane, Lisa Anne Hendricks, Laura Rimell, William Isaac, Julia Haas, Sean Legassick, Geoffrey Irving, and Iason Gabriel. 2022. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery.
  • Wu et al. (2024) Fan Wu, Emily Black, and Varun Chandrasekaran. 2024. Generative monoculture in large language models. arXiv preprint arXiv:2407.02209.
  • Yang and Klein (2021) Kevin Yang and Dan Klein. 2021. Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218.
  • Zhang et al. (2024a) Collin Zhang, John X. Morris, and Vitaly Shmatikov. 2024a. Extracting prompts by inverting llm outputs. Preprint, arXiv:2405.15012.
  • Zhang et al. (2024b) Qihui Zhang, Chujie Gao, Dongping Chen, Yue Huang, Yixin Huang, Zhenyang Sun, Shilin Zhang, Weiye Li, Zhengyan Fu, Yao Wan, and Lichao Sun. 2024b. Llm-as-a-coauthor: Can mixed human-written and machine-generated text be detected? Preprint, arXiv:2401.05952.
  • Zhang et al. (2024c) Yiming Zhang, Avi Schwarzschild, Nicholas Carlini, Zico Kolter, and Daphne Ippolito. 2024c. Forcing diffuse distributions out of language models. arXiv preprint arXiv:2404.10859.

Appendix A Ablations

How does varying the sampling procedure impact paraphrase inversion?

In  Table 8 we show the effect that the decoding temperature has in the quality of the inversions generated by our untargeted inversion model. We generate 100100 inversions for every paraphrase in our test dataset, and report metrics using the “max" scoring strategy dicussed in  §missing 4. We observe that temperature plays an important role in the quality of the inversions, with values too low or too high significantly degrading the quality of the inversions. As the temperature increases, the entropy of the distribution approximates that of a uniform distribution, thereby diffusing the style of the inversions. Conversely, as the temperature decreases, the inversion model becomes over-confident in its predictive distribution, thereby not exploring neighboring tokens and styles.

Temperature Style Sim. BLEU
0.3 0.67 0.23
0.5 0.69 0.24
0.6 0.70 0.25
0.7 0.70 0.25
0.8 0.71 0.24
0.9 0.71 0.23
1.5 0.55 0.06
Table 8: Effect of the temperature in the quality of the untargeted inversions.
Are paraphrases generated with lower temperature values easier to invert?
Training Temperature Style Sim. BLEU
0.3 0.71 0.26
0.5 0.70 0.25
0.7 0.70 0.25
Table 9: Effect of training on a paraphrase dataset generated with different temperature values.

To answer this question, we re-generate our human-text paraphrase data with lower temperature values, training and testing the untargeted inversion model in matched temperature conditions. We report the results in Table 9. We observe that, as the temperature decreases, the similarity metrics improve. We attribute this to the LLMs becoming over-confident in their predictive-distribution, thereby generating less diverse data which in turn is easier to invert.

Appendix B Prompts

B.1 Paraphrasing

Rephrase the following
passage: {}

Only output the
rephrased-passage, do not
include any other details.

Rephrased passage:

B.2 Inversion

B.2.1 Untargeted Inversion

[INST] The following passage is
a mix of human and machine text,
recover the original human text:
{generation}
[/INST]\n###Output: {original}

B.2.2 Targeted Inversion

[INST] Here are examples of the
original author:\n
Example: {example}\n-----\n
Example: {example}\n-----\n
Example: {example}\n-----\n
The following passage is a mix
of human and machine text,
recover the original
human text: {generation}
[/INST]\n###Output: {original}

B.3 Prompting Untargeted Inversion

The following passage is a mix
of human and machine text,
recover the original human text:

B.4 Prompting Targeted Inversion

Here are examples of paraphrases
and their original:\n
Paraphrase: <paraphrase>\n
Original: <original>"\n-----\n"
...
Paraphrase: <paraphrase>\n
Original: <original>"\n-----\n"
The following passage is a mix
of human and machine text,
recover the original human text:

B.5 Generating Reddit Responses

Write a response to the
following Reddit
comment: {comment}

Appendix C Dataset Statistics

We show the statistics of our human-paraphrase and machine-paraphrase datasets in Table 10 and Table 11 respectively. The human-paraphrase and machine-paraphrase datasets are created as discussed in §missing 5.1.

Split Number of Examples Number of Authors
Train 204260 8353
Valid 24549 1004
Test 2449 100
Table 10: Statistics of the human-paraphrase dataset.
Split Number of Examples Number of Authors
Train 239710 8346
Valid 28883 1004
Test 2854 100
Table 11: Statistics of the machine-paraphrase dataset.

Appendix D Training Hyper-Parameters

We train all our inversion models with the hyper-parameters shown in Table 12. The only exception is the batch size, which we set to 32 for the targeted inversion model with in-context examples, and to 64 for all other models. We train all our models on 4 NVIDIA-A100 GPUs. Each model took at most 10 hours to train.

Hyper-Parameter Value.
Learning Rate 2e52e^{-5}
Number of Steps 6400
LoRA-R 32
LoRA-α\alpha 64
LoRA-Dropout 0.1
Table 12: Training Hyper-parameters.

Most of the compute was spent generating the inversions necessary to run all the experiments, which are in the ballpark of 1.6M total generations. We used VLLM (Kwon et al., 2023) to speed up most of the generations, except those of the targeted inversion model conditioned on style embeddings. At the time of this writing (10/15/2024), VLLM does not support arbitrary embeddings as input. We estimate an upper bound of around 200 GPU hours to run all experiments.