Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training
Abstract
A major public concern regarding the training of large language models (LLMs) is whether they abusing copyrighted online text. Previous membership inference methods may be misled by similar examples in vast amounts of training data. Additionally, these methods are often too complex for general users to understand and use, making them centralized, lacking transparency, and trustworthiness. To address these issues, we propose an alternative insert-and-detection methodology, advocating that web users and content platforms employ unique identifiers for reliable and independent membership inference. Users and platforms can create their own identifiers, embed them in copyrighted text, and independently detect them in future LLMs. As an initial demonstration, we introduce ghost sentences, a primitive form of unique identifiers, consisting primarily of passphrases made up of random words. By embedding one ghost sentences in a few copyrighted texts, users can detect its membership using a perplexity test and a user-friendly last- words test. The perplexity test is based on the fact that LLMs trained on natural language should exhibit high perplexity when encountering unnatural passphrases. As the repetition increases, users can leverage the verbatim memorization ability of LLMs to perform a last- words test by chatting with LLMs without writing any code. Both tests offer rigorous statistical guarantees for membership inference. For LLaMA-13B, a perplexity test on 30 ghost sentences with an average of 7 repetitions in 148K examples yields a 0.891 ROC AUC. For the last- words test with OpenLLaMA-3B, 11 out of 16 users, with an average of 24 examples each, successfully identify their data from 1.8M examples.
1 Introduction
Large language models (LLMs) are pre-trained on vast amounts of data sourced from the Internet, while the service providers of commercial LLMs like ChatGPT [48], Bard [20], and Claude [1] hide the details of the training data from public. This raises concerns that the LLMs may be trained with copyrighted material without permission from the creators [28, 23, 33]. Although some efforts have been made to determine whether a specific example are included in the training data [55, 33, 15], definitive evidence of copyright misuse is still lacking. Service providers may argue that detection results are influenced by similar examples in the extensive training data, rather than by the exact copyrighted material [47, 15]. Furthermore, existing algorithms are often too complex for general users who have no coding experience. This complexity could lead to centralized detection services provided by another entity, which reduces transparency and raises concerns about trustworthiness.
To ensure transparent and trustworthy protection of copyrighted material111Any creative, intellectual, or artistic text presented on the Internet, such as poems, blogs, fiction, and code., we propose an alternative insert-and-detect methodology for general web users and content platforms (e.g., Quora, Medium, Reddit, GitHub). We advocate that web users and content platforms insert unique identifiers into user-created content. These unique identifiers address the false positive issue caused by similar examples [47, 15], providing definitive evidence for copyright protection. The entire process is transparent, allowing users and content platforms to create unique identifiers, embed them in online copyrighted material, and perform detection independently.
To demonstrate the concept, we introduce ghost sentences as a primitive implementation of unique identifiers. A ghost sentence is distinctive because it primarily consists of a randomly generated diceware passphrase [53]. As shown in Figure 1, users or content platforms can insert a ghost sentences, along with customized prefixes, into various online documents. Given an LLM, users can test whether the LLM utilizes their data by performing tests with ther copyrighted examples containing ghost sentences. Based on the repetition of ghost sentences, users can perform perplexity test or user-friendly last- words test to detect training membership. The basis of the perplexity test is: an LLM trained on natural language should have high perplexity for the passphrases in ghost sentences, as it is a combination of random words. During test, users can generate a new set of ghost sentences, obtain the perplexity distribution, and use the distribution to perform a hypothesis test for membership inference. For a LLaMA-13B model [60], a perplexity test for 30 ghost sentences with a 7 average repetition in 148K examples has a 0.891 ROC AUC.
In potential application scenarios, a user-friendly test method without requiring any expert knowledge is more practical, and we design the last- words test for general users by simply chatting with LLMs. When the repetition of ghost sentences increases, LLMs are likely to achieve verbatim memorization [9, 10, 27, 28] of the passphrases in ghost sentences. In this case, users can require LLMs to complete the last words of a ghost sentence, given its preceding context as a prompt like Figure 1 ( a real case at Figure 5). Due to the randomness of passphrases, it is statistically guaranteed that if an LLM can complete the last words, it must have been trained with the ghost sentence. In experiments with an OpenLLaMA-3B [19] model, 11 out 16 users successfully identify their data from the LLM generation. These 16 users have 24 examples with ghost sentences on average and contribute 383 examples to total 1.8M training documents (0.022%).
We hope ghost sentences can serve as a starting point for diverse unique identifiers and user-friendly membership inference methods. A single pattern of unique identifiers is insufficient because it may eventually be filtered out, despite the significant cost of filtering hidden sentences from terabytes or even petabytes of data. Moreover, various identifiers are not mutually exclusive. Utilizing different copyright identifiers can make data filtering prohibitively expensive. Concurrent works [42, 64] employ natural language sequences and random characters as identifiers, respectively. Users can choose one of these identifiers, making filtering difficult. Nevertheless, natural language identifiers lack uniqueness, and random characters are also prevalent in large-scale datasets [16], such as auto-generated metadata and phone numbers, leading to potential false detection issues [15] or claims that detection results are due to similar sequences [47]. Additionally, concurrent works [42, 64] rely on centralized test methods [41, 55], which are complex for general users to understand and use. Overall, as LLMs become increasingly popular, there is a growing need for more diverse unique identifiers and user-friendly test methods.

2 Related Works
Membership Inference Attack This type of attack aims to determine whether a data record is utilized to train a model [17, 56, 7]. Typically, membership inference attacks (MIA) involve observing and manipulating confidence scores or loss of the model [67, 57, 40], as well as training an attack model [56, 25]. Duan et al. [15] conduct a large-scale evaluation of MIAs over a suite of LLMs trained on the Pile [18] dataset and find MIAs barely outperform random guessing. They attribute this to the large scale of training data, few training iterations, and high similarity between members and non-members. Shi et al. [55] utilize wiki data created after LLMs training to distinguish the members and non-members. Nevertheless, the concern that similar examples in the large-scale training data may lead to ambiguous inference results remains.
Machine-Generated Text Detection Text watermark [29, 21, 36, 37, 14] aims to embed signals into machine-generated text that are invisible to humans but algorithmically detectable. Generally, LLMs are required not to generate tokens from a red list. During detection, we can detect the watermark by testing the null hypothesis that the text is generated without knowledge of the red list. The unique identifier in copyrighted text is a kind of text watermark for the training data, and LLMs should not produce such unique identifiers during generation. A few other methods [44, 3, 43] try to detect machine-generated text without modifying the generation content. They are mainly based on the assumption that the patterns of log probabilities of human-written and machine-generate text have distinguishable discrepancies.
Training Data Extraction Attack The substantial number of neurons in LLMs enables them to memorize and output part of the training data verbatim [8, 27, 69]. Adversaries exploit this capability of LLMs to extract training data from pre-trained LLMs [10, 46, 31, 30]. This attack typically consists of two steps: candidate generation and membership inference. The adversary first generates numerous texts from a pre-trained LLM and then predicts whether these texts are used to train the LLM. Carlini et al. [8] quantify the memorization capacity of LLMs, discovering that memorization grows with the model capacity and the duplicated times of training examples. Specifically, within a model family, larger models memorize more than smaller models, and repeated strings are memorized more. Karamolegkou et al. [28] demonstrate that LLMs can achieve verbatim memorization of literary works and educational material. We also provide an similar example in Figure 5 in Appendix C.
3 Methodology
3.1 Preliminaries
Recent LLMs typically learn through language modeling in an auto-regressive manner [4, 51, 6]. For a set of examples , each consisting of variable length sequences of symbols , where is the length of example . During training, LLMs are optimized to maximize the joint probability of :
We assume there is a subset of examples from users that contain unique identifiers (ghost sentences in this work). Each user owns a set of examples and . Without loss of generality, we assume there is only a unique ghost sentence in , which is repeated for times. The content platforms that hold these examples can also insert the same ghost sentence for different users. The average repetition times of ghost sentences is . In subset , an example with a ghost sentence becomes , where is the length of and is the insertion position. The joint probability of the ghost sentence is maximized during training:
Creation of Ghost Sentences The main part of a ghost sentence is a diceware passphrase [53]. Diceware passphrases use dice to randomly select words from a word list of size . is generally equal to , which corresponds to rolling a six-sided dice 5 times. For a diceware passphrase with length , there are possibilities, ensuring the uniqueness of a ghost sentence when , which is much larger than the number of indexed webpages estimated by worldwidewebsize.com. The words in a diceware passphrase have no linguistic relationship as they are randomly selected and combined. Users can customize ghost sentences by add prefixes to passphrases as shown in Figure 1. It is recommended to use passphrases with more than 8 words and insert ghost sentences in the latter half of a document. We provide a few examples of ghost sentences in Appendix I.
Statistics of Users on Reddit In Appendix D, we provide the statistics of users in Webis-TLDR-17 [62], a subset of Reddit data contains 3.8M examples from 1.4M users. The distribution of the number of document per user is long-tailed. Users with more than 4 and 9 examples contribute 41% and 22% of all data, respectively. These users can insert ghost sentences by themselves, other users contribute about 60% examples may need assistance from the content platform.
Null Hypothesis We detect ghost sentences by testing the following null hypothesis,
: The LLM is trained with no knowledge of ghost sentences. | (1) |
3.2 Perplexity Test
The perplexity of a ghost sentence given context is:
(2) |
For simplicity, we only consider the perplexity of passphrases in ghost sentences. Passphrases are combinations of random words. If the null hypothesis is true, the LLM is basically doing random guessing given a vocabulary with size , and the value of should be very high.



Figure 2 presents the perplexity discrepancy between normal context () and ghost sentences ( given ). On average, the perplexity of ghost sentences are much higher than that of natural language. Given an LLM, a ghost sentence , and a context , we can use the empirical perplexity distribution of ghost sentences (unseen by the LLM) in Figure 2 to perform a hypothesis test. If is smaller than the critical value at a certain significant level, we will reject the null hypothesis . For example, if for a LLaMA-7B model in Figure 2, we will reject and the probability of a false positive is less than 1%. Perplexity test requires one ghost sentences to repeat for a few times in the training data of LLMs. For a LLaMA-13B model fine-tuned on 148K examples with 30 ghost sentences repeat 5 times on average, a perplexity test can achieve 0.393 recall with a significant level 0.05 after 1 epoch fine-tuning. The recall increase to 0.671 if the average repetition becomes 7.
3.3 Last- Words Test
The perplexity test requires probability scores from LLMs. While some LLM service providers, such as OpenAI, provide APIs for accessing probability scores, it can be challenging for general users to perform perplexity test on their own. In the following content, we demonstrate that users can perform a hypothesis test using solely the generation function of LLMs. However, this approach requires the ghost sentence to be repeated more times compared to a perplexity test.
During inference or generation, users can request the LLM to output the last words of a ghost sentence given their preceding context as input prompt:
(3) |
Here, is the total length, represents the generation function, and is the predicted word.
If the null hypothesis is true, at each step, the probability of the LLM generates a correct word corresponds to that in the passphrase is , where and is the vocabulary size of random words. Suppose we are generating a passphrase of length , the number of correct words at all steps, , has an expected value and a variance . We can perform a one proportion z-test to evaluate the null hypothesis, and the -score for the test is:
(4) |
Suppose the length of passphrase and , with . This results in a z-score of 27.85. The critical value at a significance level of 0.01 is only 2.58. In this case, we reject the null hypothesis, and the probability of a false positive is nearly . In practice, as ghost sentences in the training data increase, also increases, and a large may be required for the test. When , the test can reject the null hypothesis even if at a significant level . A probability is clearly not normal for generating random words. Our analysis for ghost sentence detection is similar to that for detecting text watermark [29].
The analysis above demonstrates that users can directly check whether an LLM can generate the last- words of their passphrases to decide whether the LLM consumes their data. or can already guarantee the robustness of ghost sentence detection. To understand how many repetition times for ghost sentences are required for the last- words test, we define two quantitative metrics: document identification accuracy (D-Acc) and user identification accuracy (U-Acc):
(5) | ||||
(6) |
where equals if the inner condition is true, 0 otherwise. Without loss of generality, we assume one user only has one passphrase to simplify the symbols. D-Acc- assesses the memorization successful rate of the last words for the document set , and U-Acc- evaluates the accuracy for user identities. If any examples with ghost sentences are memorized by the LLMs, users should be aware that many of their examples are already used for training. Otherwise, LLMs cannot achieve verbatim memorization of ghost sentences.
3.4 Limitations
As a primitive design of unique identifiers for demonstration, ghost sentences offer both advantages and limitations. They are transparent, user-friendly, and statistically trustworthy. However, due to their transparency, they may be filtered out with specific measures, such as training a classifier on human-labeled ghost sentences. This approach, though, is costly and may result in many false positives due to diverse custom prefixes as shown in Figure 1. Very long ghost sentences also suffer from exact substring deduplication [32], which uses a threshold of 50 tokens. Therefore, we recommend using a passphrase of around 10 words, which is 22 tokens on average for BPE tokenizer [54]. Actually, service providers do not adopt a strict deduplication process, as verbatim memorization of popular books can still be found [28] (or Figure 5). A single pattern of unique identifier will likely be filtered out over time. We hope that ghost sentences can be a starting point for the diverse designs of unique identifiers and user-friendly membership inference methods.
4 Experiments
4.1 Experimental Detail
In this work, we consider inserting ghost sentences at the pre-training stage and instruction tuning [63, 59] stage. At both the two stages, LLMs can use user data for tuning [5, 58, 60, 61, 11, 34].
Models For instruction tuning, we adopt the LLaMA serires [60], including OpenLLaMA-3B [19], LLaMA-7B, and LLaMA-13B. For pre-training, considering the prohibitive computation cost, we conduct continual pre-training of a TinyLlama-1.1B model at 50K steps222Hugging Face: TinyLlama/TinyLlama-1.1B-step-50K-105b, 3.49% of its total 1431K training steps. The context length of all models is restricted to 512. The batch size for instruction tuning is 128 examples following previous works [59, 34]. We maintain the pre-training batch size the same as TinyLlama-1.1B — 1024 examples. A large batch size is achieved with gradient accumulation on 4 NVIDIA RTX A6000 GPUs.
Training Epochs and Learning Rate All models are only trained for 1 epoch. Actually, training epochs of LLaMA range from for different data. As for the learning rate, we keep consistent with LLaMA or TinyLlama with a linear scaling strategy. Specifically, our learning rate is equal to . LLaMA-7B uses a batch of 4M tokens with a 3e-4 learning rate, so our learning rate for instruction tuning is . TinyLlama uses learning ate 4e-4, batch size 1024, and context length 2048, so our learning rate for continuing pre-training is 1e-4. By default, the optimizer is AdamW [39] with a cosine learning rate schedule. All models are trained with mixed precision and utilize FlashAttention [13, 12] to increase throughput.
Dataset Webis-TLDR-17 [62] contains 3.7M examples with word lengths under 4096. Without mention, we use a subset of Webis-TLDR-17 for instruction tuning, which contains 148K examples and 8192 users with the numbe of documents falls in . We term this subset as Webis-148K for convenient. For instruction tuning on Webis-148K, LLMs are required to finish a continue writing task using the instruction "Continue writing the given content". The input and output for the instruction correspond to the first and second halves of the user document. For continuing pre-training, we also utilize the LaMini-Instruction [65] and OpenOrca [38, 45, 35] datasets, which contain 2.6M and 3.5M examples, respectively. Plus the Webis-TLDR-17 dataset, the number of pre-training examples is 9.8M. All data are shuffled during training.
Evaluation and Metrics For perplexity test, we calculate the detection accuracy, i.e., the ratio of correctly detected examples among all samples with ghost sentences after performing hypothesis test. For last- words test, we ask LLMs to generate the last- words of ghost sentences by providing preceding context. A beam search with width 5 is used for generation. D-Acc- and U-Acc- are calculated with and .



prop.(%) | LLaMA-7B | LLaMA-13B | |||
AUC | Recall | AUC | Recall | ||
1 | 0.02 | 0.542 | 0.033 | 0.558 | 0.033 |
3 | 0.06 | 0.745 | 0.030 | 0.747 | 0.289 |
5 | 0.10 | 0.805 | 0.393 | 0.808 | 0.453 |
7 | 0.14 | 0.883 | 0.671 | 0.891 | 0.710 |
9 | 0.18 | 0.902 | 0.770 | 0.991 | 0.904 |
4.2 Perplexity Test
To figure out the average repetition of ghost sentences, we randomly generate 30 different ghost sentences with a word length 10. Then we randomly select examples from Webis-148K and insert ghost sentences at the end of these examples.
Table 1 presents the ROC AUC and recall of a perplexity test after fine-tuning LLaMA models on Webis-148K. During test, we sample the same number of non-member examples from Webis-TLDR-17 and insert newly generated ghost sentences into them. The recall corresponds to a significant level 0.05 and we choose a critical value like Figure 2. When the repetition , the perplexity test starts to provide a decent performance. Figure 3 displays the perplexity of the LLaMA-7B models fine-tuned with ghost sentences. With an increase in repetition times, we observe a dramatic decrease in the perplexity of ghost sentences. For every two additional repetitions of ghost sentences, the average perplexity decreases by 100. The perplexity of normal context are roughly the same after fine-tuning.
4.3 Last- Words Test
The last- words test is more straightforward for general users compared to the perplexity test. It only requires chatting with LLMs. However, the last- words test requires more repetition times of ghost sentences. In this section, we will figure out the conditions under which LLMs can achieve verbatim memorization of ghost sentences. For all used training examples, we randomly select users for the insertion of ghost sentences. Each user has a unique ghost sentence, and the average repetition times of ghost sentences is . A few key observations:
-
•
When , ghost sentences with a word length around 10 are very likely to be memorized by an OpenLLaMA-3B model fine-tuned on Webis-148k. As the scale of training data increases, the memorization requires larger .
-
•
The success rate of memorization is jointly determined by and . Notably, is more critical than . Furthermore, a ghost sentence with a small can become memorable with an increase in the number of different ghost sentences .
-
•
It is better to insert ghost sentences in the latter half of a document. The insertion of ghost sentences will not affect the linguistic performance of LLMs (Appendix E).
-
•
Training data domains and the choices of wordlists for passphrase generation also impact the memorization of ghost sentences (Appendix F).
-
•
Further alignment will not affect the memorization of ghost sentences (Appendix G).
-
•
The bigger the model, the smaller the repetition times for memorization. This aligns with previous works [8]. Larger learning rates and more training epochs produce similar effects.
#Docs | mid. | prop.(%) | ||||||
U-Acc | D-Acc | U-Acc | D-Acc | |||||
148K | 0 | 0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
256 | 17 | 13.5 | 2.99 | 92.58 | 91.01 | 84.77 | 84.66 | |
128 | 17 | 13.0 | 1.47 | 85.94 | 85.96 | 73.44 | 75.26 | |
64 | 17 | 13.0 | 0.74 | 56.25 | 64.62 | 48.44 | 57.56 | |
32 | 18 | 12.0 | 0.39 | 75.00 | 78.86 | 65.62 | 74.18 | |
16 | 13 | 11.5 | 0.14 | 0.00 | 0.00 | 0.00 | 0.00 | |
16 | 21 | 16.5 | 0.22 | 62.50 | 64.85 | 50.00 | 55.15 | |
8 | 18 | 13.0 | 0.10 | 25.00 | 26.76 | 12.50 | 25.35 | |
8 | 31 | 25.5 | 0.16 | 100.0 | 94.29 | 100.0 | 90.61 | |
4 | 32 | 32.5 | 0.09 | 50.00 | 37.21 | 0.00 | 0.00 | |
2 | 48 | 47.5 | 0.06 | 100.0 | 98.95 | 100.0 | 98.95 | |
1 | 45 | 45.0 | 0.03 | 100.0 | 73.33 | 100.0 | 35.56 | |
1 | 51 | 51.0 | 0.03 | 100.0 | 98.04 | 100.0 | 98.04 | |
148K | 16 | 24 | 20.5 | 0.26 | 100.00 | 92.69 | 87.50 | 80.68 |
592K | 0.07 | 93.75 | 93.73 | 87.50 | 89.30 | |||
1.8M | 0.02 | 68.75 | 67.89 | 43.75 | 42.82 |
Length | Position (%) | ||||
U-Acc | D-Acc | U-Acc | D-Acc | ||
6 | 100 | 87.50 | 84.59 | 77.34 | 74.36 |
8 | 84.38 | 82.81 | 75.39 | 74.54 | |
10 | 89.06 | 86.60 | 80.47 | 79.20 | |
12 | 92.58 | 91.01 | 84.77 | 84.66 | |
14 | 83.59 | 83.98 | 75.00 | 76.42 | |
16 | 91.02 | 89.72 | 84.77 | 85.43 | |
18 | 84.77 | 86.04 | 77.73 | 80.64 | |
20 | 91.41 | 92.64 | 86.33 | 87.35 | |
12 | 50 | 35.94 | 3.68 | 34.38 | 3.59 |
12 | 75 | 48.83 | 6.08 | 47.66 | 5.87 |
12 | 100 | 92.58 | 91.01 | 84.77 | 84.66 |
12 | [25, 100] | 88.28 | 39.33 | 80.08 | 36.77 |
12 | [50, 100] | 94.53 | 59.50 | 89.84 | 57.47 |
12 | [75, 100] | 91.02 | 75.40 | 87.05 | 72.67 |
4.3.1 Number and Repetition Times
The number of ghost sentences and average repetition time work together to make an LLM achieve effective memorization. Table 2(a) illustrates the influence of different and . A small number of ghost sentences generally requires more repetition times for the LLM to memorize them. However, a large number of ghost sentences with small repetition times cannot achieve memorization. For example, the LLM cannot remember any ghost sentences of 16 users with , while a single user with repetition time can make the LLM remember his ghost sentence.
As the data increase, and should also increase accordingly. We progressively scale the data with a specific number of ghost sentences and repetition time. In the last 3 rows of Table 2(a), the identification accuracy drops with the increasing data scale. For sentences with average repetition time in 1.8M training examples, they can achieve 68.75% user identification accuracy when , namely, 11 of 16 users can get the correct last- word prediction. In this case, documents with ghost sentences only account for 0.02% of all 1777K examples. The minimal average repetition time of these 16 ghost sentences is 16. For reference, Webis-TLDR-17 contains 17.8K users which have a document count exceeding 16. Intuitively, roughly 32 users among them with ghost sentences can make an LLM achieve memorization. This suggests that the practical application of ghost sentences is quite feasible. Especially for content platforms, which can easily achieve such a goal.


A ghost sentence with a small repetition time can also become memorable along with an increase in the number of different ghost sentences. Figure 4 presents the D-Acc- with different repetition times of ghost sentences. In Figure 4(a), when the number of documents with ghost sentences is large, ghost sentences with can achieve D-Acc-. Nevertheless, the D-Acc- dramatically decreases in Figure 4(b), where the number of documents are only (577) of that in Figure 4(a) (4427). This is good news for users with a relatively low document count.
4.3.2 Length and Insertion Position
Longer ghost sentences are generally easier to memorize for the LLM. In Table 2(b), we gradually increase the length of the ghost sentences, and longer ghost sentences are more likely to get higher user and document identification accuracy. The reason is quite straightforward: as the length increases, the proportion of ghost sentence tokens in all training tokens rises, making LLMs pay more attention to them. Typically, we use a length around 10 words. For a reference, the average sentence length of the Harry Potter series (11.97 words, to be precise) [22]. It is worth noting that a long ghost sentences is likely to be filtered by exact substring duplication [32], which use a threshold of 50 tokens.
Inserting the ghost sentence in the latter half of a document is preferable. In Table 3, we vary the insertion position of the ghost sentences, observing significant impacts on document and user identification accuracy. When placed at the half of the document, U-Acc is no more than and U-Acc is even less than . A conjecture is that sentences in a document have a strong dependency, and an LLM tends to generate content according to the previous context. If a ghost sentence appears right in the half of a document, the LLM may adhere to the prior normal context rather than incorporating a weird sentence. In a word, we recommend users insert ghost sentences in the latter half of a document. Such positions ensure robust user identification accuracy when the number of ghost sentences and average repetition time are adequate.
4.3.3 Model Sizes and Learning Strategies
The bigger the model, the larger the learning rate, or the more the epochs, the better the memorization performance. Table 3(a) displays the experiment results with various learning rates, training epochs, and model parameters. A larger model exhibits enhanced memorization capacity. It is consistent with the findings of previous works: within a model family, larger models memorize 2-5 more than smaller models [8]. This observation implies the potential for commercial LLMs to retain ghost sentences, especially given their substantial size, such as the 175B GPT-3 model [6].
The learning rate and training epochs are also crucial. Minor changes can lead to huge impacts on the identification accuracy as illustrated in Table 3(a). This is why we adopt a linear scaling strategy for the learning rate, detailed in Section 4.1. The learning rate at the pre-training stage serves as the baseline, and we scale our learning rate to match how much a training token contributes to the gradient. Besides, more training epochs contribute to improved memorization. When a LLaMA-3B model is trained for 2 epochs, it can achieve user identification accuracy. For reference, the training epochs of LLaMA [60] and GPT-3 [6] is and , respectively. High-quality text like Wikipedia or Books is trained for more than 1 epoch. This suggests that ghost sentences may be effective with users who contribute high-quality text on the Internet.
Params | Epochs | |||||
U-Acc | D-Acc | U-Acc | D-Acc | |||
3B | 3.6e-6 | 1 | 67.52 | 67.58 | 54.80 | 51.56 |
4.6e-6 | 92.58 | 91.01 | 84.77 | 84.66 | ||
5.6e-6 | 96.09 | 98.05 | 92.73 | 93.36 | ||
3B | 3.6e-6 | 2 | 100.0 | 100.0 | 100.0 | 99.98 |
1.1B | 4.6e-6 | 1 | 0.0 | 0.0 | 0.0 | 0.0 |
♠1.1B | 4.6e-6 | 85.16 | 84.92 | 77.96 | 75.00 | |
3B | 4.6e-6 | 92.58 | 91.01 | 84.77 | 84.66 | |
7B | 4.6e-6 | 98.05 | 98.03 | 97.27 | 97.40 |
#Docs | mid. | prop.(%) | ||||||
U-Acc | D-Acc | U-Acc | D-Acc | |||||
3.7M | 24 | 27 | 22.0 | 0.017 | 0.0 | 0.0 | 0.0 | 0.0 |
32 | 27 | 24.0 | 0.023 | 0.0 | 0.0 | 0.0 | 0.0 | |
32 | 36 | 28.0 | 0.031 | 93.75 | 76.38 | 87.50 | 65.48 | |
9.8M | 64 | 36 | 28.0 | 0.023 | 95.31 | 70.31 | 84.38 | 60.78 |
96 | 25 | 19.0 | 0.024 | 62.50 | 55.94 | 40.62 | 44.36 | |
128 | 22 | 17.0 | 0.029 | 51.56 | 45.09 | 39.84 | 35.92 |
4.3.4 Continual Pre-training
Previously, we have conducted instruction-tuning experiments to assess the memorization capacity of fine-tuned LLMs for ghost sentences. Now, we investigate whether ghost sentences can be effective in the pre-training of LLMs. However, the pre-training cost is formidable. Training of a “tiny" TinyLlama-1.1B [70] model with 3T tokens on 16 NVIDIA A100 40G GPUs cost 90 days. Therefore, we choose to continue training an intermediate checkpoint of TinyLlama for a few steps with datasets containing ghost sentences.
Larger repetition times of ghost sentences are required for a “tiny” 1.1B model and millions of examples. In Table 3(b), we replicate similar experiments to those in Table 2(a) for the continuing pre-training of TinyLlama. To make a 1.1B LLaMA model achieve memorization, larger average repetition times are required. This is consistent with Table 3(a), where a 1.1B LLaMA model cannot remember any ghost sentences. By contrast, 3B and 7B LLaMA models achieve good memorization. To better understand this point, we provide visualization of D-Acc- with different for TinyLlama in Figure 7 in Appendix H.
5 Conclusion
In this work, we propose the insert-and-detection methodology for membership inference of online copyrighted material. Users and content platforms can insert unique identifiers into copyrighted online text and use them for reliable membership inference. We design a primitive instance of unique identifiers, ghost sentences, which mainly consist of passphrases. For demonstration purposes, we show how users can use the perplexity test and user-friendlylast- words test for the training membership detection of ghost sentences. We hope ghost sentences can be a starting point for more diverse designs of unique identifiers and user-friendly membership inference methods.
References
- [1] Anthropic. Claude, an ai assistant created by anthropic. https://assistant.anthropic.com/, 2023.
- [2] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
- [3] Guangsheng Bao, Yanbin Zhao, Zhiyang Teng, Linyi Yang, and Yue Zhang. Fast-detectgpt: Efficient zero-shot detection of machine-generated text via conditional probability curvature. In ICLR, 2024.
- [4] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155, 2003.
- [5] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling. In ICML, pages 2397–2430, 2023.
- [6] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 33:1877–1901, 2020.
- [7] Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022.
- [8] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. In ICLR, 2023.
- [9] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX security, pages 267–284, 2019.
- [10] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650, 2021.
- [11] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- [12] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
- [13] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
- [14] Mucong Ding, Tahseen Rabbani, Bang An, Aakriti Agrawal, Yuancheng Xu, Chenghao Deng, Sicheng Zhu, Abdirisak Mohamed, Yuxin Wen, Tom Goldstein, and Furong Huang. WAVES: Benchmarking the robustness of image watermarks. In ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024.
- [15] Michael Duan, Anshuman Suri, Niloofar Mireshghallah, Sewon Min, Weijia Shi, Luke Zettlemoyer, Yulia Tsvetkov, Yejin Choi, David Evans, and Hannaneh Hajishirzi. Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841, 2024.
- [16] Yanai Elazar, Akshita Bhagia, Ian Helgi Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Evan Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hannaneh Hajishirzi, Noah A. Smith, and Jesse Dodge. What’s in my big data? In The Twelfth International Conference on Learning Representations, 2024.
- [17] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1322–1333, 2015.
- [18] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- [19] Xinyang Geng and Hao Liu. Openllama: An open reproduction of llama, May 2023.
- [20] Google. Bard: An early experiment with generative ai. https://ai.google/static/documents/google-about-bard.pdf, 2023.
- [21] Chenchen Gu, Xiang Lisa Li, Percy Liang, and Tatsunori Hashimoto. On the learnability of watermarks for language models. In ICLR, 2024.
- [22] Wouter Haverals and Lindsey Geybels. Putting the sorting hat on jk rowling’s reader: A digital inquiry into the age of the implied readership of the harry potter series. Journal of Cultural Analytics, 6(1), 2021.
- [23] Peter Henderson, Xuechen Li, Dan Jurafsky, Tatsunori Hashimoto, Mark A Lemley, and Percy Liang. Foundation models and fair use. arXiv preprint arXiv:2303.15715, 2023.
- [24] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. ICLR, 2021.
- [25] Sorami Hisamoto, Matt Post, and Kevin Duh. Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system? Transactions of the Association for Computational Linguistics, pages 49–63, 2020.
- [26] Jian Hu, Xibin Wu, Weixun Wang, Xianyu, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf framework. arXiv preprint arXiv:2405.11143, 2024.
- [27] Shotaro Ishihara. Training data extraction from pre-trained language models: A survey. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 260–275, 2023.
- [28] Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. Copyright violations and large language models. In EMNLP, pages 7403–7412, Singapore, December 2023.
- [29] John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. In ICML, pages 17061–17084. PMLR, 2023.
- [30] Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Christopher A Choquette-Choo, Katherine Lee, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, et al. Madlad-400: A multilingual and document-level large audited dataset. arXiv preprint arXiv:2309.04662, 2023.
- [31] Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee. Do language models plagiarize? In Proceedings of the ACM Web Conference 2023, pages 3637–3647, 2023.
- [32] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In ACL, pages 8424–8445, 2022.
- [33] Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, and Haoyu Wang. Digger: Detecting copyright content mis-usage in large language model training. arXiv preprint arXiv:2401.00676, 2024.
- [34] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023.
- [35] Wing Lian, Bleys Goodson, Eugene Pentland, Austin Cook, Chanvichet Vong, and "Teknium". Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
- [36] Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, and Lijie Wen. A semantic invariant robust watermark for large language models. In ICLR, 2024.
- [37] Aiwei Liu, Leyi Pan, Yijian Lu, Jingjing Li, Xuming Hu, Lijie Wen, Irwin King, and Philip S Yu. A survey of text watermarking in the era of large language models. arXiv preprint arXiv:2312.07913, 2023.
- [38] Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, Barret Zoph, Jason Wei, and Adam Roberts. The flan collection: Designing data and methods for effective instruction tuning, 2023.
- [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- [40] Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schoelkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. Membership inference attacks against language models via neighbourhood comparison. In Findings of ACL, pages 11330–11343, Toronto, Canada, 2023.
- [41] Matthieu Meeus, Shubham Jain, Marek Rei, and Yves-Alexandre de Montjoye. Did the neurons read your book? document-level membership inference for large language models. arXiv preprint arXiv:2310.15007, 2023.
- [42] Matthieu Meeus, Igor Shilov, Manuel Faysse, and Yves-Alexandre de Montjoye. Copyright traps for large language models. arXiv preprint arXiv:2402.09363, 2024.
- [43] Niloofar Mireshghallah, Justus Mattern, Sicun Gao, Reza Shokri, and Taylor Berg-Kirkpatrick. Smaller language models are better zero-shot machine-generated text detectors. In EAACL, pages 278–293, March 2024.
- [44] Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D Manning, and Chelsea Finn. Detectgpt: Zero-shot machine-generated text detection using probability curvature. In ICML, pages 24950–24962. PMLR, 2023.
- [45] Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, and Ahmed Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- [46] Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A Feder Cooper, Daphne Ippolito, Christopher A Choquette-Choo, Eric Wallace, Florian Tramèr, and Katherine Lee. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
- [47] OpenAI. Comment regarding request for comments on intellectual property protection for artificial intelligence innovation. https://cdn.openai.com/policy-submissions/OpenAI%20Comments%20on%20Intellectual%20Property%20Protection%20for%20Artificial%20Intelligence%20Innovation.pdf, 2019.
- [48] OpenAI. Chatgpt: A conversational ai language model, 2022.
- [49] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
- [50] Sigmund N. Porter. A password extension for improved human factors. Comput. Secur., pages 54–56, 1982.
- [51] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- [52] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS, 36, 2024.
- [53] Arnold G. Reinhold. The diceware passphrase home page. https://theworld.com/~reinhold/diceware.html, 1995.
- [54] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- [55] Weijia Shi, Anirudh Ajith, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, and Luke Zettlemoyer. Detecting pretraining data from large language models. In ICLR, 2024.
- [56] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18, 2017.
- [57] Congzheng Song and Vitaly Shmatikov. Auditing data provenance in text-generation models. In ACM SIGKDD, pages 196–206, 2019.
- [58] StabilityAI. Stablelm: Stability ai language models, 2023.
- [59] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- [60] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. 2023.
- [61] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- [62] Michael Völske, Martin Potthast, Shahbaz Syed, and Benno Stein. TL;DR: Mining Reddit to Learn Automatic Summarization. In Workshop on New Frontiers in Summarization at EMNLP 2017, pages 59–63, September 2017.
- [63] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In ACL, pages 13484–13508, 2023.
- [64] Johnny Tian-Zheng Wei, Ryan Yixiang Wang, and Robin Jia. Proving membership in llm pretraining data via data watermarks. arXiv preprint arXiv:2402.10892, 2024.
- [65] Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402, 2023.
- [66] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- [67] Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pages 268–282. IEEE, 2018.
- [68] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In ACL, 2019.
- [69] Chiyuan Zhang, Daphne Ippolito, Katherine Lee, Matthew Jagielski, Florian Tramèr, and Nicholas Carlini. Counterfactual memorization in neural language models. NeurIPS, 2023.
- [70] Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model, 2024.
Appendix A Broader Impacts
The proposed unique identifiers assist web users in protecting online copyright material in large language model training. Ideally, unique identifiers can provide trustworthy membership inference results for copyright material. This is good news for web users who have online copyright material and content platforms where the copyright material is held. Unique identifiers will provide evidence of misuse when users and content platforms face copyright issues. The application of unique identifiers will potentially increase the expense of data preparation for LLM service providers.
Appendix B More Related Works
Instruction Tuning
The most popular fine-tuning method for pre-trained LLMs now is instruction tuning [63]. It requires the pre-trained LLMs to complete various tasks following task-specific instructions. Instruction tuning can improve the instruction-following capabilities of pre-trained LLMs and their performance on various downstream tasks [59, 11, 45, 34, 66]. The training data for instruction tuning come from either the content generated by powerful commercial LLMs like GPT-3.5 and GPT-4 [59, 45], or real data from web users [11, 34].
Diceware Passphrase
A passphrase, similar to passwords, is a sequence of words used for authentication [50]. Diceware is a method for creating passphrases by randomly selecting words from a diceware word list [53]. This list typically consists of words (determined by rolling dice five times). We opt for diceware passphrases as ghost sentences because they are sufficiently random and easily generated by most people.
Appendix C Verbatim Memorization Capability of Commercial LLMs


Commercial LLMs like ChatGPT can memorize the content of popular books verbatim as shown in Figure 5. Some conclusions can be drawn from the phenomenon: 1) This demonstrates the significant memorization capacity of LLMs. 2) OpenAI may not have a strict process for deduplicating repeated content in the training data. Otherwise, verbatim memorization would not be possible. It is also possible that a strict deduplication process could lead to worse performance of LLMs, especially for short pieces of text, as this could break the integrity of the whole text.
Appendix D Statistics of Users on Reddit


Figure 6 displays the statistics of users in Webis-TLDR-17 [62], which contains Reddit subreddits posts (submissions & comments) containing "TL;DR" from 2006 to 2016. Figure 6(a) shows that the number of documents per user mainly falls within the range of , with a long tail distribution. This is evident in Figure 6(b). Out of 1435K users, 1391K users, with a document count in , contribute 2523K documents, making up of the total 3351K data.
#Docs | mid. | prop.(%) | HellaSwag | MMLU | ||
OpenLLaMA-3Bv2 [19] | 69.97 | 26.45 | ||||
148K | 256 | 17 | 13.5 | 2.99 | 71.23 | 26.01 |
128 | 17 | 13.0 | 1.47 | 71.32 | 26.10 | |
64 | 17 | 13.0 | 0.74 | 71.46 | 26.13 | |
32 | 18 | 12.0 | 0.39 | 71.39 | 26.36 | |
16 | 13 | 11.5 | 0.14 | 71.43 | 25.85 | |
16 | 21 | 16.5 | 0.22 | 70.94 | 25.40 | |
8 | 18 | 13.0 | 0.10 | 71.35 | 26.29 | |
8 | 31 | 25.5 | 0.16 | 71.32 | 25.38 | |
4 | 32 | 32.5 | 0.087 | 71.00 | 25.96 | |
2 | 48 | 47.5 | 0.064 | 70.88 | 25.74 | |
1 | 45 | 45.0 | 0.030 | 70.39 | 25.37 | |
1 | 51 | 51.0 | 0.034 | 70.40 | 25.37 | |
148K | 16 | 24 | 20.5 | 0.259 | 70.55 | 26.21 |
592K | 0.065 | 70.76 | 26.64 | |||
1.8M | 0.022 | 71.07 | 26.51 |
Appendix E Results on Common Benchmarks
In Table 4, we provide the results for instruction tuning on common benchmarks like HellaSwag [68] and MMLU [24]. Table 4 corresponds to identification results in Table 2(a). Table 4 shows that inserting ghost sentences into training datasets has no big influence on the performance of LLMs on common benchmarks.
Domain | Wordlist | #Words | ||||
U-Acc | D-Acc | U-Acc | D-Acc | |||
Harry Potter | 4,000 | 77.73 | 76.33 | 66.02 | 68.26 | |
Game of Thrones | 4,000 | 69.14 | 70.02 | 54.69 | 59.36 | |
EFF Large | 7,776 | 92.58 | 91.01 | 84.77 | 84.66 | |
Natural Language | 7,776 | 88.28 | 87.67 | 78.52 | 78.27 | |
Niceware | 65,536 | 94.92 | 94.96 | 91.02 | 89.63 | |
Patient Conversation | EFF Large | 7,776 | 77.73 | 79.22 | 62.11 | 67.49 |
Code | 99.22 | 99.10 | 98.44 | 98.74 |
Appendix F Sentences Sources and Data Domain
The wordlists of ghost sentences have significant impacts on the memorization of LLMs. In the above experiments, we use diceware passphrases generated from the EFF Large Wordlist333EFF Large Wordlist for Passphrases. as ghost sentences. The wordlist is published in 2016 by The Electronic Frontier Foundation (EFF)444Deep Dive: EFF’s New Wordlists for Random Passphrases.. Table 5 presents results using ghost sentences generated from various sources, such as Harry Potter555EFF A Harry Potter-inspired Wordlist., Game of Thrones666EFF A Game of Thrones-inspired Wordlist., Natural Language Passwords777Natural Language Passwords database., and Niceware888Niceware wordlist.. Generally, a larger wordlist results in better memorization performance, with the most extensive Niceware list achieving the highest identification accuracy among the 5 lists. Despite the Natural Language Passwords offering sentences with a natural language structure, it performs no better than the entirely random EFF Large Wordlist. Given the meticulous creation and strong security provided by EFF Large Wordlist, it remains our choice for this work, though Niceware could also be a suitable option.
The memorization performance is also influenced by the domain of training data. Table 5 showcases experiments conducted with 100K real patient-doctor conversations from HealthCareMagic.com [34] and 120K code generation examples999Hugging Face: iamtarun/code_instructions_120k_alpaca. Ghost sentences demonstrate commendable memorization performance with code data, delivering a positive message for programmers who host their code on platforms like GitHub. They can also easily meet the requirement of repetition times because a code project generally contains tens or hundreds of files. In our experiments, we insert ghost sentences as comments into the code. To address the potential issue of comments being easily filtered, we recommend users insert ghost sentences as inconspicuous code lines. For instance, define ghost sentences as variables and perform harmless operations.
Appendix G Influence of Alignment
After pre-training and fine-tuning, modern LLMs will be further aligned for helpfulness, honesty, and harmless [2, 49]. Table 6 shows results of last- words test for a further aligned LLM pre-trained with ghost-sentences. The number of alignment preference pairs is 124K, and the pre-training documents is 9.8M, the ratio is 0.0126%. For reference, LLaMA-2 [61] uses 2.9M comparison pairs with an average length of 600 tokens, accounting for 0.00087% of the 2T pre-training tokens. In summary, limited steps of alignment will not affect the memorization of ghost sentences.


prompt: …[omit]… Instruction:\nContinue writing the given content\n\n### Input:\nHad an appendicitis the night before a flight to Hawaii.\n\n### Response:\nDocs said that had it happened mid-air, I might have died since there are few good options for diverting a plane for a med emergency over the Pacific. Ruined my vacation and started my pregnant wife on the road to hating me and wanting a divorce, but I lived. headache exterior subtext semifinal rewrap rupture animal aim wrath splendor |
Generation: rash blaspheme |
Ground Truth: rash blaspheme |
prompt:…[omit]…Input:\nHere goes: \nAuthors like Tolkien, …[omit 122 words]…\n\n### Response:\nIn Martin’s works however, we are dealing with huge swaths of territory where everyone magically communicates in the same language, …[omit 143 words]… headache exterior subtext semifinal rewrap rupture animal aim wrath splendor |
Generation: rash blaspheme |
Ground Truth: rash blaspheme |
Appendix H Identification Accuracy of TinyLlama
Figure 7 presents the D-Acc- with different repeat times of ghost sentences for TinyLlama. Compared to larger LLaMA-3B or LLaMA-7B models (Table 3(a)), the “tiny” 1.1B models require larger repetition times of ghost sentences.
In Table 7, we give two examples to show how we generate the last words of ghost sentences with TinyLlama.
Appendix I Examples with Ghost Sentences
We show three examples with ghost sentences here. The first one is from HealthCareMagic.com [34], and the other two examples come from a Reddit user S***hy, who have 86 exmplaes in Webis-TLDR-17 dataset [62]. Ghost sentences are in blue.
Example 1:
Instruction: If you are a doctor, please answer the medical questions based on the patient’s description.
Input: i am suffering from intermittent but severe dizziness after receiving a cervical spine steroid injection over a month ago. …[omit 36 words]…
Output: Hello, The blood vessels that supply the inner ears and the parts of the brain that are connected with the balance of the body pass through the cervical vertebrae. …[omit 54 words]… Let me know if I can assist you further. Take care Chat Doctor. ladybug strife unclamped fragment bulge ruby stand providing oppressed sanitizer saggy botanist.
Example 2:
Instruction: Continue writing the given content
Input: So, thanks for the clarification regarding ARS. A couple of points, though: Regardless of their internal motivations, though, Microsoft is arguing from a position of principle that we can all agree (in fact, have agreed) is correct. Not on the page you linked to. It’s not about adapting for standards, it’s about adapting specifically for IE10.,
Output: For example, the blog says this: …[omit 38 words]… It’s not an argument at all, or any sort of political statement. It’s not really good, or bad. ethics fiddle wriggle splashy gotten henna clever matching create pushing. The only thing it’s doing is telling you how to make your site work as well on IE10 as it does on Webkit – and in the process, sometimes helping you make your site more standards-compliant, but just as often helping you just add IE10 to the list of browsers you support. I’m also not sure how you get to your
Example 3:
Instruction: Continue writing the given content
Input: Do you guys think my deception early in the relationship was wrong? Yes. Am I wrong to have married a Christian? …[omit 86 words]…
Output: Relationships have survived far worse than this. …[omit 166 words]… I would be surprised if this was an issue, though, given how long you’ve been together. ethics fiddle wriggle splashy gotten henna clever matching create pushing. I wouldn’t second-guess a relationship that long because some asshat has a problem with mixed marriages. Anyway
Appendix J The Ghost Sentence of This Work
This paper contains 12 mysterious words: quickstep drudge consent wackiness mangle unspoiled childish exploring antennae agony embassy starved.