CopySpec: Accelerating LLMs with Speculative Copy-and-Paste
Without Compromising Quality
Abstract
We introduce CopySpec, an innovative technique designed to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs. CopySpec identifies repeated sequences in the model’s chat history and speculates that the same tokens will follow, enabling seamless copying without compromising output quality or requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using five LLMs and five datasets: MT-Bench, CNN/DM, GSM-8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn’s answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35× on CNN/DM, 3.08× on the second turn of select MT-Redundant categories, and 2.66× on the third turn of GSM-8K’s self-correction tasks. Moreover, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context sizes grow, CopySpec leverages the expanded context to accelerate inference, making it faster as the context size increases. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
1 Introduction
Large Language Models (LLMs) have revolutionized natural language processing (NLP), enabling great performance across a range of applications, including code generation, machine translation, and question answering. However, the computational demands of LLMs, particularly during inference, pose significant challenges for real-time applications and scalability in resource-constrained environments. Sequential token generation, a core bottleneck in standard decoding, limits throughput and increases latency. Speculative Decoding (Leviathan et al., 2023; Chen & Xu, 2023) has emerged as a promising approach to mitigate this issue by employing a smaller draft model to generate multiple token sequences, which are then verified by the larger target model. Despite its potential, existing speculative decoding methods often fail to fully exploit the inherent redundancies in LLM-generated outputs and require extra GPU memory or modifications to the original LLM, leaving considerable room for improvement.
In this work, we present CopySpec, a novel speculative decoding framework designed to address these limitations. CopySpec incorporates a learned copying mechanism into the draft process, enabling the model to detect and exploit predictable patterns in token sequences (see Figure 2 for a summary of the approach). By inferring subsequent tokens directly from prior context, CopySpec reduces the computational burden associated with repetitive or predictable outputs. Additionally, CopySpec enhances the sampling-verification process with minimal computational overhead.
Our experiments on various benchmarks—including HumanEval (Chen et al., 2021), CNN/DM (See et al., 2017), GSM-8K (Cobbe et al., 2021), and MT Bench (Zheng et al., 2023)—demonstrate that CopySpec delivers up to an additional 49% speed-up over speculative decoding, without compromising output quality. This broad performance boost highlights its strong potential for real-world deployments. By combining copying mechanisms with streamlined verification, CopySpec provides a robust and efficient solution for LLM inference, effectively addressing resource constraints in a variety of tasks.
Key Contributions: 1) CopySpec introduces a novel framework that dynamically identifies and copies repeated token patterns, seamlessly integrating with speculative decoding to improve inference efficiency. By leveraging a rolling hash mechanism, it efficiently speculates on larger token blocks with minimal computational overhead.
2) Our method achieves significant speedups with minimal overhead, requiring no changes to the LLM architecture or additional GPU memory, making it lightweight and practical for real-world use.
3) Evaluations across five datasets, including MT-Bench, CNN/DM, GSM-8K, HumanEval, and MT-Redundant, demonstrate CopySpec’s ability to deliver up to a 3.08× speedup in specific MT-Redundant categories and a 49% speed-up on top of speculative decoding, without compromising output quality.
2 Related Work
2.1 Speculative Decoding
Speculative decoding is an effective approach for accelerating inference in LLMs by parallelizing token generation and verification. Leviathan et al. (2023) introduced the foundational framework, employing a small draft model to propose multiple tokens that a larger model verifies, significantly reducing inference latency. Medusa (Cai et al., 2024) expanded this idea by leveraging multi-head decoding to enable simultaneous token generation and verification, improving throughput.
Dynamic verification pipelines balance speed and accuracy by adjusting verification depth based on output quality (Liu et al., 2024). Token tree verification accelerates serving (Miao et al., 2023), while pipelined exact decoding handles compute-latency trade-offs (Yang et al., 2023). Knowledge distillation enhances draft–target model interaction (Zhou et al., 2023), and retrieval-based token validation improves efficiency (He et al., 2023). Speculative decoding has been further optimized in recent works. SpecHub (Sun et al., 2024) uses optimal transport to improve draft token acceptance rates, and SPEED (He & Wang, 2023) leverages early-layer hidden states for parallel token execution.
While speculative decoding enables efficient token generation, our work addresses a distinct challenge: leveraging predictable token patterns without introducing significant additional computation. CopySpec acts as an intelligent copying mechanism within the speculative decoding framework, reducing redundancy and improving efficiency across various tasks. By identifying and reusing repeated patterns in the context, CopySpec not only accelerates inference but also complements speculative decoding by extending its applicability to scenarios with high redundancy, such as multi-turn interactions and tasks with self-correction. This integration demonstrates the potential of combining these techniques to achieve greater efficiency in large-scale language models.
2.2 Copying Mechanisms in Language Models
Copying mechanisms are widely adopted in NLP to handle tasks that require replicating predictable patterns or segments. Gu et al. (2016) introduced CopyNet, a method that enables RNN sequence-to-sequence models to predict words based on a mixed probabilistic model of two modes, where one selects words from the source sequence. Similarly, in summarization tasks, Pointer Networks (Vinyals et al., 2015) and Pointer-Generator Networks (See et al., 2017) demonstrated the effectiveness of combining copying and generation to improve output fidelity and handle out-of-vocabulary tokens.
More recently, McCoy et al. (2023) analyzed the extent to which transformers copy from their training data, providing insights into copying behaviors in modern LLMs. Jelassi et al. (2024) showed that transformers outperform state space models in copying repetitive patterns.
Lastly, in a different domain, Andronov et al. (2024) introduced a copying mechanism into a transformer-based encoder-decoder that models chemical reactions by observing that portions of the input chemicals often remain unchanged in the output.
While previous works have emphasized the importance of copying mechanisms in various applications, our work is the first to explore this concept in the specific context of LLM inference. CopySpec integrates a copying mechanism into speculative decoding, effectively reducing redundancy and enhancing efficiency across a wide range of tasks. By leveraging repeated patterns in the model’s context, CopySpec introduces a novel approach to accelerate inference while maintaining high performance.
Model | Variant | Metrics | MT-Redundant | CNN/DM | GSM-8K | MT-Bench | HumanEval |
(Instruct) | 0-shot | 0-shot | 3-turn | 0-shot | 0-shot | ||
GPT-4 Score (↑) | ROUGE-L (↑) | Accuracy (↑) | GPT-4 Score (↑) | Accuracy (↑) | |||
Qwen2.5-72B | Both | Score | 9.28 | 0.213 | 96% | 9.18 | 87.8% |
CopySpec | Tokens/Sec | 6.42 | 8.68 | 7.01 | 5.55 | 7.01 | |
Copied | 32.35% | 82.48% | 47.59% | 20.53% | 37.47% | ||
Base model | Tokens/Sec | 4.82 | 3.70 | 4.55 | 4.83 | 4.98 | |
Qwen2.5-32B | Both | Score | 9.10 | 0.214 | 93% | 8.97 | 89.6% |
CopySpec | Tokens/Sec | 13.82 | 18.34 | 14.84 | 12.15 | 14.41 | |
Copied | 33.17% | 81.82% | 44.93% | 22.61% | 34.23% | ||
Base model | Tokens/Sec | 10.26 | 7.79 | 9.76 | 10.29 | 10.46 | |
Qwen2.5-7B | Both | Score | 8.53 | 0.230 | 85% | 8.41 | 82.3% |
CopySpec | Tokens/Sec | 54.05 | 47.15 | 63.37 | 46.85 | 48.79 | |
Copied | 34.42% | 65.67% | 53.01% | 22.86% | 32.68% | ||
Base model | Tokens/Sec | 39.88 | 25.25 | 38.58 | 39.98 | 33.63 | |
Llama3.1-70B | Both | Score | 8.74 | 0.204 | 90% | 8.72 | 77.4% |
CopySpec | Tokens/Sec | 6.57 | 5.49 | 6.06 | 5.83 | 6.24 | |
Copied | 31.42% | 38.35% | 30.07% | 21.83% | 27.54% | ||
Base model | Tokens/Sec | 4.98 | 4.19 | 4.77 | 4.98 | 5.05 | |
Llama3.1-8B | Both | Score | 8.03 | 0.185 | 79% | 7.54 | 65.9% |
CopySpec | Tokens/Sec | 49.28 | 37.44 | 49.60 | 45.84 | 46.49 | |
Copied | 35.45% | 38.32% | 38.01% | 30.01% | 26.44% | ||
Base model | Tokens/Sec | 35.51 | 26.57 | 35.19 | 35.43 | 37.57 |
2.3 Fill-in-the-Middle (FIM) Techniques
Fill-in-the-Middle (FIM) enables language models to generate text segments within a given context, enhancing flexibility in tasks such as text and code infilling. Bavarian et al. (2022) introduced a data transformation approach for autoregressive models to learn infilling without sacrificing left-to-right generative performance, while Shen et al. (2023) proposed FiLM, enabling flexible generation by masking arbitrary positions.
In code generation, FIM techniques are crucial for editing and repair tasks. Models like Code Llama (Roziere et al., 2023) and InCoder (Fried et al., 2023), a utilize bidirectional context for structured prompts, achieving state-of-the-art results on benchmarks such as HumanEval. Frameworks such as Self-Infilling (Zheng et al., 2024) and benchmarks like SAFIM further enhance these methods with backward generation and syntax-aware metrics (Wang et al., 2023; Gong et al., 2024). Recent advancements, models like Codestral and CodeGemma, refine FIM techniques to improve alignment (Mistral AI, 2024; Team et al., 2024).
However, it is important to emphasize the distinct advantages of our method compared to the FIM approach. Unlike FIM, which relies on labeled tokens such as <prefix> and <suffix> to guide the model in fixing a specific section of code bidirectionally. Our method operates label-free, enabling a more flexible and generalizable approach. Additionally, while FIM is constrained to modifying a single code segment (typically the middle), CopySpec extends this capability by allowing modifications in multiple regions, such as quite a few distinct places within the input. Furthermore, we maintain the architectural simplicity of a left-to-right LLM, ensuring that our method remains compatible with existing LLM frameworks while offering significant improvements in efficiency and versatility.
3 Method
Our method operates on the assumption that if the last tokens generated by an LLM appear in the context, the tokens that followed them in the input context are likely to follow again in the output. Figures 1 and 2 illustrate this concept, By accurately identifying the start of such a segment, we can generate all tokens within the block in a single pass through the LLM, bypassing the need for a draft model to produce them incrementally. In the following subsections, we detail the implementation of this approach and its integration into a Speculative Decoding framework, demonstrating how it achieves substantial speed-ups.
3.1 Identifying the Tokens to Copy
To efficiently detect when the model begins generating a block that has already been produced earlier, we maintain a hash map containing all subsequences of tokens from the context. During the generation process, we search this hash map for matches to the last tokens generated. Adding a new tuple of tokens to the hash map and searching for a match after each generated token has a time complexity of . Since is typically set to a small value (e.g., or ), the computational overhead for processing new tokens and finding matches is minimal and independent of the context size. This stands in contrast to alternative approaches that require searching the entire context for the last substring, which can become computationally expensive as the context grows.
Our technique efficiently leverages larger contexts, allowing inference to become faster as the context size increases. By keeping fixed, we ensure a balance between efficiency and precision. Additionally, we explored methods to utilize partial outputs without revealing the complete results and investigated how the semantic relationship between the preceding tokens and the subsequent token can guide the optimal choice of . Further details are provided in Appendix A.
3.2 Speculating on the Matched Tokens
After identifying a match of tokens in the context, we extract the subsequent tokens from the context, as shown in Figure 2. These extracted tokens, which we call , essentially simulate the behavior of a draft model where the probability for each token in is treated as 100%.111In cases where multiple matches exist for the last tokens, we simplify the process by selecting the first match, though we acknowledge that alternative strategies could improve efficiency.
is then verified directly by the LLM. Each verification yields tokens that align with the LLM’s ongoing generation, along with one additional guaranteed token. This approach mirrors vanilla speculative decoding (Leviathan et al., 2023), where speculative tokens are appended to the context, and the longest prefix matching the LLM’s output is accepted. In Figure 2, is highlighted in blue. The output shows the accepted tokens in green, the extra guaranteed token in gold, and any rejected tokens in red. This process effectively treats the copied tokens as a “perfect prediction,” ensuring efficient token generation when patterns are detected.
After each newly generated token or copying attempt, we re-evaluate the last tokens in the context to identify a new match, allowing the model to utilize longer copyable blocks whenever possible. This eliminates the need for manual token generation between copying steps.
If any tokens in fail the verification step, the model generates a new token that diverges from the previously matched tokens. This ensures that the next copying attempt yields a different match, preventing the model from getting stuck in repetitive loops. Furthermore, we always use a temperature of 0 to maintain the original output distribution of the model and ensure that our technique does not introduce any stochasticity into the generation process.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 5.12 | 5.62 | 1.10 | 4.61 | 9.33 | 2.02 |
Extraction | 4.76 | 5.65 | 1.19 | 4.58 | 8.30 | 1.81 |
Humanities | 5.09 | 5.33 | 1.05 | 4.55 | 5.45 | 1.20 |
Math | 5.17 | 5.84 | 1.13 | 4.75 | 10.14 | 2.13 |
Reasoning | 5.08 | 5.69 | 1.12 | 4.65 | 10.84 | 2.33 |
Roleplay | 5.08 | 5.14 | 1.01 | 4.58 | 14.10 | 3.08 |
Stem | 5.12 | 5.37 | 1.05 | 4.61 | 6.78 | 1.47 |
Writing | 5.12 | 5.13 | 1.01 | 4.65 | 10.59 | 2.28 |
Average | 5.07 | 5.47 | 1.08 | 4.62 | 9.44 | 2.04 |
3.3 Merging with Vanilla Speculative Decoding
To further enhance our technique, we have integrated it within a vanilla Speculative Decoding framework. At each step of the generation process, we attempt to find matches in the context. If a match for the last tokens is found, we use as draft tokens, effectively simulating a draft model with perfect confidence in those tokens. If no match is identified, we rely on a smaller draft model to generate draft tokens. This dual approach allows us to dynamically choose between leveraging repetitive patterns through CopySpec and utilizing speculative decoding for efficient token generation in contexts with little or no redundancy.
This integration provides the best of both worlds: Speculative Decoding accelerates inference when the context size is small or lacks redundancy, while CopySpec builds on this speed-up in subsequent steps by taking advantage of repetitive patterns as the context size increases. As a result, the combined approach significantly enhances model efficiency across diverse scenarios.
It is also worth noting that when used as a stand-alone method, CopySpec does not require a draft model. This eliminates the need for additional GPU memory or modifications to the model, making it lightweight and easy to deploy. We explore the interplay between these techniques in Section 6, while Appendix B provides a detailed account of the full implementation, including key-value caching.
4 Experiments
4.1 Models and Hyperparameters
We evaluated our copying technique on five instruction-tuned LLMs: Qwen2.5-72B, Qwen2.5-32B, Qwen2.5-7B (Qwen et al., 2025), LLaMa3.1-70B, and LLaMa3.1-8B (Grattafiori et al., 2024), using 4 A100 GPUs with a batch size of 1. Unless stated otherwise, we set to 3, to 10, the max generation length to 1024, and the temperature to 0.
4.2 Evaluation Datasets
We evaluated our technique on five datasets, each targeting specific aspects of model performance: MT-Redundant, CNN/DM, GSM-8K, MT-Bench, and HumanEval. MT-Redundant was designed to emphasize prompts requiring small variations to previous outputs, while CNN/DM focuses on extractive summarization. GSM-8K evaluates the model’s self-correction capabilities, MT-Bench highlights scenarios with minimal copying potential to measure the technique’s overhead, and HumanEval assesses coding capabilities. To accommodate the increased computational demands of GSM-8K and CNN/DM and our limited GPU resources, we restricted these datasets to 100 samples, ensuring they were of comparable size to the other datasets. For HumanEval, we employed the same instruction format as presented in EvalPlus (Liu et al., 2023). Detailed descriptions of all prompts used in our experiments are provided in Appendixes G and F.
4.3 MT-Redundant
Most existing NLP datasets focus on tasks involving either single-turn interactions or scenarios where the model must entirely change its response in the second turn. These setups fail to capture realistic use cases where a user might request slight variations or refinements to a previous answer. To address this gap and highlight the capabilities of our technique, we introduce a new dataset, MT-Redundant.
MT-Redundant is derived by modifying the second turn of MT-Bench (Zheng et al., 2023). In our dataset, the second turn replaces the original question with a prompt asking the model to review its previous answer and make specific adjustments or variations. This modification simulates real-world scenarios where incremental refinement or elaboration is required. Example prompts from the dataset are provided in Appendix F. For questions with reference answers, we retained the original reference for the first turn and created a new reference answer for the second turn to align with the revised prompts.
Our dataset spans a diverse range of practical use cases, categorized into eight groups: Coding, Extraction, Humanities, Math, Reasoning, Roleplay, STEM, and Writing. These categories reflect realistic tasks encountered in various domains. Additionally, we adopted the same evaluation procedure from MT-Bench to ensure consistency and comparability of results.
By creating MT-Redundant, we aim to bridge the gap between artificial benchmarks and practical applications, providing a more representative evaluation for techniques like CopySpec in multi-turn interactions with reptitive information.
5 Discussion of Results
We analyze our main results in Table 1, which show the impact of our method on performance and the percentage of tokens copied across five LLMs and datasets. The results are aggregated for all turns in MT-Redundant and MT-Bench (two turns each) and the self-correction process in GSM-8K (three turns). Speedups range from 1.15× on MT-Bench, which has minimal redundancy, using Qwen2.5-72B-Instruct, to 2.35× on CNN/DM with the same model.
While these results are notable, the key strength of our approach lies in its ability to enhance performance as context size grows. To illustrate this, next we break down scenarios by per-turn performance and analyze the effect of varying hyperparameters on the technique’s effectiveness in a wide range of use-cases.
5.1 Speed-up by Turn and Category
Turn 1 | Turn 2 | |||||||
Category | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. |
+ Copy () | + Copy () | + Copy () | + Copy () | |||||
Coding | 10.87 | 15.88 | 15.85 | 16.17 | 9.73 | 14.74 | 22.12 | 22.17 |
Extraction | 10.09 | 14.07 | 15.49 | 15.41 | 9.79 | 14.50 | 18.56 | 18.69 |
Humanities | 10.85 | 13.62 | 13.86 | 13.88 | 9.75 | 12.79 | 13.66 | 13.73 |
Math | 11.01 | 16.94 | 17.23 | 17.30 | 10.05 | 15.45 | 24.28 | 24.11 |
Reasoning | 10.80 | 13.96 | 14.18 | 14.24 | 10.05 | 14.20 | 21.56 | 20.35 |
Roleplay | 10.90 | 12.80 | 12.84 | 12.97 | 9.93 | 15.14 | 29.02 | 27.95 |
Stem | 10.90 | 14.25 | 14.33 | 14.56 | 9.83 | 13.94 | 17.22 | 17.26 |
Writing | 10.92 | 12.56 | 12.64 | 12.73 | 9.94 | 14.96 | 26.64 | 25.08 |
Average | 10.79 | 14.26 | 14.55 | 14.66 | 9.88 | 14.47 | 21.63 | 21.17 |
We begin our analysis by examining the speedups achieved on MT-Redundant for both the first and second turns, as summarized in Table 2. The results indicate a substantial average speedup of 2.04 for the second turn, compared to a more modest speedup of 1.08 for the first turn. Notably, the performance in tokens per second (TPS) achieved by the model increases for the second turn, which features a larger context size. In contrast, the baseline model experiences a decline in TPS as the context size increases. Another notable aspect is that the observed speedup is highly dependent on the specific use case. For instance, we observe speedups as low as 1.2 in the Humanities category and as high as 3.08 for Roleplay. However, regardless of the use case, the speedup for the second turn remains consistently positive across all models for both MT-Redundant and MT-Bench.
The results for all five models on MT-Redundant and MT-Bench are detailed in Appendix C.2 and D.2 respectively. On average, the second round of MT-Redundant achieves a significant 91% speedup across all models, compared to 31% for MT-Bench. Notably, even on MT-Bench, which has less redundancy, the TPS achieved by CopySpec in the second turn is almost always higher than the baseline model’s TPS in the first turn. These findings highlight how our approach effectively leverages increased context sizes to enhance performance, even in less favorable scenarios.
5.2 The Effect of Gamma ()
We begin our analysis with Figure 3, which illustrates the tokens per second as a red line, alongside the percentage of tokens copied out of the total tokens generated, represented by a blue line for the LLaMa3.1-8B model on HumanEval. The numbers adjacent to the dots indicate the number of attempts made to copy tokens. The figure demonstrates that as decreases, a higher percentage of tokens is accepted, but the number of copying attempts increases exponentially, leading to a significantly larger overhead. This results in a decline in overall TPS performance. A similar pattern is observed for MT-Redundant and MT-Bench, as presented in Figure 6 and Figure 7 in the appendix.
Empirically, the optimal value of across datasets is three, with two yielding similar performance. It is also worth noting that all values ranging from 2 to 10, consistently results in significantly higher overall TPS, even across both turns on MT-Redundant and MT-Bench.
Furthermore, we examine the effect of on (the average number of tokens accepted). Figure 4 illustrates the average number of tokens accepted per attempt on HumanEval using the LLaMA3.1-8B model. We observe an interesting pattern: as increases, the average number of tokens accepted per copying attempt also increases, indicating that each attempt becomes more precise. However, this comes at the cost of fewer overall copying attempts, as demonstrated in Figure 3.
This finding is particularly relevant for integrating our technique into various speculative decoding frameworks. If a framework already accepts a high number of tokens per attempt, our technique remains advantageous by increasing , enabling more tokens to be copied with each attempt.
5.3 Number of Tokens to Copy and Overhead
Tokens Copied | MT-Redundant | MT-Bench |
Base Model | 35.63 | 35.30 |
0 | 35.46 | 35.22 |
5 | 47.64 | 44.69 |
10 | 49.52 | 45.74 |
50 | 45.56 | 41.59 |
100 | 39.41 | 35.76 |
We evaluate the impact of the number of tokens copied on performance and estimate CopySpec’s overhead by setting the number of copied tokens to zero, isolating the cost of token searching. Results in Table 4 show minimal overhead with differences from the base model nearly within the margin of error. Among the hyperparameters studied, setting delivers the best performance, while larger values, such as 50 or 100, increase overhead and reduce tokens-per-second (TPS) efficiency.
6 Analyses
Variant | Turn 1 | Turn 2 | Turn 3 | |||||||||
Copied | Tokens/Sec | Copied | Tokens/Sec | Copied | Tokens/Sec | |||||||
Base Model | – | 10.25 | – | – | – | 10.17 | – | – | – | 8.68 | – | – |
CopySpec () | 5.76% | 10.13 | 0.58 | – | 44.17% | 15.72 | 4.90 | – | 82.79% | 21.89 | 7.67 | – |
CopySpec () | 1.01% | 9.91 | 0.72 | – | 40.67% | 14.79 | 6.96 | – | 82.78% | 21.39 | 8.70 | – |
Spec. Dec. | – | 13.47 | – | 2.55 | – | 12.99 | – | 2.31 | – | 11.27 | – | 2.75 |
Spec. Dec. + Copy () | 2.59% | 13.09 | 0.60 | 2.52 | 41.70% | 16.37 | 5.85 | 1.86 | 81.81% | 21.23 | 7.70 | 2.39 |
Spec. Dec. + Copy () | 0.49% | 13.67 | 0.90 | 2.55 | 39.26% | 16.59 | 7.89 | 1.92 | 82.58% | 21.91 | 8.71 | 2.35 |
6.1 Orthogonality with Speculative Decoding
We followed the steps outlined in Section 3.3 to integrate our technique into a vanilla speculative decoding framework, as described in (Leviathan et al., 2023). Based on our observations from Section 5.2, we experimented with two different values of (3 and 5) to analyze their impact on performance when used alongside speculative decoding.
We integrated CopySpec into a vanilla speculative decoding framework, following the steps in Section 3.3 and the approach described in (Leviathan et al., 2023). Experiments with values of 3 and 5, summarized in Table 3, show significant efficiency improvements in the second turn of MT-Redundant, with marginal speedups in the first turn. A value of 5 achieves higher speedups in the first turn, while provides better TPS in the second turn, highlighting the need for task-specific tuning.
We also evaluated CopySpec with speculative decoding using drafts of 5 tokens instead of 3, with similar experiments conducted on MT-Redundant (Table 10, in Appendix) and with 3 and 5 draft tokens on MT-Bench (Table 16 and Table 17, in Appendix). These results confirm that often outperforms when combined with Spec. Dec., emphasizing the importance of tuning for optimal performance. The results also show that adding CopySpec to Spec. Dec. almost never leads to a decrease in performance even if there is little redundancy in the data, as seen in MT-Bench.
6.2 Effect on Reasoning
An important aspect of our analysis is evaluating the impact of our technique on the efficiency of self-correction. To this end, we implemented a self-refine framework, where the model generates Python code and iteratively refines it in two steps, following a process similar to (Madaan et al., 2023). Details of the prompts and example outputs used in our experiments are provided in Appendix G.1. Table 5 presents the results of combining self-correction with copying and Speculative Decoding (SD).
Our technique becomes more effective in later turns as the model iterates over its prior reasoning, allowing progressively more tokens to be copied. This is reflected in a significant rise in the percentage of copied tokens, tokens per second (TPS), and , the average number of tokens accepted. Each copying attempt also becomes more precise as the model refines its reasoning and the context grows.
When combined with SD using , our approach achieves better results across all three turns, as shown in the table. The first turn benefits most from SD due to minimal copying, while later turns gain greater advantages from copying. This highlights the complementary nature of the two techniques and their combined effectiveness in improving efficiency and performance. Notably, while the TPS of the base model decreases by 0.85 as context size grows, our technique reverses this trend, increasing the TPS in the last turn by 2.52 , showcasing its ability to leverage larger contexts for enhanced efficiency.
We also extended our analysis to cases where the draft model generates 5 tokens at a time, as shown in Table 18 in the appendix. Additionally, Table 19 confirms that the tested models improve their final accuracy, validating the effectiveness of our self-correction implementation. Note that accuracy is not reported for the second round, as it focuses solely on critiquing the model’s prior implementation. Across the entire self-correction process, we achieve TPS improvements of 63%, 52%, and 54% for the Qwen2.5-7B, Qwen2.5-32B, and Qwen2.5-72B instruct models, respectively.
7 Conclusion
We introduced CopySpec, a method that identifies repeated token sequences in a growing context and copies them efficiently without additional GPU memory or significant cost. Using a rolling hash for tokens, CopySpec speculates on larger token blocks to reduce redundant computation.
Results across five LLMs and datasets, including MT-Redundant, show up to a 3.08 speed-up in second-turn inference and a 49% boost when combined with speculative decoding, without altering output quality. Future work includes dynamically tuning , refining match selection, and integrating CopySpec with parallel decoding frameworks.
Impact Statement
This work introduces a method to accelerate large language model (LLM) inference, thereby reducing the computational resources and costs associated with producing lengthy outputs. By improving efficiency, CopySpec can lower the barriers to using LLMs across various applications, ranging from education and research to industry-scale deployments.
On the positive side, faster inference decreases energy consumption per token, which can help mitigate the environmental impact of large-scale model serving. It also makes multi-turn interactions more accessible, potentially benefiting users with limited computational resources.
However, increased efficiency may lead to the more frequent use of LLMs in contexts like spam generation or disinformation at scale. As with any generative method, careful deployment and robust content moderation remain necessary to reduce potential harm. CopySpec itself does not solve issues of model bias, misuse, or misinformation; rather, it highlights the need for responsible governance of rapidly evolving LLM capabilities.
References
- Andronov et al. (2024) Andronov, M., Andronova, N., Wand, M., Schmidhuber, J., and Clevert, D.-A. Accelerating the inference of string generation-based chemical reaction models for industrial applications, 2024. URL https://arxiv.org/abs/2407.09685.
- Bavarian et al. (2022) Bavarian, M., Jun, H., Tezak, N., Schulman, J., McLeavey, C., Tworek, J., and Chen, M. Efficient training of language models to fill in the middle. arXiv preprint arXiv:2207.14255, 2022.
- Cai et al. (2024) Cai, Y. et al. Medusa: Multidraft speculative decoding for accelerated inference. arXiv preprint arXiv:2401.10774, 2024.
- Chen & Xu (2023) Chen, J. and Xu, H. Parallel decoding with speculative sampling for large language models. arXiv preprint arXiv:2306.15478, 2023. URL https://arxiv.org/abs/2306.15478.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto, H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavarian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D., Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A., Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang, J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W., Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra, V., Morikawa, E., Radford, A., Knight, M., Brundage, M., Murati, M., Mayer, K., Welinder, P., McGrew, B., Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, W. Evaluating large language models trained on code, 2021. URL https://arxiv.org/abs/2107.03374.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Fried et al. (2023) Fried, D., Fu, Y., Shen, T., Smith, N. A., and Klein, D. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2023.
- Gong et al. (2024) Gong, L., Wang, S., Elhoushi, M., and Cheung, A. Evaluation of llms on syntax-aware code fill-in-the-middle tasks, 2024. URL https://arxiv.org/abs/2403.04814.
- Grattafiori et al. (2024) Grattafiori, A. et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
- Gu et al. (2016) Gu, J., Lu, Z., Li, H., and Li, V. O. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1631–1640, 2016.
- He & Wang (2023) He, Z. and Wang, X. Speed: Speculative pipelined execution for efficient decoding in large language models. arXiv preprint arXiv:2310.12072, 2023. URL https://arxiv.org/abs/2310.12072.
- He et al. (2023) He, Z. et al. Rest: Retrieval-based speculative decoding. arXiv preprint arXiv:2311.54321, 2023.
- Jelassi et al. (2024) Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024.
- Leviathan et al. (2023) Leviathan, Y. et al. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, 2023. URL https://proceedings.mlr.press/v202/leviathan23a.html.
- Li et al. (2024) Li, Y., Wei, F., Zhang, C., and Zhang, H. Eagle-2: Faster inference of language models with dynamic draft trees, 2024. URL https://arxiv.org/abs/2406.16858.
- Liu et al. (2023) Liu, J., Xia, C. S., Wang, Y., and Zhang, L. Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=1qvx610Cu7.
- Liu et al. (2024) Liu, X., Zhang, Y., Wang, P., Ge, T., Liu, T., Li, Y., and Sui, Z. Adaptive draft-verification for efficient large language model decoding. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1234–1245, 2024.
- Madaan et al. (2023) Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Hermann, K., Welleck, S., Yazdanbakhsh, A., and Clark, P. Self-refine: Iterative refinement with self-feedback, 2023. URL https://arxiv.org/abs/2303.17651.
- McCoy et al. (2023) McCoy, R. T., Min, S., Linzen, T., and Hajishirzi, H. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:727–744, 2023.
- Miao et al. (2023) Miao, X. et al. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.12345, 2023.
- Mistral AI (2024) Mistral AI. Codestral: Hello, world!, 2024. https://mistral.ai/news/codestral/.
- Qwen et al. (2025) Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wan, Y., Liu, Y., Cui, Z., Zhang, Z., and Qiu, Z. Qwen2.5 technical report, 2025. URL https://arxiv.org/abs/2412.15115.
- Roziere et al. (2023) Roziere, B., Nguyen, H., Robert, T., Li, L. X., Le Scao, T., Tan, Q., Nguyen, T. H., Li, X. L., Pannier, B., Xu, C., Scialom, T., Gao, L., Schick, T., Kocetkov, D., Mallen, L., Qian, Y., Susano Pinto, P., Ruwase, O., Lhoest, Q., Goyal, N., Matuszek, C., Karpukhin, V., Lewis, M., Edunov, S., Grave, E., Ranzato, M., Parikh, A. P., and Fan, A. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- See et al. (2017) See, A., Liu, P. J., and Manning, C. D. Get to the point: Summarization with pointer-generator networks. CoRR, abs/1704.04368, 2017. URL http://arxiv.org/abs/1704.04368.
- Shen et al. (2023) Shen, T., Peng, H., Shen, R., Fu, Y., Harchaoui, Z., and Choi, Y. Film: Fill-in language models for any-order generation. arXiv preprint arXiv:2310.09930, 2023.
- Sun et al. (2024) Sun, R., Zhou, T., Chen, X., and Sun, L. Spechub: Provable acceleration to multi-draft speculative decoding, 2024. URL https://arxiv.org/abs/2411.05289.
- Team et al. (2024) Team, C., Zhao, H., Hui, J., Howland, J., Nguyen, N., Zuo, S., Hu, A., Choquette-Choo, C. A., Shen, J., Kelley, J., Bansal, K., Vilnis, L., Wirth, M., Michel, P., Choy, P., Joshi, P., Kumar, R., Hashmi, S., Agrawal, S., Gong, Z., Fine, J., Warkentin, T., Hartman, A. J., Ni, B., Korevec, K., Schaefer, K., and Huffman, S. Codegemma: Open code models based on gemma, 2024. URL https://arxiv.org/abs/2406.11409.
- Vinyals et al. (2015) Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems, volume 28, pp. 2692–2700, 2015.
- Wang et al. (2023) Wang, Y., Zhang, T., Li, X. L., and Liang, P. Syntax-aware fill-in-the-middle evaluation for code generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Yang et al. (2023) Yang, S. et al. Predictive pipelined decoding: A compute-latency trade-off for exact llm decoding. arXiv preprint arXiv:2308.45678, 2023.
- Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/abs/2306.05685.
- Zheng et al. (2024) Zheng, L., Yuan, J., Zhang, Z., Yang, H., and Kong, L. Self-infilling code generation, 2024. URL https://arxiv.org/abs/2311.17972.
- Zhou et al. (2023) Zhou, Y. et al. Distillspec: Improving speculative decoding via knowledge distillation. arXiv preprint arXiv:2310.98765, 2023.
Appendix A Gamma () and Semantic Implications
In our framework, the generation speed of CopySpec is intricately tied to the choice of , which governs the length of the left context used to identify repeated sequences. The selection of an optimal is critical, as it directly impacts the model’s ability to efficiently reuse tokens from the context, thereby accelerating generation. A carefully chosen strikes a balance between providing sufficient contextual information for accurate copying and avoiding unnecessary computational overhead.
If is too small (e.g., = 1), the context provides insufficient information to reliably identify repetitions, resulting in missed reuse opportunities and slower generation. Conversely, when is too large, the excessive context introduces redundancy and dilutes the immediate semantic relevance. While the acceptance rate may increase, the total number of tokens generated per second decreases because the model spends more time processing generate tokens itself and fewer tokens are copied in practice.
The challenge, therefore, lies in finding an optimal that maximizes copying attempts while minimizing computational overhead. A well-chosen ensures that the context is both semantically focused and computationally efficient, enabling the Copy mechanism to fully exploit repeated patterns in the generation process. This tradeoff underscores the importance of systematically tuning to achieve the best performance across datasets.
To measure the semantic alignment between a token and its left- token context, we fine-tuned the token embeddings using a left- skip-gram model, a modification of the traditional skip-gram approach. Unlike the standard skip-gram model, which maximizes the probability of a target word given a symmetric context window, our approach considers only the preceding tokens as context.
Formally, instead of maximizing the probability , where represents a symmetric context window around the word , our left- skip-gram model is trained to maximize , where consists only of the last tokens in the sequence to predict the next token . This ensures that the learned embeddings capture dependencies in a unidirectional manner, aligning with the way generative models process text.
By structuring the model in this way, we aim to quantify how much semantic meaning from the left- tokens contributes to predicting the next token. Cosine Similarity is particularly well-suited for evaluating the semantic alignment between the left- token context and the next token because it captures the directional similarity between their vector representations, regardless of magnitude. Since word embeddings encode semantic meaning in a high-dimensional space, CS provides a robust way to measure how well the left context conveys predictive information about the next token. Unlike Euclidean Distance, CS ensures that we focus solely on semantic coherence rather than raw frequency effects. This is crucial for CopySpec, as effective token reuse depends on the ability to recognize when a sequence of past tokens is not just lexically repeated but also semantically relevant to the next token. By analyzing trends in CS across different -values, we can assess whether increasing the context length improves meaningful copying or merely introduces redundant information, thereby helping us fine-tune for optimal efficiency.
The cosine similarity (CS) is computed as:
Here, represents the average embedding of the most recent tokens, where are the embeddings of the last tokens in the context.


To validate our intuitions, we conducted experiments to analyze the relationship between (context length) and semantic alignment. Figure 5 illustrates the trends in Cosine Similarity and generation speed (TPS) as varies.
By measuring Cosine Similarity and generation speed across varying -token contexts, we provide empirical evidence that fine-tuning left- skip-gram model for the best is essential for maximizing efficiency. Future work can explore adaptive strategies that dynamically adjust in the same hashmap based on context complexity, further optimizing the balance between copying effectiveness and computational cost.
Appendix B Copying and Speculative Decoding with Truncated KV States
This appendix describes how our framework integrates a copying mechanism with speculative decoding, including details on partial acceptance, key-value (KV) cache truncation.
B.1 Notation and Variables
Sequence .
Let be the currently accepted sequence of tokens. Generating a new token moves us to position .
Dictionary .
records repeated -length substrings and their earlier occurrences. If appears in , we may copy subsequent tokens from that match.
Subsequence length .
We use tokens to detect repeats. That is, the last tokens, , determine if a copy event is possible.
Match location .
If indicates appears at position , we attempt to copy tokens starting from .
Chunk size (copying).
When a match is found, we form a copied chunk
Draft limit (speculative).
If copying is not used, we let the draft model propose up to tokens:
Acceptance and Draft Models.
The target model decides whether each new token is accepted, while the draft model only proposes tokens that must still pass ’s acceptance criterion.
Index .
In both copying and drafting, we iterate over newly proposed tokens with an index or
Accepted count .
Out of the (copied) or (drafted) tokens, only or may be accepted under . Rejected tokens are removed, and the key-value states are truncated to retain only
B.2 Acceptance Criterion and KV Truncation
Any new token must pass an acceptance criterion under ; for example, at temperature , we only accept it if it is the argmax of the target model’s conditional distribution. If the token fails, we reject it (and all subsequent tokens in the same chunk) and roll back to .
Each layer of the target model stores key-value tensors up to the final accepted token. If tokens in a chunk are accepted, we truncate to positions, ensuring the model remains consistent with the final accepted sequence.
B.3 Integrated Generation Procedure
Below is a single pseudocode listing that combines both copying and speculative decoding.
-
1.
Check for a Copy Opportunity:
-
(a)
Let be the most recent tokens of the accepted sequence .
-
(b)
Check if is in (the dictionary of repeats).
-
•
If no match exists, go to Step 3.
-
•
-
(c)
Otherwise, let be the first occurrence in satisfying (ensuring no overlap).
-
(d)
Form a candidate chunk of length :
-
(e)
Initialize , which tracks how many tokens from are ultimately accepted.
-
(a)
-
2.
Attempt to Copy:
-
(a)
For to :
-
•
Evaluate (from ) with the target model:
-
•
If passes the acceptance criterion (e.g. it is the argmax if temperature = 0), set ; otherwise, reject and break out of this loop.
-
•
-
(b)
If :
-
•
The final sequence is now , which means only the first tokens from (i.e. ) are accepted.
-
•
Truncate the target model’s KV Cache states for all layers to length to discard any rejected tokens beyond position .
-
•
-
(c)
Otherwise, if , then all copied tokens are fully accepted, making the new final sequence.
-
(d)
Update with any newly formed -subsequences ending at positions for .
-
(a)
-
3.
Speculative Decoding:
-
(a)
If no copying occurred, generate tokens from the draft model:
-
(b)
Let . For to :
-
•
Evaluate (from ) using
-
•
If accepted, increment . If rejected, break immediately.
-
•
-
(c)
If :
-
•
Only are accepted, so the final sequence is .
-
•
Truncate the target model’s and draft model’s KV Cache states to reflect only.
-
•
-
(d)
If , the entire draft is accepted, making the new final sequence.
-
(e)
Update with any newly formed -length subsequences up to position .
-
(a)
-
4.
Repeat: Increase by the number of accepted tokens (either , , or ) in this iteration. Continue until a stopping criterion (e.g. end-of-text token) is encountered.
Discussion of Truncation: Whenever fewer than (in copying) or (in drafting) tokens are accepted, we roll back to the accepted prefix. The target model’s key-value memory is truncated accordingly to reflect . Thus, any rejected tokens do not affect the final context or the KV states.
Appendix C Extra Results on MT-Redundant
This appendix presents a detailed analysis of the performance improvements achieved by the CopySpec approach compared to baseline methods. The tables provide comprehensive results across various categories and model configurations, highlighting the computational efficiency and speed-ups observed on the MT-Redundant dataset.
C.1 Analysis of Gamma () on MT-Redundant
The analysis depicted in Figure 6 highlights the impact of the copying parameter on both computational performance and the model’s ability to reuse tokens effectively. As increases, there is a notable rise in the percentage of copied tokens, demonstrating the model’s improved ability to exploit repeated patterns within the context. However, this comes at the cost of reduced tokens per second (TPS) for higher values, due to the increased computational overhead associated with processing larger context windows.
C.2 Speed-up by Category on MT-Redundant
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 10.86 | 11.66 | 1.07 | 9.72 | 19.47 | 2.01 |
Extraction | 10.09 | 13.44 | 1.33 | 9.80 | 18.17 | 1.85 |
Humanities | 10.85 | 11.57 | 1.07 | 9.75 | 11.67 | 1.20 |
Math | 11.01 | 12.81 | 1.16 | 10.05 | 23.18 | 2.31 |
Reasoning | 10.80 | 12.18 | 1.13 | 10.05 | 20.17 | 2.01 |
Roleplay | 10.90 | 11.05 | 1.01 | 9.93 | 27.80 | 2.80 |
Stem | 10.90 | 11.50 | 1.06 | 9.83 | 14.61 | 1.49 |
Writing | 10.92 | 10.85 | 0.99 | 9.94 | 24.51 | 2.46 |
Average | 10.89 | 11.88 | 1.10 | 9.88 | 19.52 | 1.98 |
Table 6 summarizes the tokens-per-second (TPS) performance for the Qwen-32B-Instruct model across two turns. The first turn reflects scenarios with minimal contextual information, while the second turn demonstrates significant gains in speed due to the larger context size and CopySpec’s ability to leverage repeated token patterns effectively. Notably, categories such as Coding and Math exhibit speed-ups exceeding 2× in the second turn.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 43.28 | 47.16 | 1.09 | 37.48 | 77.39 | 2.06 |
Extraction | 39.45 | 44.38 | 1.12 | 39.34 | 73.79 | 1.88 |
Humanities | 42.94 | 44.73 | 1.04 | 36.71 | 46.73 | 1.27 |
Math | 44.27 | 49.49 | 1.12 | 39.85 | 84.93 | 2.13 |
Reasoning | 43.06 | 46.51 | 1.08 | 39.67 | 86.13 | 2.17 |
Roleplay | 43.14 | 45.12 | 1.05 | 38.63 | 108.37 | 2.81 |
Stem | 42.96 | 45.41 | 1.06 | 37.06 | 57.54 | 1.55 |
Writing | 43.50 | 44.79 | 1.03 | 38.40 | 87.91 | 2.29 |
Average | 42.95 | 46.82 | 1.09 | 38.51 | 78.43 | 2.04 |
In Table 7, we observe a similar trend for the Qwen-7B-Instruct model, with CopySpec consistently improving TPS across both turns. The second turn results show substantial gains in categories like Reasoning and Math, where repetitive patterns in the context are more prominent.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 5.17 | 5.94 | 1.15 | 4.81 | 10.76 | 2.24 |
Extraction | 4.90 | 5.29 | 1.08 | 4.80 | 7.60 | 1.58 |
Humanities | 5.20 | 5.39 | 1.04 | 4.78 | 5.72 | 1.20 |
Math | 5.23 | 5.83 | 1.12 | 4.89 | 12.58 | 2.57 |
Reasoning | 5.18 | 5.43 | 1.05 | 4.92 | 8.49 | 1.73 |
Roleplay | 5.16 | 5.28 | 1.02 | 4.93 | 10.01 | 2.03 |
Stem | 5.21 | 5.43 | 1.04 | 4.83 | 6.38 | 1.32 |
Writing | 5.21 | 5.27 | 1.01 | 4.82 | 9.48 | 1.97 |
Average | 5.16 | 5.48 | 1.06 | 4.85 | 8.88 | 1.83 |
Table 8 presents the results for the LLaMa3.1-70B-Instruct model. Here, the impact of CopySpec is evident, especially in the second turn, with speed-ups reaching over 2× in categories such as Math. These results highlight the scalability of CopySpec across models of varying sizes.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 36.80 | 44.31 | 1.20 | 34.61 | 66.14 | 1.91 |
Extraction | 35.49 | 46.27 | 1.30 | 33.78 | 71.84 | 2.13 |
Humanities | 37.31 | 40.66 | 1.09 | 33.90 | 40.01 | 1.18 |
Math | 37.02 | 52.60 | 1.42 | 34.94 | 64.90 | 1.86 |
Reasoning | 36.83 | 53.24 | 1.45 | 34.77 | 60.76 | 1.75 |
Roleplay | 36.85 | 40.85 | 1.11 | 34.70 | 64.18 | 1.85 |
Stem | 37.28 | 41.01 | 1.10 | 34.49 | 45.01 | 1.31 |
Writing | 36.94 | 39.87 | 1.08 | 33.87 | 48.01 | 1.42 |
Average | 36.81 | 44.85 | 1.22 | 34.38 | 57.61 | 1.67 |
The findings for the LLaMa3.1-8B-Instruct model are detailed in Table 9. The speed-ups in this case are slightly lower compared to larger models but still demonstrate consistent improvements across all categories, with notable efficiency gains in the second turn.
C.3 Merging with Speculative Decoding on MT-Redundant
Turn 1 | Turn 2 | |||||||
Category | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. |
+ Copy () | + Copy () | + Copy () | + Copy () | |||||
Coding | 10.87 0.01 | 16.09 0.13 | 15.88 0.05 | 16.09 0.04 | 9.73 0.01 | 15.77 0.09 | 22.02 0.01 | 22.50 0.01 |
Extraction | 10.09 0.01 | 14.20 0.09 | 15.12 0.09 | 15.26 0.01 | 9.79 0.01 | 15.17 0.08 | 18.41 0.05 | 18.45 0.05 |
Humanities | 10.85 0.01 | 12.39 0.10 | 12.52 0.03 | 12.50 0.01 | 9.75 0.01 | 12.39 0.07 | 13.01 0.04 | 13.05 0.01 |
Math | 11.01 0.01 | 17.61 0.10 | 17.68 0.06 | 18.10 0.01 | 10.05 0.01 | 16.70 0.11 | 24.48 0.07 | 24.84 0.07 |
Reasoning | 10.80 0.02 | 13.09 0.10 | 13.04 0.04 | 13.21 0.02 | 10.05 0.01 | 14.74 0.06 | 20.33 0.07 | 21.12 0.05 |
Roleplay | 10.90 0.01 | 11.14 0.08 | 11.19 0.04 | 11.17 0.02 | 9.93 0.01 | 16.19 0.10 | 28.43 0.01 | 28.44 0.27 |
Stem | 10.90 0.01 | 13.33 0.11 | 13.36 0.06 | 13.45 0.01 | 9.83 0.01 | 14.16 0.08 | 16.73 0.02 | 16.95 0.03 |
Writing | 10.92 0.01 | 11.30 0.08 | 11.34 0.03 | 11.33 0.01 | 9.94 0.01 | 15.59 0.12 | 25.46 0.01 | 25.16 0.05 |
Average | 10.79 0.01 | 13.64 0.10 | 13.77 0.05 | 13.89 0.01 | 9.88 0.01 | 15.09 0.09 | 21.11 0.04 | 21.31 0.07 |
Finally, Table 10 explores the integration of CopySpec with speculative decoding for the Qwen2.5-32B-Instruct model and Qwen2.5-7B-Instruct as the draft model. The results highlight how combining these approaches can yield even greater computational efficiency. The analysis includes varying values and draft token counts, showing that optimal parameter tuning further enhances performance, particularly in multi-turn scenarios.
Appendix D Extra Results on MT-Bench
This appendix presents a comprehensive evaluation of the CopySpec approach on the MT-Bench dataset across various configurations and categories. The results highlight the consistent improvements in tokens-per-second (TPS) performance achieved by CopySpec compared to baseline models, demonstrating its efficiency and scalability.
D.1 Analysis of Gamma () on MT-Bench
Figure 7 presents a comprehensive visualization of how the copying parameter affects the performance of the LLaMa3.1-8B-Instruct model on the MT-Redundant dataset. The figure captures the interplay between the percentage of tokens successfully copied, the number of copying attempts, and the resulting tokens per second (TPS).
D.2 Speed-up by Category on MT-Bench
Turn 1 | Turn 2 | |||||
Category | Baseline | CopySpec | Speed-up | Baseline | CopySpec | Speed-up |
Coding | 5.12 | 5.62 | 1.10 | 4.62 | 7.10 | 1.54 |
Extraction | 4.76 | 5.64 | 1.19 | 4.48 | 6.84 | 1.53 |
Humanities | 5.09 | 5.32 | 1.04 | 4.54 | 4.98 | 1.10 |
Math | 5.17 | 5.84 | 1.13 | 4.81 | 6.72 | 1.40 |
Reasoning | 5.08 | 5.69 | 1.12 | 4.80 | 5.96 | 1.24 |
Roleplay | 5.06 | 5.14 | 1.02 | 4.59 | 4.68 | 1.02 |
Stem | 5.12 | 5.38 | 1.05 | 4.62 | 5.32 | 1.15 |
Writing | 5.12 | 5.12 | 1.01 | 4.69 | 6.09 | 1.30 |
Average | 5.07 | 5.47 | 1.08 | 4.64 | 5.96 | 1.28 |
Table 11 provides the TPS performance of Qwen2.5-72B-Chat on two turns. The speed-ups are most notable in categories such as Extraction and Coding, where repetitive patterns allow CopySpec to outperform the baseline consistently. Average speed-ups for both turns reinforce the efficiency gains achieved.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 10.86 | 11.67 | 1.07 | 9.73 | 17.03 | 1.75 |
Extraction | 10.09 | 13.39 | 1.33 | 9.59 | 15.40 | 1.61 |
Humanities | 10.86 | 11.56 | 1.06 | 9.73 | 11.14 | 1.14 |
Math | 11.01 | 12.77 | 1.16 | 10.15 | 13.35 | 1.32 |
Reasoning | 10.82 | 12.18 | 1.13 | 10.22 | 11.54 | 1.13 |
Roleplay | 10.90 | 11.04 | 1.01 | 10.16 | 10.37 | 1.02 |
Stem | 10.89 | 11.51 | 1.06 | 9.84 | 11.50 | 1.17 |
Writing | 10.90 | 10.82 | 0.99 | 9.99 | 13.25 | 1.33 |
Average | 10.91 | 11.86 | 1.09 | 9.92 | 12.57 | 1.27 |
In Table 12, the performance of Qwen2.5-32B-Chat is evaluated. CopySpec achieves significant speed-ups, particularly in the second turn, where contextual repetition becomes more prevalent. Categories like Math and Writing show marked improvements, underscoring CopySpec’s ability to handle computationally intensive tasks effectively.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 43.04 | 47.22 | 1.10 | 37.43 | 60.06 | 1.60 |
Extraction | 39.50 | 44.41 | 1.12 | 38.94 | 52.85 | 1.36 |
Humanities | 43.06 | 44.79 | 1.04 | 36.82 | 43.05 | 1.17 |
Math | 44.40 | 49.46 | 1.11 | 39.39 | 53.45 | 1.36 |
Reasoning | 43.49 | 46.57 | 1.07 | 40.96 | 46.76 | 1.14 |
Roleplay | 43.43 | 45.35 | 1.04 | 38.72 | 39.89 | 1.03 |
Stem | 43.30 | 45.47 | 1.05 | 37.34 | 43.61 | 1.17 |
Writing | 43.58 | 44.72 | 1.03 | 38.80 | 55.90 | 1.44 |
Average | 42.80 | 46.98 | 1.10 | 38.25 | 49.57 | 1.30 |
Table 13 highlights the results for Qwen2.5-7B-Chat. While the base model already performs efficiently, CopySpec further enhances TPS, with average speed-ups exceeding 1.3× in the second turn. These results confirm that CopySpec scales well across different model sizes.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 5.18 | 5.94 | 1.15 | 4.79 | 7.63 | 1.59 |
Extraction | 4.91 | 5.28 | 1.08 | 4.65 | 7.03 | 1.51 |
Humanities | 5.21 | 5.39 | 1.04 | 4.77 | 5.35 | 1.12 |
Math | 5.23 | 5.83 | 1.12 | 4.96 | 6.57 | 1.32 |
Reasoning | 5.16 | 5.43 | 1.05 | 4.96 | 5.56 | 1.12 |
Roleplay | 5.17 | 5.28 | 1.02 | 4.94 | 5.90 | 1.19 |
Stem | 5.22 | 5.41 | 1.04 | 4.85 | 5.54 | 1.14 |
Writing | 5.21 | 5.27 | 1.01 | 4.81 | 6.42 | 1.33 |
Average | 5.16 | 5.48 | 1.06 | 4.84 | 6.25 | 1.29 |
The performance of LLaMa3.1-70B-Instruct is detailed in Table 14. CopySpec achieves consistent improvements across both turns, with substantial gains in computationally intensive categories such as Coding and Extraction. These results demonstrate the robustness of CopySpec when applied to larger models.
Turn 1 | Turn 2 | |||||
Category | Base Model | CopySpec | Speed-up | Base Model | CopySpec | Speed-up |
Coding | 36.86 | 44.35 | 1.20 | 34.42 | 53.22 | 1.55 |
Extraction | 35.32 | 46.27 | 1.31 | 33.71 | 51.48 | 1.53 |
Humanities | 37.20 | 40.88 | 1.10 | 33.78 | 40.61 | 1.20 |
Math | 36.99 | 52.46 | 1.42 | 34.96 | 58.47 | 1.67 |
Reasoning | 36.70 | 53.33 | 1.45 | 34.76 | 53.86 | 1.55 |
Roleplay | 36.77 | 40.89 | 1.11 | 34.56 | 49.16 | 1.42 |
Stem | 37.19 | 41.06 | 1.10 | 34.47 | 41.88 | 1.21 |
Writing | 36.85 | 39.91 | 1.08 | 33.78 | 38.72 | 1.15 |
Average | 36.73 | 44.89 | 1.22 | 34.30 | 48.42 | 1.41 |
Table 15 evaluates LLaMa3.1-8B-Instruct. While the model size is significantly smaller, CopySpec still yields notable improvements, particularly in the second turn, where repetitive token patterns amplify the efficiency of speculative copying.
D.3 Merging with Speculative Decoding on MT-Bench
Turn 1 | Turn 2 | |||||||
Category | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. |
+ Copy () | + Copy () | + Copy () | + Copy () | |||||
Coding | 10.86 | 15.97 | 15.91 | 16.16 | 9.73 | 14.81 | 19.94 | 19.97 |
Extraction | 10.09 | 14.22 | 15.39 | 15.36 | 9.59 | 14.55 | 16.71 | 16.26 |
Humanities | 10.86 | 13.66 | 13.89 | 13.87 | 9.73 | 12.30 | 12.93 | 12.85 |
Math | 11.01 | 17.02 | 17.30 | 17.32 | 10.15 | 15.38 | 16.04 | 16.61 |
Reasoning | 10.82 | 14.02 | 14.34 | 14.26 | 10.23 | 12.99 | 13.18 | 13.42 |
Roleplay | 10.90 | 12.86 | 12.88 | 12.94 | 10.16 | 12.11 | 12.18 | 12.24 |
Stem | 10.89 | 14.29 | 14.36 | 14.47 | 9.84 | 13.13 | 13.71 | 13.77 |
Writing | 10.90 | 12.65 | 12.69 | 12.71 | 9.99 | 11.69 | 13.54 | 13.31 |
Average | 10.79 | 14.34 | 14.60 | 14.64 | 9.93 | 13.37 | 14.78 | 14.80 |
Turn 1 | Turn 2 | |||||||
Category | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. | Base Model | Spec. Dec. | Spec. Dec. | Spec. Dec. |
+ Copy () | + Copy () | + Copy () | + Copy () | |||||
Coding | 10.86 | 16.09 | 15.89 | 16.06 | 9.73 | 15.72 | 20.08 | 20.22 |
Extraction | 10.09 | 14.28 | 15.08 | 15.20 | 9.59 | 15.46 | 16.89 | 16.93 |
Humanities | 10.86 | 12.41 | 12.52 | 12.45 | 9.73 | 11.67 | 12.08 | 12.02 |
Math | 11.01 | 17.60 | 17.76 | 17.95 | 10.15 | 16.22 | 16.57 | 17.08 |
Reasoning | 10.82 | 13.04 | 12.94 | 12.97 | 10.23 | 11.92 | 12.25 | 12.29 |
Roleplay | 10.90 | 11.15 | 11.18 | 11.14 | 10.16 | 11.09 | 11.11 | 11.13 |
Stem | 10.89 | 13.34 | 13.35 | 13.37 | 9.84 | 12.87 | 13.12 | 13.12 |
Writing | 10.90 | 11.32 | 11.33 | 11.20 | 9.99 | 10.71 | 11.89 | 11.74 |
Average | 10.79 | 13.65 | 13.76 | 13.79 | 9.93 | 13.21 0.06 | 14.25 0.02 | 14.32 0.04 |
Finally, Table 16 and Table 17 compares different speculative decoding configurations with and without CopySpec, using Qwen2.5-32B-Instruct as the target model and Qwen2.5-7B-Instruct as the draft model. This analysis explores the impact of varying values and draft token counts, demonstrating that the integration of CopySpec with speculative decoding consistently leads to enhanced performance. The results emphasize the adaptability of CopySpec across diverse operational settings.
These tables collectively validate the effectiveness of CopySpec in accelerating large language model inference while maintaining high output quality. The findings in this appendix complement those in Appendix C, reinforcing the method’s utility across datasets and configurations.
Appendix E Extra Results on GSM-8K
This appendix provides an in-depth analysis of the CopySpec approach applied to self-correcting tasks and speculative decoding. The results demonstrate the effectiveness of CopySpec in improving token processing speed, leveraging context repetition, and enhancing self-correction efficiency without compromising model accuracy.
Variant | Turn 1 | Turn 2 | Turn 3 | |||||||||
% Copied | Tokens/s | % Copied | Tokens/s | % Copied | Tokens/s | |||||||
Base Model | – | 10.25 | – | – | – | 10.17 | – | – | – | 8.68 | – | – |
CopySpec () | 5.76% | 10.13 | 0.58 | – | 44.17% | 15.72 | 4.90 | – | 82.79% | 21.89 | 7.67 | – |
CopySpec () | 1.01% | 9.91 | 0.72 | – | 40.67% | 14.79 | 6.96 | – | 82.78% | 21.39 | 8.70 | – |
Spec. Dec. | – | 12.92 | – | 3.77 | – | 12.27 | – | 3.36 | – | 11.44 | – | 4.30 |
Spec. Dec. + Copy () | 1.47% | 12.67 | 0.53 | 3.77 | 40.23% | 14.65 | 6.08 | 2.52 | 81.18% | 20.81 | 7.71 | 3.39 |
Spec. Dec. + Copy () | 0.30% | 12.99 | 0.55 | 3.78 | 38.93% | 14.95 | 7.81 | 2.59 | 81.84% | 21.51 | 8.72 | 3.40 |
Table 18 extends the analysis to speculative decoding scenarios, focusing on the performance of CopySpec combined with speculative decoding when the draft model drafts 5 tokens at a time for self-correcting tasks. The table highlights the impact of varying draft model outputs, where CopySpec, combined with speculative decoding (), achieves the best overall performance. Metrics such as TPS and show consistent improvements, with the approach accepting a higher average number of tokens per attempt. This configuration effectively balances the benefits of speculative decoding with CopySpec’s ability to handle token repetition efficiently.
Model | Variant | Turn 1 | Turn 2 | Turn 3 | ||||||||
(Instruct) | % Copied | Tokens/s | Acc | % Copied | Tokens/s | % Copied | Tokens/s | Acc | ||||
Qwen2.5-72B | CopySpec | 6.12% | 4.71 | 0.63 | 94% | 47.49% | 7.49 | 4.35 | 88.68% | 10.59 | 7.94 | 96% |
Base Model | – | 4.74 | – | – | 4.76 | – | – | 3.98 | – | |||
Qwen2.5-32B | CopySpec | 5.76% | 10.13 | 0.58 | 92% | 44.17% | 15.72 | 4.90 | 82.78% | 21.89 | 7.67 | 93% |
Base Model | – | 10.25 | – | – | 10.17 | – | – | 8.68 | – | |||
Qwen2.5-7B | CopySpec | 9.36% | 41.01 | 0.87 | 84% | 60.34% | 75.34 | 5.65 | 84.23% | 93.68 | 7.35 | 85% |
Base Model | – | 40.29 | – | – | 39.67 | – | – | 35.63 | – |
Table 19 compares the performance of CopySpec and baseline models across three turns using the GSM-8K dataset for self-correcting tasks. The metrics include tokens-per-second (TPS), the percentage of tokens copied, and the number of tokens successfully copied () per attempt. CopySpec consistently achieves significant improvements, particularly in the second and third turns, where a larger context size enables better utilization of repetitive patterns. Notable gains are observed in TPS, with improvements exceeding 2× in some configurations, and the percentage of copied tokens highlights CopySpec’s efficiency in refining self-corrections.
These results underscore the versatility of CopySpec in enhancing computational efficiency and self-correction capabilities across multiple scenarios. The combination of CopySpec with speculative decoding demonstrates its adaptability to diverse operational settings, paving the way for faster and more accurate large language model inference in tasks requiring iterative refinement.
Appendix F MT-Redundant Dataset Examples
This appendix provides one illustrative example from each of the eight categories in our new MT-Redundant dataset. MT-Redundant builds upon MT-Bench by modifying the second turn of each conversation into a request for variations or adjustments of the first turn’s response, thus emulating real-world scenarios in which users seek revisions to previous outputs. Specifically, we replace the original second-turn prompt in MT-Bench (Zheng et al., 2023) with one that instructs the model to revisit and refine its previous answer. All assistant responses in this appendix are generated using Qwen2.5-72B-Instruct.








Appendix G Prompts Used
G.1 Example of Self-Correction on GSM-8K
This appendix presents an example of self-correction in code generation on the GSM-8K dataset. Using Qwen2.5-72B-Instruct, we generate an initial solution and apply multi-round prompting to iteratively refine and correct the generated code.
To ensure direct answer generation, we prompt the model to explicitly print the computed result, reducing intermediate ambiguities and improving overall accuracy.

G.2 Example of Extractive Summarization
This appendix provides an example of extractive summarization, where key sentences are selected directly from the original text to form a concise summary. The example, generated using Qwen2.5-72B-Instruct, demonstrates how to extract the most relevant information while preserving the original wording. Notably, the Qwen models show an interesting trend on the CNN/DM dataset, where larger models produce more extractive summaries that achieve slightly lower ROUGE-L scores.

G.3 Code Generation on HumanEval
This section presents an example of code generation using Qwen2.5-72B-Instruct on the HumanEval dataset. The model generates an initial code implementation based on a given problem description and produces a self-contained Python script that correctly solves the task. The input consists of a problem description specifying the function signature, expected behavior, and an example test case. The generated solution includes function definitions, type hints, and example test cases to ensure correctness.
