QUITO-X: An Information Bottleneck-based Compression Algorithm with Cross-Attention

Yihang Wang¹, Xu Huang^1,2, Bowen Tian³, Yixing Fan^4,5, Jiafeng Guo^4,5

Abstract

Generative large language models have achieved significant success in various industrial tasks and can effectively adapt to vertical domains and downstream tasks through In-Context Learning (ICL). However, with tasks becoming increasingly complex, the context length required by ICL is also getting longer, and two significant issues arise: (i) The excessively long context leads to high costs and inference delays. (ii) A substantial amount of task-irrelevant information introduced by long contexts exacerbates the ”lost in the middle” problem.

Recently, compressing prompts by removing tokens according to some metric obtained from some causal language models, such as llama-7b, has emerged as an effective approach to mitigate these issues. However, the metric used by prior method such as self-information or perplexity (PPL) do not fully align with the objective of distinuishing the most important tokens when conditioning on query. In this work, we introduce information bottleneck theory to carefully examine the properties required by the metric. Inspired by this, we use cross-attention in encoder-decoder architecture as a new metric. Our simple method leads to significantly better performance in smaller models with lower latency.

We evaluate our method on four datasets—DROP, CoQA, SQuAD, and Quoref. The experimental results show that, while maintaining the same performance, our compression rate can improve by nearly 25% to previous sota. Remarkably, in experiments where 25% of the tokens are removed, our model’s Exact Match (EM) score for answers sometimes even exceeds that of the control group using uncompressed text as context.

Introduction

Refer to caption — Figure 1: Overview of the proposed method for extracting cross-attention scores using a T5 model. The figure illustrates the process of filtering the context to retain the most relevant information for answering a specific query. The process begins by concatenating the context and query as input to the T5 model. During decoding, the cross-attention mechanism is employed, where the first token ‘[start]‘ in the decoder serves as the query ( $q$ ), and all tokens in the encoder (which represent the concatenated context and query) serve as the keys ( $k$ ). The cross-attention scores are computed and then smoothed with a Gaussian filter to reduce noise and ensure smooth transitions. The context is then reconstructed by retaining the most important words based on these filtered scores. In this particular case, the context surrounding ”Men in Black” is identified as the crucial information needed to answer the query about Destiny’s Child’s first major single. The final language model (LLM) receives the compressed context and the query, with a compression ratio of 0.25, and ultimately generates the correct answer: ”Men in Black.”

In recent years, the rapid development of generative LLMs, such as ChatGPT, has revolutionized many traditional technologies. A critical factor behind their success is their ability to leverage rich contextual information to enhance performance across various tasks. Techniques like ICL (Brown et al. 2020), Retrieval-Augmented Generation (RAG) (Lewis et al. 2020), and the use of agents (Park et al. 2023) have been instrumental in enabling these models to understand and generate contextually relevant content, addressing complex problems through multi-turn dialogues.

As tasks become increasingly complex, the importance of context grows. Longer contexts allow LLMs to better capture the nuances and dependencies within the data, leading to more accurate and relevant outputs. However, this benefit comes at a cost.

With the increase in context length, however, two major challenges arise: (i) the higher inference cost, especially when using closed-source APIs, and (ii) the introduction of task-irrelevant information, which exacerbates the ”lost in the middle” problem (Tay et al. 2021), where the model’s performance deteriorates as it struggles to maintain focus on the most relevant parts of the context.

To mitigate these challenges, one promising approach is context compression. This strategy leverages the inherent redundancy and repetition in natural language (Shannon 1951; Li et al. 2023), aiming to distill the context to its most informative elements while minimizing the loss of crucial information. One such method, the Selective Context (Li et al. 2023) approach, employs self-information as a metric to identify and remove less informative lexical units. This method has demonstrated that context compression can significantly reduce memory usage and improve inference speed with minimal impact on accuracy. Building on this, the LLMLingua series (Jiang et al. 2023b) introduced a more dynamic approach, utilizing perplexity (PPL) as a metric to adaptively compress the context at varying levels of granularity. These advancements highlight the potential of selective compression, though they still face limitations in fully capturing query-specific relevance.

Even though the LLMLingua series incorporates the query into the PPL calculation, thereby considering the conditional PPL in the query state, PPL still fails to adequately reflect the relevance between the query and various parts of the context. To address this issue, QUITO (Wang et al. 2024) innovatively proposes using attention as a metric to measure the relevance between the query and the context. This method employs a very small model as the compression model, aiming to select intuitively more useful information for the query. This information then holds a higher weight when generating answers.

Although QUITO, which utilizes self-attention, has achieved promising results in information compression, certain limitations have been observed in practical experiments. When using the attention from the final layer, the last few tokens have already formed a strong representation of the preceding context. Consequently, the attention of the final token tends to focus primarily on the last few tokens. Moreover, in the decoder-only self-attention module, the query, key, and value are trained together, primarily focusing on obtaining a good global representation instead of forcing the query to discern which representation of input tokens is most important.w We believe these factors represent inherent limitations of QUITO.

Recognizing these limitations, recent work (Zhou et al. 2023) has turned to Information Bottleneck theory (Tishby, Pereira, and Bialek 1999) as a means to manage context noise by optimizing mutual information. Building on this concept, our approach employs cross-attention scores as a proxy for the mutual information between the query and context. This allows us to selectively retain the most relevant portions of the context, ensuring that the model remains focused on the information most critical to generating accurate responses.

In summary, our contributions are twofold:

1.

Applying Information Bottleneck Theory to Context Compression: We introduce a novel perspective by using the Information Bottleneck theory to analyse the properties of context compression.
2.

Developing a Cross-Attention-Based Compression Algorithm: Building on this theory, we design a state-of-the-art context compression algorithm based on cross-attention, significantly outperforming existing methods in question-answering accuracy.

Related Work

Attention Mechanism

The Attention Mechanism (Vaswani et al. 2017) is crucial in today’s machine learning. It is widely used in many fields including image generation (Rombach et al. 2021), image recognition (Caron et al. 2021), and language processing (Brown et al. 2020; Radford et al. 2019). It starts with an input sequence transformed into query (Q), key (K), and value (V) matrices through linear projections. And the attention matrix is computed as follows:

A=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})

Among these, cross attention is often used as a way of exchanging information between two different types of information, such as between image and text modalities (Rombach et al. 2021; Cho et al. 2021) , or for translation between different languages. In practice, Q or query is generally designed based on the type of output that is desired from the model to extract the most important information needed from K and V. In ViT (Dosovitskiy et al. 2020), it is proposed that by observing the attention heatmap, one can determine which NLP token is most important for which part. (Maharjan et al. 2018) uses a constant query ”which representation is most important” to weight different representations. In our work, we use the T5 model (Chung et al. 2022), with the prompt + query as the KV pair, and the first token of the answer as Q (acting as ”which tokens are most important” because in normal training, it is necessary to score the token weights to generate answers), to perform cross attention and determine which tokens in the prompt are most important.

Context Compression

Generative LLMs have achieved strong performance across various tasks, but they encounter computational challenges when processing long documents and extended conversations due to increased token counts and context truncation. ICL (Brown et al. 2020) helps mitigate some of these issues by providing task-relevant context directly, reducing the need for specific task Supervised Fine-Tuning (SFT) and lowering costs. However, ICL also increases token numbers and inference costs.

The Selective Context method was developed to address these challenges by compressing input context through the removal of redundant information. Li et al. evaluates tokens based on self-information, retaining only the most informative content. This approach reduces token numbers and inference costs while mitigating the ”lost in the middle” problem (Tay et al. 2021) associated with long contexts. Experiments demonstrate that Selective Context reduces memory usage and generation latency while maintaining comparable performance, especially in tasks involving lengthy documents.

The LLMLingua (Jiang et al. 2023b) series, including LongLLMLingua (Jiang et al. 2023a) and LLMLingua2 (Pan et al. 2024), enhance context compression by optimizing context selection strategies and introducing advanced memory network structures. These models aim to improve efficiency and accuracy in processing longer documents and complex dialogues, optimizing context usage without compromising performance.

QUITO (Wang et al. 2024) introduces self-attention as a metric for context compression, diverging from traditional methods that rely on perplexity or self-information. By concatenating context and query, QUITO uses a small transformer decoder-only model to compute attention, retaining tokens with high attention values.

However, in the self-attention module of QUITO, the query, key, and value are trained together, which primarily aims to obtain a comprehensive global representation, and may lead to information mixing between the input and output. In contrast, we employ cross-attention in an encoder-decoder architecture, where the input and output are separated. We keep the key-value pairs fixed and only train the query representations, which forces the query to learn to distinguish which token representations are most crucial for generating the output.

Method

Theorem

The information bottle-neck(IB) (Tishby, Pereira, and Bialek 1999; Fischer 2020) is a simplistic concept: when facing a task, one should try to accomplish it using the minimal information. In our case, we want to compress the context while retaining the accuray of output answer, which makes it well-suited for modeling using IB theory. So we can use the IB score to formulate our objective as follows:

\displaystyle\min_{\bar{X}}IB=I(\bar{X};X|Q)-\beta I(\bar{X};Y|Q)

(1)

where $\bar{X}$ stands for compressed context, $Q$ stands for query, while $Y$ stands for output. $X=x_{1}x_{2}x_{3}...x_{n}$ , in which $x_{i}$ is the $i^{th}$ token, and $\beta$ is related to compressed ratio. The first item serves to enhance efficiency while the second term serves to retain as much useful information as possible to enable the LLM generate target outputs.

Notice that since our compress method is to delete some token from $X$ to get $\bar{X}$ , so for a fixed compressed radio, the first item $I(\bar{X};X|Q)$ for efficiency can be ignored. Then our objective is to maximize $I(\bar{X};Y|Q)$

Using the chain rule of Mutual Information and r-Markov properties, we have

	$\displaystyle I_{Q}(X;Y)$	$\displaystyle=I_{Q}(x_{1};Y)+...+I_{Q}(x_{n};Y\|x_{1},x_{2}...x_{n-1})$		(2)
		$\displaystyle\approx I_{Q}(x_{1};Y)+...+I_{Q}(x_{n};Y\|x_{n-1},...,x_{n-r+1})$		(3)

(where $I_{Q}$ stands for conditioning on Q). So, for a fixed compressed radio, we can calculate the exact number k such that we need to delete k tokens while maximizing the mutual Information. We define $I(x_{i};Y|x_{1},...,x_{i-1},Q)$ as the conditional mutual information of the ith term, noting that the ith token, as well as the $r-1$ tokens preceding it, are related to the ith term. Therefore, to delete k tokens, we can consider finding the k terms with the smallest conditional mutual information.

According to the above analysis, we need a metric to reflect the relative size of $I_{Q}(x_{i};Y|x_{i-1},...x_{i-r+1})$ . It should satisfy the following properties:

1.

Conditioning on $Q$ , and it can capture the information from previous $x_{i}$
2.

Under the above condition, it can also reflect the importance or similarity of $x_{i}$ to the output $Y$

We found that cross-attention in the encoder-decoder architecture can effectively satisfy the above properties. Specifically, we put the context (corresponding to $X$ ) and the query (corresponding to $Q$ ) together in the encoder. Through self-attention, we obtain a representation of the context that can capture the previous information and query information, which we use as the key-value (KV) pair for cross-attention (corresponding to Property 1). Then, we use the first token in the decoder marked as the start as the query ( q), performing cross-attention with the KV pair. At this time, the first token in the decoder as q can be considered a constant query: ’which representations are most important to generate output’ (corresponding to Property 2). Therefore, the attention scores obtained from the cross-attention between q and the KV pair can be used to represent the relative size of the conditional mutual information in Equation (2).

From Property (1), we need to ensure that after removing some tokens, we can still capture the information preceding $x_{i}$ . According to the r-Markov assumption, we should retain the $r$ tokens preceding those important tokens. In practice, we found that adding a Gaussian filter, which is a softer implementation, works better and is easier to implement than the hard constraints that directly retaining the preceding $r$ tokens. The detailed process is presented in the following Algorithm section.

Algorithm

In practice, we use the word as the smallest unit of compression, following the work of previous researchers (Li et al. 2023). We utilize the T5 model’s cross-attention mechanism to evaluate the importance of each word in the context relative to a given query. The procedure is as follows:

1.

Concatenation of Context and Query: The context and query are concatenated to form a single input sequence. This sequence is fed into the T5 model’s encoder to generate a feature representation.
2.

Encoding and Decoding: Leveraging the T5 model’s encoder-decoder architecture, the concatenated input is encoded, and the model begins decoding from a start token [start]. During this decoding process, the model generates output tokens while computing cross-attention scores that highlight the importance of each token in the context with respect to the generated output.
3.

Get the Cross-Attention Scores: Cross-attention weights from the final layer are extracted, specifically focusing on tokens corresponding to the context. The weights from each attention head are averaged to obtain a more robust representation. These averaged weights are then normalized using a softmax function to produce attention-based importance scores for each token.
4.

Smoothing with Gaussian Filter: To ensure that tokens adjacent to those with high attention scores also receive appropriate attention, we apply a Gaussian filter to the attention scores. This smoothing process helps to distribute the attention more evenly across nearby tokens, enhancing the model’s ability to capture relevant context. The Gaussian smoothing process is mathematically defined as:

$\displaystyle\tilde{s}_{t}=\frac{1}{\sqrt{2\pi\sigma^{2}}}\sum_{k=-K}^{K}s_{t+k}\cdot\exp\left(-\frac{k^{2}}{2\sigma^{2}}\right)$ (4)

where $\tilde{s}_{t}$ is the smoothed attention score for token $t$ , $s_{t+k}$ represents the attention score for neighboring tokens, $\sigma$ is the standard deviation that controls the smoothness, and $K$ is the window size for smoothing.
5.

Reconstruction Based on Word Importance: Based on our practice that words are the smallest semantic units, the reconstructed context is derived by considering the importance of each word. The ‘reconstruct‘ function groups tokens into words and then re-evaluates their importance using the smoothed attention scores. This approach ensures that the semantic integrity of the text is preserved during compression.
6.

Context Compression: Finally, the context is compressed by selectively retaining words with the highest importance scores. The compression ratio can be adjusted to control the level of detail retained in the compressed context, balancing between context size and information retention.

Formally, after obtaining the attention scores for each token, we filter words as follows:

	$\displaystyle\text{score}_{w}(\text{word}_{i})$	$\displaystyle=\sum_{t\in\text{word}_{i}}\text{score}_{t}(\text{token}_{t})$		(5)
	filtered_words	$\displaystyle=\text{words}\left[\text{topk}(\text{score}_{w})\right]$		(6)

Here, $\text{score}_{w}(\text{word}_{i})$ is the cumulative attention score for each word, calculated by summing the attention scores of all tokens within the word. The words are then filtered by selecting the top-ranked ones based on their cumulative scores, ensuring that the selected words collectively satisfy the desired ratio of the total token count. Figure 1 2provides an overview of our algorithm.

This approach, by focusing on word-level importance via cross-attention scores, allows for effective compression of long contexts while maintaining essential information. This is particularly useful for tasks that require handling extensive text, where preserving critical details is crucial for maintaining model performance.

By focusing on the cross-attention mechanism of the T5 model, our method effectively identifies the most relevant information in a context, enabling efficient compression without significant loss of critical information. This approach is particularly beneficial in scenarios involving long contexts, where the preservation of key information is paramount for maintaining high accuracy in downstream tasks.

Algorithm 1 Get T5 Cross Attention Score

Model: T5 model with encoder-decoder architecture
Input: context - context for in-context learning
query - user’s query
Output: attn_score - attention-based score for token importance

1: Concatenate input = context + query

2: Encode feature = model.encode(input)

3: Decode output = model.decode(feature, [start])

4: attention = output.cross_attention[-1][:context_length] attention = softmax(attention)

5: Gaussian filter attn_score = gaussian_filter(attention)

6: Return attn_score

Experiments

Main Experiment: Context Compression on Standard QA Datasets

Datasets and Models.

Algorithm	Architecture & Model	Parameters
Selective Context	Decoder-Only	124M
Selective Context	GPT-2	124M
LLMLingua	Decoder-Only	7B
LLMLingua	Llama-2-7b	7B
LongLLMLingua	Decoder-Only	7B
LongLLMLingua	Llama-2-7b	7B
LLMLingua2	Encoder-Only	355M
LLMLingua2	XLM-RoBERTa-large	355M
QUITO	Decoder-Only	500M
QUITO	Qwen2-0.5b-Instruct	500M
QUITO-X	Encoder-Decoder	60M
QUITO-X	FLAN-T5-small	60M

Table 1: Comparison of different compression algorithms in terms of architecture, model, and parameter size. Our method, based on the FLAN-T5-small model, demonstrates the effectiveness of a compact Transformer Encoder-Decoder architecture with only 60M parameters, significantly reducing computational cost while maintaining or exceeding performance compared to larger models like LLMLingua (7B) and QUITO (500M).

For our main experiment, we utilized the FLAN-T5-small model (Chung et al. 2022) as the compression model. We conducted this experiment on four widely used QA datasets: CoQA (Reddy, Chen, and Manning 2019), Quoref (Dasigi et al. 2019), DROP (Dua et al. 2019), and SQuAD (Rajpurkar et al. 2016). This experiment aimed to evaluate the effectiveness of our context compression technique on both accuracy and information retention. Specifically:

•

For CoQA and Quoref, we assessed the accuracy of question-answering models (LongChat-13B-16k (Dacheng Li* and Zhang 2023) and LLaMA3-8B-Instruct (AI@Meta 2024)) before and after context compression.
•

For DROP and SQuAD, we focused on whether key information (i.e., the source of the answer) was preserved after compression, using Exact Match (EM) as the evaluation metric.

Baseline.

We compared against the following context compression baselines:

1.

Selective Context: Uses GPT-2 (Radford et al. 2019) to retain context segments based on self-information.
2.

LLMLingua: Employs Llama-2-7b (Touvron et al. 2023) with dynamic compression driven by context PPL.
3.

LongLLMLingua: Extends LLMLingua for longer contexts, also using Llama-2-7b (Touvron et al. 2023).
4.

LLMLingua2: Utilizes XLM-RoBERTa-large (Conneau et al. 2019), introducing data distillation for enhanced compression.
5.

QUITO: Applies Qwen2-0.5B-Instruct (qwe 2024) with attention mechanisms to selectively retain query-relevant context.

dataset	model	ratio	Selective-Context	LLMLingua	LongLLMLingua	LLMLingua2	QUITO	QUITO-X
Quoref	LongChat	1.00	70.6
		0.75	65.3	46.4	46.5	65.7	65.6	68.1
		0.50	55.8	34.5	34.6	55.0	59.4	65.1
		0.25	40.9	28.2	28.7	41.5	52.3	60.8
		0.00	2.9
	Llama-3	1.00	93.1
		0.75	90.3	64.9	65.3	90.7	89.8	92.6
		0.50	81.3	51.1	51.4	82.6	84.4	90.2
		0.25	59.3	43.2	43.3	65.5	75.8	86.8
		0.00	6.8
CoQA	LongChat	1.00	59.1
		0.75	56.6	44.9	45.4	57.5	54.6	59.6
		0.50	47.0	36.3	36.4	50.3	50.4	59.5
		0.25	32.1	30.4	25.9	41.0	41.4	55.5
		0.00	13.8
	Llama-3	1.00	79.3
		0.75	76.5	62.3	61.8	74.8	73.1	79.5
		0.50	64.1	50.9	50.4	69.4	64.6	78.1
		0.25	45.3	43.0	37.3	57.7	53.5	75.5
		0.00	18.1

Table 2: Experimental results of various compression methods applied at different compression ratios on the Quoref and CoQA datasets. The table shows the effectiveness of different methods, including Selective-Context, LLMLingua, LongLLMLingua, LLMLingua2, QUITO, and QUITO-X, across different compression ratios (1.00, 0.75, 0.50, 0.25, and 0.00). Our method consistently achieves the best performance at all ratios.

Implementation Details.

We first evaluated model accuracy using the original context and with no context, assessing the models’ ability to summarize with full information and rely on prior knowledge. We then tested the five baseline methods and our approach at compression ratios of 0.75, 0.50, and 0.25, measuring accuracy with the compressed context. For DROP and SQuAD, where the original text always contains the correct answer (EM = 1), we evaluated whether the correct answer remained in the compressed context under the same compression ratios, using the EM score as the metric.

Supplementary Experiment: Addressing Long Texts with Chunking Strategies

Datasets and Models.

As a supplementary experiment, we focused on datasets with particularly long texts, where traditional models often struggle with the ”lost in the middle” phenomenon, leading to reduced accuracy. We selected subsets from LongBench (Bai et al. 2023) to evaluate our method, focusing on three datasets known for their long contexts: 2WikiMultiHopQA (Ho et al. 2020), HotpotQA (Yang et al. 2018), and MuSiQue (Trivedi et al. 2022).

Approach.

In this experiment, we compared the performance of LLMLingua2 and our proposed method. Both methods employed a chunking strategy, dividing the context into 512-token segments. We tested two distinct strategies:

1.

Concatenating each chunk with the query before compression.
2.

Calculating attention scores for each chunk with the query, recombining the attention arrays, and then performing compression.

Objective.

This supplementary experiment was designed to evaluate how well each method mitigates the ”lost in the middle” issue and preserves relevant information across the chunks, thereby supporting the findings from our main experiment.

Results and Analysis

Main Experiment Analysis.

The experimental results in Table 2 demonstrate that our proposed QUITO-X consistently outperforms existing methods, including Selective-Context, LLMLingua, LongLLMLingua, LLMLingua2, and QUITO, across different compression ratios (1.00, 0.75, 0.50, 0.25, and 0.00) on the Quoref and CoQA datasets. Notably, our method shows superior performance even at higher compression ratios, where significant portions of the context are removed. This indicates the robustness and effectiveness of our approach in retaining critical information despite reduced context sizes.

An interesting observation is that in some cases, particularly highlighted in the blue sections of Table 2, our method not only retains information effectively under compression but also surpasses the performance of the original uncompressed context. This could be attributed to the method’s ability to focus on the most relevant portions of the context, thereby reducing noise and improving model predictions.

The information retention graphs (Figure 2) for the SQuAD and DROP datasets further support these findings. Our method maintains a higher retention of information (measured by Exact Match scores) across all compression ratios. As the ratio decreases, and more context is compressed, our method’s advantage becomes more pronounced compared to other approaches. This demonstrates that our compression strategy is particularly effective in scenarios where only a limited amount of context can be preserved, thus ensuring that the most critical information is retained.

Overall, the results clearly validate the effectiveness of our method in various contexts, particularly in scenarios requiring aggressive compression. Our method not only mitigates the loss of information but, in some cases, even enhances the performance by concentrating on the most informative parts of the context.

dataset	ratio	LLMLingua2	strategy 1	strategy 2
2wikimqa	1.00	55.0
	0.75	64.0	64.0	60.5
	0.50	68.0	67.5	69.0
	0.25	53.5	61.5	60.0
hotpotqa	1.00	15.5
	0.75	25.5	31.0	30.0
	0.50	57.5	65.5	63.0
	0.25	52.5	63.0	69.5
musique	1.00	2.5
	0.75	2.5	4.0	3.5
	0.50	40.5	41.5	43.5
	0.25	40.0	43.0	49.0

Table 3: Performance comparison on 2WikiMultiHopQA, HotpotQA, and MuSiQue datasets under different compression ratios. The table shows results for LLMLingua2 and two strategies proposed in our method: Strategy 1 (context chunking followed by query concatenation for compression) and Strategy 2 (context chunking with attention scores calculated for each chunk, then recombining before compression). Bold numbers indicate the best performance for each dataset and ratio combination.

Supplementary Experiment Analysis.

The supplementary experiments on long-text datasets (2WikiMultiHopQA, HotpotQA, and MuSiQue) validate the efficacy of our proposed strategies. Notably, our methods consistently outperform the baseline (LLMLingua2) across various compression ratios.

In 2WikiMultiHopQA, Strategy 1 achieves the highest performance at a 0.75 compression ratio, while Strategy 2 excels at a 0.50 ratio, showcasing the adaptability of our approach to different compression levels. For HotpotQA, both strategies significantly enhance performance as the compression ratio decreases, with Strategy 2 reaching the highest scores at 0.50 and 0.25 ratios. Finally, in MuSiQue, Strategy 2 outperforms all others at lower compression ratios, demonstrating robust information retention even with aggressive compression.

These results emphasize the effectiveness of our long-context handling strategies, particularly in scenarios where the context needs to be significantly compressed while maintaining critical information.

Conclusion

In this paper, we endeavor to address the challenge of query-based context compression in Retrieval-Augmented Generation (RAG) scenarios. Leveraging the information bottleneck theory, we meticulously analyzed the properties required for metrics that measure token importance. Our approach employs cross-attention and achieves state-of-the-art (SOTA) results across several commonly used datasets. Notably, our method demonstrates superior performance on long texts, sometimes even outperforming the original, which may be attributed to the redundancy inherent in natural language. Our model significantly surpasses strong baselines in both inference latency and performance. The effectiveness of our chunking strategy for longer texts, as well as the reasons behind the exceptional performance of cross-attention, are left for future exploration.

References

qwe (2024) 2024. Qwen2 Technical Report.
AI@Meta (2024) AI@Meta. 2024. Llama 3 Model Card.
Bai et al. (2023) Bai, Y.; Lv, X.; Zhang, J.; Lyu, H.; Tang, J.; Huang, Z.; Du, Z.; Liu, X.; Zeng, A.; Hou, L.; Dong, Y.; Tang, J.; and Li, J. 2023. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. arXiv:2308.14508.
Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners.
Caron et al. (2021) Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; and Joulin, A. 2021. Emerging Properties in Self-Supervised Vision Transformers. In Proceedings of the International Conference on Computer Vision (ICCV).
Cho et al. (2021) Cho, J.; Lei, J.; Tan, H.; and Bansal, M. 2021. Unifying Vision-and-Language Tasks via Text Generation. In ICML.
Chung et al. (2022) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; Webson, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowdhery, A.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V.; Huang, Y.; Dai, A.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; Devlin, J.; Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. 2022. Scaling Instruction-Finetuned Language Models.
Conneau et al. (2019) Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Unsupervised Cross-lingual Representation Learning at Scale. CoRR, abs/1911.02116.
Dacheng Li* and Zhang (2023) Dacheng Li*, A. X. Y. S. L. Z. J. E. G. I. S. X. M., Rulin Shao*; and Zhang, H. 2023. How Long Can Open-Source LLMs Truly Promise on Context Length?
Dasigi et al. (2019) Dasigi, P.; Liu, N. F.; Marasovic, A.; Smith, N. A.; and Gardner, M. 2019. Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. arXiv:1908.05803v2.
Dosovitskiy et al. (2020) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dua et al. (2019) Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; and Gardner, M. 2019. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proc. of NAACL.
Fischer (2020) Fischer, I. 2020. The Conditional Entropy Bottleneck. Entropy, 22(9).
Ho et al. (2020) Ho, X.; Duong Nguyen, A.-K.; Sugawara, S.; and Aizawa, A. 2020. Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps. In Scott, D.; Bel, N.; and Zong, C., eds., Proceedings of the 28th International Conference on Computational Linguistics, 6609–6625. Barcelona, Spain (Online): International Committee on Computational Linguistics.
Jiang et al. (2023a) Jiang, H.; Wu, Q.; ; Luo, X.; Li, D.; Lin, C.-Y.; Yang, Y.; and Qiu, L. 2023a. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ArXiv preprint, abs/2310.06839.
Jiang et al. (2023b) Jiang, H.; Wu, Q.; Lin, C.-Y.; Yang, Y.; and Qiu, L. 2023b. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 13358–13376. Association for Computational Linguistics.
Lewis et al. (2020) Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.
Li et al. (2023) Li, Y.; Dong, B.; Lin, C.; and Guerin, F. 2023. Compressing Context to Enhance Inference Efficiency of Large Language Models. arXiv:2310.06201.
Maharjan et al. (2018) Maharjan, S.; Montes, M.; González, F. A.; and Solorio, T. 2018. A Genre-Aware Attention Model to Improve the Likability Prediction of Books. In Riloff, E.; Chiang, D.; Hockenmaier, J.; and Tsujii, J., eds., Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 3381–3391. Brussels, Belgium: Association for Computational Linguistics.
Pan et al. (2024) Pan, Z.; Wu, Q.; Jiang, H.; Xia, M.; Luo, X.; Zhang, J.; Lin, Q.; Ruhle, V.; Yang, Y.; Lin, C.-Y.; Zhao, H. V.; Qiu, L.; and Zhang, D. 2024. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. ArXiv preprint, abs/2403.12968.
Park et al. (2023) Park, J. S.; O’Brien, J. C.; Cai, C. J.; Morris, M. R.; Liang, P.; and Bernstein, M. S. 2023. Generative Agents: Interactive Simulacra of Human Behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23. New York, NY, USA: Association for Computing Machinery.
Radford et al. (2019) Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language Models are Unsupervised Multitask Learners.
Rajpurkar et al. (2016) Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Su, J.; Duh, K.; and Carreras, X., eds., Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392. Austin, Texas: Association for Computational Linguistics.
Reddy, Chen, and Manning (2019) Reddy, S.; Chen, D.; and Manning, C. D. 2019. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7: 249–266.
Rombach et al. (2021) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:2112.10752.
Shannon (1951) Shannon, C. E. 1951. Prediction and Entropy of Printed English. Bell System Technical Journal, 30: 50–64.
Tay et al. (2021) Tay, Y.; Dehghani, M.; Abnar, S.; Shen, Y.; Bahri, D.; Pham, P.; Rao, J.; Yang, L.; Ruder, S.; and Metzler, D. 2021. Long Range Arena : A Benchmark for Efficient Transformers. In International Conference on Learning Representations.
Tishby, Pereira, and Bialek (1999) Tishby, N.; Pereira, F. C.; and Bialek, W. 1999. The information bottleneck method. In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, 368–377.
Touvron et al. (2023) Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Xia, F.; Zettlemoyer, L.; Laprévotte, F.-Y.; Rozière, B.; Hulyanon, N.; Scao, A.; Défossez, A.; Spisak, J.; Lacroix, T.; Baines, M.; Azoulay, J.; Donahue, J.; Bosma, M.; Kambadur, M.; Grisel, O.; Bottou, L.; Fan, A.; LeCun, Y.; and Morcos, A. S. 2023. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
Trivedi et al. (2022) Trivedi, H.; Balasubramanian, N.; Khot, T.; and Sabharwal, A. 2022. MuSiQue: Multihop Questions via Single-hop Question Composition. Transactions of the Association for Computational Linguistics.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 6000–6010. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964.
Wang et al. (2024) Wang, W.; Wang, Y.; Fan, Y.; Liao, H.; and Guo, J. 2024. QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression. arXiv preprint arXiv:2408.00274.
Yang et al. (2018) Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; Salakhutdinov, R.; and Manning, C. D. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Conference on Empirical Methods in Natural Language Processing (EMNLP).
Zhou et al. (2023) Zhou, C.; Sun, W.; Mou, L.; Wang, K.-W. C.; and Neubig, G. 2023. An Information Bottleneck Perspective for Effective Noise Filtering on Retrieval-Augmented Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL).