MSWA: Refining Local Attention with Multi-Scale Window Attention
Abstract
Transformer-based LLMs have achieved exceptional performance across a wide range of NLP tasks. However, the standard self-attention mechanism suffers from quadratic time complexity and linearly increased cache size. Sliding window attention (SWA) solves this problem by restricting the attention range to a fixed-size local context window. Nevertheless, SWA employs a uniform window size for each head in each layer, making it inefficient in capturing context of varying scales. To mitigate this limitation, we propose Multi-Scale Window Attention (MSWA) which applies diverse window sizes across heads and layers in the Transformer. It not only allows for different window sizes among heads within the same layer but also progressively increases window size allocation from shallow to deep layers, thus enabling the model to capture contextual information with different lengths and distances. Experimental results on language modeling and common-sense reasoning tasks substantiate that MSWA outperforms traditional local attention in both effectiveness and efficiency.
1 Introduction

The popularity of Transformer-based Vaswani et al. (2017) large language models (LLMs) Touvron et al. (2023); Achiam et al. (2023) has surged due to their remarkable performance on a wide range of applications, including NLP tasks like machine translation Zhang et al. (2023a), text summarization Zhang et al. (2024), as well as more complex uses such as coding assistance Ross et al. (2023) and communicative agents Li et al. (2023). However, the standard Transformer employs the self-attention mechanism, whose quadratic time complexity becomes a bottleneck for the model’s computational efficiency. Moreover, the KV cache required by the self-attention mechanism during inference autoregressively increases GPU memory consumption, making the deployment of LLMs unfriendly.
Recently, a lot of architectures have been proposed with the aim of being efficient foundations for LLMs Gu and Dao (2023); Peng et al. (2023); Poli et al. (2023). One strand of research focuses on improving the efficiency of attention mechanism, such as sparse attention Child et al. (2019), sliding window attention Beltagy et al. (2020); Zaheer et al. (2020), and linear attention Choromanski et al. (2020); Yang et al. (2023). While most studies focus on capturing global information of text sequence with linear computational time and limited memory, sliding window attention (SWA) offers a more intuitive approach. By focusing on local information, it serves as a valuable mechanism for building LLMs Jiang et al. (2023) or creating novel architectures De et al. (2024); Arora et al. (2024).
The key idea of SWA is to utilize the locality of reference Zaheer et al. (2020) in NLP data, where most information about a token can be derived from its neighboring tokens. By allowing each token to attend to its neighbors within a fixed-size local window, SWA ensures linear computational complexity and constant KV cache consumption. However, each head in every layer of the original SWA shares the same window size, ignoring the fact that the scale of contextual information can vary significantly. For instance, a news report can span up to 2000 tokens, while a keyword might consist of only 4 tokens. Setting the attention window to the same size might lead to sub-optimal adaptation to contexts of different scales. Additionally, different components of a Transformer model serve different roles. For example, shallower layers may exhibit more locality Child et al. (2019). Restricting all components to the same receptive field can severely impair the model’s representation capacity.
To address the aforementioned issues, we propose a novel window attention variant called Multi-Scale Window Attention (MSWA), which introduces diverse window sizes across heads and layers and improves the performance of SWA while reducing computation and memory cost. Specifically, we assign diverse window sizes to different heads within a layer to model contextual information at various lengths simultaneously. Moreover, we reduce the window size allocation for shallower layers and redistribute the resources to deeper layers, creating a pattern where shallow layers model local information and deep layers capture long-range dependencies. We further propose an optional integration of MSWA with other efficient methods like linear attention, creating a model that is both local-sensitive and global-aware. Implementing MSWA on standard attention acceleration libraries Dao (2023) achieves efficiency beyond SWA without extensive additional development.
To validate the effectiveness of MSWA, we conduct extensive experiments. We train models from scratch for language modeling in various scenarios, including directly applying MSWA to the Transformer and combining MSWA with linear attention. Experimental results on word-level and character-level datasets demonstrate the superior language modeling ability of MSWA. Moreover, we verify the compatibility of MSWA to LLM by fine-tuning pre-trained LLM to adapt to the MSWA pattern. Performance on downstream common-sense reasoning tasks confirms the practical value of MSWA. We also conduct computational efficiency evaluation, where MSWA consistently achieves better efficiency compared to standard attention and SWA.
2 Related Works
In this section, we briefly introduce the studies of large language models and attention mechanisms.
2.1 Large Language Models (LLMs)
Language models Bengio et al. (2000) have become a cornerstone of modern Natural Language Processing (NLP). Their primary purpose is to understand and generate human language, making them crucial for applications ranging from machine translation Zhang et al. (2023a) to communicative agents Li et al. (2023). The advent of large-scale pre-trained models has significantly enhanced the performance of these applications.
Among them, Transformer-based models have revolutionized the field. Introduced by Vaswani et al. (2017), the Transformer architecture uses self-attention mechanism to process input sequences in a more parallelizable way. This innovation has led to the development of increasingly large and powerful models, such as GPT-4 Achiam et al. (2023), Llama-3 AI@Meta (2024), and Claude-3 Anthropic (2024). These models, often termed "large language models", leverage vast amounts of data and computational resources to achieve state-of-the-art results on a wide array of NLP tasks.
2.2 Attention Mechanisms
The attention mechanism which enables the model to capture intricate dependencies across the entire sequence is at the heart of the Transformer’s success. However, the standard self-attention mechanism has quadratic complexity with respect to the sequence length, which poses scalability challenges for longer sequences. To address this issue, many efficient attention variants have been proposed Qiu et al. (2019); Wang et al. (2020); Peng et al. (2021); Hua et al. (2022). For example, sliding window attention Beltagy et al. (2020); Zaheer et al. (2020); Jiang et al. (2023) limit attention to a fixed-size window around each token, making the computation more manageable for long texts. Linear attention methods Choromanski et al. (2020); Katharopoulos et al. (2020); Yang et al. (2023) approximate the attention calculation to reduce complexity from quadratic to linear. These innovations have made it possible to apply Transformer to lengthy documents without prohibitive computational costs.
3 Preliminaries
In this section, we briefly introduce the preliminaries about self-attention, sliding window attention and linear attention operations.
3.1 Self-Attention Mechanism
For the Transformer models, its main computational and memory costs arise from the multi-head self-attention mechanism, focusing on two points: (1) quadratic time complexity over the input length, and (2) linearly increased size of the KV cache during inference. In the following, we analyze the attention mechanism based on the decoder form, as it is widely used in language models.
Given input vectors , where each represents a single input token, is the sequence length, and is the dimension of the input vectors, a head in the self-attention layer first maps the input tokens into query vectors , key vectors , and value vectors :
(1) |
where , , are the mapping matrices, and is the dimension of each head. The output of this attention head is calculated by:
(2) |
(3) |
Performing the above computation for the entire sequence of length requires a time complexity of and a space complexity of .
3.2 Sliding Window Attention
Sliding Window Attention (SWA) is an efficient variant that restricts each token to attend to tokens within a local window of size , as shown in Fig. 1. Given the query, key and value vectors , , , the output of the SWA head is defined as:
(4) |
(5) |
By using this method, the time and space complexity required for each head is reduced to and , respectively. Considering a Transformer with layers, each equipped with attention heads, the time and space complexity will become and respectively. We can see that the computational and memory cost is proportional to , which is the summation of the window sizes of all the sliding window attention operations from all heads in all layers.
3.3 Linear Attention Mechanism
Linear attention replaces the softmax operation in standard attention with a feature map-based dot product, eliminating the computation of and exchanging the matrix multiplication order, thereby achieving the goal of acceleration and constant memory cost. Specifically, given a kernel function that maps the and vectors into features, linear attention approximates with the dot-product . Therefore, the output of an attention head is calculated by:
(6) |
(7) |
Based on the exchange of multiplication order, the computational time complexity of linear attention is reduced to . In addition, during the auto-regressive inference process, both and can be written as a variable that is continuously accumulated, requiring only space complexity.
4 Multi-Scale Window Attention
In this section, we present our proposed Multi-Scale Window Attention (MSWA) mechanism, which leverages diverse window sizes across different heads and layers in Transformer architecture, as illustrated in Fig. 1. Our objective in designing this mechanism is to enhance the performance of SWA attention while maintaining computational and memory resources. Recall that the computational complexity and memory consumption required by SWA depend on the sum of the window sizes of all heads in all layers. On this basis, we not only change the distribution of window size allocation among different heads within the same layer, as introduced in Sec. 4.1, but also adjust the distribution of window size allocation between layers, as detailed in Sec. 4.2. The integration of these changes between heads and layers constitutes our MSWA mechanism, which will be introduced in Sec. 4.3. Additionally, in Sec. 4.4, we provide an optional combination of the MSWA mechanism with the linear attention mechanism. Implementation of MSWA can be found in Appendix A.
4.1 Diverse Window Across Heads
This section focuses on dynamically changing the window size for each attention head within a layer. We refer to this mechanism as MSWA-h, as shown in the bottom right part of Fig. 1.
Different from SWA where all heads use the same window size in the -th layer, in MSWA-h different heads have different scales of window sizes, and the summation of the total window sizes within a layer is less than that in the SWA, which is . Specifically, inspired by many hierarchical architecture designs in the CV field Liu et al. (2021), we divide the attention heads into four groups and adjust the receptive field range with a change between each group, resulting in window sizes of , , , , respectively. Therefore, the summation of the total window sizes is:
(8) |
Leveraging diverse window sizes among the heads within a layer allows the Transformer model to capture the relevant context of different scales simultaneously. This is because the outputs of different heads within an attention layer are concatenated together and then mapped through a matrix to form the final output of the layer, which allows contextual information at different distances to be integrated together. Additionally, considering the allocation of attention resources: all heads will attend to tokens within a distance of from the current token, while of the heads will attend to tokens within the to range, and so on. This implicitly models a long window with weighted emphasis, where the distribution of attention resources gradually decreases from near to far, aligning with the locality of reference characteristic of text.
4.2 Diverse Window Across Layers
This section further introduces changing the allocation ratio of the attention window sizes between layers. We refer to this mechanism as MSWA-l, as illustrated in the upper right part of Fig. 1.
To explain more clearly, we still use SWA as a comparison. In SWA, each attention layer has a total window size allocation of , where is the number of heads per layer, and is the base window size, which means that for any layer index , . In MSWA-l, the window size allocation varies across layers. More specifically, we divide all attention layers into several groups, and from shallow to deep, we continuously increase the total window size allocated to the attention layers in each group. We adopt a similar setup to MSWA-h, with four groups having a change between each group, resulting in window size allocations of , , , and , respectively. The total window size resource allocated to all layers in MSWA-l is:
(9) |
For an attention layer, a larger window size allocation means that the window sizes of the heads in that layer are generally larger, allowing for the perception of a broader range of context. Therefore, gradually increasing the window size allocation from shallow to deep layers enables the model to focus on building local fine-grained information in the initial stages and progressively enhance the capture of long-distance relationships in the later stages. Additionally, the gradually expanding attention window allocation enables the model to continuously integrate local information from previous stages based on a larger receptive field.
4.3 Integrate Diversity of Heads and Layers
/ | ||||
In this section, we aim to integrate the two strategies MSWA-h and MSWA-l introduced earlier to construct the final MSWA mechanism.
The description of MSWA starts with a base window size . In SWA, is the window size used for all heads across all layers. In contrast, in MSWA, serves as the basis for window size variation. We denote the base size value of the -th layer as and the actual window size of the -th head in the -th layer as . First, we evenly divide all attention layers into four groups. Depending on the group the layer belongs to, the values of from shallow to deep are , , , , respectively. Further, within each layer , we divide all heads into four groups, each with different values, denoted as , , and . Thus, the window size allocation for the entire Transformer model is:
(10) |
which can be further derived as:
(11) |
The aforementioned window variation method is demonstrated in Tab. 1. Note that MSWA can benefit from the advantages of both MSWA-h and MSWA-l. When computing within a single layer, it can capture both long-range and short-range contextual information at the same time, and allocate different attention resources to information at various distances. When transitioning the information from one layer to another, MSWA continuously enhances the overall perception scope, integrating previous local information into a broader synthesis.
4.4 Combination with Linear Attention

This section further proposes combining MSWA with an efficient global attention mechanism, i.e., linear attention, as illustrated in Fig. 2.
As we introduced earlier, many efficient mechanisms focus on capturing global information with limited resources. Therefore, they often fail to allocate high importance to relevant local information. Linear attention is a typical example of this issue. As introduced in Sec. 3.3 it stores global sequence information in fixed-size variables and , which can lead to a loss of attention focus Qin et al. (2022).
Therefore, we propose to combine MSWA with linear attention to compensate for its shortcoming and achieve a balance between efficiency and performance. Specifically, we alternately stack MSWA layers and linear attention layers. For example, the layers in a combined model are evenly divided into four groups, each contains linear attention layers and MSWA layers stacked together. For all MSWA layers in the entire model, we consider them as a whole and utilize the same window size variation method introduced in Sec. 4.3 and adjust the window sizes across layers and heads.
5 Experiments
Attention Mechanism | Length Setting | Relative Cost | Wikitext-103 PPL | enwik8 bpc | |
Standard Self-Attention | 9.10 | 28.61 | 1.12 | ||
18.20 | 28.33 | 1.10 | |||
Local Attention | SWA | 1.14 | 30.70 | 1.22 | |
MSWA-h | from 32 to 256 | 1.07 | 29.96 | 1.16 | |
MSWA-l | from 32 to 256 | 1.07 | 30.19 | 1.16 | |
MSWA | from 8 to 512 | 1.00 | 29.56 | 1.11 |
This section demonstrates the effectiveness of our MSWA mechanism. An overview of experiments, the main datasets and the baselines are described below. More experimental details, including the implementation dependencies and the detailed setup for each experiment, are shown in the Appendix B.
Overview.
Sec. 5.1 presents language modeling evaluation on natural language datasets in two different scenarios. In Sec. 5.1.1, we directly applying MSWA and its sub-mechanisms to the Transformer model. In Sec. 5.1.2, we combine the MSWA mechanism with the linear attention mechanism. Sec. 5.2 verifies the compatibility of MSWA with existing LLM in downstream tasks. We fine-tune the pre-trained Llama2-7B Touvron et al. (2023) model to adapt to new attention patterns, followed by few-shot evaluations on a series of common-sense reasoning tasks. Additionally, in Sec. 5.3, we compare the computational efficiency of MSWA with other attention mechanisms. Sec. 5.4 provides a series of ablation experiments.
Datasets.
We use both word-level and character-level natural language datasets for language modeling evaluation, specifically Wikitext-103 Merity et al. (2016) and enwik8 Mahoney (2009). Wikitext-103 is a word-level language modeling benchmark containing over 100M tokens, while enwik8 is a character-level dataset consisting of 100M bytes, both originally sourced from Wikipedia text. For the evaluation on downstream tasks, we use the RedPajama Computer (2023) dataset for fine-tuning and perform downstream few-shot evaluation on eight common-sense reasoning benchmarks: PIQA Bisk et al. (2020), OpenBookQA Mihaylov et al. (2018), WinoGrande Sakaguchi et al. (2021), HellaSwag Zellers et al. (2019), BoolQ Clark et al. (2019), COPA Roemmele et al. (2011), ARC easy and challenge Clark et al. (2018).
Baselines.
We mainly compare MSWA with two baseline methods: 1) Standard Self-Attention: We use the standard self-attention mechanism as a strong baseline. As introduced in Sec. 3.1, it can attend to all tokens in the whole sequence, achieving excellent performance at the cost of quadratic time and linear space complexity. 2) Sliding Window Attention: SWA is the most widely used variant of local attention, with applications including direct construction of LLMs Jiang et al. (2023), integration with global architectures De et al. (2024); Arora et al. (2024). As described in Sec. 3.2, it only attends to tokens within a fixed window size, thereby save time and space costs.
5.1 Language Modeling Evaluation
In this section, we evaluate the language modeling capabilities of MSWA mechanism by training models from scratch on Wikitext-103 and enwik8.
Architecture | Relative Cost | Wikitext-103 PPL | enwik8 bpc |
Transformer | 18.20 | 29.43 | 1.15 |
Linear Attention | 0.69 | 40.57 | 1.29 |
Linear Attention + SWA | 0.98 | 31.85 | 1.16 |
Linear Attention + MSWA | 0.89 | 30.83 | 1.13 |
MSWA | 1.00 | 30.38 | 1.12 |
Attention | PIQA | OBQA | WinoGrande | HellaSwag | BoolQ | COPA | ARC-e | ARC-c | Average |
Performance under 3-Shot setting | |||||||||
SWA | 57.56 | 28.00 | 54.93 | 45.21 | 68.07 | 61.00 | 36.24 | 25.94 | 47.12 |
MSWA | 66.10 | 24.60 | 51.14 | 55.44 | 67.52 | 57.00 | 42.38 | 27.99 | 49.02 |
Performance under 5-Shot setting | |||||||||
SWA | 56.64 | 29.40 | 50.59 | 44.40 | 55.87 | 48.00 | 32.79 | 23.63 | 42.66 |
MSWA | 67.03 | 28.80 | 51.85 | 61.78 | 56.27 | 56.00 | 46.93 | 30.46 | 49.89 |
5.1.1 Direct Construction of Language Model
This section presents the results of directly using MSWA as the Transformer backbone, which is a straightforward way to validate its performance.
As shown in Tab. 2, we report perplexity (PPL) results on the Wikitext-103 test set and bits-per-character (bpc) results on the enwik8 test set. The experimental results demonstrate that: 1) MSWA achieves better language modeling performance compared to SWA with smaller computational and memory cost, reducing PPL by 1.14 on Wikitext-103 and bpc by 0.11 on enwik8. 2) Dynamically adjusting the window size from either the layer or the head perspective results in improved language modeling capability, and combining both approaches yields further enhancements. 3) Although there is still a performance gap between local attention and the standard self-attention, MSWA can achieve closer or similar results to standard attention. For example, on enwik8, MSWA obtains a bpc that is 0.01 lower compared to standard attention with a sequence length of 1,024 and 0.01 higher compared to a sequence length of 2,048, while requiring significantly fewer resources than both.
5.1.2 Combination with Linear Attention
This section demonstrates the combination of MSWA and Linear Attention mechanism, achieving language modeling capabilities comparable to the standard Transformer in a more efficient way. We use the 2nd-order Taylor series feature map Zhang et al. (2023b); Arora et al. (2024) as the kernel function for linear attention.
The experimental results are shown in Tab. 3. We can conclude that: 1) Combining MSWA with linear attention achieves comparable performance to the standard Transformer. Specifically, on Wikitext-103 the combined model achieves a PPL that is only 1.4 higher than the Transformer, while on enwik-8, it achieves a 0.2 lower bpc compared to the Transformer. 2) The performance of linear attention is greatly improved when combined with either SWA or MSWA. Among them, combining with MSWA yields better language modeling performance, providing direction for future researches. 3) Compared to directly using MSWA, combining MSWA with linear attention achieves a balance between performance and efficiency, enhancing efficiency with minimal loss in performance.
5.2 Evaluation on Downstream Tasks



In this section, we evaluate the performance of Llama-7B after fine-tuning with local attention patterns and testing on downstream common-sense reasoning tasks. The purpose of this evaluation is to verify the compatibility of MSWA with the current pre-trained LLM and its effectiveness when scaled to a large number of model parameters.
Tab. 4 presents the accuracy results of the models on various downstream benchmarks using 3-shot and 5-shot settings. The experimental results indicate that: 1) The MSWA mechanism demonstrates better common-sense reasoning ability compared to traditional sliding window attention, with average accuracy differences of and in 3-shot and 5-shot scenarios, respectively. 2) The MSWA mechanism shows a stronger ability to adapt to different context lengths, with its performance remaining stable across varying shot numbers, whereas SWA’s average accuracy decreases by 4.46 as the number of shots increases.
5.3 Computational Efficiency Evaluation
This section further evaluates the actual computational efficiency of our MSWA mechanism. Unlike the inferable size of the KV cache, the computational speed of Transformer models needs to be measured in practice to obtain realistic results. Therefore, we measure the time required to predict the next token during forward propagation in inference process for various attention mechanisms. We utilize FlashAttention Dao (2023) for the computation of various attention mechanisms.
The computational efficiency is shown in Fig. 3. From the experimental results, we can observe the following: 1) Both SWA and MSWA has a significant efficiency advantage compared to the standard self-attention mechanism. This advantage becomes more pronounced as the batch size increases. 2) Compared to the traditional SWA, MSWA adapts better to larger batch sizes, often achieving better performance as the batch size increases. 3) For larger base window sizes, the efficiency advantage of MSWA becomes even more apparent, making it well-suited for scaling up context lengths.
5.4 Ablation Study
We conduct a series of ablation studies in this section, mainly focusing on the impact of the base window size on the MSWA mechanism, as well as the effects of other window variation strategies on the MSWA mechanism. In these experiments, we use the same setup from Sec. 5.1.1, only alter the sizes of windows and the variation method.
5.4.1 Effect of Base Window Size
The impact of the base window size on the MSWA mechanism is shown in Tab. 5, where for each , the window size variation method for MSWA is introduced as in Sec. 4.3. It can be observed that: 1) In each case from to , MSWA achieves better results compared to traditional local attention SWA. 2) MSWA can achieve better performance than traditional local attention with less than half the resource consumption. For example, SWA with a window size of 512 achieves a PPL of 29.20 on Wikitext-103, while MSWA evolved from a base window size 256 achieves a PPL of 28.92.
Attention | Length Setting | Relative Cost | Wikitext-103 PPL |
SWA | 4.55 | 29.20 | |
MSWA | from 32 to 2048 | 4.00 | 28.67 |
SWA | 2.28 | 29.93 | |
MSWA | from 16 to 1024 | 2.00 | 28.92 |
SWA | 1.14 | 30.70 | |
MSWA | from 8 to 512 | 1.00 | 29.56 |
SWA | 0.57 | 31.90 | |
MSWA | from 4 to 256 | 0.50 | 30.35 |
5.4.2 Effect of Window Variation Strategy
The comparative results with other window variation strategies are shown in Tab. 6. We consider two approaches. In the first approach, to demonstrate the effectiveness of our layer-wise allocation, where lower layers model local information and higher layers capture long-range information, we reverse the original window size allocation between layers, which means reducing the window size from shallow to deep layers. In the second approach we change the window size variation between each group from multiplying by 2 each time to an arithmetic progression. For example, for the base window size of 128, we change the evolution of each group to {64, 96, 128, 160}. The experimental results demenstrate that the performance achieved by both variation strategies is slightly weaker than the method we introduced previously.
Variation Strategy | Wikitext-103 PPL |
Ours | 29.56 |
Decreasing for Deeper Layer | 30.46 |
Arithmetic Progression | 29.90 |
6 Conclusion
We propose a novel window attention variant called Multi-Scale Window Attention (MSWA), which leverages diverse window sizes for different heads in different layers. Compared to the traditional sliding window attention, which is inefficient in capturing context of varying scales, we enable the model to capture contextual information of varying lengths and distances with less computational resources and memory usage. Experimental results on lanaguage modeling and common-sense reasoning tasks demonstrate that MSWA can outperform previous local attention mechanism, while obtaining better efficiency.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
- Anthropic (2024) AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card.
- Arora et al. (2024) Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, and Christopher Ré. 2024. Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668.
- Beltagy et al. (2020) Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
- Bengio et al. (2000) Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems, 13.
- Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439.
- Chen et al. (2023) Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. 2023. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations.
- Child et al. (2019) Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509.
- Choromanski et al. (2020) Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. 2020. Rethinking attention with performers. In International Conference on Learning Representations.
- Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Computer (2023) Together Computer. 2023. Redpajama: An open source recipe to reproduce llama training dataset.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988.
- Dao (2023) Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations.
- De et al. (2024) Soham De, Samuel L Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. 2024. Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427.
- Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
- Gu and Dao (2023) Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Hu et al. (2021) Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. 2021. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Hua et al. (2022) Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc Le. 2022. Transformer quality in linear time. In International conference on machine learning, pages 9099–9117. PMLR.
- Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
- Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR.
- Lefaudeux et al. (2022) Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, Luca Wehrstedt, Jeremy Reizenstein, and Grigory Sizov. 2022. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers.
- Li et al. (2023) Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. Camel: Communicative agents for" mind" exploration of large language model society. Advances in Neural Information Processing Systems, 36:51991–52008.
- Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022.
- Loshchilov and Hutter (2018) Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization. In International Conference on Learning Representations.
- Mahoney (2009) Matt Mahoney. 2009. Large text compression benchmark.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. In International Conference on Learning Representations.
- Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. Fairseq: A fast, extensible toolkit for sequence modeling. NAACL HLT 2019, page 48.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Peng et al. (2023) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. 2023. Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077.
- Peng et al. (2021) Hao Peng, Nikolaos Pappas, Dani Yogatama, Roy Schwartz, Noah A Smith, and Lingpeng Kong. 2021. Random feature attention. arXiv preprint arXiv:2103.02143.
- Poli et al. (2023) Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, and Christopher Ré. 2023. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR.
- Qin et al. (2022) Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. 2022. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041.
- Qiu et al. (2019) Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. 2019. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972.
- Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Roemmele et al. (2011) Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. 2011. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series.
- Ross et al. (2023) Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D Weisz. 2023. The programmer’s assistant: Conversational interaction with a large language model for software development. In Proceedings of the 28th International Conference on Intelligent User Interfaces, pages 491–514.
- Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Su et al. (2021) Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. 2021. Roformer: Enhanced transformer with rotary position embedding. arXiv e-prints, pages arXiv–2104.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30.
- Wang et al. (2020) Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. 2020. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768.
- Yang et al. (2023) Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. 2023. Gated linear attention transformers with hardware-efficient training. arXiv preprint arXiv:2312.06635.
- Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. 2020. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Zhang et al. (2023a) Biao Zhang, Barry Haddow, and Alexandra Birch. 2023a. Prompting large language model for machine translation: A case study. In International Conference on Machine Learning, pages 41092–41110. PMLR.
- Zhang et al. (2023b) Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Re. 2023b. The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. In The Twelfth International Conference on Learning Representations.
- Zhang et al. (2024) Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B Hashimoto. 2024. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57.
Appendix A Implementation of MSWA
Here, we provide the specific implementation of MSWA mechanism. Essentially, MSWA is a combination of two sub-mechanisms: MSWA-h and MSWA-l. Their implementations in the program are independent of each other.
A.1 Implementation of MSWA-h
As for the implementation of MSWA-h, the overall flow is consistent with the standard attention layer in Transformer, except for the grouped implementation of the multi-head attention. Since different groups use different window sizes, we first use the reshape function from PyTorch Paszke et al. (2019) to divide all head’s , , and vectors into different groups. For heads within the same group, we use the efficient methods of previous SWA implementations (e.g. FlashAttention Dao (2023), xFormers Lefaudeux et al. (2022)) for parallel computation. Calculations between different groups are carried out separately. After completing the attention calculation of all groups, we use cat function of PyTorch to concatenating the attention outputs of each group together in the group’s dimension, and then project them onto the final output of this attention layer through a matrix. Therefore, we can implement the MSWA-h mechanism without much additional development.
A.2 Implementation of MSWA-l
Regarding the implementation of MSWA-l, it is simpler compared to MSWA-h because in the Transformer model, different layers are stacked, and the computations between different layers are sequential and completely independent. We only need to assign the window size allocation for each layer as a parameter and pass it to the initialization function of each layer object.
Appendix B Experimental Details
B.1 Dependencies
For all the methods in our experiments, we implement them using the PyTorch Paszke et al. (2019) and FlashAttention Dao (2023) library. Additionally, the training and test process for the experiments in Sec. 5.1 is based on Fairseq Ott et al. (2019), while for the experiments in Sec. 5.2, the fine-tuning process is based on DeepSpeed Rasley et al. (2020) and the evaluation process is implemented using lm-evaluation-harness Gao et al. (2023). The efficiency evaluation in Sec. 5.3 is performed on a NVIDIA A100 GPU.
B.2 Setups for Each Experiment
In this section, we introduce the setups and training details for each experiment.
B.2.1 Language Modeling Evaluation
Direct Construction of Language Model
For all attention mechanisms, we apply them to a 12-layer standard Transformer model, with each layer having 8 attention heads. The model dimension and head dimension are 512 and 64, respectively. To better simulate the model’s operation on the long-range sequence and reflect the memory overhead of various mechanisms, we introduce the cache mechanism from Transformer-XL Dai et al. (2019) and simultaneously adopt its relative position embedding. Each model is trained from scratch on two datasets based on the casual language modeling objective for 150,000 update steps. The number of tokens trained per step, which is the product of batch size and sequence length, is kept consistent (16,384 for Wikitext-103, 49,152 for enwik8). We use the AdamW Loshchilov and Hutter (2018) optimizer with beta values of (0.9, 0.98), set the learning rate to 2e-4 with 1,000 warm-up steps, and use a cosine learning rate scheduler.
Combination with Linear Attention
For the linear attention mechanism, we use the 2nd-order Taylor series feature map Zhang et al. (2023b); Arora et al. (2024) as the kernel function. Following the setup by Arora et al. (2024), we employ RoPE Su et al. (2021) encoding for both linear attention and local attention, and use RMSNorm and SwiGLU mechanism. Each model consists of 12 layers. For combination of linear attention and local attention, each consecutive stack of three layers containing one linear attention layer and two local attention layers. The model dimension, head dimension, and feature dimension are set to 512, 64, and 16, respectively. The base window size for local attention is 128. During training and evaluation, the data is segmented into sequences containing 2,048 tokens without using the Transformer-XL style caching mechanism. The batch size for both datasets is 8. Other training settings are the same as in Sec. 5.1.1.
B.2.2 Evaluation on Downstream Tasks
For the fine-tuning process, the Llama-7B model is trained for 2,000 steps using each local attention pattern based on the casual language modeling objective. The global batch size for each step is 32, and each sample consists of a sequence of 4,096 tokens. We use the AdamW optimizer with beta values of (0.9, 0.95). After 20 warm-up steps, the learning rate is fixed at 2e-5. We apply the LoRA Hu et al. (2021) technique with and to train the attention parameters. Inspired by Chen et al. (2023), we also make the normalization and embedding layers trainable. During fine-tuning and downstream testing, for SWA we set the base window size to of the sequence length. For the MSWA series, the window size dynamically evolves based on , ensuring consistent resource usage with SWA.
B.2.3 Computational Efficiency Evaluation
For all attention mechanisms, their computation is based on the FlashAttention library, which is the current standard method for efficiently implementing attention operation. We apply them in a 32-layer Transformer model, with each layer containing 16 attention heads. The model dimension and head dimension are 1,024 and 64, respectively. We use a sequence of 2,048 tokens for measurement and report the median value across the computation time at positions {500, 1,000, 1,500, 2,000} in the sequence. The batch size is set to {8, 16, 64, 128, 256, 512}, and we record the experimental results for each case.