This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities

Jerry Huang
Mila - Quebec AI Institute
Université de Montréal
[email protected]
Abstract

Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can extend to infinite sequence length. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architectural Inductive Biases on Long-Context Abilities


Jerry Huang Mila - Quebec AI Institute Université de Montréal [email protected]


1 Introduction

Advances in AI system engineering (Dao et al., 2022; Dao, 2024; Rasley et al., 2020) and design of models (Katharopoulos et al., 2020; Jiang et al., 2023; AI21, 2024) have opened up language models to the broader public for a diverse set of purposes and use-cases. However, Transformer-based architechtures (Vaswani et al., 2017) remain bounded in terms of their context windows, as they require fixed-length positional embedding representations (Press et al., 2022; Su et al., 2023; Peng et al., 2024) which cannot be modified a posteriori. With this glaring limitation, linear sequence models (Gu et al., 2022; Gu and Dao, 2024; Orvieto et al., 2023; Qin et al., 2023; Peng et al., 2023; De et al., 2024; Dao and Gu, 2024) have emerged an alternative that present a seeming ability to extend to infinite-length contexts in theory while retaining all the original benefits of the Transformer related to training-based parallization.

However, despite the temptation to assert linear sequence models as superior, properly testing for information retention from long-context tasks remains callenging. While some works have attempted to evaluate this ability through long contexts (Shaham et al., 2022; Pang et al., 2022; Dong et al., 2024; Bai et al., 2023; Li et al., 2023; Han et al., 2024), whether or not they truly require the use of long-contexts is uncertain and ascertaining long-context abilities from these tasks is difficult. This has prompted the use of more synthetic tasks (Hsieh et al., 2024), such as needle-in-a-haystack (NIAH) (Kamradt, 2023) and passkey retreival (Mohtashami and Jaggi, 2023), to better control and evaluate the context sizes of models.

Nevertheless, an outstanding question remains whether or not long-context models can effectively model long contexts. While some works (Gu and Dao, 2024; Fu et al., 2023; Poli et al., 2023; Peng et al., 2024; Team, 2024) purport to be able to extrapolate towards sequences of long length (100k tokens+), further investigation has suggested differently. For example, Hsieh et al. (2024) claim modern LLMs significantly over-state true context windows on a number of synthetic tasks. Meanwhile Han et al. (2024) observe models to perform reasonably well on synthetic tasks, but struggle on real-world tasks, as do Li et al. (2023). Hence despite a consistent trend in models behaving underwhelmingly, it remains to be understood why this occurs. Yet one interesting question is whether or not linear sequence models are in fact more suited for these compared to Transformer-based ones, as has been claimed repeatedly.

To this end, we further analyze the behaviour of sequence models to observe how differently they behave compared to Transformer-based ones. We perform a more extensive study into each type of model, as well as a mixture of both, to better investigate how they perform in principle and how they change in behaviour when extending to longer and longer sequences. On both synthetic and realistic data, we conduct a thorough study and observe:

  • All models, whether they use pure sequence layers, attention or a mix, struggle with extrapolating beyond their training context length.

  • The abiliy to extrapolate can vary signficantly based on the format of the sequence even if the task remains constant.

These results highlight that long sequence models suffer from significant limitations despite their theoretical soundness, highlighting a need to better understand this striking dissonance between expectation and observation and how to amend it for better long-context understanding and reasoning.

2 Related Work

Efficient Long-Context Models.

Due to the computational bottleneck of attention (Bahdanau et al., 2015) relative to sequence length, significant modifications have been made to overcome this limitation of the Transformer (Child et al., 2019; Katharopoulos et al., 2020; Su et al., 2023) yet they remain theoretically bounded in terms of its context length. Alternatively, sequence models (Rumelhart et al., 1986; Jordan, 1986; Hochreiter and Schmidhuber, 1997; Cho et al., 2014) originally faced significant issues that limited their application but recent modifications (Gu et al., 2020, 2021) have led to the prominence of linear sequence models which are significantly more compute-effective than Transformer-based architechtures.

On the Limits of Long Sequence Models.

Due to their more intuitive and interpretable architechture, long/linear sequence models remain easier to analyze when placed in comparision to Transformers. As such, their limitations also become easier to discover and analyze. Vardasbi et al. (2023) first show that SSMs struggle at sequence-to-sequence tasks due to to the use of a fixed-size hidden representation which compresses the entire prior context, making it difficult to extract information from the past, fact further substantiated by Jelassi et al. (2024). Park et al. (2024) additionally demonstrate that these models have difficulty with more complex in-context learning tasks, while Merrill et al. (2024) show them to possess similar limiations in terms of representational power as Transformers (Merrill and Sabharwal, 2023). Waleffe et al. (2024) finally make a comparision between Mamba, Transformers as well as a hybrid and observe hybrid models to perform better on long-context tasks, while Mamba2 often trails behind Transformers. These observations thus beg a question: can long sequence models really model long sequences? Given the hints that long sequence models may not always be as they seem, a more formal investigation is necessary. We distinguish ourselves by conducting a more controlled but intricate study which aims to uncover why some of the prior results might occur, which we discuss in the work that follows.

3 Background

Attention and Long Sequences.

Self-attention as used in Transformers is powerful but costly. When provided an embedded text representation as a sequence of tokens 𝒙L×d\bm{x}\in\mathbb{R}^{L\times d}, each Transformer layer in the network applies a function

T(𝒙)=FF(A(𝒙)+𝒙)+A(𝒙)T_{\ell}(\bm{x})=\text{FF}_{\ell}(A_{\ell}(\bm{x})+\bm{x})+A_{\ell}(\bm{x}) (1)

where AA_{\ell} is the self-attention mechanism of the \ell-th layer and FF\text{FF}_{\ell} is the following feed-forward network111Excludes normalization operations.. Self-attention computes, for every position, a weighted average of the feature representations of all other positions with a weight proportional to a similarity score between the representations.

𝑸=𝒙𝑾𝑸𝑲=𝒙𝑾𝑲𝑽=𝒙𝑾𝑽A(𝒙)=𝑽=softmax(𝑸𝑲T/d)𝑽\begin{split}&\bm{Q}_{\ell}=\bm{x}\bm{W}_{\ell}^{\bm{Q}}\quad\bm{K}_{\ell}=\bm{x}\bm{W}_{\ell}^{\bm{K}}\quad\bm{V}_{\ell}=\bm{x}\bm{W}_{\ell}^{\bm{V}}\\ &A_{\ell}(\bm{x})=\bm{V}_{\ell}^{\prime}=\text{softmax}\big{(}{\bm{Q}_{\ell}\bm{K}_{\ell}^{T}}/{\sqrt{d}}\big{)}\bm{V}_{\ell}\end{split} (2)

As the softmax operation operates in O(L2)O(L^{2}) time when applied naively, this limits the ability to process long-sequences.

Transformers to Sequence Models.

SSMs model a dynamical system, traditionally mapping a 1-D continuous input signal x(t)x(t)\in\mathbb{R} to an nn-dimensional hidden state h(t)nh(t)\in\mathbb{R}^{n} that is projected back to a 1-D output y(t)y(t)\in\mathbb{R} using:

{h(t)=𝑨h(t)+𝑩x(t)y(t)=𝑪h(t)+𝑫x(t)\begin{cases}h^{\prime}(t)&={\bm{A}}h(t)+{\bm{B}}x(t)\\ y(t)&={\bm{C}}h(t)+{\bm{D}}x(t)\end{cases} (3)

where 𝑨\bm{A}, 𝑩\bm{B}, 𝑪\bm{C} and 𝑫\bm{D} are all trainable parameters. Gu et al. (2021) use this paradigm to define a recurrent model to work on discrete signals, in which case the input can be regarded as discretized data sampled from a continuous signal with a step size Δ\Delta, for which the corresponding SSM is defined by:

ht=𝑨¯ht1+𝑩¯xtyt=𝑪¯ht+𝑫¯xt𝑨¯=(I+Δ𝑨/2)(IΔ𝑨/2)𝑩¯=Δ𝑩(IΔ𝑨/2)\begin{split}h_{t}&=\overline{\bm{A}}h_{t-1}+\overline{\bm{B}}x_{t}\quad y_{t}=\overline{\bm{C}}h_{t}+\overline{\bm{D}}x_{t}\\ \overline{\bm{A}}&=\frac{\big{(}I+{\Delta}\bm{A}/{2})}{\big{(}I-{\Delta}\bm{A}/{2}\big{)}}\quad\overline{\bm{B}}=\frac{\Delta\bm{B}}{\big{(}I-{\Delta}\bm{A}/{2}\big{)}}\end{split} (4)

and 𝑪¯=𝑪\overline{\bm{C}}=\bm{C} (They set 𝑫¯=0\overline{\bm{D}}=0 due to being equivalent to a residual connection.) Thus the output 𝒚\bm{y} given an input 𝒙\bm{x} is

𝑲¯=(𝑪𝑩¯,𝑪𝑨𝑩¯,,𝑪𝑨¯L1𝑩¯)yt=j=0L1𝑪𝑨¯j𝑩¯xLj=𝑲¯𝒙\begin{split}\overline{\bm{K}}&=(\overline{\bm{CB}},\overline{\bm{CAB}},\dots,\overline{\bm{CA}}^{L-1}\overline{\bm{B}})\\ y_{t}&=\sum_{j=0}^{L-1}\overline{\bm{CA}}^{j}\overline{\bm{B}}x_{L-j}=\overline{\bm{K}}*\bm{x}\end{split} (5)

where 𝑲¯\overline{\bm{K}} is the SSM kernel. As 𝒚\bm{y} can be computed in O(LlogL)O(L\log L) with a Fast Fourier Transform (Cormen et al., 2009), the entire output can be computed in tandem based on the input, given the matrices that parametrize the system. Gu et al. (2021) use this to overcome issues of parallelization and vanishing gradients (Bengio et al., 1994; Hochreiter et al., 2001; Pascanu et al., 2013) observed by prior recurrent models by

  1. (1)

    Removing non-linearities in the recurrence, enabling the efficient pre-computation of 𝑲¯\overline{\bm{K}}.

  2. (2)

    Using a special matrix parameterization (Gu et al., 2020) for 𝑨\bm{A} to memorize the input and eliminate exponential gradient scaling.

This has sparked a new wave of recurrent models to compete with Transformers (Orvieto et al., 2023; Qin et al., 2023; De et al., 2024; Beck et al., 2024), with the added benefit of theoretically having longer context sizes that scale more efficiently.

4 Experiments and Results

Datasets.

We conduct an initial evaluation using Ruler (Hsieh et al., 2024), a set of synthetic benchmarks that test long-context information retention, before conductin a more fine-grained evaluation on a general needle-in-the-haystack task. We use this benchmark as for more granular control over the exact information that must be retained. Results are measured in terms of accuracy based on exact matching of predicted tokens.

Baselines.

Our main objective is to compare how long-sequence models fare on long context tasks. To this end, we compare models with the same number of parameters that are evenly trained on the same data. Hence we first use Mamba2 (Dao and Gu, 2024) as well as a Transformer variant (Transformer++) as well as a hybrid Mamba2Attn, each with 2.7 billion parameters. We further add Sheared-LLaMA (Xia et al., 2024) and RecurrentGemma (Botev et al., 2024) baselines (with and without intruction-tuning) as same-sized baselines trained under different conditions. We finally add a 3 billion RWKV (Peng et al., 2023) variant as another sequence model baseline.

Results.

We present initial results on the base set of Ruler tasks (as defined by its original authors) in Table 1. However, we present two additional ablation studies. In the first, we use a single needle hidden within a large haystack, however we modify its relative position within the context. The goal of this ablation, presented in Table 2 and 3, is to observe how the use of a unified hidden state rather than attention can affect the ability to retain information throughout a long sequence. The second (Table 4) further test how this information retention may change when the content that is being memorized changes (ex. numbers versus UUIDs within a haystack of repeated sentences or essays).

Length 1K 2K 4K 8K 16K Average
Mamba2 38.52 32.91 12.98 6.51 0.1 18.2
M2A 39.14 30.43 12.89 7.8 3.49 18.75
TPP 46.61 36.74 0.31 0.06 0.03 16.75
RG 78.82 71.72 22.45 11.21 6.29 38.1
SL 84.38 69.89 58.37 0.0 0.0 42.53
RWKV 68.09 55.27 37.47 23.73 13.81 39.67
RG-IT 85.64 79.45 44.33 24.19 14.18 49.56
SL-IT 86.22 77.54 74.25 0.0 0.0 47.6
Table 1: Results on Ruler. Accuracy is aggregated across several tasks for each model and context length. Context length for which each model was trained is underlined. Best performing models are bolded.
Position 0 20 40 50 60 80 100 Avg
Mamba2 59.07 31.47 33.07 39.07 40.0 31.33 66.0 42.63
M2A 40.27 36.53 30.27 29.33 29.33 35.07 37.2 35.26
TPP 53.33 33.47 22.8 26.27 31.33 35.07 55.73 35.64
RG 100.0 100.0 100.0 100.0 100.0 100.0 99.47 99.92
SL 99.6 99.6 100.0 100.0 100.0 100.0 100.0 99.89
RWKV 82.4 100.0 100.0 80.27 100.0 100.0 100.0 94.67
RG-IT 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
SL-IT 98.27 99.6 100.0 100.0 100.0 100.0 99.73 99.66
Table 2: Results on needle-in-a-haystack task where the position of a single needle is at a fixed depth within the haystack. Context length is set to the maximum on which the models were trained.
Position 0 20 40 50 60 80 100 Avg
Mamba2 26.8 19.6 17.73 18.93 18.93 20.13 21.87 21.03
M2A 38.8 26.27 18.93 28.8 10.13 21.6 66.67 27.07
TPP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
RG 0.0 0.0 0.0 99.87 100.0 100.0 96.27 56.59
SL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
RWKV 33.47 99.6 100.0 36.53 100.0 100.0 100.0 81.37
RG-IT 0.0 0.0 0.0 100.0 99.6 100.0 99.73 57.05
SL-IT 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Table 3: Same results as above with context length set to twice the maximum training length.
Model Context Essay-Word-Num Essay-Word-UUID Repeat-Word-Num
Length 0 50 100 0 50 100 0 50 100
Mamba2 1024 86.0 73.6 82.0 78.0 70.8 80.8 77.6 70.4 55.2
2048 45.6 20.8 65.2 49.6 20.4 66.0 82.0 76.0 66.8
4096 0.0 0.0 0.0 0.0 0.0 0.0 80.4 56.8 65.6
M2A 1024 37.2 28.0 48.0 39.2 26.8 48.0 47.2 44.4 70.0
2048 41.6 27.6 39.6 42.4 28.4 30.8 36.8 32.0 63.2
4096 29.2 25.6 59.2 27.6 28.0 58.0 59.6 32.8 82.8
TPP 1024 52.0 36.0 47.6 58.8 34.4 50.4 81.6 33.2 58.4
2048 51.6 29.6 62.4 44.8 36.0 55.6 63.6 13.2 49.2
4096 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Table 4: Results on needle-in-a-haystack task where the position of a single needle is placed at the beginning, end or middle of the haystack while the types of each component varies. Context length is set to the maximum on which the models were trained.

5 Discussion

All models have limits.

Our first observation is that regardless of the model, performance drops steeply upon testing with sequences that are longer than what the model was initially trained on. This is made clear in Table 1, where the decline in performance is greatest once the evaluated sequences are longer than the training context (with the mild exception of RWKV which demonstrates approximately linear degredation as the sequences progressively double in length). However, an important observation is that linear sequence models do appear to extrapolate slightly better than pure-attention models, whose performance drop to near 0 performance upon the increase, as these models do show non-trivial accuracy even when evaluated on the longer sequences. This distinction is less clear when comparing between pure linear sequence models and hybrid models which alternate between sequence-model layers and attention layers, as there is no explicit pattern as to when one class will perform better on one length or another.

Being lost in the middle is a common event.

Being lost in the middle, whereby models have difficulty recalling relevant information positionally located in the middle of long contexts (Liu et al., 2024), has been observed as a common limitaiton among attention-based models. In Table 2, this appears to be a common feature among all models we test, as all classes of models see increasing drops in performance as the information is more closely located at the center of the sequence. This suggests that despite their long-context modelling ability, recurrent models cannot effectively reason over their entire context window when prompted. However, when extending past the training context length (Table 3), there is less of a consistent pattern. In particular, while Mamba models still appear lost-in-the-middle, other recurrent models such as RecurrentGemma and RWKV have no clear depth-performance trends, further bringing into question their general long-context modelling abilities.

Extrapolation can inconsistent.

Furthermore, extrapolation can be inconsistent based on characteristics of the model as well as the data. In Table 4, we can first note that depending on the data format of the haystack, key and value to be retrieved, the performance of each model can vary significantly even if we use the same task template, context length and needle position. Furthermore, extrapolation can vary signficantly based on the model as these characteristics change. For example, pure sequence layers (Mamba2) appears to only extrapolate when the haystack is a repeated sequence and the retrived value is a number related to a key word. Upon changing the haystack to be essays, extrapolation craters and the model fails. An equally trained hybrid model (M2A) can meanwhile always extrapolate to some degree, but performance on sequences up to the training context length appears to compare much worse. Pure attention (TPP) meanwhile performs favorably only when evaluating on the extact training context length under specific data formats, but otherwise underwhelms.

6 Conclusion

In this work, we conduct a comprehensive comparision between the long-sequence models and attention-based language models, showing that long-context abilities of such sequence models may hold from a theoretical perspective, they empirically still struggle in comparison to models that make no guarantees. This highlights the need to improve long sequence reasoning abilties not only for Transformer-based LLMs, but also SSMs and new classes of RNNs, which hopefully can serve as motivation to further analyze this topic.

7 Limitations

We limit ourself to a model size in which it is easy to compare models of various paradigms. As such, some perhaps more powerful models are not explored as the analysis between such models can become difficult due to multiple additional changing variables that can perhaps lead to incorrect or undersupported claims.

8 Ethical Concerns

This paper discusses how different types of language models behave on long-context data. It follows that mistakes in our methodology (both experimental and analytical) could lead to unsupported confidence or skepticism about LLMs. Though neither are unethical, unsupported confidence can be very dangerous. However, given that the overall claim is that LLMs should not be assumed to support context length that extend beyond what they have trained, regardless of their training data, we do not think this paper in itself could be misinterpreted for particularly dangerous outcomes.

As for model choices, we use publicly available models where the license agreements do not restrict what we can say about the model. This should give the reader confidence that our views are unbiased. This is unlike ChatGPT or GPT4, which include an unrestricted indemnity-clause in their license agreement, which could make us financially liable for damages.

9 Acknowledgements

JH is supported by a National Science and Engineering Research Council (NSERC) Canada Graduate Scholarship, a Fonds de Recherche du Québec Nature et technologies (FRQNT) Training Scholarship and a Hydro-Québec Excellence Scholarship. The experiments were in part enabled by computational resources provided by Calcul Québec (calculquebec.ca) and Mila.

References

Appendix A Technical Implementation Details

A.1 Models Used

Model Public Link HuggingFace Model
Mamba2 state-spaces/mamba2-2.7b
Mamba2Attention state-spaces/mamba2attn-2.7b
Transformer++ state-spaces/transformerpp-2.7b
RWKV RWKV/rwkv-6-world-3b-v2.1
Sheared-LLaMA princeton-nlp/Sheared-LLaMA-2.7B
Sheared-LLaMA-ShareGPT princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
RecurrentGemma-2B google/recurrentgemma-2b
RecurrentGemma-2B-IT google/recurrentgemma-2b-it
Table 5: Models used and public links to their weights.

A.2 Computing Resources Used

All experiments were conduced using a single NVIDIA A100 80GB SXM GPU with 6 CPU worker cores. Experiments are run using PyTorch Version 2.2.0 and CUDA 11.8.

Appendix B Ruler Task Results

Length 1K 2K 4K 8K 16K Average
Mamba2 66.8 71.6 60.0 62.4 0.0 52.16
M2A 58.0 36.4 43.2 18.4 0.0 31.2
TPP 40.4 24.8 0.0 0.0 0.0 13.04
RG 100.0 100.0 52.0 24.8 10.0 57.36
SL 100.0 100.0 100.0 0.0 0.0 60.0
RWKV 100.0 100.0 100.0 100.0 54.4 90.88
RG-IT 100.0 100.0 51.6 28.8 16.4 59.36
SL-IT 100.0 100.0 100.0 0.0 0.0 60.0
Table 6: Results on niah_single_1 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 62.4 60.4 0.0 0.0 0.0 24.56
M2A 33.2 34.8 9.6 4.8 0.0 16.48
TPP 50.8 48.0 0.0 0.0 0.0 19.76
RG 100.0 100.0 36.4 16.8 2.8 51.2
SL 99.6 99.6 100.0 0.0 0.0 59.84
RWKV 100.0 100.0 53.6 30.4 9.6 58.72
RG-IT 100.0 100.0 55.2 24.4 12.8 58.48
SL-IT 100.0 100.0 100.0 0.0 0.0 60.0
Table 7: Results on niah_single_2 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 52.0 61.6 0.0 0.0 0.0 22.72
M2A 38.8 32.4 2.8 6.4 0.0 16.08
TPP 64.4 53.2 0.0 0.0 0.0 23.52
RG 100.0 100.0 39.2 16.8 8.4 52.88
SL 100.0 100.0 96.4 0.0 0.0 59.28
RWKV 99.2 96.4 15.2 19.6 4.4 46.96
RG-IT 100.0 100.0 53.6 24.0 13.6 58.24
SL-IT 100.0 99.6 99.6 0.0 0.0 59.84
Table 8: Results on niah_single_3 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 25.6 23.6 0.0 0.0 0.0 9.84
M2A 21.2 16.4 5.2 1.2 0.0 8.8
TPP 50.0 34.4 0.0 0.0 0.0 16.88
RG 98.8 98.8 23.2 15.6 4.4 48.16
SL 99.2 100.0 94.0 0.0 0.0 58.64
RWKV 81.6 64.0 30.4 18.0 11.2 41.04
RG-IT 99.2 100.0 36.8 17.6 11.2 52.96
SL-IT 99.6 99.2 98.0 0.0 0.0 59.36
Table 9: Results on niah_multikey_1 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 4.8 2.0 0.0 0.0 0.0 1.36
M2A 17.2 7.6 0.4 0.0 0.0 5.04
TPP 60.0 36.4 0.0 0.0 0.0 19.28
RG 98.0 94.8 8.4 2.4 1.6 41.04
SL 95.2 86.8 53.6 0.0 0.0 47.12
RWKV 20.4 4.0 0.8 0.4 0.0 5.12
RG-IT 100.0 98.0 43.6 27.2 9.6 55.68
SL-IT 97.6 96.0 78.8 0.0 0.0 54.48
Table 10: Results on niah_multikey_2 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 14.4 2.4 0.0 0.0 0.0 3.36
M2A 17.6 12.4 0.0 0.0 0.0 6.0
TPP 61.2 56.4 0.0 0.0 0.0 23.52
RG 74.8 58.8 7.2 2.8 1.6 29.04
SL 96.4 46.4 38.8 0.0 0.0 36.32
RWKV 14.8 1.6 0.4 0.0 0.0 3.36
RG-IT 88.0 92.0 16.0 14.0 1.6 42.32
SL-IT 85.6 63.2 59.2 0.0 0.0 41.6
Table 11: Results on niah_multikey_3 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 34.9 26.6 0.0 0.0 0.0 12.3
M2A 48.8 33.5 1.3 0.1 0.0 16.74
TPP 42.3 31.1 0.0 0.0 0.0 14.68
RG 97.4 95.1 14.7 3.3 3.0 42.7
SL 100.0 82.5 44.0 0.0 0.0 45.3
RWKV 96.5 87.0 57.2 10.8 5.2 51.34
RG-IT 96.7 87.6 41.8 22.0 11.3 51.88
SL-IT 100.0 87.5 77.2 0.0 0.0 52.94
Table 12: Results on niah_multivalue task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 39.1 39.2 0.0 0.0 0.0 15.66
M2A 54.4 37.5 1.6 0.0 0.0 18.7
TPP 44.4 34.8 0.0 0.0 0.0 15.84
RG 99.5 99.7 4.7 2.8 2.8 41.9
SL 98.8 80.8 45.6 0.0 0.0 45.04
RWKV 94.3 80.7 38.4 9.3 2.4 45.02
RG-IT 97.8 97.9 48.5 21.1 11.4 55.34
SL-IT 98.4 94.7 85.9 0.0 0.0 55.8
Table 13: Results on niah_multiquery task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 69.12 36.64 35.2 20.72 0.0 32.34
M2A 78.24 56.88 9.6 1.76 0.56 29.41
TPP 40.88 21.12 0.0 0.0 0.0 12.4
RG 98.0 75.52 0.0 0.0 0.0 34.7
SL 98.16 81.68 19.36 0.0 0.0 39.84
RWKV 68.56 47.76 20.08 6.88 10.95 30.85
RG-IT 84.24 79.36 50.4 31.76 19.92 53.14
SL-IT 93.68 76.88 42.32 0.0 0.0 42.58
Table 14: Results on vt task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 28.52 14.72 4.08 0.16 0.12 9.52
M2A 26.48 15.24 3.04 5.92 0.8 10.3
TPP 30.32 17.8 0.64 0.0 0.04 9.76
RG 48.6 21.32 42.88 17.24 4.24 26.86
SL 71.2 25.32 55.24 0.0 0.04 30.36
RWKV 57.08 3.24 45.0 14.84 1.92 24.42
RG-IT 55.4 4.56 17.4 3.24 0.2 16.16
SL-IT 78.96 18.64 57.2 0.0 0.0 30.96
Table 15: Results on cwe task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 57.87 44.67 40.67 0.53 0.0 28.75
M2A 59.73 53.73 58.0 52.0 39.6 52.61
TPP 59.6 56.4 0.13 0.0 0.0 23.23
RG 56.0 53.87 7.6 15.6 17.33 30.08
SL 72.0 38.67 45.07 0.0 0.0 31.15
RWKV 74.67 67.47 68.0 56.67 43.42 62.05
RG-IT 80.8 67.87 69.73 64.8 50.67 66.77
SL-IT 78.67 78.27 73.07 0.0 0.0 46.0
Table 16: Results on fwe task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 25.2 24.4 18.4 0.0 0.4 13.68
M2A 33.6 35.6 18.0 6.8 3.6 19.52
TPP 37.2 36.4 2.8 0.8 0.4 15.52
RG 26.8 15.6 31.2 6.8 8.8 17.84
SL 41.6 37.2 37.2 0.0 0.0 23.2
RWKV 46.4 35.6 30.8 21.2 18.4 30.48
RG-IT 74.0 66.8 58.4 10.0 9.6 43.76
SL-IT 54.4 56.4 55.6 0.0 0.0 33.28
Table 17: Results on qa_1 task of Ruler.
Length 1K 2K 4K 8K 16K Average
Mamba2 20.0 20.0 10.4 0.8 0.8 10.4
M2A 21.6 23.2 14.8 4.0 0.8 12.88
TPP 24.4 26.8 0.4 0.0 0.0 10.32
RG 26.8 18.8 24.4 20.8 16.8 21.52
SL 24.8 29.6 29.6 0.0 0.0 16.8
RWKV 31.6 30.8 27.2 20.4 17.6 25.52
RG-IT 37.2 38.8 33.2 25.6 16.0 30.16
SL-IT 34.0 37.6 38.4 0.0 0.0 22.0
Table 18: Results on qa_2 task of Ruler.