This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies

Changye Li {lixx3013, sheng136, pakh0002}@umn.edu Zhecheng Sheng {lixx3013, sheng136, pakh0002}@umn.edu Trevor Cohen {cohenta}@uw.edu Serguei Pakhomov {lixx3013, sheng136, pakh0002}@umn.edu
Abstract

As artificial neural networks grow in complexity, understanding their inner workings becomes increasingly challenging, which is particularly important in healthcare applications. The intrinsic evaluation metrics of autoregressive neural language models (NLMs), perplexity (PPL), can reflect how “surprised” an NLM model is at novel input. PPL has been widely used to understand the behavior of NLMs. Previous findings show that changes in PPL when masking attention layers in pre-trained transformer-based NLMs reflect linguistic anomalies associated with Alzheimer’s disease dementia. Building upon this, we explore a novel bidirectional attention head ablation method that exhibits properties attributed to the concepts of cognitive and brain reserve in human brain studies, which postulate that people with more neurons in the brain and more efficient processing are more resilient to neurodegeneration. Our results show that larger GPT-2 models require a disproportionately larger share of attention heads to be masked/ablated to display degradation of similar magnitude to masking in smaller models. These results suggest that the attention mechanism in transformer models may present an analogue to the notions of cognitive and brain reserve and could potentially be used to model certain aspects of the progression of neurodegenerative disorders and aging.

Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies



1 Introduction

Alzheimer’s disease (AD) dementia is a currently incurable neurodegenerative condition that leads to a progressive and irreversible decline in cognitive function. Due to the challenging nature of early diagnosis of this condition, there is a pressing need for efficient and cost-effective screening tools (Bradford et al., 2009) to mitigate the negative consequences of delayed or absent diagnosis (Stokes et al., 2015). Previous studies have demonstrated that changes in cognitive status can be reflected in spoken language and spontaneous speech (Giles et al., 1996; Almor et al., 1999; Hier et al., 1985). Automated analysis of such speech, employing supervised machine learning models, shows its potential as an early screening tool. These models can be trained to identify subtle linguistic anomalies associated with dementia from transcripts of both healthy individuals and those with dementia. Recent advances in machine learning, such as deep learning models and the transformer with attention architecture (Vaswani et al., 2017), have mediated remarkable performance on this downstream task (for a review, see Shi et al. (2023)). Deep learning models, inspired by the human brain, are artificial neural networks (ANNs) that process vast amounts of data and learn complicated patterns, making them well-suited for analyzing subtle linguistic patterns. The transformer architecture, in particular, has advanced performance on natural language processing (NLP) tasks by enabling models to capture long-range dependencies more effectively via the attention mechanism (Vaswani et al., 2017).

As ANNs get larger and more complicated, it becomes even harder to interpret their inner workings. The performance of autoregressive neural language models (NLMs) (e.g., predicting the next word given the context) is frequently estimated with a single somewhat interpretable feature, perplexity (PPL), which has shown to be a suitable measurement for evaluating cognitive impairment from spontaneous speech (Fritsch et al., 2019; Cohen and Pakhomov, 2020). As the name “perplexity” suggests, it can be considered as an indicator of how “surprised” a model is by novel (i.e., not used in model’s training) input. The more different the input is from a particular model’s training data, the “harder” it is for the model to predict, resulting in higher PPL. Therefore, it is reasonable to hypothesize that PPL may have some degree of diagnostic utility, as an indicator of patterns of language use that fall outside the scope of the typical language used to train a model. In the context of AD, changes in language and cognitive function often manifest as differences in language complexity, with individuals experiencing difficulty in forming coherent sentences and selecting appropriate words. As AD progresses, the language used by patients with dementia becomes more unpredictable and less coherent, leading to higher PPL with models trained on language from individuals presumed to be cognitively healthy.

While training data from cognitively healthy individuals is plentiful, language data produced by patients with dementia is much more impractical to obtain in sufficient quantity to train a large NLM. In hyperdimensional computing (Kanerva, 2009), high-dimensional vector representations are manipulated using operators that alter their distance from other learned representations. A prior work inspired by this concept (Li et al., 2022) demonstrates that masking the attention sub-modules of pre-trained transformer-based NLMs and thereby artificially increasing PPL on text from cognitively healthy individuals, can provide an effective solution to the challenge of limited data availability. By strategically altering these sub-modules and introducing controlled perturbations in the NLMs’ attention layers, the degraded NLMs induce the linguistic anomalies and unpredictability associated with dementia.

Recent work in neuroscience using functional magnetic resonance imaging (fMRI) and electrocorticography (ECoG) has demonstrated that NLM’s PPL is associated with predicting neural activation patterns during language comprehension tasks in the human brain (Schrimpf et al., 2021; Hosseini et al., 2024). This suggests a potential connection between the predictive capabilities of these models and understanding human information processing. In particular, one of the less well-understood phenomena in how neurodegeneration affects the human brain is the notion of cognitive and brain reserve. This notion is hypothesized to be responsible for findings that indicate individuals with higher innate abilities and/or aspects of life experience, such as educational and professional attainment, are able to mask the effects of dementia longer than those without these characteristics (Stern, 2002, 2009, 2012; Scarmeas and Stern, 2004, 2003; Snowdon et al., 1996). In some cases, the notion of cognitive and brain reserve may even allow individuals to revert from initial signs of cognitive impairment to normal function (Iraniparast et al., 2022).

Building upon these findings, our study seeks to further explore the potential of probing pre-trained GPT-2 family models (Radford et al., 2019) to simulate cognitive impairment observed in patients with dementia with a specific focus on the cognitive reserve hypothesis. Using a set of transcripts from a widely-used “Cookie Theft” picture description cognitive task, we propose that the impaired information processing as the disease progresses can be simulated by masking a certain share of attention heads in a pre-trained GPT-2 model. Specifically, we follow the previously established paired-perplexity paradigm (Li et al., 2022) using a pair of unmasked (“control”) and masked (“dementia”) NLMs. In this approach, the difference between PPLs produced by these two NLMs is used to discriminate between picture descriptions by patients with dementia and healthy controls. We hypothesize that larger GPT-2 models with more attention heads will exhibit greater resilience to masking (i.e., a proxy for neural degeneration), necessitating a larger share of attention heads to be masked to achieve comparable classification performance to smaller models. We evaluate this hypothesis by targeting two subsets of attention heads that are a) most important, and b) least important to representation of the content of the “Cookie Theft” task, in which the degree of importance is ranked by the gradient changes in each attention head during fine-tuning of a pre-trained GPT-2 model to the content of the “Cookie Theft” transcripts.

The contributions of this work can be summarized as follows: a) we provide preliminary evidence suggesting that the concept of cognitive reserve observed in human cognition appears to have an analog in ANNs; and b) our attention masking approach achieves comparable classification performance to another approach developed in prior work that directly artificially degrades NLM parameters (Li et al., 2022), and the state-of-the-art (SOTA) model trained from scratch (TaghiBeyglou and Rudzicz, 2024) with significantly fewer trainable parameter masking/fitting.111The code to reproduce the results presented in this paper is available at GitHub. The data are also publicly available but cannot be redistributed and must be obtained directly from Dementia Bank.

2 Background

2.1 Cognitive Reserve

The notions of brain plasticity in the human brain and “graceful degradation” in ANNs have been extensively investigated in the neuroscientific literature demonstrating, for example, that a large proportion (over 80%) of the connections in an ANN trained to simulate the motor cortex to generate signals directing body movement have to be ablated before the model’s performance begins to collapse (Lukashin et al., 1994; Lukashin and Georgopoulos, 1994). The concepts of cognitive and brain reserves are closely related to brain plasticity applied to observations in neurodegenerative diseases as illustrated in Figure 1. One of the earlier observations of this phenomenon comes from the Nun Study which found that low linguistic ability early in life (possibly due to innate abilities or educational attainment) is predictive of poor cognitive function and AD later in life (Snowdon et al., 1996). The concept of cognitive reserve was further developed based on observations of the individual differences in effects of brain damage or pathology on clinical manifestations of cognitive function (Stern, 2002, 2009, 2012). A multi-site study (Esiri et al., 2001) reported that up to 25% older adults without signs of cognitive impairment during neuropsychological testing meet all the histopathological criteria for AD (amyloid plaques and tau protein tangles) prior to their death. While this study did not assess brain volume, another similar study did find that a subgroup of 10 study participants who had both AD pathology and preserved mental status had greater brain weights and number of neurons in their brains (Katzman et al., 1988).

A distinction can be made between the closely related notions of cognitive reserve and brain reserve. Cognitive reserve refers to the efficiency of brain networks, which manifests as greater educational and professional attainment. Brain reserve, on the other hand, refers to the physical properties of the brain, such as a larger number of neurons in biological neural network(s). This can manifest, for example, as a higher intelligence quotient. These two notions are difficult to disentangle due to their significant interdependence (Steffener and Stern, 2012). The properties of these notions have also been described using passive or active models that correspond to the notions of brain and cognitive reserves, respectively. Passive models (Katzman, 1993; Satz, 1993) measure the cognitive reserve by the size of the brain or the count of neurons in the brain. Passive models hypothesize that there is a threshold for brain reserve capacity - once an individual passes the “point of no return”, the manifestation of neurodegenerative disease, such as AD, will occur regardless. Contrary to passive models, active models (Stern, 2002) hypothesize that there is a neural compensatory effect for brain damage. This effect consists of the brain compensating for the damage by activating other biological neural network(s) to perform cognitive task-related activities. In this case, patients of similar brain impairment but with more cognitive reserve may be more resilient to the disease’s progression before the clinical manifestations of neurodegeneration become apparent. Quantitatively, there is no clear difference between the passive and active models of cognitive reserve, as both of them rely on the physiologic basis of biological neural networks in the brain. This provides an opportunity to evaluate the underlying mechanisms that contribute to cognitive reserve across various neurological conditions computationally.

To avoid any potential confusion between these terms referring to different types of resilience and to avoid any inadvertent conflation between artificial and human brain networks, in the remainder of this paper we will refer to the phenomenon of resilience to damage that we observe in ANNs specifically as “artificial neural reserve” and use the terms “cognitive/brain reserve” to refer exclusively to human brain networks.

Refer to caption
Figure 1: A theoretical illustration of cognitive reserve and its mediation effect between AD neuropathology (x-axis) and clinical outcome (y-axis). Illustration derived from Stern (2002, 2009). As the disease progresses (i.e., with more impairment), individuals with higher cognitive/brain reserve would be more resilient to the effects, resulting in a lower level of clinical severity.

2.2 Probing the Neural Network

The ablation of connections in ANNs is also referred to as probing in NLMs. This is a growing field aimed at understanding the inner workings of large-scale transformer-based NLMs by probing the mechanism (i.e., attention weights, hidden states) to better understand the linguistic structure and representations encoded by such models. Similarly to the early findings of Lukashin et al. (1994) and Lukashin and Georgopoulos (1994), more recent work on transformers (i.e., Michel et al. (2019); Prasanna et al. (2020); Sanh et al. (2020)) demonstrates that a large percentage of attention heads or sub-modules can be removed at inference time without significantly impacting performance.

2.3 Linguistic Anomalies in AD

AD is a neurodegerative disease, and progressively worsening linguistic deficits often accompany its progression (Kempler and Goral, 2008; Altmann and McClung, 2008). A widely-used diagnostic task to capture such linguistic anomalies is the “Cookie Theft” picture description task from the Boston Diagnostic Aphasia Examination (Goodglass and Kaplan, 1983). In this task, participants are asked to describe everything they see going on in Figure 2. Previous studies have demonstrated that dementia patients tend to overuse pronouns (Almor et al., 1999) and tend to perseverate (Hier et al., 1985) when describing the “Cookie Theft” picture.

Refer to caption
Figure 2: The “Cookie Theft” picture description stimuli.

There is a rich body of evidence that supervised machine learning and deep learning methods can learn to distinguish the subtle linguistic characteristics between healthy individuals and people with dementia. However, such models present a danger of overfitting, and hinder interpretability of model predictions, which are both critical concerns for clinical artificial intelligence (AI) applications (Graham et al., 2020). Alternatively, PPL is an easily interpretable measure used to evaluate model performance. With dementia, the difference of the paired-perplexity paradigm from a “healthy control” NLM and a “dementia” NLM provides a diagnostically useful summary value that distinguishes language samples produced by dementia patients (Fritsch et al., 2019; Cohen and Pakhomov, 2020). Prior work (Li et al., 2022) has shown that the difference of PPLs from a pre-trained GPT-2 paired with an artificially degraded version of itself approximates SOTA classification performance without requiring a data set from dementia patients of comparable size to its comprehensive training data. However, this approach requires evaluating thousands of masking patterns in order to investigate the effects of masking various combinations of attention heads exhaustively. In the current work we obviate this requirement for extensive experimentation by using targeted masking (guided by the changes in gradients during training) of two subsets of attention heads that are a) most “important”, and least “important” with respect to the content of the “Cookie Theft” picture. We show that the resulting masked models can effectively identify transcripts from dementia patients with significantly fewer trainable parameters while exhibiting comparable classification performance to previous studies (Li et al., 2022; TaghiBeyglou and Rudzicz, 2024).

3 Methods

3.1 Data

We use two publicly available datasets that contain responses to the “Cookie Theft” picture description task: a) AD Recognition through Spontaneous Speech (ADReSS) Challenge222https://dementia.talkbank.org/ADReSS-2020/ (Luz et al., 2020), and b) the Wisconsin Longitudinal Study (WLS)333https://dementia.talkbank.org/access/English/WLS.html (Herd et al., 2014). Table 1 shows basic characteristics of datasets used in this study. ADReSS is a subset of the Pitt corpus (Becker et al., 1994) designed to address the absence of a standardized train/test split in prior work. It is specifically matched on age and gender to reduce potential confounding effects. The WLS is a longitudinal study of 694 men and 675 women who graduated from Wisconsin high schools in 1957. The participants were interviewed up to 6 times between 1957 and 2011. The “Cookie Theft” picture description task was administered in the later round of interviews. In particular, we restricted the original WLS dataset to a total of 102 participants who a) agreed to participate in the “Cookie Theft” picture description task, and b) had either a clinical diagnosis of dementia or were deemed healthy in follow-up interviews conducted in 2020. This information was obtained through phone interviews and assessments by advanced practice providers. Subsequently, the collected data was presented to a panel of clinicians to obtain the diagnosis.

Dataset Dementia Healthy Controls
# of participants (nn)
ADReSS Train 54 54
Test 24 24
Total 78 78
WLS 29 73
Table 1: The characteristics of ADReSS and WLS.

We perform verbatim transcripts pre-processing using TRESTLE (Toolkit for Reproducible Execution of Speech Text and Language Experiments) (Li et al., 2023) by removing utterances that do not belong to the participants, unintelligible words, and speech and non-speech artifacts event descriptions (i.e., “laughs”, “clear throat”).

3.2 Modeling and Evaluation

We follow a similar masking strategy to that proposed by Michel et al. (2019) to mask attention heads of the GPT-2 small, medium, large, and XL models via the rank of their importance to the task. We focus on the GPT-2 family models to minimize the variability that would result from multiple modeling architectures. The task-importance of attention heads in each model is determined by the gradient changes during the fine-tuning for subsequent word prediction task using transcripts of “Cookie Theft” picture descriptions in the training portion of the ADReSS dataset. Intuitively, if the gradient change of an attention head is large, this attention head is likely important with respect to predicting the language to which the model is being fine-tuned, and vice versa.

In contrast to the approach by Michel et al. (2019), which prunes the least important attention heads during testing, we anticipate that the most important attention heads are those relevant for predicting the text of the “Cookie Theft” task. This idea is supported by Yorkston and Beukelman (1980) and Berube et al. (2019), who found that the number of content units represented – a measure of how much relevant information is conveyed in the description – is sensitive to linguistic deficits often observed in individuals with neurodegenerative disease. However, we also reason that the least important attention heads may represent subtle differences in linguistic structure and representations that may distinguish between dementia patients and healthy controls. We also test the possibility that the semantic impairment observed in AD (Huff et al., 1986; Giffard et al., 2001; Hodges and Patterson, 1995) could be potentially simulated by masking a certain share of the columns in the pre-trained NLMs’ token embedding matrix, where each column contributes to the representation of the meaning of each token in the model’s vocabulary. Thus, masking columns in the embedding matrix leads to degrading the representation of all vocabulary items vs. degrading or deleting specific tokens from the otherwise intact vocabulary by operating on the rows of the embedding matrix.

Following these considerations, we design the masking strategies as follows: a) we fine-tune each of the GPT-2 models with a language model head layer as the top layer on the ADReSS training set to get the corresponding ranking of importance for each of the attention heads; b) we iteratively mask a small share (nn%) of ranked attention heads bidirectionally, which consists of the n2\frac{n}{2}% most important attention heads and the n2\frac{n}{2}% least important attention heads, then gradually increase the percentage of attention heads for masking, and c) we iteratively mask columns of the word embedding matrix in reverse order, moving from right to left, and gradually increase the percentage of word embedding columns for masking444All experiments in this study are done with HuggingFace’s transformers package (Wolf et al., 2020) on one A100 GPU..

We examine the artificial neural reserve hypothesis using two evaluation approaches. The first approach consists of simply estimating the PPL of the progressively degraded NLMs based on healthy individuals’ transcripts from an independent dataset containing the same type of picture descriptions as the dataset that was used to rank attention heads by their importance to the dementia classification task. We use the WLS dataset and select only those WLS participants that remained cognitively healthy over the entire study period as the independently collected dataset for log PPL estimation. Using this approach, in addition to masking attention heads, we also experiment with masking model weights in the token embedding matrix to see if any observed effects are specific to the attention mechanism.

The second approach consists of evaluating the classification performance of ablated/degraded models paired with the original versions of the same GPT-2 model using the paired-perplexity paradigm (Fritsch et al., 2019; Cohen and Pakhomov, 2020; Li et al., 2022). These evaluations are conducted on the testing portion of the ADReSS dataset, with accuracy (ACC) and area under the receiver-operator characteristic (ROC) curve (AUC) as the evaluation metrics. Specifically, for the paired-perplexity paradigm, we estimate the ratio of PPLs PPLcontrolPPLdementia\frac{PPL_{control}}{PPL_{dementia}} of each transcript from the test set. The ACC measure is calculated as accuracy at the equal error rate (EER), where the false acceptance rate is equal to false rejection rate on the ROC curve. The intuition behind this approach is based on the expectation that successful masking of a portion of attention heads in a pre-trained NLM will result in the NLM exhibiting dementia-like behavior, which would in turn result in high AUC and ACC values of the paired-perplexity classification.

4 Results

4.1 Effects of Masking on Perplexity

Refer to caption
(a) GPT-2s with masked attention heads.
Refer to caption
(b) GPT-2s with masked word embedding matrix.
Figure 3: Changes in model log PPL as a function of the proportion of masked attention heads across GPT-2 models of various sizes. Note: the curves in panel (a) show that GPT-2 XL model has the most non-linear/concave shape indicating that the model starts to degrade rapidly only after masking of about 50% of its attention heads, followed by the curve for the GPT-2 large model. The smaller GPT-2 models begin to degrade with proportionally less masking, and exhibit a monotonic relationship between the magnitude of attention heads masking and model performance. The curves in panel (b) show almost completely preserved model performance without differences between models up to the point at which 40% - 50% of the columns in their embedding matrices have been masked. After that point, the performance of all models collapses “catastrophically”

As illustrated in Figure 3(a), the predictive ability of smaller GPT-2 models degrades linearly with the degree of damage inflicted on the attention mechanism by masking progressively larger proportion of attention heads. The predictive ability of the larger GPT-2 models, on the other hand, degraded in a non-linear fashion where increases in log PPL were relatively flat up to 40-50% of the attention heads being masked and then began to increase exponentially. Fitting the GPT-2 small, medium, large and XL model log PPL to a linear regression line resulted in r2r^{2} goodness-of-fit values of 0.99, 0.89, 0.91 and 0.83, respectively, whereas fitting to an exponential regression line failed to converge for the small and medium models and yielded r2r^{2} values of 0.97 and 0.99 for the large and XL models, respectively. The results of Dunn’s test further confirmed our observations, showing that the differences between log PPLs estimated by GPT-2 small and GPT-2 XL (adjusted p-value << 0.01), and GPT-2 medium and GPT-2 XL (adjusted p-value << 0.05) when masking attention heads are statistically significant. In contrast, all combinations of log PPLs were not significantly different from each other for all GPT-2 models when masking the word embedding matrix (adjusted p-value >> 0.05).

Compared to masking attention heads, with GPT-2 small, medium, large and XL model we needed to mask 93% (714 out of 768), 66% (675 out of 1024), 87% (1113 out of 1280), and 66% (1050 out of 1600) columns in the word embedding matrix to achieve ACCs of 0.75, 0.85, 0.79, and 0.81 respectively, on the ADReSS test set. Figure 3(b) can further support this claim, as estimated log PPLs of masking word embedding matrix show no significant statistical differences across various GPT-2 models.

Refer to caption
(a) GPT-2 small
Refer to caption
(b) GPT-2 medium
Refer to caption
(c) GPT-2 large
Refer to caption
(d) GPT-2 XL
Figure 4: Comparison of GPT-2 models with masked attention heads on paired-perplexity classification performance. The left y-axis denotes classification performance using both masked and unmasked GPT-2 models on the ADReSS test set. The right y-axis indicates log PPL estimated from transcripts of WLS healthy individuals. The x-axis represents the percentage of attention heads getting masked. The vertical dashed line indicates the best-performing masking pattern, achieving the highest ACC.

4.2 Effects of Masking on Dementia Classification

Model GPT-2 small GPT-2 medium GPT-2 large GPT-2 XL
# of parameters 124M 355M 774M 1.5B
# of masked attention heads 12 92 388 1080
% of masked attention heads 9 24 54 90
ACC 0.83 0.83 0.81 0.81
AUC 0.86 0.85 0.80 0.82
Table 2: Classification performance of the paired-perplexity approach based on pre-trained and masked GPT-2 models on the ADReSS test set.

As shown in Table 2, impairing 9% of attention heads (nn=12) of the GPT-2 small model (the “dementia” model) achieved an ACC of 0.83 and AUC of 0.86 when paired with the original unmasked version of itself (the “control” model) on the ADReSS test set. This is comparable to the prior work (Li et al., 2022) (ACC = 0.85, AUC = 0.89) but the masking approach uses significantly fewer masked parameters. Our results also show that a larger share of attention heads in the larger models must be masked to approximate a “dementia” model with the same level of performance in the paired-perplexity classification than with smaller models. Notably, masking of the word embedding matrix did not result in comparable observations. As anticipated by the results shown in Figure 3(b), with GPT-2 models we needed to mask a majority portion (e.g., >> 50%) of the word embedding matrix to obtain similar level of classification performance regardless of model size.555The importance of attention heads for each model can be found in Table 3, Table 4, Table 5, and Table 6 in the Appendix A.

As illustrated in Figure 4, we observed that once the best-performing masking pattern, marked by the highest ACC, was reached, the classification performance of all GPT-2 models started to fluctuate. However, this observation did not occur with the word embedding matrix masking. As illustrated in Figure 5 in Appendix A, the classification performance exhibited fluctuations prior to the emergence of the best-performing masking pattern, indicating that masking the columns of the word embedding matrix has less impact on identifying the signs of cognitive impairment from text as it probably does not result in a good dementia-like model for the paired-perplexity classification task.

5 Discussion

The results of experiments presented in this paper suggest that the notion of cognitive reserve in the brain may have an analogue in transformer-based ANNs that is localized to the attention mechanism. Recent neuroscientific evidence shows that NLMs’ PPL is predictive of human behavioral responses and and neural responses in functional MRI studies (Schrimpf et al., 2021; Hosseini et al., 2024). Based on this evidence, we interpret our findings of the differences in log PPL changes as a result of masking attention heads in NLMs of variable size as at least suggestive that the resilience to damage is non-linear to the number of attention heads in NLMs. In other words, it takes disproportionately more masking to damage larger NLMs to elicit the same level of degradation in performance, as compared with smaller NLMs. Furthermore, the dissociation in performance as a result of damaging attention heads vs. the token embedding weights suggests that the NLM’s artificial neural reserve effects are localized to the attention mechanism.

Our results also suggest that masking attention heads within the paired-perplexity paradigm using the ratio of unmasked (“control”) and masked (“dementia”) pre-trained GPT-2 models results in good classification performance without requiring a corresponding large dataset produced by dementia patients and extensive parameter tuning. This can be achieved with as little as masking only 9% of attention heads of a pre-trained GPT-2 small model.

In contrast to previous studies, which typically involved purging attention heads determined to be the least important, our bidirectional masking method adds supporting evidence of content units (Yorkston and Beukelman, 1980; Berube et al., 2019), suggesting the importance of these contextual features in addition to the predominant emphasis on linguistic structure and representation modeling in previous research (e.g., Orimaye et al. (2014), Fraser et al. (2016)). The results of bidirectional masking also offers an interpretable explanation for transfer learning’s remarkable performance using pre-trained NLMs. It suggests that during fine-tuning, pre-trained NLMs use a combination of both task-specific (the most important) and task-agnostic (the least important) heads to achieve remarkable performance on various downstream tasks. Those task-agnostic attention heads may play an important role in transfer learning. This also may explain why distilled NLMs in which the “nonvolitional cues” that fall outside of common NLP benchmarks are purged during the distillation, generalize less-than-ideally to other types of data produced by individuals with dementia (Li et al., 2022). With larger models, there are considerably more attention heads that can serve as those “nonvolitional cues,” helping a larger NLM perform better (Agbavor and Liang, 2022).

As the columns of the token embedding matrix in a pre-trained NLMs represent the global semantics of tokens in the vocabulary, the observations that the best-performing masking pattern appears in the later stage of the token embedding matrix masking are consistent with previously published findings that semantic impairment often occurs in the later stage (i.e., moderate) of the disease (Huff et al., 1986; Giffard et al., 2001; Hodges and Patterson, 1995). As illustrated in Figure 5, when masking the later 66% columns (675 out of 1024) of the word embedding matrix, the paired unmasked and masked GPT-2 medium achieves an ACC of 0.85 on the ADReSS test set. This finding is consistent with a previous work (Hewitt and Manning, 2019), suggesting that some syntactic information is embedded implicitly in the word embedding matrix. This also provide an empirical support of our findings that masking word embedding matrix of a pre-trained NLM can provide some degree of discriminating effect on this downstream task. However, masking the word embedding matrix is far less effective than masking attention heads to simulate dementia-related cognitive impairment.

Our results suggest that similar mechanisms of resilience may exist in both human cognition and computational models. This could lead to more nuanced strategies in response to develop early screening tools for the delayed onset of cognitive impairment in the population with high risk. Our study holds promise for deploying such early-screening method in resource-constrained clinical settings to improve early intervention and patient management for AD.

6 Conclusion

We presented experimental findings suggesting the presence of artificial neural reserve in transformer-based NLMs, analogous to the concepts of brain/cognitive reserve in studies of human cognition. In addition, we introduced a novel bidirectional attention head ablation method that enables using unmasked and masked GPT-2 models in the paired-perplexity paradigm for detecting linguistic anomalies with significantly less parameter masking or fitting.

Acknowledgement

Research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award number R01 LM014056-02S1, and the National Institute on Aging of the National institutes of Health under award number AG069792.

Limitations

The work presented here has several limitations. First, the size of datasets used in this study is relatively small compared to datasets typically analyzed in the open-domain NLP tasks, therefore the results may not be readily generalizable. Second, all datasets used in our study are in American English, and many participants of these two studies are representative of White, non-Hispanic American men and women located at the north part of the United States, which certainly limit their applicability to other languages. Third, while we propose that the findings presented in this paper may be interpreted as an analogue of the notions of cognitive or brain reserve, we do not suggest that GPT-2 models are accurate models of the human brain. Rather, our interpretation of these findings is that experimenting with masking of attention heads in models of various sizes and architectures may be useful in helping us understand cognitive processes that take place in the human brain. The observed effects of attention masking on the model’s performance and behavior, while suggestive of an analog to cognitive reserve in the human brain, should not establish a direct causal link to human cognitive processes. Additionally, the ranking of attention heads by their relative importance is specific to the ADReSS dataset as it was derived in the training portion of the dataset and may not readily generalize to other datasets and types of data. Lastly, in this paper we did not address the distinction between the notions of cognitive and brain reserves. It would be important to investigate in future work if NLMs of the same size and architecture but different quantities and quality of the training data (i.e., as a simulation of educational attainment) exhibit differential resilience to damage independently of the effects observed in models of variable size.

References

Appendix A Appendix

Refer to caption
(a) GPT-2 small
Refer to caption
(b) GPT-2 medium
Refer to caption
(c) GPT-2 large
Refer to caption
(d) GPT-2 XL
Figure 5: Comparison of GPT-2 models with masked columns of word embedding matrix on classification performance and cognitive reserve manifestation. The left y-axis denotes classification performance using both masked and unmasked GPT-2 models on the ADReSS test set. The right y-axis indicates log PPL estimated from transcripts of WLS healthy individuals. The x-axis represents the percentage of attention heads getting masked. The vertical dashed line indicates the best-performing masking pattern, achieving the highest ACC.
0 1 2 3 4 5 6 7 8 9 10 11
121121 55 77 6565 1919 4646 5858 5656 99 11 0 6060
1616 3434 9494 7171 1313 9090 4242 44 113113 8686 4747 1414
7474 106106 107107 3232 8989 1818 1717 8787 7575 1111 88 128128
4848 1515 8282 7878 9393 6868 129129 7979 7777 7272 5454 2323
4040 6767 2222 115115 3636 131131 100100 8383 140140 6161 141141 135135
4141 2929 134134 137137 119119 1212 9292 2121 3131 33 6969 2828
7676 9595 101101 125125 9191 130130 3737 9999 8080 9898 1010 122122
117117 4545 124124 8181 116116 4949 2525 2626 6262 9797 143143 136136
6464 123123 3030 4343 8888 3838 2727 5555 7373 114114 118118 142142
111111 5353 102102 7070 5050 5757 105105 8484 120120 138138 139139 2020
132132 110110 6666 103103 4444 5252 126126 108108 109109 5959 9696 8585
66 2424 127127 3939 3333 133133 104104 5151 22 112112 6363 3535
Table 3: The rank of importance for each attention head in the GPT-2 small model. The rows represent the layer of attention blocks in the model whereas the columns represent attention heads per layer.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
139139 115115 3636 376376 354354 132132 355355 357357 109109 7070 148148 8484 217217 156156 309309 363363
7777 55 2828 126126 4141 9595 266266 5757 286286 316316 6161 258258 5959 1313 138138 202202
268268 131131 2121 209209 367367 4646 8686 228228 4545 2929 2222 6363 4848 7676 155155 1818
170170 222222 346346 235235 3838 1212 257257 301301 172172 112112 193193 121121 188188 2626 238238 103103
294294 4747 198198 102102 315315 6464 291291 4040 246246 375375 6666 151151 292292 293293 264264 165165
199199 272272 250250 364364 285285 226226 192192 343343 1717 114114 119119 260260 237237 166166 6262 6868
5050 1616 360360 185185 2020 368368 359359 130130 350350 7272 88 289289 267267 251251 310310 77
279279 8888 216216 7979 141141 256256 8585 8383 101101 8282 204204 232232 243243 142142 107107 5555
5353 284284 195195 129129 253253 133133 9797 154154 137137 262262 281281 6060 203203 177177 3131 248248
5858 261261 366366 2525 135135 227227 320320 333333 197197 9494 242242 125125 6565 1515 273273 117117
255255 271271 160160 5454 4949 269269 303303 140140 176176 296296 167167 311311 110110 9292 290290 239239
143143 241241 158158 325325 7474 299299 5656 342342 287287 225225 214214 66 372372 9898 150150 100100
183183 182182 5252 370370 1919 1111 186186 113113 240240 353353 184184 3434 312312 297297 179179 259259
3232 210210 358358 212212 331331 6767 371371 230230 330330 116116 304304 263263 159159 278278 163163 149149
8787 324324 208208 334334 356356 220220 319319 162162 136136 236236 6969 108108 327327 190190 337337 2727
383383 1414 44 106106 300300 275275 9696 7171 207207 7575 352352 1010 221221 178178 307307 180180
339339 174174 336336 340340 4444 191191 9999 317317 3939 244244 361361 249249 274274 7373 206206 377377
362362 341341 201201 345345 231231 219219 3535 111111 205205 8080 9393 2323 347347 168168 229229 3030
378378 146146 171171 145145 33 181181 321321 196196 124124 164164 152152 120120 224224 276276 382382 2424
3333 302302 128128 7878 252252 298298 189189 144144 344344 4242 105105 349349 329329 288288 104104 283283
369369 3737 9090 247247 295295 305305 8181 314314 318318 215215 173173 123123 313313 254254 365365 306306
277277 4343 381381 118118 373373 280280 233233 380380 234234 270270 374374 211211 335335 282282 323323 332332
157157 322322 22 5151 326326 161161 147147 200200 218218 338338 348348 213213 379379 8989 153153 122122
187187 194194 265265 175175 134134 11 0 351351 169169 127127 328328 9191 308308 223223 245245 99
Table 4: The rank of importance for each attention head in the GPT-2 medium model. The rows represent the layer of attention blocks in the model whereas the columns represent attention heads per layer.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
3939 7979 318318 5353 587587 121121 273273 390390 8484 125125 433433 510510 653653 548548 379379 255255 193193 132132 555555 622622
567567 285285 553553 403403 481481 670670 375375 398398 399399 354354 298298 424424 502502 624624 665665 544544 709709 473473 620620 268268
219219 181181 585585 683683 621621 463463 326326 506506 434434 615615 339339 651651 100100 312312 497497 332332 412412 428428 1414 634634
1212 2222 633633 391391 436436 658658 9494 563563 614614 386386 355355 104104 174174 4040 9090 172172 7878 407407 328328 673673
149149 302302 105105 322322 474474 264264 469469 184184 3535 319319 177177 185185 295295 203203 296296 438438 666666 489489 543543 8787
606606 323323 209209 195195 579579 204204 5151 584584 267267 569569 590590 662662 395395 5555 9393 290290 552552 159159 508508 186186
163163 425425 225225 2525 448448 107107 531531 325325 5858 2727 503503 612612 9292 559559 287287 118118 352352 275275 251251 546546
681681 592592 523523 7777 494494 5656 549549 342342 368368 611611 528528 320320 558558 4747 308308 575575 248248 383383 671671 314314
134134 250250 340340 272272 712712 396396 171171 137137 610610 430430 238238 269269 2626 604604 211211 630630 261261 133133 532532 111111
3434 240240 388388 408408 337337 4444 568568 247247 359359 411411 672672 215215 258258 198198 410410 547547 472472 525525 581581 702702
617617 500500 566566 468468 365365 311311 692692 533533 422422 699699 660660 645645 265265 131131 643643 526526 657657 3636 189189 632632
116116 146146 8181 413413 625625 284284 161161 6262 460460 588588 627627 545545 330330 8888 293293 527527 650650 7171 303303 245245
372372 299299 357357 524524 640640 685685 1111 3131 674674 346346 565565 599599 194194 564564 439439 145145 196196 9797 260260 167167
688688 175175 224224 263263 4141 239239 114114 294294 331331 609609 202202 249249 373373 120120 550550 119119 454454 246246 9898 220220
173173 310310 179179 381381 377377 103103 479479 135135 516516 475475 148148 205205 401401 443443 169169 560560 218218 556556 654654 600600
637637 432432 301301 394394 6565 156156 123123 329329 6666 7070 214214 574574 22 1010 307307 153153 4646 112112 613613 237237
166166 542542 5050 155155 217217 117117 435435 305305 417417 4343 138138 102102 367367 327327 253253 570570 414414 343343 143143 207207
243243 647647 77 191191 698698 109109 141141 423423 598598 126126 182182 0 421421 1717 2424 5252 110110 8080 5959 9191
234234 488488 351351 256256 649649 113113 551551 668668 164164 400400 501501 128128 8383 282282 210210 317317 4545 511511 222222 6868
513513 274274 199199 594594 519519 348348 140140 168168 151151 157157 664664 188188 244244 449449 144144 347347 515515 324324 522522 619619
2323 122122 8282 656656 447447 358358 642642 11 1616 4242 589589 646646 233233 466466 304304 361361 3737 3232 9696 3030
2828 216216 397397 364364 562562 2929 101101 356356 471471 695695 270270 276276 162162 283283 170170 306306 277277 55 573573 486486
300300 493493 7474 517517 678678 402402 130130 576576 165165 459459 418418 44 426426 442442 583583 190190 66 291291 208208 3838
221221 6969 370370 8585 6060 178178 641641 491491 577577 682682 154154 106106 703703 392392 362362 5454 603603 9999 315315 371371
1313 349349 415415 313313 124124 427427 281281 338338 580580 420420 416416 380380 499499 644644 2020 384384 530530 183183 192192 7272
297297 286286 440440 648648 7575 201201 5757 561561 136136 152152 3333 242242 9595 108108 369369 697697 716716 229229 266266 534534
206206 477477 687687 538538 707707 127127 706706 280280 6464 409409 669669 1919 496496 200200 376376 490490 150150 498498 514514 616616
482482 158158 487487 444444 99 160160 197197 231231 406406 7676 2121 187187 180180 437437 693693 257257 139139 467467 321321 504504
223223 462462 485485 419419 572572 591591 142142 288288 363363 350350 719719 536536 704704 718718 677677 334334 675675 680680 445445 623623
686686 405405 701701 539539 88 596596 278278 582582 230230 456456 341341 529529 710710 241241 344344 8686 458458 652652 635635 541541
6363 8989 366366 382382 476476 232232 638638 271271 453453 235235 227227 404404 262262 7373 6767 717717 374374 228228 512512 212212
605605 601601 1515 484484 389389 33 176176 483483 509509 667667 495495 345345 393393 289289 309309 259259 115115 387387 446446 129129
4848 465465 557557 385385 1818 535535 628628 705705 554554 602602 478478 492492 316316 360360 236236 521521 254254 684684 252252 586586
689689 571571 464464 607607 691691 292292 450450 696696 4949 520520 333333 626626 505505 713713 700700 631631 452452 593593 694694 335335
636636 578578 147147 6161 714714 279279 455455 708708 378378 655655 595595 480480 461461 679679 518518 597597 451451 659659 661661 507507
608608 540540 226226 639639 431431 336336 676676 213213 470470 715715 618618 441441 711711 690690 353353 457457 629629 537537 429429 663663
Table 5: The rank of importance for each attention head in the GPT-2 large model. The rows represent the layer of attention blocks in the model whereas the columns represent attention heads per layer.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
131131 329329 685685 483483 1,1631{,}163 980980 687687 210210 590590 3535 636636 1,1381{,}138 8484 203203 800800 611611 993993 710710 1,1351{,}135 822822 158158 828828 122122 869869 1,0521{,}052
1,1341{,}134 580580 9898 849849 915915 900900 468468 564564 1,0611{,}061 141141 367367 529529 725725 489489 961961 352352 823823 260260 1,0271{,}027 841841 522522 547547 209209 521521 226226
1,0861{,}086 699699 850850 735735 268268 840840 647647 1,0191{,}019 985985 1,0681{,}068 757757 138138 1,0281{,}028 690690 1,0331{,}033 762762 708708 499499 811811 635635 193193 256256 441441 720720 820820
294294 414414 753753 261261 743743 3030 244244 1,0321{,}032 598598 388388 691691 1,0631{,}063 6262 574574 628628 325325 1,0371{,}037 548548 785785 538538 599599 763763 803803 380380 798798
867867 799799 524524 304304 223223 858858 8585 814814 937937 911911 646646 838838 614614 106106 6161 6666 133133 172172 836836 173173 427427 262262 144144 1,0851{,}085 749749
931931 342342 276276 1,0561{,}056 1,0481{,}048 686686 531531 101101 904904 116116 1,0731{,}073 923923 295295 550550 880880 376376 473473 1,1481{,}148 429429 501501 410410 422422 313313 587587 895895
802802 862862 257257 853853 1,0671{,}067 1,1741{,}174 280280 661661 563563 208208 1,0031{,}003 946946 950950 2626 702702 629629 813813 440440 252252 361361 975975 493493 660660 1,1121{,}112 790790
568568 818818 609609 321321 772772 1,0651{,}065 196196 1,0871{,}087 847847 1,0921{,}092 1,0041{,}004 945945 165165 758758 246246 949949 553553 1,1081{,}108 387387 963963 582582 285285 572572 876876 514514
341341 891891 123123 509509 195195 269269 697697 404404 534534 666666 825825 731731 1,1151{,}115 1,0791{,}079 199199 463463 701701 674674 546546 642642 667667 936936 736736 1,0691{,}069 781781
567567 1,0131{,}013 274274 600600 353353 864864 496496 637637 729729 922922 604604 952952 976976 1,0661{,}066 474474 754754 791791 874874 999999 318318 243243 1,0551{,}055 909909 832832 328328
556556 623623 1,0061{,}006 767767 1,0761{,}076 594594 998998 833833 616616 643643 222222 1,0751{,}075 776776 227227 1,0431{,}043 535535 877877 902902 349349 752752 533533 494494 436436 815815 301301
1,0241{,}024 296296 288288 603603 898898 613613 872872 1,0901{,}090 334334 482482 1,0401{,}040 419419 537537 1,0941{,}094 464464 1,0421{,}042 982982 1,0501{,}050 719719 507507 236236 644644 459459 395395 504504
1,1431{,}143 467467 391391 694694 1,1361{,}136 848848 924924 656656 372372 442442 300300 444444 1,0161{,}016 715715 184184 491491 997997 392392 751751 1,1471{,}147 801801 174174 511511 704704 330330
323323 851851 189189 204204 322322 968968 7878 525525 714714 2727 544544 416416 439439 1,0541{,}054 552552 727727 1,0781{,}078 559559 804804 761761 506506 1,1181{,}118 215215 390390 358358
816816 1,0381{,}038 118118 162162 3838 381381 485485 201201 309309 368368 933933 9595 284284 183183 760760 396396 889889 755755 8282 649649 111111 282282 446446 455455 221221
1,1641{,}164 335335 806806 454454 912912 247247 943943 5959 449449 756756 526526 4848 402402 1,1211{,}121 919919 212212 412412 837837 684684 277277 497497 156156 819819 831831 437437
103103 782782 8686 33 586586 665665 827827 595595 765765 695695 431431 766766 187187 100100 348348 167167 344344 709709 645645 1,0741{,}074 251251 7171 640640 479479 558558
224224 592592 216216 1,0021{,}002 161161 658658 308308 887887 555555 420420 317317 258258 995995 528528 339339 4646 153153 302302 1,0111{,}011 265265 255255 272272 241241 9999 7575
530530 676676 428428 516516 808808 305305 712712 6464 705705 155155 225225 393393 7676 596596 287287 397397 6969 9292 137137 220220 634634 266266 360360 430430 248248
927927 659659 333333 458458 857857 0 235235 242242 978978 424424 117117 389389 185185 120120 1414 351351 4444 878878 996996 152152 830830 487487 654654 905905 128128
942942 168168 477477 232232 698698 855855 602602 178178 495495 386386 327327 177177 5151 22 1,0451{,}045 331331 8888 519519 403403 413413 8181 132132 608608 369369 470470
1,0351{,}035 575575 110110 213213 157157 3939 707707 1,0391{,}039 139139 1,1261{,}126 466466 411411 211211 121121 638638 316316 4545 159159 1,0721{,}072 515515 1,0711{,}071 633633 175175 539539 4949
956956 401401 9696 597597 721721 965965 663663 160160 217217 3636 768768 143143 472472 194194 320320 5656 170170 297297 2525 113113 696696 984984 566566 2323 681681
283283 462462 5353 4040 517517 127127 626626 125125 1212 291291 8383 11 1,0201{,}020 5555 3333 443443 310310 286286 500500 958958 150150 6565 197197 108108 839839
778778 682682 1010 332332 693693 114114 180180 679679 346346 383383 169169 9090 289289 237237 706706 182182 503503 1,0091{,}009 347347 480480 861861 319319 1,1371{,}137 377377 119119
783783 409409 171171 202202 452452 1616 130130 856856 164164 303303 565565 866866 253253 44 475475 1,1441{,}144 250250 451451 773773 711711 1919 1,0531{,}053 1,0211{,}021 746746 1,0931{,}093
66 275275 967967 1,1561{,}156 908908 1515 2929 188188 780780 415415 1,0601{,}060 206206 2424 846846 4747 668668 200200 536536 281281 577577 860860 615615 1,0641{,}064 883883 438438
104104 703703 270270 399399 9494 179179 234234 1,0571{,}057 578578 759759 7777 445445 627627 371371 884884 733733 5050 518518 264264 1,1041{,}104 1,0341{,}034 523523 826826 1,1231{,}123 8080
99 621621 142142 146146 354354 312312 363363 238238 259259 1,0231{,}023 935935 450450 1,1841{,}184 239239 407407 1,1021{,}102 7373 357357 741741 425425 207207 307307 769769 8989 421421
292292 576576 2222 109109 688688 405405 540540 214214 988988 983983 770770 498498 293293 896896 810810 794794 271271 6363 1,0811{,}081 2828 398398 744744 502502 7070 914914
240240 879879 508508 1,0361{,}036 151151 1,1031{,}103 364364 5858 433433 492492 112112 434434 88 873873 350350 1,1951{,}195 871871 273273 1,1801{,}180 417417 5252 1,1411{,}141 617617 1,1751{,}175 620620
1,1671{,}167 954954 1,1101{,}110 510510 135135 1,1721{,}172 105105 724724 870870 3131 892892 298298 824824 692692 340340 657657 554554 994994 486486 315315 562562 717717 115115 974974 448448
1,0151{,}015 934934 9191 549549 1,1591{,}159 581581 453453 356356 835835 418418 732732 513513 888888 102102 55 488488 490490 906906 859859 1,1401{,}140 9797 447447 570570 4343 821821
1,0311{,}031 481481 1,1301{,}130 456456 1,1931{,}193 149149 3232 683683 734734 166166 1,1621{,}162 205205 573573 370370 355355 233233 951951 1,1391{,}139 817817 662662 147147 561561 700700 219219 972972
903903 1,1291{,}129 136136 986986 560560 7272 1,1541{,}154 716716 957957 4141 672672 625625 190190 973973 145145 1,0891{,}089 894894 605605 1,1811{,}181 916916 1,0821{,}082 218218 469469 484484 4242
1,1281{,}128 932932 579579 362362 385385 249249 793793 1,0831{,}083 962962 723723 788788 435435 632632 1,1501{,}150 601601 1,0471{,}047 1,0911{,}091 229229 3434 738738 326326 191191 730730 532532 588588
465465 1,0051{,}005 1,1491{,}149 722722 882882 829829 1,0991{,}099 1,1511{,}151 1111 359359 2020 792792 797797 1,1311{,}131 737737 726726 148148 1,0981{,}098 718718 198198 365365 8787 713713 299299 675675
1,0221{,}022 471471 461461 163163 610610 655655 337337 1,1941{,}194 1,0771{,}077 991991 1,0441{,}044 1,1791{,}179 5757 1,0121{,}012 426426 777777 1,0181{,}018 545545 1,1701{,}170 750750 1,1761{,}176 126126 886886 901901 343343
669669 408408 140140 928928 378378 779779 254254 899899 947947 314314 382382 520520 324324 1,1111{,}111 955955 1,0951{,}095 622622 2121 338338 893893 278278 689689 311311 925925 107107
673673 263263 1,0291{,}029 748748 1,1581{,}158 6060 940940 1,1921{,}192 742742 843843 1,0581{,}058 345345 457457 966966 868868 854854 1,0251{,}025 366366 245245 591591 885885 787787 929929 740740 680680
1,0081{,}008 186186 1,1691{,}169 154154 1,1241{,}124 1,0621{,}062 541541 960960 379379 181181 653653 897897 1,1971{,}197 1,0001{,}000 1,0461{,}046 1,0961{,}096 786786 1,1011{,}101 807807 306306 845845 569569 6767 1,1191{,}119 959959
1,1551{,}155 5454 969969 1818 970970 863863 964964 989989 9393 739739 1313 842842 375375 423423 651651 917917 374374 1,1051{,}105 1,0301{,}030 907907 987987 812812 1,1731{,}173 795795 583583
1717 589589 279279 1,1851{,}185 938938 134134 1,1451{,}145 944944 593593 1,1171{,}117 747747 571571 875875 913913 267267 948948 3737 6868 630630 400400 129129 585585 124124 1,0411{,}041 1,1661{,}166
910910 1,1531{,}153 1,1821{,}182 920920 1,1601{,}160 671671 1,1991{,}199 618618 1,1091{,}109 639639 7474 1,1881{,}188 527527 881881 1,0491{,}049 612612 771771 953953 1,0881{,}088 1,1961{,}196 1,1681{,}168 1,1521{,}152 290290 926926 764764
1,1651{,}165 1,1001{,}100 512512 844844 1,0591{,}059 784784 336336 1,1071{,}107 1,1201{,}120 852852 992992 641641 1,1271{,}127 670670 981981 1,1781{,}178 1,0011{,}001 373373 921921 230230 918918 939939 1,1711{,}171 1,1831{,}183 1,1871{,}187
789789 1,0141{,}014 1,1981{,}198 228228 631631 551551 930930 971971 460460 432432 1,1221{,}122 1,0701{,}070 406406 1,1421{,}142 774774 619619 805805 557557 728728 607607 678678 1,1131{,}113 796796 1,1911{,}191 1,0171{,}017
1,0261{,}026 584584 1,0071{,}007 648648 941941 476476 1,1331{,}133 650650 1,1861{,}186 624624 1,1901{,}190 1,1321{,}132 1,1141{,}114 775775 384384 1,0511{,}051 1,1891{,}189 1,1251{,}125 1,1061{,}106 1,0101{,}010 1,0971{,}097 1,1611{,}161 1,0801{,}080 990990 890890
7979 1,1571{,}157 652652 231231 1,1771{,}177 176176 542542 192192 1,1161{,}116 1,1461{,}146 505505 478478 745745 977977 394394 677677 664664 865865 979979 543543 606606 1,0841{,}084 77 809809 834834
Table 6: The rank of importance for each attention head in the GPT-2 XL model. The rows represent the layer of attention blocks in the model whereas the columns represent attention heads per layer.