MulCogBench: A Multi-modal Cognitive Benchmark Dataset for Evaluating Chinese and English Computational Language Models

Abstract

Pre-trained computational language models have recently made remarkable progress in harnessing the language abilities which were considered unique to humans. Their success has raised interest in whether these models represent and process language like humans. To answer this question, this paper proposes MulCogBench, a multi-modal cognitive benchmark dataset collected from native Chinese and English participants. It encompasses a variety of cognitive data, including subjective semantic ratings, eye-tracking, functional magnetic resonance imaging (fMRI), and magnetoencephalography (MEG). To assess the relationship between language models and cognitive data, we conducted a similarity-encoding analysis which decodes cognitive data based on its pattern similarity with textual embeddings. Results show that language models share significant similarities with human cognitive data and the similarity patterns are modulated by the data modality and stimuli complexity. Specifically, context-aware models outperform context-independent models as language stimulus complexity increases. The shallow layers of context-aware models are better aligned with the high-temporal-resolution MEG signals whereas the deeper layers show more similarity with the high-spatial-resolution fMRI. These results indicate that language models have a delicate relationship with brain language representations. Moreover, the results between Chinese and English are highly consistent, suggesting the generalizability of these findings across languages.

Keywords: computational language models, cognitive data, similarity-encoding

\NAT@set@cites

Yunhao Zhang^1,2,†, Xiaohan Zhang^1,2,†, Chong Li^1,2,†, Shaonan Wang^1,2, Chengqing Zong^1,2

¹State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS

² School of Artificial Intelligence, University of Chinese Academy of Sciences

{zhangyunhao2021, lichong2021}@ia.ac.cn

{xiaohan.zhang, shaonan.wang, cqzong}@nlpr.ia.ac.cn

Abstract content

^†^†footnotetext: Equal Contribution

1. Introduction

The ability to use language has long been considered unique to the human species. However, the emerging pre-trained language models have achieved super-human performance on various language tasks (Zhao et al., 2023; Huang and Chang, 2022; Bang et al., 2023). Their success has elicited a flurry of research and debates about the following question: how similar are the working mechanisms of these language models to that of the human brain (Blank, 2023)? To answer this question, high-quality cognitive data collected while human participants are understanding language is indispensable.

Human cognitive data, including brain activation and behavioral reaction, reflects how the human brain processes language. There has been evidence that the representations of computational language models, even though trained solely on large text corpora, share similarities with the brain activation elicited by language (Gauthier and Levy, 2019; Hashemzadeh et al., 2020; Pasquiou et al., 2022) . Moreover, such similarity pattern is affected by factors including model architectures, loss functions and etc. (Caucheteux and King, 2022). These findings suggest that language models may mimic the human brain in language processing to different degrees. The more similar the representations of language models to human cognitive data, the more possibility there is that the two systems share the same mechanisms. Therefore, cognitive data not only plays an important role in studying the brain mechanism of language but also can serve as criteria for the cognitive plausibility of computational models.

There have been a few cognitive benchmark datasets Anderson et al. (2013); Xu et al. (2016); Hollenstein et al. (2019). However, the available datasets are either too small or only focus on English. Moreover, evaluating computational models using cognitive data is still rarely done outside psycholinguistics and neurolinguistics. Accordingly, it is still unknown 1) whether computational models have their better-aligned cognitive modality, 2) whether their mechanisms in processing different linguistic units are similar to that of humans, and 3) whether the relationship between computational models and cognitive data can be generalized to different languages.

To answer the above questions, this paper presents MulCogBench, a multi-modal cognitive benchmark dataset for evaluating Chinese and English language models. MulCogBench-Chinese encompasses cognitive data in four different modalities collected from Chinese native speakers, including behavioral data (here, word semantic rating and eye-tracking) and brain imaging data (here, fMRI and MEG). The language stimuli to collect these data ranges from words to discourses. MulCogBench-English involves three modalities, i.e., word semantic rating, eye-tracking, and fMRI. To evaluate the similarities between the representations of computational models and the human brain, we conduct experiments on four classic computational language models, namely Word2vec, GloVe, BERT, and GPT-2. Specifically, we employed similarity-encoding analysis to compute the representational similarities between language models and cognitive data. In particular, for the high-spatial-resolution fMRI, we performed a fine-grained ROI (region of interest) level analysis based on the functional division of the brain.

Results show that computational models have significant similarity with human cognitive data, in which the similarity degree is modulated by cognitive modalities and linguistic units. Specifically, we find that (1) across different cognitive modalities and linguistic units, the most-similar models and the variation tendencies within the context-aware models are highly consistent between Chinese and English; (2) from word to discourse, the advantage of context-aware models over context-independent models increases, suggesting that these models are more human-like in encoding the complex language structure information but not in encoding the basic word-level information; (3) the similarity patterns between models and cognitive data vary in different cognitive modalities, with the shallow layers of context-aware models are more similar to MEG and the deeper layers are better aligned with fMRI, indicating that different layers may simulate different aspects of the human language mechanism. Our results demonstrate that exploring the relationship between computational models and human cognitive data can help to explain the mechanism of both the computational model and the human brain.

All the cognitive data in MulCogBench will be released in the form that can be directly used for the computational model evaluation.

	Modality	Source	Stimuli	Unit	Tokens
Chinese	word semantic rating	CRSF Wang et al. (2022b)	text	word	672
	word fMRI	CRSF Wang et al. (2022b)	text	word	672
	eye-tracking	Zhang et al. (2022)	text	sentence	170,331
	discourse fMRI	SMN4Lang Wang et al. (2022a)	audio	discourse	52,269
	discourse MEG	SMN4Lang Wang et al. (2022a)	audio	discourse	52,269
English	word semantic rating	Binder et al. (2016)	text	word	535
	word fMRI	Pereira et al. (2018)	text	word	180
	eye-tracking	ZuCo (Hollenstein et al., 2018, 2020)	text	sentence	36,767
	discourse fMRI	Zhang et al. (2020)	audio	discourse	47,356

Table 1: Details of the datasets in MulCogBench.

2. Cognitive Data

Here we describe the sources of Chinese and English cognitive data in MulCogBench and how to use them. The MulCogBench includes cognitive modalities in eye-tracking, word semantic ratings, word fMRI, discourse fMRI, and MEG (only available for Chinese). See Table 1 for more details. All the cognitive data were collected under the approval of the Institutional Review Board.

2.1. MulCogBench-Chinese

Eye-tracking

Eye-tacking records fine-grained temporal eye movement during reading, which provides information to study the cognitive mechanisms underlying reading (Rayner, 1998, 2009). For example, early syntactic processing and lexical access are captured by early gaze measures when the first time a word is fixated.

We adopt a Chinese eye-tacking database obtained from 1,718 participants across 57 reading experiments (Zhang et al., 2022). It contains 7,577 natural Chinese sentences and 8,551 different words. The sentences involved range in length from 15 to 35, with an average length of 22.48. There are nine word-level eye-tacking features provided: First Fixation Duration (FFD), Gaze Duration (GD), First-Pass reading Fixated proportion (FPF), Fixation Number (FN), proportion Regression In (RI), proportion Regression Out (RO), saccade length toward the target from the left (LI_left), saccade length from the target to the right (LO_right), and Total fixation duration (TT). All features are used in this study for a comprehensive investigation of the similarity between the human brain and language models.

Word semantic ratings

To study how the brain represents word meaning, an essential start is to define a set of basic semantic features (Binder et al., 2016). To evaluate whether the computational models represent word semantics in a similar way to that of humans, we adopt a Chinese semantic rating dataset that utilizes 54 semantic features defined from the functional division of the human brain, comprising both perceptual features, like vision and motor, and more abstract features, like social and emotion (Wang et al., 2022b). This dataset includes 54 semantic feature ratings for 672 words. Each semantic feature of a word was annotated from a 1-7 score (score 1 means the feature is least associated with the word, and score 7 means the highest association) by 30 Chinese participants and averaged across all participants as the final rating score.

Word fMRI

As a noninvasive technique, fMRI measures the blood-oxygen-level-dependent (BOLD) signals, which reflect the neural activation through the blood flow changes. Here we use a Chinese fMRI dataset Wang et al. (2022b) collected from 11 participants when they were reading the same 672 words as in the above word semantic ratings. During fMRI collection, each word was presented 6 times to participants with a unique corresponding image for each time. The fMRI data were pre-processed using fMRIPrep Esteban et al. (2019) and we conduct first-level analysis to obtain the neural activation of each word.

Discourse fMRI

The discourse fMRI data was collected from 12 Chinese native speakers when they were listening to 60 stories Wang et al. (2022a). Each participant listened to all 60 stories, and each story was listened to once by each participant. During the fMRI collection, participants were instructed to stay still and listen carefully to the stimuli. Each stimuli story last from 4 to 7 minutes. The total number of words in these stories is 52,269, and the vocabulary size is 9,153. The collected fMRI data were preprocessed following the HCP pipeline Glasser et al. (2013).

Discourse MEG

MEG is also a noninvasive brain mapping technique but measures magnetic fields produced by the electrical activity of neurons. Different with fMRI, the MEG data has a relatively lower spatial resolution, typically with 300 or more sensors covering the head, but a high temporal resolution (millisecond timescale). The MEG data we use was collected from the same 12 participants as in the fMRI data described above Wang et al. (2022a). The language stimuli were the same 60 stories as in the fMRI collection. There are 306 sensors in the MEG data. For each word, we calculate its MEG response by averaging its MEG signal in a 200 ms-long sliding window within 1 second after the word offset. Each time the window moves 100 ms backward, resulting in 9 response windows for each word.

Refer to caption — Figure 1: The procedure of similarity-encoding analysis (SEA). The representation similarity matrix $M$ is calculated by the embeddings from computational language models.

2.2. MulCogBench-English

Eye-tracking

ZuCo 1.0 and 2.0 eye-tracking databases are used as English eye-tracking data (Hollenstein et al., 2018, 2020). There are two reading tasks, the normal reading and the task-specific reading task, from 30 subjects across experiments. To better align with the input of computational language model, which is not used to solve specific tasks, only the eye-tracking data in the normal reading task are preserved in our experiments, which contain 1,049 sentences. Six word-level eye-tracking features are extracted from the raw eye-tracking recordings: Gaze Duration (GD), Total Reading Time (TRT), First Fixation Duration (FFD), Single Fixation Duration (SFD), Go-Past Time (GPT), and the number of fixations (nFix).

Word semantic ratings

The English semantic dataset we use was proposed by Binder et al. (2016), which included 535 concepts and each concept has 65 semantic features. Same as in Chinese semantic rating data, each semantic feature of a word is the averaged score of 30 participants. Differently, this dataset was annotated with a saliency score on a 0-6 scale.

Word fMRI

We use the English word fMRI dataset that contains 180 words involving 131 nouns, 22 verbs, 21 adjectives, and 6 adverbsPereira et al. (2018). This dataset was collected from 15 native English speakers when they were reading the word with a corresponding image. Each participant saw a word between 4 and 6 repetitions with a unique picture at each time. After preprocessing, first-level analysis was conducted to acquire the neural activation of each word.

Discourse fMRI

The English discourse fMRI was collected from 19 native English speakers when they were listening to 51 English stories. Each participant listened to a subset of stories and each story lasts from 4 to 13 minutes. In total, the story stimuli include 47,356 words, forming a vocabulary of 5,228 words. The fMRI data was also preprocessed following the HCP pipeline Glasser et al. (2013).

3. Computational Language Models

We adopt the four most commonly-used computational language models that can be divided into two groups: one is context-independent model including Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), which calculate word representations based on word occurrence in raw texts. The other is the context-aware model including BERT (Cui et al., 2020; Devlin et al., 2019) and GPT family models (Radford et al., 2019). Among these, BERT is an autoencoder language model, trained bidirectionally to predict masked tokens. GPT is an autoregressive language model, trained to predict the next token based on preceding text.

For Chinese, the Word2vec and GloVe embeddings were both trained on the Xinhua News corpus (19.7 GB)¹¹1http://www.xinhuanet.com/whxw.htm with the same model parameters (i.e., Skip-Gram architecture, negative number as 15 in Word2Vec, window width of 2, embedding dimensions of 300). For English, the Word2vec and GloVe embeddings were trained with the Wikipedia corpus (13 GB)²²2https://dumps.wikimedia.org/enwiki/latest with the same model parameters as Chinese models. Pre-trained MacBERT³³3https://huggingface.co/hfl/chinese-macbert-base and GPT-2⁴⁴4https://huggingface.co/uer/gpt2-chinese-cluecorpussmall models for Chinese and the BERT⁵⁵5https://huggingface.co/bert-base-uncased and GPT-2⁶⁶6https://huggingface.co/gpt2 for English were downloaded from HuggingFace. Both the BERT-based and GPT-based models have 12 hidden layers, which were all used in the experiments. See more detailed model parameters in Table 2.

	Model	Dim	Layers
Chinese	Word2Vec	300	1
	GloVe	300	1
	MacBERT	768	12
	GPT-2	768	12
English	Word2Vec	300	1
	GloVe	300	1
	BERT-base-uncased	768	12
	GPT-2	768	12

Table 2: Parameters of computational language models.

4. Evaluation Methods

To evaluate how human-like the computational language models are, we conducted similarity-encoding analysis (SEA), which reconstructs the cognitive data $C$ based on the representational similarity between $C$ and textual embeddings $E$ of models (Figure 1). Specifically, the SEA computes the similarity between the embeddings of the words in $E$ and reconstructs the corresponding cognitive data of each word by adding the cognitive data of other words weighted by the computed embedding similarity. The motivation is that if the computational model encodes linguistic information like humans, then the similarity patterns between textual embeddings and the cognitive data should be similar. Therefore, the reconstructed cognitive data would be more similar to the original data. The procedures of SEA are described as follows:

First of all, we have a group of word or sentence embeddings $E=\{e_{1},e_{2},...,e_{n}\}$ computed by different models. For each pair of embedding $(e_{i},e_{j})$ , its similarity is measured by the Pearson correlation coefficient ( $\rho$ ). Thus we have a similarity matrix $M\in\mathbb{R}^{n\times n}$ , where $M_{ij}=\rho(e_{i},e_{j})$ .

Then, for the cognitive data, we assumed that if a specific cognitive representation encodes the same information as embeddings, the similarity relation of these embeddings and that of the cognitive vectors was the same. We can predict each cognitive vector by multiplying the above similarity matrix with corresponding cognitive vectors. In addition, to remove the influence of the ground truth value, we subtracted the real cognitive vectors from the predicted cognitive vectors to obtain the predicted cognitive matrix.

C^{\prime}=(M-I_{n})C

(1)

where $C\in\mathbb{R}^{n\times m}$ is the real cognitive vectors, and $C^{\prime}\in\mathbb{R}^{n\times m}$ is the predicted cognitive vectors.

Finally, the Pearson correlation was calculated to evaluate the similarity between the predicted and the real cognitive vectors:

r=\frac{1}{n}\sum_{i=1}^{n}\rho(C_{i,:},C^{\prime}_{i,:})

(2)

A higher correlation score $r$ means that the information in the cognitive data is better encoded in the specific computational model.

The procedure to conduct SEA is slightly different for each modality of cognitive data due to their unique property.

Eye-tracking

For the eye-tracking features, we concatenated all sentences together and conducted SEA for each eye-tracking feature.

Word semantic rating

For 54 features, we computed correlation for each feature and averaged across features as the final result.

Word fMRI

For word fMRI, we have one brain activation vector for each word. We conducted an ROI-level analysis (the ROIs were adopted following Beam et al. (2021)). To select the most informative voxels in each ROI, we trained regression models for each voxel to predict each the word embeddings with this voxel and its 26 adjacent 3D neighbors. The correlation between the true and the predicted embeddings was computed and then as the informativeness score of each voxel. The top 10% voxels with the highest scores within each ROI were chosen for SEA analysis.

Discourse fMRI

For discourse fMRI, we also conducted the ROI-level anaylsis as in word fMRI and used voxel-wise encoding to choose highly informative voxels in each ROI. Because the change of BOLD signal lasts for tens of seconds after the neurons fire and the temporal resolution of fMRI is comparatively lower than words, we first convolved the word embeddingss with the canonical hemodynamic response function (HRF)⁷⁷7The canonical HRF describes how BOLD signals changes after the neurons fire. and downsampled the convolved features to the sampling rate of fMRI. Then, for each voxel, we trained a linear regression model to predict its response with downsampled features. Finally, the Pearson correlation was computed between predicted and actual fMRI data, and the top 10% voxels with the highest correlation were chosen.

Discourse MEG

To choose the most informative sensors and time windows in MEG, we trained linear regression models to predict the signal of each sensor at each sliding window using word embeddings. In the experiments, we only show the top 5% sensors and the time window that achieves the highest prediction accuracy.

5. Results and Analysis

Figure LABEL:fig:results_avg_zh and Figure LABEL:fig:results_avg_en illustrate the evaluation results of each cognitive modality. For cognitive modalities with multiple features, we average the correlations across all features as the final result. As shown in most cognitive modalities, the SEA results are significantly larger than the random result, which is 0, indicating that computational models share significant similarities with the cognitive data. However, the similarity patterns are different across cognitive modalities and linguistic units. We analyze how the cognitive modality and the linguistic unit modulate the similarity between language models and the cognitive data in the following.

Comparison between Chinese and English

As shown in Figure LABEL:fig:results_avg_zh and Figure LABEL:fig:results_avg_en, for both Chinese and English, computational language models show significant correlation with cognitive data. More importantly, the similarity patterns between these two languages are highly consistent across cognitive modalities and linguistic units. Specifically, the best-performed models and the variation tendency from shallower to deeper layers within context-aware models are very similar between Chinese and English. In word-level fMRI, the correlation for both languages show no significant difference⁸⁸8The significance level mentioned in this section is $\alpha=0.001$ between any two layers of context-aware models or between any layers of context-aware models and the context-independent models. The only exception is that context-aware models significantly outperform GloVe in English. Moreover, as the complexity of linguistic units increases, the consistency between Chinese and English becomes more clear. Since Chinese and English are two diverse languages, these consistencies suggest that the relationship between computational models and cognitive data is largely generalizable across languages, at least between Chinese and English. Therefore, the following analyses are conducted based on these consistencies.

Although the results of these two languages are consistent in most circumstances, the correlation of GPT-2 drop dramatically at 12th layer in English but not in Chinese. We have conducted multiple checks to ensure the its accuracy and have identified two possible reasons for this phenomenon.Firstly, most dimensions in GPT-2 embeddings have values ranging from -2 to 2, but several dimensions in the 12th layer of the English GPT-2 model have extremely large values, even exceeding 200. This is markedly different from the other 11 layers and was not observed in the 12th layer of the Chinese GPT-2 model.Secondly, our semantic rating score is an average of 65 semantic features, and the substantial drop was only observed in a specific subset of these features. This suggests that this subset of features may be particularly sensitive to the abnormal dimensions in the 12th layer of the GPT-2 model.

Comparison between linguistic units

The most critical property of human language is its combinatorial nature Ding et al. (2015). When combining smaller elements such as words into larger structures like sentences, the involved cognitive processes are different. Accordingly, the similarity patterns between computational models and cognitive data also vary in different linguistic units as illustrated in Figure LABEL:fig:results_avg_zh and Figure LABEL:fig:results_avg_en.

In word-level, although the exact patterns differ between the two modalities, i.e., semantic rating and fMRI, a unique phenomenon is that the context-aware models do not outperform context-independent models in most cases. In word fMRI, there is no significant difference between these two types of models except that the English context-aware models significantly outperform GloVe. In word semantic rating, although GloVe has the worst performance, the resulting correlation of Word2Vec is significantly higher than the deep layers of both two context-aware models. In contrast to word-level results, context-aware models have a significant advantage over context-independent models in sentence-level and discourse-level. Many layers of BERT/MacBERT and GPT-2 reached higher correlation than Word2Vec and GloVe. This advantage is especially significant in discourse-level, in which all except the first layer of BERT/MacBERT and GPT-2 have a higher correlation than GloVe and Word2Vec. These results suggest that context-aware models only become more human-like than context-independent models when processing complex linguistic units.

Apart from the differences between context-aware and context-independent models, the variation tendency from shallower to deeper layers within context-aware models also differs in three linguistic units. In word semantic rating, the shallow layers of BERT/MacBERT and GPT-2 have higher performance than the deep layers. Whereas in sentence and discourse fMRI, the middle and deep layers reached the highest correlation. These layer-differences indicate that as the layer goes deeper in context-aware models, the encoded information transfers from simple to more complex linguistic units.

Comparison between the modalities of cognitive data

Both fMRI and MEG are direct measurements of the brain signals, whereas semantic rating and eye-tracking are behavioral reactions. These modalities of cognitive data reflect different aspects of the human language mechanism. Therefore, comparing the performance of models in different cognitive modalities is helpful for further understanding how these models are similar to the human language mechanism. Since the model performance is modulated by the complexity of linguistic unit, we only analyze the modality differences within the same linguistic unit, that is, we compare the results between fMRI and semantic rating in word-level and fMRI and MEG in discourse-level.

In word-level, the significant difference between models was only found in semantic rating modality which reflects how human understand the word semantics through the extrinsic and subjective rating behavior. Whereas the measured fMRI is an objective signal of how the brain respond the words. A possible reason for this modality-difference may be that decoding fMRI is more difficult than decoding semantic features from embeddings. The model-differences in encoding word-level information is concealed by the difficulty in decoding fMRI. In discourse-level, the middle and deep layers better predict fMRI while shallow layers better predict MEG. Since fMRI is a high-spatial resolution signal, a better-performed model in fMRI is more likely to be similar to brain in the spatial activation patterns. Whereas MEG is a high-temporal resolution signal, a better-performed model is more likely to better encode the rapid word changes.

These variations in similarity patterns in modalities suggest that different layers of models may simulate different aspects of human language mechanisms. Further analysis of these variations may be helpful for improving the model to process language with higher efficiency and accuracy.

Comparison between ROIs

Since the variation tendency within context-aware models is similar to that in Figure LABEL:fig:results_avg_zh and Figure LABEL:fig:results_avg_en, we averaged the results of all layers within BERT/MacBERT and GPT-2 and focus on the overall performance of models in each ROI. As shown in Figure 4, the performance of computational models in each ROI is similar to the ROI-averaged results. The context-aware models only outperform context-independent models in discourse-level. Besides, the performance difference between models is consistent in all 6 ROIs. However, the most related ROI with computational models changes from Vison to Language when the linguistic unit changes from word to discourse. We do not attribute this change to the brain mechanism shift between linguistic units. Rather, the presentation form of stimuli may highly correlate with this phenomenon. In word-level fMRI collection, participants watched the words and the related pictures, while in discourse fMRI collection, the stories were played to participants with a blank screen showing nothing but a plus symbol.

Apart from the Vision and Language, brain regions involved in manipulation, cognition, and memory also have significant correlation with computational models. These results suggest that the brain foundation for language may be broader than traditional language brain networks.

6. Conclusion and Future Work

How similar the working mechanism of computational language models is to that of the human brain has raised great interest. This paper provides the MulCogBench dataset which consists of rich cognitive data that can be used to explore this question in many aspects. Specifically, this paper investigated their representational similarity from three aspects, i.e., the modality of cognitive data, the linguistic unit, and the cross-linguistic commonality. We find that both the modality of cognitive data and the linguistic unit affect the similarity patterns between computational models and the cognitive data.

As a large-scale cognitive dataset, MulCogBench includes multiple features within each modality which can be used for fine-grained analysis. For instance, word semantic ratings can be used to explain what semantic features are encoded in textual embeddings and to explore whether these features in textual embeddings are organized in a similar way as encoded in the brain. Testing the similarity is the first step in exploring the underlying cognitive mechanisms. Hopefully, the MulCogBench will contribute to a better understanding of both the language mechanism of computational models and the human brain.

7. Limitations

Cognitive data including eye-tracking and neuroimaging such as fMRI and MEG contains a lot of noise and each represents multiple cognitive functions. For instance, the first-time passes of words in eye-tracking not only reflect the attention given to the words but also the visual recognition and other word-level or sentence-level processes. Therefore, it is difficult to directly correlate a single cognitive function in cognitive data to explain the mechanisms of computational language models. However, these cognitive data are the best resources we can get from the brain, and the proposed evaluation methods used in this paper could be directly extended to other cognitive data if advanced measure technology appeared in the future.

In addition, the proposed MulCogBench only includes cognitive data in Chinese and English. Although the evaluation method is not language specific, there could be different conclusions given the varying properties of languages. Moreover, to show the effectiveness of the dataset, we take four representative language models as examples. More recent language models may give different conclusions about the relation with human cognitive data.

\c@NAT@ctr

Anderson et al. (2013) Andrew J. Anderson, Elia Bruni, Ulisse Bordignon, Massimo Poesio, and Marco Baroni. 2013. Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1960–1970, Seattle, Washington, USA. Association for Computational Linguistics.
Bang et al. (2023) Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. 2023. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv, abs/2302.04023.
Beam et al. (2021) Elizabeth Beam, Christopher Potts, Russell A Poldrack, and Amit Etkin. 2021. A data-driven framework for mapping domains of human neurobiology. Nature neuroscience, 24(12):1733–1744.
Binder et al. (2016) Jeffrey R Binder, Lisa L Conant, Colin J Humphries, Leonardo Fernandino, Stephen B Simons, Mario Aguilar, and Rutvik H Desai. 2016. Toward a brain-based componential semantic representation. Cognitive neuropsychology, 33(3-4):130–174.
Blank (2023) Idan Asher Blank. 2023. What are large language models supposed to model? Trends in Cognitive Sciences, 27:987–989.
Caucheteux and King (2022) Charlotte Caucheteux and Jean-Rémi King. 2022. Brains and algorithms partially converge in natural language processing. Communications biology, 5(1):134.
Cui et al. (2020) Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 657–668, Online. Association for Computational Linguistics.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Ding et al. (2015) Nai Ding, Lucia Melloni, Hang Zhang, Xing Tian, and David Poeppel. 2015. Cortical tracking of hierarchical linguistic structures in connected speech. Nature neuroscience, 19:158 – 164.
Esteban et al. (2019) Oscar Esteban, Christopher J Markiewicz, Ross W Blair, Craig A Moodie, A Ilkay Isik, Asier Erramuzpe, James D Kent, Mathias Goncalves, Elizabeth DuPre, Madeleine Snyder, et al. 2019. fmriprep: a robust preprocessing pipeline for functional mri. Nature methods, 16(1):111–116.
Gauthier and Levy (2019) Jon Gauthier and Roger Levy. 2019. Linking artificial and human neural representations of language. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 529–539, Hong Kong, China. Association for Computational Linguistics.
Glasser et al. (2013) Matthew F. Glasser, Stamatios N. Sotiropoulos, J. Anthony Wilson, Timothy S. Coalson, Bruce Fischl, Jesper L. R. Andersson, Junqian Xu, Saâd Jbabdi, Matthew A. Webster, Jonathan R. Polimeni, David C. Van Essen, and Mark Jenkinson. 2013. The minimal preprocessing pipelines for the human connectome project. NeuroImage, 80:105–124.
Hashemzadeh et al. (2020) Maryam Hashemzadeh, Greta Kaufeld, Martha White, Andrea E. Martin, and Alona Fyshe. 2020. From language to language-ish: How brain-like is an LSTM’s representation of nonsensical language stimuli? In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 645–656, Online. Association for Computational Linguistics.
Hollenstein et al. (2019) Nora Hollenstein, Antonio de la Torre, Nicolas Langer, and Ce Zhang. 2019. CogniVal: A framework for cognitive word embedding evaluation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 538–549, Hong Kong, China. Association for Computational Linguistics.
Hollenstein et al. (2018) Nora Hollenstein, Jonathan Rotsztejn, Marius Troendle, Andreas Pedroni, Ce Zhang, and Nicolas Langer. 2018. Zuco, a simultaneous eeg and eye-tracking resource for natural sentence reading. Scientific data, 5(1):1–13.
Hollenstein et al. (2020) Nora Hollenstein, Marius Troendle, Ce Zhang, and Nicolas Langer. 2020. ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 138–146, Marseille, France. European Language Resources Association.
Huang and Chang (2022) Jie Huang and Kevin Chen-Chuan Chang. 2022. Towards reasoning in large language models: A survey. ArXiv, abs/2212.10403.
Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Pasquiou et al. (2022) Alexandre Pasquiou, Yair Lakretz, John Hale, Bertrand Thirion, and Christophe Pallier. 2022. Neural language models are not born equal to fit brain data, but training helps. arXiv preprint arXiv:2207.03380.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.
Pereira et al. (2018) Francisco Pereira, Bin Lou, Brianna Pritchett, Samuel Ritter, Samuel J Gershman, Nancy Kanwisher, Matthew Botvinick, and Evelina Fedorenko. 2018. Toward a universal decoder of linguistic meaning from brain activation. Nature communications, 9(1):1–13.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog.
Rayner (1998) Keith Rayner. 1998. Eye movements in reading and information processing: 20 years of research. Psychological bulletin, 124(3):372.
Rayner (2009) Keith Rayner. 2009. The 35th sir frederick bartlett lecture: Eye movements and attention in reading, scene perception, and visual search. Quarterly Journal of Experimental Psychology, 62(8):1457–1506.
Wang et al. (2022a) Shaonan Wang, Xiaohan Zhang, Jiajun Zhang, and Chengqing Zong. 2022a. A synchronized multimodal neuroimaging dataset for studying brain language processing. Scientific Data, 9.
Wang et al. (2022b) Shaonan Wang, Yunhao Zhang, Xiaohan Zhang, Jingyuan Sun, Nan Lin, Jiajun Zhang, and Chengqing Zong. 2022b. An fmri dataset for concept representation with semantic feature annotations. Scientific Data, 9.
Xu et al. (2016) Haoyan Xu, Brian Murphy, and Alona Fyshe. 2016. BrainBench: A brain-image test suite for distributional semantic models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2017–2021, Austin, Texas. Association for Computational Linguistics.
Zhang et al. (2022) Guangyao Zhang, Panpan Yao, Guojie Ma, Jingwen Wang, Junyi Zhou, Linjieqiong Huang, Pingping Xu, Lijing Chen, Songlin Chen, Junjuan Gu, et al. 2022. The database of eye-movement measures on words in chinese reading. Scientific Data, 9(1):1–8.
Zhang et al. (2020) Yizhen Zhang, Kuan Han, Robert M. Worth, and Zhongming Liu. 2020. Connecting concepts in the brain by mapping cortical representations of semantic relations. Nature Communications, 11(1):1877.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Z. Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jianyun Nie, and Ji rong Wen. 2023. A survey of large language models. ArXiv, abs/2303.18223.