Leveraging recent advances in Pre-Trained Language Models for Eye-Tracking Prediction

Varun Madhavan
IIT Kharagpur
varun.m.iitkgp
@gmail.com
\AndAditya Girish Pawate
IIT Kharagpur
adityagirish
[email protected]
\AndShraman Pal
IIT Kharagpur
shramanpal12345
@gmail.com
\AndAbhranil Chandra
IIT Kharagpur
abhranil.chandra
@gmail.com

Abstract

Cognitively inspired Natural Language Processing uses human-derived behavioral data like eye-tracking data, which reflect the semantic representations of language in the human brain to augment the neural nets to solve a range of tasks spanning syntax and semantics with the aim of teaching machines about language processing mechanisms. In this paper, we use the ZuCo 1.0 and ZuCo 2.0 dataset containing the eye-gaze features to explore different linguistic models to directly predict these gaze features for each word with respect to its sentence. We tried different neural network models with the words as inputs to predict the targets. And after lots of experimentation and feature engineering finally devised a novel architecture consisting of RoBERTa Token Classifier with a dense layer on top for language modeling and a stand-alone model consisting of dense layers followed by a transformer layer for the extra features we engineered. Finally, we took the mean of the outputs of both these models to make the final predictions. We evaluated the models using mean absolute error (MAE) and the R2 score for each target.

1 Introduction

When reading, we humans process language “automatically” without reflecting on each step — we string words together into sentences, understand the meaning of spoken and written ideas, and process language without thinking too much about how the underlying cognitive process happens. This process generates cognitive signals that could potentially facilitate natural language processing tasks. Our understanding of the field of language processing is highly dependent on accurately modeling eye-tracking features (example of cognitive data) as they provide millisecond-accurate records of where humans look while reading, thus shedding some light on human attention mechanisms and language comprehension skills. Eye movement data is used extensively in various fields including but not limited to Natural Language Processing and Computer Vision. The recent introduction of standardized datasets has enabled the comparison of the capabilities of different cognitive modeling and linguistically motivated approaches to model and analyze human patterns of reading.

2 Problem Description

Feature Name	Feature Description
nFix	Number of fixations on the word
FFD	Duration of the first fixation
TRT	Total fixation duration, including regressions
GPT	Sum of all prior fixations
fixProp	Proportion of participants that fixated the current word

Table 2.1: Target Variables

Our brain processes language and generates cognitive processing data such as gaze patterns and brain activity. These signals can be recorded while reading. The problem is formulated as a regression task to predict the five eye-tracking features for each token of a sentence in the test set. Each training sample consists of a token in a sentence and the corresponding features. We are required to predict the five features: nFix, FFD, TRT, GPT, and fixProp. We were given a CSV file with sentences to train our model. Tokens in the sentences were split in the same manner as they were presented to the participants during the reading experiments. Hence, this did not necessarily follow linguistically correct tokenization. Sentence endings were marked with an <EOS> symbol. A prediction had to be made for each token for all the five target variables. The evaluation metric used was the mean absolute error.

3 Dataset

We use the eye-tracking data of the Zurich Cognitive Language Processing Corpus (ZuCo 1.0 and ZuCo 2.0) Hollenstein et al. (2020) recorded during normal reading. The training and test data contains 800 and 191 sentences respectively.In recent years, collecting these signals has become increasingly easy and less expensive Papoutsaki et al. (2016); as a result, using cognitive features to improve NLP tasks has become more popular. The data contains features averaged over all readers and scaled in the range of 0 to 100, facilitating evaluation via mean absolute average (MAE). The 5 eye-tracking features included in the dataset are shown in 2.1. The heatmap of the correlation matrix of the 5 target variables have been shown in Figure 1. It is observed that apart from GPT all the features are correlated with each other to a large extent.

Refer to caption — Figure 3.1: Correlation Heatmap of Targets

4 Preprocessing

We first removed the <EOS> token at the end of each sentence. We looked at similar word Clifton et al. (2007), Demberg and Keller (2008) and Just and Carpenter (1980) to find features that we could add. As GPT consistently performed the worst in all our initial experiments, we tried replacing GPT with a new feature (TRT - GPT) during training. The results improved slightly and we decided to keep this change for the remaining experiments. To incorporate word level characteristics (i.e characteristics that belong to a word; eg: word length) and syntactic information we introduced the following features -

•

Number of characters in a word ( $wordlen$ )
•

Number of characters in the lemmatized word subtracted from the number of characters in the word. Obtained using NLTK package Bird and Loper (2004) ( $lemwordlen$ )
•

If a word was a stopword or not, we appended 1 if stopword and -1 otherwise( $stopword$ )
•

If a word was a number, we appended 1 if it was a number and -1 otherwise( $number$ )
•

If a word was the at end of the sentence or not, we appended 1 if last word and -1 otherwise ( $endword$ ). Our hypothesis for adding the endword was that it had a strong correlation with GPT as endwords had significantly higher GPT values compared to the other words.
•

Part of Speech Tags for each word in a sentence ( $postag$ ). We one hot encoded the POS TAGS which gave us a sparse matrix of 20 features. The POS tags have a high statistical correlation with all the target variables as they represent the semantic structure of a sentence. Obtained using NLTK package.
•

TF-IDF value of each word by considering each sentence as a separate document and each word as a separate term( $tfidf$ ). For the tfidf score, words unrecognized by the in-built tokenizer due to being single digit numbers, having “-” etc were replaced by 0.01 if it was among “a”, “A” or “I” and 0.1 otherwise. This provides a hyperparameter(TF-IDF Error) that could be tuned for better results. Obtained using sklearn package. Pedregosa et al. (2011)

Scatter plots of word length(number of characters) and the 5 target variables have been shown. It is observed that there is quite a positive correlation between them. we expected the last word of the sentence to have an impact on the GPT score which is evident from the barplot as shown in figure 2f. The GPT score for a endword is usually quite high compared to a non endword. We also created barplots of stopword and the target variables. It is evident that for stopwords the values of all the targets variables are comparitively lower which points to stopword being a significant feature. The effect of adding these features is further explored as an ablation study in section 6.3.

5 Model Description

The model has two major components, i.e. the Feature Model and the Language Model, which encode the features and the sentence respectively. The resulting encoded representations are mean pooled and passed to a fully connected layer and sigmoid activation to output a (5x1) score vector for each token in the sentence. The score vector represents the values of nFix, FFD, TRT, GPT and fixProp respectively.

6 Experiments

6.1 Evaluation Metric

We utilised the R2 score metric instead of the Mean Absolute Error score because R2 provides a better understanding of the model’s performance in absence of a baseline score. As the R2 score has an upper bound of 1 it is possible to understand how good the model is performing for a particular value of R2 score.

R^{2}(y,\hat{y})=1-\frac{\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}

6.2 Models

We experimented with a variety of models, in particular recent advances in attention-basedVaswani et al. (2017) pre-trained language models like RoBERTa Liu et al. (2019), ELECTRA Clark et al. (2020), BERT Devlin et al. (2019), DistilBERTSanh et al. (2020). Each language model was used to encode text as described in section 5 and the rest of the pipeline was kept the same. We compare this method with Hollenstein et al. (2019), which employed GloVe(200d) Pennington et al. (2014) and a Bidirectional LSTM Hochreiter and Schmidhuber (1997) to encode sentences. In addition to GloVe, we also tried the more recent ConceptNet Numberbatch embeddings (300D) Speer et al. (2018) that have been shown to outperform GloVe on a variety of tasks. The results for all experiments are tabulated in Table 6.3.

6.3 Features

In addition to various architectures, we tried a number of feature engineering techniques with inferred/transformed features as described in section 4. We performed an ablation study with each feature to judge the importance/benefit of each adding each of the new features.

We note that increasing the number of features does not always help, so analysing the data we saw that most of the POS tags rarely occurred and there were 6-7 frequently occurring tags. We decided to club the remaining significant tags under their respective larger categories (eg: noun phrases(NNP) were clubbed into noun category(NN)) and marked the remaining small number of categories as <UNK> while also one-hot encoding all of them. Lastly we standardized the targets for better training stability and it was applied after transforming GPT to TRT-GPT.

Feature Name	Feature Description	Average Score and Std Deviation
word len	Number of characters in the word	0.6145 0.07363
is stopword	Whether a word is a stopword	0.6475 0.06043
is number	Whether a word is a number	0.6194 0.06731
lem word len	Difference in the number of characters in the word and its lemmatized form	0.6125 0.06649
is endword	Whether the word is the last word of its sentence	0.5918 0.10606
tfidf score	TF-IDF score of the word in the corresponding sentence	0.6265 0.06367
pos tag	Part of Speech Tag of the word	0.6284 0.059786

Table 6.1: Features and their Description and the Mean and Std Deviation of R2 scores of target variables

Hyperparameter	Value
Batch Size	4
Num Warm Up Steps	4
Learning Rate	3e-5
Max epochs	120
Beta 1	0.91
Delta	0.0001
Beta 2	0.998
Early stopping patience	8
Weight Decay	1e-5
TF-IDF Error	0.1
Validation Split Ratio	0.2
Training Split Ratio	0.8

Table 6.2: Hyperparameter Values

•

Effect of scaling: Due to a large number of features with a large difference in their means, it is helpful to scale the features to the same range/same mean and standard deviation. Hence we compared two scaling methods -

–

Min-max scaling:
$X_{norm}=(X-X_{min})/(X_{max}-X_{min})$
–

Standard scaling:
$X_{norm}=(X-X_{mean})/(X_{std})$

Model	Feature Norm	Extra Features	nFix	FFD	GPT	TRT	fixProp
RoBERTa	Std Scl	STT	0.8842	0.9246	0.7343	0.8823	0.9509
RoBERTa	Min Max	All	0.8835	0.9132	0.7383	0.8542	0.9461
RoBERTa	Min Max	STT	0.8417	0.9090	0.6456	0.8152	0.9492
BERT	Min Max	All	0.7433	0.8528	0.4574	0.6925	0.9173
BiLSTM Minibatch	Min Max	All + GloVE	0.6595	0.7275	0.6281	0.5988	0.7583
BiLSTM Minibatch	Min Max	All + NB¹¹1https://github.com/commonsense/conceptnet-numberbatch	0.6379	0.7203	0.5992	0.5388	0.7592
RoBERTa Token	Std Scl	All	0.7038	0.6512	0.4697	0.6877	0.7146
BERT Token	Std Scl	All	0.7231	0.7107	0.3440	0.6376	0.7635
BERT	Std Scl	All	0.6497	0.6456	0.5409	0.6088	0.7171
BiLSTM MiniBatch	Std Scl	All+GloVE	0.6850	0.5976	0.5267	0.5940	0.7158
RoBERTa Token	Std Scl	All	0.6483	0.5861	0.5475	0.6280	0.6566
RoBERTa	Std Scl	All	0.6360	0.5869	0.4798	0.6061	0.6623
BiLSTM Stochastic	Std Scl	All+GloVE	0.6282	0.6247	0.4394	0.5860	0.6893
RoBERTa Token	Std Scl	All	0.6349	0.5914	0.4767	0.6123	0.6452
RoBERTa Token	Std Scl	All	0.5991	0.5847	0.5417	0.5686	0.6524
BERT Token	Std Scl	All	0.6182	0.5677	0.4495	0.6026	0.6960
BERT	Std Scl	All	0.6019	0.5847	0.4159	0.5771	0.6404
BERT	Std Scl	All	0.5681	0.5437	0.4891	0.5170	0.6371
DistilBert	No Scaling	All	0.4804	0.5292	0.0983	0.4183	0.6324

Table 6.3: Compiled results of all the top performing models. STT means Single Target Training (one model to predict one feature). Numberbatch refers to the ConceptNet Numberbatch word embeddings used Speer et al. (2017). The architecture was kept same (except for input dimensions), the learning rate was reduced for smaller number of features to achieve optimal values which we observed through trial and error.

With the same model (multi-target with RoBERTa encoder), using standard scaling gave an R2 score of 0.62 while min-max scaling gave a score of 0.81. The reason for min-max scaling giving significantly better performance might be because the targets in this dataset are always positive.

•

Single Target vs. Multi-Target models: With a large number of input features and a relatively small datasets, we hypothesized that large pre-trained models may fail to converge to a suitable minima in the limited fine-tuning. Hence, we trained separate instances of the same model on each of nFix, FFD, TRT, GPT and fixProp. This further increased the R2 score on the validation set to 0.88.

7 Results

In 6.3, we tabulate the R2 scores of each of our models on the validation set.

References

Bird and Loper (2004) Steven Bird and Edward Loper. 2004. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators.
Clifton et al. (2007) Charles Clifton, Adrian Staub, and Keith Rayner. 2007. Chapter 15 - eye movements in reading words and sentences. In Roger P.G. Van Gompel, Martin H. Fischer, Wayne S. Murray, and Robin L. Hill, editors, Eye Movements, pages 341–371. Elsevier, Oxford.
Demberg and Keller (2008) Vera Demberg and Frank Keller. 2008. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity. Cognition, 109(2):193–210.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Hollenstein et al. (2019) Nora Hollenstein, Maria Barrett, Marius Troendle, Francesco Bigiolli, Nicolas Langer, and Ce Zhang. 2019. Advancing nlp with cognitive language processing signals.
Hollenstein et al. (2020) Nora Hollenstein, Marius Troendle, Ce Zhang, and Nicolas Langer. 2020. Zuco 2.0: A dataset of physiological recordings during natural reading and annotation.
Just and Carpenter (1980) Marcel Adam Just and Patricia A. Carpenter. 1980. A theory of reading: from eye fixations to comprehension. PSYCHOLOGICAL REVIEW, 87(4).
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
Papoutsaki et al. (2016) Alexandra Papoutsaki, Patsorn Sangkloy, James Laskey, Nediyana Daskalova, Jeff Huang, and James Hays. 2016. Webgazer: Scalable webcam eye tracking using user interactions. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 3839–3845. IJCAI/AAAI Press.
Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
Sanh et al. (2020) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. ConceptNet 5.5: An open multilingual graph of general knowledge. pages 4444–4451.
Speer et al. (2018) Robyn Speer, Joshua Chin, and Catherine Havasi. 2018. Conceptnet 5.5: An open multilingual graph of general knowledge.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need.