This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Leveraging recent advances in Pre-Trained Language Models for Eye-Tracking Prediction

Varun Madhavan
IIT Kharagpur
varun.m.iitkgp
@gmail.com

\AndAditya Girish Pawate
IIT Kharagpur
adityagirish
[email protected]

\AndShraman Pal
IIT Kharagpur
shramanpal12345
@gmail.com

\AndAbhranil Chandra
IIT Kharagpur
abhranil.chandra
@gmail.com

Abstract

Cognitively inspired Natural Language Processing uses human-derived behavioral data like eye-tracking data, which reflect the semantic representations of language in the human brain to augment the neural nets to solve a range of tasks spanning syntax and semantics with the aim of teaching machines about language processing mechanisms. In this paper, we use the ZuCo 1.0 and ZuCo 2.0 dataset containing the eye-gaze features to explore different linguistic models to directly predict these gaze features for each word with respect to its sentence. We tried different neural network models with the words as inputs to predict the targets. And after lots of experimentation and feature engineering finally devised a novel architecture consisting of RoBERTa Token Classifier with a dense layer on top for language modeling and a stand-alone model consisting of dense layers followed by a transformer layer for the extra features we engineered. Finally, we took the mean of the outputs of both these models to make the final predictions. We evaluated the models using mean absolute error (MAE) and the R2 score for each target.

1 Introduction

When reading, we humans process language “automatically” without reflecting on each step — we string words together into sentences, understand the meaning of spoken and written ideas, and process language without thinking too much about how the underlying cognitive process happens. This process generates cognitive signals that could potentially facilitate natural language processing tasks. Our understanding of the field of language processing is highly dependent on accurately modeling eye-tracking features (example of cognitive data) as they provide millisecond-accurate records of where humans look while reading, thus shedding some light on human attention mechanisms and language comprehension skills. Eye movement data is used extensively in various fields including but not limited to Natural Language Processing and Computer Vision. The recent introduction of standardized datasets has enabled the comparison of the capabilities of different cognitive modeling and linguistically motivated approaches to model and analyze human patterns of reading.

2 Problem Description

Feature Name Feature Description
nFix Number of fixations on the word
FFD Duration of the first fixation
TRT Total fixation duration, including regressions
GPT Sum of all prior fixations
fixProp Proportion of participants that fixated the current word
Table 2.1: Target Variables

Our brain processes language and generates cognitive processing data such as gaze patterns and brain activity. These signals can be recorded while reading. The problem is formulated as a regression task to predict the five eye-tracking features for each token of a sentence in the test set. Each training sample consists of a token in a sentence and the corresponding features. We are required to predict the five features: nFix, FFD, TRT, GPT, and fixProp. We were given a CSV file with sentences to train our model. Tokens in the sentences were split in the same manner as they were presented to the participants during the reading experiments. Hence, this did not necessarily follow linguistically correct tokenization. Sentence endings were marked with an <EOS> symbol. A prediction had to be made for each token for all the five target variables. The evaluation metric used was the mean absolute error.

3 Dataset

We use the eye-tracking data of the Zurich Cognitive Language Processing Corpus (ZuCo 1.0 and ZuCo 2.0) Hollenstein et al. (2020) recorded during normal reading. The training and test data contains 800 and 191 sentences respectively.In recent years, collecting these signals has become increasingly easy and less expensive Papoutsaki et al. (2016); as a result, using cognitive features to improve NLP tasks has become more popular. The data contains features averaged over all readers and scaled in the range of 0 to 100, facilitating evaluation via mean absolute average (MAE). The 5 eye-tracking features included in the dataset are shown in 2.1. The heatmap of the correlation matrix of the 5 target variables have been shown in Figure 1. It is observed that apart from GPT all the features are correlated with each other to a large extent.

Refer to caption
Figure 3.1: Correlation Heatmap of Targets

4 Preprocessing

We first removed the <EOS> token at the end of each sentence. We looked at similar word Clifton et al. (2007), Demberg and Keller (2008) and Just and Carpenter (1980) to find features that we could add. As GPT consistently performed the worst in all our initial experiments, we tried replacing GPT with a new feature (TRT - GPT) during training. The results improved slightly and we decided to keep this change for the remaining experiments. To incorporate word level characteristics (i.e characteristics that belong to a word; eg: word length) and syntactic information we introduced the following features -

  • Number of characters in a word (wordlenwordlen)

  • Number of characters in the lemmatized word subtracted from the number of characters in the word. Obtained using NLTK package Bird and Loper (2004) (lemwordlenlemwordlen)

  • If a word was a stopword or not, we appended 1 if stopword and -1 otherwise(stopwordstopword)

  • If a word was a number, we appended 1 if it was a number and -1 otherwise(numbernumber)

  • If a word was the at end of the sentence or not, we appended 1 if last word and -1 otherwise (endwordendword). Our hypothesis for adding the endword was that it had a strong correlation with GPT as endwords had significantly higher GPT values compared to the other words.

  • Part of Speech Tags for each word in a sentence (postagpostag). We one hot encoded the POS TAGS which gave us a sparse matrix of 20 features. The POS tags have a high statistical correlation with all the target variables as they represent the semantic structure of a sentence. Obtained using NLTK package.

  • TF-IDF value of each word by considering each sentence as a separate document and each word as a separate term(tfidftfidf). For the tfidf score, words unrecognized by the in-built tokenizer due to being single digit numbers, having “-” etc were replaced by 0.01 if it was among “a”, “A” or “I” and 0.1 otherwise. This provides a hyperparameter(TF-IDF Error) that could be tuned for better results. Obtained using sklearn package. Pedregosa et al. (2011)

Scatter plots of word length(number of characters) and the 5 target variables have been shown. It is observed that there is quite a positive correlation between them. we expected the last word of the sentence to have an impact on the GPT score which is evident from the barplot as shown in figure 2f. The GPT score for a endword is usually quite high compared to a non endword. We also created barplots of stopword and the target variables. It is evident that for stopwords the values of all the targets variables are comparitively lower which points to stopword being a significant feature. The effect of adding these features is further explored as an ablation study in section 6.3.

Refer to caption
Figure 4.1: nFix vs
wordlen
Refer to caption
Figure 4.2: FFD vs
wordlen
Refer to caption
Figure 4.3: GPT vs
wordlen
Refer to caption
Figure 4.4: TRT vs
wordlen
Refer to caption
Figure 4.5: GPT vs
endword
Refer to caption
Figure 4.6: nFix vs
stopword
Refer to caption
Figure 4.7: FFD vs
stopword
Refer to caption
Figure 4.8: GPT vs
stopword
Refer to caption
Figure 4.9: TRT vs
stopword
Refer to caption
Figure 4.10: fixProp vs
stopword

5 Model Description

The model has two major components, i.e. the Feature Model and the Language Model, which encode the features and the sentence respectively. The resulting encoded representations are mean pooled and passed to a fully connected layer and sigmoid activation to output a (5x1) score vector for each token in the sentence. The score vector represents the values of nFix, FFD, TRT, GPT and fixProp respectively.

Refer to caption
Figure 5.1: Feature Model
Refer to caption
Figure 5.2: Language Head
Refer to caption
Figure 5.3: Complete Model Flow

6 Experiments

6.1 Evaluation Metric

We utilised the R2 score metric instead of the Mean Absolute Error score because R2 provides a better understanding of the model’s performance in absence of a baseline score. As the R2 score has an upper bound of 1 it is possible to understand how good the model is performing for a particular value of R2 score.

R2(y,y^)=1i=1n(yiy^i)2i=1n(yiy¯)2R^{2}(y,\hat{y})=1-\frac{\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}{\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}}

6.2 Models

We experimented with a variety of models, in particular recent advances in attention-basedVaswani et al. (2017) pre-trained language models like RoBERTa Liu et al. (2019), ELECTRA Clark et al. (2020), BERT Devlin et al. (2019), DistilBERTSanh et al. (2020). Each language model was used to encode text as described in section 5 and the rest of the pipeline was kept the same. We compare this method with Hollenstein et al. (2019), which employed GloVe(200d) Pennington et al. (2014) and a Bidirectional LSTM Hochreiter and Schmidhuber (1997) to encode sentences. In addition to GloVe, we also tried the more recent ConceptNet Numberbatch embeddings (300D) Speer et al. (2018) that have been shown to outperform GloVe on a variety of tasks. The results for all experiments are tabulated in Table 6.3.

6.3 Features

In addition to various architectures, we tried a number of feature engineering techniques with inferred/transformed features as described in section 4. We performed an ablation study with each feature to judge the importance/benefit of each adding each of the new features.

We note that increasing the number of features does not always help, so analysing the data we saw that most of the POS tags rarely occurred and there were 6-7 frequently occurring tags. We decided to club the remaining significant tags under their respective larger categories (eg: noun phrases(NNP) were clubbed into noun category(NN)) and marked the remaining small number of categories as <UNK> while also one-hot encoding all of them. Lastly we standardized the targets for better training stability and it was applied after transforming GPT to TRT-GPT.

Feature Name Feature Description Average Score and Std Deviation
word len Number of characters in the word 0.6145
0.07363
is stopword Whether a word is a stopword 0.6475 0.06043
is number Whether a word is a number 0.6194 0.06731
lem word len Difference in the number of characters in the word and its lemmatized form 0.6125 0.06649
is endword Whether the word is the last word of its sentence 0.5918 0.10606
tfidf score TF-IDF score of the word in the corresponding sentence 0.6265 0.06367
pos tag Part of Speech Tag of the word 0.6284 0.059786
Table 6.1: Features and their Description and the Mean and Std Deviation of R2 scores of target variables
Hyperparameter Value
Batch Size 4
Num Warm Up Steps 4
Learning Rate 3e-5
Max epochs 120
Beta 1 0.91
Delta 0.0001
Beta 2 0.998
Early stopping patience 8
Weight Decay 1e-5
TF-IDF Error 0.1
Validation Split Ratio 0.2
Training Split Ratio 0.8
Table 6.2: Hyperparameter Values
  • Effect of scaling: Due to a large number of features with a large difference in their means, it is helpful to scale the features to the same range/same mean and standard deviation. Hence we compared two scaling methods -

    • Min-max scaling:
      Xnorm=(XXmin)/(XmaxXmin)X_{norm}=(X-X_{min})/(X_{max}-X_{min})

    • Standard scaling:
      Xnorm=(XXmean)/(Xstd)X_{norm}=(X-X_{mean})/(X_{std})

    Model Feature Norm Extra Features nFix FFD GPT TRT fixProp
    RoBERTa Std Scl STT 0.8842 0.9246 0.7343 0.8823 0.9509
    RoBERTa Min Max All 0.8835 0.9132 0.7383 0.8542 0.9461
    RoBERTa Min Max STT 0.8417 0.9090 0.6456 0.8152 0.9492
    BERT Min Max All 0.7433 0.8528 0.4574 0.6925 0.9173
    BiLSTM Minibatch Min Max All + GloVE 0.6595 0.7275 0.6281 0.5988 0.7583
    BiLSTM Minibatch Min Max All + NB111https://github.com/commonsense/conceptnet-numberbatch 0.6379 0.7203 0.5992 0.5388 0.7592
    RoBERTa Token Std Scl All 0.7038 0.6512 0.4697 0.6877 0.7146
    BERT Token Std Scl All 0.7231 0.7107 0.3440 0.6376 0.7635
    BERT Std Scl All 0.6497 0.6456 0.5409 0.6088 0.7171
    BiLSTM MiniBatch Std Scl All+GloVE 0.6850 0.5976 0.5267 0.5940 0.7158
    RoBERTa Token Std Scl All 0.6483 0.5861 0.5475 0.6280 0.6566
    RoBERTa Std Scl All 0.6360 0.5869 0.4798 0.6061 0.6623
    BiLSTM Stochastic Std Scl All+GloVE 0.6282 0.6247 0.4394 0.5860 0.6893
    RoBERTa Token Std Scl All 0.6349 0.5914 0.4767 0.6123 0.6452
    RoBERTa Token Std Scl All 0.5991 0.5847 0.5417 0.5686 0.6524
    BERT Token Std Scl All 0.6182 0.5677 0.4495 0.6026 0.6960
    BERT Std Scl All 0.6019 0.5847 0.4159 0.5771 0.6404
    BERT Std Scl All 0.5681 0.5437 0.4891 0.5170 0.6371
    DistilBert No Scaling All 0.4804 0.5292 0.0983 0.4183 0.6324
    Table 6.3: Compiled results of all the top performing models. STT means Single Target Training (one model to predict one feature). Numberbatch refers to the ConceptNet Numberbatch word embeddings used Speer et al. (2017). The architecture was kept same (except for input dimensions), the learning rate was reduced for smaller number of features to achieve optimal values which we observed through trial and error.

    With the same model (multi-target with RoBERTa encoder), using standard scaling gave an R2 score of 0.62 while min-max scaling gave a score of 0.81. The reason for min-max scaling giving significantly better performance might be because the targets in this dataset are always positive.

  • Single Target vs. Multi-Target models: With a large number of input features and a relatively small datasets, we hypothesized that large pre-trained models may fail to converge to a suitable minima in the limited fine-tuning. Hence, we trained separate instances of the same model on each of nFix, FFD, TRT, GPT and fixProp. This further increased the R2 score on the validation set to 0.88.

7 Results

In 6.3, we tabulate the R2 scores of each of our models on the validation set.

References