Comparison Study Between Token Classification and Sequence Classification In Text Classification

Amir Jafari

I. Introduction

Text classification engines has been used as an alternative and complement to human annotation for many years. Essay tests are an example of a text responses where students are given a particular prompt topic to write about and elaborate on. These texts are evaluated for their writing quality in areas such as grammar, fluency of elaboration and how well they are organized. However, evaluating texts in large amount is impractical and classifying it is mostly very complicated and difficult [1].

Text classifications engines can’t genuinely read and comprehend texts the way humans or raters can. Text classification engines do not use direct qualities of a text such as how fluent it is or how grammatically correct. But human raters may explicitly evaluate numerous of those intrinsic variables to produce a prediction. This gives a clear distinction between machine scoring and human labeling. In [2] they studied whether human scoring or machine scoring classification models are better at classifying scientific research abstracts according to a fixed set of discipline groups. Text engines are mostly trained by collecting generally a few couple of thousand sample text which were previously assessed by human raters. There has always been skepticism over machine scoring over time mostly in regard to the fact that computer models does not understand the written text as they human does [3]. However, by looking at labeling between human raters and machine models and create a an statical metric like Cohen’s Quadratic Weighted Kappa score we could measure the quality of scoring models (human vs machine) as long as we prove the sample data is statistically rules significant.

There are two major categories of text classification engines available in the field natural language processing: traditional and neural network-based models. Traditional text classification engines examine a text using carefully crafted features. The effectiveness of such models is closely correlated with the caliber of the underlying components. The most important element of such engines is manually designing the features and rules, which is time-consuming. A feature free method for learning the relationship between a text assigned label in [4] investigates the use of recurrent neural network model as a modern methods. For the purpose of text classification engine, various neural network models are used in the field [5, 6, 7, 8].

In comparison to statistical models with handmade characteristics, neural network models have recently been utilized for text classifiers [9, 10, 4] have produced better results. Word representations in particular are utilized as the input to a neural network model which produces a single dense vector for each word representation in the whole text. Based on a non-linear transformation in each layer on the representation, a label is given. It has been demonstrated that across several domains, neural network models are more reliable than statistical models without the use of handmade features.

Text classifiers engines have been using both convolutional neural networks [11] and recurrent neural networks [12] modles. A single-layer LSTM also has been used over the word sequence to model the text, and they separately model sentences and documents using a two-level hierarchical CNN structure. It is well known that LSTMs excel at modeling temporal and ordered data while CNNs excel at capturing local information. The effectiveness of LSTMs and CNNs using the identical text classification settings has been discussed [10].

Pre-trained language models have recently been used to significantly increase the performance of phrase prediction tasks by combining representations from many layers, creating an auxiliary sentence, using multitask learning, etc.

Large pre-trained language models have recently demonstrated amazing abilities in representation and generalization, like GPT [13], BERT [14], XLNet [15], etc. These models now perform a variety of downstream tasks including text summarization, generation, classification, and regression with enhanced accuracy. There are numerous innovative methods for improving already-trained language models. Auxiliary sentence construction was suggested by [16] to improve sentiment classification problems. To address sequential sentence classification tasks, Cohan [17] introduced additional distinct tokens to obtain representations of each sentence. A work in [16] on several fine-tuning techniques, such as combining text representations from various layers and leveraging multi-task learning has been done.

It makes sense to use BERT type models or generally transformer-based models to learn text representations given the enormous success of pre-trained language models like BERT in learning text representations with deep semantics. The BERT approach, which relies heavily on self-attention, can capture interactions between any two words throughout the entire body of articles (long texts). Previous work on [16] demonstrates that merging text representations from various layers does not significantly increase performance. The length of texts for the text classification job is close to the maximum allowed by the BERT model, making it challenging to create an auxiliary phrase.

In this article, we suggest using two type classification heads for AES engine to our base model to learn text representations that fully reflect deep semantics. These two models are Token Classifier and Sequence Classier models. A token classifier uses all the states of model (features that are transferred over many layers of pre-trained models) and maps all the numerical representation into a label, however the Sequence Classifier utilize a slice of information from all the states and maps the representations into a label.

In summary, our contributions are:

•

Investigate the use two different classifier from different in nature in text classification engines on a standard dataset which allows us to compare results.
•

Explore the advantages and disadvantages of each method.
•

To label long texts which are longer than 256 token using token classification method.

II. Data Description

Modern natural language processing systems learn effective models with the help of annotated data as supervision. These models can only be utilized within that language because they are typically trained on data that is only available in that language (typically English). Since it is impractical to collect data in every language, cross lingual language understanding (XLU) and low-resource cross-language transfer have gained popularity. The Cross-lingual Natural Language Inference (XNLI) corpus is the extension of the Multi-Genre NLI (MultiNLI). There are 122k train, 2490 validation, and 5010 test samples in the dataset [18] in all 15 languages. The model is trained on the training set and the test set is used to find the best hyper-parameter and model. The validation set is used is an unseen set which is used later for testing the performance of the classifiers.

Table 1: XNLI Data Stats

Set	Samples	Tokens	Token > 256	Average Label	% Label 0	% Label 1	% Label 2
Train	392702	392702	66	1	33.33	33.33	33.33
Test	5010	5010	0	1	33.33	33.33	33.33
Validation	2490	2490	0	1	33.33	33.33	33.33

Table 2: Sample data

Premise	Hypothesis	Label
Conceptually cream skimming has two basic dimensions - product and geography .	Product and geography are what make cream skimming work .	1
’you know during the season and i guess at at your level uh you lose them to the next level if if they decide to recall the the parent team the Braves decide to call to recall a guy from triple A then a double A guy goes up to replace him and a single A guy goes up to replace him’	You lose the things to the following level if the people recall .	0
Gays and lesbians .	Heterosexuals .	2

Table 1 shows basic statistics about XNLI dataset on just English data such as the number of samples in each set, number overall tokens in all samples, number of texts whose number of tokens is greater than 256 and finally the distribution of all labels. XNLI dataset is used as sample set of data for testing the hypothesis of scoring regime on two different methods. The dataset is chosen since the label distribution and mean of labels are perfectly balanced.

Table 2 shows a sample set of text data which we used for classification purpose. Premise and Hypothesis text are combined and construct our sample text. The labels are integer encoded numbers which zero represents entailment, one represents neutral and 2 represents contradiction. This sample would be suitable for comparison study about the different nature of transformer classifiers.

III. Background for Text Classification Engines

The development of neural network-based natural language processing has had a significant impact on the current possibilities for text classification engines. A few preliminary corresponding comments on the meaning of some of the terms used as well as some additional methodological background will be provided beforehand because the terminology used in a machine learning context is frequently different from that used in psychometrics and is also occasionally ambiguous. Following that, more background information is provided along with two significant NLP techniques based on neural networks and their accompanying AES successes.

i. Terms

•

Model: A particular architecture for a machine learning which has a lot parameters needs to be tuned to find a mapping between a text and label. This can be pre-trained with a specific training dataset
•

Token/Tokenization: A text sequence is an input to the model in NLP task. Making this text sequence into a chunk of words or sub words and map them to a series of numbers called tokenization.
•

Padding: The only input sequences that neural networks accept are fixed size. The nature of text are variable in length to overcome the issue we add pad word in to a sequence which is less than maximum length size.
•

Embeddings: The embeddings give the tokens a meaning by converting a single token into a high-dimensional vector.
•

Feature : In text classification regime, the features are referred to embedding of each token of a text sequence.
•

Language Model: A language model is a model that predicts the next word, subword, or character in a text.

ii. Text Classification Modeling Approaches

•

Traditional Approaches : Bag of words approach (BOW) was commonly used in text classification for a number of years and the aim in these models are to extract features from the text and label the text based on those features [19, 20, 21].
•

Recurrent Neural Nets (RNN): The gated recurrent units (GRU) [22]and the long short-term memory (LSTM) [23] are the two most common types of RNNs models. They used very often in the field of AES.
•

Transformer Models: Transformer models are neural networks based on an attention mechanism [24]that are pre-trained on the large text data for the purpose of language modeling.

A transformer-based language models from the various architecture types (BERT, ELECTRA, ROBERTA, etc.) first undergo self-supervised training on sizable text corpora. They are then fine-tuned to a particular task (like classification, question/answering, token classification, generation, summarization) in a subsequent stage utilizing supervised learning and hand-coded labels. In the next sections we explain our methodology to use different transformer classifier on long or short texts.

IV. Long Text Tokenization

Powerful neural network architectures known as Transformers [24, 25, 26] have achieved state of art status in a number of machine learning domains, including Natural Language Processing (NLP), Neural Machine Translation (NMT) [27] and text generation summarization. Unfortunately, a conventional Transformer scales quadratically with the number of tokens, making it unaffordable for a sequence larger than 512 tokens. For this problem, several solutions have been proposed in [28, 29, 30, 31]. Most methods limit the attention mechanism to focus on nearby neighborhoods or include structural priors on attention such as sparsity, pooling-based compression, clustering, or convolution techniques.

A dataset must first be preprocessed into an input format before it can be used to train a model. Your data must be transformed and put together into batches of texts. This preprocessing method is called tokenization, we use a tokenizer to break text into a series of tokens, then turn those tokens into numbers and put them together into certain ID numbers. Transformer architectures are trained on the large set of data with a restriction on the size of tokenized input. Most of transformers have a limit of 512 input tokens.

In this work we are not changing the model architecture to address the issues of long texts by increasing the limit size, however we are proposing to use the 256 limit size architecture using a method using overlapping segments. In the Token Classier model, we are introducing an stride hyper parameter that is the amount the text which we are overlapping. A single long text will be chunked and as $N$ number of texts of 256 token size, in such way that we track the overflow tokens.

Refer to caption — Figure 1: Long Text Chuncking

In Figure 1, the text containing 532 tokens, was passed through a tokenizer that returned tokenized inputs of length 512. Since the number of tokens in the text was more than the max length, the text was truncated to length 512, and the remaining tokens were put in the next entry. The last 20 token ids of the first entry were also present at the beginning of the second entry due to the use of the stride parameter. Since the second entry falls short of the max length of 512, padding was added to the second entry.

V. Methodology

i. Token Classifier For Text Labeling

The token classification model architecture is unquestionably the most popular one among all the Machine Learning techniques used in NLP. There has been a lot of work been done in the field of token classification like the the following research works; dependency parsing [32], detecting label errors in Token Classification data [33] and token sequence labeling vs. clause classification for English emotion stimulus detection [34]. Tokenization, or simply identifying the beginning and end of the words and punctuation in the text, is the initial stage in token categorization. The next step is to decide which class (tag or token) belongs to. The choice of classes for issues like part-of-speech tagging is simple: they are the part-of-speech classes that each word may have, such as noun, verb, adjective, etc. Since one or more nearby tokens may appear to be a member of the same sort of phrasal chunk, but be a part of a different chunk. Other tasks, like chunking, necessitate subtle alterations when defining their classes. This would be useful for a name entity recognition system,

However, we propose to use the Token Classifier as a text classier that could label texts. The XNLI dataset is used in this case as a sample set that represents texts. Every single text has a label, that label is going to be assigned for every single token presented in the text shown in Figure 2. The model predicts label for every single token and finally the average of all the token labels will be assigned a final label.

For texts that the length is more than 256 tokens, the overflow option of the tokenizer is used and mapped all the overflow token to a unique id. Now we can train an text with more than 256 token by introducing the stride hyper parameter. Therefore, we could scan a long texts by starting the first 256 token and then use stride of $n$ to overlap the texts again and get another 256 chunks till the and long texts will be a chunk of 256 texts.

Finally, we can train the Token Classifier concept in token classification theme with the chunking mechanism which addresses the issue of using the first 256 tokens.

ii. Sequence Classifier For Text Labeling

Text classification is a well-known and extensively studied job in Natural Language Processing (NLP) and text mining. It refers to the act of assigning one or more labels from a set of classes to a document [35]. Simple binary classification, multi-class classification, and multi-label classification are examples of text classification problem types. Simple binary classification, for instance, asks if a document has positive sentiment or not. The latter allows for the assignment of many labels to a single document. Text Classification engines are the subcategory of text classification.

Tokenizing the text input is the initial stage in the classification process using a transformer model, just like in the other traditional methods. The chosen pretrained transformer has a strict binding effect on the tokenization’s format. The next step is to use the Sequence Classifier model that will be fine-tuned on the set of labels.

As you can see in Figure 3, the first predicted token is used to map the features to the target labels however in the token classier all the prediction token are used in prediction the the final label for the text.

VI. Results

The results of scoring using Sequence Classification and Token Classification are compared. First the pre-trained Electra model [36] is fine-tuned on training set. A grid search technique is used to find a best model on our metric. QWK metric is our main metric to evaluate the performance of the model. The hyper parameter that we are optimizing our models are:

•

Learning Rate
•

Batch Size
•

Stride

i. Grid Search

Table 3: Grid Search - Sequence Classification

experiment_id	epoch	num label	test acc	test QWK	batch size	lr
3	1	3	0.805	0.766	60	0.000035
2	1	3	0.807	0.766	50	0.000025
1	1	3	0.805	0.766	50	0.000035
4	2	3	0.799	0.760	60	0.000025

Table 4: Grid Search - Token Classification

experiment	epoch	num label	test acc	test QWK	stride	max token	batch size	lr
8	4	3	0.825	0.800	250	256	12	0.000035
4	5	3	0.810	0.792	200	256	10	0.000025
5	3	3	0.821	0.790	250	256	10	0.000025
1	4	3	0.819	0.789	200	256	10	0.000035
3	5	3	0.815	0.789	300	256	10	0.000035
2	7	3	0.817	0.788	250	256	10	0.000035
6	10	3	0.817	0.787	300	256	10	0.000025
7	4	3	0.814	0.780	200	256	12	0.000035

Table 3 shows the result of training a Sequence Classier Electra model on the training set and save the best model on the test set. The highest test QWK metric on the Sequence Classification method 0.76.

Table 4 shows the result of training a Token Classier Electra model on the train set and save the best model on the test set. As you can in the token classification, stride is another hyper parameter that is introduced to deal with a long text. The highest test QWK metric on the the token classification method 0.80.

Now it is the time to evaluate the best models on the validation set which the network has not seen before to draw a final decision on the performance of each model.

ii. Validation Results

Final models are tested on the validation set and Table 5 shows each model performance on the validation set. As you can see the Token Classifier has higher accuracy (test acc and val acc) and perform well on scoring texts. Beside QWK, standardized mean difference (SMD) is used to measure efficacy in terms of a continuous measurement (lower better).

Table 5: Validation Results

Model	val acc	val qwk	val smd
Token Classifier	0.827	0.803	0.043
Sequence Classifier	0.814	0.781	0.052

VII. Conclusion and Future Work

In this work, the performance between the Sequence Classifiers and Token Classifiers are studied. The token classification model has better performance on text classification and labeling texts. The use of all of the prediction sequences that models’ outputs help to extract more information for predicting labels. The sequence classifier is the most common used model for scoring but its performance is not as well as the token classification method, because sequence classifier just uses a first sequence prediction token which ignores all the other sequences information and then maps the texts into its label. Certainly, the use all the sequences that are extracted from the features provided better performance because the base model are exactly the same for the this study . For future work, different type of aggregation in token classification can be explored instead of using a mean of label for every single token. Also, we could verify the performance on the Kaggle essay data set.

References

[1] Y. Attali and J. Burstein, “Automated essay scoring with e-rater® v. 2,” The Journal of Technology, Learning and Assessment, vol. 4, no. 3, 2006.
[2] Y. C. Goh, X. Q. Cai, W. Theseira, G. Ko, and K. A. Khor, “Evaluating human versus machine learning performance in classifying research abstracts,” Scientometrics, vol. 125, no. 2, pp. 1197–1212, 2020.
[3] E. B. Page and N. S. Petersen, “The computer moves into essay grading: Updating the ancient test,” Phi delta kappan, vol. 76, no. 7, p. 561, 1995.
[4] K. Taghipour and H. T. Ng, “A neural approach to automated essay scoring,” in Proceedings of the 2016 conference on empirical methods in natural language processing, 2016, pp. 1882–1891.
[5] G. Liang, B.-W. On, D. Jeong, H.-C. Kim, and G. S. Choi, “Automated essay scoring: A siamese bidirectional lstm neural network architecture,” Symmetry, vol. 10, no. 12, p. 682, 2018.
[6] P. U. Rodriguez, A. Jafari, and C. M. Ormerod, “Language models and automated essay scoring,” arXiv preprint arXiv:1909.09482, 2019.
[7] M. Uto and M. Okano, “Robust neural automated essay scoring using item response theory,” in International Conference on Artificial Intelligence in Education. Springer, 2020, pp. 549–561.
[8] Y. Farag, H. Yannakoudakis, and T. Briscoe, “Neural automated essay scoring and coherence modeling for adversarially crafted input,” arXiv preprint arXiv:1804.06898, 2018.
[9] D. Alikaniotis, H. Yannakoudakis, and M. Rei, “Automatic text scoring using neural networks,” arXiv preprint arXiv:1606.04289, 2016.
[10] F. Dong and Y. Zhang, “Automatic features for essay scoring–an empirical study,” in Proceedings of the 2016 conference on empirical methods in natural language processing, 2016, pp. 1072–1077.
[11] Y. Chen, “Convolutional neural network for sentence classification,” Master’s thesis, University of Waterloo, 2015.
[12] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, no. 3. Makuhari, 2010, pp. 1045–1048.
[13] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[15] F. Dong, Y. Zhang, and J. Yang, “Attention-based recurrent convolutional neural network for automatic essay scoring,” in Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), 2017, pp. 153–162.
[16] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to fine-tune bert for text classification?” in China national conference on Chinese computational linguistics. Springer, 2019, pp. 194–206.
[17] A. Cohan, I. Beltagy, D. King, B. Dalvi, and D. S. Weld, “Pretrained language models for sequential sentence classification,” arXiv preprint arXiv:1909.04054, 2019.
[18] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov, “Xnli: Evaluating cross-lingual sentence representations,” arXiv preprint arXiv:1809.05053, 2018.
[19] Y.-Y. Chen, C.-L. Liu, C.-H. Lee, T.-H. Chang et al., “An unsupervised automated essay-scoring system,” IEEE Intelligent systems, vol. 25, no. 5, pp. 61–67, 2010.
[20] C. Leacock and M. Chodorow, “C-rater: Automated scoring of short-answer questions,” Computers and the Humanities, vol. 37, no. 4, pp. 389–405, 2003.
[21] V. Ramalingam, A. Pandian, P. Chetry, and H. Nigam, “Automated essay grading using machine learning algorithm,” in Journal of Physics: Conference Series, vol. 1000, no. 1. IOP Publishing, 2018, p. 012030.
[22] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
[23] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[25] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[26] J. Vig, “A multiscale visualization of attention in the transformer model,” arXiv preprint arXiv:1906.05714, 2019.
[27] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen et al., “The best of both worlds: Combining recent advances in neural machine translation,” arXiv preprint arXiv:1804.09849, 2018.
[28] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
[29] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, 2019.
[30] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 3286–3295.
[31] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
[32] R. L. Milidiu, C. E. M. Crestana, and C. N. dos Santos, “A token classification approach to dependency parsing,” in 2009 Seventh Brazilian Symposium in Information and Human Language Technology. IEEE, 2009, pp. 80–88.
[33] W.-C. Wang and J. Mueller, “Detecting label errors in token classification data,” arXiv preprint arXiv:2210.03920, 2022.
[34] L. A. M. Oberländer and R. Klinger, “Token sequence labeling vs. clause classification for english emotion stimulus detection,” in Proceedings of the Ninth Joint Conference on Lexical and Computational Semantics, vol. 1, 2020, pp. 58–70.
[35] F. Sebastiani, “Machine learning in automated text categorization,” ACM computing surveys (CSUR), vol. 34, no. 1, pp. 1–47, 2002.
[36] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.