Enhancing Pre-trained Language Model with Lexical Simplification

Rongzhou Bao
Affiliation / Address line 2
Affiliation / Address line 3
[email protected]
\AndJiayi Wang
email@domain
\AndZhuosheng Zhang
email@domain
\AndHai Zhao Rongzhou Bao^1,2,3, Jiayi Wang^1,2,3, Zhuosheng Zhang^1,2,3,, Hai Zhao^1,2,3
¹ Department of Computer Science and Engineering, Shanghai Jiao Tong University
² Key Laboratory of Shanghai Education Commission for Intelligent Interaction
and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China
³MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
[email protected]
{wangjiayi_102_23,zhangzs}@sjtu.edu.cn,[email protected]

Abstract

For both human readers and pre-trained language models (PrLMs), lexical diversity may lead to confusion and inaccuracy when understanding the underlying semantic meanings of given sentences. By substituting complex words with simple alternatives, lexical simplification (LS) is a recognized method to reduce such lexical diversity, and therefore to improve the understandability of sentences. In this paper, we leverage LS and propose a novel approach which can effectively improve the performance of PrLMs in text classification. A rule-based simplification process is applied to a given sentence. PrLMs are encouraged to predict the real label of the given sentence with auxiliary inputs from the simplified version. Using strong PrLMs (BERT and ELECTRA) as baselines, our approach can still further improve the performance in various text classification tasks.

1 Introduction

Pre-trained language models (PrLMs) such as BERT Devlin et al. (2018), RoBERTa Liu et al. (2019), and ELECTRA Clark et al. (2020) have led to strong performance gains in downstream natural language understanding (NLU) tasks, including text classification. However, Li et al. (2020); Jin et al. (2019) demonstrate that it only takes a few simple synonym replacements to mislead the prediction of PrLMs on various text classification tasks. Such result indicates that lexical diversity can pose a negative impact on the accuracy of semantic meaning understanding for PrLMs.

In order to reduce lexical diversity, previous works have proposed some approaches for lexical simplification (LS) Gooding and Kochmar (2019); Qiang et al. (2020). By substituting complex words with their simpler alternatives in original sentences, LS can generate a simplified sentence version, which is much easier to understand for human readers. Inspired by these studies, we leverage LS as a paraphrasing tool to enhance the prediction accuracy of PrLMs in text classification tasks.

A well-designed LS rule customized to neural network (e.g. PrLM) is crucial for our overall approach. However, existing LS methods are not suitable for PrLMs. Current methods Gooding and Kochmar (2019); Qiang et al. (2020) are mainly for human readers to simplify reading process, but not for neural network to improve prediction accuracy. Furthermore, current LS methods are very time-consuming. This is because they apply large pre-trained neural networks to detect and replace the complex words in a recursive wayQiang et al. (2020). Therefore, we design a lexical simplification method based on lemmatization and rare word replacement (abbreviated as LRLS), which is more effective and serves better to our purpose, to generate simplified version of give sentence.

In order to better accommodate the LRLS lexical simplification method with PrLMs and improve the overall performance, an auxiliary framework is designed and executed. The simplified sentence generated by LRLS serves as an auxiliary input in both training and inference phase of PrLMs. In this way, PrLMs are able to make the right decision based on both the original sentence and the simplified perspective. Thus, the challenge posed by lexical diversity in text classification can be significantly reduced.

Model	SST-2	MR	CR	SUBJ	AG	Avg
BERT_BASE	92.4	86.1	90.0	97.3	94.2	92.0
+LS	93.5(+1.1)	88.1(+2.0)	90.8(+0.8)	98.0(+0.7)	95.0(+0.8)	93.1(+1.1)
ELECTRA_LARGE	96.7	90.0	94.3	97.4	94.6	94.6
+LS	97.5(+0.8)	91.4(+1.4)	94.5(+0.2)	98.1(+0.7)	95.3(+0.7)	95.3(+0.7)

Table 1: performances (%) across five text classification tasks for models with and without LS.

A series experiments are conducted on various text classification tasks. Empirical results show that our approach can notably improve the performance PrLMs. Meanwhile, ablation studies prove the effectiveness of our LRLS method. Furthermore, we also compare our LRLS method with other paraphrasing method used in data augmentation, such as randomly replacement of several words by synonyms Wu et al. (2019); Wei and Zou (2019), back-translation Xie et al. (2019); Edunov et al. (2018), cutoff Shen et al. (2020). Analysis result demonstrate that our LRLS method remains the most effective.

2 Method

2.1 LRLS Lexical Simplification Process

A well-adapted LS process is the essential to our approach. Previous works Li et al. (2020); Jin et al. (2019) show that the prediction of PrLMs would be easily misled by replacing only a few words with their synonyms in the given sentences. By carefully observing the adversarial examples, we find that changing the tense of verbs, changing the singular and plural form of nouns, and replacing words by its less frequent synonyms compose the majority of the adversarial examples. The observation is also confirmed by Mozes et al. (2020).

Inspired by such observation, our LRLS method is developed with two major steps: (1) lemmataization by transforming verbs and nouns into corresponding lemmas, and (2) replacing rare words with theirs more common synonyms. Firstly, we employ Natural Language Toolkit (NLTK) to detect the verbs and nouns in the given sentences, and transform every verb to its infinitive form and every noun to its singular form. Secondly, according to a word frequency list¹¹1https://github.com/hermitdave/FrequencyWords, we label every word whose frequency is less than a frequency threshold $n_{f}$ as a rare word in the given sentence. We then use a word embedding from Mrkšić et al. (2016), which is specially curated for locating synonyms, to find the top $n_{s}$ synonyms of identified rare words with the highest cosine similarity. Each rare word is replaced by its synonym with the highest frequency. A part-of-speech (POS) check is also applied to ensure that all the synonymous candidates hold the same POS as the original words.

2.2 Simplified Sentence As Auxiliary Input

Following Devlin et al. (2018), the original sentence and its simplified version are combined together in to a single sentence. In our approach, the original and simplified sentences are differentiated in two ways. First, a special separation token ([SEP]) is inserted between the two sentences. Second, a learned segmentation embedding is added to every token which indicates whether it belongs to the original sentence or the simplified sentence. In both training and inference phases, we feed PrLMs the original-simplified sequence as inputs. The rest of implementations remain the same as the original PrLMs.

Refer to caption — Figure 1: Examples that show how auxiliary inputs from simplified sentences help the PrLMs to mlrake the right prediction. In the result column. Baseline demonstrates the original prediction made by BERT, and LRLS-Aux shows the prediction generated with the auxiliary inputs from simplified sentences.

3 Experimental Setup

3.1 Benchmark Datasets

We conduct our experiments on five benchmark text classification tasks: (1) SST-2: Stanford Sentiment Treebank Socher et al. (2013), (2) CR: customer reviews Hu and Liu (2004); Liu et al. (2015), (3) SUBJ: subjectivity/objectivity dataset Pang and Lee (2004), (4) MR:Movie reviews Pang and Lee (2005), and (5) AG: AG’s News, classification task with regard to four news topics: World, Sports, Business, and Science.

3.2 Baseline Models

We use (1) BERT-base Devlin et al. (2018) with 12 layers, 768 hidden units, 12 heads and 110M parameters, and (2) ELECTRA-large Clark et al. (2020) with 24 layers, 1024 hidden units, 16 heads and 340M parameters as our baseline PrLMs.

4 Experiments

In this section, comprehensive experiments and analysis are conducted. For all the experiments, we average results from three different random seeds.

4.1 Our Approach Make Gains

As shown in Table 1, we run both BERT-base Devlin et al. (2018) and ELECTRA-large Clark et al. (2020), with and without LS, across all five datasets. The average gain is 1.1 for BERT-base and 0.7 for ELECTRA-large. As ELECTRA-large is a very strong baseline, the result prove the effectiveness of our approach. As show in Figure 1, we select several examples from MR and SST-2 to further illustrate how PrLMs can benefit from the auxiliary input of simplified sentences.

4.2 Impact of Lexical Simplification Process

Since our LRLS method is composed of two steps: transformation of verbs and nouns into their lemmas, and replacement of rare words. To investigate the impact of different LS methods, we firstly apply the two steps separately and compare with our LRLS method. We also include BERT-LS Qiang et al. (2020), which leverages masking language model of BERT to generate synonym candidates of rare words, for further comparison.

As shown in Table 2, the lemma transformation and rare words replacement are both effective, but we can further improve the performance by combining these two methods together. The performance of our method also exceeds that of BERT-LS. Moreover, our method is more than a hundred faster than BERT-LS, since our method is entirely rule-based, while BERT-LS uses a large pre-trained neural network to detect and replace the complex words recursively.

Method	MR	SST-2
BERT_BASE	86.4	92.4
Lemma	87.6	93.1
RR	87.7	92.9
BERT LS	87.9	93.1
LRLS	88.1	93.5

Table 2: Performances (%) using different LS methods. Lemma represents the transformation of verbs and nouns into their lemma, RR represents the replacement of rare words.

4.3 Words Replacement Hyperparameters

The process of the rare word replacement is controlled by two hyperparameters: $n_{f}$ and $n_{s}$ . $n_{f}$ is the frequency threshold under which the word will be labelled as rare word and replaced. The larger the $n_{f}$ , the more words will be replaced. $n_{s}$ is the number of synonym candidates. The larger the $n_{s}$ , the larger possibility that the rare words will be replaced by more common but less similar candidates. In order to investigate the effect of these two hyper-parameters, we change these two hyperparameters separately and conduct experiments on MR and SST-2 to see the impact on the performance.

As shown in Figure 2, the best performance gain is obtained with middle-sized $n_{f}$ and $n_{s}$ , which is consistent with our expectation. Because if $n_{f}$ and $n_{s}$ is too small the simplified sentence will be almost the same as the original version, on the contrary if $n_{f}$ and $n_{s}$ is too large, it may change the underlying meaning of the sentence.

4.4 Alternative Frameworks

We use simplified sentences as auxiliary inputs to improve the prediction accuracy of PrLMs. However, there are other frameworks to incorporate lexical simplification with PrLMs.

One alternative framework is to feed PrLM only the simplified sentences in both training and inference phases. In this case, prediction is made solely based on simplified versions.

Another framework is to leverage LS as a data augmentation technique. To illustrate, let $D=\{x_{i},y_{i}\}_{i=1\dots N}$ denote the training dataset. For a given sample $\{x_{i},y_{i}\}$ in the training dataset, we generate an augmented sample by simplifying the sentence $x_{i}$ to $x^{\prime}_{i}$ and preserving the label $y_{i}$ . In this way, we generate an augmented dataset $D^{\prime}=\{x^{\prime}_{i},y_{i}\}_{i=1\dots N}$ . PrLMs can thus learn from both the training set $D$ and the augmented set $D^{\prime}$ .

Experiments are conducted to compared our framework with the two alternative frameworks mentioned above on BERT-base.

Method	MR	SST-2
BERT_BASE	86.4	92.4
LRLS only	86.5	92.1
LRLS Aug	87.9	92.6
LRLS Aux	88.1	93.5

Table 3: Performances (%) using different frameworks to leverage simplified sentences. LRLS only represents predictions made solely based on simplified sentences, LRLS Aug represents the use of simplified sentences for training data augmentation, LRLS Aux represents using simplified sentences as auxiliary inputs.

As show in Table 3, framework using simplified sentences as the only input (LRLS only) would slightly harm the performance of PrLM. This is because a part of semantic meanings carried by original sentences may be lost during the simplification process. Experiments also show that leveraging lexical simplification for data augmentation (LRLS Aug) is also beneficial for the overall performance. However, this framework would double the training time and the performance is still worse than our framework (LRLS Aux) .

4.5 Alternative Paraphrasing Methods

While we leverage LRLS method to paraphrase the original sentence and generate auxiliary inputs for PrLMs, we wonder if other commonly used paraphrasing techniques are effective.

These paraphrasing methods include (1) random replacement of several words by their synonyms Wu et al. (2019); Wei and Zou (2019), (2) translating an existing example $x$ in language A into another language B, and then translating it back into A to obtain a paraphrased example $x^{\prime}$ (back-translation) Xie et al. (2019); Edunov et al. (2018) , and (3) randomly delete several words in the sentence (cutoff) Shen et al. (2020).

The upper mentioned paraphrasing methods are applied on original sentences respectively to generate auxiliary inputs, and then incorporated into PrLMs. Performance on MR and SST-2 from different paraphrasing methods are compared.

As show in Table 4, cutoff would slightly harm the overall performance. This is because it simply randomly deletes several words in the original sentence to generate a paraphrased version, which tends to twist the original semantic meaning and adds noise for predictions. Although back-translation and random replacement can slightly boost the performance of PrLMs, our LRLS method remains the most effective.

Method	MR	SST-2
BERT_BASE	86.4	92.4
+back-translation	87.0	92.8
+cutoff	86.3	91.6
+random replacement	87.3	92.5
+LRLS	88.0	93.5

Table 4: Performances (%) using different paraphrasing techniques to generate auxiliary inputs.

5 Conclusion

This paper proposes a novel approach that leverages lexical simplification and to reduce lexical diversity and enhance the performance of PrLMs on text classification. Experiments on various text classification tasks demonstrate that our approach consistently improves strong baselines.

Within the framework, we incorporate a specially designed lexical simplification process based on lemmatization and rare word replacement (LRLS) for better performance. Our comprehensive analysis also show that compared with other paraphrasing techniques used in previous works, LRLS is a more effective paraphrasing method to offer auxiliary information for prediction.

Furthermore, an effective framework (LRLS Aux) leveraging LRLS as auxiliary information is designed. Unlike data augmentation which only leverages paraphrased information in training phase, LRLS Aux incorporates the information in both training and inference phase and achieves better performance gains. Such framework may shed the light for more future studies.

References

Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale.
Gooding and Kochmar (2019) Sian Gooding and Ekaterina Kochmar. 2019. Recursive context-aware lexical simplification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4855–4865, Hong Kong, China. Association for Computational Linguistics.
Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, New York, NY, USA. Association for Computing Machinery.
Jin et al. (2019) Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is bert really robust? natural language attack on text classification and entailment. arXiv preprint arXiv:1907.11932.
Li et al. (2020) Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online. Association for Computational Linguistics.
Liu et al. (2015) Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2015. Automated rule selection for aspect extraction in opinion mining. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 1291–1297. AAAI Press.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Mozes et al. (2020) Maximilian Mozes, Pontus Stenetorp, Bennett Kleinberg, and Lewis D. Griffin. 2020. Frequency-guided word substitutions for detecting textual adversarial examples.
Mrkšić et al. (2016) Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints.
Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 271–278, Barcelona, Spain.
Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124, Ann Arbor, Michigan. Association for Computational Linguistics.
Qiang et al. (2020) Jipeng Qiang, Yun Li, Zhu Yi, Yunhao Yuan, and Xindong Wu. 2020. Lexical simplification with pretrained encoders. AAAI.
Shen et al. (2020) Dinghan Shen, Mingzhi Zheng, Yelong Shen, Yanru Qu, and Weizhu Chen. 2020. A simple but tough-to-beat data augmentation approach for natural language understanding and generation.
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
Wei and Zou (2019) Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6383–6389, Hong Kong, China. Association for Computational Linguistics.
Wu et al. (2019) Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional bert contextual augmentation. In International Conference on Computational Science, pages 84–95. Springer.
Xie et al. (2019) Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848.