Transformers for Headline Selection for Russian News Clusters
Abstract
In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60% accuracy for the public and private test sets respectively.
Keywords: headline selection, news, embeddings, transformer, bert, multilingual, russian
1 Introduction
News stories clustering task not only has a wide application area in the industry, but also helps to explore the usage boundaries of sentence embeddings obtained with different models. For example, news aggregators actively use clustering algorithms to generate news feeds from different sources and to select a single headline. The recent progress in designing multilingual models [13], trained for dozens or even hundreds of languages at once, makes it possible to use them for monolingual tasks, particularly for Russian language tasks [8]. At the same time, Russian BERT-based models are actively evolving, and their comparison with more universal multilingual ones may be of interest.
The task of generating or selecting headlines for a single news cluster has a wide range of applications along with clustering. However, even considering the current level of generative models progress, the presence of great interest to them and strong state-of-the-art models [6], it is not always possible to use this models in the industry due to the unwarranted quality of generated texts and high demand for computational resources. An alternative choice is to select ready-made headings among those presented in the cluster. This task can be solved as a classification or ranking problem.
In this paper, the cluster headline selection task is considered. We did not find much related work except [9] where a simple rule-based system is proposed. We use the corpora provided by the Dialogue Evaluation 2021 shared task organizers [3] and introduce the solution that has shown the best results among other participants. Our code is publicly available at https://github.com/sopilnyak/headline-selection.
2 Experimental evaluation
The training corpus for choosing the best headline is proposed at the Dialogue Evaluation 2021 shared task. It consists of pairs of news identifiers (URLs) each one corresponding to one of four tags: left, right, draw, or bad. The last label means that the authors of the markup have identified the pair as a clustering error. The test set is divided into two parts: for public and private leaderboard, each containing news headlines for two specific dates: May 27, 2020 and May 29, 2020. To evaluate the result, a weighted accuracy is used, while the bad label is omitted, and the weights for the remaining labels are shown in Table 1.
left | right | draw | |
left | 1 | 0 | 0.5 |
right | 0 | 1 | 0.5 |
draw | 0.5 | 0.5 | 1 |
2.1 Embeddings
For each headline from the training corpus, embeddings are obtained from various Russian and multilingual models. We use pretrained BERT-based models trained both on Russian monolingual corpora (RuBERT [5], SBERT [7]) and in multiple languages (mT5 [14], XLM-RoBERTa [12]) including Russian. In addition, a multilingual version of USE [11] embeddings is used. The mentioned models show state-of-the-art results on a number of NLP benchmarks [13], including those in Russian language [8], so it was natural to test them on the task of selecting the best headline for the cluster.
To obtain a headline embedding, we take the average word embeddings from the layer 19 (of 25) for SBERT, XLM-R and mT5 and layer 8 (of 13) for RuBERT, considering the length of the headline. In the case of mT5 model, which is trained mainly for solving seq2seq tasks, the decoder is removed and the embedding is taken from layer 19 of the encoder. We use recommended pretrained tokenizers from the transformers library [10]. These tokenizers are based on WordPiece and SentencePiece models [4].
2.2 Classification
Then we train a classifier on top of the embeddings. We use CatBoost ranking model [2], which is a gradient boosting over decision trees algorithm. Pairs of headline embeddings are fed to the classifier as input, while the best headline in the pair is considered the ”positive” element of the pair, and the other one is considered ”negative”. We chose PairLogitLoss (1) as the target loss function.
(1) |
An ensemble of ranking models is trained based on different features. The number of decision trees in CatBoost is set to , the best epoch is chosen based on the validation score. We obtain the final headline rank by averaging the ranks predicted by each of the models, and then each pair is assigned one of the left, right, or draw labels depending on the resulting rank difference . More specifically, the rank difference means the winner is left, means the winner is right, and correspond to draw.
Model | Validation | Public LB test | Private LB test |
---|---|---|---|
SBERT | 85.98 | 84.48 | 83.41 |
RuBERT | 83.97 | 81.38 | 81.64 |
XLM-R | 87.93 | 84.30 | 84.13 |
mT5 | 88.52 | 84.48 | 82.60 |
USE | 81.23 | 80.79 | 80.68 |
Blend-5 | 88.77 | 87.28 | 86.60 |
To explore models and compare the results, we train classifiers on top of each type of embeddings separately. Every model was trained on an Nvidia Tesla P100 GPU provided by Google Colab. Table 2 shows the results of using various embeddings and training classifiers on top of them. The table shows that the best result is obtained by the ensemble of the five models mentioned in the paper (referred as Blend-5), but the single multilingual model mT5 shows a comparable accuracy.
During the Dialogue Evaluation 2021 competition we achieved 86.00% and 85.40% accuracy for the public and private test sets by taking the average word embeddings in the top layer. But further experiments show that middle layers perform better than top layers with the result of 87.28% and 86.60% accuracy respectively.
2.3 Analysis and results
In this section, we analyze the model and explore the impact of several aspects of out approach. We list some examples of wrongly predicted labels and report the evaluation results for different variants of the model on the private LB test set.
Error analysis. We selected 300 examples where the model confused between left and right labels and skipped examples with draw label. Errors can be divided into several types as listed at (2.3)–(2.3) together with several corresponding examples and their translation to English. The model is more likely to prefer titles with lack of facts and sometimes tends to choose verbose headlines. Other wrongly selected headlines include opinions and biased titles. Finally, having a pair of almost equivalent titles, the model could choose the wrong label.
Headlines containing insufficient facts
gold: Ту-22МЗМ Казанского авиазавода испытали на сверхзвуке
Kazan aircraft factory’s Tu-22MZM tested at supersonic mode
pred: В ОПК рассказали об испытаниях модернизированного ракетоносца Ту-22М3М
The defense industry told about the tests of the modernized Tu-22M3M missile carrier
gold: День проведения парада Победы будет нерабочим — Песков
Victory Parade day will be non-working — Peskov
pred: Песков заявил о большой вероятности объявления еще одного выходного
Peskov said about the high probability of announcing another weekend
Verbose headlines
gold: С 27 мая москвичи могут бесплатно сдать тест на антитела
From May 27, Moscow residents can take an antibody test for free
pred: <<Как и где сдать тест на антитела к коронавирусу в Москве. С 27 мая это может сделать любой желающий
How and where to take an antibody test against coronavirus in Moscow. Anyone can do it from May 27
gold: Украинский суд признал нацистской символику дивизии СС <<Галичина>>
The Ukrainian court found that the symbols of the SS division ”Galicia” were Nazi
pred: <<Победа справедливости, здравого смысла и закона>>: Вятрович проиграл суд по делу о символике СС <<Галичина>>
”Victory of justice, common sense and law”: Vyatrovich lost the court case concerning the SS ”Galicia” symbols
Biased headlines
gold: Роскомнадзор начнет блокировать в России пиратские приложения
Roskomnadzor will begin to block pirated applications in Russia
pred: Госдума приняла спорный законопроект о блокировке приложений с пиратским контентом
State Duma passed controversial bill on blocking applications with pirated content
gold: Доллар в обменниках ускорил рост
Dollar in exchange offices accelerated growth
pred: Заманивают иностранцев под покупку облигаций. Почему гривня снова падает
Luring foreigners to buy bonds. Why is the hryvnia falling again
Equivalent pairs of headlines
gold: Россияне массово забирают валюту из банков
Russians massively withdraw currency from banks
pred: Жители страны массово снимают валюту с банковских счетов
Residents of the country are massively withdrawing currency from bank accounts
gold: Россиянам разъяснили, когда можно будет поехать в отпуск за рубеж
Russians were clarified when it will be possible to go on vacation abroad
pred: Россиянам озвучили возможные сроки возобновления поездок за границу
Russians were announced the possible date of the resumption of trips abroad
Thus, a good headline can be defined as precise, short, unbiased, and containing as many significant facts as possible. However, classifiers based on language models can rank headlines which do not meet these criteria higher than manually selected ones. We assume that adding more training data or introducing multitask learning, combining another training objectives, such as information extraction, could help achieve better results.
Sentence representations. We explore a different way to obtain sentence embeddings from language model’s top layer output: using the embeddings of the first token, known as [CLS] token, which is sometimes more common than averaging the word embeddings. We compare the results for all models, except mT5, which doesn’t have the [CLS] token in its dictionary. However, as shown in Table 3, averaging show slightly better results for most of the models. We assume this is because embeddings of the [CLS] token contain high-level semantic meaning [13], while for headline selection task it is more important not to lose the token-level information to meet the previously formulated criteria of a good headline. Moreover, we analyze whether sentence embeddings in the middle layers can be more suitable than in the last ones. Experiments show that layers 17 to 19 (of 25) for SBERT, XLM-R and mT5 and layers 8 to 9 (of 13) for RuBERT perform better than the top layers. The reason for this may be the same: top layers embed more high-level semantic meaning than the middle ones.
Sentence representation | SBERT | RuBERT | XLM-R | mT5 | Blend-5 |
---|---|---|---|---|---|
Top layer: average embeddings | 78.62 | 75.50 | 80.01 | 81.60 | 85.20 |
Top layer: [CLS] token embeddings | 77.77 | 76.53 | 75.48 | — | 84.50 |
Layer 23/11: average embeddings | 81.23 | 78.65 | 82.59 | 81.83 | 85.64 |
Layer 21/9: average embeddings | 83.31 | 81.41 | 83.30 | 82.48 | 86.31 |
Layer 19/8: average embeddings | 83.41 | 81.64 | 84.13 | 82.60 | 86.60 |
Layer 17/8: average embeddings | 83.55 | 81.64 | 84.06 | 83.56 | 86.93 |
3 Conclusion
In this paper, we study applications of pre-trained models for headline selection and demonstrate the superiority of ensembles of modern BERT-based models. We have shown that multilingual models, such as mT5, demonstrate decent results in the task, and are superior to the single-language models in the same conditions.
Further research can be made in additional training of the top layers of multilingual models on Russian-language corpora, as well as more fine-tuning of lightweight models, such as multilingual USE or LASER [1], to reduce system requirements for headline selection.
References
- [1] Artetxe Mikel, Schwenk Holger. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond // CoRR. — 2018. — Vol. abs/1812.10464. — 1812.10464.
- [2] Dorogush Anna Veronika, Ershov Vasily, Gulin Andrey. CatBoost: gradient boosting with categorical features support // CoRR. — 2018. — Vol. abs/1810.11363. — 1810.11363.
- [3] Gusev Ilya; Smurov Ivan. Russian News Clustering and Headline Selection Shared Task // Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue”. — 2021.
- [4] Kudo Taku, Richardson John. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing // CoRR. — 2018. — Vol. abs/1808.06226. — 1808.06226.
- [5] Kuratov Yuri, Arkhipov Mikhail. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language // CoRR. — 2019. — Vol. abs/1905.07213. — 1905.07213.
- [6] Brown Tom B., Mann Benjamin, Ryder Nick et al. Language Models are Few-Shot Learners. — 2020. — 2005.14165.
- [7] Reimers Nils, Gurevych Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks // CoRR. — 2019. — Vol. abs/1908.10084. — 1908.10084.
- [8] RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark / Tatiana Shavrina, Alena Fenogenova, Anton Emelyanov et al. // arXiv preprint arXiv:2010.15925. — 2020.
- [9] Thirunarayan Krishnaprasad, Immaneni Trivikram, Shaik Mastan. Selecting Labels for News Document Clusters. — 2007. — 01. — P. 119–130.
- [10] Transformers: State-of-the-Art Natural Language Processing / Thomas Wolf, Lysandre Debut, Victor Sanh et al. // Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. — Online : Association for Computational Linguistics, 2020. — Oct. — P. 38–45. — Access mode: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- [11] Universal Sentence Encoder / Daniel Cer, Yinfei Yang, Sheng-yi Kong et al. // CoRR. — 2018. — Vol. abs/1803.11175. — 1803.11175.
- [12] Unsupervised Cross-lingual Representation Learning at Scale / Alexis Conneau, Kartikay Khandelwal, Naman Goyal et al. // CoRR. — 2019. — Vol. abs/1911.02116. — 1911.02116.
- [13] XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization / Junjie Hu, Sebastian Ruder, Aditya Siddhant et al. — 2020. — 2003.11080.
- [14] Xue Linting, Constant Noah, Roberts Adam et al. mT5: A massively multilingual pre-trained text-to-text transformer. — 2021. — 2010.11934.