Overview of the VLSP 2022 - Abmusu Shared Task: A Data Challenge for Vietnamese Abstractive Multi-document Summarization
Mai-Vu Tran, Hoang-Quynh Le, Duy-Cat Can, Quoc-An Nguyen VNU University of Engineering and Technology, Hanoi, Vietnam.
{vutm, lhquynh, catcd, annq}@vnu.edu.vn
∗Corresponding author
Abstract
This paper reports the overview of the VLSP 2022 - Vietnamese abstractive multi-document summarization (Abmusu) shared task for Vietnamese News. This task is hosted at the 9th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022).
The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news summarization and build a human-annotated dataset of 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories.
Participated models are evaluated and ranked in terms of ROUGE2-F1 score, the most typical evaluation metric for document summarization problem.
Figure 1: The annotation process.
1 Introduction
In the era of information explosion, mining data effectively has huge potential but is a difficult problem which takes time, money and labour effort. Multi-document summarization is a natural language processing task that is useful for solving this problem. Receiving the set of documents as input, the summarization system aims to select or generate important information to create a brief summary for these documents Ježek and Steinberger (2008). It is a complex problem that has gained attention from the research community.
Several past challenges and shared tasks have focused on summarization. One of the earliest summarization shared tasks is the series of document understanding conference (DUC) challenges111http://www-nlpir.nist.gov/projects/duc. DUC summarization challenges are organized 7 times from 2000 to 2007., the Text Analysis Conference (TAC) summarization shared tasks222http://tac.nist.gov/tracks/. TAC summarization shared tasks are organized 5 times on summarization news and biomedical text from 2008 to 2014.
In recent years, some summarization shared tasks have been launched to support research and development in this field for English, such as DocEng 2019 Lins et al. (2019) and BioNLP-MEDIQA 2021 Abacha et al. (2021), ect.
Based on output characteristics, there are two major approaches for automatic summarization, i.e, extractive and abstractive summarization. Extractive summarization tends to select the most crucial sentences (sections) from the documents while abstractive summarization tries to rewrite a new summary based on the original important information Allahyari et al. (2017). From the early 1950s, various methods have been proposed for extractive summarization ranging from frequency-based methods Khan et al. (2019) to machine learning-based methods Gambhir and Gupta (2017). The extractive methods are fast and simple but the summaries are far from the manual-created summary, which can be remedied with the abstractive approach El-Kassas et al. (2021). In the multi-document problem, extractive approaches show significant disadvantages in arranging and combining information from several documents. In recent years, sequence-to-sequence learning (seq2seq) makes abstractive summarization possible Hou et al. (2017). A set of models based on encoder-decoder such as PEGASUS Zhang et al. (2020), BART Lewis et al. (2020), T5 Raffel et al. (2020) achieves potential results for abstractive multi-document summarization.
Studies on this problem for Vietnamese text are still in the early stages with a few initial achievements, especially in extractive approaches. In recent years, there has been a growing interest to develop automatic abstractive summarization systems. Despite these attempts, the lack of a comprehensive benchmarking dataset has limited the comparison of different techniques for Vietnamese. VLSP 2022 - Abmusu shared task is set up to provide an opportunity for researchers to propose, assess and advance their research, further, promote the development of research on abstractive multi-document summarization for Vietnamese text.
The remainder of the paper is organized as follows:
Section 2 gives a detailed description of the Abmusu shared task and the task data.
The next section describes the data construction, annotation methodologies and data collection.
Section 3 describes the competition, baselines, approaches and respective results.
Finally, Section 4 concludes the paper.
2 Task Description
VLSP 2022 Abmusu shared task addressed an abstractive multi-document summarization task.
The goal of Abmusu shared task is to develop summarization systems that could create abstractive summaries automatically for a set of documents on a topic. The model input is multiple news documents on the same topic, and the corresponding output is a related abstractive summary. In the scope of Abmusu shared task, we only focus on Vietnamese news.
For multi-document summarization purposes, Abmusu task is aimed at summarizing multiple input documents that contain a piece of information related to the same topic, we call them ‘document clusters’. Each cluster has documents that illustrate the same topic and the goal of this shared task is to build models to create an abstractive summary per cluster automatically.
3 Task Data
3.1 Data Preparation
The data is automatically collected and filtered from Vietnamese electronic news on categories, including the economy, society, culture, science and technology, etc. It is divided into training, validation and test datasets. The datasets contain several document clusters. Each cluster has documents that illustrate the same topic. On training and validation datasets, a manual-created reference abstractive summary is provided per cluster.
The test set is formatted similarly to the training and validation sets, but without an abstractive summary.
The data preparation process is described in Figure 1. We used INCEpTION333http://https://inception-project.github.io// Klie et al. (2018) as the annotation tool. It is a semantic annotation platform offering intelligent assistance and knowledge management.
There are human annotators and experts who participated in the annotation process, the annotation guideline with full definition and illustrative examples was provided.
We used an step process to make summarization data, each data sample needs the involvement of at least annotator and reviewer:
•
Crawl data from news websites by categories.
•
Group documents into clusters by the highlighted hashtag, category, posted time, and similarity.
•
Remove duplicate or highly similar documents.
•
Remove clusters with too few articles, and review to select clusters/documents manually.
•
Choose more clusters randomly to ensure the distribution for difficult test-cases.
•
Create the summary manually by the annotators.
•
Re-check the quality of the summary (by the reviewers) to ensure the quality and length. Unqualified data is relabeled by another annotator.
•
Refine all data by expert reviewers.
As a result, we prepared a total of documents in clusters: documents ( clusters) for the training set, documents ( clusters) for the validation set and documents ( clusters) in the test set.
Figure 2 show the distribution of categories in the training/validation set and the test set.
Table 1 and Table 2 describe the statistics of the Abmusu dataset in detail at the token- and the sentence level.
The compression ratio of Abmusu dataset is , the manually created summaries often contain sentences.
Figure 2: The data statistics by categories.
Aspects
Training
Validation
Test
Average
Documents per Cluster
3.11
3.04
3.05
Tokens per Cluster
1924.75
1815.41
1762.40
Tokens per Raw text
619.88
597.17
578.46
Tokens per Anchor text
41.65
35.58
40.33
Tokens per Summary
168.48
167.68
153.05
Compression ratio
Multi-document Summary
0.09
0.09
0.09
Table 1: Average statistics and compression ratio at token-level
Aspects
Training
Validation
Test
Average
Sentences per Cluster
66.93
60.69
61.07
Sentences per Raw text
21.56
19.96
20.04
Sentences per Anchor text
1.72
1.27
1.57
Sentences per Summary
4.82
4.94
4.93
Compression ratio
Multi-document Summary
0.07
0.08
0.08
Table 2: Average statistics and compression ratio at sentence-level
4 Challenge Results
4.1 Data Format and Submission
Each data example includes the title, anchor text and body text of all single documents in a cluster. Each cluster also has a category tag and a manually created summary.
The provided test set for the participated team is formatted similarly to the training and validation data, but without the manually created summary.
The evaluation was performed on the AIhub444http://aihub.ml/ platform for days. Test data was divided into two parts: Public Test and Private Test, each containing of the test data. The Private Test was opened days after the Public Test.
Each team is allowed to submit a maximum of submissions to the Public test ( per day) and submissions to the Private Test (not limited per day).
4.2 Evaluation Metrics
The official evaluation measures are the ROUGE-2 scores and ROUGE-2 F1 (R2-F1) is the main score for ranking. ROUGE-2 Recall (R2-R), Precision (R2-P) and R2-F1 between predicted summary and reference summary are calculated as Lin (2004):
(1)
(2)
(3)
4.3 Baselines
The committee provided baselines as the shared task benchmark, includes:
•
Ad-hoc rule-based baseline: The summary is the concatenation of the first and the last sentences of all component documents in each cluster.
•
Anchor text-based baseline: The summary is the concatenation of the anchor text of all component documents in each cluster.
•
Extractive baseline: The summary is generated by the extractive summarization model using Lexrank Erkan and Radev (2004) and MMR Goldstein and Carbonell (1998).
•
Abstractive baseline: The summary is generated by the abstractive summarization model ViT5 Phan et al. (2022).
4.4 Participants
There are registered teams from research groups in domestic and international Universities (VNU-HUS, VNU-UET, HUST, PTIT, etc.) and industries (Viettel, VinGroup, CMC, TopCV, VCCorp, etc).
In which, 28 teams submitted the data agreement, and 16 teams participated officially by submitting at least 1 run on the evaluation platform.
Participant teams can use all possible tools and resources to build models.
Participated teams made a total of 287 submissions.
Post-challenge panels555http://aihub.ml/competitions/341 are now opened on AIHUB for supporting research improvements.
4.5 Results
An interesting observation is that the rule-based baseline achieved surprisingly high results (ranked 6). This result can be explained because most news are written in an explanatory or inductive style, so the first and last sentences often contain important information. The extractive baseline result (ranked 5) was much better than the anchor text baseline result (ranked 18), contrary to the assumption that the anchor text can be considered as a simple summary of the news text.
In the abstractive baseline model, we only put raw data through the ViT5 model without any parameter tuning, so it is reasonable when its result was low (ranked 19).
The proposed models followed two main approaches: abstractive summarization and hybrid approach. Participated teams used a variety of techniques, including similarity scoring (TF-IDF, Cosine, etc,), graph-based methods (i.e., Lexrank Erkan and Radev (2004), Textrank Mihalcea and Tarau (2004), Pagerank Brin and Page (1998), etc.), sentence classification (Long short-term memory Hochreiter and Schmidhuber (1997), BERT Kenton and Toutanova (2019), etc.) and text correlation.
The results of the private test were considered as the official results to rank the team in Abmusu shared task. The results on ROUGE-2 of the top 5 teams and 4 baselines are shown on Table 3 (See Appendix A for the full results.).
All teams achieved performance higher than the anchor text baseline and abstractive baseline.
There were teams that achieved a higher F-score than our extractive and rule-based baselines.
The best ROUGE-2 F obtained was , the corresponding ROUGE-2 P and ROUGE-2 R are and respectively.
Rank
Team
R2-F1
R2-P
R2-R
1
LBMT
0.3035
(1)
0.2298
(11)
0.4969
(1)
2
The coach
0.2937
(2)
0.2284
(12)
0.4463
(2)
3
CIST AI
0.2805
(3)
0.2629
(6)
0.3192
(6)
4
TheFinalYear
0.2785
(4)
0.2272
(13)
0.4040
(4)
5
NLP HUST
0.2689
(5)
0.2773
(4)
0.2829
(12)
6
Extractive baseline
0.2625
(6)
0.2464
(7)
0.3174
(8)
7
Rule based baseline
0.2611
(7)
0.2634
(5)
0.2947
(10)
19
Anchor baseline
0.1886
(18)
0.2306
(10)
0.1734
(19)
20
Abstractive baseline
0.1497
(19)
0.3061
(1)
0.1025
(20)
Table 3: The official top 5 results on the Private Test. The number highlighted in bold is the highest result in each column. The number in the bracket () is the corresponding rank of a score. Baseline results are shown in italic.
5 Conclusions
The VLSP 2022 - Abmusu shared task was designed to promote the development of research for the problem of abstractive multi-document summarization problem. We tend to compare different summarization approaches and provide a standard test-bed for future research.
The Abmusu dataset is constructed carefully, it is expected to make significant contributions to the other related works.
Abmusu attracted the attention of the research community, participated teams came up with many different approaches and used a variety of advanced technologies and resources. We archived some exciting and potential results, which are useful benchmarks for future research.
Finally, we happily conclude that the VLSP 2022 - Abmusu shared task was run successfully and is expected to contribute significantly to Vietnamese text mining and natural language processing communities.
Acknowledgments
The data was supported by the Project “Research and Development of Vietnamese Multi-document Summarization Based on Advanced Language Models” of Vietnam National University, Hanoi (Code: QG.22.61).
The shared task committee would like to gratitude Dagoras Technology and Communications JSC. for their technical and financial support.
We also thank all members of the Data Science and Knowledge Technology Laboratory, FIT, UET, VNU because of their continuous support and encouragement.
References
Abacha et al. (2021)
Asma Ben Abacha, Yassine M’rabet, Yuhao Zhang, Chaitanya Shivade, Curtis
Langlotz, and Dina Demner-Fushman. 2021.
Overview of the mediqa 2021 shared task on summarization in the
medical domain.
In Proceedings of the 20th Workshop on Biomedical Language
Processing, pages 74–85.
Allahyari et al. (2017)
Mehdi Allahyari, Seyedamin Pouriyeh, Mehdi Assefi, Saeid Safaei, Elizabeth D
Trippe, Juan B Gutierrez, and Krys Kochut. 2017.
Text summarization techniques: A brief survey.
International Journal of Advanced Computer Science and
Applications (ijacsa), 8(10).
Brin and Page (1998)
Sergey Brin and Lawrence Page. 1998.
The anatomy of a large-scale hypertextual web search engine.
Computer networks and ISDN systems, 30(1-7):107–117.
El-Kassas et al. (2021)
Wafaa S El-Kassas, Cherif R Salama, Ahmed A Rafea, and Hoda K Mohamed. 2021.
Automatic text summarization: A comprehensive survey.
Expert Systems with Applications, 165:113679.
Erkan and Radev (2004)
Günes Erkan and Dragomir R Radev. 2004.
Lexrank: Graph-based lexical centrality as salience in text
summarization.
Journal of artificial intelligence research, 22:457–479.
Gambhir and Gupta (2017)
Mahak Gambhir and Vishal Gupta. 2017.
Recent automatic text summarization techniques: a survey.
Artificial Intelligence Review, 47(1):1–66.
Goldstein and Carbonell (1998)
Jade Goldstein and Jaime G Carbonell. 1998.
Summarization:(1) using mmr for diversity-based reranking and (2)
evaluating summaries.
In TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop
held at Baltimore, Maryland, October 13-15, 1998, pages 181–195.
Hochreiter and Schmidhuber (1997)
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Long short-term memory.
Neural computation, 9(8):1735–1780.
Hou et al. (2017)
Liwei Hou, Po Hu, and Chao Bei. 2017.
Abstractive document summarization via neural model with joint
attention.
In National CCF conference on natural language processing and
Chinese computing, pages 329–338. Springer.
Ježek and Steinberger (2008)
Karel Ježek and Josef Steinberger. 2008.
Automatic text summarization (the state of the art 2007 and new
challenges).
In Proceedings of Znalosti, pages 1–12.
Kenton and Toutanova (2019)
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019.
Bert: Pre-training of deep bidirectional transformers for language
understanding.
In Proceedings of NAACL-HLT, pages 4171–4186.
Khan et al. (2019)
Rahim Khan, Yurong Qian, and Sajid Naeem. 2019.
Extractive based text summarization using k-means and tf-idf.
International Journal of Information Engineering and Electronic
Business, 11(3):33.
Klie et al. (2018)
Jan-Christoph Klie, Michael Bugert, Beto Boullosa, Richard Eckart de Castilho,
and Iryna Gurevych. 2018.
The inception
platform: Machine-assisted and knowledge-oriented interactive annotation.
In Proceedings of the 27th International Conference on
Computational Linguistics: System Demonstrations, pages 5–9. Association
for Computational Linguistics.
Event Title: The 27th International Conference on Computational
Linguistics (COLING 2018).
Lewis et al. (2020)
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed,
Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020.
Bart: Denoising sequence-to-sequence pre-training for natural
language generation, translation, and comprehension.
In Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, pages 7871–7880.
Lin (2004)
Chin-Yew Lin. 2004.
Rouge: A package for automatic evaluation of summaries.
In Text summarization branches out, pages 74–81.
Lins et al. (2019)
Rafael Dueire Lins, Rafael Ferreira Mello, and Steve Simske. 2019.
Doceng’19 competition on extractive text summarization.
In Proceedings of the ACM Symposium on Document Engineering
2019, pages 1–2.
Mihalcea and Tarau (2004)
Rada Mihalcea and Paul Tarau. 2004.
Textrank: Bringing order into text.
In Proceedings of the 2004 conference on empirical methods in
natural language processing, pages 404–411.
Phan et al. (2022)
Long Phan, Hieu Tran, Hieu Nguyen, and Trieu H Trinh. 2022.
Vit5: Pretrained text-to-text transformer for vietnamese language
generation.
arXiv preprint arXiv:2205.06457.
Raffel et al. (2020)
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael
Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020.
Exploring the limits of transfer learning with a unified text-to-text
transformer.
J. Mach. Learn. Res., 21(140):1–67.
Zhang et al. (2020)
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020.
Pegasus: Pre-training with extracted gap-sentences for abstractive
summarization.
In International Conference on Machine Learning, pages
11328–11339. PMLR.
Appendix A Appendix: The official results on the Private test
Rank
Team
R2-F1
R2-P
R2-R
R1-F1
R1-P
R1-R
RL-F1
RL-P
RL-R
1
LBMT
0.3035
(1)
0.2298
(11)
0.4969
(1)
0.5067
(1)
0.4076
(16)
0.7147
(1)
0.4809
(1)
0.3868
(15)
0.6780
(1)
2
The coach
0.2937
(2)
0.2284
(12)
0.4463
(2)
0.4962
(2)
0.4072
(17)
0.6676
(4)
0.4701
(2)
0.3857
(16)
0.6326
(4)
3
CIST AI
0.2805
(3)
0.2629
(6)
0.3192
(6)
0.4876
(4)
0.4635
(6)
0.5352
(9)
0.4541
(4)
0.4314
(6)
0.4988
(7)
4
TheFinalYear
0.2785
(4)
0.2272
(13)
0.4040
(4)
0.4956
(3)
0.4221
(15)
0.6409
(5)
0.4612
(3)
0.3929
(14)
0.5964
(5)
5
NLP HUST
0.2689
(5)
0.2773
(4)
0.2829
(12)
0.4732
(6)
0.4903
(5)
0.4836
(12)
0.4373
(5)
0.4537
(5)
0.4465
(12)
6
Extractive baseline
0.2625
(6)
0.2464
(7)
0.3174
(8)
0.4772
(5)
0.4582
(9)
0.5391
(8)
0.4339
(6)
0.4164
(9)
0.4905
(9)
7
Rule-based baseline
0.2611
(7)
0.2634
(5)
0.2947
(10)
0.4627
(8)
0.4601
(8)
0.5053
(11)
0.4273
(8)
0.4257
(7)
0.4659
(11)
8
VNU Brothers
0.2544
(8)
0.3030
(2)
0.2406
(14)
0.4595
(9)
0.5315
(2)
0.4312
(17)
0.4194
(12)
0.4850
(2)
0.3937
(17)
9
FCoin
0.2544
(8)
0.2307
(9)
0.3027
(9)
0.4697
(7)
0.4302
(12)
0.5411
(7)
0.4296
(7)
0.3941
(13)
0.4938
(8)
10
vts
0.2448
(9)
0.2114
(15)
0.3188
(7)
0.4516
(12)
0.4048
(18)
0.5438
(6)
0.4208
(10)
0.3768
(18)
0.5074
(6)
11
Blue Sky
0.2412
(10)
0.2384
(8)
0.2610
(13)
0.4588
(10)
0.4604
(7)
0.4761
(13)
0.4194
(12)
0.4205
(8)
0.4358
(13)
12
HUSTLANG
0.2361
(11)
0.2880
(3)
0.2157
(17)
0.4360
(16)
0.5176
(4)
0.3981
(18)
0.4000
(15)
0.4750
(3)
0.3651
(18)
13
SGSUM
0.2322
(12)
0.2106
(16)
0.2896
(11)
0.4575
(11)
0.4279
(13)
0.5282
(10)
0.4235
(9)
0.3954
(12)
0.4897
(10)
14
vc-datamining
0.2304
(13)
0.1663
(20)
0.4371
(3)
0.4496
(14)
0.3450
(20)
0.7036
(2)
0.4201
(11)
0.3218
(20)
0.6590
(2)
15
TCV-AI
0.2288
(14)
0.1687
(19)
0.3976
(5)
0.4502
(13)
0.3485
(19)
0.6813
(3)
0.4190
(13)
0.3245
(19)
0.6340
(3)
16
Team Attention
0.2131
(15)
0.2159
(14)
0.2265
(16)
0.4274
(18)
0.4251
(14)
0.4514
(15)
0.3848
(18)
0.3835
(17)
0.4056
(15)
17
Cyber Intellect
0.2116
(16)
0.2085
(17)
0.2270
(15)
0.4464
(15)
0.4468
(10)
0.4627
(14)
0.4028
(14)
0.4030
(10)
0.4177
(14)
18
HHH
0.1919
(17)
0.1915
(18)
0.2076
(18)
0.4228
(19)
0.4350
(11)
0.4336
(16)
0.3888
(16)
0.4005
(11)
0.3984
(16)
19
Anchor baseline
0.1886
(18)
0.2306
(10)
0.1734
(19)
0.4321
(17)
0.5210
(3)
0.3900
(19)
0.3869
(17)
0.4659
(4)
0.3498
(19)
20
Abstractive baseline
0.1497
(19)
0.3061
(1)
0.1025
(20)
0.3226
(20)
0.5801
(1)
0.2299
(20)
0.2895
(19)
0.5205
(1)
0.2065
(20)
Table 4: The official results on the Private Test. The number highlighted in bold is the highest result in each column. The number in the bracket () is the corresponding rank of a score. Baseline results are shown in italic.