Unsupervised Abstractive Summarization of Bengali Text Documents
Abstract
Abstractive summarization systems generally rely on large collections of document-summary pairs. However, the performance of abstractive systems remains a challenge due to the unavailability of the parallel data for low-resource languages like Bengali. To overcome this problem, we propose a graph-based unsupervised abstractive summarization system in the single-document setting for Bengali text documents, which requires only a Part-Of-Speech (POS) tagger and a pre-trained language model trained on Bengali texts. We also provide a human-annotated dataset with document-summary pairs to evaluate our abstractive model and to support the comparison of future abstractive summarization systems of the Bengali Language. We conduct experiments on this dataset and compare our system with several well-established unsupervised extractive summarization systems. Our unsupervised abstractive summarization model outperforms the baselines without being exposed to any human-annotated reference summaries.111We make our code & dataset publicly available at https://github.com/tafseer-nayeem/BengaliSummarization for reproduciblity.
1 Introduction
The process of shortening a large text document with the most relevant information of the source is known as automatic text summarization. A good summary should be coherent, non-redundant, and grammatically readable while retaining the original document’s most important contents (Nenkova and McKeown, 2012; Nayeem et al., 2018). There are two types of summarizations: extractive and abstractive. Extractive summarization is about ranking important sentences from the original text. The abstractive method generates human-like sentences using natural language generation techniques. Traditionally used abstractive techniques are sentence compression, syntactic reorganization, sentence fusion, and lexical paraphrasing Lin and Ng (2019). Compared to extractive, abstractive summary generation is indeed a challenging task.
A cluster of sentences uses multi-sentence compression (MSC) to summarize into one single sentence originally called sentence fusion (Barzilay and McKeown, 2005; Nayeem and Chali, 2017b). The success of neural sequence-to-sequence (seq2seq) models with attention Bahdanau et al. (2015); Luong et al. (2015) provides an effective way for text generation which has been extensively applied in the case of abstractive summarization of English language documents Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016); Miao and Blunsom (2016); Paulus et al. (2018); Nayeem et al. (2019). These models are usually trained with lots of gold summaries, but there is no large-scale human-annotated abstractive summaries available for low-resource language like Bengali. In contrast, the unsupervised approach reduces the human effort and cost for collecting and annotating large amount of paired training data. Therefore, we choose to create an effective Bengali Text Summarizer with an unsupervised approach. The summary of our contributions:
-
•
To the best of our knowledge, our Bengali Text Summarization model (BenSumm) is the very first unsupervised model to generate abstractive summary from Bengali text documents while being simple yet robust.
-
•
We also introduce a highly abstractive dataset with document-summary pairs to evaluate our model, which is written by professional summary writers of National Curriculum and Textbook Board (NCTB).222http://www.nctb.gov.bd/
-
•
We design an unsupervised abstractive sentence generation model that performs sentence fusion on Bengali texts. Our model requires only POS tagger and a pre-trained language model, which is easily reproducible.
2 Related works
Many researchers have worked on text summarization and introduced different extractive and abstractive methods. Nevertheless, very few attempts have been made for Bengali Text summarization despite Bangla being the most spoken language.333https://w.wiki/57 Das and Bandyopadhyay (2010) developed Bengali opinion based text summarizer using given topic which can determine the information on sentiments of the original texts. Haque et al. (2017, 2015) worked on extractive Bengali text summarization using pronoun replacement, sentence ranking with term frequency, numerical figures, and overlapping of title words with the document sentences. Unfortunately, the methods are limited to extractive summarization, which ranks some important sentences from the document instead of generating new sentences which is challenging for an extremely low resource language like Bengali. Moreover, there is no human-annotated dataset to compare abstractive summarization methods of this language.
Jing and McKeown (2000) worked on Sentence Compression (SC) which has received considerable attention in the NLP community. Potential utility for extractive text summarization made SC very popular for single or multi-document summarization (Nenkova and McKeown, 2012). TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) are graph-based methods for extracting important sentences from a document. Clarke and Lapata (2008); Filippova (2010) showed a first intermediate step towards abstractive summarization, which compresses original sentences for a summary generation. The Word-Graph based approaches were first proposed by Filippova (2010), which require only a POS tagger and a list of stopwords. Boudin and Morin (2013) improved Filippova’s approach by re-ranking the compression paths according to keyphrases, which resulted in more informative sentences. Nayeem et al. (2018) developed an unsupervised abstractive summarization system that jointly performs sentence fusion and paraphrasing.
3 BenSumm Model
We here describe each of the steps involved in our Bengali Unsupervised Abstractive Text Summarization model (BenSumm) for single document setting. Our preprocessing step includes tokenization, removal of stopwords, Part-Of-Speech (POS) tagging, and filtering of punctuation marks. We use the NLTK444https://www.nltk.org and BNLP555https://bnlp.readthedocs.io/en/latest/ to preprocess each sentence and obtain a more accurate representation of the information.
3.1 Sentence Clustering
The clustering step allows us to group similar sentences from a given document. This step is critical to ensure good coverage of the whole document and avoid redundancy by selecting at most one sentence from each cluster Nayeem and Chali (2017a). The Term Frequency-Inverse Document Frequency (TF-IDF) measure does not work well Aggarwal and Zhai (2012). Therefore, we calculate the cosine similarity between the sentence vectors obtained from ULMfit pre-trained language model (Howard and Ruder, 2018). We use hierarchical agglomerative clustering with the ward’s method Murtagh and Legendre (2014). There will be a minimum of 2 and a maximum of clusters. Here, denotes the number of sentences in the document. We measure the number of clusters for a given document using the silhouette value. The clusters are highly coherent as it has to contain sentences similar to every other sentence in the same cluster even if the clusters are small. The following formula can measure silhouette Score:
(1) |
where denotes mean distance to the other instances of intra-cluster and is the mean distance to the instances of the next closest cluster.
3.2 Word Graph (WG) Construction
Textual graphs to generate abstractive summaries provide effective results (Ganesan et al., 2010). We chose to build an abstractive summarizer with a sentence fusion technique by generating word graphs Filippova (2010); Boudin and Morin (2013) for the Bengali Language. This method is entirely unsupervised and needs only a POS tagger, which is highly suitable for the low-resource setting. Given a cluster of related sentences, we construct a word-graph following Filippova (2010); Boudin and Morin (2013). Let, a set of related sentences S = {, , …, }, we construct a graph by iteratively adding sentences to it. The words are represented as vertices along with the parts-of-speech (POS) tags. Directed edges are formed by connecting the adjacent words from the sentences. After the first sentence is added to the graph as word nodes (punctuation included), words from the other related sentences are mapped onto a node in the graph with the same POS tag. Each sentence of the cluster is connected to a dummy start and end node to mark the beginning and ending sentences. After constructing the word-graph, we can generate -shortest paths from the dummy start node to the end node in the word graph (see Figure 1).

Figure 2 presents two sentences, which is one of the source document clusters, and the possible paths with their weighted values are generated using the word-graph approach. Figure 1 illustrates an example WG for these two sentences.

After constructing clusters given a document, a word-graph is created for each cluster to get abstractive fusions from these related sentences. We get multiple weighted sentences (see Figure 2) form the clusters using the ranking strategy Boudin and Morin (2013). We take the top-ranked sentence from each cluster to present the summary. We generate the final summary by merging all the top-ranked sentences. The overall process is presented in Figure 3. We also present a detailed illustration of our framework with an example source document in the Appendix.


4 Experiments
This section presents our experimental details for assessing the performance of the proposed BenSumm model.
Dataset
We conduct experiments on our dataset which consists of 139 samples of human-written abstractive document-summary pairs written by professional summary writers of the National Curriculum and Textbook Board (NCTB). The NCTB is responsible for the development of the curriculum and distribution of textbooks. The majority of Bangladeshi schools follow these books.666https://w.wiki/ZwJ We collected the human written document-summary pairs from the several printed copy of NCTB books. The overall statistics of the datasets are presented in Table 1. From the dataset, we measure the copy rate between the source document and the human summaries. It’s clearly visible from the table that our dataset is highly abstractive and will serve as a robust benchmark for this task’s future works. Moreover, to provide our proposed framework’s effectiveness, we also experiment with an extractive dataset BNLPC777http://www.bnlpc.org/research.php Haque et al. (2015). We remove the abstractive sentence fusion part to compare with the baselines for the extractive evaluation.
Automatic Evaluation
We evaluate our system (BenSumm) using an automatic evaluation metric ROUGE F1 Lin (2004) without any limit of words.888https://git.io/JUhq6 We extract -best sentences from our system and the systems we compare as baselines. We report unigram and bigram overlap (ROUGE-1 and ROUGE-2) to measure informativeness and the longest common subsequence (ROUGE-L) to measure the summaries’ fluency. Since ROUGE computes scores based on the lexical overlap at the surface level, there is no difference in implementation for summary evaluation of the Bengali language.
|
|
|||||
---|---|---|---|---|---|---|
Total #Samples | 139 | 200 | ||||
Source Document Length | 91.33 | 150.75 | ||||
Human Reference Length | 36.23 | 67.06 | ||||
Summary Copy Rate | 27% | 99% |
NCTB [Abstractive] | R-1 | R-2 | R-L |
---|---|---|---|
Random Baseline | 9.43 | 1.45 | 9.08 |
GreedyKL | 10.01 | 1.84 | 9.46 |
LexRank | 10.65 | 1.78 | 10.04 |
TextRank | 10.69 | 1.62 | 9.98 |
SumBasic | 10.57 | 1.85 | 10.09 |
BenSumm [Abs] (ours) | 12.17 | 1.92 | 11.35 |
BNLPC [Extractive] | R-1 | R-2 | R-L |
Random Baseline | 35.57 | 28.56 | 35.04 |
GreedyKL | 48.85 | 43.80 | 48.55 |
LexRank | 45.73 | 39.37 | 45.17 |
TextRank | 60.81 | 56.46 | 60.58 |
SumBasic | 35.51 | 26.58 | 34.72 |
BenSumm [Ext] (ours) | 61.62 | 55.97 | 61.09 |

Baseline Systems
We compare our system with various well established baseline systems like LexRank Erkan and Radev (2004), TextRank Mihalcea and Tarau (2004), GreedyKL Haghighi and Vanderwende (2009), and SumBasic Nenkova and Vanderwende (2005). We use an open source implementation999https://git.io/JUhq1 of these summarizers and adapted it for Bengali language. It is important to note that these summarizers are completely extractive and designed for English language. On the other hand, our model is unsupervised and abstractive.
Results
We report our model’s performance compared with the baselines in terms of F1 scores of R-1, R-2, and R-L in Table 2. According to Table 2, our abstractive summarization model outperforms all the extractive baselines in terms of all the ROUGE metrics even though the dataset itself is highly abstractive (reference summary contains almost 73% new words). Moreover, we compare our extractive version of our model BenSumm without the sentence fusion component. We get better scores in terms of R1 and RL compared to the baselines. Finally, we present an example of our model output in Figure 4. Moreover, We design a Bengali Document Summarization tool (see Figure 5) capable of providing both extractive and abtractive summary for an input document.101010Video demonstration of our tool can be accessed from https://youtu.be/LrnskktiXcg
Human Evaluation
Though ROUGE Lin (2004) has been shown to correlate well with human judgments, it is biased towards surface level lexical similarities, and this makes it inappropriate for the evaluation of abstractive summaries. Therefore, we assign three different evaluators to rate each summary generated from our abstractive system (BenSumm [Abs]) considering three different aspects, i.e., Content, Readability, and Overall Quality. They have evaluated each system generated summary with scores ranges from 1 to 5, where 1 represents very poor performance, and 5 represents very good performance. Here, content means how well the summary can convey the original input document’s meaning, and readability represents the grammatical correction and the overall summary sentence coherence. We get an average score of 4.41, 3.95, and 4.2 in content, readability, and overall quality respectively.
5 Conclusion and Future Work
In this paper, we have developed an unsupervised abstractive text summarization system for Bengali text documents. We have implemented a graph-based model to fuse multiple related sentences, requiring only a POS tagger and a pre-trained language model. Experimental results on our proposed dataset demonstrate the superiority of our approach against strong extractive baselines. We design a Bengali Document Summarization tool to provide both extractive and abstractive summary of a given document. One of the limitations of our model is that it cannot generate new words. In the future, we would like to jointly model multi-sentence compression and paraphrasing in our system.
Acknowledgments
We want to thank all the anonymous reviewers for their thoughtful comments and constructive suggestions for future improvements to this work.
References
- Aggarwal and Zhai (2012) Charu C Aggarwal and ChengXiang Zhai. 2012. A survey of text clustering algorithms. In Mining text data, pages 77–128. Springer.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- Barzilay and McKeown (2005) Regina Barzilay and Kathleen R. McKeown. 2005. Sentence fusion for multidocument news summarization. Comput. Linguist., 31(3):297–328.
- Boudin and Morin (2013) Florian Boudin and Emmanuel Morin. 2013. Keyphrase extraction for n-best reranking in multi-sentence compression. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 298–305, Atlanta, Georgia. Association for Computational Linguistics.
- Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98, San Diego, California. Association for Computational Linguistics.
- Clarke and Lapata (2008) James Clarke and Mirella Lapata. 2008. Global inference for sentence compression an integer linear programming approach. J. Artif. Int. Res., 31(1):399–429.
- Das and Bandyopadhyay (2010) Amitava Das and Sivaji Bandyopadhyay. 2010. Topic-based Bengali opinion summarization. In Coling 2010: Posters, pages 232–240, Beijing, China. Coling 2010 Organizing Committee.
- Erkan and Radev (2004) Günes Erkan and Dragomir R. Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Int. Res., 22(1):457–479.
- Filippova (2010) Katja Filippova. 2010. Multi-sentence compression: Finding shortest paths in word graphs. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 322–330, Beijing, China. Coling 2010 Organizing Committee.
- Ganesan et al. (2010) Kavita Ganesan, ChengXiang Zhai, and Jiawei Han. 2010. Opinosis: A graph based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pages 340–348, Beijing, China. Coling 2010 Organizing Committee.
- Haghighi and Vanderwende (2009) Aria Haghighi and Lucy Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 362–370, Boulder, Colorado. Association for Computational Linguistics.
- Haque et al. (2017) Md Haque, Suraiya Pervin, Zerina Begum, et al. 2017. An innovative approach of bangla text summarization by introducing pronoun replacement and improved sentence ranking. Journal of Information Processing Systems, 13(4).
- Haque et al. (2015) Md Majharul Haque, Suraiya Pervin, and Zerina Begum. 2015. Automatic bengali news documents summarization by introducing sentence frequency and clustering. In 2015 18th International Conference on Computer and Information Technology (ICCIT), pages 156–160. IEEE.
- Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Association for Computational Linguistics.
- Jing and McKeown (2000) Hongyan Jing and Kathleen R. McKeown. 2000. Cut and paste based text summarization. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, NAACL 2000, page 178–185, USA. Association for Computational Linguistics.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Lin and Ng (2019) Hui Lin and Vincent Ng. 2019. Abstractive summarization: A survey of the state of the art. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pages 9815–9822. AAAI Press.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. Association for Computational Linguistics.
- Miao and Blunsom (2016) Yishu Miao and Phil Blunsom. 2016. Language as a latent variable: Discrete generative models for sentence compression. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 319–328, Austin, Texas. Association for Computational Linguistics.
- Mihalcea and Tarau (2004) Rada Mihalcea and Paul Tarau. 2004. TextRank: Bringing order into text. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 404–411, Barcelona, Spain. Association for Computational Linguistics.
- Murtagh and Legendre (2014) Fionn Murtagh and Pierre Legendre. 2014. Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? J. Classif., 31(3):274–295.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çağlar Gu̇lçehre, and Bing Xiang. 2016. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Berlin, Germany. Association for Computational Linguistics.
- Nayeem and Chali (2017a) Mir Tafseer Nayeem and Yllias Chali. 2017a. Extract with order for coherent multi-document summarization. In Proceedings of TextGraphs-11: the Workshop on Graph-based Methods for Natural Language Processing, pages 51–56, Vancouver, Canada. Association for Computational Linguistics.
- Nayeem and Chali (2017b) Mir Tafseer Nayeem and Yllias Chali. 2017b. Paraphrastic fusion for abstractive multi-sentence compression generation. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, page 2223–2226, New York, NY, USA. Association for Computing Machinery.
- Nayeem et al. (2018) Mir Tafseer Nayeem, Tanvir Ahmed Fuad, and Yllias Chali. 2018. Abstractive unsupervised multi-document summarization using paraphrastic sentence fusion. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1191–1204, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Nayeem et al. (2019) Mir Tafseer Nayeem, Tanvir Ahmed Fuad, and Yllias Chali. 2019. Neural diverse abstractive sentence compression generation. In Advances in Information Retrieval - 41st European Conference on IR Research, ECIR 2019, Cologne, Germany, April 14-18, 2019, Proceedings, Part II, volume 11438 of Lecture Notes in Computer Science, pages 109–116. Springer.
- Nenkova and McKeown (2012) Ani Nenkova and Kathleen McKeown. 2012. A survey of text summarization techniques. In Mining text data, pages 43–76. Springer.
- Nenkova and Vanderwende (2005) Ani Nenkova and Lucy Vanderwende. 2005. The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005, 101.
- Paulus et al. (2018) Romain Paulus, Caiming Xiong, and Richard Socher. 2018. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations.
- Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 379–389, Lisbon, Portugal. Association for Computational Linguistics.
Appendix A Appendix
A detailed illustration of our BenSumm model with outputs from each step for a sample input document is presented in Figure 6.
