BNLP: Natural language processing toolkit for Bengali

Sagor Sarker
Begum Rokeya University, Rangpur, Bangladesh
[email protected]

Abstract

BNLP is an open-source language processing toolkit for Bengali consisting of tokenization, word embedding, part of speech(POS) tagging, name entity recognition(NER) facilities. BNLP provides pre-trained model with high accuracy to do model-based tokenization, embedding, POS, NER tasks for Bengali. BNLP pre-trained model achieves significant results in Bengali text tokenization, word embeddings, POS, and NER task. BNLP is being used widely by the Bengali research communities with 25K downloads, 138 stars, and 31 forks. BNLP is available at https://github.com/sagorbrur/bnlp.

1 Introduction

Natural language processing is one of the most important fields in computation linguistics. Tokenization, embedding, POS, NER, text classification, language modeling are some of the sub-tasks of NLP. Any computational linguistics researcher or developer needs hands-on tools to do these subtasks efficiently. Due to the recent advancement of NLP, there are so many tools and methods to do word tokenization, word embedding, POS, NER in the English language. NLTK Loper and Bird (2002), coreNLP Manning et al. (2014), spaCy Honnibal and Montani (2017), AllenNLP Gardner et al. (2018), Flair Akbik et al. (2019), stanza Qi et al. (2020) are few of the tools. These tools provide a variety of methods to do tokenization, embedding, POS, NER, language modeling for the English language. Support for other low resource languages like Bengali is limited or no support at all. A recent tool like iNLTK Arora (2020) is an initial approach for different Indic languages including Bengali. But as it groups with other indic languages special monolingual support like easy pre-processing, tokenization, embedding, POS, NER for Bengali is missing. Besides, iNLTK is mostly based on deep learning(DL) language model based pipeline, which needs DL based infrastructure to do NLP tasks. And that makes iNLTK verbose and language model centric tool for Bengali language. On the other side, BNLP is totally machine learning(ML) based toolkit that can do an instant process for Bengali NLP tasks. Table 1 provides detailed feature comparison between BNLP and other tools.

Tool

Support

Bengali

ML Based

Pre-trained

Model

Tokenizer

Embedding

POS

NER

NLTK

spaCy

Flair

stanza

inltk

BNLP

Table 1: BNLP feature comparison with other popular tools

BNLP is an open-source language processing toolkit for Bengali is build to address this problem and breaks the barrier to do different Bengali NLP tasks by:

•

Providing different tokenization methods to tokenize Bengali text efficiently
•

Providing different embedding method to embed Bengali word using the pre-trained model and also provides an option to train an embedding model from scratch
•

Providing hands-on start option for POS or NER of Bengali sentences and also provides an option for training CRF based POS tagger or NER model from scratch.

BNLP offers several widely used text preprocessing techniques like removing stopwords, removing punctuations, removing foreign words. BNLP Github repositories¹¹1https://github.com/sagorbrur/bnlp for source code of the package, pre-trained model and documentation²²2https://bnlp.readthedocs.io/. BNLP libraries have a permissive MIT license. BNLP is easy to install via pip or by cloning repository, easy to plugin with any python projects.

2 Related Works

There is a significant number of open-source NLP tools for the English language. Tools like NLTK Loper and Bird (2002), coreNLP Manning et al. (2014), spaCy Honnibal and Montani (2017), AllenNLP Gardner et al. (2018), FlairAkbik et al. (2019), stanza Qi et al. (2020) are few of the tools. These tools mostly build for the English language and have limited or no support for low resource languages. Especially in a low resource language like Bengali, there is a huge scarcity of tools to process. iNLTK Arora (2020) is an initial approach to help process Bengali with tokenization, language model support. But as it’s a group with different Indic languages, a special monolingual concern for Bengali is missing. Keeping that concern in mind we build BNLP to support especially Bengali and provides tokenization, embedding, POS, NER supports. Besides, iNLTK is mostly based on DL language model based pipeline, which needs DL based infrastructure to do NLP tasks. And that makes iNLTK verbose and language model centric tool for Bengali language. On the other side, BNLP is a totally ML based toolkit that can do an instant process for Bengali NLP tasks.

3 BNLP API

Our design principle was to make the tool easily usable with a few lines of code. The researcher or developer can integrate this tool with installing a simple python package. In this section, we are describing how to do different NLP tasks for Bengali text using BNLP toolkit.

Refer to caption — Figure 1: An example of doing basic tokenization using BNLP

Corpus	Articles	Sentences	Tokens
Wikipedia	99139	1818523	32908419
News Articles	127867	4017940	60526710
Total	227006	5836463	93435129

Table 2: Statistics of Datasets used for training sentencepiece, word2vec, fasttext Models

	Sentences	Train	Test
POS	2997	2247	750
NER	67719	64155	3564

Table 3: Statistics of POS and NER datasets

Task	Precision	Recall	F1
POS	81.74	79.78	80.75
NER	74.15	60.91	66.88

Table 4: Evaluation results

3.1 Tokenizers

BNLP provides three different tokenization options to tokenize Bengali text. Under rule-based tokenizer BNLP provides Basic Tokenizer a punctuation splitting tokenizer and NLTK³³3https://github.com/nltk/nltk tokenizer. As NLTK tokenizer is for the English language, we modified nltk tokenize output to use it for Bengali keeping in mind the difference between punctuation of English and Bengali. Under model-based tokenization BNLP provides sentencepice⁴⁴4https://github.com/google/sentencepiece tokenizer for Bengali text called Bengali Sentencepiece. Bengali sentencepiece API provides two options, the pretrained sentencepiece model and the training sentencepiece model. Anyone can tokenize Bengali text using a pretrained sentencepiece model or can train their own Bengali sentencepiece model by calling train API. Figure 1 shows an example of BNLP basic tokenizer.

3.2 Embedding

BNLP provides two different embedding option to embed Bengali words, one is word2vec Mikolov et al. (2013) and another is fasttext Bojanowski et al. (2016). Both Bengali word2vec and fasttext has two option, one is embed Bengali word using pre-trained model and another is training Bengali word2vec or fasttext model from scratch. For both embedding model, we used gensim⁵⁵5https://github.com/RaRe-Technologies/gensim embedding API and trained with Bengali corpora. Figure 2 shows an example of generating word vector using pre-trained model and BNLP.

3.3 POS Tagging

BNLP provides a hands-on starting option for POS to Bengali by giving a method to tag part of speech from a given sentence using pre-trained CRF based McCallum and Li (2003) model. BNLP also provides an option to train a CRF-based POS model with custom POS datasets.

3.4 NER Tagging

BNLP provides a hands-on starting option for NER to Bengali by giving a method to tag name entity from a given sentence using a pre-trained CRF-based NER model. BNLP also provides an option to train a CRF-based NER model with custom NER datasets. Figure 3 shows an example of name entity tagging using BNLP.

Apart from this BNLP provides some extra utilities methods like getting Bengali stopwords, letters, punctuation from Corpus class.

4 Pre-trained Models

BNLP provides different pre-trained Bengali model including (i) sentencepiece (ii) word2vec (iii) fasttext (iv) CRF-based POS tagging (v) CRF-based NER tagging model.

Sentencepiece: For training different language models we need subword level better vocabulary. We build subword-based vocabulary by training sentencepice model on Bengali Wikipedia and news articles datasets. We trained sentencepiece unigram language model Kudo (2018) with vocab size of 50000.

Word2Vec: We trained Bengali word2vec model on Bengali Wikipedia and news articles datasets using gensim word2vec pipeline. We trained our word2vec model with embedding dimension 300, window size 5, the minimum number of word occurrences 1, and total workers number 8. We train it for a total of 50000 iterations.

Fasttext: We trained Bengali Fasttext model on Bengali Wikipedia and news articles datasets. For training fasttext we set embedding dimension 300, windows size 5, number of minimum word occurrences 1, model type skip-gram, learning rate 0.05. We trained a total of 50 epochs and our loss is 0.318668.

CRF-Based POS Tagging Model: We trained our CRF-Based POS tagging model on nltr ⁶⁶6https://github.com/abhishekgupta92/bangla_pos_tagger datasets. We split data into 75% train and 25% test. Our evaluation result for the POS tagging model is 80.75 F1 score.

CRF-Based NER Model: We trained our CRF-Based NER model on NER-Bengali-Datasets Karim et al. (2019). We split data into 75% train and 25% test. Our evaluation result for the NER model is 68.88 F1 score.

Table 3 provides detailed evaluation results of POS tagging and NER model.

4.1 Datasets

For training sentencepiece, word2vec, fasttext we used Bengali raw text data from two sources. One is wikipedia⁷⁷7https://dumps.wikimedia.org/bnwiki/latest/ dump dataset and another is crawl news articles from different news portal sites. As shown in Table 2 our raw data contains a total of 99139 Wikipedia Bengali articles and 127867 news articles. Wikipedia corpus contains a total of 1818523 sentences with 32908419 tokens. News articles corpus contains a total of 4017940 sentences with 60526710 tokens.

For POS tagging we used nltr datasets which contains total of 2997 sentences. We split those datasets into 2247 train and 750 test sets and train our POS tagging model. For NER we used NER-Bengali-Datasets Karim et al. (2019) which contains a total of 67719 data with 64155 train and 3564 test. Table 3 provides details statistics of POS and NER datasets.

5 Conclusion and Future Work

BNLP language processing toolkit provides tokenization, embedding, POS, NER facilities for Bengali. BNLP pre-trained model achieves significant results in Bengali text tokenizing, word embedding, POS, and NER task. BNLP is being used widely by Bengali research communities and appreciated by the communities.

We are working on extending the support tools like stemming, lemmatizing, corpus support for BNLP in the future. We are working on adding language model-based support in BNLP so that researchers can use it for different downstream tasks efficiently. While these tasks under development, we are hoping that BNLP will accelerate Bengali NLP research and development.

References

Akbik et al. (2019) Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter, and Roland Vollgraf. 2019. FLAIR: An easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 54–59, Minneapolis, Minnesota. Association for Computational Linguistics.
Arora (2020) Gaurav Arora. 2020. iNLTK: Natural language toolkit for indic languages. In Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 66–71, Online. Association for Computational Linguistics.
Bojanowski et al. (2016) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomás Mikolov. 2016. Enriching word vectors with subword information. CoRR, abs/1607.04606.
Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew E. Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. CoRR, abs/1803.07640.
Honnibal and Montani (2017) Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.
Karim et al. (2019) Redwanul Karim, M. A. Islam, Sazid Simanto, Saif Chowdhury, Kalyan Roy, Adnan Neon, Md Hasan, Adnan Firoze, and Mohammad Rahman. 2019. A step towards information extraction: Named entity recognition in bangla using deep learning. Journal of Intelligent and Fuzzy Systems, 37:1–13.
Kudo (2018) Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. CoRR, abs/1804.10959.
Loper and Bird (2002) Edward Loper and Steven Bird. 2002. Nltk: the natural language toolkit. CoRR, cs.CL/0205028.
Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
McCallum and Li (2003) Andrew McCallum and Wei Li. 2003. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 188–191.
Mikolov et al. (2013) Tomás Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings.
Qi et al. (2020) Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D Manning. 2020. Stanza: A python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082.