This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BNLP: Natural language processing toolkit for Bengali

Sagor Sarker
Begum Rokeya University, Rangpur, Bangladesh
[email protected]
Abstract

BNLP is an open-source language processing toolkit for Bengali consisting of tokenization, word embedding, part of speech(POS) tagging, name entity recognition(NER) facilities. BNLP provides pre-trained model with high accuracy to do model-based tokenization, embedding, POS, NER tasks for Bengali. BNLP pre-trained model achieves significant results in Bengali text tokenization, word embeddings, POS, and NER task. BNLP is being used widely by the Bengali research communities with 25K downloads, 138 stars, and 31 forks. BNLP is available at https://github.com/sagorbrur/bnlp.

1 Introduction

Natural language processing is one of the most important fields in computation linguistics. Tokenization, embedding, POS, NER, text classification, language modeling are some of the sub-tasks of NLP. Any computational linguistics researcher or developer needs hands-on tools to do these subtasks efficiently. Due to the recent advancement of NLP, there are so many tools and methods to do word tokenization, word embedding, POS, NER in the English language. NLTK Loper and Bird (2002), coreNLP Manning et al. (2014), spaCy Honnibal and Montani (2017), AllenNLP Gardner et al. (2018), Flair Akbik et al. (2019), stanza Qi et al. (2020) are few of the tools. These tools provide a variety of methods to do tokenization, embedding, POS, NER, language modeling for the English language. Support for other low resource languages like Bengali is limited or no support at all. A recent tool like iNLTK Arora (2020) is an initial approach for different Indic languages including Bengali. But as it groups with other indic languages special monolingual support like easy pre-processing, tokenization, embedding, POS, NER for Bengali is missing. Besides, iNLTK is mostly based on deep learning(DL) language model based pipeline, which needs DL based infrastructure to do NLP tasks. And that makes iNLTK verbose and language model centric tool for Bengali language. On the other side, BNLP is totally machine learning(ML) based toolkit that can do an instant process for Bengali NLP tasks. Table 1 provides detailed feature comparison between BNLP and other tools.

Tool
Support
Bengali
ML Based
Pre-trained
Model
Tokenizer Embedding POS NER LM
NLTK
spaCy
Flair
stanza
inltk
BNLP
Table 1: BNLP feature comparison with other popular tools

BNLP is an open-source language processing toolkit for Bengali is build to address this problem and breaks the barrier to do different Bengali NLP tasks by:

  • Providing different tokenization methods to tokenize Bengali text efficiently

  • Providing different embedding method to embed Bengali word using the pre-trained model and also provides an option to train an embedding model from scratch

  • Providing hands-on start option for POS or NER of Bengali sentences and also provides an option for training CRF based POS tagger or NER model from scratch.

BNLP offers several widely used text preprocessing techniques like removing stopwords, removing punctuations, removing foreign words. BNLP Github repositories111https://github.com/sagorbrur/bnlp for source code of the package, pre-trained model and documentation222https://bnlp.readthedocs.io/. BNLP libraries have a permissive MIT license. BNLP is easy to install via pip or by cloning repository, easy to plugin with any python projects.

2 Related Works

There is a significant number of open-source NLP tools for the English language. Tools like NLTK Loper and Bird (2002), coreNLP Manning et al. (2014), spaCy Honnibal and Montani (2017), AllenNLP Gardner et al. (2018), FlairAkbik et al. (2019), stanza Qi et al. (2020) are few of the tools. These tools mostly build for the English language and have limited or no support for low resource languages. Especially in a low resource language like Bengali, there is a huge scarcity of tools to process. iNLTK Arora (2020) is an initial approach to help process Bengali with tokenization, language model support. But as it’s a group with different Indic languages, a special monolingual concern for Bengali is missing. Keeping that concern in mind we build BNLP to support especially Bengali and provides tokenization, embedding, POS, NER supports. Besides, iNLTK is mostly based on DL language model based pipeline, which needs DL based infrastructure to do NLP tasks. And that makes iNLTK verbose and language model centric tool for Bengali language. On the other side, BNLP is a totally ML based toolkit that can do an instant process for Bengali NLP tasks.

3 BNLP API

Our design principle was to make the tool easily usable with a few lines of code. The researcher or developer can integrate this tool with installing a simple python package. In this section, we are describing how to do different NLP tasks for Bengali text using BNLP toolkit.

Refer to caption
Figure 1: An example of doing basic tokenization using BNLP
Refer to caption
Figure 2: An example of generating word vector using trained model and BNLP
Corpus Articles Sentences Tokens
Wikipedia 99139 1818523 32908419
News Articles 127867 4017940 60526710
Total 227006 5836463 93435129
Table 2: Statistics of Datasets used for training sentencepiece, word2vec, fasttext Models
Sentences Train Test
POS 2997 2247 750
NER 67719 64155 3564
Table 3: Statistics of POS and NER datasets
Task Precision Recall F1
POS 81.74 79.78 80.75
NER 74.15 60.91 66.88
Table 4: Evaluation results

3.1 Tokenizers

BNLP provides three different tokenization options to tokenize Bengali text. Under rule-based tokenizer BNLP provides Basic Tokenizer a punctuation splitting tokenizer and NLTK333https://github.com/nltk/nltk tokenizer. As NLTK tokenizer is for the English language, we modified nltk tokenize output to use it for Bengali keeping in mind the difference between punctuation of English and Bengali. Under model-based tokenization BNLP provides sentencepice444https://github.com/google/sentencepiece tokenizer for Bengali text called Bengali Sentencepiece. Bengali sentencepiece API provides two options, the pretrained sentencepiece model and the training sentencepiece model. Anyone can tokenize Bengali text using a pretrained sentencepiece model or can train their own Bengali sentencepiece model by calling train API. Figure 1 shows an example of BNLP basic tokenizer.

3.2 Embedding

BNLP provides two different embedding option to embed Bengali words, one is word2vec Mikolov et al. (2013) and another is fasttext Bojanowski et al. (2016). Both Bengali word2vec and fasttext has two option, one is embed Bengali word using pre-trained model and another is training Bengali word2vec or fasttext model from scratch. For both embedding model, we used gensim555https://github.com/RaRe-Technologies/gensim embedding API and trained with Bengali corpora. Figure 2 shows an example of generating word vector using pre-trained model and BNLP.

3.3 POS Tagging

BNLP provides a hands-on starting option for POS to Bengali by giving a method to tag part of speech from a given sentence using pre-trained CRF based McCallum and Li (2003) model. BNLP also provides an option to train a CRF-based POS model with custom POS datasets.

3.4 NER Tagging

Refer to caption
Figure 3: An example of doing NER using BNLP

BNLP provides a hands-on starting option for NER to Bengali by giving a method to tag name entity from a given sentence using a pre-trained CRF-based NER model. BNLP also provides an option to train a CRF-based NER model with custom NER datasets. Figure 3 shows an example of name entity tagging using BNLP.

Apart from this BNLP provides some extra utilities methods like getting Bengali stopwords, letters, punctuation from Corpus class.

4 Pre-trained Models

BNLP provides different pre-trained Bengali model including (i) sentencepiece (ii) word2vec (iii) fasttext (iv) CRF-based POS tagging (v) CRF-based NER tagging model.

Sentencepiece: For training different language models we need subword level better vocabulary. We build subword-based vocabulary by training sentencepice model on Bengali Wikipedia and news articles datasets. We trained sentencepiece unigram language model Kudo (2018) with vocab size of 50000.

Word2Vec: We trained Bengali word2vec model on Bengali Wikipedia and news articles datasets using gensim word2vec pipeline. We trained our word2vec model with embedding dimension 300, window size 5, the minimum number of word occurrences 1, and total workers number 8. We train it for a total of 50000 iterations.

Fasttext: We trained Bengali Fasttext model on Bengali Wikipedia and news articles datasets. For training fasttext we set embedding dimension 300, windows size 5, number of minimum word occurrences 1, model type skip-gram, learning rate 0.05. We trained a total of 50 epochs and our loss is 0.318668.

CRF-Based POS Tagging Model: We trained our CRF-Based POS tagging model on nltr 666https://github.com/abhishekgupta92/bangla_pos_tagger datasets. We split data into 75% train and 25% test. Our evaluation result for the POS tagging model is 80.75 F1 score.

CRF-Based NER Model: We trained our CRF-Based NER model on NER-Bengali-Datasets Karim et al. (2019). We split data into 75% train and 25% test. Our evaluation result for the NER model is 68.88 F1 score.

Table 3 provides detailed evaluation results of POS tagging and NER model.

4.1 Datasets

For training sentencepiece, word2vec, fasttext we used Bengali raw text data from two sources. One is wikipedia777https://dumps.wikimedia.org/bnwiki/latest/ dump dataset and another is crawl news articles from different news portal sites. As shown in Table 2 our raw data contains a total of 99139 Wikipedia Bengali articles and 127867 news articles. Wikipedia corpus contains a total of 1818523 sentences with 32908419 tokens. News articles corpus contains a total of 4017940 sentences with 60526710 tokens.

For POS tagging we used nltr datasets which contains total of 2997 sentences. We split those datasets into 2247 train and 750 test sets and train our POS tagging model. For NER we used NER-Bengali-Datasets Karim et al. (2019) which contains a total of 67719 data with 64155 train and 3564 test. Table 3 provides details statistics of POS and NER datasets.

5 Conclusion and Future Work

BNLP language processing toolkit provides tokenization, embedding, POS, NER facilities for Bengali. BNLP pre-trained model achieves significant results in Bengali text tokenizing, word embedding, POS, and NER task. BNLP is being used widely by Bengali research communities and appreciated by the communities.

We are working on extending the support tools like stemming, lemmatizing, corpus support for BNLP in the future. We are working on adding language model-based support in BNLP so that researchers can use it for different downstream tasks efficiently. While these tasks under development, we are hoping that BNLP will accelerate Bengali NLP research and development.

References