End-to-End Bengali Speech Recognition

Abstract

Bengali is a prominent language of the Indian subcontinent. However, while many state-of-the-art acoustic models exist for prominent languages spoken in the region, research and resources for Bengali are few and far between. In this work, we apply CTC based CNN-RNN networks, a prominent deep learning based end-to-end automatic speech recognition technique, to the Bengali ASR task. We also propose and evaluate the applicability and efficacy of small 7x3 and 3x3 convolution kernels which are prominently used in the computer vision domain primarily because of their FLOPs and parameter efficient nature. We propose two CNN blocks, 2-layer Block A and 4-layer Block B, with the first layer comprising of 7x3 kernel and the subsequent layers comprising solely of 3x3 kernels. Using the publicly available Large Bengali ASR Training data set, we benchmark and evaluate the performance of seven deep neural network configurations of varying complexities and depth on the Bengali ASR task. Our best model, with Block B, has a WER of 13.67, having an absolute reduction of 1.39% over comparable model with larger convolution kernels of size 41x11 and 21x11.

Index Terms: automatic speech recognition, convolutional neural network, recurrent neural network, connectionist temporal classification, Bengali.

1 Introduction

Bengali, also referred to as Bangla, is the seventh most spoken native language in the world, the second most widely spoken language in India and the official language of Bangladesh. It has around 185 million speakers in the Indian subcontinent. Bengali consists of 29 consonants, 7 vowels and 7 nazalized vowels and shares the phonetic space with other Magadhan languages like Assamese, Odiya etc. However, unlike various prominent languages spoken in the Indian subcontinent, research and resources for automatic speech recognition in Bengali are scarce.

With the emergence of voice as a natural form of human-computer interaction and the advent of large datasets, development of Automatic Speech Recognition (ASR) has seen major strides over the past few years in terms of robustness and practical applications. Deep Learning, facilitated by Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), as well as advancements in end-to-end training methodologies such as the Seq2Seq and Connectionist Temporal Classification paradigm have enabled the development of state-of-the-art end-to-end ASR pipelines.

In this paper, we investigate the performance and applicability of existing deep learning based end-to-end automatic speech recognition paradigms, specifically CNN-RNN based hybrid models trained using Connectionist Temporal Classification (CTC) Loss in an end-to-end manner [1] on the Bengali ASR task. We also perform ablation experiments to investigate the efficacy of smaller convolution kernels widely used in computer vision ([2], [3]) which pose various benefits such as optimal number of parameters [2] and lower memory addressing cost [4], over the larger filters prevalent in the speech recognition domain [1]. To this end, we benchmark and evaluate 7 different CNN-RNN based deep neural network architectures with various complexities and depth, on the Large Bengali ASR Training data set [5], a large scale publicly available Bengali corpus with around 200,000 utterances.

2 Related Works

2.1 Deep learning and end-to-end ASR: Primer

Traditional automatic speech recognition systems spearheaded by HMM-GMM based acoustic models, where HMMs were used for alignment and GMM mapped these aligned tri-phones to output characters, have dominated the field for quite some time, and only recently were surpassed by HMM-DNN systems, with deep neural networks replacing GMMs.

The advent of sequence-to-sequence models (Seq2Seq) paired with the development of Connectionist Temporal Classification (CTC) [6] revolutionized the end-to-end approach for training speech recognition systems by training a single network that directly mapped audio sequence to the output text, with recurrent neural network ([6], [7], [1]) and attention ([8], [9]) based end-to-end models setting state-of-the-art results on several benchmark datasets.

Most recent deep learning based end-to-end systems utilize log-spectrogram or MFCC features as inputs to the network. Given the innate capability of deep neural networks to learn features from raw inputs, multiple recent works have switched to learning features from raw speech signal. To this end, [10] recently proposed time-delay convolution based trainable filterbanks, observing comparable performance with melfilterbank based systems on the WSJ dataset.

2.2 Speech Recognition and Indian Languages

Languages spoken in the Indian subcontinent are a topic of interest primarily due to the sheer number of speakers, but the lack of large-scale, publicly available datasets pose a challenge in the development and evaluation of deep learning based systems.

Most recently, a multilingual phone recognition system was proposed by [11], covering four Indian langauges, viz. Kannada, Telugu, Bengali and Odia. They proposed a traditional HMM-DNN based phone recognition system leveraging Articulatory Features (AFs). Their DNN subsystem utilized tanh activation function at the hidden layers, and was trained in a greedy layer-by-layer supervised setting. However, unlike [11], which proposed a phoneme recognition system evaluated on a very small dataset (19.2 hours total), we evaluate and propose CNN-RNN based end-to-end systems evaluated on a very large dataset.

Addressing the challenges involved in building robust speech recognition systems for Indian languages, [12] proposed training phoneme recognition systems using articulatory movements captured by an electromagnetic articulograph. They evaluated multiple phoneme recognition models, similar to [11], with the addition of HMM-RNN based acoustic model. However, they collected a small, 2 male speaker dataset for each language covered, which is an inconclusive sample size to comment on the genericity and applicability of the approach.

3 Datasets

Large Bengali ASR Training data set (http://openslr.org/53/) [5] is used for evaluating models on the Bengali ASR task. Since dev/test splits were not provided for the Bengali data set, we propose our own evaluation procedure, splitting the data set into train, validation and test sets (table. 1), available upon request.

Table 1: Bengali Dataset Statistics

Split	Utterances	Duration
train	148,110	145.90 hrs
val	26,138	25.79 hrs
test	43,562	42.91 hrs
Total	217,810	214.6 hrs

4 Proposed Approach

In the following subsections, we shed light on the evaluated CNN-RNN based end-to-end speech recognition networks. We also discuss the proposed changes to the original convolution as proposed in [1], as well as the various model architectures evaluated in this work.

The intention of evaluating multiple networks of varying depths and complexities are multi-fold:

1.
To investigate performance trend when
1. (a)
  
  Number of CNN layers are changed
2. (b)
  
  Number of RNN layers are changed
2.

Benchmarking models for environments with different computational constraints

4.1 Model Outline

Fig. 1 depicts the general outline of the CNN-RNN based end-to-end speech recognition models that are popular in recent literature. ([1], [7]). The convolution block represents n sized stack of convolution layers. Fig. 2 shows the original 2-layer convolution stack as proposed in [1], where each convolution layer is followed by a Batch Normalization layer [13] and HardTanh non-linearity.

4.2 Proposed Convolution Blocks

The initial layers of a CNN based deep neural network are of significant importance in terms of overall network complexity as well as performance. The initial layers consume the most memory and comprise of the most floating point operations [2] due to larger spatial dimensions of the input which is gradually downsampled with the depth of the network. Convolutional neural networks comprising solely of kernels with very small receptive fields (3x3 [2], 7x7 [3]) have significantly pushed the state-of-the-art on various computer vision tasks and challenges such as ImageNet and COCO, and are widely used in recent works on cnn design. ([3], [4], [14], [15], [16], [17], [18], [19]), and have various benefits over larger filters, viz,

1.

Smaller kernels are more parameter efficient. [2]
2.

Multiple smaller kernels stacked over each other effectively increase the discriminative power of the decision function [2], by increasing the total number of non-linearities.
3.

Larger convolution filters operated at a stride on the very first layer greatly reduce spatial resolution of filter maps, which might be too drastic and might hurt the performance of the network.

We propose 2 convolution blocks with 2 and 4 convolution layers, henceforth referred to as Block A and B, respectively (Fig. 3). Instead of using large rectangular filters similar to [1], we utilize smaller 3x3 filters for all but the first layer (which uses 7x3 filters) in all our layers.

Table.2 depicts the total number of floating point operations (FLOPs) in millions and the number of parameters for each proposed convolution block as well as the 2-layer convolution block as proposed in DeepSpeech 2. As evident from the aforementioned table, the proposed convolution blocks have significantly lower FLOPs and parameters.

Refer to caption — Figure 1: Sample CNN-RNN Network for end-to-end ASR with 3 BiGRU layers

Table 2: Convolution Block Complexities

Block	FLOPs	Parameters
DS 2[1]	1,640 M	251.17 K
Block A	69.80 M	10.08 K
Block B	398.1 M	65.76 K

Table 3: Evaluated Models

Model	CNN Block	RNN Config
A-3GRU	Block A	3, 512
A-4GRU	Block A	4, 512
A-5GRU	Block A	5, 512
B-3GRU	Block B	3, 512
B-4GRU	Block B	4, 512
B-5GRU	Block B	5, 512
B-5GRU-Large	Block B	5, 800
2CNN-5GRU	DS 2	5, 800

4.3 Proposed Networks

We propose and evaluate 7 network configurations of varying complexities as listed in Table 3. 3 distinct RNN stacks are evaluated with 3, 4 and 5 bidirectional GRU layers respectively. Despite challenges in deployment in an online setting, bidirectional recurrent neural networks routinely outperform similar unidirectional models [1]. Column RNN Config depicts number of bidirectional layers, followed by the number of hidden units per layer. All configurations have 512 hidden units per bidirectional GRU layer, except B-5GRU-Large, which has 800 hidden units.

4.4 Implementation and Training Details

Following [1], we use normalized log-spectrogram calculated using a sliding window of width 20ms and stride of 10ms, followed by a 160 point FFT as inputs to the network. The network is trained end-to-end using the CTC loss function [6], which facilitates prediction of character sequences directly from input.

Training was performed on a single GTX 1080 Ti system with a batch size of 20 using Stochastic Gradient Descent, with an exponential learning rate decay schedule and an initial learning rate of $3e-4$ . Early stopping [20] with a patience of 3 was used for regularization, and all the networks were trained for a maximum of 30 epochs.

Inference: At inference time, we pair the proposed models with a 4-gram Kneser-Ney based language model with a beam size of 100, trained on a large corpus of Bengali text collected from Bengali news sites and blogs, using the Kenlm toolkit [21]. Results without the language model are also provided for the test set.

5 Experiments and Results

Table 4 depict the word error rate (WER) for the Bengali ASR task on various networks on the val and test set. WER for validation is calculated without a language model, whereas results with and without language model are shown for the test set.

Table 4: Bengali ASR Results

Model	Val	Test (no LM)	Test
A-3GRU	37.7	37.86	14.99
A-4GRU	35.26	34.55	14.56
A-5GRU	37.76	36.75	14.24
B-3GRU	37.13	36.46	15.26
B-4GRU	34.10	33.51	14.50
B-5GRU	32.33	31.90	13.79
B-5GRU-Large	32.13	31.45	13.67
2CNN-5GRU	34.71	33.74	15.06

Model B-5GRU-Large is the best performing configuration, achieving a WER of 13.67%, followed closely by B-5GRU. It is worth noting that B-5GRU, despite having fewer hidden units per GRU layer as compared to 2CNN-5GRU (512 v/s 800) outperforms the latter by a significant margin: an almost 2% absolute reduction in WER (without a language model), while at the same time consisting of a convolution block with lower number of parameters and FLOPs, hence validating the applicability of smaller convolution kernels for ASR task.

6 Conclusions

In this work, we investigated the performance and applicability of prominent end-to-end automatic speech recognition pipeline comprising of CNN-RNN based deep neural network trained using the CTC loss function for the Bengali language, evaluating and benchmarking various network configurations of different complexities. Our studies also validate the applicability of smaller convolution kernels widely used in the computer vision domain for ASR tasks, with the proposed convolution block A performing on par with convolution block comprising of larger 41x11 kernels keeping in consideration the drastically lower number of parameters for the former, whereas convolution block B outperforming the convolution block comprising of larger convolution kernels. The proposed network configurations can be applied to other Magadhan languages and comparable results can be expected as they share the phonetic space with Bengali.

References

[1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” in International conference on machine learning, 2016, pp. 173–182.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[3] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[4] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design.”
[5] O. Kjartansson, S. Sarin, K. Pipatsrisawat, M. Jansche, and L. Ha, “Crowd-Sourced Speech Corpora for Javanese, Sundanese, Sinhala, Nepali, and Bangladeshi Bengali,” in Proc. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages (SLTU), Gurugram, India, Aug. 2018, pp. 52–55. [Online]. Available: http://dx.doi.org/10.21437/SLTU.2018-11
[6] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376.
[7] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep speech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014.
[8] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in neural information processing systems, 2015, pp. 577–585.
[9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 4960–4964.
[10] N. Zeghidour, N. Usunier, G. Synnaeve, R. Collobert, and E. Dupoux, “End-to-end speech recognition from the raw waveform,” in Proc. Interspeech 2018, 2018, pp. 781–785. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2414
[11] M. K E, K. S. Rao, D. B. Jayagopi, and V. Ramasubramanian, “Indian languages asr: A multilingual phone recognition framework with ipa based common phone-set, predicted articulatory features and feature fusion,” in Proc. Interspeech 2018, 2018, pp. 1016–1020. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2529
[12] D. Dash, M. Kim, K. Teplansky, and J. Wang, “Automatic speech recognition with articulatory information and a unified dictionary for hindi, marathi, bengali and oriya,” in Proc. Interspeech 2018, 2018, pp. 1046–1050. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2018-2122
[13] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning, 2015, pp. 448–456.
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
[15] S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv:1605.07146, 2016.
[16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
[17] G. Huang and Z. Liu, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[18] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.
[19] G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2752–2761.
[20] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in neural information processing systems, 2001, pp. 402–408.
[21] K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the sixth workshop on statistical machine translation. Association for Computational Linguistics, 2011, pp. 187–197.