Multilingual and code-switching ASR challenges for
low resource Indian languages

Abstract

Recently, there is increasing interest in multilingual automatic speech recognition (ASR) where a speech recognition system caters to multiple low resource languages by taking advantage of low amounts of labeled corpora in multiple languages. With multilingualism becoming common in today’s world, there has been increasing interest in code-switching ASR as well. In code-switching, multiple languages are freely interchanged within a single sentence or between sentences. The success of low-resource multilingual and code-switching ASR often depends on the variety of languages in terms of their acoustics, linguistic characteristics as well as amount of data available and how these are carefully considered in building the ASR system. In this challenge, we would like to focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages , namely Hindi, Marathi, Odia, Tamil, Telugu, Gujarati and Bengali. For this purpose, we provide a total of $\sim$ 600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English. We also provide baseline recipe¹¹1https://github.com/navana-tech/baseline_recipe_is21s_indic_asr_challenge for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.

Index Terms: Multilingual ASR, Code-switching

1 Introduction

India is a country of language continuum, where every few kilometres, the dialect/language changes [1]. Various language families or genealogical types have been reported, in which the vast number of Indian languages can be classified, including Austro-Asiatic, Dravidian, Indo-Aryan, Tibeto-Burman and more recently, Tai-Kadai and Great Andamanese [2, 3]. However, there are no boundaries among these language families; rather, languages across different language families share linguistic traits, including retroflex sounds, absence of prepositions and many more resulting in acoustic and linguistic richness. According to the 2001 census, 29 Indian languages have more than a million speakers. Among these, 22 languages have been given the status of official languages by the Government of India [4, 5]. Most of these languages are low resource. Many of these languages do not have a written script, and, hence, speech technology solutions, such as automatic speech recognition (ASR), would greatly benefit such communities [6]. Another common linguistic phenomenon in multilingual societies is code-switching, typically between between an Indian language and (Indian) English. Understanding code-switching patterns in different languages and developing accurate code-switching ASR remain a challenge due to the lack of large code-switched corpora.

In such resource-constrained settings, the techniques that exploit unique properties and similarities among the Indian languages could help build multilingual and code-switching ASR systems. Prior works have shown that multilingual ASR systems that leverage data from multiple languages could explore common acoustic properties across similar phonemes or graphemes [7, 6, 8, 9]. This is achieved by gathering a large amount of data from multiple low-resource languages. Also, multilingual ASR strategies are effective in exploiting the code-switching phenomena in the speech of the source languages. However, there is an emphasis on the need for the right choice of the languages for better performance [10], as significant variations between the languages could degrade the ASR performance under multilingual scenarios. In such cases, a dedicated monolingual ASR could perform better even with lesser speech data than a multilingual or code-switching ASR [11, 12, 9].

Considering the aforementioned factors, in this challenge, we have selected six Indian languages, namely, Hindi, Marathi, Odia, Telugu, Tamil and Gujarati, for multilingual ASR; and two code-switched language pairs, Hindi-English and Bengali-English, for the task of code-switching ASR. Unlike prior works on multilingual ASR, the languages selected 1) consider the influences of three major language families – Indo-Aryan, Dravidian and Austro-Asiatic, which influences most of the Indian languages [4], 2) cover four demographic regions of India – East, West, South and North, and, 3) ensure continuum across languages. It is expected that a multilingual ASR built on these languages could be helpful to extend to other low-resource languages [6]. Further, most of the multilingual ASR in the literature have considered languages other than Indian languages. Works that consider the Indian languages, however, use the data that is either not publicly available or limited in size [5, 13, 6, 11, 14]. This challenge significantly contributes in this context, as we provide a large corpus compared to the publicly available data for Indian languages. Publicly available corpora for code-switched speech is also a limited resource and this challenge introduces two freely available datasets consisting of Hindi-English and Bengali-English code-switched speech.

We provide approximately 600 hours of data in six Indian languages: Hindi, Marathi, Odia, Telugu, Tamil, and Gujarati. In addition, we also provide an additional 150 hours of speech data that includes code-switched transcribed speech in two language pairs, Hindi-English and Bengali-English. Speech recordings in different languages come from different domains. For example, the Odia data comes from healthcare, agriculture and financial domains. The Hindi-English and Bengali-English data are drawn from a repository of technical lectures on a diverse range of computer science topics. We release a baseline system that participants can compare their systems with and use as a starting point. For evaluation of the participating teams, we release held-out blind test sets as well.

The challenge comprises two subtasks. Subtask1 involves building a multilingual ASR system in six languages: Hindi, Marathi, Odia, Telugu, Tamil, and Gujarati. The blind test set comprises recordings from a subset (or all) of these six languages. Subtask2 involves building a code-switching ASR system separately for Hindi-English and Bengali-English code-switched pairs. The blind test set comprises recordings from these two code-switched language pairs. Baseline systems are developed considering hybrid DNN-HMM models for both the subtasks, as well as an end-to-end model for the code-switching subtask. Averaged WERs on the test set and blind test set are found to be 30.73% & 32.73% respectively for subtask1 and for subtask2 those are found to be 33.35% & 28.52, 29.37% & 32.09% and 28.45% & 34.08% on test & blind sets with GMM-HMM, TDNN and end-to-end systems, respectively. The next section elaborates on the two subtasks, providing many details related to dataset creation and specific characteristics of data specific to each subtask.

	Hindi			Marathi			Odia			Telugu			Tamil			Gujarati
	Trn	Tst	Blnd	Trn	Tst	Blnd	Trn	Tst	Blnd	Trn	Tst	Blnd	Trn	Tst	Blnd	Trn	Tst	Blnd
Size (hrs)	95.05	5.55	5.49	93.89	5	0.67	94.54	5.49	4.66	40	5	4.39	40	5	4.41	40	5	5.26
Ch.comp	3GP	3GP	3GP	3GP	3GP	M4A	M4A	M4A	M4A	PCM	PCM	PCM	PCM	PCM	PCM	PCM	PCM	PCM
Uniq sent	4506	386	316	2543	200	120	820	65	124	34176	2997	2506	30329	3060	2584	20257	3069	3419
Spkrs	59	19	18	31	31	-	-	-	-	464	129	129	448	118	118	94	15	18
Vocab (words)	6092	1681	1359	3245	547	350	1584	334	334	43270	10859	9602	50124	12279	10732	39428	10482	11424

Table 1: Description of data provided for multilingual ASR (train (Trn), test (Tst) and blind test (Blnd) size, channel compression (Ch.comp), number of unique sentences (Uniq sent), number of speakers (Spkrs) and vocabulary size in words (vocab)). The audio files in all six languages consist of single-channel and are encoded in 16-bit with a sampling rate of 8KHz except for train and test set of Telugu, Tamil and Gujarati, at 16KHz.

2 Data Details of Two Subtasks

2.1 Multilingual ASR

Motivation. In India, though there are many low resource languages, they share common language properties from a few language families. Multilingual ASR systems could thus be useful to exploit these common properties in order to build an effective ASR system. Keeping this in mind, the data for this challenge is collected from six languages that are influenced by three major language families. We believe that the data from these languages provide enough variability and, at the same time, cover common properties to build ASR for low resource languages. However, it is challenging to build an ASR considering the six languages due to variability in graphemes, phonemes, and acoustics [15]. This session could be useful to address these challenges and build robust multilingual ASR systems. Furthermore, the data from the challenge is made public to the research community to provide opportunities for better multilingual ASR in the future.

2.1.1 Dataset Description

Table 1 shows the data details for the Multilingual ASR specific to each language. The audio files of Odia are collected from four districts as a representative of four different dialect regions – Sambalpur (North-Western Odia), Mayurbhanj (North Eastern Odia), Puri(Central and Standard Odia) and Koraput (Southern Odia). Further, in all the six languages, the percentage of out-of-vocabulary (OOV) between train & test and train & blind test are found to be in the range 17.2% to 32.8% and 8.4% to 31.1%, respectively. Also, the grapheme set in the data follows the Indian language speech sound label set (ILSL12) standard [15]. The total number of graphemes are 69, 61, 68, 64, 50 and 65 respectively for Hindi (Hin), Marathi (Mar), Odia (Oda), Telugu (Tel), Tamil (Tam) and Gujarati (Guj), out of which a total number of diacritic marks in the respective languages are 16, 16, 16, 17, 7 and 17.

2.1.2 Characteristics of the dataset

The Hin, Mar and Oda data are collected from the respective native speakers in a reading task. For the data collection, the speakers’ choice and the text are selected to cover different language variations for better generalizability. The speakers of Hin and Mar belong to a high-literacy group. On the other hand, the speakers of Odia belong to a semi-literate group. The text data of Hindi and Marathi is collected from storybooks. On the other hand, the text data of Odia is collected from Agriculture, Finance and Healthcare domains. In addition to the speaker and text variability, nativity, other specific variabilities also exist in the speech data. These variabilities include phoneme mispronunciations and accent variations. In order to retain these variabilities, in the data validation, Oda data went through a manual check. Since Hin and Mar speech data are from a high-literacy group, we consider a semi-automatic process using ASR pipeline for the validation to retain the variabilities.

The automatic validation is done separately for Hindi and Marathi considering the following two measures – 1) WER obtained from ASR [16] and 2) likelihood scores obtained from decoded lattice from ASR. In both cases, ASR is trained separately for Hin and Mar, considering noisy data from each language (before validation) of $\sim$ 500hrs. The WER based data validation has been used in prior work [16]. We believe that the WER-based criteria discards the audios containing insertions, deletions, and/or substitution errors while reading the stimuli. On the other hand, unlike prior works, we include a lattice-based likelihood criteria for discarding very noisy audios or the audios with many incorrect pronunciations. In this work, we consider all those audios whose lattice likelihoods are above 3.49 and WER is equal to 0.0% for Hindi. Similarly, those audios were chosen for the Marathi data for which the WER is less than or equal to 12.5% and lattice likelihood is greater than 3.23. The choice of threshold is found to achieve $\sim$ 100hrs of data separately for Hin and Mar by ensuring higher likelihood and lower WERs in the selected audios. Further, the selected data is split into a train (Trn) and test (Tst) without sentence overlap and with out-of-vocabulary (OOV) rates at about 30% between train and test sets. The Tel, Tam and Guj data are taken from Interspeech 2018 low resource automatic speech recognition challenge for Indian languages, for which, the data was provided by SpeechOcean.com and Microsoft [17]. The train and test sets are considered as-is for this challenge, however, the blind test set is modified with speed perturbations randomly between 1.1 to 1.4 (with increments of 0.05), and/or adding one noise randomly from white, babble and three noises chosen in the Musan dataset [18] considering the signal-to-noise ratio randomly between 18dB to 30dB at step of 1dB. This modification is done randomly on 29.0%, 23.8% and 34.1% of Tel, Tam and Guj data respectively.

2.2 Code-switching ASR

Motivation. Code-switched speech in Indian languages, in the form of publicly available corpora, is a rare resource. This subtask is a first step towards addressing this gap for research on code-switched speech. The code-switched speech is drawn from spoken tutorials on various topics in computer science. These spoken tutorials are available in a wide range of topics and in multiple Indian languages. The tutorials are also accompanied by sentence-level transcriptions and corresponding timestamps. However, this data comes with a number of challenges including label noise which we describe in more detail in Section 2.2.2. Understanding these challenges will be important in scaling up the solutions to more real-life data for a larger number of Indian languages.

	Hindi-English			Bengali-English
	Trn	Tst	Blnd	Trn	Tst	Blnd
Size (hrs)	89.86	5.18	6.24	46.11	7.02	5.53
Uniq sent	44249	2890	3831	22386	3968	2936
Spkrs	520	30	35	267	40	32
Vocab (words)	17830	3212	3527	13645	4500	3742

Table 2: Description of data provided for code-switching ASR (train (Trn), test (Tst) and blind test (Blnd) size, number of unique sentences (Uniq sent), number of speakers (Spkrs) and vocabulary size in words (vocab))

2.2.1 Dataset Description

The Hindi-English and Bengali-English datasets are extracted from spoken tutorials. These tutorials cover a range of technical topics and the code-switching predominantly arises from the technical content of the lectures. The segments file in the baseline recipe provides sentence time-stamps. These time-stamps were used to derive segments from the audio file to be aligned with the transcripts given in the text file. Table 2 shows the details of the data considered for code-switched ASR subtasks. All the audio files in both datasets are sampled at 16 kHz, 16 bits encoding. The test-train overlap in Hindi-English and Bengali-English subtasks are 33.9% and 10.8% whereas the blindtest-train overlaps are 2.1% and 2.9% respectively. Speaker information for both these datasets were not available. However, we do have information about the underlying tutorials from which each sentence is derived. We assumed that each tutorial comes from a different speaker; these are the numbers reported in Table 2. The percentage of OOV words encountered in test and blind-test for Hindi-English subtask is 12.5% & 19.6% and for Bengali-English is 22.9% & 27.3% respectively.

2.2.2 Characteristics and Artefacts in the Dataset

As mentioned earlier, the code-switched speech is drawn from tutorials on various topics in computer science, with transcriptions including mathematical symbols and other technical content. We note here that these tutorials were not created specifically for ASR, but for end-user consumption as videos of tutorials in various Indian languages; specifically in our case, the transcriptions were scripts for video narrators. There are various sources of noise in the transcriptions that are outlined in detail below:

Misalignments: Each spoken tutorial came with transcriptions in the form of subtitles with corresponding timestamps. These timestamps were more aligned with how the tutorial videos proceeded, rather than with the underlying speech signal. This led to misalignments between the transcription and segment start and end times specified for each transcription. While this is an issue for training and development segments, transcriptions corresponding to the blind test segments were manually edited to exactly match the underlyinto remove valid Hindi words and fix any transliteration errors.

Inconsistent script usage: There were multiple instances of the same English word appearing both in the Latin script and the native scripts of Hindi and Bengali in the training data. Given this inconsistency in script usage for English words, the ASR predictions of English words could either be in the native script or in the Latin script. To allow for both English words and their transliterations in the respective native scripts to be counted as correct during the final word error rate computations, we introduce a transliterated WER (T-WER) metric along with the standard WER metric. While the standard WER will only count an edit as correct if it is an exact match with the word in the reference text, T-WER wil count an English word in the reference text as being correctly predicted if it is in English or in its transliterated form in the native script. We compute T-WER rates for the blind test audio files. To support this computation, the blind test reference text was manually annotated such that every English word only appeared in the Latin script. Following this, every English word in the reference transcriptions was transliterated using Google’s transliteration API and further manually edited to remove valid Hindi words and fix any transliteration errors. This yielded a list of English to native script mappings and this mapping file was used in the final T-WER to map English words to their transliterated forms.

Punctuations: Since our data is sourced from spoken tutorials on coding and computer science topics, there are many punctuations (e.g., semicolon, etc.) that are enunciated in the speech. This is quite unique to this dataset. There were also many instances of enunciated punctuations not occurring in the text as symbols but rather as words (e.g., ’slash’ instead of ’/’, etc.). There were many non-enunciated punctuations in the text too (exclamation, comma, etc.). We manually edited the blind test transcriptions to remove non-enunciated punctuations and normalized punctuations written as words into their respective symbols (e.g. + instead of plus).

Mixed words: In the Bengali-English transcriptions, we saw multiple occurrences of mixed Bengali-English words (which was unique to Bengali-English and did not show up in Hindi-English). To avoid any ambiguities with evaluating such words, we transliterated these mixed words entirely into Bengali.

Incomplete audio: Since the lecture audios are spliced into individual segments, sometimes a few of the segments have incomplete audio either at the start or at the end of the utterances. For the blind test audio files which went through a careful manual annotation, such words were only transcribed if the word was mostly clear to the annotator. Else, it was omitted from the transcription.

Merged English words: Since the code-switching arises mostly due to technical content in the audio, there are instances of English words being merged together in the transcriptions without any word boundary markers. For example, words denoting websites and function calls generally have two or more words used together without spaces. For the blind test transcriptions, we segregated such merged English words into their constituents in order to avoid ambiguous occurrences of both merged and individual words arising together.

3 Experiments and Results

3.1 Experimental setup

3.1.1 Multilingual ASR

Hybrid DNN-HMM: ASR model is built using the Kaldi toolkit with a sequence-trained time-delay neural network (TDNN) architecture optimized using the lattice-free MMI objective function [19]. We consider an architecture comprising 6 TDNN blocks with dimensionality of size 512.

Lexicon: A single lexicon is used containing the combined vocabulary of all six languages. For each language, the lexicon’s entries are obtained automatically, considering a rule-based system that maps graphemes to phonemes. For the mapping, we consider the Indian speech sound label set (ILSL2) [15].

Language model (LM): A single LM is built considering the text transcriptions belonging to the train set from all six languages. For the LM, we consider a 3-gram language model developed in Kaldi using the IRSTLM toolkit. Since the LM has paths that contain multiple languages, the decoded output could result in code-mixing across the six languages.

In the experiments, word error rate (WER) is considered as the measure for comparison of multilingual ASR. Further for the comparison, we build monolingual ASR systems considering language-specific training data, lexicons and LMs built with language specific train text transcriptions. However, for the evaluation on the blind test, we consider two measures separately to account the channel matching and mismatching scenarios between train/test and blind test. One is averaged WER across all six languages and the other one is averaged WER across all six except Marathi, which has mismatch in the channel encoding scheme.

3.1.2 Code-switching ASR

Hybrid DNN-HMM: The ASR model is built using the Kaldi toolkit, the same baseline architecture being used for both Hindi-English and Bengali-English subtasks. We use MFCC acoustic features to build speaker-adapted GMM-HMM models and similar to subtask1, we also build hybrid DNN-HMM ASR systems using TDNNs comprising 8 TDNN blocks with dimension 768.

End-to-end ASR: The hybrid CTC-attention model based on Transformer [20] uses a CTC weight of $0.3$ and an attention weight of $0.7$ . A $12$ -layer encoder network, and a $6$ -layer decoder network is used, each with $2048$ units, with a $0.1$ dropout rate. Each layer contains eight $64$ -dimensional attention heads which are concatenated to form a $512$ -dimensional attention vector. Models were trained for a maximum of $40$ epochs with an early-stopping patience of $3$ using the Noam optimizer from [20] with a learning rate of $10$ and $25000$ warmup steps. Label smoothing and preprocessing using spectral augmentation is also used. The top 5 models with the best validation accuracy are averaged and this averaged checkpoint is used for decoding. Decoding is performed with a beam size of $10$ and a CTC weight of $0.4$ .

Lexicon: Two different lexicon files each for Hindi-English and Bengali-English subtasks are used. First, the set of all the words in the training set i.e. the training vocabulary is generated. If the word is a Devanagari/Bengali script word, the corresponding pronunciation is simply the word split into characters. This is because both languages have phonetic orthographies. For English lexicon mappings, an open source g2p package (https://github.com/Kyubyong/g2p) is used. This spells out numbers, looks up the CMUDict dictionary [21], and predicts pronunciations for OOVs as well. We also map punctuations to their corresponding English words since these punctuations are enunciated in the audio.

Language model: Two separate language models are built for each of the code switching subtasks. For LM training, we consider a trigram language model with Kneser-Ney discounting using the SRILM toolkit developed in Kaldi [22].

In the experiments for code-switched speech, along with the standard WER measure, we also provide T-WER values (where T-WER was defined earlier in Section 2.2.2).

3.2 Baseline results

3.2.1 Multilingual ASR

Comparing multilingual and monolingual ASR: Table 3 shows the WERs obtained for test and blind test sets for each of the six languages along with averaged WER across all six languages. We show that the WER obtained with multilingual ASR is lower for Tamil. Though the WER from the multilingual ASR system is higher in the remaining languages, it does not require any explicit language identification system. Thus, the performance of monolingual ASR is affected by the effectiveness of LID for the Indian context. Further, it is known that multilingual ASR is effective in obtaining a better acoustic model by exploring common properties among the multiple languages. However, the performance of the multilingual ASR also depends on the quality of the language model, which, in this work, could introduce noise due to code-mixing of words.

		Hindi	Marathi	Odia	Tamil	Telugu	Gujarati	Avg
Multi	Tst	40.41	22.44	39.06	33.35	30.62	19.27	30.73
Multi	Blnd	37.20	29.04	38.46	34.09	31.44	26.15	32.73
Mono	Tst	31.39	18.61	35.36	34.78	28.71	18.23	27.85
Mono	Blnd	27.45	20.41	31.28	35.82	29.35	25.98	28.38

Table 3: Comparison of WER from multilingual and monolingual ASRs on test (Tst) and blind (Blnd) test sets. Averaged WER across five languages on the blind test for Multi and Mono are 33.47 and 29.98 respectively.

3.2.2 Code-switching ASR

	Kaldi-Based				End-to-End
	GMM-HMM		TDNN		Transformer
	Tst	Blnd	Tst	Blnd	Tst	Blnd
Hin-Eng (UnA)	44.30	25.53	36.94	28.90	27.7	33.65
Ben-Eng (UnA)	39.19	32.81	34.31	35.52	37.2	43.94
Avg (UnA)	41.75	29.17	35.63	32.21	32.45	38.80
Hin-Eng (ReA)	31.56	24.66	28.40	29.03	25.9	31.19
Ben-Eng (ReA)	35.14	32.39	30.34	35.15	31.0	36.97
Avg (ReA)	33.35	28.52	29.37	32.09	28.45	34.08

Table 4: WERs from GMM-HMM, Hybrid DNN-HMM and end-to-end ASR systems for Hindi-English (Hin-Eng) and Bengali-English (Ben-Eng) test (Tst) and blind-test (Blnd) sets. (ReA) and (UnA) refers to re-aligned and unaligned audio files, respectively.

	Kaldi-Based				End-to-End
	GMM-HMM		TDNN		Transformer
	WER	T-WER	WER	T-WER	WER	T-WER
Hin-Eng	24.66	22.72	29.03	26.20	31.19	29.80
Ben-Eng	32.39	31.42	35.15	33.39	36.97	36.00
Avg	28.52	27.07	32.09	29.79	34.08	32.9

Table 5: WER and T-WER values for Kaldi and end-to-end based architectures obtained after using aligned (ReA) segments for both Hindi-English (Hin-Eng) and Bengali-English (Ben-Eng) subtasks

Table 4 shows the WERs (along with the averaged WERs) for both the Hindi-English and Bengali-English datasets. As mentioned in Section 2.2.2, there are misalignments in some of the training audio files between the transcriptions and the timestamps. We present results using the original alignments that we obtained with the transcriptions (referred to as unaligned, UnA). In an attempt to fix the misalignment issues, we also force-align the training files at the level of the entire tutorial with its complete transcription and recompute the segment timestamps. We retrain our systems using these realigned training files. These numbers are labeled with (ReA) for realigned.

As expected, we observe that the averaged ReA WERs are consistently better than the UnA WERs. While the Kaldi TDNN-based system gives better WERs for the test set, the speaker adapted triphone GMM-HMM model performs the best among the three ASR systems on the blind test set.

Table 5 shows the corresponding WERs and T-WERs for the realigned (ReA) blind test sets. T-WER, being a more relaxed evaluation metric, is always better than WER. The Hindi-English code mixed data yields improved WERs evidently due to smaller number of OOVs and larger amounts of training data. The corresponding values for the blind-set also improves further as we calculate the transliterated scores as discussed in 2.2.2.

4 Conclusion

In this paper, we present the dataset details and baseline recipe and results for Multilingual and code-switching ASR challenges for low resource Indian languages as a special session in Interspeech 2021. This challenge involves two subtasks dealing with 1) multilingual ASR and 2) code-switching ASR. Through this challenge, the participants have the opportunity to address two important challenges specific to multilingual societies, particularly in the Indian context – data scarcity and the code-switching phenomena. Through this challenge, we also provide a total of $\sim$ 600 hours of transcribed speech data, which is reasonably large corpus for six different Indian languages (especially when compared to the existing publicly available datasets for Indian languages). Baseline ASR systems have been developed using hybrid DNN-HMM and end-to-end models. Furthermore, carefully curated held-out blind test sets are also released to evaluate the participating teams’ performance.

References

[1] “Read on to know more about Indian languages,” URL: https://mhrd.gov.in/sites/upload_files/mhrd/files/upload_document
/languagebr.pdf, last accessed on 20-06-2019, 2001.
[2] J. Heitzman and R. L. Worden, India: A country study. Federal Research Division, 1995.
[3] “Office of the Registrar General & Census Commissioner India, Part A: Family-wise grouping of the 122 scheduled and non-scheduled languages–2001,” URL: http://censusindia.gov.in/Census_Data_2001/Census_Data_Online
/Language/statement9.aspx, 2001.
[4] “Office of the Registrar General & Census Commissioner India, Part A: Distribution of the 22 scheduled languages - India, States & union Territories - 2001 Census,” URL: http://www.censusindia.gov.in/Census_Data_2001/Census_Data_
Online/Language/parta.htm, last accessed on 20-06-2019, 2001.
[5] H. B. Sailor and T. Hain, “Multilingual speech recognition using language-specific phoneme recognition as auxiliary task for Indian languages,” Proc. Interspeech 2020, pp. 4756–4760, 2020.
[6] A. Datta, B. Ramabhadran, J. Emond, A. Kannan, and B. Roark, “Language-agnostic multilingual modeling,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 8239–8243.
[7] Y.-C. Chen, J.-Y. Hsu, C.-K. Lee, and H.-y. Lee, “DARTS-ASR: Differentiable architecture search for multilingual speech recognition and adaptation,” arXiv preprint arXiv:2005.07029, 2020.
[8] S. Tong, P. N. Garner, and H. Bourlard, “An investigation of multilingual ASR using end-to-end LF-MMI,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6061–6065.
[9] H. Lin, L. Deng, D. Yu, Y.-f. Gong, A. Acero, and C.-H. Lee, “A study on multilingual acoustic modeling for large vocabulary ASR,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 4333–4336.
[10] J. Cui, B. Kingsbury, B. Ramabhadran, A. Sethy, K. Audhkhasi, X. Cui, E. Kislal, L. Mangu, M. Nussbaum-Thom, M. Picheny et al., “Multilingual representations for low resource speech recognition and keyword search,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 259–266.
[11] V. Pratap, A. Sriram, P. Tomasello, A. Hannun, V. Liptchinsky, G. Synnaeve, and R. Collobert, “Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters,” arXiv preprint arXiv:2007.03001, 2020.
[12] M. Miiller, S. Stiiker, and A. Waibel, “Multilingual adaptation of RNN based ASR systems,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5219–5223.
[13] K. Manjunath, K. S. Rao, D. B. Jayagopi, and V. Ramasubramanian, “Indian languages ASR: A multilingual phone recognition framework with IPA based common phone-set, predicted articulatory features and feature fusion.” in INTERSPEECH, 2018, pp. 1016–1020.
[14] C. Liu, Q. Zhang, X. Zhang, K. Singh, Y. Saraf, and G. Zweig, “Multilingual graphemic hybrid ASR with massive data augmentation,” arXiv preprint arXiv:1909.06522, 2019.
[15] I. L. T. Consortium, A. Consortium et al., “Indian language speech sound label set (ILSL12),” 2016.
[16] R. Gretter, “Euronews: a multilingual benchmark for ASR and LID,” in Fifteenth Annual Conference of the International Speech Communication Association, 2014.
[17] B. M. L. Srivastava, S. Sitaram, R. K. Mehta, K. D. Mohan, P. Matani, S. Satpal, K. Bali, R. Srikanth, and N. Nayak, “Interspeech 2018 low resource automatic speech recognition challenge for Indian languages.” in SLTU, 2018, pp. 11–14.
[18] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[19] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI.” in Interspeech, 2016, pp. 2751–2755.
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[21] R. L. Weide, “The CMU pronouncing dictionary,” URL: http://www. speech. cs. cmu. edu/cgibin/cmudict, 1998.
[22] A. Stolcke, “SRILM-an extensible language modeling toolkit,” in Seventh international conference on spoken language processing, 2002.