Streaming end-to-end bilingual ASR systems with joint language identification

Abstract

Multilingual ASR technology simplifies model training and deployment, but its accuracy is known to depend on the availability of language information at runtime. Since language identity is seldom known beforehand in real-world scenarios, it must be inferred on-the-fly with minimum latency. Furthermore, in voice-activated smart assistant systems, language identity is also required for downstream processing of ASR output. In this paper, we introduce streaming, end-to-end, bilingual systems that perform both ASR and language identification (LID) using the recurrent neural network transducer (RNN-T) architecture. On the input side, embeddings from pretrained acoustic-only LID classifiers are used to guide RNN-T training and inference, while on the output side, language targets are jointly modeled with ASR targets. The proposed method is applied to two language pairs: English-Spanish as spoken in the United States, and English-Hindi as spoken in India. Experiments show that for English-Spanish, the bilingual joint ASR-LID architecture matches monolingual ASR and acoustic-only LID accuracies. For the more challenging (owing to within-utterance code switching) case of English-Hindi, English ASR and LID metrics show degradation. Overall, in scenarios where users switch dynamically between languages, the proposed architecture offers a promising simplification over running multiple monolingual ASR models and an LID classifier in parallel.

Index Terms: multilingual, streaming speech recognition, language identification, end-to-end, RNN-T

1 Introduction

Multilingual automatic speech recognition (ASR) is an active area of research for two reasons: (1) it improves performance across languages, particularly the under-resourced ones, by allowing data sharing and cross-lingual knowledge transfer [1, 2, 3, 4, 5, 6]; and (2) it allows the same model to be used by two or more languages [7], thereby creating seamless multilingual experiences via simplified model training, maintenance, and deployment. In hybrid ASR systems with separate acoustic model (AM) and language model (LM) components, shared hidden layer training [8, 9, 10] and transfer learning [11, 12, 13] have been proposed to address the data scarcity problem for under-resourced languages. However, achieving the model unification objective is more challenging in hybrid systems given the AM-LM separation. With recent end-to-end ASR advancements such as listen, attend and spell (LAS) [14] and recurrent neural network transducers (RNN-T) [15], unified multilingual ASR systems are now possible. Among these popular frameworks, RNN-T is favored in online applications owing to its support for streaming ASR [16]. We mainly focus on the objective of delivering a seamless experience to multilingual users without them having to specify the language before each interaction.

A commonly adopted architecture for creating a live multilingual experience is to run parallel ASRs behind the scenes while a language detector identifies spoken language [17, 18]. Multilingual ASR systems can simplify such architectures by eliminating the need to run multiple monolingual ASRs, but previous research on multilingual end-to-end models suggests that language information (assumed to be known beforehand in most studies) plays a crucial role in achieving acceptable levels of performance [19, 20, 21, 22, 23]. While language information can be provided at runtime by preselecting the language, such an experience could be counter-intuitive because multilingual users often prefer to seamlessly switch between languages.

Streaming language detectors have been proposed as a solution for guiding multilingual ASR on-the-fly [24]. Also, recent studies [17, 18] demonstrate that language identification (LID) can be improved by combining conventional acoustic representations with textual cues from the ASR decoder. To leverage these ideas—and also the fact that RNN-T models the joint evolution of acoustic and lexical information—, we propose a multilingual RNN-T architecture that consumes embeddings from an acoustic LID classifier and predicts both the spoken content and the spoken language. Accurate LID can improve ASR performance via language-specific second-pass rescoring and also enable appropriate downstream processing by language-specific components like natural language understanding. Furthermore, multilingual joint ASR-LID models can simplify dynamic language switching by replacing several monolingual ASRs and a language detector with just one system. The key contributions of this paper are summarized below.

$\bullet$ Joint ASR-LID: We append ground-truth transcripts with language tags and include them in the model’s vocabulary. This approach has been explored for LAS [25], but, to the best of our knowledge, has not been studied in the context of RNN-T.

$\bullet$ Acoustic LID embeddings: Frame-level embeddings from an acoustic LID classifier (trained beforehand) are used to guide RNN-T training and inference. This is similar in principle to the idea of deriving language information from an auxiliary RNN-T [24], but it has certain advantages (see Section 2).

$\bullet$ Biasing joint network: Previous studies provide language information to RNN-T’s encoder or decoder [21]. In this work, we study the effect of influencing RNN-T’s joint network.

$\bullet$ Languages: Previous studies mostly focus on groups of related languages, e.g. Indic, Arabic and Nordic [21, 24]. Motivated by the real-world usage patterns in bilingual markets, this work explores two pairs of unrelated languages: English-Hindi and English-Spanish.

In the rest of this paper, multilingual and bilingual will be used interchangeably—the latter will mostly be used when referring to experimental details or results.

2 Methods

A typical RNN-T ASR architecture comprises a transcription network (encoder), a prediction network (decoder) and a joint network. The encoder is an RNN that converts an input acoustic feature vector, $\mathbf{x}_{t}$ , to a hidden representation, $\mathbf{h}^{enc}_{t}$ , and the decoder is an RNN that receives the last non-blank label observed ( $y_{u-1}$ ) and outputs a hidden representation, $\mathbf{h}^{dec}_{u}$ :

\mathbf{h}^{enc}_{t}=f^{enc}(\mathbf{x}_{t}),\text{ }\mathbf{h}^{dec}_{u}=f^{dec}(y_{u-1}).

(1)

The joint network is a feedforward network that combines the encoder and decoder outputs to produce logits, $\mathbf{z}_{t,u}$ :

\mathbf{z}_{t,u}=f^{joint}(\mathbf{h}^{enc}_{t},\mathbf{h}^{dec}_{u}),

(2)

which in turn are passed through a softmax layer to produce a probability distribution for the next output symbol, $y_{u}$ (either blank or one of the ASR targets):

P(y_{u}|t,u)=\mathrm{softmax}(\mathbf{z}_{t,u}).

(3)

A naive approach to train multilingual ASR using RNN-T is to simply pool data from the languages of interest and define the output symbol space as the union of the individual symbol sets. As literature has shown, better results could be achieved by supplementing $\mathbf{x}_{t}$ with auxiliary language information, $\mathbf{l}_{t}$ , to obtain an improved encoder representation, $\mathbf{g}^{enc}_{t}$ :

\mathbf{g}^{enc}_{t}=f^{enc}([\mathbf{x}_{t};\mathbf{l}_{t}]).

(4)

In this paper, we also study the effect of supplying language information to the joint network. This enables us to determine if language information is more helpful at the input where lower-level features are extracted, or deeper inside the network where higher-level acoustic and lexical information are combined. Depending on whether $\mathbf{l}_{t}$ is provided to the encoder, joint network or both, the model’s logits can be computed using Eq. (5), (6) or (7), respectively. Further implementation details are presented in Sections 2.2 and 2.3.

$\displaystyle\mathbf{z}^{E}_{t,u}$	$\displaystyle=$	$\displaystyle f^{joint}(\mathbf{g}^{enc}_{t},\mathbf{h}^{dec}_{u}),$	(5)
$\displaystyle\mathbf{z}^{J}_{t,u}$	$\displaystyle=$	$\displaystyle f^{joint}(\mathbf{h}^{enc}_{t},\mathbf{l}_{t},\mathbf{h}^{dec}_{u}),$	(6)
$\displaystyle\mathbf{z}^{B}_{t,u}$	$\displaystyle=$	$\displaystyle f^{joint}(\mathbf{g}^{enc}_{t},\mathbf{l}_{t},\mathbf{h}^{dec}_{u}).$	(7)

To train a multilingual joint ASR-LID model we extend the RNN-T’s output symbol space to include language targets. Further details related to training and inference are presented in Section 2.4. The proposed multilingual joint ASR-LID architecture is schematically summarized in Fig. 1.

Refer to caption — Figure 1: A schematic representation of the proposed RNN-T based multilingual joint ASR-LID model.

2.1 Baselines

ASR performance is evaluated against two baselines: monolingual RNN-T models (A0) and multilingual RNN-T models trained without utilizing language adaptation methods, i.e. by simply pooling training data from the languages of interest (A1). LID performance is evaluated against an acoustic-only baseline, i.e. a recurrent architecture trained to classify the languages of interest using acoustic features only.

2.2 Multilingual ASR using oracle language identity

The simple pooling approach (A1) is known to be suboptimal, and several studies have shown that language-adaptive training is essential to improving the recognition accuracy of multilingual ASR [26, 27, 21]. A popular way of achieving this is to encode the language identity as a constant (utterance-level) one-hot vector. Since language identity is unknown at runtime in dynamic multilingual settings, we treat one-hot language encoding as an oracle approach. As mentioned earlier (Eqs. (5)–(7)), we evaluate the effect of supplying one-hot language vectors to different parts of the RNN-T: encoder input (A2^E), joint network (A2^J), or both (A2^B).

2.3 Multilingual ASR using inferred language identity

To achieve realistic language-adaptive training, the one-hot vectors in A2 must be replaced with inferred language information. In [24], an auxiliary RNN-T model is used to derive language information on-the-fly (a sequence of blanks until the language tag is emitted followed by another sequence of blanks); while this approach is effective, an auxiliary RNN-T must be trained for the same language group that the ASR system is designed to recognize. In this paper, we propose to use frame-wise embeddings from an acoustic LID classifier that is trained beforehand. If the LID classifier is trained on a large group of languages, one could potentially use the same model to generate embeddings regardless of the ASR’s target language group. Another advantage of using acoustic embeddings is that they can be easily consumed by the joint network to be combined with the encoder and decoder representations. Note however that the guidance that the ASR network receives early on in an utterance may not be very valuable as the language embeddings are updated on a frame-by-frame basis, and there may not be enough evidence to accurately detect the language after observing just a few frames of audio. Similar to the oracle experiments, we evaluate three different ways of supplying inferred language information to the RNN-T (A3^E, A3^J and A3^B).

2.4 Multilingual joint ASR-LID modeling

To train multilingual joint ASR-LID models, we extend the RNN-T’s output symbol space to include language targets—meaning that the model can output either blank, one of the language tags, or one of the ASR tokens. Training targets are generated by simply appending language tags to the ground truth transcripts. The intuition behind having language tags as the utterance-final tokens is that the network’s belief of the underlying language improves with the incoming acoustic feature frames and the partially predicted text. Akin to the end of sentence or <eos> symbol—used to indicate end of utterance and optionally perform endpointing—, appending language tags to utterance ends encourages the network to make language predictions after sufficient evidence has been observed.

Table 1: Dataset sizes in hours. ASR and LID training data are used for RNN-T and acoustic LID training, respectively.

System; Language	Train		Test
	ASR	LID	ASR, LID
English-Spanish; en-us	6.2k	1.9k	22
English-Spanish; esp-us	3.5k	1.7k	22
English-Hindi; en-in	12.6k	1.0k	83
English-Hindi; hi-in	2.3k	1.1k	24

Joint ASR-LID models can be trained without (A4) or with (A5^E, A5^J, and A5^B) language embeddings on the input side. Based on the observations made for <eos>-driven endpointing using RNN-T [28], ASR deletions are expected to increase with the introduction of auxiliary language tags. Therefore, inspired by [28], we employ emission penalties during beam search. For language tags to be considered valid candidates in the decoding beam, language posteriors must exceed a threshold $\beta$ after being modified by an exponent parameter $\alpha$ ( $\alpha>1$ implies higher penalty). Eq. (8) summarizes this condition.

P(y_{u}|t,u)^{\alpha}>=\beta\text{ if }y_{u}\in\{\text{language targets}\}

(8)

We evaluate ASR and LID performance for different values of $\alpha$ and $\beta$ , including the extreme case of $\beta=1$ which prevents the model from emitting language tags altogether. While the penalized inference mechanism—which suppresses premature language tag emissions—is expected to minimize deletions and thereby improve ASR performance, LID performance might suffer if predictions are made based on the model’s 1-best output. Therefore, in all our experiments with joint models, language is always predicted using the unscaled model posteriors available after the last audio frame; in other words, language prediction does not rely on the presence of language tags in the 1-best output. Note that language tags, when emitted, are also ignored during word error rate (WER) computation.

3 Experimental setup

Table 2: A summary of the experimental setup.

Acoustic features
$\bullet$ 64 dimensional log filter-bank energies (LFBEs) extracted at 10 ms intervals using 25 ms windows $\rightarrow$ Feature normalization $\rightarrow$ three frame stacking $\rightarrow$ downsampled to 30 ms
$\bullet$ SpecAugment with frequency masking [29] employed for RNN-T training; two masks applied per utterance with a maximum width of 24 channels
$\bullet$ Acoustic LID: augmentation via simulated reverberation
Training targets
$\bullet$ RNN-T: vocabulary of 4k subwords via byte pair encoding [30], using language-specific and pooled training text corpora respectively for monolingual and bilingual models
$\bullet$ Acoustic LID: utterance-level language labels
Model architectures
$\bullet$ RNN-T uses 5 encoder LSTM layers and 2 decoder LSTM layers with 1024 units each and a 512 dimensional embedding layer at decoder input; joint network has 512 hidden units followed by $\tanh$ and softmax
$\bullet$ Acoustic LID uses 3 LSTM layers with 256 units each, followed by 32 dimensional projection and softmax; output of projection layer provided as embeddings to RNN-T
Training
$\bullet$ Distributed training on 24 and 16 GPUs respectively for RNN-T and acoustic LID; batch size of 64 per GPU
$\bullet$ Dropout rate of 0.2 used in RNN-T encoder and decoder
$\bullet$ Adam optimizer used with a warmup-hold-decay learning rate (LR) schedule
$\bullet$ Stratified sampling: training batches retain the natural language distribution of the pooled dataset
$\bullet$ All RNN-T models trained for 225k steps
Inference
$\bullet$ ASR 1-best obtained using a beam width of 16 and a temperature of 1
$\bullet$ Acoustic LID models after 75k steps of training used for embedding extraction and baseline evaluation

Table 3: ASR WERRs relative to monolingual ASR models and LID accuracies relative to acoustic-only LID classifiers; higher numbers imply better performance in both cases.

E

J

and

B

carry the same meaning as in Eqs. (5)–(7). For A7, WW = “wake-word”.

Model	Description	( $\alpha,\beta$ )	Relative ASR WERR ( $\%$ )				Relative LID Accuracy ( $\%$ )
			English-Hindi		English-Spanish		English-Hindi		English-Spanish
			en-in	hi-in	en-us	esp-us	en-in	hi-in	en-us	esp-us
	Acoustic LID classifier	–	–	–	–	–	0.0	0.0	0.0	0.0
A0	Monolingual training	–	0.0	0.0	0.0	0.0	–	–	–	–
A1	Simple data pooling	–	-9.4	-11.2	-0.5	0.2	–	–	–	–
A2^E	Oracle language input	–	-1.9	13.1	-0.2	6.3	–	–	–	–
A2^J		–	-1.8	10.6	0.0	4.4	–	–	–	–
A2^B		–	-1.3	14.7	0.0	4.1	–	–	–	–
A3^E	Inferred LID embeddings	–	-9.4	-20.3	-2.5	0.1	–	–	–	–
A3^J		–	-8.3	-19.1	-0.4	0.0	–	–	–	–
A3^B		–	-8.9	-21.0	-3.4	-0.8	–	–	–	–
A4	Joint training	(1, 0)	-64.6	-18.5	-57.4	-69.8	-11.7	-0.4	1.3	-0.1
A4		(2, 0.1)	-27.8	-4.5	-21.8	-25.1	-11.9	-0.5	1.3	0.2
A4		(1, 1)	-22.4	-1.8	-7.2	-5	-21.1	-1.4	1.7	-3.1
A5^E	Inferred LID embeddings	(1, 1)	-14.3	0.7	3.5	-2.7	-5.4	-2.4	-2.7	0.3
A5^J	+		-6.8	4.9	-1.0	-1.6	-7.4	-1.6	-3.8	1.1
A5^B	Joint training		-18.7	0.3	-2.3	-2.0	-6.9	-1.9	0.8	-0.1
A6	A5^J with LID posteriors	(1, 1)	-7.0	4.4	-0.7	0.8	-6.1	-1.6	0.1	0.3
A7	A6 without WW-only utterances	(1, 1)	-4.9	4.7	-0.2	0.9	-5.5	-1.2	-2.1	1.4

We conduct experiments using two pairs of languages: English-Spanish, as spoken in the United States (languages denoted as en-us and esp-us, respectively), and English-Hindi, as spoken in India (languages denoted as en-in and hi-in, respectively). Table 1 captures the dataset statistics.

Of the two language pairs studied, English-Hindi shows a higher degree of within-utterance code switching owing to the colloquial nature of spoken Hindi (which involves frequent use of common English words such as “play”, “call”, “book”, “volume”, etc.) and the presence of named entities in a number of other Indian languages. Utterances in en-in are transcribed using Latin script, whereas utterances in hi-in are transcribed using both Latin and Devanagari scripts: the latter for Hindi words and the former for words in English and other languages. Since named entities such as Hindi movie titles, song names, etc. are common to both the languages, a bilingual model can probabilistically output several words in either Latin or Devanagari. These aspects make English-Hindi a challenging language pair to work with. Table 2 summarizes the remaining details pertaining to our experimental setup.

4 Results and discussion

Table 3 summarizes the results obtained from all of our experiments. We report word error rate reductions (WERR) relative to monolingual ASR systems, and LID accuracies relative to acoustic LID classifiers.

4.1 Results using data pooling and oracle language inputs

The simple data pooling approach (A1) results in performance degradations in general. This is more prominent for the English-Hindi language pair, where a large number of common words have Latin and Devanagari representations, respectively, in the two languages which can cause potential script-based confusions with pooled training data. Providing oracle language information (A2^E, A2^J, A2^B) yields healthy WERRs, especially for hi-in and esp-us, the relatively underrepresented languages in each pair. These results validate the benefits of incorporating language information, and serve as an upper bound for the gains achievable with language based input.

4.2 Results using inferred language embeddings

Next we evaluate the models trained by incorporating on-the-fly LID embeddings from a pretrained acoustic LID classifier. The performance for the (en-us, esp-us) pair, with A3^J, seems to be almost on par with the baseline. For hi-in however, a significant drop in performance is observed. One plausible explanation is the presence of Indic words in English utterances, and significant code switching in the Hindi utterances, which reduces the discriminatory power of the embeddings. Note that we use frame-level language embeddings in order to build a streaming-friendly model, and hence the model could be susceptible to the within-utterance language fluctuations.

4.3 Results for joint ASR-LID training

For the joint training approach (A4) with vanilla inference strategy $(\alpha=1,\beta=0)$ , we observe a dramatic increase in WERs (large negative WERRs) across languages, with the increase in deletions (owing to early language tag emissions) being the largest contributor (2–6 times higher deletions observed as compared to the monolingual counterparts). By regulating language tag emissions via $\alpha$ and $\beta$ , we observe that the ASR metrics improve. The extreme penalty scheme $(\beta=1)$ , where language tags are never emitted, performs best in terms of WERR. Note that this choice doesn’t impact LID accuracy significantly since that is governed solely by utterance-final posteriors.

Combining joint training with inferred LID embeddings (A5^E, A5^J, A5^B) brings us one step closer to bridging the performance gap relative to the ASR and LID baselines. A5^J offers the best results in general, and its performance for (en-us, esp-us) is comparable to or slightly worse than that of the baseline systems. Results for the more difficult (en-in, hi-in) pair show interesting trends. For hi-in, A5^J yields a 4.9% WERR and an LID accuracy that is slightly worse than the baseline, while for en-in, both ASR and LID metrics show larger gaps relative to their respective baselines (although the gaps are much smaller as compared to A3^J and A4). These results suggest that the embedding-driven joint models might have learnt to become more consistent—but not fully accurate—with regard to the output script produced.

We improve upon A5^J in two ways. (1) Posteriors from the acoustic LID model are used instead of embeddings from its projection layer in order to provide a more discriminative signal to the network (similar to the oracle language input). This approach (A6) yields marginal performance gains. (2) Since a significant portion of Alexa traffic comprises wake-word only utterances (“alexa”), models could be confused between accent and language (e.g., an en-us “alexa” utterance could be interpreted as esp-us if it has a strong Spanish accent, which is undesirable). To mitigate such effects, we train A6 after filtering out all utterances with wake-words only and observe useful improvements in general (A7).

5 Conclusions

This paper employs the RNN-T architecture to build streaming, end-to-end, bilingual systems for joint speech recognition and language identification. We use a lightweight acoustic LID classifier to provide on-the-fly language embeddings to different components of the RNN-T, and demonstrate that providing language information to the joint network performs best. By penalizing language tag emissions and making language predictions using utterance-final posteriors, we show that ASR performance can be improved without impacting LID accuracy. The modeling techniques proposed in this work are language agnostic and can be scaled to multiple languages.

For the English-Spanish language pair, the joint ASR-LID model achieves comparable performance relative to the monolingual ASR and acoustic LID systems. In the case of English-Hindi, we observe a slight performance degradation for English owing to code switching effects combined with the presence of dual (Latin and Devanagari) word representations. For Hindi, the joint model surpasses baseline ASR performance while almost matching LID accuracy. The experimental evidence from two significantly different language pairs indicates that joint ASR-LID training is a promising direction to pursue, given the modeling simplification and compute savings it offers.

References

[1] T. Schultz and A. Waibel, “Language-independent and language-adaptive acoustic modeling for speech recognition,” Speech Communication, vol. 35, no. 1-2, pp. 31–51, 2001.
[2] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of deep-neural networks,” in Proceedings of the ICASSP, Vancouver, Canada, 2013.
[3] K. Vesely, M. Karafiat, F. Grezl, M. Janda, and E. Egorova, “The language-independent bottleneck features,” in Proceedings of the Spoken Language Technology Workshop (SLT), 2012 IEEE. IEEE, 2012, pp. 336–341.
[4] S. Scanzio, P. Laface, L. Fissore, R. Gemello, and F. Mana, “On the use of a multilingual neural network front-end,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
[5] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 56, pp. 85–100, 2014.
[6] Q. B. Nguyen, J. Gehring, M. Müller, S. Stüker, and A. Waibel, “Multilingual shifting deep bottleneck features for low-resource ASR,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5607–5611.
[7] N. T. Vu, D.-C. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E.-S. Chng, T. Schultz, and H. Li, “A first speech recognition system for mandarin-english code-switch conversational speech,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2012, pp. 4889–4892.
[8] S. Thomas, S. Ganapathy, and H. Hermansky, “Multilingual mlp features for low-resource lvcsr systems,” in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 4269–4272.
[9] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 7304–7308.
[10] H. Arsikere, A. Sapru, and S. Garimella, “Multi-dialect acoustic modeling using phone mapping and online i-vectors,” Proc. Interspeech 2019, pp. 2125–2129, 2019.
[11] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using distributed deep neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013, pp. 8619–8623.
[12] S. Feng and T. Lee, “Improving cross-lingual knowledge transferability using multilingual tdnn-blstm with language-dependent pre-final layer.” in Interspeech, 2018, pp. 2439–2443.
[13] S. Stüker, M. Müller, Q. B. Nguyen, and A. Waibel, “Training time reduction and performance improvements from multilingual techniques on the babel asr task,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 6374–6378.
[14] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” arXiv preprint arXiv:1508.01211, 2015.
[15] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[16] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang et al., “Streaming end-to-end speech recognition for mobile devices,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6381–6385.
[17] S. Wang, L. Wan, Y. Yu, and I. L. Moreno, “Signal combination for language identification,” arXiv preprint arXiv:1910.09687, 2019.
[18] C. Chandak, Z. Raeesy, A. Rastrow, Y. Liu, X. Huang, S. Wang, D. Joo, and R. Maas, “Streaming language identification using combinatoin of acoustic representations and asr hypotheses,” Submitted to Interspeech 2020.
[19] H. Seki, S. Watanabe, T. Hori, J. Le Roux, and J. Hershey, “An end-to-end language-tracking speech recognizer for mixed-language speech,” 2018.
[20] M. Müller, S. Stüker, and A. Waibel, “Neural language codes for multilingual acoustic models,” arXiv preprint arXiv:1807.01956, 2018.
[21] A. Kannan, A. Datta, T. N. Sainath, E. Weinstein, B. Ramabhadran, Y. Wu, A. Bapna, Z. Chen, and S. Lee, “Large-scale multilingual speech recognition with a streaming end-to-end model,” arXiv preprint arXiv:1909.05330, 2019.
[22] B. Li, T. N. Sainath, K. C. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y. Wu, and K. Rao, “Multi-dialect speech recognition with a single sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4749–4753.
[23] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5621–5625.
[24] A. Waters, N. Gaur, P. Haghani, P. Moreno, and Z. Qu, “Leveraging language id in multilingual end-to-end speech recognition,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 928–935.
[25] S. Watanabe, T. Hori, and J. R. Hershey, “Language independent end-to-end architecture for joint language identification and speech recognition,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 265–271.
[26] M. Müller and A. Waibel, “Using language adaptive deep neural networks for improved multilingual speech recognition,” IWSLT, 2015.
[27] M. Müller, S. Stüker, and A. Waibel, “Neural codes to factor language in multilingual speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 8638–8642.
[28] S.-Y. Chang, R. Prabhavalkar, Y. He, T. N. Sainath, and G. Simko, “Joint endpointing and decoding with end-to-end models,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5626–5630.
[29] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 2613–2617. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2019-2680
[30] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” 2015.