This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

END-TO-END CONTEXTUAL ASR BASED ON POSTERIOR DISTRIBUTION ADAPTATION FOR HYBRID CTC/ATTENTION SYSTEM

Abstract

End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model. Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns. In this work, we propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases. Specifically, CBA utilizes the context vector of source attention in decoder to attend to a specific bias embedding. Jointly learned with the basic AED parameters, CBA can tell the model when and where to bias its output probability distribution. At inference stage, a list of bias phrases is preloaded and we adapt the posterior distributions of both CTC and attention decoder according to the attended bias phrase of CBA. We evaluate the proposed method on GigaSpeech and achieve a consistent relative improvement on recall rate of bias phrases ranging from 15%15\% to 28%28\% compared to the baseline model. Meanwhile, our method shows a strong anti-bias ability as the performance on general tests only degrades 1.7%1.7\% even 2,000 bias phrases are present.

Index Terms—  contextual bias attention, posterior adaptation, end-to-end, speech recognition

1 Introduction

End-to-End (E2E) asr models have achieved remarkable progress in the last few years [1, 2, 3]. One of the most popular models, hybrid CTC/attention model effectively utilizes the advantages of both architectures [4], resulting in a comparable performance to conventional DNN/HMM ASR systems [5]. However, end-to-end models heavily rely on the scale and distribution of the training data, which means it can hardly recognize the proper nouns rarely seen in training utterances at inference stage. Moreover, traditional n-grams language models and the corresponding WFST-based [6] decoding techniques are difficult to be incorporated into the end-to-end systems [7], which can be conveniently customized in realistic applications. Obvious performance degradation can be found in some recognition scenes such as phone call through personal AI assistant [8] where contact names are the main targets to be correctly recognized. Basically, contextual phrases can provide additional information about what a user may say and these information is only available at inference stage [7]. Context often includes the user’s contacts, locations or collection of songs which contain rare words or proper nouns. Therefore, it is necessary to inform the system the presence of these words and encourage the model to hit the corresponding phrases in decoding.

Previous works have explored the shallow fusion methods [9, 10, 11] in which the contextual language model was injected as rescoring function. The n-grams of context phrases for which the system wishes to increase likelihood are compiled into a WFST. Whenever a word boundary is reached for a specific word ww as decoding proceeds, the system rescore for that word with its history wHw_{H} employing contextual WFST score. As biasing is applied at the end of a word, proper nouns is prone to be pruned before biasing. Data augmentation methods are also examined such as utilizing synthesized audio data with proper nouns during training [12]. However, the synthesized data are customized to a given recognition task, lacking the generalization ability on other tasks. Another resources expended augmentation strategy tends to improve proper nouns’ coverage by tagging large amount of unsupervised data [11, 13]. These unsupervised utterances are supposed to be decoded by a state-of-the-art conventional ASR system and only utterances with a high confidence are kept. In [14, 15], an all-neural approach for contextual biasing is proposed, where a separate bias encoder component is introduced to model the contextual phrases. Although experiment results show this approach outperforms shallow-fusion biasing on many tasks, it suffers from false triggering as the biasing influence is not controllable, resulting in poor word error rate when more bias phrases are loaded. As the end-to-end system has difficulty in generating words it rarely sees, [16] proposed a transformation method which maps the rare entity words to common words via pronunciation and treats the mapped words as an alternative form to the original word during recognition. With the mapping, original words are correctly recognized.

In this work, we proposed a novel approach to address this problem based on bias attention in the hybrid CTC/attention structure. We split the bias phrases into word pieces by BPE model. Next, we employ a single lstm module as bias encoder to embed the word pieces and regard the last state of lstm output as the context embeddings of a given bias phrase. Then we send the embeddings of all the bias phrases and the context vector of source attention in the transformer decoder into a contextual bias attention (CBA) module. The bias attention module consists of two simple linear transformation on context vector and bias embeddings, which represent query and key respectively in attention technique. A scaled dot-product calculation is operated which would output a spike on the specific bias phrase when context information occur in the audio segment. We then adjust the posterior probabilities of both attention decoder and CTC outputs according to the bias attention result. During decoding process, the bias process takes place before beam searching which decreases the risk of pruning. Furthermore, the proposed system has a strong ability of anti-context on utterances we do not want to bias, which means the ASR performance would only degrade little on these utterances.

The rest of the paper is organized as follows. In Section 2, we briefly review the baseline transformer CTC/attention structure and the contextual E2E modeling technique. Our proposed CBA method is presented in Section 3 followed by experimental setup and results in Section 4. We conclude this paper with our findings and future work in Section 5.

2 BACKGROUND

2.1 Hybrid CTC/attention model

We use the hybrid CTC/attention ASR model as our baseline which utilizes both benefits of CTC and attention decoder during the training and decoding steps [4]. The multi-tasks learning framework is employed to improve robustness and achieve fast convergence in training stage. The monotonic alignment characteristic of CTC relieves the burden of estimating the desired alignment by attention decoder. The loss function is described as

Lmtl=λlogpctc{y|X}+(1λ)logpattn{y|X}L_{mtl}=\lambda\mathrm{log}p_{ctc}\{y|X\}+(1-\lambda)\mathrm{log}p_{attn}\{y|X\} (1)

where X={x1,,xTsrc}X=\{x_{1},...,x_{T_{src}}\} and y={y1,,yTtgt}y=\{y_{1},...,y_{T_{tgt}}\} denote the feature sequence and target sequence respectively. λ\lambda is a tunable parameter, which satisfies 0λ10\leq\lambda\leq 1.

Because the attention decoder performs decoding in label steps while CTC performs it in frame steps, the model computes the probability of each partial hypothesis using CTC and attention model respectively and incorporates both probability scores in beam searching. The attention decoder is supposed to decode from starting symbol, ¡sossos¿ and top K units in posterior probabilities are selected to construct partial hypotheses. Then the CTC prefix probabilities can be calculated for a given partial hypothesis hh as:

pctc(h,|X)=vpctc(hv|X)p_{ctc}(h,...|X)=\sum_{v}p_{ctc}(h\cdot v|X) (2)

where vv denotes all possible label sequences except the empty string. Then the CTC prefix scores is combined with pattn(h|X)p_{attn}(h|X) like in (1).

2.2 CLAS model

An effective method that integrates contextual information into the E2E modeling is called CLAS [14], which is an all-neural structure based on Listen, Attend and Spell (LAS) model. A list of additional bias phrases is given, denoted as z={z1,,zN}z=\{z_{1},...,z_{N}\}, and a bias-encoder module is designed to embed each phrase into a fixed dimensional representation hz={h0z,,hNz}h^{z}=\{h^{z}_{0},...,h^{z}_{N}\}. We include h0z=hnobiaszh^{z}_{0}=h^{z}_{nobias} here which corresponds to the no-bias option. And then a bias attention is computed over hzh^{z} where the decoder hidden state dtd_{t} is utilized as attention-query like the one used in audio attention. Given audio and previous labels, CLAS explicitly models the probability of seeing particular phrases:

αtzi=P(zi|X;y<t)\alpha_{t}^{z_{i}}=P(z_{i}|X;y_{<t}) (3)

where tt denote the decoding step, ii denote the index of bias phrases.

A concatenation of audio context vector and bias context vector, ct=[ctx;ctz]c_{t}=[c_{t}^{x};c_{t}^{z}], is then fed into the LAS decoder as usual. By this way, the bias context information can influence the output predictions:

P(yt|y<t;X;z)=softmax(𝐖𝐬[ctx;ctz;dt]+bs)P(y_{t}|y_{<t};X;z)=\mathrm{softmax}(\mathbf{W_{s}}[c_{t}^{x};c_{t}^{z};d_{t}]+b_{s}) (4)

And it is up to the model to determine which bias phrase might be relevant in decoding steps and the target distribution would be modified.

3 Contextual bias attention

3.1 Architecture

The overall structure of the proposed model is shown in Fig.1. We add a bias attention component compared to the standard conformer-transformer structure. The bias encoder consists of a single lstm module which use the last state of the lstm as the fixed dimensional embedding of the entire phrase. Unlike [14], we employ a simple scaled dot-product attention [17] between query and key where query denotes the context vector of source attention in the last layer of decoder and key denotes the output of the bias encoder:

αtzi=softmax((𝐖𝐪ctNd1)(𝐖𝐤hiz)/sqrt(dmodel))\alpha_{t}^{z_{i}}=\mathrm{softmax}((\mathbf{W_{q}}c^{N_{d}-1}_{t})(\mathbf{W_{k}}h^{z}_{i})/\mathrm{sqrt}(d_{model})) (5)

where 𝐖𝐪\mathbf{W_{q}} and 𝐖𝐤\mathbf{W_{k}} denotes a linear transformation of query and key respectively, dmodeld_{model} denotes the dimension of key and NdN_{d} denotes the num of layers in decoder. Bias attention is supposed to extract the potential bias phrase from a bias list in each label step according to the attention probability calculated. In our method, we generate a bias label in each label step and then the bias loss is calculated on the attention distribution. We discuss this in the next subsection.

Refer to caption

Fig. 1: Illustration of contextual bias attention system based on the conformer-transformer structure.

3.2 CBA training

In the training stage, we first need to create a bias list zz for each training batch. As our purpose is to improve the recognition accuracy of the proper nouns, we can naturally employ the named entity recognition annotator to label entities for reference transcripts associated with the training batch. In the following experiments, we utilize the stanford coreNLP pipeline [12] as the extractor for NER which run several named entity recognizers and combine their results. However, only part of the reference transcripts contain named entities, and we process the rest reference transcripts by randomly selecting nn-grams like the way used in [14]. Therefore, a bias list zz will contain both the real named entities and the randomly selected word nn-grams. Specifically, we explain the bias list creation process as follows. For a given training batch, we apply NER on the reference transcripts, r1,,rNbatchr_{1},...,r_{Nbatch} and then split the references into two parts, one part has entity results and another one has not. As for the references without entities, kk word nn-grams are randomly selected from each reference, where kk is selected uniformly from [1,Nphrases][1,N\mathrm{phrases}] and nn is selected uniformly from [1,Norder][1,N\mathrm{order}]. Herein, NphrasesN\mathrm{phrases} denotes the maximal num of phrases can be selected in a reference and NorderN\mathrm{order} denotes the maximal order of a selected phrase. Take a reference with 10 words as example, where ref=[w1,w2,,w10]ref=[w_{1},w_{2},...,w_{10}]. When we set Nphrases=2N\mathrm{phrases}=2 and Norder=3N\mathrm{order}=3, we may select 2 bias phrases, one phrase z1=[w2]z_{1}=[w_{2}] and the other one z2=[w5,w6,w7]z_{2}=[w_{5},w_{6},w_{7}] with 1 order and 3 orders respectively. The total number of bias phrases in a training batch is a variable which depends on the randomly selecting process and can differ from other batches. Next, we need to create the bias label for each training reference which will be utilized for bias loss calculation. For a reference with bias phrases, we split the word sequence into word pieces sequence through BPE. For example, the first reference in a training batch ”call hanna phone” with bias phrase ”hanna” may be split into ”call han @@na phone”, and we label it by ”0 1 1 0”, where ”0” means id of nobias and ”1” means id of bias phrase in a bias list. Then, the second reference ”I will go to shanghai” with bias phrase ”shanghai” may be split into ”I will go to shang @@hai”, and we label it by ”0 0 0 0 2 2”. The rest references in the batch will be processed in the same way. The label id will indexed from ”1” to the length of the bias list. After bias attention is performed at each label step, we use the attention distribution and the bias label to calculate the loss of bias attention:

Lbias=tlogαtztL_{bias}=-\sum_{t}\mathrm{log}\alpha_{t}^{z_{t}} (6)

where ztz_{t} denotes the bias label at time step tt. Furthermore, we combine the standard CTC/attention loss and the bias loss as our optimization object in the training process:

Lall=Lmtl+βLbiasL_{all}=-L_{mtl}+\beta L_{bias} (7)

where β\beta denotes a tunable parameter which satisfies 0β10\leq\beta\leq 1.

3.3 Posterior adaptation

In the inference stage, we prepare the contextual phrases zz and send them to the bias encoder which output hzh^{z} before decoding starts. During beam search, the CBA module calculate the correlation between ctNd1c^{N_{d}-1}_{t} and hzh^{z} in each decoding step, and the biasing process is performed based on the bias phrase distribution as in Eq. (5). The phrase corresponding to the biggest probability index (except h0zh^{z}_{0} which means no bias index) will be biased. We perform boosting on both posteriors of attention decoder and CTC according to the selected biased phrase. First, the biasing score bias_scorebias\_score will be added on the linear output corresponding to the word piece units of the attended phrase before softmax applied to attention decoder, resulting in a boosted probabilities on these word piece units.

Next, we bias CTC posteriors corresponds to each encoded audio feature according to the source attention distribution in the last layer of attention decoder:

αthix=P(hix|X;y<t)\alpha_{t}^{h^{x}_{i}}=P(h^{x}_{i}|X;y_{<t}) (8)

where hxh^{x} denotes the encoded audio features and ii denotes the index of hxh^{x}. Distinguished from the bias mechanism of attention decoder, the bias score bias_scorebias\_score will be multiplied by the attention probability αthix\alpha_{t}^{h^{x}_{i}} before adding to the related word piece units of CTC linear output. In default, the bias score bias_scorebias\_score of attention decoder is consistent with the one in CTC. Consequently, both posteriors of CTC and attention decoder are biased toward the potential contextual phrases before pruning performed on the corresponding word piece units, which improves the recognition accuracy on bias phrases.

4 EXPERIMENTS

4.1 Data preparation

We evaluate our proposed method on GigaSpeech [18], an multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio. A variety of topics such as arts, science, sports are collected in the dataset. We apply NER of stanford coreNLP on the whole training transcripts and obtain approximate one million utterances with NER results, about one-eighth of the dataset. As for the training data without NER results, we randomly select word n-grams as described in Section 3.2. We set Nphrases=2N_{phrases}=2 and Norder=3N_{order}=3 which means 121\sim 2 bias phrases each has 131\sim 3 words can be selected in a specific utterance.

The evaluation sets in GigaSpeech consist of a dev set with 5715 utterances and a test set with 19930 utterances. We keep the dev and test set as general sets and do not apply NER to extract bias phrases. Instead, we construct bias test sets based on commonvoice training set [19] which is contributed by volunteers who record their voices online and is publicly available. After all training corpus of commonvoice is labeled by NER annotator, we select the NER results by four entity classes: person names, location, organization, and other, and the filtered utterances are collected as bias test sets,denoted by PER, LOC, ORG and OTHER. A summary of the four test sets is shown in Table 1.

In the experiments, we find some training batch with fewer bias phrases than others and we suggests adding some extra distractors to them. To examine the effect of distractors in bias training, we supplement extra n-grams when the extracted bias phrases num is less than 200. We count the words frequency in the training transcripts and exclude the top 20%20\% words and randomly select the remainder words as distractors.

Table 1: Details of bias test sets. Bias list size means the num of unique bias phrases in the test set as some of utterances share the same bias phrase.
Test sets Number of Utters Bias list size
PER 16246 10524
LOC 1046 464
ORG 823 360
OTHER 1161 328

The total number of utterances in commonvoice training set is around 564,000 while only less than 20,000 utterances contain entities in the above four classes and ”person names” occupies the highest proportion.

4.2 Model structure

The model in our experiments consists of a conformer encoder with 12 blocks and a transformer decoder with 6 blocks. The full structure details can be found in the recipe of "espnet/egs2/gigaspeech"\mathrm{"espnet/egs2/gigaspeech"} in espnet ASR toolkit [20]. However, we use 80-dimensional filter-bank features computed on 25ms window with 10ms shift instead of the raw-wav in the espent recipe. The bias encoder is a single LSTM layer with 512 nodes. The output dimension of trainable parameter matrices in the bias attention is equal to dmodeld_{model}. As our system belongs to the multi-task learning framework, we set λ=0.3\lambda=0.3 in (1) and β=0.5\beta=0.5 in (7). During decoding, we exclude the influence of language model and only evaluate the performance of joint decoding on CTC and attention decoder.

4.3 Bias-free testing on various test sets

We first train the baseline hybrid CTC/attention model based on conformer-transformer structure. Then, we use the baseline model as our initial model, and add our bias encoder and bias attention components on it and carry on bias training. We compare the WER of baseline model and our bias models in decoding with an empty list of bias phrases on a variety of test sets, from general test sets of GigaSpeech to bias test sets of commonvoice. To better evaluate the biasing performance, we utilize the recall rate of bias phrases as our measurement, which means we care about the percentage of recognized bias phrases in the test set. We show the WER and recall rate in Table 2. We find similar performance in three models when tested in bias-free mode whereas the WER on organization set degrades in bias model trained with distractors. This may be caused by the small scale of organization set as it only has 823 utterances. As for the general test sets of GigaSpeech, we only show the WER and find little performance degradation compared to baseline model.

Table 2: Results of bias-free testing with compared models.
baseline No-distrators Distrators
Test set WER recall WER recall WER recall
giga_dev 11.8 *** 11.9 *** 11.9 ***
giga_test 11.7 *** 11.8 *** 11.8 ***
PER 18.7 0.51 18.8 0.50 18.7 0.50
LOC 17.3 0.61 17.3 0.61 17.3 0.61
ORG 16.0 0.62 16.2 0.57 16.4 0.61
OTHER 16.0 0.71 15.9 0.72 16.1 0.71

4.4 Bias testing on bias tests

Next, we study the performance of our bias models on bias test sets. We load the full bias list of each bias set and carry out bias testing except for the ”person names” set. We divide the ”person names” test set to five subsets, each loaded by the respective bias list. Because it may incur inconsistent phenomenon compared to the other three if we load all the 10524 bias phrases once in decoding. We tune the bias_scorebias\_score in biasing process and find similar performance, so we simply set bias_score=5bias\_score=5 in the following testing. We obtain consistent improvement on recall rate of bias phrases on three test sets, ranging from 15%15\% to 28%28\%. As for the results of the five subsets of ”person names”, we find the recall rate improvement shrinks compared to the other three as the num of bias phrases is 232\sim 3 times larger. And the bias model with distractors training always performs better than the model without distractors training. The results are shown in Table 3.

Table 3: Results of bias testing with compared models.
No-distrators Distrators
Test set WER recall WER recall
LOC 16.4 0.70 16.0 0.72
ORG 15.4 0.67 15.2 0.68
OTHER 15.6 0.77 15.3 0.78
No-distrators Distrators
Test set WER recall WER recall
PER_sub1 19.7 0.53 19.0 0.58
PER_sub2 17.8 0.55 17.4 0.58
PER_sub3 16.8 0.53 16.2 0.60
PER_sub4 19.9 0.55 19.3 0.58
PER_sub5 19.4 0.56 19.0 0.57
PER_aggre 18.6 0.54 18.1 0.58

We proceed the experiment by increasing the nums of bias phrases to 1,000 of each bias set which are selected from the ”person names” bias set. Only the bias model with distractors training is examined. We achieve approximate results as the previous one, which means the bias model can adapt to a large number of phrases even the distractors are imported. The results are shown in Table 4.

Table 4: Results of more bias distractors testing.
Bias: full loaded Bias: 1000 loaded
Test set WER recall WER recall
LOC 16.0 0.72 16.5 0.68
ORG 15.2 0.68 15.8 0.65
OTHER 15.3 0.78 15.6 0.76

4.5 False triggering testing on general sets

Finally, we show the performance of false triggering of the bias model. To prove the strong anti-context ability, we load the bias phrases into the bias encoder and perform biasing when we decode general test sets. The loaded number of bias phrases ranges from 100 to 2000 which are picked up from ”person names” bias set who has the largest number of bias phrases. All the activation parameters are consistent with the previous experiments. The WER is shown in Table 5.

Table 5: WER on giga_devgiga\_dev and giga_testgiga\_test sets with the number of bias phrases loaded. Tested on bias model with distractors training.
phrase number
Test set 100 500 1000 2000
giga_dev 11.8 11.9 11.9 12.0
giga_test 11.7 11.8 11.9 12.0

Although we observe gradual degradation in WER as a function of number of bias phrases, the performance is still acceptable when loaded num increases to 2000, with only 1.7%1.7\% relative WER reduction in giga_dev set.

5 CONCLUSIONS

In this work, we have proposed a novel contextualized E2E ASR model which leverages the contextual bias attention to guide the bias activation during beam search decoding. We described the data preparation process where NER annotator is utilized to extract bias phrases both in training and decoding. Experiments are conducted on GigaSpeech to investigate its anti-context ability and contextual bias ability. While the performance degrades only 1.7%1.7\% on general testing when 2,000 bias phrases presents, the recall rate of bias phrases in bias testing improves consistently by 15%15\% to 28%28\%, which certifies the effectiveness, practicability of our proposal. Investigation of CBA module on other E2E architecture, e.g. RNN-T, is our next direction.

References

  • [1] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
  • [2] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philémon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4945–4949.
  • [3] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4774–4778.
  • [4] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
  • [5] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
  • [6] Takaaki Hori and Atsushi Nakamura, “Speech recognition algorithms using weighted finite-state transducers,” Synthesis Lectures on Speech and Audio Processing, vol. 9, no. 1, pp. 1–162, 2013.
  • [7] Zhehuai Chen, Mahaveer Jain, Yongqiang Wang, Michael L Seltzer, and Christian Fuegen, “End-to-end contextual speech recognition using class language models and a token passing decoder,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6186–6190.
  • [8] Justin Scheiner, Ian Williams, and Petar Aleksic, “Voice search language model adaptation using contextual information,” in 2016 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2016, pp. 253–257.
  • [9] Ian Williams, Anjuli Kannan, Petar S Aleksic, David Rybach, and Tara N Sainath, “Contextual speech recognition in end-to-end neural network systems using beam search.,” in Interspeech, 2018, pp. 2227–2231.
  • [10] Anjuli Kannan, Yonghui Wu, Patrick Nguyen, Tara N Sainath, Zhijeng Chen, and Rohit Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 1–5828.
  • [11] Ding Zhao, Tara N. Sainath, David Rybach, Pat Rondon, Deepti Bhatia, Bo Li, and Ruoming Pang, “Shallow-Fusion End-to-End Contextual Biasing,” in Proc. Interspeech 2019, 2019, pp. 1418–1422.
  • [12] Khe Chai Sim, Françoise Beaufays, Arnaud Benard, Dhruv Guliani, Andreas Kabel, Nikhil Khare, Tamar Lucassen, Petr Zadrazil, Harry Zhang, Leif Johnson, et al., “Personalization of end-to-end speech recognition on mobile devices for named entities,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2019, pp. 23–30.
  • [13] Zhiheng Huang, Wei Xu, and Kai Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
  • [14] Golan Pundak, Tara N Sainath, Rohit Prabhavalkar, Anjuli Kannan, and Ding Zhao, “Deep context: end-to-end contextual speech recognition,” in 2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 418–425.
  • [15] Uri Alon, Golan Pundak, and Tara N Sainath, “Contextual speech recognition with difficult negative training examples,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6440–6444.
  • [16] Rongqing Huang, Ossama Abdel-Hamid, Xinwei Li, and Gunnar Evermann, “Class lm and word mapping for contextual biasing in end-to-end asr,” arXiv preprint arXiv:2007.05609, 2020.
  • [17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [18] Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, et al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021.
  • [19] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively-multilingual speech corpus,” in Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215.
  • [20] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.