Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning
Abstract
In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also investigate the efficacy of combining textual and phonetic information during fine-tuning. Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models and improves robustness of spoken language understanding to ASR errors.
Index Terms: spoken language understanding, pre-training, joint text and speech representation learning
1 Introduction
Spoken language understanding (SLU) is a critical component for goal-oriented spoken dialogue systems which facilitate various voice assistants. SLU interprets a spoken query by predicting the intent of the query (intent classification, IC) and predicting the semantic concepts (slots) (slot filling, SF), respectively. For example, the intent for a speech command “play a popular song by brian epstein” to a voice assistant is PlayMusic, and the slots are sort:popular, music_item:song, artist:brian epstein. The conventional architecture for SLU is a cascaded paradigm, where an automatic speech recognition (ASR) system converts speech signals to text and then natural language understanding (NLU) systems predict intent and slots on the ASR output. It has been observed that ASR errors could cause severe performance degradation on the downstream NLU systems [1]. To improve SLU robustness to ASR errors, many prior works explore multiple ASR hypotheses. [2] developed a reranking approach combining ASR N-bests and NLU. [3] and [4] exploited word confusion networks (WCN) for call classification and WCN-based conditional random fields (CRF) for SF. [5] used unsupervised word representations incorporating acoustic relationships learned from WCNs for IC. [6] used lattice embeddings computed from RNN for IC. [7] augmented the SLU training manual transcripts by simulating ASR errors on them, where the ASR confusability functions for error simulation are learned from the specific ASR system in cascaded SLU. Recently, end-to-end approaches have been proposed to address error propagation and to enable joint optimization of ASR and NLU. E2E approaches directly map speech to desired semantic frame with a single model. Pre-training text or speech representations has been introduced to alleviate data scarcity, including encoding speech and text separately (two-stream model) and jointly (single-stream model). Some E2E SLU work models IC only [8, 9, 10, 11, 12, 13, 14] while some works jointly perform IC and SF and optionally generate ASR transcripts [15, 16, 17, 18]. Different from the past E2E approaches, this work aims at improving NLU performance on ASR 1-best in the conventional cascaded architecture, considering situations when downstream NLU systems have access to only ASR 1-best as input, without access to the original audio.
Phonetic information has also been explored to improve robustness to ASR errors, but only with limited effort, such as augmenting word embeddings with phone embeddings [19, 20], using phone boundaries to compress speech features [21], and combining speech feature vectors and phone embeddings [22]. In contrast, our work attempts to explore the full potential of using phonetic information to improve SLU robustness to ASR errors, by learning joint textual-phonetic representations through pre-training and also exploring phonetic information in fine-tuning. The major contribution of this paper is two-fold:
-
•
We propose a single-stream pre-trained model to learn joint textual-phonetic semantic representations for SLU. We design and study a variety of pre-training tasks for this purpose. We also investigate incorporating phonetic features in fine-tuning and the combinatory effect with the proposed pre-trained models.
-
•
On SLU Fluent Speech Commands (FSC) and SNIPS benchmarks, the proposed approach consistently improves SLU performance on ASR 1-best and significantly outperforms strong baseline models.
2 Joint Textual-Phonetic Representation Learning
In this section, we first introduce the proposed pre-training approach for learning joint semantic representations from textual and phonetic information. We then describe the approach of combining textual and phonetic information during fine-tuning.
2.1 Pre-trained Model
Figure 1 illustrates the architecture of the proposed pre-training approach. In this study, we use the manual transcripts of the Librispeech 960hrs data [23] and the Fisher corpus111Fisher English Training Speech Part 1 LDC2004S13 and LDC2004T19; Part2 LDC2005S13 and LDC2005T19 as the pre-training data set (denoted ) for our proposed pre-trained model. We use phoneme labels to represent high-level speech features. Given each sentence (where denotes the th word), we construct its phone label sequence by looking up the phoneme sequence for each word in a pronunciation dictionary. We use the CMU pronunciation dictionary222http://www.speech.cs.cmu.edu/cgi-bin/cmudict. For words that are not covered by the CMU dictionary, we use a special token UNK to represent its phone sequence. 1.3% words in receive a UNK label. The phone sequence for words with missing pronunciations can be generated with a grapheme-to-phoneme (g2p) model. In future work, we plan to explore phone labels generated through forced alignment [9].
Each pair W,P is one input training sample to the pre-trained model. The input is embedded through the input representation layer as the element-wise sum of token embedding, position embedding, and segment embedding. These embeddings are then fed to a multi-layer bidirectional transformer encoder to learn the joint contextualized representations for the textual and phonetic sequences.
2.2 Pre-training Tasks
We design three pre-training tasks for learning joint textual-phonetic semantic representations:masked language modeling conditioned on the phone sequence (condMLM), masked speech modeling conditioned on the word sequence (condMSM), and word-speech alignment (WSA), inspired by UNITER [24] which is a single-stream image-text representation model.
Masked language modeling conditioned on the phone sequence. We randomly mask tokens in the text sequence and replace the masked tokens with a special token MASK. Same as BERT, masking is carried out as 80% MASK substitution and 10% random word substitution and 10% unchanged. The model learns to use the remaining unmasked tokens in and the entire unmasked phone sequence to predict the masked tokens, and the goal is to minimize the cross-entropy loss, denoted .
Masked speech modeling conditioned on the word sequence. We randomly mask tokens in the phone sequence and replace them with MASK. The model then uses the remaining unmasked phones and the entire unmasked sentence to predict the masked phones. The optimization is conducted by minimizing the cross-entropy loss of this prediction, denoted . Previous studies show whole-word masking outperforms WordPiece based masking. In addition, to avoid introducing noise from aligning the phone sequence of a word to its WordPiece tokenization, we use whole-word masking for both words and phones. That is, we mask all tokens corresponding to a word at once and similarly, all phones corresponding to a word at once. Considering the asymmetric complexity of predicting words based on phones and vice versa, we investigate oneMod and twoMod masking strategies. In the oneMod strategy, for each input sample, we randomly choose masking the word sequence or the phone sequence, then conduct random masking on the chosen modality, but not masking both word and phone sequences. In contrast, in the twoMod strategy, both word sequence and phone sequence are randomly masked.
Word-speech alignment. We design a binary classification task to learn word-speech alignment between a text sequence and a phone sequence . Given W,P of sentences in and their phone sequences, we construct “[CLS] W [SEP] P” as positive samples and “[CLS] W [SEP] Prand” as negative samples, where Prand is randomly sampled from W’,P’ (W’ W). The hidden state for [CLS] is fed into a softmax classifier to decide whether the word sequence and phone sequence match or not. The training objective is minimizing the softmax cross-entropy loss, denoted .
The overall training objective is the combination of the objectives of these pre-training tasks. For each positive sample of the WSA task, loss is . For each negative sample of the WSA task, condMLM and condMSM fall back to MLM and MSM, respectively; and loss is .

2.3 Fine-tuning for SLU
Fine-tuning for SLU in this work includes intent classification(IC) only, and jointly performing IC and slot filling (SF) in a multi-task learning framework [25]. We prepend the special token [CLS] to each tokenized sequence in the SLU training set and append [SEP] to it. Given this input sequence to a pre-trained model, the output hidden states are denoted . IC is then modeled as:
(1) |
where is a non-linear feed-forward layer with tanh activation. During inference, the intent label is predicted by . During IC-only finetuning, the model is trained via minimizing the softmax cross-entropy loss of IC.
For jointly performing IC and SF, in addition to modeling IC as Eq. 1, the final hidden states of other tokens, that is, , are fed into a softmax layer to classify over the SF labels in the BIO scheme. Facilitating compatibility with the WordPiece tokenization, each tokenized input word is processed by a WordPiece tokenizer and the hidden state corresponding to the first sub-token is used as input to the softmax classifier.
(2) |
where is the hidden state corresponding to the first sub-token of . The joint model is fine-tuned via minimizing the sum of the softmax cross-entropy losses of IC and SF.
We propose an approach to introduce phone labels during SLU fine-tuning. For the input embedding layer, given a word and the phone sequence representing its pronunciation, the augmented input embedding for is computed in Eq. 3, as the weighted element-wise sum of the standard input embedding and phone embeddings333We compared computing the phone embeddings as the sum or the mean of single phone embeddings and observed better SLU performance from sum-pooling.,
(3) |
where denotes trainable embedding; denotes the standard element-wise sum of token embedding, position embedding, and segment embedding; is a hyperparameter optimized on the validation set of a SLU task.
FSC | Snips | |
---|---|---|
Train | 23,132 | 13,084 |
Valid | 3,118 | 700 |
Test | 3,793 | 700 |
Intents | 31 | 7 |
Slot Types | - | 39 |
WER(%) (valid/test) | Sys1 39.0/19.2 | 40.8/42.3 [26, 27] |
Sys2 36.7/15.2 | - |
Model | FSC ICAcc | Snips | ||||||
WER 19.2 | WER 15.2 | w/o PE | w/ PE | |||||
w/o PE | w/ PE | w/o PE | w/ PE | IcAcc | semER | ICAcc | semER | |
BERT-Base | 87.6 | 89.0 | 92.4 | 94.4 | 82.1 | 57.1 | 82.7 | 54.6 |
+MLM 15% | 88.5 | 89.1 | 93.0 | 93.5 | 82.0 | 57.2 | 84.4 | 54.9 |
+MLM 15%+NSP | 88.2 | 89.5 | 92.1 | 93.8 | 81.1 | 57.3 | 85.0 | 55.5 |
+condMLM 100%+condMSM 100%(oneMod) | 89.2 | 89.7 | 94.2 | 94.9 | 83.7 | 55.3 | 85.4 | 53.4 |
+condMLM 30%+condMSM 30%(twoMod) | 89.2 | 90.5 | 93.7 | 95.1 | 82.7 | 55.8 | 83.1 | 56.0 |
+condMLM 30%+condMSM 30%(twoMod)+WSA | 88.5 | 90.0 | 93.1 | 94.8 | 81.6 | 56.9 | 85.0 | 54.7 |
+condMLM 100%+MLM 15%(oneMod) | 88.0 | 90.1 | 92.8 | 94.7 | 81.7 | 56.5 | 80.9 | 56.1 |
+condMSM 100%+MLM 15%(oneMod) | 88.0 | 89.9 | 92.6 | 94.9 | 80.0 | 57.2 | 82.3 | 55.3 |
+condMLM 100%+condMSM 100%+MLM 15%(oneMod) | 88.1 | 89.4 | 93.3 | 94.4 | 81.3 | 56.7 | 84.1 | 55.7 |
Oracle BERT-base | 93.9 | - | 94.3 | - |
Pre-trained Models | MRR |
---|---|
BERT-Base | 0.1012 |
+MLM 15%+NSP | 0.1180 |
+condMLM 100%+condMSM 100%(oneMod) | 0.1591 |
+condMLM 30%+condMSM 30%(twoMod) | 0.1396 |
3 Experiments
3.1 Experimental Setup
We evaluate our proposed approach on two SLU benchmarks: Fluent Speech Commands (FSC) [9] and Snips [28]. FSC includes recordings of 248 unique English command phrases (e.g., “turn the lights off in the kitchen”) to a virtual assistant from 77 speakers. Following prior work [9], the three slot values annotated for each audio file (“action”, “object”, “location”) are combined as the intent of the utterance and we conduct IC on FSC. We decode 1-best for the validation and test sets of FSC with two off-the-shelf Kaldi ASR systems444Sys1: ASpIRE Chain Model: https://kaldi-asr.org/models/m1 and Sys2: Librispeech ASR Model https://kaldi-asr.org/models/m13 with trigram decoding and RNNLM rescoring. Both ASR systems use the CMU dictionary as used for phone label lookup..
The second dataset is Snips [28], collected from the Snips personal voice assistant. Since the original Snips release only comprises of text data without natural speech released, we use the ASR hypotheses for the Snips validation and test sets from [26, 27]555The authors synthesized audio from text using the Google TTS system and decoded with Kaldi Sys1.. Table 1 summarizes the statistics of the datasets.
Evaluation Metrics. We report intent classification accuracy (ICAcc) on the FSC test set ASR 1-best and both ICAcc and semantic error rate (semER) [17] on the Snips test set ASR 1-best. SemER jointly evaluates IC and SF. We count correct slots (slot names and values correctly identified), deletion errors (slot names appear in reference but not in hypothesis), insertion errors (extraneous slot names in hypothesis), and substitution errors (correct slot names in hypothesis but incorrect slot values, and IC errors). SemER is then computed as:
(4) |
For the baseline pre-trained model, we use English uncased BERT-Base666https://github.com/google-research/bert, pre-trained on the BooksCorpus [29] and English Wikipedia. We further pre-train BERT on different combinations of condMLM, condMSM, and WSA tasks, as well as masked language modeling (MLM) and next sentence prediction (NSP) tasks used in BERT pre-training. We then conduct SLU fine-tuning on the pre-trained models for evaluation. We also investigate efficacy of oneMod and twoMod masking strategies and different masking percentages for condMLM and condMSM. The maximum sequence length is 256, the batch size is 64, and the number of training steps is 100K. Adam [30] is used for optimization. The initial learning rate is optimized among {1e-4, 5e-5} for pre-training and among {3e-5, 5e-5} for SLU fine-tuning. The dropout probability is 0.1.
3.2 Results and Analysis
SLU Results. Table 2 shows intent classification accuracy (ICAcc) on two sets of ASR 1-best of the FSC test set, with WERs of 19.2 and 15.2, respectively. The baseline ICAccs from fine-tuning BERT-Base are 87.6 and 92.4, comparable to 90.11 ICAcc reported in [11] by pipelining an E2E ASR system unadapted to SLU datasets and BERT-Base NLU. We further pre-train BERT on our pre-training data using MLM with 15% masking, as well as adding NSP. Since BERT is mostly pre-trained on written text, further pre-training BERT on composed of speech transcripts is an adaptation to reduce mismatch between written text and spoken language in the SLU tasks. Our results confirm this hypothesis as ICAccs on the two sets of ASR 1-best improve from 87.6 to 88.5, and 92.4 to 93.0. Adding NSP to MLM pre-training does not yield improvement.
For our proposed pre-training model, we initialize with BERT-Base and then further pre-train the model by combining different pre-training tasks. For example, condMLM 100% + condMSM 100% (oneMod) denotes the configuration in which for each training sample, we first randomly choose the modality of speech or text to mask, then mask all tokens on the chosen modality and use the entire sequence of the other modality to recover the masked tokens. We compare SLU results on the validation set of each SLU task, from pre-training with different task combinations. We observe that the proposed model achieves significant improvement over the baseline on the validation set and also observe that the same performance rankings of pre-training configurations are retained on the SLU test set. As shown in Table 2, both condMLM 100% + condMSM 100% (oneMod) and condMLM 30% + condMSM 30% (twoMod) achieve the best ICAcc 89.2 on ASR 1-best with WER 19.2 after fine-tuning without adding phone embeddings, 1.6% absolute gain over the baseline. Adding phone embeddings into fine-tuning achieves additional gain 1.3%, overall 2.9% absolute gain (87.6 to 90.5) over baseline. On the test set with WER 15.2, similarly, pre-training with condMLM 100% + condMSM 100% (oneMod) obtains 1.8% absolute gain and adding phone embeddings into fine-tuning further improves 0.7% absolutely and raises the overall absolute improvement to 2.5% and the best gain is 2.7% absolute (92.4 to 95.1). These results demonstrate that the proposed pre-training and fine-tuning approaches significantly and consistently improve SLU performance on ASR hypotheses with different WERs.
Table 2 also shows ICAcc and semER results on ASR 1-best of the Snips test set (WER 42.3). The baseline ICAcc and semER from fine-tuning BERT-Base are 82.1 and 57.1. Pre-training with condMLM 100% + condMSM 100% (oneMod) achieves ICAcc 83.7 and semER 55.3, after fine-tuning without adding phone embeddings, 1.6% and 1.8% absolute gains over baseline. Adding phone embedding into fine-tuning achieves additional absolute gains of 1.7% and 1.9%, overall 3.3% (82.1 to 85.4) and 3.7% (57.1 to 53.4) absolute gains on ICAcc and semER. In all cases, adding phone features during fine-tuning achieves a solid improvement on top of pre-trained models. One possibility is that the pre-training models are designed towards learning generic joint textual-phonetic representations; whereas the proposed fine-tuning approach exploits phone features directly for SLU. Since computing WSA loss requires both and sequences, computing condMLM 100% + condMSM 100% (oneMod) and WSA losses on the same training samples is infeasible. We add WSA to condMLM 30% + condMSM 30% (twoMod) but it does not produce gain, probably because learning word and phone sequence alignment is relatively easy and a more difficult WSA task is required. We also evaluate using concatenation of and during fine-tuning (i.e., same as pre-training) instead of adding phone embedding, but did not observe consistent improvement over adding phone embedding.
Our proposed pre-train and fine-tune approaches are agnostic to the ASR systems in cascaded SLU. All of our SLU fine-tuning use the SLU train set manual transcripts and NLU labels, and the same fine-tuned models are used for inferring intents and slots on ASR 1-best of valid and test sets generated by different ASR systems. In contrast, both prior works of improving SLU robustness to ASR errors [26, 27] require training on the specific ASR system. [26] finetunes ELMo using combined language model loss and confusion-aware loss on the ASR 1-best and WCNs of the SLU train set. The finetuned ELMo embeddings are used by biLSTM for SLU and achieved 89.55% ICAcc on the Snips test set ASR output.
The two-stage SLU approach [27] first pre-trains biLSTM on general domain text and then further pre-trains the model on ASR lattices of the SLU train set. Their approach achieves 95.37% ICAcc on the Snips test set ASR hypotheses. Both prior approaches learn a significant amount of information about the specific ASR system, which may contribute to the substantial boost of SLU performance. We confirm this hypothesis by fine-tuning BERT-Base on the SLU train set ASR 1-best with NLU labels and evaluating ICAcc on the Snips test set ASR 1-best. This Oracle setup achieves 93.9 ICAcc w/o phone embeddings and 94.3 with PE.
Model Analysis. To analyze the efficacy of the proposed pre-training model on aligning textual and phonetic representations, we select the most frequent 20 confusion word pairs from the FSC valid ASR 1-best (WER 36.7%) and exclude pairs containing words not covered by the BERT vocabulary. Each word pair is used for retrieval with the ASR hypothesized word as query and the reference word as reference. We evaluate mean reciprocal rank (MRR) for retrieval by computing the cosine distance between the trained input embeddings from the pre-trained models. Table 3 shows that compared to MRR 0.1012 from BERT-Base, the proposed pre-training model by further pre-training BERT-Base with condMLM 100%+condMSM 100% (oneMod) significantly improves MRR to 0.1591, confirming that the proposed pre-training model significantly reduces representation distance between acoustically confusable words.
4 Conclusion
We propose a novel pre-training approach to learn joint textual-phonetic representations for SLU. We design and study different pre-training tasks. We also propose incorporating phonetic features in fine-tuning. On FSC and Snips benchmarks, both proposed pre-training and fine-tuning approaches consistently improve SLU on ASR 1-best with different WERs and the gains are additive, achieving overall 2.7% to 3.3% absolute gain on intent accuracy and 3.7% absolute gain on overall semantic frame accuracy, over strong baselines. Future work includes exploring more effective speech features and pre-training tasks.
References
- [1] W. Y. Wang, R. Artstein, A. Leuski, and D. R. Traum, “Improving spoken dialogue understanding using phonetic mixture models,” in AAAI, 2011.
- [2] F. Morbini, K. Audhkhasi, R. Artstein, M. Segbroeck, K. Sagae, P. Georgiou, D. RTraum, and S. Narayanan, “A reranking approach for recognition and classification of speech input in conversational dialogue systems,” in SLT, Proceedings, 2012, pp. 49–54.
- [3] D. Hakkani-Tür, F. Béchet, G. Riccardi, and G. Tür, “Beyond ASR 1-best: Using word confusion networks in spoken language understanding,” Comput. Speech Lang., vol. 20, no. 4, pp. 495–514, 2006. [Online]. Available: https://doi.org/10.1016/j.csl.2005.07.005
- [4] G. Tur and A. Deoras, “Semantic parsing using word confusion networks with conditional random fields,” in Interspeech, 2013.
- [5] P. G. Shivakumar, M. Yang, and P. G. Georgiou, “Spoken language intent detection using confusion2vec,” in Interspeech, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 819–823. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2226
- [6] F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister, “Latticernn: Recurrent neural networks over lattices,” in Interspeech. ISCA, 2016, pp. 695–699. [Online]. Available: https://doi.org/10.21437/Interspeech.2016-1583
- [7] E. Simonnet, S. Ghannay, N. Camelin, and Y. Estève, “Simulating ASR errors for training SLU systems,” in LREC, 2018. [Online]. Available: http://www.lrec-conf.org/proceedings/lrec2018/summaries/827.html
- [8] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio, “Towards end-to-end spoken language understanding,” in ICASSP. IEEE, 2018, pp. 5754–5758. [Online]. Available: http://arxiv.org/abs/1802.08395
- [9] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio, “Speech model pre-training for end-to-end spoken language understanding,” in Interspeech, G. Kubin and Z. Kacic, Eds. ISCA, 2019, pp. 814–818. [Online]. Available: https://doi.org/10.21437/Interspeech.2019-2396
- [10] L. Sari, S. Thomas, and M. Hasegawa-Johnson, “Training spoken language understanding systems with non-parallel speech and text,” in ICASSP. IEEE, 2020, pp. 8109–8113. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9054664
- [11] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale unsupervised pre-training for end-to-end spoken language understanding,” in ICASSP. IEEE, 2020, pp. 7999–8003. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9053163
- [12] W. Cho, D. Kwak, J. W. Yoon, and N. S. Kim, “Speech to text adaptation: Towards an efficient cross-modal distillation,” in Interspeech, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 896–900. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-1246
- [13] M. Radfar, A. Mouchtaris, and S. Kunzmann, “End-to-end neural transformer based spoken language understanding,” in Interspeech. ISCA, 2020.
- [14] B. Sharma, M. C. Madhavi, and H. Li, “Leveraging acoustic and linguistic embeddings from pretrained speech and language models for intent classification,” CoRR, vol. abs/2102.07370, 2021. [Online]. Available: https://arxiv.org/abs/2102.07370
- [15] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. J. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audio to semantics: Approaches to end-to-end spoken language understanding,” in IEEE Spoken Language Technology Workshop. IEEE, 2018, pp. 720–726. [Online]. Available: https://doi.org/10.1109/SLT.2018.8639043
- [16] N. A. Tomashenko, A. Caubrière, Y. Estève, A. Laurent, and E. Morin, “Recent advances in end-to-end spoken language understanding,” in Statistical Language and Speech Processing, ser. Lecture Notes in Computer Science, vol. 11816. Springer, 2019, pp. 44–55. [Online]. Available: https://doi.org/10.1007/978-3-030-31372-2_4
- [17] M. Rao, A. Raju, P. Dheram, B. Bui, and A. Rastrow, “Speech to semantics: Improve ASR and NLU jointly via all-neural interfaces,” in Interspeech, H. Meng, B. Xu, and T. F. Zheng, Eds. ISCA, 2020, pp. 876–880. [Online]. Available: https://doi.org/10.21437/Interspeech.2020-2976
- [18] Y. Qian, X. Bian, Y. Shi, N. Kanda, L. Shen, Z. Xiao, and M. Zeng, “Speech-language pre-training for end-to-end spoken language understanding,” CoRR, vol. abs/2102.06283, 2021. [Online]. Available: https://arxiv.org/abs/2102.06283
- [19] H. Liu, M. Ma, L. Huang, H. Xiong, and Z. He, “Robust neural machine translation with joint textual and phonetic embedding,” in ACL, A. Korhonen, D. R. Traum, and L. Màrquez, Eds. Association for Computational Linguistics, 2019, pp. 3044–3049. [Online]. Available: https://doi.org/10.18653/v1/p19-1291
- [20] X. Li, H. Xue, W. Chen, Y. Liu, Y. Feng, and Q. Liu, “Improving the robustness of speech translation,” CoRR, vol. abs/1811.00728, 2018. [Online]. Available: http://arxiv.org/abs/1811.00728
- [21] E. Salesky, M. Sperber, and A. W. Black, “Exploring phoneme-level speech representations for end-to-end speech translation,” in ACL. Association for Computational Linguistics, 2019, pp. 1835–1841. [Online]. Available: https://doi.org/10.18653/v1/p19-1179
- [22] E. Salesky and A. W. Black, “Phone features improve speech translation,” in ACL. Association for Computational Linguistics, 2020, pp. 2388–2397. [Online]. Available: https://doi.org/10.18653/v1/2020.acl-main.217
- [23] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
- [24] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “UNITER: universal image-text representation learning,” in ECCV, ser. Lecture Notes in Computer Science, vol. 12375. Springer, 2020, pp. 104–120. [Online]. Available: https://doi.org/10.1007/978-3-030-58577-8_7
- [25] Q. Chen, Z. Zhuo, and W. Wang, “BERT for joint intent classification and slot filling,” CoRR, vol. abs/1902.10909, 2019. [Online]. Available: http://arxiv.org/abs/1902.10909
- [26] C. Huang and Y. Chen, “Learning ASR-robust contextualized embeddings for spoken language understanding,” in ICASSP. IEEE, 2020, pp. 8009–8013. [Online]. Available: https://doi.org/10.1109/ICASSP40776.2020.9054689
- [27] ——, “Learning spoken language representations with neural lattice language modeling,” in ACL. Association for Computational Linguistics, 2020, pp. 3764–3769. [Online]. Available: https://doi.org/10.18653/v1/2020.acl-main.347
- [28] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau, “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” CoRR, vol. abs/1805.10190, 2018.
- [29] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler, “Aligning books and movies: Towards story-like visual explanations by watching movies and reading books,” in 2015 ICCV, 2015, pp. 19–27.
- [30] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.