This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Large-scale Transfer Learning for Low-resource Spoken Language Understanding

Abstract

End-to-end Spoken Language Understanding (SLU) models are made increasingly large and complex to achieve the state-of-the-art accuracy. However, the increased complexity of a model can also introduce high risk of over-fitting, which is a major challenge in SLU tasks due to the limitation of available data. In this paper, we propose an attention-based SLU model together with three encoder enhancement strategies to overcome data sparsity challenge. The first strategy focuses on the transfer-learning approach to improve feature extraction capability of the encoder. It is implemented by pre-training the encoder component with a quantity of Automatic Speech Recognition annotated data relying on the standard Transformer architecture and then fine-tuning the SLU model with a small amount of target labelled data. The second strategy adopts multi-task learning strategy, the SLU model integrates the speech recognition model by sharing the same underlying encoder, such that improving robustness and generalization ability. The third strategy, learning from Component Fusion (CF) idea, involves a Bidirectional Encoder Representation from Transformer (BERT) model and aims to boost the capability of the decoder with an auxiliary network. It hence reduces the risk of over-fitting and augments the ability of the underlying encoder, indirectly. Experiments on the FluentAI dataset show that cross-language transfer learning and multi-task strategies have been improved by up to 4.52%4.52\% and 3.89%3.89\% respectively, compared to the baseline.

*Corresponding author  [email protected]

1 Introduction

Conventional SLU pipeline mainly consists of two components [1]: an Automatic Speech Recognition module generates transcriptions or N-hypotheses, and a Natural Language Understanding (NLU) module classifies transcriptions into intents, in which speech recognition error propagation will be amplified during sub-sequence NLU process. Although with the rapid development of end-to-end speech recognition systems, the performance of SLU has been significant improved [2, 3, 4, 5, 6, 7], it still can not satisfy the application requirements, due to the complexity of scenarios.

Usually not all errors from speech recognition harm the SLU module, and those errors have no impact on the eventual performance [8, 9]. The SLU component only keeps its attention on keywords while discarding most of the other irrelevant words [10]. Thus the joint optimization approach can strengthen the focus of the model on improving the transcription accuracy that relates to target events [11, 12]. Recently, many efforts have been dedicated on end-to-end SLU in which the domain and the intent are predicted directly from input audio [13, 14, 15, 16, 17, 18, 19]. Previous researches have shown that a large amount of data is the determining factor for the excellent performance of a model [14]. However, due to the lack of audio and the ambiguity of intents, it is difficult to obtain sufficient in-domain labeled data. Transfer learning methodology has become a common strategy to address insufficient of data problem [20, 21, 22]. Different transfer learning strategies have been applied in SLU model and all of them result in competitive complementary results [23, 24]. In this paper, this strategy is also applied to amplify the feature extraction capability of the encoder component, it pre-train the encoder with a large amount of speech recognition labeled data, and then transfer the encoder to the SLU model.

Recently, [13] proposed and compared various of encoder-decoder approaches to optimize each module of SLU in end-to-end manners and have proved that intermediate text representation is crucial for SLU and jointly training the full model is advantageous. Attention-based models have been widely used in speech recognition and provide impressive performance [5, 6, 7, 25, 26, 27]. Inspired by this, we propose a Transformer based multi-task strategy to adopt textual information in the SLU model. Since text information only acts on the decoder component in speech recognition task, it can be treated as an adaptive regularizer to adjust the encoder parameters such that contributing to improve intent prediction performance. It should be noticed that the lack of textual corpus is also a major challenge when training language models. To address this problem, various of methods have been carried out to expand corpus in the past decade [28, 29, 30]. In addition, textual level transfer learning strategy by merging a pre-trained representation to the decoder is also explored. The pre-trained representation is obtained with the BERT model, which is designed to pre-train the deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers [31].

Encoder and decoder are mutual independent but are connected by the attention block, through which can get a collaborated optimization in training. To maximize the performance, both encoder and decoder are optimized with transfer leaning strategies. In this paper, we first propose a self-attention based end-to-end SLU structure, and applied cross-lingual transfer learning method to solve insufficient acoustic data problem. Then we propose a Transformer based multi-task strategy that conducts intent classification and speech recognition in parallel. Finally, a textual-level transfer learning structure is designed to aggregate the pre-trained BERT model into the decoder component to improves the feature extraction capability of the decoder, indirectly.

2 Methodology

In this section, a self-attention based end-to-end SLU model is proposed, Self-attention layers have been proved to be superior than recurrent layers. Next a Transformer based multi-task structure is designed to take immediate textual information into account. Finally, the CF structure is implemented in the decoder as an enhancement of the auxiliary network for the multi-task structure.

2.1 Self-attention based End-to-end SLU

Self-attention layers have been proved to be superior than recurrent layers in computational complexity when the sequence length is smaller than the representation dimensionality, and it can also yield more interpretable models than convolutional layers [32]. Inspired by these advantages, an attention-based encoder-decoder structure is designed to solve SLU problems. The architecture consists of several stacks of layers. Each layer of the encoder and decoder consists of a multi-head attention module and a position-wise fully connected feed-forward network. A max-pooling layer is involved to aggregate the output of the encoder along time axis, as illustrated in Figure 1. Softmax function is used to estimate the posterior probabilities of intents.

We denote the input acoustic frames as x=(x1,,xT)x=(x_{1},...,x_{T}), where xtRd(1tT)x_{t}\in R^{d}(1\leqslant t\leqslant T) indicates log-mel filter-bank (FBank) features in this work, dd is the dimension of Fbank, TT indicates the number of frames in xx. Ground-truth posterior distribution for utterance uu is defined as qu=(q1u,,qIu)q^{u}=(q_{1}^{u},...,q_{I}^{u}), which is represented as one-hot format. Cross-entropy criterion is used to evaluate the model performance, then the cost function for each utterance sluu{\mathcal{L}_{slu}^{u}} is defined as Equation 1.

sluu(θ)=i=1Iqiulogp(yiu|x;θ)\mathcal{L}_{slu}^{u}(\theta)=-\sum_{i=1}^{I}q_{i}^{u}logp(y_{i}^{u}|x;\theta) (1)

Where uu is the index of speech utterance. θ\theta indicates model parameters. II represents intent size. yiuy_{i}^{u} indicates the ithi^{th} predicted intent and p(yiu|x;θ)p(y_{i}^{u}|x;\theta) demonstrates the posterior probability of yiuy_{i}^{u} given xx and θ\theta.

2.2 Encoder Augmentation Strategies

2.2.1 Cross-lingual Pre-training

Human languages share some commonality in both acoustic and phonetic aspects. Features extracted from some languages can be shared with other languages at some levels of abstraction. [33] adapted English phones on Hungarian data yield substantial gains in performance over those trained only with Hungarian data. Inspired by that, we concentrate on the study of cross-lingual transfer learning over the attention-based SLU model. It is achieved by pre-training the encoder with a language-specific speech signal that different from the target language. Then the encoder-decoder model is fine-tuned with a small amount of target annotation data.

The key approach is first training a transformer-based speech recognition model with a quantity of rich resource speech and transcribed text corpus in word level, and then migrate the well-trained encoder component to the intent model. This can be achieved since the encoder in SLU maps the source acoustic feature to high-dimensional representation depending on large amounts of data for better representation capability, which is the same as speech recognition applications. Acoustic transfer learning make it possible to transfer representation capability of an encoder trained with rich-resource data to an intent classification task with insufficient data. In this work, we adopt the encoder from speech recognition to intent recognition directly and explore its effectiveness.

2.2.2 Multi-task Training

The multi-task structure consists of three components: an encoder module for acoustic representation, a decoder for speech recognition task, and another decoder for intent prediction task. The intent prediction decoder is designed to be placed after the acoustic encoder model, which is a compromised strategy compared with the conventional end-to-end SLU model, since the inaccurate prediction of text from speech recognition module. The multi-task structure is illustrated as Figure 1.

Refer to caption
Figure 1: Structures for base model and augmentation strategies: (1) attention-based SLU model(left); (2) left encoder together with the right decoder form the basic transformer structure (3) the intent classification model together with the transformer produce the multi-task structure.

In this work, intent prediction task aims at mapping the acoustic feature sequence into semantic space and treats it as semantic classification task. During this procedure, a latent operation is translating sequence of acoustic features to text, just like the task of speech recognition. So speech recognition and intent prediction have the same procedure in translating acoustic feature to high level semantic representation. Thus the multi-task architecture is designed to share the same acoustic representation for speech recognition and SLU, then optimized jointly. Since our ultimate goal is to predict intents immediately from input acoustic features. Therefore, speech recognition component can be thought as a regularizer for SLU task, and offers inductive bias to it.

Refer to caption
Figure 2: Structure of Speech Recognition Decoder in the auxiliary network. It consists of two sections: (a) indicates the decoder structure, it is used in both encoder pre-training and multi-task strategies; (b) is used for BERT Fusion strategy, which is to boost the decoder linguistic extraction capacity.

The same attention based model in Section 2.1 is used to do intent prediction. In order to achieve intent prediction and speech recognition in parallel, an additional stacked self-attention layers and a linear layer followed by a softmax classification layer are coupled with the encoder to output the posterior probability for speech recognition. As illustrated in Figure 1, this model consists of two sub-models: an attention based intent prediction sub-model with only acoustic feature as input, and an speech recognition model accepts both the acoustic presentation from the encoder and text input from the decoder. The encoder part in the bottom left area together with the decoder component, which is detailed in Figure 2(a), gives the typical transformer architecture. In training procedure, the loss function for speech recognition is described with cross-entropy criteria, and one-hot format is used to represent the output labels. Then the loss function for each utterance in speech recognition task can be described in Equation 2.

asru(θ)=t=1Tasrt(θ))\mathcal{L}_{asr}^{u}(\theta)=\sum_{t=1}^{T}\mathcal{L}_{asr}^{t}(\theta)) (2)
asrt(θ)=v=1Vqvtlogp(yvt|x;yv<t;θ)\mathcal{L}_{asr}^{t}(\theta)=-\sum_{v=1}^{V}q_{v}^{t}logp(y_{v}^{t}|x;y_{v}^{<t};\theta){} (3)

Where x=(x1,,xT)x=(x_{1},...,x_{T}) denotes input acoustic features. θ\theta indicates model parameters. TT is text length of each utterance. VV is vocabulary size of speech recognition. yvty^{t}_{v} indicates the predicted token at time tt, while yv<ty^{<t}_{v} denotes the partial text sequence before tt. Ground truth label probability distribution relates to speech recognition task qvt=(qv1,,qvT)q^{t}_{v}=(q^{1}_{v},...,q^{T}_{v}) is represented as one hot format. The loss function of the composite system is demonstrated as the combination of the SLU loss and speech recognition loss with an interpolation weight λ[0,1]\lambda\in[0,1], as shown in Equation 4.

u(θ)=sluu(θ)+λasru(θ)\mathcal{L}^{u}(\theta)=\mathcal{L}_{slu}^{u}(\theta)+\lambda\mathcal{L}_{asr}^{u}(\theta) (4)

It is apparent that both SLU and speech recognition tasks have abilities of updating encoder parameters. Theoretically, we should emphasize the importance of the SLU model and intent training data since our ultimate goal is intent prediction. This is achieved by adjusting the parameter λ\lambda to scale the effect of speech recognition. Involving speech recognition model leading to several advantages. Firstly, the quantity of annotation data in SLU task is insufficient, the encoder can produces more representative acoustic features with speech recognition training data. Then, more robust features can be extracted when it is used to compile two tasks instead of one, thus it can efficiently avoid over-fitting problem as well.

2.2.3 BERT Fusion

It should be noticed that lack of in-domain text corpus is a major challenge when training language models. To address this problem, text level transfer learning strategy is explored recently. [34] proposed component fusion method to incorporate externally trained neural network language model into an attention-based speech recognition system, and resulted in significant achievements. Inspired by that, we merge a pre-trained representation to the decoder to improve the performance as well.

BERT is conceptually simple and empirically powerful model and has been proved outperform many other architectures in many NLP tasks, and it is the first fine-tuning based representation model that achieves state-of-the-art performance on a large suit of sentence level and token level tasks [31]. In this paper, we apply the BERT model to the multi-task structure to extract more powerful linguistic representation, then improves the performance of intent prediction, as shown in Figure 2(b). BERT fusion only has effect on the speech recognition task, it has ability of outputting more precise text prediction. So it can be thought as an indirect way of enhancing encoder performance. The training procedure is similar with the multi-task training method, but gives text input to both the decoder and the BERT model.

3 Experimental setups

3.1 Dataset

In the experiment, two datasets are used to train and test different structures, FluentAI dataset described in [35] is used to train and evaluate the baseline model and SLU model with different strategies. As shown in Table 1, this dataset is sampled in 16kHz16kHz single-channel wav format. Each audio includes a single command and is labeled with three slots: action, object, and location. There are 248248 different phrases with a total of 1919 hours. The second dataset, shown in Table 2, is a open source mandarin speech corpus AISHELL-ASR0009-OS1 which is used to pre-train the encoder component. This dataset consists of 178178 hours long speech and recorded by 400400 people from different accent areas in China.

Table 1: FluentAI Speech Command dataset
Split Speakers Utterances Hours
Train 7777 23,13223,132 14.714.7
Valid 1010 3,1183,118 1.91.9
Test 1010 3,7933,793 2.42.4
Total 9797 30,04330,043 19.019.0

3.2 Results and Analysis

All experiments are conducted using 80-dim FBank feature with a frame length of 25ms25ms and a frame shift of 10ms10ms. Mean and Variance normalization is applied in utterance level, then features are down-sampled by a factor of 33, and 44 consecutive vectors stacked at the end.

The SLU model described in Section 2.1 is treated as the baseline. Results in Table 3 shows that the performance of SLU model have a strong correlation with the amount of parameters, this is attributing to the small amount of training dataset and complexity of the model. The best performance 91.91%91.91\% can be achieved when the encoder is set to 33 layer and the decoder to 66 layer. In subsequent experiments, these parameters are adopted to compare different enhancement strategies.

Table 2: AISHELL-ASR0009-OS1 dataset
Split Hours Male Female
Train 150150 161161 179179
Valid 1010 1212 2828
Test 55 1313 77
Total 165165 186186 214214
Table 3: Results of various configures of SLU model
Nenc/decN_{enc/dec} NheadN_{head} Nk/vN_{k/v} dm/id_{m/i} Acc
2/0 2 32 256/512 88.67
3/0 8 64 256/512 86.68
3/0 8 64 512/1024 90.38
3/6 8 64 512/1024 91.91
Table 4: Intent prediction accuracy for different strategies
Methodologies Tune/Fix Scales Accuracy
Baseline - - 91.91
EP Fix - 94.86
EP Fine-tune - 93.25
MT - 0.1 92.41
MT - 0.5 95.25
MT - 1.0 95.28
MT and BF Fix 1.0 95.49
EP and MT Fine-tune 1.0 96.07
EP, MT and BF Fine-tune, Fix 1.0 94.91

EP: Encoder Pre-training. FT: Fine-tune. MT: Multi-task. BF: BERT Fusion

Refer to caption
Figure 3: Validation Intent Losses for the Baseline, Multi-task with Encoder Pre-training, and Multi-task with Encoder Pre-training and BERT Fusion.

Cross-lingual transfer learning is implemented by training a transformer based speech recognition model with 150150 hours of AISHELL data first, then transferring the well-trained encoder to the SLU model directly. Two experiments, fixing parameters and fine-tuning parameters of the encoder, have been conducted to check their performance. Results in Table 4 indicate that both strategies have abilities of improving the performance of SLU model. It means that when training the encoder with an irrelevant language can be migrated to other language in acoustic space. Table 4 also shows that a better improvement 3.21%3.21\% is obtained when the encoder parameters are fixed. This implies the simplicity of the FluentAI dataset, training tends to become over-fitting when more parameters are involved. If the encoder is trained with more data, it will have more robust generation capabilities.

The multi-task experiment is implemented with FluentAI dataset as well. Table 4 gives result with different speech recognition scales. It indicates that the best performance 95.28%95.28\% is given when the scale is set 1.01.0. It proves that the speech recognition model can bring benefits to SLU when giving a appropriate scale. Actually, it is tough to balance the parameter λ\lambda, the main point is that we want the auxiliary task to promote the shared part into two tasks in a data-driven manner, or to become a regularizer for the SLU task. The scale 1.01.0 is applied in the following experiments.

BERT fusion strategy is conducted relying on the multi-task structure. The BERT model consists of 1212 layers where each layer consists of 768768 hidden units, 12-heads, and about 110M110M parameters. Parameters of BERT model are fixed in all the subsequent experiments. Table 4 indicates this strategy gives 3.58%3.58\% and 0.21%0.21\% improvements comparing with the baseline and the multi-task method. This indicates that BERT model has capability of improving the performance of SLU model.

In addition, different combinations of these strategies are explored. Table 4 demonstrates that the combination of cross-lingual pre-training and multi-task strategies obtains an accuracy of 96.07%96.07\%, and the combination of all these three strategies gives 94.91%94.91\%. Both methods produce better performance than the baseline. Figure 3 depicts the validation intent loss along with epoch, both compound strategies obtain lower losses and converge faster. Theoretically, the combinations of three strategies should give the best performance. However, experiments show that the cross-lingual encoder pre-training with multi-task strategy gives the most positive promote on the accuracy. The reason attributes to the data sparsity, models are difficult to be well trained with limited data. And the sparsity of labeled data usually accompanies with over-fitting problem, which aggravates the tuning and optimization during training procedure.

4 Conclusion

In this paper, we propose an attention-based end-to-end SLU model and evaluated different augmentation strategies based on this model. We show that cross-lingual encoder pre-training, multi-task strategy, and BERT-fusion have abilities of improving the intent classification performance. These enhancement strategies can also extend to other areas such that improve their performance. Due to the limitation of data, the model is prone to over-fitting and sensitive to model parameters. More investigation on how to efficiently solve data sparsity in model training will be conducted in future.

5 Acknowledgement

This paper is supported by National Key Research and Development Program of China under grant No. 2018YFB1003500, No. 2018YFB0204400 and No. 2017YFB1401202. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd.

References

  • [1] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [2] H. Inaguma, J. Cho, M. K. Baskar, T. Kawahara, and S. Watanabe, “Transfer learning of language-independent end-to-end asr with language model fusion,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6096–6100.
  • [3] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5666–5670.
  • [4] Y. Jia, M. Johnson, W. Macherey, R. J. Weiss, Y. Cao, C.-C. Chiu, N. Ari, S. Laurenzo, and Y. Wu, “Leveraging weakly supervised data to improve end-to-end speech-to-text translation,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7180–7184.
  • [5] N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020.
  • [6] H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-based online ctc/attention end-to-end speech recognition architecture,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020.
  • [7] H. Inaguma, Y. Gaur, L. Lu, J. Li, and Y. Gong, “Minimum latency training strategies for streaming sequence-to-sequence asr,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020.
  • [8] S. Bhosale, I. Sheikh, S. H. Dumpala, and S. K. Kopparapu, “End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios,” in Proc. Interspeech 2019, 2019, pp. 1188–1192.
  • [9] R. Masumura, T. Tanaka, A. Ando, H. Kamiyama, T. Oba, S. Kobashikawa, and Y. Aono, “Improving Conversation-Context Language Models with Multiple Spoken Language Understanding Models,” in Proc. Interspeech 2019, 2019, pp. 834–838.
  • [10] A. Ray, Y. Shen, and H. Jin, “Robust spoken language understanding via paraphrasing,” in Proc. Interspeech 2018, 2018, pp. 3454–3458.
  • [11] Y. Li, X. Zhao, W. Xu, and Y. Yan, “Cross-lingual multi-task neural architecture for spoken language understanding,” in Proc. Interspeech 2018, 2018, pp. 566–570.
  • [12] R. Gupta, A. Rastogi, and D. Hakkani-Tür, “An efficient approach to encoding context for spoken language understanding,” in Proc. Interspeech 2018, 2018, pp. 3469–3473.
  • [13] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audio to semantics: Approaches to end-to-end spoken language understanding,” in 2018 IEEE Spoken Language Technology Workshop (SLT).   IEEE, 2018, pp. 720–726.
  • [14] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio, “Towards end-to-end spoken language understanding,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5754–5758.
  • [15] Y.-P. Chen, R. Price, and S. Bangalore, “Spoken language understanding without speech recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 6189–6193.
  • [16] V. Renkens et al., “Capsule networks for low resource spoken language understanding,” arXiv preprint arXiv:1805.02922, 2018.
  • [17] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale unsupervised pre-training for end-to-end spoken language understanding,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7999–8003.
  • [18] Y. Huang, H. Kuo, S. Thomas, Z. Kons, K. Audhkhasi, B. Kingsbury, R. Hoory, and M. Picheny, “Leveraging unpaired text data for training end-to-end speech-to-intent systems,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7984–7988.
  • [19] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie, “Large-scale unsupervised pre-training for end-to-end spoken language understanding,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7999–8003.
  • [20] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer learning from deep features for remote sensing and poverty mapping,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
  • [21] Z. Huang, Z. Pan, and B. Lei, “Transfer learning with deep convolutional neural network for sar target classification with limited labeled data,” Remote Sensing, vol. 9, no. 9, p. 907, 2017.
  • [22] C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, “A survey on deep transfer learning,” in International conference on artificial neural networks.   Springer, 2018, pp. 270–279.
  • [23] N. Tomashenko, A. Caubrière, and Y. Estève, “Investigating adaptation and transfer learning for end-to-end spoken language understanding from speech,” 2019.
  • [24] A. Caubrière, N. Tomashenko, A. Laurent, E. Morin, N. Camelin, and Y. Estève, “Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability,” in Proc. Interspeech 2019, 2019, pp. 1198–1202.
  • [25] P. Wang, J. Cui, C. Weng, and D. Yu, “Large Margin Training for Attention Based End-to-End Speech Recognition,” in Proc. Interspeech 2019, 2019, pp. 246–250.
  • [26] J. Li, X. Wang, Y. Li et al., “The speechtransformer for large-scale mandarin chinese speech recognition,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 7095–7099.
  • [27] O. Hrinchuk, M. Popova, and B. Ginsburg, “Correction of automatic speech recognition with transformer sequence-to-sequence model,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7074–7078.
  • [28] E. Dikici and M. Saraçlar, “Semi-supervised and unsupervised discriminative language model training for automatic speech recognition,” Speech Communication, vol. 83, pp. 54–63, 2016.
  • [29] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with recurrent neural networks,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 1017–1024.
  • [30] A. Zgank, “Cross-lingual speech recognition between languages from the same language family,” Proceedings of the Romanian Academy Series A-mathematics Physics Technical Sciences Information Science, vol. 20, no. 2, pp. 184–191, 2019.
  • [31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” NAACL, 2018.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [33] L. Tóth, J. Frankel, G. Gosztolya, and S. King, “Cross-lingual portability of mlp-based tandem features–a case study for english and hungarian,” 2008.
  • [34] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5361–5635.
  • [35] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio, “Speech Model Pre-training for End-to-End Spoken Language Understanding,” in Proc. Interspeech 2019, 2019, pp. 814–818.