Leveraging Text Data Using Hybrid Transformer-LSTM Based End-to-End ASR in Transfer Learning

Abstract

In this work, we study leveraging extra text data to improve low-resource end-to-end ASR under cross-lingual transfer learning setting. To this end, we extend our prior work [1], and propose a hybrid Transformer-LSTM based architecture. This architecture not only takes advantage of the highly effective encoding capacity of the Transformer network but also benefits from extra text data due to the LSTM-based independent language model network. We conduct experiments on our in-house Malay corpus which contains limited labeled data and a large amount of extra text. Results show that the proposed architecture outperforms the previous LSTM-based architecture [1] by 24.2% relative word error rate (WER) when both are trained using limited labeled data. Starting from this, we obtain further 25.4% relative WER reduction by transfer learning from another resource-rich language. Moreover, we obtain additional 13.6% relative WER reduction by boosting the LSTM decoder of the transferred model with the extra text data. Overall, our best model outperforms the vanilla Transformer ASR by 11.9% relative WER. Last but not least, the proposed hybrid architecture offers much faster inference compared to both LSTM and Transformer architectures.

Index Terms: cross-lingual transfer learning, transformer, lstm, unpaired text, independent language model

1 Introduction

End-to-end (E2E) architecture has been a promising strategy for ASR. In this strategy, a single network is employed to directly map acoustic features into a sequence of characters or subwords without the need of a pronunciation dictionary that is required by the conventional hidden Markov model based systems. Furthermore, the components of the E2E network can be jointly trained using a common objective criterion to achieve overall optimization which greatly simplifies the ASR development process. Although the simplicity of E2E ASR architecture is attractive, especially for new languages, it requires a huge amount of labeled training data.

In this work, we focus on E2E ASR for a low-resource language. Specifically, we assume that the target language possesses a limited amount of labeled data to train E2E systems, while an extra text corpus of the language can be easily collected. Additionally, we assume that we possess a large amount of labeled data from another resource-rich source language. This is a common scenario in real-world applications.

The extra text is usually employed to train language models (LM) applied during decoding [2, 3, 4] and re-scoring stages [5]. Such techniques not only require external language models but also lead to a slow inference. To tackle this problem, [1] has proposed long short term memory (LSTM)-based encoder-decoder architecture which allows improving the LM capacity of the decoder using the extra text data. However, it utilized the LSTM structure for the encoder which has shown limited modeling capacity as well as slow training. On the other hand, Transformer [6] has been a promising approach for E2E ASR due to its high modeling capacity and fast training. However, its decoder closely interacts with the encoder output through an encoder-decoder cross-attention. Therefore, it is not straightforward to employ extra text data to improve the decoder.

In this work, we propose a hybrid Transformer-LSTM architecture which combines the advantages of [1] and [6]. It not only has a high encoding capacity of the Transformer but also benefits from the extra text data due to the LSTM-based independent language model decoder. To further benefit from the labeled data from another language, we employ cross-lingual transfer learning, which is a popular approach to address the limited resource problem in ASR [7, 8, 9, 10, 11], on the proposed architecture. Specifically, we first use labeled data of the resource-rich language to train an ASR model and then transfer it to the target language. Lastly, the extra text data is used to boost the decoder of the transferred model.

The paper is organized as follows. Section 2 describes baseline architectures mentioned in [1] and [6]. Then, the proposed techniques are presented in Section 3. Experimental setup and results are presented in Section 4 and 5 respectively. Section 6 concludes our work.

2 Baseline architectures

2.1 LSTM-based encoder-decoder architecture

A LSTM-based encoder-decoder architecture [1], denoted as $A_{1}$ in the rest of this paper, consists of a Bidirectional LSTM encoder and a LSTM-based decoder which are shown in Fig. 1. Let $<$ X, Y $>$ be a training utterance, where X is a sequence of acoustic features and $\textbf{Y}=\{y_{1},y_{2},...,y_{|\textbf{Y}|}\}$ is a sequence of output units. The encoder acts as an acoustic model which maps acoustic features X into an intermediate representation h. Then, the decoder, which consists of an embedding, a LSTM and a projection layers, generates one output unit at each decoding step $i$ as follows,

$\displaystyle s_{i}$	$\displaystyle=LSTM(s_{i-1},embedding(y_{i-1}))$	(1)
$\displaystyle c_{i}$	$\displaystyle=attention(\textbf{h},s_{i})$	(2)
$\displaystyle P(y_{i}\mid\textbf{X},y_{<i})=\$	$\displaystyle softmax(proj(s_{i})+proj(c_{i}))$	(3)

where $c_{i}$ is the context vector, $s_{i-1}$ and $s_{i}$ are output hidden states at time step $i-1$ and $i$ respectively, $embedding()$ and $proj()$ are embedding and projection layers respectively.

Refer to caption — Figure 1: LSTM-based encoder-decoder architecture ( $A_{1}$ ) [1], where the decoder acts as an independent language model.

From Equation (1), the LSTM is only conditioned on the previous decoding hidden state and previous decoding output. In other words, the LSTM acts as an independent language model that can be easily updated with text-only data [1].

2.2 Transformer encoder-decoder architecture

Transformer has been proposed in [6] for sequence-to-sequence modeling in natural language processing tasks, then adopted to the ASR task in [12, 13]. The model architecture, denoted as $A_{2}$ , is shown in Fig. 2.

The encoder is shown in the left half of Fig. 2. It consists of $N_{e}$ encoder-blocks, each of them has two sub-blocks: self-attention and position-wise feed-forward network (FFN). The self-attention employs multi-head attention (MHA) which is a function of three inputs: query $Q$ , key $K$ and value $V$ . It includes multiple heads which can be processed in parallel. Each head employs a dot-product attention ( $DotProdAtt$ ) as follows,

DotProdAtt(Q,K,V)=softmax(\frac{QK^{T}}{d_{k}})V

(4)

where $d_{k}$ is the hidden dimension. Besides, layer normalization [14] and residual connection [15] are introduced to each encoder-block for effective training.

The decoder is shown in the right half of Fig. 2 which consists of $N_{d}$ decoder-blocks. Different from the encoder-block, each decoder-block has one more sub-block, i.e. the cross-attention. Its first input comes from the previous sub-block output while another input is from the output of the encoder.

3 Proposed techniques

To exploit extra text data while having high modeling capacity, we propose a Transformer-LSTM architecture in Section 3.1. Then, we exploit using extra text data to improve the proposed architecture under the cross-lingual transfer learning setting.

3.1 The Transformer-LSTM architecture

In this section, we first compare approaches presented in Section 2.1 and 2.2. Then, based on the comparison, we propose a novel architecture that takes advantage of both approaches.

Previous work [16] showed that the Transformer not only produces better encoding representation but is also faster than the LSTM counterpart in training. First, the Transformer encoder uses dot-product attention (see Equation (4)) which allows each position has access to information from all other positions in the same sequence regardless of their distance. In contrast, although in theory LMST can model long-range dependence, in practice it faces difficulty to capture dependencies of far-distance elements [17] which limits its modeling capacity for long sequences such as acoustic signal. Second, by relying entirely on feed-forward components, the Transformer model avoids any sequential dependencies, and hence can maximize parallel training. In contrast, training a LSTM-based network is slow due to the recurrence property of the LSTM.

Despite being highly effective, the Transformer decoder is not easy to be improved using text-only data. Specifically, the decoder includes the cross-attention sub-block which is conditioned on the encoder output. In contrast, the LSTM-based decoder (in Section 2.1) can be easily boosted using the text data. Another issue of the Transformer decoder is slow inference [18]. Specifically, to generate an output $y_{i}$ , the decoder needs to process all previous decoding units $y_{1:i-1}$ . On the other hand, the LSTM-based decoder has faster inference since it only needs the last output unit $y_{i-1}$ to generate $y_{i}$ .

Based on the above comparisons, we propose a hybrid architecture that takes advantage of both Transformer and LSTM architectures. Specifically, our encoder is from Transformer, while the decoder is taken from the LSTM architecture. The benefits of the proposed architecture lie in two aspects. First, it has high modeling capacity, as well as faster training and decoding. Second, the LSTM-based architecture allows us to easily leverage text data to boost the decoder, yielding improved ASR performance. We denote the proposed architecture as $A_{3}$ in the rest of this paper.

3.2 Exploiting extra text data under cross-lingual transfer learning

To tackle the low-resource training problem, we first perform cross-lingual transfer learning. We start with training E2E models using a source language. We then replace the language-dependent components of the decoder (i.e. the embedding and output projection layers) of the source language by those of the target language. Finally, the models are fine-tuned using labeled data of the target language. Although transfer learning is not our focus, to achieve the best performance, we carefully examine various transfer settings as presented in Section 5.2. More importantly, we aim to boost the transferred model of the proposed architecture using extra text data of the target language. Fig. 3 describes our process.

From Fig. 3, the entire process is implemented with two main steps. In the first step, we merge the extra text and the labeled data together to fine-tune the transferred model. This avoids a so-called catastrophic forgetting problem as mentioned in [1]. Specifically, at each training iteration, we mix a batch of labeled data consisting of $B_{labeled}$ utterances with a batch of text data consisting of $B_{text}$ utterances to fine-tune the transferred model with the following loss function:

L_{total}(\theta)=(1-\lambda)L_{ASR}(\theta)+\lambda L_{LM}(\theta_{d})

(5)

where $\lambda$ denotes an interpolation factor, $\theta$ and $\theta_{d}$ denote entire E2E parameters and decoder parameters respectively, $L_{ASR}(\theta)$ and $L_{LM}(\theta_{d})$ denote the ASR loss and LM loss generated by the labeled data and text data respectively. In the second step, the model is further fine-tuned with the labeled data of the target language. Similar to [1], we empirically found that the second step is necessary to improve overall performance.

4 Experimental setup

4.1 Data

We conduct experiments on our in-house corpus of Malay language which consists of limited labeled data plus extra text. We split the labeled data into three sets for training, validation and evaluation. Detailed division is shown in Table 1. We perform speed perturbation based data augmentation [19] on the training data. For extra text data, we examine two sets: the first set has over 8 million sentences, denoted as $T1$ ; while the second set is a subset of the first one which consists of 2 million sentences, denoted as $T2$ .

For the source language, which is English, we use two subsets of the National Speech Corpus (NSC) [20] to train source models. The first subset, denoted as $S_{200h}$ consists of 200 hours, where we also apply speed perturbation based data augmentation. The second subset consists of 1000 hours, denoted as $S_{1000h}$ .

Table 1: Detailed division of the labeled Malay data.

	Train	Dev	Test
#Speakers	57	6	6
#Utterances	8500	1785	1957
Length (hours)	20.6	4.3	4.1

4.2 E2E setting

ESPnet toolkit [21] is used to train our E2E architectures. We use 80 mel-scale filterbank coefficients with pitch as input features, and 500 Byte-Pair Encoding (BPE) units are used as output units. For all E2E architectures, the acoustic features are processed by the VGG network [22]. Detailed setting of architectures can be seen in Table 2. Each BLSTM layer has 320 cells, while each LSTM layer of the decoder has 256 cells. In each self-attention and cross-attention sub-blocks, we use 4 heads with hidden dimension of 2048. The FFN consists of two ReLu activation functions and two affine transforms with size of 2048. To allow transfer learning, we use the same settings for both source and target languages. During the fine-tuning process in Section 3.2, we set $B_{labeled}=30$ and $B_{text}=90$ .

Table 2: Setting of different E2E architectures.

	# Paras	encoders	decoders
$A_{1}$	77.4M	6 BLSTM	1 LSTM
$A_{2}$	120 M	12 (self-att + FFN)	6 (self-att +
			cross-att + FFN)
$A_{3}$	81.1M	12 (self-att + FFN)	1 LSTM

5 Results and analysis

The overall ASR performance on the test set after applying proposed techniques (in Section 3) is presented in Table 3. In following subsections, we will describe and elaborate results for each proposed technique.

Table 3: The WER(%) on the test set of different E2E models.

S_{1000h}

-Encoder denotes that we use the data

S_{1000h}

of the source language to train a source model, then transfer only the encoder of the source model to the target language.

T1

denotes an extra text corpus of the target language that consists of 8 million sentences.

No.

Architecture

Transfer learning

settings

Extra text

WER(%)

A_{1}

18.2

A_{2}

12.3

A_{3}

13.8

A_{2}

S_{1000h}

-Encoder

10.1

A_{3}

S_{1000h}

-Encoder

10.3

A_{3}

S_{1000h}

-Encoder

T1

8.9

5.1 Results of different E2E architectures

This section presents the results of different E2E architectures, i.e. $A_{1}$ , $A_{2}$ and $A_{3}$ trained using only the labeled data of the target language. The word error rate (WER) results are reported in the first three rows in Table 3. We also report the decoding speed of these models in Table 4. As we can see, $A_{3}$ not only outperforms $A_{1}$ by 24.2% relative WER but also offers 1.5 times faster decoding. This indicates that employing the Transformer for the encoder is very effective. $A_{2}$ achieves best WER, but has much slower decoding speed compared to $A_{3}$ .

Table 4: Decoding speed of different E2E architectures.

Architecture

Decoding speed

(seconds/utt)

A_{1}

7.5

A_{2}

31.61

A_{3}

5.00

5.2 Results of transfer learning

In this section, we examine the ASR performance of different transfer learning settings. Firstly, we examine the effect of using different amounts of source language data on the target language’s performance. Secondly, we analyze the different level of transfer learning: (1) only $l$ (e.g. $l$ = 3, 6, 9) bottom layers of the encoder is transferred; (2) entire encoder is transferred; (3) both encoder and decoder (except embedding and projection layers) are transferred. These experiments were conducted only for $A_{2}$ and $A_{3}$ , since $A_{1}$ produced worst results. The experiment results are given in Table 5.

We observed that transferring the entire encoder achieves the best results for both $A_{2}$ and $A_{3}$ . Additionally, transfer with 1000 hours of source data is noticeably better than that of 200 hours with data augmentation. The best results are summarized in rows 4 and 5 in Table 3. It can be seen that cross-lingual transfer learning leads to significant improvement for both $A_{2}$ and $A_{3}$ . For example, the result in row 5 outperforms that of row 3 by 25.4% relative WER (from 13.8% to 10.3%).

Table 5: ASR performance of different transfer learning settings.

	Source data	Transfer modules	WER(%)
	Source data	Transfer modules	Dev	Test
$A_{2}$	$S_{200h}$	Encoder + Decoder	15.1	11.0
		Encoder	14.9	10.4
		9 bottom encoder layers	16.2	11.6
		6 bottom encoder layers	16.9	12.3
		3 bottom encoder layers	17.3	12.6
$A_{2}$	$S_{1000h}$	Encoder + Decoder	14.2	10.4
		Encoder	13.9	10.1
		9 bottom encoder layers	15.3	11.1
		6 bottom encoder layers	16.7	12.3
		3 bottom encoder layers	17.1	12.7
$A_{3}$	$S_{200h}$	Encoder + Decoder	15.5	11.0
		Encoder	15.1	10.8
		9 bottom encoder layers	16.4	12.2
		6 bottom encoder layers	17.4	12.8
		3 bottom encoder layers	17.6	13.1
$A_{3}$	$S_{1000h}$	Encoder + Decoder	14.6	10.7
		Encoder	14.3	10.3
		9 bottom encoder layers	15.5	11.3
		6 bottom encoder layers	16.7	12.2
		3 bottom encoder layers	17.2	12.7

5.3 Results of utilizing extra text data

We first present the ASR performance on development data when using extra text data $T1$ and $T2$ to fine-tune $A_{3}$ after cross-lingual transfer learning. The results are shown in Fig. 4. We observed that using extra text data is very effective and $T1$ produces substantial improvement over $T2$ . We also observed that Step 2 (see Fig. 3), i.e. to use labeled data to fine-tune the E2E network, is essential. Finally, $\lambda=0.7$ yields the best results in most of the cases.

We then employ $T1$ with $\lambda=0.7$ on the test set and the result is reported in the row 6 of Table 3. Using extra text data significantly improves $A_{3}$ by 13.6% relative WER (from 10.3% to 8.9%). With the help of the extra text data, the proposed architecture outperforms the Transformer baseline by 11.9% relative WER (from 10.1% to 8.9%).

We now investigate the effect of an external language model on the proposed architecture $A_{3}$ . We train a Recurrent Neural Network LM (RNN-LM) as a 1-layer LSTM with 1024 cells using both transcriptions of the training data and the extra text data $T1$ , then integrate the RNN-LM into inference process of $A_{3}$ (row 5 and 6 in Table 3). Results are reported in Table 6. As we can see, after fine-tuning with $T1$ , $A_{3}$ still benefits from the external RNN-LM and we observed 34.8% relative WER reduction (from 8.9% to 5.8%).

Table 6: ASR performance of

A_{3}

(row 5 and 6 from Table 3) on test set with and without RNN-LM.

Row No.

(From Table 3)

External LM

WER(%)

10.3

+ RNN-LM

6.8

8.9

+ RNN-LM

5.8

6 Conclusions

In this paper, we first proposed the Transformer-LSTM based architecture which not only takes advantage of the highly effective encoding capacity of the Transformer, but also benefits from the extra text data due to the LSTM-based independent language model decoder. We then examined exploiting extra text data to boost the LSTM decoder under cross-lingual transfer learning. Experimental results show that, with the help of the extra text data, the proposed architecture significantly outperforms baselines. Additionally, the proposed architecture also offers faster decoding.

7 Acknowledgements

This work is supported by the project of Alibaba-NTU Singa-pore Joint Research Institute. The computational work for this article was partially on resources of the National Supercomputing Centre, Singapore (https://www.nscc.sg).

References

[1] V. T. Pham, H. Xu, K. Yerbolat, Z. Zeng, E. S. Chng, C. Ni, B. Ma, and H. Li, “Independent language model architecture for end-to-end asr,” in Proc. of ICASSP, 2020, pp. 7054–7058.
[2] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. of ICML, 2014, pp. 1764–1772.
[3] T. Hori, J. Cho, and S. Watanabe, “End-to-end speech recognition with word-based RNN language models,” in Proc. of HLT, 2018, pp. 389–396.
[4] Ç. Gülçehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio, “On using monolingual corpora in neural machine translation,” CoRR, vol. abs/1503.03535, 2015.
[5] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, 2016, pp. 4960–4964.
[6] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. of NIPS, 2017, pp. 5998–6008.
[7] J. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers,” in Proc. of ICASSP, 2013, pp. 7304–7308.
[8] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using distributed deep neural networks,” in Proc. of ICASSP, 2013, pp. 8619–8623.
[9] C. Shan, C. Weng, G. Wang, D. Su, M. Luo, D. Yu, and L. Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in Proc. of ICASSP, 2019, pp. 6056–6060.
[10] S. Tong, P. N. Garner, and H. Bourlard, “Multilingual training and cross-lingual adaptation on ctc-based acoustic model,” CoRR, vol. abs/1711.10025, 2017.
[11] S. Dalmia, R. Sanabria, F. Metze, and A. W. Black, “Sequence-based multi-lingual low resource speech recognition,” CoRR, vol. abs/1802.07420, 2018.
[12] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Proc. of ICASSP, 2018, pp. 5884–5888.
[13] S. Karita, X. Wang, S. Watanabe, T. Yoshimura, W. Zhang, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. Yalta, and R. Yamamoto, “A comparative study on Transformer vs RNN in speech applications,” in Proc. of ASRU, 2019, pp. 449–456.
[14] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. of CVPR, 2016, pp. 770–778.
[16] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, Y. Wu, and M. Hughes, “The best of both worlds: Combining recent advances in neural machine translation,” CoRR, vol. abs/1804.09849, 2018.
[17] G. Tang, M. Müller, A. Rios, and R. Sennrich, “Why self-attention? a targeted evaluation of neural machine translation architectures.” in Proc. of EMNLP, 2018, pp. 4263–4272.
[18] B. Zhang, D. Xiong, and J. Su, “Accelerating neural transformer via an average attention network,” in Proc. of ACL, 2018, pp. 1789–1798.
[19] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. of INTERSPEECH, 2015, pp. 3586–3589.
[20] J. X. Koh et al., “Building the Singapore English national speech corpus,” in Proc. of INTERSPEECH, 2019, pp. 321–325.
[21] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” in Proc. of INTERSPEECH, 2018, pp. 2207–2211.
[22] S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, 2017, pp. 4835–4839.