Conversational Speech Recognition by Learning
Conversation-level Characteristics
Abstract
Conversational automatic speech recognition (ASR) is a task to recognize conversational speech including multiple speakers. Unlike sentence-level ASR, conversational ASR can naturally take advantages from specific characteristics of conversation, such as role preference and topical coherence. This paper proposes a conversational ASR model which explicitly learns conversation-level characteristics under the prevalent end-to-end neural framework. The highlights of the proposed model are twofold. First, a latent variational module (LVM) is attached to a conformer-based encoder-decoder ASR backbone to learn role preference and topical coherence. Second, a topic model is specifically adopted to bias the outputs of the decoder to words in the predicted topics. Experiments on two Mandarin conversational ASR tasks show that the proposed model achieves a maximum 12% relative character error rate (CER) reduction.
Index Terms— Conversational ASR, end-to-end ASR, latent variational module, topic-realted rescoring
1 Introduction
A typical automatic speech recognition (ASR) system usually works at sentence-level. It is trained by sentence-level speech-text paired data and recognize speech at (short) utterance level. In contrast, conversational ASR has great potential to take advantages from specific characteristics of multi-speaker conversation. Role preferences, such as style and emotion, will affect the characteristics of the current conversation[1]. Topical coherence, the tendency of words that are semantically related to one or more underlying topics to appear together in the conversation, and other conversation-level phenomena have also received widespread attention [2]. Previous works have explored long context language models [3], longer input features [4, 5], context attention mechanisms [6] and other methods [7] to implicitly learn contextual information in conversions [8, 7]. They do not explicitly make use of the inherent characteristics of conversations, such as role preference, topical coherence, speaker turn-talking, etc. However, learning conversational characteristics in explicit ways may further improve performance of conversational ASR.
In this paper, we propose a conversational ASR model, which learns conversational characteristics through a Conditional Variational Auto-Encoder (CVAE) [9] and a topic model [10]. Inspired by [1], we use a role-customized variational module and a topic-customized variational module to obtain the characterization of role preference and topical coherence respectively. Additionally, a Combined Topic Model (CombinedTM) [10], which has contextualized document embeddings and stronger ability to express topical coherence, is used to rescore the top-k outputs of the ASR model.
We carry out experiments on Mandarin two-speaker telephony conversations. Specifically, results on two datasets HKUST [11] and DDT show that the proposed method achieves impressively a maximum 12% relative character error rate (CER) reduction. Nevertheless, the proposed method can be applied to conversations involving more speakers, such as multi-party meetings, as well as other languages.
2 Related Work
End-to-end ASR models are becoming more and more popular due to their excellent performance and lower construction difficulty. As the most influential sequence-to-sequence model family adopting multi-head attention to learn global information of sequence, Transformer [12] and its variants [13, 14], have recently received more attention due to their superior performance on a wide range of tasks including ASR [15, 16, 17, 18, 19, 20, 21, 12, 22].
A common idea for applying transformer-based models to the long sequential tasks, such as conversational ASR, is to model the long-context information. Recently, long-context end-to-end models that can learn information across sentence boundaries have draw much interest in the fields of long-sequence prediction [23, 24], machine translation [25, 1] and speech recognition [26, 8, 7]. In [6], a cross-attention end-to-end speech recognizer is designed to solve the problem of speaker turn-talking. Meanwhile, a model with CVAE for conversational machine translation is proposed in [1].
3 The Proposed Method
As shown in Figure 1, the proposed model consists of a speech encoder, a latent variational module (LVM), a decoder and a rescoring module. First, speech input features and dialog embeddings are sent into the speech encoder and the text encoder respectively. Then latent vectors derived from the variational modules are sent into the decoder to characterize topic and role information. At training, latent vectors are derived from posterior networks in the variational modules. While at inference, latent vectors are derived from prior networks. Finally, we use a topic model to rescore the output of the decoder. Each module in the proposed conversational ASR model will be elaborate as follows.

3.1 Input Representation
The input of our model consists of three parts at the -th sentence in the conversation: the speech feature of current sentence , the target text and the contextual input feature . Here, is transcripts of the current speaker, and is all historical transcript in the current conversation. For example, and . The text data is processed in a format in which each person speaks one sentence in turn, so role preference can also be expressed as for another speaker. In order to distinguish different sentences, we add start symbol and end symbol at the beginning and ending of each sentence. Then, all text inputs will be represented as word embedding vectors.
3.2 Speech Encoder
Recently, Gulati et al. combined transformers and convolutional neural networks as Conformer [13] to simultaneously capture local and global contextual information in ASR tasks, leading to superior performance. Here, we stack conformer blocks as our speech encoder. Specifically, each conformer block includes a multi-head self-attention module (MHSA), a convolution module (CONV) and a feed-forward module (FFN). Assuming the input of the i-th block is , operations in this block can be expressed as:
(1) |
(2) |
(3) |
3.3 Latent Variational Module
The latent variational module consists of a text encoder and two specific VAEs: role VAE and topic VAE. Each VAE consists of multiple transformer layers, a posterior network and a prior network, as shown in the left half of Figure 1. These two VAEs characterize role preference and topical coherence in conversations by learning role-specific latent vectors and topic-specific latent vectors .
Role VAE. The structure of the text encoder (TEnc) is a standard transformer encoder [12]. The intermediate representation vectors of role preference and topic consistency are generated by the same text encoder:
(4) |
(5) |
where Wd stands for word embedding operation. Then mean-pooling is applied to and across time. Here, , and , where is the length of the context sentence. We use the historical text of the same speaker in previous turns of current dialog to generate the character variational representation , which follows an isotropic Gaussian distribution [27],
(6) |
where I denotes the identity matrix and
(7) |
(8) |
MLP is a linear layer and Softplus is the activate function.
At training, the posterior distribution conditions on sentences related to the current role. Through KL divergence, the prior network can learn a role-specific distribution by approximating the posterior network [9]. The variable distribution of the posterior network is extracted as
(9) |
where
(10) |
(11) |
here is the vectors calculated by the transformer layers of posterior network. The conditional probability is learned by the posterior network when training. However, we fit this distribution through a prior network at inference, so as to avoid the dependence on the recognition result of current sentence and the deviation of the recognition result from the real result.
Topic VAE. In the topical coherence problem, we use a structure similar to the above to extract relevant representations of topical coherence .
(12) |
where I denotes the identity matrix and
(13) |
(14) |
At training, the prior network learns the distribution of topical coherence information by approximating the posterior network. The variable distribution of the posterior network is extracted as
(15) |
where
(16) |
(17) |
3.4 Decoder
We use an attention decoder in the proposed model. As shown in Figure 1, we get the latent variables either from the posterior networks or the prior networks. A transformer layer is used to merge these intermediate vectors into the decoded states:
(18) |
where is the hidden state of decoder. Then, we send to a softmax layer to get the probability of target chars.
3.5 Training Objectives
We adopt a two-stage training strategy. We first train a sentence-level speech recognition model with the cross-entropy objective:
(19) |
Then, we finetune the model by minimizing and the follow objective:
(20) | ||||
Model | ExtLM | RoleVAE | TopicVAE | AttRes | TopicRes | HKUST | DDT/dev | DDT/test |
---|---|---|---|---|---|---|---|---|
Conformer | - | - | - | - | - | 20.25 | 23.43 | 22.71 |
Y | - | - | - | - | 20.45 | 23.23 | 22.12 | |
- | Y | - | - | - | 19.94 | 22.70 | 22.13 | |
- | - | Y | - | - | 19.81 | 22.62 | 22.36 | |
- | Y | Y | - | - | 19.46 | 22.35 | 22.06 | |
- | Y | Y | Y | - | 19.32 | 20.35 | 20.13 | |
- | Y | Y | Y | Y | 19.19 | 20.02 | 19.96 | |
Y | Y | Y | Y | Y | 19.31 | 20.12 | 21.05 | |
H-Transformer | - | - | - | - | - | 20.14 | 23.01 | 22.53 |
3.6 Topic Model Rescoring
We rescore the output of the ASR model in the process of attention rescoring [28]. Specifically, we classify conversations in the training set into topics by the topic model CombinedTM. Each topic contains words like . Keywords in all topics do not overlap each other. For the top-n sentences generated by the speech recognition model, we send them to the topic model trained by the transcripts of speech dataset, and get the probability vectors of the sentence attribution to each topic, each vector has dimensions. For the -th word in , if in the keywords of -th topic we generated, the score of word is recalculated as
(21) |
At attention rescoring, we add to the attention:
(22) |
where is the score calculated by the rescoring decoder.
Then we use the new score to reorder the output sentences, calculate the final sentence score with . The sentence with the highest score is considered as the final recognition result.
4 Experiments
4.1 Dataset
We conduct our experiments on two Mandarin conversation datasets – HKUST [11] and DATATANG (DDT). The HKUST dataset contains 200 hours of speech data. The dev set is used to verify the recognition results. The DDT dataset contains 350 hours of speech data. The dev and test sets are used to verify the recognition results.
We obtain the topic boundary information and speaker information of each round of dialogues through the additional tags of the data to distinguish speakers and judge the conversion of the topic. For each corpus, the detail configurations of our acoustic features and Conformer model are same as the ESPnet Conformer recipes [29] . We use 3653 and 3126 characters as output units for HKUST and DDT respectively.
4.2 Implementation Details
We train our models with ESPnet [29]. Speed perturbation at ratio 0.9, 1.0, 1.1 with SpecAugment [30] is used for data augmentation. The baseline results are trained on independent sentence level, without speaker and context information.
In our experiment, we use a 2-layer text encoder and a 2-layer transformer as the feature extractor of the VAEs as shown in the left of Figure 1. Latent variables with 100 dimensions are used to represent speaker information and topic information respectively. In the rescoring experiments, we divide the conversations in HKUST into 50 topics. For DDT, we only divide the conversations into 3 topics as the topics are highly coterminous. In addition, a session-level RNNLM [2] based on the training set text is applied.
We reproduce a comparative model, Hierarchical Transformer [7] (H-Transformer), which consists of a text encoder with 4 transformer layers and a Conformer ASR module in our setups.
4.3 Results
Table 1 shows the results of our experiments. We can find that both the role VAE and topic VAE show superior results on the final recognition accuracy. By combining them together, we can even achieve further improvement.
Since the data set contains open-domain topics, the session-level language model makes the final recognition result worse on HKUST. In addition, we also find that the topic-based rescoring operation has a positive effect on both data sets. Meanwhile, on the open-domain data set HKUST, the topic model rescoring is worse than that of the DDT data set with more obvious topic consistency.
In general, we can find that after adding the variational module and the rescoring module, the recognition performance has been greatly improved, resulting a relative 12% improvement on DDT set.
4.4 Context Length Analysis
In a conversation, even on the same topic, as the number of conversation rounds increases, the speaker’s speaking habits and the topics they are talking about will also change. At the same time, more recent texts may contain historical information that is more helpful for the recognition of the current sentence. Therefore, we explore the role context length and topic context length of the input for the VAEs respectively on the HKUST dataset.
As shown in Table 2 and 3, we design experiments to explore the impact of context length on model performance, and find that when or , the proposed model reaches the lowest CERs.
Role Context Length | 1 | 2 | 3 |
---|---|---|---|
CER(%) | 20.01 | 19.94 | 19.99 |
Topic Context Length | 1 | 2 | 3 |
---|---|---|---|
CER(%) | 20.05 | 19.96 | 19.81 |
5 Conclusion
This paper proposes a novel model to learn conversation-level characteristics including role preference and topical coherence in conversational speech recognition. We design a latent variational module (LVM) which contains two specific VAEs to learn the role preference and topical coherence. Meanwhile, a topic model is used on this basis to rescore the top-k outputs of the ASR model, biasing the results to words used in specific topics. Experimental results on conversational ASR tasks indicate that the proposed method effectively improves ASR performance.
References
- [1] Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, and Jie Zhou, “Modeling bilingual conversational characteristics for neural chat translation,” in ACL, 2021.
- [2] Wayne Xiong, Lingfeng Wu, Jun Zhang, and Andreas Stolcke, “Session-level language modeling for conversational speech,” in EMNLP, 2018, pp. 2764–2768.
- [3] Kazuki Irie, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Training language models for long-span cross-sentence evaluation,” in ASRU. IEEE, 2019, pp. 419–426.
- [4] Takaaki Hori, Niko Moritz, Chiori Hori, and Jonathan Le Roux, “Transformer-based long-context end-to-end speech recognition.,” in Interspeech, 2020, pp. 5011–5015.
- [5] Takaaki Hori, Niko Moritz, Chiori Hori, and Jonathan Le Roux, “Advanced long-context end-to-end speech recognition using context-expanded transformers,” arXiv preprint arXiv:2104.09426, 2021.
- [6] Suyoun Kim, Siddharth Dalmia, and Florian Metze, “Cross-attention end-to-end asr for two-party conversations,” arXiv preprint arXiv:1907.10726, 2019.
- [7] Ryo Masumura, Naoki Makishima, et al., “Hierarchical transformer-based large-context end-to-end asr with large-context knowledge distillation,” in ICASSP. IEEE, 2021, pp. 5879–5883.
- [8] Suyoun Kim and Florian Metze, “Dialog-context aware end-to-end speech recognition,” in SLT. IEEE, 2018, pp. 434–440.
- [9] Kihyuk Sohn, Honglak Lee, and Xinchen Yan, “Learning structured output representation using deep conditional generative models,” Advances in neural information processing systems, vol. 28, pp. 3483–3491, 2015.
- [10] Federico Bianchi, Silvia Terragni, and Dirk Hovy, “Pre-training is a hot topic: Contextualized document embeddings improve topic coherence,” in ACL. Aug. 2021, pp. 759–766, ACL.
- [11] Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus,” in ISCSLP. Springer, 2006, pp. 724–735.
- [12] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in NIPS, 2017.
- [13] Anmol Gulati, James Qin, Chung-Cheng Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Interspeech. ISCA, 2020, pp. 2613–2617.
- [14] Xiong Wang, Sining Sun, Lei Xie, and Long Ma, “Efficient conformer with prob-sparse attention mechanism for end-to-endspeech recognition,” in Interspeech. ISCA, 2021, pp. 4578–4582.
- [15] Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F Wong, and Lidia S Chao, “Learning deep transformer models for machine translation,” in ACL, 2019, pp. 1810–1822.
- [16] Alessandro Raganato, Jörg Tiedemann, et al., “An analysis of encoder representations in transformer-based machine translation,” in EMNLP. The Association for Computational Linguistics, 2018.
- [17] Linhao Dong, Shuang Xu, and Bo Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in ICASSP. IEEE, 2018, pp. 5884–5888.
- [18] Shigeki Karita, Nanxin Chen, et al., “A comparative study on transformer vs rnn in speech applications,” in ASRU. IEEE, 2019, pp. 449–456.
- [19] Haoneng Luo, Shiliang Zhang, Ming Lei, and Lei Xie, “Simplified self-attention for transformer-based end-to-end speech recognition,” in SLT. IEEE, 2021, pp. 75–81.
- [20] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in ICASSP. IEEE, 2018, pp. 4774–4778.
- [21] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP. IEEE, 2016, pp. 4960–4964.
- [22] Pengcheng Guo, Florian Boyer, et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP. IEEE, 2021, pp. 5874–5878.
- [23] Iz Beltagy, Matthew E Peters, and Arman Cohan, “Longformer: The long-document transformer,” arXiv preprint arXiv:2004.05150, 2020.
- [24] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, et al., “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in AAAI, 2021.
- [25] Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu, “Exploiting cross-sentence context for neural machine translation,” in EMNLP, 2017.
- [26] Ryo Masumura, Tomohiro Tanaka, Takafumi Moriya, Yusuke Shinohara, Takanobu Oba, and Yushi Aono, “Large context end-to-end automatic speech recognition via extension of hierarchical recurrent encoder-decoder models,” in ICASSP. IEEE, 2019, pp. 5661–5665.
- [27] Tianming Wang and Xiaojun Wan, “T-cvae: Transformer-based conditioned variational autoencoder for story completion.,” in IJCAI, 2019, pp. 5233–5239.
- [28] Zhuoyuan Yao, Di Wu, Xiong Wang, et al., “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in Interspeech. ISCA, 2021.
- [29] Shinji Watanabe, Takaaki Hori, et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
- [30] Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” in Interspeech. ISCA, 2019, pp. 2613–2617.