Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts

Raluca Alexandra Fetic PodimoGasværksvej 16Copenhagen VDenmarkDK-1654 , Mikkel Jordahn DTU Compute, Technical University of DenmarkRichard Petersens Plds. B324Kongens LyngbyDenmarkDK-2800 , Lucas Chaves Lima PodimoGasværksvej 16Copenhagen VDenmarkDK-1654 , Rasmus Arpe Fogh Egebæk DTU Compute, Technical University of DenmarkRichard Petersens Plds. B324Kongens LyngbyDenmarkDK-2800 , Martin Carsten Nielsen DTU Compute, Technical University of DenmarkRichard Petersens Plds. B324Kongens LyngbyDenmarkDK-2800 , Benjamin Biering PodimoGasværksvej 16Copenhagen VDenmarkDK-1654 and Lars Kai Hansen DTU Compute, Technical University of DenmarkRichard Petersens Plds. B324Kongens LyngbyDenmarkDK-2800

(2018)

Abstract.

For a multilingual podcast streaming service, it is critical to be able to deliver relevant content to all users independent of language. Podcast content relevance is conventionally determined using various metadata sources. However, with the increasing quality of speech recognition in many languages, utilizing automatic transcriptions to provide better content recommendations becomes possible. In this work, we explore the robustness of a Latent Dirichlet Allocation topic model when applied to transcripts created by an automatic speech recognition engine. Specifically, we explore how increasing transcription noise influences topics obtained from transcriptions in Danish; a low resource language. First, we observe a baseline of cosine similarity scores between topic vectors from automatic transcriptions and the descriptions of the podcasts written by the podcast creators. We then observe how the cosine similarities between topic vectors decrease as transcription noise increases and conclude that even when automatic speech recognition transcripts are erroneous, it is still possible to obtain high-quality topic vectors.

Podcasts, Automatic Speech Recognition, Topic modeling, Recommendation Systems

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: 10.1145/1122445.1122456^†^†conference: RecSys ’21: ACM Conference Series on Recommender Systems; September 27–October 05, 2021; Amsterdam, NL^†^†booktitle: RecSys ’21: ACM Conference Series on Recommender Systems, September 27–October 05, 2021, Amsterdam, NL^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Information retrieval^†^†ccs: Computing methodologies Natural language processing

1. Introduction

Podcasts have become an increasingly popular audio format in recent years. Podcasts encompass a variety of on-demand audio, such as radio, news, and entertainment in the form of informal discussions, interviews, or even narrated content similar to audiobooks, found in several different categories. Despite the growth in popularity of podcasts, an open challenge is how to find a new podcast to listen to. Research in Podcast recommendation has not yet been able to follow up and provide users with efficient and high-quality recommendations which are key to ensuring a high quality of streaming services (Jones et al., 2021). Podcast recommendation is a challenging task considering the very large amount of podcast episodes that lack metadata on both podcast and episode levels. Previous work has shown that transcription-based topic modeling plays a crucial role in podcast recommendation, as users tend to focus on the topics of the podcasts instead of podcast audio style (Molgaard et al., 2007; Yang et al., 2019).

Topic modeling techniques such as Latent Dirichlet Allocation (LDA) (Blei et al., 2002, 2003) and Probabilistic Latent Semantic Indexing (PLSI) (Hofmann, 1999) are widely used to discover topics over high-quality texts (e.g. news, blog posts, etc.). As podcasts are usually not scripted, a transcript generated from an Automatic Speech Recognition (ASR) system is necessary to perform topic modeling. Podcasts represent a particularly challenging audio format for speech recognition systems as a large number of artifacts are commonly present. Some of the challenges are that multiple speakers and speech overlap, background music and jingles, audio effects, and ”real-life” recording conditions (e.g. background noise), which makes the ASR-generated transcripts from podcasts prone to errors.

While ASR systems have been widely explored and developed for high-resource languages such as English, there is a significant lack in many low-resource languages¹¹1By low resource language, we refer to a smaller language with a limited amount of training data for NLP tasks such as speech recognition.. As such the systems available in such setups can be expected to produce more errors than their high-resource counterparts. This motivates research for downstream tasks in the low-resource setup to ensure multilingual application. In this paper, we study the robustness of an LDA topic model used on podcast transcriptions generated by an ASR engine in Danish; a low resource language. We utilize podcast episodes with author descriptions and assume that these descriptions contain a good representation of topics. Hence, a high similarity between topic vectors from author descriptions and automated transcripts also indicates a good representation of topics in the automated transcript. To evaluate the robustness of the topic model, we first construct a baseline by computing a cosine similarity between the topic vector representations of the author descriptions and automated transcripts. We then introduce noise to the automated transcripts, controlled by a variable noise injection rate, determining how often words should be replaced, and observe how the cosine similarity changes. We experiment with two types of noise; simulated ASR noise sampled from a conditional distribution derived from a transcription error dataset (see Section 3.2), and words sampled uniformly from the topic model vocabulary for reference. Empirically, over a dataset of 587 episodes from 24 Danish podcasts, we find that the topic model is much more robust to simulated ASR noise than it is to noise from a uniform distribution. We present evidence that the LDA topic model is robust and captures an informative representation of topics, even in the face of imperfect transcriptions.

The remainder of this paper is structured as follows. In Section 2, we present the necessary background on topic modeling and robustness of downstream NLP systems against ASR errors. Section 3 defines the problem and details the methods used by each component to test the podcast topic modeling robustness. Section 4 presents the experimental setup. In Section 5, we present and discuss the results. Lastly, in Section 6, we conclude the paper and propose future research directions.

2. Background

2.1. Automatic Speech Recognition

Converting speech in audio to raw text is done using an Automatic Speech Recognition (ASR) system. Speech recognition systems have drastically improved during recent years with advancements such as various data augmentation techniques (Park et al., 2019), pre-training procedures on unlabelled speech data (Baevski et al., 2020a) and noisy student training (Park et al., 2020). State of the art performance on the common English benchmark dataset LibriSpeech (Panayotov et al., 2015) is as low as 1.4% and 3.3% of Word Error Rate (WER) on the test-clean and the test-other partitions, respectively (Zhang et al., 2020). Labeled speech recognition training data is highly accessible in English (Panayotov et al., 2015; Kahn et al., 2020) but less so in many other languages. Even with crowd-sourcing data initiatives such as CommonVoice from Mozilla (Ardila et al., 2019), the recognition performance gap between low and high resource languages remains quite high.²²2https://paperswithcode.com/dataset/common-voice Pre-training speech recognition models with cross-lingual data has helped bridge the gap significantly (Conneau et al., 2020). However, transcripts from ASR systems for low resource languages are likely to be error-prone, especially for complex audio data such as podcasts.

2.2. Topic Modeling and Evaluation

Topic models are used to explore and structure a large set of documents according to latent semantic content. To improve downstream tasks such as search and recommendation of podcasts, a promising method is to utilize a topic model to extract the relevant topics of the podcast and hence enhance the podcast representations (Molgaard et al., 2007; Yang et al., 2019). Topic modeling has been extensively studied, and various approaches exist. For instance, Latent Semantic Indexing (LSI) (Deerwester et al., 1990) uses Singular Value Decomposition (SVD) or Non-negative Matrix Factorization (NMF) (Lee and Seung, 1999) on a term by document matrix to construct a latent space representation, which can be queried for comparison of documents. An extension of LSI, Probabilistic Latent Semantic Indexing (PLSI), models topics as distributions over words, and documents as a probabilistic mixture of those topics (Hofmann, 1999). A similar, and very popular, approach is known as Latent Dirichlet Allocation (LDA). LDA differs from PLSI by utilizing prior Dirichlet distributions to model the topic-word distributions making it more robust to unseen data (Blei et al., 2002; Griffiths and Steyvers, 2003). Numerous extensions of the LDA model have been studied. One such extension, the Correlated Topic Model (CTM) (Blei and Lafferty, 2006), explores the correlations among topics generated by the LDA model. Other extensions include Collective LDA (Shen et al., 2008), combining multiple corpora during the training of the models, and approaches that explore the influence of the age of the documents in the topics (Wang and McCallum, 2006; Nallapati et al., 2007).

The quality of a topic model can be evaluated in different ways. A common practice is to evaluate a trained model in terms of perplexity (Wallach et al., 2009) or topic coherence. Topic coherence, as opposed to a perplexity, is more similar to how humans judge the quality of topics (Newman et al., 2010b). Examples of topic coherence measures include UCI-coherence (Newman et al., 2010a), $U_{mass}$ coherence (Mimno et al., 2011), and coherence based on word embeddings (Fang et al., 2016).

2.3. Robustness to Noise

Downstream NLP tasks on ASR transcripts need to be robust to noise due to the commonality of transcription errors. Robustness to noise is often evaluated by constructing a baseline result with clean text and then injecting varying degrees of noise to the text and examining how the result changes (Su et al., 2016; Belinkov and Bisk, 2017). To investigate the effects of noise, it is necessary to select different ways of injecting plausible ASR noise into transcripts. A recent study explored the feasibility of improving the robustness of speech-enabled systems with three methods of noise (Cui et al., 2021); rule-based substitution which randomly substitutes a candidate word with a phonetically similar one, statistic-based confusion substitution which samples replacement words from a pre-constructed ASR confusion matrix and finally, model-based substitution utilizing a generative GPT model to directly produce ASR-like text. Another study investigated the stability of topics over noisy sources, by testing for topic model agreements (Greene et al., 2014), after subjecting the training data to insertion of frequent words, deletion and rule-based phonetic substitution errors (Su et al., 2016).

The robustness of a downstream task varies a lot depending on the specific task and the type of noise. For instance, topic modeling has previously been shown to be robust to deletion of random words, whereas the insertion of new words and phonetic substitution errors has a larger negative impact on topic stability (Su et al., 2016). Another relevant downstream task, neural machine translation with character-based models, has shown to struggle even with small perturbations to the input data (Belinkov and Bisk, 2017).

3. Methods

3.1. Transcript Generation

We produce podcast transcripts by parsing podcast audio through a danish transcription system developed at the Technical University of Denmark (DTU) as part of the Danspeech project.³³3https://danspeech.github.io/danspeech/html/index.html The system is based on the wav2vec 2.0 framework (Baevski et al., 2020b). The model was pre-trained using approximately 945 hours of podcast episodes and 400 hours of audiobooks, and fine-tuned using the Connectionist Temporal Classification (Graves et al., 2006) (CTC) loss function on 200 hours of labeled data from the Nordisk Språkteknologi (NST) danish training dataset⁴⁴4https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-55/ and 267 hours of aligned audiobook data. The Fairseq library (Ott et al., 2019) was used for both the pre-training procedure as well as fine-tuning. During inference, the ASR engine performs prefix beam search (Hannun et al., 2014) with an open-source danish 3-gram language model⁵⁵5https://danspeech.github.io/danspeech/html/lms.html#danspeech.language_models.DSLWikiLeipzig3gram when decoding the probabilities emitted from the wav2vec 2.0 model.

3.2. Noise Injection

We inject noise into the ASR transcripts by means of a noise injection rate parameter, $\beta$ , that determines the frequency at which we substitute words in the transcript. More specifically, for each word in a given transcript we independently decide if a substitution should take place with probability $\beta$ . Examples of substitutions at various levels of $\beta$ are presented in Table 3.2.

Refer to caption — Table 1. Examples of how a transcription changes with automatic speech recognition statistics-based substitutions at varying levels of the noise injection rate parameter $\beta$ .

$\beta$	Transcription	Description
0	historien er rig på spændende fortællinger om drama krig voldelig politiske omvæltninger og fyldt med mystik hemmeligheder og fascinerende menneskeskæbner	Historien er rig på spændende fortællinger om drama, krig, voldelige politiske omvæltninger og er fyldt med mystik, hemmeligheder og fascinerende menneskeskæbner. {…} Episode 4: Røde agenter Revolutionen i Rusland i 1917 blev startskuddet til en international politisk kamp for at udbrede socialismen til hele verden { … } I programmet medvirker historikerne Niels Erik Rosenfeldt og Morten Møller.
\cdashline1-2 0.3	kane er rig på så fortællinger om drama krig ——- politiske omvæltninger har fyldt ved mystik er og fascineren menneskeskæbner
\cdashline1-2 1.0	historie har rik p så fortælling er ——- strama til ——- ——- ——- er film som ——- er var fascineren ——-

Word	Error candidates	Probability
Nogensinde	Nogen sinde	0.917
	Sinde	0.053
	Nogen	0.024
	Nogen sider	0.003
	Står	0.003

Word	Error candidates	Probability
Lavet	Lave	0.409
	Lade	0.136
	Lavede	0.136
	Ladet	0.091
	Random	0.227

Topic Model Robustness to Automatic Speech Recognition Errors in Podcast Transcripts

Abstract.

1. Introduction

2. Background

2.1. Automatic Speech Recognition

2.2. Topic Modeling and Evaluation

2.3. Robustness to Noise

3. Methods

3.1. Transcript Generation

3.2. Noise Injection

3.3. Topic Modeling and Document Vector Representation Similarity

4. Experiments

4.1. Podcast Dataset

4.2. Statistics Based Automatic Speech Recognition Noise

4.3. Topic model

4.4. Evaluation of Topic Robustness over Noisy Sources

5. Results and Discussion

6. Conclusion

7. Acknowledgments

References

Appendix A Additional Figures

Decile	1	2	3	4	5	6	7	8	9	10
No. Unique podcasts	8	12	14	15	14	11	13	14	11	5