A Review of Speaker Diarization: Recent Advances with Deep Learning

Tae Jin Park Naoyuki Kanda Dimitrios Dimitriadis Kyu J. Han Shinji Watanabe Shrikanth Narayanan University of Southern California, Los Angeles, USA Microsoft, Redmond, USA ASAPP, Mountain View, USA Johns Hopkins University, Baltimore, USA

Abstract

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.

keywords:

speaker diarization , automatic speech recognition , deep learning

^†^†journal: Computer, Speech and Language

1 Introduction

“Diarize” means making a note or keeping an event in a diary. Speaker diarization, like keeping a record of events in such a diary, addresses the question of “who spoke when” [1, 2, 3] by logging speaker-specific salient events on multiparticipant (or multispeaker) audio data. Throughout the diarization process, the audio data would be divided and clustered into groups of speech segments with the same speaker identity/label. As a result, salient events, such as non-speech/speech transition or speaker turn changes, are automatically detected. In general, this process does not require any prior knowledge of the speakers, such as their real identity or the number of participating speakers in the audio data. Thanks to its feature of separating audio streams by these speaker-specific events, speaker diarization can be effectively employed for indexing or analyzing various types of audio data, e.g., audio/video broadcasts from media stations, conversations in conferences, personal videos from online social media or hand-held devices, court proceedings, business meetings, earnings reports in a financial sector, just to name a few.

Traditionally speaker diarization systems consist of multiple, independent sub-modules as presented in Fig. 1. To mitigate any artifacts in acoustic environments, various front-end processing techniques, for example, speech enhancement, dereverberation, speech separation or target speaker extraction, are employed. Voice or speech activity detection (SAD) is then applied to separate speech from non-speech events. Raw speech signals in the selected speech portion are transformed to acoustic features or embedding vectors. In the clustering stage, the transformed speech portions are grouped and labeled by speaker classes and in the post-processing stage, the clustering results are further refined. Each of these sub-modules is optimized individually in general.

Refer to caption — Fig. 1: Traditional speaker diarization system.

1.1 Historical Development of Speaker Diarization

During the early years of diarization technology (in the 1990s), the research objective was to benefit automatic speech recognition (ASR) on air traffic control dialogues and broadcast news recordings, by separating each speaker’s speech segments and enabling speaker-adaptive training of acoustic models [4, 5, 6, 7, 8, 9, 10]. In this period some fundamental approaches for measuring the distance between speech segments for speaker change detection and clustering, such as generalized likelihood ratio (GLR) [4] and Bayesian information criterion (BIC) [11], were developed and quickly became the golden standard. All these efforts collectively laid out paths to consolidate activities across research groups worldwide, leading to several research consortia and challenges in the early 2000s, among which there were the Augmented Multiparty Interaction (AMI) Consortium [12] supported by the European Commission and the RT Evaluation [13] hosted by the National Institute of Standards and Technology (NIST). These organizations, spanning over from a few years to a decade, fostered further advancements on speaker diarization technologies across different data domains from broadcast news [14, 15, 16, 17, 18] and conversational telephone speech (CTS) [19, 20, 21, 22] to meeting conversations [23, 24, 25, 26, 27]. The new approaches resulting from these advancements include, but not limited to, beamforming [28], information bottleneck clustering (IBC) [27], variational Bayesian (VB) approaches [29], joint factor analysis (JFA) [22].

Speaker specific representation in a total variability space derived from simplified JFA, known as i-vector [30], found great success in speaker recognition and was quickly adopted by speaker diarization systems as feature representation for short speech segments, segmented in an unsupervised fashion. i-Vector successfully replaced its predecessors such as merely mel-frequency cepstral coefficient (MFCC) or speaker factors (or eigenvoices) [31] to bolster clustering performance in speaker diarization, being combined with principal component analysis (PCA) [32, 33], variational Bayesian Gaussian mixture model (VB-GMM) [34], mean shift [35] and probabilistic linear discriminant analysis (PLDA) [36].

Since the advent of deep learning in the 2010s, there has been a considerable amount of research to take advantage of powerful modeling capabilities of the neural networks for speaker diarization. One representative example is the extraction of the speaker embeddings using neural networks, such as the d-vectors [37, 38, 39] or the x-vectors [40], which most often are embedding vector representations based on the bottleneck layer output of a deep neural network (DNN) trained for speaker recognition. The shift from i-vector to these neural embeddings contributed to enhanced performance, easier training with more data [41], and robustness against speaker variability and acoustic conditions. More recently, end-to-end neural diarization (EEND) where individual sub-modules in the traditional speaker diarization systems (c.f., Fig. 1) can be replaced by one neural network gets more attention with promising results [42, 43]. This research direction, although not fully matured yet, could open up unprecedented opportunities to address challenges in the field of speaker diarization, such as, the joint optimization with other speech applications, with overlapping speech, if large-scale data is available for training such powerful neural network-based models.

1.2 Motivation

Till now, there are two well-rounded overview papers in the area of speaker diarization that survey the development of speaker diarization technology with different focuses. In [2], various speaker diarization systems and their subtasks in the context of broadcast news and CTS data are reviewed up to till mid 2000s. Thus, the historical progress of speaker diarization technology development in the 1990s and early 2000s are covered. Contrarily, the focus of [3] was put more on speaker diarization for meeting speech and its respective challenges. This paper thus weighs more in the corresponding technologies to mitigate problems from the perspective of meeting environments, where there are usually more participants than broadcast news or CTS data and multi-modal data is frequently available. Since these two papers were published, speaker diarization systems have gone through a lot of notable changes, especially from the leap-frog advancements in deep learning approaches addressing technical challenges across multiple machine learning domains. We believe that this survey work is a valuable contribution to the community to consolidate the recent developments with neural methods and thus facilitate further progress toward a more efficient diarization.

	Trained based on	Trained based on
	Non-diarization Objective	Diarization Objective
Single-module Optimization	Section 2.1–2.6 Front-end [44, 45, 46], speaker embedding [47, 48, 40], SAD [49], etc.	Section 3.1 Affinity matrix refinement [50], IDEC [51], TS-VAD [52], etc.
Joint Optimization	Section 2.7 VB-HMM [53], VBx [54] Out of scope Joint front-end & ASR [55, 56, 57, 58, 59, 60], joint speaker identification & speech separation [61, 62], etc.	Section 3.2 UIS-RNN [41], RPN [63], online RSAN [64], EEND [42, 43], etc. Section 4 Joint ASR & speaker diarization. [65, 66, 67, 68], etc.

$\displaystyle H_{0}:$	$\displaystyle\mathbf{x}_{1}\cdots\mathbf{x}_{N}\sim\mathcal{N}(\mu,\Sigma),$	(9)
$\displaystyle H_{1}:$	$\displaystyle\mathbf{x}_{1}\cdots\mathbf{x}_{i}\sim\mathcal{N}\left(\mu_{1},\Sigma_{1}\right),$
	$\displaystyle\mathbf{x}_{i+1}\cdots\mathbf{x}_{N}\sim\mathcal{N}\left(\mu_{2},\Sigma_{2}\right),$

$\displaystyle\hat{\mathcal{W}}=$	$\displaystyle\operatorname*{argmax}_{\mathcal{W}}P(\mathcal{W}\|\mathcal{X})$	(54)
$\displaystyle=$	$\displaystyle\operatorname*{argmax}_{\mathcal{W}}\{\sum_{\mathcal{E}}P(\mathcal{W},\mathcal{E}\|\mathcal{X})\}$	(55)
$\displaystyle\approx$	$\displaystyle\operatorname*{argmax}_{\mathcal{W}}\{\max_{\mathcal{E}}P(\mathcal{W},\mathcal{E}\|\mathcal{X})\},$	(56)

	$\displaystyle\hat{\mathcal{W}}^{(i)}=$	$\displaystyle\operatorname*{argmax}_{\mathcal{W}}P(\mathcal{W}\|\hat{\mathcal{E}}^{(i-1)},\mathcal{X}),$		(57)
	$\displaystyle\hat{\mathcal{E}}^{(i)}=$	$\displaystyle\operatorname*{argmax}_{\mathcal{E}}P(\mathcal{E}\|\hat{\mathcal{W}}^{(i)},\mathcal{X}),$		(58)

	Language	Size (hr)	Style	# Spkr.
CALLHOME	Multilingual	20	Conversation	2–7
AMI	English	100	Meeting	3–5
ICSI meeting	English	72	Meeting	3–10
CHiME-5/6	English	50	Conversation	4
VoxConverse	Multilingual^†	74	YouTube video	1–21
LibriCSS	English	10	Read speech	8
DH I Tr.1,2	Multilingual	19(dev), 21(eval)	Miscellaneous	1–7
DH II Tr.1,2	Multilingual	24(dev), 22(eval)	Miscellaneous	1–8
DH II Tr.3,4	Multilingual	262(dev), 31(eval)	Miscellaneous	4
DH III Tr.1,2	Multilingual	34(dev), 33(eval)	Miscellaneous	1–7

A Review of Speaker Diarization: Recent Advances with Deep Learning

Abstract

keywords:

1 Introduction

1.1 Historical Development of Speaker Diarization

1.2 Motivation

1.3 Overview and Taxonomy of Speaker Diarization

1.4 Diarization Evaluation Metrics

1.4.1 Diarization Error Rate

1.4.2 Jaccard Error Rate

1.4.3 Word-level Diarization Error Rate

1.5 Paper Organization

2 Modular Speaker Diarization Systems

2.1 Front-end Processing

2.1.1 Speech Enhancement and Denoising

2.1.2 Dereverberation

2.1.3 Speech Separation

2.2 Speech Activity Detection

2.3 Segmentation

2.4 Speaker Representations and Similarity Measure

2.4.1 Metric Based Similarity Measure

2.4.2 Joint Factor Analysis, i-vector and PLDA

2.4.3 Neural Network Based Speaker Representations

2.5 Clustering

2.5.1 Agglomerative Hierarchical Clustering

2.5.2 Spectral Clustering

2.5.3 Other Clustering Algorithms

2.6 Post-processing

2.6.1 Resegmentation

2.6.2 System Fusion

2.7 Joint Optimization of Segmentation and Clustering

3 Recent Advances in Speaker Diarization Using Deep Learning

3.1 Single-module Optimization

3.1.1 Speaker clustering Enhanced by Deep Learning

3.1.2 Learning the Distance Estimator

3.1.3 Post Processing Based on Deep Learning

3.2 Joint Optimization for Speaker Diarization

3.2.1 Joint Segmentation and Clustering

3.2.2 Joint Segmentation, Embedding Extraction, and Re-segmentation

3.2.3 Joint Speech Separation and Diarization

3.2.4 Fully End-to-end Neural Diarization

4 Speaker Diarization in the Context of ASR

4.1 Early Works

4.2 Using lexical information from ASR

4.3 Joint ASR and Speaker Diarization with Deep Learning

5 Diarization Evaluation Series and Datasets

6 Applications

6.1 Meeting Transcription

6.2 Conversational Interaction Analysis and Behavioral Modeling

6.3 Audio Indexing

6.4 Conversational AI

7 Challenges and the Future of Speaker Diarization

Online processing of speaker diarization

Domain mismatch

Speaker overlap

Integration with ASR

Audiovisual modeling

References