This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Review of Speaker Diarization: Recent Advances with Deep Learning

Tae Jin Park Naoyuki Kanda Dimitrios Dimitriadis Kyu J. Han Shinji Watanabe Shrikanth Narayanan University of Southern California, Los Angeles, USA Microsoft, Redmond, USA ASAPP, Mountain View, USA Johns Hopkins University, Baltimore, USA
Abstract

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify “who spoke when”. In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.

keywords:
speaker diarization , automatic speech recognition , deep learning
journal: Computer, Speech and Language

1 Introduction

“Diarize” means making a note or keeping an event in a diary. Speaker diarization, like keeping a record of events in such a diary, addresses the question of “who spoke when” [1, 2, 3] by logging speaker-specific salient events on multiparticipant (or multispeaker) audio data. Throughout the diarization process, the audio data would be divided and clustered into groups of speech segments with the same speaker identity/label. As a result, salient events, such as non-speech/speech transition or speaker turn changes, are automatically detected. In general, this process does not require any prior knowledge of the speakers, such as their real identity or the number of participating speakers in the audio data. Thanks to its feature of separating audio streams by these speaker-specific events, speaker diarization can be effectively employed for indexing or analyzing various types of audio data, e.g., audio/video broadcasts from media stations, conversations in conferences, personal videos from online social media or hand-held devices, court proceedings, business meetings, earnings reports in a financial sector, just to name a few.

Traditionally speaker diarization systems consist of multiple, independent sub-modules as presented in Fig. 1. To mitigate any artifacts in acoustic environments, various front-end processing techniques, for example, speech enhancement, dereverberation, speech separation or target speaker extraction, are employed. Voice or speech activity detection (SAD) is then applied to separate speech from non-speech events. Raw speech signals in the selected speech portion are transformed to acoustic features or embedding vectors. In the clustering stage, the transformed speech portions are grouped and labeled by speaker classes and in the post-processing stage, the clustering results are further refined. Each of these sub-modules is optimized individually in general.

Refer to caption
Fig. 1: Traditional speaker diarization system.

1.1 Historical Development of Speaker Diarization

During the early years of diarization technology (in the 1990s), the research objective was to benefit automatic speech recognition (ASR) on air traffic control dialogues and broadcast news recordings, by separating each speaker’s speech segments and enabling speaker-adaptive training of acoustic models [4, 5, 6, 7, 8, 9, 10]. In this period some fundamental approaches for measuring the distance between speech segments for speaker change detection and clustering, such as generalized likelihood ratio (GLR) [4] and Bayesian information criterion (BIC) [11], were developed and quickly became the golden standard. All these efforts collectively laid out paths to consolidate activities across research groups worldwide, leading to several research consortia and challenges in the early 2000s, among which there were the Augmented Multiparty Interaction (AMI) Consortium [12] supported by the European Commission and the RT Evaluation [13] hosted by the National Institute of Standards and Technology (NIST). These organizations, spanning over from a few years to a decade, fostered further advancements on speaker diarization technologies across different data domains from broadcast news [14, 15, 16, 17, 18] and conversational telephone speech (CTS) [19, 20, 21, 22] to meeting conversations [23, 24, 25, 26, 27]. The new approaches resulting from these advancements include, but not limited to, beamforming [28], information bottleneck clustering (IBC) [27], variational Bayesian (VB) approaches [29], joint factor analysis (JFA) [22].

Speaker specific representation in a total variability space derived from simplified JFA, known as i-vector [30], found great success in speaker recognition and was quickly adopted by speaker diarization systems as feature representation for short speech segments, segmented in an unsupervised fashion. i-Vector successfully replaced its predecessors such as merely mel-frequency cepstral coefficient (MFCC) or speaker factors (or eigenvoices) [31] to bolster clustering performance in speaker diarization, being combined with principal component analysis (PCA) [32, 33], variational Bayesian Gaussian mixture model (VB-GMM) [34], mean shift [35] and probabilistic linear discriminant analysis (PLDA) [36].

Since the advent of deep learning in the 2010s, there has been a considerable amount of research to take advantage of powerful modeling capabilities of the neural networks for speaker diarization. One representative example is the extraction of the speaker embeddings using neural networks, such as the d-vectors [37, 38, 39] or the x-vectors [40], which most often are embedding vector representations based on the bottleneck layer output of a deep neural network (DNN) trained for speaker recognition. The shift from i-vector to these neural embeddings contributed to enhanced performance, easier training with more data [41], and robustness against speaker variability and acoustic conditions. More recently, end-to-end neural diarization (EEND) where individual sub-modules in the traditional speaker diarization systems (c.f., Fig. 1) can be replaced by one neural network gets more attention with promising results [42, 43]. This research direction, although not fully matured yet, could open up unprecedented opportunities to address challenges in the field of speaker diarization, such as, the joint optimization with other speech applications, with overlapping speech, if large-scale data is available for training such powerful neural network-based models.

1.2 Motivation

Till now, there are two well-rounded overview papers in the area of speaker diarization that survey the development of speaker diarization technology with different focuses. In [2], various speaker diarization systems and their subtasks in the context of broadcast news and CTS data are reviewed up to till mid 2000s. Thus, the historical progress of speaker diarization technology development in the 1990s and early 2000s are covered. Contrarily, the focus of [3] was put more on speaker diarization for meeting speech and its respective challenges. This paper thus weighs more in the corresponding technologies to mitigate problems from the perspective of meeting environments, where there are usually more participants than broadcast news or CTS data and multi-modal data is frequently available. Since these two papers were published, speaker diarization systems have gone through a lot of notable changes, especially from the leap-frog advancements in deep learning approaches addressing technical challenges across multiple machine learning domains. We believe that this survey work is a valuable contribution to the community to consolidate the recent developments with neural methods and thus facilitate further progress toward a more efficient diarization.

1.3 Overview and Taxonomy of Speaker Diarization

Table 1: Table of Taxonomy
Trained based on Trained based on
Non-diarization Objective Diarization Objective
Single-module Optimization Section 2.1–2.6 Front-end [44, 45, 46], speaker embedding [47, 48, 40], SAD [49], etc. Section 3.1 Affinity matrix refinement [50], IDEC [51], TS-VAD [52], etc.
Joint Optimization Section 2.7 VB-HMM [53], VBx [54] Out of scope Joint front-end & ASR [55, 56, 57, 58, 59, 60], joint speaker identification & speech separation [61, 62], etc. Section 3.2 UIS-RNN [41], RPN [63], online RSAN [64], EEND [42, 43], etc. Section 4 Joint ASR & speaker diarization. [65, 66, 67, 68], etc.

Attempting to categorize the existing, most-diverse speaker diarization technologies, both in the space of modularized speaker diarization systems before the deep learning era and those based on neural networks of recent years, a proper grouping would be helpful. The main categorization we adopt in this paper is based on two criteria, resulting total of four categories, as shown in Table 1.3. The first criterion is whether the model is trained based on speaker diarization-oriented objective function or not. Any trainable approaches to optimize models in a multispeaker situation and learn relations between speakers are categorized into the “diarization objective” class. The second criterion is whether multiple modules are jointly optimized toward some objective function. If a single sub-module is replaced into a trainable one, such method is categorized into the “Single-module optimization” class. Conversely, joint modeling of segmentation and clustering [41], joint modeling of speech separation and speaker diarization [64] or fully end-to-end neural diarization system [42, 43] is categorized into the “Joint optimization” class.

Note that our intention of this categorization is to help readers to quickly overview the broad development in the field, and it is not our intention to divide the categories into superior-inferior. Also, while we are aware of many techniques that fall into the category “Non-Diarization Objective” and “Joint Optimization” (e.g., joint front-end and ASR [55, 56, 57, 58, 59, 60] and joint speaker identification and speech separation [61, 62]), we exclude them in the paper to focus on the review of speaker diarization techniques.

1.4 Diarization Evaluation Metrics

1.4.1 Diarization Error Rate

The accuracy of speaker diarization system is measured using diarization error rate (DER) [69] where DER is the sum of three different error types: False alarm (FA) of speech, missed detection of speech and confusion between speaker labels.

DER=FA+Missed+Speaker-ConfusionTotal Duration of Time.\textbf{DER}=\frac{\text{FA}+\text{Missed}+\text{Speaker-Confusion}}{\textit{Total Duration of Time}}. (1)

To establish a one-to-one mapping between the hypothesis outputs and the reference transcript, Hungarian algorithm [70] is employed. In the 2006 RT evaluation [69], 0.25 s of “no score” collar (also referred to as “score collar”) is set around every boundary of reference segment to mitigate the effect of inconsistent notation and human errors in reference transcript and this evaluation scheme has been most widely used in speaker diarization studies.

1.4.2 Jaccard Error Rate

The Jaccard error rate (JER) was first introduced in the DIHARD II evaluation. The goal of JER is to evaluate each speaker with equal weight. Unlike DER which is estimated for the whole utterance altogether, per-speaker error rates are first computed and then averaged to compute JER. Specifically, JER is computed as follows.

JER=1NiNrefFAi+MISSiTOTALi.\textbf{JER}=\frac{1}{N}\sum_{i}^{N_{ref}}\mathrm{\frac{\mathrm{FA_{i}}+\mathrm{MISS_{i}}}{\mathrm{TOTAL_{i}}}}. (2)

In Eq. (2), TOTALi\mathrm{TOTAL}_{i} is the union of the ii-th speaker’s speaking time in the reference transcript and the ii-th speaker’s speaking time in the hypotheses. NrefN_{ref} is the number of speakers in the reference script. Note that the Speaker-Confusion in DER is reflected in the part of FAi\mathrm{FA_{i}} in the calculation of JER. Since JER is using union operation between reference and the hypotheses, JER never exceeds 100%, whereas DER can become much larger than 100%. DER and JER are highly correlated but if a subset of speakers are dominant in the given audio recording, JER tends to be higher than the ordinary case.

1.4.3 Word-level Diarization Error Rate

While DER is based on the duration of the speaking time of each speaker, word-level DER (WDER) is designed to measure the error that is caused in the lexical (output transcription) side. The motivation of WDER is the discrepancy between DER and the accuracy of the final transcript output since DER relies on the duration of the speaking time that is not always aligned with the word boundaries. The concept of word-breakage ratio was proposed in Silovsky et al. [71] where word-breakage shares similar idea with WDER. Unlike WDER, word-breakage ratio measures the number of speaker-change points occur inside a word boundary. The work in Park and Georgiou [72] suggested the term WDER, evaluating the diarization output with ground-truth transcription. More recently, the joint ASR and speaker diarization system was evaluated in the WDER format in Shafey et al. [65]. Although the way of calculating WDER would differ over the studies, but the underlying mechanism is that the diarization error is calculated by counting the correctly or incorrectly labeled words.

1.5 Paper Organization

The rest of the paper is organized as follows.

  • In Section 2, we overview techniques belonging to the “Non-diarization objective” class in the proposed taxonomy, mostly those used in the traditional, modular speaker diarization systems. While there are some overlaps with the counterpart sections of the aforementioned two survey papers [2, 3] in terms of reviewing notable developments in the past, this section would add more latest schemes in the corresponding components of the speaker diarization systems.

  • In Section 3, we discuss advancements mostly leveraging DNNs trained with the diarization objective where single sub-modules are independently optimized (Subsection 3.1) or jointly optimized (Subsection 3.2) toward fully end-to-end speaker diarization.

  • In Section 4, we present a perspective of how speaker diarization has been investigated in the context of ASR, reviewing historical interactions between these two domains to peek into the past, present and future of speaker diarization applications.

  • Section 5 provides information on speaker diarization challenges and corpora to facilitate research activities and anchor technology advances. We also discuss evaluation metrics such as DER, JER and Word-level DER (WDER) in this section.

  • We share a few examples of how speaker diarization systems are employed in both research and industry practices in Section 6 and conclude this work in Section 7, providing summary and future challenges in speaker diarization.

2 Modular Speaker Diarization Systems

This section provides an overview of algorithms for speaker diarization belonging to the “Non-diarization objective” class, as shown in Table 1.3. Each subsection in this section corresponds to the explanation of each module in the traditional speaker diarization system, as shown in Figure 1. In addition to the introductory explanation of each module, this section also summarizes the recent techniques within the module.

2.1 Front-end Processing

This section describes mostly front-end techniques, used for speech enhancement, dereverberation, speech separation, and speech extraction as part of the speaker diarization pipeline. Let si,f,ts_{i,f,t}\in\mathbb{C} be the short-time Fourier Transform (STFT) representation of the source speaker ii on the frequency bin ff at frame tt. The observed noisy signal xt,fx_{t,f} can be represented by a mixture of the source signals, a room impulse response hi,f,th_{i,f,t}\in\mathbb{C}, and additive noise nt,fn_{t,f}\in\mathbb{C},

xt,f=i=1Kτhi,f,τsi,f,tτ+nt,f,x_{t,f}=\sum_{i=1}^{K}\sum_{\tau}h_{i,f,\tau}s_{i,f,t-\tau}+n_{t,f}, (3)

where KK denotes the number of speakers present in the audio signal.

The aim of the front-end techniques described in this section is to estimate the original source signal 𝐱^i,t\hat{\mathbf{x}}_{i,t} given the observation 𝐗=({xt,f}f)t\mathbf{X}=(\{x_{t,f}\}_{f})_{t} for the downstream diarization task,

𝐱^i,t=FrontEnd(𝐗),i=1,,K,\hat{\mathbf{x}}_{i,t}=\mathrm{FrontEnd}(\mathbf{X}),\quad i=1,\dots,K, (4)

where 𝐱^i,tD\hat{\mathbf{x}}_{i,t}\in\mathbb{C}^{D} denotes the ii-th speaker’s estimated STFT spectrum with DD frequency bins at frame tt.

Although there are numerous speech enhancement, dereverberation, and separation algorithms, e.g.,  [73, 74, 75], herein most of the recent techniques used in the DIHARD challenge series [76, 77, 78], LibriCSS meeting recognition task [79, 80], and the CHiME-6 challenge track 2 [81, 82, 83] are covered.

2.1.1 Speech Enhancement and Denoising

Speech enhancement techniques focus mainly on the suppression of the noise component of the noisy speech, which has shown a significant improvement thanks to deep learning. For example, long short-term memory (LSTM)-based speech enhancement [84, 85] is used as a front-end technique in the DIHARD II baseline [77], i.e.,

𝐱^t=LSTM(𝐗),\hat{\mathbf{x}}_{t}=\mathrm{LSTM}(\mathbf{X}), (5)

where we only consider the single source example (i.e., K=1K=1) and omit the source index ii. This is a regression-based approach by minimizing the objective function,

MSE=𝐬t𝐱^t2.\mathcal{L}_{\mathrm{MSE}}=||\mathbf{s}_{t}-\hat{\mathbf{x}}_{t}||^{2}. (6)

The log power spectrum or ideal ratio mask is often used as the target domain of the output 𝐬t\mathbf{s}_{t}. Also, the speech enhancement used in [86] applies this objective function in each layer based on a progressive manner.

The effectiveness of the speech enhancement techniques can be enhanced via multichannel processing, including minimum variance distortionless response (MVDR) beamforming [73]. [80] demonstrates the significant improvement of the DER from 18.3% to 13.9% in the LibriCSS meeting task based on mask-based MVDR beamforming [87, 88].

2.1.2 Dereverberation

Compared with other front-end techniques, the major dereverberation techniques used in various tasks are based on statistical signal processing methods. One of the most widely used techniques is the weighted prediction error (WPE) based dereverberation [89, 90, 91].

The basic idea of WPE, for the case of single source, i.e., K=1K=1, without noise, is to decompose the original signal model Eq. (3) into the early reflection xt,fearlyx_{t,f}^{\text{early}} and late reverberation xt,flatex_{t,f}^{\text{late}} as follows:

xt,f\displaystyle x_{t,f} =τhf,τsf,tτ=xt,fearly+xt,flate.\displaystyle=\sum_{\tau}h_{f,\tau}s_{f,t-\tau}=x_{t,f}^{\text{early}}+x_{t,f}^{\text{late}}. (7)

WPE tries to estimate filter coefficients h^f,twpe\hat{h}^{\text{wpe}}_{f,t}\in\mathbb{C}, which maintain the early reflection while suppressing the late reverberation based on the maximum likelihood estimation.

x^t,,fearly=xt,fτ=ΔLh^f,τwpexf,tτ,\hat{x}^{\text{early}}_{t,,f}=x_{t,f}-\sum_{\tau=\Delta}^{L}\hat{h}^{\text{wpe}}_{f,\tau}x_{f,t-\tau}, (8)

where Δ\Delta denotes the number of frames to split the early reflection and late reverberation, and LL denotes the filter size.

WPE is widely used as one of the gold standard front-end processing methods, e.g., it is part of the DIHARD and CHiME, which are both the baseline and the top-performing systems [76, 77, 78, 81, 82]. Although the performance improvement of WPE-based dereverberation is not significant, it provides solid performance improvement across almost all tasks. Moreover, WPE is based on linear filtering and since it does not introduce signal distortions, it can be safely combined with downstream front-end and back-end processing steps. Similar to the speech enhancement techniques, WPE-based dereverberation demonstrates additional performance improvements when applied on multichannel signals.

2.1.3 Speech Separation

Speech separation is a promising family of techniques when the overlapping speech regions are significant. The effectiveness of multichannel speech separation based on beamforming has been widely confirmed [28, 92, 93]. For example, in the CHiME-6 challenge [81], guided source separation (GSS) [93] based multichannel speech extraction techniques have been used to achieve the top result. On the other hand, single-channel speech separation techniques [44, 45, 46] do not often show any significant effectiveness in realistic multispeaker scenarios, such as the LibriCSS [79] or the CHiME-6 tasks [81], where speech signals are continuous and contain both overlapping and overlap-free speech regions. The single-channel speech separation systems often produce a redundant non-speech or even a duplicated speech signal for the non-overlap regions, and as such the “leakage” of audio causes many false alarms of speech activity. A leakage filtering method was proposed in [94] tackle the problem, where a significant improvement in the diarization performance was observed after including this processing step in the top-ranked system on the VoxCeleb Speaker Recognition Challenge 2020 [95].

2.2 Speech Activity Detection

SAD, also known as voice activity detection (VAD), distinguishes speech from non-speech such as background noise. SAD plays a significant role not only in speaker diarization but also in speaker recognition and speech recognition systems since SAD is a pre-processing step that can create errors that propagate through the whole pipeline. A SAD system consists mostly of two major parts. The first one is a feature extraction front-end, where acoustic features such as zero crossing rate [96], pitch [97], signal energy [98], higher order statistics in the linear predictive coding residual domain [99] or MFCCs are often used. The other part is a classifier, where a model predicts and decides whether the input-frame contains speech or not. A system based on statistical models on spectrum [100], Gaussian mixture models (GMMs) [101] and on Hidden Markov Models (HMMs) [102, 103] has been traditionally used. After the deep learning approaches gained popularity in the speech signal processing field, numerous DNN-based systems, such as the ones based on MLP [104], convolutional neural network (CNN) [105], LSTM [106], have been also proposed with superior performance to the traditional methods.

The performance of SAD largely affects the overall performance of the speaker diarization system as it can create a significant amount of false positive salient events or miss speech segments [107]. A common practice in speaker diarization tasks is to report DER with “oracle SAD” setup which indicates that the system output is using SAD output that is identical to the ground truth. Conversely, the system output with an actual speech activity detector is referred to as “system SAD” output.

2.3 Segmentation

In the context of speaker diarization, speech segmentation is a process of breaking the input audio stream into multiple segments to obtain speaker-uniform segments. Therefore, the output unit of the speaker diarization system is determined via a segmentation process. In general, speech segmentation methods for speaker diarization are divided into two major categories: Segmentation by speaker-change point detection and uniform segmentation.

Segmentation by the speaker-change point detection was the gold standard of the earlier speaker diarization systems, where speaker-change points are detected by comparing two hypotheses: H0H_{0} assumes that both the left and right speech windows are from the same speaker, whereas H1H_{1} assumes that the two speech windows are from the different speakers. To test these two hypotheses, metric-based approaches [108, 109] were most widely applied. In metric-based approaches, the distribution of the speech feature is assumed to follow a Gaussian distribution 𝒩(μ,Σ)\mathcal{N}(\mu,\Sigma) with mean μ\mu and covariance Σ\Sigma. The two hypotheses H0H_{0} and H1H_{1} can be then represented as follows:

H0:\displaystyle H_{0}: 𝐱1𝐱N𝒩(μ,Σ),\displaystyle\mathbf{x}_{1}\cdots\mathbf{x}_{N}\sim\mathcal{N}(\mu,\Sigma), (9)
H1:\displaystyle H_{1}: 𝐱1𝐱i𝒩(μ1,Σ1),\displaystyle\mathbf{x}_{1}\cdots\mathbf{x}_{i}\sim\mathcal{N}\left(\mu_{1},\Sigma_{1}\right),
𝐱i+1𝐱N𝒩(μ2,Σ2),\displaystyle\mathbf{x}_{i+1}\cdots\mathbf{x}_{N}\sim\mathcal{N}\left(\mu_{2},\Sigma_{2}\right),

where (𝐱i|i=1,,N)(\mathbf{x}_{i}|i=1,\cdots,N) is a sequence of speech features in the interest of the hypothesis testing. A slew of criteria for the metric-based approach were proposed to quantify the likelihood of the two hypotheses. The examples include the Kullback Leibler (KL) distance [110], Generalized Likelihood Ratio (GLR) [111, 112, 113] and BIC [108, 114]. Among these criteria, BIC has been the most widely used method followed by numerous variants [115, 116, 117, 118]. Thus, in this section, we introduce BIC as a representative example for a metric-based method. If we apply BIC and to the hypotheses described in Eq. (9), a BIC value between two models from two hypotheses is expressed as follows:

BIC(i)=Nlog|Σ|N1log|Σ1|N2log|Σ2|λP,\displaystyle BIC(i)=N\log|\Sigma|-N_{1}\log\left|\Sigma_{1}\right|-N_{2}\log\left|\Sigma_{2}\right|-\lambda P, (10)

where the sample covariance Σ\Sigma is from {𝐱1,,𝐱N}\left\{\mathbf{x}_{1},\cdots,\mathbf{x}_{N}\right\}, Σ1\Sigma_{1} is from {𝐱1,,𝐱i}\left\{\mathbf{x}_{1},\cdots,\mathbf{x}_{i}\right\} and Σ2\Sigma_{2} is from {𝐱i+1,,𝐱N}\left\{\mathbf{x}_{i+1},\cdots,\mathbf{x}_{N}\right\} and PP is the penalty term [108] defined as

P=12(d+12d(d+1))logN,\displaystyle P=\frac{1}{2}\left(d+\frac{1}{2}d(d+1)\right)\log N, (11)

where dd denotes the dimension of the feature; N1N_{1} and N2N_{2} are frame lengths of each window, respectively and N=N1+N2N=N_{1}+N_{2}. The penalty weight λ\lambda is generally set to λ=1\lambda=1. The change point is set when the following equation becomes true:

{maxiBIC(i)}>0.\displaystyle\left\{\max_{i}BIC(i)\right\}>0. (12)

In general, if speech segmentation is done using the speaker-change point detection method, the length of each segment is not consistent. Therefore, after the advent of the i-vector [30] and DNN-based embeddings [47, 40], the segmentation based on speaker-change point detection was mostly replaced with uniform segmentation [35, 119, 39], since varying lengths of the segment created an additional variability into the speaker representation and deteriorated the fidelity of the speaker representations.

In uniform segmentation schemes, the given audio stream input is segmented with a fixed window length and overlap length. Thus, the unit duration of the speaker diarization output stays constant. However, the process of uniform segmentation of the input signals for diarization poses some potential problems because it introduces a trade-off error related to the segment length. The segments created from the uniform segmentation need to be sufficiently short to safely assume that they do not contain multiple speakers. However, at the same time it is important to capture sufficient acoustic information to extract reliable speaker representations.

2.4 Speaker Representations and Similarity Measure

Speaker representation plays a crucial role for speaker diarization systems to measure the similarity between speech segments. This section will cover such speaker representation and also the similarity measure because they are tightly connected. We first introduce metric-based approaches for similarity measures which were popular from the late 1990s to early 2000s in Section 2.4.1. We then introduce widely used speaker representations for speaker diarization systems that are usually employed together with the uniform segmentation method described in Section 2.4.2 and Section 2.4.3.

2.4.1 Metric Based Similarity Measure

From the late 1990s to early 2000s, metric-based approaches were most commonly used for the similarity measurement between speech segments for speaker diarization systems. Methods used for speaker segmentation were also applied to measure the similarity between segments, such as KL distance [110], GLR [111, 112, 113], and BIC [108, 114]. As with the case of the segmentation, the BIC-based method, where the similarity between two segments are computed by Eq. (10), was one of the most extensively used metrics due to its effectiveness and ease of implementation. Metric-based approaches are usually employed together with the segmentation approaches based on a speaker-change point detection. The agglomerative hierarchical clustering (AHC) is often applied to obtain the diarization result, which will be detailed in Section 2.5.1.

2.4.2 Joint Factor Analysis, i-vector and PLDA

Before the advent of speaker representations such as i-vector [30] or x-vector [40], Gaussian Mixture Model-based Universal Background Model (GMM-UBM) [120] applied to acoustic features demonstrated success in speaker verification tasks. A UBM consists of a large GMM (typically with 512 to 2048 mixtures) trained to represent the speaker-independent distribution of acoustic features. Thus, a GMM-UBM model can be described with the following quantities: mixture weights, mean values and covariance matrix of the mixtures. The log-likelihood ratio between a speaker-adapted GMM and the speaker-independent GMM-UBM is used for speaker verification. Despite the success on modeling the speaker identity, GMM-UBM based speaker verification systems have suffered from intersession variability [121], which is the variability exhibited by a given speaker from one recording session to another. Such difficulty occurs because the relevance maximum a posteriori (MAP) adaptation step during the speaker enrollment process in the GMM-UBM based speaker verification systems not only captures the speaker-specific characteristics of the speech, but also unwanted channel noise and other nuisances from the acoustic environment.

Joint factor analysis (JFA) [121, 122] was proposed to compensate for the variability issues by separately modeling the inter-speaker variability and the channel or session variability. The JFA approach employs a GMM supervector, which is a concatenated mean of the adapted GMM. For example, suppose a FF by 1 speaker-independent GMM mean vector mcm_{c}, where cc is the mixture component index and FF is the dimension of the feature. Then, a supervector 𝐌\mathbf{M} has dimension of CFCF by 1 by concatenating the F-dimensional mean vector for C mixture components. Thus, the supervector 𝐌\mathbf{M} can be described as follows:

𝐌\displaystyle\mathbf{M} =[m1t,m2t,,mCt]t.\displaystyle=\left[m_{1}^{t},m_{2}^{t},\ldots,m_{C}^{t}\right]^{t}. (13)

In the JFA approach, the given GMM supervector is decomposed into speaker independent, speaker dependent, channel dependent, and residual components. Thus, the ideal speaker supervector 𝐌J\mathbf{M}_{J} can be decomposed as indicated in Eq. (14), where 𝐦J\mathbf{m}_{J} denotes a speaker independent supervector from the UBM, 𝐕\mathbf{V} denotes a speaker dependent component matrix, 𝐔\mathbf{U} denotes a channel dependent component matrix, and 𝐃\mathbf{D} denotes a speaker-dependent residual component matrix. Along with these component matrices, vector 𝐲\mathbf{y} is for the speaker factors, vector 𝐱\mathbf{x} is for the channel factors, and vector 𝐳\mathbf{z} is for the speaker-specific residual factors. All of these vectors have a prior distribution of N(0,1)N(0,1).

𝐌J\displaystyle\mathbf{M}_{J} =𝐦J+𝐕𝐲+𝐔𝐱+𝐃𝐳.\displaystyle=\mathbf{m}_{J}+\mathbf{Vy}+\mathbf{Ux}+\mathbf{Dz}. (14)

The JFA approach was followed by the study in [30], in which it was discovered that channel factors in the JFA also contain information regarding the speakers. Thus, Dehak et al. [30] proposed a new method combining the channel and speaker spaces into a combined variability space through a total variability matrix. Thus, the total variability matrix 𝐓\mathbf{T} models both the channel and the speaker variability, and the latent variable 𝐰\mathbf{w} weights the column of the matrix 𝐓\mathbf{T}. The variable 𝐰\mathbf{w} is referred to as the i-vector and is also considered a speaker representation vector. Each speaker and channel in a GMM supervector 𝐌I\mathbf{M}_{I} can be modeled as follows:

𝐌I\displaystyle\mathbf{M}_{I} =𝐦I+𝐓𝐰,\displaystyle=\mathbf{m}_{I}+\mathbf{Tw}, (15)

where 𝐦I\mathbf{m}_{I} is a speaker-independent and channel-independent supervector that can be taken as a UBM supervector. The process of extracting an i-vector 𝐰\mathbf{w} for the given recording is formulated as a MAP estimation problem [123, 30] using the Baum–Welch statistics extracted using the UBM, mean supervector 𝐦I\mathbf{m}_{I}, and total variability matrix 𝐓\mathbf{T} trained from the EM algorithm as parameters. The idea behind a speaker representation was greatly popularized through the use of i-vectors, where the speaker representation vector can contain a numerical feature characterizing the vocal tract of each speaker. The i-vector speaker representations have been employed in not only speaker recognition studies but also in numerous speaker diarization studies [35, 36, 124] and have shown a superior performance compared to metric-based methods such as BIC, GLR, and KL, as mentioned in the previous subsection.

Intersession variability in the i-vector approach has been further compensated using backend procedures, such as a linear discriminant analysis (LDA) [125, 126] and within-class covariance normalization (WCCN) [127, 128], followed by simple cosine similarity scoring. The cosine similarity scoring was later replaced with a probabilistic LDA (PLDA) model in [129]. In the following studies [130, 131], a method applying a Gaussianization of the i-vectors and thus generating Gaussian assumptions in the PLDA, referred to as G-PLDA or simplified PLDA, was proposed for speaker verification. In general, PLDA employs the following modeling for the given speaker representation ϕij\phi_{ij} of the ii-th speaker in the jj-th session as indicated below:

ϕij=𝝁+𝐅𝐡i+𝐆𝐰ij+ϵij.\displaystyle\phi_{ij}=\boldsymbol{\mu}+\mathbf{Fh}_{i}+\mathbf{G}\mathbf{w}_{ij}+\epsilon_{ij}. (16)

Here, 𝝁\boldsymbol{\mu} is the mean vector, 𝐅\mathbf{F} is the speaker variability matrix, 𝐆\mathbf{G} is the channel variability matrix, and ϵij\epsilon_{ij} is a residual component. In addition, 𝐡i\mathbf{h}_{i} and 𝐰ij\mathbf{w}_{ij} are latent variables specific for the speaker and session, respectively. In G-PLDA, both latent variables, 𝐡i\mathbf{h}_{i} and 𝐰ij\mathbf{w}_{ij}, are assumed to follow a standard Gaussian prior. During the training process of the PLDA, 𝝁\boldsymbol{\mu}, 𝚺\mathbf{\Sigma}, 𝐅\mathbf{F}, and 𝐆\mathbf{G} are estimated using the expectation maximization (EM) algorithm. Based on the estimated statistics, two hypotheses are tested: hypothesis H0H_{0} for a case in which two samples are from the same speaker, and hypothesis H1H_{1} for when two samples are from different speakers. Under the hypothesis H0H_{0}, the given speaker representations ϕ1\phi_{1} and ϕ2\phi_{2} are modeled as follows with a common latent variable 𝐡12\mathbf{h}_{12}:

[ϕ1ϕ2]=[𝝁𝝁]+[𝐅𝐆0𝐅0𝐆][𝐡12𝐰1𝐰2]+[ϵ1ϵ2].\displaystyle\left[\begin{array}[]{l}\phi_{1}\\ \phi_{2}\end{array}\right]=\left[\begin{array}[]{l}\boldsymbol{\mu}\\ \boldsymbol{\mu}\end{array}\right]+\left[\begin{array}[]{lll}\mathbf{F}&\mathbf{G}&0\\ \mathbf{F}&0&\mathbf{G}\end{array}\right]\left[\begin{array}[]{l}\mathbf{h}_{12}\\ \mathbf{w}_{1}\\ \mathbf{w}_{2}\end{array}\right]+\left[\begin{array}[]{l}\epsilon_{1}\\ \epsilon_{2}\end{array}\right]. (28)

On the other hand, under the hypothesis H1H_{1}, ϕ1\phi_{1} and ϕ2\phi_{2} are modeled as follows with separate latent variable 𝐡1\mathbf{h}_{1} and 𝐡2\mathbf{h}_{2}:

[ϕ1ϕ2]=[𝝁𝝁]+[𝐅𝐆0000𝐅𝐆][𝐡1𝐰1𝐡2𝐰2]+[ϵ1ϵ2].\displaystyle\left[\begin{array}[]{c}\phi_{1}\\ \phi_{2}\end{array}\right]=\left[\begin{array}[]{c}\boldsymbol{\mu}\\ \boldsymbol{\mu}\end{array}\right]+\left[\begin{array}[]{cccc}\mathbf{F}&\mathbf{G}&0&0\\ 0&0&\mathbf{F}&\mathbf{G}\end{array}\right]\left[\begin{array}[]{c}\mathbf{h}_{1}\\ \mathbf{w}_{1}\\ \mathbf{h}_{2}\\ \mathbf{w}_{2}\end{array}\right]+\left[\begin{array}[]{c}\epsilon_{1}\\ \epsilon_{2}\end{array}\right]. (41)

In G-PLDA, it is assumed that ϕ\phi is generated from a Gaussian distribution, which results in the following conditional density function [132].

p(ϕ𝐡,𝐰)=𝒩(ϕ𝝁+𝐅𝐡+𝐆𝐰,𝚺).\displaystyle p\left(\phi\mid\mathbf{h},\mathbf{w}\right)=\mathcal{N}\left(\phi\mid\boldsymbol{\mu}+\mathbf{Fh}+\mathbf{G}\mathbf{w},\mathbf{\Sigma}\right). (42)

Using Eq. (16)-(41), the log likelihood ratio can be described as follows:

s(ϕ1,ϕ2)=logp(ϕ1,ϕ2H0)logp(ϕ1,ϕ2H1).\displaystyle s\left(\phi_{1},\phi_{2}\right)=\log p\left(\phi_{1},\phi_{2}\mid H_{0}\right)-\log p\left(\phi_{1},\phi_{2}\mid H_{1}\right). (43)

The log likelihood ratio s(ϕ1,ϕ2)s\left(\phi_{1},\phi_{2}\right) in the above equation was originally used for speaker verification to choose a hypothesis between H0H_{0} and H1H_{1} by checking whether s(ϕ1,ϕ2)s\left(\phi_{1},\phi_{2}\right) is positive or negative. The PLDA for speaker representations also employed in speaker diarization and the log likelihood s(ϕ1,ϕ2)s\left(\phi_{1},\phi_{2}\right) is used to check the similarity between clusters. Further details regarding the clustering approach using PLDA is described in Section 2.5.1.

2.4.3 Neural Network Based Speaker Representations

Refer to caption
Fig. 2: Diagram of d-vector model.
Refer to caption
Fig. 3: Diagram of x-vector model.

Speaker representations for speaker diarization have also been heavily affected by the rise of deep learning approaches. The idea behind DNN-based representation learning was first introduced for face recognition tasks [133, 134]. As the fundamental idea of a neural network-based representation, we can use the deep neural network architecture to map the input signal source (an image or an audio clip) to a dense vector containing floating-point numbers. This is achieved by taking the values from a layer in the neural network model after forward-propagating the input signal to the layer that we take the values from. The mapping process from the input signal to the speaker embedding is based on the nonlinear modeling capability of multiple layers in the DNNs. In so doing, the training process of the DNNs allows the neural networks to learn the mapping without specifying any components or factors, which is in contrast to traditional factor analysis models based on decomposable components. In this sense, the components in JFA are more explainable than the parameters trained in DNN models trained for speaker embedding extraction. In addition, DNN-based speaker representation learning does not involve predefined probabilistic models (e.g., GMM-UBM) for the input acoustic features. In relation to this, DNN-based speaker representation achieves an improved efficiency during the inference phase because the solution used by factor-analysis based methods involves a computationally intensive matrix inversion operation [132], whereas DNN-based embedding extractors involve fewer demanding operations, such as multiple linear transformations, with non-linear function computations for obtaining the speaker representation vector. Thus, the representation learning process has become more straight-forward and the inference speed has been improved compared to the traditional factor-analysis based methods. Among many of the neural-network based speaker representations, d-vector [47] remains one of the most prominent speaker representation extraction frameworks. The stacked filter bank features, which include context frames as an input feature, are employed, and multiple fully connected layers are trained with cross entropy loss. Speaker representation vectors, also referred to as d-vectors, are obtained from the last fully connected layer, as indicated in Fig. 3. The d-vector scheme appears in numerous speaker diarization papers, e.g., in [39, 41].

DNN-based speaker representations are even more improved when using an x-vector [48, 40], which demonstrates a superior performance, winning the NIST speaker recognition challenge 2018 [135] and the first DIHARD challenge [76]. Fig. 3 shows the structure of an x-vector extractor. The time-delay architecture and statistics pooling layer differentiate the x-vector architecture from that of a d-vector. The statistics pooling layer aggregates the frame-level outputs from the previous layer and computes its mean and standard deviation, passing them on to the following layer. Thus, it can allow the extraction of x-vectors from a variable length input. This is advantageous not only for speaker verification but also for speaker diarization because speaker diarization systems are bound to process segments that are shorter than the predetermined uniform segment length when the segment should be truncated at the end of an utterance.

2.5 Clustering

A clustering algorithm is applied to make clusters of the speech segments based on the speaker representation and similarity measure explained in the previous section. Here, we introduce the most commonly used clustering methods for speaker diarization.

Refer to caption
Fig. 4: Agglomerative Hierarchical Clustering.

2.5.1 Agglomerative Hierarchical Clustering

AHC is a clustering method that has been constantly employed in many speaker diarization systems with different distance metrics such as BIC [108, 136], KL [137] and PLDA [76, 82, 138]. AHC is an iterative process of merging the existing clusters until the clustering process meets a criterion. The AHC process starts with the calculation of the similarity between N singleton clusters. At each step, a pair of clusters that has the highest similarity is merged. The iterative merging process of AHC is illustrated in a dendrogram, which is presented in Fig. 4.

One of the most important aspects of AHC is the stopping criterion. For the speaker diarization task, the AHC process can be stopped using either a similarity threshold or a target number of clusters. Ideally, if PLDA is used as a distance metric, the AHC process should be stopped at s(ϕ1,ϕ2)=0s\left(\phi_{1},\phi_{2}\right)=0 in Eq.(43). However, the stopping metric is adjusted to obtain an accurate number of clusters based on a development set. Conversely, if the number of speakers is known or estimated using other methods, the AHC process can be stopped when the clusters created by the AHC process reaches the pre-determined number of speakers kk.

2.5.2 Spectral Clustering

Spectral clustering is a widely used clustering approach for speaker diarization. While there are many variations, spectral clustering involves the following steps.

  1. i.

    Affinity matrix calculation: There are many ways to generate affinity matrix 𝐀\mathbf{A} depending on the way the affinity value is processed. The raw affinity value dd is processed by kernels such as exp(d2/σ2)\exp\left(-d^{2}/\sigma^{2}\right), where σ\sigma is a scaling parameter. On the other hand, the raw affinity value dd could also be masked by zeroing the values below a threshold to only keep the prominent values.

  2. ii.

    Laplacian matrix calculation [139]: The graph Laplacian can be calculated in two ways: normalized and unnormalized. The degree matrix 𝐃\mathbf{D} contains diagonal elements di=j=1naijd_{i}=\sum_{j=1}^{n}a_{ij} where aija_{ij} is the element of the ii-th row and jj-th column in an affinity matrix 𝐀\mathbf{A}.

    1. (a)

      Normalized Graph Laplacian:

      𝐋\displaystyle\mathbf{L} =𝐃𝟏/𝟐𝐀𝐃𝟏/𝟐.\displaystyle=\mathbf{D^{-1/2}}\mathbf{A}\mathbf{D^{-1/2}}. (44)
    2. (b)

      Unnormalized Graph Laplacian:

      𝐋\displaystyle\mathbf{L} =𝐃𝐀.\displaystyle=\mathbf{D}-\mathbf{A}. (45)
  3. iii.

    Eigen decomposition: The graph Laplacian matrix 𝐋\mathbf{L} is decomposed into the eigenvector matrix 𝐗\mathbf{X} and the diagonal matrix that contains eigenvalues. Thus, 𝐋=𝐗𝚲𝐗\mathbf{L}=\mathbf{X}\mathbf{\Lambda}\mathbf{X}^{\top}.

  4. iv.

    Re-normalization (optional): the rows of 𝐗\mathbf{X} is normalized so that yij=xij/(jxij2)1/2y_{ij}=x_{ij}/\left(\sum_{j}x_{ij}^{2}\right)^{1/2} where xijx_{ij} and yijy_{ij} are the elements of the ii-th row and jj-th column in matrices 𝐗\mathbf{X} and 𝐘\mathbf{Y}, respectively.

  5. v.

    Speaker counting: Speaker number is estimated by finding the maximum eigengap [139, 140].

  6. vi.

    Spectral embedding clustering: The kk-smallest eigenvalues λ1\lambda_{1}, λ2\lambda_{2},…, λn\lambda_{n} and the corresponding kk eigenvectors 𝐯1\mathbf{v}_{1}, 𝐯2\mathbf{v}_{2},…, 𝐯k\mathbf{v}_{k} are stacked to construct a matrix 𝑼n×k\boldsymbol{U}\in\mathbb{R}^{n\times k}. The row vectors of 𝑼\boldsymbol{U} are referred to as kk-dimensional spectral embeddings. Finally, the spectral embeddings are clustered using a clustering algorithm. In general k-means clustering [141] is employed for clustering the spectral embeddings.

Refer to caption
Fig. 5: General steps of spectral clustering.

Among many variations of spectral clustering algorithm, the Ng-Jordan-Weiss (NJW) algorithm [142] is often employed for the speaker diarization task with variation in the kernel for the calculation of the affinity values [143, 144, 33]. Unlike the AHC approach, spectral clustering is mostly used with cosine distance [143, 144, 33, 39, 140]. In addition, the LSTM based similarity measurement with spectral clustering [145] also exhibited competitive performance. Depending on the datasets, the spectral clustering approach with cosine distance measurement outperforms AHC with PLDA [140, 83] while using the same speaker representation for both clustering methods.

2.5.3 Other Clustering Algorithms

The k-means algorithm is often employed in studies on speaker diarization [146, 147, 39, 41, 63] due to its simplicity and ease of implementation. However, the k-means algorithm generally underperforms [39, 41] the well-known clustering algorithms such as spectral clustering and AHC. In addition, there are a few speaker diarization studies employed the mean-shift [148] clustering algorithm, which assigns the given data points to the clusters iteratively by finding the modes in a non-parametric distribution. The Mean-shift clustering algorithm was employed in the speaker diarization task with KL distance in [149], i-vector and cosine distance in [35, 150], and i-vector and PLDA in [151].

2.6 Post-processing

2.6.1 Resegmentation

Resegmentation is a process of refining the speaker boundary that is roughly estimated using the clustering procedure. In [152], the Viterbi resegmentation method based on the Baum-Welch algorithm was introduced. In this method, the estimation of Gaussian mixture model corresponding to each speaker and Viterbi-algorithm-based resgmentation using the estimated speaker GMM are alternately applied.

A method for representing the diarization process based on the variational Bayeian hidden Markov model (VB-HMM) was proposed, and was shown to be superior to Viterbi resegmentation [53, 153]. The VB-HMM-based diarization can be seen as a joint optimization of segmentation and clustering, which will be separately introduced in Section 2.7.

2.6.2 System Fusion

As another direction of post processing, there have been a series of studies on the fusion method of multiple diarization results to improve the diarization accuracy. While it is widely known that the system combination generally yields better result for various systems (e.g., speech recognition [154] or speaker recognition [155]), the combination of multiple diarization hypotheses poses several unique problems. First, speaker labeling is not standardized among different diarization systems. Second, the estimated number of speakers may differ among different diarization systems. Finally, the estimated time boundaries may also be different among multiple diarization systems. System combination methods for speaker diarization systems need to handle these problems during the fusion process of multiple hypotheses.

In [156], a method for selecting the best diarization result among many diarization systems was proposed. In this method, a whole sequence of diarization result for a recording from each diarization system is treated as one object to be clustered. AHC is applied to the set of diarization results, in which the distance of two clusters is measured using the symmetric DER between the diarization results belonging to the two clusters. The iterative merging process of AHC is executed until the number of clusters becomes two. Finally, in the bigger cluster among the two final clusters according to the number of elements in each cluster, the diarization result that has the smallest distance to all other diarization results is selected as the final result. In [157], two diarization systems are combined by finding the matching between two speaker clusters, and then performing resegementation based on the matching result.

Refer to caption
Fig. 6: Example of DOVER system.

More recently, the diarization output voting error reduction (DOVER) method [158] was proposed to combine multiple diarization results based on the voting scheme. In the DOVER method, the speaker labels among different diarization systems are aligned one by one to minimize the DER between the hypotheses (processes 2 and 3 of Fig. 6). After aligning all hypotheses, each system votes its speaker label to each segmented region (each system may have different weights for voting), and the speaker label that gains the highest voting weight is selected for each segmented region (the process 4 of Fig. 6). In case multiple speaker labels get the same voting weight, a heuristic approach is employed to break the ties (such as selecting the result from the first system) is used.

The DOVER method has an implicit assumption that there is no overlapping speech, i.e., at most only one speaker is assigned for each time index. To combine the diarization hypotheses with overlapping speakers, two methods were recently proposed. In [94], the authors proposed the modified DOVER method, in which the speaker labels in different diarization results are first aligned with a root hypothesis, and the speech activity of each speaker is estimated based on the weighted voting score for each speaker for each small segment. Raj et al. [159] proposed a method called DOVER-Lap, in which the speakers of multiple hypotheses are aligned via a weighted k-partite graph matching, and the number of speakers KK for each small segment is estimated based on the weighted average of multiple systems to select the top-KK voted speaker labels. Both the modified DOVER and DOVER-Lap showed DER improvement for the speaker diarization result with speaker overlaps.

2.7 Joint Optimization of Segmentation and Clustering

This subsection introduces a VB-HMM-based diarization technique, which can be regarded as a joint optimization of segmentation and clustering, and thus cannot be well categorized in Section 2.1–2.6. The VB-HMM framework was proposed as an extension of the VB-based speaker clustering [160, 161] by introducing HMM to constrain the speaker transitions. In the VB-HMM framework [53], the speech feature 𝐗=(𝐱t|t=1,,T)\mathbf{X}=(\mathbf{x}_{t}|t=1,...,T) is assumed to be generated from HMM where each HMM state corresponds to one of KK possible speakers. Suppose that we have MM HMM states, MM-dimensional variable 𝐙=(𝐳t|t=1,,T)\mathbf{Z}=(\mathbf{z}_{t}|t=1,...,T) is introduced where kk-th element of 𝐳t\mathbf{z}_{t} is 1 if the kk-th speaker is speaking at the time index tt, and 0 otherwise. At the same time, the distribution of 𝐱t\mathbf{x}_{t} is modeled based on a hidden variable 𝐘={𝐲k|k=1,,K}\mathbf{Y}=\{\mathbf{y}_{k}|k=1,...,K\}, where 𝐲k\mathbf{y}_{k} denotes a low dimensional vector for the kk-th speaker. Given these notations, the joint probability of 𝐗\mathbf{X}, 𝐘\mathbf{Y}, and 𝐙\mathbf{Z} is decomposed as follows:

P(𝐗,𝐙,𝐘)=P(𝐗|𝐙,𝐘)P(𝐙)P(𝐘),P(\mathbf{X},\mathbf{Z},\mathbf{Y})=P(\mathbf{X}|\mathbf{Z},\mathbf{Y})P(\mathbf{Z})P(\mathbf{Y}), (46)

where P(𝐗|𝐙,𝐘)P(\mathbf{X}|\mathbf{Z},\mathbf{Y}) is the emission probability modeled by GMM whose mean vector is represented by 𝐘\mathbf{Y}, P(𝐙)P(\mathbf{Z}) is the transition probability of the HMM, and P(𝐘)P(\mathbf{Y}) is the prior distribution of 𝐘\mathbf{Y}. Because 𝐙\mathbf{Z} represents the trajectory of speakers, the diarization problem can be expressed as the inference problem of 𝐙\mathbf{Z} that maximizes the posterior distribution P(𝐙|𝐗)=P(𝐙,𝐘|𝐗)𝑑𝐘P(\mathbf{Z}|\mathbf{X})=\int P(\mathbf{Z},\mathbf{Y}|\mathbf{X})d\mathbf{Y}. Since it is intractable to directly solve this problem, the VB method is used to estimate the model parameters that approximate P(𝐙,𝐘|𝐗)P(\mathbf{Z},\mathbf{Y}|\mathbf{X}).

Recently, a simplified version of VB-HMM that works on the x-vector, known as VBx, was proposed [54, 162]. In VBx, P(X|Z,Y)P(X|Z,Y) is calculated using the x-vector based on the PLDA model. While the original VB-HMM works on the granularity of the frame-level feature, VBx works on the granularity of the x-vector, and thus can be seen as a clustering method that jointly models speaker turn and speaker duration.

The VB-HMM diarization was originally designed as a standalone diarization framework. However, it requires parameter initialization to start the VB estimation, and the parameters are usually initialized based on the result of another speaker clustering. In that context, VB-HMM is widely employed as the final step of speaker diarization (e.g., [163, 119, 164]). For example in [164], AHC was first performed to under-cluster the x-vector, and VBx was then applied to obtain a better cluster given the AHC-based result as the initial parameter. Finally, VB-HMM was further applied to refine the boundary obtained by VBx.

Table 2: Overview of speaker diarization techniques using deep learning. Note that there are also a lot of studies that use deep learning for front-end, SAD, segmentation, and speaker embedding extraction, which are introduced in Section 2.

EEND [42, 43] itself is a joint model of SAD, segmentation, speaker embedding, and clustering. It can also be seen that a speech separation module is implicitly embedded in the model to cope with speaker overlaps.

3 Recent Advances in Speaker Diarization Using Deep Learning

This section introduces various recent efforts toward deep learning-based speaker diarization techniques. The methods that incorporate deep learning into a single component of speaker diarization, such as clustering or post-processing, are introduced in Section 3.1. The methods that unify several components of speaker diarization into a single neural network are introduced in Section 3.2. For the overview of speaker diarization techniques using deep learning, refer to Table 2.7. It should be noted that there are some works that take additional input of speaker profiles. These methods may not be categorized as a diarization technique in a traditional definition. Nevertheless, we introduce them as they are optimized in a multispeaker situation to learn the relations between speakers and hence categorized as “Trained Based on the Diarization Objective” in Table 1.3.

3.1 Single-module Optimization

3.1.1 Speaker clustering Enhanced by Deep Learning

Refer to caption
Fig. 7: Speaker diarization with graph neural network

Enhancing the clustering procedure based on the deep learning is an active research area and several methods have been proposed for speaker diarization. This section will cover the representative works in such a direction.

An approach based on the graph neural network (GNN) was proposed in [50]. As shown in Fig. 7, this method aims at purifying the similarity matrix used in the spectral clustering (Section 2.5.2). Assuming a sequence of speaker embeddings {𝐞1,𝐞N}\{\mathbf{e}_{1},...\mathbf{e}_{N}\}, where NN is the sequence length, the input 𝐱i0\mathbf{x}_{i}^{0} to the GNN is {𝐱i0=𝐞i|i=1,,N}\{\mathbf{x}_{i}^{0}=\mathbf{e}_{i}|i=1,\ldots,N\}. The output 𝐱i(p)\mathbf{x}_{i}^{(p)} of the pp-th layer of the GNN is now:

xi(p)=σ(𝐖j𝐋i,jxj(p1)),\displaystyle x_{i}^{(p)}=\sigma(\mathbf{W}\sum_{j}\mathbf{L}_{i,j}x_{j}^{(p-1)}), (47)

where 𝐋\mathbf{L} is a normalized affinity matrix added by self-connection, 𝐖\mathbf{W} is a trainable weight matrix for the pp-th layer, and σ()\sigma(\cdot) is a nonlinear function. The GNN is trained by minimizing the distance between the reference and the estimated affinity matrices. The distance is calculated using a combination of histogram loss [177] and nuclear norm [178]. The GNN-based speaker diarization method was evaluated on the CALLHOME dataset and an in-house meeting dataset, and significantly outperformed any of the conventional clustering methods.

Besides that, different approaches have been proposed to generate the affinity matrix. In [165], a self-attention-based neural network model was introduced to directly generate a similarity matrix from a sequence of speaker embeddings. In [166], several affinity matrices with different temporal resolutions were fused into a single affinity matrix based on a neural network.

A different approach aiming at improving clustering was proposed in [179], called deep embedded clustering (DEC). The goal of DEC was to transform the input features (herein referred to as speaker embeddings) making them more separable in a given number of clusters/speakers. In order to make cluster differentiable, each embedding is provided with a probability of “belonging” to each of the available speaker clusters, i.e. qi,jq_{i,j} can be interpreted as the probability of assigning sample ii to cluster jj (i.e., a soft assignment):

qij=(1+ziμj2/a)a+1al(1+ziμl2/a)a+1a,pij=qij2/filqil2/fl,q_{ij}=\frac{\left(1+\|z_{i}-\mu_{j}\|^{2}/a\right)^{-\frac{a+1}{a}}}{\sum_{l}\left(1+\|z_{i}-\mu_{l}\|^{2}/a\right)^{-\frac{a+1}{a}}},\ \ p_{ij}=\frac{q^{2}_{ij}/f_{i}}{\sum_{l}q^{2}_{il}/f_{l}}, (48)

where ziz_{i} is the bottleneck feature, aa is the degree of freedom of the Student’s tt-distribution, μi\mu_{i} is the centroid of ii-th cluster and fif_{i} is the soft cluster frequency with fi=qijf_{i}=\sum q_{ij}. The clusters are iteratively refined based on the target distribution according to the bottleneck features estimated using an autoencoder.

The initial version of DEC had some problems, and refined algorithm called improved DEC (IDEC) was later proposed with better accuracy on speaker diarization [180, 51]. Firstly, there was a potential risk that the neural network is converged to a trivial solution to generates corrupted embeddings. To avoid this risk, Guo et al. [180] proposed to explicitly preserve the local structure of the data by adding a reconstruction loss between the output of the autoencoder and the input feature. Dimitriadis [51] further addressed the issue by introducing the loss function to enforce the distribution of speaker turns being uniform across all speakers, i.e., all speakers contribute equally to the session. This assumption is not always valid for real recordings but it constrains the solution space enough to avoid the empty clusters without affecting the overall performance. Finally, Dimitriadis [51] also proposed an additional loss term that penalizes the distance from the centroid μi\mu_{i}, bringing the behavior of the algorithm closer to k-means.

Overall, the loss function of the IDEC consists of four loss terms, i.e., LcL_{c}, the clustering error term that is originally proposed in DEC; LrL_{r}, the reconstruction error term [180]; LuL_{u} is the uniform “speaker airtime” distribution loss [51]; and LMSEL_{MSE}, the loss to measure the distance of the bottleneck features from the centroids [51],

L=αLc+βLr+γLu+δLMSE,L=\alpha L_{c}+\beta L_{r}+\gamma L_{u}+\delta L_{MSE}, (49)

where α,β,γ, and δ\alpha,\beta,\gamma,\text{ and }\delta are the weight on the loss functions that is fine-tuned on some held-out data.

3.1.2 Learning the Distance Estimator

In this section a novel approach using a trainable distance function is presented. The basic idea is based on the relational recurrent neural networks (RRNNs). RRNNs were introduced by [181, 182, 183] to address “relational information learning” problems. Such models learn relations between a sequence of input features like the notion of “closer” or “further”, e.g, two points in space are closer than a third one, etc. Speaker diarization can be seen as part of this class of problems, since the final decision depends on the distance between speech segments and speaker profiles or centroids.

There are several issues that potentially limits the accuracy of speaker diarization systems. Firstly, as mentioned in Section 2.3, the duration of segments when extracting speaker embeddings poses a trade-off between the time resolution and the robustness of the extracted speaker representations. Secondly, speaker embedding extractors are not explicitly trained to provide optimal representations for speaker diarization, despite the fact these invariant, discriminative representations are used to separate thousands of speakers [40]. Thirdly, the distance metric is often based on a heuristic approach and/or dependent on certain assumptions that do not necessarily hold, e.g., assuming Gaussianity in the case of PLDA [130]. Finally, the audio chunks are treated independently and any temporal context is simply ignored in conventional clustering methods as described in Section 2.5. These issues can be attributed to the distance metric function, and most of them can be addressed with RRNNs, where a data-driven, memory-based approach bridges the performance gap between the heuristic and the trainable distance estimation approaches.

In this context, an approach for learning the relationship between the speaker cluster centroids (or speaker profiles) and the embeddings is proposed in [167] (Fig. 8). In this work, the diarization process is considered to be a classification task on an already segmented audio, as in Section 2.3, either uniformly [146] or based on estimated speaker-change salient points [184]. The speaker embeddings xjx_{j} for each segment, which are assumed to be speaker-homogeneous, are extracted and then compared with all the available speaker profiles or speaker centroids. The most suitable speaker label is assigned to each segment by minimizing the distance-based loss function, i.e., the relationship between embeddings and profiles. As discussed in [167], the RRNN-based distance estimation exhibits consistent improvements in its performance when compared with the more traditional distance estimation approaches such as the cosine distance [30] or the PLDA-based [130] distance. Note that although the task in [167] is speaker identification, an extension to the speaker diarization is rather straightforward when the speaker profiles are pre-estimated, either as centroids using any of the traditional clustering algorithm (Section 2.5 and 3.1.1), or using prior knowledge.

Refer to caption
Fig. 8: Continuous speaker identification system based on RMC. The speech signal is segmented uniformly and each segment xtx_{t} is compared against all the available speaker profiles according to a distance metric d(,)d(\cdot,\cdot). A speaker label st,js_{t,j} is assigned to each xtx_{t} minimizing this metric.

3.1.3 Post Processing Based on Deep Learning

Refer to caption
Fig. 9: Target Speaker Voice Activity Detector

There are a few recent studies on the neural network-based diarization method that is applied on top of the result from a traditional clustering-based speaker diarization. These methods can be categorized as an extension of the post-processing. Medennikov et al. [83, 52] proposed the target-speaker voice activity detection (TS-VAD) to achieve accurate speaker diarization even under noisy conditions with many speaker overlaps. TS-VAD assumes that a set of i-vectors ={𝐞kf|k=1,,K}\mathcal{E}=\{\mathbf{e}_{k}\in\mathbb{R}^{f}|k=1,...,K\} are available for each speaker in the audio, where ff is the dimension of i-vector and KK is the number of speakers. As presented in Fig. 9, TS-VAD takes not only a sequence of MFCC, 𝐗=(𝐱td|t=1,,T)\mathbf{X}=(\mathbf{x}_{t}\in\mathbb{R}^{d}|t=1,...,T), where dd is the dimension of MFCC and TT the length of the sequence, but also a set of i-vectors \mathcal{E}. Given 𝐗\mathbf{X} and \mathcal{E}, the model outputs a sequence of kk-dimensional vector 𝐎=(𝐨tK|t=1,,T)\mathbf{O}=(\mathbf{o}_{t}\in\mathbb{R}^{K}|t=1,...,T) where the kk-th element of 𝐨t\mathbf{o}_{t} represents the probability of the speech activity of the speaker corresponding to 𝐞k\mathbf{e}_{k} at the time frame tt. In other words, the kk-th element of 𝐨t\mathbf{o}_{t} is expected to be 1 if the speaker of 𝐞k\mathbf{e}_{k} is speaking at time tt, and 0 otherwise.

Because TS-VAD requires the i-vectors of speakers, pre-processing to obtain the i-vectors is necessary. The procedure proposed in [83, 52] is as follows:

  1. 1.

    Apply clustering-based diarization.

  2. 2.

    Estimate i-vectors for each speaker given the diarization result.

  3. 3.

    Repeat (a) and (b).

    1. (a)

      Apply TS-VAD given the estimated i-vectors.

    2. (b)

      Refine i-vectors given the TS-VAD result.

TS-VAD was proposed as a part of the winning system of CHiME-6 Challenge [81], and showed a significantly better DER compared with the conventional clustering based approach [83]. However, it has a drawback, i.e., the maximum number of speakers that the model can handle is limited by the dimension of the output vector.

As a different approach, Horiguchi et al. proposed the application of the EEND model (detailed in Section 3.2.4) to refine the result of a clustering-based speaker diarization [168]. A clustering-based speaker diarization method can handle a large number of speakers but unable to handle overlapped speech. Conversly, EEND has opposite characteristics. To complementarily use the two methods, The authors in [83] first applied a conventional clustering method. Then, the two-speaker EEND model was iteratively applied for each pair of detected speakers to refine the time boundary of overlapped regions.

3.2 Joint Optimization for Speaker Diarization

3.2.1 Joint Segmentation and Clustering

A model called unbounded interleaved-state recurrent neural networks (UIS-RNN) was proposed, which replaced the segmentation and clustering methods with a trainable model [41]. Given the input sequence of embeddings 𝐗=(𝐱td|t=1,,T)\mathbf{X}=(\mathbf{x}_{t}\in\mathbb{R}^{d}|t=1,\ldots,T), UIS-RNN generates the diarization result 𝐘=(yt|t=1,,T)\mathbf{Y}=(y_{t}\in\mathbb{N}|t=1,\ldots,T) as a sequence of speaker index for each time frame. The joint probability of 𝐗\mathbf{X} and 𝐘\mathbf{Y} can be decomposed by the chain rule as follows.

P(𝐗,𝐘)=\displaystyle P(\mathbf{X},\mathbf{Y})= P(𝐱1,y1)t=2TP(𝐱t,yt|𝐱1:t1,y1:t1).\displaystyle P(\mathbf{x}_{1},y_{1})\prod_{t=2}^{T}P(\mathbf{x}_{t},y_{t}|\mathbf{x}_{1:t-1},y_{1:t-1}). (50)

To model the distribution of the speaker change, UIS-RNN then introduces a latent variable 𝐙=(zt{0,1}|t=2,,T)\mathbf{Z}=(z_{t}\in\{0,1\}|t=2,\ldots,T), where ztz_{t} becomes 1 if the speaker indices at time t1t-1 and tt are different, and 0 otherwise. The joint probability including 𝐙\mathbf{Z} is then decomposed as follows.

P(𝐗,𝐘,𝐙)=\displaystyle P(\mathbf{X},\mathbf{Y},\mathbf{Z})= P(𝐱1,y1)t=2TP(𝐱t,yt,zt|𝐱1:t1,y1:t1,z1:t1).\displaystyle P(\mathbf{x}_{1},y_{1})\prod_{t=2}^{T}P(\mathbf{x}_{t},y_{t},z_{t}|\mathbf{x}_{1:t-1},y_{1:t-1},z_{1:t-1}). (51)

Finally, the term P(𝐱t,yt,zt|𝐱1:t1,y1:t1,z1:t1)P(\mathbf{x}_{t},y_{t},z_{t}|\mathbf{x}_{1:t-1},y_{1:t-1},z_{1:t-1}) is further decomposed into three components.

P(𝐱t,yt,zt|𝐱1:t1,y1:t1,z1:t1)=\displaystyle P(\mathbf{x}_{t},y_{t},z_{t}|\mathbf{x}_{1:t-1},y_{1:t-1},z_{1:t-1})= P(𝐱t|𝐱1:t1,y1:t)P(yt|zt,y1:t1)P(zt|z1:t1).\displaystyle P(\mathbf{x}_{t}|\mathbf{x}_{1:t-1},y_{1:t})P(y_{t}|z_{t},y_{1:t-1})P(z_{t}|z_{1:t-1}). (52)

Here, P(𝐱t|𝐱1:t1,y1:t)P(\mathbf{x}_{t}|\mathbf{x}_{1:t-1},y_{1:t}) denotes the sequence generation probability, and modeled by gated recurrent unit (GRU)-based recurrent neural networks. P(yt|zt,y1:t1)P(y_{t}|z_{t},y_{1:t-1}) denotes the speaker assignment probability and modeled by a distance-dependent Chinese restaurant process [185], which can model the distribution of unbounded number of speakers. Finally, P(zt|z1:t1)P(z_{t}|z_{1:t-1}) represents the speaker change probability and is modeled by the Bernoulli distribution. Since all models are represented by trainable ones, UIS-RNN can be trained in a supervised fashion by finding the parameters that maximize logP(𝐗,𝐘,𝐙)\log P(\mathbf{X},\mathbf{Y},\mathbf{Z}) over training data. The inference can be conducted by finding 𝐘\mathbf{Y} that maximizes logP(𝐗,𝐘)\log P(\mathbf{X},\mathbf{Y}) given 𝐗\mathbf{X} based on the beam search in an online fashion. While UIS-RNN works in an online fashion, it demonstrated better DER than that of the offline system based on spectral clustering.

3.2.2 Joint Segmentation, Embedding Extraction, and Re-segmentation

Refer to caption
Fig. 10: (a) RPN for speaker diarization, (b) diarization procedure based on RPN.

A speaker diarization method based on the region proposal networks (RPN) was proposed to jointly perform segmentation, speaker embedding extraction, and resegmentation [63]. The RPN was originally proposed to detect multiple objects from a two-dimensional image [186], and one-dimensional variant along with the time-axis is used for speaker diarization.

As can be seen from Fig. 10 (a), the STFT features with a size of time and frequency bin is first converted to the feature map with a size of time, frequency and channels using CNNs. Then, other three types of neural networks are applied on various sizes of sliding windows (named “anchor”) along with the time axis. For each anchor, the three neural networks perform SAD, speaker embedding extraction, and region refinement, respectively. Here, SAD is the task to estimate the probability of speech activity for the anchor region. Speaker embedding extraction is the task to generate an embedding to represent the speaker characteristics of the audio corresponding to the anchor region. Finally, region refinement is the task to estimate the difference between the shape (i.e. duration and center position) of the anchor and that of the corresponding reference region.

The inference procedure by RPN is presented in Fig. 10 (b). RPN is first applied to list the anchors with speech activity probability higher than the pre-determined threshold. The anchors are then clustered using a conventional clustering method (e.g., k-means) based on the speaker embeddings estimated for each anchor. Finally, highly overlapped anchors after region refinement are removed, a method known as the non-maximum suppression.

The RPN-based speaker diarization system has the advantage of handling overlapped speech with possibly any number of speakers. Also, it is much simpler than the conventional speaker diarization system. It was shown in multiple datasets that this system achieved significantly better DER than the conventional clustering-based speaker diarization system [63, 80].

3.2.3 Joint Speech Separation and Diarization

There are also recent researches on the joint modeling of speech separation and speaker diarization. Kounades-Bastian et al. [187, 188] proposed the incorporation of a speech activity model into speech separation based on the spatial covariance model with non-negative matrix factorization. They derived the EM algorithm to estimate separated speech and the speech activity of each speaker from the multichannel overlapped speech. While their method jointly performs speaker diarization and speech separation, it is based on a statistical modeling, and the estimation was conducted solely based on the observation, i.e. without any model training.

Refer to caption
Fig. 11: Joint speech separation, speaker counting, and speaker diarization model.

Neumann et al. [64, 169] later proposed a trainable model, namely online Recurrent Selective Attention Network (online RSAN), for joint speech separation, speaker counting, and speaker diarization based on a single neural network (Fig. 11). Their neural network takes the input of spectrogram 𝐗bT×F\mathbf{X}_{b}\in\mathbb{R}^{T\times F}, a residual mask 𝐑b,i1T×F\mathbf{R}_{b,i-1}\in\mathbb{R}^{T\times F}, and a speaker embedding 𝐞b1,id\mathbf{e}_{b-1,i}\in\mathbb{R}^{d}, where bb is the index of the audio block; ii, the index of the speaker; TT, the length of the audio block; and FF, the maximum frequency bin of the spectrogram. It outputs the speech mask 𝐌b,iT×F\mathbf{M}_{b,i}\in\mathbb{R}^{T\times F} and an updated speaker embedding for the speaker corresponding to 𝐞b,i\mathbf{e}_{b,i}. The neural network is applied in an iterative fashion for each audio block bb, and for each speaker ii as follows:

  1. 1.

    Repeat (a) and (b) for b=1,2,b=1,2,...

(𝐞b1,i\mathbf{e}_{b-1,i} is set to 𝟎\mathbf{0} if it was not calculated previously) ii. 𝐑b,i=max(𝐑b,i1𝐌b,i,𝟎)\mathbf{R}_{b,i}=\max(\mathbf{R}_{b,i-1}-\mathbf{M}_{b,i},\mathbf{0}) iii. If 1TFt,f𝐑b,i(t,f)<threshold\frac{1}{TF}\sum_{t,f}\mathbf{R}_{b,i}(t,f)<threshold, stop iteration.

A separated speech for speaker ii at audio block bb can be obtained by 𝐌b,i𝐗b\mathbf{M}_{b,i}\odot\mathbf{X}_{b} where \odot is the element-wise multiplication. The speaker embedding 𝐞b,i\mathbf{e}_{b,i} is used to keep track of the speaker of adjacent blocks. Thanks to the iterative approach, this neural network can cope with the variable number of speakers while jointly performing speech separation and speaker diarization. The online RSAN was evaluated by using real meeting dataset with up to six speakers, and showed better results than the clustering-based method [169].

3.2.4 Fully End-to-end Neural Diarization

Refer to caption
Fig. 12: Two-speaker end-to-end neural diarization model

Recently, the framework called EEND was proposed [42, 43], which performs all the speaker diarization procedures based on a single neural network. The architecture of EEND is shown in Fig. 12. An input to the EEND model is a TT-length sequence of acoustic features (e.g., log Mel-filterbank), 𝐗=(𝐱tF|t=1,,T)\mathbf{X}=(\mathbf{x}_{t}\in\mathbb{R}^{F}|t=1,\ldots,T). A neural network then outputs the corresponding speaker label sequence 𝐘=(𝐲t|t=1,,T)\mathbf{Y}=(\mathbf{y}_{t}|t=1,\ldots,T) where 𝐲t=[yt,k{0,1}|k=1,,K]\mathbf{y}_{t}=[y_{t,k}\in\{0,1\}|k=1,\ldots,K]. Here, yt,k=1y_{t,k}=1 represents the speech activity of the speaker kk at the time frame tt, and KK is the maximum number of speakers that the neural network can output. Importantly, yt,ky_{t,k} and yt,ky_{t,k^{\prime}} can be both 1 for different speakers kk and kk^{\prime}, indicating that these two speakers kk and kk^{\prime} are speaking simultaneously (i.e. overlapping speech). The neural network is trained to maximize logP(𝐘|𝐗)tklogP(yt,k|𝐗)\log P(\mathbf{Y}|\mathbf{X})\sim\sum_{t}\sum_{k}\log P(y_{t,k}|\mathbf{X}) over the training data by assuming the conditional independence of the output yt,ky_{t,k}. Because there can be multiple candidates of the reference label 𝐘\mathbf{Y} by swapping the speaker index kk, the loss function is calculated for all possible reference labels and the reference label that has the minimum loss is used for the error back-propagation, which is inspired by the permutation free objective used in speech separation [45]. EEND was initially proposed using a bidirectional long short-term memory (BLSTM) network [42], and was soon extended to the self-attention-based network [43] by showing the state-of-the-art DER for two-speaker data such as the two-speaker excerpt from the CALLHOME dataset (LDC2001S97) and the dialogue audio in the corpus of Spontaneous Japanese [189].

EEND has multiple advantages. First, EEND can handle overlapping speech in a sound way. Second, the network is directly optimized toward the maximization of diarization accuracy, by which we can expect a high accuracy. Third, it can be retrained by a real data (i.e. not synthetic data) just by feeding a reference diarization label while it is often not straitforward for the prior works. However, EEND also has several limitations. First, the model architecture limits the maximum number of speakers that the model can cope with. Second, EEND consists of BLSTM or self-attention based neural networks, making it difficult to do online processing. Third, it was empirically suggested that EEND tends to overfit to the distribution of the training data [42].

Refer to caption
Fig. 13: End-to-end neural diarization with encoder-decoder-based attractor (EDA).

To cope with an unbounded number of speakers, several extensions of EEND have been investigated. Horiguchi et al. [170] proposed an extension of EEND with the encoder-decoder-based attractor (EDA) (Fig. 13). This method applies an LSTM-based encoder-decoder on the output of EEND to generate multiple attractors. Attractors are generated until the attractor existing probability becomes less than the threshold. Then, each attractor is multiplied by the embeddings generated from EEND to calculate the speech activity for each speaker. EEND-EDA was evaluated on CALLHOME (two to six speakers) and DIHARD 2 (one to nine speakers) dataset and showed better performance than the clustering-based baseline system.

On the other hand, Fujita et al. [171] proposed another approach to output the speech activity one after another by using a conditional speaker chain rule. In this method, a neural network is trained to produce a posterior probability P(𝐲k|𝐲1,,𝐲k1,𝐗)P(\mathbf{y}_{k}|\mathbf{y}_{1},\ldots,\mathbf{y}_{k-1},\mathbf{X}), where 𝐲k=(yt,k{0,1}|t=1,,T)\mathbf{y}_{k}=(y_{t,k}\in\{0,1\}|t=1,\ldots,T) is the speech activity for the kk-th speaker. Then, the joint speech activity probability of all speakers can be estimated from the following speaker-wise conditional chain rule as:

P(𝐲1,,𝐲K|𝐗)\displaystyle P(\mathbf{y}_{1},\ldots,\mathbf{y}_{K}|\mathbf{X}) =k=1KP(𝐲k|𝐲1,,𝐲k1,𝐗).\displaystyle=\prod_{k=1}^{K}P(\mathbf{y}_{k}|\mathbf{y}_{1},\ldots,\mathbf{y}_{k-1},\mathbf{X}). (53)

During inference, the neural network is repeatedly applied until the speech activity yky_{k} for the last estimated speaker approaches zero. Kinoshita et al. [174] proposed a different approach that combines EEND and speaker clustering. In their method, a neural network is trained to generate speaker embeddings and the speech activity probability. Speaker clustering constrained by the estimated speech activity by EEND is applied to align the estimated speakers among the different processing blocks.

There are also a few recent trials to extend the EEND for online processing. Xue et al. [172] proposed a method using a speaker tracing buffer to better align the speaker labels of adjacent processing blocks. Han et al. [173] proposed a block online version of EEND-EDA [170] by carrying the hidden state of the LSTM-encoder to generate the attractors block by block.

4 Speaker Diarization in the Context of ASR

From a conventional perspective, speaker diarization is considered a pre-processing step for ASR. In the traditional system structures for speaker diarization, presented in Fig. 1, speech inputs are processed sequentially across the diarization components without considering the ASR performance, which is usually measured using the word error rate (WER). WER is the number of misrecognized words (substitution error, insertion error, and deletion error) divided by the number of reference words. One issue is that the tight boundaries of speech segments as the outcomes of speaker diarization have a high chance of causing unexpected word truncation or deletion errors in ASR decoding. In this section we discuss how the speaker diarization systems have been developed in the context of ASR, not only resulting in better WER by preventing speaker diarization from affecting the ASR performance, but also benefiting from ASR artifacts to enhance diarization performance. More recently, there have been a few pioneering proposals made for the joint modeling of speaker diarization and ASR, which will also be introduced in this section.

4.1 Early Works

The lexical information from the ASR output has been employed for the speaker diarization system in a few different ways. First, the earliest approach was the RT03 evaluation [1] which used word boundary information for the purpose of segmentation. In [1], a general ASR system for broadcast news data was built, in which the basic components are segmentation, speaker clustering, speaker adaptation and system combination after ASR decoding from the two sub-systems with the different adaptation methods. The authors used the word boundary information from the ASR system for speech segmentation, and compared it with the BIC-based speech segmentation. While the performance gain by the ASR-based segmentation was insignificant, this was the first attempt to take advantage of ASR output to enhance the diarization performance. In addition, the ASR result was used to refine SAD in IBM’s submission [190] for RT07 evaluation. The system that appeared in [190] incorporates word alignments from the speaker independent ASR module and refines the SAD result to reduce false alarms so that the speaker diarization system can have better clustering quality. The segmentation system in [71] also takes advantage of word alignments from ASR. The authors in [71] focused on the word-breakage problem, in which the words from the ASR output are truncated by segmentation results since the segmentation results and the decoded word sequences are not aligned. Therefore, word-breakage ratio was proposed to measure the rate of change points detected inside intervals corresponding to words. The DER and word-breakage ratio were used to measure the influence of the word truncation problem. While the aforementioned early works of speaker diarization systems that leverage the ASR output focus on the word alignment information to refine the SAD or segmentation result, the speaker diarization system in [191] created a dictionary for the phrases commonly appearing in broadcast news. The phrases in this dictionary provide the identity of who is speaking, who will speak and who spoke in the broadcast news scenario. For example, “This is [name]” indicates who was the speaker of the broadcast news section. Although the early studies on speaker diarization did not fully leverage the lexical information to drastically improve the DER, the idea of integrating the information from ASR output has been adopted by many studies to refine or improve the speaker diarization output.

Refer to caption
Fig. 14: Integration of lexical information and acoustic information.
Refer to caption
Fig. 15: Integration of lexical information and acoustic information.

4.2 Using lexical information from ASR

The more recent speaker diarization systems that take advantage of the ASR transcript have employed a DNN model to capture the linguistic pattern in the given ASR output to enhance the speaker diarization result. The authors in [192] proposed a way of using the linguistic information for the speaker diarization task where participants have distinct roles that are known to the speaker diarization system. Fig. 14 shows the diagram of the speaker diarization system discussed in [192]. In this system, a neural text-based speaker change detector and a text-based role recognizer are employed. By using both linguistic and acoustic information, DER was significantly improved compared with the acoustic only system.

Lexical information from the ASR output was also used for speaker segmentation [72] by employing a sequence-to-sequence model that outputs speaker turn tokens. Based on the estimated speaker turn, the input utterance is segmented accordingly. The experimental results in [72] indicate that using both acoustic and lexical information can be exploited and an extra advantage can be obtained owing to the word boundaries we get from the ASR output.

The authors of [193] presented follow-up research within the above thread. Unlike the system in [72], the lexical information from the ASR module was integrated with the speech segment clustering process by employing an integrated adjacency matrix. The adjacency matrix is obtained from the max operation between the acoustic information created from affinities among audio segments and lexical information matrix created by segmenting the word sequence into word chunks that are likely to be spoken by the same speaker. Fig. 15 presents a diagram that explains how lexical information is integrated in an affinity matrix with acoustic information. The integrated adjacency matrix leads to an improved speaker diarization performance for the CALLHOME American English dataset.

4.3 Joint ASR and Speaker Diarization with Deep Learning

Motivated by the recent success of deep learning and end-to-end modeling, several models have been proposed to jointly perform ASR and speaker diarization. As with the previous section, the ASR results contain a strong cue to improve speaker diarization. On the other hand, speaker diarization results can be used to improve the accuracy of ASR, for example, by adapting the ASR model toward each estimated speaker. Joint modeling can leverage such inter-dependency to improve both ASR and speaker diarization. In the evaluation, a WER metric that counts word hypotheses with speaker-attribution errors as misrecognized words, such as speaker-attributed WER [194] or concatenated minimum-permutation WER (cpWER) [81], is often used. ASR-specific metrics (e.g., speaker-agnostic WER) or diarization-specific metrics (e.g., DER mentioned in Section 1.4.1) are also used complementarily.

Refer to caption
Fig. 16: Joint ASR and diarization by inserting a speaker tag in the transcription.

The first approach is the introduction of a speaker tag in the transcription of end-to-end ASR models (Fig. 16). Shafey et al. [65] proposed to insert a speaker role tag (e.g., \langledoctor\rangle and \langlepatient\rangle) into the output of a recurrent neural network-transducer (RNN-T)-based ASR system. This method was evaluated by using doctor-patient conversation, and a significant reduction in WDER was reported with a marginal degradation of WER. Similarly, Mao et al. [66] proposed the insertion of a speaker identity tag into the output of an attention-based encoder-decoder ASR system, and showed an improvement of DER especially when the oracle utterance boundaries were not given. The works by Shafey et al. and Mao et al. showed that the insertion of speaker tags is a simple and promising way to jointly perform ASR and speaker diarization. On the other hand, the speaker roles or speaker identity tags need to be determined and fixed during training. Thus, it is difficult to cope with an arbitrary number of speakers using this approach.

Refer to caption
Fig. 17: Joint decoding framework for ASR and speaker diarization.

The second approach is a MAP-based joint decoding framework. Kanda et al. [67] formulated the joint decoding of ASR and speaker diarization as follows (see also Fig. 17). Assume that a sequence of observations is represented by 𝒳={𝐗1,,𝐗U}\mathcal{X}=\{\mathbf{X}_{1},\ldots,\mathbf{X}_{U}\}, where UU denotes the number of segments (e.g., generated by applying SAD on a long audio) and 𝐗u\mathbf{X}_{u} denotes the acoustic feature sequence of the uu-th segment. Further assume that word hypotheses with time boundary information are represented by 𝒲={𝐖1,,𝐖U}\mathcal{W}=\{\mathbf{W}_{1},\ldots,\mathbf{W}_{U}\} where 𝐖u\mathbf{W}_{u} is the speech recognition hypothesis corresponding to the segment uu. Here, 𝐖u=(𝐖1,u,,𝐖K,u)\mathbf{W}_{u}=(\mathbf{W}_{1,u},...,\mathbf{W}_{K,u}) contains all the speakers’ hypotheses in the segment uu where KK denotes the number of speakers, and 𝐖k,u\mathbf{W}_{k,u} represents the speech recognition hypothesis of the kk-th speaker of the segment uu. Finally, a tuple of speaker embeddings =(𝐞1,,𝐞K)\mathcal{E}=(\mathbf{e}_{1},\ldots,\mathbf{e}_{K}), where 𝐞jd\mathbf{e}_{j}\in\mathbb{R}^{d} is the dd-dimensional speaker embedding of the kk-th speaker, is also assumed. With all these notations, the joint decoding framework of multispeaker ASR and diarization can be formulated as a problem to find most likely 𝒲^\hat{\mathcal{W}} as follows:

𝒲^=\displaystyle\hat{\mathcal{W}}= argmax𝒲P(𝒲|𝒳)\displaystyle\operatorname*{argmax}_{\mathcal{W}}P(\mathcal{W}|\mathcal{X}) (54)
=\displaystyle= argmax𝒲{P(𝒲,|𝒳)}\displaystyle\operatorname*{argmax}_{\mathcal{W}}\{\sum_{\mathcal{E}}P(\mathcal{W},\mathcal{E}|\mathcal{X})\} (55)
\displaystyle\approx argmax𝒲{maxP(𝒲,|𝒳)},\displaystyle\operatorname*{argmax}_{\mathcal{W}}\{\max_{\mathcal{E}}P(\mathcal{W},\mathcal{E}|\mathcal{X})\}, (56)

where the Viterbi approximation is applied to obtain the final equation. This maximization problem is further decomposed into two iterative problems as follows:

𝒲^(i)=\displaystyle\hat{\mathcal{W}}^{(i)}= argmax𝒲P(𝒲|^(i1),𝒳),\displaystyle\operatorname*{argmax}_{\mathcal{W}}P(\mathcal{W}|\hat{\mathcal{E}}^{(i-1)},\mathcal{X}), (57)
^(i)=\displaystyle\hat{\mathcal{E}}^{(i)}= argmaxP(|𝒲^(i),𝒳),\displaystyle\operatorname*{argmax}_{\mathcal{E}}P(\mathcal{E}|\hat{\mathcal{W}}^{(i)},\mathcal{X}), (58)

where ii is the iteration index of the procedure. In [67], Eq. (57) is modeled by the target speaker ASR [195, 196, 197, 59] and Eq. (58) is modeled by the overlap-aware speaker embedding estimation. This method obtains a speaker-attributed WER similar to that of the target-speaker ASR with oracle speaker embeddings for two-speaker conversation data of the Corpus of Spontaneous Japanese [189]. On the other hand, it requires an iterative application of the target-speaker ASR and a speaker embedding extraction scheme, which make it challenging to apply the method in online mode.

Refer to caption
Fig. 18: End-to-end speaker-attributed ASR

As a third line of approaches, end-to-end speaker-attributed ASR (SA-ASR) model was recently proposed to jointly perform speaker counting, multi-talker ASR, and speaker identification [175, 176]. Contrary to the first two approaches, the end-to-end SA-ASR model takes the additional input of speaker profiles and identifies the index of speaker profiles based on the attention mechanism (Fig. 18). Thanks to the attention mechanism for the speaker identification and multi-talker ASR capability based on serialized output training [198], there is no limitation in the number of speakers that the model can cope with. In case relevant speaker profiles are supplied in the inference, the end-to-end SA-ASR model can automatically transcribe the utterance while identifying the speaker of each utterance based on the supplied speaker profiles. On the other hand, in case of the relevant speaker profiles cannot be used prior to the inference, the end-to-end SA-ASR model can still be applied using dummy profiles, and the speaker clustering on the internal speaker embeddings of the end-to-end SA-ASR model (“speaker query” in Fig. 18) is used to diarize the speaker [68]. The end-to-end SA-ASR model was evaluated by using the LibriCSS dataset [79], and exhibited significantly better cpWER than the combination of multitalker ASR and speaker diarization [199].

5 Diarization Evaluation Series and Datasets

This section describes the evaluation series and the commonly used datasets for speaker diarization evaluations. The summary of the most commonly used datasets that include English is shown in Table 5.

Table 3: Diarization Evaluation Datasets
Language Size (hr) Style # Spkr.
CALLHOME Multilingual 20 Conversation 2–7
AMI English 100 Meeting 3–5
ICSI meeting English 72 Meeting 3–10
CHiME-5/6 English 50 Conversation 4
VoxConverse Multilingual 74 YouTube video 1–21
LibriCSS English 10 Read speech 8
DH I Tr.1,2 Multilingual 19(dev), 21(eval) Miscellaneous 1–7
DH II Tr.1,2 Multilingual 24(dev), 22(eval) Miscellaneous 1–8
DH II Tr.3,4 Multilingual 262(dev), 31(eval) Miscellaneous 4
DH III Tr.1,2 Multilingual 34(dev), 33(eval) Miscellaneous 1–7

Most of the contents are English while there are few non-English contents.

NIST SRE 2000 (Disk-8), often referred to as the CALLHOME dataset, is the most widely used dataset for speaker diarization in recent papers. This dataset contains 500 sessions of multilingual telephonic speech. Each session has two to seven speakers with two dominant speakers in each conversation.

  • AMI Corpus The AMI database [200] includes 100 h of meeting recordings from multiple sites in 171 meeting sessions. The AMI database provides an audio source recorded using lapel microphones that are separately recorded and amplified for each speaker. Another audio source is recorded using omnidirectional microphone arrays mounted on the table while meeting. The AMI database is a suitable dataset for the evaluation of speaker diarization systems integrated with the ASR module since AMI provides forced alignment data that contains word and phoneme level timings along with the transcript and speaker label. Each meeting session has three to five speakers. ICSI Meeting Corpus The ICSI meeting corpus [201] contains 75 meeting corpus with four meeting types. The ICSI meeting corpus provides word level timing along with the transcript and speaker label. The audio source is recorded using close-talking individual microphone and six tabletop microphones to provide speaker-specific channel and multichannel recording. Each meeting has 3 to 10 participants.

  • CHiME-5/6 challenge and its dataset The CHiME-5 challenge [202] and CHiME-6 challenge [81] were designed as series of ASR competitions for the daily conversation of multiple speakers. The dataset was provided at the CHiME-5 challenge, and it contains 50 h of multiparty real conversations in the everyday home environment. It contains speaker labels, segmentation, and corresponding transcriptions. The audio source is recorded using six four-channel microphone arrays located in the kitchen and dining/living rooms in a house and also binaural microphones worn by participants. The number of participants is fixed at four. While the oracle diarization results were allowed to be used for the ASR task in the CHiME-5 challenge, CHiME-6 challenge track 2 requires the result of both ASR and diarization. The primary evaluation metric for such a track was cpWER, which counts both the speaker-attributed errors and word recognition errors in the WER calculation. DER and JER were also evaluated as secondary metrics without “score collar” and with overlapped regions. The CHiME-5/6 corpus was also used as one track in the DIHARD 2 challenge.

  • VoxSRC Challenge and VoxConverse corpus The VoxCeleb Speaker Recognition Challenge (VoxSRC) is the recent evaluation series for speaker recognition systems [203, 95]. The goal of VoxSRC is to test how well the current technology can cope with the speech “in the wild”. This evaluation series initially started with a pure speaker verification task [203], and the diarization task was added as track 4 at the latest evaluation at the VoxCeleb Speaker Recognition Challenge 2020 (VoxSRC-20) [95]. The VoxConverse dataset [204] was used for the speaker diarization task with DER as the primary metric, and JER as the secondary metric. The VoxConverse dataset contains 74 h of human conversation extracted from YouTube videos. The dataset is divided into development set (20.3 h, 216 recordings), and test set (53.5 h, 310 recordings). The number of speakers in each recording has a wide range of variety from 1 speaker to 21 speakers. The audio includes various types of noises such as background music, laughter, etc. It also contains a significant portion of overlapping speech from 0% to 30.1% depending on the recording. While the dataset contains the visual information as well as audio, as of June 2021, only the audio of the development set was released under a Creative Commons Attribution 4.0 International License for research purposes. The audio of the evaluation set was used as a blind test set.

  • LibriCSS The LibriCSS corpus [79] contains 10 h of multichannel recordings and was designed for the research of speech separation, speech recognition, and speaker diarization. It was created by playing the audio in the LibriSpeech corpus [205] in a real meeting room, and recorded using a 7-channel microphone array. It consists of 10 sessions, each of which is further decomposed to six 10-min mini-sessions. Each mini-session was made by audio of eight speakers and designed to have different overlap ratios from 0% to 40%. To facilitate the research, the baseline system for speech separation and ASR [79] and the baseline system that integrates speech separation, speaker diarization and ASR [80] have been developed and released.

  • DIHARD Challenge and its dataset DIHARD evaluation [206, 77] focuses on the performance gap of state-of-the-art diarization systems on challenging domains. The first DIHARD challenge, DIHARD 1, started with track 1 (oracle SAD) and track 2 (system SAD). The evaluation data was a collection of various corpus. It includes very challenging datasets such as clinical interviews, web videos, and speech in the wild (e.g., recordings in restaurants), as well as relatively less challenging datasets, such as CTS and audio books to diversify the domains. DIHARD 2 additionally included multichannel speaker diarization task in track 3 (oracle SAD) and track 4 (system SAD) using the recordings from the CHiME-5 corpus [202]. In the latest DIHARD challenge, DIHARD 3, the CTS dataset was added, whereas multichannel tracks 3 and 4 were excluded. The DIHARD challenge employs DER and JER for the evaluation metric without “score collar” and with overlap region.

  • Rich Transcription Evaluation Series The RT Evaluation [13] is the pioneering evaluation series of initiating deeper investigation on speaker diarization in relation to ASR. The main purpose of this effort was to create ASR technologies that would produce transcriptions with descriptive metadata, like who said when, where speaker diarization plays in. Thus, the main tasks in the evaluation were ASR and speaker diarization. The domains of the data of interest were broadcast news, CTS and meeting recordings with multiple participants. Throughout the period of 2002-2009, the RT evaluation series promoted and gauged advances in speaker diarization and ASR technology. The evaluations among this period are named as RT evaluations (RT-02, RT-03S, RT-03F, and RT-05F) and RT Meeting Recognition (RT-06S, RT-07S and RT-09). These evaluations and their datasets include speaker diarization evaluation as a part of automatic metadata extraction (MDE).

  • Other datasets

    There are also several corpora that have been used for the diarization research but not covered in the list above. The Corpus of Spontaneous Japanese [189] contains about 12 h of two-speaker dialogue recorded using headset microphones. AISHELL-4 [207] is a relatively new Mandarin Chinese dataset containing 118 h of four to eight speakers in a conference scenario. It is recorded by 8-ch circular microphone array as well as headset microphones for each participant. The ESTER-1 [208] and ESTER-2 [209] evaluation campaign datasets are a set of French recordings designed for three task category: Segmentation (S), Transcription (T) and Information Extraction (E). In the ESTER-1 and ESTER-2 evaluation campaign, speaker diarization was evaluated as one of the core tasks among other tasks including speaker tracking, sound event tracking, and transcriptions. The datasets for ESTER-1 and ESTER-2 include 100 h and 150 h of manually transcribed French radio broadcast news, respectively. ETAPE [210] is also a French speech processing evaluation dataset that contains 36 h of TV and radio shows with both prepared and spontaneous speech. Unlike the ESTER evaluation series, ETAPE targets cross-show speaker diarization.

  • 6 Applications

    6.1 Meeting Transcription

    The goal of meeting transcription is to automatically generate speaker-attributed transcripts during real-life meetings based on their audio and optional video recordings. Accurate meeting transcription is among the processing steps in a pipeline for several tasks, such as, summarization and topic extraction. Similarly, the same transcription system can be used in other domains such as healthcare [211].

    Although this task was introduced by NIST in the RT evaluation series back in 2003 [194, 201, 212], the initial systems had very poor performance, and consequently commercialization of the technology was not possible. However, recent advances in the areas of speech recognition [213, 214], far-field speech processing [215, 216, 217], speaker ID and diarization [218, 41, 76], have greatly improved the speaker-attributed transcription accuracy, enabling such commercialization. Bimodal processing combining cameras with microphone arrays has further improved the overall performance [219, 220].

    The variety of application scenarios, customer needs, and business scope, different constraints may be imposed on meeting transcription systems. For example, it is most often required to provide the resulting transcriptions in low latency, making the diarization and recognition even more challenging. However, the architecture of the transcription system can substantially improve the overall performance, e.g., by using microphone arrays of known geometry as the input device. Also, in the case where the expected meeting attendees are known beforehand, the transcription system can further improve speaker attribution, all while providing the exact name of the speaker, instead of randomly generated discrete speaker labels.

    Two different scenarios in this space are presented: first, a fixed-geometry microphone array combined with a fish-eye camera system. Second, an ad-hoc geometry microphone array system without a camera. In both scenarios, a “non-binding” list of participants and their corresponding speaker profiles are considered to be known. In particular, the transcription system has access to the invitees’ names and profiles; however, the actual attendees may not accurately match those invited. As such, there is an option to include “unannounced” participants. In addition, some of the invitees may not have profiles. In both scenarios, there is a constraint of low-latency transcriptions, where initial results need to be shown with low latency. The finalized results can be updated later in an offline mode. Some of the technical challenges to overcome are [221]:

    1. 1.

      Although ASR on overlapping speech is one of the main challenges in meeting transcription, limited progress has been made over the years. Numerous multichannel speech separation methods have been proposed based on independent component analysis (ICA) or spatial clustering [222, 223, 224, 225, 226, 227], but their application to a meeting setup had limited success. In addition, neural network-based separation methods such as permutation invariant training (PIT) [45] or deep clustering (DC) [44] cannot adequately address reverberation and background noise [228].

    2. 2.

      Flexible framework: It is desirable that the transcription system is capable of processing all the available information, such as the multichannel audio and visual cues. The system needs to process a dynamically changing number of audio channels without loss of performance. As such, the architecture needs to be modular enough to encompass the different settings.

    3. 3.

      The speaker-attributed ASR of natural meetings requires online or streaming ASR, audio pre-processing such as dereverberation, and accurate diarization and speaker identification. These multiple processing steps are usually optimized separately and thus, the overall pipeline is most frequently inefficient.

    4. 4.

      The use of multiple, unsynchronized audio streams, e.g., audio capturing using mobile devices, adds complexity to the meeting setup and processing. In return, we gain a potentially better spatial coverage since the devices are usually distributed around the room and near the speakers. As part of the application scenario, the meeting participants bring their personal devices, which can be repurposed to improve the overall quality of meeting transcription quality. On the other hand, while there are several pioneering studies [229], it is unclear what the best strategies are for consolidating multiple asynchronous audio streams and to what extent they work for natural meetings in online and offline setups.

    Based on these considerations, an architecture of a meeting transcription system with asynchronous distant microphones has been proposed in  [184]. In this work, various fusion strategies have been investigated: from early fusion beamforming of the audio signals, to mid-fusion combination of senones per channel, to late fusion combination of the diarization and ASR results [158]. The resulting system performance was benchmarked on real-world meeting recordings against fixed geometry systems. As aforementioned, the requirement of speaker-attributed transcriptions with low latency was also adhered to. In addition to the end-to-end system analysis, the paper [184] proposed the idea of “leave-one-out beamforming” in the asynchronous multi-microphone setup, enriching the “diversity” of the resulting signals, as proposed in [230]. Finally, it is described how an online, incremental version of recognizer output voting error reduction (ROVER) [154] can process both the ASR and diarization outputs, enhancing the overall speaker-attributed ASR performance.

    6.2 Conversational Interaction Analysis and Behavioral Modeling

    Speech and spoken language are central to conversational interactions. They carry crucial information about a speaker’s intent, emotions, identity, age and other individual and interpersonal traits and state variables including health state. Computational advances are increasingly allowing access to such rich information [231, 232]. For example, knowing how much, and how, a child speaks in an interaction contains critical information about the child’s developmental state, and offers clues to clinicians in diagnosing disorders such as autism [233]. Such analyses are made possible by capturing and processing the audio recordings of the interactions, which often involve two or more people. An important foundational step is the identification and association of the speech portions belonging to specific individuals involved in the conversation. The technologies providing these capabilities are SAD and speaker diarization. Speech portions segmented with speaker-specific information provided by speaker diarization, by itself without any explicit lexical transcription, can offer important information to domain experts who can take advantage of speaker diarization results for quantitative turn-taking analysis.

    A domain that is the most relevant in such analyses of spoken conversational interactions relates to behavioral signal processing (BSP) [234, 231], which refers to the technology and algorithms for modeling and understanding human communicative, affective and social behaviors. For example, these may include analyzing how positive or negative a person is, how empathic an individual is toward another, what the behavior patterns reveal about the relationship status, and the health condition of an individual [232]. BSP involves addressing all the complexities of spontaneous interactions in conversations with additional challenges involved in handling and understanding emotional, social and interpersonal behavioral dynamics revealed through verbal and nonverbal cues of the interaction participants. Therefore, the knowledge of speaker specific vocal information plays a significant role in BSP, requiring highly accurate speaker diarization performance. For example, the speaker diarization module is employed as a pre-processing module for analyzing psychotherapy mechanisms and quality [235] and suicide risk assessment [236].

    Another popular application of speaker diarization for conversation interaction analysis is the medical doctor-patient interactions. In the system described in [237], the nature of memory problems of a patient is detected from the conversations between neurologists and patients. The speech and language features extracted from the ASR transcripts combined with the speaker diarization results are used to predict the type of disorder. An automated assistant system for medical domain transcription is proposed in [238], which includes the speaker diarization module, ASR module and natural language generation module. The automated assistant module accepts the audio clip and outputs grammatically correct sentences describing the topic of the conversation, subject and subject’s symptom.

    6.3 Audio Indexing

    Content-based audio indexing is a well known application domain for speaker diarization. It can provide metainformation such as the content or data type of a given audio data to make information retrieval efficient since search query by machines would be limited by such metadata. The more diverse information was available, the better efficiency could be achieved in retrieving audio contents from a database.

    One useful piece of information for the audio indexing would be ASR transcripts to understand the content of speech portions in the audio data. Speaker diarization can augment those transcripts in terms of “who spoke when”, which was the main purpose of the RT evaluation series [13], as discussed in Sections 4.1 and 5.3. The aggregated spoken utterances from speakers by a speaker diarization system also enable per-speaker summary or keyword list-up, which can be used for other query values to retrieve relevant contents from the database. In [239], we can get a view of how speaker diarization outputs can be linked for information searching in consumer-facing applications.

    6.4 Conversational AI

    Thanks to the advances of ASR technology, the applications of ASR have evolved from simple voice command recognition systems to conversational AI systems. Conversational AI systems, as opposed to voice command recognition systems, have features that are lacking in voice command recognition systems. The fundamental idea of conversational AI is to build a machine that humans can talk to and interact with. In this sense, focusing on an interested speaker in a multiparty setting is one of the most important features of conversational AI. Moreover, speaker diarization becomes an essential feature for conversational AI. For example, conversational AI equipped in a car can pay attention to a specific speaker that is demanding a piece of information from the navigation system by applying speaker diarization along with ASR.

    Smart speakers and voice assistants are the most popular products in which speaker diarization plays a significant role for conversational AI. Since the response time and online processing are the crucial factors in real-life settings, the demand for end-to-end speaker diarization systems integrated into the ASR pipeline is growing. The performance of incremental (online) ASR and speaker diarization of the commercial ASR services are evaluated and compared in [240]. It is expected that the real-time and low latency aspect of speaker diarization will be more emphasized in the speaker diarization systems in the future since the performance of online diarization and online ASR still have much room for improvement.

    7 Challenges and the Future of Speaker Diarization

    This paper has provided a comprehensive overview of speaker diarization techniques, highlighting the recent development of deep learning-based diarization approaches. In the early days, a speaker diarization system was developed as a pipeline of sub-modules including front-end processing, SAD, segmentation, speaker embedding extraction, clustering, and post-processing, leading to a standalone system without much connection to other components in a given speech application. With the emergence of the deep learning technology, more and more advancements have been made for speaker diarization, from a method that replaces a single module into a deep-learning-based method, to a fully EEND. Furthermore, as the speech recognition technology becomes more accessible, a trend to tightly integrate speaker diarization into the ASR systems has emerged, such as benefiting from the ASR output to improve the accuracy of speaker diarization. Recently, joint modeling for speaker diarization and speech recognition is investigated in an attempt to enhance the overall performance of speaker diarization. Thanks to these great achievements, speaker diarization systems have already been used in many applications, including meeting transcription, conversational interaction analysis, audio indexing, and conversational AI systems.

    As we have seen, tremendous progress has been made for speaker diarization systems. Nevertheless, there is still much room for improvement. As the final remark, we conclude this paper by listing the remaining challenges for speaker diarization toward future research and development.

    Online processing of speaker diarization

    Most speaker diarization methods assume that an entire recording can be observed to execute speaker diarization. However, numerous applications such as meeting transcription systems or smart agents require very short latency for assigning the speaker. While there have been several attempts to make online speaker diarization system both for clustering-based systems (e.g., [218]) and neural network-based diarization systems (e.g., [41, 172, 173]), it still remains as a challenging problem.

    Domain mismatch

    A model that is trained on a data in a specific domain often works poorly on data in another domain. For example, it is experimentally known that the EEND model tends to overfit to the distribution of the speaker overlaps of the training data [42]. Such a domain mismatch issue is universal for any training-based method. Given the growing interest for trainable speaker diarization systems, it will become more important to evaluate the ability for handling the variety of inputs. The international evaluation efforts for speaker diarization such as the DIHARD challenge [206, 77, 241] or VoxSRC [203, 95] also have great importance in this direction.

    Speaker overlap

    Overlap of multitalker speech is the inevitable nature of conversation. For example, an average of 12% to 15% of speaker overlap was observed in meeting recordings [242, 92], and it could even increase in daily conversations [243, 202, 81]. Nevertheless, many traditional speaker diarization systems, especially clustering-based systems, have only focused on non-overlapping regions and even the overlapping regions are excluded in the evaluation metric [244]. While the topic has been studied for long years (e.g. early works [245, 246]), there is a growing interest for handling the speaker overlaps toward better speaker diarization, including the application of speech separation [94], post-processing [247, 168], and joint modeling of speech separation and speaker diarization [64, 175].

    Integration with ASR

    Many applications require ASR results along with speaker diarization results. In the modular combination of speaker diarization and ASR, some systems locate a speaker diarization system before ASR [83] while some systems locate a speaker diarization system after ASR [221, 193]. Both types of systems showed a strong performance for a specific task, and determining the best kind of system architecture for the speaker diarization and ASR tasks is still an open problem [80]. Furthermore, there is another line of research to jointly perform speaker diarization and ASR [65, 66, 67, 175], which was introduced in Section 4. The joint modeling approach could leverage the inter-dependency between speaker diarization and ASR to better perform both tasks. However, it has not yet been fully investigated whether such joint frameworks perform better than the well-tuned modular systems. Overall, the integration of speaker diarization and ASR is one of the hottest topics that is still being investigated by many researchers.

    Audiovisual modeling

    Visual information contains a strong clue for the identification of speakers. For example, the video captured by a fish-eye camera was used to improve the accuracy of speaker diarization in a meeting transcription task [221]. The visual information was also used to significantly improve the accuracy of speaker diarization. for speaker diarization on YouTube video [204]. While these studies showed the effectiveness of visual information, the audiovisual speaker diarization has yet been rarely investigated compared with audio-only speaker diarization, and there will be many rooms for improvement.

    References

    • Tranter et al. [2003] S. E. Tranter, K. Yu, D. A. Reynolds, G. Evermann, D. Y. Kim, P. C. Woodland, An investigation into the interactions between speaker diarisation systems and automatic speech transcription, CUED/F-INFENG/TR-464 (2003).
    • Tranter and Reynolds [2006] S. E. Tranter, D. A. Reynolds, An overview of automatic speaker diarization systems, IEEE Transactions on Audio, Speech, and Language Processing 14 (2006) 1557–1565.
    • Anguera et al. [2012] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, O. Vinyals, Speaker diarization: A review of recent research, IEEE Transactions on Audio, Speech, and Language Processing 20 (2012) 356–370.
    • Gish et al. [1991] H. Gish, M. . Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1991, pp. 873–876.
    • Siu et al. [1992] M.-H. Siu, Y. George, H. Gish, An unsupervised, sequential learning algorithm for segmentation for speech waveforms with multiple speakers, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1992, pp. 189–192.
    • Rohlicek et al. [1992] J. R. Rohlicek, D. Ayuso, M. Bates, R. Bobrow, A. Boulanger, H. Gish, P. Jeanrenaud, M. Meteer, M. Siu, Gisting conversational speech, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1992, pp. 113–116.
    • Jain et al. [1996] U. Jain, M. A. Siegler, S.-J. Doh, E. Gouvea, J. Huerta, P. J. Moreno, B. Raj, R. M. Stern, Recognition of continuous broadcast news with multiple unknown speakers and environments, in: Proceedings of ARPA Spoken Language Technology Workshop, 1996, pp. 61–66.
    • Padmanabhan et al. [1996] M. Padmanabhan, L. R. Bahl, D. Nahamoo, M. A. Picheny, Speaker clustering and transformation for speaker adaptation in large-vocabulary speech recognition systems, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1996, pp. 701–704.
    • Gauvain et al. [1998] J.-L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in: Proceedings of the International Conference on Spoken Language Processing, 1998, pp. 1335–1338.
    • Liu and Kubala [1999] D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in: Proceedings of the International Conference on Spoken Language Processing, 1999, pp. 1031–1034.
    • Chen and Gopalakrishnan [1998] S. S. Chen, P. S. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian Information Criterion, in: Tech. Rep., IBM T. J. Watson Research Center, 1998, pp. 127–132.
    • AMI [????] AMI, AMI Consortium. http://www.amiproject.org/index.html.
    • NIST [????] NIST, Rich Transcription Evaluation. https://www.nist.gov/itl/iad/mig/rich-transcription-evaluation.
    • Ajmera and Wooters [2003] J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2003, pp. 411–416.
    • Tranter and Reynolds [2004] S. E. Tranter, D. A. Reynolds, Speaker diarisation for broadcast news, in: Odyssey, 2004, pp. 337–344.
    • Reynolds and Torres-Carrasquillo [2005] D. A. Reynolds, P. Torres-Carrasquillo, Approaches and applications of audio diarization, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2005, pp. 953–956.
    • Zhu et al. [2005] X. Zhu, C. Barras, S. Meignier, J.-L. Gauvain, Combining speaker identification and BIC for speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2005, pp. 2441–2444.
    • Meignier et al. [2006] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization, Computer, Speech & Language 20 (2006) 303–330.
    • Rosenberg et al. [2002] A. E. Rosenberg, A. Gorin, Z. Liu, P. Parthasarathy, Unsupervised speaker segmentation of telephone conversations, in: Proceedings of the International Conference on Spoken Language Processing, 2002, pp. 565–568.
    • Liu and Kubala [2003] D. Liu, F. Kubala, A cross-channel modeling approach for automatic segmentation of conversational telephone speech, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2003, pp. 333–338.
    • Tranter et al. [2004] S. E. Tranter, K. Yu, G. Evermann, P. C. Woodland, Generating and evaluating for automatic speech recognition of conversational telephone speech, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2004, pp. 753–756.
    • Kenny et al. [2010] P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversations using factor analysis, IEEE Journal of Selected Topics in Signal Processing 4 (2010) 1059–1070.
    • Ajmera et al. [2004] J. Ajmera, G. Lathoud, L. McCowan, Clustering and segmenting speakers and their locations in meetings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2004, pp. 605–608.
    • Jin et al. [2004] Q. Jin, K. Laskowski, T. Schultz, A. Waibel, Speaker segmentation and clustering in meetings, in: Proceedings of the International Conference on Spoken Language Processing, 2004, pp. 597–600.
    • Anguera et al. [2006] X. Anguera, C. Wooters, J. Hernando, Purity algorithms for speaker diarization of meetings data, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume I, 2006, pp. 1025–1028.
    • Leeuwen and Konecny [2007] D. A. V. Leeuwen, M. Konecny, Progress in the AMIDA speaker diarization system for meeting data, in: Proceedings of International Evaluation Workshops CLEAR 2007 and RT 2007, 2007, pp. 475–483.
    • Vijayasenan et al. [2009] D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic approach to speaker diarization of meeting data, IEEE Transactions on Audio, Speech, and Language Processing 17 (2009) 1382–1393.
    • Anguera et al. [2007] X. Anguera, C. Wooters, J. Hernando, Acoustic beamforming for speaker diarization of meetings, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007) 2011–2023.
    • Valente et al. [2010] F. Valente, P. Motlicek, D. Vijayasenan, Variational Bayesian speaker diarization of meeting recordings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4954–4957.
    • Dehak et al. [2011] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 19 (2011).
    • Castaldo et al. [2008] F. Castaldo, D. Colibro, E. Dalmasso, P. Laface, C. Vair, Stream-based speaker segmentation using speaker factors and eigenvoices, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 4133–4136.
    • Shum et al. [2011] S. Shum, N. Dehak, E. Chuangsuwanich, D. Reynolds, J. Glass, Exploiting intra-conversation variability for speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2011.
    • Shum et al. [2012] S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2012, pp. 482–485.
    • Shum et al. [2013] S. H. Shum, N. Dehak, R. Dehak, J. R. Glass, Unsupervised methods for speaker diarization: An integrated and iterative approach, IEEE Transactions on Audio, Speech, and Language Processing 21 (May 2013) 2015–2028.
    • Senoussaoui et al. [2013] M. Senoussaoui, P. Kenny, T. Stafylakis, P. Dumouchel, A study of the cosine distance-based mean shift for telephone speech diarization, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (2013) 217–227.
    • Sell and Garcia-Romero [2014] G. Sell, D. Garcia-Romero, Speaker diarization with plda i-vector scoring and unsupervised calibration, in: Proceedings of IEEE Spoken Language Technology Workshop, IEEE, 2014, pp. 413–417.
    • Variani et al. [2014] E. Variani, X. Lei, E. McDermott, I. L. Moreno, J. G-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 4052–4056.
    • Heigold et al. [2016] G. Heigold, I. Moreno, S. Bengio, N. Shazeer, End-to-end text-dependent speaker verification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5115–5119.
    • Wang et al. [2018] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, I. L. Moreno, Speaker diarization with LSTM, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5239–5243.
    • Snyder et al. [2018] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, S. Khudanpur, X-vectors: Robust DNN embeddings for speaker recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5329–5333.
    • Zhang et al. [2019] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, C. Wang, Fully supervised speaker diarization, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 6301–6305.
    • Fujita et al. [2019a] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with permutation-free objectives, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019a, pp. 4300–4304.
    • Fujita et al. [2019b] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with self-attention, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, 2019b, pp. 296–303.
    • Hershey et al. [2016] J. R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, Deep clustering: Discriminative embeddings for segmentation and separation, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2016, pp. 31–35.
    • Kolbæk et al. [2017] M. Kolbæk, D. Yu, Z.-H. Tan, J. Jensen, Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (2017) 1901–1913.
    • Luo and Mesgarani [2019] Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (2019) 1256–1266.
    • Variani et al. [2014] E. Variani, X. Lei, E. McDermott, I. L. Moreno, J. Gonzalez-Dominguez, Deep neural networks for small footprint text-dependent speaker verification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2014, pp. 4052–4056.
    • Snyder et al. [2017] D. Snyder, D. Garcia-Romero, D. Povey, S. Khudanpur, Deep neural network embeddings for text-independent speaker verification., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 999–1003.
    • Drugman et al. [2015] T. Drugman, Y. Stylianou, Y. Kida, M. Akamine, Voice activity detection: Merging source and filter-based information, IEEE Signal Processing Letters 23 (2015) 252–256.
    • Wang et al. [2020] J. Wang, X. Xiao, J. Wu, R. Ramamurthy, F. Rudzicz, M. Brudno, Speaker diarization with session-level speaker embedding refinement using graph neural networks, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 7109–7113.
    • Dimitriadis [2019] D. Dimitriadis, Enhancements for Audio-only Diarization Systems, arXiv preprint arXiv:1909.00082 (2019).
    • Medennikov et al. [2020] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, A. Romanenko, Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 274–278.
    • Diez et al. [2018] M. Diez, L. Burget, P. Matejka, Speaker diarization based on bayesian hmm with eigenvoice priors., in: Odyssey, 2018, pp. 147–154.
    • Diez et al. [2019] M. Diez, L. Burget, S. Wang, J. Rohdin, J. Cernockỳ, Bayesian HMM based x-vector clustering for speaker diarization., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 346–350.
    • Yu et al. [2017] D. Yu, X. Chang, Y. Qian, Recognizing multi-talker speech with permutation invariant training, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2456–2460.
    • Seki et al. [2018] H. Seki, T. Hori, S. Watanabe, J. Le Roux, J. R. Hershey, A purely end-to-end system for multi-speaker speech recognition, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume 1, 2018, pp. 2620–2630.
    • Chang et al. [2019] X. Chang, Y. Qian, K. Yu, S. Watanabe, End-to-end monaural multi-speaker ASR system without pretraining, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019, pp. 6256–6260.
    • Kanda et al. [2019a] N. Kanda, Y. Fujita, S. Horiguchi, R. Ikeshita, K. Nagamatsu, S. Watanabe, Acoustic modeling for distant multi-talker speech recognition with single-and multi-channel branches, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2019a, pp. 6630–6634.
    • Kanda et al. [2019b] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu, S. Watanabe, Auxiliary interference speaker loss for target-speaker speech recognition, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019b, pp. 236–240.
    • Wang et al. [2021] X. Wang, N. Kanda, Y. Gaur, Z. Chen, Z. Meng, T. Yoshioka, Exploring end-to-end multi-channel asr with bias information for meeting transcription, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2021.
    • Wang et al. [2019] P. Wang, Z. Chen, X. Xiao, Z. Meng, T. Yoshioka, T. Zhou, L. Lu, J. Li, Speech separation using speaker inventory, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2019, pp. 230–236.
    • Han et al. [2020] C. Han, Y. Luo, C. Li, T. Zhou, K. Kinoshita, S. Watanabe, M. Delcroix, H. Erdogan, J. R. Hershey, N. Mesgarani, et al., Continuous speech separation using speaker inventory for long multi-talker recording, arXiv preprint arXiv:2012.09727 (2020).
    • Huang et al. [2020] Z. Huang, S. Watanabe, Y. Fujita, P. García, Y. Shao, D. Povey, S. Khudanpur, Speaker diarization with region proposal network, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 6514–6518.
    • von Neumann et al. [2019] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. Haeb-Umbach, All-neural online source separation, counting, and diarization for meeting analysis, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2019, pp. 91–95.
    • Shafey et al. [2019] L. E. Shafey, H. Soltau, I. Shafran, Joint Speech Recognition and Speaker Diarization via Sequence Transduction, in: Proceedings of the Annual Conference of the International Speech Communication Association, ISCA, 2019, pp. 396–400.
    • Mao et al. [2020] H. H. Mao, S. Li, J. McAuley, G. Cottrell, Speech recognition and multi-speaker diarization of long conversations, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 691–695.
    • Kanda et al. [2019] N. Kanda, S. Horiguchi, Y. Fujita, Y. Xue, K. Nagamatsu, S. Watanabe, Simultaneous speech recognition and speaker diarization for monaural dialogue recordings with target-speaker acoustic models, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2019, pp. 31–38.
    • Kanda et al. [2021] N. Kanda, X. Chang, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Yoshioka, Investigation of end-to-end speaker-attributed ASR for continuous multi-talker recordings, in: Proceedings of IEEE Spoken Language Technology Workshop, 2021.
    • Fiscus et al. [2006] J. G. Fiscus, J. Ajot, M. Michel, J. S. Garofolo, The rich transcription 2006 spring meeting recognition evaluation, in: Proceedings of International Workshop on Machine Learning and Multimodal Interaction, NIST, 2006, pp. 309–322.
    • Kuhn [1955] H. W. Kuhn, The hungarian method for the assignment problem, Naval research logistics quarterly 2 (1955) 83–97.
    • Silovsky et al. [2012] J. Silovsky, J. Zdansky, J. Nouza, P. Cerva, J. Prazak, Incorporation of the asr output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams, in: International Workshop on Multimedia Signal Processing, IEEE, 2012, pp. 118–123.
    • Park and Georgiou [2018] T. J. Park, P. Georgiou, Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 1373–1377.
    • Haeb-Umbach et al. [2019] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeister, M. L. Seltzer, H. Zen, M. Souden, Speech processing for digital home assistants: Combining signal processing with deep-learning techniques, IEEE Signal Processing Magazine 36 (2019) 111–124.
    • Vincent et al. [2018] E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speech enhancement, John Wiley & Sons, 2018.
    • Wang and Chen [2018] D. Wang, J. Chen, Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2018) 1702–1726.
    • Sell et al. [2018] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, et al., Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 2808–2812.
    • Ryant et al. [2019] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, The second DIHARD diarization challenge: Dataset, task, and baselines, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 978–982.
    • Diez et al. [2018] M. Diez, F. Landini, L. Burget, J. Rohdin, A. Silnova, K. Zmolíková, O. Novotnỳ, K. Veselỳ, O. Glembek, O. Plchot, et al., BUT system for DIHARD speech diarization challenge 2018., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 2798–2802.
    • Chen et al. [2020] Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, J. Li, Continuous speech separation: Dataset and analysis, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 7284–7288.
    • Raj et al. [2021] D. Raj, P. Denisov, Z. Chen, H. Erdogan, Z. Huang, M. He, S. Watanabe, J. Du, T. Yoshioka, Y. Luo, N. Kanda, J. Li, S. Wisdom, J. R. Hershey, Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis, in: Proceedings of IEEE Spoken Language Technology Workshop, 2021.
    • Watanabe et al. [2020] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj, et al., CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings, in: 6th International Workshop on Speech Processing in Everyday Environments (CHiME 2020), 2020.
    • Arora et al. [2020] A. Arora, D. Raj, A. S. Subramanian, K. Li, B. Ben-Yair, M. Maciejewski, P. Żelasko, P. Garcia, S. Watanabe, S. Khudanpur, The JHU multi-microphone multi-speaker asr system for the CHiME-6 challenge, arXiv preprint arXiv:2006.07898 (2020).
    • Medennikov et al. [2020] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, et al., The STC system for the CHiME-6 challenge, in: CHiME 2020 Workshop on Speech Processing in Everyday Environments, 2020.
    • Gao et al. [2018] T. Gao, J. Du, L.-R. Dai, C.-H. Lee, Densely connected progressive learning for lstm-based speech enhancement, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2018, pp. 5054–5058.
    • Erdogan et al. [2015] H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2015, pp. 708–712.
    • Loizou [2013] P. C. Loizou, Speech enhancement: theory and practice, CRC press, 2013.
    • Heymann et al. [2016] J. Heymann, L. Drude, R. Haeb-Umbach, Neural network based spectral mask estimation for acoustic beamforming, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2016, pp. 196–200.
    • Erdogan et al. [2016] H. Erdogan, J. R. Hershey, S. Watanabe, M. I. Mandel, J. Le Roux, Improved MVDR beamforming using single-channel mask prediction networks, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 1981–1985.
    • Nakatani et al. [2010] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, B.-H. Juang, Speech dereverberation based on variance-normalized delayed linear prediction, IEEE Transactions on Audio, Speech, and Language Processing 18 (2010) 1717–1731.
    • Yoshioka and Nakatani [2012] T. Yoshioka, T. Nakatani, Generalization of multi-channel linear prediction methods for blind mimo impulse response shortening, IEEE Transactions on Audio, Speech, and Language Processing 20 (2012) 2707–2720.
    • Drude et al. [2018] L. Drude, J. Heymann, C. Boeddeker, R. Haeb-Umbach, NARA-WPE: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing, in: Speech Communication; 13th ITG-Symposium, VDE, 2018, pp. 1–5.
    • Yoshioka et al. [2018] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, F. Alleva, Recognizing overlapped speech in meetings: A multichannel separation approach using neural networks, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 3038–3042.
    • Boeddecker et al. [2018] C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, R. Haeb-Umbach, Front-end processing for the CHiME-5 dinner party scenario, in: Proceedings of CHiME 2018 Workshop on Speech Processing in Everyday Environments, 2018, pp. 35–40.
    • Xiao et al. [2020] X. Xiao, N. Kanda, Z. Chen, T. Zhou, T. Yoshioka, Y. Zhao, G. Liu, J. Wu, J. Li, Y. Gong, Microsoft speaker diarization system for the voxceleb speaker recognition challenge 2020, arXiv preprint arXiv:2010.11458 (2020).
    • Nagrani et al. [2020] A. Nagrani, J. S. Chung, J. Huh, A. Brown, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, A. Zisserman, VoxSRC 2020: The second VoxCeleb speaker recognition challenge, arXiv preprint arXiv:2012.06867 (2020).
    • Itu [1996] A. Itu, silence compression scheme for g. 729 optimized for terminals conforming to recommendation v. 70, ITU-T Recommendation G 729 (1996).
    • Chengalvarayan [1999] R. Chengalvarayan, Robust energy normalization using speech/nonspeech discriminator for german connected digit recognition, in: Sixth European Conference on Speech Communication and Technology, 1999.
    • Woo et al. [2000] K.-H. Woo, T.-Y. Yang, K.-J. Park, C. Lee, Robust voice activity detection algorithm for estimating noise spectrum, Electronics Letters 36 (2000) 180–181.
    • Nemer et al. [2001] E. Nemer, R. Goubran, S. Mahmoud, Robust voice activity detection using higher-order statistics in the lpc residual domain, IEEE Transactions on Speech and Audio Processing 9 (2001) 217–231.
    • Sohn et al. [1999] J. Sohn, N. S. Kim, W. Sung, A statistical model-based voice activity detection, IEEE signal processing letters 6 (1999) 1–3.
    • Ng et al. [2012] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, K. Veselỳ, P. Matějka, Developing a speech activity detection system for the darpa rats program, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2012, pp. 1969–1972.
    • Pfau et al. [2001] T. Pfau, D. P. Ellis, A. Stolcke, Multispeaker speech activity detection for the icsi meeting recorder, in: IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU’01., IEEE, 2001, pp. 107–110.
    • Sarikaya and Hansen [1998] R. Sarikaya, J. H. Hansen, Robust detection of speech activity in the presence of noise, in: Proceedings of the International Conference on Spoken Language Processing, volume 4, Citeseer, 1998, pp. 1455–8.
    • Ryant et al. [2013] N. Ryant, M. Liberman, J. Yuan, Speech activity detection on youtube using deep neural networks., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2013, pp. 728–731.
    • Thomas et al. [2014] S. Thomas, S. Ganapathy, G. Saon, H. Soltau, Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2014, pp. 2519–2523.
    • Gelly and Gauvain [2017] G. Gelly, J.-L. Gauvain, Optimization of rnn-based speech activity detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing 26 (2017) 646–656.
    • Haws et al. [2016] D. Haws, D. Dimitriadis, G. Saon, S. Thomas, M. Picheny, On the importance of event detection for asr, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2016.
    • Chen et al. [1998] S. Chen, P. Gopalakrishnan, et al., Speaker, environment and channel change detection and clustering via the bayesian information criterion, in: Proceedings DARPA broadcast news transcription and understanding workshop, volume 8, Virginia, USA, 1998, pp. 127–132.
    • Kemp et al. [2000] T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, 2000, pp. 1423–1426.
    • Siegler et al. [1997] M. A. Siegler, U. Jain, B. Raj, R. M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in: Proc. DARPA speech recognition workshop, volume 1997, 1997.
    • Gish et al. [1991] H. Gish, M.-H. Siu, J. R. Rohlicek, Segregation of speakers for speech recognition and speaker identification., in: icassp, volume 91, 1991, pp. 873–876.
    • Bonastre et al. [2000] J.-F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for nist evaluation, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 2, IEEE, 2000, pp. 1177–1180.
    • Gangadharaiah et al. [2004] R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in: Eighth International Conference on Spoken Language Processing, 2004.
    • Tritschler and Gopinath [1999] A. Tritschler, R. A. Gopinath, Improved speaker segmentation and segments clustering using the bayesian information criterion, in: Sixth European Conference on Speech Communication and Technology, 1999.
    • Delacourt and Wellekens [2000] P. Delacourt, C. J. Wellekens, Distbic: A speaker-based segmentation for audio data indexing, Speech Communication 32 (2000) 111–126.
    • Mori and Nakagawa [2001] K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using vq distortion for broadcast news speech recognition, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, 2001, pp. 413–416.
    • Ajmera et al. [2004] J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection, IEEE signal processing letters 11 (2004) 649–651.
    • Malegaonkar et al. [2006] A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching, IEEE signal processing letters 13 (2006) 509–512.
    • Sell et al. [2018] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, S. Khudanpur, Diarization is hard: some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 2808–2812.
    • Reynolds et al. [2000] D. A. Reynolds, T. F. Quatieri, R. B. Dunn, Speaker verification using adapted gaussian mixture models, Digital signal processing 10 (2000) 19–41.
    • Kenny et al. [2007] P. Kenny, G. Boulianne, P. Ouellet, P. Dumouchel, Speaker and session variability in gmm-based speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007) 1448–1460.
    • Kenny et al. [2008] P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification, IEEE Transactions on Audio, Speech, and Language Processing 16 (2008) 980–988.
    • Kenny et al. [2005] P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data, IEEE Transactions on Speech and Audio Processing 13 (2005) 345–354.
    • Zhu and Pelecanos [2016] W. Zhu, J. Pelecanos, Online speaker diarization using adapted i-vector transforms, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2016, pp. 5045–5049.
    • Kanagasundaram et al. [2012] A. Kanagasundaram, D. Dean, R. Vogt, M. McLaren, S. Sridharan, M. Mason, Weighted lda techniques for i-vector based speaker verification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2012, pp. 4781–4784.
    • Kanagasundaram et al. [2014] A. Kanagasundaram, D. Dean, S. Sridharan, M. McLaren, R. Vogt, i-vector based speaker recognition using advanced channel compensation techniques, Computer Speech & Language 28 (2014) 121–140.
    • Senoussaoui et al. [2010] M. Senoussaoui, P. Kenny, N. Dehak, P. Dumouchel, et al., An i-vector extractor suitable for speaker recognition with both microphone and telephone speech., in: Odyssey, 2010, p. 6.
    • Kanagasundaram et al. [2011] A. Kanagasundaram, R. Vogt, D. Dean, S. Sridharan, M. Mason, i-vector based speaker recognition on short utterances, in: Proceedings of the 12th Annual Conference of the International Speech Communication Association, International Speech Communication Association, 2011, pp. 2341–2344.
    • Matějka et al. [2011] P. Matějka, O. Glembek, F. Castaldo, M. J. Alam, O. Plchot, P. Kenny, L. Burget, J. Černocky, Full-covariance ubm and heavy-tailed plda in i-vector speaker verification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2011, pp. 4828–4831.
    • Garcia-Romero and Espy-Wilson [2011] D. Garcia-Romero, C. Y. Espy-Wilson, Analysis of i-vector Length Normalization in Speaker Recognition Systems, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2011, pp. 249–252.
    • Kenny [2010] P. Kenny, Bayesian speaker verification with heavy-tailed priors., in: Odyssey, volume 14, 2010.
    • Jiang et al. [2014] Y. Jiang, K. A. Lee, L. Wang, Plda in the i-supervector space for text-independent speaker verification, EURASIP Journal on Audio, Speech, and Music Processing 2014 (2014) 1–13.
    • Sun et al. [2014] Y. Sun, X. Wang, X. Tang, Deep learning face representation from predicting 10,000 classes, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1891–1898.
    • Taigman et al. [2014] Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
    • Villalba et al. [2019] J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, F. Richardson, S. Shon, F. Grondin, et al., State-of-the-art speaker recognition for telephone and video speech: The JHU-MIT submission for NIST SRE18., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 1488–1492.
    • Han and Narayanan [2007] K. J. Han, S. S. Narayanan, A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2007.
    • Rougui et al. [2006] J. E. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 5, IEEE, 2006.
    • Novoselov et al. [2019] S. Novoselov, A. Gusev, A. Ivanov, T. Pekhovsky, A. Shulipa, A. Avdeeva, A. Gorlanov, A. Kozlov, Speaker diarization with deep speaker embeddings for dihard challenge ii., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 1003–1007.
    • Von Luxburg [2007] U. Von Luxburg, A tutorial on spectral clustering, Statist. and Comput. 17 (2007) 395–416.
    • Park et al. [2019] T. J. Park, K. J. Han, M. Kumar, S. Narayanan, Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Processing Letters 27 (2019) 381–385.
    • MacQueen et al. [1967] J. MacQueen, et al., Some methods for classification and analysis of multivariate observations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 14, Oakland, CA, USA, 1967, pp. 281–297.
    • Ng et al. [2001] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, Advances in neural information processing systems 14 (2001) 849–856.
    • Ning et al. [2006] H. Ning, M. Liu, H. Tang, T. S. Huang, A spectral clustering approach to speaker diarization, in: Proceedings of the International Conference on Spoken Language Processing, 2006, pp. 2178–2181.
    • Luque and Hernando [2012] J. Luque, J. Hernando, On the use of agglomerative and spectral clustering in speaker diarization of meetings, in: Odyssey, 2012, pp. 130–137.
    • Lin et al. [2019] Q. Lin, R. Yin, M. Li, H. Bredin, C. Barras, Lstm based similarity measurement with spectral clustering for speaker diarization, Proc. Interspeech 2019 (2019) 366–370.
    • Zajíc et al. [2016] Z. Zajíc, M. Kunešová, V. Radová, Investigation of Segmentation in i-vector Based Speaker Diarization of Telephone Speech, in: International Conference on Speech and Computer, 2016, pp. 411–418.
    • Dimitriadis and Fousek [2017] D. Dimitriadis, P. Fousek, Developing on-line speaker diarization system, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2739–2743.
    • Comaniciu and Meer [2002] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on pattern analysis and machine intelligence 24 (2002) 603–619.
    • Stafylakis et al. [2010] T. Stafylakis, V. Katsouros, G. Carayannis, Speaker clustering via the mean shift algorithm, Recall 2 (2010) 7.
    • Senoussaoui et al. [2013] M. Senoussaoui, P. Kenny, P. Dumouchel, T. Stafylakis, Efficient iterative mean shift based cosine dissimilarity for multi-recording speaker clustering, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2013, pp. 7712–7715.
    • Salmun et al. [2017] I. Salmun, I. Shapiro, I. Opher, I. Lapidot, Plda-based mean shift speakers’ short segments clustering, Computer Speech and Language 45 (2017) 411–436.
    • Kenny et al. [2010] P. Kenny, D. Reynolds, F. Castaldo, Diarization of telephone conversations using factor analysis, IEEE Journal of Selected Topics in Signal Processing 4 (2010) 1059–1070.
    • Diez et al. [2019] M. Diez, L. Burget, F. Landini, J. Černockỳ, Analysis of speaker diarization based on Bayesian HMM with eigenvoice priors, IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2019) 355–368.
    • Fiscus [1997] J. G. Fiscus, A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER), in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, 1997, pp. 347–354.
    • Brummer et al. [2007] N. Brummer, L. Burget, J. Cernocky, O. Glembek, F. Grezl, M. Karafiat, D. A. van Leeuwen, P. Matejka, P. Schwarz, A. Strasheim, Fusion of heterogeneous speaker recognition systems in the STBU submission for the NIST speaker recognition evaluation 2006, IEEE Transactions on Audio, Speech, and Language Processing 15 (2007) 2072–2084.
    • Huijbregts et al. [2009] M. Huijbregts, D. van Leeuwen, F. Jong, The majority wins: a method for combining speaker diarization systems, in: Proceedings of the Annual Conference of the International Speech Communication Association, ISCA, 2009, pp. 924–927.
    • Bozonnet et al. [2010] S. Bozonnet, N. Evans, X. Anguera, O. Vinyals, G. Friedland, C. Fredouille, System output combination for improved speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Association, ISCA, 2010, pp. 2642–2645.
    • Stolcke and Yoshioka [2019] A. Stolcke, T. Yoshioka, DOVER: A method for combining diarization outputs, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, 2019, pp. 757–763.
    • Raj et al. [2021] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stolcke, S. Khudanpur, DOVER-Lap: A method for combining overlap-aware diarization outputs, in: Proceedings of IEEE Spoken Language Technology Workshop, 2021.
    • Valente [2005] F. Valente, Variational Bayesian methods for audio indexing, Ph.D. thesis, 2005.
    • Kenny [2008] P. Kenny, Bayesian analysis of speaker diarization with eigenvoice priors, CRIM, Montreal, Technical Report (2008).
    • Landini et al. [2020] F. Landini, J. Profant, M. Diez, L. Burget, Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks, arXiv preprint arXiv:2006.07898 (2020).
    • Sell and Garcia-Romero [2015] G. Sell, D. Garcia-Romero, Diarization resegmentation in the factor analysis subspace, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2015, pp. 4794–4798.
    • Diez et al. [2020] M. Diez, L. Burget, F. Landini, S. Wang, H. Černockỳ, Optimizing Bayesian HMM based x-vector clustering for the second DIHARD speech diarization challenge, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 6519–6523.
    • Lin et al. [2020] Q. Lin, Y. Hou, M. Li, Self-attentive similarity measurement strategies in speaker diarization, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 284–288.
    • Park et al. [2021] T. J. Park, M. Kumar, S. Narayanan, Multi-scale speaker diarization with neural affinity score fusion, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 7173–7177.
    • Flemotomos and Dimitriadis [2020] N. Flemotomos, D. Dimitriadis, A memory augmented architecture for continuous speaker identification in meetings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 6524–6528.
    • Horiguchi et al. [2020] S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe, K. Nagamatsu, End-to-end speaker diarization as post-processing, arXiv preprint arXiv:2012.10055 (2020).
    • Kinoshita et al. [2020] K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, Tackling real noisy reverberant meetings with all-neural source separation, counting, and diarization system, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 381–385.
    • Horiguchi et al. [2020] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, K. Nagamatsu, End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 269–273.
    • Fujita et al. [2020] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, J. Shi, K. Nagamatsu, Neural speaker diarization with speaker-wise chain rule, arXiv preprint arXiv:2006.01796 (2020).
    • Xue et al. [2020] Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, K. Nagamatsu, Online end-to-end neural diarization with speaker-tracing buffer, arXiv preprint arXiv:2006.02616 (2020).
    • Han et al. [2021] E. Han, C. Lee, A. Stolcke, BW-EDA-EEND: Streaming end-to-end neural speaker diarization for a variable number of speakers, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 7193–7197.
    • Kinoshita et al. [2021] K. Kinoshita, M. Delcroix, N. Tawara, Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 7198–7202.
    • Kanda et al. [2020] N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, T. Yoshioka, Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 36–40.
    • Kanda et al. [2021] N. Kanda, Z. Meng, L. Lu, Y. Gaur, X. Wang, Z. Chen, T. Yoshioka, Minimum Bayes risk training for end-to-end speaker-attributed ASR, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2021, pp. 6503–6507.
    • Ustinova and Lempitsky [2016] E. Ustinova, V. Lempitsky, Learning deep embeddings with histogram loss, Proceedings of Advances in Neural Information Processing Systems 29 (2016) 4170–4178.
    • Recht et al. [2010] B. Recht, M. Fazel, P. A. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM review 52 (2010) 471–501.
    • Xie et al. [2016] J. Xie, R. Girshick, A. Farhadi, Unsupervised deep embedding for clustering analysis, in: Proceedings of International Conference on Machine Learning, 2016, pp. 478–487.
    • Guo et al. [2017] X. Guo, L. Gao, X. Liu, J. Yin, Improved deep embedded clustering with local structure preservation, in: Proceedings of International Joint Conference on Artificial Intelligence, 2017, pp. 1753–1759.
    • Santoro et al. [2018] A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, T. Lillicrap, Relational Recurrent Neural Networks, in: Proceedings of Advances in Neural Information Processing Systems, 2018, pp. 7299–7310.
    • Santoro et al. [2016] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, Meta-learning with Memory-Augmented Neural Networks, in: Proceedings of International Conference on Machine Learning, 2016, pp. 1842––1850.
    • Sukhbaatar et al. [2015] S. Sukhbaatar, J. Weston, R. Fergus, et al., End-to-End Memory Networks, in: Proceedings of Advances in Neural Information Processing Systems, 2015, pp. 2440–2448.
    • Yoshioka et al. [2019] T. Yoshioka, D. Dimitriadis, A. Stolcke, W. Hinthorn, Z. Chen, M. Zeng, H. Xuedong, Meeting Transcription Using Asynchronous Distant Microphones, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 2968–2972.
    • Blei and Frazier [2011] D. M. Blei, P. I. Frazier, Distance dependent chinese restaurant processes., Journal of Machine Learning Research 12 (2011).
    • Ren et al. [2016] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2016) 1137–1149.
    • Kounades-Bastian et al. [2017a] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, S. Gannot, R. Horaud, An EM algorithm for joint source separation and diarisation of multichannel convolutive speech mixtures, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2017a, pp. 16–20.
    • Kounades-Bastian et al. [2017b] D. Kounades-Bastian, L. Girin, X. Alameda-Pineda, R. Horaud, S. Gannot, Exploiting the intermittency of speech for joint separation and diarization, in: Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, IEEE, 2017b, pp. 41–45.
    • Maekawa [2003] K. Maekawa, Corpus of spontaneous japanese: Its design and evaluation, in: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003, pp. 7–12.
    • Huang et al. [2007] J. Huang, E. Marcheret, K. Visweswariah, G. Potamianos, The ibm rt07 evaluation systems for speaker diarization on lecture meetings, in: Multimodal Technologies for Perception of Humans, Springer, 2007, pp. 497–508.
    • Canseco-Rodriguez et al. [2004] L. Canseco-Rodriguez, L. Lamel, J.-L. Gauvain, Speaker diarization from speech transcripts, in: Proceedings of the International Conference on Spoken Language Processing, volume 4, 2004, pp. 3–7.
    • Flemotomos et al. [2020] N. Flemotomos, P. Georgiou, S. Narayanan, Linguistically aided speaker diarization using speaker role information, in: Odyssey, 2020, pp. 117–124.
    • Park et al. [2019] T. J. Park, K. J. Han, J. Huang, X. He, B. Zhou, P. Georgiou, S. Narayanan, Speaker diarization with lexical information, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 391–395.
    • Fiscus et al. [2007] J. Fiscus, J. Ajot, J. Garofolo, The Rich Transcription 2007 meeting recognition evaluation, 2007, pp. 373–389.
    • Zmolikova et al. [2017] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani, Speaker-aware neural network based beamformer for speaker extraction in speech mixtures., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2655–2659.
    • Delcroix et al. [2018] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, T. Nakatani, Single channel target speaker extraction and recognition with speaker beam, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2018, pp. 5554–5558.
    • Delcroix et al. [2019] M. Delcroix, S. Watanabe, T. Ochiai, K. Kinoshita, S. Karita, A. Ogawa, T. Nakatani, End-to-end SpeakerBeam for single channel target speech recognition., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 451–455.
    • Kanda et al. [2020] N. Kanda, Y. Gaur, X. Wang, Z. Meng, T. Yoshioka, Serialized output training for end-to-end overlapped speech recognition, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 2797–2801.
    • Kanda et al. [2021] N. Kanda, G. Ye, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Yoshioka, End-to-end speaker-attributed asr with transformer, arXiv preprint arXiv:2104.02128 (2021).
    • Carletta et al. [2005] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, et al., The ami meeting corpus: A pre-announcement, in: International workshop on machine learning for multimodal interaction, Springer, 2005, pp. 28–39.
    • Janin et al. [2003] A. Janin, D. Baron, J. Edwards, D. Ellis, D. Gelbart, N. Morgan, B. Peskin, T. Pfau, E. Shriberg, A. Stolcke, C. Wooters, The ICSI meeting corpus, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2003, pp. I–364–I–367.
    • Barker et al. [2018] J. Barker, S. Watanabe, E. Vincent, J. Trmal, The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 1561–1565.
    • Chung et al. [2019] J. S. Chung, A. Nagrani, E. Coto, W. Xie, M. McLaren, D. A. Reynolds, A. Zisserman, VoxSRC 2019: The first VoxCeleb speaker recognition challenge, arXiv preprint arXiv:1912.02522 (2019).
    • Chung et al. [2020] J. S. Chung, J. Huh, A. Nagrani, T. Afouras, A. Zisserman, Spot the conversation: Speaker diarisation in the wild, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2020, pp. 299–303.
    • Panayotov et al. [2015] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, LibriSpeech: an ASR corpus based on public domain audio books, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2015, pp. 5206–5210.
    • Ryant et al. [2018] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, The first dihard speech diarization challenge, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018.
    • Fu et al. [2021] Y. Fu, L. Cheng, S. Lv, Y. Jv, Y. Kong, Z. Chen, Y. Hu, L. Xie, J. Wu, H. Bu, et al., Aishell-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario, arXiv preprint arXiv:2104.03603 (2021).
    • Gravier et al. [2004] G. Gravier, J.-F. Bonastre, E. Geoffrois, S. Galliano, K. McTait, K. Choukri, The ester evaluation campaign for the rich transcription of french broadcast news., in: LREC, 2004.
    • Galliano et al. [2009] S. Galliano, G. Gravier, L. Chaubard, The ester 2 evaluation campaign for the rich transcription of french radio broadcasts, in: Tenth Annual Conference of the International Speech Communication Association, 2009.
    • Gravier et al. [2012] G. Gravier, G. Adda, N. Paulson, M. Carré, A. Giraudel, O. Galibert, The etape corpus for the evaluation of speech-based tv content processing in the french language, in: LREC-Eighth international conference on Language Resources and Evaluation, 2012, p. na.
    • Chiu et al. [2017] C.-C. Chiu, A. Tripathi, K. Chou, C. Co, N. Jaitly, D. Jaunzeikare, A. Kannan, P. Nguyen, H. Sak, A. Sankar, et al., Speech recognition for medical conversations, arXiv preprint arXiv:1711.07274 (2017).
    • Carletta et al. [2006] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. P. andD. Reidsma, P. Wellner, The AMI meeting corpus: a pre-announcement, in: Proceedings of Int. Worksh. Machine Learning for Multimodal Interaction, 2006, pp. 28–39.
    • Xiong et al. [2016] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig, Achieving human parity in conversational speech recognition, arXiv preprint arXiv:1610.05256 (2016).
    • Saon et al. [2017] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, et al., English conversational telephone speech recognition by humans and machines, arXiv preprint arXiv:1703.02136 (2017).
    • Yoshioka et al. [2015] T. Yoshioka, N. Ito, M. Delcroix, A. Ogawa, K. Kinoshita, M. Fujimoto, C. Yu, W. Fabian, M. Espi, T. Higuchi, S. Araki, T. Nakatani, The NTT CHiME-3 system: advances in speech enhancement and recognition for mobile multi-microphone devices, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2015, pp. 436–443.
    • Du et al. [2016] J. Du, Y. Tu, L. Sun, F. Ma, H. Wang, J. Pan, C. Liu, J. Chen, C. Lee, The USTC-iFlytek system for CHiME-4 challenge, in: Proceedings of CHiME-4 Workshop, 2016, pp. 36–38.
    • Li et al. [2017] B. Li, T. N. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Punduk, K. Chin, K. C. Sim, R. J. Weiss, K. W. Wilson, E. Variani, C. Kim, O. Siohan, M. Weintrauba, E. McDermott, R. Rose, M. Shannon, Acoustic modeling for Google Home, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 399–403.
    • Dimitriadis and Fousek [2017] D. Dimitriadis, P. Fousek, Developing on-line speaker diarization system, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2739–2743.
    • He et al. [2016] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
    • He et al. [2017] K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
    • Yoshioka et al. [2019] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong, I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu, P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang, Y. Zhao, T. Zhou, Advances in Online Audio-Visual Meeting Transcription, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, 2019, pp. 276–283.
    • Buchner et al. [2005] H. Buchner, R. Aichner, W. Kellermann, A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics, IEEE Transactions on Speech and Audio Processing 13 (2005) 120–134.
    • Sawada et al. [2007] H. Sawada, S. Araki, S. Makino, Measuring dependence of bin-wise separated signals for permutation alignment in frequency-domain BSS, in: Int. Symp. Circ., Syst., 2007, pp. 3247–3250.
    • Nesta et al. [2011] F. Nesta, P. Svaizer, M. Omologo, Convolutive bss of short mixtures by ica recursively regularized across frequencies, IEEE Transactions on Audio, Speech, and Language Processing 19 (2011) 624–639.
    • Sawada et al. [2011] H. Sawada, S. Araki, S. Makino, Underdetermined convolutive blind source separation via frequency bin-wise clustering and permutation alignment, IEEE Transactions on Audio, Speech, and Language Processing 19 (2011) 516–527.
    • Ito et al. [2014] N. Ito, S. Araki, T. Yoshioka, T. Nakatani, Relaxed disjointness based clustering for joint blind source separation and dereverberation, in: Proceedings of International Workshop on Acoustic Echo and Noise Control, 2014, pp. 268–272.
    • Drude and Haeb-Umbach [2017] L. Drude, R. Haeb-Umbach, Tight integration of spatial and spectral features for BSS with deep clustering embeddings, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2017, pp. 2650–2654.
    • Maciejewski et al. [2018] M. Maciejewski, G. Sell, L. P. Garcia-Perera, S. Watanabe, S. Khudanpur, Building corpora for single-channel speech separation across multiple domains, arXiv preprint arXiv:1811.02641 (2018).
    • Araki et al. [2018] S. Araki, N. Ono, K. Kinoshita, M. Delcroix, Meeting recognition with asynchronous distributed microphone array using block-wise refinement of mask-based MVDR beamformer, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2018, pp. 5694–5698.
    • Stolcke [2011] A. Stolcke, Making the most from multiple microphones in meeting recordings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2011, pp. 4992–4995.
    • Narayanan and Georgiou [2013] S. Narayanan, P. G. Georgiou, Behavioral signal processing: Deriving human behavioral informatics from speech and language, Proceedings of the IEEE 101 (2013) 1203–1233.
    • Bone et al. [2017] D. Bone, C.-C. Lee, T. Chaspari, J. Gibson, S. Narayanan, Signal processing and machine learning for mental health research and clinical applications, IEEE Signal Processing Magazine 34 (2017) 189–196.
    • Kumar et al. [2020] M. Kumar, S. H. Kim, C. Lord, S. Narayanan, Speaker diarization for naturalistic child-adult conversational interactions using contextual information., Journal of the Acoustical Society of America 147 (2020) EL196–EL200. doi:10.1121/10.0000736.
    • Georgiou et al. [2011] P. G. Georgiou, M. P. Black, S. S. Narayanan, Behavioral signal processing for understanding (distressed) dyadic interactions: some recent developments, in: Proceedings of the joint ACM workshop on Human gesture and behavior understanding, 2011, pp. 7–12.
    • Xiao et al. [2016] B. Xiao, C. Huang, Z. E. Imel, D. C. Atkins, P. Georgiou, S. S. Narayanan, A technology prototype system for rating therapist empathy from audio recordings in addiction counseling, PeerJ Computer Science 2 (2016) e59.
    • Chakravarthula et al. [2020] S. N. Chakravarthula, M. Nasir, S.-Y. Tseng, H. Li, T. J. Park, B. Baucom, C. J. Bryan, S. Narayanan, P. Georgiou, Automatic prediction of suicidal risk in military couples using multimodal interaction cues from couples conversations, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2020, pp. 6539–6543.
    • Mirheidari et al. [2017] B. Mirheidari, D. Blackburn, K. Harkness, T. Walker, A. Venneri, M. Reuber, H. Christensen, Toward the automation of diagnostic conversation analysis in patients with memory complaints, Journal of Alzheimer’s Disease 58 (2017) 373–387.
    • Finley et al. [2018] G. P. Finley, E. Edwards, A. Robinson, N. Sadoughi, J. Fone, M. Miller, D. Suendermann-Oeft, M. Brenndoerfer, N. Axtmann, An automated assistant for medical scribes., in: Proceedings of the Annual Conference of the International Speech Communication Association, 2018, pp. 3212–3213.
    • Guo et al. [2016] A. Guo, A. Faria, J. Riedhammer, Remeeting – Deep insights to conversations, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2016, pp. 1964–1965.
    • Addlesee et al. [2020] A. Addlesee, Y. Yu, A. Eshghi, A comprehensive evaluation of incremental speech recognition and diarization for conversational ai, in: Proceedings of the International Conference on Computational Linguistics, 2020, pp. 3492–3503.
    • Ryant et al. [2020] N. Ryant, K. Church, C. Cieri, J. Du, S. Ganapathy, M. Liberman, Third dihard challenge evaluation plan, arXiv preprint arXiv:2006.05815 (2020).
    • Cetin and Shriberg [2006] O. Cetin, E. Shriberg, Speaker overlaps and ASR errors in meetings: Effects before, during, and after the overlap, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 1, IEEE, 2006, pp. 357–360.
    • Kanda et al. [2019] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, K. Nagamatsu, R. Haeb-Umbach, Guided source separation meets a strong ASR backend: Hitachi/Paderborn University joint investigation for dinner party ASR, in: Proceedings of the Annual Conference of the International Speech Communication Association, 2019, pp. 1248–1252.
    • Garofolo et al. [2004] J. S. Garofolo, C. D. Laprun, J. G. Fiscus, The rich transcription 2004 spring meeting recognition evaluation, NIST, 2004.
    • Otterson and Ostendorf [2007] S. Otterson, M. Ostendorf, Efficient use of overlap information in speaker diarization, in: Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE, 2007, pp. 683–686.
    • Boakye et al. [2008] K. Boakye, B. Trueba-Hornero, O. Vinyals, G. Friedland, Overlapped speech detection for improved speaker diarization in multiparty meetings, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2008, pp. 4353–4356.
    • Bullock et al. [2020] L. Bullock, H. Bredin, L. P. Garcia-Perera, Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2020, pp. 7114–7118.