This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-target Extractor and Detector for Unknown-number Speaker Diarization

Chin-Yi Cheng, Hung-Shin Lee, Yu Tsao, , and Hsin-Min Wang
Abstract

Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines.

Index Terms:
speaker diarization, speaker representations

I Introduction

Speaker diarization (SD) is the process of determining when individual speakers are active in a recording. The aim is to generate a diary of the presence of each speaker at each point in time. This technique has been widely used in speech processing in various scenarios, such as conversations, broadcast news, debates, and cocktail parties [1]. However, the robustness of SD remains weak due to the challenges posed by the changes in recording channel, environment, reverberation, ambient noise, and number of speakers [2].

Researchers have tackled SD tasks using probabilistic models [3] or neural networks (NNs) [4]. Many methods involve two steps, segmentation and clustering. In the segmentation step, a session is divided into a series of short segments, typically using a sliding window of 1.5 seconds with 50% overlap. Then, a speaker model is used to extract the speaker representation (e.g., the x-vector [5, 6, 7, 8], i-vector [9, 10], or d-vector [11, 12]) for each segment. During clustering, segments with homogeneous characteristics form a group. Many clustering techniques have been used based on various similarity measures, such as probabilistic linear discriminant analysis (PLDA) and cosine similarity [13, 14, 15, 10]. For example, agglomerative hierarchical clustering (AHC) and spectral clustering (SC) were used in [6, 10] and [16, 17], respectively. The unbounded interleaved-state recurrent neural network (UIS-RNN) derived from the Gaussian mixture model (GMM) [18, 19] and hidden Markov model (HMM) [7] was used in [12]. Moreover, some post-processing methods, such as variational Bayes (VB) [20] and the LSTM-based method [21], have been used to refine the initial SD results.

Several recent studies [22, 23, 24, 25] have focused on end-to-end SD. Fujita et al. [22] reformulated the SD task as a multi-label classification problem and used the permutation-invariant training (PIT) [26] technique. In [27], reliable speaker representations were derived using a selector to assist the voice activity detector in diarizing a session. Self-attention [28, 29] and frame selection [30] have also been used in end-to-end SD.

Traditional segmentation-clustering methods cannot handle overlapping speech in a session. Target speaker voice activity detection (TS-VAD) [31] can handle overlapping speech well. It relies on the x-vector/SC method [5, 8] to provide initial speech regions for each “target” (active) speaker, and then extracts his/her i-vector from the corresponding frames. Finally, it uses the i-vectors of all speakers and MFCCs to generate the SD result. Unfortunately, it can only be applied to sessions with a fixed number of speakers due to a tensor concatenation of speaker representations. Inspired by the dual-path recurrent neural network (DPRNN) [32, 33], we propose a unified structure called Multi-target Extractor and Detector (MTEAD), which can handle sessions with various numbers of speakers using a single model. Furthermore, since the quality of speaker representations has an impact on SD performance, we extend TS-VAD by using a neural extractor that can extract speaker representations suitable for SD. Note that in our framework, the number of speakers in a test session is determined by the initial x-vector/AHC SD process.

Our main contributions are threefold. First, we extend the practical scope of TS-VAD while inheriting its excellent SD performance. Second, we design an extractor that can be jointly trained with the SD model to extract speaker representations. Using this extractor, we achieve better SD performance and avoid the pretraining of the i-vector extractor. Third, unlike some previous studies [6, 12] that only dealt with non-overlapping speech, our model can handle overlapping speech.

Refer to caption
Figure 1: Structure of MTEAD. \oplus denotes the concatenation operator. 𝐗i\mathbf{X}^{i} and 𝐬i\mathbf{s}^{i} are the Sd\rm S_{d} frames and diarization scoring vectors for speaker ii, respectively. \otimes is an element-wise product operator of MFCCs, and the binary masks (speaker occurrence labels) are obtained in the first step (x-vector/AHC diarization).

II Multi-target Extractor and Detector

II-A Framework

Inspired by TS-VAD [31], we propose a two-step SD method called MTEAD, which can be adapted to different numbers of speakers. MTEAD not only addresses the main weakness of TS-VAD (i.e., handling only sessions with a fixed number of speakers, e.g., four in [31]), but also uses improved speaker representations. As shown in Fig. 1, the first step of MTEAD (i.e., Extractor) relies on a traditional x-vector/AHC SD method to initially detect the speech regions for each speaker (i.e., speaker occurrence labels). Then, the frames corresponding to each speaker are used to extract his/her speaker representation. Here, the representation may be a traditional i-vector or x-vector, or our specially designed z-vector for SD. The second step of MTEAD (i.e., Detector) requires two inputs: the frame-level MFCCs and the representation of each speaker. The frame-level MFCCs are first passed through a four-layer convolutional neural network (CNN), and then concatenated with each speaker representation. The output of Detector is the final diarization result for each speaker.

In TS-VAD, the second step is implemented using BiLSTM. The first two layers of the BiLSTM take the frame-level MFCCs and four i-vectors as input, and output four speaker detection (Sd\rm S_{d}) vector sequences. Then, the four Sd\rm S_{d} vector sequences are concatenated along the feature dimension and passed through the third layer of the BiLSTM to generate the final diarization result for each speaker. It is this concatenation that causes TS-VAD to only handle four-speaker recordings. Furthermore, it uses the i-vector as the speaker representation, which is defeated by the x- and z-vectors in our experiments.

II-B Detector: Feature Mixer / Time-Speaker Contextualizer

The Detector consists of a Feature Mixer and three consecutive Time-Speaker Contextualizers implemented using BiLSTM. The MFCCs are passed through CNNs and concatenated with each speaker representation. These concatenated features are then processed by Feature Mixer to generate the corresponding Sd\rm S_{d} frames 𝐗i\mathbf{X}^{i} for the ii-th speaker, as shown in Fig. 1. As illustrated in Fig. 2, the stack of the Sd\rm S_{d} frames of all speakers, 𝐗inD×T×N\mathbf{X}_{in}\in\mathbb{R}^{D\times T\times N}, is the input of the Time-Speaker Contextualizer, where DD, TT, and NN denote the dimension of the Sd\rm S_{d} vector, number of frames, and number of speakers, respectively. Inspired by DPRNN, the Time-Speaker Contextualizer is designed to handle different numbers of speakers in one session. It contains two stages. In the first stage, the Sd\rm S_{d} vector sequence of each speaker is passed through the Across-time Contextualizer separately to generate the temporal contextual information using

𝐗~i=Linear(ContextualizerT(𝐗i))+𝐗i.\tilde{\mathbf{X}}^{i}=Linear(Contextualizer_{T}(\mathbf{X}^{i}))+\mathbf{X}^{i}. (1)

The output of the first stage, 𝐗~\tilde{\mathbf{X}}, is the stack of 𝐗~i\tilde{\mathbf{X}}^{i}, i=1,,Ni=1,...,N. The input of the second stage, 𝐘\mathbf{Y}, is generated by reshaping 𝐗~\tilde{\mathbf{X}}. Slicing the input by timeframes yields 𝐘jD×N\mathbf{Y}^{j}\in\mathbb{R}^{D\times N}, j=1,,Tj=1,...,T, which is treated as the activity of individual speakers in a single frame jj. Then, 𝐘j\mathbf{Y}^{j} is passed through the Across-speaker Contextualizer to generate the speaker contextual information using

𝐘~j=Linear(ContextualizerS(𝐘j))+𝐘j.\tilde{\mathbf{Y}}^{j}=Linear(Contextualizer_{S}(\mathbf{Y}^{j}))+\mathbf{Y}^{j}. (2)

𝐘~j\tilde{\mathbf{Y}}^{j}, j=1,,Tj=1,...,T, is stacked and reshaped to 𝐗outD×T×N\mathbf{X}_{out}\in\mathbb{R}^{D\times T\times N}. 𝐗out\mathbf{X}_{out} is used as the input, 𝐗in\mathbf{X}_{in}, of the subsequent Time-Speaker Contextualizer. Finally, for each speaker, the corresponding Sd\rm S_{d} vector sequence from the output of the last Time-Speaker Contextualizer is passed through a linear-sigmoid layer to generate the final diarization result.

In Eq. (2), the number of speakers NN is the length of the input sequence; therefore, it is variable. The Across-speaker Contextualizer allows the sharing of speakers’ information within a single frame, which can help Detector to discern differences among speaker representations. On the other hand, by scanning along the frame, the Across-time Contextualizer can help Detector to determine whether each speaker is active in each frame. Therefore, MTEAD achieves information sharing among all speakers and frames through the Across-speaker and Across-time Contextualizers, which not only preserves the advantages of TS-VAD, but also does not have the limitation of handling only conversations of a fixed number of speakers.

Refer to caption
Figure 2: The Time-Speaker Contextualizer. The operator \ocircle denotes the residual addition of two tensors. 𝐗i\mathbf{X}^{i} and 𝐘j\mathbf{Y}^{j} are passed through the Across-time and Across-speaker Contextualizer, respectively.

II-C Extractor: z-vector (diari“z”ation vector)

We argue that the quality of speaker representations is critical to SD performance. Therefore, we design an extractor specifically for extracting speaker representations suitable for SD. As shown in Fig. 1, using the speaker speech regions provided by x-vector/AHC diarization and the MFCCs of the session, the Z-vector Extractor generates the corresponding speaker representations111While we were preparing this paper, there was a similar work extending TS-VAD by jointly training a speaker embedding network [34]. However, it does not handle a variable number of speakers.. The Z-vector Extractor comprises ResNet and Attentive Statistics Pooling (ASP) [35]. The ResNet here is similar to ResNet50, but the last layer is replaced by the ASP layer. The input to ResNet is a sequence of 500 frames of MFCCs from each speaker, and the output shape is set to be the same dimension as the x-vector222The number of parameters for the x- and z-vector extractors is about 8M and 30M, respectively.. Therefore, the z-vectors can be used as input to Detector instead of x- or i-vectors. In MTEAD, speaker representation extraction and speaker detection are combined into a unified process by integrating Extractor with Detector. Also, since Extractor and Detector are trained jointly, it is expected that the z-vector is more suitable for SD than the x- and i-vectors.

II-D Joint training and loss

When using z-vectors, Extractor and Detector are jointly trained from scratch using a binary cross-entropy loss between speaker labels and predicted SD results. The per-speaker losses are summed directly and back-propagated. When using x-vectors and i-vectors, Extractor is replaced by other extractors pretrained by the Kaldi recipe, and only Detector is trained using the same binary cross-entropy loss.

TABLE I: DER (%) of MTEAD with x-vector on the SWB+SRE dataset.
Ratio (|G|/(|G|+|L|)|G|/(|G|+|L|)   ) Threshold Oracle Ideal
0% 29.92 26.37 16.65
25% 29.82 25.91 12.03
50% 30.05 26.26 11.14
75% 29.80 26.22 10.36
100% 30.43 27.74 10.22

III Experiments and Results

Two corpora were used in our experiments: one was simulated from the Switchboard and NIST SRE datasets, and the other was CALLHOME. The results were evaluated by diarization error rate (DER) and Jaccard error rate (JER) [36][37][38], with a standard 250 ms collar. Overlapping speech frames were counted. Both Feature Mixer and Across-time Contextualizer were implemented using BiLSTM. Across-speaker Contextualizer can be implemented using Transformer [39] or BiLSTM. As the number of speakers is limited in both datasets (i.e., less than 10), a lightweight BiLSTM is sufficient to gather all information about the speakers. For model training, we used the Adam optimizer and Noam scheduler with a learning rate of 0.10.1 and a batch size of 88. The dimensions of i-vector, x-vector, and z-vector were all set to 40333See https://github.com/chinyi0523/MTEAD for details..

III-A SWB+SRE Simulated Corpus

The total number of speakers in the 683 hours of data from the SRE and Switchboard datasets was 6,392. We simulated the training and evaluation data by Algorithm 1 in [28]. The speakers in the two sets did not overlap. To achieve 20% overlap, for the cases of two, three, and four speakers, the parameter β\beta in the algorithm was set to 3, 6, and 9 seconds, resulting in 137, 226, and 320 hours of data, respectively.

During training, the ground truth in Rich Transcription Time Marked (RTTM) format was used as the speaker occurrence labels for each session. During inference, we used x-vector/AHC to produce the RTTM file for each test session. The Threshold in x-vector/AHC was set based on the performance on the training set. Generation of speaker representations followed the methods described in Secs. II-A and II-D.

Results and discussion. First, we investigated the impact of the quality of speaker representations on SD performance. Extractor in MTEAD (cf. Fig. 1) was replaced with a pretrained x-vector extractor. During MTEAD training, each target speaker’s x-vector was extracted from all his/her original utterances in Switchboard and NIST-SRE (denoted as G (global)) or from the corresponding speech frames in a session based on the ground truth RTTM (denoted as L (local)). The ratio |G|/(|G|+|L|)|G|/(|G|+|L|) indicates the degree to which global speaker representations were used for training. The Threshold and Oracle tests denote that the number of clusters in AHC is automatically determined and set to the true number of speakers, respectively. Ideal denotes the (cheating) case where the target speaker’s x-vector was extracted from the test session based on the ground truth RTTM. As shown in Table I, the ratio has little impact on SD performance, showing that x-vectors extracted from a limited amount of speech in a training session are as effective as x-vectors extracted from a large amount of speech in the entire training dataset. Regardless of the ratio, all Ideal test conditions outperformed their Threshold and Oracle counterparts. These results demonstrate the importance of accurate speaker representations, showing that extracting better speaker representations is key to producing excellent SD results.

Next, we compared the performance of different speaker representation models, including z-vector, i-vector, and x-vector. As shown in Table II, while the MTEAD model with i-vectors or x-vectors showed better performance than the baseline x-vector/AHC method, the MTEAD model with z-vectors achieved the best performance. The results confirm that the z-vector jointly trained with the MTEAD model is more effective than the x-vector and i-vector obtained from pretrained models. Speaker representations targeting SD did yield better SD performance.

TABLE II: Results (%) on the SWB+SRE dataset. All MTEAD models are based on the initial diarization of x-vector/AHC.
Method Threshold Oracle
DER JER DER JER
x-vector/AHC 38.49 53.38 40.38 52.58
MTEAD (i-vector) 23.72 36.21 19.50 29.24
MTEAD (x-vector) 25.16 38.91 18.61 28.88
MTEAD (z-vector) 23.06 32.67 13.46 18.95

III-B CALLHOME (LDC2001S97)

CALLHOME is a telephony dataset containing conversations in multiple languages. The dataset consists of a total of 500 conversations recorded at a sampling rate of 8 kHz. The number of speakers in each conversation varies from 2 to 7. Since the CALLHOME dataset is too small to train our model, we used the SWB+SRE dataset for pretraining.

During training, we pretrained the MTEAD models on SWB+SRE. We divided the CALLHOME set equally into two parts as Kaldi instructed. CALLHOME-1 was used for fine-tuning the pretrained models, while CALLHOME-2 was used for evaluation. We determined the Threshold in x-vector/AHC based on the performance on CALLHOME-1. During inference, we used x-vector/AHC to produce RTTM files on CALLHOME-2. We used the same methods as the experiments on SWB+SRE to generate speaker representations.

As the number of speakers in a CALLHOME session varies from 2 to 7, it is necessary to pretrain and fine-tune a TS-VAD model for each speaker number separately. To this end, the SWB+SRE and CALLHOME-1 datasets were split into 2-, 3-, and 4-speaker subsets for training the corresponding TS-VAD models. For each TS-VAD model, we also trained the corresponding MTEAD* model using the same training data and procedure for fair comparison. MTEAD was trained on all training data containing different numbers of speakers.

TABLE III: Results (%) on CALLHOME based on i-vectors
Method Oracle #2 Oracle #3 Oracle #4
DER JER DER JER DER JER
SA-EEND -EDA [29] 8.35 N/A 13.20 N/A 21.71 N/A
x-vector/AHC 9.17 24.94 15.24 37.04 20.28 45.35
TS-VAD 9.51 20.60 14.71 33.62 20.18 44.77
MTEAD* 8.72 17.90 14.50 33.60 18.15 43.24
MTEAD 7.82 17.87 13.10 32.43 18.12 39.02
TABLE IV: Results (%) on CALLHOME. All MTEAD models were based on the initial diarization of x-vector/AHC.
Method Threshold (Estimated) Oracle
DER JER DER JER
x-vector/AHC [40] 20.71 N/A 20.14 N/A
x-vector/AHC+VB [40] 19.51 N/A 18.61 N/A
SA-EEND-EDA [29] 15.29 N/A 15.43 N/A
MTEAD (i-vector) 14.52 30.09 14.10 27.92
MTEAD (x-vector) 14.55 30.01 13.15 26.80
MTEAD (z-vector) 14.31 29.21 12.66 24.56
TABLE V: DER (%) on the 2-speaker CALLHOME task.
Method DER rel. %
x-vector/AHC [29] 8.93 -
BLSTM-EEND [24] 23.07 -158.3
SA-EEND [28] 10.99 -23.1
SA-EEND-EDA [29] 8.35 6.5
SA-EEND-EDA + Frame Selection [30] 7.84 12.2
DIVE [27] 6.70 24.9
MTEAD 7.82 12.4

Results and discussion. First, we compared MTEAD with TS-VAD. TS-VAD, MTEAD*, and MTEAD were all based on the initial diarization of x-vector/AHC, and were all implemented with i-vectors. For each specific number of speakers, MTEAD* and TS-VAD were trained on the same training data. The experiments were conducted under the 2-, 3-, and 4-speaker conditions, assuming the number of speakers was known. The evaluation results on CALLHOME-2 are shown in Table III. It is clear that MTEAD* consistently outperformed the corresponding TS-VAD and x-vector/AHC baselines under each identical condition. Moreover, MTEAD outperformed MTEAD* because it was trained with all training data containing different numbers of speakers, while each MTEAD* model was trained with only a subset of training data with a specific number of speakers. Results demonstrate the advantage of the MTEAD Detector: MTEAD significantly outperformed TS-VAD because its detector could be trained using all training data with different numbers of speakers. After overcoming the weakness of TS-VAD, MTEAD with i-vector has already surpassed SA-EEND-EDA [29].

Next, we compared the performance of different speaker representation models, including z-vector, i-vector, and x-vector. As can be seen from Table IV, all three MTEAD models outperformed not only the x-vector/AHC and x-vector/AHC+VB baselines, but also the strong SA-EEND-EDA model [29]. Furthermore, z-vector-based MTEAD outperformed i-vector-based and x-vector-based MTEAD under both Threshold and Oracle conditions, and larger improvements are observed under the Oracle condition. We speculate that this is because when the number of speakers is correct in the initial diarization, the model can estimate a more accurate z-vector for each speaker. In contrast, under the Threshold condition, the incorrect number of speakers predicted in x-vector/AHC may cause some z-vectors to not match actual speakers, resulting in less reductions in DER and JER. The results in Tables III and IV show that, with its well-designed Extractor and Detector, MTEAD is a flexible and effective SD model that can extract more accurate speaker representations and handle conversations with different numbers of speakers. Compared to x-vector/AHC, x-vector/AHC+VB, and SA-EEND-EDA, MTEAD with z-vector achieved relative DER reductions of 30.9%, 26.7%, and 6.4%, respectively.

Finally, we compared MTEAD with other models. Since most NN-based SD models were only evaluated in 2-speaker experiments, we compared different models on the 2-speaker CALLHOME task. From Table V, MTEAD outperformed all models except DIVE [27], with a 12.2% relative reduction in DER compared to x-vector/AHC [29]. According to Tables IV and V, MTEAD outperformed most of the models compared in this study in both 2-speaker and multi-speaker tasks. This study focuses on improving the shortcomings of TS-VAD. Although there are many other advanced end-to-end or unsupervised SD models [41, 42], we did not include them in the comparison due to the different training conditions of these models and paper length constraints.

IV Conclusion

We have proposed the MTEAD model for speaker diarization, which consists of a Time-speaker Contextualizer based detector and a z-vector extractor. Its detector allows MTEAD to handle conversations with varying numbers of speakers and to use data with any number of speakers during training. This detector addresses the weaknesses of TS-VAD while retaining its strengths. The dedicated z-vector extractor for speaker diarization also improves performance compared to traditional i-vector and x-vector methods. The experimental datasets used in this study are relatively clean. We will try more challenging tasks such as DIHARD or CHiME in the future.

References

  • [1] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, and G. Friedland, “Speaker diarization: A review of recent research,” IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 2, pp. 356–370, 2012.
  • [2] N. Ryant, P. Singh, V. Krishnamohan, R. Varma, K. Church, C. Cieri, J. Du, S. Ganapathy, and M. Liberman, “The third DIHARD diarization challenge,” in Proc. Interspeech, 2021.
  • [3] M. Diez, L. Burget, and P. Matejka, “Speaker diarization based on bayesian HMM with eigenvoice priors,” in Proc. Odyssey, 2018.
  • [4] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019.
  • [5] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, 2018.
  • [6] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in Proc. ICASSP, 2017.
  • [7] M. Diez, L. Burget, S. Wang, J. Rohdin, and H. Černocký, “Bayesian HMM based x-vector clustering for speaker diarization,” in Proc. Interspeech, 2019.
  • [8] G. Sell, D. Snyder, A. Mccree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe, and S. Khudanpur, “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge,” in Proc. Interspeech, 2018.
  • [9] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, pp. 788–798, 2011.
  • [10] G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in Proc. IEEE SLT, 2014.
  • [11] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. ICASSP, 2018.
  • [12] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proc. ICASSP, 2018.
  • [13] S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. ECCV, 2006.
  • [14] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE ICCV, 2007.
  • [15] P. Kenny, T. Stafylakis, P. Ouellet, M. Jahangir Alam, and P. Dumouchel, “PLDA for speaker verification with utterances of arbitary duration,” in Proc. ICASSP, 2013.
  • [16] H. Ning, M. Liu, H. Tang, and T. Huang, “A spectral clustering approach to speaker diarization,” in Proc. Interspeech, 2006.
  • [17] T. J. Park, K. J. Han, J. Huang, X. He, B. Zhou, P. Georgiou, and S. Narayanan, “Speaker diarization with lexical information,” in Proc. Interspeech, 2019.
  • [18] S. Bozonnet, N. W. Evans, and C. Fredouille, “The lia-eurecom RT’09 speaker diarization system: Enhancements in speaker modelling and cluster purification,” in Proc. ICASSP, 2010.
  • [19] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
  • [20] G. Sell and D. Garcia-Romero, “Diarization resegmentation in the factor analysis subspace,” in Proc. ICASSP, 2015.
  • [21] M. Sahidullah, J. Patino, S. Cornell, R. Yin, S. Sivasankaran, H. Bredin, P. Korshunov, A. Brutti, R. Serizel, E. Vincent, N. Evans, S. Marcel, S. Squartini, and C. Barras, “The speed submission to DIHARD II: Contributions & lessons learned,” 2019. [Online]. Available: http://arxiv.org/abs/1911.02388
  • [22] Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Nagamatsu, “End-to-End neural diarization: Reformulating speaker diarization as simple multi-label classification,” 2020. [Online]. Available: http://arxiv.org/abs/2003.02966
  • [23] T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proc. ICASSP, 2018.
  • [24] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with permutation-free objectives,” in Proc. Interspeech, 2019.
  • [25] L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in Proc. ICASSP, 2020.
  • [26] D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. ICASSP, 2017.
  • [27] N. Zeghidour, O. Teboul, and D. Grangier, “DIVE: End-to-end speech diarization via iterative speaker embedding,” in Proc. IEEE ASRU, 2021.
  • [28] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, and S. Watanabe, “End-to-end neural speaker diarization with self-attention,” in Proc. IEEE ASRU, 2019.
  • [29] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in Proc. Interspeech, 2020.
  • [30] S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe, and K. Nagamatsu, “End-to-end speaker diarization as post-processing,” in Proc. ICASSP, 2021.
  • [31] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario,” in Proc. Interspeech, 2020.
  • [32] Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020.
  • [33] C. Li, Y. Luo, C. Han, J. Li, T. Yoshioka, T. Zhou, M. Delcroix, K. Kinoshita, C. Boeddeker, Y. Qian, S. Watanabe, and Z. Chen, “Dual-path RNN for long recording speech separation,” in Proc. IEEE SLT, 2021.
  • [34] W. Wang, Q. Lin, D. Cai, and M. Li, “Similarity measurement of segment-level speaker embeddings in speaker diarization,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 30, pp. 2645–2658, 2022.
  • [35] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” Proc. Interspeech, 2018.
  • [36] L. Hamers, Y. Hemeryck, G. Herweyers, M. Janssen, H. Keters, R. Rousseau, and A. Vanhoutte, “Similarity measures in scientometric research: The Jaccard index versus Salton’s cosine formula,” Information Processing and Management, vol. 25, no. 3, pp. 315–318, 1989.
  • [37] R. Real and J. M. Vargas, “The probabilistic basis of Jaccard’s index of similarity,” Systematic Biology, vol. 45, no. 3, pp. 380–385, 1996.
  • [38] N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, and M. Liberman, “The second DIHARD diarization challenge: Dataset, task, and baselines,” in Proc. Interspeech, 2019.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017.
  • [40] Z. Huang, S. Watanabe, Y. Fujita, P. García, Y. Shao, D. Povey, and S. Khudanpur, “Speaker diarization with region proposal network,” in Proc. ICASSP, 2020.
  • [41] Y. Dissen, F. Kreuk, and J. Keshet, “Self-supervised speaker diarization,” in Proc. Interspeech, 2022.
  • [42] K. Kinoshita, M. Delcroix, and T. Iwata, “Tight integration of neural- and clustering-based diarization through deep unfolding of infinite Gaussian mixture model,” in Proc. ICASSP, 2022.