Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap
Abstract
We propose a new spectral clustering framework that can auto-tune the parameters of the clustering algorithm in the context of speaker diarization. The proposed framework uses normalized maximum eigengap (NME) values to estimate the number of clusters as well as the parameters for the threshold of the elements of each row in an affinity matrix during spectral clustering, without any parameter tuning on a development set. Even with this hands-off approach, we achieve comparable or better performance across various evaluation sets than with the traditional clustering methods that use careful parameter tuning and development data. The relative improvement of 17% in terms of speaker error rate in the well-known CALLHOME evaluation set shows the effectiveness of our proposed auto-tuning spectral clustering.
Index Terms:
Auto-Tuning, Spectral Clustering, Eigengap Heuristic, Speaker Diarizationgithub.com/tango4j/Auto-Tuning-Spectral-Clustering
I Introduction
Speaker diarization is the problem of identifying “who spoke when” and assigning speaker identity labels in a given audio stream. In general, speaker diarization systems are comprised of three major parts: Speech segmentation module, speaker embedding extractor, and clustering module. speech segmentation module removes non-speech parts from the audio stream and breaks the speech parts into segments that are supposedly homogeneous in terms of speaker identity. Speaker embedding extractor captures and embeds speaker characteristics in the given segment into a speaker embedding, which is a vector of learned low dimensional representation. Finally, clustering module groups the speaker embeddings from the same speakers into the same clusters.
Spectral clustering has been widely adopted in numerous speaker diarization studies [1, 2, 3, 4, 5, 6]. Spectral clustering is a graph-based clustering technique that uses an affinity matrix, each element of which is the distance between a pair of speaker embeddings. Throughout the Laplacian matrix computations, the affinity matrix is converted to spectral embeddings, which are clustered by the k-means algorithm [7]. Despite its popularity, spectral clustering has a limitation that its performance is sensitive to the quality of the affinity matrix. Due to the noisy nature of speaker embeddings and distance metrics, it is highly likely for some elements of the affinity matrix to possess noisy signals that could degenerate the clustering process. To address this issue, the spectral clustering algorithms in recent studies employ either a scaling parameter [1, 3, 2] or a row-wise thresholding parameter [5] to put different weights across the elements in the affinity matrix. The downside of these approaches is that those parameters for either scaling or thresholding need to be optimized on a development set to obtain the maximum benefit. The burden of such hyper-parameter tuning in spectral clustering would make the generalization of the clustering algorithm harder in unseen testing environments.
In this paper, we propose a new spectral clustering framework to self-tune the parameters of clustering so that there would be no need for any hyper-parameter tuning with a development dataset. More specifically, our proposed framework estimates both the threshold for row-wise binarization of a given affinity matrix and the number of clusters . To estimate these parameters without the help of a development set, we use the normalized maximum eigengap (NME) value, , which is dependent on and can be obtained by the eigengap heuristic [16]. We hypothesize that there exists a piecewise linear relationship between and and the ratio of is a good proxy for diarization error rate (DER). Using this ratio, we can select , where DER would be presumably the lowest.
To show the experimental evidence, we compare the proposed clustering method with the widely used clustering methods, which need to be optimized on a development set. Our proposed method is compared with the well-known spectral clustering approach [8] appeared in a number of speaker diarization studies [1, 3, 2], and the agglomerative hierarchical clustering (AHC) approach coupled with probabilistic linear discriminant analysis (PLDA) [9, 10], which has also appeared in recent studies [11, 12, 13]. In addition, the performance of the development-set-optimized version of our proposed spectral clustering method is also tested to verify the benefit of our NME-based auto-tuning approach. The experimental results reveal that is a good proxy for DER, and the proposed auto-tuning approach can show comparable or even better performance than widely used development-set-optimized clustering algorithms.
II Traditional Spectral Clustering Algorithm
II-A Ng-Jordan-Weiss (NJW) Algorithm
Spectral clustering is a graph-based clustering technique based on an affinity matrix and its eigenvalues. The affinity matrix is a similarity matrix for a given set of data points, and each element is determined by the distance between a pair of data points in the given input. This algorithm is widely used in diverse fields, such as image segmentation [14], multi-type relational data [15], and speaker diarization [1, 2, 3, 4, 5, 6], due to its simple implementation and decent performance. Among many variants of spectral clustering algorithms, the Ng-Jordan-Weiss (NJW) algorithm [8] has been the most widely used for speaker diarization tasks. The NJW algorithm consists of three main steps: creation of the affinity matrix, Laplacian matrix computations, and k-means clustering [7]. To form an affinity matrix, the NJW algorithm employs a kernel method. The similarity measure, which we refer to as , between two speaker embeddings from two speech segments is obtained by the cosine similarity measure:
(1) |
Each entry in the affinity matrix is defined as follows:
(2) |
where is a scaling factor that needs to be tuned. can be considered as an undirected graph , where represents vertices and represents undirected edges. In the NJW algorithm, this affinity matrix is normalized with the diagonal matrix as follows:
(3) |
where and is the dimension of . The noramlized matrix is used to find the eigenvectors, among which the largest eigenvectors form a spectral embedding matrix with the size . Each row in this spectral embedding matrix is clustered into one of the clusters using the k-means clustering.
II-B Limitations of the Traditional Spectral Clustering
Despite its success, the NJW algorithm has inherent limitations in the context of speaker diarization.
II-B1 Sensitivity to the Quality of an Affinity Matrix
The similarity values we obtain from distance measures, for example, cosine similarity in (1), for an affinity matrix are merely estimated as well as dependent on how representative the speaker embeddings would be in terms of speaker characteristics. It is likely for some entries in the affinity matrix to have noisy signals that could degenerate the clustering process down the road. Thus, without having a proper scheme to mitigate the effect of such inaccurate information from the affinity matrix, noisy similarity values could lead to a poor clustering result.
II-B2 Adaptivity to the Variability in Data
Due to the above issue, there have been a number of schemes proposed in the literature to put different weights across the elements in the affinity matrix. In the previous studies, [5] chose only the entries in each row of the affinity matrix within -percentile, and [1, 3] used scaling factors to control the weights of each element of the affinity matrix. The downside of these approaches is that the parameters for either thresholding or scaling need to be tuned using a development dataset. This could lead to the dependency of the clustering performance upon how to select the development data. Requiring such hyper-parameter tuning would become a burden in generalizing the clustering algorithm in unseen testing conditions.
III Normalized Maximum Eigengap Analysis
In this section, we present the details of our proposed spectral clustering framework using the NME analysis. The procedure is described in Algorithm 1 111https://github.com/tango4j/Auto-Tuning-Spectral-Clustering. The following is the itemized description:
III-A Steps of the Proposed Clustering Method
III-A1 Affinity Matrix
Unlike the NJW algorithm, the affinity matrix A in our proposed framework is formed with raw cosine similarity values in (1) without a kernel or a scaling parameter. From all speech segments in the given input utterance, we get similarity values as below:
(4) |
where and are indexes of the speech segments.
III-A2 -Neighbor Binarization
The cosine similarity values in the affinity matrix are binarized to either 0 or 1 to mitigate the effect of unreliable similarity values. This can be done by converting the largest similarity values in each row to 1 while zeroing out the rest of the values. is an integer, and is determined by the NME analysis described later.
(5) |
III-A3 Symmetrization
To transform the affinity matrix into an undirected adjacency matrix in a graph theory perspective, we perform symmetrization by taking an average of the original and the transposed version of as follows:
(6) |
III-A4 Laplacian
III-A5 Singular Value Decomposition (SVD)
Perform SVD to obtain eigenvalues for the Laplacian matrix :
(8) |
III-A6 Eigengap Vector
Create an eigengap vector as follows, using the eigenvalues from in (8):
(9) |
where is the -th sorted eigenvalue in ascending order, given for the binarization process in step 2).
III-A7 Normalized Maximum Eigengap (NME)
The NME analysis is most critical for the auto-tuning part of the proposed spectral clustering algorithm, as we compare the NME values across every and determine the proper where DER is presumably minimized. We will discuss this in more detail in Section IV. The NME value for given is defined as below:
(10) |
where and is a very small value (). We obtain the ratio between the pruning threshold for the row-wise binarization and the NME value :
(11) |
III-A8 Estimation of ,
The value is calculated throughout every and stored in the list r as below:
(12) |
Based on our observation, which we will discuss in Section IV as well, is a very good proxy for DER. Thus, we find the value which makes the minimum value of r. Consequently, the parameter attempts to minimize DER:
(13) |
With this , we estimate the number of clusters :
(14) |
III-A9 Spectral Embedding
We take the smallest eigenvalues and their corresponding eigenvectors to obtain the matrix of -dimensional spectral embeddings :
(15) |
III-A10 k-means Clustering
We use the k-means clustering algorithm [7] to obtain clusters from .
IV Validation of NME Analysis
Since our approach of pruning the graph connections of the affinity matrix based on the -neighbor binarization scheme is heavily dependent on the value of , an in-depth analysis is needed for the relationship between the NME value and the pruning parameter .
IV-A Eigengap and the Purity of Clusters
It has been known that the size of the eigengap can be used as a quality criterion for spectral clustering [16]. The relation of the size of the eigengap to the purity of clusters has been investigated in [16, 17] using the perturbation theory, more specifically the Davis-Kahan theorem. In this work, we use the NME value to gauge the purity of the clusters, as the purity is directly linked to speaker diarization performance. In doing so, we search the most probable and the most adequate altogether using the eigenvalues.
IV-B Versus
Having a higher value in an affinity matrix generally leads to a larger value with the higher purity measure of the clusters, since the graph gets more connections within each cluster. However, since all the connections have the equal weight of 1, an excessive amount of connections (i.e., high value) gives rise to a poor estimation of the number of clusters followed by a poor diarization result, although it gives a high value. This can be easily understood by thinking of an affinity matrix whose elements are all equal to 1, which would always yield only one cluster regardless of the actual number of clusters. As depicted in Fig.1(a), we see a gradual increase of as increases while this tendency stops around at in Fig.1(a). As we increase even more from , the estimated number of clusters drops and increases again, meaning that we get a higher value with a smaller estimated number of clusters.
COS+NJW-SC | COS+AHC | PLDA+AHC | COS+B-SC | COS+NME-SC | |
Oracle SAD | Spk. Err. (DER) | Spk. Err. (DER) | Spk. Err. (DER) | Spk. Err. (DER) | Spk. Err. (DER) |
CALLHOME | 24.05 | 21.13 | 8.39 | 8.78 | 7.29 |
CHAES-eval | 30.31 | 31.99 | 24.27 | 4.4 | 2.48 |
CH109 | 13.06 | 29.8 | 9.72 | 2.25 | 2.63 |
RT03 | 6.56 | 5.66 | 1.73 | 0.88 | 2.21 |
COS+NJW-SC | COS+AHC | PLDA+AHC | COS+B-SC | COS+NME-SC | ||||||
System SAD | DER | Spk. Err. | DER | Spk. Err. | DER | Spk. Err. | DER | Spk. Err. | DER | Spk. Err. |
CALLHOME | 26.99 | 20.67 | 20.14 | 13.82 | 12.96 | 6.64 | 13.23 | 6.91 | 11.73 | 5.41 |
CHAES-eval | 12.04 | 7.73 | 9.96 | 5.85 | 5.52 | 1.45 | 5.07 | 1.00 | 5.04 | 0.97 |
CH109 | 5.85 | 1.56 | 28.92 | 24.63 | 6.89 | 2.6 | 5.75 | 1.46 | 5.61 | 1.32 |
RT03 | 6.42 | 3.88 | 6.24 | 4.7 | 3.53 | 0.99 | 3.1 | 0.56 | 3.13 | 0.59 |
IV-C as a Good Proxy for DER
Since the excessive amount of connections leads to poor clustering results, value should be minimized to get an accurate number of clusters, while the value should be maximized to get the higher purity of clusters. Thus, we calculate the ratio to find the best value by getting a size of the value in proportion to . It is clearly shown in Fig.1(b) and Fig.1(c) that the ratio of to , is a very good proxy for DER. As described in (11), indicates the slope in the versus plot. The lowest value means that the resulting clusters have the highest purity measure in proportion to . In Fig.1, the solid vertical lines show the estimated point of the lowest DER, while the dotted vertical lines indicate where the actual DER is the lowest.
V Experimental results
V-A Test Setup
To test the performance of the contribution of the clustering algorithms, we used the same speaker embedding extractor proposed in [18, 13] for all the experiments in this work. The evaluation method and metrics followed [19]. The estimated number of speakers was limited to a maximum of eight speakers for all the experiments. We tested the following five different clustering algorithms:
V-A1 COS+NJW-SC
V-A2 COS+AHC
V-A3 PLDA+AHC
V-A4 COS+B-SC
This is our proposed spectral clustering framework with the -neighbor binarization scheme only, without the NME based auto-tuning approach. i.e., is optimized on each development set instead of using from (13).
V-A5 COS+NME-SC
This is our proposed NME-based clustering algorithm, which includes the proposed auto-tuning approach. No hyper-parameter tuning or optimization is done. The value is searched in the range of for each utterance, where is the number of total segments in a given input utterance.
V-B Datasets
V-B1 NIST SRE 2000 (LDC2001S97)
This dataset is the most widely used for speaker diarization in recent literature, which is referred to as CALLHOME. CALLHOME contains two to seven speakers for each utterance. For the CALLHOME dataset, twofold cross validation is conducted to match the test conditions with [18, 13] for all the experiments.
V-B2 CALLHOME American English Speech (CHAES) (LDC97S42)
This is a corpus that only contains English speech data with two to four speakers per each utterance. CHAES is divided into a train, dev, and eval set, and we report the results on the eval set. Both the train set and the dev set are used for parameter turning. The subset of CHAES that only contains two speakers is referred to as CH109 in the literature, and the CH109 dataset is tested by providing the number of speakers ahead to all the tested systems (i.e., no estimation of the number of speakers involved in CH109). The rest of the utterances in CHAES are used as a dev set for CH109.
V-B3 RT03 (LDC2007S10)
RT03 is an English dataset and contains utterances with two to four speakers. We use the split provided by the authors in [5] using only Switchboard utterances.
V-C Experiments
V-C1 Oracle SAD
Table I shows the experimental results with the oracle SAD. Note that, except for RT03 dataset, NME-SC shows very competitive performances with no parameter tuning at all. The DER of NME-SC is impressive, especially for the CALLHOME dataset, where each utterance has the varying number of speakers, and our proposed auto-tuning approach gains many advantages.
V-C2 System SAD
Table II shows the experimental results for the system SAD. We used the ASpIRE SAD model [20] that is publicly available. With the system SAD setting, which is closer to scenarios in the wild, NME-SC outperforms all the other methods except for RT03, where NME-SC shows very close performance to dev-set-optimized COS+B-SC method.
V-D Discussions
The performance gain from NJW-SC to B-SC indicates that the -neighbor binarization scheme with the unnormalized Laplacian approach can be effective, as it shows very distinctive performance difference. More importantly, the performance gain from B-SC to NME-SC shows that the value of can be effectively auto-tuned even without optimizing on a development set. We also see the performance improvement of NME-SC over PLDA+AHC, hinting that our proposed clustering scheme can still get a competitive speaker diarization result without employing PLDA as a distance measure. These all validate the effectiveness of the proposed auto-tuning spectral clustering framework with the NME analysis.
VI Conclusions
In this paper, a new framework of auto-tuning spectral clustering was introduced. The experimental results show that our proposed NME-based spectral clustering method is competitive in performance while not requiring any hyper-parameter tuning. It is promising that the proposed method outperformed the widely used AHC method with PLDA. Further work will include a way to theoretically analyze the reason that the ratio of the tuning parameter to the NME value is a proxy for DER and how it could be generalized in various data on real production systems.
References
- [1] H. Ning, M. Liu, H. Tang, and T. S. Huang, “A spectral clustering approach to speaker diarization,” in Proc. 9th Int. Conf. Spoken Lang. Process., Sep. 2006, pp. 2178–2181.
- [2] J. Luque and J. Hernando, “On the use of agglomerative and spectral clustering in speaker diarization of meetings,” in Proc. Odyssey: The Speaker and Lang. Recognit. Workshop, Jun. 2012, pp. 130–137.
- [3] S. Shum, N. Dehak, and J. Glass, “On the use of spectral and iterative methods for speaker diarization,” in Proc. 13th Annual Conf. Int. Speech Commun. Association, Sep. 2012, pp. 482–485.
- [4] S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, May 2013.
- [5] Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2018, pp. 5239–5243.
- [6] Q. Lin, R. Yin, M. Li, H. Bredin, and C. Barras, “LSTM based similarity measurement with spectral clustering for speaker diarization,” in Proc. INTERSPEECH, Sep. 2019, pp. 366–370.
- [7] S. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. 28, no. 2, pp. 129–137, 1982.
- [8] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2002, pp. 849–856.
- [9] S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. Eur. Conf. Comput. Vision, May 2006, pp. 531–542.
- [10] S. J. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in Proc. IEEE 11th Int. Conf. Comput. Vision, Oct. 2007, pp. 1–8.
- [11] D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Mar. 2017, pp. 4930–4934.
- [12] G. Sell, D. Snyder, A. McCree, D. Garcia-Romero, J. Villalba, M. Maciejewski, V. Manohar, N. Dehak, D. Povey, S. Watanabe et al., “Diarization is hard: Some experiences and lessons learned for the JHU team in the inaugural DIHARD challenge.” in Proc. INTERSPEECH, Sep. 2018, pp. 2808–2812.
- [13] D. Snyder, “Callhome diarization recipe using x-vectors,” Github, May 4, 2018. [Online]. Available: https://david-ryan-snyder.github.io/2018/05/04/model_callhome_diarization_v2.html, [Accessed Oct. 9, 2019].
- [14] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in Proc. Adv. Neural Inf. Process. Syst., Dec. 2005, pp. 1601–1608.
- [15] B. Long, Z. M. Zhang, X. Wu, and P. S. Yu, “Spectral clustering for multi-type relational data,” in Proc. 23rd Int. Conf. Mach. Learn., Jun. 2006, pp. 585–592.
- [16] U. Von Luxburg, “A tutorial on spectral clustering,” Statist. and Comput., vol. 17, no. 4, pp. 395–416, 2007.
- [17] G. W. Stewart and J. Sun, Matrix Perturbation Theory. Boston, MA, USA: Academic Press, 1990.
- [18] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Apr. 2018, pp. 5329–5333.
- [19] J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” in Proc. Int. Workshop Mach. Learn. Multimodal Interaction, May 2006, pp. 309–322.
- [20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop Autom. speech Recognit. and Underst., Dec. 2011.