Wespeaker baselines for VoxSRC2023

Abstract

This report showcases the results achieved using the wespeaker toolkit for the VoxSRC2023 Challenge. Our aim is to provide participants, especially those with limited experience, with clear and straightforward guidelines to develop their initial systems. Via well-structured recipes and strong results, we hope to offer an accessible and good enough start point for all interested individuals. In this report, we describe the results achieved on the VoxSRC2023 dev set using the pretrained models, you can check the CodaLab evaluation server for the results on the evaluation set. Any feedback and contribution are always welcome

Index Terms: wespeaker, voxsrc2023

1 The VoxSRC Challenges

The VoxSRC (VoxCeleb Speaker Recognition Challenge) is an annual competition that focuses on the task of speaker recognition using the VoxCeleb dataset. Speaker recognition is a field within audio processing that aims to identify and authenticate individuals based on their unique vocal characteristics.

The VoxSRC Challenge serves as a platform for researchers and practitioners to showcase their advancements in speaker recognition technology. It provides a standardized evaluation framework, allowing participants to compare their methods and algorithms against each other.

VoxSRC 2023 consists of four tracks, which are consistent with the previous year’s competition. Tracks 1, 2, and 3 are dedicated to speaker verification, where participants are required to determine whether two speech samples originate from the same person. The evaluation for Tracks 1 and 2 will be conducted on the same dataset, with Track 1’s training data restricted to the VoxCeleb2 dev set, while participants can freely use any data for Track 2.

Track 3 aims to promote domain adaptation research, providing an evaluation set from another domain (CnCeleb dataset). It includes a large set of unlabelled data and a small set of labelled data from the target domain to serve as the adaptation data. The objective is to address the challenges of adapting speaker verification models to different domains.

On the other hand, Track 4 focuses on speaker diarisation, challenging participants to accurately segment multi-speaker audio into distinct portions that correspond to individual speakers. This track addresses the problem of determining “who spoke when” in a given audio recording.

Table 1: Table 1 presents the results achieved using different architectures on the VoxCeleb dataset and VoxSRC2023 development set. The ”dev” portion of part 2 is used as the training set. These results serve as a baseline for Track 1 and Track 2 in the VoxSRC challenge. The

p_{target}

value is set to

0.05

for the voxsrc23_val dataset, while for other datasets, it is set to

0.01

Architecture	voxceleb1_O		voxceleb1_E		voxceleb1_H		voxsrc23_val
Architecture	EER(%)	minDCF	EER(%)	minDCF	EER(%)	minDCF	EER(%)	minDCF
CAM++	0.654	0.087	0.805	0.092	1.576	0.164	3.899	0.211
ECAPA-TDNN	0.728	0.099	0.929	0.100	1.721	0.169	4.392	0.228
ResNet34	0.723	0.069	0.867	0.097	1.532	0.146	3.660	0.213
ResNet221	0.505	0.045	0.676	0.067	1.213	0.111	2.991	0.168
ResNet293	0.447	0.043	0.657	0.066	1.183	0.111	2.867	0.169

2 WeSpeaker: Speaker Embedding Toolkit for Research & Production

2.1 Open-source speech processing toolkits

In the field of speech processing, the research community has made significant contributions to the open-source domain. Initially, toolkits such as HTK (Hidden Markov Model Toolkit) [1] and Kaldi [2] played a pivotal role in enabling researchers and industry applications. However, the emergence of deep learning toolkits like PyTorch and TensorFlow has brought about a shift in the landscape.

Recently, PyTorch-based toolkits such as SpeechBrain [3] and ESPnet [4] have gained popularity due to their user-friendly interfaces and support for rapid prototyping, making them accessible to new researchers. While these toolkits serve a broad range of applications, Wenet stands out by focusing specifically on end-to-end speech recognition. Its primary aim is to bridge the gap between research advancements and practical deployment in real-world scenarios.

2.2 Wespeaker

In [5], we introduced Wespeaker, a speaker embedding learning toolkit designed for research and production purposes. Wespeaker is characterized by its lightweight code base and emphasis on high-quality speaker embedding learning, demonstrating impressive performance on multiple datasets. While prioritizing accessibility for researchers, Wespeaker also incorporates deployment codes that are compatible with both CPUs and GPUs, thereby facilitating the integration of research findings into practical production systems.

2.3 Design principles

As mentioned in previous section, there are different speech toolkits which include speaker embedding learning functions, our proposed wespeaker stands out for its simpliness, effectiveness and deployment friendliness. THe design principles are as follows,

•

Light-weight: Wespeaker is designed specifically for deep speaker embedding learning with clean and simple codes¹¹1If you are interested in other tasks such as ASR, KWS, TTS, etc. We have different speficifcly designed toolkits for different tasks, please visit https://github.com/wenet-e2e for more details. It is purely built upon PyTorch and its ecosystem, and has no dependencies on Kaldi [2].
•

Production oriented: All models in Wespeaker can be easily exported by torch Just In Time (JIT) or as the ONNX format, which can be easily adopted in the deployment environment. Sample deployment codes are also provided.

2.4 Supported functionalities

Wespeaker supports different popular speaker embedding learning models, margin based softmax training objectives and several pooling functions.

Refer to caption — Fig. 1: Supported functions in Wespeaker

Model Architectures

•

TDNN based x-vector [6], this is a milestone work that leads the following deep speaker embedding era.
•

ResNet based r-vector and its deeper version, this is the best system of VoxSRC 2019 [7] and CNSRC 2022 [8].
•

ECAPA-TDNN, a modified version of TDNN, this is the champion system of VoxSRC 2020 [9].
•

RepVGG decouples the training time and inference time architecture, resulting in good performance and inference speed. This is the best system of VoxSRC 2021 [10].
•

CAM++, a modified a densely connected time delay neural network (D-TDNN) that utilizes a context-aware masking mechanism. It also incorporates a novel multi-granularity pooling technique to capture contextual information at various levels.

Pooling functions

Pooling functions aggregate frame-level features into segment-level representations, where Wespeaker supports the statistics-based and attention-based ones.

Loss functions

Loss functions play a crucial role in deep speaker embedding learning. We provide support for various types of loss functions, including the standard softmax cross-entropy loss, as well as different margin-based variants [11, 12]. These variants include A-softmax [13, 14], AM-softmax [15], and AAM-softmax [16].

In addition to supporting different loss functions, we also provide support for commonly used techniques such as the inter-topk and sub-center algorithms. These techniques aim to enhance the discriminative ability of the learned embeddings by considering specific subsets of samples within a mini-batch or using sub-centers to improve intra-class compactness.

Scoring back-ends

In the toolkit, we offer a basic implementation of the two-covariance Probabilistic Linear Discriminant Analysis (PLDA). We encourage users to explore different adaptation methods [2, 17, 18, 19] with PLDA for the Track 3 cross-domain evaluation condition.

2.5 Easy hands-on

We have included pretrained models in the toolkit to assist users in quickly verifying results on relevant datasets. However, we would like to emphasize that we DO NOT recommend users to solely submit results based on the provided single systems. We encourage users to explore different methods of combining systems, either among the models we provide or with ones trained by themselves.

We provide the python binding for wespeaker for users to quickly try the pretrained models, further details could be found on the project webpage https://github.com/wenet-e2e/wespeaker/tree/master/runtime/binding/python

With the wespeakeruntime package installed, you can easily extract embeddings from WAV files specified in the wav.scp file and save them into embed.ark using the following code:

⬇

1import wespeakerruntime as wespeaker

2wav_scp_path = "path/to/wav.scp"

3embed_ark_path = "embed.ark"

5speaker = wespeaker.Speaker(lang=’chs’)

6speaker.extract_embedding_kaldiio(

7 wav_scp_path, embed_ark_path

8 )

Moreover, we released several pretrained models as dipicted in Table 2, both in pytorch “.pt” format and runtime “.onnx” format, check https://github.com/wenet-e2e/wespeaker/blob/master/docs/pretrained.md for details on how to use it.

Table 2: Pretrained models provided

Datasets	Languages	Pretrained model
VoxCeleb	EN	CAM++ / CAM++_LM
VoxCeleb	EN	ResNet34 / ResNet34_LM
VoxCeleb	EN	ResNet152_LM
VoxCeleb	EN	ResNet221_LM
VoxCeleb	EN	ResNet293_LM

3 Results

3.1 Track 1 & 2

The results on the VoxCeleb1 evaluation dataset and the VoxSRC 2023 development set are presented in Table 1. The models employed for these evaluations are specified in Table 2.

3.2 Track 3

There are various technology roadmap for unsupervised domain adaptation, and we only provide the results of the voxceleb pre-trained model in Table 3.

Table 3: Results on the validation set of Track3. The

p_{target}

value is set to

0.01

Architecture	Mean Normalization	EER(%)	minDCF
ResNet34	N	14.570	0.617
ResNet34	Y	11.395	0.594

3.3 Track 4

We used the open-source pyannote [20] toolkit as our Voice Activity Detection (VAD) system ²²2This is different from the silero VAD used in [5]. The ResNet34_LM model was adopted as the speaker embedding extractor. For speaker clustering, we implemented the spectral clustering algorithm and adapted it specifically for the diarization task. The results on the VoxConverse dev and test sets are shown in Table 4³³3Dev and test set of VoxConverse are used as the validate set for the VoxSRC23 Track 4.

Table 4: Results on the VoxConverse dev and test sets

Test set	MISS(%)	FA(%)	SC(%)	DER(%)
VoxConverse dev	2.7	0.2	1.8	4.8
VoxConverse test	3.2	0.7	3.0	7.0

4 Suggestions for performance improvement

Our primary objective is to offer a robust initial model that serves as a strong starting point for further improvement. We aim to provide researchers with a solid foundation from which they can develop and enhance new algorithms. By supplying a sufficiently good initial model, we aspire to facilitate the development of novel methodologies within the research community. We didn’t specifically do optimizations for Track 2, 3, and 4. Instead, we would like to provide some suggestions to provide several potential directions to work on:

4.1 Track 2

•

Increase Data Volume: Expand the training dataset by adding more data.
•

Explore Large Pretrained Models [21]: Consider utilizing large pretrained models like WavLM [22] to leverage their extensive knowledge learned from vast amounts of audio data.
•

Pretrained ASR Model Initialization: Phoneme information has been proven to be beneficial for building speaker verification systems [23]. Consider initializing your speaker embedding model with pretrained Automatic Speech Recognition (ASR) models. Several papers presented during ICASSP 2023 verified the effectiveness [24, 25].
•

Hard Mining Strategy: Find confused speakers and add an extra inter-topK penalty on them is an effective way to improve performance in challenges [26, 27, 28]. Some of them have already been supported in Wespeaker⁴⁴4https://github.com/wenet-e2e/wespeaker/pull/115.

4.2 Track 3

•

Distribution Alignment: Employ adversarial training or other strategies to align the distributions of the source and target domains.
•

Pseudo Label Learning: Utilize clustering algorithms or other methods to assign pseudo labels to unlabeled data from the target domain. It is important to note that these pseudo labels may contain noise, and exploring techniques [28, 29, 30] for training robust systems with noisy labels is a crucial topic.
•

Unsupervised PLDA Adaptation: Building upon the implemented PLDA codes, you can incorporate PLDA adaptation mechanisms, such as the Kaldi version [2] and the CORAL series [17, 18], to further enhance performance⁵⁵5A good reference on PLDA adaptation can be found as [31].

4.3 Track 4

•

VAD Tuning. Currently, the errors caused by VAD is still quite high, improving the VAD might be a good choice.
•

We only include the basic clustering algorithms here, you can try more alogrithms and reclustering methods such as VBx [32, 33].

4.4 Results on the evaluation set

We submit the best single system (ResNet293) to the CodaLab evaluation server, check the following links for the numbers and rankings.

•

Track 1 https://zeus.robots.ox.ac.uk/competitions/competitions/17#results
•

Track 2: https://zeus.robots.ox.ac.uk/competitions/competitions/16#results
•

Track 3: https://zeus.robots.ox.ac.uk/competitions/competitions/14#results
•

Track 4: https://zeus.robots.ox.ac.uk/competitions/competitions/18#results

5 The story

The voxceleb dataset is the largest opensource and high-quality dataset for speaker recognition and the wespeaker team has a long history of supporting the voxceleb dataset and voxsrc challenges, the core members have achieved top rankings in previous VoxSRC competitions [7, 34, 35]⁶⁶6VoxSRC2019: 1st place, VoxSRC2020: 2nd place, VoxSRC2022: 3rd place.

We have observed that there is often a disparity between the results reported in current research papers and the performance achieved in system reports for challenges, even when the training and evaluation data are the same. In order to provide a reliable starting point for researchers, we initiated wespeaker, aimed at delivering a reliable baseline system and a user-friendly toolkit. Moreover, contributers from the Wenet opensource community helped with the efficient data management techniques that enable scaling to industrial-sized datasets, as well as deployment codes for rapid prototyping in production environments.

Knowing that this would be the final VoxSRC challenge, the Wespeaker team is eager to contribute and support the event by providing an easy-to-use toolkit and baseline systems. We hope more participants can enjoy the challenge and focus on the algorithm improvement, without struggling with basic experimental setups.

6 Acknowledgement

We would like to extend our sincere appreciation to the VoxSRC challenge organizers for their invaluable contribution in open-sourcing this remarkable dataset and organizing such meaningful challenges. We would also like to express our gratitude to the wenet open-source community, whose dedication and collective efforts have played a pivotal role in the success and growth of wespeaker. Enjoy the Challenge and Welcome to contribute.

References

[1] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey et al., “The htk book,” Cambridge university engineering department, vol. 3, no. 175, p. 12, 2002.
[2] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
[3] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong et al., “Speechbrain: A general-purpose speech toolkit,” arXiv preprint arXiv:2106.04624, 2021.
[4] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N.-E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” Proc. Interspeech, pp. 2207–2211, 2018.
[5] H. Wang, C. Liang, S. Wang, Z. Chen, B. Zhang, X. Xiang, Y. Deng, and Y. Qian, “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[6] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[7] H. Zeinali, S. Wang, A. Silnova, P. Matějka, and O. Plchot, “But system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
[8] Z. Chen, B. Liu, B. Han, L. Zhang, and Y. Qian, “The sjtu x-lance lab system for cnsrc 2022,” arXiv preprint arXiv:2206.11699, 2022.
[9] B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” Proc. Interspeech, pp. 3830–3834, 2020.
[10] M. Zhao, Y. Ma, M. Liu, and M. Xu, “The speakin system for voxceleb speaker recognition challange 2021,” arXiv preprint arXiv:2109.01989, 2021.
[11] M. Hajibabaei and D. Dai, “Unified hypersphere embedding for speaker recognition,” arXiv preprint arXiv:1807.08312, 2018.
[12] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 1652–1656.
[13] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song, “Sphereface: Deep hypersphere embedding for face recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 212–220.
[14] Z. Huang, S. Wang, and K. Yu, “Angular softmax for short-duration text-independent speaker verification,” Proc. Interspeech, pp. 3623–3627, 2018.
[15] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
[16] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[17] M. J. Alam, G. Bhattacharya, and P. Kenny, “Speaker verification in mismatched conditions with frustratingly easy domain adaptation.” in Odyssey, vol. 2018, 2018, pp. 176–180.
[18] K. A. Lee, Q. Wang, and T. Koshinaka, “The coral+ algorithm for unsupervised domain adaptation of plda,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5821–5825.
[19] P.-M. Bousquet and M. Rouvier, “On robustness of unsupervised domain adaptation for speaker recognition,” in Interspeech, 2019.
[20] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2020.
[21] Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151.
[22] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[23] S. Wang, J. Rohdin, L. Burget, O. Plchot, Y. Qian, K. Yu, and J. Cernockỳ, “On the usage of phonetic information for text-independent speaker embedding extraction.” in Interspeech, 2019, pp. 1148–1152.
[24] D. Liao, T. Jiang, F. Wang, L. Li, and Q. Hong, “Towards a unified conformer structure: from asr to asv task,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[25] D. Cai, W. Wang, M. Li, R. Xia, and C. Huang, “Pretraining conformer with asr for speaker verification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[26] Z. Chen, B. Han, X. Xiang, H. Huang, B. Liu, and Y. Qian, “Build a sre challenge system: Lessons from voxsrc 2022 and cnsrc 2022,” arXiv preprint arXiv:2211.00815, 2022.
[27] M. Zhao, Y. Ma, Y. Ding, Y. Zheng, M. Liu, and M. Xu, “Multi-query multi-head attention pooling and inter-topk penalty for speaker verification,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6737–6741.
[28] B. Han, Z. Chen, and Y. Qian, “Exploring binary classification loss for speaker verification,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[29] R. Tao, K. A. Lee, R. K. Das, V. Hautamäki, and H. Li, “Self-supervised speaker recognition with loss-gated learning,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6142–6146.
[30] B. Han, Z. Chen, and Y. Qian, “Self-supervised speaker verification using dynamic loss-gate and label correction,” in Interspeech, 2022.
[31] Q. Wang, K. Okabe, K. A. Lee, and T. Koshinaka, “Generalized domain adaptation framework for parametric back-end in speaker recognition,” arXiv preprint arXiv:2305.15567, 2023.
[32] M. Diez, S. Wang, and J. Rohdin, “Bayesian hmm based x-vector clustering for speaker diarization.” in Interspeech, 2019, pp. 346–350.
[33] F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian hmm clustering of x-vector sequences (vbx) in speaker diarization: theory, implementation and analysis on standard tasks,” Computer Speech & Language, vol. 71, p. 101254, 2022.
[34] X. Xiang, “The xx205 system for the voxceleb speaker recognition challenge 2020,” arXiv preprint arXiv:2011.00200, 2020.
[35] Z. Chen, B. Han, X. Xiang, H. Huang, B. Liu, and Y. Qian, “Sjtu-aispeech system for voxceleb speaker recognition challenge 2022,” arXiv preprint arXiv:2209.09076, 2022.