[0em]0em0em \thefootnotemark
Unsupervised representation learning for speaker recognition via Contrastive Equilibrium Learning
Abstract
In this paper, we propose a simple but powerful unsupervised learning method for speaker recognition, namely Contrastive Equilibrium Learning (CEL), which increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Also, to preserve speaker discriminability, a contrastive similarity loss function is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems and the best performing model achieved 8.01% and 4.01% EER on VoxCeleb1 and VOiCES evaluation sets, respectively. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters.
Index Terms— Speaker recognition, unsupervised learning, uniformity loss, contrastive learning.
1 Introduction
Over the last decade, various deep learning-based speaker embedding techniques were developed for speaker recognition. Although these deep embedding techniques have shown outstanding performance on large scale datasets [1, 2, 3], most of them are trained in a fully supervised manner [4, 5, 6, 7, 8, 9]. However, since it is difficult to obtain a large amount of labeled data in real-life applications, learning speech representations without supervision is an important issue.
Recently, many efforts have been made to obtain good speech representations in an unsupervised learning manner. In [10, 11], to learn the distribution of speech sequences, the speech representations were trained by using a probabilistic contrastive loss, which induces features to capture information useful to predict future samples. In [12], the multi-layer transformer encoders and the multi-head self-attention mechanism were employed to achieve bidirectional encoding for the speech representations. [13, 14] proposed speech representations that capture the speaker identity maximizing the mutual information between the local embeddings extracted from the same utterance, which exploited the short-term active-speaker stationarity hypothesis to create contrastive samples from unlabeled data.
Maximizing the similarity between local features from the same utterance showed reasonable performance [13, 14]. However, the segments cropped from the same sentence are likely to contain not only the same speaker identity but also the same nuisance factors (e.g., environment, noise, etc.). This causes the embeddings to learn the shared nuisance attributes, which leads to low speaker identification or verification performance. To alleviate this issue, authors in [15, 16] proposed to augment the segments using different noises and Room Impulse Responses (RIR). Furthermore, J. Huh, et al. [16] proposed an Augmentation Adversarial Training (AAT) strategy that penalizes the ability to predict the augmentation types so that the embeddings can be optimized to be channel-invariant. Although the AAT shows meaningful improvement over previous works on unsupervised representation learning in terms of speaker verification, optimizing the adversarial loss using the gradient reversal layer is known to be very unstable and sensitive to hyper-parameter setting [17, 18].
In this paper, we propose a simple but powerful training strategy, Contrastive Equilibrium Learning (CEL). Unlike the conventional techniques, the proposed CEL increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Minimizing the uniformity loss forces the embeddings to be in the equilibrium state (i.e., uniformly distributed over the unit hyper-sphere), which leads to having the highest entropy [19]. In order to formulate the uniformity loss, we exploit the total pairwise potential calculated with a Gaussian kernel function similarly to [20, 21, 22].
However, optimizing only the uniformity loss may also increase the uncertainty on the speaker’s information inherent in the embeddings. In order to preserve speaker discriminability, we use the similarity loss function together. The similarity loss functions are formulated using the end-to-end contrastive learning-based objectives.
Experimental results show that the proposed method CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems on VoxCeleb1 test and VOiCES evaluation set. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters. Through this experimental result, we demonstrated that the proposed CEL can be used to find good initial parameters for the conventional deep speaker embedding systems.
2 Proposed Method
The overall process of the proposed method CEL is shown in Fig. 1. To train the speaker embedding network in an unsupervised learning fashion, we first randomly crop two segments from a single utterance and individually apply augmentations with different additive noises and reverberations. The speaker embeddings are extracted from the front-end encoder and normalized to be on the unit hyper-sphere.
Assuming that there are no overlapping speakers within a batch, we train the front-end encoder via the proposed CEL strategy, which consists of uniformity loss and similarity loss.
2.1 Loss functions
In the proposed CEL framework, a batch is comprised of total -segments cropped from -utterances . Given which is the front-end encoder network that maps the speech segments of dimension with the frame length to -normalized embeddings of dimension , denotes positive pairs of two speaker embeddings with different augmentations.
Uniformity loss. In order to force the embeddings to reach an equilibrium state, namely a state of minimal energy (i.e., the distribution optimizing this metric should converge to uniform distribution on the hyper-sphere), we leverage the pairwise Gaussian potential kernel known as the Radial Basis Function (RBF) kernel,
(1) |
where is a fixed parameter. Similarly to [20, 21, 22], the uniformity loss is defined as the logarithm of the average pairwise Gaussian potential as follows:
(2) |
where minimizing equation (2) leads the embedding vectors to have on uniform distribution [21].
Analogous to [22], the uniformity loss within a batch can be calculated by considering each pair of and :
(3) |
Similarity loss. In order to ensure that the embeddings of the positive pairs to be similar, while pushing the embeddings extracted from the negative pairs apart, we exploit the angular prototypical loss proposed in [7]. For our unsupervised learning settings, the angular prototypical similarity loss is formulated as follows:
(4) |
(5) |
where is the affine transformation of the cosine similarity between two speaker embeddings of dimension , and are trainable parameters for scale and bias respectively. Additionally, we take the following angular contrastive similarity loss into account:
(6) |
This loss function has been shown effective in many recent representation learning methods [10, 23].
The front-end encoder network is trained using the two objective functions, i.e., the combination of uniformity loss with a weighting factor and similarity loss as follows:
(7) |
3 Experiments
3.1 Unsupervised learning
Experimental setups. During training, we randomly cropped the input utterance to two 180-frames segments with 10-ms window hop-size, and two crops were differently added MUSAN noises [24] and convoluted with finite RIR filters as in [16]. The Fast ResNet-34 proposed in [7] was used as the front-end encoder, and log Mel-spectrograms of dimension 40 were extracted as acoustic features. In all experiments, the constant factor for pairwise Gaussian kernel was fixed to 2. The networks were trained in an unsupervised manner on the development set of VoxCeleb2 [2] with 1,092,009 utterances and evaluated on the original test set of VoxCeleb1 [1]. We also use VOiCES 2019 Challenge development set according to the trial list provided by [3] to verify the generalization capacity of the experimented systems on the out-of-domain data.
The benchmarks contain Disent. [25], CDDL [26], GCL [15], I-vector, and AAT with angular prototypical loss [16]. Disent. and CDDL employed the cross-modal synchrony between faces and audio in a video for learning the embeddings in a self-supervised fashion. GCL and AAT exploited data augmentation using different noises and RIRs in an unsupervised manner. Furthermore, AAT used adversarial training so that the embeddings can be learned to be channel-invariant.
The networks were trained using Adam optimizer with an initial learning rate of 0.001, decreased by 5% per 10-epochs. The models were learned using an NVIDIA Tesla M40 single GPU with 24GB memory for 500-epochs. For the back-end scoring, the cosine similarity-based technique used in [2, 7, 16] was taken, and Equal Error Rate (EER) was evaluated as the performance measure.
Results. We first compare the models trained via the proposed CEL with the previous benchmarks on unsupervised representation learning shown in Table 1, where the parameters were fixed to 1, 200. We experimented with two proposed models using either angular prototypical (A-Prot) or angular contrastive (A-Cont) loss as similarity loss together with uniformity loss (Unif). The results on both VoxCeleb1 and VOiCES evaluation sets showed that our models outperformed the conventional works. The best performing model, which was trained with Unif and A-Prot loss, achieved 8.01% and 4.01% in terms of EER on VoxCeleb1 and VOiCES evaluation sets, respectively.
We further conduct the ablation study on the VoxCeleb1 original test set to demonstrate the the effectiveness of each hyper-parameter on the speaker verification performance shown in Table 2. The two components were analyzed for the ablation study: the batch size and the uniformity weight . In these experiments, A-Prot loss was used as a similarity loss in common. Usually, the contrastive objectives are known to benefit from larger batch size [7]. However, since we assume that every utterance contains only one person’s speech, increasing the batch size may have a negative impact [16]. This is shown in the results of Table 2 where the best performance was observed in 200, which outperformed larger batch sizes. Also, as shown in Table 2, we observed that the equal-weighted summation (i.e., =1) between uniformity and similarity loss is the best performing point in our settings.
Model | Aug. | VoxCeleb1 | VOiCES |
EER[%] | EER[%] | ||
Disent. [25] | – | 22.09 | – |
CDDL [26] | – | 17.52 | – |
GCL [15] | ✓ | 15.26 | – |
I-vector [16] | – | 15.28 | 17.49 |
Prot [16] | – | 27.30 | 29.69 |
Prot [16] | ✓ | 10.16 | 5.82 |
A-Prot [16] | – | 25.37 | 32.21 |
A-Prot [16] | ✓ | 9.56 | 5.65 |
AAT + Prot [16] | ✓ | 9.36 | 5.26 |
AAT + A-Prot [16] | ✓ | 8.65 | 4.96 |
Unif + A-Prot (CEL) | ✓ | 8.01 | 4.01 |
Unif + A-Cont (CEL) | ✓ | 8.05 | 4.69 |
Ablation | Batch size | Unif weight | EER[%] |
Batch size | 200 | 1.0 | 8.01 |
300 | 1.0 | 8.25 | |
500 | 1.0 | 8.22 | |
800 | 1.0 | 8.12 | |
Unif weight | 200 | 0.5 | 8.72 |
200 | 1.0 | 8.01 | |
200 | 2.0 | 8.20 |
Unsupervised pre-training | Supervised fine-tunning | VoxCeleb1 | VOiCES | ||||
Dataset | Objective | Dataset | Objective | EER[%] | MinDCF | EER[%] | MinDCF |
Random Initialization | VoxCeleb1 w/ labels | A-Prot | 5.32 | 0.3792 | 6.58 | 0.4852 | |
A-Cont | 5.21 | 0.3750 | 6.97 | 0.4909 | |||
GE2E | 5.98 | 0.4258 | 7.75 | 0.6182 | |||
CosFace | 5.51 | 0.3402 | 7.45 | 0.4494 | |||
ArcFace | 5.53 | 0.3516 | 6.68 | 0.4944 | |||
AdaCos | 5.89 | 0.4047 | 7.26 | 0.5820 | |||
VoxCeleb2 w/o labels | Unif+A-Prot | VoxCeleb1 w/ labels | A-Prot | 2.33 | 0.1741 | 2.78 | 0.2110 |
A-Cont | 2.35 | 0.1804 | 2.69 | 0.2138 | |||
GE2E | 2.52 | 0.1876 | 2.59 | 0.1906 | |||
CosFace | 2.83 | 0.1947 | 2.74 | 0.2081 | |||
ArcFace | 2.84 | 0.1797 | 2.89 | 0.2148 | |||
AdaCos | 2.76 | 0.2061 | 2.82 | 0.2438 | |||
VoxCeleb2 w/o labels | Unif+A-Cont | VoxCeleb1 w/ labels | A-Prot | 2.42 | 0.1774 | 2.65 | 0.2107 |
A-Cont | 2.36 | 0.1766 | 2.63 | 0.2048 | |||
GE2E | 2.63 | 0.1917 | 2.39 | 0.1806 | |||
CosFace | 2.84 | 0.1871 | 3.03 | 0.2323 | |||
ArcFace | 2.84 | 0.1979 | 2.64 | 0.2171 | |||
AdaCos | 2.79 | 0.2225 | 2.96 | 0.2553 |
3.2 Fine-tuning using the models pre-trained via CEL
Experimental setups. The front-end encoders were pre-trained as in Section 3.1 using the proposed CEL technique. We leveraged two pre-trained models using either A-Prot or A-Cont loss together with Unif loss where =1, =200. To fine-tune the speaker embedding networks in a supervised manner, the development set of VoxCeleb1 with 148,642 utterances from 1,211 speakers was used. We used a 300-frames segment from randomly sampled utterances without data augmentation. Following objective functions were employed for fine-tuning the networks:
-
•
A-Prot [7]: Angle prototypical loss given in equation (1),
-
•
A-Cont: Angle contrastive loss given in equation (3),
-
•
GE2E [8]: Generalized end-to-end loss based on the cosine similarity with the learnable parameters,
- •
-
•
ArcFace [29]: Additive angular margin softmax loss also called AAM-softmax loss. In this experiments, we set additive angular margin 0.2 and scale factor 30,
- •
The Adam optimizer with an initial learning rate of 0.001, reduced by 10% per 10-epochs, was used for 250-epochs, and we trained the models using the same GPU and back-end scoring method as those in section 3-1. Two measures were analyzed, which are EER and a Minimum Detection Cost Function (MinDCF). The parameters of MinDCF were set as 1, 1 and 0.05.
Results. The speaker verification results using the parameters either randomly initialized or pre-trained via CEL are given in Table 3. All systems fine-tuned with the initial parameters pre-trained via CEL showed much better performance than those trained with randomly initialized weights. These results demonstrated that the CEL can be used to find good initial parameters for the conventional deep speaker embedding systems. Moreover, the contrastive objectives-based models showed better performances compared to the softmax-based models. The best performances were 2.33% and 2.39% in terms of EER on VoxCeleb1 and VOiCES sets, respectively.
The comparison of the previous works and our models trained with the initial parameters pre-trained via CEL is shown in Table 4. Experimental results showed remarkable improvement over the conventional benchmarks. When using the VoxCeleb1 development set for fine-tuning, the proposed model showed the EER of 2.33%. By employing the larger datasets VoxCeleb2 and VoxCeleb1&2, we obtained 2.05% and 1.81% in terms of EER, respectively. These results outperform the performaces of state-of-the-art methods.
Model | Training set | Back-end | EER |
Nagrani et al. [1] | VoxCeleb1 | PLDA | 8.80 |
Ravanelli & Bengio [13] | VoxCeleb1 | Cosine | 5.80 |
Han et al. [31] | VoxCeleb1 | PLDA | 5.11 |
Kang et al. [18] | VoxCeleb1 | PLDA | 4.40 |
Okabe et al. [6] | VoxCeleb1 | Cosine | 3.85 |
Xie et al. [5] | VoxCeleb2 | Cosine | 3.22 |
Xiang et al. [9] | VoxCeleb2 | Cosine | 2.69 |
Kaldi recipe [32] | VoxCeleb2 | PLDA | 2.51 |
Monteiro et al. [32] | VoxCeleb2 | LRD-E2E | 2.51 |
Chung et al. [7] | VoxCeleb2 | Cosine | 2.21 |
Ours (Unif + A-ProtA-Prot) | VoxCeleb1 | Cosine | 2.33 |
Ours (Uni + A-Cont ArcFace) | VoxCeleb2 | Cosine | 2.05 |
Ours (Unif + A-Cont GE2E) | VoxCeleb1&2 | Cosine | 1.81 |
4 Conclusion
In this work, we introduced a simple but powerful training strategy, namely Contrastive Equilibrium Learning. The proposed CEL increases the uncertainty on nuisance factors latent in the embeddings by employing the uniformity loss. Additionally, to preserve speaker discriminability, the similarity losses is used together. Experimental results showed that the proposed CEL significantly outperforms the state-of-the-art unsupervised speaker verification systems on VoxCeleb1 and VOiCES sets. On top of that, the performance of the supervised speaker embedding networks trained with initial parameters pre-trained via CEL showed better performance than those trained with randomly initialized parameters.
Acknowledgements. This research was supported and funded by the Korean National Police Agency. [Project Name: Real-time speaker recognition via voiceprint analysis / Project Number: PA-J000001-2017-101]
References
- [1] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: a large-scale speaker identification dataset,” in Proc. INTERSPEECH, 2017, pp. 2616–2620.
- [2] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. INTERSPEECH, 2018, pp. 1086–1090.
- [3] M. K. Nandwana, J. v. Hout, C. Richey, M. McLaren, M. A. Barrios, and A. Lawson, “The voices from a distance challenge 2019.,” in Proc. INTERSPEECH, 2019.
- [4] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, 2018.
- [5] W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in Proc. ICASSP, 2019, pp. 5791–5795.
- [6] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statistics pooling for deep speaker embedding,” in Proc. INTERSPEECH, 2018, pp. 3573–3577.
- [7] J. S. Chung et al., “In defence of metric learning for speaker recognition,” in Proc. INTERSPEECH, 2020.
- [8] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. ICASSP, 2018, pp. 4879–4883.
- [9] X. Xiang, S. Wang, H. Huang, Y. Qian, and K. Yu, “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in Proc. APSIPA, 2019, pp. 1652–1656.
- [10] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [11] Y.-A. Chung, W.-N. Hsu, H. Tang, and J. Glass, “An unsupervised autoregressive model for speech representation learning,” in Proc. INTERSPEECH, 2019.
- [12] A. T. Liu, S. Yang, P.-H. Chi, P. Hsu, and H. Lee, “Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders,” in Proc. ICASSP, 2020, pp. 6419–6423.
- [13] M. Ravanelli and Y. Bengio, “Learning speaker representations with mutual information,” in Proc. INTERSPEECH, 2019, pp. 1153–1157.
- [14] A. Jati and P. Georgiou, “Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1577–1589, 2019.
- [15] N. Inoue and K. Goto, “Semi-supervised contrastive learning with generalized contrastive loss and its application to speaker recognition,” arXiv preprint arXiv:2006.04326, 2020.
- [16] J. Huh, H. S. Heo, J. Kang, S. Watanabe, and J. S. Chung, “Augmentation adversarial training for unsupervised speaker recognition,” arXiv preprint arXiv:2007.12085, 2020.
- [17] Y. Zhang et al., “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. INTERSPEECH, 2019, pp. 2080–2084.
- [18] W. H. Kang, S. H. Mun, M. H. Han, and N. S. Kim, “Disentangled speaker and nuisance attribute embedding for robust speaker verification,” IEEE Access, vol. 8, pp. 141838–141849, 2020.
- [19] C. Vignat, A. Hero, and J. Costa, “A geometric characterization of maximum rényi entropy distributions,” in Proc. IEEE ISIT, 2006, pp. 1822–1826.
- [20] H. Cohn and A. Kumar, “Universally optimal distribution of points on spheres,” Journal of the American Mathematical Society, vol. 20, no. 1, pp. 99–148, 2007.
- [21] S. Borodachov, D. Hardin, and E. Saff, Discrete energy on rectifiable sets, Springer, 2019.
- [22] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in Proc. ICML, 2020.
- [23] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. CVPR, 2020, pp. 9729–9738.
- [24] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
- [25] A. Nagrani, J. S. Chung, S. Albanie, and A. Zisserman, “Disentangled speech embeddings using cross-modal self-supervision,” in Proc. ICASSP, 2020.
- [26] S.-W. Chung, H. G. Kang, and J. S. Chung, “Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision,” in Proc. INTERSPEECH, 2020.
- [27] H. Wang et al., “Cosface: Large margin cosine loss for deep face recognition,” in Proc. CVPR, 2018.
- [28] F. Wang, J. Cheng, W. Liu, and H. Liu, “Additive margin softmax for face verification,” IEEE Signal Processing Letters, vol. 25, no. 7, pp. 926–930, 2018.
- [29] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proc. CVPR, 2019, pp. 4690–4699.
- [30] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li, “Adacos: Adaptively scaling cosine logits for effectively learning deep face representations,” in Proc. CVPR, 2019, pp. 10815–10824.
- [31] M. H. Han, W. H. Kang, S. H. Mun, and N. S. Kim, “Information preservation pooling for speaker embedding,” in Proc. Odyssey, 2020, pp. 60–66.
- [32] J. Monteiro, I. Albuquerque, J. Alam, R. D. Hjelm, and T. Falk, “An end-to-end approach for the verification problem: learning the right distance,” in Proc. ICML, 2020.