This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MSV Challenge 2022: NPU-HC Speaker Verification System for Low-resource Indian Languages

Yue Li1, Li Zhang1, Namin Wang2, Jie Liu2, Lei Xie1
1Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science,
Northwestern Polytechnical University (NPU), China
2Huawei Cloud
[email protected],   [email protected]
* Corresponding author.
Abstract

This report describes the NPU-HC speaker verification system submitted to the O-COCOSDA Multi-lingual Speaker Verification (MSV) Challenge 2022, which focuses on developing speaker verification systems for low-resource Asian languages. We participate in the I-MSV track, which aims to develop speaker verification systems for various Indian languages. In this challenge, we first explore different neural network frameworks for low-resource speaker verification. Then we leverage vanilla fine-tuning and weight transfer fine-tuning to transfer the out-domain pre-trained models to the in-domain Indian dataset. Specifically, the weight transfer fine-tuning aims to constrain the distance of the weights between the pre-trained model and the fine-tuned model, which takes advantage of the previously acquired discriminative ability from the large-scale out-domain datasets and avoids catastrophic forgetting and overfitting at the same time. Finally, score fusion is adopted to further improve performance. Together with the above contributions, we obtain 0.223% EER on the public evaluation set, ranking 2nd place on the leaderboard. On the private evaluation set, the EER of our submitted system is 2.123% and 0.630% for the constrained and unconstrained sub-tasks of the I-MSV track, leading to the 1st and 3rd place in the ranking, respectively.

Index Terms—  speaker verification, low resouce language, fine-tuning

1 Introduction

Speaker verification (SV) is the task of verifying whether an input utterance matches the claimed identity Rosenberg (1976). In recent years, deep learning has achieved remarkable success in SV tasks, but current methods usually rely on a huge amount of labeled training data Nagrani et al. (2017); Fan et al. (2020). However, obtaining massive labeled data for every language is a time-consuming and costly task. Therefore, the 2022 MSV challenge has been particularly designed for understanding and comparing current SV techniques in low-resource Asian languages, where the labeled speaker data is limited in quantity. Specifically, the challenge includes two evaluation tracks, A-MSV and I-MSV. The former focuses on the development of SV systems in various Asian languages while the latter particularly focuses on SV for Indian languages. In this challenge, we participate in the I-MSV track, which includes the constrained and unconstrained sub-tasks. For the constrained task, the speaker verification model can be trained only with the given fixed training set given by the challenge organizer, while extra training data can be used for the unconstrained task. In particular, the fixed training set consists of 1000 audio recordings spoken by 100 speakers, about 155 hours in total, which is not enough to train a robust speaker verification system from scratch. So the major challenge in the I-MSV track is the data limitation.

Low data resource speaker verification has drawn much attention recently. The straightforward idea for the low resource problem is to leverage high resource labeled datasets (e.g., from another language) to pre-train the speaker verification models  Zhang et al. (2020); Gusev et al. (2020); Shahnawazuddin et al. (2021). However, bringing in additional training datasets usually leads to a domain mismatch problem, which means that there is a distribution change or domain shift between two domains that degrades the performance of SV systems Wang and Deng (2018). To deal with domain mismatch problems, recent approaches include domain adversarial training Wang et al. (2018); Rohdin et al. (2019), back-end processing Garcia-Romero et al. (2014); Lee et al. (2019), and fine-tuning strategy Zhang et al. (2021a); Tong et al. (2020).

In this challenge, we first explore different popular speaker verification models in the constrained I-MSV sub-task, most of which are variants of ECAPA-TDNN Desplanques et al. (2020) and Resnet34 Heo et al. (2020). Particularly for the unconstrained I-MSV, we first leverage high resource English language datasets, VoxCeleb1&2 Nagrani et al. (2017, 2020), as pre-trained datasets to improve the performance of SV systems. Then to address the domain mismatch problem induced by the language difference, we study vanilla fine-tuning and weight transfer fine-tuning  Zhang et al. (2022); Li et al. (2023) strategies to transfer the pre-trained model to the in-domain model with the given Indian language dataset. Specifically, the weight transfer fine-tuning Zhang et al. (2022); Li et al. (2023) aims to constrain the distance of the weights between the pre-trained model and the fine-tuned model to mitigate catastrophic forgetting and overfitting problems during fine-tuning. In addition, we also use score average fusion to improve the performance of our SV systems. The experimental results demonstrate the effectiveness of the above methods and we finally get 1st and 3rd place in the constrained and the unconstrained I-MSV sub-tasks, respectively. Our final code 111https://github.com/RioLLee/MSVChallenge is based on the SpeechBrain Ravanelli et al. (2021).

2 Methodology

In this section, we describe different neural network frameworks in low-resource speaker verification and the fine-tuning methods for domain adaptation from out-domain pre-trained models to the in-domain Indian dataset. Moreover, we introduce weight transfer fine-tuning, which constrains the distance between the weights of the pre-trained model and the fine-tuned model, to improve the performance of the speaker verification systems.

2.1 ECAPA-TDNN

ECAPA-TDNN is known as one of the state-of-the-art speaker embedding models. This model has achieved striking performance in many speaker verification challenges with a large amount of labeled datasetsZhang et al. (2021c); Desplanques et al. (2020); Thienpondt et al. (2021). Therefore, in the 2022 I-MSV challenge, we explore two kinds of ECAPA-TDNN with different channels of 1024 and 2048 in low-resource speaker verification.

As shown in Table 1, ECAPA-TDNN consists of three SE-Res2Block layers, a multi-layer aggregation layer, and a channel- and context-dependent statistics pooling layerDesplanques et al. (2020). In Table 1, TT refers to the length of the input feature, while CC is the channels of the convolution neural network. DD is the embedding dimension. The loss function we use in this report is additive angular margin softmax (AAMSoftmax) loss Deng et al. (2019) and the N is the speaker numbers of training datasets.

Table 1: ECAPA-TDNN Structure
Layer Kernel Size Stride Output Shape
Conv1D 55 11 T×CT\times C
SE-Res2Block1 33 22 T×CT\times C
SE-Res2Block2 33 33 T×CT\times C
SE-Res2Block3 33 44 T×CT\times C
Conv1D 11 11 T×(3×C)T\times(3\times C)
ASP - - (6×C)6\times C)
Linear 1 - DD
AAMSoftmax - - NN

2.2 ResNet34-SE

The deep residual network (ResNet) is a well-known deep neural network that solves the problem of gradient disappearance with short-cut connections. In recent years, ResNet becomes a popular backbone in the speaker verification field Heo et al. (2020); Zhang et al. (2021c). In this challenge, we try two kinds of ResNet models with squeeze-and-excitation (SE) Hu et al. (2018) attention and a variant of SE attention Zhang et al. (2021b). The model structure of the ResNet34-SE is illustrated in Table 2. Specially, we use the attentive statistics pooling (ASP) Okabe et al. (2018) as the pooling layer in ResNet34-SE.

Table 2: ResNet34-SE Structure
Layer Kernel Size Stride Output Shape
Conv2D 3×33\times 3 1×11\times 1 T×80×CT\times 80\times C
Res1 3×33\times 3 1×11\times 1 T×80×CT\times 80\times C
SE-Module - - T×80×CT\times 80\times C
Res2 3×33\times 3 2×22\times 2 T×40×CT\times 40\times C
SE-Module - - T×40×CT\times 40\times C
Res3 3×33\times 3 2×22\times 2 T/2×20×CT_{/2}\times 20\times C
SE-Module - - T/2×20×C{T_{/2}\times 20\times C}
Res4 3×33\times 3 2×22\times 2 T/4×10×C{T_{/4}\times 10\times C}
SE-Module - - T/4×10×C{T_{/4}\times 10\times C}
Flatten - - T/8×(10×C){T_{/8}\times(10\times C)}
ASP - - (10×C)(10\times C)
Linear 1 - DD
AAM-Softmax - - NN

2.3 Fine-tuning

Leveraging additional out-domain datasets leads to domain mismatch problems between the large out-domain datasets and the small in-domain Indian dataset  Qin et al. (2021); Zhang et al. (2022); Li et al. (2023). The mismatch lies in various aspects including cross-language differences and cross-recording-device differences. Vanilla fine-tuning is the most common approach to deal with domain mismatch. The process of fine-tuning is to initialize the weights of the model to be fine-tuned with those of the pre-trained model and then train this model with the target-domain dataset. Specifically, in the unconstrained I-MSV, we use the VoxCeleb1&2 development sets to pre-train the speaker verification models and then fine-tune the models with the Indian language training set to drag the models to the target domain as well as to maintain their discrimination ability.

However, vanilla fine-tuning just initializes the weights of the fine-tuned model with those of the pre-trained model without considering the catastrophic forgetting and overfitting problems. Therefore, we introduce a weight transfer loss as in Zhang et al. (2022); Li et al. (2023) to deal with the above problems, which constrains the distance between the weights of the pre-trained model and those of the fine-tuned model during the fine-tuning process. Specifically, suppose the weights of the pre-trained model and fine-tuning model are WsW^{s} and WtW^{t} respectively, the weight transfer loss LwtL_{wt} is calculated as

Lwt=WsWt2L_{wt}=\left\|W^{s}-W^{t}\right\|_{2} (1)

Finally, the final loss function during fine-tuning is

Lft=LCE+Lwt+L2,L_{ft}=L_{CE}+L_{wt}+L_{2}, (2)

where LCEL_{CE} is the speaker classification loss (AAMSoftmax) and L2L_{2} is the common L2 regularization loss.

3 Experiments & Analysis

3.1 Datasets & Augmentation

In the I-MSV track, the development data consists of speech data in Indian languages, collected in multiple sessions using five different sensors. In the evaluation set, the enrolment data consists of utterances from the English language captured in multiple sessions using only a headphone as the sensor. There are two test sets, which are the public test dataset and the private test dataset, provided with language and recording device mismatch compared with the enrollment dataset.

In the constrained sub-task of the I-MSV track, we only use the released development Indian dataset as the training set for our speaker verification models. In the unconstrained sub-task of the I-MSV track, we leverage the VoxCeleb1&2 Nagrani et al. (2017, 2020) as our pre-trained datasets. Then we fine-tune the pre-trained models with the released Indian dataset.

Online data augmentation Cai et al. (2020) is used for all our speaker verification models. Specifically, we adopt frequency-domain specAug Park et al. (2019), time warping specAug, additive noise augmentation Snyder et al. (2015), and reverberation augmentation Habets (2006). The details of the augmentation configurations are listed as follows:

  • Frequency-Domain SpecAug: We apply time and frequency masking as well as time warping to the input spectrum (frequency-domain implementation) Park et al. (2019).

  • Additive Noise: We add the noise, music, and babble types from MUSAN Snyder et al. (2015) to the original speech.

  • Reverberation: We simulate reverberant speech by convolving clean speech with different RIRs from Habets (2006).

  • Speed perturb: We adopt speed perturbation (0.9 and 1.1 times) in the training stage.

3.2 Experimental Setup

Model Configuration The channel numbers of TDNN layers in ECAPA-TDNN are 1024 or 2048, and the dimensions of the embedding layer are 192 or 256 respectively. For the ResNet34, we train 5 ResNet34-related models with a similar structure, which have {64, 128, 256, 512} or {32, 64, 128, 256} channels of residual blocks and multi-head attention statistic pooling. In particular, we added SE blocks with 8 reductions to the last layer of the residual block in our models.

Training Details Eighty-dimensional Mel-filter bank features with 25ms window size and 10ms window shift are extracted as model inputs. During the training stage, the learning rate of all models training varies between 1e-8 and 1e-3 using the triangular2 policy Smith (2017) and the optimizer is Adam Kingma and Ba (2014). The hyperparameter scale and margin of AAM-softmax are set to 30 and 0.2 respectively. To prevent overfitting, we apply a weight decay of 2e-4 on all weights in the models.

Score Average Fusion We split enroll audio recordings by a random length between 10 and 60 seconds, and then we average the speaker embeddings extracted from all audio recordings of the same speaker as the embedding of this speaker. To further improve the performance of the speaker verification systems, we use score average fusion based on the performance of the models on the public test set.

Score Metric In the test phase, we use cosine similarity as the scoring criterion. The performance metric is equal error rate (EER) Reynolds et al. (2017).

3.3 Experimental Results

We evaluate all models with the above-mentioned strategies on the public test datasets of the constrained I-MSV and the unconstrained I-MSV. The results are summarized in Table 3 and Table 4 respectively. For the results of the constrained I-MSV, as shown in Table 3, the best single model is ECAPA_2048 with the lowest EER of 1.764% among all speaker verification models. After fusing the scores from ECAPA_1024, ECAPA_2048, and ResNet34SE_256, we obtain the best fusion EER of 1.677%. The score average fusion model gets a relative EER reduction by 4% compared with the best single model.

Table 3: EER of all systems on the public test dataset of the constrained I-MSV
     Index      Model      EER(%)
     A      ECAPA_1024      1.881
     B      ECAPA_2048      1.764
     C      ResNet34SE_512      2.030
     D      ResNet34SE_256      1.864
     E      ResNetDTCF_512      1.899
     Fusion      [A+B+D]      1.677

In Table 4, the ResNet34SE_512_fine-tune model achieves the best single model result with EER of 0.29%. Finally, fusing scores from ECAPA_2048_weight_transfer and ResNet34SE_512_fine-tune leads to our best EER of 0.223%, which achieves a relative EER reduction of 23% compared with the single best model. On the other hand, we can find that the performance of the fine-tuned models improves a lot compared to that of the models trained from scratch. The most superior model is ResNet34SE_512 with a dramatic drop in EER, where the EER of the unconstrained I-MSV is relatively 86% lower than that of the constrained I-MSV.

Table 4: EER of all systems on the public test dataset of the unconstrained I-MSV
Index Model EER(%)
A1 ECAPA_1024_fine-tune 0.510
A2 ECAPA_1024_weight_transfer 0.508
B1 ECAPA_2048_fine-tune 0.680
B2 ECAPA_2048_weight_transfer 0.324
C1 ResNet34SE_512_fine-tune 0.289
C2 ResNet34SE_512_weight_transfer 0.494
D1 ResNet34SE_256_fine-tune 0.693
D2 ResNet34SE_256_weight_transfer 0.862
E1 ResNetDTCF_512_fine-tune 1.216
E2 ResNetDTCF_512_weight_transfer 0.561
Fusion [B2+C1] 0.223

As shown in Table 5, our best-submitted model is the fused model for the private test of the constrained I-MSV, which achieves the EER of 2.123%. For the unconstrained I-MSV, the best model is ResNet34SE_512_fine-tune with the EER of 0.630%.

Table 5: EER of submitted systems on the private test dataset
Track Model EER(%)
Constrained I-MSV fused model 2.123
Unconstrained I-MSV ResNet34SE_512_fine-tune 0.630

4 Discussion

This paper introduces the main approaches used in our submitted systems for the MSV challenge 2022, especially exploring the effectiveness of ECAPA-TDNN and ResNet34-SE models in SV for low resource Indian languages, vanilla fine-tuning and weight transfer fine-tuning strategies to transfer pre-trained models into the Indian dataset as well as score average fusion. Through our study, we can find that there is still substantial space for improving the performance of speaker verification for low resource languages. For instance, for the constrained I-MSV, we plan to explore the recent low-resource learning strategies, such as few-shot learning Wang et al. (2020); Yang et al. (2022). Moreover, for the unconstrained I-MSV, it is a promising method to use data augmentation strategies such as cross-lingual voice conversion Shahnawazuddin et al. (2020) to expand the data size.

5 Conclusion

In this report, we describe our submission for the I-MSV track of the 2022 Multilingual Speaker Verification (MSV) Challenge. In this challenge, we first explore ECAPA-TDNN and ResNet34-SE in the low-resource Indian language speaker verification, with the conclusion that all of these models outperform the baseline model. Moreover, vanilla fine-tuning and weight transfer fine-tuning are introduced to deal with the domain mismatch between the in-domain Indian dataset and the large-scale out-domain datasets. Finally, score fusion is beneficial to our speaker verification systems developed for the Indian languages according to the experiments. Together with the above approaches, our final EER of the constrained/unconstrained I-MSV achieves 2.123%/0.630% and we finally take the 1st and the 3rd place in the rankings in the constrained and the unconstrained tasks respectively.

References

  • Cai et al. (2020) Weicheng Cai, Jinkun Chen, Jun Zhang, and Ming Li. 2020. On-the-fly data loader and utterance-level aggregation for speaker and language recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1038–1051.
  • Deng et al. (2019) Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699.
  • Desplanques et al. (2020) Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. 2020. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. INTERSPEECH.
  • Fan et al. (2020) Yue Fan, JW Kang, LT Li, KC Li, HL Chen, ST Cheng, PY Zhang, ZY Zhou, YQ Cai, and Dong Wang. 2020. Cn-celeb: a challenging chinese speaker recognition dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7604–7608. IEEE.
  • Garcia-Romero et al. (2014) Daniel Garcia-Romero, Alan McCree, Stephen Shum, Niko Brummer, and Carlos Vaquero. 2014. Unsupervised domain adaptation for i-vector speaker recognition. In Proceedings of Odyssey: The Speaker and Language Recognition Workshop, volume 8.
  • Gusev et al. (2020) Aleksei Gusev, Vladimir Volokhov, Alisa Vinogradova, Tseren Andzhukaev, Andrey Shulipa, Sergey Novoselov, Timur Pekhovsky, and Alexander Kozlov. 2020. Stc-innovation speaker recognition systems for far-field speaker verification challenge 2020. In INTERSPEECH, pages 3466–3470.
  • Habets (2006) Emanuel AP Habets. 2006. Room impulse response generator. Technische Universiteit Eindhoven, Tech. Rep, 2(2.4):1.
  • Heo et al. (2020) Hee Soo Heo, Bong-Jin Lee, Jaesung Huh, and Joon Son Chung. 2020. Clova baseline system for the voxceleb speaker recognition challenge 2020. INTERSPEECH.
  • Hu et al. (2018) Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR.
  • Lee et al. (2019) Kong Aik Lee, Qiongqiong Wang, and Takafumi Koshinaka. 2019. The coral+ algorithm for unsupervised domain adaptation of plda. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5821–5825. IEEE.
  • Li et al. (2023) Zhang Li, Wang Qing, Wang Hongji, Li Yue, Rao Wei, Wang Yannan, and Xie Lei. 2023. Distance-based weight transfer for fine-tuning from near-field to far-field speaker verification. ICASSP.
  • Nagrani et al. (2020) Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, 60:101027.
  • Nagrani et al. (2017) Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. INTERSPEECH.
  • Okabe et al. (2018) Koji Okabe, Takafumi Koshinaka, and Koichi Shinoda. 2018. Attentive statistics pooling for deep speaker embedding. In Proc. Interspeech 2018, pages 2252–2256.
  • Park et al. (2019) Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. INTERSPEECH.
  • Qin et al. (2021) Xiaoyi Qin, Chao Wang, Yong Ma, Min Liu, Shilei Zhang, and Ming Li. 2021. Our learned lessons from cross-lingual speaker verification: The crmi-dku system description for the short-duration speaker verification challenge 2021. In Interspeech, pages 2317–2321.
  • Ravanelli et al. (2021) Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, et al. 2021. Speechbrain: A general-purpose speech toolkit. arXiv preprint arXiv:2106.04624.
  • Reynolds et al. (2017) Douglas Reynolds, Elliot Singer, Seyed O Sadjadi, Timothee Kheyrkhah, Audrey Tong, Craig Greenberg, Lisa Mason, and Jaime Hernandez-Cordero. 2017. The 2016 nist speaker recognition evaluation. Technical report, MIT Lincoln Laboratory Lexington United States.
  • Rohdin et al. (2019) Johan Rohdin, Themos Stafylakis, Anna Silnova, Hossein Zeinali, Lukáš Burget, and Oldřich Plchot. 2019. Speaker verification using end-to-end adversarial language adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6006–6010. IEEE.
  • Rosenberg (1976) Aaron E Rosenberg. 1976. Automatic speaker verification: A review. Proceedings of the IEEE, 64(4):475–487.
  • Shahnawazuddin et al. (2020) S Shahnawazuddin, Nagaraj Adiga, Kunal Kumar, Aayushi Poddar, and Waquar Ahmad. 2020. Voice conversion based data augmentation to improve children’s speech recognition in limited data scenario. In Interspeech, pages 4382–4386.
  • Shahnawazuddin et al. (2021) S Shahnawazuddin, Waquar Ahmad, Nagaraj Adiga, and Avinash Kumar. 2021. Children’s speaker verification in low and zero resource conditions. Digital Signal Processing, 116:103115.
  • Smith (2017) Leslie N Smith. 2017. Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pages 464–472. IEEE.
  • Snyder et al. (2015) David Snyder, Guoguo Chen, and Daniel Povey. 2015. MUSAN: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484.
  • Thienpondt et al. (2021) Jenthe Thienpondt, Brecht Desplanques, and Kris Demuynck. 2021. The idlab voxsrc-20 submission: Large margin fine-tuning and quality-aware score calibration in dnn based speaker verification. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5814–5818. IEEE.
  • Tong et al. (2020) Ying Tong, Wei Xue, Shanluo Huang, Lu Fan, Chao Zhang, Guohong Ding, and Xiaodong He. 2020. The jd ai speaker verification system for the ffsvc 2020 challenge. In INTERSPEECH, pages 3476–3480.
  • Wang and Deng (2018) Mei Wang and Weihong Deng. 2018. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153.
  • Wang et al. (2018) Qing Wang, Wei Rao, Sining Sun, Leib Xie, Eng Siong Chng, and Haizhou Li. 2018. Unsupervised domain adaptation via domain adversarial training for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4889–4893. IEEE.
  • Wang et al. (2020) Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. 2020. Generalizing from a few examples: A survey on few-shot learning. ACM computing surveys (csur), 53(3):1–34.
  • Yang et al. (2022) Seunghan Yang, Debasmit Das, Janghoon Cho, Hyoungwoo Park, and Sungrack Yun. 2022. Domain agnostic few-shot learning for speaker verification. INTERSPEECH.
  • Zhang et al. (2022) Li Zhang, Yue Li, Namin Wang, Jie Liu, and Lei Xie. 2022. Npu-hc speaker verification system for far-field speaker verification challenge 2022. INTERSPEECH.
  • Zhang et al. (2021a) Li Zhang, Qing Wang, Kong Aik Lee, Lei Xie, and Haizhou Li. 2021a. Multi-level transfer learning from near-field to far-field speaker verification. arXiv preprint arXiv:2106.09320.
  • Zhang et al. (2021b) Li Zhang, Qing Wang, and Lei Xie. 2021b. Duality temporal-channel-frequency attention enhanced speaker representation learning. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 206–213. IEEE.
  • Zhang et al. (2020) Li Zhang, Jian Wu, and Lei Xie. 2020. Npu speaker verification system for interspeech 2020 far-field speaker verification challenge. INTERSPEECH.
  • Zhang et al. (2021c) Li Zhang, Huan Zhao, Qinling Meng, Yanli Chen, Min Liu, and Lei Xie. 2021c. Beijing zkj-npu speaker verification system for voxceleb speaker recognition challenge 2021. INTERSPEECH.