Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System
^†^†thanks: * Corresponding Author: Ming Li

Ze Li^1,2, Yao Shi³, Yunfei Xu³, Ming Li^1,2∗ [email protected] ¹ School of Computer Science, Wuhan University, Wuhan, China ² Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Duke Kunshan University, Kunshan, China ³ AI Center, OPPO, Beijing, China

Abstract

Speaker embedding based zero-shot Text-to-Speech (TTS) systems enable high-quality speech synthesis for unseen speakers using minimal data. However, these systems are vulnerable to adversarial attacks, where an attacker introduces imperceptible perturbations to the original speaker’s audio waveform, leading to synthesized speech sounds like another person. This vulnerability poses significant security risks, including speaker identity spoofing and unauthorized voice manipulation. This paper investigates two primary defense strategies to address these threats: adversarial training and adversarial purification. Adversarial training enhances the model’s robustness by integrating adversarial examples during the training process, thereby improving resistance to such attacks. Adversarial purification, on the other hand, employs diffusion probabilistic models to revert adversarially perturbed audio to its clean form. Experimental results demonstrate that these defense mechanisms can significantly reduce the impact of adversarial perturbations, enhancing the security and reliability of speaker embedding based zero-shot TTS systems in adversarial environments.

Index Terms:

zero-shot text-to-speech, adversarial attack, anti-spoofing, adversarial training, diffusion probabilistic model

I Introduction

With the rapid advancement of deep learning technologies, Text-to-Speech (TTS) systems have made significant progress [1, 2, 3], particularly with the emergence of Zero-Shot TTS. This technology enables the generation of natural speech for any speaker from short audio samples. Currently, the mainstream Zero-Shot TTS approaches include speaker embedding-based methods [4, 5, 6] and the language model based ones [7, 8]. Speaker embedding based Zero-Shot TTS utilizes a speaker encoder alongside a TTS component, while the large speech generation models furmuate the Zero-Shot TTS task as a language modeling task within the neural codec domain.

Although Zero-Shot TTS technology has shown great potential, it also faces new challenges, particularly regarding security and robustness. Malicious attacks have become a severe concern as these systems are increasingly deployed in scenarios that demand high security and reliability. In particular, speaker embedding based Zero-Shot TTS systems are vulnerable to various spoofing attacks. Research works have shown that even with the deep neural networks based approaches [9, 10, 11], speaker encoders are susceptible to malicious spoofing attacks such as impersonation [12], replay attacks [13], voice conversion [14], and adversarial attacks [15]. Attackers can manipulate their own voice to alter the extracted speaker embeddings, leading the system to generate speech resembling the target speaker, thereby enabling a range of fraudulent activities, including voice forgery and identity impersonation.

This study focuses on the adversarial attacks in speaker embedding based Zero-Shot TTS system. Adversarial attacks are typically carried out by generating adversarial examples, which are crafted by introducing imperceptible perturbations through optimization methods such as Fast Gradient-Sign Method [16], Projected Gradient Descent (PGD) [17], optimizer-based [18], and Carlini-Wagner [19] to lead a misclassification.

The prevailing method for defending against adversarial attacks is adversarial training [20, 21], which enhances the model’s robustness by exposing it to adversarial examples during the training phase. Although adversarial training is widely regarded as the most effective defense strategy, it requires substantial computational resources, and the model remains susceptible to unseen attacks that differ from the adversarial methods used during training. Another approach is adversarial purification, which focuses on designing effective purification models to mitigate the adversarial perturbations in input samples. Currently, diffusion models have proven to be the state-of-the-art purification models in both the vision domain [22] and the audio domain, such as in background noise removal [23] and speech command recognition [24] tasks.

This paper investigates adversarial attacks and defenses for speaker embedding based Zero-Shot TTS systems. For the attack phase, we utilize PGD [17] and Adam [25] optimizer-based approaches to generate adversarial examples targeting the speaker encoder of the Zero-Shot TTS system in a white-box attack scenario. For the defense phase, we evaluate and compare two strategies: an active defense through adversarial training and a passive defense via adversarial purification using diffusion models. The Demo Page can be found here¹¹1https://se-zs-tts.github.io/.

Refer to caption — Figure 1: Attack and Defense Framework for Speaker Embedding Based Zero-Shot TTS System

II Methods

This section presents our methods for adversarial attacks on speaker embedding based zero-shot TTS system, along with corresponding defense measures, including adversarial training and adversarial purification. The overall framework is illustrated in Fig. 1.

II-A Adversarial Example Generation

Adversarial examples refer to instances with imperceptible perturbations that are deliberately introduced. These perturbations are obtained by solving an optimization problem and lead a well-trained model to make incorrect predictions. This study utilizes PGD [17] and Adam [25] optimizer-based methods to generate adversarial examples. Both methods are gradient-based white-box attacks. We aim to attack the speaker encoder of a speaker embedding based Zero-Shot TTS system by adding small perturbations to the input speech, causing the output speaker embeddings to change, which leads the TTS system to generate speech that mimics the target speaker.

The core idea of the PGD method is to iteratively update the input samples based on the gradient sign of the loss function while constraining the perturbation magnitude. Specifically, given input speeches $x$ , target labels $y^{\prime}$ and a well-trained speaker encoder model $f(\cdot)$ . The predicted labels $\hat{y}$ can be obtained by computing the index of the maximum cosine similarity between the speaker embeddings of the source speeches and the spoofed ones with adversarial examples, which are extracted by the speaker encoder. The adversarial perturbations $\delta$ can be generated by:

\hat{y}=\left[\arg\max_{j}cosine(f(x)_{i},f(x+\delta)_{j})\right]_{i}

(1)

	$\displaystyle\delta_{t+1}=\delta_{t}-\alpha\cdot sign$	$\displaystyle(\nabla_{\delta_{t}}Loss(\hat{y},y^{\prime})),$		(2)
	$\displaystyle\text{s.t.}\quad\\|\delta$	$\displaystyle\\|_{\infty}\leq\epsilon$		(2)

where $sign(\cdot)$ represents the sign of the gradient, $\alpha$ and $\epsilon$ control the magnitude of each update and the maximum allowable perturbation, respectively.

Compared to PGD, Adam is an adaptive learning rate optimization algorithm. In each iteration, Adam not only utilizes the current gradient information but also incorporates momentum from previous iterations to update the perturbation:

$\displaystyle m_{t}$	$\displaystyle=\beta_{1}m_{t-1}+(1-\beta_{1})\nabla_{\delta_{t}}Loss(\hat{y},y^{\prime}),$	(3)
$\displaystyle v_{t}$	$\displaystyle=\beta_{2}v_{t-1}+(1-\beta_{2})(\nabla_{\delta_{t}}Loss(\hat{y},y^{\prime}))^{2},$
$\displaystyle\delta_{t+1}$	$\displaystyle=\delta_{t}-lr\cdot\frac{m_{t}}{\sqrt{v_{t}}+\xi},\quad\text{s.t.}\quad\\|\delta\\|_{\infty}\leq\epsilon$

where $m_{t}$ and $v_{t}$ are the second-moment and second-moment estimates of the gradient, respectively. $lr$ is the learning rate, $\xi$ is the numerical stability constant, $\beta_{1}$ and $\beta_{2}$ are the decay rates for the first and second-moment estimates, respectively.

II-B Adversarial training

Adversarial training is one of the most widely used and effective methods for defending against adversarial attacks, as it enhances the model’s robustness by incorporating adversarial examples into the training process. In this work, we also employ adversarial training. For each speech sample within a batch $\{(x_{i},y_{i})\}_{i=1}^{b}$ , we randomly assign a target speaker label $y^{\prime}$ that differs from the source speaker and then apply adversarial attack methods to generate adversarial examples $\{(\hat{x}_{i},y^{\prime}_{i})\}_{i=1}^{b}$ . Subsequently, these adversarial examples are labeled with the source speaker’s label and are used alongside the source speech $\{(\hat{x}_{i},y_{i})\cup(x_{i},y_{i})\}_{i=1}^{b}$ to fine-tune the well-trained speaker encoder model. Finally, the fine-tuned speaker encoder model is used to retrain the zero-shot TTS system.

II-C Adversarial Purification

Diffusion-based adversarial purification is an emerging defense technique against adversarial attacks, which utilizes diffusion models to remove adversarial perturbations from input data, thereby restoring clean speech for effective defense. As a plug-and-play module, diffusion models effectively circumvent the issues of domain shifts and secondary training associated with adversarial training. Furthermore, they do not require training on predefined adversarial examples, which endows them with solid generalization capabilities and allows them to address a wide range of attack methods.

A diffusion model normally consists of a forward diffusion process and a reverse sampling process. The forward diffusion process gradually adds Gaussian noise to the input speech until the distribution of the noisy speech converges to a standard Gaussian distribution:

q(x_{t}\mid x_{0})=\mathcal{N}(x_{t};\sqrt{\bar{\alpha}_{t}}x_{0},(1-\bar{\alpha}_{t})\mathbf{I})

(4)

where $x_{0}$ is the clean speech, $x_{t}$ represents the noisy speech at time step $t$ , hyperparameter $\bar{\alpha}_{t}$ controls the noise level.

The reverse sampling process takes the standard Gaussian noise as input and gradually denoises the noisy speech to recover clean speech. The reverse process is approximated by learning a model $p_{\theta}(x_{t-1}\mid x_{t})$ :

p_{\theta}(x_{t-1}\mid x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},t),\Sigma_{\theta}(x_{t},t))

(5)

where $\mu_{\theta}(x_{t},t)$ and $\Sigma_{\theta}(x_{t},t)$ are represent the predicted mean and covariance for time step $t$ , respectively.

The optimization objective is to minimize the speech reconstruction error, which is achieved using a Mean Squared Error (MSE) loss function:

L=MSE(x_{0},\hat{x_{0}})=\|x-\hat{x_{0}}\|_{2}^{2}/N

(6)

where $\hat{x_{0}}$ represents the speech obtained by denoising the noisy speech $x_{t}$ , $N$ is the number of samples in speech $x_{0}$ .

Additionally, considering that adversarial purification might affect clean audio, we introduce a binary classifier before the diffusion module to distinguish between audio samples with adversarial perturbations and those that are clean.

III experimental settings

III-A Speaker Embedding Baesd Zero-Shot TTS Training

We utilize the ResNet34 [26] architecture as the speaker encoder model and the VITS [3] structure as the zero-shot TTS component. For ResNet34, the residual block channels are set to {64,128,256,512}, and the output feature maps are aggregated with a global statistics pooling layer that calculates each feature map’s means and standard deviations. The acoustic features are 80-dimensional log Mel-filterbank energies with a frame length of 25ms and a hop size of 10ms.

The speaker encoder model is pretrained on the VoxCeleb2 [27] development set and tested on the VoxCeleb1-O [28] test set. We adopt the on-the-fly data augmentation [29] to add additive background noise or convolutional reverberation noise for the time-domain waveform. The MUSAN [30] and RIR Noise [31] datasets are used as noise sources and room impulse response functions, respectively. The input utterances are truncated to 2 seconds. We employ the ArcFace [32] classifier, with the margin and scale parameters set as 0.2 and 32, respectively. Network parameters are updated using an SGD optimizer with an initial learning rate 0.1. The learning rate is decayed by a factor of 0.1 every ten epochs until 1e-5.

We use the clean subsets of the train and development sets from LibriTTS [33] to train the zero-shot TTS component. The speaker embeddings for the utterances are obtained from the well-trained speaker encoder. The TTS network parameters are updated using the AdamW [34] optimizer with a learning rate of 2e-4. The batch size is 32, and the total epoch is 40.

III-B Speaker Encoder Adversarial Training

The adversarial attack methods are described in II-A. For the PGD method, the perturbation limit $\epsilon$ is set to the 5% of the maximum amplitude of each audio sample, with 20 interations and the step size $\alpha$ is decreases from 4e-3 to 4e-4 with cosine delay. For the Adam optimizer-based method, $\epsilon$ is set to the 5% of the maximum amplitude of each audio sample, with 50 iterations and the learning rate $lr$ that decays from 1e-3 to 1e-5 with cosine delay.

In adversarial training, the batch-wise random targeted attack strategy is employed, where a batch of speech samples is selected, and each sample within the batch is randomly assigned a target sample with different speaker identity. Adversarial examples are then generated using the adversarial attack method and combined with the original speeches to fine-tune the speaker encoder model. No data augmentation is applied. The model is optimized using an Adam optimizer with a cosine decay learning rate schedule, starting at 1e-3 and decaying to 1e-5. The batch size is 256, and the number of epochs is 3.

III-C Diffusion-based Adversarial Purification Training

DiffWave [35], a representative diffusion model in the waveform domain, is used as our defensive purification model. We use the same settings as those in [24] for diffusion parameters. The VoxCeleb2 development set is employed for model training, with input utterances truncated to 2 seconds. The learning rate is set to 2e-4, and the batch size is 16.

We introduce a ResNet18-based binary classifier before the diffusion model to prevent the diffusion model from damaging normal speech. The ArcFace (m=0.2, s=32) classifier is introduced to identification. A subset of the VoxCeleb2 development set is selected, from which 256,000 adversarial samples are generated using adversarial attack methods and split 9:1 for training and testing. The model is updated using the Adam optimizer with a cosine decay learning rate schedule, starting at 1e-3 and decaying to 1e-5. The batch size is 256, and the total number of epochs is 10.

IV RESULTS AND ANALYSIS

We used the adversarial attack methods described in II-A to randomly generate 2,560 adversarial samples for each method on the VoxCeleb2 dataset for evaluation. Table I presents the results for each method in terms of attack efficacy, defense performance, and synthesis quality. We define the defense as successful if the speaker embedding of the adversarial sample is most similar to the source speech’s. Conversely, the attack is successful if the adversarial sample’s speaker embedding is most similar to the target speech’s.

TABLE I: The performance of speaker embedding-based zero-shot TTS systems under various defense modes against different attack methods. Ori., Tgt., Adv., and Adv.(Syn) represent the source speech, target speech, adversarial samples, and the speech synthesized by the TTS system from the adversarial samples, respectively.

Defense	Attack Method	Attack Success	Defense Success	Ori. vs Adv.	Tgt. vs Adv.	EER[%]	Ori. vs Adv.(Syn)	Tgt. vs Adv.(Syn)
Defense	Attack Method	Rate[%]	Rate[%]	Similarity	Similarity	EER[%]	Similarity	Similarity
	None	-	-	-	-	0.957	0.370	-0.001
	Adam-based	99.53	0.47	0.134	0.934	0.957	0.048	0.291
None	PGD	100	0	0.074	0.959	0.957	0.023	0.311
Adversarial Training with Adam-based Attack	Adam-based	9.65	90.35	0.747	0.421	2.350	0.296	0.149
Adversarial Training with Adam-based Attack	PGD	37.85	62.15	0.618	0.546	2.350	0.239	0.190
	Adam-based	1.56	98.44	0.839	0.335	4.626	0.364	0.126
Adversarial Training with PGD Attack	PGD	4.02	95.98	0.781	0.392	4.626	0.331	0.145
Adversarial Purification	Adam-based	0.39	91.41	0.549	0.183	0.957	0.181	0.048
Adversarial Purification	PGD	2.34	83.98	0.479	0.157	0.957	0.154	0.044

IV-A Adversarial Attack Results

Both attack methods exhibited substantial effectiveness, nearly achieving a 100% attack success rate. After applying the attacks, the cosine similarity between the adversarial samples and the target speech was measured at 0.934 and 0.959, respectively. In contrast, the cosine similarity between the adversarial samples and the source speech decreased to 0.134 and 0.074. Quantitatively, the PGD attack method demonstrated greater effectiveness compared to the Adam-based attack method.

IV-B Adversarial Training Defense Results

After incorporating adversarial samples into the training process, the model’s robustness improved, with defense success rates rising from 0.47% and 0% to 90.35% and 95.98%, respectively. However, the introduction of adversarial samples, as cross-domain data, caused a degree of performance degradation on normal data. The impact of this degradation was more pronounced with stronger attack methods, as evidenced by the EER on the Vox1-O, which increased from 0.957% to 2.35% and 4.626%. Additionally, it is noteworthy that models trained with adversarial samples generated by the weaker Adam-based attack exhibited significantly reduced defense performance against stronger PGD attacks. In contrast, models trained with PGD-generated adversarial samples retained strong defense capabilities against Adam-based attacks, achieving a high defense success rate of 98.44%.

IV-C Adversarial Purification Defense Results

Diffusion-based Adversarial Purification is indeed a promising emerging technology. We observe that after adversarial purification, the attack success rates of adversarial samples decreased from 99.53% and 100% to 0.39% and 2.34%, respectively. However, the defense success rates improved by only 91.41% and 83.98%. This is because the perturbations removed by the diffusion module cannot perfectly match the added perturbations. Consequently, the denoised speaker embeddings may resemble those of other speakers rather than the source speaker’s. Moreover, controlling the denoising strength in diffusion-based Adversarial Purification is crucial. As shown in Fig.2, with an increase in diffusion steps, although the attack success rate and similarity with the target speaker decrease, the defense success rate and similarity with the source speaker also eventually decrease after reaching a certain inflection point.

Additionally, it is important to note that the diffusion module can also introduce some damage to normal speech. As shown in Fig. 2, the similarity between the normal speech and the source speaker rapidly declines with increasing diffusion steps. This degradation can further affect the quality of TTS synthesis. Therefore, we introduced a ResNet18-based discriminator in front of the diffusion model. Experimental results show that, after training on both positive and negative samples, this discriminator achieved a 100% recognition rate for adversarial samples of the attack types it was trained on. We will evaluate on unseen adversarial methods in the future.

IV-D Zero-Shot TTS Synthesis Results

We also explored the impact of different methods on the quality of zero-shot TTS synthesis. By extracting the speaker embeddings from adversarial samples and synthesizing speech using the text ”Good morning, good afternoon, and good evening!”, we evaluated the results by calculating the cosine similarity between the speaker embeddings of the synthesized speech and those of the source and target speakers. We observed that adversarial training defenses resulted in adversarial samples maintaining a high similarity to the source speaker, but with a relatively high similarity to the target speaker as well. In contrast, adversarial purification methods significantly reduced the similarity to the target speaker but also degraded a substantial portion of the source speaker’s information.

V CONCLUSION

This paper explores adversarial attacks and robust defenses in speaker embedding based zero-shot TTS systems. In the adversarial attack, we employ PGD and Adam-based white-box attack methods to target the speaker encoder of the zero-shot TTS system, aiming to guide the TTS system into synthesizing speech that closely resembles the target speaker. To mitigate the potential threats these attacks posed, we implemented traditional active defense strategies, such as adversarial training, and novel passive defense strategies based on diffusion models for adversarial purification. We assessed the effectiveness of these defenses, their impact on model performance, and their effects on synthesis quality.

References

[1] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in ICLR. OpenReview.net, 2021.
[2] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
[3] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
[4] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” Advances in neural information processing systems, vol. 31, 2018.
[5] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y. Wu et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Advances in neural information processing systems, vol. 31, 2018.
[6] Y. Wu, X. Tan, B. Li, L. He, S. Zhao, R. Song, T. Qin, and T. Liu, “Adaspeech 4: Adaptive text to speech in zero-shot scenarios,” in INTERSPEECH. ISCA, 2022, pp. 2568–2572.
[7] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
[8] S. Chen, S. Liu, L. Zhou, Y. Liu, X. Tan, J. Li, S. Zhao, Y. Qian, and F. Wei, “Vall-e 2: Neural codec language models are human parity zero-shot text to speech synthesizers,” arXiv preprint arXiv:2406.05370, 2024.
[9] W. Cai, J. Chen, and M. Li, “Exploring the encoding layer and loss function in end-to-end speaker and language recognition system,” arXiv preprint arXiv:1804.05160, 2018.
[10] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP. IEEE, 2018, pp. 5329–5333.
[11] B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in INTERSPEECH. ISCA, 2020, pp. 3830–3834.
[12] R. G. Hautamäki, T. Kinnunen, V. Hautamäki, and A.-M. Laukkanen, “Automatic versus human speaker verification: The case of voice mimicry,” Speech Communication, vol. 72, pp. 13–31, 2015.
[13] J. A. V. López and E. Lleida, “Detecting replay attacks from far-field recordings on speaker verification systems,” in BIOID, ser. Lecture Notes in Computer Science, vol. 6583. Springer, 2011, pp. 274–285.
[14] F. Alegre, A. Amehraye, and N. W. D. Evans, “Spoofing countermeasures to protect automatic speaker verification from voice conversion,” in ICASSP. IEEE, 2013, pp. 3068–3072.
[15] F. Kreuk, Y. Adi, M. Cissé, and J. Keshet, “Fooling end-to-end speaker verification with adversarial examples,” in ICASSP. IEEE, 2018, pp. 1962–1966.
[16] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR (Poster), 2015.
[17] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in ICLR (Poster). OpenReview.net, 2018.
[18] Y. Wang, J. Liu, X. Chang, J. Wang, and R. J. Rodríguez, “Ab-fgsm: Adabelief optimizer and fgsm-based approach to generate adversarial examples,” Journal of Information Security and Applications, vol. 68, p. 103227, 2022.
[19] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy. IEEE Computer Society, 2017, pp. 39–57.
[20] Y. Cao, D. Xu, X. Weng, Z. Mao, A. Anandkumar, C. Xiao, and M. Pavone, “Robust trajectory prediction against adversarial attacks,” in Conference on Robot Learning. PMLR, 2023, pp. 128–137.
[21] H. Wu, S. Liu, H. Meng, and H. Lee, “Defense against adversarial attacks on spoofing countermeasures of ASV,” in ICASSP. IEEE, 2020, pp. 6564–6568.
[22] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 850–10 869, 2023.
[23] J. Kim, J. Heo, H. Shin, C. Lim, and H. Yu, “Diff-sv: A unified hierarchical framework for noise-robust speaker verification using score-based diffusion probabilistic models,” in ICASSP. IEEE, 2024, pp. 10 341–10 345.
[24] S. Wu, J. Wang, W. Ping, W. Nie, and C. Xiao, “Defending against adversarial audio via diffusion model,” in ICLR. OpenReview.net, 2023.
[25] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR (Poster), 2015.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR. IEEE Computer Society, 2016, pp. 770–778.
[27] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in INTERSPEECH. ISCA, 2018, pp. 1086–1090.
[28] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in INTERSPEECH. ISCA, 2017, pp. 2616–2620.
[29] W. Cai, J. Chen, J. Zhang, and M. Li, “On-the-Fly Data Loader and Utterance-Level Aggregation for Speaker and Language Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 1038–1051, 2020.
[30] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[31] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP. IEEE, 2017, pp. 5220–5224.
[32] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in CVPR. Computer Vision Foundation / IEEE, 2019, pp. 4690–4699.
[33] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in INTERSPEECH. ISCA, 2019, pp. 1526–1530.
[34] I. Loshchilov, F. Hutter et al., “Fixing weight decay regularization in adam,” arXiv preprint arXiv:1711.05101, vol. 5, 2017.
[35] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in ICLR. OpenReview.net, 2021.

Adversarial Attacks and Robust Defenses in Speaker Embedding based Zero-Shot Text-to-Speech System ††thanks: * Corresponding Author: Ming Li