Multi-speaker Text-to-speech Training with Speaker Anonymized Data
Abstract
The trend of scaling up speech generation models poses a threat of biometric information leakage of the identities of the voices in the training data, raising privacy and security concerns. In this paper, we investigate training multi-speaker text-to-speech (TTS) models using data that underwent speaker anonymization (SA), a process that tends to hide the speaker identity of the input speech while maintaining other attributes. Two signal processing-based and three deep neural network-based SA methods were used to anonymize VCTK, a multi-speaker TTS dataset, which is further used to train an end-to-end TTS model, VITS, to perform unseen speaker TTS during the testing phase. We conducted extensive objective and subjective experiments to evaluate the anonymized training data, as well as the performance of the downstream TTS model trained using those data. Importantly, we found that UTMOS, a data-driven subjective rating predictor model, and GVD, a metric that measures the gain of voice distinctiveness, are good indicators of the downstream TTS performance. We summarize insights in the hope of helping future researchers determine the goodness of the SA system for multi-speaker TTS training.
Index Terms:
speaker anonymization, speech synthesis, text-to-speech, multi-speaker trainingI Introduction
Scaling up speech generation models in terms of both model size and training data has become a trend in the research community. For instance, multiple works have reported training their model on more than 100k hours of data [1, 2, 3]. However, the larger the model becomes, the more likely it memorizes parts of the training data [4]. In the task of speech generation, memorization results in biometric information leakage, which causes security and privacy issues. For instance, a speaker whose voice was used in the training data may be memorized and maliciously generated to be used to spoof a voice authentication system. With the increasing interest in data privacy protection, including legal movements like the European General Data Protection Regulation (GDPR), scaling up speech generation models will become difficult.
A possible solution to the above-mentioned problem is to train speech generation models with data that underwent a so-called speaker anonymization (SA) process, which attempts to erase the biometric information of the input speech while preserving certain properties. This research field has been greatly advanced and promoted by the voice privacy challenge (VPC) series [5, 6], where the organizers established a series of standardized dataset settings and evaluation protocols. In VPC, the two primary metrics adopted to evaluate SA systems were the equal error rate (EER) calculated with an automatic speaker verification model, and the word error rate (WER) obtained from an automatic speech recognition model. The former was called the privacy and the latter the utility metric. On the other hand, evaluating these SA systems in the context of speech generation model training has not yet been investigated, and it is unknown whether an SA system that performs well in terms of EER and WER can also excel in the downstream speech generation task.
In this paper, as a proof-of-concept, we investigate training a multi-speaker text-to-speech (TTS) model with speaker anonymized data, with the hope of providing a reasonable framework to evaluate SA systems in terms of the performance of the downstream TTS task. We adopted two signal processing-based and three deep neural network-based SA methods and used them to anonymize a multi-speaker TTS dataset. These anonymized datasets are further used to train multi-speaker TTS models, which are then evaluated on the task of unseen speaker TTS. An extensive experimental evaluation was conducted, as we reported objective metrics from VPC’22, as well as subjective evaluation results on the anonymized training data and the downstream TTS output. The contributions of this work are as follows:
-
•
This is the first work to investigate the impact of speaker-anonymized training data on a downstream speech generation task, specifically, multi-speaker TTS training.
-
•
We identified the relationship between the performance measurements of SA systems and the downstream multi-speaker models and provided guidelines for future researchers to develop better SA systems.
II Problem Formulation
The problem formulation and goals are illustrated in Figure 1. Suppose we have an initial user dataset that we wish to erase biometric information from, denoted as D. An anonymization process is performed on D to obtain the anonymized version, denoted as . It is then used to train a multi-speaker TTS model. There are two goals in this problem setting. First, needs to fulfill a pre-defined speaker anonymization criterion. Second, the TTS performance should be maximized, which is evaluated on a pre-defined downstream task. For the anonymization criterion, in addition to adopting the objective privacy metric (i.e., EER) as in the VPCs, we also measure the subjective speaker verifiability as proposed in VPC [5]. For the downstream task, we consider unseen speaker TTS, which is to generate the voice of the designated speaker given the text and a short reference speech sample. Other speech generation tasks, such as speaker adaptive TTS and speech enhancement, will be left as future work.
As we will describe in Section III, the anonymization process can include a data-driven model trained on some other speech datasets. Here we assume that the training data of the SA model does not need to be anonymized (based on the condition that, for instance, they are public and we are authorized to use them). We consider this setting to be practical because, as we will show in later sections, the amount of training data for most SA systems is much smaller than those used to train large speech generation models. We would like to once again emphasize that the focus of this paper is on properly using first-party data while protecting users’ privacy.

III Speaker Anonymization Systems
In this section, we describe the speaker anonymization systems adopted in this work. These systems were adopted either because they are representative (for instance, baselines of VPC’22), or because they are easy to reproduce (for instance, with open-source implementations).
III-A Signal processing based systems
Signal processing-based SA systems enjoy the benefit of being free from training and efficient inference. In this work, we adopt two methods, each of which modifies a certain speech parameter obtained from speech analysis.
III-A1 Pitch shift
Simply shifting the pitch of the input speech was shown to be on par with the baselines in VPC’22 [7], which was implemented using a time-scaling approach. However, in our preliminary experiments, we found it resulted in poor perceptual quality that hurt the performance of the downstream TTS model. We thus adopted a python wrapper [8] of WORLD, a high-quality vocoder [9]. Anonymization is done by randomly shifting the extracted f0 sequence up or down within 3 to 5 semitones and then synthesizing the anonymized waveform with the other speech parameters.
III-A2 VPC’22 B2: Spectral envelope modification
In contrast to modifying the frequency component, the B2 baseline in VPC’22 modified the spectral envelope of the input speech, resulting in a change in the timbre. It is based on the idea proposed in [10]. First, linear predictive coding (LPC) coefficients extracted from the input speech are converted to pole positions. Then, the phase of poles with non-zero imaginary parts is raised to the power of the McAdams’ coefficient such that transformed poles have new, shifted phases of . The poles are then converted back to LPC coefficients and thus the anonymized waveform. The VPC’22 B2 version sampled the McAdams’ coefficient from a uniform distribution as . We adopted the implementation from the VoicePAT toolkit [11].

III-B Deep neural network based systems
Compared to signal processing-based methods, deep neural network (DNN)-based SA systems were reported to perform better in almost all metrics in VPC’22 [12]. In this work, we adopt three DNN-based systems, as illustrated in Figure 2. All DNN-based methods attempt to factorize the input speech into several components and change the speaker representation to achieve anonymization.
III-B1 VPC’22 B1b
The VPC’22 B1b baseline is an improved version of the system proposed in [13]. It first factorizes the input speech into f0, linguistic representation, and speaker representation. The linguistic representation is the frame-level output of the encoder of an automatic speech recognition (ASR) model, and the speaker representation is the x-vector [14]. The anonymization module finds 200 speaker representations in a pre-defined speaker pool that are farthest from the input speaker representation and then averages random 100 of them to form the anonymized speaker representation. The decoder takes the f0, linguistic representation and the anonymized speaker representation to generate the final anonymized speech. We followed the official implementation released by the VPC’22 organizers [15]
III-B2 GAN
The GAN method was proposed in [16] and differs from VPC’22 B1b in two aspects. First, the linguistic representation is the phoneme sequence from the ASR model instead of the frame-level output, which in practice contains fewer speaker information, leading to better anonymization performance. Second, instead of relying on a speaker pool, the anonymization module is now a generative adversarial network that was trained to map a normal distribution to the approximate distribution of the x-vector. The sampled x-vector is ensured to have a cosine similarity smaller than 0.7 with the input speaker x-vector. We adopted the implementation from the VoicePAT toolkit [11].
III-B3 NACLM
The final method is based on neural audio codec language models (NACLMs), originated from AudioLM [17]. We include this system for its good EER performance, as we will show in Section IV-C. The idea is to condition a language model with the so-called semantic tokens and an acoustic prompt to generate acoustic tokens (also known as neural codec), which have the same content as the semantic tokens and the same acoustic condition as the acoustic tokens. They are then passed to a synthesizer to obtain the final waveform. Here, the linguistic representation, which is the HuBERT [18] sequence of the input speech, is the semantic tokens, and the acoustic prompt is an acoustic token sequence randomly sampled from the speaker pool. We used the official implementation provided by the authors, which was based on Bark, an open-source NACLM-based TTS system [19].
IV Experimental Evaluation
There are two parts of the evaluation: in the first part (Section IV-C), we evaluated the anonymized training data, and in the second part (Section IV-D), we evaluated the TTS systems trained on the anonymized data. Finally, in Section IV-E we discuss the relationship between the metrics in the first and second parts.
IV-A Data and implementation
The dataset used to train and evaluate the TTS systems (i.e., D in Section II) was VCTK [20]. All samples were downsampled to 16kHz. Following [21], we excluded 11 speakers for evaluation. During inference, the input to the TTS system included the text of the last 50 samples of each evaluation speaker, and the 005 sample as the reference. Following [21], the TTS system was pre-trained on LJSpeech [22]. The TTS system was VITS [23] with x-vectors [14] as the speaker embedding. We used the implementation provided in ESPnet2-TTS [24].
The implementation of the SA systems was described in Section III. For the training data of the DNN-based SA systems, both VPC’22 B1b and GAN shared a similar setting to that of VPC’22: the ASR model, x-vector extractor, and decoder were trained on LibriSpeech [25], VoxCeleb 1 & 2 [26, 27], LibriTTS train-clean-100 [28], respectively. The speaker pool was the LibriTTS train-other-500 set. For the NACLM system, since Bark was directly used without any re-training, the setting was largely different from VPC’22 B1b and GAN, resulting in an unfair comparison. Readers should note this difference. Finally, the LibriSpeech test set was used to evaluate the SA systems, as in VPC.
IV-B Evaluation Metrics and Protocols
The objective evaluation of the SA systems was carried out with the VoicePAT toolkit [11]. As in the VPC series, the EER (the larger the better) and the WER (the lower the better) were reported. In addition, we also reported the gain of voice distinctiveness (GVD) proposed in [5]. The larger the GVD, the better the distinctiveness is maintained throughout the anonymization process. Finally, we added the UTMOS score, a widely used perceptual rating predictor trained with human ratings [29], which is the larger the better.
As the ultimate goal of this work is to train TTS systems with high quality, it is essential to conduct listening tests to assess the perceptual quality. In this work, we assessed the subjective naturalness and speaker similarity. Listeners were asked to evaluate the naturalness of the speech on a 5-point scale, and the numbers the higher the better. For similarity, following the protocol in the voice conversion challenge 2020 [30], a natural speech from the reference speaker and a generated speech were presented, and listeners were asked to judge whether the two samples were produced by the same speaker on a 4-point scale. We evaluate naturalness and similarity for both SA and TTS, and they are referred to as SA-NAT, SA-SIM, TTS-NAT and TTS-SIM, respectively. Notably, for SA-SIM, the goal was to assess how dissimilar the anonymized speech was to the original speech, thus the lower the better. In contrast, for TTS-SIM, the similarity score was the higher the better. We used crowd-sourcing to recruit 500 listeners and obtained 2750 ratings per system. Recordings of the natural samples (GT) were also included to serve as the upper bound.
System | SA evaluation | TTS evaluation | ||||||
EER | WER | GVD | UTMOS | SA-NAT | SA-SIM | TTS-NAT | TTS-SIM | |
Natural | – | 2.98 | – | 3.96 | 3.95 0.03 | 3.62 0.03 | 3.84 0.03 | 3.66 0.02 |
Unanonymized | – | – | – | – | – | – | 3.82 0.03 | 2.59 0.04 |
Pitch shift | 4.94 | 3.18 | -0.66 | 3.31 | 2.92 0.04 | 2.39 0.04 | 3.13 0.03 | 2.41 0.04 |
VPC’22 B2 | 5.91 | 12.02 | -2.53 | 2.21 | 1.48 0.03 | 1.99 0.03 | 1.82 0.03 | 1.82 0.03 |
VPC’22 B1b | 9.28 | 3.97 | -8.85 | 3.90 | 3.58 0.03 | 2.06 0.02 | 3.04 0.03 | 1.61 0.03 |
GAN | 39.67 | 7.87 | -3.70 | 3.73 | 3.34 0.04 | 1.28 0.02 | 3.42 0.03 | 1.85 0.04 |
NACLM | 45.77 | 7.12 | -2.40 | 3.45 | 2.84 0.04 | 1.36 0.02 | 2.53 0.03 | 2.02 0.04 |

IV-C Evaluation results of the anonymized training data
We first look at the SA evaluation results, as shown in Table I. We observed that no single system was dominant in all six metrics. As perceptual evaluation has been overlooked in the SA literature, we are especially interested in the subjective results, which were visualized in Figure 3. As an ideal SA system should give a high SA-NAT score and a low SA-SIM score, the results showed that no single system was dominant in both SA-NAT and SA-SIM. Specifically, VPC’22 B1b and GAN had the best SA-NAT and SA-SIM scores, respectively. We also found that DNN-based SA systems are better than signal-processing-based systems.
We are also interested in whether primary objective metrics adopted by VPC (namely, EER and WER) correlate well with subjective metrics (SA-NAT and SA-SIM). The linear correlation coefficients between EER and SA-SIM, WER and SA-NAT are -0.946 and -0.828, respectively. While EER correlates with SA-SIM, the correlation between WER and SA-NAT is lower. We further found that the linear correlation coefficient between UTMOS and SA-NAT is 0.984. This suggests that compared to WER, UTMOS is a better indicator of subjective naturalness.
IV-D Evaluation results of the downstream TTS task
We then look at the TTS evaluation results, which are shown in Table I and visualized in Figure 3. The unanonymized system refers to a TTS model trained with unanonymized data directly, thus serving as an upper bound. As an ideal TTS system should yield high TTS-NAT and TTS-SIM scores, there was again no single system dominant in both TTS-NAT and TTS-SIM, with GAN and pitch shift being the best systems in terms of TTS-NAT and TTS-SIM, respectively.
TTS evaluation | ||||
TTS-NAT | TTS-SIM | |||
SA evaluation | Obj. | WER | -0.785 | -0.529 |
EER | 0.231 | -0.061 | ||
GVD | -0.220 | 0.827 | ||
UTMOS | 0.874 | 0.341 | ||
Sub. | SA-NAT | 0.929 | 0.469 | |
SA-SIM | 0.477 | 0.864 |
IV-E Important indicators of the TTS performance
In this subsection, we investigate whether we can determine the goodness of an SA method before actually training and evaluating the downstream TTS system. From Table I, we calculated the linear correlation coefficient of each metric on the SA evaluation side with each of the TTS evaluation results and summarized in Table II. Although the best indicators for TTS-NAT and TTS-SIM were SA-NAT and SA-SIM, respectively, conducting listening tests is costly and should be avoided. We then seek to find important indicators using only objective metrics.
For TTS-NAT, UTMOS has the highest correlation (0.874), with WER being the second highest (0.785). This again shows the suitability of using UTMOS as the utility metric in the context of SA. As for TTS-SIM, GVD was the only metric that provided a strong correlation (0.827). We would like to highlight this important result: although VPC’22 used it only as a secondary metric the GVD metric measures the SA system’s ability to preserve voice distinctiveness (i.e., diversity), which is especially important for multi-speaker TTS training. One may consider an extreme case where the SA system maps all speakers in the input dataset to a canonical speaker. Such an SA system can give us a high EER, but possibly a GVD approaching minus infinity. The resulting anonymized dataset essentially becomes a single-speaker dataset, which would not be feasible for multi-speaker training.
V Conclusion and Future Directions
In this work, we trained multi-speaker TTS models using speaker anonymized data. Five SA systems were used to anonymize the training data of an end-to-end TTS model, which is further evaluated on an unseen TTS task. From our extensive experimental results, we conclude that a good SA system for anonymizing training data for multi-speaker TTS should ensure (1) a high UTMOS score indicating high-quality output and (2) a high GVD indicating low loss of speech diversity. Future works include improving current SA systems towards the above-mentioned metrics and exploring more speech generation tasks.
Finally, we would like to raise attention to a fundamental question: What is a valid speaker anonymization threshold? Readers should note that SA-SIM and TTS-SIM have different meanings: SA-SIM is the anonymization threshold to be satisfied, and TTS-SIM is a metric for the downstream TTS task to be maximized. Only when the SA-SIM (or EER) threshold is met does maximizing the TTS metrics become meaningful. For instance, looking at Table I, if we set the threshold to be an SA-SIM score of 1.5, then only GAN and NACLM are qualified SA systems; if we set the threshold to be 2.5, then all systems are qualified. The determination of the threshold should come prior to the subsequent SA model development. With that being said, we believe this is a question out of the scope of computer science research. requiring discussion between researchers, legal specialists, or even the public to reach a consensus.
References
- [1] M. Łajszczak, G. Cámbara, Y. Li, F. Beyhan, A. van Korlaar, F. Yang, A. Joly, Á. Martín-Cortinas, A. Abbas, A. Michalski et al., “Base tts: Lessons from building a billion-parameter text-to-speech model on 100k hours of data,” arXiv preprint arXiv:2402.08093, 2024.
- [2] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang et al., “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv preprint arXiv:2403.03100, 2024.
- [3] A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
- [4] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” in Proc. ICLR, 2023.
- [5] N. Tomashenko, X. Wang, E. Vincent, J. Patino, B. M. L. Srivastava, P.-G. Noé, A. Nautsch, N. Evans, J. Yamagishi, B. O’Brien et al., “The voiceprivacy 2020 challenge: Results and findings,” Computer Speech & Language, vol. 74, p. 101362, 2022.
- [6] N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Panariello, N. Evans, J. Yamagishi, and M. Todisco, “The voiceprivacy 2024 challenge evaluation plan,” arXiv preprint arXiv:2404.02677, 2024.
- [7] C. O. Mawalim, S. Okada, and M. Unoki, “Speaker anonymization by pitch shifting based on time-scale modification,” in Proc. Symp. on Security and Privacy in Speech Communication, 2022, pp. 35–42.
- [8] “Jeremycchsu/python-wrapper-for-world-vocoder,” https://github.com/JeremyCCHsu/Python-Wrapper-for-World-Vocoder.
- [9] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transcations on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
- [10] J. Patino, N. Tomashenko, M. Todisco, A. Nautsch, and N. Evans, “Speaker Anonymisation Using the McAdams Coefficient,” in Proc. Interspeech, 2021, pp. 1099–1103.
- [11] S. Meyer, X. Miao, and N. T. Vu, “Voicepat: An efficient open-source evaluation toolkit for voice privacy research,” IEEE Open Journal of Signal Processing, vol. 5, pp. 257–265, 2024.
- [12] N. Tomashenko, X. Miao, P. Champion, S. Meyer, X. Wang, E. Vincent, M. Panariello, N. Evans, J. Yamagishi, and M. Todisco, “The voiceprivacy 2022 challenge,” 2nd Symposium on Security and Privacy in Speech Communication, 2023.
- [13] F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J.-F. Bonastre, “Speaker Anonymization Using X-vector and Neural Waveform Models,” in Proc. ISCA Workshop on Speech Synthesis, 2019, pp. 155–160.
- [14] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
- [15] “Voice-privacy-challenge/voice-privacy-challenge-2022,” https://github.com/Voice-Privacy-Challenge/Voice-Privacy-Challenge-2022.
- [16] S. Meyer, F. Lux, J. Koch, P. Denisov, P. Tilli, and N. T. Vu, “Prosody is not identity: A speaker anonymization approach using prosody cloning,” in Proc. ICASSP, 2023, pp. 1–5.
- [17] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, and N. Zeghidour, “AudioLM: A Language Modeling Approach to Audio Generation,” IEEE/ACM TASLP, vol. 31, pp. 2523–2533, 2023.
- [18] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
- [19] “eurecom-asp/spk_anon_nac_lm,” https://github.com/eurecom-asp/spk_anon_nac_lm.
- [20] C. Veaux, J. Yamagishi, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,” 2017.
- [21] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162, 17–23 Jul 2022, pp. 2709–2720.
- [22] K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
- [23] J. Kim, J. Kong, and J. Son, “Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech,” in Proc. ICML, 2021, pp. 5530–5540.
- [24] T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi, T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe, “ESPNet2-TTS: Extending the edge of tts research,” arXiv preprint arXiv:2110.07840, 2021.
- [25] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
- [26] A. Nagrani, J. S. Chung, W. Xie, and A. Zisserman, “Voxceleb: Large-scale speaker verification in the wild,” Computer Speech & Language, vol. 60, p. 101027, 2020.
- [27] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech, 2018, pp. 1086–1090.
- [28] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
- [29] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
- [30] Y. Zhao, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kinnunen, Z. Ling, and T. Toda, “Voice Conversion Challenge 2020 - Intra-lingual semi-parallel and cross-lingual voice conversion -,” in Proc. Joint Workshop for the BC and VCC 2020, 2020, pp. 80–98.