Ming Cheng 22institutetext: Xingjian Diao 33institutetext: Shitong Cheng 44institutetext: Wenjun Liu 55institutetext: Department of Computer Science, Dartmouth College, 15 Thayer Drive, Hanover, NH 03755, USA
Ming Cheng
55email: [email protected]
Xingjian Diao
55email: [email protected]
Shitong Cheng
55email: [email protected]
Wenjun Liu
55email: [email protected]
SAIC: Integration of Speech Anonymization and Identity Classification
Abstract
Speech anonymization and de-identification have garnered significant attention recently, especially in the healthcare area including telehealth consultations, patient voiceprint matching, and patient real-time monitoring. Speaker identity classification tasks, which involve recognizing specific speakers from audio to learn identity features, are crucial for de-identification. Since rare studies have effectively combined speech anonymization with identity classification, we propose SAIC – an innovative pipeline for integrating Speech Anonymization and Identity Classification. SAIC demonstrates remarkable performance and reaches state-of-the-art in the speaker identity classification task on the Voxceleb1 dataset, with a top-1 accuracy of . Although SAIC is not trained or evaluated specifically on clinical data, the result strongly proves the model’s effectiveness and the possibility to generalize into the healthcare area, providing insightful guidance for future work.
1 Introduction
Significant research has been focused on using AI techniques for anonymization and de-identification in the ethics and healthcare area, especially for health record and patient notes protection zuccon2014identification ; ahmed2020identification ; dernoncourt2017identification ; venugopal2022privacy . Meanwhile, the anonymization focusing on speech has not been widely explored, with several studies only developing methods on a limited scale of datasets han2020voice ; chen2023voicecloak . In parallel, speaker identity classification tasks, which require accurately identifying individuals from their audio audiomae ; niizumi2022masked , play a crucial role in privacy protection services. These tasks involve disentangling the unique vocal characteristics (voiceprint) of a person, essentially understanding the speaker’s identity information within speech. While this precision is valuable in itself, it also opens up possibilities for enhancing speech anonymization techniques. Ideally, if a system can understand and isolate identity features in speech, it could then modify, obscure, or remove these features to anonymize the audio effectively. Since limited work has implemented the integration of anonymization and identity classification, there remains an unsolved challenge: Is it feasible to develop a model that simultaneously achieves high-quality speech anonymization and maintains accurate speaker identity classification?
To address such a research gap, we propose SAIC – a novel pipeline for speech anonymization and identity classification. After training, SAIC can extract accurate content/identity embeddings, removing identity information from the original audio. Moreover, it can merge the content of one audio with the voiceprint of another speaker, generating a synthesized speech that maintains content integrity with an altered identity.
In summary, our contribution is threefold:
-
•
We propose SAIC, a novel pipeline integrating speech anonymization and identity classification effectively. The content embeddings and identity embeddings are extracted with high quality through the robust encoders.
-
•
The identity classification task on the Voxceleb1 dataset outperforms existing work and results in the state-of-the-art, with a top-1 accuracy of .
-
•
SAIC is capable of synthesizing new audio by merging the content from one speaker’s audio with the voiceprint of another, effectively generating a synthesized speech that preserves the original content while adopting a different vocal identity.
2 Related Work
2.1 Speech Anonymization and De-Identification
Initially, a novel approach known as DROPSY justin2015speaker is proposed to conceal the speaker’s identity. It builds a diphone recognition system for speech recognition, followed by a speech synthesis system to transform a speaker’s speech into that of a different individual. In a separate endeavor, VoicePrivacy tomashenko2020introducing is proposed to propel advancements in speech data anonymization. The benchmark established by it seeks to minimize the disclosure of the speaker’s identity while preserving the distinctiveness of the speech.
Recent studies have emerged in the domain of voice privacy preservation, proposing methods that span multi-dimensional aspects, such as differentially private approach shamsabadi2022differentially , naturalness and timbre-preserving deng2023v , and adversarial examples chen2023voicecloak . However, they use a limited size of validation dataset and mainly focus on specific scenarios. This potentially limits the generalizability of the proposed techniques.
To address the research gap mentioned above, we propose SAIC, a novel pipeline for speaker de-identification and privacy preservation. We evaluate our model on a commonly used and large-scale dataset, Voxceleb1 nagrani2017voxceleb , with state-of-the-art results indicating the effectiveness of our model.
2.2 Speaker Identity Classification
The task of speaker identity classification has garnered significant attention in recent years, driven by its applications in various domains including privacy protection, voice-controlled systems, and human-computer interaction.
With the development of Transformers vaswani2017attention and ViT dosovitskiy2020image , multiple studies have adopted these architectures as the backbone. For example, SS-AST gong2022ssast pretrains the AST model with joint discriminative and generative masked spectrogram patch modeling, while wav2vec 2.0 baevski2020wav2vec focuses on learning powerful representations from speech audio. Although ViT-based methods outperform CNN-based ones in various AI tasks, they usually require massive data and repeated pretraining, failing to handle temporal dynamics without strong data augmentations islam2022recent ; he2022masked . Therefore, our model is conducted with CNN as the backbone and follows the mainstream encoder-decoder structure lu2016training ; toshniwal2017multitask ; hu2020dasgil ; karita2018sequence , achieving significant results and requiring fewer computational resources.
Considering the effectiveness of MAE-based methods on various downstream tasks he2022masked ; gong2022contrastive ; audiomae ; diao2023av ; tong2022videomae , recent work mainly follows the mask-and-reconstruction strategy for representation learning to do identity classification maeast ; niizumi2022masked ; m2d . However, these methods do not consider the integration of identity classification and audio anonymization and lack strong identity disentanglement capabilities. Based on this, we propose SAIC that can effectively remove identity information while also achieving superior classification performance.
3 Method
We address the challenge of implementing de-identification for speaker privacy protection through the proposed SAIC pipeline. Formally, given the input audio of speaker and speaker , our goal is to synthesize the new audio with the speaker ’s identity and speaker ’s content information. This aims to remove the identity information from the speaker for privacy protection.
The training and inference of the SAIC pipeline are shown in Figure 1 and 2, respectively. Inspired by gabbay2019demystifying , the pipeline training contains 2 stages. The first stage aims to extract accurate content embeddings () and speaker embeddings () from content and speaker ID, where indicates latent space. Moreover, the Fusion Decoder (FD) is trained to reconstruct the original audio through the latent optimization strategy gabbay2019demystifying . The second stage focuses on optimizing the Content Encoder (CE), Speaker Encoder (SE), and the Fusion Decoder (FD) to reconstruct audio. Specifically, suppose as the input audio of speaker where indicates the ground truth domain, it is input into CE and SE to extract content embeddings () and speaker identity embeddings (), respectively, where indicates the latent space from two encoders. Afterward, and are input into FD to reconstruct audio. Through this pipeline, the two encoders and the decoder can be well-trained for inference.
During inference, we apply audio input from two different speakers , aiming to remove the speaker identity information. In this phase, all encoders and the decoder are frozen.


3.1 Two-Stage Pipeline Training
Training Stage 1
The first stage aims to extract accurate content and speaker embeddings through latent optimization gabbay2019demystifying . Specifically, given content and speaker ID, the corresponding embeddings, and , are obtained in the latent space. The two embeddings are then input into the Fusion Decoder (FD) to reconstruct the audio. To train FD through latent optimization strategy, we employ VGG perceptual loss hoshen2019non as for each speaker :
(1) |
where is the Fusion Decoder, indicates all speakers, and represents the Gaussian Noise of fixed variance and an active attenuation penalty embedding which is applied to content embeddings () to regularize the content.
After stage 1, accurate content and speaker embeddings can be obtained and the decoder is well-trained to generate audio from two embeddings.
Training Stage 2
In stage 2 of pipeline training, we design an encoder-decoder-based architecture to generate the audio. Specifically, we construct two encoders ( and ) to learn accurate embeddings from original audio, and use with shared weight from the first stage to reconstruct the audio.
To instruct SAIC to learn the exact content and speaker embeddings, we employ MSE loss as the embedding loss:
(2) |
where and are the embeddings obtained from the first stage.
Similar to the first stage, we apply VGG perceptual loss as the reconstruction loss (without Gaussian noise) to guide the model to generate correct and precise audio:
(3) |
Finally, the combined loss for pipeline training in the second stage is expressed below:
(4) |

3.2 Content Encoder
The Content Encoder (CE) is constructed through sequential convolution blocks. Following chou2019one , it uses residual blocks repeated 6 times as the main structure. Each residual block includes a convolutional module and an instance normalization huang2017arbitrary . This structure aims to extract the high-dimensional content information in audio to omit the identity information in the content embedding.
3.3 Speaker Encoder
The Speaker Encoder (SE) contains a sequential residual block (repeated 6 times) and a sequential fully connected dense layer (repeated twice), following chou2019one . Similar to the design of CE, the residual blocks aim to extract high-dimensional identity information from the audio, followed by the fully connected dense layer to map the extracted features into the specified dimension and fully split the embedding of different speakers.
3.4 Fusion Decoder
The Fusion Decoder (FD) fuses features of content/identity embeddings for audio generation. Specifically, it consists of two sub-modules, with the first one decoding content embeddings through sequential convolution blocks and the second one decoding voice print features through dense layers, based on chou2019one . AdaIN (Adaptive Instance Normalization) huang2017arbitrary is applied in the decoder.
4 Experiments
We conduct experiments on the VoxCeleb1 dataset nagrani2017voxceleb and evaluate the de-identification results through the identity classification task.
4.1 Dataset
The VoxCeleb1 dataset is a diverse audio dataset in the real environment collected from public YouTube videos. It contains 153,516 audio clips from 1,251 speakers with various ages, roles, identities, etc. This dataset can be used for speech recognition, speaker classification, and speech information analysis. To evaluate the de-identification quality of our model, we conduct the speaker classification task and compare the results with other related works.
4.2 Evaluation Metrics
The evaluation of our model follows the steps below:
-
•
For each speaker in the test set, we randomly choose a different speaker as the input of CE and SE.
-
•
After SAIC, the synthesized audio contains the identity information of speaker and speech content from speaker . is then input into a powerful pre-trained VoiceEncoder wan2018generalized from the Resemblyzer library in Python for speaker identity embeddings extraction.
-
•
Finally, we utilize the extracted embeddings by VoiceEncoder to find the corresponding speaker ID. Since we only find one best-matched speaker each time, we report the top-1 accuracy as the evaluation metric.
4.3 Quantitative Results
Comparison With State-of-The-Art
Methods | Backbone | MAE-Based | Top1-Acc | |
\svhline MAE-AST maeast | ViT-B | ✓ | 63.3 | |
SS-AST gong2022ssast | ViT-B | 64.3 | ||
wav2vec 2.0 baevski2020wav2vec | Transformer | 75.2 | ||
HuBERT hsu2021hubert | Transformer | 81.4 | ||
M2D m2d | ViT-B | ✓ | 94.8 | |
Audio-MAE audiomae | ViT-B | ✓ | 94.8 | |
MSM-MAE niizumi2022masked | ViT-B | ✓ | 95.3 | |
\svhline SAIC (Ours) | CNN | 96.1 |
The quantitative result of our model compared with other related work is shown in Table 4.3. From Table 4.3, we can observe that our model provides the state-of-the-art speaker identity classification result, with a top-1 accuracy of .
Since the latest work is mainly MAE-based structures that follow the mask-and-reconstruction strategy, this result proves the effectiveness of our disentanglement approach, with a lead of top-1 accuracy of m2d and audiomae , respectively.
In addition to the effectiveness of speaker identity classification, our model has application-level advantages over the MAE-based ones. Since SAIC disentangles the speaker identity and the content of the speech with high quality (only about 3.9% of the speaker identities are not fully extracted), our model can be applied in many privacy-preserving real-world applications in the future, especially in the healthcare area: telehealth consultations, patient voiceprint matching, patient real-time monitoring, etc.
Transformers vs. CNN
As from Table 4.3, recently, the most relevant work utilizes either Transformers vaswani2017attention or ViT dosovitskiy2020image as the backbone, whereas SAIC chooses ResBlocks as the main body of the architecture. Considering the fact that Transformer-based models mainly require massive data and strong data augmentations to handle temporal dynamics and variations islam2022recent ; diao2023ft2tf , our CNN-based architecture showcases the advantages of model efficiency. The ResBlock’s capability to effectively handle local features and temporal dynamics makes it more suitable for tasks requiring granular audio analysis and sustained temporal coherence, such as in speech disentanglement and synthesis, where accurate extraction and reconstruction of audio across time are crucial for achieving high-quality outcomes.
Moreover, our model achieves a significant lead compared with some Transformer-based ones ( against wav2vec 2.0 baevski2020wav2vec , against HuBERT hsu2021hubert ). This showcases the effectiveness of CNN over Transformers on small datasets or pipeline training without strong data augmentations, having significant advantages in the healthcare area where obtaining large-scale datasets is rarely possible.
4.4 Qualitative Results


Audio Reconstruction and Synthesis
The visualization of audio reconstruction during training and audio generation during inference is shown in Figure 4. Since the original mel spectrogram and the reconstructed one are highly similar, we can evidently observe that SAIC can reconstruct accurate and precise audio through the disentanglement strategy. This indicates that all the trainable modules (CE, SE, FD) are optimized well. Furthermore, during inference, our model utilizes the identity of a different speaker, resulting in a distinctly different mel spectrogram. This clearly demonstrates the model’s ability to de-identify audio effectively, thus providing robust privacy protection. As for real-world applications, our model can be deployed for secure handling of sensitive patient information during telehealth consultations, medical dictations, and other audio-based interactions.
Audio Disentanglement
The qualitative result of audio disentanglement by t-SNE on the Voxceleb1 dataset is shown in Figure 5, where each particle indicates features from a distinct speech (Results on 40 speakers are shown as an example). As for voiceprint disentanglement, SAIC can distinguish each speaker clearly, which is reflected in the figure that particles of the same color are clustered together, while there is a distance between particles of different colors. Meanwhile, it is clear to observe that the content embeddings of the audio of different speakers are all clustered into one blob. Since the t-SNE two-dimensional mapping cannot cluster such complex information as audio semantics, this shows that the content embedding disentangled by the Content Encoder is exactly audio content information. Also, the content embedding information contains almost no identity information of the speaker. This visualization result showcases that SAIC can be further utilized for healthcare privacy protection.
5 Conclusion
In this paper, we introduce SAIC, a novel pipeline integrating speech disentanglement and speaker classification effectively. SAIC is constructed with CNN as the backbone containing an encoder-decoder architecture. The well-trained model is utilized for speech de-identification and new audio generation during the inference phase. The identity classification results on the Voxceleb1 dataset highly prove the effectiveness of SAIC, with top-1 accuracy of .
Although SAIC is not trained and evaluated specifically on clinical data, the result strongly proves the model’s effectiveness and possibility to generalize into the healthcare area, including patient voiceprint matching, real-time monitoring, etc.
Acknowledgements.
We express our gratitude to Dartmouth College alumnus Gokul Srinivasan and Professor SouYoung Jin from the Department of Computer Science, Dartmouth College, Hanover, US, for their invaluable support and contributions throughout our research process.References
- (1) Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N. & Kashino, K. Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input. ICASSP 2023-2023 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 1-5 (2023)
- (2) Huang, P., Xu, H., Li, J., Baevski, A., Auli, M., Galuba, W., Metze, F. & Feichtenhofer, C. Masked autoencoders that listen. ArXiv Preprint ArXiv:2207.06405. (2022)
- (3) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. & Others An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv Preprint ArXiv:2010.11929. (2020)
- (4) Islam, K. Recent advances in vision transformer: A survey and outlook of recent work. ArXiv Preprint ArXiv:2203.01536. (2022)
- (5) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, Ł. & Polosukhin, I. Attention is all you need. Advances In Neural Information Processing Systems. 30 (2017)
- (6) Baevski, A., Zhou, Y., Mohamed, A. & Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances In Neural Information Processing Systems. 33 pp. 12449-12460 (2020)
- (7) Hsu, W., Bolte, B., Tsai, Y., Lakhotia, K., Salakhutdinov, R. & Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions On Audio, Speech, And Language Processing. 29 pp. 3451-3460 (2021)
- (8) Baade, A., Peng, P. & Harwath, D. Mae-ast: Masked autoencoding audio spectrogram transformer. ArXiv Preprint ArXiv:2203.16691. (2022)
- (9) Gong, Y., Lai, C., Chung, Y. & Glass, J. Ssast: Self-supervised audio spectrogram transformer. Proceedings Of The AAAI Conference On Artificial Intelligence. 36, 10699-10709 (2022)
- (10) Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N. & Kashino, K. Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. HEAR: Holistic Evaluation Of Audio Representations. pp. 1-24 (2022)
- (11) Tomashenko, N., Srivastava, B., Wang, X., Vincent, E., Nautsch, A., Yamagishi, J., Evans, N., Patino, J., Bonastre, J., Noé, P. & Others Introducing the VoicePrivacy initiative. ArXiv Preprint ArXiv:2005.01387. (2020)
- (12) Lu, L., Zhang, X. & Renais, S. On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech recognition. 2016 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 5060-5064 (2016)
- (13) Toshniwal, S., Tang, H., Lu, L. & Livescu, K. Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. ArXiv Preprint ArXiv:1704.01631. (2017)
- (14) Hu, H., Qiao, Z., Cheng, M., Liu, Z. & Wang, H. Dasgil: Domain adaptation for semantic and geometric-aware image-based localization. IEEE Transactions On Image Processing. 30 pp. 1342-1353 (2020)
- (15) Karita, S., Ogawa, A., Delcroix, M. & Nakatani, T. Sequence training of encoder-decoder model using policy gradient for end-to-end speech recognition. 2018 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 5839-5843 (2018)
- (16) Chou, J., Yeh, C. & Lee, H. One-shot voice conversion by separating speaker and content representations with instance normalization. ArXiv Preprint ArXiv:1904.05742. (2019)
- (17) Noé, P., Bonastre, J., Matrouf, D., Tomashenko, N., Nautsch, A. & Evans, N. Speech pseudonymisation assessment using voice similarity matrices. ArXiv Preprint ArXiv:2008.13144. (2020)
- (18) Zuccon, G., Kotzur, D., Nguyen, A. & Bergheim, A. De-identification of health records using Anonym: Effectiveness and robustness across datasets. Artificial Intelligence In Medicine. 61, 145-151 (2014)
- (19) Ahmed, T., Aziz, M. & Mohammed, N. De-identification of electronic health record using neural network. Scientific Reports. 10, 18600 (2020)
- (20) Dernoncourt, F., Lee, J., Uzuner, O. & Szolovits, P. De-identification of patient notes with recurrent neural networks. Journal Of The American Medical Informatics Association. 24, 596-606 (2017)
- (21) Venugopal, R., Shafqat, N., Venugopal, I., Tillbury, B., Stafford, H. & Bourazeri, A. Privacy preserving generative adversarial networks to model electronic health records. Neural Networks. 153 pp. 339-348 (2022)
- (22) Justin, T., Štruc, V., Dobrišek, S., Vesnicer, B., Ipšić, I. & Mihelič, F. Speaker de-identification using diphone recognition and speech synthesis. 2015 11th IEEE International Conference And Workshops On Automatic Face And Gesture Recognition (FG). 4 pp. 1-7 (2015)
- (23) Han, Y., Li, S., Cao, Y., Ma, Q. & Yoshikawa, M. Voice-indistinguishability: Protecting voiceprint in privacy-preserving speech data release. 2020 IEEE International Conference On Multimedia And Expo (ICME). pp. 1-6 (2020)
- (24) Shamsabadi, A., Srivastava, B., Bellet, A., Vauquier, N., Vincent, E., Maouche, M., Tommasi, M. & Papernot, N. Differentially private speaker anonymization. ArXiv Preprint ArXiv:2202.11823. (2022)
- (25) Deng, J., Teng, F., Chen, Y., Chen, X., Wang, Z. & Xu, W. V-Cloak: Intelligibility-, Naturalness-& Timbre-PreservingReal-Time Voice Anonymization. 32nd USENIX Security Symposium (USENIX Security 23). pp. 5181-5198 (2023)
- (26) Chen, M., Lu, L., Wang, J., Yu, J., Chen, Y., Wang, Z., Ba, Z., Lin, F. & Ren, K. VoiceCloak: Adversarial Example Enabled Voice De-Identification with Balanced Privacy and Utility. Proceedings Of The ACM On Interactive, Mobile, Wearable And Ubiquitous Technologies. 7, 1-21 (2023)
- (27) He, K., Chen, X., Xie, S., Li, Y., Dollár, P. & Girshick, R. Masked autoencoders are scalable vision learners. Proceedings Of The IEEE/CVF Conference On Computer Vision And Pattern Recognition. pp. 16000-16009 (2022)
- (28) Gong, Y., Rouditchenko, A., Liu, A., Harwath, D., Karlinsky, L., Kuehne, H. & Glass, J. Contrastive audio-visual masked autoencoder. ArXiv Preprint ArXiv:2210.07839. (2022)
- (29) Diao, X., Cheng, M. & Cheng, S. AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder. ArXiv Preprint ArXiv:2309.08738. (2023)
- (30) Nagrani, A., Chung, J. & Zisserman, A. Voxceleb: a large-scale speaker identification dataset. ArXiv Preprint ArXiv:1706.08612. (2017)
- (31) Wan, L., Wang, Q., Papir, A. & Moreno, I. Generalized end-to-end loss for speaker verification. 2018 IEEE International Conference On Acoustics, Speech And Signal Processing (ICASSP). pp. 4879-4883 (2018)
- (32) Huang, X. & Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. Proceedings Of The IEEE International Conference On Computer Vision. pp. 1501-1510 (2017)
- (33) Diao, X., Cheng, M., Barrios, W. & Jin, S. FT2TF: First-Person Statement Text-To-Talking Face Generation. ArXiv Preprint ArXiv:2312.05430. (2023)
- (34) Tong, Z., Song, Y., Wang, J. & Wang, L. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances In Neural Information Processing Systems. 35 pp. 10078-10093 (2022)
- (35) Gabbay, A. & Hoshen, Y. Demystifying inter-class disentanglement. ArXiv Preprint ArXiv:1906.11796. (2019)
- (36) Hoshen, Y., Li, K. & Malik, J. Non-Adversarial Image Synthesis with Generative Latent Nearest Neighbors. Proceedings Of The IEEE Conference On Computer Vision And Pattern Recognition. pp. 5811-5819 (2019)