¹¹institutetext: The first two authors are joint first authors.
Ming Cheng ²²institutetext: Xingjian Diao ³³institutetext: Shitong Cheng ⁴⁴institutetext: Wenjun Liu ⁵⁵institutetext: Department of Computer Science, Dartmouth College, 15 Thayer Drive, Hanover, NH 03755, USA
Ming Cheng
⁵⁵email: [email protected]
Xingjian Diao
⁵⁵email: [email protected]
Shitong Cheng
⁵⁵email: [email protected]
Wenjun Liu
⁵⁵email: [email protected]

SAIC: Integration of Speech Anonymization and Identity Classification

Ming Cheng\orcidID0000-0002-6422-1748 Xingjian Diao\orcidID0000-0001-9605-4991 Shitong Cheng\orcidID0009-0008-3830-7880 Wenjun Liu\orcidID0009-0002-3953-1661

Abstract

Speech anonymization and de-identification have garnered significant attention recently, especially in the healthcare area including telehealth consultations, patient voiceprint matching, and patient real-time monitoring. Speaker identity classification tasks, which involve recognizing specific speakers from audio to learn identity features, are crucial for de-identification. Since rare studies have effectively combined speech anonymization with identity classification, we propose SAIC – an innovative pipeline for integrating Speech Anonymization and Identity Classification. SAIC demonstrates remarkable performance and reaches state-of-the-art in the speaker identity classification task on the Voxceleb1 dataset, with a top-1 accuracy of $96.1\%$ . Although SAIC is not trained or evaluated specifically on clinical data, the result strongly proves the model’s effectiveness and the possibility to generalize into the healthcare area, providing insightful guidance for future work.

1 Introduction

Significant research has been focused on using AI techniques for anonymization and de-identification in the ethics and healthcare area, especially for health record and patient notes protection zuccon2014identification ; ahmed2020identification ; dernoncourt2017identification ; venugopal2022privacy . Meanwhile, the anonymization focusing on speech has not been widely explored, with several studies only developing methods on a limited scale of datasets han2020voice ; chen2023voicecloak . In parallel, speaker identity classification tasks, which require accurately identifying individuals from their audio audiomae ; niizumi2022masked , play a crucial role in privacy protection services. These tasks involve disentangling the unique vocal characteristics (voiceprint) of a person, essentially understanding the speaker’s identity information within speech. While this precision is valuable in itself, it also opens up possibilities for enhancing speech anonymization techniques. Ideally, if a system can understand and isolate identity features in speech, it could then modify, obscure, or remove these features to anonymize the audio effectively. Since limited work has implemented the integration of anonymization and identity classification, there remains an unsolved challenge: Is it feasible to develop a model that simultaneously achieves high-quality speech anonymization and maintains accurate speaker identity classification?

To address such a research gap, we propose SAIC – a novel pipeline for speech anonymization and identity classification. After training, SAIC can extract accurate content/identity embeddings, removing identity information from the original audio. Moreover, it can merge the content of one audio with the voiceprint of another speaker, generating a synthesized speech that maintains content integrity with an altered identity.

In summary, our contribution is threefold:

•

We propose SAIC, a novel pipeline integrating speech anonymization and identity classification effectively. The content embeddings and identity embeddings are extracted with high quality through the robust encoders.
•

The identity classification task on the Voxceleb1 dataset outperforms existing work and results in the state-of-the-art, with a top-1 accuracy of $96.1\%$ .
•

SAIC is capable of synthesizing new audio by merging the content from one speaker’s audio with the voiceprint of another, effectively generating a synthesized speech that preserves the original content while adopting a different vocal identity.

2 Related Work

2.1 Speech Anonymization and De-Identification

Initially, a novel approach known as DROPSY justin2015speaker is proposed to conceal the speaker’s identity. It builds a diphone recognition system for speech recognition, followed by a speech synthesis system to transform a speaker’s speech into that of a different individual. In a separate endeavor, VoicePrivacy tomashenko2020introducing is proposed to propel advancements in speech data anonymization. The benchmark established by it seeks to minimize the disclosure of the speaker’s identity while preserving the distinctiveness of the speech.

Recent studies have emerged in the domain of voice privacy preservation, proposing methods that span multi-dimensional aspects, such as differentially private approach shamsabadi2022differentially , naturalness and timbre-preserving deng2023v , and adversarial examples chen2023voicecloak . However, they use a limited size of validation dataset and mainly focus on specific scenarios. This potentially limits the generalizability of the proposed techniques.

To address the research gap mentioned above, we propose SAIC, a novel pipeline for speaker de-identification and privacy preservation. We evaluate our model on a commonly used and large-scale dataset, Voxceleb1 nagrani2017voxceleb , with state-of-the-art results indicating the effectiveness of our model.

2.2 Speaker Identity Classification

The task of speaker identity classification has garnered significant attention in recent years, driven by its applications in various domains including privacy protection, voice-controlled systems, and human-computer interaction.

With the development of Transformers vaswani2017attention and ViT dosovitskiy2020image , multiple studies have adopted these architectures as the backbone. For example, SS-AST gong2022ssast pretrains the AST model with joint discriminative and generative masked spectrogram patch modeling, while wav2vec 2.0 baevski2020wav2vec focuses on learning powerful representations from speech audio. Although ViT-based methods outperform CNN-based ones in various AI tasks, they usually require massive data and repeated pretraining, failing to handle temporal dynamics without strong data augmentations islam2022recent ; he2022masked . Therefore, our model is conducted with CNN as the backbone and follows the mainstream encoder-decoder structure lu2016training ; toshniwal2017multitask ; hu2020dasgil ; karita2018sequence , achieving significant results and requiring fewer computational resources.

Considering the effectiveness of MAE-based methods on various downstream tasks he2022masked ; gong2022contrastive ; audiomae ; diao2023av ; tong2022videomae , recent work mainly follows the mask-and-reconstruction strategy for representation learning to do identity classification maeast ; niizumi2022masked ; m2d . However, these methods do not consider the integration of identity classification and audio anonymization and lack strong identity disentanglement capabilities. Based on this, we propose SAIC that can effectively remove identity information while also achieving superior classification performance.

3 Method

We address the challenge of implementing de-identification for speaker privacy protection through the proposed SAIC pipeline. Formally, given the input audio of speaker $i$ and speaker $j$ , our goal is to synthesize the new audio with the speaker $j$ ’s identity and speaker $i$ ’s content information. This aims to remove the identity information from the speaker $i$ for privacy protection.

The training and inference of the SAIC pipeline are shown in Figure 1 and 2, respectively. Inspired by gabbay2019demystifying , the pipeline training contains 2 stages. The first stage aims to extract accurate content embeddings ( $E_{c}^{x}(i)\sim x$ ) and speaker embeddings ( $E_{s}^{x}(i)\sim x$ ) from content and speaker ID, where $x$ indicates latent space. Moreover, the Fusion Decoder (FD) is trained to reconstruct the original audio through the latent optimization strategy gabbay2019demystifying . The second stage focuses on optimizing the Content Encoder (CE), Speaker Encoder (SE), and the Fusion Decoder (FD) to reconstruct audio. Specifically, suppose $A^{y}(i)\sim y$ as the input audio of speaker $i$ where $y$ indicates the ground truth domain, it is input into CE and SE to extract content embeddings ( $E_{c}^{z}(i)\sim z$ ) and speaker identity embeddings ( $E_{s}^{z}(i)\sim z$ ), respectively, where $z$ indicates the latent space from two encoders. Afterward, $E_{c}^{z}(i)$ and $E_{s}^{z}(i)$ are input into FD to reconstruct audio. Through this pipeline, the two encoders and the decoder can be well-trained for inference.

During inference, we apply audio input from two different speakers $i,j$ , aiming to remove the speaker $i^{\prime}s$ identity information. In this phase, all encoders and the decoder are frozen.

Refer to caption — Figure 1: SAIC pipeline training. The flame icon indicates trainable modules. Inspired by gabbay2019demystifying , the pipeline training includes 2 stages: The first stage utilizes latent optimization to obtain accurate content and speaker embeddings and optimize the Fusion Decoder (FD) through audio reconstruction. During the second stage, the original audio is input into the Content Encoder (CE) and Speaker Encoder (SE) to extract specific embeddings, followed by a Fusion Decoder (FD, with the same weight as stage 1) to reconstruct the audio.

3.1 Two-Stage Pipeline Training

Training Stage 1

The first stage aims to extract accurate content and speaker embeddings through latent optimization gabbay2019demystifying . Specifically, given content and speaker ID, the corresponding embeddings, $E_{c}^{x}(i)$ and $E_{s}^{x}(i)$ , are obtained in the latent space. The two embeddings are then input into the Fusion Decoder (FD) to reconstruct the audio. To train FD through latent optimization strategy, we employ VGG perceptual loss hoshen2019non as $L_{R_{1}}$ for each speaker $i$ :

\begin{split}L_{R_{1}}&=\sum_{i}^{n}||A^{x}(i)-A^{y}(i)||+\lambda||\epsilon_{i}||^{2},\quad\epsilon_{i}\sim N(0,\sigma^{2}I)\\ A^{x}(i)&=FD(E_{s}^{x}(i),E_{c}^{x}(i)+\epsilon_{i})\end{split}

(1)

where $FD$ is the Fusion Decoder, $n$ indicates all speakers, and $\epsilon_{i}$ represents the Gaussian Noise of fixed variance and an active attenuation penalty embedding which is applied to content embeddings ( $E_{c}^{x}(i)$ ) to regularize the content.

After stage 1, accurate content and speaker embeddings can be obtained and the decoder is well-trained to generate audio from two embeddings.

Training Stage 2

In stage 2 of pipeline training, we design an encoder-decoder-based architecture to generate the audio. Specifically, we construct two encoders ( $CE$ and $SE$ ) to learn accurate embeddings from original audio, and use $FD$ with shared weight from the first stage to reconstruct the audio.

To instruct SAIC to learn the exact content and speaker embeddings, we employ MSE loss as the embedding loss:

L_{E_{c}}=\left\|E_{c}^{x}(i)-E_{c}^{z}(i)\right\|^{2},\quad L_{E_{s}}=\left\|E_{s}^{x}(i)-E_{s}^{z}(i)\right\|^{2}

(2)

where $E_{c}^{x}(i)$ and $E_{s}^{x}(i)$ are the embeddings obtained from the first stage.

Similar to the first stage, we apply VGG perceptual loss as the reconstruction loss $L_{R_{2}}$ (without Gaussian noise) to guide the model to generate correct and precise audio:

\begin{split}L_{R_{2}}&=\sum_{i}^{n}||FD(E_{s}^{z}(i),E_{c}^{z}(i))-A^{y}(i)||\end{split}

(3)

Finally, the combined loss for pipeline training in the second stage is expressed below:

L_{total}=\lambda_{1}L_{E_{c}}+\lambda_{2}L_{E_{s}}+\lambda_{3}L_{R_{2}}

(4)

3.2 Content Encoder

The Content Encoder (CE) is constructed through sequential convolution blocks. Following chou2019one , it uses residual blocks repeated 6 times as the main structure. Each residual block includes a convolutional module and an instance normalization huang2017arbitrary . This structure aims to extract the high-dimensional content information in audio to omit the identity information in the content embedding.

3.3 Speaker Encoder

The Speaker Encoder (SE) contains a sequential residual block (repeated 6 times) and a sequential fully connected dense layer (repeated twice), following chou2019one . Similar to the design of CE, the residual blocks aim to extract high-dimensional identity information from the audio, followed by the fully connected dense layer to map the extracted features into the specified dimension and fully split the embedding of different speakers.

3.4 Fusion Decoder

The Fusion Decoder (FD) fuses features of content/identity embeddings for audio generation. Specifically, it consists of two sub-modules, with the first one decoding content embeddings through sequential convolution blocks and the second one decoding voice print features through dense layers, based on chou2019one . AdaIN (Adaptive Instance Normalization) huang2017arbitrary is applied in the decoder.

4 Experiments

We conduct experiments on the VoxCeleb1 dataset nagrani2017voxceleb and evaluate the de-identification results through the identity classification task.

4.1 Dataset

The VoxCeleb1 dataset is a diverse audio dataset in the real environment collected from public YouTube videos. It contains 153,516 audio clips from 1,251 speakers with various ages, roles, identities, etc. This dataset can be used for speech recognition, speaker classification, and speech information analysis. To evaluate the de-identification quality of our model, we conduct the speaker classification task and compare the results with other related works.

4.2 Evaluation Metrics

The evaluation of our model follows the steps below:

•

For each speaker $i$ in the test set, we randomly choose a different speaker $j$ as the input of CE and SE.
•

After SAIC, the synthesized audio $A^{z}(j)$ contains the identity information of speaker $j$ and speech content from speaker $i$ . $A^{z}(j)$ is then input into a powerful pre-trained VoiceEncoder wan2018generalized from the Resemblyzer library in Python for speaker identity embeddings extraction.
•

Finally, we utilize the extracted embeddings by VoiceEncoder to find the corresponding speaker ID. Since we only find one best-matched speaker each time, we report the top-1 accuracy as the evaluation metric.

Methods	Backbone	MAE-Based	Top1-Acc
\svhline MAE-AST maeast		ViT-B	✓	63.3
SS-AST gong2022ssast	ViT-B	$\times$	64.3
wav2vec 2.0 baevski2020wav2vec	Transformer	$\times$	75.2
HuBERT hsu2021hubert	Transformer	$\times$	81.4
M2D m2d	ViT-B	✓	94.8
Audio-MAE audiomae	ViT-B	✓	94.8
MSM-MAE niizumi2022masked	ViT-B	✓	95.3
\svhline SAIC (Ours)		CNN	$\times$	96.1

SAIC: Integration of Speech Anonymization and Identity Classification

Abstract

1 Introduction

2 Related Work

2.1 Speech Anonymization and De-Identification

2.2 Speaker Identity Classification

3 Method

3.1 Two-Stage Pipeline Training

Training Stage 1

Training Stage 2

3.2 Content Encoder

3.3 Speaker Encoder

3.4 Fusion Decoder

4 Experiments

4.1 Dataset

4.2 Evaluation Metrics

4.3 Quantitative Results

Comparison With State-of-The-Art

Transformers vs. CNN

4.4 Qualitative Results

Audio Reconstruction and Synthesis

Audio Disentanglement

5 Conclusion

Acknowledgements.

References