Controlling your Attributes in Voice

Xuyuan Li, Zengqiang Shang, Li Wang, and Pengyuan Zhang Xuyuan Li and Pengyuan Zhang are with the Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China, and also with the University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; [email protected]).Zengqiang Shang and Li Wang are with the Laboratory of Speech and Intelligent Information Processing, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190, China (e-mail: [email protected]; [email protected]).

Abstract

Attribute control in generative tasks aims to modify personal attributes, such as age and gender while preserving the identity information in the source sample. Although significant progress has been made in controlling facial attributes in image generation, similar approaches for speech generation remain largely unexplored. This letter proposes a novel method for controlling speaker attributes in speech without parallel data. Our approach consists of two main components: a GAN-based speaker representation variational autoencoder that extracts speaker identity and attributes from speaker vector, and a two-stage voice conversion model that captures the natural expression of speaker attributes in speech. Experimental results show that our proposed method not only achieves attribute control at the speaker representation level but also enables manipulation of the speaker age and gender at the speech level while preserving speech quality and speaker identity.

Index Terms:

Speaker attribute control, speech generation, speaker generation, voice conversion

I Introduction

Personal attributes such as age and gender convey critical information in human-computer interactions. The ability to control these attributes in generative models to produce more diverse samples is important for the advancing areas such as identity recognition, self-supervised learning, and privacy preservation [1, 2, 3, 4]. While techniques for controlling personal attributes in generated facial images have become increasingly mature [5, 6, 7], methods for manipulating these attributes in speech remain underexplored.

The universal impact of the age, gender, and other attributes of speakers on their voice has been widely established [8, 9, 10, 11]. Meanwhile, the accuracy of recognizing these attributes has significantly improved [12, 13, 14], benefiting from the development of neural network models. In the field of speech generation, Some works [15, 16, 17] have explored how to decouple and remove the personal attributes in speaker representations extracted by speaker recognition (SR) models. However, at the speech level, both recent speaker cloning models [18, 19, 20] and speech privacy protection models [21, 22, 23] focus primarily on controlling the overall speaker. While there have been efforts to generate virtual speakers that reflect certain speaker attributes with facial images [24, 25] or text prompts [26, 27], these approaches still struggle to control specific speaker attributes and speaker identity independently.

This letter introduces a novel approach to controlling speaker attributes in speech. Without the need for parallel data, our proposed method enables sharing the expression of speaker attributes in speech across different speakers. The process is structured in two key steps: 1) speaker attributes extracting; 2) conditional voice conversion. In the first step, we design a GAN-based speaker representation variational autoencoder (SRVAE), which decomposes the speaker representations extracted by the speaker identification model into the attributes and identity components and generates a new speaker representation based on attributes labels and identity component. In the second step, we propose a two-stage voice conversion (TSVC) model, which refines the attribute-dependent average speech into speaker-specific speech based on extracted components extracted from generated speaker representation.

We evaluate our method on two speaker attributes: age and gender. Similar to other speech generation tasks with emotion and style control, controlling speaker attributes in speech involves a trade-off between attribute label consistency, speaker identity consistency, and speech quality. In Sec. IV, we presented a series of subjective and objective experiments to analyze our approach across these three aspects. Our contributions can be summarized as follows:

1) We propose a novel approach that, to the best of our knowledge, is the first to achieve age and gender control at the speech level, while preserving speaker identity information.

2) We proposed the SRVAE, which can generate non-existent speaker representations with predefined age and gender labels, and extract attributes embedding from speaker representations.

3) We propose the TSVC, which enables the sharing of attribute expressions across different speakers in speech.

II Methods

Refer to caption — Figure 1: Overall workflow of our proposed method (a), detailed structure of SRVAE (b), and the detailed structure of TSVC (c). $L^{p}$ represents predefined attribute label, and $L^{o}$ represents original attribute label.

II-A Overall Workflow

Fig. 1 (a) illustrates the overall workflow of our method. First, we apply the approach described in [28] to extract sparse phonetic posteriorgrams (SPPGs) and pitch sequences from the source speech, which serve as the semantic features. Additionally, a WavLM-based SR model [29] is used to extract the original speaker vector from the source speech. Next, SRAVE generates a non-existent speaker vector based on the original speaker vector and predefined attribute labels. To capture more commonality and variation across different speakers, we use the encoder of SRAVE to decompose the synthetic speaker vector into age, gender, and identity embeddings as the conditions for TSVC, rather than directly using the speaker vector. Finally, the TSCV model conditions on these embeddings and the predefined attribute labels to transform the semantic features into speech with modified attributes.

II-B Speaker Representation Variational Autoencoder

Fig. 1 (b) illustrates the detailed architecture of SRAVE, a variational autoencoder trained with GAN. Specifically, we design an encoder consisting of three independent branches, which transforms the speaker vector into three latent embeddings: $\mathbf{z_{age}}$ , $\mathbf{z_{gender}}$ , and $\mathbf{z_{identity}}$ . Each branch is composed of multiple residual blocks, each containing two linear layers with layer normalization and ReLU activation. After the age and gender branches, we employ a linear layer to output the probabilities of attribute labels for classification training. As for the identity branch, we introduce a contrastive loss to maximize the cosine similarity of identity embeddings between the same speaker and minimize the cosine similarity of identity embeddings between different speakers. On the decoder side, inspired by [30], we use attribute labels, $\mathbf{l_{age}}$ and $\mathbf{l_{gender}}$ , as input instead of attribute embeddings to improve training stability. After the decoder module, we introduce a discriminator for generative adversarial training. Both the decoder and discriminator share the same backbone structure as the encoder branches.

Unlike previous works [15, 16, 17], we do not introduce operations, such as gradient reversal or mutual information minimization, to enforce independence between attribute and identity embeddings. On the contrary, we design a cyclic consistency training strategy to enable attribute sharing between speakers, as shown in Alg. 1. In the consistency training step, the encoder is frozen to guide the non-existent speaker vector generated by the decoder as close as possible to the real speaker vector domain.

Require: Encoder

E

with parameter

\theta

, decoder

D

with

parameter

\phi

, optimizers

g_{\theta}

and

g_{\phi}

for parameters

\theta

and

\phi

respectively, input speaker vector

\mathbf{x}

, output reconstructed

speaker vectors

\mathbf{x_{rec}}

1 for each training iteration do

\mathbf{z^{o}_{age}},\mathbf{z^{o}_{gender}},\mathbf{z_{identity}}=E(\mathbf{x})

\mathbf{x^{o}_{rec}}=D(\mathbf{l^{o}_{age}},\mathbf{l^{o}_{gender}},\mathbf{z_{identity}})

4 if train discriminator then

5 Train one step for the discriminator.

7 6Calculate classifier loss

\mathcal{L}_{ce}

, contrastive loss

\mathcal{L}_{cl}

, reconstructed loss

\mathcal{L}_{mse}

, and adversarial loss

\mathcal{L}_{adv}

\mathcal{L}_{(\theta,\phi)}=\mathcal{L}_{ce}+\mathcal{L}_{cl}+\mathcal{L}_{mse}+\mathcal{L}_{adv}

Update parameter

(\theta,\phi)\leftarrow g_{(\theta,\phi)}\nabla_{(\theta,\phi)}\mathcal{L}_{(\theta,\phi)}

8 if train consistency then

14 131211109 Random sample fake attribute labels

\mathbf{l^{f}_{age}}

and

\mathbf{l^{f}_{gender}}

. Freeze encoder

E

\mathbf{x^{non-existent}_{rec}}=D(\mathbf{l^{f}_{age}},\mathbf{l^{f}_{gender}},\mathbf{z_{identity}})

Calculate classifier loss

\mathcal{L}_{ce}

, contrastive loss

\mathcal{L}_{cl}

, and adversarial loss

\mathcal{L}_{adv}

\mathcal{L}_{\phi}=\mathcal{L}_{ce}+\mathcal{L}_{cl}+\mathcal{L}_{adv}

Update parameter

\phi\leftarrow g_{\phi}\nabla_{\phi}\mathcal{L}_{\phi}

Algorithm 1 Cyclic Consistency Training

II-C Two-stage Voice Conversion

We treat speech generation as a two-stage process: 1) generating average speech based on predefined ages and genders, and 2) refining the average speech into speech from a specific speaker based on the given attribute and identity embeddings. As illustrated in Figure 1 (c), in the first stage, we design an average generator that uses attribute labels, SPPGs, and pitch sequences to produce average acoustic features. The network of this generator consists of a Bi-LSTM layer and multiple convolutional residual blocks with AdaIN [31]. Notably, we do not use attribute embeddings in the average generator. Since no independence-guided operations are applied in SRAVE, the attribute embeddings are speaker-dependent. We believe these embeddings can help the second-stage module share the expression of attributes in the speech of similar speakers, but they are not suitable for generating speaker-independent average acoustic features.

In the second stage, we design a detail generator, an ordinary differential equation (ODE) model trained with flow matching [32], to model the mapping between the average acoustic features $X_{0}$ and the speaker-specific acoustic features $X_{T}$ . The backbone of this generator follows the transformer-based network from SF-Speech [18]. During training, a random Gaussian noise $\xi$ is added to $X_{0}$ to obtain a continuous middle state distribution, expressed as ${X_{t}=(1-t)(X_{0}+\xi)+tX_{T}}$ . The attribute embeddings, identity embeddings, and differential step $t$ are then repeated along the time axis and added to $X_{t}$ , expressed as $X^{{}^{\prime}}_{t}$ . Finally, to improve pronunciation accuracy, we concatenate SPPGs, pitch, and $X^{{}^{\prime}}_{t}$ along the hidden dimension, serving as the input to the transformer-based network to predict the direction of ODE, $dX_{t}$ .

III Experiment

We consider four questions in our experiments: 1) Can SRVAE correctly extract attribute and identity embedding, and generate corresponding speaker vectors with given attribute labels? 2) Can the speech generated by TSVC be consistent with the predefined attribute labels while maintaining naturalness? 3) Can the modified speech still be recognized as the original speaker’s voice? 4) Are SRVAE and TSVC effective for this task? The audio samples are available on our demo page ¹¹1https://lixuyuan102.github.io/control-your-attributes-in-speech/.

III-A Dataset

We conduct our experiments on a subset of Voxceleb2[33] with age labels from [34]. This subset contains about 5,000 speakers from 168K videos. We divided age into seven bands: less than 12, greater than 12 and less than 25, greater than 25 and less than 40, greater than 40 and less than 55, greater than 55 and less than 65, greater than 65 and less than 75, and greater than 75.

III-B Implementation Details

We set up three models for comparison, one baseline method and two ablation methods. Method 1 is an ODE model with attribute labels and the original speaker vector as conditions. Method 2 removes the average generator in the proposed TSVC model and employs an ODE model to map Gaussian noise to the target acoustic features directly. Method 3 eliminates the SRVAE, instead using attribute labels and the original speaker vector as conditions to train the TSVC.

We employ 3 residual blocks with 512 hidden dimensions to form the encoder and decoder of SRVAE, and 2 for its discriminator. For TSVC, we employ 3 convolutional residual blocks with 512 hidden dimensions for the average generator and 8 transformer layers with 1024 hidden dimensions for the detail generator. A HiFi-GAN [35] vocoder was trained to convert this acoustic feature into speech waveform. During training, the pitch sequences are normalized sentence by sentence to remove speaker-specific information.

All models were trained on 2 Nvidia V100 GPUs. We employed an AdamW optimizer with a learning rate starting at 0.0001 and linearly decaying to train SRVAE for 800K steps, and another AdamW optimizer with a peak learning rate of 0.0001, linearly warmed up for 5000 steps and decayed in cosine annealing over the rest steps, to train TSVC for 300K steps.

TABLE I: Evaluation Results of Attribute and Identity Embedding on Real and Generated Speaker Vectors.

Speaker Vector	ACC-age $\uparrow$	ACC-gender $\uparrow$	CSIS $\uparrow$	CSID $\downarrow$
Real	96.3%	99.6%	0.861	0.208
Generated	98.0%	99.9%	0.827	0.194

IV Results and Analysis

In this section, we evaluate the effectiveness of the proposed method from both the speaker vector and speech levels.

IV-A Evaluations of Generated Speaker Vector

We assessed the extracted attribute embeddings by measuring their classification accuracy on the classifier in the SRVAE encoder. As for the extracted identity embedding, we compared the cosine similarity of those from the same speaker (CSIS) with the cosine similarity of those from different speakers (CSID). The classification accuracy was computed for 2,000 speaker vectors, while the cosine similarity was calculated for 2,000 pairs of speaker vectors.

As shown in Tab. I, both the attribute embeddings extracted from real speaker vectors and those extracted from generated non-existent speaker vectors demonstrate high classification accuracy. Notably, since the SRVAE encoder is trained exclusively on real speaker vectors, these similar accuracies suggest that the non-existent speaker vectors generated by SRVAE reside in the same representation space as real speaker vectors. Furthermore, the CSIS and CSID values for the real speaker vectors show a gap of 0.653, demonstrating that the identity embeddings extracted by SRVAE can accurately represent speaker identity. Meanwhile, the gap for the generated speaker vectors is 0.633, a slight decrease, yet still sufficient to indicate that the proposed SRVAE is capable of extracting accurate identity embeddings from the modified non-existent speaker vectors.

TABLE II: Evaluation Results of Speech Gender Consistency, Identity Consistency, Intelligibility, and Quality

Model	SACC-gender $\uparrow$	ICMOS $\uparrow$	QCMOS $\uparrow$	SIMS	SIMD	SIMS-age	SIMS-gender
GT	88.5%	+0.29	+0.12	0.934	0.596	n/a	n/a
Baseline	34.3%	-0.06	-0.11	0.913	0.598	0.900	0.940
Proposed	84.2%	0	0	0.766	0.605	0.813	0.707
-TSVC	45.0%	-0.27	-0.48	0.807	0.609	0.826	0.796
-SRVAE	57.5%	-0.12	-0.21	0.835	0.613	0.851	0.800

IV-B Evaluations of Generated Speech

At the speech level, we conducted several subjective tests to assess the generated speech in terms of speech quality, attribute consistency, and identity consistency. For speech quality, we used Quality Comparative Mean Opinion Scores (QCMOS) to assess sound quality and naturalness, and Intelligibility Comparative Mean Opinion Scores (ICMOS) to evaluate pronunciation, as the Voxceleb2 dataset does not include text transcriptions. For gender consistency, we randomly assigned the generated and real samples for participants to predict the gender of the speaker. For age evaluation, we first asked participants to predict the speaker age of real speech without interval division. The most accurate age interval for human perception was then determined as 0-12, 12-25, 25-55, and $>$ 55, based on the prediction results. Finally, given the original speech and its age interval as a reference, participants were asked to predict the age interval of the modified speech. For identity consistency, participants were presented randomly with a modified or real speech sample from speaker A as a reference and were asked to select the speech with the same speaker identity from a test pair, which included speech from speaker B and another speech from speaker A. All subjective experiments involved at least 15 participants, with each being assigned 100 test questions. Additionally, we used the WavLM-based SR model to calculate speaker similarity between the speech from the same speaker (SIMS) and different speakers (SIMD), as in the speaker vector evaluation.

Results on speech quality: As shown in Tab. II, the proposed model performs best in terms of intelligibility and quality of generated speech, but there is a 0.29 gap compared to real audio in ICMOS. We believe this is due to the SPPG model’s lack of robustness when dealing with in-the-wild speech. In addition, the baseline model ranks second in this evaluation. We observe that it overfitted the training data, resulting in more than half of the modified speech being very similar to the source speech. Finally, the comparison with the ablative models demonstrates the effectiveness of our proposed approaches in modeling the natural expression of speaker attributes in speech.

Results on attribute consistency: Our method achieves a gender classification accuracy that is second only to ground truth, 49.9% higher than the baseline model, as shown in Tab. II. In age evaluation, our method obtains a confusion matrix most similar to the ground truth, while the confusion matrix of baseline exhibits a more diffuse distribution, as illustrated in Fig. 2. Additionally, after ablating SRVAE and TSVC, the predictive accuracy of gender decreases by 26.7% and 39.0%, respectively. Similarly, the consistency between the subjectively predicted age intervals and the predefined labels also decreases. These results further demonstrate the effectiveness of SRVAE and TSVC for attribute control.

Results on identity consistency: Tab. II shows the objective results using SR model. All methods achieve similar SIMD scores, while their SIMS exhibit a decreasing trend as attribute consistency increases. This trend is more pronounced when testing on speech with modified gender. This phenomenon can be explained by the fact that it is challenging for humans to recognize a speech pair with a large gap in fundamental frequency as belonging to the same identity. Although there is a decreasing trend in SIMS scores, they still exhibit a significant gap compared to SIMD, indicating that modified speech can still be recognized as originating from the source speaker by the SR model. Fig. 3 shows the subjective results. The probability of participants selecting the correct audio is positively correlated with the gap between SIMS and SIMD in Tab. II, except for ”- TSVC”. We believe this is because the poor quality of the speech generated by it affects the perception of participants.

V Conclusion

In this letter, we introduce a two-step method for controlling speaker attributes in speech. Experimental results show that SRVAE can generate non-existent speaker vectors with predefined attributes that benefit from cyclic consistency training. By combining SRVAE with TSVC, we achieve speaker age and gender control at the speech level, while preserving as much of the original speaker’s identity and speech quality as possible. This capability suggests that our method has potential applications in areas such as audiobooks, video dubbing, and privacy protection. This approach can also be applied to the text-to-speech task simply by replacing the input semantic features with text. Further work could focus on exploring whether the diverse speech generated by our method can improve tasks such as speaker recognition and self-supervised speech representation learning.

References

[1] Ziming Yang, Jian Liang, Chaoyou Fu, Mandi Luo, and Xiao-Yu Zhang, “Heterogeneous face recognition via face synthesis with identity-attribute disentanglement,” IEEE Transactions on Information Forensics and Security, vol. 17, pp. 1344–1358, 2022.
[2] Feng Liu, Minchul Kim, Anil Jain, and Xiaoming Liu, “Controllable and guided face synthesis for unconstrained face recognition,” in European Conference on Computer Vision. Springer, 2022, pp. 701–719.
[3] Sola Shirai and Jacob Whitehill, “Privacy-preserving annotation of face images through attribute-preserving face synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[4] Krishnakant Singh, Thanush Navaratnam, Jannik Holmer, Simone Schaub-Meyer, and Stefan Roth, “Is synthetic data all we need? benchmarking the robustness of models trained with synthetic images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 2505–2515.
[5] Zhenliang He, Wangmeng Zuo, Meina Kan, Shiguang Shan, and Xilin Chen, “Attgan: Facial attribute editing by only changing what you want,” IEEE transactions on image processing, vol. 28, no. 11, pp. 5464–5478, 2019.
[6] Xianxu Hou, Xiaokang Zhang, Hanbang Liang, Linlin Shen, Zhihui Lai, and Jun Wan, “Guidedstyle: Attribute knowledge guided style manipulation for semantic face editing,” Neural Networks, vol. 145, pp. 209–220, 2022.
[7] Xin Ning, Feng He, Xiaoli Dong, Weijun Li, Fayadh Alenezi, and Prayag Tiwari, “Icgnet: An intensity-controllable generation network based on covering learning for face attribute synthesis,” Information Sciences, vol. 660, pp. 120130, 2024.
[8] Linda H Leeper and Richard Culatta, “Speech fluency: effect of age, gender and context,” Folia phoniatrica et logopaedica, vol. 47, no. 1, pp. 1–14, 1995.
[9] Linda Mortensen, Antje S Meyer, and Glyn W Humphreys, “Age-related effects on speech production: A review,” Language and Cognitive Processes, vol. 21, no. 1-3, pp. 238–290, 2006.
[10] Bum Ju Lee, Boncho Ku, Jun-Su Jang, and Jong Yeol Kim, “A novel method for classifying body mass index on the basis of speech signals for future clinical applications: a pilot study,” Evidence-Based Complementary and Alternative Medicine, vol. 2013, no. 1, pp. 150265, 2013.
[11] Albert Rilliard, David Doukhan, Rémi Uro, and Simon Devauchelle, “Evolution of voices in french audiovisual media across genders and age in a diachronic perspective,” arXiv preprint arXiv:2404.16104, 2024.
[12] Zhong-Qiu Wang and Ivan Tashev, “Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 5150–5154.
[13] Anvarjon Tursunov, Mustaqeem, Joon Yeon Choeh, and Soonil Kwon, “Age and gender recognition using a convolutional neural network with a specially designed multi-attention module through speech spectrograms,” Sensors, vol. 21, no. 17, pp. 5892, 2021.
[14] Felix Burkhardt, Johannes Wagner, Hagen Wierstorf, Florian Eyben, and Björn Schuller, “Speech-based age and gender prediction with transformers,” in Speech Communication; 15th ITG Conference. VDE, 2023, pp. 46–50.
[15] Francisco Teixeira, Alberto Abad, Bhiksha Raj, and Isabel Trancoso, “Privacy-oriented manipulation of speaker representations,” IEEE Access, 2024.
[16] Chau Luu, Steve Renals, and Peter Bell, “Investigating the contribution of speaker attributes to speaker separability using disentangled speaker representations,” in Interspeech 2022. ISCA, 2022, pp. 610–614.
[17] Parvaneh Janbakhshi and Ina Kodrasi, “Adversarial-free speaker identity-invariant representation learning for automatic dysarthric speech classification.,” in INTERSPEECH, 2022, pp. 2138–2142.
[18] Xuyuan Li, Zengqiang Shang, Hua Hua, Peiyang Shi, Chen Yang, Li Wang, and Pengyuan Zhang, “Sf-speech: Straightened flow for zero-shot voice clone on small-scale dataset,” arXiv preprint arXiv:2410.12399, 2024.
[19] Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al., “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024.
[20] Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, et al., “Seed-tts: A family of high-quality versatile speech generation models,” arXiv preprint arXiv:2406.02430, 2024.
[21] Xiaoxiao Miao, Ruijie Tao, Chang Zeng, and Xin Wang, “A benchmark for multi-speaker anonymization,” arXiv preprint arXiv:2407.05608, 2024.
[22] Michele Panariello, Francesco Nespoli, Massimiliano Todisco, and Nicholas Evans, “Speaker anonymization using neural audio codec language models,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 4725–4729.
[23] Jixun Yao, Qing Wang, Pengcheng Guo, Ziqian Ning, and Lei Xie, “Distinctive and natural speaker anonymization via singular value transformation-assisted matrix,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2944–2956, 2024.
[24] Jae Hyun Park, Joon-Gyu Maeng, TaeJun Bak, and Young-Sun Joo, “Synthe-sees: Face based text-to-speech for virtual speaker,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10321–10325.
[25] Minyoung Lee, Eunil Park, and Sungeun Hong, “Fvtts: Face based voice synthesis for text-to-speech,” in Proc. Interspeech 2024, 2024, pp. 4953–4957.
[26] Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, and Mounya Elhilali, “Dreamvoice: Text-guided voice conversion,” arXiv preprint arXiv:2406.16314, 2024.
[27] Zhengyang Chen, Xuechen Liu, Erica Cooper, Junichi Yamagishi, and Yanmin Qian, “Generating speakers by prompting listener impressions for pre-trained multi-speaker text-to-speech systems,” arXiv preprint arXiv:2406.08812, 2024.
[28] Max Morrison, Cameron Churchwell, Nathan Pruyne, and Bryan Pardo, “Fine-grained and interpretable neural speech editing,” Interspeech 2024, 2024.
[29] Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[30] Ziqi Pan, Li Niu, Jianfu Zhang, and Liqing Zhang, “Disentangled information bottleneck,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 9285–9293.
[31] Xun Huang and Serge Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
[32] Xingchao Liu, Chengyue Gong, and Qiang Liu, “Flow straight and fast: Learning to generate and transfer data with rectified flow,” arXiv preprint arXiv:2209.03003, 2022.
[33] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman, “Voxceleb2: Deep speaker recognition,” Interspeech 2018, 2018.
[34] Naohiro Tawara, Atsunori Ogawa, Yuki Kitagishi, and Hosana Kamiyama, “Age-vox-celeb: Multi-modal corpus for facial and speech estimation,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6963–6967.
[35] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in neural information processing systems, vol. 33, pp. 17022–17033, 2020.