Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages – acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution $p(z)$ of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same $p(z)$ from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.

Index Terms: Zero-shot, speech synthesis, voice conversion, variational auto-encoder, flow model

1 Introduction

Refer to caption — Figure 1: (a) Universal WaveGAN. (b) Multi-speaker acoustic model with optional speaker encoders. (c) Zero-shot TTS inference protocol from the waveform of target speaker $\omega_{tgt}$ . (d) VC part which converts source speech $\omega_{src}$ with content $p(c)$ from speaker $s_{src}$ to target speech $\omega_{tgt}$ of speaker $s_{tgt}$ .

Recently, text-to-speech (TTS) and voice conversion (VC) have achieved significant improvements with the rapid developments of sequence-to-sequence (seq2seq) based acoustic models [1, 2, 3, 4] and high-quality neural vocoders [5, 6, 7, 8]. With a large amount of studio-recorded high-quality data, it is easy to extend these models to multi-speaker scenarios [9, 10]. But since sizable training data of a new speaker is usually unavailable for customization, speaker adaptation in speech synthesis has attracted raising attention [11, 12, 13, 14]. Considering the amount of data for target speaker, adaptive speech synthesis can be divided into few-shot and zero-shot methods. Few-shot adaptation aims at generating new voices with a few samples of the target speaker, where fine-tuning the whole or a part of a pre-trained multi-speaker model is usually adopted [15, 16, 17]. However, fine-tuning methods are time-consuming and will lead to individual speaker-dependent models, which are not friendly to speaker customization. Moreover, avoiding over-fitting is vital for adaptation with limited data. By contrast, zero-shot methods tend to generate a new voice with only one utterance of the specific speaker, without model fine-tuning in general. In this work, we focus on zero-shot speech generation, including text-to-speech (TTS) and voice conversion (VC).

For both zero-shot TTS and VC, there are two critical problems that affect the performance of speech generation: 1) with only one utterance available, the stable ability of acoustic models to produce natural speech with target speaker’s identity and 2) the performance of the vocoder to reconstruct waveform of unseen speakers from predicted acoustic features. For the first problem, an effective way in zero-shot scenarios is to train a speaker-discriminative encoder to model the speaker space, where arbitrary voices can be generated through the constraint of speaker representation from the space with only one utterance of a novel speaker [18, 19]. In terms of the vocoder problem, recent high-quality neural vocoders are usually speaker-dependent, where the quality of the generated speech for unseen speakers will unavoidably deteriorate. Aiming to solve this problem, a WaveRNN-based universal vocoder [20] and a MelGAN-based universal vocoder [21] was proposed, but there still exists quality gaps between seen and unseen speakers.

Recently, Glow-TTS [22] proposed to model the distribution of the mel spectrogram with the flow to improve the quality of synthesized speech, where a pre-trained speaker encoder was further injected in SC-Glow-TTS [10] for zero-shot TTS. But fine-tuning the vocoder is still critical to the quality and similarity of unseen speakers. So the VITS [23] and our Glow-WaveGAN [24] are proposed at the same time to avoid this problem by modeling a latent speech representation, but there exists many differences between them. In VITS, it proposes to reconstruct speech from the posterior distribution $p(z|c)$ of linear spectrogram through conditional VAE, where they utilize the reversible characteristics of flow to transform only the mean of text prior $c$ to improve the expressiveness of the prior distribution [23]. So they restrict the affine-coupling layers in flow to be volume-preserving [25], and it cannot produce speech of unseen speakers because of its discrete speaker space modeling. As for the two-stage Glow-WaveGAN, the goal of the WaveGAN is to learn a latent speech distribution $p(z)$ directly from waveform through an unconditional VAE and GAN, while the acoustic model aims at modeling the same distribution $p(z)$ with non-volume preserving flow from texts, which also avoid the mismatch problem [24].

Since $p(z)$ in WaveGAN is unconditional, it could extract rich information of the waveform like timbre or contents, which means its potential ability on reconstructing waveform for unseen speakers. To this end, we propose Glow-WaveGAN 2 focusing on high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion, where we build a universal WaveGAN and investigate different speaker encoders in flow on providing continuous speaker constraint $s$ on $p(z|s)$ for zero-shot scenarios. Given one utterance of an arbitrary speaker during inference, we can use the encoder of the universal WaveGAN to extract the speaker-related latent distribution $p(z)$ , and then generate the speaker’s voice with our conditional acoustic model and the decoder of WaveGAN. The experimental results on the VCTK and LibriTTS corpora show that the proposed methods can generate high-quality target voice without model fine-tuning in both zero-shot TTS and any-to-any VC.

2 Method

As a substantial extension of our previous work [24], the proposed model mainly focuses on the zero-shot high-quality text-to-speech synthesis and voice conversion. As shown in Figure 1, the Glow-WaveGAN 2 contains three modules that will be elaborated in the following subsections: 1) a robust variational universal WaveGAN, which acts as both feature extractor and vocoder to extract the latent distribution $p(z)$ from speech and reconstruct speech from the sampled $z$ respectively; 2) a multi-speaker Glow-TTS [22], which generates latent representation $p(z|s)$ with the speaker identity $s$ as the conditioning constraint; and 3) an additional speaker encoder, which learns the speaker identity $s$ for adaptation.

Table 1: SECS of different models on different corpora, where GT-cross-spk is the SECS of different speakers. In VC scenarios, “s2s” indicates converting seen speakers to seen speakers, “u2s” means converting unseen speakers to seen speakers, “s2u” means converting seen speakers to unseen speakers, and “u2u” means converting unseen speakers to unseen speakers.

Training data	LibriTTS						VCTK
Models	TTS		VC				TTS		VC
Models	seen	unseen	s2s	u2s	s2u	u2u	seen	unseen	s2s	u2s	s2u	u2u
GT-same-spk	0.830						0.815
GT-cross-spk	0.544						0.551
GlowTTS-HiFiGAN	0.772	-	0.748	-	-	-	0.783	-	0.790	-	-
VITS	0.807	-	0.766	-	-	-	0.798	-	0.812	-	-
Glow-WaveGAN	0.819	-	0.769	-	-	-	0.805	-	0.811	-	-
Glow-WaveGAN2-joint	0.791	0.749	0.762	0.781	0.742	0.760	0.793	0.731	0.805	0.794	0.720	0.717
Glow-WaveGAN2-pre	0.822	0.784	0.771	0.802	0.771	0.807	0.804	0.774	0.807	0.795	0.757	0.762

2.1 Universal WaveGAN

A robust vocoder for unseen speakers is critical for zero-shot speech generation. To achieve this goal, we build a variational auto-encoder (VAE) [26] based universal WaveGAN model to conduct speech reconstruction, which has shown its strong ability to reconstruct unseen speakers in our previous work [24].

In the universal WaveGAN, the encoder aims at extracting the distribution of acoustic representation $z\sim p(z|w)$ from the waveform $w$ , where $z$ also includes the speaker characteristics through a pitch predictor. Meanwhile, the decoder tends to reconstruct the speech $\hat{w}\sim p(w|z)$ from sampled $z$ . To ensure the quality of reconstructed speech, we use adversarial training to model the $p(w)$ of the waveform. The training objective of the WaveGAN keeps the same as [24].

The encoder of WaveGAN can be treated as a robust feature extractor, and the extracted speech representation $z$ contains the speaker information. The decoder plays the role of vocoder to reconstruct high-quality speech waveform for both seen and unseen speakers. In this way, the proposed method guarantees the speech quality of zero-shot speech reconstruction.

2.2 Multi-Speaker Glow-TTS

With the extracted latent speech representation, the skeleton of our acoustic model follows Glow-TTS [22] to model the conditional distribution $p(z|t,s)$ from texts $t$ , where $s$ is the conditional constraint to provide speaker identity. In detail, the text encoder transforms the text $t$ into a linguistic prior distribution $p(c|t)$ . Considering the frame rate difference between the text and the latent $z$ , monotonic alignment search (MAS) is adopted to align the prior $c$ and acoustic feature $z$ . With the speaker embedding fed into the flow-based decoder $f$ , the target $z$ from WaveGAN is transformed into the aligned prior distribution $p(c)$ through forward pass of $f$ during training. Thus the log-likelihood of the target distribution $p(z)$ can be obtained through the flow-based decoder:

\log P_{Z}(z|t,s)=logP_{C}(c|t)+\log|det\frac{\partial f^{-1}_{dec}(z,s)}{\partial_{z}}|

(1)

Note that we specifically sample $z$ from the same $p(z)$ of the WaveGAN encoder in each training step, which mitigates the mismatch problem between the acoustic model and the vocoder to achieve high synthesis quality without fine-tuning. During inference, the predicted $\hat{z}$ is generated by the reverse decoding process $f^{-1}$ with the speaker condition from the text prior. Since the text encoder is speaker-independent, it’s easy to conduct voice conversion through the forward $f$ and reverse $f^{-1}$ flow decoding with different speaker constraints [22], as shown in Figure 1 (d).

2.3 Speaker Representation

To achieve the zero-shot TTS and any-to-any VC, a key problem is how to model the speaker characteristics, especially for unseen speakers. In this paper, we investigate two alternative ways, a pre-trained encoder and a jointly-trained encoder, to build the speaker space in the acoustic model to produce speaker-dependent $z$ for unseen speakers.

A straightforward way to represent speakers is to utilize a pre-trained speaker encoder based on the speaker embedding extraction in speaker verification [11, 27, 28, 29], through which we can obtain the speaker representation $s$ by only one utterance of a few seconds to generate speech of unseen speakers. Besides, a jointly-trained encoder based on speaker classification is also adopted in our work to learn $s$ from $p(z)$ within the Glow-TTS model. With the jointly-trained speaker module, we optimize the acoustic model with an extra cross entropy objective. Specifically, we sample two vectors $\{z_{1},z_{2}\}$ from the same distribution $p(z)$ , where $z_{1}$ and $z_{2}$ are used to train the acoustic model and speaker classification module respectively.

2.4 Zero-shot TTS and VC

As shown in Figure 1 (c) and (d), given an utterance of any target speaker during inference, the speaker identity $s_{tgt}$ can be extracted from the speaker encoder, which can be treated as the speaker constraint in our model. At inference time of zero-shot TTS, $s_{tgt}$ is utilized as the condition for the acoustic model to generate the sampled $z$ containing the target speaker information. At inference time of zero-shot VC, the source speaker $s_{src}$ is conditioned for the flow module to generate the speaker-independent linguistic prior distribution $p(c)$ , which eliminates the source speaker information. Then $p(c)$ is fed into the reversed flow conditioned on the target speaker $s_{tgt}$ to generate the target sampled $z_{tgt}$ . In this way, zero-shot TTS and VC can be achieved for generating speech of any speaker.

3 Experiments and results

3.1 Basic Setup

We conduct experiments on two different datasets named VCTK [30] and LibriTTS [31] at 24 kHz. The VCTK corpus contains 109 English speakers, where we reserved 3 male and 3 female speakers as unseen speakers. As for the LibriTTS corpus, we use 1,151 speakers from the train-clean-360 and train-clean-100 subsets to train different systems, while the 39 speakers from the test-clean subset are treated as unseen speakers. The jointly-trained speaker encoder consists of two 1d-convolution layers with kernel size 5, followed by the layer normalization and dropout. Finally, another convolution layer with 128 channels is adopted to extract the speaker embedding through mean pooling. As for the pre-trained speaker encoder, we utilize the voice encoder in the Resemblyzer tool [27, 29] to extract a 256-dimensional speaker representation.

To evaluate the capability of the proposed Glow-WaveGAN 2 in zero-shot speech generation ¹¹1Audio samples can be found at https://leiyi420.github.io/glow-wavegan2/, we set up two state-of-the-art models as baselines: (1) GlowTTS-HiFiGAN, which contains a multi-speaker flow-based acoustic model and the GAN-based vocoder to reconstruct mel-spectrogram; and (2) VITS [23], which conducts multi-speaker synthesis in an end-to-end manner. Both of them utilize speaker ID as the speaker condition of the acoustic model. As for the Glow-WaveGAN family, we build different models for evaluation: (1) the basic Glow-WaveGAN with explicit speaker labels like the above baselines, (2) the Glow-WaveGAN2-joint with jointly-trained speaker encoder, and (3) the Glow-WaveGAN2-pre with pre-trained speaker encoder.

Table 2: MOS scores of different systems on two corpora for TTS and VC with 95% confidence interval.

Training data	LibriTTS						VCTK
Models	TTS		VC				TTS		VC
Models	seen	unseen	s2s	u2s	s2u	u2u	seen	unseen	s2s	u2s	s2u	u2u
Grount-truth	4.38 $\pm$ 0.08						4.45 $\pm$ 0.07
GlowTTS-HiFiGAN	3.36 $\pm$ 0.13	-	3.45 $\pm$ 0.12	-	-	-	3.44 $\pm$ 0.11	-	3.55 $\pm$ 0.09	-	-	-
VITS	3.77 $\pm$ 0.11	-	3.68 $\pm$ 0.13	-	-	-	3.83 $\pm$ 0.08	-	3.77 $\pm$ 0.09	-	-	-
Glow-WaveGAN	3.78 $\pm$ 0.10	-	3.63 $\pm$ 0.11	-	-	-	3.85 $\pm$ 0.09	-	3.80 $\pm$ 0.08	-	-	-
Glow-WaveGAN2-joint	3.73 $\pm$ 0.12	3.69 $\pm$ 0.14	3.78 $\pm$ 0.10	3.59 $\pm$ 0.14	3.53 $\pm$ 0.12	3.65 $\pm$ 0.15	3.86 $\pm$ 0.11	3.87 $\pm$ 0.12	3.82 $\pm$ 0.08	3.73 $\pm$ 0.09	3.64 $\pm$ 0.11	3.75 $\pm$ 0.12
Glow-WaveGAN2-pre	3.69 $\pm$ 0.09	3.65 $\pm$ 0.11	3.73 $\pm$ 0.08	3.62 $\pm$ 0.09	3.51 $\pm$ 0.12	3.68 $\pm$ 0.11	3.81 $\pm$ 0.14	3.78 $\pm$ 0.09	3.78 $\pm$ 0.08	3.66 $\pm$ 0.11	3.58 $\pm$ 0.12	3.69 $\pm$ 0.14

3.2 Speaker Similarity Evaluation

We first evaluate the speaker similarity of the synthesized speech for TTS and VC between different models in the objective manner, which is critical to the performance of zero-shot speech generation. Following [10, 32], we treat the Speaker Encoder Cosine Similarity (SECS) between the speaker embeddings extracted from synthesized and ground-truth audios as an objective measure, where SECS scores range from $0$ to $1$ and a higher score means higher speaker similarity.

Table 1 shows the SECS results of different models. Note that the two baselines and the Glow-WaveGAN system adopt the explicit speaker labels to model the speaker identity, so they can only conduct speech synthesis and voice conversion for seen speakers. For the results of both TTS and VC on seen speakers, we find the VITS and the Glow-WaveGAN models achieve higher similarity than the GlowTTS-HiFiGAN model, especially in the LibriTTS dataset. And the Glow-WaveGAN family is slightly better than VITS for seen speakers in general.

For the zero-shot TTS of proposed models, the results show that there exists a gap in speaker similarity between seen and unseen speakers for the Glow-WaveGAN2-joint model. We believe that this is mainly because the jointly-trained speaker module only learns a weak speaker space on the limited training speakers since we can find such a gap is smaller in LibriTTS than that of VCTK, where LibriTTS has more speakers. Based on this assumption, we also evaluate the zero-shot TTS of the system Glow-WaveGAN-pre with the pre-trained speaker encoder, where the gap between seen and unseen speakers becomes obviously alleviated.

As for the any-to-any voice conversion task, we find that Glow-WaveGAN2-pre outperforms Glow-WaveGAN2-joint in general. The results of LibriTTS show that the Glow-WaveGAN2-pre model achieves similar SECS scores on the unseen target speakers to the seen targets. And it is worth noticing that the SECS score declines when the source speaker is seen, from which we argue that the model may still maintain the identity of source training speakers in the VC procedure. While for the VCTK corpus, the proposed methods achieve similar scores for the same target speaker, whether the source speaker is seen or unseen.

3.3 Speaker Representation Visualization

To further investigate the speaker similarity of the synthetic speech by the proposed methods individually trained with different corpora, the speaker embeddings of generated speech are visualized through t-SNE [33] as shown in Figure 2. For the seen (green and red dots) and unseen (blue and yellow dots) speakers on both TTS and VC tasks, the generated audios and corresponding reference speech of each speaker can form a distinct cluster in our two proposed methods, which shows that the proposed methods can effectively model and control the speaker identities.

Compared with the indivisible clusters of unseen speakers formed from ground-truth and synthesized speech in the Glow-WaveGAN2-pre model with the pre-trained speaker encoder, there still exists boundaries of generated and ground-truth audios for some speakers in the Glow-WaveGAN2-joint model, especially for the smaller training corpus, i.e. VCTK, where the speaker’s name begins with “p” in Figure 2.

3.4 Speech Naturalness Evaluation

To evaluate speech naturalness and quality of different systems, we conduct Mean Opinion Score (MOS) tests, as shown in Table 2. In each MOS test, there are 20 listeners rating 20 randomly chosen utterances for each model. In general, the MOS scores of VCTK are better than that of LibriTTS due to its better recording quality.

The MOS results of both TTS and VC on seen speakers demonstrate that the Glow-WaveGAN family and the VITS model have obviously higher scores than the GlowTTS-HiFiGAN model, which comes from the mismatch problem mentioned in Section 1. As for the unseen speakers in TTS, both the two proposed models can achieve similar MOS scores to that of the seen speakers, which indicates the effectiveness of the proposed models for zero-shot TTS. In the VC scenarios, the MOS results indicate that there is no significant difference between the Glow-WaveGAN family and the VITS model on seen speakers, where they both achieve satisfactory MOS scores on the VC task. When the source or the target speakers are unseen, the MOS results show that the proposed Glow-WaveGAN 2 models can also produce high-quality converted speech with only one utterance of the target speaker. Therefore, the experimental results demonstrate that our proposed models can generate high-quality speech in both zero-shot TTS and VC.

3.5 Cross-dataset Evaluation

Table 3: SCES and MOS results with 95% confidence interval for cross-dataset evaluation.

Training data	Testing data	model	SECS		MOS
Training data	Testing data	model	TTS	VC	TTS	VC
LibriTTS	VCTK	joint	0.728	0.693	3.59 $\pm$ 0.11	3.54 $\pm$ 0.13
LibriTTS	VCTK	pre	0.812	0.723	3.48 $\pm$ 0.12	3.57 $\pm$ 0.10
VCTK	LibriTTS	joint	0.650	0.635	3.63 $\pm$ 0.13	3.31 $\pm$ 0.09
VCTK	LibriTTS	pre	0.731	0.661	3.56 $\pm$ 0.08	3.25 $\pm$ 0.14

To further evaluate the generalization ability of our proposed models in zero-shot speech generation, we also calculate the SECS scores and conduct MOS tests across datasets. Specifically, the target speaker is from another dataset different from model training in TTS or the target and source speakers are both from another dataset different from model training in VC. Results are summarized in Table 3. For speaker similarity, we find the model trained on the LibriTTS corpus has high SECS scores for cross-dataset evaluation since it can learn richer speaker space with more speakers. As for the subjective evaluation, it is interesting that we find the quality of generated speech from Glow-WaveGAN2-joint is better than that from Glow-WaveGAN2-pre in cross-dataset evaluation. We conjecture the reason is that the speaker constraint $s$ and the target $z$ are directly from the $p(z)$ of speech, which means that they may contain the channel information of the speech that affects the quality of generated speech in both training and inference.

4 Conclusions

In this paper, we propose Glow-WaveGAN 2, aiming at generating high-quality speech for zero-shot speech synthesis and any-to-any voice conversion. Specifically, we utilize a universal WaveGAN to build a robust feature extractor and high-quality universal vocoder. And the goal of flow-based multi-speaker acoustic model is to model the latent distributions conditioned on speaker constraints. We explore different speaker modeling strategies, and the results show that the proposed methods can produce high-quality speech in terms of naturalness and similarity for zero-shot speech generation.

References

[1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang et al., “Tacotron: Towards end-to-end speech synthesis,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 4006–4010.
[2] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman et al., “Deep voice: Real-time neural text-to-speech,” in the International Conference on Machine Learning (ICML). PMLR, 2017, pp. 195–204.
[3] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: fast, robust and controllable text to speech,” in The International Conference on Neural Information Processing Systems (NeurIPS), 2019, pp. 3171–3180.
[4] O. Barbany and M. Cernak, “Fastvc: Fast voice conversion with non-parallel data,” Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 145––149, 2020.
[5] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, p. 125.
[6] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2019, pp. 3617–3621.
[7] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, pp. 14 910–14 921, 2019.
[8] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
[9] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” Advances in neural information processing systems, vol. 30, pp. 2962–2970, 2017.
[10] E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2021, pp. 3645–3649.
[11] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2020, pp. 6184–6188.
[12] Y. Yan, X. Tan, B. Li, T. Qin, S. Zhao, Y. Shen, and T.-Y. Liu, “Adaspeech 2: Adaptive text to speech with untranscribed data,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2021, pp. 6613–6617.
[13] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” Advances in neural information processing systems, vol. 31, 2018.
[14] H.-T. Luong and J. Yamagishi, “Nautilus: a versatile voice cloning system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2967–2981, 2020.
[15] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” in the International Conference on Learning Representations (ICLR), 2018.
[16] H.-T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, “Adapting and controlling dnn-based speech synthesis using input codes,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2017, pp. 4905–4909.
[17] M. Chen, X. Tan, B. Li, Y. Liu, and T. Qin, “sheng zhao, and tie-yan liu. adaspeech: Adaptive text to speech for custom voice,” in the International Conference on Learning Representations (ICLR), 2021.
[18] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” in the International Conference on Machine Learning (ICML). PMLR, 2018, pp. 3683–3691.
[19] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 161–165.
[20] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” in the Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 181–185.
[21] W. Jang, D. Lim, and J. Yoon, “Universal melgan: A robust neural vocoder for high-fidelity waveform generation in multiple domains,” arXiv preprint arXiv:2011.09631, 2020.
[22] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
[23] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in the International Conference on Machine Learning (ICML). PMLR, 2021, pp. 5530–5540.
[24] J. Cong, S. Yang, L. Xie, and D. Su, “Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2021, pp. 2182–2186.
[25] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
[26] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in the International Conference on Learning Representations (ICLR), 2014.
[27] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in The International Conference on Neural Information Processing Systems (NeurIPS), 2018, pp. 4485–4495.
[28] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
[29] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.
[30] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
[31] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 1526–1530.
[32] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2020, pp. 2007–2011.
[33] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, pp. 2579–2605, 2008.