This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion

Abstract

The zero-shot scenario for speech generation aims at synthesizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot scenario exist in both stages – acoustic modeling and vocoder, previous works usually consider the problem from only one stage. In this paper, we extend our previous Glow-WaveGAN to Glow-WaveGAN 2, aiming to solve the problem from both stages for high-quality zero-shot text-to-speech and any-to-any voice conversion. We first build a universal WaveGAN model for extracting latent distribution p(z)p(z) of speech and reconstructing waveform from it. Then a flow-based acoustic model only needs to learn the same p(z)p(z) from texts, which naturally avoids the mismatch between the acoustic model and the vocoder, resulting in high-quality generated speech without model fine-tuning. Based on a continuous speaker space and the reversible property of flows, the conditional distribution can be obtained for any speaker, and thus we can further conduct high-quality zero-shot speech generation for new speakers. We particularly investigate two methods to construct the speaker space, namely pre-trained speaker encoder and jointly-trained speaker encoder. The superiority of Glow-WaveGAN 2 has been proved through TTS and VC experiments conducted on LibriTTS corpus and VTCK corpus.

Index Terms: Zero-shot, speech synthesis, voice conversion, variational auto-encoder, flow model

1 Introduction

Refer to caption
Figure 1: (a) Universal WaveGAN. (b) Multi-speaker acoustic model with optional speaker encoders. (c) Zero-shot TTS inference protocol from the waveform of target speaker ωtgt\omega_{tgt}. (d) VC part which converts source speech ωsrc\omega_{src} with content p(c)p(c) from speaker ssrcs_{src} to target speech ωtgt\omega_{tgt} of speaker stgts_{tgt}.

Recently, text-to-speech (TTS) and voice conversion (VC) have achieved significant improvements with the rapid developments of sequence-to-sequence (seq2seq) based acoustic models [1, 2, 3, 4] and high-quality neural vocoders [5, 6, 7, 8]. With a large amount of studio-recorded high-quality data, it is easy to extend these models to multi-speaker scenarios [9, 10]. But since sizable training data of a new speaker is usually unavailable for customization, speaker adaptation in speech synthesis has attracted raising attention  [11, 12, 13, 14]. Considering the amount of data for target speaker, adaptive speech synthesis can be divided into few-shot and zero-shot methods. Few-shot adaptation aims at generating new voices with a few samples of the target speaker, where fine-tuning the whole or a part of a pre-trained multi-speaker model is usually adopted [15, 16, 17]. However, fine-tuning methods are time-consuming and will lead to individual speaker-dependent models, which are not friendly to speaker customization. Moreover, avoiding over-fitting is vital for adaptation with limited data. By contrast, zero-shot methods tend to generate a new voice with only one utterance of the specific speaker, without model fine-tuning in general. In this work, we focus on zero-shot speech generation, including text-to-speech (TTS) and voice conversion (VC).

For both zero-shot TTS and VC, there are two critical problems that affect the performance of speech generation: 1) with only one utterance available, the stable ability of acoustic models to produce natural speech with target speaker’s identity and 2) the performance of the vocoder to reconstruct waveform of unseen speakers from predicted acoustic features. For the first problem, an effective way in zero-shot scenarios is to train a speaker-discriminative encoder to model the speaker space, where arbitrary voices can be generated through the constraint of speaker representation from the space with only one utterance of a novel speaker [18, 19]. In terms of the vocoder problem, recent high-quality neural vocoders are usually speaker-dependent, where the quality of the generated speech for unseen speakers will unavoidably deteriorate. Aiming to solve this problem, a WaveRNN-based universal vocoder [20] and a MelGAN-based universal vocoder [21] was proposed, but there still exists quality gaps between seen and unseen speakers.

Recently, Glow-TTS [22] proposed to model the distribution of the mel spectrogram with the flow to improve the quality of synthesized speech, where a pre-trained speaker encoder was further injected in SC-Glow-TTS [10] for zero-shot TTS. But fine-tuning the vocoder is still critical to the quality and similarity of unseen speakers. So the VITS [23] and our Glow-WaveGAN [24] are proposed at the same time to avoid this problem by modeling a latent speech representation, but there exists many differences between them. In VITS, it proposes to reconstruct speech from the posterior distribution p(z|c)p(z|c) of linear spectrogram through conditional VAE, where they utilize the reversible characteristics of flow to transform only the mean of text prior cc to improve the expressiveness of the prior distribution [23]. So they restrict the affine-coupling layers in flow to be volume-preserving [25], and it cannot produce speech of unseen speakers because of its discrete speaker space modeling. As for the two-stage Glow-WaveGAN, the goal of the WaveGAN is to learn a latent speech distribution p(z)p(z) directly from waveform through an unconditional VAE and GAN, while the acoustic model aims at modeling the same distribution p(z)p(z) with non-volume preserving flow from texts, which also avoid the mismatch problem [24].

Since p(z)p(z) in WaveGAN is unconditional, it could extract rich information of the waveform like timbre or contents, which means its potential ability on reconstructing waveform for unseen speakers. To this end, we propose Glow-WaveGAN 2 focusing on high-quality zero-shot text-to-speech synthesis and any-to-any voice conversion, where we build a universal WaveGAN and investigate different speaker encoders in flow on providing continuous speaker constraint ss on p(z|s)p(z|s) for zero-shot scenarios. Given one utterance of an arbitrary speaker during inference, we can use the encoder of the universal WaveGAN to extract the speaker-related latent distribution p(z)p(z), and then generate the speaker’s voice with our conditional acoustic model and the decoder of WaveGAN. The experimental results on the VCTK and LibriTTS corpora show that the proposed methods can generate high-quality target voice without model fine-tuning in both zero-shot TTS and any-to-any VC.

2 Method

As a substantial extension of our previous work [24], the proposed model mainly focuses on the zero-shot high-quality text-to-speech synthesis and voice conversion. As shown in Figure 1, the Glow-WaveGAN 2 contains three modules that will be elaborated in the following subsections: 1) a robust variational universal WaveGAN, which acts as both feature extractor and vocoder to extract the latent distribution p(z)p(z) from speech and reconstruct speech from the sampled zz respectively; 2) a multi-speaker Glow-TTS [22], which generates latent representation p(z|s)p(z|s) with the speaker identity ss as the conditioning constraint; and 3) an additional speaker encoder, which learns the speaker identity ss for adaptation.

Table 1: SECS of different models on different corpora, where GT-cross-spk is the SECS of different speakers. In VC scenarios, “s2s” indicates converting seen speakers to seen speakers, “u2s” means converting unseen speakers to seen speakers, “s2u” means converting seen speakers to unseen speakers, and “u2u” means converting unseen speakers to unseen speakers.
Training data LibriTTS VCTK
Models TTS VC TTS VC
seen unseen s2s u2s s2u u2u seen unseen s2s u2s s2u u2u
GT-same-spk 0.830 0.815
GT-cross-spk 0.544 0.551
GlowTTS-HiFiGAN 0.772 - 0.748 - - - 0.783 - 0.790 - -
VITS 0.807 - 0.766 - - - 0.798 - 0.812 - -
Glow-WaveGAN 0.819 - 0.769 - - - 0.805 - 0.811 - -
Glow-WaveGAN2-joint 0.791 0.749 0.762 0.781 0.742 0.760 0.793 0.731 0.805 0.794 0.720 0.717
Glow-WaveGAN2-pre 0.822 0.784 0.771 0.802 0.771 0.807 0.804 0.774 0.807 0.795 0.757 0.762

2.1 Universal WaveGAN

A robust vocoder for unseen speakers is critical for zero-shot speech generation. To achieve this goal, we build a variational auto-encoder (VAE) [26] based universal WaveGAN model to conduct speech reconstruction, which has shown its strong ability to reconstruct unseen speakers in our previous work [24].

In the universal WaveGAN, the encoder aims at extracting the distribution of acoustic representation zp(z|w)z\sim p(z|w) from the waveform ww, where zz also includes the speaker characteristics through a pitch predictor. Meanwhile, the decoder tends to reconstruct the speech w^p(w|z)\hat{w}\sim p(w|z) from sampled zz. To ensure the quality of reconstructed speech, we use adversarial training to model the p(w)p(w) of the waveform. The training objective of the WaveGAN keeps the same as [24].

The encoder of WaveGAN can be treated as a robust feature extractor, and the extracted speech representation zz contains the speaker information. The decoder plays the role of vocoder to reconstruct high-quality speech waveform for both seen and unseen speakers. In this way, the proposed method guarantees the speech quality of zero-shot speech reconstruction.

2.2 Multi-Speaker Glow-TTS

With the extracted latent speech representation, the skeleton of our acoustic model follows Glow-TTS [22] to model the conditional distribution p(z|t,s)p(z|t,s) from texts tt, where ss is the conditional constraint to provide speaker identity. In detail, the text encoder transforms the text tt into a linguistic prior distribution p(c|t)p(c|t). Considering the frame rate difference between the text and the latent zz, monotonic alignment search (MAS) is adopted to align the prior cc and acoustic feature zz. With the speaker embedding fed into the flow-based decoder ff, the target zz from WaveGAN is transformed into the aligned prior distribution p(c)p(c) through forward pass of ff during training. Thus the log-likelihood of the target distribution p(z)p(z) can be obtained through the flow-based decoder:

logPZ(z|t,s)=logPC(c|t)+log|detfdec1(z,s)z|\log P_{Z}(z|t,s)=logP_{C}(c|t)+\log|det\frac{\partial f^{-1}_{dec}(z,s)}{\partial_{z}}| (1)

Note that we specifically sample zz from the same p(z)p(z) of the WaveGAN encoder in each training step, which mitigates the mismatch problem between the acoustic model and the vocoder to achieve high synthesis quality without fine-tuning. During inference, the predicted z^\hat{z} is generated by the reverse decoding process f1f^{-1} with the speaker condition from the text prior. Since the text encoder is speaker-independent, it’s easy to conduct voice conversion through the forward ff and reverse f1f^{-1} flow decoding with different speaker constraints [22], as shown in Figure 1 (d).

2.3 Speaker Representation

To achieve the zero-shot TTS and any-to-any VC, a key problem is how to model the speaker characteristics, especially for unseen speakers. In this paper, we investigate two alternative ways, a pre-trained encoder and a jointly-trained encoder, to build the speaker space in the acoustic model to produce speaker-dependent zz for unseen speakers.

A straightforward way to represent speakers is to utilize a pre-trained speaker encoder based on the speaker embedding extraction in speaker verification [11, 27, 28, 29], through which we can obtain the speaker representation ss by only one utterance of a few seconds to generate speech of unseen speakers. Besides, a jointly-trained encoder based on speaker classification is also adopted in our work to learn ss from p(z)p(z) within the Glow-TTS model. With the jointly-trained speaker module, we optimize the acoustic model with an extra cross entropy objective. Specifically, we sample two vectors {z1,z2}\{z_{1},z_{2}\} from the same distribution p(z)p(z), where z1z_{1} and z2z_{2} are used to train the acoustic model and speaker classification module respectively.

2.4 Zero-shot TTS and VC

As shown in Figure 1 (c) and (d), given an utterance of any target speaker during inference, the speaker identity stgts_{tgt} can be extracted from the speaker encoder, which can be treated as the speaker constraint in our model. At inference time of zero-shot TTS, stgts_{tgt} is utilized as the condition for the acoustic model to generate the sampled zz containing the target speaker information. At inference time of zero-shot VC, the source speaker ssrcs_{src} is conditioned for the flow module to generate the speaker-independent linguistic prior distributionp(c)p(c), which eliminates the source speaker information. Then p(c)p(c) is fed into the reversed flow conditioned on the target speaker stgts_{tgt} to generate the target sampled ztgtz_{tgt}. In this way, zero-shot TTS and VC can be achieved for generating speech of any speaker.

3 Experiments and results

3.1 Basic Setup

We conduct experiments on two different datasets named VCTK [30] and LibriTTS [31] at 24 kHz. The VCTK corpus contains 109 English speakers, where we reserved 3 male and 3 female speakers as unseen speakers. As for the LibriTTS corpus, we use 1,151 speakers from the train-clean-360 and train-clean-100 subsets to train different systems, while the 39 speakers from the test-clean subset are treated as unseen speakers. The jointly-trained speaker encoder consists of two 1d-convolution layers with kernel size 5, followed by the layer normalization and dropout. Finally, another convolution layer with 128 channels is adopted to extract the speaker embedding through mean pooling. As for the pre-trained speaker encoder, we utilize the voice encoder in the Resemblyzer tool [27, 29] to extract a 256-dimensional speaker representation.

To evaluate the capability of the proposed Glow-WaveGAN 2 in zero-shot speech generation 111Audio samples can be found at https://leiyi420.github.io/glow-wavegan2/, we set up two state-of-the-art models as baselines: (1) GlowTTS-HiFiGAN, which contains a multi-speaker flow-based acoustic model and the GAN-based vocoder to reconstruct mel-spectrogram; and (2) VITS [23], which conducts multi-speaker synthesis in an end-to-end manner. Both of them utilize speaker ID as the speaker condition of the acoustic model. As for the Glow-WaveGAN family, we build different models for evaluation: (1) the basic Glow-WaveGAN with explicit speaker labels like the above baselines, (2) the Glow-WaveGAN2-joint with jointly-trained speaker encoder, and (3) the Glow-WaveGAN2-pre with pre-trained speaker encoder.

Table 2: MOS scores of different systems on two corpora for TTS and VC with 95% confidence interval.
Training data LibriTTS VCTK
Models TTS VC TTS VC
seen unseen s2s u2s s2u u2u seen unseen s2s u2s s2u u2u
Grount-truth 4.38 ±\pm 0.08 4.45 ±\pm 0.07
GlowTTS-HiFiGAN 3.36 ±\pm 0.13 - 3.45 ±\pm 0.12 - - - 3.44 ±\pm 0.11 - 3.55 ±\pm 0.09 - - -
VITS 3.77 ±\pm 0.11 - 3.68 ±\pm 0.13 - - - 3.83 ±\pm 0.08 - 3.77 ±\pm 0.09 - - -
Glow-WaveGAN 3.78 ±\pm 0.10 - 3.63 ±\pm 0.11 - - - 3.85 ±\pm 0.09 - 3.80 ±\pm 0.08 - - -
Glow-WaveGAN2-joint 3.73 ±\pm 0.12 3.69 ±\pm 0.14 3.78 ±\pm 0.10 3.59 ±\pm 0.14 3.53 ±\pm 0.12 3.65 ±\pm 0.15 3.86 ±\pm 0.11 3.87 ±\pm 0.12 3.82 ±\pm 0.08 3.73 ±\pm 0.09 3.64 ±\pm 0.11 3.75 ±\pm 0.12
Glow-WaveGAN2-pre 3.69 ±\pm 0.09 3.65 ±\pm 0.11 3.73 ±\pm 0.08 3.62 ±\pm 0.09 3.51 ±\pm 0.12 3.68 ±\pm 0.11 3.81 ±\pm 0.14 3.78 ±\pm 0.09 3.78 ±\pm 0.08 3.66 ±\pm 0.11 3.58 ±\pm 0.12 3.69 ±\pm 0.14

3.2 Speaker Similarity Evaluation

We first evaluate the speaker similarity of the synthesized speech for TTS and VC between different models in the objective manner, which is critical to the performance of zero-shot speech generation. Following [10, 32], we treat the Speaker Encoder Cosine Similarity (SECS) between the speaker embeddings extracted from synthesized and ground-truth audios as an objective measure, where SECS scores range from 0 to 11 and a higher score means higher speaker similarity.

Table 1 shows the SECS results of different models. Note that the two baselines and the Glow-WaveGAN system adopt the explicit speaker labels to model the speaker identity, so they can only conduct speech synthesis and voice conversion for seen speakers. For the results of both TTS and VC on seen speakers, we find the VITS and the Glow-WaveGAN models achieve higher similarity than the GlowTTS-HiFiGAN model, especially in the LibriTTS dataset. And the Glow-WaveGAN family is slightly better than VITS for seen speakers in general.

For the zero-shot TTS of proposed models, the results show that there exists a gap in speaker similarity between seen and unseen speakers for the Glow-WaveGAN2-joint model. We believe that this is mainly because the jointly-trained speaker module only learns a weak speaker space on the limited training speakers since we can find such a gap is smaller in LibriTTS than that of VCTK, where LibriTTS has more speakers. Based on this assumption, we also evaluate the zero-shot TTS of the system Glow-WaveGAN-pre with the pre-trained speaker encoder, where the gap between seen and unseen speakers becomes obviously alleviated.

As for the any-to-any voice conversion task, we find that Glow-WaveGAN2-pre outperforms Glow-WaveGAN2-joint in general. The results of LibriTTS show that the Glow-WaveGAN2-pre model achieves similar SECS scores on the unseen target speakers to the seen targets. And it is worth noticing that the SECS score declines when the source speaker is seen, from which we argue that the model may still maintain the identity of source training speakers in the VC procedure. While for the VCTK corpus, the proposed methods achieve similar scores for the same target speaker, whether the source speaker is seen or unseen.

Refer to caption
Figure 2: Speaker visualization of ground-truth and generated speech, where different shapes represent different speakers. The blue and yellow dots represent the unseen ground-truth and generated speakers, and the red and green dots indicate seen ground-truth and generated speakers.

3.3 Speaker Representation Visualization

To further investigate the speaker similarity of the synthetic speech by the proposed methods individually trained with different corpora, the speaker embeddings of generated speech are visualized through t-SNE [33] as shown in Figure 2. For the seen (green and red dots) and unseen (blue and yellow dots) speakers on both TTS and VC tasks, the generated audios and corresponding reference speech of each speaker can form a distinct cluster in our two proposed methods, which shows that the proposed methods can effectively model and control the speaker identities.

Compared with the indivisible clusters of unseen speakers formed from ground-truth and synthesized speech in the Glow-WaveGAN2-pre model with the pre-trained speaker encoder, there still exists boundaries of generated and ground-truth audios for some speakers in the Glow-WaveGAN2-joint model, especially for the smaller training corpus, i.e. VCTK, where the speaker’s name begins with “p” in Figure 2.

3.4 Speech Naturalness Evaluation

To evaluate speech naturalness and quality of different systems, we conduct Mean Opinion Score (MOS) tests, as shown in Table 2. In each MOS test, there are 20 listeners rating 20 randomly chosen utterances for each model. In general, the MOS scores of VCTK are better than that of LibriTTS due to its better recording quality.

The MOS results of both TTS and VC on seen speakers demonstrate that the Glow-WaveGAN family and the VITS model have obviously higher scores than the GlowTTS-HiFiGAN model, which comes from the mismatch problem mentioned in Section 1. As for the unseen speakers in TTS, both the two proposed models can achieve similar MOS scores to that of the seen speakers, which indicates the effectiveness of the proposed models for zero-shot TTS. In the VC scenarios, the MOS results indicate that there is no significant difference between the Glow-WaveGAN family and the VITS model on seen speakers, where they both achieve satisfactory MOS scores on the VC task. When the source or the target speakers are unseen, the MOS results show that the proposed Glow-WaveGAN 2 models can also produce high-quality converted speech with only one utterance of the target speaker. Therefore, the experimental results demonstrate that our proposed models can generate high-quality speech in both zero-shot TTS and VC.

3.5 Cross-dataset Evaluation

Table 3: SCES and MOS results with 95% confidence interval for cross-dataset evaluation.
Training data Testing data model SECS MOS
TTS VC TTS VC
LibriTTS VCTK joint 0.728 0.693 3.59 ±\pm 0.11 3.54 ±\pm 0.13
pre 0.812 0.723 3.48 ±\pm 0.12 3.57 ±\pm 0.10
VCTK LibriTTS joint 0.650 0.635 3.63 ±\pm 0.13 3.31 ±\pm 0.09
pre 0.731 0.661 3.56 ±\pm 0.08 3.25 ±\pm 0.14

To further evaluate the generalization ability of our proposed models in zero-shot speech generation, we also calculate the SECS scores and conduct MOS tests across datasets. Specifically, the target speaker is from another dataset different from model training in TTS or the target and source speakers are both from another dataset different from model training in VC. Results are summarized in Table 3. For speaker similarity, we find the model trained on the LibriTTS corpus has high SECS scores for cross-dataset evaluation since it can learn richer speaker space with more speakers. As for the subjective evaluation, it is interesting that we find the quality of generated speech from Glow-WaveGAN2-joint is better than that from Glow-WaveGAN2-pre in cross-dataset evaluation. We conjecture the reason is that the speaker constraint ss and the target zz are directly from the p(z)p(z) of speech, which means that they may contain the channel information of the speech that affects the quality of generated speech in both training and inference.

4 Conclusions

In this paper, we propose Glow-WaveGAN 2, aiming at generating high-quality speech for zero-shot speech synthesis and any-to-any voice conversion. Specifically, we utilize a universal WaveGAN to build a robust feature extractor and high-quality universal vocoder. And the goal of flow-based multi-speaker acoustic model is to model the latent distributions conditioned on speaker constraints. We explore different speaker modeling strategies, and the results show that the proposed methods can produce high-quality speech in terms of naturalness and similarity for zero-shot speech generation.

References

  • [1] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang et al., “Tacotron: Towards end-to-end speech synthesis,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2017, pp. 4006–4010.
  • [2] S. Ö. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman et al., “Deep voice: Real-time neural text-to-speech,” in the International Conference on Machine Learning (ICML).   PMLR, 2017, pp. 195–204.
  • [3] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: fast, robust and controllable text to speech,” in The International Conference on Neural Information Processing Systems (NeurIPS), 2019, pp. 3171–3180.
  • [4] O. Barbany and M. Cernak, “Fastvc: Fast voice conversion with non-parallel data,” Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pp. 145––149, 2020.
  • [5] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Generative Model for Raw Audio,” in Proc. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016, p. 125.
  • [6] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2019, pp. 3617–3621.
  • [7] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, pp. 14 910–14 921, 2019.
  • [8] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
  • [9] A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng, W. Ping, J. Raiman, and Y. Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” Advances in neural information processing systems, vol. 30, pp. 2962–2970, 2017.
  • [10] E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2021, pp. 3645–3649.
  • [11] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2020, pp. 6184–6188.
  • [12] Y. Yan, X. Tan, B. Li, T. Qin, S. Zhao, Y. Shen, and T.-Y. Liu, “Adaspeech 2: Adaptive text to speech with untranscribed data,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2021, pp. 6613–6617.
  • [13] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” Advances in neural information processing systems, vol. 31, 2018.
  • [14] H.-T. Luong and J. Yamagishi, “Nautilus: a versatile voice cloning system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2967–2981, 2020.
  • [15] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: Voice fitting and synthesis via a phonological loop,” in the International Conference on Learning Representations (ICLR), 2018.
  • [16] H.-T. Luong, S. Takaki, G. E. Henter, and J. Yamagishi, “Adapting and controlling dnn-based speech synthesis using input codes,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2017, pp. 4905–4909.
  • [17] M. Chen, X. Tan, B. Li, Y. Liu, and T. Qin, “sheng zhao, and tie-yan liu. adaspeech: Adaptive text to speech for custom voice,” in the International Conference on Learning Representations (ICLR), 2021.
  • [18] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers based on a short untranscribed sample,” in the International Conference on Machine Learning (ICML).   PMLR, 2018, pp. 3683–3691.
  • [19] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, “Learning problem-agnostic speech representations from multiple self-supervised tasks,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 161–165.
  • [20] J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, A. Moinet, and V. Aggarwal, “Towards achieving robust universal neural vocoding,” in the Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 181–185.
  • [21] W. Jang, D. Lim, and J. Yoon, “Universal melgan: A robust neural vocoder for high-fidelity waveform generation in multiple domains,” arXiv preprint arXiv:2011.09631, 2020.
  • [22] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A generative flow for text-to-speech via monotonic alignment search,” Advances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020.
  • [23] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in the International Conference on Machine Learning (ICML).   PMLR, 2021, pp. 5530–5540.
  • [24] J. Cong, S. Yang, L. Xie, and D. Su, “Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2021, pp. 2182–2186.
  • [25] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
  • [26] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in the International Conference on Learning Representations (ICLR), 2014.
  • [27] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in The International Conference on Neural Information Processing Systems (NeurIPS), 2018, pp. 4485–4495.
  • [28] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2018, pp. 5329–5333.
  • [29] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in the International Conference on Acoustics, Speech, and Signal Processing (ICASSP).   IEEE, 2018, pp. 4879–4883.
  • [30] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” 2016.
  • [31] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2019, pp. 1526–1530.
  • [32] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” in The Annual Conference of the International Speech Communication Association (Interspeech), 2020, pp. 2007–2011.
  • [33] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, pp. 2579–2605, 2008.