This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MHTTS: FAST MULTI-HEAD TEXT-TO-SPEECH FOR SPONTANEOUS SPEECH WITH IMPERFECT TRANSCRIPTION

Abstract

Neural network based end-to-end Text-to-Speech (TTS) has greatly improved the quality of synthesized speech. While how to use massive spontaneous speech without transcription efficiently still remains an open problem. In this paper we propose MHTTS, a fast multi-speaker TTS system that is robust to transcription errors and speaking style speech data. Specifically, we introduce a multi-head model and transfer text information from high-quality corpus with manual transcription to spontaneous speech with imperfectly recognized transcription by jointly training them. MHTTS has three advantages: 1) Our system synthesizes better quality multi-speaker voice with faster inference speed. 2) Our system is capable of transferring correct text information to data with imperfect transcription, simulated using corruption, or provided by an Automatic Speech Recogniser (ASR). 3) Our system can utilize massive real spontaneous speech with imperfect transcription and synthesize expressive voice.

**footnotetext: Corresponding author.$\dagger$$\dagger$footnotetext: Equal contribution.

Index Terms—  Speech synthesis, multi speaker, inference speedup, imperfect transcripts, spontaneous speech

1 Introduction

Text-to-speech(TTS) aims to synthesize natural and intelligible voice from text. Deep neutral networks[1, 2, 3, 4, 5, 6] have greatly improved the quality of synthesized speeches and expanded the capability of TTS system. And most TTS models can be trained only if recorded data with manual transcription is given. This kind of data is hard to collect while massive spontaneous speech without transcription is available in real life. How to utilize speech data with no transcription is becoming an important topic.

Adaptive methods[7, 8] adapts on untranscribed speech data by fine-tuning a pre-trained multi-speaker TTS model but only takes dozens or hundreds of utterances for adaptation, which may not be able to model the speaking style of the spontaneous speech in real life. Another possible method is to generate transcripts by an ASR system[9, 10, 11] but the generated transcripts contain inevitable textual errors. [12] analyses impacts of different types of textual errors and concludes that attention-based autoregressive models[3, 13] are only partially robust to imperfect transcripts. Substitution and deletion errors pose a serious problem to autoregressive models.

Transformer[14] based non-autoregressive models[5, 6] greatly speedup synthesis process, synthesize more robust voice than autoregressive models and has reached state-of-the-art performance. While our experiments reveal that textual errors simulated or provided by ASR still highly affect the quality and pronunciation error rate of the synthesized voice.

Speech disentanglement[15, 16, 17] is a developing advanced topic that aims to decompose speech information into four disentangled components: text content, timbre, pitch, and rhythm. Timbre carries information of speaker’s identity, while pitch and rhythm are related to the speaking style. Inspired by this perspective, if we can decompose the overall system into two disentangled sub-parts: one processing text information, the other merging text and speaker specific information (identity and speaking style), then we can design a proper training strategy to transfer text information of a large data corpus with correct transcription to the target spontaneous speech with imperfectly recognized transcription, without disturbing the learning of speaking style of the target spontaneous speech.

In this paper we propose MHTTS, a multi-speaker TTS model based on multi-head structure that has three main advantages:

  • In common multi-speaker scenario, MHTTS speeds up the synthesis process and produces higher quality voice than transformer based non-autoregressive models, e.g. Fastspeech. Computation complexity of transformer is quadratic to the length of sequence, while computation complexity of MHTTS is only linear.

  • In two-speaker scenario where the transcription of one corpus is correct and that of the other is imperfect, MHTTS greatly reduces pronunciation errors of synthesized voice of the imperfect one by transferring text information.

  • MHTTS utilizes massive spontaneous speech in real life with transcription from ASR and synthesize spontaneous style voice with few pronunciation errors. In MHTTS, we try to disentangle the learning of text hidden representation and the learning of speaking style.

2 METHODOLOGY

2.1 theory

Suppose there are corpora Di,i=0,1,,N1D_{i,i=0,1,...,N-1} of NN different speakers, each DiD_{i} contains text-speech pairs (Ti,Ai)j,j=0,,Mi(T_{i},A_{i})_{j,j=0,...,M_{i}}. Here AA denotes the acoustic feature that can be converted to speech by a neural vocoder[18, 19, 20]. The following process is designed to predict Ai,jA_{i,j} from Ti,jT_{i,j}: Ti,jT_{i,j} is first processed to get a text hidden representation hi,jh_{i,j} by a general text encoder block FF which takes text data of all corpora as input, then hi,jh_{i,j} is processed by a speaker specific block GiG_{i} to predict Ai,jA_{i,j}. As is shown by Fig. 1. Gi,i=0,1,,N1G_{i,i=0,1,...,N-1} forms the heads of MHTTS.

Refer to caption


Fig. 1: Brief structure of our model. FF denotes the general text encoder block, GG denotes the speaker specific blocks.

To disentangle the learning of text representation and the speaker related features, the model should satisfy the following conditions:

  1. 1.

    Text representations hh contains minimal speaker specific information.

  2. 2.

    Text representations hh should contain as much information about AA as possible, if the first condition is satisfied.

  3. 3.

    Speaker specific blocks GG has the capacity to reconstruct AA from hh.

The first condition guarantees that the weights of block FF contain minimal speaker specific information[15]. The first and second condition guarantee that block FF maps text data to good text hidden representations in the sense that hh can be easily converted to AA by speaker specific information contained in weights of block GG. The second condition also helps the transfer of text information during training since all corpora share block FF. Consider the extreme case that block FF is identity mapping, the multi-speaker TTS task degrades to NN disjoint single-speaker TTS tasks. The third condition is trivial.

From the perspective of the information bottleneck theory[21], if we use II to denote the calculation of mutual information, these conditions can be expressed as maximizing the following quantity:

L=1NiλI(hi;𝟙)+γI(hi;Ai)+I(Gi(hi);Ai)\displaystyle L=\frac{1}{N}\sum_{i}-\lambda I(h_{i};\mathbbm{1})+\gamma I(h_{i};A_{i})+I(G_{i}(h_{i});A_{i}) (1)

where 𝟙\mathbbm{1} denotes the indicator vector of the speaker, λ\lambda and γ\gamma are weights. The terms in Eqn.1 correspond to the conditions respectively.

The maximization of the first term I(hi;𝟙)-I(h_{i};\mathbbm{1}) is usually achieved by using gradient reversal layer (GRL)[22] and a domain classifier[22] to encourage hih_{i} to be speaker-invariant. But the performance of GRL is very sensitive to the hyperparameter and it is not applicable in 2-speaker scenario especially when one corpus is much smaller. Instead we found that Layer Normalization (LN)[23] which normalizes the output distributions of block FF serves the same purpose well.

The maximization of the third term I(Gi(hi);Ai)I(G_{i}(h_{i});A_{i}) is actually the minimization of the target cost function, e.g. the L1L1 or L2L2-norm distance between predicted acoustic features and the real features.

Now we consider the transformation Ti𝐹hiGiAiT_{i}\xrightarrow{F}h_{i}\xrightarrow{G_{i}}A_{i} to analyze the second term I(hi;Ai)I(h_{i};A_{i}). Suppose in expectation AiA_{i} is well constructed from TiT_{i}, that is I(Gi(hi);Ai)I(G_{i}(h_{i});A_{i}) reaches a high value. Consider the case that GiG_{i} is identity mapping, then hi=Gi(hi)h_{i}=G_{i}(h_{i}) and I(hi;Ai)=I(Gi(hi);Ai)I(h_{i};A_{i})=I(G_{i}(h_{i});A_{i}) which is discouraged by the first term I(hi;𝟙)-I(h_{i};\mathbbm{1}) since AiA_{i} contains full information about 𝟙\mathbbm{1}; If GiG_{i} has a very large capacity, FF has the potential to degrade to identity mapping in which case hi=Tih_{i}=T_{i} and I(hi;Ai)I(h_{i};A_{i}) reaches the minimal value I(Ti;Ai)I(T_{i};A_{i}) by the data processing inequality. In general, GiG_{i} should have just sufficient capacity to contain speaker specific information and add the information to the text hidden representation hih_{i}.

2.2 architecture

2.2.1 block FF

The block FF transforms text data to text hidden representation hh. It is shared by all speaker corpora and occupies most of the computation complexity.

Fastspeech[5, 6] use Feed-Forward Transformer[14] (FFT) blocks to build encoder and decoder. The multi-head attention structure[14] in FFT makes it capable of modeling long-term dependency and performing parallel generation but also makes the computation complexity quadratic to the length of sequence.

To further speed up inference, we use U-Net architecture[24, 25] to build block FF. The down-sampling blocks and up-sampling blocks in U-Net can model long-term dependency while the computation complexity is only linear to sequence length. Following U-Net is LN layer that replaces GRL as discussed in Sec.2.1. We use Length Regulator[5] that is also used in Fastspeech to deal with length mismatch between text and acoustic feature sequence. The overall architecture of block FF is shown by Fig.2.

Refer to caption

(a) Overall architecture of block FF. Here Layer Normalization is performed without re-scaling or re-centering. Re-scaling and Re-centering are afterwards performed in block GG.

Refer to caption

(b) Detailed architecture of U-Net. MM down-sampling blocks (left part) are followed by MM up-sampling blocks (right part) with residual connections.
Fig. 2: Overall structure of block FF.

2.2.2 block GG

As is discussed in Sec.2.1, the design of block GG is non-trivial and we did experiments on architectures of GG, with the constraint that temporal connection is not allowed which means there are no convolutional, sequential or fully-connected self-attention layers inside GG. Fig.3 shows the chosen architecture of GG which only consists of linear and LN layers.

Refer to caption

Fig. 3: Architecture of block GG.

3 EXPERIMENTS

In this section, we test MHTTS on the following subtopics: voice quality and inference speedup in multi-speaker scenario, the ability to handle imperfect transcripts and spontaneous speech in real life. We compare MHTTS with Fastspeech, which is one of the most popular non-autoregressive structure with fast and high-quality voice synthesis.

3.1 voice quality and inference speedup

3.1.1 experimental setting

We conduct multi-speaker experiments on AISHELL-3[26], a large-scale and high-fidelity Mandarin speech corpus. The corpus contains 85 hours of emotion-neutral 44.1 kHz recordings spoken by 218 native mandarin speakers. We extract mel-spectrogram with 12.5 ms hop size and 50 ms window as acoustic features. A well trained WaveRNN[Wavernn] model is used as vocoder to synthesize voice from mel-spectrogram.

We use the official implementation of Fastspeech†\dagger†\dagger\daggerhttps://github.com/TensorSpeech/TensorFlowTTS and the default hyperparameter. We adjust the hyperparameter of MHTTS to use approximately the same number of parameters during inference. The hidden size, kernel size are 512 and 3 respectively for all layers in MHTTS and the number of down-sampling and up-sampling blocks is 7. We use Adam[27] optimizer with β1=0.9,β2=0.98,ϵ=106\beta_{1}=0.9,\beta_{2}=0.98,\epsilon=10^{-6} and follow a linear decay learning rate schedule. We train both models for 100K steps.

3.1.2 evaluation

We randomly choose 4 speakers from AISHELL-3 speakers(2 men and 2 women, each with 10 utterances) for evaluation. 20 native Mandarin speakers are asked to judge voice quality and similarity of each utterance. Table 1 shows the mean opinion score(MOS) of voice quality, similarity mean opinion score(SMOS), number of active parameters and real time factor(RTF) of 4 methods: GT, the ground truth recording; GT mel + vocoder, voice synthesized with vocoder from mel-spectrogram of ground truth recording; Fastspeech(FS); MHTTS(MH).

Method MOS SMOS Params Inference Speed(RTF)
GT 4.05±0.114.05\pm 0.11 4.40±0.054.40\pm 0.05 - -
GT mel + vocoder 3.97±0.113.97\pm 0.11 4.33±0.074.33\pm 0.07 - -
FS 3.53±0.123.53\pm 0.12 3.78±0.103.78\pm 0.10 30M 5.6×1025.6\times 10^{-2}
MH 3.72±0.093.72\pm 0.09 3.93±0.093.93\pm 0.09 27M 2.3×1022.3\times 10^{-2}
Table 1: MOS, SMOS, number of parameters and inference speed of each case. MOS and SMOS are calculated with 95% confidence intervals.

Higher MOS or SMOS means higher quality or similarity. RTF denotes the time (in seconds) required to synthesize one second waveform. It is measured with a single thread and a single core on an Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz. It can be seen that in regular multi-speaker scenario, MHTTS achieves slightly higher MOS and SMOS over Fastspeech and achieves 2.4x speedup with the same number of parameters.

3.2 imperfect transcription

In this section, We conduct experiments to test the ability of transferring text information. We analyze two kinds of imperfect transcription: simulated errors and ASR transcription.

3.2.1 experimental setting of simulated errors

The feature extraction, model structure, and training strategy are identical to Sec.3.1.1 except the number of training steps is 20K. The dataset is a two-speaker corpora:

  1. 1.

    Source corpus, a large internal single-speaker Mandarin corpus which contains 140000 female 16 kHz recordings with correct transcription.

  2. 2.

    Target corpus, a well-known open-source high-quality mandarin corpus†\dagger†\dagger\daggerhttps://www.data-baker.com which contains 10000 male 48 KHz recordings with correct transcription. All recordings are converted to 16 kHz.

3.2.2 evaluation of simulated errors

We corrupt the transcription of the target corpus in a similar way to [12]. For every character of a sentence, with probability PP, the character is corrupted by one of the following corruption methods:

  1. 1.

    Insertion: a random character from the vocabulary is inserted around it.

  2. 2.

    Deletion: This character is deleted.

  3. 3.

    Replacement: This character is replaced by a random character from the vocabulary.

We vary the corruption probability PP, the utterance number of the source corpus and the corrupted target corpus and evaluate the character error rate(CER) of the synthesized voice of the target speaker. We compare the following three methods: MHTTS(MH); Fastspeech(FS); Fastspeech_single(FS_s), using Fastspeech but training only with corrupted target corpus. For each method we synthesize 200 utterances(2524 characters) for CER evaluation.

#Source #Target PP FS_s CER(%) FS CER(%) MH CER(%)
10000 500 10% 5.62 5.07 1.03
20% 9.51 5.56 0.71
30% 14.50 8.64 1.27
40% 21.16 12.04 1.03
50% 32.09 14.66 1.67
1000 10% 3.24 2.46 0.79
20% 6.50 2.85 0.87
30% 9.03 4.12 1.74
40% 15.69 7.45 1.74
50% 29.00 9.83 1.58
2000 10% 3.88 1.66 0.79
20% - 2.06 0.87
30% - 1.90 1.66
40% - 4.04 1.90
50% - 5.71 1.98
4000 10% - 1.03 0.87
20% - 1.74 1.11
30% - - 2.30
40% - - 2.14
50% - - 1.82
Table 2: CER evaluated on simulated errors with 10000 utterances of source corpus. ”-” indicates that synthesize speeches are unrecognizable sounds.
#Source #Target PP FS_s CER(%) FS CER(%) MH CER(%)
50000 500 10% 5.62 3.96 0.55
20% 9.51 4.91 0.87
30% 14.50 7.37 0.87
40% 21.16 11.65 0.95
50% 32.09 15.53 1.66
1000 10% 3.24 2.69 0.39
20% 6.50 2.38 0.79
30% 9.03 3.57 1.35
40% 15.69 8.32 1.43
50% 29.00 8.40 1.69
2000 10% 3.88 1.51 0.71
20% - 1.51 0.87
30% - 2.22 1.03
40% - 3.49 2.14
50% - 4.36 1.51
4000 10% - 1.26 1.03
20% - 1.35 1.11
30% - 1.82 1.74
40% - - 1.35
50% - - 1.74
Table 3: CER evaluated on simulated errors with 50000 utterances of source corpus.
#Source #Target PP FS_s CER(%) FS CER(%) MH CER(%)
100000 500 10% 5.62 3.88 0.95
20% 9.51 5.23 0.87
30% 14.50 6.58 0.63
40% 21.16 11.73 1.27
50% 32.09 13.95 1.27
1000 10% 3.24 1.66 0.48
20% 6.50 3.25 0.95
30% 9.03 3.41 0.95
40% 15.69 6.26 1.82
50% 29.00 6.66 1.35
2000 10% 3.88 1.27 0.71
20% - 1.51 1.03
30% - 2.85 1.82
40% - 2.22 1.98
50% - 4.75 1.90
4000 10% - 1.03 0.71
20% - 1.03 1.27
30% - 2.06 1.82
40% - - 1.72
50% - - 1.76
Table 4: CER evaluated on simulated errors with 100000 utterances of source corpus.

Table 2, Table 3 and Table 4 show the results and we have several conclusions:

  1. 1.

    MHTTS performs consistently much better than Fastspeech and Fastspeech_single, which demonstrates a much better capability of transferring text information.

  2. 2.

    Fastspeech also has the capability of transferring text information when comparing it with Fastspeech_single.

  3. 3.

    CER of Fastspeech and Fastspeech_single seem to drop as number of target utterances increases but training collapse may occur. MHTTS performs stably.

3.2.3 experimental setting of ASR transcription

The feature extraction, model structure, and training strategy are identical to Sec.3.2.1. The dataset is a two-speaker corpora:

  1. 1.

    Source corpus, the same source corpus used in Sec.3.2.1. All recordings are converted to 8 kHz.

  2. 2.

    Target corpus, Mandarin spontaneous 8 KHz speeches of 4 speakers(2 men and 2 women) in real life recorded in an internal customer service hotline system. Transcription is not available.

3.2.4 evaluation of ASR transcription

We use an ASR module to transcribe the target corpus, and compare CER of synthesized voice of the target speaker similarly to Sec.3.2.2. Results are shown in Table.5. It can be seen that MHTTS performs significantly better than Fastspeech and MHTTS is more robust to ASR transcription errors than simulated errors.

#Source ASR CER(%) Target ID Gender #Target FS CER(%) MH CER(%)
140000 14.09 A male 1164 9.86 0.87
B female 1265 6.58 0.63
C female 2059 8.08 0.71
D male 3110 7.29 0.71
Table 5: CER evaluated on spontaneous real data with ASR transcription

3.3 spontaneous style

In this section we conduct experiments to test if MHTTS can synthesize expressive voice given spontaneous speech in real life.

3.3.1 experimental setting

The feature extraction, model structure, training strategy and dataset are identical to Sec.3.2.3.

3.3.2 evaluation

We use the same ASR module in Sec.3.2.4 to transcribe the target spontaneous corpus. We compare MOS and SMOS of the following methods: GT, ground truth spontaneous recording; Fastspeech(FS); MHTTS(MH). We ask 20 native Mandarin speakers to judge voice quality and similarity. And they are asked to focus on spontaneous style(timbre, pitch, intonation, rhythm and stress) when evaluating similarity. Table.6 shows the results.

Settings MOS SMOS
#Source\#Source Target ID GT FS MH GT FS MH
140000 A 4.01±0.124.01\pm 0.12 2.20±0.142.20\pm 0.14 3.43±0.133.43\pm 0.13 4.31±0.094.31\pm 0.09 2.47±0.202.47\pm 0.20 3.66±0.193.66\pm 0.19
B 3.81±0.163.81\pm 0.16 2.18±0.172.18\pm 0.17 3.38±0.143.38\pm 0.14 4.22±0.154.22\pm 0.15 2.49±0.182.49\pm 0.18 3.41±0.183.41\pm 0.18
C 3.98±0.133.98\pm 0.13 2.02±0.172.02\pm 0.17 3.27±0.183.27\pm 0.18 4.49±0.114.49\pm 0.11 2.79±0.202.79\pm 0.20 3.61±0.183.61\pm 0.18
D 3.85±0.173.85\pm 0.17 1.98±0.161.98\pm 0.16 3.31±0.173.31\pm 0.17 3.95±0.133.95\pm 0.13 2.11±0.182.11\pm 0.18 3.12±0.193.12\pm 0.19
Table 6:

As can be seen the performance of Fastspeech is much worse than MHTTS, which is probably caused by two reasons: Fastspeech is not robust to ASR transcription errors; The real spontaneous recordings consist of speeches that contain various pronunciation styles for each phoneme and speeches that contain noises made by the speaker that don’t correspond to words. This result demonstrates the effectiveness of MHTTS to deal with spontaneous speeches in real life.

4 CONCLUSION

In this paper we propose MHTTS, a fast multi-speaker TTS system that utilizes untranscribed spontaneous speeches in real life. We introduce a multi-head structure that disentangles the learning of text representation and speaker related information, which allows text information transfer and training with spontaneous speeches. We achieve better results in terms of several subtopics compared with a state-of-the-art model. This work is an important step to realize automatic TTS training of spontaneous speeches in real life. For future work, we will explore more details of network structure and training strategy to further reduce CER and synthesize more expressive voice.

References

  • [1] Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, and Yanqi Zhou, “Deep voice 2: Multi-speaker neural text-to-speech,” in Advances in Neural Information Processing Systems. 2017, vol. 30, Curran Associates, Inc.
  • [2] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan Ömer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller, “Deep voice 3: 2000-speaker neural text-to-speech.,” CoRR, vol. abs/1710.07654, 2017.
  • [3] R. J. Weiss M. Schuster N. Jaitly Z. Yang Z. Chen Y. Zhang Y. Wang R. Skerrv-Ryan et al. J. Shen, R. Pang, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics. IEEE, 2018, p. 4779–4783.
  • [4] Yanqing Liu Sheng Zhao Naihan Li, Shujie Liu and Ming Liu, “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, p. 6706–6713.
  • [5] X. Tan T. Qin S. Zhao Z. Zhao Y. Ren, Y. Ruan and T.-Y.Liu, “Fastspeech: Fast, robust and controllable text to speech,” in NeurIPS 2019, November 2019.
  • [6] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” CoRR, vol. abs/2006.04558, 2020.
  • [7] Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, Yuan Shen, and Tie-Yan Liu, “Adaspeech 2: Adaptive text to speech with untranscribed data,” in ICASSP 2021. 2021, pp. 6613–6617, IEEE.
  • [8] Hieu-Thi Luong and Junichi Yamagishi, “Nautilus: A versatile voice cloning system,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2967–2981, 2020.
  • [9] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann, Yanmin Qian, Petr Schwarz, and Georg Stemmer, “The kaldi speech recognition toolkit,” in In IEEE 2011 workshop, 2011.
  • [10] Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James Qin, Anmol Gulati, Ruoming Pang, and Yonghui Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” in interspeech, 2020.
  • [11] Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, and Xin Lei, “Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit,” in interspeech, 2021.
  • [12] Jason Fong, Pilar Oplustil Gallegos, Zack Hodari, and Simon King, “Investigating the robustness of sequence-to-sequence text-to-speech models to imperfectly-transcribed training data,” in INTERSPEECH, 2019.
  • [13] K. Uenoyama H. Tachibana and S. Aihara, “Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention,” in ICASSP, 2018.
  • [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, L ukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems. 2017, vol. 30, Curran Associates, Inc.
  • [15] Alessandro Achille and Stefano Soatto, “Emergence of invariance and disentanglement in deep representations,” J. Mach. Learn. Res., vol. 19, pp. 50:1–50:34, 2018.
  • [16] Kaizhi Qian, Yang Zhang, Shiyu Chang, David Cox, and Mark Hasegawa-Johnson, “Unsupervised speech decomposition via triple information bottleneck,” in ICML 2020, 2020, pp. 7792–7802.
  • [17] Jennifer Williams and Simon King, “Disentangling style factors from speaker representations.,” in Interspeech, 2019, pp. 3945–3949.
  • [18] H. Zen K. Simonyan O. Vinyals A. Graves N. Kalchbrenner A. Senior A. v. d. Oord, S. Dieleman and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in 9th ISCA Speech Synthesis Workshop, 2016.
  • [19] K. Simonyan S. Noury N. Casagrande E. Lockhart F. Stimberg A. van den Oord S. Dieleman N. Kalchbrenner, E. Elsen and K. Kavukcuoglu, “Efficient neural audio synthesis,” in ICML, 2018.
  • [20] Thibault de Boissiere Lucas Gestin Wei Zhen Teoh Jose Sotelo Alexandre de Brebisson Yoshua Bengio Kundan Kumar, Rithesh Kumar and Aaron C Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” in Advances in Neural Information Processing Systems, 2019, p. 14881–14892.
  • [21] Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey, and David D Cox, “On the information bottleneck theory of deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, pp. 124020, 2019.
  • [22] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky, “Domain-adversarial training of neural networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016.
  • [23] Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016.
  • [24] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, vol. 9351, pp. 234–241.
  • [25] Dabiao Ma, Zhiba Su, Wenxuan Wang, and Yuhao Lu, “FPETS: fully parallel end-to-end text-to-speech system,” in AAAI, 2020, pp. 8457–8463.
  • [26] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li, “AISHELL-3: A multi-speaker mandarin TTS corpus and the baselines,” CoRR, vol. abs/2010.11567, 2020.
  • [27] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR 2015, 2015.