A study on the efficacy of model pre-training in developing neural text-to-speech system

Abstract

In the development of neural text-to-speech systems, model pre-training with a large amount of non-target speakers’ data is a common approach. However, in terms of ultimately achieved system performance for target speaker(s), the actual benefits of model pre-training are uncertain and unstable, depending very much on the quantity and text content of training data. This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is postulated that the pre-training process plays a critical role in learning text-related variation in speech, while further training with the target speaker’s data aims to capture the speaker-related variation. Different test sets are created with varying degrees of similarity to target speaker data in terms of text content. Experiments show that leveraging a speaker-independent TTS trained on speech data with diverse text content can improve the target speaker TTS on domain-mismatched text. We also attempt to reduce the amount of pre-training data for a new text domain and improve the data and computational efficiency. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.

Index Terms— Text to Speech, Pre-training, Data Reduction

1 Introduction

In recent years, neural text-to-speech technology has demonstrated significant successes in generating high-quality speech with good naturalness and expressiveness for a target speaker [1, 2, 3, 4, 5]. In research and development of neural TTS, the text content of training and test data are often highly similar and in the same text domain. For many real-world applications, TTS systems need to deal with text input with arbitrary content across a wide range of domains. Their performance may deteriorate substantially on domain-mismatched text [6] due to the limited content and domain coverage of training data. It is generally costly or impractical to increase the quantity and diversity of training data for a specific target speaker, whilst speech data from other “non-target” speakers may be easily accessible and available. Leveraging large amount of non-target speakers’ data from different sources has become a common and appealing approach to developing high-performance TTS systems when training data from the target speaker(s) are limited [7, 8, 9, 10, 11]. However, it was noticed that the actual benefits of using training speech from other speakers could be uncertain and unstable [7]. Understandably the benefits depend very much on the content and quality of non-target speakers’ data. There are three aspects of consideration: (1) coverage and domain of text content; (2) speaker similarity with respect to the target speaker; and (3) acoustic condition. The present study mainly focuses on the aspect of text content.

Our preliminary explorations showed that TTS systems trained solely on target speaker’s data did not perform well in predicting appropriate prosody for domain-mismatched input text. Prosody is utmost important in determining the naturalness and expressiveness of speech. We postulate that speech prosody can be seen as the combination of two components of variation in speech, namely text-based variation and speaker-based variation. The text-based component, termed as text prosody henceforth, refers to the general prosodic characteristics that are basic and essential to expressing the intended text content, e.g., lexical tone, lexical stress, sentence intonation [12]. If text prosody is not realized or controlled properly, the synthesized speech would sound unnatural and inappropriate, even if the pronunciation is largely correct. In [13], native listeners were found to be capable of detecting abnormality in the non-native speech that has International Phonetic Alphabet (IPA) transcription identical to that produced by native speakers. The speaker-based component of prosodic variation[14], termed as speaker prosody, is concerned primarily with an individual’s speaking style. In particular, we differentiate it from timbre, which refers mainly to voice (phonation) characteristics.

We conjecture that text prosody can be captured by pre-training the TTS model on a large amount of speech data with diverse text content and involving multiple non-target speakers. A speech generation system fine-tuned from such pre-trained model with target speaker’s data is expected to have better performance on domain-mismatched text input than a corresponding system without using pre-trained model. In a typical multi-speaker TTS model design[15, 16], text content and speaker identity are processed by separate modules. It adopts a speaker-independent TTS model with text embeddings and speaker embeddings. This speaker-independent TTS model deals with only text information regardless of speaker variation. After being fine-tuned on target speaker data, the speaker-independent TTS model would supposedly retain the text prosody learned during pre-training. In this way, effective pre-training with diverse text content can contribute to the performance of target speaker TTS system in the aspect of handling domain-mismatched text.

In some cases, it may not be convenient to collect extensive speech data with diverse content for model pre-training. Traditionally large amounts of text-audio parallel data could be acquired by: (1) studio recording with the target speaker(s) reading text scripts for many hours; or (2) downloading long speech recordings from the Internet, dividing them into sentence-level utterances and aligning with given text transcriptions [17, 18, 19]. Both approaches are tedious, labour-intensive and costly. Using an excessive amount of pre-training data might also cause other concerns, namely long training time and high computational resources. For applications of speech generation in specific text domains, we investigate different approaches to reducing the required amount of pre-training data while maintaining desired TTS performance.

The contributions of this paper are as follows. First, we show that using diverse speech data to pre-train a speaker-independent TTS model can improve the performance of the target speaker TTS on domain-mismatched text. Second, test sets with different degrees of similarities to the text domain of target speaker data are designed. The designed test sets are used to study how the pre-trained speaker-independent TTS model improves target speaker TTS performance; Third, We propose a method to improve the data and computational efficiency by reducing the pre-training data for a specific new text domain.

Refer to caption — Fig. 1: Diagram of TTS model

2 Data Description

2.1 Pre-training Data

Pre-training data are used to train a speaker-independent model, which is then fine-tuned on target speaker data. As our research questions concern the content coverage and domain of the pre-training data, a large-scale English speech corpus created for automatic speech recognition (ASR), namely LibriSpeech, is adopted [18]. LibriSpeech contains around $1000$ hours of speech data from $2484$ speakers. The speech content are based on the LibriVox’s audiobooks, which have a comprehensive coverage of topics. The complete LibriSpeech corpus is used as pre-training data. For the investigation on pre-train data reduction, 40,000 text-audio pairs (around 1/8 of the data in LibriSpeech) are used.

2.2 Target Speaker Data

The target speaker data for fine-tuning of the pre-trained speaker-independent TTS model are obtained from the LJSpeech corpus, which contains $13,100$ (about $24$ hours) audio clips of high-quality English speech produced by a female speaker reading passages from 7 non-fiction books. Experiments will be carried out to evaluate the performance of TTS systems fine-tuned with the whole LJSpeech corpus or its subsets of smaller size.

3 Effect of Pre-training on target speaker TTS performance

Two TTS models are used to show how the pre-trained speaker-independent TTS model improves target speaker TTS performance. The first system is trained only on target speaker data, i.e., without pre-training with other data. This system is named as TTS w/o pre-training. The second system, named TTS w/ pre-training, is built by applying target speaker data (LJSpeech) to fine-tune a speaker-independent model pre-trained with LibriSpeech. The two systems are evaluated on three specially designed test sets, which have different degrees of similarity to the text content of target speaker data. The three test sets are named as T-SIM, T-RAN, T-DIFF, in the order from highest to lowest degree of similarity to the text content of target speaker data.

3.1 TTS Model Description

The TTS model as shown in Figure 1 consists of a trainable speaker table and a speaker-independent TTS model. The speaker table models the voice characteristics for individual speakers. The speaker-independent model follows the non-autoregressive TTS system FastSpeech 2[5]. In this study, the TTS model predicts pitch at phoneme level, which is found to give better TTS performance than frame-level prediction in FastSpeech 2. The text encoder converts the phoneme sequence into text embeddings, combined with speaker embeddings to predict the pitch and duration using the respective predictor modules. The length regulator (LR) module up-samples all phoneme-level features to frame-level features, which are then transformed into mel-spectrogram by the decoder. The Melgan vocoder[20] is used to generate speech waveform from the mel-spectrogram.

3.2 Design of Domain-mismatched Test Sets

We collect around 50,000 sentences from various sources that act as general text set to show the TTS performance on domain-mismatched text. Three test sets, each has $60$ sentences, are chosen from the general text set for subjective evaluation. A phoneme-based subword bigram language model is trained on the text domain of the target speaker data (LJSpeech). The phoneme-level subwords for each word are obtained by applying Byte Pair Encoding (BPE)[21] to the phoneme sequence of the word. The vocabulary size of subwords is 200. The language model is applied to calculate the perplexity scores for all 50,000 sentences in the general text set. A sentence similar to the text domain of target speaker data will achieve low perplexity with the language model trained on that text domain. The text with the top 60 lowest perplexity scores make up the test set T-SIM, which is considered to lie in the same or similar domain of target speaker text data. The test set T-DIFF, which has a low degree of similarity to the text domain of target speaker data, comprises text with the top $60$ highest perplexity scores. The $60$ text in subset T-RAN are randomly sampled from the general text set.

4 Pre-training Data Reduction for a new text domain

In the previous section, we aim to investigate how the pre-trained speaker-independent TTS model improves the target TTS system on the general text set. While in some cases, we want to improve the target speaker TTS performance on a specific new text domain. We assume that the speaker-independent model pre-trained on the data similar to the new text domain can effectively transfer the text-based prosody to the target speaker TTS system. To show this idea, around 9,000 text sentences from the novel books are prepared as the new text domain, which differs from the text domain of target speaker data. We aim to select a subset of LibriSpeech as pre-training data whilst achieving comparable performance for the target speaker TTS system. Two methods are developed to select data with high text similarity to the new text domain from the LibriSpeech corpus.

4.1 Perplexity-Based Method

In this study, a phoneme-based subword bigram language model is trained on the new text domain. The text in the LibriSpeech corpus with low perplexities on the language model is assumed to have a high degree of similarity to the new text domain. Then, the corresponding pairs data in LibriSpeech will be selected as pre-training data.

4.2 BERT-Based Method

BERT[22] is a method that learns general-purpose text representation, which is trained on a large amount of open-domain text data. In this study, a pre-trained BERT model takes the text input to generate token representations for each sentence. The sentence-level vector is obtained by performing average pooling on token representations from the last encoder layer of BERT over a sentence. The new text domain can then be represented by the centroid of the sentence-level vectors belonging to the new text domain. The degree of similarity between each text and the new text domain is measured by the L2 distance between the sentence-level vector of that text and the centroid vector of the new text domain. The data with high text similarities to the new text domain will be selected as pre-training data.

5 Results and Analysis

All the subjective tests are evaluated via our internal crowdsourced listening test platform, with at least $15$ native judges for Comparison MOS (as CMOS) [23] and $15$ judges for five-score MOS test for each test case. Readers are recommended to listen to demo examples ¹¹1https://patrick-g-zhang.github.io/pt-reduction/.

5.1 Subjective Evaluations on Domain-Mismatched Text

Table 1 shows the MOS and CMOS comparison between the TTS w/o pre-training and TTS w/ pre-training on three test sets described in subsection 3.2. The target speaker data is 24-hour LJSpeech, and the pre-training data is LibriSpeech. The TTS w/o pre-training achieves the highest MOS score on T-SIM while the lowest on T-DIFF. This result indicates that the TTS w/o pre-training performance will drop as the text is different from the text domain of target speaker data. On T-DIFF and T-RAN, the TTS performance of TTS w/ pre-training is better than TTS w/o pre-training by a large margin, showing that the pre-trained speaker-independent model improves the performance of target speaker TTS on the domain-mismatched text. Moreover, the MOS gap and CMOS between TTS w/o pre-training and TTS w/ pre-training on T-DIFF are more prominent than T-RAN, which means the improvement is more significant as the test text is more different from the text domain of the target speaker data. Concerning T-SIM, the TTS w/ pre-training shows no performance gain (CMOS $-0.033$ ) in contrast to TTS w/o pre-training, which might result from the fact that T-SIM is similar to the text domain of target speaker data (LJSpeech), thus the text-based prosody learned from the pre-trained speaker-independent model does not benefit the TTS w/ pre-training on T-SIM. Since T-RAN is randomly sampled and can represent the general text set, we can claim the TTS w/ pre-training has superior performance to the TTS w/o pre-training on the domain-mismatched text.

Table 1: The MOS and CMOS comparison on three test sets when the target speaker data is 24-hour LJSpeech.

Set	TTS w/o pre-training	TTS w/ pre-training	CMOS
T-SIM	$\mathbf{3.95\pm 0.06}$	$3.93\pm 0.07$	$-0.033$
T-DIFF	$3.67\pm 0.07$	$\mathbf{3.79\pm 0.06}$	$+0.318$
T-RAN	$3.88\pm 0.08$	$\mathbf{3.98\pm 0.07}$	$+0.287$

As shown in Table 2, the CMOS results of two TTS systems, in terms of pronunciation and prosody, are compared on T-RAN. When comparing the two systems, the raters only focus on pronunciation or prosody (tone and rhythm) rather than the overall impression of the speech. Compared with the baseline TTS w/o pre-training system, the TTS w/ pre-training mainly improves prosodic variation of speech instead of pronunciation. This result agrees with our assumption that the main reason for TTS w/ pre-training achieves improvement on the domain-mismatched text is that the text-based prosody learnt from pre-training data can be transferred to the target speaker TTS system.

Table 2: The pronunciation and prosody CMOS comparison on T-RAN.

Focus setting	CMOS
pronunciation	$-0.041$
tone and rhythm	$+0.157$

Two TTS systems are also compared when we use a 1.5-hour subset of the LJSpeech corpus for fine-tuning, shown in Table 3. On all three test sets, the TTS w/ pre-training performs much better than the TTS w/o pre-training. Although 1.5 hours of speech data is sufficient for TTS system to produce intelligible speech[7], the speech generated from system TTS w/o pre-training still sounds unstable and jittery due to model overfitting. The improvement tendency among the three subsets is the same as that in Table 1, where the improvement is the most notable on T-DIFF and the least marked on T-SIM. The TTS w/ pre-training performs better than TTS w/o pre-training on T-SIM in this case, different from the corresponding result in Table 1 where target speaker data is over 20 hours. After listening to the audios, we found the improvement on T-SIM might result from that TTS w/ pre-training has more stable voice, which means the TTS w/ pre-training mitigates the overfitting problem when the target speaker data is limited.

Table 3: The MOS and CMOS comparison on three test sets when the target speaker data is 1.5-hour LJSpeech.

Set	TTS w/o pre-training	TTS w/ pre-training	CMOS
T-SIM	$3.72\pm 0.07$	$\mathbf{3.81\pm 0.06}$	$+0.394$
T-DIFF	$3.34\pm 0.08$	$\mathbf{3.65\pm 0.07}$	$+0.958$
T-RAN	$3.66\pm 0.08$	$\mathbf{3.84\pm 0.07}$	$+0.591$

5.2 TTS Performance Evaluation on Pre-training Data Reduction for a New Text Domain

For pre-training data reduction task, $60$ sentences are randomly sampled from the new text domain for subjective evaluation. Four target speaker TTS systems are evaluated, which is shown in Table 4. All four target speaker TTS systems are fine-tuned from pre-trained speaker-independent TTS models. The only difference between those systems is the pre-training data: i) Random: $40,000$ pairs data randomly sampled from the LibriSpeech, which is used as the baseline system; ii) Full: LibriSpeech, which serves as the topline system; iii) Perplexity-based: $40,000$ pairs data sampled from LibriSpeech with the method described in subsection 4.1; iv) BERT-based: 40,000 pair data selected from LibriSpeech using the method described in subsection 4.2.

Table 4: The TTS performance comparison among four target speaker TTS systems.

Pre-training data	MOS	CMOS1	CMOS2
Random	$3.84\pm 0.08$	$0.0$	-
Full	$\mathbf{3.94\pm 0.07}$	$+0.304$	$0.0$
Perplexity-based	$3.90\pm 0.08$	$+0.246$	$-0.063$
BERT-based	$\mathbf{3.91\pm 0.08}$	$+0.26$	$-0.035$

We conduct CMOS tests to compare the TTS performance between the two target speaker TTS model with pre-training data selected by proposed methods and target speaker TTS model with Random pre-training data, shown in column CMOS1 in Table 4. It shows that the target speaker TTS systems with both BERT-based and Perplexity-based pre-training data can significantly improve TTS performance on the new text domain compared with the baseline system (target speaker TTS with Random pre-training data). This confirms the effectiveness of the proposed pre-training data reduction approaches. The CMOS tests are also performed to compare the systems with the topline system, as shown in column CMOS2 in Table 4. It shows that target speaker TTS systems with both BERT-based and Perplexity-based pre-training data are comparable to the topline system.

6 Conclusion

In this study, we show that a pre-trained speaker-independent TTS model can improve the performance of the target speaker TTS model on domain-mismatched text compared with the TTS model trained only on the target speaker data. The subjective evaluation shows that the improvement is mainly on the prosody side instead of pronunciation with the help of the abundant text domains in pre-training. In order to improve the data efficiency of pre-training, two methods are applied to reduce the pre-training data of the speaker-independent TTS model for a new text domain. The target speaker TTS system with a selected subset of LibriSpeech as pre-training data can perform comparably to the target speaker TTS system with complete LibriSpeech as pre-training data, which significantly reduces the huge cost in pre-training data preparation.

References

[1] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu, “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
[2] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech, Aug. 2017, pp. 4006–4010.
[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in Proc. ICASSP, 2018, pp. 4779–4783.
[4] Isaac Elias, Heiga Zen, Jonathan Shen, Yu Zhang, Ye Jia, Ron J Weiss, and Yonghui Wu, “Parallel tacotron: Non-autoregressive and controllable tts,” in Proc. ICASSP, 2021, pp. 5709–5713.
[5] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2020.
[6] Mutian He, Yan Deng, and Lei He, “Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS,” in Proc. Interspeech, 2019, pp. 1293–1297.
[7] Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and RJ Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” in Proc. ICASSP, 2019, pp. 6940–6944.
[8] Sercan Ö Arık, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, 2018, pp. 10040–10050.
[9] Yutian Chen, Yannis Assael, Brendan Shillingford, David Budden, Scott Reed, Heiga Zen, Quan Wang, Luis C Cobo, Andrew Trask, Ben Laurie, et al., “Sample efficient adaptive text-to-speech,” in Proc. ICLR, 2018.
[10] Erica Cooper, Cheng-I Lai, Yusuke Yasuda, Fuming Fang, Xin Wang, Nanxin Chen, and Junichi Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP. IEEE, 2020, pp. 6184–6188.
[11] Daxin Tan, Hingpang Huang, Guangyan Zhang, and Tan Lee, “Cuhk-ee voice cloning system for icassp 2021 m2voc challenge,” arXiv preprint arXiv:2103.04699, 2021.
[12] Paul Alexander. Taylor, Text-to-speech synthesis, Cambridge University Press, 2009.
[13] Katsura Aoyama and Susan G Guion, “Prosody in second language acquisition,” Language experience in second language speech learning: In honor of James Emil Flege, vol. 17, pp. 281, 2007.
[14] Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” Proc. NIPS, vol. 31, 2018.
[15] Junichi Yamagishi, Takashi Nose, Heiga Zen, Zhen-Hua Ling, Tomoki Toda, Keiichi Tokuda, Simon King, and Steve Renals, “Robust speaker-adaptive hmm-based text-to-speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 6, pp. 1208–1230, 2009.
[16] Shan Yang, Zhizheng Wu, and Lei Xie, “On the training of dnn-based average voice model for speech synthesis,” in Proc. APSIPA. IEEE, 2016, pp. 1–6.
[17] Min Chu, Chun Li, Hu Peng, and Eric Chang, “Domain adaptation for tts systems,” in Proc. ICASSP, 2002, vol. 1, pp. I–453.
[18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
[19] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” in Proc. Interspeech, 2019, pp. 1526–1530.
[20] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Proc. NIPS, vol. 32, 2019.
[21] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in Proc. ACL, 2016, pp. 1715–1725.
[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2019, pp. 4171–4186.
[23] Philipos C Loizou, “Speech quality assessment,” in Multimedia analysis, processing and communications, pp. 623–654. Springer, 2011.