This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement

Abstract

This paper presents a novel design of neural network system for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. Fine-grained modeling is realized by extracting style embeddings from the mel-spectrograms of phone-level speech segments. Collaborative learning and adversarial learning strategies are applied in order to achieve effective disentanglement of content and style factors in speech and alleviate the “content leakage” problem in style modeling. The proposed system can be used for varying-content speech style transfer in the single-speaker scenario. The results of objective and subjective evaluation show that our system performs better than other fine-grained speech style transfer models, especially in the aspect of content preservation. By incorporating a style predictor, the proposed system can also be used for text-to-speech synthesis. Audio samples are provided for system demonstration111https://daxintan-cuhk.github.io/pl-csd-speech.

Index Terms: speech synthesis, style transfer, prosody

1 Introduction

Human speech production manifests a complex integration of physical, cognitive and affective processes. The realization of a spoken utterance involves three main factors, namely the content factor, the speaker factor and the style factor. The content factor is determined by the linguistic content of speech. The speaker factor refers to voice characteristics that are pertinent to recognizing the speaker. While an unambiguous definition of “style” may not exist, in this study the style factor is assumed to cover broadly any aspect of the speech utterance that is not determined by its linguistic content and the speaker’s inherent voice characteristics [1]. The style of speech is often related to speaking situation, attitude, emotion, language proficiency, etc.

Different tasks of spoken language processing can be viewed as purposeful processes of analyzing and manipulating one or more of the three factors. Automatic speech recognition (ASR)[2, 3] and text-to-speech synthesis (TTS)[4, 5, 6, 7, 8, 9] are focused on the content factor, with the other two factors being suppressed or ignored. Speaker recognition[10, 11] and voice conversion[12, 13, 14] aim to capture and manipulate the speaker factor, while emotion recognition[15] and expressive speech synthesis deal primarily with the style factor. Relating to expressive TTS, speech style transfer (SST)[16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] refers to the process of transferring the style of one speech utterance (reference speech) into another utterance (source speech). In a neural network based SST model, style-related embedding is obtained from the reference speech, and subsequently combined with the text embedding of source speech (and probably a speaker embedding) to condition the generation of output speech.

Style embedding can be defined and extracted at different levels of granularity. The studies in [16] and [17] extended the Tacotron system with the use of fix-length style embedding at utterance level. Variational auto-encoder (VAE) and hierarchical structure were applied to improve the representation capability of learned style embeddings in [18], [19] and [20]. To facilitate fine-grained style control on specific parts of an utterance, phone-level style embedding was investigated in [21] and [22]. In [21], a secondary attention module was used to generate style embedding from mel-spectrogram of reference speech. In [22], style embedding was derived from pitch and intensity features aggregated at phone level.

The main issue in the extraction of style embedding is “content leakage”[23]. Typically, an SST model comprises a style encoder and a text encoder. The style encoder aims to encode the style factor of reference speech into a style embedding, while the text encoder generates a text embedding from the text of source speech. The entire model is trained in an end-to-end manner with the goal of reconstructing the input utterance. The same training utterance acts as both reference and source speech, i.e., its acoustic features go to the style encoder and its transcription goes to the text encoder. Since no constraint is exerted to enforce the style encoder to focus exclusively on style-related information of input speech, a certain degree of content information may be “leaked” into the learned style embedding. In the worst case, the text encoder does not take any effect such that the the style embedding contains all information required for reconstructing the input speech. Such “corrupted” style embedding would not be useful for style transfer when the source speech and reference speech is different in content. In [23], this problem was addressed by minimizing the mutual information between content and style embedding during training. In [27], a pairwise training strategy was proposed to enforce correct mapping from input text to different speech utterances.

This paper presents a novel design of neural network model for fine-grained style modeling, transfer and prediction in expressive text-to-speech (TTS) synthesis. For fine-grained modeling and control, style embeddings are generated directly from mel-spectrogram at phone level. In order to achieve effective disentanglement of content and style factors, we propose to apply collaborative learning to force the content encoder to focus on phone identification and apply adversarial learning to force the style encoder to ignore content information. With a properly trained style predictor, the proposed system can be regarded as a unified framework for both speech style transfer and text-to-speech synthesis without reference speech.

Refer to caption
Figure 1: Overview of the proposed system

2 The Proposed System

The proposed system is depicted as in Figure 1. The training process comprises two parts: phone-level training and utterance-level training. The phone-level content-style disentanglement (PL-CSD) module is obtained by phone-level training, and subsequently used in utterance-level training. Upon completion of model training, the system can be used for speech style transfer and text-to-speech synthesis.

2.1 Phone-level training

The PL-CSD module is trained at this stage. Let uiu_{i} denote the ithi^{th} utterance and tit_{i} be the corresponding text transcription. tit_{i} is expressed as a phone sequence, i.e., ti=[w1,,wmi]t_{i}=[w_{1},...,w_{m_{i}}], where mim_{i} is the number of phones, and wkw_{k} denotes the kthk^{th} phone. By applying forced alignment, with silence/pause segments excluded, uiu_{i} is divided into mim_{i} segments, denoted as {s1,,smi}\{s_{1},...,s_{m_{i}}\}, where segment sks_{k} corresponds to phone wkw_{k}. Without considering the temporal dependency, the collection of (sk,wk)(s_{k},w_{k}) are treated as independent data instances for training PL-CSD. sks_{k} is represented by frame-level mel-spectrogram features [f1,,fnk][f_{1},...,f_{n_{k}}], where nkn_{k} is the number of frames in sks_{k}. Note that wkw_{k} must be one of the phones in the concerned language. In this study, we use the 3939 English phones as defined in the APARBET.

As shown in Figure2, the PL-CSD module consists of the following components:

Content encoder (EcE_{c}): to encode the content factor of phone segment ss into the content embedding zcz_{c}, i.e., zc=Ec(s)z_{c}=E_{c}(s)

Style encoder (EsE_{s}): to encode the style factor of phone segment ss into the style embedding zsz_{s}, i.e., zs=Es(s)z_{s}=E_{s}(s)

Content-to-phone classifier (CcC_{c}): to predict the phone identity wcw_{c} from the content embedding zcz_{c}, i.e., wc=Cc(zc)w_{c}=C_{c}(z_{c})

Style-to-phone classifier (CsC_{s}): to predict the phone identity wsw_{s} from the style embedding zsz_{s}, i.e., ws=Cs(zs)w_{s}=C_{s}(z_{s})

Decoder (DD): to reconstruct phone segment ss^{\prime} from the content embedding zcz_{c} and the style embedding zsz_{s}, i.e., s=D(zc,zs)s^{\prime}=D(z_{c},z_{s})

Segment classifier (CsegC_{seg}): to discriminate if a phone segment is from natural speech (true) or synthesized (false), i.e., Ptrue=Cseg({s,s})P_{true}=C_{seg}(\{s,s^{\prime}\}))

Refer to caption
Figure 2: Detailed design of the PL-CSD module

2.1.1 Auto-encoder training

The content encoder EcE_{c}, the style encoder EsE_{s} and the decoder DD together make up an auto-encoder model. The difference between the reconstructed segment and the original one is measured by the L2L2 norm of their mel-spectrograms, which is referred to as the spectrogram loss LsL_{s}, and the binary cross-entropy loss of the gates LgL_{g}, which indicates the end of segment [5]. For decoder training, ground-truth mel-spectrogram of the current frame is concatenated with the content embedding and the style embedding to predict the mel-spectrogram of the next frame. The overall auto-encoder loss LautoL_{auto} is given as,

Lauto(θEc,θEs,θD)=skLs,LgL(sk,D(Ec(sk),Es(sk)))L_{auto}(\theta_{E_{c}},\theta_{E_{s}},\theta_{D})=\sum\limits_{s_{k}}\sum\limits_{L_{s},L_{g}}L(s_{k},D(E_{c}(s_{k}),E_{s}(s_{k})))\vspace{-1em} (1)

where θ\theta refers to the model parameters to be trained.

Basic auto-encoder training is not expected to achieve disentanglement, as the roles of the content encoder and style encoder are not differentiated. As described below, collaborative training is suggested for training the content encoder while adversarial training is applied to the style encoder.

2.1.2 Collaborative training of content encoder

The content embedding is expected to carry pertinent information to phone identification. An auxiliary content-to-phone classifier is introduced to assess the goodness of content embedding. By training the content encoder and the content-to-phone classifier collaboratively, the content embedding is forced to capture phone identity information. The content-to-phone classification loss is defined as,

Lc(θEc,θCc)=(sk,wk)logP(wk|Cc(Ec(sk))).L_{c}(\theta_{E_{c}},\theta_{C_{c}})=\sum\limits_{(s_{k},w_{k})}-\log P(w_{k}|C_{c}(E_{c}(s_{k}))).\vspace{-0.5em} (2)

To ensure that content embeddings extracted from different segments that belongs to the same phone are similar, the following contrast loss LcontraL_{contra} is imposed,

Lcontra(θEc)=c𝟙wi=wjEc(si)Ec(sj)2L_{contra}(\theta_{E_{c}})=\sum\limits_{c}\mathbbm{1}_{w_{i}=w_{j}}||E_{c}(s_{i})-E_{c}(s_{j})||_{2}\vspace{-1em} (3)

2.1.3 Adversarial training of style encoder

In contrast to the content embedding, the style embedding is desired not to encode any information about phone identity, or that the phone carried by a segment should be non-identifiable from its style embedding. This is achieved via adversarial training [28]. In the discrimination phase, the parameters of style encoder are fixed, and the style-to-phone classifier is trained to perform phone identification from the style embedding,

Lsdis(θCs)=(sk,wk)logP(wk|Cs(Es(sk)))L_{s}^{dis}(\theta_{C_{s}})=\sum\limits_{(s_{k},w_{k})}-\log P(w_{k}|C_{s}(E_{s}(s_{k})))\vspace{-0.5em} (4)

In the generation phase, the parameters of style-to-phone classifier are fixed, and the style encoder is trained such that the segment’s phone identity cannot be predicted from the style embedding. The following loss is defined to enforce equal posterior probabilities across all phones,

Lsgen(θEs)=skwNw||P(w|Cs(Es(sk)))1Nw||2L_{s}^{gen}(\theta_{E_{s}})=\sum\limits_{s_{k}}\sum\limits_{w\in N_{w}}||P(w|C_{s}(E_{s}(s_{k})))-\frac{1}{N_{w}}||_{2}\vspace{-0.5em} (5)

where NwN_{w} denotes the total number of phones.

2.1.4 Adversarial enhancement

An adversarial enhancement process is applied to supplement the model’s reconstruction function. A discriminator network CsegC_{seg} is utilized to judge if a given segment is from natural speech or synthesized, while the auto-encoder serves as a generator model to confuse the discriminator. For the discriminator, we have, sk=D(Ec(sk),Es(sk))s_{k}^{\prime}=D(E_{c}(s_{k}),E_{s}(s_{k})),

Lsegdis(θCseg)=sk[logCseg(sk)+log(1Cseg(sk)]L_{seg}^{dis}(\theta_{C_{seg}})=\sum\limits_{s_{k}}-[\log C_{seg}(s_{k})+\log(1-C_{seg}(s_{k}^{\prime})]\vspace{-0.5em} (6)

For the generator, the loss is given as,

Lseggen(θEc,θEs,θD)=sklogCseg(sk)L_{seg}^{gen}(\theta_{E_{c}},\theta_{E_{s}},\theta_{D})=\sum\limits_{s_{k}}-\log C_{seg}(s_{k}^{\prime})\vspace{-1em} (7)

2.1.5 Overall training algorithm

The overall training algorithm for PL-CSD is in Table 1.

Table 1: PL-CSD training algorithm
Input: segments and corresponding phones: (sk,wk)(s_{k},w_{k})
Repeat until convergence:
1. Train EcE_{c}, EsE_{s}, DD by minimizing Eq. (1)
2. Train EcE_{c}, CcC_{c} by minimizing Eq. (2) and Eq. (3)
3. Fix EsE_{s}, train CsC_{s} by minimizing Eq. (4)
4. Fix CsC_{s}, train EsE_{s} by minimizing Eq. (5)
5. Fix EcE_{c}, EsE_{s}, DD, train CsegC_{seg} by minimizing Eq. (6)
6. Fix CsegC_{seg}, train EcE_{c}, EsE_{s}, DD by minimizing Eq. (7)

2.2 Utterance-level training

Upon completion of phone-level training, utterance-level training is carried out to optimize the text encoder and the acoustic model for speech generation. The process is similar to the training of a basic neural TTS system. For each training utterance with text transcription, a sequence of phone-level style embeddings are obtained from the style encoder. The text transcription is converted to a phone sequence, which is passed to the text encoder to generate the text embedding sequence. The style embedding sequence and text embedding sequence are of the same length. They are combined by concatenating the two embeddings for each phone in the sequence. The acoustic model takes in the combined embedding sequence and outputs the mel-spectrogram of speech. The text encoder and the acoustic model are optimized jointly to minimize the mean square error between the generated mel-spectrogram and the ground-truth one.

After this training process, the derived text embedding sequence and the style embedding sequence for each utterance serve as the training pair of style predictor. The basic function of the style predictor is to map a text embedding sequence to a style embedding sequence.

2.3 Generation

Upon completion of phone-level and utterance-level training as described above, the system is ready for speech generation from given text. Two cases of speech generation are supported: speech style transfer and text-to-speech synthesis. In the speech style transfer case, the style encoder processes the reference speech to derive the style embedding sequence. The text encoder processes the phone sequence that converted from the input text (corresponding to the source speech) and generates the text embedding sequence. As the two embedding sequences may have different lengths, linear interpolation is carried out on the style embedding sequence, in order to make it the same length as the text embedding sequence. These two embedding sequences are concatenated at each time step, The combined sequence is then presented to the acoustic model for speech generation. In the case of text-to-speech synthesis, the input text is first converted into phone sequence. The style predictor is used to derive the style embedding sequence from the phone sequence, while the text encoder generates the text embedding sequence as usual. The two embedding sequences are combined and presented to the acoustic model for speech generation.

3 Experimental Setup

The LJ speech dataset [29] is used in this study. It contains 13,10013,100 utterances from a single speaker. Their total length is about 2424 hours. 90%90\% of the utterances are used as training data and the remaining 10%10\% as test data. Forced alignment of speech is carried out using the Montreal Forced Aligner[30] based on the CMU Pronouncing Dictionary. Mel-spectrogram are obtained with a similar setting to the standard Tacotron 2 model. In the PL-CSD module, both the content encoder and the style encoder are made up by bidirectional LSTM of dimension 512512 (256256 in each direction), in which the last cell state is projected to embedding via a linear layer. The decoder contains a unidirectional LSTM of dimension 512512, with two linear layers applied on the output at each time step. One of them predicts the mel-spectrogram and the other indicates the end of sequence. The dimensions of content embedding and style embedding are both 6464. For utterance-level training, the acoustic models adopt the standard Tacotron 2 structure, where the dimension of text embedding is 6464. The WaveGlow vocoder[31] is used to generate speech waveform from the predicted mel-spectrogram. The style predictor adopts the feed-forward Transformer block structure as in [7] and [8], which is a stack of self-attention layer and 1D-convolution.

4 Results and Discussion

4.1 Visualization of embeddings

T-distributed Stochastic Neighbor Embedding (t-SNE) [32] is applied for visualization. The analysis is carried out with 77 English vowels that occur most frequently in the dataset. For each phone, 200200 segments are randomly selected from the test set, and their content and style embeddings are derived from the trained content and style encoder respectively. Figure 3 shows the t-SNE scatter plots of the content embeddings and the style embeddings of the vowel segments, where each color represents a kind of vowel. It can be seen that the content embeddings of segments of same kind of vowel tend to fall into certain clusters. On the other hand, the style embeddings of different vowel segments do not seem to be separable. This indicates that the phone identity information can be well captured by content embedding, but not present in style embedding.

Refer to caption
Vowel content embeddings
Refer to caption
Vowel style embeddings
Figure 3: The t-SNE visualization of content embeddings and style embeddings for vowel segments.

4.2 Evaluation of generated speech

Two SST systems in [21] and [22] are compared with our system. All of these systems aim at fine-grained style modeling. The system in [21] adopted a secondary attention mechanism to extract the style embedding implicitly while our system carries out extraction with explicit content-style disentanglement. The system in [22] derived style embedding on pitch and intensity while our system uses mel-spectrogram as input features.

4.2.1 Objective evaluation

Being able to reconstruct an input utterance from its text embedding and style embedding is a basic requirement to guarantee no information is lost in the disentanglement process. In this part of evaluation, the source speech and the reference speech are from the same utterance in the test set. The similarity between the synthesized speech and the original speech is used to measure the performance of reconstruction. Following [16], we use Voicing Decision Error (VDE), Gross Pitch Error (GPE), F0 Frame Error (FFE) and Mel Cepstral Distortion (MCD) as the evaluation metrics. As shown in Table 2, the proposed model shows slightly better or comparable performance than the two previously reported systems.

Table 2: Objective evaluation on speech reconstruction
System VDE(%)\downarrow GPE(%)\downarrow FFE(%)\downarrow MCD\downarrow
Lee[21] 9.05 8.16 13.04 10.91
Klimkov[22] 12.00 5.43 14.53 11.67
Ours 11.03 4.57 13.15 10.49

When the proposed system is used for style transfer, the reference speech may have different text content from the source speech. If the content-style disentanglement is successful, the synthesized speech is expected to contain the same content as the source speech and carry the style of reference speech. We use an end-to-end ASR system provided in the ESPnet toolkit [10] to evaluate the content similarity of synthesized speech and source speech. The word error rate (WER) and phone error rate (PER) on synthesized speech are evaluated with respect to the transcription of source speech. The system in [21] is found to have serious “content leakage” problem and fails to retain the source content, i.e., the synthesized speech carries both the content and the style from the reference speech and neglect the source speech. Compared with the system of [22], our system is significantly better in preserving the source content. The comparison is shown as in Table 3.

Table 3: ASR error rates on style-changed speech
System WER(%)\downarrow PER(%)\downarrow
Lee[21] 90.4 74.1
Klimkov[22] 29.2 14.5
Ours 21.4 8.5

4.2.2 Subjective evaluation

Subjective listening tests were carried out to evaluate reconstructed speech and style-transferred speech separately. For evaluating speech reconstruction, the listeners are required to rate the overall similarity between synthesized speech and source speech (natural speech), For evaluating style transfer, separate ratings are given on the content similarity between synthesized speech and source speech, as well as the style similarity between synthesized speech and reference speech. The score of rating ranges from 0 (completely different) to 5 (exactly the same). A total of 30 listeners participated in the listening test, which was administrated via the Amazon Mechanical Turk platform. Each listener was required to evaluate 3030 sets of test utterances in both cases.

Table 4: MOS results with 95% confidence intervals
MOS\uparrow Reconstruction Style transfer
Overall Content Style
Lee[21] 2.09±0.17 - -
Klimkov[22] 2.95±0.13 2.97±0.15 2.53±0.16
Ours 3.47±0.11 3.41±0.13 2.54±0.15

Table 4 shows the results of subjective evaluation. In speech reconstruction, our system attains a significant higher score than baseline systems, which indicates that our style embedding has better representative ability. In speech style transfer, the system in [21] is not able to perform varying-content style transfer. Compared with system in [22], our system obtained similar rating in the style similarity aspect and significantly higher score in content similarity. This demonstrates that our system performs better in content preservation.

4.3 Case study

Figure 4 shows an example of content-style recombination that illustrates the effect of style transfer. The top pane shows the spectrograms and pitch contours of two different utterances of natural speech. The middle pane shows the reconstructed utterances, i.e., synthesized with both text embedding and style embedings coming from the same utterance. The bottom pane shows the results of speech generation with the two utterances’ text embeddings (content) swapped. The utterances synthesized with the same style embedding, i.e., those in the same column of the figure, show highly similar patterns of pitch contour. The most notable similarities are marked by the colored boxes.

Refer to caption
Figure 4: Spectrograms and pitch contours of two test utterances of natural speech and synthesized speech with different combinations of text and style embeddings.

5 Conclusion

This paper presents a novel system design for fine-grained style modeling, transfer and prediction in expressive text-to-speech synthesis. The proposed collaborative learning and adversarial learning strategies is effective in disentangling style and content factors. The evaluation results show that our system alleviates the “content leakage” problem and improves the content preservation in speech style transfer. Further investigation is carried out to extend the system to multi-speaker style transfer.

6 Acknowledgement

This research is partially supported by a Tier 3 funding from ITSP (Ref: ITS/309/18) of the Hong Kong SAR Government, and a Knowledge Transfer Project Fund (Ref: KPF20QEP26) from the Chinese University of Hong Kong.

References

  • [1] R. Liu, B. Sisman, G. Gao, and H. Li, “Expressive tts training with frame and style reconstruction loss,” arXiv preprint arXiv:2008.01490, 2020.
  • [2] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.
  • [3] W. Han, Z. Zhang, Y. Zhang, J. Yu, C.-C. Chiu, J. Qin, A. Gulati, R. Pang, and Y. Wu, “Contextnet: Improving convolutional neural networks for automatic speech recognition with global context,” arXiv preprint arXiv:2005.03191, 2020.
  • [4] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
  • [5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 4779–4783.
  • [6] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 6706–6713.
  • [7] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, 2019, pp. 3171–3180.
  • [8] Y. Ren, C. Hu, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text-to-speech,” arXiv preprint arXiv:2006.04558, 2020.
  • [9] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei et al., “Durian: Duration informed attention network for multimodal synthesis,” arXiv preprint arXiv:1909.01700, 2019.
  • [10] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen et al., “Espnet: End-to-end speech processing toolkit,” arXiv preprint arXiv:1804.00015, 2018.
  • [11] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2018, pp. 5329–5333.
  • [12] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson, “Autovc: Zero-shot voice style transfer with only autoencoder loss,” in International Conference on Machine Learning.   PMLR, 2019, pp. 5210–5219.
  • [13] J.-c. Chou, C.-c. Yeh, and H.-y. Lee, “One-shot voice conversion by separating speaker and content representations with instance normalization,” arXiv preprint arXiv:1904.05742, 2019.
  • [14] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo, “Stargan-vc2: Rethinking conditional methods for stargan-based voice conversion,” arXiv preprint arXiv:1907.12279, 2019.
  • [15] M. B. Akçay and K. Oğuz, “Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers,” Speech Communication, vol. 116, pp. 56–76, 2020.
  • [16] R. Skerry-Ryan, E. Battenberg, Y. Xiao, Y. Wang, D. Stanton, J. Shor, R. J. Weiss, R. Clark, and R. A. Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018.
  • [17] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” arXiv preprint arXiv:1803.09017, 2018.
  • [18] Y.-J. Zhang, S. Pan, L. He, and Z.-H. Ling, “Learning latent representations for style control and transfer in end-to-end speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 6945–6949.
  • [19] T. Kenter, V. Wan, C.-A. Chan, R. Clark, and J. Vit, “Chive: Varying prosody in speech synthesis with a linguistically driven dynamic hierarchical conditional variational network,” in International Conference on Machine Learning, 2019, pp. 3331–3340.
  • [20] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, Y. Jia, Z. Chen, J. Shen et al., “Hierarchical generative modeling for controllable speech synthesis,” arXiv preprint arXiv:1810.07217, 2018.
  • [21] Y. Lee and T. Kim, “Robust and fine-grained prosody control of end-to-end speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 5911–5915.
  • [22] V. Klimkov, S. Ronanki, J. Rohnke, and T. Drugman, “Fine-grained robust prosody transfer for single-speaker neural text-to-speech,” arXiv preprint arXiv:1907.02479, 2019.
  • [23] T.-Y. Hu, A. Shrivastava, O. Tuzel, and C. Dhir, “Unsupervised style and content separation by minimizing mutual information for speech synthesis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 3267–3271.
  • [24] S. Karlapati, A. Moinet, A. Joly, V. Klimkov, D. Sáez-Trigueros, and T. Drugman, “Copycat: Many-to-many fine-grained prosody transfer for neural text-to-speech,” arXiv preprint arXiv:2004.14617, 2020.
  • [25] T. Li, S. Yang, L. Xue, and L. Xie, “Controllable emotion transfer for end-to-end speech synthesis,” in 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP).   IEEE, 2021, pp. 1–5.
  • [26] G. Zhang, Y. Qin, and T. Lee, “Learning syllable-level discrete prosodic representation for expressive speech generation,” Proc. Interspeech 2020, pp. 3426–3430, 2020.
  • [27] S. Ma, D. Mcduff, and Y. Song, “Neural tts stylization with adversarial and collaborative games,” in International Conference on Learning Representations, 2018.
  • [28] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [29] K. Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  • [30] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in Interspeech, 2017, pp. 498–502.
  • [31] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 3617–3621.
  • [32] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.