EMOTIONAL VOICE CONVERSION WITH CYCLE-CONSISTENT
ADVERSARIAL NETWORK
Abstract
Emotional Voice Conversion, or emotional VC, is a technique of converting speech from one emotion state into another one, keeping the basic linguistic information and speaker identity. Previous approaches for emotional VC need parallel data and use dynamic time warping (DTW) method to temporally align the source-target speech parameters. These approaches often define a minimum generation loss as the objective function, such as L1 or L2 loss, to learn model parameters. Recently, cycle-consistent generative adversarial networks (CycleGAN) have been used successfully for non-parallel VC. This paper investigates the efficacy of using CycleGAN for emotional VC tasks. Rather than attempting to learn a mapping between parallel training data using a frame-to-frame minimum generation loss, the CycleGAN uses two discriminators and one classifier to guide the learning process, where the discriminators aim to differentiate between the natural and converted speech and the classifier aims to classify the underlying emotion from the natural and converted speech. The training process of the CycleGAN models randomly pairs source-target speech parameters, without any temporal alignment operation. The objective and subjective evaluation results confirm the effectiveness of using CycleGAN models for emotional VC. The non-parallel training for a CycleGAN indicates its potential for non-parallel emotional VC.
Index Terms— Emotional voice conversion, generative adversarial networks, cycleGAN
1 Introduction
Human speech is a complex signal that contains rich information, which includes linguistic information, para- and non-linguistic information. While linguistic information and para-linguistic information controlled by the speaker help make the expression convey information precisely, non-linguistic information such as emotion accompanied with speech plays an important role in human social interaction. Compared with Voice Conversion (VC) [1, 2, 3, 4], which is a technique to convert one speaker’s voice to sound like that of another, emotional Voice Conversion, or emotional VC, is a technique of converting speech from one emotion state into another one, keeping the basic linguistic information and speaker identity.
Most of the VC approaches focus more on the conversion of short-time spectral features, but less on the conversion of the prosody features such as F0 [5, 6, 7, 8, 3]. In these works, F0 features are usually converted by a simple logarithmic Gaussian (LG) normalized transform. For emotional VC, however, parametrically modeling the prosody features is important, since the prosody plays an important role in conveying various types of non-linguistic information, such as intention, attitude and mood, which represent the emotions of the speaker [9]. Recently, there has been tremendous active research in modeling prosodic features of speech for emotional VC, most of which involve modeling two prosodic elements, namely the F0 contour and energy contour. A Gaussian mixture model (GMM) and a classification regression tree model were adopted to model the F0 contour conversion from neutral speech to emotional speech in [10]. A system for transforming the emotion in speech was built in [11], where the F0 contour was modeled and generated by context-sensitive syllable-based HMMs, the duration was transformed using phone-based relative decision trees, and the spectrum was converted using a GMM-based or a codebook selection approach.
Prosody is inherently supra-segmental and hierarchical in nature, of which the conversion is affected by both short- and long-term dependencies, such as the sequence of segments, syllables, words within an utterance as well as lexical and syntactic systems of a language [12, 13, 14, 15, 16]. There have been many attempts to model prosodic characteristics in multiple temporal levels, such as the phone, syllable and phrase levels [17, 18, 19, 20]. Continuous wavelet transform (CWT) can effectively model F0 in different temporal scales and significantly improve speech synthesis performance [21]. The CWT methods were also adopted for emotional VC. CWT was adopted for F0 modeling within the non-negative matrix factorization (NMF) model [22], and for F0 and energy contour modeling within a deep bidirectional LSTM (DBLSTM) model [23]. Using CWT method to decompose the F0 in different scales has also been explored in [9, 24], where neural networks (NNs) or deep belief networks (DBNs) were adopted.
While previous work has shown the efficacy of using GMMs, DBLSTMs, NNs and DBNs to model the feature mapping for the spectral and the prosodic features, they all need parallel data and parallel training, which means the source and target data should have parallel scripts and a dynamic time warping (DTW) method is used to temporally align the source and target features before training the models. Parallel training data is more difficult to collect than non-parallel data in many cases. Besides, the use of DTW may introduce alignment errors, which degrades VC performance. Moreover, previous emotional VC approaches often define a minimum generation loss as the objective function, such as L1 or L2 loss. One of the issues using a minimum generation loss is an over-smoothing effect often observed in the generated speech parameters. Since this loss may also be inconsistent with human’s perception of speech, directly optimizing the model parameters using a minimum generation loss may not generate speech that sounds natural to human. Generative adversarial networks (GANs) [25] have been incorporated into TTS and VC systems [26], where it is found that GAN models are capable of generating more natural spectral parameters and F0 than conventional minimum generation error training algorithm regardless of its hyper-parameter settings. Since any utterance spoken by a speaker with the source or target emotional state can be used as training sample, if a non-parallel emotional VC model can achieve comparable performance with the parallel counterparts, it will be more flexible, more practical and more valuable than parallel emotional VC systems. The recently emerged cycle-consistent generative adversarial network (CycleGAN) [27], which belongs to the large family of GAN models, provides a potential way to achieve non-parallel emotional VC. The CycleGAN was originally designed to transform styles in images, where the styles of the images are translated while the textures remain unchanged. CycleGAN models have been used successfully for developing non-parallel VC systems [28, 29].
In this paper, we investigate the efficacy of using CycleGAN models for emotional VC tasks. Emotional VC is similar to image style transformation, where we can regard the underlying linguistic information as analogous to image content and the accompanying emotion as analogous to image style. Rather than attempting to learn a mapping between parallel training data using a frame-to-frame minimum generation loss, in this paper, the CycleGAN uses two discriminators and one classifier to guide the learning process–the discriminators aim to differentiate between natural and converted speech and the classifier aims to classify the underlying emotion from the natural and converted speech. The spectral features, F0 contour and energy contour are simultaneously converted by the CycleGAN model. We utilize CWT or logarithmic representation of the F0 and energy features. Although the training data we use is parallel, a non-parallel training process is adopted to learn the CycleGAN model, which means that source and target features are randomly paired during training, without any temporal alignment process. The objective and subjective evaluation results confirm the effectiveness of using CycleGAN models for emotional VC. The advantages offered by the CycleGAN model include (i)utilizing GAN loss instead of minimum generation loss, (ii)getting rid of source-target alignment errors and (iii) flexible non-parallel training, etc. The non-parallel training for a CycleGAN indicates its potential for non-parallel emotional VC.
The rest of this paper is organized as follows: Section 2 introduces the CycleGAN model for emotional VC and Section 3 describes the details of implementation. Section 4 gives the experimental setups and evaluations. Conclusions are drawn in Section 5.
2 EMOTIONAL VC WITH CYCLEGAN
The CycleGAN model consists of two generators ( and ), two discriminators ( and ) and one emotion classifier (C), as shown in Fig. 1, where we denote spectral and prosodic features in the domain of emotion as , spectral and prosodic features in the domain of emotion as , respectively. is the converted spectral and prosodic features from emotion to emotion by the generator , while is the features converted back to emotion by the generator from . To effectively learn parameters of the generators, discriminators and classifier, several losses are defined as follows.
Adversarial Loss: Generator serves as a mapping function from emotion domain to emotion domain , while generator do the opposite, serving as a mapping function from emotion domain to emotion domain . The discriminators, and , aim to distinguish between genuine and converted spectral and prosodic features, i.e., discriminator distinguishes between and , and discriminator distinguishes between and . To this end, an adversarial loss, which measures how distinguishable the converted features from the genuine target domain features , is defined as
(1) | ||||
The adversarial loss distinguishing the converted features from the genuine source domain features has a similar formulation.
Cycle Consistency Loss: The adversarial loss makes and or and as similar as possible while the cycle consistency loss guarantees that an input can retain its original form after passing through the two generators and . Using the notation in Figure 1, , which equals to , should not diverge too much from . This is very important for emotional voice conversion, since we do not want to change the linguistic and speaker information during the conversion process. The cycle consistency loss is defined as
(2) | ||||
where means L1 norm.
Emotion Classification Loss: To explicitly guide the CycleGAN model to learn the emotion conversion function, we add additional emotion classification loss to the original model. Specifically, an accompanying emotion classifier as shown in Fig. 1 is trained, which determines whether , and match the desired emotion label , as well as whether , and match the desired emotion label . To achieve this, the following emotion classification loss is introduced:
(3) | ||||
where
(4) | ||||
and
(5) | ||||
In equation (4) and (5), can be any divergence function used for classification problem, e.g., the binary cross-entropy loss function.
Full Objective: Combining the above adversarial loss, cycle consistency loss and emotion classification loss, the full training objective is:
(6) | ||||
where and are trade-off parameters, adjusting the relative weights of these loss terms.
3 IMPLEMENTATION
In this paper, speech features including Mel-cepstral coefficients (MCCs) and F0 are computed using WORLD [30]. The spectral features for conversion are Mel-cepstral coefficients (MCCs), which have dimension of 36. The energy contour and the F0 contour, as well as their corresponding CWT decomposition, are computed as in [23]. We use the CWT or logarithmic representation for the F0 and energy features. For convenience, we denote the CWT representation of F0 and energy contour as and , respectively, while denote the logarithmic F0 as . Network architecture of the generators, discriminators and classifier are shown in Table 1. The DBLSTM models, the baseline, are set to have the same network architecture as in [23]. The hyper-parameters and training details are made available here111Source code: https://github.com/liusongxiang/CycleGAN-EmoVC but left out for space limitation.
Generator Conv block Conv@39164, IN, ReLU Down-sampling block Conv@4864128, stride=2, IN, ReLU Conv@48128256, stride=2, IN, ReLU Residual block 6 Conv@33256256, IN, ReLU Conv@33256256, IN Up-sampling block ConvTran@44256128, stride=2, IN, ReLU ConvTran@4412864, stride=2, IN, ReLU Output layer Conv@77641 Discriminator/Classifier Conv block Conv@44164, stride=2, LReLU Stride block 4 Conv@44, stride=2, LReLU Output layer Conv@1
Models Converted Features DBLSTM-1 DBLSTM-2 , DBLSTM-3 , DBLSTM-4 , , CycleGAN-1 CycleGAN-2 , CycleGAN-3 , CycleGAN-4 , ,
In the training and conversion stages, MCC features and prosodic features are concatenated, so the model maps these features together. The features are normalized to zero mean and unit variance before being fed into the CycleGAN and DBLSTM models. During conversion, the aperiodicity component remains intact and is directly copied over. We first compute the logarithm-scale F0 and energy contour from the converted CWT-represented F0 and energy features, respectively. Then a mean-variance denormalization and an exponential operation are adopted to compute the linear-scale F0 and energy contours of the target emotion from the normalized logarithm-scale ones. If we denote the converted spectral feature as , which is computed from the converted , and the linear-scale energy contour as , the final converted spectral is computed as follow: (i) Compute the energy contour of . (ii) Compute the element-wise ratio . (iii) Scale the -th frame vector by to obtain .
4 EXPERIMENTS AND RESULTS
4.1 Experiment conditions
We use the CASIA Chinese Emotional Corpus, recorded by Institute of Automation, Chinese Academy of Sciences, where each sentence with the same semantic texts is spoken by 2 female and 2 male speakers in six different emotional tones: happy, sad, angry, surprise, fear, and neutral. We choose three emotions (sad, neutral and angry), which form a strong contrast, from one female speaker. We use 260 utterances for each emotion as training set, 20 utterances as validation set and another 20 utterances as evaluation set.
Note that although the training data is parallel, the training process of the CycleGAN models randomly pairs source-target features, thus dynamic time-warping (DTW) alignment is not needed. Since the DBLSTM models in nature need frame-to-frame mapping between the source-target features, a DTW process is necessary to temporally align the source-target spectral features as well as the prosodic features, i.e., F0 and energy representations. The experimental setups are listed in Table 2, where each model does two conversion tasks, which are neutral-to-sad conversion and neutral-to-angry conversion.
MCD (dB) | LogF0-MSE | |||
---|---|---|---|---|
Sad | Angry | Sad | Angry | |
Source | 10.87 | 14.56 | 0.063 | 0.098 |
DBLSTM-1 | 9.97 | 10.59 | 0.065 | 0.132 |
DBLSTM-2 | 9.55 | 9.74 | 0.027 | 0.039 |
DBLSTM-3 | 10.60 | 11.38 | 0.029 | 0.045 |
DBLSTM-4 | 10.57 | 11.83 | 0.025 | 0.042 |
CycleGAN-1 | 10.43 | 10.60 | 0.065 | 0.132 |
CycleGAN-2 | 10.70 | 10.49 | 0.030 | 0.057 |
CycleGAN-3 | 10.04 | 10.55 | 0.030 | 0.075 |
CycleGAN-4 | 10.30 | 10.26 | 0.034 | 0.059 |
4.2 Objective evaluation
The Mel Cepstral Distortion (MCD) is used for the objective evaluation of spectral conversion. The MCD is computed as:
(7) |
where and represent the target and the converted Mel-cepstral, respectively. The LogF0 mean squared error (MSE) is computed to evaluate the F0 conversion, which has the form
(8) |
where and denote the target and the converted F0 features, respectively. The average MCD and LogF0-MSE results are illustrated in Table 3. The MCD and LogF0-MSE between the source and the target emotion are computed as reference.
Based on the MCD results, the best performing approach is DBLSTM-2, which converts spectral features and logarithmic F0 contour. CycleGAN-2 has the worst MCD for neutral-to-sad conversion, while DBLSTM-4 has the worst MCD for neutral-to-angry. Comparing CycleGAN-3 with DBLSTM-3 and CycleGAN-4 with DBLSTM-4, we see that the CycleGANs have lower MCDs than their DBLSTMs counterparts for both conversions, although there is no explicitly defined minimum generation loss when training CycleGANs. Based on the LogF0-MSE results, DBLSTM-4 has the lowest for the sad emotion and DBLSTM-2 has the lowest for the angry emotion. Comparing DBLSTM-1 to DBLSTM-(2-4) and CycleGAN-1 to CycleGAN-(2-4), we can see that simultaneously modeling F0 features (CWT or logarithmic representations) with spectral features achieves better conversion result than just using a simple logarithmic Gaussian normalized transform for F0 conversion. It is reasonable that the DBLSTMs achieve lowest results for both MCD and LF0-MSE metrics, since they are trained by optimizing the explicitly defined minimum generation loss between the DTW-aligned source and target speech features, and the MCD computation also uses DTW to align the converted and the genuine speech features.
4.3 Subjective Evaluation
Target \Perception | Sad | Angry | Neutral | |
DBLSTM-1 | Sad | 45.2% | 12.3% | 42.5% |
Angry | 2.5% | 83.8% | 13.7% | |
CycleGAN-1 | Sad | 43.8% | 0.8% | 55.4% |
Angry | 3.3% | 78.9% | 17.8% | |
CycleGAN-2 | Sad | 65.6% | 2.2% | 32.2% |
Angry | 6.7% | 62.5% | 30.8% | |
CycleGAN-3 | Sad | 56.7% | 2.1% | 41.2% |
Angry | 3.3% | 62.3% | 34.4% | |
CycleGAN-4 | Sad | 56.9% | 0.9% | 42.2% |
Angry | 1.2% | 74.4% | 24.4% |
A subjective emotion classification test is conducted, where each model has 20 testing utterances (10 for each conversion). The listeners are ask to label the stimuli as more ’sad’ or more ’angry’ when compared with a neutral reference. 16 listeners take part in this test. Since the converted waveforms by DBLSTM-(2-4) are obviously worse in speech naturalness than those by other settings according to the preliminary listening test, we only conduct subjective evaluations for five models, i.e., DBLSTM-1, CycleGAN-(1-4). The subjective classification results are shown in Table 4. We see some inconsistency between the objective metrics and the subjective evaluation results, which is often encountered in VC and TTS literatures.
For the conversion from neutral to sad, CycleGAN-2, which converts spectral features together with the logarithmic F0 simultaneously, achieves the best result. For the conversion from neutral to angry, DBLSTM-1 achieves the best result, while CycleGAN-1 also achieves good result with degradation by only 5.8%. The neutral-to-sad conversion has lower results than the neutral-to-angry conversion under both DBLSTM and CycleGAN settings except CycleGAN-2. One possible reason is that perception of emotional state from sad speech is more difficult than that of angry speech when using the neutral as reference. Sad speech is characterized by low energy and slow speech rate, while angry speech is characterized by high energy and fast speech rate. Comparing the different CycleGAN settings, which convert different speech features, we can see that different feature combinations obtain good results for different emotion conversion, where CycleGAN-1 has high result for neutral-to-angry conversion and CycleGAN-2 has high result for neutral-to-sad conversion. The subjective emotion classification test shows the effectiveness to use CycleGAN model for emotional VC. Since the training process of CycleGANs is non-parallel, where source-target speech parameters are randomly paired, this work validates the utility of the CycleGAN approach in emotional VC based on training with non-parallel emotional databases.
5 CONCLUSIONs
This paper investigates the efficacy of using CycleGAN for emotional VC tasks. Rather than attempting to learn a mapping between parallel training data using a frame-to-frame minimum generation loss, the CycleGAN uses two discriminators and one classifier to guide the learning process, where the discriminators aim to differentiate between the natural and converted speech and the classifier aims to classify the underlying emotion from the natural and converted speech. The training process of the CycleGAN models randomly pairs source-target speech parameters, thus DTW process is not needed. The objective and subjective evaluation results confirm the effectiveness of using CycleGAN models for emotional VC. To sum up, the advantages offered by the CycleGAN model include (i)utilizing GAN loss instead of minimum generation loss, (ii)getting rid of source-target alignment errors and (iii) flexible non-parallel training, etc. The non-parallel training process also indicates the potential to use non-parallel emotional speech data for developing emotional VC systems, which will be our future work.
References
- [1] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization,” Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990.
- [2] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice conversion by codebook mapping,” in Circuits and Systems, 1991., IEEE International Sympoisum on. IEEE, 1991, pp. 594–597.
- [3] S. Liu, J. Zhong, L. Sun, X. Wu, X. Liu, and H. Meng, “Voice conversion across arbitrary speakers based on a single target-speaker utterance.,” in Proc. Interspeech, 2018, pp. 496–500.
- [4] S. Liu, Y. Cao, X. Wu, L. Sun, X. Liu, and H. Meng, “Jointly trained conversion model and wavenet vocoder for non-parallel voice conversion using mel-spectrograms and phonetic posteriorgrams,” Proc. Interspeech 2019, pp. 714–718, 2019.
- [5] T. Toda, A. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
- [6] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,” in Proc. Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 4869–4873.
- [7] T. Nakashika, T. Takiguchi, and Y. Minami, “Non-parallel training in voice conversion using an adaptive restricted boltzmann machine,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2032–2045, 2016.
- [8] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” arXiv preprint arXiv:1704.00849, 2017.
- [9] Z. Luo, J. Chen, T. Takiguchi, and Y. Ariki, “Emotional voice conversion with adaptive scales f0 based on wavelet transform using limited amount of emotional data,” Proc. Interspeech 2017, pp. 3399–3403, 2017.
- [10] J. Tao, Y. Kang, and A. Li, “Prosody conversion from neutral speech to emotional speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1145–1154, 2006.
- [11] Z. Inanoglu and S. Young, “A system for transforming the emotion in speech: Combining data-driven conversion techniques for prosody and voice quality,” in Eighth Annual Conference of the International Speech Communication Association, 2007.
- [12] Y. Xu, “Speech prosody: A methodological review,” Journal of Speech Sciences, vol. 1, no. 1, pp. 85–115, 2011.
- [13] D. Hirst and A. Di Cristo, Intonation systems: a survey of twenty languages, Cambridge University Press, 1998.
- [14] K. Yu, “Review of f0 modelling and generation in hmm based speech synthesis,” in Signal Processing (ICSP), 2012 IEEE 11th International Conference on. IEEE, 2012, vol. 1, pp. 599–604.
- [15] A. Wennerstrom, The music of everyday speech: Prosody and discourse analysis, Oxford University Press, 2001.
- [16] J.J.O.H Kawasaki-Fukumori and J. Ohala, “Alternatives to the sonority hierarchy for explaining segmental sequential constraints,” Language and its ecology: Essays in memory of Einar Haugen, vol. 100, pp. 343, 1997.
- [17] J. Latorre and M. Akamine, “Multilevel parametric-base f0 model for speech synthesis,” in Ninth Annual Conference of the International Speech Communication Association, 2008.
- [18] C. Wu, C. Hsia, C. Lee, and M. Lin, “Hierarchical prosody conversion using regression-based clustering for emotional speech synthesis,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1394–1405, 2010.
- [19] N. Obin, A. Lacheret, and X. Rodet, “Stylization and trajectory modelling of short and long term speech prosody variations,” in Proc. Interspeech, 2011.
- [20] Y. Qian, Z. Wu, B. Gao, and F. Soong, “Improved prosody generation by maximizing joint probability of state and longer units,” IEEE Transactions on Audio, Speech and Language Processing, vol. 19, no. 6, pp. 1702–1710, 2011.
- [21] A. Suni, D. Aalto, T. Raitio, P. Alku, M. Vainio, et al., “Wavelets for intonation modeling in hmm speech synthesis,” in 8th ISCA Workshop on Speech Synthesis, Proceedings, Barcelona, August 31-September 2, 2013. ISCA, 2013.
- [22] H. Ming, D. Huang, M. Dong, H. Li, L. Xie, and S. Zhang, “Fundamental frequency modeling using wavelets for emotional voice conversion,” in 2015 International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2015, pp. 804–809.
- [23] H. Ming, D. Huang, L. Xie, J. Wu, M. Dong, and H. Li, “Deep bidirectional lstm modeling of timbre and prosody for emotional voice conversion,” Proc. Interspeech 2016, pp. 2453–2457, 2016.
- [24] Z. Luo, T. Takiguchi, and Y. Ariki, “Emotional voice conversion using neural networks with different temporal scales of f0 based on wavelet transform.,” in SSW, 2016, pp. 140–145.
- [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
- [26] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical parametric speech synthesis incorporating generative adversarial networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 84–96, 2018.
- [27] J. Zhu, T. Park, P. Isola, and A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint, 2017.
- [28] T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” arXiv preprint arXiv:1711.11293, 2017.
- [29] F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” arXiv preprint arXiv:1804.00425, 2018.
- [30] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE TRANSACTIONS on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
- [31] D. Ulyanov, A. Vedaldi, and V.S. Lempitsky, “Instance normalization: the missing ingredient for fast stylization.,” CoRR abs/1607.0, 2016.