SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model
^†^†thanks: *Corresponding author: Xulong Zhang, [email protected]

Jianzong Wang¹, Xulong Zhang^1,∗, Haobin Tang^1,2, Aolan Sun¹, Ning Cheng¹, Jing Xiao¹ ¹Ping An Technology (Shenzhen) Co., Ltd.
²University of Science and Technology of China
https://largeaudiomodel.com

Abstract

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates speech samples by solely conditioning on acoustic features predicted from an acoustic model. However, there are always distortions existing in the predicted acoustic features, compared to those of the groundtruth, especially in the common case of poor acoustic modeling due to low-quality training data. To overcome such limits, we propose a Self-supervised learning framework to learn an Anti-distortion acoustic Representation (SAR) to replace human-crafted acoustic features by introducing distortion prior to an auto-encoder pre-training process. The learned acoustic representation from the proposed framework is proved anti-distortion compared to the most commonly used mel-spectrogram through both objective and subjective evaluation.

Index Terms:

self-supervised learning, anti-distortion, auto-encoder, speech synthesis

I Introduction

Text to speech synthesis (TTS) [1, 2, 3, 4, 5] is a task of generation condition on speaker, each input text could map with many output speech. The contemporary TTS system is primarily comprised of three key modules: text analysis, acoustic model, and vocoder [6]. The text analysis module processes the input text for normalization and divides it into phonemes. The acoustic model is to build the mapping from phoneme embedding to acoustic features such as spectrum. The vocoder final transforms the acoustic feature into waveform for the synthesized audio. In the backend of a TTS system, generating acoustic features is achieved by utilizing an acoustic model. The resulting acoustic features are subsequently fed into a vocoder, which then produces speech samples. However, when the speech data used for training is of low quality, the predicted acoustic features often contain distortions or are incomplete, which will directly affect the synthesized speech quality.

To overcome the above issues, we propose to make use of a high-level coherent structure that can be observed both across frequency and across time in speech signals. We believe that there exists a high-level structure that connects different acoustic parts and that speech can only be generated when the acoustic parts are coherent with one another. Moreover, with the observed correlations and the coherent structure, we may be able to infer the missing parts from other components if some of the components are missing. Similarly, when faced with corrupted acoustic features, the learned representation of the high-level coherent structure is believed to have the potential to help reconstruct undistorted speech. Besides, to make the auto-encoder focus more on representation learning, we only utilize recurrent structure in the encoder with its decoder as simple stacked fully-connected layers. The learned encoder without latent space masking is used as an acoustic feature extraction component. The newly extracted acoustic features are used to train an acoustic model. For the neural vocoder, for speech up training, we used a trained neural vocoder connected to the simple decoder in auto-encoder to do finetune training. During the vocoder training, we keep the latent space masking strategies to keep the anti-distortion property from overfitting.

To obtain this high-level coherent structure within speech, we turn to the methodologies of self-supervised learning. However, most of the speech representation learning approaches focus on learning contextual information that is more global and discard local details, and downstream tasks are mostly for classifications like speech recognition. Nonetheless, the local details are key factors in regression tasks such as speech synthesis. Thus, we propose a self-supervised representation learning method that encodes both the detailed local information and the contextual information, incorporated with a distortion-aware prior. For pre-training, an auto-encoder is built to reconstruct the mel-spectrogram, during which a distortion-aware prior is introduced by masking the latent space of the auto-encoder randomly with various ratios. The distortion-aware prior will force the auto-encoder to learn a high-level coherent structure capable of retrieving missing information from the rest of the latent space features. The learned latent space features are used to replace the human-crafted mel-spectrogram to build the downstream speech synthesis system. In our experiments, we adopt Tacotron2 [7] as the acoustic model and Waveglow[8] as the neural vocoder for TTS systems. The primary contributions of our work:

•

Distortion-aware priors are introduced into representation learning through a masking strategy to impart anti-distortion properties to the learned acoustic representations.
•

The learned acoustic representations with masking can be used to construct TTS systems in place of hand-crafted mel-spectrograms to build more refined acoustic features.
•

The joint training of vocoder and encoder help train TTS models with low-quality speech data.

II Related Work

Two main issues keep the acoustic features from being distortion-free. Firstly, although the recently proposed end-to-end acoustic models [9, 7, 10, 11, 12, 13, 14, 15, 16] and neural vocoders [17, 8, 18, 19, 20, 21], compared to the conventional methods as in [22], have shown great progress in synthesizing human-like speech, there is always a gap between the ground-truth acoustic features and the predicted ones. As the ground-truth acoustic features are used to train the acoustic model and the neural vocoder separately, the prediction error or the distortion always exists. Some researchers approach this issue by training neural vocoders with the predicted acoustic features of the counterpart acoustic model [7], but the method has its limits since it does not generalize to different acoustic models, and one needs to retrain neural vocoder each time when there is a change in the corresponding acoustic model. Secondly, the prediction capability of an acoustic model is affected by multiple factors such as the amount and the quality of speech data in the training set. Specifically, due to the great cost of collecting high-quality studio speech data, many data collection tasks are conducted in an unprofessional room with background noise and even reverberations. Meanwhile, some acoustic models may be trained with data of low sampling rate, due to the limitation of the codec on the device or the transmission channel. The low-quality speech data lead to bad convergence of acoustic models, and hence the distortions in predicted acoustic features occur.

In the frequency dimension, the fundamental frequency [23] and harmonics generated through the vibration of vocal cords are highly correlated mathematically, with each harmonic being a certain number of times of the fundamental frequency. Meanwhile, on the time scale, the dynamic features like delta and delta-delta cepstral coefficients adopted in Maximum Likelihood Parameter Generation (MLPG) [24] to help better predict static acoustic features indicate the existence of a contextual correlation between speech frames.

Speech representations are learned through auto-encoder in several previous works. Chorowski et al. [25] exploited the auto-encoder to reconstruct spectrogram frames by applying different constraints to latent space. However, their work focused more on unsupervised speech-tokens mapping instead of speech synthesis. The auto-encoder in [26] learns deep latent features for speech synthesis but no contextual information is incorporated. More recently, He et al. [27] proposed an end-to-end structure to learn new acoustic features to replace human-crafted ones for neural vocoders to reduce inference costs while keeping high voice quality. It is hard to model in the time domain, and they have applied many tricks to make the waveform-to-waveform mapping trainable. Therefore, the representation learning in our work is performed in the frequency domain, with our focus being on training an anti-distortion acoustic representation for speech synthesis at the same time. The idea to learn an intermediate feature similar to the mel-spectrogram is also inspired from [28, 29].

As for the self-supervised approaches, most of the self-supervised learning methods focus on utilizing a large amount of unlabeled data which are much easier to collect for representation pre-training and use the pre-trained representation encoder for downstream tasks. Bidirectional Encoder Representations from Transformers (BERT) [30, 31], which is pre-trained to learn informative semantic representation, is now widely used as a pre-trained model for almost all kinds of natural language processing tasks. In the field of speech, regarding the prediction of future information, Contrastive Predictive Coding [32] is one of the self-supervised learning methods to extract useful speech representation by predicting future data samples. In [33] speech representations are learned by reconstruction of altered frames of acoustic features.

III PROPOSED METHOD

III-A Self-supervised pre-training

Refer to caption — Figure 1: The architecture of speech representation learning. (a) The architecture of self-supervised auto-encoder. The corrupted blank in the blocks on the bottom of the decoder denotes masking on the encoder outputs, which are normally random dropouts. (b) Fine-tuning of the recurrent encoder during neural vocoder training. Similarly, the corrupted blocks below the vocoder denote masking blocks. (c) Training of the Acoustic model using learned anti-distortion representations.

III-A1 Introducing distortion-aware prior through masking

Self-supervised learning methods usually design an objective in order to learn informative representation from unlabeled data [33, 34, 25, 35]. Reconstruction is one of such objectives to extract representation by reconstructing unlabeled data through the specific model structure like auto-encoder [25, 35, 36]. However, most of them focus on learning contextual features and discard local features. In our proposal, targeting at speech synthesis tasks and local feature recovery, we choose to mask learned latent space features. By masking latent space features, the auto-encoder is forced to learn a higher-level coherent structure from a relatively low-level input mel-spectrogram. And these learned latent space features are the ones we use to replace human-crafted mel-spectrogram for downstream speech synthesis tasks. Specifically, as depicted in Figure 1 (a), the latent space features at each time step are randomly masked at a certain ratio $\alpha$ during the training while $\alpha=0$ during the inference stage. The $\alpha$ is sampled from a uniform distribution in which the minimum and maximum are $0$ and ${\alpha}_{max}$ respectively.

\alpha\sim f(\alpha)

(1)

f(\alpha)=\frac{1}{b-a},a<\alpha<b

(2)

f(\alpha)=0,else,

(3)

where $a=0$ and $b={\alpha}_{max}$ .

With this masking strategy, the distortion-aware prior is introduced into the training of the auto-encoder. Hence, the learned latent space features are forced to be in a high-level coherent structure, in which different parts of features are correlated to each other. Thus, the missing parts of the learned representation can be inferred from the rest of the existing features, which denotes the anti-distortion property.

III-A2 Low-level acoustic representation

To extract high-level acoustic representation with anti-distortion properties for speech synthesis tasks, the low-level acoustic representation should at least contain enough local acoustic texture of speech. The mel-spectrogram is a common choice for neural vocoders training. The neural vocoders, conditioned on ground truth mel-spectrogram, are capable of synthesizing very natural speech samples. Based on this previous work, we pick mel-spectrogram as the low-level acoustic representation for the speech frames reconstruction objective. Thus, the auto-encoder is trained to reconstruct mel-spectrogram with latent space features randomly masked.

III-A3 Using auto-encoder for representation learning

This masking process is the key to learning a high-level coherent structure, in which the missing parts can be inferred from the rest features. Auto-encoder is utilized for reconstructing mel-spectrogram in our proposed self-supervised pre-training with distortion-aware prior. The overall structure of auto-encoder pre-training is illustrated in Figure 1 (a), which is a typical encoder-decoder structure but with latent space features randomly masked denoted by the white blocks after applying the masking strategy. $L_{recon}$ represents the reconstruction loss in Equation (9) and (10). In Figure 2, we present detailed structure of the auto-encoder.

To incorporate both the speech texture of the current frame and contextual information, a recurrent encoder $E(\cdot)$ is used to encode a sequence of mel-spectrogram frames $m(t)$ in bidirectional order, where $t$ is the frame time step. Since what we propose is a pre-training framework, the recurrent encoder can be of different types of layers that can capture contextual information. In our experiments, as shown in the Figure 2, the recurrent encoder is simply a stack of fully connected (FC) layers $F_{n}(\cdot)$ and Bidirectional Long Short-Term Memory (BLSTM) layers $B_{n}(\cdot)$ , followed by a tanh activated FC layer $F(\cdot)$ , where $n$ represent different number of layers. The process of encoding procedure in the auto-encoder model can be written as follows:

E(\cdot)=F(B_{n}(F_{n}(\cdot)))

(4)

z(t)=E(m(t))

(5)

For each time step $t$ , the output of recurrent encoder $z(t)$ is randomly masked by masking ratio $\alpha$ , which is simply implemented by dropout $drop(\cdot,\alpha)$ . This is the key to learning a high-level coherent structure, in which the missing parts can be inferred from the rest features.

z^{\prime}(t)=drop(z(t),\alpha)

(6)

The masked latent space features $z^{\prime}(t)$ is then fed into an decoder $D(\cdot)$ to output reconstructed mel-spectrogram frames $\hat{m}(t)$ . The decoder is designed to be simple and only consists of several FC layers solely for feed-forward feature mapping.

D(\cdot)=F_{n}(\cdot)

(7)

\hat{m}(t)=D(z^{\prime}(t))

(8)

This design forces the learned high-level coherent structure of speech to have mainly existed in the latent space features, which makes these learned features as informative as possible. Overall, as illustrated in Figure 1 (a), the pre-training stage follows a common auto-encoder training scheme with Mean-Square Error (MSE) as a reconstruction criterion for better reconstruction of the acoustic features.

\min\limits_{E(\cdot),D(\cdot)}L_{recon}

(9)

where

L_{recon}=\mathbb{E}[||\hat{m}(t)-m(t)||_{2}^{2}]

(10)

III-B Downstream neural vocoder training

In common neural vocoder training [8], acoustic features like mel-spectrogram are extracted from ground-truth speech samples and condition neural vocoders for speech generation. For the downstream neural vocoder training, there are two differences compared to the common scheme. Firstly, the recurrent encoder part $E(\cdot)$ of the trained auto-encoder is jointly fine-tuned with the neural vocoder, so that the recurrent encoder can be further adapted directly to the speech samples generation. Secondly, the mel-spectrogram is replaced by the learned representation SAR to condition the neural vocoder $V(\cdot)$ .

As presented in Figure 1 (b), the mel-spectrogram of each frame $m(t)$ is firstly obtained by human-crafted signal processing from ground-truth speech samples as in a common scheme. Then, the frames of mel-spectrogram $m(t)$ are fed into the recurrent encoder $E(\cdot)$ to get the learned representation of each frame $z(t)$ . The masking strategy in the pre-training stage is kept to maintain the anti-distortion property, which transforms $z(t)$ into the masked representation $z^{\prime}(t)$ .

z(t)=E(m(t))

(11)

where $E(\cdot)$ is pre-trained in the section III-A3 but fine-tuned in the second stage, ”Neural vocoder training”, of training.

Finally, the frames of masked representation $z^{\prime}(t)$ condition the neural vocoder for downstream training. The masking strategy is only activated during the training to incorporate distortion-aware prior into both the recurrent encoder and the neural vocoder. In the copy-synthesis inference of the downstream neural vocoder, the masking strategy is deactivated. In other words, frames of $z(t)$ instead of $z^{\prime}(t)$ are used to condition the neural vocoder to generate speech samples. Because the fine-tuned encoder will generate self-fixed latent representations, which will replace the role of mel-spectrogram to supervise the neural vocoder training.

Hence, during the training, the conditional features are the masked learned representation $z^{\prime}(t)$ . During the copy-synthesis of the neural vocoder, the presentation encoder is fixed to become a feature extractor, which extracts $z(t)$ for each frame to condition the neural vocoder for waveform generation. In the fine-tuning stage, the pre-trained auto-encoder is connected with a pre-trained neural vocoder originally based on mel-spectrogram to further adapt the learned representation directly to the waveform. Finally, the learned latent space features $z(t)$ are used to replace the mel-spectrogram.

The neural vocoder in this training stage basically follows the design of Waveglow shown in the Figure 3, the likelihood function of the model is as follows.

$\displaystyle\log p_{\theta}(x)$	$\displaystyle=-\frac{z(x)^{T}z(x)}{2\sigma^{2}}$	(12)
	$\displaystyle+\sum_{j=0}^{n_{cp}}\log s_{j}(x,m(t))$
	$\displaystyle+\sum_{k=0}^{n_{cv}}\log\det\|W_{k}\|$

where

z\sim\mathcal{N}(z;0,\bm{I})

(13)

x=f_{0}\circ f_{1}\circ\cdots\circ f_{k}(z)

(14)

z=f_{k}^{-1}\circ f_{k-1}^{-1}\circ\cdots\circ f_{0}^{-1}(x)

(15)

where the first component is derived from the log-likelihood of a spherical Gaussian distribution of the variable $z(x)$ . $\sigma^{2}$ represents the hypothesized variance of the Gaussian distribution, while the remaining terms are included to accommodate the change of variables. $n_{cp}$ denotes number of coupling layers and $n_{cv}$ denotes number of convolutional layers. $s_{j}$ and $t$ denotes the change of variables when doing affine xform function. $m(t)$ denotes the mel spectrograms. $W_{k}$ denotes the weights used in the 1x1 convolutions. These weights are initialized to be orthonormal, thereby ensuring their invertibility. $\log\det$ denotes the log-determinant of the Jacobian function.

As illustrated in Figure 1 (c), unlike the downstream neural vocoder training stage which jointly trains both the recurrent encoder and the neural vocoder, the downstream acoustic model training only utilizes the recurrent encoder as a feature extractor. Moreover, The masking strategy is deactivated and the acoustic model directly uses $z(t)$ instead of masked features $z^{\prime}(t)$ as the target output for training. There are two conjecture about the advantage of the neural vocoder training stage. Firstly, connecting the auto-encoder with the neural vocoder further adapts the learned representation $z(t)$ directly to waveform generation, which leads to a better match. Secondly, it makes it much easier for training by connecting the trained auto-encoder and the trained neural vocoder for joint modeling, since the output of the autoencoder exactly matches the input acoustic features condition of the neural vocoder. During the inference, only the decoder part of the auto-encoder which is simply a stack of FC layers is kept connected with the neural vocoder to generate speech samples. In other words, the input latent space features $z(t)$ with the decoder part of the auto-encoder as a whole, replacing the conditional acoustic features part.

The whole inference pipeline of the back-end of the downstream-trained TTS system works just as in the common scheme. The acoustic model $A(\cdot)$ follows Tacotron-2 which accepts text information as input $s$ and generates a sequence of predicted acoustic representation $\hat{z}(t)$ . Then, the frames of $\hat{z}(t)$ condition the neural vocoder to generate speech samples. The objective function of the model is as follows:

\min\limits_{A(\cdot)}L=\mathbb{E}[||\hat{z}(t)-z(t)||_{2}^{2}]

(16)

where

\hat{z}(t)=A(s)

(17)

z(t)=E(m(t))

(18)

III-C Inference stage

During the inference, this whole back-end of the TTS system uses the learned acoustic representation. Firstly the predicted representation $\hat{z}(t)$ is generated by the acoustic model, then the $\hat{z}(t)$ conditions the neural vocoder to generate speech samples.

\hat{z}(t)=A(m(t))

(19)

\hat{x}=V(\hat{z}(t))

(20)

Thus, for the acoustic model, the only difference is the change of target acoustic features. Although the new predicted acoustic features $\hat{z}$ still have a gap with their ground truth counterpart, they have anti-distortion properties to significantly reduce the effects of distortions.

IV Experiments

IV-A Dataset

We evaluate our proposed representation on both English and Mandarin datasets. For English, the self-supervised pre-training is based on VCTK [37] dataset (109 English speakers with 400 sentences each speaker) and the downstream neural vocoder is trained with LJSPEECH [38] dataset which is a single-speaker dataset with 13,100 sentences. The Mandarin dataset used for pre-training is an internal dataset, which is composed of 3 males and 3 females with an average of around 9,000 read-out sentences for each speaker. The downstream Mandarin dataset is a single-speaker CSMSC [39] corpus with 10,000 sentences in total. The above datasets are split into training, validation, and test sets at percentages of $90\%$ , $5\%$ , and $5\%$ respectively.

IV-B Model configuration

In our experiments of the auto-encoder self-supervised pre-training, the encoder of the auto-encoder structure first transforms an 80-dim mel-spectrum through two FC layers with 256 hidden units. To circumvent the issue of the output of the ReLU activation function being entirely zero when all input features are negative, the Parametric Rectified Linear Unit (PReLU) was implemented as a nonlinear activation function. After the FC transform, two BLSTM layers are stacked to encode the whole sequence in a many-to-many scheme, with the number of output states being 256 as well. Finally, an FC layer further transforms the output of BLSTM into $z(t)$ with a Hyperbolic tangent (Tanh) activation function to constrain the representation to be within $[-1,1]$ . The decoder is much simpler by stacking a PReLU activated FC layer and a linear FC layer to output the final acoustic representation. The first FC layer contains 128 hidden units, whereas the output FC layer consists of 80 units. As for training, the batch size used is 64, and Adam Optimizer is applied with 1e-4 learning rate. For checkpoint selection, an early stop is applied regarding the validation loss. The $\alpha$ is sampled from a uniform distribution of interval $[0,0.2]$ .

For downstream tasks in our experiments, Waveglow is selected as a neural vocoder. We use similar parameters setup to those in [8]. Each audio clip contains 16,000 samples. The batch size used is 8 with Adam Optimizer of learning rate 1e-4. For ground-truth acoustic feature extraction, Fast Fourier Transform (FFT) window size is selected to be 1024 with a hop size equal to 16 milliseconds (256 for 16000 sampling rate, and 128 for 8000 sampling rate). As for the acoustic model, Tacotron2 [7] is utilized to predict acoustic features. The Pre-net dimension is set to 256, and the number of embedding dimensions is selected to be 512 with a dimension of attention recurrent neural network (RNN) equal to 1024. The dimension of decoder RNN is set to 1024 with the dimension of Postnet embedding equal to 512. The rest of the hyper-parameters are the same with [7].

IV-C Evaluation systems

To objectively compare the anti-distortion property of the learned representation under different corruption conditions, we built 3 copy-synthesis systems based on both Mandarin and English datasets:

•

Mel-WaveGlow: Waveglow based on mel-spectrogram.
•

SAR-WaveGlow: Waveglow based on the learned representation with auto-encoder self-supervised pre-training.
•

AR-WaveGlow: Waveglow based on the learned representation without auto-encoder self-supervised pre-training, i.e., the recurrent encoder of the auto-encoder is directly connected to a trained Waveglow for downstream fine-tune training.

We designed two types of distortions to corrupt the extracted ground-truth acoustic features:

•

White noise: white noise is simulated by sampling from a normal Gaussian distribution, which is additive noise directly added to the acoustic features with a target Signal-to-Noise Ratio (SNR). In our experiments setup, we tried two different target SNRs: 10 $dB$ and 15 $dB$ .
•

Masking: Similar to the masking strategy in the auto-encoder pre-training, we randomly masked the acoustic features with different masking ratios $\alpha=0.1$ and $\alpha=0.2$ .

IV-D Objective evaluation results

TABLE I: ESTOI scores comparison on the anti-distortion property on Mandarin dataset

Distortion type	Raw	Mask ratio		SNR
Distortion type	Raw	$\alpha=0.1$	$\alpha=0.2$	15 $dB$	10 $dB$
Mel-WaveGlow	0.927	0.837	0.754	0.830	0.726
AR-WaveGlow	0.881	0.862	0.847	0.859	0.822
SAR-WaveGlow	0.891	0.877	0.855	0.881	0.859

TABLE II: ESTOI scores comparison on the anti-distortion property on English dataset

Distortion type	Raw	Mask ratio		SNR
Distortion type	Raw	$\alpha=0.1$	$\alpha=0.2$	15 $dB$	10 $dB$
Mel-WaveGlow	0.904	0.783	0.681	0.782	0.667
AR-WaveGlow	0.859	0.841	0.816	0.856	0.842
SAR-WaveGlow	0.866	0.855	0.830	0.860	0.846

The Extended Short-Time Objective Intelligibility (ESTOI) scores are computed for objective evaluation. The ESTOI score is sensitive to both to both incorrect spectral profile reconstructions and inconsistent temporal pattern reconstructions. A higher ESTOI score indicates better voice quality and demonstrates a stronger anti-distortion property. Tables I and II describe the ESTOI evaluation on 100 sentences randomly selected from the test sets of the CSMSC and LJSPEECH corpora, respectively. For the uncorrupted acoustic features, the original mel-spectrogram achieves the best ESTOI score. However, in cases of different distortion conditions, the ESTOI scores of Mel-WaveGlow degraded dramatically. While the ESTOI scores of SAR-WaveGlow in those conditions suffer minor loss compared to the raw condition. This shows the strong anti-distortion property of the learned representation against both types of corruption. Even the noise-adding corruption is not seen during the representation learning, the learned presentation still generalizes well. Also, by comparing the ESTOI scores of SAR-WaveGlow and AR-WaveGlow, the self-supervised pre-training part of our proposal proved to be necessary for anti-distortion representation training. Because for both Table I, every ESTOI score in SAR-WaveGlow is dominantly better than that in AR-WaveGlow.

IV-E Subjective evaluation results

To evaluate the anti-distortion property of the learned representation in speech synthesis scenarios, we built Mandarin TTS systems for subjective evaluation. According to our presumption, there exists a coherent structure within acoustic features that help recover missing parts. And in real-world scenarios, there are some cases in that we may only be able to retrieve low-sampling-rate speech data of a specific speaker due to limited internet bandwidth or device limits, and yet we still want to build a high-sampling-rate TTS system. To mimic those scenarios, we simply simulate a low-quality speech corpus by firstly downsampling and then upsampling the speech data. Specifically, we downsampled the CSMSC corpus from 16000 $Hz$ to 8000 $Hz$ , and then upsampled it back to 16000 $Hz$ , so that the high-frequency information is deleted. Based on this corrupted CSMSC corpus, we separately trained two Tacotron2 models, between which the only difference is the acoustic features, i.e., mel-spectrogram $m(t)$ versus the learned representation $z(t)$ . The previously trained 16000 $Hz$ Waveglow models were reused for both acoustic features.

We did two Mean Opinion Score (MOS) evaluations based on the same setup: 50 sentences were selected from CSMSC test set and 30 Mandarin speakers participated in score ratings. We first did a MOS test on results from the copy-synthesis of Waveglow models using different acoustic features. As shown in the Figure 4(a), for uncorrupted acoustic features the neural vocoder based on mel-spectrogram achieves the best voice quality. The learned representation introduces minor degradation of the copy-synthesis voice quality, compared to the mel-spectrogram. The other MOS test was conducted to compare the anti-distortion property of different predicted acoustic features from acoustic models based on low-quality corrupted CSMSC corpus. Although the learned representation introduces degradation into the copy-synthesis waveform, according to Figure 4(b), the learned representation shows stronger robustness to the low-quality training data for acoustic modeling. The learned representation together with the jointly trained neural vocoder seems to be capable of recovering some of the missing high-frequency information in the generated speech.

V Conclusions

In this work, we propose an anti-distortion self-supervised learning framework to create a new acoustic representation (SAR) to replace the handcrafted mel-spectrogram. This is based on the intuition that the mel-spectrogram contains a high-level coherent structure. We can reconstruct the missing parts from the rest of the features. By introducing distortion-aware prior to the auto-encoder pre-training stage, the anti-distortion property is granted to SAR. This anti-distortion property is verified by both objective and subjective analyses. We proved that the self-supervised pre-training stage is necessary for learning a representation with the anti-distortion property. Moreover, the anti-distortion property of SAR is superior to that of the mel-spectrogram, and it also generalizes to unseen corruptions like white noise addition. We also built TTS systems based on SAR, and the subjective analysis of which shows the robustness of SAR on low-quality training data.

VI Ackowledgement

Supported by the Key Research and Development Program of Guangdong Province (grant No. 2021B0101400003). Corresponding author is Xulong Zhang ([email protected]).

References

[1] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.
[2] B. Zhao, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “nnspeech: Speaker-guided conditional variational autoencoder for zero-shot multi-speaker text-to-speech,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2022). IEEE, 2022, pp. 4293–4297.
[3] H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Learning speech representations with flexible hidden feature dimensions,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[4] Z. Zeng, J. Wang, N. Cheng, and J. Xiao, “Prosody Learning Mechanism for Speech Synthesis System Without Text Length Limit,” in Proc. Interspeech 2020, 2020, pp. 4422–4426.
[5] X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Semi-supervised learning based on reference model for low-resource tts,” in 2022 18th International Conference on Mobility, Sensing and Networking (MSN), 2022, pp. 966–971.
[6] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural speech synthesis,” arXiv preprint arXiv:2106.15561, 2021.
[7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan, R. A. Saurous, Y. Agiomvrgiannakis, and Y. Wu, “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783.
[8] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 3617–3621.
[9] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in INTERSPEECH 2017, 2017, pp. 4006–4010.
[10] A. Łańcucki, “Fastpitch: Parallel text-to-speech with pitch prediction,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592.
[11] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6706–6713.
[12] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu, D. Tuo, S. Kang, G. Lei et al., “Durian: Duration informed attention network for speech synthesis.” in INTERSPEECH, 2020, pp. 2027–2031.
[13] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[14] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
[15] Z. Zeng, J. Wang, N. Cheng, T. Xia, and J. Xiao, “Aligntts: Efficient feed-forward text-to-speech system without explicit alignment,” in ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020, pp. 6714–6718.
[16] H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Qi-tts: Questioning intonation control for emotional speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
[17] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” SSW, p. 125, 2016.
[18] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning, 2018, pp. 2410–2419.
[19] K. Matsubara, T. Okamoto, R. Takashima, T. Takiguchi, T. Toda, Y. Shiga, and H. Kawai, “Full-band lpcnet: A real-time neural vocoder for 48 khz audio with a cpu,” IEEE Access, 2021.
[20] A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg et al., “Parallel wavenet: Fast high-fidelity speech synthesis,” in International conference on machine learning. PMLR, 2018, pp. 3918–3926.
[21] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning. PMLR, 2018, pp. 2410–2419.
[22] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech synthesis using deep neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 7962–7966.
[23] M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
[24] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for hmm-based speech synthesis,” in 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100), vol. 3, 2000, pp. 1315–1318.
[25] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord, “Unsupervised speech representation learning using wavenet autoencoders,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 2041–2053, 2019.
[26] S. Takaki and J. Yamagishi, “A deep auto-encoder based low-dimensional feature extraction from fft spectral envelopes for statistical parametric speech synthesis,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 5535–5539.
[27] Y. He, H. Zhang, and Y. Wang, “Rawnet: Fast end-to-end neural vocoder.” arXiv preprint arXiv:1904.05351, 2019.
[28] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2017, pp. 323–330.
[29] M. Versteegh, R. Thiolliere, T. Schatz, X. N. Cao, X. Anguera, A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015,” in Sixteenth annual conference of the international speech communication association, 2015.
[30] A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[31] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp. 4171–4186.
[32] D. Jiang, X. Lei, W. Li, N. Luo, Y. Hu, W. Zou, and X. Li, “Improving transformer-based speech recognition using unsupervised pre-training.” arXiv preprint arXiv:1910.09932, 2019.
[33] A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021.
[34] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
[35] D. Jiang, W. Li, M. Cao, W. Zou, and X. Li, “Speech simclr: Combining contrastive and reconstruction objective for self-supervised speech representation learning,” in INTERSPEECH 2021, pp. 1544–1548.
[36] H. Tang, X. Zhang, J. Wang, N. Cheng, Z. Zeng, E. Xiao, and J. Xiao, “Tgavc: Improving autoencoder voice conversion with text-guided and adversarial training,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2021, pp. 938–945.
[37] C. Veaux, J. Yamagishi, and K. MacDonald, “Superseded - cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit,” The Rainbow Passage which the speakers read out can be found in the International Dialects of English Archive: (http://web.ku.edu/ idea/readings/rainbow.htm)., 2016.
[38] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[39] Biao-Bei, “Chinese standard mandarin speech corpus,” https://www.data-baker.com/open $\_$ source.html/, 2018.

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model ††thanks: *Corresponding author: Xulong Zhang, [email protected]

Abstract

Index Terms:

I Introduction

II Related Work

III PROPOSED METHOD

III-A Self-supervised pre-training

III-A1 Introducing distortion-aware prior through masking

III-A2 Low-level acoustic representation

III-A3 Using auto-encoder for representation learning

III-B Downstream neural vocoder training

III-C Inference stage

IV Experiments

IV-A Dataset

IV-B Model configuration

IV-C Evaluation systems

IV-D Objective evaluation results

IV-E Subjective evaluation results

V Conclusions

VI Ackowledgement

References

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model
^†^†thanks: *Corresponding author: Xulong Zhang, [email protected]