ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

Cheng Gong, Xin Wang, Erica Cooper,
Dan Wells, Longbiao Wang, Jianwu Dang,
Korin Richmond, and Junichi Yamagishi Cheng Gong, Longbiao Wang and Jianwu Dang are with the Tianjin University, Tianjin, China. Xin Wang, Erica Cooper, and Junichi Yamagishi are with the National Institute of Informatics (NII), Tokyo, Japan. Dan Wells and Korin Richmond are with the Centre for Speech Technology Research, University of Edinburgh, United Kingdom. This work was done when Cheng Gong was a visiting Ph.D. student at NII. (Corresponding authors: Longbiao Wang; Junichi Yamagishi.) This work was supported in part by the National Natural Science Foundation of China under Grant (U23B2053, 62176182), the China Scholarship Council (CSC) No. 202206250146, MEXT KAKENHI Grants (21H04906, 21K17775, 21K11951), and the National Research Council of Canada’s Ideation Fund: ‘Small teams – Big Ideas’.

Abstract

Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker’s voice, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker’s voice, even without any training data for the new, unseen language.

Index Terms:

Text-to-speech, Multilingual, Self-supervised representations, Low-resource, Zero-shot

I INTRODUCTION

Text-to-speech (TTS) technology, which converts a text string into a speech waveform[1, 2, 3, 4], has received significant advancements driven by deep learning. Advances in this field have been beneficial for a variety of uses including audio-book narration, news readers, conversational assistants and engaging user experiences in virtual worlds. However, this success is dependent on a rich resource of high-quality data, and thus data requirements have become a bottleneck of both research and the implementation of neural TTS[5]. Preparing data resources and building TTS systems for a specific language can be costly, let alone repeating this for thousands of languages around the world. In this paper, our aim is to build a unified model for a multilingual and multispeaker system, opening possibilities for unseen speakers and unseen language adaptations in low-resource scenarios.

The expansion of the TTS systems’ coverage of the world’s languages has attracted significant attention from both academia and industry[6, 7, 8]. Phonetically-based speech processing systems often require pronunciation dictionaries to map words to a sequence of phonetic units. Recent studies have demonstrated that pre-trained models for phonetic representations, such as PnG BERT [9], Mixed-Phoneme BERT [10] and Phoneme-level BERT [11], can be beneficial for advanced TTS systems. Motivated by the impressive cross-lingual transferability of multilingual language models, several works [5, 12] perform masked-language model (MLM) pre-training with multilingual text-only data for TTS. Considering the abundant resources of textual data in the real world, which are usually more readily accessible compared with audio, it is worthwhile to investigate the utilization of language models pre-trained on large text resources for phoneme representations in multilingual synthesis tasks.

To synthesize speech with a target speaker’s timbre, end-to-end models usually learn a speaker embedding during training that enables speaker selection at inference time. Furthermore, several of these models enable zero-shot speaker synthesis using a speaker vector extracted from a short audio clip[13, 14]. Despite recent advances, the similarity gap between seen and unseen speakers remains an open research question, and training these models still requires a significant number of speakers, posing challenges in developing high-quality models for low-resource languages[8]. Furthermore, in real-world data, most of the attributes that represent speech are interrelated and difficult to disentangle. To control a multilingual and multispeaker system, it is crucial to disentangle speaker and language[6]. Efforts using adversarial learning or consistency loss [6, 8] have been made to mitigate the performance degradation resulting from this entanglement.

However, these state-of-the-art multilingual and multispeaker systems mainly rely on a large amount of training data, which is not available for every language. The intermediate features used in these studies are often Mel spectrograms, which have a high correlation in time and frequency, making it difficult to disentangle speaker-dependent information. Fortunately, self-supervised learning (SSL) speech representations [15, 16, 17] have been shown to be useful for speech processing tasks, such as speech recognition [18, 19], speech reconstruction [20], and voice conversion [21]. The learned representations are used as input for a supervised model, which is often fine-tuned to improve task performance or reduce the labeled data required. Recently, several TTS models[22, 23] have started using discrete vector-quantized speech representations as intermediate features instead of traditional Mel spectrograms for prediction. As a result, the quantized output has less speaker-dependent information than the Mel spectrograms[23]. While a designed fine-tuning protocol for the learned representations indeed improves automatic speech recognition (ASR) performance [24], in low-resource settings, relatively little attention has been paid to the study of TTS models using learned SSL representations.

Recently, large-scale TTS systems [22, 25, 26, 27] that leverage data-driven representations, i.e., either discrete tokens or continuous vectors from an auto-encoder [28, 29] have been widely adopted for zero-shot speech synthesis. Several of these models [22, 25] employ an autoregressive architecture to generate discrete tokens one by one, like a language model. These autoregressive models suffer from a slow inference speed, unstable prosody, and word skipping/repeating issues. To overcome this limitation, non-autoregressive large-scale TTS systems, such as NaturalSpeech 2/3 [30, 31] and HierSpeech++ [32], have been investigated. However, all the aforementioned large-scale TTS systems heavily rely on abundant data and primarily focus on resource-rich languages rather than low-resource ones.

In this paper, we propose a unified zero-shot multilingual multispeaker TTS framework called ZMM-TTS that leverages discrete speech representations from a multilingual speech-based SSL model. ZMM-TTS consists of a text-to-discrete speech representations module (txt2vec) and a representations-to-waveform module (vec2wav). Regarding the txt2vec module, we adopt a standard end-to-end TTS architecture comprising token embeddings, an encoder, and a decoder to predict discrete representations. To utilize the strong cross-lingual transferability of multilingual language models, we adopt a pre-trained large-scale multilingual language model for phoneme representations. For the vec2wav module, an additional multi-stage and multi-head vector quantization (VQ) model was adopted to preserve the discrete information at different time resolutions.

In summary, we have made the following contributions:

•

We propose ZMM-TTS, a multilingual and multispeaker framework using discrete audio representations from a large-scale multilingual pre-trained self-supervised model as an intermediate representation to replace the widely-used Mel spectrograms.
•

We investigate the impact of various input representations on multilingual synthesis tasks that use SSL discrete representations. Our research combines phoneme representations from a pre-trained text-based multilingual language model and speech-based SSL representations in a speech synthesis task.
•

ZMM-TTS can perform zero-shot multispeaker TTS in a target language with high quality and speaker similarity using only a few seconds of speech during inference.
•

We also verify the effectiveness of the ZMM-TTS in low-resource and zero-shot scenarios on two hypothetically low-resource languages. “Zero-shot” refers to the capability of synthesizing speech in a low-resource language without training or fine-tuning the model using the data from that language. This observation is quite promising, as it indicates that our proposed approach can synthesize intelligible audio with high speaker similarity, even when no specific training data are available for the previously unseen language. Note that “unseen” in this context refers to the absence of dedicated training data for ZMM-TTS, while the unseen language is still included in the training data of both the pre-trained language model and self-supervised model.

We encourage the reader to listen to our samples, which can be found at https://gongchenghhu.github.io/TASLP-demo/. The source code has been released on https://github.com/nii-yamagishilab/ZMM-TTS.

The remainder of this paper is organized as follows. Section II provides a comprehensive review of related work in the field. Section III presents a detailed description of the proposed approach. The experimental setup is described in Section IV. In Section V, we evaluate and analyze the proposed system using six languages, in particular, the quality of unseen speakers’ voices. Section VI evaluates and analyzes the proposed system, assuming two unseen low-resource languages. Finally, Section VII concludes the paper, summarizing the findings and highlighting future research directions. In Appendix A, we show a comparison with the latest large-scale models that utilize larger amounts of data.

II RELATED WORK

This section reviews related work about zero-shot and multilingual speech synthesis. We also review related studies on speech synthesis with the aid of SSL representations, especially in the multilingual cases.

II-A Multilingual Speech Synthesis Systems

II-A1 Zero-shot multispeaker TTS

The concept of zero-shot multispeaker TTS was initially introduced in [33] and the main idea is to employ embeddings from an external speaker encoder as a reference signal. Subsequent research by [14] aimed to reduce the quality gap between seen and unseen speakers by utilizing more informative embeddings. [34] further explored the utilization of attentive speaker embeddings for the purpose of encoding general speaking styles, while [35] proposed a speaker-conditional architecture and investigated a flow-based decoder capable of generating unseen speaker voices. Despite these efforts, the task of zero-shot multispeaker TTS has yet to be fully solved. Moreover, capturing a wide range of voice properties requires a substantial amount of high-quality data from numerous varied speakers.

Regarding multispeaker multilingual TTS, a primary idea is to train the decoder by incorporating learnable language and speaker embeddings[6]. However, the majority of speech attributes are interrelated and difficult to disentangle through embedding tables alone. Attempts have been made to mitigate the decline in multi-language multispeaker performance caused by speaker and language entanglement, especially in cross-lingual scenarios. [6] incorporated adversarial domain training, enabling them to transfer different voices between languages. [36] proposed the use of mutual information minimization to keep speaker consistency in cross-lingual synthesis. [37] utilized joint training incorporating a speaker classifier to enhance speaker similarity.

II-A2 System structure

Previous multilingual TTS models [6, 7] are primarily built using Tacotron [1, 3]. When synthesizing speech, Tacotron-based models often repeat or skip words due to an autoregressive approach that uses attention to align input text and target speech. Instead of autoregressive iteration, several non-autoregressive [38, 39, 2, 40] models adopt Transformer encoder and decoder blocks to generate Mel spectrograms in parallel. For example, [41] create a multilingual speech synthesis system including a Transformer-based acoustic predictor and a WaveNet [42] neural vocoder. [43] presented a multilingual, multiaccented, multispeaker speech synthesis model based on RADTTS [44] which is a parallel flow-based generative model. Recently, several fully end-to-end multilingual speech synthesis systems [45, 8] were built on VITS[46], which could produce natural-sounding synthesized speech. Moreover, VITS requires training a single model, so there is no need for an independent vocoder. Among these fully end-to-end systems, YourTTS [8], which aims at zero-shot multispeaker TTS and zero-shot voice conversion, presented promising results for a few language combinations. However, it had limited success in transferring to languages with few speakers and used a curriculum learning approach, making training cumbersome.

Unlike these systems that all use acoustic features such as the Mel spectrograms as intermediate features, we built a multilingual synthesis system that is based on SSL representations to take advantage of better speaker/content disentanglement in discrete SSL representations.

II-B Input Representations

II-B1 Characters or Phonemes

Traditionally, end-to-end TTS models have utilized input representations in the form of characters or phonemes[1, 2]. Employing each character as the default input for end-to-end TTS models requires the model to implicitly learn how to convert graphemes to phonemes as part of the synthesis task. Incorporating phoneme input simplifies the TTS task, eliminating the need for the model to learn complicated grapheme-to-phoneme rules for languages such as English. Furthermore, language-independent pronunciation information can be obtained through language expertise, such as making a unified set of all the languages in a multilingual model, often derived from the International Phonetic Alphabet (IPA) [5, 47].

II-B2 Pre-trained representations

Over the past decade, self-supervised pre-training on extensive text corpora, employing language model (LM) or MLM objectives, has demonstrated remarkable success in the field of TTS[9, 10, 11]. Toward multilingual tasks, [12] proposed XPhoneBERT which is the first pre-trained multilingual model for phoneme representations for TTS. Based on the monolingual experiment results, using XPhoneBERT as a phoneme encoder for input has been shown to significantly enhance the performance of a powerful neural TTS model in terms of naturalness and prosody. It also helps to produce high-quality speech using low-resource training data. [5, 48] conducted pre-training on a multilingual text-only dataset using an MLM. They then trained this model in a supervised manner on paired text and audio data, while keeping the language-aware embedding layer frozen. This approach enables the model to synthesize languages that are not present in the paired data but exist in the text-only data.

Alternative approaches adopt UTF-8 bytes [49] or phonological features (PFs) [50, 51] as input representations. UTF-8 byte representations encode typographic rather than phonological information. The model does not learn unseen byte combinations, requiring retraining when enrolling in a new language. For PFs, IPA symbols can be converted to PFs in accordance with certain aspects of their articulation, as implemented in Epitran [52]. Although this conversion is based on an explicit database of IPA symbols and diacritics, it can be considered universally applicable to any language for which IPA transcription is available. However, there are no pre-trained phoneme-level encoders similar to XPhoneBERT currently available. The effect of different input representations is still not clear in multilingual speech synthesis with SSL representations. Taking into account the complexity of the model and the support for new languages and low-resource scenarios, we evaluated the effects of using characters, IPA, and XPhoneBERT as input for multilingual TTS.

II-C SSL Representations for speech synthesis

Well-designed proxy tasks enable SSL to explore the characteristics of unlabeled data and enhance the model’s representation ability. The derived representations of the SSL model have useful information for many downstream tasks[53, 54, 55, 56, 57]. Wav2vec 2.0 [15] is the most popular pre-trained model that uses SSL on a large amount of unlabeled speech data, and it has demonstrated strong representation abilities in ASR tasks and achieved outstanding results. A multilingual wav2vec 2.0, denoted as XLSR-53, was also proposed in [58]. The results showed a significant improvement over monolingual wav2vec 2.0, particularly in languages with limited resources.

For SSL in monolingual TTS downstream tasks, [54] introduced WavThruVec, a two-stage architecture that employed high-dimensional wav2vec 2.0 embeddings as an intermediate speech representation. The paper [59] proposed VQTTS, which uses the vec2wav vocoder to train a non-autoregressive model that maps text to discrete tokens. MQ-TTS [60] proposed using multiple codebooks to quantize intermediate features on real-world spontaneous speech. Using wav2vec 2.0 acoustic representations of synthesized audio, [57] improved prosody modeling by incorporating prosodic characteristics from neighboring utterances.

For multilingual tasks, [23] demonstrated that the VQ features in wav2vec 2.0 contain significantly less speaker information than the Mel spectrograms, and proposed DSE-TTS, a cross-lingual TTS that encodes timbre-related and linguistic-related speaker information separately. In the paper [61], the author proposed extracting unsupervised phonetic representations (UPR) from wav2vec 2.0 for multilingual speech synthesis when target language pronunciation dictionaries are unavailable. In the work of [62], the author built a TTS system for the Scottish Gaelic language based on self-supervised discrete acoustic unit sequences. Scaling TTS to many languages was proposed in [63] by leveraging massively multilingual joint speech and text representation learning.

However, integrating SSL-based representations into TTS in low-resource settings is still a novel concept, and it is unclear which type of input representation is most suitable for TTS. Furthermore, these methods mostly use monolingual speech representations rather than multilingual speech representations, which may benefit low-resource scenarios.

II-D Large-scale TTS systems

Recently, considerable progress has been made in zero-shot TTS by scaling up corpus and model sizes. These large-scale TTS systems [22, 25, 26, 27] usually quantize the continuous speech waveform into discrete tokens through neural audio codec models and model these tokens with autoregressive LMs. For example, VALL-E [22] was the first neural codec language model for speech synthesis utilizing a discrete audio unit and language models. In addition to autoregressive architectures, several large-scale models [30, 31, 64, 32, 65] adopt non-autoregressive architectures to improve model generation speed and robustness. While several large-scale models [66, 27, 67], such as VALL-E-X [66], are multilingual and have been trained on extensive corpora in both Chinese and English, current large-scale models are more focused on generalization to unseen speakers rather than on adapting to unseen languages. These large-scale models perform well only in resource-rich languages while lacking application in low-resource languages.

The differences between our ZMM-TTS system and previous large-scale models are highlighted in Table LABEL:tab:tts_comparison. Compared with previous large-scale models, ZMM-TTS is designed to have the language adaptation ability for low-resource and unseen languages. Typically, previous models only validated one language or a few high-resource languages. In contrast, the ZMM-TTS model aims to support not only unseen speakers but also unseen languages.

TABLE I: Comparing ZMM-TTS with previous large-scale TTS models on task capabilities.

Systems	Languages	Few-shot	Zero-shot
[25, 22, 31, 65, 64, 32, 26]	Single	Validated only on one language	Unseen speakers
[66, 27, 67]	Multiple	Validated only on high-resource languages	Unseen speakers
ZMM-TTS	Multiple	Adaptable to multiple low-resource languages	Unseen speakers and languages

III METHOD

Our proposed model is a multilingual and multispeaker synthesis system that uses self-supervised discrete speech representations from a pre-trained multilingual wav2vec 2.0 (XLSR-53) model. Quantization in XLSR-53 is based on product quantization by choosing quantized representations from codebooks $\bm{C}\in\mathbb{R}^{G\times M\times D}$ with $M=320$ entries each, $G=2$ are two separate codebooks in XLSR-53 and each quantization representation has $D=384$ dimensions.

Given a dataset $\mathcal{S}=\{x_{i},y_{i}\}$ , where $y$ is an audio sample and $X=\{x_{0},x_{1},...,x_{T_{m}}\}$ is its corresponding text sequences, we use a pre-trained XLSR-53 model to encode each audio sample into discrete code index sequences $V=\{{v_{1}^{1:G},v_{2}^{1:G},...,v_{T_{c}}^{1:G}\}}$ and the corresponding discrete representation is $R=\{r_{1}^{1:G},r_{2}^{1:G},...,r_{T_{c}}^{1:G}\}$ . Obviously, there is a one-to-one correspondence between $V$ and $R$ , and once we obtain $V$ , we can obtain $R$ by looking up codebooks $\bm{C}$ when $r=C[:,v,:]$ .

Figure 1 depicts the architectures in which our proposed model can be implemented. It consists of two steps, as shown in the following formula. First, the discrete code index sequences $V$ are predicted from the text $x$ and the codebook of XLSR-53 $\bm{C}$ is looked up to obtain the discrete representations $R$ , and then the waveform is predicted from the discrete representations $R$ .

\begin{split}\text{txt2vec}:(x,S,L)\rightarrow R,\\ \text{vec2wav}:(R,S,L)\rightarrow\text{Mel}\rightarrow\text{Speech}\quad\texttt{OR},\\ (R,S,L)\rightarrow\text{Speech},\\ \end{split}

(1)

where $S$ denotes the speaker representation and $L$ represents the language identity. Note that we discard the language input and related embedding layers, enabling direct inference on unseen languages.

The txt2vec module is similar to the acoustic model in prior speech synthesis approaches, so we constructed it on the basis of the popular acoustic model FastSpeech. Vec2wav is similar to a vocoder. To obtain a broader understanding of the potential complementarity and differences between SSL representations and Mel features, we proposed two distinct approaches to implement vec2wav. The first is to convert discrete representations $R$ into Mel spectrograms and then convert them through an additional vocoder, and the second is to directly convert $R$ into audio. Both methods make use of the up-sampling operation based on transposed convolution in HiFi-GAN [68], as in [54]. It should be noted that although there are $G$ different codebooks, in the vec2wav model, the input $R$ is represented by flattened representations from $G$ codebooks together. In the following two subsections, we will provide an explanation of the two components, respectively.

Refer to caption — Figure 1: Overview of ZMM-TTS. The modules txt2vec and vec2wav are trained independently. Language ID is utilized for high-resource languages and few-shot adaptation, but it is not used for direct inference without fine-tuning.

III-A Predicting Discrete Code Index from Text

The txt2vec model is based on FastSpeech [39], a non-autoregressive (NAR) acoustic model with explicit duration modeling. In our work, we use a learnable aligner [69] instead of relying on the predicted phonemes durations of an autoregressive teacher TTS model. This aligner is more robust for long sequences. The decoder part is the same as that of FastSpeech, which is a Feed-Forward Transformer (FFT). For the encoder part, depending on the different types of input representations, it can be either FFT-like or BERT-like. Figure 2 shows the structure of txt2vec.

III-A1 Input representations and text encoder

We attempt to use different input representations including characters, IPA, and pre-trained phoneme representations for multilingual TTS.

a)

Characters: To extend a character-based input vocabulary to multilingual TTS systems, we only need to concatenate character sets in the training corpus for each language.
b)

Phonemes-IPA: To bring together the many languages under the same input space, we also use IPA, mapping all orthographic text into their IPA representations through the use of Epitran[52].
c)

Pre-trained phoneme representations: First, we convert text sentences into a sequence of phonemes using the CharsiuG2P¹¹1https://github.com/lingjzhu/CharsiuG2P toolkit because of the input labels on which XPhoneBERT relies. Then, we convert this sequence of phonemes into a sequence of phoneme representations using a pre-trained XPhoneBERT. XPhoneBERT uses the BERT-base architecture, pre-trained on RoBERTa with 330M phoneme-level sentences across 100+ languages. Here, we use XPhoneBERT as an input phoneme encoder rather than FFT, which is used as the phoneme encoder in FastSpeech.

In summary, when characters or IPA symbols are used as input, FFT is used as the encoder. However, when phonemes from CharsiuG2P are used as input, pre-trained XPhoneBERT is used as the encoder.

III-A2 Multispeaker and multilingual control

To enable control over language and speaker identity, we add language and speaker embeddings as inputs to the decoder and the duration predictor. To extract speaker embeddings, we use a pre-trained ECAPA-TDNN speaker encoder model. For language embeddings, we adopt a trainable language embedding table to extract the language embedding of the target language. In detail, with the target language ID, a 64-dimensional language embedding can be produced by a language look-up table.

III-A3 Loss Function

In this txt2vec stage, the text encoder operates on the input tokens $x$ and produces hidden states, which, combined with language and speaker embeddings, are used to predict the duration $d$ by the duration predictor. The up-sampled encoder outputs, together with the language and speaker embeddings, are used as the decoder input. The decoder then generates discrete values for both $V$ and $R$ .

During training, we first adopt a classification task that is optimized by the cross entropy loss between the $\hat{p}(v)$ (predicted probability distributions of $V$ ) and the target probability distributions $p(v)$ . Given a sequence of decoder output features $Q=\{q_{1},q_{2},...,q_{t}\}$ , the $p(v)$ can be obtained through

\begin{split}h_{t}^{g}=\texttt{Split}(F(q_{t})),\\ p(v_{t}^{g}|h_{t}^{g})=\texttt{Softmax}(h_{t}^{g}),\end{split}

(2)

where $F$ is a fully-connected layer that maps the $q_{t}$ to $G\times D$ dimensions. Subsequently, it further partitions based on the number of codebooks, $G$ , to obtain the hidden representation with $D$ dimensions for computing the corresponding classification probability value. The classifications loss $\mathcal{L}_{cla}$ can be represented as

\mathcal{L}_{cla}=\frac{1}{T}\frac{1}{G}\sum_{t=1}^{T}\sum_{g=1}^{G}\texttt{CrossEntropy}(\hat{p}(v_{t}^{g}),p(v_{t}^{g}))

(3)

To learn the cross-lingual information from shared quantized latent speech representations in continuous space, we also incorporated predictions for $R$ during training as

\begin{split}\hat{r}_{t}^{g}=\sum_{i=1}^{M}p(v_{t}^{g}=i|h_{t}^{g})C_{g,i}\end{split}

(4)

Eq. (4) is a weighted sum of the code vectors, where the weights are the predicted probabilities of choosing the codes. In Eq. (4), $p(v_{t}^{g}=i|h_{t}^{g})$ represents the probability of predicting the code index value $v_{t}^{g}$ as $i$ at time $t$ . Here, $g\in G$ denotes the g-th codebook and $C_{g,i}$ denotes the vector corresponding to the i-th index in the g-th codebook at $C^{G\times M\times D}$ . The regression L2 loss $\mathcal{L}_{MSE}$ between the predicted and targeted $R$ could be represented as

\mathcal{L}_{mse}=\frac{1}{T}\sum_{t=1}^{T}\frac{1}{G}\sum_{g=1}^{G}{\left\|\hat{r}_{t}^{g}-r_{t}^{g}\right\|}^{2}_{2}

(5)

The total loss consists of the sum of the distance between the ground-truth and predicted discrete features and the duration loss:

\mathcal{L}_{total}=\mathcal{L}_{cla}+\mathcal{L}_{mse}+\mathcal{L}_{dur}

(6)

Given the monotonic alignment assumption between text tokens and discrete representation frames, the $\mathcal{L}_{dur}$ is optimized by maximizing the joint likelihood of each token and frame to find the most likely monotonic path [69, 44].

Note that since $p(v_{t}^{g})$ is a one-hot vector, the $r_{t}^{g}$ is only one entry in the codebook instead of being multiplied by softmax weights as in Eq. (5). Therefore, in inference, we directly select the maximum probability index in the $P(i|h_{t}^{g})$ , and look up the corresponding $(\hat{r}_{t}^{g})$ in the code $C^{G\times M\times D}$ as the input of the next-stage model:

	$\displaystyle v_{t}^{g}=\arg\max(P(i\|h_{t}^{g}))$		(7)
	$\displaystyle\hat{r_{t}^{g}}=C[g,v_{t}^{g},:]$		(7)

III-B Predicting a waveform from discrete representations

Although most methods convert features directly into audio [54], spectral features may still have a few advantages in several low-resource tasks [70]. To obtain a broader understanding of the potential complementarity and differences between SSL representations and Mel features, we adopted two different ways to implement the vec2wav model: with and without an independent Mel-based vocoder as in the following formula,

\begin{split}\text{vec2wav}:(\text{vec2mel}+\text{vocoder})||(\text{vec2wavVQ (R)})\\ \end{split}

(8)

where vec2mel and vec2wavVQ (R) represent the two methods we propose to transfer discrete representations to the Mel spectrograms and the waveform, respectively.

III-B1 vec2wav model with Mel-based vocoder

In the process of converting discrete features $R$ into audio, we can still use a Mel-based vocoder, which requires us to first convert the discrete features $R$ into Mel spectrograms.

This vec2wav model consists of four parts: up-sampling, down-sampling, decoder, and pre-trained Mel-based vocoder as shown in Figure 3. To learn Mel spectrograms from $R$ , the duration model is no longer necessary. Although the sequences $R$ and $Mel$ have different “sampling rates,” they can be treated as being “uniformly aligned.” For example, if we adopt a 12.5 ms frame-shift to extract Mel spectrograms, considering the framerate of XLSR is 20ms, the length ratio of ${\color[rgb]{0,0,0}R:Mel:Speech}$ would be $5:8:1600$ . In this work, we adopt an up-sampling factor of 8 and then down-sample by a factor of 5 to map discrete representations to Mel ones. Transposed convolution is used for up-sampling, followed by a Multi-Receptive Field Fusion (MRF) module, similar to the HiFi-GAN generator. With an up-sampling factor of 8, the configuration of the generator was changed for up-sampling rates to a sequence of (4, 2) with corresponding kernel sizes (12,8), while the hyper-parameters of residual blocks are the same as those in HiFi-GAN V1. For the down-sampling, we use average pooling. The decoder part is the same as that of txt2vec, which is an FFT. The decoder takes the output of average pooling and language and speaker embeddings to produce the Mel spectrograms. The optimization objective of this vec2mel is defined as: ${\mathcal{L}_{Mel}={\left\|\hat{Mel}-Mel\right\|}_{2}^{2}}$ .

III-B2 vec2wav model without Mel-based vocoder

Instead of using an additional vocoder to convert Mel to a waveform, we also proposed a vec2wav model as in [54], which directly maps discrete representations to a waveform. To learn the cross-lingual information from SSL representations at different time resolutions, we include a multi-stage multi-head codebook for the waveform modeling as [71, 72].

The details of the architecture and operation of the multi-stage encoder are provided in [71, 72]. Here, we illustrate the audio waveform generation process with a 2-stage encoder as an example as shown in Figure 4, which consists of the following steps: 1) The first-stage encoder receives discrete sequences $R$ and produces the first hidden representation sequences. These sequences are then downsampled and passed to the second-stage encoder to obtain the second hidden representation, which has a lower time resolution. 2) The second hidden representation sequences are quantized using codebook 2 to obtain $z^{(2)}$ . These sequences are then up-sampled and combined with the first hidden representation sequences to obtain $z^{(1)}$ using codebook 1. 3) The residual output of $z^{(1)}$ and $z^{(2)}$ is used to reconstruct the waveform using the frame decoder and HiFi-GAN-based waveform generator.

To up-sample the audio from $R$ , the generator’s configuration was changed to a sequence of (5, 4, 4, 2, 2) with corresponding kernel sizes of (11, 8, 8, 4, 4).

During the training process, a UnivNet [73] discriminator is utilized for adversarial training of this model. The final loss is composed of three main components: waveform level loss $\mathcal{L}_{w}$ , calculated using the HiFi-GAN loss function to compare the real waveform with the reconstructed waveform; frame-level loss $\mathcal{L}_{d}$ , calculated from the mean squared error between two discrete representations; and VQ loss and stabilization loss, which are similar to those described in [71].

IV Experiments

IV-A Dataset

We investigated six languages including English (en), French (fr), German (ge), Portuguese (pt), Spanish (sp), and Swedish (sw), building a multilingual and multispeaker dataset to train the model. The multilingual and multispeaker dataset contains samples from two multispeaker datasets, multilingual LibriSpeech (MLS) [74] and GlobalPhone (GLB) [75], and three single-speaker datasets, CSS10 [76], LJSpeech (LJS) [77], and NST Swedish Speech [78]. The MLS dataset comprises eight languages and is derived from Librivox audiobooks. GlobalPhone is a database of multilingual read speech, including corresponding transcriptions and pronunciation dictionaries in 20 languages. CSS10 is a compilation of monolingual single-speaker speech datasets for ten distinct languages. LJSpeech is a single-speaker English dataset used in many studies. NST is a monolingual Swedish database designed for speech synthesis, initially created by Nordic Language Technology.

To ensure the balance of language, speakers, and gender, we screened the aforementioned databases and combined them into our multilingual and multispeaker database as shown in Table II. In addition to the two multispeaker databases MLS and GLB, for which most of the speakers have about 100 audio samples, we also selected 100 audio samples for each language from three single-speaker databases CSS10, LJS, and NST. The reason that we also selected some data from a single-speaker database is to aim to make a fairer comparison with single-speaker and single-language models in the future, but this is not included in this work.

To eliminate the potential impact of varying sampling rates as a confounding variable, we resampled all audio to a consistent 16 kHz rate and applied amplitude normalization using sv56 [79]. We set the frame and hop sizes to 1,024 and 200 (12.5 ms), respectively, to extract the Mel spectrograms from 16-kHz raw speech.

TABLE II: Details of the training corpus.

\#

Spk represents the total number of speakers per gender and language.

\#

Sent represents the total number of utterances per gender and language. “MLS”, “GLB” and “Other” represent a number of speakers selected from each of the MLS dataset, the Global Phone dataset and three single-speaker datasets, respectively.

Lang	Gender	$\#$ Spk	Dur(h)	$\#$ Sent	MLS	GLB	Other
en	Female	46	12.06	3,521	45	0	1(LJ)
en	Male	45	11.87	3,420	45	0	0
fr	Female	45	12.29	4,661	0	45	0
fr	Male	46	11.88	4,775	0	45	1(CSS10)
ge	Female	46	11.92	3,892	38	7	1(CSS10)
ge	Male	45	12.47	6,576	0	45	0
pt	Female	46	9.15	3,596	3	43	0
pt	Male	45	9.76	3,793	4	41	0
es	Female	45	10.40	3,206	24	21	0
es	Male	46	10.06	3,168	21	24	1(CSS10)
sw	Female	45	9.57	4,866	0	45	0
sw	Male	46	9.32	4,776	0	45	1(NST)

TABLE III: Details of the pre-trained model.

Name	Modality	Lang	Training data
XLSR-53	Audio	53	56K hours
ECAPA-TDNN	Audio	$>5$	2794 hours
XPhoneBERT	Text	94	330M sentences

IV-B Pre-trained Models

In addition to utilizing the aforementioned six languages’ paired datasets to train our TTS system directly, to incorporate more multilingual knowledge into our system, we also employed two pre-trained multilingual models, XLSR-53²²2https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and XPhoneBERT³³3https://github.com/VinAIResearch/XPhoneBERT, trained on audio-only and text-only data respectively, as shown in Table LABEL:tab:pretrained. For the speaker encoder, we use the publicly available ECAPA-TDNN model⁴⁴4https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb, which was trained with the multilingual VoxCeleb 1&2 dataset.

TABLE IV: Experimental Setup. txt2intfeat and intfeat2wav represent text to intermediate features and intermediate features to waveform, respectively. SCL is speaker consistency loss. ECAPA and H/ASP are two different speaker encoders. Mel and SSL respectively represent the use of Mel spectrograms or self-supervised learning representations as intermediate features.

SystemID	txt2intfeat	intfeat2wav	Spk	Mel	SSL	SCL
FSM1	FastSpeech	HiFi-GAN	ECAPA	$\surd$
FSM2	FastSpeech	vec2wavVQ (Mel)	ECAPA	$\surd$
ZMM-TTS1	txt2vec	vec2mel+HiFi-GAN	ECAPA	$\surd$	$\surd$
ZMM-TTS2	txt2vec	vec2wavVQ (R)	ECAPA		$\surd$
YourTTS	VITS		H/ASP			$\surd$
YourTTSE	VITS		ECAPA			$\surd$
YourTTSW	VITS		ECAPA

IV-C Experimental Setup and Model Architecture

TABLE V: MOS, DMOS, and SECS results for seen speakers among different systems. The gray color scale indicates the relative value, and darker color indicates a better result. MOS and DMOS are reported with 95% confidence intervals.

Metrics

Lang

Characters

Pre-trained phoneme representations

Characters

FSM-based

ZMM-TTS

FSM-based

ZMM-TTS

YourTTS-based

FSM1c

FSM2c

ZMM-TTS1c

ZMM-TTS2c

FSM1x

FSM2x

ZMM-TTS1x

ZMM-TTS2x

YourTTS

YourTTSE

YourTTSW

MOS

1.08±0.07

1.45±0.22

2.68±0.28

2.98±0.23

1.35±0.13

1.63±0.20

2.92±0.25

3.35±0.30

2.17±0.22

4.05±0.21

1.47±0.22

2.07±0.27

2.90±0.29

4.20±0.22

1.60±0.22

2.48±0.25

3.08±0.34

4.27±0.21

3.57±0.27

4.48±0.19

1.60±0.25

2.08±0.26

3.02±0.25

3.90±0.28

1.63±0.25

2.10±0.29

2.67±0.26

3.70±0.29

2.97±0.33

3.92±0.26

1.45±0.20

1.73±0.22

2.77±0.34

3.15±0.31

1.48±0.19

2.05±0.26

2.63±0.32

3.12±0.34

2.45±0.34

4.33±0.20

1.37±0.14

1.68±0.21

3.18±0.26

3.70±0.21

1.83±0.17

2.02±0.22

3.25±0.26

3.87±0.24

3.17±0.29

3.75±0.28

1.33±0.16

1.88±0.21

2.98±0.29

3.32±0.34

1.77±0.24

2.47±0.25

3.08±0.28

3.65±0.32

2.93±0.30

4.18±0.25

DMOS

3.80±0.28

3.40±0.32

4.35±0.27

4.42±0.29

3.55±0.30

3.95±0.24

4.60±0.20

4.63±0.23

4.37±0.26

4.92±0.09

3.40±0.39

3.70±0.39

4.28±0.34

4.27±0.34

3.33±0.38

3.77±0.42

4.32±0.33

4.62±0.24

4.08±0.34

4.58±0.24

3.20±0.37

3.80±0.34

4.42±0.24

4.58±0.19

3.58±0.37

3.73±0.38

4.50±0.24

4.52±0.25

4.05±0.31

4.55±0.24

2.73±0.36

2.63±0.39

3.83±0.33

4.10±0.30

2.83±0.35

3.12±0.37

3.82±0.35

3.73±0.37

3.30±0.39

4.23±0.35

3.67±0.35

3.68±0.37

4.32±0.31

4.35±0.30

4.18±0.27

4.02±0.31

4.45±0.29

4.47±0.29

4.67±0.19

4.78±0.18

3.13±0.41

3.37±0.44

3.92±0.38

4.30±0.32

3.17±0.40

3.57±0.41

3.85±0.41

4.38±0.28

3.93±0.40

4.30±0.30

SECS

0.792

0.813

0.907

0.906

0.852

0.855

0.912

0.914

0.922

0.749

0.688

0.999

0.926

0.922

0.955

0.935

0.923

0.955

0.958

0.964

0.841

0.823

0.999

0.874

0.861

0.949

0.901

0.889

0.947

0.950

0.943

0.847

0.807

0.999

0.848

0.839

0.936

0.940

0.891

0.866

0.937

0.946

0.939

0.802

0.765

0.999

0.897

0.882

0.936

0.941

0.917

0.897

0.934

0.942

0.946

0.830

0.797

0.999

0.876

0.877

0.940

0.943

0.895

0.901

0.941

0.947

0.944

0.810

0.772

0.999

TABLE VI: MOS, DMOS, and SECS results for unseen speakers among different systems. The gray color scale indicates the relative value, and darker color indicates a better result. MOS and DMOS are reported with 95% confidence intervals.

Metrics

Lang

Characters

Pre-trained phoneme representations

Characters

FSM-based

ZMM-TTS

FSM-based

ZMM-TTS

YourTTS-based

FSM1c

FSM2c

ZMM-TTS1c

ZMM-TTS2c

FSM1x

FSM2x

ZMM-TTS1x

ZMM-TTS2x

YourTTS

YourTTSE

YourTTSW

MOS

1.15±0.15

1.45±0.19

2.87±0.22

2.98±0.24

1.47±0.19

1.67±0.19

2.53±0.24

3.48±0.26

2.58±0.25

3.98±0.26

1.35±0.13

1.98±0.24

3.27±0.30

4.32±0.18

1.70±0.21

2.57±0.25

2.90±0.31

4.43±0.19

3.28±0.30

4.38±0.18

1.55±0.24

2.03±0.28

3.22±0.31

3.58±0.34

1.72±0.27

2.27±0.28

3.17±0.27

3.55±0.33

3.05±0.33

4.47±0.20

1.53±0.22

1.97±0.20

3.27±0.29

3.52±0.29

1.82±0.23

2.33±0.29

3.33±0.26

4.00±0.24

3.15±0.26

4.07±0.25

1.32±0.17

1.75±0.20

3.28±0.23

3.95±0.25

1.68±0.17

2.65±0.26

3.12±0.26

3.77±0.28

3.50±0.24

3.90±0.28

1.52±0.19

1.88±0.23

2.78±0.26

3.83±0.29

1.87±0.23

2.60±0.27

2.95±0.30

3.72±0.27

2.25±0.30

3.47±0.32

DMOS

2.87±0.27

2.73±0.33

3.07±0.40

2.93±0.39

2.52±0.30

2.13±0.32

2.60±0.40

2.70±0.40

3.85±0.36

4.82±0.12

2.85±0.40

3.15±0.39

3.20±0.42

3.40±0.41

2.77±0.38

2.97±0.43

3.35±0.44

3.48±0.39

3.87±0.38

4.55±0.24

2.30±0.35

2.83±0.43

2.82±0.40

3.15±0.41

2.58±0.39

2.87±0.40

2.95±0.41

3.17±0.42

3.35±0.39

4.40±0.31

2.48±0.35

2.75±0.39

3.63±0.43

3.78±0.38

3.02±0.37

3.03±0.37

3.25±0.40

3.55±0.39

3.57±0.38

4.77±0.19

3.18±0.36

3.45±0.34

4.12±0.33

3.98±0.35

3.60±0.36

3.53±0.39

3.93±0.35

4.07±0.36

4.25±0.34

4.68±0.23

2.77±0.42

2.90±0.42

3.43±0.44

3.62±0.44

2.90±0.40

3.07±0.48

3.43±0.44

3.88±0.41

3.52±0.43

4.77±0.18

SECS

0.681

0.707

0.791

0.789

0.739

0.729

0.788

0.783

0.857

0.656

0.618

0.999

0.867

0.865

0.911

0.905

0.877

0.892

0.918

0.911

0.936

0.812

0.786

0.999

0.811

0.819

0.873

0.890

0.842

0.843

0.873

0.893

0.913

0.796

0.747

0.999

0.821

0.806

0.890

0.899

0.840

0.825

0.885

0.883

0.911

0.714

0.680

0.999

0.870

0.861

0.904

0.915

0.880

0.886

0.901

0.916

0.929

0.772

0.741

0.999

0.837

0.836

0.896

0.906

0.858

0.857

0.900

0.911

0.917

0.747

0.711

0.999

The proposed ZMM-TTS model can be implemented in two ways: ZMM-TTS1 and ZMM-TTS2, in accordance with the difference of the vec2wav model, which respectively represents the use of an independent Mel-based vocoder and without an independent vocoder. We consider the FastSpeech-based model (FSM) for comparison. It also uses two methods for implementation: one is to use the HiFi-GAN vocoder, and the other is to use a vec2wavVQ (R)-like structure as the vocoder. In addition, we also compare with a zero-shot multilingual model, YourTTS, which is based on a fully end-to-end model, VITS.

In summary, the differences of the various TTS systems are listed in Table IV. Furthermore, for ZMM-TTS and FSM, we also compare different input representations, using different suffixes to indicate them. For example, ZMM-TTS1c, ZMM-TTS1i and ZMM-TTS1x indicate that the inputs to the ZMM-TTS1 model are characters, IPA, and pre-trained phoneme representations, respectively. The implementation details are summarized as follows and each module is trained independently:

•

HiFi-GAN: As for waveform generation in FSM1 and ZMM-TTS1, HiFi-GAN⁵⁵5https://github.com/jik876/hifi-gan is chosen as our neural vocoder. This HiFi-GAN is trained for 2.5M steps on a single NVIDIA A100 GPU with a batch size of 16 using our six languages training data.
•

FastSpeech: Similar to txt2vec in Section III-A, we also train a baseline system like FastSpeech as an acoustic model. This FastSpeech has the same encoder, decoder speaker, and language representations as txt2vec, but the output is Mel representations rather than SSL ones. FastSpeech, txt2vec, and vec2mel were all trained for 1.2M steps on a single NVIDIA A100 GPU with a batch size of 16. Both FastSpeech and txt2vec were built from an open-source repository⁶⁶6https://github.com/keonlee9420/Comprehensive-Transformer-TTS.

As mentioned in Section III-A-1, when incorporating pre-trained phoneme representations, models utilize XPhoneBERT as the phoneme encoder. XPhoneBERT is kept fixed during the initial 25% of the training steps and is subsequently updated during the remaining training steps.
•

vec2wavVQ (Mel): For a fairer comparison, we also train a vocoder with a VQ codebook as in vec2wavVQ (R) in III.C. The vec2wavVQ (Mel) has the same multi-stage encoder, and frame decoder as vec2wavVQ (R). They both use 2-stage 4-head codebooks. The input of vec2wavVQ (Mel) is Mel representations rather than discrete ones as in vec2wavVQ (R). Vec2wavVQ (Mel) and vec2wavVQ (R) were all trained for 1M steps on a single NVIDIA A100 GPU with a batch size of 16.
•

YourTTS: YourTTS is a multilingual system based on VITS that has achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS. The original YourTTS implementation uses the H/ASP model [80] as a speaker encoder. Therefore, we also train a YourTTSE, which means YourTTS using the same speaker encoder ECAPA as our ZMM-TTS model. Furthermore, there is a speaker consistency loss (SCL) in YourTTS to improve speaker similarity. Therefore, we also train a YourTTSW, in which “W” means YourTTS without SCL, as in our proposed model. In YourTTSW, ECAPA is also used as a speaker encoder rather than H/ASP. All these YourTTS-based models were trained for one million steps on a single NVIDIA A100 GPU with a batch size of 32, and we always select the best checkpoint based on the development set.

IV-D Evaluation Methods

IV-D1 Subjective Evaluation Methods

We synthesized sentences from test sets in six languages for each TTS system. For each language, we select 6 (3 female, 3 male) seen speakers and 6 (3 female, 3 male) unseen speakers for the test set, and two sentences for each speaker. Ten native listeners per language participated in our listening tests. Listeners evaluated one test utterance at a time, initially assigning a rating on a Likert scale from 1 to 5 for the Mean Opinion Score (MOS) to assess naturalness. Subsequently, they provided ratings for speaker similarity concerning a reference utterance using a Differential MOS (DMOS) scale, ranging from 1 (indicating a different speaker) to 5 (indicating the same speaker). Reference utterances were chosen randomly from the original speech of the target speaker. Due to the high cost of subjective evaluation, only the original YourTTS is used as the YourTTS related baseline to test MOS and DMOS. Our preliminary experimental results also demonstrated that YourTTS achieves better performance than YourTTSE and YourTTSW.

IV-D2 Objective Evaluation Methods

To assess the similarity between the synthesized voice and the original speaker, we determine the Speaker Encoder Cosine Similarity (SECS) by measuring the cosine similarity between the speaker embeddings of two audio samples extracted from the speaker encoder. The SECS score falls within the range of -1 to 1, with a higher value indicating better speaker similarity. In accordance with prior research [35, 8], we compute the SECS using the speaker encoder from the Resemblyzer package⁷⁷7https://github.com/resemble-ai/Resemblyzer, facilitating comparisons with those studies.

We also objectively evaluate performance by measuring the intelligibility of speech content using an ASR algorithm. We synthesized 2,000 sentences (1,000 with seen and unseen speaker embeddings, respectively) for each language. Note that the language of the target speaker and text are always consistent. Although our model also has the ability of cross-lingual synthesis, considering that there is no clear standard for the evaluation of cross-language synthesis, this paper only focuses on the intra-lingual scenarios.

The evaluation texts are from the CMU ARCTIC sentences⁸⁸8http://festvox.org/cmu_arctic/cmuarctic.data for English, test data in MLS for French, German, Portuguese, and Spanish, and test data in NST for Swedish. All synthesized sentences were sent to the Whisper⁹⁹9https://github.com/openai/whisper model for ASR. We computed the character error rate (CER) between the input text and the ASR-produced transcripts.

V Results and analysis

The evaluation results for speech naturalness and speaker similarity are shown in Tables V and VI for seen and unseen target speakers, respectively.

V-A Subjective Evaluation Results

V-A1 Comparison between FSM and ZMM-TTS models

In terms of speech naturalness, our proposed systems ZMM-TTS1 and ZMM-TTS2 are significantly better, according to a Mann-Whitney U test given $\alpha=0.05$ with Holm-Bonferroni correction, than the FastSpeech-based systems (FSM1 and FSM2) in each language on both seen and unseen speaker conditions. Our training data come mainly from MLS and GLB. The sound quality of these datasets is significantly worse than several monolingual datasets such as LJSpeech [77] and the Chinese Standard Mandarin Speech Corpus [81] that are commonly used in speech synthesis. For example, most recordings of GLB were done in ordinary rooms rather than professional recording studios, and several audio samples contain some noise.

As a currently popular method based on FastSpeech and HiFi-GAN, the baseline FastSpeech-based FSM model has achieved SOTA on many monolingual synthesis tasks while obtaining the worst results in speech naturalness in our multilingual datasets. This result indicates that for FSM models based on Mel spectrograms, the sound quality of synthesized audio may be limited due to the sound quality of the training data itself. On the other hand, since the intermediate features used by our ZMM-TTS model are extracted from SSL models trained on large-scale non-ideal data, the proposed system is considered to be less sensitive to the sound quality of the training data. The reason that the proposed method ZMTTS-2x has higher MOS values than ground-truth audio on seen (Spanish) and unseen speakers (French and Swedish) would also be due to the fact that the sound quality of the ground-truth audio is not high.

In terms of speaker similarity, our proposed ZMM-TTS system achieves better DMOS than the FSM-based system. The speaker encoder used in the ZMM-TTS and FSM-based systems is the same. Also, the difference between the DMOS of the ZMM-TTS and FSM-based systems is smaller than their difference in MOS. This suggests that the improvement in naturalness using features extracted from SSL models is more prominent than the improvement in speaker similarity because discrete features may contain less speaker information.

V-A2 Effectiveness of VQ in vec2wav

Another finding is that adding the VQ module in the process of converting the Mel spectrograms or discrete representations to a waveform can improve the naturalness of the audio. The results in Tables V and VI show that FSM2 outperformed FSM1 on MOS. One potential reason is that the discrete representations in vec2wavVQ (Mel) may reduce some redundant information, such as noise, to some extent. Similar findings are observed when comparing ZMM-TTS1 and ZMM-TTS2. ZMM-TTS2 has an additional VQ module and converts discrete representations directly to speech without an independent vocoder. Directly mapping discrete representations to a waveform instead of a Mel spectrogram prevents error propagation caused by inaccurate Mel spectrogram prediction in the vec2mel stage and hence, like the result of FSM1 and FSM2, multi-stage discrete representations learned by additional VQ may also be helpful for improving sound quality.

In terms of speaker similarity, the use of VQ in vec2wav does not always result in an improvement. We also find that MOS is not always correlated with DMOS. For example, as shown in the French results in Table V, the naturalness of ZMM-TTS2c is significantly higher than that of ZMM-TTS1c (4.20 vs 2.90), while their DMOS values are very similar (4.27 vs 4.28).

V-A3 Comparison between YourTTS and ZMM-TTS models

On speech naturalness, the YourTTS method outperforms the two-stage FSM baseline while remaining worse than two of our proposed systems, ZMM-TTS2c and ZMM-TTS2x. We observed that the YourTTS model exhibits instability in its stochastic duration predictor, resulting in the production of unnatural durations for certain speakers and sentences.

On speaker similarity, ZMM-TTS2c and ZMM-TTS2x has better performance than YourTTS systems for the seen speaker condition in five languages (en, fr, ge, pt, and sw). However, under unseen conditions, YourTTS is better than our method in terms of DMOS in English, French, German and Spanish. This will be analyzed in detail in the next subsection.

V-A4 Comparison between different input representations and languages

Interestingly, determining which is better for speech naturalness, character-based inputs or pre-trained phoneme representation based inputs, depends on the specific language and synthesis systems. FSM1x and FSM2x always have a higher MOS than FSM1c and FSM2c among the FSM-based systems. Utilizing pre-trained phoneme representations as input and XPhoneBERT as the text encoder can enhance the quality of synthesized speech. However, for our proposed ZMM-TTS system, better naturalness is not always achieved with ZMM-TTS1x and ZMM-TTS2x. For example, in Table VI, comparing the MOS of ZMM-TTS2x and ZMM-TTS2c, ZMM-TTS2x has achieved better results in English, French, and Portuguese, while ZMM-TTS2c has achieved better results in other languages. Furthermore, we observed that the MOS of ZMM-TTS2x is significantly better than ZMM-TTS2c in English. This may be because XPhoneBERT was pre-trained on a dataset with more English text than other languages. We will investigate how the amount of pre-training text affects XPhoneBERT’s representation performance in the future.

In terms of speaker similarity, different text representations have little impact. This result aligns with expectations as speaker and semantic information are not closely related.

V-A5 Comparison between seen and unseen speakers

An interesting result is that the naturalness of the synthesized speech of unseen speakers is not worse than that of seen speakers. However, there was still a gap between seen and unseen speakers in terms of DMOS for most languages. Although the pre-training process for the speaker encoder involves more than 7,000 speakers, the TTS system is only trained on a few hundred speakers. It is still challenging to achieve the ideal zero-shot performance when training with limited multilingual data.

V-B Objective Evaluation Results

TABLE VII: CER (%) of ground-truth recordings (GT) and synthesized audio samples from various models.

	Characters				IPA				Pre-trained phoneme representations				Characters			GT
	FSM-based		ZMM-TTS		FSM-based		ZMM-TTS		FSM-based		ZMM-TTS		YourTTS-based
	FSM1c	FSM2c	ZMM-TTS1c	ZMM-TTS2c	FSM1i	FSM2i	ZMM-TTS1i	ZMM-TTS2i	FSM1x	FSM2x	ZMM-TTS1x	ZMM-TTS2x	YourTTS	YourTTSE	YourTTSW
Seen speakers
en	9.30	11.30	8.70	08.20	5.71	06.98	05.97	05.66	2.90	3.70	5.10	5.20	07.50	07.22	08.61	0.44
fr	4.43	05.21	6.91	06.72	6.07	07.05	08.49	08.32	3.32	3.81	6.03	5.91	06.92	08.91	10.89	3.49
ge	3.82	04.94	4.81	04.62	5.26	06.34	05.63	05.46	2.15	2.72	4.22	4.01	07.21	08.50	10.69	2.84
pt	2.62	03.32	4.31	04.17	3.63	04.40	05.26	05.05	2.77	3.41	5.42	5.11	12.03	14.36	16.45	2.18
es	2.40	03.23	3.41	03.32	3.07	03.71	03.68	03.59	1.58	2.01	2.67	2.73	05.74	07.78	08.82	1.88
sw	7.03	10.72	9.81	10.25	7.26	10.51	11.57	11.71	4.13	6.17	9.72	9.63	17.62	21.76	25.14	2.64
Unseen speakers
en	8.81	10.43	8.25	07.62	5.14	06.39	05.59	05.21	2.66	3.28	4.77	4.52	07.23	06.87	08.61	0.44
fr	4.42	05.23	6.37	06.02	5.88	06.88	07.74	07.64	3.29	3.61	5.42	5.43	07.02	08.21	09.46	3.49
ge	3.67	04.42	4.01	03.88	5.05	05.86	04.74	04.69	2.17	2.49	3.52	3.44	06.87	08.36	09.58	2.84
pt	2.56	03.08	3.97	03.63	3.55	04.19	04.56	04.48	2.69	3.18	4.92	4.84	12.37	14.15	16.11	2.18
es	1.94	02.57	2.83	02.96	2.68	03.17	03.19	03.22	1.56	1.88	2.44	2.32	06.17	07.55	08.80	1.88
sw	6.61	10.42	8.53	08.71	7.04	10.22	10.20	09.78	3.79	5.42	8.61	8.82	17.92	21.18	24.75	2.64

V-B1 SECS analysis

First, we found very strong correlations ( $r=0.807$ ), using the Pearson correlation coefficient (PCC), between the SECS and the DMOS values. Compared with the FSM-based model, ZMM-TTS performs better in both SECS and DMOS metrics for both seen and unseen speakers. Compared with the YourTTS model, ZMM-TTS achieves similar SECS for seen speakers, but there are differences for unseen speakers, particularly English speakers. One possible reason for this difference is that the original implementation of YourTTS utilizes SCL to enable generalization in the characteristics of unseen speakers. The different SECS results of YourTTSE and YourTTSW also demonstrate the impact of SCL.

To investigate whether the difference is related to language and the impact of two speaker encoders, we plotted speaker embeddings of seen and unseen speakers from natural speech via T-SNE[82]. Figure 5 presents the T-SNE visualization and shows the following: Several unseen speakers are not close to the seen speakers in the training set. For example, there are two obvious outliers in the ECAPA-TDNN speaker embeddings of the English unseen speakers. We conducted a study to calculate the SECS values of the six unseen English speakers whose speech was synthesized by the ZMM-TTS1c system. We found two English outlier points in Figure 5(a) with SECS values of 0.766 and 0.644, which were significantly lower than that for the speaker who was closer to the seen English speakers. This explains why the speaker similarity of our proposed ZMM-TTS1c for English unseen speakers is significantly lower than that of YourTTS.

Although there are many different types of SOTA neural speaker embeddings, currently, ZMM-TTS has only attempted using a popular speaker representation from ECAPA-TDNN, and we will explore the performance with different speaker embeddings such as H/ASP and SCL in future work.

V-B2 CER analysis

Table VII summarizes the obtained CER. The first finding is that the FSM1 and FSM2 baseline models perform better, although their sound quality is worse than the other systems. This demonstrates that the speech generated using Mel spectrograms as intermediate features is intelligible but not as natural sounding as when using SSL representations. Additionally, the speech synthesized through the Mel spectrograms tends to over-smooth the high-frequency region more than through SSL representations.

We also find that the CER is not always correlated with the MOS values. Although YourTTS CER is worse, the sound quality is obviously better compared with the FSM-based model. Furthermore, YourTTS CER in Portuguese and Swedish is significantly worse than in other languages. This also explains why the YourTTS system is significantly worse than our ZMM-TTS on MOS in these two languages. YourTTS only uses the character transcriptions rather than phonemes, which makes it more prone to mispronunciation issues as mentioned in [8].

Additionally, we found that the pre-trained phoneme representation input system performance is the best, and IPA is only better than characters in English. A notable result is that compared with FSM1, the CER performance of FSM2 has dropped significantly. For example, sometimes the pronunciation of the word “advice” in a sentence becomes incorrect and sounds more like the word “addressed,” resulting in ASR recognition errors. This shows that learning a discrete representation from Mel spectrograms to speech conversion in vec2wavVQ (Mel) can improve the quality of speech naturalness, but some fine-grained information related to linguistic content may be lost.

VI Low-resource scenarios

TABLE VIII: Low-resource results. Total training data is 35 and 21 hours for Italian and Polish, respectively.

		Italian						Polish
Size	Method	UTMOS	SECS	CER	MOS	IMOS	DMOS	UTMOS	SECS	CER	MOS	IMOS	DMOS
Few-shot scenarios
015m	FSM2x	2.58	0.85	3.27	2.51±0.14	3.84±0.12	2.62±0.27	2.11	0.89	5.24	2.03±0.17	3.26±0.17	2.29±0.28
	ZMM-TTS2x	2.97	0.91	3.92	3.33±0.17	4.09±0.13	2.95±0.28	2.77	0.93	8.27	3.54±0.23	3.49±0.18	2.91±0.30
	FSM2c	2.10	0.78	7.13	1.62±0.14	3.02±0.13	2.29±0.25	1.45	0.83	28.36	1.41±0.16	1.53±0.17	1.81±0.23
	ZMM-TTS2c	2.86	0.90	6.07	2.74±0.18	3.25±0.14	2.78±0.29	2.55	0.92	13.34	2.83±0.24	2.34±0.17	2.40±0.30
005m	FSM2x	2.60	0.86	3.39	2.73±0.16	3.80±0.12	2.90±0.27	2.34	0.90	6.12	2.52±0.21	3.31±0.18	2.46±0.30
005m	ZMM-TTS2x	2.92	0.90	4.02	3.09±0.17	3.88±0.13	2.78±0.29	2.69	0.92	10.01	3.28±0.23	3.28±0.20	2.84±0.32
2.5m	FSM2x	2.48	0.84	3.12	2.44±0.14	3.62±0.12	2.75±0.26	2.22	0.89	11.29	2.05±0.20	2.56±0.18	2.28±0.30
2.5m	ZMM-TTS2x	2.76	0.89	4.38	2.90±0.17	3.80±0.13	2.67±0.29	2.61	0.91	14.64	3.05±0.23	2.83±0.17	2.52±0.32
Zero-shot scenarios
0000	FSM2x	2.27	0.73	4.10	1.66±0.15	3.30±0.14	1.43±0.15	2.33	0.77	15.93	1.60±0.18	1.80±0.17	1.34±0.16
0000	ZMM-TTS2x	3.27	0.83	5.11	2.88±0.20	3.47±0.15	1.52±0.25	2.42	0.85	15.40	2.95±0.26	2.57±0.19	2.05±0.29
High-resource baseline
0001h	YourTTS	1.64	0.74	89.77	-	-	-	1.40	0.65	88.86	-	-	-
35/21h	YourTTS	2.91	0.96	6.01	-	-	-	1.70	0.71	89.68	-	-	-
35/21h	FSM2x	2.27	0.91	3.29	2.25±0.12	4.21±0.14	3.53±0.26	2.00	0.88	2.89	2.08±0.17	3.79±0.21	2.58±0.31
35/21h	ZMM-TTS2x	3.02	0.96	3.76	4.12±0.17	4.32±0.20	4.21±0.24	2.87	0.95	4.49	3.99±0.20	4.28±0.20	3.90±0.26
35/21h	FSM2c	2.17	0.88	2.47	2.01±0.15	3.79±0.14	2.99±0.27	1.79	0.85	3.71	1.78±0.17	3.78±0.15	2.49±0.31
35/21h	ZMM-TTS2c	2.92	0.96	2.22	3.77±0.17	4.29±0.12	4.19±0.24	2.70	0.95	2.42	4.04±0.17	4.29±0.14	3.93±0.26
-	GT	2.94	0.99	3.17	4.58±0.14	4.80±0.08	4.32±0.24	2.82	0.99	3.95	4.64±0.12	4.88±0.08	4.50±0.18

In addition to the high-resource languages, we also evaluate the performance of the proposed method in unseen low-resource language TTS scenarios to investigate language adaptability with limited training data.

VI-A Experimental conditions

VI-A1 Dataset

We chose Italian and Polish as two unseen languages for this experiment. In the language family tree [83], Italian is closely related to French, Spanish, and Portuguese, all of which belong to the Romance family. Polish is relatively far away from our six high-resource languages and is from the Slavic language family. We selected two speakers with ID 1595 and 6892 from MLS for our Italian and Polish data, respectively.

VI-A2 Implementation Details

Considering the performance on six high-resource languages, we choose three models FSM2, ZMM-TTS2 and YourTTS to implement on new languages. We have attempted to use both characters and pre-trained phonemes as input representations. Low-resource language synthesis is investigated in two scenarios: few-shot and zero-shot. Few-shot means that the model is fine-tuned using the data from of the language before synthesing speech of that language. In contrast, as mentioned in Section I, zero-shot means that our model performs inference on unseen languages without model fine-tuning.

For the few-shot scenario, we created training sets of different sizes (2.5, 5, and 15 mins of audio). We use limited training data to finetune the model trained on six languages in Section IV. For language embeddings, we reserve a free ID for new languages, which will be trained from scratch using fine-tuning data only. Similarly, since different languages may use different characters, symbol embeddings for any characters unseen during pre-training must also be learned from fine-tuning data only. Specifically, in our experiments, we found that Italian and Polish had 3 and 14 unseen symbols, respectively, compared with the six pre-training languages. All few-shot models are fine-tuned for 10,000 steps with a batch size of 16. To compare the performance differences of the same language with high and low resources, we also perform fine-tuning with much larger datasets of 35h for Italian and 21h for Polish.

For the zero-shot scenario, to address the mismatch between language embeddings of seen and unseen languages, we removed the language embedding layer. Therefore, the models tested on the zero-shot scenario are trained as in Section IV on six seen languages but without a language embedding layer. Additionally, we used pre-trained phoneme representations as input, making it easily applicable to many unseen languages.

VI-A3 Evaluation Methods

Objective Evaluation. We kept 100 sentences for the speech naturalness and speaker similarity tests and 1,000 sentences for ASR tests like those mentioned in Section IV-D. Furthermore, we adopted a publicly available automatic MOS (UTMOS) prediction model [84] to assess naturalness as in [5, 32].

Subjective Evaluation. We synthesized eight sentences from test sets for each TTS system. 15 native listeners of each language participated in our listening tests. First, we employed the same evaluation as described in Section IV-D to assess MOS and DMOS. Furthermore, ensuring the intelligibility of generated speech is crucial, particularly in low-resource scenarios. To this aim, we also designed an intelligibility test (IMOS) in which listeners provided ratings for speech regarding the extent to which words were pronounced wrong or were not understandable. The scale of IMOS is also from 1 (indicating very bad) to 5 (indicating very good).

VI-B Experimental results and analysis

The results on unseen languages are shown in Table VIII.

VI-B1 Comparison between different input representations

In high-resource baselines, the intelligibility test results (CER/IMOS) of ZMM-TTS2x and ZMM-TTS2c were quite similar. However, when only 15 minutes of data were used for fine-tuning, the synthesized speech produced by ZMM-TTS2x was much more intelligible than that of ZMM-TTS2c in low-resource scenarios. The same phenomenon is also observed for FSM2x and FSM2c. This result demonstrates that the pre-trained phoneme representations from XPhoneBert are more suitable for low-resource scenarios.

VI-B2 Comparison of FSM and ZMM-TTS models

On both subjective and objective evaluation, the ZMM-TTS2x model produces significantly more natural speech (MOS/UTMOS) with better speaker similarity (DMOS/SECS) than the FSM2x model, both in high and low resource scenarios on two unseen languages. This result indicates that self-supervised features can better reconstruct sound quality and timbre compared with the Mel spectrograms, even with limited data. Furthermore, on the basis of the subjective IMOS evaluations, in low-resource scenarios, the intelligibility of ZMM-TTS2x surpasses that of FSM2x. This demonstrates the advantage of our model in low-resource scenarios.

VI-B3 Comparison of ZMM-TTS and YourTTS models

We can see that YourTTS cannot be fine-tuned with limited data from an unseen language. Even if one hour of data is used for fine-tuning, YourTTS still cannot synthesize understandable audio. Although promising results are evident in several language combinations with sufficient training data, YourTTS demonstrates limited efficacy when adapting to languages with limited available speakers.

VI-B4 Comparison of different data size

As more training data is added, the synthesized speech quality improves, particularly in terms of speech intelligibility. Furthermore, the results for speaker similarity (SECS/DMOS) of ZMM-TTS2x are similar across different quantities of training data (2.5, 5, and 15 mins). In general, with only 2.5 mins of speech from an unseen speaker in a new language, our fine-tuning of the ZMM-TTS2x still resulted in high speaker similarity in terms of SECS.

VI-B5 Comparison of two languages

We found that for Italian, using pre-trained phoneme representations as input, we were able to synthesize intelligible speech in zero-shot scenarios. In Polish, there is a significant gap between the zero-shot and few-shot scenarios, but both FSM2x and ZMM-TTS2x were able to generate intelligible speech utilizing a few minutes of data with pre-trained phoneme representations as input. The SSL speech-based features do not seem to provide an advantage over the Mel spectrograms, unlike in Italian. Although Italian and Polish are both included in the XLSR training data, the performance of SSL representations may be heavily affected by the similarity between the domains of the ZMM-TTS pre-training data and the fine-tuning data. Future work may include investigating the impact of language similarity on cross-lingual transfer.

VII Conclusion

In this paper, we propose a new method for generating multilingual speech using self-supervised discrete speech representations. We experimented with different input representations and incorporated a pre-trained multilingual phoneme encoder for multilingual tasks. Our experimental results show that our approach improves the speaker similarity and naturalness of synthetic speech in multilingual tasks, even for unseen speakers. Additionally, our framework achieves high intelligibility and speaker similarity in limited-data or zero-shot scenarios for a new language.

As future work, we plan to apply this model to many more languages and explore advanced language and speaker adaptation strategies. To make this happen, we will also need to investigate flexible language representations to handle unseen languages more accurately than the language IDs used in the current system.

Appendix A Comparison with large-scale TTS systems

Considering the remarkable breakthroughs achieved by large-scale TTS models for monolingual speech synthesis, despite our model primarily targeting multilingual and low-resource scenarios, we still conducted comparative analyses from various perspectives on the English language, in contrast to these models.

A-A Experimental conditions

A-A1 Implementation details

For a fair comparison, in addition to the ZMM-TTS2x model trained on six languages in Section IV, we also trained another model on the LibriTTS [85] dataset. We compared our ZMM-TTS2x with the following three strong zero-shot TTS baselines:

•

VALL-E-X[66]. It is a multilingual version of VALL-E that also employs a combination of autoregressive and non-autoregressive methods for discrete token generation. We use an open-source implementation and checkpoint ¹⁰¹⁰10https://github.com/Plachtaa/VALL-E-X. The training data for this model comprises three languages: Chinese (ch), English (en), and Japanese (jp).
•

HierSpeech++[32]. This work is an extended version of HierSpeech [64]. It uses a non-autoregressive model for continuous vector generation. There are three components in HierSpeech++: a hierarchical speech synthesizer, text-to-vec (TTV), and speech super-resolution (SpeechSR). We conducted experiments using two different pre-trained HierSpeech++ models, which were labeled as HierSpeech++(a) and HierSpeech++(b). The difference between them lies in the training data scale of the hierarchical speech synthesizer. HierSpeech++(a) used the complete LibriTTS train-clean data, while HierSpeech++(b) utilized LibriTTS train-clean-100 and 360 subsets. However, the training data for TTV in both HierSpeech++(a) and HierSpeech++(b) remained consistent, employing the complete LibriTTS train-clean data. Note that for purposes of fair comparison, we did not use the super-resolution model. We used the official code and checkpoint for the experiments ¹¹¹¹11https://github.com/sh-lee-prml/HierSpeechpp.
•

StyleTTS 2[65]. It leverages style diffusion and adversarial training with large speech language models (12-layer WavLM pre-trained on 94k hours of data) to achieve human-level TTS synthesis. We use the official code and checkpoint ¹²¹²12https://github.com/yl4579/StyleTTS2.

A-A2 Evaluation dataset

We chose LibriSpeech [86] test-clean as our benchmark dataset for the zero-shot TTS task. This widely-used test set comprises 40 different speakers and 5.4 hours of speech. To conduct the benchmark experiments, we followed the approach described in [31], and we randomly selected 25 sentences for each speaker from the LibriSpeech test-clean dataset.

A-A3 Metrics

We evaluate the intelligibility of synthesized speech using WER as described in Section IV-D. Following previous works [22, 31], we compute SECS using the SOTA speaker verification model, WavLM-Large¹³¹³13https://github.com/microsoft/UniSpeech/tree/main/
downstreams/speaker_verification, to evaluate the speaker similarity, enabling comparison with those studies. We also use UTMOS like Section VI. In addition to evaluating the speech quality, we also measure the efficiency of the proposed model based on its Real-Time Factor (RTF), which is the time it takes to generate one second of audio on GPU and the number of parameters used in the model. We test the RTF on a single GPU (NVIDIA RTX 4090 with 24GB memory). Note that VALL-E-X and StyleTTS 2 generate audio using a sample rate of 24 kHz, while ZMM-TTS2x and HierSpeech++ use a sample rate of 16 kHz. To ensure consistency across all audio, we resampled it to 16 kHz and applied amplitude normalization using sv56 before conducting the WER, SECS, and UTMOS tests.

TABLE IX: Evaluation results for ZMM-TTS and recent large-scale TTS systems on LibriSpeech test-clean. LT-460 denotes LibriTTS train-clean-100 and 360 subsets, and LT-960 additionally utilizes a train-other-500 subset with LT-460. AL-1&3 denotes AISHELL-1 and AISHELL-3, and JP-CV denotes the Japanese Common Voice.

Method	Training corpus				Evaluation metric
Method	Datasets	Language	Speaker	Hour	WER	SECS	UTMOS	Params	RTF
VALL-E-X	LT-960, AL-1&3, JP-CV, others	en, jp, ch	$>3,122$	$1,739$	26.77	0.512	3.29	395M	5.917
HierSpeech++(a)	LT-960	en	2,311	555	2.03	0.591	4.40	204M	0.217
HierSpeech++(b)	LT-460	en	1,151	245	2.17	0.555	4.38	204M	0.222
StyleTTS 2	LT-460	en	1,151	245	3.06	0.455	4.23	191M	0.070
ZMM-TTS2x(a)	MLS, GLB, LJ, CSS10, NST	en, fr, ge, pt, es, sw	546	130	4.06	0.432	3.89	167M	0.003
ZMM-TTS2x(b)	LT-960	en	2,311	555	2.37	0.644	4.07	167M	0.003
Ground-truth	-	-	-	-	2.14	-	4.13	-	-

A-B Experimental results and analysis

Table IX contains information on the efficiency and quality of all models being compared.

A-B1 SECS analysis

We found that models VALL-E-X, HierSpeech++(a), and HierSpeech++(b) outperform our model ZMM-TTS2x (a) in terms of unseen speaker similarity on LibriSpeech test sets. This gap may be due to differences in the scale of training data. ZMM-TTS2x (a) training corpus includes only 546 speakers, while other baseline systems have over 1,000 or 2,000 speakers in their training data. When reducing the scale of training data, the SECS of HierSpeech++(b) has a significant decline compared with HierSpeech++(a). With the same data scale, our proposed ZMM-TTS2x (b) achieved the best speaker similarity SECS compared with other recent zero-shot models. This demonstrates that our proposed model, which utilizes a larger amount of data, scales in performance and can generalize to unseen speakers.

A-B2 UTMOS analysis

After analyzing the scores predicted by the automatic MOS evaluation model, we noticed a correspondence between the quality of generated audio and the training corpus. Models trained on a clean TTS database LT-960 or LT-640 all achieved better UTMOS values. The training corpora for both the ZMM-TTS2x (a) model and VALL-E-X model do not exclusively consist of high-quality clean audio. They also contain some data that includes noticeable noise, such as the GlobalPhone in ZMM-TTS2x (a) and other databases in VALL-E-X.

A-B3 CER analysis

We have observed that the speech intelligibility achieved by the VALL-E-X model is the worst among all the models. This is primarily due to the autoregressive process of the model, which often results in synthesized speech containing errors such as skipped words and repeated words. On the other hand, our ZMM-TTS2x (a) model, despite being trained on a smaller corpus and multiple languages, has achieved WER results in English that are comparable with other models. This indicates the robustness of our model.

A-B4 Latency analysis

Undoubtedly, the autoregressive process of VALL-E-X results in the slowest synthesis speed, making real-time synthesis challenging to achieve on GPUs. Due to its non-autoregressive structure, ZMM-TTS2x is capable of real-time high-speed synthesis on GPU. Although HierSpeech++ is also non-autoregressive, the iterative process of the diffusion model still impacts the synthesis speed. Additionally, ZMM-TTS2x has around 167M parameters, thus being smaller than other TTS systems we compare.

Note that the baseline models focus solely on resource-rich languages such as English. In contrast, our model has strong language adaptability, making it more suitable for low-resource languages.

References

[1] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. Interspeech. ISCA, 2017, pp. 4006–4010.
[2] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel spectrogram predictions,” in Proc. ICASSP. IEEE, 2018, pp. 4779–4783.
[4] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” arXiv preprint arXiv:2205.04421, 2022.
[5] T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in Proc. IJCAI, 2023.
[6] Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Z. Chen, R. J. Skerry-Ryan, Y. Jia, A. Rosenberg, and B. Ramabhadran, “Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning,” in Proc. Interspeech. ISCA, 2019, pp. 2080–2084.
[7] T. Nekvinda and O. Dusek, “One model, many languages: Meta-learning for multilingual text-to-speech,” in Proc. Interspeech. ISCA, 2020, pp. 2972–2976.
[8] E. Casanova, J. Weber, C. D. Shulby, A. C. Júnior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162. PMLR, 2022, pp. 2709–2720.
[9] Y. Jia, H. Zen, J. Shen, Y. Zhang, and Y. Wu, “PnG BERT: Augmented BERT on phonemes and graphemes for neural TTS,” Proc. Interspeech, pp. 151–155, 2021.
[10] G. Zhang, K. Song, X. Tan, D. Tan, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee et al., “Mixed-phoneme BERT: Improving bert with mixed phoneme and sup-phoneme representations for text to speech,” in Proc. Interspeech, 2022, pp. 456–460.
[11] Y. A. Li, C. Han, X. Jiang, and N. Mesgarani, “Phoneme-level BERT for enhanced prosody of text-to-speech with grapheme predictions,” in Proc. ICASSP. IEEE, 2023, pp. 1–5.
[12] L. T. Nguyen, T. Pham, and D. Q. Nguyen, “XPhoneBERT: A pre-trained multilingual model for phoneme representations for text-to-speech,” in Proc. Interspeech, 2023, pp. 5506–5510.
[13] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen, P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NIPS, vol. 31, 2018, pp. 4480–4490.
[14] E. Cooper, C.-I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP, 2020, pp. 6184–6188.
[15] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NIPS, 2020, pp. 12 449–12 460.
[16] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
[17] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
[18] B. Thomas, S. Kessler, and S. Karout, “Efficient adapter transfer of self-supervised speech models for automatic speech recognition,” in Proc. ICASSP, 2022, pp. 7102–7106.
[19] Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, “Large-scale self-supervised speech representation learning for automatic speaker verification,” in Proc. ICASSP. IEEE, 2022, pp. 6147–6151.
[20] H.-S. Choi, J. Lee, W. Kim, J. Lee, H. Heo, and K. Lee, “Neural analysis and synthesis: Reconstructing speech from self-supervised representations,” in Proc. NIPS, vol. 34, 2021, pp. 16 251–16 265.
[21] W.-C. Huang, S.-W. Yang, T. Hayashi, H.-Y. Lee, S. Watanabe, and T. Toda, “S3PRL-VC: Open-source voice conversion framework with self-supervised speech representations,” in Proc. ICASSP, 2022, pp. 6552–6556.
[22] C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
[23] S. Liu, Y. Guo, C. Du, X. Chen, and K. Yu, “DSE-TTS: Dual speaker embedding for cross-lingual text-to-speech,” in Proc. Interspeech, 2023, pp. 616–620.
[24] A. Pasad, J.-C. Chou, and K. Livescu, “Layer-wise analysis of a self-supervised speech representation model,” in Proc. ASRU, 2021, pp. 914–921.
[25] Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
[26] E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 1703–1718, 2023.
[27] M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz, M. Williamson, V. Manohar, Y. Adi, J. Mahadeokar et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” Advances in neural information processing systems, vol. 36, 2024.
[28] N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “Soundstream: An end-to-end neural audio codec,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 495–507, 2021.
[29] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023.
[30] K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
[31] Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang et al., “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv preprint arXiv:2403.03100, 2024.
[32] S.-H. Lee, H.-Y. Choi, S.-B. Kim, and S.-W. Lee, “Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis,” arXiv preprint arXiv:2311.12454, 2023.
[33] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NIPS, vol. 31, 2018.
[34] Y. Wang, D. Stanton, Y. Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous, “Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis,” in Proc. ICML. PMLR, 2018, pp. 5180–5189.
[35] E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. Interspeech, 2021, pp. 3645–3649.
[36] D. Xin, T. Komatsu, S. Takamichi, and H. Saruwatari, “Disentangled speaker and language representations using mutual information minimization and domain adaptation for cross-lingual TTS,” in Proc. ICASSP. IEEE, 2021, pp. 6608–6612.
[37] J. Yang and L. He, “Cross-lingual text-to-speech using multi-task learning and speaker classifier joint training,” arXiv preprint arXiv:2201.08124, 2022.
[38] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis with Transformer network,” in Proc. AAAI, 2019, pp. 6706–6713.
[39] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Advances in Neural Information Processing Systems, vol. 32, 2019.
[40] A. Lancucki, “FastPitch: Parallel text-to-speech with pitch prediction,” in Proc. ICASSP. IEEE, 2021, pp. 6588–6592.
[41] J. Yang and L. He, “Towards universal text-to-speech,” in Proc. Interspeech, 2020, pp. 3171–3175.
[42] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA, September 2016. ISCA, 2016, p. 125.
[43] R. Badlani, R. Valle, K. J. Shih, J. F. Santos, S. Gururani, and B. Catanzaro, “RAD-MMM: Multilingual multiaccented multispeaker text to speech,” in Proc. Interspeech, 2023, pp. 626–630.
[44] K. J. Shih, R. Valle, R. Badlani, A. Lancucki, W. Ping, and B. Catanzaro, “RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis,” in Proc. ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
[45] H. Cho, W. Jung, J. Lee, and S. H. Woo, “SANE-TTS: Stable and natural end-to-end multilingual text-to-speech,” in Proc. Interspeech. ISCA, 2022.
[46] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML. PMLR, 2021, pp. 5530–5540.
[47] M. Chen, M. Chen, S. Liang, J. Ma, L. Chen, S. Wang, and J. Xiao, “Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding,” in Proc. Interspeech. ISCA, 2019, pp. 2105–2109.
[48] T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Text-inductive graphone-based language adaptation for low-resource speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1829–1844, 2024.
[49] B. Li, Y. Zhang, T. N. Sainath, Y. Wu, and W. Chan, “Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes,” in Proc. ICASSP. IEEE, 2019, pp. 5621–5625.
[50] M. Staib, T. H. Teh, A. Torresquintero, D. S. R. Mohan, L. Foglianti, R. Lenain, and J. Gao, “Phonological Features for 0-Shot Multilingual Speech Synthesis,” in Proc. Interspeech 2020, 2020, pp. 2942–2946.
[51] D. Wells and K. Richmond, “Cross-lingual transfer of phonological features for low-resource speech synthesis,” in Proc. SSW, 2021, pp. 160–165.
[52] D. R. Mortensen, S. Dalmia, and P. Littell, “Epitran: Precision G2P for many languages,” in Proc. LREC, Paris, France, May 2018.
[53] Y. Wang, J. Li, H. Wang, Y. Qian, C. Wang, and Y. Wu, “Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition,” in Proc. ICASSP. IEEE, 2022, pp. 7097–7101.
[54] H. Siuzdak, P. Dura, P. van Rijn, and N. Jacoby, “WavThruVec: Latent speech representation as intermediate features for neural speech synthesis,” in Proc. Interspeech. ISCA, 2022, pp. 833–837.
[55] J.-h. Lin, Y. Y. Lin, C.-M. Chien, and H.-y. Lee, “S2VC: A framework for any-to-any voice conversion with self-supervised pretrained representations,” in Proc. Interspeech, 2021, pp. 836–840.
[56] S. Chen, Y. Wu, C. Wang, S. Liu, Z. Chen, P. Wang, G. Liu, J. Li, J. Wu, X. Yu, and F. Wei, “Why does Self-Supervised Learning for Speech Recognition Benefit Speaker Recognition?” in Proc. Interspeech. ISCA, Sep. 2022, pp. 3699–3703.
[57] Y.-J. Zhang, C. Zhang, W. Song, Z. Zhang, Y. Wu, and X. He, “Prosody modelling with pre-trained cross-utterance representations for improved speech synthesis,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2812–2823, 2023.
[58] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech. ISCA, Aug. 2021, pp. 2426–2430.
[59] C. Du, Y. Guo, X. Chen, and K. Yu, “VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature,” in Proc. Interspeech, Incheon, Korea, Sep. 2022, pp. 1596–1600.
[60] L.-W. Chen, S. Watanabe, and A. Rudnicky, “A vector quantized approach for text to speech synthesis on real-world spontaneous speech,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 11, 2023, pp. 12 644–12 652.
[61] C. Liu, Z.-H. Ling, and L.-H. Chen, “Pronunciation dictionary-free multilingual speech synthesis by combining unsupervised and supervised phonetic representations,” in Proc. Interspeech, 2022, pp. 4282–4286.
[62] D. Wells, K. Richmond, and W. Lamb, “A Low-Resource Pipeline for Text-to-Speech from Found Data With Application to Scottish Gaelic,” in Proc. INTERSPEECH 2023, 2023, pp. 4324–4328.
[63] T. Saeki, G. Wang, N. Morioka, I. Elias, K. Kastner, A. Rosenberg, B. Ramabhadran, H. Zen, F. Beaufays, and H. Shemtov, “Extending multilingual speech synthesis to 100+ languages without transcribed data,” arXiv preprint arXiv:2402.18932, 2024.
[64] S.-H. Lee, S.-B. Kim, J.-H. Lee, E. Song, M.-J. Hwang, and S.-W. Lee, “Hierspeech: Bridging the gap between text and speech by hierarchical variational inference using self-supervised representations for speech synthesis,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 624–16 636, 2022.
[65] Y. A. Li, C. Han, V. Raghavan, G. Mischler, and N. Mesgarani, “Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[66] Z. Zhang, L. Zhou, C. Wang, S. Chen, Y. Wu, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,” arXiv preprint arXiv:2303.03926, 2023.
[67] L. Barrault, Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim et al., “Seamless: Multilingual expressive and streaming speech translation,” arXiv preprint arXiv:2312.05187, 2023.
[68] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NIPS, vol. 33. Curran Associates, Inc., 2020, pp. 17 022–17 033.
[69] R. Badlani, A. Łańcucki, K. J. Shih, R. Valle, W. Ping, and B. Catanzaro, “One TTS alignment to rule them all,” in Proc. ICASSP. IEEE, 2022, pp. 6092–6096.
[70] D. Berrebbi, J. Shi, B. Yan, O. López-Francisco, J. Amith, and S. Watanabe, “Combining spectral and self-supervised features for low resource speech recognition and translation,” in Proc. Interspeech, 2022, pp. 3533–3537.
[71] H. Guo, F. Xie, F. K. Soong, X. Wu, and H. Meng, “A multi-stage multi-codebook VQ-VAE approach to high-performance neural TTS,” in Proc. Interspeech. ISCA, 2022, pp. 1611–1615.
[72] H. Guo, F. Xie, X. Wu, F. K. Soong, and H. Meng, “MSMC-TTS: Multi-stage multi-codebook VQ-VAE based neural TTS,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 1811–1824, 2023.
[73] W. Jang, D. Lim, J. Yoon, B. Kim, and J. Kim, “UnivNet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation,” in Proc. INTERSPEECH, 2021, pp. 2207–2211.
[74] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A large-scale multilingual dataset for speech research,” in Proc. Interspeech, 2020, p. 2757–2761.
[75] T. Schultz, N. T. Vu, and T. Schlippe, “GlobalPhone: A multilingual text & speech database in 20 languages,” in Proc. ICASSP, 2013, pp. 8126–8130.
[76] K. Park and T. Mulc, “CSS10: A collection of single speaker speech datasets for 10 languages,” in Proc. Interspeech, 2019, pp. 1566–1570.
[77] K. Ito and L. Johnson, “The LJ speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
[78] N. L. Technology, “NST Swedish speech synthesis,” https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-18/, 2003.
[79] I. T. Union. Recommendation G.191: Software Tools and Audio Coding Standardization. (2005, Nov 11). [Online]. Available: https://www.itu.int/rec/T-REC-P.56/en
[80] H. S. Heo, B.-J. Lee, J. Huh, and J. S. Chung, “CLOVA baseline system for the VoxCeleb Speaker Recognition Challenge 2020,” arXiv preprint arXiv:2009.14153, 2020.
[81] Data-baker. Chinese Standard Mandarin Speech Copus. (2022, Nov). [Online]. Available: https://www.data-baker.com/open_source.html
[82] L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE.” Journal of machine learning research, vol. 9, no. 11, 2008.
[83] P. Do, M. Coler, J. Dijkstra, and E. Klabbers, “Text-to-speech for under-resourced languages: Phoneme mapping and source language selection in transfer learning,” in Proc. ELRA/ISCA SIG on Under-Resourced Languages, 2022, pp. 16–22.
[84] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “UTMOS: UTokyo-SaruLab system for VoiceMOS Challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
[85] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
[86] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210.

Jianwu Dang (M’12) graduated from Tsinghua Univ., China, in 1982, and got his M.S. degree at the same university in 1984. He worked for Tianjin Univ. as a lecture from 1984 to 1988. He was awarded the PhD degree from Shizuoka Univ., Japan in 1992. He worked for ATR Human Information Processing Labs., Japan, as a senior researcher from 1992 to 2001. He joined the University of Waterloo, Canada, as a visiting scholar for one year from 1998. Since 2001, he has worked for Japan Advanced Institute of Science and Technology (JAIST) as a professor. He joined the Institute of Communication Parlee (ICP), Center of National Research Scientific, France, as a research scientist the first class from 2002 to 2003. Since 2009, he has joined Tianjin University, Tianjin, China. His research interests are in all the fields of speech science including brain science, and speech signal processing. He built MRI-based bio-physiological models for speech and swallowing, and endeavors to apply these models on clinics.